From yaronh at voltaire.com Tue Mar 1 02:29:46 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Tue, 1 Mar 2005 12:29:46 +0200 Subject: [openib-general] IB Address Translation service Message-ID: <35EA21F54A45CB47B879F21A91F4862F3FACB5@taurus.voltaire.com> Eric, let me correct some of your assumptions Which this API is actually targeting to protect against, see below > -----Original Message----- > From: Eric W. Biederman [mailto:eric at lnxi.com] On Behalf Of Eric W. > Biederman > Sent: Tuesday, March 01, 2005 9:18 AM > To: Yaron Haviv > Cc: Roland Dreier; shaharf; openib-general at openib.org > Subject: Re: [openib-general] IB Address Translation service > > "Yaron Haviv" writes: > > > > -----Original Message----- > > > From: openib-general-bounces at openib.org [mailto:openib-general- > > > bounces at openib.org] On Behalf Of Roland Dreier > > > Sent: Monday, February 28, 2005 7:13 PM > > > To: shaharf > > > Cc: openib-general at openib.org > > > Subject: Re: [openib-general] IB Address Translation service > > > > > > This API seems overly complex and at the same time too inflexible to > > > me. However, rather than getting bogged down nitpicking about APIs, I > > > think we have to take a few steps back. > > > > I believe the API is very flexible, but we are pretty open to here what > > you think is needed in addition > > > > > First, let's understand the problem we're trying to solve. Who are > > > the consumers of this address translation service? > > > > The first problem is that most ULPs use valid IP addresses for > > simplicity (DAPL, iSER, NFS/RDMA, SDP, MPI, etc') and someone needs to > > resolve it to an IB address and device to use IB. This should take into > > account cases where there are more than one HCAs in the system. > > Preferable/optionally the ULP would like to know which partition to use > > if there is more than one, and leverage on the IP subnetting done by > > IPoIB. > > I am confused. In any sane network the translation is: > Hostname -> address. > > IP because it spans multiple networks does: > Hostname -> IP address -> hw address. > > IB because it can span multiple IB networks does: > GUID+QPN -> LID + QPN. > > So what is wrong with simply doing: > Hostname -> GUID > ??? 1. In standard protocols such as SDP, iSER, NFS/RDMA, Oracle, .. (unlike OSU MPICH) the name service is one of the standard IP name services mapping Host names to IP addresses, and the ULP accepts a destination IP and NOT a Host name. 2. InfiniBand Hardware address is a GID and not LID, LID is a path attribute implemented to avoid the slow 48 bit lookup done in Ethernet and enable multi-pathing. A LID address is dynamically allocated; you may also have multiple LID addresses per port. (OSU MPICH implementation is a bad example for IB citizenship) So to summaries: Ethernet: Host Name -> IP -> MAC Address InfiniBand: Host Name -> IP -> GID Address -> Path (LID, SL, ..) So If we intend to relay on standard name services we can start with IP (or implement a proprietary name service for Name->HW Addr if we wish) Than we need to translate an IP to HW address (GID/GUID) and the equivalent of VLANs (partitions), this is provided by the ib_at_route_by_ip call And internally it is based on IP and IPoIB mechanisms similar to how Libor implemented it in SDP (and optionally if we see a need using ATS). Than in IB we need to resolve a GID to path attributes, which consist of LID, SL/VL, MTU, etc' The inputs to that are the source, destination, partition and QoS attributes, and the result is a path, since IB also support Multi-pathing, a user may receive multiple paths that can be used for high-availability, performance aggregation, or source based routing. A path may also travel through isolated congestion domains using VLs. The ib_at_paths_by_route call allows resolving HW Address + preferences to one or more path records that are than used by the ULP & CM. It can also be used by non-IP based ULP's such as SRP or MPICH, that is why the API unlike the current SDP implementation is divided to 2 calls one for HW address, and one for path. Currently OSU MPICH is using Proprietary Name and LID+QP assignment, it doesn't work the standard IB way with SA & CM, which is not making use of a lot of IB capabilities, and is also making it more static and less robust, I wouldn't use that as the example for ULP implementation. The MPI layer which doesn't have any idea about the fabric routing/utilization/availability is determining the path. Another simple scenario your application requires is to run MPI and NFS on different IB VLs, today you need to manually configure (recompile) that in each ULP, with that proposal it can be done automatically with a central configuration on the SM. On the other hand SDP uses same mechanisms; however we cannot use it for other ULP's (e.g. kDAPL), and also it is missing functionality that is needed by many of our users. The proposal calls for doing one set of calls for current and future ULP's. > It would be brain damaged for DAPL to require IP addresses. Not that > DAPL hasn't shown some brain damage already. DAPL use IP addresses since it is a common API for IB & Ethernet/RDMA, I'm not sure what is wrong with IP, millions use it and are familiar with it, which is something I cant say about GIDs & LIDs. > You can't do GUID -> IP because there is not a requirement on > a 1 to 1 mapping. And in general there is no fixed IP -> GUID mapping. If you dig into the call, it returns an array of IPs, you can also specify VLAN (P_Key). > > What are the semantics in the upper levels when the IP -> GUID mapping > changes? Does you connection properly follow the IP to the new GUID? > That's a ULP implementation question; I believe in general it shouldn't. > Just FYI IPv6 doesn't use arp. The implementation will depend on the IP stack to provide the IP->GID so it supports both IPv4 & IPv6. Yaron From shaharf at voltaire.com Tue Mar 1 06:33:46 2005 From: shaharf at voltaire.com (shaharf) Date: Tue, 1 Mar 2005 16:33:46 +0200 Subject: [openib-general] IB Address Translation service Message-ID: > > Roland> First, let's understand the problem we're trying to solve. > Roland> Who are the consumers of this address translation service? > > shaharf> Any ULPs at user & kernel, and also some > shaharf> applications. > > I think this is too general an answer. We should be designing based > on specific ULPs and applications. For example, I don't see anything > particularly useful to IPoIB in this API. Perhaps Libor can comment > on how this API works for SDP. > You are right about the IPoIB. I think that IPoIB should not use this API (or at least functions that may use ARP) because this creates a circular dependency in the architecture. Of course this can be solved, but I think that this is really unnecessarily. IPoIB have also relative modest resolution requirement and I don't see why we should complex things. SDP, kDAPL and maybe others are a different story. As Libor already mentioned in a different mail, the SDP already does a very similar lookup. In fact one of my internal goals was to be able to fulfill SDP requirement. The internal resolution should be very similar to the current SDP implementation, except that the ATS option is to be supported. The ATS issue is orthogonal issue to this API. As long that there are ULPs (such as kDAPL) that requires it (even just for reverse mapping), we should provide it. My personal opinion is that the IB-ARP + ATS combination is twisted. As Libor wrote it brings up many issues regarding distributed mechanisms vs. central mechanisms and databases. I guess that up to here there this is a consensus. But my (personal not Voltaire's) take is that the redundant mechanism is the ARP and not the ATS. My reasoning is simple, IB mgt is centralistic. I don't like it but that's the way it is. Adding contradicting mechanisms does not solve the problem, it just makes everything more complex. As I understand it the ARP reasoning is that due the fact that the resolving process has two stages (IP->GID, GID->lid) it is reasonable to use a separate and well known mechanism for IP to IB resolution. Another issue is that it is distributed and therefore doesn't require SM (at least when ignoring the multicast setup). I think that as IP is tunneled over IB, it is not reasonable to use ARP, and its distributed nature is a problem not a feature - the SM is still required for the path record and the multicast management. The correct solution for the centralistic IB management is to distribute the SM - not the underlying mechanisms. I think that it is not too hard to distribute the SM or at least the SA part of it. The SM/SA can also cache the requests much better that the clients. Further more, a unified ATS + path query can be defined to resolve everything in one stage. This will simplify many aspects of the resolution. But again, this is not the really the main issue. > What application would use functions like ib_at_ips_by_gid90 or > ib_at_ips_by_subnet()? > > shaharf> My take right now is to implement a kernel based > shaharf> mechanism and a user mode library to interface it. There > shaharf> are other feasible solutions. I would really like you > shaharf> have your suggestions and preferences. > > Unless there is a real kernel consumer that needs something this > elaborate, I would prefer to implement this sort of caching service as > a userspace daemon/library. This allows for more sophisticated > implementations (eg persistent caches) and also makes debugging and > maintenance easier. > ib_at_ips_by_gid() function is intended for reverse resolution, i.e. if you have a gid and you want you resolve it back to ip/device, and ib_at_ips_by_subnet()to let your resolve all IB devices (and GIDS) on a subnet, for example for a application level load balancing/fail over. ib_at_ips_by_gid() is required by kDAPL. I totally agree that overengineering is bad. This means that some of the functions (such as ib_at_ips_by_subnet) may be implemented at the first stage only in usermode. > shaharf> I think that starting with the APIs is a valid approach > shaharf> that has its own advantages and disadvantages. > > Sure, it's always good to have code in hand to start a discussion. > But in this case the API seems to be far ahead of its consumers, so it > ends up feeling overengineered to me. > You are completely right. The proposed API is designed to cover the (near) future requirement of ICER, NFS-RDMA, kDAPL, SDP, and other. It attempts to cover the following issues: Resolution Back resolution Multi-pathing Fail-over TOS/QOS Partitioning There are not visionary requirements. There are present or very near future requirements. The API attempts to show the "correct" solutions for some common problems. Without it, we may end with several different and un-matching solutions to the same problems. We don't want the ULPs to re-discover the wheel every time. The only "over-engineering" IMO is the caching support. I think that caching is a very likely to happen so it is best for the API to let the clients know that "beware, these function may return cached results". Some application may care. Note that the caching impact is only few flags and invalidate function. This is not very big overhead. For the usermode/kernel mode issue, I would be happy to implement everything in usermode. This leaves just a small issue of efficient kernel to user requests interfaces... Personally, I think that it is legitimate architecture (User mode daemon to serve the kernel) especially when you keep the caches within the kernel so the fast paths do not require usermode intervention, and let the usermode daemons maintain the caches and do the slow path tasks, where the extra context switches overhead will be insignificant relative to the entire slow path latency. I am not sure that my approach is very popular... > - R. Shahar From halr at voltaire.com Tue Mar 1 06:50:25 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Mar 2005 09:50:25 -0500 Subject: [openib-general] Question In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EEEF1@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED6047EEEF1@mtlex01.yok.mtl.com> Message-ID: <1109688422.23094.1265.camel@localhost.localdomain> Hi Eitan, On Tue, 2005-03-01 at 01:26, Eitan Zahavi wrote: > Hal wrote: > > So this looks like a workaround for a bug. Not sure what any of the > other symptoms > > are but I'm real curious now. Can someone comment more on this ? > > The ERR 3610 is really just a warning. It is caused by the Anafa1 chip > responding with a LinearFDBTop 0xC000. Are you sure that the only problem when Anafa1 gets into this state ? Does it continue to forward LR packets ? What happens with all the other LIDs now theoretically in play ? I also presume there's no fix for this with Anafa 1. Is that correct ? > OpenSM does know how to handle that case and fix it. Right, the workaround resets LinearFDBTop. > > At a minimum, the SMA is reporting an invalid value for > PortInfo::LinearFDBTop. I > > wonder if it also is incapable of forwarding DR MADs as well. That > would explain > > this. > There are no issues with that switch ability to do DR mads. -- Hal From jlentini at netapp.com Tue Mar 1 07:16:03 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 1 Mar 2005 10:16:03 -0500 (EST) Subject: [openib-general] MTHCA features In-Reply-To: <52bra4s8w1.fsf@topspin.com> References: <52bra4s8w1.fsf@topspin.com> Message-ID: roland> James> It is my understanding that the current MTHCA driver does roland> James> not support InfiniBand memory windows or memory roland> James> registration using virtual addresses. roland> roland> James> Is this information correct? If so, when will these roland> James> features be supported? roland> roland> Well, memory registration is pretty complete. By design, we only roland> support memory registration with physical addresses for kernel roland> consumers even at the verbs API level (ie there are no mthca-specific roland> limitations). In the kernel, registration by virtual address is not roland> very useful. For userspace verbs, only registration by virtual roland> address is supported for obvious reasons. roland> roland> Memory windows are not implemented for mthca. It wouldn't be a lot of roland> work for someone with access to Mellanox documentation to implement roland> them, but they're not particularly useful due to their performance roland> characteristics. Is anyone on this list working on memory window support? I ask because the DAT API contains interfaces that allow users to interact with memory windows. From eitan at mellanox.co.il Tue Mar 1 07:45:01 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 1 Mar 2005 17:45:01 +0200 Subject: [openib-general] Question Message-ID: <506C3D7B14CDD411A52C00025558DED6047EEEFD@mtlex01.yok.mtl.com> The bug is only in the meaning of the report. No other issue was found with it. The Anafa1 will report this wrong value only after reboot. EZ Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, March 01, 2005 4:50 PM > To: Eitan Zahavi > Cc: Ronald G. Minnich; openib-general at openib.org > Subject: RE: [openib-general] Question > > Hi Eitan, > > On Tue, 2005-03-01 at 01:26, Eitan Zahavi wrote: > > Hal wrote: > > > So this looks like a workaround for a bug. Not sure what any of the > > other symptoms > > > are but I'm real curious now. Can someone comment more on this ? > > > > The ERR 3610 is really just a warning. It is caused by the Anafa1 chip > > responding with a LinearFDBTop 0xC000. > > Are you sure that the only problem when Anafa1 gets into this state ? > Does it continue to forward LR packets ? What happens with all the other > LIDs now theoretically in play ? > > I also presume there's no fix for this with Anafa 1. Is that correct ? > > > OpenSM does know how to handle that case and fix it. > > Right, the workaround resets LinearFDBTop. > > > > At a minimum, the SMA is reporting an invalid value for > > PortInfo::LinearFDBTop. I > > > wonder if it also is incapable of forwarding DR MADs as well. That > > would explain > > > this. > > There are no issues with that switch ability to do DR mads. > > -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland at topspin.com Tue Mar 1 08:03:19 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 01 Mar 2005 08:03:19 -0800 Subject: [openib-general] MTHCA features In-Reply-To: (James Lentini's message of "Tue, 1 Mar 2005 10:16:03 -0500 (EST)") References: <52bra4s8w1.fsf@topspin.com> Message-ID: <521xazqrp4.fsf@topspin.com> James> I ask because the DAT API contains interfaces that allow James> users to interact with memory windows. Are there any real applications that use those interfaces? Thanks, Roland From roland at topspin.com Tue Mar 1 08:03:47 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 01 Mar 2005 08:03:47 -0800 Subject: [openib-general] MTHCA features In-Reply-To: (Or Gerlitz's message of "Tue, 1 Mar 2005 08:18:29 +0200") References: Message-ID: <52wtsrpd3w.fsf@topspin.com> Or> By "performance characteristics" do you mean the extra Or> overhead to generate another rkey for the already registered Or> address range (and also to create/free the mw)? No, I mean the performance cost of binding/unbinding the MW. - R. From roland at topspin.com Tue Mar 1 08:42:47 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 01 Mar 2005 08:42:47 -0800 Subject: [openib-general] [PATCH] Add PCI device ID for new Mellanox HCA In-Reply-To: <52fyzfrk29.fsf@topspin.com> (Roland Dreier's message of "Mon, 28 Feb 2005 21:50:38 -0800") References: <52fyzfrk29.fsf@topspin.com> Message-ID: <52oee3pbaw.fsf@topspin.com> Hi Greg, It turns out that Mellanox decided to change the device ID at the last minute. So of course there will be parts with both IDs. Here's an updated patch that includes both IDs. Please use this instead. Thanks, Roland Add PCI device IDs for new Mellanox "Sinai" InfiniHost III Lx HCA. Signed-off-by: Roland Dreier --- linux-svn.orig/include/linux/pci_ids.h 2005-02-28 21:10:53.000000000 -0800 +++ linux-svn/include/linux/pci_ids.h 2005-03-01 08:39:49.766178558 -0800 @@ -1992,6 +1992,8 @@ #define PCI_DEVICE_ID_MELLANOX_TAVOR 0x5a44 #define PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT 0x6278 #define PCI_DEVICE_ID_MELLANOX_ARBEL 0x6282 +#define PCI_DEVICE_ID_MELLANOX_SINAI_OLD 0x5e8c +#define PCI_DEVICE_ID_MELLANOX_SINAI 0x6274 #define PCI_VENDOR_ID_PDC 0x15e9 #define PCI_DEVICE_ID_PDC_1841 0x1841 From mshefty at ichips.intel.com Tue Mar 1 09:11:55 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 01 Mar 2005 09:11:55 -0800 Subject: [openib-general] Re: [CM] destroy_cm_id In-Reply-To: <20050228173307.K21494@topspin.com> References: <20050228105036.122e3b1c.mshefty@ichips.intel.com> <20050228173307.K21494@topspin.com> Message-ID: <4224A25B.4010500@ichips.intel.com> Libor Michalek wrote: > Is it ever allowed to call ib_destroy_cm_id() from a CM callback? > For some reason I thought that this was OK from only the IDLE callback, > but if I destroy from IDLE I get a hang on cm_id_priv->lock, I believe. > Should the normal mode of operation in the case be to return an error > from IDLE to ensure that cm_id gets cleaned-up? You cannot call ib_destroy_cm_id from a callback. A reference is held on the cm_id while the callback is in progress, so the call to ib_destroy_cm_id will always block forever. The solution is to return a non-zero value from the callback itself, which will destroy the cm_id. Note that you can destroy the cm_id at anytime. You don't need to wait for it to transition to IDLE. (The CM maintains the timewait state separate from the cm_id itself.) - Sean From krause at cup.hp.com Tue Mar 1 09:11:52 2005 From: krause at cup.hp.com (Michael Krause) Date: Tue, 01 Mar 2005 09:11:52 -0800 Subject: [openib-general] IB Address Translation service In-Reply-To: References: <35EA21F54A45CB47B879F21A91F4862F3FAC5D@taurus.voltaire.com> Message-ID: <6.2.0.14.2.20050301090420.02c34638@esmail.cup.hp.com> At 11:17 PM 2/28/2005, Eric W. Biederman wrote: >"Yaron Haviv" writes: > > > > -----Original Message----- > > > From: openib-general-bounces at openib.org [mailto:openib-general- > > > bounces at openib.org] On Behalf Of Roland Dreier > > > Sent: Monday, February 28, 2005 7:13 PM > > > To: shaharf > > > Cc: openib-general at openib.org > > > Subject: Re: [openib-general] IB Address Translation service > > > > > > This API seems overly complex and at the same time too inflexible to > > > me. However, rather than getting bogged down nitpicking about APIs, I > > > think we have to take a few steps back. > > > > I believe the API is very flexible, but we are pretty open to here what > > you think is needed in addition > > > > > First, let's understand the problem we're trying to solve. Who are > > > the consumers of this address translation service? > > > > The first problem is that most ULPs use valid IP addresses for > > simplicity (DAPL, iSER, NFS/RDMA, SDP, MPI, etc') and someone needs to > > resolve it to an IB address and device to use IB. This should take into > > account cases where there are more than one HCAs in the system. > > Preferable/optionally the ULP would like to know which partition to use > > if there is more than one, and leverage on the IP subnetting done by > > IPoIB. > >I am confused. In any sane network the translation is: >Hostname -> address. > >IP because it spans multiple networks does: >Hostname -> IP address -> hw address. > >IB because it can span multiple IB networks does: >GUID+QPN -> LID + QPN. > >So what is wrong with simply doing: >Hostname -> GUID >??? > >Then all the kernel needs to be passed GUID + QPN. > >I am certain MPI does not care about IP addresses. It is the job >of the mpi launcher to resolve where all of the pieces are. Generally >mpirun is done over IP and it just needs to collect the native network >addresses before it leaves. That still does not eliminate the need to resolve some form of address. >It would be brain damaged for DAPL to require IP addresses. Not that >DAPL hasn't shown some brain damage already. I don't believe the IT API requires ATS. It is a bit more flexible and matches better with applications I think. >Please, please remember that IP addresses > > > It is possible to replicate the same code you have in SDP (which is also > > not complete) across all ULP's, I assume a better way is to provide it > > in one central place. > >How about not even worrying about it. It is an extra step that >introduces latency and confusion. > >You can't do GUID -> IP because there is not a requirement on >a 1 to 1 mapping. And in general there is no fixed IP -> GUID mapping. > >What are the semantics in the upper levels when the IP -> GUID mapping >changes? Does you connection properly follow the IP to the new GUID? It should follow a new mapping if done right. >I don't see this making sense anywhere except user space. > > > There are also two proposed address resolution mechanisms, one is ARP > > used by SDP, and one is ATS used by some DAPL consumers, and we believe > > it is better to combine them under the same API. > >Just FYI IPv6 doesn't use arp. ND or ARP for this point is less an issue. > > The second problem relates to mapping of IB GID to one or more Path > > records > > This is also something needed for ALL ULP's. today each ULP provides the > > minimal subset of path resolution functionality without taking into > > account topics such as partitioning, QoS, source routing and > > multi-pathing. > > Some of these require using special SA queries (such as SA Multipath > > Record query and QoSPath Query). > > I don't think it make sense to put all this functionality into each ULP > > as well. > >That part is reasonable. Although the fact it is easy to knock >OpenSM down concerns me. However that looks to be a separate >problem. > > > Than we can also discuss, does it make sense to have each path > > resolution call lead us to the sa, or does it make more sense to cache > > those paths. > > And if we cache, doesn't it make more sense to cache/invalidate the > > routes to all ULP's rather implementing/having it in each ULP. > > Also not sure how a 1000 node cluster functions without the caching. > > > > And the last problem is related to reverse resolution from IB to IP > > addresses that is needed for DAPL, as well as for different management > > and diagnostic tools that want to know what is really that node/port > > behind that GID addresses. > > > > So how would you suggest to go about it ? > > Duplicate all of that in each ULP ? > > Refrain from implementing advanced routing, partitioning, QoS (we cant > > really maintain all that advanced code for each ULP) ? > >One small step at a time. Where each step is obviously correct. > >One giant leap only works well for internal use. Not for things >that are heavily used. > > > Our idea is to provide those few helper functions that enable people to > > make full use of IB and its features without reading all the IB spec, > > and a Phd. > > If you clear all the remarks from the library, you will see it is very > > slim, and for my understanding includes all the relevant input and > > output parameters for each of the 3 functions I mentioned. > >But an interface like that is usually provided by glibc not by the kernel. >At the mixing of levels in that proposed API is absolutely horrible. > > >Eric >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From tduffy at sun.com Tue Mar 1 09:21:56 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 01 Mar 2005 09:21:56 -0800 Subject: [openib-general] [PATCH][IPOIB] data_debug_level should be declared static Message-ID: <1109697716.11800.2.camel@duffman> Signed-off-by: Tom Duffy Index: drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- drivers/infiniband/ulp/ipoib/ipoib_ib.c (revision 1927) +++ drivers/infiniband/ulp/ipoib/ipoib_ib.c (working copy) @@ -40,7 +40,7 @@ #include "ipoib.h" #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA -int data_debug_level; +static int data_debug_level; module_param(data_debug_level, int, 0644); MODULE_PARM_DESC(data_debug_level, From tduffy at sun.com Tue Mar 1 09:46:26 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 01 Mar 2005 09:46:26 -0800 Subject: [openib-general] Re: [openib-commits] r1932 - gen2/trunk/src/linux-kernel/infiniband/ulp/sdp In-Reply-To: <20050301174315.F23FD22834D@openib.ca.sandia.gov> References: <20050301174315.F23FD22834D@openib.ca.sandia.gov> Message-ID: <1109699186.11800.13.camel@duffman> On Tue, 2005-03-01 at 09:43 -0800, libor at openib.org wrote: > Modified: gen2/trunk/src/linux-kernel/infiniband/ulp/sdp/sdp_buff_p.h > =================================================================== > --- gen2/trunk/src/linux-kernel/infiniband/ulp/sdp/sdp_buff_p.h 2005-02-28 23:43:10 UTC (rev 1931) > +++ gen2/trunk/src/linux-kernel/infiniband/ulp/sdp/sdp_buff_p.h 2005-03-01 17:43:14 UTC (rev 1932) > @@ -74,3 +74,15 @@ > }; > > #endif /* _SDP_BUFF_P_H */ > + > + > + > + > + > + > + > + > + > + > + > + Checkin turd. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From tduffy at sun.com Tue Mar 1 09:56:09 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 01 Mar 2005 09:56:09 -0800 Subject: [openib-general] [PATCH][SDP] lnx_stream_ops should be declared static Message-ID: <1109699770.11800.17.camel@duffman> lnx_stream_ops should be static. Also, fix one more static name in sdp_proc.c Signed-off-by: Tom Duffy Index: drivers/infiniband/ulp/sdp/sdp_inet.c =================================================================== --- drivers/infiniband/ulp/sdp/sdp_inet.c (revision 1932) +++ drivers/infiniband/ulp/sdp/sdp_inet.c (working copy) @@ -1373,7 +1373,7 @@ static int sdp_inet_shutdown(struct sock /* * Primary socket initialization */ -struct proto_ops _lnx_stream_ops = { +static struct proto_ops lnx_stream_ops = { .family = AF_INET_SDP, .release = sdp_inet_release, .bind = sdp_inet_bind, @@ -1419,7 +1419,7 @@ static int sdp_inet_create(struct socket return -ENOMEM; } - sock->ops = &_lnx_stream_ops; + sock->ops = &lnx_stream_ops; sock->state = SS_UNCONNECTED; sock_graft(conn->sk, sock); Index: drivers/infiniband/ulp/sdp/sdp_proc.c =================================================================== --- drivers/infiniband/ulp/sdp/sdp_proc.c (revision 1932) +++ drivers/infiniband/ulp/sdp/sdp_proc.c (working copy) @@ -81,7 +81,7 @@ static int sdp_proc_read_parse(char *pag * (anything that is not a module) should create an entry and define read * write function. */ -static struct sdpc_proc_ent _file_entry_list[SDP_PROC_ENTRIES] = { +static struct sdpc_proc_ent file_entry_list[SDP_PROC_ENTRIES] = { { .entry = NULL, .type = SDP_PROC_ENTRY_MAIN_BUFF, @@ -136,7 +136,7 @@ int sdp_main_proc_cleanup(void) * first clean-up the frameworks tables */ for (counter = 0; counter < SDP_PROC_ENTRIES; counter++) { - sub_entry = &_file_entry_list[counter]; + sub_entry = &file_entry_list[counter]; if (sub_entry->entry) { remove_proc_entry(sub_entry->name, dir_root); sub_entry->entry = NULL; @@ -189,7 +189,7 @@ int sdp_main_proc_init(void) dir_root->owner = THIS_MODULE; for (counter = 0; counter < SDP_PROC_ENTRIES; counter++) { - sub_entry = &_file_entry_list[counter]; + sub_entry = &file_entry_list[counter]; if (sub_entry->type != counter) { result = -EFAULT; goto error; From libor at topspin.com Tue Mar 1 09:58:21 2005 From: libor at topspin.com (Libor Michalek) Date: Tue, 1 Mar 2005 09:58:21 -0800 Subject: [openib-general] Re: [openib-commits] r1932 - gen2/trunk/src/linux-kernel/infiniband/ulp/sdp In-Reply-To: <1109699186.11800.13.camel@duffman>; from tduffy@sun.com on Tue, Mar 01, 2005 at 09:46:26AM -0800 References: <20050301174315.F23FD22834D@openib.ca.sandia.gov> <1109699186.11800.13.camel@duffman> Message-ID: <20050301095821.A27810@topspin.com> On Tue, Mar 01, 2005 at 09:46:26AM -0800, Tom Duffy wrote: > On Tue, 2005-03-01 at 09:43 -0800, libor at openib.org wrote: > > Modified: gen2/trunk/src/linux-kernel/infiniband/ulp/sdp/sdp_buff_p.h > > =================================================================== > > --- gen2/trunk/src/linux-kernel/infiniband/ulp/sdp/sdp_buff_p.h 2005-02-28 23:43:10 UTC (rev 1931) > > +++ gen2/trunk/src/linux-kernel/infiniband/ulp/sdp/sdp_buff_p.h 2005-03-01 17:43:14 UTC (rev 1932) > > @@ -74,3 +74,15 @@ > > }; > > Checkin turd. That's odd. It doesn't make much sense to have a seperate header file for a single structure which is used in the same place as sdp_buff.h and four constants. Here's a patch to get rid of the file. 4 files changed, 28 insertions(+), 91 deletions(-) -Libor Signed-off-by: Libor Michalek Index: sdp_main.h =================================================================== --- sdp_main.h (revision 1932) +++ sdp_main.h (working copy) @@ -115,6 +115,4 @@ #include "sdp_advt.h" #include "sdp_iocb.h" -#include "sdp_buff_p.h" - #endif /* _SDP_MAIN_H */ Index: sdp_dev.h =================================================================== --- sdp_dev.h (revision 1932) +++ sdp_dev.h (working copy) @@ -111,8 +111,14 @@ #define SDP_SEND_POST_FRACTION 0x06 #define SDP_SEND_POST_SLOW 0x01 #define SDP_SEND_POST_COUNT 0x0A - /* + * Buffer pool initialization defaul values. + */ +#define SDP_BUFF_POOL_COUNT_MIN 1024 +#define SDP_BUFF_POOL_COUNT_MAX 1048576 +#define SDP_BUFF_POOL_COUNT_INC 128 +#define SDP_BUFF_POOL_FREE_MARK 1024 +/* * SDP experimental parameters. */ Index: sdp_buff.h =================================================================== --- sdp_buff.h (revision 1932) +++ sdp_buff.h (working copy) @@ -76,6 +76,27 @@ u32 lkey; /* component of scather/gather list (key) */ }; +struct sdpc_buff_root { + /* + * variant + */ + struct sdpc_buff_q pool; /* actual pool of buffers */ + spinlock_t lock; /* spin lock for pool access */ + /* + * invariant + */ + kmem_cache_t *pool_cache; /* cache of pool objects */ + kmem_cache_t *buff_cache; /* cache of buffer descriptor objects */ + + int buff_min; /* minimum allocated buffers */ + int buff_max; /* maximum allocated buffers */ + int buff_cur; /* total allocated buffers */ + int buff_size; /* size of each buffer in the pool */ + + int alloc_inc; /* allocation increment */ + int free_mark; /* start freeing unused buffers */ +}; + /* * buffer flag defintions */ Index: sdp_buff_p.h =================================================================== --- sdp_buff_p.h (revision 1932) +++ sdp_buff_p.h (working copy) @@ -1,88 +0,0 @@ -/* - * Copyright (c) 2005 Topspin Communications. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id$ - */ - -#ifndef _SDP_BUFF_P_H -#define _SDP_BUFF_P_H -/* - * linux types - */ -#include -#include -#include - -#include "sdp_buff.h" -/* - * definitions - */ -#define SDP_BUFF_POOL_COUNT_MIN 1024 -#define SDP_BUFF_POOL_COUNT_MAX 1048576 -#define SDP_BUFF_POOL_COUNT_INC 128 -#define SDP_BUFF_POOL_FREE_MARK 1024 -/* - * structures - */ -struct sdpc_buff_root { - /* - * variant - */ - struct sdpc_buff_q pool; /* actual pool of buffers */ - spinlock_t lock; /* spin lock for pool access */ - /* - * invariant - */ - kmem_cache_t *pool_cache; /* cache of pool objects */ - kmem_cache_t *buff_cache; /* cache of buffer descriptor objects */ - - int buff_min; /* minimum allocated buffers */ - int buff_max; /* maximum allocated buffers */ - int buff_cur; /* total allocated buffers */ - int buff_size; /* size of each buffer in the pool */ - - int alloc_inc; /* allocation increment */ - int free_mark; /* start freeing unused buffers */ -}; - -#endif /* _SDP_BUFF_P_H */ - - - - - - - - - - - - From tduffy at sun.com Tue Mar 1 10:10:18 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 01 Mar 2005 10:10:18 -0800 Subject: [openib-general] Re: [openib-commits] r1932 - gen2/trunk/src/linux-kernel/infiniband/ulp/sdp In-Reply-To: <20050301095821.A27810@topspin.com> References: <20050301174315.F23FD22834D@openib.ca.sandia.gov> <1109699186.11800.13.camel@duffman> <20050301095821.A27810@topspin.com> Message-ID: <1109700619.11800.18.camel@duffman> On Tue, 2005-03-01 at 09:58 -0800, Libor Michalek wrote: > That's odd. It doesn't make much sense to have a seperate header file > for a single structure which is used in the same place as sdp_buff.h and > four constants. Here's a patch to get rid of the file. That's one way to get rid of the turd ;) Looks good. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From tduffy at sun.com Tue Mar 1 10:12:35 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 01 Mar 2005 10:12:35 -0800 Subject: [openib-general] [PATCH][CORE] fix sparse warnings about static variables Message-ID: <1109700755.11800.21.camel@duffman> This gets rid of the new sparse warnings like: /build1/tduffy/openib-work/linux-2.6.10-openib/drivers/infiniband/core/mad.c:50:14: warning: symbol 'ib_mad_cache' was not declared. Should it be static? Signed-off-by: Tom Duffy Index: drivers/infiniband/core/agent.c =================================================================== --- drivers/infiniband/core/agent.c (revision 1922) +++ drivers/infiniband/core/agent.c (working copy) @@ -45,14 +45,11 @@ #include "smi.h" #include "agent_priv.h" #include "mad_priv.h" - +#include "agent.h" spinlock_t ib_agent_port_list_lock; static LIST_HEAD(ib_agent_port_list); -extern kmem_cache_t *ib_mad_cache; - - /* * Caller must hold ib_agent_port_list_lock */ Index: drivers/infiniband/core/cache.c =================================================================== --- drivers/infiniband/core/cache.c (revision 1922) +++ drivers/infiniband/core/cache.c (working copy) @@ -38,6 +38,7 @@ #include #include "core_priv.h" +#include "ib_cache.h" struct ib_pkey_cache { int table_len; Index: drivers/infiniband/core/mad_priv.h =================================================================== --- drivers/infiniband/core/mad_priv.h (revision 1922) +++ drivers/infiniband/core/mad_priv.h (working copy) @@ -194,4 +194,6 @@ struct ib_mad_port_private { struct ib_mad_qp_info qp_info[IB_MAD_QPS_CORE]; }; +extern kmem_cache_t *ib_mad_cache; + #endif /* __IB_MAD_PRIV_H__ */ Index: drivers/infiniband/core/smi.c =================================================================== --- drivers/infiniband/core/smi.c (revision 1922) +++ drivers/infiniband/core/smi.c (working copy) @@ -37,7 +37,7 @@ */ #include - +#include "smi.h" /* * Fixup a directed route SMP for sending From mshefty at ichips.intel.com Tue Mar 1 10:27:37 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 01 Mar 2005 10:27:37 -0800 Subject: [openib-general] [MAD] RMPP reassembly Message-ID: <4224B419.1080601@ichips.intel.com> I'm studying the RMPP implementation requirements for reassembly, and there are a couple of issues/questions. * What is an appropriate window size for the receiver to use? My initial thought was to use 1/8th of the receive queue size, but this would be easy to change. * For the total transaction timeout, the equation given to calculate the value would probably require 1000+ lines of code, and the default value given is 40 seconds, which seems long. Any opinions on what approach to take here? I can either go with a total reassembly timeout value, or a timeout relative to the last received segment. I'm leaning towards whichever ends up being easier to implement. * Have people found it necessary to keep the context of a reassembled MAD around after reassembly has completed? - Sean From mshefty at ichips.intel.com Tue Mar 1 10:34:16 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 01 Mar 2005 10:34:16 -0800 Subject: [openib-general] [MAD] RMPP reassembly In-Reply-To: <4224B419.1080601@ichips.intel.com> References: <4224B419.1080601@ichips.intel.com> Message-ID: <4224B5A8.4000109@ichips.intel.com> Sean Hefty wrote: > I'm studying the RMPP implementation requirements for reassembly, and > there are a couple of issues/questions. Also, does anyone know of any existing RMPP implementations outside of the SourceForge IB stack? - Sean From halr at voltaire.com Tue Mar 1 10:41:50 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Mar 2005 13:41:50 -0500 Subject: [openib-general] [MAD] RMPP reassembly In-Reply-To: <4224B5A8.4000109@ichips.intel.com> References: <4224B419.1080601@ichips.intel.com> <4224B5A8.4000109@ichips.intel.com> Message-ID: <1109702314.23094.1763.camel@localhost.localdomain> Hi Sean, On Tue, 2005-03-01 at 13:34, Sean Hefty wrote: > Also, does anyone know of any existing RMPP implementations outside of > the SourceForge IB stack? Voltaire has one in its gen1 stack. -- Hal From mshefty at ichips.intel.com Tue Mar 1 10:52:05 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 01 Mar 2005 10:52:05 -0800 Subject: [openib-general] [MAD] RMPP reassembly In-Reply-To: <1109702314.23094.1763.camel@localhost.localdomain> References: <4224B419.1080601@ichips.intel.com> <4224B5A8.4000109@ichips.intel.com> <1109702314.23094.1763.camel@localhost.localdomain> Message-ID: <4224B9D5.2010205@ichips.intel.com> Hal Rosenstock wrote: > Hi Sean, > > On Tue, 2005-03-01 at 13:34, Sean Hefty wrote: > >>Also, does anyone know of any existing RMPP implementations outside of >>the SourceForge IB stack? > > > Voltaire has one in its gen1 stack. (Resending to list) Can you send me a link to the directory? - Sean From roland at topspin.com Tue Mar 1 11:48:46 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 01 Mar 2005 11:48:46 -0800 Subject: [openib-general] [MAD] RMPP reassembly In-Reply-To: <4224B419.1080601@ichips.intel.com> (Sean Hefty's message of "Tue, 01 Mar 2005 10:27:37 -0800") References: <4224B419.1080601@ichips.intel.com> Message-ID: <527jkrp2ox.fsf@topspin.com> Sean> * For the total transaction timeout, the equation given to Sean> calculate the value would probably require 1000+ lines of Sean> code, and the default value given is 40 seconds, which seems Sean> long. Any opinions on what approach to take here? I can Sean> either go with a total reassembly timeout value, or a Sean> timeout relative to the last received segment. I'm leaning Sean> towards whichever ends up being easier to implement. I'd be somewhat scared to tinker with the timeout calculations without doing some heavy-duty research into how the modified version interacts with a spec-compliant implementation. Experience with TCP shows that protocol behavior in the face of packet loss can be complex and unpredictable and that minor changes in the protocol can lead to large degradations in performance. - R. From mshefty at ichips.intel.com Tue Mar 1 11:57:27 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 01 Mar 2005 11:57:27 -0800 Subject: [openib-general] [MAD] RMPP reassembly In-Reply-To: <527jkrp2ox.fsf@topspin.com> References: <4224B419.1080601@ichips.intel.com> <527jkrp2ox.fsf@topspin.com> Message-ID: <4224C927.80206@ichips.intel.com> Roland Dreier wrote: > Sean> * For the total transaction timeout, the equation given to > Sean> calculate the value would probably require 1000+ lines of > Sean> code, and the default value given is 40 seconds, which seems > Sean> long. Any opinions on what approach to take here? I can > Sean> either go with a total reassembly timeout value, or a > Sean> timeout relative to the last received segment. I'm leaning > Sean> towards whichever ends up being easier to implement. > > I'd be somewhat scared to tinker with the timeout calculations without > doing some heavy-duty research into how the modified version interacts > with a spec-compliant implementation. Experience with TCP shows that > protocol behavior in the face of packet loss can be complex and > unpredictable and that minor changes in the protocol can lead to large > degradations in performance. I would tend to agree, except that the IB spec gives this beauty of an equation for calculating total transaction timeout: 4.096 us x 8 x ceiling(payload length/220) x (2 ^ packet lifetime from sender to receiver + 2 ^ packet lifetime from receiver to sender + 2 ^ receiver response time value (ClassPortInfo:RespTimeValue or 20) + 2 ^ sender response time value (ClassPortInfo:RespTimeValue or 20) Getting from receiving the first segment of an RMPP MAD to this value is non-trivial, and doing so before the sender times out is even more difficult. Is there spec compliant implementation of this in existence? If so, I'd be interested in seeing it. - Sean From roland at topspin.com Tue Mar 1 12:00:50 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 01 Mar 2005 12:00:50 -0800 Subject: [openib-general] [MAD] RMPP reassembly In-Reply-To: <4224C927.80206@ichips.intel.com> (Sean Hefty's message of "Tue, 01 Mar 2005 11:57:27 -0800") References: <4224B419.1080601@ichips.intel.com> <527jkrp2ox.fsf@topspin.com> <4224C927.80206@ichips.intel.com> Message-ID: <523bvfp24t.fsf@topspin.com> Sean> Getting from receiving the first segment of an RMPP MAD to Sean> this value is non-trivial, and doing so before the sender Sean> times out is even more difficult. Is there spec compliant Sean> implementation of this in existence? If so, I'd be Sean> interested in seeing it. Yeah, I know that equation. It doesn't seem that bad to calculate -- I guess the worst part is dividing by 220, but that shouldn't be more than a few hundred cycles. - R. From mshefty at ichips.intel.com Tue Mar 1 12:05:06 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 01 Mar 2005 12:05:06 -0800 Subject: [openib-general] [MAD] RMPP reassembly In-Reply-To: <523bvfp24t.fsf@topspin.com> References: <4224B419.1080601@ichips.intel.com> <527jkrp2ox.fsf@topspin.com> <4224C927.80206@ichips.intel.com> <523bvfp24t.fsf@topspin.com> Message-ID: <4224CAF2.1070002@ichips.intel.com> Roland Dreier wrote: > Sean> Getting from receiving the first segment of an RMPP MAD to > Sean> this value is non-trivial, and doing so before the sender > Sean> times out is even more difficult. Is there spec compliant > Sean> implementation of this in existence? If so, I'd be > Sean> interested in seeing it. > > Yeah, I know that equation. It doesn't seem that bad to calculate -- > I guess the worst part is dividing by 220, but that shouldn't be more > than a few hundred cycles. I'm more concerned about getting the necessary data than performing the actual calculation. - Sean From jlentini at netapp.com Tue Mar 1 12:58:15 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 1 Mar 2005 15:58:15 -0500 (EST) Subject: [openib-general] MTHCA features In-Reply-To: <521xazqrp4.fsf@topspin.com> References: <52bra4s8w1.fsf@topspin.com> <521xazqrp4.fsf@topspin.com> Message-ID: NFS/RDMA doesn't require memory windows but it will make use of them if they are available. -james On Tue, 1 Mar 2005, Roland Dreier wrote: > James> I ask because the DAT API contains interfaces that allow > James> users to interact with memory windows. > > Are there any real applications that use those interfaces? > > Thanks, > Roland > From hch at lst.de Tue Mar 1 14:05:43 2005 From: hch at lst.de (Christoph Hellwig) Date: Tue, 1 Mar 2005 23:05:43 +0100 Subject: [openib-general] putting in dead wood for DAPL and similar abomination Message-ID: <20050301220543.GA16443@lst.de> Please don't put in things like the address translation service or memory windows for DAPL folks. The IB code in the kernel already has far too much unused stuff and adding more will not go past reviews for kernel inclusions - as will DAPL itself exactly because of such utter stupidities. Similar hint to the NFS over RDMA folks at CITI - if you want your stuff to go in use the openib helper directly below the transport switch - differnet RDMA transports are too diverse to be sanely abstracted out and DAPL does a horrible job at that. If we need to consolidate code for differnt transports we can put it into a library later on. From tduffy at sun.com Tue Mar 1 14:13:27 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 01 Mar 2005 14:13:27 -0800 Subject: [openib-general] IB Address Translation service In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> Message-ID: <1109715208.11800.41.camel@duffman> On Tue, 2005-03-01 at 08:07 +0200, Yaron Haviv wrote: > The one thing that ATS provide and is not possible with ARP is reverse > resolution GID->IP, any ideas how to achieve that without ATS ? RARP. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mst at mellanox.co.il Tue Mar 1 14:53:18 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Mar 2005 00:53:18 +0200 Subject: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs In-Reply-To: <52u0o4pfe8.fsf@topspin.com> References: <52u0o4pfe8.fsf@topspin.com> Message-ID: <20050301225318.GA16946@mellanox.co.il> Quoting r. Roland Dreier : > Subject: ANNOUNCE: First usable version of userspace verbs > > I'm happy to announce the initial availability of userspace verbs > support for brave testers. > > To try this out, check out the roland-uverbs subversion branch: > > svn co https://openib.org/svn/gen2/branches/roland-uverbs > > and build as usual. Select CONFIG_INFINIBAND_USER_VERBS to build > userspace verbs support. > > If you want to use a linux-2.6.10 kernel, you will need to apply the > new linux-2.6.10-backports.diff patch from the branch (which just > exports get_sb_pseudo()). No patches at all are required for an > up-to-date BK or linux-2.6.11-rc4 tree. > > If you use udev, add the rule > > KERNEL="uverbs*", NAME="infiniband/%k", MODE="0666" > > to your configuration. Otherwise, create the required device files: > > mknod /dev/infiniband/uverbs0 c 231 128 > mknod /dev/infiniband/uverbs1 c 231 129 > > and so on for as many HCAs as you have installed. > > The build the userspace libraries in src/userspace/libibverbs and > src/userspace/libmthca with the usual > > ./autogen.sh && ./configure && make && sudo make install > > passing whatever parameters to configure you want; you can use > --prefix to install to another location. If you set a non-standard > prefix, it may be useful to pass a -I in CPPFLAGS to the > configure for libmthca. > > Once you have the libraries built and installed, load the ib_mthca and > ib_uverbs modules. By default, libibverbs will search for driver > libraries in /lib/infiniband; if you installed libmthca > somewhere else, set the OPENIB_DRIVER_PATH environment variable to > point to the directory with mthca.so. > > To actually try things out, you can use the ibv_pingpong program > shipped as part of the libibverbs package. For example, one one > system start the server side > > $ ibv_pingpong > > and on another system start the client by passing the address of the > server (in this example I use IPv6 over IPoIB): > > $ ibv_pingpong fe80::202:c901:7fc:c711%ib0 > > The pingpong program has a number of options -- run ibv_pingpong -h to > see a list of the switches you can try. > > The current code is stable for me, but all that means is that my tiny > selection of tests and test systems has not uncovered any of the bugs > that are undoubtedly present. Some of the limitations I know about: > > - Only RC is implemented. There are not even any functions to call > to create UD address handles yet. > - Only Tavor mode is supported -- PCI Express HCAs will not work if > they are running mem-free firmware. > - On x86, only CPUs with SSE will work now. I'd be surprised if > anyone has x86 system with an HCA that doesn't have SSE. > > Also, I've only tried 32-bit i386 userspace running on i386 and x86_64 > kernels -- I don't expect any portability problems but I haven't even > built for other architectures. > > In any case, please give this a spin and let me know how it looks to you. > > My short- and medium-term plans are: > > 1. Catch up on reviewing and applying the patche queue I'm sitting on. > 2. Land the Arbel mem-free mode support from the roland-uverbs branch > onto the main trunk (and merge it upstream once 2.6.11 is out and > 2.6.12 opens). > 3. Implement UD support for userspace. I should have this done before > the end of next week. > 4. Implement mem-free support for userspace. > > Thanks, > Roland Roland, I have implemented a small test for the rdma functionality. I based it on the pingpong test, the main change being polling on data instead of completions (but I also changed the clock sampling to use the realtime clock from -lrt, since it gives a more consistent timing results on my system). This is useful as an example of using rdma, and is also useful as a post send latency benchmark, for tuning (nicer than the send test in that it let us measure post send separately from poll cq). Do you want such stuff under libibverbs/examples, or somewhere else? -- MST - Michael S. Tsirkin From tduffy at sun.com Tue Mar 1 15:01:46 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 01 Mar 2005 15:01:46 -0800 Subject: [openib-general] IB Address Translation service In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F3FAD2D@taurus.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F3FAD2D@taurus.voltaire.com> Message-ID: <1109718106.11800.51.camel@duffman> [ putting back on list ] On Wed, 2005-03-02 at 00:29 +0200, Yaron Haviv wrote: > Did you try RARP with IPoIB ? I have not. > I thought that there is some issue that it doesn't work Currently, the rarpd only works with ethernet, but I don't see why this couldn't be fixed. > Also I hope you can comment on the other ib_at capabilities which are > more important than ATS I don't mind the idea of abstracting out address translation. I think maybe this is a premature optimization and we should see how each ULP uses/does it first, then abstract out common code. Otherwise, I feel neither strongly for or against your proposal. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Tue Mar 1 15:02:14 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 01 Mar 2005 15:02:14 -0800 Subject: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs In-Reply-To: <20050301225318.GA16946@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 2 Mar 2005 00:53:18 +0200") References: <52u0o4pfe8.fsf@topspin.com> <20050301225318.GA16946@mellanox.co.il> Message-ID: <52y8d7nf61.fsf@topspin.com> Michael> I have implemented a small test for the rdma Michael> functionality. I based it on the pingpong test, the main Michael> change being polling on data instead of completions (but Michael> I also changed the clock sampling to use the realtime Michael> clock from -lrt, since it gives a more consistent timing Michael> results on my system). Sounds great, thanks. Michael> Do you want such stuff under libibverbs/examples, or Michael> somewhere else? Please generate a patch putting it under libibverbs/examples. If it makes sense to share code from pingpong.c, feel free to split pingpong.c into multiple source files and share the code. Thanks, Roland From roland at topspin.com Tue Mar 1 15:06:48 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 01 Mar 2005 15:06:48 -0800 Subject: [openib-general] putting in dead wood for DAPL and similar abomination In-Reply-To: <20050301220543.GA16443@lst.de> (Christoph Hellwig's message of "Tue, 1 Mar 2005 23:05:43 +0100") References: <20050301220543.GA16443@lst.de> Message-ID: <52u0nvneyf.fsf@topspin.com> Christoph> Please don't put in things like the address translation Christoph> service or memory windows for DAPL folks. The IB code Christoph> in the kernel already has far too much unused stuff and Christoph> adding more will not go past reviews for kernel Christoph> inclusions - as will DAPL itself exactly because of Christoph> such utter stupidities. Similar hint to the NFS over Christoph> RDMA folks at CITI - if you want your stuff to go in Christoph> use the openib helper directly below the transport Christoph> switch - differnet RDMA transports are too diverse to Christoph> be sanely abstracted out and DAPL does a horrible job Christoph> at that. If we need to consolidate code for differnt Christoph> transports we can put it into a library later on. I agree with this sentiment. (Notice how I asked if any real applications are using memory windows?) I also agree that it makes sense to build abstractions by looking at multiple real implementations, rather than trying to design the abstractions in advance. We're just now beginning to understand how a clean InfiniBand stack should look, and I haven't seen any free software for other RDMA transports. By the way, at least for the code I wrote, anything that doesn't have a kernel user yet is there because it is used by a real protocol that should make it upstream eventually. - R. From roland at topspin.com Tue Mar 1 15:13:29 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 01 Mar 2005 15:13:29 -0800 Subject: [openib-general] [PATCH][IPOIB] data_debug_level should be declared static In-Reply-To: <1109697716.11800.2.camel@duffman> (Tom Duffy's message of "Tue, 01 Mar 2005 09:21:56 -0800") References: <1109697716.11800.2.camel@duffman> Message-ID: <52ll97nena.fsf@topspin.com> Thanks, applied. - R. From roland at topspin.com Tue Mar 1 15:22:30 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 01 Mar 2005 15:22:30 -0800 Subject: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs In-Reply-To: <20050301225318.GA16946@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 2 Mar 2005 00:53:18 +0200") References: <52u0o4pfe8.fsf@topspin.com> <20050301225318.GA16946@mellanox.co.il> Message-ID: <52hdjvne89.fsf@topspin.com> mst> but I also changed the clock sampling to use the realtime mst> clock from -lrt, since it gives a more consistent timing mst> results on my system. By the way, what exactly are you using? clock_gettime() with CLOCK_REALTIME? Do you know what the difference from gettimeofday is? I haven't followed Linux timekeeping development too closely but there should be some portable libc way to get high-resolution time without a system call (ie rdtsc on x86, mftb on ppc, etc). - R. From yaronh at voltaire.com Tue Mar 1 15:38:50 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Wed, 2 Mar 2005 01:38:50 +0200 Subject: [openib-general] putting in dead wood for DAPL and similarabomination Message-ID: <35EA21F54A45CB47B879F21A91F4862F3FAD30@taurus.voltaire.com> > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Christoph Hellwig > Sent: Wednesday, March 02, 2005 12:06 AM > To: openib-general at openib.org > Subject: [openib-general] putting in dead wood for DAPL and > similarabomination > > Please don't put in things like the address translation service or > memory windows for DAPL folks. The IB code in the kernel already > has far too much unused stuff and adding more will not go past reviews > for kernel inclusions - as will DAPL itself exactly because of such > utter stupidities. Even if your approach to DAPL was right you still have address translation service in SDP, and would need one for NFS/RDMA, and another one to iSER and another one for Lustre, etc' (even if they are coded directly to the verbs) Not to mention other protocols that access the SA (e.g. SRP, ..). So is your idea to duplicate that functionality for all the ULPs ? Would that make the code simpler and easier to maintain ? Yaron From hch at lst.de Tue Mar 1 15:46:44 2005 From: hch at lst.de (Christoph Hellwig) Date: Wed, 2 Mar 2005 00:46:44 +0100 Subject: [openib-general] putting in dead wood for DAPL and similarabomination In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F3FAD30@taurus.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F3FAD30@taurus.voltaire.com> Message-ID: <20050301234644.GA18115@lst.de> On Wed, Mar 02, 2005 at 01:38:50AM +0200, Yaron Haviv wrote: > Even if your approach to DAPL was right you still have address > translation service in SDP, and would need one for NFS/RDMA, and another > one to iSER and another one for Lustre, etc' (even if they are coded > directly to the verbs) Not to mention other protocols that access the SA > (e.g. SRP, ..). > > So is your idea to duplicate that functionality for all the ULPs ? > Would that make the code simpler and easier to maintain ? Get the code out first and then see what can be shared and what not, there's no way to find a sane API otherwise. From libor at topspin.com Tue Mar 1 15:52:43 2005 From: libor at topspin.com (Libor Michalek) Date: Tue, 1 Mar 2005 15:52:43 -0800 Subject: [openib-general] IB Address Translation service In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F3FAC6A@taurus.voltaire.com>; from yaronh@voltaire.com on Mon, Feb 28, 2005 at 11:55:54PM +0200 References: <35EA21F54A45CB47B879F21A91F4862F3FAC6A@taurus.voltaire.com> Message-ID: <20050301155243.B27810@topspin.com> On Mon, Feb 28, 2005 at 11:55:54PM +0200, Yaron Haviv wrote: > From: Libor Michalek > > > > SDP does implement a subset of the proposed functionality for > > resolving IP addresses to PathRecords which can then be used in > > a CM REQ request, plus some basic caching. All the code is isolated > > to a single file, sdp_link.c. There's really only a single entry > > point API, plus a completion function: > > > > int sdp_link_path_lookup(u32 dst_addr, > > u32 src_addr, > > int bound_dev_if, > > void (*completion)(u64 id, > > int status, > > u32 dst_addr, > > u32 src_addr, > > u8 hw_port, > > struct ib_device *ca, > > struct ib_sa_path_rec *path, > > void *arg), > > void *arg, > > u64 *id); > > > > The values are based on strictly what is needed by either the Linux > > routing code to resolve the address, or the IB APIs to establish the > > connection. The implementation has three stages: > > > > - src/dst IP address -> IPoIB net_device, IB ca, IB port, IB pkey. > > - dst IP address and IPoIB net_device -> dst GID using IPoIB ARP > > - dst GID -> PathRecord using ib_sa. > > Libor the idea is that ib_at provides similar functionality > Sahar looked through your SDP code prior to proposing the API > We would like to have a common API for all the ULP's that provide that > functionality, and specifically now when we implement kDAPL over OpenIB. Sure, it does make sense to break this code into it's own module if there are multiple ULPs that need to use the code, and sounds like we are getting close to having another ULP which needs this resolution. However, the API feels like it is intended to provide every possible bell and whistle imaginable. It is far better to start with a simple clean minimum of features and add to the functionality as new ULPs are introduced or old ULPs are improved. I may be wrong, and people have intentions for each function and parameter that you proposed, but it feels so large that it would be good to hear which ULPs you envision using each of the functions, especially some of the less obvious ones. Remember, this API does not need to be frozen out of the gate, changes can and will be made, incompatabilities will be introduced. I would like to see the feature set, if possible, split between user and kernel space, we should minimize what's in the kernel, and features that are only needed in userspace, should only be implemented in userspace. I also see kDAPL as weak justification for a feature. (notice I did no say uDAPL) I would be better to see a kDAPL proposal, by which I mean code, that had a chance before we start including features for it in surrounding code. As it stands it has an uphill battle, and not just because of the API itself. > To summaries the differences: > > The reasons we broken it to two functions (IP->GID, GID->Path) and not > have an IP->Path API (like we also used to have in our gen1 stack) are: > > a. some consumers will only need the 1st part (e.g. just to know which > HCA to use) > b. some may use only the 2nd part (e.g. IPoIB, SRP) > c. you can get parameters from the first part (e.g. P_Key, and decide to > overwrite it with your own P_Key, etc') > d. the 2nd function provides more options for multipath, partitioning, > QoS > e. we can now more easily use different IP resolution mechanisms without > changing the 2nd function (ARP or ATS). I have no real problem with spliting the two halves of the resolution into two functions, as long as the common case of IP->Path is easy to perform. By which I mean that all the parameters I need for GID->Path are either in the IP->GID result or are obvious. Which it sounds like from your later comment. > We added source IP and TOS as optional parameters for the IP->GID, just > because IP route can be defined for Src/dst/TOS, and it's already part > of Linux. OK, sounds good. I'm using source IP now, since it's possible to bind a socket to a specific source address before connecting. -Libor From mst at mellanox.co.il Tue Mar 1 16:22:19 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Mar 2005 02:22:19 +0200 Subject: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs In-Reply-To: <52hdjvne89.fsf@topspin.com> References: <52u0o4pfe8.fsf@topspin.com> <20050301225318.GA16946@mellanox.co.il> <52hdjvne89.fsf@topspin.com> Message-ID: <20050302002219.GB16946@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: ANNOUNCE: First usable version of userspace verbs > > mst> but I also changed the clock sampling to use the realtime > mst> clock from -lrt, since it gives a more consistent timing > mst> results on my system. > > By the way, what exactly are you using? clock_gettime() with > CLOCK_REALTIME? Yes. > Do you know what the difference from gettimeofday is? I didnt investigate, all I know is that with clock_gettime I seem to get consistent results across runs, not so with gettimeofday. > I haven't followed Linux timekeeping development too closely but there > should be some portable libc way to get high-resolution time without a > system call (ie rdtsc on x86, mftb on ppc, etc). > > - R. I can look up the librt source, I guess. -- MST - Michael S. Tsirkin From mst at mellanox.co.il Tue Mar 1 16:38:36 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Mar 2005 02:38:36 +0200 Subject: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs In-Reply-To: <52hdjvne89.fsf@topspin.com> References: <52u0o4pfe8.fsf@topspin.com> <20050301225318.GA16946@mellanox.co.il> <52hdjvne89.fsf@topspin.com> Message-ID: <20050302003836.GA17646@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: ANNOUNCE: First usable version of userspace verbs > > mst> but I also changed the clock sampling to use the realtime > mst> clock from -lrt, since it gives a more consistent timing > mst> results on my system. > > By the way, what exactly are you using? clock_gettime() with > CLOCK_REALTIME? Do you know what the difference from gettimeofday is? > > I haven't followed Linux timekeeping development too closely but there > should be some portable libc way to get high-resolution time without a > system call (ie rdtsc on x86, mftb on ppc, etc). > > - R. > Looking at libc sources (glibc-2.3.2-200304020432) there appears an internal macro for it, but I dont see it exported, it seems to be used for ./malloc/memusage.c implementation. I'll look for a library outside of libc that we can use. clock_gettime is a syscall, so has overhead course. Still, since we call it once per 1000 iterations, the overhead isnt big. -- MST - Michael S. Tsirkin From tziporet at mellanox.co.il Wed Mar 2 00:33:56 2005 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Wed, 2 Mar 2005 10:33:56 +0200 Subject: [openib-general] MTHCA features Message-ID: <506C3D7B14CDD411A52C00025558DED6064BF063@mtlex01.yok.mtl.com> We recommend to work with FMRs and not memory windows due to performance. FMRS are much faster and available for kernel modules only. They are not yet implemented in mthca but it is possible to add them. Tziporet -----Original Message----- From: James Lentini [mailto:jlentini at netapp.com] Sent: Tuesday, March 01, 2005 10:58 PM To: Roland Dreier Cc: openib-general at openib.org Subject: Re: [openib-general] MTHCA features NFS/RDMA doesn't require memory windows but it will make use of them if they are available. -james -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Wed Mar 2 01:27:51 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Mar 2005 11:27:51 +0200 Subject: [openib-general] [PATCH] uverbs: whitespace fix Message-ID: <20050302092751.GB25029@mellanox.co.il> Whitespace fix. Signed-off-by: Michael S. Tsirkin Index: mthca_provider.c =================================================================== --- mthca_provider.c (revision 1895) +++ mthca_provider.c (working copy) @@ -674,7 +674,7 @@ static struct ib_mr *mthca_reg_user_mr(s return ERR_PTR(-ENOMEM); list_for_each_entry(chunk, ®ion->chunk_list, list) - npages += chunk->nents; + npages += chunk->nents; page_list = kmalloc(npages * sizeof *page_list, GFP_KERNEL); if (!page_list) { -- MST - Michael S. Tsirkin From mst at mellanox.co.il Wed Mar 2 02:12:12 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Mar 2005 12:12:12 +0200 Subject: [openib-general] [PATCH] uverbs_mem printk Message-ID: <20050302101212.GC25029@mellanox.co.il> Since userspace can trivially trigger get_user_pages failure by passing in an illegal virtual address/size pair, I suggest removing the printk when this happends: I think kernel messages should reflect kernel problems, not user level application bugs. Signed-off-by: Michael S. Tsirkin Index: core/uverbs_mem.c =================================================================== --- core/uverbs_mem.c (revision 1895) +++ core/uverbs_mem.c (working copy) @@ -69,11 +69,6 @@ int ib_umem_get(struct ib_device *dev, s PAGE_SIZE / sizeof (struct page *)), 1, 0, page_list, NULL); - if (ret < 0) { - printk(KERN_ERR "get_user_pages: %d\n", ret); - printk(KERN_ERR "failed at cur_base %lx\n", cur_base); - } - if (ret < 0) goto out; -- MST - Michael S. Tsirkin From mst at mellanox.co.il Wed Mar 2 02:43:11 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Mar 2005 12:43:11 +0200 Subject: [openib-general] mr_table.max_mtt_order Message-ID: <20050302104311.GD25029@mellanox.co.il> Roland, two questions: 1. I'm looking at mthca_init_mr_table. The following loop: for (i = 1, dev->mr_table.max_mtt_order = 0; i < dev->limits.num_mtt_segs; i <<= 1, ++dev->mr_table.max_mtt_order) ; /* nothing */ Seems to exit th first time when (1 << (dev->mr_table.max_mtt_order) ) >= dev->limits.num_mtt_segs So if dev->limits.num_mtt_segs is not a power of 2, (1 << (dev->mr_table.max_mtt_order) ) > dev->limits.num_mtt_segs and so max_mtt_order seems to be too large by 1? Did I misunderstand something, or is there something that forces dev->limits.num_mtt_segs to be a power of 2? 2. There are some places in mthca where we try to round some value up to the power of 2, some done by loops like this one. I find them error-prone. Will you accept a patch replacing them with an inline function? Using fls, this function will also be more efficient than a linear loop. -- MST - Michael S. Tsirkin From Arkady.Kanevsky at netapp.com Wed Mar 2 03:44:22 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 2 Mar 2005 06:44:22 -0500 Subject: [openib-general] IB Address Translation service Message-ID: Some historical perspective - ATS was defined prior to IPoIB. The requirements. DAT has two needs: 1. forward translation: given an IP address returns back IB GID/LID. 2. reverse translation: given IB GID/LID returns back an IP address of the requestor. ULPs: NFS, DAFS. SDP encoded IP addresses into its headers. But DAT is API and cannot define a protocol for it. Abstract address translation is a good idea. For IB we can use ATS or IPoIB. For iWARP it will be no-op. We must ensure that the DAPL that we submit to Linux can be layered on top of all RDMA transports. Since IPoIB had not had plugfest/connectathon or some other interop that demonstrate ARP and RARP I suggest we have both ATS and IPoIB support. ATS has been fully successfully tested at DAPL Plugfest. In DAPL we had not assessed the HA requirements implications on address translations which is currently under discussion. Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Tom Duffy [mailto:tduffy at sun.com] > Sent: Tuesday, March 01, 2005 6:02 PM > To: Yaron Haviv > Cc: openib-general at openib.org > Subject: RE: [openib-general] IB Address Translation service > > > [ putting back on list ] > > On Wed, 2005-03-02 at 00:29 +0200, Yaron Haviv wrote: > > Did you try RARP with IPoIB ? > > I have not. > > > I thought that there is some issue that it doesn't work > > Currently, the rarpd only works with ethernet, but I don't > see why this couldn't be fixed. > > > Also I hope you can comment on the other ib_at capabilities > which are > > more important than ATS > > I don't mind the idea of abstracting out address translation. > I think maybe this is a premature optimization and we should > see how each ULP uses/does it first, then abstract out common > code. Otherwise, I feel neither strongly for or against your > proposal. > > -tduffy > From gdror at mellanox.co.il Wed Mar 2 05:35:59 2005 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Wed, 2 Mar 2005 15:35:59 +0200 Subject: [dat-discussions] RE: [openib-general] IB Address Translation service Message-ID: <506C3D7B14CDD411A52C00025558DED607211F55@mtlex01.yok.mtl.com> From: Kanevsky, Arkady [mailto:arkady at netapp.com] Sent: Wednesday, March 02, 2005 1:44 PM > > Some historical perspective - ATS was defined prior to IPoIB. > > The requirements. > DAT has two needs: > 1. forward translation: given an IP address returns back IB > GID/LID. 2. reverse translation: given IB GID/LID returns > back an IP address of the requestor. > > ULPs: NFS, DAFS. > > SDP encoded IP addresses into its headers. Arkady, you meant that SDP placed the IP addresses into the private data of the CM REQ message. This message just go once when the connection is established. Right ? In other words, if one wants to perform reverse lookup when not using ATS, then the private data of the REQ message in DAPL has to change so that the connecting node can send it's IP address. > But DAT is API and cannot define a protocol for it. > > Abstract address translation is a good idea. > For IB we can use ATS or IPoIB. > For iWARP it will be no-op. > We must ensure that the DAPL that we submit to Linux can be > layered on top of all RDMA transports. > > Since IPoIB had not had plugfest/connectathon or some other > interop that demonstrate ARP and RARP I suggest we have both > ATS and IPoIB support. ATS has been fully successfully tested > at DAPL Plugfest. As far as I know IPoIB has been tested for interop to some degree last plugfest. I don't know the details. Note that it was tested as a standalone module and not as an address resolution mechanism for DAPL. > > In DAPL we had not assessed the HA requirements implications > on address translations which is currently under discussion. > > Arkady Kanevsky email: arkady at netapp.com > Network Appliance phone: 781-768-5395 > 375 Totten Pond Rd. Fax: 781-895-1195 > Waltham, MA 02451-2010 central phone: 781-768-5300 > > > > > -----Original Message----- > > From: Tom Duffy [mailto:tduffy at sun.com] > > Sent: Tuesday, March 01, 2005 6:02 PM > > To: Yaron Haviv > > Cc: openib-general at openib.org > > Subject: RE: [openib-general] IB Address Translation service > > > > > > [ putting back on list ] > > > > On Wed, 2005-03-02 at 00:29 +0200, Yaron Haviv wrote: > > > Did you try RARP with IPoIB ? > > > > I have not. > > > > > I thought that there is some issue that it doesn't work > > > > Currently, the rarpd only works with ethernet, but I don't > > see why this couldn't be fixed. > > > > > Also I hope you can comment on the other ib_at capabilities > > which are > > > more important than ATS > > > > I don't mind the idea of abstracting out address translation. > > I think maybe this is a premature optimization and we should > > see how each ULP uses/does it first, then abstract out common > > code. Otherwise, I feel neither strongly for or against your > > proposal. > > > > -tduffy > > > > Yahoo! Groups Sponsor > ADVERTISEMENT > > > > > > > Yahoo! Groups Links > > To visit your group on the web, go to: > http://groups.yahoo.com/group/dat-discussions/ > > To > unsubscribe from this group, send an email to: > dat-discussions-unsubscribe at yahoogroups.com > > Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Wed Mar 2 07:14:46 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Mar 2005 17:14:46 +0200 Subject: [openib-general] Fwd: Linux 2.6.11 Message-ID: <20050302151446.GA26194@mellanox.co.il> To little fanfare, the first mainline kernel with InfiniBand support has been released: # ls linux-2.6.10/drivers/infiniband /bin/ls: linux-2.6.10/drivers/infiniband: No such file or directory # ls linux-2.6.11/drivers/infiniband . .. Kconfig Makefile core hw include ulp And its _all_ officially bug free (see attachment), which must mean gen2 code is officially bug free too! -- MST - Michael S. Tsirkin -------------- next part -------------- An embedded message was scrubbed... From: Linus Torvalds Subject: Linux 2.6.11 Date: no date Size: 5140 URL: From jlentini at netapp.com Wed Mar 2 08:11:35 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 2 Mar 2005 11:11:35 -0500 (EST) Subject: [openib-general] putting in dead wood for DAPL and similar abomination In-Reply-To: <20050301220543.GA16443@lst.de> References: <20050301220543.GA16443@lst.de> Message-ID: hch> Please don't put in things like the address translation service or hch> memory windows for DAPL folks. The IB code in the kernel already hch> has far too much unused stuff and adding more will not go past reviews hch> for kernel inclusions - as will DAPL itself exactly because of such hch> utter stupidities. Similar hint to the NFS over RDMA folks at CITI - hch> if you want your stuff to go in use the openib helper directly below hch> the transport switch - differnet RDMA transports are too diverse to hch> be sanely abstracted out and DAPL does a horrible job at that. DAPL has been efficiently supported on top of InfiniBand, iWARP, the Virtual Interface Architecture, Quadrics, and Myrinet. hch> If we need to consolidate code for differnt transports we can put hch> it into a library later on. From jlentini at netapp.com Wed Mar 2 08:26:49 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 2 Mar 2005 11:26:49 -0500 (EST) Subject: [openib-general] IB Address Translation service In-Reply-To: <1109715208.11800.41.camel@duffman> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> Message-ID: tduffy> > The one thing that ATS provide and is not possible with tduffy> > ARP is reverse resolution GID->IP, any ideas how to achieve tduffy> > that without ATS ? tduffy> tduffy> RARP. Where is the encapsulation of RARP packets on IB defined? The "Transmission of IP over InfiniBand" IETF draft specifies the procedure for ARP and Neighbor Discovery, but not RARP. From tduffy at sun.com Wed Mar 2 10:11:34 2005 From: tduffy at sun.com (Tom Duffy) Date: Wed, 02 Mar 2005 10:11:34 -0800 Subject: [openib-general] [PATCH][SDP] Make sdp compile on 2.6.11 Message-ID: <1109787094.4913.7.camel@duffman> Now that 2.6.11 is out, need to make sdp compile with 2.6.11. Signed-off-by: Tom Duffy Index: drivers/infiniband/ulp/sdp/sdp_iocb.c =================================================================== --- drivers/infiniband/ulp/sdp/sdp_iocb.c (revision 1937) +++ drivers/infiniband/ulp/sdp/sdp_iocb.c (working copy) @@ -141,6 +141,7 @@ static int sdp_iocb_page_save(struct sdp struct page *page; unsigned long pfn; pgd_t *pgd; + pud_t *pud; pmd_t *pmd; pte_t *ptep; pte_t pte; @@ -182,8 +183,12 @@ static int sdp_iocb_page_save(struct sdp pgd = pgd_offset_gate(iocb->mm, addr); if (!pgd || pgd_none(*pgd)) break; + + pud = pud_offset(pgd, addr); + if (!pud || pud_none(*pud)) + break; - pmd = pmd_offset(pgd, addr); + pmd = pmd_offset(pud, addr); if (!pmd || pmd_none(*pmd)) break; From yaronh at voltaire.com Wed Mar 2 10:15:52 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Wed, 2 Mar 2005 20:15:52 +0200 Subject: [openib-general] IB Address Translation service Message-ID: <35EA21F54A45CB47B879F21A91F4862F3FAD96@taurus.voltaire.com> > -----Original Message----- > From: Tom Duffy [mailto:tduffy at sun.com] > Sent: Wednesday, March 02, 2005 1:02 AM > To: Yaron Haviv > Cc: openib-general at openib.org > Subject: RE: [openib-general] IB Address Translation service > > [ putting back on list ] > > On Wed, 2005-03-02 at 00:29 +0200, Yaron Haviv wrote: > > Did you try RARP with IPoIB ? > > I have not. > > > I thought that there is some issue that it doesn't work > > Currently, the rarpd only works with ethernet, but I don't see why this > couldn't be fixed. > Tom, IPoIB HW Address consists of GID+QPN+.. In order to issue a RARP I believe you should supply the full HW address to get the IP address back, how would you know the remote IPoIB QPN ? or can you do it without a QPN ? Yaron From Thomas.Talpey at netapp.com Wed Mar 2 10:40:32 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 02 Mar 2005 13:40:32 -0500 Subject: [openib-general] putting in dead wood for DAPL and similar abomination In-Reply-To: <20050301220543.GA16443@lst.de> References: <20050301220543.GA16443@lst.de> Message-ID: <6.2.1.2.2.20050301214342.01403700@exnane01.nane.netapp.com> At 05:05 PM 3/1/2005, Christoph Hellwig wrote: >Similar hint to the NFS over RDMA folks at CITI - >if you want your stuff to go in use the openib helper directly below >the transport switch - differnet RDMA transports are too diverse to >be sanely abstracted out and DAPL does a horrible job at that. If >we need to consolidate code for differnt transports we can put it >into a library later on. Ok, I'll speak for the NFS over RDMA implementation. (I've brought your hint to the attention of the CITI folks - we are working together this week here at the NFS Connectathon). The NFS/RDMA client, and soon the server, use kDAPL for a simple reason - we need an RDMA API which allows us to plug in RDMA NICs without also having to modify NFS client, server and RPC code. You're trying to sentence us to coding NFS to individual hardware. It's unacceptable to have to modify NFS and RPC just because a new adapter has been attached. It's the same NFS/RDMA protocol over IB, iWARP, and even VI. Offering "consolidation" "later on" is an enormous step backward from what we're using (successfully) today. Tom. From Thomas.Talpey at netapp.com Wed Mar 2 10:49:00 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 02 Mar 2005 13:49:00 -0500 Subject: [openib-general] IB Address Translation service In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F3FAD96@taurus.voltaire.com > References: <35EA21F54A45CB47B879F21A91F4862F3FAD96@taurus.voltaire.com> Message-ID: <6.2.1.2.2.20050302134205.05a92eb0@exnane01.nane.netapp.com> At 01:15 PM 3/2/2005, Yaron Haviv wrote: >In order to issue a RARP I believe you should supply the full HW address >to get the IP address back, how would you know the remote IPoIB QPN ? or >can you do it without a QPN ? To say nothing of the fact that there must be a RARPD, administered and secured on each subnet. Aren't there enough daemons needed to support this stuff as it is? The advantage of ATS is that it "just works" whether wired point to point, or via a switch, or whatever. It requires no central administration, works as transparently as ARP and ND, and supports IP addressing so applications don't have any ambiguity in how they resolve names. If we get rid of ATS, what do we replace it with? Raw IB GID's from the application?? Tom. From robert.j.woodruff at intel.com Wed Mar 2 11:22:56 2005 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 2 Mar 2005 11:22:56 -0800 Subject: [openib-general] putting in dead wood for DAPL and similarabomination Message-ID: <1AC79F16F5C5284499BB9591B33D6F0003B54E09@orsmsx408> James wrote:, >DAPL has been efficiently supported on top of InfiniBand, iWARP, the >Virtual Interface Architecture, Quadrics, and Myrinet. I think the point is that only one of those interconnects (IB) is in the kernel, the rest are proprietary. Do any of the other RDMA interconnect vendors plan to submit their code for inclusion into Linux in the near future ? woody From halr at voltaire.com Wed Mar 2 11:32:21 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Mar 2005 14:32:21 -0500 Subject: [openib-general] [PATCH][CORE] fix sparse warnings about static variables In-Reply-To: <1109700755.11800.21.camel@duffman> References: <1109700755.11800.21.camel@duffman> Message-ID: <1109791941.4645.17.camel@localhost.localdomain> On Tue, 2005-03-01 at 13:12, Tom Duffy wrote: > This gets rid of the new sparse warnings like: > > /build1/tduffy/openib-work/linux-2.6.10-openib/drivers/infiniband/core/mad.c:50:14: warning: symbol 'ib_mad_cache' was not declared. Should it be static? Thanks. Applied. -- Hal From sean.hefty at intel.com Wed Mar 2 12:14:28 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 2 Mar 2005 12:14:28 -0800 Subject: [openib-general] [MAD] RMPP reassembly In-Reply-To: <4224B419.1080601@ichips.intel.com> Message-ID: >I'm studying the RMPP implementation requirements for reassembly, and >there are a couple of issues/questions. A couple more comments while coding up the reassembly: struct ib_mad_recv_buf contains struct ib_mad *mad. I'm wondering if it makes sense to change this to a union of ib_mad *, ib_rmpp_mad *, ib_vendor_mad *, ib_smp *, ib_sa_mad *. Currently, the user casts the returned MAD to the correct format. This would be a minor, but visible change to all current MAD users... Has anyone given thought on how to best expose RMPP to user mode? - Sean From halr at voltaire.com Wed Mar 2 12:17:06 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Mar 2005 15:17:06 -0500 Subject: [Fwd: [openib-general] [RFC] Diagnostic tree structure] Message-ID: <1109794536.4645.22.camel@localhost.localdomain> The diagnostics tree structure has been flattened one level. There is no longer host and net subdirectories. The diag tools have been moved up one level in the tree. Let me know if there are any problems I introduced by doing this. Thanks. -- Hal -----Forwarded Message----- From: Hal Rosenstock To: openib-general at openib.org Subject: [openib-general] [RFC] Diagnostic tree structure Date: 17 Feb 2005 09:06:39 -0500 Hi, The current userspace diagnostics tree structure has host and net subdirectories. The distinction between the two is blurring so we would like to flatten the tree and just have all the tools under diags. Any objections ? Thanks. -- Hal _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From krause at cup.hp.com Wed Mar 2 12:24:47 2005 From: krause at cup.hp.com (Michael Krause) Date: Wed, 02 Mar 2005 12:24:47 -0800 Subject: [openib-general] putting in dead wood for DAPL and similarabomination In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0003B54E09@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0003B54E09@orsmsx408> Message-ID: <6.2.0.14.2.20050302122242.02994f68@esmail.cup.hp.com> At 11:22 AM 3/2/2005, Woodruff, Robert J wrote: > James wrote:, > >DAPL has been efficiently supported on top of InfiniBand, iWARP, the > >Virtual Interface Architecture, Quadrics, and Myrinet. > >I think the point is that only one of those interconnects (IB) is >in the kernel, the rest are proprietary. Do any of the other RDMA >interconnect vendors plan to submit their code for inclusion into Linux >in the near future ? I know some people are working on iWARP devices though this could be done on the OpenRDMA source work (still a work in progress) which supports both iWARP and IB. BTW, I second Tom, et. al. push to use an API to abstract this and avoid having to permute every subsystem to work for a given device. The RNIC PI is intended to provide abstraction for iWARP / IB hardware to a very large extent (think of this as a standard verbs interface). IT API / DAPL provide another layer of abstraction and can be used to integrate subsystems either over the RNIC PI or whatever verbs API people desire. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From hch at lst.de Wed Mar 2 13:48:47 2005 From: hch at lst.de (Christoph Hellwig) Date: Wed, 2 Mar 2005 22:48:47 +0100 Subject: [openib-general] putting in dead wood for DAPL and similar abomination In-Reply-To: References: <20050301220543.GA16443@lst.de> Message-ID: <20050302214847.GA4253@lst.de> On Wed, Mar 02, 2005 at 11:11:35AM -0500, James Lentini wrote: > DAPL has been efficiently supported on top of InfiniBand, iWARP, the > Virtual Interface Architecture, Quadrics, and Myrinet. And I've not seen any kernel submittsion for either of them - and what's important no single kDAPL application that actually shows any benefit that way. Volatair's iSER implementation would surely be smaller when directly written to the OpenIB interface, and is already smaller than the whole kDAPL layer. From yaronh at voltaire.com Wed Mar 2 14:26:33 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Thu, 3 Mar 2005 00:26:33 +0200 Subject: [openib-general] putting in dead wood for DAPL and similarabomination Message-ID: <35EA21F54A45CB47B879F21A91F4862F3FAD9E@taurus.voltaire.com> > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Christoph Hellwig > Sent: Wednesday, March 02, 2005 11:49 PM > To: James Lentini > Cc: Christoph Hellwig; openib-general at openib.org > Subject: Re: [openib-general] putting in dead wood for DAPL and > similarabomination > > On Wed, Mar 02, 2005 at 11:11:35AM -0500, James Lentini wrote: > > DAPL has been efficiently supported on top of InfiniBand, iWARP, the > > Virtual Interface Architecture, Quadrics, and Myrinet. > > And I've not seen any kernel submittsion for either of them - and what's > important no single kDAPL application that actually shows any benefit > that way. Volatair's iSER implementation would surely be smaller when > directly written to the OpenIB interface, and is already smaller than > the whole kDAPL layer. Christoph, the reason the iSER code is very thin is that it is using kDAPL (and Linux iSCSI), it doesn't need to deal with SA calls, CM calls, LIDs, GIDs, and a bunch of other things. Besides being RDMA transport independent DAPL enable people to code to RDMA without been intimately familiar with the HW, we saw people coding to it in days, Which I can't say the same for Verbs. Abstract layers are not new to Linux, Sockets is another type of abstraction with multiple protocols/families underneath, or even Ethernet Why aren't you suggesting to do TCP implementation for ATM cards, and one for PPP, etc' Yaron > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general From jon-openib at umich.edu Wed Mar 2 14:32:22 2005 From: jon-openib at umich.edu (Jon Bauman) Date: Wed, 2 Mar 2005 14:32:22 -0800 Subject: [openib-general] putting in dead wood for DAPL and similar abomination Message-ID: <92a7dc786ef0764d64c329a62c42a3aa@umich.edu> At 05:05 PM 3/1/2005, Christoph Hellwig wrote: > Similar hint to the NFS over RDMA folks at CITI - > if you want your stuff to go in use the openib helper directly below > the transport switch - differnet RDMA transports are too diverse to > be sanely abstracted out and DAPL does a horrible job at that. If > we need to consolidate code for differnt transports we can put it > into a library later on. CITI folk here. I'm not familiar with the openib helper you refer to, but since you mention the transport switch, I'll assume you're referring to the client. I'm currently working on the NFS over RPCRDMA server, so this isn't of much help to me. While I'd agree that DAPL has it's shortcomings, it's not finalized yet, and I know of no other alternatives. On the other hand, I don't agree that the different RDMA transports are necessarily too diverse to provide a reasonable API for them. It seems silly to invest a lot of effort writing directly for IB, since we couldn't reuse the code for other transports. Why create a nonstandard library after the fact when so much work has gone into DAPL already? Even if DAPL needs to change, we can later make our changes just once at that layer. We should have basic functionality with NFS atop DAPL in the near future that will enable us to plug in different transports without changing the ULP code. Would that convince you that DAPL is at least a useful starting point? From tduffy at sun.com Wed Mar 2 15:39:48 2005 From: tduffy at sun.com (Tom Duffy) Date: Wed, 02 Mar 2005 15:39:48 -0800 Subject: [openib-general] "arping" failing over ipoib Message-ID: <1109806789.4913.43.camel@duffman> I am trying to configure my fedora box to bring up ib0 on startup. Unfortunately it is failing, saying that the IP address is already taken on the network -- no matter what IP address I use. I traced this down to the fact that the ifup script uses "arping" to test this condition. It appears arping is failing something like this: # arping -c 2 -w 3 -D -I ib0 192.168.0.62 ARPING 192.168.0.62 from 0.0.0.0 ib0 Sent 2 probes (2 broadcast(s)) Received -1 response(s) where I can ping: # ping 192.168.0.62 PING 192.168.0.62 (192.168.0.62) 56(84) bytes of data. 64 bytes from 192.168.0.62: icmp_seq=0 ttl=64 time=0.079 ms --- 192.168.0.62 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.079/0.079/0.079/0.000 ms, pipe 2 [root at flopteron iputils]# arp -a ? (192.168.0.62) at 00:00:00:14:FE:80:00:00:00 [infiniband] on ib0 ? (10.6.98.1) at 00:00:0C:07:AC:00 [ether] on eth0 # ifconfig ib0 ib0 Link encap:InfiniBand HWaddr 00:00:00:84:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:192.168.0.25 Bcast:192.168.0.255 Mask:255.255.255.0 inet6 addr: fe80::202:c901:a99:e0a1/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:39 errors:0 dropped:0 overruns:0 frame:0 TX packets:39 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:3164 (3.0 KiB) TX bytes:3320 (3.2 KiB) I have looked at the arping code, and it seems to be crafting the packet correctly, even using 32 in the type field, so I am a bit befuddled as to why this isn't working. Any ideas? -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Wed Mar 2 16:31:42 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Mar 2005 19:31:42 -0500 Subject: [openib-general] "arping" failing over ipoib In-Reply-To: <1109806789.4913.43.camel@duffman> References: <1109806789.4913.43.camel@duffman> Message-ID: <1109809902.4645.48.camel@localhost.localdomain> On Wed, 2005-03-02 at 18:39, Tom Duffy wrote: > I am trying to configure my fedora box to bring up ib0 on startup. > Unfortunately it is failing, saying that the IP address is already taken > on the network -- no matter what IP address I use. I traced this down > to the fact that the ifup script uses "arping" to test this condition. > It appears arping is failing something like this: > > # arping -c 2 -w 3 -D -I ib0 192.168.0.62 > ARPING 192.168.0.62 from 0.0.0.0 ib0 > Sent 2 probes (2 broadcast(s)) > Received -1 response(s) Here's what I see going on: arpping appears to cause a join (with component mask 0x10083) to MGID . 0xFFFF:FFFF:FFFF:0742:0070:26C2:C43F:81C0. That does not look like an IPoIB MGID to me. Not sure how this MGID is generated. The SM refuses this with status 0x0600 (ERR_REQ_INSUFFICIENT_COMPONENTS) which is what a join request gets when the group is not already created. -- Hal From hch at lst.de Wed Mar 2 19:48:27 2005 From: hch at lst.de (Christoph Hellwig) Date: Thu, 3 Mar 2005 04:48:27 +0100 Subject: [openib-general] putting in dead wood for DAPL and similarabomination In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F3FAD9E@taurus.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F3FAD9E@taurus.voltaire.com> Message-ID: <20050303034827.GA9092@lst.de> On Thu, Mar 03, 2005 at 12:26:33AM +0200, Yaron Haviv wrote: > > And I've not seen any kernel submittsion for either of them - and > what's > > important no single kDAPL application that actually shows any benefit > > that way. Volatair's iSER implementation would surely be smaller when > > directly written to the OpenIB interface, and is already smaller than > > the whole kDAPL layer. > > Christoph, the reason the iSER code is very thin is that it is using > kDAPL > (and Linux iSCSI), it doesn't need to deal with SA calls, CM calls, > LIDs, GIDs, and a bunch of other things. Umm, no - it's not small. In it's current form it's freakin' huge. That's partially a fault of stupid kDAPL APIs, the cisco iscsi code and the broken implementation of the iscsi transport switch, but also because it's pretty bad code. The current iSER code is 10928 LOC, add to that 22155 LOC of kDAPL (not including the actual provider for IB) and 5822 LOC linux-iscsi kernel code. Compare that to the 25412 LOC total for drivers/infiniband in Linux 2.6.11. Here's the challenge: if someone gets me the funding I'll write complete iSER of IB implementation in less than 10k LOC based on the open-iscsi code if someone gets me the funding. > Besides being RDMA transport independent DAPL enable people to code to > RDMA without been intimately familiar with the HW, we saw people coding > to it in days, Which I can't say the same for Verbs. Which means they'll hack up total crap code. > Abstract layers are not new to Linux, Sockets is another type of > abstraction with multiple protocols/families underneath, And you forgot that sockets are a really small abstraction layer, which kDAPL is not. And even though sockets provide a really nice abstraction for the data transmission you need to know about address families for connection establishment and control. Really bad anology, you lost :) From roland at topspin.com Wed Mar 2 20:32:42 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 02 Mar 2005 20:32:42 -0800 Subject: [openib-general] [PATCH] uverbs: whitespace fix In-Reply-To: <20050302092751.GB25029@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 2 Mar 2005 11:27:51 +0200") References: <20050302092751.GB25029@mellanox.co.il> Message-ID: <52ekexs61h.fsf@topspin.com> Thanks, applied. From roland at topspin.com Wed Mar 2 20:34:04 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 02 Mar 2005 20:34:04 -0800 Subject: [openib-general] [PATCH] uverbs_mem printk In-Reply-To: <20050302101212.GC25029@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 2 Mar 2005 12:12:12 +0200") References: <20050302101212.GC25029@mellanox.co.il> Message-ID: <52acpls5z7.fsf@topspin.com> Thanks, applied. That was really just some left over debugging code anyway. From roland at topspin.com Wed Mar 2 20:37:13 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 02 Mar 2005 20:37:13 -0800 Subject: [openib-general] mr_table.max_mtt_order In-Reply-To: <20050302104311.GD25029@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 2 Mar 2005 12:43:11 +0200") References: <20050302104311.GD25029@mellanox.co.il> Message-ID: <526509s5ty.fsf@topspin.com> Michael> Did I misunderstand something, or is there something that Michael> forces dev->limits.num_mtt_segs to be a power of 2? Well, right now it's essentially hard coded to 1<<20, so it's OK for now. In general the buddy allocator used for allocating MTT entries will break if Michael> 2. There are some places in mthca where we try to round Michael> some value up to the power of 2, some done by loops like Michael> this one. I find them error-prone. Will you accept a Michael> patch replacing them with an inline function? Using fls, Michael> this function will also be more efficient than a linear Michael> loop. I thought about this a little. I think that any inline function forces someone reading the code to look up what it does, no matter how descriptive the name we come up with. I think it would be better to use fls() directly. I'm already in the habit of using ffs() to compute log_2 of powers of two but for some reason I never remember fls(). - R. From roland at topspin.com Wed Mar 2 20:40:11 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 02 Mar 2005 20:40:11 -0800 Subject: [openib-general] "arping" failing over ipoib In-Reply-To: <1109806789.4913.43.camel@duffman> (Tom Duffy's message of "Wed, 02 Mar 2005 15:39:48 -0800") References: <1109806789.4913.43.camel@duffman> Message-ID: <521xaxs5p0.fsf@topspin.com> Tom> # arping -c 2 -w 3 -D -I ib0 192.168.0.62 Tom> ARPING 192.168.0.62 from 0.0.0.0 ib0 Tom> Sent 2 probes (2 broadcast(s)) Tom> Received -1 response(s) What are the network startup scripts expecting? How is arping getting so confused that it reports -1 responses? (Sorry, haven't had a chance to look at the code). Tom> I have looked at the arping code, and it seems to be crafting Tom> the packet correctly, even using 32 in the type field, so I Tom> am a bit befuddled as to why this isn't working. What packet does arping create? Unfortunately, because a "normal" IPoIB packet doesn't include any encapsulation beyond the 4 bytes of ethertype/reserved, it's a little difficult for userspace to send broadcast packets. If arping tries to create an ethernet-like header, then the IPoIB driver is going to get a little confused. - R. From tduffy at sun.com Wed Mar 2 21:24:08 2005 From: tduffy at sun.com (Tom Duffy) Date: Wed, 02 Mar 2005 21:24:08 -0800 Subject: [openib-general] "arping" failing over ipoib In-Reply-To: <521xaxs5p0.fsf@topspin.com> References: <1109806789.4913.43.camel@duffman> <521xaxs5p0.fsf@topspin.com> Message-ID: <1109827448.26501.3.camel@duffman> On Wed, 2005-03-02 at 20:40 -0800, Roland Dreier wrote: > Tom> # arping -c 2 -w 3 -D -I ib0 192.168.0.62 > Tom> ARPING 192.168.0.62 from 0.0.0.0 ib0 > Tom> Sent 2 probes (2 broadcast(s)) > Tom> Received -1 response(s) > > What are the network startup scripts expecting? How is arping getting > so confused that it reports -1 responses? (Sorry, haven't had a > chance to look at the code). That is a good question. I think the code is b0rked. -1 is coming from the "received" variable. This is never initialized. Initializing it to 0 at least causes arping to fail gracefully, letting ifup continue. I have opened fedora bug 150156 regarding this and emailed the arping maintainer. > Tom> I have looked at the arping code, and it seems to be crafting > Tom> the packet correctly, even using 32 in the type field, so I > Tom> am a bit befuddled as to why this isn't working. > > What packet does arping create? Unfortunately, because a "normal" > IPoIB packet doesn't include any encapsulation beyond the 4 bytes of > ethertype/reserved, it's a little difficult for userspace to send > broadcast packets. If arping tries to create an ethernet-like header, > then the IPoIB driver is going to get a little confused. OK, well I thought it was right according to the ARP IPoIB encapsulation IETF draft. I will look at it a bit more in depth tomorrow. In any event, what is the right format of the packet for userspace to craft? Thanks, -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][2/11] IB: fix vendor MAD deregistration In-Reply-To: <2005322131.pkxanHLh4SQ8X31k@topspin.com> Message-ID: <2005322131.5pgryiWlkZPYdcE7@topspin.com> From: Shahar Frank Fix bug when deregistering a vendor class MAD agent. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/core/mad.c 2005-03-02 20:26:03.185796628 -0800 +++ linux-export/drivers/infiniband/core/mad.c 2005-03-02 20:26:10.980104746 -0800 @@ -41,7 +41,6 @@ #include "smi.h" #include "agent.h" - MODULE_LICENSE("Dual BSD/GPL"); MODULE_DESCRIPTION("kernel IB MAD API"); MODULE_AUTHOR("Hal Rosenstock"); @@ -490,6 +489,7 @@ cancel_mads(mad_agent_priv); port_priv = mad_agent_priv->qp_info->port_priv; + cancel_delayed_work(&mad_agent_priv->timed_work); flush_workqueue(port_priv->wq); @@ -1266,12 +1266,12 @@ } port_priv = agent_priv->qp_info->port_priv; + mgmt_class = convert_mgmt_class(agent_priv->reg_req->mgmt_class); class = port_priv->version[ agent_priv->reg_req->mgmt_class_version].class; if (!class) goto vendor_check; - mgmt_class = convert_mgmt_class(agent_priv->reg_req->mgmt_class); method = class->method_table[mgmt_class]; if (method) { /* Remove any methods for this mad agent */ @@ -1293,16 +1293,21 @@ } vendor_check: + if (!is_vendor_class(mgmt_class)) + goto out; + + /* normalize mgmt_class to vendor range 2 */ + mgmt_class = vendor_class_index(agent_priv->reg_req->mgmt_class); vendor = port_priv->version[ agent_priv->reg_req->mgmt_class_version].vendor; + if (!vendor) goto out; - mgmt_class = vendor_class_index(agent_priv->reg_req->mgmt_class); vendor_class = vendor->vendor_class[mgmt_class]; if (vendor_class) { index = find_vendor_oui(vendor_class, agent_priv->reg_req->oui); - if (index == -1) + if (index < 0) goto out; method = vendor_class->method_table[index]; if (method) { --- linux-export.orig/drivers/infiniband/core/mad_priv.h 2005-03-02 20:26:03.185796628 -0800 +++ linux-export/drivers/infiniband/core/mad_priv.h 2005-03-02 20:26:10.980104746 -0800 @@ -58,8 +58,8 @@ #define MAX_MGMT_CLASS 80 #define MAX_MGMT_VERSION 8 #define MAX_MGMT_OUI 8 -#define MAX_MGMT_VENDOR_RANGE2 IB_MGMT_CLASS_VENDOR_RANGE2_END - \ - IB_MGMT_CLASS_VENDOR_RANGE2_START + 1 +#define MAX_MGMT_VENDOR_RANGE2 (IB_MGMT_CLASS_VENDOR_RANGE2_END - \ + IB_MGMT_CLASS_VENDOR_RANGE2_START + 1) struct ib_mad_list_head { struct list_head list; From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][0/11] InfiniBand fixes Message-ID: <2005322131.J5dPz9nJYwSlaHs6@topspin.com> Here is a batch of fixes from the OpenIB subversion tree for merging. Thanks, Roland From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][3/11] IB: sparse fixes In-Reply-To: <2005322131.5pgryiWlkZPYdcE7@topspin.com> Message-ID: <2005322131.O2Ym8iporsXeypcV@topspin.com> From: Tom Duffy Fix some sparse warnings by making sure we have appropriate "extern" declarations visible. Signed-off-by: Tom Duffy Signed-off-by: Hal Rosenstock ( Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/core/agent.c 2005-03-02 20:26:10.599187430 -0800 +++ linux-export/drivers/infiniband/core/agent.c 2005-03-02 20:26:11.456001445 -0800 @@ -45,14 +45,11 @@ #include "smi.h" #include "agent_priv.h" #include "mad_priv.h" - +#include "agent.h" spinlock_t ib_agent_port_list_lock; static LIST_HEAD(ib_agent_port_list); -extern kmem_cache_t *ib_mad_cache; - - /* * Caller must hold ib_agent_port_list_lock */ --- linux-export.orig/drivers/infiniband/core/cache.c 2005-03-02 20:26:03.085818330 -0800 +++ linux-export/drivers/infiniband/core/cache.c 2005-03-02 20:26:11.456001445 -0800 @@ -37,6 +37,8 @@ #include #include +#include + #include "core_priv.h" struct ib_pkey_cache { --- linux-export.orig/drivers/infiniband/core/mad_priv.h 2005-03-02 20:26:10.980104746 -0800 +++ linux-export/drivers/infiniband/core/mad_priv.h 2005-03-02 20:26:11.457001228 -0800 @@ -192,4 +192,6 @@ struct ib_mad_qp_info qp_info[IB_MAD_QPS_CORE]; }; +extern kmem_cache_t *ib_mad_cache; + #endif /* __IB_MAD_PRIV_H__ */ --- linux-export.orig/drivers/infiniband/core/smi.c 2005-03-02 20:26:03.085818330 -0800 +++ linux-export/drivers/infiniband/core/smi.c 2005-03-02 20:26:11.458001011 -0800 @@ -37,7 +37,7 @@ */ #include - +#include "smi.h" /* * Fixup a directed route SMP for sending From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][1/11] IB: simplify MAD code In-Reply-To: <2005322131.J5dPz9nJYwSlaHs6@topspin.com> Message-ID: <2005322131.pkxanHLh4SQ8X31k@topspin.com> From: Hal Rosenstock Remove unneeded MAD agent registration by using a single agent for both directed-route and LID-routed MADs. Signed-off-by: Hal Rosenstock Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/core/agent.c 2005-03-02 20:26:03.280776011 -0800 +++ linux-export/drivers/infiniband/core/agent.c 2005-03-02 20:26:10.599187430 -0800 @@ -66,14 +66,13 @@ if (device) { list_for_each_entry(entry, &ib_agent_port_list, port_list) { - if (entry->dr_smp_agent->device == device && + if (entry->smp_agent->device == device && entry->port_num == port_num) return entry; } } else { list_for_each_entry(entry, &ib_agent_port_list, port_list) { - if ((entry->dr_smp_agent == mad_agent) || - (entry->lr_smp_agent == mad_agent) || + if ((entry->smp_agent == mad_agent) || (entry->perf_mgmt_agent == mad_agent)) return entry; } @@ -111,7 +110,7 @@ return 1; } - return smi_check_local_smp(port_priv->dr_smp_agent, smp); + return smi_check_local_smp(port_priv->smp_agent, smp); } static int agent_mad_send(struct ib_mad_agent *mad_agent, @@ -231,10 +230,8 @@ /* Get mad agent based on mgmt_class in MAD */ switch (mad->mad.mad.mad_hdr.mgmt_class) { case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: - mad_agent = port_priv->dr_smp_agent; - break; case IB_MGMT_CLASS_SUBN_LID_ROUTED: - mad_agent = port_priv->lr_smp_agent; + mad_agent = port_priv->smp_agent; break; case IB_MGMT_CLASS_PERF_MGMT: mad_agent = port_priv->perf_mgmt_agent; @@ -284,7 +281,6 @@ { int ret; struct ib_agent_port_private *port_priv; - struct ib_mad_reg_req reg_req; unsigned long flags; /* First, check if port already open for SMI */ @@ -308,35 +304,19 @@ spin_lock_init(&port_priv->send_list_lock); INIT_LIST_HEAD(&port_priv->send_posted_list); - /* Obtain MAD agent for directed route SM class */ - reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE; - reg_req.mgmt_class_version = 1; - - port_priv->dr_smp_agent = ib_register_mad_agent(device, port_num, - IB_QPT_SMI, - NULL, 0, - &agent_send_handler, - NULL, NULL); + /* Obtain send only MAD agent for SM class (SMI QP) */ + port_priv->smp_agent = ib_register_mad_agent(device, port_num, + IB_QPT_SMI, + NULL, 0, + &agent_send_handler, + NULL, NULL); - if (IS_ERR(port_priv->dr_smp_agent)) { - ret = PTR_ERR(port_priv->dr_smp_agent); + if (IS_ERR(port_priv->smp_agent)) { + ret = PTR_ERR(port_priv->smp_agent); goto error2; } - /* Obtain MAD agent for LID routed SM class */ - reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; - port_priv->lr_smp_agent = ib_register_mad_agent(device, port_num, - IB_QPT_SMI, - NULL, 0, - &agent_send_handler, - NULL, NULL); - if (IS_ERR(port_priv->lr_smp_agent)) { - ret = PTR_ERR(port_priv->lr_smp_agent); - goto error3; - } - - /* Obtain MAD agent for PerfMgmt class */ - reg_req.mgmt_class = IB_MGMT_CLASS_PERF_MGMT; + /* Obtain send only MAD agent for PerfMgmt class (GSI QP) */ port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num, IB_QPT_GSI, NULL, 0, @@ -344,15 +324,15 @@ NULL, NULL); if (IS_ERR(port_priv->perf_mgmt_agent)) { ret = PTR_ERR(port_priv->perf_mgmt_agent); - goto error4; + goto error3; } - port_priv->mr = ib_get_dma_mr(port_priv->dr_smp_agent->qp->pd, + port_priv->mr = ib_get_dma_mr(port_priv->smp_agent->qp->pd, IB_ACCESS_LOCAL_WRITE); if (IS_ERR(port_priv->mr)) { printk(KERN_ERR SPFX "Couldn't get DMA MR\n"); ret = PTR_ERR(port_priv->mr); - goto error5; + goto error4; } spin_lock_irqsave(&ib_agent_port_list_lock, flags); @@ -361,12 +341,10 @@ return 0; -error5: - ib_unregister_mad_agent(port_priv->perf_mgmt_agent); error4: - ib_unregister_mad_agent(port_priv->lr_smp_agent); + ib_unregister_mad_agent(port_priv->perf_mgmt_agent); error3: - ib_unregister_mad_agent(port_priv->dr_smp_agent); + ib_unregister_mad_agent(port_priv->smp_agent); error2: kfree(port_priv); error1: @@ -391,8 +369,7 @@ ib_dereg_mr(port_priv->mr); ib_unregister_mad_agent(port_priv->perf_mgmt_agent); - ib_unregister_mad_agent(port_priv->lr_smp_agent); - ib_unregister_mad_agent(port_priv->dr_smp_agent); + ib_unregister_mad_agent(port_priv->smp_agent); kfree(port_priv); return 0; --- linux-export.orig/drivers/infiniband/core/agent_priv.h 2005-03-02 20:26:03.280776011 -0800 +++ linux-export/drivers/infiniband/core/agent_priv.h 2005-03-02 20:26:10.599187430 -0800 @@ -55,8 +55,7 @@ struct list_head send_posted_list; spinlock_t send_list_lock; int port_num; - struct ib_mad_agent *dr_smp_agent; /* DR SM class */ - struct ib_mad_agent *lr_smp_agent; /* LR SM class */ + struct ib_mad_agent *smp_agent; /* SM class */ struct ib_mad_agent *perf_mgmt_agent; /* PerfMgmt class */ struct ib_mr *mr; }; From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][4/11] IB/mthca: add missing break In-Reply-To: <2005322131.O2Ym8iporsXeypcV@topspin.com> Message-ID: <2005322131.oecVhU1CS3swCooO@topspin.com> Add missing break statements in switch in mthca_profile.c (pointed out by Michael Tsirkin). Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_profile.c 2005-03-02 20:26:03.023831785 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_profile.c 2005-03-02 20:26:11.904904003 -0800 @@ -241,10 +241,12 @@ case MTHCA_RES_UDAV: dev->av_table.ddr_av_base = profile[i].start; dev->av_table.num_ddr_avs = profile[i].num; + break; case MTHCA_RES_UARC: init_hca->uarc_base = profile[i].start; init_hca->log_uarc_sz = ffs(request->uarc_size) - 13; init_hca->log_uar_sz = ffs(request->num_uar) - 1; + break; default: break; } From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][5/11] IB/mthca: fix reset value endianness In-Reply-To: <2005322131.oecVhU1CS3swCooO@topspin.com> Message-ID: <2005322131.ube7cIPz9y7840bB@topspin.com> MTHCA_RESET_VALUE must always be swapped, since the HCA expects to see it in big-endian order and we write it with writel. This means on little-endian systems we have to swap it to big-endian order before writing, and on big-endian systems we need to swap it to make up for the additional swap that writel will do. This fixes resetting the HCA on big-endian machines. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_reset.c 2005-03-02 20:26:02.970843287 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_reset.c 2005-03-02 20:26:12.219835642 -0800 @@ -50,7 +50,7 @@ struct pci_dev *bridge = NULL; #define MTHCA_RESET_OFFSET 0xf0010 -#define MTHCA_RESET_VALUE cpu_to_be32(1) +#define MTHCA_RESET_VALUE swab32(1) /* * Reset the chip. This is somewhat ugly because we have to From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][6/11] IB/ipoib: fix rx memory leak In-Reply-To: <2005322131.ube7cIPz9y7840bB@topspin.com> Message-ID: <2005322131.6N8qBqgz1WuD4wnL@topspin.com> Fix memory leak when posting a receive buffer (pointed out by Shirley Ma). Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2005-03-02 20:26:02.919854355 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2005-03-02 20:26:12.514771621 -0800 @@ -137,6 +137,9 @@ if (ret) { ipoib_warn(priv, "ipoib_ib_receive failed for buf %d (%d)\n", id, ret); + dma_unmap_single(priv->ca->dma_device, addr, + IPOIB_BUF_SIZE, DMA_FROM_DEVICE); + dev_kfree_skb_any(skb); priv->rx_ring[id].skb = NULL; } From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][7/11] IB/ipoib: use list_for_each_entry_safe when required In-Reply-To: <2005322131.6N8qBqgz1WuD4wnL@topspin.com> Message-ID: <2005322131.K2SnvQsocHnkTwPm@topspin.com> From: Shirley Ma Change uses of list_for_each_entry() where the loop variable is freed inside the loop to list_for_each_entry_safe(). Signed-off-by: Shirley Ma Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-03-02 20:26:02.832873236 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-03-02 20:26:12.799709771 -0800 @@ -790,7 +790,7 @@ spin_unlock_irqrestore(&priv->lock, flags); - list_for_each_entry(mcast, &remove_list, list) { + list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { ipoib_mcast_leave(dev, mcast); ipoib_mcast_free(mcast); } @@ -902,7 +902,7 @@ spin_unlock_irqrestore(&priv->lock, flags); /* We have to cancel outside of the spinlock */ - list_for_each_entry(mcast, &remove_list, list) { + list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { ipoib_mcast_leave(mcast->dev, mcast); ipoib_mcast_free(mcast); } From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][8/11] IB/ipoib: rename global symbols In-Reply-To: <2005322131.K2SnvQsocHnkTwPm@topspin.com> Message-ID: <2005322131.OKEJHXn13XfMX2Aa@topspin.com> Make IPoIB data_debug_level module parameter static to the single file where it is used. Also Rename IPoIB module parameter variable from "debug_level" to "ipoib_debug_level". This avoids possible name clashes if IPoIB is built into the kernel. We use module_param_named so that the user-visible parameter names remain the same. Signed-off-by: Tom Duffy Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2005-03-02 20:26:02.744892334 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib.h 2005-03-02 20:26:13.207621227 -0800 @@ -308,11 +308,11 @@ #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG -extern int debug_level; +extern int ipoib_debug_level; #define ipoib_dbg(priv, format, arg...) \ do { \ - if (debug_level > 0) \ + if (ipoib_debug_level > 0) \ ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ } while (0) #define ipoib_dbg_mcast(priv, format, arg...) \ --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2005-03-02 20:26:12.514771621 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2005-03-02 20:26:13.208621010 -0800 @@ -40,7 +40,7 @@ #include "ipoib.h" #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA -int data_debug_level; +static int data_debug_level; module_param(data_debug_level, int, 0644); MODULE_PARM_DESC(data_debug_level, --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-03-02 20:26:02.744892334 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-03-02 20:26:13.207621227 -0800 @@ -51,9 +51,9 @@ MODULE_LICENSE("Dual BSD/GPL"); #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG -int debug_level; +int ipoib_debug_level; -module_param(debug_level, int, 0644); +module_param_named(debug_level, ipoib_debug_level, int, 0644); MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0"); #endif From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][9/11] IB/ipoib: small fixes In-Reply-To: <2005322131.OKEJHXn13XfMX2Aa@topspin.com> Message-ID: <2005322131.kDy0lnKe0rjDV0tv@topspin.com> From: Shirley Ma IPoIB small fixes: Initialize path->ah to NULL, and fix dereference after free of neigh in error path of neigh_add_path(). Signed-off-by: Shirley Ma Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-03-02 20:26:13.207621227 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-03-02 20:26:13.653524436 -0800 @@ -346,8 +346,9 @@ if (!path) return NULL; - path->dev = dev; + path->dev = dev; path->pathrec.dlid = 0; + path->ah = NULL; skb_queue_head_init(&path->queue); @@ -450,8 +451,8 @@ err: *to_ipoib_neigh(skb->dst->neighbour) = NULL; list_del(&neigh->list); - kfree(neigh); neigh->neighbour->ops->destructor = NULL; + kfree(neigh); ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][11/11] IB/ipoib: fix locking on path deletion In-Reply-To: <2005322131.HYDDjSPPN3QdHwmF@topspin.com> Message-ID: <2005322131.6juV8g9K5T9OJ7gu@topspin.com> Fix up locking for IPoIB path table. Make sure that destruction of address handles, neighbour info and path structs is locked properly to avoid races and deadlocks. (Problem originally diagnosed by Shirley Ma) Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-03-02 20:26:13.977454122 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-03-02 20:26:14.301383808 -0800 @@ -215,16 +215,25 @@ return 0; } -static void __path_free(struct net_device *dev, struct ipoib_path *path) +static void path_free(struct net_device *dev, struct ipoib_path *path) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_neigh *neigh, *tn; struct sk_buff *skb; + unsigned long flags; while ((skb = __skb_dequeue(&path->queue))) dev_kfree_skb_irq(skb); + spin_lock_irqsave(&priv->lock, flags); + list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) { + /* + * It's safe to call ipoib_put_ah() inside priv->lock + * here, because we know that path->ah will always + * hold one more reference, so ipoib_put_ah() will + * never do more than decrement the ref count. + */ if (neigh->ah) ipoib_put_ah(neigh->ah); *to_ipoib_neigh(neigh->neighbour) = NULL; @@ -232,11 +241,11 @@ kfree(neigh); } + spin_unlock_irqrestore(&priv->lock, flags); + if (path->ah) ipoib_put_ah(path->ah); - rb_erase(&path->rb_node, &priv->path_tree); - list_del(&path->list); kfree(path); } @@ -248,15 +257,20 @@ unsigned long flags; spin_lock_irqsave(&priv->lock, flags); + list_splice(&priv->path_list, &remove_list); INIT_LIST_HEAD(&priv->path_list); + + list_for_each_entry(path, &remove_list, list) + rb_erase(&path->rb_node, &priv->path_tree); + spin_unlock_irqrestore(&priv->lock, flags); list_for_each_entry_safe(path, tp, &remove_list, list) { if (path->query) ib_sa_cancel_query(path->query_id, path->query); wait_for_completion(&path->done); - __path_free(dev, path); + path_free(dev, path); } } @@ -361,8 +375,6 @@ path->pathrec.pkey = cpu_to_be16(priv->pkey); path->pathrec.numb_path = 1; - __path_add(dev, path); - return path; } @@ -422,6 +434,8 @@ (union ib_gid *) (skb->dst->neighbour->ha + 4)); if (!path) goto err; + + __path_add(dev, path); } list_add_tail(&neigh->list, &path->neigh_list); @@ -497,8 +511,12 @@ skb_push(skb, sizeof *phdr); __skb_queue_tail(&path->queue, skb); - if (path_rec_start(dev, path)) - __path_free(dev, path); + if (path_rec_start(dev, path)) { + spin_unlock(&priv->lock); + path_free(dev, path); + return; + } else + __path_add(dev, path); } else { ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -658,7 +676,7 @@ static void ipoib_neigh_destructor(struct neighbour *n) { - struct ipoib_neigh *neigh = *to_ipoib_neigh(n); + struct ipoib_neigh *neigh; struct ipoib_dev_priv *priv = netdev_priv(n->dev); unsigned long flags; struct ipoib_ah *ah = NULL; @@ -670,6 +688,7 @@ spin_lock_irqsave(&priv->lock, flags); + neigh = *to_ipoib_neigh(n); if (neigh) { if (neigh->ah) ah = neigh->ah; From roland at topspin.com Wed Mar 2 21:31:22 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 2 Mar 2005 21:31:22 -0800 Subject: [openib-general] [PATCH][10/11] IB/ipoib: don't call ipoib_put_ah with lock held In-Reply-To: <2005322131.kDy0lnKe0rjDV0tv@topspin.com> Message-ID: <2005322131.HYDDjSPPN3QdHwmF@topspin.com> From: Shirley Ma ipoib_put_ah() may call ipoib_free_ah(), which might take the device's lock. Therefore we need to make sure we don't call ipoib_put_ah() when holding the lock already. Signed-off-by: Shirley Ma Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-03-02 20:26:13.653524436 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-03-02 20:26:13.977454122 -0800 @@ -661,6 +661,7 @@ struct ipoib_neigh *neigh = *to_ipoib_neigh(n); struct ipoib_dev_priv *priv = netdev_priv(n->dev); unsigned long flags; + struct ipoib_ah *ah = NULL; ipoib_dbg(priv, "neigh_destructor for %06x " IPOIB_GID_FMT "\n", @@ -671,13 +672,16 @@ if (neigh) { if (neigh->ah) - ipoib_put_ah(neigh->ah); + ah = neigh->ah; list_del(&neigh->list); *to_ipoib_neigh(n) = NULL; kfree(neigh); } spin_unlock_irqrestore(&priv->lock, flags); + + if (ah) + ipoib_put_ah(ah); } static int ipoib_neigh_setup(struct neighbour *neigh) --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-03-02 20:26:12.799709771 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-03-02 20:26:13.977454122 -0800 @@ -93,6 +93,8 @@ struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_neigh *neigh, *tmp; unsigned long flags; + LIST_HEAD(ah_list); + struct ipoib_ah *ah, *tah; ipoib_dbg_mcast(netdev_priv(dev), "deleting multicast group " IPOIB_GID_FMT "\n", @@ -101,7 +103,8 @@ spin_lock_irqsave(&priv->lock, flags); list_for_each_entry_safe(neigh, tmp, &mcast->neigh_list, list) { - ipoib_put_ah(neigh->ah); + if (neigh->ah) + list_add_tail(&neigh->ah->list, &ah_list); *to_ipoib_neigh(neigh->neighbour) = NULL; neigh->neighbour->ops->destructor = NULL; kfree(neigh); @@ -109,6 +112,9 @@ spin_unlock_irqrestore(&priv->lock, flags); + list_for_each_entry_safe(ah, tah, &ah_list, list) + ipoib_put_ah(ah); + if (mcast->ah) ipoib_put_ah(mcast->ah); From halr at voltaire.com Wed Mar 2 22:07:31 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Mar 2005 01:07:31 -0500 Subject: [openib-general] "arping" failing over ipoib In-Reply-To: <521xaxs5p0.fsf@topspin.com> References: <1109806789.4913.43.camel@duffman> <521xaxs5p0.fsf@topspin.com> Message-ID: <1109829887.4645.132.camel@localhost.localdomain> On Wed, 2005-03-02 at 23:40, Roland Dreier wrote: > What packet does arping create? I think it creates an ARP first. It does everything "directly". > Unfortunately, because a "normal" IPoIB packet doesn't include > any encapsulation beyond the 4 bytes of ethertype/reserved, > it's a little difficult for userspace to send broadcast packets. For ARP specifically (ethertype 0x0806), there is no way without additional information for the IPoIB driver to know whether it is intended to be broadcast or unicast. I'm also not sure how responses would make it back to user space either as I would expect arping to be looking for the response or lack thereof. > If arping tries to create an ethernet-like header, > then the IPoIB driver is going to get a little confused. That's what is currently going on and is why the group gets confused. The driver is likely looking into this different packet incorrectly. It does recognize that it is not a unicast packet. -- Hal From shaharf at voltaire.com Thu Mar 3 02:45:03 2005 From: shaharf at voltaire.com (shaharf) Date: Thu, 3 Mar 2005 12:45:03 +0200 Subject: [openib-general] IB Address Translation service Message-ID: > > The advantage of ATS is that it "just works" whether wired point to > point, or via a switch, or whatever. It requires no central > administration, > works as transparently as ARP and ND, and supports IP addressing so > applications don't have any ambiguity in how they resolve names. > > If we get rid of ATS, what do we replace it with? Raw IB GID's from > the application?? > > Tom. > Tom, I am with you regarding that subject. Even though both IB-ARP and ATS should be considered to be a hack, I think that IB-ARP is much more problematic and it does not deliver a complete solution: it doesn't contain all data required, it does not solve the SM load problem (due the requirement for the path record query) and it is a mechanism that contradicts IB management architecture. I would fix the ATS to include some missing fields, and maybe define unified ATS + path query for performance. The SM/SA scalability problem should be solved by distributing the SA part of it, probably using a single write/multiple reader model and a simple cache coherency protocol to allow efficient caching by sub SA agents or even hosts. This type of distribution is also requested clearly in the SOW section 1.3.1. Shahar From Thomas.Talpey at netapp.com Thu Mar 3 06:46:53 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Thu, 03 Mar 2005 09:46:53 -0500 Subject: [openib-general] putting in dead wood for DAPL and similarabomination In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0003B54E09@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0003B54E09@orsmsx408> Message-ID: <6.2.1.2.2.20050303094159.050933a0@exnane01.nane.netapp.com> At 02:22 PM 3/2/2005, Woodruff, Robert J wrote: >I think the point is that only one of those interconnects (IB) is >in the kernel, the rest are proprietary. Do any of the other RDMA >interconnect vendors plan to submit their code for inclusion into Linux >in the near future ? Yes - take a look at where you can freely download their complete stack, including drivers for their iWARP NIC, plus MPI and DAPL API libraries. It runs on 2.6.10 and many versions back (including 2.4.x). I'll let Clem speak for his plans to submit it, however. I'm just a satisfied user. Tom. From Thomas.Talpey at netapp.com Thu Mar 3 06:56:49 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Thu, 03 Mar 2005 09:56:49 -0500 Subject: [openib-general] putting in dead wood for DAPL and similarabomination In-Reply-To: <20050303034827.GA9092@lst.de> References: <35EA21F54A45CB47B879F21A91F4862F3FAD9E@taurus.voltaire.com> <20050303034827.GA9092@lst.de> Message-ID: <6.2.1.2.2.20050303094748.03422620@exnane01.nane.netapp.com> At 10:48 PM 3/2/2005, Christoph Hellwig wrote: >The current iSER code is 10928 LOC, add to that 22155 LOC of kDAPL (not >including the actual provider for IB) and 5822 LOC linux-iscsi kernel >code. Compare that to the 25412 LOC total for drivers/infiniband in Linux >2.6.11. Is this just about LOCs? I think you should wait to see how large kDAPL is *after* it has been properly integrated into the kernel before judging that. At present, the code is heavily commented and fully generalized to aid porting to multiple operating systems. It will look quite different once it is freed of these attributes. Also, I'll point out there is extensive debug and trace throughout the code, which are optional. BTW I agree with Yaron that one copy of code in DAPL replaces N copies of the same code in all RDMA drivers (IB, iWARP etc), or worse, upper layers. Which is why NFS/RDMA needs it. Tom. From clemc at ammasso.com Thu Mar 3 07:24:35 2005 From: clemc at ammasso.com (Clem Cole) Date: Thu, 3 Mar 2005 10:24:35 -0500 Subject: [openib-general] putting in dead wood for DAPL and similarabomination Message-ID: <8E9D028761D8264D910612167E8457E8742C24@mail2> Thanks for the kind words Tom. Indeed, Ammasso is fully open source and are thrilled at the idea of getting DAPL into the mainline. We went GA with our 1.2 release recently and have had our AMSO100 hardware and associated iWARP software in the hands of about 60 different sites - both HPTC and commerical. Feel free to download the code and have a look. As Tom says it has been tested on kernels as far back as those for RH 7.3 and modern as kernel.org's 2.6.10. We have tested on both x86 and x86-64. As Tom says, we too (as well as our customers) are statisfied users of the DAPL interface. DAPL certainly continues to show that it works pretty well for us and is easier to use than the low level QP verbs layer. Recently an ISV moved a hunk of code from another interface (we do not know which) to our DAPL implementation in about 2.5 weeks. For the record, our kDAPL and uDAPL are dervived from the DAT reference code. Besides writing the Ammasso specific provider code, since IB != iWARP we did have to make some small changes to common code to handle iWARP specific difference. We are working with Tom, Arkady and the rest of the DAPL community to get the iWARP changes back into the base to help ensure that the DAPL interface is not considerred just an IB thing or that people come to the incorrect conclusion that IB verbs are ``good enough.'' For whatever its worth, we are actively working on not only being DAPL/iWARP >>compliant<< (working with UNH etc) but also >>compatible<< - going to plugfests and working with actual ISVs that have written linux code that rely on DAPL/iWARP -- i.e. not only do we pass the full uDAPL/kDAPL test suite, we have been working with a number of different large commerical vendors (who's source code we have never seen) to get their tests as well as their >>applications<< which were designed to run over DAPL. Since these codes had been previously only tested on IB (i.e we are the first shipping iWARP provider), we feel pretty good that our DAPL works as expected. Note: our plan is to increase the number of applications that use the code as quickly as possible; but under we are small start up and are band limited by the number of ISV we can work directly at one time. If you have specific questions, feel free to take them off line to me. Clem Cole Dist. Eng PS If you are interested in getting your hands on hardware, drop me a line and I'll make connection to our sales guys. -----Original Message----- From: Talpey, Thomas [mailto:Thomas.Talpey at netapp.com] Sent: Thursday, March 03, 2005 9:47 AM To: openib-general at openib.org Cc: Clem Cole Subject: RE: [openib-general] putting in dead wood for DAPL and similarabomination At 02:22 PM 3/2/2005, Woodruff, Robert J wrote: >I think the point is that only one of those interconnects (IB) is >in the kernel, the rest are proprietary. Do any of the other RDMA >interconnect vendors plan to submit their code for inclusion into Linux >in the near future ? Yes - take a look at where you can freely download their complete stack, including drivers for their iWARP NIC, plus MPI and DAPL API libraries. It runs on 2.6.10 and many versions back (including 2.4.x). I'll let Clem speak for his plans to submit it, however. I'm just a satisfied user. Tom. From yaronh at voltaire.com Thu Mar 3 07:23:00 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Thu, 3 Mar 2005 17:23:00 +0200 Subject: [openib-general] putting in dead wood for DAPL and similarabomination Message-ID: <35EA21F54A45CB47B879F21A91F4862F3FAE4E@taurus.voltaire.com> > -----Original Message----- > From: Christoph Hellwig [mailto:hch at lst.de] > Sent: Thursday, March 03, 2005 5:48 AM > To: Yaron Haviv > Cc: Christoph Hellwig; James Lentini; openib-general at openib.org > Subject: Re: [openib-general] putting in dead wood for DAPL and > similarabomination > > The current iSER code is 10928 LOC, add to that 22155 LOC of kDAPL (not > including the actual provider for IB) and 5822 LOC linux-iscsi kernel > code. Compare that to the 25412 LOC total for drivers/infiniband in Linux > 2.6.11. As Tom indicated we expect a significant code shrink for kDAPL, it will be much more Linux friendly when we are done with it, some parts will be re-written. Also the iSER code is not optimal in terms of LOC, and we can clean up some redundant code if we are in an LOC contest, I believe after we glue all the layers we will focus on reducing LOCs and test code. > Here's the challenge: if someone gets me the funding I'll write > complete iSER of IB implementation in less than 10k LOC based on the > open-iscsi code if someone gets me the funding. You know there is also the challenge of making it work, perform, interoperate, and support some features, not all is about LOC :) Anyway thanks for offering us support we may take you up on the some day Yaron From roland at topspin.com Thu Mar 3 12:43:48 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 03 Mar 2005 12:43:48 -0800 Subject: [openib-general] mr_table.max_mtt_order In-Reply-To: <526509s5ty.fsf@topspin.com> (Roland Dreier's message of "Wed, 02 Mar 2005 20:37:13 -0800") References: <20050302104311.GD25029@mellanox.co.il> <526509s5ty.fsf@topspin.com> Message-ID: <52hdjspiij.fsf@topspin.com> Roland> I thought about this a little. I think that any inline Roland> function forces someone reading the code to look up what Roland> it does, no matter how descriptive the name we come up Roland> with. I think it would be better to use fls() directly. Roland> I'm already in the habit of using ffs() to compute log_2 Roland> of powers of two but for some reason I never remember Roland> fls(). Actually, since there's already roundup_pow_of_two() in kernel.h, let's use that. - R. From mst at mellanox.co.il Thu Mar 3 13:05:25 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 Mar 2005 23:05:25 +0200 Subject: [openib-general] mr_table.max_mtt_order In-Reply-To: <52hdjspiij.fsf@topspin.com> References: <20050302104311.GD25029@mellanox.co.il> <526509s5ty.fsf@topspin.com> <52hdjspiij.fsf@topspin.com> Message-ID: <20050303210525.GA4022@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] mr_table.max_mtt_order > > Roland> I thought about this a little. I think that any inline > Roland> function forces someone reading the code to look up what > Roland> it does, no matter how descriptive the name we come up > Roland> with. I think it would be better to use fls() directly. > Roland> I'm already in the habit of using ffs() to compute log_2 > Roland> of powers of two but for some reason I never remember > Roland> fls(). > > Actually, since there's already roundup_pow_of_two() in kernel.h, > let's use that. > > - R. > I plan to post a patch on Sunday. -- MST - Michael S. Tsirkin From roland at topspin.com Thu Mar 3 15:20:26 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:26 -0800 Subject: [openib-general] [PATCH][0/26] InfiniBand merge Message-ID: <2005331520.b7ycIGGfSwBBRSED@topspin.com> Here's another series of patches that applies on top of the fixes I posted yesterday. This series syncs the kernel with everything ready for merging from the OpenIB subversion tree. Most of these patches add more support for "mem-free" mode to mthca. This allows PCI Express HCAs to operate by storing context in the host system's memory rather than in dedicated memory attached to the HCA. With this series of patches, mem-free mode is usable -- in fact, this series of patches is being posted from a system whose only network connection is IP-over-IB running on a mem-free HCA. Thanks, Roland From roland at topspin.com Thu Mar 3 15:20:26 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:26 -0800 Subject: [openib-general] [PATCH][1/26] IB: fix ib_find_cached_gid() port numbering In-Reply-To: <2005331520.b7ycIGGfSwBBRSED@topspin.com> Message-ID: <2005331520.OFd0tTycEIjc5XlW@topspin.com> From: Sean Hefty Fix ib_find_cached_gid() to return the correct port number relative to the port numbering used by the device. Signed-off-by: Sean Hefty Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/core/cache.c 2005-03-02 20:53:21.000000000 -0800 +++ linux-export/drivers/infiniband/core/cache.c 2005-03-03 15:02:57.180310444 -0800 @@ -114,7 +114,7 @@ cache = device->cache.gid_cache[p]; for (i = 0; i < cache->table_len; ++i) { if (!memcmp(gid, &cache->table[i], sizeof *gid)) { - *port_num = p; + *port_num = p + start_port(device); if (index) *index = i; ret = 0; From roland at topspin.com Thu Mar 3 15:20:26 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:26 -0800 Subject: [openib-general] [PATCH][2/26] IB/mthca: CQ minor tweaks In-Reply-To: <2005331520.OFd0tTycEIjc5XlW@topspin.com> Message-ID: <2005331520.GI2ijwUAkM9zyNyy@topspin.com> From: "Michael S. Tsirkin" Clean up CQ code so that we only calculate the address of a CQ entry once when using it. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2005-02-03 16:59:43.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:51.832670421 -0800 @@ -147,20 +147,21 @@ + (entry * MTHCA_CQ_ENTRY_SIZE) % PAGE_SIZE; } -static inline int cqe_sw(struct mthca_cq *cq, int i) +static inline struct mthca_cqe *cqe_sw(struct mthca_cq *cq, int i) { - return !(MTHCA_CQ_ENTRY_OWNER_HW & - get_cqe(cq, i)->owner); + struct mthca_cqe *cqe; + cqe = get_cqe(cq, i); + return (MTHCA_CQ_ENTRY_OWNER_HW & cqe->owner) ? NULL : cqe; } -static inline int next_cqe_sw(struct mthca_cq *cq) +static inline struct mthca_cqe *next_cqe_sw(struct mthca_cq *cq) { return cqe_sw(cq, cq->cons_index); } -static inline void set_cqe_hw(struct mthca_cq *cq, int entry) +static inline void set_cqe_hw(struct mthca_cqe *cqe) { - get_cqe(cq, entry)->owner = MTHCA_CQ_ENTRY_OWNER_HW; + cqe->owner = MTHCA_CQ_ENTRY_OWNER_HW; } static inline void inc_cons_index(struct mthca_dev *dev, struct mthca_cq *cq, @@ -388,7 +389,8 @@ int free_cqe = 1; int err = 0; - if (!next_cqe_sw(cq)) + cqe = next_cqe_sw(cq); + if (!cqe) return -EAGAIN; /* @@ -397,8 +399,6 @@ */ rmb(); - cqe = get_cqe(cq, cq->cons_index); - if (0) { mthca_dbg(dev, "%x/%d: CQE -> QPN %06x, WQE @ %08x\n", cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn), @@ -509,8 +509,8 @@ entry->status = IB_WC_SUCCESS; out: - if (free_cqe) { - set_cqe_hw(cq, cq->cons_index); + if (likely(free_cqe)) { + set_cqe_hw(cqe); ++(*freed); cq->cons_index = (cq->cons_index + 1) & cq->ibcq.cqe; } @@ -655,7 +655,7 @@ } for (i = 0; i < nent; ++i) - set_cqe_hw(cq, i); + set_cqe_hw(get_cqe(cq, i)); cq->cqn = mthca_alloc(&dev->cq_table.alloc); if (cq->cqn == -1) @@ -773,7 +773,7 @@ int j; printk(KERN_ERR "context for CQN %x (cons index %x, next sw %d)\n", - cq->cqn, cq->cons_index, next_cqe_sw(cq)); + cq->cqn, cq->cons_index, !!next_cqe_sw(cq)); for (j = 0; j < 16; ++j) printk(KERN_ERR "[%2x] %08x\n", j * 4, be32_to_cpu(ctx[j])); } From roland at topspin.com Thu Mar 3 15:20:26 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:26 -0800 Subject: [openib-general] [PATCH][3/26] IB/mthca: improve CQ locking part 1 In-Reply-To: <2005331520.GI2ijwUAkM9zyNyy@topspin.com> Message-ID: <2005331520.cHJfJcRbBu1fFgB6@topspin.com> From: Michael S. Tsirkin Avoid taking the CQ table lock in the fast path path by using synchronize_irq() after removing a CQ from the table to make sure that no completion events are still in progress. This gets a nice speedup (about 4%) in IP over IB on my hardware. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:51.832670421 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:52.368554099 -0800 @@ -33,6 +33,7 @@ */ #include +#include #include @@ -181,11 +182,7 @@ { struct mthca_cq *cq; - spin_lock(&dev->cq_table.lock); cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1)); - if (cq) - atomic_inc(&cq->refcount); - spin_unlock(&dev->cq_table.lock); if (!cq) { mthca_warn(dev, "Completion event for bogus CQ %08x\n", cqn); @@ -193,9 +190,6 @@ } cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); - - if (atomic_dec_and_test(&cq->refcount)) - wake_up(&cq->wait); } void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn) @@ -783,6 +777,11 @@ cq->cqn & (dev->limits.num_cqs - 1)); spin_unlock_irq(&dev->cq_table.lock); + if (dev->mthca_flags & MTHCA_FLAG_MSI_X) + synchronize_irq(dev->eq_table.eq[MTHCA_EQ_COMP].msi_x_vector); + else + synchronize_irq(dev->pdev->irq); + atomic_dec(&cq->refcount); wait_event(cq->wait, !atomic_read(&cq->refcount)); From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][4/26] IB/mthca: improve CQ locking part 2 In-Reply-To: <2005331520.cHJfJcRbBu1fFgB6@topspin.com> Message-ID: <2005331520.lXKA9W9JoVIrmqB8@topspin.com> From: Michael S. Tsirkin Locking during the poll cq operation can be reduced by locking the cq while qp is being removed from the qp array. This also avoids an extra atomic operation for reference counting. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:52.368554099 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:52.923433653 -0800 @@ -418,14 +418,14 @@ spin_unlock(&(*cur_qp)->lock); } - spin_lock(&dev->qp_table.lock); + /* + * We do not have to take the QP table lock here, + * because CQs will be locked while QPs are removed + * from the table. + */ *cur_qp = mthca_array_get(&dev->qp_table.qp, be32_to_cpu(cqe->my_qpn) & (dev->limits.num_qps - 1)); - if (*cur_qp) - atomic_inc(&(*cur_qp)->refcount); - spin_unlock(&dev->qp_table.lock); - if (!*cur_qp) { mthca_warn(dev, "CQ entry for unknown QP %06x\n", be32_to_cpu(cqe->my_qpn) & 0xffffff); @@ -537,12 +537,8 @@ inc_cons_index(dev, cq, freed); } - if (qp) { + if (qp) spin_unlock(&qp->lock); - if (atomic_dec_and_test(&qp->refcount)) - wake_up(&qp->wait); - } - spin_unlock_irqrestore(&cq->lock, flags); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2005-02-03 16:59:28.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:12:52.924433436 -0800 @@ -1083,9 +1083,21 @@ return 0; err_out_free: - spin_lock_irq(&dev->qp_table.lock); + /* + * Lock CQs here, so that CQ polling code can do QP lookup + * without taking a lock. + */ + spin_lock_irq(&send_cq->lock); + if (send_cq != recv_cq) + spin_lock(&recv_cq->lock); + + spin_lock(&dev->qp_table.lock); mthca_array_clear(&dev->qp_table.qp, mqpn); - spin_unlock_irq(&dev->qp_table.lock); + spin_unlock(&dev->qp_table.lock); + + if (send_cq != recv_cq) + spin_unlock(&recv_cq->lock); + spin_unlock_irq(&send_cq->lock); err_out: dma_free_coherent(&dev->pdev->dev, sqp->header_buf_size, @@ -1100,11 +1112,28 @@ u8 status; int size; int i; + struct mthca_cq *send_cq; + struct mthca_cq *recv_cq; + + send_cq = to_mcq(qp->ibqp.send_cq); + recv_cq = to_mcq(qp->ibqp.recv_cq); - spin_lock_irq(&dev->qp_table.lock); + /* + * Lock CQs here, so that CQ polling code can do QP lookup + * without taking a lock. + */ + spin_lock_irq(&send_cq->lock); + if (send_cq != recv_cq) + spin_lock(&recv_cq->lock); + + spin_lock(&dev->qp_table.lock); mthca_array_clear(&dev->qp_table.qp, qp->qpn & (dev->limits.num_qps - 1)); - spin_unlock_irq(&dev->qp_table.lock); + spin_unlock(&dev->qp_table.lock); + + if (send_cq != recv_cq) + spin_unlock(&recv_cq->lock); + spin_unlock_irq(&send_cq->lock); atomic_dec(&qp->refcount); wait_event(qp->wait, !atomic_read(&qp->refcount)); From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][5/26] IB/mthca: CQ cleanups In-Reply-To: <2005331520.lXKA9W9JoVIrmqB8@topspin.com> Message-ID: <2005331520.bkPiyqSCQe0LOju5@topspin.com> Simplify some of the code for CQ handling slightly. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:52.923433653 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:53.538300187 -0800 @@ -150,9 +150,8 @@ static inline struct mthca_cqe *cqe_sw(struct mthca_cq *cq, int i) { - struct mthca_cqe *cqe; - cqe = get_cqe(cq, i); - return (MTHCA_CQ_ENTRY_OWNER_HW & cqe->owner) ? NULL : cqe; + struct mthca_cqe *cqe = get_cqe(cq, i); + return MTHCA_CQ_ENTRY_OWNER_HW & cqe->owner ? NULL : cqe; } static inline struct mthca_cqe *next_cqe_sw(struct mthca_cq *cq) @@ -378,7 +377,7 @@ struct mthca_wq *wq; struct mthca_cqe *cqe; int wqe_index; - int is_error = 0; + int is_error; int is_send; int free_cqe = 1; int err = 0; @@ -401,12 +400,9 @@ dump_cqe(cqe); } - if ((cqe->opcode & MTHCA_ERROR_CQE_OPCODE_MASK) == - MTHCA_ERROR_CQE_OPCODE_MASK) { - is_error = 1; - is_send = cqe->opcode & 1; - } else - is_send = cqe->is_send & 0x80; + is_error = (cqe->opcode & MTHCA_ERROR_CQE_OPCODE_MASK) == + MTHCA_ERROR_CQE_OPCODE_MASK; + is_send = is_error ? cqe->opcode & 0x01 : cqe->is_send & 0x80; if (!*cur_qp || be32_to_cpu(cqe->my_qpn) != (*cur_qp)->qpn) { if (*cur_qp) { From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][6/26] IB: remove unsignaled receives In-Reply-To: <2005331520.bkPiyqSCQe0LOju5@topspin.com> Message-ID: <2005331520.psAuTRchMaqO6dem@topspin.com> From: Michael S. Tsirkin Remove support for unsignaled receive requests. This is a non-standard extension to the IB spec that is not used by any known applications or protocols, and is not supported by newer hardware. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/core/mad.c 2005-03-02 20:53:21.000000000 -0800 +++ linux-export/drivers/infiniband/core/mad.c 2005-03-03 14:12:54.671054304 -0800 @@ -2191,7 +2191,6 @@ recv_wr.next = NULL; recv_wr.sg_list = &sg_list; recv_wr.num_sge = 1; - recv_wr.recv_flags = IB_RECV_SIGNALED; do { /* Allocate and map receive buffer */ @@ -2386,7 +2385,6 @@ qp_init_attr.send_cq = qp_info->port_priv->cq; qp_init_attr.recv_cq = qp_info->port_priv->cq; qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; - qp_init_attr.rq_sig_type = IB_SIGNAL_ALL_WR; qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE; qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE; qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-01-25 20:48:48.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:12:54.672054087 -0800 @@ -369,14 +369,12 @@ struct mthca_cq *recv_cq, enum ib_qp_type type, enum ib_sig_type send_policy, - enum ib_sig_type recv_policy, struct mthca_qp *qp); int mthca_alloc_sqp(struct mthca_dev *dev, struct mthca_pd *pd, struct mthca_cq *send_cq, struct mthca_cq *recv_cq, enum ib_sig_type send_policy, - enum ib_sig_type recv_policy, int qpn, int port, struct mthca_sqp *sqp); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2005-01-25 20:49:23.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c 2005-03-03 14:12:54.673053870 -0800 @@ -343,7 +343,7 @@ to_mcq(init_attr->send_cq), to_mcq(init_attr->recv_cq), init_attr->qp_type, init_attr->sq_sig_type, - init_attr->rq_sig_type, qp); + qp); qp->ibqp.qp_num = qp->qpn; break; } @@ -364,7 +364,7 @@ err = mthca_alloc_sqp(to_mdev(pd->device), to_mpd(pd), to_mcq(init_attr->send_cq), to_mcq(init_attr->recv_cq), - init_attr->sq_sig_type, init_attr->rq_sig_type, + init_attr->sq_sig_type, qp->ibqp.qp_num, init_attr->port_num, to_msqp(qp)); break; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.h 2005-01-25 20:47:46.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:12:54.674053653 -0800 @@ -154,7 +154,6 @@ void *last; int max_gs; int wqe_shift; - enum ib_sig_type policy; }; struct mthca_qp { @@ -172,6 +171,7 @@ struct mthca_wq rq; struct mthca_wq sq; + enum ib_sig_type sq_policy; int send_wqe_offset; u64 *wrid; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:12:52.924433436 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:12:54.675053436 -0800 @@ -690,7 +690,7 @@ MTHCA_QP_BIT_SRE | MTHCA_QP_BIT_SWE | MTHCA_QP_BIT_SAE); - if (qp->sq.policy == IB_SIGNAL_ALL_WR) + if (qp->sq_policy == IB_SIGNAL_ALL_WR) qp_context->params1 |= cpu_to_be32(MTHCA_QP_BIT_SSC); if (attr_mask & IB_QP_RETRY_CNT) { qp_context->params1 |= cpu_to_be32(attr->retry_cnt << 16); @@ -778,8 +778,8 @@ qp->resp_depth = attr->max_rd_atomic; } - if (qp->rq.policy == IB_SIGNAL_ALL_WR) - qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RSC); + qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RSC); + if (attr_mask & IB_QP_MIN_RNR_TIMER) { qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->min_rnr_timer << 24); qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_TIMEOUT); @@ -977,7 +977,6 @@ struct mthca_cq *send_cq, struct mthca_cq *recv_cq, enum ib_sig_type send_policy, - enum ib_sig_type recv_policy, struct mthca_qp *qp) { int err; @@ -987,8 +986,7 @@ qp->state = IB_QPS_RESET; qp->atomic_rd_en = 0; qp->resp_depth = 0; - qp->sq.policy = send_policy; - qp->rq.policy = recv_policy; + qp->sq_policy = send_policy; qp->rq.cur = 0; qp->sq.cur = 0; qp->rq.next = 0; @@ -1008,7 +1006,6 @@ struct mthca_cq *recv_cq, enum ib_qp_type type, enum ib_sig_type send_policy, - enum ib_sig_type recv_policy, struct mthca_qp *qp) { int err; @@ -1025,7 +1022,7 @@ return -ENOMEM; err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, - send_policy, recv_policy, qp); + send_policy, qp); if (err) { mthca_free(&dev->qp_table.alloc, qp->qpn); return err; @@ -1044,7 +1041,6 @@ struct mthca_cq *send_cq, struct mthca_cq *recv_cq, enum ib_sig_type send_policy, - enum ib_sig_type recv_policy, int qpn, int port, struct mthca_sqp *sqp) @@ -1073,8 +1069,7 @@ sqp->qp.transport = MLX; err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, - send_policy, recv_policy, - &sqp->qp); + send_policy, &sqp->qp); if (err) goto err_out_free; @@ -1495,9 +1490,7 @@ ((struct mthca_next_seg *) wqe)->nda_op = 0; ((struct mthca_next_seg *) wqe)->ee_nds = cpu_to_be32(MTHCA_NEXT_DBD); - ((struct mthca_next_seg *) wqe)->flags = - (wr->recv_flags & IB_RECV_SIGNALED) ? - cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0; + ((struct mthca_next_seg *) wqe)->flags = 0; wqe += sizeof (struct mthca_next_seg); size = sizeof (struct mthca_next_seg) / 16; --- linux-export.orig/drivers/infiniband/include/ib_verbs.h 2005-01-25 20:47:00.000000000 -0800 +++ linux-export/drivers/infiniband/include/ib_verbs.h 2005-03-03 14:12:54.669054738 -0800 @@ -73,7 +73,6 @@ IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), IB_DEVICE_SRQ_RESIZE = (1<<13), IB_DEVICE_N_NOTIFY_CQ = (1<<14), - IB_DEVICE_RQ_SIG_TYPE = (1<<15) }; enum ib_atomic_cap { @@ -408,7 +407,6 @@ struct ib_srq *srq; struct ib_qp_cap cap; enum ib_sig_type sq_sig_type; - enum ib_sig_type rq_sig_type; enum ib_qp_type qp_type; u8 port_num; /* special QP types only */ }; @@ -533,10 +531,6 @@ IB_SEND_INLINE = (1<<3) }; -enum ib_recv_flags { - IB_RECV_SIGNALED = 1 -}; - struct ib_sge { u64 addr; u32 length; @@ -579,7 +573,6 @@ u64 wr_id; struct ib_sge *sg_list; int num_sge; - int recv_flags; }; enum ib_access_flags { --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2005-03-02 20:53:21.000000000 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2005-03-03 14:12:54.668054955 -0800 @@ -105,7 +105,6 @@ .wr_id = wr_id | IPOIB_OP_RECV, .sg_list = &list, .num_sge = 1, - .recv_flags = IB_RECV_SIGNALED }; struct ib_recv_wr *bad_wr; --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2005-01-15 15:19:59.000000000 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2005-03-03 14:12:54.667055172 -0800 @@ -165,7 +165,6 @@ .max_recv_sge = 1 }, .sq_sig_type = IB_SIGNAL_ALL_WR, - .rq_sig_type = IB_SIGNAL_ALL_WR, .qp_type = IB_QPT_UD }; From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][7/26] IB/mthca: map registers for mem-free mode In-Reply-To: <2005331520.psAuTRchMaqO6dem@topspin.com> Message-ID: <2005331520.q2c4004P8DuwgJEx@topspin.com> Move the request/ioremap of regions related to event handling into mthca_eq.c. Map the correct regions depending on whether we're in Tavor or native mem-free mode. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_config_reg.h 2005-01-25 20:48:48.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_config_reg.h 2005-03-03 14:12:55.516870705 -0800 @@ -46,5 +46,6 @@ #define MTHCA_MAP_ECR_SIZE (MTHCA_ECR_SIZE + MTHCA_ECR_CLR_SIZE) #define MTHCA_CLR_INT_BASE 0xf00d8 #define MTHCA_CLR_INT_SIZE 0x00008 +#define MTHCA_EQ_SET_CI_SIZE (8 * 32) #endif /* MTHCA_CONFIG_REG_H */ --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:12:54.672054087 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:12:55.515870922 -0800 @@ -237,9 +237,17 @@ struct semaphore cap_mask_mutex; void __iomem *hcr; - void __iomem *ecr_base; - void __iomem *clr_base; void __iomem *kar; + void __iomem *clr_base; + union { + struct { + void __iomem *ecr_base; + } tavor; + struct { + void __iomem *eq_arm; + void __iomem *eq_set_ci_base; + } arbel; + } eq_regs; struct mthca_cmd cmd; struct mthca_limits limits; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_eq.c 2005-01-25 20:48:48.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_eq.c 2005-03-03 14:12:55.516870705 -0800 @@ -366,10 +366,10 @@ if (dev->eq_table.clr_mask) writel(dev->eq_table.clr_mask, dev->eq_table.clr_int); - if ((ecr = readl(dev->ecr_base + 4)) != 0) { + if ((ecr = readl(dev->eq_regs.tavor.ecr_base + 4)) != 0) { work = 1; - writel(ecr, dev->ecr_base + + writel(ecr, dev->eq_regs.tavor.ecr_base + MTHCA_ECR_CLR_BASE - MTHCA_ECR_BASE + 4); for (i = 0; i < MTHCA_NUM_EQ; ++i) @@ -578,6 +578,129 @@ dev->eq_table.eq + i); } +static int __devinit mthca_map_reg(struct mthca_dev *dev, + unsigned long offset, unsigned long size, + void __iomem **map) +{ + unsigned long base = pci_resource_start(dev->pdev, 0); + + if (!request_mem_region(base + offset, size, DRV_NAME)) + return -EBUSY; + + *map = ioremap(base + offset, size); + if (!*map) { + release_mem_region(base + offset, size); + return -ENOMEM; + } + + return 0; +} + +static void mthca_unmap_reg(struct mthca_dev *dev, unsigned long offset, + unsigned long size, void __iomem *map) +{ + unsigned long base = pci_resource_start(dev->pdev, 0); + + release_mem_region(base + offset, size); + iounmap(map); +} + +static int __devinit mthca_map_eq_regs(struct mthca_dev *dev) +{ + unsigned long mthca_base; + + mthca_base = pci_resource_start(dev->pdev, 0); + + if (dev->hca_type == ARBEL_NATIVE) { + /* + * We assume that the EQ arm and EQ set CI registers + * fall within the first BAR. We can't trust the + * values firmware gives us, since those addresses are + * valid on the HCA's side of the PCI bus but not + * necessarily the host side. + */ + if (mthca_map_reg(dev, (pci_resource_len(dev->pdev, 0) - 1) & + dev->fw.arbel.clr_int_base, MTHCA_CLR_INT_SIZE, + &dev->clr_base)) { + mthca_err(dev, "Couldn't map interrupt clear register, " + "aborting.\n"); + return -ENOMEM; + } + + /* + * Add 4 because we limit ourselves to EQs 0 ... 31, + * so we only need the low word of the register. + */ + if (mthca_map_reg(dev, ((pci_resource_len(dev->pdev, 0) - 1) & + dev->fw.arbel.eq_arm_base) + 4, 4, + &dev->eq_regs.arbel.eq_arm)) { + mthca_err(dev, "Couldn't map interrupt clear register, " + "aborting.\n"); + mthca_unmap_reg(dev, (pci_resource_len(dev->pdev, 0) - 1) & + dev->fw.arbel.clr_int_base, MTHCA_CLR_INT_SIZE, + dev->clr_base); + return -ENOMEM; + } + + if (mthca_map_reg(dev, (pci_resource_len(dev->pdev, 0) - 1) & + dev->fw.arbel.eq_set_ci_base, + MTHCA_EQ_SET_CI_SIZE, + &dev->eq_regs.arbel.eq_set_ci_base)) { + mthca_err(dev, "Couldn't map interrupt clear register, " + "aborting.\n"); + mthca_unmap_reg(dev, ((pci_resource_len(dev->pdev, 0) - 1) & + dev->fw.arbel.eq_arm_base) + 4, 4, + dev->eq_regs.arbel.eq_arm); + mthca_unmap_reg(dev, (pci_resource_len(dev->pdev, 0) - 1) & + dev->fw.arbel.clr_int_base, MTHCA_CLR_INT_SIZE, + dev->clr_base); + return -ENOMEM; + } + } else { + if (mthca_map_reg(dev, MTHCA_CLR_INT_BASE, MTHCA_CLR_INT_SIZE, + &dev->clr_base)) { + mthca_err(dev, "Couldn't map interrupt clear register, " + "aborting.\n"); + return -ENOMEM; + } + + if (mthca_map_reg(dev, MTHCA_ECR_BASE, + MTHCA_ECR_SIZE + MTHCA_ECR_CLR_SIZE, + &dev->eq_regs.tavor.ecr_base)) { + mthca_err(dev, "Couldn't map ecr register, " + "aborting.\n"); + mthca_unmap_reg(dev, MTHCA_CLR_INT_BASE, MTHCA_CLR_INT_SIZE, + dev->clr_base); + return -ENOMEM; + } + } + + return 0; + +} + +static void __devexit mthca_unmap_eq_regs(struct mthca_dev *dev) +{ + if (dev->hca_type == ARBEL_NATIVE) { + mthca_unmap_reg(dev, (pci_resource_len(dev->pdev, 0) - 1) & + dev->fw.arbel.eq_set_ci_base, + MTHCA_EQ_SET_CI_SIZE, + dev->eq_regs.arbel.eq_set_ci_base); + mthca_unmap_reg(dev, ((pci_resource_len(dev->pdev, 0) - 1) & + dev->fw.arbel.eq_arm_base) + 4, 4, + dev->eq_regs.arbel.eq_arm); + mthca_unmap_reg(dev, (pci_resource_len(dev->pdev, 0) - 1) & + dev->fw.arbel.clr_int_base, MTHCA_CLR_INT_SIZE, + dev->clr_base); + } else { + mthca_unmap_reg(dev, MTHCA_ECR_BASE, + MTHCA_ECR_SIZE + MTHCA_ECR_CLR_SIZE, + dev->eq_regs.tavor.ecr_base); + mthca_unmap_reg(dev, MTHCA_CLR_INT_BASE, MTHCA_CLR_INT_SIZE, + dev->clr_base); + } +} + int __devinit mthca_map_eq_icm(struct mthca_dev *dev, u64 icm_virt) { int ret; @@ -636,6 +759,10 @@ if (err) return err; + err = mthca_map_eq_regs(dev); + if (err) + goto err_out_free; + if (dev->mthca_flags & MTHCA_FLAG_MSI || dev->mthca_flags & MTHCA_FLAG_MSI_X) { dev->eq_table.clr_mask = 0; @@ -653,7 +780,7 @@ (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 128 : intr, &dev->eq_table.eq[MTHCA_EQ_COMP]); if (err) - goto err_out_free; + goto err_out_unmap; err = mthca_create_eq(dev, MTHCA_NUM_ASYNC_EQE, (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 129 : intr, @@ -720,6 +847,9 @@ err_out_comp: mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_COMP]); +err_out_unmap: + mthca_unmap_eq_regs(dev); + err_out_free: mthca_alloc_cleanup(&dev->eq_table.alloc); return err; @@ -740,5 +870,7 @@ for (i = 0; i < MTHCA_NUM_EQ; ++i) mthca_free_eq(dev, &dev->eq_table.eq[i]); + mthca_unmap_eq_regs(dev); + mthca_alloc_cleanup(&dev->eq_table.alloc); } --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c 2005-01-25 20:49:05.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c 2005-03-03 14:12:55.516870705 -0800 @@ -686,37 +686,18 @@ int err; /* - * We request our first BAR in two chunks, since the MSI-X - * vector table is right in the middle. + * We can't just use pci_request_regions() because the MSI-X + * table is right in the middle of the first BAR. If we did + * pci_request_region and grab all of the first BAR, then + * setting up MSI-X would fail, since the PCI core wants to do + * request_mem_region on the MSI-X vector table. * - * This is why we can't just use pci_request_regions() -- if - * we did then setting up MSI-X would fail, since the PCI core - * wants to do request_mem_region on the MSI-X vector table. + * So just request what we need right now, and request any + * other regions we need when setting up EQs. */ - if (!request_mem_region(pci_resource_start(pdev, 0) + - MTHCA_HCR_BASE, - MTHCA_HCR_SIZE, - DRV_NAME)) { - err = -EBUSY; - goto err_hcr_failed; - } - - if (!request_mem_region(pci_resource_start(pdev, 0) + - MTHCA_ECR_BASE, - MTHCA_MAP_ECR_SIZE, - DRV_NAME)) { - err = -EBUSY; - goto err_ecr_failed; - } - - if (!request_mem_region(pci_resource_start(pdev, 0) + - MTHCA_CLR_INT_BASE, - MTHCA_CLR_INT_SIZE, - DRV_NAME)) { - err = -EBUSY; - goto err_int_failed; - } - + if (!request_mem_region(pci_resource_start(pdev, 0) + MTHCA_HCR_BASE, + MTHCA_HCR_SIZE, DRV_NAME)) + return -EBUSY; err = pci_request_region(pdev, 2, DRV_NAME); if (err) @@ -731,24 +712,11 @@ return 0; err_bar4_failed: - pci_release_region(pdev, 2); -err_bar2_failed: - - release_mem_region(pci_resource_start(pdev, 0) + - MTHCA_CLR_INT_BASE, - MTHCA_CLR_INT_SIZE); -err_int_failed: - - release_mem_region(pci_resource_start(pdev, 0) + - MTHCA_ECR_BASE, - MTHCA_MAP_ECR_SIZE); -err_ecr_failed: - release_mem_region(pci_resource_start(pdev, 0) + - MTHCA_HCR_BASE, +err_bar2_failed: + release_mem_region(pci_resource_start(pdev, 0) + MTHCA_HCR_BASE, MTHCA_HCR_SIZE); -err_hcr_failed: return err; } @@ -761,16 +729,7 @@ pci_release_region(pdev, 2); - release_mem_region(pci_resource_start(pdev, 0) + - MTHCA_CLR_INT_BASE, - MTHCA_CLR_INT_SIZE); - - release_mem_region(pci_resource_start(pdev, 0) + - MTHCA_ECR_BASE, - MTHCA_MAP_ECR_SIZE); - - release_mem_region(pci_resource_start(pdev, 0) + - MTHCA_HCR_BASE, + release_mem_region(pci_resource_start(pdev, 0) + MTHCA_HCR_BASE, MTHCA_HCR_SIZE); } @@ -941,31 +900,13 @@ goto err_free_dev; } - mdev->clr_base = ioremap(mthca_base + MTHCA_CLR_INT_BASE, - MTHCA_CLR_INT_SIZE); - if (!mdev->clr_base) { - mthca_err(mdev, "Couldn't map interrupt clear register, " - "aborting.\n"); - err = -ENOMEM; - goto err_iounmap; - } - - mdev->ecr_base = ioremap(mthca_base + MTHCA_ECR_BASE, - MTHCA_ECR_SIZE + MTHCA_ECR_CLR_SIZE); - if (!mdev->ecr_base) { - mthca_err(mdev, "Couldn't map ecr register, " - "aborting.\n"); - err = -ENOMEM; - goto err_iounmap_clr; - } - mthca_base = pci_resource_start(pdev, 2); mdev->kar = ioremap(mthca_base + PAGE_SIZE * MTHCA_KAR_PAGE, PAGE_SIZE); if (!mdev->kar) { mthca_err(mdev, "Couldn't map kernel access region, " "aborting.\n"); err = -ENOMEM; - goto err_iounmap_ecr; + goto err_iounmap; } err = mthca_tune_pci(mdev); @@ -1014,12 +955,6 @@ err_iounmap_kar: iounmap(mdev->kar); -err_iounmap_ecr: - iounmap(mdev->ecr_base); - -err_iounmap_clr: - iounmap(mdev->clr_base); - err_iounmap: iounmap(mdev->hcr); @@ -1067,9 +1002,8 @@ mthca_close_hca(mdev); + iounmap(mdev->kar); iounmap(mdev->hcr); - iounmap(mdev->ecr_base); - iounmap(mdev->clr_base); if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) pci_disable_msix(pdev); From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][9/26] IB/mthca: dynamic context memory mapping for mem-free mode In-Reply-To: <2005331520.qwFp6OBqldRd6oo8@topspin.com> Message-ID: <2005331520.nfKPjEcWG6DlwOqo@topspin.com> Add support for mapping more memory into HCA's context to cover context tables when new objects are allocated. Pass the object size into mthca_alloc_icm_table(), reference count the ICM chunks, and add new mthca_table_get() and mthca_table_put() functions to handle mapping memory when allocating or destroying objects. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c 2005-03-03 14:12:56.152732681 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c 2005-03-03 14:12:56.772598129 -0800 @@ -363,10 +363,9 @@ } mdev->mr_table.mtt_table = mthca_alloc_icm_table(mdev, init_hca->mtt_base, - mdev->limits.num_mtt_segs * init_hca->mtt_seg_sz, - mdev->limits.reserved_mtts * - init_hca->mtt_seg_sz, 1); + mdev->limits.num_mtt_segs, + mdev->limits.reserved_mtts, 1); if (!mdev->mr_table.mtt_table) { mthca_err(mdev, "Failed to map MTT context memory, aborting.\n"); err = -ENOMEM; @@ -374,10 +373,9 @@ } mdev->mr_table.mpt_table = mthca_alloc_icm_table(mdev, init_hca->mpt_base, - mdev->limits.num_mpts * dev_lim->mpt_entry_sz, - mdev->limits.reserved_mrws * - dev_lim->mpt_entry_sz, 1); + mdev->limits.num_mpts, + mdev->limits.reserved_mrws, 1); if (!mdev->mr_table.mpt_table) { mthca_err(mdev, "Failed to map MPT context memory, aborting.\n"); err = -ENOMEM; @@ -385,10 +383,9 @@ } mdev->qp_table.qp_table = mthca_alloc_icm_table(mdev, init_hca->qpc_base, - mdev->limits.num_qps * dev_lim->qpc_entry_sz, - mdev->limits.reserved_qps * - dev_lim->qpc_entry_sz, 1); + mdev->limits.num_qps, + mdev->limits.reserved_qps, 0); if (!mdev->qp_table.qp_table) { mthca_err(mdev, "Failed to map QP context memory, aborting.\n"); err = -ENOMEM; @@ -396,10 +393,9 @@ } mdev->qp_table.eqp_table = mthca_alloc_icm_table(mdev, init_hca->eqpc_base, - mdev->limits.num_qps * dev_lim->eqpc_entry_sz, - mdev->limits.reserved_qps * - dev_lim->eqpc_entry_sz, 1); + mdev->limits.num_qps, + mdev->limits.reserved_qps, 0); if (!mdev->qp_table.eqp_table) { mthca_err(mdev, "Failed to map EQP context memory, aborting.\n"); err = -ENOMEM; @@ -407,10 +403,9 @@ } mdev->cq_table.table = mthca_alloc_icm_table(mdev, init_hca->cqc_base, - mdev->limits.num_cqs * dev_lim->cqc_entry_sz, - mdev->limits.reserved_cqs * - dev_lim->cqc_entry_sz, 1); + mdev->limits.num_cqs, + mdev->limits.reserved_cqs, 0); if (!mdev->cq_table.table) { mthca_err(mdev, "Failed to map CQ context memory, aborting.\n"); err = -ENOMEM; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.c 2005-01-25 20:46:29.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.c 2005-03-03 14:12:56.773597912 -0800 @@ -79,6 +79,7 @@ if (!icm) return icm; + icm->refcount = 0; INIT_LIST_HEAD(&icm->chunk_list); cur_order = get_order(MTHCA_ICM_ALLOC_SIZE); @@ -138,9 +139,62 @@ return NULL; } +int mthca_table_get(struct mthca_dev *dev, struct mthca_icm_table *table, int obj) +{ + int i = (obj & (table->num_obj - 1)) * table->obj_size / MTHCA_TABLE_CHUNK_SIZE; + int ret = 0; + u8 status; + + down(&table->mutex); + + if (table->icm[i]) { + ++table->icm[i]->refcount; + goto out; + } + + table->icm[i] = mthca_alloc_icm(dev, MTHCA_TABLE_CHUNK_SIZE >> PAGE_SHIFT, + (table->lowmem ? GFP_KERNEL : GFP_HIGHUSER) | + __GFP_NOWARN); + if (!table->icm[i]) { + ret = -ENOMEM; + goto out; + } + + if (mthca_MAP_ICM(dev, table->icm[i], table->virt + i * MTHCA_TABLE_CHUNK_SIZE, + &status) || status) { + mthca_free_icm(dev, table->icm[i]); + table->icm[i] = NULL; + ret = -ENOMEM; + goto out; + } + + ++table->icm[i]->refcount; + +out: + up(&table->mutex); + return ret; +} + +void mthca_table_put(struct mthca_dev *dev, struct mthca_icm_table *table, int obj) +{ + int i = (obj & (table->num_obj - 1)) * table->obj_size / MTHCA_TABLE_CHUNK_SIZE; + u8 status; + + down(&table->mutex); + + if (--table->icm[i]->refcount == 0) { + mthca_UNMAP_ICM(dev, table->virt + i * MTHCA_TABLE_CHUNK_SIZE, + MTHCA_TABLE_CHUNK_SIZE >> 12, &status); + mthca_free_icm(dev, table->icm[i]); + table->icm[i] = NULL; + } + + up(&table->mutex); +} + struct mthca_icm_table *mthca_alloc_icm_table(struct mthca_dev *dev, - u64 virt, unsigned size, - unsigned reserved, + u64 virt, int obj_size, + int nobj, int reserved, int use_lowmem) { struct mthca_icm_table *table; @@ -148,20 +202,23 @@ int i; u8 status; - num_icm = size / MTHCA_TABLE_CHUNK_SIZE; + num_icm = obj_size * nobj / MTHCA_TABLE_CHUNK_SIZE; table = kmalloc(sizeof *table + num_icm * sizeof *table->icm, GFP_KERNEL); if (!table) return NULL; - table->virt = virt; - table->num_icm = num_icm; - init_MUTEX(&table->sem); + table->virt = virt; + table->num_icm = num_icm; + table->num_obj = nobj; + table->obj_size = obj_size; + table->lowmem = use_lowmem; + init_MUTEX(&table->mutex); for (i = 0; i < num_icm; ++i) table->icm[i] = NULL; - for (i = 0; i < (reserved + MTHCA_TABLE_CHUNK_SIZE - 1) / MTHCA_TABLE_CHUNK_SIZE; ++i) { + for (i = 0; i * MTHCA_TABLE_CHUNK_SIZE < reserved * obj_size; ++i) { table->icm[i] = mthca_alloc_icm(dev, MTHCA_TABLE_CHUNK_SIZE >> PAGE_SHIFT, (use_lowmem ? GFP_KERNEL : GFP_HIGHUSER) | __GFP_NOWARN); @@ -173,6 +230,12 @@ table->icm[i] = NULL; goto err; } + + /* + * Add a reference to this ICM chunk so that it never + * gets freed (since it contains reserved firmware objects). + */ + ++table->icm[i]->refcount; } return table; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.h 2005-01-25 20:46:29.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.h 2005-03-03 14:12:56.773597912 -0800 @@ -53,12 +53,16 @@ struct mthca_icm { struct list_head chunk_list; + int refcount; }; struct mthca_icm_table { u64 virt; int num_icm; - struct semaphore sem; + int num_obj; + int obj_size; + int lowmem; + struct semaphore mutex; struct mthca_icm *icm[0]; }; @@ -75,10 +79,12 @@ void mthca_free_icm(struct mthca_dev *dev, struct mthca_icm *icm); struct mthca_icm_table *mthca_alloc_icm_table(struct mthca_dev *dev, - u64 virt, unsigned size, - unsigned reserved, + u64 virt, int obj_size, + int nobj, int reserved, int use_lowmem); void mthca_free_icm_table(struct mthca_dev *dev, struct mthca_icm_table *table); +int mthca_table_get(struct mthca_dev *dev, struct mthca_icm_table *table, int obj); +void mthca_table_put(struct mthca_dev *dev, struct mthca_icm_table *table, int obj); static inline void mthca_icm_first(struct mthca_icm *icm, struct mthca_icm_iter *iter) From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][8/26] IB/mthca: add UAR allocation In-Reply-To: <2005331520.q2c4004P8DuwgJEx@topspin.com> Message-ID: <2005331520.qwFp6OBqldRd6oo8@topspin.com> Add support for allocating user access regions (UARs). Use this to allocate a region for kernel at driver init instead using hard-coded MTHCA_KAR_PAGE index. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/Makefile 2005-01-15 15:16:40.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/Makefile 2005-03-03 14:12:56.155732030 -0800 @@ -9,4 +9,4 @@ ib_mthca-y := mthca_main.o mthca_cmd.o mthca_profile.o mthca_reset.o \ mthca_allocator.o mthca_eq.o mthca_pd.o mthca_cq.o \ mthca_mr.o mthca_qp.o mthca_av.o mthca_mcg.o mthca_mad.o \ - mthca_provider.o mthca_memfree.o + mthca_provider.o mthca_memfree.o mthca_uar.o --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:53.538300187 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:56.153732464 -0800 @@ -666,7 +666,7 @@ MTHCA_CQ_FLAG_TR); cq_context->start = cpu_to_be64(0); cq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 | - MTHCA_KAR_PAGE); + dev->driver_uar.index); cq_context->error_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn); cq_context->comp_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_COMP].eqn); cq_context->pd = cpu_to_be32(dev->driver_pd.pd_num); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:12:55.515870922 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:12:56.152732681 -0800 @@ -65,7 +65,6 @@ }; enum { - MTHCA_KAR_PAGE = 1, MTHCA_MAX_PORTS = 2 }; @@ -108,6 +107,7 @@ int gid_table_len; int pkey_table_len; int local_ca_ack_delay; + int num_uars; int max_sg; int num_qps; int reserved_qps; @@ -148,6 +148,12 @@ } *page_list; }; +struct mthca_uar_table { + struct mthca_alloc alloc; + u64 uarc_base; + int uarc_size; +}; + struct mthca_pd_table { struct mthca_alloc alloc; }; @@ -252,6 +258,7 @@ struct mthca_cmd cmd; struct mthca_limits limits; + struct mthca_uar_table uar_table; struct mthca_pd_table pd_table; struct mthca_mr_table mr_table; struct mthca_eq_table eq_table; @@ -260,6 +267,7 @@ struct mthca_av_table av_table; struct mthca_mcg_table mcg_table; + struct mthca_uar driver_uar; struct mthca_pd driver_pd; struct mthca_mr driver_mr; @@ -318,6 +326,7 @@ int mthca_array_init(struct mthca_array *array, int nent); void mthca_array_cleanup(struct mthca_array *array, int nent); +int mthca_init_uar_table(struct mthca_dev *dev); int mthca_init_pd_table(struct mthca_dev *dev); int mthca_init_mr_table(struct mthca_dev *dev); int mthca_init_eq_table(struct mthca_dev *dev); @@ -326,6 +335,7 @@ int mthca_init_av_table(struct mthca_dev *dev); int mthca_init_mcg_table(struct mthca_dev *dev); +void mthca_cleanup_uar_table(struct mthca_dev *dev); void mthca_cleanup_pd_table(struct mthca_dev *dev); void mthca_cleanup_mr_table(struct mthca_dev *dev); void mthca_cleanup_eq_table(struct mthca_dev *dev); @@ -337,6 +347,9 @@ int mthca_register_device(struct mthca_dev *dev); void mthca_unregister_device(struct mthca_dev *dev); +int mthca_uar_alloc(struct mthca_dev *dev, struct mthca_uar *uar); +void mthca_uar_free(struct mthca_dev *dev, struct mthca_uar *uar); + int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd); void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_eq.c 2005-03-03 14:12:55.516870705 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_eq.c 2005-03-03 14:12:56.154732247 -0800 @@ -469,7 +469,7 @@ MTHCA_EQ_FLAG_TR); eq_context->start = cpu_to_be64(0); eq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 | - MTHCA_KAR_PAGE); + dev->driver_uar.index); eq_context->pd = cpu_to_be32(dev->driver_pd.pd_num); eq_context->intr = intr; eq_context->lkey = cpu_to_be32(eq->mr.ibmr.lkey); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c 2005-03-03 14:12:55.516870705 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c 2005-03-03 14:12:56.152732681 -0800 @@ -570,13 +570,35 @@ MTHCA_INIT_DOORBELL_LOCK(&dev->doorbell_lock); - err = mthca_init_pd_table(dev); + err = mthca_init_uar_table(dev); if (err) { mthca_err(dev, "Failed to initialize " - "protection domain table, aborting.\n"); + "user access region table, aborting.\n"); return err; } + err = mthca_uar_alloc(dev, &dev->driver_uar); + if (err) { + mthca_err(dev, "Failed to allocate driver access region, " + "aborting.\n"); + goto err_uar_table_free; + } + + dev->kar = ioremap(dev->driver_uar.pfn << PAGE_SHIFT, PAGE_SIZE); + if (!dev->kar) { + mthca_err(dev, "Couldn't map kernel access region, " + "aborting.\n"); + err = -ENOMEM; + goto err_uar_free; + } + + err = mthca_init_pd_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "protection domain table, aborting.\n"); + goto err_kar_unmap; + } + err = mthca_init_mr_table(dev); if (err) { mthca_err(dev, "Failed to initialize " @@ -677,7 +699,16 @@ err_pd_table_free: mthca_cleanup_pd_table(dev); - return err; + +err_kar_unmap: + iounmap(dev->kar); + +err_uar_free: + mthca_uar_free(dev, &dev->driver_uar); + +err_uar_table_free: + mthca_cleanup_uar_table(dev); + return err; } static int __devinit mthca_request_regions(struct pci_dev *pdev, @@ -789,7 +820,6 @@ static int mthca_version_printed = 0; int ddr_hidden = 0; int err; - unsigned long mthca_base; struct mthca_dev *mdev; if (!mthca_version_printed) { @@ -891,8 +921,7 @@ sema_init(&mdev->cmd.poll_sem, 1); mdev->cmd.use_events = 0; - mthca_base = pci_resource_start(pdev, 0); - mdev->hcr = ioremap(mthca_base + MTHCA_HCR_BASE, MTHCA_HCR_SIZE); + mdev->hcr = ioremap(pci_resource_start(pdev, 0) + MTHCA_HCR_BASE, MTHCA_HCR_SIZE); if (!mdev->hcr) { mthca_err(mdev, "Couldn't map command register, " "aborting.\n"); @@ -900,22 +929,13 @@ goto err_free_dev; } - mthca_base = pci_resource_start(pdev, 2); - mdev->kar = ioremap(mthca_base + PAGE_SIZE * MTHCA_KAR_PAGE, PAGE_SIZE); - if (!mdev->kar) { - mthca_err(mdev, "Couldn't map kernel access region, " - "aborting.\n"); - err = -ENOMEM; - goto err_iounmap; - } - err = mthca_tune_pci(mdev); if (err) - goto err_iounmap_kar; + goto err_iounmap; err = mthca_init_hca(mdev); if (err) - goto err_iounmap_kar; + goto err_iounmap; err = mthca_setup_hca(mdev); if (err) @@ -948,13 +968,11 @@ mthca_cleanup_mr_table(mdev); mthca_cleanup_pd_table(mdev); + mthca_cleanup_uar_table(mdev); err_close: mthca_close_hca(mdev); -err_iounmap_kar: - iounmap(mdev->kar); - err_iounmap: iounmap(mdev->hcr); @@ -1000,9 +1018,12 @@ mthca_cleanup_mr_table(mdev); mthca_cleanup_pd_table(mdev); + iounmap(mdev->kar); + mthca_uar_free(mdev, &mdev->driver_uar); + mthca_cleanup_uar_table(mdev); + mthca_close_hca(mdev); - iounmap(mdev->kar); iounmap(mdev->hcr); if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_profile.c 2005-03-02 20:53:21.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_profile.c 2005-03-03 14:12:56.153732464 -0800 @@ -236,6 +236,7 @@ init_hca->mtt_seg_sz = ffs(dev_lim->mtt_seg_sz) - 7; break; case MTHCA_RES_UAR: + dev->limits.num_uars = profile[i].num; init_hca->uar_scratch_base = profile[i].start; break; case MTHCA_RES_UDAV: --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:12:54.674053653 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:12:56.153732464 -0800 @@ -49,6 +49,11 @@ DECLARE_PCI_UNMAP_ADDR(mapping) }; +struct mthca_uar { + unsigned long pfn; + int index; +}; + struct mthca_mr { struct ib_mr ibmr; int order; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:12:54.675053436 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:12:56.155732030 -0800 @@ -625,7 +625,7 @@ qp_context->mtu_msgmax = cpu_to_be32((attr->path_mtu << 29) | (31 << 24)); } - qp_context->usr_page = cpu_to_be32(MTHCA_KAR_PAGE); + qp_context->usr_page = cpu_to_be32(dev->driver_uar.index); qp_context->local_qpn = cpu_to_be32(qp->qpn); if (attr_mask & IB_QP_DEST_QPN) { qp_context->remote_qpn = cpu_to_be32(attr->dest_qp_num); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-export/drivers/infiniband/hw/mthca/mthca_uar.c 2005-03-03 14:12:56.152732681 -0800 @@ -0,0 +1,69 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include "mthca_dev.h" + +int mthca_uar_alloc(struct mthca_dev *dev, struct mthca_uar *uar) +{ + uar->index = mthca_alloc(&dev->uar_table.alloc); + if (uar->index == -1) + return -ENOMEM; + + uar->pfn = (pci_resource_start(dev->pdev, 2) >> PAGE_SHIFT) + uar->index; + + return 0; +} + +void mthca_uar_free(struct mthca_dev *dev, struct mthca_uar *uar) +{ + mthca_free(&dev->uar_table.alloc, uar->index); +} + +int mthca_init_uar_table(struct mthca_dev *dev) +{ + int ret; + + ret = mthca_alloc_init(&dev->uar_table.alloc, + dev->limits.num_uars, + dev->limits.num_uars - 1, + dev->limits.reserved_uars); + + return ret; +} + +void mthca_cleanup_uar_table(struct mthca_dev *dev) +{ + /* XXX check if any UARs are still allocated? */ + mthca_alloc_cleanup(&dev->uar_table.alloc); +} From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][10/26] IB/mthca: mem-free memory region support In-Reply-To: <2005331520.nfKPjEcWG6DlwOqo@topspin.com> Message-ID: <2005331520.6xlwh79w94Kl0EpH@topspin.com> Add support for mem-free mode to memory region code. This mostly amounts to properly munging between keys and indices. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_mr.c 2005-01-15 15:16:11.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_mr.c 2005-03-03 14:12:57.165512841 -0800 @@ -53,7 +53,8 @@ u32 window_count; u32 window_count_limit; u64 mtt_seg; - u32 reserved[3]; + u32 mtt_sz; /* Arbel only */ + u32 reserved[2]; } __attribute__((packed)); #define MTHCA_MPT_FLAG_SW_OWNS (0xfUL << 28) @@ -121,21 +122,38 @@ spin_unlock(&dev->mr_table.mpt_alloc.lock); } +static inline u32 hw_index_to_key(struct mthca_dev *dev, u32 ind) +{ + if (dev->hca_type == ARBEL_NATIVE) + return (ind >> 24) | (ind << 8); + else + return ind; +} + +static inline u32 key_to_hw_index(struct mthca_dev *dev, u32 key) +{ + if (dev->hca_type == ARBEL_NATIVE) + return (key << 24) | (key >> 8); + else + return key; +} + int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd, u32 access, struct mthca_mr *mr) { void *mailbox; struct mthca_mpt_entry *mpt_entry; + u32 key; int err; u8 status; might_sleep(); mr->order = -1; - mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc); - if (mr->ibmr.lkey == -1) + key = mthca_alloc(&dev->mr_table.mpt_alloc); + if (key == -1) return -ENOMEM; - mr->ibmr.rkey = mr->ibmr.lkey; + mr->ibmr.rkey = mr->ibmr.lkey = hw_index_to_key(dev, key); mailbox = kmalloc(sizeof *mpt_entry + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); @@ -151,7 +169,7 @@ MTHCA_MPT_FLAG_REGION | access); mpt_entry->page_size = 0; - mpt_entry->key = cpu_to_be32(mr->ibmr.lkey); + mpt_entry->key = cpu_to_be32(key); mpt_entry->pd = cpu_to_be32(pd); mpt_entry->start = 0; mpt_entry->length = ~0ULL; @@ -160,7 +178,7 @@ sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey)); err = mthca_SW2HW_MPT(dev, mpt_entry, - mr->ibmr.lkey & (dev->limits.num_mpts - 1), + key & (dev->limits.num_mpts - 1), &status); if (err) mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); @@ -182,6 +200,7 @@ void *mailbox; u64 *mtt_entry; struct mthca_mpt_entry *mpt_entry; + u32 key; int err = -ENOMEM; u8 status; int i; @@ -189,10 +208,10 @@ might_sleep(); WARN_ON(buffer_size_shift >= 32); - mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc); - if (mr->ibmr.lkey == -1) + key = mthca_alloc(&dev->mr_table.mpt_alloc); + if (key == -1) return -ENOMEM; - mr->ibmr.rkey = mr->ibmr.lkey; + mr->ibmr.rkey = mr->ibmr.lkey = hw_index_to_key(dev, key); for (i = dev->limits.mtt_seg_size / 8, mr->order = 0; i < list_len; @@ -254,7 +273,7 @@ access); mpt_entry->page_size = cpu_to_be32(buffer_size_shift - 12); - mpt_entry->key = cpu_to_be32(mr->ibmr.lkey); + mpt_entry->key = cpu_to_be32(key); mpt_entry->pd = cpu_to_be32(pd); mpt_entry->start = cpu_to_be64(iova); mpt_entry->length = cpu_to_be64(total_size); @@ -275,7 +294,7 @@ } err = mthca_SW2HW_MPT(dev, mpt_entry, - mr->ibmr.lkey & (dev->limits.num_mpts - 1), + key & (dev->limits.num_mpts - 1), &status); if (err) mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); @@ -307,7 +326,8 @@ might_sleep(); err = mthca_HW2SW_MPT(dev, NULL, - mr->ibmr.lkey & (dev->limits.num_mpts - 1), + key_to_hw_index(dev, mr->ibmr.lkey) & + (dev->limits.num_mpts - 1), &status); if (err) mthca_warn(dev, "HW2SW_MPT failed (%d)\n", err); @@ -318,7 +338,7 @@ if (mr->order >= 0) mthca_free_mtt(dev, mr->first_seg, mr->order); - mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + mthca_free(&dev->mr_table.mpt_alloc, key_to_hw_index(dev, mr->ibmr.lkey)); } int __devinit mthca_init_mr_table(struct mthca_dev *dev) From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][11/26] IB/mthca: mem-free EQ initialization In-Reply-To: <2005331520.6xlwh79w94Kl0EpH@topspin.com> Message-ID: <2005331520.nW52EhJhFo4sAhLI@topspin.com> Add code to initialize EQ context properly in both Tavor and mem-free mode. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_eq.c 2005-03-03 14:12:56.154732247 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_eq.c 2005-03-03 14:12:57.462448386 -0800 @@ -54,10 +54,10 @@ u32 flags; u64 start; u32 logsize_usrpage; - u32 pd; + u32 tavor_pd; /* reserved for Arbel */ u8 reserved1[3]; u8 intr; - u32 lost_count; + u32 arbel_pd; /* lost_count for Tavor */ u32 lkey; u32 reserved2[2]; u32 consumer_index; @@ -75,6 +75,7 @@ #define MTHCA_EQ_STATE_ARMED ( 1 << 8) #define MTHCA_EQ_STATE_FIRED ( 2 << 8) #define MTHCA_EQ_STATE_ALWAYS_ARMED ( 3 << 8) +#define MTHCA_EQ_STATE_ARBEL ( 8 << 8) enum { MTHCA_EVENT_TYPE_COMP = 0x00, @@ -467,10 +468,16 @@ MTHCA_EQ_OWNER_HW | MTHCA_EQ_STATE_ARMED | MTHCA_EQ_FLAG_TR); - eq_context->start = cpu_to_be64(0); - eq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 | - dev->driver_uar.index); - eq_context->pd = cpu_to_be32(dev->driver_pd.pd_num); + if (dev->hca_type == ARBEL_NATIVE) + eq_context->flags |= cpu_to_be32(MTHCA_EQ_STATE_ARBEL); + + eq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24); + if (dev->hca_type == ARBEL_NATIVE) { + eq_context->arbel_pd = cpu_to_be32(dev->driver_pd.pd_num); + } else { + eq_context->logsize_usrpage |= cpu_to_be32(dev->driver_uar.index); + eq_context->tavor_pd = cpu_to_be32(dev->driver_pd.pd_num); + } eq_context->intr = intr; eq_context->lkey = cpu_to_be32(eq->mr.ibmr.lkey); From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][12/26] IB/mthca: mem-free interrupt handling In-Reply-To: <2005331520.nW52EhJhFo4sAhLI@topspin.com> Message-ID: <2005331520.KR3jHRDWtXI3rzl6@topspin.com> Update interrupt handling code to handle mem-free mode. While we're at it, improve the Tavor interrupt handling to avoid an extra MMIO read of the event cause register. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:12:56.152732681 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:12:57.857362663 -0800 @@ -171,6 +171,7 @@ struct mthca_alloc alloc; void __iomem *clr_int; u32 clr_mask; + u32 arm_mask; struct mthca_eq eq[MTHCA_NUM_EQ]; u64 icm_virt; struct page *icm_page; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_eq.c 2005-03-03 14:12:57.462448386 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_eq.c 2005-03-03 14:12:57.859362229 -0800 @@ -165,19 +165,46 @@ MTHCA_ASYNC_EVENT_MASK; } -static inline void set_eq_ci(struct mthca_dev *dev, struct mthca_eq *eq, u32 ci) +static inline void tavor_set_eq_ci(struct mthca_dev *dev, struct mthca_eq *eq, u32 ci) { u32 doorbell[2]; doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eq->eqn); doorbell[1] = cpu_to_be32(ci & (eq->nent - 1)); + /* + * This barrier makes sure that all updates to ownership bits + * done by set_eqe_hw() hit memory before the consumer index + * is updated. set_eq_ci() allows the HCA to possibly write + * more EQ entries, and we want to avoid the exceedingly + * unlikely possibility of the HCA writing an entry and then + * having set_eqe_hw() overwrite the owner field. + */ + wmb(); mthca_write64(doorbell, dev->kar + MTHCA_EQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } -static inline void eq_req_not(struct mthca_dev *dev, int eqn) +static inline void arbel_set_eq_ci(struct mthca_dev *dev, struct mthca_eq *eq, u32 ci) +{ + /* See comment in tavor_set_eq_ci() above. */ + wmb(); + __raw_writel(cpu_to_be32(ci), dev->eq_regs.arbel.eq_set_ci_base + + eq->eqn * 8); + /* We still want ordering, just not swabbing, so add a barrier */ + mb(); +} + +static inline void set_eq_ci(struct mthca_dev *dev, struct mthca_eq *eq, u32 ci) +{ + if (dev->hca_type == ARBEL_NATIVE) + arbel_set_eq_ci(dev, eq, ci); + else + tavor_set_eq_ci(dev, eq, ci); +} + +static inline void tavor_eq_req_not(struct mthca_dev *dev, int eqn) { u32 doorbell[2]; @@ -189,16 +216,23 @@ MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } +static inline void arbel_eq_req_not(struct mthca_dev *dev, u32 eqn_mask) +{ + writel(eqn_mask, dev->eq_regs.arbel.eq_arm); +} + static inline void disarm_cq(struct mthca_dev *dev, int eqn, int cqn) { - u32 doorbell[2]; + if (dev->hca_type != ARBEL_NATIVE) { + u32 doorbell[2]; - doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn); - doorbell[1] = cpu_to_be32(cqn); + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn); + doorbell[1] = cpu_to_be32(cqn); - mthca_write64(doorbell, - dev->kar + MTHCA_EQ_DOORBELL, - MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } } static inline struct mthca_eqe *get_eqe(struct mthca_eq *eq, u32 entry) @@ -233,7 +267,7 @@ ib_dispatch_event(&record); } -static void mthca_eq_int(struct mthca_dev *dev, struct mthca_eq *eq) +static int mthca_eq_int(struct mthca_dev *dev, struct mthca_eq *eq) { struct mthca_eqe *eqe; int disarm_cqn; @@ -334,60 +368,93 @@ ++eq->cons_index; eqes_found = 1; - if (set_ci) { - wmb(); /* see comment below */ + if (unlikely(set_ci)) { + /* + * Conditional on hca_type is OK here because + * this is a rare case, not the fast path. + */ set_eq_ci(dev, eq, eq->cons_index); set_ci = 0; } } /* - * This barrier makes sure that all updates to - * ownership bits done by set_eqe_hw() hit memory - * before the consumer index is updated. set_eq_ci() - * allows the HCA to possibly write more EQ entries, - * and we want to avoid the exceedingly unlikely - * possibility of the HCA writing an entry and then - * having set_eqe_hw() overwrite the owner field. + * Rely on caller to set consumer index so that we don't have + * to test hca_type in our interrupt handling fast path. */ - if (likely(eqes_found)) { - wmb(); - set_eq_ci(dev, eq, eq->cons_index); - } - eq_req_not(dev, eq->eqn); + return eqes_found; } -static irqreturn_t mthca_interrupt(int irq, void *dev_ptr, struct pt_regs *regs) +static irqreturn_t mthca_tavor_interrupt(int irq, void *dev_ptr, struct pt_regs *regs) { struct mthca_dev *dev = dev_ptr; u32 ecr; - int work = 0; int i; if (dev->eq_table.clr_mask) writel(dev->eq_table.clr_mask, dev->eq_table.clr_int); - if ((ecr = readl(dev->eq_regs.tavor.ecr_base + 4)) != 0) { - work = 1; - + ecr = readl(dev->eq_regs.tavor.ecr_base + 4); + if (ecr) { writel(ecr, dev->eq_regs.tavor.ecr_base + MTHCA_ECR_CLR_BASE - MTHCA_ECR_BASE + 4); for (i = 0; i < MTHCA_NUM_EQ; ++i) - if (ecr & dev->eq_table.eq[i].ecr_mask) - mthca_eq_int(dev, &dev->eq_table.eq[i]); + if (ecr & dev->eq_table.eq[i].eqn_mask && + mthca_eq_int(dev, &dev->eq_table.eq[i])) { + tavor_set_eq_ci(dev, &dev->eq_table.eq[i], + dev->eq_table.eq[i].cons_index); + tavor_eq_req_not(dev, dev->eq_table.eq[i].eqn); + } } - return IRQ_RETVAL(work); + return IRQ_RETVAL(ecr); } -static irqreturn_t mthca_msi_x_interrupt(int irq, void *eq_ptr, +static irqreturn_t mthca_tavor_msi_x_interrupt(int irq, void *eq_ptr, struct pt_regs *regs) { struct mthca_eq *eq = eq_ptr; struct mthca_dev *dev = eq->dev; mthca_eq_int(dev, eq); + tavor_set_eq_ci(dev, eq, eq->cons_index); + tavor_eq_req_not(dev, eq->eqn); + + /* MSI-X vectors always belong to us */ + return IRQ_HANDLED; +} + +static irqreturn_t mthca_arbel_interrupt(int irq, void *dev_ptr, struct pt_regs *regs) +{ + struct mthca_dev *dev = dev_ptr; + int work = 0; + int i; + + if (dev->eq_table.clr_mask) + writel(dev->eq_table.clr_mask, dev->eq_table.clr_int); + + for (i = 0; i < MTHCA_NUM_EQ; ++i) + if (mthca_eq_int(dev, &dev->eq_table.eq[i])) { + work = 1; + arbel_set_eq_ci(dev, &dev->eq_table.eq[i], + dev->eq_table.eq[i].cons_index); + } + + arbel_eq_req_not(dev, dev->eq_table.arm_mask); + + return IRQ_RETVAL(work); +} + +static irqreturn_t mthca_arbel_msi_x_interrupt(int irq, void *eq_ptr, + struct pt_regs *regs) +{ + struct mthca_eq *eq = eq_ptr; + struct mthca_dev *dev = eq->dev; + + mthca_eq_int(dev, eq); + arbel_set_eq_ci(dev, eq, eq->cons_index); + arbel_eq_req_not(dev, eq->eqn_mask); /* MSI-X vectors always belong to us */ return IRQ_HANDLED; @@ -496,10 +563,10 @@ kfree(dma_list); kfree(mailbox); - eq->ecr_mask = swab32(1 << eq->eqn); + eq->eqn_mask = swab32(1 << eq->eqn); eq->cons_index = 0; - eq_req_not(dev, eq->eqn); + dev->eq_table.arm_mask |= eq->eqn_mask; mthca_dbg(dev, "Allocated EQ %d with %d entries\n", eq->eqn, nent); @@ -551,6 +618,8 @@ mthca_warn(dev, "HW2SW_EQ returned status 0x%02x\n", status); + dev->eq_table.arm_mask &= ~eq->eqn_mask; + if (0) { mthca_dbg(dev, "Dumping EQ context %02x:\n", eq->eqn); for (i = 0; i < sizeof (struct mthca_eq_context) / 4; ++i) { @@ -562,7 +631,6 @@ } } - mthca_free_mr(dev, &eq->mr); for (i = 0; i < npages; ++i) pci_free_consistent(dev->pdev, PAGE_SIZE, @@ -780,6 +848,8 @@ (dev->eq_table.inta_pin < 31 ? 4 : 0); } + dev->eq_table.arm_mask = 0; + intr = (dev->mthca_flags & MTHCA_FLAG_MSI) ? 128 : dev->eq_table.inta_pin; @@ -810,15 +880,20 @@ for (i = 0; i < MTHCA_NUM_EQ; ++i) { err = request_irq(dev->eq_table.eq[i].msi_x_vector, - mthca_msi_x_interrupt, 0, - eq_name[i], dev->eq_table.eq + i); + dev->hca_type == ARBEL_NATIVE ? + mthca_arbel_msi_x_interrupt : + mthca_tavor_msi_x_interrupt, + 0, eq_name[i], dev->eq_table.eq + i); if (err) goto err_out_cmd; dev->eq_table.eq[i].have_irq = 1; } } else { - err = request_irq(dev->pdev->irq, mthca_interrupt, SA_SHIRQ, - DRV_NAME, dev); + err = request_irq(dev->pdev->irq, + dev->hca_type == ARBEL_NATIVE ? + mthca_arbel_interrupt : + mthca_tavor_interrupt, + SA_SHIRQ, DRV_NAME, dev); if (err) goto err_out_cmd; dev->eq_table.have_irq = 1; @@ -842,6 +917,12 @@ mthca_warn(dev, "MAP_EQ for cmd EQ %d returned status 0x%02x\n", dev->eq_table.eq[MTHCA_EQ_CMD].eqn, status); + for (i = 0; i < MTHCA_EQ_CMD; ++i) + if (dev->hca_type == ARBEL_NATIVE) + arbel_eq_req_not(dev, dev->eq_table.eq[i].eqn_mask); + else + tavor_eq_req_not(dev, dev->eq_table.eq[i].eqn); + return 0; err_out_cmd: --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c 2005-03-03 14:12:56.772598129 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c 2005-03-03 14:12:57.858362446 -0800 @@ -608,13 +608,6 @@ goto err_mr_table_free; } - if (dev->hca_type == ARBEL_NATIVE) { - mthca_warn(dev, "Sorry, native MT25208 mode support is not done, " - "aborting.\n"); - err = -ENODEV; - goto err_pd_free; - } - err = mthca_init_eq_table(dev); if (err) { mthca_err(dev, "Failed to initialize " @@ -638,8 +631,16 @@ mthca_err(dev, "BIOS or ACPI interrupt routing problem?\n"); goto err_cmd_poll; - } else - mthca_dbg(dev, "NOP command IRQ test passed\n"); + } + + mthca_dbg(dev, "NOP command IRQ test passed\n"); + + if (dev->hca_type == ARBEL_NATIVE) { + mthca_warn(dev, "Sorry, native MT25208 mode support is not complete, " + "aborting.\n"); + err = -ENODEV; + goto err_cmd_poll; + } err = mthca_init_cq_table(dev); if (err) { --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:12:56.153732464 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:12:57.858362446 -0800 @@ -70,7 +70,7 @@ struct mthca_eq { struct mthca_dev *dev; int eqn; - u32 ecr_mask; + u32 eqn_mask; u32 cons_index; u16 msi_x_vector; u16 msi_x_entry; From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][13/26] IB/mthca: tweak firmware command debug messages In-Reply-To: <2005331520.KR3jHRDWtXI3rzl6@topspin.com> Message-ID: <2005331520.wYWJriF1rMlIY4lJ@topspin.com> Slightly improve debugging output for UNMAP_ICM and MODIFY_QP firmware commands. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 2005-01-25 20:48:02.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cmd.c 2005-03-03 14:12:58.283270213 -0800 @@ -1305,6 +1305,9 @@ int mthca_UNMAP_ICM(struct mthca_dev *dev, u64 virt, u32 page_count, u8 *status) { + mthca_dbg(dev, "Unmapping %d pages at %llx from ICM.\n", + page_count, (unsigned long long) virt); + return mthca_cmd(dev, virt, page_count, 0, CMD_UNMAP_ICM, CMD_TIME_CLASS_B, status); } @@ -1538,10 +1541,10 @@ if (0) { int i; mthca_dbg(dev, "Dumping QP context:\n"); - printk(" %08x\n", be32_to_cpup(qp_context)); + printk(" opt param mask: %08x\n", be32_to_cpup(qp_context)); for (i = 0; i < 0x100 / 4; ++i) { if (i % 8 == 0) - printk("[%02x] ", i * 4); + printk(" [%02x] ", i * 4); printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2])); if ((i + 1) % 8 == 0) printk("\n"); From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][14/26] IB/mthca: tweak MAP_ICM_page firmware command In-Reply-To: <2005331520.wYWJriF1rMlIY4lJ@topspin.com> Message-ID: <2005331520.6eBThkRRWYJ5HE5s@topspin.com> Have MAP_ICM_page() firmware command map assume pages are always the HCA-native 4K size rather than using the kernel's page size. This will make handling doorbell pages for mem-free mode simpler. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 2005-03-03 14:12:58.283270213 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cmd.c 2005-03-03 14:12:58.619197294 -0800 @@ -1290,7 +1290,7 @@ return -ENOMEM; inbox[0] = cpu_to_be64(virt); - inbox[1] = cpu_to_be64(dma_addr | (PAGE_SHIFT - 12)); + inbox[1] = cpu_to_be64(dma_addr); err = mthca_cmd(dev, indma, 1, 0, CMD_MAP_ICM, CMD_TIME_CLASS_B, status); From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][15/26] IB/mthca: mem-free doorbell record allocation In-Reply-To: <2005331520.6eBThkRRWYJ5HE5s@topspin.com> Message-ID: <2005331520.dH2BeQ6Ko7h8SaKM@topspin.com> Mem-free mode requires the driver to allocate additional doorbell pages for each user access region. Add support for this in mthca_memfree.c, and have the driver allocate a table in db_tab for kernel use. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:12:57.857362663 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:12:59.077097900 -0800 @@ -268,9 +268,10 @@ struct mthca_av_table av_table; struct mthca_mcg_table mcg_table; - struct mthca_uar driver_uar; - struct mthca_pd driver_pd; - struct mthca_mr driver_mr; + struct mthca_uar driver_uar; + struct mthca_db_table *db_tab; + struct mthca_pd driver_pd; + struct mthca_mr driver_mr; struct ib_mad_agent *send_agent[MTHCA_MAX_PORTS][2]; struct ib_ah *sm_ah[MTHCA_MAX_PORTS]; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.c 2005-03-03 14:12:56.773597912 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.c 2005-03-03 14:12:59.079097466 -0800 @@ -267,3 +267,199 @@ kfree(table); } + +static u64 mthca_uarc_virt(struct mthca_dev *dev, int page) +{ + return dev->uar_table.uarc_base + + dev->driver_uar.index * dev->uar_table.uarc_size + + page * 4096; +} + +int mthca_alloc_db(struct mthca_dev *dev, int type, u32 qn, u32 **db) +{ + int group; + int start, end, dir; + int i, j; + struct mthca_db_page *page; + int ret = 0; + u8 status; + + down(&dev->db_tab->mutex); + + switch (type) { + case MTHCA_DB_TYPE_CQ_ARM: + case MTHCA_DB_TYPE_SQ: + group = 0; + start = 0; + end = dev->db_tab->max_group1; + dir = 1; + break; + + case MTHCA_DB_TYPE_CQ_SET_CI: + case MTHCA_DB_TYPE_RQ: + case MTHCA_DB_TYPE_SRQ: + group = 1; + start = dev->db_tab->npages - 1; + end = dev->db_tab->min_group2; + dir = -1; + break; + + default: + return -1; + } + + for (i = start; i != end; i += dir) + if (dev->db_tab->page[i].db_rec && + !bitmap_full(dev->db_tab->page[i].used, + MTHCA_DB_REC_PER_PAGE)) { + page = dev->db_tab->page + i; + goto found; + } + + if (dev->db_tab->max_group1 >= dev->db_tab->min_group2 - 1) { + ret = -ENOMEM; + goto out; + } + + page = dev->db_tab->page + end; + page->db_rec = dma_alloc_coherent(&dev->pdev->dev, 4096, + &page->mapping, GFP_KERNEL); + if (!page->db_rec) { + ret = -ENOMEM; + goto out; + } + memset(page->db_rec, 0, 4096); + + ret = mthca_MAP_ICM_page(dev, page->mapping, mthca_uarc_virt(dev, i), &status); + if (!ret && status) + ret = -EINVAL; + if (ret) { + dma_free_coherent(&dev->pdev->dev, 4096, + page->db_rec, page->mapping); + goto out; + } + + bitmap_zero(page->used, MTHCA_DB_REC_PER_PAGE); + if (group == 0) + ++dev->db_tab->max_group1; + else + --dev->db_tab->min_group2; + +found: + j = find_first_zero_bit(page->used, MTHCA_DB_REC_PER_PAGE); + set_bit(j, page->used); + + if (group == 1) + j = MTHCA_DB_REC_PER_PAGE - 1 - j; + + ret = i * MTHCA_DB_REC_PER_PAGE + j; + + page->db_rec[j] = cpu_to_be64((qn << 8) | (type << 5)); + + *db = (u32 *) &page->db_rec[j]; + +out: + up(&dev->db_tab->mutex); + + return ret; +} + +void mthca_free_db(struct mthca_dev *dev, int type, int db_index) +{ + int i, j; + struct mthca_db_page *page; + u8 status; + + i = db_index / MTHCA_DB_REC_PER_PAGE; + j = db_index % MTHCA_DB_REC_PER_PAGE; + + page = dev->db_tab->page + i; + + down(&dev->db_tab->mutex); + + page->db_rec[j] = 0; + if (i >= dev->db_tab->min_group2) + j = MTHCA_DB_REC_PER_PAGE - 1 - j; + clear_bit(j, page->used); + + if (bitmap_empty(page->used, MTHCA_DB_REC_PER_PAGE) && + i >= dev->db_tab->max_group1 - 1) { + mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, i), 1, &status); + + dma_free_coherent(&dev->pdev->dev, 4096, + page->db_rec, page->mapping); + page->db_rec = NULL; + + if (i == dev->db_tab->max_group1) { + --dev->db_tab->max_group1; + /* XXX may be able to unmap more pages now */ + } + if (i == dev->db_tab->min_group2) + ++dev->db_tab->min_group2; + } + + up(&dev->db_tab->mutex); +} + +int mthca_init_db_tab(struct mthca_dev *dev) +{ + int i; + + if (dev->hca_type != ARBEL_NATIVE) + return 0; + + dev->db_tab = kmalloc(sizeof *dev->db_tab, GFP_KERNEL); + if (!dev->db_tab) + return -ENOMEM; + + init_MUTEX(&dev->db_tab->mutex); + + dev->db_tab->npages = dev->uar_table.uarc_size / PAGE_SIZE; + dev->db_tab->max_group1 = 0; + dev->db_tab->min_group2 = dev->db_tab->npages - 1; + + dev->db_tab->page = kmalloc(dev->db_tab->npages * + sizeof *dev->db_tab->page, + GFP_KERNEL); + if (!dev->db_tab->page) { + kfree(dev->db_tab); + return -ENOMEM; + } + + for (i = 0; i < dev->db_tab->npages; ++i) + dev->db_tab->page[i].db_rec = NULL; + + return 0; +} + +void mthca_cleanup_db_tab(struct mthca_dev *dev) +{ + int i; + u8 status; + + if (dev->hca_type != ARBEL_NATIVE) + return; + + /* + * Because we don't always free our UARC pages when they + * become empty to make mthca_free_db() simpler we need to + * make a sweep through the doorbell pages and free any + * leftover pages now. + */ + for (i = 0; i < dev->db_tab->npages; ++i) { + if (!dev->db_tab->page[i].db_rec) + continue; + + if (!bitmap_empty(dev->db_tab->page[i].used, MTHCA_DB_REC_PER_PAGE)) + mthca_warn(dev, "Kernel UARC page %d not empty\n", i); + + mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, i), 1, &status); + + dma_free_coherent(&dev->pdev->dev, 4096, + dev->db_tab->page[i].db_rec, + dev->db_tab->page[i].mapping); + } + + kfree(dev->db_tab->page); + kfree(dev->db_tab); +} --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.h 2005-03-03 14:12:56.773597912 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.h 2005-03-03 14:12:59.078097683 -0800 @@ -125,4 +125,37 @@ return sg_dma_len(&iter->chunk->mem[iter->page_idx]); } +enum { + MTHCA_DB_REC_PER_PAGE = 4096 / 8 +}; + +struct mthca_db_page { + DECLARE_BITMAP(used, MTHCA_DB_REC_PER_PAGE); + u64 *db_rec; + dma_addr_t mapping; +}; + +struct mthca_db_table { + int npages; + int max_group1; + int min_group2; + struct mthca_db_page *page; + struct semaphore mutex; +}; + +enum { + MTHCA_DB_TYPE_INVALID = 0x0, + MTHCA_DB_TYPE_CQ_SET_CI = 0x1, + MTHCA_DB_TYPE_CQ_ARM = 0x2, + MTHCA_DB_TYPE_SQ = 0x3, + MTHCA_DB_TYPE_RQ = 0x4, + MTHCA_DB_TYPE_SRQ = 0x5, + MTHCA_DB_TYPE_GROUP_SEP = 0x7 +}; + +int mthca_init_db_tab(struct mthca_dev *dev); +void mthca_cleanup_db_tab(struct mthca_dev *dev); +int mthca_alloc_db(struct mthca_dev *dev, int type, u32 qn, u32 **db); +void mthca_free_db(struct mthca_dev *dev, int type, int db_index); + #endif /* MTHCA_MEMFREE_H */ --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_profile.c 2005-03-03 14:12:56.153732464 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_profile.c 2005-03-03 14:12:59.078097683 -0800 @@ -244,9 +244,11 @@ dev->av_table.num_ddr_avs = profile[i].num; break; case MTHCA_RES_UARC: - init_hca->uarc_base = profile[i].start; - init_hca->log_uarc_sz = ffs(request->uarc_size) - 13; - init_hca->log_uar_sz = ffs(request->num_uar) - 1; + dev->uar_table.uarc_size = request->uarc_size; + dev->uar_table.uarc_base = profile[i].start; + init_hca->uarc_base = profile[i].start; + init_hca->log_uarc_sz = ffs(request->uarc_size) - 13; + init_hca->log_uar_sz = ffs(request->num_uar) - 1; break; default: break; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_uar.c 2005-03-03 14:12:56.152732681 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_uar.c 2005-03-03 14:12:59.078097683 -0800 @@ -33,6 +33,7 @@ */ #include "mthca_dev.h" +#include "mthca_memfree.h" int mthca_uar_alloc(struct mthca_dev *dev, struct mthca_uar *uar) { @@ -58,12 +59,20 @@ dev->limits.num_uars, dev->limits.num_uars - 1, dev->limits.reserved_uars); + if (ret) + return ret; + + ret = mthca_init_db_tab(dev); + if (ret) + mthca_alloc_cleanup(&dev->uar_table.alloc); return ret; } void mthca_cleanup_uar_table(struct mthca_dev *dev) { + mthca_cleanup_db_tab(dev); + /* XXX check if any UARs are still allocated? */ mthca_alloc_cleanup(&dev->uar_table.alloc); } From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][17/26] IB/mthca: refactor CQ buffer allocate/free In-Reply-To: <2005331520.WW3zbnVIUjZ4q0Ov@topspin.com> Message-ID: <2005331520.tIEkOrHOmFDGvQZK@topspin.com> Factor the allocation and freeing of completion queue buffers into mthca_alloc_cq_buf() and mthca_free_cq_buf(). This makes the code more readable and will eventually make handling userspace CQs simpler (the kernel doesn't have to allocate a buffer at all). Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:56.153732464 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:59.925913650 -0800 @@ -557,32 +557,40 @@ MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } -int mthca_init_cq(struct mthca_dev *dev, int nent, - struct mthca_cq *cq) +static void mthca_free_cq_buf(struct mthca_dev *dev, struct mthca_cq *cq) { - int size = nent * MTHCA_CQ_ENTRY_SIZE; - dma_addr_t t; - void *mailbox = NULL; - int npages, shift; - u64 *dma_list = NULL; - struct mthca_cq_context *cq_context; - int err = -ENOMEM; - u8 status; int i; + int size; - might_sleep(); + if (cq->is_direct) + pci_free_consistent(dev->pdev, + (cq->ibcq.cqe + 1) * MTHCA_CQ_ENTRY_SIZE, + cq->queue.direct.buf, + pci_unmap_addr(&cq->queue.direct, + mapping)); + else { + size = (cq->ibcq.cqe + 1) * MTHCA_CQ_ENTRY_SIZE; + for (i = 0; i < (size + PAGE_SIZE - 1) / PAGE_SIZE; ++i) + if (cq->queue.page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + cq->queue.page_list[i].buf, + pci_unmap_addr(&cq->queue.page_list[i], + mapping)); - mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA, - GFP_KERNEL); - if (!mailbox) - goto err_out; + kfree(cq->queue.page_list); + } +} - cq_context = MAILBOX_ALIGN(mailbox); +static int mthca_alloc_cq_buf(struct mthca_dev *dev, int size, + struct mthca_cq *cq) +{ + int err = -ENOMEM; + int npages, shift; + u64 *dma_list = NULL; + dma_addr_t t; + int i; if (size <= MTHCA_MAX_DIRECT_CQ_SIZE) { - if (0) - mthca_dbg(dev, "Creating direct CQ of size %d\n", size); - cq->is_direct = 1; npages = 1; shift = get_order(size) + PAGE_SHIFT; @@ -590,7 +598,7 @@ cq->queue.direct.buf = pci_alloc_consistent(dev->pdev, size, &t); if (!cq->queue.direct.buf) - goto err_out; + return -ENOMEM; pci_unmap_addr_set(&cq->queue.direct, mapping, t); @@ -603,7 +611,7 @@ dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); if (!dma_list) - goto err_out_free; + goto err_free; for (i = 0; i < npages; ++i) dma_list[i] = t + i * (1 << shift); @@ -612,12 +620,9 @@ npages = (size + PAGE_SIZE - 1) / PAGE_SIZE; shift = PAGE_SHIFT; - if (0) - mthca_dbg(dev, "Creating indirect CQ with %d pages\n", npages); - dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); if (!dma_list) - goto err_out; + return -ENOMEM; cq->queue.page_list = kmalloc(npages * sizeof *cq->queue.page_list, GFP_KERNEL); @@ -631,7 +636,7 @@ cq->queue.page_list[i].buf = pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t); if (!cq->queue.page_list[i].buf) - goto err_out_free; + goto err_free; dma_list[i] = t; pci_unmap_addr_set(&cq->queue.page_list[i], mapping, t); @@ -640,13 +645,6 @@ } } - for (i = 0; i < nent; ++i) - set_cqe_hw(get_cqe(cq, i)); - - cq->cqn = mthca_alloc(&dev->cq_table.alloc); - if (cq->cqn == -1) - goto err_out_free; - err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num, dma_list, shift, npages, 0, size, @@ -654,7 +652,52 @@ MTHCA_MPT_FLAG_LOCAL_READ, &cq->mr); if (err) - goto err_out_free_cq; + goto err_free; + + kfree(dma_list); + + return 0; + +err_free: + mthca_free_cq_buf(dev, cq); + +err_out: + kfree(dma_list); + + return err; +} + +int mthca_init_cq(struct mthca_dev *dev, int nent, + struct mthca_cq *cq) +{ + int size = nent * MTHCA_CQ_ENTRY_SIZE; + void *mailbox = NULL; + struct mthca_cq_context *cq_context; + int err = -ENOMEM; + u8 status; + int i; + + might_sleep(); + + cq->ibcq.cqe = nent - 1; + + cq->cqn = mthca_alloc(&dev->cq_table.alloc); + if (cq->cqn == -1) + return -ENOMEM; + + mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out; + + cq_context = MAILBOX_ALIGN(mailbox); + + err = mthca_alloc_cq_buf(dev, size, cq); + if (err) + goto err_out_mailbox; + + for (i = 0; i < nent; ++i) + set_cqe_hw(get_cqe(cq, i)); spin_lock_init(&cq->lock); atomic_set(&cq->refcount, 1); @@ -697,37 +740,20 @@ cq->cons_index = 0; - kfree(dma_list); kfree(mailbox); return 0; - err_out_free_mr: +err_out_free_mr: mthca_free_mr(dev, &cq->mr); + mthca_free_cq_buf(dev, cq); - err_out_free_cq: - mthca_free(&dev->cq_table.alloc, cq->cqn); - - err_out_free: - if (cq->is_direct) - pci_free_consistent(dev->pdev, size, - cq->queue.direct.buf, - pci_unmap_addr(&cq->queue.direct, mapping)); - else { - for (i = 0; i < npages; ++i) - if (cq->queue.page_list[i].buf) - pci_free_consistent(dev->pdev, PAGE_SIZE, - cq->queue.page_list[i].buf, - pci_unmap_addr(&cq->queue.page_list[i], - mapping)); - - kfree(cq->queue.page_list); - } - - err_out: - kfree(dma_list); +err_out_mailbox: kfree(mailbox); +err_out: + mthca_free(&dev->cq_table.alloc, cq->cqn); + return err; } @@ -778,27 +804,7 @@ wait_event(cq->wait, !atomic_read(&cq->refcount)); mthca_free_mr(dev, &cq->mr); - - if (cq->is_direct) - pci_free_consistent(dev->pdev, - (cq->ibcq.cqe + 1) * MTHCA_CQ_ENTRY_SIZE, - cq->queue.direct.buf, - pci_unmap_addr(&cq->queue.direct, - mapping)); - else { - int i; - - for (i = 0; - i < ((cq->ibcq.cqe + 1) * MTHCA_CQ_ENTRY_SIZE + PAGE_SIZE - 1) / - PAGE_SIZE; - ++i) - pci_free_consistent(dev->pdev, PAGE_SIZE, - cq->queue.page_list[i].buf, - pci_unmap_addr(&cq->queue.page_list[i], - mapping)); - - kfree(cq->queue.page_list); - } + mthca_free_cq_buf(dev, cq); mthca_free(&dev->cq_table.alloc, cq->cqn); kfree(mailbox); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2005-03-03 14:12:54.673053870 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c 2005-03-03 14:12:59.925913650 -0800 @@ -408,8 +408,7 @@ if (err) { kfree(cq); cq = ERR_PTR(err); - } else - cq->ibcq.cqe = nent - 1; + } return &cq->ibcq; } From roland at topspin.com Thu Mar 3 15:20:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:27 -0800 Subject: [openib-general] [PATCH][16/26] IB/mthca: mem-free doorbell record writing In-Reply-To: <2005331520.dH2BeQ6Ko7h8SaKM@topspin.com> Message-ID: <2005331520.WW3zbnVIUjZ4q0Ov@topspin.com> Add a mthca_write_db_rec() to wrap writing doorbell records. On 64-bit archs, this is just a 64-bit write, while on 32-bit archs it splits the write into two 32-bit writes with a memory barrier to make sure the two halves of the record are written in the correct order. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_doorbell.h 2005-01-25 20:49:05.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_doorbell.h 2005-03-03 14:12:59.570990692 -0800 @@ -57,6 +57,11 @@ __raw_writeq(*(u64 *) val, dest); } +static inline void mthca_write_db_rec(u32 val[2], u32 *db) +{ + *(u64 *) db = *(u64 *) val; +} + #else /* @@ -80,4 +85,11 @@ spin_unlock_irqrestore(doorbell_lock, flags); } +static inline void mthca_write_db_rec(u32 val[2], u32 *db) +{ + db[0] = val[0]; + wmb(); + db[1] = val[1]; +} + #endif From roland at topspin.com Thu Mar 3 15:20:28 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:28 -0800 Subject: [openib-general] [PATCH][18/26] IB/mthca: mem-free CQ initialization In-Reply-To: <2005331520.tIEkOrHOmFDGvQZK@topspin.com> Message-ID: <2005331520.xvxJqi7Nfv5UdZpQ@topspin.com> Update CQ initialization and cleanup to handle mem-free mode: we need to make sure the HCA has memory mapped for the entry in the CQ context table we will use and also allocate doorbell records. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:12:59.925913650 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:13:00.312829664 -0800 @@ -39,6 +39,7 @@ #include "mthca_dev.h" #include "mthca_cmd.h" +#include "mthca_memfree.h" enum { MTHCA_MAX_DIRECT_CQ_SIZE = 4 * PAGE_SIZE @@ -55,7 +56,7 @@ u32 flags; u64 start; u32 logsize_usrpage; - u32 error_eqn; + u32 error_eqn; /* Tavor only */ u32 comp_eqn; u32 pd; u32 lkey; @@ -64,7 +65,9 @@ u32 consumer_index; u32 producer_index; u32 cqn; - u32 reserved[3]; + u32 ci_db; /* Arbel only */ + u32 state_db; /* Arbel only */ + u32 reserved; } __attribute__((packed)); #define MTHCA_CQ_STATUS_OK ( 0 << 28) @@ -685,10 +688,30 @@ if (cq->cqn == -1) return -ENOMEM; + if (dev->hca_type == ARBEL_NATIVE) { + cq->arm_sn = 1; + + err = mthca_table_get(dev, dev->cq_table.table, cq->cqn); + if (err) + goto err_out; + + err = -ENOMEM; + + cq->set_ci_db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_CQ_SET_CI, + cq->cqn, &cq->set_ci_db); + if (cq->set_ci_db_index < 0) + goto err_out_icm; + + cq->arm_db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_CQ_ARM, + cq->cqn, &cq->arm_db); + if (cq->arm_db_index < 0) + goto err_out_ci; + } + mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); if (!mailbox) - goto err_out; + goto err_out_mailbox; cq_context = MAILBOX_ALIGN(mailbox); @@ -716,6 +739,11 @@ cq_context->lkey = cpu_to_be32(cq->mr.ibmr.lkey); cq_context->cqn = cpu_to_be32(cq->cqn); + if (dev->hca_type == ARBEL_NATIVE) { + cq_context->ci_db = cpu_to_be32(cq->set_ci_db_index); + cq_context->state_db = cpu_to_be32(cq->arm_db_index); + } + err = mthca_SW2HW_CQ(dev, cq_context, cq->cqn, &status); if (err) { mthca_warn(dev, "SW2HW_CQ failed (%d)\n", err); @@ -751,6 +779,14 @@ err_out_mailbox: kfree(mailbox); + mthca_free_db(dev, MTHCA_DB_TYPE_CQ_ARM, cq->arm_db_index); + +err_out_ci: + mthca_free_db(dev, MTHCA_DB_TYPE_CQ_SET_CI, cq->set_ci_db_index); + +err_out_icm: + mthca_table_put(dev, dev->cq_table.table, cq->cqn); + err_out: mthca_free(&dev->cq_table.alloc, cq->cqn); @@ -806,6 +842,12 @@ mthca_free_mr(dev, &cq->mr); mthca_free_cq_buf(dev, cq); + if (dev->hca_type == ARBEL_NATIVE) { + mthca_free_db(dev, MTHCA_DB_TYPE_CQ_ARM, cq->arm_db_index); + mthca_free_db(dev, MTHCA_DB_TYPE_CQ_SET_CI, cq->set_ci_db_index); + mthca_table_put(dev, dev->cq_table.table, cq->cqn); + } + mthca_free(&dev->cq_table.alloc, cq->cqn); kfree(mailbox); } --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:12:57.858362446 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:13:00.312829664 -0800 @@ -143,6 +143,14 @@ int cqn; int cons_index; int is_direct; + + /* Next fields are Arbel only */ + int set_ci_db_index; + u32 *set_ci_db; + int arm_db_index; + u32 *arm_db; + int arm_sn; + union { struct mthca_buf_list direct; struct mthca_buf_list *page_list; From roland at topspin.com Thu Mar 3 15:20:28 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:28 -0800 Subject: [openib-general] [PATCH][19/26] IB/mthca: mem-free CQ operations In-Reply-To: <2005331520.xvxJqi7Nfv5UdZpQ@topspin.com> Message-ID: <2005331520.VEavoMG964z0bUT1@topspin.com> Add support for CQ data path operations (request notification, update consumer index) in mem-free mode. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:13:00.312829664 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:13:01.214633912 -0800 @@ -136,11 +136,15 @@ #define MTHCA_CQ_ENTRY_OWNER_SW (0 << 7) #define MTHCA_CQ_ENTRY_OWNER_HW (1 << 7) -#define MTHCA_CQ_DB_INC_CI (1 << 24) -#define MTHCA_CQ_DB_REQ_NOT (2 << 24) -#define MTHCA_CQ_DB_REQ_NOT_SOL (3 << 24) -#define MTHCA_CQ_DB_SET_CI (4 << 24) -#define MTHCA_CQ_DB_REQ_NOT_MULT (5 << 24) +#define MTHCA_TAVOR_CQ_DB_INC_CI (1 << 24) +#define MTHCA_TAVOR_CQ_DB_REQ_NOT (2 << 24) +#define MTHCA_TAVOR_CQ_DB_REQ_NOT_SOL (3 << 24) +#define MTHCA_TAVOR_CQ_DB_SET_CI (4 << 24) +#define MTHCA_TAVOR_CQ_DB_REQ_NOT_MULT (5 << 24) + +#define MTHCA_ARBEL_CQ_DB_REQ_NOT_SOL (1 << 24) +#define MTHCA_ARBEL_CQ_DB_REQ_NOT (2 << 24) +#define MTHCA_ARBEL_CQ_DB_REQ_NOT_MULT (3 << 24) static inline struct mthca_cqe *get_cqe(struct mthca_cq *cq, int entry) { @@ -159,7 +163,7 @@ static inline struct mthca_cqe *next_cqe_sw(struct mthca_cq *cq) { - return cqe_sw(cq, cq->cons_index); + return cqe_sw(cq, cq->cons_index & cq->ibcq.cqe); } static inline void set_cqe_hw(struct mthca_cqe *cqe) @@ -167,17 +171,26 @@ cqe->owner = MTHCA_CQ_ENTRY_OWNER_HW; } -static inline void inc_cons_index(struct mthca_dev *dev, struct mthca_cq *cq, - int nent) +/* + * incr is ignored in native Arbel (mem-free) mode, so cq->cons_index + * should be correct before calling update_cons_index(). + */ +static inline void update_cons_index(struct mthca_dev *dev, struct mthca_cq *cq, + int incr) { u32 doorbell[2]; - doorbell[0] = cpu_to_be32(MTHCA_CQ_DB_INC_CI | cq->cqn); - doorbell[1] = cpu_to_be32(nent - 1); + if (dev->hca_type == ARBEL_NATIVE) { + *cq->set_ci_db = cpu_to_be32(cq->cons_index); + wmb(); + } else { + doorbell[0] = cpu_to_be32(MTHCA_TAVOR_CQ_DB_INC_CI | cq->cqn); + doorbell[1] = cpu_to_be32(incr - 1); - mthca_write64(doorbell, - dev->kar + MTHCA_CQ_DOORBELL, - MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + mthca_write64(doorbell, + dev->kar + MTHCA_CQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } } void mthca_cq_event(struct mthca_dev *dev, u32 cqn) @@ -191,6 +204,8 @@ return; } + ++cq->arm_sn; + cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); } @@ -247,8 +262,8 @@ if (nfreed) { wmb(); - inc_cons_index(dev, cq, nfreed); - cq->cons_index = (cq->cons_index + nfreed) & cq->ibcq.cqe; + cq->cons_index += nfreed; + update_cons_index(dev, cq, nfreed); } spin_unlock_irq(&cq->lock); @@ -341,7 +356,7 @@ break; } - err = mthca_free_err_wqe(qp, is_send, wqe_index, &dbd, &new_wqe); + err = mthca_free_err_wqe(dev, qp, is_send, wqe_index, &dbd, &new_wqe); if (err) return err; @@ -411,7 +426,7 @@ if (*cur_qp) { if (*freed) { wmb(); - inc_cons_index(dev, cq, *freed); + update_cons_index(dev, cq, *freed); *freed = 0; } spin_unlock(&(*cur_qp)->lock); @@ -505,7 +520,7 @@ if (likely(free_cqe)) { set_cqe_hw(cqe); ++(*freed); - cq->cons_index = (cq->cons_index + 1) & cq->ibcq.cqe; + ++cq->cons_index; } return err; @@ -533,7 +548,7 @@ if (freed) { wmb(); - inc_cons_index(dev, cq, freed); + update_cons_index(dev, cq, freed); } if (qp) @@ -544,20 +559,57 @@ return err == 0 || err == -EAGAIN ? npolled : err; } -void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq, - int solicited) +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify) { u32 doorbell[2]; - doorbell[0] = cpu_to_be32((solicited ? - MTHCA_CQ_DB_REQ_NOT_SOL : - MTHCA_CQ_DB_REQ_NOT) | - cq->cqn); + doorbell[0] = cpu_to_be32((notify == IB_CQ_SOLICITED ? + MTHCA_TAVOR_CQ_DB_REQ_NOT_SOL : + MTHCA_TAVOR_CQ_DB_REQ_NOT) | + to_mcq(cq)->cqn); doorbell[1] = 0xffffffff; mthca_write64(doorbell, - dev->kar + MTHCA_CQ_DOORBELL, - MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + to_mdev(cq->device)->kar + MTHCA_CQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&to_mdev(cq->device)->doorbell_lock)); + + return 0; +} + +int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +{ + struct mthca_cq *cq = to_mcq(ibcq); + u32 doorbell[2]; + u32 sn; + u32 ci; + + sn = cq->arm_sn & 3; + ci = cpu_to_be32(cq->cons_index); + + doorbell[0] = ci; + doorbell[1] = cpu_to_be32((cq->cqn << 8) | (2 << 5) | (sn << 3) | + (notify == IB_CQ_SOLICITED ? 1 : 2)); + + mthca_write_db_rec(doorbell, cq->arm_db); + + /* + * Make sure that the doorbell record in host memory is + * written before ringing the doorbell via PCI MMIO. + */ + wmb(); + + doorbell[0] = cpu_to_be32((sn << 28) | + (notify == IB_CQ_SOLICITED ? + MTHCA_ARBEL_CQ_DB_REQ_NOT_SOL : + MTHCA_ARBEL_CQ_DB_REQ_NOT) | + cq->cqn); + doorbell[1] = ci; + + mthca_write64(doorbell, + to_mdev(ibcq->device)->kar + MTHCA_CQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&to_mdev(ibcq->device)->doorbell_lock)); + + return 0; } static void mthca_free_cq_buf(struct mthca_dev *dev, struct mthca_cq *cq) --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:12:59.077097900 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:13:01.213634129 -0800 @@ -368,8 +368,8 @@ int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); -void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq, - int solicited); +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); +int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); int mthca_init_cq(struct mthca_dev *dev, int nent, struct mthca_cq *cq); void mthca_free_cq(struct mthca_dev *dev, @@ -384,7 +384,7 @@ struct ib_send_wr **bad_wr); int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, struct ib_recv_wr **bad_wr); -int mthca_free_err_wqe(struct mthca_qp *qp, int is_send, +int mthca_free_err_wqe(struct mthca_dev *dev, struct mthca_qp *qp, int is_send, int index, int *dbd, u32 *new_wqe); int mthca_alloc_qp(struct mthca_dev *dev, struct mthca_pd *pd, --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2005-03-03 14:12:59.925913650 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c 2005-03-03 14:13:01.213634129 -0800 @@ -421,13 +421,6 @@ return 0; } -static int mthca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify notify) -{ - mthca_arm_cq(to_mdev(cq->device), to_mcq(cq), - notify == IB_CQ_SOLICITED); - return 0; -} - static inline u32 convert_access(int acc) { return (acc & IB_ACCESS_REMOTE_ATOMIC ? MTHCA_MPT_FLAG_ATOMIC : 0) | @@ -625,7 +618,6 @@ dev->ib_dev.create_cq = mthca_create_cq; dev->ib_dev.destroy_cq = mthca_destroy_cq; dev->ib_dev.poll_cq = mthca_poll_cq; - dev->ib_dev.req_notify_cq = mthca_req_notify_cq; dev->ib_dev.get_dma_mr = mthca_get_dma_mr; dev->ib_dev.reg_phys_mr = mthca_reg_phys_mr; dev->ib_dev.dereg_mr = mthca_dereg_mr; @@ -633,6 +625,11 @@ dev->ib_dev.detach_mcast = mthca_multicast_detach; dev->ib_dev.process_mad = mthca_process_mad; + if (dev->hca_type == ARBEL_NATIVE) + dev->ib_dev.req_notify_cq = mthca_arbel_arm_cq; + else + dev->ib_dev.req_notify_cq = mthca_tavor_arm_cq; + init_MUTEX(&dev->cap_mask_mutex); ret = ib_register_device(&dev->ib_dev); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:13:00.312829664 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:13:01.213634129 -0800 @@ -141,7 +141,7 @@ spinlock_t lock; atomic_t refcount; int cqn; - int cons_index; + u32 cons_index; int is_direct; /* Next fields are Arbel only */ --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:12:56.155732030 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:13:01.215633695 -0800 @@ -1551,7 +1551,7 @@ return err; } -int mthca_free_err_wqe(struct mthca_qp *qp, int is_send, +int mthca_free_err_wqe(struct mthca_dev *dev, struct mthca_qp *qp, int is_send, int index, int *dbd, u32 *new_wqe) { struct mthca_next_seg *next; @@ -1561,7 +1561,10 @@ else next = get_recv_wqe(qp, index); - *dbd = !!(next->ee_nds & cpu_to_be32(MTHCA_NEXT_DBD)); + if (dev->hca_type == ARBEL_NATIVE) + *dbd = 1; + else + *dbd = !!(next->ee_nds & cpu_to_be32(MTHCA_NEXT_DBD)); if (next->ee_nds & cpu_to_be32(0x3f)) *new_wqe = (next->nda_op & cpu_to_be32(~0x3f)) | (next->ee_nds & cpu_to_be32(0x3f)); From roland at topspin.com Thu Mar 3 15:20:28 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:28 -0800 Subject: [openib-general] [PATCH][21/26] IB/mthca: mem-free address vectors In-Reply-To: <2005331520.7k4CdyDk307HOUr6@topspin.com> Message-ID: <2005331520.kqgduGt72iMbbNeg@topspin.com> Update address vector handling to support mem-free mode. In mem-free mode, the address vector (in hardware format) is copied by the driver into each send work queue entry, so our address handle creation can become pretty trivial: we just kmalloc() a buffer to hold the formatted address vector. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_av.c 2005-01-15 15:19:30.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_av.c 2005-03-03 14:13:02.121437076 -0800 @@ -60,27 +60,34 @@ u32 index = -1; struct mthca_av *av = NULL; - ah->on_hca = 0; + ah->type = MTHCA_AH_PCI_POOL; - if (!atomic_read(&pd->sqp_count) && - !(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + if (dev->hca_type == ARBEL_NATIVE) { + ah->av = kmalloc(sizeof *ah->av, GFP_KERNEL); + if (!ah->av) + return -ENOMEM; + + ah->type = MTHCA_AH_KMALLOC; + av = ah->av; + } else if (!atomic_read(&pd->sqp_count) && + !(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { index = mthca_alloc(&dev->av_table.alloc); /* fall back to allocate in host memory */ if (index == -1) - goto host_alloc; + goto on_hca_fail; av = kmalloc(sizeof *av, GFP_KERNEL); if (!av) - goto host_alloc; + goto on_hca_fail; - ah->on_hca = 1; + ah->type = MTHCA_AH_ON_HCA; ah->avdma = dev->av_table.ddr_av_base + index * MTHCA_AV_SIZE; } - host_alloc: - if (!ah->on_hca) { +on_hca_fail: + if (ah->type == MTHCA_AH_PCI_POOL) { ah->av = pci_pool_alloc(dev->av_table.pool, SLAB_KERNEL, &ah->avdma); if (!ah->av) @@ -123,7 +130,7 @@ j * 4, be32_to_cpu(((u32 *) av)[j])); } - if (ah->on_hca) { + if (ah->type == MTHCA_AH_ON_HCA) { memcpy_toio(dev->av_table.av_map + index * MTHCA_AV_SIZE, av, MTHCA_AV_SIZE); kfree(av); @@ -134,12 +141,21 @@ int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah) { - if (ah->on_hca) + switch (ah->type) { + case MTHCA_AH_ON_HCA: mthca_free(&dev->av_table.alloc, (ah->avdma - dev->av_table.ddr_av_base) / MTHCA_AV_SIZE); - else + break; + + case MTHCA_AH_PCI_POOL: pci_pool_free(dev->av_table.pool, ah->av, ah->avdma); + break; + + case MTHCA_AH_KMALLOC: + kfree(ah->av); + break; + } return 0; } @@ -147,7 +163,7 @@ int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, struct ib_ud_header *header) { - if (ah->on_hca) + if (ah->type == MTHCA_AH_ON_HCA) return -EINVAL; header->lrh.service_level = be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 28; @@ -176,6 +192,9 @@ { int err; + if (dev->hca_type == ARBEL_NATIVE) + return 0; + err = mthca_alloc_init(&dev->av_table.alloc, dev->av_table.num_ddr_avs, dev->av_table.num_ddr_avs - 1, @@ -212,6 +231,9 @@ void __devexit mthca_cleanup_av_table(struct mthca_dev *dev) { + if (dev->hca_type == ARBEL_NATIVE) + return; + if (dev->av_table.av_map) iounmap(dev->av_table.av_map); pci_pool_destroy(dev->av_table.pool); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:13:01.712525837 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:13:02.120437293 -0800 @@ -82,12 +82,18 @@ struct mthca_av; +enum mthca_ah_type { + MTHCA_AH_ON_HCA, + MTHCA_AH_PCI_POOL, + MTHCA_AH_KMALLOC +}; + struct mthca_ah { - struct ib_ah ibah; - int on_hca; - u32 key; - struct mthca_av *av; - dma_addr_t avdma; + struct ib_ah ibah; + enum mthca_ah_type type; + u32 key; + struct mthca_av *av; + dma_addr_t avdma; }; /* From roland at topspin.com Thu Mar 3 15:20:28 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:28 -0800 Subject: [openib-general] [PATCH][20/26] IB/mthca: mem-free QP initialization In-Reply-To: <2005331520.VEavoMG964z0bUT1@topspin.com> Message-ID: <2005331520.7k4CdyDk307HOUr6@topspin.com> Update QP initialization and cleanup to handle mem-free mode. In mem-free mode, work queue sizes have to be rounded up to a power of 2, we need to allocate doorbells, there must be memory mapped for the entries in the QP and extended QP context table that we use, and the entries of the receive queue must be initialized. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:13:01.213634129 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:13:01.712525837 -0800 @@ -167,6 +167,9 @@ void *last; int max_gs; int wqe_shift; + + int db_index; /* Arbel only */ + u32 *db; }; struct mthca_qp { --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:13:01.215633695 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:13:01.713525620 -0800 @@ -40,6 +40,7 @@ #include "mthca_dev.h" #include "mthca_cmd.h" +#include "mthca_memfree.h" enum { MTHCA_MAX_DIRECT_QP_SIZE = 4 * PAGE_SIZE, @@ -105,8 +106,11 @@ struct mthca_qp_context { u32 flags; - u32 sched_queue; - u32 mtu_msgmax; + u32 tavor_sched_queue; /* Reserved on Arbel */ + u8 mtu_msgmax; + u8 rq_size_stride; /* Reserved on Tavor */ + u8 sq_size_stride; /* Reserved on Tavor */ + u8 rlkey_arbel_sched_queue; /* Reserved on Tavor */ u32 usr_page; u32 local_qpn; u32 remote_qpn; @@ -121,18 +125,22 @@ u32 reserved2; u32 next_send_psn; u32 cqn_snd; - u32 next_snd_wqe[2]; + u32 snd_wqe_base_l; /* Next send WQE on Tavor */ + u32 snd_db_index; /* (debugging only entries) */ u32 last_acked_psn; u32 ssn; u32 params2; u32 rnr_nextrecvpsn; u32 ra_buff_indx; u32 cqn_rcv; - u32 next_rcv_wqe[2]; + u32 rcv_wqe_base_l; /* Next recv WQE on Tavor */ + u32 rcv_db_index; /* (debugging only entries) */ u32 qkey; u32 srqn; u32 rmsn; - u32 reserved3[19]; + u16 rq_wqe_counter; /* reserved on Tavor */ + u16 sq_wqe_counter; /* reserved on Tavor */ + u32 reserved3[18]; } __attribute__((packed)); struct mthca_qp_param { @@ -193,7 +201,7 @@ u32 imm; /* immediate data */ }; -struct mthca_ud_seg { +struct mthca_tavor_ud_seg { u32 reserved1; u32 lkey; u64 av_addr; @@ -203,6 +211,13 @@ u32 reserved3[2]; }; +struct mthca_arbel_ud_seg { + u32 av[8]; + u32 dqpn; + u32 qkey; + u32 reserved[2]; +}; + struct mthca_bind_seg { u32 flags; /* [31] Atomic [30] rem write [29] rem read */ u32 reserved; @@ -617,14 +632,24 @@ break; } } - /* leave sched_queue as 0 */ + + /* leave tavor_sched_queue as 0 */ + if (qp->transport == MLX || qp->transport == UD) - qp_context->mtu_msgmax = cpu_to_be32((IB_MTU_2048 << 29) | - (11 << 24)); + qp_context->mtu_msgmax = (IB_MTU_2048 << 5) | 11; else if (attr_mask & IB_QP_PATH_MTU) { - qp_context->mtu_msgmax = cpu_to_be32((attr->path_mtu << 29) | - (31 << 24)); + qp_context->mtu_msgmax = (attr->path_mtu << 5) | 31; + } + + if (dev->hca_type == ARBEL_NATIVE) { + qp_context->rq_size_stride = + ((ffs(qp->rq.max) - 1) << 3) | (qp->rq.wqe_shift - 4); + qp_context->sq_size_stride = + ((ffs(qp->sq.max) - 1) << 3) | (qp->sq.wqe_shift - 4); } + + /* leave arbel_sched_queue as 0 */ + qp_context->usr_page = cpu_to_be32(dev->driver_uar.index); qp_context->local_qpn = cpu_to_be32(qp->qpn); if (attr_mask & IB_QP_DEST_QPN) { @@ -708,6 +733,11 @@ qp_context->next_send_psn = cpu_to_be32(attr->sq_psn); qp_context->cqn_snd = cpu_to_be32(to_mcq(ibqp->send_cq)->cqn); + if (dev->hca_type == ARBEL_NATIVE) { + qp_context->snd_wqe_base_l = cpu_to_be32(qp->send_wqe_offset); + qp_context->snd_db_index = cpu_to_be32(qp->sq.db_index); + } + if (attr_mask & IB_QP_ACCESS_FLAGS) { /* * Only enable RDMA/atomics if we have responder @@ -787,12 +817,16 @@ if (attr_mask & IB_QP_RQ_PSN) qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->rq_psn); - qp_context->ra_buff_indx = dev->qp_table.rdb_base + - ((qp->qpn & (dev->limits.num_qps - 1)) * MTHCA_RDB_ENTRY_SIZE << - dev->qp_table.rdb_shift); + qp_context->ra_buff_indx = + cpu_to_be32(dev->qp_table.rdb_base + + ((qp->qpn & (dev->limits.num_qps - 1)) * MTHCA_RDB_ENTRY_SIZE << + dev->qp_table.rdb_shift)); qp_context->cqn_rcv = cpu_to_be32(to_mcq(ibqp->recv_cq)->cqn); + if (dev->hca_type == ARBEL_NATIVE) + qp_context->rcv_db_index = cpu_to_be32(qp->rq.db_index); + if (attr_mask & IB_QP_QKEY) { qp_context->qkey = cpu_to_be32(attr->qkey); qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_Q_KEY); @@ -860,12 +894,20 @@ size = sizeof (struct mthca_next_seg) + qp->sq.max_gs * sizeof (struct mthca_data_seg); - if (qp->transport == MLX) + switch (qp->transport) { + case MLX: size += 2 * sizeof (struct mthca_data_seg); - else if (qp->transport == UD) - size += sizeof (struct mthca_ud_seg); - else /* bind seg is as big as atomic + raddr segs */ + break; + case UD: + if (dev->hca_type == ARBEL_NATIVE) + size += sizeof (struct mthca_arbel_ud_seg); + else + size += sizeof (struct mthca_tavor_ud_seg); + break; + default: + /* bind seg is as big as atomic + raddr segs */ size += sizeof (struct mthca_bind_seg); + } for (qp->sq.wqe_shift = 6; 1 << qp->sq.wqe_shift < size; qp->sq.wqe_shift++) @@ -942,7 +984,6 @@ err = mthca_mr_alloc_phys(dev, pd->pd_num, dma_list, shift, npages, 0, size, - MTHCA_MPT_FLAG_LOCAL_WRITE | MTHCA_MPT_FLAG_LOCAL_READ, &qp->mr); if (err) @@ -972,6 +1013,60 @@ return err; } +static int mthca_alloc_memfree(struct mthca_dev *dev, + struct mthca_qp *qp) +{ + int ret = 0; + + if (dev->hca_type == ARBEL_NATIVE) { + ret = mthca_table_get(dev, dev->qp_table.qp_table, qp->qpn); + if (ret) + return ret; + + ret = mthca_table_get(dev, dev->qp_table.eqp_table, qp->qpn); + if (ret) + goto err_qpc; + + qp->rq.db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_RQ, + qp->qpn, &qp->rq.db); + if (qp->rq.db_index < 0) { + ret = -ENOMEM; + goto err_eqpc; + } + + qp->sq.db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_SQ, + qp->qpn, &qp->sq.db); + if (qp->sq.db_index < 0) { + ret = -ENOMEM; + goto err_rq_db; + } + } + + return 0; + +err_rq_db: + mthca_free_db(dev, MTHCA_DB_TYPE_RQ, qp->rq.db_index); + +err_eqpc: + mthca_table_put(dev, dev->qp_table.eqp_table, qp->qpn); + +err_qpc: + mthca_table_put(dev, dev->qp_table.qp_table, qp->qpn); + + return ret; +} + +static void mthca_free_memfree(struct mthca_dev *dev, + struct mthca_qp *qp) +{ + if (dev->hca_type == ARBEL_NATIVE) { + mthca_free_db(dev, MTHCA_DB_TYPE_SQ, qp->sq.db_index); + mthca_free_db(dev, MTHCA_DB_TYPE_RQ, qp->rq.db_index); + mthca_table_put(dev, dev->qp_table.eqp_table, qp->qpn); + mthca_table_put(dev, dev->qp_table.qp_table, qp->qpn); + } +} + static int mthca_alloc_qp_common(struct mthca_dev *dev, struct mthca_pd *pd, struct mthca_cq *send_cq, @@ -979,7 +1074,9 @@ enum ib_sig_type send_policy, struct mthca_qp *qp) { - int err; + struct mthca_next_seg *wqe; + int ret; + int i; spin_lock_init(&qp->lock); atomic_set(&qp->refcount, 1); @@ -996,8 +1093,51 @@ qp->rq.last = NULL; qp->sq.last = NULL; - err = mthca_alloc_wqe_buf(dev, pd, qp); - return err; + ret = mthca_alloc_memfree(dev, qp); + if (ret) + return ret; + + ret = mthca_alloc_wqe_buf(dev, pd, qp); + if (ret) { + mthca_free_memfree(dev, qp); + return ret; + } + + if (dev->hca_type == ARBEL_NATIVE) { + for (i = 0; i < qp->rq.max; ++i) { + wqe = get_recv_wqe(qp, i); + wqe->nda_op = cpu_to_be32(((i + 1) & (qp->rq.max - 1)) << + qp->rq.wqe_shift); + wqe->ee_nds = cpu_to_be32(1 << (qp->rq.wqe_shift - 4)); + } + + for (i = 0; i < qp->sq.max; ++i) { + wqe = get_send_wqe(qp, i); + wqe->nda_op = cpu_to_be32((((i + 1) & (qp->sq.max - 1)) << + qp->sq.wqe_shift) + + qp->send_wqe_offset); + } + } + + return 0; +} + +static void mthca_align_qp_size(struct mthca_dev *dev, struct mthca_qp *qp) +{ + int i; + + if (dev->hca_type != ARBEL_NATIVE) + return; + + for (i = 0; 1 << i < qp->rq.max; ++i) + ; /* nothing */ + + qp->rq.max = 1 << i; + + for (i = 0; 1 << i < qp->sq.max; ++i) + ; /* nothing */ + + qp->sq.max = 1 << i; } int mthca_alloc_qp(struct mthca_dev *dev, @@ -1010,6 +1150,8 @@ { int err; + mthca_align_qp_size(dev, qp); + switch (type) { case IB_QPT_RC: qp->transport = RC; break; case IB_QPT_UC: qp->transport = UC; break; @@ -1048,6 +1190,8 @@ int err = 0; u32 mqpn = qpn * 2 + dev->qp_table.sqp_start + port - 1; + mthca_align_qp_size(dev, &sqp->qp); + sqp->header_buf_size = sqp->qp.sq.max * MTHCA_UD_HEADER_SIZE; sqp->header_buf = dma_alloc_coherent(&dev->pdev->dev, sqp->header_buf_size, &sqp->header_dma, GFP_KERNEL); @@ -1160,14 +1304,15 @@ kfree(qp->wrid); + mthca_free_memfree(dev, qp); + if (is_sqp(dev, qp)) { atomic_dec(&(to_mpd(qp->ibqp.pd)->sqp_count)); dma_free_coherent(&dev->pdev->dev, to_msqp(qp)->header_buf_size, to_msqp(qp)->header_buf, to_msqp(qp)->header_dma); - } - else + } else mthca_free(&dev->qp_table.alloc, qp->qpn); } @@ -1350,17 +1495,17 @@ break; case UD: - ((struct mthca_ud_seg *) wqe)->lkey = + ((struct mthca_tavor_ud_seg *) wqe)->lkey = cpu_to_be32(to_mah(wr->wr.ud.ah)->key); - ((struct mthca_ud_seg *) wqe)->av_addr = + ((struct mthca_tavor_ud_seg *) wqe)->av_addr = cpu_to_be64(to_mah(wr->wr.ud.ah)->avdma); - ((struct mthca_ud_seg *) wqe)->dqpn = + ((struct mthca_tavor_ud_seg *) wqe)->dqpn = cpu_to_be32(wr->wr.ud.remote_qpn); - ((struct mthca_ud_seg *) wqe)->qkey = + ((struct mthca_tavor_ud_seg *) wqe)->qkey = cpu_to_be32(wr->wr.ud.remote_qkey); - wqe += sizeof (struct mthca_ud_seg); - size += sizeof (struct mthca_ud_seg) / 16; + wqe += sizeof (struct mthca_tavor_ud_seg); + size += sizeof (struct mthca_tavor_ud_seg) / 16; break; case MLX: From roland at topspin.com Thu Mar 3 15:20:28 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:28 -0800 Subject: [openib-general] [PATCH][22/26] IB/mthca: mem-free work request posting In-Reply-To: <2005331520.kqgduGt72iMbbNeg@topspin.com> Message-ID: <2005331520.ADYAIRdSQBiHhYiD@topspin.com> Implement posting send and receive work requests for mem-free mode. Also tidy up a few things in send/receive posting for Tavor mode (fix smp_wmb()s that should really be just wmb()s, annotate tests in the fast path with likely()/unlikely()). Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:13:01.213634129 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:13:02.565340719 -0800 @@ -380,10 +380,14 @@ void mthca_qp_event(struct mthca_dev *dev, u32 qpn, enum ib_event_type event_type); int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask); -int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, - struct ib_send_wr **bad_wr); -int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, - struct ib_recv_wr **bad_wr); +int mthca_tavor_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr); +int mthca_tavor_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); +int mthca_arbel_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr); +int mthca_arbel_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); int mthca_free_err_wqe(struct mthca_dev *dev, struct mthca_qp *qp, int is_send, int index, int *dbd, u32 *new_wqe); int mthca_alloc_qp(struct mthca_dev *dev, --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2005-03-03 14:13:01.213634129 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c 2005-03-03 14:13:02.566340502 -0800 @@ -613,8 +613,6 @@ dev->ib_dev.create_qp = mthca_create_qp; dev->ib_dev.modify_qp = mthca_modify_qp; dev->ib_dev.destroy_qp = mthca_destroy_qp; - dev->ib_dev.post_send = mthca_post_send; - dev->ib_dev.post_recv = mthca_post_receive; dev->ib_dev.create_cq = mthca_create_cq; dev->ib_dev.destroy_cq = mthca_destroy_cq; dev->ib_dev.poll_cq = mthca_poll_cq; @@ -625,10 +623,15 @@ dev->ib_dev.detach_mcast = mthca_multicast_detach; dev->ib_dev.process_mad = mthca_process_mad; - if (dev->hca_type == ARBEL_NATIVE) + if (dev->hca_type == ARBEL_NATIVE) { dev->ib_dev.req_notify_cq = mthca_arbel_arm_cq; - else + dev->ib_dev.post_send = mthca_arbel_post_send; + dev->ib_dev.post_recv = mthca_arbel_post_receive; + } else { dev->ib_dev.req_notify_cq = mthca_tavor_arm_cq; + dev->ib_dev.post_send = mthca_tavor_post_send; + dev->ib_dev.post_recv = mthca_tavor_post_receive; + } init_MUTEX(&dev->cap_mask_mutex); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:13:01.713525620 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:13:02.567340285 -0800 @@ -253,6 +253,16 @@ u16 vcrc; }; +static const u8 mthca_opcode[] = { + [IB_WR_SEND] = MTHCA_OPCODE_SEND, + [IB_WR_SEND_WITH_IMM] = MTHCA_OPCODE_SEND_IMM, + [IB_WR_RDMA_WRITE] = MTHCA_OPCODE_RDMA_WRITE, + [IB_WR_RDMA_WRITE_WITH_IMM] = MTHCA_OPCODE_RDMA_WRITE_IMM, + [IB_WR_RDMA_READ] = MTHCA_OPCODE_RDMA_READ, + [IB_WR_ATOMIC_CMP_AND_SWP] = MTHCA_OPCODE_ATOMIC_CS, + [IB_WR_ATOMIC_FETCH_AND_ADD] = MTHCA_OPCODE_ATOMIC_FA, +}; + static int is_sqp(struct mthca_dev *dev, struct mthca_qp *qp) { return qp->qpn >= dev->qp_table.sqp_start && @@ -637,9 +647,8 @@ if (qp->transport == MLX || qp->transport == UD) qp_context->mtu_msgmax = (IB_MTU_2048 << 5) | 11; - else if (attr_mask & IB_QP_PATH_MTU) { + else if (attr_mask & IB_QP_PATH_MTU) qp_context->mtu_msgmax = (attr->path_mtu << 5) | 31; - } if (dev->hca_type == ARBEL_NATIVE) { qp_context->rq_size_stride = @@ -1385,8 +1394,8 @@ return 0; } -int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, - struct ib_send_wr **bad_wr) +int mthca_tavor_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr) { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); @@ -1402,16 +1411,6 @@ int ind; u8 op0 = 0; - static const u8 opcode[] = { - [IB_WR_SEND] = MTHCA_OPCODE_SEND, - [IB_WR_SEND_WITH_IMM] = MTHCA_OPCODE_SEND_IMM, - [IB_WR_RDMA_WRITE] = MTHCA_OPCODE_RDMA_WRITE, - [IB_WR_RDMA_WRITE_WITH_IMM] = MTHCA_OPCODE_RDMA_WRITE_IMM, - [IB_WR_RDMA_READ] = MTHCA_OPCODE_RDMA_READ, - [IB_WR_ATOMIC_CMP_AND_SWP] = MTHCA_OPCODE_ATOMIC_CS, - [IB_WR_ATOMIC_FETCH_AND_ADD] = MTHCA_OPCODE_ATOMIC_FA, - }; - spin_lock_irqsave(&qp->lock, flags); /* XXX check that state is OK to post send */ @@ -1550,7 +1549,7 @@ qp->wrid[ind + qp->rq.max] = wr->wr_id; - if (wr->opcode >= ARRAY_SIZE(opcode)) { + if (wr->opcode >= ARRAY_SIZE(mthca_opcode)) { mthca_err(dev, "opcode invalid\n"); err = -EINVAL; *bad_wr = wr; @@ -1561,15 +1560,15 @@ ((struct mthca_next_seg *) prev_wqe)->nda_op = cpu_to_be32(((ind << qp->sq.wqe_shift) + qp->send_wqe_offset) | - opcode[wr->opcode]); - smp_wmb(); + mthca_opcode[wr->opcode]); + wmb(); ((struct mthca_next_seg *) prev_wqe)->ee_nds = cpu_to_be32((size0 ? 0 : MTHCA_NEXT_DBD) | size); } if (!size0) { size0 = size; - op0 = opcode[wr->opcode]; + op0 = mthca_opcode[wr->opcode]; } ++ind; @@ -1578,7 +1577,7 @@ } out: - if (nreq) { + if (likely(nreq)) { u32 doorbell[2]; doorbell[0] = cpu_to_be32(((qp->sq.next << qp->sq.wqe_shift) + @@ -1599,8 +1598,8 @@ return err; } -int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, - struct ib_recv_wr **bad_wr) +int mthca_tavor_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); @@ -1621,7 +1620,7 @@ ind = qp->rq.next; for (nreq = 0; wr; ++nreq, wr = wr->next) { - if (qp->rq.cur + nreq >= qp->rq.max) { + if (unlikely(qp->rq.cur + nreq >= qp->rq.max)) { mthca_err(dev, "RQ %06x full\n", qp->qpn); err = -ENOMEM; *bad_wr = wr; @@ -1640,7 +1639,7 @@ wqe += sizeof (struct mthca_next_seg); size = sizeof (struct mthca_next_seg) / 16; - if (wr->num_sge > qp->rq.max_gs) { + if (unlikely(wr->num_sge > qp->rq.max_gs)) { err = -EINVAL; *bad_wr = wr; goto out; @@ -1659,10 +1658,10 @@ qp->wrid[ind] = wr->wr_id; - if (prev_wqe) { + if (likely(prev_wqe)) { ((struct mthca_next_seg *) prev_wqe)->nda_op = cpu_to_be32((ind << qp->rq.wqe_shift) | 1); - smp_wmb(); + wmb(); ((struct mthca_next_seg *) prev_wqe)->ee_nds = cpu_to_be32(MTHCA_NEXT_DBD | size); } @@ -1676,7 +1675,7 @@ } out: - if (nreq) { + if (likely(nreq)) { u32 doorbell[2]; doorbell[0] = cpu_to_be32((qp->rq.next << qp->rq.wqe_shift) | size0); @@ -1696,6 +1695,247 @@ return err; } +int mthca_arbel_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + void *wqe; + void *prev_wqe; + unsigned long flags; + int err = 0; + int nreq; + int i; + int size; + int size0 = 0; + u32 f0 = 0; + int ind; + u8 op0 = 0; + + spin_lock_irqsave(&qp->lock, flags); + + /* XXX check that state is OK to post send */ + + ind = qp->sq.next & (qp->sq.max - 1); + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (qp->sq.cur + nreq >= qp->sq.max) { + mthca_err(dev, "SQ full (%d posted, %d max, %d nreq)\n", + qp->sq.cur, qp->sq.max, nreq); + err = -ENOMEM; + *bad_wr = wr; + goto out; + } + + wqe = get_send_wqe(qp, ind); + prev_wqe = qp->sq.last; + qp->sq.last = wqe; + + ((struct mthca_next_seg *) wqe)->flags = + ((wr->send_flags & IB_SEND_SIGNALED) ? + cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0) | + ((wr->send_flags & IB_SEND_SOLICITED) ? + cpu_to_be32(MTHCA_NEXT_SOLICIT) : 0) | + cpu_to_be32(1); + if (wr->opcode == IB_WR_SEND_WITH_IMM || + wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) + ((struct mthca_next_seg *) wqe)->flags = wr->imm_data; + + wqe += sizeof (struct mthca_next_seg); + size = sizeof (struct mthca_next_seg) / 16; + + switch (qp->transport) { + case UD: + memcpy(((struct mthca_arbel_ud_seg *) wqe)->av, + to_mah(wr->wr.ud.ah)->av, MTHCA_AV_SIZE); + ((struct mthca_arbel_ud_seg *) wqe)->dqpn = + cpu_to_be32(wr->wr.ud.remote_qpn); + ((struct mthca_arbel_ud_seg *) wqe)->qkey = + cpu_to_be32(wr->wr.ud.remote_qkey); + + wqe += sizeof (struct mthca_arbel_ud_seg); + size += sizeof (struct mthca_arbel_ud_seg) / 16; + break; + + case MLX: + err = build_mlx_header(dev, to_msqp(qp), ind, wr, + wqe - sizeof (struct mthca_next_seg), + wqe); + if (err) { + *bad_wr = wr; + goto out; + } + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + break; + } + + if (wr->num_sge > qp->sq.max_gs) { + mthca_err(dev, "too many gathers\n"); + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + /* Add one more inline data segment for ICRC */ + if (qp->transport == MLX) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32((1 << 31) | 4); + ((u32 *) wqe)[1] = 0; + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + qp->wrid[ind + qp->rq.max] = wr->wr_id; + + if (wr->opcode >= ARRAY_SIZE(mthca_opcode)) { + mthca_err(dev, "opcode invalid\n"); + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + if (likely(prev_wqe)) { + ((struct mthca_next_seg *) prev_wqe)->nda_op = + cpu_to_be32(((ind << qp->sq.wqe_shift) + + qp->send_wqe_offset) | + mthca_opcode[wr->opcode]); + wmb(); + ((struct mthca_next_seg *) prev_wqe)->ee_nds = + cpu_to_be32(MTHCA_NEXT_DBD | size); + } + + if (!size0) { + size0 = size; + op0 = mthca_opcode[wr->opcode]; + } + + ++ind; + if (unlikely(ind >= qp->sq.max)) + ind -= qp->sq.max; + } + +out: + if (likely(nreq)) { + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32((nreq << 24) | + ((qp->sq.next & 0xffff) << 8) | + f0 | op0); + doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); + + qp->sq.cur += nreq; + qp->sq.next += nreq; + + /* + * Make sure that descriptors are written before + * doorbell record. + */ + wmb(); + *qp->sq.db = cpu_to_be32(qp->sq.next & 0xffff); + + /* + * Make sure doorbell record is written before we + * write MMIO send doorbell. + */ + wmb(); + mthca_write64(doorbell, + dev->kar + MTHCA_SEND_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } + + spin_unlock_irqrestore(&qp->lock, flags); + return err; +} + +int mthca_arbel_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + unsigned long flags; + int err = 0; + int nreq; + int ind; + int i; + void *wqe; + + spin_lock_irqsave(&qp->lock, flags); + + /* XXX check that state is OK to post receive */ + + ind = qp->rq.next & (qp->rq.max - 1); + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (unlikely(qp->rq.cur + nreq >= qp->rq.max)) { + mthca_err(dev, "RQ %06x full\n", qp->qpn); + err = -ENOMEM; + *bad_wr = wr; + goto out; + } + + wqe = get_recv_wqe(qp, ind); + + ((struct mthca_next_seg *) wqe)->flags = 0; + + wqe += sizeof (struct mthca_next_seg); + + if (unlikely(wr->num_sge > qp->rq.max_gs)) { + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + } + + if (i < qp->rq.max_gs) { + ((struct mthca_data_seg *) wqe)->byte_count = 0; + ((struct mthca_data_seg *) wqe)->lkey = cpu_to_be32(0x100); + ((struct mthca_data_seg *) wqe)->addr = 0; + } + + qp->wrid[ind] = wr->wr_id; + + ++ind; + if (unlikely(ind >= qp->rq.max)) + ind -= qp->rq.max; + } +out: + if (likely(nreq)) { + qp->rq.cur += nreq; + qp->rq.next += nreq; + + /* + * Make sure that descriptors are written before + * doorbell record. + */ + wmb(); + *qp->rq.db = cpu_to_be32(qp->rq.next & 0xffff); + } + + spin_unlock_irqrestore(&qp->lock, flags); + return err; +} + int mthca_free_err_wqe(struct mthca_dev *dev, struct mthca_qp *qp, int is_send, int index, int *dbd, u32 *new_wqe) { From roland at topspin.com Thu Mar 3 15:20:28 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:28 -0800 Subject: [openib-general] [PATCH][24/26] IB/mthca: QP locking optimization In-Reply-To: <2005331520.kVkmRDQ3e4IStEy9@topspin.com> Message-ID: <2005331520.i9PPmMDNBr0DxH5I@topspin.com> From: Michael S. Tsirkin 1. Split the QP spinlock into separate send and receive locks. The only place where we have to lock both is upon modify_qp, and that is not on data path. 2. Avoid taking any QP locks when polling CQ. This last part is achieved by getting rid of the cur field in mthca_wq, and calculating the number of outstanding WQEs by comparing the head and tail fields. head is only updated by post, tail is only updated by poll. In a rare case where an overrun is detected, a CQ is locked and the overrun condition is re-tested, to avoid any potential for stale tail values. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:13:01.214633912 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-03 14:13:03.417155819 -0800 @@ -423,15 +423,6 @@ is_send = is_error ? cqe->opcode & 0x01 : cqe->is_send & 0x80; if (!*cur_qp || be32_to_cpu(cqe->my_qpn) != (*cur_qp)->qpn) { - if (*cur_qp) { - if (*freed) { - wmb(); - update_cons_index(dev, cq, *freed); - *freed = 0; - } - spin_unlock(&(*cur_qp)->lock); - } - /* * We do not have to take the QP table lock here, * because CQs will be locked while QPs are removed @@ -446,8 +437,6 @@ err = -EINVAL; goto out; } - - spin_lock(&(*cur_qp)->lock); } entry->qp_num = (*cur_qp)->qpn; @@ -465,9 +454,9 @@ } if (wq->last_comp < wqe_index) - wq->cur -= wqe_index - wq->last_comp; + wq->tail += wqe_index - wq->last_comp; else - wq->cur -= wq->max - wq->last_comp + wqe_index; + wq->tail += wqe_index + wq->max - wq->last_comp; wq->last_comp = wqe_index; @@ -551,9 +540,6 @@ update_cons_index(dev, cq, freed); } - if (qp) - spin_unlock(&qp->lock); - spin_unlock_irqrestore(&cq->lock, flags); return err == 0 || err == -EAGAIN ? npolled : err; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:13:02.120437293 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-03 14:13:03.416156036 -0800 @@ -166,21 +166,22 @@ }; struct mthca_wq { - int max; - int cur; - int next; - int last_comp; - void *last; - int max_gs; - int wqe_shift; + spinlock_t lock; + int max; + unsigned next_ind; + unsigned last_comp; + unsigned head; + unsigned tail; + void *last; + int max_gs; + int wqe_shift; - int db_index; /* Arbel only */ - u32 *db; + int db_index; /* Arbel only */ + u32 *db; }; struct mthca_qp { struct ib_qp ibqp; - spinlock_t lock; atomic_t refcount; u32 qpn; int is_direct; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:13:02.567340285 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-03 14:13:03.418155602 -0800 @@ -577,9 +577,11 @@ else cur_state = attr->cur_qp_state; } else { - spin_lock_irq(&qp->lock); + spin_lock_irq(&qp->sq.lock); + spin_lock(&qp->rq.lock); cur_state = qp->state; - spin_unlock_irq(&qp->lock); + spin_unlock(&qp->rq.lock); + spin_unlock_irq(&qp->sq.lock); } if (attr_mask & IB_QP_STATE) { @@ -1076,6 +1078,16 @@ } } +static void mthca_wq_init(struct mthca_wq* wq) +{ + spin_lock_init(&wq->lock); + wq->next_ind = 0; + wq->last_comp = wq->max - 1; + wq->head = 0; + wq->tail = 0; + wq->last = NULL; +} + static int mthca_alloc_qp_common(struct mthca_dev *dev, struct mthca_pd *pd, struct mthca_cq *send_cq, @@ -1087,20 +1099,13 @@ int ret; int i; - spin_lock_init(&qp->lock); atomic_set(&qp->refcount, 1); qp->state = IB_QPS_RESET; qp->atomic_rd_en = 0; qp->resp_depth = 0; qp->sq_policy = send_policy; - qp->rq.cur = 0; - qp->sq.cur = 0; - qp->rq.next = 0; - qp->sq.next = 0; - qp->rq.last_comp = qp->rq.max - 1; - qp->sq.last_comp = qp->sq.max - 1; - qp->rq.last = NULL; - qp->sq.last = NULL; + mthca_wq_init(&qp->sq); + mthca_wq_init(&qp->rq); ret = mthca_alloc_memfree(dev, qp); if (ret) @@ -1394,6 +1399,24 @@ return 0; } +static inline int mthca_wq_overflow(struct mthca_wq *wq, int nreq, + struct ib_cq *ib_cq) +{ + unsigned cur; + struct mthca_cq *cq; + + cur = wq->head - wq->tail; + if (likely(cur + nreq < wq->max)) + return 0; + + cq = to_mcq(ib_cq); + spin_lock(&cq->lock); + cur = wq->head - wq->tail; + spin_unlock(&cq->lock); + + return cur + nreq >= wq->max; +} + int mthca_tavor_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, struct ib_send_wr **bad_wr) { @@ -1411,16 +1434,18 @@ int ind; u8 op0 = 0; - spin_lock_irqsave(&qp->lock, flags); + spin_lock_irqsave(&qp->sq.lock, flags); /* XXX check that state is OK to post send */ - ind = qp->sq.next; + ind = qp->sq.next_ind; for (nreq = 0; wr; ++nreq, wr = wr->next) { - if (qp->sq.cur + nreq >= qp->sq.max) { - mthca_err(dev, "SQ full (%d posted, %d max, %d nreq)\n", - qp->sq.cur, qp->sq.max, nreq); + if (mthca_wq_overflow(&qp->sq, nreq, qp->ibqp.send_cq)) { + mthca_err(dev, "SQ %06x full (%u head, %u tail," + " %d max, %d nreq)\n", qp->qpn, + qp->sq.head, qp->sq.tail, + qp->sq.max, nreq); err = -ENOMEM; *bad_wr = wr; goto out; @@ -1580,7 +1605,7 @@ if (likely(nreq)) { u32 doorbell[2]; - doorbell[0] = cpu_to_be32(((qp->sq.next << qp->sq.wqe_shift) + + doorbell[0] = cpu_to_be32(((qp->sq.next_ind << qp->sq.wqe_shift) + qp->send_wqe_offset) | f0 | op0); doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); @@ -1591,10 +1616,10 @@ MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } - qp->sq.cur += nreq; - qp->sq.next = ind; + qp->sq.next_ind = ind; + qp->sq.head += nreq; - spin_unlock_irqrestore(&qp->lock, flags); + spin_unlock_irqrestore(&qp->sq.lock, flags); return err; } @@ -1613,15 +1638,18 @@ void *wqe; void *prev_wqe; - spin_lock_irqsave(&qp->lock, flags); + spin_lock_irqsave(&qp->rq.lock, flags); /* XXX check that state is OK to post receive */ - ind = qp->rq.next; + ind = qp->rq.next_ind; for (nreq = 0; wr; ++nreq, wr = wr->next) { - if (unlikely(qp->rq.cur + nreq >= qp->rq.max)) { - mthca_err(dev, "RQ %06x full\n", qp->qpn); + if (mthca_wq_overflow(&qp->rq, nreq, qp->ibqp.recv_cq)) { + mthca_err(dev, "RQ %06x full (%u head, %u tail," + " %d max, %d nreq)\n", qp->qpn, + qp->rq.head, qp->rq.tail, + qp->rq.max, nreq); err = -ENOMEM; *bad_wr = wr; goto out; @@ -1678,7 +1706,7 @@ if (likely(nreq)) { u32 doorbell[2]; - doorbell[0] = cpu_to_be32((qp->rq.next << qp->rq.wqe_shift) | size0); + doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq); wmb(); @@ -1688,10 +1716,10 @@ MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } - qp->rq.cur += nreq; - qp->rq.next = ind; + qp->rq.next_ind = ind; + qp->rq.head += nreq; - spin_unlock_irqrestore(&qp->lock, flags); + spin_unlock_irqrestore(&qp->rq.lock, flags); return err; } @@ -1712,16 +1740,18 @@ int ind; u8 op0 = 0; - spin_lock_irqsave(&qp->lock, flags); + spin_lock_irqsave(&qp->sq.lock, flags); /* XXX check that state is OK to post send */ - ind = qp->sq.next & (qp->sq.max - 1); + ind = qp->sq.head & (qp->sq.max - 1); for (nreq = 0; wr; ++nreq, wr = wr->next) { - if (qp->sq.cur + nreq >= qp->sq.max) { - mthca_err(dev, "SQ full (%d posted, %d max, %d nreq)\n", - qp->sq.cur, qp->sq.max, nreq); + if (mthca_wq_overflow(&qp->sq, nreq, qp->ibqp.send_cq)) { + mthca_err(dev, "SQ %06x full (%u head, %u tail," + " %d max, %d nreq)\n", qp->qpn, + qp->sq.head, qp->sq.tail, + qp->sq.max, nreq); err = -ENOMEM; *bad_wr = wr; goto out; @@ -1831,19 +1861,18 @@ u32 doorbell[2]; doorbell[0] = cpu_to_be32((nreq << 24) | - ((qp->sq.next & 0xffff) << 8) | + ((qp->sq.head & 0xffff) << 8) | f0 | op0); doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); - qp->sq.cur += nreq; - qp->sq.next += nreq; + qp->sq.head += nreq; /* * Make sure that descriptors are written before * doorbell record. */ wmb(); - *qp->sq.db = cpu_to_be32(qp->sq.next & 0xffff); + *qp->sq.db = cpu_to_be32(qp->sq.head & 0xffff); /* * Make sure doorbell record is written before we @@ -1855,7 +1884,7 @@ MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } - spin_unlock_irqrestore(&qp->lock, flags); + spin_unlock_irqrestore(&qp->sq.lock, flags); return err; } @@ -1871,15 +1900,18 @@ int i; void *wqe; - spin_lock_irqsave(&qp->lock, flags); + spin_lock_irqsave(&qp->rq.lock, flags); /* XXX check that state is OK to post receive */ - ind = qp->rq.next & (qp->rq.max - 1); + ind = qp->rq.head & (qp->rq.max - 1); for (nreq = 0; wr; ++nreq, wr = wr->next) { - if (unlikely(qp->rq.cur + nreq >= qp->rq.max)) { - mthca_err(dev, "RQ %06x full\n", qp->qpn); + if (mthca_wq_overflow(&qp->rq, nreq, qp->ibqp.recv_cq)) { + mthca_err(dev, "RQ %06x full (%u head, %u tail," + " %d max, %d nreq)\n", qp->qpn, + qp->rq.head, qp->rq.tail, + qp->rq.max, nreq); err = -ENOMEM; *bad_wr = wr; goto out; @@ -1921,18 +1953,17 @@ } out: if (likely(nreq)) { - qp->rq.cur += nreq; - qp->rq.next += nreq; + qp->rq.head += nreq; /* * Make sure that descriptors are written before * doorbell record. */ wmb(); - *qp->rq.db = cpu_to_be32(qp->rq.next & 0xffff); + *qp->rq.db = cpu_to_be32(qp->rq.head & 0xffff); } - spin_unlock_irqrestore(&qp->lock, flags); + spin_unlock_irqrestore(&qp->rq.lock, flags); return err; } From roland at topspin.com Thu Mar 3 15:20:28 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:28 -0800 Subject: [openib-general] [PATCH][25/26] IB/mthca: implement query of device caps In-Reply-To: <2005331520.i9PPmMDNBr0DxH5I@topspin.com> Message-ID: <2005331520.mctunM7QrSZHM8mX@topspin.com> From: Michael S. Tsirkin Set device_cap_flags field in mthca's query_device method. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cmd.h 2005-01-25 20:48:02.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cmd.h 2005-03-03 14:13:03.934043620 -0800 @@ -95,7 +95,21 @@ }; enum { - DEV_LIM_FLAG_SRQ = 1 << 6 + DEV_LIM_FLAG_RC = 1 << 0, + DEV_LIM_FLAG_UC = 1 << 1, + DEV_LIM_FLAG_UD = 1 << 2, + DEV_LIM_FLAG_RD = 1 << 3, + DEV_LIM_FLAG_RAW_IPV6 = 1 << 4, + DEV_LIM_FLAG_RAW_ETHER = 1 << 5, + DEV_LIM_FLAG_SRQ = 1 << 6, + DEV_LIM_FLAG_BAD_PKEY_CNTR = 1 << 8, + DEV_LIM_FLAG_BAD_QKEY_CNTR = 1 << 9, + DEV_LIM_FLAG_MW = 1 << 16, + DEV_LIM_FLAG_AUTO_PATH_MIG = 1 << 17, + DEV_LIM_FLAG_ATOMIC = 1 << 18, + DEV_LIM_FLAG_RAW_MULTI = 1 << 19, + DEV_LIM_FLAG_UD_AV_PORT_ENFORCE = 1 << 20, + DEV_LIM_FLAG_UD_MULTI = 1 << 21, }; struct mthca_dev_lim { --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:13:03.005245231 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:13:03.932044054 -0800 @@ -218,6 +218,7 @@ int hca_type; unsigned long mthca_flags; + unsigned long device_cap_flags; u32 rev_id; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c 2005-03-03 14:13:03.005245231 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c 2005-03-03 14:13:03.933043837 -0800 @@ -171,6 +171,33 @@ mdev->limits.reserved_uars = dev_lim->reserved_uars; mdev->limits.reserved_pds = dev_lim->reserved_pds; + /* IB_DEVICE_RESIZE_MAX_WR not supported by driver. + May be doable since hardware supports it for SRQ. + + IB_DEVICE_N_NOTIFY_CQ is supported by hardware but not by driver. + + IB_DEVICE_SRQ_RESIZE is supported by hardware but SRQ is not + supported by driver. */ + mdev->device_cap_flags = IB_DEVICE_CHANGE_PHY_PORT | + IB_DEVICE_PORT_ACTIVE_EVENT | + IB_DEVICE_SYS_IMAGE_GUID | + IB_DEVICE_RC_RNR_NAK_GEN; + + if (dev_lim->flags & DEV_LIM_FLAG_BAD_PKEY_CNTR) + mdev->device_cap_flags |= IB_DEVICE_BAD_PKEY_CNTR; + + if (dev_lim->flags & DEV_LIM_FLAG_BAD_QKEY_CNTR) + mdev->device_cap_flags |= IB_DEVICE_BAD_QKEY_CNTR; + + if (dev_lim->flags & DEV_LIM_FLAG_RAW_MULTI) + mdev->device_cap_flags |= IB_DEVICE_RAW_MULTI; + + if (dev_lim->flags & DEV_LIM_FLAG_AUTO_PATH_MIG) + mdev->device_cap_flags |= IB_DEVICE_AUTO_PATH_MIG; + + if (dev_lim->flags & DEV_LIM_FLAG_UD_AV_PORT_ENFORCE) + mdev->device_cap_flags |= IB_DEVICE_UD_AV_PORT_ENFORCE; + if (dev_lim->flags & DEV_LIM_FLAG_SRQ) mdev->mthca_flags |= MTHCA_FLAG_SRQ; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2005-03-03 14:13:02.566340502 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c 2005-03-03 14:13:03.933043837 -0800 @@ -43,6 +43,8 @@ struct ib_smp *in_mad = NULL; struct ib_smp *out_mad = NULL; int err = -ENOMEM; + struct mthca_dev* mdev = to_mdev(ibdev); + u8 status; in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); @@ -50,7 +52,7 @@ if (!in_mad || !out_mad) goto out; - props->fw_ver = to_mdev(ibdev)->fw_ver; + props->fw_ver = mdev->fw_ver; memset(in_mad, 0, sizeof *in_mad); in_mad->base_version = 1; @@ -59,7 +61,7 @@ in_mad->method = IB_MGMT_METHOD_GET; in_mad->attr_id = IB_SMP_ATTR_NODE_INFO; - err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1, + err = mthca_MAD_IFC(mdev, 1, 1, 1, NULL, NULL, in_mad, out_mad, &status); if (err) @@ -69,10 +71,11 @@ goto out; } - props->vendor_id = be32_to_cpup((u32 *) (out_mad->data + 36)) & + props->device_cap_flags = mdev->device_cap_flags; + props->vendor_id = be32_to_cpup((u32 *) (out_mad->data + 36)) & 0xffffff; - props->vendor_part_id = be16_to_cpup((u16 *) (out_mad->data + 30)); - props->hw_ver = be16_to_cpup((u16 *) (out_mad->data + 32)); + props->vendor_part_id = be16_to_cpup((u16 *) (out_mad->data + 30)); + props->hw_ver = be16_to_cpup((u16 *) (out_mad->data + 32)); memcpy(&props->sys_image_guid, out_mad->data + 4, 8); memcpy(&props->node_guid, out_mad->data + 12, 8); From roland at topspin.com Thu Mar 3 15:20:28 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:28 -0800 Subject: [openib-general] [PATCH][23/26] IB/mthca: mem-free multicast table In-Reply-To: <2005331520.ADYAIRdSQBiHhYiD@topspin.com> Message-ID: <2005331520.kVkmRDQ3e4IStEy9@topspin.com> Tie up one last loose end by mapping enough context memory to cover the whole multicast table during initialization, and then enable mem-free mode. mthca now supports enough of mem-free mode so that IPoIB works with a mem-free HCA. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:13:02.565340719 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-03 14:13:03.005245231 -0800 @@ -207,8 +207,9 @@ }; struct mthca_mcg_table { - struct semaphore sem; - struct mthca_alloc alloc; + struct semaphore sem; + struct mthca_alloc alloc; + struct mthca_icm_table *table; }; struct mthca_dev { --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c 2005-03-03 14:12:57.858362446 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c 2005-03-03 14:13:03.005245231 -0800 @@ -412,8 +412,29 @@ goto err_unmap_eqp; } + /* + * It's not strictly required, but for simplicity just map the + * whole multicast group table now. The table isn't very big + * and it's a lot easier than trying to track ref counts. + */ + mdev->mcg_table.table = mthca_alloc_icm_table(mdev, init_hca->mc_base, + MTHCA_MGM_ENTRY_SIZE, + mdev->limits.num_mgms + + mdev->limits.num_amgms, + mdev->limits.num_mgms + + mdev->limits.num_amgms, + 0); + if (!mdev->mcg_table.table) { + mthca_err(mdev, "Failed to map MCG context memory, aborting.\n"); + err = -ENOMEM; + goto err_unmap_cq; + } + return 0; +err_unmap_cq: + mthca_free_icm_table(mdev, mdev->cq_table.table); + err_unmap_eqp: mthca_free_icm_table(mdev, mdev->qp_table.eqp_table); @@ -587,7 +608,7 @@ goto err_uar_free; } - err = mthca_init_pd_table(dev); + err = mthca_init_pd_table(dev); if (err) { mthca_err(dev, "Failed to initialize " "protection domain table, aborting.\n"); @@ -635,13 +656,6 @@ mthca_dbg(dev, "NOP command IRQ test passed\n"); - if (dev->hca_type == ARBEL_NATIVE) { - mthca_warn(dev, "Sorry, native MT25208 mode support is not complete, " - "aborting.\n"); - err = -ENODEV; - goto err_cmd_poll; - } - err = mthca_init_cq_table(dev); if (err) { mthca_err(dev, "Failed to initialize " @@ -704,7 +718,7 @@ err_uar_table_free: mthca_cleanup_uar_table(dev); - return err; + return err; } static int __devinit mthca_request_regions(struct pci_dev *pdev, @@ -814,6 +828,7 @@ const struct pci_device_id *id) { static int mthca_version_printed = 0; + static int mthca_memfree_warned = 0; int ddr_hidden = 0; int err; struct mthca_dev *mdev; @@ -893,6 +908,10 @@ mdev->pdev = pdev; mdev->hca_type = id->driver_data; + if (mdev->hca_type == ARBEL_NATIVE && !mthca_memfree_warned++) + mthca_warn(mdev, "Warning: native MT25208 mode support is incomplete. " + "Your HCA may not work properly.\n"); + if (ddr_hidden) mdev->mthca_flags |= MTHCA_FLAG_DDR_HIDDEN; From roland at topspin.com Thu Mar 3 15:20:28 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 3 Mar 2005 15:20:28 -0800 Subject: [openib-general] [PATCH][26/26] IB: MAD cancel callbacks from thread In-Reply-To: <2005331520.mctunM7QrSZHM8mX@topspin.com> Message-ID: <2005331520.zA1xypugai2bUq7X@topspin.com> From: Sean Hefty Modify ib_cancel_mad() to invoke a user's send completion callback from a different thread context than that used by the caller. This allows a caller to hold a lock while calling cancel that is also acquired from their send handler. Signed-off-by: Sean Hefty Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/core/mad.c 2005-03-03 14:12:54.671054304 -0800 +++ linux-export/drivers/infiniband/core/mad.c 2005-03-03 14:13:04.375947697 -0800 @@ -68,6 +68,7 @@ static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, struct ib_mad_send_wc *mad_send_wc); static void timeout_sends(void *data); +static void cancel_sends(void *data); static void local_completions(void *data); static int solicited_mad(struct ib_mad *mad); static int add_nonoui_reg_req(struct ib_mad_reg_req *mad_reg_req, @@ -341,6 +342,8 @@ INIT_LIST_HEAD(&mad_agent_priv->local_list); INIT_WORK(&mad_agent_priv->local_work, local_completions, mad_agent_priv); + INIT_LIST_HEAD(&mad_agent_priv->canceled_list); + INIT_WORK(&mad_agent_priv->canceled_work, cancel_sends, mad_agent_priv); atomic_set(&mad_agent_priv->refcount, 1); init_waitqueue_head(&mad_agent_priv->wait); @@ -2004,12 +2007,44 @@ return NULL; } +void cancel_sends(void *data) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + unsigned long flags; + + mad_agent_priv = (struct ib_mad_agent_private *)data; + + mad_send_wc.status = IB_WC_WR_FLUSH_ERR; + mad_send_wc.vendor_err = 0; + + spin_lock_irqsave(&mad_agent_priv->lock, flags); + while (!list_empty(&mad_agent_priv->canceled_list)) { + mad_send_wr = list_entry(mad_agent_priv->canceled_list.next, + struct ib_mad_send_wr_private, + agent_list); + + list_del(&mad_send_wr->agent_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + kfree(mad_send_wr); + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + spin_lock_irqsave(&mad_agent_priv->lock, flags); + } + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); +} + void ib_cancel_mad(struct ib_mad_agent *mad_agent, u64 wr_id) { struct ib_mad_agent_private *mad_agent_priv; struct ib_mad_send_wr_private *mad_send_wr; - struct ib_mad_send_wc mad_send_wc; unsigned long flags; mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, @@ -2031,19 +2066,12 @@ } list_del(&mad_send_wr->agent_list); + list_add_tail(&mad_send_wr->agent_list, &mad_agent_priv->canceled_list); adjust_timeout(mad_agent_priv); spin_unlock_irqrestore(&mad_agent_priv->lock, flags); - mad_send_wc.status = IB_WC_WR_FLUSH_ERR; - mad_send_wc.vendor_err = 0; - mad_send_wc.wr_id = mad_send_wr->wr_id; - mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, - &mad_send_wc); - - kfree(mad_send_wr); - if (atomic_dec_and_test(&mad_agent_priv->refcount)) - wake_up(&mad_agent_priv->wait); - + queue_work(mad_agent_priv->qp_info->port_priv->wq, + &mad_agent_priv->canceled_work); out: return; } --- linux-export.orig/drivers/infiniband/core/mad_priv.h 2005-03-02 20:53:21.000000000 -0800 +++ linux-export/drivers/infiniband/core/mad_priv.h 2005-03-03 14:13:04.375947697 -0800 @@ -95,6 +95,8 @@ unsigned long timeout; struct list_head local_list; struct work_struct local_work; + struct list_head canceled_list; + struct work_struct canceled_work; atomic_t refcount; wait_queue_head_t wait; From jgarzik at pobox.com Thu Mar 3 16:04:22 2005 From: jgarzik at pobox.com (Jeff Garzik) Date: Thu, 03 Mar 2005 19:04:22 -0500 Subject: [openib-general] Re: [PATCH][16/26] IB/mthca: mem-free doorbell record writing In-Reply-To: <2005331520.WW3zbnVIUjZ4q0Ov@topspin.com> References: <2005331520.WW3zbnVIUjZ4q0Ov@topspin.com> Message-ID: <4227A606.50703@pobox.com> Roland Dreier wrote: > Add a mthca_write_db_rec() to wrap writing doorbell records. On > 64-bit archs, this is just a 64-bit write, while on 32-bit archs it > splits the write into two 32-bit writes with a memory barrier to make > sure the two halves of the record are written in the correct order. > +static inline void mthca_write_db_rec(u32 val[2], u32 *db) > +{ > + db[0] = val[0]; > + wmb(); > + db[1] = val[1]; > +} > + Are you concerned about ordering, or write-combining? I am unaware of a situation where writes are re-ordered into a reversed, descending order for no apparent reason. Jeff From jgarzik at pobox.com Thu Mar 3 16:07:43 2005 From: jgarzik at pobox.com (Jeff Garzik) Date: Thu, 03 Mar 2005 19:07:43 -0500 Subject: [openib-general] Re: [PATCH][26/26] IB: MAD cancel callbacks from thread In-Reply-To: <2005331520.zA1xypugai2bUq7X@topspin.com> References: <2005331520.zA1xypugai2bUq7X@topspin.com> Message-ID: <4227A6CF.6080805@pobox.com> Roland Dreier wrote: > +void cancel_sends(void *data) > +{ > + struct ib_mad_agent_private *mad_agent_priv; > + struct ib_mad_send_wr_private *mad_send_wr; > + struct ib_mad_send_wc mad_send_wc; > + unsigned long flags; > + > + mad_agent_priv = (struct ib_mad_agent_private *)data; don't add casts to a void pointer, that's silly. > + mad_send_wc.status = IB_WC_WR_FLUSH_ERR; > + mad_send_wc.vendor_err = 0; > + > + spin_lock_irqsave(&mad_agent_priv->lock, flags); > + while (!list_empty(&mad_agent_priv->canceled_list)) { > + mad_send_wr = list_entry(mad_agent_priv->canceled_list.next, > + struct ib_mad_send_wr_private, > + agent_list); > + > + list_del(&mad_send_wr->agent_list); > + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); > + > + mad_send_wc.wr_id = mad_send_wr->wr_id; > + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, > + &mad_send_wc); > + > + kfree(mad_send_wr); > + if (atomic_dec_and_test(&mad_agent_priv->refcount)) > + wake_up(&mad_agent_priv->wait); > + spin_lock_irqsave(&mad_agent_priv->lock, flags); > + } > + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); dumb question... why is the lock dropped? is it just for the send_handler(), or also for wr_id assigned, kfree, and wake_up() ? From libor at topspin.com Thu Mar 3 16:21:07 2005 From: libor at topspin.com (Libor Michalek) Date: Thu, 3 Mar 2005 16:21:07 -0800 Subject: [openib-general] [RFC] userspace CM/verbs QP Message-ID: <20050303162107.A18428@topspin.com> Roland, As it currently stands, the userspace CM needs to pass a QP from userspace to kernel space in order to pass it on to the kernel CM. I'm thinking that the best way to handle this is for the uCM library to pass uCM kernel the uverbs QP handle, and then have the kernel uCM lookup the QP from ib_uverbs. Unfortunetly this means ib_uverbs would need to export a lookup function. Would you like a patch, or do you have some other idea? -Libor From roland at topspin.com Thu Mar 3 16:30:03 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 03 Mar 2005 16:30:03 -0800 Subject: [openib-general] Re: [PATCH][26/26] IB: MAD cancel callbacks from thread In-Reply-To: <4227A6CF.6080805@pobox.com> (Jeff Garzik's message of "Thu, 03 Mar 2005 19:07:43 -0500") References: <2005331520.zA1xypugai2bUq7X@topspin.com> <4227A6CF.6080805@pobox.com> Message-ID: <52zmxknth0.fsf@topspin.com> Jeff> don't add casts to a void pointer, that's silly. Fair enough... Jeff> dumb question... why is the lock dropped? is it just for Jeff> the send_handler(), or also for wr_id assigned, kfree, and Jeff> wake_up() ? Not sure... Sean? - R. From roland at topspin.com Thu Mar 3 16:33:15 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 03 Mar 2005 16:33:15 -0800 Subject: [openib-general] Re: [PATCH][16/26] IB/mthca: mem-free doorbell record writing In-Reply-To: <4227A606.50703@pobox.com> (Jeff Garzik's message of "Thu, 03 Mar 2005 19:04:22 -0500") References: <2005331520.WW3zbnVIUjZ4q0Ov@topspin.com> <4227A606.50703@pobox.com> Message-ID: <52vf88ntbo.fsf@topspin.com> Jeff> Are you concerned about ordering, or write-combining? ordering... write combining would be fine. Jeff> I am unaware of a situation where writes are re-ordered into Jeff> a reversed, descending order for no apparent reason. Hmm... I've seen ppc64 do some pretty freaky reordering but on the other hand that's a 64-bit arch so we don't care in this case. I guess I'd rather keep the barrier there so we don't have the possibility of a rare hardware crash when the HCA just happens to read the doorbell record in a corrupt state. - R. From sean.hefty at intel.com Thu Mar 3 16:34:43 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 3 Mar 2005 16:34:43 -0800 Subject: [openib-general] Re: [PATCH][26/26] IB: MAD cancel callbacks fromthread In-Reply-To: <4227A6CF.6080805@pobox.com> Message-ID: >Roland Dreier wrote: >> +void cancel_sends(void *data) >> +{ >> + struct ib_mad_agent_private *mad_agent_priv; >> + struct ib_mad_send_wr_private *mad_send_wr; >> + struct ib_mad_send_wc mad_send_wc; >> + unsigned long flags; >> + >> + mad_agent_priv = (struct ib_mad_agent_private *)data; > >don't add casts to a void pointer, that's silly. This is my bad. >> + mad_send_wc.status = IB_WC_WR_FLUSH_ERR; >> + mad_send_wc.vendor_err = 0; >> + >> + spin_lock_irqsave(&mad_agent_priv->lock, flags); >> + while (!list_empty(&mad_agent_priv->canceled_list)) { >> + mad_send_wr = list_entry(mad_agent_priv->canceled_list.next, >> + struct ib_mad_send_wr_private, >> + agent_list); >> + >> + list_del(&mad_send_wr->agent_list); >> + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); >> + >> + mad_send_wc.wr_id = mad_send_wr->wr_id; >> + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, >> + &mad_send_wc); >> + >> + kfree(mad_send_wr); >> + if (atomic_dec_and_test(&mad_agent_priv->refcount)) >> + wake_up(&mad_agent_priv->wait); >> + spin_lock_irqsave(&mad_agent_priv->lock, flags); >> + } >> + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); > >dumb question... why is the lock dropped? is it just for the >send_handler(), or also for wr_id assigned, kfree, and wake_up() ? The lock is dropped to avoid calling the user back with it held. The if statement / wake_up call near the bottom of the loop can be replaced with a simple atomic_dec. The test should always fail. The lock is to protect access to the canceled_list. (Sorry about the mailer...) - Sean From jgarzik at pobox.com Thu Mar 3 16:35:00 2005 From: jgarzik at pobox.com (Jeff Garzik) Date: Thu, 03 Mar 2005 19:35:00 -0500 Subject: [openib-general] Re: [PATCH][3/26] IB/mthca: improve CQ locking part 1 In-Reply-To: <2005331520.cHJfJcRbBu1fFgB6@topspin.com> References: <2005331520.cHJfJcRbBu1fFgB6@topspin.com> Message-ID: <4227AD34.4050002@pobox.com> Roland Dreier wrote: > @@ -783,6 +777,11 @@ > cq->cqn & (dev->limits.num_cqs - 1)); > spin_unlock_irq(&dev->cq_table.lock); > > + if (dev->mthca_flags & MTHCA_FLAG_MSI_X) > + synchronize_irq(dev->eq_table.eq[MTHCA_EQ_COMP].msi_x_vector); > + else > + synchronize_irq(dev->pdev->irq); > + Tangent: I think we need a pci_irq_sync() rather than putting the above code into each driver. Jeff From roland at topspin.com Thu Mar 3 16:37:45 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 03 Mar 2005 16:37:45 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <20050303162107.A18428@topspin.com> (Libor Michalek's message of "Thu, 3 Mar 2005 16:21:07 -0800") References: <20050303162107.A18428@topspin.com> Message-ID: <52r7iwnt46.fsf@topspin.com> Libor> Roland, As it currently stands, the userspace CM needs to Libor> pass a QP from userspace to kernel space in order to pass Libor> it on to the kernel CM. I'm thinking that the best way to Libor> handle this is for the uCM library to pass uCM kernel the Libor> uverbs QP handle, and then have the kernel uCM lookup the Libor> QP from ib_uverbs. Unfortunetly this means ib_uverbs would Libor> need to export a lookup function. Libor> Would you like a patch, or do you have some other idea? Hmm, I guess that would be OK. It does mean you have to hold a mutex to avoid another userspace thread killing the QP out from under you, which is a little ugly to expose... What do you really need to do with the QP? Can you just have userspace pass the information like QP number that you need? - R. From sean.hefty at intel.com Thu Mar 3 16:39:02 2005 From: sean.hefty at intel.com (Hefty, Sean) Date: Thu, 3 Mar 2005 16:39:02 -0800 Subject: [openib-general] [RFC] userspace CM/verbs QP Message-ID: > As it currently stands, the userspace CM needs to pass a QP from >userspace to kernel space in order to pass it on to the kernel CM. >I'm thinking that the best way to handle this is for the uCM library >to pass uCM kernel the uverbs QP handle, and then have the kernel >uCM lookup the QP from ib_uverbs. Unfortunetly this means ib_uverbs >would need to export a lookup function. > > Would you like a patch, or do you have some other idea? As an FYI, the kernel CM uses the QP to get the QP number, QP type, if it uses a SRQ, and which device the QP is located on. - Sean From roland at topspin.com Thu Mar 3 16:40:14 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 03 Mar 2005 16:40:14 -0800 Subject: [openib-general] Re: [PATCH][3/26] IB/mthca: improve CQ locking part 1 In-Reply-To: <4227AD34.4050002@pobox.com> (Jeff Garzik's message of "Thu, 03 Mar 2005 19:35:00 -0500") References: <2005331520.cHJfJcRbBu1fFgB6@topspin.com> <4227AD34.4050002@pobox.com> Message-ID: <52mztknt01.fsf@topspin.com> > @@ -783,6 +777,11 @@ > cq->cqn & (dev->limits.num_cqs - 1)); > spin_unlock_irq(&dev->cq_table.lock); > + if (dev->mthca_flags & MTHCA_FLAG_MSI_X) > + synchronize_irq(dev->eq_table.eq[MTHCA_EQ_COMP].msi_x_vector); > + else > + synchronize_irq(dev->pdev->irq); > + Jeff> Tangent: I think we need a pci_irq_sync() rather than Jeff> putting the above code into each driver. The problem with trying to make it generic is that mthca has multiple MSI-X vectors, and only the driver author could know that we only need to synchronize with the completion event vector. - R. From jgarzik at pobox.com Thu Mar 3 16:41:06 2005 From: jgarzik at pobox.com (Jeff Garzik) Date: Thu, 03 Mar 2005 19:41:06 -0500 Subject: [openib-general] Re: [PATCH][16/26] IB/mthca: mem-free doorbell record writing In-Reply-To: <52vf88ntbo.fsf@topspin.com> References: <2005331520.WW3zbnVIUjZ4q0Ov@topspin.com> <4227A606.50703@pobox.com> <52vf88ntbo.fsf@topspin.com> Message-ID: <4227AEA2.8060007@pobox.com> Roland Dreier wrote: > Jeff> Are you concerned about ordering, or write-combining? > > ordering... write combining would be fine. > > Jeff> I am unaware of a situation where writes are re-ordered into > Jeff> a reversed, descending order for no apparent reason. > > Hmm... I've seen ppc64 do some pretty freaky reordering but on the > other hand that's a 64-bit arch so we don't care in this case. I > guess I'd rather keep the barrier there so we don't have the > possibility of a rare hardware crash when the HCA just happens to read > the doorbell record in a corrupt state. Well, we don't just add code to "hope and pray" for an event that nobody is sure can even occur... Does someone have a concrete case where this could happen? ever? Jeff From roland at topspin.com Thu Mar 3 16:43:26 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 03 Mar 2005 16:43:26 -0800 Subject: [openib-general] Re: [PATCH][26/26] IB: MAD cancel callbacks fromthread In-Reply-To: (Sean Hefty's message of "Thu, 3 Mar 2005 16:34:43 -0800") References: Message-ID: <52fyzcnsup.fsf@topspin.com> >> don't add casts to a void pointer, that's silly. How should we handle this nit? Should I post a new version of this patch or an incremental diff that fixes it up? - R. From roland at topspin.com Thu Mar 3 16:50:59 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 03 Mar 2005 16:50:59 -0800 Subject: [openib-general] Re: [PATCH][16/26] IB/mthca: mem-free doorbell record writing In-Reply-To: <4227AEA2.8060007@pobox.com> (Jeff Garzik's message of "Thu, 03 Mar 2005 19:41:06 -0500") References: <2005331520.WW3zbnVIUjZ4q0Ov@topspin.com> <4227A606.50703@pobox.com> <52vf88ntbo.fsf@topspin.com> <4227AEA2.8060007@pobox.com> Message-ID: <52bra0nsi4.fsf@topspin.com> Jeff> Well, we don't just add code to "hope and pray" for an event Jeff> that nobody is sure can even occur... The hardware requires that if the record is written in two 32-bit chunks, then they must be written in order. Of course the hardware probably won't be reading just as we're writing, so almost all of the time we won't notice the problem. It feels more like "hope and pray" to me to leave the barrier out and assume that every possible implementation of every architecture will always write them in order. Jeff> Does someone have a concrete case where this could happen? ever? I don't see how you can rule it out on out-of-order architectures. If the second word becomes ready before the first, then the CPU may execute the second write before the first. It's not precisely the same situation, but if you look at mthca_eq.c you'll see an rmb() in mthca_eq_int(). That's there because on ppc64, I really saw a situation where code like: while (foo->x) { switch (foo->y) { was behaving as if foo->y was being read before foo->x. Even though both foo->x and foo->y are in the same cache line, and foo->x was written by the hardware after foo->y. - R. From libor at topspin.com Thu Mar 3 16:57:05 2005 From: libor at topspin.com (Libor Michalek) Date: Thu, 3 Mar 2005 16:57:05 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <52r7iwnt46.fsf@topspin.com>; from roland@topspin.com on Thu, Mar 03, 2005 at 04:37:45PM -0800 References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> Message-ID: <20050303165705.B18428@topspin.com> On Thu, Mar 03, 2005 at 04:37:45PM -0800, Roland Dreier wrote: > Libor> Roland, As it currently stands, the userspace CM needs to > Libor> pass a QP from userspace to kernel space in order to pass > Libor> it on to the kernel CM. I'm thinking that the best way to > Libor> handle this is for the uCM library to pass uCM kernel the > Libor> uverbs QP handle, and then have the kernel uCM lookup the > Libor> QP from ib_uverbs. Unfortunetly this means ib_uverbs would > Libor> need to export a lookup function. > > Libor> Would you like a patch, or do you have some other idea? > > Hmm, I guess that would be OK. It does mean you have to hold a mutex > to avoid another userspace thread killing the QP out from under you, > which is a little ugly to expose... I was thinking of maybe ref counting the access, either in ib_qp or ib_uobject. Adding a pair of functions, lookup and return, to manage the ref count... > What do you really need to do with the QP? Can you just have > userspace pass the information like QP number that you need? I thought about that, here's the current list, but we would need to lookup the pd anyway: qp->pd, qp->qp_num qp->qp_type qp->srq qp->device -Libor From greg at kroah.com Thu Mar 3 16:58:24 2005 From: greg at kroah.com (Greg KH) Date: Thu, 3 Mar 2005 16:58:24 -0800 Subject: [openib-general] Re: [PATCH][3/26] IB/mthca: improve CQ locking part 1 In-Reply-To: <4227AD34.4050002@pobox.com> References: <2005331520.cHJfJcRbBu1fFgB6@topspin.com> <4227AD34.4050002@pobox.com> Message-ID: <20050304005824.GA18411@kroah.com> On Thu, Mar 03, 2005 at 07:35:00PM -0500, Jeff Garzik wrote: > Roland Dreier wrote: > >@@ -783,6 +777,11 @@ > > cq->cqn & (dev->limits.num_cqs - 1)); > > spin_unlock_irq(&dev->cq_table.lock); > > > >+ if (dev->mthca_flags & MTHCA_FLAG_MSI_X) > >+ synchronize_irq(dev->eq_table.eq[MTHCA_EQ_COMP].msi_x_vector); > >+ else > >+ synchronize_irq(dev->pdev->irq); > >+ > > > Tangent: I think we need a pci_irq_sync() rather than putting the above > code into each driver. Sure, I have no problem accepting that into the pci core. thanks, greg k-h From sean.hefty at intel.com Thu Mar 3 17:00:01 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 3 Mar 2005 17:00:01 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <20050303165705.B18428@topspin.com> Message-ID: >I thought about that, here's the current list, but we would need to >lookup the pd anyway: > > qp->pd, > qp->qp_num > qp->qp_type > qp->srq > qp->device The kernel CM uses the PD of an internal mad_agent, and not the PD of the user's QP. So, I don't think it's needed. - Sean From akpm at osdl.org Thu Mar 3 17:01:09 2005 From: akpm at osdl.org (Andrew Morton) Date: Thu, 3 Mar 2005 17:01:09 -0800 Subject: [openib-general] Re: [PATCH][26/26] IB: MAD cancel callbacks fromthread In-Reply-To: <52fyzcnsup.fsf@topspin.com> References: <52fyzcnsup.fsf@topspin.com> Message-ID: <20050303170109.72e8a3f2.akpm@osdl.org> Roland Dreier wrote: > > >> don't add casts to a void pointer, that's silly. > > How should we handle this nit? Should I post a new version of this > patch or an incremental diff that fixes it up? > I'll fix it up. From roland at topspin.com Thu Mar 3 17:02:36 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 03 Mar 2005 17:02:36 -0800 Subject: [openib-general] Re: [PATCH][3/26] IB/mthca: improve CQ locking part 1 In-Reply-To: <20050304005824.GA18411@kroah.com> (Greg KH's message of "Thu, 3 Mar 2005 16:58:24 -0800") References: <2005331520.cHJfJcRbBu1fFgB6@topspin.com> <4227AD34.4050002@pobox.com> <20050304005824.GA18411@kroah.com> Message-ID: <527jkonryr.fsf@topspin.com> Greg> Sure, I have no problem accepting that into the pci core. What would pci_irq_sync() do exactly? - R. From akpm at osdl.org Thu Mar 3 17:07:52 2005 From: akpm at osdl.org (Andrew Morton) Date: Thu, 3 Mar 2005 17:07:52 -0800 Subject: [openib-general] Re: [PATCH][26/26] IB: MAD cancel callbacks fromthread In-Reply-To: <20050303170109.72e8a3f2.akpm@osdl.org> References: <52fyzcnsup.fsf@topspin.com> <20050303170109.72e8a3f2.akpm@osdl.org> Message-ID: <20050303170752.7bc42e86.akpm@osdl.org> Andrew Morton wrote: > > Roland Dreier wrote: > > > > >> don't add casts to a void pointer, that's silly. > > > > How should we handle this nit? Should I post a new version of this > > patch or an incremental diff that fixes it up? > > > > I'll fix it up. Actually, seeing as 15/26 has vanished into the ether and there have been quite a few comments, please resend everything. From akpm at osdl.org Thu Mar 3 17:22:12 2005 From: akpm at osdl.org (Andrew Morton) Date: Thu, 3 Mar 2005 17:22:12 -0800 Subject: [openib-general] Re: [PATCH][26/26] IB: MAD cancel callbacks fromthread In-Reply-To: <20050303170752.7bc42e86.akpm@osdl.org> References: <52fyzcnsup.fsf@topspin.com> <20050303170109.72e8a3f2.akpm@osdl.org> <20050303170752.7bc42e86.akpm@osdl.org> Message-ID: <20050303172212.27da9009.akpm@osdl.org> Andrew Morton wrote: > > Andrew Morton wrote: > > > > Roland Dreier wrote: > > > > > > >> don't add casts to a void pointer, that's silly. > > > > > > How should we handle this nit? Should I post a new version of this > > > patch or an incremental diff that fixes it up? > > > > > > > I'll fix it up. > > Actually, seeing as 15/26 has vanished into the ether and there have been > quite a few comments, please resend everything. I seem to have forgotten how to operate this computer thingy. I have all 26 patches. From roland at topspin.com Thu Mar 3 18:02:20 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 03 Mar 2005 18:02:20 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <20050303165705.B18428@topspin.com> (Libor Michalek's message of "Thu, 3 Mar 2005 16:57:05 -0800") References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> Message-ID: <52zmxkmamr.fsf@topspin.com> Libor> I was thinking of maybe ref counting the access, either in Libor> ib_qp or ib_uobject. Adding a pair of functions, lookup and Libor> return, to manage the ref count... I think it makes sense to put a ref count in struct ib_uobject. That would make it easier to enforce things like "make sure my CQs don't get freed while I create this QP" also. Then I could encapsulate the whole IDR locking mess. - R. From libor at topspin.com Thu Mar 3 18:21:21 2005 From: libor at topspin.com (Libor Michalek) Date: Thu, 3 Mar 2005 18:21:21 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <52zmxkmamr.fsf@topspin.com>; from roland@topspin.com on Thu, Mar 03, 2005 at 06:02:20PM -0800 References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> Message-ID: <20050303182121.C18428@topspin.com> On Thu, Mar 03, 2005 at 06:02:20PM -0800, Roland Dreier wrote: > Libor> I was thinking of maybe ref counting the access, either in > Libor> ib_qp or ib_uobject. Adding a pair of functions, lookup and > Libor> return, to manage the ref count... > > I think it makes sense to put a ref count in struct ib_uobject. That > would make it easier to enforce things like "make sure my CQs don't > get freed while I create this QP" also. Then I could encapsulate the > whole IDR locking mess. When you say locking mess, do you mean accessing and potentially deleting the object which is referenced by the IDR table outside of the lock used to access the IDR? -Libor From roland at topspin.com Thu Mar 3 18:59:21 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 03 Mar 2005 18:59:21 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <20050303182121.C18428@topspin.com> (Libor Michalek's message of "Thu, 3 Mar 2005 18:21:21 -0800") References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> Message-ID: <52vf88m7zq.fsf@topspin.com> Libor> When you say locking mess, do you mean accessing and Libor> potentially deleting the object which is referenced by the Libor> IDR table outside of the lock used to access the IDR? I just meant the fact that right now I have to hold the idr mutex over both looking up an old object (eg a PD) and creating a new object that uses the old object (eg an MR). It makes the cleanup paths and so on potentially tricky. - R. From Thomas.Talpey at netapp.com Thu Mar 3 16:46:50 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Thu, 03 Mar 2005 19:46:50 -0500 Subject: kDAPL code size Re: [openib-general] putting in dead wood for DAPL and similarabomination In-Reply-To: <6.2.1.2.2.20050303094748.03422620@exnane01.nane.netapp.com > References: <35EA21F54A45CB47B879F21A91F4862F3FAD9E@taurus.voltaire.com> <20050303034827.GA9092@lst.de> <6.2.1.2.2.20050303094748.03422620@exnane01.nane.netapp.com> Message-ID: <6.2.1.2.2.20050303193401.01bc9eb0@exnane01.nane.netapp.com> At 09:56 AM 3/3/2005, Talpey, Thomas wrote: >that. At present, the code is heavily commented and fully generalized to >aid porting to multiple operating systems. It will look quite different once >it is freed of these attributes. Also, I'll point out there is extensive debug >and trace throughout the code, which are optional. I did a quick check of the source and I can report that over half the lines of kDAPL are comments, taking 22KLOC to around 10KLOC. Debug and kDAPL/uDAPL ifdefs are another ~500, and the dapl_os_* portability glue ~2000. By the way, the NFS/RDMA client code is only 3KLOC. I could guess it would take another few KLOC if it had to interface directly to verbs. And that's just the NFS/RDMA client. Repeat for server, repeat for other upper layers such as iSER. Repeat all for iWARP. Ouch. Tom. From mst at mellanox.co.il Fri Mar 4 06:38:56 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 4 Mar 2005 16:38:56 +0200 Subject: [openib-general] Re: [PATCH][16/26] IB/mthca: mem-free doorbell record writing In-Reply-To: <52bra0nsi4.fsf@topspin.com> References: <2005331520.WW3zbnVIUjZ4q0Ov@topspin.com> <4227A606.50703@pobox.com> <52vf88ntbo.fsf@topspin.com> <4227AEA2.8060007@pobox.com> <52bra0nsi4.fsf@topspin.com> Message-ID: <20050304143855.GD13804@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH][16/26] IB/mthca: mem-free doorbell record writing > > Jeff> Well, we don't just add code to "hope and pray" for an event > Jeff> that nobody is sure can even occur... > > The hardware requires that if the record is written in two 32-bit > chunks, then they must be written in order. Of course the hardware > probably won't be reading just as we're writing, so almost all of the > time we won't notice the problem. Its not necessarily related to reads. writes must arrive in order, even if the card is not reading at that time. -- MST - Michael S. Tsirkin From vonwyl at EIG.UNIGE.CH Fri Mar 4 06:46:20 2005 From: vonwyl at EIG.UNIGE.CH (Marc von Wyl) Date: Fri, 04 Mar 2005 15:46:20 +0100 Subject: [openib-general] Problem when trying to compile management and libibverb In-Reply-To: <20050222194001.GD25382@mellanox.co.il> References: <52u0o4pfe8.fsf@topspin.com> <20050222194001.GD25382@mellanox.co.il> Message-ID: <422874BC.2010903@eig.unige.ch> Hi, I get some troubles when trying to install the openib userspace gen2... When I try to compile the management part in gen2/trunk/src/userspace/management/libibcommon after the autogen and configure part I get an error with the Makefile : make[2]: rpath : command not found It seems that the LINK variable has no value (I tried with ld and ./libtool --mode=link gcc -g but still nothing...). And for the libibverb part, using autogen configure and make too, I get : make: *** No rule to make target src/libibverbs.la , necessary for all-am . Stop Thanks... From halr at voltaire.com Fri Mar 4 07:00:17 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Mar 2005 10:00:17 -0500 Subject: [openib-general] Problem when trying to compile management and libibverb In-Reply-To: <422874BC.2010903@eig.unige.ch> References: <52u0o4pfe8.fsf@topspin.com> <20050222194001.GD25382@mellanox.co.il> <422874BC.2010903@eig.unige.ch> Message-ID: <1109948416.4645.1546.camel@localhost.localdomain> On Fri, 2005-03-04 at 09:46, Marc von Wyl wrote: > I get some troubles when trying to install the openib userspace gen2... > > When I try to compile the management part in > gen2/trunk/src/userspace/management/libibcommon after the autogen and > configure part I get an error with the Makefile : > make[2]: rpath : command not found > It seems that the LINK variable has no value (I tried with ld and > ./libtool --mode=link gcc -g but still nothing...). > > And for the libibverb part, using autogen configure and make too, I get : > make: *** No rule to make target src/libibverbs.la , necessary for > all-am . Stop Not sure I totally follow what you did. I presume you followed the instructions in management/README and ran autogen.sh and configure in the library directories to generate your makefiles before running make. If so, not sure why LINK would not be defined. It gets generated in my Makefile as: LINK = $(LIBTOOL) --mode=link $(CCLD) $(AM_CFLAGS) $(CFLAGS) \ $(AM_LDFLAGS) $(LDFLAGS) -o $@ where CCLD = $(CC) Have you built this before or is this the first time ? What distribution are you using ? Can you send your Makefile which was generated for one of these libraries ? Thanks. -- Hal From vonwyl at EIG.UNIGE.CH Fri Mar 4 07:41:40 2005 From: vonwyl at EIG.UNIGE.CH (Marc von Wyl) Date: Fri, 04 Mar 2005 16:41:40 +0100 Subject: [openib-general] Problem when trying to compile management and libibverb In-Reply-To: <1109948416.4645.1546.camel@localhost.localdomain> References: <52u0o4pfe8.fsf@topspin.com> <20050222194001.GD25382@mellanox.co.il> <422874BC.2010903@eig.unige.ch> <1109948416.4645.1546.camel@localhost.localdomain> Message-ID: <422881B4.6070601@eig.unige.ch> Hal Rosenstock a écrit : >Not sure I totally follow what you did. I presume you followed the >instructions in management/README and ran autogen.sh and configure in >the library directories to generate your makefiles before running make. > >If so, not sure why LINK would not be defined. It gets generated in my >Makefile as: >LINK = $(LIBTOOL) --mode=link $(CCLD) $(AM_CFLAGS) $(CFLAGS) \ > $(AM_LDFLAGS) $(LDFLAGS) -o $@ >where >CCLD = $(CC) > >Have you built this before or is this the first time ? > >What distribution are you using ? > >Can you send your Makefile which was generated for one of these >libraries ? > >Thanks. > >-- Hal > > > I found the problem... I was looking in the wrong direction since two days... It was a problem with automake. Thanks and sorry for the disturbance. From greg at kroah.com Fri Mar 4 08:33:58 2005 From: greg at kroah.com (Greg KH) Date: Fri, 4 Mar 2005 08:33:58 -0800 Subject: [openib-general] Re: [PATCH][3/26] IB/mthca: improve CQ locking part 1 In-Reply-To: <527jkonryr.fsf@topspin.com> References: <2005331520.cHJfJcRbBu1fFgB6@topspin.com> <4227AD34.4050002@pobox.com> <20050304005824.GA18411@kroah.com> <527jkonryr.fsf@topspin.com> Message-ID: <20050304163357.GB28179@kroah.com> On Thu, Mar 03, 2005 at 05:02:36PM -0800, Roland Dreier wrote: > Greg> Sure, I have no problem accepting that into the pci core. > > What would pci_irq_sync() do exactly? Consolidate common code like this? :) thanks, greg k-h From roland at topspin.com Fri Mar 4 08:34:50 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 04 Mar 2005 08:34:50 -0800 Subject: [openib-general] Re: [PATCH][16/26] IB/mthca: mem-free doorbell record writing In-Reply-To: <20050304143855.GD13804@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 4 Mar 2005 16:38:56 +0200") References: <2005331520.WW3zbnVIUjZ4q0Ov@topspin.com> <4227A606.50703@pobox.com> <52vf88ntbo.fsf@topspin.com> <4227AEA2.8060007@pobox.com> <52bra0nsi4.fsf@topspin.com> <20050304143855.GD13804@mellanox.co.il> Message-ID: <52acpjmkt1.fsf@topspin.com> Michael> Its not necessarily related to reads. writes must arrive Michael> in order, even if the card is not reading at that time. We're talking about doorbell records in host memory, not MMIO doorbell writes. So there's no way for the HCA to know that host memory was written out of order unless it happens to read in the middle. - R. From roland at topspin.com Fri Mar 4 08:43:06 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 04 Mar 2005 08:43:06 -0800 Subject: [openib-general] Re: [PATCH][3/26] IB/mthca: improve CQ locking part 1 In-Reply-To: <20050304163357.GB28179@kroah.com> (Greg KH's message of "Fri, 4 Mar 2005 08:33:58 -0800") References: <2005331520.cHJfJcRbBu1fFgB6@topspin.com> <4227AD34.4050002@pobox.com> <20050304005824.GA18411@kroah.com> <527jkonryr.fsf@topspin.com> <20050304163357.GB28179@kroah.com> Message-ID: <521xavmkf9.fsf@topspin.com> Roland> What would pci_irq_sync() do exactly? Greg> Consolidate common code like this? :) I don't see how one can do that. As I pointed out in my reply to Jeff, it actually requires understanding how the driver uses the different MSI-X vectors to know which vector we need to synchronize against. So it seems pci_irq_sync() would have to be psychic. If we can figure out how to do that, maybe we can consolidate a lot more code into an API like void do_what_i_mean(void); ;) - R. From roland at topspin.com Fri Mar 4 08:58:41 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 04 Mar 2005 08:58:41 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <52vf88m7zq.fsf@topspin.com> (Roland Dreier's message of "Thu, 03 Mar 2005 18:59:21 -0800") References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> <52vf88m7zq.fsf@topspin.com> Message-ID: <52sm3bl54u.fsf@topspin.com> I thought about this a little more. There is a problem with letting the CM module look up QPs in the userspace verbs table: it becomes very awkward to check that the the QP belongs to a context (== userspace verbs file descriptor) owned by the CM user. I see the following solutions: 1. Don't worry about checking. There's nothing too evil a CM user can do with a QP beyond getting another QP to connect to it, since the CM user can't modify a QP unless it legitimately owns it. And an evil user can always guess the QPN instead of the QP handle anyway. 2. Change the CM API so that it just takes the QPN, QP type, SRQ status and device directly rather than reading it out of the QP. This lets the userspace CM just get the info from userspace without needing to look at the QP at all. Of course it does raise the issue of how userspace should specify the device. 3. Merge the userspace CM into userspace verbs support so they use the same context. Ugh. Personally I would lean slightly towards #2, since it feels to me like even the kernel CM API would be cleaner that way. However I don't have a good answer for how userspace should specify which device to use. - R. From libor at topspin.com Fri Mar 4 10:08:38 2005 From: libor at topspin.com (Libor Michalek) Date: Fri, 4 Mar 2005 10:08:38 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <52sm3bl54u.fsf@topspin.com>; from roland@topspin.com on Fri, Mar 04, 2005 at 08:58:41AM -0800 References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> <52vf88m7zq.fsf@topspin.com> <52sm3bl54u.fsf@topspin.com> Message-ID: <20050304100838.A19903@topspin.com> On Fri, Mar 04, 2005 at 08:58:41AM -0800, Roland Dreier wrote: > I thought about this a little more. There is a problem with letting > the CM module look up QPs in the userspace verbs table: it becomes > very awkward to check that the the QP belongs to a context (== > userspace verbs file descriptor) owned by the CM user. > > I see the following solutions: > > 1. Don't worry about checking. There's nothing too evil a CM user > can do with a QP beyond getting another QP to connect to it, since > the CM user can't modify a QP unless it legitimately owns it. And > an evil user can always guess the QPN instead of the QP handle anyway. > > 2. Change the CM API so that it just takes the QPN, QP type, SRQ > status and device directly rather than reading it out of the QP. > This lets the userspace CM just get the info from userspace > without needing to look at the QP at all. Of course it does raise > the issue of how userspace should specify the device. As you say, the second solution does not resolve the issue about which you are worried in the first solution. The device issue I think still creates a dependancy between the kernel components of uverbs and ucm, it's just moved down the chain, since the only thing that has the user to kernel mapping of the device handle is uverbs. Unless that code is duplicated, and then it becomes a software maintenance dependency... > Personally I would lean slightly towards #2, since it feels to me like > even the kernel CM API would be cleaner that way. However I don't > have a good answer for how userspace should specify which device to > use. I'm still leaning towards #1, if it comes down to a choice between device or QP that needs to be exposed, the QP seems more intuitive to me. -Libor From roland at topspin.com Fri Mar 4 10:27:39 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 04 Mar 2005 10:27:39 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <20050304100838.A19903@topspin.com> (Libor Michalek's message of "Fri, 4 Mar 2005 10:08:38 -0800") References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> <52vf88m7zq.fsf@topspin.com> <52sm3bl54u.fsf@topspin.com> <20050304100838.A19903@topspin.com> Message-ID: <52k6onl10k.fsf@topspin.com> Libor> As you say, the second solution does not resolve the Libor> issue about which you are worried in the first Libor> solution. The device issue I think still creates a Libor> dependancy between the kernel components of uverbs and ucm, Libor> it's just moved down the chain, since the only thing that Libor> has the user to kernel mapping of the device handle is Libor> uverbs. Unless that code is duplicated, and then it becomes Libor> a software maintenance dependency... Yeah, you're right. OK, let's just export the QP handle lookup/release stuff from uverbs. - R. From mshefty at ichips.intel.com Fri Mar 4 10:50:16 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 04 Mar 2005 10:50:16 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <20050304100838.A19903@topspin.com> References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> <52vf88m7zq.fsf@topspin.com> <52sm3bl54u.fsf@topspin.com> <20050304100838.A19903@topspin.com> Message-ID: <4228ADE8.70109@ichips.intel.com> Libor Michalek wrote: >> 2. Change the CM API so that it just takes the QPN, QP type, SRQ >> status and device directly rather than reading it out of the QP. >> This lets the userspace CM just get the info from userspace >> without needing to look at the QP at all. Of course it does raise >> the issue of how userspace should specify the device. > > As you say, the second solution does not resolve the issue about which > you are worried in the first solution. The device issue I think still > creates a dependancy between the kernel components of uverbs and ucm, > it's just moved down the chain, since the only thing that has the user > to kernel mapping of the device handle is uverbs. Unless that code is > duplicated, and then it becomes a software maintenance dependency... The CM needs the device in order to locate which port to send out the connection REQ on. We could let the CM locate the device in the kernel based on the user's path record. This goes back a little to the discussion of having a cm_path field in the REQ parameter that the CM can use when sending the REQ. On the receiving side of the REQ, the CM knows the device based on which port the REQ came in on. In order to make this work, there needs to be a call similar to: ib_find_cached_device_gid(gid, &device, &port_num, &index); The CM already needs something like this call in order to perform SIDR. (See cm_find_device() in cm.c.) Support for this call requires some changes/exposure to the known device list. So, I think that we may be able to change the kernel CM API to take the necessary fields, rather than the QP pointer. I can make doing these changes a priority over RMPP if we decide to go this route. - Sean From tduffy at sun.com Fri Mar 4 11:58:33 2005 From: tduffy at sun.com (Tom Duffy) Date: Fri, 04 Mar 2005 11:58:33 -0800 Subject: [openib-general] IB Address Translation service In-Reply-To: References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> Message-ID: <1109966313.20238.11.camel@duffman> On Wed, 2005-03-02 at 11:26 -0500, James Lentini wrote: > tduffy> > The one thing that ATS provide and is not possible with > tduffy> > ARP is reverse resolution GID->IP, any ideas how to achieve > tduffy> > that without ATS ? > tduffy> > tduffy> RARP. > > Where is the encapsulation of RARP packets on IB defined? The > "Transmission of IP over InfiniBand" IETF draft specifies the > procedure for ARP and Neighbor Discovery, but not RARP. I do see some mention of RARP in the ipoib IETF draft, but it may not be fully flushed out. In any event, I think being able to plop an IB network in an Ethernet world will require things like RARP to work. If there is no spec now, it should be written. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Fri Mar 4 12:53:22 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Mar 2005 15:53:22 -0500 Subject: [Fwd: Re: [openib-general] Solaris IPoIB MTU with OpenSM] Message-ID: <1109969601.4648.32.camel@erez-s.us.voltaire.com> Hi again Nitin, Finally got a chance to work on this. I have a workaround for you for now. Real patch later... Let me know if this does the trick for you. It did for me. -- Hal Index: osm_sa_mcmember_record.c =================================================================== --- osm_sa_mcmember_record.c (revision 1953) +++ osm_sa_mcmember_record.c (working copy) @@ -1522,9 +1522,11 @@ if ((IB_MCR_COMPMASK_PROXY & comp_mask) && (p_rcvd_rec->proxy_join != p_mgrp->mcmember_rec.proxy_join)) goto Exit; +#if 0 /* if defined MUST match exactly !*/ if ((IB_MCR_COMPMASK_MTU_SEL & comp_mask) && ((p_rcvd_rec->mtu >> 6) != (p_mgrp->mcmember_rec.mtu >> 6))) goto Exit; +#endif if ((IB_MCR_COMPMASK_MTU & comp_mask) && ((p_rcvd_rec->mtu & 0x3F) != (p_mgrp->mcmember_rec.mtu & 0x3F))) goto Exit; -----Forwarded Message----- From: Hal Rosenstock To: Nitin Hande Cc: openib , Tom Duffy Subject: Re: [openib-general] Solaris IPoIB MTU with OpenSM Date: 24 Feb 2005 08:42:23 -0500 Hi Nitin, On Wed, 2005-02-23 at 17:19, Nitin Hande wrote: > Hal, > > [comments below] > On Wed, 2005-02-23 at 02:19, Hal Rosenstock wrote: > > On Tue, 2005-02-22 at 22:56, Nitin Hande wrote: > > > So I tried the latest patches and preliminarily things seem to be > > > working fine. > > > > Yipee. > [snip..] > > > > > > > > So after this test above, I try to run snoop on the solaris interface > > > and get the following error message from the layer below IPoIB: > > > > > > Feb 22 19:50:25 dongon.SFBay.Sun.COM ibd: [ID 517869 kern.info] NOTICE: > > > ibd0: HCA GUID 0002c901097651d0 port 1 PKEY ffff Could not get list of > > > IBA multicast groups > > > > > > My preliminary assumption is that OpenSm is not returning the list of > > > multicast groups that the ibd interface has joined. I will look at the > > > MAD's tomorrow and try to ascertain that. > > > > How does S10 request this ? Remember that if it is a GetTable and > > doesn't fit in a single MAD, it will be broken now. If that is the case, > > we will live with this until we have real RMPP. > Below is an an example of a single GetTable request and response between > Solaris and OpenSM. OpenSM is not reporting the MCgroups in case of a > single request/response. I have also provided a MAD output between > Solaris IPoIB driver and IBSRM single GetTable request response below > this example. > > Here is the MAD trace between solaris and OpenSM: > Outgoing MAD: > BaseVersion: 0x1 > MgmtClass: 0x3 - SubnAdm > ClassVersion: 0x2 > R_Method: 0x12 - SubnAdmGetTable() > Status: 0x0 - NO_ERROR > ClassSpecific: 0x0 > TransactionID: 0x97651d1000000ec > AttributeID: 0x38 - SA_MCMEMBERRECORD_ATTRID > > 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef > 0: 01 03 02 12 00 00 00 00 09 76 51 d1 00 00 00 ec .........vQ..... > 10: 00 38 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 .8.............. > 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 30: 00 00 00 00 00 00 80 b4 00 00 00 00 00 00 00 00 ................ > 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 50: 00 00 00 00 00 00 00 00 00 00 0b 1b 00 00 84 00 ................ > 60: ff ff 00 00 00 00 00 00 20 00 00 00 00 00 00 00 ........ ....... > 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > Incoming MAD: > BaseVersion: 0x1 > MgmtClass: 0x3 - SubnAdm > ClassVersion: 0x2 > R_Method: 0x92 - > Status: 0x0 - NO_ERROR > ClassSpecific: 0x0 > TransactionID: 0x97651d1000000ec > AttributeID: 0x38 - SA_MCMEMBERRECORD_ATTRID > > 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef > 0: 01 03 02 92 00 00 00 00 09 76 51 d1 00 00 00 ec .........vQ..... > 10: 00 38 00 00 ff ff ff ff 01 01 77 00 00 00 00 01 .8........w..... > 20: 00 00 00 14 00 00 00 00 00 00 00 00 00 07 00 00 ................ > 30: 00 00 00 00 00 00 80 b4 00 00 00 00 00 00 00 00 ................ > 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ It is likely failing the component checking in osm_sa_mcmember_record.c::__osm_sa_mcm_by_comp_mask_cb due to an endian issue. Either you can debug this code or I will early next week. The component mask in the request is 0x80b4 so the only components checked are QKey (0xb1b), MTU (exactly 2048 (4)), PKey (0xffff), and scope (2). If I don't hear anything by next week, I will work on this then. Thanks. -- Hal > Here is the transaction between IBSRM and Solaris IPoIB driver. > > Outgoing MAD: > BaseVersion: 0x1 > MgmtClass: 0x3 - SubnAdm > ClassVersion: 0x2 > R_Method: 0x12 - SubnAdmGetTable() > Status: 0x0 - NO_ERROR > ClassSpecific: 0x0 > TransactionID: 0x8fecc610000009a > AttributeID: 0x38 - SA_MCMEMBERRECORD_ATTRID > > 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef > 0: 01 03 02 12 00 00 00 00 08 fe cc 61 00 00 00 9a ...........a.... > 10: 00 38 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 .8.............. > 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 30: 00 00 00 00 00 00 80 b4 00 00 00 00 00 00 00 00 ................ > 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 50: 00 00 00 00 00 00 00 00 81 23 45 68 00 00 84 00 .........#Eh.... > 60: 80 01 00 00 00 00 00 00 20 00 00 00 00 00 00 00 ........ ....... > 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > Incoming MAD: > BaseVersion: 0x1 > MgmtClass: 0x3 - SubnAdm > ClassVersion: 0x2 > R_Method: 0x92 - > Status: 0x0 - NO_ERROR > ClassSpecific: 0x0 > TransactionID: 0x8fecc610000009a > AttributeID: 0x38 - SA_MCMEMBERRECORD_ATTRID > > 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef > 0: 01 03 02 92 00 00 00 00 08 fe cc 61 00 00 00 9a ...........a.... > 10: 00 38 00 00 00 00 00 00 01 01 73 00 00 00 00 01 .8........s..... > 20: 00 00 01 40 00 00 00 00 00 00 00 00 00 07 00 00 ... at ............ > 30: 00 00 00 00 00 00 00 00 ff 12 40 1b 80 01 00 00 .......... at ..... > 40: 00 00 00 00 00 00 00 09 00 00 00 00 00 00 00 00 ................ > 50: 00 00 00 00 00 00 00 00 81 23 45 68 c0 04 84 00 .........#Eh.... > 60: 80 01 83 8d 00 00 00 00 20 00 00 00 00 00 00 00 ........ ....... > 70: ff 12 40 1b 80 01 00 00 00 00 00 00 00 00 00 01 .. at ............. > 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 90: 81 23 45 68 c0 03 84 00 80 01 83 8d 00 00 00 00 .#Eh............ > a0: 20 00 00 00 00 00 00 00 ff 12 40 1b 80 01 00 00 ......... at ..... > b0: 00 00 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 ................ > c0: 00 00 00 00 00 00 00 00 81 23 45 68 c0 00 84 00 .........#Eh.... > d0: 80 01 83 8d 00 00 00 00 20 00 00 00 00 00 00 00 ........ ....... > e0: ff 12 60 1b 80 01 00 00 00 00 00 01 ff 76 5b 01 ..`..........v[. > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > Thanks > Nitin From Thomas.Talpey at netapp.com Fri Mar 4 13:06:01 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Fri, 04 Mar 2005 16:06:01 -0500 Subject: [openib-general] IB Address Translation service In-Reply-To: <1109966313.20238.11.camel@duffman> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> <1109966313.20238.11.camel@duffman> Message-ID: <6.2.1.2.2.20050304160204.05e1d0e0@exnane01.nane.netapp.com> At 02:58 PM 3/4/2005, Tom Duffy wrote: >In any event, I think being able to plop an IB network in an Ethernet >world will require things like RARP to work. If there is no spec now, >it should be written. I can't remember the last time I saw a machine RARP. Well, maybe I do but it was like 1980-something. Since DHCP, I don't think there's a reason for hosts to do it. Here's what comes out when I type "man rarp" on a 2.6 system. "Obsolete". >RARP(8) Linux Programmer's Manual RARP(8) > > > >NAME > rarp - manipulate the system RARP table > >SYNOPSIS > rarp [-V] [--version] [-h] [--help] > rarp -a > rarp [-v] -d hostname ... > rarp [-v] [-t type] -s hostname hw_addr > >NOTE > This program is obsolete. From version 2.3, the Linux > kernel no longer contains RARP support. For a replacement > RARP daemon, see ftp://ftp.dementia.org/pub/net-tools Tom. From mshefty at ichips.intel.com Fri Mar 4 16:09:14 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 04 Mar 2005 16:09:14 -0800 Subject: [openib-general] [MAD] RMPP reassembly In-Reply-To: <4224B419.1080601@ichips.intel.com> References: <4224B419.1080601@ichips.intel.com> Message-ID: <4228F8AA.2060102@ichips.intel.com> Sean Hefty wrote: > I'm studying the RMPP implementation requirements for reassembly, and > there are a couple of issues/questions. * In order to send RMPP ACKs, etc. the RMPP code needs access to a MR (LKey actually) usable with the registered mad_agent. Both the CM and SA query code call ib_get_dma_mr() after calling ib_register_mad_agent(), and I would expect that other code will be similar. I was considering adding an ib_mr* field to the mad_agent structure and returning it to the user. Any objections or comments? - Sean From roland at topspin.com Fri Mar 4 16:31:06 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 04 Mar 2005 16:31:06 -0800 Subject: [openib-general] [MAD] RMPP reassembly In-Reply-To: <4228F8AA.2060102@ichips.intel.com> (Sean Hefty's message of "Fri, 04 Mar 2005 16:09:14 -0800") References: <4224B419.1080601@ichips.intel.com> <4228F8AA.2060102@ichips.intel.com> Message-ID: <52y8d3j5md.fsf@topspin.com> Sean> * In order to send RMPP ACKs, etc. the RMPP code needs Sean> access to a MR (LKey actually) usable with the registered Sean> mad_agent. Both the CM and SA query code call Sean> ib_get_dma_mr() after calling ib_register_mad_agent(), and I Sean> would expect that other code will be similar. I was Sean> considering adding an ib_mr* field to the mad_agent Sean> structure and returning it to the user. Any objections or Sean> comments? We discussed this once back in September of last year. For some reason we decided that the consumer was responsible for managing memory registration, but I don't remember why. - R. From mshefty at ichips.intel.com Fri Mar 4 16:46:21 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 04 Mar 2005 16:46:21 -0800 Subject: [openib-general] [MAD] RMPP reassembly In-Reply-To: <52y8d3j5md.fsf@topspin.com> References: <4224B419.1080601@ichips.intel.com> <4228F8AA.2060102@ichips.intel.com> <52y8d3j5md.fsf@topspin.com> Message-ID: <4229015D.6000005@ichips.intel.com> Roland Dreier wrote: > Sean> * In order to send RMPP ACKs, etc. the RMPP code needs > Sean> access to a MR (LKey actually) usable with the registered > Sean> mad_agent. > > We discussed this once back in September of last year. For some > reason we decided that the consumer was responsible for managing > memory registration, but I don't remember why. I can't remember either, but I don't know if we thought about internally generated MADs sent on behalf of the user. I will continue to try to find that discussion. At a minimum, the RMPP code either needs the user to provide a MR that it can use when registering for the MAD service, or it needs to allocate one internally. If the latter approach is used, exposing it to the user seems to make sense. For QP 0/1 traffic, the RMPP layer could cheat a little and allocate a single MR per port, rather than per mad_agent, similar to what the CM and SA module do. But then a different method would be needed if we ever wanted to support RMPP on a redirected QP. - Sean From beng at isilon.com Fri Mar 4 17:01:35 2005 From: beng at isilon.com (Brian Eng) Date: Fri, 04 Mar 2005 17:01:35 -0800 Subject: [openib-general] Re: Incorrect endian in GUID comparison/SM master selection In-Reply-To: <1108607846.27002.113.camel@bengbsd.isilon.com> References: <1108607846.27002.113.camel@bengbsd.isilon.com> Message-ID: <1109984495.62825.4849.camel@bengbsd.isilon.com> Hello again, I found another place in OpenSM where it compares two SM's. I suggest the following both to fix it and to form a common comparison routine: --- osm_sminfo_rcv.c 4 Mar 2005 23:21:53 -0000 1.2.2.4 +++ osm_sminfo_rcv.c 5 Mar 2005 00:45:03 -0000 @@ -156,30 +156,15 @@ osm_sminfo_rcv_init( By higher - we mean: SM with higher priority or with same priority and lower GUID. **********************************************************************/ -boolean_t +inline boolean_t __osm_sminfo_rcv_remote_sm_is_higher ( IN const osm_sminfo_rcv_t* p_rcv, IN const ib_sm_info_t* p_remote_sm ) { - - - if( ib_sminfo_get_priority( p_remote_sm ) > - p_rcv->p_subn->opt.sm_priority ) - { - return( TRUE ); - } - else - { - if( ib_sminfo_get_priority( p_remote_sm ) == - p_rcv->p_subn->opt.sm_priority ) - { - if( p_remote_sm->guid < p_rcv->p_subn->sm_port_guid ) - { - return( TRUE ); - } - } - } - return( FALSE ); + return( osm_sm_is_greater_than( ib_sminfo_get_priority( p_remote_sm ), + p_remote_sm->guid, + p_rcv->p_subn->opt.sm_priority, + p_rcv->p_subn->sm_port_guid) ); } /********************************************************************** --- osm_state_mgr.c 4 Mar 2005 23:15:42 -0000 1.2.2.6 +++ osm_state_mgr.c 5 Mar 2005 00:45:03 -0000 @@ -1563,12 +1563,19 @@ __osm_state_mgr_get_highest_sm( { cl_qmap_t* p_sm_tbl; osm_remote_sm_t* p_sm = NULL; - osm_remote_sm_t* p_highest_sm = NULL; + osm_remote_sm_t* p_highest_sm; + uint8_t highest_sm_priority; + ib_net64_t highest_sm_guid; OSM_LOG_ENTER( p_mgr->p_log, __osm_state_mgr_get_highest_sm ); p_sm_tbl = &p_mgr->p_subn->sm_guid_tbl; + /* Start with the local sm as the standard */ + p_highest_sm = NULL; + highest_sm_priority = p_mgr->p_subn->opt.sm_priority; + highest_sm_guid = p_mgr->p_subn->sm_port_guid; + /* go over all the remote SMs */ for( p_sm = (osm_remote_sm_t*)cl_qmap_head( p_sm_tbl ); p_sm != (osm_remote_sm_t*)cl_qmap_end( p_sm_tbl ); @@ -1579,55 +1586,19 @@ __osm_state_mgr_get_highest_sm( if (ib_sminfo_get_state(&p_sm->smi) == IB_SMINFO_STATE_NOTACTIVE ) continue; - if ( p_highest_sm == NULL) + if ( osm_sm_is_greater_than( ib_sminfo_get_priority(&p_sm->smi), + p_sm->smi.guid, highest_sm_priority, highest_sm_guid ) ) { + /* the new p_sm is with higher priority - update the highest_sm */ + /* to this sm */ p_highest_sm = p_sm; - } - else - { - if ( ib_sminfo_get_priority(&p_sm->smi) > - ib_sminfo_get_priority(&p_highest_sm->smi) ) - { - /* the new p_sm is with higher priority - update the highest_sm */ - /* to this sm */ - p_highest_sm = p_sm; - } - else - { - if ( ib_sminfo_get_priority(&p_sm->smi) == - ib_sminfo_get_priority(&p_highest_sm->smi) ) - { - /* both SMs are with same priority - compare GUIDs */ - if ( p_sm->smi.guid < p_highest_sm->smi.guid ) - { - /* the new p_sm is with same priority but lower GUID - */ - /* update the highest sm to this sm */ - p_highest_sm = p_sm; - } - } - } + highest_sm_priority = ib_sminfo_get_priority(&p_sm->smi); + highest_sm_guid = p_sm->smi.guid; } } - /* compare the p_highest_sm to the local sm */ + if ( p_highest_sm != NULL ) { - /* check if this SM is higher then us */ - if ( p_mgr->p_subn->opt.sm_priority > - ib_sminfo_get_priority(&p_highest_sm->smi) ) - { - /* the local SM has higher priority */ - return( NULL ); - } - else - { - if( ib_sminfo_get_priority(&p_highest_sm->smi) == - p_mgr->p_subn->opt.sm_priority && - p_highest_sm->smi.guid > p_mgr->p_subn->sm_port_guid ) - { - /* they have same priority. Local SM has lower GUID */ - return( NULL ); - } - } osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "__osm_state_mgr_get_highest_sm: " "Found higher SM with guid: %016" PRIx64 "\n", --- osm_state_mgr.h 23 Sep 2004 18:43:08 -0000 1.2 +++ osm_state_mgr.h 5 Mar 2005 00:45:03 -0000 @@ -495,6 +495,62 @@ osm_state_mgr_process( * SEE ALSO * State Manager *********/ +/****f* OpenSM: State Manager/osm_sm_is_greater_than +* NAME +* osm_sm_is_greater_than +* +* DESCRIPTION +* Compares two SM's (14.4.1.2) +* +* SYNOPSIS +*/ +static inline boolean_t +osm_sm_is_greater_than ( + IN const uint8_t l_priority, + IN const ib_net64_t l_guid, + IN const uint8_t r_priority, + IN const ib_net64_t r_guid ) +{ + if( l_priority > r_priority ) + { + return( TRUE ); + } + else + { + if( l_priority == r_priority ) + { + if( cl_ntoh64(l_guid) < cl_ntoh64(r_guid) ) + { + return( TRUE ); + } + } + } + return( FALSE ); +} +/* +* PARAMETERS +* l_priority +* [in] Priority of the SM on the "left" +* +* l_guid +* [in] GUID of the SM on the "left" +* +* r_priority +* [in] Priority of the SM on the "right" +* +* r_guid +* [in] GUID of the SM on the "right" +* +* RETURN VALUES +* Return TRUE if an sm with l_priority and l_guid is higher than an sm +* with r_priority and r_guid, +* return FALSE otherwise. +* +* NOTES +* +* SEE ALSO +* State Manager +*********/ From lindahl at pathscale.com Fri Mar 4 18:04:03 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Fri, 4 Mar 2005 18:04:03 -0800 Subject: [openib-general] IB Address Translation service In-Reply-To: <1109966313.20238.11.camel@duffman> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> <1109966313.20238.11.camel@duffman> Message-ID: <20050305020402.GA3297@greglaptop.internal.keyresearch.com> On Fri, Mar 04, 2005 at 11:58:33AM -0800, Tom Duffy wrote: > In any event, I think being able to plop an IB network in an Ethernet > world will require things like RARP to work. If there is no spec now, > it should be written. Much more important is understanding the role of RARP in the ethernet world. It is *not* something you do to find _someone else's_ IP addr from their MAC addr. It's what you do to find your _own_ IP addr because you're booting. Ethernet protocols such as IP include enough IP information to talk back to someone who sent you a packet. So you don't need to find out an IP addr from a MAC for remote nodes on a regular basis. Instead, you find out a MAC addr from an IP address, which is ARP. RARP is little used now that DHCP is popular. Now it would be nice for ethernet broadcast packets to just work(tm) with IPoIB. "ping -b" is an example of a user-level program that generates a broadcast packet. DHCP clients also generate such packets, and DHCP servers listen for them. Getting a RARP client and server to work ought to be the same as a DHCP client and server. -- greg From tduffy at sun.com Fri Mar 4 18:30:48 2005 From: tduffy at sun.com (Tom Duffy) Date: Fri, 04 Mar 2005 18:30:48 -0800 Subject: [openib-general] IB Address Translation service In-Reply-To: <6.2.1.2.2.20050304160204.05e1d0e0@exnane01.nane.netapp.com> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> <1109966313.20238.11.camel@duffman> <6.2.1.2.2.20050304160204.05e1d0e0@exnane01.nane.netapp.com> Message-ID: <1109989848.20238.16.camel@duffman> On Fri, 2005-03-04 at 16:06 -0500, Talpey, Thomas wrote: > At 02:58 PM 3/4/2005, Tom Duffy wrote: > >In any event, I think being able to plop an IB network in an Ethernet > >world will require things like RARP to work. If there is no spec now, > >it should be written. > > I can't remember the last time I saw a machine RARP. Well, maybe I do > but it was like 1980-something. Since DHCP, I don't think there's a > reason for hosts to do it. I guess Sun is stuck in the 80's -- big hair and new age music. All the sparc openboot systems rarp/bootp to network start. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From cap at nsc.liu.se Sat Mar 5 02:24:18 2005 From: cap at nsc.liu.se (Peter =?iso-8859-1?q?Kjellstr=F6m?=) Date: Sat, 5 Mar 2005 11:24:18 +0100 Subject: [openib-general] IB Address Translation service In-Reply-To: <6.2.1.2.2.20050304160204.05e1d0e0@exnane01.nane.netapp.com> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109966313.20238.11.camel@duffman> <6.2.1.2.2.20050304160204.05e1d0e0@exnane01.nane.netapp.com> Message-ID: <200503051124.19440.cap@nsc.liu.se> On Friday 04 March 2005 22.06, Talpey, Thomas wrote: > At 02:58 PM 3/4/2005, Tom Duffy wrote: > >In any event, I think being able to plop an IB network in an Ethernet > >world will require things like RARP to work. If there is no spec now, > >it should be written. > > I can't remember the last time I saw a machine RARP. Well, maybe I do > but it was like 1980-something. Since DHCP, I don't think there's a > reason for hosts to do it. If my memory servers me righ clustermatic uses rarp instead of dhcp since it's much lighter and simpler. I can imagine that if you have to code something that's not in a full OS, implementing a rarp based find-my-ip function will seem alot more fun than implementing a dhcp-client (or porting one...). /Peter -- ------------------------------------------------------------ Peter Kjellström | E-mail: cap at nsc.liu.se National Supercomputer Centre | Sweden | http://www.nsc.liu.se From David.Brean at Sun.COM Sat Mar 5 07:22:08 2005 From: David.Brean at Sun.COM (David M. Brean) Date: Sat, 05 Mar 2005 10:22:08 -0500 Subject: [openib-general] IB Address Translation service In-Reply-To: <20050305020402.GA3297@greglaptop.internal.keyresearch.com> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> <1109966313.20238.11.camel@duffman> <20050305020402.GA3297@greglaptop.internal.keyresearch.com> Message-ID: <4229CEA0.7060904@sun.com> Greg Lindahl wrote: >On Fri, Mar 04, 2005 at 11:58:33AM -0800, Tom Duffy wrote: > > > >>In any event, I think being able to plop an IB network in an Ethernet >>world will require things like RARP to work. If there is no spec now, >>it should be written. >> >> > >Much more important is understanding the role of RARP in the ethernet >world. > >It is *not* something you do to find _someone else's_ IP addr from >their MAC addr. It's what you do to find your _own_ IP addr because >you're booting. Ethernet protocols such as IP include enough IP >information to talk back to someone who sent you a packet. So you >don't need to find out an IP addr from a MAC for remote nodes on a >regular basis. Instead, you find out a MAC addr from an IP address, >which is ARP. > > > Right, RARP won't satisfy the reverse lookup requirement being put forward, so I don't think it's relevant to this address resolution discussion. >RARP is little used now that DHCP is popular. > >Now it would be nice for ethernet broadcast packets to just work(tm) >with IPoIB. "ping -b" is an example of a user-level program that >generates a broadcast packet. DHCP clients also generate such >packets, and DHCP servers listen for them. Getting a RARP client and >server to work ought to be the same as a DHCP client and server. > > > There is an I-D for DHCP on IB. IPoIB defines a "broadcast" address and DHCP (and ARP) on IB use it. Could make RARP work using this mechanism, but as someone else pointed out, the IB hardware address contains a QPN. The I-D for IPoIB says something like: The link-layer address for IPoIB includes the QPN which might not be constant across reboots or even across network interface resets. Cached QPN entries, such as in static ARP entries or in RARP servers will only work if the implementation(s) using these options ensure that the QPN associated with an interface is invariant across reboots/network resets. So, there are requirements on the IPoIB implementation to make RARP work. Folks in the IPoIB work group decided not to go much further than these statements for RARP support since most folks felt that DHCP is (de facto) replacement. -David >-- greg > > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From David.Brean at Sun.COM Sat Mar 5 07:52:16 2005 From: David.Brean at Sun.COM (David M. Brean) Date: Sat, 05 Mar 2005 10:52:16 -0500 Subject: [openib-general] IB Address Translation service In-Reply-To: <1109989848.20238.16.camel@duffman> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> <1109966313.20238.11.camel@duffman> <6.2.1.2.2.20050304160204.05e1d0e0@exnane01.nane.netapp.com> <1109989848.20238.16.camel@duffman> Message-ID: <4229D5B0.30205@sun.com> Tom Duffy wrote: >On Fri, 2005-03-04 at 16:06 -0500, Talpey, Thomas wrote: > > >>At 02:58 PM 3/4/2005, Tom Duffy wrote: >> >> >>>In any event, I think being able to plop an IB network in an Ethernet >>>world will require things like RARP to work. If there is no spec now, >>>it should be written. >>> >>> >>I can't remember the last time I saw a machine RARP. Well, maybe I do >>but it was like 1980-something. Since DHCP, I don't think there's a >>reason for hosts to do it. >> >> > >I guess Sun is stuck in the 80's -- big hair and new age music. All the >sparc openboot systems rarp/bootp to network start. > > > And DHCP, too. -David >-tduffy > > >------------------------------------------------------------------------ > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From halr at voltaire.com Sat Mar 5 08:13:39 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Mar 2005 11:13:39 -0500 Subject: [openib-general] IB Address Translation service In-Reply-To: <20050305020402.GA3297@greglaptop.internal.keyresearch.com> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> <1109966313.20238.11.camel@duffman> <20050305020402.GA3297@greglaptop.internal.keyresearch.com> Message-ID: <1110039219.4648.709.camel@localhost.localdomain> On Fri, 2005-03-04 at 21:04, Greg Lindahl wrote: > Now it would be nice for ethernet broadcast packets to just work(tm) > with IPoIB. "ping -b" is an example of a user-level program that > generates a broadcast packet. Isn't ping -b a broadcast at the IP (ICMP) level which indirectly causes a broadcast at the link level ? This is different from the arping case which directly wants to send (and receive) link level broadvcasts from user space rather than have the kernel do it on it's behalf. > DHCP clients also generate such packets, and DHCP servers listen for them. Getting a RARP client and > server to work ought to be the same as a DHCP client and server. DHCP uses UDP so is similar to ping -b in that regard. -- Hal From halr at voltaire.com Sat Mar 5 08:17:39 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Mar 2005 11:17:39 -0500 Subject: [openib-general] IB Address Translation service In-Reply-To: <4229CEA0.7060904@sun.com> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> <1109966313.20238.11.camel@duffman> <20050305020402.GA3297@greglaptop.internal.keyresearch.com> <4229CEA0.7060904@sun.com> Message-ID: <1110039458.4648.727.camel@localhost.localdomain> On Sat, 2005-03-05 at 10:22, David M. Brean wrote: > There is an I-D for DHCP on IB. IPoIB defines a "broadcast" address and > DHCP (and ARP) on IB use it. Could make RARP work using this mechanism, > but as someone else pointed out, the IB hardware address contains a > QPN. The I-D for IPoIB says something like: > > The link-layer address for IPoIB includes the QPN which might not be > constant across reboots or even across network interface resets. > Cached QPN entries, such as in static ARP entries or in RARP servers > will only work if the implementation(s) using these options ensure > that the QPN associated with an interface is invariant across > reboots/network resets. That may be the requirement but I think there are some issues with keeping the QPN invariant. Quoting Dror Goldenberg (http://openib.org/pipermail/openib-general/2004-November/006765.html): "Assigning specific QPN for ipoib requires allocation of QPN space which is beyond IB spec verbs. Current verbs do not allow it. I don't have any objection for that, except that you have to hold a set of preallocated QPs with specific numbers and hand them over to privileged consumer when requested to. I wouldn't commit that it will work on any HCA architecture." -- Hal > > So, there are requirements on the IPoIB implementation to make RARP > work. Folks in the IPoIB work group decided not to go much further than > these statements for RARP support since most folks felt that DHCP is (de > facto) replacement. > > -David > > > >-- greg > > > > > >_______________________________________________ > >openib-general mailing list > >openib-general at openib.org > >http://openib.org/mailman/listinfo/openib-general > > > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From yaronh at voltaire.com Sat Mar 5 10:42:45 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Sat, 5 Mar 2005 20:42:45 +0200 Subject: [openib-general] IB Address Translation service Message-ID: <35EA21F54A45CB47B879F21A91F4862F3FAEA7@taurus.voltaire.com> > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Hal Rosenstock > Sent: Saturday, March 05, 2005 6:18 PM > To: David M. Brean > Cc: openib-general at openib.org > Subject: Re: [openib-general] IB Address Translation service > > On Sat, 2005-03-05 at 10:22, David M. Brean wrote: > > There is an I-D for DHCP on IB. IPoIB defines a "broadcast" address and > > DHCP (and ARP) on IB use it. Could make RARP work using this mechanism, > > but as someone else pointed out, the IB hardware address contains a > > QPN. The I-D for IPoIB says something like: > > > > The link-layer address for IPoIB includes the QPN which might not be > > constant across reboots or even across network interface resets. > > Cached QPN entries, such as in static ARP entries or in RARP servers > > will only work if the implementation(s) using these options ensure > > that the QPN associated with an interface is invariant across > > reboots/network resets. > > That may be the requirement but I think there are some issues with > keeping the QPN invariant. Quoting Dror Goldenberg > (http://openib.org/pipermail/openib-general/2004-November/006765.html): > "Assigning specific QPN for ipoib requires allocation of QPN space which > is beyond IB spec verbs. Current verbs do not allow it. I don't have any > objection for that, except that you have to hold a set of preallocated > QPs with specific numbers and hand them over to privileged consumer when > requested to. I wouldn't commit that it will work on any HCA > architecture." > > -- Hal > Just to add to Hal and Dave, it is not only that the QPN may not be constant, you can actually have few valid QPNs, one or more per partition, since each partition reflects the notion of an IP VLAN/Network the RARP should return different IP per partition, and the RARP caller should use different QPN in each case. I believe all the emails in this thread clarify why RARP is not a valid approach Yaron From mst at mellanox.co.il Sun Mar 6 02:38:40 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 6 Mar 2005 12:38:40 +0200 Subject: [PATCH] might_sleep on con_lock (was Re: [openib-general] SDP_CONN_LOCK) Message-ID: <20050306103840.GR26194@mellanox.co.il> Quoting r. Libor Michalek : > Subject: Re: [openib-general] SDP_CONN_LOCK > > > They do implement exclusive access to the socket, but they implement > > > exclusive access from both process and irq context, which is why a > > > semaphore was not used. In interrupt context SDP_CONN_LOCK_BH is used > > > to lock the connection, look in sdp_cq_event_handler() for it's use, > > > and in process context SDP_CONN_LOCK is used. > > > > I dont really understand how it works. > > When an interrupt arrives while users != 0, it seems you are > > calling scheduler(). > > What is sdp_conn_internal_lock doing? I understand it is to be called > > from interrupt context, but how can it call scheduler() then? > > SDP_CONN_LOCK and SDP_CONN_UNLOCK are never called from interrupt > context, only from process context. So how about this patch, to make sure we get a stack dump if they are: Signed-off-by: Michael S. Tsirkin Index: sdp_conn.h =================================================================== --- sdp_conn.h (revision 1953) +++ sdp_conn.h (working copy) @@ -477,6 +477,8 @@ static inline void sdp_conn_lock(struct { unsigned long flags; + might_sleep(); + spin_lock_irqsave(&conn->lock.slock, flags); if (conn->lock.users != 0) { -- MST - Michael S. Tsirkin From mst at mellanox.co.il Sun Mar 6 08:06:11 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 6 Mar 2005 18:06:11 +0200 Subject: [openib-general] rev 1954 - new flint uploaded Message-ID: <20050306160611.GX26194@mellanox.co.il> With revision 1954 I have uploaded new flint code, synched with current mellanox code. There are lots of changes, since this adds support for burning more flash types. Unfortunately I was unable to test this code on a big endian machine. I hope nothing was broken. ppc/sparc guys, please test and let me know. thanks, -- MST - Michael S. Tsirkin From mst at mellanox.co.il Sun Mar 6 12:28:45 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 6 Mar 2005 22:28:45 +0200 Subject: [openib-general] [PATCH] disable MSI for AMD-8131 Message-ID: <20050306202845.GE8486@mellanox.co.il> Greg, Martin, The AMD-8131 I/O APIC (device id 1022:7450/7451) does not support message signalled interrupts. Thus, if a device driver attempts to enable msi, it will suceed, but interrupts are not actually delivered to the cpu. The Nforce chipsets do not seem to have this limitation. AMD confirmed that MSI mode is unsupported with this APIC. The following patch adds a flag to pci quirks to detect this and disable msi. Please let me know what do you think. Signed-off-by: Michael S. Tsirkin diff -rup linux-2.6.11/drivers/pci/msi.c linux-2.6.11-msi/drivers/pci/msi.c --- linux-2.6.11/drivers/pci/msi.c 2005-03-02 09:38:26.000000000 +0200 +++ linux-2.6.11-msi/drivers/pci/msi.c 2005-05-21 23:29:08.000000000 +0300 @@ -20,6 +20,7 @@ #include #include +#include "pci.h" #include "msi.h" static DEFINE_SPINLOCK(msi_lock); @@ -372,6 +373,13 @@ static int msi_init(void) if (!status) return status; + if (pci_msi_quirk) { + pci_msi_enable = 0; + printk(KERN_WARNING "PCI: MSI quirk detected. MSI disabled.\n"); + status = -EINVAL; + return status; + } + if ((status = msi_cache_init()) < 0) { pci_msi_enable = 0; printk(KERN_WARNING "PCI: MSI cache init failed\n"); diff -rup linux-2.6.11/drivers/pci/pci.h linux-2.6.11-msi/drivers/pci/pci.h --- linux-2.6.11/drivers/pci/pci.h 2005-03-02 09:37:55.000000000 +0200 +++ linux-2.6.11-msi/drivers/pci/pci.h 2005-05-21 22:28:21.000000000 +0300 @@ -65,6 +65,7 @@ extern void pci_remove_legacy_files(stru extern spinlock_t pci_bus_lock; extern int pcie_mch_quirk; +extern int pci_msi_quirk; extern struct device_attribute pci_dev_attrs[]; extern struct class_device_attribute class_device_attr_cpuaffinity; diff -rup linux-2.6.11/drivers/pci/quirks.c linux-2.6.11-msi/drivers/pci/quirks.c --- linux-2.6.11/drivers/pci/quirks.c 2005-03-02 09:37:31.000000000 +0200 +++ linux-2.6.11-msi/drivers/pci/quirks.c 2005-05-21 22:35:45.000000000 +0300 @@ -429,6 +429,8 @@ static void __init quirk_ioapic_rmw(stru } DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_SI, PCI_ANY_ID, quirk_ioapic_rmw ); +int pci_msi_quirk; + #define AMD8131_revA0 0x01 #define AMD8131_revB0 0x11 #define AMD8131_MISC 0x40 @@ -437,6 +439,9 @@ static void __init quirk_amd_8131_ioapic { unsigned char revid, tmp; + pci_msi_quirk = 1; + printk(KERN_WARNING "PCI: MSI quirk detected. pci_msi_quirk set.\n"); + if (nr_ioapics == 0) return; From mst at mellanox.co.il Mon Mar 7 07:38:49 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 7 Mar 2005 17:38:49 +0200 Subject: [openib-general] Re: rev 1954 - new flint uploaded In-Reply-To: <20050306160611.GX26194@mellanox.co.il> References: <20050306160611.GX26194@mellanox.co.il> Message-ID: <20050307153849.GI26194@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: rev 1954 - new flint uploaded > > With revision 1954 I have uploaded new flint code, > synched with current mellanox code. I made some last minute fixes and uploaded rev 1957. Enjoy! > There are lots of changes, since this adds support for > burning more flash types. > > Unfortunately I was unable to test this code on > a big endian machine. I hope nothing was broken. > ppc/sparc guys, please test and let me know. -- MST - Michael S. Tsirkin From halr at voltaire.com Mon Mar 7 08:13:45 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Mar 2005 11:13:45 -0500 Subject: [openib-general] mthca query_device does not fill in struct ib_device_attr Message-ID: <1110212025.4648.38.camel@localhost.localdomain> Hi Roland, It appears that mthca_provider.c::mthca_query_device does not fill in many of the device attributes in struct ib_device_attr. When can the remainder of these be completed ? Thanks. -- Hal From roland at topspin.com Mon Mar 7 08:32:43 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 07 Mar 2005 08:32:43 -0800 Subject: [openib-general] Re: mthca query_device does not fill in struct ib_device_attr In-Reply-To: <1110212025.4648.38.camel@localhost.localdomain> (Hal Rosenstock's message of "07 Mar 2005 11:13:45 -0500") References: <1110212025.4648.38.camel@localhost.localdomain> Message-ID: <52fyz7ju1g.fsf@topspin.com> Hal> It appears that mthca_provider.c::mthca_query_device does not Hal> fill in many of the device attributes in struct Hal> ib_device_attr. When can the remainder of these be completed? Any time... which ones are needed now? - R. From halr at voltaire.com Mon Mar 7 08:45:01 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Mar 2005 11:45:01 -0500 Subject: [openib-general] Re: mthca query_device does not fill in struct ib_device_attr In-Reply-To: <52fyz7ju1g.fsf@topspin.com> References: <1110212025.4648.38.camel@localhost.localdomain> <52fyz7ju1g.fsf@topspin.com> Message-ID: <1110213900.4648.79.camel@localhost.localdomain> On Mon, 2005-03-07 at 11:32, Roland Dreier wrote: > Hal> It appears that mthca_provider.c::mthca_query_device does not > Hal> fill in many of the device attributes in struct > Hal> ib_device_attr. When can the remainder of these be completed? > > Any time... which ones are needed now? Certainly the following ones: u64 max_mr_size; int max_qp; int max_qp_wr; int max_sge; int max_cq; int max_cqe; int max_mr; int max_pd; int max_qp_rd_atom; Thanks. -- Hal From roland at topspin.com Mon Mar 7 08:55:18 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 07 Mar 2005 08:55:18 -0800 Subject: [openib-general] Re: mthca query_device does not fill in struct ib_device_attr In-Reply-To: <1110213900.4648.79.camel@localhost.localdomain> (Hal Rosenstock's message of "07 Mar 2005 11:45:01 -0500") References: <1110212025.4648.38.camel@localhost.localdomain> <52fyz7ju1g.fsf@topspin.com> <1110213900.4648.79.camel@localhost.localdomain> Message-ID: <52acpfjszt.fsf@topspin.com> Hal> Certainly the following ones: OK, it won't be hard to fill out those entries. What application is using this info? - R. From halr at voltaire.com Mon Mar 7 08:59:03 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Mar 2005 11:59:03 -0500 Subject: [openib-general] Re: mthca query_device does not fill in struct ib_device_attr In-Reply-To: <52acpfjszt.fsf@topspin.com> References: <1110212025.4648.38.camel@localhost.localdomain> <52fyz7ju1g.fsf@topspin.com> <1110213900.4648.79.camel@localhost.localdomain> <52acpfjszt.fsf@topspin.com> Message-ID: <1110214736.4648.90.camel@localhost.localdomain> On Mon, 2005-03-07 at 11:55, Roland Dreier wrote: > OK, it won't be hard to fill out those entries. What application is > using this info? uDAPL (and kDAPL) use these device attributes currently. -- Hal From xma at us.ibm.com Mon Mar 7 09:38:32 2005 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 7 Mar 2005 09:38:32 -0800 Subject: [openib-general] mthca drvier on PPC platform Message-ID: I am starting to test mthca driver on PPC64. I want to know whether someone has tested mthca on any PPC platform? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland at topspin.com Mon Mar 7 09:41:41 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 07 Mar 2005 09:41:41 -0800 Subject: [openib-general] userspace verbs UD support Message-ID: <52wtsjica2.fsf@topspin.com> I've just committed support for UD address handles to libibverbs and libmthca on the roland-uverbs branch. This makes it possible to use UD from userspace. libibverbs/examples has a new ud-pingpong.c demo program that shows how this works. There are no new changes to the kernel uverbs support on the roland-uverbs branch to handle UD, so any modules built with last week's code should still work. - R. From roland at topspin.com Mon Mar 7 09:42:33 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 07 Mar 2005 09:42:33 -0800 Subject: [openib-general] mthca drvier on PPC platform In-Reply-To: (Shirley Ma's message of "Mon, 7 Mar 2005 09:38:32 -0800") References: Message-ID: <52sm37ic8m.fsf@topspin.com> Shirley> I am starting to test mthca driver on PPC64. I want to Shirley> know whether someone has tested mthca on any PPC platform? Yes, I have used it successfully on IBM p630 and JS20 systems. - Roland From krause at cup.hp.com Mon Mar 7 09:48:06 2005 From: krause at cup.hp.com (Michael Krause) Date: Mon, 07 Mar 2005 09:48:06 -0800 Subject: [openib-general] IB Address Translation service In-Reply-To: <1110039458.4648.727.camel@localhost.localdomain> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> <1109966313.20238.11.camel@duffman> <20050305020402.GA3297@greglaptop.internal.keyresearch.com> <4229CEA0.7060904@sun.com> <1110039458.4648.727.camel@localhost.localdomain> Message-ID: <6.2.0.14.2.20050307094517.020916f8@esmail.cup.hp.com> Just to make this clear: - There are only two QP that are defined with specific intention - QP0 and QP1. All other QP may vary throughout the entire QP space. - All ULP built on top of IB must assume that the QP are variant and must discover these through various protocol such as the service ID protocol or for IPoIB, the ARP / ND exchange. - Multiple QP may be used for a given service allowing both finer grain partitioning as well as scaling opportunities. So, this isn't something open to debate. It is how we designed the technology to allow flexibility and performance. Mike At 08:17 AM 3/5/2005, Hal Rosenstock wrote: >On Sat, 2005-03-05 at 10:22, David M. Brean wrote: > > There is an I-D for DHCP on IB. IPoIB defines a "broadcast" address and > > DHCP (and ARP) on IB use it. Could make RARP work using this mechanism, > > but as someone else pointed out, the IB hardware address contains a > > QPN. The I-D for IPoIB says something like: > > > > The link-layer address for IPoIB includes the QPN which might not be > > constant across reboots or even across network interface resets. > > Cached QPN entries, such as in static ARP entries or in RARP servers > > will only work if the implementation(s) using these options ensure > > that the QPN associated with an interface is invariant across > > reboots/network resets. > >That may be the requirement but I think there are some issues with >keeping the QPN invariant. Quoting Dror Goldenberg >(http://openib.org/pipermail/openib-general/2004-November/006765.html): >"Assigning specific QPN for ipoib requires allocation of QPN space which >is beyond IB spec verbs. Current verbs do not allow it. I don't have any >objection for that, except that you have to hold a set of preallocated >QPs with specific numbers and hand them over to privileged consumer when >requested to. I wouldn't commit that it will work on any HCA >architecture." > >-- Hal > > > > > > So, there are requirements on the IPoIB implementation to make RARP > > work. Folks in the IPoIB work group decided not to go much further than > > these statements for RARP support since most folks felt that DHCP is (de > > facto) replacement. > > > > -David > > > > > > >-- greg > > > > > > > > >_______________________________________________ > > >openib-general mailing list > > >openib-general at openib.org > > >http://openib.org/mailman/listinfo/openib-general > > > > > >To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Mon Mar 7 11:14:07 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 07 Mar 2005 11:14:07 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <52sm3bl54u.fsf@topspin.com> References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> <52vf88m7zq.fsf@topspin.com> <52sm3bl54u.fsf@topspin.com> Message-ID: <422CA7FF.1070506@ichips.intel.com> Roland Dreier wrote: > 1. Don't worry about checking. There's nothing too evil a CM user > can do with a QP beyond getting another QP to connect to it, since > the CM user can't modify a QP unless it legitimately owns it. And > an evil user can always guess the QPN instead of the QP handle anyway. > > 2. Change the CM API so that it just takes the QPN, QP type, SRQ > status and device directly rather than reading it out of the QP. > This lets the userspace CM just get the info from userspace > without needing to look at the QP at all. Of course it does raise > the issue of how userspace should specify the device. > > 3. Merge the userspace CM into userspace verbs support so they use > the same context. Ugh. > > Personally I would lean slightly towards #2, since it feels to me like > even the kernel CM API would be cleaner that way. However I don't > have a good answer for how userspace should specify which device to > use. Which of these options is being used? It seems like option #2 would work as long as there's a way to locate a device based on a GID. - Sean From libor at topspin.com Mon Mar 7 12:33:46 2005 From: libor at topspin.com (Libor Michalek) Date: Mon, 7 Mar 2005 12:33:46 -0800 Subject: [openib-general] Re: [PATCH][SDP] Make sdp compile on 2.6.11 In-Reply-To: <1109787094.4913.7.camel@duffman>; from tduffy@sun.com on Wed, Mar 02, 2005 at 10:11:34AM -0800 References: <1109787094.4913.7.camel@duffman> Message-ID: <20050307123346.A27729@topspin.com> On Wed, Mar 02, 2005 at 10:11:34AM -0800, Tom Duffy wrote: > Now that 2.6.11 is out, need to make sdp compile with 2.6.11. > Thanks Tom, applied and commited. -Libor From libor at topspin.com Mon Mar 7 13:20:12 2005 From: libor at topspin.com (Libor Michalek) Date: Mon, 7 Mar 2005 13:20:12 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <422CA7FF.1070506@ichips.intel.com>; from mshefty@ichips.intel.com on Mon, Mar 07, 2005 at 11:14:07AM -0800 References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> <52vf88m7zq.fsf@topspin.com> <52sm3bl54u.fsf@topspin.com> <422CA7FF.1070506@ichips.intel.com> Message-ID: <20050307132012.B27729@topspin.com> On Mon, Mar 07, 2005 at 11:14:07AM -0800, Sean Hefty wrote: > Roland Dreier wrote: > > 1. Don't worry about checking. There's nothing too evil a CM user > > can do with a QP beyond getting another QP to connect to it, since > > the CM user can't modify a QP unless it legitimately owns it. And > > an evil user can always guess the QPN instead of the QP handle anyway. > > > > 2. Change the CM API so that it just takes the QPN, QP type, SRQ > > status and device directly rather than reading it out of the QP. > > This lets the userspace CM just get the info from userspace > > without needing to look at the QP at all. Of course it does raise > > the issue of how userspace should specify the device. > > > > 3. Merge the userspace CM into userspace verbs support so they use > > the same context. Ugh. > > > > Personally I would lean slightly towards #2, since it feels to me like > > even the kernel CM API would be cleaner that way. However I don't > > have a good answer for how userspace should specify which device to > > use. > > Which of these options is being used? It seems like option #2 would > work as long as there's a way to locate a device based on a GID. Sean, I'm not sure there's an easy way to perform a port source GID to device lookup, did you have something specific in mind for the lookup? Unless there's an easy way to do this, I was going to go ahead with #1... -Libor From libor at topspin.com Mon Mar 7 13:23:02 2005 From: libor at topspin.com (Libor Michalek) Date: Mon, 7 Mar 2005 13:23:02 -0800 Subject: [PATCH] might_sleep on con_lock (was Re: [openib-general] SDP_CONN_LOCK) In-Reply-To: <20050306103840.GR26194@mellanox.co.il>; from mst@mellanox.co.il on Sun, Mar 06, 2005 at 12:38:40PM +0200 References: <20050306103840.GR26194@mellanox.co.il> Message-ID: <20050307132302.C27729@topspin.com> On Sun, Mar 06, 2005 at 12:38:40PM +0200, Michael S. Tsirkin wrote: > Quoting r. Libor Michalek : > > Subject: Re: [openib-general] SDP_CONN_LOCK > > > > They do implement exclusive access to the socket, but they implement > > > > exclusive access from both process and irq context, which is why a > > > > semaphore was not used. In interrupt context SDP_CONN_LOCK_BH is used > > > > to lock the connection, look in sdp_cq_event_handler() for it's use, > > > > and in process context SDP_CONN_LOCK is used. > > > > > > I dont really understand how it works. > > > When an interrupt arrives while users != 0, it seems you are > > > calling scheduler(). > > > What is sdp_conn_internal_lock doing? I understand it is to be called > > > from interrupt context, but how can it call scheduler() then? > > > > SDP_CONN_LOCK and SDP_CONN_UNLOCK are never called from interrupt > > context, only from process context. > > So how about this patch, to make sure we get a stack dump if they are: Seems reasonable, I've applied and committed the patch. Thanks Michael. -Libor From mshefty at ichips.intel.com Mon Mar 7 13:26:19 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 07 Mar 2005 13:26:19 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <20050307132012.B27729@topspin.com> References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> <52vf88m7zq.fsf@topspin.com> <52sm3bl54u.fsf@topspin.com> <422CA7FF.1070506@ichips.intel.com> <20050307132012.B27729@topspin.com> Message-ID: <422CC6FB.9070206@ichips.intel.com> Libor Michalek wrote: > Sean, I'm not sure there's an easy way to perform a port source GID > to device lookup, did you have something specific in mind for the > lookup? Unless there's an easy way to do this, I was going to go ahead > with #1... There's a device_list maintained in device.c that's used when ib_register_client() is called to report all available devices to a client. My thinking was to make this list available to cache.c for use calling a function such as ib_get_cached_gid(). - Sean From libor at topspin.com Mon Mar 7 15:11:31 2005 From: libor at topspin.com (Libor Michalek) Date: Mon, 7 Mar 2005 15:11:31 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <422CC6FB.9070206@ichips.intel.com>; from mshefty@ichips.intel.com on Mon, Mar 07, 2005 at 01:26:19PM -0800 References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> <52vf88m7zq.fsf@topspin.com> <52sm3bl54u.fsf@topspin.com> <422CA7FF.1070506@ichips.intel.com> <20050307132012.B27729@topspin.com> <422CC6FB.9070206@ichips.intel.com> Message-ID: <20050307151131.D27729@topspin.com> On Mon, Mar 07, 2005 at 01:26:19PM -0800, Sean Hefty wrote: > Libor Michalek wrote: > > Sean, I'm not sure there's an easy way to perform a port source GID > > to device lookup, did you have something specific in mind for the > > lookup? Unless there's an easy way to do this, I was going to go ahead > > with #1... > > There's a device_list maintained in device.c that's used when > ib_register_client() is called to report all available devices to a > client. My thinking was to make this list available to cache.c for use > calling a function such as ib_get_cached_gid(). OK, that does make sense. There's no other reason to stick with solution #1, since getting the device was the only remaining reason to use the QP. So I'm in favor of going with solution #2 and passing in the necessary values directly. This solution also decreases the number of module dependencies, which is always nice. -Libor From timur.tabi at ammasso.com Mon Mar 7 15:17:11 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Mon, 07 Mar 2005 17:17:11 -0600 Subject: [openib-general] Getting the code and locking user space memory regions Message-ID: <422CE0F7.7060000@ammasso.com> Hi, A long time ago, the openib driver used a hack to call sys_mlock() to lock down a user space memory region. This was because get_user_pages() wasn't completely locking the region like it was supposed to. I haven't paid much attention to the openib stuff since then, but now I want to know what the current development status is. I know that a driver is part of the 2.6.11 kernel, but that driver doesn't have any user-mode support in it. I tried to download the latest code from openib.org, but all I could find was a web interface to "Subversion". Obviously, this is too cumbersome for downloading everything, so is there another way to get all the code? Also, I'd like to know what the current code does to lock memory regions. Does the driver still call sys_mlock()? Has get_user_pages() been fixed (my tests show it hasn't). Is there another technique used? -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com From halr at voltaire.com Tue Mar 8 06:10:19 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Mar 2005 09:10:19 -0500 Subject: [openib-general] Re: Incorrect endian in GUID comparison/SM master selection In-Reply-To: <1109984495.62825.4849.camel@bengbsd.isilon.com> References: <1108607846.27002.113.camel@bengbsd.isilon.com> <1109984495.62825.4849.camel@bengbsd.isilon.com> Message-ID: <1110291019.4650.796.camel@localhost.localdomain> Hi Brian, On Fri, 2005-03-04 at 20:01, Brian Eng wrote: > I found another place in OpenSM where it compares two SM's. I suggest > the following both to fix it and to form a common comparison routine: I had a couple of problems with this patch but worked around them manually. Please review the changes. In the future, please make sure the patch is preformatted. Also, not sure why the line numbers were so different. Thanks. Applied. -- Hal From mst at mellanox.co.il Tue Mar 8 07:52:16 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 Mar 2005 17:52:16 +0200 Subject: [openib-general] [PATCH] rq formatting for arbel-native Message-ID: <20050308155216.GG26194@mellanox.co.il> For arbel native, the ee_nds field in the receive queue must include the rq max_gs value (plus header), and not according to the wqe shift value. This differs from current documentation, documentation will be updated. This patch is required to make memfree work on MT25204. Signed-off-by: Michael S. Tsirkin Index: hw/mthca/mthca_qp.c =================================================================== --- hw/mthca/mthca_qp.c (revision 1943) +++ hw/mthca/mthca_qp.c (working copy) @@ -1119,10 +1119,15 @@ static int mthca_alloc_qp_common(struct if (dev->hca_type == ARBEL_NATIVE) { for (i = 0; i < qp->rq.max; ++i) { + int size; wqe = get_recv_wqe(qp, i); wqe->nda_op = cpu_to_be32(((i + 1) & (qp->rq.max - 1)) << qp->rq.wqe_shift); - wqe->ee_nds = cpu_to_be32(1 << (qp->rq.wqe_shift - 4)); + + size = sizeof (struct mthca_next_seg) + + qp->rq.max_gs * sizeof (struct mthca_data_seg); + + wqe->ee_nds = cpu_to_be32(size / 16); } for (i = 0; i < qp->sq.max; ++i) { -- MST - Michael S. Tsirkin From mst at mellanox.co.il Tue Mar 8 07:58:02 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 Mar 2005 17:58:02 +0200 Subject: [openib-general] [PATCH] register mthca for sinai device id Message-ID: <20050308155802.GH26194@mellanox.co.il> Now that memfree support is merged to trunk, register mthca for MT25204 device ids. Use numeric values to make it work on 2.6.11, until the symbolic names make it upstream. With this and previous patch in place, ip over ib now seems to work for me on MT25204. Signed-off-by: Michael S. Tsirkin Index: mthca_main.c =================================================================== --- mthca_main.c (revision 1964) +++ mthca_main.c (working copy) @@ -1094,6 +1094,10 @@ static struct pci_device_id mthca_pci_ta .driver_data = ARBEL_NATIVE }, { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL), .driver_data = ARBEL_NATIVE }, + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, 0x5e8c), /* Sinai old */ + .driver_data = ARBEL_NATIVE }, + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, 0x6274), /* Sinai */ + .driver_data = ARBEL_NATIVE }, { 0, } }; -- MST - Michael S. Tsirkin From mshefty at ichips.intel.com Tue Mar 8 10:21:11 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 08 Mar 2005 10:21:11 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <20050307151131.D27729@topspin.com> References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> <52vf88m7zq.fsf@topspin.com> <52sm3bl54u.fsf@topspin.com> <422CA7FF.1070506@ichips.intel.com> <20050307132012.B27729@topspin.com> <422CC6FB.9070206@ichips.intel.com> <20050307151131.D27729@topspin.com> Message-ID: <422DED17.5070709@ichips.intel.com> Libor Michalek wrote: >>There's a device_list maintained in device.c that's used when >>ib_register_client() is called to report all available devices to a >>client. My thinking was to make this list available to cache.c for use >>calling a function such as ib_get_cached_gid(). > > > OK, that does make sense. There's no other reason to stick with > solution #1, since getting the device was the only remaining reason > to use the QP. So I'm in favor of going with solution #2 and passing > in the necessary values directly. This solution also decreases the > number of module dependencies, which is always nice. Roland, do you have a preference for exposing the device_list and device_sem in device.c? I can put ib_get_cached_gid() directly in device.c, but that separates the caching functions. I could also change cache.c to maintain its own device list, which encapsulates the changes more, but duplicates the list. - Sean From mshefty at ichips.intel.com Tue Mar 8 10:51:49 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 08 Mar 2005 10:51:49 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <422DED17.5070709@ichips.intel.com> References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> <52vf88m7zq.fsf@topspin.com> <52sm3bl54u.fsf@topspin.com> <422CA7FF.1070506@ichips.intel.com> <20050307132012.B27729@topspin.com> <422CC6FB.9070206@ichips.intel.com> <20050307151131.D27729@topspin.com> <422DED17.5070709@ichips.intel.com> Message-ID: <422DF445.1070806@ichips.intel.com> Sean Hefty wrote: > Roland, do you have a preference for exposing the device_list and > device_sem in device.c? I can put ib_get_cached_gid() directly in > device.c, but that separates the caching functions. I could also change > cache.c to maintain its own device list, which encapsulates the changes > more, but duplicates the list. Uhm... thinking about this more, I think that trying to provide this functionality from the cache exposes the potential for a client to access a device after its removal... We can still make the changes to the CM API; it just requires that the CM maintain a list of devices that it can access. - Sean From halr at voltaire.com Tue Mar 8 11:45:37 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Mar 2005 14:45:37 -0500 Subject: [openib-general] A Couple of CM Questions Message-ID: <1110311137.4645.28.camel@localhost.localdomain> Hi Sean, My main question has to do with an error path in cm_req_handler. If cm_init_av fails (lines 1098 or 1103), I get the following crash: Mar 8 14:19:04 localhost kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000000 Mar 8 14:19:04 localhost kernel: printing eip: Mar 8 14:19:04 localhost kernel: d09db042 Mar 8 14:19:04 localhost kernel: *pde = 0ba1d067 Mar 8 14:19:04 localhost kernel: *pte = 00000000 Mar 8 14:19:04 localhost kernel: Oops: 0000 [#1] Mar 8 14:19:04 localhost kernel: Modules linked in: ib_cm ib_umad ide_cd cdrom lp ipv6 autofs parport_pc parport uhci_hcd ehci_hcd ib_mthca ib_mad ib_core ohci_hcd eepro100 mii evdev usbcore Mar 8 14:19:04 localhost kernel: CPU: 0 Mar 8 14:19:04 localhost kernel: EIP: 0060:[] Tainted: P VLI Mar 8 14:19:04 localhost kernel: EFLAGS: 00010286 (2.6.10) Mar 8 14:19:04 localhost kernel: EIP is at cm_alloc_msg+0x42/0x100 [ib_cm] Mar 8 14:19:04 localhost kernel: eax: 00000000 ebx: cf641800 ecx: 00000000 edx: cfffa340 Mar 8 14:19:04 localhost kernel: esi: c1ca9400 edi: cf641958 ebp: 00000000 esp: c30d5e38 Mar 8 14:19:04 localhost kernel: ds: 007b es: 007b ss: 0068 Mar 8 14:19:04 localhost kernel: Process ib_cm/0 (pid: 4948, threadinfo=c30d4000 task=c2863aa0) Mar 8 14:19:04 localhost kernel: Stack: cffff560 000000d0 00000028 00000000 c1ca9400 00000000 00000004 d09e0330 Mar 8 14:19:04 localhost kernel: c1ca9400 c30d5e80 33215650 000040a9 0000407e 00000296 ffffffc2 00000282 Mar 8 14:19:04 localhost kernel: 0400407e 000040a9 00000246 00000292 c1ca9400 ffffffea c1c90ea8 d09dc531 Mar 8 14:19:04 localhost kernel: Call Trace: Mar 8 14:19:04 localhost kernel: [] ib_send_cm_rej+0x70/0x2d0 [ib_cm] Mar 8 14:19:04 localhost kernel: [] ib_destroy_cm_id+0x3b1/0x780 [ib_cm] Mar 8 14:19:04 localhost kernel: [] rb_erase+0x4b/0xf0 Mar 8 14:19:04 localhost kernel: [] cm_req_handler+0x16c/0x780 [ib_cm] Mar 8 14:19:04 localhost kernel: [] cm_work_handler+0x0/0x130 [ib_cm] Mar 8 14:19:04 localhost kernel: [] cm_work_handler+0x32/0x130 [ib_cm] Mar 8 14:19:04 localhost kernel: [] worker_thread+0x251/0x470 Mar 8 14:19:04 localhost kernel: [] default_wake_function+0x0/0x20 Mar 8 14:19:04 localhost kernel: [] default_wake_function+0x0/0x20 Mar 8 14:19:04 localhost kernel: [] worker_thread+0x0/0x470 Mar 8 14:19:04 localhost kernel: [] kthread+0xaa/0xb0 Mar 8 14:19:04 localhost kernel: [] kthread+0x0/0xb0 Mar 8 14:19:04 localhost kernel: [] kernel_thread_helper+0x5/0x10 Mar 8 14:19:04 localhost kernel: Code: 74 24 20 89 04 24 e8 be 05 77 ef 89 c3 b8 f4 ff ff ff 85 db 0f 84 b5 00 00 00 b9 56 00 00 00 89 df 89 e8 f3 ab 8b 86 8c 00 00 00 <8b> 10 8d 86 a0 00 00 00 89 44 24 04 8b 42 04 8b 40 04 89 04 24 Also, it appears to me that the comm IDs in the CM messages are not endianized on the IB "wire". This causes no issue with interoperability but is slightly less clean to look at. Thanks for your help with this. -- Hal From mshefty at ichips.intel.com Tue Mar 8 12:10:28 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 08 Mar 2005 12:10:28 -0800 Subject: [openib-general] Re: A Couple of CM Questions In-Reply-To: <1110311137.4645.28.camel@localhost.localdomain> References: <1110311137.4645.28.camel@localhost.localdomain> Message-ID: <422E06B4.8010608@ichips.intel.com> Hal Rosenstock wrote: > My main question has to do with an error path in cm_req_handler. If > cm_init_av fails (lines 1098 or 1103), I get the following crash: I think I see what's happening here. Since the cm_init_av fails, the cm_id doesn't have the information that it needs in order to send back any sort of reply (including a REJ message) to the sender of the REQ. A quick fix would be to not send the REJ when destroying the cm_id. I'm not sure what the better fix would be at the moment. The CM tries to send replies (including REJ messages) to a received MAD using the path record stored in the REQ. This could be changed to use the path of the received MAD instead. But a failure could still occur, so the destruction of the cm_id in the REQ_RCVD state needs some additional error handling. I will queue up trying to get a fix for this after I finish the modifications for the CM to support user-space. Will that work okay? > Also, it appears to me that the comm IDs in the CM messages are not > endianized on the IB "wire". This causes no issue with interoperability > but is slightly less clean to look at. I didn't swap them simply because they didn't need to be. - Sean From roland at topspin.com Tue Mar 8 12:13:03 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 08 Mar 2005 12:13:03 -0800 Subject: [openib-general] Re: [RFC] userspace CM/verbs QP In-Reply-To: <422DED17.5070709@ichips.intel.com> (Sean Hefty's message of "Tue, 08 Mar 2005 10:21:11 -0800") References: <20050303162107.A18428@topspin.com> <52r7iwnt46.fsf@topspin.com> <20050303165705.B18428@topspin.com> <52zmxkmamr.fsf@topspin.com> <20050303182121.C18428@topspin.com> <52vf88m7zq.fsf@topspin.com> <52sm3bl54u.fsf@topspin.com> <422CA7FF.1070506@ichips.intel.com> <20050307132012.B27729@topspin.com> <422CC6FB.9070206@ichips.intel.com> <20050307151131.D27729@topspin.com> <422DED17.5070709@ichips.intel.com> Message-ID: <52acpdhp68.fsf@topspin.com> Sean> Roland, do you have a preference for exposing the Sean> device_list and device_sem in device.c? I can put Sean> ib_get_cached_gid() directly in device.c, but that separates Sean> the caching functions. I could also change cache.c to Sean> maintain its own device list, which encapsulates the changes Sean> more, but duplicates the list. I think it makes sense to expose the list and sem to cache.c. Of course we should change the names so that they're less generic (ie add an "ib_" prefix) to avoid clashes if someone builds IB support into a monolithic kernel. Also it probably makes sense to turn device_sem into an rwsem and use down_read() if cache.c to allow concurrent lookups by the CM. - R. From halr at voltaire.com Tue Mar 8 12:11:58 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Mar 2005 15:11:58 -0500 Subject: [openib-general] Re: A Couple of CM Questions In-Reply-To: <422E06B4.8010608@ichips.intel.com> References: <1110311137.4645.28.camel@localhost.localdomain> <422E06B4.8010608@ichips.intel.com> Message-ID: <1110312718.4648.8.camel@localhost.localdomain> On Tue, 2005-03-08 at 15:10, Sean Hefty wrote: > I will queue up trying to get a fix for this after I finish the > modifications for the CM to support user-space. Will that work okay? Sure. I can get back to making sure this error path works. Thanks. -- Hal From roland at topspin.com Tue Mar 8 12:43:04 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 08 Mar 2005 12:43:04 -0800 Subject: [openib-general] [PATCH] rq formatting for arbel-native In-Reply-To: <20050308155216.GG26194@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 8 Mar 2005 17:52:16 +0200") References: <20050308155216.GG26194@mellanox.co.il> Message-ID: <526501hns7.fsf@topspin.com> Thanks... as soon as I have some Sinai HCAs to test with I'll roll Sinai support into mthca. - R. From timur.tabi at ammasso.com Tue Mar 8 15:12:27 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Tue, 08 Mar 2005 17:12:27 -0600 Subject: [openib-general] http://openib.org/downloads/ Message-ID: <422E315B.8010708@ammasso.com> Why is this directory empty? I'm trying to download all the openib code (or at least, all the driver code), but I can't find any tarballs. Can anyone tell me where I can download the OpenIB software? -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com From libor at topspin.com Tue Mar 8 15:28:08 2005 From: libor at topspin.com (Libor Michalek) Date: Tue, 8 Mar 2005 15:28:08 -0800 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <422E315B.8010708@ammasso.com>; from timur.tabi@ammasso.com on Tue, Mar 08, 2005 at 05:12:27PM -0600 References: <422E315B.8010708@ammasso.com> Message-ID: <20050308152808.B28988@topspin.com> On Tue, Mar 08, 2005 at 05:12:27PM -0600, Timur Tabi wrote: > Why is this directory empty? I'm trying to download all the openib code > (or at least, all the driver code), but I can't find any tarballs. Can > anyone tell me where I can download the OpenIB software? If you want code which is newer then what is in 2.6.11 then you'll need to check it out from the subversion repository. For the head of tree kernel code try this: svn co https://openib.org/svn/gen2/trunk/src/linux-kernel -Libor From mlleinin at hpcn.ca.sandia.gov Tue Mar 8 15:31:48 2005 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Tue, 08 Mar 2005 15:31:48 -0800 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <422E315B.8010708@ammasso.com> References: <422E315B.8010708@ammasso.com> Message-ID: <1110324708.8595.223.camel@localhost> On Tue, 2005-03-08 at 17:12 -0600, Timur Tabi wrote: > Why is this directory empty? I'm trying to download all the openib code > (or at least, all the driver code), but I can't find any tarballs. Can > anyone tell me where I can download the OpenIB software? > You can grab the openib source code from the subversion repository. See http://www.openib.org/tools.html. If you want everything run 'svn co https://openib.org/svn' Most of the work to date has been for kernel-space IB support (now in the 2.6.11 kernel). At some point, in the near future, the user-space support will be stable/tested enough that we _may_ start posting tar files, but until then subversion checkout is the best way to get the source. - Matt From mshefty at ichips.intel.com Tue Mar 8 15:59:25 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 8 Mar 2005 15:59:25 -0800 Subject: [openib-general] [PATCH] [CM] Replace QP pointer with necessary values only Message-ID: <20050308155925.43eee30d.mshefty@ichips.intel.com> This patch modifies the CM API to take the necessary QP values, rather than the QP pointer itself. This should simplify the implementation of the usermode CM. I did _not_ change the device caching information for this until I can convince myself that clients should be able to gain access to the device structure in this fashion. As a side effect of this change, SIDR now works in theory. The CM continues to try to send MADs using the same path as that provided in the CM REQ message. Signed-off-by: Sean Hefty Index: infiniband/core/cm.c =================================================================== --- infiniband/core/cm.c (revision 1964) +++ infiniband/core/cm.c (working copy) @@ -61,6 +61,8 @@ static struct ib_client cm_client = { static struct ib_cm { spinlock_t lock; + struct list_head device_list; + rwlock_t device_lock; struct rb_root listen_service_table; /* struct rb_root peer_service_table; todo: fix peer to peer */ struct rb_root remote_qp_table; @@ -71,13 +73,19 @@ static struct ib_cm { } cm; struct cm_port { + struct cm_device *cm_dev; struct ib_mad_agent *mad_agent; - u64 ca_guid; - spinlock_t lock; struct ib_mr *mr; u8 port_num; }; +struct cm_device { + struct list_head list; + struct ib_device *device; + u64 ca_guid; + struct cm_port port[0]; +}; + struct cm_msg { struct cm_id_private *cm_id_priv; struct ib_send_wr send_wr; @@ -214,48 +222,6 @@ static void cm_free_msg(struct cm_msg *m kfree(msg); } -static struct cm_port * cm_find_port(struct ib_device *device, - union ib_gid *gid) -{ - struct cm_port *port; - int ret; - u8 p; - - port = (struct cm_port *)ib_get_client_data(device, &cm_client); - if (!port) - return NULL; - - ret = ib_find_cached_gid(device, gid, &p, NULL); - if (ret) - port = NULL; - else - port = &port[p-1]; - - return port; -} - -static int cm_find_device(union ib_gid *gid, struct ib_device **device, - struct cm_port **port) -{ - int ret; - u8 p; - - /* todo: (high priority if SIDR is needed, low otherwise) - write me - need call in ib_cache_* stuff? */ - /* see static device_list in device.c */ - /* ret = ib_find_cached_device_gid(gid, device, &p, NULL); */ - ret = -EINVAL; - if (ret) - return ret; - - *port = (struct cm_port *)ib_get_client_data(*device, &cm_client); - if (!*port) - return -EINVAL; - - *port = &(*port[p-1]); - return 0; -} - static void cm_set_ah_attr(struct ib_ah_attr *ah_attr, u8 port_num, u16 dlid, u8 sl, u16 src_path_bits) { @@ -266,20 +232,33 @@ static void cm_set_ah_attr(struct ib_ah_ ah_attr->port_num = port_num; } -static int cm_init_av(struct ib_device *device, struct ib_sa_path_rec *path, - struct cm_av *av) +static int cm_init_av(struct ib_sa_path_rec *path, struct cm_av *av) { + struct cm_device *cm_dev; + struct cm_port *port = NULL; + unsigned long flags; int ret; + u8 p; + + read_lock_irqsave(&cm.device_lock, flags); + list_for_each_entry(cm_dev, &cm.device_list, list) { + if (!ib_find_cached_gid(cm_dev->device, &path->sgid, + &p, NULL)) { + port = &cm_dev->port[p-1]; + break; + } + } + read_unlock_irqrestore(&cm.device_lock, flags); - av->port = cm_find_port(device, &path->sgid); - if (!av->port) + if (!port) return -EINVAL; - ret = ib_find_cached_pkey(device, av->port->port_num, path->pkey, - &av->pkey_index); + ret = ib_find_cached_pkey(cm_dev->device, port->port_num, + be16_to_cpu(path->pkey), &av->pkey_index); if (ret) return ret; + av->port = port; av->dgid = path->dgid; cm_set_ah_attr(&av->ah_attr, av->port->port_num, path->dlid, path->sl, path->slid & 0x7F); @@ -660,8 +639,9 @@ retest: case IB_CM_MRA_REP_RCVD: spin_unlock_irqrestore(&cm_id_priv->lock, flags); ib_send_cm_rej(cm_id, IB_CM_REJ_TIMEOUT, - &cm_id_priv->av.port->ca_guid, - sizeof &cm_id_priv->av.port->ca_guid, NULL, 0); + &cm_id_priv->av.port->cm_dev->ca_guid, + sizeof &cm_id_priv->av.port->cm_dev->ca_guid, + NULL, 0); break; case IB_CM_ESTABLISHED: spin_unlock_irqrestore(&cm_id_priv->lock, flags); @@ -744,13 +724,13 @@ static void cm_format_req(struct cm_req_ req_msg->local_comm_id = cm_id_priv->id.local_id; req_msg->service_id = param->service_id; - req_msg->local_ca_guid = cm_id_priv->av.port->ca_guid; - cm_req_set_local_qpn(req_msg, cpu_to_be32(param->qp->qp_num)); + req_msg->local_ca_guid = cm_id_priv->av.port->cm_dev->ca_guid; + cm_req_set_local_qpn(req_msg, cpu_to_be32(param->qp_num)); cm_req_set_resp_res(req_msg, param->responder_resources); cm_req_set_init_depth(req_msg, param->initiator_depth); cm_req_set_remote_resp_timeout(req_msg, param->remote_cm_response_timeout); - cm_req_set_qp_type(req_msg, param->qp->qp_type); + cm_req_set_qp_type(req_msg, param->qp_type); cm_req_set_flow_ctrl(req_msg, param->flow_control); cm_req_set_starting_psn(req_msg, cpu_to_be32(param->starting_psn)); cm_req_set_local_resp_timeout(req_msg, @@ -760,7 +740,7 @@ static void cm_format_req(struct cm_req_ cm_req_set_path_mtu(req_msg, param->primary_path->mtu); cm_req_set_rnr_retry_count(req_msg, param->rnr_retry_count); cm_req_set_max_cm_retries(req_msg, param->max_cm_retries); - cm_req_set_srq(req_msg, (param->qp->srq != NULL)); + cm_req_set_srq(req_msg, param->srq); req_msg->primary_local_lid = param->primary_path->slid; req_msg->primary_remote_lid = param->primary_path->dlid; @@ -798,10 +778,10 @@ static void cm_format_req(struct cm_req_ static inline int cm_validate_req_param(struct ib_cm_req_param *param) { - if (!param->qp || !param->primary_path) + if (!param->primary_path) return -EINVAL; - if (param->qp->qp_type != IB_QPT_RC && param->qp->qp_type != IB_QPT_UC) + if (param->qp_type != IB_QPT_RC && param->qp_type != IB_QPT_UC) return -EINVAL; if (param->private_data && @@ -839,13 +819,11 @@ int ib_send_cm_req(struct ib_cm_id *cm_i } spin_unlock_irqrestore(&cm_id_priv->lock, flags); - ret = cm_init_av(param->qp->device, param->primary_path, - &cm_id_priv->av); + ret = cm_init_av(param->primary_path, &cm_id_priv->av); if (ret) goto out; if (param->alternate_path) { - ret = cm_init_av(param->qp->device, param->alternate_path, - &cm_id_priv->alt_av); + ret = cm_init_av(param->alternate_path, &cm_id_priv->alt_av); if (ret) goto out; } @@ -1078,13 +1056,11 @@ static int cm_req_handler(struct cm_work cm_id_priv->id.service_mask = ~0ULL; cm_format_paths_from_req(req_msg, &work->path[0], &work->path[1]); - ret = cm_init_av(work->port->mad_agent->device, &work->path[0], - &cm_id_priv->av); + ret = cm_init_av(&work->path[0], &cm_id_priv->av); if (ret) goto error3; if (req_msg->alt_local_lid) { - ret = cm_init_av(work->port->mad_agent->device, &work->path[1], - &cm_id_priv->alt_av); + ret = cm_init_av(&work->path[1], &cm_id_priv->alt_av); if (ret) goto error3; } @@ -1124,7 +1100,7 @@ static void cm_format_rep(struct cm_rep_ rep_msg->local_comm_id = cm_id_priv->id.local_id; rep_msg->remote_comm_id = cm_id_priv->id.remote_id; - cm_rep_set_local_qpn(rep_msg, cpu_to_be32(param->qp->qp_num)); + cm_rep_set_local_qpn(rep_msg, cpu_to_be32(param->qp_num)); cm_rep_set_starting_psn(rep_msg, cpu_to_be32(param->starting_psn)); rep_msg->resp_resources = param->responder_resources; rep_msg->initiator_depth = param->initiator_depth; @@ -1132,29 +1108,14 @@ static void cm_format_rep(struct cm_rep_ cm_rep_set_failover(rep_msg, param->failover_accepted); cm_rep_set_flow_ctrl(rep_msg, param->flow_control); cm_rep_set_rnr_retry_count(rep_msg, param->rnr_retry_count); - cm_rep_set_srq(rep_msg, (param->qp->srq != NULL)); - rep_msg->local_ca_guid = cm_id_priv->av.port->ca_guid; + cm_rep_set_srq(rep_msg, param->srq); + rep_msg->local_ca_guid = cm_id_priv->av.port->cm_dev->ca_guid; if (param->private_data && param->private_data_len) memcpy(rep_msg->private_data, param->private_data, param->private_data_len); } -static inline int cm_validate_rep_param(struct ib_cm_rep_param *param) -{ - if (!param->qp) - return -EINVAL; - - if (param->qp->qp_type != IB_QPT_RC && param->qp->qp_type != IB_QPT_UC) - return -EINVAL; - - if (param->private_data && - param->private_data_len > IB_CM_REP_PRIVATE_DATA_SIZE) - return -EINVAL; - - return 0; -} - int ib_send_cm_rep(struct ib_cm_id *cm_id, struct ib_cm_rep_param *param) { @@ -1165,9 +1126,11 @@ int ib_send_cm_rep(struct ib_cm_id *cm_i unsigned long flags; int ret; - ret = cm_validate_rep_param(param); - if (ret) + if (param->private_data && + param->private_data_len > IB_CM_REP_PRIVATE_DATA_SIZE) { + ret = -EINVAL; goto out; + } cm_id_priv = container_of(cm_id, struct cm_id_private, id); ret = cm_alloc_msg(cm_id_priv, &msg); @@ -2313,7 +2276,6 @@ static void cm_format_sidr_req(struct cm int ib_send_cm_sidr_req(struct ib_cm_id *cm_id, struct ib_cm_sidr_req_param *param) { - struct ib_device *device; struct cm_id_private *cm_id_priv; struct cm_msg *msg; struct ib_send_wr *bad_send_wr; @@ -2325,11 +2287,7 @@ int ib_send_cm_sidr_req(struct ib_cm_id return -EINVAL; cm_id_priv = container_of(cm_id, struct cm_id_private, id); - ret = cm_find_device(¶m->path->sgid, &device, &cm_id_priv->av.port); - if (ret) - goto out; - - ret = cm_init_av(device, param->path, &cm_id_priv->av); + ret = cm_init_av(param->path, &cm_id_priv->av); if (ret) goto out; @@ -2966,7 +2924,8 @@ static u64 cm_get_ca_guid(struct ib_devi static void cm_add_one(struct ib_device *device) { - struct cm_port *port_array, *port; + struct cm_device *cm_dev; + struct cm_port *port; struct ib_mad_reg_req reg_req = { .mgmt_class = IB_MGMT_CLASS_CM, .mgmt_class_version = IB_CM_CLASS_VERSION @@ -2974,24 +2933,25 @@ static void cm_add_one(struct ib_device struct ib_port_modify port_modify = { .set_port_cap_mask = IB_PORT_CM_SUP }; - u64 ca_guid; - u8 i; + unsigned long flags; int ret; + u8 i; - ca_guid = cm_get_ca_guid(device); - if (!ca_guid) + cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) * + device->phys_port_cnt, GFP_KERNEL); + if (!cm_dev) return; - port_array = kmalloc(sizeof *port * device->phys_port_cnt, GFP_KERNEL); - if (!port_array) - return; + cm_dev->device = device; + cm_dev->ca_guid = cm_get_ca_guid(device); + if (!cm_dev->ca_guid) + goto error1; set_bit(IB_MGMT_METHOD_SEND, reg_req.method_mask); - for (i = 1, port = port_array; i <= device->phys_port_cnt; i++, port++){ - spin_lock_init(&port->lock); - port->ca_guid = ca_guid; + for (i = 1; i <= device->phys_port_cnt; i++) { + port = &cm_dev->port[i-1]; + port->cm_dev = cm_dev; port->port_num = i; - port->mad_agent = ib_register_mad_agent(device, i, IB_QPT_GSI, ®_req, @@ -3000,54 +2960,64 @@ static void cm_add_one(struct ib_device cm_recv_handler, port); if (IS_ERR(port->mad_agent)) - goto error1; + goto error2; port->mr = ib_get_dma_mr(port->mad_agent->qp->pd, IB_ACCESS_LOCAL_WRITE); if (IS_ERR(port->mr)) - goto error2; + goto error3; ret = ib_modify_port(device, i, 0, &port_modify); if (ret) - goto error3; + goto error4; } - ib_set_client_data(device, &cm_client, port_array); + ib_set_client_data(device, &cm_client, cm_dev); + + write_lock_irqsave(&cm.device_lock, flags); + list_add_tail(&cm_dev->list, &cm.device_list); + write_unlock_irqrestore(&cm.device_lock, flags); return; -error3: +error4: ib_dereg_mr(port->mr); -error2: +error3: ib_unregister_mad_agent(port->mad_agent); -error1: +error2: port_modify.set_port_cap_mask = 0; port_modify.clr_port_cap_mask = IB_PORT_CM_SUP; - while (port != port_array) { - --port; + while (--i) { + port = &cm_dev->port[i]; ib_modify_port(device, port->port_num, 0, &port_modify); - ib_dereg_mr(port->mr); ib_unregister_mad_agent(port->mad_agent); } - kfree(port_array); +error1: + kfree(cm_dev); } static void cm_remove_one(struct ib_device *device) { - struct cm_port *port_array, *port; + struct cm_device *cm_dev; + struct cm_port *port; struct ib_port_modify port_modify = { .clr_port_cap_mask = IB_PORT_CM_SUP }; + unsigned long flags; int i; - port_array = (struct cm_port *)ib_get_client_data(device, &cm_client); - if (!port_array) + cm_dev = ib_get_client_data(device, &cm_client); + if (!cm_dev) return; - for (i = 1, port = port_array; i <= device->phys_port_cnt; i++, port++){ + write_lock_irqsave(&cm.device_lock, flags); + list_del(&cm_dev->list); + write_unlock_irqrestore(&cm.device_lock, flags); + + for (i = 1; i <= device->phys_port_cnt; i++) { + port = &cm_dev->port[i]; ib_modify_port(device, port->port_num, 0, &port_modify); - ib_dereg_mr(port->mr); ib_unregister_mad_agent(port->mad_agent); } - kfree(port_array); + kfree(cm_dev); } static int __init ib_cm_init(void) @@ -3055,6 +3025,8 @@ static int __init ib_cm_init(void) int ret; memset(&cm, 0, sizeof cm); + INIT_LIST_HEAD(&cm.device_list); + rwlock_init(&cm.device_lock); spin_lock_init(&cm.lock); cm.listen_service_table = RB_ROOT; cm.remote_id_table = RB_ROOT; From mshefty at ichips.intel.com Tue Mar 8 16:08:48 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 8 Mar 2005 16:08:48 -0800 Subject: [openib-general] [PATCH] [SDP] Updated to new CM API Message-ID: <20050308160848.010c39ff.mshefty@ichips.intel.com> Libor, Here's a patch that updates SDP to the new CM API. I didn't actually test this though. (I did test the CM changes, just not SDP.) Signed-off-by: Sean Hefty Index: infiniband/ulp/sdp/sdp_actv.c =================================================================== --- infiniband/ulp/sdp/sdp_actv.c (revision 1964) +++ infiniband/ulp/sdp/sdp_actv.c (working copy) @@ -472,9 +472,11 @@ static void sdp_cm_path_complete(u64 id, /* * set QP/CM parameters. */ - memset(¶m, 0, sizeof(struct ib_cm_req_param)); + memset(¶m, 0, sizeof param); - param.qp = conn->qp; + param.qp_num = conn->qp->qp_num; + param.qp_type = conn->qp->qp_type; + param.srq = (conn->qp->srq != NULL); param.primary_path = path; param.alternate_path = NULL; param.service_id = cpu_to_be64(SDP_PORT_TO_SID(conn->dst_port)); Index: infiniband/ulp/sdp/sdp_pass.c =================================================================== --- infiniband/ulp/sdp/sdp_pass.c (revision 1964) +++ infiniband/ulp/sdp/sdp_pass.c (working copy) @@ -263,7 +263,8 @@ static int sdp_cm_accept(struct sdp_opt /* * send REP message to remote CM to continue connection. */ - param.qp = conn->qp; + param.qp_num = conn->qp->qp_num; + param.srq = (conn->qp->srq != NULL); param.starting_psn = conn->rq_psn; param.private_data = hello_ack; /* From libor at topspin.com Tue Mar 8 16:28:06 2005 From: libor at topspin.com (Libor Michalek) Date: Tue, 8 Mar 2005 16:28:06 -0800 Subject: [openib-general] Re: [PATCH] [SDP] Updated to new CM API In-Reply-To: <20050308160848.010c39ff.mshefty@ichips.intel.com>; from mshefty@ichips.intel.com on Tue, Mar 08, 2005 at 04:08:48PM -0800 References: <20050308160848.010c39ff.mshefty@ichips.intel.com> Message-ID: <20050308162806.C28988@topspin.com> On Tue, Mar 08, 2005 at 04:08:48PM -0800, Sean Hefty wrote: > Libor, > > Here's a patch that updates SDP to the new CM API. I didn't actually > test this though. (I did test the CM changes, just not SDP.) Thanks. I just tested it and works correctly. Feel free to commit it at the same time that you commit the CM changes. -Libor From tduffy at sun.com Tue Mar 8 16:27:55 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 08 Mar 2005 16:27:55 -0800 Subject: [openib-general] [PATCH][libsdp] update to use correct protocol family number (27) Message-ID: <1110328075.20262.69.camel@duffman> Signed-off-by: Tom Duffy Index: gen2/trunk/src/userspace/libsdp/src/sdp_inet.h =================================================================== --- gen2/trunk/src/userspace/libsdp/src/sdp_inet.h (revision 1966) +++ gen2/trunk/src/userspace/libsdp/src/sdp_inet.h (working copy) @@ -27,7 +27,7 @@ /* * constants shared between user and kernel space. */ -#define AF_INET_SDP 26 /* SDP socket protocol family */ +#define AF_INET_SDP 27 /* SDP socket protocol family */ #define AF_INET_STR "AF_INET_SDP" /* SDP enabled environment variable */ #endif /* _TS_SDP_INET_H */ From mshefty at ichips.intel.com Tue Mar 8 16:31:00 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 08 Mar 2005 16:31:00 -0800 Subject: [openib-general] Re: [PATCH] [SDP] Updated to new CM API In-Reply-To: <20050308162806.C28988@topspin.com> References: <20050308160848.010c39ff.mshefty@ichips.intel.com> <20050308162806.C28988@topspin.com> Message-ID: <422E43C4.9050909@ichips.intel.com> Libor Michalek wrote: > On Tue, Mar 08, 2005 at 04:08:48PM -0800, Sean Hefty wrote: > >>Libor, >> >>Here's a patch that updates SDP to the new CM API. I didn't actually >>test this though. (I did test the CM changes, just not SDP.) > > > Thanks. I just tested it and works correctly. Feel free to commit it > at the same time that you commit the CM changes. All CM changes related to this have been committed. - Sean From mshefty at ichips.intel.com Tue Mar 8 17:01:46 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 08 Mar 2005 17:01:46 -0800 Subject: [openib-general] Re: [PATCH] [SDP] Updated to new CM API In-Reply-To: <20050308162806.C28988@topspin.com> References: <20050308160848.010c39ff.mshefty@ichips.intel.com> <20050308162806.C28988@topspin.com> Message-ID: <422E4AFA.1090507@ichips.intel.com> Libor Michalek wrote: > On Tue, Mar 08, 2005 at 04:08:48PM -0800, Sean Hefty wrote: > >>Libor, >> >>Here's a patch that updates SDP to the new CM API. I didn't actually >>test this though. (I did test the CM changes, just not SDP.) > > > Thanks. I just tested it and works correctly. Feel free to commit it > at the same time that you commit the CM changes. FYI - I just found a bug in the CM module unload code. I'll submit a patch shortly. - Sean From iod00d at hp.com Tue Mar 8 18:56:47 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 8 Mar 2005 18:56:47 -0800 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <1110324708.8595.223.camel@localhost> References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> Message-ID: <20050309025647.GN5502@esmail.cup.hp.com> On Tue, Mar 08, 2005 at 03:31:48PM -0800, Matt Leininger wrote: > You can grab the openib source code from the subversion repository. > See http://www.openib.org/tools.html. If you want everything run 'svn > co https://openib.org/svn' Matt, probably best to just add a short blurb to tools.html that includes an example using gen2 branch. That's what we want people to focus on I think. grant From halr at voltaire.com Wed Mar 9 02:06:35 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Mar 2005 05:06:35 -0500 Subject: [openib-general] Kernel oops when unloading ib_cm module with latest CM Message-ID: <1110362795.4645.16.camel@localhost.localdomain> This didn't occur before yerterday's CM change. Mar 9 05:03:34 localhost kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000018 Mar 9 05:03:34 localhost kernel: printing eip: Mar 9 05:03:34 localhost kernel: d09472a7 Mar 9 05:03:34 localhost kernel: *pde = 00000000 Mar 9 05:03:34 localhost kernel: Oops: 0000 [#1] Mar 9 05:03:34 localhost kernel: Modules linked in: ib_cm ib_umad ide_cd cdrom lp ipv6 autofs parport_pc parport uhci_hcd ehci_hcd ib_mthca ib_mad ib_core ohci_hcd eepro100 mii evdev usbcore Mar 9 05:03:34 localhost kernel: CPU: 0 Mar 9 05:03:34 localhost kernel: EIP: 0060:[] Tainted: P VLI Mar 9 05:03:34 localhost kernel: EFLAGS: 00010286 (2.6.10) Mar 9 05:03:34 localhost kernel: EIP is at ib_unregister_mad_agent+0x7/0x30 [ib_mad] Mar 9 05:03:34 localhost kernel: eax: 00000000 ebx: c2eefa24 ecx: c1054660 edx: 00000000 Mar 9 05:03:34 localhost kernel: esi: 00000003 edi: c2eef9e0 ebp: cee7f000 esp: ce783ef0 Mar 9 05:03:34 localhost kernel: ds: 007b es: 007b ss: 0068 Mar 9 05:03:34 localhost kernel: Process modprobe (pid: 5105, threadinfo=ce782000 task=c147f550) Mar 9 05:03:34 localhost kernel: Stack: ce783f08 d09e42cf 00000000 00000001 00000000 ce783f08 00000000 00010000 Mar 9 05:03:34 localhost kernel: 00000000 00000000 d09e6200 cee7f000 08049178 d09e61e4 d08c0cb1 cee7f000 Mar 9 05:03:34 localhost kernel: cedb0160 c012f4b8 c2e55000 00000000 c015892d 00000286 00000000 d09e6200 Mar 9 05:03:34 localhost kernel: Call Trace: Mar 9 05:03:34 localhost kernel: [] cm_remove_one+0x9f/0xd0 [ib_cm] Mar 9 05:03:34 localhost kernel: [] ib_unregister_client+0x211/0x220 [ib_core] Mar 9 05:03:34 localhost kernel: [] destroy_workqueue+0x58/0x1c0 Mar 9 05:03:34 localhost kernel: [] unmap_region+0x9d/0xf0 Mar 9 05:03:34 localhost kernel: [] ib_cm_cleanup+0x26/0x28 [ib_cm] Mar 9 05:03:34 localhost kernel: [] sys_delete_module+0x158/0x190 Mar 9 05:03:34 localhost kernel: [] sys_munmap+0x44/0x70 Mar 9 05:03:34 localhost kernel: [] sysenter_past_esp+0x52/0x75 Mar 9 05:03:34 localhost kernel: Code: 40 df 94 d0 c7 44 24 08 09 02 00 00 c7 44 24 04 40 de 94 d0 89 44 24 0c e8 b7 0c 7d ef e9 60 fe ff ff 89 f6 83 ec 04 8b 44 24 08 <8b> 48 18 85 c9 74 12 83 e8 08 89 04 24 e8 a7 fb ff ff 5a 31 c0 From halr at voltaire.com Wed Mar 9 02:35:17 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Mar 2005 05:35:17 -0500 Subject: [openib-general] [PATCH] [TRIVIAL] SDP: Eliminate uneeded initialization and fix some typos Message-ID: <1110364516.4645.22.camel@localhost.localdomain> SDP: Eliminate uneeded initialization and fix some typos Signed-off-by: Hal Rosenstock Index: sdp_link.c =================================================================== --- sdp_link.c (revision 1967) +++ sdp_link.c (working copy) @@ -232,7 +232,7 @@ if (!status) { /* - * on sucess save path record, stop waiting for info, + * on success save path record, stop waiting for info, * and complete all waiting IOs */ info->flags &= ~SDP_LINK_F_PATH; @@ -447,7 +447,6 @@ goto path; } - if ((NUD_CONNECTED|NUD_DELAY|NUD_PROBE) & rt->u.dst.neighbour->nud_state) { memcpy(&info->path.dgid, Index: sdp_actv.c =================================================================== --- sdp_actv.c (revision 1967) +++ sdp_actv.c (working copy) @@ -486,7 +486,6 @@ * no endian swap needed for single byte values. */ param.private_data_len = (u8)(buff->tail - buff->data); - param.peer_to_peer = 0; param.responder_resources = 4; param.initiator_depth = 4; param.remote_cm_response_timeout = 20; Index: sdp_pass.c =================================================================== --- sdp_pass.c (revision 1967) +++ sdp_pass.c (working copy) @@ -143,7 +143,7 @@ return result; } /* - * Functions to handle incomming passive connection requests. (REQ) + * Functions to handle incoming passive connection requests. (REQ) */ static int sdp_cm_accept(struct sdp_opt *conn) { From halr at voltaire.com Wed Mar 9 02:42:22 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Mar 2005 05:42:22 -0500 Subject: [openib-general] [PATCH] [TRIVIAL] libsdp: Change TS_ to OPENIB_ Message-ID: <1110364671.4645.27.camel@localhost.localdomain> libsdp: Change TS_ to OPENIB_ Signed-off-by: Hal Rosenstock Index: src/port.c =================================================================== --- src/port.c (revision 1967) +++ src/port.c (working copy) @@ -398,7 +398,7 @@ int protocol ) { -#ifdef _TS_VERBOSE_PRELOAD +#ifdef _OPENIB_VERBOSE_PRELOAD FILE *fd; #endif struct sdp_socket_info *sdp_sock_info; Index: src/socket.c =================================================================== --- src/socket.c (revision 1967) +++ src/socket.c (working copy) @@ -55,7 +55,7 @@ #include #if 0 -#define _TS_VERBOSE_PRELOAD +#define _OPENIB_VERBOSE_PRELOAD #endif #define SOCKOP_socket 1 @@ -98,7 +98,7 @@ char *inet; char **tenviron; -#ifdef _TS_VERBOSE_PRELOAD +#ifdef _OPENIB_VERBOSE_PRELOAD FILE *fd; #endif /* @@ -128,7 +128,7 @@ } /* if */ } /* if */ -#ifdef _TS_VERBOSE_PRELOAD +#ifdef _OPENIB_VERBOSE_PRELOAD fd = fopen("/tmp/libsdp.log.txt", "a+"); fprintf(fd, "SOCKET: <%s> domain <%d> type <%d> protocol <%d>\n", Index: src/sdp_inet.h =================================================================== --- src/sdp_inet.h (revision 1967) +++ src/sdp_inet.h (working copy) @@ -21,8 +21,8 @@ $Id$ */ -#ifndef _TS_SDP_INET_H -#define _TS_SDP_INET_H +#ifndef _SDP_INET_H +#define _SDP_INET_H /* * constants shared between user and kernel space. @@ -30,4 +30,4 @@ #define AF_INET_SDP 26 /* SDP socket protocol family */ #define AF_INET_STR "AF_INET_SDP" /* SDP enabled environment variable */ -#endif /* _TS_SDP_INET_H */ +#endif /* _SDP_INET_H */ From mst at mellanox.co.il Wed Mar 9 04:11:50 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Mar 2005 14:11:50 +0200 Subject: [openib-general] Re: [PATCH] [TRIVIAL] libsdp: Change TS_ to OPENIB_ In-Reply-To: <1110364671.4645.27.camel@localhost.localdomain> References: <1110364671.4645.27.camel@localhost.localdomain> Message-ID: <20050309121150.GC1826@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: [PATCH] [TRIVIAL] libsdp: Change TS_ to OPENIB_ > > libsdp: Change TS_ to OPENIB_ > > Signed-off-by: Hal Rosenstock Thanks! -- MST - Michael S. Tsirkin From mst at mellanox.co.il Wed Mar 9 04:12:08 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Mar 2005 14:12:08 +0200 Subject: [openib-general] Re: [PATCH][libsdp] update to use correct protocol family number (27) In-Reply-To: <1110328075.20262.69.camel@duffman> References: <1110328075.20262.69.camel@duffman> Message-ID: <20050309121208.GD1826@mellanox.co.il> Quoting r. Tom Duffy : > Subject: [PATCH][libsdp] update to use correct protocol family number (27) > > Signed-off-by: Tom Duffy Thanks! -- MST - Michael S. Tsirkin From mst at mellanox.co.il Wed Mar 9 04:27:00 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Mar 2005 14:27:00 +0200 Subject: [openib-general] [PATCH] uverbs rdma example Message-ID: <20050309122700.GA2352@mellanox.co.il> Here is a small test for the rdma functionality. I based it on the pingpong test, the main change being polling on data instead of receive completions. This is useful as an example of using rdma, and is also useful as a post send latency benchmark, for tuning (nicer than the send test in that it let us measure post send separately from poll cq). Code is originally based on the pingping test. I intentionally did not rename functions from pingpong_ to rdma_ to make it easier to share some code later if we decide it is useful. Roland, I also noticed a race in the pingpong test: you exchange connection data over socket when the qp is still in INIT. Then, the client immediately may move it to RTR, to RTS and start posting work requests. If the client is fast enough, send may arrive at the server when the server qp is till in INIT, an error wil be generated. For now I gave up on measuring time with rdtsc for benchmarking, since the results seem quite close to what I get with simple gettimeofday, and the later is more portable. I have the relevant code available if someone wants it. I fixed this in this new test, by calling the exch routine the second time (and I had to split this routine, to avoid closing the socket). I guess the pingpong test must be fixed, too. Signed-off-by: Michael S. Tsirkin Index: Makefile.am =================================================================== --- Makefile.am (revision 1967) +++ Makefile.am (working copy) @@ -20,7 +20,8 @@ src_libibverbs_la_LDFLAGS = -version-inf src_libibverbs_la_DEPENDENCIES = $(srcdir)/src/libibverbs.map bin_PROGRAMS = examples/ibv_devices examples/ibv_asyncwatch \ - examples/ibv_pingpong examples/ibv_ud_pingpong + examples/ibv_pingpong examples/ibv_ud_pingpong \ + examples/ibv_rdma examples_ibv_devices_SOURCES = examples/device_list.c examples_ibv_devices_LDADD = $(top_builddir)/src/libibverbs.la examples_ibv_pingpong_SOURCES = examples/pingpong.c @@ -29,6 +30,8 @@ examples_ibv_ud_pingpong_SOURCES = examp examples_ibv_ud_pingpong_LDADD = $(top_builddir)/src/libibverbs.la examples_ibv_asyncwatch_SOURCES = examples/asyncwatch.c examples_ibv_asyncwatch_LDADD = $(top_builddir)/src/libibverbs.la +examples_ibv_rdma_SOURCES = examples/rdma.c +examples_ibv_rdma_LDADD = $(top_builddir)/src/libibverbs.la libibverbsincludedir = $(includedir)/infiniband Index: examples/rdma.c =================================================================== --- examples/rdma.c (revision 0) +++ examples/rdma.c (revision 0) @@ -0,0 +1,686 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies Ltd. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include + +enum { + PINGPONG_RDMA_WRID = 3, +}; + +static int page_size; + +struct pingpong_context { + struct ibv_context *context; + struct ibv_pd *pd; + struct ibv_mr *mr; + struct ibv_cq *cq; + struct ibv_qp *qp; + void *buf; + volatile char *post_buf; + volatile char *poll_buf; + int size; + int rx_depth; + int tx_depth; +}; + +struct pingpong_dest { + int lid; + int qpn; + int psn; + unsigned rkey; + unsigned long long vaddr; +}; + +/* + * pp_get_local_lid() uses a pretty bogus method for finding the LID + * of a local port. Please don't copy this into your app (or if you + * do, please rip it out soon). + */ +static uint16_t pp_get_local_lid(struct ibv_device *dev, int port) +{ + char path[256]; + char val[16]; + char *name; + + if (sysfs_get_mnt_path(path, sizeof path)) { + fprintf(stderr, "Couldn't find sysfs mount.\n"); + return 0; + } + + asprintf(&name, "%s/class/infiniband/%s/ports/%d/lid", path, + ibv_get_device_name(dev), port); + + if (sysfs_read_attribute_value(name, val, sizeof val)) { + fprintf(stderr, "Couldn't read LID at %s\n", name); + return 0; + } + + return strtol(val, NULL, 0); +} + +static int pp_client_connect(const char *servername, int port) +{ + struct addrinfo *res, *t; + struct addrinfo hints = { + .ai_family = AF_UNSPEC, + .ai_socktype = SOCK_STREAM + }; + char *service; + int n; + int sockfd = -1; + + asprintf(&service, "%d", port); + n = getaddrinfo(servername, service, &hints, &res); + + if (n < 0) { + fprintf(stderr, "%s for %s:%d\n", gai_strerror(n), servername, port); + return n; + } + + for (t = res; t; t = t->ai_next) { + sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); + if (sockfd >= 0) { + if (!connect(sockfd, t->ai_addr, t->ai_addrlen)) + break; + close(sockfd); + sockfd = -1; + } + } + + freeaddrinfo(res); + + if (sockfd < 0) { + fprintf(stderr, "Couldn't connect to %s:%d\n", servername, port); + return sockfd; + } + return sockfd; +} + +struct pingpong_dest * pp_client_exch_dest(int sockfd, + const struct pingpong_dest *my_dest) +{ + struct pingpong_dest *rem_dest = NULL; + char msg[sizeof "0000:000000:000000:00000000:0000000000000000"]; + + sprintf(msg, "%04x:%06x:%06x:%08x:%016Lx", my_dest->lid, my_dest->qpn, my_dest->psn,my_dest->rkey,my_dest->vaddr); + if (write(sockfd, msg, sizeof msg) != sizeof msg) { + perror("client write"); + fprintf(stderr, "Couldn't send local address\n"); + goto out; + } + + if (read(sockfd, msg, sizeof msg) != sizeof msg) { + perror("client read"); + fprintf(stderr, "Couldn't read remote address\n"); + goto out; + } + + rem_dest = malloc(sizeof *rem_dest); + if (!rem_dest) + goto out; + + sscanf(msg, "%x:%x:%x:%x:%Lx", &rem_dest->lid, &rem_dest->qpn, &rem_dest->psn,&rem_dest->rkey,&rem_dest->vaddr); + +out: + return rem_dest; +} + +int pp_server_connect(int port) +{ + struct addrinfo *res, *t; + struct addrinfo hints = { + .ai_flags = AI_PASSIVE, + .ai_family = AF_UNSPEC, + .ai_socktype = SOCK_STREAM + }; + char *service; + int sockfd = -1, connfd; + int n; + + asprintf(&service, "%d", port); + n = getaddrinfo(NULL, service, &hints, &res); + + if (n < 0) { + fprintf(stderr, "%s for port %d\n", gai_strerror(n), port); + return n; + } + + for (t = res; t; t = t->ai_next) { + sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); + if (sockfd >= 0) { + n = 1; + + setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &n, sizeof n); + + if (!bind(sockfd, t->ai_addr, t->ai_addrlen)) + break; + close(sockfd); + sockfd = -1; + } + } + + freeaddrinfo(res); + + if (sockfd < 0) { + fprintf(stderr, "Couldn't listen to port %d\n", port); + return sockfd; + } + + listen(sockfd, 1); + connfd = accept(sockfd, NULL, 0); + if (connfd < 0) { + perror("server accept"); + fprintf(stderr, "accept() failed\n"); + close(sockfd); + return connfd; + } + + close(sockfd); + return connfd; +} + +static struct pingpong_dest *pp_server_exch_dest(int connfd, const struct pingpong_dest *my_dest) +{ + char msg[sizeof "0000:000000:000000:00000000:0000000000000000"]; + struct pingpong_dest *rem_dest = NULL; + int parsed; + int n; + + n = read(connfd, msg, sizeof msg); + if (n != sizeof msg) { + perror("server read"); + fprintf(stderr, "%d/%d: Couldn't read remote address\n", n, (int) sizeof msg); + goto out; + } + + rem_dest = malloc(sizeof *rem_dest); + if (!rem_dest) + goto out; + + parsed = sscanf(msg, "%x:%x:%x:%x:%Lx", &rem_dest->lid, &rem_dest->qpn, &rem_dest->psn,&rem_dest->rkey,&rem_dest->vaddr); + if (parsed != 5) { + fprintf(stderr, "Couldn't parse line <%.*s>\n",(int)sizeof msg, + msg); + free(rem_dest); + rem_dest = NULL; + goto out; + } + + sprintf(msg, "%04x:%06x:%06x:%08x:%016Lx", my_dest->lid, my_dest->qpn, my_dest->psn,my_dest->rkey,rem_dest->vaddr); + if (write(connfd, msg, sizeof msg) != sizeof msg) { + perror("server write"); + fprintf(stderr, "Couldn't send local address\n"); + free(rem_dest); + rem_dest = NULL; + goto out; + } +out: + return rem_dest; +} + +static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size, + int tx_depth, int rx_depth, int port) +{ + struct pingpong_context *ctx; + + ctx = malloc(sizeof *ctx); + if (!ctx) + return NULL; + + ctx->size = size; + ctx->rx_depth = rx_depth; + ctx->tx_depth = tx_depth; + + ctx->buf = memalign(page_size, size * 2); + if (!ctx->buf) { + fprintf(stderr, "Couldn't allocate work buf.\n"); + return NULL; + } + + memset(ctx->buf, 0, size * 2); + + ctx->post_buf = (char*)ctx->buf + (size - 1); + ctx->poll_buf = (char*)ctx->buf + (2 * size - 1); + + ctx->context = ibv_open_device(ib_dev); + if (!ctx->context) { + fprintf(stderr, "Couldn't get context for %s\n", + ibv_get_device_name(ib_dev)); + return NULL; + } + + ctx->pd = ibv_alloc_pd(ctx->context); + if (!ctx->pd) { + fprintf(stderr, "Couldn't allocate PD\n"); + return NULL; + } + + ctx->mr = ibv_reg_mr(ctx->pd, ctx->buf, size * 2, + IBV_ACCESS_REMOTE_WRITE); + if (!ctx->mr) { + fprintf(stderr, "Couldn't allocate MR\n"); + return NULL; + } + + ctx->cq = ibv_create_cq(ctx->context, rx_depth + tx_depth, NULL); + if (!ctx->cq) { + fprintf(stderr, "Couldn't create CQ\n"); + return NULL; + } + + { + struct ibv_qp_init_attr attr = { + .send_cq = ctx->cq, + .recv_cq = ctx->cq, + .cap = { + .max_send_wr = tx_depth, + .max_recv_wr = rx_depth, + .max_send_sge = 1, + .max_recv_sge = 1 + }, + .qp_type = IBV_QPT_RC + }; + + ctx->qp = ibv_create_qp(ctx->pd, &attr); + if (!ctx->qp) { + fprintf(stderr, "Couldn't create QP\n"); + return NULL; + } + } + + { + struct ibv_qp_attr attr; + + attr.qp_state = IBV_QPS_INIT; + attr.pkey_index = 0; + attr.port_num = port; + attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE; + + if (ibv_modify_qp(ctx->qp, &attr, + IBV_QP_STATE | + IBV_QP_PKEY_INDEX | + IBV_QP_PORT | + IBV_QP_ACCESS_FLAGS)) { + fprintf(stderr, "Failed to modify QP to INIT\n"); + return NULL; + } + } + + return ctx; +} + +static int pp_post_rdma(struct pingpong_context *ctx, + struct pingpong_dest* rem_dest) +{ + struct ibv_sge list = { + .addr = (uintptr_t) ctx->buf, + .length = ctx->size, + .lkey = ctx->mr->lkey + }; + struct ibv_send_wr wr = { + .wr_id = PINGPONG_RDMA_WRID, + .sg_list = &list, + .num_sge = 1, + .opcode = IBV_WR_RDMA_WRITE, + .send_flags = IBV_SEND_SIGNALED, + .wr.rdma.remote_addr = rem_dest->vaddr, + .wr.rdma.rkey = rem_dest->rkey + }; + struct ibv_send_wr *bad_wr; + + return ibv_post_send(ctx->qp, &wr, &bad_wr); +} + +static int pp_connect_ctx(struct pingpong_context *ctx, int port, int my_psn, + struct pingpong_dest *dest) +{ + struct ibv_qp_attr attr; + + attr.qp_state = IBV_QPS_RTR; + attr.path_mtu = IBV_MTU_1024; + attr.dest_qp_num = dest->qpn; + attr.rq_psn = dest->psn; + attr.max_dest_rd_atomic = 1; + attr.min_rnr_timer = 12; + attr.ah_attr.is_global = 0; + attr.ah_attr.dlid = dest->lid; + attr.ah_attr.sl = 0; + attr.ah_attr.src_path_bits = 0; + attr.ah_attr.port_num = port; + if (ibv_modify_qp(ctx->qp, &attr, + IBV_QP_STATE | + IBV_QP_AV | + IBV_QP_PATH_MTU | + IBV_QP_DEST_QPN | + IBV_QP_RQ_PSN | + IBV_QP_MAX_DEST_RD_ATOMIC | + IBV_QP_MIN_RNR_TIMER)) { + fprintf(stderr, "Failed to modify QP to RTR\n"); + return 1; + } + + attr.qp_state = IBV_QPS_RTS; + attr.timeout = 14; + attr.retry_cnt = 7; + attr.rnr_retry = 7; + attr.sq_psn = my_psn; + attr.max_rd_atomic = 1; + if (ibv_modify_qp(ctx->qp, &attr, + IBV_QP_STATE | + IBV_QP_TIMEOUT | + IBV_QP_RETRY_CNT | + IBV_QP_RNR_RETRY | + IBV_QP_SQ_PSN | + IBV_QP_MAX_QP_RD_ATOMIC)) { + fprintf(stderr, "Failed to modify QP to RTS\n"); + return 1; + } + + return 0; +} + +static void usage(const char *argv0) +{ + printf("Usage:\n"); + printf(" %s start a server and wait for connection\n", argv0); + printf(" %s connect to server at \n", argv0); + printf("\n"); + printf("Options:\n"); + printf(" -p, --port= listen on/connect to port (default 18515)\n"); + printf(" -d, --ib-dev= use IB device (default first device found)\n"); + printf(" -i, --ib-port= use port of IB device (default 1)\n"); + printf(" -s, --size= size of message to exchange (default 4096)\n"); + printf(" -t, --tx-depth= size of tx queue (default 50)\n"); + printf(" -n, --iters= number of exchanges (default 1000)\n"); +} + +int main(int argc, char *argv[]) +{ + struct dlist *dev_list; + struct ibv_device *ib_dev; + struct pingpong_context *ctx; + struct pingpong_dest my_dest; + struct pingpong_dest *rem_dest; + struct timeval start, end; + char *ib_devname = NULL; + char *servername = NULL; + int port = 18515; + int ib_port = 1; + int size = 1; + int rx_depth = 1; + int tx_depth = 50; + int iters = 1000; + int scnt, rcnt, ccnt; + int client_first_post; + int sockfd; + + srand48(getpid() * time(NULL)); + + while (1) { + int c; + + static struct option long_options[] = { + { .name = "port", .has_arg = 1, .val = 'p' }, + { .name = "ib-dev", .has_arg = 1, .val = 'd' }, + { .name = "ib-port", .has_arg = 1, .val = 'i' }, + { .name = "size", .has_arg = 1, .val = 's' }, + { .name = "iters", .has_arg = 1, .val = 'n' }, + { .name = "tx-depth",.has_arg = 1, .val = 't' }, + { 0 } + }; + + c = getopt_long(argc, argv, "p:d:i:s:t:n:e", long_options, NULL); + if (c == -1) + break; + + switch (c) { + case 'p': + port = strtol(optarg, NULL, 0); + if (port < 0 || port > 65535) { + usage(argv[0]); + return 1; + } + break; + + case 'd': + ib_devname = strdupa(optarg); + break; + + case 'i': + ib_port = strtol(optarg, NULL, 0); + if (port < 0) { + usage(argv[0]); + return 1; + } + break; + + case 's': + size = strtol(optarg, NULL, 0); + break; + + case 't': + tx_depth = strtol(optarg, NULL, 0); + break; + + case 'n': + iters = strtol(optarg, NULL, 0); + break; + + default: + usage(argv[0]); + return 1; + } + } + + if (optind == argc - 1) + servername = strdupa(argv[optind]); + else if (optind < argc) { + usage(argv[0]); + return 1; + } + + page_size = sysconf(_SC_PAGESIZE); + + dev_list = ibv_get_devices(); + + dlist_start(dev_list); + if (!ib_devname) { + ib_dev = dlist_next(dev_list); + if (!ib_dev) { + fprintf(stderr, "No IB devices found\n"); + return 1; + } + } else { + dlist_for_each_data(dev_list, ib_dev, struct ibv_device) + if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) + break; + if (!ib_dev) { + fprintf(stderr, "IB device %s not found\n", ib_devname); + return 1; + } + } + + ctx = pp_init_ctx(ib_dev, size, iters, rx_depth, ib_port); + if (!ctx) + return 1; + + my_dest.lid = pp_get_local_lid(ib_dev, ib_port); + my_dest.qpn = ctx->qp->qp_num; + my_dest.psn = lrand48() & 0xffffff; + if (!my_dest.lid) { + fprintf(stderr, "Local lid 0x0 detected. Is an SM running?\n"); + return 1; + } + my_dest.rkey = ctx->mr->rkey; + my_dest.vaddr = (uintptr_t)ctx->buf + ctx->size; + + printf(" local address: LID %#04x, QPN %#06x, PSN %#06x " + "RKey %#08x VAddr %#016Lx\n", + my_dest.lid, my_dest.qpn, my_dest.psn, + my_dest.rkey, my_dest.vaddr); + + + if (servername) { + sockfd = pp_client_connect(servername, port); + } else { + sockfd = pp_server_connect(port); + } + if (sockfd < 0) + return 1; + + if (servername) { + rem_dest = pp_client_exch_dest(sockfd, &my_dest); + } else { + rem_dest = pp_server_exch_dest(sockfd, &my_dest); + } + + if (!rem_dest) + return 1; + + printf(" remote address: LID %#04x, QPN %#06x, PSN %#06x, " + "RKey %#08x VAddr %#016Lx\n", + rem_dest->lid, rem_dest->qpn, rem_dest->psn, + rem_dest->rkey, rem_dest->vaddr); + + if (pp_connect_ctx(ctx, ib_port, my_dest.psn, rem_dest)) + return 1; + + /* An additional handshake is required *after* moving qp to RTR. + Arbitrarily reuse exch_dest for this purpose. */ + if (servername) { + rem_dest = pp_client_exch_dest(sockfd, &my_dest); + } else { + rem_dest = pp_server_exch_dest(sockfd, &my_dest); + } + + write(sockfd, "done", sizeof "done"); + close(sockfd); + + if (gettimeofday(&start, NULL)) { + perror("gettimeofday"); + return 1; + } + + scnt = 0; + rcnt = 0; + ccnt = 0; + if (servername) + client_first_post = 1; + else + client_first_post = 0; + + while (scnt < iters || ccnt < iters || rcnt < iters) { + + /* Wait till buffer changes. */ + if (rcnt < iters && ! client_first_post) { + ++rcnt; + while (*ctx->poll_buf != (char)rcnt) { + } + /* Here the data is already in the physical memory. + If we wanted to actually use it, we may need + a read memory barrier here. */ + } else + client_first_post = 0; + + if (scnt < iters) { + *ctx->post_buf = (char)++scnt; + if (pp_post_rdma(ctx, rem_dest)) { + fprintf(stderr, "Couldn't post send: scnt=%d\n", + scnt); + return 1; + } + } + + if (ccnt < iters) { + struct ibv_wc wc; + int ne; + ++ccnt; + do { + ne = ibv_poll_cq(ctx->cq, 1, &wc); + } while (ne == 0); + + if (ne < 0) { + fprintf(stderr, "poll CQ failed %d\n", ne); + return 1; + } + if (wc.status != IBV_WC_SUCCESS) { + fprintf(stderr, "Completion wth error at %s:\n", + servername?"client":"server"); + fprintf(stderr, "Failed status %d: wr_id %d\n", + wc.status, (int) wc.wr_id); + fprintf(stderr, "scnt=%d, rcnt=%d, ccnt=%d\n", + scnt, rcnt, ccnt); + return 1; + } + } + } + + if (gettimeofday(&end, NULL)) { + perror("gettimeofday"); + return 1; + } + + { + float usec = (end.tv_sec - start.tv_sec) * 1000000 + + (end.tv_usec - start.tv_usec); + long long bytes = (long long) size * iters; + + printf("%lld bytes in %.2f seconds = %.2f Mbit/sec\n", + bytes, usec / 1000000., bytes * 8. / usec); + printf("%d iters in %.2f seconds = %.2f usec/iter\n", + iters, usec / 1000000., usec / iters); + } + + return 0; +} -- MST - Michael S. Tsirkin From mst at mellanox.co.il Wed Mar 9 04:34:36 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Mar 2005 14:34:36 +0200 Subject: [openib-general] Re: [PATCH] [TRIVIAL] libsdp: Change TS_ to OPENIB_ In-Reply-To: <1110364671.4645.27.camel@localhost.localdomain> References: <1110364671.4645.27.camel@localhost.localdomain> Message-ID: <20050309123436.GA2586@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: [PATCH] [TRIVIAL] libsdp: Change TS_ to OPENIB_ > > libsdp: Change TS_ to OPENIB_ > > Signed-off-by: Hal Rosenstock > > Index: src/port.c > =================================================================== > --- src/port.c (revision 1967) > +++ src/port.c (working copy) > @@ -398,7 +398,7 @@ > int protocol > ) > { > -#ifdef _TS_VERBOSE_PRELOAD > +#ifdef _OPENIB_VERBOSE_PRELOAD > FILE *fd; > #endif I decided to rename this one to _SDP_VERBOSE_PRELOAD. -- MST - Michael S. Tsirkin From halr at voltaire.com Wed Mar 9 06:43:18 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Mar 2005 09:43:18 -0500 Subject: [openib-general] [PATCH] [TRIVIAL] SDP: sdp_actv.c remove redundant initialization Message-ID: <1110379398.4645.46.camel@localhost.localdomain> SDP: sdp_actv.c remove redundant initialization qp_attr->min_rnr_timer is already initialized to 0 by cm_init_qp_rtr_attr in cm.c Is this really intended to be IB_RNR_TIMER_122_88 instead ? Signed-off-by: Hal Rosenstock Index: sdp_actv.c =================================================================== --- sdp_actv.c (revision 1970) +++ sdp_actv.c (working copy) @@ -133,10 +133,9 @@ goto done; } - qp_attr->min_rnr_timer = 0; /* IB_RNR_TIMER_122_88; */ qp_attr->rq_psn = conn->rq_psn; - attr_mask |= (IB_QP_MIN_RNR_TIMER | IB_QP_RQ_PSN); + attr_mask |= IB_QP_RQ_PSN; result = ib_modify_qp(conn->qp, qp_attr, attr_mask); if (result) { From halr at voltaire.com Wed Mar 9 07:25:57 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Mar 2005 10:25:57 -0500 Subject: [openib-general] PATCH] [TRIVIAL] SDP: sdp_pass.c remove redundant initialization Message-ID: <1110381957.4645.1.camel@localhost.localdomain> SDP: sdp_pass.c remove redundant initialization (Similar to previous sdp_actv.c patch) qp_attr->min_rnr_timer is already initialized to 0 by cm_init_qp_rtr_attr in cm.c Is this really intended to be IB_RNR_TIMER_122_88 instead ? Signed-off-by: Hal Rosenstock Index: sdp_pass.c =================================================================== --- sdp_pass.c (revision 1970) +++ sdp_pass.c (working copy) @@ -235,10 +235,9 @@ goto error; } - qp_attr->min_rnr_timer = 0; /* IB_RNR_TIMER_122_88; */ qp_attr->rq_psn = conn->rq_psn; - qp_mask |= (IB_QP_MIN_RNR_TIMER | IB_QP_RQ_PSN); + qp_mask |= IB_QP_RQ_PSN; result = ib_modify_qp(conn->qp, qp_attr, qp_mask); kfree(qp_attr); From halr at voltaire.com Wed Mar 9 08:00:42 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Mar 2005 11:00:42 -0500 Subject: [openib-general] Re: mthca query_device does not fill in struct ib_device_attr In-Reply-To: <52acpfjszt.fsf@topspin.com> References: <1110212025.4648.38.camel@localhost.localdomain> <52fyz7ju1g.fsf@topspin.com> <1110213900.4648.79.camel@localhost.localdomain> <52acpfjszt.fsf@topspin.com> Message-ID: <1110383801.4648.1.camel@localhost.localdomain> On Mon, 2005-03-07 at 11:55, Roland Dreier wrote: > OK, it won't be hard to fill out those entries. Any idea on when this change will be made ? Thanks. -- Hal From timur.tabi at ammasso.com Wed Mar 9 08:58:47 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Wed, 09 Mar 2005 10:58:47 -0600 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <1110324708.8595.223.camel@localhost> References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> Message-ID: <422F2B47.80809@ammasso.com> Matt Leininger wrote: > Most of the work to date has been for kernel-space IB support (now in > the 2.6.11 kernel). At some point, in the near future, the user-space > support will be stable/tested enough that we _may_ start posting tar > files, but until then subversion checkout is the best way to get the > source. Just to be clear - the current user-space stuff, whatever it is, is in the subversion repository? -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com From halr at voltaire.com Wed Mar 9 09:12:53 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Mar 2005 12:12:53 -0500 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <422F2B47.80809@ammasso.com> References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <422F2B47.80809@ammasso.com> Message-ID: <1110388190.4647.3.camel@localhost.localdomain> On Wed, 2005-03-09 at 11:58, Timur Tabi wrote: > Just to be clear - the current user-space stuff, whatever it is, is in > the subversion repository? The latest user space verbs is on the roland-uverbs branch in the repository (https://openib.org/svn/gen2/branches/roland-uverbs/). It will be merged back to the mainline (https://openib.org/svn/gen2/trunk/src/userspace/) but an earlier version is there presently. -- Hal From roland at topspin.com Wed Mar 9 09:28:17 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 09 Mar 2005 09:28:17 -0800 Subject: [openib-general] Re: mthca query_device does not fill in struct ib_device_attr In-Reply-To: <1110383801.4648.1.camel@localhost.localdomain> (Hal Rosenstock's message of "09 Mar 2005 11:00:42 -0500") References: <1110212025.4648.38.camel@localhost.localdomain> <52fyz7ju1g.fsf@topspin.com> <1110213900.4648.79.camel@localhost.localdomain> <52acpfjszt.fsf@topspin.com> <1110383801.4648.1.camel@localhost.localdomain> Message-ID: <527jkgenke.fsf@topspin.com> Hal> Any idea on when this change will be made ? I should be able to get to it before the end of this week. Keep in mind that this won't help uDAPL, since filling out this function in the kernel does nothing to get the information to userspace. If this is blocking you, you should be able to fill in reasonable defaults to make progress. I can't imagine an application depends on knowing the exact values of these limits. - R. From mshefty at ichips.intel.com Wed Mar 9 09:38:48 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 09 Mar 2005 09:38:48 -0800 Subject: [openib-general] Re: Kernel oops when unloading ib_cm module with latest CM In-Reply-To: <1110362795.4645.16.camel@localhost.localdomain> References: <1110362795.4645.16.camel@localhost.localdomain> Message-ID: <422F34A8.2080200@ichips.intel.com> Hal Rosenstock wrote: > This didn't occur before yerterday's CM change. Yeah, I saw this right before I left work yesterday. I'll have a fix in a couple of hours. - Sean From timur.tabi at ammasso.com Wed Mar 9 09:51:51 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Wed, 09 Mar 2005 11:51:51 -0600 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <1110388190.4647.3.camel@localhost.localdomain> References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <422F2B47.80809@ammasso.com> <1110388190.4647.3.camel@localhost.localdomain> Message-ID: <422F37B7.7060609@ammasso.com> Hal Rosenstock wrote: > The latest user space verbs is on the roland-uverbs branch in the > repository (https://openib.org/svn/gen2/branches/roland-uverbs/). > > It will be merged back to the mainline > (https://openib.org/svn/gen2/trunk/src/userspace/) but an earlier > version is there presently. I see that function ibv_lock_range() in libibverbs calls the mlock() system call. mlock() can only be called by a process that has root privileges. Does this mean the user-space verbs support is only available to applications that run as root? -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com From roland at topspin.com Wed Mar 9 09:56:25 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 09 Mar 2005 09:56:25 -0800 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <422F37B7.7060609@ammasso.com> (Timur Tabi's message of "Wed, 09 Mar 2005 11:51:51 -0600") References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <422F2B47.80809@ammasso.com> <1110388190.4647.3.camel@localhost.localdomain> <422F37B7.7060609@ammasso.com> Message-ID: <52y8cwd7p2.fsf@topspin.com> Timur> I see that function ibv_lock_range() in libibverbs calls Timur> the mlock() system call. mlock() can only be called by a Timur> process that has root privileges. Does this mean the Timur> user-space verbs support is only available to applications Timur> that run as root? Actually this isn't true. Any process can call mlock() (try it and see). - R. From timur.tabi at ammasso.com Wed Mar 9 09:56:23 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Wed, 09 Mar 2005 11:56:23 -0600 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <52y8cwd7p2.fsf@topspin.com> References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <422F2B47.80809@ammasso.com> <1110388190.4647.3.camel@localhost.localdomain> <422F37B7.7060609@ammasso.com> <52y8cwd7p2.fsf@topspin.com> Message-ID: <422F38C7.805@ammasso.com> Roland Dreier wrote: > Timur> I see that function ibv_lock_range() in libibverbs calls > Timur> the mlock() system call. mlock() can only be called by a > Timur> process that has root privileges. Does this mean the > Timur> user-space verbs support is only available to applications > Timur> that run as root? > > Actually this isn't true. Any process can call mlock() (try it and see). Since when? man mlock: ERRORS EPERM The calling process does not have appropriate privileges. Only root processes are allowed to lock pages. -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com From roland at topspin.com Wed Mar 9 10:01:15 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 09 Mar 2005 10:01:15 -0800 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <422F38C7.805@ammasso.com> (Timur Tabi's message of "Wed, 09 Mar 2005 11:56:23 -0600") References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <422F2B47.80809@ammasso.com> <1110388190.4647.3.camel@localhost.localdomain> <422F37B7.7060609@ammasso.com> <52y8cwd7p2.fsf@topspin.com> <422F38C7.805@ammasso.com> Message-ID: <52u0nkd7h0.fsf@topspin.com> Timur> Since when? According to my kernel tree, a change "rlimit-based mlocks for unprivileged users" was committed around August of last year. Timur> man mlock: I guess the man page is out of date. As I said, try it and see. - R. From timur.tabi at ammasso.com Wed Mar 9 10:00:56 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Wed, 09 Mar 2005 12:00:56 -0600 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <52u0nkd7h0.fsf@topspin.com> References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <422F2B47.80809@ammasso.com> <1110388190.4647.3.camel@localhost.localdomain> <422F37B7.7060609@ammasso.com> <52y8cwd7p2.fsf@topspin.com> <422F38C7.805@ammasso.com> <52u0nkd7h0.fsf@topspin.com> Message-ID: <422F39D8.8050302@ammasso.com> Roland Dreier wrote: > Timur> Since when? > > According to my kernel tree, a change "rlimit-based mlocks for > unprivileged users" was committed around August of last year. Can you tell me which kernel version in particular? -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com From bill at strahm.net Wed Mar 9 10:11:17 2005 From: bill at strahm.net (Bill Strahm) Date: Wed, 09 Mar 2005 10:11:17 -0800 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <422F38C7.805@ammasso.com> References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <422F2B47.80809@ammasso.com> <1110388190.4647.3.camel@localhost.localdomain> <422F37B7.7060609@ammasso.com> <52y8cwd7p2.fsf@topspin.com> <422F38C7.805@ammasso.com> Message-ID: <422F3C45.4010003@strahm.net> Timur Tabi wrote: > Roland Dreier wrote: > >> Timur> I see that function ibv_lock_range() in libibverbs calls >> Timur> the mlock() system call. mlock() can only be called by a >> Timur> process that has root privileges. Does this mean the >> Timur> user-space verbs support is only available to applications >> Timur> that run as root? >> >> Actually this isn't true. Any process can call mlock() (try it and >> see). > > > Since when? > > man mlock: > > ERRORS > > EPERM The calling process does not have appropriate > privileges. Only root processes are allowed to lock pages. > Well, you can CALL it, it just returns an expected error code. Bill From mshefty at ichips.intel.com Wed Mar 9 10:20:11 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 9 Mar 2005 10:20:11 -0800 Subject: [openib-general] [PATCH] [CM] fix unload issue, crash rejecting a REQ after an error Message-ID: <20050309102011.1183256b.mshefty@ichips.intel.com> This patch fixes the CM unload issue added by the previous patch. It should also allow sending a REJ message in response to a REQ after an error has occurred. Signed-off-by: Sean Hefty Index: infiniband/core/cm.c =================================================================== --- infiniband/core/cm.c (revision 1965) +++ infiniband/core/cm.c (working copy) @@ -232,7 +232,16 @@ static void cm_set_ah_attr(struct ib_ah_ ah_attr->port_num = port_num; } -static int cm_init_av(struct ib_sa_path_rec *path, struct cm_av *av) +static void cm_init_av_for_response(struct cm_port *port, + struct ib_wc *wc, struct cm_av *av) +{ + av->port = port; + av->pkey_index = wc->pkey_index; + cm_set_ah_attr(&av->ah_attr, port->port_num, wc->slid, wc->sl, + wc->dlid_path_bits); +} + +static int cm_init_av_by_path(struct ib_sa_path_rec *path, struct cm_av *av) { struct cm_device *cm_dev; struct cm_port *port = NULL; @@ -259,7 +268,6 @@ static int cm_init_av(struct ib_sa_path_ return ret; av->port = port; - av->dgid = path->dgid; cm_set_ah_attr(&av->ah_attr, av->port->port_num, path->dlid, path->sl, path->slid & 0x7F); return 0; @@ -819,11 +827,12 @@ int ib_send_cm_req(struct ib_cm_id *cm_i } spin_unlock_irqrestore(&cm_id_priv->lock, flags); - ret = cm_init_av(param->primary_path, &cm_id_priv->av); + ret = cm_init_av_by_path(param->primary_path, &cm_id_priv->av); if (ret) goto out; if (param->alternate_path) { - ret = cm_init_av(param->alternate_path, &cm_id_priv->alt_av); + ret = cm_init_av_by_path(param->alternate_path, + &cm_id_priv->alt_av); if (ret) goto out; } @@ -1012,6 +1021,8 @@ static int cm_req_handler(struct cm_work cm_id_priv = container_of(cm_id, struct cm_id_private, id); cm_id_priv->id.remote_id = req_msg->local_comm_id; + cm_init_av_for_response(work->port, work->mad_recv_wc->wc, + &cm_id_priv->av); cm_id_priv->timewait_info = cm_create_timewait_info( cm_id_priv->id.local_id, cm_id_priv->id.remote_id, @@ -1056,11 +1067,11 @@ static int cm_req_handler(struct cm_work cm_id_priv->id.service_mask = ~0ULL; cm_format_paths_from_req(req_msg, &work->path[0], &work->path[1]); - ret = cm_init_av(&work->path[0], &cm_id_priv->av); + ret = cm_init_av_by_path(&work->path[0], &cm_id_priv->av); if (ret) goto error3; if (req_msg->alt_local_lid) { - ret = cm_init_av(&work->path[1], &cm_id_priv->alt_av); + ret = cm_init_av_by_path(&work->path[1], &cm_id_priv->alt_av); if (ret) goto error3; } @@ -2287,7 +2298,7 @@ int ib_send_cm_sidr_req(struct ib_cm_id return -EINVAL; cm_id_priv = container_of(cm_id, struct cm_id_private, id); - ret = cm_init_av(param->path, &cm_id_priv->av); + ret = cm_init_av_by_path(param->path, &cm_id_priv->av); if (ret) goto out; @@ -2359,6 +2370,8 @@ static int cm_sidr_req_handler(struct cm wc = work->mad_recv_wc->wc; cm_id_priv->av.dgid.global.subnet_prefix = wc->slid; cm_id_priv->av.dgid.global.interface_id = 0; + cm_init_av_for_response(work->port, work->mad_recv_wc->wc, + &cm_id_priv->av); cm_id_priv->id.remote_id = sidr_req_msg->request_id; cm_id_priv->id.state = IB_CM_SIDR_REQ_RCVD; atomic_inc(&cm_id_priv->work_count); @@ -2383,10 +2396,6 @@ static int cm_sidr_req_handler(struct cm cm_id_priv->id.context = cur_cm_id_priv->id.context; cm_id_priv->id.service_id = sidr_req_msg->service_id; cm_id_priv->id.service_mask = ~0ULL; - cm_id_priv->av.port = work->port; - cm_id_priv->av.pkey_index = wc->pkey_index; - cm_set_ah_attr(&cm_id_priv->av.ah_attr, work->port->port_num, - wc->slid, wc->sl, wc->dlid_path_bits); cm_format_sidr_req_event(work, &cur_cm_id_priv->id); cm_process_work(cm_id_priv, work); @@ -3013,7 +3022,8 @@ static void cm_remove_one(struct ib_devi write_unlock_irqrestore(&cm.device_lock, flags); for (i = 1; i <= device->phys_port_cnt; i++) { - port = &cm_dev->port[i]; + port = &cm_dev->port[i-1]; + ib_dereg_mr(port->mr); ib_modify_port(device, port->port_num, 0, &port_modify); ib_unregister_mad_agent(port->mad_agent); } From mshefty at ichips.intel.com Wed Mar 9 10:25:26 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 09 Mar 2005 10:25:26 -0800 Subject: [openib-general] Re: A Couple of CM Questions In-Reply-To: <1110311137.4645.28.camel@localhost.localdomain> References: <1110311137.4645.28.camel@localhost.localdomain> Message-ID: <422F3F96.4050803@ichips.intel.com> Hal Rosenstock wrote: > Hi Sean, > > My main question has to do with an error path in cm_req_handler. If > cm_init_av fails (lines 1098 or 1103), I get the following crash: I don't have an easy way to test this, but the patch that I just submitted for the CM should correct this problem. Please let me know if you run into any more problems, or if this patch doesn't work. - Sean From iod00d at hp.com Wed Mar 9 10:30:55 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 9 Mar 2005 10:30:55 -0800 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <422F38C7.805@ammasso.com> References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <422F2B47.80809@ammasso.com> <1110388190.4647.3.camel@localhost.localdomain> <422F37B7.7060609@ammasso.com> <52y8cwd7p2.fsf@topspin.com> <422F38C7.805@ammasso.com> Message-ID: <20050309183055.GH10338@esmail.cup.hp.com> On Wed, Mar 09, 2005 at 11:56:23AM -0600, Timur Tabi wrote: > man mlock: > > ERRORS > > EPERM The calling process does not have appropriate > privileges. Only root processes are allowed to lock pages. mine says: EPERM The calling process has insufficient privilege to call mlock. Under Linux the CAP_IPC_LOCK capability is required. Assuming roland tried it, I'll guess that all processes are given CAP_IPC_LOCK by default. Further, ulimit -a output is enlightening: grundler at gsyprf11:~$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited grant From halr at voltaire.com Wed Mar 9 11:20:18 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Mar 2005 14:20:18 -0500 Subject: [openib-general] Re: A Couple of CM Questions In-Reply-To: <422F3F96.4050803@ichips.intel.com> References: <1110311137.4645.28.camel@localhost.localdomain> <422F3F96.4050803@ichips.intel.com> Message-ID: <1110396018.4645.33.camel@localhost.localdomain> On Wed, 2005-03-09 at 13:25, Sean Hefty wrote: > > My main question has to do with an error path in cm_req_handler. If > > cm_init_av fails (lines 1098 or 1103), I get the following crash: > > I don't have an easy way to test this, but the patch that I just > submitted for the CM should correct this problem. Please let me know > if you run into any more problems, or if this patch doesn't work. CM unload problem appears to be fixed. Also, this fixes the crash when this occurs but the removal of the CM module now hangs. Any easy way to reproduce this is to clear out the path record DGID before sending REP. Thanks. -- Hal From mshefty at ichips.intel.com Wed Mar 9 11:31:03 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 09 Mar 2005 11:31:03 -0800 Subject: [openib-general] Re: A Couple of CM Questions In-Reply-To: <1110396018.4645.33.camel@localhost.localdomain> References: <1110311137.4645.28.camel@localhost.localdomain> <422F3F96.4050803@ichips.intel.com> <1110396018.4645.33.camel@localhost.localdomain> Message-ID: <422F4EF7.9070500@ichips.intel.com> Hal Rosenstock wrote: > Also, this fixes the crash when this occurs but the removal of the CM > module now hangs. > > Any easy way to reproduce this is to clear out the path record DGID > before sending REP. Thanks for the info. I'll try to isolate what's causing the hang. - Sean From roland at topspin.com Wed Mar 9 11:37:45 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 09 Mar 2005 11:37:45 -0800 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <422F39D8.8050302@ammasso.com> (Timur Tabi's message of "Wed, 09 Mar 2005 12:00:56 -0600") References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <422F2B47.80809@ammasso.com> <1110388190.4647.3.camel@localhost.localdomain> <422F37B7.7060609@ammasso.com> <52y8cwd7p2.fsf@topspin.com> <422F38C7.805@ammasso.com> <52u0nkd7h0.fsf@topspin.com> <422F39D8.8050302@ammasso.com> Message-ID: <52ll8wd306.fsf@topspin.com> Timur> Can you tell me which kernel version in particular? Sorry, I don't have that info handy, but if you have a bk tree it should be easy to figure out. Or searching the web for the string "rlimit-based mlocks for unprivileged users" should work. - R. From roland at topspin.com Wed Mar 9 11:41:46 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 09 Mar 2005 11:41:46 -0800 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <20050309183055.GH10338@esmail.cup.hp.com> (Grant Grundler's message of "Wed, 9 Mar 2005 10:30:55 -0800") References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <422F2B47.80809@ammasso.com> <1110388190.4647.3.camel@localhost.localdomain> <422F37B7.7060609@ammasso.com> <52y8cwd7p2.fsf@topspin.com> <422F38C7.805@ammasso.com> <20050309183055.GH10338@esmail.cup.hp.com> Message-ID: <52fyz4d2th.fsf@topspin.com> Actually, my mlock(2) man page says: EPERM (Linux 2.6.9 and later) the caller was not privileged (CAP_IPC_LOCK) and its RLIMIT_MEMLOCK soft resource limit was 0. EPERM (Linux 2.6.8 and earlier) The calling process has insufficient privilege to call munlockall. Under Linux the CAP_IPC_LOCK capability is required. so I guess the change was made in version 2.6.9. - R. From mshefty at ichips.intel.com Wed Mar 9 13:49:49 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 09 Mar 2005 13:49:49 -0800 Subject: [openib-general] Re: A Couple of CM Questions In-Reply-To: <1110396018.4645.33.camel@localhost.localdomain> References: <1110311137.4645.28.camel@localhost.localdomain> <422F3F96.4050803@ichips.intel.com> <1110396018.4645.33.camel@localhost.localdomain> Message-ID: <422F6F7D.4040000@ichips.intel.com> Hal Rosenstock wrote: >>>My main question has to do with an error path in cm_req_handler. If >>>cm_init_av fails (lines 1098 or 1103), I get the following crash: >> > Also, this fixes the crash when this occurs but the removal of the CM > module now hangs. Hal, how many CPUs is your system running on? - Sean From tduffy at sun.com Wed Mar 9 15:22:31 2005 From: tduffy at sun.com (Tom Duffy) Date: Wed, 09 Mar 2005 15:22:31 -0800 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <422F3C45.4010003@strahm.net> References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <422F2B47.80809@ammasso.com> <1110388190.4647.3.camel@localhost.localdomain> <422F37B7.7060609@ammasso.com> <52y8cwd7p2.fsf@topspin.com> <422F38C7.805@ammasso.com> <422F3C45.4010003@strahm.net> Message-ID: <1110410551.7648.8.camel@duffman> On Wed, 2005-03-09 at 10:11 -0800, Bill Strahm wrote: > Well, you can CALL it, it just returns an expected error code. Bill took a big dose of LFP's for breakfast today as required by the IETF ;) -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mshefty at ichips.intel.com Wed Mar 9 16:37:44 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 09 Mar 2005 16:37:44 -0800 Subject: [openib-general] Re: A Couple of CM Questions In-Reply-To: <1110396018.4645.33.camel@localhost.localdomain> References: <1110311137.4645.28.camel@localhost.localdomain> <422F3F96.4050803@ichips.intel.com> <1110396018.4645.33.camel@localhost.localdomain> Message-ID: <422F96D8.3070207@ichips.intel.com> Hal Rosenstock wrote: >>>My main question has to do with an error path in cm_req_handler. If >>>cm_init_av fails (lines 1098 or 1103), I get the following crash: >> > Also, this fixes the crash when this occurs but the removal of the CM > module now hangs. > > Any easy way to reproduce this is to clear out the path record DGID > before sending REP. an update... I've been able to reproduce this, and what's happening is that the cm_id that the CM created to handle the REQ is hanging waiting for its reference count to go to 0, but I'm not entirely sure why yet. The REQ is received and processed in a CM controlled work queue. After seeing the error, the CM sends a REJ message to the sender. (The code to set the proper reject code is not there yet, but a REJ should still be delivered.) As a result of sending the REJ, the reference count on the cm_id is incremented. The CM then waits in the CM work queue thread for the send to complete, which would decrement the reference count. The send completion should be processed from the context of the MAD layer controlled work queue, so I'm not sure why it's not getting called. My planned long term fix is to allow the REJ to be sent without holding a reference on the cm_id. But there's a similar issue sending a DREQ or DREP when destroying a cm_id. So, I'm trying to understand this more. - Sean From mplee at hpcn.ca.sandia.gov Thu Mar 10 00:05:54 2005 From: mplee at hpcn.ca.sandia.gov (Michael Lee) Date: Thu, 10 Mar 2005 00:05:54 -0800 Subject: [openib-general] openib.org services unavailable on 3/10/2005 Message-ID: <1110441954.9018.9.camel@acheron.ca.sandia.gov> Due to a last-minute planned power outage in our computer lab, the openib.org server will be unavailable from 7:00am to 7:30am PST on Thursday, 3/10/2005. -- Michael Lee HPCN Sandia From mst at mellanox.co.il Thu Mar 10 02:42:11 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 Mar 2005 12:42:11 +0200 Subject: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs In-Reply-To: <52acpotwd9.fsf@topspin.com> References: <52u0o4pfe8.fsf@topspin.com> <20050228154254.GB31510@mellanox.co.il> <52acpotwd9.fsf@topspin.com> Message-ID: <20050310104211.GF2586@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: ANNOUNCE: First usable version of userspace verbs > > Michael> Interestingly, if I rebuild libmthca with -O3 compiler > Michael> flag, the pingpong program does not make progress. > Michael> Building libibverbs or the test itself with -O3 has no > Michael> such effect. > > I can't reproduce this on either i386 or x86-64 (Intel Nocona system). > > $ gcc --version > gcc (GCC) 3.3.5 (Debian 1:3.3.5-8) > Copyright (C) 2003 Free Software Foundation, Inc. > This is free software; see the source for copying conditions. There is NO > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. > > - R. > I think I have discovered the problem. It seems that with -O3 my compiler may reorder the WQE (and possibly CQE) write with respect to the doorbell. This wont happen on i386 with consistent i/o ordering since the doorbell is done in assembly, and probably not on other 32 bit architectures since the mutex is likely to include a memory barrier. Applying the folowing patch fixes the problem for me for x86_64. Signed-off-by: Michael S. Tsirkin Index: src/userspace/libmthca/src/doorbell.h =================================================================== --- src/userspace/libmthca/src/doorbell.h (revision 1972) +++ src/userspace/libmthca/src/doorbell.h (working copy) @@ -56,6 +56,9 @@ static inline void mthca_write64(uint32_ static inline void mthca_write64(uint32_t val[2], struct mthca_context *ctx, int offset) { + /* Sufficient for x86_64. + * Other architectures may need a memory barrier here. */ + asm volatile("" ::: "memory"); *(volatile uint64_t *) (ctx->uar + offset) = *(uint64_t *) val; } -- MST - Michael S. Tsirkin From mst at mellanox.co.il Thu Mar 10 02:47:22 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 Mar 2005 12:47:22 +0200 Subject: [openib-general] uverbs: pthread_mutex -> pthread_spinlock ? Message-ID: <20050310104722.GG2586@mellanox.co.il> I am able to shave about 200ns off the rdma post latency, by using pthread_spinlock instead of pthread_mutex for protecting the qp post op in libmthca. I'm aware of course that a context switch when spinlock is held may waste a whole timeslice , but maybe for short operations such as this it's reasonable to use spinlocks? -- MST - Michael S. Tsirkin From mst at mellanox.co.il Thu Mar 10 04:19:41 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 Mar 2005 14:19:41 +0200 Subject: [openib-general] [PATCH] AIO code to use get_user_pages Message-ID: <20050310121941.GI2586@mellanox.co.il> Well, I went ahead and modified the AIO code to use get_user_pages. Since we dont yet have fmr support, this patch is untested, but it does compile :) Please let me know what do you think. Another approach (instead of waiting for fmr support) could be to add a fall-back option to use a regular memory region. A todo item is to add zcopy support for synchronous operations. Signed-off-by: Michael S. Tsirkin Index: sdp_send.c =================================================================== --- sdp_send.c (revision 1972) +++ sdp_send.c (working copy) @@ -2195,6 +2195,7 @@ skip: /* entry point for IOCB based tran iocb->req = req; iocb->key = req->ki_key; iocb->addr = (unsigned long)msg->msg_iov->iov_base; + iocb->is_receive = 0; req->ki_cancel = sdp_inet_write_cancel; Index: sdp_recv.c =================================================================== --- sdp_recv.c (revision 1972) +++ sdp_recv.c (working copy) @@ -1459,6 +1459,7 @@ int sdp_inet_recv(struct kiocb *req, st iocb->req = req; iocb->key = req->ki_key; iocb->addr = (unsigned long)msg->msg_iov->iov_base; + iocb->is_receive = 1; req->ki_cancel = sdp_inet_read_cancel; Index: sdp_iocb.c =================================================================== --- sdp_iocb.c (revision 1972) +++ sdp_iocb.c (working copy) @@ -1,5 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies Ltd. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -31,89 +32,107 @@ * * $Id$ */ - +#include #include "sdp_main.h" static kmem_cache_t *sdp_iocb_cache = NULL; -/* - * memory locking functions - */ -#include - -typedef int (*do_mlock_ptr_t)(unsigned long, size_t, int); -static do_mlock_ptr_t mlock_ptr = NULL; +static void sdp_copy_one_page(struct page *from, struct page* to, + unsigned long iocb_addr, size_t iocb_size, + unsigned long uaddr) +{ + size_t size_left = iocb_addr + iocb_size - uaddr; + size_t size = min(size_left,PAGE_SIZE); + unsigned long offset = uaddr % PAGE_SIZE; + void* fptr; + void* tptr; + + fptr = kmap_atomic(from, KM_USER0); + tptr = kmap_atomic(to, KM_USER0); + + memcpy(tptr + offset, fptr + offset, size); + + kunmap_atomic(tptr, KM_USER0); + kunmap_atomic(fptr, KM_USER0); + set_page_dirty_lock(to); +} /* - * do_iocb_unlock - unlock the memory for an IOCB + * sdp_iocb_unlock - unlock the memory for an IOCB + * Copy if pages moved since. */ -static int do_iocb_unlock(struct sdpc_iocb *iocb) +int sdp_iocb_unlock(struct sdpc_iocb *iocb) { - struct vm_area_struct *vma; + int result = 0; + struct page ** pages = NULL; + unsigned long uaddr; + int i; - vma = find_vma(iocb->mm, (iocb->addr & PAGE_MASK)); - if (!vma) - sdp_warn("No VMA for IOCB <%lx:%Zu> unlock", - iocb->addr, iocb->size); + if (!(iocb->flags & SDP_IOCB_F_LOCKED)) + return 0; - while (vma) { - sdp_dbg_data(NULL, - "unmark <%lx> <%p> <%08lx:%08lx> <%08lx> <%ld>", - iocb->addr, vma, vma->vm_start, vma->vm_end, - vma->vm_flags, (long)vma->vm_private_data); - - spin_lock(&iocb->mm->page_table_lock); - /* - * if there are no more references to the vma - */ - vma->vm_private_data--; - - if (!vma->vm_private_data) { - /* - * modify VM flags. - */ - vma->vm_flags &= ~(VM_DONTCOPY|VM_LOCKED); - /* - * adjust locked page count - */ - vma->vm_mm->locked_vm -= ((vma->vm_end - - vma->vm_start) >> - PAGE_SHIFT); - } + /* For read, unlock and we are done */ + if (!iocb->is_receive) { + for (i = 0;i < iocb->page_count; ++i) + page_cache_release(iocb->page_array[i]); + goto done; + } - spin_unlock(&iocb->mm->page_table_lock); - /* - * continue if the buffer continues onto the next vma - */ - if ((iocb->addr + iocb->size) > vma->vm_end) - vma = vma->vm_next; - else - vma = NULL; + /* For write, we must check the virtual pages did not get remapped */ + + /* As an optimisation (to avoid scanning the vma tree each time), + * try to get all pages in one go. */ + /* TODO: use cache for allocations? Allocate by chunks? */ + + pages = kmalloc((sizeof(struct page *) * + iocb->page_count), GFP_KERNEL); + + down_read(&iocb->mm->mmap_sem); + + if (pages) { + result=get_user_pages(iocb->tsk, iocb->mm, + iocb->addr, + iocb->page_count , iocb->is_receive, 0, + pages, NULL); + + if (result != iocb->page_count) { + kfree(pages); + pages = NULL; + } } - return 0; -} + for (i = 0, uaddr = iocb->addr; i < iocb->page_count; + ++i, uaddr = (uaddr & PAGE_MASK) + PAGE_SIZE) + { + struct page* page; + set_page_dirty_lock(iocb->page_array[i]); + + if (pages) + page = pages[i]; + else { + result=get_user_pages(iocb->tsk, iocb->mm, + uaddr & PAGE_MASK, + 1 , 1, 0, &page, NULL); + if (result != 1) { + page = NULL; + } + } -/* - * sdp_iocb_unlock - unlock the memory for an IOCB - */ -int sdp_iocb_unlock(struct sdpc_iocb *iocb) -{ - int result; + if (page && iocb->page_array[i] != page) + sdp_copy_one_page(iocb->page_array[i], page, + iocb->addr, iocb->size, uaddr); - /* - * check if IOCB is locked. - */ - if (!(iocb->flags & SDP_IOCB_F_LOCKED)) - return 0; - /* - * spin lock since this could be from interrupt context. - */ - down_write(&iocb->mm->mmap_sem); - - result = do_iocb_unlock(iocb); + if (page) + page_cache_release(page); + page_cache_release(iocb->page_array[i]); + } + + up_read(&iocb->mm->mmap_sem); - up_write(&iocb->mm->mmap_sem); + if (pages) + kfree(pages); + +done: kfree(iocb->page_array); kfree(iocb->addr_array); @@ -121,37 +140,41 @@ int sdp_iocb_unlock(struct sdpc_iocb *io iocb->page_array = NULL; iocb->addr_array = NULL; iocb->mm = NULL; - /* - * mark IOCB unlocked. - */ + iocb->tsk = NULL; + iocb->flags &= ~SDP_IOCB_F_LOCKED; return result; } /* - * sdp_iocb_page_save - save page information for an IOCB + * sdp_iocb_lock - lock the memory for an IOCB + * We do not take a reference on the mm, AIO handles this for us. */ -static int sdp_iocb_page_save(struct sdpc_iocb *iocb) +int sdp_iocb_lock(struct sdpc_iocb *iocb) { - unsigned int counter; + int result = -ENOMEM; unsigned long addr; size_t size; - int result = -ENOMEM; - struct page *page; - unsigned long pfn; - pgd_t *pgd; - pud_t *pud; - pmd_t *pmd; - pte_t *ptep; - pte_t pte; + int i; + /* + * iocb->addr - buffer start address + * iocb->size - buffer length + * addr - page aligned + * size - page multiple + */ + addr = iocb->addr & PAGE_MASK; + size = PAGE_ALIGN(iocb->size + (iocb->addr & ~PAGE_MASK)); - if (iocb->page_count <= 0 || iocb->size <= 0 || !iocb->addr) - return -EINVAL; + iocb->page_offset = iocb->addr - addr; + + iocb->page_count = size >> PAGE_SHIFT; /* * create array to hold page value which are later needed to register * the buffer with the HCA */ + + /* TODO: use cache for allocations? Allocate by chunks? */ iocb->addr_array = kmalloc((sizeof(u64) * iocb->page_count), GFP_KERNEL); if (!iocb->addr_array) @@ -161,259 +184,41 @@ static int sdp_iocb_page_save(struct sdp GFP_KERNEL); if (!iocb->page_array) goto err_page; - /* - * iocb->addr - buffer start address - * iocb->size - buffer length - * addr - page aligned - * size - page multiple - */ - addr = iocb->addr & PAGE_MASK; - size = PAGE_ALIGN(iocb->size + (iocb->addr & ~PAGE_MASK)); - iocb->page_offset = iocb->addr - addr; - /* - * Find pages used within the buffer which will then be registered - * for RDMA - */ - spin_lock(&iocb->mm->page_table_lock); + down_write(¤t->mm->mmap_sem); - for (counter = 0; - size > 0; - counter++, addr += PAGE_SIZE, size -= PAGE_SIZE) { - pgd = pgd_offset_gate(iocb->mm, addr); - if (!pgd || pgd_none(*pgd)) - break; - - pud = pud_offset(pgd, addr); - if (!pud || pud_none(*pud)) - break; - - pmd = pmd_offset(pud, addr); - if (!pmd || pmd_none(*pmd)) - break; - - ptep = pte_offset_map(pmd, addr); - if (!ptep) - break; - - pte = *ptep; - pte_unmap(ptep); - - if (!pte_present(pte)) - break; - - pfn = pte_pfn(pte); - if (!pfn_valid(pfn)) - break; - - page = pfn_to_page(pfn); - - iocb->page_array[counter] = page; - iocb->addr_array[counter] = page_to_phys(page); + result=get_user_pages(current, current->mm, iocb->addr, + iocb->page_count , iocb->is_receive, 0, + iocb->page_array, NULL); + + up_read(¤t->mm->mmap_sem); + + if (result != iocb->page_count) { + sdp_dbg_err("unable to lock <%lx:%Zu> error <%d> <%d>", + iocb->addr, iocb->size, result, iocb->page_count); + goto err_get; } - spin_unlock(&iocb->mm->page_table_lock); - - if (size > 0) { - result = -EFAULT; - goto err_find; - } - - return 0; -err_find: - - kfree(iocb->page_array); - iocb->page_array = NULL; -err_page: - - kfree(iocb->addr_array); - iocb->addr_array = NULL; -err_addr: - - return result; -} - -/* - * sdp_iocb_lock - lock the memory for an IOCB - */ -int sdp_iocb_lock(struct sdpc_iocb *iocb) -{ - struct vm_area_struct *vma; - kernel_cap_t real_cap; - unsigned long limit; - int result = -ENOMEM; - unsigned long addr; - size_t size; - - /* - * mark IOCB as locked. We do not take a reference on the mm, AIO - * handles this for us. - */ iocb->flags |= SDP_IOCB_F_LOCKED; iocb->mm = current->mm; - /* - * save and raise capabilities - */ - real_cap = cap_t(current->cap_effective); - cap_raise(current->cap_effective, CAP_IPC_LOCK); - - size = PAGE_ALIGN(iocb->size + (iocb->addr & ~PAGE_MASK)); - addr = iocb->addr & PAGE_MASK; - - iocb->page_count = size >> PAGE_SHIFT; + iocb->tsk = current; - limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur; - limit >>= PAGE_SHIFT; - /* - * lock the mm, if within the limit lock the address range. - */ - down_write(&iocb->mm->mmap_sem); - if (!((iocb->page_count + current->mm->locked_vm) > limit)) - result = (*mlock_ptr)(addr, size, 1); - /* - * process result - */ - if (result) { - sdp_dbg_err("VMA lock <%lx:%Zu> error <%d> <%d:%lu:%lu>", - iocb->addr, iocb->size, result, - iocb->page_count, iocb->mm->locked_vm, limit); - goto err_lock; + for (i = 0; i< iocb->page_count; ++i) { + iocb->addr_array[i] = page_to_phys(iocb->page_array[i]); } - /* - * look up the head of the vma queue, loop through the vmas, marking - * them do not copy, reference counting, and saving them. - */ - vma = find_vma(iocb->mm, addr); - if (!vma) - /* - * sanity check. - */ - sdp_warn("No VMA for IOCB! <%lx:%Zu> lock", - iocb->addr, iocb->size); - - while (vma) { - spin_lock(&iocb->mm->page_table_lock); - - if (!(VM_LOCKED & vma->vm_flags)) - sdp_warn("Unlocked vma! <%08lx>", vma->vm_flags); - - if (PAGE_SIZE < (unsigned long)vma->vm_private_data) - sdp_dbg_err("VMA: private daya in use! <%08lx>", - (unsigned long)vma->vm_private_data); - - vma->vm_flags |= VM_DONTCOPY; - vma->vm_private_data++; - - spin_unlock(&iocb->mm->page_table_lock); - - sdp_dbg_data(NULL, - "mark <%lx> <0x%p> <%08lx:%08lx> <%08lx> <%ld>", - iocb->addr, vma, vma->vm_start, vma->vm_end, - vma->vm_flags, (long)vma->vm_private_data); - - if ((addr + size) > vma->vm_end) - vma = vma->vm_next; - else - vma = NULL; - } - - result = sdp_iocb_page_save(iocb); - if (result) { - sdp_dbg_err("Error <%d> saving pages for IOCB <%lx:%Zu>", - result, iocb->addr, iocb->size); - goto err_save; - } - - up_write(&iocb->mm->mmap_sem); - cap_t(current->cap_effective) = real_cap; return 0; -err_save: - - (void)do_iocb_unlock(iocb); -err_lock: - /* - * unlock the mm and restore capabilities. - */ - up_write(&iocb->mm->mmap_sem); - cap_t(current->cap_effective) = real_cap; - - iocb->flags &= ~SDP_IOCB_F_LOCKED; - iocb->mm = NULL; +err_get: + kfree(iocb->page_array); +err_page: + kfree(iocb->addr_array); +err_addr: return result; } /* - * IOCB memory locking init functions - */ -struct kallsym_iter { - loff_t pos; - struct module *owner; - unsigned long value; - unsigned int nameoff; /* If iterating in core kernel symbols */ - char type; - char name[128]; -}; - -/* - * sdp_mem_lock_init - initialize the userspace memory locking - */ -static int sdp_mem_lock_init(void) -{ - struct file *kallsyms; - struct seq_file *seq; - struct kallsym_iter *iter; - loff_t pos = 0; - int ret = -EINVAL; - - sdp_dbg_init("Memory Locking initialization."); - - kallsyms = filp_open("/proc/kallsyms", O_RDONLY, 0); - if (!kallsyms) { - sdp_warn("Failed to open /proc/kallsyms"); - goto done; - } - - seq = (struct seq_file *)kallsyms->private_data; - if (!seq) { - sdp_warn("Failed to fetch sequential file."); - goto err_close; - } - - for (iter = seq->op->start(seq, &pos); - iter != NULL; - iter = seq->op->next(seq, iter, &pos)) - if (!strcmp(iter->name, "do_mlock")) - mlock_ptr = (do_mlock_ptr_t)iter->value; - - if (!mlock_ptr) - sdp_warn("Failed to find lock pointer."); - else - ret = 0; - -err_close: - filp_close(kallsyms, NULL); -done: - return ret; -} - -/* - * sdp_mem_lock_cleanup - cleanup the memory locking tables - */ -static int sdp_mem_lock_cleanup(void) -{ - sdp_dbg_init("Memory Locking cleanup."); - /* - * null out entries. - */ - mlock_ptr = NULL; - - return 0; -} - -/* * IOCB memory registration functions */ @@ -831,28 +636,12 @@ void sdp_iocb_q_clear(struct sdpc_iocb_q } /* - * primary initialization/cleanup functions - */ - -/* * sdp_main_iocb_init - initialize the advertisment caches */ int sdp_main_iocb_init(void) { - int result; - sdp_dbg_init("IOCB cache initialization."); - /* - * initialize locking code. - */ - result = sdp_mem_lock_init(); - if (result < 0) { - sdp_warn("Error <%d> initializing memory locking.", result); - return result; - } - /* - * initialize the caches only once. - */ + if (sdp_iocb_cache) { sdp_warn("IOCB caches already initialized."); return -EINVAL; @@ -862,15 +651,10 @@ int sdp_main_iocb_init(void) sizeof(struct sdpc_iocb), 0, SLAB_HWCACHE_ALIGN, NULL, NULL); - if (!sdp_iocb_cache) { - result = -ENOMEM; - goto error_iocb_c; - } + if (!sdp_iocb_cache) + return -ENOMEM; return 0; -error_iocb_c: - (void)sdp_mem_lock_cleanup(); - return result; } /* @@ -879,16 +663,6 @@ error_iocb_c: void sdp_main_iocb_cleanup(void) { sdp_dbg_init("IOCB cache cleanup."); - /* - * cleanup the caches - */ kmem_cache_destroy(sdp_iocb_cache); - /* - * null out entries. - */ sdp_iocb_cache = NULL; - /* - * cleanup memory locking - */ - (void)sdp_mem_lock_cleanup(); } Index: sdp_iocb.h =================================================================== --- sdp_iocb.h (revision 1972) +++ sdp_iocb.h (working copy) @@ -99,9 +99,11 @@ struct sdpc_iocb { /* * page list. data for locking/registering userspace */ - struct mm_struct *mm; /* user mm struct */ - unsigned long addr; /* user space address */ - size_t size; /* total size of the user buffer */ + struct mm_struct *mm; /* user mm struct */ + struct task_struct *tsk; + unsigned long addr; /* user space address */ + size_t size; /* total size of the user buffer */ + int is_receive; struct page **page_array; /* list of page structure pointers. */ u64 *addr_array; /* list of physical page addresses. */ -- MST - Michael S. Tsirkin From tziporet at mellanox.co.il Thu Mar 10 04:18:26 2005 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 10 Mar 2005 14:18:26 +0200 Subject: [openib-general] uverbs: pthread_mutex -> pthread_spinlock ? Message-ID: <506C3D7B14CDD411A52C00025558DED6064BF104@mtlex01.yok.mtl.com> In VAPI we used spinlocks from this reason on all data-path verbs and it gave us better performance. Tziporet -----Original Message----- From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] Sent: Thursday, March 10, 2005 12:47 PM To: openib-general at openib.org Subject: [openib-general] uverbs: pthread_mutex -> pthread_spinlock ? I am able to shave about 200ns off the rdma post latency, by using pthread_spinlock instead of pthread_mutex for protecting the qp post op in libmthca. I'm aware of course that a context switch when spinlock is held may waste a whole timeslice , but maybe for short operations such as this it's reasonable to use spinlocks? -- MST - Michael S. Tsirkin _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Thu Mar 10 04:31:29 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 Mar 2005 14:31:29 +0200 Subject: [openib-general] [PATCH] uverbs rdma example (updated) In-Reply-To: <20050309122700.GA2352@mellanox.co.il> References: <20050309122700.GA2352@mellanox.co.il> Message-ID: <20050310123129.GA12542@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: [PATCH] uverbs rdma example > > Here is a small test for the rdma functionality. > I based it on the pingpong test, the main change being polling on data > instead of receive completions. > > This is useful as an example of using rdma, and is also useful > as a post send latency benchmark, for tuning (nicer than the send test > in that it let us measure post send separately from poll cq). > > Code is originally based on the pingping test. > I intentionally did not rename functions from pingpong_ to rdma_ > to make it easier to share some code later if we decide it is useful. > > [...] That code had a typo, and some whitespace badness. sscanf result also has to be checked. Since that patch wasnt yet committed, here's an updated version to replace it. Let me know what do you think. Signed-off-by: Michael S. Tsirkin Index: Makefile.am =================================================================== --- Makefile.am (revision 1970) +++ Makefile.am (working copy) @@ -20,7 +20,8 @@ src_libibverbs_la_LDFLAGS = -version-inf src_libibverbs_la_DEPENDENCIES = $(srcdir)/src/libibverbs.map bin_PROGRAMS = examples/ibv_devices examples/ibv_asyncwatch \ - examples/ibv_pingpong examples/ibv_ud_pingpong + examples/ibv_pingpong examples/ibv_ud_pingpong \ + examples/ibv_rdma examples_ibv_devices_SOURCES = examples/device_list.c examples_ibv_devices_LDADD = $(top_builddir)/src/libibverbs.la examples_ibv_pingpong_SOURCES = examples/pingpong.c @@ -29,6 +30,8 @@ examples_ibv_ud_pingpong_SOURCES = examp examples_ibv_ud_pingpong_LDADD = $(top_builddir)/src/libibverbs.la examples_ibv_asyncwatch_SOURCES = examples/asyncwatch.c examples_ibv_asyncwatch_LDADD = $(top_builddir)/src/libibverbs.la +examples_ibv_rdma_SOURCES = examples/rdma.c +examples_ibv_rdma_LDADD = $(top_builddir)/src/libibverbs.la libibverbsincludedir = $(includedir)/infiniband Index: examples/rdma.c =================================================================== --- examples/rdma.c (revision 0) +++ examples/rdma.c (revision 0) @@ -0,0 +1,698 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Mellanox Technologies Ltd. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include + +enum { + PINGPONG_RDMA_WRID = 3, +}; + +static int page_size; + +struct pingpong_context { + struct ibv_context *context; + struct ibv_pd *pd; + struct ibv_mr *mr; + struct ibv_cq *cq; + struct ibv_qp *qp; + void *buf; + volatile char *post_buf; + volatile char *poll_buf; + int size; + int rx_depth; + int tx_depth; +}; + +struct pingpong_dest { + int lid; + int qpn; + int psn; + unsigned rkey; + unsigned long long vaddr; +}; + +/* + * pp_get_local_lid() uses a pretty bogus method for finding the LID + * of a local port. Please don't copy this into your app (or if you + * do, please rip it out soon). + */ +static uint16_t pp_get_local_lid(struct ibv_device *dev, int port) +{ + char path[256]; + char val[16]; + char *name; + + if (sysfs_get_mnt_path(path, sizeof path)) { + fprintf(stderr, "Couldn't find sysfs mount.\n"); + return 0; + } + + asprintf(&name, "%s/class/infiniband/%s/ports/%d/lid", path, + ibv_get_device_name(dev), port); + + if (sysfs_read_attribute_value(name, val, sizeof val)) { + fprintf(stderr, "Couldn't read LID at %s\n", name); + return 0; + } + + return strtol(val, NULL, 0); +} + +static int pp_client_connect(const char *servername, int port) +{ + struct addrinfo *res, *t; + struct addrinfo hints = { + .ai_family = AF_UNSPEC, + .ai_socktype = SOCK_STREAM + }; + char *service; + int n; + int sockfd = -1; + + asprintf(&service, "%d", port); + n = getaddrinfo(servername, service, &hints, &res); + + if (n < 0) { + fprintf(stderr, "%s for %s:%d\n", gai_strerror(n), servername, port); + return n; + } + + for (t = res; t; t = t->ai_next) { + sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); + if (sockfd >= 0) { + if (!connect(sockfd, t->ai_addr, t->ai_addrlen)) + break; + close(sockfd); + sockfd = -1; + } + } + + freeaddrinfo(res); + + if (sockfd < 0) { + fprintf(stderr, "Couldn't connect to %s:%d\n", servername, port); + return sockfd; + } + return sockfd; +} + +struct pingpong_dest * pp_client_exch_dest(int sockfd, + const struct pingpong_dest *my_dest) +{ + struct pingpong_dest *rem_dest = NULL; + char msg[sizeof "0000:000000:000000:00000000:0000000000000000"]; + int parsed; + + sprintf(msg, "%04x:%06x:%06x:%08x:%016Lx", my_dest->lid, my_dest->qpn, + my_dest->psn,my_dest->rkey,my_dest->vaddr); + if (write(sockfd, msg, sizeof msg) != sizeof msg) { + perror("client write"); + fprintf(stderr, "Couldn't send local address\n"); + goto out; + } + + if (read(sockfd, msg, sizeof msg) != sizeof msg) { + perror("client read"); + fprintf(stderr, "Couldn't read remote address\n"); + goto out; + } + + rem_dest = malloc(sizeof *rem_dest); + if (!rem_dest) + goto out; + + parsed = sscanf(msg, "%x:%x:%x:%x:%Lx", &rem_dest->lid, &rem_dest->qpn, + &rem_dest->psn,&rem_dest->rkey,&rem_dest->vaddr); + + if (parsed != 5) { + fprintf(stderr, "Couldn't parse line <%.*s>\n",(int)sizeof msg, + msg); + free(rem_dest); + rem_dest = NULL; + goto out; + } +out: + return rem_dest; +} + +int pp_server_connect(int port) +{ + struct addrinfo *res, *t; + struct addrinfo hints = { + .ai_flags = AI_PASSIVE, + .ai_family = AF_UNSPEC, + .ai_socktype = SOCK_STREAM + }; + char *service; + int sockfd = -1, connfd; + int n; + + asprintf(&service, "%d", port); + n = getaddrinfo(NULL, service, &hints, &res); + + if (n < 0) { + fprintf(stderr, "%s for port %d\n", gai_strerror(n), port); + return n; + } + + for (t = res; t; t = t->ai_next) { + sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); + if (sockfd >= 0) { + n = 1; + + setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &n, sizeof n); + + if (!bind(sockfd, t->ai_addr, t->ai_addrlen)) + break; + close(sockfd); + sockfd = -1; + } + } + + freeaddrinfo(res); + + if (sockfd < 0) { + fprintf(stderr, "Couldn't listen to port %d\n", port); + return sockfd; + } + + listen(sockfd, 1); + connfd = accept(sockfd, NULL, 0); + if (connfd < 0) { + perror("server accept"); + fprintf(stderr, "accept() failed\n"); + close(sockfd); + return connfd; + } + + close(sockfd); + return connfd; +} + +static struct pingpong_dest *pp_server_exch_dest(int connfd, const struct pingpong_dest *my_dest) +{ + char msg[sizeof "0000:000000:000000:00000000:0000000000000000"]; + struct pingpong_dest *rem_dest = NULL; + int parsed; + int n; + + n = read(connfd, msg, sizeof msg); + if (n != sizeof msg) { + perror("server read"); + fprintf(stderr, "%d/%d: Couldn't read remote address\n", n, (int) sizeof msg); + goto out; + } + + rem_dest = malloc(sizeof *rem_dest); + if (!rem_dest) + goto out; + + parsed = sscanf(msg, "%x:%x:%x:%x:%Lx", &rem_dest->lid, &rem_dest->qpn, + &rem_dest->psn, &rem_dest->rkey, &rem_dest->vaddr); + if (parsed != 5) { + fprintf(stderr, "Couldn't parse line <%.*s>\n",(int)sizeof msg, + msg); + free(rem_dest); + rem_dest = NULL; + goto out; + } + + sprintf(msg, "%04x:%06x:%06x:%08x:%016Lx", my_dest->lid, my_dest->qpn, + my_dest->psn, my_dest->rkey, my_dest->vaddr); + if (write(connfd, msg, sizeof msg) != sizeof msg) { + perror("server write"); + fprintf(stderr, "Couldn't send local address\n"); + free(rem_dest); + rem_dest = NULL; + goto out; + } +out: + return rem_dest; +} + +static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size, + int tx_depth, int rx_depth, int port) +{ + struct pingpong_context *ctx; + + ctx = malloc(sizeof *ctx); + if (!ctx) + return NULL; + + ctx->size = size; + ctx->rx_depth = rx_depth; + ctx->tx_depth = tx_depth; + + ctx->buf = memalign(page_size, size * 2); + if (!ctx->buf) { + fprintf(stderr, "Couldn't allocate work buf.\n"); + return NULL; + } + + memset(ctx->buf, 0, size * 2); + + ctx->post_buf = (char*)ctx->buf + (size - 1); + ctx->poll_buf = (char*)ctx->buf + (2 * size - 1); + + ctx->context = ibv_open_device(ib_dev); + if (!ctx->context) { + fprintf(stderr, "Couldn't get context for %s\n", + ibv_get_device_name(ib_dev)); + return NULL; + } + + ctx->pd = ibv_alloc_pd(ctx->context); + if (!ctx->pd) { + fprintf(stderr, "Couldn't allocate PD\n"); + return NULL; + } + + ctx->mr = ibv_reg_mr(ctx->pd, ctx->buf, size * 2, + IBV_ACCESS_REMOTE_WRITE); + if (!ctx->mr) { + fprintf(stderr, "Couldn't allocate MR\n"); + return NULL; + } + + ctx->cq = ibv_create_cq(ctx->context, rx_depth + tx_depth, NULL); + if (!ctx->cq) { + fprintf(stderr, "Couldn't create CQ\n"); + return NULL; + } + + { + struct ibv_qp_init_attr attr = { + .send_cq = ctx->cq, + .recv_cq = ctx->cq, + .cap = { + .max_send_wr = tx_depth, + .max_recv_wr = rx_depth, + .max_send_sge = 1, + .max_recv_sge = 1 + }, + .qp_type = IBV_QPT_RC + }; + + ctx->qp = ibv_create_qp(ctx->pd, &attr); + if (!ctx->qp) { + fprintf(stderr, "Couldn't create QP\n"); + return NULL; + } + } + + { + struct ibv_qp_attr attr; + + attr.qp_state = IBV_QPS_INIT; + attr.pkey_index = 0; + attr.port_num = port; + attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE; + + if (ibv_modify_qp(ctx->qp, &attr, + IBV_QP_STATE | + IBV_QP_PKEY_INDEX | + IBV_QP_PORT | + IBV_QP_ACCESS_FLAGS)) { + fprintf(stderr, "Failed to modify QP to INIT\n"); + return NULL; + } + } + + return ctx; +} + +static int pp_post_rdma(struct pingpong_context *ctx, + struct pingpong_dest* rem_dest) +{ + struct ibv_sge list = { + .addr = (uintptr_t) ctx->buf, + .length = ctx->size, + .lkey = ctx->mr->lkey + }; + struct ibv_send_wr wr = { + .wr_id = PINGPONG_RDMA_WRID, + .sg_list = &list, + .num_sge = 1, + .opcode = IBV_WR_RDMA_WRITE, + .send_flags = IBV_SEND_SIGNALED, + .wr.rdma.remote_addr = rem_dest->vaddr, + .wr.rdma.rkey = rem_dest->rkey + }; + struct ibv_send_wr *bad_wr; + + return ibv_post_send(ctx->qp, &wr, &bad_wr); +} + +static int pp_connect_ctx(struct pingpong_context *ctx, int port, int my_psn, + struct pingpong_dest *dest) +{ + struct ibv_qp_attr attr; + + attr.qp_state = IBV_QPS_RTR; + attr.path_mtu = IBV_MTU_1024; + attr.dest_qp_num = dest->qpn; + attr.rq_psn = dest->psn; + attr.max_dest_rd_atomic = 1; + attr.min_rnr_timer = 12; + attr.ah_attr.is_global = 0; + attr.ah_attr.dlid = dest->lid; + attr.ah_attr.sl = 0; + attr.ah_attr.src_path_bits = 0; + attr.ah_attr.port_num = port; + if (ibv_modify_qp(ctx->qp, &attr, + IBV_QP_STATE | + IBV_QP_AV | + IBV_QP_PATH_MTU | + IBV_QP_DEST_QPN | + IBV_QP_RQ_PSN | + IBV_QP_MAX_DEST_RD_ATOMIC | + IBV_QP_MIN_RNR_TIMER)) { + fprintf(stderr, "Failed to modify QP to RTR\n"); + return 1; + } + + attr.qp_state = IBV_QPS_RTS; + attr.timeout = 14; + attr.retry_cnt = 7; + attr.rnr_retry = 7; + attr.sq_psn = my_psn; + attr.max_rd_atomic = 1; + if (ibv_modify_qp(ctx->qp, &attr, + IBV_QP_STATE | + IBV_QP_TIMEOUT | + IBV_QP_RETRY_CNT | + IBV_QP_RNR_RETRY | + IBV_QP_SQ_PSN | + IBV_QP_MAX_QP_RD_ATOMIC)) { + fprintf(stderr, "Failed to modify QP to RTS\n"); + return 1; + } + + return 0; +} + +static void usage(const char *argv0) +{ + printf("Usage:\n"); + printf(" %s start a server and wait for connection\n", argv0); + printf(" %s connect to server at \n", argv0); + printf("\n"); + printf("Options:\n"); + printf(" -p, --port= listen on/connect to port (default 18515)\n"); + printf(" -d, --ib-dev= use IB device (default first device found)\n"); + printf(" -i, --ib-port= use port of IB device (default 1)\n"); + printf(" -s, --size= size of message to exchange (default 4096)\n"); + printf(" -t, --tx-depth= size of tx queue (default 50)\n"); + printf(" -n, --iters= number of exchanges (default 1000)\n"); +} + +int main(int argc, char *argv[]) +{ + struct dlist *dev_list; + struct ibv_device *ib_dev; + struct pingpong_context *ctx; + struct pingpong_dest my_dest; + struct pingpong_dest *rem_dest; + struct timeval start, end; + char *ib_devname = NULL; + char *servername = NULL; + int port = 18515; + int ib_port = 1; + int size = 1; + int rx_depth = 1; + int tx_depth = 50; + int iters = 1000; + int scnt, rcnt, ccnt; + int client_first_post; + int sockfd; + + srand48(getpid() * time(NULL)); + + while (1) { + int c; + + static struct option long_options[] = { + { .name = "port", .has_arg = 1, .val = 'p' }, + { .name = "ib-dev", .has_arg = 1, .val = 'd' }, + { .name = "ib-port", .has_arg = 1, .val = 'i' }, + { .name = "size", .has_arg = 1, .val = 's' }, + { .name = "iters", .has_arg = 1, .val = 'n' }, + { .name = "tx-depth",.has_arg = 1, .val = 't' }, + { 0 } + }; + + c = getopt_long(argc, argv, "p:d:i:s:t:n:e", long_options, NULL); + if (c == -1) + break; + + switch (c) { + case 'p': + port = strtol(optarg, NULL, 0); + if (port < 0 || port > 65535) { + usage(argv[0]); + return 1; + } + break; + + case 'd': + ib_devname = strdupa(optarg); + break; + + case 'i': + ib_port = strtol(optarg, NULL, 0); + if (port < 0) { + usage(argv[0]); + return 1; + } + break; + + case 's': + size = strtol(optarg, NULL, 0); + break; + + case 't': + tx_depth = strtol(optarg, NULL, 0); + break; + + case 'n': + iters = strtol(optarg, NULL, 0); + break; + + default: + usage(argv[0]); + return 1; + } + } + + if (optind == argc - 1) + servername = strdupa(argv[optind]); + else if (optind < argc) { + usage(argv[0]); + return 1; + } + + page_size = sysconf(_SC_PAGESIZE); + + dev_list = ibv_get_devices(); + + dlist_start(dev_list); + if (!ib_devname) { + ib_dev = dlist_next(dev_list); + if (!ib_dev) { + fprintf(stderr, "No IB devices found\n"); + return 1; + } + } else { + dlist_for_each_data(dev_list, ib_dev, struct ibv_device) + if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) + break; + if (!ib_dev) { + fprintf(stderr, "IB device %s not found\n", ib_devname); + return 1; + } + } + + ctx = pp_init_ctx(ib_dev, size, iters, rx_depth, ib_port); + if (!ctx) + return 1; + + my_dest.lid = pp_get_local_lid(ib_dev, ib_port); + my_dest.qpn = ctx->qp->qp_num; + my_dest.psn = lrand48() & 0xffffff; + if (!my_dest.lid) { + fprintf(stderr, "Local lid 0x0 detected. Is an SM running?\n"); + return 1; + } + my_dest.rkey = ctx->mr->rkey; + my_dest.vaddr = (uintptr_t)ctx->buf + ctx->size; + + printf(" local address: LID %#04x, QPN %#06x, PSN %#06x " + "RKey %#08x VAddr %#016Lx\n", + my_dest.lid, my_dest.qpn, my_dest.psn, + my_dest.rkey, my_dest.vaddr); + + + if (servername) { + sockfd = pp_client_connect(servername, port); + } else { + sockfd = pp_server_connect(port); + } + if (sockfd < 0) + return 1; + + if (servername) { + rem_dest = pp_client_exch_dest(sockfd, &my_dest); + } else { + rem_dest = pp_server_exch_dest(sockfd, &my_dest); + } + + if (!rem_dest) + return 1; + + printf(" remote address: LID %#04x, QPN %#06x, PSN %#06x, " + "RKey %#08x VAddr %#016Lx\n", + rem_dest->lid, rem_dest->qpn, rem_dest->psn, + rem_dest->rkey, rem_dest->vaddr); + + if (pp_connect_ctx(ctx, ib_port, my_dest.psn, rem_dest)) + return 1; + + /* An additional handshake is required *after* moving qp to RTR. + Arbitrarily reuse exch_dest for this purpose. */ + if (servername) { + rem_dest = pp_client_exch_dest(sockfd, &my_dest); + } else { + rem_dest = pp_server_exch_dest(sockfd, &my_dest); + } + + write(sockfd, "done", sizeof "done"); + close(sockfd); + + if (gettimeofday(&start, NULL)) { + perror("gettimeofday"); + return 1; + } + + scnt = 0; + rcnt = 0; + ccnt = 0; + if (servername) + client_first_post = 1; + else + client_first_post = 0; + + while (scnt < iters || ccnt < iters || rcnt < iters) { + + /* Wait till buffer changes. */ + if (rcnt < iters && ! client_first_post) { + ++rcnt; + while (*ctx->poll_buf != (char)rcnt) { + } + /* Here the data is already in the physical memory. + If we wanted to actually use it, we may need + a read memory barrier here. */ + } else + client_first_post = 0; + + if (scnt < iters) { + *ctx->post_buf = (char)++scnt; + if (pp_post_rdma(ctx, rem_dest)) { + fprintf(stderr, "Couldn't post send: scnt=%d\n", + scnt); + return 1; + } + } + + if (ccnt < iters) { + struct ibv_wc wc; + int ne; + ++ccnt; + do { + ne = ibv_poll_cq(ctx->cq, 1, &wc); + } while (ne == 0); + + if (ne < 0) { + fprintf(stderr, "poll CQ failed %d\n", ne); + return 1; + } + if (wc.status != IBV_WC_SUCCESS) { + fprintf(stderr, "Completion wth error at %s:\n", + servername?"client":"server"); + fprintf(stderr, "Failed status %d: wr_id %d\n", + wc.status, (int) wc.wr_id); + fprintf(stderr, "scnt=%d, rcnt=%d, ccnt=%d\n", + scnt, rcnt, ccnt); + return 1; + } + } + } + + if (gettimeofday(&end, NULL)) { + perror("gettimeofday"); + return 1; + } + + { + float usec = (end.tv_sec - start.tv_sec) * 1000000 + + (end.tv_usec - start.tv_usec); + long long bytes = (long long) size * iters; + + printf("%lld bytes in %.2f seconds = %.2f Mbit/sec\n", + bytes, usec / 1000000., bytes * 8. / usec); + printf("%d iters in %.2f seconds = %.2f usec/iter\n", + iters, usec / 1000000., usec / iters); + } + + return 0; +} -- MST - Michael S. Tsirkin From mst at mellanox.co.il Thu Mar 10 06:17:01 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 Mar 2005 16:17:01 +0200 Subject: [openib-general] Re: [PATCH] AIO code to use get_user_pages In-Reply-To: <20050310121941.GI2586@mellanox.co.il> References: <20050310121941.GI2586@mellanox.co.il> Message-ID: <20050310141701.GB12542@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: [PATCH] AIO code to use get_user_pages > > Well, I went ahead and modified the AIO code to use get_user_pages. > Since we dont yet have fmr support, this patch is untested, but it > does compile :) Please let me know what do you think. > > Another approach (instead of waiting for fmr support) could be > to add a fall-back option to use a regular memory region. > A todo item is to add zcopy support for synchronous operations. It seems I tried to use the same kmap slot twice. Thats not right. Here's a small patch to apply on top of that. Signed-off-by: Michael S. Tsirkin --- sdp/sdp_iocb.c 2005-03-10 11:28:02.000000000 +0200 +++ sdp-fix//sdp_iocb.c 2005-03-10 13:36:28.000000000 +0200 @@ -48,11 +48,11 @@ static void sdp_copy_one_page(struct pag void* tptr; fptr = kmap_atomic(from, KM_USER0); - tptr = kmap_atomic(to, KM_USER0); + tptr = kmap_atomic(to, KM_USER1); memcpy(tptr + offset, fptr + offset, size); - kunmap_atomic(tptr, KM_USER0); + kunmap_atomic(tptr, KM_USER1); kunmap_atomic(fptr, KM_USER0); set_page_dirty_lock(to); } -- MST - Michael S. Tsirkin From halr at voltaire.com Thu Mar 10 07:31:38 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Mar 2005 10:31:38 -0500 Subject: [openib-general] CM REP sent and local QPN setting Message-ID: <1110468698.4659.19.camel@localhost.localdomain> Hi Sean, Doesn't the passive CM need to set the local_qpn in cm_id_private when sending the REP ? Inside of ib_send_cm_rep(), shouldn't there be the following line of code: cm_id_priv->local_qpn = cm_rep_get_local_qpn(rep_msg); Otherwise if the active side disconnects, the remote QPN in the DREQ is cannot be matched. -- Hal From mshefty at ichips.intel.com Thu Mar 10 09:20:51 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 10 Mar 2005 09:20:51 -0800 Subject: [openib-general] Re: CM REP sent and local QPN setting In-Reply-To: <1110468698.4659.19.camel@localhost.localdomain> References: <1110468698.4659.19.camel@localhost.localdomain> Message-ID: <423081F3.7040003@ichips.intel.com> Hal Rosenstock wrote: > Hi Sean, > > Doesn't the passive CM need to set the local_qpn in cm_id_private when > sending the REP ? Inside of ib_send_cm_rep(), shouldn't there be the > following line of code: > > cm_id_priv->local_qpn = cm_rep_get_local_qpn(rep_msg); The passive sides needs to store the local QPN somewhere. I'll take a look to see if this is done anywhere else (I'm assuming not by your message), and if not add it in. Thanks for the info. - Sean From mshefty at ichips.intel.com Thu Mar 10 09:34:45 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 10 Mar 2005 09:34:45 -0800 Subject: [openib-general] Re: CM REP sent and local QPN setting In-Reply-To: <423081F3.7040003@ichips.intel.com> References: <1110468698.4659.19.camel@localhost.localdomain> <423081F3.7040003@ichips.intel.com> Message-ID: <42308535.4020100@ichips.intel.com> >> Doesn't the passive CM need to set the local_qpn in cm_id_private when >> sending the REP ? Inside of ib_send_cm_rep(), shouldn't there be the >> following line of code: >> >> cm_id_priv->local_qpn = cm_rep_get_local_qpn(rep_msg); > > > The passive sides needs to store the local QPN somewhere. I'll take a > look to see if this is done anywhere else (I'm assuming not by your > message), and if not add it in. Thanks for the info. I've committed a patch to fix this using your suggestion above. - Sean From roland at topspin.com Thu Mar 10 09:49:16 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 10 Mar 2005 09:49:16 -0800 Subject: [openib-general] uverbs: pthread_mutex -> pthread_spinlock ? In-Reply-To: <20050310104722.GG2586@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 10 Mar 2005 12:47:22 +0200") References: <20050310104722.GG2586@mellanox.co.il> Message-ID: <52acpbbdcz.fsf@topspin.com> Michael> I am able to shave about 200ns off the rdma post latency, Michael> by using pthread_spinlock instead of pthread_mutex for Michael> protecting the qp post op in libmthca. Michael> I'm aware of course that a context switch when spinlock Michael> is held may waste a whole timeslice , but maybe for short Michael> operations such as this it's reasonable to use spinlocks? You're right, this is a significant performance boost. I had believed that since pthread_mutex_lock and pthread_mutex_unlock can be done completely in userspace with NPTL and futexes (with only a single locked instruction when there is no contention), then doing pthread_spin_lock/pthread_spin_unlock instead would be roughly equivalent. However, a quick synthetic benchmark shows that I was completely wrong: uncontended pthread_mutex_t operations are much slower than uncontended pthread_spin_t operations on i386, x86_64, ppc64 and ia64. I've committed a changeset for libmthca that converts from pthread_mutex_t to pthread_spinlock_t, which results in a measurable improvement on real pingpong tests. Thanks, Roland From roland at topspin.com Thu Mar 10 10:29:20 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 10 Mar 2005 10:29:20 -0800 Subject: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs In-Reply-To: <20050310104211.GF2586@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 10 Mar 2005 12:42:11 +0200") References: <52u0o4pfe8.fsf@topspin.com> <20050228154254.GB31510@mellanox.co.il> <52acpotwd9.fsf@topspin.com> <20050310104211.GF2586@mellanox.co.il> Message-ID: <521xanbbi7.fsf@topspin.com> Michael> I think I have discovered the problem. It seems that with Michael> -O3 my compiler may reorder the WQE (and possibly CQE) Michael> write with respect to the doorbell. This wont happen on Michael> i386 with consistent i/o ordering since the doorbell is Michael> done in assembly, and probably not on other 32 bit Michael> architectures since the mutex is likely to include a Michael> memory barrier. Michael> Applying the folowing patch fixes the problem for me for Michael> x86_64. Thanks for diagnosing this. I think I want to work on a more general fix though. - R. From mst at mellanox.co.il Thu Mar 10 10:42:30 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 Mar 2005 20:42:30 +0200 Subject: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs In-Reply-To: <521xanbbi7.fsf@topspin.com> References: <52u0o4pfe8.fsf@topspin.com> <20050228154254.GB31510@mellanox.co.il> <52acpotwd9.fsf@topspin.com> <20050310104211.GF2586@mellanox.co.il> <521xanbbi7.fsf@topspin.com> Message-ID: <20050310184230.GA13051@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: ANNOUNCE: First usable version of userspace verbs > > Michael> I think I have discovered the problem. It seems that with > Michael> -O3 my compiler may reorder the WQE (and possibly CQE) > Michael> write with respect to the doorbell. This wont happen on > Michael> i386 with consistent i/o ordering since the doorbell is > Michael> done in assembly, and probably not on other 32 bit > Michael> architectures since the mutex is likely to include a > Michael> memory barrier. > > Michael> Applying the folowing patch fixes the problem for me for > Michael> x86_64. > > Thanks for diagnosing this. I think I want to work on a more general > fix though. > > - R. > Generally I think you'll need to implement a write memory barrier, and use it before each doorbell. I didnt find an efficient portable way to do this. I suggest implementing it for ppc with eioio, and simply use a spinlock instead of a barrier for anything else. -- MST - Michael S. Tsirkin From mshefty at ichips.intel.com Thu Mar 10 15:37:08 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 10 Mar 2005 15:37:08 -0800 Subject: [openib-general] MAD receive work completion Message-ID: <4230DA24.3040801@ichips.intel.com> I'm hitting an issue in the CM where I need to access work completion information about a received MAD. The CM takes the received MAD and queues it to a CM owned work queue for processing. It then accesses the wc field from ib_mad_recv_wc shown below. struct ib_mad_recv_wc { struct ib_wc *wc; struct ib_mad_recv_buf recv_buf; int mad_len; }; The ib_mad_recv_wc and referenced data buffers are owned by the CM until it calls ib_free_recv_mad(), however the wc field references an item that is declared on the stack. I see two main solutions. The CM can allocate its own ib_wc structure and copy the contents of the returned work completion. Or the MAD layer can avoid allocating the work completion on the stack. Thoughts? - Sean From mshefty at ichips.intel.com Thu Mar 10 15:48:57 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 10 Mar 2005 15:48:57 -0800 Subject: [openib-general] [PATCH] [CM] fix CM unload after receiving a bad REQ Message-ID: <20050310154857.64f71b8a.mshefty@ichips.intel.com> The following patch fixes the issue of unloading the CM after receiving a bad REQ. Signed-off-by: Sean Hefty Index: cm.c =================================================================== --- cm.c (revision 1974) +++ cm.c (working copy) @@ -237,8 +237,8 @@ { av->port = port; av->pkey_index = wc->pkey_index; - cm_set_ah_attr(&av->ah_attr, port->port_num, wc->slid, wc->sl, - wc->dlid_path_bits); + cm_set_ah_attr(&av->ah_attr, port->port_num, cpu_to_be16(wc->slid), + wc->sl, wc->dlid_path_bits); } static int cm_init_av_by_path(struct ib_sa_path_rec *path, struct cm_av *av) @@ -648,7 +648,7 @@ spin_unlock_irqrestore(&cm_id_priv->lock, flags); ib_send_cm_rej(cm_id, IB_CM_REJ_TIMEOUT, &cm_id_priv->av.port->cm_dev->ca_guid, - sizeof &cm_id_priv->av.port->cm_dev->ca_guid, + sizeof cm_id_priv->av.port->cm_dev->ca_guid, NULL, 0); break; case IB_CM_ESTABLISHED: @@ -1038,14 +1038,14 @@ if (cm_insert_remote_id(cm_id_priv->timewait_info)) { spin_unlock_irqrestore(&cm.lock, flags); ret = -EINVAL; - goto error2; + goto error1; } /* Check for a stale connection. */ if (cm_insert_remote_qpn(cm_id_priv->timewait_info)) { spin_unlock_irqrestore(&cm.lock, flags); /* todo: reject as stale */ ret = -EINVAL; - goto error2; + goto error1; } /* Find matching listen request. */ listen_cm_id_priv = cm_find_listen(req_msg->service_id); @@ -1053,7 +1053,7 @@ spin_unlock_irqrestore(&cm.lock, flags); /* todo: reject with no match */ ret = -EINVAL; - goto error2; + goto error1; } atomic_inc(&listen_cm_id_priv->refcount); atomic_inc(&cm_id_priv->refcount); @@ -1069,11 +1069,11 @@ cm_format_paths_from_req(req_msg, &work->path[0], &work->path[1]); ret = cm_init_av_by_path(&work->path[0], &cm_id_priv->av); if (ret) - goto error3; + goto error2; if (req_msg->alt_local_lid) { ret = cm_init_av_by_path(&work->path[1], &cm_id_priv->alt_av); if (ret) - goto error3; + goto error2; } cm_id_priv->timeout_ms = cm_convert_to_ms( cm_req_get_local_resp_timeout(req_msg)); @@ -1092,11 +1092,9 @@ cm_process_work(cm_id_priv, work); cm_deref_id(listen_cm_id_priv); return 0; -error3: - cm_deref_id(listen_cm_id_priv); error2: - cm_cleanup_timewait(cm_id_priv->timewait_info); - kfree(cm_id_priv->timewait_info); + atomic_dec(&cm_id_priv->refcount); + cm_deref_id(listen_cm_id_priv); error1: ib_destroy_cm_id(&cm_id_priv->id); return ret; @@ -2592,7 +2590,7 @@ } static void cm_send_handler(struct ib_mad_agent *mad_agent, - struct ib_mad_send_wc *mad_send_wc) + struct ib_mad_send_wc *mad_send_wc) { struct cm_msg *msg; From ftillier at infiniconsys.com Thu Mar 10 15:54:15 2005 From: ftillier at infiniconsys.com (Fab Tillier) Date: Thu, 10 Mar 2005 15:54:15 -0800 Subject: [openib-general] MAD receive work completion In-Reply-To: <4230DA24.3040801@ichips.intel.com> Message-ID: <000201c525cc$78423c50$8d5aa8c0@infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > I'm hitting an issue in the CM where I need to access work completion > information about a received MAD. The CM takes the received MAD and > queues it to a CM owned work queue for processing. It then accesses > the wc field from ib_mad_recv_wc shown below. > > struct ib_mad_recv_wc { > struct ib_wc *wc; > struct ib_mad_recv_buf recv_buf; > int mad_len; > }; > > The ib_mad_recv_wc and referenced data buffers are owned by the CM > until it calls ib_free_recv_mad(), however the wc field references an > item that is declared on the stack. > > I see two main solutions. The CM can allocate its own ib_wc structure > and copy the contents of the returned work completion. Or the MAD > layer can avoid allocating the work completion on the stack. Thoughts? I assume you need the WC information to help you reply, right? I would say change the ib_wc embedded in ib_mad_recv_wc from a pointer to the structure, and then use that when you poll. That way you avoid an extra allocation in the MAD layer, and avoid the data copy in the CM. Note that I may have completely missed your point, in which case ignore me. - Fab From mshefty at ichips.intel.com Thu Mar 10 16:03:07 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 10 Mar 2005 16:03:07 -0800 Subject: [openib-general] MAD receive work completion In-Reply-To: <000201c525cc$78423c50$8d5aa8c0@infiniconsys.com> References: <000201c525cc$78423c50$8d5aa8c0@infiniconsys.com> Message-ID: <4230E03B.9070008@ichips.intel.com> Fab Tillier wrote: > I assume you need the WC information to help you reply, right? > > I would say change the ib_wc embedded in ib_mad_recv_wc from a pointer to > the structure, and then use that when you poll. That way you avoid an extra > allocation in the MAD layer, and avoid the data copy in the CM. > > Note that I may have completely missed your point, in which case ignore me. You got the point. I need to generate a reply. Moving the wc from a pointer to a structure means that the MAD layer needs to know a.) that the next completion is a receive, and b.) which data buffer received the data. Currently, the MAD layer uses a single CQ for sends and receives on QP 0 and 1. (This shouldn't be overly difficult to change, however.) - Sean From roland at topspin.com Thu Mar 10 16:04:53 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 10 Mar 2005 16:04:53 -0800 Subject: [openib-general] MAD receive work completion In-Reply-To: <4230DA24.3040801@ichips.intel.com> (Sean Hefty's message of "Thu, 10 Mar 2005 15:37:08 -0800") References: <4230DA24.3040801@ichips.intel.com> Message-ID: <5264zz9hei.fsf@topspin.com> Sean> I'm hitting an issue in the CM where I need to access work Sean> completion information about a received MAD. The CM takes Sean> the received MAD and queues it to a CM owned work queue for Sean> processing. It then accesses the wc field from Sean> ib_mad_recv_wc shown below. This doesn't seem worth an API change to me. I think the simplest and best solution is just to copy the work completion information you need into the work structure you put onto your workqueue. - R. From mshefty at ichips.intel.com Thu Mar 10 16:35:39 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 10 Mar 2005 16:35:39 -0800 Subject: [openib-general] MAD receive work completion In-Reply-To: <5264zz9hei.fsf@topspin.com> References: <4230DA24.3040801@ichips.intel.com> <5264zz9hei.fsf@topspin.com> Message-ID: <4230E7DB.7020905@ichips.intel.com> Roland Dreier wrote: > Sean> I'm hitting an issue in the CM where I need to access work > Sean> completion information about a received MAD. The CM takes > Sean> the received MAD and queues it to a CM owned work queue for > Sean> processing. It then accesses the wc field from > Sean> ib_mad_recv_wc shown below. > > This doesn't seem worth an API change to me. I think the simplest and > best solution is just to copy the work completion information you need > into the work structure you put onto your workqueue. It could be done without an API change, likely changing 3-4 lines of code, with the result that the work completion would be copied for all received MADs. (The copy could be avoided with a more extensive change, but I would go with a simpler solution for now.) To me, it seems that the behavior isn't what a user would expect given the current API. The ib_mad_recv_wc belongs to the user until it is freed, but one of the fields in it exists only during the callback. Is this the behavior that we want? - Sean From ftillier at infiniconsys.com Thu Mar 10 16:49:41 2005 From: ftillier at infiniconsys.com (Fab Tillier) Date: Thu, 10 Mar 2005 16:49:41 -0800 Subject: [openib-general] MAD receive work completion In-Reply-To: <4230E7DB.7020905@ichips.intel.com> Message-ID: <000301c525d4$36d17bc0$8d5aa8c0@infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > To me, it seems that the behavior isn't what a user would expect given > the current API. The ib_mad_recv_wc belongs to the user until it is > freed, but one of the fields in it exists only during the callback. Is > this the behavior that we want? It depends on the usage model. If the majority of clients always process the completion in a different thread context (i.e. using a workqueue like the CM does), then I would say make the copy always. Otherwise, I agree with Roland and it should be up to the client. However, if the CM needs to have the WC information beyond the work it does in its work queue *and* saves the received mad structure already, then it will copy it once from the receive callback into the work structure, and then again into the cid structure. Note that this only applies to the case where the received MAD is saved beyond both the MAD completion callback and the CM workqueue handler. - Fab From iod00d at hp.com Thu Mar 10 17:01:04 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 10 Mar 2005 17:01:04 -0800 Subject: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs In-Reply-To: <521xanbbi7.fsf@topspin.com> References: <52u0o4pfe8.fsf@topspin.com> <20050228154254.GB31510@mellanox.co.il> <52acpotwd9.fsf@topspin.com> <20050310104211.GF2586@mellanox.co.il> <521xanbbi7.fsf@topspin.com> Message-ID: <20050311010104.GC16523@esmail.cup.hp.com> On Thu, Mar 10, 2005 at 10:29:20AM -0800, Roland Dreier wrote: > Michael> Applying the folowing patch fixes the problem for me for > Michael> x86_64. > > Thanks for diagnosing this. I think I want to work on a more general > fix though. Crap. /usr/include/linux/compiler.h shows "barrier" but on debian, it's #ifdef __KERNEL__ only. I'm wondering if xf86 has the same problem for graphics drivers talking to cards. I'm not certain barrier would be the "perfect" solution, but expect it should DTRT. grant From roland at topspin.com Thu Mar 10 17:00:21 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 10 Mar 2005 17:00:21 -0800 Subject: [openib-general] MAD receive work completion In-Reply-To: <4230E7DB.7020905@ichips.intel.com> (Sean Hefty's message of "Thu, 10 Mar 2005 16:35:39 -0800") References: <4230DA24.3040801@ichips.intel.com> <5264zz9hei.fsf@topspin.com> <4230E7DB.7020905@ichips.intel.com> Message-ID: <521xan9eu2.fsf@topspin.com> Sean> It could be done without an API change, likely changing 3-4 Sean> lines of code, with the result that the work completion Sean> would be copied for all received MADs. (The copy could be Sean> avoided with a more extensive change, but I would go with a Sean> simpler solution for now.) Sorry, you're right. I hadn't gone back and looked at the actual API, and I didn't remember that ib_free_recv_mad() takes the whole struct ib_mad_recv_wc (I thought it just took the ib_mad_recv_buf). Sean> To me, it seems that the behavior isn't what a user would Sean> expect given the current API. The ib_mad_recv_wc belongs to Sean> the user until it is freed, but one of the fields in it Sean> exists only during the callback. Is this the behavior that Sean> we want? I agree, this is rather ugly. Now I'm not sure whether it makes sense to change the wc member of struct ib_mad_recv_wc from a struct ib_wc * to just a struct ib_wc. On the one hand it makes the API slightly cleaner, but on the other hand it is an incompatible change that may limit the internal implementation of handling MAD receives. - Roland From roland at topspin.com Thu Mar 10 20:50:27 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 10 Mar 2005 20:50:27 -0800 Subject: [openib-general] [Andrew Morton] inappropriate use of in_atomic() Message-ID: <52oedq946k.fsf@topspin.com> drivers/infiniband/core/mad.c is in Andrew's list... >From looking at the code, the best fix I can come up with is just to always use GFP_ATOMIC ... worst case we drop a MAD under memory pressure. The other option is to change ib_post_send_mad() to take a GFP_ mask as a parameter, but that doesn't seem worth doing... --- infiniband/core/mad.c (revision 1975) +++ infiniband/core/mad.c (working copy) @@ -666,11 +666,7 @@ static int handle_outgoing_dr_smp(struct if (!ret || !device->process_mad) goto out; - if (in_atomic() || irqs_disabled()) - alloc_flags = GFP_ATOMIC; - else - alloc_flags = GFP_KERNEL; - local = kmalloc(sizeof *local, alloc_flags); + local = kmalloc(sizeof *local, GFP_ATOMIC); if (!local) { ret = -ENOMEM; printk(KERN_ERR PFX "No memory for ib_mad_local_private\n"); @@ -860,9 +856,7 @@ int ib_post_send_mad(struct ib_mad_agent } /* Allocate MAD send WR tracking structure */ - mad_send_wr = kmalloc(sizeof *mad_send_wr, - (in_atomic() || irqs_disabled()) ? - GFP_ATOMIC : GFP_KERNEL); + mad_send_wr = kmalloc(sizeof *mad_send_wr, GFP_ATOMIC); if (!mad_send_wr) { printk(KERN_ERR PFX "No memory for " "ib_mad_send_wr_private\n"); -------------- next part -------------- An embedded message was scrubbed... From: Andrew Morton Subject: inappropriate use of in_atomic() Date: Thu, 10 Mar 2005 20:40:06 -0800 Size: 3123 URL: From mst at mellanox.co.il Thu Mar 10 23:17:02 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 11 Mar 2005 09:17:02 +0200 Subject: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs In-Reply-To: <20050311010104.GC16523@esmail.cup.hp.com> References: <52u0o4pfe8.fsf@topspin.com> <20050228154254.GB31510@mellanox.co.il> <52acpotwd9.fsf@topspin.com> <20050310104211.GF2586@mellanox.co.il> <521xanbbi7.fsf@topspin.com> <20050311010104.GC16523@esmail.cup.hp.com> Message-ID: <20050311071702.GA20891@mellanox.co.il> Quoting r. Grant Grundler : > Subject: Re: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs > > On Thu, Mar 10, 2005 at 10:29:20AM -0800, Roland Dreier wrote: > > Michael> Applying the folowing patch fixes the problem for me for > > Michael> x86_64. > > > > Thanks for diagnosing this. I think I want to work on a more general > > fix though. > > Crap. /usr/include/linux/compiler.h shows "barrier" but on debian, > it's #ifdef __KERNEL__ only. I think its because its a kernel header. current distributions seem to put a copy of kernel headers under /usr/include/linux and /usr/include/asm. I suspect using these isnt allowed. > I'm wondering if xf86 has the same problem > for graphics drivers talking to cards. > > I'm not certain barrier would be the "perfect" solution, but expect > it should DTRT. > > grant > Actually I think we need a wmb here. It just happens to be equivalent to barrier on x86_64. -- MST - Michael S. Tsirkin From mst at mellanox.co.il Thu Mar 10 23:31:08 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 11 Mar 2005 09:31:08 +0200 Subject: [openib-general] Re: [Andrew Morton] inappropriate use of in_atomic() In-Reply-To: <52oedq946k.fsf@topspin.com> References: <52oedq946k.fsf@topspin.com> Message-ID: <20050311073108.GA20989@mellanox.co.il> Quoting r. Roland Dreier : > Subject: [Andrew Morton] inappropriate use of in_atomic() > > drivers/infiniband/core/mad.c is in Andrew's list... > > >From looking at the code, the best fix I can come up with is just to > always use GFP_ATOMIC ... worst case we drop a MAD under memory > pressure. The other option is to change ib_post_send_mad() to take a > GFP_ mask as a parameter, but that doesn't seem worth doing... > > --- infiniband/core/mad.c (revision 1975) > +++ infiniband/core/mad.c (working copy) > @@ -666,11 +666,7 @@ static int handle_outgoing_dr_smp(struct > if (!ret || !device->process_mad) > goto out; > > - if (in_atomic() || irqs_disabled()) > - alloc_flags = GFP_ATOMIC; > - else > - alloc_flags = GFP_KERNEL; > - local = kmalloc(sizeof *local, alloc_flags); > + local = kmalloc(sizeof *local, GFP_ATOMIC); > if (!local) { > ret = -ENOMEM; > printk(KERN_ERR PFX "No memory for ib_mad_local_private\n"); > @@ -860,9 +856,7 @@ int ib_post_send_mad(struct ib_mad_agent > } > > /* Allocate MAD send WR tracking structure */ > - mad_send_wr = kmalloc(sizeof *mad_send_wr, > - (in_atomic() || irqs_disabled()) ? > - GFP_ATOMIC : GFP_KERNEL); > + mad_send_wr = kmalloc(sizeof *mad_send_wr, GFP_ATOMIC); > if (!mad_send_wr) { > printk(KERN_ERR PFX "No memory for " > "ib_mad_send_wr_private\n"); > > > > > > Date: Thu, 10 Mar 2005 20:40:06 -0800 > From: Andrew Morton > Subject: inappropriate use of in_atomic() > > > in_atomic() is not a reliable indication of whether it is currently safe > to call schedule(). > > This is because the lockdepth beancounting which in_atomic() uses is only > accumulated if CONFIG_PREEMPT=y. in_atomic() will return false inside > spinlocks if CONFIG_PREEMPT=n. > > Consequently the use of in_atomic() in the below files is probably > deadlocky if CONFIG_PREEMPT=n: > > arch/ppc64/kernel/viopath.c > drivers/net/irda/sir_kthread.c > drivers/net/wireless/airo.c > drivers/video/amba-clcd.c > drivers/acpi/osl.c > drivers/ieee1394/ieee1394_transactions.c > drivers/infiniband/core/mad.c > > Note that the same beancounting is used for the "scheduling while atomic" > warning, so if the code calls schedule with locks held, we won't get a > warning. Both are tied to CONFIG_PREEMPT=y. > > The kernel provides no reliable runtime way of detecting whether or not it > is safe to call schedule(). > > Can we please find ways to change the above code to not use in_atomic()? > Then we can whack #ifndef MODULE around its definition to reduce > reoccurrences. Will probably rename it to something more scary as well. > > Thanks. > Sdp also has a couple of uses. Maybe we can use the atomic branch in all cases here, as well? Libor? -- MST - Michael S. Tsirkin From mst at mellanox.co.il Fri Mar 11 05:14:46 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 11 Mar 2005 15:14:46 +0200 Subject: [openib-general] fmr support in mthca In-Reply-To: <526507mkmm.fsf@topspin.com> References: <2005331520.b7ycIGGfSwBBRSED@topspin.com> <20050304140155.GC13804@mellanox.co.il> <526507mkmm.fsf@topspin.com> Message-ID: <20050311131446.GC20989@mellanox.co.il> Roland, would you like me to implement FMRs in mthca? It is needed by SDP for zero copy support. -- MST - Michael S. Tsirkin From mst at mellanox.co.il Fri Mar 11 05:23:12 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 11 Mar 2005 15:23:12 +0200 Subject: [openib-general] Re: Re: ANNOUNCE: First usable version of userspace verbs In-Reply-To: <20050311071702.GA20891@mellanox.co.il> References: <52u0o4pfe8.fsf@topspin.com> <20050228154254.GB31510@mellanox.co.il> <52acpotwd9.fsf@topspin.com> <20050310104211.GF2586@mellanox.co.il> <521xanbbi7.fsf@topspin.com> <20050311010104.GC16523@esmail.cup.hp.com> <20050311071702.GA20891@mellanox.co.il> Message-ID: <20050311132311.GD20989@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: Re: Re: ANNOUNCE: First usable version of userspace verbs > > Quoting r. Grant Grundler : > > Subject: Re: [openib-general] Re: ANNOUNCE: First usable version of userspace verbs > > > > On Thu, Mar 10, 2005 at 10:29:20AM -0800, Roland Dreier wrote: > > > Michael> Applying the folowing patch fixes the problem for me for > > > Michael> x86_64. > > > > > > Thanks for diagnosing this. I think I want to work on a more general > > > fix though. > > > > Crap. /usr/include/linux/compiler.h shows "barrier" but on debian, > > it's #ifdef __KERNEL__ only. > I think its because its a kernel header. current distributions seem to > put a copy of kernel headers under /usr/include/linux and /usr/include/asm. > I suspect using these isnt allowed. For example, wmb is in /usr/include/asm/system.h, including this on SuSe 9.1 gives you: ~>cat foo.c #include ~>gcc foo.c In file included from /usr/include/asm/system.h:4, from foo.c:1: /usr/include/asm-x86_64/system.h: In function `__cmpxchg': /usr/include/asm-x86_64/system.h:249: error: `LOCK_PREFIX' undeclared (first use in this function) /usr/include/asm-x86_64/system.h:249: error: (Each undeclared identifier is reported only once /usr/include/asm-x86_64/system.h:249: error: for each function it appears in.) /usr/include/asm-x86_64/system.h:249: error: parse error before string constant /usr/include/asm-x86_64/system.h:255: error: parse error before string constant /usr/include/asm-x86_64/system.h:261: error: parse error before string constant /usr/include/asm-x86_64/system.h:267: error: parse error before string constant /usr/include/asm-x86_64/system.h: At top level: /usr/include/asm-x86_64/system.h:279: error: parse error before "cmpxchg4_locked" /usr/include/asm-x86_64/system.h:279: error: parse error before '*' token /usr/include/asm-x86_64/system.h: In function `cmpxchg4_locked': /usr/include/asm-x86_64/system.h:282: error: `new' undeclared (first use in this function) /usr/include/asm-x86_64/system.h:282: error: `old' undeclared (first use in this function) /usr/include/asm-x86_64/system.h:282: error: `__u32' undeclared (first use in this function) /usr/include/asm-x86_64/system.h:282: error: parse error before ')' token -- MST - Michael S. Tsirkin From halr at voltaire.com Fri Mar 11 05:17:15 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Mar 2005 08:17:15 -0500 Subject: [openib-general] [Andrew Morton] inappropriate use of in_atomic() In-Reply-To: <52oedq946k.fsf@topspin.com> References: <52oedq946k.fsf@topspin.com> Message-ID: <1110547035.4659.130.camel@localhost.localdomain> On Thu, 2005-03-10 at 23:50, Roland Dreier wrote: > drivers/infiniband/core/mad.c is in Andrew's list... > > >From looking at the code, the best fix I can come up with is just to > always use GFP_ATOMIC ... worst case we drop a MAD under memory > pressure. That could be bad if this persists but I suppose there are other ill effects of this. > The other option is to change ib_post_send_mad() to take a > GFP_ mask as a parameter, but that doesn't seem worth doing... There aren't that many places this is called. Also, it appears to me that sa_query.c is already doing this for some of it's memory allocation and this could be passed down to ib_post_send_mad. int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, int gfp_mask, ... This approach seems better to me from a robustness standpoint. Is the difficulty determing what to set the mask to for each call ? If they all end up being GFP_ATOMIC, this reduces to your preferred solution. The biggest impact appears to be on CM (at least currently). -- Hal > --- infiniband/core/mad.c (revision 1975) > +++ infiniband/core/mad.c (working copy) > @@ -666,11 +666,7 @@ static int handle_outgoing_dr_smp(struct > if (!ret || !device->process_mad) > goto out; > > - if (in_atomic() || irqs_disabled()) > - alloc_flags = GFP_ATOMIC; > - else > - alloc_flags = GFP_KERNEL; > - local = kmalloc(sizeof *local, alloc_flags); > + local = kmalloc(sizeof *local, GFP_ATOMIC); > if (!local) { > ret = -ENOMEM; > printk(KERN_ERR PFX "No memory for ib_mad_local_private\n"); > @@ -860,9 +856,7 @@ int ib_post_send_mad(struct ib_mad_agent > } > > /* Allocate MAD send WR tracking structure */ > - mad_send_wr = kmalloc(sizeof *mad_send_wr, > - (in_atomic() || irqs_disabled()) ? > - GFP_ATOMIC : GFP_KERNEL); > + mad_send_wr = kmalloc(sizeof *mad_send_wr, GFP_ATOMIC); > if (!mad_send_wr) { > printk(KERN_ERR PFX "No memory for " > "ib_mad_send_wr_private\n"); > > > > > ______________________________________________________________________ > > From: Andrew Morton > To: Paul Mackerras , Jean Tourrilhes , javier at tudela.mad.ttd.net, linux-fbdev-devel at lists.sourceforge.net, acpi-devel at lists.sourceforge.net, linux1394-devel at lists.sourceforge.net, Roland Dreier > Cc: linux-kernel at vger.kernel.org > Subject: inappropriate use of in_atomic() > Date: 10 Mar 2005 20:40:06 -0800 > > > in_atomic() is not a reliable indication of whether it is currently safe > to call schedule(). > > This is because the lockdepth beancounting which in_atomic() uses is only > accumulated if CONFIG_PREEMPT=y. in_atomic() will return false inside > spinlocks if CONFIG_PREEMPT=n. > > Consequently the use of in_atomic() in the below files is probably > deadlocky if CONFIG_PREEMPT=n: > > arch/ppc64/kernel/viopath.c > drivers/net/irda/sir_kthread.c > drivers/net/wireless/airo.c > drivers/video/amba-clcd.c > drivers/acpi/osl.c > drivers/ieee1394/ieee1394_transactions.c > drivers/infiniband/core/mad.c > > Note that the same beancounting is used for the "scheduling while atomic" > warning, so if the code calls schedule with locks held, we won't get a > warning. Both are tied to CONFIG_PREEMPT=y. > > The kernel provides no reliable runtime way of detecting whether or not it > is safe to call schedule(). > > Can we please find ways to change the above code to not use in_atomic()? > Then we can whack #ifndef MODULE around its definition to reduce > reoccurrences. Will probably rename it to something more scary as well. > > Thanks. > > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Fri Mar 11 05:28:50 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 11 Mar 2005 15:28:50 +0200 Subject: [openib-general] Re: [Andrew Morton] inappropriate use of in_atomic() In-Reply-To: <1110547035.4659.130.camel@localhost.localdomain> References: <52oedq946k.fsf@topspin.com> <1110547035.4659.130.camel@localhost.localdomain> Message-ID: <20050311132850.GE20989@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: [Andrew Morton] inappropriate use of in_atomic() > > On Thu, 2005-03-10 at 23:50, Roland Dreier wrote: > > drivers/infiniband/core/mad.c is in Andrew's list... > > > > >From looking at the code, the best fix I can come up with is just to > > always use GFP_ATOMIC ... worst case we drop a MAD under memory > > pressure. > > That could be bad if this persists but I suppose there are other ill > effects of this. > > > The other option is to change ib_post_send_mad() to take a > > GFP_ mask as a parameter, but that doesn't seem worth doing... > > There aren't that many places this is called. Also, it appears to me > that sa_query.c is already doing this for some of it's memory allocation > and this could be passed down to ib_post_send_mad. > > int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, > struct ib_sa_path_rec *rec, > ib_sa_comp_mask comp_mask, > int timeout_ms, int gfp_mask, > ... > > This approach seems better to me from a robustness standpoint. > > Is the difficulty determing what to set the mask to for each call ? If > they all end up being GFP_ATOMIC, this reduces to your preferred > solution. > > The biggest impact appears to be on CM (at least currently). As far as I remember most CM code is thread level anyway, isnt it? -- MST - Michael S. Tsirkin From halr at voltaire.com Fri Mar 11 05:27:53 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Mar 2005 08:27:53 -0500 Subject: [openib-general] [PATCH] [CM] fix CM unload after receiving a bad REQ In-Reply-To: <20050310154857.64f71b8a.mshefty@ichips.intel.com> References: <20050310154857.64f71b8a.mshefty@ichips.intel.com> Message-ID: <1110547672.4659.138.camel@localhost.localdomain> On Thu, 2005-03-10 at 18:48, Sean Hefty wrote: > The following patch fixes the issue of unloading the CM after > receiving a bad REQ. Works for me :-) Thanks. -- Hal From roland at topspin.com Fri Mar 11 08:56:37 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 11 Mar 2005 08:56:37 -0800 Subject: [openib-general] [Andrew Morton] inappropriate use of in_atomic() In-Reply-To: <1110547035.4659.130.camel@localhost.localdomain> (Hal Rosenstock's message of "11 Mar 2005 08:17:15 -0500") References: <52oedq946k.fsf@topspin.com> <1110547035.4659.130.camel@localhost.localdomain> Message-ID: <527jke86ka.fsf@topspin.com> Hal> There aren't that many places this is called. Also, it Hal> appears to me that sa_query.c is already doing this for some Hal> of it's memory allocation and this could be passed down to Hal> ib_post_send_mad. Yes, that's an alternate solution. The question is whether it's worth changing the API so that some callers can use GFP_KERNEL. - R. From roland at topspin.com Fri Mar 11 08:57:43 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 11 Mar 2005 08:57:43 -0800 Subject: [openib-general] fmr support in mthca In-Reply-To: <20050311131446.GC20989@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 11 Mar 2005 15:14:46 +0200") References: <2005331520.b7ycIGGfSwBBRSED@topspin.com> <20050304140155.GC13804@mellanox.co.il> <526507mkmm.fsf@topspin.com> <20050311131446.GC20989@mellanox.co.il> Message-ID: <523bv286ig.fsf@topspin.com> Michael> Roland, would you like me to implement FMRs in mthca? It Michael> is needed by SDP for zero copy support. Yes, that would be great. BTW, for mem-free mode I put the MPT and MTT in lowmem to make FMRs simpler to use. - R. From mshefty at ichips.intel.com Fri Mar 11 09:36:02 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 11 Mar 2005 09:36:02 -0800 Subject: [openib-general] MAD receive work completion In-Reply-To: <521xan9eu2.fsf@topspin.com> References: <4230DA24.3040801@ichips.intel.com> <5264zz9hei.fsf@topspin.com> <4230E7DB.7020905@ichips.intel.com> <521xan9eu2.fsf@topspin.com> Message-ID: <4231D702.2070709@ichips.intel.com> Roland Dreier wrote: > Now I'm not sure whether it makes sense to change the wc member of > struct ib_mad_recv_wc from a struct ib_wc * to just a struct ib_wc. > On the one hand it makes the API slightly cleaner, but on the other > hand it is an incompatible change that may limit the internal > implementation of handling MAD receives. Right now, I'm leaning towards no API change, but changing the implementation in the MAD layer to ensure that the ib_wc is valid after the callback returns. This would avoid limiting the MAD layer implementation, while also preventing changes to the existing users. I'll generate a patch for this to clarify the idea. - Sean From mshefty at ichips.intel.com Fri Mar 11 09:59:40 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 11 Mar 2005 09:59:40 -0800 Subject: [openib-general] [PATCH] [MAD] fix handing user WC structure on the stack Message-ID: <20050311095940.1edb0475.mshefty@ichips.intel.com> This patch replaces the ib_wc *wc field in ib_mad_recv_wc from pointing to a structure on the stack to one allocated with the received MAD buffer. This allows client to access the field after their receive completion handler has returned. Signed-off-by: Sean Hefty Index: mad.c =================================================================== --- mad.c (revision 1964) +++ mad.c (working copy) @@ -1606,7 +1606,8 @@ static void ib_mad_recv_done_handler(str DMA_FROM_DEVICE); /* Setup MAD receive work completion from "normal" work completion */ - recv->header.recv_wc.wc = wc; + recv->header.wc = *wc; + recv->header.recv_wc.wc = &recv->header.wc; recv->header.recv_wc.mad_len = sizeof(struct ib_mad); recv->header.recv_wc.recv_buf.mad = &recv->mad.mad; recv->header.recv_wc.recv_buf.grh = &recv->grh; Index: mad_priv.h =================================================================== --- mad_priv.h (revision 1964) +++ mad_priv.h (working copy) @@ -69,6 +69,7 @@ struct ib_mad_list_head { struct ib_mad_private_header { struct ib_mad_list_head mad_list; struct ib_mad_recv_wc recv_wc; + struct ib_wc wc; DECLARE_PCI_UNMAP_ADDR(mapping) } __attribute__ ((packed)); From roland at topspin.com Fri Mar 11 13:07:45 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 11 Mar 2005 13:07:45 -0800 Subject: [openib-general] [Andrew Morton] inappropriate use of in_atomic() In-Reply-To: <52oedq946k.fsf@topspin.com> (Roland Dreier's message of "Thu, 10 Mar 2005 20:50:27 -0800") References: <52oedq946k.fsf@topspin.com> Message-ID: <52mzt97uxq.fsf@topspin.com> Does anyone have a patch that they would prefer to this solution (unconditionally use GFP_ATOMIC)? If not I'll send this upstream so that we at least don't have the potential for deadlock with CONFIG_PREEMPT=n. We can always update the API to pass in a gfp_mask later, if this causes problems. - R. Here's the diff I have: Index: infiniband/core/mad.c =================================================================== --- infiniband/core/mad.c (revision 1975) +++ infiniband/core/mad.c (working copy) @@ -666,11 +666,7 @@ static int handle_outgoing_dr_smp(struct if (!ret || !device->process_mad) goto out; - if (in_atomic() || irqs_disabled()) - alloc_flags = GFP_ATOMIC; - else - alloc_flags = GFP_KERNEL; - local = kmalloc(sizeof *local, alloc_flags); + local = kmalloc(sizeof *local, GFP_ATOMIC); if (!local) { ret = -ENOMEM; printk(KERN_ERR PFX "No memory for ib_mad_local_private\n"); @@ -860,9 +856,7 @@ int ib_post_send_mad(struct ib_mad_agent } /* Allocate MAD send WR tracking structure */ - mad_send_wr = kmalloc(sizeof *mad_send_wr, - (in_atomic() || irqs_disabled()) ? - GFP_ATOMIC : GFP_KERNEL); + mad_send_wr = kmalloc(sizeof *mad_send_wr, GFP_ATOMIC); if (!mad_send_wr) { printk(KERN_ERR PFX "No memory for " "ib_mad_send_wr_private\n"); From roland at topspin.com Fri Mar 11 13:08:16 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 11 Mar 2005 13:08:16 -0800 Subject: [openib-general] [PATCH] [MAD] fix handing user WC structure on the stack In-Reply-To: <20050311095940.1edb0475.mshefty@ichips.intel.com> (Sean Hefty's message of "Fri, 11 Mar 2005 09:59:40 -0800") References: <20050311095940.1edb0475.mshefty@ichips.intel.com> Message-ID: <52is3x7uwv.fsf@topspin.com> This looks good to me. - R. From roland at topspin.com Fri Mar 11 13:35:34 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 11 Mar 2005 13:35:34 -0800 Subject: PATCH: use GFP_ATOMIC instead of in_atomic() (was Re: [openib-general] [Andrew Morton] inappropriate use of in_atomic()) In-Reply-To: <52mzt97uxq.fsf@topspin.com> (Roland Dreier's message of "Fri, 11 Mar 2005 13:07:45 -0800") References: <52oedq946k.fsf@topspin.com> <52mzt97uxq.fsf@topspin.com> Message-ID: <52d5u57tnd.fsf_-_@topspin.com> Err, here's a fixed diff that doesn't use an unitialized alloc_flags. Any comments? - R. Index: infiniband/core/mad.c =================================================================== --- infiniband/core/mad.c (revision 1977) +++ infiniband/core/mad.c (working copy) @@ -646,7 +646,7 @@ static int handle_outgoing_dr_smp(struct struct ib_smp *smp, struct ib_send_wr *send_wr) { - int ret, alloc_flags, solicited; + int ret, solicited; unsigned long flags; struct ib_mad_local_private *local; struct ib_mad_private *mad_priv; @@ -666,11 +666,7 @@ static int handle_outgoing_dr_smp(struct if (!ret || !device->process_mad) goto out; - if (in_atomic() || irqs_disabled()) - alloc_flags = GFP_ATOMIC; - else - alloc_flags = GFP_KERNEL; - local = kmalloc(sizeof *local, alloc_flags); + local = kmalloc(sizeof *local, GFP_ATOMIC); if (!local) { ret = -ENOMEM; printk(KERN_ERR PFX "No memory for ib_mad_local_private\n"); @@ -678,7 +674,7 @@ static int handle_outgoing_dr_smp(struct } local->mad_priv = NULL; local->recv_mad_agent = NULL; - mad_priv = kmem_cache_alloc(ib_mad_cache, alloc_flags); + mad_priv = kmem_cache_alloc(ib_mad_cache, GFP_ATOMIC); if (!mad_priv) { ret = -ENOMEM; printk(KERN_ERR PFX "No memory for local response MAD\n"); @@ -860,9 +856,7 @@ int ib_post_send_mad(struct ib_mad_agent } /* Allocate MAD send WR tracking structure */ - mad_send_wr = kmalloc(sizeof *mad_send_wr, - (in_atomic() || irqs_disabled()) ? - GFP_ATOMIC : GFP_KERNEL); + mad_send_wr = kmalloc(sizeof *mad_send_wr, GFP_ATOMIC); if (!mad_send_wr) { printk(KERN_ERR PFX "No memory for " "ib_mad_send_wr_private\n"); From sean.hefty at intel.com Fri Mar 11 14:27:25 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 11 Mar 2005 14:27:25 -0800 Subject: PATCH: use GFP_ATOMIC instead of in_atomic() (was Re:[openib-general] [Andrew Morton] inappropriate use of in_atomic()) In-Reply-To: <52d5u57tnd.fsf_-_@topspin.com> Message-ID: >Err, here's a fixed diff that doesn't use an unitialized alloc_flags. > >Any comments? Looks fine to me. I agree that we can change the API later if it becomes an issue. - Sean From libor at topspin.com Fri Mar 11 15:09:46 2005 From: libor at topspin.com (Libor Michalek) Date: Fri, 11 Mar 2005 15:09:46 -0800 Subject: [openib-general] [RFC] Userspace CM access. Message-ID: <20050311150946.A31689@topspin.com> Below is the source for the kernel portion of the userspace CM. I've got enough of the userspace library to verify the basic functionality, but it's not yet ready for general use. However, I wanted to get the kernel portion posted for comment and checked-in now that the bulk of it is complete. The code is for the most part a pass through from userspace to the kernel CM, plus synchronization, sanity checking, and the event model is turned into a "get next event" interface. Next step is the userspace library. -Libor Signed-off-by: Libor Michalek Index: infiniband/core/Makefile =================================================================== --- infiniband/core/Makefile (revision 1979) +++ infiniband/core/Makefile (working copy) @@ -1,6 +1,7 @@ EXTRA_CFLAGS += -Idrivers/infiniband/include -obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_cm.o ib_sa.o ib_umad.o +obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_cm.o ib_sa.o ib_umad.o \ + ib_ucm.o ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ device.o fmr_pool.o cache.o @@ -12,3 +13,5 @@ ib_sa-y := sa_query.o ib_umad-y := user_mad.o + +ib_ucm-y := ucm.o Index: infiniband/include/ib_user_cm.h =================================================================== --- infiniband/include/ib_user_cm.h (revision 0) +++ infiniband/include/ib_user_cm.h (revision 0) @@ -0,0 +1,326 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ib_user_verbs.h 1852 2005-02-21 22:21:01Z roland $ + */ + +#ifndef IB_USER_CM_H +#define IB_USER_CM_H + +#include + +#define IB_USER_CM_ABI_VERSION 1 + +enum { + IB_USER_CM_CMD_CREATE_ID, + IB_USER_CM_CMD_DESTORY_ID, + IB_USER_CM_CMD_ATTR_ID, + + IB_USER_CM_CMD_LISTEN, + IB_USER_CM_CMD_ESTABLISH, + + IB_USER_CM_CMD_SEND_REQ, + IB_USER_CM_CMD_SEND_REP, + IB_USER_CM_CMD_SEND_RTU, + IB_USER_CM_CMD_SEND_DREQ, + IB_USER_CM_CMD_SEND_DREP, + IB_USER_CM_CMD_SEND_REJ, + IB_USER_CM_CMD_SEND_MRA, + IB_USER_CM_CMD_SEND_LAP, + IB_USER_CM_CMD_SEND_APR, + IB_USER_CM_CMD_SEND_SIDR_REQ, + IB_USER_CM_CMD_SEND_SIDR_REP, + IB_USER_CM_CMD_QP_ATTR, + + IB_USER_CM_CMD_EVENT, +}; +/* + * command ABI structures. + */ +struct ib_ucm_cmd_hdr { + __u32 cmd; + __u16 in; + __u16 out; +}; + +struct ib_ucm_create_id { + __u64 response; +}; + +struct ib_ucm_create_id_resp { + __u32 id; +}; + +struct ib_ucm_destroy_id { + __u32 id; +}; + +struct ib_ucm_attr_id { + __u64 response; + __u32 id; +}; + +struct ib_ucm_attr_id_resp { + __u64 service_id; + __u64 service_mask; + __u32 state; + __u32 lap_state; + __u32 local_id; + __u32 remote_id; +}; + +struct ib_ucm_listen { + __u64 service_id; + __u64 service_mask; + __u32 id; +}; + +struct ib_ucm_establish { + __u32 id; +}; + +struct ib_ucm_private_data { + __u64 data; + __u32 id; + __u8 len; + __u8 reserved[3]; +}; + +struct ib_ucm_path_rec { + __u8 dgid[16]; + __u8 sgid[16]; + __u16 dlid; + __u16 slid; + __u32 raw_traffic; + __u32 flow_label; + __u32 reversible; + __u32 mtu; + __u16 pkey; + __u8 hop_limit; + __u8 traffic_class; + __u8 numb_path; + __u8 sl; + __u8 mtu_selector; + __u8 rate_selector; + __u8 rate; + __u8 packet_life_time_selector; + __u8 packet_life_time; + __u8 preference; +}; + +struct ib_ucm_req { + __u32 id; + __u32 qpn; + __u32 qp_type; + __u32 psn; + __u64 sid; + + __u64 primary_path; + __u64 alternate_path; + __u8 len; + __u8 peer_to_peer; + __u8 responder_resources; + __u8 initiator_depth; + __u8 remote_cm_response_timeout; + __u8 flow_control; + __u8 local_cm_response_timeout; + __u8 retry_count; + __u8 rnr_retry_count; + __u8 max_cm_retries; + __u8 srq; + __u8 reserved[1]; +}; + +struct ib_ucm_rep { + __u64 data; + __u32 id; + __u32 qpn; + __u32 psn; + __u8 len; + __u8 responder_resources; + __u8 initiator_depth; + __u8 target_ack_delay; + __u8 failover_accepted; + __u8 flow_control; + __u8 rnr_retry_count; + __u8 srq; +}; + +struct ib_ucm_info { + __u32 id; + __u32 status; + __u64 info; + __u64 data; + __u8 info_len; + __u8 data_len; + __u8 reserved[2]; +}; + +struct ib_ucm_mra { + __u64 data; + __u32 id; + __u8 len; + __u8 timeout; + __u8 reserved[2]; +}; + +struct ib_ucm_lap { + __u64 path; + __u64 data; + __u32 id; + __u8 len; + __u8 reserved[3]; +}; + +struct ib_ucm_sidr_req { + __u32 id; + __u32 timeout; + __u64 sid; + __u64 data; + __u64 path; + __u16 pkey; + __u8 len; + __u8 max_cm_retries; +}; + +struct ib_ucm_sidr_rep { + __u32 id; + __u32 qpn; + __u32 qkey; + __u32 status; + __u64 info; + __u64 data; + __u8 info_len; + __u8 data_len; + __u8 reserved[2]; +}; +/* + * event notification ABI structures. + */ +struct ib_ucm_event_get { + __u64 response; + __u64 data; + __u64 info; + __u8 data_len; + __u8 info_len; + __u8 reserved[2]; +}; + +struct ib_ucm_req_event_resp { + __u32 listen_id; + /* device */ + /* port */ + struct ib_ucm_path_rec primary_path; + struct ib_ucm_path_rec alternate_path; + __u64 remote_ca_guid; + __u32 remote_qkey; + __u32 remote_qpn; + __u32 qp_type; + __u32 starting_psn; + __u8 responder_resources; + __u8 initiator_depth; + __u8 local_cm_response_timeout; + __u8 flow_control; + __u8 remote_cm_response_timeout; + __u8 retry_count; + __u8 rnr_retry_count; + __u8 srq; +}; + +struct ib_ucm_rep_event_resp { + __u64 remote_ca_guid; + __u32 remote_qkey; + __u32 remote_qpn; + __u32 starting_psn; + __u8 responder_resources; + __u8 initiator_depth; + __u8 target_ack_delay; + __u8 failover_accepted; + __u8 flow_control; + __u8 rnr_retry_count; + __u8 srq; + __u8 reserved[1]; +}; + +struct ib_ucm_rej_event_resp { + __u32 reason; + /* ari in ib_ucm_event_get info field. */ +}; + +struct ib_ucm_mra_event_resp { + __u8 timeout; + __u8 reserved[3]; +}; + +struct ib_ucm_lap_event_resp { + struct ib_ucm_path_rec path; +}; + +struct ib_ucm_apr_event_resp { + __u32 status; + /* apr info in ib_ucm_event_get info field. */ +}; + +struct ib_ucm_sidr_req_event_resp { + __u32 listen_id; + /* device */ + /* port */ + __u16 pkey; + __u8 reserved[2]; +}; + +struct ib_ucm_sidr_rep_event_resp { + __u32 status; + __u32 qkey; + __u32 qpn; + /* info in ib_ucm_event_get info field. */ +}; + +struct ib_ucm_event_resp { + __u32 id; + __u32 state; + __u32 event; + union { + struct ib_ucm_req_event_resp req_resp; + struct ib_ucm_rep_event_resp rep_resp; + struct ib_ucm_rej_event_resp rej_resp; + struct ib_ucm_mra_event_resp mra_resp; + struct ib_ucm_lap_event_resp lap_resp; + struct ib_ucm_apr_event_resp apr_resp; + + struct ib_ucm_sidr_req_event_resp sidr_req_resp; + struct ib_ucm_sidr_rep_event_resp sidr_rep_resp; + + __u32 send_status; + } u; +}; + +#endif /* IB_USER_CM_H */ Index: infiniband/core/ucm.c =================================================================== --- infiniband/core/ucm.c (revision 0) +++ infiniband/core/ucm.c (revision 0) @@ -0,0 +1,1388 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "ucm.h" + +MODULE_AUTHOR("Libor Michalek"); +MODULE_DESCRIPTION("InfiniBand userspace Connection Manager access"); +MODULE_LICENSE("Dual BSD/GPL"); + +enum { + IB_UCM_MAJOR = 231, + IB_UCM_MINOR = 255 +}; + +#define IB_UCM_DEV MKDEV(IB_UCM_MAJOR, IB_UCM_MINOR) + +static struct semaphore ctx_id_mutex; +static struct idr ctx_id_table; +static int ctx_id_rover = 0; + +static struct ib_ucm_context *ib_ucm_ctx_get(int id) +{ + struct ib_ucm_context *ctx; + + down(&ctx_id_mutex); + ctx = idr_find(&ctx_id_table, id); + if (ctx) + ctx->ref++; + up(&ctx_id_mutex); + + return ctx; +} + +static void ib_ucm_ctx_put(struct ib_ucm_context *ctx) +{ + struct ib_ucm_event *uevent; + + down(&ctx_id_mutex); + + ctx->ref--; + if (!ctx->ref) + idr_remove(&ctx_id_table, ctx->id); + + up(&ctx_id_mutex); + + if (ctx->ref) + return; + + down(&ctx->file->mutex); + + list_del(&ctx->file_list); + while (!list_empty(&ctx->events)) { + + uevent = list_entry(ctx->events.next, + struct ib_ucm_event, ctx_list); + list_del(&uevent->file_list); + list_del(&uevent->ctx_list); + + kfree(uevent); + } + + up(&ctx->file->mutex); + + printk(KERN_ERR "UCM: Destroyed CM ID <%d>\n", ctx->id); + + (void)ib_destroy_cm_id(ctx->cm_id); + kfree(ctx); +} + +static struct ib_ucm_context *ib_ucm_ctx_alloc(struct ib_ucm_file *file) +{ + struct ib_ucm_context *ctx; + int result; + + ctx = kmalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + ctx->ref = 1; /* user reference */ + ctx->file = file; + + INIT_LIST_HEAD(&ctx->events); + init_MUTEX(&ctx->mutex); + + list_add_tail(&ctx->file_list, &file->ctxs); + + ctx_id_rover = (ctx_id_rover + 1) & INT_MAX; +retry: + result = idr_pre_get(&ctx_id_table, GFP_KERNEL); + if (!result) + goto error; + + down(&ctx_id_mutex); + result = idr_get_new_above(&ctx_id_table, ctx, ctx_id_rover, &ctx->id); + up(&ctx_id_mutex); + + if (result == -EAGAIN) + goto retry; + if (result) + goto error; + + printk(KERN_ERR "UCM: Allocated CM ID <%d>\n", ctx->id); + + return ctx; +error: + list_del(&ctx->file_list); + kfree(ctx); + + return NULL; +} +/* + * Event portion of the API, handle CM events + * and allow event polling. + */ +static void ib_ucm_event_path_get(struct ib_ucm_path_rec *upath, + struct ib_sa_path_rec *kpath) +{ + memcpy(upath->dgid, kpath->dgid.raw, sizeof(union ib_gid)); + memcpy(upath->sgid, kpath->sgid.raw, sizeof(union ib_gid)); + + upath->dlid = kpath->dlid; + upath->slid = kpath->slid; + upath->raw_traffic = kpath->raw_traffic; + upath->flow_label = kpath->flow_label; + upath->hop_limit = kpath->hop_limit; + upath->traffic_class = kpath->traffic_class; + upath->reversible = kpath->reversible; + upath->numb_path = kpath->numb_path; + upath->pkey = kpath->pkey; + upath->sl = kpath->sl; + upath->mtu_selector = kpath->mtu_selector; + upath->mtu = kpath->mtu; + upath->rate_selector = kpath->rate_selector; + upath->rate = kpath->rate; + upath->packet_life_time = kpath->packet_life_time; + upath->preference = kpath->preference; + + upath->packet_life_time_selector = + kpath->packet_life_time_selector; +} + +static void ib_ucm_event_req_get(struct ib_ucm_req_event_resp *ureq, + struct ib_cm_req_event_param *kreq) +{ + ureq->listen_id = (int)kreq->listen_id->context; + + ureq->remote_ca_guid = kreq->remote_ca_guid; + ureq->remote_qkey = kreq->remote_qkey; + ureq->remote_qpn = kreq->remote_qpn; + ureq->qp_type = kreq->qp_type; + ureq->starting_psn = kreq->starting_psn; + ureq->responder_resources = kreq->responder_resources; + ureq->initiator_depth = kreq->initiator_depth; + ureq->local_cm_response_timeout = kreq->local_cm_response_timeout; + ureq->flow_control = kreq->flow_control; + ureq->remote_cm_response_timeout = kreq->remote_cm_response_timeout; + ureq->retry_count = kreq->retry_count; + ureq->rnr_retry_count = kreq->rnr_retry_count; + ureq->srq = kreq->srq; + + ib_ucm_event_path_get(&ureq->primary_path, kreq->primary_path); + ib_ucm_event_path_get(&ureq->alternate_path, kreq->alternate_path); +} + +static void ib_ucm_event_rep_get(struct ib_ucm_rep_event_resp *urep, + struct ib_cm_rep_event_param *krep) +{ + urep->remote_ca_guid = krep->remote_ca_guid; + urep->remote_qkey = krep->remote_qkey; + urep->remote_qpn = krep->remote_qpn; + urep->starting_psn = krep->starting_psn; + urep->responder_resources = krep->responder_resources; + urep->initiator_depth = krep->initiator_depth; + urep->target_ack_delay = krep->target_ack_delay; + urep->failover_accepted = krep->failover_accepted; + urep->flow_control = krep->flow_control; + urep->rnr_retry_count = krep->rnr_retry_count; + urep->srq = krep->srq; +} + +static void ib_ucm_event_rej_get(struct ib_ucm_rej_event_resp *urej, + struct ib_cm_rej_event_param *krej) +{ + urej->reason = krej->reason; +} + +static void ib_ucm_event_mra_get(struct ib_ucm_mra_event_resp *umra, + struct ib_cm_mra_event_param *kmra) +{ + umra->timeout = kmra->service_timeout; +} + +static void ib_ucm_event_lap_get(struct ib_ucm_lap_event_resp *ulap, + struct ib_cm_lap_event_param *klap) +{ + ib_ucm_event_path_get(&ulap->path, klap->alternate_path); +} + +static void ib_ucm_event_apr_get(struct ib_ucm_apr_event_resp *uapr, + struct ib_cm_apr_event_param *kapr) +{ + uapr->status = kapr->ap_status; +} + +static void ib_ucm_event_sidr_req_get(struct ib_ucm_sidr_req_event_resp *ureq, + struct ib_cm_sidr_req_event_param *kreq) +{ + ureq->listen_id = (int)kreq->listen_id->context; + ureq->pkey = kreq->pkey; +} + +static void ib_ucm_event_sidr_rep_get(struct ib_ucm_sidr_rep_event_resp *urep, + struct ib_cm_sidr_rep_event_param *krep) +{ + urep->status = krep->status; + urep->qkey = krep->qkey; + urep->qpn = krep->qpn; +}; + +static int ib_ucm_event_process(struct ib_cm_event *evt, + struct ib_ucm_event *uvt) +{ + void *info = NULL; + int result; + + switch (evt->event) { + case IB_CM_REQ_RECEIVED: + ib_ucm_event_req_get(&uvt->resp.u.req_resp, + &evt->param.req_rcvd); + uvt->data_len = IB_CM_REQ_PRIVATE_DATA_SIZE; + + break; + case IB_CM_REP_RECEIVED: + ib_ucm_event_rep_get(&uvt->resp.u.rep_resp, + &evt->param.rep_rcvd); + uvt->data_len = IB_CM_REP_PRIVATE_DATA_SIZE; + + break; + case IB_CM_RTU_RECEIVED: + uvt->data_len = IB_CM_RTU_PRIVATE_DATA_SIZE; + + break; + case IB_CM_DREQ_RECEIVED: + uvt->data_len = IB_CM_DREQ_PRIVATE_DATA_SIZE; + + break; + case IB_CM_DREP_RECEIVED: + uvt->data_len = IB_CM_DREP_PRIVATE_DATA_SIZE; + + break; + case IB_CM_MRA_RECEIVED: + ib_ucm_event_mra_get(&uvt->resp.u.mra_resp, + &evt->param.mra_rcvd); + uvt->data_len = IB_CM_MRA_PRIVATE_DATA_SIZE; + + break; + case IB_CM_REJ_RECEIVED: + ib_ucm_event_rej_get(&uvt->resp.u.rej_resp, + &evt->param.rej_rcvd); + uvt->data_len = IB_CM_REJ_PRIVATE_DATA_SIZE; + uvt->info_len = evt->param.rej_rcvd.ari_length; + info = evt->param.rej_rcvd.ari; + + break; + case IB_CM_LAP_RECEIVED: + ib_ucm_event_lap_get(&uvt->resp.u.lap_resp, + &evt->param.lap_rcvd); + uvt->data_len = IB_CM_LAP_PRIVATE_DATA_SIZE; + + break; + case IB_CM_APR_RECEIVED: + ib_ucm_event_apr_get(&uvt->resp.u.apr_resp, + &evt->param.apr_rcvd); + uvt->data_len = IB_CM_APR_PRIVATE_DATA_SIZE; + uvt->info_len = evt->param.apr_rcvd.info_len; + info = evt->param.apr_rcvd.apr_info; + + break; + case IB_CM_SIDR_REQ_RECEIVED: + ib_ucm_event_sidr_req_get(&uvt->resp.u.sidr_req_resp, + &evt->param.sidr_req_rcvd); + uvt->data_len = IB_CM_SIDR_REQ_PRIVATE_DATA_SIZE; + + break; + case IB_CM_SIDR_REP_RECEIVED: + ib_ucm_event_sidr_rep_get(&uvt->resp.u.sidr_rep_resp, + &evt->param.sidr_rep_rcvd); + uvt->data_len = IB_CM_SIDR_REP_PRIVATE_DATA_SIZE; + uvt->info_len = evt->param.sidr_rep_rcvd.info_len; + info = evt->param.sidr_rep_rcvd.info; + + break; + default: + uvt->resp.u.send_status = evt->param.send_status; + + break; + } + + if (uvt->data_len && evt->private_data) { + + uvt->data = kmalloc(uvt->data_len, GFP_KERNEL); + if (!uvt->data) { + result = -ENOMEM; + goto error; + } + + memcpy(uvt->data, evt->private_data, uvt->data_len); + } + + if (uvt->info_len && info) { + + uvt->info = kmalloc(uvt->info_len, GFP_KERNEL); + if (!uvt->info) { + result = -ENOMEM; + goto error; + } + + memcpy(uvt->info, info, uvt->info_len); + } + + return 0; +error: + if (uvt->info) + kfree(uvt->info); + if (uvt->data) + kfree(uvt->data); + return result; +} + +static int ib_ucm_event_handler(struct ib_cm_id *cm_id, + struct ib_cm_event *event) +{ + struct ib_ucm_event *uevent; + struct ib_ucm_context *ctx; + int result = 0; + int id; + + /* + * lookup correct context based on event type. + */ + switch (event->event) { + case IB_CM_REQ_RECEIVED: + id = (int)event->param.req_rcvd.listen_id->context; + break; + case IB_CM_SIDR_REQ_RECEIVED: + id = (int)event->param.sidr_req_rcvd.listen_id->context; + break; + default: + id = (int)cm_id->context; + break; + } + + ctx = ib_ucm_ctx_get(id); + if (!ctx) + return -ENOENT; + + if (event->event == IB_CM_REQ_RECEIVED || + event->event == IB_CM_SIDR_REQ_RECEIVED) + id = IB_UCM_CM_ID_INVALID; + + uevent = kmalloc(sizeof(*uevent), GFP_KERNEL); + if (!uevent) { + result = -ENOMEM; + goto done; + } + + memset(uevent, 0, sizeof(*uevent)); + + uevent->resp.id = id; + uevent->resp.event = event->event; + uevent->resp.state = cm_id->state; + + result = ib_ucm_event_process(event, uevent); + if (result) + goto done; + + uevent->ctx = ctx; + + down(&ctx->file->mutex); + + list_add_tail(&uevent->file_list, &ctx->file->events); + list_add_tail(&uevent->ctx_list, &ctx->events); + + wake_up_interruptible(&ctx->file->poll_wait); + + up(&ctx->file->mutex); +done: + ctx->error = result; + ib_ucm_ctx_put(ctx); /* func reference */ + return result; +} + +static ssize_t ib_ucm_qp_event(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_context *ctx; + struct ib_ucm_event_get cmd; + struct ib_ucm_event *uevent = NULL; + int result = 0; + DEFINE_WAIT(wait); + + if (out_len < sizeof(struct ib_ucm_event_resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + /* + * wait + */ + down(&file->mutex); + + while (list_empty(&file->events)) { + + if (file->filp->f_flags & O_NONBLOCK) { + result = -EAGAIN; + break; + } + + if (signal_pending(current)) { + result = -ERESTARTSYS; + break; + } + + prepare_to_wait(&file->poll_wait, &wait, TASK_INTERRUPTIBLE); + + up(&file->mutex); + schedule(); + down(&file->mutex); + + finish_wait(&file->poll_wait, &wait); + } + + if (result) + goto done; + + uevent = list_entry(file->events.next, struct ib_ucm_event, file_list); + + if (uevent->resp.id != IB_UCM_CM_ID_INVALID) + goto user; + + ctx = ib_ucm_ctx_alloc(file); + if (!ctx) { + result = -ENOMEM; + goto done; + } + + uevent->resp.id = ctx->id; + +user: + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &uevent->resp, sizeof(uevent->resp))) { + result = -EFAULT; + goto done; + } + + if (uevent->data) { + + if (cmd.data_len < uevent->data_len) { + result = -ENOMEM; + goto done; + } + + if (copy_to_user((void __user *)(unsigned long)cmd.data, + uevent->data, cmd.data_len)) { + result = -EFAULT; + goto done; + } + } + + if (uevent->info) { + + if (cmd.info_len < uevent->info_len) { + result = -ENOMEM; + goto done; + } + + if (copy_to_user((void __user *)(unsigned long)cmd.info, + uevent->info, cmd.info_len)) { + result = -EFAULT; + goto done; + } + } + + list_del(&uevent->file_list); + list_del(&uevent->ctx_list); + + if (uevent->data) + kfree(uevent->data); + if (uevent->info) + kfree(uevent->info); + kfree(uevent); +done: + up(&file->mutex); + return result; +} + + +static ssize_t ib_ucm_create_id(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_create_id cmd; + struct ib_ucm_create_id_resp resp; + struct ib_ucm_context *ctx; + int result; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ib_ucm_ctx_alloc(file); + if (!ctx) + return -ENOMEM; + + ctx->cm_id = ib_create_cm_id(ib_ucm_event_handler, + (void *)(unsigned long)ctx->id); + if (!ctx->cm_id) { + result = -ENOMEM; + goto err_cm; + } + + resp.id = ctx->id; + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) { + result = -EFAULT; + goto err_ret; + } + + return 0; +err_ret: + (void)ib_destroy_cm_id(ctx->cm_id); +err_cm: + ib_ucm_ctx_put(ctx); /* user reference */ + + return result; +} + +static ssize_t ib_ucm_destroy_id(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_destroy_id cmd; + struct ib_ucm_context *ctx; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) + return -ENOENT; + + ib_ucm_ctx_put(ctx); /* user reference */ + ib_ucm_ctx_put(ctx); /* func reference */ + + return 0; +} + +static ssize_t ib_ucm_attr_id(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_attr_id_resp resp; + struct ib_ucm_attr_id cmd; + struct ib_ucm_context *ctx; + int result = 0; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) + return -ENOENT; + + down(&ctx->file->mutex); + if (ctx->file != file) { + result = -EINVAL; + goto done; + } + + resp.service_id = ctx->cm_id->service_id; + resp.service_mask = ctx->cm_id->service_mask; + resp.state = ctx->cm_id->state; + resp.lap_state = ctx->cm_id->lap_state; + resp.local_id = ctx->cm_id->local_id; + resp.remote_id = ctx->cm_id->remote_id; + + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) + result = -EFAULT; + +done: + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ + return result; +} + +static ssize_t ib_ucm_listen(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_listen cmd; + struct ib_ucm_context *ctx; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) + return -ENOENT; + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_cm_listen(ctx->cm_id, cmd.service_id, + cmd.service_mask); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ + return result; +} + +static ssize_t ib_ucm_establish(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_establish cmd; + struct ib_ucm_context *ctx; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) + return -ENOENT; + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_cm_establish(ctx->cm_id); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ + return result; +} + +static int ib_ucm_alloc_data(void **dest, u64 src, u32 len) +{ + void *data; + + *dest = NULL; + + if (!len) + return 0; + + data = kmalloc(len, GFP_KERNEL); + if (!data) + return -ENOMEM; + + if (copy_from_user(data, (void __user *)(unsigned long)src, len)) { + kfree(data); + return -EFAULT; + } + + *dest = data; + return 0; +} + +static int ib_ucm_path_get(struct ib_sa_path_rec **path, u64 src) +{ + struct ib_ucm_path_rec ucm_path; + struct ib_sa_path_rec *sa_path; + + *path = NULL; + + if (!src) + return 0; + + sa_path = kmalloc(sizeof(*sa_path), GFP_KERNEL); + if (!sa_path) + return -ENOMEM; + + if (copy_from_user(&ucm_path, (void __user *)(unsigned long)src, + sizeof(ucm_path))) { + + kfree(sa_path); + return -EFAULT; + } + + memcpy(sa_path->dgid.raw, ucm_path.dgid, sizeof(union ib_gid)); + memcpy(sa_path->sgid.raw, ucm_path.sgid, sizeof(union ib_gid)); + + sa_path->dlid = ucm_path.dlid; + sa_path->slid = ucm_path.slid; + sa_path->raw_traffic = ucm_path.raw_traffic; + sa_path->flow_label = ucm_path.flow_label; + sa_path->hop_limit = ucm_path.hop_limit; + sa_path->traffic_class = ucm_path.traffic_class; + sa_path->reversible = ucm_path.reversible; + sa_path->numb_path = ucm_path.numb_path; + sa_path->pkey = ucm_path.pkey; + sa_path->sl = ucm_path.sl; + sa_path->mtu_selector = ucm_path.mtu_selector; + sa_path->mtu = ucm_path.mtu; + sa_path->rate_selector = ucm_path.rate_selector; + sa_path->rate = ucm_path.rate; + sa_path->packet_life_time = ucm_path.packet_life_time; + sa_path->preference = ucm_path.preference; + + sa_path->packet_life_time_selector = + ucm_path.packet_life_time_selector; + + *path = sa_path; + return 0; +} + +static ssize_t ib_ucm_send_req(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_cm_req_param param; + struct ib_ucm_context *ctx; + struct ib_ucm_req cmd; + int result; + + param.private_data = NULL; + param.primary_path = NULL; + param.alternate_path = NULL; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(¶m.private_data, cmd.data, cmd.len); + if (result) + goto done; + + result = ib_ucm_path_get(¶m.primary_path, cmd.primary_path); + if (result) + goto done; + + result = ib_ucm_path_get(¶m.alternate_path, cmd.alternate_path); + if (result) + goto done; + + param.private_data_len = cmd.len; + param.service_id = cmd.sid; + param.qp_num = cmd.qpn; + param.qp_type = cmd.qp_type; + param.starting_psn = cmd.psn; + param.peer_to_peer = cmd.peer_to_peer; + param.responder_resources = cmd.responder_resources; + param.initiator_depth = cmd.initiator_depth; + param.remote_cm_response_timeout = cmd.remote_cm_response_timeout; + param.flow_control = cmd.flow_control; + param.local_cm_response_timeout = cmd.local_cm_response_timeout; + param.retry_count = cmd.retry_count; + param.rnr_retry_count = cmd.rnr_retry_count; + param.max_cm_retries = cmd.max_cm_retries; + param.srq = cmd.srq; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_req(ctx->cm_id, ¶m); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (param.private_data) + kfree(param.private_data); + if (param.primary_path) + kfree(param.primary_path); + if (param.alternate_path) + kfree(param.alternate_path); + + return result; +} + +static ssize_t ib_ucm_send_rep(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_cm_rep_param param; + struct ib_ucm_context *ctx; + struct ib_ucm_rep cmd; + int result; + + param.private_data = NULL; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(¶m.private_data, cmd.data, cmd.len); + if (result) + return result; + + param.qp_num = cmd.qpn; + param.starting_psn = cmd.psn; + param.private_data_len = cmd.len; + param.responder_resources = cmd.responder_resources; + param.initiator_depth = cmd.initiator_depth; + param.target_ack_delay = cmd.target_ack_delay; + param.failover_accepted = cmd.failover_accepted; + param.flow_control = cmd.flow_control; + param.rnr_retry_count = cmd.rnr_retry_count; + param.srq = cmd.srq; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_rep(ctx->cm_id, ¶m); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (param.private_data) + kfree(param.private_data); + + return result; +} + +static ssize_t ib_ucm_send_private_data(struct ib_ucm_file *file, + const char __user *inbuf, int in_len, + int (*func)(struct ib_cm_id *cm_id, + void *private_data, + u8 private_data_len)) +{ + struct ib_ucm_private_data cmd; + struct ib_ucm_context *ctx; + void *private_data = NULL; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(&private_data, cmd.data, cmd.len); + if (result) + return result; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = func(ctx->cm_id, private_data, cmd.len); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (private_data) + kfree(private_data); + + return result; +} + +static ssize_t ib_ucm_send_rtu(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return ib_ucm_send_private_data(file, inbuf, in_len, ib_send_cm_rtu); +} + +static ssize_t ib_ucm_send_dreq(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return ib_ucm_send_private_data(file, inbuf, in_len, ib_send_cm_dreq); +} + +static ssize_t ib_ucm_send_drep(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return ib_ucm_send_private_data(file, inbuf, in_len, ib_send_cm_drep); +} + +static ssize_t ib_ucm_send_info(struct ib_ucm_file *file, + const char __user *inbuf, int in_len, + int (*func)(struct ib_cm_id *cm_id, + int status, + void *info, + u8 info_len, + void *data, + u8 data_len)) +{ + struct ib_ucm_context *ctx; + struct ib_ucm_info cmd; + void *data = NULL; + void *info = NULL; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(&data, cmd.data, cmd.data_len); + if (result) + goto done; + + result = ib_ucm_alloc_data(&info, cmd.info, cmd.info_len); + if (result) + goto done; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = func(ctx->cm_id, cmd.status, + info, cmd.info_len, + data, cmd.data_len); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (data) + kfree(data); + if (info) + kfree(info); + + return result; +} + +static ssize_t ib_ucm_send_rej(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return ib_ucm_send_info(file, inbuf, in_len, (void *)ib_send_cm_rej); +} + +static ssize_t ib_ucm_send_apr(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return ib_ucm_send_info(file, inbuf, in_len, (void *)ib_send_cm_apr); +} + +static ssize_t ib_ucm_send_mra(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_context *ctx; + struct ib_ucm_mra cmd; + void *data = NULL; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(&data, cmd.data, cmd.len); + if (result) + return result; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_mra(ctx->cm_id, cmd.timeout, + data, cmd.len); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (data) + kfree(data); + + return result; +} + +static ssize_t ib_ucm_send_lap(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_context *ctx; + struct ib_sa_path_rec *path = NULL; + struct ib_ucm_lap cmd; + void *data = NULL; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(&data, cmd.data, cmd.len); + if (result) + goto done; + + result = ib_ucm_path_get(&path, cmd.path); + if (result) + goto done; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_lap(ctx->cm_id, path, data, cmd.len); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (data) + kfree(data); + if (path) + kfree(path); + + return result; +} + +static ssize_t ib_ucm_send_sidr_req(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_cm_sidr_req_param param; + struct ib_ucm_context *ctx; + struct ib_ucm_sidr_req cmd; + int result; + + param.private_data = NULL; + param.path = NULL; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(¶m.private_data, cmd.data, cmd.len); + if (result) + goto done; + + result = ib_ucm_path_get(¶m.path, cmd.path); + if (result) + goto done; + + param.private_data_len = cmd.len; + param.service_id = cmd.sid; + param.timeout_ms = cmd.timeout; + param.max_cm_retries = cmd.max_cm_retries; + param.pkey = cmd.pkey; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_sidr_req(ctx->cm_id, ¶m); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (param.private_data) + kfree(param.private_data); + if (param.path) + kfree(param.path); + + return result; +} + +static ssize_t ib_ucm_send_sidr_rep(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_cm_sidr_rep_param param; + struct ib_ucm_sidr_rep cmd; + struct ib_ucm_context *ctx; + int result; + + param.info = NULL; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(¶m.private_data, + cmd.data, cmd.data_len); + if (result) + goto done; + + result = ib_ucm_alloc_data(¶m.info, cmd.info, cmd.info_len); + if (result) + goto done; + + param.qp_num = cmd.qpn; + param.qkey = cmd.qkey; + param.status = cmd.status; + param.info_length = cmd.info_len; + param.private_data_len = cmd.data_len; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_sidr_rep(ctx->cm_id, ¶m); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (param.private_data) + kfree(param.private_data); + if (param.info) + kfree(param.info); + + return result; +} + +static ssize_t ib_ucm_qp_attr(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return 0; +} + +static ssize_t (*ucm_cmd_table[])(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) = { + [IB_USER_CM_CMD_CREATE_ID] = ib_ucm_create_id, + [IB_USER_CM_CMD_DESTORY_ID] = ib_ucm_destroy_id, + [IB_USER_CM_CMD_ATTR_ID] = ib_ucm_attr_id, + [IB_USER_CM_CMD_LISTEN] = ib_ucm_listen, + [IB_USER_CM_CMD_ESTABLISH] = ib_ucm_establish, + [IB_USER_CM_CMD_SEND_REQ] = ib_ucm_send_req, + [IB_USER_CM_CMD_SEND_REP] = ib_ucm_send_rep, + [IB_USER_CM_CMD_SEND_RTU] = ib_ucm_send_rtu, + [IB_USER_CM_CMD_SEND_DREQ] = ib_ucm_send_dreq, + [IB_USER_CM_CMD_SEND_DREP] = ib_ucm_send_drep, + [IB_USER_CM_CMD_SEND_REJ] = ib_ucm_send_rej, + [IB_USER_CM_CMD_SEND_MRA] = ib_ucm_send_mra, + [IB_USER_CM_CMD_SEND_LAP] = ib_ucm_send_lap, + [IB_USER_CM_CMD_SEND_APR] = ib_ucm_send_apr, + [IB_USER_CM_CMD_SEND_SIDR_REQ] = ib_ucm_send_sidr_req, + [IB_USER_CM_CMD_SEND_SIDR_REP] = ib_ucm_send_sidr_rep, + [IB_USER_CM_CMD_QP_ATTR] = ib_ucm_qp_attr, + [IB_USER_CM_CMD_EVENT] = ib_ucm_qp_event, +}; + +static ssize_t ib_ucm_write(struct file *filp, const char __user *buf, + size_t len, loff_t *pos) +{ + struct ib_ucm_file *file = filp->private_data; + struct ib_ucm_cmd_hdr hdr; + ssize_t result; + + if (len < sizeof(hdr)) + return -EINVAL; + + if (copy_from_user(&hdr, buf, sizeof(hdr))) + return -EFAULT; + + printk(KERN_ERR "UCM: Write. cmd <%d> in <%d> out <%d> len <%d>\n", + hdr.cmd, hdr.in, hdr.out, len); + + if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucm_cmd_table)) + return -EINVAL; + + if (hdr.in + sizeof(hdr) > len) + return -EINVAL; + + result = ucm_cmd_table[hdr.cmd](file, buf + sizeof(hdr), + hdr.in, hdr.out); + if (!result) + result = len; + + return result; +} + +static unsigned int ib_ucm_poll(struct file *filp, + struct poll_table_struct *wait) +{ + struct ib_ucm_file *file = filp->private_data; + unsigned int mask = 0; + + poll_wait(filp, &file->poll_wait, wait); + + if (!list_empty(&file->events)) + mask = POLLIN | POLLRDNORM; + + return mask; +} + +static int ib_ucm_open(struct inode *inode, struct file *filp) +{ + struct ib_ucm_file *file; + + file = kmalloc(sizeof(*file), GFP_KERNEL); + if (!file) + return -ENOMEM; + + INIT_LIST_HEAD(&file->events); + INIT_LIST_HEAD(&file->ctxs); + init_waitqueue_head(&file->poll_wait); + + init_MUTEX(&file->mutex); + + filp->private_data = file; + file->filp = filp; + + printk(KERN_ERR "UCM: Created struct\n"); + + return 0; +} + +static int ib_ucm_close(struct inode *inode, struct file *filp) +{ + struct ib_ucm_file *file = filp->private_data; + struct ib_ucm_context *ctx; + + down(&file->mutex); + + while (!list_empty(&file->ctxs)) { + + ctx = list_entry(file->ctxs.next, + struct ib_ucm_context, file_list); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* user reference */ + down(&file->mutex); + } + + up(&file->mutex); + + kfree(file); + + printk(KERN_ERR "UCM: Deleted struct\n"); + return 0; +} + +static struct file_operations ib_ucm_fops = { + .owner = THIS_MODULE, + .open = ib_ucm_open, + .release = ib_ucm_close, + .write = ib_ucm_write, + .poll = ib_ucm_poll, +}; + + +static struct class_simple *ib_ucm_class; +static struct cdev ib_ucm_cdev; + +static int __init ib_ucm_init(void) +{ + int result; + + result = register_chrdev_region(IB_UCM_DEV, 1, "infiniband_cm"); + if (result) { + printk(KERN_ERR "UCM: Error <%d> registering dev\n", result); + goto err_chr; + } + + cdev_init(&ib_ucm_cdev, &ib_ucm_fops); + + result = cdev_add(&ib_ucm_cdev, IB_UCM_DEV, 1); + if (result) { + printk(KERN_ERR "UCM: Error <%d> adding cdev\n", result); + goto err_cdev; + } + + ib_ucm_class = class_simple_create(THIS_MODULE, "ucm"); + if (IS_ERR(ib_ucm_class)) { + result = PTR_ERR(ib_ucm_class); + printk(KERN_ERR "UCM: Error <%d> creating class\n", result); + goto err_class; + } + + class_simple_device_add(ib_ucm_class, + IB_UCM_DEV, + NULL, + "ucm"); + + devfs_mk_cdev(IB_UCM_DEV, + S_IFCHR|S_IRUGO|S_IWUGO, + "infiniband/ucm"); + + idr_init(&ctx_id_table); + init_MUTEX(&ctx_id_mutex); + + return 0; +err_class: + cdev_del(&ib_ucm_cdev); +err_cdev: + unregister_chrdev_region(IB_UCM_DEV, 1); +err_chr: + return result; +} + +static void __exit ib_ucm_cleanup(void) +{ + devfs_remove("infiniband/ucm"); + class_simple_device_remove(IB_UCM_DEV); + class_simple_destroy(ib_ucm_class); + cdev_del(&ib_ucm_cdev); + unregister_chrdev_region(IB_UCM_DEV, 1); +} + +module_init(ib_ucm_init); +module_exit(ib_ucm_cleanup); Index: infiniband/core/ucm.h =================================================================== --- infiniband/core/ucm.h (revision 0) +++ infiniband/core/ucm.h (revision 0) @@ -0,0 +1,84 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#ifndef UCM_H +#define UCM_H + +#include +#include +#include +#include + +#include +#include + +#define IB_UCM_CM_ID_INVALID 0xffffffff + +struct ib_ucm_file { + struct semaphore mutex; + struct file *filp; + /* + * list of pending events + */ + struct list_head ctxs; /* list of active connections */ + struct list_head events; /* list of pending events */ + wait_queue_head_t poll_wait; +}; + +struct ib_ucm_context { + int id; + int ref; + int error; + + struct ib_ucm_file *file; + struct ib_cm_id *cm_id; + struct semaphore mutex; + + struct list_head events; /* list of pending events. */ + struct list_head file_list; /* member in file ctx list */ +}; + +struct ib_ucm_event { + struct ib_ucm_context *ctx; + struct list_head file_list; /* member in file event list */ + struct list_head ctx_list; /* member in ctx event list */ + + struct ib_ucm_event_resp resp; + void *data; + void *info; + int data_len; + int info_len; +}; + +#endif /* UCM_H */ From libor at topspin.com Fri Mar 11 15:17:48 2005 From: libor at topspin.com (Libor Michalek) Date: Fri, 11 Mar 2005 15:17:48 -0800 Subject: [openib-general] Re: [PATCH] [TRIVIAL] SDP: Eliminate uneeded initialization and fix some typos In-Reply-To: <1110364516.4645.22.camel@localhost.localdomain>; from halr@voltaire.com on Wed, Mar 09, 2005 at 05:35:17AM -0500 References: <1110364516.4645.22.camel@localhost.localdomain> Message-ID: <20050311151748.B31689@topspin.com> On Wed, Mar 09, 2005 at 05:35:17AM -0500, Hal Rosenstock wrote: > SDP: Eliminate uneeded initialization and fix some typos Thanks, applied and commited. -Libor From libor at topspin.com Fri Mar 11 15:22:13 2005 From: libor at topspin.com (Libor Michalek) Date: Fri, 11 Mar 2005 15:22:13 -0800 Subject: [openib-general] [RFC] Userspace CM access. In-Reply-To: <20050311150946.A31689@topspin.com>; from libor@topspin.com on Fri, Mar 11, 2005 at 03:09:46PM -0800 References: <20050311150946.A31689@topspin.com> Message-ID: <20050311152213.C31689@topspin.com> On Fri, Mar 11, 2005 at 03:09:46PM -0800, Libor Michalek wrote: > > Below is the source for the kernel portion of the userspace CM. I've > got enough of the userspace library to verify the basic functionality, > but it's not yet ready for general use. However, I wanted to get the > kernel portion posted for comment and checked-in now that the bulk of > it is complete. The code is for the most part a pass through from > userspace to the kernel CM, plus synchronization, sanity checking, > and the event model is turned into a "get next event" interface. OK. Not sure how one of the structure fields disappeared, but here's a resend, that actually builds. -Libor Signed-off-by: Libor Michalek Index: infiniband/core/Makefile =================================================================== --- infiniband/core/Makefile (revision 1979) +++ infiniband/core/Makefile (working copy) @@ -1,6 +1,7 @@ EXTRA_CFLAGS += -Idrivers/infiniband/include -obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_cm.o ib_sa.o ib_umad.o +obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_cm.o ib_sa.o ib_umad.o \ + ib_ucm.o ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ device.o fmr_pool.o cache.o @@ -12,3 +13,5 @@ ib_sa-y := sa_query.o ib_umad-y := user_mad.o + +ib_ucm-y := ucm.o Index: infiniband/include/ib_user_cm.h =================================================================== --- infiniband/include/ib_user_cm.h (revision 0) +++ infiniband/include/ib_user_cm.h (revision 0) @@ -0,0 +1,326 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ib_user_verbs.h 1852 2005-02-21 22:21:01Z roland $ + */ + +#ifndef IB_USER_CM_H +#define IB_USER_CM_H + +#include + +#define IB_USER_CM_ABI_VERSION 1 + +enum { + IB_USER_CM_CMD_CREATE_ID, + IB_USER_CM_CMD_DESTORY_ID, + IB_USER_CM_CMD_ATTR_ID, + + IB_USER_CM_CMD_LISTEN, + IB_USER_CM_CMD_ESTABLISH, + + IB_USER_CM_CMD_SEND_REQ, + IB_USER_CM_CMD_SEND_REP, + IB_USER_CM_CMD_SEND_RTU, + IB_USER_CM_CMD_SEND_DREQ, + IB_USER_CM_CMD_SEND_DREP, + IB_USER_CM_CMD_SEND_REJ, + IB_USER_CM_CMD_SEND_MRA, + IB_USER_CM_CMD_SEND_LAP, + IB_USER_CM_CMD_SEND_APR, + IB_USER_CM_CMD_SEND_SIDR_REQ, + IB_USER_CM_CMD_SEND_SIDR_REP, + IB_USER_CM_CMD_QP_ATTR, + + IB_USER_CM_CMD_EVENT, +}; +/* + * command ABI structures. + */ +struct ib_ucm_cmd_hdr { + __u32 cmd; + __u16 in; + __u16 out; +}; + +struct ib_ucm_create_id { + __u64 response; +}; + +struct ib_ucm_create_id_resp { + __u32 id; +}; + +struct ib_ucm_destroy_id { + __u32 id; +}; + +struct ib_ucm_attr_id { + __u64 response; + __u32 id; +}; + +struct ib_ucm_attr_id_resp { + __u64 service_id; + __u64 service_mask; + __u32 state; + __u32 lap_state; + __u32 local_id; + __u32 remote_id; +}; + +struct ib_ucm_listen { + __u64 service_id; + __u64 service_mask; + __u32 id; +}; + +struct ib_ucm_establish { + __u32 id; +}; + +struct ib_ucm_private_data { + __u64 data; + __u32 id; + __u8 len; + __u8 reserved[3]; +}; + +struct ib_ucm_path_rec { + __u8 dgid[16]; + __u8 sgid[16]; + __u16 dlid; + __u16 slid; + __u32 raw_traffic; + __u32 flow_label; + __u32 reversible; + __u32 mtu; + __u16 pkey; + __u8 hop_limit; + __u8 traffic_class; + __u8 numb_path; + __u8 sl; + __u8 mtu_selector; + __u8 rate_selector; + __u8 rate; + __u8 packet_life_time_selector; + __u8 packet_life_time; + __u8 preference; +}; + +struct ib_ucm_req { + __u32 id; + __u32 qpn; + __u32 qp_type; + __u32 psn; + __u64 sid; + __u64 data; + __u64 primary_path; + __u64 alternate_path; + __u8 len; + __u8 peer_to_peer; + __u8 responder_resources; + __u8 initiator_depth; + __u8 remote_cm_response_timeout; + __u8 flow_control; + __u8 local_cm_response_timeout; + __u8 retry_count; + __u8 rnr_retry_count; + __u8 max_cm_retries; + __u8 srq; + __u8 reserved[1]; +}; + +struct ib_ucm_rep { + __u64 data; + __u32 id; + __u32 qpn; + __u32 psn; + __u8 len; + __u8 responder_resources; + __u8 initiator_depth; + __u8 target_ack_delay; + __u8 failover_accepted; + __u8 flow_control; + __u8 rnr_retry_count; + __u8 srq; +}; + +struct ib_ucm_info { + __u32 id; + __u32 status; + __u64 info; + __u64 data; + __u8 info_len; + __u8 data_len; + __u8 reserved[2]; +}; + +struct ib_ucm_mra { + __u64 data; + __u32 id; + __u8 len; + __u8 timeout; + __u8 reserved[2]; +}; + +struct ib_ucm_lap { + __u64 path; + __u64 data; + __u32 id; + __u8 len; + __u8 reserved[3]; +}; + +struct ib_ucm_sidr_req { + __u32 id; + __u32 timeout; + __u64 sid; + __u64 data; + __u64 path; + __u16 pkey; + __u8 len; + __u8 max_cm_retries; +}; + +struct ib_ucm_sidr_rep { + __u32 id; + __u32 qpn; + __u32 qkey; + __u32 status; + __u64 info; + __u64 data; + __u8 info_len; + __u8 data_len; + __u8 reserved[2]; +}; +/* + * event notification ABI structures. + */ +struct ib_ucm_event_get { + __u64 response; + __u64 data; + __u64 info; + __u8 data_len; + __u8 info_len; + __u8 reserved[2]; +}; + +struct ib_ucm_req_event_resp { + __u32 listen_id; + /* device */ + /* port */ + struct ib_ucm_path_rec primary_path; + struct ib_ucm_path_rec alternate_path; + __u64 remote_ca_guid; + __u32 remote_qkey; + __u32 remote_qpn; + __u32 qp_type; + __u32 starting_psn; + __u8 responder_resources; + __u8 initiator_depth; + __u8 local_cm_response_timeout; + __u8 flow_control; + __u8 remote_cm_response_timeout; + __u8 retry_count; + __u8 rnr_retry_count; + __u8 srq; +}; + +struct ib_ucm_rep_event_resp { + __u64 remote_ca_guid; + __u32 remote_qkey; + __u32 remote_qpn; + __u32 starting_psn; + __u8 responder_resources; + __u8 initiator_depth; + __u8 target_ack_delay; + __u8 failover_accepted; + __u8 flow_control; + __u8 rnr_retry_count; + __u8 srq; + __u8 reserved[1]; +}; + +struct ib_ucm_rej_event_resp { + __u32 reason; + /* ari in ib_ucm_event_get info field. */ +}; + +struct ib_ucm_mra_event_resp { + __u8 timeout; + __u8 reserved[3]; +}; + +struct ib_ucm_lap_event_resp { + struct ib_ucm_path_rec path; +}; + +struct ib_ucm_apr_event_resp { + __u32 status; + /* apr info in ib_ucm_event_get info field. */ +}; + +struct ib_ucm_sidr_req_event_resp { + __u32 listen_id; + /* device */ + /* port */ + __u16 pkey; + __u8 reserved[2]; +}; + +struct ib_ucm_sidr_rep_event_resp { + __u32 status; + __u32 qkey; + __u32 qpn; + /* info in ib_ucm_event_get info field. */ +}; + +struct ib_ucm_event_resp { + __u32 id; + __u32 state; + __u32 event; + union { + struct ib_ucm_req_event_resp req_resp; + struct ib_ucm_rep_event_resp rep_resp; + struct ib_ucm_rej_event_resp rej_resp; + struct ib_ucm_mra_event_resp mra_resp; + struct ib_ucm_lap_event_resp lap_resp; + struct ib_ucm_apr_event_resp apr_resp; + + struct ib_ucm_sidr_req_event_resp sidr_req_resp; + struct ib_ucm_sidr_rep_event_resp sidr_rep_resp; + + __u32 send_status; + } u; +}; + +#endif /* IB_USER_CM_H */ Index: infiniband/core/ucm.c =================================================================== --- infiniband/core/ucm.c (revision 0) +++ infiniband/core/ucm.c (revision 0) @@ -0,0 +1,1388 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "ucm.h" + +MODULE_AUTHOR("Libor Michalek"); +MODULE_DESCRIPTION("InfiniBand userspace Connection Manager access"); +MODULE_LICENSE("Dual BSD/GPL"); + +enum { + IB_UCM_MAJOR = 231, + IB_UCM_MINOR = 255 +}; + +#define IB_UCM_DEV MKDEV(IB_UCM_MAJOR, IB_UCM_MINOR) + +static struct semaphore ctx_id_mutex; +static struct idr ctx_id_table; +static int ctx_id_rover = 0; + +static struct ib_ucm_context *ib_ucm_ctx_get(int id) +{ + struct ib_ucm_context *ctx; + + down(&ctx_id_mutex); + ctx = idr_find(&ctx_id_table, id); + if (ctx) + ctx->ref++; + up(&ctx_id_mutex); + + return ctx; +} + +static void ib_ucm_ctx_put(struct ib_ucm_context *ctx) +{ + struct ib_ucm_event *uevent; + + down(&ctx_id_mutex); + + ctx->ref--; + if (!ctx->ref) + idr_remove(&ctx_id_table, ctx->id); + + up(&ctx_id_mutex); + + if (ctx->ref) + return; + + down(&ctx->file->mutex); + + list_del(&ctx->file_list); + while (!list_empty(&ctx->events)) { + + uevent = list_entry(ctx->events.next, + struct ib_ucm_event, ctx_list); + list_del(&uevent->file_list); + list_del(&uevent->ctx_list); + + kfree(uevent); + } + + up(&ctx->file->mutex); + + printk(KERN_ERR "UCM: Destroyed CM ID <%d>\n", ctx->id); + + (void)ib_destroy_cm_id(ctx->cm_id); + kfree(ctx); +} + +static struct ib_ucm_context *ib_ucm_ctx_alloc(struct ib_ucm_file *file) +{ + struct ib_ucm_context *ctx; + int result; + + ctx = kmalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + ctx->ref = 1; /* user reference */ + ctx->file = file; + + INIT_LIST_HEAD(&ctx->events); + init_MUTEX(&ctx->mutex); + + list_add_tail(&ctx->file_list, &file->ctxs); + + ctx_id_rover = (ctx_id_rover + 1) & INT_MAX; +retry: + result = idr_pre_get(&ctx_id_table, GFP_KERNEL); + if (!result) + goto error; + + down(&ctx_id_mutex); + result = idr_get_new_above(&ctx_id_table, ctx, ctx_id_rover, &ctx->id); + up(&ctx_id_mutex); + + if (result == -EAGAIN) + goto retry; + if (result) + goto error; + + printk(KERN_ERR "UCM: Allocated CM ID <%d>\n", ctx->id); + + return ctx; +error: + list_del(&ctx->file_list); + kfree(ctx); + + return NULL; +} +/* + * Event portion of the API, handle CM events + * and allow event polling. + */ +static void ib_ucm_event_path_get(struct ib_ucm_path_rec *upath, + struct ib_sa_path_rec *kpath) +{ + memcpy(upath->dgid, kpath->dgid.raw, sizeof(union ib_gid)); + memcpy(upath->sgid, kpath->sgid.raw, sizeof(union ib_gid)); + + upath->dlid = kpath->dlid; + upath->slid = kpath->slid; + upath->raw_traffic = kpath->raw_traffic; + upath->flow_label = kpath->flow_label; + upath->hop_limit = kpath->hop_limit; + upath->traffic_class = kpath->traffic_class; + upath->reversible = kpath->reversible; + upath->numb_path = kpath->numb_path; + upath->pkey = kpath->pkey; + upath->sl = kpath->sl; + upath->mtu_selector = kpath->mtu_selector; + upath->mtu = kpath->mtu; + upath->rate_selector = kpath->rate_selector; + upath->rate = kpath->rate; + upath->packet_life_time = kpath->packet_life_time; + upath->preference = kpath->preference; + + upath->packet_life_time_selector = + kpath->packet_life_time_selector; +} + +static void ib_ucm_event_req_get(struct ib_ucm_req_event_resp *ureq, + struct ib_cm_req_event_param *kreq) +{ + ureq->listen_id = (int)kreq->listen_id->context; + + ureq->remote_ca_guid = kreq->remote_ca_guid; + ureq->remote_qkey = kreq->remote_qkey; + ureq->remote_qpn = kreq->remote_qpn; + ureq->qp_type = kreq->qp_type; + ureq->starting_psn = kreq->starting_psn; + ureq->responder_resources = kreq->responder_resources; + ureq->initiator_depth = kreq->initiator_depth; + ureq->local_cm_response_timeout = kreq->local_cm_response_timeout; + ureq->flow_control = kreq->flow_control; + ureq->remote_cm_response_timeout = kreq->remote_cm_response_timeout; + ureq->retry_count = kreq->retry_count; + ureq->rnr_retry_count = kreq->rnr_retry_count; + ureq->srq = kreq->srq; + + ib_ucm_event_path_get(&ureq->primary_path, kreq->primary_path); + ib_ucm_event_path_get(&ureq->alternate_path, kreq->alternate_path); +} + +static void ib_ucm_event_rep_get(struct ib_ucm_rep_event_resp *urep, + struct ib_cm_rep_event_param *krep) +{ + urep->remote_ca_guid = krep->remote_ca_guid; + urep->remote_qkey = krep->remote_qkey; + urep->remote_qpn = krep->remote_qpn; + urep->starting_psn = krep->starting_psn; + urep->responder_resources = krep->responder_resources; + urep->initiator_depth = krep->initiator_depth; + urep->target_ack_delay = krep->target_ack_delay; + urep->failover_accepted = krep->failover_accepted; + urep->flow_control = krep->flow_control; + urep->rnr_retry_count = krep->rnr_retry_count; + urep->srq = krep->srq; +} + +static void ib_ucm_event_rej_get(struct ib_ucm_rej_event_resp *urej, + struct ib_cm_rej_event_param *krej) +{ + urej->reason = krej->reason; +} + +static void ib_ucm_event_mra_get(struct ib_ucm_mra_event_resp *umra, + struct ib_cm_mra_event_param *kmra) +{ + umra->timeout = kmra->service_timeout; +} + +static void ib_ucm_event_lap_get(struct ib_ucm_lap_event_resp *ulap, + struct ib_cm_lap_event_param *klap) +{ + ib_ucm_event_path_get(&ulap->path, klap->alternate_path); +} + +static void ib_ucm_event_apr_get(struct ib_ucm_apr_event_resp *uapr, + struct ib_cm_apr_event_param *kapr) +{ + uapr->status = kapr->ap_status; +} + +static void ib_ucm_event_sidr_req_get(struct ib_ucm_sidr_req_event_resp *ureq, + struct ib_cm_sidr_req_event_param *kreq) +{ + ureq->listen_id = (int)kreq->listen_id->context; + ureq->pkey = kreq->pkey; +} + +static void ib_ucm_event_sidr_rep_get(struct ib_ucm_sidr_rep_event_resp *urep, + struct ib_cm_sidr_rep_event_param *krep) +{ + urep->status = krep->status; + urep->qkey = krep->qkey; + urep->qpn = krep->qpn; +}; + +static int ib_ucm_event_process(struct ib_cm_event *evt, + struct ib_ucm_event *uvt) +{ + void *info = NULL; + int result; + + switch (evt->event) { + case IB_CM_REQ_RECEIVED: + ib_ucm_event_req_get(&uvt->resp.u.req_resp, + &evt->param.req_rcvd); + uvt->data_len = IB_CM_REQ_PRIVATE_DATA_SIZE; + + break; + case IB_CM_REP_RECEIVED: + ib_ucm_event_rep_get(&uvt->resp.u.rep_resp, + &evt->param.rep_rcvd); + uvt->data_len = IB_CM_REP_PRIVATE_DATA_SIZE; + + break; + case IB_CM_RTU_RECEIVED: + uvt->data_len = IB_CM_RTU_PRIVATE_DATA_SIZE; + + break; + case IB_CM_DREQ_RECEIVED: + uvt->data_len = IB_CM_DREQ_PRIVATE_DATA_SIZE; + + break; + case IB_CM_DREP_RECEIVED: + uvt->data_len = IB_CM_DREP_PRIVATE_DATA_SIZE; + + break; + case IB_CM_MRA_RECEIVED: + ib_ucm_event_mra_get(&uvt->resp.u.mra_resp, + &evt->param.mra_rcvd); + uvt->data_len = IB_CM_MRA_PRIVATE_DATA_SIZE; + + break; + case IB_CM_REJ_RECEIVED: + ib_ucm_event_rej_get(&uvt->resp.u.rej_resp, + &evt->param.rej_rcvd); + uvt->data_len = IB_CM_REJ_PRIVATE_DATA_SIZE; + uvt->info_len = evt->param.rej_rcvd.ari_length; + info = evt->param.rej_rcvd.ari; + + break; + case IB_CM_LAP_RECEIVED: + ib_ucm_event_lap_get(&uvt->resp.u.lap_resp, + &evt->param.lap_rcvd); + uvt->data_len = IB_CM_LAP_PRIVATE_DATA_SIZE; + + break; + case IB_CM_APR_RECEIVED: + ib_ucm_event_apr_get(&uvt->resp.u.apr_resp, + &evt->param.apr_rcvd); + uvt->data_len = IB_CM_APR_PRIVATE_DATA_SIZE; + uvt->info_len = evt->param.apr_rcvd.info_len; + info = evt->param.apr_rcvd.apr_info; + + break; + case IB_CM_SIDR_REQ_RECEIVED: + ib_ucm_event_sidr_req_get(&uvt->resp.u.sidr_req_resp, + &evt->param.sidr_req_rcvd); + uvt->data_len = IB_CM_SIDR_REQ_PRIVATE_DATA_SIZE; + + break; + case IB_CM_SIDR_REP_RECEIVED: + ib_ucm_event_sidr_rep_get(&uvt->resp.u.sidr_rep_resp, + &evt->param.sidr_rep_rcvd); + uvt->data_len = IB_CM_SIDR_REP_PRIVATE_DATA_SIZE; + uvt->info_len = evt->param.sidr_rep_rcvd.info_len; + info = evt->param.sidr_rep_rcvd.info; + + break; + default: + uvt->resp.u.send_status = evt->param.send_status; + + break; + } + + if (uvt->data_len && evt->private_data) { + + uvt->data = kmalloc(uvt->data_len, GFP_KERNEL); + if (!uvt->data) { + result = -ENOMEM; + goto error; + } + + memcpy(uvt->data, evt->private_data, uvt->data_len); + } + + if (uvt->info_len && info) { + + uvt->info = kmalloc(uvt->info_len, GFP_KERNEL); + if (!uvt->info) { + result = -ENOMEM; + goto error; + } + + memcpy(uvt->info, info, uvt->info_len); + } + + return 0; +error: + if (uvt->info) + kfree(uvt->info); + if (uvt->data) + kfree(uvt->data); + return result; +} + +static int ib_ucm_event_handler(struct ib_cm_id *cm_id, + struct ib_cm_event *event) +{ + struct ib_ucm_event *uevent; + struct ib_ucm_context *ctx; + int result = 0; + int id; + + /* + * lookup correct context based on event type. + */ + switch (event->event) { + case IB_CM_REQ_RECEIVED: + id = (int)event->param.req_rcvd.listen_id->context; + break; + case IB_CM_SIDR_REQ_RECEIVED: + id = (int)event->param.sidr_req_rcvd.listen_id->context; + break; + default: + id = (int)cm_id->context; + break; + } + + ctx = ib_ucm_ctx_get(id); + if (!ctx) + return -ENOENT; + + if (event->event == IB_CM_REQ_RECEIVED || + event->event == IB_CM_SIDR_REQ_RECEIVED) + id = IB_UCM_CM_ID_INVALID; + + uevent = kmalloc(sizeof(*uevent), GFP_KERNEL); + if (!uevent) { + result = -ENOMEM; + goto done; + } + + memset(uevent, 0, sizeof(*uevent)); + + uevent->resp.id = id; + uevent->resp.event = event->event; + uevent->resp.state = cm_id->state; + + result = ib_ucm_event_process(event, uevent); + if (result) + goto done; + + uevent->ctx = ctx; + + down(&ctx->file->mutex); + + list_add_tail(&uevent->file_list, &ctx->file->events); + list_add_tail(&uevent->ctx_list, &ctx->events); + + wake_up_interruptible(&ctx->file->poll_wait); + + up(&ctx->file->mutex); +done: + ctx->error = result; + ib_ucm_ctx_put(ctx); /* func reference */ + return result; +} + +static ssize_t ib_ucm_qp_event(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_context *ctx; + struct ib_ucm_event_get cmd; + struct ib_ucm_event *uevent = NULL; + int result = 0; + DEFINE_WAIT(wait); + + if (out_len < sizeof(struct ib_ucm_event_resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + /* + * wait + */ + down(&file->mutex); + + while (list_empty(&file->events)) { + + if (file->filp->f_flags & O_NONBLOCK) { + result = -EAGAIN; + break; + } + + if (signal_pending(current)) { + result = -ERESTARTSYS; + break; + } + + prepare_to_wait(&file->poll_wait, &wait, TASK_INTERRUPTIBLE); + + up(&file->mutex); + schedule(); + down(&file->mutex); + + finish_wait(&file->poll_wait, &wait); + } + + if (result) + goto done; + + uevent = list_entry(file->events.next, struct ib_ucm_event, file_list); + + if (uevent->resp.id != IB_UCM_CM_ID_INVALID) + goto user; + + ctx = ib_ucm_ctx_alloc(file); + if (!ctx) { + result = -ENOMEM; + goto done; + } + + uevent->resp.id = ctx->id; + +user: + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &uevent->resp, sizeof(uevent->resp))) { + result = -EFAULT; + goto done; + } + + if (uevent->data) { + + if (cmd.data_len < uevent->data_len) { + result = -ENOMEM; + goto done; + } + + if (copy_to_user((void __user *)(unsigned long)cmd.data, + uevent->data, cmd.data_len)) { + result = -EFAULT; + goto done; + } + } + + if (uevent->info) { + + if (cmd.info_len < uevent->info_len) { + result = -ENOMEM; + goto done; + } + + if (copy_to_user((void __user *)(unsigned long)cmd.info, + uevent->info, cmd.info_len)) { + result = -EFAULT; + goto done; + } + } + + list_del(&uevent->file_list); + list_del(&uevent->ctx_list); + + if (uevent->data) + kfree(uevent->data); + if (uevent->info) + kfree(uevent->info); + kfree(uevent); +done: + up(&file->mutex); + return result; +} + + +static ssize_t ib_ucm_create_id(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_create_id cmd; + struct ib_ucm_create_id_resp resp; + struct ib_ucm_context *ctx; + int result; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ib_ucm_ctx_alloc(file); + if (!ctx) + return -ENOMEM; + + ctx->cm_id = ib_create_cm_id(ib_ucm_event_handler, + (void *)(unsigned long)ctx->id); + if (!ctx->cm_id) { + result = -ENOMEM; + goto err_cm; + } + + resp.id = ctx->id; + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) { + result = -EFAULT; + goto err_ret; + } + + return 0; +err_ret: + (void)ib_destroy_cm_id(ctx->cm_id); +err_cm: + ib_ucm_ctx_put(ctx); /* user reference */ + + return result; +} + +static ssize_t ib_ucm_destroy_id(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_destroy_id cmd; + struct ib_ucm_context *ctx; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) + return -ENOENT; + + ib_ucm_ctx_put(ctx); /* user reference */ + ib_ucm_ctx_put(ctx); /* func reference */ + + return 0; +} + +static ssize_t ib_ucm_attr_id(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_attr_id_resp resp; + struct ib_ucm_attr_id cmd; + struct ib_ucm_context *ctx; + int result = 0; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) + return -ENOENT; + + down(&ctx->file->mutex); + if (ctx->file != file) { + result = -EINVAL; + goto done; + } + + resp.service_id = ctx->cm_id->service_id; + resp.service_mask = ctx->cm_id->service_mask; + resp.state = ctx->cm_id->state; + resp.lap_state = ctx->cm_id->lap_state; + resp.local_id = ctx->cm_id->local_id; + resp.remote_id = ctx->cm_id->remote_id; + + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) + result = -EFAULT; + +done: + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ + return result; +} + +static ssize_t ib_ucm_listen(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_listen cmd; + struct ib_ucm_context *ctx; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) + return -ENOENT; + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_cm_listen(ctx->cm_id, cmd.service_id, + cmd.service_mask); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ + return result; +} + +static ssize_t ib_ucm_establish(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_establish cmd; + struct ib_ucm_context *ctx; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) + return -ENOENT; + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_cm_establish(ctx->cm_id); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ + return result; +} + +static int ib_ucm_alloc_data(void **dest, u64 src, u32 len) +{ + void *data; + + *dest = NULL; + + if (!len) + return 0; + + data = kmalloc(len, GFP_KERNEL); + if (!data) + return -ENOMEM; + + if (copy_from_user(data, (void __user *)(unsigned long)src, len)) { + kfree(data); + return -EFAULT; + } + + *dest = data; + return 0; +} + +static int ib_ucm_path_get(struct ib_sa_path_rec **path, u64 src) +{ + struct ib_ucm_path_rec ucm_path; + struct ib_sa_path_rec *sa_path; + + *path = NULL; + + if (!src) + return 0; + + sa_path = kmalloc(sizeof(*sa_path), GFP_KERNEL); + if (!sa_path) + return -ENOMEM; + + if (copy_from_user(&ucm_path, (void __user *)(unsigned long)src, + sizeof(ucm_path))) { + + kfree(sa_path); + return -EFAULT; + } + + memcpy(sa_path->dgid.raw, ucm_path.dgid, sizeof(union ib_gid)); + memcpy(sa_path->sgid.raw, ucm_path.sgid, sizeof(union ib_gid)); + + sa_path->dlid = ucm_path.dlid; + sa_path->slid = ucm_path.slid; + sa_path->raw_traffic = ucm_path.raw_traffic; + sa_path->flow_label = ucm_path.flow_label; + sa_path->hop_limit = ucm_path.hop_limit; + sa_path->traffic_class = ucm_path.traffic_class; + sa_path->reversible = ucm_path.reversible; + sa_path->numb_path = ucm_path.numb_path; + sa_path->pkey = ucm_path.pkey; + sa_path->sl = ucm_path.sl; + sa_path->mtu_selector = ucm_path.mtu_selector; + sa_path->mtu = ucm_path.mtu; + sa_path->rate_selector = ucm_path.rate_selector; + sa_path->rate = ucm_path.rate; + sa_path->packet_life_time = ucm_path.packet_life_time; + sa_path->preference = ucm_path.preference; + + sa_path->packet_life_time_selector = + ucm_path.packet_life_time_selector; + + *path = sa_path; + return 0; +} + +static ssize_t ib_ucm_send_req(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_cm_req_param param; + struct ib_ucm_context *ctx; + struct ib_ucm_req cmd; + int result; + + param.private_data = NULL; + param.primary_path = NULL; + param.alternate_path = NULL; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(¶m.private_data, cmd.data, cmd.len); + if (result) + goto done; + + result = ib_ucm_path_get(¶m.primary_path, cmd.primary_path); + if (result) + goto done; + + result = ib_ucm_path_get(¶m.alternate_path, cmd.alternate_path); + if (result) + goto done; + + param.private_data_len = cmd.len; + param.service_id = cmd.sid; + param.qp_num = cmd.qpn; + param.qp_type = cmd.qp_type; + param.starting_psn = cmd.psn; + param.peer_to_peer = cmd.peer_to_peer; + param.responder_resources = cmd.responder_resources; + param.initiator_depth = cmd.initiator_depth; + param.remote_cm_response_timeout = cmd.remote_cm_response_timeout; + param.flow_control = cmd.flow_control; + param.local_cm_response_timeout = cmd.local_cm_response_timeout; + param.retry_count = cmd.retry_count; + param.rnr_retry_count = cmd.rnr_retry_count; + param.max_cm_retries = cmd.max_cm_retries; + param.srq = cmd.srq; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_req(ctx->cm_id, ¶m); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (param.private_data) + kfree(param.private_data); + if (param.primary_path) + kfree(param.primary_path); + if (param.alternate_path) + kfree(param.alternate_path); + + return result; +} + +static ssize_t ib_ucm_send_rep(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_cm_rep_param param; + struct ib_ucm_context *ctx; + struct ib_ucm_rep cmd; + int result; + + param.private_data = NULL; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(¶m.private_data, cmd.data, cmd.len); + if (result) + return result; + + param.qp_num = cmd.qpn; + param.starting_psn = cmd.psn; + param.private_data_len = cmd.len; + param.responder_resources = cmd.responder_resources; + param.initiator_depth = cmd.initiator_depth; + param.target_ack_delay = cmd.target_ack_delay; + param.failover_accepted = cmd.failover_accepted; + param.flow_control = cmd.flow_control; + param.rnr_retry_count = cmd.rnr_retry_count; + param.srq = cmd.srq; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_rep(ctx->cm_id, ¶m); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (param.private_data) + kfree(param.private_data); + + return result; +} + +static ssize_t ib_ucm_send_private_data(struct ib_ucm_file *file, + const char __user *inbuf, int in_len, + int (*func)(struct ib_cm_id *cm_id, + void *private_data, + u8 private_data_len)) +{ + struct ib_ucm_private_data cmd; + struct ib_ucm_context *ctx; + void *private_data = NULL; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(&private_data, cmd.data, cmd.len); + if (result) + return result; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = func(ctx->cm_id, private_data, cmd.len); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (private_data) + kfree(private_data); + + return result; +} + +static ssize_t ib_ucm_send_rtu(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return ib_ucm_send_private_data(file, inbuf, in_len, ib_send_cm_rtu); +} + +static ssize_t ib_ucm_send_dreq(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return ib_ucm_send_private_data(file, inbuf, in_len, ib_send_cm_dreq); +} + +static ssize_t ib_ucm_send_drep(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return ib_ucm_send_private_data(file, inbuf, in_len, ib_send_cm_drep); +} + +static ssize_t ib_ucm_send_info(struct ib_ucm_file *file, + const char __user *inbuf, int in_len, + int (*func)(struct ib_cm_id *cm_id, + int status, + void *info, + u8 info_len, + void *data, + u8 data_len)) +{ + struct ib_ucm_context *ctx; + struct ib_ucm_info cmd; + void *data = NULL; + void *info = NULL; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(&data, cmd.data, cmd.data_len); + if (result) + goto done; + + result = ib_ucm_alloc_data(&info, cmd.info, cmd.info_len); + if (result) + goto done; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = func(ctx->cm_id, cmd.status, + info, cmd.info_len, + data, cmd.data_len); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (data) + kfree(data); + if (info) + kfree(info); + + return result; +} + +static ssize_t ib_ucm_send_rej(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return ib_ucm_send_info(file, inbuf, in_len, (void *)ib_send_cm_rej); +} + +static ssize_t ib_ucm_send_apr(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return ib_ucm_send_info(file, inbuf, in_len, (void *)ib_send_cm_apr); +} + +static ssize_t ib_ucm_send_mra(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_context *ctx; + struct ib_ucm_mra cmd; + void *data = NULL; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(&data, cmd.data, cmd.len); + if (result) + return result; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_mra(ctx->cm_id, cmd.timeout, + data, cmd.len); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (data) + kfree(data); + + return result; +} + +static ssize_t ib_ucm_send_lap(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_context *ctx; + struct ib_sa_path_rec *path = NULL; + struct ib_ucm_lap cmd; + void *data = NULL; + int result; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(&data, cmd.data, cmd.len); + if (result) + goto done; + + result = ib_ucm_path_get(&path, cmd.path); + if (result) + goto done; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_lap(ctx->cm_id, path, data, cmd.len); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (data) + kfree(data); + if (path) + kfree(path); + + return result; +} + +static ssize_t ib_ucm_send_sidr_req(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_cm_sidr_req_param param; + struct ib_ucm_context *ctx; + struct ib_ucm_sidr_req cmd; + int result; + + param.private_data = NULL; + param.path = NULL; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(¶m.private_data, cmd.data, cmd.len); + if (result) + goto done; + + result = ib_ucm_path_get(¶m.path, cmd.path); + if (result) + goto done; + + param.private_data_len = cmd.len; + param.service_id = cmd.sid; + param.timeout_ms = cmd.timeout; + param.max_cm_retries = cmd.max_cm_retries; + param.pkey = cmd.pkey; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_sidr_req(ctx->cm_id, ¶m); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (param.private_data) + kfree(param.private_data); + if (param.path) + kfree(param.path); + + return result; +} + +static ssize_t ib_ucm_send_sidr_rep(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_cm_sidr_rep_param param; + struct ib_ucm_sidr_rep cmd; + struct ib_ucm_context *ctx; + int result; + + param.info = NULL; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + result = ib_ucm_alloc_data(¶m.private_data, + cmd.data, cmd.data_len); + if (result) + goto done; + + result = ib_ucm_alloc_data(¶m.info, cmd.info, cmd.info_len); + if (result) + goto done; + + param.qp_num = cmd.qpn; + param.qkey = cmd.qkey; + param.status = cmd.status; + param.info_length = cmd.info_len; + param.private_data_len = cmd.data_len; + + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) { + result = -ENOENT; + goto done; + } + + down(&ctx->file->mutex); + if (ctx->file != file) + result = -EINVAL; + else + result = ib_send_cm_sidr_rep(ctx->cm_id, ¶m); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* func reference */ +done: + if (param.private_data) + kfree(param.private_data); + if (param.info) + kfree(param.info); + + return result; +} + +static ssize_t ib_ucm_qp_attr(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + return 0; +} + +static ssize_t (*ucm_cmd_table[])(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) = { + [IB_USER_CM_CMD_CREATE_ID] = ib_ucm_create_id, + [IB_USER_CM_CMD_DESTORY_ID] = ib_ucm_destroy_id, + [IB_USER_CM_CMD_ATTR_ID] = ib_ucm_attr_id, + [IB_USER_CM_CMD_LISTEN] = ib_ucm_listen, + [IB_USER_CM_CMD_ESTABLISH] = ib_ucm_establish, + [IB_USER_CM_CMD_SEND_REQ] = ib_ucm_send_req, + [IB_USER_CM_CMD_SEND_REP] = ib_ucm_send_rep, + [IB_USER_CM_CMD_SEND_RTU] = ib_ucm_send_rtu, + [IB_USER_CM_CMD_SEND_DREQ] = ib_ucm_send_dreq, + [IB_USER_CM_CMD_SEND_DREP] = ib_ucm_send_drep, + [IB_USER_CM_CMD_SEND_REJ] = ib_ucm_send_rej, + [IB_USER_CM_CMD_SEND_MRA] = ib_ucm_send_mra, + [IB_USER_CM_CMD_SEND_LAP] = ib_ucm_send_lap, + [IB_USER_CM_CMD_SEND_APR] = ib_ucm_send_apr, + [IB_USER_CM_CMD_SEND_SIDR_REQ] = ib_ucm_send_sidr_req, + [IB_USER_CM_CMD_SEND_SIDR_REP] = ib_ucm_send_sidr_rep, + [IB_USER_CM_CMD_QP_ATTR] = ib_ucm_qp_attr, + [IB_USER_CM_CMD_EVENT] = ib_ucm_qp_event, +}; + +static ssize_t ib_ucm_write(struct file *filp, const char __user *buf, + size_t len, loff_t *pos) +{ + struct ib_ucm_file *file = filp->private_data; + struct ib_ucm_cmd_hdr hdr; + ssize_t result; + + if (len < sizeof(hdr)) + return -EINVAL; + + if (copy_from_user(&hdr, buf, sizeof(hdr))) + return -EFAULT; + + printk(KERN_ERR "UCM: Write. cmd <%d> in <%d> out <%d> len <%d>\n", + hdr.cmd, hdr.in, hdr.out, len); + + if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucm_cmd_table)) + return -EINVAL; + + if (hdr.in + sizeof(hdr) > len) + return -EINVAL; + + result = ucm_cmd_table[hdr.cmd](file, buf + sizeof(hdr), + hdr.in, hdr.out); + if (!result) + result = len; + + return result; +} + +static unsigned int ib_ucm_poll(struct file *filp, + struct poll_table_struct *wait) +{ + struct ib_ucm_file *file = filp->private_data; + unsigned int mask = 0; + + poll_wait(filp, &file->poll_wait, wait); + + if (!list_empty(&file->events)) + mask = POLLIN | POLLRDNORM; + + return mask; +} + +static int ib_ucm_open(struct inode *inode, struct file *filp) +{ + struct ib_ucm_file *file; + + file = kmalloc(sizeof(*file), GFP_KERNEL); + if (!file) + return -ENOMEM; + + INIT_LIST_HEAD(&file->events); + INIT_LIST_HEAD(&file->ctxs); + init_waitqueue_head(&file->poll_wait); + + init_MUTEX(&file->mutex); + + filp->private_data = file; + file->filp = filp; + + printk(KERN_ERR "UCM: Created struct\n"); + + return 0; +} + +static int ib_ucm_close(struct inode *inode, struct file *filp) +{ + struct ib_ucm_file *file = filp->private_data; + struct ib_ucm_context *ctx; + + down(&file->mutex); + + while (!list_empty(&file->ctxs)) { + + ctx = list_entry(file->ctxs.next, + struct ib_ucm_context, file_list); + + up(&ctx->file->mutex); + ib_ucm_ctx_put(ctx); /* user reference */ + down(&file->mutex); + } + + up(&file->mutex); + + kfree(file); + + printk(KERN_ERR "UCM: Deleted struct\n"); + return 0; +} + +static struct file_operations ib_ucm_fops = { + .owner = THIS_MODULE, + .open = ib_ucm_open, + .release = ib_ucm_close, + .write = ib_ucm_write, + .poll = ib_ucm_poll, +}; + + +static struct class_simple *ib_ucm_class; +static struct cdev ib_ucm_cdev; + +static int __init ib_ucm_init(void) +{ + int result; + + result = register_chrdev_region(IB_UCM_DEV, 1, "infiniband_cm"); + if (result) { + printk(KERN_ERR "UCM: Error <%d> registering dev\n", result); + goto err_chr; + } + + cdev_init(&ib_ucm_cdev, &ib_ucm_fops); + + result = cdev_add(&ib_ucm_cdev, IB_UCM_DEV, 1); + if (result) { + printk(KERN_ERR "UCM: Error <%d> adding cdev\n", result); + goto err_cdev; + } + + ib_ucm_class = class_simple_create(THIS_MODULE, "ucm"); + if (IS_ERR(ib_ucm_class)) { + result = PTR_ERR(ib_ucm_class); + printk(KERN_ERR "UCM: Error <%d> creating class\n", result); + goto err_class; + } + + class_simple_device_add(ib_ucm_class, + IB_UCM_DEV, + NULL, + "ucm"); + + devfs_mk_cdev(IB_UCM_DEV, + S_IFCHR|S_IRUGO|S_IWUGO, + "infiniband/ucm"); + + idr_init(&ctx_id_table); + init_MUTEX(&ctx_id_mutex); + + return 0; +err_class: + cdev_del(&ib_ucm_cdev); +err_cdev: + unregister_chrdev_region(IB_UCM_DEV, 1); +err_chr: + return result; +} + +static void __exit ib_ucm_cleanup(void) +{ + devfs_remove("infiniband/ucm"); + class_simple_device_remove(IB_UCM_DEV); + class_simple_destroy(ib_ucm_class); + cdev_del(&ib_ucm_cdev); + unregister_chrdev_region(IB_UCM_DEV, 1); +} + +module_init(ib_ucm_init); +module_exit(ib_ucm_cleanup); Index: infiniband/core/ucm.h =================================================================== --- infiniband/core/ucm.h (revision 0) +++ infiniband/core/ucm.h (revision 0) @@ -0,0 +1,84 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#ifndef UCM_H +#define UCM_H + +#include +#include +#include +#include + +#include +#include + +#define IB_UCM_CM_ID_INVALID 0xffffffff + +struct ib_ucm_file { + struct semaphore mutex; + struct file *filp; + /* + * list of pending events + */ + struct list_head ctxs; /* list of active connections */ + struct list_head events; /* list of pending events */ + wait_queue_head_t poll_wait; +}; + +struct ib_ucm_context { + int id; + int ref; + int error; + + struct ib_ucm_file *file; + struct ib_cm_id *cm_id; + struct semaphore mutex; + + struct list_head events; /* list of pending events. */ + struct list_head file_list; /* member in file ctx list */ +}; + +struct ib_ucm_event { + struct ib_ucm_context *ctx; + struct list_head file_list; /* member in file event list */ + struct list_head ctx_list; /* member in ctx event list */ + + struct ib_ucm_event_resp resp; + void *data; + void *info; + int data_len; + int info_len; +}; + +#endif /* UCM_H */ From libor at topspin.com Fri Mar 11 15:26:28 2005 From: libor at topspin.com (Libor Michalek) Date: Fri, 11 Mar 2005 15:26:28 -0800 Subject: [openib-general] Re: [PATCH] [TRIVIAL] SDP: sdp_actv.c remove redundant initialization In-Reply-To: <1110379398.4645.46.camel@localhost.localdomain>; from halr@voltaire.com on Wed, Mar 09, 2005 at 09:43:18AM -0500 References: <1110379398.4645.46.camel@localhost.localdomain> Message-ID: <20050311152628.D31689@topspin.com> On Wed, Mar 09, 2005 at 09:43:18AM -0500, Hal Rosenstock wrote: > SDP: sdp_actv.c remove redundant initialization > > qp_attr->min_rnr_timer is already initialized to 0 by > cm_init_qp_rtr_attr in cm.c > > Is this really intended to be IB_RNR_TIMER_122_88 instead ? No, the RNR timer can be set to 0. SDP should never need RNR since the protocol ensures that buffers are posted for receive before the remote connection peer sends any data. Applied and commited along with the patch for sdp_pass.c Thanks. -Libor From libor at topspin.com Fri Mar 11 15:43:16 2005 From: libor at topspin.com (Libor Michalek) Date: Fri, 11 Mar 2005 15:43:16 -0800 Subject: [openib-general] Re: [Andrew Morton] inappropriate use of in_atomic() In-Reply-To: <20050311073108.GA20989@mellanox.co.il>; from mst@mellanox.co.il on Fri, Mar 11, 2005 at 09:31:08AM +0200 References: <52oedq946k.fsf@topspin.com> <20050311073108.GA20989@mellanox.co.il> Message-ID: <20050311154316.E31689@topspin.com> On Fri, Mar 11, 2005 at 09:31:08AM +0200, Michael S. Tsirkin wrote: > > Sdp also has a couple of uses. > Maybe we can use the atomic branch in all cases here, as well? > Libor? Yes, the case in sdp_iocb.c can probably always take the atomic path. The kmap/kunmap cases really only care whether we're in an interrupt, so switching to in_interrupt() should be sufficient. -Libor From roland at topspin.com Fri Mar 11 15:45:31 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 11 Mar 2005 15:45:31 -0800 Subject: [openib-general] [RFC] Userspace CM access. In-Reply-To: <20050311152213.C31689@topspin.com> (Libor Michalek's message of "Fri, 11 Mar 2005 15:22:13 -0800") References: <20050311150946.A31689@topspin.com> <20050311152213.C31689@topspin.com> Message-ID: <524qfh7nms.fsf@topspin.com> I suggest tabifying the file -- there seem to be some whitespace problems like: + ctx = ib_ucm_ctx_get(cmd.id); + if (!ctx) (spaces on one line, tabs the next). More substantive comments later... - R. From hozer at hozed.org Fri Mar 11 17:15:27 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Fri, 11 Mar 2005 19:15:27 -0600 Subject: [openib-general] kernel 2.6.11 and userland packages? Message-ID: <20050312011527.GC9768@kalmia.hozed.org> I have in my office a shiny new kernel.org 2.6.11 64 bit kernel running on my Mac G5, with the drivers/infiniband modules loaded. What do I need to do to verify this all works? Also, I'd really like to make debian packages of the userland utilities and libraries, and get a debian/ subdirectory into the subversion release, so the packages can be rebuilt easily. Where should I start on this? -- -------------------------------------------------------------------------- Troy Benjegerdes 'da hozer' hozer at hozed.org Somone asked my why I work on this free (http://www.fsf.org/philosophy/) software stuff and not get a real job. Charles Shultz had the best answer: "Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life." -- Charles Shultz From admin at donateonline.info Fri Mar 11 17:29:02 2005 From: admin at donateonline.info (Help) Date: Fri, 11 Mar 2005 17:29:02 -0800 (PST) Subject: [openib-general] Children in crisis Message-ID: <20050312012902.F38DB22834D@openib.ca.sandia.gov> An HTML attachment was scrubbed... URL: From hozer at hozed.org Fri Mar 11 21:55:09 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Fri, 11 Mar 2005 23:55:09 -0600 Subject: [openib-general] http://openib.org/downloads/ In-Reply-To: <20050309025647.GN5502@esmail.cup.hp.com> References: <422E315B.8010708@ammasso.com> <1110324708.8595.223.camel@localhost> <20050309025647.GN5502@esmail.cup.hp.com> Message-ID: <20050312055509.GH9768@kalmia.hozed.org> On Tue, Mar 08, 2005 at 06:56:47PM -0800, Grant Grundler wrote: > On Tue, Mar 08, 2005 at 03:31:48PM -0800, Matt Leininger wrote: > > You can grab the openib source code from the subversion repository. > > See http://www.openib.org/tools.html. If you want everything run 'svn > > co https://openib.org/svn' > > Matt, > probably best to just add a short blurb to tools.html > that includes an example using gen2 branch. That's what > we want people to focus on I think. having just waded into this stuff, I'd really like a "just_build_it_all.sh" script. Well, actually, what I'd really like is to do: svn co https://openib.org/some/path cd some/path fakeroot dpkg-buildpackage and get me some debian packages ;) FYI, I'm hopeing that at least the PPC debian 2.6.11 kernel packages with have IB modules enabled in the .config From mst at mellanox.co.il Mon Mar 14 06:46:50 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 Mar 2005 16:46:50 +0200 Subject: [openib-general] [PATCH] alignment check in reg_phys_mr Message-ID: <20050314144650.GF16749@mellanox.co.il> Apparently reg_phys_mr in mthca requires that the start address is page aligned. Seems like a bug to me. Roland? Signed-off-by: Michael S. Tsirkin Index: drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- drivers/infiniband/hw/mthca/mthca_provider.c (revision 1983) +++ drivers/infiniband/hw/mthca/mthca_provider.c (working copy) @@ -494,7 +494,7 @@ static struct ib_mr *mthca_reg_phys_mr(s mask = 0; total_size = 0; for (i = 0; i < num_phys_buf; ++i) { - if (buffer_list[i].addr & ~PAGE_MASK) + if (i != 0 && buffer_list[i].addr & ~PAGE_MASK) return ERR_PTR(-EINVAL); if (i != 0 && i != num_phys_buf - 1 && (buffer_list[i].size & ~PAGE_MASK)) -- MST - Michael S. Tsirkin From halr at voltaire.com Mon Mar 14 08:07:25 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Mar 2005 11:07:25 -0500 Subject: [openib-general] Re: [openib-commits] r1983 - gen2/trunk/src/linux-kernel/infiniband/hw/mthca In-Reply-To: <20050312203709.34B3522834D@openib.ca.sandia.gov> References: <20050312203709.34B3522834D@openib.ca.sandia.gov> Message-ID: <1110816445.4645.31.camel@localhost.localdomain> On Sat, 2005-03-12 at 15:37, roland at openib.org wrote: > + props->max_qp_rd_atom = 1 << mdev->qp_table.rdb_shift; > + props->max_qp_init_rd_atom = 1 << mdev->qp_table.rdb_shift; These are getting set to 1. That's not what I was expecting. Thanks. -- Hal From gshipman at cs.unm.edu Mon Mar 14 09:21:56 2005 From: gshipman at cs.unm.edu (gshipman) Date: Mon, 14 Mar 2005 10:21:56 -0700 Subject: [openib-general] vstat error on bproc slave node (VAPI_EGEN) Message-ID: <6b97df984990ccc634d3ab93673a52fd@cs.unm.edu> I am relatively new to openib so here goes: I am attempting to configure our small cluster to use bproc and openib. Note I am using gen1 on kernel 2.6.6 patched with the clustermatic stuff, (should I be using gen2, is it stable for general use?). I have successfully gotten things going on the head node including opensm. I have successfully gotten the slave nodes to run the patched kernel, load the appropriate modules as well as the various user level libraries but I am having an issue on the slave nodes: If I run: $bpsh 13 /usr/mellanox/bin/vstat 1 HCA found: hca_id=InfiniHost0 Error: Could not retrieve handle to the HCA InfiniHost0 (VAPI_EGEN) On the head node I get: $/usr/mellanox/bin/vstat 1 HCA found: hca_id=InfiniHost0 vendor_id=0x02C9 vendor_part_id=0x5A44 hw_ver=0xA1 fw_ver=0x300020000 num_phys_ports=2 port=1 port_state=PORT_DOWN sm_lid=0x0000 port_lid=0x0353 port_lmc=0x00 max_mtu=2048 port=2 port_state=PORT_ACTIVE sm_lid=0x0354 port_lid=0x0354 port_lmc=0x00 max_mtu=2048 I can run ifconfig on the slave I see ib0 properly: $bpsh 13 ifconfig ib0 ib0 Link encap:Ethernet HWaddr 00:00:00:00:00:00 BROADCAST MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Thanks, Galen From roland at topspin.com Mon Mar 14 09:28:38 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 14 Mar 2005 09:28:38 -0800 Subject: [openib-general] [PATCH] Make SDP compile with gcc-2.95 Message-ID: <52br9m2l2x.fsf@topspin.com> This trivial patch seems to be required to get SDP to compile with gcc 2.95. It seems to be working around a bug with handling empty "arg..." parameters to macros (without this change, gcc 2.95 eats x->state in addition to the comma following it when arg is empty). - R. Index: infiniband/ulp/sdp/sdp_proto.h =================================================================== --- infiniband/ulp/sdp/sdp_proto.h (revision 1977) +++ infiniband/ulp/sdp/sdp_proto.h (working copy) @@ -482,7 +482,7 @@ extern int sdp_debug_level; if (x) { \ sdp_dbg_out(level, type, \ "<%d> <%04x:%04x> " format, \ - x->hashent, x->istate, x->state, \ + x->hashent, x->istate, x->state , \ ## arg); \ } \ else { \ From rminnich at lanl.gov Mon Mar 14 09:37:55 2005 From: rminnich at lanl.gov (Ronald G. Minnich) Date: Mon, 14 Mar 2005 10:37:55 -0700 (MST) Subject: [openib-general] vstat error on bproc slave node (VAPI_EGEN) In-Reply-To: <6b97df984990ccc634d3ab93673a52fd@cs.unm.edu> References: <6b97df984990ccc634d3ab93673a52fd@cs.unm.edu> Message-ID: On Mon, 14 Mar 2005, gshipman wrote: > I am attempting to configure our small cluster to use bproc and openib. > Note I am using gen1 on kernel 2.6.6 patched with the clustermatic > stuff, (should I be using gen2, is it stable for general use?). use gen2. I have tested it and it is ok. > > I have successfully gotten things going on the head node including opensm. I > have successfully gotten the slave nodes to run the patched kernel, load the > appropriate modules as well as the various user level libraries but I am > having an issue on the slave nodes: > > If I run: > $bpsh 13 /usr/mellanox/bin/vstat > 1 HCA found: > hca_id=InfiniHost0 > Error: Could not retrieve handle to the HCA InfiniHost0 (VAPI_EGEN) arg. I used to have this a lot. It's a mellanox issue, and it's a pain to work around. Can you just cut to gen2 and stop using gen1? I would really recommend for future use only using gen2 and not using any of the mellanox stuff. I realize user level is not there yet but I think it is worth just waiting for. ron From roland at topspin.com Mon Mar 14 10:05:48 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 14 Mar 2005 10:05:48 -0800 Subject: [openib-general] Re: [openib-commits] r1983 - gen2/trunk/src/linux-kernel/infiniband/hw/mthca In-Reply-To: <1110816445.4645.31.camel@localhost.localdomain> (Hal Rosenstock's message of "14 Mar 2005 11:07:25 -0500") References: <20050312203709.34B3522834D@openib.ca.sandia.gov> <1110816445.4645.31.camel@localhost.localdomain> Message-ID: <52oedm14sj.fsf@topspin.com> > > + props->max_qp_rd_atom = 1 << mdev->qp_table.rdb_shift; > > + props->max_qp_init_rd_atom = 1 << mdev->qp_table.rdb_shift; > These are getting set to 1. That's not what I was expecting. i.e. rdb_shift == 0. Hmm... OK, should be fixed now. - R. From mst at mellanox.co.il Mon Mar 14 10:13:45 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 Mar 2005 20:13:45 +0200 Subject: [openib-general] Re: fmr support in mthca In-Reply-To: <523bv286ig.fsf@topspin.com> References: <2005331520.b7ycIGGfSwBBRSED@topspin.com> <20050304140155.GC13804@mellanox.co.il> <526507mkmm.fsf@topspin.com> <20050311131446.GC20989@mellanox.co.il> <523bv286ig.fsf@topspin.com> Message-ID: <20050314181345.GB17668@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: fmr support in mthca > > Michael> Roland, would you like me to implement FMRs in mthca? It > Michael> is needed by SDP for zero copy support. > > Yes, that would be great. > > BTW, for mem-free mode I put the MPT and MTT in lowmem to make FMRs > simpler to use. > > - R. > OK, I have done the implementation, will test and post tomorrow. -- MST - Michael S. Tsirkin From halr at voltaire.com Mon Mar 14 10:09:11 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Mar 2005 13:09:11 -0500 Subject: [openib-general] Re: [openib-commits] r1983 - gen2/trunk/src/linux-kernel/infiniband/hw/mthca In-Reply-To: <52oedm14sj.fsf@topspin.com> References: <20050312203709.34B3522834D@openib.ca.sandia.gov> <1110816445.4645.31.camel@localhost.localdomain> <52oedm14sj.fsf@topspin.com> Message-ID: <1110823751.4645.5.camel@localhost.localdomain> On Mon, 2005-03-14 at 13:05, Roland Dreier wrote: > > > + props->max_qp_rd_atom = 1 << mdev->qp_table.rdb_shift; > > > + props->max_qp_init_rd_atom = 1 << mdev->qp_table.rdb_shift; > > > These are getting set to 1. That's not what I was expecting. > > i.e. rdb_shift == 0. Hmm... > > OK, should be fixed now. This is now getting set to 4 (rdb_shift = 2). Still not what I was expecting :-( -- Hal From roland at topspin.com Mon Mar 14 10:18:18 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 14 Mar 2005 10:18:18 -0800 Subject: [openib-general] Re: [openib-commits] r1983 - gen2/trunk/src/linux-kernel/infiniband/hw/mthca In-Reply-To: <1110823751.4645.5.camel@localhost.localdomain> (Hal Rosenstock's message of "14 Mar 2005 13:09:11 -0500") References: <20050312203709.34B3522834D@openib.ca.sandia.gov> <1110816445.4645.31.camel@localhost.localdomain> <52oedm14sj.fsf@topspin.com> <1110823751.4645.5.camel@localhost.localdomain> Message-ID: <527jka147p.fsf@topspin.com> Hal> This is now getting set to 4 (rdb_shift = 2). Still not what Hal> I was expecting :-( What were you expecting? - R. From halr at voltaire.com Mon Mar 14 10:17:21 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Mar 2005 13:17:21 -0500 Subject: [openib-general] Re: [openib-commits] r1983 - gen2/trunk/src/linux-kernel/infiniband/hw/mthca In-Reply-To: <527jka147p.fsf@topspin.com> References: <20050312203709.34B3522834D@openib.ca.sandia.gov> <1110816445.4645.31.camel@localhost.localdomain> <52oedm14sj.fsf@topspin.com> <1110823751.4645.5.camel@localhost.localdomain> <527jka147p.fsf@topspin.com> Message-ID: <1110824241.4645.9.camel@localhost.localdomain> On Mon, 2005-03-14 at 13:18, Roland Dreier wrote: > Hal> This is now getting set to 4 (rdb_shift = 2). Still not what > Hal> I was expecting :-( > > What were you expecting? I thought this would be a larger number, around 64K. That's what I think gen1 sees. Is 4 correct (for gen2) ? -- Hal From roland at topspin.com Mon Mar 14 10:51:04 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 14 Mar 2005 10:51:04 -0800 Subject: [openib-general] Re: [openib-commits] r1983 - gen2/trunk/src/linux-kernel/infiniband/hw/mthca In-Reply-To: <1110824241.4645.9.camel@localhost.localdomain> (Hal Rosenstock's message of "14 Mar 2005 13:17:21 -0500") References: <20050312203709.34B3522834D@openib.ca.sandia.gov> <1110816445.4645.31.camel@localhost.localdomain> <52oedm14sj.fsf@topspin.com> <1110823751.4645.5.camel@localhost.localdomain> <527jka147p.fsf@topspin.com> <1110824241.4645.9.camel@localhost.localdomain> Message-ID: <52psy2ysbr.fsf@topspin.com> Hal> I thought this would be a larger number, around 64K. That's Hal> what I think gen1 sees. Is 4 correct (for gen2) ? The initiator number may be slightly bogus but the target number is correct. Each RDMA request takes 32 bytes of context memory at the target, so I don't see how a driver could support 64K outstanding RDMAs per QP (that would be 64K * 32 bytes * ~64K possible QPs = 128GB of context memory). - R. From roland at topspin.com Mon Mar 14 10:51:18 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 14 Mar 2005 10:51:18 -0800 Subject: [openib-general] Re: fmr support in mthca In-Reply-To: <20050314181345.GB17668@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 14 Mar 2005 20:13:45 +0200") References: <2005331520.b7ycIGGfSwBBRSED@topspin.com> <20050304140155.GC13804@mellanox.co.il> <526507mkmm.fsf@topspin.com> <20050311131446.GC20989@mellanox.co.il> <523bv286ig.fsf@topspin.com> <20050314181345.GB17668@mellanox.co.il> Message-ID: <52ll8qysbd.fsf@topspin.com> Michael> OK, I have done the implementation, will test and post tomorrow. Excellent, I'm looking forward to seeing it. - R. From roland at topspin.com Mon Mar 14 10:53:10 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 14 Mar 2005 10:53:10 -0800 Subject: [openib-general] [PATCH] uverbs rdma example (updated) In-Reply-To: <20050310123129.GA12542@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 10 Mar 2005 14:31:29 +0200") References: <20050309122700.GA2352@mellanox.co.il> <20050310123129.GA12542@mellanox.co.il> Message-ID: <52hdjeys89.fsf@topspin.com> I started looking over this code. As far as I can see, neither tx_depth nor rx_depth is used for anything. Is this correct? Should we just get rid of the options? Also would it make sense to change the RDMA operation to be unsignaled and just poll the destination buffer (ignore completions)? I realize this is a Mellanox extension to the spec but it might be more interesting than yet another variation on the pingpong code. - R. From tduffy at sun.com Mon Mar 14 10:57:13 2005 From: tduffy at sun.com (Tom Duffy) Date: Mon, 14 Mar 2005 10:57:13 -0800 Subject: [openib-general] .org pavilion spot in LW 2005 in SF Message-ID: <1110826633.21708.8.camel@duffman> Duncan, Hello, I am contacting you as a representative from the OpenIB.org alliance. We are a non-profit organization that is dedicated to providing an open-source, multi-vendor, best-of-breed Infiniband stack for the Linux kernel as well as all the related userland libraries and utilities. Our website is http://www.openib.org. All of our projects are available under the GPL as well as a BSD license. We would like a slot in the .org pavilion for LinuxWorld 2005 in San Francisco. The booth will have demos of InfiniBand in action using the recently accepted code in the 2.6.11 kernel running on multiple vendors hardware. Please "reply all" as I have CC'ed the developer list for OpenIB. Thanks, -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mst at mellanox.co.il Mon Mar 14 11:10:11 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 Mar 2005 21:10:11 +0200 Subject: [openib-general] [PATCH] uverbs rdma example (updated) In-Reply-To: <52hdjeys89.fsf@topspin.com> References: <20050309122700.GA2352@mellanox.co.il> <20050310123129.GA12542@mellanox.co.il> <52hdjeys89.fsf@topspin.com> Message-ID: <20050314191011.GD17668@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] [PATCH] uverbs rdma example (updated) > > I started looking over this code. As far as I can see, neither > tx_depth nor rx_depth is used for anything. Is this correct? Should > we just get rid of the options? Hmm. rx_depth is unused. tx_depth is used. > Also would it make sense to change the RDMA operation to be unsignaled > and just poll the destination buffer (ignore completions)? Hmm. Thats what I do for receieve - polling on data. You cant assume the hardware will not read the buffer until you get a send completion, so you wont be able to re-use the send buffer. Since polling cq is done after post, it does not affect the latency in any way. > I realize > this is a Mellanox extension to the spec but it might be more > interesting than yet another variation on the pingpong code. > > - R. > What do you refer to as extension? -- MST - Michael S. Tsirkin From roland at topspin.com Mon Mar 14 11:57:22 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 14 Mar 2005 11:57:22 -0800 Subject: [openib-general] [PATCH] uverbs rdma example (updated) In-Reply-To: <20050314191011.GD17668@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 14 Mar 2005 21:10:11 +0200") References: <20050309122700.GA2352@mellanox.co.il> <20050310123129.GA12542@mellanox.co.il> <52hdjeys89.fsf@topspin.com> <20050314191011.GD17668@mellanox.co.il> Message-ID: <52u0nexaot.fsf@topspin.com> Michael> Hmm. rx_depth is unused. tx_depth is used. Where? If I search through the whole patch for "tx_depth," the only place I see it do anything at all is in + ctx->cq = ibv_create_cq(ctx->context, rx_depth + tx_depth, NULL); but I don't see how more than one send can be outstanding. Michael> Hmm. Thats what I do for receieve - polling on data. Michael> You cant assume the hardware will not read the buffer Michael> until you get a send completion, so you wont be able to Michael> re-use the send buffer. Since polling cq is done after Michael> post, it does not affect the latency in any way. That makes sense. Also I forgot that without a completion we can never clean up the WQE buffer. - R. From hozer at hozed.org Mon Mar 14 15:01:18 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Mon, 14 Mar 2005 17:01:18 -0600 Subject: [openib-general] Getting rid of pinned memory requirement Message-ID: <20050314230118.GP9768@kalmia.hozed.org> The current InfiniBand model of using 'mlock()' to maintain a constant virtual to physical mapping for registered memory pages is not going to work with NUMA page migration and memory hotplug. I want to get some discussion started on this list, and once we have an idea what's feasable from the infiniband side, to bring up the discussion on linux-kernel, and get the memory hotplug and numa page migration people involved as well. I think the following list covers the major points. Are there any big "gotcha's" involved? * Add "registered" flag to linux/mm.h (VM_REGISTERED 0x01000000) * Need to define a 'registered memory' api. Maybe using 'madvise()' ? * Kernel needs to be able to unpin registered memory and shoot down cached mappings in network cards (treat IB/Iwarp cards like a TLB) * Requires IB/Iwarp card to dispatch an interrupt on a mapping 'miss' * This model allows applications to register more memory than physically exists, and the kernel manages what is actually pinned. * Requires adding hooks in MM code to dispatch driver mapping shootdowns. (A per-VM area list of adapters to be notified for the mapping?) I know that having the card dispatch an interrupt on an incoming packet that's not mapped is outside the spec. The alternative is that if the kernel wants to move some memory around that's registered, it's got to have some way to either kill the application, or tear down and re-establish all the QP's. I suppose an alternative would be a "SIG_I_KILLED_YOUR_MAPPINGS" type signal to tell the application (or library) that it needs to re-establish all it's pinned memory might work. From caitlinb at siliquent.com Mon Mar 14 15:29:06 2005 From: caitlinb at siliquent.com (Caitlin Bestler) Date: Mon, 14 Mar 2005 15:29:06 -0800 Subject: [openib-general] Getting rid of pinned memory requirement Message-ID: <8508251A6FC08A489844A94261D3693A039002@fiona.siliquent.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Troy > Benjegerdes > Sent: Monday, March 14, 2005 3:01 PM > To: openib-general at openib.org > Subject: [openib-general] Getting rid of pinned memory requirement > > The current InfiniBand model of using 'mlock()' to maintain a > constant virtual to physical mapping for registered memory > pages is not going to work with NUMA page migration and > memory hotplug. > > I want to get some discussion started on this list, and once > we have an idea what's feasable from the infiniband side, to > bring up the discussion on linux-kernel, and get the memory > hotplug and numa page migration people involved as well. > > I think the following list covers the major points. Are there > any big "gotcha's" involved? > > * Add "registered" flag to linux/mm.h (VM_REGISTERED 0x01000000) > > * Need to define a 'registered memory' api. Maybe using 'madvise()' ? > > * Kernel needs to be able to unpin registered memory and > shoot down cached > mappings in network cards (treat IB/Iwarp cards like a TLB) > > * Requires IB/Iwarp card to dispatch an interrupt on a mapping 'miss' > The point of requiring that the memory be pinned is so that the IB/iWARP card does not have to deal with the kernel on a per-placement basis. That includes having to double-check any host memory resources to see if there is anything to 'miss' in the mapping. Once a memory region is registered the HCA/RNIC is entitled to assume that the mapping from LKey/Address (STag/TO) to physical memory is not subject to change. Enhancement protocols have been discussed in both DAPL and RNIC-PI to allow kernels to rearrange memory, but they involve the host explicitly telling the HCA/RNIC to suspend access to a memory region *and* when possible taking action to quiesce the connections using the memory region. > * This model allows applications to register more memory than > physically exists, and the kernel manages what is actually pinned. > Fundamental to any definition of RDMA is that the application controls the avialability of target memory -- not the kernel. That is why traditional buffer flow controls do not apply. > * Requires adding hooks in MM code to dispatch driver mapping > shootdowns. (A > per-VM area list of adapters to be notified for the mapping?) > > > I know that having the card dispatch an interrupt on an > incoming packet that's not mapped is outside the spec. The > alternative is that if the kernel wants to move some memory > around that's registered, it's got to have some way to either > kill the application, or tear down and re-establish all the > QP's. I suppose an alternative would be a > "SIG_I_KILLED_YOUR_MAPPINGS" type signal to tell the > application (or library) that it needs to re-establish all > it's pinned memory might work. > Only if you are re-arranging memory for a bunch of connections that were taking a nice nap. If you did this for active connections they could be dead before you could reregister the memory. And even if you could reregister it, how do you redistribute the RKeys? From hozer at hozed.org Mon Mar 14 15:56:05 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Mon, 14 Mar 2005 17:56:05 -0600 Subject: [openib-general] Getting rid of pinned memory requirement In-Reply-To: <8508251A6FC08A489844A94261D3693A039002@fiona.siliquent.com> References: <8508251A6FC08A489844A94261D3693A039002@fiona.siliquent.com> Message-ID: <20050314235605.GS9768@kalmia.hozed.org> On Mon, Mar 14, 2005 at 03:29:06PM -0800, Caitlin Bestler wrote: > > > > -----Original Message----- > > From: openib-general-bounces at openib.org > > [mailto:openib-general-bounces at openib.org] On Behalf Of Troy > > Benjegerdes > > Sent: Monday, March 14, 2005 3:01 PM > > To: openib-general at openib.org > > Subject: [openib-general] Getting rid of pinned memory requirement > > > > The current InfiniBand model of using 'mlock()' to maintain a > > constant virtual to physical mapping for registered memory > > pages is not going to work with NUMA page migration and > > memory hotplug. > > > > I want to get some discussion started on this list, and once > > we have an idea what's feasable from the infiniband side, to > > bring up the discussion on linux-kernel, and get the memory > > hotplug and numa page migration people involved as well. > > > > I think the following list covers the major points. Are there > > any big "gotcha's" involved? > > > > * Add "registered" flag to linux/mm.h (VM_REGISTERED 0x01000000) > > > > * Need to define a 'registered memory' api. Maybe using 'madvise()' ? > > > > * Kernel needs to be able to unpin registered memory and > > shoot down cached > > mappings in network cards (treat IB/Iwarp cards like a TLB) > > > > * Requires IB/Iwarp card to dispatch an interrupt on a mapping 'miss' > > > > The point of requiring that the memory be pinned is so that > the IB/iWARP card does not have to deal with the kernel on > a per-placement basis. > > That includes having to double-check any host memory resources > to see if there is anything to 'miss' in the mapping. I guess I wasn't implying any 'double-checking'.. What I want is for the kernel to be able to unpin memory and tell the card it did so, instead of being locked into never being able to move that memory around. This requires no host memory interaction. By doing this, I can register a whole lot *more* memory, and the kernel can still keep buggy applications from trashing the whole system. [snip] > Fundamental to any definition of RDMA is that the application > controls the avialability of target memory -- not the kernel. > That is why traditional buffer flow controls do not apply. While hardware designers may like this idea, I would like to make the point that if you want the application to *absolutely* control the availability of physical memory, you shouldn't be writing userspace applications that run on Linux. There's always going to be a limit on how much memory you can mlock. And right now the only option the kernel has for unlocking that memory is to kill the application. I think there's got to be a reasonable way to deal with this that doesn't make the application responsible for everything in the world. We don't want to have to rewrite every RDMA application to be able to support memory hotplug. This is an obvious layer that can and should be abstracted by the kernel. From mshefty at ichips.intel.com Mon Mar 14 16:22:58 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 14 Mar 2005 16:22:58 -0800 Subject: [openib-general] [PATCH] [MAD] API changes and updates to support RMPP Message-ID: <20050314162258.1bedff07.mshefty@ichips.intel.com> This patch updates the MAD API to help provide support for the RMPP implementation and clients. Notable changes: * A valid memory region (MR) is returned as part of the mad_agent registration process. The agent, CM, and SA query modules were updated to use the returned MR. * A list_head structure was added to ib_mad_recv_wc to make walking the list of received MAD buffers easier. As part of this change, a bug was fixed where freed memory could have been accessed in ib_free_recv_mad() if RMPP were enabled. This change is unlikely to affect existing clients. Please respond with any comments. The received RMPP support code* is currently dependent on these changes. Signed-off-by: Sean Hefty *not included, some assembly required... Index: core/agent.c =================================================================== --- core/agent.c (revision 1964) +++ core/agent.c (working copy) @@ -135,7 +135,7 @@ static int agent_mad_send(struct ib_mad_ sizeof(mad_priv->mad), DMA_TO_DEVICE); gather_list.length = sizeof(mad_priv->mad); - gather_list.lkey = (*port_priv->mr).lkey; + gather_list.lkey = mad_agent->mr->lkey; send_wr.next = NULL; send_wr.opcode = IB_WR_SEND; @@ -324,22 +324,12 @@ int ib_agent_port_open(struct ib_device goto error3; } - port_priv->mr = ib_get_dma_mr(port_priv->smp_agent->qp->pd, - IB_ACCESS_LOCAL_WRITE); - if (IS_ERR(port_priv->mr)) { - printk(KERN_ERR SPFX "Couldn't get DMA MR\n"); - ret = PTR_ERR(port_priv->mr); - goto error4; - } - spin_lock_irqsave(&ib_agent_port_list_lock, flags); list_add_tail(&port_priv->port_list, &ib_agent_port_list); spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); return 0; -error4: - ib_unregister_mad_agent(port_priv->perf_mgmt_agent); error3: ib_unregister_mad_agent(port_priv->smp_agent); error2: @@ -363,8 +353,6 @@ int ib_agent_port_close(struct ib_device list_del(&port_priv->port_list); spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); - ib_dereg_mr(port_priv->mr); - ib_unregister_mad_agent(port_priv->perf_mgmt_agent); ib_unregister_mad_agent(port_priv->smp_agent); kfree(port_priv); Index: core/cm.c =================================================================== --- core/cm.c (revision 1977) +++ core/cm.c (working copy) @@ -75,7 +75,6 @@ static struct ib_cm { struct cm_port { struct cm_device *cm_dev; struct ib_mad_agent *mad_agent; - struct ib_mr *mr; u8 port_num; }; @@ -191,7 +190,7 @@ static int cm_alloc_msg(struct cm_id_pri DMA_TO_DEVICE); pci_unmap_addr_set(m, mapping, m->sge.addr); m->sge.length = sizeof m->mad; - m->sge.lkey = cm_id_priv->av.port->mr->lkey; + m->sge.lkey = cm_id_priv->av.port->mad_agent->mr->lkey; m->send_wr.wr_id = (unsigned long) m; m->send_wr.sg_list = &m->sge; @@ -2970,14 +2969,9 @@ static void cm_add_one(struct ib_device if (IS_ERR(port->mad_agent)) goto error2; - port->mr = ib_get_dma_mr(port->mad_agent->qp->pd, - IB_ACCESS_LOCAL_WRITE); - if (IS_ERR(port->mr)) - goto error3; - ret = ib_modify_port(device, i, 0, &port_modify); if (ret) - goto error4; + goto error3; } ib_set_client_data(device, &cm_client, cm_dev); @@ -2986,15 +2980,13 @@ static void cm_add_one(struct ib_device write_unlock_irqrestore(&cm.device_lock, flags); return; -error4: - ib_dereg_mr(port->mr); error3: ib_unregister_mad_agent(port->mad_agent); error2: port_modify.set_port_cap_mask = 0; port_modify.clr_port_cap_mask = IB_PORT_CM_SUP; while (--i) { - port = &cm_dev->port[i]; + port = &cm_dev->port[i-1]; ib_modify_port(device, port->port_num, 0, &port_modify); ib_unregister_mad_agent(port->mad_agent); } @@ -3022,7 +3014,6 @@ static void cm_remove_one(struct ib_devi for (i = 1; i <= device->phys_port_cnt; i++) { port = &cm_dev->port[i-1]; - ib_dereg_mr(port->mr); ib_modify_port(device, port->port_num, 0, &port_modify); ib_unregister_mad_agent(port->mad_agent); } Index: core/mad.c =================================================================== --- core/mad.c (revision 1980) +++ core/mad.c (working copy) @@ -35,8 +35,6 @@ #include #include -#include - #include "mad_priv.h" #include "smi.h" #include "agent.h" @@ -264,22 +262,29 @@ struct ib_mad_agent *ib_register_mad_age ret = ERR_PTR(-ENOMEM); goto error1; } + memset(mad_agent_priv, 0, sizeof *mad_agent_priv); + + mad_agent_priv->agent.mr = ib_get_dma_mr(port_priv->qp_info[qpn].qp->pd, + IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(mad_agent_priv->agent.mr)) { + ret = ERR_PTR(-ENOMEM); + goto error2; + } if (mad_reg_req) { reg_req = kmalloc(sizeof *reg_req, GFP_KERNEL); if (!reg_req) { ret = ERR_PTR(-ENOMEM); - goto error2; + goto error3; } /* Make a copy of the MAD registration request */ memcpy(reg_req, mad_reg_req, sizeof *reg_req); } /* Now, fill in the various structures */ - memset(mad_agent_priv, 0, sizeof *mad_agent_priv); mad_agent_priv->qp_info = &port_priv->qp_info[qpn]; mad_agent_priv->reg_req = reg_req; - mad_agent_priv->rmpp_version = rmpp_version; + mad_agent_priv->agent.rmpp_version = rmpp_version; mad_agent_priv->agent.device = device; mad_agent_priv->agent.recv_handler = recv_handler; mad_agent_priv->agent.send_handler = send_handler; @@ -304,7 +309,7 @@ struct ib_mad_agent *ib_register_mad_age if (method) { if (method_in_use(&method, mad_reg_req)) - goto error3; + goto error4; } } ret2 = add_nonoui_reg_req(mad_reg_req, mad_agent_priv, @@ -320,14 +325,14 @@ struct ib_mad_agent *ib_register_mad_age if (is_vendor_method_in_use( vendor_class, mad_reg_req)) - goto error3; + goto error4; } } ret2 = add_oui_reg_req(mad_reg_req, mad_agent_priv); } if (ret2) { ret = ERR_PTR(ret2); - goto error3; + goto error4; } } @@ -349,11 +354,13 @@ struct ib_mad_agent *ib_register_mad_age return &mad_agent_priv->agent; -error3: +error4: spin_unlock_irqrestore(&port_priv->reg_lock, flags); kfree(reg_req); -error2: +error3: kfree(mad_agent_priv); +error2: + ib_dereg_mr(mad_agent_priv->agent.mr); error1: return ret; } @@ -490,18 +497,16 @@ static void unregister_mad_agent(struct * MADs, preventing us from queuing additional work */ cancel_mads(mad_agent_priv); - port_priv = mad_agent_priv->qp_info->port_priv; - cancel_delayed_work(&mad_agent_priv->timed_work); - flush_workqueue(port_priv->wq); spin_lock_irqsave(&port_priv->reg_lock, flags); remove_mad_reg_req(mad_agent_priv); list_del(&mad_agent_priv->agent_list); spin_unlock_irqrestore(&port_priv->reg_lock, flags); - /* XXX: Cleanup pending RMPP receives for this agent */ + flush_workqueue(port_priv->wq); + /* ib_cancel_rmpp_recvs(mad_agent_priv); */ atomic_dec(&mad_agent_priv->refcount); wait_event(mad_agent_priv->wait, @@ -509,6 +514,7 @@ static void unregister_mad_agent(struct if (mad_agent_priv->reg_req) kfree(mad_agent_priv->reg_req); + ib_dereg_mr(mad_agent_priv->agent.mr); kfree(mad_agent_priv); } @@ -757,7 +763,7 @@ static int handle_outgoing_dr_smp(struct list_add_tail(&local->completion_list, &mad_agent_priv->local_list); spin_unlock_irqrestore(&mad_agent_priv->lock, flags); queue_work(mad_agent_priv->qp_info->port_priv->wq, - &mad_agent_priv->local_work); + &mad_agent_priv->local_work); ret = 1; out: return ret; @@ -919,31 +925,33 @@ EXPORT_SYMBOL(ib_post_send_mad); */ void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc) { - struct ib_mad_recv_buf *entry; + struct ib_mad_recv_buf *mad_recv_buf, *temp_recv_buf; struct ib_mad_private_header *mad_priv_hdr; struct ib_mad_private *priv; + struct list_head free_list; - mad_priv_hdr = container_of(mad_recv_wc, - struct ib_mad_private_header, - recv_wc); - priv = container_of(mad_priv_hdr, struct ib_mad_private, header); - - /* - * Walk receive buffer list associated with this WC - * No need to remove them from list of receive buffers - */ - list_for_each_entry(entry, &mad_recv_wc->recv_buf.list, list) { - /* Free previous receive buffer */ - kmem_cache_free(ib_mad_cache, priv); + if (mad_recv_wc->mad_len <= sizeof(struct ib_mad)) { mad_priv_hdr = container_of(mad_recv_wc, struct ib_mad_private_header, recv_wc); priv = container_of(mad_priv_hdr, struct ib_mad_private, header); - } + kmem_cache_free(ib_mad_cache, priv); + } else { + INIT_LIST_HEAD(&free_list); + list_splice_init(&mad_recv_wc->rmpp_list, &free_list); - /* Free last buffer */ - kmem_cache_free(ib_mad_cache, priv); + list_for_each_entry_safe(mad_recv_buf, temp_recv_buf, + &free_list, list) { + mad_priv_hdr = container_of(mad_recv_wc, + struct ib_mad_private_header, + recv_wc); + priv = container_of(mad_priv_hdr, + struct ib_mad_private, + header); + kmem_cache_free(ib_mad_cache, priv); + } + } } EXPORT_SYMBOL(ib_free_recv_mad); @@ -1486,16 +1494,19 @@ out: return valid; } -/* - * Return start of fully reassembled MAD, or NULL, if MAD isn't assembled yet - */ -static struct ib_mad_private * -reassemble_recv(struct ib_mad_agent_private *mad_agent_priv, - struct ib_mad_private *recv) -{ - /* Until we have RMPP, all receives are reassembled!... */ - INIT_LIST_HEAD(&recv->header.recv_wc.recv_buf.list); - return recv; +static struct ib_mad_recv_wc * +process_recv(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_recv_wc *mad_recv_wc) +{ + INIT_LIST_HEAD(&mad_recv_wc->rmpp_list); + list_add(&mad_recv_wc->recv_buf.list, &mad_recv_wc->rmpp_list); + + /* + if (mad_agent_priv->agent.rmpp_version) + return ib_process_rmpp_recv(mad_agent_priv, mad_recv_wc); + else + */ + return mad_recv_wc; } static struct ib_mad_send_wr_private* @@ -1526,16 +1537,17 @@ find_send_req(struct ib_mad_agent_privat } static void ib_mad_complete_recv(struct ib_mad_agent_private *mad_agent_priv, - struct ib_mad_private *recv, + struct ib_mad_recv_wc *mad_recv_wc, int solicited) { struct ib_mad_send_wr_private *mad_send_wr; struct ib_mad_send_wc mad_send_wc; unsigned long flags; + u64 tid; - /* Fully reassemble receive before processing */ - recv = reassemble_recv(mad_agent_priv, recv); - if (!recv) { + /* Process the receive before giving it to the user. */ + mad_recv_wc = process_recv(mad_agent_priv, mad_recv_wc); + if (!mad_recv_wc) { if (atomic_dec_and_test(&mad_agent_priv->refcount)) wake_up(&mad_agent_priv->wait); return; @@ -1543,12 +1555,12 @@ static void ib_mad_complete_recv(struct /* Complete corresponding request */ if (solicited) { + tid = mad_recv_wc->recv_buf.mad->mad_hdr.tid; spin_lock_irqsave(&mad_agent_priv->lock, flags); - mad_send_wr = find_send_req(mad_agent_priv, - recv->mad.mad.mad_hdr.tid); + mad_send_wr = find_send_req(mad_agent_priv, tid); if (!mad_send_wr) { spin_unlock_irqrestore(&mad_agent_priv->lock, flags); - ib_free_recv_mad(&recv->header.recv_wc); + ib_free_recv_mad(mad_recv_wc); if (atomic_dec_and_test(&mad_agent_priv->refcount)) wake_up(&mad_agent_priv->wait); return; @@ -1558,10 +1570,9 @@ static void ib_mad_complete_recv(struct spin_unlock_irqrestore(&mad_agent_priv->lock, flags); /* Defined behavior is to complete response before request */ - recv->header.recv_wc.wc->wr_id = mad_send_wr->wr_id; - mad_agent_priv->agent.recv_handler( - &mad_agent_priv->agent, - &recv->header.recv_wc); + mad_recv_wc->wc->wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.recv_handler(&mad_agent_priv->agent, + mad_recv_wc); atomic_dec(&mad_agent_priv->refcount); mad_send_wc.status = IB_WC_SUCCESS; @@ -1569,9 +1580,8 @@ static void ib_mad_complete_recv(struct mad_send_wc.wr_id = mad_send_wr->wr_id; ib_mad_complete_send_wr(mad_send_wr, &mad_send_wc); } else { - mad_agent_priv->agent.recv_handler( - &mad_agent_priv->agent, - &recv->header.recv_wc); + mad_agent_priv->agent.recv_handler(&mad_agent_priv->agent, + mad_recv_wc); if (atomic_dec_and_test(&mad_agent_priv->refcount)) wake_up(&mad_agent_priv->wait); } @@ -1675,7 +1685,8 @@ local: solicited = solicited_mad(&recv->mad.mad); mad_agent = find_mad_agent(port_priv, &recv->mad.mad, solicited); if (mad_agent) { - ib_mad_complete_recv(mad_agent, recv, solicited); + ib_mad_complete_recv(mad_agent, &recv->header.recv_wc, + solicited); /* * recv is freed up in error cases in ib_mad_complete_recv * or via recv_handler in ib_mad_complete_recv() @@ -1757,10 +1768,18 @@ static void ib_mad_complete_send_wr(stru { struct ib_mad_agent_private *mad_agent_priv; unsigned long flags; + enum ib_mad_result ret; mad_agent_priv = container_of(mad_send_wr->agent, struct ib_mad_agent_private, agent); + /* + if (mad_agent_priv->agent.rmpp_version) + ret = process_rmpp_send_wc(mad_send_wr, mad_send_wc); + else + */ + ret = IB_MAD_RESULT_SUCCESS; + spin_lock_irqsave(&mad_agent_priv->lock, flags); if (mad_send_wc->status != IB_WC_SUCCESS && mad_send_wr->status == IB_WC_SUCCESS) { @@ -1784,8 +1803,9 @@ static void ib_mad_complete_send_wr(stru if (mad_send_wr->status != IB_WC_SUCCESS ) mad_send_wc->status = mad_send_wr->status; - mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, - mad_send_wc); + if (ret == IB_MAD_RESULT_SUCCESS) + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + mad_send_wc); /* Release reference on agent taken when sending */ if (atomic_dec_and_test(&mad_agent_priv->refcount)) @@ -2034,8 +2054,7 @@ void cancel_sends(void *data) &mad_send_wc); kfree(mad_send_wr); - if (atomic_dec_and_test(&mad_agent_priv->refcount)) - wake_up(&mad_agent_priv->wait); + atomic_dec(&mad_agent_priv->refcount); spin_lock_irqsave(&mad_agent_priv->lock, flags); } spin_unlock_irqrestore(&mad_agent_priv->lock, flags); Index: core/agent_priv.h =================================================================== --- core/agent_priv.h (revision 1964) +++ core/agent_priv.h (working copy) @@ -57,7 +57,6 @@ struct ib_agent_port_private { int port_num; struct ib_mad_agent *smp_agent; /* SM class */ struct ib_mad_agent *perf_mgmt_agent; /* PerfMgmt class */ - struct ib_mr *mr; }; #endif /* __IB_AGENT_PRIV_H__ */ Index: core/mad_priv.h =================================================================== --- core/mad_priv.h (revision 1980) +++ core/mad_priv.h (working copy) @@ -101,7 +101,6 @@ struct ib_mad_agent_private { atomic_t refcount; wait_queue_head_t wait; - u8 rmpp_version; }; struct ib_mad_snoop_private { Index: core/sa_query.c =================================================================== --- core/sa_query.c (revision 1964) +++ core/sa_query.c (working copy) @@ -77,7 +77,6 @@ struct ib_sa_sm_ah { struct ib_sa_port { struct ib_mad_agent *agent; - struct ib_mr *mr; struct ib_sa_sm_ah *sm_ah; struct work_struct update_task; spinlock_t ah_lock; @@ -492,7 +491,7 @@ retry: sizeof (struct ib_sa_mad), DMA_TO_DEVICE); gather_list.length = sizeof (struct ib_sa_mad); - gather_list.lkey = port->mr->lkey; + gather_list.lkey = port->agent->mr->lkey; pci_unmap_addr_set(query, mapping, gather_list.addr); ret = ib_post_send_mad(port->agent, &wr, &bad_wr); @@ -771,7 +770,6 @@ static void ib_sa_add_one(struct ib_devi sa_dev->end_port = e; for (i = 0; i <= e - s; ++i) { - sa_dev->port[i].mr = NULL; sa_dev->port[i].sm_ah = NULL; sa_dev->port[i].port_num = i + s; spin_lock_init(&sa_dev->port[i].ah_lock); @@ -783,13 +781,6 @@ static void ib_sa_add_one(struct ib_devi if (IS_ERR(sa_dev->port[i].agent)) goto err; - sa_dev->port[i].mr = ib_get_dma_mr(sa_dev->port[i].agent->qp->pd, - IB_ACCESS_LOCAL_WRITE); - if (IS_ERR(sa_dev->port[i].mr)) { - ib_unregister_mad_agent(sa_dev->port[i].agent); - goto err; - } - INIT_WORK(&sa_dev->port[i].update_task, update_sm_ah, &sa_dev->port[i]); } @@ -813,10 +804,8 @@ static void ib_sa_add_one(struct ib_devi return; err: - while (--i >= 0) { - ib_dereg_mr(sa_dev->port[i].mr); + while (--i >= 0) ib_unregister_mad_agent(sa_dev->port[i].agent); - } kfree(sa_dev); Index: include/ib_mad.h =================================================================== --- include/ib_mad.h (revision 1964) +++ include/ib_mad.h (working copy) @@ -70,9 +70,37 @@ #define IB_MGMT_MAX_METHODS 128 +/* RMPP information */ +#define IB_MGMT_RMPP_VERSION 1 + +#define IB_MGMT_RMPP_TYPE_DATA 1 +#define IB_MGMT_RMPP_TYPE_ACK 2 +#define IB_MGMT_RMPP_TYPE_STOP 3 +#define IB_MGMT_RMPP_TYPE_ABORT 4 + +#define IB_MGMT_RMPP_FLAG_ACTIVE 1 +#define IB_MGMT_RMPP_FLAG_FIRST (1<<1) +#define IB_MGMT_RMPP_FLAG_LAST (1<<2) + +#define IB_MGMT_RMPP_NO_RESPTIME 0x1F + +#define IB_MGMT_RMPP_STATUS_SUCCESS 0 +#define IB_MGMT_RMPP_STATUS_RESX 1 +#define IB_MGMT_RMPP_STATUS_T2L 118 +#define IB_MGMT_RMPP_STATUS_BAD_LEN 119 +#define IB_MGMT_RMPP_STATUS_BAD_SEG 120 +#define IB_MGMT_RMPP_STATUS_BADT 121 +#define IB_MGMT_RMPP_STATUS_W2S 122 +#define IB_MGMT_RMPP_STATUS_S2B 123 +#define IB_MGMT_RMPP_STATUS_BAD_STATUS 124 +#define IB_MGMT_RMPP_STATUS_UNV 125 +#define IB_MGMT_RMPP_STATUS_TMR 126 +#define IB_MGMT_RMPP_STATUS_UNSPEC 127 + #define IB_QP0 0 #define IB_QP1 __constant_htonl(1) #define IB_QP1_QKEY 0x80010000 +#define IB_QP_SET_QKEY 0x80000000 struct ib_grh { u32 version_tclass_flow; @@ -124,6 +152,45 @@ struct ib_vendor_mad { u8 data[216]; } __attribute__ ((packed)); +/** + * ib_get_rmpp_resptime - Returns the RMPP response time. + * @rmpp_hdr: An RMPP header. + */ +static inline u8 ib_get_rmpp_resptime(struct ib_rmpp_hdr *rmpp_hdr) +{ + return rmpp_hdr->rmpp_rtime_flags >> 3; +} + +/** + * ib_get_rmpp_flags - Returns the RMPP flags. + * @rmpp_hdr: An RMPP header. + */ +static inline u8 ib_get_rmpp_flags(struct ib_rmpp_hdr *rmpp_hdr) +{ + return rmpp_hdr->rmpp_rtime_flags & 0x7; +} + +/** + * ib_set_rmpp_resptime - Sets the response time in an RMPP header. + * @rmpp_hdr: An RMPP header. + * @rtime: The response time to set. + */ +static inline void ib_set_rmpp_resptime(struct ib_rmpp_hdr *rmpp_hdr, u8 rtime) +{ + rmpp_hdr->rmpp_rtime_flags = ib_get_rmpp_flags(rmpp_hdr) | (rtime << 3); +} + +/** + * ib_set_rmpp_flags - Sets the flags in an RMPP header. + * @rmpp_hdr: An RMPP header. + * @flags: The flags to set. + */ +static inline void ib_set_rmpp_flags(struct ib_rmpp_hdr *rmpp_hdr, u8 flags) +{ + rmpp_hdr->rmpp_rtime_flags = (rmpp_hdr->rmpp_rtime_flags & 0xF1) | + (flags & 0x7); +} + struct ib_mad_agent; struct ib_mad_send_wc; struct ib_mad_recv_wc; @@ -168,6 +235,7 @@ typedef void (*ib_mad_recv_handler)(stru * ib_mad_agent - Used to track MAD registration with the access layer. * @device: Reference to device registration is on. * @qp: Reference to QP used for sending and receiving MADs. + * @mr: Memory region for system memory usable for DMA. * @recv_handler: Callback handler for a received MAD. * @send_handler: Callback handler for a sent MAD. * @snoop_handler: Callback handler for snooped sent MADs. @@ -176,16 +244,19 @@ typedef void (*ib_mad_recv_handler)(stru * Unsolicited MADs sent by this client will have the upper 32-bits * of their TID set to this value. * @port_num: Port number on which QP is registered + * @rmpp_version: If set, indicates the RMPP version used by this agent. */ struct ib_mad_agent { struct ib_device *device; struct ib_qp *qp; + struct ib_mr *mr; ib_mad_recv_handler recv_handler; ib_mad_send_handler send_handler; ib_mad_snoop_handler snoop_handler; void *context; u32 hi_tid; u8 port_num; + u8 rmpp_version; }; /** @@ -219,6 +290,7 @@ struct ib_mad_recv_buf { * ib_mad_recv_wc - received MAD information. * @wc: Completion information for the received data. * @recv_buf: Specifies the location of the received data buffer(s). + * @rmpp_list: Specifies a list of RMPP reassembled received MAD buffers. * @mad_len: The length of the received MAD, without duplicated headers. * * For received response, the wr_id field of the wc is set to the wr_id @@ -227,6 +299,7 @@ struct ib_mad_recv_buf { struct ib_mad_recv_wc { struct ib_wc *wc; struct ib_mad_recv_buf recv_buf; + struct list_head rmpp_list; int mad_len; }; From caitlinb at siliquent.com Mon Mar 14 16:33:19 2005 From: caitlinb at siliquent.com (Caitlin Bestler) Date: Mon, 14 Mar 2005 16:33:19 -0800 Subject: [openib-general] Getting rid of pinned memory requirement Message-ID: <8508251A6FC08A489844A94261D3693A039009@fiona.siliquent.com> > > While hardware designers may like this idea, I would like to > make the point that if you want the application to > *absolutely* control the availability of physical memory, you > shouldn't be writing userspace applications that run on Linux. > This is not just a hardware design issue. It is fundamental to why RDMA is able to optimize end-to-end traffic flow. The application is directly advertising the availability of buffers (through RKeys) to the other side. It is bad network engineering for the kernel to revoke that good faith advertisement and count on the HCA/RNIC to say "oops" when the data does arrive but the targeted buffer is not in memory. But that does not mean that you cannot design mechanisms below the application to allow the kernel to re-organize physical memory -- it just means that the kernel had best not be playing overcommit tricks behind the applications back. To use a banking analogy, an adverised RKey is like a certified check. The application has sent this RKey to its peer, and it expects the HCA/RNIC to honor that check when RDMA Writes are made to that memory. But just as a bank does not have to guarantee in advance which specific bills will be used to cash a guaranteed check, there is nothing to say that the virtual to physical mappings are permanent and immutable. It would be possible to design an interface that allowed the kernel to: a) suspend the use of a memory region. 1) outputs referencing the suspend LKey would be temporarily held by the HCA/RNIC. 2) inputs referencing the suspend memory region would be delayed (RNR NAK, internal buffers, etc.) 3) possibly ask the peer to similarly suspend sending. This is trickier though. b) Update the virtual to physical mappings, or at least provide the RDMA layer with "physical page X replaced by physical page Y". c) unsuspend the memory region. The key is that the entire operation either has to be fast enough so that no connection or application session layer time-outs occur, or an end-to-end agreement to suspend the connetion is a requirement. The first option seems more plausible to me, the second essentially reuqires extending the CM protocol. That's a tall order even for InfiniBand, and it's even worse for iWARP where the CM functionality typically ends when the connection is established. > There's always going to be a limit on how much memory you can > mlock. And right now the only option the kernel has for > unlocking that memory is to kill the application. I think > there's got to be a reasonable way to deal with this that > doesn't make the application responsible for everything in > the world. We don't want to have to rewrite every RDMA > application to be able to support memory hotplug. This is an > obvious layer that can and should be abstracted by the kernel. > Yes, there are limits on how much memory you can mlock, or even allocate. Applications are required to reqister memory precisely because the required guarantess are not there by default. Eliminating those guarantees *is* effectively rewriting every RDMA application without even letting them know. From hozer at hozed.org Mon Mar 14 17:06:19 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Mon, 14 Mar 2005 19:06:19 -0600 Subject: [openib-general] Getting rid of pinned memory requirement In-Reply-To: <8508251A6FC08A489844A94261D3693A039009@fiona.siliquent.com> References: <8508251A6FC08A489844A94261D3693A039009@fiona.siliquent.com> Message-ID: <20050315010619.GT9768@kalmia.hozed.org> On Mon, Mar 14, 2005 at 04:33:19PM -0800, Caitlin Bestler wrote: > > > > While hardware designers may like this idea, I would like to > > make the point that if you want the application to > > *absolutely* control the availability of physical memory, you > > shouldn't be writing userspace applications that run on Linux. > > > > This is not just a hardware design issue. It is fundamental to > why RDMA is able to optimize end-to-end traffic flow. The application > is directly advertising the availability of buffers (through RKeys) > to the other side. It is bad network engineering for the kernel > to revoke that good faith advertisement and count on the HCA/RNIC > to say "oops" when the data does arrive but the targeted buffer > is not in memory. > > But that does not mean that you cannot design mechanisms below > the application to allow the kernel to re-organize physical > memory -- it just means that the kernel had best not be playing > overcommit tricks behind the applications back. > > To use a banking analogy, an adverised RKey is like a certified > check. The application has sent this RKey to its peer, and it > expects the HCA/RNIC to honor that check when RDMA Writes are > made to that memory. But just as a bank does not have to > guarantee in advance which specific bills will be used to > cash a guaranteed check, there is nothing to say that the > virtual to physical mappings are permanent and immutable. > > It would be possible to design an interface that allowed > the kernel to: > > a) suspend the use of a memory region. > 1) outputs referencing the suspend LKey would be > temporarily held by the HCA/RNIC. > 2) inputs referencing the suspend memory region > would be delayed (RNR NAK, internal buffers, > etc.) > 3) possibly ask the peer to similarly suspend > sending. This is trickier though. > b) Update the virtual to physical mappings, or at least > provide the RDMA layer with "physical page X replaced > by physical page Y". > c) unsuspend the memory region. > > The key is that the entire operation either has to be fast > enough so that no connection or application session layer > time-outs occur, or an end-to-end agreement to suspend the > connetion is a requirement. The first option seems more > plausible to me, the second essentially reuqires extending > the CM protocol. That's a tall order even for InfiniBand, > and it's even worse for iWARP where the CM functionality > typically ends when the connection is established. I'll buy the good network design argument. I suppose if the kernel wants to revoke a card's pinned memory, we should be able to guarantee that it gets new pinned memory within a bounded time. What sort of timing do we need? Milliseconds? Microseconds? In the case of iWarp, isn't this just TCP underneath? If so, can't we just drop any packets in the pipe on the floor and let them get retransmitted? (I suppose the same argument goes for infiniband.. what sort of a time window do we have for retransmission?) What are the limits on end-to-end flow control in IB and iWarp? > > > > > There's always going to be a limit on how much memory you can > > mlock. And right now the only option the kernel has for > > unlocking that memory is to kill the application. I think > > there's got to be a reasonable way to deal with this that > > doesn't make the application responsible for everything in > > the world. We don't want to have to rewrite every RDMA > > application to be able to support memory hotplug. This is an > > obvious layer that can and should be abstracted by the kernel. > > > > Yes, there are limits on how much memory you can mlock, or > even allocate. Applications are required to reqister memory > precisely because the required guarantess are not there by > default. Eliminating those guarantees *is* effectively > rewriting every RDMA application without even letting > them know. Some of this argument is a policy issue, which I would argue shouldn't be hard-coded in the code or in the network hardware. At least in my view, the guarantees are only there to make applications go fast. We are getting low latency and high performance with infiniband by making memory registration go really really slow. If, to make big HPC simulation applications work, we wind up doing memcpy() to put the data into a registered buffer because we can't register half of physical memory, the application isn't going very fast. From caitlinb at siliquent.com Mon Mar 14 17:35:31 2005 From: caitlinb at siliquent.com (Caitlin Bestler) Date: Mon, 14 Mar 2005 17:35:31 -0800 Subject: [openib-general] Getting rid of pinned memory requirement Message-ID: <8508251A6FC08A489844A94261D3693A03900B@fiona.siliquent.com> > -----Original Message----- > From: Troy Benjegerdes [mailto:hozer at hozed.org] > Sent: Monday, March 14, 2005 5:06 PM > To: Caitlin Bestler > Cc: openib-general at openib.org > Subject: Re: [openib-general] Getting rid of pinned memory requirement > > > > > The key is that the entire operation either has to be fast > > enough so that no connection or application session layer > > time-outs occur, or an end-to-end agreement to suspend the > > connetion is a requirement. The first option seems more > > plausible to me, the second essentially > > reuqires extending the CM protocol. That's a tall order even for > > InfiniBand, and it's even worse for iWARP where the CM > > functionality typically ends when the connection is established. > > I'll buy the good network design argument. > > I suppose if the kernel wants to revoke a card's pinned > memory, we should be able to guarantee that it gets new > pinned memory within a bounded time. What sort of timing do > we need? Milliseconds? > Microseconds? > > In the case of iWarp, isn't this just TCP underneath? If so, > can't we just drop any packets in the pipe on the floor and > let them get retransmitted? (I suppose the same argument goes > for infiniband.. > what sort of a time window do we have for retransmission?) > > What are the limits on end-to-end flow control in IB and iWarp? > >From the RDMA Provider's perspective, the short answer is "quick enough so that I don't have to do anything heroic to keep the connection alive." With TCP you also have to add "and healthy". If you've ever had a long download that got effectively stalled by a burst of noise and you just hit the 'reload' button on your browser then you know what I'm talking about. But in transport neutral terms I would think that one RTT is definitely safe -- that much data could have been dropped by one switch failure or one nasty spike in inbound noise. > > > > Yes, there are limits on how much memory you can mlock, or even > > allocate. Applications are required to reqister memory precisely > > because the required guarantess are not there by default. > Eliminating > > those guarantees *is* effectively rewriting every RDMA application > > without even letting them know. > > Some of this argument is a policy issue, which I would argue > shouldn't be hard-coded in the code or in the network hardware. > > At least in my view, the guarantees are only there to make > applications go fast. We are getting low latency and high > performance with infiniband by making memory registration go > really really slow. If, to make big HPC simulation > applications work, we wind up doing memcpy() to put the data > into a registered buffer because we can't register half of > physical memory, the application isn't going very fast. > What you are looking for is a distinction between registering memory to *enable* the RNIC to optimize local access and registering memory to enable its being advertised to the remote end. Early implementations of RDMA, both IB and iWARP, have not distinquished between the two. But theoretically *applications* do not need memory regions that are not enabled for remote access to be pinned. That is an RNIC requirement that could evolve. But applications themselves *do* need remotely accessible memory regions, portions of which they intend to advertise with RKeys, to be truly available (i.e., pinned). You are also making a policy assumption that an application that actually needs half of physical memory should be using paged memory. Memory is cheap, and if performance is critical why should this memory be swapped out to disk? Is the limitation on not being able to register half of physical memory based upon some assumption that swapping is a requirement? Or is it a limitation in the memory region size? If it's the latter, you need to get the OS to support larger page sizes. From abhijitngpune at indiatimes.com Mon Mar 14 21:32:34 2005 From: abhijitngpune at indiatimes.com (abhijitngpune) Date: Tue, 15 Mar 2005 11:02:34 +0530 Subject: [openib-general] openSM Message-ID: <200503150518.KAA03171@WS0005.indiatimes.com> Hi, Does openSM supports non-fat tree (irregular such as graph) topologies? AbhijeetIndiatimes Email now powered by APIC Advantage. Help! Help -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Mon Mar 14 21:55:42 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Mar 2005 07:55:42 +0200 Subject: [openib-general] [PATCH] uverbs rdma example (updated) In-Reply-To: <52u0nexaot.fsf@topspin.com> References: <20050309122700.GA2352@mellanox.co.il> <20050310123129.GA12542@mellanox.co.il> <52hdjeys89.fsf@topspin.com> <20050314191011.GD17668@mellanox.co.il> <52u0nexaot.fsf@topspin.com> Message-ID: <20050315055542.GA18928@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] [PATCH] uverbs rdma example (updated) > > Michael> Hmm. rx_depth is unused. tx_depth is used. > > Where? If I search through the whole patch for "tx_depth," the only > place I see it do anything at all is in > > + ctx->cq = ibv_create_cq(ctx->context, rx_depth + tx_depth, NULL); It also sets the qp depth, not sure why dont you see it. Attached please find my latest version of the test. > but I don't see how more than one send can be outstanding. Only one may be outstanding at a time, but tx_depth option makes it possible to study the effect of qp/cq size on the latency. mst -- MST - Michael S. Tsirkin -------------- next part -------------- /* * Copyright (c) 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005 Mellanox Technologies Ltd. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available from the file * COPYING in the main directory of this source tree, or the * OpenIB.org BSD license below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above * copyright notice, this list of conditions and the following * disclaimer. * * - Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials * provided with the distribution. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * * $Id$ */ #if HAVE_CONFIG_H # include #endif /* HAVE_CONFIG_H */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include enum { PINGPONG_RDMA_WRID = 3, }; static int page_size; struct pingpong_context { struct ibv_context *context; struct ibv_pd *pd; struct ibv_mr *mr; struct ibv_cq *cq; struct ibv_qp *qp; void *buf; volatile char *post_buf; volatile char *poll_buf; int size; int rx_depth; int tx_depth; struct ibv_sge list; struct ibv_send_wr wr; }; struct pingpong_dest { int lid; int qpn; int psn; unsigned rkey; unsigned long long vaddr; }; /* * pp_get_local_lid() uses a pretty bogus method for finding the LID * of a local port. Please don't copy this into your app (or if you * do, please rip it out soon). */ static uint16_t pp_get_local_lid(struct ibv_device *dev, int port) { char path[256]; char val[16]; char *name; if (sysfs_get_mnt_path(path, sizeof path)) { fprintf(stderr, "Couldn't find sysfs mount.\n"); return 0; } asprintf(&name, "%s/class/infiniband/%s/ports/%d/lid", path, ibv_get_device_name(dev), port); if (sysfs_read_attribute_value(name, val, sizeof val)) { fprintf(stderr, "Couldn't read LID at %s\n", name); return 0; } return strtol(val, NULL, 0); } static int pp_client_connect(const char *servername, int port) { struct addrinfo *res, *t; struct addrinfo hints = { .ai_family = AF_UNSPEC, .ai_socktype = SOCK_STREAM }; char *service; int n; int sockfd = -1; asprintf(&service, "%d", port); n = getaddrinfo(servername, service, &hints, &res); if (n < 0) { fprintf(stderr, "%s for %s:%d\n", gai_strerror(n), servername, port); return n; } for (t = res; t; t = t->ai_next) { sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd >= 0) { if (!connect(sockfd, t->ai_addr, t->ai_addrlen)) break; close(sockfd); sockfd = -1; } } freeaddrinfo(res); if (sockfd < 0) { fprintf(stderr, "Couldn't connect to %s:%d\n", servername, port); return sockfd; } return sockfd; } struct pingpong_dest * pp_client_exch_dest(int sockfd, const struct pingpong_dest *my_dest) { struct pingpong_dest *rem_dest = NULL; char msg[sizeof "0000:000000:000000:00000000:0000000000000000"]; int parsed; sprintf(msg, "%04x:%06x:%06x:%08x:%016Lx", my_dest->lid, my_dest->qpn, my_dest->psn,my_dest->rkey,my_dest->vaddr); if (write(sockfd, msg, sizeof msg) != sizeof msg) { perror("client write"); fprintf(stderr, "Couldn't send local address\n"); goto out; } if (read(sockfd, msg, sizeof msg) != sizeof msg) { perror("client read"); fprintf(stderr, "Couldn't read remote address\n"); goto out; } rem_dest = malloc(sizeof *rem_dest); if (!rem_dest) goto out; parsed = sscanf(msg, "%x:%x:%x:%x:%Lx", &rem_dest->lid, &rem_dest->qpn, &rem_dest->psn,&rem_dest->rkey,&rem_dest->vaddr); if (parsed != 5) { fprintf(stderr, "Couldn't parse line <%.*s>\n",(int)sizeof msg, msg); free(rem_dest); rem_dest = NULL; goto out; } out: return rem_dest; } int pp_server_connect(int port) { struct addrinfo *res, *t; struct addrinfo hints = { .ai_flags = AI_PASSIVE, .ai_family = AF_UNSPEC, .ai_socktype = SOCK_STREAM }; char *service; int sockfd = -1, connfd; int n; asprintf(&service, "%d", port); n = getaddrinfo(NULL, service, &hints, &res); if (n < 0) { fprintf(stderr, "%s for port %d\n", gai_strerror(n), port); return n; } for (t = res; t; t = t->ai_next) { sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); if (sockfd >= 0) { n = 1; setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &n, sizeof n); if (!bind(sockfd, t->ai_addr, t->ai_addrlen)) break; close(sockfd); sockfd = -1; } } freeaddrinfo(res); if (sockfd < 0) { fprintf(stderr, "Couldn't listen to port %d\n", port); return sockfd; } listen(sockfd, 1); connfd = accept(sockfd, NULL, 0); if (connfd < 0) { perror("server accept"); fprintf(stderr, "accept() failed\n"); close(sockfd); return connfd; } close(sockfd); return connfd; } static struct pingpong_dest *pp_server_exch_dest(int connfd, const struct pingpong_dest *my_dest) { char msg[sizeof "0000:000000:000000:00000000:0000000000000000"]; struct pingpong_dest *rem_dest = NULL; int parsed; int n; n = read(connfd, msg, sizeof msg); if (n != sizeof msg) { perror("server read"); fprintf(stderr, "%d/%d: Couldn't read remote address\n", n, (int) sizeof msg); goto out; } rem_dest = malloc(sizeof *rem_dest); if (!rem_dest) goto out; parsed = sscanf(msg, "%x:%x:%x:%x:%Lx", &rem_dest->lid, &rem_dest->qpn, &rem_dest->psn, &rem_dest->rkey, &rem_dest->vaddr); if (parsed != 5) { fprintf(stderr, "Couldn't parse line <%.*s>\n",(int)sizeof msg, msg); free(rem_dest); rem_dest = NULL; goto out; } sprintf(msg, "%04x:%06x:%06x:%08x:%016Lx", my_dest->lid, my_dest->qpn, my_dest->psn, my_dest->rkey, my_dest->vaddr); if (write(connfd, msg, sizeof msg) != sizeof msg) { perror("server write"); fprintf(stderr, "Couldn't send local address\n"); free(rem_dest); rem_dest = NULL; goto out; } out: return rem_dest; } static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size, int tx_depth, int rx_depth, int port) { struct pingpong_context *ctx; ctx = malloc(sizeof *ctx); if (!ctx) return NULL; ctx->size = size; ctx->rx_depth = rx_depth; ctx->tx_depth = tx_depth; ctx->buf = memalign(page_size, size * 2); if (!ctx->buf) { fprintf(stderr, "Couldn't allocate work buf.\n"); return NULL; } memset(ctx->buf, 0, size * 2); ctx->post_buf = (char*)ctx->buf + (size - 1); ctx->poll_buf = (char*)ctx->buf + (2 * size - 1); ctx->context = ibv_open_device(ib_dev); if (!ctx->context) { fprintf(stderr, "Couldn't get context for %s\n", ibv_get_device_name(ib_dev)); return NULL; } ctx->pd = ibv_alloc_pd(ctx->context); if (!ctx->pd) { fprintf(stderr, "Couldn't allocate PD\n"); return NULL; } ctx->mr = ibv_reg_mr(ctx->pd, ctx->buf, size * 2, IBV_ACCESS_REMOTE_WRITE); if (!ctx->mr) { fprintf(stderr, "Couldn't allocate MR\n"); return NULL; } ctx->cq = ibv_create_cq(ctx->context, rx_depth + tx_depth, NULL); if (!ctx->cq) { fprintf(stderr, "Couldn't create CQ\n"); return NULL; } { struct ibv_qp_init_attr attr = { .send_cq = ctx->cq, .recv_cq = ctx->cq, .cap = { .max_send_wr = tx_depth, .max_recv_wr = rx_depth, .max_send_sge = 1, .max_recv_sge = 1 }, .qp_type = IBV_QPT_RC }; ctx->qp = ibv_create_qp(ctx->pd, &attr); if (!ctx->qp) { fprintf(stderr, "Couldn't create QP\n"); return NULL; } } { struct ibv_qp_attr attr; attr.qp_state = IBV_QPS_INIT; attr.pkey_index = 0; attr.port_num = port; attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE; if (ibv_modify_qp(ctx->qp, &attr, IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT | IBV_QP_ACCESS_FLAGS)) { fprintf(stderr, "Failed to modify QP to INIT\n"); return NULL; } } ctx->wr.wr_id = PINGPONG_RDMA_WRID; ctx->wr.sg_list = &ctx->list; ctx->wr.num_sge = 1; ctx->wr.opcode = IBV_WR_RDMA_WRITE; ctx->wr.send_flags = IBV_SEND_SIGNALED; return ctx; } static int pp_connect_ctx(struct pingpong_context *ctx, int port, int my_psn, struct pingpong_dest *dest) { struct ibv_qp_attr attr; attr.qp_state = IBV_QPS_RTR; attr.path_mtu = IBV_MTU_1024; attr.dest_qp_num = dest->qpn; attr.rq_psn = dest->psn; attr.max_dest_rd_atomic = 1; attr.min_rnr_timer = 12; attr.ah_attr.is_global = 0; attr.ah_attr.dlid = dest->lid; attr.ah_attr.sl = 0; attr.ah_attr.src_path_bits = 0; attr.ah_attr.port_num = port; if (ibv_modify_qp(ctx->qp, &attr, IBV_QP_STATE | IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN | IBV_QP_RQ_PSN | IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MIN_RNR_TIMER)) { fprintf(stderr, "Failed to modify QP to RTR\n"); return 1; } attr.qp_state = IBV_QPS_RTS; attr.timeout = 14; attr.retry_cnt = 7; attr.rnr_retry = 7; attr.sq_psn = my_psn; attr.max_rd_atomic = 1; if (ibv_modify_qp(ctx->qp, &attr, IBV_QP_STATE | IBV_QP_TIMEOUT | IBV_QP_RETRY_CNT | IBV_QP_RNR_RETRY | IBV_QP_SQ_PSN | IBV_QP_MAX_QP_RD_ATOMIC)) { fprintf(stderr, "Failed to modify QP to RTS\n"); return 1; } return 0; } static void usage(const char *argv0) { printf("Usage:\n"); printf(" %s start a server and wait for connection\n", argv0); printf(" %s connect to server at \n", argv0); printf("\n"); printf("Options:\n"); printf(" -p, --port= listen on/connect to port (default 18515)\n"); printf(" -d, --ib-dev= use IB device (default first device found)\n"); printf(" -i, --ib-port= use port of IB device (default 1)\n"); printf(" -s, --size= size of message to exchange (default 4096)\n"); printf(" -t, --tx-depth= size of tx queue (default 50)\n"); printf(" -n, --iters= number of exchanges (default 1000)\n"); } int main(int argc, char *argv[]) { struct dlist *dev_list; struct ibv_device *ib_dev; struct pingpong_context *ctx; struct pingpong_dest my_dest; struct pingpong_dest *rem_dest; struct timeval start, end; char *ib_devname = NULL; char *servername = NULL; int port = 18515; int ib_port = 1; int size = 1; int rx_depth = 1; int tx_depth = 50; int iters = 1000; int scnt, rcnt, ccnt; int client_first_post; int sockfd; struct ibv_qp *qp; struct ibv_send_wr *wr; volatile char *poll_buf; volatile char *post_buf; srand48(getpid() * time(NULL)); while (1) { int c; static struct option long_options[] = { { .name = "port", .has_arg = 1, .val = 'p' }, { .name = "ib-dev", .has_arg = 1, .val = 'd' }, { .name = "ib-port", .has_arg = 1, .val = 'i' }, { .name = "size", .has_arg = 1, .val = 's' }, { .name = "iters", .has_arg = 1, .val = 'n' }, { .name = "tx-depth",.has_arg = 1, .val = 't' }, { 0 } }; c = getopt_long(argc, argv, "p:d:i:s:t:n:e", long_options, NULL); if (c == -1) break; switch (c) { case 'p': port = strtol(optarg, NULL, 0); if (port < 0 || port > 65535) { usage(argv[0]); return 1; } break; case 'd': ib_devname = strdupa(optarg); break; case 'i': ib_port = strtol(optarg, NULL, 0); if (port < 0) { usage(argv[0]); return 1; } break; case 's': size = strtol(optarg, NULL, 0); break; case 't': tx_depth = strtol(optarg, NULL, 0); break; case 'n': iters = strtol(optarg, NULL, 0); break; default: usage(argv[0]); return 1; } } if (optind == argc - 1) servername = strdupa(argv[optind]); else if (optind < argc) { usage(argv[0]); return 1; } page_size = sysconf(_SC_PAGESIZE); dev_list = ibv_get_devices(); dlist_start(dev_list); if (!ib_devname) { ib_dev = dlist_next(dev_list); if (!ib_dev) { fprintf(stderr, "No IB devices found\n"); return 1; } } else { dlist_for_each_data(dev_list, ib_dev, struct ibv_device) if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) break; if (!ib_dev) { fprintf(stderr, "IB device %s not found\n", ib_devname); return 1; } } ctx = pp_init_ctx(ib_dev, size, iters, rx_depth, ib_port); if (!ctx) return 1; my_dest.lid = pp_get_local_lid(ib_dev, ib_port); my_dest.qpn = ctx->qp->qp_num; my_dest.psn = lrand48() & 0xffffff; if (!my_dest.lid) { fprintf(stderr, "Local lid 0x0 detected. Is an SM running?\n"); return 1; } my_dest.rkey = ctx->mr->rkey; my_dest.vaddr = (uintptr_t)ctx->buf + ctx->size; printf(" local address: LID %#04x, QPN %#06x, PSN %#06x " "RKey %#08x VAddr %#016Lx\n", my_dest.lid, my_dest.qpn, my_dest.psn, my_dest.rkey, my_dest.vaddr); if (servername) { sockfd = pp_client_connect(servername, port); } else { sockfd = pp_server_connect(port); } if (sockfd < 0) return 1; if (servername) { rem_dest = pp_client_exch_dest(sockfd, &my_dest); } else { rem_dest = pp_server_exch_dest(sockfd, &my_dest); } if (!rem_dest) return 1; printf(" remote address: LID %#04x, QPN %#06x, PSN %#06x, " "RKey %#08x VAddr %#016Lx\n", rem_dest->lid, rem_dest->qpn, rem_dest->psn, rem_dest->rkey, rem_dest->vaddr); if (pp_connect_ctx(ctx, ib_port, my_dest.psn, rem_dest)) return 1; /* An additional handshake is required *after* moving qp to RTR. Arbitrarily reuse exch_dest for this purpose. */ if (servername) { rem_dest = pp_client_exch_dest(sockfd, &my_dest); } else { rem_dest = pp_server_exch_dest(sockfd, &my_dest); } write(sockfd, "done", sizeof "done"); close(sockfd); wr = &ctx->wr; ctx->list.addr = (uintptr_t) ctx->buf; ctx->list.length = ctx->size; ctx->list.lkey = ctx->mr->lkey; wr->wr.rdma.remote_addr = rem_dest->vaddr; wr->wr.rdma.rkey = rem_dest->rkey; scnt = 0; rcnt = 0; ccnt = 0; if (servername) client_first_post = 1; else client_first_post = 0; poll_buf = ctx->poll_buf; post_buf = ctx->post_buf; qp = ctx->qp; if (gettimeofday(&start, NULL)) { perror("gettimeofday"); return 1; } while (scnt < iters || ccnt < iters || rcnt < iters) { /* Wait till buffer changes. */ if (rcnt < iters && ! client_first_post) { ++rcnt; while (*poll_buf != (char)rcnt) { } /* Here the data is already in the physical memory. If we wanted to actually use it, we may need a read memory barrier here. */ } else client_first_post = 0; if (scnt < iters) { struct ibv_send_wr *bad_wr; *post_buf = (char)++scnt; if (ibv_post_send(qp, wr, &bad_wr)) { fprintf(stderr, "Couldn't post send: scnt=%d\n", scnt); return 1; } } if (ccnt < iters) { struct ibv_wc wc; int ne; ++ccnt; do { ne = ibv_poll_cq(ctx->cq, 1, &wc); } while (ne == 0); if (ne < 0) { fprintf(stderr, "poll CQ failed %d\n", ne); return 1; } if (wc.status != IBV_WC_SUCCESS) { fprintf(stderr, "Completion wth error at %s:\n", servername?"client":"server"); fprintf(stderr, "Failed status %d: wr_id %d\n", wc.status, (int) wc.wr_id); fprintf(stderr, "scnt=%d, rcnt=%d, ccnt=%d\n", scnt, rcnt, ccnt); return 1; } } } if (gettimeofday(&end, NULL)) { perror("gettimeofday"); return 1; } { float usec = (end.tv_sec - start.tv_sec) * 1000000 + (end.tv_usec - start.tv_usec); long long bytes = (long long) size * iters; printf("%lld bytes in %.2f seconds = %.2f Mbit/sec\n", bytes, usec / 1000000., bytes * 8. / usec); printf("%d iters in %.2f seconds = %.2f usec/iter\n", iters, usec / 1000000., usec / iters); } return 0; } From mst at mellanox.co.il Mon Mar 14 22:23:52 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Mar 2005 08:23:52 +0200 Subject: [openib-general] uverbs security Message-ID: <20050315062352.GA19233@mellanox.co.il> Hi, Roland! Looking at uverbs kernel module, I notice that in some instances it passes some parameters from userspace directly to ib core, without verifying their sanity. One example of this is qp attributes in create and modify qp. For example, modify qp and alloc qp will simply copy the attributes. This might create issues since the core may assume it works against a trusted kernel client, so it may get confused if passed illegal parameter values. For example, qp type could be IB_QPT_SMI or IB_QPT_GSI. Will this create a problem? Hard for me to tell ... I think the best approach is to validate *all* user-given parameters before passing them on to core. What do you think? -- MST - Michael S. Tsirkin From shaharf at voltaire.com Tue Mar 15 00:51:26 2005 From: shaharf at voltaire.com (shaharf) Date: Tue, 15 Mar 2005 10:51:26 +0200 Subject: [openib-general] openSM Message-ID: Hi, Does openSM supports non-fat tree (irregular such as graph) topologies? Abhijeet ________________________________ [shaharf] Yes. OpenSM supports any type of mesh (connected graph). -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Mar 15 07:17:17 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Mar 2005 10:17:17 -0500 Subject: [openib-general] user_mad.c and 2.6.11 Message-ID: <1110899837.4662.578.camel@localhost.localdomain> Hi Roland, Just ran across this reminder: Should user_mad.c be updated for the following: /* XXX remove once 2.6.11 is released */ Thanks. -- Hal From hozer at hozed.org Tue Mar 15 07:38:00 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Tue, 15 Mar 2005 09:38:00 -0600 Subject: [openib-general] uverbs security In-Reply-To: <20050315062352.GA19233@mellanox.co.il> References: <20050315062352.GA19233@mellanox.co.il> Message-ID: <20050315153759.GU9768@kalmia.hozed.org> On Tue, Mar 15, 2005 at 08:23:52AM +0200, Michael S. Tsirkin wrote: > Hi, Roland! > Looking at uverbs kernel module, I notice that in some instances > it passes some parameters from userspace directly to ib core, without > verifying their sanity. > > One example of this is qp attributes in create and modify qp. > > For example, modify qp and alloc qp will simply copy the attributes. > This might create issues since the core may assume it works against a > trusted kernel client, so it may get confused if passed illegal > parameter values. > > For example, qp type could be IB_QPT_SMI or IB_QPT_GSI. Will this create > a problem? Hard for me to tell ... > > I think the best approach is to validate *all* user-given parameters > before passing them on to core. What do you think? Yes. We should be validating all user parameters, and be thinking about malicious userspace apps. This is another reason I think we ought to have the linux MM support a 'VM_REGISTERED' flag, and things like selinux can have different security policies for registered memory vs not-registered. I think we should probably also have (possibly compile-time) options for IB core to sanity check everything, regardless of whether it came from userspace or kernelspace. (Kind of like CONFIG_DEBUG_KERNEL and the like) From mst at mellanox.co.il Tue Mar 15 08:18:57 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Mar 2005 18:18:57 +0200 Subject: [openib-general] mstflint update Message-ID: <20050315161857.GD16749@mellanox.co.il> I have updated mstflint in the openib repository. revision 1990 fixes a crash and cleans up progress reporting in flash error recovery process. Tested on x86/ia64/i686. -- MST - Michael S. Tsirkin From mst at mellanox.co.il Tue Mar 15 08:27:06 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Mar 2005 18:27:06 +0200 Subject: [openib-general] [PATCH] set lkey in mthca mpt entry Message-ID: <20050315162706.GG16749@mellanox.co.il> lkey does not seem to be set in the mpt entry. does this look right? Signed-off-by: Michael S. Tsirkin Index: hw/mthca/mthca_mr.c =================================================================== --- hw/mthca/mthca_mr.c (revision 1983) +++ hw/mthca/mthca_mr.c (working copy) @@ -206,9 +206,9 @@ int mthca_mr_alloc_notrans(struct mthca_ mpt_entry->pd = cpu_to_be32(pd); mpt_entry->start = 0; mpt_entry->length = ~0ULL; - - memset(&mpt_entry->lkey, 0, - sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey)); + mpt_entry->lkey = cpu_to_be32(mr->ibmr.lkey); + memset(&mpt_entry->window_count, 0, + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, window_count)); err = mthca_SW2HW_MPT(dev, mpt_entry, key & (dev->limits.num_mpts - 1), @@ -327,8 +327,9 @@ int mthca_mr_alloc_phys(struct mthca_dev mpt_entry->pd = cpu_to_be32(pd); mpt_entry->start = cpu_to_be64(iova); mpt_entry->length = cpu_to_be64(total_size); - memset(&mpt_entry->lkey, 0, - sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey)); + mpt_entry->lkey = cpu_to_be32(mr->ibmr.lkey); + memset(&mpt_entry->window_count, 0, + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, window_count)); mpt_entry->mtt_seg = cpu_to_be64(dev->mr_table.mtt_base + mr->first_seg * dev->limits.mtt_seg_size); -- MST - Michael S. Tsirkin From roland at topspin.com Tue Mar 15 08:41:58 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 15 Mar 2005 08:41:58 -0800 Subject: [openib-general] [PATCH] set lkey in mthca mpt entry In-Reply-To: <20050315162706.GG16749@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 15 Mar 2005 18:27:06 +0200") References: <20050315162706.GG16749@mellanox.co.il> Message-ID: <52hdjczwrt.fsf@topspin.com> Michael> lkey does not seem to be set in the mpt entry. does this Michael> look right? You would know better but my docs say that the lkey field should be set to 0 for SW2HW_MPT and is only used to refer to the original region for memory windows. - R. From roland at topspin.com Tue Mar 15 08:42:22 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 15 Mar 2005 08:42:22 -0800 Subject: [openib-general] Re: user_mad.c and 2.6.11 In-Reply-To: <1110899837.4662.578.camel@localhost.localdomain> (Hal Rosenstock's message of "15 Mar 2005 10:17:17 -0500") References: <1110899837.4662.578.camel@localhost.localdomain> Message-ID: <52d5u0zwr5.fsf@topspin.com> Hal> Hi Roland, Just ran across this reminder: Hal> Should user_mad.c be updated for the following: /* XXX remove Hal> once 2.6.11 is released */ Yep, I'd apply that patch for sure. - R. From tduffy at sun.com Tue Mar 15 09:16:24 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 15 Mar 2005 09:16:24 -0800 Subject: [openib-general] kernel 2.6.11 and userland packages? In-Reply-To: <20050312011527.GC9768@kalmia.hozed.org> References: <20050312011527.GC9768@kalmia.hozed.org> Message-ID: <1110906984.28053.19.camel@duffman> On Fri, 2005-03-11 at 19:15 -0600, Troy Benjegerdes wrote: > I have in my office a shiny new kernel.org 2.6.11 64 bit kernel running > on my Mac G5, with the drivers/infiniband modules loaded. > > What do I need to do to verify this all works? Do you have the IB card plugged into an IB switch? Is that switch running an SM? Do you have another machine connected to your G5? You can see if the card is initializing on your machine by running ibstatus. Or check out the /sys/class/infiniband/ directory manually. Check the FAQ. > Also, I'd really like to make debian packages of the userland utilities > and libraries, and get a debian/ subdirectory into the subversion > release, so the packages can be rebuilt easily. > > Where should I start on this? Write the .deb, send it as a file or patch to the list. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From krause at cup.hp.com Tue Mar 15 09:51:07 2005 From: krause at cup.hp.com (Michael Krause) Date: Tue, 15 Mar 2005 09:51:07 -0800 Subject: [openib-general] Getting rid of pinned memory requirement In-Reply-To: <8508251A6FC08A489844A94261D3693A03900B@fiona.siliquent.com > References: <8508251A6FC08A489844A94261D3693A03900B@fiona.siliquent.com> Message-ID: <6.2.0.14.2.20050315093749.02a06518@esmail.cup.hp.com> At 05:35 PM 3/14/2005, Caitlin Bestler wrote: > > > > -----Original Message----- > > From: Troy Benjegerdes [mailto:hozer at hozed.org] > > Sent: Monday, March 14, 2005 5:06 PM > > To: Caitlin Bestler > > Cc: openib-general at openib.org > > Subject: Re: [openib-general] Getting rid of pinned memory requirement > > > > > > > > The key is that the entire operation either has to be fast > > > enough so that no connection or application session layer > > > time-outs occur, or an end-to-end agreement to suspend the > > > connetion is a requirement. The first option seems more > > > plausible to me, the second essentially > > > reuqires extending the CM protocol. That's a tall order even for > > > InfiniBand, and it's even worse for iWARP where the CM > > > functionality typically ends when the connection is established. > > > > I'll buy the good network design argument. I and others designed InfiniBand RNR (Receiver not ready) operations to allow one to adjust V-to-P mappings (not change the address that was advertised) in order to allow an OS to safely play some games with memory and not drop a connection. The time values associated with RNR allow a solution to tolerate up to infinite amount of time to perform such operations but the envisioned goal was to do this on the order of a handful or milliseconds in the worse case. For iWARP, there was no support for defining RNR functionality as indeed many people claimed one could just drop in-bound segments and allow the retransmission protocol to deal with the delay (even if this has performance implications due to back-off algorithms though some claim SACK would minimize this to a large extent). Again, the idea was to minimize the worse case to milliseconds of down time. BTW, all of this assumed that the OS would not perform these types of changes that often so the long-term impact on an application would be minimum. > > > > I suppose if the kernel wants to revoke a card's pinned > > memory, we should be able to guarantee that it gets new > > pinned memory within a bounded time. What sort of timing do > > we need? Milliseconds? > > Microseconds? > > > > In the case of iWarp, isn't this just TCP underneath? If so, > > can't we just drop any packets in the pipe on the floor and > > let them get retransmitted? (I suppose the same argument goes > > for infiniband.. > > what sort of a time window do we have for retransmission?) > > > > What are the limits on end-to-end flow control in IB and iWarp? > > > > >From the RDMA Provider's perspective, the short answer is "quick enough > so that I don't have to do anything heroic to keep the connection alive." It should not require anything heroic. What is does require is a local method to suspend the local QP(s) so that it cannot place or read memory in the effected area. That can take some time depending upon the implementation. There is then the time to over write the mappings which again depending upon the implementation and the number of mappings could be milliseconds in length. >With TCP you also have to add "and healthy". If you've ever had a long >download that got effectively stalled by a burst of noise and you just hit >the 'reload' button on your browser then you know what I'm talking about. > >But in transport neutral terms I would think that one RTT is definitely >safe -- that much data could have >been dropped by one switch failure or one nasty spike in inbound noise. > > > > > > > Yes, there are limits on how much memory you can mlock, or even > > > allocate. Applications are required to reqister memory precisely > > > because the required guarantess are not there by default. > > Eliminating > > > those guarantees *is* effectively rewriting every RDMA application > > > without even letting them know. > > > > Some of this argument is a policy issue, which I would argue > > shouldn't be hard-coded in the code or in the network hardware. > > > > At least in my view, the guarantees are only there to make > > applications go fast. We are getting low latency and high > > performance with infiniband by making memory registration go > > really really slow. If, to make big HPC simulation > > applications work, we wind up doing memcpy() to put the data > > into a registered buffer because we can't register half of > > physical memory, the application isn't going very fast. > > > >What you are looking for is a distinction between registering >memory to *enable* the RNIC to optimize local access and >registering memory to enable its being advertised to the >remote end. > >Early implementations of RDMA, both IB and iWARP, have not >distinquished between the two. But theoretically *applications* >do not need memory regions that are not enabled for remote >access to be pinned. That is an RNIC requirement that could >evolve. But applications themselves *do* need remotely >accessible memory regions, portions of which they intend >to advertise with RKeys, to be truly available (i.e., pinned). > >You are also making a policy assumption that an application >that actually needs half of physical memory should be using >paged memory. Memory is cheap, and if performance is critical >why should this memory be swapped out to disk? > >Is the limitation on not being able to register half of >physical memory based upon some assumption that swapping >is a requirement? Or is it a limitation in the memory region >size? If it's the latter, you need to get the OS to support >larger page sizes. For some OS, you can pin very large areas. I've seen 15/16 of memory being able to be pinned with no adverse impacts on the applications. For these OS, kernel memory is effectively pinned memory. As such, depending upon the mix of services being provided, the system may operate quite nicely with such large amounts of memory being pinned. As more services are "ported" to operate over RDMA technologies, memory management isn't necessarily any harder; it just becomes something people have to think more about. Today's VM designs have allowed people to get sloppy as they assume that swapping will occur and since many platforms are not that loaded, they don't see any real adverse impacts. User-space RDMA applications requires people to think once again about memory management and that swapping isn't a get-out-of-jail card. One needs to develop resource management tools to determine who obtains specified amounts of resources and their priorities. For the most part, this is somewhat a re-invention of some thinking that went into the micro-kernel work in past years. These problems are not intractable; they are only constrained by the legacy inertia inherent in all technologies today. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Mar 15 11:12:24 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Mar 2005 14:12:24 -0500 Subject: [openib-general] Re: [PATCH] [MAD] API changes and updates to support RMPP In-Reply-To: <20050314162258.1bedff07.mshefty@ichips.intel.com> References: <20050314162258.1bedff07.mshefty@ichips.intel.com> Message-ID: <1110913944.4662.666.camel@localhost.localdomain> On Mon, 2005-03-14 at 19:22, Sean Hefty wrote: > This patch updates the MAD API to help provide support for the RMPP > implementation and clients. Notable changes: Wouldn't this change also impact ib_user_mad.h and user_mad.c ? -- Hal From roland at topspin.com Tue Mar 15 11:27:11 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 15 Mar 2005 11:27:11 -0800 Subject: [openib-general] uverbs security In-Reply-To: <20050315062352.GA19233@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 15 Mar 2005 08:23:52 +0200") References: <20050315062352.GA19233@mellanox.co.il> Message-ID: <52acp4yak0.fsf@topspin.com> Michael> Hi, Roland! Looking at uverbs kernel module, I notice Michael> that in some instances it passes some parameters from Michael> userspace directly to ib core, without verifying their Michael> sanity. Michael> One example of this is qp attributes in create and modify Michael> qp. Michael> For example, modify qp and alloc qp will simply copy the Michael> attributes. This might create issues since the core may Michael> assume it works against a trusted kernel client, so it Michael> may get confused if passed illegal parameter values. Michael> For example, qp type could be IB_QPT_SMI or Michael> IB_QPT_GSI. Will this create a problem? Hard for me to Michael> tell ... This particular example is OK, because mthca_provider.c has: case IB_QPT_SMI: case IB_QPT_GSI: { /* Don't allow userspace to create special QPs */ if (pd->uobject) return ERR_PTR(-EINVAL); but I agree it might be better to check this in the uverbs module. Michael> I think the best approach is to validate *all* user-given Michael> parameters before passing them on to core. What do you Michael> think? Yes, we should do as much validation as possible, although I'm not very worried about bad values that have no effect on anyone other than the userspace process itself. - R. From mshefty at ichips.intel.com Tue Mar 15 11:27:32 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 15 Mar 2005 11:27:32 -0800 Subject: [openib-general] Re: [PATCH] [MAD] API changes and updates to support RMPP In-Reply-To: <1110913944.4662.666.camel@localhost.localdomain> References: <20050314162258.1bedff07.mshefty@ichips.intel.com> <1110913944.4662.666.camel@localhost.localdomain> Message-ID: <42373724.9070507@ichips.intel.com> Hal Rosenstock wrote: > On Mon, 2005-03-14 at 19:22, Sean Hefty wrote: > >>This patch updates the MAD API to help provide support for the RMPP >>implementation and clients. Notable changes: > > > Wouldn't this change also impact ib_user_mad.h and user_mad.c ? I don't think that they effect those files directly. BUT, I didn't test against these two files, and I don't even think that I included them in my compile, which is an obvious oversight. It might be possible to remove the internal MR in user_mad.c, but that could still come in a separate patch. Something needs to be done to support RMPP in usermode, but I haven't thought that far ahead yet. - Sean From roland at topspin.com Tue Mar 15 12:27:58 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 15 Mar 2005 12:27:58 -0800 Subject: [openib-general] [PATCH] InfiniBand: remove unsafe use of in_atomic() Message-ID: <52zmx4wt69.fsf@topspin.com> Using in_atomic() to decide between GFP_KERNEL and GFP_ATOMIC is not safe (it doesn't work if CONFIG_PREEMPT=n). Change to just always allocating with GFP_ATOMIC, since we don't know if we can sleep or not. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/core/mad.c 2005-03-15 12:23:32.640868259 -0800 +++ linux-export/drivers/infiniband/core/mad.c 2005-03-15 12:26:56.311553460 -0800 @@ -646,7 +646,7 @@ struct ib_smp *smp, struct ib_send_wr *send_wr) { - int ret, alloc_flags, solicited; + int ret, solicited; unsigned long flags; struct ib_mad_local_private *local; struct ib_mad_private *mad_priv; @@ -666,11 +666,7 @@ if (!ret || !device->process_mad) goto out; - if (in_atomic() || irqs_disabled()) - alloc_flags = GFP_ATOMIC; - else - alloc_flags = GFP_KERNEL; - local = kmalloc(sizeof *local, alloc_flags); + local = kmalloc(sizeof *local, GFP_ATOMIC); if (!local) { ret = -ENOMEM; printk(KERN_ERR PFX "No memory for ib_mad_local_private\n"); @@ -678,7 +674,7 @@ } local->mad_priv = NULL; local->recv_mad_agent = NULL; - mad_priv = kmem_cache_alloc(ib_mad_cache, alloc_flags); + mad_priv = kmem_cache_alloc(ib_mad_cache, GFP_ATOMIC); if (!mad_priv) { ret = -ENOMEM; printk(KERN_ERR PFX "No memory for local response MAD\n"); @@ -860,9 +856,7 @@ } /* Allocate MAD send WR tracking structure */ - mad_send_wr = kmalloc(sizeof *mad_send_wr, - (in_atomic() || irqs_disabled()) ? - GFP_ATOMIC : GFP_KERNEL); + mad_send_wr = kmalloc(sizeof *mad_send_wr, GFP_ATOMIC); if (!mad_send_wr) { printk(KERN_ERR PFX "No memory for " "ib_mad_send_wr_private\n"); From Nitin.Hande at Sun.COM Tue Mar 15 13:15:57 2005 From: Nitin.Hande at Sun.COM (Nitin Hande) Date: Tue, 15 Mar 2005 13:15:57 -0800 Subject: [Fwd: Re: [openib-general] Solaris IPoIB MTU with OpenSM] In-Reply-To: <1109969601.4648.32.camel@erez-s.us.voltaire.com> References: <1109969601.4648.32.camel@erez-s.us.voltaire.com> Message-ID: <1110921356.7768.447.camel@sr1-umpk-01> Hal, On Fri, 2005-03-04 at 12:53, Hal Rosenstock wrote: > Hi again Nitin, > > Finally got a chance to work on this. I have a workaround for you for > now. Real patch later... Let me know if this does the trick for you. It > did for me. > > -- Hal > > Index: osm_sa_mcmember_record.c > =================================================================== > --- osm_sa_mcmember_record.c (revision 1953) > +++ osm_sa_mcmember_record.c (working copy) > @@ -1522,9 +1522,11 @@ > if ((IB_MCR_COMPMASK_PROXY & comp_mask) && > (p_rcvd_rec->proxy_join != p_mgrp->mcmember_rec.proxy_join)) goto Exit; > > +#if 0 > /* if defined MUST match exactly !*/ > if ((IB_MCR_COMPMASK_MTU_SEL & comp_mask) && > ((p_rcvd_rec->mtu >> 6) != (p_mgrp->mcmember_rec.mtu >> 6))) goto Exit; > +#endif > > if ((IB_MCR_COMPMASK_MTU & comp_mask) && > ((p_rcvd_rec->mtu & 0x3F) != (p_mgrp->mcmember_rec.mtu & 0x3F))) goto Exit; This is cool, I have got Solaris IPoIB happily working with the OpenSM now. It plumbs, pings and snoops on 0xffff pkey. Here is some output: [root at dongon ~]# cat /etc/path_to_inst | grep ibd "/pci at 8,600000/pci at 1/pci15b3,5a44 at 0/ibport at 1,ffff,ipib" 0 "ibd" "/pci at 8,600000/pci at 1/pci15b3,5a44 at 0/ibport at 2,ffff,ipib" 1 "ibd" [root at dongon ~]# ifconfig ibd0 ibd0: flags=1000843 mtu 2044 index 3 inet 192.168.100.111 netmask ffffff00 broadcast 192.168.100.255 ipib 0:0:0:16:fe:80:0:0:0:0:0:0:0:2:c9:1:9:76:51:d1 [root at dongon ~]# ping 192.168.100.112 192.168.100.112 is alive [root at dongon ~]# snoop -d ibd1 192.168.100.112 -> * ARP C Who is 192.168.100.111, 192.168.100.111 ? 192.168.100.111 -> 192.168.100.112 ARP R 192.168.100.111, 192.168.100.111 is 0:0:0:16:fe:80:0:0:0:0:0:0:0:2:c9:1:9:76:51:d1 192.168.100.111 -> 192.168.100.112 ICMP Echo request (ID: 641 Sequence number: 0) 192.168.100.112 -> 192.168.100.111 ICMP Echo reply (ID: 641 Sequence number: 0) This is fantastic. Thanks Hal !.. BTW, I have not tested it with multiple GetTable reponse - RMPP packet. On other hand, on my linux node, if I try to use 8001 partition and configure IB interface with IP addr (same time while ib0 is using 0xffff pkey), I get the following error, you may want to investigate that.... [root at flopteron2 ~]# echo 0x8001 > /sys/class/net/ib0/create_child [root at flopteron2 ~]# ifconfig ib0.8001 10.10.1.1 [root at flopteron2ib0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 ~]# ib0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 [root at flopteron2 ~]# ib0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 b0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 b0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 b0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 b0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 b0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 b0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 b0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 b0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 b0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, status -22 Thanks Nitin > > > -----Forwarded Message----- > > From: Hal Rosenstock > To: Nitin Hande > Cc: openib , Tom Duffy > Subject: Re: [openib-general] Solaris IPoIB MTU with OpenSM > Date: 24 Feb 2005 08:42:23 -0500 > > Hi Nitin, > > On Wed, 2005-02-23 at 17:19, Nitin Hande wrote: > > Hal, > > > > [comments below] > > On Wed, 2005-02-23 at 02:19, Hal Rosenstock wrote: > > > On Tue, 2005-02-22 at 22:56, Nitin Hande wrote: > > > > So I tried the latest patches and preliminarily things seem to be > > > > working fine. > > > > > > Yipee. > > [snip..] > > > > > > > > > > > So after this test above, I try to run snoop on the solaris interface > > > > and get the following error message from the layer below IPoIB: > > > > > > > > Feb 22 19:50:25 dongon.SFBay.Sun.COM ibd: [ID 517869 kern.info] NOTICE: > > > > ibd0: HCA GUID 0002c901097651d0 port 1 PKEY ffff Could not get list of > > > > IBA multicast groups > > > > > > > > My preliminary assumption is that OpenSm is not returning the list of > > > > multicast groups that the ibd interface has joined. I will look at the > > > > MAD's tomorrow and try to ascertain that. > > > > > > How does S10 request this ? Remember that if it is a GetTable and > > > doesn't fit in a single MAD, it will be broken now. If that is the case, > > > we will live with this until we have real RMPP. > > Below is an an example of a single GetTable request and response between > > Solaris and OpenSM. OpenSM is not reporting the MCgroups in case of a > > single request/response. I have also provided a MAD output between > > Solaris IPoIB driver and IBSRM single GetTable request response below > > this example. > > > > Here is the MAD trace between solaris and OpenSM: > > Outgoing MAD: > > BaseVersion: 0x1 > > MgmtClass: 0x3 - SubnAdm > > ClassVersion: 0x2 > > R_Method: 0x12 - SubnAdmGetTable() > > Status: 0x0 - NO_ERROR > > ClassSpecific: 0x0 > > TransactionID: 0x97651d1000000ec > > AttributeID: 0x38 - SA_MCMEMBERRECORD_ATTRID > > > > 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef > > 0: 01 03 02 12 00 00 00 00 09 76 51 d1 00 00 00 ec .........vQ..... > > 10: 00 38 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 .8.............. > > 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 30: 00 00 00 00 00 00 80 b4 00 00 00 00 00 00 00 00 ................ > > 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 50: 00 00 00 00 00 00 00 00 00 00 0b 1b 00 00 84 00 ................ > > 60: ff ff 00 00 00 00 00 00 20 00 00 00 00 00 00 00 ........ ....... > > 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > Incoming MAD: > > BaseVersion: 0x1 > > MgmtClass: 0x3 - SubnAdm > > ClassVersion: 0x2 > > R_Method: 0x92 - > > Status: 0x0 - NO_ERROR > > ClassSpecific: 0x0 > > TransactionID: 0x97651d1000000ec > > AttributeID: 0x38 - SA_MCMEMBERRECORD_ATTRID > > > > 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef > > 0: 01 03 02 92 00 00 00 00 09 76 51 d1 00 00 00 ec .........vQ..... > > 10: 00 38 00 00 ff ff ff ff 01 01 77 00 00 00 00 01 .8........w..... > > 20: 00 00 00 14 00 00 00 00 00 00 00 00 00 07 00 00 ................ > > 30: 00 00 00 00 00 00 80 b4 00 00 00 00 00 00 00 00 ................ > > 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > It is likely failing the component checking in > osm_sa_mcmember_record.c::__osm_sa_mcm_by_comp_mask_cb due to an endian > issue. Either you can debug this code or I will early next week. > > The component mask in the request is 0x80b4 so the only components > checked are QKey (0xb1b), MTU (exactly 2048 (4)), PKey (0xffff), and > scope (2). > > If I don't hear anything by next week, I will work on this then. > > Thanks. > > -- Hal > > > Here is the transaction between IBSRM and Solaris IPoIB driver. > > > > Outgoing MAD: > > BaseVersion: 0x1 > > MgmtClass: 0x3 - SubnAdm > > ClassVersion: 0x2 > > R_Method: 0x12 - SubnAdmGetTable() > > Status: 0x0 - NO_ERROR > > ClassSpecific: 0x0 > > TransactionID: 0x8fecc610000009a > > AttributeID: 0x38 - SA_MCMEMBERRECORD_ATTRID > > > > 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef > > 0: 01 03 02 12 00 00 00 00 08 fe cc 61 00 00 00 9a ...........a.... > > 10: 00 38 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 .8.............. > > 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 30: 00 00 00 00 00 00 80 b4 00 00 00 00 00 00 00 00 ................ > > 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 50: 00 00 00 00 00 00 00 00 81 23 45 68 00 00 84 00 .........#Eh.... > > 60: 80 01 00 00 00 00 00 00 20 00 00 00 00 00 00 00 ........ ....... > > 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > Incoming MAD: > > BaseVersion: 0x1 > > MgmtClass: 0x3 - SubnAdm > > ClassVersion: 0x2 > > R_Method: 0x92 - > > Status: 0x0 - NO_ERROR > > ClassSpecific: 0x0 > > TransactionID: 0x8fecc610000009a > > AttributeID: 0x38 - SA_MCMEMBERRECORD_ATTRID > > > > 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef > > 0: 01 03 02 92 00 00 00 00 08 fe cc 61 00 00 00 9a ...........a.... > > 10: 00 38 00 00 00 00 00 00 01 01 73 00 00 00 00 01 .8........s..... > > 20: 00 00 01 40 00 00 00 00 00 00 00 00 00 07 00 00 ... at ............ > > 30: 00 00 00 00 00 00 00 00 ff 12 40 1b 80 01 00 00 .......... at ..... > > 40: 00 00 00 00 00 00 00 09 00 00 00 00 00 00 00 00 ................ > > 50: 00 00 00 00 00 00 00 00 81 23 45 68 c0 04 84 00 .........#Eh.... > > 60: 80 01 83 8d 00 00 00 00 20 00 00 00 00 00 00 00 ........ ....... > > 70: ff 12 40 1b 80 01 00 00 00 00 00 00 00 00 00 01 .. at ............. > > 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 90: 81 23 45 68 c0 03 84 00 80 01 83 8d 00 00 00 00 .#Eh............ > > a0: 20 00 00 00 00 00 00 00 ff 12 40 1b 80 01 00 00 ......... at ..... > > b0: 00 00 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 ................ > > c0: 00 00 00 00 00 00 00 00 81 23 45 68 c0 00 84 00 .........#Eh.... > > d0: 80 01 83 8d 00 00 00 00 20 00 00 00 00 00 00 00 ........ ....... > > e0: ff 12 60 1b 80 01 00 00 00 00 00 01 ff 76 5b 01 ..`..........v[. > > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > > > Thanks > > Nitin > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From roland at topspin.com Tue Mar 15 13:23:51 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 15 Mar 2005 13:23:51 -0800 Subject: [Fwd: Re: [openib-general] Solaris IPoIB MTU with OpenSM] In-Reply-To: <1110921356.7768.447.camel@sr1-umpk-01> (Nitin Hande's message of "Tue, 15 Mar 2005 13:15:57 -0800") References: <1109969601.4648.32.camel@erez-s.us.voltaire.com> <1110921356.7768.447.camel@sr1-umpk-01> Message-ID: <52vf7swql4.fsf@topspin.com> Nitin> On other hand, on my linux node, if I try to use 8001 Nitin> partition and configure IB interface with IP addr (same Nitin> time while ib0 is using 0xffff pkey), I get the following Nitin> error, you may want to investigate that.... I think this is probably an OpenSM issue (does OpenSM support multiple partitions?). On my fabric, running Topspin's embedded SM on a switch, I can do: # modprobe ib_ipoib # echo 0x8001 > /sys/class/net/ib0/create_child # ifconfig ib0.8001 up on both systems. On system #1 I have: # ifconfig ib0.8001 ib0.8001 Link encap:UNSPEC HWaddr 00-13-04-06-FE-80-00-00-00-00-00-00-00-00-00-00 inet6 addr: fe80::202:c901:7fc:c711/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:4 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:300 (300.0 b) and on system #2 I'm able to do: # ping6 -I ib0.8001 fe80::202:c901:7fc:c711 PING fe80::202:c901:7fc:c711(fe80::202:c901:7fc:c711) from fe80::202:c901:78c:e461 ib0.8001: 56 data bytes 64 bytes from fe80::202:c901:7fc:c711: icmp_seq=1 ttl=64 time=4.56 ms 64 bytes from fe80::202:c901:7fc:c711: icmp_seq=2 ttl=64 time=0.077 ms 64 bytes from fe80::202:c901:7fc:c711: icmp_seq=3 ttl=64 time=0.065 ms - R. From roland at topspin.com Tue Mar 15 14:06:29 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 15 Mar 2005 14:06:29 -0800 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs Message-ID: <52k6o8wom2.fsf@topspin.com> I just spent a little time creating a new "ibv" module for NetPIPE that runs on top of the userspace verbs I've been developing on the roland-uverbs branch. This is pretty much a straight port of the current Mellanox VAPI "ib" module, with the main changes coming from the fact that OpenIB doesn't support the non-standard "unsignaled receive" extension, and the fact that a completion event thread is no longer created automatically. I found several bugs in the verbs support while making this work, but it seems quite stable now, although I haven't tried all option combinations. I also have not had a chance to compare Mellanox VAPI and OpenIB verbs performance on identical hardware -- it would be very useful to see this comparison on a variety of systems. The new ibv module is contained in the patch included below. Thanks, Roland --- NetPIPE_3.6.2.orig/makefile 2004-06-09 12:46:35.000000000 -0700 +++ NetPIPE_3.6.2/makefile 2005-03-15 13:58:08.000000000 -0800 @@ -229,6 +229,10 @@ -DINFINIBAND -DTCP -I $(VAPI_INC) -L $(VAPI_LIB) \ -lmpga -lvapi -lpthread +ibv: $(SRC)/ibv.c $(SRC)/netpipe.c $(SRC)/netpipe.h + $(CC) $(CFLAGS) $(SRC)/ibv.c $(SRC)/netpipe.c -o NPibv \ + -DOPENIB -DTCP -libverbs + atoll: $(SRC)/atoll.c $(SRC)/netpipe.c $(SRC)/netpipe.h $(CC) $(CFLAGS) -DATOLL $(SRC)/netpipe.c \ $(SRC)/atoll.c -o NPatoll \ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ NetPIPE_3.6.2/src/ibv.c 2005-03-15 13:30:03.000000000 -0800 @@ -0,0 +1,1072 @@ +/*****************************************************************************/ +/* "NetPIPE" -- Network Protocol Independent Performance Evaluator. */ +/* Copyright 1997, 1998 Iowa State University Research Foundation, Inc. */ +/* */ +/* This program is free software; you can redistribute it and/or modify */ +/* it under the terms of the GNU General Public License as published by */ +/* the Free Software Foundation. You should have received a copy of the */ +/* GNU General Public License along with this program; if not, write to the */ +/* Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ +/* */ +/* ibv.c ---- Infiniband module for OpenIB verbs */ +/*****************************************************************************/ + +#define USE_VOLATILE_RPTR /* needed for polling on last byte of recv buffer */ +#include "netpipe.h" +#include +#include +#include + +/* Debugging output macro */ + +FILE* logfile; + +#if 0 +#define LOGPRINTF(_format, _aa...) fprintf(logfile, "%s: " _format, __func__ , ##_aa); fflush(logfile) +#else +#define LOGPRINTF(_format, _aa...) +#endif + +/* Header files needed for Infiniband */ + +#include + +/* Global vars */ + +static struct ibv_device *hca; +static struct ibv_context *ctx; +static struct ibv_port_attr hca_port; +static int port_num; +static uint16_t lid; +static uint16_t d_lid; +static struct ibv_pd *pd_hndl; +static int num_cqe; +static int act_num_cqe; +static struct ibv_cq *s_cq_hndl; +static struct ibv_cq *r_cq_hndl; +static struct ibv_mr *s_mr_hndl; +static struct ibv_mr *r_mr_hndl; +static struct ibv_qp_init_attr qp_init_attr; +static struct ibv_qp *qp_hndl; +static uint32_t d_qp_num; +static struct ibv_qp_attr qp_attr; +static struct ibv_wc wc; +static int max_wq=50000; +static void* remote_address; +static uint32_t remote_key; +static volatile int receive_complete; +static pthread_t thread; + +/* Function definitions */ + +void Init(ArgStruct *p, int* pargc, char*** pargv) +{ + /* Set defaults + */ + p->prot.ib_mtu = IBV_MTU_1024; /* 1024 Byte MTU */ + p->prot.commtype = NP_COMM_RDMAWRITE; /* Use RDMA write communications */ + p->prot.comptype = NP_COMP_LOCALPOLL; /* Use local polling for completion */ + p->tr = 0; /* I am not the transmitter */ + p->rcv = 1; /* I am the receiver */ +} + +void Setup(ArgStruct *p) +{ + + int one = 1; + int sockfd; + struct sockaddr_in *lsin1, *lsin2; /* ptr to sockaddr_in in ArgStruct */ + char *host; + struct hostent *addr; + struct protoent *proto; + int send_size, recv_size, sizeofint = sizeof(int); + struct sigaction sigact1; + char logfilename[80]; + + /* Sanity check */ + if( p->prot.commtype == NP_COMM_RDMAWRITE && + p->prot.comptype != NP_COMP_LOCALPOLL ) { + fprintf(stderr, "Error, RDMA Write may only be used with local polling.\n"); + fprintf(stderr, "Try using RDMA Write With Immediate Data with vapi polling\n"); + fprintf(stderr, "or event completion\n"); + exit(-1); + } + + if( p->prot.commtype != NP_COMM_RDMAWRITE && + p->prot.comptype == NP_COMP_LOCALPOLL ) { + fprintf(stderr, "Error, local polling may only be used with RDMA Write.\n"); + fprintf(stderr, "Try using vapi polling or event completion\n"); + exit(-1); + } + + /* Open log file */ + sprintf(logfilename, ".iblog%d", 1 - p->tr); + logfile = fopen(logfilename, "w"); + + host = p->host; /* copy ptr to hostname */ + + lsin1 = &(p->prot.sin1); + lsin2 = &(p->prot.sin2); + + bzero((char *) lsin1, sizeof(*lsin1)); + bzero((char *) lsin2, sizeof(*lsin2)); + + if ( (sockfd = socket(AF_INET, SOCK_STREAM, 0)) < 0){ + printf("NetPIPE: can't open stream socket! errno=%d\n", errno); + exit(-4); + } + + if(!(proto = getprotobyname("tcp"))){ + printf("NetPIPE: protocol 'tcp' unknown!\n"); + exit(555); + } + + if (p->tr){ /* if client i.e., Sender */ + + + if (atoi(host) > 0) { /* Numerical IP address */ + lsin1->sin_family = AF_INET; + lsin1->sin_addr.s_addr = inet_addr(host); + + } else { + + if ((addr = gethostbyname(host)) == NULL){ + printf("NetPIPE: invalid hostname '%s'\n", host); + exit(-5); + } + + lsin1->sin_family = addr->h_addrtype; + bcopy(addr->h_addr, (char*) &(lsin1->sin_addr.s_addr), addr->h_length); + } + + lsin1->sin_port = htons(p->port); + + } else { /* we are the receiver (server) */ + + bzero((char *) lsin1, sizeof(*lsin1)); + lsin1->sin_family = AF_INET; + lsin1->sin_addr.s_addr = htonl(INADDR_ANY); + lsin1->sin_port = htons(p->port); + + if (bind(sockfd, (struct sockaddr *) lsin1, sizeof(*lsin1)) < 0){ + printf("NetPIPE: server: bind on local address failed! errno=%d", errno); + exit(-6); + } + + } + + if(p->tr) + p->commfd = sockfd; + else + p->servicefd = sockfd; + + + + /* Establish tcp connections */ + + establish(p); + + /* Initialize Mellanox Infiniband */ + + if(initIB(p) == -1) { + CleanUp(p); + exit(-1); + } +} + +void event_handler(struct ibv_cq *cq); + +void *EventThread(void *unused) +{ + struct ibv_cq *cq; + void *data; + + while (1) { + if (ibv_get_cq_event(ctx, 0, &cq, &data)) { + fprintf(stderr, "Failed to get CQ event\n"); + return NULL; + } + event_handler(cq); + } +} + +int initIB(ArgStruct *p) +{ + struct dlist *dev_list; + int ret; + + dev_list = ibv_get_devices(); + dlist_start(dev_list); + hca = dlist_next(dev_list); + if (!hca) { + fprintf(stderr, "Couldn't find any InfiniBand devices\n"); + return -1; + } else { + LOGPRINTF("Found Infiniband HCA %s\n", ibv_get_device_name(hca)); + } + + ctx = ibv_open_device(hca); + if (!ctx) { + fprintf(stderr, "Couldn't create InfiniBand context\n"); + return -1; + } else { + LOGPRINTF("Found Infiniband HCA %s\n", ibv_get_device_name(hca)); + } + + /* Get HCA properties */ + + port_num=1; + ret = ibv_query_port(ctx, port_num, &hca_port); + if(ret) { + fprintf(stderr, "Error querying Infiniband HCA\n"); + return -1; + } else { + LOGPRINTF("Queried Infiniband HCA\n"); + } + lid = hca_port.lid; + LOGPRINTF(" lid = %d\n", lid); + + + /* Allocate Protection Domain */ + + pd_hndl = ibv_alloc_pd(ctx); + if(!pd_hndl) { + fprintf(stderr, "Error allocating PD\n"); + return -1; + } else { + LOGPRINTF("Allocated Protection Domain\n"); + } + + + /* Create send completion queue */ + + num_cqe = 30000; /* Requested number of completion q elements */ + s_cq_hndl = ibv_create_cq(ctx, num_cqe, NULL); + if(!s_cq_hndl) { + fprintf(stderr, "Error creating send CQ\n"); + return -1; + } else { + act_num_cqe = s_cq_hndl->cqe; + LOGPRINTF("Created Send Completion Queue with %d elements\n", act_num_cqe); + } + + + /* Create recv completion queue */ + + num_cqe = 20000; /* Requested number of completion q elements */ + r_cq_hndl = ibv_create_cq(ctx, num_cqe, NULL); + if(!r_cq_hndl) { + fprintf(stderr, "Error creating send CQ\n"); + return -1; + } else { + act_num_cqe = r_cq_hndl->cqe; + LOGPRINTF("Created Recv Completion Queue with %d elements\n", act_num_cqe); + } + + + /* Placeholder for MR */ + + + /* Create Queue Pair */ + + qp_init_attr.cap.max_recv_wr = max_wq; /* Max outstanding WR on RQ */ + qp_init_attr.cap.max_send_wr = max_wq; /* Max outstanding WR on SQ */ + qp_init_attr.cap.max_recv_sge = 1; /* Max scatter/gather entries on RQ */ + qp_init_attr.cap.max_send_sge = 1; /* Max scatter/gather entries on SQ */ + qp_init_attr.recv_cq = r_cq_hndl; /* CQ handle for RQ */ + qp_init_attr.send_cq = s_cq_hndl; /* CQ handle for SQ */ + qp_init_attr.sq_sig_all = 0; /* Signalling type */ + qp_init_attr.qp_type = IBV_QPT_RC; /* Transmission type */ + + qp_hndl = ibv_create_qp(pd_hndl, &qp_init_attr); + if(!qp_hndl) { + fprintf(stderr, "Error creating Queue Pair\n"); + return -1; + } else { + LOGPRINTF("Created Queue Pair\n"); + } + + + /* Exchange lid and qp_num with other node */ + + if( write(p->commfd, &lid, sizeof(lid) ) != sizeof(lid) ) { + fprintf(stderr, "Failed to send lid over socket\n"); + return -1; + } + if( write(p->commfd, &qp_hndl->qp_num, sizeof(qp_hndl->qp_num) ) != sizeof(qp_hndl->qp_num) ) { + fprintf(stderr, "Failed to send qpnum over socket\n"); + return -1; + } + if( read(p->commfd, &d_lid, sizeof(d_lid) ) != sizeof(d_lid) ) { + fprintf(stderr, "Failed to read lid from socket\n"); + return -1; + } + if( read(p->commfd, &d_qp_num, sizeof(d_qp_num) ) != sizeof(d_qp_num) ) { + fprintf(stderr, "Failed to read qpnum from socket\n"); + return -1; + } + + LOGPRINTF("Local: lid=%d qp_num=%d Remote: lid=%d qp_num=%d\n", + lid, qp_hndl->qp_num, d_lid, d_qp_num); + + + /* Bring up Queue Pair */ + + /******* INIT state ******/ + + qp_attr.qp_state = IBV_QPS_INIT; + qp_attr.pkey_index = 0; + qp_attr.port_num = port_num; + qp_attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ; + + ret = ibv_modify_qp(qp_hndl, &qp_attr, + IBV_QP_STATE | + IBV_QP_PKEY_INDEX | + IBV_QP_PORT | + IBV_QP_ACCESS_FLAGS); + if(ret) { + fprintf(stderr, "Error modifying QP to INIT\n"); + return -1; + } + + LOGPRINTF("Modified QP to INIT\n"); + + /******* RTR (Ready-To-Receive) state *******/ + + qp_attr.qp_state = IBV_QPS_RTR; + qp_attr.max_dest_rd_atomic = 1; + qp_attr.dest_qp_num = d_qp_num; + qp_attr.ah_attr.sl = 0; + qp_attr.ah_attr.is_global = 0; + qp_attr.ah_attr.dlid = d_lid; + qp_attr.ah_attr.static_rate = 0; + qp_attr.ah_attr.src_path_bits = 0; + qp_attr.ah_attr.port_num = port_num; + qp_attr.path_mtu = p->prot.ib_mtu; + qp_attr.rq_psn = 0; + qp_attr.pkey_index = 0; + qp_attr.min_rnr_timer = 5; + + ret = ibv_modify_qp(qp_hndl, &qp_attr, + IBV_QP_STATE | + IBV_QP_AV | + IBV_QP_PATH_MTU | + IBV_QP_DEST_QPN | + IBV_QP_RQ_PSN | + IBV_QP_MAX_DEST_RD_ATOMIC | + IBV_QP_MIN_RNR_TIMER); + + if(ret) { + fprintf(stderr, "Error modifying QP to RTR\n"); + return -1; + } + + LOGPRINTF("Modified QP to RTR\n"); + + /* Sync before going to RTS state */ + Sync(p); + + /******* RTS (Ready-to-Send) state *******/ + + qp_attr.qp_state = IBV_QPS_RTS; + qp_attr.sq_psn = 0; + qp_attr.timeout = 31; + qp_attr.retry_cnt = 1; + qp_attr.rnr_retry = 1; + qp_attr.max_rd_atomic = 1; + + ret = ibv_modify_qp(qp_hndl, &qp_attr, + IBV_QP_STATE | + IBV_QP_TIMEOUT | + IBV_QP_RETRY_CNT | + IBV_QP_RNR_RETRY | + IBV_QP_SQ_PSN | + IBV_QP_MAX_QP_RD_ATOMIC); + + if(ret) { + fprintf(stderr, "Error modifying QP to RTS\n"); + return -1; + } + + LOGPRINTF("Modified QP to RTS\n"); + + /* If using event completion, request the initial notification */ + if( p->prot.comptype == NP_COMP_EVENT ) { + if (pthread_create(&thread, NULL, EventThread, NULL)) { + fprintf(stderr, "Couldn't start event thread\n"); + return -1; + } + ibv_req_notify_cq(r_cq_hndl, 0); + } + + return 0; +} + +int finalizeIB(ArgStruct *p) +{ + int ret; + + LOGPRINTF("Finalizing IB stuff\n"); + + if(qp_hndl) { + LOGPRINTF("Destroying QP\n"); + ret = ibv_destroy_qp(qp_hndl); + if(ret) { + fprintf(stderr, "Error destroying Queue Pair\n"); + } + } + + if(r_cq_hndl) { + LOGPRINTF("Destroying Recv CQ\n"); + ret = ibv_destroy_cq(r_cq_hndl); + if(ret) { + fprintf(stderr, "Error destroying recv CQ\n"); + } + } + + if(s_cq_hndl) { + LOGPRINTF("Destroying Send CQ\n"); + ret = ibv_destroy_cq(s_cq_hndl); + if(ret) { + fprintf(stderr, "Error destroying send CQ\n"); + } + } + + /* Check memory registrations just in case user bailed out */ + if(s_mr_hndl) { + LOGPRINTF("Deregistering send buffer\n"); + ret = ibv_dereg_mr(s_mr_hndl); + if(ret) { + fprintf(stderr, "Error deregistering send mr\n"); + } + } + + if(r_mr_hndl) { + LOGPRINTF("Deregistering recv buffer\n"); + ret = ibv_dereg_mr(r_mr_hndl); + if(ret) { + fprintf(stderr, "Error deregistering recv mr\n"); + } + } + + if(pd_hndl) { + LOGPRINTF("Deallocating PD\n"); + ret = ibv_dealloc_pd(pd_hndl); + if(ret) { + fprintf(stderr, "Error deallocating PD\n"); + } + } + + /* Application code should not close HCA, just release handle */ + + if(ctx) { + LOGPRINTF("Releasing HCA\n"); + ret = ibv_close_device(ctx); + if(ret) { + fprintf(stderr, "Error releasing HCA\n"); + } + } + + return 0; +} + +void event_handler(struct ibv_cq *cq) +{ + int ret; + + while(1) { + + ret = ibv_poll_cq(cq, 1, &wc); + + if(ret == 0) { + LOGPRINTF("Empty completion queue, requesting next notification\n"); + ibv_req_notify_cq(r_cq_hndl, 0); + return; + } else if(ret < 0) { + fprintf(stderr, "Error in event_handler, polling cq\n"); + exit(-1); + } else if(wc.status != IBV_WC_SUCCESS) { + fprintf(stderr, "Error in event_handler, on returned work completion " + "status: %d\n", wc.status); + exit(-1); + } + + LOGPRINTF("Retrieved work completion\n"); + + /* For ping-pong mode at least, this check shouldn't be needed for + * normal operation, but it will help catch any bugs with multiple + * sends coming through when we're only expecting one. + */ + if(receive_complete == 1) { + + while(receive_complete != 0) sched_yield(); + + } + + receive_complete = 1; + + } + +} + +static int +readFully(int fd, void *obuf, int len) +{ + int bytesLeft = len; + char *buf = (char *) obuf; + int bytesRead = 0; + + while (bytesLeft > 0 && + (bytesRead = read(fd, (void *) buf, bytesLeft)) > 0) + { + bytesLeft -= bytesRead; + buf += bytesRead; + } + if (bytesRead <= 0) + return bytesRead; + return len; +} + +void Sync(ArgStruct *p) +{ + char s[] = "SyncMe"; + char response[7]; + + if (write(p->commfd, s, strlen(s)) < 0 || + readFully(p->commfd, response, strlen(s)) < 0) + { + perror("NetPIPE: error writing or reading synchronization string"); + exit(3); + } + if (strncmp(s, response, strlen(s))) + { + fprintf(stderr, "NetPIPE: Synchronization string incorrect!\n"); + exit(3); + } +} + +void PrepareToReceive(ArgStruct *p) +{ + int ret; /* Return code */ + struct ibv_recv_wr rr; /* Receive request */ + struct ibv_recv_wr *bad_wr; + struct ibv_sge sg_entry; /* Scatter/Gather list - holds buff addr */ + + /* We don't need to post a receive if doing RDMA write with local polling */ + + if( p->prot.commtype == NP_COMM_RDMAWRITE && + p->prot.comptype == NP_COMP_LOCALPOLL ) + return; + + rr.num_sge = 1; + rr.sg_list = &sg_entry; + rr.next = NULL; + + sg_entry.lkey = r_mr_hndl->lkey; + sg_entry.length = p->bufflen; + sg_entry.addr = (uintptr_t)p->r_ptr; + + ret = ibv_post_recv(qp_hndl, &rr, &bad_wr); + if(ret) { + fprintf(stderr, "Error posting recv request\n"); + CleanUp(p); + exit(-1); + } else { + LOGPRINTF("Posted recv request\n"); + } + + /* Set receive flag to zero and request event completion + * notification for this receive so the event handler will + * be triggered when the receive completes. + */ + if( p->prot.comptype == NP_COMP_EVENT ) { + receive_complete = 0; + } +} + +void SendData(ArgStruct *p) +{ + int ret; /* Return code */ + struct ibv_send_wr sr; /* Send request */ + struct ibv_send_wr *bad_wr; + struct ibv_sge sg_entry; /* Scatter/Gather list - holds buff addr */ + + /* Fill in send request struct */ + + if(p->prot.commtype == NP_COMM_SENDRECV) { + sr.opcode = IBV_WR_SEND; + LOGPRINTF("Doing regular send\n"); + } else if(p->prot.commtype == NP_COMM_SENDRECV_WITH_IMM) { + sr.opcode = IBV_WR_SEND_WITH_IMM; + LOGPRINTF("Doing regular send with imm\n"); + } else if(p->prot.commtype == NP_COMM_RDMAWRITE) { + sr.opcode = IBV_WR_RDMA_WRITE; + sr.wr.rdma.remote_addr = (uintptr_t)(remote_address + (p->s_ptr - p->s_buff)); + sr.wr.rdma.rkey = remote_key; + LOGPRINTF("Doing RDMA write (raddr=%p)\n", sr.wr.rdma.remote_addr); + } else if(p->prot.commtype == NP_COMM_RDMAWRITE_WITH_IMM) { + sr.opcode = IBV_WR_RDMA_WRITE_WITH_IMM; + sr.wr.rdma.remote_addr = (uintptr_t)(remote_address + (p->s_ptr - p->s_buff)); + sr.wr.rdma.rkey = remote_key; + LOGPRINTF("Doing RDMA write with imm (raddr=%p)\n", sr.wr.rdma.remote_addr); + } else { + fprintf(stderr, "Error, invalid communication type in SendData\n"); + exit(-1); + } + + sr.send_flags = 0; /* This needed due to a bug in Mellanox HW rel a-0 */ + + sr.num_sge = 1; + sr.sg_list = &sg_entry; + sr.next = NULL; + + sg_entry.lkey = s_mr_hndl->lkey; /* Local memory region key */ + sg_entry.length = p->bufflen; + sg_entry.addr = (uintptr_t)p->s_ptr; + + ret = ibv_post_send(qp_hndl, &sr, &bad_wr); + if(ret) { + fprintf(stderr, "Error posting send request\n"); + } else { + LOGPRINTF("Posted send request\n"); + } + +} + +void RecvData(ArgStruct *p) +{ + int ret; + + /* Busy wait for incoming data */ + + LOGPRINTF("Receiving at buffer address %p\n", p->r_ptr); + + /* + * Unsignaled receives are not supported, so we must always poll the + * CQ, except when using RDMA writes. + */ + if( p->prot.commtype == NP_COMM_RDMAWRITE ) { + + /* Poll for receive completion locally on the receive data */ + + LOGPRINTF("Waiting for last byte of data to arrive\n"); + + while(p->r_ptr[p->bufflen-1] != 'a' + (p->cache ? 1 - p->tr : 1) ) + { + /* BUSY WAIT -- this should be fine since we + * declared r_ptr with volatile qualifier */ + } + + /* Reset last byte */ + p->r_ptr[p->bufflen-1] = 'a' + (p->cache ? p->tr : 0); + + LOGPRINTF("Received all of data\n"); + + } else if( p->prot.comptype != NP_COMP_EVENT ) { + + /* Poll for receive completion using VAPI poll function */ + + LOGPRINTF("Polling completion queue for VAPI work completion\n"); + + ret = 0; + while(ret == 0) + ret = ibv_poll_cq(r_cq_hndl, 1, &wc); + + if(ret < 0) { + fprintf(stderr, "Error in RecvData, polling for completion\n"); + exit(-1); + } + + if(wc.status != IBV_WC_SUCCESS) { + fprintf(stderr, "Error in status of returned completion: %d\n", + wc.status); + exit(-1); + } + + LOGPRINTF("Retrieved successful completion\n"); + + } else if( p->prot.comptype == NP_COMP_EVENT ) { + + /* Instead of polling directly on data or VAPI completion queue, + * let the VAPI event completion handler set a flag when the receive + * completes, and poll on that instead. Could try using semaphore here + * as well to eliminate busy polling + */ + + LOGPRINTF("Polling receive flag\n"); + + while( receive_complete == 0 ) + { + /* BUSY WAIT */ + } + + /* If in prepost-burst mode, we won't be calling PrepareToReceive + * between ping-pongs, so we need to reset the receive_complete + * flag here. + */ + if( p->preburst ) receive_complete = 0; + + LOGPRINTF("Receive completed\n"); + } +} + +/* Reset is used after a trial to empty the work request queues so we + have enough room for the next trial to run */ +void Reset(ArgStruct *p) +{ + + int ret; /* Return code */ + struct ibv_send_wr sr; /* Send request */ + struct ibv_send_wr *bad_sr; + struct ibv_recv_wr rr; /* Recv request */ + struct ibv_recv_wr *bad_rr; + + /* If comptype is event, then we'll use event handler to detect receive, + * so initialize receive_complete flag + */ + if(p->prot.comptype == NP_COMP_EVENT) receive_complete = 0; + + /* Prepost receive */ + rr.num_sge = 0; + rr.next = NULL; + + LOGPRINTF("Posting recv request in Reset\n"); + ret = ibv_post_recv(qp_hndl, &rr, &bad_rr); + if(ret) { + fprintf(stderr, " Error posting recv request\n"); + CleanUp(p); + exit(-1); + } + + /* Make sure both nodes have preposted receives */ + Sync(p); + + /* Post Send */ + sr.opcode = IBV_WR_SEND; + sr.send_flags = IBV_SEND_SIGNALED; + sr.num_sge = 0; + sr.next = NULL; + + LOGPRINTF("Posting send request \n"); + ret = ibv_post_send(qp_hndl, &sr, &bad_sr); + if(ret) { + fprintf(stderr, " Error posting send request in Reset\n"); + exit(-1); + } + if(wc.status != IBV_WC_SUCCESS) { + fprintf(stderr, " Error in completion status: %d\n", + wc.status); + exit(-1); + } + + LOGPRINTF("Polling for completion of send request\n"); + ret = 0; + while(ret == 0) + ret = ibv_poll_cq(s_cq_hndl, 1, &wc); + + if(ret < 0) { + fprintf(stderr, "Error polling CQ for send in Reset\n"); + exit(-1); + } + if(wc.status != IBV_WC_SUCCESS) { + fprintf(stderr, " Error in completion status: %d\n", + wc.status); + exit(-1); + } + + LOGPRINTF("Status of send completion: %d\n", wc.status); + + if(p->prot.comptype == NP_COMP_EVENT) { + /* If using event completion, the event handler will set receive_complete + * when it gets the completion event. + */ + LOGPRINTF("Waiting for receive_complete flag\n"); + while(receive_complete == 0) { /* BUSY WAIT */ } + } else { + LOGPRINTF("Polling for completion of recv request\n"); + ret = 0; + while(ret == 0) + ret = ibv_poll_cq(r_cq_hndl, 1, &wc); + + if(ret < 0) { + fprintf(stderr, "Error polling CQ for recv in Reset"); + exit(-1); + } + if(wc.status != IBV_WC_SUCCESS) { + fprintf(stderr, " Error in completion status: %d\n", + wc.status); + exit(-1); + } + + LOGPRINTF("Status of recv completion: %d\n", wc.status); + } + LOGPRINTF("Done with reset\n"); +} + +void SendTime(ArgStruct *p, double *t) +{ + uint32_t ltime, ntime; + + /* + Multiply the number of seconds by 1e6 to get time in microseconds + and convert value to an unsigned 32-bit integer. + */ + ltime = (uint32_t)(*t * 1.e6); + + /* Send time in network order */ + ntime = htonl(ltime); + if (write(p->commfd, (char *)&ntime, sizeof(uint32_t)) < 0) + { + printf("NetPIPE: write failed in SendTime: errno=%d\n", errno); + exit(301); + } +} + +void RecvTime(ArgStruct *p, double *t) +{ + uint32_t ltime, ntime; + int bytesRead; + + bytesRead = readFully(p->commfd, (void *)&ntime, sizeof(uint32_t)); + if (bytesRead < 0) + { + printf("NetPIPE: read failed in RecvTime: errno=%d\n", errno); + exit(302); + } + else if (bytesRead != sizeof(uint32_t)) + { + fprintf(stderr, "NetPIPE: partial read in RecvTime of %d bytes\n", + bytesRead); + exit(303); + } + ltime = ntohl(ntime); + + /* Result is ltime (in microseconds) divided by 1.0e6 to get seconds */ + *t = (double)ltime / 1.0e6; +} + +void SendRepeat(ArgStruct *p, int rpt) +{ + uint32_t lrpt, nrpt; + + lrpt = rpt; + /* Send repeat count as a long in network order */ + nrpt = htonl(lrpt); + if (write(p->commfd, (void *) &nrpt, sizeof(uint32_t)) < 0) + { + printf("NetPIPE: write failed in SendRepeat: errno=%d\n", errno); + exit(304); + } +} + +void RecvRepeat(ArgStruct *p, int *rpt) +{ + uint32_t lrpt, nrpt; + int bytesRead; + + bytesRead = readFully(p->commfd, (void *)&nrpt, sizeof(uint32_t)); + if (bytesRead < 0) + { + printf("NetPIPE: read failed in RecvRepeat: errno=%d\n", errno); + exit(305); + } + else if (bytesRead != sizeof(uint32_t)) + { + fprintf(stderr, "NetPIPE: partial read in RecvRepeat of %d bytes\n", + bytesRead); + exit(306); + } + lrpt = ntohl(nrpt); + + *rpt = lrpt; +} + +void establish(ArgStruct *p) +{ + int clen; + int one = 1; + struct protoent; + + clen = sizeof(p->prot.sin2); + if(p->tr){ + if(connect(p->commfd, (struct sockaddr *) &(p->prot.sin1), + sizeof(p->prot.sin1)) < 0){ + printf("Client: Cannot Connect! errno=%d\n",errno); + exit(-10); + } + } + else { + /* SERVER */ + listen(p->servicefd, 5); + p->commfd = accept(p->servicefd, (struct sockaddr *) &(p->prot.sin2), + &clen); + + if(p->commfd < 0){ + printf("Server: Accept Failed! errno=%d\n",errno); + exit(-12); + } + } +} + +void CleanUp(ArgStruct *p) +{ + char *quit="QUIT"; + if (p->tr) + { + write(p->commfd,quit, 5); + read(p->commfd, quit, 5); + close(p->commfd); + } + else + { + read(p->commfd,quit, 5); + write(p->commfd,quit,5); + close(p->commfd); + close(p->servicefd); + } + + finalizeIB(p); +} + + +void AfterAlignmentInit(ArgStruct *p) +{ + int bytesRead; + + /* Exchange buffer pointers and remote infiniband keys if doing rdma. Do + * the exchange in this function because this will happen after any + * memory alignment is done, which is important for getting the + * correct remote address. + */ + if( p->prot.commtype == NP_COMM_RDMAWRITE || + p->prot.commtype == NP_COMM_RDMAWRITE_WITH_IMM ) { + + /* Send my receive buffer address + */ + if(write(p->commfd, (void *)&p->r_buff, sizeof(void*)) < 0) { + perror("NetPIPE: write of buffer address failed in AfterAlignmentInit"); + exit(-1); + } + + LOGPRINTF("Sent buffer address: %p\n", p->r_buff); + + /* Send my remote key for accessing + * my remote buffer via IB RDMA + */ + if(write(p->commfd, (void *)&r_mr_hndl->rkey, sizeof(uint32_t)) < 0) { + perror("NetPIPE: write of remote key failed in AfterAlignmentInit"); + exit(-1); + } + + LOGPRINTF("Sent remote key: %d\n", r_mr_hndl->rkey); + + /* Read the sent data + */ + bytesRead = readFully(p->commfd, (void *)&remote_address, sizeof(void*)); + if (bytesRead < 0) { + perror("NetPIPE: read of buffer address failed in AfterAlignmentInit"); + exit(-1); + } else if (bytesRead != sizeof(void*)) { + perror("NetPIPE: partial read of buffer address in AfterAlignmentInit"); + exit(-1); + } + + LOGPRINTF("Received remote address from other node: %p\n", remote_address); + + bytesRead = readFully(p->commfd, (void *)&remote_key, sizeof(uint32_t)); + if (bytesRead < 0) { + perror("NetPIPE: read of remote key failed in AfterAlignmentInit"); + exit(-1); + } else if (bytesRead != sizeof(uint32_t)) { + perror("NetPIPE: partial read of remote key in AfterAlignmentInit"); + exit(-1); + } + + LOGPRINTF("Received remote key from other node: %d\n", remote_key); + + } +} + + +void MyMalloc(ArgStruct *p, int bufflen, int soffset, int roffset) +{ + /* Allocate buffers */ + + p->r_buff = malloc(bufflen+MAX(soffset,roffset)); + if(p->r_buff == NULL) { + fprintf(stderr, "Error malloc'ing buffer\n"); + exit(-1); + } + + if(p->cache) { + + /* Infiniband spec says we can register same memory region + * more than once, so just copy buffer address. We will register + * the same buffer twice with Infiniband. + */ + p->s_buff = p->r_buff; + + } else { + + p->s_buff = malloc(bufflen+soffset); + if(p->s_buff == NULL) { + fprintf(stderr, "Error malloc'ing buffer\n"); + exit(-1); + } + + } + + /* Register buffers with Infiniband */ + + r_mr_hndl = ibv_reg_mr(pd_hndl, p->r_buff, bufflen + MAX(soffset, roffset), + IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE); + if(!r_mr_hndl) + { + fprintf(stderr, "Error registering recv buffer\n"); + exit(-1); + } + else + { + LOGPRINTF("Registered Recv Buffer\n"); + } + + s_mr_hndl = ibv_reg_mr(pd_hndl, p->s_buff, bufflen+soffset, IBV_ACCESS_LOCAL_WRITE); + if(!s_mr_hndl) { + fprintf(stderr, "Error registering send buffer\n"); + exit(-1); + } else { + LOGPRINTF("Registered Send Buffer\n"); + } + +} +void FreeBuff(char *buff1, char *buff2) +{ + int ret; + + if(s_mr_hndl) { + LOGPRINTF("Deregistering send buffer\n"); + ret = ibv_dereg_mr(s_mr_hndl); + if(ret) { + fprintf(stderr, "Error deregistering send mr\n"); + } else { + s_mr_hndl = NULL; + } + } + + if(r_mr_hndl) { + LOGPRINTF("Deregistering recv buffer\n"); + ret = ibv_dereg_mr(r_mr_hndl); + if(ret) { + fprintf(stderr, "Error deregistering recv mr\n"); + } else { + r_mr_hndl = NULL; + } + } + + if(buff1 != NULL) + free(buff1); + + if(buff2 != NULL) + free(buff2); +} + --- NetPIPE_3.6.2.orig/src/netpipe.c 2004-06-22 12:38:41.000000000 -0700 +++ NetPIPE_3.6.2/src/netpipe.c 2005-03-15 12:36:44.000000000 -0800 @@ -142,7 +142,7 @@ case 's': streamopt = 1; printf("Streaming in one direction only.\n\n"); -#if defined(TCP) && ! defined(INFINIBAND) +#if defined(TCP) && ! defined(INFINIBAND) && !defined(OPENIB) printf("Sockets are reset between trials to avoid\n"); printf("degradation from a collapsing window size.\n\n"); #endif @@ -168,7 +168,7 @@ case 'u': end = atoi(optarg); break; -#if defined(TCP) && ! defined(INFINIBAND) +#if defined(TCP) && ! defined(INFINIBAND) && !defined(OPENIB) case 'b': /* -b # resets the buffer size, -b 0 keeps system defs */ args.prot.sndbufsz = args.prot.rcvbufsz = atoi(optarg); break; @@ -178,7 +178,7 @@ /* end will be maxed at sndbufsz+rcvbufsz */ printf("Passing data in both directions simultaneously.\n"); printf("Output is for the combined bandwidth.\n"); -#if defined(TCP) && ! defined(INFINIBAND) +#if defined(TCP) && ! defined(INFINIBAND) && !defined(OPENIB) printf("The socket buffer size limits the maximum test size.\n\n"); #endif if( streamopt ) { @@ -270,7 +270,29 @@ exit(-1); } break; +#endif + +#if defined(OPENIB) + case 'm': switch(atoi(optarg)) { + case 256: args.prot.ib_mtu = IBV_MTU_256; + break; + case 512: args.prot.ib_mtu = IBV_MTU_512; + break; + case 1024: args.prot.ib_mtu = IBV_MTU_1024; + break; + case 2048: args.prot.ib_mtu = IBV_MTU_2048; + break; + case 4096: args.prot.ib_mtu = IBV_MTU_4096; + break; + default: + fprintf(stderr, "Invalid MTU size, must be one of " + "256, 512, 1024, 2048, 4096\n"); + exit(-1); + } + break; +#endif +#if defined(OPENIB) || defined(INFINIBAND) case 't': if( !strcmp(optarg, "send_recv") ) { printf("Using Send/Receive communications\n"); args.prot.commtype = NP_COMM_SENDRECV; @@ -317,7 +339,7 @@ case 'n': nrepeat_const = atoi(optarg); break; -#if defined(TCP) && ! defined(INFINIBAND) +#if defined(TCP) && ! defined(INFINIBAND) && !defined(OPENIB) case 'r': args.reset_conn = 1; printf("Resetting connection after every trial\n"); break; @@ -331,7 +353,7 @@ #endif /* ! defined TCGMSG */ -#if defined(INFINIBAND) +#if defined(OPENIB) || defined(INFINIBAND) asyncReceive = 1; fprintf(stderr, "Preposting asynchronous receives (required for Infiniband)\n"); if(args.bidir && ( @@ -377,7 +399,7 @@ end = args.upper; if( args.tr ) { printf("The upper limit is being set to %d Bytes\n", end); -#if defined(TCP) && ! defined(INFINIBAND) +#if defined(TCP) && ! defined(INFINIBAND) && !defined(OPENIB) printf("due to socket buffer size limitations\n\n"); #endif } } @@ -990,7 +1012,7 @@ void PrintUsage() { printf("\n NETPIPE USAGE \n\n"); -#if ! defined(INFINIBAND) +#if ! defined(INFINIBAND) && !defined(OPENIB) printf("a: asynchronous receive (a.k.a. preposted receive)\n"); #endif printf("B: burst all preposts before measuring performance\n"); @@ -998,7 +1020,7 @@ printf("b: specify TCP send/receive socket buffer sizes\n"); #endif -#if defined(INFINIBAND) +#if defined(INFINIBAND) || defined(OPENIB) printf("c: specify type of completion <-c type>\n" " valid types: local_poll, vapi_poll, event\n" " default: local_poll\n"); @@ -1010,7 +1032,7 @@ printf(" all MPI-2 implementations\n"); #endif -#if defined(TCP) || defined(INFINIBAND) +#if defined(TCP) || defined(INFINIBAND) || defined(OPENIB) printf("h: specify hostname of the receiver <-h host>\n"); #endif @@ -1019,7 +1041,7 @@ printf("i: Do an integrity check instead of measuring performance\n"); printf("l: lower bound start value e.g. <-l 1>\n"); -#if defined(INFINIBAND) +#if defined(INFINIBAND) || defined(OPENIB) printf("m: set MTU for Infiniband adapter <-m mtu_size>\n"); printf(" valid sizes: 256, 512, 1024, 2048, 4096 (default 1024)\n"); #endif @@ -1030,7 +1052,7 @@ printf("p: set the perturbation number <-p 1>\n" " (default = 3 Bytes, set to 0 for no perturbations)\n"); -#if defined(TCP) && ! defined(INFINIBAND) +#if defined(TCP) && ! defined(INFINIBAND) && !defined(OPENIB) printf("r: reset sockets for every trial\n"); #endif @@ -1039,7 +1061,7 @@ printf("S: Use synchronous sends.\n"); #endif -#if defined(INFINIBAND) +#if defined(INFINIBAND) || defined(OPENIB) printf("t: specify type of communications <-t type>\n" " valid types: send_recv, send_recv_with_imm,\n" " rdma_write, rdma_write_with_imm\n" @@ -1056,7 +1078,7 @@ #if defined(MPI) printf(" May need to use -a to choose asynchronous communications for MPI/n"); #endif -#if defined(TCP) && !defined(INFINIBAND) +#if defined(TCP) && !defined(INFINIBAND) && !defined(OPENIB) printf(" The maximum test size is limited by the TCP buffer size/n"); #endif printf("\n"); @@ -1131,7 +1153,7 @@ memset(p->s_buff, 'b', nbytes+soffset); } -#if !defined(INFINIBAND) && !defined(ARMCI) && !defined(LAPI) && !defined(GPSHMEM) && !defined(SHMEM) && !defined(GM) +#if !defined(OPENIB) && !defined(INFINIBAND) && !defined(ARMCI) && !defined(LAPI) && !defined(GPSHMEM) && !defined(SHMEM) && !defined(GM) void MyMalloc(ArgStruct *p, int bufflen, int soffset, int roffset) { --- NetPIPE_3.6.2.orig/src/netpipe.h 2004-06-22 12:38:41.000000000 -0700 +++ NetPIPE_3.6.2/src/netpipe.h 2005-03-14 16:20:30.000000000 -0800 @@ -27,6 +27,10 @@ #include /* ib_mtu_t */ #endif +#ifdef OPENIB +#include /* enum ibv_mtu */ +#endif + #ifdef FINAL #define TRIALS 7 #define RUNTM 0.25 @@ -73,9 +77,14 @@ int commtype; /* Communications type */ int comptype; /* Completion type */ #endif +#if defined(OPENIB) + enum ibv_mtu ib_mtu; /* MTU Size for Infiniband HCA */ + int commtype; /* Communications type */ + int comptype; /* Completion type */ +#endif }; -#if defined(INFINIBAND) +#if defined(INFINIBAND) || defined(OPENIB) enum completion_types { NP_COMP_LOCALPOLL, /* Poll locally on last byte of data */ NP_COMP_VAPIPOLL, /* Poll using vapi function */ From halr at voltaire.com Tue Mar 15 14:22:38 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Mar 2005 17:22:38 -0500 Subject: [Fwd: Re: [openib-general] Solaris IPoIB MTU with OpenSM] In-Reply-To: <1110921356.7768.447.camel@sr1-umpk-01> References: <1109969601.4648.32.camel@erez-s.us.voltaire.com> <1110921356.7768.447.camel@sr1-umpk-01> Message-ID: <1110925357.4662.682.camel@localhost.localdomain> Hi Nitin, On Tue, 2005-03-15 at 16:15, Nitin Hande wrote: > This is cool, I have got Solaris IPoIB happily working with the > OpenSM now. It plumbs, pings and snoops on 0xffff pkey. Great. That's good news. I'll work on a real fix for this now. > On other hand, on my linux node, if I try to use 8001 partition and > configure IB interface with IP addr (same time while ib0 is using 0xffff > pkey), I get the following error, you may want to investigate that.... > > [root at flopteron2 ~]# echo 0x8001 > /sys/class/net/ib0/create_child > [root at flopteron2 ~]# ifconfig ib0.8001 10.10.1.1 > [root at flopteron2ib0.8001: multicast join failed for > ff12:401b:8001:0:0:0:ffff:ffff, status -22 > ~]# ib0.8001: multicast join failed for ff12:401b:8001:0:0:0:ffff:ffff, > status -22 I will look into this but I suspect this is caused by the response to some request in the join "flow" to be more than 1 RMPP packet. Remember that OpenSM is currently hamstrung in this manner until there is sufficient RMPP for SA GetTableResps. Thanks. -- Hal From robert.j.woodruff at intel.com Tue Mar 15 14:25:50 2005 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 15 Mar 2005 14:25:50 -0800 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs Message-ID: <1AC79F16F5C5284499BB9591B33D6F0003D455D0@orsmsx408> Roland> I just spent a little time creating a new "ibv" module for NetPIPE >that runs on top of the userspace verbs I've been developing on the >roland-uverbs branch. Cool. this will be very useful. Any idea if/when the netpipe folks will release a version of netpipe that has this patch included ? woody From roland at topspin.com Tue Mar 15 15:03:09 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 15 Mar 2005 15:03:09 -0800 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0003D455D0@orsmsx408> (Robert J. Woodruff's message of "Tue, 15 Mar 2005 14:25:50 -0800") References: <1AC79F16F5C5284499BB9591B33D6F0003D455D0@orsmsx408> Message-ID: <52fyywwlzm.fsf@topspin.com> Robert> Cool. this will be very useful. Any idea if/when the Robert> netpipe folks will release a version of netpipe that has Robert> this patch included ? That's up to the netpipe folks. Posting the patch was the first contact I've made beyond downloading the source yesterday. It might be reasonable to wait until the APIs are a little more frozen and the support has landed on the OpenIB trunk (as I said, userspace verbs are still only on the roland-uverbs branch). I would estimate a time frame on the order of weeks for that to happen. - R. From robert.j.woodruff at intel.com Tue Mar 15 15:11:21 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Tue, 15 Mar 2005 15:11:21 -0800 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs In-Reply-To: <52fyywwlzm.fsf@topspin.com> Message-ID: >It might be reasonable to wait until the APIs are a little more frozen >and the support has landed on the OpenIB trunk (as I said, userspace >verbs are still only on the roland-uverbs branch). I would estimate a >time frame on the order of weeks for that to happen. > - R. Good point, probably a little early for them to start to integrate until things settle down and the usermode verbs move to the trunk. On another note, Arlin says he is making good progress on the uDAPL port so we should have another test vehicle for the user-mode verbs pretty soon. Any idea when the user-mode CM support will show up ? From halr at voltaire.com Tue Mar 15 15:16:41 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Mar 2005 18:16:41 -0500 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs In-Reply-To: References: Message-ID: <1110928601.4662.721.camel@localhost.localdomain> On Tue, 2005-03-15 at 18:11, Bob Woodruff wrote: > Any idea when the user-mode CM support will show up ? I think it should be there in about a couple of weeks. -- Hal From roland at topspin.com Tue Mar 15 17:01:51 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 15 Mar 2005 17:01:51 -0800 Subject: [openib-general] [PATCH] alignment check in reg_phys_mr In-Reply-To: <20050314144650.GF16749@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 14 Mar 2005 16:46:50 +0200") References: <20050314144650.GF16749@mellanox.co.il> Message-ID: <52br9kwghs.fsf@topspin.com> Thanks, applied. - R. From mst at mellanox.co.il Tue Mar 15 21:18:48 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 16 Mar 2005 07:18:48 +0200 Subject: [openib-general] [PATCH] set lkey in mthca mpt entry In-Reply-To: <52hdjczwrt.fsf@topspin.com> References: <20050315162706.GG16749@mellanox.co.il> <52hdjczwrt.fsf@topspin.com> Message-ID: <20050316051848.GA3950@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] [PATCH] set lkey in mthca mpt entry > > Michael> lkey does not seem to be set in the mpt entry. does this > Michael> look right? > > You would know better but my docs say that the lkey field should be > set to 0 for SW2HW_MPT and is only used to refer to the original > region for memory windows. > > - R. > Correct, sorry. lkey is for query only, I confused it with memkey. -- MST - Michael S. Tsirkin From hozer at hozed.org Tue Mar 15 22:39:28 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Wed, 16 Mar 2005 00:39:28 -0600 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0003D455D0@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0003D455D0@orsmsx408> Message-ID: <20050316063928.GV9768@kalmia.hozed.org> On Tue, Mar 15, 2005 at 02:25:50PM -0800, Woodruff, Robert J wrote: > > Roland> I just spent a little time creating a new "ibv" module for > NetPIPE > >that runs on top of the userspace verbs I've been developing on the > >roland-uverbs branch. > > Cool. this will be very useful. Any idea if/when the netpipe folks will > release a version of netpipe that has this patch included ? I'll ask Dave Turner what he wants to do about this.. Once I get it built and tested locally, I'll probably stick some results and a link up at http://scl.ameslab.gov/Projects/InfiniBand/ Sooo... what's the easiest way for me to test this if I have opterons with 2.6.11.4 kernels? (aka, just replace drivers/infiniband from the roland-uverbs branch? And does anyone have a clean way of building all the userspace stuff? What I've seen so far is pretty tedious) From mark_seuss at yahoo.com Tue Mar 15 22:39:31 2005 From: mark_seuss at yahoo.com (Mark Seuss) Date: Tue, 15 Mar 2005 22:39:31 -0800 (PST) Subject: [openib-general] How come the CM doesn't implement the state machine? Message-ID: <20050316063931.87938.qmail@web61306.mail.yahoo.com> I have a basic question about the CM. It looks like the gen2 CM doesn't implement the CM state machine as defined in the IB spec. It doesn't perform retransmissions, handle timeouts, etc. Is the current CM API intended as the final API, or is this just an intermediate step on the way to implementing a full CM such as the gen1 CM? --------------------------------- Do you Yahoo!? Yahoo! Small Business - Try our new resources site! -------------- next part -------------- An HTML attachment was scrubbed... URL: From paul.baxter at dsl.pipex.com Wed Mar 16 00:58:54 2005 From: paul.baxter at dsl.pipex.com (Paul Baxter) Date: Wed, 16 Mar 2005 08:58:54 -0000 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs References: <1AC79F16F5C5284499BB9591B33D6F0003D455D0@orsmsx408> <20050316063928.GV9768@kalmia.hozed.org> Message-ID: <001501c52a06$626c6b10$8000000a@blorp> From: "Troy Benjegerdes" > Once I get it built and tested locally, I'll probably stick some results > and a link up at http://scl.ameslab.gov/Projects/InfiniBand/ > > Sooo... what's the easiest way for me to test this if I have opterons > with 2.6.11.4 kernels? > > (aka, just replace drivers/infiniband from the roland-uverbs branch? And > does anyone have a clean way of building all the userspace stuff? What > I've seen so far is pretty tedious) Troy, While I appreciate your keenness , I think its a little unfair to criticise the build status and organisation of code that is still being written and is subject to change. I'd far rather everyone gets a working core before worrying so much about how it might be packaged. That does need to be addressed, of course. Your comments at your URL regarding complexity and size of the software stack making progress slow are IMHO unfair to openib. They've worked hard on getting a streamlined set of functionality into the kernel and now need to finish off key parts of userspace support and only then 'package' it so that you will find it easier to compile and test. They will also need to get some reasonable documentation together (update the material at the sourceforge IB project?) and start adding other user-space/kernel functionality, but right now patience is a virtue :) PS I'm looking forward to another of your excellent writeups when you do get this working. I hope its current status desn't colour or frustrate your view of this promising 'alpha' userspace code. Regards Paul Baxter From halr at voltaire.com Wed Mar 16 08:16:43 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Mar 2005 11:16:43 -0500 Subject: [openib-general] Re: [openib-commits] r1895 - in gen1/trunk/src/userspace/osm: . complib iba opensm osmsh osmtest utils In-Reply-To: <20050224143819.15CE22284D7@openib.ca.sandia.gov> References: <20050224143819.15CE22284D7@openib.ca.sandia.gov> Message-ID: <1110989803.4662.2038.camel@localhost.localdomain> On Thu, 2005-02-24 at 09:38, eitan at openib.org wrote: > Author: eitan > Date: 2005-02-24 06:38:16 -0800 (Thu, 24 Feb 2005) > New Revision: 1895 > Log: > OpenSM Rev 1.8.0 Gen1 release Do you mean 1.7.0 rather than 1.8.0 release ? Thanks. -- Hal From roland at topspin.com Wed Mar 16 08:52:56 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 16 Mar 2005 08:52:56 -0800 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs In-Reply-To: <20050316063928.GV9768@kalmia.hozed.org> (Troy Benjegerdes's message of "Wed, 16 Mar 2005 00:39:28 -0600") References: <1AC79F16F5C5284499BB9591B33D6F0003D455D0@orsmsx408> <20050316063928.GV9768@kalmia.hozed.org> Message-ID: <523buvwn13.fsf@topspin.com> Troy> Sooo... what's the easiest way for me to test this if I have Troy> opterons with 2.6.11.4 kernels? Troy> (aka, just replace drivers/infiniband from the roland-uverbs Troy> branch? And does anyone have a clean way of building all the Troy> userspace stuff? What I've seen so far is pretty tedious) Yes, the roland-uverbs src/linux-kernel/infiniband directory should just drop in and replace the existing drivers/infiniband. You'll want to turn on CONFIG_INFINIBAND_USER_VERBS in your config (a new option) to enable userspace verbs, load the ib_uverbs module (if you don't build support into your kernel), and create /dev/infiniband/uverbs device nodes (easiest way is to add KERNEL="uverbs*", NAME="infiniband/%k", MODE="0666" to your udev rules). To build the userspace verbs support, you just need to build libibverbs and libmthca libraries (using the usual "./autogen.sh && ./configure && make && make install" recipe). I agree that the management subdirectory has a few too many little pieces right now, but it's not needed if you already have a subnet manager running somewhere. - R. From mst at mellanox.co.il Wed Mar 16 08:58:41 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 16 Mar 2005 18:58:41 +0200 Subject: [openib-general] userspace doorbells In-Reply-To: <521xanbbi7.fsf@topspin.com> References: <52u0o4pfe8.fsf@topspin.com> <20050228154254.GB31510@mellanox.co.il> <52acpotwd9.fsf@topspin.com> <20050310104211.GF2586@mellanox.co.il> <521xanbbi7.fsf@topspin.com> Message-ID: <20050316165841.GP16749@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: ANNOUNCE: First usable version of userspace verbs > > Michael> I think I have discovered the problem. It seems that with > Michael> -O3 my compiler may reorder the WQE (and possibly CQE) > Michael> write with respect to the doorbell. This wont happen on > Michael> i386 with consistent i/o ordering since the doorbell is > Michael> done in assembly, and probably not on other 32 bit > Michael> architectures since the mutex is likely to include a > Michael> memory barrier. > > Michael> Applying the folowing patch fixes the problem for me for > Michael> x86_64. > > Thanks for diagnosing this. I think I want to work on a more general > fix though. > > - R. > Roland, I see you have made the doorbell page volatile. This makes sence, and must be enough on x86_64, but for this to work on PPC, wont you still need to insert a write memory barrier, to guard against the CPU re-ordering writes to hardware and to the WQE? Since you do it in kernel, why not in userspace? -- MST - Michael S. Tsirkin From roland at topspin.com Wed Mar 16 09:01:06 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 16 Mar 2005 09:01:06 -0800 Subject: [openib-general] Re: userspace doorbells In-Reply-To: <20050316165841.GP16749@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 16 Mar 2005 18:58:41 +0200") References: <52u0o4pfe8.fsf@topspin.com> <20050228154254.GB31510@mellanox.co.il> <52acpotwd9.fsf@topspin.com> <20050310104211.GF2586@mellanox.co.il> <521xanbbi7.fsf@topspin.com> <20050316165841.GP16749@mellanox.co.il> Message-ID: <52oedjv831.fsf@topspin.com> Michael> Roland, I see you have made the doorbell page volatile. Michael> This makes sence, and must be enough on x86_64, but for Michael> this to work on PPC, wont you still need to insert a Michael> write memory barrier, to guard against the CPU Michael> re-ordering writes to hardware and to the WQE? Since you Michael> do it in kernel, why not in userspace? I'm working on it... see the file I added to libibverbs for the start of my plan. - R. From halr at voltaire.com Wed Mar 16 08:59:24 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Mar 2005 11:59:24 -0500 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs In-Reply-To: <523buvwn13.fsf@topspin.com> References: <1AC79F16F5C5284499BB9591B33D6F0003D455D0@orsmsx408> <20050316063928.GV9768@kalmia.hozed.org> <523buvwn13.fsf@topspin.com> Message-ID: <1110992364.4662.2070.camel@localhost.localdomain> On Wed, 2005-03-16 at 11:52, Roland Dreier wrote: > I agree that the > management subdirectory has a few too many little pieces right now, > but it's not needed if you already have a subnet manager running > somewhere. And you don't need all the pieces if all you want to do is run OpenSM and don't care about the diagnostics. -- Hal From mshefty at ichips.intel.com Wed Mar 16 09:25:49 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 16 Mar 2005 09:25:49 -0800 Subject: [openib-general] How come the CM doesn't implement the state machine? In-Reply-To: <20050316063931.87938.qmail@web61306.mail.yahoo.com> References: <20050316063931.87938.qmail@web61306.mail.yahoo.com> Message-ID: <42386C1D.4000206@ichips.intel.com> Mark Seuss wrote: > I have a basic question about the CM. It looks like the gen2 > CM doesn't implement the CM state machine as defined in the IB spec. The gen2 CM implements the state machine as defined by the IB spec. The states are defined in ib_cm.h, and the CM uses these when processing sent or received MADs and to handle timewait. > It doesn't perform retransmissions, handle timeouts, etc. Is the The CM will retry requests and perform timeouts. See cm_process_send_timeout(). > current CM API intended as the final API, or is this just an > intermediate step on the way to implementing a full CM such as the > gen1 CM? The gen2 CM is a full CM. It does have some missing functionality, but nothing that should prevent it from operating. - Sean From mkowalski01 at gmail.com Wed Mar 16 09:30:43 2005 From: mkowalski01 at gmail.com (mark kowalski) Date: Wed, 16 Mar 2005 11:30:43 -0600 Subject: [openib-general] failure using second hca via udapl Message-ID: I've been writing some code using udapl and recently added a second hca to my machine. Both hca's are mellanox cards: 0000:02:03.0 PCI bridge: Mellanox Technology MT23108 InfiniHost HCA bridge (rev a1) 0000:03:00.0 InfiniBand: Mellanox Technology MT23108 InfiniHost HCA (rev a1) 0000:04:04.0 PCI bridge: Mellanox Technology MT23108 InfiniHost HCA bridge (rev a0) 0000:05:00.0 InfiniBand: Mellanox Technology MT23108 InfiniHost HCA (rev a0) both cards are recognized and seem to initialize fine: Mar 15 17:40:37 kernel: Mellanox Tavor Device Driver is creating device "InfiniHost0" (bus=03, devfn=00) Mar 15 17:40:37 kernel: Mellanox Tavor Device Driver is creating device "InfiniHost1" (bus=05, devfn=00) the problem is when I try to access ports on the second hca I get this failure: EVAPI_k_get_qp_hndl returns -244 (Invalid HCA Handle.) tsIbUCmAccept failed: -5 I noticed this comment in the tsIbUCmAccept routine: /* FIXME: Don't hardcode HCA handle for EVAPI_k_get_qp_hndl and _tsIbUQpRegister */ after it set the qp_handle variable to 0. I modified the code to pass the hca_handle that is input to tsIbUCmAccept function (VAPI_hca_hndl_t hca_handle) in on the call to EVAPI_k_get_qp_hndl instead of the qp_handle variable. (i put some print statements in the hca initialization code and the hca_handle for the first hca was 0 and the hca_handle for the second hca was 1 so this seemed like a reasonable thing to do since the hca_handle is an index into the hca_tbl). Anyway I get pass the EVAPI_k_get_qp_hndl failure but instead I get this failure: kernel: [KERNEL_IB][tsIbCmUserAccept][/var/tmp/IBGD/lib/modules/2.6.4-52-smp/build/drivers/infiniband/core/useraccess_cm.c:877]EVAPI_k_set_destroy_qp_cbk, return code = -251 (Resource is busy) in routine tsIbCmUserAccept in the device driver code during the call to this routine: EVAPI_k_set_destroy_qp_cbk. I've gone through the initialization code and it seems that everything that is done for the first hca is done for the second so it would seem that once I pased in the correct hca_tbl index everything should work, but it doesn't. Anyone have 2 hca's working via udapl out there? Thanks, Mark Kowalski From mshefty at ichips.intel.com Wed Mar 16 09:34:22 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 16 Mar 2005 09:34:22 -0800 Subject: [openib-general] [PATCH] [MAD] API changes and updates to support RMPP In-Reply-To: <20050314162258.1bedff07.mshefty@ichips.intel.com> References: <20050314162258.1bedff07.mshefty@ichips.intel.com> Message-ID: <42386E1E.1070701@ichips.intel.com> Sean Hefty wrote: > This patch updates the MAD API to help provide support for the RMPP > implementation and clients. Notable changes: > > * A valid memory region (MR) is returned as part of the mad_agent > registration process. The agent, CM, and SA query modules were > updated to use the returned MR. > * A list_head structure was added to ib_mad_recv_wc to make walking > the list of received MAD buffers easier. As part of this change, a > bug was fixed where freed memory could have been accessed in > ib_free_recv_mad() if RMPP were enabled. This change is unlikely > to affect existing clients. If no one objects, I will commit these changes later today, so I can push in the RMPP changes. - Sean From eitan at mellanox.co.il Wed Mar 16 09:47:39 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 16 Mar 2005 19:47:39 +0200 Subject: [openib-general] RE: [openib-commits] r1895 - in gen1/trunk/src/userspace/osm: . c omplib iba opensm osmsh osmtest utils Message-ID: <506C3D7B14CDD411A52C00025558DED6047EEF9F@mtlex01.yok.mtl.com> The OpenSM of the IBGD 1.7.0 is named 1.8.0 due to the many bug fixes it has. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, March 16, 2005 6:17 PM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: Re: [openib-commits] r1895 - in gen1/trunk/src/userspace/osm: . complib iba > opensm osmsh osmtest utils > > On Thu, 2005-02-24 at 09:38, eitan at openib.org wrote: > > Author: eitan > > Date: 2005-02-24 06:38:16 -0800 (Thu, 24 Feb 2005) > > New Revision: 1895 > > > Log: > > OpenSM Rev 1.8.0 Gen1 release > > Do you mean 1.7.0 rather than 1.8.0 release ? > > Thanks. > > -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From mkowalski01 at gmail.com Wed Mar 16 15:15:43 2005 From: mkowalski01 at gmail.com (mark kowalski) Date: Wed, 16 Mar 2005 17:15:43 -0600 Subject: [openib-general] how to turn on kernel level tracing (mtl_log) Message-ID: I've been trying, without success unfortunately to turn on tracing within the kernel components of openib. I've used the logset mtl_log_dbg_print command to toggle the "debug_print" variable used in the mtl_log command. I've also used logset to set the list of severities to be printed to 8 (12345678). I've also turned on printing for any debug or error messages (MTL_DEBUG and MTL_ERROR). I even went so far as to add print info structure records for every module_name I could grep in the source code (VIP, HCA, VIPKL, etc). None of this had any affect on getting any kind of trace records printed out of the kernel. the only messages I got were sev 1 messages when I shut the system down. <4>***** mtl_log:DEBUG: layer 'VIPKL', type 2, sev '1' <4>***** mtl_log:DEBUG: Found layer 'VIPKL', Name="error", sev="12345678" <4>***** mtl_log:DEBUG: print string <1> VIPKL(1): var/tmp/IBGD/lib/modules/2.6.4-52-smp/build/drivers/infiniband/hw/mellanox-hca/vipkl/em.c[87]: EM delete:found unreleased async object <4>***** mtl_log:DEBUG: layer 'VIPKL', type 2, sev '1' <4>***** mtl_log:DEBUG: Found layer 'VIPKL', Name="error", sev="12345678" <4>***** mtl_log:DEBUG: print string <1> VIPKL(1): var/tmp/IBGD/lib/modules/2.6.4-52-smp/build/drivers/infiniband/hw/mellanox-hca/vipkl/em.c[87]: EM delete:found unreleased async object Is the code originally compiled with the MAX_TRACE variable set to 1? Is there a way this could be changed or bypassed without recompiling the source? I also added the MTL_LOG environment variable to get non-kernel messages but that didn't produce any messages either. any help would be appreciated. Thanks Mark From sean.hefty at intel.com Wed Mar 16 15:39:13 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 16 Mar 2005 15:39:13 -0800 Subject: [openib-general] [PATCH] [MAD] API changes and updates to supportRMPP In-Reply-To: <42386E1E.1070701@ichips.intel.com> Message-ID: >> This patch updates the MAD API to help provide support for the RMPP >> implementation and clients. Notable changes: >> >> * A valid memory region (MR) is returned as part of the mad_agent >> registration process. The agent, CM, and SA query modules were >> updated to use the returned MR. >> * A list_head structure was added to ib_mad_recv_wc to make walking >> the list of received MAD buffers easier. As part of this change, a >> bug was fixed where freed memory could have been accessed in >> ib_free_recv_mad() if RMPP were enabled. This change is unlikely >> to affect existing clients. > >If no one objects, I will commit these changes later today, so I can >push in the RMPP changes. I've committed these changes. A patch for RMPP receive handling will be posted as soon as I finishing removing the bugs that I so carefully designed into it. - Sean From hozer at hozed.org Wed Mar 16 20:58:17 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Wed, 16 Mar 2005 22:58:17 -0600 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs In-Reply-To: <52k6o8wom2.fsf@topspin.com> References: <52k6o8wom2.fsf@topspin.com> Message-ID: <20050317045817.GZ9768@kalmia.hozed.org> On Tue, Mar 15, 2005 at 02:06:29PM -0800, Roland Dreier wrote: > I just spent a little time creating a new "ibv" module for NetPIPE > that runs on top of the userspace verbs I've been developing on the > roland-uverbs branch. This is pretty much a straight port of the > current Mellanox VAPI "ib" module, with the main changes coming from > the fact that OpenIB doesn't support the non-standard "unsignaled > receive" extension, and the fact that a completion event thread is no > longer created automatically. > > I found several bugs in the verbs support while making this work, but > it seems quite stable now, although I haven't tried all option > combinations. I also have not had a chance to compare Mellanox VAPI > and OpenIB verbs performance on identical hardware -- it would be very > useful to see this comparison on a variety of systems. > I'm having trouble building opensm et all from roland-uverb... (and I can't really test NetPIPE without an SM ) .o -o opensm -L/afs/scl/project/infiniband/openib/roland-uverbs/src/lib -lpthread /afs/scl/project/infiniband/openib/roland-uverbs/src/lib/libumad.so /afs/scl/project/infiniband/openib/roland-uverbs/src/lib/libcomplib.so /afs/scl/project/infiniband/openib/roland-uverbs/src/lib/libcommon.so -Wl,--rpath -Wl,/afs/scl/project/infiniband/openib/roland-uverbs/src/lib -Wl,--rpath -Wl,/afs/scl/project/infiniband/openib/roland-uverbs/src/lib osm_switch.o(.text+0x25): In function `osm_switch_init': /afs/scl/project/infiniband/openib/roland-uverbs/src/userspace/management/osm/opensm/osm_switch.c:111: multiple definition of `no symbol' osm_switch.o(.text+0x0):/afs/scl/project/infiniband/openib/roland-uverbs/src/userspace/management/osm/opensm/osm_switch.c:98: first defined here /usr/bin/ld: Warning: size of symbol `' changed from 37 in osm_switch.o to 242 in osm_switch.o osm_switch.o(.text+0x117): In function `osm_switch_destroy': /afs/scl/project/infiniband/openib/roland-uverbs/src/userspace/management/osm/opensm/osm_switch.c:163: multiple definition of `no symbol' osm_switch.o(.text+0x0):/afs/scl/project/infiniband/openib/roland-uverbs/src/userspace/management/osm/opensm/osm_switch.c:98: first defined here From hozer at hozed.org Wed Mar 16 21:10:58 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Wed, 16 Mar 2005 23:10:58 -0600 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs In-Reply-To: <20050317045817.GZ9768@kalmia.hozed.org> References: <52k6o8wom2.fsf@topspin.com> <20050317045817.GZ9768@kalmia.hozed.org> Message-ID: <20050317051058.GA9768@kalmia.hozed.org> Note to self: check build twice in case your filesystem has bogons. it builds fine. I must have had a bogus object file. But, does this mean the port is up, or only that the physical link is active? troy at opteron1:/afs/scl/project/infiniband/openib/roland-uverbs/src/bin$ ./ibstat CA 'mthca0' CA type: MT23108 Number of ports: 2 Firmware version: 2.0.0 Hardware version: a1 Node GUID: 0x0002c90108cd8ba0 System image GUID: 0x0002c90108cd8ba3 Port 1: State: Active Physical state: LinkUp Rate: 10 Base lid: 1 LMC: 0 SM lid: 1 Capability mask: 0x00100a6a Port GUID: 0x0002c90108cd8ba1 On Wed, Mar 16, 2005 at 10:58:17PM -0600, Troy Benjegerdes wrote: > On Tue, Mar 15, 2005 at 02:06:29PM -0800, Roland Dreier wrote: > > I just spent a little time creating a new "ibv" module for NetPIPE > > that runs on top of the userspace verbs I've been developing on the > > roland-uverbs branch. This is pretty much a straight port of the > > current Mellanox VAPI "ib" module, with the main changes coming from > > the fact that OpenIB doesn't support the non-standard "unsignaled > > receive" extension, and the fact that a completion event thread is no > > longer created automatically. > > > > I found several bugs in the verbs support while making this work, but > > it seems quite stable now, although I haven't tried all option > > combinations. I also have not had a chance to compare Mellanox VAPI > > and OpenIB verbs performance on identical hardware -- it would be very > > useful to see this comparison on a variety of systems. > > > > I'm having trouble building opensm et all from roland-uverb... (and I > can't really test NetPIPE without an SM ) > > .o -o opensm -L/afs/scl/project/infiniband/openib/roland-uverbs/src/lib > -lpthread > /afs/scl/project/infiniband/openib/roland-uverbs/src/lib/libumad.so > /afs/scl/project/infiniband/openib/roland-uverbs/src/lib/libcomplib.so > /afs/scl/project/infiniband/openib/roland-uverbs/src/lib/libcommon.so > -Wl,--rpath -Wl,/afs/scl/project/infiniband/openib/roland-uverbs/src/lib > -Wl,--rpath -Wl,/afs/scl/project/infiniband/openib/roland-uverbs/src/lib > osm_switch.o(.text+0x25): In function `osm_switch_init': > /afs/scl/project/infiniband/openib/roland-uverbs/src/userspace/management/osm/opensm/osm_switch.c:111: > multiple definition of `no symbol' > osm_switch.o(.text+0x0):/afs/scl/project/infiniband/openib/roland-uverbs/src/userspace/management/osm/opensm/osm_switch.c:98: > first defined here > /usr/bin/ld: Warning: size of symbol `' changed from 37 in osm_switch.o > to 242 in osm_switch.o > osm_switch.o(.text+0x117): In function `osm_switch_destroy': > /afs/scl/project/infiniband/openib/roland-uverbs/src/userspace/management/osm/opensm/osm_switch.c:163: > multiple definition of `no symbol' > osm_switch.o(.text+0x0):/afs/scl/project/infiniband/openib/roland-uverbs/src/userspace/management/osm/opensm/osm_switch.c:98: > first defined here > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- -------------------------------------------------------------------------- Troy Benjegerdes 'da hozer' hozer at hozed.org Somone asked my why I work on this free (http://www.fsf.org/philosophy/) software stuff and not get a real job. Charles Shultz had the best answer: "Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life." -- Charles Shultz From hozer at hozed.org Wed Mar 16 21:20:03 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Wed, 16 Mar 2005 23:20:03 -0600 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs In-Reply-To: <20050317045817.GZ9768@kalmia.hozed.org> References: <52k6o8wom2.fsf@topspin.com> <20050317045817.GZ9768@kalmia.hozed.org> Message-ID: <20050317052003.GB9768@kalmia.hozed.org> Okay, it's late. I have NetPIPE TCP running on ipoib, peak is around 1.9 gigabits/sec or so with no socket buffer tuning. (this means I have an SM and such). Tommorow I will try the netpipe on the uverbs. From itamar at mellanox.co.il Thu Mar 17 00:27:50 2005 From: itamar at mellanox.co.il (Itamar Rabenstein) Date: Thu, 17 Mar 2005 10:27:50 +0200 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace ve rbs Message-ID: <91DB792C7985D411BEC300B40080D29CC35797@mtvex01.mtv.mtl.com> Just a small thing FW version 2.0.0 is very old FW. You should upgrade it to the latest which is now 3.3.2 -Itamar > -----Original Message----- > From: Troy Benjegerdes [mailto:hozer at hozed.org] > Sent: Thursday, March 17, 2005 7:11 AM > To: Roland Dreier > Cc: openib-general at openib.org > Subject: Re: [openib-general] Port of NetPIPE-3.6.2 to OpenIB > userspace > verbs > > > Note to self: check build twice in case your filesystem has bogons. > > it builds fine. I must have had a bogus object file. But, > does this mean > the port is up, or only that the physical link is active? > > troy at opteron1:/afs/scl/project/infiniband/openib/roland-uverbs > /src/bin$ ./ibstat > CA 'mthca0' > CA type: MT23108 > Number of ports: 2 > Firmware version: 2.0.0 > Hardware version: a1 > Node GUID: 0x0002c90108cd8ba0 > System image GUID: 0x0002c90108cd8ba3 > Port 1: > State: Active > Physical state: LinkUp > Rate: 10 > Base lid: 1 > LMC: 0 > SM lid: 1 > Capability mask: 0x00100a6a > Port GUID: 0x0002c90108cd8ba1 > > > On Wed, Mar 16, 2005 at 10:58:17PM -0600, Troy Benjegerdes wrote: > > On Tue, Mar 15, 2005 at 02:06:29PM -0800, Roland Dreier wrote: > > > I just spent a little time creating a new "ibv" module for NetPIPE > > > that runs on top of the userspace verbs I've been > developing on the > > > roland-uverbs branch. This is pretty much a straight port of the > > > current Mellanox VAPI "ib" module, with the main changes > coming from > > > the fact that OpenIB doesn't support the non-standard "unsignaled > > > receive" extension, and the fact that a completion event > thread is no > > > longer created automatically. > > > > > > I found several bugs in the verbs support while making > this work, but > > > it seems quite stable now, although I haven't tried all option > > > combinations. I also have not had a chance to compare > Mellanox VAPI > > > and OpenIB verbs performance on identical hardware -- it > would be very > > > useful to see this comparison on a variety of systems. > > > > > > > I'm having trouble building opensm et all from > roland-uverb... (and I > > can't really test NetPIPE without an SM ) > > > > .o -o opensm > -L/afs/scl/project/infiniband/openib/roland-uverbs/src/lib > > -lpthread > > /afs/scl/project/infiniband/openib/roland-uverbs/src/lib/libumad.so > > > /afs/scl/project/infiniband/openib/roland-uverbs/src/lib/libcomplib.so > > > /afs/scl/project/infiniband/openib/roland-uverbs/src/lib/libcommon.so > > -Wl,--rpath > -Wl,/afs/scl/project/infiniband/openib/roland-uverbs/src/lib > > -Wl,--rpath > -Wl,/afs/scl/project/infiniband/openib/roland-uverbs/src/lib > > osm_switch.o(.text+0x25): In function `osm_switch_init': > > > /afs/scl/project/infiniband/openib/roland-uverbs/src/userspace > /management/osm/opensm/osm_switch.c:111: > > multiple definition of `no symbol' > > > osm_switch.o(.text+0x0):/afs/scl/project/infiniband/openib/rol > and-uverbs/src/userspace/management/osm/opensm/osm_switch.c:98: > > first defined here > > /usr/bin/ld: Warning: size of symbol `' changed from 37 in > osm_switch.o > > to 242 in osm_switch.o > > osm_switch.o(.text+0x117): In function `osm_switch_destroy': > > > /afs/scl/project/infiniband/openib/roland-uverbs/src/userspace > /management/osm/opensm/osm_switch.c:163: > > multiple definition of `no symbol' > > > osm_switch.o(.text+0x0):/afs/scl/project/infiniband/openib/rol > and-uverbs/src/userspace/management/osm/opensm/osm_switch.c:98: > > first defined here > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > -- > -------------------------------------------------------------- > ------------ > Troy Benjegerdes 'da hozer' > hozer at hozed.org > > Somone asked my why I work on this free > (http://www.fsf.org/philosophy/) > software stuff and not get a real job. Charles Shultz had the > best answer: > > "Why do musicians compose symphonies and poets write poems? They do it > because life wouldn't have any meaning for them if they > didn't. That's why > I draw cartoons. It's my life." -- Charles Shultz > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From eitan at mellanox.co.il Thu Mar 17 01:12:58 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 17 Mar 2005 11:12:58 +0200 Subject: [openib-general] User Level Events - request for support Message-ID: <506C3D7B14CDD411A52C00025558DED6047EEFA3@mtlex01.yok.mtl.com> I would like to propose for gen2 stack to have a user level API supporting registration for notifications on unaffiliated asynchronous events. As an example in cases when the local IB port link goes down an SM that runs on top of this port needs to be notified and start a sweep when the port is back up. Missing such event, in user land, prevents the SM from knowing about the change. Currently gen1 and gen2 OpenSM is not registered to get these events. The SM will then fail to reconfigure the subnet in cases like: 1. The SM cable connects to a switch and the user changes the switch port the SM is connected to. In this case the SM might be in the middle of a sweep and do not even notice (due to the short time the change takes) that there was a change. Traps coming from the switch are forwarded to the old port the SM was connected to and are dropped (as the port is down). As OpenSM does not know there was any change in the subnet topology, it will not perform a heavy sweep. 2. The switch connected to the SM is rebooted. If the reboot happens so fast that it falls between two light sweeps, OpenSM will not be able to know the switch was reset (as all SMPs are DR). Although one can write special case code to handle these cases, due to the asynchronous nature of things there are many races that can not be resolved without a simple "port up/down" event that the SM should register to. Hope this provides enough reason behind my request for a user-level event notification mechanism. Eitan -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Thu Mar 17 02:03:09 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 Mar 2005 12:03:09 +0200 Subject: [openib-general] Re: User Level Events - request for support In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EEFA3@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED6047EEFA3@mtlex01.yok.mtl.com> Message-ID: <20050317100309.GT16749@mellanox.co.il> Quoting r. Eitan Zahavi : > Subject: User Level Events - request for support > > I would like to propose for gen2 stack to have a user level API supporting > registration for notifications on unaffiliated asynchronous events. > > As an example in cases when the local IB port link goes down an SM that runs on > top of this port needs to be notified and start a sweep when the port is back > up. Missing such event, in user land, prevents the SM from knowing about the > change. > > Currently gen1 and gen2 OpenSM is not registered to get these events. The SM > will then fail to reconfigure the subnet in cases like: > > 1. The SM cable connects to a switch and the user changes the switch port > the SM is connected to. In this case the SM might be in the middle of a sweep > and do not even notice (due to the short time the change takes) that there was > a change. Traps coming from the switch are forwarded to the old port the SM was > connected to and are dropped (as the port is down). As OpenSM does not know > there was any change in the subnet topology, it will not perform a heavy sweep. > > 2. The switch connected to the SM is rebooted. If the reboot happens so > fast that it falls between two light sweeps, OpenSM will not be able to know > the switch was reset (as all SMPs are DR). > > Although one can write special case code to handle these cases, due to the > asynchronous nature of things there are many races that can not be resolved > without a simple "port up/down" event that the SM should register to. > > Hope this provides enough reason behind my request for a user-level event > notification mechanism. > > Eitan > uverbs already have this capability, of course. However I think I agree it might make sence to add this capability to umad. Since sm is already blocking on read from umad, it might be simplest to make read return with an error code. Does this make sence? Alternatively, we could try using kobject_uevent. -- MST - Michael S. Tsirkin From mst at mellanox.co.il Thu Mar 17 02:24:55 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 Mar 2005 12:24:55 +0200 Subject: [openib-general] Re: userspace doorbells In-Reply-To: <52oedjv831.fsf@topspin.com> References: <52u0o4pfe8.fsf@topspin.com> <20050228154254.GB31510@mellanox.co.il> <52acpotwd9.fsf@topspin.com> <20050310104211.GF2586@mellanox.co.il> <521xanbbi7.fsf@topspin.com> <20050316165841.GP16749@mellanox.co.il> <52oedjv831.fsf@topspin.com> Message-ID: <20050317102455.GV16749@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: userspace doorbells > > Michael> Roland, I see you have made the doorbell page volatile. > Michael> This makes sence, and must be enough on x86_64, but for > Michael> this to work on PPC, wont you still need to insert a > Michael> write memory barrier, to guard against the CPU > Michael> re-ordering writes to hardware and to the WQE? Since you > Michael> do it in kernel, why not in userspace? > > I'm working on it... see the file I added to > libibverbs for the start of my plan. > > - R. > OK, makes sence. I expect you'll need rmb for CQE polling and wmb for doorbells, just like we do for kernel code. Hmm, and I expect pthread_spin_lock may provide a generic barrier implementation: it seems pthread_spin_lock just has to include an rmb, pthread_spin_unlock - a wmb. -- MST - Michael S. Tsirkin From tziporet at mellanox.co.il Thu Mar 17 05:05:34 2005 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 17 Mar 2005 15:05:34 +0200 Subject: [openib-general] how to turn on kernel level tracing (mtl_log ) Message-ID: <506C3D7B14CDD411A52C00025558DED6064BF16F@mtlex01.yok.mtl.com> Regarding the low level driver (VAPI) only: The kernel debug messages can be generated only at compile time. to generate them one need to compile with the variable MTCONF set to debug. Once it is compiled you can decide which modules to see by the command: logset trace:1234 (or all) To see use level errors/traces once should export MTL_LOG="error:1234" If you compiled with debug enables you can also do: export MTL_LOG="error:1234 debug:1234 trace:1234" Tziporet -----Original Message----- From: mark kowalski [mailto:mkowalski01 at gmail.com] Sent: Thursday, March 17, 2005 1:16 AM To: openib-general at openib.org Subject: [openib-general] how to turn on kernel level tracing (mtl_log) I've been trying, without success unfortunately to turn on tracing within the kernel components of openib. I've used the logset mtl_log_dbg_print command to toggle the "debug_print" variable used in the mtl_log command. I've also used logset to set the list of severities to be printed to 8 (12345678). I've also turned on printing for any debug or error messages (MTL_DEBUG and MTL_ERROR). I even went so far as to add print info structure records for every module_name I could grep in the source code (VIP, HCA, VIPKL, etc). None of this had any affect on getting any kind of trace records printed out of the kernel. the only messages I got were sev 1 messages when I shut the system down. <4>***** mtl_log:DEBUG: layer 'VIPKL', type 2, sev '1' <4>***** mtl_log:DEBUG: Found layer 'VIPKL', Name="error", sev="12345678" <4>***** mtl_log:DEBUG: print string <1> VIPKL(1): var/tmp/IBGD/lib/modules/2.6.4-52-smp/build/drivers/infiniband/hw/mellanox-h ca/vipkl/em.c[87]: EM delete:found unreleased async object <4>***** mtl_log:DEBUG: layer 'VIPKL', type 2, sev '1' <4>***** mtl_log:DEBUG: Found layer 'VIPKL', Name="error", sev="12345678" <4>***** mtl_log:DEBUG: print string <1> VIPKL(1): var/tmp/IBGD/lib/modules/2.6.4-52-smp/build/drivers/infiniband/hw/mellanox-h ca/vipkl/em.c[87]: EM delete:found unreleased async object Is the code originally compiled with the MAX_TRACE variable set to 1? Is there a way this could be changed or bypassed without recompiling the source? I also added the MTL_LOG environment variable to get non-kernel messages but that didn't produce any messages either. any help would be appreciated. Thanks Mark _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Thu Mar 17 05:06:06 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Mar 2005 08:06:06 -0500 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs In-Reply-To: <20050317051058.GA9768@kalmia.hozed.org> References: <52k6o8wom2.fsf@topspin.com> <20050317045817.GZ9768@kalmia.hozed.org> <20050317051058.GA9768@kalmia.hozed.org> Message-ID: <1111064765.4662.2563.camel@localhost.localdomain> On Thu, 2005-03-17 at 00:10, Troy Benjegerdes wrote: > But, does this mean > the port is up, or only that the physical link is active? The port is up. The port state needs to get to active (which is port is up) which won't happen unless port physical state gets to link up. -- Hal > troy at opteron1:/afs/scl/project/infiniband/openib/roland-uverbs/src/bin$ ./ibstat > CA 'mthca0' > CA type: MT23108 > Number of ports: 2 > Firmware version: 2.0.0 > Hardware version: a1 > Node GUID: 0x0002c90108cd8ba0 > System image GUID: 0x0002c90108cd8ba3 > Port 1: > State: Active > Physical state: LinkUp > Rate: 10 > Base lid: 1 > LMC: 0 > SM lid: 1 > Capability mask: 0x00100a6a > Port GUID: 0x0002c90108cd8ba1 From halr at voltaire.com Thu Mar 17 05:10:00 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Mar 2005 08:10:00 -0500 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs In-Reply-To: <20050317045817.GZ9768@kalmia.hozed.org> References: <52k6o8wom2.fsf@topspin.com> <20050317045817.GZ9768@kalmia.hozed.org> Message-ID: <1111064822.4662.2569.camel@localhost.localdomain> On Wed, 2005-03-16 at 23:58, Troy Benjegerdes wrote: > I'm having trouble building opensm et all from roland-uverb... (and I > can't really test NetPIPE without an SM ) Please note that OpenSM changes are not being integrated on the roland-uverbs branch and this is essentially a snapshot from when this was branched some time ago. roland-uverbs will be coming back to the mainline trunk soon. -- Hal From halr at voltaire.com Thu Mar 17 05:16:27 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Mar 2005 08:16:27 -0500 Subject: [openib-general] OpenIB OUI Message-ID: <1111064936.4662.2576.camel@localhost.localdomain> >From the latest http://standards.ieee.org/regauth/oui/oui.txt 00-14-05 (hex) OpenIB, Inc. 001405 (base 16) OpenIB, Inc. M/S JF5-357 2111 NE 25th Avenue Hillsboro OR 97124 UNITED STATES From halr at voltaire.com Thu Mar 17 06:10:51 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Mar 2005 09:10:51 -0500 Subject: [openib-general] [PATCH] agent: Add IB ping server agent Message-ID: <1111068651.4662.2686.camel@localhost.localdomain> agent: Add IB ping server agent (used with ibping diagnostic tool) Signed-off-by: Shahar Frank Signed-off-by: Hal Rosenstock Index: include/ib_mad.h =================================================================== --- include/ib_mad.h (revision 2012) +++ include/ib_mad.h (working copy) @@ -56,6 +56,10 @@ #define IB_MGMT_CLASS_VENDOR_RANGE2_START 0x30 #define IB_MGMT_CLASS_VENDOR_RANGE2_END 0x4F +#define IB_MGMT_CLASS_OPENIB_PING (IB_MGMT_CLASS_VENDOR_RANGE2_START+2) + +#define IB_OPENIB_OUI (0x001405) + /* Management methods */ #define IB_MGMT_METHOD_GET 0x01 #define IB_MGMT_METHOD_SET 0x02 Index: core/agent_priv.h =================================================================== --- core/agent_priv.h (revision 2012) +++ core/agent_priv.h (working copy) @@ -57,6 +57,7 @@ int port_num; struct ib_mad_agent *smp_agent; /* SM class */ struct ib_mad_agent *perf_mgmt_agent; /* PerfMgmt class */ + struct ib_mad_agent *pingd_agent; /* OpenIB Ping class */ }; #endif /* __IB_AGENT_PRIV_H__ */ Index: core/agent.c =================================================================== --- core/agent.c (revision 2012) +++ core/agent.c (working copy) @@ -37,7 +37,7 @@ */ #include - +#include #include #include @@ -70,7 +70,8 @@ } else { list_for_each_entry(entry, &ib_agent_port_list, port_list) { if ((entry->smp_agent == mad_agent) || - (entry->perf_mgmt_agent == mad_agent)) + (entry->perf_mgmt_agent == mad_agent) || + (entry->pingd_agent == mad_agent)) return entry; } } @@ -151,7 +152,8 @@ ah_attr.sl = wc->sl; ah_attr.static_rate = 0; ah_attr.ah_flags = 0; /* No GRH */ - if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { + if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT || + mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_OPENIB_PING) { if (wc->wc_flags & IB_WC_GRH) { ah_attr.ah_flags = IB_AH_GRH; /* Should sgid be looked up ? */ @@ -175,7 +177,8 @@ } send_wr.wr.ud.ah = agent_send_wr->ah; - if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { + if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT || + mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_OPENIB_PING) { send_wr.wr.ud.pkey_index = wc->pkey_index; send_wr.wr.ud.remote_qkey = IB_QP1_QKEY; } else { /* for SMPs */ @@ -233,6 +236,9 @@ case IB_MGMT_CLASS_PERF_MGMT: mad_agent = port_priv->perf_mgmt_agent; break; + case IB_MGMT_CLASS_OPENIB_PING: + mad_agent = port_priv->pingd_agent; + break; default: return 1; } @@ -240,6 +246,42 @@ return agent_mad_send(mad_agent, port_priv, mad, grh, wc); } +static void pingd_recv_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_agent_port_private *port_priv; + struct ib_vendor_mad *vend; + struct ib_mad_private *recv = container_of(mad_recv_wc, + struct ib_mad_private, + header.recv_wc); + + /* Find matching MAD agent */ + port_priv = ib_get_agent_port(NULL, 0, mad_agent); + if (!port_priv) { + kmem_cache_free(ib_mad_cache, recv); + printk(KERN_ERR SPFX "pingd_recv_handler: no matching MAD " + "agent %p\n", mad_agent); + return; + } + + vend = (struct ib_vendor_mad *)mad_recv_wc->recv_buf.mad; + + vend->mad_hdr.method |= IB_MGMT_METHOD_RESP; + vend->mad_hdr.status = 0; + if (!system_utsname.domainname[0]) + strncpy(vend->data, system_utsname.nodename, sizeof vend->data); + else + snprintf(vend->data, sizeof vend->data, "%s.%s", + system_utsname.nodename, system_utsname.domainname); + + /* Send response */ + if (agent_mad_send(mad_agent, port_priv, recv, + mad_recv_wc->recv_buf.grh, mad_recv_wc->wc)) { + kmem_cache_free(ib_mad_cache, recv); + printk(KERN_ERR SPFX "pingd_recv_handler: reply failed\n"); + } +} + static void agent_send_handler(struct ib_mad_agent *mad_agent, struct ib_mad_send_wc *mad_send_wc) { @@ -278,6 +320,7 @@ { int ret; struct ib_agent_port_private *port_priv; + struct ib_mad_reg_req pingd_reg_req; unsigned long flags; /* First, check if port already open for SMI */ @@ -324,12 +367,33 @@ goto error3; } + pingd_reg_req.mgmt_class = IB_MGMT_CLASS_OPENIB_PING; + pingd_reg_req.mgmt_class_version = 1; + pingd_reg_req.oui[0] = (IB_OPENIB_OUI >> 16) & 0xff; + pingd_reg_req.oui[1] = (IB_OPENIB_OUI >> 8) & 0xff; + pingd_reg_req.oui[2] = IB_OPENIB_OUI & 0xff; + set_bit(IB_MGMT_METHOD_GET, pingd_reg_req.method_mask); + + /* Obtain server MAD agent for OpenIB Ping class (GSI QP) */ + port_priv->pingd_agent = ib_register_mad_agent(device, port_num, + IB_QPT_GSI, + &pingd_reg_req, 0, + &agent_send_handler, + &pingd_recv_handler, + NULL); + if (IS_ERR(port_priv->pingd_agent)) { + ret = PTR_ERR(port_priv->pingd_agent); + goto error4; + } + spin_lock_irqsave(&ib_agent_port_list_lock, flags); list_add_tail(&port_priv->port_list, &ib_agent_port_list); spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); return 0; +error4: + ib_unregister_mad_agent(port_priv->perf_mgmt_agent); error3: ib_unregister_mad_agent(port_priv->smp_agent); error2: @@ -353,6 +417,7 @@ list_del(&port_priv->port_list); spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); + ib_unregister_mad_agent(port_priv->pingd_agent); ib_unregister_mad_agent(port_priv->perf_mgmt_agent); ib_unregister_mad_agent(port_priv->smp_agent); kfree(port_priv); From mst at mellanox.co.il Thu Mar 17 06:44:46 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 Mar 2005 16:44:46 +0200 Subject: [openib-general] Re: [PATCH] agent: Add IB ping server agent In-Reply-To: <1111068651.4662.2686.camel@localhost.localdomain> References: <1111068651.4662.2686.camel@localhost.localdomain> Message-ID: <20050317144446.GC12627@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: [PATCH] agent: Add IB ping server agent > > agent: Add IB ping server agent (used with ibping diagnostic tool) > > Signed-off-by: Shahar Frank > Signed-off-by: Hal Rosenstock > > Index: include/ib_mad.h > =================================================================== > --- include/ib_mad.h (revision 2012) > +++ include/ib_mad.h (working copy) > @@ -56,6 +56,10 @@ > #define IB_MGMT_CLASS_VENDOR_RANGE2_START 0x30 > #define IB_MGMT_CLASS_VENDOR_RANGE2_END 0x4F > > +#define IB_MGMT_CLASS_OPENIB_PING (IB_MGMT_CLASS_VENDOR_RANGE2_START+2) > + > +#define IB_OPENIB_OUI (0x001405) > + > /* Management methods */ > #define IB_MGMT_METHOD_GET 0x01 > #define IB_MGMT_METHOD_SET 0x02 > Index: core/agent_priv.h > =================================================================== > --- core/agent_priv.h (revision 2012) > +++ core/agent_priv.h (working copy) > @@ -57,6 +57,7 @@ > int port_num; > struct ib_mad_agent *smp_agent; /* SM class */ > struct ib_mad_agent *perf_mgmt_agent; /* PerfMgmt class */ > + struct ib_mad_agent *pingd_agent; /* OpenIB Ping class */ > }; > > #endif /* __IB_AGENT_PRIV_H__ */ > Index: core/agent.c > =================================================================== > --- core/agent.c (revision 2012) > +++ core/agent.c (working copy) > @@ -37,7 +37,7 @@ > */ > > #include > - > +#include > #include > > #include > @@ -70,7 +70,8 @@ > } else { > list_for_each_entry(entry, &ib_agent_port_list, port_list) { > if ((entry->smp_agent == mad_agent) || > - (entry->perf_mgmt_agent == mad_agent)) > + (entry->perf_mgmt_agent == mad_agent) || > + (entry->pingd_agent == mad_agent)) > return entry; > } > } > @@ -151,7 +152,8 @@ > ah_attr.sl = wc->sl; > ah_attr.static_rate = 0; > ah_attr.ah_flags = 0; /* No GRH */ > - if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { > + if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT || > + mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_OPENIB_PING) { > if (wc->wc_flags & IB_WC_GRH) { > ah_attr.ah_flags = IB_AH_GRH; > /* Should sgid be looked up ? */ > @@ -175,7 +177,8 @@ > } > > send_wr.wr.ud.ah = agent_send_wr->ah; > - if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { > + if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT || > + mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_OPENIB_PING) { > send_wr.wr.ud.pkey_index = wc->pkey_index; > send_wr.wr.ud.remote_qkey = IB_QP1_QKEY; > } else { /* for SMPs */ > @@ -233,6 +236,9 @@ > case IB_MGMT_CLASS_PERF_MGMT: > mad_agent = port_priv->perf_mgmt_agent; > break; > + case IB_MGMT_CLASS_OPENIB_PING: > + mad_agent = port_priv->pingd_agent; > + break; > default: > return 1; > } > @@ -240,6 +246,42 @@ > return agent_mad_send(mad_agent, port_priv, mad, grh, wc); > } > > +static void pingd_recv_handler(struct ib_mad_agent *mad_agent, > + struct ib_mad_recv_wc *mad_recv_wc) > +{ > + struct ib_agent_port_private *port_priv; > + struct ib_vendor_mad *vend; > + struct ib_mad_private *recv = container_of(mad_recv_wc, > + struct ib_mad_private, > + header.recv_wc); > + > + /* Find matching MAD agent */ > + port_priv = ib_get_agent_port(NULL, 0, mad_agent); > + if (!port_priv) { > + kmem_cache_free(ib_mad_cache, recv); > + printk(KERN_ERR SPFX "pingd_recv_handler: no matching MAD " > + "agent %p\n", mad_agent); > + return; > + } > + > + vend = (struct ib_vendor_mad *)mad_recv_wc->recv_buf.mad; > + > + vend->mad_hdr.method |= IB_MGMT_METHOD_RESP; > + vend->mad_hdr.status = 0; > + if (!system_utsname.domainname[0]) > + strncpy(vend->data, system_utsname.nodename, sizeof vend->data); > + else > + snprintf(vend->data, sizeof vend->data, "%s.%s", > + system_utsname.nodename, system_utsname.domainname); > + > + /* Send response */ > + if (agent_mad_send(mad_agent, port_priv, recv, > + mad_recv_wc->recv_buf.grh, mad_recv_wc->wc)) { > + kmem_cache_free(ib_mad_cache, recv); > + printk(KERN_ERR SPFX "pingd_recv_handler: reply failed\n"); > + } > +} > + > static void agent_send_handler(struct ib_mad_agent *mad_agent, > struct ib_mad_send_wc *mad_send_wc) > { > @@ -278,6 +320,7 @@ > { > int ret; > struct ib_agent_port_private *port_priv; > + struct ib_mad_reg_req pingd_reg_req; > unsigned long flags; > > /* First, check if port already open for SMI */ > @@ -324,12 +367,33 @@ > goto error3; > } > > + pingd_reg_req.mgmt_class = IB_MGMT_CLASS_OPENIB_PING; > + pingd_reg_req.mgmt_class_version = 1; > + pingd_reg_req.oui[0] = (IB_OPENIB_OUI >> 16) & 0xff; > + pingd_reg_req.oui[1] = (IB_OPENIB_OUI >> 8) & 0xff; > + pingd_reg_req.oui[2] = IB_OPENIB_OUI & 0xff; > + set_bit(IB_MGMT_METHOD_GET, pingd_reg_req.method_mask); > + > + /* Obtain server MAD agent for OpenIB Ping class (GSI QP) */ > + port_priv->pingd_agent = ib_register_mad_agent(device, port_num, > + IB_QPT_GSI, > + &pingd_reg_req, 0, > + &agent_send_handler, > + &pingd_recv_handler, > + NULL); > + if (IS_ERR(port_priv->pingd_agent)) { > + ret = PTR_ERR(port_priv->pingd_agent); > + goto error4; > + } > + > spin_lock_irqsave(&ib_agent_port_list_lock, flags); > list_add_tail(&port_priv->port_list, &ib_agent_port_list); > spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); > > return 0; > > +error4: > + ib_unregister_mad_agent(port_priv->perf_mgmt_agent); > error3: > ib_unregister_mad_agent(port_priv->smp_agent); > error2: > @@ -353,6 +417,7 @@ > list_del(&port_priv->port_list); > spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); > > + ib_unregister_mad_agent(port_priv->pingd_agent); > ib_unregister_mad_agent(port_priv->perf_mgmt_agent); > ib_unregister_mad_agent(port_priv->smp_agent); > kfree(port_priv); I think we want an option to compile this out of kernel, at least for documentation purposes. I think ping its useful (at least here in mellanox I think I was the one who first proposed adding it, before the workshop) but after all it *is* a vendor MAD. -- MST - Michael S. Tsirkin From halr at voltaire.com Thu Mar 17 06:56:08 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Mar 2005 09:56:08 -0500 Subject: [openib-general] Re: [PATCH] agent: Add IB ping server agent In-Reply-To: <20050317144446.GC12627@mellanox.co.il> References: <1111068651.4662.2686.camel@localhost.localdomain> <20050317144446.GC12627@mellanox.co.il> Message-ID: <1111070959.4662.2770.camel@localhost.localdomain> On Thu, 2005-03-17 at 09:44, Michael S. Tsirkin wrote: > I think we want an option to compile this out of kernel, at least > for documentation purposes. Not sure what you mean for documentation purposes. > I think ping its useful (at least here in mellanox I think > I was the one who first proposed adding it, before the workshop) > but after all it *is* a vendor MAD. It is an open vendor MAD which everyone is free to implement. That was part of the idea of using getting an OUI for this which can be used openly. This MAD and others will be documented for anyone to implement in their end port implementations. -- Hal From mst at mellanox.co.il Thu Mar 17 07:08:22 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 Mar 2005 17:08:22 +0200 Subject: [openib-general] Re: [PATCH] agent: Add IB ping server agent In-Reply-To: <1111070959.4662.2770.camel@localhost.localdomain> References: <1111068651.4662.2686.camel@localhost.localdomain> <20050317144446.GC12627@mellanox.co.il> <1111070959.4662.2770.camel@localhost.localdomain> Message-ID: <20050317150822.GF12627@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: [PATCH] agent: Add IB ping server agent > > On Thu, 2005-03-17 at 09:44, Michael S. Tsirkin wrote: > > I think we want an option to compile this out of kernel, at least > > for documentation purposes. > > Not sure what you mean for documentation purposes. > > > I think ping its useful (at least here in mellanox I think > > I was the one who first proposed adding it, before the workshop) > > but after all it *is* a vendor MAD. > > It is an open vendor MAD which everyone is free to implement. That was > part of the idea of using getting an OUI for this which can be used > openly. This MAD and others will be documented for anyone to implement > in their end port implementations. > > -- Hal > I agree, so maybe it will become standard at some point - then we'll have to change the code to also support the standard one, right? Thats what I mean "for documentation purposes" - so that things outside the spec are easy to locate. -- MST - Michael S. Tsirkin From halr at voltaire.com Thu Mar 17 07:05:54 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Mar 2005 10:05:54 -0500 Subject: [openib-general] Re: User Level Events - request for support In-Reply-To: <20050317100309.GT16749@mellanox.co.il> References: <506C3D7B14CDD411A52C00025558DED6047EEFA3@mtlex01.yok.mtl.com> <20050317100309.GT16749@mellanox.co.il> Message-ID: <1111071225.4662.2784.camel@localhost.localdomain> On Thu, 2005-03-17 at 05:03, Michael S. Tsirkin wrote: > However I think I agree it might make sence to add this capability > to umad. Since sm is already blocking on read from umad, it might > be simplest to make read return with an error code. Does this make > sence? > > Alternatively, we could try using kobject_uevent. This has been on the OpenSM todo list (https://openib.org/svn/gen2/trunk/src/userspace/management/osm/doc/todo). This functionality in umad at the initial port to OpenIB and I believe is on the todo list. -- Hal From roland at topspin.com Thu Mar 17 07:11:31 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 17 Mar 2005 07:11:31 -0800 Subject: [openib-general] [PATCH] agent: Add IB ping server agent In-Reply-To: <1111068651.4662.2686.camel@localhost.localdomain> (Hal Rosenstock's message of "17 Mar 2005 09:10:51 -0500") References: <1111068651.4662.2686.camel@localhost.localdomain> Message-ID: <527jk6s3x8.fsf@topspin.com> I would suggest moving the ping server at least into its own source file, if not into its own module. I'm not convinced that we want to have a vendor-specific MAD handler unconditionally compiled into the core MAD support. It might make some sense to have the option of handling ping packets in kernel space but I think it should also be possible to just run a userspace ping server. Also, is there documentation about the format of these ping MADs? - R. From halr at voltaire.com Thu Mar 17 07:25:29 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Mar 2005 10:25:29 -0500 Subject: [openib-general] [PATCH] agent: Add IB ping server agent In-Reply-To: <527jk6s3x8.fsf@topspin.com> References: <1111068651.4662.2686.camel@localhost.localdomain> <527jk6s3x8.fsf@topspin.com> Message-ID: <1111073129.4662.2821.camel@localhost.localdomain> On Thu, 2005-03-17 at 10:11, Roland Dreier wrote: > I would suggest moving the ping server at least into its own source > file, if not into its own module. I'm not convinced that we want to > have a vendor-specific MAD handler unconditionally compiled into the > core MAD support. Moving this into it's own source file is straightforward. As to whether it is unconditionally compiled in or now, it loses some of its value is it is a compile time option. But that is also easy to add another config option for this if that is what the community desires. > It might make some sense to have the option of handling ping packets > in kernel space but I think it should also be possible to just run a > userspace ping server. > > Also, is there documentation about the format of these ping MADs? I believe this should be available by day's end. -- Hal. From halr at voltaire.com Thu Mar 17 07:48:19 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Mar 2005 10:48:19 -0500 Subject: [openib-general] [PATCH] agent: Add IB ping server agent In-Reply-To: <527jk6s3x8.fsf@topspin.com> References: <1111068651.4662.2686.camel@localhost.localdomain> <527jk6s3x8.fsf@topspin.com> Message-ID: <1111074189.4662.2859.camel@localhost.localdomain> On Thu, 2005-03-17 at 10:11, Roland Dreier wrote: > I would suggest moving the ping server at least into its own source > file, if not into its own module. I'm not convinced that we want to > have a vendor-specific MAD handler unconditionally compiled into the > core MAD support. > > It might make some sense to have the option of handling ping packets > in kernel space but I think it should also be possible to just run a > userspace ping server. One more thing: This is analagous to the (ICMP) ping server built into the kernel. -- Hal From roland at topspin.com Thu Mar 17 08:38:52 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 17 Mar 2005 08:38:52 -0800 Subject: [openib-general] [PATCH] agent: Add IB ping server agent In-Reply-To: <1111074189.4662.2859.camel@localhost.localdomain> (Hal Rosenstock's message of "17 Mar 2005 10:48:19 -0500") References: <1111068651.4662.2686.camel@localhost.localdomain> <527jk6s3x8.fsf@topspin.com> <1111074189.4662.2859.camel@localhost.localdomain> Message-ID: <523buurzvn.fsf@topspin.com> Hal> One more thing: This is analagous to the (ICMP) ping server Hal> built into the kernel. Sort of, except it's not part of the IB spec. ICMP and in particular ICMP echo support is a "MUST" for any RFC-compliant IP implementation. One other thought just occurred to me -- what advantages does this vendor-specific ping MAD have over a LID-routed NodeDescription get query? - R. From eitan at mellanox.co.il Thu Mar 17 08:55:54 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 17 Mar 2005 18:55:54 +0200 Subject: [openib-general] [PATCH] agent: Add IB ping server agent Message-ID: <506C3D7B14CDD411A52C00025558DED6047EEFA5@mtlex01.yok.mtl.com> Hi Hal, If the response would have included the guid and lid of the responding port this could potentially be used to validate path record cache in a distributed manner... Also could help in diagnostics. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Thursday, March 17, 2005 6:39 PM > To: Hal Rosenstock > Cc: openib-general at openib.org > Subject: Re: [openib-general] [PATCH] agent: Add IB ping server agent > > Hal> One more thing: This is analagous to the (ICMP) ping server > Hal> built into the kernel. > > Sort of, except it's not part of the IB spec. ICMP and in particular > ICMP echo support is a "MUST" for any RFC-compliant IP implementation. > > One other thought just occurred to me -- what advantages does this > vendor-specific ping MAD have over a LID-routed NodeDescription get query? > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland at topspin.com Thu Mar 17 09:00:48 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 17 Mar 2005 09:00:48 -0800 Subject: [openib-general] [PATCH] agent: Add IB ping server agent In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EEFA5@mtlex01.yok.mtl.com> (Eitan Zahavi's message of "Thu, 17 Mar 2005 18:55:54 +0200") References: <506C3D7B14CDD411A52C00025558DED6047EEFA5@mtlex01.yok.mtl.com> Message-ID: <52u0naqkan.fsf@topspin.com> Eitan> Hi Hal, If the response would have included the guid and Eitan> lid of the responding port this could potentially be used Eitan> to validate path record cache in a distributed manner... Eitan> Also could help in diagnostics. Sort of like a LID-routed query of PortInfo? - R. From roland at topspin.com Thu Mar 17 09:00:05 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 17 Mar 2005 09:00:05 -0800 Subject: [openib-general] [PATCH] Restore needed #include to mad.c Message-ID: <52y8cmqkbu.fsf@topspin.com> If we don't include , mad.c won't compile on e.g. sparc64. Signed-off-by: Roland Dreier Index: infiniband/core/mad.c =================================================================== --- infiniband/core/mad.c (revision 2016) +++ infiniband/core/mad.c (working copy) @@ -32,6 +32,8 @@ * $Id$ */ +#include + #include "mad_priv.h" #include "smi.h" #include "agent.h" From iod00d at hp.com Thu Mar 17 09:18:29 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 17 Mar 2005 09:18:29 -0800 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace ve rbs In-Reply-To: <91DB792C7985D411BEC300B40080D29CC35797@mtvex01.mtv.mtl.com> References: <91DB792C7985D411BEC300B40080D29CC35797@mtvex01.mtv.mtl.com> Message-ID: <20050317171829.GD10752@esmail.cup.hp.com> On Thu, Mar 17, 2005 at 10:27:50AM +0200, Itamar Rabenstein wrote: > Just a small thing FW version 2.0.0 is very old FW. > You should upgrade it to the latest which is now 3.3.2 Good point. Troy, in order to use MSI or MSI-X, you need 3.3.2 firmware. And in case it's not obvious, expect best perf with MSI-X if you platform supports it. Either msflint or tvflash should work for upgrading the cards. grant From roland at topspin.com Thu Mar 17 09:28:09 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 17 Mar 2005 09:28:09 -0800 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace ve rbs In-Reply-To: <20050317171829.GD10752@esmail.cup.hp.com> (Grant Grundler's message of "Thu, 17 Mar 2005 09:18:29 -0800") References: <91DB792C7985D411BEC300B40080D29CC35797@mtvex01.mtv.mtl.com> <20050317171829.GD10752@esmail.cup.hp.com> Message-ID: <52psxyqj12.fsf@topspin.com> Grant> Troy, in order to use MSI or MSI-X, you need 3.3.2 Grant> firmware. And in case it's not obvious, expect best perf Grant> with MSI-X if you platform supports it. Either msflint or Grant> tvflash should work for upgrading the cards. I don't think MSI is supported by Linux on ppc64 right now. In fact I don't think you can even select CONFIG_PCI_MSI. I'm not sure what ppc64 hardware can even do MSI. For example IBM JS20 blades use the AMD 8131 for their PCI-X bridge, and we know that part doesn't support MSI. I believe Apple G5 hardware uses the same chip for its PCI-X slots. On the other hand, all I see from lspci on a p630 system is PCI bridge: IBM: Unknown device 0188 (rev 02) so who knows what that might do? - R. From mshefty at ichips.intel.com Thu Mar 17 09:38:18 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 17 Mar 2005 09:38:18 -0800 Subject: [openib-general] [PATCH] agent: Add IB ping server agent In-Reply-To: <527jk6s3x8.fsf@topspin.com> References: <1111068651.4662.2686.camel@localhost.localdomain> <527jk6s3x8.fsf@topspin.com> Message-ID: <4239C08A.7070309@ichips.intel.com> Roland Dreier wrote: > I would suggest moving the ping server at least into its own source > file, if not into its own module. I'm not convinced that we want to > have a vendor-specific MAD handler unconditionally compiled into the > core MAD support. I agree with this. This feels like it should be a separate module. - Sean From mst at mellanox.co.il Thu Mar 17 09:45:52 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 Mar 2005 19:45:52 +0200 Subject: [openib-general] Re: [PATCH] agent: Add IB ping server agent In-Reply-To: <523buurzvn.fsf@topspin.com> References: <1111068651.4662.2686.camel@localhost.localdomain> <527jk6s3x8.fsf@topspin.com> <1111074189.4662.2859.camel@localhost.localdomain> <523buurzvn.fsf@topspin.com> Message-ID: <20050317174551.GA13775@mellanox.co.il> Quoting r. Roland Dreier : > One other thought just occurred to me -- what advantages does this > vendor-specific ping MAD have over a LID-routed NodeDescription get query? > > - R. I think GetResp will always go to the SM, while OPENIB_PING is point to point. -- MST - Michael S. Tsirkin From halr at voltaire.com Thu Mar 17 09:42:36 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Mar 2005 12:42:36 -0500 Subject: [openib-general] [PATCH] agent: Add IB ping server agent In-Reply-To: <4239C08A.7070309@ichips.intel.com> References: <1111068651.4662.2686.camel@localhost.localdomain> <527jk6s3x8.fsf@topspin.com> <4239C08A.7070309@ichips.intel.com> Message-ID: <1111081356.4662.3218.camel@localhost.localdomain> On Thu, 2005-03-17 at 12:38, Sean Hefty wrote: > Roland Dreier wrote: > > I would suggest moving the ping server at least into its own source > > file, if not into its own module. I'm not convinced that we want to > > have a vendor-specific MAD handler unconditionally compiled into the > > core MAD support. > > I agree with this. This feels like it should be a separate module. OK but this will result in some code duplication. I should have this later today and will revert back the changes to agent.c and agent_priv.h of earlier today. -- Hal From mst at mellanox.co.il Thu Mar 17 09:48:30 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 Mar 2005 19:48:30 +0200 Subject: [openib-general] Re: [PATCH] agent: Add IB ping server agent In-Reply-To: <1111081356.4662.3218.camel@localhost.localdomain> References: <1111068651.4662.2686.camel@localhost.localdomain> <527jk6s3x8.fsf@topspin.com> <4239C08A.7070309@ichips.intel.com> <1111081356.4662.3218.camel@localhost.localdomain> Message-ID: <20050317174830.GB13775@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: [PATCH] agent: Add IB ping server agent > > On Thu, 2005-03-17 at 12:38, Sean Hefty wrote: > > Roland Dreier wrote: > > > I would suggest moving the ping server at least into its own source > > > file, if not into its own module. I'm not convinced that we want to > > > have a vendor-specific MAD handler unconditionally compiled into the > > > core MAD support. > > > > I agree with this. This feels like it should be a separate module. > > OK but this will result in some code duplication. I should have this > later today and will revert back the changes to agent.c and agent_priv.h > of earlier today. > > -- Hal I think a compile option is simpler, and would be sufficient. -- MST - Michael S. Tsirkin From mst at mellanox.co.il Thu Mar 17 09:57:02 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 Mar 2005 19:57:02 +0200 Subject: [openib-general] [PATCH] nit in dereg_mr Message-ID: <20050317175702.GA16399@mellanox.co.il> Its cleaner to kfree mthca_mr, and not reply on the fact that ib_mr is the first field in mthca_mr. Signed-off-by: Michael S. Tsirkin Index: hw/mthca/mthca_provider.c =================================================================== --- hw/mthca/mthca_provider.c (revision 2012) +++ hw/mthca/mthca_provider.c (working copy) @@ -568,8 +568,9 @@ static int mthca_dereg_mr(struct ib_mr *mr) { - mthca_free_mr(to_mdev(mr->device), to_mmr(mr)); - kfree(mr); + struct mthca_mr *mmr = to_mmr(mr); + mthca_free_mr(to_mdev(mr->device), mmr); + kfree(mmr); return 0; } -- MST - Michael S. Tsirkin From mshefty at ichips.intel.com Thu Mar 17 09:57:39 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 17 Mar 2005 09:57:39 -0800 Subject: [openib-general] Re: [PATCH] agent: Add IB ping server agent In-Reply-To: <20050317174830.GB13775@mellanox.co.il> References: <1111068651.4662.2686.camel@localhost.localdomain> <527jk6s3x8.fsf@topspin.com> <4239C08A.7070309@ichips.intel.com> <1111081356.4662.3218.camel@localhost.localdomain> <20050317174830.GB13775@mellanox.co.il> Message-ID: <4239C513.1060005@ichips.intel.com> Michael S. Tsirkin wrote: > I think a compile option is simpler, and would be sufficient. We can examine the different MAD agents to identify common functionality that can be pulled into a library or set of helper functions. I've already started trying to identify areas with the RMPP implementation. E.g. we can add new calls such as ib_create_ah_from_wc() or ib_alloc_send_mad() - to allocate the MAD and format the send_wr. - Sean From halr at voltaire.com Thu Mar 17 09:55:35 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Mar 2005 12:55:35 -0500 Subject: [openib-general] Re: [PATCH] agent: Add IB ping server agent In-Reply-To: <20050317174830.GB13775@mellanox.co.il> References: <1111068651.4662.2686.camel@localhost.localdomain> <527jk6s3x8.fsf@topspin.com> <4239C08A.7070309@ichips.intel.com> <1111081356.4662.3218.camel@localhost.localdomain> <20050317174830.GB13775@mellanox.co.il> Message-ID: <1111081974.4662.3277.camel@localhost.localdomain> On Thu, 2005-03-17 at 12:48, Michael S. Tsirkin wrote: > Quoting r. Hal Rosenstock : > > OK but this will result in some code duplication. I should have this > > later today and will revert back the changes to agent.c and agent_priv.h > > of earlier today. > I think a compile option is simpler, and would be sufficient. I agree. Is this an acceptable approach ? -- Hal From mst at mellanox.co.il Thu Mar 17 10:01:10 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 Mar 2005 20:01:10 +0200 Subject: [openib-general] [PATCH] mthca_mr error handling Message-ID: <20050317180110.GA17059@mellanox.co.il> Fix error handling in mr allocation for arbel native: mthca_free must get an mr index, not a key. Signed-off-by: Michael S. Tsirkin Index: hw/mthca/mthca_mr.c =================================================================== --- hw/mthca/mthca_mr.c (revision 2012) +++ hw/mthca/mthca_mr.c (working copy) @@ -231,7 +231,7 @@ err_out_table: mthca_table_put(dev, dev->mr_table.mpt_table, key); err_out_mpt_free: - mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + mthca_free(&dev->mr_table.mpt_alloc, key); kfree(mailbox); return err; } @@ -368,7 +368,7 @@ err_out_table: mthca_table_put(dev, dev->mr_table.mpt_table, key); err_out_mpt_free: - mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + mthca_free(&dev->mr_table.mpt_alloc, key); return err; } -- MST - Michael S. Tsirkin From halr at voltaire.com Thu Mar 17 10:21:05 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Mar 2005 13:21:05 -0500 Subject: [openib-general] Re: [PATCH] agent: Add IB ping server agent In-Reply-To: <20050317174551.GA13775@mellanox.co.il> References: <1111068651.4662.2686.camel@localhost.localdomain> <527jk6s3x8.fsf@topspin.com> <1111074189.4662.2859.camel@localhost.localdomain> <523buurzvn.fsf@topspin.com> <20050317174551.GA13775@mellanox.co.il> Message-ID: <1111083665.4662.3475.camel@localhost.localdomain> On Thu, 2005-03-17 at 12:45, Michael S. Tsirkin wrote: > I think GetResp will always go to the SM, while OPENIB_PING is point to > point. Response matching is transaction ID based with the MAD agent ID embedded in the transaction ID so these can be demultiplexed to separate entities. -- Hal From mst at mellanox.co.il Thu Mar 17 10:29:57 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 Mar 2005 20:29:57 +0200 Subject: [openib-general] Re: [PATCH] agent: Add IB ping server agent In-Reply-To: <1111083665.4662.3475.camel@localhost.localdomain> References: <1111068651.4662.2686.camel@localhost.localdomain> <527jk6s3x8.fsf@topspin.com> <1111074189.4662.2859.camel@localhost.localdomain> <523buurzvn.fsf@topspin.com> <20050317174551.GA13775@mellanox.co.il> <1111083665.4662.3475.camel@localhost.localdomain> Message-ID: <20050317182957.GA14358@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: [PATCH] agent: Add IB ping server agent > > On Thu, 2005-03-17 at 12:45, Michael S. Tsirkin wrote: > > I think GetResp will always go to the SM, while OPENIB_PING is point to > > point. > > Response matching is transaction ID based with the MAD agent ID embedded > in the transaction ID so these can be demultiplexed to separate > entities. > > -- Hal > I mean, they will go to the SM LID. -- MST - Michael S. Tsirkin From halr at voltaire.com Thu Mar 17 10:34:20 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Mar 2005 13:34:20 -0500 Subject: [openib-general] [PATCH] agent: Add IB ping server agent In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EEFA5@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED6047EEFA5@mtlex01.yok.mtl.com> Message-ID: <1111084460.4662.3550.camel@localhost.localdomain> Hi Eitan, On Thu, 2005-03-17 at 11:55, Eitan Zahavi wrote: > If the response would have included the guid and lid of the responding > port this could potentially be used to validate path record cache in a > distributed manner... > > Also could help in diagnostics. An entire range of IB ping types could be implemented analagous to ICMP types. Also, as you indicate, a response that includes target port GUID and LID could be used to validate an SA client cache. -- Hal From roland at topspin.com Thu Mar 17 10:40:08 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 17 Mar 2005 10:40:08 -0800 Subject: [openib-general] Re: [PATCH] agent: Add IB ping server agent In-Reply-To: <20050317182957.GA14358@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 17 Mar 2005 20:29:57 +0200") References: <1111068651.4662.2686.camel@localhost.localdomain> <527jk6s3x8.fsf@topspin.com> <1111074189.4662.2859.camel@localhost.localdomain> <523buurzvn.fsf@topspin.com> <20050317174551.GA13775@mellanox.co.il> <1111083665.4662.3475.camel@localhost.localdomain> <20050317182957.GA14358@mellanox.co.il> Message-ID: <52br9iqfp3.fsf@topspin.com> Michael> I mean, they will go to the SM LID. Are you sure? Where is this in the spec? I really thought that get responses for SM class queries will be formed the same way as responses for any other class. That is, the response will be sent to source LID of the query, etc. - R. From roland at topspin.com Thu Mar 17 10:43:10 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 17 Mar 2005 10:43:10 -0800 Subject: [openib-general] Re: [PATCH] agent: Add IB ping server agent In-Reply-To: <1111081974.4662.3277.camel@localhost.localdomain> (Hal Rosenstock's message of "17 Mar 2005 12:55:35 -0500") References: <1111068651.4662.2686.camel@localhost.localdomain> <527jk6s3x8.fsf@topspin.com> <4239C08A.7070309@ichips.intel.com> <1111081356.4662.3218.camel@localhost.localdomain> <20050317174830.GB13775@mellanox.co.il> <1111081974.4662.3277.camel@localhost.localdomain> Message-ID: <527jk6qfk1.fsf@topspin.com> Michael> I think a compile option is simpler, and would be sufficient. Hal> I agree. Is this an acceptable approach ? I think I'd prefer to avoid having this #ifdef'ed. If we have OpenIB vendor class handling nicely encapsulated, it's nice and maintainable, and it makes it clear how vendor classes should be handled. If we set the precedent for handling for vendor classes in the core mad files, then we're taking the first step on the road to insanity. - R. From roland at topspin.com Thu Mar 17 10:43:58 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 17 Mar 2005 10:43:58 -0800 Subject: [openib-general] [PATCH] agent: Add IB ping server agent In-Reply-To: <1111084460.4662.3550.camel@localhost.localdomain> (Hal Rosenstock's message of "17 Mar 2005 13:34:20 -0500") References: <506C3D7B14CDD411A52C00025558DED6047EEFA5@mtlex01.yok.mtl.com> <1111084460.4662.3550.camel@localhost.localdomain> Message-ID: <523buuqfip.fsf@topspin.com> Hal> An entire range of IB ping types could be implemented Hal> analagous to ICMP types. Seems like a good argument for splitting the support out into its own source file at least. - R. From halr at voltaire.com Thu Mar 17 10:41:00 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Mar 2005 13:41:00 -0500 Subject: [openib-general] [PATCH] agent: Add IB ping server agent In-Reply-To: <523buurzvn.fsf@topspin.com> References: <1111068651.4662.2686.camel@localhost.localdomain> <527jk6s3x8.fsf@topspin.com> <1111074189.4662.2859.camel@localhost.localdomain> <523buurzvn.fsf@topspin.com> Message-ID: <1111084531.4662.3556.camel@localhost.localdomain> On Thu, 2005-03-17 at 11:38, Roland Dreier wrote: > One other thought just occurred to me -- what advantages does this > vendor-specific ping MAD have over a LID-routed NodeDescription get query? No MKey issue is the first thing that comes to mind. Also, the ability to support various options such as: length based query/response -- Hal From mshefty at ichips.intel.com Thu Mar 17 10:46:25 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 17 Mar 2005 10:46:25 -0800 Subject: [openib-general] [PATCH] agent: Add IB ping server agent In-Reply-To: <1111084460.4662.3550.camel@localhost.localdomain> References: <506C3D7B14CDD411A52C00025558DED6047EEFA5@mtlex01.yok.mtl.com> <1111084460.4662.3550.camel@localhost.localdomain> Message-ID: <4239D081.4040807@ichips.intel.com> Hal Rosenstock wrote: > An entire range of IB ping types could be implemented analagous to ICMP > types. This is more of a reason to break this out into a separate module. - Sean From roland at topspin.com Thu Mar 17 10:50:55 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 17 Mar 2005 10:50:55 -0800 Subject: [openib-general] [PATCH] agent: Add IB ping server agent In-Reply-To: <1111084531.4662.3556.camel@localhost.localdomain> (Hal Rosenstock's message of "17 Mar 2005 13:41:00 -0500") References: <1111068651.4662.2686.camel@localhost.localdomain> <527jk6s3x8.fsf@topspin.com> <1111074189.4662.2859.camel@localhost.localdomain> <523buurzvn.fsf@topspin.com> <1111084531.4662.3556.camel@localhost.localdomain> Message-ID: <52y8cmp0mo.fsf@topspin.com> Roland> One other thought just occurred to me -- what advantages Roland> does this vendor-specific ping MAD have over a LID-routed Roland> NodeDescription get query? Hal> No MKey issue is the first thing that comes to mind. Hmm... is someone running their fabric with an M_Key protection level above 1 going to want a backdoor query class without access control? Especially if it's unconditionally enabled and reveals information like hostname? Hal> Also, the ability to support various options such as: length Hal> based query/response Given that MAD packets are always exactly 256 bytes long, is this useful? - R. From roland at topspin.com Thu Mar 17 10:52:34 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 17 Mar 2005 10:52:34 -0800 Subject: [openib-general] [PATCH] agent: Add IB ping server agent In-Reply-To: <1111084460.4662.3550.camel@localhost.localdomain> (Hal Rosenstock's message of "17 Mar 2005 13:34:20 -0500") References: <506C3D7B14CDD411A52C00025558DED6047EEFA5@mtlex01.yok.mtl.com> <1111084460.4662.3550.camel@localhost.localdomain> Message-ID: <52sm2up0jx.fsf@topspin.com> Hal> An entire range of IB ping types could be implemented Hal> analagous to ICMP types. Given that the current code seems to send a reply without looking at the attribute ID or anything else about the query packet, how are we going to handle backwards compatibility? - R. From halr at voltaire.com Thu Mar 17 11:00:41 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Mar 2005 14:00:41 -0500 Subject: [openib-general] [PATCH] agent: Add IB ping server agent In-Reply-To: <52y8cmp0mo.fsf@topspin.com> References: <1111068651.4662.2686.camel@localhost.localdomain> <527jk6s3x8.fsf@topspin.com> <1111074189.4662.2859.camel@localhost.localdomain> <523buurzvn.fsf@topspin.com> <1111084531.4662.3556.camel@localhost.localdomain> <52y8cmp0mo.fsf@topspin.com> Message-ID: <1111086041.4662.3648.camel@localhost.localdomain> On Thu, 2005-03-17 at 13:50, Roland Dreier wrote: > Hal> Also, the ability to support various options such as: length > Hal> based query/response > > Given that MAD packets are always exactly 256 bytes long, is this useful? The vendor MAD class range 2 supports RMPP so this is not necessarily limited to 256 bytes. -- Hal From halr at voltaire.com Thu Mar 17 11:14:06 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Mar 2005 14:14:06 -0500 Subject: [openib-general] [PATCH] agent: Add IB ping server agent In-Reply-To: <52sm2up0jx.fsf@topspin.com> References: <506C3D7B14CDD411A52C00025558DED6047EEFA5@mtlex01.yok.mtl.com> <1111084460.4662.3550.camel@localhost.localdomain> <52sm2up0jx.fsf@topspin.com> Message-ID: <1111086380.4662.3683.camel@localhost.localdomain> On Thu, 2005-03-17 at 13:52, Roland Dreier wrote: > Hal> An entire range of IB ping types could be implemented > Hal> analagous to ICMP types. > > Given that the current code seems to send a reply without looking at > the attribute ID or anything else about the query packet, how are we > going to handle backwards compatibility? It is not ready for upstream. Obviously version and type fields need to be added to this. -- Hal From halr at voltaire.com Thu Mar 17 11:33:41 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Mar 2005 14:33:41 -0500 Subject: [openib-general] Re: [PATCH] Restore needed #include to mad.c In-Reply-To: <52y8cmqkbu.fsf@topspin.com> References: <52y8cmqkbu.fsf@topspin.com> Message-ID: <1111088021.4662.3777.camel@localhost.localdomain> On Thu, 2005-03-17 at 12:00, Roland Dreier wrote: > If we don't include , mad.c won't compile on > e.g. sparc64. Thanks. Applied. -- Hal From halr at voltaire.com Thu Mar 17 11:44:36 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Mar 2005 14:44:36 -0500 Subject: [openib-general] [PATCH] agent: Remove IB ping server from agent Message-ID: <1111088676.4662.3816.camel@localhost.localdomain> agent: Remove IB ping server from agent Signed-off-by: Hal Rosenstock Index: agent_priv.h =================================================================== --- agent_priv.h (revision 2017) +++ agent_priv.h (working copy) @@ -57,7 +57,6 @@ int port_num; struct ib_mad_agent *smp_agent; /* SM class */ struct ib_mad_agent *perf_mgmt_agent; /* PerfMgmt class */ - struct ib_mad_agent *pingd_agent; /* OpenIB Ping class */ }; #endif /* __IB_AGENT_PRIV_H__ */ Index: agent.c =================================================================== --- agent.c (revision 2017) +++ agent.c (working copy) @@ -37,7 +37,6 @@ */ #include -#include #include #include @@ -70,8 +69,7 @@ } else { list_for_each_entry(entry, &ib_agent_port_list, port_list) { if ((entry->smp_agent == mad_agent) || - (entry->perf_mgmt_agent == mad_agent) || - (entry->pingd_agent == mad_agent)) + (entry->perf_mgmt_agent == mad_agent)) return entry; } } @@ -152,8 +150,7 @@ ah_attr.sl = wc->sl; ah_attr.static_rate = 0; ah_attr.ah_flags = 0; /* No GRH */ - if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT || - mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_OPENIB_PING) { + if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { if (wc->wc_flags & IB_WC_GRH) { ah_attr.ah_flags = IB_AH_GRH; /* Should sgid be looked up ? */ @@ -177,8 +174,7 @@ } send_wr.wr.ud.ah = agent_send_wr->ah; - if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT || - mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_OPENIB_PING) { + if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { send_wr.wr.ud.pkey_index = wc->pkey_index; send_wr.wr.ud.remote_qkey = IB_QP1_QKEY; } else { /* for SMPs */ @@ -236,9 +232,6 @@ case IB_MGMT_CLASS_PERF_MGMT: mad_agent = port_priv->perf_mgmt_agent; break; - case IB_MGMT_CLASS_OPENIB_PING: - mad_agent = port_priv->pingd_agent; - break; default: return 1; } @@ -246,42 +239,6 @@ return agent_mad_send(mad_agent, port_priv, mad, grh, wc); } -static void pingd_recv_handler(struct ib_mad_agent *mad_agent, - struct ib_mad_recv_wc *mad_recv_wc) -{ - struct ib_agent_port_private *port_priv; - struct ib_vendor_mad *vend; - struct ib_mad_private *recv = container_of(mad_recv_wc, - struct ib_mad_private, - header.recv_wc); - - /* Find matching MAD agent */ - port_priv = ib_get_agent_port(NULL, 0, mad_agent); - if (!port_priv) { - kmem_cache_free(ib_mad_cache, recv); - printk(KERN_ERR SPFX "pingd_recv_handler: no matching MAD " - "agent %p\n", mad_agent); - return; - } - - vend = (struct ib_vendor_mad *)mad_recv_wc->recv_buf.mad; - - vend->mad_hdr.method |= IB_MGMT_METHOD_RESP; - vend->mad_hdr.status = 0; - if (!system_utsname.domainname[0]) - strncpy(vend->data, system_utsname.nodename, sizeof vend->data); - else - snprintf(vend->data, sizeof vend->data, "%s.%s", - system_utsname.nodename, system_utsname.domainname); - - /* Send response */ - if (agent_mad_send(mad_agent, port_priv, recv, - mad_recv_wc->recv_buf.grh, mad_recv_wc->wc)) { - kmem_cache_free(ib_mad_cache, recv); - printk(KERN_ERR SPFX "pingd_recv_handler: reply failed\n"); - } -} - static void agent_send_handler(struct ib_mad_agent *mad_agent, struct ib_mad_send_wc *mad_send_wc) { @@ -320,7 +277,6 @@ { int ret; struct ib_agent_port_private *port_priv; - struct ib_mad_reg_req pingd_reg_req; unsigned long flags; /* First, check if port already open for SMI */ @@ -367,33 +323,12 @@ goto error3; } - pingd_reg_req.mgmt_class = IB_MGMT_CLASS_OPENIB_PING; - pingd_reg_req.mgmt_class_version = 1; - pingd_reg_req.oui[0] = (IB_OPENIB_OUI >> 16) & 0xff; - pingd_reg_req.oui[1] = (IB_OPENIB_OUI >> 8) & 0xff; - pingd_reg_req.oui[2] = IB_OPENIB_OUI & 0xff; - set_bit(IB_MGMT_METHOD_GET, pingd_reg_req.method_mask); - - /* Obtain server MAD agent for OpenIB Ping class (GSI QP) */ - port_priv->pingd_agent = ib_register_mad_agent(device, port_num, - IB_QPT_GSI, - &pingd_reg_req, 0, - &agent_send_handler, - &pingd_recv_handler, - NULL); - if (IS_ERR(port_priv->pingd_agent)) { - ret = PTR_ERR(port_priv->pingd_agent); - goto error4; - } - spin_lock_irqsave(&ib_agent_port_list_lock, flags); list_add_tail(&port_priv->port_list, &ib_agent_port_list); spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); return 0; -error4: - ib_unregister_mad_agent(port_priv->perf_mgmt_agent); error3: ib_unregister_mad_agent(port_priv->smp_agent); error2: @@ -417,7 +352,6 @@ list_del(&port_priv->port_list); spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); - ib_unregister_mad_agent(port_priv->pingd_agent); ib_unregister_mad_agent(port_priv->perf_mgmt_agent); ib_unregister_mad_agent(port_priv->smp_agent); kfree(port_priv); From mst at mellanox.co.il Thu Mar 17 11:49:17 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 Mar 2005 21:49:17 +0200 Subject: [openib-general] Re: [PATCH] agent: Add IB ping server agent In-Reply-To: <52br9iqfp3.fsf@topspin.com> References: <1111068651.4662.2686.camel@localhost.localdomain> <527jk6s3x8.fsf@topspin.com> <1111074189.4662.2859.camel@localhost.localdomain> <523buurzvn.fsf@topspin.com> <20050317174551.GA13775@mellanox.co.il> <1111083665.4662.3475.camel@localhost.localdomain> <20050317182957.GA14358@mellanox.co.il> <52br9iqfp3.fsf@topspin.com> Message-ID: <20050317194917.GA14912@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] agent: Add IB ping server agent > > Michael> I mean, they will go to the SM LID. > > Are you sure? Where is this in the spec? I really thought that get > responses for SM class queries will be formed the same way as > responses for any other class. That is, the response will be sent to > source LID of the query, etc. > > - R. > I'm not necessarily talking about routing - it seems to me that these MADs were designed for SM use, not for anything else. For direct route SMPs we clearly have: C14-5: Only a SM shall originate a directed route SMP. For LID-routed SMPs its not that clear, but still we have things like Table 122 "SM MAD Sources and Destinations", which if I read it correctly seem to say that responses shall go to the SM. Generally, I see: 13.5.1.1 PROCESSING SUBNET MANAGEMENT PACKETS (SMPS) The Subnet Management Interface (SMI) is associated with QP0. QP0 is used exclusively for sending and receiving subnet management MADs. Communications with the SMA in a channel adapter, switch, or router is always through the SMI. If a channel adapter, switch, or router hosts a SM, then communications between that SM and the SMA of each channel adapter, switch, or router in the subnet is also through the SMI. Only SMAs and SM communicate through this interface. No other entities may do so. One other issue I can see is mkey protection, although you always might say the user must set the key protecion bits to allow get port info ... -- MST - Michael S. Tsirkin From mst at mellanox.co.il Thu Mar 17 12:16:46 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 Mar 2005 22:16:46 +0200 Subject: [openib-general] [PATCH] fmr support in mthca Message-ID: <20050317201646.GA15221@mellanox.co.il> This patch implements FMR support. I also rolled into it two fixes for regular mrs that I posed previously, let me know if its a problem. This seems to be working fine for me, although I only did relatively basic tests. Both Tavor and Arbel Native modes are supported. I made some tradeoffs for simplicity, let me know what do you think: - for tavor, I keep for each fmr two pages mapped: for mpt and one for mtt access. This spends more kernel virtual memory than could be used, since many mpts could share one page. Alternatives are: map/unmap io memory on each fmr map/unmap request, or keep and intermediate table tomap each page only once. - icm that has the mpts/mtts is linearly scanned and this is repeated for each mtt on each fmr map. This may be improved somewhat with some kind of an iterator, but to really speed things up the icm data structure (list of arrays) would have to be replaced by some kind of tree. Signed-off-by: Michael S. Tsirkin Index: hw/mthca/mthca_dev.h =================================================================== --- hw/mthca/mthca_dev.h (revision 2012) +++ hw/mthca/mthca_dev.h (working copy) @@ -163,6 +163,7 @@ struct mthca_mr_table { int max_mtt_order; unsigned long **mtt_buddy; u64 mtt_base; + u64 mpt_base; /* Tavor only */ struct mthca_icm_table *mtt_table; struct mthca_icm_table *mpt_table; }; @@ -363,7 +364,17 @@ int mthca_mr_alloc_phys(struct mthca_dev u64 *buffer_list, int buffer_size_shift, int list_len, u64 iova, u64 total_size, u32 access, struct mthca_mr *mr); -void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr); +void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr); + +int mthca_fmr_alloc(struct mthca_dev *dev, u32 pd, + u32 access, struct mthca_fmr *fmr); + +int mthca_fmr_map(struct mthca_dev *dev, struct mthca_fmr *fmr, + u64 *page_list, int list_len, u64 iova); + +void mthca_fmr_unmap(struct mthca_dev *dev, struct mthca_fmr *fmr); + +void mthca_free_fmr(struct mthca_dev *dev, struct mthca_fmr *fmr); int mthca_map_eq_icm(struct mthca_dev *dev, u64 icm_virt); void mthca_unmap_eq_icm(struct mthca_dev *dev); Index: hw/mthca/mthca_memfree.h =================================================================== --- hw/mthca/mthca_memfree.h (revision 2012) +++ hw/mthca/mthca_memfree.h (working copy) @@ -90,6 +90,11 @@ int mthca_table_get_range(struct mthca_d void mthca_table_put_range(struct mthca_dev *dev, struct mthca_icm_table *table, int start, int end); +/* Nonblocking. Callers must make sure the object exists by serializing against + * callers of get/put. */ +void *mthca_table_find(struct mthca_dev *dev, struct mthca_icm_table *table, + int obj); + static inline void mthca_icm_first(struct mthca_icm *icm, struct mthca_icm_iter *iter) { Index: hw/mthca/mthca_provider.c =================================================================== --- hw/mthca/mthca_provider.c (revision 2012) +++ hw/mthca/mthca_provider.c (working copy) @@ -568,8 +568,101 @@ static struct ib_mr *mthca_reg_phys_mr(s static int mthca_dereg_mr(struct ib_mr *mr) { - mthca_free_mr(to_mdev(mr->device), to_mmr(mr)); - kfree(mr); + struct mthca_mr *mmr = to_mmr(mr); + mthca_free_mr(to_mdev(mr->device), mmr); + kfree(mmr); + return 0; +} + +static struct ib_fmr *mthca_alloc_fmr(struct ib_pd *pd, int mr_access_flags, + struct ib_fmr_attr *fmr_attr) +{ + struct mthca_fmr *fmr; + int err; + fmr = kmalloc(sizeof *fmr, GFP_KERNEL); + if (!fmr) + return ERR_PTR(-ENOMEM); + + memcpy(&fmr->attr, fmr_attr, sizeof *fmr_attr); + err = mthca_fmr_alloc(to_mdev(pd->device), to_mpd(pd)->pd_num, + convert_access(mr_access_flags), fmr); + + if (err) { + kfree(fmr); + return ERR_PTR(err); + } + return &fmr->ibmr; +} + +static int mthca_dealloc_fmr(struct ib_fmr *fmr) +{ + struct mthca_fmr *mfmr = to_mfmr(fmr); + mthca_free_fmr(to_mdev(fmr->device), mfmr); + kfree(mfmr); + return 0; +} + +static int mthca_map_phys_fmr(struct ib_fmr *fmr, u64 *page_list, int list_len, + u64 iova) +{ + int i, page_mask; + struct mthca_fmr *mfmr; + + mfmr = to_mfmr(fmr); + + if (list_len > mfmr->attr.max_pages) { + mthca_warn(to_mdev(fmr->device), "Attempt to map list length " + "%d to fmr with max %d pages\n", + list_len, mfmr->attr.max_pages); + return -EINVAL; + } + + page_mask = (1 << mfmr->attr.page_size) - 1; + + /* We are getting page lists, so va must be page aligned. */ + if (iova & page_mask) { + mthca_warn(to_mdev(fmr->device), "Attempt to map fmr with page " + "shift %d at misaligned virtual address %016llx\n", + mfmr->attr.page_size, iova); + return -EINVAL; + } + + /* Trust the user not to pass misaligned data in page_list */ + if (0) + for (i = 0; i < list_len; ++i) { + if (page_list[i] & page_mask) { + mthca_warn(to_mdev(fmr->device), "Attempt to " + "map fmr with page shift %d at " + "address %016llx\n", + mfmr->attr.page_size, page_list[i]); + return -EINVAL; + } + } + + return mthca_fmr_map(to_mdev(fmr->device), to_mfmr(fmr), page_list, + list_len, iova); +} + +static int mthca_unmap_fmr(struct list_head *fmr_list) +{ + struct mthca_dev* mdev = NULL; + struct ib_fmr *fmr; + int err; + u8 status; + + list_for_each_entry(fmr, fmr_list, list) { + mdev = to_mdev(fmr->device); + mthca_fmr_unmap(mdev, to_mfmr(fmr)); + } + + if (!mdev) + return 0; + + err = mthca_SYNC_TPT(mdev, &status); + if (err) + return err; + if (status) + return -EINVAL; return 0; } @@ -636,6 +729,15 @@ int mthca_register_device(struct mthca_d dev->ib_dev.get_dma_mr = mthca_get_dma_mr; dev->ib_dev.reg_phys_mr = mthca_reg_phys_mr; dev->ib_dev.dereg_mr = mthca_dereg_mr; + + if (dev->hca_type == ARBEL_NATIVE || + !(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + dev->ib_dev.alloc_fmr = mthca_alloc_fmr; + dev->ib_dev.map_phys_fmr = mthca_map_phys_fmr; + dev->ib_dev.unmap_fmr = mthca_unmap_fmr; + dev->ib_dev.dealloc_fmr = mthca_dealloc_fmr; + } + dev->ib_dev.attach_mcast = mthca_multicast_attach; dev->ib_dev.detach_mcast = mthca_multicast_detach; dev->ib_dev.process_mad = mthca_process_mad; Index: hw/mthca/mthca_provider.h =================================================================== --- hw/mthca/mthca_provider.h (revision 2012) +++ hw/mthca/mthca_provider.h (working copy) @@ -60,6 +60,16 @@ struct mthca_mr { u32 first_seg; }; +struct mthca_fmr { + struct ib_fmr ibmr; + struct ib_fmr_attr attr; + int order; + u32 first_seg; + int maps; + struct mthca_mpt_entry __iomem *mpt; /* Tavor only */ + u64 __iomem *mtts; /* Tavor only */ +}; + struct mthca_pd { struct ib_pd ibpd; u32 pd_num; @@ -218,6 +228,11 @@ struct mthca_sqp { dma_addr_t header_dma; }; +static inline struct mthca_fmr *to_mfmr(struct ib_fmr *ibmr) +{ + return container_of(ibmr, struct mthca_fmr, ibmr); +} + static inline struct mthca_mr *to_mmr(struct ib_mr *ibmr) { return container_of(ibmr, struct mthca_mr, ibmr); Index: hw/mthca/mthca_profile.c =================================================================== --- hw/mthca/mthca_profile.c (revision 2012) +++ hw/mthca/mthca_profile.c (working copy) @@ -223,9 +223,10 @@ u64 mthca_make_profile(struct mthca_dev init_hca->mc_hash_sz = 1 << (profile[i].log_num - 1); break; case MTHCA_RES_MPT: - dev->limits.num_mpts = profile[i].num; - init_hca->mpt_base = profile[i].start; - init_hca->log_mpt_sz = profile[i].log_num; + dev->limits.num_mpts = profile[i].num; + dev->mr_table.mpt_base = profile[i].start; + init_hca->mpt_base = profile[i].start; + init_hca->log_mpt_sz = profile[i].log_num; break; case MTHCA_RES_MTT: dev->limits.num_mtt_segs = profile[i].num; Index: hw/mthca/mthca_cmd.c =================================================================== --- hw/mthca/mthca_cmd.c (revision 2012) +++ hw/mthca/mthca_cmd.c (working copy) @@ -1406,6 +1406,11 @@ int mthca_WRITE_MTT(struct mthca_dev *de return err; } +int mthca_SYNC_TPT(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_SYNC_TPT, CMD_TIME_CLASS_B, status); +} + int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap, int eq_num, u8 *status) { Index: hw/mthca/mthca_cmd.h =================================================================== --- hw/mthca/mthca_cmd.h (revision 2012) +++ hw/mthca/mthca_cmd.h (working copy) @@ -277,6 +277,7 @@ int mthca_HW2SW_MPT(struct mthca_dev *de int mpt_index, u8 *status); int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry, int num_mtt, u8 *status); +int mthca_SYNC_TPT(struct mthca_dev *dev, u8 *status); int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap, int eq_num, u8 *status); int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context, Index: hw/mthca/mthca_mr.c =================================================================== --- hw/mthca/mthca_mr.c (revision 2012) +++ hw/mthca/mthca_mr.c (working copy) @@ -368,11 +368,13 @@ err_out_table: mthca_table_put(dev, dev->mr_table.mpt_table, key); err_out_mpt_free: - mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + mthca_free(&dev->mr_table.mpt_alloc, key); return err; } -void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr) +/* Free mr or fmr */ +static void mthca_free_region(struct mthca_dev *dev, u32 lkey, int order, + u32 first_seg) { int err; u8 status; @@ -380,7 +382,7 @@ void mthca_free_mr(struct mthca_dev *dev might_sleep(); err = mthca_HW2SW_MPT(dev, NULL, - key_to_hw_index(dev, mr->ibmr.lkey) & + key_to_hw_index(dev, lkey) & (dev->limits.num_mpts - 1), &status); if (err) @@ -389,15 +391,276 @@ void mthca_free_mr(struct mthca_dev *dev mthca_warn(dev, "HW2SW_MPT returned status 0x%02x\n", status); - if (mr->order >= 0) - mthca_free_mtt(dev, mr->first_seg, mr->order); + if (order >= 0) + mthca_free_mtt(dev, first_seg, order); if (dev->hca_type == ARBEL_NATIVE) mthca_table_put(dev, dev->mr_table.mpt_table, - key_to_hw_index(dev, mr->ibmr.lkey)); - mthca_free(&dev->mr_table.mpt_alloc, key_to_hw_index(dev, mr->ibmr.lkey)); + key_to_hw_index(dev, lkey)); + mthca_free(&dev->mr_table.mpt_alloc, key_to_hw_index(dev, lkey)); +} + +void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr) +{ + mthca_free_region(dev, mr->ibmr.lkey, mr->order, mr->first_seg); +} + +int mthca_fmr_alloc(struct mthca_dev *dev, u32 pd, + u32 access, struct mthca_fmr *mr) +{ + void *mailbox; + u64 mtt_seg; + struct mthca_mpt_entry *mpt_entry; + u32 key, idx; + int err = -ENOMEM; + u8 status; + int i; + int list_len = mr->attr.max_pages; + + might_sleep(); + + BUG_ON(mr->attr.page_size < 12); + WARN_ON(mr->attr.page_size >= 32); + + mr->maps = 0; + + key = mthca_alloc(&dev->mr_table.mpt_alloc); + if (key == -1) + return -ENOMEM; + idx = key & (dev->limits.num_mpts - 1); + mr->ibmr.rkey = mr->ibmr.lkey = hw_index_to_key(dev, key); + + if (dev->hca_type == ARBEL_NATIVE) { + err = mthca_table_get(dev, dev->mr_table.mpt_table, key); + if (err) + goto err_out_mpt_free; + } + for (i = dev->limits.mtt_seg_size / 8, mr->order = 0; + i < list_len; + i <<= 1, ++mr->order) + ; /* nothing */ + + mr->first_seg = mthca_alloc_mtt(dev, mr->order); + if (mr->first_seg == -1) + goto err_out_table; + + mtt_seg = dev->mr_table.mtt_base + + mr->first_seg * dev->limits.mtt_seg_size; + + mr->mpt = NULL; + mr->mtts = NULL; + + if (dev->hca_type != ARBEL_NATIVE) { + mr->mpt = ioremap(dev->mr_table.mpt_base + + sizeof *(mr->mpt) * idx, sizeof *(mr->mpt)); + if (!mr->mpt) { + mthca_dbg(dev, "Couldn't map MPT entry for fmr %08x.\n", + mr->ibmr.lkey); + goto err_out_free_mtt; + } + mr->mtts = ioremap(mtt_seg, list_len * sizeof *(mr->mtts)); + if (!mr->mtts) { + mthca_dbg(dev, "Couldn't map MTT entry %016llx " + "(size %x) for fmr %08x.\n", mtt_seg, + list_len * sizeof u64, mr->ibmr.lkey); + goto err_out_free_mtt; + } + } + + mailbox = kmalloc(sizeof *mpt_entry + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out_free_mtt; + + mpt_entry = MAILBOX_ALIGN(mailbox); + + mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS | + MTHCA_MPT_FLAG_MIO | + MTHCA_MPT_FLAG_REGION | + access); + + mpt_entry->page_size = cpu_to_be32(mr->attr.page_size - 12); + mpt_entry->key = cpu_to_be32(key); + mpt_entry->pd = cpu_to_be32(pd); + memset(&mpt_entry->start, 0, + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, start)); + mpt_entry->mtt_seg = cpu_to_be64(mtt_seg); + + if (0) { + mthca_dbg(dev, "Dumping MPT entry %08x:\n", mr->ibmr.lkey); + for (i = 0; i < sizeof (struct mthca_mpt_entry) / 4; ++i) { + if (i % 4 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) mpt_entry)[i])); + if ((i + 1) % 4 == 0) + printk("\n"); + } + } + + err = mthca_SW2HW_MPT(dev, mpt_entry, + key & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); + else if (status) { + mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n", + status); + err = -EINVAL; + } + + kfree(mailbox); + return err; + +err_out_free_mtt: + if (mr->mtts) + iounmap(mr->mtts); + if (mr->mpt) + iounmap(mr->mpt); + mthca_free_mtt(dev, mr->first_seg, mr->order); + +err_out_table: + if (dev->hca_type == ARBEL_NATIVE) + mthca_table_put(dev, dev->mr_table.mpt_table, key); + +err_out_mpt_free: + mthca_free(&dev->mr_table.mpt_alloc, key); + return err; +} + +void mthca_free_fmr(struct mthca_dev *dev, struct mthca_fmr *fmr) +{ + mthca_free_region(dev, fmr->ibmr.lkey, fmr->order, fmr->first_seg); +} + +#define MTHCA_MPT_STATUS_SW 0xF +#define MTHCA_MPT_STATUS_HW 0x0 + +static void mthca_tavor_fmr_map(struct mthca_dev *dev, struct mthca_fmr *fmr, + u64 *page_list, int list_len, u64 iova, u32 key) +{ + struct mthca_mpt_entry mpt_entry; + int i; + + writeb(MTHCA_MPT_STATUS_SW,fmr->mpt); + + wmb(); + + for (i = 0; i < list_len; ++i) { + u64 mtt_entry = cpu_to_be64(page_list[i] | + MTHCA_MTT_FLAG_PRESENT); + writeq(mtt_entry, fmr->mtts + i); + } + mpt_entry.lkey = cpu_to_be32(key); + mpt_entry.length = cpu_to_be64(((u64)list_len) * + (1 << fmr->attr.page_size)); + mpt_entry.start = cpu_to_be64(iova); + + writel(mpt_entry.lkey, &fmr->mpt->key); + memcpy_toio(&fmr->mpt->start, &mpt_entry.start, + offsetof(struct mthca_mpt_entry, window_count) - + offsetof(struct mthca_mpt_entry, start)); + + wmb(); + + writeb(MTHCA_MPT_STATUS_SW,fmr->mpt); + + wmb(); +} + +static int mthca_arbel_fmr_map(struct mthca_dev *dev, struct mthca_fmr *fmr, + u64 *page_list, int list_len, u64 iova, u32 key) +{ + void* mpt; + struct mthca_mpt_entry *mpt_entry; + u8 *mpt_status; + int i; + + mpt = mthca_table_find(dev, dev->mr_table.mpt_table, key); + if (!mpt) + return -EINVAL; + + mpt_status = mpt; + *mpt_status = MTHCA_MPT_STATUS_SW; + + wmb(); + + /* This is really dumb. We are rescanning the ICM on + * each mpt entry. We want some kind of iterator here. + * May be fine meanwhile, while things are small. */ + for (i = 0; i < list_len; ++i) { + u64 *mtt_entry = mthca_table_find(dev, dev->mr_table.mtt_table, + fmr->first_seg + i); + if (!mtt_entry) + return -EINVAL; + + *mtt_entry = cpu_to_be64(page_list[i] | MTHCA_MTT_FLAG_PRESENT); + } + + + mpt_entry = mpt; + mpt_entry->lkey = mpt_entry->key = cpu_to_be32(key); + mpt_entry->length = cpu_to_be64(((u64)list_len) * + (1 << fmr->attr.page_size)); + mpt_entry->start = cpu_to_be64(iova); + + wmb(); + + *mpt_status = MTHCA_MPT_STATUS_HW; + + wmb(); + return 0; +} + + +int mthca_fmr_map(struct mthca_dev *dev, struct mthca_fmr *fmr, + u64 *page_list, int list_len, u64 iova) +{ + u32 key; + + if (fmr->maps >= fmr->attr.max_maps) { + mthca_warn(dev, "Attempt to map fmr %d times, max_maps is %d\n", + fmr->maps, fmr->attr.max_maps); + return -EINVAL; + } + + key = key_to_hw_index(dev, fmr->ibmr.lkey) + dev->limits.num_mpts; + fmr->ibmr.lkey = fmr->ibmr.rkey = hw_index_to_key(dev, key); + fmr->maps++; + + if (dev->hca_type == ARBEL_NATIVE) { + return mthca_arbel_fmr_map(dev, fmr, page_list, list_len, + iova, key); + } else { + mthca_tavor_fmr_map(dev, fmr, page_list, list_len, + iova, key); + return 0; + } +} + +void mthca_fmr_unmap(struct mthca_dev *dev, struct mthca_fmr *fmr) +{ + if (!fmr->maps) { + return; + } + + fmr->maps = 0; + + if (dev->hca_type == ARBEL_NATIVE) { + u32 key = key_to_hw_index(dev, fmr->ibmr.lkey); + u8 *mpt_status = mthca_table_find(dev, dev->mr_table.mpt_table, + key); + if (!mpt_status) + return; + + *mpt_status = MTHCA_MPT_STATUS_SW; + wmb(); + } else { + writeb(MTHCA_MPT_STATUS_SW,fmr->mpt); + wmb(); + } } + int __devinit mthca_init_mr_table(struct mthca_dev *dev) { int err; Index: hw/mthca/mthca_memfree.c =================================================================== --- hw/mthca/mthca_memfree.c (revision 2012) +++ hw/mthca/mthca_memfree.c (working copy) @@ -192,6 +192,47 @@ void mthca_table_put(struct mthca_dev *d up(&table->mutex); } +/* Nonblocking. Callers must make sure the object exists by serializing against + * callers of get/put. */ +void *mthca_table_find(struct mthca_dev *dev, struct mthca_icm_table *table, + int obj) +{ + int idx, offset, i; + struct mthca_icm_chunk *chunk; + struct mthca_icm *icm; + struct page *page = NULL; + + /* Supported only for low mem tables for now. */ + if (!table->lowmem) + return NULL; + + idx = (obj & (table->num_obj - 1)) * table->obj_size; + icm = table->icm[idx / MTHCA_TABLE_CHUNK_SIZE]; + offset = idx % MTHCA_TABLE_CHUNK_SIZE; + + if(!icm) + return NULL; + + /* Linear scan of ICM on each access. Since this is called on fmr + * registration which is on data path, eventually we may want to + * rearrange things to use some kind of tree. */ + + list_for_each_entry(chunk, &icm->chunk_list, list) { + for (i = 0; i < chunk->npages; ++i) { + if (chunk->mem[i].length >= offset) { + page = chunk->mem[i].page; + break; + } + offset -= chunk->mem[i].length; + } + } + + if (!page) + return NULL; + + return lowmem_page_address(page) + offset; +} + int mthca_table_get_range(struct mthca_dev *dev, struct mthca_icm_table *table, int start, int end) { -- MST - Michael S. Tsirkin From mst at mellanox.co.il Thu Mar 17 12:25:41 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 Mar 2005 22:25:41 +0200 Subject: [openib-general] Re: [PATCH] agent: Add IB ping server agent In-Reply-To: <52y8cmp0mo.fsf@topspin.com> References: <1111068651.4662.2686.camel@localhost.localdomain> <527jk6s3x8.fsf@topspin.com> <1111074189.4662.2859.camel@localhost.localdomain> <523buurzvn.fsf@topspin.com> <1111084531.4662.3556.camel@localhost.localdomain> <52y8cmp0mo.fsf@topspin.com> Message-ID: <20050317202541.GB15221@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] agent: Add IB ping server agent > > Roland> One other thought just occurred to me -- what advantages > Roland> does this vendor-specific ping MAD have over a LID-routed > Roland> NodeDescription get query? > > Hal> No MKey issue is the first thing that comes to mind. > > Hmm... is someone running their fabric with an M_Key protection level > above 1 going to want a backdoor query class without access control? > Especially if it's unconditionally enabled and reveals information > like hostname? Out of curiosity - what is the M_Key protection level currently set by OpenSm? -- MST - Michael S. Tsirkin From halr at voltaire.com Thu Mar 17 12:36:05 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Mar 2005 15:36:05 -0500 Subject: [openib-general] Re: [PATCH] agent: Add IB ping server agent In-Reply-To: <20050317202541.GB15221@mellanox.co.il> References: <1111068651.4662.2686.camel@localhost.localdomain> <527jk6s3x8.fsf@topspin.com> <1111074189.4662.2859.camel@localhost.localdomain> <523buurzvn.fsf@topspin.com> <1111084531.4662.3556.camel@localhost.localdomain> <52y8cmp0mo.fsf@topspin.com> <20050317202541.GB15221@mellanox.co.il> Message-ID: <1111091765.4662.4013.camel@localhost.localdomain> On Thu, 2005-03-17 at 15:25, Michael S. Tsirkin wrote: > Out of curiosity - what is the M_Key protection level currently set by > OpenSm? PortInfo::M_KeyProtectBits = 0 From mst at mellanox.co.il Thu Mar 17 12:43:19 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 Mar 2005 22:43:19 +0200 Subject: [openib-general] Re: [PATCH] agent: Add IB ping server agent In-Reply-To: <1111091765.4662.4013.camel@localhost.localdomain> References: <1111068651.4662.2686.camel@localhost.localdomain> <527jk6s3x8.fsf@topspin.com> <1111074189.4662.2859.camel@localhost.localdomain> <523buurzvn.fsf@topspin.com> <1111084531.4662.3556.camel@localhost.localdomain> <52y8cmp0mo.fsf@topspin.com> <20050317202541.GB15221@mellanox.co.il> <1111091765.4662.4013.camel@localhost.localdomain> Message-ID: <20050317204319.GD15221@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: [PATCH] agent: Add IB ping server agent > > On Thu, 2005-03-17 at 15:25, Michael S. Tsirkin wrote: > > Out of curiosity - what is the M_Key protection level currently set by > > OpenSm? > > PortInfo::M_KeyProtectBits = 0 > No mkey issues then :) -- MST - Michael S. Tsirkin From halr at voltaire.com Thu Mar 17 12:39:40 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Mar 2005 15:39:40 -0500 Subject: [openib-general] [RFC] [PATCH] ping: Add IB ping server agent Message-ID: <1111091848.4662.4020.camel@localhost.localdomain> ping: Add IB ping server agent (used with ibping diagnostic tool) Signed-off-by: Shahar Frank Signed-off-by: Hal Rosenstock Index: ping_priv.h =================================================================== --- ping_priv.h (revision 0) +++ ping_priv.h (revision 0) @@ -0,0 +1,61 @@ +/* + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#ifndef __IB_PING_PRIV_H__ +#define __IB_PING_PRIV_H__ + +#include + +#define SPFX "ib_ping: " + +struct ib_ping_send_wr { + struct list_head send_list; + struct ib_ah *ah; + struct ib_mad_private *mad; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +struct ib_ping_port_private { + struct list_head port_list; + struct list_head send_posted_list; + spinlock_t send_list_lock; + int port_num; + struct ib_mad_agent *pingd_agent; /* OpenIB Ping class */ +}; + +#endif /* __IB_PING_PRIV_H__ */ Index: ping.h =================================================================== --- ping.h (revision 0) +++ ping.h (revision 0) @@ -0,0 +1,49 @@ +/* + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#ifndef __PING_H_ +#define __PING_H_ + +extern spinlock_t ib_ping_port_list_lock; + +extern int ib_ping_port_open(struct ib_device *device, + int port_num); + +extern int ib_ping_port_close(struct ib_device *device, int port_num); + +#endif /* __PING_H_ */ Index: ping.c =================================================================== --- ping.c (revision 0) +++ ping.c (revision 0) @@ -0,0 +1,336 @@ +/* + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include +#include +#include + +#include "ping_priv.h" +#include "mad_priv.h" +#include "ping.h" + +spinlock_t ib_ping_port_list_lock; +static LIST_HEAD(ib_ping_port_list); + +/* + * Caller must hold ib_ping_port_list_lock + */ +static inline struct ib_ping_port_private * +__ib_get_ping_port(struct ib_device *device, int port_num, + struct ib_mad_agent *mad_agent) +{ + struct ib_ping_port_private *entry; + + BUG_ON(!(!!device ^ !!mad_agent)); /* Exactly one MUST be (!NULL) */ + + if (device) { + list_for_each_entry(entry, &ib_ping_port_list, port_list) { + if (entry->pingd_agent->device == device && + entry->port_num == port_num) + return entry; + } + } else { + list_for_each_entry(entry, &ib_ping_port_list, port_list) { + if (entry->pingd_agent == mad_agent) + return entry; + } + } + return NULL; +} + +static inline struct ib_ping_port_private * +ib_get_ping_port(struct ib_device *device, int port_num, + struct ib_mad_agent *mad_agent) +{ + struct ib_ping_port_private *entry; + unsigned long flags; + + spin_lock_irqsave(&ib_ping_port_list_lock, flags); + entry = __ib_get_ping_port(device, port_num, mad_agent); + spin_unlock_irqrestore(&ib_ping_port_list_lock, flags); + + return entry; +} + +static int ping_mad_send(struct ib_mad_agent *mad_agent, + struct ib_ping_port_private *port_priv, + struct ib_mad_private *mad_priv, + struct ib_grh *grh, + struct ib_wc *wc) +{ + struct ib_ping_send_wr *ping_send_wr; + struct ib_sge gather_list; + struct ib_send_wr send_wr; + struct ib_send_wr *bad_send_wr; + struct ib_ah_attr ah_attr; + unsigned long flags; + int ret = 1; + + ping_send_wr = kmalloc(sizeof(*ping_send_wr), GFP_KERNEL); + if (!ping_send_wr) + goto out; + ping_send_wr->mad = mad_priv; + + /* PCI mapping */ + gather_list.addr = dma_map_single(mad_agent->device->dma_device, + &mad_priv->mad, + sizeof(mad_priv->mad), + DMA_TO_DEVICE); + gather_list.length = sizeof(mad_priv->mad); + gather_list.lkey = mad_agent->mr->lkey; + + send_wr.next = NULL; + send_wr.opcode = IB_WR_SEND; + send_wr.sg_list = &gather_list; + send_wr.num_sge = 1; + send_wr.wr.ud.remote_qpn = wc->src_qp; /* DQPN */ + send_wr.wr.ud.timeout_ms = 0; + send_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED; + + ah_attr.dlid = wc->slid; + ah_attr.port_num = mad_agent->port_num; + ah_attr.src_path_bits = wc->dlid_path_bits; + ah_attr.sl = wc->sl; + ah_attr.static_rate = 0; + ah_attr.ah_flags = 0; /* No GRH */ + if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_OPENIB_PING) { + if (wc->wc_flags & IB_WC_GRH) { + ah_attr.ah_flags = IB_AH_GRH; + /* Should sgid be looked up ? */ + ah_attr.grh.sgid_index = 0; + ah_attr.grh.hop_limit = grh->hop_limit; + ah_attr.grh.flow_label = be32_to_cpup( + &grh->version_tclass_flow) & 0xfffff; + ah_attr.grh.traffic_class = (be32_to_cpup( + &grh->version_tclass_flow) >> 20) & 0xff; + memcpy(ah_attr.grh.dgid.raw, + grh->sgid.raw, + sizeof(ah_attr.grh.dgid)); + } + } else { + printk(KERN_ERR SPFX "Not OpenIB ping class 0x%x\n", + mad_priv->mad.mad.mad_hdr.mgmt_class); + kfree(ping_send_wr); + goto out; + } + + ping_send_wr->ah = ib_create_ah(mad_agent->qp->pd, &ah_attr); + if (IS_ERR(ping_send_wr->ah)) { + printk(KERN_ERR SPFX "No memory for address handle\n"); + kfree(ping_send_wr); + goto out; + } + + send_wr.wr.ud.ah = ping_send_wr->ah; + send_wr.wr.ud.pkey_index = wc->pkey_index; + send_wr.wr.ud.remote_qkey = IB_QP1_QKEY; + send_wr.wr.ud.mad_hdr = &mad_priv->mad.mad.mad_hdr; + send_wr.wr_id = (unsigned long)ping_send_wr; + + pci_unmap_addr_set(ping_send_wr, mapping, gather_list.addr); + + /* Send */ + spin_lock_irqsave(&port_priv->send_list_lock, flags); + if (ib_post_send_mad(mad_agent, &send_wr, &bad_send_wr)) { + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + dma_unmap_single(mad_agent->device->dma_device, + pci_unmap_addr(ping_send_wr, mapping), + sizeof(mad_priv->mad), + DMA_TO_DEVICE); + ib_destroy_ah(ping_send_wr->ah); + kfree(ping_send_wr); + } else { + list_add_tail(&ping_send_wr->send_list, + &port_priv->send_posted_list); + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + ret = 0; + } + +out: + return ret; +} + +static void pingd_recv_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_ping_port_private *port_priv; + struct ib_vendor_mad *vend; + struct ib_mad_private *recv = container_of(mad_recv_wc, + struct ib_mad_private, + header.recv_wc); + + /* Find matching MAD agent */ + port_priv = ib_get_ping_port(NULL, 0, mad_agent); + if (!port_priv) { + kmem_cache_free(ib_mad_cache, recv); + printk(KERN_ERR SPFX "pingd_recv_handler: no matching MAD " + "agent %p\n", mad_agent); + return; + } + + vend = (struct ib_vendor_mad *)mad_recv_wc->recv_buf.mad; + + vend->mad_hdr.method |= IB_MGMT_METHOD_RESP; + vend->mad_hdr.status = 0; + if (!system_utsname.domainname[0]) + strncpy(vend->data, system_utsname.nodename, sizeof vend->data); + else + snprintf(vend->data, sizeof vend->data, "%s.%s", + system_utsname.nodename, system_utsname.domainname); + + /* Send response */ + if (ping_mad_send(mad_agent, port_priv, recv, + mad_recv_wc->recv_buf.grh, mad_recv_wc->wc)) { + kmem_cache_free(ib_mad_cache, recv); + printk(KERN_ERR SPFX "pingd_recv_handler: reply failed\n"); + } +} + +static void pingd_send_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct ib_ping_port_private *port_priv; + struct ib_ping_send_wr *ping_send_wr; + unsigned long flags; + + /* Find matching MAD agent */ + port_priv = ib_get_ping_port(NULL, 0, mad_agent); + if (!port_priv) { + printk(KERN_ERR SPFX "pingd_send_handler: no matching MAD " + "agent %p\n", mad_agent); + return; + } + + ping_send_wr = (struct ib_ping_send_wr *)(unsigned long)mad_send_wc->wr_id; + spin_lock_irqsave(&port_priv->send_list_lock, flags); + /* Remove completed send from posted send MAD list */ + list_del(&ping_send_wr->send_list); + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + + /* Unmap PCI */ + dma_unmap_single(mad_agent->device->dma_device, + pci_unmap_addr(ping_send_wr, mapping), + sizeof(ping_send_wr->mad->mad), + DMA_TO_DEVICE); + + ib_destroy_ah(ping_send_wr->ah); + + /* Release allocated memory */ + kmem_cache_free(ib_mad_cache, ping_send_wr->mad); + kfree(ping_send_wr); +} + +int ib_ping_port_open(struct ib_device *device, int port_num) +{ + int ret; + struct ib_ping_port_private *port_priv; + struct ib_mad_reg_req pingd_reg_req; + unsigned long flags; + + /* First, check if port already open */ + port_priv = ib_get_ping_port(device, port_num, NULL); + if (port_priv) { + printk(KERN_DEBUG SPFX "%s port %d already open\n", + device->name, port_num); + return 0; + } + + /* Create new device info */ + port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL); + if (!port_priv) { + printk(KERN_ERR SPFX "No memory for ib_ping_port_private\n"); + ret = -ENOMEM; + goto error1; + } + + memset(port_priv, 0, sizeof *port_priv); + port_priv->port_num = port_num; + spin_lock_init(&port_priv->send_list_lock); + INIT_LIST_HEAD(&port_priv->send_posted_list); + + pingd_reg_req.mgmt_class = IB_MGMT_CLASS_OPENIB_PING; + pingd_reg_req.mgmt_class_version = 1; + pingd_reg_req.oui[0] = (IB_OPENIB_OUI >> 16) & 0xff; + pingd_reg_req.oui[1] = (IB_OPENIB_OUI >> 8) & 0xff; + pingd_reg_req.oui[2] = IB_OPENIB_OUI & 0xff; + set_bit(IB_MGMT_METHOD_GET, pingd_reg_req.method_mask); + + /* Obtain server MAD agent for OpenIB Ping class (GSI QP) */ + port_priv->pingd_agent = ib_register_mad_agent(device, port_num, + IB_QPT_GSI, + &pingd_reg_req, 0, + &pingd_send_handler, + &pingd_recv_handler, + NULL); + if (IS_ERR(port_priv->pingd_agent)) { + ret = PTR_ERR(port_priv->pingd_agent); + goto error2; + } + + spin_lock_irqsave(&ib_ping_port_list_lock, flags); + list_add_tail(&port_priv->port_list, &ib_ping_port_list); + spin_unlock_irqrestore(&ib_ping_port_list_lock, flags); + + return 0; + +error2: + kfree(port_priv); +error1: + return ret; +} + +int ib_ping_port_close(struct ib_device *device, int port_num) +{ + struct ib_ping_port_private *port_priv; + unsigned long flags; + + spin_lock_irqsave(&ib_ping_port_list_lock, flags); + port_priv = __ib_get_ping_port(device, port_num, NULL); + if (port_priv == NULL) { + spin_unlock_irqrestore(&ib_ping_port_list_lock, flags); + printk(KERN_ERR SPFX "Port %d not found\n", port_num); + return -ENODEV; + } + list_del(&port_priv->port_list); + spin_unlock_irqrestore(&ib_ping_port_list_lock, flags); + + ib_unregister_mad_agent(port_priv->pingd_agent); + kfree(port_priv); + + return 0; +} Index: mad.c =================================================================== --- mad.c (revision 2019) +++ mad.c (working copy) @@ -37,7 +37,9 @@ #include "mad_priv.h" #include "smi.h" #include "agent.h" +#include "ping.h" + MODULE_LICENSE("Dual BSD/GPL"); MODULE_DESCRIPTION("kernel IB MAD API"); MODULE_AUTHOR("Hal Rosenstock"); @@ -2624,6 +2626,12 @@ device->name, cur_port); goto error_device_open; } + ret = ib_ping_port_open(device, cur_port); + if (ret) { + printk(KERN_ERR PFX "Couldn't open %s port %d " + "for ping agent\n", + device->name, cur_port); + } } goto error_device_query; @@ -2631,6 +2639,12 @@ error_device_open: while (i > 0) { cur_port--; + ret2 = ib_ping_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR PFX "Couldn't close %s port %d " + "for ping agent\n", + device->name, cur_port); + } ret2 = ib_agent_port_close(device, cur_port); if (ret2) { printk(KERN_ERR PFX "Couldn't close %s port %d " @@ -2661,6 +2675,12 @@ cur_port = 1; } for (i = 0; i < num_ports; i++, cur_port++) { + ret2 = ib_ping_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR PFX "Couldn't close %s port %d " + "for ping agent\n", + device->name, cur_port); + } ret2 = ib_agent_port_close(device, cur_port); if (ret2) { printk(KERN_ERR PFX "Couldn't close %s port %d " @@ -2691,6 +2711,7 @@ spin_lock_init(&ib_mad_port_list_lock); spin_lock_init(&ib_agent_port_list_lock); + spin_lock_init(&ib_ping_port_list_lock); ib_mad_cache = kmem_cache_create("ib_mad", sizeof(struct ib_mad_private), Index: Makefile =================================================================== --- Makefile (revision 2017) +++ Makefile (working copy) @@ -5,7 +5,7 @@ ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ device.o fmr_pool.o cache.o -ib_mad-y := mad.o smi.o agent.o +ib_mad-y := mad.o smi.o agent.o ping.o ib_cm-y := cm.o From mshefty at ichips.intel.com Thu Mar 17 12:55:27 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 17 Mar 2005 12:55:27 -0800 Subject: [openib-general] [RFC] [PATCH] ping: Add IB ping server agent In-Reply-To: <1111091848.4662.4020.camel@localhost.localdomain> References: <1111091848.4662.4020.camel@localhost.localdomain> Message-ID: <4239EEBF.8000006@ichips.intel.com> Hal Rosenstock wrote: > ping: Add IB ping server agent (used with ibping diagnostic tool) {snip} > + ret = ib_ping_port_open(device, cur_port); > + if (ret) { > + printk(KERN_ERR PFX "Couldn't open %s port %d " > + "for ping agent\n", > + device->name, cur_port); > + } > } My preference would still be to keep this a separate module. IMO, we need to improve the encapsulation of the MAD layer, not weaken it, possibly removing all MAD agents from mad.c. - Sean From halr at voltaire.com Thu Mar 17 13:20:13 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Mar 2005 16:20:13 -0500 Subject: [openib-general] [RFC] [PATCH] ping: Add IB ping server agent In-Reply-To: <4239EEBF.8000006@ichips.intel.com> References: <1111091848.4662.4020.camel@localhost.localdomain> <4239EEBF.8000006@ichips.intel.com> Message-ID: <1111093245.4662.4114.camel@localhost.localdomain> On Thu, 2005-03-17 at 15:55, Sean Hefty wrote: > My preference would still be to keep this a separate module. IMO, we > need to improve the encapsulation of the MAD layer, not weaken it, > possibly removing all MAD agents from mad.c. OK but I don't see what would bring it in other than manually loading it :-( Any ideas ? -- Hal From greg at kroah.com Thu Mar 17 10:14:17 2005 From: greg at kroah.com (Greg KH) Date: Thu, 17 Mar 2005 10:14:17 -0800 Subject: [openib-general] [PATCH] Add PCI device ID for new Mellanox HCA In-Reply-To: <52oee3pbaw.fsf@topspin.com> References: <52fyzfrk29.fsf@topspin.com> <52oee3pbaw.fsf@topspin.com> Message-ID: <20050317181417.GA3743@kroah.com> On Tue, Mar 01, 2005 at 08:42:47AM -0800, Roland Dreier wrote: > Hi Greg, > > It turns out that Mellanox decided to change the device ID at the last > minute. So of course there will be parts with both IDs. Here's an > updated patch that includes both IDs. Please use this instead. Applied, thanks. greg k-h From hozer at hozed.org Thu Mar 17 15:46:05 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Thu, 17 Mar 2005 17:46:05 -0600 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs In-Reply-To: <001501c52a06$626c6b10$8000000a@blorp> References: <1AC79F16F5C5284499BB9591B33D6F0003D455D0@orsmsx408> <20050316063928.GV9768@kalmia.hozed.org> <001501c52a06$626c6b10$8000000a@blorp> Message-ID: <20050317234605.GC9768@kalmia.hozed.org> On Wed, Mar 16, 2005 at 08:58:54AM -0000, Paul Baxter wrote: > From: "Troy Benjegerdes" > >Once I get it built and tested locally, I'll probably stick some results > >and a link up at http://scl.ameslab.gov/Projects/InfiniBand/ > > > >Sooo... what's the easiest way for me to test this if I have opterons > >with 2.6.11.4 kernels? > > > >(aka, just replace drivers/infiniband from the roland-uverbs branch? And > >does anyone have a clean way of building all the userspace stuff? What > >I've seen so far is pretty tedious) > > Troy, > > While I appreciate your keenness , I think its a little unfair to criticise > the build status and organisation of code that is still being written and > is subject to change. I'd far rather everyone gets a working core before > worrying so much about how it might be packaged. That does need to be > addressed, of course. > > Your comments at your URL regarding complexity and size of the software > stack making progress slow are IMHO unfair to openib. > They've worked hard on getting a streamlined set of functionality into the > kernel and now need to finish off key parts of userspace support and only > then 'package' it so that you will find it easier to compile and test. Actually, my comments about InfiniBand software refferred to all the previous vendor "released" software stacks. It's been about two days since I started looking at this stuff, and I'm now able to run Netpipe. I haven't crashed the kernel once from the infiniband driver. I'm quite impressed with openib progress. OpenIB is relatively modular, and the individual modules are nicely named, and much smaller than anything else I've seen for IB drivers. In fact, the code size of the IB support looks to be smaller than the ipv6 module. From hozer at hozed.org Thu Mar 17 16:11:56 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Thu, 17 Mar 2005 18:11:56 -0600 Subject: [openib-general] Port of NetPIPE-3.6.2 to OpenIB userspace verbs In-Reply-To: <523buvwn13.fsf@topspin.com> References: <1AC79F16F5C5284499BB9591B33D6F0003D455D0@orsmsx408> <20050316063928.GV9768@kalmia.hozed.org> <523buvwn13.fsf@topspin.com> Message-ID: <20050318001156.GD9768@kalmia.hozed.org> On Wed, Mar 16, 2005 at 08:52:56AM -0800, Roland Dreier wrote: > Troy> Sooo... what's the easiest way for me to test this if I have > Troy> opterons with 2.6.11.4 kernels? > > Troy> (aka, just replace drivers/infiniband from the roland-uverbs > Troy> branch? And does anyone have a clean way of building all the > Troy> userspace stuff? What I've seen so far is pretty tedious) > > Yes, the roland-uverbs src/linux-kernel/infiniband directory should > just drop in and replace the existing drivers/infiniband. You'll want > to turn on CONFIG_INFINIBAND_USER_VERBS in your config (a new option) > to enable userspace verbs, load the ib_uverbs module (if you don't > build support into your kernel), and create /dev/infiniband/uverbs > device nodes (easiest way is to add > > KERNEL="uverbs*", NAME="infiniband/%k", MODE="0666" > > to your udev rules). > > To build the userspace verbs support, you just need to build > libibverbs and libmthca libraries (using the usual "./autogen.sh && > ./configure && make && make install" recipe). I agree that the > management subdirectory has a few too many little pieces right now, > but it's not needed if you already have a subnet manager running > somewhere. The management/Makefile and management/Readme in roland-uverbs were very usefull. The only real gotcha I had was I didn't install in /usr/local/lib, and set LD_LIBRARY_PATH to point to the libraries.. thus libibverbs couldn't find libmthca.so, and the error messages weren't particularly informative. From roland at topspin.com Thu Mar 17 19:59:31 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 17 Mar 2005 19:59:31 -0800 Subject: [openib-general] [PATCH] fmr support in mthca In-Reply-To: <20050317201646.GA15221@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 17 Mar 2005 22:16:46 +0200") References: <20050317201646.GA15221@mellanox.co.il> Message-ID: <521xadob8c.fsf@topspin.com> Thanks for implementing this. You saved me a lot of work and Libor will be very happy when he gets back next week. Some comments from my first read through: > This patch implements FMR support. I also rolled into it two fixes > for regular mrs that I posed previously, let me know if its a problem. No problem although I'll apply them separately. > This seems to be working fine for me, although I only did relatively basic > tests. Both Tavor and Arbel Native modes are supported. I made some tradeoffs > for simplicity, let me know what do you think: > - for tavor, I keep for each fmr two pages mapped: for mpt and one for > mtt access. This spends more kernel virtual memory than could be used, > since many mpts could share one page. Alternatives are: > map/unmap io memory on each fmr map/unmap request, or > keep and intermediate table tomap each page only once. I don't think this is acceptable. Each ioremap has to map at least one page plus a guard page. With two ioremaps per FMR, every FMR is using 16K (or more) of vmalloc space. On 64 bit archs, this doesn't matter, but on a large memory i386 machine, there's less than 128 MB of vmalloc space available (possibly a lot less if someone is using a video card with a big frame buffer or something). That means we're limited to a few thousand FMRs, which isn't enough. What if we just reserve something like 64K MPTs and MTTs for FMRs and ioremap everything at driver startup? That would only use a few MB of vmalloc space and probably simplify the code too. > - icm that has the mpts/mtts is linearly scanned and this is repeated > for each mtt on each fmr map. This may be improved somewhat > with some kind of an iterator, but to really speed things up > the icm data structure (list of arrays) would have to > be replaced by some kind of tree. I don't understand this. I'm probably missing something but the addresses don't change after we allocate the FMR, right? It seems we could just store the MPT/MTT address in the FMR structure the same way we do for Tavor mode. Some more nitpicky comments below... > > Signed-off-by: Michael S. Tsirkin > > Index: hw/mthca/mthca_dev.h > =================================================================== > --- hw/mthca/mthca_dev.h (revision 2012) > +++ hw/mthca/mthca_dev.h (working copy) > @@ -163,6 +163,7 @@ struct mthca_mr_table { > int max_mtt_order; > unsigned long **mtt_buddy; > u64 mtt_base; > + u64 mpt_base; /* Tavor only */ > struct mthca_icm_table *mtt_table; > struct mthca_icm_table *mpt_table; > }; > @@ -363,7 +364,17 @@ int mthca_mr_alloc_phys(struct mthca_dev > u64 *buffer_list, int buffer_size_shift, > int list_len, u64 iova, u64 total_size, > u32 access, struct mthca_mr *mr); > -void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr); > +void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr); > + > +int mthca_fmr_alloc(struct mthca_dev *dev, u32 pd, > + u32 access, struct mthca_fmr *fmr); > + > +int mthca_fmr_map(struct mthca_dev *dev, struct mthca_fmr *fmr, > + u64 *page_list, int list_len, u64 iova); > + > +void mthca_fmr_unmap(struct mthca_dev *dev, struct mthca_fmr *fmr); > + > +void mthca_free_fmr(struct mthca_dev *dev, struct mthca_fmr *fmr); > > int mthca_map_eq_icm(struct mthca_dev *dev, u64 icm_virt); > void mthca_unmap_eq_icm(struct mthca_dev *dev); > Index: hw/mthca/mthca_memfree.h > =================================================================== > --- hw/mthca/mthca_memfree.h (revision 2012) > +++ hw/mthca/mthca_memfree.h (working copy) > @@ -90,6 +90,11 @@ int mthca_table_get_range(struct mthca_d > void mthca_table_put_range(struct mthca_dev *dev, struct mthca_icm_table *table, > int start, int end); > > +/* Nonblocking. Callers must make sure the object exists by serializing against > + * callers of get/put. */ > +void *mthca_table_find(struct mthca_dev *dev, struct mthca_icm_table *table, > + int obj); Can we just make this use the table mutex and only call it when allocating an FMR? > + > static inline void mthca_icm_first(struct mthca_icm *icm, > struct mthca_icm_iter *iter) > { > Index: hw/mthca/mthca_provider.c > =================================================================== > --- hw/mthca/mthca_provider.c (revision 2012) > +++ hw/mthca/mthca_provider.c (working copy) > @@ -568,8 +568,101 @@ static struct ib_mr *mthca_reg_phys_mr(s > > static int mthca_dereg_mr(struct ib_mr *mr) > { > - mthca_free_mr(to_mdev(mr->device), to_mmr(mr)); > - kfree(mr); > + struct mthca_mr *mmr = to_mmr(mr); > + mthca_free_mr(to_mdev(mr->device), mmr); > + kfree(mmr); > + return 0; > +} > + > +static struct ib_fmr *mthca_alloc_fmr(struct ib_pd *pd, int mr_access_flags, > + struct ib_fmr_attr *fmr_attr) > +{ > + struct mthca_fmr *fmr; > + int err; > + fmr = kmalloc(sizeof *fmr, GFP_KERNEL); > + if (!fmr) > + return ERR_PTR(-ENOMEM); > + > + memcpy(&fmr->attr, fmr_attr, sizeof *fmr_attr); > + err = mthca_fmr_alloc(to_mdev(pd->device), to_mpd(pd)->pd_num, > + convert_access(mr_access_flags), fmr); > + > + if (err) { > + kfree(fmr); > + return ERR_PTR(err); > + } > + return &fmr->ibmr; > +} > + > +static int mthca_dealloc_fmr(struct ib_fmr *fmr) > +{ > + struct mthca_fmr *mfmr = to_mfmr(fmr); > + mthca_free_fmr(to_mdev(fmr->device), mfmr); > + kfree(mfmr); > + return 0; > +} > + > +static int mthca_map_phys_fmr(struct ib_fmr *fmr, u64 *page_list, int list_len, > + u64 iova) > +{ > + int i, page_mask; > + struct mthca_fmr *mfmr; > + > + mfmr = to_mfmr(fmr); > + > + if (list_len > mfmr->attr.max_pages) { > + mthca_warn(to_mdev(fmr->device), "Attempt to map list length " > + "%d to fmr with max %d pages\n", > + list_len, mfmr->attr.max_pages); > + return -EINVAL; > + } > + > + page_mask = (1 << mfmr->attr.page_size) - 1; > + > + /* We are getting page lists, so va must be page aligned. */ > + if (iova & page_mask) { > + mthca_warn(to_mdev(fmr->device), "Attempt to map fmr with page " > + "shift %d at misaligned virtual address %016llx\n", > + mfmr->attr.page_size, iova); > + return -EINVAL; > + } > + > + /* Trust the user not to pass misaligned data in page_list */ > + if (0) > + for (i = 0; i < list_len; ++i) { > + if (page_list[i] & page_mask) { > + mthca_warn(to_mdev(fmr->device), "Attempt to " > + "map fmr with page shift %d at " > + "address %016llx\n", > + mfmr->attr.page_size, page_list[i]); > + return -EINVAL; > + } > + } > + > + return mthca_fmr_map(to_mdev(fmr->device), to_mfmr(fmr), page_list, > + list_len, iova); > +} > + > +static int mthca_unmap_fmr(struct list_head *fmr_list) > +{ > + struct mthca_dev* mdev = NULL; > + struct ib_fmr *fmr; > + int err; > + u8 status; > + > + list_for_each_entry(fmr, fmr_list, list) { > + mdev = to_mdev(fmr->device); > + mthca_fmr_unmap(mdev, to_mfmr(fmr)); > + } > + > + if (!mdev) > + return 0; > + > + err = mthca_SYNC_TPT(mdev, &status); > + if (err) > + return err; > + if (status) > + return -EINVAL; > return 0; > } > > @@ -636,6 +729,15 @@ int mthca_register_device(struct mthca_d > dev->ib_dev.get_dma_mr = mthca_get_dma_mr; > dev->ib_dev.reg_phys_mr = mthca_reg_phys_mr; > dev->ib_dev.dereg_mr = mthca_dereg_mr; > + > + if (dev->hca_type == ARBEL_NATIVE || > + !(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { > + dev->ib_dev.alloc_fmr = mthca_alloc_fmr; > + dev->ib_dev.map_phys_fmr = mthca_map_phys_fmr; > + dev->ib_dev.unmap_fmr = mthca_unmap_fmr; > + dev->ib_dev.dealloc_fmr = mthca_dealloc_fmr; > + } > + > dev->ib_dev.attach_mcast = mthca_multicast_attach; > dev->ib_dev.detach_mcast = mthca_multicast_detach; > dev->ib_dev.process_mad = mthca_process_mad; > Index: hw/mthca/mthca_provider.h > =================================================================== > --- hw/mthca/mthca_provider.h (revision 2012) > +++ hw/mthca/mthca_provider.h (working copy) > @@ -60,6 +60,16 @@ struct mthca_mr { > u32 first_seg; > }; > > +struct mthca_fmr { > + struct ib_fmr ibmr; > + struct ib_fmr_attr attr; > + int order; > + u32 first_seg; > + int maps; > + struct mthca_mpt_entry __iomem *mpt; /* Tavor only */ > + u64 __iomem *mtts; /* Tavor only */ > +}; > + > struct mthca_pd { > struct ib_pd ibpd; > u32 pd_num; > @@ -218,6 +228,11 @@ struct mthca_sqp { > dma_addr_t header_dma; > }; > > +static inline struct mthca_fmr *to_mfmr(struct ib_fmr *ibmr) > +{ > + return container_of(ibmr, struct mthca_fmr, ibmr); > +} > + > static inline struct mthca_mr *to_mmr(struct ib_mr *ibmr) > { > return container_of(ibmr, struct mthca_mr, ibmr); > Index: hw/mthca/mthca_profile.c > =================================================================== > --- hw/mthca/mthca_profile.c (revision 2012) > +++ hw/mthca/mthca_profile.c (working copy) > @@ -223,9 +223,10 @@ u64 mthca_make_profile(struct mthca_dev > init_hca->mc_hash_sz = 1 << (profile[i].log_num - 1); > break; > case MTHCA_RES_MPT: > - dev->limits.num_mpts = profile[i].num; > - init_hca->mpt_base = profile[i].start; > - init_hca->log_mpt_sz = profile[i].log_num; > + dev->limits.num_mpts = profile[i].num; > + dev->mr_table.mpt_base = profile[i].start; > + init_hca->mpt_base = profile[i].start; > + init_hca->log_mpt_sz = profile[i].log_num; > break; > case MTHCA_RES_MTT: > dev->limits.num_mtt_segs = profile[i].num; > Index: hw/mthca/mthca_cmd.c > =================================================================== > --- hw/mthca/mthca_cmd.c (revision 2012) > +++ hw/mthca/mthca_cmd.c (working copy) > @@ -1406,6 +1406,11 @@ int mthca_WRITE_MTT(struct mthca_dev *de > return err; > } > > +int mthca_SYNC_TPT(struct mthca_dev *dev, u8 *status) > +{ > + return mthca_cmd(dev, 0, 0, 0, CMD_SYNC_TPT, CMD_TIME_CLASS_B, status); > +} > + > int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap, > int eq_num, u8 *status) > { > Index: hw/mthca/mthca_cmd.h > =================================================================== > --- hw/mthca/mthca_cmd.h (revision 2012) > +++ hw/mthca/mthca_cmd.h (working copy) > @@ -277,6 +277,7 @@ int mthca_HW2SW_MPT(struct mthca_dev *de > int mpt_index, u8 *status); > int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry, > int num_mtt, u8 *status); > +int mthca_SYNC_TPT(struct mthca_dev *dev, u8 *status); > int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap, > int eq_num, u8 *status); > int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context, > Index: hw/mthca/mthca_mr.c > =================================================================== > --- hw/mthca/mthca_mr.c (revision 2012) > +++ hw/mthca/mthca_mr.c (working copy) > @@ -368,11 +368,13 @@ err_out_table: > mthca_table_put(dev, dev->mr_table.mpt_table, key); > > err_out_mpt_free: > - mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); > + mthca_free(&dev->mr_table.mpt_alloc, key); > return err; > } > > -void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr) > +/* Free mr or fmr */ > +static void mthca_free_region(struct mthca_dev *dev, u32 lkey, int order, > + u32 first_seg) > { > int err; > u8 status; > @@ -380,7 +382,7 @@ void mthca_free_mr(struct mthca_dev *dev > might_sleep(); > > err = mthca_HW2SW_MPT(dev, NULL, > - key_to_hw_index(dev, mr->ibmr.lkey) & > + key_to_hw_index(dev, lkey) & > (dev->limits.num_mpts - 1), > &status); > if (err) > @@ -389,15 +391,276 @@ void mthca_free_mr(struct mthca_dev *dev > mthca_warn(dev, "HW2SW_MPT returned status 0x%02x\n", > status); > > - if (mr->order >= 0) > - mthca_free_mtt(dev, mr->first_seg, mr->order); > + if (order >= 0) > + mthca_free_mtt(dev, first_seg, order); > > if (dev->hca_type == ARBEL_NATIVE) > mthca_table_put(dev, dev->mr_table.mpt_table, > - key_to_hw_index(dev, mr->ibmr.lkey)); > - mthca_free(&dev->mr_table.mpt_alloc, key_to_hw_index(dev, mr->ibmr.lkey)); > + key_to_hw_index(dev, lkey)); > + mthca_free(&dev->mr_table.mpt_alloc, key_to_hw_index(dev, lkey)); > +} > + > +void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr) > +{ > + mthca_free_region(dev, mr->ibmr.lkey, mr->order, mr->first_seg); > +} > + > +int mthca_fmr_alloc(struct mthca_dev *dev, u32 pd, > + u32 access, struct mthca_fmr *mr) > +{ > + void *mailbox; > + u64 mtt_seg; > + struct mthca_mpt_entry *mpt_entry; > + u32 key, idx; > + int err = -ENOMEM; > + u8 status; > + int i; > + int list_len = mr->attr.max_pages; > + > + might_sleep(); > + > + BUG_ON(mr->attr.page_size < 12); > + WARN_ON(mr->attr.page_size >= 32); Why is one a BUG and one a WARN? Why not just return an error in both cases? Something like: if (mr->attr.page_size < 12 || mr->attr.page_size >= 32) return -EINVAL; > + > + mr->maps = 0; > + > + key = mthca_alloc(&dev->mr_table.mpt_alloc); > + if (key == -1) > + return -ENOMEM; > + idx = key & (dev->limits.num_mpts - 1); > + mr->ibmr.rkey = mr->ibmr.lkey = hw_index_to_key(dev, key); > + > + if (dev->hca_type == ARBEL_NATIVE) { > + err = mthca_table_get(dev, dev->mr_table.mpt_table, key); > + if (err) > + goto err_out_mpt_free; > + } > + for (i = dev->limits.mtt_seg_size / 8, mr->order = 0; > + i < list_len; > + i <<= 1, ++mr->order) > + ; /* nothing */ > + > + mr->first_seg = mthca_alloc_mtt(dev, mr->order); > + if (mr->first_seg == -1) > + goto err_out_table; > + > + mtt_seg = dev->mr_table.mtt_base + > + mr->first_seg * dev->limits.mtt_seg_size; > + > + mr->mpt = NULL; > + mr->mtts = NULL; > + > + if (dev->hca_type != ARBEL_NATIVE) { > + mr->mpt = ioremap(dev->mr_table.mpt_base + > + sizeof *(mr->mpt) * idx, sizeof *(mr->mpt)); > + if (!mr->mpt) { > + mthca_dbg(dev, "Couldn't map MPT entry for fmr %08x.\n", > + mr->ibmr.lkey); > + goto err_out_free_mtt; > + } > + mr->mtts = ioremap(mtt_seg, list_len * sizeof *(mr->mtts)); > + if (!mr->mtts) { > + mthca_dbg(dev, "Couldn't map MTT entry %016llx " > + "(size %x) for fmr %08x.\n", mtt_seg, > + list_len * sizeof u64, mr->ibmr.lkey); > + goto err_out_free_mtt; > + } > + } > + > + mailbox = kmalloc(sizeof *mpt_entry + MTHCA_CMD_MAILBOX_EXTRA, > + GFP_KERNEL); > + if (!mailbox) > + goto err_out_free_mtt; > + > + mpt_entry = MAILBOX_ALIGN(mailbox); > + > + mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS | > + MTHCA_MPT_FLAG_MIO | > + MTHCA_MPT_FLAG_REGION | > + access); > + > + mpt_entry->page_size = cpu_to_be32(mr->attr.page_size - 12); > + mpt_entry->key = cpu_to_be32(key); > + mpt_entry->pd = cpu_to_be32(pd); > + memset(&mpt_entry->start, 0, > + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, start)); > + mpt_entry->mtt_seg = cpu_to_be64(mtt_seg); > + > + if (0) { > + mthca_dbg(dev, "Dumping MPT entry %08x:\n", mr->ibmr.lkey); > + for (i = 0; i < sizeof (struct mthca_mpt_entry) / 4; ++i) { > + if (i % 4 == 0) > + printk("[%02x] ", i * 4); > + printk(" %08x", be32_to_cpu(((u32 *) mpt_entry)[i])); > + if ((i + 1) % 4 == 0) > + printk("\n"); > + } > + } > + > + err = mthca_SW2HW_MPT(dev, mpt_entry, > + key & (dev->limits.num_mpts - 1), > + &status); > + if (err) > + mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); > + else if (status) { > + mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n", > + status); > + err = -EINVAL; > + } > + > + kfree(mailbox); > + return err; > + > +err_out_free_mtt: > + if (mr->mtts) > + iounmap(mr->mtts); > + if (mr->mpt) > + iounmap(mr->mpt); > + mthca_free_mtt(dev, mr->first_seg, mr->order); > + > +err_out_table: > + if (dev->hca_type == ARBEL_NATIVE) > + mthca_table_put(dev, dev->mr_table.mpt_table, key); > + > +err_out_mpt_free: > + mthca_free(&dev->mr_table.mpt_alloc, key); > + return err; > +} > + > +void mthca_free_fmr(struct mthca_dev *dev, struct mthca_fmr *fmr) > +{ > + mthca_free_region(dev, fmr->ibmr.lkey, fmr->order, fmr->first_seg); > +} > + > +#define MTHCA_MPT_STATUS_SW 0xF > +#define MTHCA_MPT_STATUS_HW 0x0 > + > +static void mthca_tavor_fmr_map(struct mthca_dev *dev, struct mthca_fmr *fmr, > + u64 *page_list, int list_len, u64 iova, u32 key) > +{ > + struct mthca_mpt_entry mpt_entry; > + int i; > + > + writeb(MTHCA_MPT_STATUS_SW,fmr->mpt); It's a minor nitpick but put a space after the "," please. (This appears a few more times below too) > + > + wmb(); Are the wmb()s required in this function? writeX() is already ordered. > + > + for (i = 0; i < list_len; ++i) { > + u64 mtt_entry = cpu_to_be64(page_list[i] | > + MTHCA_MTT_FLAG_PRESENT); > + writeq(mtt_entry, fmr->mtts + i); Don't use writeq() unconditionally. 32 bit archs don't have it. > + } > + mpt_entry.lkey = cpu_to_be32(key); > + mpt_entry.length = cpu_to_be64(((u64)list_len) * > + (1 << fmr->attr.page_size)); > + mpt_entry.start = cpu_to_be64(iova); > + > + writel(mpt_entry.lkey, &fmr->mpt->key); > + memcpy_toio(&fmr->mpt->start, &mpt_entry.start, > + offsetof(struct mthca_mpt_entry, window_count) - > + offsetof(struct mthca_mpt_entry, start)); > + > + wmb(); > + > + writeb(MTHCA_MPT_STATUS_SW,fmr->mpt); Should this be STATUS_HW here? It seems we exit the function with the MPT marked invalid. > + > + wmb(); > +} > + > +static int mthca_arbel_fmr_map(struct mthca_dev *dev, struct mthca_fmr *fmr, > + u64 *page_list, int list_len, u64 iova, u32 key) > +{ > + void* mpt; Write this as "void *mpt" instead please. > + struct mthca_mpt_entry *mpt_entry; > + u8 *mpt_status; > + int i; > + > + mpt = mthca_table_find(dev, dev->mr_table.mpt_table, key); > + if (!mpt) > + return -EINVAL; > + > + mpt_status = mpt; > + *mpt_status = MTHCA_MPT_STATUS_SW; > + > + wmb(); > + > + /* This is really dumb. We are rescanning the ICM on > + * each mpt entry. We want some kind of iterator here. > + * May be fine meanwhile, while things are small. */ > + for (i = 0; i < list_len; ++i) { > + u64 *mtt_entry = mthca_table_find(dev, dev->mr_table.mtt_table, > + fmr->first_seg + i); > + if (!mtt_entry) > + return -EINVAL; > + > + *mtt_entry = cpu_to_be64(page_list[i] | MTHCA_MTT_FLAG_PRESENT); > + } > + > + > + mpt_entry = mpt; > + mpt_entry->lkey = mpt_entry->key = cpu_to_be32(key); > + mpt_entry->length = cpu_to_be64(((u64)list_len) * > + (1 << fmr->attr.page_size)); > + mpt_entry->start = cpu_to_be64(iova); > + > + wmb(); > + > + *mpt_status = MTHCA_MPT_STATUS_HW; > + > + wmb(); > + return 0; > +} > + > + > +int mthca_fmr_map(struct mthca_dev *dev, struct mthca_fmr *fmr, > + u64 *page_list, int list_len, u64 iova) > +{ > + u32 key; > + > + if (fmr->maps >= fmr->attr.max_maps) { > + mthca_warn(dev, "Attempt to map fmr %d times, max_maps is %d\n", > + fmr->maps, fmr->attr.max_maps); > + return -EINVAL; > + } > + > + key = key_to_hw_index(dev, fmr->ibmr.lkey) + dev->limits.num_mpts; > + fmr->ibmr.lkey = fmr->ibmr.rkey = hw_index_to_key(dev, key); > + fmr->maps++; Maybe put this common part in a static function and have mthca_{arbel,tavor}_fmr_map() call it. Then we could just set the map_phys_fmr method in the device struct to the right function at initialization time. > + > + if (dev->hca_type == ARBEL_NATIVE) { > + return mthca_arbel_fmr_map(dev, fmr, page_list, list_len, > + iova, key); > + } else { > + mthca_tavor_fmr_map(dev, fmr, page_list, list_len, > + iova, key); > + return 0; > + } > +} > + > +void mthca_fmr_unmap(struct mthca_dev *dev, struct mthca_fmr *fmr) > +{ > + if (!fmr->maps) { > + return; > + } Nitpick -- no { } needed here. > + > + fmr->maps = 0; > + > + if (dev->hca_type == ARBEL_NATIVE) { > + u32 key = key_to_hw_index(dev, fmr->ibmr.lkey); > + u8 *mpt_status = mthca_table_find(dev, dev->mr_table.mpt_table, > + key); > + if (!mpt_status) > + return; > + > + *mpt_status = MTHCA_MPT_STATUS_SW; > + wmb(); > + } else { > + writeb(MTHCA_MPT_STATUS_SW,fmr->mpt); > + wmb(); > + } > } > > + > int __devinit mthca_init_mr_table(struct mthca_dev *dev) > { > int err; > Index: hw/mthca/mthca_memfree.c > =================================================================== > --- hw/mthca/mthca_memfree.c (revision 2012) > +++ hw/mthca/mthca_memfree.c (working copy) > @@ -192,6 +192,47 @@ void mthca_table_put(struct mthca_dev *d > up(&table->mutex); > } > > +/* Nonblocking. Callers must make sure the object exists by serializing against > + * callers of get/put. */ > +void *mthca_table_find(struct mthca_dev *dev, struct mthca_icm_table *table, > + int obj) > +{ > + int idx, offset, i; > + struct mthca_icm_chunk *chunk; > + struct mthca_icm *icm; > + struct page *page = NULL; > + > + /* Supported only for low mem tables for now. */ > + if (!table->lowmem) > + return NULL; > + > + idx = (obj & (table->num_obj - 1)) * table->obj_size; > + icm = table->icm[idx / MTHCA_TABLE_CHUNK_SIZE]; > + offset = idx % MTHCA_TABLE_CHUNK_SIZE; What happened to the indendation here? > + > + if(!icm) > + return NULL; > + > + /* Linear scan of ICM on each access. Since this is called on fmr > + * registration which is on data path, eventually we may want to > + * rearrange things to use some kind of tree. */ > + > + list_for_each_entry(chunk, &icm->chunk_list, list) { > + for (i = 0; i < chunk->npages; ++i) { > + if (chunk->mem[i].length >= offset) { > + page = chunk->mem[i].page; > + break; > + } > + offset -= chunk->mem[i].length; > + } > + } > + > + if (!page) > + return NULL; > + > + return lowmem_page_address(page) + offset; > +} > + > int mthca_table_get_range(struct mthca_dev *dev, struct mthca_icm_table *table, > int start, int end) > { From roland at topspin.com Thu Mar 17 20:08:26 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 17 Mar 2005 20:08:26 -0800 Subject: [openib-general] [PATCH] nit in dereg_mr In-Reply-To: <20050317175702.GA16399@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 17 Mar 2005 19:57:02 +0200") References: <20050317175702.GA16399@mellanox.co.il> Message-ID: <52wts5mw91.fsf@topspin.com> Thanks, applied. From roland at topspin.com Thu Mar 17 20:10:22 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 17 Mar 2005 20:10:22 -0800 Subject: [openib-general] [PATCH] mthca_mr error handling In-Reply-To: <20050317180110.GA17059@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 17 Mar 2005 20:01:10 +0200") References: <20050317180110.GA17059@mellanox.co.il> Message-ID: <52sm2tmw5t.fsf@topspin.com> Thanks, applied. From mst at mellanox.co.il Fri Mar 18 00:44:55 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 18 Mar 2005 10:44:55 +0200 Subject: [openib-general] [PATCH] fmr support in mthca In-Reply-To: <521xadob8c.fsf@topspin.com> References: <20050317201646.GA15221@mellanox.co.il> <521xadob8c.fsf@topspin.com> Message-ID: <20050318084455.GA23781@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] [PATCH] fmr support in mthca > > Thanks for implementing this. You saved me a lot of work and Libor > will be very happy when he gets back next week. Good, glad to help. I will try to address your comments next week (its already weekend here). > Some comments from my first read through: > > > This patch implements FMR support. I also rolled into it two fixes > > for regular mrs that I posed previously, let me know if its a problem. > > No problem although I'll apply them separately. > > > This seems to be working fine for me, although I only did relatively basic > > tests. Both Tavor and Arbel Native modes are supported. I made some tradeoffs > > for simplicity, let me know what do you think: > > - for tavor, I keep for each fmr two pages mapped: for mpt and one for > > mtt access. This spends more kernel virtual memory than could be used, > > since many mpts could share one page. Alternatives are: > > map/unmap io memory on each fmr map/unmap request, or > > keep and intermediate table tomap each page only once. > > I don't think this is acceptable. Each ioremap has to map at least > one page plus a guard page. With two ioremaps per FMR, every FMR is > using 16K (or more) of vmalloc space. On 64 bit archs, this doesn't > matter, but on a large memory i386 machine, there's less than 128 MB > of vmalloc space available (possibly a lot less if someone is using a > video card with a big frame buffer or something). That means we're > limited to a few thousand FMRs, which isn't enough. > > What if we just reserve something like 64K MPTs and MTTs for FMRs and > ioremap everything at driver startup? That would only use a few MB of > vmalloc space and probably simplify the code too. I dont like these pre-allocations - if someone is only using SDP and IP over IB, it seems he wont need almost any regular regions. 64K MTTs with 4K page size cover up to 200MByte of memory. My other problem with this approach was implementational: existing allocator and table code can be passed reserved parameter, but dont have the ability to allocate out of that pool. So we'd have to allocate out of a separate allocator, and take care so that keys do not conflict. This gets a bit complicated. Maybe do something separate for 32 bit kernels (like - disable FMR support)? > > - icm that has the mpts/mtts is linearly scanned and this is repeated > > for each mtt on each fmr map. This may be improved somewhat > > with some kind of an iterator, but to really speed things up > > the icm data structure (list of arrays) would have to > > be replaced by some kind of tree. > > I don't understand this. I'm probably missing something but the > addresses don't change after we allocate the FMR, right? It seems we > could just store the MPT/MTT address in the FMR structure the same way > we do for Tavor mode. Yes but for mtts the addresses may not be physically contigious, unless we want to limit FMRs to PAGE_SIZE/8 MTTs, which means 512 MTTs, that is 2MByte with 4K FMR page size. And is it seems possible that even with this limitation MTTs for a specific FMR start at non page aligned boundary. So we'd need an array of pages per FMR, unlike Tavor. Do you think its a good idea? > Some more nitpicky comments below... > > > +/* Nonblocking. Callers must make sure the object exists by serializing against > > + * callers of get/put. */ > > +void *mthca_table_find(struct mthca_dev *dev, struct mthca_icm_table *table, > > + int obj); > > Can we just make this use the table mutex and only call it when > allocating an FMR? See above. But the restriction doesnt matter much for FMRs because the icm ref count is incremented when FMR is created, so they satisfy this constraint. Other comments need to be addressed. I'll start working on them when I am back on Sunday. -- MST - Michael S. Tsirkin From: Hal Rosenstock To: openib-general at openib.org Content-Type: text/plain Organization: Message-Id: <1111152373.4662.6585.camel at localhost.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 (1.2.2-4) Date: 18 Mar 2005 08:26:14 -0500 Content-Transfer-Encoding: 7bit X-Virus-Scanned: by amavisd-new at voltaire.com X-Spam-Checker-Version: SpamAssassin 2.64 (2004-01-11) on openib.ca.sandia.gov X-Spam-Level: ** X-Spam-Status: No, hits=2.9 required=5.0 tests=DOMAIN_BODY, REMOVE_REMOVAL_NEAR autolearn=no version=2.64 Subject: [openib-general] [PATCH] ping Add IB ping server agent X-BeenThere: openib-general at openib.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: OpenIB General Mailing List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 18 Mar 2005 13:33:49 -0000 ping: Add IB ping server agent as a separate module (used with ibping diagnostic tool) Signed-off-by: Shahar Frank Signed-off-by: Hal Rosenstock Index: ping.h =================================================================== --- ping.h (revision 0) +++ ping.h (revision 0) @@ -0,0 +1,49 @@ +/* + * Copyright (c) 2004, 2005 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004, 2005 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004, 2005 Intel Corporation. All rights reserved. + * Copyright (c) 2004, 2005 Topspin Corporation. All rights reserved. + * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#ifndef __PING_H_ +#define __PING_H_ + +extern spinlock_t ib_ping_port_list_lock; + +extern int ib_ping_port_open(struct ib_device *device, + int port_num); + +extern int ib_ping_port_close(struct ib_device *device, int port_num); + +#endif /* __PING_H_ */ Index: ping_priv.h =================================================================== --- ping_priv.h (revision 0) +++ ping_priv.h (revision 0) @@ -0,0 +1,61 @@ +/* + * Copyright (c) 2004, 2005 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004, 2005 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004, 2005 Intel Corporation. All rights reserved. + * Copyright (c) 2004, 2005 Topspin Corporation. All rights reserved. + * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#ifndef __IB_PING_PRIV_H__ +#define __IB_PING_PRIV_H__ + +#include + +#define SPFX "ib_ping: " + +struct ib_ping_send_wr { + struct list_head send_list; + struct ib_ah *ah; + struct ib_mad_private *mad; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +struct ib_ping_port_private { + struct list_head port_list; + struct list_head send_posted_list; + spinlock_t send_list_lock; + int port_num; + struct ib_mad_agent *pingd_agent; /* OpenIB Ping class */ +}; + +#endif /* __IB_PING_PRIV_H__ */ Index: ping.c =================================================================== --- ping.c (revision 0) +++ ping.c (revision 0) @@ -0,0 +1,425 @@ +/* + * Copyright (c) 2004, 2005 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004, 2005 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004, 2005 Intel Corporation. All rights reserved. + * Copyright (c) 2004, 2005 Topspin Corporation. All rights reserved. + * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include +#include +#include + +#include "ping_priv.h" +#include "mad_priv.h" +#include "ping.h" + +spinlock_t ib_ping_port_list_lock; +static LIST_HEAD(ib_ping_port_list); + +/* + * Caller must hold ib_ping_port_list_lock + */ +static inline struct ib_ping_port_private * +__ib_get_ping_port(struct ib_device *device, int port_num, + struct ib_mad_agent *mad_agent) +{ + struct ib_ping_port_private *entry; + + BUG_ON(!(!!device ^ !!mad_agent)); /* Exactly one MUST be (!NULL) */ + + if (device) { + list_for_each_entry(entry, &ib_ping_port_list, port_list) { + if (entry->pingd_agent->device == device && + entry->port_num == port_num) + return entry; + } + } else { + list_for_each_entry(entry, &ib_ping_port_list, port_list) { + if (entry->pingd_agent == mad_agent) + return entry; + } + } + return NULL; +} + +static inline struct ib_ping_port_private * +ib_get_ping_port(struct ib_device *device, int port_num, + struct ib_mad_agent *mad_agent) +{ + struct ib_ping_port_private *entry; + unsigned long flags; + + spin_lock_irqsave(&ib_ping_port_list_lock, flags); + entry = __ib_get_ping_port(device, port_num, mad_agent); + spin_unlock_irqrestore(&ib_ping_port_list_lock, flags); + + return entry; +} + +static int ping_mad_send(struct ib_mad_agent *mad_agent, + struct ib_ping_port_private *port_priv, + struct ib_mad_private *mad_priv, + struct ib_grh *grh, + struct ib_wc *wc) +{ + struct ib_ping_send_wr *ping_send_wr; + struct ib_sge gather_list; + struct ib_send_wr send_wr; + struct ib_send_wr *bad_send_wr; + struct ib_ah_attr ah_attr; + unsigned long flags; + int ret = 1; + + ping_send_wr = kmalloc(sizeof(*ping_send_wr), GFP_KERNEL); + if (!ping_send_wr) + goto out; + ping_send_wr->mad = mad_priv; + + /* PCI mapping */ + gather_list.addr = dma_map_single(mad_agent->device->dma_device, + &mad_priv->mad, + sizeof(mad_priv->mad), + DMA_TO_DEVICE); + gather_list.length = sizeof(mad_priv->mad); + gather_list.lkey = mad_agent->mr->lkey; + + send_wr.next = NULL; + send_wr.opcode = IB_WR_SEND; + send_wr.sg_list = &gather_list; + send_wr.num_sge = 1; + send_wr.wr.ud.remote_qpn = wc->src_qp; /* DQPN */ + send_wr.wr.ud.timeout_ms = 0; + send_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED; + + ah_attr.dlid = wc->slid; + ah_attr.port_num = mad_agent->port_num; + ah_attr.src_path_bits = wc->dlid_path_bits; + ah_attr.sl = wc->sl; + ah_attr.static_rate = 0; + ah_attr.ah_flags = 0; /* No GRH */ + if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_OPENIB_PING) { + if (wc->wc_flags & IB_WC_GRH) { + ah_attr.ah_flags = IB_AH_GRH; + /* Should sgid be looked up ? */ + ah_attr.grh.sgid_index = 0; + ah_attr.grh.hop_limit = grh->hop_limit; + ah_attr.grh.flow_label = be32_to_cpup( + &grh->version_tclass_flow) & 0xfffff; + ah_attr.grh.traffic_class = (be32_to_cpup( + &grh->version_tclass_flow) >> 20) & 0xff; + memcpy(ah_attr.grh.dgid.raw, + grh->sgid.raw, + sizeof(ah_attr.grh.dgid)); + } + } else { + printk(KERN_ERR SPFX "Not OpenIB ping class 0x%x\n", + mad_priv->mad.mad.mad_hdr.mgmt_class); + kfree(ping_send_wr); + goto out; + } + + ping_send_wr->ah = ib_create_ah(mad_agent->qp->pd, &ah_attr); + if (IS_ERR(ping_send_wr->ah)) { + printk(KERN_ERR SPFX "No memory for address handle\n"); + kfree(ping_send_wr); + goto out; + } + + send_wr.wr.ud.ah = ping_send_wr->ah; + send_wr.wr.ud.pkey_index = wc->pkey_index; + send_wr.wr.ud.remote_qkey = IB_QP1_QKEY; + send_wr.wr.ud.mad_hdr = &mad_priv->mad.mad.mad_hdr; + send_wr.wr_id = (unsigned long)ping_send_wr; + + pci_unmap_addr_set(ping_send_wr, mapping, gather_list.addr); + + /* Send */ + spin_lock_irqsave(&port_priv->send_list_lock, flags); + if (ib_post_send_mad(mad_agent, &send_wr, &bad_send_wr)) { + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + dma_unmap_single(mad_agent->device->dma_device, + pci_unmap_addr(ping_send_wr, mapping), + sizeof(mad_priv->mad), + DMA_TO_DEVICE); + ib_destroy_ah(ping_send_wr->ah); + kfree(ping_send_wr); + } else { + list_add_tail(&ping_send_wr->send_list, + &port_priv->send_posted_list); + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + ret = 0; + } + +out: + return ret; +} + +static void pingd_recv_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_ping_port_private *port_priv; + struct ib_vendor_mad *vend; + struct ib_mad_private *recv = container_of(mad_recv_wc, + struct ib_mad_private, + header.recv_wc); + + /* Find matching MAD agent */ + port_priv = ib_get_ping_port(NULL, 0, mad_agent); + if (!port_priv) { + kmem_cache_free(ib_mad_cache, recv); + printk(KERN_ERR SPFX "pingd_recv_handler: no matching MAD " + "agent %p\n", mad_agent); + return; + } + + vend = (struct ib_vendor_mad *)mad_recv_wc->recv_buf.mad; + + vend->mad_hdr.method |= IB_MGMT_METHOD_RESP; + vend->mad_hdr.status = 0; + if (!system_utsname.domainname[0]) + strncpy(vend->data, system_utsname.nodename, sizeof vend->data); + else + snprintf(vend->data, sizeof vend->data, "%s.%s", + system_utsname.nodename, system_utsname.domainname); + + /* Send response */ + if (ping_mad_send(mad_agent, port_priv, recv, + mad_recv_wc->recv_buf.grh, mad_recv_wc->wc)) { + kmem_cache_free(ib_mad_cache, recv); + printk(KERN_ERR SPFX "pingd_recv_handler: reply failed\n"); + } +} + +static void pingd_send_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct ib_ping_port_private *port_priv; + struct ib_ping_send_wr *ping_send_wr; + unsigned long flags; + + /* Find matching MAD agent */ + port_priv = ib_get_ping_port(NULL, 0, mad_agent); + if (!port_priv) { + printk(KERN_ERR SPFX "pingd_send_handler: no matching MAD " + "agent %p\n", mad_agent); + return; + } + + ping_send_wr = (struct ib_ping_send_wr *)(unsigned long)mad_send_wc->wr_id; + spin_lock_irqsave(&port_priv->send_list_lock, flags); + /* Remove completed send from posted send MAD list */ + list_del(&ping_send_wr->send_list); + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + + /* Unmap PCI */ + dma_unmap_single(mad_agent->device->dma_device, + pci_unmap_addr(ping_send_wr, mapping), + sizeof(ping_send_wr->mad->mad), + DMA_TO_DEVICE); + + ib_destroy_ah(ping_send_wr->ah); + + /* Release allocated memory */ + kmem_cache_free(ib_mad_cache, ping_send_wr->mad); + kfree(ping_send_wr); +} + +int ib_ping_port_open(struct ib_device *device, int port_num) +{ + int ret; + struct ib_ping_port_private *port_priv; + struct ib_mad_reg_req pingd_reg_req; + unsigned long flags; + + /* First, check if port already open */ + port_priv = ib_get_ping_port(device, port_num, NULL); + if (port_priv) { + printk(KERN_DEBUG SPFX "%s port %d already open\n", + device->name, port_num); + return 0; + } + + /* Create new device info */ + port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL); + if (!port_priv) { + printk(KERN_ERR SPFX "No memory for ib_ping_port_private\n"); + ret = -ENOMEM; + goto error1; + } + + memset(port_priv, 0, sizeof *port_priv); + port_priv->port_num = port_num; + spin_lock_init(&port_priv->send_list_lock); + INIT_LIST_HEAD(&port_priv->send_posted_list); + + pingd_reg_req.mgmt_class = IB_MGMT_CLASS_OPENIB_PING; + pingd_reg_req.mgmt_class_version = 1; + pingd_reg_req.oui[0] = (IB_OPENIB_OUI >> 16) & 0xff; + pingd_reg_req.oui[1] = (IB_OPENIB_OUI >> 8) & 0xff; + pingd_reg_req.oui[2] = IB_OPENIB_OUI & 0xff; + set_bit(IB_MGMT_METHOD_GET, pingd_reg_req.method_mask); + + /* Obtain server MAD agent for OpenIB Ping class (GSI QP) */ + port_priv->pingd_agent = ib_register_mad_agent(device, port_num, + IB_QPT_GSI, + &pingd_reg_req, 0, + &pingd_send_handler, + &pingd_recv_handler, + NULL); + if (IS_ERR(port_priv->pingd_agent)) { + ret = PTR_ERR(port_priv->pingd_agent); + goto error2; + } + + spin_lock_irqsave(&ib_ping_port_list_lock, flags); + list_add_tail(&port_priv->port_list, &ib_ping_port_list); + spin_unlock_irqrestore(&ib_ping_port_list_lock, flags); + + return 0; + +error2: + kfree(port_priv); +error1: + return ret; +} + +int ib_ping_port_close(struct ib_device *device, int port_num) +{ + struct ib_ping_port_private *port_priv; + unsigned long flags; + + spin_lock_irqsave(&ib_ping_port_list_lock, flags); + port_priv = __ib_get_ping_port(device, port_num, NULL); + if (port_priv == NULL) { + spin_unlock_irqrestore(&ib_ping_port_list_lock, flags); + printk(KERN_ERR SPFX "Port %d not found\n", port_num); + return -ENODEV; + } + list_del(&port_priv->port_list); + spin_unlock_irqrestore(&ib_ping_port_list_lock, flags); + + ib_unregister_mad_agent(port_priv->pingd_agent); + kfree(port_priv); + + return 0; +} + +static void ib_ping_init_device(struct ib_device *device) +{ + int ret, num_ports, cur_port, i, ret2; + + if (device->node_type == IB_NODE_SWITCH) { + num_ports = 1; + cur_port = 0; + } else { + num_ports = device->phys_port_cnt; + cur_port = 1; + } + + for (i = 0; i < num_ports; i++, cur_port++) { + ret = ib_ping_port_open(device, cur_port); + if (ret) { + printk(KERN_ERR SPFX "Couldn't open %s port %d\n", + device->name, cur_port); + goto error_device_open; + } + } + goto error_device_query; + +error_device_open: + while (i > 0) { + cur_port--; + ret2 = ib_ping_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR PFX "Couldn't close %s port %d " + "for ping agent\n", + device->name, cur_port); + } + i--; + } + +error_device_query: + return; +} + +static void ib_ping_remove_device(struct ib_device *device) +{ + int ret = 0, i, num_ports, cur_port, ret2; + + if (device->node_type == IB_NODE_SWITCH) { + num_ports = 1; + cur_port = 0; + } else { + num_ports = device->phys_port_cnt; + cur_port = 1; + } + for (i = 0; i < num_ports; i++, cur_port++) { + ret2 = ib_ping_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR SPFX "Couldn't close %s port %d " + "for ping agent\n", + device->name, cur_port); + if (!ret) + ret = ret2; + } + } +} + +static struct ib_client ping_client = { + .name = "ping", + .add = ib_ping_init_device, + .remove = ib_ping_remove_device +}; + +static int __init ib_ping_init_module(void) +{ + spin_lock_init(&ib_ping_port_list_lock); + INIT_LIST_HEAD(&ib_ping_port_list); + + if (ib_register_client(&ping_client)) { + printk(KERN_ERR SPFX "Couldn't register ib_ping client\n"); + return -EINVAL; + } + + return 0; +} + +static void __exit ib_ping_cleanup_module(void) +{ + ib_unregister_client(&ping_client); +} + +module_init(ib_ping_init_module) +module_exit(ib_ping_cleanup_module) + Index: mad.c =================================================================== --- mad.c (revision 2023) +++ mad.c (working copy) @@ -45,6 +45,8 @@ kmem_cache_t *ib_mad_cache; +EXPORT_SYMBOL(ib_mad_cache); + static struct list_head ib_mad_port_list; static u32 ib_mad_client_id = 0; Index: Makefile =================================================================== --- Makefile (revision 2023) +++ Makefile (working copy) @@ -1,12 +1,15 @@ EXTRA_CFLAGS += -Idrivers/infiniband/include -obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_cm.o ib_sa.o ib_umad.o +obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_ping.o \ + ib_cm.o ib_sa.o ib_umad.o ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ device.o fmr_pool.o cache.o ib_mad-y := mad.o smi.o agent.o +ib_ping-y := ping.o + ib_cm-y := cm.o ib_sa-y := sa_query.o From mshefty at ichips.intel.com Fri Mar 18 09:05:16 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 18 Mar 2005 09:05:16 -0800 Subject: (SPAM?) [openib-general] [PATCH] ping Add IB ping server agent In-Reply-To: <1111152373.4662.6585.camel@localhost.localdomain> References: <1111152373.4662.6585.camel@localhost.localdomain> Message-ID: <423B0A4C.7040400@ichips.intel.com> Hal Rosenstock wrote: > + ping_send_wr = kmalloc(sizeof(*ping_send_wr), GFP_KERNEL); > + if (!ping_send_wr) > + goto out; > + ping_send_wr->mad = mad_priv; > + > + /* PCI mapping */ > + gather_list.addr = dma_map_single(mad_agent->device->dma_device, > + &mad_priv->mad, > + sizeof(mad_priv->mad), > + DMA_TO_DEVICE); > + gather_list.length = sizeof(mad_priv->mad); > + gather_list.lkey = mad_agent->mr->lkey; > + > + send_wr.next = NULL; > + send_wr.opcode = IB_WR_SEND; > + send_wr.sg_list = &gather_list; > + send_wr.num_sge = 1; > + send_wr.wr.ud.remote_qpn = wc->src_qp; /* DQPN */ > + send_wr.wr.ud.timeout_ms = 0; > + send_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED; > + > + ah_attr.dlid = wc->slid; > + ah_attr.port_num = mad_agent->port_num; > + ah_attr.src_path_bits = wc->dlid_path_bits; > + ah_attr.sl = wc->sl; > + ah_attr.static_rate = 0; > + ah_attr.ah_flags = 0; /* No GRH */ > + if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_OPENIB_PING) { > + if (wc->wc_flags & IB_WC_GRH) { > + ah_attr.ah_flags = IB_AH_GRH; > + /* Should sgid be looked up ? */ > + ah_attr.grh.sgid_index = 0; > + ah_attr.grh.hop_limit = grh->hop_limit; > + ah_attr.grh.flow_label = be32_to_cpup( > + &grh->version_tclass_flow) & 0xfffff; > + ah_attr.grh.traffic_class = (be32_to_cpup( > + &grh->version_tclass_flow) >> 20) & 0xff; > + memcpy(ah_attr.grh.dgid.raw, > + grh->sgid.raw, > + sizeof(ah_attr.grh.dgid)); > + } We should start looking at moving code like this into a common function usable by multiple modules. This would require exposing the definition of some sort of send_mad structure, but I think that we can come up with something that would work for most agents. > + kmem_cache_free(ib_mad_cache, recv); Why doesn't this code just call ib_free_recv_mad() like other agents do? This assumes the implementation of the MAD layer, which I think can be avoided. > + /* Unmap PCI */ > + dma_unmap_single(mad_agent->device->dma_device, > + pci_unmap_addr(ping_send_wr, mapping), > + sizeof(ping_send_wr->mad->mad), > + DMA_TO_DEVICE); > + > + ib_destroy_ah(ping_send_wr->ah); > + > + /* Release allocated memory */ > + kmem_cache_free(ib_mad_cache, ping_send_wr->mad); > + kfree(ping_send_wr); I don't like that ib_mad_cache is being used for send MADs, when it stores and is used for MADs posted to the receive queue. The use of this cache by agents makes it more difficult to change the MAD layer implementation. > Index: mad.c > =================================================================== > --- mad.c (revision 2023) > +++ mad.c (working copy) > @@ -45,6 +45,8 @@ > > > kmem_cache_t *ib_mad_cache; > +EXPORT_SYMBOL(ib_mad_cache); > + I'm not sure about exporting the cache in this manner, versus providing additional functionality. What I think makes sense to do is to examine the SA query, CM, ping, and RMPP code to identify areas of re-use for sending MADs. This may get us back to providing a virtual pool of send MADs, which could later be combined with the received MADs that would allow turning a receive MAD around as a response. - Sean From mkowalski01 at gmail.com Fri Mar 18 09:24:53 2005 From: mkowalski01 at gmail.com (mark kowalski) Date: Fri, 18 Mar 2005 11:24:53 -0600 Subject: [openib-general] queue pair destruction on dat_ep_disconnect Message-ID: I've been doing some experimentation to see if a client using udapl can recover from a hard port failure if the second port on the hca offers a path to the same destination. I've noticed a problem. when on the client I see a timeout waiting on a response from the server or I get a transport error from the evd_wait on the dat_ep_post_send, I will eventually dat_ep_disconnect the endpoint in preparation of recreating the endpoint and trying to connect to a new IA obtained from data_ia_open on a name associated with the other port on the hca. What I've noticed is that the ep_disconnect does not seem to destroy the underlying queue pair and eventhough I issue a new dat_ep_create, access to the new end point fails with "resource busy" because the destroy_cbk field is still filled in. If I issue the dat_ep_free after the dat_ep_disconnect and then start the process of creating and connecting to a new end point then it works fine. I've noticed in the dapls_ib_disconnect (not the openib one) that the call to VAPI_destroy_qp is ifdef'd out. in the openib dapls_ib_disconnect there is no call at all to VAPI_destroy_qp. Is this intentional? It seems that the dat_ep_disconnect should cleanup the underlying queue pair and a dat_ep_free shouldn't be required. Thanks in advance for any help you can provide, Mark Kowalski From Steven.Sears at netapp.com Fri Mar 18 09:45:09 2005 From: Steven.Sears at netapp.com (Sears, Steven) Date: Fri, 18 Mar 2005 12:45:09 -0500 Subject: [openib-general] RE: [Dapl-devel] queue pair destruction on dat_ep_disconnect Message-ID: May I point out that taking your analysis to its logical conclusion, you could only ever have a single EP on any machine if each EP has a unique QP. This is obviously false, I think you're looking in the wrong place. dat_ep_disconnect() is not supposed to destroy a QP, just transition the state to a not-connected state (IB state ERROR). An EP, and by extension a QP, can have several different attributes, it wouldn't be efficient or intuitive if you destroyed the underlying QP just because you are disconnecting. The QP remains attached to the EP until you explicitly free it in dat_ep_free(); this is intentional and by design. If you look at the state diagram in the DAT spec, you will notice that you should dat_ep_reset() the EP before you try to use it again. This will transition the underlying QP from the ERROR state to INIT. But I don't think you're trying to reuse the EP, so I don't know why it's a problem. Getting back to your real problem, I'm not sure why you can't create a new EP on a different IA, they should be completely separate. If dat_ep_create() fails, something is hosed. I don't know about the destroy_cbk field as it isn't in the reference implementation, so I can't help you there. -Steve > -----Original Message----- > From: mark kowalski [mailto:mkowalski01 at gmail.com] > Sent: Friday, March 18, 2005 12:25 PM > To: openib-general at openib.org; dapl-devel at lists.sourceforge.net > Subject: [Dapl-devel] queue pair destruction on dat_ep_disconnect > > > I've been doing some experimentation to see if a client using udapl > can recover from a hard port failure if the second port on the hca > offers a path to the same destination. I've noticed a problem. > when on the client I see a timeout waiting on a response from > the server or I get a transport error from the evd_wait on the > dat_ep_post_send, I will eventually dat_ep_disconnect the endpoint in > preparation of recreating the endpoint and trying to connect to a new > IA obtained from data_ia_open on a name associated with the other port > on the hca. What I've noticed is that the ep_disconnect does not seem > to destroy the underlying queue pair and eventhough I issue a new > dat_ep_create, access to the new end point fails with "resource busy" > because the destroy_cbk field is still filled in. If I issue the > dat_ep_free after the dat_ep_disconnect and then start the process of > creating and connecting to a new end point then it works fine. > I've noticed in the dapls_ib_disconnect (not the openib one) that > the call to VAPI_destroy_qp is ifdef'd out. in the openib > dapls_ib_disconnect there is no call at all to VAPI_destroy_qp. Is > this intentional? It seems that the dat_ep_disconnect should cleanup > the underlying queue pair and a dat_ep_free shouldn't be required. > > Thanks in advance for any help you can provide, > Mark Kowalski > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from > real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Dapl-devel mailing list > Dapl-devel at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dapl-devel > From roland at topspin.com Fri Mar 18 11:42:29 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 18 Mar 2005 11:42:29 -0800 Subject: [openib-general] [PATCH] fmr support in mthca In-Reply-To: <20050318084455.GA23781@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 18 Mar 2005 10:44:55 +0200") References: <20050317201646.GA15221@mellanox.co.il> <521xadob8c.fsf@topspin.com> <20050318084455.GA23781@mellanox.co.il> Message-ID: <52k6o4lp0a.fsf@topspin.com> Michael> Good, glad to help. I will try to address your comments Michael> next week (its already weekend here). No problem. Libor won't be back until Monday so I won't even try SDP until then. Roland> What if we just reserve something like 64K MPTs and MTTs Roland> for FMRs and ioremap everything at driver startup? That Roland> would only use a few MB of vmalloc space and probably Roland> simplify the code too. Michael> I dont like these pre-allocations - if someone is only Michael> using SDP and IP over IB, it seems he wont need almost Michael> any regular regions. 64K MTTs with 4K page size cover up Michael> to 200MByte of memory. We can bump up the numbers if you want. Right now the default allocation is 1 << 20 MTT segments (8 << 20 MTT entries). I see no problem with having 64K MPTs and 256 MTT segments reserved for FMRs by default. That should be more than enough for a single HCA -- 256K MTT segments means that 2 million pages or 8 GB of IO could be in flight at a time, which doesn't seem like a harsh limit to me. Ultimately we can make the allocations tunable at device init time, along with the rest of the parameters (number of QPs, number of CQs, etc). I haven't seen much pressure to do that so far but it is definitely in my plans. Michael> My other problem with this approach was implementational: Michael> existing allocator and table code can be passed reserved Michael> parameter, but dont have the ability to allocate out of Michael> that pool. So we'd have to allocate out of a separate Michael> allocator, and take care so that keys do not Michael> conflict. This gets a bit complicated. I think this is the way to go. Keys are easy to deal with -- in mthca_init_mr_table, we could just pass dev->limits.num_fmrs instead of dev->limits.reserved_mrws when initializing dev->mr_table.mpt_alloc, and then create a new table of size dev->limits.num_fmrs and reserve dev->limits.reserved_mrws out of that table. The buddy allocator is a little more work but it needs to be cleaned up and encapsulated better anyway. Once that's done we'd just have two buddy allocators. The first one would cover all the MTT segments, and we'd first take out a chunk of that one to cover the reserved MTTs and then allocate another chunk that can hold whatever number of MTT segments we decide to use for FMRs. Michael> Maybe do something separate for 32 bit kernels (like - Michael> disable FMR support)? No FMRs on 32-bit kernels isn't going to fly. It doesn't seem that hard to make things work on i386 so why not do it? Michael> Yes but for mtts the addresses may not be physically Michael> contigious, unless we want to limit FMRs to PAGE_SIZE/8 Michael> MTTs, which means 512 MTTs, that is 2MByte with 4K FMR Michael> page size. And is it seems possible that even with this Michael> limitation MTTs for a specific FMR start at non page Michael> aligned boundary. I think it's fine to limit an FMR to 512 MTT entries. I'd have to look at the source to be sure of the exact numbers, but I know that for the Topspin stack, neither SDP nor SRP is using more than 32 entries per FMR. A limit of mapping 512 pages/2 MB per FMR seems fine. I don't know of anyone using FMRs even close to that big. Even if it turns out to be to small, I see no problem with adding a small array of something on the order of 2 or 4 MTT pages. If we use the buddy allocator for MTT entries for FMRs, then alignment is OK. The buddy allocator guarantees that objects will be aligned to their size, which means that the MTT segments will never cross a page boundary. - R. From rf at q-leap.de Fri Mar 18 15:51:08 2005 From: rf at q-leap.de (Roland Fehrenbacher) Date: Sat, 19 Mar 2005 00:51:08 +0100 Subject: [openib-general] opensm trouble Message-ID: <16955.26988.755836.49462@gargle.gargle.HOWL> Hi, I am having problems getting opensm to run using the latest gen1 version from https://openib.org/svn/gen1/trunk. I compiled the gen1 modules succesfully against vanilla 2.6.11 (some minor fixes were necessary, patches appended), and can load them ok on Mellanox HCAs. Architecture is x86_64. # cat /proc/infiniband/core/ca1/info name: InfiniHost0 provider: tavor node GUID: 0002:c902:0040:12a0 ports: 2 vendor ID: 0x2c9 device ID: 0x5a44 HW revision: 0xa1 FW revision: 0x300020000 When starting opensm, I get # opensm ------------------------------------------------- OpenSM Rev:1.8.0 Command Line Arguments: Log File: /tmp/osm.log ------------------------------------------------- OpenSM Rev:1.8.0 Choose a local port number with which to bind: 1: GUID = 0x 0, lid = 0x03C9, state = INIT 2: GUID = 0x 0, lid = 0x03CA, state = DOWN Enter choice (1-2): 1 SM port is down. SM port is down. Obviously, the GUID is not read by opensm, and subsequent connection fails. Excerpt from /tmp/osm.log: --------------------------------------------- Mar 19 00:36:25 [4000] -> osm_opensm_init: Forcing single threaded dispatcher. Mar 19 00:36:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GUID:0x0000000000000000,0x0000000000000000 Mar 19 00:36:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GUID:0x0000000000000000,0x0000000020f2ffff Mar 19 00:36:32 [4002] -> __osmv_txn_timeout_cb: ERR 6702: The transaction request (tid=0x2F4B13AB) timed out (after 4 retries). Invoking the error callback. Mar 19 00:36:32 [4002] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT). Mar 19 00:36:32 [4002] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) status..................0x0 hop_ptr.................0x0 hop_count...............0x0 trans_id................0x0 attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0] Return path: [0] Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Mar 19 00:36:32 [18007] -> __osm_state_mgr_is_sm_port_down: ERR 3308: SM port GUID unknown. ??? 92 98311:00:1111188992 [645F7472] -> SM port is down. --------------------------------- What could be the reason for this? I see the same behaviour when using IBGD 1.6.1, 1.7.0-rc32, and also using vanilla 2.6.10. Cheers, Roland -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: patch-ql URL: From mshefty at ichips.intel.com Fri Mar 18 18:24:30 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 18 Mar 2005 18:24:30 -0800 Subject: [openib-general] [PATCH] [RMPP] receive RMPP support Message-ID: <20050318182430.0e6e8a16.mshefty@ichips.intel.com> This patch adds support to receive RMPP MADs. Notes: * A default timeout of 40 seconds is used to timeout reassembly. I.e. a sender has 40 seconds to complete the transmission. * A default timeout of 5 seconds is used to timeout cleanup of a reassembled MAD. This allows the receiver to re-send a lost final ACK. * The receive window is set to 1/8th of the QP RQ size. * The receiver sends an ACK under the following conditions: a MAD is received with a segment number less than the last segment number ACKed, if the last segment of a window has been received, or the last segment in a transfer has been received. * The receiver will store MADs received out of order (which is needed to support multi-threading). The code was tested by hacking the SA query code to send GET_TABLE requests to opensm running on the SourceForge stack. It seems to be working, but it has not been rigorously tested. Signed-off-by: Sean Hefty Index: include/ib_sa.h =================================================================== --- include/ib_sa.h (revision 2028) +++ include/ib_sa.h (working copy) @@ -41,9 +41,11 @@ #include enum { - IB_SA_CLASS_VERSION = 2, /* IB spec version 1.1/1.2 */ + IB_SA_CLASS_VERSION = 2, /* IB spec version 1.1/1.2 */ - IB_SA_METHOD_DELETE = 0x15 + IB_SA_METHOD_GET_TABLE = 0x12, + IB_SA_METHOD_GET_TABLE_RESP = 0x92, + IB_SA_METHOD_DELETE = 0x15 }; enum ib_sa_selector { Index: core/mad_rmpp.c =================================================================== --- core/mad_rmpp.c (revision 0) +++ core/mad_rmpp.c (revision 0) @@ -0,0 +1,572 @@ +/* + * Copyright (c) 2005 Intel Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mad_rmpp.c 1921 2005-03-02 22:58:44Z sean.hefty $ + */ + +#include "mad_rmpp.h" +#include "mad_priv.h" + +enum rmpp_state { + RMPP_STATE_ACTIVE, + RMPP_STATE_TIMEOUT, + RMPP_STATE_COMPLETE +}; + +struct mad_rmpp_recv { + struct ib_mad_agent_private *agent; + struct list_head list; + struct work_struct timeout_work; + struct work_struct cleanup_work; + wait_queue_head_t wait; + enum rmpp_state state; + spinlock_t lock; + atomic_t refcount; + + struct ib_ah *ah; + struct ib_mad_recv_wc *rmpp_wc; + struct ib_mad_recv_buf *cur_seg_buf; + int last_ack; + int seg_num; + int newwin; + + u64 tid; + u32 src_qp; + u16 slid; + u8 mgmt_class; + u8 class_version; + u8 method; +}; + +struct rmpp_msg { + struct ib_mad_agent *mad_agent; + struct ib_send_wr send_wr; + struct ib_sge sge; + DECLARE_PCI_UNMAP_ADDR(mapping) + struct ib_rmpp_mad mad; +}; + +static struct ib_ah * create_ah_from_wc(struct ib_pd *pd, struct ib_wc *wc, + u8 port_num) +{ + struct ib_ah_attr *ah_attr; + struct ib_ah *ah; + + ah_attr = kmalloc(sizeof *ah_attr, GFP_KERNEL); + if (!ah_attr) + return ERR_PTR(-ENOMEM); + + memset(ah_attr, 0, sizeof *ah_attr); + ah_attr->dlid = wc->slid; + ah_attr->sl = wc->sl; + ah_attr->src_path_bits = wc->dlid_path_bits; + ah_attr->port_num = port_num; + + ah = ib_create_ah(pd, ah_attr); + kfree(ah_attr); + return ah; +} + +static void destroy_rmpp_recv(struct mad_rmpp_recv *rmpp_recv) +{ + atomic_dec(&rmpp_recv->refcount); + wait_event(rmpp_recv->wait, !atomic_read(&rmpp_recv->refcount)); + ib_destroy_ah(rmpp_recv->ah); + kfree(rmpp_recv); +} + +void ib_cancel_rmpp_recvs(struct ib_mad_agent_private *agent) +{ + struct mad_rmpp_recv *rmpp_recv, *temp_rmpp_recv; + unsigned long flags; + + spin_lock_irqsave(&agent->lock, flags); + list_for_each_entry(rmpp_recv, &agent->rmpp_list, list) { + cancel_delayed_work(&rmpp_recv->timeout_work); + cancel_delayed_work(&rmpp_recv->cleanup_work); + } + spin_unlock_irqrestore(&agent->lock, flags); + + flush_workqueue(agent->qp_info->port_priv->wq); + + list_for_each_entry_safe(rmpp_recv, temp_rmpp_recv, + &agent->rmpp_list, list) { + list_del(&rmpp_recv->list); + if (rmpp_recv->state != RMPP_STATE_COMPLETE) + ib_free_recv_mad(rmpp_recv->rmpp_wc); + destroy_rmpp_recv(rmpp_recv); + } +} + +static void recv_timeout_handler(void *data) +{ + struct mad_rmpp_recv *rmpp_recv = data; + struct ib_mad_recv_wc *rmpp_wc; + unsigned long flags; + + spin_lock_irqsave(&rmpp_recv->agent->lock, flags); + if (rmpp_recv->state != RMPP_STATE_ACTIVE) { + spin_unlock_irqrestore(&rmpp_recv->agent->lock, flags); + return; + } + rmpp_recv->state = RMPP_STATE_TIMEOUT; + list_del(&rmpp_recv->list); + spin_unlock_irqrestore(&rmpp_recv->agent->lock, flags); + + /* TODO: send abort. */ + rmpp_wc = rmpp_recv->rmpp_wc; + destroy_rmpp_recv(rmpp_recv); + ib_free_recv_mad(rmpp_wc); +} + +static void recv_cleanup_handler(void *data) +{ + struct mad_rmpp_recv *rmpp_recv = data; + unsigned long flags; + + spin_lock_irqsave(&rmpp_recv->agent->lock, flags); + list_del(&rmpp_recv->list); + spin_unlock_irqrestore(&rmpp_recv->agent->lock, flags); + destroy_rmpp_recv(rmpp_recv); +} + +static struct mad_rmpp_recv * +create_rmpp_recv(struct ib_mad_agent_private *agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct mad_rmpp_recv *rmpp_recv; + struct ib_mad_hdr *mad_hdr; + + rmpp_recv = kmalloc(sizeof *rmpp_recv, GFP_KERNEL); + if (!rmpp_recv) + return NULL; + + rmpp_recv->ah = create_ah_from_wc(agent->agent.qp->pd, + mad_recv_wc->wc, + agent->agent.port_num); + if (IS_ERR(rmpp_recv->ah)) + goto error; + + rmpp_recv->agent = agent; + init_waitqueue_head(&rmpp_recv->wait); + INIT_WORK(&rmpp_recv->timeout_work, recv_timeout_handler, rmpp_recv); + INIT_WORK(&rmpp_recv->cleanup_work, recv_cleanup_handler, rmpp_recv); + spin_lock_init(&rmpp_recv->lock); + rmpp_recv->state = RMPP_STATE_ACTIVE; + atomic_set(&rmpp_recv->refcount, 1); + + rmpp_recv->rmpp_wc = mad_recv_wc; + rmpp_recv->cur_seg_buf = &mad_recv_wc->recv_buf; + rmpp_recv->newwin = 1; + rmpp_recv->seg_num = 1; + rmpp_recv->last_ack = 0; + + mad_hdr = &mad_recv_wc->recv_buf.mad->mad_hdr; + rmpp_recv->tid = mad_hdr->tid; + rmpp_recv->src_qp = mad_recv_wc->wc->src_qp; + rmpp_recv->slid = mad_recv_wc->wc->slid; + rmpp_recv->mgmt_class = mad_hdr->mgmt_class; + rmpp_recv->class_version = mad_hdr->class_version; + rmpp_recv->method = mad_hdr->method; + return rmpp_recv; + +error: kfree(rmpp_recv); + return NULL; +} + +static inline void deref_rmpp_recv(struct mad_rmpp_recv *rmpp_recv) +{ + if (atomic_dec_and_test(&rmpp_recv->refcount)) + wake_up(&rmpp_recv->wait); +} + +static struct mad_rmpp_recv * +find_rmpp_recv(struct ib_mad_agent_private *agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct mad_rmpp_recv *rmpp_recv; + struct ib_mad_hdr *mad_hdr = &mad_recv_wc->recv_buf.mad->mad_hdr; + + list_for_each_entry(rmpp_recv, &agent->rmpp_list, list) { + if (rmpp_recv->tid == mad_hdr->tid && + rmpp_recv->src_qp == mad_recv_wc->wc->src_qp && + rmpp_recv->slid == mad_recv_wc->wc->slid && + rmpp_recv->mgmt_class == mad_hdr->mgmt_class && + rmpp_recv->class_version == mad_hdr->class_version && + rmpp_recv->method == mad_hdr->method) + return rmpp_recv; + } + return NULL; +} + +static struct mad_rmpp_recv * +acquire_rmpp_recv(struct ib_mad_agent_private *agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct mad_rmpp_recv *rmpp_recv; + unsigned long flags; + + spin_lock_irqsave(&agent->lock, flags); + rmpp_recv = find_rmpp_recv(agent, mad_recv_wc); + if (rmpp_recv) + atomic_inc(&rmpp_recv->refcount); + spin_unlock_irqrestore(&agent->lock, flags); + return rmpp_recv; +} + +static struct mad_rmpp_recv * +insert_rmpp_recv(struct ib_mad_agent_private *agent, + struct mad_rmpp_recv *rmpp_recv) +{ + struct mad_rmpp_recv *cur_rmpp_recv; + + cur_rmpp_recv = find_rmpp_recv(agent, rmpp_recv->rmpp_wc); + if (!cur_rmpp_recv) + list_add_tail(&rmpp_recv->list, &agent->rmpp_list); + + return cur_rmpp_recv; +} + +static struct rmpp_msg * alloc_rmpp_msg(struct ib_mad_agent *mad_agent, + u32 remote_qpn, u16 pkey_index, + struct ib_ah *ah) +{ + struct rmpp_msg *msg; + + msg = kmalloc(sizeof *msg, GFP_KERNEL); + if (!msg) + return NULL; + memset(msg, 0, sizeof *msg); + + msg->sge.addr = dma_map_single(mad_agent->device->dma_device, + &msg->mad, sizeof msg->mad, + DMA_TO_DEVICE); + pci_unmap_addr_set(msg, mapping, msg->sge.addr); + msg->sge.length = sizeof msg->mad; + msg->sge.lkey = mad_agent->mr->lkey; + + msg->send_wr.wr_id = (unsigned long) msg; + msg->send_wr.sg_list = &msg->sge; + msg->send_wr.num_sge = 1; + msg->send_wr.opcode = IB_WR_SEND; + msg->send_wr.send_flags = IB_SEND_SIGNALED; + msg->send_wr.wr.ud.ah = ah; + msg->send_wr.wr.ud.mad_hdr = &msg->mad.mad_hdr; + msg->send_wr.wr.ud.remote_qpn = remote_qpn; + msg->send_wr.wr.ud.remote_qkey = IB_QP_SET_QKEY; + msg->send_wr.wr.ud.pkey_index = pkey_index; + + msg->mad_agent = mad_agent; + return msg; +} + +static void free_rmpp_msg(struct rmpp_msg *msg) +{ + dma_unmap_single(msg->mad_agent->device->dma_device, + pci_unmap_addr(msg, mapping), + sizeof msg->mad, DMA_TO_DEVICE); + kfree(msg); +} + +static void format_ack(struct ib_rmpp_mad *ack, + struct ib_rmpp_mad *data, + struct mad_rmpp_recv *rmpp_recv) +{ + unsigned long flags; + + ack->mad_hdr = data->mad_hdr; + ack->mad_hdr.method ^= IB_MGMT_METHOD_RESP; + ack->rmpp_hdr.rmpp_version = data->rmpp_hdr.rmpp_version; + ack->rmpp_hdr.rmpp_type = IB_MGMT_RMPP_TYPE_ACK; + ib_set_rmpp_resptime(&ack->rmpp_hdr, + ib_get_rmpp_resptime(&data->rmpp_hdr)); + ib_set_rmpp_flags(&ack->rmpp_hdr, IB_MGMT_RMPP_FLAG_ACTIVE); + + spin_lock_irqsave(&rmpp_recv->lock, flags); + rmpp_recv->last_ack = rmpp_recv->seg_num; + ack->rmpp_hdr.seg_num = cpu_to_be32(rmpp_recv->seg_num); + ack->rmpp_hdr.paylen_newwin = cpu_to_be32(rmpp_recv->newwin); + spin_unlock_irqrestore(&rmpp_recv->lock, flags); +} + +static void ack_recv(struct mad_rmpp_recv *rmpp_recv, + struct ib_mad_recv_wc *recv_wc) +{ + struct rmpp_msg *msg; + struct ib_send_wr *bad_send_wr; + int ret; + + msg = alloc_rmpp_msg(&rmpp_recv->agent->agent, recv_wc->wc->src_qp, + recv_wc->wc->pkey_index, rmpp_recv->ah); + if (!msg) + return; + + format_ack(&msg->mad, (struct ib_rmpp_mad *) recv_wc->recv_buf.mad, + rmpp_recv); + ret = ib_post_send_mad(&rmpp_recv->agent->agent, &msg->send_wr, + &bad_send_wr); + if (ret) + free_rmpp_msg(msg); +} + +static inline int get_last_flag(struct ib_mad_recv_buf *seg) +{ + struct ib_rmpp_mad *rmpp_mad; + + rmpp_mad = (struct ib_rmpp_mad *) seg->mad; + return ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr) & IB_MGMT_RMPP_FLAG_LAST; +} + +static inline int get_seg_num(struct ib_mad_recv_buf *seg) +{ + struct ib_rmpp_mad *rmpp_mad; + + rmpp_mad = (struct ib_rmpp_mad *) seg->mad; + return be32_to_cpu(rmpp_mad->rmpp_hdr.seg_num); +} + +static inline struct ib_mad_recv_buf * get_next_seg(struct list_head *rmpp_list, + struct ib_mad_recv_buf *seg) +{ + if (seg->list.next == rmpp_list) + return NULL; + + return container_of(seg->list.next, struct ib_mad_recv_buf, list); +} + +static inline int window_size(struct ib_mad_agent_private *agent) +{ + return max(agent->qp_info->recv_queue.max_active >> 3, 1); +} + +static struct ib_mad_recv_buf * find_seg_location(struct list_head *rmpp_list, + int seg_num) +{ + struct ib_mad_recv_buf *seg_buf; + int cur_seg_num; + + list_for_each_entry_reverse(seg_buf, rmpp_list, list) { + cur_seg_num = get_seg_num(seg_buf); + if (seg_num > cur_seg_num) + return seg_buf; + if (seg_num == cur_seg_num) + break; + } + return NULL; +} + +static void update_seg_num(struct mad_rmpp_recv *rmpp_recv, + struct ib_mad_recv_buf *new_buf) +{ + struct list_head *rmpp_list = &rmpp_recv->rmpp_wc->rmpp_list; + + while (new_buf && (get_seg_num(new_buf) == rmpp_recv->seg_num + 1)) { + rmpp_recv->cur_seg_buf = new_buf; + rmpp_recv->seg_num++; + new_buf = get_next_seg(rmpp_list, new_buf); + } +} + +static inline int get_mad_len(struct mad_rmpp_recv *rmpp_recv) +{ + int hdr_size; + + /* TODO: need to check for SA MADs - requires access to SA header */ + hdr_size = sizeof(struct ib_mad_hdr) + sizeof(struct ib_rmpp_hdr); + return rmpp_recv->seg_num * (sizeof(struct ib_mad) - hdr_size) + + hdr_size; +} + +static struct ib_mad_recv_wc * complete_rmpp(struct mad_rmpp_recv *rmpp_recv) +{ + struct ib_mad_recv_wc *rmpp_wc; + + ack_recv(rmpp_recv, rmpp_recv->rmpp_wc); + if (rmpp_recv->seg_num > 1) + cancel_delayed_work(&rmpp_recv->timeout_work); + + rmpp_wc = rmpp_recv->rmpp_wc; + rmpp_wc->mad_len = get_mad_len(rmpp_recv); + /* 5 seconds until we can find the packet lifetime */ + queue_delayed_work(rmpp_recv->agent->qp_info->port_priv->wq, + &rmpp_recv->cleanup_work, msecs_to_jiffies(5000)); + return rmpp_wc; +} + +static struct ib_mad_recv_wc * +continue_rmpp(struct ib_mad_agent_private *agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct mad_rmpp_recv *rmpp_recv; + struct ib_mad_recv_buf *prev_buf; + struct ib_mad_recv_wc *done_wc; + int seg_num; + unsigned long flags; + + rmpp_recv = acquire_rmpp_recv(agent, mad_recv_wc); + if (!rmpp_recv) + goto drop1; + + seg_num = get_seg_num(&mad_recv_wc->recv_buf); + + spin_lock_irqsave(&rmpp_recv->lock, flags); + if ((rmpp_recv->state != RMPP_STATE_ACTIVE) || + (seg_num > rmpp_recv->newwin)) + goto drop3; + + if (seg_num <= rmpp_recv->last_ack) { + spin_unlock_irqrestore(&rmpp_recv->lock, flags); + ack_recv(rmpp_recv, mad_recv_wc); + goto drop2; + } + + prev_buf = find_seg_location(&rmpp_recv->rmpp_wc->rmpp_list, seg_num); + if (!prev_buf) + goto drop3; + + done_wc = NULL; + list_add(&mad_recv_wc->recv_buf.list, &prev_buf->list); + if (rmpp_recv->cur_seg_buf == prev_buf) { + update_seg_num(rmpp_recv, &mad_recv_wc->recv_buf); + if (get_last_flag(rmpp_recv->cur_seg_buf)) { + rmpp_recv->state = RMPP_STATE_COMPLETE; + spin_unlock_irqrestore(&rmpp_recv->lock, flags); + done_wc = complete_rmpp(rmpp_recv); + goto out; + } else if (rmpp_recv->seg_num == rmpp_recv->newwin) { + rmpp_recv->newwin += window_size(agent); + spin_unlock_irqrestore(&rmpp_recv->lock, flags); + ack_recv(rmpp_recv, mad_recv_wc); + goto out; + } + } + spin_unlock_irqrestore(&rmpp_recv->lock, flags); +out: + deref_rmpp_recv(rmpp_recv); + return done_wc; + +drop3: spin_unlock_irqrestore(&rmpp_recv->lock, flags); +drop2: deref_rmpp_recv(rmpp_recv); +drop1: ib_free_recv_mad(mad_recv_wc); + return NULL; +} + +static struct ib_mad_recv_wc * +start_rmpp(struct ib_mad_agent_private *agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct mad_rmpp_recv *rmpp_recv; + unsigned long flags; + + rmpp_recv = create_rmpp_recv(agent, mad_recv_wc); + if (!rmpp_recv) { + ib_free_recv_mad(mad_recv_wc); + return NULL; + } + + spin_lock_irqsave(&agent->lock, flags); + if (insert_rmpp_recv(agent, rmpp_recv)) { + spin_unlock_irqrestore(&agent->lock, flags); + /* duplicate first MAD */ + destroy_rmpp_recv(rmpp_recv); + return continue_rmpp(agent, mad_recv_wc); + } + atomic_inc(&rmpp_recv->refcount); + spin_unlock_irqrestore(&agent->lock, flags); + + if (get_last_flag(&mad_recv_wc->recv_buf)) { + rmpp_recv->state = RMPP_STATE_COMPLETE; + complete_rmpp(rmpp_recv); + } else { + /* 40 seconds until we can find the packet lifetimes */ + queue_delayed_work(agent->qp_info->port_priv->wq, + &rmpp_recv->timeout_work, + msecs_to_jiffies(40000)); + rmpp_recv->newwin += window_size(agent); + ack_recv(rmpp_recv, mad_recv_wc); + mad_recv_wc = NULL; + } + deref_rmpp_recv(rmpp_recv); + return mad_recv_wc; +} + +struct ib_mad_recv_wc * +ib_process_rmpp_recv_wc(struct ib_mad_agent_private *agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_rmpp_mad *rmpp_mad; + + rmpp_mad = (struct ib_rmpp_mad *)mad_recv_wc->recv_buf.mad; + if (!(rmpp_mad->rmpp_hdr.rmpp_rtime_flags & IB_MGMT_RMPP_FLAG_ACTIVE)) + return mad_recv_wc; + + switch (rmpp_mad->rmpp_hdr.rmpp_type) { + case IB_MGMT_RMPP_TYPE_DATA: + if (rmpp_mad->rmpp_hdr.seg_num == __constant_htonl(1)) + return start_rmpp(agent, mad_recv_wc); + else + return continue_rmpp(agent, mad_recv_wc); + case IB_MGMT_RMPP_TYPE_ACK: + /* process_rmpp_ack(agent, mad_recv_wc); */ + break; + case IB_MGMT_RMPP_TYPE_STOP: + case IB_MGMT_RMPP_TYPE_ABORT: + /* process_rmpp_nack(agent, mad_recv_wc); */ + break; + default: + break; + } + ib_free_recv_mad(mad_recv_wc); + return NULL; +} + + +enum ib_mad_result +ib_process_rmpp_send_wc(struct ib_mad_send_wr_private *mad_send_wr, + struct ib_mad_send_wc *mad_send_wc) +{ + struct ib_rmpp_mad *rmpp_mad; + struct rmpp_msg *msg; + + rmpp_mad = (struct ib_rmpp_mad *)mad_send_wr->send_wr.wr.ud.mad_hdr; + if (!(ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr) & + IB_MGMT_RMPP_FLAG_ACTIVE)) + return IB_MAD_RESULT_SUCCESS; + + if (rmpp_mad->rmpp_hdr.rmpp_type != IB_MGMT_RMPP_TYPE_DATA) { + msg = (struct rmpp_msg *) (unsigned long) mad_send_wc->wr_id; + free_rmpp_msg(msg); + return IB_MAD_RESULT_CONSUMED; + } + + /* TODO: continue send until done - ACKed or we have a response */ + return IB_MAD_RESULT_SUCCESS; +} Index: core/mad.c =================================================================== --- core/mad.c (revision 2028) +++ core/mad.c (working copy) @@ -31,10 +31,9 @@ * * $Id$ */ - #include -#include "mad_priv.h" +#include "mad_rmpp.h" #include "smi.h" #include "agent.h" @@ -199,8 +198,8 @@ if (qpn == -1) goto error1; - if (rmpp_version) - goto error1; /* XXX: until RMPP implemented */ + if (rmpp_version && rmpp_version != IB_MGMT_RMPP_VERSION) + goto error1; /* Validate MAD registration request if supplied */ if (mad_reg_req) { @@ -344,6 +343,7 @@ spin_lock_init(&mad_agent_priv->lock); INIT_LIST_HEAD(&mad_agent_priv->send_list); INIT_LIST_HEAD(&mad_agent_priv->wait_list); + INIT_LIST_HEAD(&mad_agent_priv->rmpp_list); INIT_WORK(&mad_agent_priv->timed_work, timeout_sends, mad_agent_priv); INIT_LIST_HEAD(&mad_agent_priv->local_list); INIT_WORK(&mad_agent_priv->local_work, local_completions, @@ -507,7 +507,7 @@ spin_unlock_irqrestore(&port_priv->reg_lock, flags); flush_workqueue(port_priv->wq); - /* ib_cancel_rmpp_recvs(mad_agent_priv); */ + ib_cancel_rmpp_recvs(mad_agent_priv); atomic_dec(&mad_agent_priv->refcount); wait_event(mad_agent_priv->wait, @@ -925,27 +925,19 @@ struct ib_mad_private *priv; struct list_head free_list; - if (mad_recv_wc->mad_len <= sizeof(struct ib_mad)) { + INIT_LIST_HEAD(&free_list); + list_splice_init(&mad_recv_wc->rmpp_list, &free_list); + + list_for_each_entry_safe(mad_recv_buf, temp_recv_buf, + &free_list, list) { + mad_recv_wc = container_of(mad_recv_buf, struct ib_mad_recv_wc, + recv_buf); mad_priv_hdr = container_of(mad_recv_wc, struct ib_mad_private_header, recv_wc); priv = container_of(mad_priv_hdr, struct ib_mad_private, header); kmem_cache_free(ib_mad_cache, priv); - } else { - INIT_LIST_HEAD(&free_list); - list_splice_init(&mad_recv_wc->rmpp_list, &free_list); - - list_for_each_entry_safe(mad_recv_buf, temp_recv_buf, - &free_list, list) { - mad_priv_hdr = container_of(mad_recv_wc, - struct ib_mad_private_header, - recv_wc); - priv = container_of(mad_priv_hdr, - struct ib_mad_private, - header); - kmem_cache_free(ib_mad_cache, priv); - } } } EXPORT_SYMBOL(ib_free_recv_mad); @@ -1496,12 +1488,10 @@ INIT_LIST_HEAD(&mad_recv_wc->rmpp_list); list_add(&mad_recv_wc->recv_buf.list, &mad_recv_wc->rmpp_list); - /* if (mad_agent_priv->agent.rmpp_version) - return ib_process_rmpp_recv(mad_agent_priv, mad_recv_wc); + return ib_process_rmpp_recv_wc(mad_agent_priv, mad_recv_wc); else - */ - return mad_recv_wc; + return mad_recv_wc; } static struct ib_mad_send_wr_private* @@ -1768,12 +1758,10 @@ mad_agent_priv = container_of(mad_send_wr->agent, struct ib_mad_agent_private, agent); - /* if (mad_agent_priv->agent.rmpp_version) ret = ib_process_rmpp_send_wc(mad_send_wr, mad_send_wc); else - */ - ret = IB_MAD_RESULT_SUCCESS; + ret = IB_MAD_RESULT_SUCCESS; spin_lock_irqsave(&mad_agent_priv->lock, flags); if (mad_send_wc->status != IB_WC_SUCCESS && @@ -1800,7 +1788,7 @@ mad_send_wc->status = mad_send_wr->status; if (ret == IB_MAD_RESULT_SUCCESS) mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, - mad_send_wc); + mad_send_wc); /* Release reference on agent taken when sending */ if (atomic_dec_and_test(&mad_agent_priv->refcount)) Index: core/mad_rmpp.h =================================================================== --- core/mad_rmpp.h (revision 0) +++ core/mad_rmpp.h (revision 0) @@ -0,0 +1,50 @@ +/* + * Copyright (c) 2005 Intel Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: mad_rmpp.h 1921 2005-02-25 22:58:44Z sean.hefty $ + */ + +#ifndef __MAD_RMPP_H__ +#define __MAD_PRIV_H__ + +#include "mad_priv.h" + +struct ib_mad_recv_wc * +ib_process_rmpp_recv_wc(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_recv_wc *mad_recv_wc); + +enum ib_mad_result +ib_process_rmpp_send_wc(struct ib_mad_send_wr_private *mad_send_wr, + struct ib_mad_send_wc *mad_send_wc); + +void ib_cancel_rmpp_recvs(struct ib_mad_agent_private *mad_agent_priv); + +#endif /* __MAD_RMPP_H__ */ Index: core/mad_priv.h =================================================================== --- core/mad_priv.h (revision 2028) +++ core/mad_priv.h (working copy) @@ -98,6 +98,7 @@ struct work_struct local_work; struct list_head canceled_list; struct work_struct canceled_work; + struct list_head rmpp_list; atomic_t refcount; wait_queue_head_t wait; Index: core/Makefile =================================================================== --- core/Makefile (revision 2028) +++ core/Makefile (working copy) @@ -6,7 +6,7 @@ ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ device.o fmr_pool.o cache.o -ib_mad-y := mad.o smi.o agent.o +ib_mad-y := mad.o smi.o agent.o mad_rmpp.o ib_ping-y := ping.o From hozer at hozed.org Fri Mar 18 18:38:50 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Fri, 18 Mar 2005 20:38:50 -0600 Subject: [openib-general] Causes of interrupt problems? Message-ID: <20050319023850.GE9768@kalmia.hozed.org> What would cause the following? ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004) ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:04:00.0) ib_mthca 0000:04:00.0: NOP command failed to generate interrupt, aborting. ib_mthca 0000:04:00.0: BIOS or ACPI interrupt routing problem? I've seen this on two Opteron systems, one Tyan board, one Rioworks HDAMA. Is there some bios setting I should look for? Things are working fine on another Rioworks HDAMA board. -- -------------------------------------------------------------------------- Troy Benjegerdes 'da hozer' hozer at hozed.org Somone asked my why I work on this free (http://www.fsf.org/philosophy/) software stuff and not get a real job. Charles Shultz had the best answer: "Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life." -- Charles Shultz From sean.hefty at intel.com Fri Mar 18 18:38:51 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 18 Mar 2005 18:38:51 -0800 Subject: [openib-general] [PATCH] [RMPP] receive RMPP support In-Reply-To: <20050318182430.0e6e8a16.mshefty@ichips.intel.com> Message-ID: On a separate note, I'd like to solicit comments about exposing the following (with slight modifications), in ib_verbs.h, ib_mad.h, and/or ib_mad_helper.h. - Sean >+struct rmpp_msg { >+ struct ib_mad_agent *mad_agent; >+ struct ib_send_wr send_wr; >+ struct ib_sge sge; >+ DECLARE_PCI_UNMAP_ADDR(mapping) >+ struct ib_rmpp_mad mad; >+}; >+ >+static struct ib_ah * create_ah_from_wc(struct ib_pd *pd, struct ib_wc >*wc, >+ u8 port_num) >+{ >+ struct ib_ah_attr *ah_attr; >+ struct ib_ah *ah; >+ >+ ah_attr = kmalloc(sizeof *ah_attr, GFP_KERNEL); >+ if (!ah_attr) >+ return ERR_PTR(-ENOMEM); >+ >+ memset(ah_attr, 0, sizeof *ah_attr); >+ ah_attr->dlid = wc->slid; >+ ah_attr->sl = wc->sl; >+ ah_attr->src_path_bits = wc->dlid_path_bits; >+ ah_attr->port_num = port_num; >+ >+ ah = ib_create_ah(pd, ah_attr); >+ kfree(ah_attr); >+ return ah; >+} >+ >+static struct rmpp_msg * alloc_rmpp_msg(struct ib_mad_agent *mad_agent, >+ u32 remote_qpn, u16 pkey_index, >+ struct ib_ah *ah) >+{ >+ struct rmpp_msg *msg; >+ >+ msg = kmalloc(sizeof *msg, GFP_KERNEL); >+ if (!msg) >+ return NULL; >+ memset(msg, 0, sizeof *msg); >+ >+ msg->sge.addr = dma_map_single(mad_agent->device->dma_device, >+ &msg->mad, sizeof msg->mad, >+ DMA_TO_DEVICE); >+ pci_unmap_addr_set(msg, mapping, msg->sge.addr); >+ msg->sge.length = sizeof msg->mad; >+ msg->sge.lkey = mad_agent->mr->lkey; >+ >+ msg->send_wr.wr_id = (unsigned long) msg; >+ msg->send_wr.sg_list = &msg->sge; >+ msg->send_wr.num_sge = 1; >+ msg->send_wr.opcode = IB_WR_SEND; >+ msg->send_wr.send_flags = IB_SEND_SIGNALED; >+ msg->send_wr.wr.ud.ah = ah; >+ msg->send_wr.wr.ud.mad_hdr = &msg->mad.mad_hdr; >+ msg->send_wr.wr.ud.remote_qpn = remote_qpn; >+ msg->send_wr.wr.ud.remote_qkey = IB_QP_SET_QKEY; >+ msg->send_wr.wr.ud.pkey_index = pkey_index; >+ >+ msg->mad_agent = mad_agent; >+ return msg; >+} >+ >+static void free_rmpp_msg(struct rmpp_msg *msg) >+{ >+ dma_unmap_single(msg->mad_agent->device->dma_device, >+ pci_unmap_addr(msg, mapping), >+ sizeof msg->mad, DMA_TO_DEVICE); >+ kfree(msg); From roland at topspin.com Fri Mar 18 20:15:04 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 18 Mar 2005 20:15:04 -0800 Subject: [openib-general] [PATCH] [RMPP] receive RMPP support In-Reply-To: (Sean Hefty's message of "Fri, 18 Mar 2005 18:38:51 -0800") References: Message-ID: <52u0n8jmpj.fsf@topspin.com> I'm not sure if/how RMPP will be used, so it's not clear to me how useful the RMPP functions are -- I don't feel qualified to have an opinion yet. >+static struct ib_ah * create_ah_from_wc(struct ib_pd *pd, struct ib_wc *wc, >+ u8 port_num) This function seems reasonably useful. >+{ >+ struct ib_ah_attr *ah_attr; >+ struct ib_ah *ah; >+ >+ ah_attr = kmalloc(sizeof *ah_attr, GFP_KERNEL); Is it really worth kmalloc() for ah_attr here? struct ib_ah_attr is only 32 bytes. Between the ah_attr pointer and the stack used by the call to kmalloc(), the current code is probably using at least 16 bytes. I'd trade 16 bytes of stack for smaller source and object code. >+ if (!ah_attr) >+ return ERR_PTR(-ENOMEM); >+ >+ memset(ah_attr, 0, sizeof *ah_attr); >+ ah_attr->dlid = wc->slid; >+ ah_attr->sl = wc->sl; >+ ah_attr->src_path_bits = wc->dlid_path_bits; >+ ah_attr->port_num = port_num; What if the wc has IB_WC_GRH set? I'm not sure how useful a helper this is if it doesn't handle the GRH case. >+ >+ ah = ib_create_ah(pd, ah_attr); >+ kfree(ah_attr); >+ return ah; >+} Also, while we're looking at the code... >+ msg->sge.addr = dma_map_single(mad_agent->device->dma_device, >+ &msg->mad, sizeof msg->mad, >+ DMA_TO_DEVICE); it's somewhat risky to use dma_map_single() on fields in the middle of a structure, because you don't know that the field starts and ends on a cacheline boundary. In this case you're pretty safe because you're doing DMA_TO_DEVICE, but if you use the same type of code with DMA_FROM_DEVICE on a non-cache-coherent arch (e.g. PowerPC 4xx) then you can get into trouble. See for a very nice writeup of the problem. - R. From roland at topspin.com Fri Mar 18 20:23:12 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 18 Mar 2005 20:23:12 -0800 Subject: [openib-general] Causes of interrupt problems? In-Reply-To: <20050319023850.GE9768@kalmia.hozed.org> (Troy Benjegerdes's message of "Fri, 18 Mar 2005 20:38:50 -0600") References: <20050319023850.GE9768@kalmia.hozed.org> Message-ID: <52psxwjmbz.fsf@topspin.com> > What would cause the following? > ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004) > ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:04:00.0) > ib_mthca 0000:04:00.0: NOP command failed to generate interrupt, aborting. > ib_mthca 0000:04:00.0: BIOS or ACPI interrupt routing problem? > I've seen this on two Opteron systems, one Tyan board, one Rioworks > HDAMA. Is there some bios setting I should look for? Things are working > fine on another Rioworks HDAMA board. It seems that the fact that the HCA appears as a PCI device with a huge BAR behind a PCI bridge confuses some BIOS/ACPI implementations. Looking at that error message I realize it might be nice to be able to see what IRQ the driver is trying. If you change the line in mthca_main.c that prints the error to something like mthca_err(dev, "NOP command failed to generate interrupt (IRQ %d), aborting.\n", dev->mthca_flags & MTHCA_FLAG_MSI_X ? dev->eq_table.eq[MTHCA_EQ_CMD].msi_x_vector : dev->pdev->irq); then you can see what IRQ the HCA driver is trying. Then you can put another device like an ethernet in the same PCI slot and (assuming that the device works) compare the IRQ it is using with the one that mthca saw. If they're different then most likely you have a BIOS/ACPI problem. Unfortunately I'm not much good at fixing that sort of thing. The only thing I know to try is looking for a newer BIOS version. Other things to check: do the two HDAMA boards have the same BIOS revision? Is the HCA in the same slot in both boards? - R. From eitan at mellanox.co.il Fri Mar 18 21:57:02 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 19 Mar 2005 07:57:02 +0200 Subject: [openib-general] opensm trouble Message-ID: <506C3D7B14CDD411A52C00025558DED6047EEFAD@mtlex01.yok.mtl.com> Hi Roland, > > Choose a local port number with which to bind: > > 1: GUID = 0x 0, lid = 0x03C9, state = INIT > 2: GUID = 0x 0, lid = 0x03CA, state = DOWN > > Enter choice (1-2): 1 > SM port is down. The fact the displayed GUID is 0 means that somehow OpenSM did not get the "correct" (in it's own eyes) to VAPI_query_hca_gid_tbl. I assume you were able to do vstat ? Seems there is a bug either in the OpenSM vendor layer: osm_vendor_mlx_hca.c Or there is a miss-match between the evapi.h that OpenSM is linked with and the one that vstat is linked with. I will be able to try and reproduce the problem in our lab only Sunday. But I guess if you have a chance to look into the above file it will not be very hard to find too. EZ -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Fri Mar 18 22:33:15 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 18 Mar 2005 22:33:15 -0800 Subject: [openib-general] [PATCH] [RMPP] receive RMPP support In-Reply-To: <52u0n8jmpj.fsf@topspin.com> Message-ID: >I'm not sure if/how RMPP will be used, so it's not clear to me how >useful the RMPP functions are -- I don't feel qualified to have an >opinion yet. I wouldn't have generic functions be named or specific to RMPP. In looking at the CM, RMPP, SA query, SMI, and now ping code, there's a general commonality between them for sending a MAD. There's similarity in structure definitions, along with the allocation and formatting of the send_wr. My thought is that if we can get an agreement on a send_mad structure, additional support by the MAD layer could make sending MADs a little easier without the need to change the existing APIs (for the most part, if at all). >>+{ >>+ struct ib_ah_attr *ah_attr; >>+ struct ib_ah *ah; >>+ >>+ ah_attr = kmalloc(sizeof *ah_attr, GFP_KERNEL); > >Is it really worth kmalloc() for ah_attr here? struct ib_ah_attr is >only 32 bytes. Between the ah_attr pointer and the stack used by the >call to kmalloc(), the current code is probably using at least 16 >bytes. I'd trade 16 bytes of stack for smaller source and object code. I'll remove the kmalloc here before committing the patch. Thanks. >>+ if (!ah_attr) >>+ return ERR_PTR(-ENOMEM); >>+ >>+ memset(ah_attr, 0, sizeof *ah_attr); >>+ ah_attr->dlid = wc->slid; >>+ ah_attr->sl = wc->sl; >>+ ah_attr->src_path_bits = wc->dlid_path_bits; >>+ ah_attr->port_num = port_num; > >What if the wc has IB_WC_GRH set? I'm not sure how useful a helper >this is if it doesn't handle the GRH case. I considered GRH, and my thought was to update the call for GRH support when it's needed. I can go back and verify that the API takes all the required information for GRH support. >>+ msg->sge.addr = dma_map_single(mad_agent->device->dma_device, >>+ &msg->mad, sizeof msg->mad, >>+ DMA_TO_DEVICE); > >it's somewhat risky to use dma_map_single() on fields in the middle of >a structure, because you don't know that the field starts and ends on >a cacheline boundary. In this case you're pretty safe because you're >doing DMA_TO_DEVICE, but if you use the same type of code with >DMA_FROM_DEVICE on a non-cache-coherent arch (e.g. PowerPC 4xx) then >you can get into trouble. See for a >very nice writeup of the problem. Thanks for the link. I'll look at updating this before committing the patch as well. - Sean From iod00d at hp.com Fri Mar 18 22:50:43 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 18 Mar 2005 22:50:43 -0800 Subject: [openib-general] [PATCH] [RMPP] receive RMPP support In-Reply-To: <52u0n8jmpj.fsf@topspin.com> References: <52u0n8jmpj.fsf@topspin.com> Message-ID: <20050319065043.GH15070@esmail.cup.hp.com> On Fri, Mar 18, 2005 at 08:15:04PM -0800, Roland Dreier wrote: > it's somewhat risky to use dma_map_single() on fields in the middle of > a structure, because you don't know that the field starts and ends on > a cacheline boundary. In this case you're pretty safe because you're > doing DMA_TO_DEVICE, but if you use the same type of code with > DMA_FROM_DEVICE on a non-cache-coherent arch (e.g. PowerPC 4xx) then > you can get into trouble. See for a > very nice writeup of the problem. Indeed. I never saw even that though I review DMA_mapping.txt and wrote the DMA support for parisc-linux port. Good catch and have to wonder if PCI pool is really the right choice (or not) for inbound DMA....need some way of enforcing that in order to catch it before it's a problem. thanks, grant From rf at q-leap.de Sat Mar 19 03:00:14 2005 From: rf at q-leap.de (Roland Fehrenbacher) Date: Sat, 19 Mar 2005 12:00:14 +0100 Subject: [openib-general] Causes of interrupt problems? In-Reply-To: <20050319023850.GE9768@kalmia.hozed.org> References: <20050319023850.GE9768@kalmia.hozed.org> Message-ID: <16956.1598.291390.272184@gargle.gargle.HOWL> >>>>> "Troy" == Troy Benjegerdes writes: Troy> What would cause the following? > ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004) > ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:04:00.0) > ib_mthca 0000:04:00.0: NOP command failed to generate interrupt, aborting. > ib_mthca 0000:04:00.0: BIOS or ACPI interrupt routing problem? kernel commandline parameter "pci=noacpi" might get around this. If not, try "noapic". Roland Troy> I've seen this on two Opteron systems, one Tyan board, one Troy> Rioworks HDAMA. Is there some bios setting I should look Troy> for? Things are working fine on another Rioworks HDAMA Troy> board. From rf at q-leap.de Sat Mar 19 03:02:02 2005 From: rf at q-leap.de (Roland Fehrenbacher) Date: Sat, 19 Mar 2005 12:02:02 +0100 Subject: [openib-general] opensm trouble In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EEFAD@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED6047EEFAD@mtlex01.yok.mtl.com> Message-ID: <16956.1706.724345.206530@gargle.gargle.HOWL> >>>>> "Eitan" == Eitan Zahavi writes: Hi Eitan, > Choose a local port number with which to bind: > > 1: GUID = 0x 0, lid = 0x03C9, state = INIT > 2: GUID = 0x 0, lid = 0x03CA, state = DOWN > > Enter choice (1-2): 1 > SM port is down. Eitan> The fact the displayed GUID is 0 means that somehow OpenSM Eitan> did not get the "correct" (in it's own eyes) to Eitan> VAPI_query_hca_gid_tbl. Eitan> I assume you were able to do vstat ? yes. vstat output is: # vstat 1 HCA found: hca_id=InfiniHost0 vendor_id=0x02C9 vendor_part_id=0x5A44 hw_ver=0xA1 fw_ver=0x300020000 num_phys_ports=2 port=1 port_state=PORT_INITIALIZE sm_lid=0x0000 port_lid=0x03c9 port_lmc=0x00 max_mtu=2048 port=2 port_state=PORT_DOWN sm_lid=0x0000 port_lid=0x03ca port_lmc=0x00 max_mtu=2048 Eitan> Seems there is a bug either in the OpenSM vendor layer: Eitan> osm_vendor_mlx_hca.c Eitan> Or there is a miss-match between the evapi.h that OpenSM is Eitan> linked with and the one that vstat is linked with. Eitan> I will be able to try and reproduce the problem in our lab Eitan> only Sunday. But I guess if you have a chance to look into Eitan> the above file it will not be very hard to find too. Roland From hozer at hozed.org Sat Mar 19 14:31:33 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Sat, 19 Mar 2005 16:31:33 -0600 Subject: [openib-general] Causes of interrupt problems? In-Reply-To: <52psxwjmbz.fsf@topspin.com> References: <20050319023850.GE9768@kalmia.hozed.org> <52psxwjmbz.fsf@topspin.com> Message-ID: <20050319223133.GF9768@kalmia.hozed.org> On Fri, Mar 18, 2005 at 08:23:12PM -0800, Roland Dreier wrote: > > What would cause the following? > > > ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004) > > ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:04:00.0) > > ib_mthca 0000:04:00.0: NOP command failed to generate interrupt, aborting. > > ib_mthca 0000:04:00.0: BIOS or ACPI interrupt routing problem? > > > I've seen this on two Opteron systems, one Tyan board, one Rioworks > > HDAMA. Is there some bios setting I should look for? Things are working > > fine on another Rioworks HDAMA board. > > It seems that the fact that the HCA appears as a PCI device with a > huge BAR behind a PCI bridge confuses some BIOS/ACPI implementations. > > Looking at that error message I realize it might be nice to be able to > see what IRQ the driver is trying. If you change the line in > mthca_main.c that prints the error to something like > > mthca_err(dev, "NOP command failed to generate interrupt (IRQ %d), aborting.\n", > dev->mthca_flags & MTHCA_FLAG_MSI_X ? > dev->eq_table.eq[MTHCA_EQ_CMD].msi_x_vector : > dev->pdev->irq); Can you add this, as well as a check for recent firmware and/or card revision? I have some cards with ancient firmware revisions, which seem like they don't implement NOP. The bios was actually fine on this machine, and everything was happy once I put a card with a newer firmware in. FYI, I've now got nfs over ipoib running, and I'm getting about 110-120 MB/sec read throughput from nfs using 'dd if=nfsfile of=/dev/null'. From mst at mellanox.co.il Sat Mar 19 14:42:12 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 20 Mar 2005 00:42:12 +0200 Subject: [openib-general] Re: Causes of interrupt problems? In-Reply-To: <52psxwjmbz.fsf@topspin.com> References: <20050319023850.GE9768@kalmia.hozed.org> <52psxwjmbz.fsf@topspin.com> Message-ID: <20050319224212.GB1741@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Causes of interrupt problems? > > > What would cause the following? > > > ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004) > > ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:04:00.0) > > ib_mthca 0000:04:00.0: NOP command failed to generate interrupt, aborting. > > ib_mthca 0000:04:00.0: BIOS or ACPI interrupt routing problem? > > > I've seen this on two Opteron systems, one Tyan board, one Rioworks > > HDAMA. Is there some bios setting I should look for? Things are working > > fine on another Rioworks HDAMA board. > > It seems that the fact that the HCA appears as a PCI device with a > huge BAR behind a PCI bridge confuses some BIOS/ACPI implementations. > > Looking at that error message I realize it might be nice to be able to > see what IRQ the driver is trying. If you change the line in > mthca_main.c that prints the error to something like > > mthca_err(dev, "NOP command failed to generate interrupt (IRQ %d), aborting.\n", > dev->mthca_flags & MTHCA_FLAG_MSI_X ? > dev->eq_table.eq[MTHCA_EQ_CMD].msi_x_vector : > dev->pdev->irq); > > then you can see what IRQ the HCA driver is trying. Then you can put > another device like an ethernet in the same PCI slot and (assuming > that the device works) compare the IRQ it is using with the one that > mthca saw. If they're different then most likely you have a BIOS/ACPI > problem. Unfortunately I'm not much good at fixing that sort of > thing. The only thing I know to try is looking for a newer BIOS version. > > Other things to check: do the two HDAMA boards have the same BIOS > revision? Is the HCA in the same slot in both boards? > > - R. Another sort of problem one sometimes sees is hardware related spurios interrupt asserts, as the result the IRQ finally gets disabled by the kernel. Once you have the IRQ number, please try to look in /var/log/messages whether this interrupt was disabled by the kernel. These are messages like "no one cared". -- MST - Michael S. Tsirkin From shaharf at voltaire.com Sun Mar 20 02:29:16 2005 From: shaharf at voltaire.com (shaharf) Date: Sun, 20 Mar 2005 12:29:16 +0200 Subject: [openib-general] [PATCH] ping Add IB ping server agent Message-ID: Hi all, I saw that a lot of ping-pong flew around during (my) weekend ;-). I guess that I have few things to explain regarding the ibping. First, the origin of the utility is in pathforward SOW. As I understand it, the ibping (vping in the SOW) should imitate the ICMP ping and should be used for basic/sanity connection checks. NodeDesc packets are not enough because you cannot control there output to include additional information, and you cannot be sure that the kernel is OK when you get NodeDesc replies because some IB devices replies to the basic SMP mads without the kernel knowledge (for example many managed switches, and all non managed switches). I certainly do not want to force the current openib gen2 host mad architecture where all mads are exposed to the kernel before the firmware have a chance to reply on them. Regarding the issue whether to have it in a separate source file and or modules, I am not sure it is really important as long it is loaded by default and therefore the diagnostics utilities may rely on it. I agree with Sean that additional shared functionality is required. I felt it was a bit stupid of me to replicate (again) this functionality, but I didn't want to change too much of the code, especially due the fact that these areas will (and are) affected by the RMPP functionality. Sean has also a good point regarding the mad pool. I mimicked other paths that Hal and other wrote, but now, after several paths use these methods, it seems that we have to implement some cleaner send mad pools instrumentation. Sorry about (not calling) ib_free_recv_mad() function - I am a bad boy ;-) Regarding the content of the ping packet, I encountered many ideas in the list, most of them good and valid. My opinion is that the kernel server should reply only on the most basic "ping queries" and any further ping enhancements (for example, returning gid + lid, etc.) should be implemented in a separate, probably user mode server. This will be the most flexible solution and will reduce the kernel pollution. In fact I started to implement a (usermode only) "ibsystat service" that will supply such extensions. Currently, what I have in mind is to provide basic host information: number of CPUs, memory, utilization, etc., IB information: number of hcas, models, etc. The idea is to ease some cluster wide operations common in large cluster setup. Any ideas and suggestions are welcomes. A word for Michael: it is true that QP0 should be kept for subnet management only, but even though it is written in the Spec., I wouldn't say that MADs of any sort should be assumed to be originated from the SM. Specifically, the return address can be any lid and not just the SMLid - because that are times where the SMLid is not configured, and that many SM's may use direct mads (or any other type of mads) to discover the network and/or to communicate among themselves. In fact many of the utilities already existing in the gen2 diagnostics tree use direct mads for many purposes. Let's say that the Spec phrasing is just a sort of short seeing. Personally I extend the "SM" term to be "any subnet management oriented entity"... If you thing this is some kind of invalid behavior, you may be formally right, but as long as multiple SM's are allowed to operate (one master, many standbys), you can not disallow that so - "if you can't win them, join them"... (you can check and see that the diags are not performing anything that is not allowed by for a standby SM). Finally, the existing (kernel) code is really preliminary. It is supplied to get a foothold (meaning to allow the user mode diags to rely on it), and to share the community with its implementation (an overwhelming success indeed ;-). Shahar From eitan at mellanox.co.il Sun Mar 20 06:09:19 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 20 Mar 2005 16:09:19 +0200 Subject: [openib-general] IB MADs Level Management Simulator Message-ID: <506C3D7B14CDD411A52C00025558DED6047EEFC2@mtlex01.yok.mtl.com> Hi All, I have uploaded the IB MADs Level Management Simulator sources into the gen2/utils/linux-user/ibdm and gen2/utils/linux-user/IBMgtSim Please see the README files for details. Eitan -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Sun Mar 20 10:12:42 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 20 Mar 2005 20:12:42 +0200 Subject: [openib-general] Re: Re: [Andrew Morton] inappropriate use of in_atomic() In-Reply-To: <20050311154316.E31689@topspin.com> References: <52oedq946k.fsf@topspin.com> <20050311073108.GA20989@mellanox.co.il> <20050311154316.E31689@topspin.com> Message-ID: <20050320181242.GA18963@mellanox.co.il> Quoting r. Libor Michalek : > Subject: Re: Re: [Andrew Morton] inappropriate use of in_atomic() > > On Fri, Mar 11, 2005 at 09:31:08AM +0200, Michael S. Tsirkin wrote: > > > > Sdp also has a couple of uses. > > Maybe we can use the atomic branch in all cases here, as well? > > Libor? > > Yes, the case in sdp_iocb.c can probably always take the atomic > path. The kmap/kunmap cases really only care whether we're in an > interrupt, so switching to in_interrupt() should be sufficient. > > -Libor > Recent comments by Andrew indicate that it is better to always use kmap_atomic/kunmap_atomic if possible. This will also let us get rid of the wrapper function, which is good. Why do you think we need to kmap? -- MST - Michael S. Tsirkin From roland at topspin.com Mon Mar 21 07:12:13 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 21 Mar 2005 07:12:13 -0800 Subject: [openib-general] Causes of interrupt problems? References: <20050319023850.GE9768@kalmia.hozed.org> <52psxwjmbz.fsf@topspin.com> <20050319223133.GF9768@kalmia.hozed.org> Message-ID: <5264zlhw36.fsf@topspin.com> Troy> I have some cards with ancient firmware revisions, which Troy> seem like they don't implement NOP. The bios was actually Troy> fine on this machine, and everything was happy once I put a Troy> card with a newer firmware in. Hmm, interesting. I forgot that the NOP command was added after some firmware was released (and I'm surprised there are cards with such old FW still around). I think I'll add a warning message like HCA firmware version 1.13.0 is old. If you have problems, try updating FW. if mthca sees a card with firmware older than the newest version (3.3.2, 4.6.2 and 5.0.1 respectively). Thanks, Roland From rminnich at lanl.gov Mon Mar 21 07:30:55 2005 From: rminnich at lanl.gov (Ronald G. Minnich) Date: Mon, 21 Mar 2005 08:30:55 -0700 (MST) Subject: [openib-general] [PATCH] ping Add IB ping server agent In-Reply-To: <52br9dhw38.fsf@topspin.com> References: <52br9dhw38.fsf@topspin.com> Message-ID: On Mon, 21 Mar 2005, Roland Dreier wrote: > It doesn't seems like a good idea to reinvent cluster management in an > IB-specific way. I would rather see this sort of thing built on top of > an existing tool like Ganglia (http://ganglia.sf.net). While I don't agree taht ganglia is the right thing, I agree completely with this point. Reinventing tools for booting, console, etc. in an IB-specific way is a huge mistake. ron From roland at topspin.com Mon Mar 21 07:57:11 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 21 Mar 2005 07:57:11 -0800 Subject: [openib-general] [PATCH] ping Add IB ping server agent References: <1111152373.4662.6585.camel@localhost.localdomain> Message-ID: <52zmwxgffs.fsf@topspin.com> A few comments based on a quick read through of the code. > Index: ping_priv.h > =================================================================== > --- ping_priv.h (revision 0) > +++ ping_priv.h (revision 0) > +#include > + > +#define SPFX "ib_ping: " > + > +struct ib_ping_send_wr { > + struct list_head send_list; > + struct ib_ah *ah; > + struct ib_mad_private *mad; > + DECLARE_PCI_UNMAP_ADDR(mapping) > +}; > + > +struct ib_ping_port_private { > + struct list_head port_list; > + struct list_head send_posted_list; > + spinlock_t send_list_lock; > + int port_num; > + struct ib_mad_agent *pingd_agent; /* OpenIB Ping class */ > +}; Is it worth having a separate include file that is only included in one place for this small amount of declarations? > Index: ping.c > --- ping.c (revision 0) > +++ ping.c (revision 0) > +#include "mad_priv.h" It doesn't seem right for a different module to be including the mad module's private implementation details. > +/* > + * Caller must hold ib_ping_port_list_lock > + */ > +static inline struct ib_ping_port_private * > +__ib_get_ping_port(struct ib_device *device, int port_num, > + struct ib_mad_agent *mad_agent) > +{ > + struct ib_ping_port_private *entry; > + > + BUG_ON(!(!!device ^ !!mad_agent)); /* Exactly one MUST be (!NULL) */ Why have this complication of having a single function that can look up by device or mad_agent? For lookup by mad_agent it seems you should just use mad_agent->context directly and not even call a function to do that. > + > + if (device) { > + list_for_each_entry(entry, &ib_ping_port_list, port_list) { > + if (entry->pingd_agent->device == device && > + entry->port_num == port_num) > + return entry; > + } > + } else { > + list_for_each_entry(entry, &ib_ping_port_list, port_list) { > + if (entry->pingd_agent == mad_agent) > + return entry; > + } > + } > + return NULL; > +} > + /* PCI mapping */ > + gather_list.addr = dma_map_single(mad_agent->device->dma_device, Is this comment useful? It's pretty obvious what's going on here, and it's not necessarily PCI mapping (the HCA could be on some other type of bus). > +static void pingd_recv_handler(struct ib_mad_agent *mad_agent, > + struct ib_mad_recv_wc *mad_recv_wc) > +{ > + struct ib_ping_port_private *port_priv; > + struct ib_vendor_mad *vend; > + struct ib_mad_private *recv = container_of(mad_recv_wc, > + struct ib_mad_private, > + header.recv_wc); > + > + /* Find matching MAD agent */ > + port_priv = ib_get_ping_port(NULL, 0, mad_agent); just mad_agent->context as I said above. > + if (!port_priv) { > + kmem_cache_free(ib_mad_cache, recv); should ib_free_recv_mad() -- there's a defined API, we should use it. > + kmem_cache_free(ib_mad_cache, recv); ditto. > +static void pingd_send_handler(struct ib_mad_agent *mad_agent, > + struct ib_mad_send_wc *mad_send_wc) > +{ > + struct ib_ping_port_private *port_priv; > + struct ib_ping_send_wr *ping_send_wr; > + unsigned long flags; > + > + /* Find matching MAD agent */ > + port_priv = ib_get_ping_port(NULL, 0, mad_agent); mad_agent->context > + /* Unmap PCI */ > + dma_unmap_single(mad_agent->device->dma_device, Inaccurate and not helpful comment again. > + /* Release allocated memory */ > + kmem_cache_free(ib_mad_cache, ping_send_wr->mad); ib_free_recv_mad() > +int ib_ping_port_open(struct ib_device *device, int port_num) > +{ > + int ret; > + struct ib_ping_port_private *port_priv; > + struct ib_mad_reg_req pingd_reg_req; > + unsigned long flags; > + > + /* First, check if port already open */ > + port_priv = ib_get_ping_port(device, port_num, NULL); I think you can trust the core not to call you twice for the same device, no need to check this. > + /* Obtain server MAD agent for OpenIB Ping class (GSI QP) */ > + port_priv->pingd_agent = ib_register_mad_agent(device, port_num, > + IB_QPT_GSI, > + &pingd_reg_req, 0, > + &pingd_send_handler, > + &pingd_recv_handler, > + NULL); use port_priv instead of NULL for the context param. > +static void ib_ping_init_device(struct ib_device *device) > +{ > + int ret, num_ports, cur_port, i, ret2; Why do you need ret and ret2? > + > + if (device->node_type == IB_NODE_SWITCH) { > + num_ports = 1; > + cur_port = 0; > + } else { > + num_ports = device->phys_port_cnt; > + cur_port = 1; > + } > + > + for (i = 0; i < num_ports; i++, cur_port++) { > + ret = ib_ping_port_open(device, cur_port); > + if (ret) { > + printk(KERN_ERR SPFX "Couldn't open %s port %d\n", > + device->name, cur_port); > + goto error_device_open; > + } why not just if (ib_ping_port_open(device, cur_port) { ? You never use the value of ret for anything. > + } > + goto error_device_query; Just return here -- this is the success case so don't obfuscate it. > + > +error_device_open: > + while (i > 0) { > + cur_port--; > + ret2 = ib_ping_port_close(device, cur_port); > + if (ret2) { why not just if (ib_ping_port_close(device, cur_port) printk(... > + printk(KERN_ERR PFX "Couldn't close %s port %d " > + "for ping agent\n", > + device->name, cur_port); > + } > + i--; > + } > + > +error_device_query: don't need this label at all. > + return; > +} > + > +static void ib_ping_remove_device(struct ib_device *device) > +{ > + int ret = 0, i, num_ports, cur_port, ret2; Why do you need ret or ret2? > + > + if (device->node_type == IB_NODE_SWITCH) { > + num_ports = 1; > + cur_port = 0; > + } else { > + num_ports = device->phys_port_cnt; > + cur_port = 1; > + } > + for (i = 0; i < num_ports; i++, cur_port++) { > + ret2 = ib_ping_port_close(device, cur_port); > + if (ret2) { > + printk(KERN_ERR SPFX "Couldn't close %s port %d " > + "for ping agent\n", > + device->name, cur_port); > + if (!ret) > + ret = ret2; > + } How about just if (ib_ping_port_close(device, cur_port) printk(... You don't care about the return value in ret2 and you don't do anything with the value you put in ret as far as I can see. > + } > +} > + > +static struct ib_client ping_client = { > + .name = "ping", > + .add = ib_ping_init_device, > + .remove = ib_ping_remove_device > +}; > + > +static int __init ib_ping_init_module(void) > +{ > + spin_lock_init(&ib_ping_port_list_lock); > + INIT_LIST_HEAD(&ib_ping_port_list); INIT_LIST_HEAD() isn't needed if you declare your list with LIST_HEAD(). > Index: mad.c > =================================================================== > --- mad.c (revision 2023) > +++ mad.c (working copy) > @@ -45,6 +45,8 @@ > > > kmem_cache_t *ib_mad_cache; > +EXPORT_SYMBOL(ib_mad_cache); I don't think we should be exporting internals like this. - R. From shaharf at voltaire.com Mon Mar 21 08:30:38 2005 From: shaharf at voltaire.com (shaharf) Date: Mon, 21 Mar 2005 18:30:38 +0200 Subject: [openib-general] [PATCH] ping Add IB ping server agent Message-ID: > > > It doesn't seems like a good idea to reinvent cluster management in an > > IB-specific way. I would rather see this sort of thing built on top of > > an existing tool like Ganglia (http://ganglia.sf.net). > > While I don't agree taht ganglia is the right thing, I agree completely > with this point. Reinventing tools for booting, console, etc. in an > IB-specific way is a huge mistake. > > ron Please note the IBPING is not really connected to the question if the idea to provide remote sys stat is good or not. Ibping is there to validate basic hardware and software functionalities. Ibsysstat is an optional usermode service that you run or not. If you don't like it and prefer running ganglia or anything else, please do. MHO is that this utility maybe very helpful for diagnostics and monitoring. I am against ideas where fundamental IB management mechanisms will rely on such mechanism and I guess that we all agree with that. Shahar From mkowalski01 at gmail.com Mon Mar 21 08:50:33 2005 From: mkowalski01 at gmail.com (mark kowalski) Date: Mon, 21 Mar 2005 10:50:33 -0600 Subject: [openib-general] (no subject) Message-ID: Thanks Caitlin, This answer helps explain what I'm seeing. >From a strict DAT specification point of view: >After a dat_ep_reset the DAT Provider MAY rotate the underlying >transport resources (QP) that the EP is associated with in order >to avoid time-wait states. But I haven't seen any implementations >actually do that. Note that I stated "rotate", not "release", >it is not acceptable to suddenly find that you no longer have >the required resources when the Consumer attempts to establish >a new connection with the same EP. >Generally a QP is associated with an EP from beginning to end. >The only way to be certain that the resource is freed is to >free it, and you really should do a reset first. In no cases >can disconnecting alone be expected to release transport >layer resources, it merely changes their state From mkowalski01 at gmail.com Mon Mar 21 09:23:45 2005 From: mkowalski01 at gmail.com (mark kowalski) Date: Mon, 21 Mar 2005 11:23:45 -0600 Subject: [openib-general] Re: queue pair destruction on dat_ep_disconnect In-Reply-To: References: Message-ID: I guess "queue pair destruction" was a poor choice of words. What I meant was the freeing/cleaning up of the dapl structures that connect to the queue pair, not the actual queue pair itself. Sorry about that. Anyway, about the single EP on a machine. All I can do is report what I'm seeing. It doesn't make any sense to me either, but that is what's happening. Also, It is not the dat_ep_create that fails, it is the subsequent incoming connect request that would fail. Sorry if I was unclear about that. Thanks, Mark >May I point out that taking your analysis to its logical conclusion, you could only ever have >a single EP on any machine if each EP has a unique QP. This is obviously false, I think >you're looking in the wrong place. >dat_ep_disconnect() is not supposed to destroy a QP, just transition the state to a not->connected state (IB state ERROR). An EP, and by extension a QP, can have several >different attributes, it wouldn't be efficient or intuitive if you destroyed the underlying QP >just because you are disconnecting. The QP remains attached to the EP until you >explicitly free it in dat_ep_free(); this is intentional and by design. >If you look at the state diagram in the DAT spec, you will notice that you should >dat_ep_reset() the EP before you try to use it again. This will transition the underlying >QP from the ERROR state to INIT. But I don't think you're trying to reuse the EP, so I don't >know why it's a problem. >Getting back to your real problem, I'm not sure why you can't create a new EP on a >different IA, they should be completely separate. If dat_ep_create() fails, something is >hosed. I don't know about the destroy_cbk field as it isn't in the reference >implementation, so I can't help you there. >-Steve From ardavis at ichips.intel.com Mon Mar 21 16:41:28 2005 From: ardavis at ichips.intel.com (ardavis) Date: Mon, 21 Mar 2005 16:41:28 -0800 Subject: [openib-general] uverbs ibv_reg_mr return same lkey value Message-ID: <423F69B8.8060703@ichips.intel.com> Roland, I noticed ibv_reg_mr always returns the same lkey of 0 (per device) which presents a problem with udapl since the lkey is used as the context in the lmr hash table to avoid re-use. Is there a reason that you choose not to make this unique? -arlin From roland at topspin.com Mon Mar 21 19:00:23 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 21 Mar 2005 19:00:23 -0800 Subject: [openib-general] uverbs ibv_reg_mr return same lkey value In-Reply-To: <423F69B8.8060703@ichips.intel.com> (ardavis@ichips.intel.com's message of "Mon, 21 Mar 2005 16:41:28 -0800") References: <423F69B8.8060703@ichips.intel.com> Message-ID: <52k6o0fkqg.fsf@topspin.com> ardavis> I noticed ibv_reg_mr always returns the same lkey of 0 ardavis> (per device) which presents a problem with udapl since ardavis> the lkey is used as the context in the lmr hash table to ardavis> avoid re-use. Is there a reason that you choose not to ardavis> make this unique? I'm not sure I follow the question. On my system ibv_reg_mr() returns a different, non-zero L_Key for each region that is registered. The L_Key is guaranteed to be unique among all the memory regions currently registered with a given device. I don't see how it's possible to get L_Key 0 for any memory region (since that L_Key is reserved by Mellanox HCA firmware), and I'm actually not sure what you mean by "0 (per device)." If you are seeing mr->lkey == 0 after a call to ibv_reg_mr(), then something is going wrong either in your code or in libibverbs or libmthca. Can you post your code where you see that happen? If uDAPL is assuming that L_Keys are globally unique even with multiple HCAs, then uDAPL needs to be fixed, since a completely compliant verbs implementation may not satisfy that condition. - R. From mst at mellanox.co.il Tue Mar 22 08:33:44 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 22 Mar 2005 18:33:44 +0200 Subject: [openib-general] Re: [PATCH] fmr support in mthca In-Reply-To: <52k6o4lp0a.fsf@topspin.com> References: <20050317201646.GA15221@mellanox.co.il> <521xadob8c.fsf@topspin.com> <20050318084455.GA23781@mellanox.co.il> <52k6o4lp0a.fsf@topspin.com> Message-ID: <20050322163344.GX12627@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] fmr support in mthca > > Michael> Good, glad to help. I will try to address your comments > Michael> next week (its already weekend here). > > No problem. Libor won't be back until Monday so I won't even try SDP > until then. > > Roland> What if we just reserve something like 64K MPTs and MTTs > Roland> for FMRs and ioremap everything at driver startup? That > Roland> would only use a few MB of vmalloc space and probably > Roland> simplify the code too. > > Michael> I dont like these pre-allocations - if someone is only > Michael> using SDP and IP over IB, it seems he wont need almost > Michael> any regular regions. 64K MTTs with 4K page size cover up > Michael> to 200MByte of memory. > > We can bump up the numbers if you want. Right now the default > allocation is 1 << 20 MTT segments (8 << 20 MTT entries). I see no > problem with having 64K MPTs and 256 MTT segments reserved for FMRs by > default. That should be more than enough for a single HCA -- 256K MTT > segments means that 2 million pages or 8 GB of IO could be in flight > at a time, which doesn't seem like a harsh limit to me. > > Ultimately we can make the allocations tunable at device init time, > along with the rest of the parameters (number of QPs, number of CQs, > etc). I haven't seen much pressure to do that so far but it is > definitely in my plans. > > Michael> My other problem with this approach was implementational: > Michael> existing allocator and table code can be passed reserved > Michael> parameter, but dont have the ability to allocate out of > Michael> that pool. So we'd have to allocate out of a separate > Michael> allocator, and take care so that keys do not > Michael> conflict. This gets a bit complicated. > > I think this is the way to go. Keys are easy to deal with -- in > mthca_init_mr_table, we could just pass dev->limits.num_fmrs instead > of dev->limits.reserved_mrws when initializing dev->mr_table.mpt_alloc, > and then create a new table of size dev->limits.num_fmrs and reserve > dev->limits.reserved_mrws out of that table. > > The buddy allocator is a little more work but it needs to be cleaned > up and encapsulated better anyway. Once that's done we'd just have > two buddy allocators. The first one would cover all the MTT segments, > and we'd first take out a chunk of that one to cover the reserved MTTs > and then allocate another chunk that can hold whatever number of MTT > segments we decide to use for FMRs. > > Michael> Maybe do something separate for 32 bit kernels (like - > Michael> disable FMR support)? > > No FMRs on 32-bit kernels isn't going to fly. It doesn't seem that > hard to make things work on i386 so why not do it? > > Michael> Yes but for mtts the addresses may not be physically > Michael> contigious, unless we want to limit FMRs to PAGE_SIZE/8 > Michael> MTTs, which means 512 MTTs, that is 2MByte with 4K FMR > Michael> page size. And is it seems possible that even with this > Michael> limitation MTTs for a specific FMR start at non page > Michael> aligned boundary. > > I think it's fine to limit an FMR to 512 MTT entries. I'd have to > look at the source to be sure of the exact numbers, but I know that > for the Topspin stack, neither SDP nor SRP is using more than 32 > entries per FMR. A limit of mapping 512 pages/2 MB per FMR seems > fine. I don't know of anyone using FMRs even close to that big. > > Even if it turns out to be to small, I see no problem with adding a > small array of something on the order of 2 or 4 MTT pages. > > If we use the buddy allocator for MTT entries for FMRs, then alignment > is OK. The buddy allocator guarantees that objects will be aligned to > their size, which means that the MTT segments will never cross a page > boundary. > > - R. > OK. I thought about it and I buy this design. I'll prepare a patch along these lines. MST -- MST - Michael S. Tsirkin From roland at topspin.com Tue Mar 22 08:40:14 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 22 Mar 2005 08:40:14 -0800 Subject: [openib-general] uverbs ibv_reg_mr return same lkey value In-Reply-To: <52k6o0fkqg.fsf@topspin.com> (Roland Dreier's message of "Mon, 21 Mar 2005 19:00:23 -0800") References: <423F69B8.8060703@ichips.intel.com> <52k6o0fkqg.fsf@topspin.com> Message-ID: <524qf3fxch.fsf@topspin.com> Roland> I don't see how it's possible to get L_Key 0 for any Roland> memory region (since that L_Key is reserved by Mellanox Roland> HCA firmware) By the way, this is just a coincidence with the current HW and driver. With a different HCA, it may be possible to get L_Key 0 for a memory region, although of course each memory region will still have a different L_Key. - R. From vnguyen_777 at yahoo.com Tue Mar 22 09:01:47 2005 From: vnguyen_777 at yahoo.com (vinh nguyen) Date: Tue, 22 Mar 2005 09:01:47 -0800 (PST) Subject: [openib-general] vapi throughput Message-ID: <20050322170147.66061.qmail@web51002.mail.yahoo.com> Hi, Does anyone have throughput results for Mellanox IB Gold 1.6.1? I have Mellanox Cougar cards on Intel server boards dual Xeon 2.8 GHz. Here's the my results for point to point connection: IPoIB: 200MB/sec SDP: 580MB/sec VAPI: 630MB/sec The results for perf_main seems a little low. With the old Mellanox package I used to see perf_main produce around 800MB/sec. Does anyone experiencing this? Vinh Nguyen __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From halr at voltaire.com Tue Mar 22 09:19:23 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Mar 2005 12:19:23 -0500 Subject: [openib-general] CM oops on sending DREQ and DREP when not in proper state Message-ID: <1111511963.4659.18.camel@localhost.localdomain> Hi Sean, In cm.c, in both ib_send_cm_dreq() and ib_send_cm_drep(), there are checks for the connection being in the proper state. When this check fails, the allocated message is attempted to be freed but it is done from cm_id_priv->msg despite that never being stored in this error case and just msg being correct. int ib_send_cm_dreq { ... if (cm_id->state != IB_CM_ESTABLISHED) { spin_unlock_irqrestore(&cm_id_priv->lock, flags); ret = -EINVAL; goto out; ... if (!ret) { cm_id->state = IB_CM_DREQ_SENT; cm_id_priv->msg = msg; } else cm_enter_timewait(cm_id_priv); spin_unlock_irqrestore(&cm_id_priv->lock, flags); out: if (!msg_ret && ret) cm_free_msg(cm_id_priv->msg); return ret; Similarly for ib_send_cm_dreq. Thanks. -- Hal From halr at voltaire.com Tue Mar 22 11:13:14 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Mar 2005 14:13:14 -0500 Subject: [openib-general] CM oops on sending DREQ and DREP when not inproper state In-Reply-To: References: Message-ID: <1111518794.4659.0.camel@localhost.localdomain> On Tue, 2005-03-22 at 13:51, Sean Hefty wrote: > >In cm.c, in both ib_send_cm_dreq() and ib_send_cm_drep(), there are > >checks for the connection being in the proper state. When this check > >fails, the allocated message is attempted to be freed but it is done > >from cm_id_priv->msg despite that never being stored in this error case > >and just msg being correct. > > Thanks for finding this. If no one does/needs it sooner, I'll update the > code next week when I return from vacation. It can wait. I've worked around it. -- Hal From sean.hefty at intel.com Tue Mar 22 10:51:08 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 22 Mar 2005 10:51:08 -0800 Subject: [openib-general] CM oops on sending DREQ and DREP when not inproper state In-Reply-To: <1111511963.4659.18.camel@localhost.localdomain> Message-ID: >In cm.c, in both ib_send_cm_dreq() and ib_send_cm_drep(), there are >checks for the connection being in the proper state. When this check >fails, the allocated message is attempted to be freed but it is done >from cm_id_priv->msg despite that never being stored in this error case >and just msg being correct. Thanks for finding this. If no one does/needs it sooner, I'll update the code next week when I return from vacation. - Sean From ardavis at ichips.intel.com Tue Mar 22 13:55:20 2005 From: ardavis at ichips.intel.com (ardavis) Date: Tue, 22 Mar 2005 13:55:20 -0800 Subject: [openib-general] uverbs ibv_reg_mr return same lkey value In-Reply-To: <52k6o0fkqg.fsf@topspin.com> References: <423F69B8.8060703@ichips.intel.com> <52k6o0fkqg.fsf@topspin.com> Message-ID: <42409448.4030802@ichips.intel.com> Roland Dreier wrote: > ardavis> I noticed ibv_reg_mr always returns the same lkey of 0 > ardavis> (per device) which presents a problem with udapl since > ardavis> the lkey is used as the context in the lmr hash table to > ardavis> avoid re-use. Is there a reason that you choose not to > ardavis> make this unique? > >I'm not sure I follow the question. On my system ibv_reg_mr() returns >a different, non-zero L_Key for each region that is registered. The >L_Key is guaranteed to be unique among all the memory regions >currently registered with a given device. I don't see how it's >possible to get L_Key 0 for any memory region (since that L_Key is >reserved by Mellanox HCA firmware), and I'm actually not sure what you >mean by "0 (per device)." > >If you are seeing mr->lkey == 0 after a call to ibv_reg_mr(), then >something is going wrong either in your code or in libibverbs or >libmthca. Can you post your code where you see that happen? > >If uDAPL is assuming that L_Keys are globally unique even with >multiple HCAs, then uDAPL needs to be fixed, since a completely >compliant verbs implementation may not satisfy that condition. > > - R. > > > the only assumption is a unique lkey across the device. In my case, I am seeing a lkey of 0 returned for every memory region created on this device, however the rkey is changing. mr_register: lkey=0x0, rkey=0x5 ln=72 priv=1 pingpong looks ok... "ibv_reg_mr: mr 0x506ea0 mr->lkey a002c mr->rkey a002c" local address: LID 0x0012, QPN 0x010406, PSN 0xdd3db0 so nevermind...my bad! .I will dig a little deeper. -arlin From hozer at hozed.org Tue Mar 22 18:57:53 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Tue, 22 Mar 2005 20:57:53 -0600 Subject: [openib-general] NFS over IPoIB performance.. Message-ID: <20050323025753.GN9768@kalmia.hozed.org> Well, I'm quite impressed.. I've been running NFS over IPoIB to a server with a 3ware SATA raid card in it, and nothing's crashed ;) This is with a 2.4.11.4 kernel and roland-uverbs branch. (although I'm not using uverbs at the moment) da0:64bit:~$ ifconfig ib0 ib0 Link encap:UNSPEC HWaddr 00-00-00-14-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:10.0.0.10 Bcast:10.255.255.255 Mask:255.255.0.0 inet6 addr: fe80::206:6a00:a000:43c/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:209004632 errors:0 dropped:0 overruns:0 frame:0 TX packets:959650153 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:309422700203 (288.1 GiB) TX bytes:1863296583515 (1.6 TiB) My only real problem is there seems to be no good way to set readahead for the NFS client. I'm averaging 75mb/sec or so, and I've seen peaks of 100MB/sec, with clients from two different machines running the GAMESS computational chemistry code. From mst at mellanox.co.il Wed Mar 23 06:25:03 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 23 Mar 2005 16:25:03 +0200 Subject: [openib-general] [PATCH] mthca: fill in missing fields in send completion Message-ID: <20050323142503.GA13701@mellanox.co.il> Fill in missing fields in the send completion. Signed-off-by: Michael S. Tsirkin Signed-off-by: Itamar Rabenstein Index: infiniband/hw/mthca/mthca_dev.h =================================================================== --- infiniband/hw/mthca/mthca_dev.h (revision 2035) +++ infiniband/hw/mthca/mthca_dev.h (working copy) @@ -88,6 +88,19 @@ enum { MTHCA_NUM_EQ }; +enum { + MTHCA_OPCODE_NOP = 0x00, + MTHCA_OPCODE_RDMA_WRITE = 0x08, + MTHCA_OPCODE_RDMA_WRITE_IMM = 0x09, + MTHCA_OPCODE_SEND = 0x0a, + MTHCA_OPCODE_SEND_IMM = 0x0b, + MTHCA_OPCODE_RDMA_READ = 0x10, + MTHCA_OPCODE_ATOMIC_CS = 0x11, + MTHCA_OPCODE_ATOMIC_FA = 0x12, + MTHCA_OPCODE_BIND_MW = 0x18, + MTHCA_OPCODE_INVALID = 0xff +}; + struct mthca_cmd { int use_events; struct semaphore hcr_sem; Index: infiniband/hw/mthca/mthca_cq.c =================================================================== --- infiniband/hw/mthca/mthca_cq.c (revision 2035) +++ infiniband/hw/mthca/mthca_cq.c (working copy) @@ -473,7 +473,46 @@ static inline int mthca_poll_one(struct } if (is_send) { - entry->opcode = IB_WC_SEND; /* XXX */ + switch (cqe->opcode) { + case MTHCA_OPCODE_RDMA_WRITE: + entry->opcode = IB_WC_RDMA_WRITE; + entry->wc_flags = 0; + break; + case MTHCA_OPCODE_RDMA_WRITE_IMM: + entry->opcode = IB_WC_RDMA_WRITE; + entry->wc_flags = IB_WC_WITH_IMM; + break; + case MTHCA_OPCODE_SEND: + entry->opcode = IB_WC_SEND; + entry->wc_flags = 0; + break; + case MTHCA_OPCODE_SEND_IMM: + entry->opcode = IB_WC_SEND; + entry->wc_flags = IB_WC_WITH_IMM; + break; + case MTHCA_OPCODE_RDMA_READ: + entry->opcode = IB_WC_RDMA_READ; + entry->wc_flags = 0; + entry->byte_len = be32_to_cpu(cqe->byte_cnt); + break; + case MTHCA_OPCODE_ATOMIC_CS: + entry->opcode = IB_WC_COMP_SWAP; + entry->wc_flags = 0; + entry->byte_len = be32_to_cpu(cqe->byte_cnt); + break; + case MTHCA_OPCODE_ATOMIC_FA: + entry->opcode = IB_WC_FETCH_ADD; + entry->wc_flags = 0; + entry->byte_len = be32_to_cpu(cqe->byte_cnt); + break; + case MTHCA_OPCODE_BIND_MW: + entry->opcode = IB_WC_BIND_MW; + entry->wc_flags = 0; + break; + default: + entry->opcode = MTHCA_OPCODE_INVALID; + break; + } } else { entry->byte_len = be32_to_cpu(cqe->byte_cnt); switch (cqe->opcode & 0x1f) { Index: infiniband/hw/mthca/mthca_qp.c =================================================================== --- infiniband/hw/mthca/mthca_qp.c (revision 2035) +++ infiniband/hw/mthca/mthca_qp.c (working copy) @@ -171,19 +171,6 @@ enum { }; enum { - MTHCA_OPCODE_NOP = 0x00, - MTHCA_OPCODE_RDMA_WRITE = 0x08, - MTHCA_OPCODE_RDMA_WRITE_IMM = 0x09, - MTHCA_OPCODE_SEND = 0x0a, - MTHCA_OPCODE_SEND_IMM = 0x0b, - MTHCA_OPCODE_RDMA_READ = 0x10, - MTHCA_OPCODE_ATOMIC_CS = 0x11, - MTHCA_OPCODE_ATOMIC_FA = 0x12, - MTHCA_OPCODE_BIND_MW = 0x18, - MTHCA_OPCODE_INVALID = 0xff -}; - -enum { MTHCA_NEXT_DBD = 1 << 7, MTHCA_NEXT_FENCE = 1 << 6, MTHCA_NEXT_CQ_UPDATE = 1 << 3, -- MST - Michael S. Tsirkin From mkowalski01 at gmail.com Wed Mar 23 08:36:31 2005 From: mkowalski01 at gmail.com (mark kowalski) Date: Wed, 23 Mar 2005 10:36:31 -0600 Subject: [openib-general] 75 second timeout for endpoint state to go from Disconnect_Pending to Disconnected Message-ID: Hello, I've been doing some work with udapl trying to recover from port failures and have run into a problem. I have a simple test program that contains a server and a client, running on two different machines, sending data back and forth. When I have a physical connection problem on the client side (caused by pulling the ib cable from the inuse port on the hca) the server will see this and eventually issue a dat_ep_disconnect (gracefully) and then go and wait for the client to reconnect to it. The problem is that it is taking about 75 seconds for the end point on the server to go from DISCONNECT_PENDING state to DISCONNECTED. The TS_UDAPL_CM_RESPONSE_TIMEOUT field specifies a timeout of 4.x seconds and it looks like it is being setup correctly. The TS_UDAPL_MAX_CM_RETRIES is set to 15 so we thought that for some reason the disconnect request is being retried the max number of times before it completes and that is why I'm seeing a 75 second wait. We have tried modifying the TS_UDAPL_MAX_CM_RETRIES in dapl_openib_cm.h from 15 to 2 to see if this would cause it to disconnect faster but using a catc tool to examine the packets as it went across the wire we found that 15 was still being passed as the max retry count. A side issue to this problem is how can you change the retry and timeout value and have it accepted? Changing the disconnect to ABRUPT doesn't matter because even though the endpoint status will be immediately displayed as DISCONNECTED, when the server tries to accept the reconnection request from the client the cr_accept fails. As long as the server waits until the status of the endpoint changes from disconnect_pending to disconnected before processing the client connect request then the connection can be reestablished and data transmissions restarted. Does anyone know why it is taking so long for the server end point to disconnect or why the retry count change did not seem to be accepted? Thanks in advance for any help, Mark Kowalski From iod00d at hp.com Wed Mar 23 12:08:37 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 23 Mar 2005 12:08:37 -0800 Subject: [openib-general] ipoib/mthca broken on ia64? Message-ID: <20050323200837.GE31468@esmail.cup.hp.com> Hi, I wanted to run netpipe and basics aren't working. I haven't tried the SVN tree in over a month. It could have been broken for ia64 for a while. Sorry for lagging on that... I'm running 2.6.11 kernel with TOB svn bits and building the IB modules "in tree". Just replaced the drivers/infiniband with the one from SVN. Both systems are connected to a TS-90 switch (12 port?) which implements it own SM. Does this need a firmware upgrade maybe? iowa:/usr/src/linux-2.6# cat /sys/class/infiniband/mthca0/ports/*/state 1: DOWN 2: INIT iowa:/usr/src/linux-2.6# cat /sys/class/infiniband/mthca0/hw_rev a1 iowa:/usr/src/linux-2.6# cat /sys/class/infiniband/mthca0/fw_ver 3.3.2 ionize:~# cat /sys/class/infiniband/mthca0/ports/*/state 1: DOWN 2: INIT ionize:~# cat /sys/class/infiniband/mthca0/hw_rev a1 ionize:~# cat /sys/class/infiniband/mthca0/fw_ver 3.3.2 Modprobe output from ionize: ionize:~# modprobe ib_mthca ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004) ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:61:00.0) GSI 49 (level, low) -> CPU 0 (0x0000) vector 71 ACPI: PCI interrupt 0000:61:00.0[A] -> GSI 49 (level, low) -> IRQ 71 ionize:~# [ Probably want to roll the driver version and date again.] Any clues where I should be looking next? thanks, grant From roland at topspin.com Wed Mar 23 12:16:08 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 23 Mar 2005 12:16:08 -0800 Subject: [openib-general] ipoib/mthca broken on ia64? In-Reply-To: <20050323200837.GE31468@esmail.cup.hp.com> (Grant Grundler's message of "Wed, 23 Mar 2005 12:08:37 -0800") References: <20050323200837.GE31468@esmail.cup.hp.com> Message-ID: <521xa6ce47.fsf@topspin.com> > iowa:/usr/src/linux-2.6# cat /sys/class/infiniband/mthca0/ports/*/state > 1: DOWN > 2: INIT It looks like the driver is working but the SM isn't bringing the ports to the active state. The problem could still be on the host or the switch unfortunately. What do you see in the files /sys/class/infiniband/mthca0/ports/2/counters/port_rcv_packets /sys/class/infiniband/mthca0/ports/2/counters/port_xmit_packets > Both systems are connected to a TS-90 switch (12 port?) which > implements it own SM. Does this need a firmware upgrade maybe? It's entirely possible. Have you used the same switch successfully in the past? You can check that the switch's SM is still working by connecting a cable directly from one switch port to another. You should see both LEDs come on for both ports (possibly after a couple second delay) if the SM is still running. I believe the switch ports will get to INIT even with the embedded CPU completely hosed, so it's possible your switch has crashed or failed completely. - R. From halr at voltaire.com Wed Mar 23 12:16:43 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Mar 2005 15:16:43 -0500 Subject: [openib-general] ipoib/mthca broken on ia64? In-Reply-To: <20050323200837.GE31468@esmail.cup.hp.com> References: <20050323200837.GE31468@esmail.cup.hp.com> Message-ID: <1111609003.4659.69.camel@localhost.localdomain> Hi Grant, On Wed, 2005-03-23 at 15:08, Grant Grundler wrote: > Hi, > I wanted to run netpipe and basics aren't working. I haven't > tried the SVN tree in over a month. It could have been broken > for ia64 for a while. Sorry for lagging on that... > > I'm running 2.6.11 kernel with TOB svn bits and building > the IB modules "in tree". Just replaced the drivers/infiniband > with the one from SVN. > > Both systems are connected to a TS-90 switch (12 port?) which > implements it own SM. Does this need a firmware upgrade maybe? > > > iowa:/usr/src/linux-2.6# cat /sys/class/infiniband/mthca0/ports/*/state > 1: DOWN > 2: INIT Looks like port 2 is plugged in. It needs to get to ACTIVE before IPoIB will work. Is the SM enabled in the TS switch ? -- Hal > iowa:/usr/src/linux-2.6# cat /sys/class/infiniband/mthca0/hw_rev > a1 > iowa:/usr/src/linux-2.6# cat /sys/class/infiniband/mthca0/fw_ver > 3.3.2 > > > ionize:~# cat /sys/class/infiniband/mthca0/ports/*/state > 1: DOWN > 2: INIT > ionize:~# cat /sys/class/infiniband/mthca0/hw_rev > a1 > ionize:~# cat /sys/class/infiniband/mthca0/fw_ver > 3.3.2 > > > Modprobe output from ionize: > ionize:~# modprobe ib_mthca > ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004) > ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:61:00.0) > GSI 49 (level, low) -> CPU 0 (0x0000) vector 71 > ACPI: PCI interrupt 0000:61:00.0[A] -> GSI 49 (level, low) -> IRQ 71 > ionize:~# > > [ Probably want to roll the driver version and date again.] > > Any clues where I should be looking next? > > thanks, > grant > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From iod00d at hp.com Wed Mar 23 12:27:45 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 23 Mar 2005 12:27:45 -0800 Subject: [openib-general] ipoib/mthca broken on ia64? In-Reply-To: <1111609003.4659.69.camel@localhost.localdomain> References: <20050323200837.GE31468@esmail.cup.hp.com> <1111609003.4659.69.camel@localhost.localdomain> Message-ID: <20050323202745.GF31468@esmail.cup.hp.com> On Wed, Mar 23, 2005 at 03:16:43PM -0500, Hal Rosenstock wrote: > Hi Grant, > > iowa:/usr/src/linux-2.6# cat /sys/class/infiniband/mthca0/ports/*/state > > 1: DOWN > > 2: INIT > > Looks like port 2 is plugged in. It needs to get to ACTIVE before IPoIB > will work. Is the SM enabled in the TS switch ? I haven't touched the switch (besides plugging cables into it) in ~6 monthes. It worked fine last time I tried in Jan/Feb. Since the host openib SW stack changed, either its a bug in openib stack OR exposed a bug/deficiency in the switch SM. thanks, grant From iod00d at hp.com Wed Mar 23 12:31:38 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 23 Mar 2005 12:31:38 -0800 Subject: [openib-general] ipoib/mthca broken on ia64? In-Reply-To: <521xa6ce47.fsf@topspin.com> References: <20050323200837.GE31468@esmail.cup.hp.com> <521xa6ce47.fsf@topspin.com> Message-ID: <20050323203138.GG31468@esmail.cup.hp.com> On Wed, Mar 23, 2005 at 12:16:08PM -0800, Roland Dreier wrote: > It looks like the driver is working but the SM isn't bringing the > ports to the active state. The problem could still be on the host or > the switch unfortunately. What do you see in the files > > /sys/class/infiniband/mthca0/ports/2/counters/port_rcv_packets > /sys/class/infiniband/mthca0/ports/2/counters/port_xmit_packets ionize:~# cat /sys/class/infiniband/mthca0/ports/*/counters/port_rcv_packets 0 0 ionize:~# cat /sys/class/infiniband/mthca0/ports/*/counters/port_xmit_packets 0 0 Ditto for iowa. > > Both systems are connected to a TS-90 switch (12 port?) which > > implements it own SM. Does this need a firmware upgrade maybe? > > It's entirely possible. Have you used the same switch successfully in > the past? Yes. This is the only switch I've got. > You can check that the switch's SM is still working by > connecting a cable directly from one switch port to another. You > should see both LEDs come on for both ports (possibly after a couple > second delay) if the SM is still running. I'm still at home and will test that after lunch when I head down to the computer room (Cupertino). > I believe the switch ports will get to INIT even with the embedded CPU > completely hosed, so it's possible your switch has crashed or failed completely. *nod*. I try the above first...then cycle power and see if it comes back to life. The switch has been on since December or so. thanks, grant From roland at topspin.com Wed Mar 23 12:33:05 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 23 Mar 2005 12:33:05 -0800 Subject: [openib-general] ipoib/mthca broken on ia64? In-Reply-To: <20050323203138.GG31468@esmail.cup.hp.com> (Grant Grundler's message of "Wed, 23 Mar 2005 12:31:38 -0800") References: <20050323200837.GE31468@esmail.cup.hp.com> <521xa6ce47.fsf@topspin.com> <20050323203138.GG31468@esmail.cup.hp.com> Message-ID: <52psxqayri.fsf@topspin.com> Grant> *nod*. I try the above first...then cycle power and see if Grant> it comes back to life. The switch has been on since Grant> December or so. If you have a serial console or ethernet configured for the switch, you can check if it still looks happy as well. It wouldn't really surprise me that much if the switch crashed sometime in the past couple of months. - R. From iod00d at hp.com Wed Mar 23 15:46:23 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 23 Mar 2005 15:46:23 -0800 Subject: [openib-general] ipoib/mthca broken on ia64? In-Reply-To: <52psxqayri.fsf@topspin.com> References: <20050323200837.GE31468@esmail.cup.hp.com> <521xa6ce47.fsf@topspin.com> <20050323203138.GG31468@esmail.cup.hp.com> <52psxqayri.fsf@topspin.com> Message-ID: <20050323234623.GD32615@esmail.cup.hp.com> On Wed, Mar 23, 2005 at 12:33:05PM -0800, Roland Dreier wrote: > Grant> *nod*. I try the above first...then cycle power and see if > Grant> it comes back to life. The switch has been on since > Grant> December or so. > > If you have a serial console or ethernet configured for the switch, > you can check if it still looks happy as well. It wouldn't really > surprise me that much if the switch crashed sometime in the past > couple of months. Yup. The switch was hosed and cycling power got it back to life again: grundler at ionize:~$ cat /sys/class/infiniband/mthca0/ports/*/state 4: ACTIVE 1: DOWN grundler at iowa:~$ cat /sys/class/infiniband/mthca0/ports/*/state 4: ACTIVE 4: ACTIVE Of course, ping works too. And my apologies - I likely instigated the failure. I connected the serial console out of the TS90 to the serial port of an rx2600 (ia64). At the time I thought: "Just in case LAN config fails". I didn't realize the rx2600 had a getty running. The TS90 wasn't able to teach getty to use "help". :^( And the getty didn't let the TS90 login either. :^( I suppose that's an easy-to-setup stress test for the switch user interface and serial port drivers. BTW, interesting to note that the TS90 switch is an "embedded linux" device running 2.4.19 kernel. Console output appended. "PPC 440GP" makes me wonder if this switch exposed the problem of DMA to non-cacheline-aligned buffers on non-coherent platforms. Could just be a coicindence I guess. :^) And I found the user interface on the switch complete non-intuitive. e.g. The UI supports auto-completion. If a command only supports one option (e.g. show ib), then autocompletion should supply it. I never did figure out how to view port status from the command line. (I don't really need to normally and certainly not when sitting in front of the switch) Ok...back to our regular program... thanks, grant => reset PPCBoot 1.1.6 (Release 1.1.3hp releng #25 02/19/2004 12:05:03) (Feb 19 2004 - 1) Board: Topspin 90 Controller Card FPGA loading... FPGA Revision Register=0xf0100005 Rev=0x6 POST: FPGA Rev:0x6 : PASSED TS90 cntlr0: 1 fan, cntlr1: 2 fans Setting fan cntlr0 to MANUAL mode POST: ctlr0 wait for fans to speed-up... POST: ctlr0 wait for fans to slow-down... POST: fan cntlr0 PASSED Setting fan cntlr1 to MANUAL mode POST: ctlr1 wait for fans to speed-up... POST: ctlr1 wait for fans to slow-down... POST: fan cntlr1 PASSED Setting fan cntlr0 to AUTO mode Setting fan cntlr1 to AUTO mode leds 0xf0100004 = 0xb0 Hit any key to stop autoboot: 0 Releasing Anafa 1 from reset... Releasing Anafa 2 from reset... Releasing Anafa 3 from reset... ENET Speed is 100 Mbps - FULL duplex connection Boot Regular Image from Disk Partition 0: boot sig=0xcc9e8160 Loading: ++++++++++++++++++++++++++++++++++++++++[ 1272K] ++++++++++++++++++++++++++++++++++++++++[ 2552K] ++++++++++++++++++++++++++++++++++++++++[ 3832K] ++++++++++++++++++++++++++++++++++++++++[ 5112K] ++++++++++++++++++++++++++++++++++++++++[ 6392K] ++++++++++++++++++++++++++++++++++++++++[ 7672K] ++++++++++++++++++ [ Done ] ## Booting image at 00200000 ... Verifying Checksum ... OK Uncompressing Multi-File Image ... OK Loading Ramdisk to 07a7a000, end 07eb69d4 ... OK Linux version 2.4.19-rc3 (releng at borg(Release 1.1.3hp releng #25 02/19/2004 12:4 Topspin 90 base board I2C TTY driver v1.1 iic_ibmocp_init: IBM on-chip iic adapter module RAMDISK: Compressed image found at block 0 VFS: Mounted root (ext2 filesystem) readonly. EXT2-fs warning: checktime reached, running e2fsck is recommended hostname. Mounting local filesystems... Partition check: fla: fla1 fla2 flb: flb1 EXT3-fs warning: checktime reached, running e2fsck is recommended EXT3-fs: recovery complete. Starting syslogd klogd. Starting portmap. Starting ntpd. Starting sshd. Starting inetd. Starting crond. Starting software Release-1.1.3hp/build025. Start Controller module. IBSM_PATH is now ../ppc440_lt Check for needed config files and software. Getting chassis-id, slot-id, and card-type information Load Mellanox drivers. MTHOME is now /topspin/images/Release-1.1.3hp/build025/exe/scripts/./../ppc440_x Add Mellanox DLLs to run-time ld.so search path. Start MDDK. mosal:Loading mosal [ OK ] mosal:Creating /dev/mosal [ OK ] mdd:Loading mdd mdd device registered successfully with major 253 [ OK ] mdd:Creating /dev/mdd_dev [ OK ] mdd: Version: Device name mt43132_pci0: ========================================================= Device ID = MT43132 Bus type = PCI Base Address = 0xfffe0000 FW Version = 5.1.0 FW Build = 0x0249 Device name mt43132_pci1: ========================================================= Device ID = MT43132 Bus type = PCI Base Address = 0xfbff0000 FW Version = 5.1.0 FW Build = 0x0249 Device name mt43132_pci2: ========================================================= Device ID = MT43132 Bus type = PCI Base Address = 0xf7ff0000 FW Version = 5.1.0 FW Build = 0x0249 Load lldr.o Creating convenience symlinks to Mellanox utils. starting to check firmware/microcode - 21:57:41 checking FPGA firmware - 21:57:41 Checking FPGA Status from PPCBoot ... FPGA Sanity Checking ... Sanity Check passed Update Checking ... FPGA rev = 6 File rev = 6 No update needed. Done. FPGA firmware done - 21:57:42 checking Anafa microcode - 21:57:42 AnafaInit: filename=/topspin/images/boot/exe/arch/ucode-43132-5.1.0 tmpname=/0 AnafaInit: Anafa0 fw 5.1.0 matches AnafaInit: filename=/topspin/images/boot/exe/arch/ucode-43132-5.1.0 tmpname=/0 AnafaInit: Anafa1 fw 5.1.0 matches AnafaInit: filename=/topspin/images/boot/exe/arch/ucode-43132-5.1.0 tmpname=/0 AnafaInit: Anafa2 fw 5.1.0 matches Anafa firmware 5.1.0 matches firmware file ucode-43132-5.1.0 Anafa microcode done - 21:57:43 done checking firmware/microcode - 21:57:43 card_startup.x : chassis-type=TS360, chassis-id=d9dfffffe2aaf, slot-id=1, card-x card_startup.x : successfully started-up card in I2C mode chassis-type is TS360, chassis-id is d9dfffffe2aaf, slot 1 is controllerIb12porx Load ts_kernel_services.o. Load ts_kernel_poll.o. Load ts_ib_device_n[KERNEL_IB][_tsIbTcarqDeviceInit][tcarq_main.c:126]Created m) o_vapi.o. Load ts_ib_tcarq.o. [KERNEL_IB][_tsIbTcarqDeviceInit][tcarq_main.c:126]Created mt43132_pci1 send qu) [KERNEL_IB][_tsIbTcarqDeviceInit][tcarq_main.c:126]Created mt43132_pci2 send qu) Load ts_ib_mad_tcarq.o. [KERNEL_IB][tsIbMadReceiveSetup][mad_tcarq.c:268]Created mt43132_pci0 QP 0 [KERNEL_IB][tsIbMadReceiveSetup][mad_tcarq.c:287]Created mt43132_pci0 QP 1 [KERNEL_IB][tsIbMadReceiveSetup][mad_tcarq.c:268]Created mt43132_pci1 QP 0 [KERNEL_IB][tsIbMadReceiveSetup][mad_tcarq.c:287]Created mt43132_pci1 QP 1 [KERNEL_IB][tsIbMadReceiveSetup][mad_tcarq.c:268]Created mt43132_pci2 QP 0 [KERNEL_IB][tsIbMadReceiveSetup][mad_tcarq.c:287]Created mt43132_pci2 QP 1 Load ts_ib_client_query.o. Load ts_ib_sa_client.o. Load ts_ipoib.o. Load ts_ib_useraccess.o. Creating character special files ts_ua[0-6]. Configure ts0 interface. Start serdes_cfg.x in background. Configuring switch SERDES: Anafa 1 Port 1: internal ... done Port 2: internal ... done Port 3: internal ... done Port 4: internal ... done Port 5: front ... done Port 6: front ... done Port 7: front ... done Port 8: front ... done Anafa 2 Port 1: internal ... done Port 2: internal ... done Port 3: internal ... done Port 4: backplane ... done Port 5: front ... done Port 6: front ... done Port 7: front ... done Port 8: front ... done Anafa 3 Port 1: internal ... done Port 2: internal ... done Port 3: internal ... done Port 4: backplane ... done Port 5: front ... done Port 6: front ... done Port 7: front ... done Port 8: front ... done Start ib_port_agent.x -1 in background. Start ts_sma.x -1 in background. [KERNEL_IB][tsIbNodeDescSet_R5c433891][device_mellanox.c:378]*node_desc has dift [KERNEL_IB][tsIbNodeDescSet_R5c433891][device_mellanox.c:378]*node_desc has dift [KERNEL_IB][tsIbNodeDescSet_R5c433891][device_mellanox.c:378]*node_desc has dift Start notifier.x in background. Start watchd_mgr.x -1 -controllerIb12port4x in background. Start ip_mgr.x in background. Start fib_mgr.x in background Start ib_mgr.x in background. srpm_mgr.x -chassis-id 0xd9dfffffe2aaf Start chassis_mgr.x in background Changing password for root Password changed. Start port_mgr.x in background. [INFO] : card 1 is inserted - type=controllerIb12port4x [INFO] : card 1 is up (in-service) - type=controllerIb12port4x Start snmp_agent.x in background Pause to let processes finish initializing. Starting CLI. startup-config file missing. Start with factory default configuration. Login: ================================================================================ Backplane Seeprom ================================================================================ base-mac-addr chassis-id -------------------------------------------------------------------------------- 0:d:9d:fe:a:af 0xd9dfffffe2aaf ================================================================================ Backplane Seeprom ================================================================================ product pca pca fru serial-number serial-number number number -------------------------------------------------------------------------------- USC041700011 CS041700002 95-00021-01-B3 AB291-62001 HP-IB# show card-inventory ================================================================================ Card Resource/Inventory Information ================================================================================ slot-id : 1 used-memory : 42040 (kbytes) free-memory : 85784 (kbytes) used-disk-space : 11576 (kbytes) free-disk-space : 90803 (kbytes) last-image-source : Release-1.1.3hp/build025 primary-image-source : Release-1.1.3hp/build025 image : Release-1.1.3hp/build025 cpu-descr : PPC 440GP Rev. C - Rev 4.129 (pvr 4012 0481) fpga-firmware-rev : 6 ib-firmware-rev : 5.1.0 From robert.j.woodruff at intel.com Wed Mar 23 16:36:37 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 23 Mar 2005 16:36:37 -0800 Subject: [openib-general] ipoib/mthca broken on ia64? In-Reply-To: <20050323234623.GD32615@esmail.cup.hp.com> Message-ID: Grant wrote> >Yup. The switch was hosed and cycling power got it back to life again: >grundler at ionize:~$ cat /sys/class/infiniband/mthca0/ports/*/state >4: ACTIVE >1: DOWN >grundler at iowa:~$ cat /sys/class/infiniband/mthca0/ports/*/state >4: ACTIVE >4: ACTIVE >Of course, ping works too. FYI - I have seen a similar weirdness on some of my systems where it seems that sometimes the port does not go active and if I unplug the cable and plug it back into the switch, it then goes active. I think on these nodes I have some very old PCI-X HCAs (A0 silicon) that I cannot even upgrade to the newest firmware. I recall Sean seeing a similar problem and again he may have old hardware and/or old firmware. I know that with the latest cards and firmware I have never seen this anomaly. I have also seen a switch get into a weird state from time to time, but not too often. woody From iod00d at hp.com Wed Mar 23 18:22:08 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 23 Mar 2005 18:22:08 -0800 Subject: [openib-general] ipoib/mthca broken on ia64? In-Reply-To: References: <20050323234623.GD32615@esmail.cup.hp.com> Message-ID: <20050324022208.GK32615@esmail.cup.hp.com> On Wed, Mar 23, 2005 at 04:36:37PM -0800, Bob Woodruff wrote: > I think on these nodes I have some very old PCI-X HCAs (A0 silicon) > that I cannot even upgrade to the newest firmware. Ok - good to know. AFAICT, I only have rev A1 silicon. > I have also seen a switch get into a weird state from time to time, > but not too often. I've just started looking for newer TS90 firmware version. Hopefully someone will have that in-house (HP). thanks, grant From shaharf at voltaire.com Thu Mar 24 08:07:21 2005 From: shaharf at voltaire.com (shaharf) Date: Thu, 24 Mar 2005 18:07:21 +0200 Subject: [openib-general] Multiple IPoIB devices over same port Message-ID: Hi, Is it possible somehow to create multiple IPoIB devices over a port with the same pkey but with different QP? If not, how complex it should be to implement it? Thanks, Shahar -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland at topspin.com Thu Mar 24 09:08:20 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 24 Mar 2005 09:08:20 -0800 Subject: [openib-general] Multiple IPoIB devices over same port References: Message-ID: <52vf7h9dkr.fsf@topspin.com> shaharf> Is it possible somehow to create multiple IPoIB devices shaharf> over a port with the same pkey but with different QP? Not with the current code. shaharf> If not, how complex it should be to implement it? I think you could do it quite easily. ipoib_add_port() encapsulates everything about creating a device, so you just have to call it more than once. The hardest part is probably figuring out the right interface for triggering this, naming devices, etc. Why do you want to do this? - R. From shaharf at voltaire.com Thu Mar 24 09:39:12 2005 From: shaharf at voltaire.com (shaharf) Date: Thu, 24 Mar 2005 19:39:12 +0200 Subject: [openib-general] Multiple IPoIB devices over same port Message-ID: > shaharf> If not, how complex it should be to implement it? > > I think you could do it quite easily. ipoib_add_port() encapsulates > everything about creating a device, so you just have to call it more > than once. The hardest part is probably figuring out the right > interface for triggering this, naming devices, etc. > > Why do you want to do this? > > - R. To support multiple virtual machines (each with its own IPoIB) on the same machine. Shahar From xma at us.ibm.com Thu Mar 24 09:53:44 2005 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 24 Mar 2005 09:53:44 -0800 Subject: [openib-general] Multiple IPoIB devices over same port In-Reply-To: Message-ID: I thought about this so that at a higher level you can use policy routing or QoS or traffic shapping or maybe even CKRM to distribute load on multiple streams. Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaharf at voltaire.com Thu Mar 24 10:00:01 2005 From: shaharf at voltaire.com (shaharf) Date: Thu, 24 Mar 2005 20:00:01 +0200 Subject: [openib-general] Multiple IPoIB devices over same port Message-ID: I am not sure I follow. Do you mean that for example, a separate Qos tag is used for each stream? If this is the case I am not sure how to handle Qos tags from the VM. I thought about this so that at a higher level you can use policy routing or QoS or traffic shapping or maybe even CKRM to distribute load on multiple streams. Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From tduffy at sun.com Thu Mar 24 11:12:36 2005 From: tduffy at sun.com (Tom Duffy) Date: Thu, 24 Mar 2005 11:12:36 -0800 Subject: [openib-general] [PATCH] Remove <2.6.11 ioctl stuff In-Reply-To: <52d5u0zwr5.fsf@topspin.com> References: <1110899837.4662.578.camel@localhost.localdomain> <52d5u0zwr5.fsf@topspin.com> Message-ID: <1111691556.3067.21.camel@duffman> On Tue, 2005-03-15 at 08:42 -0800, Roland Dreier wrote: > Hal> Hi Roland, Just ran across this reminder: > > Hal> Should user_mad.c be updated for the following: /* XXX remove > Hal> once 2.6.11 is released */ > > Yep, I'd apply that patch for sure. Make sure this doesn't get lost. Signed-off-by: Tom Duffy Index: drivers/infiniband/core/user_mad.c =================================================================== --- drivers/infiniband/core/user_mad.c (revision 2038) +++ drivers/infiniband/core/user_mad.c (working copy) @@ -43,10 +43,6 @@ #include #include #include -/* XXX remove once 2.6.11 is released */ -#if !defined(HAVE_COMPAT_IOCTL) || !defined(HAVE_UNLOCKED_IOCTL) -#include -#endif #include #include @@ -462,14 +458,8 @@ out: return ret; } -/* XXX remove once 2.6.11 is released */ -#if !defined(HAVE_COMPAT_IOCTL) || !defined(HAVE_UNLOCKED_IOCTL) -static int ib_umad_ioctl(struct inode *inode, struct file *filp, - unsigned int cmd, unsigned long arg) -#else -static long ib_umad_ioctl(struct file *filp, - unsigned int cmd, unsigned long arg) -#endif +static long ib_umad_ioctl(struct file *filp, unsigned int cmd, + unsigned long arg) { switch (cmd) { case IB_USER_MAD_REGISTER_AGENT: @@ -525,13 +515,8 @@ static struct file_operations umad_fops .read = ib_umad_read, .write = ib_umad_write, .poll = ib_umad_poll, -/* XXX remove once 2.6.11 is released */ -#if !defined(HAVE_COMPAT_IOCTL) || !defined(HAVE_UNLOCKED_IOCTL) - .ioctl = ib_umad_ioctl, -#else .unlocked_ioctl = ib_umad_ioctl, .compat_ioctl = ib_umad_ioctl, -#endif .open = ib_umad_open, .release = ib_umad_close }; @@ -832,25 +817,8 @@ static int __init ib_umad_init(void) goto out_class; } -/* XXX remove once 2.6.11 is released */ -#if !defined(HAVE_COMPAT_IOCTL) || !defined(HAVE_UNLOCKED_IOCTL) - /* Our ioctls are 32/64 clean */ - ret = register_ioctl32_conversion(IB_USER_MAD_REGISTER_AGENT, NULL); - ret |= register_ioctl32_conversion(IB_USER_MAD_UNREGISTER_AGENT, NULL); - if (ret) { - printk(KERN_ERR "user_mad: couldn't register ioctl32 conversions\n"); - goto out_client; - } -#endif - return 0; -/* XXX remove once 2.6.11 is released */ -#if !defined(HAVE_COMPAT_IOCTL) || !defined(HAVE_UNLOCKED_IOCTL) -out_client: - ib_unregister_client(&umad_client); -#endif - out_class: class_unregister(&umad_class); @@ -863,11 +831,6 @@ out: static void __exit ib_umad_cleanup(void) { -/* XXX remove once 2.6.11 is released */ -#if !defined(HAVE_COMPAT_IOCTL) || !defined(HAVE_UNLOCKED_IOCTL) - unregister_ioctl32_conversion(IB_USER_MAD_REGISTER_AGENT); - unregister_ioctl32_conversion(IB_USER_MAD_UNREGISTER_AGENT); -#endif ib_unregister_client(&umad_client); class_unregister(&umad_class); unregister_chrdev_region(base_dev, IB_UMAD_MAX_PORTS * 2); From libor at topspin.com Thu Mar 24 17:31:36 2005 From: libor at topspin.com (Libor Michalek) Date: Thu, 24 Mar 2005 17:31:36 -0800 Subject: [openib-general] NFS over IPoIB performance.. In-Reply-To: <20050323025753.GN9768@kalmia.hozed.org>; from hozer@hozed.org on Tue, Mar 22, 2005 at 08:57:53PM -0600 References: <20050323025753.GN9768@kalmia.hozed.org> Message-ID: <20050324173136.A17611@topspin.com> On Tue, Mar 22, 2005 at 08:57:53PM -0600, Troy Benjegerdes wrote: > Well, I'm quite impressed.. I've been running NFS over IPoIB to a server > with a 3ware SATA raid card in it, and nothing's crashed ;) > > My only real problem is there seems to be no good way to set readahead > for the NFS client. I'm averaging 75mb/sec or so, and I've seen peaks of > 100MB/sec, with clients from two different machines running the GAMESS > computational chemistry code. I'm curious, where you running NFS over UDP or TCP? If the former could you try the later? -Libor From libor at topspin.com Fri Mar 25 08:23:49 2005 From: libor at topspin.com (Libor Michalek) Date: Fri, 25 Mar 2005 08:23:49 -0800 Subject: [openib-general] Re: Re: [Andrew Morton] inappropriate use of in_atomic() In-Reply-To: <20050320181242.GA18963@mellanox.co.il>; from mst@mellanox.co.il on Sun, Mar 20, 2005 at 08:12:42PM +0200 References: <52oedq946k.fsf@topspin.com> <20050311073108.GA20989@mellanox.co.il> <20050311154316.E31689@topspin.com> <20050320181242.GA18963@mellanox.co.il> Message-ID: <20050325082349.A22487@topspin.com> On Sun, Mar 20, 2005 at 08:12:42PM +0200, Michael S. Tsirkin wrote: > Quoting r. Libor Michalek : > > Subject: Re: Re: [Andrew Morton] inappropriate use of in_atomic() > > > > On Fri, Mar 11, 2005 at 09:31:08AM +0200, Michael S. Tsirkin wrote: > > > > > > Sdp also has a couple of uses. > > > Maybe we can use the atomic branch in all cases here, as well? > > > Libor? > > > > Yes, the case in sdp_iocb.c can probably always take the atomic > > path. The kmap/kunmap cases really only care whether we're in an > > interrupt, so switching to in_interrupt() should be sufficient. > > Recent comments by Andrew indicate that it is better to always > use kmap_atomic/kunmap_atomic if possible. This will also > let us get rid of the wrapper function, which is good. > > Why do you think we need to kmap? I didn't realize that the atomic version was prefered over the regular kmap. The only thing that needs to be done is to make sure that the local CPU interrupts are off before calling kamp_atomic, instead we currently check to see if we're in an interrupt and call the appropriate function. I have no problem changing it to just atomic. -Libor From roland at topspin.com Fri Mar 25 08:56:03 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 25 Mar 2005 08:56:03 -0800 Subject: [openib-general] Re: [PATCH] Remove <2.6.11 ioctl stuff In-Reply-To: <1111691556.3067.21.camel@duffman> (Tom Duffy's message of "Thu, 24 Mar 2005 11:12:36 -0800") References: <1110899837.4662.578.camel@localhost.localdomain> <52d5u0zwr5.fsf@topspin.com> <1111691556.3067.21.camel@duffman> Message-ID: <52eke38y1o.fsf@topspin.com> Thanks, applied. - R. From roland at topspin.com Fri Mar 25 09:06:39 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 25 Mar 2005 09:06:39 -0800 Subject: [openib-general] Re: [PATCH] mthca: fill in missing fields in send completion In-Reply-To: <20050323142503.GA13701@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 23 Mar 2005 16:25:03 +0200") References: <20050323142503.GA13701@mellanox.co.il> Message-ID: <5264zf8xk0.fsf@topspin.com> Thanks, applied. - R. From roland at topspin.com Fri Mar 25 09:43:04 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 25 Mar 2005 09:43:04 -0800 Subject: [openib-general] Re: [PATCH] mthca: fill in missing fields in send completion In-Reply-To: <5264zf8xk0.fsf@topspin.com> (Roland Dreier's message of "Fri, 25 Mar 2005 09:06:39 -0800") References: <20050323142503.GA13701@mellanox.co.il> <5264zf8xk0.fsf@topspin.com> Message-ID: <52u0mz7hav.fsf@topspin.com> ... and adapted to libmthca as well ... - R. From hozer at hozed.org Fri Mar 25 18:33:01 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Fri, 25 Mar 2005 20:33:01 -0600 Subject: [openib-general] NFS over IPoIB performance.. In-Reply-To: <20050324173136.A17611@topspin.com> References: <20050323025753.GN9768@kalmia.hozed.org> <20050324173136.A17611@topspin.com> Message-ID: <20050326023301.GB26127@kalmia.hozed.org> On Thu, Mar 24, 2005 at 05:31:36PM -0800, Libor Michalek wrote: > On Tue, Mar 22, 2005 at 08:57:53PM -0600, Troy Benjegerdes wrote: > > Well, I'm quite impressed.. I've been running NFS over IPoIB to a server > > with a 3ware SATA raid card in it, and nothing's crashed ;) > > > > My only real problem is there seems to be no good way to set readahead > > for the NFS client. I'm averaging 75mb/sec or so, and I've seen peaks of > > 100MB/sec, with clients from two different machines running the GAMESS > > computational chemistry code. > > I'm curious, where you running NFS over UDP or TCP? If the former > could you try the later? > > -Libor With TCP: 10.0.0.10:/scratch/md1 on /nfs/da0/md1 type nfs (rw,v3,rsize=32768,wsize=32768,hard,tcp,lock,addr=10.0.0.10) opteron1:64bit:/nfs/da0/md1/troy$ dd if=junk of=/dev/null 20971520+0 records in 20971520+0 records out 10737418240 bytes transferred in 113.188032 seconds (94863547 bytes/sec) With UDP: opteron1:64bit:/nfs/da0$ mount 10.0.0.10:/scratch/md1 /nfs/da0/md1 -o rsize=32768,wsize=32768,nfsvers=3 nfs warning: mount version older than kernel opteron1:64bit:/nfs/da0$ cd md1/troy/ opteron1:64bit:/nfs/da0/md1/troy$ dd if=junk of=/dev/null 20971520+0 records in 20971520+0 records out 10737418240 bytes transferred in 96.758483 seconds (110971337 bytes/sec) And one more time with tcp, in case there are buffer-cache effects. opteron1:64bit:/nfs/da0$ mount 10.0.0.10:/scratch/md1 /nfs/da0/md1 -o rsize=32768,wsize=32768,nfsvers=3,tcp nfs warning: mount version older than kernel opteron1:64bit:/nfs/da0$ cd md1/troy/ opteron1:64bit:/nfs/da0/md1/troy$ dd if=junk of=/dev/null 20971520+0 records in 20971520+0 records out 10737418240 bytes transferred in 129.839903 seconds (82697368 bytes/sec) One last time with UDP: opteron1:64bit:/nfs/da0$ mount 10.0.0.10:/scratch/md1 /nfs/da0/md1 -o rsize=32768,wsize=32768,nfsvers=3 nfs warning: mount version older than kernel opteron1:64bit:/nfs/da0$ cd troy/ opteron1:64bit:/nfs/da0/troy$ cd ../md1/troy/ opteron1:64bit:/nfs/da0/md1/troy$ dd if=junk of=/dev/null 20971520+0 records in 20971520+0 records out 10737418240 bytes transferred in 96.915829 seconds (110791172 bytes/sec) From eitan at mellanox.co.il Sat Mar 26 09:41:41 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 26 Mar 2005 19:41:41 +0200 Subject: [openib-general] Using OpenSM Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF008@mtlex01.yok.mtl.com> Hi Aniruddha > Any ideas what is wrong? Seems you are still missing some user access modules. But I have to let the experts answer this. What I have loaded is: ib_kdapl 101408 0 dat_registry 20160 1 ib_kdapl ib_sdp 128760 0 ib_useraccess_cm 22912 0 ib_cm 52312 3 ib_kdapl,ib_sdp,ib_useraccess_cm ib_dapl_srv 47296 1 ib_kdapl ib_ip2pr 30252 3 ib_sdp,ib_useraccess_cm,ib_dapl_srv ib_ipoib 66696 3 ib_kdapl,ib_dapl_srv,ib_ip2pr ib_sa_client 34056 3 ib_dapl_srv,ib_ip2pr,ib_ipoib ib_client_query 17952 4 ib_dapl_srv,ib_ip2pr,ib_ipoib,ib_sa_client ib_poll 20152 4 ib_sdp,ib_cm,ib_ip2pr,ib_client_query ib_useraccess 15300 0 ib_tavor 38540 9 ib_useraccess_cm mod_thh 335808 0 mod_vip 387976 5 ib_kdapl,ib_useraccess_cm,ib_dapl_srv,ib_tavor,mod_thh mlxsys 130120 2 mod_thh,mod_vip ib_mad 27020 4 ib_cm,ib_client_query,ib_useraccess,ib_tavor ib_core 248852 11 ib_kdapl,ib_sdp,ib_useraccess_cm,ib_cm,ib_dapl_srv,ib_ip2pr,ib_ipoib,ib_sa_c lient,ib_useraccess,ib_tavor,ib_mad ib_services 19780 13 ib_sdp,ib_useraccess_cm,ib_cm,ib_dapl_srv,ib_ip2pr,ib_ipoib,ib_sa_client,ib_ client_query,ib_poll,ib_useraccess,ib_tavor,ib_mad,ib_core edd 13720 0 joydev 14528 0 usbserial 35952 0 parport_pc 41024 0 lp 15364 0 parport 44232 2 parport_pc,lp autofs 21120 2 thermal 16648 0 processor 21312 1 thermal fan 8196 0 button 10384 0 ipv6 275580 23 battery 12804 0 ac 8964 0 mst_pciconf 84992 0 mst_pci 82048 2 sworks_agp 13472 0 agpgart 36140 1 sworks_agp ohci_hcd 24324 0 evdev 13952 0 af_packet 26376 2 usbcore 116572 4 usbserial,ohci_hcd tg3 75396 0 binfmt_misc 14856 1 subfs 12160 2 dm_mod 57472 0 ext3 121384 2 jbd 75172 1 ext3 aic7xxx 190516 3 sd_mod 25088 4 scsi_mod 118340 5 sg,st,sr_mod,aic7xxx,sd_mod -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Sat Mar 26 10:44:13 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 26 Mar 2005 20:44:13 +0200 Subject: [openib-general] Re: Re: [Andrew Morton] inappropriate use of in_atomic() In-Reply-To: <20050325082349.A22487@topspin.com> References: <52oedq946k.fsf@topspin.com> <20050311073108.GA20989@mellanox.co.il> <20050311154316.E31689@topspin.com> <20050320181242.GA18963@mellanox.co.il> <20050325082349.A22487@topspin.com> Message-ID: <20050326184412.GB8758@mellanox.co.il> Quoting r. Libor Michalek : > Subject: Re: Re: [Andrew Morton] inappropriate use of in_atomic() > > On Sun, Mar 20, 2005 at 08:12:42PM +0200, Michael S. Tsirkin wrote: > > Quoting r. Libor Michalek : > > > Subject: Re: Re: [Andrew Morton] inappropriate use of in_atomic() > > > > > > On Fri, Mar 11, 2005 at 09:31:08AM +0200, Michael S. Tsirkin wrote: > > > > > > > > Sdp also has a couple of uses. > > > > Maybe we can use the atomic branch in all cases here, as well? > > > > Libor? > > > > > > Yes, the case in sdp_iocb.c can probably always take the atomic > > > path. The kmap/kunmap cases really only care whether we're in an > > > interrupt, so switching to in_interrupt() should be sufficient. > > > > Recent comments by Andrew indicate that it is better to always > > use kmap_atomic/kunmap_atomic if possible. This will also > > let us get rid of the wrapper function, which is good. > > > > Why do you think we need to kmap? > > I didn't realize that the atomic version was prefered over the > regular kmap. The only thing that needs to be done is to make sure > that the local CPU interrupts are off before calling kamp_atomic, > instead we currently check to see if we're in an interrupt and call > the appropriate function. I have no problem changing it to just > atomic. > > -Libor > And disable/enable local interrupts? -- MST - Michael S. Tsirkin From tziporet at mellanox.co.il Sat Mar 26 23:28:36 2005 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Sun, 27 Mar 2005 09:28:36 +0200 Subject: [openib-general] moving gen1 branch to an archive directory Message-ID: <506C3D7B14CDD411A52C00025558DED6064BF1E3@mtlex01.yok.mtl.com> Hi Matt, >From time to time we see people that try to work with gen1 although this tree is not really working or supported. Since our focus is gen2 now I suggest to move this directory to some archive directory. > Tziporet Koren > Software Director > Mellanox Technologies, Ltd > mailto:tziporet at mellanox.co.il > Tel +972-4-9097200, ext 380 > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Sun Mar 27 05:08:43 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Mar 2005 08:08:43 -0500 Subject: [openib-general] [PATCH] IPoIB: Set hardware header on packet receive Message-ID: <1111928923.4650.70.camel@localhost.localdomain> IPoIB: Set hardware header on packet receive Needed for PF_PACKET/SOCK_PACKET Signed-off-by: Hal Rosenstock Index: ipoib_ib.c =================================================================== --- ipoib_ib.c (revision 2032) +++ ipoib_ib.c (working copy) @@ -209,6 +209,7 @@ priv->stats.rx_bytes += skb->len; skb->dev = dev; + skb->mac.raw = skb->data; /* XXX get correct PACKET_ type here */ skb->pkt_type = PACKET_HOST; netif_rx_ni(skb); From mlleinin at hpcn.ca.sandia.gov Sun Mar 27 07:20:52 2005 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Sun, 27 Mar 2005 07:20:52 -0800 Subject: [openib-general] moving gen1 branch to an archive directory In-Reply-To: <506C3D7B14CDD411A52C00025558DED6064BF1E3@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED6064BF1E3@mtlex01.yok.mtl.com> Message-ID: <1111936852.19863.507.camel@localhost> On Sun, 2005-03-27 at 09:28 +0200, Tziporet Koren wrote: > Hi Matt, > > From time to time we see people that try to work with gen1 although > this tree is not really working or supported. > Since our focus is gen2 now I suggest to move this directory to some > archive directory. > I agree. How about creating an archive directory in the top level directory and move gen1 into it? - Matt From mst at mellanox.co.il Sun Mar 27 07:31:13 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 27 Mar 2005 17:31:13 +0200 Subject: [openib-general] [PATCH] FMR support in mthca Message-ID: <20050327153112.GA26108@mellanox.co.il> OK, here's an updated version of the patch. This passed basic tests: allocate/free, map/remap/unmap. For Tavor, MTTs for FMR are separate from regular MTTs, and are reserved at driver initialization. This is done to limit the amount of virtual memory needed to map the MTTs. For Arbel, there's no such limitation, and all MTTs and MPTs may be used for FMR or for regular MR. It would be easy to remove the limitation for Tavor for 64-bit systems, where it's feasible to ioremap the whole MTT table. Let me know if this is of interest. Please comment. MST Add FMR support to mthca. Both Tavor and Arbel native are supported. For Tavor, FMR support is disabled if DDR is hidden. Signed-off-by: Michael S. Tsirkin Index: hw/mthca/mthca_dev.h =================================================================== --- hw/mthca/mthca_dev.h (revision 2050) +++ hw/mthca/mthca_dev.h (working copy) @@ -61,7 +61,8 @@ enum { MTHCA_FLAG_SRQ = 1 << 2, MTHCA_FLAG_MSI = 1 << 3, MTHCA_FLAG_MSI_X = 1 << 4, - MTHCA_FLAG_NO_LAM = 1 << 5 + MTHCA_FLAG_NO_LAM = 1 << 5, + MTHCA_FLAG_FMR = 1 << 6 }; enum { @@ -134,6 +135,7 @@ struct mthca_limits { int reserved_eqs; int num_mpts; int num_mtt_segs; + int fmr_reserved_mtts; int reserved_mtts; int reserved_mrws; int reserved_uars; @@ -170,13 +172,25 @@ struct mthca_pd_table { struct mthca_alloc alloc; }; +struct mthca_buddy { + unsigned long **bits; + int max_order; + spinlock_t lock; +}; + struct mthca_mr_table { struct mthca_alloc mpt_alloc; - int max_mtt_order; - unsigned long **mtt_buddy; + struct mthca_buddy mtt_buddy; + struct mthca_buddy *fmr_mtt_buddy; u64 mtt_base; + u64 mpt_base; struct mthca_icm_table *mtt_table; struct mthca_icm_table *mpt_table; + struct { + void __iomem *mpt_base; + void __iomem *mtt_base; + struct mthca_buddy mtt_buddy; + } tavor_fmr; }; struct mthca_eq_table { @@ -375,7 +389,20 @@ int mthca_mr_alloc_phys(struct mthca_dev u64 *buffer_list, int buffer_size_shift, int list_len, u64 iova, u64 total_size, u32 access, struct mthca_mr *mr); -void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr); +void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr); + +int mthca_fmr_alloc(struct mthca_dev *dev, u32 pd, + u32 access, struct mthca_fmr *fmr); + +int mthca_tavor_map_phys_fmr(struct ib_fmr *ibfmr, u64 *page_list, + int list_len, u64 iova); +void mthca_tavor_fmr_unmap(struct mthca_dev *dev, struct mthca_fmr *fmr); +int mthca_arbel_map_phys_fmr(struct ib_fmr *ibfmr, u64 *page_list, + int list_len, u64 iova); + +void mthca_arbel_fmr_unmap(struct mthca_dev *dev, struct mthca_fmr *fmr); + +int mthca_free_fmr(struct mthca_dev *dev, struct mthca_fmr *fmr); int mthca_map_eq_icm(struct mthca_dev *dev, u64 icm_virt); void mthca_unmap_eq_icm(struct mthca_dev *dev); Index: hw/mthca/mthca_main.c =================================================================== --- hw/mthca/mthca_main.c (revision 2050) +++ hw/mthca/mthca_main.c (working copy) @@ -81,6 +81,7 @@ static struct mthca_profile default_prof .num_mtt = 1 << 20, .num_udav = 1 << 15, /* Tavor only */ .uarc_size = 1 << 18, /* Arbel only */ + .fmr_reserved_mtts = 1 << 18, /* Tavor only */ }; static int __devinit mthca_tune_pci(struct mthca_dev *mdev) Index: hw/mthca/mthca_memfree.h =================================================================== --- hw/mthca/mthca_memfree.h (revision 2050) +++ hw/mthca/mthca_memfree.h (working copy) @@ -90,6 +90,9 @@ int mthca_table_get_range(struct mthca_d void mthca_table_put_range(struct mthca_dev *dev, struct mthca_icm_table *table, int start, int end); +void *mthca_table_find(struct mthca_dev *dev, struct mthca_icm_table *table, + int obj); + static inline void mthca_icm_first(struct mthca_icm *icm, struct mthca_icm_iter *iter) { Index: hw/mthca/mthca_provider.c =================================================================== --- hw/mthca/mthca_provider.c (revision 2050) +++ hw/mthca/mthca_provider.c (working copy) @@ -574,6 +574,75 @@ static int mthca_dereg_mr(struct ib_mr * return 0; } +static struct ib_fmr *mthca_alloc_fmr(struct ib_pd *pd, int mr_access_flags, + struct ib_fmr_attr *fmr_attr) +{ + struct mthca_fmr *fmr; + int err; + fmr = kmalloc(sizeof *fmr, GFP_KERNEL); + if (!fmr) + return ERR_PTR(-ENOMEM); + + memcpy(&fmr->attr, fmr_attr, sizeof *fmr_attr); + err = mthca_fmr_alloc(to_mdev(pd->device), to_mpd(pd)->pd_num, + convert_access(mr_access_flags), fmr); + + if (err) { + kfree(fmr); + return ERR_PTR(err); + } + return &fmr->ibmr; +} + +static int mthca_dealloc_fmr(struct ib_fmr *fmr) +{ + struct mthca_fmr *mfmr = to_mfmr(fmr); + int err; + + err = mthca_free_fmr(to_mdev(fmr->device), mfmr); + if (err) + return err; + + kfree(mfmr); + return 0; +} + +static int mthca_unmap_fmr(struct list_head *fmr_list) +{ + struct ib_fmr *fmr; + int err; + u8 status; + struct mthca_dev* mdev = NULL; + + list_for_each_entry(fmr, fmr_list, list) { + mdev = to_mdev(fmr->device); + break; + } + + if (!mdev) + return 0; + + if (mdev->hca_type == ARBEL_NATIVE) { + list_for_each_entry(fmr, fmr_list, list) { + BUG_ON(fmr->device != &mdev->ib_dev); + mthca_arbel_fmr_unmap(mdev, to_mfmr(fmr)); + } + + wmb(); + } else + list_for_each_entry(fmr, fmr_list, list) { + BUG_ON(fmr->device != &mdev->ib_dev); + mthca_tavor_fmr_unmap(mdev, to_mfmr(fmr)); + } + + err = mthca_SYNC_TPT(mdev, &status); + if (err) + return err; + if (status) + return -EINVAL; + return 0; +} + static ssize_t show_rev(struct class_device *cdev, char *buf) { struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); @@ -637,6 +706,17 @@ int mthca_register_device(struct mthca_d dev->ib_dev.get_dma_mr = mthca_get_dma_mr; dev->ib_dev.reg_phys_mr = mthca_reg_phys_mr; dev->ib_dev.dereg_mr = mthca_dereg_mr; + + if (dev->mthca_flags & MTHCA_FLAG_FMR) { + dev->ib_dev.alloc_fmr = mthca_alloc_fmr; + dev->ib_dev.unmap_fmr = mthca_unmap_fmr; + dev->ib_dev.dealloc_fmr = mthca_dealloc_fmr; + if (dev->hca_type == ARBEL_NATIVE) + dev->ib_dev.map_phys_fmr = mthca_arbel_map_phys_fmr; + else + dev->ib_dev.map_phys_fmr = mthca_tavor_map_phys_fmr; + } + dev->ib_dev.attach_mcast = mthca_multicast_attach; dev->ib_dev.detach_mcast = mthca_multicast_detach; dev->ib_dev.process_mad = mthca_process_mad; Index: hw/mthca/mthca_provider.h =================================================================== --- hw/mthca/mthca_provider.h (revision 2050) +++ hw/mthca/mthca_provider.h (working copy) @@ -60,6 +60,24 @@ struct mthca_mr { u32 first_seg; }; +struct mthca_fmr { + struct ib_fmr ibmr; + struct ib_fmr_attr attr; + int order; + u32 first_seg; + int maps; + union { + struct { + struct mthca_mpt_entry __iomem *mpt; + u64 __iomem *mtts; + } tavor; + struct { + struct mthca_mpt_entry *mpt; + __be64 *mtts; + } arbel; + } mem; +}; + struct mthca_pd { struct ib_pd ibpd; u32 pd_num; @@ -218,6 +236,11 @@ struct mthca_sqp { dma_addr_t header_dma; }; +static inline struct mthca_fmr *to_mfmr(struct ib_fmr *ibmr) +{ + return container_of(ibmr, struct mthca_fmr, ibmr); +} + static inline struct mthca_mr *to_mmr(struct ib_mr *ibmr) { return container_of(ibmr, struct mthca_mr, ibmr); Index: hw/mthca/mthca_profile.c =================================================================== --- hw/mthca/mthca_profile.c (revision 2050) +++ hw/mthca/mthca_profile.c (working copy) @@ -223,9 +223,10 @@ u64 mthca_make_profile(struct mthca_dev init_hca->mc_hash_sz = 1 << (profile[i].log_num - 1); break; case MTHCA_RES_MPT: - dev->limits.num_mpts = profile[i].num; - init_hca->mpt_base = profile[i].start; - init_hca->log_mpt_sz = profile[i].log_num; + dev->limits.num_mpts = profile[i].num; + dev->mr_table.mpt_base = profile[i].start; + init_hca->mpt_base = profile[i].start; + init_hca->log_mpt_sz = profile[i].log_num; break; case MTHCA_RES_MTT: dev->limits.num_mtt_segs = profile[i].num; @@ -259,6 +260,16 @@ u64 mthca_make_profile(struct mthca_dev */ dev->limits.num_pds = MTHCA_NUM_PDS; + /* For Tavor, FMRs need to be ioremapped. For 32 bit systems it may be + * too expensive to map all MTT memory, so we reserve some MTTs for FMR + * access, taking them out of the MR pool. They dont take + * additional memory, but we assign them as part of the HCA profile + * anyway. */ + if (dev->hca_type == ARBEL_NATIVE) + dev->limits.fmr_reserved_mtts = 0; + else + dev->limits.fmr_reserved_mtts = request->fmr_reserved_mtts; + kfree(profile); return total_size; } Index: hw/mthca/mthca_cmd.c =================================================================== --- hw/mthca/mthca_cmd.c (revision 2050) +++ hw/mthca/mthca_cmd.c (working copy) @@ -1384,6 +1384,12 @@ int mthca_HW2SW_MPT(struct mthca_dev *de return err; } +int mthca_SYNC_TPT(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_SYNC_TPT, CMD_TIME_CLASS_B, status); +} + + int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry, int num_mtt, u8 *status) { Index: hw/mthca/mthca_profile.h =================================================================== --- hw/mthca/mthca_profile.h (revision 2050) +++ hw/mthca/mthca_profile.h (working copy) @@ -48,6 +48,7 @@ struct mthca_profile { int num_udav; int num_uar; int uarc_size; + int fmr_reserved_mtts; }; u64 mthca_make_profile(struct mthca_dev *mdev, Index: hw/mthca/mthca_doorbell.h =================================================================== --- hw/mthca/mthca_doorbell.h (revision 2050) +++ hw/mthca/mthca_doorbell.h (working copy) @@ -51,6 +51,11 @@ #define MTHCA_INIT_DOORBELL_LOCK(ptr) do { } while (0) #define MTHCA_GET_DOORBELL_LOCK(ptr) (NULL) +static inline void mthca_write64_raw(__be64 val, void __iomem *dest) +{ + __raw_writeq((__force u64) val, dest); +} + static inline void mthca_write64(u32 val[2], void __iomem *dest, spinlock_t *doorbell_lock) { @@ -74,6 +79,12 @@ static inline void mthca_write_db_rec(u3 #define MTHCA_INIT_DOORBELL_LOCK(ptr) spin_lock_init(ptr) #define MTHCA_GET_DOORBELL_LOCK(ptr) (ptr) +static inline void mthca_write64_raw(__be64 val, void __iomem *dest) +{ + __raw_writel(((__force u32 *) &val)[0], dest); + __raw_writel(((__force u32 *) &val)[1], dest + 4); +} + static inline void mthca_write64(u32 val[2], void __iomem *dest, spinlock_t *doorbell_lock) { Index: hw/mthca/mthca_cmd.h =================================================================== --- hw/mthca/mthca_cmd.h (revision 2050) +++ hw/mthca/mthca_cmd.h (working copy) @@ -276,6 +276,7 @@ int mthca_HW2SW_MPT(struct mthca_dev *de int mpt_index, u8 *status); int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry, int num_mtt, u8 *status); +int mthca_SYNC_TPT(struct mthca_dev *dev, u8 *status); int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap, int eq_num, u8 *status); int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context, Index: hw/mthca/mthca_mr.c =================================================================== --- hw/mthca/mthca_mr.c (revision 2050) +++ hw/mthca/mthca_mr.c (working copy) @@ -72,60 +72,107 @@ struct mthca_mpt_entry { * through the bitmaps) */ -static u32 __mthca_alloc_mtt(struct mthca_dev *dev, int order) +static u32 mthca_buddy_alloc(struct mthca_buddy *buddy, int order) { int o; int m; u32 seg; - spin_lock(&dev->mr_table.mpt_alloc.lock); + spin_lock(&buddy->lock); - for (o = order; o <= dev->mr_table.max_mtt_order; ++o) { - m = 1 << (dev->mr_table.max_mtt_order - o); - seg = find_first_bit(dev->mr_table.mtt_buddy[o], m); + for (o = order; o <= buddy->max_order; ++o) { + m = 1 << (buddy->max_order - o); + seg = find_first_bit(buddy->bits[o], m); if (seg < m) goto found; } - spin_unlock(&dev->mr_table.mpt_alloc.lock); + spin_unlock(&buddy->lock); return -1; found: - clear_bit(seg, dev->mr_table.mtt_buddy[o]); + clear_bit(seg, buddy->bits[o]); while (o > order) { --o; seg <<= 1; - set_bit(seg ^ 1, dev->mr_table.mtt_buddy[o]); + set_bit(seg ^ 1, buddy->bits[o]); } - spin_unlock(&dev->mr_table.mpt_alloc.lock); + spin_unlock(&buddy->lock); seg <<= order; return seg; } -static void __mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order) +static void mthca_buddy_free(struct mthca_buddy *buddy, u32 seg, int order) { seg >>= order; - spin_lock(&dev->mr_table.mpt_alloc.lock); + spin_lock(&buddy->lock); - while (test_bit(seg ^ 1, dev->mr_table.mtt_buddy[order])) { - clear_bit(seg ^ 1, dev->mr_table.mtt_buddy[order]); + while (test_bit(seg ^ 1, buddy->bits[order])) { + clear_bit(seg ^ 1, buddy->bits[order]); seg >>= 1; ++order; } - set_bit(seg, dev->mr_table.mtt_buddy[order]); + set_bit(seg, buddy->bits[order]); - spin_unlock(&dev->mr_table.mpt_alloc.lock); + spin_unlock(&buddy->lock); } -static u32 mthca_alloc_mtt(struct mthca_dev *dev, int order) +static int __devinit mthca_buddy_init(struct mthca_buddy *buddy, int max_order) { - u32 seg = __mthca_alloc_mtt(dev, order); + int i, s; + + buddy->max_order = max_order; + spin_lock_init(&buddy->lock); + + buddy->bits = kmalloc((buddy->max_order + 1) * sizeof (long *), + GFP_KERNEL); + if (!buddy->bits) + goto err_out; + + memset(buddy->bits, 0, (buddy->max_order + 1) * sizeof (long *)); + + for (i = 0; i <= buddy->max_order; ++i) { + s = BITS_TO_LONGS(1 << (buddy->max_order - i)); + buddy->bits[i] = kmalloc(s * sizeof (long), GFP_KERNEL); + if (!buddy->bits[i]) + goto err_out_free; + bitmap_zero(buddy->bits[i], + 1 << (buddy->max_order - i)); + } + + set_bit(0, buddy->bits[buddy->max_order]); + + return 0; + +err_out_free: + for (i = 0; i <= buddy->max_order; ++i) + kfree(buddy->bits[i]); + + kfree(buddy->bits); +err_out: + return -ENOMEM; +} + +static void __devexit mthca_buddy_cleanup(struct mthca_buddy *buddy) +{ + int i; + for (i = 0; i <= buddy->max_order; ++i) + kfree(buddy->bits[i]); + + kfree(buddy->bits); +} + + +static u32 mthca_alloc_mtt(struct mthca_dev *dev, int order, + struct mthca_buddy *buddy) +{ + u32 seg = mthca_buddy_alloc(buddy, order); if (seg == -1) return -1; @@ -133,36 +180,57 @@ static u32 mthca_alloc_mtt(struct mthca_ if (dev->hca_type == ARBEL_NATIVE) if (mthca_table_get_range(dev, dev->mr_table.mtt_table, seg, seg + (1 << order) - 1)) { - __mthca_free_mtt(dev, seg, order); + mthca_buddy_free(buddy, seg, order); seg = -1; } return seg; } -static void mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order) +static void mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order, + struct mthca_buddy* buddy) { - __mthca_free_mtt(dev, seg, order); + mthca_buddy_free(buddy, seg, order); if (dev->hca_type == ARBEL_NATIVE) mthca_table_put_range(dev, dev->mr_table.mtt_table, seg, seg + (1 << order) - 1); } +static inline u32 tavor_hw_index_to_key(u32 ind) +{ + return ind; +} + +static inline u32 tavor_key_to_hw_index(u32 key) +{ + return key; +} + +static inline u32 arbel_hw_index_to_key(u32 ind) +{ + return (ind >> 24) | (ind << 8); +} + +static inline u32 arbel_key_to_hw_index(u32 key) +{ + return (key << 24) | (key >> 8); +} + static inline u32 hw_index_to_key(struct mthca_dev *dev, u32 ind) { if (dev->hca_type == ARBEL_NATIVE) - return (ind >> 24) | (ind << 8); + return arbel_hw_index_to_key(ind); else - return ind; + return tavor_hw_index_to_key(ind); } static inline u32 key_to_hw_index(struct mthca_dev *dev, u32 key) { if (dev->hca_type == ARBEL_NATIVE) - return (key << 24) | (key >> 8); + return arbel_key_to_hw_index(key); else - return key; + return tavor_key_to_hw_index(key); } int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd, @@ -268,7 +336,8 @@ int mthca_mr_alloc_phys(struct mthca_dev i <<= 1, ++mr->order) ; /* nothing */ - mr->first_seg = mthca_alloc_mtt(dev, mr->order); + mr->first_seg = mthca_alloc_mtt(dev, mr->order, + &dev->mr_table.mtt_buddy); if (mr->first_seg == -1) goto err_out_table; @@ -361,7 +430,7 @@ err_out_mailbox_free: kfree(mailbox); err_out_free_mtt: - mthca_free_mtt(dev, mr->first_seg, mr->order); + mthca_free_mtt(dev, mr->first_seg, mr->order, &dev->mr_table.mtt_buddy); err_out_table: if (dev->hca_type == ARBEL_NATIVE) @@ -372,6 +441,19 @@ err_out_mpt_free: return err; } +/* Free mr or fmr */ +static void mthca_free_region(struct mthca_dev *dev, u32 lkey, int order, + u32 first_seg, struct mthca_buddy *buddy) +{ + if (order >= 0) + mthca_free_mtt(dev, first_seg, order, buddy); + + if (dev->hca_type == ARBEL_NATIVE) + mthca_table_put(dev, dev->mr_table.mpt_table, + arbel_key_to_hw_index(lkey)); + mthca_free(&dev->mr_table.mpt_alloc, key_to_hw_index(dev, lkey)); +} + void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr) { int err; @@ -389,85 +471,411 @@ void mthca_free_mr(struct mthca_dev *dev mthca_warn(dev, "HW2SW_MPT returned status 0x%02x\n", status); - if (mr->order >= 0) - mthca_free_mtt(dev, mr->first_seg, mr->order); + mthca_free_region(dev, mr->ibmr.lkey, mr->order, mr->first_seg, + &dev->mr_table.mtt_buddy); +} + +int mthca_fmr_alloc(struct mthca_dev *dev, u32 pd, + u32 access, struct mthca_fmr *mr) +{ + struct mthca_mpt_entry *mpt_entry; + void *mailbox; + u64 mtt_seg; + u32 key, idx; + u8 status; + int i, err = -ENOMEM, list_len = mr->attr.max_pages; + + might_sleep(); + + if (mr->attr.page_size < 12 || mr->attr.page_size >= 32) + return -EINVAL; + + /* For Arbel, all MTTs must fit in the same page. */ + if (dev->hca_type == ARBEL_NATIVE && + mr->attr.max_pages * sizeof *mr->mem.arbel.mtts > PAGE_SIZE) + return -EINVAL; + + mr->maps = 0; + + key = mthca_alloc(&dev->mr_table.mpt_alloc); + if (key == -1) + return -ENOMEM; + + idx = key & (dev->limits.num_mpts - 1); + mr->ibmr.rkey = mr->ibmr.lkey = hw_index_to_key(dev, key); + + if (dev->hca_type == ARBEL_NATIVE) { + err = mthca_table_get(dev, dev->mr_table.mpt_table, key); + if (err) + goto err_out_mpt_free; + + mr->mem.arbel.mpt = + mthca_table_find(dev, dev->mr_table.mpt_table, key); + + BUG_ON(!mr->mem.arbel.mpt); + } else + mr->mem.tavor.mpt = dev->mr_table.tavor_fmr.mpt_base + + sizeof *(mr->mem.tavor.mpt) * idx; + + for (i = MTHCA_MTT_SEG_SIZE / 8, mr->order = 0; + i < list_len; + i <<= 1, ++mr->order) + ; /* nothing */ + + mr->first_seg = mthca_alloc_mtt(dev, mr->order, + dev->mr_table.fmr_mtt_buddy); + if (mr->first_seg == -1) + goto err_out_table; + + mtt_seg = mr->first_seg * MTHCA_MTT_SEG_SIZE; + + if (dev->hca_type == ARBEL_NATIVE) { + mr->mem.arbel.mtts = mthca_table_find(dev, + dev->mr_table.mtt_table, + mr->first_seg); + BUG_ON(!mr->mem.arbel.mtts); + } else + mr->mem.tavor.mtts = dev->mr_table.tavor_fmr.mtt_base + mtt_seg; + + mailbox = kmalloc(sizeof *mpt_entry + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out_free_mtt; + + mpt_entry = MAILBOX_ALIGN(mailbox); + + mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS | + MTHCA_MPT_FLAG_MIO | + MTHCA_MPT_FLAG_REGION | + access); + + mpt_entry->page_size = cpu_to_be32(mr->attr.page_size - 12); + mpt_entry->key = cpu_to_be32(key); + mpt_entry->pd = cpu_to_be32(pd); + memset(&mpt_entry->start, 0, + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, start)); + mpt_entry->mtt_seg = cpu_to_be64(dev->mr_table.mtt_base + mtt_seg); + + if (0) { + mthca_dbg(dev, "Dumping MPT entry %08x:\n", mr->ibmr.lkey); + for (i = 0; i < sizeof (struct mthca_mpt_entry) / 4; ++i) { + if (i % 4 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) mpt_entry)[i])); + if ((i + 1) % 4 == 0) + printk("\n"); + } + } + + err = mthca_SW2HW_MPT(dev, mpt_entry, + key & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); + else if (status) { + mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_mailbox_free; + } + + kfree(mailbox); + return 0; + +err_out_mailbox_free: + kfree(mailbox); + +err_out_free_mtt: + mthca_free_mtt(dev, mr->first_seg, mr->order, + dev->mr_table.fmr_mtt_buddy); +err_out_table: if (dev->hca_type == ARBEL_NATIVE) - mthca_table_put(dev, dev->mr_table.mpt_table, - key_to_hw_index(dev, mr->ibmr.lkey)); - mthca_free(&dev->mr_table.mpt_alloc, key_to_hw_index(dev, mr->ibmr.lkey)); + mthca_table_put(dev, dev->mr_table.mpt_table, key); + +err_out_mpt_free: + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + return err; +} + +int mthca_free_fmr(struct mthca_dev *dev, struct mthca_fmr *fmr) +{ + if (fmr->maps) + return -EBUSY; + + mthca_free_region(dev, fmr->ibmr.lkey, fmr->order, fmr->first_seg, + dev->mr_table.fmr_mtt_buddy); + return 0; +} + +#define MTHCA_MPT_STATUS_SW 0xF0 +#define MTHCA_MPT_STATUS_HW 0x00 + +static inline int mthca_check_fmr(struct mthca_fmr *fmr, u64 *page_list, + int list_len, u64 iova) +{ + int i, page_mask; + + if (list_len > fmr->attr.max_pages) + return -EINVAL; + + page_mask = (1 << fmr->attr.page_size) - 1; + + /* We are getting page lists, so va must be page aligned. */ + if (iova & page_mask) + return -EINVAL; + + /* Trust the user not to pass misaligned data in page_list */ + if (0) + for (i = 0; i < list_len; ++i) { + if (page_list[i] & ~page_mask) + return -EINVAL; + } + + if (fmr->maps >= fmr->attr.max_maps) + return -EINVAL; + + return 0; +} + + +int mthca_tavor_map_phys_fmr(struct ib_fmr *ibfmr, u64 *page_list, + int list_len, u64 iova) +{ + struct mthca_fmr *fmr = to_mfmr(ibfmr); + struct mthca_dev *dev = to_mdev(ibfmr->device); + struct mthca_mpt_entry mpt_entry; + u32 key; + int i, err; + + if ((err = mthca_check_fmr(fmr, page_list, list_len, iova))) + return err; + + fmr->maps++; + + key = tavor_key_to_hw_index(fmr->ibmr.lkey); + key += dev->limits.num_mpts; + fmr->ibmr.lkey = fmr->ibmr.rkey = tavor_hw_index_to_key(key); + + writeb(MTHCA_MPT_STATUS_SW, fmr->mem.tavor.mpt); + + for (i = 0; i < list_len; ++i) { + __be64 mtt_entry = cpu_to_be64(page_list[i] | + MTHCA_MTT_FLAG_PRESENT); + mthca_write64_raw(mtt_entry, fmr->mem.tavor.mtts + i); + } + + mpt_entry.lkey = cpu_to_be32(key); + mpt_entry.length = cpu_to_be64(((u64)list_len) * + (1 << fmr->attr.page_size)); + mpt_entry.start = cpu_to_be64(iova); + + writel(mpt_entry.lkey, &fmr->mem.tavor.mpt->key); + memcpy_toio(&fmr->mem.tavor.mpt->start, &mpt_entry.start, + offsetof(struct mthca_mpt_entry, window_count) - + offsetof(struct mthca_mpt_entry, start)); + + writeb(MTHCA_MPT_STATUS_HW, fmr->mem.tavor.mpt); + + return 0; +} + +int mthca_arbel_map_phys_fmr(struct ib_fmr *ibfmr, u64 *page_list, + int list_len, u64 iova) +{ + struct mthca_fmr *fmr = to_mfmr(ibfmr); + struct mthca_dev *dev = to_mdev(ibfmr->device); + struct mthca_mpt_entry *mpt_entry; + u8 *mpt_status; + u32 key; + int i, err; + + if ((err = mthca_check_fmr(fmr, page_list, list_len, iova))) + return err; + + fmr->maps++; + + key = arbel_key_to_hw_index(fmr->ibmr.lkey); + key += dev->limits.num_mpts; + fmr->ibmr.lkey = fmr->ibmr.rkey = arbel_hw_index_to_key(key); + + mpt_status = (u8 *)fmr->mem.arbel.mpt; + *mpt_status = MTHCA_MPT_STATUS_SW; + + wmb(); + + for (i = 0; i < list_len; ++i) { + fmr->mem.arbel.mtts[i] = cpu_to_be64(page_list[i] | + MTHCA_MTT_FLAG_PRESENT); + } + + mpt_entry = fmr->mem.arbel.mpt; + fmr->mem.arbel.mpt->lkey = mpt_entry->key = cpu_to_be32(key); + fmr->mem.arbel.mpt->length = cpu_to_be64(((u64)list_len) * + (1 << fmr->attr.page_size)); + fmr->mem.arbel.mpt->start = cpu_to_be64(iova); + + wmb(); + + *mpt_status = MTHCA_MPT_STATUS_HW; + + wmb(); + + return 0; +} + +void mthca_tavor_fmr_unmap(struct mthca_dev *dev, struct mthca_fmr *fmr) +{ + u32 key; + + if (!fmr->maps) + return; + + key = tavor_key_to_hw_index(fmr->ibmr.lkey); + key &= dev->limits.num_mpts - 1; + fmr->ibmr.lkey = fmr->ibmr.rkey = tavor_hw_index_to_key(key); + + fmr->maps = 0; + + writeb(MTHCA_MPT_STATUS_SW, fmr->mem.tavor.mpt); +} + +void mthca_arbel_fmr_unmap(struct mthca_dev *dev, struct mthca_fmr *fmr) +{ + u32 key; + u8 *mpt_status; + + if (!fmr->maps) + return; + + key = arbel_key_to_hw_index(fmr->ibmr.lkey); + key &= dev->limits.num_mpts - 1; + fmr->ibmr.lkey = fmr->ibmr.rkey = arbel_hw_index_to_key(key); + + fmr->maps = 0; + + mpt_status = (u8 *)fmr->mem.arbel.mtts; + *mpt_status = MTHCA_MPT_STATUS_SW; } int __devinit mthca_init_mr_table(struct mthca_dev *dev) { - int err; - int i, s; + int err, i; + err = mthca_alloc_init(&dev->mr_table.mpt_alloc, dev->limits.num_mpts, ~0, dev->limits.reserved_mrws); if (err) - return err; + goto err_mpt_alloc; - err = -ENOMEM; + if (dev->hca_type != ARBEL_NATIVE && + (dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) + dev->limits.fmr_reserved_mtts = 0; + else + dev->mthca_flags |= MTHCA_FLAG_FMR; - for (i = 1, dev->mr_table.max_mtt_order = 0; - i < dev->limits.num_mtt_segs; - i <<= 1, ++dev->mr_table.max_mtt_order) - ; /* nothing */ + i = fls(dev->limits.num_mtt_segs - 1); + err = mthca_buddy_init(&dev->mr_table.mtt_buddy, i); - dev->mr_table.mtt_buddy = kmalloc((dev->mr_table.max_mtt_order + 1) * - sizeof (long *), - GFP_KERNEL); - if (!dev->mr_table.mtt_buddy) - goto err_out; + if (err) + goto err_mtt_buddy; - for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) - dev->mr_table.mtt_buddy[i] = NULL; + dev->mr_table.tavor_fmr.mpt_base = NULL; + dev->mr_table.tavor_fmr.mtt_base = NULL; - for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) { - s = BITS_TO_LONGS(1 << (dev->mr_table.max_mtt_order - i)); - dev->mr_table.mtt_buddy[i] = kmalloc(s * sizeof (long), - GFP_KERNEL); - if (!dev->mr_table.mtt_buddy[i]) - goto err_out_free; - bitmap_zero(dev->mr_table.mtt_buddy[i], - 1 << (dev->mr_table.max_mtt_order - i)); - } + if (dev->limits.fmr_reserved_mtts) { + i = fls(dev->limits.fmr_reserved_mtts - 1); - set_bit(0, dev->mr_table.mtt_buddy[dev->mr_table.max_mtt_order]); + if (i >= 31) { + mthca_warn(dev, "Unable to reserve 2^31 FMR MTTs.\n"); + err = -EINVAL; + goto err_fmr_mpt; + } - for (i = 0; i < dev->mr_table.max_mtt_order; ++i) - if (1 << i >= dev->limits.reserved_mtts) - break; + dev->mr_table.tavor_fmr.mpt_base = + ioremap(dev->mr_table.mpt_base, + (1 << i) * sizeof(struct mthca_mpt_entry)); + + if (!dev->mr_table.tavor_fmr.mpt_base) { + mthca_warn(dev, "MPT ioremap for FMR failed.\n"); + err = -ENOMEM; + goto err_fmr_mpt; + } - if (i == dev->mr_table.max_mtt_order) { - mthca_err(dev, "MTT table of order %d is " - "too small.\n", i); - goto err_out_free; - } + dev->mr_table.tavor_fmr.mtt_base = + ioremap(dev->mr_table.mtt_base, + (1 << i) * MTHCA_MTT_SEG_SIZE); + if (!dev->mr_table.tavor_fmr.mtt_base) { + mthca_warn(dev, "MTT ioremap for FMR failed.\n"); + err = -ENOMEM; + goto err_fmr_mtt; + } - (void) mthca_alloc_mtt(dev, i); + err = mthca_buddy_init(&dev->mr_table.tavor_fmr.mtt_buddy, i); + if (err) + goto err_fmr_mtt_buddy; + + /* Prevent regular MRs from using FMR keys */ + err = mthca_buddy_alloc(&dev->mr_table.mtt_buddy, i); + if (err) + goto err_reserve_fmr; + + dev->mr_table.fmr_mtt_buddy = + &dev->mr_table.tavor_fmr.mtt_buddy; + } else + dev->mr_table.fmr_mtt_buddy = &dev->mr_table.mtt_buddy; + + /* FMR table is always the first, take reserved MTTs out of there */ + if (dev->limits.reserved_mtts) { + int seg; + i = fls(dev->limits.reserved_mtts - 1); + seg = mthca_alloc_mtt(dev, i, dev->mr_table.fmr_mtt_buddy); + + if (seg == -1) { + mthca_warn(dev, "MTT table of order %d is too small.\n", + dev->mr_table.fmr_mtt_buddy->max_order); + err = -ENOMEM; + goto err_reserve_mtts; + } + } return 0; - err_out_free: - for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) - kfree(dev->mr_table.mtt_buddy[i]); +err_reserve_mtts: +err_reserve_fmr: + + if (dev->limits.fmr_reserved_mtts) + mthca_buddy_cleanup(&dev->mr_table.tavor_fmr.mtt_buddy); +err_fmr_mtt_buddy: + + if (dev->mr_table.tavor_fmr.mtt_base) + iounmap(dev->mr_table.tavor_fmr.mtt_base); +err_fmr_mtt: + + if (dev->mr_table.tavor_fmr.mpt_base) + iounmap(dev->mr_table.tavor_fmr.mpt_base); +err_fmr_mpt: + + mthca_buddy_cleanup(&dev->mr_table.mtt_buddy); +err_mtt_buddy: - err_out: mthca_alloc_cleanup(&dev->mr_table.mpt_alloc); +err_mpt_alloc: return err; } void __devexit mthca_cleanup_mr_table(struct mthca_dev *dev) { - int i; - /* XXX check if any MRs are still allocated? */ - for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) - kfree(dev->mr_table.mtt_buddy[i]); - kfree(dev->mr_table.mtt_buddy); + if (dev->limits.fmr_reserved_mtts) + mthca_buddy_cleanup(&dev->mr_table.tavor_fmr.mtt_buddy); + + mthca_buddy_cleanup(&dev->mr_table.mtt_buddy); + if (dev->mr_table.tavor_fmr.mtt_base) + iounmap(dev->mr_table.tavor_fmr.mtt_base); + if (dev->mr_table.tavor_fmr.mpt_base) + iounmap(dev->mr_table.tavor_fmr.mpt_base); mthca_alloc_cleanup(&dev->mr_table.mpt_alloc); } Index: hw/mthca/mthca_memfree.c =================================================================== --- hw/mthca/mthca_memfree.c (revision 2050) +++ hw/mthca/mthca_memfree.c (working copy) @@ -192,6 +192,48 @@ void mthca_table_put(struct mthca_dev *d up(&table->mutex); } +void *mthca_table_find(struct mthca_dev *dev, + struct mthca_icm_table *table, int obj) +{ + int idx, offset, i; + struct mthca_icm_chunk *chunk; + struct mthca_icm *icm; + struct page *page = NULL; + void *p = NULL; + + if (!table->lowmem) + return NULL; + + down(&table->mutex); + + idx = (obj & (table->num_obj - 1)) * table->obj_size; + icm = table->icm[idx / MTHCA_TABLE_CHUNK_SIZE]; + offset = idx % MTHCA_TABLE_CHUNK_SIZE; + + if (!icm) + goto out; + + /* Linear scan of ICM on each access. Eventually we may want to + * rearrange things to use some kind of tree. */ + + list_for_each_entry(chunk, &icm->chunk_list, list) { + for (i = 0; i < chunk->npages; ++i) { + if (chunk->mem[i].length >= offset) { + page = chunk->mem[i].page; + break; + } + offset -= chunk->mem[i].length; + } + } + + if (page) + p = lowmem_page_address(page) + offset; + +out: + up(&table->mutex); + return p; +} + int mthca_table_get_range(struct mthca_dev *dev, struct mthca_icm_table *table, int start, int end) { -- MST - Michael S. Tsirkin From mst at mellanox.co.il Sun Mar 27 07:41:34 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 27 Mar 2005 17:41:34 +0200 Subject: [openib-general] zero-sized memory regions Message-ID: <20050327154134.GB26108@mellanox.co.il> Hi! Currently an attempt to create a zero sized memory region in mthca succeeds, but apparently the key field in the region returned is uninitialized. Further, buffer_list passed in must have length of at least 1 which is somewhat counterintuitive. Should not the operation rather return -EINVAL, and avoid accessing the buffer list at all? Roland - want a patch like this? -- MST - Michael S. Tsirkin From mst at mellanox.co.il Sun Mar 27 07:58:47 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 27 Mar 2005 17:58:47 +0200 Subject: [openib-general] Re: Re: [Andrew Morton] inappropriate use of in_atomic() In-Reply-To: <20050325082349.A22487@topspin.com> References: <52oedq946k.fsf@topspin.com> <20050311073108.GA20989@mellanox.co.il> <20050311154316.E31689@topspin.com> <20050320181242.GA18963@mellanox.co.il> <20050325082349.A22487@topspin.com> Message-ID: <20050327155847.GC26108@mellanox.co.il> Quoting r. Libor Michalek : > Subject: Re: Re: [Andrew Morton] inappropriate use of in_atomic() > > On Sun, Mar 20, 2005 at 08:12:42PM +0200, Michael S. Tsirkin wrote: > > Quoting r. Libor Michalek : > > > Subject: Re: Re: [Andrew Morton] inappropriate use of in_atomic() > > > > > > On Fri, Mar 11, 2005 at 09:31:08AM +0200, Michael S. Tsirkin wrote: > > > > > > > > Sdp also has a couple of uses. > > > > Maybe we can use the atomic branch in all cases here, as well? > > > > Libor? > > > > > > Yes, the case in sdp_iocb.c can probably always take the atomic > > > path. The kmap/kunmap cases really only care whether we're in an > > > interrupt, so switching to in_interrupt() should be sufficient. > > > > Recent comments by Andrew indicate that it is better to always > > use kmap_atomic/kunmap_atomic if possible. This will also > > let us get rid of the wrapper function, which is good. > > > > Why do you think we need to kmap? > > I didn't realize that the atomic version was prefered over the > regular kmap. The only thing that needs to be done is to make sure > that the local CPU interrupts are off before calling kamp_atomic, > instead we currently check to see if we're in an interrupt and call > the appropriate function. I have no problem changing it to just > atomic. > > -Libor > My understanding is you must give kmap_atomic the proper parameter: KM_IRQ0/KM_SOFTIRQ0/KM_USR0, to avoid conflicts with other callers of kmap on the same CPU. Something like this then? Signed-off-by: Michael S. Tsirkin Index: ulp/sdp/sdp_iocb.h =================================================================== --- ulp/sdp/sdp_iocb.h (revision 2050) +++ ulp/sdp/sdp_iocb.h (working copy) @@ -133,10 +133,12 @@ */ static inline void *sdp_kmap(struct page *page) { - if (in_atomic() || irqs_disabled()) + if (in_irq()) return kmap_atomic(page, KM_IRQ0); + else if (in_softirq()) + return kmap_atomic(page, KM_SOFTIRQ0); else - return kmap(page); + return kmap_atomic(page, KM_USR0); } /* @@ -144,10 +146,12 @@ */ static inline void sdp_kunmap(struct page *page) { - if (in_atomic() || irqs_disabled()) + if (in_irq()) kunmap_atomic(page, KM_IRQ0); + else if (in_softirq()) + kunmap_atomic(page, KM_SOFTIRQ0); else - kunmap(page); + kunmap_atomic(page, KM_USR0); } #endif /* _SDP_IOCB_H */ -- MST - Michael S. Tsirkin From roland at topspin.com Mon Mar 28 07:08:54 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 28 Mar 2005 07:08:54 -0800 Subject: [openib-general] [PATCH] FMR support in mthca References: <20050327153112.GA26108@mellanox.co.il> Message-ID: <52y8c7zu2h.fsf@topspin.com> Excellent! I'll review this ASAP, and see if Libor and I can get SDP/AIO going with it. - R. From roland at topspin.com Mon Mar 28 08:08:54 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 28 Mar 2005 08:08:54 -0800 Subject: [openib-general] zero-sized memory regions References: <20050327154134.GB26108@mellanox.co.il> Message-ID: <52br93zrah.fsf@topspin.com> Michael> Hi! Currently an attempt to create a zero sized memory Michael> region in mthca succeeds, but apparently the key field in Michael> the region returned is uninitialized. Further, Michael> buffer_list passed in must have length of at least 1 Michael> which is somewhat counterintuitive. Michael> Should not the operation rather return -EINVAL, and avoid Michael> accessing the buffer list at all? Roland - want a patch Michael> like this? >From looking at the IB spec, it seems that a length of 0 is valid. However I agree that the current behavior is not reasonable, and I don't particularly feel like implementing 0 length regions. So returning -EINVAL before we do anything is probably the best way to go. - R. From roland at topspin.com Mon Mar 28 08:08:53 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 28 Mar 2005 08:08:53 -0800 Subject: [openib-general] Re: [PATCH] IPoIB: Set hardware header on packet receive References: <1111928923.4650.70.camel@localhost.localdomain> Message-ID: <52hdivzrai.fsf@topspin.com> Hal> IPoIB: Set hardware header on packet receive Needed for Hal> PF_PACKET/SOCK_PACKET What is this actually used for? Wouldn't it make more sense to set mac.raw before we pull off the IPoIB header? - R. From halr at voltaire.com Mon Mar 28 08:33:48 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Mar 2005 11:33:48 -0500 Subject: [openib-general] Re: [PATCH] IPoIB: Set hardware header on packet receive In-Reply-To: <52hdivzrai.fsf@topspin.com> References: <1111928923.4650.70.camel@localhost.localdomain> <52hdivzrai.fsf@topspin.com> Message-ID: <1112027628.4650.149.camel@localhost.localdomain> On Mon, 2005-03-28 at 11:08, Roland Dreier wrote: > What is this actually used for? Wouldn't it make more sense to set > mac.raw before we pull off the IPoIB header? I think it depends on how symmetric you want to make the receive side and what information you want to pass up. The IPoIB encapsulation actually is needed, otherwise the protocol type cannot be determined. The DGID is also probably needed as it addresses the following comment in ipoib_ib.c: /* XXX get correct PACKET_ type here */ skb->pkt_type = PACKET_HOST; Checking the incoming DGID is the only good way to determine whether the incoming packet is unicast, broadcast, or multicast. -- Hal From mst at mellanox.co.il Mon Mar 28 08:42:23 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 28 Mar 2005 18:42:23 +0200 Subject: [openib-general] Re: zero-sized memory regions In-Reply-To: <52br93zrah.fsf@topspin.com> References: <20050327154134.GB26108@mellanox.co.il> <52br93zrah.fsf@topspin.com> Message-ID: <20050328164223.GP26108@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: zero-sized memory regions > > Michael> Hi! Currently an attempt to create a zero sized memory > Michael> region in mthca succeeds, but apparently the key field in > Michael> the region returned is uninitialized. Further, > Michael> buffer_list passed in must have length of at least 1 > Michael> which is somewhat counterintuitive. > > Michael> Should not the operation rather return -EINVAL, and avoid > Michael> accessing the buffer list at all? Roland - want a patch > Michael> like this? > > >From looking at the IB spec, it seems that a length of 0 is valid. > However I agree that the current behavior is not reasonable, and I > don't particularly feel like implementing 0 length regions. So > returning -EINVAL before we do anything is probably the best way to > go. > > - R. > Something like the following then (untested) Disable 0-sized regions. They are legal in IB spec, but currently unused, and I dont feel like implementing them now. Signed-off-by: Michael S. Tsirkin Index: mthca_provider.c =================================================================== --- mthca_provider.c (revision 2054) +++ mthca_provider.c (working copy) @@ -483,6 +483,9 @@ static struct ib_mr *mthca_reg_phys_mr(s int err; int i, j, n; + if (!num_phys_buf) + return ERR_PTR(-EINVAL); + /* First check that we have enough alignment */ if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) return ERR_PTR(-EINVAL); -- MST - Michael S. Tsirkin From halr at voltaire.com Mon Mar 28 08:45:11 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Mar 2005 11:45:11 -0500 Subject: [openib-general] Re: [PATCH] IPoIB: Set hardware header on packet receive In-Reply-To: <52u0mvyb3n.fsf@topspin.com> References: <1111928923.4650.70.camel@localhost.localdomain> <52hdivzrai.fsf@topspin.com> <1112027628.4650.149.camel@localhost.localdomain> <52u0mvyb3n.fsf@topspin.com> Message-ID: <1112028310.4650.153.camel@localhost.localdomain> On Mon, 2005-03-28 at 11:43, Roland Dreier wrote: > Hal> I think it depends on how symmetric you want to make the > Hal> receive side and what information you want to pass up. > > Hal> The IPoIB encapsulation actually is needed, otherwise the > Hal> protocol type cannot be determined. > > So we should set mac.raw before the skb_pull rather than where you put > the assignment? Yes. I think that is better and more correct. I am experimenting with this now. > Hal> The DGID is also probably needed as it addresses the > Hal> following comment in ipoib_ib.c: > > Hal> /* XXX get correct PACKET_type here */ > skb-> pkt_type = PACKET_HOST; > > Hal> Checking the incoming DGID is the only good way to determine > Hal> whether the incoming packet is unicast, broadcast, or > Hal> multicast. > > Yes. I haven't added any code there because just setting PACKET_HOST > all the time doesn't seem to affect anything. I was wondering about this myself as to whether this matters or not. Other drivers determine this from the destination MAC address. Does it affect the incoming delivery if more than one process is receiving an IP broadcast or multicast ? I didn't chase it all the way down in the Linux kernel. Do you know ? -- Hal From roland at topspin.com Mon Mar 28 08:43:56 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 28 Mar 2005 08:43:56 -0800 Subject: [openib-general] Re: [PATCH] IPoIB: Set hardware header on packet receive In-Reply-To: <1112027628.4650.149.camel@localhost.localdomain> (Hal Rosenstock's message of "28 Mar 2005 11:33:48 -0500") References: <1111928923.4650.70.camel@localhost.localdomain> <52hdivzrai.fsf@topspin.com> <1112027628.4650.149.camel@localhost.localdomain> Message-ID: <52u0mvyb3n.fsf@topspin.com> Hal> I think it depends on how symmetric you want to make the Hal> receive side and what information you want to pass up. Hal> The IPoIB encapsulation actually is needed, otherwise the Hal> protocol type cannot be determined. So we should set mac.raw before the skb_pull rather than where you put the assignment? Hal> The DGID is also probably needed as it addresses the Hal> following comment in ipoib_ib.c: Hal> /* XXX get correct PACKET_type here */ skb-> pkt_type = PACKET_HOST; Hal> Checking the incoming DGID is the only good way to determine Hal> whether the incoming packet is unicast, broadcast, or Hal> multicast. Yes. I haven't added any code there because just setting PACKET_HOST all the time doesn't seem to affect anything. - R. From halr at voltaire.com Mon Mar 28 09:34:08 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Mar 2005 12:34:08 -0500 Subject: [openib-general] [PATCH] IPoIB: Set hardware header on packet receive Message-ID: <1112031248.4650.164.camel@localhost.localdomain> IPoIB: Set hardware header on packet receive Needed for PF_PACKET/SOCK_PACKET Signed-off-by: Hal Rosenstock Index: ipoib_ib.c =================================================================== --- ipoib_ib.c (revision 2032) +++ ipoib_ib.c (working copy) @@ -202,6 +202,7 @@ wc->src_qp != priv->qp->qp_num) { skb->protocol = ((struct ipoib_header *) skb->data)->proto; + skb->mac.raw = skb->data; skb_pull(skb, IPOIB_ENCAP_LEN); dev->last_rx = jiffies; From iod00d at hp.com Mon Mar 28 09:56:21 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 28 Mar 2005 09:56:21 -0800 Subject: [openib-general] [PATCH] FMR support in mthca In-Reply-To: <20050327153112.GA26108@mellanox.co.il> References: <20050327153112.GA26108@mellanox.co.il> Message-ID: <20050328175621.GB19170@esmail.cup.hp.com> On Sun, Mar 27, 2005 at 05:31:13PM +0200, Michael S. Tsirkin wrote: > For Tavor, MTTs for FMR are separate from regular MTTs, and are reserved > at driver initialization. This is done to limit the amount of > virtual memory needed to map the MTTs. > For Arbel, there's no such limitation, and all MTTs and MPTs may be used > for FMR or for regular MR. > It would be easy to remove the limitation for Tavor for 64-bit systems, where > it's feasible to ioremap the whole MTT table. Let me know if this is > of interest. I have the impression most of this forum is currently running "native" bits on 64-bit arches: sparc64, amd64, ia64, ppc64. I.e. this feature would be well tested if those 4 arches are enabled. I'll assert that's NOT how gen2 will get used once distro's pick it up. Historically, ia32 was 95% of the mainline distro's business. While I don't expect that to change significantly by next year, I do expect new *Linux* x86 servers to be running a 64-bit OS and support both 32-bit and 64-bit user space. It seems most 2U/2-socket boxes already support more than 4GB of RAM and it would make sense the default OS be a 64-bit one. However, that's just speculation. I have no visibility what the x86 side of HP (or other vendors) is doing next year. grant From roland at topspin.com Mon Mar 28 09:45:34 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 28 Mar 2005 09:45:34 -0800 Subject: [openib-general] Re: [PATCH] IPoIB: Set hardware header on packet receive In-Reply-To: <1112028310.4650.153.camel@localhost.localdomain> (Hal Rosenstock's message of "28 Mar 2005 11:45:11 -0500") References: <1111928923.4650.70.camel@localhost.localdomain> <52hdivzrai.fsf@topspin.com> <1112027628.4650.149.camel@localhost.localdomain> <52u0mvyb3n.fsf@topspin.com> <1112028310.4650.153.camel@localhost.localdomain> Message-ID: <52d5tjy88x.fsf@topspin.com> Hal> I was wondering about this myself as to whether this matters Hal> or not. Other drivers determine this from the destination Hal> MAC address. Does it affect the incoming delivery if more Hal> than one process is receiving an IP broadcast or multicast ? Hal> I didn't chase it all the way down in the Linux kernel. Do Hal> you know ? There's a lot of places where the field is checked. However all the ones I've seen are just ignoring packets that aren't PACKET_HOST or sometimes looking for PACKET_LOOPBACK packets. So PACKET_BROADCAST or PACKET_MULTICAST may be checked somewhere but I'm not aware of anywhere that they are. - R. From mshefty at ichips.intel.com Mon Mar 28 10:57:58 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 28 Mar 2005 10:57:58 -0800 Subject: [openib-general] [PATCH] [RMPP] receive RMPP support In-Reply-To: <20050318182430.0e6e8a16.mshefty@ichips.intel.com> References: <20050318182430.0e6e8a16.mshefty@ichips.intel.com> Message-ID: <424853B6.2090809@ichips.intel.com> Sean Hefty wrote: > This patch adds support to receive RMPP MADs. Notes: I've committed the RMPP support after updating the code based on the received comments. - Sean From mshefty at ichips.intel.com Mon Mar 28 11:31:40 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 28 Mar 2005 11:31:40 -0800 Subject: [openib-general] [PATCH] [CM] fix freeing messages in CM error handling cases Message-ID: <20050328113140.6fa8f31f.mshefty@ichips.intel.com> This patch fixes cases in the CM where the wrong message is freed if an error occurs trying to transition a cm_id to a new state. For consistency, I had replicated this error throughout the CM, and not just when sending DREQ, DREP. It should be fixed in all places now. Signed-off-by: Sean Hefty Index: core/cm.c =================================================================== --- core/cm.c (revision 2055) +++ core/cm.c (working copy) @@ -1160,7 +1160,7 @@ int ib_send_cm_rep(struct ib_cm_id *cm_i if (ret) { spin_unlock_irqrestore(&cm_id_priv->lock, flags); - cm_free_msg(cm_id_priv->msg); + cm_free_msg(msg); goto out; } @@ -1205,7 +1205,7 @@ static void cm_resend_rtu(struct cm_id_p ret = ib_post_send_mad(cm_id_priv->av.port->mad_agent, &msg->send_wr, &bad_send_wr); if (ret) - cm_free_msg(cm_id_priv->msg); + cm_free_msg(msg); } int ib_send_cm_rtu(struct ib_cm_id *cm_id, @@ -1239,7 +1239,7 @@ int ib_send_cm_rtu(struct ib_cm_id *cm_i if (ret) { spin_unlock_irqrestore(&cm_id_priv->lock, flags); - cm_free_msg(cm_id_priv->msg); + cm_free_msg(msg); goto out; } @@ -1503,7 +1503,7 @@ int ib_send_cm_dreq(struct ib_cm_id *cm_ spin_unlock_irqrestore(&cm_id_priv->lock, flags); out: if (!msg_ret && ret) - cm_free_msg(cm_id_priv->msg); + cm_free_msg(msg); return ret; } EXPORT_SYMBOL(ib_send_cm_dreq); @@ -1537,7 +1537,7 @@ static void cm_resend_drep(struct cm_id_ ret = ib_post_send_mad(cm_id_priv->av.port->mad_agent, &msg->send_wr, &bad_send_wr); if (ret) - cm_free_msg(cm_id_priv->msg); + cm_free_msg(msg); } int ib_send_cm_drep(struct ib_cm_id *cm_id, @@ -1572,7 +1572,7 @@ int ib_send_cm_drep(struct ib_cm_id *cm_ spin_unlock_irqrestore(&cm_id_priv->lock, flags); out: if (!msg_ret && ret) - cm_free_msg(cm_id_priv->msg); + cm_free_msg(msg); return ret; } EXPORT_SYMBOL(ib_send_cm_drep); @@ -1755,7 +1755,7 @@ int ib_send_cm_rej(struct ib_cm_id *cm_i &msg->send_wr, &bad_send_wr); out: if (!msg_ret && ret) - cm_free_msg(cm_id_priv->msg); + cm_free_msg(msg); return ret; } EXPORT_SYMBOL(ib_send_cm_rej); @@ -1942,7 +1942,7 @@ int ib_send_cm_mra(struct ib_cm_id *cm_i if (ret) { spin_unlock_irqrestore(&cm_id_priv->lock, flags); - cm_free_msg(cm_id_priv->msg); + cm_free_msg(msg); goto out; } @@ -2038,7 +2038,7 @@ int ib_send_cm_lap(struct ib_cm_id *cm_i if (ret) { spin_unlock_irqrestore(&cm_id_priv->lock, flags); - cm_free_msg(cm_id_priv->msg); + cm_free_msg(msg); goto out; } @@ -2174,7 +2174,7 @@ int ib_send_cm_apr(struct ib_cm_id *cm_i if (ret) { spin_unlock_irqrestore(&cm_id_priv->lock, flags); - cm_free_msg(cm_id_priv->msg); + cm_free_msg(msg); goto out; } cm_id->lap_state = IB_CM_LAP_IDLE; @@ -2322,7 +2322,7 @@ int ib_send_cm_sidr_req(struct ib_cm_id if (ret) { spin_unlock_irqrestore(&cm_id_priv->lock, flags); - cm_free_msg(cm_id_priv->msg); + cm_free_msg(msg); goto out; } cm_id->state = IB_CM_SIDR_REQ_SENT; @@ -2456,7 +2456,7 @@ int ib_send_cm_sidr_rep(struct ib_cm_id if (ret) { spin_unlock_irqrestore(&cm_id_priv->lock, flags); - cm_free_msg(cm_id_priv->msg); + cm_free_msg(msg); goto out; } cm_id->state = IB_CM_IDLE; From halr at voltaire.com Mon Mar 28 12:07:46 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Mar 2005 15:07:46 -0500 Subject: [openib-general] Re: [PATCH] IPoIB: Set hardware header on packet receive In-Reply-To: <52hdivy8i1.fsf@topspin.com> References: <1112030871.4650.162.camel@localhost.localdomain> <52hdivy8i1.fsf@topspin.com> Message-ID: <1112040466.4650.208.camel@localhost.localdomain> On Mon, 2005-03-28 at 12:40, Roland Dreier wrote: > Hal> Needed for PF_PACKET/SOCK_PACKET > > So what happens if this isn't set? If such a socket is created after the IPoIB interface is added without a patch along these lines and an IPoIB packet is received, the kernel hangs. > What applications use PF_PACKET/SOCK_PACKET? ISC DHCP for one. -- Hal From roland at topspin.com Mon Mar 28 13:49:25 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 28 Mar 2005 13:49:25 -0800 Subject: [openib-general] Re: [PATCH] IPoIB: Set hardware header on packet receive In-Reply-To: <1112030871.4650.162.camel@localhost.localdomain> (Hal Rosenstock's message of "28 Mar 2005 12:27:51 -0500") References: <1112030871.4650.162.camel@localhost.localdomain> Message-ID: <52aconwie2.fsf@topspin.com> Thanks, applied. - R. From gregkh at suse.de Mon Mar 28 14:58:18 2005 From: gregkh at suse.de (Greg KH) Date: Mon, 28 Mar 2005 14:58:18 -0800 Subject: [openib-general] Re: [PATCH] disable MSI for AMD-8131 In-Reply-To: <20050306202845.GE8486@mellanox.co.il> References: <20050306202845.GE8486@mellanox.co.il> Message-ID: <20050328225818.GA4919@kroah.com> On Sun, Mar 06, 2005 at 10:28:45PM +0200, Michael S. Tsirkin wrote: > Greg, Martin, > > The AMD-8131 I/O APIC (device id 1022:7450/7451) does not support message > signalled interrupts. Thus, if a device driver attempts to enable msi, > it will suceed, but interrupts are not actually delivered to the cpu. > The Nforce chipsets do not seem to have this limitation. > AMD confirmed that MSI mode is unsupported with this APIC. > > The following patch adds a flag to pci quirks to detect this and disable msi. > > Please let me know what do you think. > > Signed-off-by: Michael S. Tsirkin Looks good, applied, thanks. greg k-h From mshefty at ichips.intel.com Mon Mar 28 15:14:39 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 28 Mar 2005 15:14:39 -0800 Subject: [openib-general] [RMPP] RFC on retry sending RMPP MADs Message-ID: <42488FDF.2050608@ichips.intel.com> After studying how to send RMPP MADs, there's not an efficient way for the MAD layer to resend a segment in case of a lost packet or ACK. A simple solution would be to add a retry counter to ib_send_wr. This would instruct the MAD layer to retry a send a specific number of times until an ACK is received. (An alternate solution would be to retry a fixed number of times.) This opens the potential of using the retry counter not just with RMPP, but also for normal request-response MADs, allowing for automatic linear retries. We wouldn't have to implement this support, but the API would be there. For the case of supporting RMPP, does anyone see any other alternatives? Any other comments? - Sean From roland at topspin.com Mon Mar 28 15:58:28 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 28 Mar 2005 15:58:28 -0800 Subject: [openib-general] What context can CM be called from? Message-ID: <52aconuxuj.fsf@topspin.com> Is it supposed to be OK to call CM functions such as ib_send_cm_dreq() from interrupt context? ib_send_cm_dreq() calls cm_alloc_msg(), and in the current CM code, cm_alloc_msg() does m = kmalloc(sizeof *m, GFP_KERNEL); which makes it unsafe to call from interrupt context (as well as triggering __might_sleep warnings if you do that). I saw this trying out Libor's current SDP code, and I'm wondering if the fix should be in SDP or the CM. - R. From sean.hefty at intel.com Mon Mar 28 16:14:15 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 28 Mar 2005 16:14:15 -0800 Subject: [openib-general] What context can CM be called from? In-Reply-To: <52aconuxuj.fsf@topspin.com> Message-ID: >Is it supposed to be OK to call CM functions such as ib_send_cm_dreq() >from interrupt context? ib_send_cm_dreq() calls cm_alloc_msg(), and >in the current CM code, cm_alloc_msg() does > > m = kmalloc(sizeof *m, GFP_KERNEL); In short, no. We might be able to change the CM to be called at interrupt, but doing so requires changes to the verbs implementation. The cm_alloc_msg() calls ib_create_ah(), which I believes calls kmalloc in a similar fashion. I should also note that cm_free_msg() calls ib_destroy_ah(). I looked at associating the address handle with the cm_id, rather than the message, but it made things more difficult when sending messages that were not associated with a cm_id. - Sean From roland at topspin.com Mon Mar 28 16:30:48 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 28 Mar 2005 16:30:48 -0800 Subject: [openib-general] What context can CM be called from? In-Reply-To: (Sean Hefty's message of "Mon, 28 Mar 2005 16:14:15 -0800") References: Message-ID: <523bufuwcn.fsf@topspin.com> Sean> We might be able to change the CM to be called at interrupt, Sean> but doing so requires changes to the verbs implementation. Sean> The cm_alloc_msg() calls ib_create_ah(), which I believes Sean> calls kmalloc in a similar fashion. I should also note that Sean> cm_free_msg() calls ib_destroy_ah(). Yep good point. A particular implementation of ib_create_ah() may be safe from interrupt context but in general ib_create_ah() needs to be allowed to sleep. So I guess you've got your marching orders, Libor... - R. From libor at topspin.com Mon Mar 28 17:03:51 2005 From: libor at topspin.com (Libor Michalek) Date: Mon, 28 Mar 2005 17:03:51 -0800 Subject: [openib-general] [PATCH] FMR support in mthca In-Reply-To: <20050327153112.GA26108@mellanox.co.il>; from mst@mellanox.co.il on Sun, Mar 27, 2005 at 05:31:13PM +0200 References: <20050327153112.GA26108@mellanox.co.il> Message-ID: <20050328170351.B30499@topspin.com> On Sun, Mar 27, 2005 at 05:31:13PM +0200, Michael S. Tsirkin wrote: > OK, here's an updated version of the patch. This passed basic > tests: allocate/free, map/remap/unmap. > > For Tavor, MTTs for FMR are separate from regular MTTs, and are reserved > at driver initialization. This is done to limit the amount of > virtual memory needed to map the MTTs. > For Arbel, there's no such limitation, and all MTTs and MPTs may be used > for FMR or for regular MR. > It would be easy to remove the limitation for Tavor for 64-bit systems, where > it's feasible to ioremap the whole MTT table. Let me know if this is > of interest. > > Please comment. I haven't looked closely at the code yet, but I did try it out with SDP/AIO on a pair of x86 systems with Tavors and a pair of x86_64 systems with Arbels. With a small change to core/fmr_pool.c and enabling pool creation in SDP it worked as expected. Here are throughput results: x86 x86_64 -------- -------- SDP sync 610 MB/s 710 MB/s SDP async (hit) 740 MB/s 910 MB/s SDP async (miss) 590 MB/s 910 MB/s For sync sockets I used 81600 byte buffers. For async socket I kept 20 96K buffers in flight. For the FMR pool cache hit async results I used only 20 different buffers. For the FMR pool cache miss async results I used 1000 different buffers, of which only 20 were in flight at a time. -Libor Here is the change I made to core/fmr_pool.c: Index: fmr_pool.c =================================================================== --- fmr_pool.c (revision 2055) +++ fmr_pool.c (working copy) @@ -105,7 +105,7 @@ { return jhash_2words((u32) first_page, (u32) (first_page >> 32), - 0); + 0) & (IB_FMR_HASH_SIZE - 1); } /* Caller must hold pool_lock */ From roland at topspin.com Mon Mar 28 17:24:45 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 28 Mar 2005 17:24:45 -0800 Subject: [openib-general] [PATCH] FMR support in mthca In-Reply-To: <20050328170351.B30499@topspin.com> (Libor Michalek's message of "Mon, 28 Mar 2005 17:03:51 -0800") References: <20050327153112.GA26108@mellanox.co.il> <20050328170351.B30499@topspin.com> Message-ID: <52y8c7tfaa.fsf@topspin.com> Libor> Here is the change I made to core/fmr_pool.c: BTW I already committed this fix. - R. From libor at topspin.com Mon Mar 28 17:55:23 2005 From: libor at topspin.com (Libor Michalek) Date: Mon, 28 Mar 2005 17:55:23 -0800 Subject: [openib-general] Re: Re: [Andrew Morton] inappropriate use of in_atomic() In-Reply-To: <20050327155847.GC26108@mellanox.co.il>; from mst@mellanox.co.il on Sun, Mar 27, 2005 at 05:58:47PM +0200 References: <52oedq946k.fsf@topspin.com> <20050311073108.GA20989@mellanox.co.il> <20050311154316.E31689@topspin.com> <20050320181242.GA18963@mellanox.co.il> <20050325082349.A22487@topspin.com> <20050327155847.GC26108@mellanox.co.il> Message-ID: <20050328175523.C30499@topspin.com> On Sun, Mar 27, 2005 at 05:58:47PM +0200, Michael S. Tsirkin wrote: > > My understanding is you must give kmap_atomic the proper parameter: > KM_IRQ0/KM_SOFTIRQ0/KM_USR0, to avoid conflicts with other callers of > kmap on the same CPU. > > Something like this then? If you're going to check for in_irq() why not just use kmap() when you are not in an interrupt? I think the benefit of using kmap_atomic all the time is that you don't need to check if you are in an interrupt, you just need to make sure the local interrupts are disabled in case you are not in an interrupt. Once they are disabled I think you can use KM_IRQ0 for all cases. With interrupts disabled you should never get a collision on the KM_IRQ0 page as long as the map/unmap occur before the interrupt is enabled, which it would in the SDP case. -Libor > Signed-off-by: Michael S. Tsirkin > > Index: ulp/sdp/sdp_iocb.h > =================================================================== > --- ulp/sdp/sdp_iocb.h (revision 2050) > +++ ulp/sdp/sdp_iocb.h (working copy) > @@ -133,10 +133,12 @@ > */ > static inline void *sdp_kmap(struct page *page) > { > - if (in_atomic() || irqs_disabled()) > + if (in_irq()) > return kmap_atomic(page, KM_IRQ0); > + else if (in_softirq()) > + return kmap_atomic(page, KM_SOFTIRQ0); > else > - return kmap(page); > + return kmap_atomic(page, KM_USR0); > } > > /* > @@ -144,10 +146,12 @@ > */ > static inline void sdp_kunmap(struct page *page) > { > - if (in_atomic() || irqs_disabled()) > + if (in_irq()) > kunmap_atomic(page, KM_IRQ0); > + else if (in_softirq()) > + kunmap_atomic(page, KM_SOFTIRQ0); > else > - kunmap(page); > + kunmap_atomic(page, KM_USR0); > } > > #endif /* _SDP_IOCB_H */ > -- > MST - Michael S. Tsirkin From iod00d at hp.com Mon Mar 28 20:05:10 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 28 Mar 2005 20:05:10 -0800 Subject: [openib-general] Re: Re: [Andrew Morton] inappropriate use of in_atomic() In-Reply-To: <20050328175523.C30499@topspin.com> References: <52oedq946k.fsf@topspin.com> <20050311073108.GA20989@mellanox.co.il> <20050311154316.E31689@topspin.com> <20050320181242.GA18963@mellanox.co.il> <20050325082349.A22487@topspin.com> <20050327155847.GC26108@mellanox.co.il> <20050328175523.C30499@topspin.com> Message-ID: <20050329040510.GB20296@esmail.cup.hp.com> On Mon, Mar 28, 2005 at 05:55:23PM -0800, Libor Michalek wrote: > On Sun, Mar 27, 2005 at 05:58:47PM +0200, Michael S. Tsirkin wrote: > > > > My understanding is you must give kmap_atomic the proper parameter: > > KM_IRQ0/KM_SOFTIRQ0/KM_USR0, to avoid conflicts with other callers of > > kmap on the same CPU. > > > > Something like this then? > > If you're going to check for in_irq() why not just use kmap() when > you are not in an interrupt? I think the benefit of using kmap_atomic > all the time is that you don't need to check if you are in an > interrupt, you just need to make sure the local interrupts are > disabled in case you are not in an interrupt. Sorry - parsing that last sentence kept me busy for a bit. :^) I just wanted to point out that checking if the CPU is in an interrupt context can be more expensive than blindly turning interrupts off. If the branch is mispredicted, one looses on most architectures. Typically, masking *all* interrupts for the "current" cpu is a very efficient operation and has a smaller i-cache footprint. grant From roland at topspin.com Mon Mar 28 19:43:56 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 28 Mar 2005 19:43:56 -0800 Subject: [openib-general] [PATCH] RFC: allow send only registration from userspace Message-ID: <52ll87t8ub.fsf@topspin.com> It seems that ib_umad does not allow userspace to register an agent for solicited MADs only. The documentation says that if userspace passes a mgmt_class of 0 then the agent will not receive any unsolicited MADs, but in this case ib_umad still passes a ib_mad_reg_req struct to ib_register_mad_agent() so the registration just fails. I'm surprised that no one has complained about this yet. How did ibping work in userspace if the kernel had already registered for the ping class? Does this patch look right to everyone? - R. Index: infiniband/core/user_mad.c =================================================================== --- infiniband/core/user_mad.c (revision 2060) +++ infiniband/core/user_mad.c (working copy) @@ -389,15 +389,17 @@ static int ib_umad_reg_agent(struct ib_u goto out; found: - req.mgmt_class = ureq.mgmt_class; - req.mgmt_class_version = ureq.mgmt_class_version; - memcpy(req.method_mask, ureq.method_mask, sizeof req.method_mask); - memcpy(req.oui, ureq.oui, sizeof req.oui); + if (ureq.mgmt_class) { + req.mgmt_class = ureq.mgmt_class; + req.mgmt_class_version = ureq.mgmt_class_version; + memcpy(req.method_mask, ureq.method_mask, sizeof req.method_mask); + memcpy(req.oui, ureq.oui, sizeof req.oui); + } agent = ib_register_mad_agent(file->port->ib_dev, file->port->port_num, ureq.qpn ? IB_QPT_GSI : IB_QPT_SMI, - &req, 0, send_handler, recv_handler, - file); + ureq.mgmt_class ? &req : NULL, + 0, send_handler, recv_handler, file); if (IS_ERR(agent)) { ret = PTR_ERR(agent); goto out; From iod00d at hp.com Mon Mar 28 21:41:58 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 28 Mar 2005 21:41:58 -0800 Subject: [openib-general] BUG ipoib oops in iptables/netfilter Message-ID: <20050329054158.GA21304@esmail.cup.hp.com> Hi all, ia64 box data page faulted in the kernel when netperf "netserver" started on an rx2600 for the first time since booting. Using a 2.6.11 kernel with netfiltering enabled and openib.org SVN v2050 drivers/infiniband from gen2 branch. I don't think this is an arch specific bug. ib0 is configured as 10.0.0.51. I tried to start netperf from a second node (10.0.0.113) with: /usr/local/bin/netperf -c -l 60 -H 10.0.0.51 -t TCP_STREAM -- -m 8192 -s 262144 -S 262144 10.0.0.113 had just completed a few runs of netperf against a 3rd node (10.0.0.81) just fine - which does NOT have any iptables rules set up. A set of IP tables rules are loaded on 10.0.0.51 to firewall the an external facing NIC (gsyprf3 eth0). None of the rules reference ib0 since the firewall rc script doesn't know anything about ib0/1. IIRC, I previously avoided this by clearing out all the iptables rules. All three systems are running identical kernel/modules and debian "testing" user space bits. This is the second time I've seen this (over the past week). But I haven't tried to track it down...would like to finish collecting perf data first. FWIW, default IPoIB perf is pathetic: ~1.5-1.6Gb/s with the above netperf command line. I was trying to (and will) collect netperf numbers again with msi_x=1. Tombstone follows. thanks, grant gsyprf3:~# modprobe ib_mthca msi_x=1 ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004) ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:81:00.0) GSI 60 (level, low) -> CPU 1 (0x0100) vector 69 ACPI: PCI interrupt 0000:81:00.0[A] -> GSI 60 (level, low) -> IRQ 69 gsyprf3:~# modprobe ib_ipoib gsyprf3:~# gsyprf3:~# lsmod Module Size Used by ib_ipoib 90488 0 ib_sa 23980 1 ib_ipoib ib_mthca 211335 0 ib_mad 71400 2 ib_sa,ib_mthca ib_core 85288 4 ib_ipoib,ib_sa,ib_mthca,ib_mad ipt_state 5528 13 qla2300 127272 0 qla2xxx 250463 1 qla2300 scsi_transport_fc 45672 1 qla2xxx e1000 187588 0 tg3 197888 0 e100 79630 0 dm_mod 136584 0 gsyprf3:~# ifconfig ib0 10.0.0.51 netmask 255.255.255.0 broadcast 10.0.0.255 ... gsyprf3:~# ifconfig ib0 ib0 Link encap:UNSPEC HWaddr 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:10.0.0.51 Bcast:10.0.0.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) gsyprf3:~# Unable to handle kernel NULL pointer dereference (address 0000000000000001) swapper[0]: Oops 8813272891392 [1] Modules linked in: ib_ipoib ib_sa ib_mthca ib_mad ib_core ipt_state qla2300 qla2xxx scsi_transport_fc e1000 tg3 e100 dm_mod Pid: 0, CPU 0, comm: swapper psr : 0000101008026038 ifs : 8000000000000004 ip : [] Not tainted ip is at __copy_user+0x890/0x940 unat: 0000000000000000 pfs : 0000000000000a18 rsc : 0000000000000003 rnat: 0000000000000000 bsps: 0000000000000000 pr : 8000000a955655a5 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f csd : 0000000000000000 ssd : 0000000000000000 b0 : a000000100676f70 b6 : a000000100003320 b7 : a0000001006770a0 f6 : 000000000000000000000 f7 : 1003e0000000000000080 f8 : 1003e0000000000002000 f9 : 1003e0000000000002000 f10 : 1000682fffffff7d00000 f11 : 1003e0000000000000000 r1 : a000000100cb1950 r2 : 0000000000000000 r3 : 0000000000000001 r8 : e0000000060ee006 r9 : 0000000000000000 r10 : e000000005c37ef8 r11 : 000000000000010c r12 : a00000010088fb50 r13 : a000000100888000 r14 : 0000000000002000 r15 : 000000000000006f r16 : e0000000060ee006 r17 : e0000000060ee079 r18 : e0000000060ee07a r19 : e0000000060ee008 r20 : 0000000000000000 r21 : e0000000060ee028 r22 : e000000005d03cc0 r23 : e0000000060ee020 r24 : 00000000000146cc r25 : e000000005d03c28 r26 : e0000000060ee018 r27 : 000000004248e501 r28 : 0000000000000000 r29 : 0000000000000000 r30 : e0000000060ee079 r31 : e000000005d03c50 Call Trace: [] show_stack+0x80/0xa0 sp=a00000010088f710 bsp=a000000100889778 [] show_regs+0x7e0/0x800 sp=a00000010088f8e0 bsp=a000000100889718 [] die+0x150/0x1c0 sp=a00000010088f8f0 bsp=a0000001008896d8 [] ia64_do_page_fault+0x370/0x980 sp=a00000010088f8f0 bsp=a000000100889670 [] ia64_leave_kernel+0x0/0x260 sp=a00000010088f980 bsp=a000000100889670 [] __copy_user+0x890/0x940 sp=a00000010088fb50 bsp=a000000100889650 [] ipt_ulog_packet+0x6d0/0x800 sp=a00000010088fb50 bsp=a0000001008895a8 [] ipt_ulog_target+0x40/0x60 sp=a00000010088fb50 bsp=a000000100889568 [] ipt_do_table+0x700/0x800 sp=a00000010088fb50 bsp=a0000001008894c0 [] ipt_hook+0x40/0x60 sp=a00000010088fb60 bsp=a000000100889488 [] nf_iterate+0x150/0x200 sp=a00000010088fb60 bsp=a000000100889430 [] nf_hook_slow+0xa0/0x200 sp=a00000010088fb60 bsp=a0000001008893b0 [] ip_local_deliver+0x4e0/0x560 sp=a00000010088fb70 bsp=a000000100889368 [] ip_rcv_finish+0x610/0x720 sp=a00000010088fb70 bsp=a000000100889328 [] nf_hook_slow+0x160/0x200 sp=a00000010088fb80 bsp=a0000001008892b0 [] ip_rcv+0xa90/0xc20 sp=a00000010088fb90 bsp=a000000100889258 [] netif_receive_skb+0x450/0x580 sp=a00000010088fba0 bsp=a000000100889208 [] process_backlog+0x150/0x320 sp=a00000010088fba0 bsp=a000000100889190 [] net_rx_action+0x170/0x320 sp=a00000010088fba0 bsp=a000000100889138 [] __do_softirq+0x200/0x240 sp=a00000010088fbb0 bsp=a000000100889098 [] do_softirq+0x80/0xe0 sp=a00000010088fbb0 bsp=a000000100889038 [] irq_exit+0x80/0xa0 sp=a00000010088fbb0 bsp=a000000100889020 [] ia64_handle_irq+0x110/0x140 sp=a00000010088fbb0 bsp=a000000100888fe0 [] ia64_leave_kernel+0x0/0x260 sp=a00000010088fbb0 bsp=a000000100888fe0 [] ia64_pal_call_static+0xa0/0xc0 sp=a00000010088fd80 bsp=a000000100888f90 [] default_idle+0x100/0x1a0 sp=a00000010088fd80 bsp=a000000100888f40 [] cpu_idle+0x200/0x2c0 sp=a00000010088fe20 bsp=a000000100888ed0 [] rest_init+0x50/0x80 sp=a00000010088fe20 bsp=a000000100888eb8 [] start_kernel+0x440/0x4c0 sp=a00000010088fe20 bsp=a000000100888e58 [] _start+0x2e0/0x300 sp=a00000010088fe30 bsp=a000000100888de0 <0>Kernel panic - not syncing: Aiee, killing interrupt handler! From mst at mellanox.co.il Mon Mar 28 22:07:09 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 Mar 2005 08:07:09 +0200 Subject: [openib-general] Re: Re: [Andrew Morton] inappropriate use of in_atomic() In-Reply-To: <20050328175523.C30499@topspin.com> References: <52oedq946k.fsf@topspin.com> <20050311073108.GA20989@mellanox.co.il> <20050311154316.E31689@topspin.com> <20050320181242.GA18963@mellanox.co.il> <20050325082349.A22487@topspin.com> <20050327155847.GC26108@mellanox.co.il> <20050328175523.C30499@topspin.com> Message-ID: <20050329060709.GA14041@mellanox.co.il> Quoting r. Libor Michalek : > Subject: Re: Re: [Andrew Morton] inappropriate use of in_atomic() > > On Sun, Mar 27, 2005 at 05:58:47PM +0200, Michael S. Tsirkin wrote: > > > > My understanding is you must give kmap_atomic the proper parameter: > > KM_IRQ0/KM_SOFTIRQ0/KM_USR0, to avoid conflicts with other callers of > > kmap on the same CPU. > > > > Something like this then? > > If you're going to check for in_irq() why not just use kmap() when > you are not in an interrupt? I think the main benefit is that kmap is per-CPU, so we avoid flushing TLBs globally across all CPUs. > I think the benefit of using kmap_atomic > all the time is that you don't need to check if you are in an > interrupt, you just need to make sure the local interrupts are > disabled in case you are not in an interrupt. > Once they are disabled > I think you can use KM_IRQ0 for all cases. With interrupts disabled > you should never get a collision on the KM_IRQ0 page as long as the > map/unmap occur before the interrupt is enabled, which it would in > the SDP case. > > -Libor You are saying its cheaper to disable local interrupts than check whether you are in interrupt? Makes sence. -- MST - Michael S. Tsirkin From iod00d at hp.com Mon Mar 28 22:33:17 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 28 Mar 2005 22:33:17 -0800 Subject: [openib-general] [BUG] NULL pointer deref in ib_sa_mcmember_rec_callback() Message-ID: <20050329063317.GA21470@esmail.cup.hp.com> I'm on a roll tonight...hit ^C after trying to start netperf from ionize (10.0.0.113) to a 4th node that didn't have IPoIB module loaded or configured yet. I could no longer ping ionize (10.0.0.113) nor ping out from it via IB. I tried to unload ib_ipoib and got the NULL ptr deref segfault. Will reboot the box at this point. Same config/kernel as before: 2.6.11 + SVN gen2 version 2050. Tombstone follows. thanks, grant ... ionize:~# for i in 8192 8192 8192; do /usr/local/bin/netperf -c -l 60 -H 10.0.0.30 -t TCP_STREAM -- -m $i -s 262144 -S 262144; done ionize:~# ionize:~# ionize:~# ifconfig ib0 down Unable to handle kernel NULL pointer dereference (address 0000000000000000) ib_mad1[1942]: Oops 8813272891392 [1] Modules linked in: ib_ipoib ib_sa ib_mthca ib_mad ib_core tg3 dm_mod e1000 e100 Pid: 1942, CPU 1, comm: ib_mad1 psr : 0000101008026018 ifs : 800000000000038b ip : [] Not tainted ip is at ib_sa_mcmember_rec_callback+0x90/0xe0 [ib_sa] unat: 0000000000000000 pfs : 000000000000048d rsc : 0000000000000003 rnat: 0000000000000000 bsps: 0000000000000000 pr : 000000000000a941 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a74433f csd : 0000000000000000 ssd : 0000000000000000 b0 : a0000002000a5a30 b6 : a000000100002d70 b7 : a0000002000a5440 f6 : 1003e8080808080808081 f7 : 1003e0000000000001400 f8 : 1003e0000000000001400 f9 : 1003e00000000000027d8 f10 : 1003e000000000ff00000 f11 : 1003e000000003b5f2d38 r1 : a0000002002a4000 r2 : a0000002000a7270 r3 : e000000101fcfd98 r8 : a0000002000a5440 r9 : 0000000000000006 r10 : 0000000000000003 r11 : 0000000000000001 r12 : e000000101fcfd20 r13 : e000000101fc8000 r14 : 0000000000000000 r15 : e00000003f582908 r16 : a0000002000a92d8 r17 : 0000000000000000 r18 : 0000000000000001 r19 : 0000000000000000 r20 : e00000003f577d40 r21 : 0000000000000000 r22 : e00000003f577d40 r23 : 0000000000000000 r24 : 0000000000000000 r25 : e0000001011c1368 r26 : e00000019dadcd18 r27 : 0000001008026018 r28 : e0000001011c1368 r29 : e000000100395430 r30 : 0000000000000000 r31 : a0000002000a9da0 Call Trace: [] show_stack+0x80/0xa0 sp=e000000101fcf8e0 bsp=e000000101fc9180 [] show_regs+0x7e0/0x800 sp=e000000101fcfab0 bsp=e000000101fc9120 [] die+0x150/0x1c0 sp=e000000101fcfac0 bsp=e000000101fc90e0 [] ia64_do_page_fault+0x370/0x980 sp=e000000101fcfac0 bsp=e000000101fc9078 [] ia64_leave_kernel+0x0/0x260 sp=e000000101fcfb50 bsp=e000000101fc9078 [] ib_sa_mcmember_rec_callback+0x90/0xe0 [ib_sa] sp=e000000101fcfd20 bsp=e000000101fc9020 [] send_handler+0x110/0x280 [ib_sa] sp=e000000101fcfd70 bsp=e000000101fc8fd0 [] ib_mad_complete_send_wr+0x270/0x300 [ib_mad] sp=e000000101fcfd70 bsp=e000000101fc8f90 [] ib_mad_send_done_handler+0x1e0/0x2e0 [ib_mad] sp=e000000101fcfd70 bsp=e000000101fc8f20 [] ib_mad_completion_handler+0x180/0x200 [ib_mad] sp=e000000101fcfd80 bsp=e000000101fc8ed0 [] worker_thread+0x3d0/0x520 sp=e000000101fcfdb0 bsp=e000000101fc8e48 [] kthread+0x160/0x180 sp=e000000101fcfe20 bsp=e000000101fc8e10 [] kernel_thread_helper+0xd0/0x100 sp=e000000101fcfe30 bsp=e000000101fc8de0 [] start_kernel_thread+0x20/0x40 sp=e000000101fcfe30 bsp=e000000101fc8de0 ionize:~# From tziporet at mellanox.co.il Mon Mar 28 22:52:17 2005 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 29 Mar 2005 08:52:17 +0200 Subject: [openib-general] What context can CM be called from? Message-ID: <506C3D7B14CDD411A52C00025558DED6064BF21D@mtlex01.yok.mtl.com> Hi, If I remember correctly the verbs of create & destroy AVs should be enabled from interrupt context too since they are not privileged verbs. In VAPI we implemented these verbs in this way and I think it is important to keep it this way. So my vote is to fix the core. Is there any limitation for the implementation in gen2 that prevent this? Tziporet -----Original Message----- From: Roland Dreier [mailto:roland at topspin.com] Sent: Tuesday, March 29, 2005 2:31 AM To: Sean Hefty Cc: openib-general at openib.org Subject: Re: [openib-general] What context can CM be called from? Sean> We might be able to change the CM to be called at interrupt, Sean> but doing so requires changes to the verbs implementation. Sean> The cm_alloc_msg() calls ib_create_ah(), which I believes Sean> calls kmalloc in a similar fashion. I should also note that Sean> cm_free_msg() calls ib_destroy_ah(). Yep good point. A particular implementation of ib_create_ah() may be safe from interrupt context but in general ib_create_ah() needs to be allowed to sleep. So I guess you've got your marching orders, Libor... - R. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From abhijitngpune at indiatimes.com Tue Mar 29 05:08:58 2005 From: abhijitngpune at indiatimes.com (abhijitngpune) Date: Tue, 29 Mar 2005 18:38:58 +0530 Subject: [openib-general] OpenSM Message-ID: <200503291247.SAA25223@WS0005.indiatimes.com> Hi all, I am a new to infiniband and related issues. I have some few doubts related to openSM. 1. how does openSM support the non fat tree (graph having cycles) topologies? (any research paper will do) 2. Given a graph (it contains cycles) topology how can i demonstrate that subnet manager working for this topology? 3. What is openSM tcl extension is used for? does anybody have example code for perticular (irregular/ non fat tree) topology? Abhijeet Indiatimes Email now powered by APIC Advantage. Help! Help -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Mar 29 06:02:27 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Mar 2005 09:02:27 -0500 Subject: [openib-general] IA64 ucm.c compile warnings Message-ID: <1112104947.4664.4.camel@localhost.localdomain> Hi Libor, A few compile warnings for ucm on IA64: drivers/infiniband/core/ucm.c: In function `ib_ucm_event_req_get': drivers/infiniband/core/ucm.c:188: warning: cast from pointer to integer of different size drivers/infiniband/core/ucm.c: In function `ib_ucm_event_sidr_req_get': drivers/infiniband/core/ucm.c:251: warning: cast from pointer to integer of different size drivers/infiniband/core/ucm.c: In function `ib_ucm_event_handler': drivers/infiniband/core/ucm.c:386: warning: cast from pointer to integer of different size drivers/infiniband/core/ucm.c:389: warning: cast from pointer to integer of different size drivers/infiniband/core/ucm.c:392: warning: cast from pointer to integer of different size drivers/infiniband/core/ucm.c: In function `ib_ucm_write': drivers/infiniband/core/ucm.c:1243: warning: int format, different type arg (arg 5) drivers/infiniband/core/ucm.c:1243: warning: int format, different type arg (arg 5) Thanks. -- Hal From roland at topspin.com Tue Mar 29 06:51:20 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 29 Mar 2005 06:51:20 -0800 Subject: [openib-general] BUG ipoib oops in iptables/netfilter References: <20050329054158.GA21304@esmail.cup.hp.com> Message-ID: <52fyyesdxz.fsf@topspin.com> This is just a conjecture, but I see ipt_ulog_packet in your crash stack trace. Looking at the source to ipt_ulog_packet, I see that it can access skb->mac.raw. Prior to Hal's patch, which I just checked in today, skb->mac.raw was never initialized and hence could point off into space. So this could be causing the oops. It would be worth updating your ipoib driver to the latest svn and seeing if you can still reproduce the crash. BTW I'm not sure running traffic through netfilter is going to give you the best possible performance. - R. From jlentini at netapp.com Tue Mar 29 07:10:24 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 29 Mar 2005 10:10:24 -0500 (EST) Subject: [openib-general] kDAPL location? Message-ID: Hi, Thanks to help from several sources, primarily Itamar Rabenstein at Mellanox and Hal Rosenstock at Voltaire, I have a kDAPL provided for OpenIB. A uDAPL provider is being worked on and will be made available shortly. Where in the subversion tree should I place the kDAPL provider? It has been suggested to me that I do the following: - I create a directory in gen2/branches call jlentini-dapl - I place the kDAPL code in this directory - I place the uDAPL code in this directory when it is ready The intention would be to move the both kDAPL and uDAPL to the trunk once they are stable. If I follow the steps above, the code in the jlentini-dapl directory will not technically be a branch of the trunk. Is that acceptable? Should I follow the above plan? Are there any other suggestions? james James Lentini email: jlentini at netapp.com Network Appliance phone: 781-768-5359 375 Totten Pond Rd. fax: 781-895-1195 Waltham, MA 02451-2010 main: 781-768-5300 From halr at voltaire.com Tue Mar 29 07:20:00 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Mar 2005 10:20:00 -0500 Subject: [openib-general] [PATCH] RFC: allow send only registration from userspace In-Reply-To: <52ll87t8ub.fsf@topspin.com> References: <52ll87t8ub.fsf@topspin.com> Message-ID: <1112109600.4645.22.camel@localhost.localdomain> On Mon, 2005-03-28 at 22:43, Roland Dreier wrote: > It seems that ib_umad does not allow userspace to register an agent > for solicited MADs only. The documentation says that if userspace > passes a mgmt_class of 0 then the agent will not receive any > unsolicited MADs, but in this case ib_umad still passes a > ib_mad_reg_req struct to ib_register_mad_agent() so the registration > just fails. > > I'm surprised that no one has complained about this yet. How did > ibping work in userspace if the kernel had already registered for the > ping class? The management class for the ping registrations (for the server) is not 0 so you can't run both kernel and usermode ibping server simultaneously. > Does this patch look right to everyone? It looks right. Have you tried it out ? Also, there was a thread a while ago on adding async event support to user_mad.c. This is needed by OpenSM. Any comments on this ? -- Hal From shaharf at voltaire.com Tue Mar 29 07:33:02 2005 From: shaharf at voltaire.com (shaharf) Date: Tue, 29 Mar 2005 17:33:02 +0200 Subject: [openib-general] OpenSM Message-ID: Hi abhijitngpune, OpenSM do not know care about the topology of the network. Every connected graph is valid for it. BTW, fat tree can have cycles too. If I don't err, the algorithm used by the OpenSM is a variation of some well known graph algorithm invented by Dijkstra or based on one of Dijkstra's (I hope I write his name correctly) algorithm. You can find these algorithms in any graph theory text book - look for "find all shortest paths" algorithms. (for example : http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/dijkstra.html ) Very briefly the algorithm that the opensm is using goes like that: 1. All switches learn about themselves (hop 0) and any direct connected hosts (hop 1). They keep this information in a forwarding table that contains (schematically) the following information (the actual details are a bit more complicated to be able to support multipathing) : Lid (local port id), out-port, hops 2. Now you start the hop>1 learning phase that use several passes over the switches. On every single pass, you go over all switches (the order does not matter) and within each switch you go examine any direct attached switch called "neighbor". For every such neighbor you compare your forwarding table to neighbor table. If you find a lid that have hop count less than your hop count +1 (for the extra hop between you and the neighbor switch) you change you table entry to route that lid thought the connecting port. 3. You repeat the above process until no table is changed during a complete pass, or until number of switch passes are done. The correctness of this algorithm is left to the reader ;-) It seems that you are using gen1 stack and Opensm. Please be aware to the fact that gen1 tree is not supported any more. Please use gen2. The opensm Tcl extension is not supported on gen2 and I don't know on any plans to support it. Regarding the topology example - any connected graph will do. I guess that most connected graphs are very inefficient traffic wise, but still all of them are valid. Demonstrating that a topology is configured correctly is a bit of a problem. If you are willing to spend some efforts, you can use the topology simulator released with Melloanox Gold - look for the IBADM package. This stuff is not very well documented but it should be useable. Melloanox released (or about to release) a real subnet simulator that you can use to run opensm on top of it. Using this simulator you can test any arbitrary topology. The problem is that you have to port this simulator to gen2. Any volunteers are welcomed... Shahar ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of abhijitngpune Sent: Tuesday, March 29, 2005 3:09 PM To: openib-general at openib.org Subject: [openib-general] OpenSM Hi all, I am a new to infiniband and related issues. I have some few doubts related to openSM. 1. how does openSM support the non fat tree (graph having cycles) topologies? (any research paper will do) 2. Given a graph (it contains cycles) topology how can i demonstrate that subnet manager working for this topology? 3. What is openSM tcl extension is used for? does anybody have example code for perticular (irregular/ non fat tree) topology? Abhijeet ________________________________ Indiatimes Email now powered by APIC Advantage. Help! My Presence Help ________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland at topspin.com Tue Mar 29 07:43:33 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 29 Mar 2005 07:43:33 -0800 Subject: [openib-general] [PATCH] RFC: allow send only registration from userspace In-Reply-To: <1112109600.4645.22.camel@localhost.localdomain> (Hal Rosenstock's message of "29 Mar 2005 10:20:00 -0500") References: <52ll87t8ub.fsf@topspin.com> <1112109600.4645.22.camel@localhost.localdomain> Message-ID: <521x9ysbiy.fsf@topspin.com> Roland> I'm surprised that no one has complained about this yet. Roland> How did ibping work in userspace if the kernel had already Roland> registered for the ping class? Hal> The management class for the ping registrations (for the Hal> server) is not 0 so you can't run both kernel and usermode Hal> ibping server simultaneously. But how did you run the kernel server and the user client at the same time? It seems they would have to register an agent for the same class. Hal> It looks right. Have you tried it out ? Yes. Hal> Also, there was a thread a while ago on adding async event Hal> support to user_mad.c. This is needed by OpenSM. Any comments Hal> on this ? I would prefer to leave the passing of async events to uverbs. - R. From roland at topspin.com Tue Mar 29 08:08:20 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 29 Mar 2005 08:08:20 -0800 Subject: [openib-general] What context can CM be called from? References: <506C3D7B14CDD411A52C00025558DED6064BF21D@mtlex01.yok.mtl.com> Message-ID: <52u0muqvt7.fsf@topspin.com> Tziporet> Hi, If I remember correctly the verbs of create & Tziporet> destroy AVs should be enabled from interrupt context too Tziporet> since they are not privileged verbs. In VAPI we Tziporet> implemented these verbs in this way and I think it is Tziporet> important to keep it this way. Yes, you're correct that the AV verbs are not privileged according to the table in chapter 11 of the IB spec. I'm not sure that this requires that they must be available from interrupt context but it is reasonable for us to choose the policy that all non-privileged verbs may be called from interrupt context. Fixing mthca to allow the AH verbs to be callable from interrupt context is easy -- the trivial patch is included below. I'm not sure if this removes all obstructions to the CM being usable from interrupt context. - R. Index: infiniband/hw/mthca/mthca_av.c =================================================================== --- infiniband/hw/mthca/mthca_av.c (revision 2060) +++ infiniband/hw/mthca/mthca_av.c (working copy) @@ -63,7 +63,7 @@ int mthca_create_ah(struct mthca_dev *de ah->type = MTHCA_AH_PCI_POOL; if (dev->hca_type == ARBEL_NATIVE) { - ah->av = kmalloc(sizeof *ah->av, GFP_KERNEL); + ah->av = kmalloc(sizeof *ah->av, GFP_ATOMIC); if (!ah->av) return -ENOMEM; @@ -77,7 +77,7 @@ int mthca_create_ah(struct mthca_dev *de if (index == -1) goto on_hca_fail; - av = kmalloc(sizeof *av, GFP_KERNEL); + av = kmalloc(sizeof *av, GFP_ATOMIC); if (!av) goto on_hca_fail; @@ -89,7 +89,7 @@ int mthca_create_ah(struct mthca_dev *de on_hca_fail: if (ah->type == MTHCA_AH_PCI_POOL) { ah->av = pci_pool_alloc(dev->av_table.pool, - SLAB_KERNEL, &ah->avdma); + SLAB_ATOMIC, &ah->avdma); if (!ah->av) return -ENOMEM; From iod00d at hp.com Tue Mar 29 08:55:11 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 29 Mar 2005 08:55:11 -0800 Subject: [openib-general] BUG ipoib oops in iptables/netfilter In-Reply-To: <52fyyesdxz.fsf@topspin.com> References: <20050329054158.GA21304@esmail.cup.hp.com> <52fyyesdxz.fsf@topspin.com> Message-ID: <20050329165511.GA22850@esmail.cup.hp.com> On Tue, Mar 29, 2005 at 06:51:20AM -0800, Roland Dreier wrote: > This is just a conjecture, but I see ipt_ulog_packet in your crash > stack trace. Looking at the source to ipt_ulog_packet, I see that it > can access skb->mac.raw. Prior to Hal's patch, which I just checked > in today, skb->mac.raw was never initialized and hence could point off > into space. So this could be causing the oops. Yup - sounds reasonable. > It would be worth updating your ipoib driver to the latest svn and > seeing if you can still reproduce the crash. I'll try that. I hope I've got the recipe right to reproduce this. > BTW I'm not sure running traffic through netfilter is going to give > you the best possible performance. I agree. I've disabled it. It was enabled on this kernel only because of the external facing NIC. I will regen the kernel w/o netfilter, remove portmap and possibly inetd though I don't expect any issues with that. Security and perf testing aren't the best of friends. thanks, grant From roland at topspin.com Tue Mar 29 08:51:55 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 29 Mar 2005 08:51:55 -0800 Subject: [openib-general] [BUG] NULL pointer deref in ib_sa_mcmember_rec_callback() In-Reply-To: <20050329063317.GA21470@esmail.cup.hp.com> (Grant Grundler's message of "Mon, 28 Mar 2005 22:33:17 -0800") References: <20050329063317.GA21470@esmail.cup.hp.com> Message-ID: <52psxiqtsk.fsf@topspin.com> Grant> I'm on a roll tonight...hit ^C after trying to start Grant> netperf from ionize (10.0.0.113) to a 4th node that didn't Grant> have IPoIB module loaded or configured yet. I could no Grant> longer ping ionize (10.0.0.113) nor ping out from it via Grant> IB. I tried to unload ib_ipoib and got the NULL ptr deref Grant> segfault. Will reboot the box at this point. Strange, I don't have a theory for this one. Can you look at the assembly for ib_sa_mcmember_rec_callback and guess what the NULL deref is accessing? - R. From halr at voltaire.com Tue Mar 29 08:55:18 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Mar 2005 11:55:18 -0500 Subject: [openib-general] BUG ipoib oops in iptables/netfilter In-Reply-To: <20050329165511.GA22850@esmail.cup.hp.com> References: <20050329054158.GA21304@esmail.cup.hp.com> <52fyyesdxz.fsf@topspin.com> <20050329165511.GA22850@esmail.cup.hp.com> Message-ID: <1112115318.4646.50.camel@localhost.localdomain> Hi Grant, On Tue, 2005-03-29 at 11:55, Grant Grundler wrote: > On Tue, Mar 29, 2005 at 06:51:20AM -0800, Roland Dreier wrote: > > This is just a conjecture, but I see ipt_ulog_packet in your crash > > stack trace. Looking at the source to ipt_ulog_packet, I see that it > > can access skb->mac.raw. Prior to Hal's patch, which I just checked > > in today, skb->mac.raw was never initialized and hence could point off > > into space. So this could be causing the oops. > > Yup - sounds reasonable. > > > It would be worth updating your ipoib driver to the latest svn and > > seeing if you can still reproduce the crash. > > I'll try that. I hope I've got the recipe right to reproduce this. > > > BTW I'm not sure running traffic through netfilter is going to give > > you the best possible performance. > > I agree. I've disabled it. The latter (disabling netfilter/iptables) should be done after verifying that the crash goes away with the ipoib_ib.c change. Thanks. -- Hal From mshefty at ichips.intel.com Tue Mar 29 09:29:35 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 29 Mar 2005 09:29:35 -0800 Subject: [openib-general] What context can CM be called from? In-Reply-To: <52u0muqvt7.fsf@topspin.com> References: <506C3D7B14CDD411A52C00025558DED6064BF21D@mtlex01.yok.mtl.com> <52u0muqvt7.fsf@topspin.com> Message-ID: <4249907F.8010101@ichips.intel.com> Roland Dreier wrote: > Tziporet> Hi, If I remember correctly the verbs of create & > Tziporet> destroy AVs should be enabled from interrupt context too > Tziporet> since they are not privileged verbs. In VAPI we > Tziporet> implemented these verbs in this way and I think it is > Tziporet> important to keep it this way. > > Yes, you're correct that the AV verbs are not privileged according to > the table in chapter 11 of the IB spec. I'm not sure that this > requires that they must be available from interrupt context but it is > reasonable for us to choose the policy that all non-privileged verbs > may be called from interrupt context. > > Fixing mthca to allow the AH verbs to be callable from interrupt > context is easy -- the trivial patch is included below. > > I'm not sure if this removes all obstructions to the CM being usable > from interrupt context. With this patch, changing the kmalloc in cm_alloc_msg() to use GFP_ATOMIC rather than GFP_KERNEL should allow the CM to be usable from interrupt context. Of course, I haven't actually tested this... I have no objection to this change however. - Sean From halr at voltaire.com Tue Mar 29 10:03:55 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Mar 2005 13:03:55 -0500 Subject: [openib-general] [PATCH] RFC: allow send only registration from userspace In-Reply-To: <521x9ysbiy.fsf@topspin.com> References: <52ll87t8ub.fsf@topspin.com> <1112109600.4645.22.camel@localhost.localdomain> <521x9ysbiy.fsf@topspin.com> Message-ID: <1112119435.4645.21.camel@localhost.localdomain> On Tue, 2005-03-29 at 10:43, Roland Dreier wrote: > Roland> I'm surprised that no one has complained about this yet. > Roland> How did ibping work in userspace if the kernel had already > Roland> registered for the ping class? > > Hal> The management class for the ping registrations (for the > Hal> server) is not 0 so you can't run both kernel and usermode > Hal> ibping server simultaneously. > > But how did you run the kernel server and the user client at the same > time? It seems they would have to register an agent for the same class. The client registers with a cleared method mask so there is no conflict with the server. (This registration could be done a different way). > Hal> Also, there was a thread a while ago on adding async event > Hal> support to user_mad.c. This is needed by OpenSM. Any comments > Hal> on this ? > > I would prefer to leave the passing of async events to uverbs. It seems simplest for OpenSM to get the events through the user MAD fd it is already using. Is there a reason it shouldn't be done this way ? -- Hal From roland at topspin.com Tue Mar 29 10:15:46 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 29 Mar 2005 10:15:46 -0800 Subject: [openib-general] [PATCH] RFC: allow send only registration from userspace In-Reply-To: <1112119435.4645.21.camel@localhost.localdomain> (Hal Rosenstock's message of "29 Mar 2005 13:03:55 -0500") References: <52ll87t8ub.fsf@topspin.com> <1112109600.4645.22.camel@localhost.localdomain> <521x9ysbiy.fsf@topspin.com> <1112119435.4645.21.camel@localhost.localdomain> Message-ID: <527jjqqpwt.fsf@topspin.com> Hal> It seems simplest for OpenSM to get the events through the Hal> user MAD fd it is already using. Is there a reason it Hal> shouldn't be done this way ? I'd rather not have two mechanisms for doing the same thing. - R. From halr at voltaire.com Tue Mar 29 10:31:10 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Mar 2005 13:31:10 -0500 Subject: [openib-general] [PATCH] RFC: allow send only registration from userspace In-Reply-To: <527jjqqpwt.fsf@topspin.com> References: <52ll87t8ub.fsf@topspin.com> <1112109600.4645.22.camel@localhost.localdomain> <521x9ysbiy.fsf@topspin.com> <1112119435.4645.21.camel@localhost.localdomain> <527jjqqpwt.fsf@topspin.com> Message-ID: <1112121070.4645.2.camel@localhost.localdomain> On Tue, 2005-03-29 at 13:15, Roland Dreier wrote: > Hal> It seems simplest for OpenSM to get the events through the > Hal> user MAD fd it is already using. Is there a reason it > Hal> shouldn't be done this way ? > > I'd rather not have two mechanisms for doing the same thing. It seems wasteful to have another thread just for polling these events. -- Hal From libor at topspin.com Tue Mar 29 10:38:26 2005 From: libor at topspin.com (Libor Michalek) Date: Tue, 29 Mar 2005 10:38:26 -0800 Subject: [openib-general] Re: Re: [Andrew Morton] inappropriate use of in_atomic() In-Reply-To: <20050329040510.GB20296@esmail.cup.hp.com>; from iod00d@hp.com on Mon, Mar 28, 2005 at 08:05:10PM -0800 References: <52oedq946k.fsf@topspin.com> <20050311073108.GA20989@mellanox.co.il> <20050311154316.E31689@topspin.com> <20050320181242.GA18963@mellanox.co.il> <20050325082349.A22487@topspin.com> <20050327155847.GC26108@mellanox.co.il> <20050328175523.C30499@topspin.com> <20050329040510.GB20296@esmail.cup.hp.com> Message-ID: <20050329103826.A31683@topspin.com> On Mon, Mar 28, 2005 at 08:05:10PM -0800, Grant Grundler wrote: > On Mon, Mar 28, 2005 at 05:55:23PM -0800, Libor Michalek wrote: > > On Sun, Mar 27, 2005 at 05:58:47PM +0200, Michael S. Tsirkin wrote: > > > > > > My understanding is you must give kmap_atomic the proper parameter: > > > KM_IRQ0/KM_SOFTIRQ0/KM_USR0, to avoid conflicts with other callers of > > > kmap on the same CPU. > > > > > > Something like this then? > > > > If you're going to check for in_irq() why not just use kmap() when > > you are not in an interrupt? I think the benefit of using kmap_atomic > > all the time is that you don't need to check if you are in an > > interrupt, you just need to make sure the local interrupts are > > disabled in case you are not in an interrupt. > > Sorry - parsing that last sentence kept me busy for a bit. :^) > > I just wanted to point out that checking if the CPU is in an interrupt > context can be more expensive than blindly turning interrupts off. > If the branch is mispredicted, one looses on most architectures. > Typically, masking *all* interrupts for the "current" cpu is a very > efficient operation and has a smaller i-cache footprint. Sorry for the obfuscation. What I wanted to say, if we're going to use kmap_atomic all the time, don't bother checking if we are in an interrupt, instead always disable interrupt before calling kmap_atomic and always use the same km_type. There will not be a collision on the km_type if interrupts are disabled, so there's no reason to use different km_types in different states. At least that's my understanding of how kmap_atomic works. -Libor From roland at topspin.com Tue Mar 29 10:40:45 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 29 Mar 2005 10:40:45 -0800 Subject: [openib-general] [PATCH] RFC: allow send only registration from userspace In-Reply-To: <1112121070.4645.2.camel@localhost.localdomain> (Hal Rosenstock's message of "29 Mar 2005 13:31:10 -0500") References: <52ll87t8ub.fsf@topspin.com> <1112109600.4645.22.camel@localhost.localdomain> <521x9ysbiy.fsf@topspin.com> <1112119435.4645.21.camel@localhost.localdomain> <527jjqqpwt.fsf@topspin.com> <1112121070.4645.2.camel@localhost.localdomain> Message-ID: <523bueqor6.fsf@topspin.com> Hal> It seems wasteful to have another thread just for polling Hal> these events. Don't do that then. The issue of which file descriptor the events come from is orthogonal to how userspace handles them. (Although threads are pretty cheap) Adding async events to the MAD file descriptor really messes up the user/kernel interface. Right now we, the umad devices are very easy to understand: read() receives a MAD, write() sends a MAD, and ioctl() sets up some control information related to sending and receiving MADs. I don't see a clean way to add async events into that. Also, the current umad devices are per-port, while async events are really per-device. Adding some sort of filtering policy is just going to generate yet more confusion. If we want to move async events from the uverbs devices to dedicated async event devices then I don't really have a problem with that. That would save things like OpenSM from having to deal with the whole verbs interface just to get events. - R. From halr at voltaire.com Tue Mar 29 11:36:46 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Mar 2005 14:36:46 -0500 Subject: [openib-general] [PATCH] ping Add IB ping server agent In-Reply-To: <52zmwxgffs.fsf@topspin.com> References: <1111152373.4662.6585.camel@localhost.localdomain> <52zmwxgffs.fsf@topspin.com> Message-ID: <1112124818.4592.2.camel@localhost.localdomain> On Mon, 2005-03-21 at 10:57, Roland Dreier wrote: > A few comments based on a quick read through of the code. > > > Index: ping_priv.h > > =================================================================== > > -- ping_priv.h (revision 0) > > +++ ping_priv.h (revision 0) > > +#include > > + > > +#define SPFX "ib_ping: " > > + > > +struct ib_ping_send_wr { > > + struct list_head send_list; > > + struct ib_ah *ah; > > + struct ib_mad_private *mad; > > + DECLARE_PCI_UNMAP_ADDR(mapping) > > +}; > > + > > +struct ib_ping_port_private { > > + struct list_head port_list; > > + struct list_head send_posted_list; > > + spinlock_t send_list_lock; > > + int port_num; > > + struct ib_mad_agent *pingd_agent; /* OpenIB Ping class */ > > +}; > > Is it worth having a separate include file that is only included in > one place for this small amount of declarations? Is the same true for agent.c/agent_priv.h (and agent_priv.h should be eliminated) ? > > + /* PCI mapping */ > > + gather_list.addr = dma_map_single(mad_agent->device->dma_device, > > Is this comment useful? It's pretty obvious what's going on here, and > it's not necessarily PCI mapping (the HCA could be on some other type > of bus). Similar for agent.c and mad.c. > > + /* Unmap PCI */ > > + dma_unmap_single(mad_agent->device->dma_device, > > Inaccurate and not helpful comment again. Similar for agent.c and mad.c. -- Hal From tduffy at sun.com Tue Mar 29 12:06:29 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 29 Mar 2005 12:06:29 -0800 Subject: [openib-general] kDAPL location? In-Reply-To: References: Message-ID: <1112126789.19537.5.camel@duffman> On Tue, 2005-03-29 at 10:10 -0500, James Lentini wrote: > Should I follow the above plan? Are there any other suggestions? I think you should either branch trunk or roland-uverbs and put all the DAPL stuff in there so that it can easily be merged back to trunk when it is ready. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From libor at topspin.com Tue Mar 29 12:13:12 2005 From: libor at topspin.com (Libor Michalek) Date: Tue, 29 Mar 2005 12:13:12 -0800 Subject: [openib-general] kDAPL location? In-Reply-To: <1112126789.19537.5.camel@duffman>; from tduffy@sun.com on Tue, Mar 29, 2005 at 12:06:29PM -0800 References: <1112126789.19537.5.camel@duffman> Message-ID: <20050329121312.B31683@topspin.com> On Tue, Mar 29, 2005 at 12:06:29PM -0800, Tom Duffy wrote: > On Tue, 2005-03-29 at 10:10 -0500, James Lentini wrote: > > Should I follow the above plan? Are there any other suggestions? > > I think you should either branch trunk or roland-uverbs and put all the > DAPL stuff in there so that it can easily be merged back to trunk when > it is ready. Either branch as Tom suugests, or if you want to maintain the code against the head of tree I would suggest placing it somewhere besides the brach directory, such as the gen2/users directory. (e.g. gen2/users/jlentini) -Libor From tduffy at sun.com Tue Mar 29 12:25:44 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 29 Mar 2005 12:25:44 -0800 Subject: [openib-general] Re: [openib-commits] r2063 - in gen2/trunk/src/linux-kernel/infiniband: core include In-Reply-To: <20050329020650.D176F2283D4@openib.ca.sandia.gov> References: <20050329020650.D176F2283D4@openib.ca.sandia.gov> Message-ID: <1112127944.19537.15.camel@duffman> On Mon, 2005-03-28 at 18:06 -0800, libor at openib.org wrote: > Initial commit for kernel portion of the userspace CM interface. I am getting a few compile warnings when compiling x86_64: /build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/core/ucm.c: In function ‘ib_ucm_event_req_get’: /build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/core/ucm.c:188: warning: cast from pointer to integer of different size /build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/core/ucm.c: In function ‘ib_ucm_event_sidr_req_get’: /build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/core/ucm.c:251: warning: cast from pointer to integer of different size /build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/core/ucm.c: In function ‘ib_ucm_event_handler’: /build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/core/ucm.c:386: warning: cast from pointer to integer of different size /build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/core/ucm.c:389: warning: cast from pointer to integer of different size /build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/core/ucm.c:392: warning: cast from pointer to integer of different size /build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/core/ucm.c: In function ‘ib_ucm_write’: /build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/core/ucm.c:1243: warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘size_t’ /build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/core/ucm.c:1243: warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘size_t’ It seems that since context is a void*, if you are using this as an id, and id is only 32bits, you may lose information on a 64-bit archs. This simple patch will fix the last warning: Signed-off-by: Tom Duffy Index: drivers/infiniband/core/ucm.c =================================================================== --- drivers/infiniband/core/ucm.c (revision 2068) +++ drivers/infiniband/core/ucm.c (working copy) @@ -1240,7 +1240,7 @@ static ssize_t ib_ucm_write(struct file return -EFAULT; printk(KERN_ERR "UCM: Write. cmd <%d> in <%d> out <%d> len <%d>\n", - hdr.cmd, hdr.in, hdr.out, len); + hdr.cmd, hdr.in, hdr.out, (int)len); if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucm_cmd_table)) return -EINVAL; From libor at topspin.com Tue Mar 29 12:31:27 2005 From: libor at topspin.com (Libor Michalek) Date: Tue, 29 Mar 2005 12:31:27 -0800 Subject: [openib-general] Re: [openib-commits] r2063 - in gen2/trunk/src/linux-kernel/infiniband: core include In-Reply-To: <1112127944.19537.15.camel@duffman>; from tduffy@sun.com on Tue, Mar 29, 2005 at 12:25:44PM -0800 References: <20050329020650.D176F2283D4@openib.ca.sandia.gov> <1112127944.19537.15.camel@duffman> Message-ID: <20050329123127.C31683@topspin.com> On Tue, Mar 29, 2005 at 12:25:44PM -0800, Tom Duffy wrote: > On Mon, 2005-03-28 at 18:06 -0800, libor at openib.org wrote: > > Initial commit for kernel portion of the userspace CM interface. > > I am getting a few compile warnings when compiling x86_64: > > It seems that since context is a void*, if you are using this as an id, > and id is only 32bits, you may lose information on a 64-bit archs. > > This simple patch will fix the last warning: Hal reported the same issue on ia64. The cast of the id to 32bits won't lose info, since the assignment to context is from an int. The original id comes from the idr table and according to idr.c: * @id returns a value in the range 0 ... 0x7fffffff The context should be cast to long before assigning the value to an int or u32, Here is the full patch: Index: infiniband/core/ucm.c =================================================================== --- infiniband/core/ucm.c (revision 2063) +++ infiniband/core/ucm.c (working copy) @@ -185,7 +185,7 @@ static void ib_ucm_event_req_get(struct ib_ucm_req_event_resp *ureq, struct ib_cm_req_event_param *kreq) { - ureq->listen_id = (int)kreq->listen_id->context; + ureq->listen_id = (long)kreq->listen_id->context; ureq->remote_ca_guid = kreq->remote_ca_guid; ureq->remote_qkey = kreq->remote_qkey; @@ -248,7 +248,7 @@ static void ib_ucm_event_sidr_req_get(struct ib_ucm_sidr_req_event_resp *ureq, struct ib_cm_sidr_req_event_param *kreq) { - ureq->listen_id = (int)kreq->listen_id->context; + ureq->listen_id = (long)kreq->listen_id->context; ureq->pkey = kreq->pkey; } @@ -383,13 +383,13 @@ */ switch (event->event) { case IB_CM_REQ_RECEIVED: - id = (int)event->param.req_rcvd.listen_id->context; + id = (long)event->param.req_rcvd.listen_id->context; break; case IB_CM_SIDR_REQ_RECEIVED: - id = (int)event->param.sidr_req_rcvd.listen_id->context; + id = (long)event->param.sidr_req_rcvd.listen_id->context; break; default: - id = (int)cm_id->context; + id = (long)cm_id->context; break; } @@ -1239,7 +1239,7 @@ if (copy_from_user(&hdr, buf, sizeof(hdr))) return -EFAULT; - printk(KERN_ERR "UCM: Write. cmd <%d> in <%d> out <%d> len <%d>\n", + printk(KERN_ERR "UCM: Write. cmd <%d> in <%d> out <%d> len <%Zu>\n", hdr.cmd, hdr.in, hdr.out, len); if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucm_cmd_table)) From halr at voltaire.com Tue Mar 29 12:39:44 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Mar 2005 15:39:44 -0500 Subject: [openib-general] Re: [openib-commits] r2063 - in gen2/trunk/src/linux-kernel/infiniband: core include In-Reply-To: <20050329123127.C31683@topspin.com> References: <20050329020650.D176F2283D4@openib.ca.sandia.gov> <1112127944.19537.15.camel@duffman> <20050329123127.C31683@topspin.com> Message-ID: <1112128781.4476.0.camel@localhost.localdomain> On Tue, 2005-03-29 at 15:31, Libor Michalek wrote: > Here is the full patch: That worked for me on IA64. Thanks. -- Hal From tduffy at sun.com Tue Mar 29 12:45:19 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 29 Mar 2005 12:45:19 -0800 Subject: [openib-general] Re: [openib-commits] r2063 - in gen2/trunk/src/linux-kernel/infiniband: core include In-Reply-To: <20050329123127.C31683@topspin.com> References: <20050329020650.D176F2283D4@openib.ca.sandia.gov> <1112127944.19537.15.camel@duffman> <20050329123127.C31683@topspin.com> Message-ID: <1112129119.19537.23.camel@duffman> On Tue, 2005-03-29 at 12:31 -0800, Libor Michalek wrote: > Hal reported the same issue on ia64. Oops, sorry, missed that thread. Been on vacation and haven't had a chance to completely catch up. > Here is the full patch: Works for me on x86_64. Thanks, -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From iod00d at hp.com Tue Mar 29 12:54:15 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 29 Mar 2005 12:54:15 -0800 Subject: [openib-general] [BUG] NULL pointer deref in ib_sa_mcmember_rec_callback() In-Reply-To: <20050329063317.GA21470@esmail.cup.hp.com> References: <20050329063317.GA21470@esmail.cup.hp.com> Message-ID: <20050329205415.GI22850@esmail.cup.hp.com> On Mon, Mar 28, 2005 at 10:33:17PM -0800, Grant Grundler wrote: > Same config/kernel as before: 2.6.11 + SVN gen2 version 2050. ... > ip is at ib_sa_mcmember_rec_callback+0x90/0xe0 [ib_sa] ... > [] ib_sa_mcmember_rec_callback+0x90/0xe0 [ib_sa] > sp=e000000101fcfd20 bsp=e000000101fc9020 > [] send_handler+0x110/0x280 [ib_sa] > sp=e000000101fcfd70 bsp=e000000101fc8fd0 > [] ib_mad_complete_send_wr+0x270/0x300 [ib_mad] > sp=e000000101fcfd70 bsp=e000000101fc8f90 ... Roland, I can't unravel a64 asm as well as I'd like. Fortunately, this ut the code is pretty straighforward. +0x60 to +0x8c is the "true" part of this statement: if (mad) { struct ib_sa_mcmember_rec rec; ib_unpack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), mad->data, &rec); query->callback(status, &rec, query->context); } else query->callback(status, NULL, query->context); And +0x90 is the "else" (false) part. I *think*: r32 = *sa_query r33 = status r34 = *mad Following asm is from sa_query.o - ie unlinked: 101c: 04 07 fd 8c adds r35=-16,r32 1020: 0a 40 00 44 09 39 [MMI] cmp.eq p8,p9=0,r34;; ... 105c: 40 00 00 43 (p08) br.cond.dpnt.few 1090 1060: 11 00 00 00 01 00 [MIB] nop.m 0x0 1062: PCREL21B ib_unpack 1066: 00 00 00 02 00 00 nop.i 0x0 106c: 08 00 00 50 br.call.sptk.many b0=1060 ;; ... 1080: 0a 70 00 46 18 10 [MMI] ld8 r14=[r35];; 1086: 90 02 0c 30 20 00 ld8 r41=[r3] 108c: 00 00 04 00 nop.i 0x0 ==> 1090: 0a 40 20 1c 18 14 [MMI] ld8 r8=[r14],8;; <== 1096: 10 00 38 30 20 e0 ld8 r1=[r14] 109c: 80 08 00 07 mov b7=r8 10a0: 11 00 00 00 01 00 [MIB] nop.m 0x0 10a6: 00 00 00 02 00 00 nop.i 0x0 (b) 10ac: 78 00 80 10 br.call.sptk.many b0=b7;; 10b0: 00 08 00 4c 00 21 [MII] mov r1=r38 10b6: 00 28 01 55 00 00 mov.i ar.pfs=r37 10bc: 40 0a 00 07 mov b0=r36 10c0: 11 60 40 19 00 21 [MIB] adds r12=80,r12 10c6: 00 00 00 02 00 80 nop.i 0x0 10cc: 08 00 84 00 br.ret.sptk.many b0;; My take is (b) is the indirect call with the differences in the parameters factored out of the if/else statement. +0xcc is the return from ib_sa_mcmember_rec_callback(). I'm pretty sure +0x90 trying to reference query->context. tombstone showed r14 was zero. So it would have blown up regardless if which path we took. This is the first place query (r14) is dereferenced. hth, grant From tduffy at sun.com Tue Mar 29 14:03:35 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 29 Mar 2005 14:03:35 -0800 Subject: [openib-general] Re: [openib-commits] r2074 - gen2/trunk/src/linux-kernel/infiniband/core In-Reply-To: <20050329215718.1E3BF22834D@openib.ca.sandia.gov> References: <20050329215718.1E3BF22834D@openib.ca.sandia.gov> Message-ID: <1112133815.26801.0.camel@duffman> On Tue, 2005-03-29 at 13:57 -0800, roland at openib.org wrote: > Author: roland > Date: 2005-03-29 13:57:16 -0800 (Tue, 29 Mar 2005) > New Revision: 2074 > > Modified: > gen2/trunk/src/linux-kernel/infiniband/core/mad_rmpp.c > Log: > Add include of to fix sparc64 build. Heh. Just about to submit that patch ;) -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Tue Mar 29 13:53:22 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 29 Mar 2005 13:53:22 -0800 Subject: [openib-general] [PATCH] fix mad_rmpp build on sparc64 Message-ID: <52acomruel.fsf@topspin.com> I just committed this trivial patch, which fixes the build on sparc64. - R. Index: infiniband/core/mad_rmpp.c =================================================================== --- infiniband/core/mad_rmpp.c (revision 2073) +++ infiniband/core/mad_rmpp.c (working copy) @@ -32,6 +32,8 @@ * $Id: mad_rmpp.c 1921 2005-03-02 22:58:44Z sean.hefty $ */ +#include + #include "mad_rmpp.h" #include "mad_priv.h" From mshefty at ichips.intel.com Tue Mar 29 14:08:26 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 29 Mar 2005 14:08:26 -0800 Subject: [openib-general] [PATCH] fix mad_rmpp build on sparc64 In-Reply-To: <52acomruel.fsf@topspin.com> References: <52acomruel.fsf@topspin.com> Message-ID: <4249D1DA.1070002@ichips.intel.com> Roland Dreier wrote: > I just committed this trivial patch, which fixes the build on sparc64. Thanks. From jlentini at netapp.com Tue Mar 29 14:37:34 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 29 Mar 2005 17:37:34 -0500 (EST) Subject: [openib-general] kDAPL location? In-Reply-To: <20050329121312.B31683@topspin.com> References: <1112126789.19537.5.camel@duffman> <20050329121312.B31683@topspin.com> Message-ID: Tom, My original plan was to branch as you suggested, but I was warned that maintaining synchronization with the base code line would be difficult. We don't anticipate needing to make substantial driver modifications for DAPL. I'd be surprised if more than a handful of files (< 10) changed. Currently only the Makefile and Kconfig need to be updated. Given that, does Libor's suggestion to place the code in gen2/users make sense or would you still recommend a full branch? james On Tue, 29 Mar 2005, Libor Michalek wrote: > On Tue, Mar 29, 2005 at 12:06:29PM -0800, Tom Duffy wrote: >> On Tue, 2005-03-29 at 10:10 -0500, James Lentini wrote: >>> Should I follow the above plan? Are there any other suggestions? >> >> I think you should either branch trunk or roland-uverbs and put all the >> DAPL stuff in there so that it can easily be merged back to trunk when >> it is ready. > > Either branch as Tom suugests, or if you want to maintain the code > against the head of tree I would suggest placing it somewhere besides > the brach directory, such as the gen2/users directory. > (e.g. gen2/users/jlentini) > > -Libor > From tduffy at sun.com Tue Mar 29 15:48:55 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 29 Mar 2005 15:48:55 -0800 Subject: [openib-general] kDAPL location? In-Reply-To: References: <1112126789.19537.5.camel@duffman> <20050329121312.B31683@topspin.com> Message-ID: <1112140135.7609.5.camel@duffman> On Tue, 2005-03-29 at 17:37 -0500, James Lentini wrote: > Tom, > > My original plan was to branch as you suggested, but I was warned that > maintaining synchronization with the base code line would be > difficult. > > We don't anticipate needing to make substantial driver modifications > for DAPL. I'd be surprised if more than a handful of files (< 10) > changed. Currently only the Makefile and Kconfig need to be updated. > > Given that, does Libor's suggestion to place the code in gen2/users > make sense or would you still recommend a full branch? Really, it is up to you. It makes it easier for *me* to play with DAPL if I can checkout a whole tree into drivers/infiniband that will compile kDAPL properly than to pull this from here and that from there and then patch with this. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From libor at topspin.com Tue Mar 29 15:38:26 2005 From: libor at topspin.com (Libor Michalek) Date: Tue, 29 Mar 2005 15:38:26 -0800 Subject: [openib-general] What context can CM be called from? In-Reply-To: <4249907F.8010101@ichips.intel.com>; from mshefty@ichips.intel.com on Tue, Mar 29, 2005 at 09:29:35AM -0800 References: <506C3D7B14CDD411A52C00025558DED6064BF21D@mtlex01.yok.mtl.com> <52u0muqvt7.fsf@topspin.com> <4249907F.8010101@ichips.intel.com> Message-ID: <20050329153826.D31683@topspin.com> On Tue, Mar 29, 2005 at 09:29:35AM -0800, Sean Hefty wrote: > Roland Dreier wrote: > > Tziporet> Hi, If I remember correctly the verbs of create & > > Tziporet> destroy AVs should be enabled from interrupt context too > > Tziporet> since they are not privileged verbs. In VAPI we > > Tziporet> implemented these verbs in this way and I think it is > > Tziporet> important to keep it this way. > > > > Yes, you're correct that the AV verbs are not privileged according to > > the table in chapter 11 of the IB spec. I'm not sure that this > > requires that they must be available from interrupt context but it is > > reasonable for us to choose the policy that all non-privileged verbs > > may be called from interrupt context. > > > > Fixing mthca to allow the AH verbs to be callable from interrupt > > context is easy -- the trivial patch is included below. > > > > I'm not sure if this removes all obstructions to the CM being usable > > from interrupt context. > > With this patch, changing the kmalloc in cm_alloc_msg() to use > GFP_ATOMIC rather than GFP_KERNEL should allow the CM to be usable from > interrupt context. Of course, I haven't actually tested this... > > I have no objection to this change however. I could go either way on this issue myself. If the call can only be made from thread context I will use schedule_work() to execute the request to send the dreq. However, I would imagine that other CM users would want to send requests in interrupt context... -Libor From mshefty at ichips.intel.com Tue Mar 29 16:05:04 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 29 Mar 2005 16:05:04 -0800 Subject: [openib-general] What context can CM be called from? In-Reply-To: <20050329153826.D31683@topspin.com> References: <506C3D7B14CDD411A52C00025558DED6064BF21D@mtlex01.yok.mtl.com> <52u0muqvt7.fsf@topspin.com> <4249907F.8010101@ichips.intel.com> <20050329153826.D31683@topspin.com> Message-ID: <4249ED30.3060208@ichips.intel.com> Libor Michalek wrote: >>With this patch, changing the kmalloc in cm_alloc_msg() to use >>GFP_ATOMIC rather than GFP_KERNEL should allow the CM to be usable from >>interrupt context. Of course, I haven't actually tested this... >> >>I have no objection to this change however. > > I could go either way on this issue myself. If the call can only be > made from thread context I will use schedule_work() to execute the > request to send the dreq. However, I would imagine that other CM users > would want to send requests in interrupt context... My intention was that the CM should be able to match the calling conventions of underlying verbs/mad layer routines, except for the destroy_cm_id call that may block. It should be easy enough to at least test whether the code works at interrupt with these changes, and if not, then call schedule_work until we can identify why not and see if other changes can be made to support it. - Sean From iod00d at hp.com Tue Mar 29 16:07:25 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 29 Mar 2005 16:07:25 -0800 Subject: [openib-general] BUG ipoib oops in iptables/netfilter In-Reply-To: <1112115318.4646.50.camel@localhost.localdomain> References: <20050329054158.GA21304@esmail.cup.hp.com> <52fyyesdxz.fsf@topspin.com> <20050329165511.GA22850@esmail.cup.hp.com> <1112115318.4646.50.camel@localhost.localdomain> Message-ID: <20050330000725.GR22850@esmail.cup.hp.com> On Tue, Mar 29, 2005 at 11:55:18AM -0500, Hal Rosenstock wrote: > > > BTW I'm not sure running traffic through netfilter is going to give > > > you the best possible performance. > > > > I agree. I've disabled it. > > The latter (disabling netfilter/iptables) should be done after verifying > that the crash goes away with the ipoib_ib.c change. Sorry, I just assumed Roland had the right fix since it's not urgent for me. I've clobbered the current binaries and will have to circle around to this again in the future. :^( I will re-enabled netfilter on this machine. It's a matter of time. This ia64 box is my backup in case the normal NAT/gateway falls over. grant From libor at topspin.com Tue Mar 29 16:05:19 2005 From: libor at topspin.com (Libor Michalek) Date: Tue, 29 Mar 2005 16:05:19 -0800 Subject: [openib-general] Re: [Andrew Morton] inappropriate use of in_atomic() In-Reply-To: <20050311154316.E31689@topspin.com>; from libor@topspin.com on Fri, Mar 11, 2005 at 03:43:16PM -0800 References: <52oedq946k.fsf@topspin.com> <20050311073108.GA20989@mellanox.co.il> <20050311154316.E31689@topspin.com> Message-ID: <20050329160519.E31683@topspin.com> On Fri, Mar 11, 2005 at 03:43:16PM -0800, Libor Michalek wrote: > On Fri, Mar 11, 2005 at 09:31:08AM +0200, Michael S. Tsirkin wrote: > > > > Sdp also has a couple of uses. > > Maybe we can use the atomic branch in all cases here, as well? > > Libor? > > Yes, the case in sdp_iocb.c can probably always take the atomic > path. The kmap/kunmap cases really only care whether we're in an > interrupt, so switching to in_interrupt() should be sufficient. Patch to remove in_atomic in sdp_iocb.c for spawning a thread to execute iocb completion. Instead we always spawn the completion. -Libor Index: sdp_iocb.c =================================================================== --- sdp_iocb.c (revision 2071) +++ sdp_iocb.c (working copy) @@ -437,7 +437,7 @@ * register IOCBs physical memory */ iocb->mem = ib_fmr_pool_map_phys(conn->fmr_pool, - (u64 *)iocb->addr_array, + iocb->addr_array, iocb->page_count, &iocb->io_addr); if (IS_ERR(iocb->mem)) { @@ -539,11 +539,8 @@ { iocb->status = status; - if (in_atomic() || irqs_disabled()) { - INIT_WORK(&iocb->completion, do_iocb_complete, (void *)iocb); - schedule_work(&iocb->completion); - } else - do_iocb_complete(iocb); + INIT_WORK(&iocb->completion, do_iocb_complete, (void *)iocb); + schedule_work(&iocb->completion); return 0; } From libor at topspin.com Tue Mar 29 16:13:53 2005 From: libor at topspin.com (Libor Michalek) Date: Tue, 29 Mar 2005 16:13:53 -0800 Subject: [openib-general] [PATCH][SDP] iocb send path bug. Message-ID: <20050329161353.F31683@topspin.com> Simple patch for a data path bug when there is a size mismatch between the AIO read and write buffers of two connection peers. Without the patch the send path will process the remains of an active sdpc_iocb which has a RDMA write already in progress. When the RDMA write completes the sdpc_iocb will not be available for post processing. With the patch the send data path stalls until either the write RDMA completes or a new sink advertisement arives. -Libor Index: sdp_send.c =================================================================== --- sdp_send.c (revision 2071) +++ sdp_send.c (working copy) @@ -878,7 +878,8 @@ * hope that a new sink advertisment will arrive, because * sinks are more efficient. */ - if (sdp_desc_q_size(&conn->w_snk) > 0) + if (sdp_desc_q_size(&conn->w_snk) || + iocb->flags & SDP_IOCB_F_RDMA_W) goto done; if (conn->src_zthresh > iocb->len || From libor at topspin.com Tue Mar 29 16:56:36 2005 From: libor at topspin.com (Libor Michalek) Date: Tue, 29 Mar 2005 16:56:36 -0800 Subject: [openib-general] [PATCH][SDP] iocb recv path bug. In-Reply-To: <20050329161353.F31683@topspin.com>; from libor@topspin.com on Tue, Mar 29, 2005 at 04:13:53PM -0800 References: <20050329161353.F31683@topspin.com> Message-ID: <20050329165636.G31683@topspin.com> Patch for a receive data path bug where the iocb length was going negative, and the iocb was not getting completed when it should have. Instead it was generating a connection abort later in the pipe. The patch makes sure that the data being copied from a buffer to an iocb is bounded by the size of the iocb as well as the buffer. -Libor Index: sdp_recv.c =================================================================== --- sdp_recv.c (revision 2071) +++ sdp_recv.c (working copy) @@ -670,7 +670,7 @@ copy = min((PAGE_SIZE - offset), (unsigned long)(buff->tail - buff->data)); - + copy = min(iocb->len, copy); #ifndef _SDP_DATA_PATH_NULL memcpy((addr + offset), buff->data, copy); #endif @@ -1459,7 +1459,7 @@ iocb->req = req; iocb->key = req->ki_key; iocb->addr = (unsigned long)msg->msg_iov->iov_base; - + req->ki_cancel = sdp_inet_read_cancel; result = sdp_iocb_lock(iocb); From iod00d at hp.com Tue Mar 29 17:08:14 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 29 Mar 2005 17:08:14 -0800 Subject: [openib-general] [PATCH] FMR support in mthca In-Reply-To: <20050328170351.B30499@topspin.com> References: <20050327153112.GA26108@mellanox.co.il> <20050328170351.B30499@topspin.com> Message-ID: <20050330010814.GB24794@esmail.cup.hp.com> On Mon, Mar 28, 2005 at 05:03:51PM -0800, Libor Michalek wrote: > I haven't looked closely at the code yet, but I did try it out > with SDP/AIO on a pair of x86 systems with Tavors and a pair of > x86_64 systems with Arbels. With a small change to core/fmr_pool.c > and enabling pool creation in SDP it worked as expected. Here are > throughput results: > > x86 x86_64 > -------- -------- > SDP sync 610 MB/s 710 MB/s > SDP async (hit) 740 MB/s 910 MB/s > SDP async (miss) 590 MB/s 910 MB/s Libor, How did you generate the above numbers? netpipe? I'd like to add the "HP ZX1" (and maybe parisc) column. BTW, I've got several systems loaded with: gsyprf3:~# lsmod Module Size Used by ib_sdp 225032 0 ib_cm 86032 1 ib_sdp ib_sa 23980 1 ib_sdp ib_mthca 211335 0 ib_mad 82808 3 ib_cm,ib_sa,ib_mthca ib_core 85288 5 ib_sdp,ib_cm,ib_sa,ib_mthca,ib_mad ... and zero clue how to get it to talk to another system. ie "ifconfig -a" isn't listing any new interfaces. :^( The most recent "How-To test SDP" for gen1 posted on Jun 17, 2004: http://openib.org/pipermail/openib-general/2004-June/002892.html It references libsdp.so which doesn't seem to exist in gen2. Has anyone written an update for gen2? thanks, grant From libor at topspin.com Tue Mar 29 18:12:28 2005 From: libor at topspin.com (Libor Michalek) Date: Tue, 29 Mar 2005 18:12:28 -0800 Subject: [openib-general] [PATCH] FMR support in mthca In-Reply-To: <20050330010814.GB24794@esmail.cup.hp.com>; from iod00d@hp.com on Tue, Mar 29, 2005 at 05:08:14PM -0800 References: <20050327153112.GA26108@mellanox.co.il> <20050328170351.B30499@topspin.com> <20050330010814.GB24794@esmail.cup.hp.com> Message-ID: <20050329181228.H31683@topspin.com> On Tue, Mar 29, 2005 at 05:08:14PM -0800, Grant Grundler wrote: > On Mon, Mar 28, 2005 at 05:03:51PM -0800, Libor Michalek wrote: > > I haven't looked closely at the code yet, but I did try it out > > with SDP/AIO on a pair of x86 systems with Tavors and a pair of > > x86_64 systems with Arbels. With a small change to core/fmr_pool.c > > and enabling pool creation in SDP it worked as expected. Here are > > throughput results: > > > > x86 x86_64 > > -------- -------- > > SDP sync 610 MB/s 710 MB/s > > SDP async (hit) 740 MB/s 910 MB/s > > SDP async (miss) 590 MB/s 910 MB/s > > Libor, > How did you generate the above numbers? netpipe? > > I'd like to add the "HP ZX1" (and maybe parisc) column. I used ttcp which was recompiled to use the SDP protocol family, and a modified ttcp for the async numbers. The modified ttcp replaced the socket send/recv system calls with socket AIO io_submit/io_getevents system calls. The recompile modifications to use SDP protocol familt for regular ttcp are pretty straight forward: #include #undef AF_INET #define AF_INET AF_INET_SDP I could checkin the source for the modified ttcp's but I'm not sure exactly where... (gen2/users/libor ???) > BTW, I've got several systems loaded with: > > and zero clue how to get it to talk to another system. > ie "ifconfig -a" isn't listing any new interfaces. :^( That's pretty much all you need, once ipoib is configure a socket created with the SDP protocol family can connect/bind using the addresses of the ipoib interfaces as you would for a TCP socket. > The most recent "How-To test SDP" for gen1 posted on Jun 17, 2004: > http://openib.org/pipermail/openib-general/2004-June/002892.html > > It references libsdp.so which doesn't seem to exist in gen2. > Has anyone written an update for gen2? MST checked libsdp into the gen2 tree: gen2/trunk/src/userspace/libsdp Which is great for using SDP with application binaries that you do not want to modify. However, for the async numbers you need a program that's using AIO for network sockets, of which I have a few... -Libor From iod00d at hp.com Tue Mar 29 19:29:04 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 29 Mar 2005 19:29:04 -0800 Subject: [openib-general] [PATCH] FMR support in mthca In-Reply-To: <20050329181228.H31683@topspin.com> References: <20050327153112.GA26108@mellanox.co.il> <20050328170351.B30499@topspin.com> <20050330010814.GB24794@esmail.cup.hp.com> <20050329181228.H31683@topspin.com> Message-ID: <20050330032904.GA24936@esmail.cup.hp.com> On Tue, Mar 29, 2005 at 06:12:28PM -0800, Libor Michalek wrote: > The recompile modifications to use SDP protocol familt for regular > ttcp are pretty straight forward: > > #include > #undef AF_INET > #define AF_INET AF_INET_SDP Yes, that's pretty easy. > I could checkin the source for the modified ttcp's but I'm not sure > exactly where... (gen2/users/libor ???) I wouldn't bother for the "#define AF_INET AF_INET_SDP" above. If the AIO version is more complicated, maybe create an "examples" directory under gen2/src/userspace? [ list of modules deleted ] > That's pretty much all you need, once ipoib is configure a socket > created with the SDP protocol family can connect/bind using the > addresses of the ipoib interfaces as you would for a TCP socket. Ah! *click* ib_ipoib is missing! For whatever reason I didn't realize ib_sdp only adds the alternative net family. Yes, I did notice the "NET: Registered protocol family 27" when loading ib_sdp. I had associated ip_ipoib with providing TCP/IP but of course it does NOT. ipoib *uses* the existing stack for AF_INET. Doh! Feeling a little slow these days... > > The most recent "How-To test SDP" for gen1 posted on Jun 17, 2004: > > http://openib.org/pipermail/openib-general/2004-June/002892.html > > > > It references libsdp.so which doesn't seem to exist in gen2. > > Has anyone written an update for gen2? > > MST checked libsdp into the gen2 tree: gen2/trunk/src/userspace/libsdp sorry...my bad. usespace part of my local tree was quite stale. "svn up" fixed that. Thanks for patiently pointing out the obvious. > Which is great for using SDP with application binaries that you do not > want to modify. However, for the async numbers you need a program that's > using AIO for network sockets, of which I have a few... More fodder for "examples" someplace in gen2 tree? Or perhaps links to the right projects (e.g netperf.org) in a userspace/examples/README file? Or maybe more obvious here -> http://www.openib.org/doc.html ? I just stumbled across links to http://www.web-polygraph.org/ again. That's not something I use every day... Pointers to different tests that are relevant to evaluating IB are helpful. If someone starts it, I'll contribute to the list. thanks, grant From roland at topspin.com Tue Mar 29 20:53:58 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 29 Mar 2005 20:53:58 -0800 Subject: [openib-general] [PATCH] FMR support in mthca In-Reply-To: <20050327153112.GA26108@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 27 Mar 2005 17:31:13 +0200") References: <20050327153112.GA26108@mellanox.co.il> Message-ID: <52mzslraxl.fsf@topspin.com> Thanks, I applied this in a few pieces with some minor cleanups. As Libor reported, SDP is working and getting the expected good AIO performance with this code. Thanks, Roland From abhijitngpune at indiatimes.com Tue Mar 29 20:52:01 2005 From: abhijitngpune at indiatimes.com (abhijitngpune) Date: Wed, 30 Mar 2005 10:22:01 +0530 Subject: [openib-general] OpenSM Message-ID: <200503300430.KAA16369@WS0005.indiatimes.com> hi, Is there any opensource subnet simulator? Where can i get it? Abhijeet "shaharf" wrote: v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} st1\:*{behavior:url(#default#ieooui) } Hi abhijitngpune, OpenSM do not know care about the topology of the network. Every connected graph is valid for it. BTW, fat tree can have cycles too. If I don�t err, the algorithm used by the OpenSM is a variation of some well known graph algorithm invented by Dijkstra or based on one of Dijkstra�s (I hope I write his name correctly) algorithm. You can find these algorithms in any graph theory text book � look for �find all shortest paths� algorithms. (for example : http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/dijkstra.html ) Very briefly the algorithm that the opensm is using goes like that: 1. All switches learn about themselves (hop 0) and any direct connected hosts (hop 1). They keep this information in a forwarding table that contains (schematically) the following information (the actual details are a bit more complicated to be able to support multipathing) : Lid (local port id), out-port, hops 2. Now you start the hop>1 learning phase that use several passes over the switches. On every single pass, you go over all switches (the order does not matter) and within each switch you go examine any direct attached switch called �neighbor�. For every such neighbor you compare your forwarding table to neighbor table. If you find a lid that have hop count less than your hop count +1 (for the extra hop between you and the neighbor switch) you change you table entry to route that lid thought the connecting port. 3. You repeat the above process until no table is changed during a complete pass, or until number of switch passes are done. The correctness of this algorithm is left to the reader ;-) It seems that you are using gen1 stack and Opensm. Please be aware to the fact that gen1 tree is not supported any more. Please use gen2. The opensm Tcl extension is not supported on gen2 and I don�t know on any plans to support it. Regarding the topology example � any connected graph will do. I guess that most connected graphs are very inefficient traffic wise, but still all of them are valid. Demonstrating that a topology is configured correctly is a bit of a problem. If you are willing to spend some efforts, you can use the topology simulator released with Melloanox Gold � look for the IBADM package. This stuff is not very well documented but it should be useable. Melloanox released (or about to release) a real subnet simulator that you can use to run opensm on top of it. Using this simulator you can test any arbitrary topology. The problem is that you have to port this simulator to gen2. Any volunteers are welcomed� Shahar From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of abhijitngpune Sent: Tuesday, March 29, 05 3:09 PM To: openib-general at openib.org Subject: [openib-general] OpenSM Hi all, I am a new to infiniband and related issues. I have some few doubts related to openSM. 1. how does openSM support the non fat tree (graph having cycles) topologies? (any research paper will do) 2. Given a graph (it contains cycles) topology how can i demonstrate that subnet manager working for this topology? 3. What is openSM tcl extension is used for? does anybody have example code for perticular (irregular/ non fat tree) topology? Abhijeet Indiatimes Email now powered by APIC Advantage. Help! Help Indiatimes Email now powered by APIC Advantage. Help! Help -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Tue Mar 29 21:44:08 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 30 Mar 2005 07:44:08 +0200 Subject: [openib-general] [PATCH] FMR support in mthca In-Reply-To: <52mzslraxl.fsf@topspin.com> References: <20050327153112.GA26108@mellanox.co.il> <52mzslraxl.fsf@topspin.com> Message-ID: <20050330054408.GA26239@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] [PATCH] FMR support in mthca > > Thanks, I applied this in a few pieces with some minor cleanups. > > As Libor reported, SDP is working and getting the expected good > AIO performance with this code. > > Thanks, > Roland > What benchmark do you use to measure AIO performance? -- MST - Michael S. Tsirkin From iod00d at hp.com Tue Mar 29 22:10:31 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 29 Mar 2005 22:10:31 -0800 Subject: [openib-general] [PATCH] FMR support in mthca In-Reply-To: <20050330054408.GA26239@mellanox.co.il> References: <20050327153112.GA26108@mellanox.co.il> <52mzslraxl.fsf@topspin.com> <20050330054408.GA26239@mellanox.co.il> Message-ID: <20050330061031.GA25567@esmail.cup.hp.com> On Wed, Mar 30, 2005 at 07:44:08AM +0200, Michael S. Tsirkin wrote: > > As Libor reported, SDP is working and getting the expected good > > AIO performance with this code. > What benchmark do you use to measure AIO performance? I just asked libor as well. It was a hacked version of ttcp: http://openib.org/pipermail/openib-general/2005-March/009791.html grant From eitan at mellanox.co.il Tue Mar 29 22:12:01 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 30 Mar 2005 08:12:01 +0200 Subject: [openib-general] OpenSM Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF029@mtlex01.yok.mtl.com> Abhijeet > Is there any opensource subnet simulator? Where can i get it? https://openib.org/svn/gen2/utils/src/linux-user/ Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Wed Mar 30 00:48:04 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 30 Mar 2005 10:48:04 +0200 Subject: [openib-general] [PATCH] uverbs pingpong test Message-ID: <20050330084804.GB26964@mellanox.co.il> Hi, Roland! I've run into a problem with the pingpong test: sometimes I am getting completions with error for the first message it sends. My analysis is that it takes time for the QP to get to RTR, and if one side is already in RTS it can start sending and will get an error. Therefore we must re-synchronise both sides over a socket (by calling pp_client_exch_dest) after QP is in RTS on both sides. Reusing pp_client_exch_dest here is a bit ugly (since we already have all the data). Let me know what do you think. The problem goes away after applying the following patch. Signed-off-by: Michael S. Tsirkin Index: pingpong.c =================================================================== --- pingpong.c (revision 2064) +++ pingpong.c (working copy) @@ -427,7 +427,7 @@ int main(int argc, char *argv[]) struct ibv_device *ib_dev; struct pingpong_context *ctx; struct pingpong_dest my_dest; - struct pingpong_dest *rem_dest; + struct pingpong_dest *rem_dest, *tmp; struct timeval start, end; char *ib_devname = NULL; char *servername = NULL; @@ -565,6 +565,17 @@ int main(int argc, char *argv[]) if (pp_connect_ctx(ctx, ib_port, my_dest.psn, rem_dest)) return 1; + /* Resynch to make sure both sides are in RTR */ + if (servername) + tmp = pp_client_exch_dest(servername, port, &my_dest); + else + tmp = pp_server_exch_dest(port, &my_dest); + + if (!tmp) + return 1; + + free(tmp); + if (use_event) if (ibv_req_notify_cq(ctx->cq, 0)) { fprintf(stderr, "Couldn't request CQ notification\n"); -- MST - Michael S. Tsirkin From shaharf at voltaire.com Wed Mar 30 01:01:00 2005 From: shaharf at voltaire.com (shaharf) Date: Wed, 30 Mar 2005 11:01:00 +0200 Subject: [openib-general] OpenSM Message-ID: As I already mentioned, Mellanox released its Simulator. You can find it in https://openib.org/svn/gen2/utils/src/linux-user/IBMgtSim/ . The problem is that this Simulator is build to work with the Mellanox Gold environment. So either you get use Mellanox gold (that is based on gen1) or port the Simulator to gen2. I am not aware to any other opensource simulators. Shahar ________________________________ From: abhijitngpune [mailto:abhijitngpune at indiatimes.com] Sent: Wednesday, March 30, 2005 6:52 AM To: shaharf Cc: openib-general at openib.org Subject: Re: RE: [openib-general] OpenSM hi, Is there any opensource subnet simulator? Where can i get it? Abhijeet "shaharf" wrote: Hi abhijitngpune, OpenSM do not know care about the topology of the network. Every connected graph is valid for it. BTW, fat tree can have cycles too. If I don't err, the algorithm used by the OpenSM is a variation of some well known graph algorithm invented by Dijkstra or based on one of Dijkstra's (I hope I write his name correctly) algorithm. You can find these algorithms in any graph theory text book - look for "find all shortest paths" algorithms. (for example : http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/dijkstra.html ) Very briefly the algorithm that the opensm is using goes like that: 1. All switches learn about themselves (hop 0) and any direct connected hosts (hop 1). They keep this information in a forwarding table that contains (schematically) the following information (the actual details are a bit more complicated to be able to support multipathing) : Lid (local port id), out-port, hops 2. Now you start the hop>1 learning phase that use several passes over the switches. On every single pass, you go over all switches (the order does not matter) and within each switch you go examine any direct attached switch called "neighbor". For every such neighbor you compare your forwarding table to neighbor table. If you find a lid that have hop count less than your hop count +1 (for the extra hop between you and the neighbor switch) you change you table entry to route that lid thought the connecting port. 3. You repeat the above process until no table is changed during a complete pass, or until number of switch passes are done. The correctness of this algorithm is left to the reader ;-) It seems that you are using gen1 stack and Opensm. Please be aware to the fact that gen1 tree is not supported any more. Please use gen2. The opensm Tcl extension is not supported on gen2 and I don't know on any plans to support it. Regarding the topology example - any connected graph will do. I guess that most connected graphs are very inefficient traffic wise, but still all of them are valid. Demonstrating that a topology is configured correctly is a bit of a problem. If you are willing to spend some efforts, you can use the topology simulator released with Melloanox Gold - look for the IBADM package. This stuff is not very well documented but it should be useable. Melloanox released (or about to release) a real subnet simulator that you can use to run opensm on top of it. Using this simulator you can test any arbitrary topology. The problem is that you have to port this simulator to gen2. Any volunteers are welcomed... Shahar ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of abhijitngpune Sent: Tuesday, March 29, 05 3:09 PM To: openib-general at openib.org Subject: [openib-general] OpenSM Hi all, I am a new to infiniband and related issues. I have some few doubts related to openSM. 1. how does openSM support the non fat tree (graph having cycles) topologies? (any research paper will do) 2. Given a graph (it contains cycles) topology how can i demonstrate that subnet manager working for this topology? 3. What is openSM tcl extension is used for? does anybody have example code for perticular (irregular/ non fat tree) topology? Abhijeet ________________________________ Indiatimes Email now powered by APIC Advantage. Help! My Presence Help ________________________________ ________________________________ Indiatimes Email now powered by APIC Advantage. Help! My Presence Help ________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Wed Mar 30 01:08:31 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 30 Mar 2005 11:08:31 +0200 Subject: [openib-general] Re: BUG ipoib oops in iptables/netfilter In-Reply-To: <20050329054158.GA21304@esmail.cup.hp.com> References: <20050329054158.GA21304@esmail.cup.hp.com> Message-ID: <20050330090831.GD26964@mellanox.co.il> Quoting r. Grant Grundler : > FWIW, default IPoIB perf is pathetic: ~1.5-1.6Gb/s > with the above netperf command line. I think these systems have IOMMU, do they not? If so, could the fact that IPoIB calls dma_map_single for each packet be the reason it is slow? Maybe dma_map_single is slow? -- MST - Michael S. Tsirkin From halr at voltaire.com Wed Mar 30 04:48:54 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Mar 2005 07:48:54 -0500 Subject: [openib-general] [PATCH] [RMPP] receive RMPP support In-Reply-To: <424853B6.2090809@ichips.intel.com> References: <20050318182430.0e6e8a16.mshefty@ichips.intel.com> <424853B6.2090809@ichips.intel.com> Message-ID: <1112186934.4495.12.camel@localhost.localdomain> On Mon, 2005-03-28 at 13:57, Sean Hefty wrote: > I've committed the RMPP support after updating the code based on the > received comments. This patch appears to break things that require local completion handling (OpenSM and diags). Not sure why yet. The quick workaround is to revert mad.c prior to the RMPP change (svn update mad.c -r 2055) and remove mad_rmpp.o from the core/Makefile temporarily. -- Hal From halr at voltaire.com Wed Mar 30 05:45:01 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Mar 2005 08:45:01 -0500 Subject: [openib-general] Port of ISC DHCP-3.0.2 to OpenIB IPoIB Message-ID: <1112190300.4495.67.camel@localhost.localdomain> I just spent a little time porting ISC DHCP-3.0.2 to support DHCP over Infiniband per IETF Internet Draft draft-ietf-ipoib-dhcp-over-infiniband-09.txt. There are a couple of minor things to verify before making this available to dhcp-hackers at isc.org. I wanted to make this available to this community first. Any feedback is welcome. -- Hal diff -urN dhcp-3.0.2.orig/common/bpf.c dhcp-3.0.2/common/bpf.c --- dhcp-3.0.2.orig/common/bpf.c 2004-11-24 12:39:15.000000000 -0500 +++ dhcp-3.0.2/common/bpf.c 2005-03-25 15:38:57.000000000 -0500 @@ -194,6 +194,41 @@ BPF_STMT(BPF_RET+BPF_K, 0), }; +#if defined (HAVE_IPOIB_SUPPORT) +/* Packet filter program... + XXX Changes to the filter program may require changes to the constant + offsets used in if_register_send to patch the BPF program! XXX */ + +struct bpf_insn dhcp_bpf_ipoib_filter [] = { + /* Make sure this is an IP packet... */ + BPF_STMT (BPF_LD + BPF_H + BPF_ABS, 0), + BPF_JUMP (BPF_JMP + BPF_JEQ + BPF_K, ETHERTYPE_IP, 0, 8), + + /* Make sure it's a UDP packet... */ + BPF_STMT (BPF_LD + BPF_B + BPF_ABS, 13), + BPF_JUMP (BPF_JMP + BPF_JEQ + BPF_K, IPPROTO_UDP, 0, 6), + + /* Make sure this isn't a fragment... */ + BPF_STMT(BPF_LD + BPF_H + BPF_ABS, 10), + BPF_JUMP(BPF_JMP + BPF_JSET + BPF_K, 0x1fff, 4, 0), + + /* Get the IP header length... */ + BPF_STMT (BPF_LDX + BPF_B + BPF_MSH, 4), + + /* Make sure it's to the right port... */ + BPF_STMT (BPF_LD + BPF_H + BPF_IND, 6), + BPF_JUMP (BPF_JMP + BPF_JEQ + BPF_K, 67, 0, 1), /* patch */ + + /* If we passed all the tests, ask for the whole packet. */ + BPF_STMT(BPF_RET+BPF_K, (u_int)-1), + /* Otherwise, drop it. */ + BPF_STMT(BPF_RET+BPF_K, 0), +}; + +int dhcp_bpf_ipoib_filter_len = (sizeof dhcp_bpf_ipoib_filter / + sizeof (struct bpf_insn)); +#endif + #if defined (DEC_FDDI) struct bpf_insn *bpf_fddi_filter; #endif diff -urN dhcp-3.0.2.orig/common/conflex.c dhcp-3.0.2/common/conflex.c --- dhcp-3.0.2.orig/common/conflex.c 2004-11-24 12:39:15.000000000 -0500 +++ dhcp-3.0.2/common/conflex.c 2005-03-26 07:49:40.000000000 -0500 @@ -740,6 +740,8 @@ return IS; if (!strcasecmp (atom + 1, "gnore")) return IGNORE; + if (!strcasecmp (atom + 1, "poib")) + return IPOIB; break; case 'k': if (!strncasecmp (atom + 1, "nown", 4)) { diff -urN dhcp-3.0.2.orig/common/discover.c dhcp-3.0.2/common/discover.c --- dhcp-3.0.2.orig/common/discover.c 2004-06-10 13:59:16.000000000 -0400 +++ dhcp-3.0.2/common/discover.c 2005-03-25 16:04:41.000000000 -0500 @@ -484,6 +484,14 @@ memcpy (&tmp -> hw_address.hbuf [1], sa.sa_data, 16); break; +#ifndef HAVE_ARPHRD_IPOIB +# define ARPHRD_IPOIB HTYPE_IPOIB +#endif + case ARPHRD_IPOIB: + tmp -> hw_address.hlen = 1; + tmp -> hw_address.hbuf [0] = HTYPE_IPOIB; + break; + #ifdef HAVE_ARPHRD_METRICOM case ARPHRD_METRICOM: tmp -> hw_address.hlen = 7; diff -urN dhcp-3.0.2.orig/common/ipoib.c dhcp-3.0.2/common/ipoib.c --- dhcp-3.0.2.orig/common/ipoib.c 1969-12-31 19:00:00.000000000 -0500 +++ dhcp-3.0.2/common/ipoib.c 2005-03-29 23:45:52.000000000 -0500 @@ -0,0 +1,110 @@ +/* ipoib.c + + Packet assembly code, originally contributed by Archie Cobbs. */ + +/* + * Copyright (c) 2004 by Internet Systems Consortium, Inc. ("ISC") + * Copyright (c) 1996-2003 by Internet Software Consortium + * + * Permission to use, copy, modify, and distribute this software for any + * purpose with or without fee is hereby granted, provided that the above + * copyright notice and this permission notice appear in all copies. + * + * THE SOFTWARE IS PROVIDED "AS IS" AND ISC DISCLAIMS ALL WARRANTIES + * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF + * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL ISC BE LIABLE FOR + * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN + * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT + * OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. + * + * Internet Systems Consortium, Inc. + * 950 Charter Street + * Redwood City, CA 94063 + * + * http://www.isc.org/ + * + * This software has been written for Internet Systems Consortium + * by Ted Lemon in cooperation with Vixie Enterprises and Nominum, Inc. + * To learn more about Internet Systems Consortium, see + * ``http://www.isc.org/''. To learn more about Vixie Enterprises, + * see ``http://www.vix.com''. To learn more about Nominum, Inc., see + * ``http://www.nominum.com''. + */ + +#ifndef lint +static char copyright[] = +"$Id: ipoib.c,v 1.3.2.2 2004/06/10 17:59:18 dhankins Exp $ Copyright (c) 2004 Internet Systems Consortium. All rights reserved.\n"; +#endif /* not lint */ + +#include "dhcpd.h" + +#if defined (HAVE_IPOIB_SUPPORT) + +#define INFINIBAND_ALEN 20 + +struct ipoib_header { + u16 proto; + u16 reserved; +}; + +struct ipoib_pseudoheader { + u8 hwaddr[INFINIBAND_ALEN]; +}; + +static const u8 ipv4_bcast_addr[] = { + 0x00, 0xff, 0xff, 0xff, + 0xff, 0x12, 0x40, 0x1b, 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff +}; + +#if defined (PACKET_ASSEMBLY) || defined (PACKET_DECODING) +#include "includes/netinet/if_ether.h" +#endif /* PACKET_ASSEMBLY || PACKET_DECODING */ + +#if defined (PACKET_ASSEMBLY) +/* Assemble an hardware header... */ + +void assemble_ipoib_header (interface, buf, bufix, to, toip) + struct interface_info *interface; + unsigned char *buf; + unsigned *bufix; + struct hardware *to; + u_int32_t toip; +{ + struct ipoib_header hdr; + + /* Need pseudoheader if some sort of broadcast */ + /* Only supports limited broadcast right now */ + /* Does subnet broadcast also needed to be supported ? */ + if (toip = 0xffffffff) { + /* Fill in IPoIB pseudoheader */ + /* Currently assumes scope of 2 (link local) */ + /* Might need to cycle through all scopes :-( */ + memcpy (&buf [*bufix], &ipv4_bcast_addr, sizeof(ipv4_bcast_addr)); + *bufix += sizeof(ipv4_bcast_addr); + } + + hdr.proto = htons (ETHERTYPE_IP); + hdr.reserved = 0; + + memcpy (&buf [*bufix], &hdr, sizeof(hdr)); + *bufix += sizeof(hdr); +} +#endif /* PACKET_ASSEMBLY */ + +#ifdef PACKET_DECODING +/* Decode a hardware header... */ + +ssize_t decode_ipoib_header (interface, buf, bufix, from) + struct interface_info *interface; + unsigned char *buf; + unsigned bufix; + struct hardware *from; +{ + from -> hbuf [0] = HTYPE_IPOIB; + memcpy (&from -> hbuf [1], &buf [bufix], sizeof(struct ipoib_header)); + return sizeof(struct ipoib_header); +} +#endif /* PACKET_DECODING */ +#endif /* HAVE_IPOIB_SUPPORT */ diff -urN dhcp-3.0.2.orig/common/lpf.c dhcp-3.0.2/common/lpf.c --- dhcp-3.0.2.orig/common/lpf.c 2004-11-24 12:39:15.000000000 -0500 +++ dhcp-3.0.2/common/lpf.c 2005-03-29 18:45:09.000000000 -0500 @@ -168,6 +168,12 @@ static void lpf_tr_filter_setup (struct interface_info *); #endif +#if defined (HAVE_IPOIB_SUPPORT) +extern struct sock_filter dhcp_bpf_ipoib_filter []; +extern int dhcp_bpf_ipoib_filter_len; +static void lpf_ipoib_filter_setup (struct interface_info *); +#endif + static void lpf_gen_filter_setup (struct interface_info *); void if_register_receive (info) @@ -181,6 +187,13 @@ lpf_tr_filter_setup (info); else #endif + +#if defined (HAVE_IPOIB_SUPPORT) + if (info -> hw_address.hbuf [0] == HTYPE_IPOIB) + lpf_ipoib_filter_setup (info); + else +#endif + lpf_gen_filter_setup (info); if (!quiet_interface_discovery) @@ -222,7 +235,7 @@ p.len = dhcp_bpf_filter_len; p.filter = dhcp_bpf_filter; - /* Patch the server port into the LPF program... + /* Patch the server port into the LPF program... XXX changes to filter program may require changes to the insn number(s) used below! XXX */ dhcp_bpf_filter [8].k = ntohs ((short)local_port); @@ -254,7 +267,7 @@ p.len = dhcp_bpf_tr_filter_len; p.filter = dhcp_bpf_tr_filter; - /* Patch the server port into the LPF program... + /* Patch the server port into the LPF program... XXX changes to filter program may require changes XXX to the insn number(s) used below! XXX Token ring filter is null - when/if we have a filter @@ -277,6 +290,38 @@ } } #endif /* HAVE_TR_SUPPORT */ + +#ifdef HAVE_IPOIB_SUPPORT +static void lpf_ipoib_filter_setup (info) + struct interface_info *info; +{ + struct sock_fprog p; + + /* Set up the bpf filter program structure. This is defined in + bpf.c */ + p.len = dhcp_bpf_ipoib_filter_len; + p.filter = dhcp_bpf_ipoib_filter; + + /* Patch the server port into the LPF program... + XXX changes to filter program may require changes + to the insn number(s) used below! XXX */ + dhcp_bpf_ipoib_filter [8].k = ntohs ((short)local_port); + + if (setsockopt (info -> rfdesc, SOL_SOCKET, SO_ATTACH_FILTER, &p, + sizeof p) < 0) { + if (errno == ENOPROTOOPT || errno == EPROTONOSUPPORT || + errno == ESOCKTNOSUPPORT || errno == EPFNOSUPPORT || + errno == EAFNOSUPPORT) { + log_error ("socket: %m - make sure"); + log_error ("CONFIG_PACKET (Packet socket) %s", + "and CONFIG_FILTER"); + log_error ("(Socket Filtering) are enabled %s", + "in your kernel"); + } + log_fatal ("Can't install packet filter program: %m"); + } +} +#endif /* HAVE_IPOIB_SUPPORT */ #endif /* USE_LPF_RECEIVE */ #ifdef USE_LPF_SEND @@ -302,7 +347,7 @@ len, from, to, hto); /* Assemble the headers... */ - assemble_hw_header (interface, (unsigned char *)hh, &hbufp, hto); + assemble_hw_header (interface, (unsigned char *)hh, &hbufp, hto, to -> sin_addr.s_addr); fudge = hbufp % 4; /* IP header must be word-aligned. */ memcpy (buf + fudge, (unsigned char *)hh, hbufp); ibufp = hbufp + fudge; @@ -312,7 +357,7 @@ memcpy (buf + ibufp, raw, len); /* For some reason, SOCK_PACKET sockets can't be connected, - so we have to do a sentdo every time. */ + so we have to do a sendto every time. */ memset (&sa, 0, sizeof sa); sa.sa_family = AF_PACKET; strncpy (sa.sa_data, @@ -383,7 +428,14 @@ int can_receive_unicast_unconfigured (ip) struct interface_info *ip; { +#if defined (HAVE_IPOIB_SUPPORT) + if (!strncmp (ip->name, "ib", 2)) + return 0; + else + return 1; +#else return 1; +#endif } int supports_multiple_interfaces (ip) diff -urN dhcp-3.0.2.orig/common/Makefile.dist dhcp-3.0.2/common/Makefile.dist --- dhcp-3.0.2.orig/common/Makefile.dist 2004-09-21 16:33:35.000000000 -0400 +++ dhcp-3.0.2/common/Makefile.dist 2005-03-30 08:00:08.000000000 -0500 @@ -24,11 +24,11 @@ SEDMANPAGES = dhcp-options.man5 dhcp-eval.man5 SRC = raw.c parse.c nit.c icmp.c dispatch.c conflex.c upf.c bpf.c socket.c \ lpf.c dlpi.c packet.c tr.c ethernet.c memory.c print.c options.c \ - inet.c tree.c tables.c alloc.c fddi.c ctrace.c dns.c resolv.c \ + inet.c tree.c tables.c alloc.c fddi.c ipoib.c ctrace.c dns.c resolv.c \ execute.c discover.c comapi.c OBJ = raw.o parse.o nit.o icmp.o dispatch.o conflex.o upf.o bpf.o socket.o \ lpf.o dlpi.o packet.o tr.o ethernet.o memory.o print.o options.o \ - inet.o tree.o tables.o alloc.o fddi.o ctrace.o dns.o resolv.o \ + inet.o tree.o tables.o alloc.o fddi.o ipoib.o ctrace.o dns.o resolv.o \ execute.o discover.o comapi.o MAN = dhcp-options.5 dhcp-eval.5 diff -urN dhcp-3.0.2.orig/common/packet.c dhcp-3.0.2/common/packet.c --- dhcp-3.0.2.orig/common/packet.c 2004-11-24 12:39:16.000000000 -0500 +++ dhcp-3.0.2/common/packet.c 2005-03-25 11:01:46.000000000 -0500 @@ -104,11 +104,12 @@ } #ifdef PACKET_ASSEMBLY -void assemble_hw_header (interface, buf, bufix, to) +void assemble_hw_header (interface, buf, bufix, to, toip) struct interface_info *interface; unsigned char *buf; unsigned *bufix; struct hardware *to; + u_int32_t toip; { #if defined (HAVE_TR_SUPPORT) if (interface -> hw_address.hbuf [0] == HTYPE_IEEE802) @@ -120,6 +121,11 @@ assemble_fddi_header (interface, buf, bufix, to); else #endif +#if defined (HAVE_IPOIB_SUPPORT) + if (interface -> hw_address.hbuf [0] == HTYPE_IPOIB) + assemble_ipoib_header (interface, buf, bufix, to, toip); + else +#endif assemble_ethernet_header (interface, buf, bufix, to); } @@ -205,6 +211,11 @@ return decode_fddi_header (interface, buf, bufix, from); else #endif +#if defined (HAVE_IPOIB_SUPPORT) + if (interface -> hw_address.hbuf [0] == HTYPE_IPOIB) + return decode_ipoib_header (interface, buf, bufix, from); + else +#endif return decode_ethernet_header (interface, buf, bufix, from); } diff -urN dhcp-3.0.2.orig/common/parse.c dhcp-3.0.2/common/parse.c --- dhcp-3.0.2.orig/common/parse.c 2004-09-30 16:38:31.000000000 -0400 +++ dhcp-3.0.2/common/parse.c 2005-03-26 07:46:36.000000000 -0500 @@ -323,7 +323,7 @@ /* * hardware-parameter :== HARDWARE hardware-type colon-seperated-hex-list SEMI - * hardware-type :== ETHERNET | TOKEN_RING | FDDI + * hardware-type :== ETHERNET | TOKEN_RING | FDDI | IPOIB */ void parse_hardware_param (cfile, hardware) @@ -346,6 +346,9 @@ case FDDI: hardware -> hbuf [0] = HTYPE_FDDI; break; + case IPOIB: + hardware -> hbuf [0] = HTYPE_IPOIB; + break; default: if (!strncmp (val, "unknown-", 8)) { hardware -> hbuf [0] = atoi (&val [8]); diff -urN dhcp-3.0.2.orig/includes/dhcpd.h dhcp-3.0.2/includes/dhcpd.h --- dhcp-3.0.2.orig/includes/dhcpd.h 2004-11-24 12:39:16.000000000 -0500 +++ dhcp-3.0.2/includes/dhcpd.h 2005-03-28 08:25:47.000000000 -0500 @@ -306,9 +306,9 @@ # define EPHEMERAL_FLAGS (MS_NULL_TERMINATION | \ UNICAST_BROADCAST_HACK) - binding_state_t __attribute__ ((mode (__byte__))) binding_state; - binding_state_t __attribute__ ((mode (__byte__))) next_binding_state; - binding_state_t __attribute__ ((mode (__byte__))) desired_binding_state; + binding_state_t binding_state; + binding_state_t next_binding_state; + binding_state_t desired_binding_state; struct lease_state *state; @@ -1922,7 +1922,7 @@ u_int32_t checksum PROTO ((unsigned char *, unsigned, u_int32_t)); u_int32_t wrapsum PROTO ((u_int32_t)); void assemble_hw_header PROTO ((struct interface_info *, unsigned char *, - unsigned *, struct hardware *)); + unsigned *, struct hardware *, u_int32_t)); void assemble_udp_ip_header PROTO ((struct interface_info *, unsigned char *, unsigned *, u_int32_t, u_int32_t, u_int32_t, unsigned char *, unsigned)); diff -urN dhcp-3.0.2.orig/includes/dhcp.h dhcp-3.0.2/includes/dhcp.h --- dhcp-3.0.2.orig/includes/dhcp.h 2004-06-10 13:59:29.000000000 -0400 +++ dhcp-3.0.2/includes/dhcp.h 2005-03-24 13:50:14.000000000 -0500 @@ -75,6 +75,7 @@ #define HTYPE_ETHER 1 /* Ethernet 10Mbps */ #define HTYPE_IEEE802 6 /* IEEE 802.2 Token Ring... */ #define HTYPE_FDDI 8 /* FDDI... */ +#define HTYPE_IPOIB 32 /* IPoIB */ /* Magic cookie validating dhcp options field (and bootp vendor extensions field). */ diff -urN dhcp-3.0.2.orig/includes/dhctoken.h dhcp-3.0.2/includes/dhctoken.h --- dhcp-3.0.2.orig/includes/dhctoken.h 2004-09-21 15:25:38.000000000 -0400 +++ dhcp-3.0.2/includes/dhctoken.h 2005-03-26 07:51:29.000000000 -0500 @@ -308,7 +308,8 @@ REFRESH = 612, DOMAIN_NAME = 613, DO_FORWARD_UPDATE = 614, - KNOWN_CLIENTS = 615 + KNOWN_CLIENTS = 615, + IPOIB = 616 }; #define is_identifier(x) ((x) >= FIRST_TOKEN && \ diff -urN dhcp-3.0.2.orig/Makefile.conf dhcp-3.0.2/Makefile.conf --- dhcp-3.0.2.orig/Makefile.conf 2004-11-24 12:39:13.000000000 -0500 +++ dhcp-3.0.2/Makefile.conf 2005-03-30 08:05:28.000000000 -0500 @@ -312,7 +312,7 @@ ## Linux 2.2 ##--linux-2.2-- #COPTS = -DLINUX_MAJOR=$(MAJORVERSION) -DLINUX_MINOR=$(MINORVERSION) \ -# $(BINDDEF) $(CC_OPTIONS) +# $(BINDDEF) $(CC_OPTIONS) -DHAVE_IPOIB_SUPPORT -DUSERLAND_FILTER #CF = cf/linux.h #ADMMANDIR = /usr/man/man8 #ADMMANEXT = .8 diff -urN dhcp-3.0.2.orig/readme-ipoib.txt dhcp-3.0.2/readme-ipoib.txt --- dhcp-3.0.2.orig/readme-ipoib.txt 1969-12-31 19:00:00.000000000 -0500 +++ dhcp-3.0.2/readme-ipoib.txt 2005-03-30 08:22:37.000000000 -0500 @@ -0,0 +1,53 @@ +3/30/05 +ISC DHCP 3.0.2 +Internet Systems Consortium DHCP Client V3.0.2 +Internet Systems Consortium DHCP Server V3.0.2 + +Makefile.conf + -DHAVE_IPOIB_SUPPORT -DUSERLAND_FILTER + To build on Opteron, also add -DPTRSIZE_64BIT + + +Notes about running + +Need to configure the Linux kernel to support +Socket Filtering and the Packet socket. + CONFIG_FILTER=y + CONFIG_PACKET=y + + +DHCP client + +/sbin/modprobe ib_ipoib +/sbin/ifconfig ib0 up + without IPv4 address + +dhclient ib0 + +need to mkdir /var/state/dhcp so dhclient.leases can be saved + +setup /etc/dhclient.conf with client identifier (port GUID) +interface "ib0" { + send dhcp-client-identifier 00:08:f1:04:03:96:05:59; +} + + +DHCP server + +Load IPoIB and configure ib0 with IPv4 address +IP address configured on ib0 +/sbin/ifconfig ib0 + + +dhcpd ib0 + +setup /etc/dhcpd.conf +with at least IP address range +maybe also client identifier if want fixed IP address per client +ddns-update-style none; +subnet 192.168.0.0 netmask 255.255.255.0 { + range 192.168.0.10 192.168.0.20; +} + +touch /var/state/dhcp/dhcpd.leases + diff -urN dhcp-3.0.2.orig/server/dhcp.c dhcp-3.0.2/server/dhcp.c --- dhcp-3.0.2.orig/server/dhcp.c 2004-11-24 12:39:19.000000000 -0500 +++ dhcp-3.0.2/server/dhcp.c 2005-03-28 08:52:46.000000000 -0500 @@ -267,7 +267,7 @@ /* %Audit% This is log output. %2004.06.17,Safe% * If we truncate we hope the user can get a hint from the log. */ - snprintf (msgbuf, sizeof msgbuf, "DHCPDISCOVER from %s %s%s%svia %s", + snprintf (msgbuf, sizeof msgbuf, "DHCPDISCOVER from %s %s%s%s via %s", (packet -> raw -> htype ? print_hw_addr (packet -> raw -> htype, packet -> raw -> hlen, @@ -476,7 +476,7 @@ * If we truncate we hope the user can get a hint from the log. */ snprintf (msgbuf, sizeof msgbuf, - "DHCPREQUEST for %s%s from %s %s%s%svia %s", + "DHCPREQUEST for %s%s from %s %s%s%s via %s", piaddr (cip), smbuf, (packet -> raw -> htype ? print_hw_addr (packet -> raw -> htype, @@ -769,7 +769,7 @@ * If we truncate we hope the user can get a hint from the log. */ snprintf (msgbuf, sizeof msgbuf, - "DHCPRELEASE of %s from %s %s%s%svia %s (%sfound)", + "DHCPRELEASE of %s from %s %s%s%s via %s (%sfound)", cstr, (packet -> raw -> htype ? print_hw_addr (packet -> raw -> htype, @@ -859,7 +859,7 @@ * If we truncate we hope the user can get a hint from the log. */ snprintf (msgbuf, sizeof msgbuf, - "DHCPDECLINE of %s from %s %s%s%svia %s", + "DHCPDECLINE of %s from %s %s%s%s via %s", piaddr (cip), (packet -> raw -> htype ? print_hw_addr (packet -> raw -> htype, @@ -2807,7 +2807,7 @@ s = (char *)0; /* Say what we're doing... */ - log_info ("%s on %s to %s %s%s%svia %s", + log_info ("%s on %s to %s %s%s%s via %s", (state -> offer ? (state -> offer == DHCPACK ? "DHCPACK" : "DHCPOFFER") : "BOOTREPLY"), From halr at voltaire.com Wed Mar 30 06:35:03 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Mar 2005 09:35:03 -0500 Subject: [openib-general] Re: [PATCH] IPoIB: Set hardware header on packet receive In-Reply-To: <52d5tjy88x.fsf@topspin.com> References: <1111928923.4650.70.camel@localhost.localdomain> <52hdivzrai.fsf@topspin.com> <1112027628.4650.149.camel@localhost.localdomain> <52u0mvyb3n.fsf@topspin.com> <1112028310.4650.153.camel@localhost.localdomain> <52d5tjy88x.fsf@topspin.com> Message-ID: <1112193303.4495.119.camel@localhost.localdomain> On Mon, 2005-03-28 at 12:45, Roland Dreier wrote: > Hal> I was wondering about this myself as to whether this matters > Hal> or not. Other drivers determine this from the destination > Hal> MAC address. Does it affect the incoming delivery if more > Hal> than one process is receiving an IP broadcast or multicast ? > Hal> I didn't chase it all the way down in the Linux kernel. Do > Hal> you know ? > > There's a lot of places where the field is checked. I see it set in a number of places but not checked much in the kernel. > However all the > ones I've seen are just ignoring packets that aren't PACKET_HOST or > sometimes looking for PACKET_LOOPBACK packets. So PACKET_BROADCAST or > PACKET_MULTICAST may be checked somewhere but I'm not aware of > anywhere that they are. Is this used by netfilter ? Does this make a difference to any of the routing protocols ? Has anyone run IP unicast routing using (one or more) IPoIB interfaces ? What about multicast ? Do we care about these ? -- Hal From halr at voltaire.com Wed Mar 30 06:40:36 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Mar 2005 09:40:36 -0500 Subject: [openib-general] [PATCH] RFC: allow send only registration from userspace In-Reply-To: <523bueqor6.fsf@topspin.com> References: <52ll87t8ub.fsf@topspin.com> <1112109600.4645.22.camel@localhost.localdomain> <521x9ysbiy.fsf@topspin.com> <1112119435.4645.21.camel@localhost.localdomain> <527jjqqpwt.fsf@topspin.com> <1112121070.4645.2.camel@localhost.localdomain> <523bueqor6.fsf@topspin.com> Message-ID: <1112193635.4495.121.camel@localhost.localdomain> On Tue, 2005-03-29 at 13:40, Roland Dreier wrote: > If we want to move async events from the uverbs devices to dedicated > async event devices then I don't really have a problem with that. > That would save things like OpenSM from having to deal with the whole > verbs interface just to get events. This sounds like a better approach to us. Thanks. -- Hal From halr at voltaire.com Wed Mar 30 07:09:24 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Mar 2005 10:09:24 -0500 Subject: [openib-general] [PATCH] FMR support in mthca In-Reply-To: <20050328170351.B30499@topspin.com> References: <20050327153112.GA26108@mellanox.co.il> <20050328170351.B30499@topspin.com> Message-ID: <1112195364.4495.150.camel@localhost.localdomain> On Mon, 2005-03-28 at 20:03, Libor Michalek wrote: > I haven't looked closely at the code yet, but I did try it out > with SDP/AIO on a pair of x86 systems with Tavors and a pair of > x86_64 systems with Arbels. With a small change to core/fmr_pool.c > and enabling pool creation in SDP it worked as expected. Here are > throughput results: > > x86 x86_64 > -------- -------- > SDP sync 610 MB/s 710 MB/s > SDP async (hit) 740 MB/s 910 MB/s > SDP async (miss) 590 MB/s 910 MB/s > > For sync sockets I used 81600 byte buffers. For async socket I kept > 20 96K buffers in flight. For the FMR pool cache hit async results I > used only 20 different buffers. For the FMR pool cache miss async > results I used 1000 different buffers, of which only 20 were in flight > at a time. Any idea why hit/miss make such a difference on x86 and not x86_64 ? Also, is all the code for this now checked available ? Thanks. -- Hal From mst at mellanox.co.il Wed Mar 30 07:15:29 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 30 Mar 2005 17:15:29 +0200 Subject: [openib-general] [PATCH] FMR support in mthca In-Reply-To: <1112195364.4495.150.camel@localhost.localdomain> References: <20050327153112.GA26108@mellanox.co.il> <20050328170351.B30499@topspin.com> <1112195364.4495.150.camel@localhost.localdomain> Message-ID: <20050330151528.GS15034@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: [openib-general] [PATCH] FMR support in mthca > > On Mon, 2005-03-28 at 20:03, Libor Michalek wrote: > > I haven't looked closely at the code yet, but I did try it out > > with SDP/AIO on a pair of x86 systems with Tavors and a pair of > > x86_64 systems with Arbels. With a small change to core/fmr_pool.c > > and enabling pool creation in SDP it worked as expected. Here are > > throughput results: > > > > x86 x86_64 > > -------- -------- > > SDP sync 610 MB/s 710 MB/s > > SDP async (hit) 740 MB/s 910 MB/s > > SDP async (miss) 590 MB/s 910 MB/s > > > > For sync sockets I used 81600 byte buffers. For async socket I kept > > 20 96K buffers in flight. For the FMR pool cache hit async results I > > used only 20 different buffers. For the FMR pool cache miss async > > results I used 1000 different buffers, of which only 20 were in flight > > at a time. > > Any idea why hit/miss make such a difference on x86 and not x86_64 ? > > Also, is all the code for this now checked available ? Yes. > Thanks. > > -- Hal > -- MST - Michael S. Tsirkin From halr at voltaire.com Wed Mar 30 08:49:45 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Mar 2005 11:49:45 -0500 Subject: [openib-general] [PATCH] mad: Fix local_completions Message-ID: <1112201385.4476.0.camel@localhost.localdomain> mad: Fix local_completions so receive buffer return does not crash now that initial RMPP has been introduced Signed-off-by: Hal Rosenstock Index: mad.c =================================================================== -- mad.c (revision 2056) +++ mad.c (working copy) @@ -2116,7 +2116,7 @@ local->mad_priv->header.recv_wc.wc = &wc; local->mad_priv->header.recv_wc.mad_len = sizeof(struct ib_mad); - INIT_LIST_HEAD(&local->mad_priv->header.recv_wc.recv_buf.list); + INIT_LIST_HEAD(&local->mad_priv->header.recv_wc.rmpp_list); local->mad_priv->header.recv_wc.recv_buf.grh = NULL; local->mad_priv->header.recv_wc.recv_buf.mad = &local->mad_priv->mad.mad; From mst at mellanox.co.il Wed Mar 30 09:01:05 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 30 Mar 2005 19:01:05 +0200 Subject: [openib-general] IBV_SEND_INLINE in uverbs Message-ID: <20050330170105.GA3869@mellanox.co.il> Hello, Roland! Is it true that IBV_SEND_INLINE is not currently supported by libmthca? I need that for my work on benchmarking latency. >grep -rIi IBV_SEND_INLINE . ./libibverbs/include/infiniband/verbs.h: IBV_SEND_INLINE = 1 << 3 Do you plan to implement it, and if not, would you like me to add this support? Thanks, -- MST - Michael S. Tsirkin From libor at topspin.com Wed Mar 30 09:21:10 2005 From: libor at topspin.com (Libor Michalek) Date: Wed, 30 Mar 2005 09:21:10 -0800 Subject: [openib-general] [PATCH] FMR support in mthca In-Reply-To: <1112195364.4495.150.camel@localhost.localdomain>; from halr@voltaire.com on Wed, Mar 30, 2005 at 10:09:24AM -0500 References: <20050327153112.GA26108@mellanox.co.il> <20050328170351.B30499@topspin.com> <1112195364.4495.150.camel@localhost.localdomain> Message-ID: <20050330092110.A32764@topspin.com> On Wed, Mar 30, 2005 at 10:09:24AM -0500, Hal Rosenstock wrote: > On Mon, 2005-03-28 at 20:03, Libor Michalek wrote: > > I haven't looked closely at the code yet, but I did try it out > > with SDP/AIO on a pair of x86 systems with Tavors and a pair of > > x86_64 systems with Arbels. With a small change to core/fmr_pool.c > > and enabling pool creation in SDP it worked as expected. Here are > > throughput results: > > > > x86 x86_64 > > -------- -------- > > SDP sync 610 MB/s 710 MB/s > > SDP async (hit) 740 MB/s 910 MB/s > > SDP async (miss) 590 MB/s 910 MB/s > > > > For sync sockets I used 81600 byte buffers. For async socket I kept > > 20 96K buffers in flight. For the FMR pool cache hit async results I > > used only 20 different buffers. For the FMR pool cache miss async > > results I used 1000 different buffers, of which only 20 were in flight > > at a time. > > Any idea why hit/miss make such a difference on x86 and not x86_64 ? Not sure, but the difference is also Tavor vs. Arbel on those two pairs of systems. > Also, is all the code for this now checked available ? Almost, for SDP I need to check in this patch, now that the FMR code has been checked in. -Libor Index: sdp_conn.c =================================================================== --- sdp_conn.c (revision 2072) +++ sdp_conn.c (working copy) @@ -1774,14 +1774,12 @@ /* * create SDP memory pool */ -#ifdef _FMR_SUPPORT hca->fmr_pool = ib_create_fmr_pool(hca->pd, &fmr_param_s); if (IS_ERR(hca->fmr_pool)) { sdp_warn("Error <%ld> creating HCA <%s> fast memory pool", PTR_ERR(hca->fmr_pool), device->name); goto error; } -#endif /* * port allocation */ @@ -1828,10 +1826,9 @@ kfree(port); } -#ifdef _FMR_SUPPORT if (!IS_ERR(hca->fmr_pool)) (void)ib_destroy_fmr_pool(hca->fmr_pool); -#endif + if (!IS_ERR(hca->mem_h)) (void)ib_dereg_mr(hca->mem_h); @@ -1864,10 +1861,9 @@ kfree(port); } -#ifdef _FMR_SUPPORT if (!IS_ERR(hca->fmr_pool)) (void)ib_destroy_fmr_pool(hca->fmr_pool); -#endif + if (!IS_ERR(hca->mem_h)) (void)ib_dereg_mr(hca->mem_h); From iod00d at hp.com Wed Mar 30 10:46:53 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 30 Mar 2005 10:46:53 -0800 Subject: [openib-general] Re: BUG ipoib oops in iptables/netfilter In-Reply-To: <20050330090831.GD26964@mellanox.co.il> References: <20050329054158.GA21304@esmail.cup.hp.com> <20050330090831.GD26964@mellanox.co.il> Message-ID: <20050330184653.GA27519@esmail.cup.hp.com> On Wed, Mar 30, 2005 at 11:08:31AM +0200, Michael S. Tsirkin wrote: > Quoting r. Grant Grundler : > > FWIW, default IPoIB perf is pathetic: ~1.5-1.6Gb/s > > with the above netperf command line. > > I think these systems have IOMMU, do they not? They do. But HP ZX1 chipset allows 64-bit devices to bypass the IOMMU and directly address memory. tg3 is a 64-bit device and bypasses the IOMMU. ie tg3 is using physical addresses. > If so, could the fact that IPoIB calls dma_map_single for each > packet be the reason it is slow? Maybe dma_map_single is slow? pfmon output I've collected in the past doesn't support that theory. The ia64 "machvec" indirect function call certainly isn't helping though. ZX1 is about 3yr old chipset. I expect current x86/amd64 chipsets are not as old. I really need to collect PCI bus traces to verify if it's a chipset problem though. The logistics of doing that are non-trivial and I'll need help from a local HW group. *sigh* thanks, grant From roland at topspin.com Wed Mar 30 10:47:53 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 30 Mar 2005 10:47:53 -0800 Subject: [openib-general] Re: BUG ipoib oops in iptables/netfilter In-Reply-To: <20050330184653.GA27519@esmail.cup.hp.com> (Grant Grundler's message of "Wed, 30 Mar 2005 10:46:53 -0800") References: <20050329054158.GA21304@esmail.cup.hp.com> <20050330090831.GD26964@mellanox.co.il> <20050330184653.GA27519@esmail.cup.hp.com> Message-ID: <523budq8bq.fsf@topspin.com> There are a couple of places in mthca that use __attribute__((packed)) structs in the data path. x86/x86-64/ppc64 can generate good code for this but on ia64 it's a disaster. I'm going to look at fixing that, probably early next week. - R. From iod00d at hp.com Wed Mar 30 11:03:23 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 30 Mar 2005 11:03:23 -0800 Subject: [openib-general] Port of ISC DHCP-3.0.2 to OpenIB IPoIB In-Reply-To: <1112190300.4495.67.camel@localhost.localdomain> References: <1112190300.4495.67.camel@localhost.localdomain> Message-ID: <20050330190323.GB27519@esmail.cup.hp.com> On Wed, Mar 30, 2005 at 08:45:01AM -0500, Hal Rosenstock wrote: > I wanted to make this available to this community first. > Any feedback is welcome. ... > +/* ipoib.c > + > + Packet assembly code, originally contributed by Archie Cobbs. */ This can't be right. > + > +/* > + * Copyright (c) 2004 by Internet Systems Consortium, Inc. ("ISC") > + * Copyright (c) 1996-2003 by Internet Software Consortium > + * > + * Permission to use, copy, modify, and distribute this software for any > + * purpose with or without fee is hereby granted, provided that the above > + * copyright notice and this permission notice appear in all copies. > + * > + * THE SOFTWARE IS PROVIDED "AS IS" AND ISC DISCLAIMS ALL WARRANTIES > + * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF > + * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL ISC BE LIABLE FOR > + * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES > + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN > + * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT > + * OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. > + * > + * Internet Systems Consortium, Inc. > + * 950 Charter Street > + * Redwood City, CA 94063 > + * > + * http://www.isc.org/ > + * > + * This software has been written for Internet Systems Consortium > + * by Ted Lemon in cooperation with Vixie Enterprises and Nominum, Inc. > + * To learn more about Internet Systems Consortium, see > + * ``http://www.isc.org/''. To learn more about Vixie Enterprises, > + * see ``http://www.vix.com''. To learn more about Nominum, Inc., see > + * ``http://www.nominum.com''. > + */ This is a new file right? Copyright owner and date and the "by Ted Lemon" stuff at the bottom doesn't seem right for this file. > +static char copyright[] = > +"$Id: ipoib.c,v 1.3.2.2 2004/06/10 17:59:18 dhankins Exp $ Copyright (c) 2004 Internet Systems Consortium. All rights reserved.\n"; Ditto. I don't know the rules for submitting code to the ISC. My guess is they hwave to be as anal about copyright assignment and license as GNU foundation. You probably want to review this briefly with one or more of Voltaire/Openib/ISC legal. grant From iod00d at hp.com Wed Mar 30 12:10:06 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 30 Mar 2005 12:10:06 -0800 Subject: [openib-general] Re: BUG ipoib oops in iptables/netfilter In-Reply-To: <523budq8bq.fsf@topspin.com> References: <20050329054158.GA21304@esmail.cup.hp.com> <20050330090831.GD26964@mellanox.co.il> <20050330184653.GA27519@esmail.cup.hp.com> <523budq8bq.fsf@topspin.com> Message-ID: <20050330201006.GH27519@esmail.cup.hp.com> On Wed, Mar 30, 2005 at 10:47:53AM -0800, Roland Dreier wrote: > There are a couple of places in mthca that use __attribute__((packed)) > structs in the data path. x86/x86-64/ppc64 can generate good code for > this but on ia64 it's a disaster. I'm going to look at fixing that, > probably early next week. Are you thinking of adding "get_unaligned()" or something else? I assume this is related to the performance issues and not the original bug. If so, then posting details that you know to ia64-linux mailing list will get the right people involved. I'm on that list and can followup if people have suggestions that need to be tested. thanks, grant From roland at topspin.com Wed Mar 30 12:11:53 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 30 Mar 2005 12:11:53 -0800 Subject: [openib-general] Re: BUG ipoib oops in iptables/netfilter In-Reply-To: <20050330201006.GH27519@esmail.cup.hp.com> (Grant Grundler's message of "Wed, 30 Mar 2005 12:10:06 -0800") References: <20050329054158.GA21304@esmail.cup.hp.com> <20050330090831.GD26964@mellanox.co.il> <20050330184653.GA27519@esmail.cup.hp.com> <523budq8bq.fsf@topspin.com> <20050330201006.GH27519@esmail.cup.hp.com> Message-ID: <52psxgq4fq.fsf@topspin.com> Grant> Are you thinking of adding "get_unaligned()" or something else? Just fixing the structs so they get the correct layout without any monkey business. Grant> I assume this is related to the performance issues and not Grant> the original bug. If so, then posting details that you know Grant> to ia64-linux mailing list will get the right people Grant> involved. I'm on that list and can followup if people have Grant> suggestions that need to be tested. I think I understand the problem, it's just a matter of fixing the code up to avoid the bogus stuff. Just compare the ia64 assembly for the functions c() and d() in the code: struct foo { int a; }; struct bar { int b; } __attribute__((packed)); int c(struct foo *x) { return x->a; } int d(struct bar *x) { return x->b; } - R. From tduffy at sun.com Wed Mar 30 12:30:43 2005 From: tduffy at sun.com (Tom Duffy) Date: Wed, 30 Mar 2005 12:30:43 -0800 Subject: [openib-general] [PATCH][SDP] fix build error Re: [openib-commits] r2088 - gen2/trunk/src/linux-kernel/infiniband/ulp/sdp In-Reply-To: <20050330173413.C1FFA2283D8@openib.ca.sandia.gov> References: <20050330173413.C1FFA2283D8@openib.ca.sandia.gov> Message-ID: <1112214643.21350.32.camel@duffman> On Wed, 2005-03-30 at 09:34 -0800, libor at openib.org wrote: > Modified: gen2/trunk/src/linux-kernel/infiniband/ulp/sdp/sdp_recv.c > =================================================================== > --- gen2/trunk/src/linux-kernel/infiniband/ulp/sdp/sdp_recv.c 2005-03-30 17:33:13 UTC (rev 2087) > +++ gen2/trunk/src/linux-kernel/infiniband/ulp/sdp/sdp_recv.c 2005-03-30 17:34:12 UTC (rev 2088) > @@ -670,7 +670,7 @@ > > copy = min((PAGE_SIZE - offset), > (unsigned long)(buff->tail - buff->data)); > - > + copy = min(iocb->len, copy); > #ifndef _SDP_DATA_PATH_NULL > memcpy((addr + offset), buff->data, copy); > #endif Signed-off-by: Tom Duffy Index: drivers/infiniband/ulp/sdp/sdp_recv.c =================================================================== --- drivers/infiniband/ulp/sdp/sdp_recv.c (revision 2093) +++ drivers/infiniband/ulp/sdp/sdp_recv.c (working copy) @@ -670,7 +670,7 @@ static int sdp_read_buff_iocb(struct sdp copy = min((PAGE_SIZE - offset), (unsigned long)(buff->tail - buff->data)); - copy = min(iocb->len, copy); + copy = min((unsigned long)iocb->len, copy); #ifndef _SDP_DATA_PATH_NULL memcpy((addr + offset), buff->data, copy); #endif From iod00d at hp.com Wed Mar 30 13:24:43 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 30 Mar 2005 13:24:43 -0800 Subject: [openib-general] Re: BUG ipoib oops in iptables/netfilter In-Reply-To: <52psxgq4fq.fsf@topspin.com> References: <20050329054158.GA21304@esmail.cup.hp.com> <20050330090831.GD26964@mellanox.co.il> <20050330184653.GA27519@esmail.cup.hp.com> <523budq8bq.fsf@topspin.com> <20050330201006.GH27519@esmail.cup.hp.com> <52psxgq4fq.fsf@topspin.com> Message-ID: <20050330212443.GA28155@esmail.cup.hp.com> On Wed, Mar 30, 2005 at 12:11:53PM -0800, Roland Dreier wrote: > Grant> Are you thinking of adding "get_unaligned()" or something else? > > Just fixing the structs so they get the correct layout without any > monkey business. ah ok. > I think I understand the problem, it's just a matter of fixing the > code up to avoid the bogus stuff. Just compare the ia64 assembly for > the functions c() and d() in the code: > > struct foo { int a; }; > struct bar { int b; } __attribute__((packed)); > > int c(struct foo *x) { return x->a; } > int d(struct bar *x) { return x->b; } d() is a mess. Got it. In case anyone else is curious, I've parked the resulting "gcc -S" (and gcc -S -O4) output on http:/iou.parisc-linux.org/~grundler/ia64-packed/ thanks, grant From libor at topspin.com Wed Mar 30 16:43:49 2005 From: libor at topspin.com (Libor Michalek) Date: Wed, 30 Mar 2005 16:43:49 -0800 Subject: [openib-general] [PATCH] FMR support in mthca In-Reply-To: <20050330032904.GA24936@esmail.cup.hp.com>; from iod00d@hp.com on Tue, Mar 29, 2005 at 07:29:04PM -0800 References: <20050327153112.GA26108@mellanox.co.il> <20050328170351.B30499@topspin.com> <20050330010814.GB24794@esmail.cup.hp.com> <20050329181228.H31683@topspin.com> <20050330032904.GA24936@esmail.cup.hp.com> Message-ID: <20050330164349.B32764@topspin.com> On Tue, Mar 29, 2005 at 07:29:04PM -0800, Grant Grundler wrote: > On Tue, Mar 29, 2005 at 06:12:28PM -0800, Libor Michalek wrote: > > > I could checkin the source for the modified ttcp's but I'm not sure > > exactly where... (gen2/users/libor ???) > > I wouldn't bother for the "#define AF_INET AF_INET_SDP" above. > If the AIO version is more complicated, maybe create an "examples" > directory under gen2/src/userspace? OK, I cleaned up and checked in the AIO modified version of ttcp: gen2/trunk/src/userspace/examples/aio/ttcp.aio.c To build it you will need libaio installed, new RHEL/Fedora distros contain it by default, or you can download/build/install it yourself: http://fr2.rpmfind.net/linux/rpm2html/search.php?query=libaio Next build ttcp.aio.c: gcc -I/usr/src/linux/drivers/infiniband/ulp/sdp ttcp.aio.c -o ttcp.aio.x -laio The program has a decent help for available parameters, but here are some reasonable defaults: server: ./ttcp.aio.x -r -l 65536 -a 20 client: ./ttcp.aio.x -t -l 65536 -n 100000 -a 20 192.168.0.100 -Libor From jjengla at sandia.gov Wed Mar 30 17:04:39 2005 From: jjengla at sandia.gov (Josh England) Date: Wed, 30 Mar 2005 17:04:39 -0800 Subject: [openib-general] Port of ISC DHCP-3.0.2 to OpenIB IPoIB In-Reply-To: <1112190300.4495.67.camel@localhost.localdomain> References: <1112190300.4495.67.camel@localhost.localdomain> Message-ID: <424B4CA7.1050606@sandia.gov> Are there any plans to modify the linux DHCP client so it would be possible to do kernel-level DHCP and NFSroot over IB? -JE Hal Rosenstock wrote: > I just spent a little time porting ISC DHCP-3.0.2 to support DHCP over > Infiniband per IETF Internet Draft > draft-ietf-ipoib-dhcp-over-infiniband-09.txt. There are a couple of > minor things to verify before making this available to > dhcp-hackers at isc.org. I wanted to make this available to this community > first. Any feedback is welcome. > > -- Hal > > diff -urN dhcp-3.0.2.orig/common/bpf.c dhcp-3.0.2/common/bpf.c > --- dhcp-3.0.2.orig/common/bpf.c 2004-11-24 12:39:15.000000000 -0500 > +++ dhcp-3.0.2/common/bpf.c 2005-03-25 15:38:57.000000000 -0500 > @@ -194,6 +194,41 @@ > BPF_STMT(BPF_RET+BPF_K, 0), > }; > > +#if defined (HAVE_IPOIB_SUPPORT) > +/* Packet filter program... > + XXX Changes to the filter program may require changes to the constant > + offsets used in if_register_send to patch the BPF program! XXX */ > + > +struct bpf_insn dhcp_bpf_ipoib_filter [] = { > + /* Make sure this is an IP packet... */ > + BPF_STMT (BPF_LD + BPF_H + BPF_ABS, 0), > + BPF_JUMP (BPF_JMP + BPF_JEQ + BPF_K, ETHERTYPE_IP, 0, 8), > + > + /* Make sure it's a UDP packet... */ > + BPF_STMT (BPF_LD + BPF_B + BPF_ABS, 13), > + BPF_JUMP (BPF_JMP + BPF_JEQ + BPF_K, IPPROTO_UDP, 0, 6), > + > + /* Make sure this isn't a fragment... */ > + BPF_STMT(BPF_LD + BPF_H + BPF_ABS, 10), > + BPF_JUMP(BPF_JMP + BPF_JSET + BPF_K, 0x1fff, 4, 0), > + > + /* Get the IP header length... */ > + BPF_STMT (BPF_LDX + BPF_B + BPF_MSH, 4), > + > + /* Make sure it's to the right port... */ > + BPF_STMT (BPF_LD + BPF_H + BPF_IND, 6), > + BPF_JUMP (BPF_JMP + BPF_JEQ + BPF_K, 67, 0, 1), /* patch */ > + > + /* If we passed all the tests, ask for the whole packet. */ > + BPF_STMT(BPF_RET+BPF_K, (u_int)-1), > + /* Otherwise, drop it. */ > + BPF_STMT(BPF_RET+BPF_K, 0), > +}; > + > +int dhcp_bpf_ipoib_filter_len = (sizeof dhcp_bpf_ipoib_filter / > + sizeof (struct bpf_insn)); > +#endif > + > #if defined (DEC_FDDI) > struct bpf_insn *bpf_fddi_filter; > #endif > diff -urN dhcp-3.0.2.orig/common/conflex.c dhcp-3.0.2/common/conflex.c > --- dhcp-3.0.2.orig/common/conflex.c 2004-11-24 12:39:15.000000000 -0500 > +++ dhcp-3.0.2/common/conflex.c 2005-03-26 07:49:40.000000000 -0500 > @@ -740,6 +740,8 @@ > return IS; > if (!strcasecmp (atom + 1, "gnore")) > return IGNORE; > + if (!strcasecmp (atom + 1, "poib")) > + return IPOIB; > break; > case 'k': > if (!strncasecmp (atom + 1, "nown", 4)) { > diff -urN dhcp-3.0.2.orig/common/discover.c dhcp-3.0.2/common/discover.c > --- dhcp-3.0.2.orig/common/discover.c 2004-06-10 13:59:16.000000000 -0400 > +++ dhcp-3.0.2/common/discover.c 2005-03-25 16:04:41.000000000 -0500 > @@ -484,6 +484,14 @@ > memcpy (&tmp -> hw_address.hbuf [1], sa.sa_data, 16); > break; > > +#ifndef HAVE_ARPHRD_IPOIB > +# define ARPHRD_IPOIB HTYPE_IPOIB > +#endif > + case ARPHRD_IPOIB: > + tmp -> hw_address.hlen = 1; > + tmp -> hw_address.hbuf [0] = HTYPE_IPOIB; > + break; > + > #ifdef HAVE_ARPHRD_METRICOM > case ARPHRD_METRICOM: > tmp -> hw_address.hlen = 7; > diff -urN dhcp-3.0.2.orig/common/ipoib.c dhcp-3.0.2/common/ipoib.c > --- dhcp-3.0.2.orig/common/ipoib.c 1969-12-31 19:00:00.000000000 -0500 > +++ dhcp-3.0.2/common/ipoib.c 2005-03-29 23:45:52.000000000 -0500 > @@ -0,0 +1,110 @@ > +/* ipoib.c > + > + Packet assembly code, originally contributed by Archie Cobbs. */ > + > +/* > + * Copyright (c) 2004 by Internet Systems Consortium, Inc. ("ISC") > + * Copyright (c) 1996-2003 by Internet Software Consortium > + * > + * Permission to use, copy, modify, and distribute this software for any > + * purpose with or without fee is hereby granted, provided that the above > + * copyright notice and this permission notice appear in all copies. > + * > + * THE SOFTWARE IS PROVIDED "AS IS" AND ISC DISCLAIMS ALL WARRANTIES > + * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF > + * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL ISC BE LIABLE FOR > + * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES > + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN > + * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT > + * OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. > + * > + * Internet Systems Consortium, Inc. > + * 950 Charter Street > + * Redwood City, CA 94063 > + * > + * http://www.isc.org/ > + * > + * This software has been written for Internet Systems Consortium > + * by Ted Lemon in cooperation with Vixie Enterprises and Nominum, Inc. > + * To learn more about Internet Systems Consortium, see > + * ``http://www.isc.org/''. To learn more about Vixie Enterprises, > + * see ``http://www.vix.com''. To learn more about Nominum, Inc., see > + * ``http://www.nominum.com''. > + */ > + > +#ifndef lint > +static char copyright[] = > +"$Id: ipoib.c,v 1.3.2.2 2004/06/10 17:59:18 dhankins Exp $ Copyright (c) 2004 Internet Systems Consortium. All rights reserved.\n"; > +#endif /* not lint */ > + > +#include "dhcpd.h" > + > +#if defined (HAVE_IPOIB_SUPPORT) > + > +#define INFINIBAND_ALEN 20 > + > +struct ipoib_header { > + u16 proto; > + u16 reserved; > +}; > + > +struct ipoib_pseudoheader { > + u8 hwaddr[INFINIBAND_ALEN]; > +}; > + > +static const u8 ipv4_bcast_addr[] = { > + 0x00, 0xff, 0xff, 0xff, > + 0xff, 0x12, 0x40, 0x1b, 0x00, 0x00, 0x00, 0x00, > + 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff > +}; > + > +#if defined (PACKET_ASSEMBLY) || defined (PACKET_DECODING) > +#include "includes/netinet/if_ether.h" > +#endif /* PACKET_ASSEMBLY || PACKET_DECODING */ > + > +#if defined (PACKET_ASSEMBLY) > +/* Assemble an hardware header... */ > + > +void assemble_ipoib_header (interface, buf, bufix, to, toip) > + struct interface_info *interface; > + unsigned char *buf; > + unsigned *bufix; > + struct hardware *to; > + u_int32_t toip; > +{ > + struct ipoib_header hdr; > + > + /* Need pseudoheader if some sort of broadcast */ > + /* Only supports limited broadcast right now */ > + /* Does subnet broadcast also needed to be supported ? */ > + if (toip = 0xffffffff) { > + /* Fill in IPoIB pseudoheader */ > + /* Currently assumes scope of 2 (link local) */ > + /* Might need to cycle through all scopes :-( */ > + memcpy (&buf [*bufix], &ipv4_bcast_addr, sizeof(ipv4_bcast_addr)); > + *bufix += sizeof(ipv4_bcast_addr); > + } > + > + hdr.proto = htons (ETHERTYPE_IP); > + hdr.reserved = 0; > + > + memcpy (&buf [*bufix], &hdr, sizeof(hdr)); > + *bufix += sizeof(hdr); > +} > +#endif /* PACKET_ASSEMBLY */ > + > +#ifdef PACKET_DECODING > +/* Decode a hardware header... */ > + > +ssize_t decode_ipoib_header (interface, buf, bufix, from) > + struct interface_info *interface; > + unsigned char *buf; > + unsigned bufix; > + struct hardware *from; > +{ > + from -> hbuf [0] = HTYPE_IPOIB; > + memcpy (&from -> hbuf [1], &buf [bufix], sizeof(struct ipoib_header)); > + return sizeof(struct ipoib_header); > +} > +#endif /* PACKET_DECODING */ > +#endif /* HAVE_IPOIB_SUPPORT */ > diff -urN dhcp-3.0.2.orig/common/lpf.c dhcp-3.0.2/common/lpf.c > --- dhcp-3.0.2.orig/common/lpf.c 2004-11-24 12:39:15.000000000 -0500 > +++ dhcp-3.0.2/common/lpf.c 2005-03-29 18:45:09.000000000 -0500 > @@ -168,6 +168,12 @@ > static void lpf_tr_filter_setup (struct interface_info *); > #endif > > +#if defined (HAVE_IPOIB_SUPPORT) > +extern struct sock_filter dhcp_bpf_ipoib_filter []; > +extern int dhcp_bpf_ipoib_filter_len; > +static void lpf_ipoib_filter_setup (struct interface_info *); > +#endif > + > static void lpf_gen_filter_setup (struct interface_info *); > > void if_register_receive (info) > @@ -181,6 +187,13 @@ > lpf_tr_filter_setup (info); > else > #endif > + > +#if defined (HAVE_IPOIB_SUPPORT) > + if (info -> hw_address.hbuf [0] == HTYPE_IPOIB) > + lpf_ipoib_filter_setup (info); > + else > +#endif > + > lpf_gen_filter_setup (info); > > if (!quiet_interface_discovery) > @@ -222,7 +235,7 @@ > p.len = dhcp_bpf_filter_len; > p.filter = dhcp_bpf_filter; > > - /* Patch the server port into the LPF program... > + /* Patch the server port into the LPF program... > XXX changes to filter program may require changes > to the insn number(s) used below! XXX */ > dhcp_bpf_filter [8].k = ntohs ((short)local_port); > @@ -254,7 +267,7 @@ > p.len = dhcp_bpf_tr_filter_len; > p.filter = dhcp_bpf_tr_filter; > > - /* Patch the server port into the LPF program... > + /* Patch the server port into the LPF program... > XXX changes to filter program may require changes > XXX to the insn number(s) used below! > XXX Token ring filter is null - when/if we have a filter > @@ -277,6 +290,38 @@ > } > } > #endif /* HAVE_TR_SUPPORT */ > + > +#ifdef HAVE_IPOIB_SUPPORT > +static void lpf_ipoib_filter_setup (info) > + struct interface_info *info; > +{ > + struct sock_fprog p; > + > + /* Set up the bpf filter program structure. This is defined in > + bpf.c */ > + p.len = dhcp_bpf_ipoib_filter_len; > + p.filter = dhcp_bpf_ipoib_filter; > + > + /* Patch the server port into the LPF program... > + XXX changes to filter program may require changes > + to the insn number(s) used below! XXX */ > + dhcp_bpf_ipoib_filter [8].k = ntohs ((short)local_port); > + > + if (setsockopt (info -> rfdesc, SOL_SOCKET, SO_ATTACH_FILTER, &p, > + sizeof p) < 0) { > + if (errno == ENOPROTOOPT || errno == EPROTONOSUPPORT || > + errno == ESOCKTNOSUPPORT || errno == EPFNOSUPPORT || > + errno == EAFNOSUPPORT) { > + log_error ("socket: %m - make sure"); > + log_error ("CONFIG_PACKET (Packet socket) %s", > + "and CONFIG_FILTER"); > + log_error ("(Socket Filtering) are enabled %s", > + "in your kernel"); > + } > + log_fatal ("Can't install packet filter program: %m"); > + } > +} > +#endif /* HAVE_IPOIB_SUPPORT */ > #endif /* USE_LPF_RECEIVE */ > > #ifdef USE_LPF_SEND > @@ -302,7 +347,7 @@ > len, from, to, hto); > > /* Assemble the headers... */ > - assemble_hw_header (interface, (unsigned char *)hh, &hbufp, hto); > + assemble_hw_header (interface, (unsigned char *)hh, &hbufp, hto, to -> sin_addr.s_addr); > fudge = hbufp % 4; /* IP header must be word-aligned. */ > memcpy (buf + fudge, (unsigned char *)hh, hbufp); > ibufp = hbufp + fudge; > @@ -312,7 +357,7 @@ > memcpy (buf + ibufp, raw, len); > > /* For some reason, SOCK_PACKET sockets can't be connected, > - so we have to do a sentdo every time. */ > + so we have to do a sendto every time. */ > memset (&sa, 0, sizeof sa); > sa.sa_family = AF_PACKET; > strncpy (sa.sa_data, > @@ -383,7 +428,14 @@ > int can_receive_unicast_unconfigured (ip) > struct interface_info *ip; > { > +#if defined (HAVE_IPOIB_SUPPORT) > + if (!strncmp (ip->name, "ib", 2)) > + return 0; > + else > + return 1; > +#else > return 1; > +#endif > } > > int supports_multiple_interfaces (ip) > diff -urN dhcp-3.0.2.orig/common/Makefile.dist dhcp-3.0.2/common/Makefile.dist > --- dhcp-3.0.2.orig/common/Makefile.dist 2004-09-21 16:33:35.000000000 -0400 > +++ dhcp-3.0.2/common/Makefile.dist 2005-03-30 08:00:08.000000000 -0500 > @@ -24,11 +24,11 @@ > SEDMANPAGES = dhcp-options.man5 dhcp-eval.man5 > SRC = raw.c parse.c nit.c icmp.c dispatch.c conflex.c upf.c bpf.c socket.c \ > lpf.c dlpi.c packet.c tr.c ethernet.c memory.c print.c options.c \ > - inet.c tree.c tables.c alloc.c fddi.c ctrace.c dns.c resolv.c \ > + inet.c tree.c tables.c alloc.c fddi.c ipoib.c ctrace.c dns.c resolv.c \ > execute.c discover.c comapi.c > OBJ = raw.o parse.o nit.o icmp.o dispatch.o conflex.o upf.o bpf.o socket.o \ > lpf.o dlpi.o packet.o tr.o ethernet.o memory.o print.o options.o \ > - inet.o tree.o tables.o alloc.o fddi.o ctrace.o dns.o resolv.o \ > + inet.o tree.o tables.o alloc.o fddi.o ipoib.o ctrace.o dns.o resolv.o \ > execute.o discover.o comapi.o > MAN = dhcp-options.5 dhcp-eval.5 > > diff -urN dhcp-3.0.2.orig/common/packet.c dhcp-3.0.2/common/packet.c > --- dhcp-3.0.2.orig/common/packet.c 2004-11-24 12:39:16.000000000 -0500 > +++ dhcp-3.0.2/common/packet.c 2005-03-25 11:01:46.000000000 -0500 > @@ -104,11 +104,12 @@ > } > > #ifdef PACKET_ASSEMBLY > -void assemble_hw_header (interface, buf, bufix, to) > +void assemble_hw_header (interface, buf, bufix, to, toip) > struct interface_info *interface; > unsigned char *buf; > unsigned *bufix; > struct hardware *to; > + u_int32_t toip; > { > #if defined (HAVE_TR_SUPPORT) > if (interface -> hw_address.hbuf [0] == HTYPE_IEEE802) > @@ -120,6 +121,11 @@ > assemble_fddi_header (interface, buf, bufix, to); > else > #endif > +#if defined (HAVE_IPOIB_SUPPORT) > + if (interface -> hw_address.hbuf [0] == HTYPE_IPOIB) > + assemble_ipoib_header (interface, buf, bufix, to, toip); > + else > +#endif > assemble_ethernet_header (interface, buf, bufix, to); > > } > @@ -205,6 +211,11 @@ > return decode_fddi_header (interface, buf, bufix, from); > else > #endif > +#if defined (HAVE_IPOIB_SUPPORT) > + if (interface -> hw_address.hbuf [0] == HTYPE_IPOIB) > + return decode_ipoib_header (interface, buf, bufix, from); > + else > +#endif > return decode_ethernet_header (interface, buf, bufix, from); > } > > diff -urN dhcp-3.0.2.orig/common/parse.c dhcp-3.0.2/common/parse.c > --- dhcp-3.0.2.orig/common/parse.c 2004-09-30 16:38:31.000000000 -0400 > +++ dhcp-3.0.2/common/parse.c 2005-03-26 07:46:36.000000000 -0500 > @@ -323,7 +323,7 @@ > > /* > * hardware-parameter :== HARDWARE hardware-type colon-seperated-hex-list SEMI > - * hardware-type :== ETHERNET | TOKEN_RING | FDDI > + * hardware-type :== ETHERNET | TOKEN_RING | FDDI | IPOIB > */ > > void parse_hardware_param (cfile, hardware) > @@ -346,6 +346,9 @@ > case FDDI: > hardware -> hbuf [0] = HTYPE_FDDI; > break; > + case IPOIB: > + hardware -> hbuf [0] = HTYPE_IPOIB; > + break; > default: > if (!strncmp (val, "unknown-", 8)) { > hardware -> hbuf [0] = atoi (&val [8]); > diff -urN dhcp-3.0.2.orig/includes/dhcpd.h dhcp-3.0.2/includes/dhcpd.h > --- dhcp-3.0.2.orig/includes/dhcpd.h 2004-11-24 12:39:16.000000000 -0500 > +++ dhcp-3.0.2/includes/dhcpd.h 2005-03-28 08:25:47.000000000 -0500 > @@ -306,9 +306,9 @@ > # define EPHEMERAL_FLAGS (MS_NULL_TERMINATION | \ > UNICAST_BROADCAST_HACK) > > - binding_state_t __attribute__ ((mode (__byte__))) binding_state; > - binding_state_t __attribute__ ((mode (__byte__))) next_binding_state; > - binding_state_t __attribute__ ((mode (__byte__))) desired_binding_state; > + binding_state_t binding_state; > + binding_state_t next_binding_state; > + binding_state_t desired_binding_state; > > struct lease_state *state; > > @@ -1922,7 +1922,7 @@ > u_int32_t checksum PROTO ((unsigned char *, unsigned, u_int32_t)); > u_int32_t wrapsum PROTO ((u_int32_t)); > void assemble_hw_header PROTO ((struct interface_info *, unsigned char *, > - unsigned *, struct hardware *)); > + unsigned *, struct hardware *, u_int32_t)); > void assemble_udp_ip_header PROTO ((struct interface_info *, unsigned char *, > unsigned *, u_int32_t, u_int32_t, > u_int32_t, unsigned char *, unsigned)); > diff -urN dhcp-3.0.2.orig/includes/dhcp.h dhcp-3.0.2/includes/dhcp.h > --- dhcp-3.0.2.orig/includes/dhcp.h 2004-06-10 13:59:29.000000000 -0400 > +++ dhcp-3.0.2/includes/dhcp.h 2005-03-24 13:50:14.000000000 -0500 > @@ -75,6 +75,7 @@ > #define HTYPE_ETHER 1 /* Ethernet 10Mbps */ > #define HTYPE_IEEE802 6 /* IEEE 802.2 Token Ring... */ > #define HTYPE_FDDI 8 /* FDDI... */ > +#define HTYPE_IPOIB 32 /* IPoIB */ > > /* Magic cookie validating dhcp options field (and bootp vendor > extensions field). */ > diff -urN dhcp-3.0.2.orig/includes/dhctoken.h dhcp-3.0.2/includes/dhctoken.h > --- dhcp-3.0.2.orig/includes/dhctoken.h 2004-09-21 15:25:38.000000000 -0400 > +++ dhcp-3.0.2/includes/dhctoken.h 2005-03-26 07:51:29.000000000 -0500 > @@ -308,7 +308,8 @@ > REFRESH = 612, > DOMAIN_NAME = 613, > DO_FORWARD_UPDATE = 614, > - KNOWN_CLIENTS = 615 > + KNOWN_CLIENTS = 615, > + IPOIB = 616 > }; > > #define is_identifier(x) ((x) >= FIRST_TOKEN && \ > diff -urN dhcp-3.0.2.orig/Makefile.conf dhcp-3.0.2/Makefile.conf > --- dhcp-3.0.2.orig/Makefile.conf 2004-11-24 12:39:13.000000000 -0500 > +++ dhcp-3.0.2/Makefile.conf 2005-03-30 08:05:28.000000000 -0500 > @@ -312,7 +312,7 @@ > ## Linux 2.2 > ##--linux-2.2-- > #COPTS = -DLINUX_MAJOR=$(MAJORVERSION) -DLINUX_MINOR=$(MINORVERSION) \ > -# $(BINDDEF) $(CC_OPTIONS) > +# $(BINDDEF) $(CC_OPTIONS) -DHAVE_IPOIB_SUPPORT -DUSERLAND_FILTER > #CF = cf/linux.h > #ADMMANDIR = /usr/man/man8 > #ADMMANEXT = .8 > diff -urN dhcp-3.0.2.orig/readme-ipoib.txt dhcp-3.0.2/readme-ipoib.txt > --- dhcp-3.0.2.orig/readme-ipoib.txt 1969-12-31 19:00:00.000000000 -0500 > +++ dhcp-3.0.2/readme-ipoib.txt 2005-03-30 08:22:37.000000000 -0500 > @@ -0,0 +1,53 @@ > +3/30/05 > +ISC DHCP 3.0.2 > +Internet Systems Consortium DHCP Client V3.0.2 > +Internet Systems Consortium DHCP Server V3.0.2 > + > +Makefile.conf > + -DHAVE_IPOIB_SUPPORT -DUSERLAND_FILTER > + To build on Opteron, also add -DPTRSIZE_64BIT > + > + > +Notes about running > + > +Need to configure the Linux kernel to support > +Socket Filtering and the Packet socket. > + CONFIG_FILTER=y > + CONFIG_PACKET=y > + > + > +DHCP client > + > +/sbin/modprobe ib_ipoib > +/sbin/ifconfig ib0 up > + without IPv4 address > + > +dhclient ib0 > + > +need to mkdir /var/state/dhcp so dhclient.leases can be saved > + > +setup /etc/dhclient.conf with client identifier (port GUID) > +interface "ib0" { > + send dhcp-client-identifier 00:08:f1:04:03:96:05:59; > +} > + > + > +DHCP server > + > +Load IPoIB and configure ib0 with IPv4 address > +IP address configured on ib0 > +/sbin/ifconfig ib0 > + > + > +dhcpd ib0 > + > +setup /etc/dhcpd.conf > +with at least IP address range > +maybe also client identifier if want fixed IP address per client > +ddns-update-style none; > +subnet 192.168.0.0 netmask 255.255.255.0 { > + range 192.168.0.10 192.168.0.20; > +} > + > +touch /var/state/dhcp/dhcpd.leases > + > diff -urN dhcp-3.0.2.orig/server/dhcp.c dhcp-3.0.2/server/dhcp.c > --- dhcp-3.0.2.orig/server/dhcp.c 2004-11-24 12:39:19.000000000 -0500 > +++ dhcp-3.0.2/server/dhcp.c 2005-03-28 08:52:46.000000000 -0500 > @@ -267,7 +267,7 @@ > /* %Audit% This is log output. %2004.06.17,Safe% > * If we truncate we hope the user can get a hint from the log. > */ > - snprintf (msgbuf, sizeof msgbuf, "DHCPDISCOVER from %s %s%s%svia %s", > + snprintf (msgbuf, sizeof msgbuf, "DHCPDISCOVER from %s %s%s%s via %s", > (packet -> raw -> htype > ? print_hw_addr (packet -> raw -> htype, > packet -> raw -> hlen, > @@ -476,7 +476,7 @@ > * If we truncate we hope the user can get a hint from the log. > */ > snprintf (msgbuf, sizeof msgbuf, > - "DHCPREQUEST for %s%s from %s %s%s%svia %s", > + "DHCPREQUEST for %s%s from %s %s%s%s via %s", > piaddr (cip), smbuf, > (packet -> raw -> htype > ? print_hw_addr (packet -> raw -> htype, > @@ -769,7 +769,7 @@ > * If we truncate we hope the user can get a hint from the log. > */ > snprintf (msgbuf, sizeof msgbuf, > - "DHCPRELEASE of %s from %s %s%s%svia %s (%sfound)", > + "DHCPRELEASE of %s from %s %s%s%s via %s (%sfound)", > cstr, > (packet -> raw -> htype > ? print_hw_addr (packet -> raw -> htype, > @@ -859,7 +859,7 @@ > * If we truncate we hope the user can get a hint from the log. > */ > snprintf (msgbuf, sizeof msgbuf, > - "DHCPDECLINE of %s from %s %s%s%svia %s", > + "DHCPDECLINE of %s from %s %s%s%s via %s", > piaddr (cip), > (packet -> raw -> htype > ? print_hw_addr (packet -> raw -> htype, > @@ -2807,7 +2807,7 @@ > s = (char *)0; > > /* Say what we're doing... */ > - log_info ("%s on %s to %s %s%s%svia %s", > + log_info ("%s on %s to %s %s%s%s via %s", > (state -> offer > ? (state -> offer == DHCPACK ? "DHCPACK" : "DHCPOFFER") > : "BOOTREPLY"), > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From peter at pantasys.com Wed Mar 30 17:38:11 2005 From: peter at pantasys.com (Peter Buckingham) Date: Wed, 30 Mar 2005 17:38:11 -0800 Subject: [openib-general] Port of ISC DHCP-3.0.2 to OpenIB IPoIB In-Reply-To: <424B4CA7.1050606@sandia.gov> References: <1112190300.4495.67.camel@localhost.localdomain> <424B4CA7.1050606@sandia.gov> Message-ID: <424B5483.3010500@pantasys.com> Josh England wrote: > Are there any plans to modify the linux DHCP client so it would be > possible to do kernel-level DHCP and NFSroot over IB? i'd strongly suggest not using the in-kernel dhcpd as it has bugs and isn't really well supported. probably the best thing would be to add the support to klibc and the initramfs tools. This actually gives you much more flexibility in the end too (we can manage some failover situations with this sort of set up). peter From peter at pantasys.com Wed Mar 30 17:40:33 2005 From: peter at pantasys.com (Peter Buckingham) Date: Wed, 30 Mar 2005 17:40:33 -0800 Subject: [openib-general] [PATCH] FMR support in mthca In-Reply-To: <20050329181228.H31683@topspin.com> References: <20050327153112.GA26108@mellanox.co.il> <20050328170351.B30499@topspin.com> <20050330010814.GB24794@esmail.cup.hp.com> <20050329181228.H31683@topspin.com> Message-ID: <424B5511.3020107@pantasys.com> Libor Michalek wrote: > The recompile modifications to use SDP protocol familt for regular > ttcp are pretty straight forward: > > #include > #undef AF_INET > #define AF_INET AF_INET_SDP if i want to do in-kernel SDP can i just change my in-kernel sockets address family too, or is it more complicated? thanks, peter From libor at topspin.com Wed Mar 30 17:54:19 2005 From: libor at topspin.com (Libor Michalek) Date: Wed, 30 Mar 2005 17:54:19 -0800 Subject: [openib-general] [PATCH] FMR support in mthca In-Reply-To: <424B5511.3020107@pantasys.com>; from peter@pantasys.com on Wed, Mar 30, 2005 at 05:40:33PM -0800 References: <20050327153112.GA26108@mellanox.co.il> <20050328170351.B30499@topspin.com> <20050330010814.GB24794@esmail.cup.hp.com> <20050329181228.H31683@topspin.com> <424B5511.3020107@pantasys.com> Message-ID: <20050330175419.C32764@topspin.com> On Wed, Mar 30, 2005 at 05:40:33PM -0800, Peter Buckingham wrote: > Libor Michalek wrote: > > The recompile modifications to use SDP protocol familt for regular > > ttcp are pretty straight forward: > > > > #include > > #undef AF_INET > > #define AF_INET AF_INET_SDP > > if i want to do in-kernel SDP can i just change my in-kernel sockets > address family too, or is it more complicated? Should work, as long as you access the socket using the generic kernel socket interface and don't rely on the internals of the 'struct sock' to exactly match those of a PF_INET socket. -Libor From halr at voltaire.com Wed Mar 30 20:35:21 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Mar 2005 23:35:21 -0500 Subject: [openib-general] Port of ISC DHCP-3.0.2 to OpenIB IPoIB In-Reply-To: <20050330190323.GB27519@esmail.cup.hp.com> References: <1112190300.4495.67.camel@localhost.localdomain> <20050330190323.GB27519@esmail.cup.hp.com> Message-ID: <1112235290.4476.84.camel@localhost.localdomain> On Wed, 2005-03-30 at 14:03, Grant Grundler wrote: > On Wed, Mar 30, 2005 at 08:45:01AM -0500, Hal Rosenstock wrote: > > I wanted to make this available to this community first. > > Any feedback is welcome. > ... > > +/* ipoib.c > > + > > + Packet assembly code, originally contributed by Archie Cobbs. */ > > This can't be right. > > > + > > +/* > > + * Copyright (c) 2004 by Internet Systems Consortium, Inc. ("ISC") > > + * Copyright (c) 1996-2003 by Internet Software Consortium > > + * > > + * Permission to use, copy, modify, and distribute this software for any > > + * purpose with or without fee is hereby granted, provided that the above > > + * copyright notice and this permission notice appear in all copies. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS" AND ISC DISCLAIMS ALL WARRANTIES > > + * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF > > + * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL ISC BE LIABLE FOR > > + * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES > > + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN > > + * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT > > + * OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. > > + * > > + * Internet Systems Consortium, Inc. > > + * 950 Charter Street > > + * Redwood City, CA 94063 > > + * > > + * http://www.isc.org/ > > + * > > + * This software has been written for Internet Systems Consortium > > + * by Ted Lemon in cooperation with Vixie Enterprises and Nominum, Inc. > > + * To learn more about Internet Systems Consortium, see > > + * ``http://www.isc.org/''. To learn more about Vixie Enterprises, > > + * see ``http://www.vix.com''. To learn more about Nominum, Inc., see > > + * ``http://www.nominum.com''. > > + */ > > This is a new file right? > Copyright owner and date and the "by Ted Lemon" stuff at the bottom > doesn't seem right for this file. > > > +static char copyright[] = > > +"$Id: ipoib.c,v 1.3.2.2 2004/06/10 17:59:18 dhankins Exp $ Copyright (c) 2004 Internet Systems Consortium. All rights reserved.\n"; > > Ditto. > > > I don't know the rules for submitting code to the ISC. > My guess is they hwave to be as anal about copyright assignment > and license as GNU foundation. You probably want to review > this briefly with one or more of Voltaire/Openib/ISC legal. Thanks. I had just originated this file from some other network file in the ISC release with those names in it and didn't change those things. -- Hal From sbahling at novell.com Thu Mar 31 00:49:21 2005 From: sbahling at novell.com (Scott Bahling) Date: Thu, 31 Mar 2005 10:49:21 +0200 Subject: [openib-general] Gen1 vs Gen2 Message-ID: <1112258962.16424.126.camel@K45.suse.de> Hi, What is the general consensus about the Gen1 kernel drivers compared to the Gen2 drivers? What is the maturity of the Gen2 code relative to Gen1, and are there any features in Gen1 missing from Gen2? I am mostly interested in the kernel module code, but if the Gen2 userspace code is much better, and is dependent on the Gen2 kernel code, I would be interested to hear about that also. Thanks, Scott -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Thu Mar 31 04:52:03 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 31 Mar 2005 07:52:03 -0500 Subject: [openib-general] Gen1 vs Gen2 In-Reply-To: <1112258962.16424.126.camel@K45.suse.de> References: <1112258962.16424.126.camel@K45.suse.de> Message-ID: <1112273523.4476.424.camel@localhost.localdomain> Hi Scott, On Thu, 2005-03-31 at 03:49, Scott Bahling wrote: > What is the general consensus about the Gen1 kernel drivers compared to > the Gen2 drivers? The gen2 drivers are faster (more throughput, lower latency) and are rapidly catching up in functionality. The driver supports both Tavor and Arbel, and works in memfree mode. Sinai support will follow. > What is the maturity of the Gen2 code relative to Gen1, Depends on how you measure maturity. In time since code written, the code is less mature. However, it is being shaken out by a wider community and more actively supported than any of the gen1 code. If you monitor this list, you can see than various gen1 providers have encouraged their customers in this direction. > and are there any features in Gen1 missing from Gen2? In terms of the HCA driver, there are a few things still missing from gen2 (OpenIB). Some things which come to mind are SMR and SRQ support. What ULPs are you interested in ? IPoIB v4/v6 has been there for a while and adopted upstream. SDP appears to be next. SRP follow thereafter. There is also an early kDAPL provider soon to be put into the OpenIB tree. > I am mostly interested in the kernel module code, but if the Gen2 userspace code is > much better, and is dependent on the Gen2 kernel code, I would be > interested to hear about that also. gen2 userspace code is also better per the same metrics and is dependent on the kernel code. It is stable based on the testing to date. This is clearly less mature right now (and also has a few minor things missing for which there are no known current consumers). It is coming back to the trunk shortly. User space CM support should also be there shortly. User space ULPs will follow. There is considerable MPI and uDAPL work ongoing based on the user verbs branch. All of the above is not to say that there won't be some bugs to shake out (your mileage may vary) but most problems found and reported to this list have been resolved extremely quickly. -- Hal From mst at mellanox.co.il Thu Mar 31 06:10:24 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 31 Mar 2005 16:10:24 +0200 Subject: [openib-general] [PATCH] sdp_kmap to kmap_atomic Message-ID: <20050331141024.GU15034@mellanox.co.il> Replace sdp_kmap by kmap_atomic. Use KM_IRQ0 slot, and disable local interrupts to avoid kmap slot collision. Incidentially, sdp_iocb.h is now free of stuff not related to iocb. Signed-off-by: Michael S. Tsirkin Index: drivers/infiniband/ulp/sdp/sdp_send.c =================================================================== --- drivers/infiniband/ulp/sdp/sdp_send.c (revision 2096) +++ drivers/infiniband/ulp/sdp/sdp_send.c (working copy) @@ -641,6 +641,7 @@ static int sdp_send_data_iocb_src(struct SDP_BUFF_F_CLR_UNSIG(buff); if (conn->send_mode == SDP_MODE_COMB) { + unsigned long flags; void *addr; int pos; int off; @@ -662,21 +663,22 @@ static int sdp_send_data_iocb_src(struct result = -EFAULT; goto error; } - /* - * map, copy, unmap. - */ - addr = sdp_kmap(iocb->page_array[pos]); + + local_irq_save(flags); + + addr = kmap_atomic(iocb->page_array[pos], KM_IRQ0); if (!addr) { result = -ENOMEM; + local_irq_restore(flags); goto error; } memcpy(buff->tail, (addr + off), len); - sdp_kunmap(iocb->page_array[pos]); - /* - * update pointers - */ + kunmap_atomic(iocb->page_array[pos], KM_IRQ0); + + local_irq_restore(flags); + buff->data_size = len; buff->tail += len; @@ -731,14 +733,16 @@ static int sdp_send_iocb_buff_write(stru counter = (iocb->post + iocb->page_offset) >> PAGE_SHIFT; offset = (iocb->post + iocb->page_offset) & (~PAGE_MASK); - while (buff->tail < buff->end && - iocb->len > 0) { - /* - * map correct page of iocb - */ - addr = sdp_kmap(iocb->page_array[counter]); - if (!addr) + + while (buff->tail < buff->end && iocb->len > 0) { + unsigned long flags; + local_irq_save(flags); + + addr = kmap_atomic(iocb->page_array[counter], KM_IRQ0); + if (!addr) { + local_irq_restore(flags); break; + } copy = min((PAGE_SIZE - offset), (unsigned long)(buff->end - buff->tail)); @@ -755,7 +759,8 @@ static int sdp_send_iocb_buff_write(stru offset += copy; offset &= (~PAGE_MASK); - sdp_kunmap(iocb->page_array[counter++]); + kunmap_atomic(iocb->page_array[counter++], KM_IRQ0); + local_irq_restore(flags); } return 0; Index: drivers/infiniband/ulp/sdp/sdp_recv.c =================================================================== --- drivers/infiniband/ulp/sdp/sdp_recv.c (revision 2096) +++ drivers/infiniband/ulp/sdp/sdp_recv.c (working copy) @@ -659,12 +659,11 @@ static int sdp_read_buff_iocb(struct sdp counter = (iocb->post + iocb->page_offset) >> PAGE_SHIFT; offset = (iocb->post + iocb->page_offset) & (~PAGE_MASK); - while (buff->data < buff->tail && - iocb->len > 0) { - /* - * map correct page of iocb - */ - addr = sdp_kmap(iocb->page_array[counter]); + while (buff->data < buff->tail && iocb->len > 0) { + unsigned long flags; + local_irq_save(flags); + + addr = kmap_atomic(iocb->page_array[counter], KM_IRQ0); if (!addr) break; @@ -682,7 +681,9 @@ static int sdp_read_buff_iocb(struct sdp offset += copy; offset &= (~PAGE_MASK); - sdp_kunmap(iocb->page_array[counter++]); + kunmap_atomic(iocb->page_array[counter++], KM_IRQ0); + + local_irq_restore(flags); } /* * restore tail from OOB offset. Index: drivers/infiniband/ulp/sdp/sdp_iocb.h =================================================================== --- drivers/infiniband/ulp/sdp/sdp_iocb.h (revision 2096) +++ drivers/infiniband/ulp/sdp/sdp_iocb.h (working copy) @@ -124,30 +124,4 @@ struct sdpc_iocb_q { int size; /* current number of IOCBs in table */ }; -/* - * Address translations - */ - -/* - * sdp_kmap - map a page into kernel space - */ -static inline void *sdp_kmap(struct page *page) -{ - if (in_atomic() || irqs_disabled()) - return kmap_atomic(page, KM_IRQ0); - else - return kmap(page); -} - -/* - * sdp_kunmap - unmap a page into kernel space - */ -static inline void sdp_kunmap(struct page *page) -{ - if (in_atomic() || irqs_disabled()) - kunmap_atomic(page, KM_IRQ0); - else - kunmap(page); -} - #endif /* _SDP_IOCB_H */ -- MST - Michael S. Tsirkin From halr at voltaire.com Thu Mar 31 09:28:59 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 31 Mar 2005 12:28:59 -0500 Subject: [openib-general] What context can CM be called from? In-Reply-To: <4249ED30.3060208@ichips.intel.com> References: <506C3D7B14CDD411A52C00025558DED6064BF21D@mtlex01.yok.mtl.com> <52u0muqvt7.fsf@topspin.com> <4249907F.8010101@ichips.intel.com> <20050329153826.D31683@topspin.com> <4249ED30.3060208@ichips.intel.com> Message-ID: <1112290139.4490.20.camel@localhost.localdomain> On Tue, 2005-03-29 at 19:05, Sean Hefty wrote: > Libor Michalek wrote: > > >>With this patch, changing the kmalloc in cm_alloc_msg() to use > >>GFP_ATOMIC rather than GFP_KERNEL should allow the CM to be usable from > >>interrupt context. Of course, I haven't actually tested this... > >> > >>I have no objection to this change however. > > > > I could go either way on this issue myself. If the call can only be > > made from thread context I will use schedule_work() to execute the > > request to send the dreq. However, I would imagine that other CM users > > would want to send requests in interrupt context... > > My intention was that the CM should be able to match the calling > conventions of underlying verbs/mad layer routines, except for the > destroy_cm_id call that may block. It should be easy enough to at > least test whether the code works at interrupt with these changes, and > if not, then call schedule_work until we can identify why not and see > if other changes can be made to support it. Is this just the kmalloc in cm_alloc_msg or is there more to this ? One other comment/question about cm_alloc_msg: It seems possible that this is called prior to cm_id_priv->av.port being initialized. Should an error be returned for this case ? A follow on to this: Doing so appears to cause the connection to go into timewait. Is that correct for these cases ? Not sure what else can be done (perhaps error ?) -- Hal From mshefty at ichips.intel.com Thu Mar 31 09:41:59 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 31 Mar 2005 09:41:59 -0800 Subject: [openib-general] What context can CM be called from? In-Reply-To: <1112290139.4490.20.camel@localhost.localdomain> References: <506C3D7B14CDD411A52C00025558DED6064BF21D@mtlex01.yok.mtl.com> <52u0muqvt7.fsf@topspin.com> <4249907F.8010101@ichips.intel.com> <20050329153826.D31683@topspin.com> <4249ED30.3060208@ichips.intel.com> <1112290139.4490.20.camel@localhost.localdomain> Message-ID: <424C3667.7040909@ichips.intel.com> Hal Rosenstock wrote: > Is this just the kmalloc in cm_alloc_msg or is there more to this ? I _think_ that the kmalloc in cm_alloc_msg is all that needs to change. > One other comment/question about cm_alloc_msg: > > It seems possible that this is called prior to cm_id_priv->av.port being > initialized. Should an error be returned for this case ? Can you reference the place in the code where you think that this could happen? The port should be set before a REQ is sent or immediately after one is received. > A follow on to this: > Doing so appears to cause the connection to go into timewait. Is that > correct for these cases ? Not sure what else can be done (perhaps error > ?) I tried to follow the connection states defined by the spec. There were a couple cases where the spec didn't define a transition, so in those cases I simply made my best judgment on whether to transition to timewait or back to idle. I can't recall the specific transitions that were missing at the moment, but there were probably 3-4 of them. In some cases, an error left the cm_id in its current state, which lets the user retry the operation or abort (e.g. reject the connection). - Sean From ardavis at ichips.intel.com Thu Mar 31 09:45:06 2005 From: ardavis at ichips.intel.com (ardavis) Date: Thu, 31 Mar 2005 09:45:06 -0800 Subject: [openib-general] uverbs events Message-ID: <424C3722.9070402@ichips.intel.com> Has anyone successfully run uverbs examples with events using ibv_get_cq_event? It seems to block forever on my system with the pingpong test. Thanks, -arlin From halr at voltaire.com Thu Mar 31 09:49:38 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 31 Mar 2005 12:49:38 -0500 Subject: [openib-general] What context can CM be called from? In-Reply-To: <424C3667.7040909@ichips.intel.com> References: <506C3D7B14CDD411A52C00025558DED6064BF21D@mtlex01.yok.mtl.com> <52u0muqvt7.fsf@topspin.com> <4249907F.8010101@ichips.intel.com> <20050329153826.D31683@topspin.com> <4249ED30.3060208@ichips.intel.com> <1112290139.4490.20.camel@localhost.localdomain> <424C3667.7040909@ichips.intel.com> Message-ID: <1112291273.4490.54.camel@localhost.localdomain> On Thu, 2005-03-31 at 12:41, Sean Hefty wrote: > > It seems possible that this is called prior to cm_id_priv->av.port being > > initialized. Should an error be returned for this case ? > > Can you reference the place in the code where you think that this could > happen? The port should be set before a REQ is sent or immediately > after one is received. The simplest case is the cm id is created and then ib_send_cm_dreq is called. There may be others. Is this worth protecting against ? -- Hal From roland at topspin.com Thu Mar 31 09:46:56 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 31 Mar 2005 09:46:56 -0800 Subject: [openib-general] Re: uverbs events In-Reply-To: <424C3722.9070402@ichips.intel.com> (ardavis@ichips.intel.com's message of "Thu, 31 Mar 2005 09:45:06 -0800") References: <424C3722.9070402@ichips.intel.com> Message-ID: <52oeczoghb.fsf@topspin.com> ardavis> Has anyone successfully run uverbs examples with events ardavis> using ibv_get_cq_event? It seems to block forever on my ardavis> system with the pingpong test. Yes, I have. I'll try again with the latest code to make sure I haven't broken anything recently. - R. From mshefty at ichips.intel.com Thu Mar 31 10:10:30 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 31 Mar 2005 10:10:30 -0800 Subject: [openib-general] What context can CM be called from? In-Reply-To: <1112291273.4490.54.camel@localhost.localdomain> References: <506C3D7B14CDD411A52C00025558DED6064BF21D@mtlex01.yok.mtl.com> <52u0muqvt7.fsf@topspin.com> <4249907F.8010101@ichips.intel.com> <20050329153826.D31683@topspin.com> <4249ED30.3060208@ichips.intel.com> <1112290139.4490.20.camel@localhost.localdomain> <424C3667.7040909@ichips.intel.com> <1112291273.4490.54.camel@localhost.localdomain> Message-ID: <424C3D16.9090307@ichips.intel.com> Hal Rosenstock wrote: >>>It seems possible that this is called prior to cm_id_priv->av.port being >>>initialized. Should an error be returned for this case ? >> >>Can you reference the place in the code where you think that this could >>happen? The port should be set before a REQ is sent or immediately >>after one is received. > > > The simplest case is the cm id is created and then ib_send_cm_dreq is > called. There may be others. Is this worth protecting against ? Hmm... I'm not sure if it's worth protecting against that in the kernel. But this occurs in most of the APIs. I allocated the message before checking the state to avoid doing the message allocation and formatting while holding the spinlock, and to avoid complicated error recovery if the allocation failed. For example, changing the state first, then performing the allocation outside of the spinlock can lead to situations where the state can change as a result of receiving an incoming message. So, if the allocation fails, it's difficult to determine what needs to be done. The drawback is that if the user calls the API at random, then you're correct, the av.port field may not be initialized and would crash the system. I need to think about whether a reasonable app could hit this condition though, and whether a simple not NULL check in cm_alloc_msg is sufficient protection. - Sean From libor at topspin.com Thu Mar 31 10:51:41 2005 From: libor at topspin.com (Libor Michalek) Date: Thu, 31 Mar 2005 10:51:41 -0800 Subject: [openib-general] What context can CM be called from? In-Reply-To: <424C3667.7040909@ichips.intel.com>; from mshefty@ichips.intel.com on Thu, Mar 31, 2005 at 09:41:59AM -0800 References: <506C3D7B14CDD411A52C00025558DED6064BF21D@mtlex01.yok.mtl.com> <52u0muqvt7.fsf@topspin.com> <4249907F.8010101@ichips.intel.com> <20050329153826.D31683@topspin.com> <4249ED30.3060208@ichips.intel.com> <1112290139.4490.20.camel@localhost.localdomain> <424C3667.7040909@ichips.intel.com> Message-ID: <20050331105141.A1541@topspin.com> On Thu, Mar 31, 2005 at 09:41:59AM -0800, Sean Hefty wrote: > Hal Rosenstock wrote: > > Is this just the kmalloc in cm_alloc_msg or is there more to this ? > > I _think_ that the kmalloc in cm_alloc_msg is all that needs to change. Yes, this CM change should be sufficient. I'm testing it now and it looks good. I'll run some more tests and then check in the change. -Libor From xma at us.ibm.com Thu Mar 31 11:42:42 2005 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 31 Mar 2005 11:42:42 -0800 Subject: [openib-general] Multiple IPoIB devices over same port In-Reply-To: Message-ID: > a separate Qos tag is used for each stream? Sorry for the late response. You could do that. I just list some possible uses of Multiple IPoIB devices over same port. Other reasons could be different MTUs or other interfaces characteristics.. Basically there is no reason to restrict this implementation. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 "shaharf" 03/24/2005 10:00 AM To Shirley Ma/Beaverton/IBM at IBMUS cc Subject RE: [openib-general] Multiple IPoIB devices over same port I am not sure I follow. Do you mean that for example, a separate Qos tag is used for each stream? If this is the case I am not sure how to handle Qos tags from the VM. I thought about this so that at a higher level you can use policy routing or QoS or traffic shapping or maybe even CKRM to distribute load on multiple streams. Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Thu Mar 31 12:02:43 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 31 Mar 2005 12:02:43 -0800 Subject: [openib-general] [RMPP] vendor-specific OUI field in RMPP MADs In-Reply-To: <42488FDF.2050608@ichips.intel.com> References: <42488FDF.2050608@ichips.intel.com> Message-ID: <424C5763.8040108@ichips.intel.com> For vendor-specific MADs 0x30-0x4F, does anyone know if the OUI is repeated in every RMPP segment? My assumption is that it does, but I can't locate a specific statement in the spec that confirms this. - Sean From halr at voltaire.com Thu Mar 31 12:26:08 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 31 Mar 2005 15:26:08 -0500 Subject: [openib-general] [RMPP] vendor-specific OUI field in RMPP MADs In-Reply-To: <424C5763.8040108@ichips.intel.com> References: <42488FDF.2050608@ichips.intel.com> <424C5763.8040108@ichips.intel.com> Message-ID: <1112300768.4490.238.camel@localhost.localdomain> On Thu, 2005-03-31 at 15:02, Sean Hefty wrote: > For vendor-specific MADs 0x30-0x4F, does anyone know if the OUI is > repeated in every RMPP segment? My assumption is that it does, but I > can't locate a specific statement in the spec that confirms this. Your assumption is correct. As those MADs are required to conform to Figure 210 per o16-12.1.2 (IBA 1.2), the OUI must appear in each MAD regardless of whether RMPP is active or not. -- Hal From roland at topspin.com Thu Mar 31 12:53:48 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 31 Mar 2005 12:53:48 -0800 Subject: [openib-general] problem with SDP/AIO on mem-free HCA Message-ID: <52acojmt34.fsf@topspin.com> [err, resending with a correct openib to: line] I'm hitting a strange problem with SDP/AIO on a mem-free Arbel. My test is the following: I run Libor's ttcp.aio program with default parameters (which I think just leaves one AIO in flight at a time) as follows: ttcp.aio.x -r -s & ttcp.aio.x -t -s 127.0.0.1 This always fails with a remote access error exactly 256K into the test. I see the following in my log (with some extra tracing added to SDP to get info on the RDMAs being posted): WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <5d> at = <1d94e000>/<1000> WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <5e> at = <1d94f000>/<1000> WARN: <2> <050e:11b1> Posting SEND, wrid <5f> WARN: <1> <050e:11b1> Posting SEND, wrid <20> CRTL: <2> <050e:11b1> GETNAME: src <0d000002:1389> dst = <0d000002:8001> CRTL: <2> <050e:11b1> GETNAME: src <0d000002:1389> dst = <0d000002:8001> WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <60> at = <1d94e000>/<0> WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <61> at = <1d94e000>/<1000> WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <62> at = <1d94f000>/<1000> ib_mthca 0000:07:00.0: 86/66: error CQE -> QPN 000407, WQE @ = 00001803 [ 0] 00000407 [ 4] b3000000 [ 8] fd000003 [ c] 110000c0 [10] 13880000 [14] 00000010 [18] 00001803 [1c] ff100000 WARN: : Unhandled status <10> unknown event <-1> wrid <60> As you can see, the failed work request is an RDMA with length 0. The previous work request with wrid 5d with the same R_Key and remote address but a length of 0x1000 appears to complete successfully so the FMR seems to be OK. So I guess there are two questions: - why is SDP doing a zero-length RDMA read? - is it correct for this to fail with a remote access error? I have not had a chance to test zero-length RDMA without involving FMRs but I don't think the FMR code is to blame. Also BTW, the code in sdp_cq_event_locked() is somewhat bogus: it switches on comp->opcode even when comp->status is not success. However, if the comp->status is not success, then per the IB spec, mthca does not set the comp->opcode field. - R. From halr at voltaire.com Thu Mar 31 13:25:28 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 31 Mar 2005 16:25:28 -0500 Subject: [openib-general] [PATCH] FMR support in mthca In-Reply-To: <20050330164349.B32764@topspin.com> References: <20050327153112.GA26108@mellanox.co.il> <20050328170351.B30499@topspin.com> <20050330010814.GB24794@esmail.cup.hp.com> <20050329181228.H31683@topspin.com> <20050330032904.GA24936@esmail.cup.hp.com> <20050330164349.B32764@topspin.com> Message-ID: <1112304328.7331.42.camel@localhost.localdomain> On Wed, 2005-03-30 at 19:43, Libor Michalek wrote: > The program has a decent help for available parameters, but here are > some reasonable defaults: > > server: > > ./ttcp.aio.x -r -l 65536 -a 20 > > client: > > ./ttcp.aio.x -t -l 65536 -n 100000 -a 20 192.168.0.100 Are these the parameters used to achieve the throughput numbers you published ? > For async socket I kept 20 96K buffers in flight. For the FMR pool cache > hit async results I used only 20 different buffers. For the FMR pool cache > miss async results I used 1000 different buffers, of which only 20 were in > flight at a time. Sounds like you tweaked the numbers in sdp_dev.h. Anywhere else ? Can you provide the tuning numbers used and where they were found so these results can be reproduced ? Thanks. -- Hal From libor at topspin.com Thu Mar 31 13:41:16 2005 From: libor at topspin.com (Libor Michalek) Date: Thu, 31 Mar 2005 13:41:16 -0800 Subject: [openib-general] [PATCH] FMR support in mthca In-Reply-To: <1112304328.7331.42.camel@localhost.localdomain>; from halr@voltaire.com on Thu, Mar 31, 2005 at 04:25:28PM -0500 References: <20050327153112.GA26108@mellanox.co.il> <20050328170351.B30499@topspin.com> <20050330010814.GB24794@esmail.cup.hp.com> <20050329181228.H31683@topspin.com> <20050330032904.GA24936@esmail.cup.hp.com> <20050330164349.B32764@topspin.com> <1112304328.7331.42.camel@localhost.localdomain> Message-ID: <20050331134116.B1541@topspin.com> On Thu, Mar 31, 2005 at 04:25:28PM -0500, Hal Rosenstock wrote: > On Wed, 2005-03-30 at 19:43, Libor Michalek wrote: > > The program has a decent help for available parameters, but here are > > some reasonable defaults: > > > > server: > > > > ./ttcp.aio.x -r -l 65536 -a 20 > > > > client: > > > > ./ttcp.aio.x -t -l 65536 -n 100000 -a 20 192.168.0.100 > > Are these the parameters used to achieve the throughput numbers you > published ? > > Sounds like you tweaked the numbers in sdp_dev.h. Anywhere else ? > > Can you provide the tuning numbers used and where they were found so these > results can be reproduced ? No tweaking or changes to the SDP code itself. The parameters above should give similar results, but here are the exact parameters I used for the two aync tests I mentioned in the original results I posted. > > For async socket I kept 20 96K buffers in flight. For the FMR pool cache > > hit async results I used only 20 different buffers. ./ttcp.aio.x -r -l 98304 -a 20 -f M ./ttcp.aio.x -t -l 98304 -n 200000 -a 20 -f M 192.168.0.100 > > For the FMR pool cache miss async results I used 1000 different > > buffers, of which only 20 were in flight at a time. ./ttcp.aio.x -r -l 98304 -a 20 -x 1000 -f M ./ttcp.aio.x -t -l 98304 -n 200000 -a 20 -x 1000 -f M 192.168.0.100 -Libor From rminnich at lanl.gov Thu Mar 31 14:23:59 2005 From: rminnich at lanl.gov (Ronald G. Minnich) Date: Thu, 31 Mar 2005 15:23:59 -0700 (MST) Subject: [openib-general] Link encap:UNSPEC Message-ID: is there a number that this means, i.e. is ifconfig saying "I don't know this number so it is UNSPEC" or is it a number that means "NaN"? thanks ron From roland at topspin.com Thu Mar 31 14:23:17 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 31 Mar 2005 14:23:17 -0800 Subject: [openib-general] problem with SDP/AIO on mem-free HCA In-Reply-To: <52acojmt34.fsf@topspin.com> (Roland Dreier's message of "Thu, 31 Mar 2005 12:53:48 -0800") References: <52acojmt34.fsf@topspin.com> Message-ID: <5264z7mp4a.fsf@topspin.com> FWIW, this is with a 32-bit executable on a 64-bit kernel. Another weird thing is that SDP seems to be doing two RDMA READs of size 0x1000 at remote addresses 0x1d94e000 and 0x1d94f000. With Tavors on a 32-bit machine, the same command line paramters result in a single RDMA read of size 0x2000. - R. From halr at voltaire.com Thu Mar 31 15:06:30 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 31 Mar 2005 18:06:30 -0500 Subject: [openib-general] Port of ISC DHCP-3.0.2 to OpenIB IPoIB In-Reply-To: <424B4CA7.1050606@sandia.gov> References: <1112190300.4495.67.camel@localhost.localdomain> <424B4CA7.1050606@sandia.gov> Message-ID: <1112310139.4490.24.camel@localhost.localdomain> On Wed, 2005-03-30 at 20:04, Josh England wrote: > Are there any plans to modify the linux DHCP client so it would be > possible to do kernel-level DHCP and NFSroot over IB? I took a quick look at this and it looks pretty straightforward. Stay tuned... -- Hal From mst at mellanox.co.il Thu Mar 31 15:10:23 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 1 Apr 2005 02:10:23 +0300 Subject: [openib-general] Re: problem with SDP/AIO on mem-free HCA In-Reply-To: <52acojmt34.fsf@topspin.com> References: <52acojmt34.fsf@topspin.com> Message-ID: <20050331231023.GC6807@mellanox.co.il> Quoting r. Roland Dreier : > Subject: problem with SDP/AIO on mem-free HCA > > [err, resending with a correct openib to: line] > > I'm hitting a strange problem with SDP/AIO on a mem-free Arbel. My > test is the following: I run Libor's ttcp.aio program with default > parameters (which I think just leaves one AIO in flight at a time) as > follows: > > ttcp.aio.x -r -s & > ttcp.aio.x -t -s 127.0.0.1 > > This always fails with a remote access error exactly 256K into the > test. I see the following in my log (with some extra tracing added to > SDP to get info on the RDMAs being posted): > > WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <5d> at = > <1d94e000>/<1000> > WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <5e> at = > <1d94f000>/<1000> > WARN: <2> <050e:11b1> Posting SEND, wrid <5f> > WARN: <1> <050e:11b1> Posting SEND, wrid <20> > CRTL: <2> <050e:11b1> GETNAME: src <0d000002:1389> dst = > <0d000002:8001> > CRTL: <2> <050e:11b1> GETNAME: src <0d000002:1389> dst = > <0d000002:8001> > WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <60> at = > <1d94e000>/<0> > WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <61> at = > <1d94e000>/<1000> > WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <62> at = > <1d94f000>/<1000> > ib_mthca 0000:07:00.0: 86/66: error CQE -> QPN 000407, WQE @ = > 00001803 > [ 0] 00000407 > [ 4] b3000000 > [ 8] fd000003 > [ c] 110000c0 > [10] 13880000 > [14] 00000010 > [18] 00001803 > [1c] ff100000 > WARN: : Unhandled status <10> unknown event <-1> wrid <60> > > As you can see, the failed work request is an RDMA with length 0. The > previous work request with wrid 5d with the same R_Key and remote > address but a length of 0x1000 appears to complete successfully so the > FMR seems to be OK. > > So I guess there are two questions: > - why is SDP doing a zero-length RDMA read? > - is it correct for this to fail with a remote access error? > I have not had a chance to test zero-length RDMA without involving > FMRs but I don't think the FMR code is to blame. I dont think so. I found this: C9-88: For an HCA responder using Reliable Connection service, for each zero-length RDMA READ or WRITE request, the R_Key shall not be validated, even if the request includes Immediate data. Can it be you generate a non-zero RDMA in mthca. > Also BTW, the code in sdp_cq_event_locked() is somewhat bogus: it > switches on comp->opcode even when comp->status is not success. > However, if the comp->status is not success, then per the IB spec, > mthca does not set the comp->opcode field. > > - R. > -- MST - Michael S. Tsirkin From mshefty at ichips.intel.com Thu Mar 31 16:16:14 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 31 Mar 2005 16:16:14 -0800 Subject: [openib-general] [RMPP] RMPP formatting assumptions In-Reply-To: <42488FDF.2050608@ichips.intel.com> References: <42488FDF.2050608@ichips.intel.com> Message-ID: <424C92CE.7040709@ichips.intel.com> So far, here are my assumptions regarding the formatting of the RMPP MADs. The following fields in the RMPP header are set by the user: Version, Type = DATA, RTime, Flags = ACTIVE, and Status = 0 The RMPP code will set the SegNum and update the Flags, but uses the ACTIVE bit to determine if the user requires RMPP for a given transfer. I could easily have the RMPP code set some of these fields, but thought that the caller might be able to initialize them more efficiently. The WR length of a transfer should equal the size of the MAD header, the RMPP header, class specific header for SA or vendor, plus a data buffer that is evenly divisible by the size of the class' Data field. This requirement is needed to prevent the RMPP code from allocating and copying data segments. The payload field in the RMPP header should be set to the size of the class specific header plus the number of valid bytes of user data in the data buffer. The RMPP code will adjust the payload value to account for multiple headers. Comments? - Sean From roland at topspin.com Thu Mar 31 19:36:12 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 31 Mar 2005 19:36:12 -0800 Subject: [openib-general] [PATCH][2/3] IPoIB: fix static rate calculation In-Reply-To: <20053311936.983q6QLaPvAkIcQj@topspin.com> Message-ID: <20053311936.qOWRURSZd0itPjAn@topspin.com> Correct and simplify calculation of static rate. We need to round up the quotient of (local_rate - path_rate) / path_rate. To round up we add (path_rate - 1) to the numerator, so the quotient simplifies to (local_rate - 1) / path_rate. No idea how I came up with the old formula. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-03-31 19:06:47.984714505 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-03-31 19:26:39.094134171 -0800 @@ -302,11 +302,10 @@ .sl = pathrec->sl, .port_num = priv->port }; + int path_rate = ib_sa_rate_enum_to_int(pathrec->rate); - if (ib_sa_rate_enum_to_int(pathrec->rate) > 0) - av.static_rate = (2 * priv->local_rate - - ib_sa_rate_enum_to_int(pathrec->rate) - 1) / - (priv->local_rate ? priv->local_rate : 1); + if (path_rate > 0 && priv->local_rate > path_rate) + av.static_rate = (priv->local_rate - 1) / path_rate; ipoib_dbg(priv, "static_rate %d for local port %dX, path %dX\n", av.static_rate, priv->local_rate, --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-03-31 19:07:01.877698296 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-03-31 19:26:03.861782487 -0800 @@ -258,13 +258,12 @@ .traffic_class = mcast->mcmember.traffic_class } }; + int path_rate = ib_sa_rate_enum_to_int(mcast->mcmember.rate); av.grh.dgid = mcast->mcmember.mgid; - if (ib_sa_rate_enum_to_int(mcast->mcmember.rate) > 0) - av.static_rate = (2 * priv->local_rate - - ib_sa_rate_enum_to_int(mcast->mcmember.rate) - 1) / - (priv->local_rate ? priv->local_rate : 1); + if (path_rate > 0 && priv->local_rate > path_rate) + av.static_rate = (priv->local_rate - 1) / path_rate; ipoib_dbg_mcast(priv, "static_rate %d for local port %dX, mcmember %dX\n", av.static_rate, priv->local_rate, From roland at topspin.com Thu Mar 31 19:36:12 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 31 Mar 2005 19:36:12 -0800 Subject: [openib-general] [PATCH][3/3] IPoIB: convert to debugfs In-Reply-To: <20053311936.qOWRURSZd0itPjAn@topspin.com> Message-ID: <20053311936.XaQmN4N9new7dTCP@topspin.com> Convert IPoIB to use debugfs instead of its own custom debugging filesystem. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_fs.c 2005-03-31 19:07:14.463965782 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_fs.c 2005-03-31 19:31:28.624283013 -0800 @@ -32,19 +32,16 @@ * $Id: ipoib_fs.c 1389 2004-12-27 22:56:47Z roland $ */ -#include +#include #include -#include "ipoib.h" +struct file_operations; -enum { - IPOIB_MAGIC = 0x49504942 /* "IPIB" */ -}; +#include + +#include "ipoib.h" -static DECLARE_MUTEX(ipoib_fs_mutex); static struct dentry *ipoib_root; -static struct super_block *ipoib_sb; -static LIST_HEAD(ipoib_device_list); static void *ipoib_mcg_seq_start(struct seq_file *file, loff_t *pos) { @@ -145,143 +142,34 @@ .release = seq_release }; -static struct inode *ipoib_get_inode(void) -{ - struct inode *inode = new_inode(ipoib_sb); - - if (inode) { - inode->i_mode = S_IFREG | S_IRUGO; - inode->i_uid = 0; - inode->i_gid = 0; - inode->i_blksize = PAGE_CACHE_SIZE; - inode->i_blocks = 0; - inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; - inode->i_fop = &ipoib_fops; - } - - return inode; -} - -static int __ipoib_create_debug_file(struct net_device *dev) +int ipoib_create_debug_file(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct dentry *dentry; - struct inode *inode; char name[IFNAMSIZ + sizeof "_mcg"]; snprintf(name, sizeof name, "%s_mcg", dev->name); - dentry = d_alloc_name(ipoib_root, name); - if (!dentry) - return -ENOMEM; - - inode = ipoib_get_inode(); - if (!inode) { - dput(dentry); - return -ENOMEM; - } - - inode->u.generic_ip = dev; - priv->mcg_dentry = dentry; - - d_add(dentry, inode); - - return 0; -} - -int ipoib_create_debug_file(struct net_device *dev) -{ - struct ipoib_dev_priv *priv = netdev_priv(dev); - - down(&ipoib_fs_mutex); - - list_add_tail(&priv->fs_list, &ipoib_device_list); - - if (!ipoib_sb) { - up(&ipoib_fs_mutex); - return 0; - } - - up(&ipoib_fs_mutex); + priv->mcg_dentry = debugfs_create_file(name, S_IFREG | S_IRUGO, + ipoib_root, dev, &ipoib_fops); - return __ipoib_create_debug_file(dev); + return priv->mcg_dentry ? 0 : -ENOMEM; } void ipoib_delete_debug_file(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - down(&ipoib_fs_mutex); - list_del(&priv->fs_list); - if (!ipoib_sb) { - up(&ipoib_fs_mutex); - return; - } - up(&ipoib_fs_mutex); - - if (priv->mcg_dentry) { - d_drop(priv->mcg_dentry); - simple_unlink(ipoib_root->d_inode, priv->mcg_dentry); - } -} - -static int ipoib_fill_super(struct super_block *sb, void *data, int silent) -{ - static struct tree_descr ipoib_files[] = { - { "" } - }; - struct ipoib_dev_priv *priv; - int ret; - - ret = simple_fill_super(sb, IPOIB_MAGIC, ipoib_files); - if (ret) - return ret; - - ipoib_root = sb->s_root; - - down(&ipoib_fs_mutex); - - ipoib_sb = sb; - - list_for_each_entry(priv, &ipoib_device_list, fs_list) { - ret = __ipoib_create_debug_file(priv->dev); - if (ret) - break; - } - - up(&ipoib_fs_mutex); - - return ret; -} - -static struct super_block *ipoib_get_sb(struct file_system_type *fs_type, - int flags, const char *dev_name, void *data) -{ - return get_sb_single(fs_type, flags, data, ipoib_fill_super); + if (priv->mcg_dentry) + debugfs_remove(priv->mcg_dentry); } -static void ipoib_kill_sb(struct super_block *sb) -{ - down(&ipoib_fs_mutex); - ipoib_sb = NULL; - up(&ipoib_fs_mutex); - - kill_litter_super(sb); -} - -static struct file_system_type ipoib_fs_type = { - .owner = THIS_MODULE, - .name = "ipoib_debugfs", - .get_sb = ipoib_get_sb, - .kill_sb = ipoib_kill_sb, -}; - int ipoib_register_debugfs(void) { - return register_filesystem(&ipoib_fs_type); + ipoib_root = debugfs_create_dir("ipoib", NULL); + return ipoib_root ? 0 : -ENOMEM; } void ipoib_unregister_debugfs(void) { - unregister_filesystem(&ipoib_fs_type); + debugfs_remove(ipoib_root); } --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-03-31 19:26:39.094134171 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-03-31 19:30:51.117424929 -0800 @@ -1082,19 +1082,19 @@ return 0; -err_fs: - ipoib_unregister_debugfs(); - err_wq: destroy_workqueue(ipoib_workqueue); +err_fs: + ipoib_unregister_debugfs(); + return ret; } static void __exit ipoib_cleanup_module(void) { - ipoib_unregister_debugfs(); ib_unregister_client(&ipoib_client); + ipoib_unregister_debugfs(); destroy_workqueue(ipoib_workqueue); } From roland at topspin.com Thu Mar 31 19:43:26 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 31 Mar 2005 19:43:26 -0800 Subject: [openib-general] Link encap:UNSPEC In-Reply-To: (Ronald G. Minnich's message of "Thu, 31 Mar 2005 15:23:59 -0700 (MST)") References: Message-ID: <52mzsjkvq9.fsf@topspin.com> Ronald> is there a number that this means, i.e. is ifconfig saying Ronald> "I don't know this number so it is UNSPEC" or is it a Ronald> number that means "NaN"? The former. IPoIB reports its (IANA-assigned) link type of 32. Depending on how new a version you have, "ip addr" will either show "link/[32]" or "link/infiniband". - R. From roland at topspin.com Thu Mar 31 19:36:11 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 31 Mar 2005 19:36:11 -0800 Subject: [openib-general] [PATCH][1/3] IPoIB: set skb->mac.raw on receive Message-ID: <20053311936.983q6QLaPvAkIcQj@topspin.com> From: Hal Rosenstock Set skb->mac.raw on receive. This fixes crashes when this is dereferenced, for example by netfilter or when PF_PACKET is used. Signed-off-by: Hal Rosenstock Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2005-03-31 19:07:06.912605203 -0800 +++ linux-export/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2005-03-31 19:23:30.599053347 -0800 @@ -201,7 +201,7 @@ if (wc->slid != priv->local_lid || wc->src_qp != priv->qp->qp_num) { skb->protocol = ((struct ipoib_header *) skb->data)->proto; - + skb->mac.raw = skb->data; skb_pull(skb, IPOIB_ENCAP_LEN); dev->last_rx = jiffies; From davem at davemloft.net Thu Mar 31 20:18:17 2005 From: davem at davemloft.net (David S. Miller) Date: Thu, 31 Mar 2005 20:18:17 -0800 Subject: [openib-general] Re: [PATCH][1/3] IPoIB: set skb->mac.raw on receive In-Reply-To: <20053311936.983q6QLaPvAkIcQj@topspin.com> References: <20053311936.983q6QLaPvAkIcQj@topspin.com> Message-ID: <20050331201817.64fe1b69.davem@davemloft.net> Roland, netdev at oss.sgi.com CC:'ing either Jeff Garzik and myself, please. From roland at topspin.com Thu Mar 31 19:59:43 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 31 Mar 2005 19:59:43 -0800 Subject: [openib-general] Re: problem with SDP/AIO on mem-free HCA In-Reply-To: <20050331231023.GC6807@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 1 Apr 2005 02:10:23 +0300") References: <52acojmt34.fsf@topspin.com> <20050331231023.GC6807@mellanox.co.il> Message-ID: <52is37kuz4.fsf@topspin.com> Michael> I dont think so. I found this: Michael> C9-88: For an HCA responder using Reliable Connection Michael> service, for each zero-length RDMA READ or WRITE request, Michael> the R_Key shall not be validated, even if the request Michael> includes Immediate data. Michael> Can it be you generate a non-zero RDMA in mthca. It's possible. I'll put some more work into testing this out soon. However it also seems possible that the current mem-free FW may have a bug in handling this. SDP is generating the 0-length RDMA by posting an RDMA READ with a single scatter entry whole length is zero, which may behave differently from posting an RDMA READ with no scatter entries. I need to check this out, and also test on Tavor. - R. From roland at topspin.com Thu Mar 31 20:24:10 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 31 Mar 2005 20:24:10 -0800 Subject: [openib-general] Re: [PATCH][1/3] IPoIB: set skb->mac.raw on receive In-Reply-To: <20050331201817.64fe1b69.davem@davemloft.net> (David S. Miller's message of "Thu, 31 Mar 2005 20:18:17 -0800") References: <20053311936.983q6QLaPvAkIcQj@topspin.com> <20050331201817.64fe1b69.davem@davemloft.net> Message-ID: <52ekdvktud.fsf@topspin.com> David> Roland, netdev at oss.sgi.com CC:'ing either Jeff Garzik and David> myself, please. Sorry, will do next time around, unless you'd like me to resend this batch as well. All 3 patches are pretty trivial, though. The biggest one is just deleting a lot of code by switching to debugfs. - R. From ftillier at infiniconsys.com Thu Mar 31 21:59:02 2005 From: ftillier at infiniconsys.com (Fab Tillier) Date: Thu, 31 Mar 2005 21:59:02 -0800 Subject: [openib-general] Re: problem with SDP/AIO on mem-free HCA In-Reply-To: <52is37kuz4.fsf@topspin.com> Message-ID: <001301c5367f$e86a8a50$1802a8c0@infiniconsys.com> > -----Original Message----- > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Thursday, March 31, 2005 8:00 PM > > Michael> I dont think so. I found this: > > Michael> C9-88: For an HCA responder using Reliable Connection > Michael> service, for each zero-length RDMA READ or WRITE request, > Michael> the R_Key shall not be validated, even if the request > Michael> includes Immediate data. > > Michael> Can it be you generate a non-zero RDMA in mthca. > > It's possible. I'll put some more work into testing this out soon. > However it also seems possible that the current mem-free FW may have a > bug in handling this. > > SDP is generating the 0-length RDMA by posting an RDMA READ with a > single scatter entry whole length is zero, which may behave > differently from posting an RDMA READ with no scatter entries. I need > to check this out, and also test on Tavor. > If you are blessed with a Tavor PRM, see section 8.2.1.6 (in PRM 1.0.0). It states that a length of zero in a data segment indicates a 2GB transfer (MSb is used as a flag to indicate normal vs. inline data segments). A zero-byte request must not reference any data segments. - Fab