From jagana at us.ibm.com Mon Aug 1 01:19:18 2005 From: jagana at us.ibm.com (Venkata Jagana) Date: Mon, 1 Aug 2005 01:19:18 -0700 Subject: [openib-general] Re: [Rdma-developers] Meeting (07/22) summary:OpenRDMA community development discussion In-Reply-To: <20050729155256.06b37931@localhost.localdomain> Message-ID: rdma-developers-admin at lists.sourceforge.net wrote on 07/29/2005 03:52:56 PM: > On Fri, 29 Jul 2005 15:59:03 -0400 (EDT) > James Lentini wrote: > > > > > > > On Fri, 29 Jul 2005, Woodruff, Robert J wrote: > > > > > Venkat wrote, > > >> If anyone attended any one of the summits (netconf or kernel) and > > >> would be great if they can shed some light on this discussion. > > > > > > Roland attended the kernel summit and he was > > > also at the InfiniBand BOF where we discussed > > > the possibility of modifying the > > > IB verbs to also support iWarp as probably the right way to > > > go. Not sure if this was discussed at the kernel summit or not, > > > but perhaps Roland can provide some insight on that question. > > > > The subject came up at the kernel summit when Jamal Hadi Salim > > reported on the presentations at Netconf, in particular Stephen > > Hemminger's, see http://vger.kernel.org/netconf2005.html). > > > > It was noted that iWARP vendors are working with the OpenIB community > > on a common interface. The consensus was that this is the right > > direction. > > > > The consensus was more of wait and see what comes up. My discussion > at netconf was more of an informative session to get the participants to > know that activity is going on and some code may be coming. It was all > of 5 minutes at the end of a busy day, so it didn't count for much. > Thanks Stephen for clarifying this. Yes, the basic code base in OpenRDMA is currently available for people to start hacking. My hope is that within next few months we'll also have some rnic driver available so that the basic set of enablement layers within both user and kernel space can be experimented with. Thanks Venkat -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Mon Aug 1 01:40:00 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 1 Aug 2005 11:40:00 +0300 Subject: [openib-general] sdp: cancel read with no iocb Message-ID: <20050801084000.GS14384@mellanox.co.il> Libor, I'm seeing these messages: ib_sdp WARN: Cancel read with no IOCB. <2:0:00000005> It seems that this warning is printed in a legal state where a deferred iocb is canceled. Shouldnt this sdp_warn be replaced with sdp_dbg_ctrl? --- Remove sdp_warn for a legal condition. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_recv.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/ulp/sdp/sdp_recv.c +++ linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_recv.c @@ -942,8 +942,8 @@ static int sdp_inet_read_cancel(struct k * no IOCB found. The cancel is probably in a race with a completion. * Assume the IOCB will be completed, return appropriate value. */ - sdp_warn("Cancel read with no IOCB. <%d:%d:%08lx>", - req->ki_users, req->ki_key, req->ki_flags); + sdp_dbg_ctrl("Cancel read with no IOCB. <%d:%d:%08lx>", + req->ki_users, req->ki_key, req->ki_flags); result = -EAGAIN; -- MST From caitlinb at siliquent.com Mon Aug 1 03:49:17 2005 From: caitlinb at siliquent.com (Caitlin Bestler) Date: Mon, 1 Aug 2005 03:49:17 -0700 Subject: [openib-general] Re: [Rdma-developers] Meeting (07/22) summary:OpenRDMA community development discussion Message-ID: <8508251A6FC08A489844A94261D3693A077FF8@fiona.siliquent.com> > -----Original Message----- > From: rdma-developers-admin at lists.sourceforge.net > [mailto:rdma-developers-admin at lists.sourceforge.net] On > Behalf Of Yaron Haviv > Sent: Sunday, July 31, 2005 8:57 PM > To: Christoph Hellwig; Tom Duffy > Cc: Venkata Jagana; rdma-developers at lists.sourceforge.net; > openib-general at openib.org > Subject: RE: [openib-general] Re: [Rdma-developers] Meeting > (07/22) summary:OpenRDMA community development discussion > > > -----Original Message----- > > From: openib-general-bounces at openib.org [mailto:openib-general- > > bounces at openib.org] On Behalf Of Christoph Hellwig > > Sent: Friday, July 29, 2005 8:02 AM > > To: Tom Duffy > > Cc: Venkata Jagana; rdma-developers at lists.sourceforge.net; > Christoph > > Hellwig; openib-general at openib.org > > Subject: [openib-general] Re: [Rdma-developers] Meeting (07/22) > > summary:OpenRDMA community development discussion > > > > On Thu, Jul 28, 2005 at 02:02:08PM -0700, Tom Duffy wrote: > > > At OLS (and in previous forums), the kernel maintainers > have made it > > > *very* clear that there should only be one API. > > > > _and_ that this api is neither RNIC-PI or KDAPL. In fact > for anything > > that doesn't look very similar to the current IB midlayer > you'd need > > very convincing arguments. > > > > I assume it is not as simplistic as that iWarp CM model is > quite different than IB, and iWarp doesn't have SA/SM and a > bunch of other IB specific things > > For example: > The correct common abstraction is one where a user can issue > a connection by using a logical end-point address (such as an > IP), and doesn't have to deal with the IB or iWarp specific > CM state machine or SA/SM. > > If you look at DAPL you can break it to simple Verbs (e.g. > send, ..) where its just a simple overlay on to of the verbs > (and may be > redundant) > However there is a second part that implements a simple > connection establishment model (much like BSD) that can be > mapped to both IB (CM, SA, ..) or iWarp (TCP Syn/SynAck, ARP, > etc'), this serves couple of main > purposes: > a. make it simple for ULP developer and put the complex part > in a common > place > b. define a common model for different HW > > we can spend time and discuss theories and intentions, at the > end of the day an iWarp RNIC cannot just reside under > IB-Verbs without major changes to the overall infrastructure. > Several guys spent some time looking it over and came with an > abstraction that IS possible on top of IB & iWarp & foo, that > is called DAPL (or IT as another similar alternative) > > It would probably be wise to try and merge that effort with > IB-verbs etc' > (e.g. make the verbs portion of the API closer), and on the > same time preserve the effort that was done in kDAPL to > overcome the differences (e.g. in the CM, addressing portions) > The working theory is that two additional connection management 'verbs' will be proposed, both optional. The first is a straight-forward mapping of the traditional DAT connection establishment: i.e., do whatever you have to do in order to listen, request a connection, accept/reject connection requests. This is an interface that can map to existing iWARP implementations very easily while not immediately standardizing integration with the host TCP stack. Connection Requests are reported in this model, but via callbacks rather than EVDs, since it is desired to keep a verb layer interface more spartan than kDAPL. However, the fact that connection requests must be approved allows blocking of all requests that conflict with host policy (as expressed in packet filtering rules). It is also an interface that IB vendors *could* support, because DAPL implements over IB. We could consider having a default implementation of the methods from the DAPL code for that purpose. After that we need a standardized way to implement "modify qp to RTS" in a iWARP-centric fashion. This will require agreement on how to transfer the TCP state from the host stack to the RNIC. This is the more flexible model that matches RDMAC verbs and supports all connection establishment models, including iSER. It is inherently iWARP specific, however, since IB connections do not start their life as TCP connections. The benefit of this second additional API is that the preserves *all* host stack safeguards on connection establishment because the host stack actually establishes the TCP connection. It does require agreement on the correct way to extract the TCP connection from the host stack, however. This is something that existing deployments already do, but they aren't doing it in mainline code. Reaching consensus on a sustainable interface for doing this is certainly worthy of careful consideration, which is why the first additional DAT-style API will be proposed. From Thomas.Talpey at netapp.com Mon Aug 1 04:00:48 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Mon, 01 Aug 2005 07:00:48 -0400 Subject: [openib-general] Re: [Rdma-developers] Meeting (07/22) summary:OpenRDMA community development discussion In-Reply-To: <8508251A6FC08A489844A94261D3693A077FF8@fiona.siliquent.com > References: <8508251A6FC08A489844A94261D3693A077FF8@fiona.siliquent.com> Message-ID: <6.2.3.4.2.20050801065241.04c6d8c0@exnane01.nane.netapp.com> At 06:49 AM 8/1/2005, Caitlin Bestler wrote: >After that we need a standardized way to implement "modify qp >to RTS" in a iWARP-centric fashion. I will add that RPC/RDMA (NFS-RDMA) does not currently require this functionality, it begins all connections in RDMA mode and the current DAPL semantics are fine for now. This functionality would be of great importance to iSER however, of course. In the future, NFSv4/sessions has an exchange which can to make use of the iWARP step-up mode, but it is not required and we have deferred implementing it in the session establishment so far. Tom. From caitlinb at siliquent.com Mon Aug 1 05:27:17 2005 From: caitlinb at siliquent.com (Caitlin Bestler) Date: Mon, 1 Aug 2005 05:27:17 -0700 Subject: [openib-general] Re: [Rdma-developers] Meeting (07/22) summary:OpenRDMA community development discussion Message-ID: <8508251A6FC08A489844A94261D3693A077FFE@fiona.siliquent.com> > -----Original Message----- > From: rdma-developers-admin at lists.sourceforge.net > [mailto:rdma-developers-admin at lists.sourceforge.net] On > Behalf Of Talpey, Thomas > Sent: Monday, August 01, 2005 4:01 AM > To: openib-general at openib.org > Cc: rdma-developers at lists.sourceforge.net > Subject: RE: [openib-general] Re: [Rdma-developers] Meeting > (07/22) summary:OpenRDMA community development discussion > > At 06:49 AM 8/1/2005, Caitlin Bestler wrote: > >After that we need a standardized way to implement "modify > qp to RTS" > >in a iWARP-centric fashion. > > I will add that RPC/RDMA (NFS-RDMA) does not currently > require this functionality, it begins all connections in RDMA > mode and the current DAPL semantics are fine for now. This > functionality would be of great importance to iSER however, of course. > > In the future, NFSv4/sessions has an exchange which can to > make use of the iWARP step-up mode, but it is not required > and we have deferred implementing it in the session > establishment so far. > Which points out the other rationale, that I forgot to mention, many applications require only the DAPL-mode connection setup and enabling them immediately make sense. It also avoids rushing the TCP connection transfer discussion since most applications will be operational. From rolandd at cisco.com Mon Aug 1 06:12:24 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 01 Aug 2005 06:12:24 -0700 Subject: [openib-general] kernel VM monitor for memory registration caching References: <20050729174225.GB2683@osc.edu> <20050731103157.GT13014@minantech.com> Message-ID: <52ack1kdhj.fsf@cisco.com> Gleb> First of all, you have one user_delta per mm that user can Gleb> poll from userspace. Is it possible to make user_delta to be Gleb> part of dreg_region instead of dreg_context and module will Gleb> set it whenever registration becomes invalid. Field Gleb> 'invalid' will be added to buf_info structure and pointer to Gleb> it will be passed to kernel at registration time. This way Gleb> the userpace can look up cache and check if registration is Gleb> still valid. No need to rescan cache from userspace, we Gleb> already scanned it once from kernel after all. With your Gleb> current approach userspace will need to search for mr_handle Gleb> in the cache and invalidate the entry that holds it. That idea sounds familiar ;) Gleb> You change vma_ops in vma to catch open/close events. What Gleb> about nopage() method in vma_ops? We have to forward it to Gleb> original vma_ops? Can nopage ever affect registered memory? The registration process makes sure the pages are resident and pinned, so I don't see how a region's mapping could ever be changed by a call to a vma's nopage method. - R. From mst at mellanox.co.il Mon Aug 1 06:14:45 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 1 Aug 2005 16:14:45 +0300 Subject: [openib-general] [PATCH updated] sdp: cancel read with no iocb In-Reply-To: <20050801084000.GS14384@mellanox.co.il> References: <20050801084000.GS14384@mellanox.co.il> Message-ID: <20050801131445.GW14384@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: sdp: cancel read with no iocb > > Libor, I'm seeing these messages: > > ib_sdp WARN: Cancel read with no IOCB. <2:0:00000005> > > It seems that this warning is printed in a legal state where > a deferred iocb is canceled. Shouldnt this sdp_warn be replaced > with sdp_dbg_ctrl? Ugh, that patch broke compilation with debug on. Here's a better one. --- Remove sdp_warn for a legal condition. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_recv.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/ulp/sdp/sdp_recv.c +++ linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_recv.c @@ -942,8 +942,8 @@ static int sdp_inet_read_cancel(struct k * no IOCB found. The cancel is probably in a race with a completion. * Assume the IOCB will be completed, return appropriate value. */ - sdp_warn("Cancel read with no IOCB. <%d:%d:%08lx>", - req->ki_users, req->ki_key, req->ki_flags); + sdp_dbg_ctrl(NULL, "Cancel read with no IOCB. <%d:%d:%08lx>", + req->ki_users, req->ki_key, req->ki_flags); result = -EAGAIN; -- MST From glebn at voltaire.com Mon Aug 1 06:20:56 2005 From: glebn at voltaire.com (Gleb Natapov) Date: Mon, 1 Aug 2005 16:20:56 +0300 Subject: [openib-general] kernel VM monitor for memory registration caching In-Reply-To: <52ack1kdhj.fsf@cisco.com> References: <20050729174225.GB2683@osc.edu> <20050731103157.GT13014@minantech.com> <52ack1kdhj.fsf@cisco.com> Message-ID: <20050801132056.GB20243@minantech.com> On Mon, Aug 01, 2005 at 06:12:24AM -0700, Roland Dreier wrote: > Gleb> First of all, you have one user_delta per mm that user can > Gleb> poll from userspace. Is it possible to make user_delta to be > Gleb> part of dreg_region instead of dreg_context and module will > Gleb> set it whenever registration becomes invalid. Field > Gleb> 'invalid' will be added to buf_info structure and pointer to > Gleb> it will be passed to kernel at registration time. This way > Gleb> the userpace can look up cache and check if registration is > Gleb> still valid. No need to rescan cache from userspace, we > Gleb> already scanned it once from kernel after all. With your > Gleb> current approach userspace will need to search for mr_handle > Gleb> in the cache and invalidate the entry that holds it. > > That idea sounds familiar ;) Why? Is that what you had in mind? > > Gleb> You change vma_ops in vma to catch open/close events. What > Gleb> about nopage() method in vma_ops? We have to forward it to > Gleb> original vma_ops? > > Can nopage ever affect registered memory? The registration process > makes sure the pages are resident and pinned, so I don't see how a > region's mapping could ever be changed by a call to a vma's nopage method. > This was right for vapi modules since after mlock() vma boundaries was aligned with registration boundaries, but this is not the case for OpenIB (correct me if I am wrong here). Part of the vma can be pinned and thus be in memory but other parts may be swapped out. -- Gleb. From mst at mellanox.co.il Mon Aug 1 06:26:16 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 1 Aug 2005 16:26:16 +0300 Subject: [openib-general] SDP_IOCB_SIZE_MAX Message-ID: <20050801132616.GX14384@mellanox.co.il> Libor, in sdp_iocb.h we have: sdp_iocb.h:#define SDP_IOCB_SIZE_MAX (128*1024) /* matches AIO max kvec size. */ What does the comment mean? I seem to have no trouble passing requests bigger than 128K to SDP: ./ttcp.aio.x -r -l 1000000 -a 20 ./ttcp.aio.x -t -l 1000000 -n 1000000 -a 20 11.4.8.155 gets: # dmesg ib_sdp ERR: IOCB <0> cancel <0> flag <0140> size <1000000:0:1000000> ib_sdp ERR: IOCB <0> cancel <0> flag <0140> size <1000000:0:1000000> ib_sdp ERR: IOCB <0> cancel <0> flag <0140> size <1000000:0:1000000> -- MST From halr at voltaire.com Mon Aug 1 06:28:53 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 1 Aug 2005 16:28:53 +0300 Subject: [openib-general] kdapl build error on ia64 Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175B74@taurus.voltaire.com> Hi John, My normal email sending is not working right now so I am using an alternate method. Hope the formatting comes through OK. On Fri, 2005-07-29 at 17:46, John Partridge wrote: > With this fix the ia64 modules all build to completion with just a > couple of warnings :- > CC [M] drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.o > drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c: In function `dapl_path_comp_handler': > drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c:346: warning: long long int format, u64 arg (arg 3) > drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c: In function `dapl_rt_comp_handler': > drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c:416: warning: long long int format, u64 arg (arg 3) Can you try this patch and see if it removes the warnings ? Thanks. -- Hal Index: dapl_openib_cm.c =================================================================== --- dapl_openib_cm.c (revision 2935) +++ dapl_openib_cm.c (working copy) @@ -341,9 +341,15 @@ static void dapl_path_comp_handler(u64 r &cm_ctx->dapl_path, 1, &cm_ctx->dapl_comp); if (status) { +#if defined(__ia64__) + printk(KERN_ERR "dapl_path_comp_handler: " + "ib_at_paths_by_route returned %d id %ld\n", + status, cm_ctx->dapl_comp.req_id); +#else printk(KERN_ERR "dapl_path_comp_handler: " "ib_at_paths_by_route returned %d id %lld\n", status, cm_ctx->dapl_comp.req_id); +#endif event = DAT_CONNECTION_EVENT_BROKEN; goto error; } @@ -412,8 +418,13 @@ static void dapl_rt_comp_handler(u64 req status = ib_at_paths_by_route(&cm_ctx->dapl_rt, 0, &cm_ctx->dapl_path, 1, &cm_ctx->dapl_comp); if (status) { +#if defined(__ia64__) + printk(KERN_ERR "dapl_rt_comp_handler: ib_at_paths_by_route " + "returned %d id %ld\n", status, cm_ctx->dapl_comp.req_id); +#else printk(KERN_ERR "dapl_rt_comp_handler: ib_at_paths_by_route " "returned %d id %lld\n", status, cm_ctx->dapl_comp.req_id); +#endif event = DAT_CONNECTION_EVENT_BROKEN; goto error; } From shubbell at dbresearch.net Mon Aug 1 06:43:04 2005 From: shubbell at dbresearch.net (Sean Hubbell) Date: Mon, 01 Aug 2005 09:43:04 -0400 Subject: [openib-general] IB Port State Question Message-ID: <42EE26E8.1090702@dbresearch.net> Hello, I am new to using infiniband and have a (probably) easy question. I have my head node and I checked the status of my infiniband port state and the state is ACTIVE. I do this for the rest of my node (which is 3 other nodes) and each port state is INIT. How does one activate the port? Thanks, Sean Running cAos 2.0 on multiple Xeon 64 Bit Machines. From rolandd at cisco.com Mon Aug 1 07:43:33 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 01 Aug 2005 07:43:33 -0700 Subject: [openib-general] kernel VM monitor for memory registration caching References: <20050729174225.GB2683@osc.edu> <20050731103157.GT13014@minantech.com> <52ack1kdhj.fsf@cisco.com> <20050801132056.GB20243@minantech.com> Message-ID: <52pssxiup6.fsf@cisco.com> Gleb> Why? Is that what you had in mind? Yes, if I'm understanding your suggestion correctly, it is exactly what I suggested in http://article.gmane.org/gmane.linux.drivers.openib/13223 Gleb> This was right for vapi modules since after mlock() vma Gleb> boundaries was aligned with registration boundaries, but Gleb> this is not the case for OpenIB (correct me if I am wrong Gleb> here). Part of the vma can be pinned and thus be in memory Gleb> but other parts may be swapped out. But the pages that aren't pinned are not part of any registered region, right? So there's no reason to invalidate any registration because of nopage activitiy. - R. From glebn at voltaire.com Mon Aug 1 07:50:27 2005 From: glebn at voltaire.com (Gleb Natapov) Date: Mon, 1 Aug 2005 17:50:27 +0300 Subject: [openib-general] kernel VM monitor for memory registration caching In-Reply-To: <52pssxiup6.fsf@cisco.com> References: <20050729174225.GB2683@osc.edu> <20050731103157.GT13014@minantech.com> <52ack1kdhj.fsf@cisco.com> <20050801132056.GB20243@minantech.com> <52pssxiup6.fsf@cisco.com> Message-ID: <20050801145027.GA31358@minantech.com> On Mon, Aug 01, 2005 at 07:43:33AM -0700, Roland Dreier wrote: > Gleb> Why? Is that what you had in mind? > > Yes, if I'm understanding your suggestion correctly, it is exactly > what I suggested in http://article.gmane.org/gmane.linux.drivers.openib/13223 > Yes, they are indeed the same :) > Gleb> This was right for vapi modules since after mlock() vma > Gleb> boundaries was aligned with registration boundaries, but > Gleb> this is not the case for OpenIB (correct me if I am wrong > Gleb> here). Part of the vma can be pinned and thus be in memory > Gleb> but other parts may be swapped out. > > But the pages that aren't pinned are not part of any registered > region, right? So there's no reason to invalidate any registration > because of nopage activitiy. Correct. We invalidate nothing in nopage handler, but if pagefault happens in part of vma that is not registered we should call original nopage() handler for the nonregistered page. -- Gleb. From mulix at mulix.org Mon Aug 1 07:56:22 2005 From: mulix at mulix.org (Muli Ben-Yehuda) Date: Mon, 1 Aug 2005 17:56:22 +0300 Subject: [openib-general] kernel VM monitor for memory registration caching In-Reply-To: <20050801145027.GA31358@minantech.com> References: <20050729174225.GB2683@osc.edu> <20050731103157.GT13014@minantech.com> <52ack1kdhj.fsf@cisco.com> <20050801132056.GB20243@minantech.com> <52pssxiup6.fsf@cisco.com> <20050801145027.GA31358@minantech.com> Message-ID: <20050801145622.GN28329@granada.merseine.nu> On Mon, Aug 01, 2005 at 05:50:27PM +0300, Gleb Natapov wrote: > Correct. We invalidate nothing in nopage handler, but if pagefault > happens in part of vma that is not registered we should call original > nopage() handler for the nonregistered page. Why not split the VMA if we're regisgtering only a part of it? that way a VMA if either registered or not registered (up to a page size granularity). Cheers, Muli -- Muli Ben-Yehuda http://www.mulix.org | http://mulix.livejournal.com/ From glebn at voltaire.com Mon Aug 1 08:04:27 2005 From: glebn at voltaire.com (Gleb Natapov) Date: Mon, 1 Aug 2005 18:04:27 +0300 Subject: [openib-general] kernel VM monitor for memory registration caching In-Reply-To: <20050801145622.GN28329@granada.merseine.nu> References: <20050729174225.GB2683@osc.edu> <20050731103157.GT13014@minantech.com> <52ack1kdhj.fsf@cisco.com> <20050801132056.GB20243@minantech.com> <52pssxiup6.fsf@cisco.com> <20050801145027.GA31358@minantech.com> <20050801145622.GN28329@granada.merseine.nu> Message-ID: <20050801150427.GD20243@minantech.com> On Mon, Aug 01, 2005 at 05:56:22PM +0300, Muli Ben-Yehuda wrote: > On Mon, Aug 01, 2005 at 05:50:27PM +0300, Gleb Natapov wrote: > > > Correct. We invalidate nothing in nopage handler, but if pagefault > > happens in part of vma that is not registered we should call original > > nopage() handler for the nonregistered page. > > Why not split the VMA if we're regisgtering only a part of it? that > way a VMA if either registered or not registered (up to a page size > granularity). > I don't like the idea of splitting VMAs if we can manage without it. You'll end up having to many of them. -- Gleb. From rolandd at cisco.com Mon Aug 1 08:13:34 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 01 Aug 2005 08:13:34 -0700 Subject: [openib-general] Re: [Rdma-developers] Meeting (07/22) summary:OpenRDMA community development discussion References: <35EA21F54A45CB47B879F21A91F4862F6C59C9@taurus.voltaire.com> Message-ID: <52k6j5itb5.fsf@cisco.com> Yaron> It would probably be wise to try and merge that effort with Yaron> IB-verbs etc' (e.g. make the verbs portion of the API Yaron> closer), and on the same time preserve the effort that was Yaron> done in kDAPL to overcome the differences (e.g. in the CM, Yaron> addressing portions) This doesn't seem like the right approach to me but we'll be happy to review your patches. - R. From halr at voltaire.com Mon Aug 1 08:24:45 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 1 Aug 2005 18:24:45 +0300 Subject: [openib-general] IB Port State Question Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175B78@taurus.voltaire.com> Hi Sean, An SM (either OpenSM or an embedded SM) needs to move the ports from INIT to ACTIVE. If you have more than 2 endports (nodes), you will need an IB switch. -- Hal -----Original Message----- From: openib-general-bounces at openib.org on behalf of Sean Hubbell Sent: Mon 8/1/2005 9:43 AM To: openib-general at openib.org Subject: [openib-general] IB Port State Question Hello, I am new to using infiniband and have a (probably) easy question. I have my head node and I checked the status of my infiniband port state and the state is ACTIVE. I do this for the rest of my node (which is 3 other nodes) and each port state is INIT. How does one activate the port? Thanks, Sean Running cAos 2.0 on multiple Xeon 64 Bit Machines. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Ouissem.Benfredj at int-evry.fr Mon Aug 1 08:45:20 2005 From: Ouissem.Benfredj at int-evry.fr (Ouissem BEN FREDJ) Date: Mon, 01 Aug 2005 17:45:20 +0200 Subject: [openib-general] link availability Message-ID: <42EE4390.9020909@int-evry.fr> Hello, Is it possible to link directly two Infiniband cards type InfiniCom InfiniServ 7000 (MT23108) , without using any interconnect switch ? In that case, could you tell me how. Thank you, Ouissem Ben Fredj From halr at voltaire.com Mon Aug 1 08:42:15 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 1 Aug 2005 18:42:15 +0300 Subject: [openib-general] link availability Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175B7C@taurus.voltaire.com> Hi, Yes, you just need to cable them together and run an SM on one of the nodes. -- Hal -----Original Message----- From: openib-general-bounces at openib.org on behalf of Ouissem BEN FREDJ Sent: Mon 8/1/2005 11:45 AM To: openib-general at openib.org Subject: [openib-general] link availability Hello, Is it possible to link directly two Infiniband cards type InfiniCom InfiniServ 7000 (MT23108) , without using any interconnect switch ? In that case, could you tell me how. Thank you, Ouissem Ben Fredj _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From yaronh at voltaire.com Mon Aug 1 08:43:20 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Mon, 1 Aug 2005 18:43:20 +0300 Subject: [openib-general] Re: [Rdma-developers] Meeting (07/22) summary:OpenRDMA community development discussion Message-ID: <35EA21F54A45CB47B879F21A91F4862F6C5AC3@taurus.voltaire.com> > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Monday, August 01, 2005 11:14 AM > To: Yaron Haviv > Cc: Christoph Hellwig; Tom Duffy; Venkata Jagana; rdma- > developers at lists.sourceforge.net; openib-general at openib.org > Subject: Re: [openib-general] Re: [Rdma-developers] Meeting (07/22) > summary:OpenRDMA community development discussion > > Yaron> It would probably be wise to try and merge that effort with > Yaron> IB-verbs etc' (e.g. make the verbs portion of the API > Yaron> closer), and on the same time preserve the effort that was > Yaron> done in kDAPL to overcome the differences (e.g. in the CM, > Yaron> addressing portions) > > This doesn't seem like the right approach to me but we'll be happy to > review your patches. So how would you reconcile the differences between IB & iWarp, and specifically on the connection establishment portion ? In your approach would I need to access different CM APIs for IB & for iWarp in my ULP ? >From my perspective the current kDAPL solves that problem (w/o any additional patches), and we are trying to re-invent the wheel here. If patches are really needed they can probably applied to the kDAPL code (i.e. remove redundant code/simplify kDAPL), however this is an optimization that can always be done later. Yaron From eitan at mellanox.co.il Mon Aug 1 08:46:24 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 1 Aug 2005 18:46:24 +0300 Subject: [openib-general] link availability Message-ID: <506C3D7B14CDD411A52C00025558DED607C3058C@mtlex01.yok.mtl.com> If you would like to connect to InfiniBand HCA cards - All you need to do is 1. attach an IB cable between them 2. have IB stack running on each one of the machines (might be any stack you want) 3. have an SM run on one of the machines. My proposal is that depending on the kernel you have installed on the machines install OpenIB (if 2.6.11 or 2.6.12) stack or Mellanox IBGD or others (for the rest) And then start OpenSM. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Ouissem BEN FREDJ [mailto:Ouissem.Benfredj at int-evry.fr] > Sent: Monday, August 01, 2005 6:45 PM > To: openib-general at openib.org > Subject: [openib-general] link availability > > Hello, > > Is it possible to link directly two Infiniband cards type InfiniCom > InfiniServ 7000 (MT23108) , without using any interconnect switch ? > In that case, could you tell me how. > > Thank you, > Ouissem Ben Fredj > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From mulix at mulix.org Mon Aug 1 08:47:41 2005 From: mulix at mulix.org (Muli Ben-Yehuda) Date: Mon, 1 Aug 2005 18:47:41 +0300 Subject: [openib-general] kernel VM monitor for memory registration caching In-Reply-To: <20050801150427.GD20243@minantech.com> References: <20050729174225.GB2683@osc.edu> <20050731103157.GT13014@minantech.com> <52ack1kdhj.fsf@cisco.com> <20050801132056.GB20243@minantech.com> <52pssxiup6.fsf@cisco.com> <20050801145027.GA31358@minantech.com> <20050801145622.GN28329@granada.merseine.nu> <20050801150427.GD20243@minantech.com> Message-ID: <20050801154741.GA30781@granada.merseine.nu> On Mon, Aug 01, 2005 at 06:04:27PM +0300, Gleb Natapov wrote: > I don't like the idea of splitting VMAs if we can manage without it. You'll > end up having to many of them. ... or you end up duplicating the VMA functionality on a sub-vma level. Consider: You have a single vma with some number of pages, where some random pages are registered and some aren't. You need to be able to check for a given page whetehr it's registered or not, so you build some data structure based on the stard and end address (or page, or whatever), so that you can check if a given address is within a registered page. But this is exactly the sort of thing that the VMA level is built for: use the red black tree to find the vma for a given address, then check if the vma is registered. Linux can already cope with a large number of vma's, I think that a solution that does not split vma's will end up with either artificial limitations (can't have more than two different regions within a vma) or reimplementing vma layer functionality. Of course, I could be completely wrong :-) Cheers, Muli -- Muli Ben-Yehuda http://www.mulix.org | http://mulix.livejournal.com/ From rolandd at cisco.com Mon Aug 1 09:02:39 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 01 Aug 2005 09:02:39 -0700 Subject: [openib-general] kernel VM monitor for memory registration caching In-Reply-To: <20050801145027.GA31358@minantech.com> (Gleb Natapov's message of "Mon, 1 Aug 2005 17:50:27 +0300") References: <20050729174225.GB2683@osc.edu> <20050731103157.GT13014@minantech.com> <52ack1kdhj.fsf@cisco.com> <20050801132056.GB20243@minantech.com> <52pssxiup6.fsf@cisco.com> <20050801145027.GA31358@minantech.com> Message-ID: <52ek9dir1c.fsf@cisco.com> Gleb> Correct. We invalidate nothing in nopage handler, but if Gleb> pagefault happens in part of vma that is not registered we Gleb> should call original nopage() handler for the nonregistered Gleb> page. Oh, I see. I thought you were saying something else. Yes, this seems correct and necessary. - R. From hch at lst.de Mon Aug 1 09:14:04 2005 From: hch at lst.de (Christoph Hellwig) Date: Mon, 1 Aug 2005 18:14:04 +0200 Subject: [openib-general] Re: [Rdma-developers] Meeting (07/22) summary:OpenRDMA community development discussion In-Reply-To: <6.2.3.4.2.20050801065241.04c6d8c0@exnane01.nane.netapp.com> References: <8508251A6FC08A489844A94261D3693A077FF8@fiona.siliquent.com> <6.2.3.4.2.20050801065241.04c6d8c0@exnane01.nane.netapp.com> Message-ID: <20050801161404.GA15582@lst.de> On Mon, Aug 01, 2005 at 07:00:48AM -0400, Talpey, Thomas wrote: > At 06:49 AM 8/1/2005, Caitlin Bestler wrote: > >After that we need a standardized way to implement "modify qp > >to RTS" in a iWARP-centric fashion. > > I will add that RPC/RDMA (NFS-RDMA) does not currently require this > functionality, it begins all connections in RDMA mode and the current > DAPL semantics are fine for now. This functionality would be of great > importance to iSER however, of course. > > In the future, NFSv4/sessions has an exchange which can to make > use of the iWARP step-up mode, but it is not required and we have > deferred implementing it in the session establishment so far. I think taking over an existing TCP connection is a horrible idea and we should avoid it if possible. The state for a TCP connection is very complex and it's doubtfull we can make it work 100%. From johnip at sgi.com Mon Aug 1 09:21:07 2005 From: johnip at sgi.com (John Partridge) Date: Mon, 01 Aug 2005 11:21:07 -0500 Subject: [openib-general] kdapl build error on ia64 In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5175B74@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175B74@taurus.voltaire.com> Message-ID: <42EE4BF3.8080000@sgi.com> Hal, No the patch did not fix the warnings :- CC [M] drivers/infiniband/ulp/kdapl/ib/dapl_openib_qp.o CC [M] drivers/infiniband/ulp/kdapl/ib/dapl_openib_util.o CC [M] drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.o drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c: In function `dapl_path_comp_handler': drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c:346: warning: long long int format, u64 arg (arg 3) drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c: In function `dapl_rt_comp_handler': drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c:416: warning: long long int format, u64 arg (arg 3) Let me look at it and I will produce a patch that fixes the build error and the warnings OK ? Cheers John Hal Rosenstock wrote: > Hi John, > > My normal email sending is not working right now so I am using an alternate method. > Hope the formatting comes through OK. > > On Fri, 2005-07-29 at 17:46, John Partridge wrote: > >>With this fix the ia64 modules all build to completion with just a >>couple of warnings :- > > >> CC [M] drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.o >>drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c: In function `dapl_path_comp_handler': >>drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c:346: warning: long long int format, u64 arg (arg 3) >>drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c: In function `dapl_rt_comp_handler': >>drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c:416: warning: long long int format, u64 arg (arg 3) > > > Can you try this patch and see if it removes the warnings ? > > Thanks. > > -- Hal > > Index: dapl_openib_cm.c > =================================================================== > --- dapl_openib_cm.c (revision 2935) > +++ dapl_openib_cm.c (working copy) > @@ -341,9 +341,15 @@ static void dapl_path_comp_handler(u64 r > &cm_ctx->dapl_path, 1, > &cm_ctx->dapl_comp); > if (status) { > +#if defined(__ia64__) > + printk(KERN_ERR "dapl_path_comp_handler: " > + "ib_at_paths_by_route returned %d id %ld\n", > + status, cm_ctx->dapl_comp.req_id); > +#else > printk(KERN_ERR "dapl_path_comp_handler: " > "ib_at_paths_by_route returned %d id %lld\n", > status, cm_ctx->dapl_comp.req_id); > +#endif > event = DAT_CONNECTION_EVENT_BROKEN; > goto error; > } > @@ -412,8 +418,13 @@ static void dapl_rt_comp_handler(u64 req > status = ib_at_paths_by_route(&cm_ctx->dapl_rt, 0, &cm_ctx->dapl_path, 1, > &cm_ctx->dapl_comp); > if (status) { > +#if defined(__ia64__) > + printk(KERN_ERR "dapl_rt_comp_handler: ib_at_paths_by_route " > + "returned %d id %ld\n", status, cm_ctx->dapl_comp.req_id); > +#else > printk(KERN_ERR "dapl_rt_comp_handler: ib_at_paths_by_route " > "returned %d id %lld\n", status, cm_ctx->dapl_comp.req_id); > +#endif > event = DAT_CONNECTION_EVENT_BROKEN; > goto error; > } -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com From Thomas.Talpey at netapp.com Mon Aug 1 09:28:49 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Mon, 01 Aug 2005 12:28:49 -0400 Subject: [openib-general] kdapl build error on ia64 In-Reply-To: <42EE4BF3.8080000@sgi.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175B74@taurus.voltaire.com> <42EE4BF3.8080000@sgi.com> Message-ID: <6.2.3.4.2.20050801122645.06180480@exnane01.nane.netapp.com> It doesn't need two printk's, the one printk just needs to have %lu as its format (instead of %ld), right? Tom. At 12:21 PM 8/1/2005, John Partridge wrote: >Hal, > >No the patch did not fix the warnings :- > > CC [M] drivers/infiniband/ulp/kdapl/ib/dapl_openib_qp.o > CC [M] drivers/infiniband/ulp/kdapl/ib/dapl_openib_util.o > CC [M] drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.o > drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c: In function >`dapl_path_comp_handler': > drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c:346: warning: long >long int format, u64 arg (arg 3) > drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c: In function >`dapl_rt_comp_handler': > drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c:416: warning: long >long int format, u64 arg (arg 3) > >Let me look at it and I will produce a patch that fixes the build >error and the warnings OK ? > >Cheers >John > > >Hal Rosenstock wrote: >> Hi John, >> >> My normal email sending is not working right now so I am using an >alternate method. >> Hope the formatting comes through OK. >> >> On Fri, 2005-07-29 at 17:46, John Partridge wrote: >> >>>With this fix the ia64 modules all build to completion with just a >>>couple of warnings :- >> >> >>> CC [M] drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.o >>>drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c: In function >`dapl_path_comp_handler': >>>drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c:346: warning: long >long int format, u64 arg (arg 3) >>>drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c: In function >`dapl_rt_comp_handler': >>>drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c:416: warning: long >long int format, u64 arg (arg 3) >> >> >> Can you try this patch and see if it removes the warnings ? >> >> Thanks. >> >> -- Hal >> >> Index: dapl_openib_cm.c >> =================================================================== >> --- dapl_openib_cm.c (revision 2935) >> +++ dapl_openib_cm.c (working copy) >> @@ -341,9 +341,15 @@ static void dapl_path_comp_handler(u64 r >> &cm_ctx->dapl_path, 1, >> &cm_ctx->dapl_comp); >> if (status) { >> +#if defined(__ia64__) >> + printk(KERN_ERR "dapl_path_comp_handler: " >> + "ib_at_paths_by_route returned %d id %ld\n", >> + status, cm_ctx->dapl_comp.req_id); >> +#else >> printk(KERN_ERR "dapl_path_comp_handler: " >> "ib_at_paths_by_route returned %d id %lld\n", >> status, cm_ctx->dapl_comp.req_id); >> +#endif >> event = DAT_CONNECTION_EVENT_BROKEN; >> goto error; >> } >> @@ -412,8 +418,13 @@ static void dapl_rt_comp_handler(u64 req >> status = ib_at_paths_by_route(&cm_ctx->dapl_rt, 0, >&cm_ctx->dapl_path, 1, >> &cm_ctx->dapl_comp); >> if (status) { >> +#if defined(__ia64__) >> + printk(KERN_ERR "dapl_rt_comp_handler: ib_at_paths_by_route " >> + "returned %d id %ld\n", status, >cm_ctx->dapl_comp.req_id); >> +#else >> printk(KERN_ERR "dapl_rt_comp_handler: ib_at_paths_by_route " >> "returned %d id %lld\n", status, >cm_ctx->dapl_comp.req_id); >> +#endif >> event = DAT_CONNECTION_EVENT_BROKEN; >> goto error; >> } > >-- >John Partridge > >Silicon Graphics Inc >Tel: 651-683-3428 >Vnet: 233-3428 >E-Mail: johnip at sgi.com >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From Thomas.Talpey at netapp.com Mon Aug 1 09:30:31 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Mon, 01 Aug 2005 12:30:31 -0400 Subject: [openib-general] kdapl build error on ia64 In-Reply-To: <6.2.3.4.2.20050801122645.06180480@exnane01.nane.netapp.com > References: <5CE025EE7D88BA4599A2C8FEFCF226F5175B74@taurus.voltaire.com> <42EE4BF3.8080000@sgi.com> <6.2.3.4.2.20050801122645.06180480@exnane01.nane.netapp.com> Message-ID: <6.2.3.4.2.20050801122956.04c11850@exnane01.nane.netapp.com> At 12:28 PM 8/1/2005, Talpey, Thomas wrote: >It doesn't need two printk's, the one printk just needs >to have %lu as its format (instead of %ld), right? I mean %llu. (long long) > >Tom. > > >At 12:21 PM 8/1/2005, John Partridge wrote: >>Hal, >> >>No the patch did not fix the warnings :- >> >> CC [M] drivers/infiniband/ulp/kdapl/ib/dapl_openib_qp.o >> CC [M] drivers/infiniband/ulp/kdapl/ib/dapl_openib_util.o >> CC [M] drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.o >> drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c: In function >>`dapl_path_comp_handler': >> drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c:346: warning: long >>long int format, u64 arg (arg 3) >> drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c: In function >>`dapl_rt_comp_handler': >> drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c:416: warning: long >>long int format, u64 arg (arg 3) >> >>Let me look at it and I will produce a patch that fixes the build >>error and the warnings OK ? >> >>Cheers >>John >> >> >>Hal Rosenstock wrote: >>> Hi John, >>> >>> My normal email sending is not working right now so I am using an >>alternate method. >>> Hope the formatting comes through OK. >>> >>> On Fri, 2005-07-29 at 17:46, John Partridge wrote: >>> >>>>With this fix the ia64 modules all build to completion with just a >>>>couple of warnings :- >>> >>> >>>> CC [M] drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.o >>>>drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c: In function >>`dapl_path_comp_handler': >>>>drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c:346: warning: long >>long int format, u64 arg (arg 3) >>>>drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c: In function >>`dapl_rt_comp_handler': >>>>drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c:416: warning: long >>long int format, u64 arg (arg 3) >>> >>> >>> Can you try this patch and see if it removes the warnings ? >>> >>> Thanks. >>> >>> -- Hal >>> >>> Index: dapl_openib_cm.c >>> =================================================================== >>> --- dapl_openib_cm.c (revision 2935) >>> +++ dapl_openib_cm.c (working copy) >>> @@ -341,9 +341,15 @@ static void dapl_path_comp_handler(u64 r >>> &cm_ctx->dapl_path, 1, >>> &cm_ctx->dapl_comp); >>> if (status) { >>> +#if defined(__ia64__) >>> + printk(KERN_ERR "dapl_path_comp_handler: " >>> + "ib_at_paths_by_route returned %d id %ld\n", >>> + status, cm_ctx->dapl_comp.req_id); >>> +#else >>> printk(KERN_ERR "dapl_path_comp_handler: " >>> "ib_at_paths_by_route returned %d id %lld\n", >>> status, cm_ctx->dapl_comp.req_id); >>> +#endif >>> event = DAT_CONNECTION_EVENT_BROKEN; >>> goto error; >>> } >>> @@ -412,8 +418,13 @@ static void dapl_rt_comp_handler(u64 req >>> status = ib_at_paths_by_route(&cm_ctx->dapl_rt, 0, >>&cm_ctx->dapl_path, 1, >>> &cm_ctx->dapl_comp); >>> if (status) { >>> +#if defined(__ia64__) >>> + printk(KERN_ERR "dapl_rt_comp_handler: >ib_at_paths_by_route " >>> + "returned %d id %ld\n", status, >>cm_ctx->dapl_comp.req_id); >>> +#else >>> printk(KERN_ERR "dapl_rt_comp_handler: >ib_at_paths_by_route " >>> "returned %d id %lld\n", status, >>cm_ctx->dapl_comp.req_id); >>> +#endif >>> event = DAT_CONNECTION_EVENT_BROKEN; >>> goto error; >>> } >> >>-- >>John Partridge >> >>Silicon Graphics Inc >>Tel: 651-683-3428 >>Vnet: 233-3428 >>E-Mail: johnip at sgi.com >>_______________________________________________ >>openib-general mailing list >>openib-general at openib.org >>http://openib.org/mailman/listinfo/openib-general >> >>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> >> >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Mon Aug 1 11:43:27 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 1 Aug 2005 21:43:27 +0300 Subject: [openib-general] Re: kdapl build error on ia64 In-Reply-To: <42EE4BF3.8080000@sgi.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175B74@taurus.voltaire.com> <42EE4BF3.8080000@sgi.com> Message-ID: <20050801184326.GC23693@mellanox.co.il> Quoting r. John Partridge : > >+#if defined(__ia64__) > >+ printk(KERN_ERR "dapl_path_comp_handler: " > >+ "ib_at_paths_by_route returned %d id %ld\n", > >+ status, cm_ctx->dapl_comp.req_id); > >+#else > > printk(KERN_ERR "dapl_path_comp_handler: " > > "ib_at_paths_by_route returned %d id > > %lld\n", > > status, cm_ctx->dapl_comp.req_id); > >+#endif Why dont you just cast req_id to long long and be done with it? -- MST From mshefty at ichips.intel.com Mon Aug 1 09:45:46 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 01 Aug 2005 09:45:46 -0700 Subject: [openib-general] Re: [Rdma-developers] Meeting (07/22) summary:OpenRDMA community development discussion In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F6C59C9@taurus.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F6C59C9@taurus.voltaire.com> Message-ID: <42EE51BA.7000404@ichips.intel.com> Yaron Haviv wrote: > we can spend time and discuss theories and intentions, at the end of the > day an iWarp RNIC cannot just reside under IB-Verbs without major > changes to the overall infrastructure. I don't disagree with having a common connection library that supports both IB and iWarp, or that you could derive a solution from kDAPL. But based on the proposed APIs that I've seen, I believe that an RNIC could reside under IB verbs with minimal changes, and would likely be the best engineered solution for including RNIC support in Linux. - Sean From johnip at sgi.com Mon Aug 1 09:58:35 2005 From: johnip at sgi.com (John Partridge) Date: Mon, 01 Aug 2005 11:58:35 -0500 Subject: [openib-general] kdapl build error on ia64 In-Reply-To: <6.2.3.4.2.20050801122645.06180480@exnane01.nane.netapp.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175B74@taurus.voltaire.com> <42EE4BF3.8080000@sgi.com> <6.2.3.4.2.20050801122645.06180480@exnane01.nane.netapp.com> Message-ID: <42EE54BB.5000006@sgi.com> Yes, that's right, but, I need to test it on ia32 too to make sure it not break anything there . John Talpey, Thomas wrote: > It doesn't need two printk's, the one printk just needs > to have %lu as its format (instead of %ld), right? > > Tom. > > > At 12:21 PM 8/1/2005, John Partridge wrote: > >>Hal, >> >>No the patch did not fix the warnings :- >> >> CC [M] drivers/infiniband/ulp/kdapl/ib/dapl_openib_qp.o >> CC [M] drivers/infiniband/ulp/kdapl/ib/dapl_openib_util.o >> CC [M] drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.o >> drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c: In function >>`dapl_path_comp_handler': >> drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c:346: warning: long >>long int format, u64 arg (arg 3) >> drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c: In function >>`dapl_rt_comp_handler': >> drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c:416: warning: long >>long int format, u64 arg (arg 3) >> >>Let me look at it and I will produce a patch that fixes the build >>error and the warnings OK ? >> >>Cheers >>John >> >> >>Hal Rosenstock wrote: >> >>>Hi John, >>> >>>My normal email sending is not working right now so I am using an >> >>alternate method. >> >>>Hope the formatting comes through OK. >>> >>>On Fri, 2005-07-29 at 17:46, John Partridge wrote: >>> >>> >>>>With this fix the ia64 modules all build to completion with just a >>>>couple of warnings :- >>> >>> >>>> CC [M] drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.o >>>>drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c: In function >> >>`dapl_path_comp_handler': >> >>>>drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c:346: warning: long >> >>long int format, u64 arg (arg 3) >> >>>>drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c: In function >> >>`dapl_rt_comp_handler': >> >>>>drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c:416: warning: long >> >>long int format, u64 arg (arg 3) >> >>> >>>Can you try this patch and see if it removes the warnings ? >>> >>>Thanks. >>> >>>-- Hal >>> >>>Index: dapl_openib_cm.c >>>=================================================================== >>>--- dapl_openib_cm.c (revision 2935) >>>+++ dapl_openib_cm.c (working copy) >>>@@ -341,9 +341,15 @@ static void dapl_path_comp_handler(u64 r >>> &cm_ctx->dapl_path, 1, >>> &cm_ctx->dapl_comp); >>> if (status) { >>>+#if defined(__ia64__) >>>+ printk(KERN_ERR "dapl_path_comp_handler: " >>>+ "ib_at_paths_by_route returned %d id %ld\n", >>>+ status, cm_ctx->dapl_comp.req_id); >>>+#else >>> printk(KERN_ERR "dapl_path_comp_handler: " >>> "ib_at_paths_by_route returned %d id %lld\n", >>> status, cm_ctx->dapl_comp.req_id); >>>+#endif >>> event = DAT_CONNECTION_EVENT_BROKEN; >>> goto error; >>> } >>>@@ -412,8 +418,13 @@ static void dapl_rt_comp_handler(u64 req >>> status = ib_at_paths_by_route(&cm_ctx->dapl_rt, 0, >> >>&cm_ctx->dapl_path, 1, >> >>> &cm_ctx->dapl_comp); >>> if (status) { >>>+#if defined(__ia64__) >>>+ printk(KERN_ERR "dapl_rt_comp_handler: ib_at_paths_by_route " >>>+ "returned %d id %ld\n", status, >> >>cm_ctx->dapl_comp.req_id); >> >>>+#else >>> printk(KERN_ERR "dapl_rt_comp_handler: ib_at_paths_by_route " >>> "returned %d id %lld\n", status, >> >>cm_ctx->dapl_comp.req_id); >> >>>+#endif >>> event = DAT_CONNECTION_EVENT_BROKEN; >>> goto error; >>> } >> >>-- >>John Partridge >> >>Silicon Graphics Inc >>Tel: 651-683-3428 >>Vnet: 233-3428 >>E-Mail: johnip at sgi.com >>_______________________________________________ >>openib-general mailing list >>openib-general at openib.org >>http://openib.org/mailman/listinfo/openib-general >> >>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> >> -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com From swise at ammasso.com Mon Aug 1 10:05:08 2005 From: swise at ammasso.com (Steve Wise) Date: Mon, 1 Aug 2005 12:05:08 -0500 Subject: [openib-general] Re: [Rdma-developers] Meeting (07/22)summary:OpenRDMA community development discussion In-Reply-To: <20050801161404.GA15582@lst.de> Message-ID: > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of > Christoph Hellwig > Sent: Monday, August 01, 2005 11:14 AM > To: Talpey, Thomas > Cc: rdma-developers at lists.sourceforge.net; openib-general at openib.org > Subject: Re: [openib-general] Re: [Rdma-developers] Meeting > (07/22)summary:OpenRDMA community development discussion > > On Mon, Aug 01, 2005 at 07:00:48AM -0400, Talpey, Thomas wrote: > > At 06:49 AM 8/1/2005, Caitlin Bestler wrote: > > >After that we need a standardized way to implement "modify qp > > >to RTS" in a iWARP-centric fashion. > > > > I will add that RPC/RDMA (NFS-RDMA) does not currently require this > > functionality, it begins all connections in RDMA mode and > the current > > DAPL semantics are fine for now. This functionality would > be of great > > importance to iSER however, of course. > > > > In the future, NFSv4/sessions has an exchange which can to make > > use of the iWARP step-up mode, but it is not required and we have > > deferred implementing it in the session establishment so far. > > I think taking over an existing TCP connection is a horrible idea > and we should avoid it if possible. The state for a TCP connection > is very complex and it's doubtfull we can make it work 100%. > Christoph, Can you provide more details on exactly why you think this is a horrible idea? I agree it will be complex, but it _could_ be scoped such that the complexity is reduced. For instance, the "offload" function could fail (with EBUSY or something) if there is _any_ data pending on the socket. Thus removing any requirement to pass down pending unacked outgoing data, or pending data that has been received but not yet "read" by the application. The idea here is that the applications at the top "know" they are going into RDMA mode and have effectively quiesced the connection before attempting to move the connection into RDMA mode. We could, in fact, _require_ the connect be quiesced to keep things simpler. I'm quickly sinking into gory details, but I want to know if you have other reasons (other than the complextity) for why this is a bad idea. Stevo From ftillier at silverstorm.com Mon Aug 1 10:14:18 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Mon, 1 Aug 2005 10:14:18 -0700 Subject: [openib-general] Re: [Rdma-developers] Meeting(07/22) summary:OpenRDMA community development discussion In-Reply-To: <42EE51BA.7000404@ichips.intel.com> Message-ID: <000f01c596bc$74ad2710$9c5aa8c0@infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Monday, August 01, 2005 9:46 AM > > Yaron Haviv wrote: > > we can spend time and discuss theories and intentions, at the end of the > > day an iWarp RNIC cannot just reside under IB-Verbs without major > > changes to the overall infrastructure. > > I don't disagree with having a common connection library that supports both > IB and iWarp, or that you could derive a solution from kDAPL. But based on > the proposed APIs that I've seen, I believe that an RNIC could reside under > IB verbs with minimal changes, and would likely be the best engineered > solution for including RNIC support in Linux. Just for clarity, when you say verbs you exclude connection establishment/management, right? I think keeping the two distinct is important in this discussion, as it seems there is some confusion - some people refer to verbs as verbs + CM, others as just verbs. Here's my take from the discussions so far: - RNICs can probably be made to work under the IB verbs (with changes of course). - RNICs can probably not be made to work under the IB CM (not that I've seen this suggested). - Fab From rural at uu.net Mon Aug 1 08:22:08 2005 From: rural at uu.net (Eve Coley) Date: Mon, 1 Aug 2005 17:22:08 +0200 Subject: [openib-general] Get your personalized rate quote NOW! Message-ID: <7881770.47895759939188.JavaMail.root@fac028[3]> Hello, You have been chosen to participate in an invitation only limited time event! Are you currently paying over 3% for your mortgage? STOP! We can help you lower that today! Answer only a few questions and we can give you an approval in under 30 seconds – it’s that simple! Stop fighting for lenders – let them fight for you! Make them work for your business by giving you the lowest rates around! $230,000 loans are available for only $340/month! WE’RE PRACTICALLY GIVING AWAY MONEY! Think your credit is too bad to get a deal like this? THINK AGAIN! We will have you saving your money in no time! Are you ready to save your money? http://fZWn.appl3s.net/p1.asp Update records on site. Regards, Eve Coley From mst at mellanox.co.il Mon Aug 1 12:29:48 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 1 Aug 2005 22:29:48 +0300 Subject: [openib-general] [PATCH] OpenIB scripts In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5175B83@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175B83@taurus.voltaire.com> Message-ID: <20050801192948.GF23693@mellanox.co.il> Quoting r. Hal Rosenstock : > I don't know if you saw this but can you respond to this on the list ? No, I didnt see this. > Thanks. > > -- Hal > > -----Forwarded Message----- > > From: Hal Rosenstock > To: openib-general at openib.org > Subject: [openib-general] [PATCH] OpenIB scripts > Date: 27 Jul 2005 11:31:23 -0400 > > Add ucm to udev rules > > Signed-off-by: Hal Rosenstock > Also, should these (https://openib.org/svn/trunk/contrib/mellanox/) be > moved to a gen2 location ? I'm fine with moving some of these scripts to trunk. These are install scripts, they dont belong in either src/linux-kernel or src/userspace. What about at the top level, alongside src/? > Index: 90-ib.rules > =================================================================== > --- 90-ib.rules (revision 2917) > +++ 90-ib.rules (working copy) > @@ -1,3 +1,4 @@ > KERNEL="umad*", NAME="infiniband/%k" > KERNEL="issm*", NAME="infiniband/%k" > +KERNEL="ucm", NAME="infiniband/%k", MODE="0666" > KERNEL="uverbs*", NAME="infiniband/%k", MODE="0666" > > Okay. -- MST From mshefty at ichips.intel.com Mon Aug 1 10:26:14 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 01 Aug 2005 10:26:14 -0700 Subject: [openib-general] Re: [Rdma-developers] Meeting(07/22) summary:OpenRDMA community development discussion In-Reply-To: <000f01c596bc$74ad2710$9c5aa8c0@infiniconsys.com> References: <000f01c596bc$74ad2710$9c5aa8c0@infiniconsys.com> Message-ID: <42EE5B36.9050109@ichips.intel.com> Fab Tillier wrote: > Just for clarity, when you say verbs you exclude connection > establishment/management, right? I refer to verbs as only those calls provided by the device driver, so I'm excluding CM, SA query, MAD services, etc. For IB, it does include the creation of the special QPs. > Here's my take from the discussions so far: > - RNICs can probably be made to work under the IB verbs (with changes of > course). I would go further and state that the RNIC community seems reluctant to do so. - Sean From halr at voltaire.com Mon Aug 1 10:28:34 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 1 Aug 2005 20:28:34 +0300 Subject: [openib-general] [PATCH] OpenIB scripts Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175B84@taurus.voltaire.com> > I'm fine with moving some of these scripts to trunk. Which ones wouldn't be moved ? > These are install scripts, they dont belong in either > src/linux-kernel or src/userspace. > What about at the top level, alongside src/? Sounds OK to me. Thanks. -- Hal From hch at lst.de Mon Aug 1 10:32:04 2005 From: hch at lst.de (Christoph Hellwig) Date: Mon, 1 Aug 2005 19:32:04 +0200 Subject: [openib-general] Re: [Rdma-developers] Meeting(07/22) summary:OpenRDMA community development discussion In-Reply-To: <42EE5B36.9050109@ichips.intel.com> References: <000f01c596bc$74ad2710$9c5aa8c0@infiniconsys.com> <42EE5B36.9050109@ichips.intel.com> Message-ID: <20050801173204.GA16966@lst.de> On Mon, Aug 01, 2005 at 10:26:14AM -0700, Sean Hefty wrote: > >Here's my take from the discussions so far: > >- RNICs can probably be made to work under the IB verbs (with changes of > >course). > > I would go further and state that the RNIC community seems reluctant to do > so. Then tell them to come up with an alternative. We're not going to - include RNIC/PI or something similar to it - pile abstraction layers or abstractions layers (the kDAPL approach) so if the RNIC community can find anything better they should propose it. OTOH they're not even able to get a NIC driver OpenSource so I'm not going to waste more time on them. From swise at ammasso.com Mon Aug 1 10:33:19 2005 From: swise at ammasso.com (Steve Wise) Date: Mon, 1 Aug 2005 12:33:19 -0500 Subject: [openib-general] Re: [Rdma-developers]Meeting(07/22) summary:OpenRDMA community development discussion In-Reply-To: <42EE5B36.9050109@ichips.intel.com> Message-ID: > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Sean Hefty > Sent: Monday, August 01, 2005 12:26 PM > To: Fab Tillier > Cc: Venkata Jagana; openib-general at openib.org; Tom Duffy; > rdma-developers at lists.sourceforge.net; Christoph Hellwig > Subject: Re: [openib-general] Re: > [Rdma-developers]Meeting(07/22) summary:OpenRDMA community > development discussion > > Fab Tillier wrote: > > Just for clarity, when you say verbs you exclude connection > > establishment/management, right? > > I refer to verbs as only those calls provided by the device > driver, so I'm > excluding CM, SA query, MAD services, etc. For IB, it does > include the > creation of the special QPs. > > > Here's my take from the discussions so far: > > - RNICs can probably be made to work under the IB verbs > (with changes of > > course). > > I would go further and state that the RNIC community seems > reluctant to do so. > This isn't true. The RNIC community is actively trying to do this now. Stay tuned for patches... Steve. From jlentini at netapp.com Mon Aug 1 10:39:45 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 1 Aug 2005 13:39:45 -0400 (EDT) Subject: [openib-general] Re: [PATCH] uDAPL dapltest In-Reply-To: <42E942A4.7070006@ichips.intel.com> References: <42E942A4.7070006@ichips.intel.com> Message-ID: Where was the build problem being seen? x86_64 or IA64? I want to record it in the checkin log message. james On Thu, 28 Jul 2005, Arlin Davis wrote: > James, > > Patch to fix build problem. > > -arlin > > > Signed-off by: Arlin Davis > > Index: dapltest/test/dapl_bpool.c > =================================================================== > --- dapltest/test/dapl_bpool.c (revision 2930) > +++ dapltest/test/dapl_bpool.c (working copy) > @@ -363,8 +363,8 @@ > "BPOOL alloc_size %x\n", > (int) bpool_ptr->alloc_size); > DT_Tdep_PT_Printf (phead, > - "BPOOL pz_handle %x\n", > - (int) bpool_ptr->pz_handle); > + "BPOOL pz_handle %p\n", > + bpool_ptr->pz_handle); > DT_Tdep_PT_Printf (phead, > "BPOOL num_segs %x\n", > (int) bpool_ptr->num_segs); > From mst at mellanox.co.il Mon Aug 1 12:43:44 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 1 Aug 2005 22:43:44 +0300 Subject: [openib-general] Re: [PATCH] OpenIB scripts In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5175B84@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175B84@taurus.voltaire.com> Message-ID: <20050801194344.GG23693@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: RE: [PATCH] OpenIB scripts > > > I'm fine with moving some of these scripts to trunk. > > Which ones wouldn't be moved ? I'll have to check. There's some duplication there: there are old scripts that I use, and new scripts that Vlad wrote. Vlad's scripts match the conventional configure, make, make install interface, so I think thats the way we want to go. They also should support out of kernel builds. > > These are install scripts, they dont belong in either > > src/linux-kernel or src/userspace. > > > What about at the top level, alongside src/? > > Sounds OK to me. Good, I guess we'll wait till Sunday and add them if its fine with everyone. -- MST From swise at ammasso.com Mon Aug 1 10:40:46 2005 From: swise at ammasso.com (Steve Wise) Date: Mon, 1 Aug 2005 12:40:46 -0500 Subject: [openib-general] Re: [Rdma-developers]Meeting(07/22) summary:OpenRDMA community development discussion In-Reply-To: <20050801173204.GA16966@lst.de> Message-ID: > Then tell them to come up with an alternative. We're not going to > > - include RNIC/PI or something similar to it > - pile abstraction layers or abstractions layers (the kDAPL approach) > > so if the RNIC community can find anything better they should propose > it. OTOH they're not even able to get a NIC driver OpenSource so I'm > not going to waste more time on them. The current Ammasso RNIC driver is released under the GPL license now. At this point in time, its not openIB. We intend to develop an open source driver that plugs into the openIB framework and does _not_ implement RNIC/PI. And we'll provide this as a series of patches during development to the openIB community for general review. Steve. From halr at voltaire.com Mon Aug 1 10:43:10 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 1 Aug 2005 20:43:10 +0300 Subject: [openib-general] Re: [PATCH] uDAPL dapltest Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175B85@taurus.voltaire.com> I saw this on x86_64. Don't know about IA64. -- Hal -----Original Message----- From: openib-general-bounces at openib.org on behalf of James Lentini Sent: Mon 8/1/2005 1:39 PM To: Arlin Davis Cc: openib Subject: [openib-general] Re: [PATCH] uDAPL dapltest Where was the build problem being seen? x86_64 or IA64? I want to record it in the checkin log message. james On Thu, 28 Jul 2005, Arlin Davis wrote: > James, > > Patch to fix build problem. > > -arlin > > > Signed-off by: Arlin Davis > > Index: dapltest/test/dapl_bpool.c > =================================================================== > --- dapltest/test/dapl_bpool.c (revision 2930) > +++ dapltest/test/dapl_bpool.c (working copy) > @@ -363,8 +363,8 @@ > "BPOOL alloc_size %x\n", > (int) bpool_ptr->alloc_size); > DT_Tdep_PT_Printf (phead, > - "BPOOL pz_handle %x\n", > - (int) bpool_ptr->pz_handle); > + "BPOOL pz_handle %p\n", > + bpool_ptr->pz_handle); > DT_Tdep_PT_Printf (phead, > "BPOOL num_segs %x\n", > (int) bpool_ptr->num_segs); > _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From pw at osc.edu Mon Aug 1 10:55:16 2005 From: pw at osc.edu (Pete Wyckoff) Date: Mon, 1 Aug 2005 13:55:16 -0400 Subject: [openib-general] kernel VM monitor for memory registration caching In-Reply-To: <20050731103157.GT13014@minantech.com> References: <20050729174225.GB2683@osc.edu> <20050731103157.GT13014@minantech.com> Message-ID: <20050801175516.GA15056@osc.edu> glebn at voltaire.com wrote on Sun, 31 Jul 2005 13:31 +0300: > First of all, you have one user_delta per mm that user can poll from > userspace. Is it possible to make user_delta to be part of dreg_region > instead of dreg_context and module will set it whenever > registration becomes invalid. Field 'invalid' will be added to buf_info > structure and pointer to it will be passed to kernel at registration > time. > This way the userpace can look up cache and check if registration is > still valid. No need to rescan cache from userspace, we already scanned > it once from kernel after all. With your current approach userspace will > need to search for mr_handle in the cache and invalidate the entry that > holds it. I still think that idea is possible. But perhaps someone who wants to integrate this with a cache can do so sometime. There is no real cache structure in this userspace test code. > You change vma_ops in vma to catch open/close events. What about > nopage() method in vma_ops? We have to forward it to original vma_ops? > > Something like included patch (not even compiled). Right, good catch. I did something a bit different to avoid kmalloc in the common case of empty origvma->vm_ops (i.e. normal memory). But for the sake of correctness the fix must be there. Note that these vm_ops structures don't really stack properly. If there happened to be some other kernel component that decided to take ownership of the vm_ops structure, dreg would fail to notice that it was the "owner" of the vma already, and it might free the ops incorrectly. Ideally there would be a chain of these things that would get called from higher up that would not interfere with each other. As it is, dreg assumes it is the ultimate owner and that other interfering vma ops have already done their business. This stacked usage would be the rare case. If you're communicating using buffers from, say, your video or sound card memory, or a hugetlb space, or SYSV shm, this will arise. But in the case of shm and hugetlb and DRM video, we're okay, as they set the ops in mmap() and don't change it, ever. IB memory registrations cannot happen until the memory is mapped, so dreg will always be last. Arbitrary uses might cause failure someday, however. -- Pete diff -u -p -r1.4 dreg.c --- dreg.c 29 Jul 2005 17:17:52 -0000 1.4 +++ dreg.c 1 Aug 2005 17:41:00 -0000 @@ -96,6 +96,15 @@ static void dreg_vm_close(struct vm_area static void mem_deregister(struct dreg_context *dc, struct dreg_region *reg); /* + * Interface with VM. It calls us back when one of these events happen on + * a VMA we care about. + */ +static struct vm_operations_struct dreg_vm_ops = { + .open = dreg_vm_open, + .close = dreg_vm_close, +}; + +/* * Helpful functions. */ static struct dreg_context *find_context_by_mm(const struct mm_struct *mm) @@ -161,8 +170,11 @@ static void destroy_region(struct dreg_c struct vm_area_struct *vma = reg->vma; pr_debug("%s: reg %p vma %p addr %lx\n", __func__, reg, vma, reg->addr); - if (vma) + if (vma) { + if (vma->vm_ops != &dreg_vm_ops) + kfree(vma->vm_ops); vma->vm_ops = reg->orig_ops; + } if (reg->addr) mem_deregister(dc, reg); list_del(®->subordinate_list); @@ -172,15 +184,6 @@ static void destroy_region(struct dreg_c } /* - * Interface with VM. It calls us back when one of these events happen on - * a VMA we care about. - */ -static struct vm_operations_struct dreg_vm_ops = { - .open = dreg_vm_open, - .close = dreg_vm_close, -}; - -/* * Triggered from VM activity. Pass through to underlying vm_ops * if it exists after we finish. * @@ -305,6 +308,10 @@ static void dreg_vm_open(struct vm_area_ * forget about it and do not build a new region for it. */ if (list_empty(&temp_new_subordinate_list)) { + /* + * Do not free the newvma ops, it is just a copy, free handled when + * original vma->vm_ops is destroyed. + */ newvma->vm_ops = orig_ops; } else { reg = kmem_cache_alloc(dreg_region_cache, GFP_KERNEL); @@ -318,6 +325,14 @@ static void dreg_vm_open(struct vm_area_ } reg->orig_ops = orig_ops; + if (orig_ops) { + newvma->vm_ops = kmalloc(sizeof(*newvma->vm_ops), GFP_KERNEL); + *newvma->vm_ops = *orig_ops; + newvma->vm_ops->open = dreg_vm_open; + newvma->vm_ops->close = dreg_vm_close; + } else { + newvma->vm_ops = &dreg_vm_ops; + } /* non subordinate */ reg->vma = newvma; list_add(®->subordinate_list, &temp_new_subordinate_list); @@ -526,7 +541,15 @@ static int dreg_write_monitor(struct dre /* non subordinate */ reg->vma = vma; INIT_LIST_HEAD(®->subordinate_list); - vma->vm_ops = &dreg_vm_ops; /* own this vma */ + if (reg->orig_ops) { + /* copy it in case it uses more than open/close */ + vma->vm_ops = kmalloc(sizeof(*vma->vm_ops), GFP_KERNEL); + *vma->vm_ops = *reg->orig_ops; + vma->vm_ops->open = dreg_vm_open; + vma->vm_ops->close = dreg_vm_close; + } else { + vma->vm_ops = &dreg_vm_ops; /* own this vma */ + } reg->orig_vm_start = vma->vm_start; reg->orig_vm_end = vma->vm_end; } @@ -564,6 +587,8 @@ static int dreg_write_unmonitor(struct d if (list_empty(®->subordinate_list)) { pr_debug("%s: mr %x empty subord, releasing vma\n", __func__, mrh); + if (reg->vma->vm_ops != &dreg_vm_ops) + kfree(reg->vma->vm_ops); reg->vma->vm_ops = reg->orig_ops; } else { struct dreg_region *treg; From jlentini at netapp.com Mon Aug 1 11:07:04 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 1 Aug 2005 14:07:04 -0400 (EDT) Subject: [openib-general][PATCH][kdapl]: inc/dec module ref count In-Reply-To: References: Message-ID: On Fri, 29 Jul 2005, Guy German wrote: > Hi James, > > That's an interesting point. > I don't know why the ib_mthca can be rmmod-ed at any > time, without dependencies (I guess it's a trusted env. > issue). Is it a hotplug issue? > I need to check what happens if you rmmod ib_mthca and > then calling ia_close. It supposed to decrement > the ref count even on failure from the hca. > > Any way I think that in a production level kernel > you are not supposed to do rmmod AT ALL. I wasn't aware of that convention. From jlentini at netapp.com Mon Aug 1 11:36:08 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 1 Aug 2005 14:36:08 -0400 (EDT) Subject: [openib-general] Re: [PATCH] uDAPL makefile In-Reply-To: <42EAC8EA.80706@ichips.intel.com> References: <42EAC8EA.80706@ichips.intel.com> Message-ID: I committed the libverbs specific portion in revision 2939. I'll check the CQ_WAIT_OBJECT change in with the rest of your earlier patch (since it goes with that). james On Fri, 29 Jul 2005, Arlin Davis wrote: > James, > > no longer need to link with mthca.so > > -arlin > > Signed-off by: Arlin Davis > > > Index: dapl/udapl/Makefile > =================================================================== > --- dapl/udapl/Makefile (revision 2919) > +++ dapl/udapl/Makefile (working copy) > @@ -122,7 +122,8 @@ > # > ifeq ($(VERBS),openib) > PROVIDER = $(TOPDIR)/../openib > -CFLAGS += -DOPENIB -DCQ_WAIT_OBJECT > +CFLAGS += -DOPENIB > +#CFLAGS += -DCQ_WAIT_OBJECT > CFLAGS += -I/usr/local/include/infiniband > endif > > @@ -232,9 +233,8 @@ > endif > > ifeq ($(VERBS),openib) > -LDFLAGS += -libverbs /usr/local/lib/infiniband/mthca.so -libcm > +LDFLAGS += -libverbs -libcm -libat > LDFLAGS += -rpath /usr/local/lib -L /usr/local/lib > -LDFLAGS += -rpath /usr/local/lib/infiniband -L /usr/local/lib/infiniband > PROVIDER_SRCS = dapl_ib_util.c dapl_ib_cq.c dapl_ib_qp.c > PROVIDER_SRCS += dapl_ib_cm.c dapl_ib_mem.c > endif > From jay.rosser at hp.com Mon Aug 1 11:48:38 2005 From: jay.rosser at hp.com (Jay Rosser) Date: Mon, 01 Aug 2005 11:48:38 -0700 Subject: [openib-general] Re: [Rdma-developers] Meeting (07/22) summary:OpenRDMA community development discussion In-Reply-To: <8508251A6FC08A489844A94261D3693A077FF8@fiona.siliquent.com> References: <8508251A6FC08A489844A94261D3693A077FF8@fiona.siliquent.com> Message-ID: <42EE6E86.5080309@hp.com> The IT-API working group spent some effort on analysis of the connection management issues required to support both IB and iWARP in a common API and documented such in chapter 5 and appendix B of the IT-API 2.0 specification: http://www.opengroup.org/icsc/uploads/40/7237/IT-API-V2-Final.pdf We hope it is useful for reference on this subject and as an example of a possible approach to resolving the issues. Jay > > > > >>-----Original Message----- >>From: rdma-developers-admin at lists.sourceforge.net >>[mailto:rdma-developers-admin at lists.sourceforge.net] On >>Behalf Of Yaron Haviv >>Sent: Sunday, July 31, 2005 8:57 PM >>To: Christoph Hellwig; Tom Duffy >>Cc: Venkata Jagana; rdma-developers at lists.sourceforge.net; >>openib-general at openib.org >>Subject: RE: [openib-general] Re: [Rdma-developers] Meeting >>(07/22) summary:OpenRDMA community development discussion >> >> >> >>>-----Original Message----- >>>From: openib-general-bounces at openib.org [mailto:openib-general- >>>bounces at openib.org] On Behalf Of Christoph Hellwig >>>Sent: Friday, July 29, 2005 8:02 AM >>>To: Tom Duffy >>>Cc: Venkata Jagana; rdma-developers at lists.sourceforge.net; >>> >>> >>Christoph >> >> >>>Hellwig; openib-general at openib.org >>>Subject: [openib-general] Re: [Rdma-developers] Meeting (07/22) >>>summary:OpenRDMA community development discussion >>> >>>On Thu, Jul 28, 2005 at 02:02:08PM -0700, Tom Duffy wrote: >>> >>> >>>>At OLS (and in previous forums), the kernel maintainers >>>> >>>> >>have made it >> >> >>>>*very* clear that there should only be one API. >>>> >>>> >>>_and_ that this api is neither RNIC-PI or KDAPL. In fact >>> >>> >>for anything >> >> >>>that doesn't look very similar to the current IB midlayer >>> >>> >>you'd need >> >> >>>very convincing arguments. >>> >>> >>> >>I assume it is not as simplistic as that iWarp CM model is >>quite different than IB, and iWarp doesn't have SA/SM and a >>bunch of other IB specific things >> >>For example: >>The correct common abstraction is one where a user can issue >>a connection by using a logical end-point address (such as an >>IP), and doesn't have to deal with the IB or iWarp specific >>CM state machine or SA/SM. >> >>If you look at DAPL you can break it to simple Verbs (e.g. >>send, ..) where its just a simple overlay on to of the verbs >>(and may be >>redundant) >>However there is a second part that implements a simple >>connection establishment model (much like BSD) that can be >>mapped to both IB (CM, SA, ..) or iWarp (TCP Syn/SynAck, ARP, >>etc'), this serves couple of main >>purposes: >>a. make it simple for ULP developer and put the complex part >>in a common >>place >>b. define a common model for different HW >> >>we can spend time and discuss theories and intentions, at the >>end of the day an iWarp RNIC cannot just reside under >>IB-Verbs without major changes to the overall infrastructure. >>Several guys spent some time looking it over and came with an >>abstraction that IS possible on top of IB & iWarp & foo, that >>is called DAPL (or IT as another similar alternative) >> >>It would probably be wise to try and merge that effort with >>IB-verbs etc' >>(e.g. make the verbs portion of the API closer), and on the >>same time preserve the effort that was done in kDAPL to >>overcome the differences (e.g. in the CM, addressing portions) >> >> >> > >The working theory is that two additional connection management >'verbs' will be proposed, both optional. > >The first is a straight-forward mapping of the traditional DAT >connection establishment: i.e., do whatever you have to do in >order to listen, request a connection, accept/reject connection >requests. This is an interface that can map to existing iWARP >implementations very easily while not immediately standardizing >integration with the host TCP stack. > >Connection Requests are reported in this model, but via callbacks >rather than EVDs, since it is desired to keep a verb layer interface >more spartan than kDAPL. However, the fact that connection requests >must be approved allows blocking of all requests that conflict >with host policy (as expressed in packet filtering rules). > >It is also an interface that IB vendors *could* support, because >DAPL implements over IB. We could consider having a default >implementation of the methods from the DAPL code for that >purpose. > >After that we need a standardized way to implement "modify qp >to RTS" in a iWARP-centric fashion. This will require agreement >on how to transfer the TCP state from the host stack to the >RNIC. This is the more flexible model that matches RDMAC >verbs and supports all connection establishment models, >including iSER. It is inherently iWARP specific, however, >since IB connections do not start their life as TCP >connections. > >The benefit of this second additional API is that the >preserves *all* host stack safeguards on connection >establishment because the host stack actually establishes >the TCP connection. It does require agreement on the correct >way to extract the TCP connection from the host stack, >however. This is something that existing deployments >already do, but they aren't doing it in mainline code. >Reaching consensus on a sustainable interface for doing >this is certainly worthy of careful consideration, which >is why the first additional DAT-style API will be proposed. >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > -- jay.rosser at hp.com 408-447-3175 From jlentini at netapp.com Mon Aug 1 11:51:24 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 1 Aug 2005 14:51:24 -0400 (EDT) Subject: [openib-general][PATCH][kdapl]: inc/dec module ref count In-Reply-To: <20050731142205.GJ20524@granada.merseine.nu> References: <20050731142205.GJ20524@granada.merseine.nu> Message-ID: On Sun, 31 Jul 2005, Muli Ben-Yehuda wrote: > On Thu, Jul 28, 2005 at 03:01:24PM -0400, James Lentini wrote: > >> We've been using the following convention to indicate that we don't >> care about the return value: >> >> (void) try_module_get(THIS_MODULE); > > This seems counter intuitive to me (the idea, not the syntax) - > try_module_get() can and will fail. If it doesn't matter that it > succeeds or fails, why call it at all? There are cases when a function can fail and there is no intelligent way to handle the error or report it to a user (e.g. printf(3), although I don't use the void cast convention when calling printf). james From jlentini at netapp.com Mon Aug 1 11:56:02 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 1 Aug 2005 14:56:02 -0400 (EDT) Subject: [openib-general][PATCH][kdapl]: inc/dec module ref count In-Reply-To: <20050731151330.GN20524@granada.merseine.nu> References: <20050731151330.GN20524@granada.merseine.nu> Message-ID: On Sun, 31 Jul 2005, Muli Ben-Yehuda wrote: > On Sun, Jul 31, 2005 at 06:08:11PM +0300, Guy German wrote: >> Hi Muli, >> >> Wouldn't it be solved by moving the try_module_get call to the >> beginning of the dapl_ia_open function ? > > No. Even if it's theoretically the first line in the function, the > compiler can and will create a function prologue that will run before > you raise the reference count (same thing with decrementing the ref > count at module unload time and the function epilogue). You must do > module reference counting before executing even one instruction from > the module. > >> You are right - try_module_get() can fail when the module is not ready >> to be entered. should be something like: >> + if (!try_module_get(THIS_MODULE)) >> + return -EBUSY; > > Yes - but at the caller, not callee. Putting this in the caller (i.e. dat_ia_open and dat_ia_close) does sound like a better option. Guy, can you investigate why the ib_mthca module doesn't have a reference count and see if it relates to hotplug? I think kdapl_ib and ib_mthca should have the same policy regarding this issue. From mulix at mulix.org Mon Aug 1 11:58:18 2005 From: mulix at mulix.org (Muli Ben-Yehuda) Date: Mon, 1 Aug 2005 21:58:18 +0300 Subject: [openib-general][PATCH][kdapl]: inc/dec module ref count In-Reply-To: References: <20050731142205.GJ20524@granada.merseine.nu> Message-ID: <20050801185818.GH31528@granada.merseine.nu> On Mon, Aug 01, 2005 at 02:51:24PM -0400, James Lentini wrote: > There are cases when a function can fail and there is no intelligent > way to handle the error or report it to a user (e.g. printf(3), > although I don't use the void cast convention when calling printf). Of course, I'm talking specifically about try_module_get(). Cheers, Muli -- Muli Ben-Yehuda http://www.mulix.org | http://mulix.livejournal.com/ From jlentini at netapp.com Mon Aug 1 12:25:17 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 1 Aug 2005 15:25:17 -0400 (EDT) Subject: [openib-general][PATCH][kdapl]: adding DAT_MEM_TYPE_IA support In-Reply-To: References: Message-ID: On Fri, 29 Jul 2005, Guy German wrote: >> + array = (u64 *)phys_addr.for_array; /* need to add for_u64_array to >> union */ >> What does this comment mean? > > I think the right way to do it is : > array = phys_addr.for_u64_array > (Givven the union consists of a new type u64* called "for_u64_array") I believe the original idea was to have IA memory use the DAT_REGION_DESCRIPTION's for_pointer value. From jlentini at netapp.com Mon Aug 1 12:41:29 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 1 Aug 2005 15:41:29 -0400 (EDT) Subject: [openib-general] Re: [PATCH][dapl] cleanup dapl_cookie In-Reply-To: <20050729184034.GF15261@aon.at> References: <20050627173951.GB19279@aon.at> <20050629215046.GB17011@aon.at> <20050729184034.GF15261@aon.at> Message-ID: On Fri, 29 Jul 2005, Bernhard Fischer wrote: > On Thu, Jun 30, 2005 at 10:04:36AM -0400, James Lentini wrote: >> >> >> On Wed, 29 Jun 2005, Bernhard Fischer wrote: >> >>> On Tue, Jun 28, 2005 at 03:39:59PM -0400, James Lentini wrote: >>>> >>>> Hi Bernhard, >>>> >>>> The changes look fine. Why the additional copyright? I need to be able >>>> to explain it to my legal department. >>> >>> My legaleeze states that whatever i do during work-time is contributed >>> to work and whatever is related to work done during leasure time has to >>> be attributed to /me _at_ _least_. As that snippet (which was a >>> test-balloon >>> for that category) clearly was done in my spare time, i'm forced to >>> attribute it accordingly :-/ >>> >>> Does that answer your question satisfactorily? >> >> Thanks Bernhard. That makes sense to me. My legal inquired about the >> "all rights reserved" qualifier. All the copyrights I found in the >> OpenIB tree (including NetApp's) use that language. I'll run this by >> them. >> > As rev. 2934 i do not see this patch applied. To recap, it removed some > unneeded local variables (which my compiler wasn't smart enough to > eleminate on it's own -- gcc-4.0 and gcc-HEAD) and simplified some > conditionals and branches. > > Back then, i only submitted the changes to dapl_cookie.c to see if such > kind of code simplifications would be accepted or not. > > James, can you please elaborate why the patch was rejected? It wasn't rejected. With the back and forth on the copyright notice, I thought I was waiting for you to reply. Sorry about that. I'll dust off the patch and review it again. From halr at voltaire.com Mon Aug 1 12:40:14 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 1 Aug 2005 22:40:14 +0300 Subject: [openib-general] [PATCH] uDAPL with uCM and uAT support Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175B8C@taurus.voltaire.com> Hi, I don't think this has been applied as yet but the following steps in README are no longer needed and should be replaced by the below now that uAT has been moved to the trunk. -- Hal Instructions From README: Third drop of code, includes uCM and uAT support. build uAT library: cd gen2/trunk/src/userspace/libibat/ ./autogen.sh &&./configure && make && make install -----Original Message----- From: openib-general-bounces at openib.org on behalf of Arlin Davis Sent: Sun 7/24/2005 8:48 PM To: 'James Lentini' Cc: openib-general at openib.org Subject: [openib-general] [PATCH] uDAPL with uCM and uAT support James, Here is a third drop that adds IBAT to the uCM support. This also includes some fixes to common code evd_wait and evd_resize. Instructions From README: Third drop of code, includes uCM and uAT support. NOTE: uAT user library and kernel code in separate branch. build uAT library from following branch: cd gen2/branches/shaharf-ibat/src/userspace/libibat/ ./autogen.sh &&./configure && make && make install copy following uat source to latest trunk kernel src: gen2/branches/shaharf-ibat/src/linux-kernel/infiniband/core at.c at_priv.h att.c uat.c uat.h Makefile gen2/branches/shaharf-ibat/src/linux-kernel/infiniband/include ib_at.h ib_user_at.h From jlentini at netapp.com Mon Aug 1 12:43:48 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 1 Aug 2005 15:43:48 -0400 (EDT) Subject: [openib-general] Re: [PATCH] uDAPL dapltest In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5175B85@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175B85@taurus.voltaire.com> Message-ID: Ok. I reference the x86_64 build problem in my log message and committed it in revision 2940. On Mon, 1 Aug 2005, Hal Rosenstock wrote: > I saw this on x86_64. Don't know about IA64. > > -- Hal > > -----Original Message----- > From: openib-general-bounces at openib.org on behalf of James Lentini > Sent: Mon 8/1/2005 1:39 PM > To: Arlin Davis > Cc: openib > Subject: [openib-general] Re: [PATCH] uDAPL dapltest > > > Where was the build problem being seen? x86_64 or IA64? I want to > record it in the checkin log message. > > james > > On Thu, 28 Jul 2005, Arlin Davis wrote: > >> James, >> >> Patch to fix build problem. >> >> -arlin >> >> >> Signed-off by: Arlin Davis >> >> Index: dapltest/test/dapl_bpool.c >> =================================================================== >> --- dapltest/test/dapl_bpool.c (revision 2930) >> +++ dapltest/test/dapl_bpool.c (working copy) >> @@ -363,8 +363,8 @@ >> "BPOOL alloc_size %x\n", >> (int) bpool_ptr->alloc_size); >> DT_Tdep_PT_Printf (phead, >> - "BPOOL pz_handle %x\n", >> - (int) bpool_ptr->pz_handle); >> + "BPOOL pz_handle %p\n", >> + bpool_ptr->pz_handle); >> DT_Tdep_PT_Printf (phead, >> "BPOOL num_segs %x\n", >> (int) bpool_ptr->num_segs); >> > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From jlentini at netapp.com Mon Aug 1 13:35:45 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 1 Aug 2005 16:35:45 -0400 (EDT) Subject: [openib-general] Re: [PATCH][dapl] cleanup dapl_cookie In-Reply-To: <20050627173951.GB19279@aon.at> References: <20050627173951.GB19279@aon.at> Message-ID: On Mon, 27 Jun 2005, Bernhard Fischer wrote: > Hi James, > > untested. > > - cleanup dapl_cookie.c: remove unneeded local variables and simplify > branches to be consistent with dapl_rmr_cookie_alloc(). > > Signed-off-by: Bernhard Fischer > > thank you, > Bernhard > Committed in revision 2941. From tduffy at sun.com Mon Aug 1 14:25:41 2005 From: tduffy at sun.com (Tom Duffy) Date: Mon, 01 Aug 2005 14:25:41 -0700 Subject: [openib-general] link availability In-Reply-To: <42EE4390.9020909@int-evry.fr> References: <42EE4390.9020909@int-evry.fr> Message-ID: <1122931541.15026.31.camel@duffman> On Mon, 2005-08-01 at 17:45 +0200, Ouissem BEN FREDJ wrote: > Hello, > > Is it possible to link directly two Infiniband cards type InfiniCom > InfiniServ 7000 (MT23108) , without using any interconnect switch ? > In that case, could you tell me how. Please refer to the FAQ (specifically question 2): https://openib.org/tiki/tiki-index.php?page=OpenIBFAQ -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From tduffy at sun.com Mon Aug 1 14:49:58 2005 From: tduffy at sun.com (Tom Duffy) Date: Mon, 01 Aug 2005 14:49:58 -0700 Subject: [openib-general] Re: [PATCH] SDP: use linux/list.h for binds In-Reply-To: <20050719162908.E2231@topspin.com> References: <1121366046.17302.9.camel@duffman> <20050719160345.B2231@topspin.com> <20050719230923.GD17097@mellanox.co.il> <20050719162908.E2231@topspin.com> Message-ID: <1122932998.15026.34.camel@duffman> On Tue, 2005-07-19 at 16:29 -0700, Libor Michalek wrote: > > > Thanks. I've applied the patch since it's better then what was there. > > > However, the longer term solution needs to use a full hash table, the > > > linear list is problematic when there are a lot of connections, and each > > > port_get() needs to traverse the entire list to check for collisions. > > > > Or a tree? > > A tree would be fine as well, I was just thinking anything that has > a src_port lookup time of O(log n) instead of O(n). Also the tcp code > uses it's own hash table for it's port lookup. (net/ipv4/tcp_ipv4.c) > That seems like it might be overkill... Is anyone looking at moving this over to use rbtree.h? If not, I will take a look. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From arlin.r.davis at intel.com Mon Aug 1 14:59:01 2005 From: arlin.r.davis at intel.com (Arlin Davis) Date: Mon, 1 Aug 2005 14:59:01 -0700 Subject: [openib-general] [PATCH] uDAPL with uCM and uAT support - version 2 Message-ID: James, Here is version 2 with the changes you requested. Also, README updated per Hal's comments. dapl/udapl/dapl_evd_wait.c dapl/udapl/Makefile dapl/common/dapl_evd_resize.c dapl/openib/TODO dapl/openib/dapl_ib_util.c dapl/openib/dapl_ib_cm.c dapl/openib/dapl_ib_util.h dapl/openib/README dapl/openib/dapl_ib_cq.c Signed-off by: Arlin Davis Index: dapl/udapl/dapl_evd_wait.c =================================================================== --- dapl/udapl/dapl_evd_wait.c (revision 2919) +++ dapl/udapl/dapl_evd_wait.c (working copy) @@ -74,9 +74,10 @@ DAPL_EVD *evd_ptr; DAT_RETURN dat_status; DAT_EVENT *local_event; - DAT_BOOLEAN notify_requested = DAT_FALSE; + DAT_BOOLEAN notify_needed = DAT_FALSE; DAT_BOOLEAN waitable; DAPL_EVD_STATE evd_state; + DAT_COUNT total_events,new_events; dapl_dbg_log (DAPL_DBG_TYPE_API, "dapl_evd_wait (%p, %d, %d, %p, %p)\n", @@ -124,9 +125,9 @@ } dapl_dbg_log (DAPL_DBG_TYPE_EVD, - "dapl_evd_wait: EVD %p, CQ %p\n", - evd_ptr, - (void *)evd_ptr->ib_cq_handle); + "dapl_evd_wait: EVD %p, CQ %p, Timeout %d, Threshold %d\n", + evd_ptr,(void *)evd_ptr->ib_cq_handle, time_out, threshold); + /* * Make sure there are no other waiters and the evd is active. @@ -144,11 +145,10 @@ evd_state = dapl_os_atomic_assign ( (DAPL_ATOMIC *)&evd_ptr->evd_state, (DAT_COUNT) DAPL_EVD_STATE_OPEN, (DAT_COUNT) DAPL_EVD_STATE_WAITED ); - dapl_os_unlock ( &evd_ptr->header.lock ); - if ( evd_state != DAPL_EVD_STATE_OPEN ) + dapl_os_unlock ( &evd_ptr->header.lock ); + if ( evd_state != DAPL_EVD_STATE_OPEN || !waitable) { - /* Bogus state, bail out */ dat_status = DAT_ERROR (DAT_INVALID_STATE,0); goto bail; } @@ -182,37 +182,54 @@ * return right away if the ib_cq_handle associate with these evd * equal to IB_INVALID_HANDLE */ - dapls_evd_copy_cq(evd_ptr); - - if (dapls_rbuf_count(&evd_ptr->pending_event_queue) >= threshold) - { - break; - } - - /* - * Do not enable the completion notification if this evd is not - * a DTO_EVD or RMR_BIND_EVD + /* Logic to prevent missing completion between copy_cq (poll) + * and completion_notify (re-arm) */ - if ( (!notify_requested) && - ((evd_ptr->evd_flags & DAT_EVD_DTO_FLAG) || - (evd_ptr->evd_flags & DAT_EVD_RMR_BIND_FLAG)) ) + notify_needed = DAT_TRUE; + new_events = 0; + while (DAT_TRUE) { - dat_status = dapls_ib_completion_notify ( - evd_ptr->header.owner_ia->hca_ptr->ib_hca_handle, - evd_ptr, - (evd_ptr->completion_type == DAPL_EVD_STATE_SOLICITED_WAIT) ? - IB_NOTIFY_ON_SOLIC_COMP : IB_NOTIFY_ON_NEXT_COMP ); - - DAPL_CNTR(DCNT_EVD_WAIT_CMP_NTFY); - /* FIXME report error */ - dapl_os_assert(dat_status == DAT_SUCCESS); + dapls_evd_copy_cq(evd_ptr); /* poll for new completions */ + total_events = dapls_rbuf_count (&evd_ptr->pending_event_queue); + new_events = total_events - new_events; + if (total_events >= threshold || + (!new_events && notify_needed == DAT_FALSE)) + { + break; + } + + /* + * Do not enable the completion notification if this evd is not + * a DTO_EVD or RMR_BIND_EVD + */ + if ( (evd_ptr->evd_flags & DAT_EVD_DTO_FLAG) || + (evd_ptr->evd_flags & DAT_EVD_RMR_BIND_FLAG) ) + { + dat_status = dapls_ib_completion_notify ( + evd_ptr->header.owner_ia->hca_ptr->ib_hca_handle, + evd_ptr, + (evd_ptr->completion_type == DAPL_EVD_STATE_SOLICITED_WAIT) ? + IB_NOTIFY_ON_SOLIC_COMP : IB_NOTIFY_ON_NEXT_COMP ); + + DAPL_CNTR(DCNT_EVD_WAIT_CMP_NTFY); + notify_needed = DAT_FALSE; + new_events = total_events; + + /* FIXME report error */ + dapl_os_assert(dat_status == DAT_SUCCESS); + } + else + { + break; + } - notify_requested = DAT_TRUE; + } /* while completions < threshold, and rearm needed */ - /* Try again. */ - continue; + if (total_events >= threshold) + { + break; } - + /* * Unused by poster; it has no way to tell how many @@ -232,8 +249,6 @@ #endif dat_status = dapl_os_wait_object_wait ( &evd_ptr->wait_object, time_out ); - - notify_requested = DAT_FALSE; /* We've used it up. */ /* See if we were awakened by evd_set_unwaitable */ if ( !evd_ptr->evd_waitable ) @@ -243,13 +258,22 @@ if (dat_status != DAT_SUCCESS) { - /* - * If the status is DAT_TIMEOUT, we'll break out of the - * loop, *not* dequeue an event (because dat_status - * != DAT_SUCCESS), set *nmore (as we should for timeout) - * and return DAT_TIMEOUT. - */ - break; + /* + * If the status is DAT_TIMEOUT, we'll break out of the + * loop, *not* dequeue an event (because dat_status + * != DAT_SUCCESS), set *nmore (as we should for timeout) + * and return DAT_TIMEOUT. + */ + +#if defined(DAPL_DBG) + dapls_evd_copy_cq(evd_ptr); /* poll */ + dapl_dbg_log (DAPL_DBG_TYPE_EVD, + "dapl_evd_wait: WAKEUP ERROR (0x%x): EVD %p, CQ %p, events? %d\n", + dat_status,evd_ptr,(void *)evd_ptr->ib_cq_handle, + dapls_rbuf_count(&evd_ptr->pending_event_queue) ); +#endif /* DAPL_DBG */ + + break; } } Index: dapl/udapl/Makefile =================================================================== --- dapl/udapl/Makefile (revision 2941) +++ dapl/udapl/Makefile (working copy) @@ -122,7 +122,8 @@ # ifeq ($(VERBS),openib) PROVIDER = $(TOPDIR)/../openib -CFLAGS += -DOPENIB -DCQ_WAIT_OBJECT +CFLAGS += -DOPENIB +#CFLAGS += -DCQ_WAIT_OBJECT uncomment when fixed CFLAGS += -I/usr/local/include/infiniband endif Index: dapl/common/dapl_evd_resize.c =================================================================== --- dapl/common/dapl_evd_resize.c (revision 2919) +++ dapl/common/dapl_evd_resize.c (working copy) @@ -67,71 +67,139 @@ IN DAT_EVD_HANDLE evd_handle, IN DAT_COUNT evd_qlen ) { - DAPL_IA *ia_ptr; - DAPL_EVD *evd_ptr; - DAT_COUNT pend_cnt; - DAT_RETURN dat_status; + DAPL_IA *ia_ptr; + DAPL_EVD *evd_ptr; + DAT_EVENT *event_ptr; + DAT_EVENT *events; + DAT_EVENT *orig_event; + DAPL_RING_BUFFER free_event_queue; + DAPL_RING_BUFFER pending_event_queue; + DAT_COUNT pend_cnt; + DAT_COUNT i; + DAT_RETURN dat_status; dapl_dbg_log (DAPL_DBG_TYPE_API, "dapl_evd_resize (%p, %d)\n", evd_handle, evd_qlen); if (DAPL_BAD_HANDLE (evd_handle, DAPL_MAGIC_EVD)) { - dat_status = DAT_ERROR (DAT_INVALID_HANDLE,0); - goto bail; + return DAT_ERROR (DAT_INVALID_PARAMETER,DAT_INVALID_ARG1); } evd_ptr = (DAPL_EVD *)evd_handle; ia_ptr = evd_ptr->header.owner_ia; - if ( evd_qlen == evd_ptr->qlen ) + if ((evd_qlen <= 0) || (evd_ptr->qlen > evd_qlen)) { - dat_status = DAT_SUCCESS; - goto bail; + dat_status = DAT_ERROR(DAT_INVALID_PARAMETER,DAT_INVALID_ARG2); + goto bail; } if ( evd_qlen > ia_ptr->hca_ptr->ia_attr.max_evd_qlen ) { - dat_status = DAT_ERROR (DAT_INVALID_PARAMETER,DAT_INVALID_ARG2); + dat_status = DAT_ERROR(DAT_INSUFFICIENT_RESOURCES,DAT_RESOURCE_TEVD); goto bail; } dapl_os_lock(&evd_ptr->header.lock); - /* Don't try to resize if we are actively waiting */ if (evd_ptr->evd_state == DAPL_EVD_STATE_WAITED) { - dapl_os_unlock(&evd_ptr->header.lock); - dat_status = DAT_ERROR (DAT_INVALID_STATE,0); - goto bail; + dat_status = DAT_ERROR(DAT_INVALID_STATE,0); + goto bail_unlock; } pend_cnt = dapls_rbuf_count(&evd_ptr->pending_event_queue); if (pend_cnt > evd_qlen) { - dapl_os_unlock(&evd_ptr->header.lock); - dat_status = DAT_ERROR (DAT_INVALID_STATE,0); - goto bail; + dat_status = DAT_ERROR(DAT_INVALID_STATE,0); + goto bail_unlock; } dat_status = dapls_ib_cq_resize(evd_ptr->header.owner_ia, - evd_ptr, - &evd_qlen); - if (dat_status != DAT_SUCCESS) + evd_ptr, + &evd_qlen); + if (DAT_SUCCESS != dat_status) { + dat_status = DAT_ERROR(DAT_INSUFFICIENT_RESOURCES,DAT_RESOURCE_MEMORY); + goto bail_unlock; + } + + /* Allocate EVENTs */ + events = (DAT_EVENT *) dapl_os_alloc (evd_qlen * sizeof (DAT_EVENT)); + if (!events) { - dapl_os_unlock(&evd_ptr->header.lock); - goto bail; + dat_status = DAT_ERROR(DAT_INSUFFICIENT_RESOURCES,DAT_RESOURCE_MEMORY); + goto bail_unlock; } + event_ptr = events; - dat_status = dapls_evd_event_realloc (evd_ptr, evd_qlen); - if (dat_status != DAT_SUCCESS) + /* allocate free event queue */ + dat_status = dapls_rbuf_alloc (&free_event_queue, evd_qlen); + if (DAT_SUCCESS != dat_status) { - dapl_os_unlock(&evd_ptr->header.lock); - goto bail; + dapl_os_free(event_ptr, evd_qlen * sizeof (DAT_EVENT)); + dat_status = DAT_ERROR(DAT_INSUFFICIENT_RESOURCES,DAT_RESOURCE_MEMORY); + goto bail_unlock; + } + + /* allocate pending event queue */ + dat_status = dapls_rbuf_alloc (&pending_event_queue, evd_qlen); + if (DAT_SUCCESS != dat_status) + { + dapl_os_free(event_ptr, evd_qlen * sizeof (DAT_EVENT)); + dat_status = DAT_ERROR(DAT_INSUFFICIENT_RESOURCES,DAT_RESOURCE_MEMORY); + goto bail_unlock; } + for (i = 0; i < pend_cnt; i++) + { + orig_event = dapls_rbuf_remove(&evd_ptr->pending_event_queue); + if (orig_event == NULL) { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, " Inconsistent event queue\n"); + dapl_os_free(event_ptr, evd_qlen * sizeof (DAT_EVENT)); + dat_status = DAT_ERROR(DAT_INSUFFICIENT_RESOURCES,DAT_RESOURCE_MEMORY); + goto bail_unlock; + } + memcpy(event_ptr, orig_event, sizeof(DAT_EVENT)); + dat_status = dapls_rbuf_add(&pending_event_queue, event_ptr); + if (DAT_SUCCESS != dat_status) { + dapl_os_free(event_ptr, evd_qlen * sizeof (DAT_EVENT)); + dat_status = DAT_ERROR(DAT_INSUFFICIENT_RESOURCES,DAT_RESOURCE_MEMORY); + goto bail_unlock; + } + event_ptr++; + } + + for (i = pend_cnt; i < evd_qlen; i++) + { + dat_status = dapls_rbuf_add(&free_event_queue,(void *) event_ptr); + if (DAT_SUCCESS != dat_status) { + dapl_os_free(event_ptr, evd_qlen * sizeof (DAT_EVENT)); + dat_status = DAT_ERROR(DAT_INSUFFICIENT_RESOURCES,DAT_RESOURCE_MEMORY); + goto bail_unlock; + } + event_ptr++; + } + + dapls_rbuf_destroy (&evd_ptr->free_event_queue); + dapls_rbuf_destroy (&evd_ptr->pending_event_queue); + if (evd_ptr->events) + { + dapl_os_free (evd_ptr->events, evd_ptr->qlen * sizeof (DAT_EVENT)); + } + evd_ptr->free_event_queue = free_event_queue; + evd_ptr->pending_event_queue = pending_event_queue; + evd_ptr->events = events; + evd_ptr->qlen = evd_qlen; + +bail_unlock: + dapl_os_unlock(&evd_ptr->header.lock); - bail: + dapl_dbg_log (DAPL_DBG_TYPE_RTN, + "dapl_evd_resize returns 0x%x\n",dat_status); + +bail: + return dat_status; } Index: dapl/openib/TODO =================================================================== --- dapl/openib/TODO (revision 2919) +++ dapl/openib/TODO (working copy) @@ -1,7 +1,7 @@ IB Verbs: - CQ resize? -- query call to get current qp state +- query call to get current qp state, remote port number - ibv_get_cq_event() needs timed event call and wakeup - query call to get device attributes - memory window support @@ -9,8 +9,6 @@ DAPL: - reinit EP needs a QP timewait completion notification - add cq_object wakeup, time based cq_object wait when verbs support arrives -- update uDAPL code with real ATS support -- etc, etc. Other: - Shared memory in udapl and kernel module to support? Index: dapl/openib/dapl_ib_util.c =================================================================== --- dapl/openib/dapl_ib_util.c (revision 2919) +++ dapl/openib/dapl_ib_util.c (working copy) @@ -111,27 +111,40 @@ } -/* just get IP address for hostname */ -int dapli_get_addr( char *addr, int addr_len) +/* just get IP address, IPv4 only for now */ +int dapli_get_hca_addr( struct dapl_hca *hca_ptr ) { - struct sockaddr_in *ipv4_addr = (struct sockaddr_in*)addr; - struct hostent *h_ptr; - struct utsname ourname; - - if ( uname( &ourname ) < 0 ) - return 1; - - h_ptr = gethostbyname( ourname.nodename ); - if ( h_ptr == NULL ) + struct sockaddr_in *ipv4_addr; + struct ib_at_completion at_comp; + struct dapl_at_record at_rec; + int status; + DAT_RETURN dat_status; + + ipv4_addr = (struct sockaddr_in*)&hca_ptr->hca_address; + ipv4_addr->sin_family = AF_INET; + ipv4_addr->sin_addr.s_addr = 0; + + at_comp.fn = dapli_ip_comp_handler; + at_comp.context = &at_rec; + at_rec.addr = &hca_ptr->hca_address; + at_rec.wait_object = &hca_ptr->ib_trans.wait_object; + + /* call with async_comp until the sync version works */ + status = ib_at_ips_by_gid(&hca_ptr->ib_trans.gid, &ipv4_addr->sin_addr.s_addr, 1, + &at_comp, &at_rec.req_id); + + if (status < 0) return 1; - - if ( h_ptr->h_addrtype == AF_INET ) { - ipv4_addr = (struct sockaddr_in*) addr; - ipv4_addr->sin_family = AF_INET; - dapl_os_memcpy( &ipv4_addr->sin_addr, h_ptr->h_addr_list[0], 4 ); - } else + + if (status > 0) + dapli_ip_comp_handler(at_rec.req_id, (void*)ipv4_addr, status); + + /* wait for answer, 5 seconds max */ + dat_status = dapl_os_wait_object_wait (&hca_ptr->ib_trans.wait_object,5000000); + + if ((dat_status != DAT_SUCCESS ) || (!ipv4_addr->sin_addr.s_addr)) return 1; - + return 0; } @@ -152,14 +165,17 @@ */ int32_t dapls_ib_init (void) { - if (dapli_cm_thread_init()) - return -1; - else - return 0; + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, " dapl_ib_init: \n" ); + if (dapli_cm_thread_init() || dapli_at_thread_init()) + return 1; + + return 0; } int32_t dapls_ib_release (void) { + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, " dapl_ib_release: \n" ); + dapli_at_thread_destroy(); dapli_cm_thread_destroy(); return 0; } @@ -186,7 +202,6 @@ IN DAPL_HCA *hca_ptr) { struct dlist *dev_list; - DAT_RETURN dat_status = DAT_SUCCESS; dapl_dbg_log (DAPL_DBG_TYPE_UTIL, " open_hca: %s - %p\n", hca_name, hca_ptr ); @@ -217,36 +232,46 @@ ibv_get_device_name(hca_ptr->ib_trans.ib_dev) ); return DAT_INTERNAL_ERROR; } - + /* set inline max with enviromment or default, get local lid and gid 0 */ hca_ptr->ib_trans.max_inline_send = dapl_os_get_env_val ( "DAPL_MAX_INLINE", INLINE_SEND_DEFAULT ); - if ( dapli_get_lid(hca_ptr, hca_ptr->port_num, - &hca_ptr->ib_trans.lid )) { + if (dapli_get_lid(hca_ptr, hca_ptr->port_num, + &hca_ptr->ib_trans.lid)) { dapl_dbg_log (DAPL_DBG_TYPE_ERR, " open_hca: IB get LID failed for %s\n", ibv_get_device_name(hca_ptr->ib_trans.ib_dev) ); - return DAT_INTERNAL_ERROR; + goto bail; } - if ( dapli_get_gid(hca_ptr, hca_ptr->port_num, 0, - &hca_ptr->ib_trans.gid )) { + if (dapli_get_gid(hca_ptr, hca_ptr->port_num, 0, + &hca_ptr->ib_trans.gid)) { dapl_dbg_log (DAPL_DBG_TYPE_ERR, " open_hca: IB get GID failed for %s\n", ibv_get_device_name(hca_ptr->ib_trans.ib_dev) ); - return DAT_INTERNAL_ERROR; + goto bail; } - /* get the IP address of the device */ - if ( dapli_get_addr((char*)&hca_ptr->hca_address, - sizeof(DAT_SOCK_ADDR6) )) { + if (dapli_get_hca_addr(hca_ptr)) { dapl_dbg_log (DAPL_DBG_TYPE_ERR, " open_hca: IB get ADDR failed for %s\n", ibv_get_device_name(hca_ptr->ib_trans.ib_dev) ); - return DAT_INTERNAL_ERROR; + goto bail; + } + + /* one thread for each device open */ + if (dapli_cq_thread_init(hca_ptr)) { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + " open_hca: cq_thread_init failed for %s\n", + ibv_get_device_name(hca_ptr->ib_trans.ib_dev) ); + goto bail; } + /* initialize cq_lock and wait object */ + dapl_os_lock_init(&hca_ptr->ib_trans.cq_lock); + dapl_os_wait_object_init (&hca_ptr->ib_trans.wait_object); + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, " open_hca: %s, port %d, %s %d.%d.%d.%d INLINE_MAX=%d\n", ibv_get_device_name(hca_ptr->ib_trans.ib_dev), hca_ptr->port_num, @@ -257,7 +282,19 @@ ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_addr.s_addr >> 24 & 0xff, hca_ptr->ib_trans.max_inline_send ); - return dat_status; + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " open_hca: LID 0x%x GID subnet %016llx id %016llx\n", + hca_ptr->ib_trans.lid, + (unsigned long long)bswap_64(hca_ptr->ib_trans.gid.global.subnet_prefix), + (unsigned long long)bswap_64(hca_ptr->ib_trans.gid.global.interface_id) ); + + return DAT_SUCCESS; + +bail: + ibv_close_device(hca_ptr->ib_hca_handle); + hca_ptr->ib_hca_handle = IB_INVALID_HANDLE; + return DAT_INTERNAL_ERROR; + } @@ -282,10 +319,14 @@ dapl_dbg_log (DAPL_DBG_TYPE_UTIL," close_hca: %p\n",hca_ptr); if (hca_ptr->ib_hca_handle != IB_INVALID_HANDLE) { + dapli_cq_thread_destroy(hca_ptr); if (ibv_close_device(hca_ptr->ib_hca_handle)) return(dapl_convert_errno(errno,"ib_close_device")); hca_ptr->ib_hca_handle = IB_INVALID_HANDLE; } + + dapl_os_lock_destroy(&hca_ptr->ib_trans.cq_lock); + return (DAT_SUCCESS); } @@ -448,35 +489,4 @@ return DAT_SUCCESS; } -#ifdef PROVIDER_SPECIFIC_ATTR - -/* - * dapls_set_provider_specific_attr - * - * Input: - * attr_ptr Pointer provider attributes - * - * Output: - * none - * - * Returns: - * void - */ -DAT_NAMED_ATTR ib_attrs[] = { - { - "I_DAT_SEND_INLINE_THRESHOLD", - "128" - }, -}; - -#define SPEC_ATTR_SIZE( x ) (sizeof( x ) / sizeof( DAT_NAMED_ATTR)) - -void dapls_set_provider_specific_attr( - IN DAT_PROVIDER_ATTR *attr_ptr ) -{ - attr_ptr->num_provider_specific_attr = SPEC_ATTR_SIZE( ib_attrs ); - attr_ptr->provider_specific_attr = ib_attrs; -} - -#endif Index: dapl/openib/dapl_ib_cm.c =================================================================== --- dapl/openib/dapl_ib_cm.c (revision 2919) +++ dapl/openib/dapl_ib_cm.c (working copy) @@ -70,19 +70,8 @@ static inline uint64_t cpu_to_be64(uint64_t x) { return x; } #endif -#ifndef IB_AT - -#include -#include -#include -#include -#include -#include - -/* iclust-20 hard coded values, network order */ -#define REMOTE_GID "fe80:0000:0000:0000:0002:c902:0000:4071" -#define REMOTE_LID "0002" - +static int g_at_destroy; +static DAPL_OS_THREAD g_at_thread; static int g_cm_destroy; static DAPL_OS_THREAD g_cm_thread; static DAPL_OS_LOCK g_cm_lock; @@ -122,7 +111,7 @@ while (g_cm_destroy) { struct timespec sleep, remain; sleep.tv_sec = 0; - sleep.tv_nsec = 200000000; /* 200 ms */ + sleep.tv_nsec = 10000000; /* 10 ms */ dapl_dbg_log(DAPL_DBG_TYPE_CM, " cm_thread_destroy: waiting for cm_thread\n"); nanosleep (&sleep, &remain); @@ -130,112 +119,70 @@ dapl_dbg_log(DAPL_DBG_TYPE_CM," cm_thread_destroy(%d) exit\n",getpid()); } -static int ib_at_route_by_ip(uint32_t dst_ip, uint32_t src_ip, int tos, uint16_t flags, - struct ib_at_ib_route *ib_route, - struct ib_at_completion *async_comp) -{ - struct dapl_cm_id *conn = (struct dapl_cm_id *)async_comp->context; - - dapl_dbg_log ( - DAPL_DBG_TYPE_CM, - " CM at_route_by_ip: conn %p cm_id %d src %d.%d.%d.%d -> dst %d.%d.%d.%d (%d)\n", - conn,conn->cm_id, - src_ip >> 0 & 0xff, src_ip >> 8 & 0xff, - src_ip >> 16 & 0xff,src_ip >> 24 & 0xff, - dst_ip >> 0 & 0xff, dst_ip >> 8 & 0xff, - dst_ip >> 16 & 0xff,dst_ip >> 24 & 0xff, conn->service_id); - - /* use req_id for loopback indication */ - if (( src_ip == dst_ip ) || ( dst_ip == 0x0100007f )) - async_comp->req_id = 1; - else - async_comp->req_id = 0; - - return 1; -} - -static int ib_at_paths_by_route(struct ib_at_ib_route *ib_route, uint32_t mpath_type, - struct ib_sa_path_rec *pr, int npath, - struct ib_at_completion *async_comp) +int dapli_at_thread_init(void) { - struct dapl_cm_id *conn = (struct dapl_cm_id *)async_comp->context; - char *env, *token; - char dgid[40]; - uint16_t *p_gid = (uint16_t*)&ib_route->gid; + DAT_RETURN dat_status; - /* set local path record values and send to remote */ - (void)dapl_os_memzero(pr, sizeof(*pr)); + dapl_dbg_log(DAPL_DBG_TYPE_CM," at_thread_init(%d)\n", getpid()); - pr->slid = htons(conn->hca->ib_trans.lid); - pr->sgid.global.subnet_prefix = conn->hca->ib_trans.gid.global.subnet_prefix; - pr->sgid.global.interface_id = conn->hca->ib_trans.gid.global.interface_id; + /* create thread to process AT async requests */ + dat_status = dapl_os_thread_create(at_thread, NULL, &g_at_thread); + if (dat_status != DAT_SUCCESS) + { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " at_thread_init: failed to create thread\n"); + return 1; + } + return 0; +} - env = getenv("DAPL_REMOTE_LID"); - if ( env == NULL ) - env = REMOTE_LID; - ib_route->lid = strtol(env,NULL,0); +void dapli_at_thread_destroy(void) +{ + dapl_dbg_log(DAPL_DBG_TYPE_CM," at_thread_destroy(%d)\n", getpid()); - env = getenv("DAPL_REMOTE_GID"); - if ( env == NULL ) - env = REMOTE_GID; + /* destroy cr_thread and lock */ + g_at_destroy = 1; + pthread_kill( g_at_thread, SIGUSR1 ); + dapl_dbg_log(DAPL_DBG_TYPE_CM," at_thread_destroy(%d) SIGUSR1 sent\n",getpid()); + while (g_at_destroy) { + struct timespec sleep, remain; + sleep.tv_sec = 0; + sleep.tv_nsec = 10000000; /* 10 ms */ + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " at_thread_destroy: waiting for at_thread\n"); + nanosleep (&sleep, &remain); + } + dapl_dbg_log(DAPL_DBG_TYPE_CM," at_thread_destroy(%d) exit\n",getpid()); +} - dapl_dbg_log(DAPL_DBG_TYPE_CM, - " ib_at_paths_by_route: remote LID %x GID %s\n", - ib_route->lid,env); +void dapli_ip_comp_handler(uint64_t req_id, void *context, int rec_num) +{ + struct dapl_at_record *at_rec = context; - dapl_os_memcpy( dgid, env, 40 ); + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " ip_comp_handler: ctxt %p, req_id %lld rec_num %d\n", + context, req_id, rec_num); - /* get GID with token strings and delimiter */ - token = strtok(dgid,":"); - while (token) { - *p_gid = strtoul(token,NULL,16); - *p_gid = htons(*p_gid); /* convert each token to network order */ - token = strtok(NULL,":"); - p_gid++; - } - - /* set remote lid and gid, req_id is indication of loopback */ - if ( !async_comp->req_id ) { - pr->dlid = htons(ib_route->lid); - pr->dgid.global.subnet_prefix = ib_route->gid.global.subnet_prefix; - pr->dgid.global.interface_id = ib_route->gid.global.interface_id; - } else { - pr->dlid = pr->slid; - pr->dgid.global.subnet_prefix = pr->sgid.global.subnet_prefix; - pr->dgid.global.interface_id = pr->sgid.global.interface_id; - } - - pr->reversible = 0x1000000; - pr->pkey = 0xffff; - pr->mtu = IBV_MTU_1024; - pr->mtu_selector = 2; - pr->rate_selector = 2; - pr->rate = 3; - pr->packet_life_time_selector = 2; - pr->packet_life_time = 2; + if ((at_rec) && ( at_rec->req_id == req_id)) { + dapl_os_wait_object_wakeup(at_rec->wait_object); + return; + } - dapl_dbg_log(DAPL_DBG_TYPE_CM, - " ib_at_paths_by_route: SRC LID 0x%x GID subnet %016llx id %016llx\n", - pr->slid,(unsigned long long)(pr->sgid.global.subnet_prefix), - (unsigned long long)(pr->sgid.global.interface_id) ); - dapl_dbg_log(DAPL_DBG_TYPE_CM, - " ib_at_paths_by_route: DST LID 0x%x GID subnet %016llx id %016llx\n", - pr->dlid,(unsigned long long)(pr->dgid.global.subnet_prefix), - (unsigned long long)(pr->dgid.global.interface_id) ); - - dapli_path_comp_handler( async_comp->req_id, (void*)conn, 1); - - return 0; + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " ip_comp_handler: at_rec->req_id %lld != req_id %lld\n", + at_rec->req_id, req_id ); } -#endif /* ifndef IB_AT */ - static void dapli_path_comp_handler(uint64_t req_id, void *context, int rec_num) { struct dapl_cm_id *conn = context; int status; ib_cm_events_t event; + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " path_comp_handler: ctxt %p, req_id %lld rec_num %d\n", + context, req_id, rec_num); + if (rec_num <= 0) { dapl_dbg_log(DAPL_DBG_TYPE_CM, " path_comp_handler: resolution err %d retry %d\n", @@ -249,7 +196,7 @@ status = ib_at_paths_by_route(&conn->dapl_rt, 0, &conn->dapl_path, 1, - &conn->dapl_comp); + &conn->dapl_comp, &conn->dapl_comp.req_id); if (status) { dapl_dbg_log(DAPL_DBG_TYPE_CM, " path_by_route: err %d id %lld\n", @@ -287,6 +234,21 @@ int status; ib_cm_events_t event; + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " rt_comp_handler: conn %p, req_id %lld rec_num %d\n", + conn, req_id, rec_num); + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " rt_comp_handler: SRC GID subnet %016llx id %016llx\n", + (unsigned long long)cpu_to_be64(conn->dapl_rt.sgid.global.subnet_prefix), + (unsigned long long)cpu_to_be64(conn->dapl_rt.sgid.global.interface_id) ); + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " rt_comp_handler: DST GID subnet %016llx id %016llx\n", + (unsigned long long)cpu_to_be64(conn->dapl_rt.dgid.global.subnet_prefix), + (unsigned long long)cpu_to_be64(conn->dapl_rt.dgid.global.interface_id) ); + + if (rec_num <= 0) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, " dapl_rt_comp_handler: rec %d retry %d\n", @@ -298,7 +260,8 @@ } status = ib_at_route_by_ip(((struct sockaddr_in *)&conn->r_addr)->sin_addr.s_addr, - 0, 0, 0, &conn->dapl_rt, &conn->dapl_comp); + 0, 0, 0, &conn->dapl_rt, + &conn->dapl_comp,&conn->dapl_comp.req_id); if (status < 0) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, "dapl_rt_comp_handler: " "ib_at_route_by_ip failed with status %d\n", @@ -306,9 +269,16 @@ event = IB_CME_DESTINATION_UNREACHABLE; goto bail; } - if (status == 1) dapli_rt_comp_handler(conn->dapl_comp.req_id, conn, 1); + + return; + } + + if (!conn->dapl_rt.dgid.global.subnet_prefix || req_id != conn->dapl_comp.req_id) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " dapl_rt_comp_handler: ERROR: unexpected callback req_id=%d(%d)\n", + req_id, conn->dapl_comp.req_id ); return; } @@ -316,7 +286,7 @@ conn->dapl_comp.context = conn; conn->retries = 0; status = ib_at_paths_by_route(&conn->dapl_rt, 0, &conn->dapl_path, 1, - &conn->dapl_comp); + &conn->dapl_comp, &conn->dapl_comp.req_id); if (status) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, "dapl_rt_comp_handler: ib_at_paths_by_route " @@ -346,8 +316,6 @@ ib_cm_destroy_id(conn->cm_id); if (conn->ep) conn->ep->cm_handle = IB_INVALID_HANDLE; - if (conn->sp) - conn->sp->cm_srvc_handle = IB_INVALID_HANDLE; /* take off the CM thread work queue and free */ dapl_os_lock( &g_cm_lock ); @@ -621,10 +589,8 @@ } /* something to catch the signal */ -static void cm_handler(int signum) +static void ib_sig_handler(int signum) { - dapl_dbg_log (DAPL_DBG_TYPE_CM," cm_thread(%d,0x%x): ENTER cm_handler %d\n", - getpid(),g_cm_thread,signum); return; } @@ -643,7 +609,7 @@ sigemptyset(&sigset); sigaddset(&sigset, SIGUSR1); pthread_sigmask(SIG_UNBLOCK, &sigset, NULL); - signal( SIGUSR1, cm_handler); + signal( SIGUSR1, ib_sig_handler); dapl_os_lock( &g_cm_lock ); while (!g_cm_destroy) { @@ -667,7 +633,7 @@ dapl_dbg_log(DAPL_DBG_TYPE_CM, " cm_thread: GET EVENT fd=%d n=%d\n", ib_cm_get_fd(),ret); - if (ib_cm_event_get(&event)) { + if (ib_cm_event_get_timed(0,&event)) { dapl_dbg_log(DAPL_DBG_TYPE_CM, " cm_thread: ERR %s eventi_get on %d\n", strerror(errno), ib_cm_get_fd() ); @@ -732,6 +698,33 @@ g_cm_destroy = 0; } +/* async AT processing thread */ +void at_thread(void *arg) +{ + sigset_t sigset; + + dapl_dbg_log (DAPL_DBG_TYPE_CM, + " at_thread(%d,0x%x): ENTER: at_fd %d\n", + getpid(), g_at_thread, ib_at_get_fd()); + + sigemptyset(&sigset); + sigaddset(&sigset, SIGUSR1); + pthread_sigmask(SIG_UNBLOCK, &sigset, NULL); + signal(SIGUSR1, ib_sig_handler); + + while (!g_at_destroy) { + /* poll forever until callback or signal */ + if (ib_at_callback_get_timed(-1) < 0) { + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " at_thread: SIG? ret=%s, destroy=%d\n", + strerror(errno), g_at_destroy ); + } + dapl_dbg_log(DAPL_DBG_TYPE_CM," at_thread: callback woke\n"); + } + dapl_dbg_log(DAPL_DBG_TYPE_CM," at_thread(%d) EXIT \n", getpid()); + g_at_destroy = 0; +} + /************************ DAPL provider entry points **********************/ /* @@ -826,33 +819,34 @@ conn->dapl_comp.context = conn; conn->retries = 0; dapl_os_memcpy(&conn->r_addr, r_addr, sizeof(DAT_SOCK_ADDR6)); + + /* put on CM thread work queue */ + dapl_llist_init_entry((DAPL_LLIST_ENTRY*)&conn->entry); + dapl_os_lock( &g_cm_lock ); + dapl_llist_add_tail(&g_cm_list, + (DAPL_LLIST_ENTRY*)&conn->entry, conn); + dapl_os_unlock(&g_cm_lock); status = ib_at_route_by_ip( ((struct sockaddr_in *)&conn->r_addr)->sin_addr.s_addr, ((struct sockaddr_in *)&conn->hca->hca_address)->sin_addr.s_addr, - 0, 0, &conn->dapl_rt, &conn->dapl_comp); + 0, 0, &conn->dapl_rt, &conn->dapl_comp, &conn->dapl_comp.req_id); + + dapl_dbg_log(DAPL_DBG_TYPE_CM, " connect: at_route ret=%d,%s req_id %d GID %016llx %016llx\n", + status, strerror(errno), conn->dapl_comp.req_id, + (unsigned long long)cpu_to_be64(conn->dapl_rt.dgid.global.subnet_prefix), + (unsigned long long)cpu_to_be64(conn->dapl_rt.dgid.global.interface_id) ); if (status < 0) { dat_status = dapl_convert_errno(errno,"ib_at_route_by_ip"); - goto destroy; + dapli_destroy_cm_id(conn); + return dat_status; } - if (status == 1) - dapli_rt_comp_handler(conn->dapl_comp.req_id, conn, 1); - - /* put on CM thread work queue */ - dapl_llist_init_entry((DAPL_LLIST_ENTRY*)&conn->entry); - dapl_os_lock( &g_cm_lock ); - dapl_llist_add_tail(&g_cm_list, - (DAPL_LLIST_ENTRY*)&conn->entry, conn); - dapl_os_unlock(&g_cm_lock); + if (status > 0) + dapli_rt_comp_handler(conn->dapl_comp.req_id, conn, status); return DAT_SUCCESS; - -destroy: - dapli_destroy_cm_id(conn); - return dat_status; - } /* @@ -992,6 +986,13 @@ conn->hca = ia_ptr->hca_ptr; conn->service_id = ServiceID; + /* put on CM thread work queue */ + dapl_llist_init_entry((DAPL_LLIST_ENTRY*)&conn->entry); + dapl_os_lock( &g_cm_lock ); + dapl_llist_add_tail(&g_cm_list, + (DAPL_LLIST_ENTRY*)&conn->entry, conn); + dapl_os_unlock(&g_cm_lock); + dapl_dbg_log(DAPL_DBG_TYPE_EP, " setup_listener(conn=%p cm_id=%d)\n", sp_ptr->cm_srvc_handle,conn->cm_id); @@ -1003,19 +1004,13 @@ dat_status = DAT_CONN_QUAL_IN_USE; else dat_status = DAT_INSUFFICIENT_RESOURCES; - /* success */ - } else { - /* put on CM thread work queue */ - dapl_llist_init_entry((DAPL_LLIST_ENTRY*)&conn->entry); - dapl_os_lock( &g_cm_lock ); - dapl_llist_add_tail(&g_cm_list, - (DAPL_LLIST_ENTRY*)&conn->entry, conn); - dapl_os_unlock(&g_cm_lock); + + dapli_destroy_cm_id(conn); return dat_status; } - dapli_destroy_cm_id(conn); - return dat_status; + /* success */ + return DAT_SUCCESS; } @@ -1047,9 +1042,11 @@ " remove_listener(ia_ptr %p sp_ptr %p cm_ptr %p)\n", ia_ptr, sp_ptr, conn ); - if (sp_ptr->cm_srvc_handle != IB_INVALID_HANDLE) + if (conn != IB_INVALID_HANDLE) { + sp_ptr->cm_srvc_handle = NULL; dapli_destroy_cm_id(conn); - + } + return DAT_SUCCESS; } Index: dapl/openib/dapl_ib_util.h =================================================================== --- dapl/openib/dapl_ib_util.h (revision 2919) +++ dapl/openib/dapl_ib_util.h (working copy) @@ -53,6 +53,7 @@ #include #include #include +#include /* Typedefs to map common DAPL provider types to IB verbs */ typedef struct ibv_qp *ib_qp_handle_t; @@ -68,8 +69,8 @@ #define IB_RC_RETRY_COUNT 7 #define IB_RNR_RETRY_COUNT 7 -#define IB_CM_RESPONSE_TIMEOUT 20 /* 4 sec */ -#define IB_MAX_CM_RETRIES 4 +#define IB_CM_RESPONSE_TIMEOUT 18 /* 1 sec */ +#define IB_MAX_CM_RETRIES 7 #define IB_REQ_MRA_TIMEOUT 27 /* a little over 9 minutes */ #define IB_MAX_AT_RETRY 3 @@ -92,21 +93,12 @@ IB_CME_BROKEN } ib_cm_events_t; -#ifndef IB_AT -/* implement a quick hack to exchange GID/LID's until user IB_AT arrives */ -struct ib_at_ib_route { - union ibv_gid gid; - uint16_t lid; +struct dapl_at_record { + uint64_t req_id; + DAT_SOCK_ADDR6 *addr; + DAPL_OS_WAIT_OBJECT *wait_object; }; -struct ib_at_completion { - void (*fn)(uint64_t req_id, void *context, int rec_num); - void *context; - uint64_t req_id; -}; - -#endif - /* * dapl_llist_entry in dapl.h but dapl.h depends on provider * typedef's in this file first. move dapl_llist_entry out of dapl.h @@ -122,6 +114,7 @@ struct dapl_cm_id { struct ib_llist_entry entry; DAPL_OS_LOCK lock; + DAPL_OS_WAIT_OBJECT wait_object; int retries; int destroy; int in_callback; @@ -238,6 +231,10 @@ { struct ibv_device *ib_dev; ib_cq_handle_t ib_cq_empty; + DAPL_OS_LOCK cq_lock; + DAPL_OS_WAIT_OBJECT wait_object; + int cq_destroy; + DAPL_OS_THREAD cq_thread; int max_inline_send; uint16_t lid; union ibv_gid gid; @@ -257,11 +254,18 @@ void cm_thread (void *arg); int dapli_cm_thread_init(void); void dapli_cm_thread_destroy(void); +void at_thread (void *arg); +int dapli_at_thread_init(void); +void dapli_at_thread_destroy(void); +void cq_thread (void *arg); +int dapli_cq_thread_init(struct dapl_hca *hca_ptr); +void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr); -int dapli_get_lid(struct dapl_hca *hca_ptr, int port, uint16_t *lid ); +int dapli_get_lid(struct dapl_hca *hca_ptr, int port, uint16_t *lid); int dapli_get_gid(struct dapl_hca *hca_ptr, int port, int index, - union ibv_gid *gid ); -int dapli_get_addr(char *addr, int addr_len); + union ibv_gid *gid); +int dapli_get_hca_addr(struct dapl_hca *hca_ptr); +void dapli_ip_comp_handler(uint64_t req_id, void *context, int rec_num); DAT_RETURN dapls_modify_qp_state ( IN ib_qp_handle_t qp_handle, Index: dapl/openib/README =================================================================== --- dapl/openib/README (revision 2919) +++ dapl/openib/README (working copy) @@ -39,18 +39,16 @@ server: dtest -s client: dtest -h hostname + +Testing: dtest, dapltest - cl.sh regress.sh -setup/known issues: - - First drop with uCM (without IBAT), tested with simple dtest across 2 nodes. - hand rolled path records require remote LID and GID set via enviroment: +Setup: - export DAPL_REMOTE_GID "fe80:0000:0000:0000:0002:c902:0000:4071" - export DAPL_REMOTE_LID "0002" + Third drop of code, includes uCM and uAT support. + NOTE: requires both uCM and uAT libraries and device modules from trunk. - Also, hard coded (RTR) for use with port 1 only. - +Known issues: no memory windows support in ibverbs, dat_create_rmr fails. + some uCM scale up issues with an 8 thread dapltest in regress.sh + hard coded modify QP RTR to port 1, waiting for ib_cm_init_qp_attr call. - - Index: dapl/openib/dapl_ib_cq.c =================================================================== --- dapl/openib/dapl_ib_cq.c (revision 2919) +++ dapl/openib/dapl_ib_cq.c (working copy) @@ -50,9 +50,96 @@ #include "dapl_adapter_util.h" #include "dapl_lmr_util.h" #include "dapl_evd_util.h" +#include "dapl_ring_buffer_util.h" #include +#include +int dapli_cq_thread_init(struct dapl_hca *hca_ptr) +{ + DAT_RETURN dat_status; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_init(%p)\n", hca_ptr); + + /* create thread to process inbound connect request */ + dat_status = dapl_os_thread_create( cq_thread, (void*)hca_ptr, &hca_ptr->ib_trans.cq_thread); + if (dat_status != DAT_SUCCESS) + { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " cq_thread_init: failed to create thread\n"); + return 1; + } + return 0; +} + +void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr) +{ + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%p)\n", hca_ptr); + + /* destroy cr_thread and lock */ + hca_ptr->ib_trans.cq_destroy = 1; + pthread_kill(hca_ptr->ib_trans.cq_thread, SIGUSR1); + dapl_dbg_log(DAPL_DBG_TYPE_CM," cq_thread_destroy(%p) SIGUSR1 sent\n",hca_ptr); + while (hca_ptr->ib_trans.cq_destroy != 2) { + struct timespec sleep, remain; + sleep.tv_sec = 0; + sleep.tv_nsec = 10000000; /* 10 ms */ + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " cq_thread_destroy: waiting for cq_thread\n"); + nanosleep (&sleep, &remain); + } + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%d) exit\n",getpid()); + return; +} + +/* something to catch the signal */ +static void ib_cq_handler(int signum) +{ + return; +} + +void cq_thread( void *arg ) +{ + struct dapl_hca *hca_ptr = arg; + struct dapl_evd *evd_ptr; + struct ibv_cq *ibv_cq = NULL; + sigset_t sigset; + int status = 0; + + dapl_dbg_log ( DAPL_DBG_TYPE_UTIL," cq_thread: ENTER hca %p\n",hca_ptr); + + sigemptyset(&sigset); + sigaddset(&sigset,SIGUSR1); + pthread_sigmask(SIG_UNBLOCK, &sigset, NULL); + signal(SIGUSR1, ib_cq_handler); + + /* wait on DTO event, or signal to abort */ + while (!hca_ptr->ib_trans.cq_destroy) { + + struct pollfd cq_poll = { + .fd = hca_ptr->ib_hca_handle->cq_fd[0], + .events = POLLIN, + .revents = 0 + }; + status = poll(&cq_poll, 1, -1); + if ((status == 1) && + (!ibv_get_cq_event(hca_ptr->ib_hca_handle, 0, &ibv_cq, (void*)&evd_ptr))) { + + if (DAPL_BAD_HANDLE(evd_ptr, DAPL_MAGIC_EVD)) + continue; + + /* process DTO event via callback */ + dapl_evd_dto_callback ( evd_ptr->header.owner_ia->hca_ptr->ib_hca_handle, + evd_ptr->ib_cq_handle, + (void*)evd_ptr ); + } else { + + } + } + hca_ptr->ib_trans.cq_destroy = 2; + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: EXIT: hca %p \n", hca_ptr); + return; +} /* * Map all verbs DTO completion codes to the DAT equivelent. * @@ -410,9 +497,9 @@ IN DAPL_EVD *evd_ptr, IN ib_wait_obj_handle_t *p_cq_wait_obj_handle ) { - dapl_dbg_log ( DAPL_DBG_TYPE_UTIL, + dapl_dbg_log ( DAPL_DBG_TYPE_CM, " cq_object_create: (%p)=%p\n", - p_cq_wait_obj_handle, *p_cq_wait_obj_handle); + p_cq_wait_obj_handle, evd_ptr ); /* set cq_wait object to evd_ptr */ *p_cq_wait_obj_handle = evd_ptr; @@ -447,33 +534,86 @@ { DAPL_EVD *evd_ptr = p_cq_wait_obj_handle; ib_cq_handle_t cq = evd_ptr->ib_cq_handle; - struct ibv_cq *ibv_cq; - void *ibv_ctx; - int status = -ETIMEDOUT; + struct ibv_cq *ibv_cq = NULL; + void *ibv_ctx = NULL; + int status = 0; - dapl_dbg_log ( DAPL_DBG_TYPE_UTIL, + dapl_dbg_log ( DAPL_DBG_TYPE_CM, " cq_object_wait: dev %p evd %p cq %p, time %d\n", cq->context, evd_ptr, cq, timeout ); - /* Multiple EVD's sharing one event handle for now */ - if (cq) { - struct pollfd cq_poll = { - .fd = cq->context->cq_fd[0], - .events = POLLIN + /* Multiple EVD's sharing one event handle for now until uverbs supports more */ + + /* + * This makes it very inefficient and tricky to manage multiple CQ per device open + * For example: 4 threads waiting on separate CQ events will all be woke when + * a CQ event fires. So the poll wakes up and the first thread to get to the + * the get_cq_event wins and the other 3 will block. The dapl_evd_wait code + * above will immediately do a poll_cq after returning from CQ wait and if + * nothing on the queue will call this wait again and go back to sleep. So + * as long as they all wake up, a mutex is held around the get_cq_event + * so no blocking occurs and they all return then everything should work. + * Of course, the timeout needs adjusted on the threads that go back to sleep. + */ + while (cq) { + struct pollfd cq_poll = { + .fd = cq->context->cq_fd[0], + .events = POLLIN, + .revents = 0 }; - int timeout_ms = -1; + int timeout_ms = -1; if (timeout != DAT_TIMEOUT_INFINITE) timeout_ms = timeout/1000; - + + /* check if another thread processed the event already, pending queue > 0 */ + dapl_os_lock( &evd_ptr->header.owner_ia->hca_ptr->ib_trans.cq_lock ); + if (dapls_rbuf_count(&evd_ptr->pending_event_queue)) { + dapl_os_unlock( &evd_ptr->header.owner_ia->hca_ptr->ib_trans.cq_lock ); + break; + } + dapl_os_unlock( &evd_ptr->header.owner_ia->hca_ptr->ib_trans.cq_lock ); + + dapl_dbg_log ( DAPL_DBG_TYPE_CM," cq_object_wait: polling\n"); status = poll(&cq_poll, 1, timeout_ms); - if (status == 1) - status = ibv_get_cq_event(cq->context, - 0, &ibv_cq, &ibv_ctx); - } - dapl_dbg_log (DAPL_DBG_TYPE_UTIL, - " cq_object_wait: RET cq %p ibv_cq %p ibv_ctx %p %x\n", - cq,ibv_cq,ibv_ctx,status); + dapl_dbg_log ( DAPL_DBG_TYPE_CM," cq_object_wait: poll returned status=%d\n",status); + + /* + * If poll with timeout wakes then hold mutex around a poll with no timeout + * so subsequent get_cq_events will be guaranteed not to block + * If the event does not belong to this EVD then put it on proper EVD pending + * queue under the mutex. + */ + if (status == 1) { + dapl_os_lock( &evd_ptr->header.owner_ia->hca_ptr->ib_trans.cq_lock ); + status = poll(&cq_poll, 1, 0); + if (status == 1) { + status = ibv_get_cq_event(cq->context, + 0, &ibv_cq, &ibv_ctx); + + /* if event is not ours, put on proper evd pending queue */ + /* force another wakeup */ + if ((ibv_ctx != evd_ptr ) && + (!DAPL_BAD_HANDLE(ibv_ctx, DAPL_MAGIC_EVD))) { + dapl_dbg_log (DAPL_DBG_TYPE_CM, + " cq_object_wait: ibv_ctx %p != evd %p\n", + ibv_ctx, evd_ptr); + dapls_evd_copy_cq((struct evd_ptr*)ibv_ctx); + dapl_os_unlock( &evd_ptr->header.owner_ia->hca_ptr->ib_trans.cq_lock ); + continue; + } + } + dapl_os_unlock( &evd_ptr->header.owner_ia->hca_ptr->ib_trans.cq_lock ); + break; + + } else if (status == 0) { + status = ETIMEDOUT; + break; + } + } + dapl_dbg_log (DAPL_DBG_TYPE_CM, + " cq_object_wait: RET evd %p cq %p ibv_cq %p ibv_ctx %p %s\n", + evd_ptr, cq,ibv_cq,ibv_ctx,strerror(errno)); return(dapl_convert_errno(status,"cq_wait_object_wait")); From yaronh at voltaire.com Mon Aug 1 15:12:00 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Tue, 2 Aug 2005 01:12:00 +0300 Subject: [openib-general] Re: [Rdma-developers] Meeting(07/22) summary:OpenRDMA community development discussion Message-ID: <35EA21F54A45CB47B879F21A91F4862F6C5AFF@taurus.voltaire.com> > -----Original Message----- > From: Fab Tillier [mailto:ftillier at silverstorm.com] > Sent: Monday, August 01, 2005 1:14 PM > > > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > > > Yaron Haviv wrote: > > > we can spend time and discuss theories and intentions, at the end of > the > > > day an iWarp RNIC cannot just reside under IB-Verbs without major > > > changes to the overall infrastructure. > > > > I don't disagree with having a common connection library that supports > both > > IB and iWarp, or that you could derive a solution from kDAPL. But based > on > > the proposed APIs that I've seen, I believe that an RNIC could reside > under > > IB verbs with minimal changes, and would likely be the best engineered > > solution for including RNIC support in Linux. > > Just for clarity, when you say verbs you exclude connection > establishment/management, right? > > I think keeping the two distinct is important in this discussion, as it > seems > there is some confusion - some people refer to verbs as verbs + CM, others > as > just verbs. > > Here's my take from the discussions so far: > - RNICs can probably be made to work under the IB verbs (with changes of > course). > - RNICs can probably not be made to work under the IB CM (not that I've > seen > this suggested). > Fab, I did the same distinction between pure verbs & the broader API (+CM, SA, ..) I agree that pure send, receive, .. verbs are similar with minor differences And we may just want to adopt them with minor changes On the other hand it would not be efficient to try and bend the iWarp CM model to the IB (complex) one, but rather use a simpler one, such as the one in DAPL that fits both camps In IB we need to use a CM and a bunch of SA queries, where the ULP doesn't really need all that and can do with a simple BSD like connection request (that may map to a more complex IB or iWarp model underneath) There are ways in the dapl/bsd like connection mechanism enough to imply sequrity/QoS/etc' (using a src/dst IP, network implied from IP, and kDAPL QoS or BSD TOS, ..) so a user doesn't need direct access to SA for connections, at the most we can add some flags to it Yaron > - Fab From ardavis at ichips.intel.com Mon Aug 1 15:12:09 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Mon, 01 Aug 2005 15:12:09 -0700 Subject: [openib-general] Re: [PATCH] uDAPL dapltest In-Reply-To: References: <5CE025EE7D88BA4599A2C8FEFCF226F5175B85@taurus.voltaire.com> Message-ID: <42EE9E39.1080408@ichips.intel.com> James Lentini wrote: > > Ok. I reference the x86_64 build problem in my log message and > committed it in revision 2940. > > On Mon, 1 Aug 2005, Hal Rosenstock wrote: > >> I saw this on x86_64. Don't know about IA64. >> >> -- Hal >> >> -----Original Message----- >> From: openib-general-bounces at openib.org on behalf of James Lentini >> Sent: Mon 8/1/2005 1:39 PM >> To: Arlin Davis >> Cc: openib >> Subject: [openib-general] Re: [PATCH] uDAPL dapltest >> >> >> Where was the build problem being seen? x86_64 or IA64? I want to >> record it in the checkin log message. > both x86_64 and IA64 -arlin >> >> james >> >> On Thu, 28 Jul 2005, Arlin Davis wrote: >> >>> James, >>> >>> Patch to fix build problem. >>> >>> -arlin >>> >>> >>> Signed-off by: Arlin Davis >>> >>> Index: dapltest/test/dapl_bpool.c >>> =================================================================== >>> --- dapltest/test/dapl_bpool.c (revision 2930) >>> +++ dapltest/test/dapl_bpool.c (working copy) >>> @@ -363,8 +363,8 @@ >>> "BPOOL alloc_size %x\n", >>> (int) bpool_ptr->alloc_size); >>> DT_Tdep_PT_Printf (phead, >>> - "BPOOL pz_handle %x\n", >>> - (int) bpool_ptr->pz_handle); >>> + "BPOOL pz_handle %p\n", >>> + bpool_ptr->pz_handle); >>> DT_Tdep_PT_Printf (phead, >>> "BPOOL num_segs %x\n", >>> (int) bpool_ptr->num_segs); >>> >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > From tduffy at sun.com Mon Aug 1 15:16:30 2005 From: tduffy at sun.com (Tom Duffy) Date: Mon, 01 Aug 2005 15:16:30 -0700 Subject: [openib-general] kdapl build error on ia64 In-Reply-To: <42EAA39B.2090501@sgi.com> References: <42EAA39B.2090501@sgi.com> Message-ID: <1122934590.15026.42.camel@duffman> On Fri, 2005-07-29 at 16:46 -0500, John Partridge wrote: > I searched the archives and don't see a solution, but I think I've tracked it down > to a bug in ulp/kdapl/ib/dapl_util.h > > root on mig133 > diff -ruN dapl_util.h dapl_util.h-johnip Please use diff -ruNp. See FAQ question 10. > =============== diff =============================== > > --- dapl_util.h 2005-07-29 16:36:17.514669886 -0500 > +++ dapl_util.h-johnip 2005-07-29 16:37:11.514578548 -0500 > @@ -71,7 +71,7 @@ > > #ifdef __ia64__ > > - current_value = ia64_cmpxchg("acq", v, match_value, new_value, 4); > + current_value = ia64_cmpxchg(acq, v, match_value, new_value, 4); > > #elif defined (__PPC__) Please kill dapl_os_atomic_assign(). Just use cmpxchg() directly instead. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From johnip at sgi.com Mon Aug 1 15:23:00 2005 From: johnip at sgi.com (John Partridge) Date: Mon, 01 Aug 2005 17:23:00 -0500 Subject: [openib-general] kdapl build error on ia64 In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5175B74@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175B74@taurus.voltaire.com> Message-ID: <42EEA0C4.5000707@sgi.com> Hal/James, Here is a patch that fixes the ia64 build problem and the warnings I had previously posted. The patch ONLY fixes the problems in the trunk tree ( apply from the trunk subdir) and NOT in users/jlentini, the following files also need to be fixed (should be easy enough for James to hack the patch file to do it) :- users/jlentini/linux-kernel/dat-provider/dapl_openib_cm.c users/jlentini/linux-kernel/dat-provider/dapl_util.h The patch is based on svn revision 2935 (copy also attached to email) =================== patch begin ============================== --- src/linux-kernel/infiniband/ulp/kdapl/ib/dapl_openib_cm.c 2005-07-26 00:00:10.000000000 -0500 +++ src/linux-kernel/infiniband/ulp/kdapl/ib/dapl_openib_cm.c 2005-08-01 11:56:25.240418715 -0500 @@ -342,7 +342,7 @@ &cm_ctx->dapl_comp); if (status) { printk(KERN_ERR "dapl_path_comp_handler: " - "ib_at_paths_by_route returned %d id %lld\n", + "ib_at_paths_by_route returned %d id %lu\n", status, cm_ctx->dapl_comp.req_id); event = DAT_CONNECTION_EVENT_BROKEN; goto error; @@ -413,7 +413,7 @@ &cm_ctx->dapl_comp); if (status) { printk(KERN_ERR "dapl_rt_comp_handler: ib_at_paths_by_route " - "returned %d id %lld\n", status, cm_ctx->dapl_comp.req_id); + "returned %d id %lu\n", status, cm_ctx->dapl_comp.req_id); event = DAT_CONNECTION_EVENT_BROKEN; goto error; } --- src/linux-kernel/infiniband/ulp/kdapl/ib/dapl_util.h 2005-07-26 00:00:10.000000000 -0500 +++ src/linux-kernel/infiniband/ulp/kdapl/ib/dapl_util.h 2005-08-01 11:12:16.418589653 -0500 @@ -71,7 +71,7 @@ #ifdef __ia64__ - current_value = ia64_cmpxchg("acq", v, match_value, new_value, 4); + current_value = ia64_cmpxchg(acq, v, match_value, new_value, 4); #elif defined (__PPC__) ============================== patch end ================================= I checked that the patch does not cause a build problem on ia32. I tested that the modules load OK on ia64 and ia32. Hope this helps. John Hal Rosenstock wrote: > Hi John, > > My normal email sending is not working right now so I am using an alternate method. > Hope the formatting comes through OK. > > On Fri, 2005-07-29 at 17:46, John Partridge wrote: > >>With this fix the ia64 modules all build to completion with just a >>couple of warnings :- > > >> CC [M] drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.o >>drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c: In function `dapl_path_comp_handler': >>drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c:346: warning: long long int format, u64 arg (arg 3) >>drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c: In function `dapl_rt_comp_handler': >>drivers/infiniband/ulp/kdapl/ib/dapl_openib_cm.c:416: warning: long long int format, u64 arg (arg 3) > > > Can you try this patch and see if it removes the warnings ? > > Thanks. > > -- Hal > > Index: dapl_openib_cm.c > =================================================================== > --- dapl_openib_cm.c (revision 2935) > +++ dapl_openib_cm.c (working copy) > @@ -341,9 +341,15 @@ static void dapl_path_comp_handler(u64 r > &cm_ctx->dapl_path, 1, > &cm_ctx->dapl_comp); > if (status) { > +#if defined(__ia64__) > + printk(KERN_ERR "dapl_path_comp_handler: " > + "ib_at_paths_by_route returned %d id %ld\n", > + status, cm_ctx->dapl_comp.req_id); > +#else > printk(KERN_ERR "dapl_path_comp_handler: " > "ib_at_paths_by_route returned %d id %lld\n", > status, cm_ctx->dapl_comp.req_id); > +#endif > event = DAT_CONNECTION_EVENT_BROKEN; > goto error; > } > @@ -412,8 +418,13 @@ static void dapl_rt_comp_handler(u64 req > status = ib_at_paths_by_route(&cm_ctx->dapl_rt, 0, &cm_ctx->dapl_path, 1, > &cm_ctx->dapl_comp); > if (status) { > +#if defined(__ia64__) > + printk(KERN_ERR "dapl_rt_comp_handler: ib_at_paths_by_route " > + "returned %d id %ld\n", status, cm_ctx->dapl_comp.req_id); > +#else > printk(KERN_ERR "dapl_rt_comp_handler: ib_at_paths_by_route " > "returned %d id %lld\n", status, cm_ctx->dapl_comp.req_id); > +#endif > event = DAT_CONNECTION_EVENT_BROKEN; > goto error; > } -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: dapl.patch URL: From tduffy at sun.com Mon Aug 1 16:22:42 2005 From: tduffy at sun.com (Tom Duffy) Date: Mon, 01 Aug 2005 16:22:42 -0700 Subject: [openib-general] kdapl build error on ia64 In-Reply-To: <42EEA0C4.5000707@sgi.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175B74@taurus.voltaire.com> <42EEA0C4.5000707@sgi.com> Message-ID: <1122938562.15026.54.camel@duffman> On Mon, 2005-08-01 at 17:23 -0500, John Partridge wrote: > Hal/James, > > Here is a patch that fixes the ia64 build problem and the warnings Please sign your patches. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From rajib.majumder at csfb.com Mon Aug 1 20:04:52 2005 From: rajib.majumder at csfb.com (Majumder, Rajib) Date: Tue, 2 Aug 2005 11:04:52 +0800 Subject: [openib-general] IB Query Message-ID: Hi, I have a query regarding SDP/IB. I have 2 processes. 1 TCP client and 1 TCP server, both are running on the same physical host. We have IB HCA installed on this host. Both the process communicate using SDP. Will there be any performance gain? Any clarification is appreciated. Thanks Rajib ============================================================================== Please access the attached hyperlink for an important electronic communications disclaimer: http://www.csfb.com/legal_terms/disclaimer_external_email.shtml ============================================================================== From guyg at voltaire.com Mon Aug 1 23:52:31 2005 From: guyg at voltaire.com (Guy German) Date: Tue, 2 Aug 2005 09:52:31 +0300 Subject: [openib-general][PATCH][kdapl]: inc/dec module ref count Message-ID: Hi, James Lentini wrote: > On Sun, 31 Jul 2005, Muli Ben-Yehuda wrote: > >> On Sun, Jul 31, 2005 at 06:08:11PM +0300, Guy German wrote: >>> Hi Muli, >>> >>> Wouldn't it be solved by moving the try_module_get call to the >>> beginning of the dapl_ia_open function ? >> >> No. Even if it's theoretically the first line in the function, the >> compiler can and will create a function prologue that will run before >> you raise the reference count (same thing with decrementing the ref >> count at module unload time and the function epilogue). You must do >> module reference counting before executing even one instruction from >> the module. >> >>> You are right - try_module_get() can fail when the module is not >>> ready to be entered. should be something like: >>> + if (!try_module_get(THIS_MODULE)) >>> + return -EBUSY; >> >> Yes - but at the caller, not callee. > > Putting this in the caller (i.e. dat_ia_open and dat_ia_close) does > sound like a better option. Sounds good to me too. > > Guy, can you investigate why the ib_mthca module doesn't have a > reference count and see if it relates to hotplug? I think > kdapl_ib and > ib_mthca should have the same policy regarding this issue. As I understand, consumers are working over ib_core and not over ib_mthca directly. So, if (from a hotplug reason) ib_mthca goes down, ib_core consumers can get notified of the event, by an upcall. If you take this model to dapl, I think it would influence the way dapl consumers need to do things (like registering an upcall and know what to do in case kdapl_ib is down). I also don't know how many consumers really need "dapl hotplug"... Guy From guyg at voltaire.com Tue Aug 2 00:13:07 2005 From: guyg at voltaire.com (Guy German) Date: Tue, 2 Aug 2005 10:13:07 +0300 Subject: [openib-general][PATCH][kdapl]: adding DAT_MEM_TYPE_IA support Message-ID: Hi, James Lentini wrote: > On Fri, 29 Jul 2005, Guy German wrote: > >>> + array = (u64 *)phys_addr.for_array; /* need to add for_u64_array >>> to union */ What does this comment mean? >> >> I think the right way to do it is : >> array = phys_addr.for_u64_array >> (Givven the union consists of a new type u64* called "for_u64_array") > > I believe the original idea was to have IA memory use the > DAT_REGION_DESCRIPTION's for_pointer value. for_pointer is void*, and: *((void *)array)!=*((u64 *)array) in 32 bit machines Guy. From glebn at voltaire.com Tue Aug 2 01:17:30 2005 From: glebn at voltaire.com (Gleb Natapov) Date: Tue, 2 Aug 2005 11:17:30 +0300 Subject: [openib-general] kernel VM monitor for memory registration caching In-Reply-To: <20050801154741.GA30781@granada.merseine.nu> References: <20050729174225.GB2683@osc.edu> <20050731103157.GT13014@minantech.com> <52ack1kdhj.fsf@cisco.com> <20050801132056.GB20243@minantech.com> <52pssxiup6.fsf@cisco.com> <20050801145027.GA31358@minantech.com> <20050801145622.GN28329@granada.merseine.nu> <20050801150427.GD20243@minantech.com> <20050801154741.GA30781@granada.merseine.nu> Message-ID: <20050802081730.GF20243@minantech.com> On Mon, Aug 01, 2005 at 06:47:41PM +0300, Muli Ben-Yehuda wrote: > On Mon, Aug 01, 2005 at 06:04:27PM +0300, Gleb Natapov wrote: > > > I don't like the idea of splitting VMAs if we can manage without it. You'll > > end up having to many of them. > > ... or you end up duplicating the VMA functionality on a sub-vma > level. Consider: > > You have a single vma with some number of pages, where some random > pages are registered and some aren't. You need to be able to check for > a given page whetehr it's registered or not, so you build some data > structure based on the stard and end address (or page, or whatever), > so that you can check if a given address is within a registered > page. But this is exactly the sort of thing that the VMA level is > built for: use the red black tree to find the vma for a given address, > then check if the vma is registered. > > Linux can already cope with a large number of vma's, I think that a > solution that does not split vma's will end up with either artificial > limitations (can't have more than two different regions within a vma) > or reimplementing vma layer functionality. Of course, I could be > completely wrong :-) The registrations are more complex than vma. Registrations can overlap vma can't. Unfortunately we will need to split vma for different reason: VM_DONTCOPY. Other then that Pete's solution already works with more then one registration in vma :) -- Gleb. From gdror at mellanox.co.il Tue Aug 2 02:20:22 2005 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Tue, 2 Aug 2005 12:20:22 +0300 Subject: [openib-general] IB Query Message-ID: <506C3D7B14CDD411A52C00025558DED6085A7D06@mtlex01.yok.mtl.com> > -----Original Message----- > From: Majumder, Rajib [mailto:rajib.majumder at csfb.com] > Sent: Tuesday, August 02, 2005 6:05 AM > > I have a query regarding SDP/IB. I have 2 processes. 1 TCP > client and 1 TCP server, both are running on the same > physical host. We have IB HCA installed on this host. > > Both the process communicate using SDP. > > Will there be any performance gain? > These are the benefits of running the application locally over SDP: * When using BCopy SDP, then the HCA HW takes care of all transport offloading - segmentation & reassembly, timers, transport checks, etc. * When using ZCopy SDP, then the HCA HW can also take care of copying the data from one process address space Into other process address space. When two processes communicate through TCP locally, all transport is taken care of by the TCP layer, which adds some overhead. Data is copied by the CPU. Looping back happens at the IP layer and is internal to the OS. Nothing goes down to the networking driver. I'd also mention that SDP will transfer data through the PCI bus where TCP will use the CPU bus move the data. I don't have any performance numbers. It can be interesting to measure bandwidth and CPU utilization at various message sizes. -Dror -------------- next part -------------- An HTML attachment was scrubbed... URL: From hch at lst.de Tue Aug 2 05:57:27 2005 From: hch at lst.de ('Christoph Hellwig') Date: Tue, 2 Aug 2005 14:57:27 +0200 Subject: [openib-general] Re: [Rdma-developers] Meeting (07/22)summary:OpenRDMA community development discussion In-Reply-To: References: <20050801161404.GA15582@lst.de> Message-ID: <20050802125727.GA2472@lst.de> > Can you provide more details on exactly why you think this is a horrible > idea? I agree it will be complex, but it _could_ be scoped such that the > complexity is reduced. For instance, the "offload" function could fail > (with EBUSY or something) if there is _any_ data pending on the socket. > Thus removing any requirement to pass down pending unacked outgoing data, or > pending data that has been received but not yet "read" by the application. > The idea here is that the applications at the top "know" they are going into > RDMA mode and have effectively quiesced the connection before attempting to > move the connection into RDMA mode. We could, in fact, _require_ the > connect be quiesced to keep things simpler. I'm quickly sinking into gory > details, but I want to know if you have other reasons (other than the > complextity) for why this is a bad idea. I think your writeup here is more than explanation enough. The offload can only work for few special cases, and even for those it's rather complicated, especially if you take things as ipsec or complex tunneling that get more and more common into account. What do you archive by implementing the offload except trying to make it look more integrated to the user than it actually is? Just offload rmda protocols to the RDMA hardware and keep the IP stack out of that complexity. From hch at lst.de Tue Aug 2 05:57:40 2005 From: hch at lst.de ('Christoph Hellwig') Date: Tue, 2 Aug 2005 14:57:40 +0200 Subject: [openib-general] Re: [Rdma-developers]Meeting(07/22) summary:OpenRDMA community development discussion In-Reply-To: References: <20050801173204.GA16966@lst.de> Message-ID: <20050802125740.GB2472@lst.de> On Mon, Aug 01, 2005 at 12:40:46PM -0500, Steve Wise wrote: > > Then tell them to come up with an alternative. We're not going to > > > > - include RNIC/PI or something similar to it > > - pile abstraction layers or abstractions layers (the kDAPL approach) > > > > so if the RNIC community can find anything better they should propose > > it. OTOH they're not even able to get a NIC driver OpenSource so I'm > > not going to waste more time on them. > > The current Ammasso RNIC driver is released under the GPL license now. At > this point in time, its not openIB. We intend to develop an open source > driver that plugs into the openIB framework and does _not_ implement > RNIC/PI. And we'll provide this as a series of patches during development > to the openIB community for general review. Very nice. First time I hear people actually standing up and doing useful work. From mst at mellanox.co.il Tue Aug 2 06:00:17 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 2 Aug 2005 16:00:17 +0300 Subject: [openib-general] sdp: cant unload ib_ipoib module Message-ID: <20050802130017.GJ14384@mellanox.co.il> Hi, Libor! After running ttcp.aio.x multiple times, I am seeing these messages in dmesg: unregister_netdevice: waiting for ib0 to become free. Usage count = 2 and ipoib cant be unloaded. Any idea why? MST -- MST From halr at voltaire.com Tue Aug 2 06:08:41 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 2 Aug 2005 16:08:41 +0300 Subject: [openib-general] sdp: cant unload ib_ipoib module Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175B95@taurus.voltaire.com> This was reported back a while ago. The simplest scenario I have found to reproduce this is as follows: After using SDP, and unload SDP and then unload IPoIB and got the following: unregister_netdevice: waiting for ib0 to become free. Usage count = 1 The simplest way I found to recreate this is: 1. Bring up IPoIB and then SDP 2. Run tcp.aio.x -t (no server/receiver) 3. Wait for connection refused 4. Unload SDP and then IPoIB -- Hal -----Original Message----- From: openib-general-bounces at openib.org on behalf of Michael S. Tsirkin Sent: Tue 8/2/2005 9:00 AM To: openib-general at openib.org; Libor Michalek Subject: [openib-general] sdp: cant unload ib_ipoib module Hi, Libor! After running ttcp.aio.x multiple times, I am seeing these messages in dmesg: unregister_netdevice: waiting for ib0 to become free. Usage count = 2 and ipoib cant be unloaded. Any idea why? MST -- MST _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tom at ammasso.com Tue Aug 2 06:26:28 2005 From: tom at ammasso.com (Tom Tucker) Date: Tue, 02 Aug 2005 08:26:28 -0500 Subject: [openib-general] Re: [Rdma-developers] Meeting (07/22)summary:OpenRDMA community development discussion References: <20050801161404.GA15582@lst.de> <20050802125727.GA2472@lst.de> Message-ID: <42EF7484.5040509@ammasso.com> 'Christoph Hellwig' wrote: >>Can you provide more details on exactly why you think this is a horrible >>idea? I agree it will be complex, but it _could_ be scoped such that the >>complexity is reduced. For instance, the "offload" function could fail >>(with EBUSY or something) if there is _any_ data pending on the socket. >>Thus removing any requirement to pass down pending unacked outgoing data, or >>pending data that has been received but not yet "read" by the application. >>The idea here is that the applications at the top "know" they are going into >>RDMA mode and have effectively quiesced the connection before attempting to >>move the connection into RDMA mode. We could, in fact, _require_ the >>connect be quiesced to keep things simpler. I'm quickly sinking into gory >>details, but I want to know if you have other reasons (other than the >>complextity) for why this is a bad idea. >> >> > >I think your writeup here is more than explanation enough. The offload >can only work for few special cases, and even for those it's rather >complicated, especially if you take things as ipsec or complex tunneling >that get more and more common into account. > I think Steve's point was that it *can* be simplified as necessary to meet the demands/needs of the Linux community. It is certainly technically possible, but agreeably complicated to offload an active socket. >What do you archive by >implementing the offload except trying to make it look more integrated >to the user than it actually is? Just offload rmda protocols to the >RDMA hardware and keep the IP stack out of that complexity. > You get the benefit of things like SYN flood DOS attack avoidance built into the host stack without replicating this functionality in the offloaded adapter. There are other benefits of integration like failover, etc... IMHO, however, the bulk of the benefits are for ULP offload like RDMA where the remote peer may not be capable of HW RDMA acceleration. This kind of thing could be determined in "streaming mode" using the host stack and then migrated to an adapter for HW acceleration only if the remote peer is capable. >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Aug 2 06:54:52 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Aug 2005 09:54:52 -0400 Subject: [openib-general] kdapl build error on ia64 In-Reply-To: <42EEA0C4.5000707@sgi.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175B74@taurus.voltaire.com> <42EEA0C4.5000707@sgi.com> Message-ID: <1122990891.4422.77.camel@hal.voltaire.com> On Mon, 2005-08-01 at 18:23, John Partridge wrote: > Hal/James, > > Here is a patch that fixes the ia64 build problem and the warnings I had previously posted. > The patch ONLY fixes the problems in the trunk tree ( apply from the trunk subdir) and NOT > in users/jlentini, I believe that users/jlentini is now obsolete and the trunk supercedes this so this is fine. > the following files also need to be fixed (should be easy enough for > James to hack the patch file to do it) :- > > users/jlentini/linux-kernel/dat-provider/dapl_openib_cm.c > users/jlentini/linux-kernel/dat-provider/dapl_util.h > > The patch is based on svn revision 2935 (copy also attached to email) > =================== patch begin ============================== > > --- src/linux-kernel/infiniband/ulp/kdapl/ib/dapl_openib_cm.c 2005-07-26 00:00:10.000000000 -0500 > +++ src/linux-kernel/infiniband/ulp/kdapl/ib/dapl_openib_cm.c 2005-08-01 11:56:25.240418715 -0500 > @@ -342,7 +342,7 @@ > &cm_ctx->dapl_comp); > if (status) { > printk(KERN_ERR "dapl_path_comp_handler: " > - "ib_at_paths_by_route returned %d id %lld\n", > + "ib_at_paths_by_route returned %d id %lu\n", > status, cm_ctx->dapl_comp.req_id); > event = DAT_CONNECTION_EVENT_BROKEN; > goto error; > @@ -413,7 +413,7 @@ > &cm_ctx->dapl_comp); > if (status) { > printk(KERN_ERR "dapl_rt_comp_handler: ib_at_paths_by_route " > - "returned %d id %lld\n", status, cm_ctx->dapl_comp.req_id); > + "returned %d id %lu\n", status, cm_ctx->dapl_comp.req_id); > event = DAT_CONNECTION_EVENT_BROKEN; > goto error; This change yields the same warnings on x86_64 that you previously indicated and this fixes on ia64 :-( -- Hal From jlentini at netapp.com Tue Aug 2 07:15:40 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 2 Aug 2005 10:15:40 -0400 (EDT) Subject: [openib-general] kdapl build error on ia64 In-Reply-To: <1122934590.15026.42.camel@duffman> References: <42EAA39B.2090501@sgi.com> <1122934590.15026.42.camel@duffman> Message-ID: On Mon, 1 Aug 2005, Tom Duffy wrote: > On Fri, 2005-07-29 at 16:46 -0500, John Partridge wrote: >> I searched the archives and don't see a solution, but I think I've tracked it down >> to a bug in ulp/kdapl/ib/dapl_util.h >> >> root on mig133 > diff -ruN dapl_util.h dapl_util.h-johnip > > Please use diff -ruNp. See FAQ question 10. > >> =============== diff =============================== >> >> --- dapl_util.h 2005-07-29 16:36:17.514669886 -0500 >> +++ dapl_util.h-johnip 2005-07-29 16:37:11.514578548 -0500 >> @@ -71,7 +71,7 @@ >> >> #ifdef __ia64__ >> >> - current_value = ia64_cmpxchg("acq", v, match_value, new_value, 4); >> + current_value = ia64_cmpxchg(acq, v, match_value, new_value, 4); >> >> #elif defined (__PPC__) > > Please kill dapl_os_atomic_assign(). Just use cmpxchg() directly > instead. That sounds like a good idea. Any issues with this John? From liran at mellanox.co.il Tue Aug 2 07:33:38 2005 From: liran at mellanox.co.il (Liran Sorani) Date: Tue, 2 Aug 2005 17:33:38 +0300 Subject: [openib-general] Send SA request over umad problem. Message-ID: <506C3D7B14CDD411A52C00025558DED60865E20F@mtlex01.yok.mtl.com> Hi , I'm working on the SM group at Mellanox. While testing SM-gen2 on a loopback , I've encountered a basic problem trying to send an SA query (single mad) over osm_vendor (gen2). Trying to send the request using osm_vendor_send , passed succesfully , BUT got from the receiver (umad_recv) an error (in an endless loop ): "No space left on device". The MAD request was simple GSI - SA request of ClassPortInfo , here are the details , I've truned on debug mode of vendor_lib and umad (marked in red the important lines in the log ): ... Aug 02 03:35:49 [401776C0] -> osm_vendor_send: [ warn: [19219] umad_set_addr_net: umad 0x80810d0 dlid 1 dqp 1 sl, qkey 0 warn: [19219] umad_dump: agent id 0 status 0 timeout 0 warn: [19219] umad_addr_dump: qpn 1 qkey 0x80010000 lid 0x1 sl 0 grh_present 0 gid_index 0 hop_limit 0 traffic_class 0 flow_label 0x0 Gid 0x00000000000000000000000000000000 Aug 02 03:35:49 [401776C0] -> osm_vendor_send: RMPP 0 length 256 warn: [19219] umad_send: portid 0 agentid 0 umad 0x80810d0 timeout 1000 Aug 02 03:35:49 [401776C0] -> osm_vendor_send: Completed Sending Request p_madw = 0x80807dc. Aug 02 03:35:49 [401776C0] -> osm_vendor_send: ] Aug 02 03:35:49 [401776C0] -> __osmv_send_sa_req: Waiting for async event. warn: [19219] umad_recv: read returned 356 > sizeof umad 56 + length 256 (No space left on device) Aug 02 03:35:49 [40D7EBB0] -> umad_receiver: recv error No space left on device warn: [19219] umad_recv: portid 0 umad 0x8080e28 timeout 4294967295 warn: [19219] umad_recv: read returned 356 > sizeof umad 56 + length 256 (No space left on device) Aug 02 03:35:49 [40D7EBB0] -> umad_receiver: recv error No space left on device warn: [19219] umad_recv: portid 0 umad 0x8080e28 timeout 4294967295 warn: [19219] umad_recv: read returned 356 > sizeof umad 56 + length 256 (No space left on device) Aug 02 03:35:49 [40D7EBB0] -> umad_receiver: recv error No space left on device warn: [19219] umad_recv: portid 0 umad 0x8080e28 timeout 4294967295 warn: [19219] umad_recv: read returned 356 > sizeof umad 56 + length 256 (No space left on device) Aug 02 03:35:49 [40D7EBB0] -> umad_receiver: recv error No space left on device ... Thanks , in advance for your help . > Liran Sorani > Mellanox Technologies LTD. > mailto:liran at mellanox.co.il > Phone: +972(4)9097200 Ext: 214 > Israel, Yokneam P.O.B 586 ZIP 20692 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Aug 2 07:31:28 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Aug 2005 10:31:28 -0400 Subject: [openib-general] RE: [BUG]OpenSM double free or corruption In-Reply-To: References: Message-ID: <1122993087.4422.88.camel@hal.voltaire.com> Hi Woody, On Wed, 2005-07-27 at 11:48, Bob Woodruff wrote: > I don't have the log file, but it is easy to reproduce. I'm just getting back to this now. > Load the stack on an HCA that has old firmware I don't have an HCA with old firmware (what version ?) and I thought it was dangerous to downgrade firmware. > and the mthca > driver will not initialize and will report an error in dmesg. What driver error do you get ? > Then simply start opensm and the error occurs. -- Hal From guyg at voltaire.com Tue Aug 2 07:41:51 2005 From: guyg at voltaire.com (Guy German) Date: Tue, 2 Aug 2005 17:41:51 +0300 Subject: [openib-general] kdapl build error on ia64 Message-ID: Hi, openib-general-bounces at openib.org <> wrote: > On Mon, 1 Aug 2005, Tom Duffy wrote: > >> On Fri, 2005-07-29 at 16:46 -0500, John Partridge wrote: >>> I searched the archives and don't see a solution, but I think I've >>> tracked it down to a bug in ulp/kdapl/ib/dapl_util.h >>> >>> root on mig133 > diff -ruN dapl_util.h dapl_util.h-johnip >> >> Please use diff -ruNp. See FAQ question 10. >> >>> =============== diff =============================== >>> >>> --- dapl_util.h 2005-07-29 16:36:17.514669886 -0500 >>> +++ dapl_util.h-johnip 2005-07-29 16:37:11.514578548 -0500 @@ >>> -71,7 +71,7 @@ >>> >>> #ifdef __ia64__ >>> >>> - current_value = ia64_cmpxchg("acq", v, match_value, >>> new_value, 4); + current_value = ia64_cmpxchg(acq, v, >>> match_value, new_value, 4); >>> >>> #elif defined (__PPC__) >> >> Please kill dapl_os_atomic_assign(). Just use cmpxchg() directly >> instead. > > That sounds like a good idea. Any issues with this John? But what about other platforms (not ia64) - the code would have ifdef in all the places it is used ? > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Tue Aug 2 07:41:28 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Aug 2005 10:41:28 -0400 Subject: [openib-general] Re: Send SA request over umad problem. In-Reply-To: <506C3D7B14CDD411A52C00025558DED60865E20F@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED60865E20F@mtlex01.yok.mtl.com> Message-ID: <1122993688.4422.95.camel@hal.voltaire.com> On Tue, 2005-08-02 at 10:33, Liran Sorani wrote: > Hi , > I'm working on the SM group at Mellanox. > While testing SM-gen2 on a loopback , I've encountered a basic problem > trying to send an SA query (single mad) over osm_vendor (gen2). > Trying to send the request using osm_vendor_send , passed succesfully > , BUT got from the receiver (umad_recv) an error (in an endless loop > ): "No space left on device". The MAD request was simple GSI - SA > request of ClassPortInfo , here are the details , I've truned on debug > mode of vendor_lib and umad (marked in red the important lines in the > log ): > ... > Aug 02 03:35:49 [401776C0] -> osm_vendor_send: [ > warn: [19219] umad_set_addr_net: umad 0x80810d0 dlid 1 dqp 1 sl, qkey > 0 > warn: [19219] umad_dump: agent id 0 status 0 timeout 0 > warn: [19219] umad_addr_dump: qpn 1 qkey 0x80010000 lid 0x1 sl 0 > grh_present 0 gid_index 0 hop_limit 0 traffic_class 0 flow_label 0x0 > Gid 0x00000000000000000000000000000000 > Aug 02 03:35:49 [401776C0] -> osm_vendor_send: RMPP 0 length 256 > warn: [19219] umad_send: portid 0 agentid 0 umad 0x80810d0 timeout > 1000 > Aug 02 03:35:49 [401776C0] -> osm_vendor_send: Completed Sending > Request p_madw = 0x80807dc. > Aug 02 03:35:49 [401776C0] -> osm_vendor_send: ] > Aug 02 03:35:49 [401776C0] -> __osmv_send_sa_req: Waiting for async > event. > warn: [19219] umad_recv: read returned 356 > sizeof umad 56 + length > 256 (No space left on device) > Aug 02 03:35:49 [40D7EBB0] -> umad_receiver: recv error No space left > on device > warn: [19219] umad_recv: portid 0 umad 0x8080e28 timeout 4294967295 > warn: [19219] umad_recv: read returned 356 > sizeof umad 56 + length > 256 (No space left on device) > Aug 02 03:35:49 [40D7EBB0] -> umad_receiver: recv error No space left > on device > warn: [19219] umad_recv: portid 0 umad 0x8080e28 timeout 4294967295 > warn: [19219] umad_recv: read returned 356 > sizeof umad 56 + length > 256 (No space left on device) > Aug 02 03:35:49 [40D7EBB0] -> umad_receiver: recv error No space left > on device > warn: [19219] umad_recv: portid 0 umad 0x8080e28 timeout 4294967295 > warn: [19219] umad_recv: read returned 356 > sizeof umad 56 + length > 256 (No space left on device) > Aug 02 03:35:49 [40D7EBB0] -> umad_receiver: recv error No space left > on device > ... What SM is this talking to ? What does the SA response look like ? I have a theory as to what is going on. Just want to see if it is accurate before I spend more time on it. -- Hal From jlentini at netapp.com Tue Aug 2 08:12:52 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 2 Aug 2005 11:12:52 -0400 (EDT) Subject: [openib-general] kdapl build error on ia64 In-Reply-To: <1122990891.4422.77.camel@hal.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175B74@taurus.voltaire.com> <42EEA0C4.5000707@sgi.com> <1122990891.4422.77.camel@hal.voltaire.com> Message-ID: On Tue, 2 Aug 2005, Hal Rosenstock wrote: > On Mon, 2005-08-01 at 18:23, John Partridge wrote: >> Hal/James, >> >> Here is a patch that fixes the ia64 build problem and the warnings I had previously posted. >> The patch ONLY fixes the problems in the trunk tree ( apply from the trunk subdir) and NOT >> in users/jlentini, > > I believe that users/jlentini is now obsolete and the trunk supercedes > this so this is fine. > >> the following files also need to be fixed (should be easy enough for >> James to hack the patch file to do it) :- >> >> users/jlentini/linux-kernel/dat-provider/dapl_openib_cm.c >> users/jlentini/linux-kernel/dat-provider/dapl_util.h >> >> The patch is based on svn revision 2935 (copy also attached to email) >> =================== patch begin ============================== >> >> --- src/linux-kernel/infiniband/ulp/kdapl/ib/dapl_openib_cm.c 2005-07-26 00:00:10.000000000 -0500 >> +++ src/linux-kernel/infiniband/ulp/kdapl/ib/dapl_openib_cm.c 2005-08-01 11:56:25.240418715 -0500 >> @@ -342,7 +342,7 @@ >> &cm_ctx->dapl_comp); >> if (status) { >> printk(KERN_ERR "dapl_path_comp_handler: " >> - "ib_at_paths_by_route returned %d id %lld\n", >> + "ib_at_paths_by_route returned %d id %lu\n", >> status, cm_ctx->dapl_comp.req_id); >> event = DAT_CONNECTION_EVENT_BROKEN; >> goto error; >> @@ -413,7 +413,7 @@ >> &cm_ctx->dapl_comp); >> if (status) { >> printk(KERN_ERR "dapl_rt_comp_handler: ib_at_paths_by_route " >> - "returned %d id %lld\n", status, cm_ctx->dapl_comp.req_id); >> + "returned %d id %lu\n", status, cm_ctx->dapl_comp.req_id); >> event = DAT_CONNECTION_EVENT_BROKEN; >> goto error; > > This change yields the same warnings on x86_64 that you previously > indicated and this fixes on ia64 :-( Should we use %Lu for printing out the req_id (a u64)? From jlentini at netapp.com Tue Aug 2 08:15:35 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 2 Aug 2005 11:15:35 -0400 (EDT) Subject: [openib-general] kdapl build error on ia64 In-Reply-To: References: Message-ID: On Tue, 2 Aug 2005, Guy German wrote: > Hi, > > openib-general-bounces at openib.org <> wrote: >> On Mon, 1 Aug 2005, Tom Duffy wrote: >> >>> On Fri, 2005-07-29 at 16:46 -0500, John Partridge wrote: >>>> I searched the archives and don't see a solution, but I think I've >>>> tracked it down to a bug in ulp/kdapl/ib/dapl_util.h >>>> >>>> root on mig133 > diff -ruN dapl_util.h dapl_util.h-johnip >>> >>> Please use diff -ruNp. See FAQ question 10. >>> >>>> =============== diff =============================== >>>> >>>> --- dapl_util.h 2005-07-29 16:36:17.514669886 -0500 >>>> +++ dapl_util.h-johnip 2005-07-29 16:37:11.514578548 -0500 @@ >>>> -71,7 +71,7 @@ >>>> >>>> #ifdef __ia64__ >>>> >>>> - current_value = ia64_cmpxchg("acq", v, match_value, >>>> new_value, 4); + current_value = ia64_cmpxchg(acq, v, >>>> match_value, new_value, 4); >>>> >>>> #elif defined (__PPC__) >>> >>> Please kill dapl_os_atomic_assign(). Just use cmpxchg() directly >>> instead. >> >> That sounds like a good idea. Any issues with this John? > > But what about other platforms (not ia64) - the code would have > ifdef in all the places it is used ? I think cmpxchg() is defined for each platform Linux runs on. Is that not true? From guyg at voltaire.com Tue Aug 2 08:22:52 2005 From: guyg at voltaire.com (Guy German) Date: Tue, 2 Aug 2005 18:22:52 +0300 Subject: [openib-general] kdapl build error on ia64 Message-ID: Hi, James Lentini wrote: > On Tue, 2 Aug 2005, Guy German wrote: > >> Hi, >> >> openib-general-bounces at openib.org <> wrote: >>> On Mon, 1 Aug 2005, Tom Duffy wrote: >>> >>>> On Fri, 2005-07-29 at 16:46 -0500, John Partridge wrote: >>>>> I searched the archives and don't see a solution, but I think I've >>>>> tracked it down to a bug in ulp/kdapl/ib/dapl_util.h >>>>> >>>>> root on mig133 > diff -ruN dapl_util.h dapl_util.h-johnip >>>> >>>> Please use diff -ruNp. See FAQ question 10. >>>> >>>>> =============== diff =============================== >>>>> >>>>> --- dapl_util.h 2005-07-29 16:36:17.514669886 -0500 >>>>> +++ dapl_util.h-johnip 2005-07-29 16:37:11.514578548 -0500 @@ >>>>> -71,7 +71,7 @@ >>>>> >>>>> #ifdef __ia64__ >>>>> >>>>> - current_value = ia64_cmpxchg("acq", v, match_value, >>>>> new_value, 4); + current_value = ia64_cmpxchg(acq, v, >>>>> match_value, new_value, 4); >>>>> >>>>> #elif defined (__PPC__) >>>> >>>> Please kill dapl_os_atomic_assign(). Just use cmpxchg() directly >>>> instead. >>> >>> That sounds like a good idea. Any issues with this John? >> >> But what about other platforms (not ia64) - the code would have >> ifdef in all the places it is used ? > > I think cmpxchg() is defined for each platform Linux runs on. Is that > not true? I guess you are right. I found this (in include/asm-ia64/intrinsics.h): #define cmpxchg_acq(ptr,o,n) ia64_cmpxchg(acq, (ptr), (o), (n), sizeof(*(ptr))) #define cmpxchg_rel(ptr,o,n) ia64_cmpxchg(rel, (ptr), (o), (n),sizeof(*(ptr))) /* for compatibility with other platforms: */ #define cmpxchg(ptr,o,n) cmpxchg_acq(ptr,o,n) -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: vmalloc.txt URL: From guyg at voltaire.com Tue Aug 2 08:24:49 2005 From: guyg at voltaire.com (Guy German) Date: Tue, 2 Aug 2005 18:24:49 +0300 Subject: [openib-general][kdapl]: vmalloc instead of kmalloc Message-ID: Hi, There are some places where kmalloc might not be enough : in dapl_evd_event_alloc there is an allocation: event = kmalloc(evd->qlen * sizeof *event); whereas evd->qlen can be 128k (depends on max_cqe of the hca) and kmalloc would fail. The same goes to dapl_rbuf_alloc. Is it legit to replace those kmallocs with vmallocs ? Thanks, Guy From halr at voltaire.com Tue Aug 2 08:19:43 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Aug 2005 11:19:43 -0400 Subject: [openib-general] kdapl build error on ia64 In-Reply-To: References: <5CE025EE7D88BA4599A2C8FEFCF226F5175B74@taurus.voltaire.com> <42EEA0C4.5000707@sgi.com> <1122990891.4422.77.camel@hal.voltaire.com> Message-ID: <1122995982.4442.16.camel@hal.voltaire.com> On Tue, 2005-08-02 at 11:12, James Lentini wrote: > Should we use %Lu for printing out the req_id (a u64)? This works for x86 & x86-64. Not sure about ia64. -- Hal From mulix at mulix.org Tue Aug 2 08:40:19 2005 From: mulix at mulix.org (Muli Ben-Yehuda) Date: Tue, 2 Aug 2005 18:40:19 +0300 Subject: [openib-general][kdapl]: vmalloc instead of kmalloc In-Reply-To: References: Message-ID: <20050802154019.GA6794@granada.merseine.nu> On Tue, Aug 02, 2005 at 06:24:49PM +0300, Guy German wrote: > There are some places where kmalloc might not be enough : > in dapl_evd_event_alloc there is an allocation: > > event = kmalloc(evd->qlen * sizeof *event); > > whereas evd->qlen can be 128k (depends on max_cqe of the hca) > and kmalloc would fail. > > The same goes to dapl_rbuf_alloc. > > Is it legit to replace those kmallocs with vmallocs ? Why do we need such a large allocation? To answer your question, vmalloc has a performance overhead and can and will fail when vmalloc-space is exhausted (as can kmalloc, for different reasons). Can this allocation be cut down so that it becomes a non-issue? Cheers, Muli -- Muli Ben-Yehuda http://www.mulix.org | http://mulix.livejournal.com/ From johnip at sgi.com Tue Aug 2 08:55:45 2005 From: johnip at sgi.com (John Partridge) Date: Tue, 02 Aug 2005 10:55:45 -0500 Subject: [openib-general] kdapl build error on ia64 In-Reply-To: References: <42EAA39B.2090501@sgi.com> <1122934590.15026.42.camel@duffman> Message-ID: <42EF9781.5060706@sgi.com> James Lentini wrote: > > > That sounds like a good idea. Any issues with this John? Actually I like the idea of just using cmpxchg() it simplifies the code and each platform should do the right thing. John -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com From guyg at voltaire.com Tue Aug 2 09:02:15 2005 From: guyg at voltaire.com (Guy German) Date: Tue, 2 Aug 2005 19:02:15 +0300 Subject: [openib-general][kdapl]: vmalloc instead of kmalloc Message-ID: Hi Muli, Muli Ben-Yehuda wrote: > On Tue, Aug 02, 2005 at 06:24:49PM +0300, Guy German wrote: > >> There are some places where kmalloc might not be enough : >> in dapl_evd_event_alloc there is an allocation: >> >> event = kmalloc(evd->qlen * sizeof *event); >> >> whereas evd->qlen can be 128k (depends on max_cqe of the hca) and >> kmalloc would fail. >> >> The same goes to dapl_rbuf_alloc. >> >> Is it legit to replace those kmallocs with vmallocs ? > > Why do we need such a large allocation? > To answer your question, vmalloc has a performance overhead and can > and will fail when vmalloc-space is exhausted (as can kmalloc, for > different reasons). Can this allocation be cut down so that it > becomes a non-issue? evd_min_qlen defines the size of the event queue that the Consumer requested. sizeof *event = 184 - that leaves ~712 pending events, which is not much. ISER target is trying to support about 5000 (by their calculations), but other consumers might want to support even more and there is no reason for dapl to limit what the ib can provide. Note that iser dequeues the events itself (only the first event is accepted from a callback), hence the need for a normal size queue. Thanks, Guy. > > Cheers, > Muli From caitlin.bestler at gmail.com Tue Aug 2 09:07:42 2005 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Tue, 2 Aug 2005 09:07:42 -0700 Subject: [openib-general] Re: [Rdma-developers] Meeting (07/22)summary:OpenRDMA community development discussion In-Reply-To: <42EF7484.5040509@ammasso.com> References: <20050801161404.GA15582@lst.de> <20050802125727.GA2472@lst.de> <42EF7484.5040509@ammasso.com> Message-ID: <469958e0050802090757502d3a@mail.gmail.com> Generally there are two cases to consider: when the TCP mode is not visible and when it is. When it is not visible it is certainly easy to manage the TCP connection with subset logic within the RDMA stack and never involve the host stack. This is certainly what the initial proposal will rely upon. In the long term it has the problems you cited. Having two stacks accept TCP connections means that *both* must be updated to stay current with the latest DoS attacks. While it is more work for the RDMA device, I think there is general agreement amongs the hardware vendors that this is something that the OS *should* retain control of. Deciding which connections may be accepted is inherently an OS function. Beyond that there is a distinct programming model, already accepted in IETF specifications, that requires the application to begin work in streaming (i.e., socket) mode, and then only convert to RDMA mode once the two peers have agreed upon that optimization. To support that model you will eventually have to allow the host stack to transfer a TCP connection to the RDMA stack *or* you will require the RDMA stack to provide full TCP/socket functionality. So the real question is not whether to allow the RDMA stack to "take" a connection from the host stack, but whether to force the RDMA stack to yield control of the connection to the host during critical connection setup so that this step remains firmly under OS control and oversight. On 8/2/05, Tom Tucker wrote: > > > 'Christoph Hellwig' wrote: > > > Can you provide more details on exactly why you think this is a horrible > idea? I agree it will be complex, but it _could_ be scoped such that the > complexity is reduced. For instance, the "offload" function could fail > (with EBUSY or something) if there is _any_ data pending on the socket. > Thus removing any requirement to pass down pending unacked outgoing data, or > pending data that has been received but not yet "read" by the application. > The idea here is that the applications at the top "know" they are going into > RDMA mode and have effectively quiesced the connection before attempting to > move the connection into RDMA mode. We could, in fact, _require_ the > connect be quiesced to keep things simpler. I'm quickly sinking into gory > details, but I want to know if you have other reasons (other than the > complextity) for why this is a bad idea. > > I think your writeup here is more than explanation enough. The offload > can only work for few special cases, and even for those it's rather > complicated, especially if you take things as ipsec or complex tunneling > that get more and more common into account. > I think Steve's point was that it *can* be simplified as necessary to meet > the demands/needs of the Linux community. It is certainly technically > possible, but agreeably complicated to offload an active socket. > > > What do you archive by > implementing the offload except trying to make it look more integrated > to the user than it actually is? Just offload rmda protocols to the > RDMA hardware and keep the IP stack out of that complexity. > You get the benefit of things like SYN flood DOS attack avoidance built > into the host stack without replicating this functionality in the offloaded > adapter. There are other benefits of integration like failover, etc... IMHO, > however, the bulk of the benefits are for ULP offload like RDMA where the > remote peer may not be capable of HW RDMA acceleration. This kind of thing > could be determined in "streaming mode" using the host stack and then > migrated to an adapter for HW acceleration only if the remote peer is > capable. > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > b From johnip at sgi.com Tue Aug 2 09:11:33 2005 From: johnip at sgi.com (John Partridge) Date: Tue, 02 Aug 2005 11:11:33 -0500 Subject: [openib-general] kdapl build error on ia64 In-Reply-To: References: <42EAA39B.2090501@sgi.com> <1122934590.15026.42.camel@duffman> Message-ID: <42EF9B35.6020601@sgi.com> James, Do you want me to do a patch killing dapl_os_atomic_assign() as far as I can see it only get used in two places, dapl_rbuf_add() and dapl_rbuf_remove() right ? I'll base the patch off of svn rev 2944 OK ? John James Lentini wrote: > > On Mon, 1 Aug 2005, Tom Duffy wrote: > >> On Fri, 2005-07-29 at 16:46 -0500, John Partridge wrote: >> >>> I searched the archives and don't see a solution, but I think I've >>> tracked it down >>> to a bug in ulp/kdapl/ib/dapl_util.h >>> >>> root on mig133 > diff -ruN dapl_util.h dapl_util.h-johnip >> >> >> Please use diff -ruNp. See FAQ question 10. >> >>> =============== diff =============================== >>> >>> --- dapl_util.h 2005-07-29 16:36:17.514669886 -0500 >>> +++ dapl_util.h-johnip 2005-07-29 16:37:11.514578548 -0500 >>> @@ -71,7 +71,7 @@ >>> >>> #ifdef __ia64__ >>> >>> - current_value = ia64_cmpxchg("acq", v, match_value, >>> new_value, 4); >>> + current_value = ia64_cmpxchg(acq, v, match_value, >>> new_value, 4); >>> >>> #elif defined (__PPC__) >> >> >> Please kill dapl_os_atomic_assign(). Just use cmpxchg() directly >> instead. > > > That sounds like a good idea. Any issues with this John? -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com From jlentini at netapp.com Tue Aug 2 09:25:04 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 2 Aug 2005 12:25:04 -0400 (EDT) Subject: [openib-general] kdapl build error on ia64 In-Reply-To: <42EF9B35.6020601@sgi.com> References: <42EAA39B.2090501@sgi.com> <1122934590.15026.42.camel@duffman> <42EF9B35.6020601@sgi.com> Message-ID: On Tue, 2 Aug 2005, John Partridge wrote: > James, > > Do you want me to do a patch killing dapl_os_atomic_assign() as far as I can > see > it only get used in two places, dapl_rbuf_add() and dapl_rbuf_remove() right > ? > > I'll base the patch off of svn rev 2944 OK ? > > John > Hi John, I was just working on one. Here's what I have (note: pine will mess up the inline patch, please use the attachment if you want to test it out). I'm going to test it now. Index: dapl_provider.h =================================================================== --- dapl_provider.h (revision 2946) +++ dapl_provider.h (working copy) @@ -35,8 +35,6 @@ #include "dapl.h" -extern DAPL_DBG_MASK g_dapl_dbg_mask; - extern int dapl_provider_list_search(const char *name, struct dat_provider **provider); Index: dapl_ring_buffer_util.c =================================================================== --- dapl_ring_buffer_util.c (revision 2946) +++ dapl_ring_buffer_util.c (working copy) @@ -184,7 +184,7 @@ while (((atomic_read(&rbuf->head) + 1) & rbuf->lim) != (atomic_read(&rbuf->tail) & rbuf->lim)) { pos = atomic_read(&rbuf->head); - val = dapl_os_atomic_assign(&rbuf->head, pos, pos + 1); + val = cmpxchg(&rbuf->head.counter, pos, pos + 1); if (val == pos) { pos = (pos + 1) & rbuf->lim; /* verify in range */ rbuf->base[pos] = entry; @@ -218,7 +218,7 @@ while (atomic_read(&rbuf->head) != atomic_read(&rbuf->tail)) { pos = atomic_read(&rbuf->tail); - val = dapl_os_atomic_assign(&rbuf->tail, pos, pos + 1); + val = cmpxchg(&rbuf->tail.counter, pos, pos + 1); if (val == pos) { pos = (pos + 1) & rbuf->lim; /* verify in range */ return rbuf->base[pos]; Index: Makefile =================================================================== --- Makefile (revision 2946) +++ Makefile (working copy) @@ -7,4 +7,4 @@ dapl_cookie.o dapl_cr.o dapl_ep.o dapl_evd.o \ dapl_hca_util.o dapl_ia.o dapl_lmr.o dapl_provider.o \ dapl_pz.o dapl_ring_buffer_util.o dapl_rmr.o dapl_sp.o \ - dapl_srq.o dapl_util.o + dapl_srq.o Index: dapl_util.c =================================================================== --- dapl_util.c (revision 2946) +++ dapl_util.c (working copy) @@ -1,52 +0,0 @@ -/* - * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. - * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. - * - * This Software is licensed under one of the following licenses: - * - * 1) under the terms of the "Common Public License 1.0" a copy of which is - * available from the Open Source Initiative, see - * http://www.opensource.org/licenses/cpl.php. - * - * 2) under the terms of the "The BSD License" a copy of which is - * available from the Open Source Initiative, see - * http://www.opensource.org/licenses/bsd-license.php. - * - * 3) under the terms of the "GNU General Public License (GPL) Version 2" a - * copy of which is available from the Open Source Initiative, see - * http://www.opensource.org/licenses/gpl-license.php. - * - * Licensee has the right to choose one of the above licenses. - * - * Redistributions of source code must retain the above copyright - * notice and one of the license notices. - * - * Redistributions in binary form must reproduce both the above copyright - * notice, one of the license notices in the documentation - * and/or other materials provided with the distribution. - */ - -/* - * $Id$ - */ - -#include "dapl.h" -#include "dapl_provider.h" -#include "dapl_util.h" - -#ifdef CONFIG_KDAPL_INFINIBAND_DEBUG - -void dapl_dbg_log(enum dapl_dbg_type type, const char *fmt, ...) -{ - char buf[1024]; - va_list args; - - if (type & g_dapl_dbg_mask) { - va_start(args, fmt); - vsnprintf(buf, sizeof buf, fmt, args); - printk(KERN_ALERT "kDAPL: %s", buf); - va_end(args); - } -} - -#endif /* KDAPL_INFINIBAND_DEBUG */ Index: dapl_util.h =================================================================== --- dapl_util.h (revision 2946) +++ dapl_util.h (working copy) @@ -1,131 +0,0 @@ -/* - * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. - * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. - * - * This Software is licensed under one of the following licenses: - * - * 1) under the terms of the "Common Public License 1.0" a copy of which is - * available from the Open Source Initiative, see - * http://www.opensource.org/licenses/cpl.php. - * - * 2) under the terms of the "The BSD License" a copy of which is - * available from the Open Source Initiative, see - * http://www.opensource.org/licenses/bsd-license.php. - * - * 3) under the terms of the "GNU General Public License (GPL) Version 2" a - * copy of which is available from the Open Source Initiative, see - * http://www.opensource.org/licenses/gpl-license.php. - * - * Licensee has the right to choose one of the above licenses. - * - * Redistributions of source code must retain the above copyright - * notice and one of the license notices. - * - * Redistributions in binary form must reproduce both the above copyright - * notice, one of the license notices in the documentation - * and/or other materials provided with the distribution. - */ - -/* - * $Id$ - */ - -#ifndef DAPL_UTIL_H -#define DAPL_UTIL_H - -#include -#include -#include -#include -#include -#include -#include /* needed by hash functions */ - -#ifdef __ia64__ -#include -#include -#endif - -/* dapl_os_atomic_assign - * - * assign 'new_value' to '*v' if the current value - * matches the provided 'match_value'. - * - * Make no assignment if there is no match. - * - * Return the current value in any case. - * - * This matches the IBTA atomic operation compare & swap - * except that it is for local memory and a int may - * be only 32 bits, rather than 64. - */ - -static inline int dapl_os_atomic_assign(atomic_t * v, int match_value, - int new_value) -{ - int current_value; - - /* - * Use the Pentium compare and exchange instruction - */ - -#ifdef __ia64__ - - current_value = ia64_cmpxchg("acq", v, match_value, new_value, 4); - -#elif defined (__PPC__) - - current_value = - __cmpxchg_u32((volatile int *)v, (int)match_value, (int)new_value); - -#else - current_value = __cmpxchg((volatile void *)v, - (unsigned long)match_value, - (unsigned long)new_value, (int)4); -#endif - return current_value; -} - -/* - * *printf format helper. We use the C string constant concatenation - * ability to define 64 bit formats, which unfortunatly are non standard - * in the C compiler world. - */ -#ifdef __x86_64__ -#define F64x "%lx" -#else -#define F64x "%llx" -#endif - -/* - * Debug Functions - */ - -/* - * Use these bits to enable various tracing/debug options. Each bit - * represents debugging in a particular subsystem or area of the code. - */ -enum dapl_dbg_type { - DAPL_DBG_TYPE_ERR = 0x0001, - DAPL_DBG_TYPE_WARN = 0x0002, - DAPL_DBG_TYPE_EVD = 0x0004, - DAPL_DBG_TYPE_CM = 0x0008, - DAPL_DBG_TYPE_EP = 0x0010, - DAPL_DBG_TYPE_UTIL = 0x0020, - DAPL_DBG_TYPE_CALLBACK = 0x0040, - DAPL_DBG_TYPE_DTO_COMP_ERR = 0x0080, - DAPL_DBG_TYPE_API = 0x0100, - DAPL_DBG_TYPE_RTN = 0x0200, - DAPL_DBG_TYPE_EXCEPTION = 0x0400, - DAPL_DBG_TYPE_SRQ = 0x0800 -}; - -typedef int DAPL_DBG_MASK; - -#ifdef CONFIG_KDAPL_INFINIBAND_DEBUG -extern void dapl_dbg_log(enum dapl_dbg_type type, const char *fmt, ...); -#else /* !KDAPL_INFINIBAND_DEBUG */ -#define dapl_dbg_log(...) -#endif /* KDAPL_INFINIBAND_DEBUG */ - -#endif /* DAPL_UTIL_H */ Index: dapl.h =================================================================== --- dapl.h (revision 2946) +++ dapl.h (working copy) @@ -36,10 +36,12 @@ #define DAPL_H #include +#include +#include +#include #include -#include "dapl_util.h" #include "ib_verbs.h" #include "ib_cm.h" @@ -601,4 +603,37 @@ extern int dapl_srq_set_lw(struct dat_srq *srq, int low_watermark); -#endif +/********************************************************************* + * * + * Debug Functions * + * * + *********************************************************************/ + +/* + * Use these bits to enable various tracing/debug options. Each bit + * represents debugging in a particular subsystem or area of the code. + */ +enum dapl_dbg_type { + DAPL_DBG_TYPE_ERR = (1 << 0), + DAPL_DBG_TYPE_WARN = (1 << 1), + DAPL_DBG_TYPE_EVD = (1 << 2), + DAPL_DBG_TYPE_CM = (1 << 3), + DAPL_DBG_TYPE_EP = (1 << 4), + DAPL_DBG_TYPE_UTIL = (1 << 5), + DAPL_DBG_TYPE_CALLBACK = (1 << 6), + DAPL_DBG_TYPE_DTO_COMP_ERR = (1 << 7), + DAPL_DBG_TYPE_API = (1 << 8), + DAPL_DBG_TYPE_RTN = (1 << 9), + DAPL_DBG_TYPE_EXCEPTION = (1 << 10), + DAPL_DBG_TYPE_SRQ = (1 << 11) +}; + +typedef int DAPL_DBG_MASK; + +#ifdef CONFIG_KDAPL_INFINIBAND_DEBUG +extern void dapl_dbg_log(enum dapl_dbg_type type, const char *fmt, ...); +#else /* !KDAPL_INFINIBAND_DEBUG */ +#define dapl_dbg_log(...) +#endif /* KDAPL_INFINIBAND_DEBUG */ + +#endif /* DAPL_H */ Index: dapl_provider.c =================================================================== --- dapl_provider.c (revision 2946) +++ dapl_provider.c (working copy) @@ -36,7 +36,6 @@ #include "dapl.h" #include "dapl_hca_util.h" #include "dapl_provider.h" -#include "dapl_util.h" #include "dapl_openib_util.h" MODULE_LICENSE("Dual BSD/GPL"); @@ -56,7 +55,7 @@ *********************************************************************/ #ifdef CONFIG_KDAPL_INFINIBAND_DEBUG -DAPL_DBG_MASK g_dapl_dbg_mask = 0; +static DAPL_DBG_MASK g_dapl_dbg_mask = 0; module_param_named(dbg_mask, g_dapl_dbg_mask, int, 0644); MODULE_PARM_DESC(dbg_mask, "Bitmask to enable debug message types."); #endif /* CONFIG_KDAPL_INFINIBAND_DEBUG */ @@ -144,6 +143,24 @@ * * *********************************************************************/ +#ifdef CONFIG_KDAPL_INFINIBAND_DEBUG + +void dapl_dbg_log(enum dapl_dbg_type type, const char *fmt, ...) +{ + char buf[1024]; + va_list args; + + if (type & g_dapl_dbg_mask) { + va_start(args, fmt); + vsnprintf(buf, sizeof buf, fmt, args); + printk(KERN_ALERT "kDAPL: %s", buf); + va_end(args); + } +} + +#endif /* KDAPL_INFINIBAND_DEBUG */ + + static void dapl_provider_list_destroy(void) { struct list_head *cur_list, *next_list; Index: dapl_openib_cm.c =================================================================== --- dapl_openib_cm.c (revision 2946) +++ dapl_openib_cm.c (working copy) @@ -342,7 +342,7 @@ &cm_ctx->dapl_comp); if (status) { printk(KERN_ERR "dapl_path_comp_handler: " - "ib_at_paths_by_route returned %d id %lld\n", + "ib_at_paths_by_route returned %d id %Lu\n", status, cm_ctx->dapl_comp.req_id); event = DAT_CONNECTION_EVENT_BROKEN; goto error; @@ -413,7 +413,7 @@ &cm_ctx->dapl_comp); if (status) { printk(KERN_ERR "dapl_rt_comp_handler: ib_at_paths_by_route " - "returned %d id %lld\n", status, cm_ctx->dapl_comp.req_id); + "returned %d id %Lu\n", status, cm_ctx->dapl_comp.req_id); event = DAT_CONNECTION_EVENT_BROKEN; goto error; } -------------- next part -------------- Index: dapl_provider.h =================================================================== --- dapl_provider.h (revision 2946) +++ dapl_provider.h (working copy) @@ -35,8 +35,6 @@ #include "dapl.h" -extern DAPL_DBG_MASK g_dapl_dbg_mask; - extern int dapl_provider_list_search(const char *name, struct dat_provider **provider); Index: dapl_ring_buffer_util.c =================================================================== --- dapl_ring_buffer_util.c (revision 2946) +++ dapl_ring_buffer_util.c (working copy) @@ -184,7 +184,7 @@ while (((atomic_read(&rbuf->head) + 1) & rbuf->lim) != (atomic_read(&rbuf->tail) & rbuf->lim)) { pos = atomic_read(&rbuf->head); - val = dapl_os_atomic_assign(&rbuf->head, pos, pos + 1); + val = cmpxchg(&rbuf->head.counter, pos, pos + 1); if (val == pos) { pos = (pos + 1) & rbuf->lim; /* verify in range */ rbuf->base[pos] = entry; @@ -218,7 +218,7 @@ while (atomic_read(&rbuf->head) != atomic_read(&rbuf->tail)) { pos = atomic_read(&rbuf->tail); - val = dapl_os_atomic_assign(&rbuf->tail, pos, pos + 1); + val = cmpxchg(&rbuf->tail.counter, pos, pos + 1); if (val == pos) { pos = (pos + 1) & rbuf->lim; /* verify in range */ return rbuf->base[pos]; Index: Makefile =================================================================== --- Makefile (revision 2946) +++ Makefile (working copy) @@ -7,4 +7,4 @@ dapl_cookie.o dapl_cr.o dapl_ep.o dapl_evd.o \ dapl_hca_util.o dapl_ia.o dapl_lmr.o dapl_provider.o \ dapl_pz.o dapl_ring_buffer_util.o dapl_rmr.o dapl_sp.o \ - dapl_srq.o dapl_util.o + dapl_srq.o Index: dapl_util.c =================================================================== --- dapl_util.c (revision 2946) +++ dapl_util.c (working copy) @@ -1,52 +0,0 @@ -/* - * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. - * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. - * - * This Software is licensed under one of the following licenses: - * - * 1) under the terms of the "Common Public License 1.0" a copy of which is - * available from the Open Source Initiative, see - * http://www.opensource.org/licenses/cpl.php. - * - * 2) under the terms of the "The BSD License" a copy of which is - * available from the Open Source Initiative, see - * http://www.opensource.org/licenses/bsd-license.php. - * - * 3) under the terms of the "GNU General Public License (GPL) Version 2" a - * copy of which is available from the Open Source Initiative, see - * http://www.opensource.org/licenses/gpl-license.php. - * - * Licensee has the right to choose one of the above licenses. - * - * Redistributions of source code must retain the above copyright - * notice and one of the license notices. - * - * Redistributions in binary form must reproduce both the above copyright - * notice, one of the license notices in the documentation - * and/or other materials provided with the distribution. - */ - -/* - * $Id$ - */ - -#include "dapl.h" -#include "dapl_provider.h" -#include "dapl_util.h" - -#ifdef CONFIG_KDAPL_INFINIBAND_DEBUG - -void dapl_dbg_log(enum dapl_dbg_type type, const char *fmt, ...) -{ - char buf[1024]; - va_list args; - - if (type & g_dapl_dbg_mask) { - va_start(args, fmt); - vsnprintf(buf, sizeof buf, fmt, args); - printk(KERN_ALERT "kDAPL: %s", buf); - va_end(args); - } -} - -#endif /* KDAPL_INFINIBAND_DEBUG */ Index: dapl_util.h =================================================================== --- dapl_util.h (revision 2946) +++ dapl_util.h (working copy) @@ -1,131 +0,0 @@ -/* - * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. - * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. - * - * This Software is licensed under one of the following licenses: - * - * 1) under the terms of the "Common Public License 1.0" a copy of which is - * available from the Open Source Initiative, see - * http://www.opensource.org/licenses/cpl.php. - * - * 2) under the terms of the "The BSD License" a copy of which is - * available from the Open Source Initiative, see - * http://www.opensource.org/licenses/bsd-license.php. - * - * 3) under the terms of the "GNU General Public License (GPL) Version 2" a - * copy of which is available from the Open Source Initiative, see - * http://www.opensource.org/licenses/gpl-license.php. - * - * Licensee has the right to choose one of the above licenses. - * - * Redistributions of source code must retain the above copyright - * notice and one of the license notices. - * - * Redistributions in binary form must reproduce both the above copyright - * notice, one of the license notices in the documentation - * and/or other materials provided with the distribution. - */ - -/* - * $Id$ - */ - -#ifndef DAPL_UTIL_H -#define DAPL_UTIL_H - -#include -#include -#include -#include -#include -#include -#include /* needed by hash functions */ - -#ifdef __ia64__ -#include -#include -#endif - -/* dapl_os_atomic_assign - * - * assign 'new_value' to '*v' if the current value - * matches the provided 'match_value'. - * - * Make no assignment if there is no match. - * - * Return the current value in any case. - * - * This matches the IBTA atomic operation compare & swap - * except that it is for local memory and a int may - * be only 32 bits, rather than 64. - */ - -static inline int dapl_os_atomic_assign(atomic_t * v, int match_value, - int new_value) -{ - int current_value; - - /* - * Use the Pentium compare and exchange instruction - */ - -#ifdef __ia64__ - - current_value = ia64_cmpxchg("acq", v, match_value, new_value, 4); - -#elif defined (__PPC__) - - current_value = - __cmpxchg_u32((volatile int *)v, (int)match_value, (int)new_value); - -#else - current_value = __cmpxchg((volatile void *)v, - (unsigned long)match_value, - (unsigned long)new_value, (int)4); -#endif - return current_value; -} - -/* - * *printf format helper. We use the C string constant concatenation - * ability to define 64 bit formats, which unfortunatly are non standard - * in the C compiler world. - */ -#ifdef __x86_64__ -#define F64x "%lx" -#else -#define F64x "%llx" -#endif - -/* - * Debug Functions - */ - -/* - * Use these bits to enable various tracing/debug options. Each bit - * represents debugging in a particular subsystem or area of the code. - */ -enum dapl_dbg_type { - DAPL_DBG_TYPE_ERR = 0x0001, - DAPL_DBG_TYPE_WARN = 0x0002, - DAPL_DBG_TYPE_EVD = 0x0004, - DAPL_DBG_TYPE_CM = 0x0008, - DAPL_DBG_TYPE_EP = 0x0010, - DAPL_DBG_TYPE_UTIL = 0x0020, - DAPL_DBG_TYPE_CALLBACK = 0x0040, - DAPL_DBG_TYPE_DTO_COMP_ERR = 0x0080, - DAPL_DBG_TYPE_API = 0x0100, - DAPL_DBG_TYPE_RTN = 0x0200, - DAPL_DBG_TYPE_EXCEPTION = 0x0400, - DAPL_DBG_TYPE_SRQ = 0x0800 -}; - -typedef int DAPL_DBG_MASK; - -#ifdef CONFIG_KDAPL_INFINIBAND_DEBUG -extern void dapl_dbg_log(enum dapl_dbg_type type, const char *fmt, ...); -#else /* !KDAPL_INFINIBAND_DEBUG */ -#define dapl_dbg_log(...) -#endif /* KDAPL_INFINIBAND_DEBUG */ - -#endif /* DAPL_UTIL_H */ Index: dapl.h =================================================================== --- dapl.h (revision 2946) +++ dapl.h (working copy) @@ -36,10 +36,12 @@ #define DAPL_H #include +#include +#include +#include #include -#include "dapl_util.h" #include "ib_verbs.h" #include "ib_cm.h" @@ -601,4 +603,37 @@ extern int dapl_srq_set_lw(struct dat_srq *srq, int low_watermark); -#endif +/********************************************************************* + * * + * Debug Functions * + * * + *********************************************************************/ + +/* + * Use these bits to enable various tracing/debug options. Each bit + * represents debugging in a particular subsystem or area of the code. + */ +enum dapl_dbg_type { + DAPL_DBG_TYPE_ERR = (1 << 0), + DAPL_DBG_TYPE_WARN = (1 << 1), + DAPL_DBG_TYPE_EVD = (1 << 2), + DAPL_DBG_TYPE_CM = (1 << 3), + DAPL_DBG_TYPE_EP = (1 << 4), + DAPL_DBG_TYPE_UTIL = (1 << 5), + DAPL_DBG_TYPE_CALLBACK = (1 << 6), + DAPL_DBG_TYPE_DTO_COMP_ERR = (1 << 7), + DAPL_DBG_TYPE_API = (1 << 8), + DAPL_DBG_TYPE_RTN = (1 << 9), + DAPL_DBG_TYPE_EXCEPTION = (1 << 10), + DAPL_DBG_TYPE_SRQ = (1 << 11) +}; + +typedef int DAPL_DBG_MASK; + +#ifdef CONFIG_KDAPL_INFINIBAND_DEBUG +extern void dapl_dbg_log(enum dapl_dbg_type type, const char *fmt, ...); +#else /* !KDAPL_INFINIBAND_DEBUG */ +#define dapl_dbg_log(...) +#endif /* KDAPL_INFINIBAND_DEBUG */ + +#endif /* DAPL_H */ Index: dapl_provider.c =================================================================== --- dapl_provider.c (revision 2946) +++ dapl_provider.c (working copy) @@ -36,7 +36,6 @@ #include "dapl.h" #include "dapl_hca_util.h" #include "dapl_provider.h" -#include "dapl_util.h" #include "dapl_openib_util.h" MODULE_LICENSE("Dual BSD/GPL"); @@ -56,7 +55,7 @@ *********************************************************************/ #ifdef CONFIG_KDAPL_INFINIBAND_DEBUG -DAPL_DBG_MASK g_dapl_dbg_mask = 0; +static DAPL_DBG_MASK g_dapl_dbg_mask = 0; module_param_named(dbg_mask, g_dapl_dbg_mask, int, 0644); MODULE_PARM_DESC(dbg_mask, "Bitmask to enable debug message types."); #endif /* CONFIG_KDAPL_INFINIBAND_DEBUG */ @@ -144,6 +143,24 @@ * * *********************************************************************/ +#ifdef CONFIG_KDAPL_INFINIBAND_DEBUG + +void dapl_dbg_log(enum dapl_dbg_type type, const char *fmt, ...) +{ + char buf[1024]; + va_list args; + + if (type & g_dapl_dbg_mask) { + va_start(args, fmt); + vsnprintf(buf, sizeof buf, fmt, args); + printk(KERN_ALERT "kDAPL: %s", buf); + va_end(args); + } +} + +#endif /* KDAPL_INFINIBAND_DEBUG */ + + static void dapl_provider_list_destroy(void) { struct list_head *cur_list, *next_list; Index: dapl_openib_cm.c =================================================================== --- dapl_openib_cm.c (revision 2946) +++ dapl_openib_cm.c (working copy) @@ -342,7 +342,7 @@ &cm_ctx->dapl_comp); if (status) { printk(KERN_ERR "dapl_path_comp_handler: " - "ib_at_paths_by_route returned %d id %lld\n", + "ib_at_paths_by_route returned %d id %Lu\n", status, cm_ctx->dapl_comp.req_id); event = DAT_CONNECTION_EVENT_BROKEN; goto error; @@ -413,7 +413,7 @@ &cm_ctx->dapl_comp); if (status) { printk(KERN_ERR "dapl_rt_comp_handler: ib_at_paths_by_route " - "returned %d id %lld\n", status, cm_ctx->dapl_comp.req_id); + "returned %d id %Lu\n", status, cm_ctx->dapl_comp.req_id); event = DAT_CONNECTION_EVENT_BROKEN; goto error; } From jlentini at netapp.com Tue Aug 2 10:26:41 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 2 Aug 2005 13:26:41 -0400 (EDT) Subject: [openib-general] kdapl build error on ia64 In-Reply-To: References: <42EAA39B.2090501@sgi.com> <1122934590.15026.42.camel@duffman> <42EF9B35.6020601@sgi.com> Message-ID: On Tue, 2 Aug 2005, James Lentini wrote: > > On Tue, 2 Aug 2005, John Partridge wrote: > >> James, >> >> Do you want me to do a patch killing dapl_os_atomic_assign() as far as I >> can see >> it only get used in two places, dapl_rbuf_add() and dapl_rbuf_remove() >> right ? >> >> I'll base the patch off of svn rev 2944 OK ? >> >> John >> > > Hi John, > > I was just working on one. Here's what I have (note: pine will mess up the > inline patch, please use the attachment if you want to test it out). I'm > going to test it now. These updates worked for me and simplified the code. I'm going to check them in. If anyone sees problems, let me know. james From johnip at sgi.com Tue Aug 2 10:52:21 2005 From: johnip at sgi.com (John Partridge) Date: Tue, 02 Aug 2005 12:52:21 -0500 Subject: [openib-general] kdapl build error on ia64 In-Reply-To: References: <42EAA39B.2090501@sgi.com> <1122934590.15026.42.camel@duffman> <42EF9B35.6020601@sgi.com> Message-ID: <42EFB2D5.7050000@sgi.com> James it applied cleanly to my 2946 tree and the modules loaded OK. John James Lentini wrote: > > > On Tue, 2 Aug 2005, James Lentini wrote: > >> >> On Tue, 2 Aug 2005, John Partridge wrote: >> >>> James, >>> >>> Do you want me to do a patch killing dapl_os_atomic_assign() as far >>> as I can see >>> it only get used in two places, dapl_rbuf_add() and >>> dapl_rbuf_remove() right ? >>> >>> I'll base the patch off of svn rev 2944 OK ? >>> >>> John >>> >> >> Hi John, >> >> I was just working on one. Here's what I have (note: pine will mess up >> the inline patch, please use the attachment if you want to test it >> out). I'm going to test it now. > > > These updates worked for me and simplified the code. I'm going to check > them in. If anyone sees problems, let me know. > > james -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com From jlentini at netapp.com Tue Aug 2 11:00:10 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 2 Aug 2005 14:00:10 -0400 (EDT) Subject: [openib-general][PATCH][kdapl]: inc/dec module ref count In-Reply-To: References: Message-ID: On Tue, 2 Aug 2005, Guy German wrote: >> Guy, can you investigate why the ib_mthca module doesn't have a >> reference count and see if it relates to hotplug? I think >> kdapl_ib and >> ib_mthca should have the same policy regarding this issue. > > As I understand, consumers are working over ib_core and not over > ib_mthca directly. So, if (from a hotplug reason) ib_mthca goes > down, ib_core consumers can get notified of the event, by an upcall. Correct. Is the fact that ib_mthca always has a reference count of 0 a concious design decision to support hotplug? I think the answer is yes, I just want to make sure. > If you take this model to dapl, I think it would influence the way > dapl consumers need to do things (like registering an upcall and > know what to do in case kdapl_ib is down). > > I also don't know how many consumers really need "dapl hotplug"... Long term, kDAPL should support hotplug. As you say, we will need to modify the API to support this (as noted in the kDAPL TODO list in the Wiki). As the code stands now, we should protect users from accidently removing kdapl_ib. Are you re-working your patch for this? I've thought about this some more. To be safe, I think the module reference counts should be adjusted in dat_registry_add_provider() and dat_registry_remove_provider(). From tduffy at sun.com Tue Aug 2 11:00:22 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 02 Aug 2005 11:00:22 -0700 Subject: [openib-general] kdapl build error on ia64 In-Reply-To: <1122990891.4422.77.camel@hal.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175B74@taurus.voltaire.com> <42EEA0C4.5000707@sgi.com> <1122990891.4422.77.camel@hal.voltaire.com> Message-ID: <1123005623.2946.1.camel@duffman> On Tue, 2005-08-02 at 09:54 -0400, Hal Rosenstock wrote: > I believe that users/jlentini is now obsolete and the trunk supercedes > this so this is fine. James, do you want to delete users/jlentini so as to not confuse anybody? -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From jlentini at netapp.com Tue Aug 2 11:04:01 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 2 Aug 2005 14:04:01 -0400 (EDT) Subject: [openib-general] kdapl build error on ia64 In-Reply-To: <1123005623.2946.1.camel@duffman> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175B74@taurus.voltaire.com> <42EEA0C4.5000707@sgi.com> <1122990891.4422.77.camel@hal.voltaire.com> <1123005623.2946.1.camel@duffman> Message-ID: On Tue, 2 Aug 2005, Tom Duffy wrote: > On Tue, 2005-08-02 at 09:54 -0400, Hal Rosenstock wrote: >> I believe that users/jlentini is now obsolete and the trunk supercedes >> this so this is fine. > > James, do you want to delete users/jlentini so as to not confuse > anybody? > > -tduffy I'll remove the linux-kernel portion now and the userspace portion after I have Arlin's patch merged in and uDAPL moved to the trunk. james From jlentini at netapp.com Tue Aug 2 11:31:20 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 2 Aug 2005 14:31:20 -0400 (EDT) Subject: [openib-general][PATCH][kdapl]: adding DAT_MEM_TYPE_IA support In-Reply-To: References: Message-ID: On Tue, 2 Aug 2005, Guy German wrote: > Hi, > > James Lentini wrote: >> On Fri, 29 Jul 2005, Guy German wrote: >> >>>> + array = (u64 *)phys_addr.for_array; /* need to add for_u64_array >>>> to union */ What does this comment mean? >>> >>> I think the right way to do it is : >>> array = phys_addr.for_u64_array >>> (Givven the union consists of a new type u64* called "for_u64_array") >> >> I believe the original idea was to have IA memory use the >> DAT_REGION_DESCRIPTION's for_pointer value. > > for_pointer is void*, and: *((void *)array)!=*((u64 *)array) in 32 > bit machines I'd rather update the for_pointer type to be correct (and change the name if necessary) than add a new memmber, for_u64_array, to the union. From tduffy at sun.com Tue Aug 2 11:36:54 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 02 Aug 2005 11:36:54 -0700 Subject: [openib-general] sdp: cant unload ib_ipoib module In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5175B95@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175B95@taurus.voltaire.com> Message-ID: <1123007814.2946.12.camel@duffman> On Tue, 2005-08-02 at 16:08 +0300, Hal Rosenstock wrote: > This was reported back a while ago. The simplest scenario I have found to reproduce this is as follows: > > After using SDP, and unload SDP and then unload IPoIB and > got the following: > > unregister_netdevice: waiting for ib0 to become free. Usage count = 1 > > The simplest way I found to recreate this is: > 1. Bring up IPoIB and then SDP > 2. Run tcp.aio.x -t > (no server/receiver) > 3. Wait for connection refused > 4. Unload SDP and then IPoIB [root at flopteron2 ~]# modprobe ib_ipoib ip_tables: (C) 2000-2002 Netfilter core team [root at flopteron2 ~]# ifconfig ib0 192.168.0.26 up [root at flopteron2 ~]# ping 192.168.0.0 -b WARNING: pinging broadcast address PING 192.168.0.0 (192.168.0.0) 56(84) bytes of data. 64 bytes from 192.168.0.26: icmp_seq=0 ttl=64 time=0.057 ms 64 bytes from 192.168.0.233: icmp_seq=0 ttl=64 time=0.159 ms (DUP!) <-- snip --> [root at flopteron2 rc]# modprobe ib_sdp [root at flopteron2 ~]# ./ttcp -t -l 65536 -n 100000 -a 20 localhost -p 5002 ttcp-t: buflen = 65536 nbuf = 100000 align = 16384/0 port = 5002 localhost ttcp-t: socket ttcp-t: connect: Connection refused errno=111 [root at flopteron2 ~]# rmmod ib_sdp [root at flopteron2 ~]# rmmod ib_ipoib [root at flopteron2 ~]# I can even shoot stuff over the wire and not have unload issues. What is the problem? Perhaps you need my sdp_inet_port_put() patch? -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Tue Aug 2 12:36:46 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Aug 2005 15:36:46 -0400 Subject: [openib-general] Re: Send SA request over umad problem. In-Reply-To: <506C3D7B14CDD411A52C00025558DED60865E20F@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED60865E20F@mtlex01.yok.mtl.com> Message-ID: <1123011406.4442.321.camel@hal.voltaire.com> On Tue, 2005-08-02 at 10:33, Liran Sorani wrote: > Hi , > I'm working on the SM group at Mellanox. > While testing SM-gen2 on a loopback , I've encountered a basic problem > trying to send an SA query (single mad) over osm_vendor (gen2). > Trying to send the request using osm_vendor_send , passed succesfully > , BUT got from the receiver (umad_recv) an error (in an endless loop > ): "No space left on device". > The MAD request was simple GSI - SA request of ClassPortInfo , here > are the details , I've truned on debug mode of vendor_lib and umad > (marked in red the important lines in the log ): So are you running an SA client (gen2) making a SA Get ClassPortInfo request of a gen2 OpenSM and getting this problem ? ... > Aug 02 03:35:49 [401776C0] -> osm_vendor_send: [ > warn: [19219] umad_set_addr_net: umad 0x80810d0 dlid 1 dqp 1 sl, qkey > 0 > warn: [19219] umad_dump: agent id 0 status 0 timeout 0 > warn: [19219] umad_addr_dump: qpn 1 qkey 0x80010000 lid 0x1 sl 0 > grh_present 0 gid_index 0 hop_limit 0 traffic_class 0 flow_label 0x0 > Gid 0x00000000000000000000000000000000 > Aug 02 03:35:49 [401776C0] -> osm_vendor_send: RMPP 0 length 256 > warn: [19219] umad_send: portid 0 agentid 0 umad 0x80810d0 timeout > 1000 > Aug 02 03:35:49 [401776C0] -> osm_vendor_send: Completed Sending > Request p_madw = 0x80807dc. > Aug 02 03:35:49 [401776C0] -> osm_vendor_send: ] > Aug 02 03:35:49 [401776C0] -> __osmv_send_sa_req: Waiting for async > event. > warn: [19219] umad_recv: read returned 356 > sizeof umad 56 + length > 256 (No space left on device) > Aug 02 03:35:49 [40D7EBB0] -> umad_receiver: recv error No space left > on device > warn: [19219] umad_recv: portid 0 umad 0x8080e28 timeout 4294967295 > warn: [19219] umad_recv: read returned 356 > sizeof umad 56 + length > 256 (No space left on device) > Aug 02 03:35:49 [40D7EBB0] -> umad_receiver: recv error No space left > on device > warn: [19219] umad_recv: portid 0 umad 0x8080e28 timeout 4294967295 > warn: [19219] umad_recv: read returned 356 > sizeof umad 56 + length > 256 (No space left on device) > Aug 02 03:35:49 [40D7EBB0] -> umad_receiver: recv error No space left > on device > warn: [19219] umad_recv: portid 0 umad 0x8080e28 timeout 4294967295 > warn: [19219] umad_recv: read returned 356 > sizeof umad 56 + length > 256 (No space left on device) > Aug 02 03:35:49 [40D7EBB0] -> umad_receiver: recv error No space left > on device For some reason, the response is larger than expected and umad_receiver does not handle this currently. I think I see how to fix this. Is there any easy way to recreate this ? I'm not sure why that (the larger response) is the case for a response to SA Get ClassPortInfo. -- Hal > ... > > Thanks , in advance for your help . > > > Liran Sorani > > Mellanox Technologies LTD. > > mailto:liran at mellanox.co.il > > Phone: +972(4)9097200 Ext: 214 > > Israel, Yokneam P.O.B 586 ZIP 20692 > > > > > > From halr at voltaire.com Tue Aug 2 12:52:48 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Aug 2005 15:52:48 -0400 Subject: [openib-general] Re: Send SA request over umad problem. In-Reply-To: <1123011406.4442.321.camel@hal.voltaire.com> References: <506C3D7B14CDD411A52C00025558DED60865E20F@mtlex01.yok.mtl.com> <1123011406.4442.321.camel@hal.voltaire.com> Message-ID: <1123012368.4442.353.camel@hal.voltaire.com> On Tue, 2005-08-02 at 15:36, Hal Rosenstock wrote: > For some reason, the response is larger than expected and umad_receiver > does not handle this currently. I think I see how to fix this. Is there > any easy way to recreate this ? Here's what I get when I do an SA Get ClassPortInfo (via osmtest): Aug 02 15:47:56 [B7F22500] -> osmtest_validate_sa_class_port_info: ----------------------------- SA Class Port Info: base_ver:1 class_ver:2 cap_mask:0x202 resp_time_val:0x64 ----------------------------- > I'm not sure why that (the larger response) is the case for a response > to SA Get ClassPortInfo. I don't see this. There must be something different you are doing. -- Hal From jlentini at netapp.com Tue Aug 2 13:27:49 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 2 Aug 2005 16:27:49 -0400 (EDT) Subject: [openib-general][kdapl]: vmalloc instead of kmalloc In-Reply-To: References: Message-ID: On Tue, 2 Aug 2005, Guy German wrote: > Hi Muli, > > Muli Ben-Yehuda wrote: >> On Tue, Aug 02, 2005 at 06:24:49PM +0300, Guy German wrote: >> >>> There are some places where kmalloc might not be enough : >>> in dapl_evd_event_alloc there is an allocation: >>> >>> event = kmalloc(evd->qlen * sizeof *event); >>> >>> whereas evd->qlen can be 128k (depends on max_cqe of the hca) and >>> kmalloc would fail. >>> >>> The same goes to dapl_rbuf_alloc. >>> >>> Is it legit to replace those kmallocs with vmallocs ? We should only add calls to vmalloc() as a last resort. As Muli points out, they are discouraged. >> Why do we need such a large allocation? kDAPL creates two large pools of memory. One is for events. When the kDAPL consumer creates an EVD, it specifies a queue size (the number of events the EVD can hold). The implementation pre-allocates a pool of events equal to the size of the queue. These events are used when an IB upcall is made (e.g. connection request, connection established, aysnc. error, etc.) or the kDAPL consumer posts a "software event" via dat_evd_post_se(). The other memory pool is for cookies. A kDAPL event contains certain fields that the IB work completion (ib_wc) does not provide (like the EVD, EP, etc.). For that reason, the kDAPL provider sticks the missing information in a dapl_cookie structure and sets it as the work request's context value. When the work completion comes back, the kDAPL provider pulls the cookie out and uses it to populate the missing event fields. These cookies are also pre-allocated in a pool equal to the EVD size. >> To answer your question, vmalloc has a performance overhead and can >> and will fail when vmalloc-space is exhausted (as can kmalloc, for >> different reasons). Can this allocation be cut down so that it >> becomes a non-issue? The size of the event pool seems much larger than necessary. I would expect most consumers only use a few events from this pool (with no errors or software events, a client will use 2 and a server will use 3). We may be able to eliminate the cookie pool entirely. There are only a few values we need from the cookie. I'll look into that. > evd_min_qlen defines the size of the event queue that the Consumer requested. > sizeof *event = 184 - that leaves ~712 pending events, which is not much. > ISER target is trying to support about 5000 (by their calculations), but other consumers > might want to support even more and there is no reason for dapl to limit what the ib can provide. > Note that iser dequeues the events itself (only the first event is accepted from a callback), hence the > need for a normal size queue. From jlentini at netapp.com Tue Aug 2 14:46:01 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 2 Aug 2005 17:46:01 -0400 (EDT) Subject: [openib-general] Re: fixes to the udapl ucm/uat patch you sent In-Reply-To: References: Message-ID: Hi Arlin, Can I break this patch into 3 parts: the changes to dapl_evd_wait, the changes to dapl_evd_resize, and the ib changes? I think it will be easier to discuss each set of changes seperately (with so many seperate issues, I'm afraid I've missed your reply to some of these questions) dapl_evd_wait: I looked over the original implementation of dapl_evd_wait() with an eye towards the situation you described (the caller polling and finding fewer events than requested, the caller going to turn on notification, an event occuring, the caller turning on notification, the caller blocking unaware of the last event). I don't believe that this would happen in the original implementation. Here's why: after the caller turns on notification, the code loops, via the continue statement on line 213, back to the begining of the for loop on line 173 and repolls. Do you agree? dapl_evd_resize: I'm still unsure of why you removed the call to dapls_evd_event_realloc() and moved the work that was being performed in that routine up into dapl_evd_resize(). If we don't call dapls_evd_event_realloc() anymore, the code should be removed. ib changes: These look ok to me. I've checked them into revision 2955. On Tue, 2 Aug 2005, Or Gerlitz wrote: > Arlin, > > The patch you sent yesterday had some broken lines (97,949,1174,1332,1355,etc) > Here it is with the changes that made it patch fine over 2944 > > Or. > > Index: dapl/udapl/dapl_evd_wait.c > =================================================================== > --- dapl/udapl/dapl_evd_wait.c (revision 2919) > +++ dapl/udapl/dapl_evd_wait.c (working copy) > @@ -74,9 +74,10 @@ > DAPL_EVD *evd_ptr; > DAT_RETURN dat_status; > DAT_EVENT *local_event; > - DAT_BOOLEAN notify_requested = DAT_FALSE; > + DAT_BOOLEAN notify_needed = DAT_FALSE; > DAT_BOOLEAN waitable; > DAPL_EVD_STATE evd_state; > + DAT_COUNT total_events,new_events; > > dapl_dbg_log (DAPL_DBG_TYPE_API, > "dapl_evd_wait (%p, %d, %d, %p, %p)\n", > @@ -124,9 +125,9 @@ > } > > dapl_dbg_log (DAPL_DBG_TYPE_EVD, > - "dapl_evd_wait: EVD %p, CQ %p\n", > - evd_ptr, > - (void *)evd_ptr->ib_cq_handle); > + "dapl_evd_wait: EVD %p, CQ %p, Timeout %d, Threshold %d\n", > + evd_ptr,(void *)evd_ptr->ib_cq_handle, time_out, threshold); > + > > /* > * Make sure there are no other waiters and the evd is active. > @@ -144,11 +145,10 @@ > evd_state = dapl_os_atomic_assign ( (DAPL_ATOMIC *)&evd_ptr->evd_state, > (DAT_COUNT) DAPL_EVD_STATE_OPEN, > (DAT_COUNT) DAPL_EVD_STATE_WAITED ); > - dapl_os_unlock ( &evd_ptr->header.lock ); > > - if ( evd_state != DAPL_EVD_STATE_OPEN ) > + dapl_os_unlock ( &evd_ptr->header.lock ); > + if ( evd_state != DAPL_EVD_STATE_OPEN || !waitable) > { > - /* Bogus state, bail out */ > dat_status = DAT_ERROR (DAT_INVALID_STATE,0); > goto bail; > } > @@ -182,37 +182,54 @@ > * return right away if the ib_cq_handle associate with these evd > * equal to IB_INVALID_HANDLE > */ > - dapls_evd_copy_cq(evd_ptr); > - > - if (dapls_rbuf_count(&evd_ptr->pending_event_queue) >= threshold) > - { > - break; > - } > - > - /* > - * Do not enable the completion notification if this evd is not > - * a DTO_EVD or RMR_BIND_EVD > + /* Logic to prevent missing completion between copy_cq (poll) > + * and completion_notify (re-arm) > */ > - if ( (!notify_requested) && > - ((evd_ptr->evd_flags & DAT_EVD_DTO_FLAG) || > - (evd_ptr->evd_flags & DAT_EVD_RMR_BIND_FLAG)) ) > + notify_needed = DAT_TRUE; > + new_events = 0; > + while (DAT_TRUE) > { > - dat_status = dapls_ib_completion_notify ( > - evd_ptr->header.owner_ia->hca_ptr->ib_hca_handle, > - evd_ptr, > - (evd_ptr->completion_type == DAPL_EVD_STATE_SOLICITED_WAIT) ? > - IB_NOTIFY_ON_SOLIC_COMP : IB_NOTIFY_ON_NEXT_COMP ); > - > - DAPL_CNTR(DCNT_EVD_WAIT_CMP_NTFY); > - /* FIXME report error */ > - dapl_os_assert(dat_status == DAT_SUCCESS); > + dapls_evd_copy_cq(evd_ptr); /* poll for new completions */ > + total_events = dapls_rbuf_count (&evd_ptr->pending_event_queue); > + new_events = total_events - new_events; > + if (total_events >= threshold || > + (!new_events && notify_needed == DAT_FALSE)) > + { > + break; > + } > + > + /* > + * Do not enable the completion notification if this evd is not > + * a DTO_EVD or RMR_BIND_EVD > + */ > + if ( (evd_ptr->evd_flags & DAT_EVD_DTO_FLAG) || > + (evd_ptr->evd_flags & DAT_EVD_RMR_BIND_FLAG) ) > + { > + dat_status = dapls_ib_completion_notify ( > + evd_ptr->header.owner_ia->hca_ptr->ib_hca_handle, > + evd_ptr, > + (evd_ptr->completion_type == DAPL_EVD_STATE_SOLICITED_WAIT)? > + IB_NOTIFY_ON_SOLIC_COMP : IB_NOTIFY_ON_NEXT_COMP ); > + > + DAPL_CNTR(DCNT_EVD_WAIT_CMP_NTFY); > + notify_needed = DAT_FALSE; > + new_events = total_events; > + > + /* FIXME report error */ > + dapl_os_assert(dat_status == DAT_SUCCESS); > + } > + else > + { > + break; > + } > > - notify_requested = DAT_TRUE; > + } /* while completions < threshold, and rearm needed */ > > - /* Try again. */ > - continue; > + if (total_events >= threshold) > + { > + break; > } > - > + > > /* > * Unused by poster; it has no way to tell how many > @@ -232,8 +249,6 @@ > #endif > dat_status = dapl_os_wait_object_wait ( > &evd_ptr->wait_object, time_out ); > - > - notify_requested = DAT_FALSE; /* We've used it up. */ > > /* See if we were awakened by evd_set_unwaitable */ > if ( !evd_ptr->evd_waitable ) > @@ -243,13 +258,22 @@ > > if (dat_status != DAT_SUCCESS) > { > - /* > - * If the status is DAT_TIMEOUT, we'll break out of the > - * loop, *not* dequeue an event (because dat_status > - * != DAT_SUCCESS), set *nmore (as we should for timeout) > - * and return DAT_TIMEOUT. > - */ > - break; > + /* > + * If the status is DAT_TIMEOUT, we'll break out of the > + * loop, *not* dequeue an event (because dat_status > + * != DAT_SUCCESS), set *nmore (as we should for timeout) > + * and return DAT_TIMEOUT. > + */ > + > +#if defined(DAPL_DBG) > + dapls_evd_copy_cq(evd_ptr); /* poll */ > + dapl_dbg_log (DAPL_DBG_TYPE_EVD, > + "dapl_evd_wait: WAKEUP ERROR (0x%x): EVD %p, CQ %p, events? %d\n", > + dat_status,evd_ptr,(void *)evd_ptr->ib_cq_handle, > + dapls_rbuf_count(&evd_ptr->pending_event_queue) ); > +#endif /* DAPL_DBG */ > + > + break; > } > } > > Index: dapl/udapl/Makefile > =================================================================== > --- dapl/udapl/Makefile (revision 2941) > +++ dapl/udapl/Makefile (working copy) > @@ -122,7 +122,8 @@ > # > ifeq ($(VERBS),openib) > PROVIDER = $(TOPDIR)/../openib > -CFLAGS += -DOPENIB -DCQ_WAIT_OBJECT > +CFLAGS += -DOPENIB > +#CFLAGS += -DCQ_WAIT_OBJECT uncomment when fixed > CFLAGS += -I/usr/local/include/infiniband > endif > > Index: dapl/common/dapl_evd_resize.c > =================================================================== > --- dapl/common/dapl_evd_resize.c (revision 2919) > +++ dapl/common/dapl_evd_resize.c (working copy) > @@ -67,71 +67,139 @@ > IN DAT_EVD_HANDLE evd_handle, > IN DAT_COUNT evd_qlen ) > { > - DAPL_IA *ia_ptr; > - DAPL_EVD *evd_ptr; > - DAT_COUNT pend_cnt; > - DAT_RETURN dat_status; > + DAPL_IA *ia_ptr; > + DAPL_EVD *evd_ptr; > + DAT_EVENT *event_ptr; > + DAT_EVENT *events; > + DAT_EVENT *orig_event; > + DAPL_RING_BUFFER free_event_queue; > + DAPL_RING_BUFFER pending_event_queue; > + DAT_COUNT pend_cnt; > + DAT_COUNT i; > + DAT_RETURN dat_status; > > dapl_dbg_log (DAPL_DBG_TYPE_API, "dapl_evd_resize (%p, %d)\n", > evd_handle, evd_qlen); > > if (DAPL_BAD_HANDLE (evd_handle, DAPL_MAGIC_EVD)) > { > - dat_status = DAT_ERROR (DAT_INVALID_HANDLE,0); > - goto bail; > + return DAT_ERROR (DAT_INVALID_PARAMETER,DAT_INVALID_ARG1); > } > > evd_ptr = (DAPL_EVD *)evd_handle; > ia_ptr = evd_ptr->header.owner_ia; > > - if ( evd_qlen == evd_ptr->qlen ) > + if ((evd_qlen <= 0) || (evd_ptr->qlen > evd_qlen)) > { > - dat_status = DAT_SUCCESS; > - goto bail; > + dat_status = DAT_ERROR(DAT_INVALID_PARAMETER,DAT_INVALID_ARG2); > + goto bail; > } > > if ( evd_qlen > ia_ptr->hca_ptr->ia_attr.max_evd_qlen ) > { > - dat_status = DAT_ERROR (DAT_INVALID_PARAMETER,DAT_INVALID_ARG2); > + dat_status = DAT_ERROR(DAT_INSUFFICIENT_RESOURCES,DAT_RESOURCE_TEVD); > goto bail; > } > > dapl_os_lock(&evd_ptr->header.lock); > > - /* Don't try to resize if we are actively waiting */ > if (evd_ptr->evd_state == DAPL_EVD_STATE_WAITED) > { > - dapl_os_unlock(&evd_ptr->header.lock); > - dat_status = DAT_ERROR (DAT_INVALID_STATE,0); > - goto bail; > + dat_status = DAT_ERROR(DAT_INVALID_STATE,0); > + goto bail_unlock; > } > > pend_cnt = dapls_rbuf_count(&evd_ptr->pending_event_queue); > if (pend_cnt > evd_qlen) { > - dapl_os_unlock(&evd_ptr->header.lock); > - dat_status = DAT_ERROR (DAT_INVALID_STATE,0); > - goto bail; > + dat_status = DAT_ERROR(DAT_INVALID_STATE,0); > + goto bail_unlock; > } > > dat_status = dapls_ib_cq_resize(evd_ptr->header.owner_ia, > - evd_ptr, > - &evd_qlen); > - if (dat_status != DAT_SUCCESS) > + evd_ptr, > + &evd_qlen); > + if (DAT_SUCCESS != dat_status) { > + dat_status = DAT_ERROR(DAT_INSUFFICIENT_RESOURCES,DAT_RESOURCE_MEMORY); > + goto bail_unlock; > + } > + > + /* Allocate EVENTs */ > + events = (DAT_EVENT *) dapl_os_alloc (evd_qlen * sizeof (DAT_EVENT)); > + if (!events) > { > - dapl_os_unlock(&evd_ptr->header.lock); > - goto bail; > + dat_status = DAT_ERROR(DAT_INSUFFICIENT_RESOURCES,DAT_RESOURCE_MEMORY); > + goto bail_unlock; > } > + event_ptr = events; > > - dat_status = dapls_evd_event_realloc (evd_ptr, evd_qlen); > - if (dat_status != DAT_SUCCESS) > + /* allocate free event queue */ > + dat_status = dapls_rbuf_alloc (&free_event_queue, evd_qlen); > + if (DAT_SUCCESS != dat_status) > { > - dapl_os_unlock(&evd_ptr->header.lock); > - goto bail; > + dapl_os_free(event_ptr, evd_qlen * sizeof (DAT_EVENT)); > + dat_status = DAT_ERROR(DAT_INSUFFICIENT_RESOURCES,DAT_RESOURCE_MEMORY); > + goto bail_unlock; > + } > + > + /* allocate pending event queue */ > + dat_status = dapls_rbuf_alloc (&pending_event_queue, evd_qlen); > + if (DAT_SUCCESS != dat_status) > + { > + dapl_os_free(event_ptr, evd_qlen * sizeof (DAT_EVENT)); > + dat_status = DAT_ERROR(DAT_INSUFFICIENT_RESOURCES,DAT_RESOURCE_MEMORY); > + goto bail_unlock; > } > > + for (i = 0; i < pend_cnt; i++) > + { > + orig_event = dapls_rbuf_remove(&evd_ptr->pending_event_queue); > + if (orig_event == NULL) { > + dapl_dbg_log (DAPL_DBG_TYPE_ERR, " Inconsistent event queue\n"); > + dapl_os_free(event_ptr, evd_qlen * sizeof (DAT_EVENT)); > + dat_status = DAT_ERROR(DAT_INSUFFICIENT_RESOURCES,DAT_RESOURCE_MEMORY); > + goto bail_unlock; > + } > + memcpy(event_ptr, orig_event, sizeof(DAT_EVENT)); > + dat_status = dapls_rbuf_add(&pending_event_queue, event_ptr); > + if (DAT_SUCCESS != dat_status) { > + dapl_os_free(event_ptr, evd_qlen * sizeof (DAT_EVENT)); > + dat_status = DAT_ERROR(DAT_INSUFFICIENT_RESOURCES,DAT_RESOURCE_MEMORY); > + goto bail_unlock; > + } > + event_ptr++; > + } > + > + for (i = pend_cnt; i < evd_qlen; i++) > + { > + dat_status = dapls_rbuf_add(&free_event_queue,(void *) event_ptr); > + if (DAT_SUCCESS != dat_status) { > + dapl_os_free(event_ptr, evd_qlen * sizeof (DAT_EVENT)); > + dat_status = DAT_ERROR(DAT_INSUFFICIENT_RESOURCES,DAT_RESOURCE_MEMORY); > + goto bail_unlock; > + } > + event_ptr++; > + } > + > + dapls_rbuf_destroy (&evd_ptr->free_event_queue); > + dapls_rbuf_destroy (&evd_ptr->pending_event_queue); > + if (evd_ptr->events) > + { > + dapl_os_free (evd_ptr->events, evd_ptr->qlen * sizeof (DAT_EVENT)); > + } > + evd_ptr->free_event_queue = free_event_queue; > + evd_ptr->pending_event_queue = pending_event_queue; > + evd_ptr->events = events; > + evd_ptr->qlen = evd_qlen; > + > +bail_unlock: > + > dapl_os_unlock(&evd_ptr->header.lock); > > - bail: > + dapl_dbg_log (DAPL_DBG_TYPE_RTN, > + "dapl_evd_resize returns 0x%x\n",dat_status); > + > +bail: > + > return dat_status; > } > > Index: dapl/openib/TODO > =================================================================== > --- dapl/openib/TODO (revision 2919) > +++ dapl/openib/TODO (working copy) > @@ -1,7 +1,7 @@ > > IB Verbs: > - CQ resize? > -- query call to get current qp state > +- query call to get current qp state, remote port number > - ibv_get_cq_event() needs timed event call and wakeup > - query call to get device attributes > - memory window support > @@ -9,8 +9,6 @@ > DAPL: > - reinit EP needs a QP timewait completion notification > - add cq_object wakeup, time based cq_object wait when verbs support arrives > -- update uDAPL code with real ATS support > -- etc, etc. > > Other: > - Shared memory in udapl and kernel module to support? > Index: dapl/openib/dapl_ib_util.c > =================================================================== > --- dapl/openib/dapl_ib_util.c (revision 2919) > +++ dapl/openib/dapl_ib_util.c (working copy) > @@ -111,27 +111,40 @@ > } > > > -/* just get IP address for hostname */ > -int dapli_get_addr( char *addr, int addr_len) > +/* just get IP address, IPv4 only for now */ > +int dapli_get_hca_addr( struct dapl_hca *hca_ptr ) > { > - struct sockaddr_in *ipv4_addr = (struct sockaddr_in*)addr; > - struct hostent *h_ptr; > - struct utsname ourname; > - > - if ( uname( &ourname ) < 0 ) > - return 1; > - > - h_ptr = gethostbyname( ourname.nodename ); > - if ( h_ptr == NULL ) > + struct sockaddr_in *ipv4_addr; > + struct ib_at_completion at_comp; > + struct dapl_at_record at_rec; > + int status; > + DAT_RETURN dat_status; > + > + ipv4_addr = (struct sockaddr_in*)&hca_ptr->hca_address; > + ipv4_addr->sin_family = AF_INET; > + ipv4_addr->sin_addr.s_addr = 0; > + > + at_comp.fn = dapli_ip_comp_handler; > + at_comp.context = &at_rec; > + at_rec.addr = &hca_ptr->hca_address; > + at_rec.wait_object = &hca_ptr->ib_trans.wait_object; > + > + /* call with async_comp until the sync version works */ > + status = ib_at_ips_by_gid(&hca_ptr->ib_trans.gid, &ipv4_addr->sin_addr.s_addr, 1, > + &at_comp, &at_rec.req_id); > + > + if (status < 0) > return 1; > - > - if ( h_ptr->h_addrtype == AF_INET ) { > - ipv4_addr = (struct sockaddr_in*) addr; > - ipv4_addr->sin_family = AF_INET; > - dapl_os_memcpy( &ipv4_addr->sin_addr, h_ptr->h_addr_list[0], 4 ); > - } else > + > + if (status > 0) > + dapli_ip_comp_handler(at_rec.req_id, (void*)ipv4_addr, status); > + > + /* wait for answer, 5 seconds max */ > + dat_status = dapl_os_wait_object_wait (&hca_ptr->ib_trans.wait_object,5000000); > + > + if ((dat_status != DAT_SUCCESS ) || (!ipv4_addr->sin_addr.s_addr)) > return 1; > - > + > return 0; > } > > @@ -152,14 +165,17 @@ > */ > int32_t dapls_ib_init (void) > { > - if (dapli_cm_thread_init()) > - return -1; > - else > - return 0; > + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, " dapl_ib_init: \n" ); > + if (dapli_cm_thread_init() || dapli_at_thread_init()) > + return 1; > + > + return 0; > } > > int32_t dapls_ib_release (void) > { > + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, " dapl_ib_release: \n" ); > + dapli_at_thread_destroy(); > dapli_cm_thread_destroy(); > return 0; > } > @@ -186,7 +202,6 @@ > IN DAPL_HCA *hca_ptr) > { > struct dlist *dev_list; > - DAT_RETURN dat_status = DAT_SUCCESS; > > dapl_dbg_log (DAPL_DBG_TYPE_UTIL, > " open_hca: %s - %p\n", hca_name, hca_ptr ); > @@ -217,36 +232,46 @@ > ibv_get_device_name(hca_ptr->ib_trans.ib_dev) ); > return DAT_INTERNAL_ERROR; > } > - > + > /* set inline max with enviromment or default, get local lid and gid 0 */ > hca_ptr->ib_trans.max_inline_send = > dapl_os_get_env_val ( "DAPL_MAX_INLINE", INLINE_SEND_DEFAULT ); > > - if ( dapli_get_lid(hca_ptr, hca_ptr->port_num, > - &hca_ptr->ib_trans.lid )) { > + if (dapli_get_lid(hca_ptr, hca_ptr->port_num, > + &hca_ptr->ib_trans.lid)) { > dapl_dbg_log (DAPL_DBG_TYPE_ERR, > " open_hca: IB get LID failed for %s\n", > ibv_get_device_name(hca_ptr->ib_trans.ib_dev) ); > - return DAT_INTERNAL_ERROR; > + goto bail; > } > > - if ( dapli_get_gid(hca_ptr, hca_ptr->port_num, 0, > - &hca_ptr->ib_trans.gid )) { > + if (dapli_get_gid(hca_ptr, hca_ptr->port_num, 0, > + &hca_ptr->ib_trans.gid)) { > dapl_dbg_log (DAPL_DBG_TYPE_ERR, > " open_hca: IB get GID failed for %s\n", > ibv_get_device_name(hca_ptr->ib_trans.ib_dev) ); > - return DAT_INTERNAL_ERROR; > + goto bail; > } > - > /* get the IP address of the device */ > - if ( dapli_get_addr((char*)&hca_ptr->hca_address, > - sizeof(DAT_SOCK_ADDR6) )) { > + if (dapli_get_hca_addr(hca_ptr)) { > dapl_dbg_log (DAPL_DBG_TYPE_ERR, > " open_hca: IB get ADDR failed for %s\n", > ibv_get_device_name(hca_ptr->ib_trans.ib_dev) ); > - return DAT_INTERNAL_ERROR; > + goto bail; > + } > + > + /* one thread for each device open */ > + if (dapli_cq_thread_init(hca_ptr)) { > + dapl_dbg_log (DAPL_DBG_TYPE_ERR, > + " open_hca: cq_thread_init failed for %s\n", > + ibv_get_device_name(hca_ptr->ib_trans.ib_dev) ); > + goto bail; > } > > + /* initialize cq_lock and wait object */ > + dapl_os_lock_init(&hca_ptr->ib_trans.cq_lock); > + dapl_os_wait_object_init (&hca_ptr->ib_trans.wait_object); > + > dapl_dbg_log (DAPL_DBG_TYPE_UTIL, > " open_hca: %s, port %d, %s %d.%d.%d.%d INLINE_MAX=%d\n", > ibv_get_device_name(hca_ptr->ib_trans.ib_dev), hca_ptr->port_num, > @@ -257,7 +282,19 @@ > ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_addr.s_addr >> 24 & 0xff, > hca_ptr->ib_trans.max_inline_send ); > > - return dat_status; > + dapl_dbg_log(DAPL_DBG_TYPE_CM, > + " open_hca: LID 0x%x GID subnet %016llx id %016llx\n", > + hca_ptr->ib_trans.lid, > + (unsigned long long)bswap_64(hca_ptr->ib_trans.gid.global.subnet_prefix), > + (unsigned long long)bswap_64(hca_ptr->ib_trans.gid.global.interface_id) ); > + > + return DAT_SUCCESS; > + > +bail: > + ibv_close_device(hca_ptr->ib_hca_handle); > + hca_ptr->ib_hca_handle = IB_INVALID_HANDLE; > + return DAT_INTERNAL_ERROR; > + > } > > > @@ -282,10 +319,14 @@ > dapl_dbg_log (DAPL_DBG_TYPE_UTIL," close_hca: %p\n",hca_ptr); > > if (hca_ptr->ib_hca_handle != IB_INVALID_HANDLE) { > + dapli_cq_thread_destroy(hca_ptr); > if (ibv_close_device(hca_ptr->ib_hca_handle)) > return(dapl_convert_errno(errno,"ib_close_device")); > hca_ptr->ib_hca_handle = IB_INVALID_HANDLE; > } > + > + dapl_os_lock_destroy(&hca_ptr->ib_trans.cq_lock); > + > return (DAT_SUCCESS); > } > > @@ -448,35 +489,4 @@ > return DAT_SUCCESS; > } > > -#ifdef PROVIDER_SPECIFIC_ATTR > - > -/* > - * dapls_set_provider_specific_attr > - * > - * Input: > - * attr_ptr Pointer provider attributes > - * > - * Output: > - * none > - * > - * Returns: > - * void > - */ > -DAT_NAMED_ATTR ib_attrs[] = { > - { > - "I_DAT_SEND_INLINE_THRESHOLD", > - "128" > - }, > -}; > - > -#define SPEC_ATTR_SIZE( x ) (sizeof( x ) / sizeof( DAT_NAMED_ATTR)) > - > -void dapls_set_provider_specific_attr( > - IN DAT_PROVIDER_ATTR *attr_ptr ) > -{ > - attr_ptr->num_provider_specific_attr = SPEC_ATTR_SIZE( ib_attrs ); > - attr_ptr->provider_specific_attr = ib_attrs; > -} > - > -#endif > > Index: dapl/openib/dapl_ib_cm.c > =================================================================== > --- dapl/openib/dapl_ib_cm.c (revision 2919) > +++ dapl/openib/dapl_ib_cm.c (working copy) > @@ -70,19 +70,8 @@ > static inline uint64_t cpu_to_be64(uint64_t x) { return x; } > #endif > > -#ifndef IB_AT > - > -#include > -#include > -#include > -#include > -#include > -#include > - > -/* iclust-20 hard coded values, network order */ > -#define REMOTE_GID "fe80:0000:0000:0000:0002:c902:0000:4071" > -#define REMOTE_LID "0002" > - > +static int g_at_destroy; > +static DAPL_OS_THREAD g_at_thread; > static int g_cm_destroy; > static DAPL_OS_THREAD g_cm_thread; > static DAPL_OS_LOCK g_cm_lock; > @@ -122,7 +111,7 @@ > while (g_cm_destroy) { > struct timespec sleep, remain; > sleep.tv_sec = 0; > - sleep.tv_nsec = 200000000; /* 200 ms */ > + sleep.tv_nsec = 10000000; /* 10 ms */ > dapl_dbg_log(DAPL_DBG_TYPE_CM, > " cm_thread_destroy: waiting for cm_thread\n"); > nanosleep (&sleep, &remain); > @@ -130,112 +119,70 @@ > dapl_dbg_log(DAPL_DBG_TYPE_CM," cm_thread_destroy(%d) exit\n",getpid()); > } > > -static int ib_at_route_by_ip(uint32_t dst_ip, uint32_t src_ip, int tos, uint16_t flags, > - struct ib_at_ib_route *ib_route, > - struct ib_at_completion *async_comp) > -{ > - struct dapl_cm_id *conn = (struct dapl_cm_id *)async_comp->context; > - > - dapl_dbg_log ( > - DAPL_DBG_TYPE_CM, > - " CM at_route_by_ip: conn %p cm_id %d src %d.%d.%d.%d -> dst %d.%d.%d.%d (%d)\n", > - conn,conn->cm_id, > - src_ip >> 0 & 0xff, src_ip >> 8 & 0xff, > - src_ip >> 16 & 0xff,src_ip >> 24 & 0xff, > - dst_ip >> 0 & 0xff, dst_ip >> 8 & 0xff, > - dst_ip >> 16 & 0xff,dst_ip >> 24 & 0xff, conn->service_id); > - > - /* use req_id for loopback indication */ > - if (( src_ip == dst_ip ) || ( dst_ip == 0x0100007f )) > - async_comp->req_id = 1; > - else > - async_comp->req_id = 0; > - > - return 1; > -} > - > -static int ib_at_paths_by_route(struct ib_at_ib_route *ib_route, uint32_t mpath_type, > - struct ib_sa_path_rec *pr, int npath, > - struct ib_at_completion *async_comp) > +int dapli_at_thread_init(void) > { > - struct dapl_cm_id *conn = (struct dapl_cm_id *)async_comp->context; > - char *env, *token; > - char dgid[40]; > - uint16_t *p_gid = (uint16_t*)&ib_route->gid; > + DAT_RETURN dat_status; > > - /* set local path record values and send to remote */ > - (void)dapl_os_memzero(pr, sizeof(*pr)); > + dapl_dbg_log(DAPL_DBG_TYPE_CM," at_thread_init(%d)\n", getpid()); > > - pr->slid = htons(conn->hca->ib_trans.lid); > - pr->sgid.global.subnet_prefix = conn->hca->ib_trans.gid.global.subnet_prefix; > - pr->sgid.global.interface_id = conn->hca->ib_trans.gid.global.interface_id; > + /* create thread to process AT async requests */ > + dat_status = dapl_os_thread_create(at_thread, NULL, &g_at_thread); > + if (dat_status != DAT_SUCCESS) > + { > + dapl_dbg_log(DAPL_DBG_TYPE_ERR, > + " at_thread_init: failed to create thread\n"); > + return 1; > + } > + return 0; > +} > > - env = getenv("DAPL_REMOTE_LID"); > - if ( env == NULL ) > - env = REMOTE_LID; > - ib_route->lid = strtol(env,NULL,0); > +void dapli_at_thread_destroy(void) > +{ > + dapl_dbg_log(DAPL_DBG_TYPE_CM," at_thread_destroy(%d)\n", getpid()); > > - env = getenv("DAPL_REMOTE_GID"); > - if ( env == NULL ) > - env = REMOTE_GID; > + /* destroy cr_thread and lock */ > + g_at_destroy = 1; > + pthread_kill( g_at_thread, SIGUSR1 ); > + dapl_dbg_log(DAPL_DBG_TYPE_CM," at_thread_destroy(%d) SIGUSR1 sent\n",getpid()); > + while (g_at_destroy) { > + struct timespec sleep, remain; > + sleep.tv_sec = 0; > + sleep.tv_nsec = 10000000; /* 10 ms */ > + dapl_dbg_log(DAPL_DBG_TYPE_CM, > + " at_thread_destroy: waiting for at_thread\n"); > + nanosleep (&sleep, &remain); > + } > + dapl_dbg_log(DAPL_DBG_TYPE_CM," at_thread_destroy(%d) exit\n",getpid()); > +} > > - dapl_dbg_log(DAPL_DBG_TYPE_CM, > - " ib_at_paths_by_route: remote LID %x GID %s\n", > - ib_route->lid,env); > +void dapli_ip_comp_handler(uint64_t req_id, void *context, int rec_num) > +{ > + struct dapl_at_record *at_rec = context; > > - dapl_os_memcpy( dgid, env, 40 ); > + dapl_dbg_log(DAPL_DBG_TYPE_CM, > + " ip_comp_handler: ctxt %p, req_id %lld rec_num %d\n", > + context, req_id, rec_num); > > - /* get GID with token strings and delimiter */ > - token = strtok(dgid,":"); > - while (token) { > - *p_gid = strtoul(token,NULL,16); > - *p_gid = htons(*p_gid); /* convert each token to network order */ > - token = strtok(NULL,":"); > - p_gid++; > - } > - > - /* set remote lid and gid, req_id is indication of loopback */ > - if ( !async_comp->req_id ) { > - pr->dlid = htons(ib_route->lid); > - pr->dgid.global.subnet_prefix = ib_route->gid.global.subnet_prefix; > - pr->dgid.global.interface_id = ib_route->gid.global.interface_id; > - } else { > - pr->dlid = pr->slid; > - pr->dgid.global.subnet_prefix = pr->sgid.global.subnet_prefix; > - pr->dgid.global.interface_id = pr->sgid.global.interface_id; > - } > - > - pr->reversible = 0x1000000; > - pr->pkey = 0xffff; > - pr->mtu = IBV_MTU_1024; > - pr->mtu_selector = 2; > - pr->rate_selector = 2; > - pr->rate = 3; > - pr->packet_life_time_selector = 2; > - pr->packet_life_time = 2; > + if ((at_rec) && ( at_rec->req_id == req_id)) { > + dapl_os_wait_object_wakeup(at_rec->wait_object); > + return; > + } > > - dapl_dbg_log(DAPL_DBG_TYPE_CM, > - " ib_at_paths_by_route: SRC LID 0x%x GID subnet %016llx id %016llx\n", > - pr->slid,(unsigned long long)(pr->sgid.global.subnet_prefix), > - (unsigned long long)(pr->sgid.global.interface_id) ); > - dapl_dbg_log(DAPL_DBG_TYPE_CM, > - " ib_at_paths_by_route: DST LID 0x%x GID subnet %016llx id %016llx\n", > - pr->dlid,(unsigned long long)(pr->dgid.global.subnet_prefix), > - (unsigned long long)(pr->dgid.global.interface_id) ); > - > - dapli_path_comp_handler( async_comp->req_id, (void*)conn, 1); > - > - return 0; > + dapl_dbg_log(DAPL_DBG_TYPE_ERR, > + " ip_comp_handler: at_rec->req_id %lld != req_id %lld\n", > + at_rec->req_id, req_id ); > } > > -#endif /* ifndef IB_AT */ > - > static void dapli_path_comp_handler(uint64_t req_id, void *context, int rec_num) > { > struct dapl_cm_id *conn = context; > int status; > ib_cm_events_t event; > > + dapl_dbg_log(DAPL_DBG_TYPE_CM, > + " path_comp_handler: ctxt %p, req_id %lld rec_num %d\n", > + context, req_id, rec_num); > + > if (rec_num <= 0) { > dapl_dbg_log(DAPL_DBG_TYPE_CM, > " path_comp_handler: resolution err %d retry %d\n", > @@ -249,7 +196,7 @@ > > status = ib_at_paths_by_route(&conn->dapl_rt, 0, > &conn->dapl_path, 1, > - &conn->dapl_comp); > + &conn->dapl_comp, &conn->dapl_comp.req_id); > if (status) { > dapl_dbg_log(DAPL_DBG_TYPE_CM, > " path_by_route: err %d id %lld\n", > @@ -287,6 +234,21 @@ > int status; > ib_cm_events_t event; > > + dapl_dbg_log(DAPL_DBG_TYPE_CM, > + " rt_comp_handler: conn %p, req_id %lld rec_num %d\n", > + conn, req_id, rec_num); > + > + dapl_dbg_log(DAPL_DBG_TYPE_CM, > + " rt_comp_handler: SRC GID subnet %016llx id %016llx\n", > + (unsigned long long)cpu_to_be64(conn->dapl_rt.sgid.global.subnet_prefix), > + (unsigned long long)cpu_to_be64(conn->dapl_rt.sgid.global.interface_id) ); > + > + dapl_dbg_log(DAPL_DBG_TYPE_CM, > + " rt_comp_handler: DST GID subnet %016llx id %016llx\n", > + (unsigned long long)cpu_to_be64(conn->dapl_rt.dgid.global.subnet_prefix), > + (unsigned long long)cpu_to_be64(conn->dapl_rt.dgid.global.interface_id) ); > + > + > if (rec_num <= 0) { > dapl_dbg_log(DAPL_DBG_TYPE_ERR, > " dapl_rt_comp_handler: rec %d retry %d\n", > @@ -298,7 +260,8 @@ > } > > status = ib_at_route_by_ip(((struct sockaddr_in *)&conn->r_addr)->sin_addr.s_addr, > - 0, 0, 0, &conn->dapl_rt, &conn->dapl_comp); > + 0, 0, 0, &conn->dapl_rt, > + &conn->dapl_comp,&conn->dapl_comp.req_id); > if (status < 0) { > dapl_dbg_log(DAPL_DBG_TYPE_ERR, "dapl_rt_comp_handler: " > "ib_at_route_by_ip failed with status %d\n", > @@ -306,9 +269,16 @@ > event = IB_CME_DESTINATION_UNREACHABLE; > goto bail; > } > - > if (status == 1) > dapli_rt_comp_handler(conn->dapl_comp.req_id, conn, 1); > + > + return; > + } > + > + if (!conn->dapl_rt.dgid.global.subnet_prefix || req_id != conn->dapl_comp.req_id) { > + dapl_dbg_log(DAPL_DBG_TYPE_ERR, > + " dapl_rt_comp_handler: ERROR: unexpected callback req_id=%d(%d)\n", > + req_id, conn->dapl_comp.req_id ); > return; > } > > @@ -316,7 +286,7 @@ > conn->dapl_comp.context = conn; > conn->retries = 0; > status = ib_at_paths_by_route(&conn->dapl_rt, 0, &conn->dapl_path, 1, > - &conn->dapl_comp); > + &conn->dapl_comp, &conn->dapl_comp.req_id); > if (status) { > dapl_dbg_log(DAPL_DBG_TYPE_ERR, > "dapl_rt_comp_handler: ib_at_paths_by_route " > @@ -346,8 +316,6 @@ > ib_cm_destroy_id(conn->cm_id); > if (conn->ep) > conn->ep->cm_handle = IB_INVALID_HANDLE; > - if (conn->sp) > - conn->sp->cm_srvc_handle = IB_INVALID_HANDLE; > > /* take off the CM thread work queue and free */ > dapl_os_lock( &g_cm_lock ); > @@ -621,10 +589,8 @@ > } > > /* something to catch the signal */ > -static void cm_handler(int signum) > +static void ib_sig_handler(int signum) > { > - dapl_dbg_log (DAPL_DBG_TYPE_CM," cm_thread(%d,0x%x): ENTER cm_handler %d\n", > - getpid(),g_cm_thread,signum); > return; > } > > @@ -643,7 +609,7 @@ > sigemptyset(&sigset); > sigaddset(&sigset, SIGUSR1); > pthread_sigmask(SIG_UNBLOCK, &sigset, NULL); > - signal( SIGUSR1, cm_handler); > + signal( SIGUSR1, ib_sig_handler); > > dapl_os_lock( &g_cm_lock ); > while (!g_cm_destroy) { > @@ -667,7 +633,7 @@ > dapl_dbg_log(DAPL_DBG_TYPE_CM, > " cm_thread: GET EVENT fd=%d n=%d\n", > ib_cm_get_fd(),ret); > - if (ib_cm_event_get(&event)) { > + if (ib_cm_event_get_timed(0,&event)) { > dapl_dbg_log(DAPL_DBG_TYPE_CM, > " cm_thread: ERR %s eventi_get on %d\n", > strerror(errno), ib_cm_get_fd() ); > @@ -732,6 +698,33 @@ > g_cm_destroy = 0; > } > > +/* async AT processing thread */ > +void at_thread(void *arg) > +{ > + sigset_t sigset; > + > + dapl_dbg_log (DAPL_DBG_TYPE_CM, > + " at_thread(%d,0x%x): ENTER: at_fd %d\n", > + getpid(), g_at_thread, ib_at_get_fd()); > + > + sigemptyset(&sigset); > + sigaddset(&sigset, SIGUSR1); > + pthread_sigmask(SIG_UNBLOCK, &sigset, NULL); > + signal(SIGUSR1, ib_sig_handler); > + > + while (!g_at_destroy) { > + /* poll forever until callback or signal */ > + if (ib_at_callback_get_timed(-1) < 0) { > + dapl_dbg_log(DAPL_DBG_TYPE_CM, > + " at_thread: SIG? ret=%s, destroy=%d\n", > + strerror(errno), g_at_destroy ); > + } > + dapl_dbg_log(DAPL_DBG_TYPE_CM," at_thread: callback woke\n"); > + } > + dapl_dbg_log(DAPL_DBG_TYPE_CM," at_thread(%d) EXIT \n", getpid()); > + g_at_destroy = 0; > +} > + > /************************ DAPL provider entry points **********************/ > > /* > @@ -826,33 +819,34 @@ > conn->dapl_comp.context = conn; > conn->retries = 0; > dapl_os_memcpy(&conn->r_addr, r_addr, sizeof(DAT_SOCK_ADDR6)); > + > + /* put on CM thread work queue */ > + dapl_llist_init_entry((DAPL_LLIST_ENTRY*)&conn->entry); > + dapl_os_lock( &g_cm_lock ); > + dapl_llist_add_tail(&g_cm_list, > + (DAPL_LLIST_ENTRY*)&conn->entry, conn); > + dapl_os_unlock(&g_cm_lock); > > status = ib_at_route_by_ip( > ((struct sockaddr_in *)&conn->r_addr)->sin_addr.s_addr, > ((struct sockaddr_in *)&conn->hca->hca_address)->sin_addr.s_addr, > - 0, 0, &conn->dapl_rt, &conn->dapl_comp); > + 0, 0, &conn->dapl_rt, &conn->dapl_comp, &conn->dapl_comp.req_id); > + > + dapl_dbg_log(DAPL_DBG_TYPE_CM, " connect: at_route ret=%d,%s req_id %d GID %016llx %016llx\n", > + status, strerror(errno), conn->dapl_comp.req_id, > + (unsigned long long)cpu_to_be64(conn->dapl_rt.dgid.global.subnet_prefix), > + (unsigned long long)cpu_to_be64(conn->dapl_rt.dgid.global.interface_id) ); > > if (status < 0) { > dat_status = dapl_convert_errno(errno,"ib_at_route_by_ip"); > - goto destroy; > + dapli_destroy_cm_id(conn); > + return dat_status; > } > - if (status == 1) > - dapli_rt_comp_handler(conn->dapl_comp.req_id, conn, 1); > > - > - /* put on CM thread work queue */ > - dapl_llist_init_entry((DAPL_LLIST_ENTRY*)&conn->entry); > - dapl_os_lock( &g_cm_lock ); > - dapl_llist_add_tail(&g_cm_list, > - (DAPL_LLIST_ENTRY*)&conn->entry, conn); > - dapl_os_unlock(&g_cm_lock); > + if (status > 0) > + dapli_rt_comp_handler(conn->dapl_comp.req_id, conn, status); > > return DAT_SUCCESS; > - > -destroy: > - dapli_destroy_cm_id(conn); > - return dat_status; > - > } > > /* > @@ -992,6 +986,13 @@ > conn->hca = ia_ptr->hca_ptr; > conn->service_id = ServiceID; > > + /* put on CM thread work queue */ > + dapl_llist_init_entry((DAPL_LLIST_ENTRY*)&conn->entry); > + dapl_os_lock( &g_cm_lock ); > + dapl_llist_add_tail(&g_cm_list, > + (DAPL_LLIST_ENTRY*)&conn->entry, conn); > + dapl_os_unlock(&g_cm_lock); > + > dapl_dbg_log(DAPL_DBG_TYPE_EP, > " setup_listener(conn=%p cm_id=%d)\n", > sp_ptr->cm_srvc_handle,conn->cm_id); > @@ -1003,19 +1004,13 @@ > dat_status = DAT_CONN_QUAL_IN_USE; > else > dat_status = DAT_INSUFFICIENT_RESOURCES; > - /* success */ > - } else { > - /* put on CM thread work queue */ > - dapl_llist_init_entry((DAPL_LLIST_ENTRY*)&conn->entry); > - dapl_os_lock( &g_cm_lock ); > - dapl_llist_add_tail(&g_cm_list, > - (DAPL_LLIST_ENTRY*)&conn->entry, conn); > - dapl_os_unlock(&g_cm_lock); > + > + dapli_destroy_cm_id(conn); > return dat_status; > } > > - dapli_destroy_cm_id(conn); > - return dat_status; > + /* success */ > + return DAT_SUCCESS; > } > > > @@ -1047,9 +1042,11 @@ > " remove_listener(ia_ptr %p sp_ptr %p cm_ptr %p)\n", > ia_ptr, sp_ptr, conn ); > > - if (sp_ptr->cm_srvc_handle != IB_INVALID_HANDLE) > + if (conn != IB_INVALID_HANDLE) { > + sp_ptr->cm_srvc_handle = NULL; > dapli_destroy_cm_id(conn); > - > + } > + > return DAT_SUCCESS; > } > > Index: dapl/openib/dapl_ib_util.h > =================================================================== > --- dapl/openib/dapl_ib_util.h (revision 2919) > +++ dapl/openib/dapl_ib_util.h (working copy) > @@ -53,6 +53,7 @@ > #include > #include > #include > +#include > > /* Typedefs to map common DAPL provider types to IB verbs */ > typedef struct ibv_qp *ib_qp_handle_t; > @@ -68,8 +69,8 @@ > > #define IB_RC_RETRY_COUNT 7 > #define IB_RNR_RETRY_COUNT 7 > -#define IB_CM_RESPONSE_TIMEOUT 20 /* 4 sec */ > -#define IB_MAX_CM_RETRIES 4 > +#define IB_CM_RESPONSE_TIMEOUT 18 /* 1 sec */ > +#define IB_MAX_CM_RETRIES 7 > > #define IB_REQ_MRA_TIMEOUT 27 /* a little over 9 minutes */ > #define IB_MAX_AT_RETRY 3 > @@ -92,21 +93,12 @@ > IB_CME_BROKEN > } ib_cm_events_t; > > -#ifndef IB_AT > -/* implement a quick hack to exchange GID/LID's until user IB_AT arrives */ > -struct ib_at_ib_route { > - union ibv_gid gid; > - uint16_t lid; > +struct dapl_at_record { > + uint64_t req_id; > + DAT_SOCK_ADDR6 *addr; > + DAPL_OS_WAIT_OBJECT *wait_object; > }; > > -struct ib_at_completion { > - void (*fn)(uint64_t req_id, void *context, int rec_num); > - void *context; > - uint64_t req_id; > -}; > - > -#endif > - > /* > * dapl_llist_entry in dapl.h but dapl.h depends on provider > * typedef's in this file first. move dapl_llist_entry out of dapl.h > @@ -122,6 +114,7 @@ > struct dapl_cm_id { > struct ib_llist_entry entry; > DAPL_OS_LOCK lock; > + DAPL_OS_WAIT_OBJECT wait_object; > int retries; > int destroy; > int in_callback; > @@ -238,6 +231,10 @@ > { > struct ibv_device *ib_dev; > ib_cq_handle_t ib_cq_empty; > + DAPL_OS_LOCK cq_lock; > + DAPL_OS_WAIT_OBJECT wait_object; > + int cq_destroy; > + DAPL_OS_THREAD cq_thread; > int max_inline_send; > uint16_t lid; > union ibv_gid gid; > @@ -257,11 +254,18 @@ > void cm_thread (void *arg); > int dapli_cm_thread_init(void); > void dapli_cm_thread_destroy(void); > +void at_thread (void *arg); > +int dapli_at_thread_init(void); > +void dapli_at_thread_destroy(void); > +void cq_thread (void *arg); > +int dapli_cq_thread_init(struct dapl_hca *hca_ptr); > +void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr); > > -int dapli_get_lid(struct dapl_hca *hca_ptr, int port, uint16_t *lid ); > +int dapli_get_lid(struct dapl_hca *hca_ptr, int port, uint16_t *lid); > int dapli_get_gid(struct dapl_hca *hca_ptr, int port, int index, > - union ibv_gid *gid ); > -int dapli_get_addr(char *addr, int addr_len); > + union ibv_gid *gid); > +int dapli_get_hca_addr(struct dapl_hca *hca_ptr); > +void dapli_ip_comp_handler(uint64_t req_id, void *context, int rec_num); > > DAT_RETURN > dapls_modify_qp_state ( IN ib_qp_handle_t qp_handle, > Index: dapl/openib/README > =================================================================== > --- dapl/openib/README (revision 2919) > +++ dapl/openib/README (working copy) > @@ -39,18 +39,16 @@ > > server: dtest -s > client: dtest -h hostname > + > +Testing: dtest, dapltest - cl.sh regress.sh > > -setup/known issues: > - > - First drop with uCM (without IBAT), tested with simple dtest across 2 nodes. > - hand rolled path records require remote LID and GID set via enviroment: > +Setup: > > - export DAPL_REMOTE_GID "fe80:0000:0000:0000:0002:c902:0000:4071" > - export DAPL_REMOTE_LID "0002" > + Third drop of code, includes uCM and uAT support. > + NOTE: requires both uCM and uAT libraries and device modules from trunk. > > - Also, hard coded (RTR) for use with port 1 only. > - > +Known issues: > no memory windows support in ibverbs, dat_create_rmr fails. > + some uCM scale up issues with an 8 thread dapltest in regress.sh > + hard coded modify QP RTR to port 1, waiting for ib_cm_init_qp_attr call. > > - > - > Index: dapl/openib/dapl_ib_cq.c > =================================================================== > --- dapl/openib/dapl_ib_cq.c (revision 2919) > +++ dapl/openib/dapl_ib_cq.c (working copy) > @@ -50,9 +50,96 @@ > #include "dapl_adapter_util.h" > #include "dapl_lmr_util.h" > #include "dapl_evd_util.h" > +#include "dapl_ring_buffer_util.h" > #include > +#include > > +int dapli_cq_thread_init(struct dapl_hca *hca_ptr) > +{ > + DAT_RETURN dat_status; > + > + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_init(%p)\n", hca_ptr); > + > + /* create thread to process inbound connect request */ > + dat_status = dapl_os_thread_create( cq_thread, (void*)hca_ptr,&hca_ptr->ib_trans.cq_thread); > + if (dat_status != DAT_SUCCESS) > + { > + dapl_dbg_log(DAPL_DBG_TYPE_ERR, > + " cq_thread_init: failed to create thread\n"); > + return 1; > + } > + return 0; > +} > + > +void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr) > +{ > + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%p)\n", hca_ptr); > + > + /* destroy cr_thread and lock */ > + hca_ptr->ib_trans.cq_destroy = 1; > + pthread_kill(hca_ptr->ib_trans.cq_thread, SIGUSR1); > + dapl_dbg_log(DAPL_DBG_TYPE_CM," cq_thread_destroy(%p) SIGUSR1 sent\n",hca_ptr); > + while (hca_ptr->ib_trans.cq_destroy != 2) { > + struct timespec sleep, remain; > + sleep.tv_sec = 0; > + sleep.tv_nsec = 10000000; /* 10 ms */ > + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, > + " cq_thread_destroy: waiting for cq_thread\n"); > + nanosleep (&sleep, &remain); > + } > + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%d) exit\n",getpid()); > + return; > +} > + > +/* something to catch the signal */ > +static void ib_cq_handler(int signum) > +{ > + return; > +} > + > +void cq_thread( void *arg ) > +{ > + struct dapl_hca *hca_ptr = arg; > + struct dapl_evd *evd_ptr; > + struct ibv_cq *ibv_cq = NULL; > + sigset_t sigset; > + int status = 0; > + > + dapl_dbg_log ( DAPL_DBG_TYPE_UTIL," cq_thread: ENTER hca %p\n",hca_ptr); > + > + sigemptyset(&sigset); > + sigaddset(&sigset,SIGUSR1); > + pthread_sigmask(SIG_UNBLOCK, &sigset, NULL); > + signal(SIGUSR1, ib_cq_handler); > + > + /* wait on DTO event, or signal to abort */ > + while (!hca_ptr->ib_trans.cq_destroy) { > + > + struct pollfd cq_poll = { > + .fd = hca_ptr->ib_hca_handle->cq_fd[0], > + .events = POLLIN, > + .revents = 0 > + }; > > + status = poll(&cq_poll, 1, -1); > + if ((status == 1) && > + (!ibv_get_cq_event(hca_ptr->ib_hca_handle, 0, &ibv_cq, (void*)&evd_ptr))) { > + > + if (DAPL_BAD_HANDLE(evd_ptr, DAPL_MAGIC_EVD)) > + continue; > + > + /* process DTO event via callback */ > + dapl_evd_dto_callback ( evd_ptr->header.owner_ia->hca_ptr->ib_hca_handle, > + evd_ptr->ib_cq_handle, > + (void*)evd_ptr ); > + } else { > + > + } > + } > + hca_ptr->ib_trans.cq_destroy = 2; > + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: EXIT: hca %p \n", hca_ptr); > + return; > +} > /* > * Map all verbs DTO completion codes to the DAT equivelent. > * > @@ -410,9 +497,9 @@ > IN DAPL_EVD *evd_ptr, > IN ib_wait_obj_handle_t *p_cq_wait_obj_handle ) > { > - dapl_dbg_log ( DAPL_DBG_TYPE_UTIL, > + dapl_dbg_log ( DAPL_DBG_TYPE_CM, > " cq_object_create: (%p)=%p\n", > - p_cq_wait_obj_handle, *p_cq_wait_obj_handle); > + p_cq_wait_obj_handle, evd_ptr ); > > /* set cq_wait object to evd_ptr */ > *p_cq_wait_obj_handle = evd_ptr; > @@ -447,33 +534,86 @@ > { > DAPL_EVD *evd_ptr = p_cq_wait_obj_handle; > ib_cq_handle_t cq = evd_ptr->ib_cq_handle; > - struct ibv_cq *ibv_cq; > - void *ibv_ctx; > - int status = -ETIMEDOUT; > + struct ibv_cq *ibv_cq = NULL; > + void *ibv_ctx = NULL; > + int status = 0; > > - dapl_dbg_log ( DAPL_DBG_TYPE_UTIL, > + dapl_dbg_log ( DAPL_DBG_TYPE_CM, > " cq_object_wait: dev %p evd %p cq %p, time %d\n", > cq->context, evd_ptr, cq, timeout ); > > - /* Multiple EVD's sharing one event handle for now */ > - if (cq) { > - struct pollfd cq_poll = { > - .fd = cq->context->cq_fd[0], > - .events = POLLIN > + /* Multiple EVD's sharing one event handle for now until uverbs supports more */ > + > + /* > + * This makes it very inefficient and tricky to manage multiple CQ per device open > + * For example: 4 threads waiting on separate CQ events will all be woke when > + * a CQ event fires. So the poll wakes up and the first thread to get to the > + * the get_cq_event wins and the other 3 will block. The dapl_evd_wait code > + * above will immediately do a poll_cq after returning from CQ wait and if > + * nothing on the queue will call this wait again and go back to sleep. So > + * as long as they all wake up, a mutex is held around the get_cq_event > + * so no blocking occurs and they all return then everything should work. > + * Of course, the timeout needs adjusted on the threads that go back to sleep. > + */ > + while (cq) { > + struct pollfd cq_poll = { > + .fd = cq->context->cq_fd[0], > + .events = POLLIN, > + .revents = 0 > }; > - int timeout_ms = -1; > + int timeout_ms = -1; > > if (timeout != DAT_TIMEOUT_INFINITE) > timeout_ms = timeout/1000; > - > + > + /* check if another thread processed the event already, pending queue > 0 */ > + dapl_os_lock( &evd_ptr->header.owner_ia->hca_ptr->ib_trans.cq_lock ); > + if (dapls_rbuf_count(&evd_ptr->pending_event_queue)) { > + dapl_os_unlock( &evd_ptr->header.owner_ia->hca_ptr->ib_trans.cq_lock ); > + break; > + } > + dapl_os_unlock( &evd_ptr->header.owner_ia->hca_ptr->ib_trans.cq_lock ); > + > + dapl_dbg_log ( DAPL_DBG_TYPE_CM," cq_object_wait: polling\n"); > status = poll(&cq_poll, 1, timeout_ms); > - if (status == 1) > - status = ibv_get_cq_event(cq->context, > - 0, &ibv_cq, &ibv_ctx); > - } > - dapl_dbg_log (DAPL_DBG_TYPE_UTIL, > - " cq_object_wait: RET cq %p ibv_cq %p ibv_ctx %p %x\n", > - cq,ibv_cq,ibv_ctx,status); > + dapl_dbg_log ( DAPL_DBG_TYPE_CM," cq_object_wait: poll returned status=%d\n",status); > + > + /* > + * If poll with timeout wakes then hold mutex around a poll with no timeout > + * so subsequent get_cq_events will be guaranteed not to block > + * If the event does not belong to this EVD then put it on proper EVD pending > + * queue under the mutex. > + */ > + if (status == 1) { > + dapl_os_lock( &evd_ptr->header.owner_ia->hca_ptr->ib_trans.cq_lock ); > + status = poll(&cq_poll, 1, 0); > + if (status == 1) { > + status = ibv_get_cq_event(cq->context, > + 0, &ibv_cq, &ibv_ctx); > + > + /* if event is not ours, put on proper evd pending queue */ > + /* force another wakeup */ > + if ((ibv_ctx != evd_ptr ) && > + (!DAPL_BAD_HANDLE(ibv_ctx, DAPL_MAGIC_EVD))) { > + dapl_dbg_log (DAPL_DBG_TYPE_CM, > + " cq_object_wait: ibv_ctx %p != evd %p\n", > + ibv_ctx, evd_ptr); > + dapls_evd_copy_cq((struct evd_ptr*)ibv_ctx); > + dapl_os_unlock(&evd_ptr->header.owner_ia->hca_ptr->ib_trans.cq_lock ); > + continue; > + } > + } > + dapl_os_unlock( &evd_ptr->header.owner_ia->hca_ptr->ib_trans.cq_lock ); > + break; > + > + } else if (status == 0) { > + status = ETIMEDOUT; > + break; > + } > + } > + dapl_dbg_log (DAPL_DBG_TYPE_CM, > + " cq_object_wait: RET evd %p cq %p ibv_cq %p ibv_ctx %p %s\n", > + evd_ptr, cq,ibv_cq,ibv_ctx,strerror(errno)); > > return(dapl_convert_errno(status,"cq_wait_object_wait")); > From tduffy at sun.com Tue Aug 2 15:00:47 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 02 Aug 2005 15:00:47 -0700 Subject: [openib-general] [PATCH 1/2] kDAPL: remove inline functions Message-ID: <1123020047.5203.4.camel@duffman> This patch removes all the inline functions, instead just call the functions from the function table. It also removes the _func from the names as this is obvious by its type. Signed-off-by: Tom Duffy Index: linux-2.6.13-rc5-openib/drivers/infiniband/ulp/kdapl/kdapl.h =================================================================== --- linux-2.6.13-rc5-openib/drivers/infiniband/ulp/kdapl/kdapl.h (revision 2952) +++ linux-2.6.13-rc5-openib/drivers/infiniband/ulp/kdapl/kdapl.h (working copy) @@ -572,11 +572,9 @@ enum dat_upcall_policy { */ }; -typedef void (*DAT_UPCALL_FUNC) (void *, const struct dat_event *, boolean_t); - struct dat_upcall_object { void *instance_data; - DAT_UPCALL_FUNC upcall_func; + void (*upcall)(void *, const struct dat_event *, boolean_t); }; /* Define NULL upcall */ @@ -736,113 +734,104 @@ struct dat_provider { const char *ia_name; void *extension; - int (*ia_open_func)(const char *, int, struct dat_evd **, - struct dat_ia **); - int (*ia_query_func)(struct dat_ia *, struct dat_evd **, - struct dat_ia_attr *, struct dat_provider_attr *); - int (*ia_close_func)(struct dat_ia *, enum dat_close_flags); - int (*ia_memtype_hint_func)(struct dat_ia *, enum dat_mem_type, u64, - enum dat_mem_optimize_flags, u64 *, u64 *); - int (*cr_query_func)(struct dat_cr *, struct dat_cr_param *); - int (*cr_accept_func)(struct dat_cr *, struct dat_ep *, int, - const void*); - int (*cr_reject_func)(struct dat_cr *); - int (*cr_handoff_func)(struct dat_cr *, DAT_CONN_QUAL); - int (*evd_kcreate_func)(struct dat_ia *, int, enum dat_upcall_policy, - const struct dat_upcall_object *, - enum dat_evd_flags, struct dat_evd **); - int (*evd_query_func)(struct dat_evd *, struct dat_evd_param *); - int (*evd_modify_upcall_func)(struct dat_evd *, enum dat_upcall_policy, - const struct dat_upcall_object *); - int (*evd_resize_func)(struct dat_evd *, int); - int (*evd_post_se_func)(struct dat_evd *, const struct dat_event *); - int (*evd_dequeue_func)(struct dat_evd *, struct dat_event *); - int (*evd_free_func)(struct dat_evd *); - int (*ep_create_func)(struct dat_ia *, struct dat_pz *, struct dat_evd *, - struct dat_evd *, struct dat_evd *, - const struct dat_ep_attr *, struct dat_ep **); - int (*ep_query_func)(struct dat_ep *, struct dat_ep_param *); - int (*ep_modify_func)(struct dat_ep *, enum dat_ep_param_mask, - const struct dat_ep_param *); - int (*ep_connect_func)(struct dat_ep *, struct sockaddr *, DAT_CONN_QUAL, - unsigned long, int, const void *, enum dat_qos, - enum dat_connect_flags); - - int (*ep_dup_connect_func)(struct dat_ep *, struct dat_ep *, - unsigned long, int, const void *, - enum dat_qos); - int (*ep_disconnect_func)(struct dat_ep *, enum dat_close_flags); - int (*ep_post_send_func)(struct dat_ep *, int, struct dat_lmr_triplet *, - DAT_DTO_COOKIE, enum dat_completion_flags); - int (*ep_post_recv_func)(struct dat_ep *, int, struct dat_lmr_triplet *, - DAT_DTO_COOKIE, enum dat_completion_flags); - int (*ep_post_rdma_read_func)(struct dat_ep *, int, - struct dat_lmr_triplet *, DAT_DTO_COOKIE, - const struct dat_rmr_triplet *, - enum dat_completion_flags); - int (*ep_post_rdma_write_func)(struct dat_ep *, int, - struct dat_lmr_triplet *, DAT_DTO_COOKIE, - const struct dat_rmr_triplet *, - enum dat_completion_flags); - int (*ep_get_status_func)(struct dat_ep *, enum dat_ep_state *, - boolean_t *, boolean_t *); - - int (*ep_free_func)(struct dat_ep *); - int (*lmr_kcreate_func)(struct dat_ia *, enum dat_mem_type, - DAT_REGION_DESCRIPTION, u64, struct dat_pz *, - enum dat_mem_priv_flags, - enum dat_mem_optimize_flags, struct dat_lmr **, - DAT_LMR_CONTEXT *, DAT_RMR_CONTEXT *, u64 *, - u64 *); - int (*lmr_query_func)(struct dat_lmr *, struct dat_lmr_param *); - int (*lmr_free_func)(struct dat_lmr *); - int (*rmr_create_func)(struct dat_pz *, struct dat_rmr **); - int (*rmr_query_func)(struct dat_rmr *, struct dat_rmr_param *); - int (*rmr_bind_func)(struct dat_rmr *, struct dat_lmr *, - const struct dat_lmr_triplet *, - enum dat_mem_priv_flags, struct dat_ep *, - DAT_RMR_COOKIE, enum dat_completion_flags, - DAT_RMR_CONTEXT *); - int (*rmr_free_func)(struct dat_rmr *); - - int (*psp_create_func)(struct dat_ia *, DAT_CONN_QUAL, struct dat_evd *, - enum dat_psp_flags, struct dat_sp **); - int (*psp_query_func)(struct dat_sp *, struct dat_psp_param *); - int (*psp_free_func)(struct dat_sp *); - int (*rsp_create_func)(struct dat_ia *, DAT_CONN_QUAL, struct dat_ep *, - struct dat_evd *, struct dat_sp **); - int (*rsp_query_func)(struct dat_sp *, struct dat_rsp_param *); - int (*rsp_free_func)(struct dat_sp *); - int (*pz_create_func)(struct dat_ia *, struct dat_pz **); - int (*pz_query_func)(struct dat_pz *, struct dat_pz_param *); - int (*pz_free_func)(struct dat_pz *); + int (*ia_open)(const char *, int, struct dat_evd **, struct dat_ia **); + int (*ia_query)(struct dat_ia *, struct dat_evd **, + struct dat_ia_attr *, struct dat_provider_attr *); + int (*ia_close)(struct dat_ia *, enum dat_close_flags); + int (*ia_memtype_hint)(struct dat_ia *, enum dat_mem_type, u64, + enum dat_mem_optimize_flags, u64 *, u64 *); + int (*cr_query)(struct dat_cr *, struct dat_cr_param *); + int (*cr_accept)(struct dat_cr *, struct dat_ep *, int, const void*); + int (*cr_reject)(struct dat_cr *); + int (*cr_handoff)(struct dat_cr *, DAT_CONN_QUAL); + int (*evd_kcreate)(struct dat_ia *, int, enum dat_upcall_policy, + const struct dat_upcall_object *, + enum dat_evd_flags, struct dat_evd **); + int (*evd_query)(struct dat_evd *, struct dat_evd_param *); + int (*evd_modify_upcall)(struct dat_evd *, enum dat_upcall_policy, + const struct dat_upcall_object *); + int (*evd_resize)(struct dat_evd *, int); + int (*evd_post_se)(struct dat_evd *, const struct dat_event *); + int (*evd_dequeue)(struct dat_evd *, struct dat_event *); + int (*evd_free)(struct dat_evd *); + int (*ep_create)(struct dat_ia *, struct dat_pz *, struct dat_evd *, + struct dat_evd *, struct dat_evd *, + const struct dat_ep_attr *, struct dat_ep **); + int (*ep_query)(struct dat_ep *, struct dat_ep_param *); + int (*ep_modify)(struct dat_ep *, enum dat_ep_param_mask, + const struct dat_ep_param *); + int (*ep_connect)(struct dat_ep *, struct sockaddr *, DAT_CONN_QUAL, + unsigned long, int, const void *, enum dat_qos, + enum dat_connect_flags); + int (*ep_dup_connect)(struct dat_ep *, struct dat_ep *, unsigned long, + int, const void *, enum dat_qos); + int (*ep_disconnect)(struct dat_ep *, enum dat_close_flags); + int (*ep_post_send)(struct dat_ep *, int, struct dat_lmr_triplet *, + DAT_DTO_COOKIE, enum dat_completion_flags); + int (*ep_post_recv)(struct dat_ep *, int, struct dat_lmr_triplet *, + DAT_DTO_COOKIE, enum dat_completion_flags); + int (*ep_post_rdma_read)(struct dat_ep *, int, struct dat_lmr_triplet *, + DAT_DTO_COOKIE, const struct dat_rmr_triplet *, + enum dat_completion_flags); + int (*ep_post_rdma_write)(struct dat_ep *, int, + struct dat_lmr_triplet *, DAT_DTO_COOKIE, + const struct dat_rmr_triplet *, + enum dat_completion_flags); + int (*ep_get_status)(struct dat_ep *, enum dat_ep_state *, boolean_t *, + boolean_t *); + int (*ep_free)(struct dat_ep *); + int (*lmr_kcreate)(struct dat_ia *, enum dat_mem_type, + DAT_REGION_DESCRIPTION, u64, struct dat_pz *, + enum dat_mem_priv_flags, enum dat_mem_optimize_flags, + struct dat_lmr **, DAT_LMR_CONTEXT *, + DAT_RMR_CONTEXT *, u64 *, u64 *); + int (*lmr_query)(struct dat_lmr *, struct dat_lmr_param *); + int (*lmr_free)(struct dat_lmr *); + int (*rmr_create)(struct dat_pz *, struct dat_rmr **); + int (*rmr_query)(struct dat_rmr *, struct dat_rmr_param *); + int (*rmr_bind)(struct dat_rmr *, struct dat_lmr *, + const struct dat_lmr_triplet *, + enum dat_mem_priv_flags, struct dat_ep *, + DAT_RMR_COOKIE, enum dat_completion_flags, + DAT_RMR_CONTEXT *); + int (*rmr_free)(struct dat_rmr *); + int (*psp_create)(struct dat_ia *, DAT_CONN_QUAL, struct dat_evd *, + enum dat_psp_flags, struct dat_sp **); + int (*psp_query)(struct dat_sp *, struct dat_psp_param *); + int (*psp_free)(struct dat_sp *); + int (*rsp_create)(struct dat_ia *, DAT_CONN_QUAL, struct dat_ep *, + struct dat_evd *, struct dat_sp **); + int (*rsp_query)(struct dat_sp *, struct dat_rsp_param *); + int (*rsp_free)(struct dat_sp *); + int (*pz_create)(struct dat_ia *, struct dat_pz **); + int (*pz_query)(struct dat_pz *, struct dat_pz_param *); + int (*pz_free)(struct dat_pz *); /* DAT 1.1 */ - int (*psp_create_any_func)(struct dat_ia *, DAT_CONN_QUAL *, - struct dat_evd *, enum dat_psp_flags, - struct dat_sp **); - int (*ep_reset_func)(struct dat_ep *); + int (*psp_create_any)(struct dat_ia *, DAT_CONN_QUAL *, + struct dat_evd *, enum dat_psp_flags, + struct dat_sp **); + int (*ep_reset)(struct dat_ep *); /* DAT 1.2 */ - int (*lmr_sync_rdma_read_func)(struct dat_ia *, - const struct dat_lmr_triplet *, u64); - int (*lmr_sync_rdma_write_func)(struct dat_ia *, - const struct dat_lmr_triplet *, u64); - int (*ep_create_with_srq_func)(struct dat_ia *, struct dat_pz *, - struct dat_evd *, struct dat_evd *, - struct dat_evd *, struct dat_srq *, - const struct dat_ep_attr *, - struct dat_ep **); - int (*ep_recv_query_func)(struct dat_ep *, int *, int *); - int (*ep_set_watermark_func)(struct dat_ep *, int, int); - int (*srq_create_func)(struct dat_ia *, struct dat_pz *, - struct dat_srq_attr *, struct dat_srq **); - int (*srq_free_func)(struct dat_srq *); - int (*srq_post_recv_func)(struct dat_srq *, int, - struct dat_lmr_triplet *, DAT_DTO_COOKIE); - int (*srq_query_func)(struct dat_srq *, struct dat_srq_param *); - int (*srq_resize_func)(struct dat_srq *, int); - int (*srq_set_lw_func)(struct dat_srq *, int); + int (*lmr_sync_rdma_read)(struct dat_ia *, + const struct dat_lmr_triplet *, u64); + int (*lmr_sync_rdma_write)(struct dat_ia *, + const struct dat_lmr_triplet *, u64); + int (*ep_create_with_srq)(struct dat_ia *, struct dat_pz *, + struct dat_evd *, struct dat_evd *, + struct dat_evd *, struct dat_srq *, + const struct dat_ep_attr *, struct dat_ep **); + int (*ep_recv_query)(struct dat_ep *, int *, int *); + int (*ep_set_watermark)(struct dat_ep *, int, int); + int (*srq_create)(struct dat_ia *, struct dat_pz *, + struct dat_srq_attr *, struct dat_srq **); + int (*srq_free)(struct dat_srq *); + int (*srq_post_recv)(struct dat_srq *, int, struct dat_lmr_triplet *, + DAT_DTO_COOKIE); + int (*srq_query)(struct dat_srq *, struct dat_srq_param *); + int (*srq_resize)(struct dat_srq *, int); + int (*srq_set_lw)(struct dat_srq *, int); }; struct dat_common { @@ -913,399 +902,12 @@ extern int dat_registry_remove_provider( * DAT registry functions for consumers */ extern int dat_ia_open(const char *name, int async_event_qlen, - struct dat_evd **async_event_handle, - struct dat_ia **ia); + struct dat_evd **async_event_handle, + struct dat_ia **ia); extern int dat_ia_close(struct dat_ia *, enum dat_close_flags); extern int dat_registry_list_providers(int max_to_return, int *entries_returned, struct dat_provider_info *dat_provider_list[]); -/* - * inline functions for consumers - */ -#define DAT_CALL_PROVIDER_FUNC(func, handle, ...) \ - handle->common.provider->func(handle, ##__VA_ARGS__) - -static inline int dat_ia_memtype_hint(struct dat_ia *ia, - enum dat_mem_type mem_type, - u64 length, - enum dat_mem_optimize_flags mem_optimize, - u64 *preferred_length, - u64 *preferred_alignment) -{ - return DAT_CALL_PROVIDER_FUNC(ia_memtype_hint_func, ia, mem_type, - length, mem_optimize, preferred_length, - preferred_alignment); -} - -static inline int dat_ia_query(struct dat_ia *ia, struct dat_evd **async_evd, - struct dat_ia_attr *ia_attr, - struct dat_provider_attr *provider_attr) -{ - return DAT_CALL_PROVIDER_FUNC( - ia_query_func, ia, async_evd, ia_attr, provider_attr); -} - -static inline int dat_cr_accept(struct dat_cr *cr, struct dat_ep *ep, - int private_data_size, const void *private_data) -{ - return DAT_CALL_PROVIDER_FUNC( - cr_accept_func, cr, ep, private_data_size, private_data); -} - -static inline int dat_cr_handoff(struct dat_cr *cr, DAT_CONN_QUAL handoff) -{ - return DAT_CALL_PROVIDER_FUNC(cr_handoff_func, cr, handoff); -} - -static inline int dat_cr_query(struct dat_cr *cr, struct dat_cr_param *param) -{ - return DAT_CALL_PROVIDER_FUNC(cr_query_func, cr, param); -} - -static inline int dat_cr_reject(struct dat_cr *cr) -{ - return DAT_CALL_PROVIDER_FUNC(cr_reject_func, cr); -} - -static inline int dat_evd_dequeue(struct dat_evd *evd, struct dat_event *event) -{ - return DAT_CALL_PROVIDER_FUNC(evd_dequeue_func, evd, event); -} - -static inline int dat_evd_free(struct dat_evd *evd) -{ - return DAT_CALL_PROVIDER_FUNC(evd_free_func, evd); -} - -static inline int dat_evd_kcreate(struct dat_ia *ia, int qlen, - enum dat_upcall_policy policy, - const struct dat_upcall_object *upcall, - enum dat_evd_flags flags, - struct dat_evd ** evd) -{ - return DAT_CALL_PROVIDER_FUNC(evd_kcreate_func, ia, qlen, policy, - upcall, flags, evd); -} - -static inline int dat_evd_modify_upcall(struct dat_evd *evd, - enum dat_upcall_policy policy, - const struct dat_upcall_object *upcall) -{ - return DAT_CALL_PROVIDER_FUNC(evd_modify_upcall_func, evd, policy, - upcall); -} - -static inline int dat_evd_post_se(struct dat_evd *evd, - const struct dat_event *event) -{ - return DAT_CALL_PROVIDER_FUNC(evd_post_se_func, evd, event); -} - -static inline int dat_evd_query(struct dat_evd *evd, struct dat_evd_param *param) -{ - return DAT_CALL_PROVIDER_FUNC(evd_query_func, evd, param); -} - -static inline int dat_evd_resize(struct dat_evd *evd, int qlen) -{ - return DAT_CALL_PROVIDER_FUNC(evd_resize_func, evd, qlen); -} - -static inline int dat_ep_connect(struct dat_ep *ep, struct sockaddr *ia_addr, - DAT_CONN_QUAL conn_qual, unsigned long timeout, - int private_data_size, - const void *private_data, enum dat_qos qos, - enum dat_connect_flags flags) -{ - return DAT_CALL_PROVIDER_FUNC(ep_connect_func, ep, ia_addr, conn_qual, - timeout, private_data_size, private_data, - qos, flags); -} - -static inline int dat_ep_create(struct dat_ia *ia, struct dat_pz *pz, - struct dat_evd *in_evd, struct dat_evd *out_evd, - struct dat_evd *connect_evd, - const struct dat_ep_attr *attr, - struct dat_ep **ep) -{ - return DAT_CALL_PROVIDER_FUNC(ep_create_func, ia, pz, in_evd, out_evd, - connect_evd, attr, ep); -} - - -static inline int dat_ep_create_with_srq(struct dat_ia *ia, struct dat_pz *pz, - struct dat_evd *in_evd, - struct dat_evd *out_evd, - struct dat_evd *connect_evd, - struct dat_srq *srq, - const struct dat_ep_attr *attr, - struct dat_ep **ep) -{ - return DAT_CALL_PROVIDER_FUNC(ep_create_with_srq_func, ia, pz, in_evd, - out_evd, connect_evd, srq, attr, ep); -} - -static inline int dat_ep_disconnect(struct dat_ep *ep, - enum dat_close_flags flags) -{ - return DAT_CALL_PROVIDER_FUNC(ep_disconnect_func, ep, flags); -} - -static inline int dat_ep_dup_connect(struct dat_ep *ep, struct dat_ep *dup_ep, - unsigned long timeout, - int private_data_size, - const void *private_data, enum dat_qos qos) -{ - return DAT_CALL_PROVIDER_FUNC(ep_dup_connect_func, ep, dup_ep, timeout, - private_data_size, private_data, qos); -} - -static inline int dat_ep_free(struct dat_ep *ep) -{ - return DAT_CALL_PROVIDER_FUNC(ep_free_func, ep); -} - -static inline int dat_ep_get_status(struct dat_ep *ep, enum dat_ep_state *state, - boolean_t *recv_idle, boolean_t *req_idle) -{ - return DAT_CALL_PROVIDER_FUNC(ep_get_status_func, ep, state, recv_idle, - req_idle); -} - -static inline int dat_ep_modify(struct dat_ep *ep, enum dat_ep_param_mask mask, - const struct dat_ep_param *param) -{ - return DAT_CALL_PROVIDER_FUNC(ep_modify_func, ep, mask, param); -} - -static inline int dat_ep_post_rdma_read(struct dat_ep *ep, int size, - struct dat_lmr_triplet *local_iov, - DAT_DTO_COOKIE cookie, - const struct dat_rmr_triplet *remote_iov, - enum dat_completion_flags flags) -{ - return DAT_CALL_PROVIDER_FUNC(ep_post_rdma_read_func, ep, size, - local_iov, cookie, remote_iov, flags); -} - -static inline int dat_ep_post_rdma_write(struct dat_ep *ep, int size, - struct dat_lmr_triplet *local_iov, - DAT_DTO_COOKIE cookie, - const struct dat_rmr_triplet *remote_iov, - enum dat_completion_flags flags) -{ - return DAT_CALL_PROVIDER_FUNC(ep_post_rdma_write_func, ep, size, - local_iov, cookie, remote_iov, flags); -} - -static inline int dat_ep_post_recv(struct dat_ep *ep, int size, - struct dat_lmr_triplet *local_iov, - DAT_DTO_COOKIE cookie, - enum dat_completion_flags flags) -{ - return DAT_CALL_PROVIDER_FUNC(ep_post_recv_func, ep, size, local_iov, - cookie, flags); -} - -static inline int dat_ep_post_send(struct dat_ep *ep, - int size, - struct dat_lmr_triplet *local_iov, - DAT_DTO_COOKIE cookie, - enum dat_completion_flags flags) -{ - return DAT_CALL_PROVIDER_FUNC( - ep_post_send_func, ep, size, local_iov, cookie, flags); -} - -static inline int dat_ep_query(struct dat_ep *ep, struct dat_ep_param *param) -{ - return DAT_CALL_PROVIDER_FUNC(ep_query_func, ep, param); -} - -static inline int dat_ep_recv_query(struct dat_ep *ep, int *bufs_alloc, - int *bufs_avail) -{ - return DAT_CALL_PROVIDER_FUNC(ep_recv_query_func, ep, bufs_alloc, - bufs_avail); -} - -static inline int dat_ep_reset(struct dat_ep *ep) -{ - return DAT_CALL_PROVIDER_FUNC(ep_reset_func, ep); -} - -static inline int dat_ep_set_watermark(struct dat_ep *ep, - int soft_high_watermark, - int hard_high_watermark) -{ - return DAT_CALL_PROVIDER_FUNC(ep_set_watermark_func, ep, - soft_high_watermark, hard_high_watermark); -} - -static inline int dat_lmr_query(struct dat_lmr *lmr, struct dat_lmr_param *param) -{ - return DAT_CALL_PROVIDER_FUNC(lmr_query_func, lmr, param); -} - -static inline int dat_lmr_free(struct dat_lmr *lmr) -{ - return DAT_CALL_PROVIDER_FUNC(lmr_free_func, lmr); -} - -static inline int dat_lmr_sync_rdma_read(struct dat_ia *ia, - const struct dat_lmr_triplet *iovs, - u64 num_iovs) -{ - return DAT_CALL_PROVIDER_FUNC(lmr_sync_rdma_read_func, ia, iovs, - num_iovs); -} - -static inline int dat_lmr_sync_rdma_write(struct dat_ia *ia, - const struct dat_lmr_triplet *iovs, - u64 num_iovs) -{ - return DAT_CALL_PROVIDER_FUNC(lmr_sync_rdma_write_func, ia, iovs, - num_iovs); -} - -static inline int dat_rmr_create(struct dat_pz *pz, struct dat_rmr **rmr) -{ - return DAT_CALL_PROVIDER_FUNC(rmr_create_func, pz, rmr); -} - -static inline int dat_rmr_query(struct dat_rmr *rmr, struct dat_rmr_param *param) -{ - return DAT_CALL_PROVIDER_FUNC(rmr_query_func, rmr, param); -} - -static inline int dat_rmr_bind(struct dat_rmr *rmr, struct dat_lmr *lmr, - const struct dat_lmr_triplet *iov, - enum dat_mem_priv_flags mem_flags, - struct dat_ep *ep, DAT_RMR_COOKIE cookie, - enum dat_completion_flags comp_flags, - DAT_RMR_CONTEXT *context) -{ - return DAT_CALL_PROVIDER_FUNC(rmr_bind_func, rmr, lmr, iov, mem_flags, - ep, cookie, comp_flags, context); -} - -static inline int dat_rmr_free(struct dat_rmr *rmr) -{ - return DAT_CALL_PROVIDER_FUNC(rmr_free_func, rmr); -} - -static inline int dat_psp_create(struct dat_ia *ia, DAT_CONN_QUAL conn_qual, - struct dat_evd *evd, enum dat_psp_flags flags, - struct dat_sp **psp) -{ - return DAT_CALL_PROVIDER_FUNC(psp_create_func, ia, conn_qual, evd, - flags, psp); -} - -static inline int dat_psp_create_any(struct dat_ia *ia, DAT_CONN_QUAL *conn_qual, - struct dat_evd *evd, - enum dat_psp_flags flags, - struct dat_sp **psp) -{ - return DAT_CALL_PROVIDER_FUNC(psp_create_any_func, ia, conn_qual, evd, - flags, psp); -} - -static inline int dat_psp_query(struct dat_sp *psp, struct dat_psp_param *param) -{ - return DAT_CALL_PROVIDER_FUNC(psp_query_func, psp, param); -} - -static inline int dat_psp_free(struct dat_sp *psp) -{ - return DAT_CALL_PROVIDER_FUNC(psp_free_func, psp); -} - -static inline int dat_rsp_create(struct dat_ia *ia, DAT_CONN_QUAL conn_qual, - struct dat_ep *ep, struct dat_evd *evd, - struct dat_sp **rsp) -{ - return DAT_CALL_PROVIDER_FUNC(rsp_create_func, ia, conn_qual, ep, evd, - rsp); -} - -static inline int dat_rsp_query(struct dat_sp *rsp, struct dat_rsp_param *param) -{ - return DAT_CALL_PROVIDER_FUNC(rsp_query_func, rsp, param); -} - -static inline int dat_rsp_free(struct dat_sp *rsp) -{ - return DAT_CALL_PROVIDER_FUNC(rsp_free_func, rsp); -} - -static inline int dat_pz_create(struct dat_ia *ia, struct dat_pz **pz) -{ - return DAT_CALL_PROVIDER_FUNC(pz_create_func, ia, pz); -} - -static inline int dat_pz_query(struct dat_pz *pz, struct dat_pz_param *param) -{ - return DAT_CALL_PROVIDER_FUNC(pz_query_func, pz, param); -} - -static inline int dat_pz_free(struct dat_pz *pz) -{ - return DAT_CALL_PROVIDER_FUNC(pz_free_func, pz); -} - -static inline int dat_srq_create(struct dat_ia *ia, struct dat_pz *pz, - struct dat_srq_attr *attr, - struct dat_srq **srq) -{ - return DAT_CALL_PROVIDER_FUNC(srq_create_func, ia, pz, attr, srq); -} - -static inline int dat_srq_free(struct dat_srq *srq) -{ - return DAT_CALL_PROVIDER_FUNC(srq_free_func, srq); -} - -static inline int dat_srq_post_recv(struct dat_srq *srq, int num_iovs, - struct dat_lmr_triplet *iovs, - DAT_DTO_COOKIE cookie) -{ - return DAT_CALL_PROVIDER_FUNC(srq_post_recv_func, srq, num_iovs, iovs, - cookie); -} - -static inline int dat_srq_query(struct dat_srq *srq, struct dat_srq_param *param) -{ - return DAT_CALL_PROVIDER_FUNC(srq_query_func, srq, param); -} - -static inline int dat_srq_resize(struct dat_srq *srq, int max_recv_dtos) -{ - return DAT_CALL_PROVIDER_FUNC(srq_resize_func, srq, max_recv_dtos); -} - -static inline int dat_srq_set_lw(struct dat_srq *srq, int low_watermark) -{ - return DAT_CALL_PROVIDER_FUNC(srq_set_lw_func, srq, low_watermark); -} - -static inline int dat_lmr_kcreate(struct dat_ia *ia, enum dat_mem_type type, - DAT_REGION_DESCRIPTION region, u64 len, - struct dat_pz *pz, - enum dat_mem_priv_flags privileges, - enum dat_mem_optimize_flags optimization, - struct dat_lmr **lmr, - DAT_LMR_CONTEXT *lmr_context, - DAT_RMR_CONTEXT *rmr_context, - u64 *registered_length, - u64 *registered_address) -{ - return DAT_CALL_PROVIDER_FUNC(lmr_kcreate_func, ia, type, region, len, - pz, privileges, optimization, lmr, - lmr_context, rmr_context, - registered_length, registered_address); -} - #endif /* DAT_H */ Index: linux-2.6.13-rc5-openib/drivers/infiniband/ulp/kdapl/ib/dapl_evd.c =================================================================== --- linux-2.6.13-rc5-openib/drivers/infiniband/ulp/kdapl/ib/dapl_evd.c (revision 2952) +++ linux-2.6.13-rc5-openib/drivers/infiniband/ulp/kdapl/ib/dapl_evd.c (working copy) @@ -47,7 +47,7 @@ static void dapl_evd_upcall_trigger(stru struct dat_event event; /* Only process events if there is an enabled callback function. */ - if ((evd->upcall.upcall_func == (DAT_UPCALL_FUNC) NULL) || + if ((evd->upcall.upcall == NULL) || (evd->upcall_policy == DAT_UPCALL_DISABLE)) { return; } @@ -57,8 +57,7 @@ static void dapl_evd_upcall_trigger(stru if (0 != status) return; - evd->upcall.upcall_func(evd->upcall.instance_data, &event, - FALSE); + evd->upcall.upcall(evd->upcall.instance_data, &event, FALSE); } } @@ -177,7 +176,7 @@ static struct dapl_evd *dapl_evd_alloc(s evd->upcall = *upcall; else { evd->upcall.instance_data = NULL; - evd->upcall.upcall_func = NULL; + evd->upcall.upcall = NULL; } bail: Index: linux-2.6.13-rc5-openib/drivers/infiniband/ulp/kdapl/ib/dapl_provider.c =================================================================== --- linux-2.6.13-rc5-openib/drivers/infiniband/ulp/kdapl/ib/dapl_provider.c (revision 2952) +++ linux-2.6.13-rc5-openib/drivers/infiniband/ulp/kdapl/ib/dapl_provider.c (working copy) @@ -66,75 +66,75 @@ static struct dat_provider g_dapl_provid .ia_name = NULL, .extension = NULL, - .ia_open_func = &dapl_ia_open, - .ia_query_func = &dapl_ia_query, - .ia_close_func = &dapl_ia_close, - .ia_memtype_hint_func = &dapl_ia_memtype_hint, - - .cr_query_func = &dapl_cr_query, - .cr_accept_func = &dapl_cr_accept, - .cr_reject_func = &dapl_cr_reject, - .cr_handoff_func = &dapl_cr_handoff, - - .evd_kcreate_func = &dapl_evd_kcreate, - .evd_query_func = &dapl_evd_kquery, - .evd_modify_upcall_func = &dapl_evd_modify_upcall, - .evd_resize_func = &dapl_evd_resize, - .evd_post_se_func = &dapl_evd_post_se, - .evd_dequeue_func = &dapl_evd_dequeue, - .evd_free_func = &dapl_evd_free, - - .ep_create_func = &dapl_ep_create, - .ep_query_func = &dapl_ep_query, - .ep_modify_func = &dapl_ep_modify, - .ep_connect_func = &dapl_ep_connect, - .ep_dup_connect_func = &dapl_ep_dup_connect, - .ep_disconnect_func = &dapl_ep_disconnect, - .ep_post_send_func = &dapl_ep_post_send, - .ep_post_recv_func = &dapl_ep_post_recv, - .ep_post_rdma_read_func = &dapl_ep_post_rdma_read, - .ep_post_rdma_write_func = &dapl_ep_post_rdma_write, - .ep_get_status_func = &dapl_ep_get_status, - .ep_free_func = &dapl_ep_free, - - .lmr_kcreate_func = &dapl_lmr_kcreate, - .lmr_query_func = &dapl_lmr_query, - .lmr_free_func = &dapl_lmr_free, - - .rmr_create_func = &dapl_rmr_create, - .rmr_query_func = &dapl_rmr_query, - .rmr_bind_func = &dapl_rmr_bind, - .rmr_free_func = &dapl_rmr_free, - - .psp_create_func = &dapl_psp_create, - .psp_query_func = &dapl_psp_query, - .psp_free_func = &dapl_psp_free, - - .rsp_create_func = &dapl_rsp_create, - .rsp_query_func = &dapl_rsp_query, - .rsp_free_func = &dapl_rsp_free, - - .pz_create_func = &dapl_pz_create, - .pz_query_func = &dapl_pz_query, - .pz_free_func = &dapl_pz_free, + .ia_open = &dapl_ia_open, + .ia_query = &dapl_ia_query, + .ia_close = &dapl_ia_close, + .ia_memtype_hint = &dapl_ia_memtype_hint, + + .cr_query = &dapl_cr_query, + .cr_accept = &dapl_cr_accept, + .cr_reject = &dapl_cr_reject, + .cr_handoff = &dapl_cr_handoff, + + .evd_kcreate = &dapl_evd_kcreate, + .evd_query = &dapl_evd_kquery, + .evd_modify_upcall = &dapl_evd_modify_upcall, + .evd_resize = &dapl_evd_resize, + .evd_post_se = &dapl_evd_post_se, + .evd_dequeue = &dapl_evd_dequeue, + .evd_free = &dapl_evd_free, + + .ep_create = &dapl_ep_create, + .ep_query = &dapl_ep_query, + .ep_modify = &dapl_ep_modify, + .ep_connect = &dapl_ep_connect, + .ep_dup_connect = &dapl_ep_dup_connect, + .ep_disconnect = &dapl_ep_disconnect, + .ep_post_send = &dapl_ep_post_send, + .ep_post_recv = &dapl_ep_post_recv, + .ep_post_rdma_read = &dapl_ep_post_rdma_read, + .ep_post_rdma_write = &dapl_ep_post_rdma_write, + .ep_get_status = &dapl_ep_get_status, + .ep_free = &dapl_ep_free, + + .lmr_kcreate = &dapl_lmr_kcreate, + .lmr_query = &dapl_lmr_query, + .lmr_free = &dapl_lmr_free, + + .rmr_create = &dapl_rmr_create, + .rmr_query = &dapl_rmr_query, + .rmr_bind = &dapl_rmr_bind, + .rmr_free = &dapl_rmr_free, + + .psp_create = &dapl_psp_create, + .psp_query = &dapl_psp_query, + .psp_free = &dapl_psp_free, + + .rsp_create = &dapl_rsp_create, + .rsp_query = &dapl_rsp_query, + .rsp_free = &dapl_rsp_free, + + .pz_create = &dapl_pz_create, + .pz_query = &dapl_pz_query, + .pz_free = &dapl_pz_free, /* dat-1.1 */ - .psp_create_any_func = &dapl_psp_create_any, - .ep_reset_func = &dapl_ep_reset, + .psp_create_any = &dapl_psp_create_any, + .ep_reset = &dapl_ep_reset, /* dat-1.2 */ - .lmr_sync_rdma_read_func = &dapl_lmr_sync_rdma_read, - .lmr_sync_rdma_write_func = &dapl_lmr_sync_rdma_write, + .lmr_sync_rdma_read = &dapl_lmr_sync_rdma_read, + .lmr_sync_rdma_write = &dapl_lmr_sync_rdma_write, - .ep_create_with_srq_func = &dapl_ep_create_with_srq, - .ep_recv_query_func = &dapl_ep_recv_query, - .ep_set_watermark_func = &dapl_ep_set_watermark, - .srq_create_func = &dapl_srq_create, - .srq_free_func = &dapl_srq_free, - .srq_post_recv_func = &dapl_srq_post_recv, - .srq_query_func = &dapl_srq_query, - .srq_resize_func = &dapl_srq_resize, - .srq_set_lw_func = &dapl_srq_set_lw + .ep_create_with_srq = &dapl_ep_create_with_srq, + .ep_recv_query = &dapl_ep_recv_query, + .ep_set_watermark = &dapl_ep_set_watermark, + .srq_create = &dapl_srq_create, + .srq_free = &dapl_srq_free, + .srq_post_recv = &dapl_srq_post_recv, + .srq_query = &dapl_srq_query, + .srq_resize = &dapl_srq_resize, + .srq_set_lw = &dapl_srq_set_lw }; /********************************************************************* Index: linux-2.6.13-rc5-openib/drivers/infiniband/ulp/kdapl/api.c =================================================================== --- linux-2.6.13-rc5-openib/drivers/infiniband/ulp/kdapl/api.c (revision 2952) +++ linux-2.6.13-rc5-openib/drivers/infiniband/ulp/kdapl/api.c (working copy) @@ -48,8 +48,7 @@ MODULE_PARM_DESC(dbg_mask, "Bitmask to e struct dat_provider_list_entry { struct list_head list; struct dat_provider_info info; - int (*ia_open_func)(const char *, int, struct dat_evd **, - struct dat_ia **); + int (*ia_open)(const char *, int, struct dat_evd **, struct dat_ia **); int ref_count; }; @@ -104,7 +103,7 @@ int dat_ia_open(const char *name, int as spin_unlock_irqrestore(&dat_provider_list_lock, flags); if (entry) - return entry->ia_open_func(name, async_evd_qlen, async_evd, ia); + return entry->ia_open(name, async_evd_qlen, async_evd, ia); else { dat_dbg_print(DAT_DBG_TYPE_CONSUMER_API | DAT_DBG_TYPE_ERROR, "%s: IA [%s] not found in registry\n", __func__, @@ -129,7 +128,7 @@ int dat_ia_close(struct dat_ia *ia, enum provider = ia->common.provider; name = provider->ia_name; - status = provider->ia_close_func(ia, close_flags); + status = provider->ia_close(ia, close_flags); if (status) { dat_dbg_print(DAT_DBG_TYPE_CONSUMER_API | DAT_DBG_TYPE_ERROR, "%s: IA [%s] close failed\n", __func__, name); @@ -166,7 +165,7 @@ int dat_registry_add_provider(const stru entry->ref_count = 0; entry->info = *info; - entry->ia_open_func = provider->ia_open_func; + entry->ia_open = provider->ia_open; spin_lock_irqsave(&dat_provider_list_lock, flags); if (dat_registry_search(info->ia_name)) { From tduffy at sun.com Tue Aug 2 15:02:32 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 02 Aug 2005 15:02:32 -0700 Subject: [openib-general] [PATCH 2/2] kdapltest: use new function API In-Reply-To: <1123020047.5203.4.camel@duffman> References: <1123020047.5203.4.camel@duffman> Message-ID: <1123020152.5203.6.camel@duffman> This patch fixes up kdapltest with the new function API. Signed-off-by: Tom Duffy Index: gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_performance_util.c =================================================================== --- gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_performance_util.c (revision 2952) +++ gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_performance_util.c (working copy) @@ -65,7 +65,7 @@ DT_Performance_Test_Create ( test_ptr->ia = ia; test_ptr->cmd = &pt_ptr->Params.u.Performance_Cmd; - ret = dat_ia_query (test_ptr->ia, + ret = ia->common.provider->ia_query (test_ptr->ia, NULL, &test_ptr->ia_attr, NULL); @@ -101,7 +101,7 @@ DT_Performance_Test_Create ( test_ptr->creq_evd_length = DT_PERF_DFLT_EVD_LENGTH; /* create a protection zone */ - ret = dat_pz_create (test_ptr->ia, &test_ptr->pz); + ret = test_ptr->ia->common.provider->pz_create (test_ptr->ia, &test_ptr->pz); if ( 0 != ret) { DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_pz_create error: %s\n", @@ -182,7 +182,7 @@ DT_Performance_Test_Create ( test_ptr->ep_context.ep_attr.max_request_dtos = pipeline_len; /* Create EP */ - ret = dat_ep_create (test_ptr->ia, /* IA */ + ret = ia->common.provider->ep_create (test_ptr->ia, /* IA */ test_ptr->pz, /* PZ */ test_ptr->recv_evd_hdl, /* recv */ test_ptr->reqt_evd_hdl, /* request */ @@ -318,7 +318,7 @@ DT_Performance_Test_Destroy ( */ if (test_ptr->ep_context.ep) { - ret = dat_ep_disconnect (test_ptr->ep_context.ep, + ret = test_ptr->ep_context.ep->common.provider->ep_disconnect (test_ptr->ep_context.ep, DAT_CLOSE_ABRUPT_FLAG); if (ret != 0) { @@ -340,7 +340,7 @@ DT_Performance_Test_Destroy ( if ( NULL != ep) { /* Destroy the EP */ - ret = dat_ep_free (ep); + ret = ep->common.provider->ep_free (ep); if (ret != 0) { DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_ep_free error: %s\n", @@ -402,7 +402,7 @@ DT_Performance_Test_Destroy ( /* clean up the PZ */ if (test_ptr->pz) { - ret = dat_pz_free (test_ptr->pz); + ret = test_ptr->pz->common.provider->pz_free (test_ptr->pz); if (ret != 0) { DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_pz_free error: %s\n", @@ -462,7 +462,7 @@ DT_performance_post_rdma_op ( pre_ctxt_num = DT_Mdep_GetContextSwitchNum (); pre_ts = DT_Mdep_GetTimeStamp (); - ret = dat_ep_post_rdma_write (ep_context->ep, + ret = ep_context->ep->common.provider->ep_post_rdma_write (ep_context->ep, op->num_segs, iov, cookie, @@ -479,7 +479,7 @@ DT_performance_post_rdma_op ( pre_ctxt_num = DT_Mdep_GetContextSwitchNum (); pre_ts = DT_Mdep_GetTimeStamp (); - ret = dat_ep_post_rdma_read (ep_context->ep, + ret = ep_context->ep->common.provider->ep_post_rdma_read (ep_context->ep, op->num_segs, iov, cookie, Index: gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_performance_client.c =================================================================== --- gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_performance_client.c (revision 2952) +++ gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_performance_client.c (working copy) @@ -103,7 +103,7 @@ DT_Performance_Test_Client_Connect ( test_ptr->base_port, test_ptr->ep_context.port)); retry: - ret = dat_ep_connect (test_ptr->ep_context.ep, + ret = test_ptr->ep_context.ep->common.provider->ep_connect (test_ptr->ep_context.ep, test_ptr->remote_ia_addr, test_ptr->ep_context.port, DAT_TIMEOUT_MAX, @@ -297,7 +297,7 @@ DT_Performance_Test_Client_Phase2 ( { pre_ts = DT_Mdep_GetTimeStamp (); - ret = dat_ep_post_rdma_write (ep_context->ep, + ret = ep_context->ep->common.provider->ep_post_rdma_write (ep_context->ep, op->num_segs, iov, cookie, @@ -308,7 +308,7 @@ DT_Performance_Test_Client_Phase2 ( { pre_ts = DT_Mdep_GetTimeStamp (); - ret = dat_ep_post_rdma_read (ep_context->ep, + ret = ep_context->ep->common.provider->ep_post_rdma_read (ep_context->ep, op->num_segs, iov, cookie, Index: gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_server.c =================================================================== --- gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_server.c (revision 2952) +++ gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_server.c (working copy) @@ -119,7 +119,7 @@ DT_cs_Server (Params_t * params_ptr) DT_Tdep_PT_Debug (1,(phead,"%s: IA %s opened\n", module, Server_Cmd->dapl_name)); /* Create a PZ */ - ret = dat_pz_create (ps_ptr->ia, &ps_ptr->pz); + ret = ps_ptr->ia->common.provider->pz_create (ps_ptr->ia, &ps_ptr->pz); if (ret != 0) { DT_Tdep_PT_Printf (phead, @@ -191,7 +191,7 @@ DT_cs_Server (Params_t * params_ptr) } /* Create the EP */ - ret = dat_ep_create (ps_ptr->ia, /* IA */ + ret = ps_ptr->ia->common.provider->ep_create (ps_ptr->ia, /* IA */ ps_ptr->pz, /* PZ */ ps_ptr->recv_evd_hdl, /* recv */ ps_ptr->reqt_evd_hdl, /* request */ @@ -211,7 +211,7 @@ DT_cs_Server (Params_t * params_ptr) DT_Tdep_PT_Debug (1,(phead,"%s: EP created\n", module)); /* Create PSP */ - ret = dat_psp_create (ps_ptr->ia, + ret = ps_ptr->ia->common.provider->psp_create (ps_ptr->ia, SERVER_PORT_NUMBER, ps_ptr->creq_evd_hdl, DAT_PSP_CONSUMER_FLAG, @@ -358,7 +358,7 @@ DT_cs_Server (Params_t * params_ptr) } DT_Tdep_PT_Debug (1,(phead,"%s: Accepting Connection Request\n", module)); - ret = dat_cr_accept (cr, ps_ptr->ep, 0, (void *)0); + ret = cr->common.provider->cr_accept (cr, ps_ptr->ep, 0, (void *)0); if (ret != 0) { DT_Tdep_PT_Printf (phead, @@ -620,7 +620,7 @@ DT_cs_Server (Params_t * params_ptr) /* we passed the pt_ptr to the thread and must now 'forget' it */ pt_ptr = NULL; - ret = dat_ep_disconnect (ps_ptr->ep, DAT_CLOSE_GRACEFUL_FLAG); + ret = ps_ptr->ep->common.provider->ep_disconnect (ps_ptr->ep, DAT_CLOSE_GRACEFUL_FLAG); if (ret != 0) { DT_Tdep_PT_Printf (phead, "%s: dat_ep_disconnect fails: %s\n", @@ -636,7 +636,7 @@ DT_cs_Server (Params_t * params_ptr) } /* reset the EP to get back into the game */ - dat_ep_reset (ps_ptr->ep); + ps_ptr->ep->common.provider->ep_reset (ps_ptr->ep); DT_Tdep_PT_Debug (1,(phead,"%s: Waiting for another client...\n", module)); } /* end loop accepting connections */ @@ -672,7 +672,7 @@ server_exit: */ if (ps_ptr->ep) { - ret = dat_ep_disconnect (ps_ptr->ep, + ret = ps_ptr->ep->common.provider->ep_disconnect (ps_ptr->ep, DAT_CLOSE_ABRUPT_FLAG); if (ret != 0) { @@ -707,7 +707,7 @@ server_exit: /* Free the PSP */ if (ps_ptr->psp) { - ret = dat_psp_free (ps_ptr->psp); + ret = ps_ptr->psp->common.provider->psp_free (ps_ptr->psp); if (ret != 0) { DT_Tdep_PT_Printf (phead, "%s: dat_psp_free error: %s\n", @@ -721,7 +721,7 @@ server_exit: /* Free the EP */ if (ps_ptr->ep) { - ret = dat_ep_free (ps_ptr->ep); + ret = ps_ptr->ep->common.provider->ep_free (ps_ptr->ep); if (ret != 0) { DT_Tdep_PT_Printf (phead, "%s: dat_ep_free error: %s\n", @@ -784,7 +784,7 @@ server_exit: /* Free the PZ */ if (ps_ptr->pz) { - ret = dat_pz_free (ps_ptr->pz); + ret = ps_ptr->pz->common.provider->pz_free (ps_ptr->pz); if (ret != 0) { DT_Tdep_PT_Printf (phead, "%s: dat_pz_free error: %s\n", Index: gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_cnxn.c =================================================================== --- gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_cnxn.c (revision 2952) +++ gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_cnxn.c (working copy) @@ -40,7 +40,7 @@ get_ep_connection_state (DT_Tdep_Print_H char *req_status = "Idle"; - ret = dat_ep_get_status (ep, &ep_state, &in_dto_idle, + ret = ep->common.provider->ep_get_status (ep, &ep_state, &in_dto_idle, &out_dto_idle); if (ret != 0) { Index: gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_fft_util.c =================================================================== --- gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_fft_util.c (revision 2952) +++ gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_fft_util.c (working copy) @@ -114,7 +114,7 @@ int DT_ep_create (Params_t *params_ptr, return status; } - status = dat_ep_create (ia, pz, *recv_evd, + status = ia->common.provider->ep_create (ia, pz, *recv_evd, *send_evd, *conn_evd, NULL, ep); if (status != 0) { @@ -164,7 +164,7 @@ void DT_fft_init_client (Params_t *param DT_assert_dat (phead, rc == 0); /* create a PZ */ - rc = dat_pz_create (conn->ia, &conn->pz); + rc = conn->ia->common.provider->pz_create (conn->ia, &conn->pz); DT_assert_dat (phead, rc == 0); /* create an EP and its EVDs */ @@ -203,7 +203,7 @@ int DT_fft_destroy_conn_struct (Params_t { if (conn->connected) { - rc = dat_ep_disconnect (conn->ep, DAT_CLOSE_DEFAULT); + rc = conn->ep->common.provider->ep_disconnect (conn->ep, DAT_CLOSE_DEFAULT); DT_assert_clean (phead, rc == 0); if (!DT_disco_event_wait ( phead, conn->cr_evd, NULL )) @@ -212,7 +212,7 @@ int DT_fft_destroy_conn_struct (Params_t DT_Tdep_PT_Printf (phead, "DT_fft_destroy_conn_struct: bad disconnect event\n"); } } - rc = dat_ep_free (conn->ep); + rc = conn->ep->common.provider->ep_free (conn->ep); DT_assert_clean (phead, rc == 0); } if (conn->bpool) @@ -221,7 +221,7 @@ int DT_fft_destroy_conn_struct (Params_t } if (conn->psp) { - rc = dat_psp_free (conn->psp); + rc = conn->psp->common.provider->psp_free (conn->psp); DT_assert_clean (phead, rc == 0); } if (conn->cr_evd) @@ -250,7 +250,7 @@ int DT_fft_destroy_conn_struct (Params_t } if (conn->pz) { - rc = dat_pz_free (conn->pz); + rc = conn->pz->common.provider->pz_free (conn->pz); DT_assert_clean (phead, rc == 0); } if (conn->ia) @@ -277,7 +277,7 @@ void DT_fft_init_server (Params_t *param DT_assert_dat (phead, rc == 0); /* create a PZ */ - rc = dat_pz_create (conn->ia, &conn->pz); + rc = conn->ia->common.provider->pz_create (conn->ia, &conn->pz); DT_assert_dat (phead, rc == 0); /* create an EP and its EVDs */ @@ -292,7 +292,7 @@ void DT_fft_init_server (Params_t *param DT_assert_dat (phead, rc == 0); /* create a PSP */ - rc = dat_psp_create (conn->ia, SERVER_PORT_NUMBER, conn->cr_evd, + rc = conn->ia->common.provider->psp_create (conn->ia, SERVER_PORT_NUMBER, conn->cr_evd, DAT_PSP_CONSUMER_FLAG, &conn->psp); DT_assert_dat (phead, rc == 0); @@ -323,7 +323,7 @@ void DT_fft_listen (Params_t *params_ptr "DT_fft_listen")); /* accept the connection */ - rc =dat_cr_accept (conn->cr, conn->ep, 0, (void *)0); + rc =conn->cr->common.provider->cr_accept (conn->cr, conn->ep, 0, (void *)0); DT_assert_dat (phead, rc == 0); /* wait on a conn event via the conn EVD */ @@ -352,7 +352,8 @@ int DT_fft_connect (Params_t *params_ptr DT_Tdep_PT_Printf (phead, "Connection to server, attempt #%d\n", wait_count+1); /* attempt to connect, timeout = 10 secs */ - rc = dat_ep_connect (conn->ep, conn->remote_netaddr, + rc = conn->ep->common.provider->ep_connect(conn->ep, + conn->remote_netaddr, SERVER_PORT_NUMBER, 10*1000000, 0, (void *)0, DAT_QOS_BEST_EFFORT, DAT_CONNECT_DEFAULT_FLAG); DT_assert_dat (phead, rc == 0); Index: gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_fft_mem.c =================================================================== --- gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_fft_mem.c (revision 2952) +++ gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_fft_mem.c (working copy) @@ -87,7 +87,7 @@ int DT_mem_generic (Params_t *params_ptr if (flag != 4) { DT_Tdep_PT_Printf (phead, "Registering memory\n"); - rc = dat_lmr_kcreate (ia, + rc = ia->common.provider->lmr_kcreate (ia, DAT_MEM_TYPE_VIRTUAL, region, buffer_size, @@ -113,12 +113,12 @@ int DT_mem_generic (Params_t *params_ptr { if (lmr) { - rc = dat_lmr_free (lmr); + rc = lmr->common.provider->lmr_free (lmr); DT_assert_dat (phead, rc == 0); } lmr = NULL; - rc = dat_lmr_kcreate (conn.ia, + rc = conn.ia->common.provider->lmr_kcreate (conn.ia, DAT_MEM_TYPE_VIRTUAL, region, buffer_size, @@ -136,7 +136,7 @@ int DT_mem_generic (Params_t *params_ptr cleanup: if (lmr) { - rc = dat_lmr_free (lmr); + rc = lmr->common.provider->lmr_free (lmr); DT_assert_clean (phead, rc == 0); } if (alloc_ptr) Index: gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_limit.c =================================================================== --- gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_limit.c (revision 2952) +++ gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_limit.c (working copy) @@ -246,7 +246,7 @@ limit_test ( DT_Tdep_Print_Head *phead, cmd->width)); for (w = 0; w < cmd->width; w++) { - ret = dat_pz_create (hdl_sets[w].ia, + ret = hdl_sets[w].ia->common.provider->pz_create (hdl_sets[w].ia, &hdl_sets[w].pz); if (ret != 0) { @@ -288,7 +288,7 @@ limit_test ( DT_Tdep_Print_Head *phead, retval = TRUE; break; } - ret = dat_pz_create (hdl_sets[w % cmd->width].ia, + ret = hdl_sets[w % cmd->width].ia->common.provider->pz_create (hdl_sets[w % cmd->width].ia, &hdlptr[w]); if (ret != 0) { @@ -306,7 +306,7 @@ limit_test ( DT_Tdep_Print_Head *phead, for (tmp = 0; tmp < w; tmp++) { DT_Mdep_Schedule(); - ret = dat_pz_free (hdlptr[tmp]); + ret = hdlptr[tmp]->common.provider->pz_free (hdlptr[tmp]); if (ret != 0) { DT_Tdep_PT_Printf (phead, "%s: dat_pz_free fails: %s\n", @@ -579,7 +579,7 @@ limit_test ( DT_Tdep_Print_Head *phead, cmd->width)); for (w = 0; w < cmd->width; w++) { - ret = dat_ep_create (hdl_sets[w].ia, + ret = hdl_sets[w].ia->common.provider->ep_create (hdl_sets[w].ia, hdl_sets[w].pz, hdl_sets[w].evd, /* recv */ hdl_sets[w].evd, /* request */ @@ -624,7 +624,7 @@ limit_test ( DT_Tdep_Print_Head *phead, retval = TRUE; break; } - ret = dat_ep_create (hdl_sets[w % cmd->width].ia, + ret = hdl_sets[w % cmd->width].ia->common.provider->ep_create (hdl_sets[w % cmd->width].ia, hdl_sets[w % cmd->width].pz, hdl_sets[w % cmd->width].evd, hdl_sets[w % cmd->width].evd, @@ -647,7 +647,7 @@ limit_test ( DT_Tdep_Print_Head *phead, for (tmp = 0; tmp < w; tmp++) { DT_Mdep_Schedule(); - ret = dat_ep_free (hdlptr[tmp]); + ret = hdlptr[tmp]->common.provider->ep_free (hdlptr[tmp]); if (ret != 0) { DT_Tdep_PT_Printf (phead, "%s: dat_ep_free fails: %s\n", @@ -724,7 +724,7 @@ limit_test ( DT_Tdep_Print_Head *phead, /* * Each RSP needs a unique EP, so create one first */ - ret = dat_ep_create (hdl_sets[w % cmd->width].ia, + ret = hdl_sets[w % cmd->width].ia->common.provider->ep_create (hdl_sets[w % cmd->width].ia, hdl_sets[w % cmd->width].pz, hdl_sets[w % cmd->width].evd, hdl_sets[w % cmd->width].evd, @@ -739,7 +739,7 @@ limit_test ( DT_Tdep_Print_Head *phead, break; } - ret = dat_rsp_create (hdl_sets[w % cmd->width].ia, + ret = hdl_sets[w % cmd->width].ia->common.provider->rsp_create (hdl_sets[w % cmd->width].ia, CONN_QUAL0 + w, epptr[w], hdl_sets[w % cmd->width].evd, @@ -757,7 +757,7 @@ limit_test ( DT_Tdep_Print_Head *phead, DT_Tdep_PT_Printf (phead, "%s: dat_rsp_create #%d fails: %s\n", module, w+1, DT_RetToString (ret)); /* Cleanup the EP; no-one else will. */ - ret = dat_ep_free (epptr[w]); + ret = epptr[w]->common.provider->ep_free (epptr[w]); if (ret != 0) { DT_Tdep_PT_Printf (phead, "%s: dat_ep_free (internal cleanup @ #%d) fails: %s\n", @@ -775,7 +775,7 @@ limit_test ( DT_Tdep_Print_Head *phead, for (tmp = 0; tmp < w; tmp++) { DT_Mdep_Schedule(); - ret = dat_rsp_free (hdlptr[tmp]); + ret = hdlptr[tmp]->common.provider->rsp_free (hdlptr[tmp]); if (ret != 0) { DT_Tdep_PT_Printf (phead, "%s: dat_rsp_free fails: %s\n", @@ -783,7 +783,7 @@ limit_test ( DT_Tdep_Print_Head *phead, retval = FALSE; } /* Free EPs */ - ret = dat_ep_free (epptr[tmp]); + ret = epptr[tmp]->common.provider->ep_free (epptr[tmp]); if (ret != 0) { DT_Tdep_PT_Printf (phead, "%s: dat_ep_free fails: %s for RSPs\n", @@ -831,7 +831,7 @@ limit_test ( DT_Tdep_Print_Head *phead, retval = TRUE; break; } - ret = dat_psp_create (hdl_sets[w % cmd->width].ia, + ret = hdl_sets[w % cmd->width].ia->common.provider->psp_create (hdl_sets[w % cmd->width].ia, CONN_QUAL0 + w, hdl_sets[w % cmd->width].evd, DAT_PSP_CONSUMER_FLAG, @@ -853,7 +853,7 @@ limit_test ( DT_Tdep_Print_Head *phead, for (tmp = 0; tmp < w; tmp++) { DT_Mdep_Schedule(); - ret = dat_psp_free (hdlptr[tmp]); + ret = hdlptr[tmp]->common.provider->psp_free (hdlptr[tmp]); if (ret == -ENOSYS) { DT_Tdep_PT_Printf (phead, "%s: dat_psp_free unimplemented\n" @@ -898,7 +898,7 @@ limit_test ( DT_Tdep_Print_Head *phead, memset (®ion, 0, sizeof (region)); region.for_va = hdl_sets[w].lmr_buffer; - ret = dat_lmr_kcreate (hdl_sets[w].ia, + ret = hdl_sets[w].ia->common.provider->lmr_kcreate (hdl_sets[w].ia, DAT_MEM_TYPE_VIRTUAL, region, DFLT_BUFFSZ, @@ -1045,6 +1045,7 @@ limit_test ( DT_Tdep_Print_Head *phead, { struct dat_lmr_triplet *iovp = &hdlptr[w * cmd->width + i]; DAT_DTO_COOKIE cookie; + struct dat_ep *ep; iovp->virtual_address = (u64) (uintptr_t) hdl_sets[i].lmr_buffer; @@ -1055,7 +1056,8 @@ limit_test ( DT_Tdep_Print_Head *phead, DT_Tdep_PT_Printf (phead, "%s: dat_ep_post_recv #%d\n", module, w * cmd->width + i + 1); - ret = dat_ep_post_recv (hdl_sets[i].ep, + ep = hdl_sets[i].ep; + ret = ep->common.provider->ep_post_recv(ep, 1, iovp, cookie, @@ -1092,7 +1094,7 @@ limit_test ( DT_Tdep_Print_Head *phead, * outstanding recv DTOs in error, and otherwise * be a no-op. */ - ret = dat_ep_reset (hdl_sets[i].ep); + ret = hdl_sets[i].ep->common.provider->ep_reset(hdl_sets[i].ep); if (ret != 0) { DT_Tdep_PT_Printf (phead, "%s: dat_ep_disconnect (abrupt) fails: %s\n", @@ -1188,7 +1190,7 @@ clean_up_now: { if (hdl_sets[w].lmr) { - ret = dat_lmr_free (hdl_sets[w].lmr); + ret = hdl_sets[w].lmr->common.provider->lmr_free (hdl_sets[w].lmr); if (ret != 0) { DT_Tdep_PT_Printf (phead, "%s: dat_lmr_free fails: %s\n", @@ -1223,7 +1225,7 @@ clean_up_now: { if (hdl_sets[w].ep) { - ret = dat_ep_free (hdl_sets[w].ep); + ret = hdl_sets[w].ep->common.provider->ep_free (hdl_sets[w].ep); if (ret != 0) { DT_Tdep_PT_Printf (phead, "%s: dat_ep_free fails: %s\n", @@ -1287,7 +1289,7 @@ clean_up_now: { if (hdl_sets[w].pz) { - ret = dat_pz_free (hdl_sets[w].pz); + ret = hdl_sets[w].pz->common.provider->pz_free (hdl_sets[w].pz); if (ret != 0) { DT_Tdep_PT_Printf (phead, "%s: dat_pz_free fails: %s\n", Index: gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_fft_queryinfo.c =================================================================== --- gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_fft_queryinfo.c (revision 2952) +++ gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_fft_queryinfo.c (working copy) @@ -109,8 +109,7 @@ int DT_queryinfo_basic (Params_t *params (object_to_query == QUERY_RSP) || (object_to_query == QUERY_PZ) ) { - rc = dat_pz_create (ia, - &pz); + rc = ia->common.provider->pz_create (ia, &pz); DT_assert_dat (phead, rc == 0); } @@ -139,7 +138,7 @@ int DT_queryinfo_basic (Params_t *params &conn_evd); DT_assert_dat (phead, rc == 0); - rc = dat_ep_create (ia, + rc = ia->common.provider->ep_create (ia, pz, recv_evd, send_evd, @@ -166,7 +165,7 @@ int DT_queryinfo_basic (Params_t *params { if (result_wanted == 0) { - rc = dat_ia_query (ia, + rc = ia->common.provider->ia_query (ia, &evd, &ia_attributes, &provider_attributes); @@ -236,7 +235,7 @@ int DT_queryinfo_basic (Params_t *params { if (result_wanted == 0) { - rc = dat_evd_query (evd, + rc = evd->common.provider->evd_query (evd, &evd_param); } #if 0 @@ -256,7 +255,7 @@ int DT_queryinfo_basic (Params_t *params /* Test dat_psp_query function */ else if (object_to_query == QUERY_PSP) { - rc = dat_psp_create (ia, + rc = ia->common.provider->psp_create (ia, SERVER_PORT_NUMBER, cr_evd, DAT_PSP_PROVIDER_FLAG, @@ -264,7 +263,7 @@ int DT_queryinfo_basic (Params_t *params DT_assert_dat (phead, rc == 0); if (result_wanted == 0) { - rc = dat_psp_query (psp, + rc = psp->common.provider->psp_query (psp, &psp_param); } #if 0 @@ -284,14 +283,13 @@ int DT_queryinfo_basic (Params_t *params /* Test dat_rsp_query function */ else if (object_to_query == QUERY_RSP) { - rc = dat_rsp_create (ia, + rc = ia->common.provider->rsp_create (ia, SERVER_PORT_NUMBER, ep, cr_evd, &rsp); DT_assert_dat (phead, rc == 0); - rc = dat_rsp_query (rsp, - &rsp_param); + rc = rsp->common.provider->rsp_query (rsp, &rsp_param); } /* Test dat_cr_query function */ @@ -304,15 +302,13 @@ int DT_queryinfo_basic (Params_t *params /* Test dat_ep_query function */ else if (object_to_query == QUERY_EP) { - rc = dat_ep_query (ep, - &ep_param); + rc = ep->common.provider->ep_query (ep, &ep_param); } /* Test dat_pz_query function */ else if (object_to_query == QUERY_PZ) { - rc = dat_pz_query (pz, - &pz_param); + rc = pz->common.provider->pz_query (pz, &pz_param); } /* Test dat_lmr_query function */ @@ -322,7 +318,7 @@ int DT_queryinfo_basic (Params_t *params DT_assert (phead, alloc_ptr); memset (®ion, 0, sizeof (region)); region.for_va = alloc_ptr; - rc = dat_lmr_kcreate (ia, + rc = ia->common.provider->lmr_kcreate (ia, DAT_MEM_TYPE_VIRTUAL, region, buffer_size, @@ -335,17 +331,16 @@ int DT_queryinfo_basic (Params_t *params ®_size, ®_addr); DT_assert_dat (phead, rc == 0); - rc = dat_lmr_query (lmr, - &lmr_param); + rc = lmr->common.provider->lmr_query (lmr, &lmr_param); } /* Test dat_rmr_query function */ else if (object_to_query == QUERY_RMR) { - rc = dat_rmr_create (pz, + rc = pz->common.provider->rmr_create (pz, &rmr_handle); DT_assert_dat (phead, rc == 0); - rc = dat_rmr_query (rmr_handle, + rc = rmr_handle->common.provider->rmr_query (rmr_handle, &rmr_param); } @@ -354,13 +349,13 @@ int DT_queryinfo_basic (Params_t *params cleanup: if (rsp) { - rc = dat_rsp_free (rsp); + rc = rsp->common.provider->rsp_free (rsp); DT_assert_clean (phead, rc == 0); } if (ep) { - rc = dat_ep_free (ep); + rc = ep->common.provider->ep_free (ep); DT_assert_clean (phead, rc == 0); } @@ -384,13 +379,13 @@ cleanup: if (lmr) { - rc = dat_lmr_free (lmr); + rc = lmr->common.provider->lmr_free (lmr); DT_assert_clean (phead, rc == 0); } if (rmr_handle) { - rc = dat_rmr_free (rmr_handle); + rc = rmr_handle->common.provider->rmr_free (rmr_handle); DT_assert_clean (phead, rc == 0); } @@ -403,7 +398,7 @@ cleanup: #endif if (psp) { - rc = dat_psp_free (psp); + rc = psp->common.provider->psp_free (psp); DT_assert_clean (phead, rc == 0); } @@ -415,7 +410,7 @@ cleanup: if (pz) { - rc = dat_pz_free (pz); + rc = pz->common.provider->pz_free (pz); DT_assert_clean (phead, rc == 0); } Index: gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_transaction_test.c =================================================================== --- gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_transaction_test.c (revision 2952) +++ gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_transaction_test.c (working copy) @@ -245,7 +245,7 @@ DT_Transaction_Main (void *param) private_data_str = "DAPL and RDMA rule! Test 4321."; /* create a protection zone */ - ret = dat_pz_create (test_ptr->ia, &test_ptr->pz); + ret = test_ptr->ia->common.provider->pz_create (test_ptr->ia, &test_ptr->pz); if (ret != 0) { DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_pz_create error: %s\n", @@ -364,7 +364,7 @@ DT_Transaction_Main (void *param) } /* Create EP */ - ret = dat_ep_create (test_ptr->ia, /* IA */ + ret = test_ptr->ia->common.provider->ep_create (test_ptr->ia, /* IA */ test_ptr->pz, /* PZ */ test_ptr->recv_evd_hdl, /* recv */ test_ptr->reqt_evd_hdl, /* request */ @@ -458,7 +458,7 @@ DT_Transaction_Main (void *param) * await a connection for this EP */ - ret = dat_rsp_create (test_ptr->ia, + ret = test_ptr->ia->common.provider->rsp_create (test_ptr->ia, test_ptr->ep_context[i].ia_port, test_ptr->ep_context[i].ep, test_ptr->creq_evd_hdl, @@ -473,7 +473,7 @@ DT_Transaction_Main (void *param) } else { - ret = dat_psp_create (test_ptr->ia, + ret = test_ptr->ia->common.provider->psp_create (test_ptr->ia, test_ptr->ep_context[i].ia_port, test_ptr->creq_evd_hdl, DAT_PSP_CONSUMER_FLAG, @@ -548,7 +548,7 @@ DT_Transaction_Main (void *param) goto test_failure; } - ret = dat_cr_query (cr, &cr_param); + ret = cr->common.provider->cr_query (cr, &cr_param); if (ret != 0) { DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_cr_query #%d error:(%x) %s\n", @@ -572,7 +572,7 @@ DT_Transaction_Main (void *param) } /* what, me query? just try to accept the connection */ - ret = dat_cr_accept (cr, + ret = cr->common.provider->cr_accept (cr, NULL, /* NULL for RSP */ 0, (void *)0 /* no private data */ ); if (ret != 0) @@ -595,7 +595,7 @@ DT_Transaction_Main (void *param) goto test_failure; } /* throw away single-use PSP */ - ret = dat_rsp_free (test_ptr->ep_context[i].rsp); + ret = test_ptr->ep_context[i].rsp->common.provider->rsp_free (test_ptr->ep_context[i].rsp); if (ret != 0) { DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_rsp_free #%d error: %s\n", @@ -632,7 +632,7 @@ DT_Transaction_Main (void *param) goto test_failure; } - ret = dat_cr_query (cr, &cr_param); + ret = cr->common.provider->cr_query (cr, &cr_param); if (ret != 0) { DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_cr_query #%d error: %s\n", @@ -657,7 +657,7 @@ DT_Transaction_Main (void *param) /* what, me query? just try to accept the connection */ - ret = dat_cr_accept (cr, + ret = cr->common.provider->cr_accept (cr, test_ptr->ep_context[i].ep, 0, (void *)0 /* no private data */ ); if (ret != 0) @@ -665,7 +665,7 @@ DT_Transaction_Main (void *param) DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_cr_accept #%d error: %s\n", test_ptr->base_port, i, DT_RetToString (ret)); /* cr consumed on failure */ - (void) dat_psp_free (test_ptr->ep_context[i].psp); + (void) test_ptr->ep_context[i].psp->common.provider->psp_free (test_ptr->ep_context[i].psp); status = 1; goto test_failure; } @@ -677,13 +677,13 @@ DT_Transaction_Main (void *param) &event_num)) { /* error message printed by DT_cr_event_wait */ - (void) dat_psp_free (test_ptr->ep_context[i].psp); + (void) test_ptr->ep_context[i].psp->common.provider->psp_free (test_ptr->ep_context[i].psp); status = 1; goto test_failure; } /* throw away single-use PSP */ - ret = dat_psp_free (test_ptr->ep_context[i].psp); + ret = test_ptr->ep_context[i].psp->common.provider->psp_free (test_ptr->ep_context[i].psp); if (ret != 0) { DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_psp_free #%d error: %s\n", @@ -710,7 +710,7 @@ DT_Transaction_Main (void *param) test_ptr->ep_context[i].ia_port)); retry: - ret = dat_ep_connect (test_ptr->ep_context[i].ep, + ret = test_ptr->ep_context[i].ep->common.provider->ep_connect (test_ptr->ep_context[i].ep, test_ptr->remote_ia_addr, test_ptr->ep_context[i].ia_port, DAT_TIMEOUT_MAX, @@ -741,9 +741,10 @@ retry: */ { struct dat_event event; + struct dat_ep *ep = test_ptr->ep_context[i].ep; int drained = 0; - dat_ep_reset (test_ptr->ep_context[i].ep); + ep->common.provider->ep_reset(ep); do { ret = DT_Tdep_evd_dequeue ( test_ptr->recv_evd_hdl, @@ -1125,7 +1126,7 @@ test_failure: */ if (test_ptr->ep_context[i].ep) { - ret = dat_ep_disconnect (test_ptr->ep_context[i].ep, + ret = test_ptr->ep_context[i].ep->common.provider->ep_disconnect (test_ptr->ep_context[i].ep, DAT_CLOSE_ABRUPT_FLAG); if (ret != 0) { @@ -1196,7 +1197,7 @@ test_failure: &event); } while (ret == 0); /* Destroy the EP */ - ret = dat_ep_free (ep); + ret = ep->common.provider->ep_free (ep); if (ret != 0) { DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_ep_free #%d error: %s\n", @@ -1262,7 +1263,7 @@ test_failure: /* clean up the PZ */ if (test_ptr->pz) { - ret = dat_pz_free (test_ptr->pz); + ret = test_ptr->pz->common.provider->pz_free (test_ptr->pz); if (ret != 0) { DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_pz_free error: %s\n", Index: gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_performance_server.c =================================================================== --- gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_performance_server.c (revision 2952) +++ gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_performance_server.c (working copy) @@ -174,7 +174,7 @@ DT_Performance_Test_Server_Connect ( */ status = TRUE; - ret = dat_psp_create (test_ptr->ia, + ret = test_ptr->ia->common.provider->psp_create (test_ptr->ia, test_ptr->ep_context.port, test_ptr->creq_evd_hdl, DAT_PSP_CONSUMER_FLAG, @@ -215,7 +215,7 @@ DT_Performance_Test_Server_Connect ( } /* what, me query? just try to accept the connection */ - ret = dat_cr_accept (cr, + ret = cr->common.provider->cr_accept (cr, test_ptr->ep_context.ep, 0, (void *)0 /* no private data */ ); @@ -245,7 +245,7 @@ psp_free: if ( NULL != psp ) { /* throw away single-use PSP */ - ret = dat_psp_free (psp); + ret = psp->common.provider->psp_free (psp); if (ret != 0) { DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_psp_free error: %s\n", @@ -256,7 +256,7 @@ psp_free: if ( NULL != rsp ) { /* throw away single-use PSP */ - ret = dat_rsp_free (rsp); + ret = rsp->common.provider->rsp_free (rsp); if (ret != 0) { DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_rsp_free error: %s\n", Index: gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_fft_dataxfer_client.c =================================================================== --- gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_fft_dataxfer_client.c (revision 2952) +++ gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_fft_dataxfer_client.c (working copy) @@ -85,7 +85,7 @@ cleanup: { /* disconnect */ DT_Tdep_PT_Printf (phead, "Disconnect\n"); - rc = dat_ep_disconnect (conn.ep, DAT_CLOSE_ABRUPT_FLAG); + rc = conn.ep->common.provider->ep_disconnect (conn.ep, DAT_CLOSE_ABRUPT_FLAG); DT_assert_clean (phead, rc == 0); } rc = DT_fft_destroy_conn_struct (phead, &conn); Index: gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_bpool.c =================================================================== --- gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_bpool.c (revision 2952) +++ gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_bpool.c (working copy) @@ -161,7 +161,7 @@ DT_BpoolAlloc ( region.for_pa = virt_to_phys(region.for_va); } - ret = dat_lmr_kcreate (ia, + ret = ia->common.provider->lmr_kcreate (ia, DT_mem_type, region, bp_len, @@ -283,7 +283,7 @@ err: { if (bpool_ptr->rmr_handle) { - ret = dat_rmr_free (bpool_ptr->rmr_handle); + ret = bpool_ptr->rmr_handle->common.provider->rmr_free (bpool_ptr->rmr_handle); if (ret != 0) { DT_Tdep_PT_Printf (phead, @@ -294,7 +294,7 @@ err: } if (bpool_ptr->lmr) { - ret = dat_lmr_free (bpool_ptr->lmr); + ret = bpool_ptr->lmr->common.provider->lmr_free (bpool_ptr->lmr); if (ret != 0) { DT_Tdep_PT_Printf (phead, @@ -348,7 +348,7 @@ DT_Bpool_Destroy (Per_Test_Data_t * pt_p * an RMR, doing an rmr_free will pull the plug * and cleanup properly. */ - ret = dat_rmr_free (bpool_ptr->rmr_handle); + ret = bpool_ptr->rmr_handle->common.provider->rmr_free (bpool_ptr->rmr_handle); if (ret != 0) { DT_Tdep_PT_Printf (phead, @@ -361,7 +361,7 @@ DT_Bpool_Destroy (Per_Test_Data_t * pt_p if (bpool_ptr->lmr) { - int ret = dat_lmr_free (bpool_ptr->lmr); + int ret = bpool_ptr->lmr->common.provider->lmr_free (bpool_ptr->lmr); if (ret != 0) { DT_Tdep_PT_Printf (phead, Index: gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_test_util.c =================================================================== --- gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_test_util.c (revision 2952) +++ gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_test_util.c (working copy) @@ -45,7 +45,7 @@ DT_query ( Per_Test_Data_t *pt_ptr, phead = pt_ptr->Params.phead; /* Query the IA */ - ret = dat_ia_query (ia, + ret = ia->common.provider->ia_query (ia, &async_evd_hdl, &pt_ptr->ia_attr, &pt_ptr->provider_attr); @@ -58,7 +58,7 @@ DT_query ( Per_Test_Data_t *pt_ptr, } /* Query the EP */ - ret = dat_ep_query (ep, &ep_params); + ret = ep->common.provider->ep_query (ep, &ep_params); if (ret != 0) { DT_Tdep_PT_Printf (phead, "%s: dat_ep_query error: %s\n", @@ -183,7 +183,7 @@ DT_post_recv_buffer (DT_Tdep_Print_Head DT_Tdep_PT_Debug (3, (phead, "Post-Recv #%d [%p, %x]\n", index, buff, size)); /* Post the recv buffer */ - ret = dat_ep_post_recv (ep, + ret = ep->common.provider->ep_post_recv (ep, 1, iov, cookie, @@ -227,7 +227,7 @@ DT_post_send_buffer (DT_Tdep_Print_Head DT_Tdep_PT_Debug (3, (phead, "Post-Send #%d [%p, %x]\n", index, buff, size)); /* Post the recv buffer */ - ret = dat_ep_post_send (ep, + ret = ep->common.provider->ep_post_send (ep, 1, iov, cookie, @@ -696,7 +696,7 @@ DT_cr_check ( DT_Tdep_Print_Head *phead } else { - ret = dat_cr_reject (cr_stat_p->cr); + ret = cr_stat_p->cr->common.provider->cr_reject (cr_stat_p->cr); if (ret != 0) { DT_Tdep_PT_Printf (phead, "\tdat_cr_reject error: %s\n", Index: gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_client.c =================================================================== --- gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_client.c (revision 2952) +++ gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_client.c (working copy) @@ -114,7 +114,7 @@ DT_cs_Client (Params_t * params_ptr, DT_Tdep_PT_Debug (1,(phead, "%s: IA %s opened\n", module, dapl_name)); /* Create a PZ */ - ret = dat_pz_create (ia, &pz); + ret = ia->common.provider->pz_create (ia, &pz); if (ret != 0) { DT_Tdep_PT_Printf (phead, @@ -175,7 +175,7 @@ DT_cs_Client (Params_t * params_ptr, } /* Create an EP */ - ret = dat_ep_create (ia, /* IA */ + ret = ia->common.provider->ep_create (ia, /* IA */ pz, /* PZ */ recv_evd_hdl, /* recv */ reqt_evd_hdl, /* request */ @@ -254,7 +254,7 @@ retry_repost: DT_Tdep_PT_Debug (1,(phead, "%s: Connect Endpoint\n", module)); try_connect =1; retry: - ret = dat_ep_connect (ep, + ret = ep->common.provider->ep_connect (ep, server_netaddr, SERVER_PORT_NUMBER, DAT_TIMEOUT_MAX, @@ -288,7 +288,7 @@ retry: * See if any buffers were flushed as a result of * the REJECT; clean them up and repost if so */ - dat_ep_reset (ep); + ep->common.provider->ep_reset (ep); do { @@ -528,7 +528,7 @@ client_exit: * graceful attempt might fail because we got here due to * some error above, so we may as well try harder. */ - ret = dat_ep_disconnect (ep, DAT_CLOSE_ABRUPT_FLAG); + ret = ep->common.provider->ep_disconnect (ep, DAT_CLOSE_ABRUPT_FLAG); if (ret != 0) { DT_Tdep_PT_Printf (phead, @@ -561,7 +561,7 @@ client_exit: &event); } while (ret == 0); - ret = dat_ep_free (ep); + ret = ep->common.provider->ep_free (ep); if (ret != 0) { DT_Tdep_PT_Printf (phead, @@ -617,7 +617,7 @@ client_exit: /* Free the PZ */ if (pz) { - ret = dat_pz_free (pz); + ret = pz->common.provider->pz_free (pz); if (ret != 0) { DT_Tdep_PT_Printf (phead, Index: gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_fft_endpoint.c =================================================================== --- gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_fft_endpoint.c (revision 2952) +++ gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_fft_endpoint.c (working copy) @@ -63,14 +63,14 @@ int DT_endpoint_generic (Params_t *param rc = dat_ia_open (dev_name, DEFAULT_QUEUE_LEN, &evd, &ia); DT_assert_dat (phead, rc == 0); - rc = dat_pz_create (ia, &pz); + rc = ia->common.provider->pz_create (ia, &pz); DT_assert_dat (phead, rc == 0); if (destroy_pz_early) { if (pz) { - rc = dat_pz_free (pz); + rc = pz->common.provider->pz_free (pz); DT_assert_dat (phead, rc == 0); } } @@ -88,7 +88,7 @@ int DT_endpoint_generic (Params_t *param DAT_EVD_CONNECTION_FLAG, &conn_evd); DT_assert_dat (phead, rc == 0); - rc = dat_ep_create (ia, pz, recv_evd, send_evd, + rc = ia->common.provider->ep_create (ia, pz, recv_evd, send_evd, conn_evd, NULL, &ep); if (destroy_pz_early) { @@ -103,7 +103,7 @@ int DT_endpoint_generic (Params_t *param cleanup: if (ep) { - rc = dat_ep_free (ep); + rc = ep->common.provider->ep_free (ep); DT_assert_clean (phead, rc == 0); } @@ -127,7 +127,7 @@ cleanup: if (!destroy_pz_early && pz) { - rc = dat_pz_free (pz); + rc = pz->common.provider->pz_free (pz); DT_assert_clean (phead, rc == 0); } @@ -189,7 +189,7 @@ int DT_endpoint_case2 (Params_t *params_ rc = DT_ia_open (dev_name, &ia); DT_assert_dat (phead, rc == 0); - rc = dat_pz_create (ia, &pz); + rc = ia->common.provider->pz_create (ia, &pz); DT_assert_dat (phead, rc == 0); rc = DT_ep_create (params_ptr, ia, @@ -210,7 +210,7 @@ int DT_endpoint_case2 (Params_t *params_ 4096) == TRUE); if (ep) { - rc = dat_ep_free (ep); + rc = ep->common.provider->ep_free (ep); DT_assert_dat (phead, rc == 0); } @@ -232,7 +232,7 @@ cleanup: } if (pz) { - rc = dat_pz_free (pz); + rc = pz->common.provider->pz_free (pz); DT_assert_clean (phead, rc == 0); } if (ia) Index: gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_transaction_util.c =================================================================== --- gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_transaction_util.c (revision 2952) +++ gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_transaction_util.c (working copy) @@ -60,7 +60,7 @@ DT_handle_post_recv_buf (DT_Tdep_Print_H | (((uintptr_t) DT_Bpool_GetBuffer (op->bp, 0)) & 0xffffffffUL)); /* Post the recv */ - ret = dat_ep_post_recv ( ep_context[i].ep, + ret = ep_context[i].ep->common.provider->ep_post_recv(ep_context[i].ep, op->num_segs, iov, cookie, @@ -123,7 +123,7 @@ DT_handle_send_op (DT_Tdep_Print_Head *p | (((uintptr_t) DT_Bpool_GetBuffer (op->bp, 0)) & 0xffffffffUL)); /* Post the send */ - ret = dat_ep_post_send ( ep_context[i].ep, + ret = ep_context[i].ep->common.provider->ep_post_send ( ep_context[i].ep, op->num_segs, iov, cookie, @@ -517,7 +517,7 @@ DT_handle_rdma_op (DT_Tdep_Print_Head *p if (opcode == RDMA_WRITE) { - ret = dat_ep_post_rdma_write (ep_context[i].ep, + ret = ep_context[i].ep->common.provider->ep_post_rdma_write (ep_context[i].ep, op->num_segs, iov, cookie, @@ -528,7 +528,7 @@ DT_handle_rdma_op (DT_Tdep_Print_Head *p else /* opcode == RDMA_READ */ { - ret = dat_ep_post_rdma_read ( ep_context[i].ep, + ret = ep_context[i].ep->common.provider->ep_post_rdma_read ( ep_context[i].ep, op->num_segs, iov, cookie, Index: gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_fft_pz.c =================================================================== --- gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_fft_pz.c (revision 2952) +++ gen2/utils/src/linux-kernel/kdapl/dapltest/test/dapl_fft_pz.c (working copy) @@ -55,13 +55,13 @@ int DT_pz_case0 ( Params_t *params_ptr, rc = DT_ia_open (dev_name, &ia); DT_assert_dat (phead, rc == 0); - rc = dat_pz_create (ia, &pz); + rc = ia->common.provider->pz_create (ia, &pz); DT_assert_dat (phead, rc == 0); cleanup: if (pz) { - rc = dat_pz_free (pz); + rc = pz->common.provider->pz_free (pz); DT_assert_dat (phead, rc == 0); } if (ia) @@ -101,7 +101,7 @@ int DT_pz_case1 (Params_t *params_ptr, F rc = DT_ia_open (dev_name, &ia); DT_assert_dat (phead, rc == 0); - rc = dat_pz_create (ia, &pz); + rc = ia->common.provider->pz_create (ia, &pz); DT_assert_dat (phead, rc == 0); rc = DT_ep_create (params_ptr, @@ -116,7 +116,7 @@ int DT_pz_case1 (Params_t *params_ptr, F if (pz) { - rc = dat_pz_free (pz); + rc = pz->common.provider->pz_free (pz); DT_assert_dat (phead, rc == -EINVAL); } @@ -124,7 +124,7 @@ cleanup: /* corrrect order */ if (ep) { - rc=dat_ep_free (ep); + rc=ep->common.provider->ep_free (ep); DT_assert_clean (phead, rc == 0); } if (conn_evd) @@ -144,7 +144,7 @@ cleanup: } if (pz) { - rc=dat_pz_free (pz); + rc=pz->common.provider->pz_free (pz); DT_assert_clean (phead, rc == 0); } @@ -183,7 +183,7 @@ int DT_pz_case2 (Params_t *params_ptr, F rc = DT_ia_open (dev_name, &ia); DT_assert_dat (phead, rc == 0); - rc = dat_pz_create (ia, &pz); + rc = ia->common.provider->pz_create (ia, &pz); DT_assert_dat (phead, rc == 0); /* allocate and register bpool */ @@ -194,7 +194,7 @@ int DT_pz_case2 (Params_t *params_ptr, F if (pz) { - rc = dat_pz_free (pz); + rc = pz->common.provider->pz_free (pz); DT_assert_dat (phead, rc == -EINVAL); } @@ -208,7 +208,7 @@ cleanup: } if (pz) { - rc=dat_pz_free (pz); + rc=pz->common.provider->pz_free (pz); DT_assert_clean (phead, rc == 0); } Index: gen2/utils/src/linux-kernel/kdapl/dapltest/kdapl/kdapl_tdep_evd.c =================================================================== --- gen2/utils/src/linux-kernel/kdapl/dapltest/kdapl/kdapl_tdep_evd.c (revision 2952) +++ gen2/utils/src/linux-kernel/kdapl/dapltest/kdapl/kdapl_tdep_evd.c (working copy) @@ -118,10 +118,10 @@ DT_Tdep_evd_create (struct dat_ia * ia, dat_status = -ENOMEM; goto error; } - upcall.upcall_func = DT_Tdep_Event_Callback; + upcall.upcall = DT_Tdep_Event_Callback; upcall.instance_data = evd_ptr; - dat_status = dat_evd_kcreate (ia, + dat_status = ia->common.provider->evd_kcreate (ia, evd_min_qlen, DAT_UPCALL_SINGLE_INSTANCE, &upcall, @@ -317,7 +317,7 @@ DT_Tdep_evd_free (struct dat_evd *evd) spin_unlock_irq (&DT_Evd_Lock); DT_Mdep_Free (next); - return dat_evd_free(evd); + return evd->common.provider->evd_free(evd); } static void DT_Tdep_Event_Callback ( From jlentini at netapp.com Tue Aug 2 15:06:30 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 2 Aug 2005 18:06:30 -0400 (EDT) Subject: [openib-general] Re: [PATCH] uDAPL dapl_os_wait_object_wait() In-Reply-To: References: Message-ID: On Mon, 25 Jul 2005, Arlin Davis wrote: > James, > > Here is a patch to fix dapl_os_wait_object_wait() returning > EINVAL when passing nsec == 1000000000 to pthread_cond_timedwait(). > Hit a rare case where _microsecs was exactly 1000000. What was the timeout_val being passed to dapl_os_wait_object_wait()? Was it 1000000000 or 1000000? > > Thanks, > > -arlin > > Signed-off by: Arlin Davis > > Index: dapl/udapl/linux/dapl_osd.c > =================================================================== > --- dapl/udapl/linux/dapl_osd.c (revision 2899) > +++ dapl/udapl/linux/dapl_osd.c (working copy) > @@ -242,16 +242,9 @@ > > gettimeofday (&now, &tz); > microsecs = now.tv_usec + (timeout_val % 1000000); > - if (microsecs > 1000000) > - { > - now.tv_sec = now.tv_sec + timeout_val / 1000000 + 1; > - now.tv_usec = microsecs - 1000000; > - } > - else > - { > - now.tv_sec = now.tv_sec + timeout_val / 1000000; > - now.tv_usec = microsecs; > - } > + > + now.tv_sec = now.tv_sec + timeout_val/1000000 + (microsecs/1000000); > + now.tv_usec = microsecs % 1000000; > > /* Convert timeval to timespec */ > future.tv_sec = now.tv_sec; From nacc at us.ibm.com Tue Aug 2 15:14:55 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Tue, 2 Aug 2005 15:14:55 -0700 Subject: [openib-general] [PATCH] uDAPL dapl_os_wait_object_wait() In-Reply-To: References: Message-ID: <20050802221455.GM8353@us.ibm.com> On 25.07.2005 [15:20:29 -0700], Arlin Davis wrote: > James, > > Here is a patch to fix dapl_os_wait_object_wait() returning > EINVAL when passing nsec == 1000000000 to pthread_cond_timedwait(). > Hit a rare case where _microsecs was exactly 1000000. Hrm, rather than use all of these hard-coded numbers, might it not be better to have USEC_PER_SEC, etc. as #defines? Just makes the code a bit clearer without too much thought. Just my experience in dealing with time/timer code :) Thanks, Nish From sean.hefty at intel.com Tue Aug 2 15:13:10 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 2 Aug 2005 15:13:10 -0700 Subject: [openib-general] ucm.c review Message-ID: Libor, Arlin has been seeing some hangs running some multi-threaded DAPL tests with the CM. I looked over the ucm.c code and wanted to get feedback on some potential changes before working on them. (Some of them are fairly minor.) The main issue that I found is possible deadlock calling ib_destroy_cm_id() from within a CM callback, which requires some rework. See below for more details. I will begin work on some of these, but wanted to verify the approach to take regarding the destruction of the ucm_id. - Sean ------------------- ib_ucm_ctx_get() - Add a check that ctx-file == file. This check is made in most places where ib_ucm_ctx_get() is called. ib_ucm_ctx_put() - Remove the automatic destruction when the last reference count goes to 0. Destruction and cleanup of struct ib_ucm_context would be done only in ib_ucm_destroy_id(). I believe that we only need to prevent ib_destroy_cm_id() from being called while another kernel CM call is in progress. We may be able to use the mutex in struct ib_ucm_context for this. (It doesn't appear to be used anywhere.) Reference counting may still work better though. ib_ucm_event_rej_get() ib_ucm_event_mra_get() ib_ucm_event_lap_get() ib_ucm_event_apr_get() ib_ucm_event_sidr_req_get() Any objection to removing these functions? ib_ucm_event_process() - for IB_CM_REQ_RECEIVED, the primary path must exist, so we can remove the check. As a minor optimization, we could combine the three allocations associated with struct ib_ucm_event (structure itself, private data, and info). ib_ucm_event_handler() - We could store a reference to struct ib_ucm_context directly in the cm_id, rather than needing to perform a lookup. We also need to avoid call paths to ib_destroy_cm_id. Note that calls from a separate thread into ib_destroy_cm_id will block while we're in this callback, so the cm_id context is safe to access. I'm not sure how to handle errors reporting events here. Currently, the code returns an error status back to the CM, which will result in the destruction of the cm_id. The cm_id will be destroyed a second time when the ucm_context is destroyed. It may be best to just drop the event. ib_ucm_create_id() - We need to serialize with file->mutex around the call to ib_ucm_ctx_alloc(). ib_ucm_destroy_id() - Missing ctx->file check against file parameter. I believe that we want to move the majority of the ib_ucm_ctx_put() routine to here. ib_ucm_close() - Change call to ib_ucm_ctx_put() to ib_ucm_destroy_id(). From robert.j.woodruff at intel.com Tue Aug 2 15:17:25 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Tue, 2 Aug 2005 15:17:25 -0700 Subject: [openib-general] RE: [BUG]OpenSM double free or corruption In-Reply-To: <1122993087.4422.88.camel@hal.voltaire.com> Message-ID: Hal wrote, >Hi Woody, >I don't have an HCA with old firmware (what version ?) and I thought it >was dangerous to downgrade firmware. I think it was 4.5.3. >> and the mthca >> driver will not initialize and will report an error in dmesg. What driver error do you get ? The driver reported something like "the firmware 4.5.3 is old and that the current rev was 4.6.2" I have never downgraded the rev. of firmware I have only upgraded, but I do not see why it would not work to put an old version into the card. Perhaps the Mellanox guys would know for sure. woody From rolandd at cisco.com Tue Aug 2 15:29:37 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 02 Aug 2005 15:29:37 -0700 Subject: [openib-general] RE: [BUG]OpenSM double free or corruption In-Reply-To: (Bob Woodruff's message of "Tue, 2 Aug 2005 15:17:25 -0700") References: Message-ID: <52d5owezvy.fsf@cisco.com> Bob> The driver reported something like "the firmware 4.5.3 is old Bob> and that the current rev was 4.6.2" This message is just a warning, not an error. Does anyone have any suggestions on how to rephrase it to make it clear that this is not a fatal condition and that the driver is continuing to load? Right now the kernel log will say: HCA FW version 4.5.3 is old (4.6.2 is current). If you have problems, try updating your HCA FW. - R. From mshefty at ichips.intel.com Tue Aug 2 15:32:52 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 02 Aug 2005 15:32:52 -0700 Subject: [openib-general] RE: [BUG]OpenSM double free or corruption In-Reply-To: <52d5owezvy.fsf@cisco.com> References: <52d5owezvy.fsf@cisco.com> Message-ID: <42EFF494.5030102@ichips.intel.com> Roland Dreier wrote: > HCA FW version 4.5.3 is old (4.6.2 is current). > If you have problems, try updating your HCA FW. I thought that this was fairly clear, but maybe just adding "continuing with load" somewhere in there might help. - Sean From robert.j.woodruff at intel.com Tue Aug 2 15:35:40 2005 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 2 Aug 2005 15:35:40 -0700 Subject: [openib-general] RE: [BUG]OpenSM double free or corruption Message-ID: <1AC79F16F5C5284499BB9591B33D6F00051D033C@orsmsx408> Roland wrote, >This message is just a warning, not an error. Does anyone have any >suggestions on how to rephrase it to make it clear that this is not a >fatal condition and that the driver is continuing to load? Right now >the kernel log will say: > HCA FW version 4.5.3 is old (4.6.2 is current). > If you have problems, try updating your HCA FW. >- R. I thought the message was fine. It told be what I needed to do. I did have problems running opensm when the old firmware was in the card. I was not sure if it was that the driver did not load all the way or if it was something else, but it did not matter. The message said, upgrade the firmware, I did and the problem went away, so I was happy. If you want to change the message to make it more clear that the problem is not fatal, perhaps something like, Warning: HCA FW version 4.5.3 is old (4.6.2 is current). If you have problems, try updating your HCA FW. my 2 cents, woody From halr at voltaire.com Tue Aug 2 15:35:10 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Aug 2005 18:35:10 -0400 Subject: [openib-general] RE: [BUG]OpenSM double free or corruption In-Reply-To: References: Message-ID: <1123022109.4422.101.camel@hal.voltaire.com> On Tue, 2005-08-02 at 18:17, Bob Woodruff wrote: > Hal wrote, > >Hi Woody, > > >I don't have an HCA with old firmware (what version ?) and I thought it > >was dangerous to downgrade firmware. > > I think it was 4.5.3. So was this a Lion Cub (PCI Express) ? I don't have one to play with. > >> and the mthca > >> driver will not initialize and will report an error in dmesg. > > What driver error do you get ? I was simulating driver errors. I didn't install old firmware. > The driver reported something like > "the firmware 4.5.3 is old and that the current rev was > 4.6.2" I found this. This is a warning rather than an error. I'm not sure what fails which causes the double delete. > I have never downgraded the rev. of firmware I have only upgraded, but > I do not see why it would not work to put an old version > into the card. Perhaps the Mellanox guys would know for > sure. -- Hal From robert.j.woodruff at intel.com Tue Aug 2 16:01:04 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Tue, 2 Aug 2005 16:01:04 -0700 Subject: [openib-general] RE: [BUG]OpenSM double free or corruption In-Reply-To: <1123022109.4422.101.camel@hal.voltaire.com> Message-ID: Hal wrote, >So was this a Lion Cub (PCI Express) ? I don't have one to play with. It was a PCI-Express HCA, EM64T Xeon Dual CPU box, and I was running on a RedHat 2.6.9ELsmp kernel with the backport patches. Not sure of the HCA model type, but it was not one of the new memfree ones. I think I saw a similar error on a PCI-X card with old firmware (version 2.0.0) on an i386 box too, but I am not sure of that one. woody From halr at voltaire.com Tue Aug 2 15:57:33 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Aug 2005 18:57:33 -0400 Subject: [openib-general] Re: Send SA request over umad problem. In-Reply-To: <1123012368.4442.353.camel@hal.voltaire.com> References: <506C3D7B14CDD411A52C00025558DED60865E20F@mtlex01.yok.mtl.com> <1123011406.4442.321.camel@hal.voltaire.com> <1123012368.4442.353.camel@hal.voltaire.com> Message-ID: <1123023275.4422.112.camel@hal.voltaire.com> On Tue, 2005-08-02 at 15:52, Hal Rosenstock wrote: > > I'm not sure why that (the larger response) is the case for a response > > to SA Get ClassPortInfo. The larger response to a SA GetTable request. If you update to the latest svn for osm, this should be fixed. Let me know if it is otherwise. -- Hal From limichal at cisco.com Tue Aug 2 16:04:05 2005 From: limichal at cisco.com (Libor Michalek) Date: Tue, 2 Aug 2005 16:04:05 -0700 Subject: [openib-general] Re: SDP_IOCB_SIZE_MAX In-Reply-To: <20050801132616.GX14384@mellanox.co.il>; from mst@mellanox.co.il on Mon, Aug 01, 2005 at 04:26:16PM +0300 References: <20050801132616.GX14384@mellanox.co.il> Message-ID: <20050802160405.A30208@topspin.com> On Mon, Aug 01, 2005 at 04:26:16PM +0300, Michael S. Tsirkin wrote: > Libor, in sdp_iocb.h we have: > > sdp_iocb.h:#define SDP_IOCB_SIZE_MAX (128*1024) /* matches AIO max kvec size. */ > > What does the comment mean? This is left over from the the old 2.4 AIO code. The AIO layer use to break larger AIOs into chunks, and 128K was the default size. It's used to set the size of the FMRs, so I would expect IOs larger then this to fail because the registration will fail. Since FMRs need to be a fixed size, if we wish to continue using them, we'll need to break AIOs over a certain size into multiple iocbs and advts... > I seem to have no trouble passing requests bigger than 128K to SDP: Really? I just tried it and it definetly didn't work... -Libor From ardavis at ichips.intel.com Tue Aug 2 18:16:20 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 02 Aug 2005 18:16:20 -0700 Subject: [openib-general] Re: fixes to the udapl ucm/uat patch you sent In-Reply-To: References: Message-ID: <42F01AE4.806@ichips.intel.com> James, > > Hi Arlin, > > Can I break this patch into 3 parts: the changes to dapl_evd_wait, the > changes to dapl_evd_resize, and the ib changes? I think it will be > easier to discuss each set of changes seperately (with so many > seperate issues, I'm afraid I've missed your reply to some of these > questions) Yes please break them out. I need to go back and take a closer look at my error cases for the wait and resize. In the future I will keep any common code patches seperate from the IB code. > > dapl_evd_wait: > > I looked over the original implementation of dapl_evd_wait() with an > eye towards the situation you described (the caller polling and > finding fewer events than requested, the caller going to turn on > notification, an event occuring, the caller turning on notification, > the caller blocking unaware of the last event). I don't believe that > this would happen in the original implementation. Here's why: after > the caller turns on notification, the code loops, via the continue > statement on line 213, back to the begining of the for loop on line > 173 and repolls. Do you agree? Hmmm. Yes it appears to be correct now that I take another look at it. Ok, ignore the patch for now and I will take another look at the senario that was missing the completions. On a side note, does this call need to adjust the time if coming out of a wait but still not reaching threshold and going back into a subsequent wait? > > dapl_evd_resize: > > I'm still unsure of why you removed the call to > dapls_evd_event_realloc() and moved the work that was being performed > in that routine up into dapl_evd_resize(). If we don't call > dapls_evd_event_realloc() anymore, the code should be removed. > sorry. I had some issues with resize and just grabbed a gen1 IBAL version that rolled up everything into one call. Replaced the reallocates with the allocs. The event_realloc and the rbuf_reallocs are no longer used in this version. Let me go back to my error condition and see if I can isolate to one of the realloc calls instead of a wholesale change to resize. Cancel this patch for now. > ib changes: > > These look ok to me. I've checked them into revision 2955. thanks, -arlin > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From ardavis at ichips.intel.com Tue Aug 2 18:27:02 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 02 Aug 2005 18:27:02 -0700 Subject: [openib-general] Re: [PATCH] uDAPL dapl_os_wait_object_wait() In-Reply-To: References: Message-ID: <42F01D66.3030206@ichips.intel.com> James Lentini wrote: > > > On Mon, 25 Jul 2005, Arlin Davis wrote: > >> James, >> >> Here is a patch to fix dapl_os_wait_object_wait() returning >> EINVAL when passing nsec == 1000000000 to pthread_cond_timedwait(). >> Hit a rare case where _microsecs was exactly 1000000. > > > What was the timeout_val being passed to dapl_os_wait_object_wait()? > Was it 1000000000 or 1000000? It was the calculated time of microsecs that resulted in exactly 1000000 (1 sec) not timeout_val.. > >> >> Thanks, >> >> -arlin >> >> Signed-off by: Arlin Davis >> >> Index: dapl/udapl/linux/dapl_osd.c >> =================================================================== >> --- dapl/udapl/linux/dapl_osd.c (revision 2899) >> +++ dapl/udapl/linux/dapl_osd.c (working copy) >> @@ -242,16 +242,9 @@ >> >> gettimeofday (&now, &tz); >> microsecs = now.tv_usec + (timeout_val % 1000000); >> - if (microsecs > 1000000) >> - { >> - now.tv_sec = now.tv_sec + timeout_val / 1000000 + 1; >> - now.tv_usec = microsecs - 1000000; >> - } >> - else >> - { >> - now.tv_sec = now.tv_sec + timeout_val / 1000000; >> - now.tv_usec = microsecs; >> - } >> + >> + now.tv_sec = now.tv_sec + timeout_val/1000000 + >> (microsecs/1000000); >> + now.tv_usec = microsecs % 1000000; >> >> /* Convert timeval to timespec */ >> future.tv_sec = now.tv_sec; > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From dotanb at mellanox.co.il Tue Aug 2 22:12:46 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 3 Aug 2005 08:12:46 +0300 Subject: [openib-general] RE: where can i find functions that "convert" enumerated values t o string? Message-ID: <506C3D7B14CDD411A52C00025558DED60865E26F@mtlex01.yok.mtl.com> During debug of problems in tests/real application this can help allot. I think that it better to write "QP ts type is RC" instead of "QP ts type is 0"... > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Tuesday, July 26, 2005 1:17 AM > To: Dotan Barak > Cc: openib-general at openib.org > Subject: Re: where can i find functions that "convert" > enumerated values > t o string? > > > Dotan> If I'll send you functions that does the work, will you add > Dotan> them to the driver? > > If we can do this cleanly without bloating the source or compiled code > too much, I think it's OK. What do you see applications using these > strings for? > > - R. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at mellanox.co.il Tue Aug 2 23:02:21 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 3 Aug 2005 09:02:21 +0300 Subject: [openib-general] Re: why does the value of the node_guid don' t have the machine en dian ess? Message-ID: <506C3D7B14CDD411A52C00025558DED60865E27A@mtlex01.yok.mtl.com> I expect to get this values with the endianess of the host that i'm working on, and if i will print the node_guid as a number it will be the same as the sys_fs value. I don't see any reason for the driver to return this value in the endianess of the network, i think that it is better that the driver will return the value of this attribute in the host order, instead of every application that query for this attribute will change the order of it. > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Tuesday, July 26, 2005 9:07 PM > To: Sean Hefty > Cc: Dotan Barak; openib-general at openib.org > Subject: Re: [openib-general] Re: why does the value of the node_guid > don't have the machine en dian ess? > > > Sean> To clarify, node_guid in ib_device_attr is in network order, > Sean> correct? Do you know offhand which other fields of that > Sean> structure are also in network order? Is it just the two > Sean> guid fields? > > I'm not positive but I believe that as you say, node_guid and > sys_image_guid are the only two fields returned in network order. > > - R. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Tue Aug 2 23:50:42 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 3 Aug 2005 09:50:42 +0300 Subject: [openib-general] Re: SDP_IOCB_SIZE_MAX In-Reply-To: <20050802160405.A30208@topspin.com> References: <20050801132616.GX14384@mellanox.co.il> <20050802160405.A30208@topspin.com> Message-ID: <20050803065042.GD15300@mellanox.co.il> Quoting r. Libor Michalek : > Subject: Re: SDP_IOCB_SIZE_MAX > > On Mon, Aug 01, 2005 at 04:26:16PM +0300, Michael S. Tsirkin wrote: > > Libor, in sdp_iocb.h we have: > > > > sdp_iocb.h:#define SDP_IOCB_SIZE_MAX (128*1024) /* matches AIO max kvec size. */ > > > > What does the comment mean? > > This is left over from the the old 2.4 AIO code. The AIO layer use to > break larger AIOs into chunks, and 128K was the default size. It's used > to set the size of the FMRs, so I would expect IOs larger then this to > fail because the registration will fail. Since FMRs need to be a fixed > size, if we wish to continue using them, we'll need to break AIOs over > a certain size into multiple iocbs and advts... Yes. > > I seem to have no trouble passing requests bigger than 128K to SDP: > > Really? I just tried it and it definetly didn't work... > > -Libor > I mean they get passed to the SDP layer. Of course they cant get sent. -- MST From dotanb at mellanox.co.il Tue Aug 2 23:54:53 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 3 Aug 2005 09:54:53 +0300 Subject: [openib-general] ibv_dealloc_pd after create + destroy of a AV fails (the resource is busy) Message-ID: <506C3D7B14CDD411A52C00025558DED60865E2A0@mtlex01.yok.mtl.com> I'm using svn rev: 2946 on Mellanox HCA 23108. In user level: the following scenario fails: allocate a PD create AV destroy AV deallocate PD The problem is that the PD deallocation fails (resource is busy). If i remove the AV creation there isn't any problem. I attached a small test that reproduces this issue. <> Dotan Barak Software Verification Engineer Mellanox Technologies LTD mailto:dotanb at mellanox.co.il Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-4-8289408 Cell: 052-4222383 [ May the fork be with you ] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ah_problem.c Type: application/octet-stream Size: 4152 bytes Desc: not available URL: From mst at mellanox.co.il Wed Aug 3 00:09:24 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 3 Aug 2005 10:09:24 +0300 Subject: [openib-general] Re: sdp: cant unload ib_ipoib module In-Reply-To: <1123007814.2946.12.camel@duffman> References: <1123007814.2946.12.camel@duffman> Message-ID: <20050803070923.GG15300@mellanox.co.il> Quoting r. Tom Duffy : > Perhaps you need my sdp_inet_port_put() patch? Could be. I'll give it a spin next week. Thanks! -- MST From ianjiang91 at hotmail.com Wed Aug 3 02:14:38 2005 From: ianjiang91 at hotmail.com (Ian Jiang) Date: Wed, 03 Aug 2005 17:14:38 +0800 Subject: [openib-general] [iSER]How to get the dat_headers_1_1.tgz Message-ID: It's known to all that the kDAPL 1.1 is needed to build the iSER. I failed to get http://groups.yahoo.com/group/dat-discussions/files/dat_headers_1_1.tgz because only the members of the group could access this file, and no response was received after my applying to join. So I am wondering is there other ways to get the dat_header_1_1.tgz. By the way, I downloaded the dapl_beta1.10.tar.gz from http://sourceforge.net/projects/dapl/ but I don't think this is the dat_headers_1_1 needed by iSER, because some errors occured when I had a try. Any suggestion is appreciated! Ian Jiang ianjiang91 at hotmail.com ---- Computer Architecture Laboratory Institute of Computing Technology Chinese Academy of Sciences Beijing,P.R.China Zip code: 100080 Tel: +86-10-62564394(office) _________________________________________________________________ 免费下载 MSN Explorer: http://explorer.msn.com/lccn From itamar at mellanox.co.il Wed Aug 3 02:48:14 2005 From: itamar at mellanox.co.il (Itamar Rabenstein) Date: Wed, 3 Aug 2005 12:48:14 +0300 Subject: [openib-general] [iSER]How to get the dat_headers_1_1.tgz Message-ID: <91DB792C7985D411BEC300B40080D29CC35FDD@mtvex01.mtv.mtl.com> Hi, did you try dapl_beta2.06.tgz ? this is the last version with dapl1.1 headers from dapl_gamma* it is dapl1.2 headers http://sourceforge.net/project/showfiles.php?group_id=59288&package_id=13203 2&release_id=273441 try download : Download dapl_beta2.06.tgz Note that in openib trunk there is a working version kdapl but is dapl1.2 with some linux style changes Itamar > -----Original Message----- > From: Ian Jiang [mailto:ianjiang91 at hotmail.com] > Sent: Wednesday, August 03, 2005 12:15 PM > To: openib-general at openib.org > Subject: [openib-general] [iSER]How to get the dat_headers_1_1.tgz > > > It's known to all that the kDAPL 1.1 is needed to build the > iSER. I failed > to get > > http://groups.yahoo.com/group/dat-discussions/files/dat_header > s_1_1.tgz > because only the members of the group could access this file, and no > response was received after my applying to join. So I am > wondering is there > other ways to get the dat_header_1_1.tgz. > > By the way, I downloaded the dapl_beta1.10.tar.gz from > http://sourceforge.net/projects/dapl/ > but I don't think this is the dat_headers_1_1 needed by iSER, > because some > errors occured when I had a try. > > Any suggestion is appreciated! > > > > Ian Jiang > ianjiang91 at hotmail.com > ---- > Computer Architecture Laboratory > Institute of Computing Technology > Chinese Academy of Sciences > Beijing,P.R.China > Zip code: 100080 > Tel: +86-10-62564394(office) > > _________________________________________________________________ > 免费下载 MSN Explorer: http://explorer.msn.com/lccn > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From guyg at voltaire.com Wed Aug 3 03:28:37 2005 From: guyg at voltaire.com (Guy German) Date: Wed, 3 Aug 2005 13:28:37 +0300 Subject: [openib-general][kdapl]: vmalloc instead of kmalloc Message-ID: Hi, James Lentini wrote: > On Tue, 2 Aug 2005, Guy German wrote: > >> Hi Muli, >> >> Muli Ben-Yehuda wrote: >>> On Tue, Aug 02, 2005 at 06:24:49PM +0300, Guy German wrote: >>> >>>> There are some places where kmalloc might not be enough : >>>> in dapl_evd_event_alloc there is an allocation: >>>> >>>> event = kmalloc(evd->qlen * sizeof *event); >>>> >>>> whereas evd->qlen can be 128k (depends on max_cqe of the hca) and >>>> kmalloc would fail. >>>> >>>> The same goes to dapl_rbuf_alloc. >>>> >>>> Is it legit to replace those kmallocs with vmallocs ? > > We should only add calls to vmalloc() as a last resort. As > Muli points > out, they are discouraged. > >>> Why do we need such a large allocation? > > kDAPL creates two large pools of memory. > > One is for events. When the kDAPL consumer creates an EVD, it > specifies a queue size (the number of events the EVD can hold). The > implementation pre-allocates a pool of events equal to the > size of the > queue. These events are used when an IB upcall is made (e.g. > connection request, connection established, aysnc. error, > etc.) or the > kDAPL consumer posts a "software event" via dat_evd_post_se(). And, of course - completions of data events - which is why the queue need to be more substantial. > The other memory pool is for cookies. A kDAPL event contains certain > fields that the IB work completion (ib_wc) does not provide (like the > EVD, EP, etc.). For that reason, the kDAPL provider sticks > the missing > information in a dapl_cookie structure and sets it as the work > request's context value. When the work completion comes back, the > kDAPL provider pulls the cookie out and uses it to populate the > missing event fields. These cookies are also pre-allocated in a pool > equal to the EVD size. > >>> To answer your question, vmalloc has a performance overhead and can >>> and will fail when vmalloc-space is exhausted (as can kmalloc, for >>> different reasons). Can this allocation be cut down so that it >>> becomes a non-issue? > > The size of the event pool seems much larger than necessary. I would > expect most consumers only use a few events from this pool (with no > errors or software events, a client will use 2 and a server will use > 3). If you consider that the consumers are polling from the queue themselves (upcall policy is disabled, for performance) and the queue of events holds completions of data, then you have to support larger queue. Bare in mind that one target can have many initiators. Any way, ISER seems to be needing a solution for this, and I think it is possible to come up with a different solution than vmalloc (maybe a few kmallocs) I will think about it and send a patch when I have one. > We may be able to eliminate the cookie pool entirely. There > are only a > few values we need from the cookie. I'll look into that. That sounds good. BTW, I've calculated the sizeof evd rather then dat_event, which is actually 48. That still leaves ~2730 possible pending events, which is basically the same principle. Guy. >> evd_min_qlen defines the size of the event queue that the Consumer >> requested. sizeof *event = 184 - that leaves ~712 pending events, >> which is not much. ISER target is trying to support about 5000 (by >> their > calculations), but other consumers >> might want to support even more and there is no reason for > dapl to limit what the ib can provide. >> Note that iser dequeues the events itself (only the first > event is accepted from a callback), hence the >> need for a normal size queue. From eitan at mellanox.co.il Wed Aug 3 03:55:05 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 3 Aug 2005 13:55:05 +0300 Subject: [openib-general] OpenSM Work Message-ID: <506C3D7B14CDD411A52C00025558DED607C305BA@mtlex01.yok.mtl.com> Hi Hal, As Mellanox moves to work on OpenIB Gen2 stack, we have assigned Yael to work on merging OpenSM 1.8.0 (which released based on gen1) into the gen2 stack. She has started to work on the merge to ensure that fixes done by you and Shahar on the gen2 trunk will not be lost. The mode of work we suggest is that she will work offline. When the merge will be completed a side branch will be opened under: https://openib.org/svn/gen2/branches/osm_1_8_0 and will made available for review and testing before merge into the main trunk. Once all this is done, Yael will work on multiple new features including faster route time, PKey manager, MKey manager, and QoS. She will do so on branches off the main trunk - for each feature. In parallel, Liran who owns the OpenSM verification will enhance osmtest and other testing utilities to achieve better test coverage of SM handover, SL2VL, VLArb and PKey. Any new feature will get covered by new tests. I myself will work on making sure the IB management simulator is well integrated with the stack and the available simulator based tests as well as new tests can be run daily. We do have some issues with respect to the current osm tree: 1. All header files were moved from their relative location under the opensm, complib, iba directories and placed under the include directory. Although this seems reasonable for a "install" tree - it is not very common for development trees. Normally I would expect the Makefile.am of each sub directory of the osm project to define which header files are to be installed into the $prefix/include dir. We will revert that hierarchy change in our merged branch. 2. osmtest was just introduced back into the osm tree. I think osmtest should be placed under a "test" tree where all the tests of the ULPs core etc will be located. I would expect a location like: https://openib.org/svn/gen2/trunk/test/ userspace/management/osm 3. osmtest needs cleanup from VAPI stuff - we should let Liran who is the owner of this code development a clear AR to clean it up. 4. For some reason I saw that you have added Voltaire copyright to the osmtest code. I do not think it makes sense as no work was done on this code by a Voltaire developer. Or I might be wrong? Needless to say the 1.8.0 version of OpenSM brings with it a long set of bug fixes and enhancements. Eitan Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL -------------- next part -------------- An HTML attachment was scrubbed... URL: From mlleinin at hpcn.ca.sandia.gov Wed Aug 3 05:58:15 2005 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Wed, 03 Aug 2005 05:58:15 -0700 Subject: [openib-general] Re: [Infiniband-sock_direct] SDP Query In-Reply-To: References: <1123049575.28769.252.camel@localhost> Message-ID: <1123073895.4985.256.camel@localhost> OpenIB does have an SDP module that would likely meet your expectations. I've cc'd the openib-general mail list and Libor (the SDP maintainer). - Matt On Wed, 2005-08-03 at 08:16 -0400, Michael Speth wrote: > Matt, > Does OpenIB have a module like the Offload Protocol Module (OPS)? > > Thanks > > On 8/3/05, Matt Leininger wrote: > > Rajib, > > > > All active IB development has shifted to OpenIB (www.openib.org). > > Try submitting your question to the OpenIB developers mail list > > (openib-general at openib.org). > > > > Thanks, > > > > - Matt > > > > > > On Tue, 2005-08-02 at 17:01 +0800, Majumder, Rajib wrote: > > > Hello, > > > > > > I had a query and wanted to clarify it from this list. > > > > > > My firm is planning to migrate to IB. From ULP standpoint of view, our plan is to use SDP to take advantage of IB fabric and also without making any code changes. > > > > > > We have some processes that communicate on the LOCAL host using TCP SOCK_STREAM. If we use SDP for these processes, do you expect a performance gain? > > > > > > Does SDP behave the same way (offloaded stack, RDMA, kernel bypass, zcopy etc) while the processes run on the SAME physical host? > > > Do you have any latency/throughput data available for this test scenario? > > > > > > Any opinion would be highly appreciated. > > > > > > Thanks for your time! > > > > > > Rajib Majumder > > > Credit Suisse First Boston > > > > > > > > > ============================================================================== > > > Please access the attached hyperlink for an important electronic communications disclaimer: > > > > > > http://www.csfb.com/legal_terms/disclaimer_external_email.shtml > > > > > > ============================================================================== > > > > > > > > > > > > ------------------------------------------------------- > > > SF.Net email is sponsored by: Discover Easy Linux Migration Strategies > > > from IBM. Find simple to follow Roadmaps, straightforward articles, > > > informative Webcasts and more! Get everything you need to get up to > > > speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click > > > _______________________________________________ > > > Infiniband-sock_direct mailing list > > > Infiniband-sock_direct at lists.sourceforge.net > > > https://lists.sourceforge.net/lists/listinfo/infiniband-sock_direct > > > > > > > > > > > ------------------------------------------------------- > > SF.Net email is sponsored by: Discover Easy Linux Migration Strategies > > from IBM. Find simple to follow Roadmaps, straightforward articles, > > informative Webcasts and more! Get everything you need to get up to > > speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click > > _______________________________________________ > > Infiniband-sock_direct mailing list > > Infiniband-sock_direct at lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/infiniband-sock_direct > > > > From steve_wooding at keysounds.co.uk Wed Aug 3 06:38:01 2005 From: steve_wooding at keysounds.co.uk (=?iso-8859-1?Q?Steve_Wooding?=) Date: Wed, 3 Aug 2005 15:38:01 +0200 Subject: [openib-general] SDP RDMA support for blocking socket operations Message-ID: <30280207$112307492242f0c36a8a7799.77445222@config2.schlund.de> Libor, I realise you are very busy, but I wonder if you could give a rough indication for when "SDP RDMA support for blocking socket operations" might be implemented in SDP? My motivation for this is that I need zero copy (in order to eliminate the extra kernel buffer to application buffer copy), but I have a requirement not to use libaio as it is only implemented on Linux (my code needs to be portable). I've come across a POSIX AIO implementation, but I suppose this would have to be used on the kernel side with SDP rather than just in my application. I'm not sure how closely SDP and libaio are tied together. Anyway, some advice would be much appreciated. Cheers, Steve. From mst at mellanox.co.il Wed Aug 3 06:55:46 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 3 Aug 2005 16:55:46 +0300 Subject: [openib-general] SDP RDMA support for synchronous socket operations Message-ID: <20050803135546.GQ15300@mellanox.co.il> Libor, all, Tomorrow, I plan to start working on sdp zcopy support for synchronous send/recv socket operations. Both kernel-level and user-level initiators should continue to be supported. I dont plan to work on sendfile support, yet. I hope to finish the implementation and some basic testing in the coming two weeks time. The development will be done on a branch, to be opened tomorrow. My plan is to merge updates from trunk to stay in sync as much as possible. What follows is a raw design draft. Comments are welcome. MST ------------------------------------------------------------------------ What needs to be done to support zcopy for synchronous send/recv socket operations. Draft rev 2 Currently only AIO is supported for Zcopy. We reuse the ZCopy infrastructure for send/recv socket operations. Main differences between AIO and send/recv operations: - send/recv have more flags: MSG_DONTWAIT, MSG_OOB, MSG_WAITALL, MSG_PEEK. - after a send/recv call returns, its illegal for the HCA to touch the application's buffer. - send/recv must support data sizes too big to be transferred in one infiniband operation (SOCK_STREAM applications dont seem to expect to get EMSGSIZE). - send/recv have the ability to block until an operation completes. This has to be implemented by SDP. - with send/recv, operation is revoked with a signal, unlike aio which is canceled explicitly by the application. - typically, there is only one outstanding send/recv operation on a specific socket - send/recv must be supported for kernel and user-space consumers. current aio code seems to only support user-level consumers. Design draft covers: Send side, Receive side, Send/Send deadlock prevention: ----------------------- Send side: Operation: - Attempt zcopy if the message is bigger than send bcopy threshold - If the operation is too big to fit in a single FMR, split it to multiple buffers (iocb), queue them for processing. Q: limit the number of FMRs used by a single socket? Block till the last iocb completes. - If no FMRs are available Force bcopy transfer - On signal, locate and cancel all queued iocbs. This may need to block, in which case we block in uninterruptible state (with a timeout) If iocbs cant be canceled within a predefined time, treat this as a transport error, trigger an abortive close Options/Socket flags: - MSG_OOB out of band data Force bcopy transfer Q: What to do if src avail are outstanding? A: - MSG_DONTWAIT/O_NONBLOCK non-blocking operation Force bcopy transfer ----------------------- Receive side: Operation: - Attempt zcopy if the message is bigger than rcv bcopy threshold - If the operation is too big to fit in a single FMR, split it to multiple buffers (iocb), queue them for processing. Block till the last iocb completes. - With MSG_WAITALL: Dont post sink available. - Without MSG_WAITALL: Post exactly one sink available at a time. Registration can still be pipelined with RDMA. Note that recv without MSG_WAITALL may return a shorter message than what was sent. This is OK: "For stream-based sockets, such as SOCK_STREAM, message boundaries shall be ignored. In this case, data shall be returned to the user as soon as it becomes available, and no data shall be discarded." - If no FMRs are available Force bcopy transfer - On signal, locate and cancel all queued iocbs This may need to block, in which case we block in uninterruptible state (with a timeout) If iocbs cant be canceled within a predefined time, treat this as a transport error, trigger an abortive close Options/Socket flags: - MSG_OOB out of band data Handle as it arrives Q: Should we force bcopy transfer (SendSM)? A: - MSG_DONTWAIT/O_NONBLOCK non-blocking operation Force bcopy transfer - MSG_PEEK peek data in buffer Force bcopy transfer ----------------------- Send/Send deadlock prevention: quoting SDP spec: Receive side detects deadlock if: . A SrcAvail is received; and . No ULP receive buffer is posted; and . The local Data Source has a SrcAvail outstanding. There are several ways to resolve this deadlock. Resolve in the following way: . The Data Sink could send a SendSm message to force the use of the Bcopy data transfer mechanism. -- MST From jlentini at netapp.com Wed Aug 3 07:34:26 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 3 Aug 2005 10:34:26 -0400 (EDT) Subject: [openib-general][kdapl]: vmalloc instead of kmalloc In-Reply-To: References: Message-ID: On Wed, 3 Aug 2005, Guy German wrote: > James Lentini wrote: >> >> kDAPL creates two large pools of memory. >> >> One is for events. When the kDAPL consumer creates an EVD, it >> specifies a queue size (the number of events the EVD can hold). The >> implementation pre-allocates a pool of events equal to the >> size of the >> queue. These events are used when an IB upcall is made (e.g. >> connection request, connection established, aysnc. error, >> etc.) or the >> kDAPL consumer posts a "software event" via dat_evd_post_se(). > > And, of course - completions of data events - which is why the queue > need to be more substantial. In uDAPL yes, but not in kDAPL. None of the callers of dapl_evd_get_event(), where events are dequeued from the EVD's free_event_queue, use the events for DTO completions. In uDAPL, dat_evd_wait can dequeue data events and store them in the pending event queue. As Itamar pointed out, kDAPL could use a single circular list instead of maintaining the EVD's free and pending event queues. >> The other memory pool is for cookies. A kDAPL event contains certain >> fields that the IB work completion (ib_wc) does not provide (like the >> EVD, EP, etc.). For that reason, the kDAPL provider sticks >> the missing >> information in a dapl_cookie structure and sets it as the work >> request's context value. When the work completion comes back, the >> kDAPL provider pulls the cookie out and uses it to populate the >> missing event fields. These cookies are also pre-allocated in a pool >> equal to the EVD size. >> >>>> To answer your question, vmalloc has a performance overhead and can >>>> and will fail when vmalloc-space is exhausted (as can kmalloc, for >>>> different reasons). Can this allocation be cut down so that it >>>> becomes a non-issue? >> >> The size of the event pool seems much larger than necessary. I would >> expect most consumers only use a few events from this pool (with no >> errors or software events, a client will use 2 and a server will use >> 3). > > If you consider that the consumers are polling from the queue > themselves (upcall policy is disabled, for performance) and the > queue of events holds completions of data, then you have to support > larger queue. Bare in mind that one target can have many initiators. Even if the event queue never stores DTO events? In the worst case, I agree that kDAPL would need to allocate an amount of memory equal to n * sizeof(DAT_EVENT), where n is the EVD queue size. My observation is that an EVD almost always uses the event pool < 5 times (when there are no async errors and no software events). Further more, it usually only uses one event at a time (it posts a connection request event, the consumer reaps it, it posts a connection event, the consumer reaps it, ...). Given that, allocating a event pool equal to the queue length seems like overkill to me. The EVD could allocate smaller blocks of events in some configurable size. Most of the time a single poll (say of 25 events) would be sufficient. In the rare case when this pool was exhausted, a second one could be allocated. If that one was used up, a third could be allocated... > Any way, ISER seems to be needing a solution for this, and I think it is > possible to come up with a different solution than vmalloc (maybe a few > kmallocs) I will think about it and send a patch when I have one. Ok. That would be great. From jlentini at netapp.com Wed Aug 3 07:48:54 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 3 Aug 2005 10:48:54 -0400 (EDT) Subject: [openib-general] Re: fixes to the udapl ucm/uat patch you sent In-Reply-To: <42F01AE4.806@ichips.intel.com> References: <42F01AE4.806@ichips.intel.com> Message-ID: On Tue, 2 Aug 2005, Arlin Davis wrote: >> dapl_evd_wait: >> >> I looked over the original implementation of dapl_evd_wait() with an eye >> towards the situation you described (the caller polling and finding fewer >> events than requested, the caller going to turn on notification, an event >> occuring, the caller turning on notification, the caller blocking unaware >> of the last event). I don't believe that this would happen in the original >> implementation. Here's why: after the caller turns on notification, the >> code loops, via the continue statement on line 213, back to the begining of >> the for loop on line 173 and repolls. Do you agree? > > Hmmm. Yes it appears to be correct now that I take another look at it. Ok, > ignore the patch for now and I will take another look > at the senario that was missing the completions. On a side note, does this > call need to adjust the time if coming out of a wait but still not reaching > threshold and going back into a subsequent wait? Yes. That is broken. Good catch. From jlentini at netapp.com Wed Aug 3 08:09:32 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 3 Aug 2005 11:09:32 -0400 (EDT) Subject: [openib-general] [iSER]How to get the dat_headers_1_1.tgz In-Reply-To: <91DB792C7985D411BEC300B40080D29CC35FDD@mtvex01.mtv.mtl.com> References: <91DB792C7985D411BEC300B40080D29CC35FDD@mtvex01.mtv.mtl.com> Message-ID: You should have been able to join the dat-discussions Yahoo group. If you'd like to join, send Arkady Kanevsky [arkady at netapp.com] an email. I agree with Itamar, the dapl_beta_2_06 is your best bet. On Wed, 3 Aug 2005, Itamar Rabenstein wrote: > Hi, > > did you try dapl_beta2.06.tgz ? > this is the last version with dapl1.1 headers > from dapl_gamma* it is dapl1.2 headers > > http://sourceforge.net/project/showfiles.php?group_id=59288&package_id=13203 > 2&release_id=273441 > try download : Download dapl_beta2.06.tgz > > Note that in openib trunk there is a working version kdapl but is dapl1.2 > with some linux style changes > > Itamar > >> -----Original Message----- >> From: Ian Jiang [mailto:ianjiang91 at hotmail.com] >> Sent: Wednesday, August 03, 2005 12:15 PM >> To: openib-general at openib.org >> Subject: [openib-general] [iSER]How to get the dat_headers_1_1.tgz >> >> >> It's known to all that the kDAPL 1.1 is needed to build the >> iSER. I failed >> to get >> >> http://groups.yahoo.com/group/dat-discussions/files/dat_header >> s_1_1.tgz >> because only the members of the group could access this file, and no >> response was received after my applying to join. So I am >> wondering is there >> other ways to get the dat_header_1_1.tgz. >> >> By the way, I downloaded the dapl_beta1.10.tar.gz from >> http://sourceforge.net/projects/dapl/ >> but I don't think this is the dat_headers_1_1 needed by iSER, >> because some >> errors occured when I had a try. >> >> Any suggestion is appreciated! >> >> >> >> Ian Jiang >> ianjiang91 at hotmail.com >> ---- >> Computer Architecture Laboratory >> Institute of Computing Technology >> Chinese Academy of Sciences >> Beijing,P.R.China >> Zip code: 100080 >> Tel: +86-10-62564394(office) From dotanb at mellanox.co.il Wed Aug 3 08:25:27 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 3 Aug 2005 18:25:27 +0300 Subject: [openib-general] create several RC QPs with the same init attributes structure cau ses the init attribute structure to be changed Message-ID: <506C3D7B14CDD411A52C00025558DED60865E3A0@mtlex01.yok.mtl.com> I work with gen2 with svn rev 2946 on Mellanox HCA 23108. When i try to create several RC QPs with the same init attributes structure, i can see that this structure is being changed by the verb. here is the test code: { struct ibv_qp_init_attr attr = { .send_cq = cq, .recv_cq = cq, .cap = { .max_send_wr = 1, .max_recv_wr = 1, .max_send_sge = 1, .max_recv_sge = 1, .max_inline_data = 0, }, .qp_type = 2, }; for (i = 0; i < num_qp; ++i) { printf("s_wr %u, r_wr %u, s_sge %u, r_sge %u, max_inline %u\n", attr.cap.max_send_wr, attr.cap.max_recv_wr, attr.cap.max_send_sge, attr.cap.max_recv_sge, attr.cap.max_inline_data); qp[i] = ibv_create_qp(pd, &attr); CHECK_STRUCT("ibv_create_qp", qp[i], printf("%d \n",i); getchar(); return -1); } } here is the test output: s_wr 1, r_wr 1, s_sge 1, r_sge 1, max_inline 0 s_wr 1, r_wr 1, s_sge 1, r_sge 1, max_inline 28 s_wr 1, r_wr 1, s_sge 2, r_sge 1, max_inline 92 s_wr 1, r_wr 1, s_sge 6, r_sge 1, max_inline 220 s_wr 1, r_wr 1, s_sge 14, r_sge 1, max_inline 476 line 91, Error in ibv_create_qp, NULL pointer returned 4 When i try execute the same code with UD/UC QPs everything is just fine and the init attributes structure is not being changed by the verb. Dotan Barak Software Verification Engineer Mellanox Technologies LTD mailto:dotanb at mellanox.co.il Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-4-8289408 Cell: 052-4222383 [ May the fork be with you ] -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Wed Aug 3 09:15:06 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 03 Aug 2005 09:15:06 -0700 Subject: [openib-general] Re: create several RC QPs with the same init attributes structure cau ses the init attribute structure to be changed In-Reply-To: <506C3D7B14CDD411A52C00025558DED60865E3A0@mtlex01.yok.mtl.com> (Dotan Barak's message of "Wed, 3 Aug 2005 18:25:27 +0300") References: <506C3D7B14CDD411A52C00025558DED60865E3A0@mtlex01.yok.mtl.com> Message-ID: <52iryndmk5.fsf@cisco.com> Dotan> I work with gen2 with svn rev 2946 on Mellanox HCA 23108. Dotan> When i try to create several RC QPs with the same init Dotan> attributes structure, Dotan> i can see that this structure is being changed by the verb. This is expected and correct according to our API: the actual values allocated for the QP are returned in the pointer passed in by the consumer. - R. From rolandd at cisco.com Wed Aug 3 09:25:23 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 03 Aug 2005 09:25:23 -0700 Subject: [openib-general] Re: ibv_dealloc_pd after create + destroy of a AV fails (the resource is busy) In-Reply-To: <506C3D7B14CDD411A52C00025558DED60865E2A0@mtlex01.yok.mtl.com> (Dotan Barak's message of "Wed, 3 Aug 2005 09:54:53 +0300") References: <506C3D7B14CDD411A52C00025558DED60865E2A0@mtlex01.yok.mtl.com> Message-ID: <52ek9bdm30.fsf@cisco.com> Dotan> I'm using svn rev: 2946 on Mellanox HCA 23108. In user Dotan> level: the following scenario fails: allocate a PD create Dotan> AV destroy AV deallocate PD Thanks. There was a bug in the reference counting for pages used to hold address vectors. It's fixed with the change below (already checked in to svn). - R. --- libmthca/src/ah.c (revision 2963) +++ libmthca/src/ah.c (working copy) @@ -71,6 +71,8 @@ static struct mthca_ah_page *__add_page( return NULL; } + page->mr->context = pd->ibv_pd.context; + page->use_cnt = 0; for (i = 0; i < per_page; ++i) page->free[i] = ~0; @@ -105,17 +107,18 @@ int mthca_alloc_av(struct mthca_pd *pd, if (page->use_cnt < ps / sizeof *ah->av) for (i = 0; i < pp; ++i) if (page->free[i]) - break; - - if (!page) - page = __add_page(pd, ps, pp); + goto found; + page = __add_page(pd, ps, pp); if (!page) { free(ah); pthread_mutex_unlock(&pd->ah_mutex); return -1; } + found: + ++page->use_cnt; + for (i = 0, j = -1; i < pp; ++i) if (page->free[i]) { j = ffs(page->free[i]); @@ -171,6 +174,7 @@ void mthca_free_av(struct mthca_ah *ah) page = ah->page; i = ((void *) ah->av - page->buf) / sizeof *ah->av; page->free[i / (8 * sizeof (int))] |= 1 << (i % (8 * sizeof (int))); + if (!--page->use_cnt) { if (page->prev) page->prev->next = page->next; From rolandd at cisco.com Wed Aug 3 09:28:04 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 03 Aug 2005 09:28:04 -0700 Subject: [openib-general] [PATCH][RFC] uverbs SRQ implementation Message-ID: <52acjzdlyj.fsf@cisco.com> Here is a completely untested implementation of the kernel side of userspace SRQ support. (Hey, it compiles!) I should have the userspace libibverbs and libmthca support soon, and once I've tested this, I'll commit it. Feedback in the meantime appreciated, though... - R. --- infiniband/core/uverbs_cmd.c (revision 2963) +++ infiniband/core/uverbs_cmd.c (working copy) @@ -724,6 +724,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv struct ib_uobject *uobj; struct ib_pd *pd; struct ib_cq *scq, *rcq; + struct ib_srq *srq; struct ib_qp *qp; struct ib_qp_init_attr attr; int ret; @@ -748,9 +749,15 @@ ssize_t ib_uverbs_create_qp(struct ib_uv scq = idr_find(&ib_uverbs_cq_idr, cmd.send_cq_handle); rcq = idr_find(&ib_uverbs_cq_idr, cmd.recv_cq_handle); + if (cmd.is_srq) + srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle); + else + srq = NULL; + if (!pd || pd->uobject->context != file->ucontext || !scq || scq->uobject->context != file->ucontext || - !rcq || rcq->uobject->context != file->ucontext) { + !rcq || rcq->uobject->context != file->ucontext || + (cmd.is_srq && (!srq || srq->uobject->context != file->ucontext))) { ret = -EINVAL; goto err_up; } @@ -759,7 +766,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv attr.qp_context = file; attr.send_cq = scq; attr.recv_cq = rcq; - attr.srq = NULL; + attr.srq = srq; attr.sq_sig_type = cmd.sq_sig_all ? IB_SIGNAL_ALL_WR : IB_SIGNAL_REQ_WR; attr.qp_type = cmd.qp_type; @@ -1004,3 +1011,175 @@ ssize_t ib_uverbs_detach_mcast(struct ib return ret ? ret : in_len; } + +ssize_t ib_uverbs_create_srq(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_create_srq cmd; + struct ib_uverbs_create_srq_resp resp; + struct ib_udata udata; + struct ib_uobject *uobj; + struct ib_pd *pd; + struct ib_srq *srq; + struct ib_srq_init_attr attr; + int ret; + + if (out_len < sizeof resp) + return -ENOSPC; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + INIT_UDATA(&udata, buf + sizeof cmd, + (unsigned long) cmd.response + sizeof resp, + in_len - sizeof cmd, out_len - sizeof resp); + + uobj = kmalloc(sizeof *uobj, GFP_KERNEL); + if (!uobj) + return -ENOMEM; + + down(&ib_uverbs_idr_mutex); + + pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle); + + if (!pd || pd->uobject->context != file->ucontext) { + ret = -EINVAL; + goto err_up; + } + + attr.event_handler = ib_uverbs_srq_event_handler; + attr.srq_context = file; + + uobj->user_handle = cmd.user_handle; + uobj->context = file->ucontext; + + srq = pd->device->create_srq(pd, &attr, &udata); + if (IS_ERR(srq)) { + ret = PTR_ERR(srq); + goto err_up; + } + + srq->device = pd->device; + srq->pd = pd; + srq->uobject = uobj; + srq->event_handler = attr.event_handler; + srq->srq_context = attr.srq_context; + atomic_inc(&pd->usecnt); + atomic_set(&srq->usecnt, 0); + + memset(&resp, 0, sizeof resp); + +retry: + if (!idr_pre_get(&ib_uverbs_srq_idr, GFP_KERNEL)) { + ret = -ENOMEM; + goto err_destroy; + } + + ret = idr_get_new(&ib_uverbs_srq_idr, srq, &uobj->id); + + if (ret == -EAGAIN) + goto retry; + if (ret) + goto err_destroy; + + resp.srq_handle = uobj->id; + + spin_lock_irq(&file->ucontext->lock); + list_add_tail(&uobj->list, &file->ucontext->srq_list); + spin_unlock_irq(&file->ucontext->lock); + + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) { + ret = -EFAULT; + goto err_list; + } + + up(&ib_uverbs_idr_mutex); + + return in_len; + +err_list: + spin_lock_irq(&file->ucontext->lock); + list_del(&uobj->list); + spin_unlock_irq(&file->ucontext->lock); + +err_destroy: + ib_destroy_srq(srq); + +err_up: + up(&ib_uverbs_idr_mutex); + + kfree(uobj); + return ret; +} + +ssize_t ib_uverbs_modify_srq(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_modify_srq cmd; + struct ib_srq *srq; + struct ib_srq_attr attr; + int ret; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + down(&ib_uverbs_idr_mutex); + + srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle); + if (!srq || srq->uobject->context != file->ucontext) { + ret = -EINVAL; + goto out; + } + + attr.max_wr = cmd.max_wr; + attr.max_sge = cmd.max_sge; + attr.srq_limit = cmd.srq_limit; + + ret = ib_modify_srq(srq, &attr, cmd.attr_mask); + +out: + up(&ib_uverbs_idr_mutex); + + return ret ? ret : in_len; +} + +ssize_t ib_uverbs_destroy_srq(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_destroy_srq cmd; + struct ib_srq *srq; + struct ib_uobject *uobj; + int ret = -EINVAL; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + down(&ib_uverbs_idr_mutex); + + srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle); + if (!srq || srq->uobject->context != file->ucontext) + goto out; + + uobj = srq->uobject; + + ret = ib_destroy_srq(srq); + if (ret) + goto out; + + idr_remove(&ib_uverbs_srq_idr, cmd.srq_handle); + + spin_lock_irq(&file->ucontext->lock); + list_del(&uobj->list); + spin_unlock_irq(&file->ucontext->lock); + + kfree(uobj); + +out: + up(&ib_uverbs_idr_mutex); + + return ret ? ret : in_len; +} --- infiniband/core/uverbs.h (revision 2963) +++ infiniband/core/uverbs.h (working copy) @@ -99,10 +99,12 @@ extern struct idr ib_uverbs_mw_idr; extern struct idr ib_uverbs_ah_idr; extern struct idr ib_uverbs_cq_idr; extern struct idr ib_uverbs_qp_idr; +extern struct idr ib_uverbs_srq_idr; void ib_uverbs_comp_handler(struct ib_cq *cq, void *cq_context); void ib_uverbs_cq_event_handler(struct ib_event *event, void *context_ptr); void ib_uverbs_qp_event_handler(struct ib_event *event, void *context_ptr); +void ib_uverbs_srq_event_handler(struct ib_event *event, void *context_ptr); int ib_umem_get(struct ib_device *dev, struct ib_umem *mem, void *addr, size_t size, int write); @@ -131,5 +133,8 @@ IB_UVERBS_DECLARE_CMD(modify_qp); IB_UVERBS_DECLARE_CMD(destroy_qp); IB_UVERBS_DECLARE_CMD(attach_mcast); IB_UVERBS_DECLARE_CMD(detach_mcast); +IB_UVERBS_DECLARE_CMD(create_srq); +IB_UVERBS_DECLARE_CMD(modify_srq); +IB_UVERBS_DECLARE_CMD(destroy_srq); #endif /* UVERBS_H */ --- infiniband/core/uverbs_main.c (revision 2963) +++ infiniband/core/uverbs_main.c (working copy) @@ -69,6 +69,7 @@ DEFINE_IDR(ib_uverbs_mw_idr); DEFINE_IDR(ib_uverbs_ah_idr); DEFINE_IDR(ib_uverbs_cq_idr); DEFINE_IDR(ib_uverbs_qp_idr); +DEFINE_IDR(ib_uverbs_srq_idr); static spinlock_t map_lock; static DECLARE_BITMAP(dev_map, IB_UVERBS_MAX_DEVICES); @@ -93,6 +94,9 @@ static ssize_t (*uverbs_cmd_table[])(str [IB_USER_VERBS_CMD_DESTROY_QP] = ib_uverbs_destroy_qp, [IB_USER_VERBS_CMD_ATTACH_MCAST] = ib_uverbs_attach_mcast, [IB_USER_VERBS_CMD_DETACH_MCAST] = ib_uverbs_detach_mcast, + [IB_USER_VERBS_CMD_CREATE_SRQ] = ib_uverbs_create_qp, + [IB_USER_VERBS_CMD_MODIFY_SRQ] = ib_uverbs_modify_qp, + [IB_USER_VERBS_CMD_DESTROY_SRQ] = ib_uverbs_destroy_qp, }; static struct vfsmount *uverbs_event_mnt; @@ -127,7 +131,14 @@ static int ib_dealloc_ucontext(struct ib kfree(uobj); } - /* XXX Free SRQs */ + list_for_each_entry_safe(uobj, tmp, &context->srq_list, list) { + struct ib_srq *srq = idr_find(&ib_uverbs_srq_idr, uobj->id); + idr_remove(&ib_uverbs_srq_idr, uobj->id); + ib_destroy_srq(srq); + list_del(&uobj->list); + kfree(uobj); + } + /* XXX Free MWs */ list_for_each_entry_safe(uobj, tmp, &context->mr_list, list) { @@ -345,6 +356,13 @@ void ib_uverbs_qp_event_handler(struct i event->event); } +void ib_uverbs_srq_event_handler(struct ib_event *event, void *context_ptr) +{ + ib_uverbs_async_handler(context_ptr, + event->element.srq->uobject->user_handle, + event->event); +} + static void ib_uverbs_event_handler(struct ib_event_handler *handler, struct ib_event *event) { --- infiniband/include/ib_user_verbs.h (revision 2963) +++ infiniband/include/ib_user_verbs.h (working copy) @@ -78,7 +78,12 @@ enum { IB_USER_VERBS_CMD_POST_SEND, IB_USER_VERBS_CMD_POST_RECV, IB_USER_VERBS_CMD_ATTACH_MCAST, - IB_USER_VERBS_CMD_DETACH_MCAST + IB_USER_VERBS_CMD_DETACH_MCAST, + IB_USER_VERBS_CMD_CREATE_SRQ, + IB_USER_VERBS_CMD_MODIFY_SRQ, + IB_USER_VERBS_CMD_QUERY_SRQ, + IB_USER_VERBS_CMD_DESTROY_SRQ, + IB_USER_VERBS_CMD_POST_SRQ_RECV }; /* @@ -386,4 +391,32 @@ struct ib_uverbs_detach_mcast { __u64 driver_data[0]; }; +struct ib_uverbs_create_srq { + __u64 response; + __u64 user_handle; + __u32 pd_handle; + __u32 max_wr; + __u32 max_sge; + __u32 srq_limit; + __u64 driver_data[0]; +}; + +struct ib_uverbs_create_srq_resp { + __u32 srq_handle; +}; + +struct ib_uverbs_modify_srq { + __u32 srq_handle; + __u32 attr_mask; + __u32 max_wr; + __u32 max_sge; + __u32 srq_limit; + __u32 reserved; + __u64 driver_data[0]; +}; + +struct ib_uverbs_destroy_srq { + __u32 srq_handle; +}; + #endif /* IB_USER_VERBS_H */ From rolandd at cisco.com Wed Aug 3 09:34:40 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 03 Aug 2005 09:34:40 -0700 Subject: [openib-general] [PATCH][RFC] mthca SRQ implementation In-Reply-To: <52acjzdlyj.fsf@cisco.com> (Roland Dreier's message of "Wed, 03 Aug 2005 09:28:04 -0700") References: <52acjzdlyj.fsf@cisco.com> Message-ID: <5264undlnj.fsf@cisco.com> Here is a very lightly tested implementation of SRQ support for mthca. I have only tried some simple tests on PCI-X HCAs -- so the mem-free code paths are completely untested. In addition this code should have everything needed in the kernel to do userspace SRQs, but this is also completely untested pending libibverbs/libmthca SRQ support. - R. --- infiniband/hw/mthca/mthca_user.h (revision 2963) +++ infiniband/hw/mthca/mthca_user.h (working copy) @@ -69,6 +69,12 @@ struct mthca_create_cq_resp { __u32 reserved; }; +struct mthca_create_srq { + __u32 lkey; + __u32 db_index; + __u64 db_page; +}; + struct mthca_create_qp { __u32 lkey; __u32 reserved; --- infiniband/hw/mthca/mthca_dev.h (revision 2963) +++ infiniband/hw/mthca/mthca_dev.h (working copy) @@ -217,6 +217,13 @@ struct mthca_cq_table { struct mthca_icm_table *table; }; +struct mthca_srq_table { + struct mthca_alloc alloc; + spinlock_t lock; + struct mthca_array srq; + struct mthca_icm_table *table; +}; + struct mthca_qp_table { struct mthca_alloc alloc; u32 rdb_base; @@ -298,6 +305,7 @@ struct mthca_dev { struct mthca_mr_table mr_table; struct mthca_eq_table eq_table; struct mthca_cq_table cq_table; + struct mthca_srq_table srq_table; struct mthca_qp_table qp_table; struct mthca_av_table av_table; struct mthca_mcg_table mcg_table; @@ -360,12 +368,18 @@ int mthca_array_set(struct mthca_array * void mthca_array_clear(struct mthca_array *array, int index); int mthca_array_init(struct mthca_array *array, int nent); void mthca_array_cleanup(struct mthca_array *array, int nent); +int mthca_buf_alloc(struct mthca_dev *dev, int size, int max_direct, + union mthca_buf *buf, int *is_direct, struct mthca_pd *pd, + int hca_write, struct mthca_mr *mr); +void mthca_buf_free(struct mthca_dev *dev, int size, union mthca_buf *buf, + int is_direct, struct mthca_mr *mr); int mthca_init_uar_table(struct mthca_dev *dev); int mthca_init_pd_table(struct mthca_dev *dev); int mthca_init_mr_table(struct mthca_dev *dev); int mthca_init_eq_table(struct mthca_dev *dev); int mthca_init_cq_table(struct mthca_dev *dev); +int mthca_init_srq_table(struct mthca_dev *dev); int mthca_init_qp_table(struct mthca_dev *dev); int mthca_init_av_table(struct mthca_dev *dev); int mthca_init_mcg_table(struct mthca_dev *dev); @@ -375,6 +389,7 @@ void mthca_cleanup_pd_table(struct mthca void mthca_cleanup_mr_table(struct mthca_dev *dev); void mthca_cleanup_eq_table(struct mthca_dev *dev); void mthca_cleanup_cq_table(struct mthca_dev *dev); +void mthca_cleanup_srq_table(struct mthca_dev *dev); void mthca_cleanup_qp_table(struct mthca_dev *dev); void mthca_cleanup_av_table(struct mthca_dev *dev); void mthca_cleanup_mcg_table(struct mthca_dev *dev); @@ -425,7 +440,19 @@ int mthca_init_cq(struct mthca_dev *dev, void mthca_free_cq(struct mthca_dev *dev, struct mthca_cq *cq); void mthca_cq_event(struct mthca_dev *dev, u32 cqn); -void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn); +void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn, + struct mthca_srq *srq); + +int mthca_alloc_srq(struct mthca_dev *dev, struct mthca_pd *pd, + struct ib_srq_attr *attr, struct mthca_srq *srq); +void mthca_free_srq(struct mthca_dev *dev, struct mthca_srq *srq); +void mthca_srq_event(struct mthca_dev *dev, u32 srqn, + enum ib_event_type event_type); +void mthca_free_srq_wqe(struct mthca_srq *srq, u32 wqe_addr); +int mthca_tavor_post_srq_recv(struct ib_srq *srq, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); +int mthca_arbel_post_srq_recv(struct ib_srq *srq, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); void mthca_qp_event(struct mthca_dev *dev, u32 qpn, enum ib_event_type event_type); --- infiniband/hw/mthca/mthca_main.c (revision 2963) +++ infiniband/hw/mthca/mthca_main.c (working copy) @@ -252,6 +252,8 @@ static int __devinit mthca_init_tavor(st profile = default_profile; profile.num_uar = dev_lim.uar_size / PAGE_SIZE; profile.uarc_size = 0; + if (mdev->mthca_flags & MTHCA_FLAG_SRQ) + profile.num_srq = dev_lim.max_srqs; err = mthca_make_profile(mdev, &profile, &dev_lim, &init_hca); if (err < 0) @@ -432,6 +434,20 @@ static int __devinit mthca_init_icm(stru goto err_unmap_rdb; } + if (mdev->mthca_flags & MTHCA_FLAG_SRQ) { + mdev->srq_table.table = + mthca_alloc_icm_table(mdev, init_hca->srqc_base, + dev_lim->srq_entry_sz, + mdev->limits.num_srqs, + mdev->limits.reserved_srqs, 0); + if (!mdev->srq_table.table) { + mthca_err(mdev, "Failed to map SRQ context memory, " + "aborting.\n"); + err = -ENOMEM; + goto err_unmap_cq; + } + } + /* * It's not strictly required, but for simplicity just map the * whole multicast group table now. The table isn't very big @@ -447,11 +463,15 @@ static int __devinit mthca_init_icm(stru if (!mdev->mcg_table.table) { mthca_err(mdev, "Failed to map MCG context memory, aborting.\n"); err = -ENOMEM; - goto err_unmap_cq; + goto err_unmap_srq; } return 0; +err_unmap_srq: + if (mdev->mthca_flags & MTHCA_FLAG_SRQ) + mthca_free_icm_table(mdev, mdev->srq_table.table); + err_unmap_cq: mthca_free_icm_table(mdev, mdev->cq_table.table); @@ -531,6 +551,8 @@ static int __devinit mthca_init_arbel(st profile = default_profile; profile.num_uar = dev_lim.uar_size / PAGE_SIZE; profile.num_udav = 0; + if (mdev->mthca_flags & MTHCA_FLAG_SRQ) + profile.num_srq = dev_lim.max_srqs; icm_size = mthca_make_profile(mdev, &profile, &dev_lim, &init_hca); if ((int) icm_size < 0) { @@ -730,11 +752,18 @@ static int __devinit mthca_setup_hca(str goto err_cmd_poll; } + err = mthca_init_srq_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "shared receive queue table, aborting.\n"); + goto err_cq_table_free; + } + err = mthca_init_qp_table(dev); if (err) { mthca_err(dev, "Failed to initialize " "queue pair table, aborting.\n"); - goto err_cq_table_free; + goto err_srq_table_free; } err = mthca_init_av_table(dev); @@ -759,6 +788,9 @@ err_av_table_free: err_qp_table_free: mthca_cleanup_qp_table(dev); +err_srq_table_free: + mthca_cleanup_srq_table(dev); + err_cq_table_free: mthca_cleanup_cq_table(dev); @@ -1045,6 +1077,7 @@ err_cleanup: mthca_cleanup_mcg_table(mdev); mthca_cleanup_av_table(mdev); mthca_cleanup_qp_table(mdev); + mthca_cleanup_srq_table(mdev); mthca_cleanup_cq_table(mdev); mthca_cmd_use_polling(mdev); mthca_cleanup_eq_table(mdev); @@ -1094,6 +1127,7 @@ static void __devexit mthca_remove_one(s mthca_cleanup_mcg_table(mdev); mthca_cleanup_av_table(mdev); mthca_cleanup_qp_table(mdev); + mthca_cleanup_srq_table(mdev); mthca_cleanup_cq_table(mdev); mthca_cmd_use_polling(mdev); mthca_cleanup_eq_table(mdev); --- infiniband/hw/mthca/mthca_provider.c (revision 2963) +++ infiniband/hw/mthca/mthca_provider.c (working copy) @@ -421,6 +421,70 @@ static int mthca_ah_destroy(struct ib_ah return 0; } +static struct ib_srq *mthca_create_srq(struct ib_pd *pd, + struct ib_srq_init_attr *init_attr, + struct ib_udata *udata) +{ + struct mthca_create_srq ucmd; + struct mthca_ucontext *context = NULL; + struct mthca_srq *srq; + int err; + + srq = kmalloc(sizeof *srq, GFP_KERNEL); + if (!srq) + return ERR_PTR(-ENOMEM); + + if (pd->uobject) { + context = to_mucontext(pd->uobject->context); + + if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) + return ERR_PTR(-EFAULT); + + err = mthca_map_user_db(to_mdev(pd->device), &context->uar, + context->db_tab, ucmd.db_index, + ucmd.db_page); + + if (err) { + kfree(srq); + return ERR_PTR(err); + } + + srq->mr.ibmr.lkey = ucmd.lkey; + srq->db_index = ucmd.db_index; + } + + err = mthca_alloc_srq(to_mdev(pd->device), to_mpd(pd), + &init_attr->attr, srq); + + if (err && pd->uobject) + mthca_unmap_user_db(to_mdev(pd->device), &context->uar, + context->db_tab, ucmd.db_index); + + if (err) { + kfree(srq); + return ERR_PTR(err); + } + + return &srq->ibsrq; +} + +static int mthca_destroy_srq(struct ib_srq *srq) +{ + struct mthca_ucontext *context; + + if (srq->uobject) { + context = to_mucontext(srq->uobject->context); + + mthca_unmap_user_db(to_mdev(srq->device), &context->uar, + context->db_tab, to_msrq(srq)->db_index); + } + + mthca_free_srq(to_mdev(srq->device), to_msrq(srq)); + kfree(srq); + + return 0; +} + static struct ib_qp *mthca_create_qp(struct ib_pd *pd, struct ib_qp_init_attr *init_attr, struct ib_udata *udata) @@ -999,6 +1063,17 @@ int mthca_register_device(struct mthca_d dev->ib_dev.dealloc_pd = mthca_dealloc_pd; dev->ib_dev.create_ah = mthca_ah_create; dev->ib_dev.destroy_ah = mthca_ah_destroy; + + if (dev->mthca_flags & MTHCA_FLAG_SRQ) { + dev->ib_dev.create_srq = mthca_create_srq; + dev->ib_dev.destroy_srq = mthca_destroy_srq; + + if (mthca_is_memfree(dev)) + dev->ib_dev.post_srq_recv = mthca_arbel_post_srq_recv; + else + dev->ib_dev.post_srq_recv = mthca_tavor_post_srq_recv; + } + dev->ib_dev.create_qp = mthca_create_qp; dev->ib_dev.modify_qp = mthca_modify_qp; dev->ib_dev.destroy_qp = mthca_destroy_qp; --- infiniband/hw/mthca/mthca_provider.h (revision 2963) +++ infiniband/hw/mthca/mthca_provider.h (working copy) @@ -51,6 +51,11 @@ struct mthca_buf_list { DECLARE_PCI_UNMAP_ADDR(mapping) }; +union mthca_buf { + struct mthca_buf_list direct; + struct mthca_buf_list *page_list; +}; + struct mthca_uar { unsigned long pfn; int index; @@ -187,14 +192,34 @@ struct mthca_cq { __be32 *arm_db; int arm_sn; - union { - struct mthca_buf_list direct; - struct mthca_buf_list *page_list; - } queue; + union mthca_buf queue; struct mthca_mr mr; wait_queue_head_t wait; }; +struct mthca_srq { + struct ib_srq ibsrq; + spinlock_t lock; + atomic_t refcount; + int srqn; + int max; + int max_gs; + int wqe_shift; + int first_free; + int last_free; + int db_index; + u16 counter; + __be32 *db; + void *last; + + int is_direct; + u64 *wrid; + union mthca_buf queue; + struct mthca_mr mr; + + wait_queue_head_t wait; +}; + struct mthca_wq { spinlock_t lock; int max; @@ -228,10 +253,7 @@ struct mthca_qp { int send_wqe_offset; u64 *wrid; - union { - struct mthca_buf_list direct; - struct mthca_buf_list *page_list; - } queue; + union mthca_buf queue; wait_queue_head_t wait; }; @@ -278,6 +300,11 @@ static inline struct mthca_cq *to_mcq(st return container_of(ibcq, struct mthca_cq, ibcq); } +static inline struct mthca_srq *to_msrq(struct ib_srq *ibsrq) +{ + return container_of(ibsrq, struct mthca_srq, ibsrq); +} + static inline struct mthca_qp *to_mqp(struct ib_qp *ibqp) { return container_of(ibqp, struct mthca_qp, ibqp); --- infiniband/hw/mthca/mthca_profile.c (revision 2963) +++ infiniband/hw/mthca/mthca_profile.c (working copy) @@ -102,6 +102,7 @@ u64 mthca_make_profile(struct mthca_dev profile[MTHCA_RES_UARC].size = request->uarc_size; profile[MTHCA_RES_QP].num = request->num_qp; + profile[MTHCA_RES_SRQ].num = request->num_srq; profile[MTHCA_RES_EQP].num = request->num_qp; profile[MTHCA_RES_RDB].num = request->num_qp * request->rdb_per_qp; profile[MTHCA_RES_CQ].num = request->num_cq; --- infiniband/hw/mthca/mthca_wqe.h (revision 0) +++ infiniband/hw/mthca/mthca_wqe.h (revision 0) @@ -0,0 +1,114 @@ +/* + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#ifndef MTHCA_WQE_H +#define MTHCA_WQE_H + +#include + +enum { + MTHCA_NEXT_DBD = 1 << 7, + MTHCA_NEXT_FENCE = 1 << 6, + MTHCA_NEXT_CQ_UPDATE = 1 << 3, + MTHCA_NEXT_EVENT_GEN = 1 << 2, + MTHCA_NEXT_SOLICIT = 1 << 1, + + MTHCA_MLX_VL15 = 1 << 17, + MTHCA_MLX_SLR = 1 << 16 +}; + +enum { + MTHCA_INVAL_LKEY = 0x100 +}; + +struct mthca_next_seg { + __be32 nda_op; /* [31:6] next WQE [4:0] next opcode */ + __be32 ee_nds; /* [31:8] next EE [7] DBD [6] F [5:0] next WQE size */ + __be32 flags; /* [3] CQ [2] Event [1] Solicit */ + __be32 imm; /* immediate data */ +}; + +struct mthca_tavor_ud_seg { + u32 reserved1; + __be32 lkey; + __be64 av_addr; + u32 reserved2[4]; + __be32 dqpn; + __be32 qkey; + u32 reserved3[2]; +}; + +struct mthca_arbel_ud_seg { + __be32 av[8]; + __be32 dqpn; + __be32 qkey; + u32 reserved[2]; +}; + +struct mthca_bind_seg { + __be32 flags; /* [31] Atomic [30] rem write [29] rem read */ + u32 reserved; + __be32 new_rkey; + __be32 lkey; + __be64 addr; + __be64 length; +}; + +struct mthca_raddr_seg { + __be64 raddr; + __be32 rkey; + u32 reserved; +}; + +struct mthca_atomic_seg { + __be64 swap_add; + __be64 compare; +}; + +struct mthca_data_seg { + __be32 byte_count; + __be32 lkey; + __be64 addr; +}; + +struct mthca_mlx_seg { + __be32 nda_op; + __be32 nds; + __be32 flags; /* [17] VL15 [16] SLR [14:12] static rate + [11:8] SL [3] C [2] E */ + __be16 rlid; + __be16 vcrc; +}; + +#endif /* MTHCA_WQE_H */ Property changes on: infiniband/hw/mthca/mthca_wqe.h ___________________________________________________________________ Name: svn:keywords + Id --- infiniband/hw/mthca/mthca_cmd.c (revision 2963) +++ infiniband/hw/mthca/mthca_cmd.c (working copy) @@ -109,6 +109,7 @@ enum { CMD_SW2HW_SRQ = 0x35, CMD_HW2SW_SRQ = 0x36, CMD_QUERY_SRQ = 0x37, + CMD_ARM_SRQ = 0x40, /* QP/EE commands */ CMD_RST2INIT_QPEE = 0x19, @@ -1032,6 +1033,8 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev mthca_dbg(dev, "Max QPs: %d, reserved QPs: %d, entry size: %d\n", dev_lim->max_qps, dev_lim->reserved_qps, dev_lim->qpc_entry_sz); + mthca_dbg(dev, "Max SRQs: %d, reserved SRQs: %d, entry size: %d\n", + dev_lim->max_srqs, dev_lim->reserved_srqs, dev_lim->srq_entry_sz); mthca_dbg(dev, "Max CQs: %d, reserved CQs: %d, entry size: %d\n", dev_lim->max_cqs, dev_lim->reserved_cqs, dev_lim->cqc_entry_sz); mthca_dbg(dev, "Max EQs: %d, reserved EQs: %d, entry size: %d\n", @@ -1503,6 +1506,27 @@ int mthca_HW2SW_CQ(struct mthca_dev *dev CMD_TIME_CLASS_A, status); } +int mthca_SW2HW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, + int srq_num, u8 *status) +{ + return mthca_cmd(dev, mailbox->dma, srq_num, 0, CMD_SW2HW_SRQ, + CMD_TIME_CLASS_A, status); +} + +int mthca_HW2SW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, + int srq_num, u8 *status) +{ + return mthca_cmd_box(dev, 0, mailbox->dma, srq_num, 0, + CMD_HW2SW_SRQ, + CMD_TIME_CLASS_A, status); +} + +int mthca_ARM_SRQ(struct mthca_dev *dev, int srq_num, int limit, u8 *status) +{ + return mthca_cmd(dev, limit, srq_num, 0, CMD_ARM_SRQ, + CMD_TIME_CLASS_B, status); +} + int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num, int is_ee, struct mthca_mailbox *mailbox, u32 optmask, u8 *status) --- infiniband/hw/mthca/mthca_cq.c (revision 2963) +++ infiniband/hw/mthca/mthca_cq.c (working copy) @@ -224,7 +224,8 @@ void mthca_cq_event(struct mthca_dev *de cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); } -void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn) +void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn, + struct mthca_srq *srq) { struct mthca_cq *cq; struct mthca_cqe *cqe; @@ -265,8 +266,11 @@ void mthca_cq_clean(struct mthca_dev *de */ while (prod_index > cq->cons_index) { cqe = get_cqe(cq, (prod_index - 1) & cq->ibcq.cqe); - if (cqe->my_qpn == cpu_to_be32(qpn)) + if (cqe->my_qpn == cpu_to_be32(qpn)) { + if (srq) + mthca_free_srq_wqe(srq, be32_to_cpu(cqe->wqe)); ++nfreed; + } else if (nfreed) memcpy(get_cqe(cq, (prod_index - 1 + nfreed) & cq->ibcq.cqe), @@ -367,6 +371,13 @@ static int handle_error_cqe(struct mthca break; } + /* + * Mem-free HCAs always generate one CQE per WQE, even in the + * error case, so we don't have to check the doorbell count, etc. + */ + if (mthca_is_memfree(dev)) + return 0; + err = mthca_free_err_wqe(dev, qp, is_send, wqe_index, &dbd, &new_wqe); if (err) return err; @@ -375,12 +386,8 @@ static int handle_error_cqe(struct mthca * If we're at the end of the WQE chain, or we've used up our * doorbell count, free the CQE. Otherwise just update it for * the next poll operation. - * - * This does not apply to mem-free HCAs: they don't use the - * doorbell count field, and so we should always free the CQE. */ - if (mthca_is_memfree(dev) || - !(new_wqe & cpu_to_be32(0x3f)) || (!cqe->db_cnt && dbd)) + if (!(new_wqe & cpu_to_be32(0x3f)) || (!cqe->db_cnt && dbd)) return 0; cqe->db_cnt = cpu_to_be16(be16_to_cpu(cqe->db_cnt) - dbd); @@ -452,23 +459,27 @@ static inline int mthca_poll_one(struct >> wq->wqe_shift); entry->wr_id = (*cur_qp)->wrid[wqe_index + (*cur_qp)->rq.max]; + } else if ((*cur_qp)->ibqp.srq) { + struct mthca_srq *srq = to_msrq((*cur_qp)->ibqp.srq); + u32 wqe = be32_to_cpu(cqe->wqe); + wq = NULL; + wqe_index = wqe >> srq->wqe_shift; + entry->wr_id = srq->wrid[wqe_index]; + mthca_free_srq_wqe(srq, wqe); } else { wq = &(*cur_qp)->rq; wqe_index = be32_to_cpu(cqe->wqe) >> wq->wqe_shift; entry->wr_id = (*cur_qp)->wrid[wqe_index]; } - if (wq->last_comp < wqe_index) - wq->tail += wqe_index - wq->last_comp; - else - wq->tail += wqe_index + wq->max - wq->last_comp; + if (wq) { + if (wq->last_comp < wqe_index) + wq->tail += wqe_index - wq->last_comp; + else + wq->tail += wqe_index + wq->max - wq->last_comp; - wq->last_comp = wqe_index; - - if (0) - mthca_dbg(dev, "%s completion for QP %06x, index %d (nr %d)\n", - is_send ? "Send" : "Receive", - (*cur_qp)->qpn, wqe_index, wq->max); + wq->last_comp = wqe_index; + } if (is_error) { err = handle_error_cqe(dev, cq, *cur_qp, wqe_index, is_send, @@ -639,113 +650,8 @@ int mthca_arbel_arm_cq(struct ib_cq *ibc static void mthca_free_cq_buf(struct mthca_dev *dev, struct mthca_cq *cq) { - int i; - int size; - - if (cq->is_direct) - dma_free_coherent(&dev->pdev->dev, - (cq->ibcq.cqe + 1) * MTHCA_CQ_ENTRY_SIZE, - cq->queue.direct.buf, - pci_unmap_addr(&cq->queue.direct, - mapping)); - else { - size = (cq->ibcq.cqe + 1) * MTHCA_CQ_ENTRY_SIZE; - for (i = 0; i < (size + PAGE_SIZE - 1) / PAGE_SIZE; ++i) - if (cq->queue.page_list[i].buf) - dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, - cq->queue.page_list[i].buf, - pci_unmap_addr(&cq->queue.page_list[i], - mapping)); - - kfree(cq->queue.page_list); - } -} - -static int mthca_alloc_cq_buf(struct mthca_dev *dev, int size, - struct mthca_cq *cq) -{ - int err = -ENOMEM; - int npages, shift; - u64 *dma_list = NULL; - dma_addr_t t; - int i; - - if (size <= MTHCA_MAX_DIRECT_CQ_SIZE) { - cq->is_direct = 1; - npages = 1; - shift = get_order(size) + PAGE_SHIFT; - - cq->queue.direct.buf = dma_alloc_coherent(&dev->pdev->dev, - size, &t, GFP_KERNEL); - if (!cq->queue.direct.buf) - return -ENOMEM; - - pci_unmap_addr_set(&cq->queue.direct, mapping, t); - - memset(cq->queue.direct.buf, 0, size); - - while (t & ((1 << shift) - 1)) { - --shift; - npages *= 2; - } - - dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); - if (!dma_list) - goto err_free; - - for (i = 0; i < npages; ++i) - dma_list[i] = t + i * (1 << shift); - } else { - cq->is_direct = 0; - npages = (size + PAGE_SIZE - 1) / PAGE_SIZE; - shift = PAGE_SHIFT; - - dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); - if (!dma_list) - return -ENOMEM; - - cq->queue.page_list = kmalloc(npages * sizeof *cq->queue.page_list, - GFP_KERNEL); - if (!cq->queue.page_list) - goto err_out; - - for (i = 0; i < npages; ++i) - cq->queue.page_list[i].buf = NULL; - - for (i = 0; i < npages; ++i) { - cq->queue.page_list[i].buf = - dma_alloc_coherent(&dev->pdev->dev, PAGE_SIZE, - &t, GFP_KERNEL); - if (!cq->queue.page_list[i].buf) - goto err_free; - - dma_list[i] = t; - pci_unmap_addr_set(&cq->queue.page_list[i], mapping, t); - - memset(cq->queue.page_list[i].buf, 0, PAGE_SIZE); - } - } - - err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num, - dma_list, shift, npages, - 0, size, - MTHCA_MPT_FLAG_LOCAL_WRITE | - MTHCA_MPT_FLAG_LOCAL_READ, - &cq->mr); - if (err) - goto err_free; - - kfree(dma_list); - - return 0; - -err_free: - mthca_free_cq_buf(dev, cq); - -err_out: - kfree(dma_list); - - return err; + mthca_buf_free(dev, (cq->ibcq.cqe + 1) * MTHCA_CQ_ENTRY_SIZE, + &cq->queue, cq->is_direct, &cq->mr); } int mthca_init_cq(struct mthca_dev *dev, int nent, @@ -797,7 +703,9 @@ int mthca_init_cq(struct mthca_dev *dev, cq_context = mailbox->buf; if (cq->is_kernel) { - err = mthca_alloc_cq_buf(dev, size, cq); + err = mthca_buf_alloc(dev, size, MTHCA_MAX_DIRECT_CQ_SIZE, + &cq->queue, &cq->is_direct, + &dev->driver_pd, 1, &cq->mr); if (err) goto err_out_mailbox; @@ -858,10 +766,8 @@ int mthca_init_cq(struct mthca_dev *dev, return 0; err_out_free_mr: - if (cq->is_kernel) { - mthca_free_mr(dev, &cq->mr); + if (cq->is_kernel) mthca_free_cq_buf(dev, cq); - } err_out_mailbox: mthca_free_mailbox(dev, mailbox); @@ -929,7 +835,6 @@ void mthca_free_cq(struct mthca_dev *dev wait_event(cq->wait, !atomic_read(&cq->refcount)); if (cq->is_kernel) { - mthca_free_mr(dev, &cq->mr); mthca_free_cq_buf(dev, cq); if (mthca_is_memfree(dev)) { mthca_free_db(dev, MTHCA_DB_TYPE_CQ_ARM, cq->arm_db_index); --- infiniband/hw/mthca/mthca_profile.h (revision 2963) +++ infiniband/hw/mthca/mthca_profile.h (working copy) @@ -42,6 +42,7 @@ struct mthca_profile { int num_qp; int rdb_per_qp; + int num_srq; int num_cq; int num_mcg; int num_mpt; --- infiniband/hw/mthca/mthca_srq.c (revision 0) +++ infiniband/hw/mthca/mthca_srq.c (revision 0) @@ -0,0 +1,521 @@ +/* + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include "mthca_dev.h" +#include "mthca_cmd.h" +#include "mthca_memfree.h" +#include "mthca_wqe.h" + +enum { + MTHCA_MAX_DIRECT_SRQ_SIZE = 4 * PAGE_SIZE +}; + +struct mthca_tavor_srq_context { + __be64 wqe_base_ds; /* low 6 bits is descriptor size */ + __be32 state_pd; + __be32 lkey; + __be32 uar; + __be32 wqe_cnt; + u32 reserved[2]; +}; + +struct mthca_arbel_srq_context { + __be32 state_logsize_srqn; + __be32 lkey; + __be32 db_index; + __be32 logstride_usrpage; + __be64 wqe_base; + __be32 eq_pd; + __be16 limit_watermark; + __be16 wqe_cnt; + u16 reserved1; + __be16 wqe_counter; + u32 reserved2[3]; +}; + +static void *get_wqe(struct mthca_srq *srq, int n) +{ + if (srq->is_direct) + return srq->queue.direct.buf + (n << srq->wqe_shift); + else + return srq->queue.page_list[(n << srq->wqe_shift) >> PAGE_SHIFT].buf + + ((n << srq->wqe_shift) & (PAGE_SIZE - 1)); +} + +static void mthca_tavor_init_srq_context(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_srq *srq, + struct mthca_tavor_srq_context *context) +{ + memset(context, 0, sizeof *context); + + context->wqe_base_ds = cpu_to_be64(1 << (srq->wqe_shift - 4)); + context->state_pd = cpu_to_be32(pd->pd_num); + context->lkey = cpu_to_be32(srq->mr.ibmr.lkey); + + if (pd->ibpd.uobject) + context->uar = + cpu_to_be32(to_mucontext(pd->ibpd.uobject->context)->uar.index); + else + context->uar = cpu_to_be32(dev->driver_uar.index); +} + +static void mthca_arbel_init_srq_context(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_srq *srq, + struct mthca_arbel_srq_context *context) +{ + memset(context, 0, sizeof *context); + + context->state_logsize_srqn = cpu_to_be32(long_log2(srq->max) << 24 | + srq->srqn); + context->lkey = cpu_to_be32(srq->mr.ibmr.lkey); + context->db_index = cpu_to_be32(srq->db_index); + context->logstride_usrpage = cpu_to_be32((srq->wqe_shift - 4) << 29); + if (pd->ibpd.uobject) + context->logstride_usrpage |= + cpu_to_be32(to_mucontext(pd->ibpd.uobject->context)->uar.index); + else + context->logstride_usrpage |= cpu_to_be32(dev->driver_uar.index); + context->eq_pd = cpu_to_be32(MTHCA_EQ_ASYNC << 24 | pd->pd_num); +} + +static void mthca_free_srq_buf(struct mthca_dev *dev, struct mthca_srq *srq) +{ + mthca_buf_free(dev, srq->max << srq->wqe_shift, &srq->queue, + srq->is_direct, &srq->mr); + kfree(srq->wrid); +} + +int mthca_alloc_srq(struct mthca_dev *dev, struct mthca_pd *pd, + struct ib_srq_attr *attr, struct mthca_srq *srq) +{ + struct mthca_mailbox *mailbox; + struct mthca_data_seg *scatter; + void *wqe; + u8 status; + int ds; + int err; + int i; + + /* Sanity check SRQ size before proceeding */ + if (attr->max_wr > 16 << 20 || attr->max_sge > 64) + return -EINVAL; + + srq->max = attr->max_wr; + srq->max_gs = attr->max_sge; + srq->last = NULL; + srq->counter = 0; + + if (mthca_is_memfree(dev)) + srq->max = roundup_pow_of_two(srq->max); + + ds = min(64UL, + roundup_pow_of_two(sizeof (struct mthca_next_seg) + + srq->max_gs * sizeof (struct mthca_data_seg))); + srq->wqe_shift = long_log2(ds); + + srq->srqn = mthca_alloc(&dev->srq_table.alloc); + if (srq->srqn == -1) + return -ENOMEM; + + if (mthca_is_memfree(dev)) { + err = mthca_table_get(dev, dev->srq_table.table, srq->srqn); + if (err) + goto err_out; + + if (!pd->ibpd.uobject) { + srq->db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_SRQ, + srq->srqn, &srq->db); + if (srq->db_index < 0) { + err = -ENOMEM; + goto err_out_icm; + } + } + } + + mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL); + if (IS_ERR(mailbox)) { + err = PTR_ERR(mailbox); + goto err_out_db; + } + + srq->wrid = kmalloc(srq->max * sizeof (u64), GFP_KERNEL); + if (!srq->wrid) { + err = -ENOMEM; + goto err_out_mailbox; + } + + err = mthca_buf_alloc(dev, srq->max << srq->wqe_shift, + MTHCA_MAX_DIRECT_SRQ_SIZE, + &srq->queue, &srq->is_direct, pd, 1, &srq->mr); + if (err) + goto err_out_wrid; + + spin_lock_init(&srq->lock); + atomic_set(&srq->refcount, 1); + init_waitqueue_head(&srq->wait); + + if (mthca_is_memfree(dev)) + mthca_arbel_init_srq_context(dev, pd, srq, mailbox->buf); + else + mthca_tavor_init_srq_context(dev, pd, srq, mailbox->buf); + + err = mthca_SW2HW_SRQ(dev, mailbox, srq->srqn, &status); + + if (err) { + mthca_warn(dev, "SW2HW_SRQ failed (%d)\n", err); + goto err_out_free_mr; + } + + if (status) { + mthca_warn(dev, "SW2HW_SRQ returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_free_mr; + } + + spin_lock_irq(&dev->srq_table.lock); + if (mthca_array_set(&dev->srq_table.srq, + srq->srqn & (dev->limits.num_srqs - 1), + srq)) { + spin_unlock_irq(&dev->srq_table.lock); + goto err_out_free_mr; + } + spin_unlock_irq(&dev->srq_table.lock); + + mthca_free_mailbox(dev, mailbox); + + /* + * Now initialize the SRQ buffer so that all of the WQEs are + * linked into the list of free WQEs. In addition, set the + * scatter list L_Keys to the sentry value of 0x100. + */ + + for (i = 0; i < srq->max; ++i) { + wqe = get_wqe(srq, i); + + *(int *) wqe = i < srq->max - 1 ? i + 1 : -1; + + for (scatter = wqe + sizeof (struct mthca_next_seg); + (void *) scatter < wqe + (1 << srq->wqe_shift); + ++scatter) + scatter->lkey = cpu_to_be32(MTHCA_INVAL_LKEY); + } + + srq->first_free = 0; + srq->last_free = srq->max - 1; + + return 0; + +err_out_free_mr: + if (!pd->ibpd.uobject) + mthca_free_srq_buf(dev, srq); + +err_out_wrid: + kfree(srq->wrid); + +err_out_mailbox: + mthca_free_mailbox(dev, mailbox); + +err_out_db: + if (!pd->ibpd.uobject && mthca_is_memfree(dev)) + mthca_free_db(dev, MTHCA_DB_TYPE_SRQ, srq->db_index); + +err_out_icm: + mthca_table_put(dev, dev->srq_table.table, srq->srqn); + +err_out: + mthca_free(&dev->srq_table.alloc, srq->srqn); + + return err; +} + +void mthca_free_srq(struct mthca_dev *dev, struct mthca_srq *srq) +{ + struct mthca_mailbox *mailbox; + int err; + u8 status; + + mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL); + if (IS_ERR(mailbox)) { + mthca_warn(dev, "No memory for mailbox to free SRQ.\n"); + return; + } + + err = mthca_HW2SW_SRQ(dev, mailbox, srq->srqn, &status); + if (err) + mthca_warn(dev, "HW2SW_SRQ failed (%d)\n", err); + else if (status) + mthca_warn(dev, "HW2SW_SRQ returned status 0x%02x\n", status); + + spin_lock_irq(&dev->srq_table.lock); + mthca_array_clear(&dev->srq_table.srq, + srq->srqn & (dev->limits.num_srqs - 1)); + spin_unlock_irq(&dev->srq_table.lock); + + atomic_dec(&srq->refcount); + wait_event(srq->wait, !atomic_read(&srq->refcount)); + + if (!srq->ibsrq.uobject) { + mthca_free_srq_buf(dev, srq); + if (mthca_is_memfree(dev)) + mthca_free_db(dev, MTHCA_DB_TYPE_SRQ, srq->db_index); + } + + mthca_table_put(dev, dev->srq_table.table, srq->srqn); + mthca_free(&dev->srq_table.alloc, srq->srqn); + mthca_free_mailbox(dev, mailbox); +} + +void mthca_srq_event(struct mthca_dev *dev, u32 srqn, + enum ib_event_type event_type) +{ + struct mthca_srq *srq; + struct ib_event event; + + spin_lock(&dev->srq_table.lock); + srq = mthca_array_get(&dev->srq_table.srq, srqn & (dev->limits.num_srqs - 1)); + if (srq) + atomic_inc(&srq->refcount); + spin_unlock(&dev->srq_table.lock); + + if (!srq) { + mthca_warn(dev, "Async event for bogus SRQ %08x\n", srqn); + return; + } + + if (!srq->ibsrq.event_handler) + goto out; + + event.device = &dev->ib_dev; + event.event = event_type; + event.element.srq = &srq->ibsrq; + srq->ibsrq.event_handler(&event, srq->ibsrq.srq_context); + +out: + if (atomic_dec_and_test(&srq->refcount)) + wake_up(&srq->wait); +} + +/* + * This function must be called with IRQs disabled. + */ +void mthca_free_srq_wqe(struct mthca_srq *srq, u32 wqe_addr) +{ + int ind; + + ind = wqe_addr >> srq->wqe_shift; + + spin_lock(&srq->lock); + + if (likely(srq->first_free >= 0)) + *(int *) get_wqe(srq, srq->last_free) = ind; + else + srq->first_free = ind; + + *(int *) get_wqe(srq, ind) = -1; + srq->last_free = ind; + + spin_unlock(&srq->lock); +} + +static inline int mthca_queue_srq_recv(struct mthca_srq *srq, + struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr, + int *err) +{ + struct mthca_dev *dev = to_mdev(srq->ibsrq.device); + int ind; + int next_ind; + int nreq; + int i; + void *wqe; + void *prev_wqe; + + prev_wqe = srq->last; + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + ind = srq->first_free; + wqe = get_wqe(srq, ind); + + if (!wqe) { + mthca_err(dev, "SRQ %06x full\n", srq->srqn); + *err = -ENOMEM; + *bad_wr = wr; + return nreq; + } + + next_ind = *(int *) wqe; + + ((struct mthca_next_seg *) wqe)->nda_op = 0; + ((struct mthca_next_seg *) wqe)->ee_nds = 0; + /* flags field will always remain 0 */ + + wqe += sizeof (struct mthca_next_seg); + + if (unlikely(wr->num_sge > srq->max_gs)) { + *err = -EINVAL; + *bad_wr = wr; + return nreq; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + } + + if (i < srq->max_gs) { + ((struct mthca_data_seg *) wqe)->byte_count = 0; + ((struct mthca_data_seg *) wqe)->lkey = cpu_to_be32(MTHCA_INVAL_LKEY); + ((struct mthca_data_seg *) wqe)->addr = 0; + } + + if (likely(prev_wqe)) { + ((struct mthca_next_seg *) prev_wqe)->nda_op = + cpu_to_be32((ind << srq->wqe_shift) | 1); + wmb(); + ((struct mthca_next_seg *) prev_wqe)->ee_nds = + cpu_to_be32(MTHCA_NEXT_DBD); + } + + srq->wrid[ind] = wr->wr_id; + srq->last = wqe; + srq->first_free = next_ind; + } + + return nreq; +} + +int mthca_tavor_post_srq_recv(struct ib_srq *ibsrq, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + struct mthca_dev *dev = to_mdev(ibsrq->device); + struct mthca_srq *srq = to_msrq(ibsrq); + unsigned long flags; + int err = 0; + int nreq; + int first_ind; + + spin_lock_irqsave(&srq->lock, flags); + + first_ind = srq->first_free; + + nreq = mthca_queue_srq_recv(srq, wr, bad_wr, &err); + + if (likely(nreq)) { + __be32 doorbell[2]; + + doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift); + doorbell[1] = cpu_to_be32((srq->srqn << 8) | nreq); + + /* + * Make sure that descriptors are written before + * doorbell is rung. + */ + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_RECEIVE_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } + + spin_unlock_irqrestore(&srq->lock, flags); + return err; +} + +int mthca_arbel_post_srq_recv(struct ib_srq *ibsrq, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + struct mthca_srq *srq = to_msrq(ibsrq); + unsigned long flags; + int err = 0; + int nreq; + + spin_lock_irqsave(&srq->lock, flags); + + nreq = mthca_queue_srq_recv(srq, wr, bad_wr, &err); + + if (likely(nreq)) { + srq->counter += nreq; + + /* + * Make sure that descriptors are written before + * we write doorbell record. + */ + wmb(); + *srq->db = cpu_to_be32(srq->counter); + } + + spin_unlock_irqrestore(&srq->lock, flags); + return err; +} + +int __devinit mthca_init_srq_table(struct mthca_dev *dev) +{ + int err; + + if (!(dev->mthca_flags & MTHCA_FLAG_SRQ)) + return 0; + + spin_lock_init(&dev->srq_table.lock); + + err = mthca_alloc_init(&dev->srq_table.alloc, + dev->limits.num_srqs, + (1 << 24) - 1, + dev->limits.reserved_srqs); + if (err) + return err; + + err = mthca_array_init(&dev->srq_table.srq, + dev->limits.num_srqs); + if (err) + mthca_alloc_cleanup(&dev->srq_table.alloc); + + return err; +} + +void __devexit mthca_cleanup_srq_table(struct mthca_dev *dev) +{ + if (!(dev->mthca_flags & MTHCA_FLAG_SRQ)) + return; + + mthca_array_cleanup(&dev->srq_table.srq, dev->limits.num_srqs); + mthca_alloc_cleanup(&dev->srq_table.alloc); +} Property changes on: infiniband/hw/mthca/mthca_srq.c ___________________________________________________________________ Name: svn:keywords + Id --- infiniband/hw/mthca/mthca_cmd.h (revision 2963) +++ infiniband/hw/mthca/mthca_cmd.h (working copy) @@ -299,6 +299,11 @@ int mthca_SW2HW_CQ(struct mthca_dev *dev int cq_num, u8 *status); int mthca_HW2SW_CQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, int cq_num, u8 *status); +int mthca_SW2HW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, + int srq_num, u8 *status); +int mthca_HW2SW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, + int srq_num, u8 *status); +int mthca_ARM_SRQ(struct mthca_dev *dev, int srq_num, int limit, u8 *status); int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num, int is_ee, struct mthca_mailbox *mailbox, u32 optmask, u8 *status); --- infiniband/hw/mthca/mthca_allocator.c (revision 2963) +++ infiniband/hw/mthca/mthca_allocator.c (working copy) @@ -177,3 +177,119 @@ void mthca_array_cleanup(struct mthca_ar kfree(array->page_list); } + +/* + * Handling for queue buffers -- we allocate a bunch of memory and + * register it in a memory region at HCA virtual address 0. If the + * requested size is > max_direct, we split the allocation into + * multiple pages, so we don't require too much contiguous memory. + */ + +int mthca_buf_alloc(struct mthca_dev *dev, int size, int max_direct, + union mthca_buf *buf, int *is_direct, struct mthca_pd *pd, + int hca_write, struct mthca_mr *mr) +{ + int err = -ENOMEM; + int npages, shift; + u64 *dma_list = NULL; + dma_addr_t t; + int i; + + if (size <= max_direct) { + *is_direct = 1; + npages = 1; + shift = get_order(size) + PAGE_SHIFT; + + buf->direct.buf = dma_alloc_coherent(&dev->pdev->dev, + size, &t, GFP_KERNEL); + if (!buf->direct.buf) + return -ENOMEM; + + pci_unmap_addr_set(&buf->direct, mapping, t); + + memset(buf->direct.buf, 0, size); + + while (t & ((1 << shift) - 1)) { + --shift; + npages *= 2; + } + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_free; + + for (i = 0; i < npages; ++i) + dma_list[i] = t + i * (1 << shift); + } else { + *is_direct = 0; + npages = (size + PAGE_SIZE - 1) / PAGE_SIZE; + shift = PAGE_SHIFT; + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + return -ENOMEM; + + buf->page_list = kmalloc(npages * sizeof *buf->page_list, + GFP_KERNEL); + if (!buf->page_list) + goto err_out; + + for (i = 0; i < npages; ++i) + buf->page_list[i].buf = NULL; + + for (i = 0; i < npages; ++i) { + buf->page_list[i].buf = + dma_alloc_coherent(&dev->pdev->dev, PAGE_SIZE, + &t, GFP_KERNEL); + if (!buf->page_list[i].buf) + goto err_free; + + dma_list[i] = t; + pci_unmap_addr_set(&buf->page_list[i], mapping, t); + + memset(buf->page_list[i].buf, 0, PAGE_SIZE); + } + } + + err = mthca_mr_alloc_phys(dev, pd->pd_num, + dma_list, shift, npages, + 0, size, + MTHCA_MPT_FLAG_LOCAL_READ | + (hca_write ? MTHCA_MPT_FLAG_LOCAL_WRITE : 0), + mr); + if (err) + goto err_free; + + kfree(dma_list); + + return 0; + +err_free: + mthca_buf_free(dev, size, buf, *is_direct, NULL); + +err_out: + kfree(dma_list); + + return err; +} + +void mthca_buf_free(struct mthca_dev *dev, int size, union mthca_buf *buf, + int is_direct, struct mthca_mr *mr) +{ + int i; + + if (mr) + mthca_free_mr(dev, mr); + + if (is_direct) + dma_free_coherent(&dev->pdev->dev, size, buf->direct.buf, + pci_unmap_addr(&buf->direct, mapping)); + else { + for (i = 0; i < (size + PAGE_SIZE - 1) / PAGE_SIZE; ++i) + dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, + buf->page_list[i].buf, + pci_unmap_addr(&buf->page_list[i], + mapping)); + kfree(buf->page_list); + } +} --- infiniband/hw/mthca/mthca_qp.c (revision 2963) +++ infiniband/hw/mthca/mthca_qp.c (working copy) @@ -44,6 +44,7 @@ #include "mthca_dev.h" #include "mthca_cmd.h" #include "mthca_memfree.h" +#include "mthca_wqe.h" enum { MTHCA_MAX_DIRECT_QP_SIZE = 4 * PAGE_SIZE, @@ -175,80 +176,6 @@ enum { MTHCA_QP_OPTPAR_SCHED_QUEUE = 1 << 16 }; -enum { - MTHCA_NEXT_DBD = 1 << 7, - MTHCA_NEXT_FENCE = 1 << 6, - MTHCA_NEXT_CQ_UPDATE = 1 << 3, - MTHCA_NEXT_EVENT_GEN = 1 << 2, - MTHCA_NEXT_SOLICIT = 1 << 1, - - MTHCA_MLX_VL15 = 1 << 17, - MTHCA_MLX_SLR = 1 << 16 -}; - -enum { - MTHCA_INVAL_LKEY = 0x100 -}; - -struct mthca_next_seg { - __be32 nda_op; /* [31:6] next WQE [4:0] next opcode */ - __be32 ee_nds; /* [31:8] next EE [7] DBD [6] F [5:0] next WQE size */ - __be32 flags; /* [3] CQ [2] Event [1] Solicit */ - __be32 imm; /* immediate data */ -}; - -struct mthca_tavor_ud_seg { - u32 reserved1; - __be32 lkey; - __be64 av_addr; - u32 reserved2[4]; - __be32 dqpn; - __be32 qkey; - u32 reserved3[2]; -}; - -struct mthca_arbel_ud_seg { - __be32 av[8]; - __be32 dqpn; - __be32 qkey; - u32 reserved[2]; -}; - -struct mthca_bind_seg { - __be32 flags; /* [31] Atomic [30] rem write [29] rem read */ - u32 reserved; - __be32 new_rkey; - __be32 lkey; - __be64 addr; - __be64 length; -}; - -struct mthca_raddr_seg { - __be64 raddr; - __be32 rkey; - u32 reserved; -}; - -struct mthca_atomic_seg { - __be64 swap_add; - __be64 compare; -}; - -struct mthca_data_seg { - __be32 byte_count; - __be32 lkey; - __be64 addr; -}; - -struct mthca_mlx_seg { - __be32 nda_op; - __be32 nds; - __be32 flags; /* [17] VL15 [16] SLR [14:12] static rate - [11:8] SL [3] C [2] E */ - __be16 rlid; - __be16 vcrc; -}; - static const u8 mthca_opcode[] = { [IB_WR_SEND] = MTHCA_OPCODE_SEND, [IB_WR_SEND_WITH_IMM] = MTHCA_OPCODE_SEND_IMM, @@ -858,6 +785,9 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RSC); + if (ibqp->srq) + qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RIC); + if (attr_mask & IB_QP_MIN_RNR_TIMER) { qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->min_rnr_timer << 24); qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_TIMEOUT); @@ -880,6 +810,10 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_Q_KEY); } + if (ibqp->srq) + qp_context->srqn = cpu_to_be32(1 << 24 | + to_msrq(ibqp->srq)->srqn); + err = mthca_MODIFY_QP(dev, state_table[cur_state][new_state].trans, qp->qpn, 0, mailbox, 0, &status); if (status) { @@ -927,10 +861,6 @@ static int mthca_alloc_wqe_buf(struct mt struct mthca_qp *qp) { int size; - int i; - int npages, shift; - dma_addr_t t; - u64 *dma_list = NULL; int err = -ENOMEM; size = sizeof (struct mthca_next_seg) + @@ -980,116 +910,24 @@ static int mthca_alloc_wqe_buf(struct mt if (!qp->wrid) goto err_out; - if (size <= MTHCA_MAX_DIRECT_QP_SIZE) { - qp->is_direct = 1; - npages = 1; - shift = get_order(size) + PAGE_SHIFT; - - if (0) - mthca_dbg(dev, "Creating direct QP of size %d (shift %d)\n", - size, shift); - - qp->queue.direct.buf = dma_alloc_coherent(&dev->pdev->dev, size, - &t, GFP_KERNEL); - if (!qp->queue.direct.buf) - goto err_out; - - pci_unmap_addr_set(&qp->queue.direct, mapping, t); - - memset(qp->queue.direct.buf, 0, size); - - while (t & ((1 << shift) - 1)) { - --shift; - npages *= 2; - } - - dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); - if (!dma_list) - goto err_out_free; - - for (i = 0; i < npages; ++i) - dma_list[i] = t + i * (1 << shift); - } else { - qp->is_direct = 0; - npages = size / PAGE_SIZE; - shift = PAGE_SHIFT; - - if (0) - mthca_dbg(dev, "Creating indirect QP with %d pages\n", npages); - - dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); - if (!dma_list) - goto err_out; - - qp->queue.page_list = kmalloc(npages * - sizeof *qp->queue.page_list, - GFP_KERNEL); - if (!qp->queue.page_list) - goto err_out; - - for (i = 0; i < npages; ++i) { - qp->queue.page_list[i].buf = - dma_alloc_coherent(&dev->pdev->dev, PAGE_SIZE, - &t, GFP_KERNEL); - if (!qp->queue.page_list[i].buf) - goto err_out_free; - - memset(qp->queue.page_list[i].buf, 0, PAGE_SIZE); - - pci_unmap_addr_set(&qp->queue.page_list[i], mapping, t); - dma_list[i] = t; - } - } - - err = mthca_mr_alloc_phys(dev, pd->pd_num, dma_list, shift, - npages, 0, size, - MTHCA_MPT_FLAG_LOCAL_READ, - &qp->mr); + err = mthca_buf_alloc(dev, size, MTHCA_MAX_DIRECT_QP_SIZE, + &qp->queue, &qp->is_direct, pd, 0, &qp->mr); if (err) - goto err_out_free; + goto err_out; - kfree(dma_list); return 0; - err_out_free: - if (qp->is_direct) { - dma_free_coherent(&dev->pdev->dev, size, qp->queue.direct.buf, - pci_unmap_addr(&qp->queue.direct, mapping)); - } else - for (i = 0; i < npages; ++i) { - if (qp->queue.page_list[i].buf) - dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, - qp->queue.page_list[i].buf, - pci_unmap_addr(&qp->queue.page_list[i], - mapping)); - - } - - err_out: +err_out: kfree(qp->wrid); - kfree(dma_list); return err; } static void mthca_free_wqe_buf(struct mthca_dev *dev, struct mthca_qp *qp) { - int i; - int size = PAGE_ALIGN(qp->send_wqe_offset + - (qp->sq.max << qp->sq.wqe_shift)); - - if (qp->is_direct) { - dma_free_coherent(&dev->pdev->dev, size, qp->queue.direct.buf, - pci_unmap_addr(&qp->queue.direct, mapping)); - } else { - for (i = 0; i < size / PAGE_SIZE; ++i) { - dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, - qp->queue.page_list[i].buf, - pci_unmap_addr(&qp->queue.page_list[i], - mapping)); - } - } - + mthca_buf_free(dev, PAGE_ALIGN(qp->send_wqe_offset + + (qp->sq.max << qp->sq.wqe_shift)), + &qp->queue, qp->is_direct, &qp->mr); kfree(qp->wrid); } @@ -1430,11 +1268,12 @@ void mthca_free_qp(struct mthca_dev *dev * unref the mem-free tables and free the QPN in our table. */ if (!qp->ibqp.uobject) { - mthca_cq_clean(dev, to_mcq(qp->ibqp.send_cq)->cqn, qp->qpn); + mthca_cq_clean(dev, to_mcq(qp->ibqp.send_cq)->cqn, qp->qpn, + qp->ibqp.srq ? to_msrq(qp->ibqp.srq) : NULL); if (qp->ibqp.send_cq != qp->ibqp.recv_cq) - mthca_cq_clean(dev, to_mcq(qp->ibqp.recv_cq)->cqn, qp->qpn); + mthca_cq_clean(dev, to_mcq(qp->ibqp.recv_cq)->cqn, qp->qpn, + qp->ibqp.srq ? to_msrq(qp->ibqp.srq) : NULL); - mthca_free_mr(dev, &qp->mr); mthca_free_memfree(dev, qp); mthca_free_wqe_buf(dev, qp); } @@ -2179,15 +2018,21 @@ int mthca_free_err_wqe(struct mthca_dev { struct mthca_next_seg *next; + /* + * For SRQs, all WQEs generate a CQE, so we're always at the + * end of the doorbell chain. + */ + if (qp->ibqp.srq) { + *new_wqe = 0; + return 0; + } + if (is_send) next = get_send_wqe(qp, index); else next = get_recv_wqe(qp, index); - if (mthca_is_memfree(dev)) - *dbd = 1; - else - *dbd = !!(next->ee_nds & cpu_to_be32(MTHCA_NEXT_DBD)); + *dbd = !!(next->ee_nds & cpu_to_be32(MTHCA_NEXT_DBD)); if (next->ee_nds & cpu_to_be32(0x3f)) *new_wqe = (next->nda_op & cpu_to_be32(~0x3f)) | (next->ee_nds & cpu_to_be32(0x3f)); --- infiniband/hw/mthca/Makefile (revision 2963) +++ infiniband/hw/mthca/Makefile (working copy) @@ -9,4 +9,4 @@ obj-$(CONFIG_INFINIBAND_MTHCA) += ib_mth ib_mthca-y := mthca_main.o mthca_cmd.o mthca_profile.o mthca_reset.o \ mthca_allocator.o mthca_eq.o mthca_pd.o mthca_cq.o \ mthca_mr.o mthca_qp.o mthca_av.o mthca_mcg.o mthca_mad.o \ - mthca_provider.o mthca_memfree.o mthca_uar.o + mthca_provider.o mthca_memfree.o mthca_uar.o mthca_srq.o From jlentini at netapp.com Wed Aug 3 09:54:17 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 3 Aug 2005 12:54:17 -0400 (EDT) Subject: [openib-general] Re: [PATCH] uDAPL dapl_os_wait_object_wait() In-Reply-To: <42F01D66.3030206@ichips.intel.com> References: <42F01D66.3030206@ichips.intel.com> Message-ID: On Tue, 2 Aug 2005, Arlin Davis wrote: > James Lentini wrote: > >> >> >> On Mon, 25 Jul 2005, Arlin Davis wrote: >> >>> James, >>> >>> Here is a patch to fix dapl_os_wait_object_wait() returning >>> EINVAL when passing nsec == 1000000000 to pthread_cond_timedwait(). >>> Hit a rare case where _microsecs was exactly 1000000. >> >> >> What was the timeout_val being passed to dapl_os_wait_object_wait()? Was it >> 1000000000 or 1000000? > > It was the calculated time of microsecs that resulted in exactly 1000000 (1 > sec) not timeout_val.. > I understand now. I think the code can be simplified further. Can you look over the patch below (and attached) and let me know if you see any problems? Index: dapl/udapl/linux/dapl_osd.c =================================================================== --- dapl/udapl/linux/dapl_osd.c (revision 2949) +++ dapl/udapl/linux/dapl_osd.c (working copy) @@ -241,17 +241,10 @@ unsigned int microsecs; gettimeofday (&now, &tz); - microsecs = now.tv_usec + (timeout_val % 1000000); - if (microsecs > 1000000) - { - now.tv_sec = now.tv_sec + timeout_val / 1000000 + 1; - now.tv_usec = microsecs - 1000000; - } - else - { - now.tv_sec = now.tv_sec + timeout_val / 1000000; - now.tv_usec = microsecs; - } +#define USEC_PER_SEC 1000000 + microsecs = now.tv_usec + timeout_val; + now.tv_sec = now.tv_sec + microsecs/USEC_PER_SEC; + now.tv_usec = microsecs % USEC_PER_SEC; /* Convert timeval to timespec */ future.tv_sec = now.tv_sec; -------------- next part -------------- Index: dapl/udapl/linux/dapl_osd.c =================================================================== --- dapl/udapl/linux/dapl_osd.c (revision 2949) +++ dapl/udapl/linux/dapl_osd.c (working copy) @@ -241,17 +241,10 @@ unsigned int microsecs; gettimeofday (&now, &tz); - microsecs = now.tv_usec + (timeout_val % 1000000); - if (microsecs > 1000000) - { - now.tv_sec = now.tv_sec + timeout_val / 1000000 + 1; - now.tv_usec = microsecs - 1000000; - } - else - { - now.tv_sec = now.tv_sec + timeout_val / 1000000; - now.tv_usec = microsecs; - } +#define USEC_PER_SEC 1000000 + microsecs = now.tv_usec + timeout_val; + now.tv_sec = now.tv_sec + microsecs/USEC_PER_SEC; + now.tv_usec = microsecs % USEC_PER_SEC; /* Convert timeval to timespec */ future.tv_sec = now.tv_sec; From ftillier at silverstorm.com Wed Aug 3 10:38:24 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Wed, 3 Aug 2005 10:38:24 -0700 Subject: [openib-general] Re: why does the value of the node_guid don't have the machine endianess? In-Reply-To: <506C3D7B14CDD411A52C00025558DED60865E27A@mtlex01.yok.mtl.com> Message-ID: <001101c59852$278c6620$9c5aa8c0@infiniconsys.com> > From: Dotan Barak [mailto:dotanb at mellanox.co.il] > Sent: Tuesday, August 02, 2005 11:02 PM > > I expect to get this values with the endianess of the host that i'm working on, > and if i will print the node_guid as a number it will be the same as the > sys_fs value. > > I don't see any reason for the driver to return this value in the endianess of > the network, i think that it is better that the driver will return the value > of this attribute in the host order, instead of every application that query > for this attribute will change the order of it. It's a matter of consistency. The stack doesn't perform byte swapping on MAD payloads either, and SA requests (for NodeInfo, for example) will return node GUIDs. It's simpler to set the expectation for the client that all GUIDs are always treated in network order, that way the client doesn't have to distinguish between getting a GUID from a SA response, or getting a GUID from the device directly. It also removes the need to perform byte swapping to put the GUID in network order when issuing SA requests that need that information (if there are any). Personally, I would prefer to see the GUIDs always reported in network order in all places. We don't want to add byte swapping policy to the MAD layer. - Fab From ardavis at ichips.intel.com Wed Aug 3 10:50:57 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 03 Aug 2005 10:50:57 -0700 Subject: [openib-general] Re: [PATCH] uDAPL dapl_os_wait_object_wait() In-Reply-To: References: <42F01D66.3030206@ichips.intel.com> Message-ID: <42F10401.7010603@ichips.intel.com> James Lentini wrote: > > > On Tue, 2 Aug 2005, Arlin Davis wrote: > >> James Lentini wrote: >> >>> >>> >>> On Mon, 25 Jul 2005, Arlin Davis wrote: >>> >>>> James, >>>> >>>> Here is a patch to fix dapl_os_wait_object_wait() returning >>>> EINVAL when passing nsec == 1000000000 to pthread_cond_timedwait(). >>>> Hit a rare case where _microsecs was exactly 1000000. >>> >>> >>> >>> What was the timeout_val being passed to dapl_os_wait_object_wait()? >>> Was it 1000000000 or 1000000? >> >> >> It was the calculated time of microsecs that resulted in exactly >> 1000000 (1 sec) not timeout_val.. >> > > I understand now. I think the code can be simplified further. Can you > look over the patch below (and attached) and let me know if you see > any problems? No problems. Looks good. > > Index: dapl/udapl/linux/dapl_osd.c > =================================================================== > --- dapl/udapl/linux/dapl_osd.c (revision 2949) > +++ dapl/udapl/linux/dapl_osd.c (working copy) > @@ -241,17 +241,10 @@ > unsigned int microsecs; > > gettimeofday (&now, &tz); > - microsecs = now.tv_usec + (timeout_val % 1000000); > - if (microsecs > 1000000) > - { > - now.tv_sec = now.tv_sec + timeout_val / 1000000 + 1; > - now.tv_usec = microsecs - 1000000; > - } > - else > - { > - now.tv_sec = now.tv_sec + timeout_val / 1000000; > - now.tv_usec = microsecs; > - } > +#define USEC_PER_SEC 1000000 > + microsecs = now.tv_usec + timeout_val; > + now.tv_sec = now.tv_sec + microsecs/USEC_PER_SEC; > + now.tv_usec = microsecs % USEC_PER_SEC; > > /* Convert timeval to timespec */ > future.tv_sec = now.tv_sec; > >------------------------------------------------------------------------ > >Index: dapl/udapl/linux/dapl_osd.c >=================================================================== >--- dapl/udapl/linux/dapl_osd.c (revision 2949) >+++ dapl/udapl/linux/dapl_osd.c (working copy) >@@ -241,17 +241,10 @@ > unsigned int microsecs; > > gettimeofday (&now, &tz); >- microsecs = now.tv_usec + (timeout_val % 1000000); >- if (microsecs > 1000000) >- { >- now.tv_sec = now.tv_sec + timeout_val / 1000000 + 1; >- now.tv_usec = microsecs - 1000000; >- } >- else >- { >- now.tv_sec = now.tv_sec + timeout_val / 1000000; >- now.tv_usec = microsecs; >- } >+#define USEC_PER_SEC 1000000 >+ microsecs = now.tv_usec + timeout_val; >+ now.tv_sec = now.tv_sec + microsecs/USEC_PER_SEC; >+ now.tv_usec = microsecs % USEC_PER_SEC; > > /* Convert timeval to timespec */ > future.tv_sec = now.tv_sec; > > From iod00d at hp.com Wed Aug 3 10:56:17 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 3 Aug 2005 10:56:17 -0700 Subject: [openib-general] [PATCH][RFC] uverbs SRQ implementation In-Reply-To: <52acjzdlyj.fsf@cisco.com> References: <52acjzdlyj.fsf@cisco.com> Message-ID: <20050803175617.GB16417@esmail.cup.hp.com> On Wed, Aug 03, 2005 at 09:28:04AM -0700, Roland Dreier wrote: > Feedback in the meantime appreciated, though... ... > + if (cmd.is_srq) > + srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle); > + else > + srq = NULL; my preference is to write this as: srq = cmd.is_srq ? idr_find(&ib_uverbs_srq_idr, cmd.srq_handle) : NULL; > if (!pd || pd->uobject->context != file->ucontext || > !scq || scq->uobject->context != file->ucontext || > - !rcq || rcq->uobject->context != file->ucontext) { > + !rcq || rcq->uobject->context != file->ucontext || > + (cmd.is_srq && (!srq || srq->uobject->context != file->ucontext))) { I think it's redudant to test cmd.is_srq. srq is NULL if cmd.is_srq is not set. ie !srq should short circuit the rest of the test. if idr_find() fails, I would expect it to return NULL. ... > +ssize_t ib_uverbs_create_srq(struct ib_uverbs_file *file, > + const char __user *buf, int in_len, > + int out_len) > +{ ... > +retry: > + if (!idr_pre_get(&ib_uverbs_srq_idr, GFP_KERNEL)) { > + ret = -ENOMEM; > + goto err_destroy; > + } > + > + ret = idr_get_new(&ib_uverbs_srq_idr, srq, &uobj->id); > + > + if (ret == -EAGAIN) > + goto retry; Do I need to worry about infinite (or very long) retry loops here? If not, maybe add a one-liner comment explaining what limits the retry. I'm not clueful enough to know if the rest is correct or not. hth, grant From ftillier at silverstorm.com Wed Aug 3 11:06:21 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Wed, 3 Aug 2005 11:06:21 -0700 Subject: [openib-general] Re: create several RC QPs with the same init attributes structure cau ses the init attribute structure to be changed In-Reply-To: <52iryndmk5.fsf@cisco.com> Message-ID: <001201c59856$133cd390$9c5aa8c0@infiniconsys.com> > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Wednesday, August 03, 2005 9:15 AM > > Dotan> I work with gen2 with svn rev 2946 on Mellanox HCA 23108. > Dotan> When i try to create several RC QPs with the same init > Dotan> attributes structure, > > Dotan> i can see that this structure is being changed by the verb. > > This is expected and correct according to our API: the actual values > allocated for the QP are returned in the pointer passed in by the consumer. Why doesn't this happen with UD/UC QPs? Is inline data not supported on those? Also, why does the size keep growing? It seems that if you request 1 SGE, you get 28 bytes of max_inline. If you then request 28 bytes of max_inline, you get 2 SGEs back, and 96 bytes of max_inline. It seems to me something's off with the calculations, like using < rather than <= or something like that. If 1 SGE gives you 28 bytes, requesting 28 bytes should give you 1 SGE. I would have expected output like this: s_wr 1, r_wr 1, s_sge 1, r_sge 1, max_inline 0 s_wr 1, r_wr 1, s_sge 1, r_sge 1, max_inline 28 s_wr 1, r_wr 1, s_sge 1, r_sge 1, max_inline 28 s_wr 1, r_wr 1, s_sge 1, r_sge 1, max_inline 28 s_wr 1, r_wr 1, s_sge 1, r_sge 1, max_inline 28 s_wr 1, r_wr 1, s_sge 1, r_sge 1, max_inline 28 - Fab From ftillier at silverstorm.com Wed Aug 3 11:12:39 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Wed, 3 Aug 2005 11:12:39 -0700 Subject: [openib-general] [PATCH][RFC] uverbs SRQ implementation In-Reply-To: <20050803175617.GB16417@esmail.cup.hp.com> Message-ID: <001301c59856$f14de2f0$9c5aa8c0@infiniconsys.com> > From: Grant Grundler [mailto:iod00d at hp.com] > Sent: Wednesday, August 03, 2005 10:56 AM > > On Wed, Aug 03, 2005 at 09:28:04AM -0700, Roland Dreier wrote: > > Feedback in the meantime appreciated, though... > ... > > if (!pd || pd->uobject->context != file->ucontext || > > !scq || scq->uobject->context != file->ucontext || > > - !rcq || rcq->uobject->context != file->ucontext) { > > + !rcq || rcq->uobject->context != file->ucontext || > > + (cmd.is_srq && (!srq || srq->uobject->context != file->ucontext))) { > > I think it's redudant to test cmd.is_srq. > srq is NULL if cmd.is_srq is not set. > ie !srq should short circuit the rest of the test. > > if idr_find() fails, I would expect it to return NULL. If idr_find returns NULL when cmd.is_srq is non-zero, then the user passed an invalid parameter. Likewise, if the SRQ is not null, but its context doesn't match, that's also an invalid parameter. If cmd.is_srq is zero, then a NULL SRQ is perfectly fine, and there's no need to fail the call. That is, the check for (!srq || srq->uobject->context != file->ucontext) must only be performed if cmd.is_srq is non-zero. - Fab From steve_wooding at keysounds.co.uk Wed Aug 3 11:12:14 2005 From: steve_wooding at keysounds.co.uk (Steve Wooding) Date: Wed, 03 Aug 2005 19:12:14 +0100 Subject: [openib-general] Re: SDP RDMA support for synchronous socket operations In-Reply-To: <20050803135546.GQ15300@mellanox.co.il> References: <20050803135546.GQ15300@mellanox.co.il> Message-ID: <42F108FE.7020309@keysounds.co.uk> That's great news Micheal. Looks like my question was well timed. I look forward to trying it out when its ready. Regards, Steve. Michael S. Tsirkin wrote: >Libor, all, >Tomorrow, I plan to start working on sdp zcopy support for synchronous >send/recv socket operations. Both kernel-level and user-level initiators >should continue to be supported. I dont plan to work on sendfile >support, yet. > >I hope to finish the implementation and some basic testing in the coming two >weeks time. The development will be done on a branch, to be opened tomorrow. >My plan is to merge updates from trunk to stay in sync as much as possible. > >What follows is a raw design draft. Comments are welcome. > >MST > > > From halr at voltaire.com Wed Aug 3 11:20:54 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Aug 2005 14:20:54 -0400 Subject: [openib-general] configuration management for OpenIB Message-ID: <1123092422.4422.1586.camel@hal.voltaire.com> Hi, A number of people have commented on subversion's limited ability to handle incremental merging and perhaps it's getting to be time to switch to another configuration management tool. git and mercurial have been suggested as possibilities for this. Are there others ? Any recommendations pro or con any of these ? Are there any issues with converting from svn to whatever tool is chosen ? Thanks. -- Hal From halr at voltaire.com Wed Aug 3 11:24:07 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Aug 2005 14:24:07 -0400 Subject: [openib-general] Re: OpenSM Work In-Reply-To: <506C3D7B14CDD411A52C00025558DED607C305BA@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED607C305BA@mtlex01.yok.mtl.com> Message-ID: <1123093253.4422.1630.camel@hal.voltaire.com> Hi Eitan, On Wed, 2005-08-03 at 06:55, Eitan Zahavi wrote: > Hi Hal, > > As Mellanox moves to work on OpenIB Gen2 stack, we have assigned Yael > to work on merging OpenSM 1.8.0 (which released based on gen1) into > the gen2 stack. I too am working on small pieces of this. > She has started to work on the merge to ensure that fixes done by you > and Shahar on the gen2 trunk will not be lost. > > The mode of work we suggest is that she will work offline. Not sure by what you mean by offline here. > When the merge will be completed a side branch will be opened under: > https://openib.org/svn/gen2/branches/osm_1_8_0 and will made available > for review and testing before merge into the main trunk. I would prefer small patches rather than a large merge if that were possible. > Once all this is done, Yael will work on multiple new features > including faster route time, PKey manager, MKey manager, and QoS. She > will do so on branches off the main trunk - for each feature. OK. Sounds good. > In parallel, Liran who owns the OpenSM verification will enhance > osmtest and other testing utilities to achieve better test coverage of > SM handover, SL2VL, VLArb and PKey. Any new feature will get covered > by new tests. OK. > I myself will work on making sure the IB management simulator is well > integrated with the stack What do you mean by stack here ? > and the available simulator based tests as well as new tests can be > run daily. > We do have some issues with respect to the current osm tree: > > 1. All header files were moved from their relative location under > the opensm, complib, iba directories and placed under the include > directory. Although this seems reasonable for a "install" tree - it is > not very common for development trees. Normally I would expect the > Makefile.am of each sub directory of the osm project to define which > header files are to be installed into the $prefix/include dir. We will > revert that hierarchy change in our merged branch. Are you expecting the reverted hierarchy to make it back to the trunk ? > 2. osmtest was just introduced back into the osm tree. I think > osmtest should be placed under a "test" tree where all the tests of > the ULPs core etc will be located. I would expect a location like: > > https://openib.org/svn/gen2/trunk/test/userspace/management/osm It can be moved when it is agreed on its location. There are other places being proposed for this (talk to Amit). > 3. osmtest needs cleanup from VAPI stuff - we should let Liran > who is the owner of this code development a clear AR to clean it up. Yes. Can he send patches for this ? > 4. For some reason I saw that you have added Voltaire copyright > to the osmtest code. I do not think it makes sense as no work was done > on this code by a Voltaire developer. Or I might be wrong? There was some minor work done in terms of OpenIB. I removed this. > Needless to say the 1.8.0 version of OpenSM brings with it a long set > of bug fixes and enhancements. Once the OpenSM work is completed, will OpenSM development by Mellanox be done incrementally and in the open rather than drops ? Will patches be suppplied ? -- Hal From mshefty at ichips.intel.com Wed Aug 3 11:29:54 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 03 Aug 2005 11:29:54 -0700 Subject: [openib-general] configuration management for OpenIB In-Reply-To: <1123092422.4422.1586.camel@hal.voltaire.com> References: <1123092422.4422.1586.camel@hal.voltaire.com> Message-ID: <42F10D22.4090602@ichips.intel.com> Hal Rosenstock wrote: > A number of people have commented on subversion's limited ability to > handle incremental merging and perhaps it's getting to be time to switch > to another configuration management tool. git and mercurial have been > suggested as possibilities for this. Are there others ? Any > recommendations pro or con any of these ? Are there any issues with > converting from svn to whatever tool is chosen ? Thanks. A motivation behind any switch is to support release branches. - Sean From Thomas.Duffy.99 at alumni.brown.edu Wed Aug 3 11:41:21 2005 From: Thomas.Duffy.99 at alumni.brown.edu (Tom Duffy) Date: Wed, 03 Aug 2005 11:41:21 -0700 Subject: [openib-general] [iSER]How to get the dat_headers_1_1.tgz In-Reply-To: References: Message-ID: <1123094481.13498.9.camel@duffman> On Wed, 2005-08-03 at 17:14 +0800, Ian Jiang wrote: > It's known to all that the kDAPL 1.1 is needed to build the iSER. I failed > to get > http://groups.yahoo.com/group/dat-discussions/files/dat_headers_1_1.tgz > because only the members of the group could access this file This is dumb. Can Arkady just open the files up to anyone? If not, groups.yahoo.com should not be used for an open source project. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From Thomas.Duffy.99 at alumni.brown.edu Wed Aug 3 11:43:26 2005 From: Thomas.Duffy.99 at alumni.brown.edu (Tom Duffy) Date: Wed, 03 Aug 2005 11:43:26 -0700 Subject: [openib-general] RE: where can i find functions that "convert" enumerated values t o string? In-Reply-To: <506C3D7B14CDD411A52C00025558DED60865E26F@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED60865E26F@mtlex01.yok.mtl.com> Message-ID: <1123094606.13498.11.camel@duffman> On Wed, 2005-08-03 at 08:12 +0300, Dotan Barak wrote: > During debug of problems in tests/real application this can help > allot. > I think that it better to write "QP ts type is RC" instead of "QP ts > type is 0"... Can't you the human look in the header and do the translation? Seems better than bloating the kernel with tons of strings and case statements. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From Thomas.Duffy.99 at alumni.brown.edu Wed Aug 3 11:48:14 2005 From: Thomas.Duffy.99 at alumni.brown.edu (Tom Duffy) Date: Wed, 03 Aug 2005 11:48:14 -0700 Subject: [openib-general] RE: [BUG]OpenSM double free or corruption In-Reply-To: <52d5owezvy.fsf@cisco.com> References: <52d5owezvy.fsf@cisco.com> Message-ID: <1123094894.13498.14.camel@duffman> On Tue, 2005-08-02 at 15:29 -0700, Roland Dreier wrote: > This message is just a warning, not an error. Does anyone have any > suggestions on how to rephrase it to make it clear that this is not a > fatal condition and that the driver is continuing to load? **NOT** FATAL: HCA FW version 4.5.3 is old (4.6.2 is current). Life continues. If you have problems, try updating your HCA FW. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From rolandd at cisco.com Wed Aug 3 11:53:07 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 03 Aug 2005 11:53:07 -0700 Subject: [openib-general] configuration management for OpenIB In-Reply-To: <1123092422.4422.1586.camel@hal.voltaire.com> (Hal Rosenstock's message of "03 Aug 2005 14:20:54 -0400") References: <1123092422.4422.1586.camel@hal.voltaire.com> Message-ID: <52slxqdf8s.fsf@cisco.com> Hal> A number of people have commented on subversion's limited Hal> ability to handle incremental merging and perhaps it's Hal> getting to be time to switch to another configuration Hal> management tool. git and mercurial have been suggested as Hal> possibilities for this. Are there others ? Any Hal> recommendations pro or con any of these ? Are there any Hal> issues with converting from svn to whatever tool is chosen ? I think for what we're doing, mercurial and git are the only two systems we should consider. Mercurial (hg) and git are quite similar in most ways and are IMHO better than anything else out there for our usage, namely a fairly big open source project with a lot of distributed development. I think the biggest advantage that subversion has is that its model is very simple and easy to learn. It will definitely take time for everyone to get used to a distributed source code control system. However, since git and hg track merge history much better, managing multiple lines of development becomes possible (anyone who has tried to develop on a svn branch knows just how useless "svn merge" is). As a bonus, git/hg are shockingly fast and allow completely disconnected operation (ie you can work on your laptop on a plane and do commits and everything). There shouldn't be any issue in importing a pretty complete svn history into either system. We might want to just stick to getting the history of the trunk into the new system, since the way svn handles branches is different from git/hg, but I don't see this as a big loss (since we have no merge history now anyway, and we don't really have any active branches we want to keep alive). One other question to think about is whether we want to stick with the whole project in one tree paradigm. As part of a transition, it would make sense to me to move to a tree for the kernel stuff, a tree for libibverbs, a tree for libmthca, etc. The main obection to this is probably that it makes it a little harder to tag a monolithic release. But given all the pain that X.org is going through to split up their tree, I would argue that making monolithic releases harder is a feature. As far as the git vs. mercurial question goes, I don't have a strong opinion either way. Here are a few points to consider: Mercurial pros: - Underlying data structures are probably slightly more efficient - Tracks history across renames Mercurial cons: - Requires a gateway or some other conversion to merge with upstream kernel Git pros: - Makes merging with Linus very easy - Probably bigger user and developer communities - Automagically deduces renames and copies Git cons: - Requires periodic repacking of repository for efficiency At the kernel summit, Linus said that he'll make a final decision on whether to stick with git or switch to hg in a couple of months. However, I don't think there's much chance of Linus switching (he is just too happy with his git design). In any case, even if we decide to move towards git/hg, it probably makes sense to wait a couple of months before picking between git and hg and see if the decision is clearer then. - R. From rolandd at cisco.com Wed Aug 3 11:53:52 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 03 Aug 2005 11:53:52 -0700 Subject: [openib-general] RE: [BUG]OpenSM double free or corruption In-Reply-To: <1123094894.13498.14.camel@duffman> (Tom Duffy's message of "Wed, 03 Aug 2005 11:48:14 -0700") References: <52d5owezvy.fsf@cisco.com> <1123094894.13498.14.camel@duffman> Message-ID: <52oe8edf7j.fsf@cisco.com> Tom> **NOT** FATAL: HCA FW version 4.5.3 is old (4.6.2 is Tom> current). Life continues. If you have problems, try Tom> updating your HCA FW. :) - R. From Thomas.Duffy.99 at alumni.brown.edu Wed Aug 3 11:57:53 2005 From: Thomas.Duffy.99 at alumni.brown.edu (Tom Duffy) Date: Wed, 03 Aug 2005 11:57:53 -0700 Subject: [openib-general] configuration management for OpenIB In-Reply-To: <52slxqdf8s.fsf@cisco.com> References: <1123092422.4422.1586.camel@hal.voltaire.com> <52slxqdf8s.fsf@cisco.com> Message-ID: <1123095473.13498.21.camel@duffman> On Wed, 2005-08-03 at 11:53 -0700, Roland Dreier wrote: > Git cons: > - Requires periodic repacking of repository for efficiency Doesn't git have a very inefficient way creating blame files or creating a line-by-line versioned file? Or does repacking fix this? -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From Thomas.Duffy.99 at alumni.brown.edu Wed Aug 3 12:02:55 2005 From: Thomas.Duffy.99 at alumni.brown.edu (Tom Duffy) Date: Wed, 03 Aug 2005 12:02:55 -0700 Subject: [openib-general] Perftest problems In-Reply-To: <42E65199.8050600@cmu.edu> References: <42E65199.8050600@cmu.edu> Message-ID: <1123095775.13498.24.camel@duffman> On Tue, 2005-07-26 at 11:07 -0400, Spencer Whitman wrote: > I'm trying to run either of the perf tests and i'm seg faulting on the > client side =( . For rdma_lat the client segfaults but the server seems > to fail silently. For rdma_bw the client segfaults and the server dies > with the msg: > server read: Address family not supported by protocol > 0/45: Couldn't read remote address > > Any ideas or help? Are all the needed ib_* modules loaded? -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From Thomas.Duffy.99 at alumni.brown.edu Wed Aug 3 12:07:52 2005 From: Thomas.Duffy.99 at alumni.brown.edu (Tom Duffy) Date: Wed, 03 Aug 2005 12:07:52 -0700 Subject: [openib-general] [PATCH 1/2] kDAPL: remove inline functions In-Reply-To: <1123020047.5203.4.camel@duffman> References: <1123020047.5203.4.camel@duffman> Message-ID: <1123096072.13498.25.camel@duffman> On Tue, 2005-08-02 at 15:00 -0700, Tom Duffy wrote: > This patch removes all the inline functions, instead just call the > functions from the function table. It also removes the _func from the > names as this is obvious by its type. James, Any chance of merging these? -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From joecat at cmu.edu Wed Aug 3 12:09:37 2005 From: joecat at cmu.edu (Spencer Whitman) Date: Wed, 03 Aug 2005 15:09:37 -0400 Subject: [openib-general] Perftest problems In-Reply-To: <1123095775.13498.24.camel@duffman> References: <42E65199.8050600@cmu.edu> <1123095775.13498.24.camel@duffman> Message-ID: <42F11671.3040801@cmu.edu> Tom Duffy wrote: >On Tue, 2005-07-26 at 11:07 -0400, Spencer Whitman wrote: > > >>I'm trying to run either of the perf tests and i'm seg faulting on the >>client side =( . For rdma_lat the client segfaults but the server seems >>to fail silently. For rdma_bw the client segfaults and the server dies >>with the msg: >>server read: Address family not supported by protocol >>0/45: Couldn't read remote address >> >>Any ideas or help? >> >> > >Are all the needed ib_* modules loaded? > >-tduffy > > Yes, there was some problem with the compilation. After remaking they worked fine... From Arkady.Kanevsky at netapp.com Wed Aug 3 12:18:42 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 3 Aug 2005 15:18:42 -0400 Subject: [openib-general] [iSER]How to get the dat_headers_1_1.tgz Message-ID: The files are available to all. The posting to reflector are for members only. If you still have problems they can be made available at http://www.datcollaborative.org/. 1.2 headers are available on it. Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Tom Duffy [mailto:Thomas.Duffy.99 at alumni.brown.edu] > Sent: Wednesday, August 03, 2005 2:41 PM > To: Ian Jiang > Cc: openib-general at openib.org; > dat-discussions at yahoogroups.com; Kanevsky, Arkady > Subject: Re: [openib-general] [iSER]How to get the dat_headers_1_1.tgz > > > On Wed, 2005-08-03 at 17:14 +0800, Ian Jiang wrote: > > It's known to all that the kDAPL 1.1 is needed to build the iSER. I > > failed > > to get > > > http://groups.yahoo.com/group/dat-discussions/files/dat_header s_1_1.tgz > because only the members of the group could access this file This is dumb. Can Arkady just open the files up to anyone? If not, groups.yahoo.com should not be used for an open source project. -tduffy From rolandd at cisco.com Wed Aug 3 12:25:52 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 03 Aug 2005 12:25:52 -0700 Subject: [openib-general] configuration management for OpenIB In-Reply-To: <1123095473.13498.21.camel@duffman> (Tom Duffy's message of "Wed, 03 Aug 2005 11:57:53 -0700") References: <1123092422.4422.1586.camel@hal.voltaire.com> <52slxqdf8s.fsf@cisco.com> <1123095473.13498.21.camel@duffman> Message-ID: <52fytqddq7.fsf@cisco.com> Tom> Doesn't git have a very inefficient way creating blame files Tom> or creating a line-by-line versioned file? Good point. Yes, this is one other advantage that the hg data structure design has -- it makes "hg annotate" possible and rather fast. For example, on a kernel tree with ~33000 changesets, I get: $ time hg annotate include/linux/kernel.h > /dev/null real 0m1.584s user 0m1.180s sys 0m0.218s I picked semi-randomly as a file likely to have a lot of changes -- it is touched by 77 changesets in this repo's history. On the other hand, Linus would argue that the git design of tracking history over the whole tree rather than by individual file has its advantages for other operations (in addition to being conceptually very clean). Tom> Or does repacking fix this? No, repacking doesn't change the data structures in any fundamental way. It just puts multiple blobs in a single file. - R. From rolandd at cisco.com Wed Aug 3 12:36:35 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 03 Aug 2005 12:36:35 -0700 Subject: [openib-general] Re: create several RC QPs with the same init attributes structure cau ses the init attribute structure to be changed In-Reply-To: <001201c59856$133cd390$9c5aa8c0@infiniconsys.com> (Fab Tillier's message of "Wed, 3 Aug 2005 11:06:21 -0700") References: <001201c59856$133cd390$9c5aa8c0@infiniconsys.com> Message-ID: <527jf2dd8c.fsf@cisco.com> Fab> Why doesn't this happen with UD/UC QPs? Is inline data not Fab> supported on those? I actually would expect it to happen for UC QPs, since the code path is exactly the same. For UD QPs, the WQE format is slightly different, which explains the difference. Fab> Also, why does the size keep growing? It seems that if you Fab> request 1 SGE, you get 28 bytes of max_inline. If you then Fab> request 28 bytes of max_inline, you get 2 SGEs back, and 96 Fab> bytes of max_inline. You're right that there's something fishy. It turns out our calculation of the possible WQE size adds together several worst cases that can't all happen at once. I'll look at how to clean this up. - R. From Thomas.Duffy.99 at alumni.brown.edu Wed Aug 3 12:37:19 2005 From: Thomas.Duffy.99 at alumni.brown.edu (Tom Duffy) Date: Wed, 03 Aug 2005 12:37:19 -0700 Subject: [openib-general] [iSER]How to get the dat_headers_1_1.tgz In-Reply-To: References: Message-ID: <1123097840.13498.28.camel@duffman> On Wed, 2005-08-03 at 15:18 -0400, Kanevsky, Arkady wrote: > The files are available to all. http://groups.yahoo.com/group/dat-discussions/files/dat_headers_1_1.tgz > To access Yahoo! Groups... > > you need a Yahoo! ID. > Don't have a Yahoo! ID? > Signing up is easy. ....that is *not* available to all... -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From rolandd at cisco.com Wed Aug 3 12:38:27 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 03 Aug 2005 12:38:27 -0700 Subject: [openib-general] [PATCH][RFC] uverbs SRQ implementation In-Reply-To: <20050803175617.GB16417@esmail.cup.hp.com> (Grant Grundler's message of "Wed, 3 Aug 2005 10:56:17 -0700") References: <52acjzdlyj.fsf@cisco.com> <20050803175617.GB16417@esmail.cup.hp.com> Message-ID: <523bpqdd58.fsf@cisco.com> Grant> my preference is to write this as: Grant> srq = cmd.is_srq ? idr_find(&ib_uverbs_srq_idr, cmd.srq_handle) : NULL; OK, good suggestion. Done in my tree. Grant> I think it's redudant to test cmd.is_srq. srq is NULL if Grant> cmd.is_srq is not set. ie !srq should short circuit the Grant> rest of the test. As Fab points out, the logic is a little more complicated since the user may not have passed us an SRQ. We don't want to fail if the user didn't give us an SRQ. Grant> Do I need to worry about infinite (or very long) retry Grant> loops here? If not, maybe add a one-liner comment Grant> explaining what limits the retry. This is standard use of the idr stuff. I don't think the code needs to change, and the logic is common enough in the kernel that we shouldn't need a comment in this one spot. - R. From arlin.r.davis at intel.com Wed Aug 3 13:34:46 2005 From: arlin.r.davis at intel.com (Arlin Davis) Date: Wed, 3 Aug 2005 13:34:46 -0700 Subject: [openib-general] [PATCH] uDAPL openib uAT retry fixes Message-ID: James, Please review the following uDAPL patch. Fixes my broken uAT retry code. Thanks, -arlin Signed-off by: Arlin Davis Index: dapl/openib/dapl_ib_util.c =================================================================== --- dapl/openib/dapl_ib_util.c (revision 2970) +++ dapl/openib/dapl_ib_util.c (working copy) @@ -128,21 +128,34 @@ int dapli_get_hca_addr( struct dapl_hca at_comp.context = &at_rec; at_rec.addr = &hca_ptr->hca_address; at_rec.wait_object = &hca_ptr->ib_trans.wait_object; + at_rec.hca_ptr = hca_ptr; + at_rec.retries = 0; /* call with async_comp until the sync version works */ status = ib_at_ips_by_gid(&hca_ptr->ib_trans.gid, &ipv4_addr->sin_addr.s_addr, 1, &at_comp, &at_rec.req_id); - if (status < 0) + if (status < 0) { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + " get_hca_addr: ERR ips_by_gid %d %s \n", + status, strerror(errno)); return 1; + } - if (status > 0) - dapli_ip_comp_handler(at_rec.req_id, (void*)ipv4_addr, status); - - /* wait for answer, 5 seconds max */ - dat_status = dapl_os_wait_object_wait (&hca_ptr->ib_trans.wait_object,5000000); - - if ((dat_status != DAT_SUCCESS ) || (!ipv4_addr->sin_addr.s_addr)) + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, + " get_hca_addr: ips_by_gid ret %d at_rec %p -> id %lld\n", + status, &at_rec, at_rec.req_id ); + + if (status > 0) { + dapli_ip_comp_handler(at_rec.req_id, (void*)&at_rec, status); + } else { + dat_status = dapl_os_wait_object_wait(&hca_ptr->ib_trans.wait_object,500000); + return 0; + if (dat_status != DAT_SUCCESS) + ib_at_cancel(at_rec.req_id); + } + + if (!ipv4_addr->sin_addr.s_addr) return 1; return 0; @@ -252,6 +265,13 @@ DAT_RETURN dapls_ib_open_hca ( ibv_get_device_name(hca_ptr->ib_trans.ib_dev) ); goto bail; } + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " open_hca: LID 0x%x GID subnet %016llx id %016llx\n", + hca_ptr->ib_trans.lid, + (unsigned long long)bswap_64(hca_ptr->ib_trans.gid.global.subnet_prefix), + (unsigned long long)bswap_64(hca_ptr->ib_trans.gid.global.interface_id) ); + /* get the IP address of the device */ if (dapli_get_hca_addr(hca_ptr)) { dapl_dbg_log (DAPL_DBG_TYPE_ERR, @@ -282,11 +302,6 @@ DAT_RETURN dapls_ib_open_hca ( ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_addr.s_addr >> 24 & 0xff, hca_ptr->ib_trans.max_inline_send ); - dapl_dbg_log(DAPL_DBG_TYPE_CM, - " open_hca: LID 0x%x GID subnet %016llx id %016llx\n", - hca_ptr->ib_trans.lid, - (unsigned long long)bswap_64(hca_ptr->ib_trans.gid.global.subnet_prefix), - (unsigned long long)bswap_64(hca_ptr->ib_trans.gid.global.interface_id) ); return DAT_SUCCESS; Index: dapl/openib/dapl_ib_cm.c =================================================================== --- dapl/openib/dapl_ib_cm.c (revision 2970) +++ dapl/openib/dapl_ib_cm.c (working copy) @@ -158,19 +158,49 @@ void dapli_at_thread_destroy(void) void dapli_ip_comp_handler(uint64_t req_id, void *context, int rec_num) { struct dapl_at_record *at_rec = context; + struct sockaddr_in *ipv4_addr = (struct sockaddr_in*)at_rec->addr; + int status; dapl_dbg_log(DAPL_DBG_TYPE_CM, - " ip_comp_handler: ctxt %p, req_id %lld rec_num %d\n", - context, req_id, rec_num); + " ip_comp_handler: at_rec %p ->id %lld id %lld rec_num %d %x\n", + context, at_rec->req_id, req_id, rec_num, + ipv4_addr->sin_addr.s_addr); + + if (rec_num <= 0) { + struct ib_at_completion at_comp; + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " ip_comp_handler: resolution err %d retry %d\n", + rec_num, at_rec->retries + 1); + + if (++at_rec->retries > IB_MAX_AT_RETRY) + goto bail; + + at_comp.fn = dapli_ip_comp_handler; + at_comp.context = at_rec; + ipv4_addr->sin_addr.s_addr = 0; + + status = ib_at_ips_by_gid(&at_rec->hca_ptr->ib_trans.gid, + &ipv4_addr->sin_addr.s_addr, 1, + &at_comp, &at_rec->req_id); + if (status < 0) + goto bail; + + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, + " ip_comp_handler: NEW ips_by_gid ret %d at_rec %p -> id %lld\n", + status, at_rec, at_rec->req_id ); + } - if ((at_rec) && ( at_rec->req_id == req_id)) { + if (ipv4_addr->sin_addr.s_addr) dapl_os_wait_object_wakeup(at_rec->wait_object); - return; - } - - dapl_dbg_log(DAPL_DBG_TYPE_ERR, - " ip_comp_handler: at_rec->req_id %lld != req_id %lld\n", - at_rec->req_id, req_id ); + + return; +bail: + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " ip_comp_handler: ERR: at_rec %p, req_id %lld rec_num %d\n", + at_rec, req_id, rec_num); + + dapl_os_wait_object_wakeup(at_rec->wait_object); } static void dapli_path_comp_handler(uint64_t req_id, void *context, int rec_num) @@ -622,20 +652,21 @@ void cm_thread(void *arg) dapl_os_unlock(&g_cm_lock); ret = poll(&ufds, 1, -1); - if ((ret <= 0) || (g_cm_destroy)) { + if (ret <= 0) { dapl_dbg_log(DAPL_DBG_TYPE_CM, " cm_thread(%d): ERR %s poll\n", getpid(),strerror(errno)); dapl_os_lock(&g_cm_lock); - break; + continue; } dapl_dbg_log(DAPL_DBG_TYPE_CM, " cm_thread: GET EVENT fd=%d n=%d\n", ib_cm_get_fd(),ret); + if (ib_cm_event_get_timed(0,&event)) { dapl_dbg_log(DAPL_DBG_TYPE_CM, - " cm_thread: ERR %s eventi_get on %d\n", + " cm_thread: ERR %s event_get on %d\n", strerror(errno), ib_cm_get_fd() ); dapl_os_lock(&g_cm_lock); continue; Index: dapl/openib/dapl_ib_util.h =================================================================== --- dapl/openib/dapl_ib_util.h (revision 2970) +++ dapl/openib/dapl_ib_util.h (working copy) @@ -97,6 +97,8 @@ struct dapl_at_record { uint64_t req_id; DAT_SOCK_ADDR6 *addr; DAPL_OS_WAIT_OBJECT *wait_object; + struct dapl_hca *hca_ptr; + int retries; }; /* From Arkady.Kanevsky at netapp.com Wed Aug 3 13:35:45 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 3 Aug 2005 16:35:45 -0400 Subject: [openib-general] [iSER]How to get the dat_headers_1_1.tgz Message-ID: Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Tom Duffy [mailto:Thomas.Duffy.99 at alumni.brown.edu] > Sent: Wednesday, August 03, 2005 3:37 PM > To: Kanevsky, Arkady > Cc: Ian Jiang; dat-discussions at yahoogroups.com; > openib-general at openib.org > Subject: RE: [openib-general] [iSER]How to get the dat_headers_1_1.tgz > > > On Wed, 2005-08-03 at 15:18 -0400, Kanevsky, Arkady wrote: > > The files are available to all. > > http://groups.yahoo.com/group/dat-discussions/files/dat_header s_1_1.tgz > To access Yahoo! Groups... > > you need a Yahoo! ID. > Don't have a Yahoo! ID? > Signing up is easy. ....that is *not* available to all... -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: dat_headers_1_1.tgz Type: application/x-compressed Size: 22320 bytes Desc: dat_headers_1_1.tgz URL: From arlin.r.davis at intel.com Wed Aug 3 13:42:00 2005 From: arlin.r.davis at intel.com (Arlin Davis) Date: Wed, 3 Aug 2005 13:42:00 -0700 Subject: [openib-general] [PATCH] uDAPL common code to build with counters Message-ID: James, I tried to build uDAPL with counters to debug my wait/wakeup problem but ran into some build problems. Can you review the following patch to enable counters? Not sure what happened to dapl_counters.h? Thanks, -arlin Signed-off by: Arlin Davis Index: dapl/include/dapl_debug.h =================================================================== --- dapl/include/dapl_debug.h (revision 2967) +++ dapl/include/dapl_debug.h (working copy) @@ -64,7 +64,9 @@ typedef enum DAPL_DBG_TYPE_API = 0x0100, DAPL_DBG_TYPE_RTN = 0x0200, DAPL_DBG_TYPE_EXCEPTION = 0x0400, - DAPL_DBG_TYPE_SRQ = 0x0800 + DAPL_DBG_TYPE_SRQ = 0x0800, + DAPL_DBG_TYPE_CNTR = 0x1000 + } DAPL_DBG_TYPE; typedef enum @@ -110,12 +112,21 @@ extern void dapl_internal_dbg_log ( DAPL #define DCNT_EVD_DEQUEUE_NOT_FOUND 18 #define DCNT_TIMER_SET 19 #define DCNT_TIMER_CANCEL 20 -#define DCNT_LAST_COUNTER 22 /* Always the last counter */ +#define DCNT_LAST_COUNTER 21 /* Always the last counter */ +#define DCNT_ALL_COUNTERS DCNT_LAST_COUNTER #if defined(DAPL_COUNTERS) -#include "dapl_counters.h" -#define DAPL_CNTR(cntr) dapl_os_atomic_inc (&dapl_dbg_counters[cntr]); +extern void dapl_dump_cntr( int cntr ); +extern int dapl_dbg_counters[]; + +#define DAPL_CNTR(cntr) dapl_os_atomic_inc (&dapl_dbg_counters[cntr]); +#define DAPL_DUMP_CNTR(cntr) dapl_dump_cntr( cntr ); +#define DAPL_COUNTERS_INIT() +#define DAPL_COUNTERS_NEW(__tag, __id) +#define DAPL_COUNTERS_RESET(__id, __incr) +#define DAPL_COUNTERS_INCR(__id, __incr) + #else #define DAPL_CNTR(cntr) Index: dapl/common/dapl_debug.c =================================================================== --- dapl/common/dapl_debug.c (revision 2967) +++ dapl/common/dapl_debug.c (working copy) @@ -58,7 +58,7 @@ void dapl_internal_dbg_log ( DAPL_DBG_TY } #if defined(DAPL_COUNTERS) -long dapl_dbg_counters[DAPL_CNTR_MAX]; +int dapl_dbg_counters[DCNT_LAST_COUNTER+1] = { 0 }; /* * The order of this list must match exactly with the #defines @@ -89,6 +89,22 @@ char *dapl_dbg_counter_names[] = { 0 }; +void dapl_dump_cntr( int cntr ) +{ + int i; + + for ( i=0;i References: Message-ID: <1123102046.13498.43.camel@duffman> > application/x-compressed attachment (dat_headers_1_1.tgz), > "dat_headers_1_1.tgz" I think you missed my point. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From yhlu.kernel at gmail.com Wed Aug 3 17:58:11 2005 From: yhlu.kernel at gmail.com (yhlu) Date: Wed, 3 Aug 2005 17:58:11 -0700 Subject: [openib-general] Re: [PATCH 1/2] [IB/cm]: Correct CM port redirect reject codes In-Reply-To: <20057281331.7vqhiAJ1Yc0um2je@cisco.com> References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <20057281331.7vqhiAJ1Yc0um2je@cisco.com> Message-ID: <86802c44050803175873fb0569@mail.gmail.com> Roland, In LinuxBIOS, If I enable the prefmem64 to use real 64 range. the IB driver in Kernel can not be loaded. YH PCI: 00:18.0 1c1 <- [0x0000001000 - 0x0000003fff] io PCI: 00:18.0 1b9 <- [0xfce0000000 - 0xfcf07fffff] prefmem PCI: 00:18.0 1b1 <- [0x00fc000000 - 0x00fd2fffff] mem PCI: 01:0f.0 24 <- [0xfce0000000 - 0xfcf07fffff] bus 4 prefmem PCI: 01:0f.0 20 <- [0x00fd100000 - 0x00fd1fffff] bus 4 mem PCI: 04:00.0 10 <- [0x00fd100000 - 0x00fd1fffff] mem64 PCI: 04:00.0 18 <- [0xfcf0000000 - 0xfcf07fffff] prefmem64 PCI: 04:00.0 20 <- [0xfce0000000 - 0xfcefffffff] prefmem64 ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) ib_mthca: Initializing Mellanox Technologies MT25208 InfiniHost III Ex (Tavor c) ib_mthca 0000:04:00.0: Failed to initialize queue pair table, aborting. ib_mthca: probe of 0000:04:00.0 failed with error -16 From iod00d at hp.com Wed Aug 3 18:39:10 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 3 Aug 2005 18:39:10 -0700 Subject: [openib-general] Re: [PATCH 1/2] [IB/cm]: Correct CM port redirect reject codes In-Reply-To: <86802c44050803175873fb0569@mail.gmail.com> References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <20057281331.7vqhiAJ1Yc0um2je@cisco.com> <86802c44050803175873fb0569@mail.gmail.com> Message-ID: <20050804013910.GG16417@esmail.cup.hp.com> On Wed, Aug 03, 2005 at 05:58:11PM -0700, yhlu wrote: > Roland, > > In LinuxBIOS, If I enable the prefmem64 to use real 64 range. the IB > driver in Kernel can not be loaded. Can you provide a few more details about the configuration? o kernel version o architecture (i386 or x86-64) o post the full console output from power up? Recent email on linux-pci raised awareness that 32-bit kernel can not support 64-bit PCI MMIO addresses. struct resource (defined in include/linux/ioport.h) defines the start/end field as "unsigned long". That's only 32-bit on i386 kernels. > PCI: 04:00.0 18 <- [0xfcf0000000 - 0xfcf07fffff] prefmem64 > PCI: 04:00.0 20 <- [0xfce0000000 - 0xfcefffffff] prefmem64 I have to wonder if those BARs are truly prefetchable. Does Mellanox assume CPU is the only one to write the 3rd BAR (RAM) and the CPU implements a write-through cache (vs write back)? I'm just guessing because I don't understand exactly how the 256MB of onboard RAM is accessed. hth, grant > > ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) > ib_mthca: Initializing Mellanox Technologies MT25208 InfiniHost III Ex (Tavor c) > ib_mthca 0000:04:00.0: Failed to initialize queue pair table, aborting. > ib_mthca: probe of 0000:04:00.0 failed with error -16 > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general And I have to wonder if those BARs truly are prefetchable. It would imply only the CPU writes them and From iod00d at hp.com Wed Aug 3 18:55:03 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 3 Aug 2005 18:55:03 -0700 Subject: [openib-general] dev_queue_xmit failed to requeue packet Message-ID: <20050804015503.GH16417@esmail.cup.hp.com> Hi, I'm running netperf 2.4.0-rc1 TCP_RR test between two rx2600 machines via IPoIB. Both machines are running 2.6.12 kernels with openib SVN r2971. I'm getting a few (3 in ~15 minutes) messages on the "netserver" (machine running netserver, NOT an older HP x86 server) console: ib0: dev_queue_xmit failed to requeue packet Should I be worried? I don't recall seeing that before. I've been out for 5 weeks. The last time I ran TCP_RR was with SVN r2577. thanks, grant From rolandd at cisco.com Wed Aug 3 21:44:32 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 03 Aug 2005 21:44:32 -0700 Subject: [openib-general] mthca and LinuxBIOS (was: [PATCH 1/2] [IB/cm]: Correct CM port redirect reject codes) In-Reply-To: <86802c44050803175873fb0569@mail.gmail.com> (yhlu's message of "Wed, 3 Aug 2005 17:58:11 -0700") References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <20057281331.7vqhiAJ1Yc0um2je@cisco.com> <86802c44050803175873fb0569@mail.gmail.com> Message-ID: <52u0i6b9an.fsf_-_@cisco.com> yhlu> In LinuxBIOS, If I enable the prefmem64 to use real 64 yhlu> range. the IB driver in Kernel can not be loaded. What does it mean to "enable the prefmem64 to use real 64 range"? Does the driver work if you don't do this? yhlu> ib_mthca 0000:04:00.0: Failed to initialize queue pair table, aborting. Can you add printk()s to mthca_qp.c::mthca_init_qp_table() to find out how far the function gets before it fails? It would also be useful for you to build with CONFIG_INFINIBAND_MTHCA_DEBUG=y and send the kernel output you get with that. - Roland From rolandd at cisco.com Wed Aug 3 21:56:37 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 03 Aug 2005 21:56:37 -0700 Subject: [openib-general] dev_queue_xmit failed to requeue packet In-Reply-To: <20050804015503.GH16417@esmail.cup.hp.com> (Grant Grundler's message of "Wed, 3 Aug 2005 18:55:03 -0700") References: <20050804015503.GH16417@esmail.cup.hp.com> Message-ID: <52pssub8qi.fsf@cisco.com> Grant> ib0: dev_queue_xmit failed to requeue packet Grant> Should I be worried? I don't recall seeing that before. I don't think it's that worrisome -- at worst you're dropping one packet per message. But why is dev_queue_xmit() returning a bad status? I'll have to look at what can cause that function to fail. - R. From eitan at mellanox.co.il Wed Aug 3 23:09:21 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 4 Aug 2005 09:09:21 +0300 Subject: [openib-general] RE: OpenSM Work Message-ID: <506C3D7B14CDD411A52C00025558DED607C305C4@mtlex01.yok.mtl.com> Hi Hal, Please see my responses below. > > > > As Mellanox moves to work on OpenIB Gen2 stack, we have assigned Yael > > to work on merging OpenSM 1.8.0 (which released based on gen1) into > > the gen2 stack. > > I too am working on small pieces of this. [EZ] I think it does not make much sense to work in parallel as the changes spans many files. What are the changes you work on? > > > She has started to work on the merge to ensure that fixes done by you > > and Shahar on the gen2 trunk will not be lost. > > > > The mode of work we suggest is that she will work offline. > > Not sure by what you mean by offline here. [EZ] Offline means she will do the entire merge and then commit. I propose she will commit the changes into a branch and then you can review it and do the merge to the main trunk yourself. > > > When the merge will be completed a side branch will be opened under: > > https://openib.org/svn/gen2/branches/osm_1_8_0 and will made available > > for review and testing before merge into the main trunk. > > I would prefer small patches rather than a large merge if that were > possible. [EZ] I am afraid this is impossible due to the large difference in code. The changes affect many files and will require heavy testing to make sure they did not break anything. The testing will be done at the end of the merge work. We could however, let her commit small changes - one at a time - but that branch will be useless. > > > Once all this is done, Yael will work on multiple new features > > including faster route time, PKey manager, MKey manager, and QoS. She > > will do so on branches off the main trunk - for each feature. > > OK. Sounds good. > > > In parallel, Liran who owns the OpenSM verification will enhance > > osmtest and other testing utilities to achieve better test coverage of > > SM handover, SL2VL, VLArb and PKey. Any new feature will get covered > > by new tests. > > OK. > > > I myself will work on making sure the IB management simulator is well > > integrated with the stack > > What do you mean by stack here ? [EZ] I mean OpenIB > > > and the available simulator based tests as well as new tests can be > > run daily. > > > We do have some issues with respect to the current osm tree: > > > > 1. All header files were moved from their relative location under > > the opensm, complib, iba directories and placed under the include > > directory. Although this seems reasonable for a "install" tree - it is > > not very common for development trees. Normally I would expect the > > Makefile.am of each sub directory of the osm project to define which > > header files are to be installed into the $prefix/include dir. We will > > revert that hierarchy change in our merged branch. > > Are you expecting the reverted hierarchy to make it back to the trunk ? [EZ] Yes. Explained why before. If you need examples for how this is done in many other GNU projects we can provide it. > > > 2. osmtest was just introduced back into the osm tree. I think > > osmtest should be placed under a "test" tree where all the tests of > > the ULPs core etc will be located. I would expect a location like: > > > > https://openib.org/svn/gen2/trunk/test/userspace/management/osm > > It can be moved when it is agreed on its location. There are other > places being proposed for this (talk to Amit). [EZ] OK > > > 3. osmtest needs cleanup from VAPI stuff - we should let Liran > > who is the owner of this code development a clear AR to clean it up. > > Yes. Can he send patches for this ? [EZ] Yes he will > > > 4. For some reason I saw that you have added Voltaire copyright > > to the osmtest code. I do not think it makes sense as no work was done > > on this code by a Voltaire developer. Or I might be wrong? > > There was some minor work done in terms of OpenIB. I removed this. > > > Needless to say the 1.8.0 version of OpenSM brings with it a long set > > of bug fixes and enhancements. > > Once the OpenSM work is completed, will OpenSM development by Mellanox > be done incrementally and in the open rather than drops ? Will patches > be suppplied ? [EZ] Yes. After the merge is done all enhancements described above will be done on OpenIB. Assuming we resume our maintainer position and are able to commit directly. > > -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Wed Aug 3 23:42:23 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 4 Aug 2005 09:42:23 +0300 Subject: [openib-general] Re: [PATCH 1/2] [IB/cm]: Correct CM port redirect reject codes In-Reply-To: <86802c44050803175873fb0569@mail.gmail.com> References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <20057281331.7vqhiAJ1Yc0um2je@cisco.com> <86802c44050803175873fb0569@mail.gmail.com> Message-ID: <20050804064223.GT15300@mellanox.co.il> Quoting r. yhlu : > Subject: Re: [PATCH 1/2] [IB/cm]: Correct CM port redirect reject codes > > Roland, > > In LinuxBIOS, If I enable the prefmem64 to use real 64 range. the IB > driver in Kernel can not be loaded. Are you using the latest firmware on the HCA card? -- MST From kashyapv at us.ibm.com Wed Aug 3 23:50:59 2005 From: kashyapv at us.ibm.com (Vivek Kashyap) Date: Wed, 3 Aug 2005 23:50:59 -0700 (PDT) Subject: [openib-general] IPoIB -- connected mode update Message-ID: Attached is an udpated draft (will be posting to internet drafts after the current ietf ends) for ipoib-connected mode based on the discussions on ipoib wg, openib (IB on Linux), and other communications. Two threads that saw good discussion are given below. I believe the attached updated draft captures all the discussions. Please comment. http://openib.org/pipermail/openib-general/2005-May/006751.html http://www1.ietf.org/mail-archive/web/ipoverib/current/msg01212.html thanks, Vivek -------------------------------- IP over InfiniBand: Connected Mode Abstract This document specifies a method for transmitting IPv4/IPv6 packets and address resolution over the connected modes of InfiniBand. Table of Contents 1.0 Introduction 2.0 IPoIB-connected mode 2.1 Multicasting 2.2 Outline of Address Resolution 2.3 Outline of Connection Setup 3.0 Address Resolution 3.1 Link-layer Address 3.2 IB Connection Setup 3.3 Service-ID 4.0 Frame Format 5.0 Maximum Transmission Unit 5.1 Per-Connection MTU 6.0 IPoIB-CM Considerations 6.1 A Cautionary Note on IPoIB-RC 7.0 Security Considerations 8.0 IANA Considerations 9.0 References 1.0 Introduction The InfiniBand specification [IB_ARCH] can be found at www.infinibandta.org. The document [IPoIB_ARCH] provides a short overview of InfiniBand architecture along with consideration for specifying IP over InfiniBand networks. The InfiniBand architecture (IBA) defines multiple modes of transports. Of these the unreliable datagram (UD) transport method best matches the needs of IP. IP over InfiniBand (IPoIB) over UD is described in [IPoIB_UD]. This document describes IP transmission over the connected modes of IBA. IBA defines two connected modes: 1. Reliable Connected (RC) 2. Unreliable Connected (UC) As is evident from the nomenclature, the two modes differ mainly in providing reliability of data delivery across the connection. This document applies equally to both the connected modes. IPoIB over these two modes is referred to as IPoIB-CM (connected mode) in this document. For clarity IPoIB over the unreliable datagram mode, as described in [IPoIB_UD] is referred to as IPoIB-UD. IBA requires that all Host Channel Adapters (HCAs) support the reliable and unreliable connected modes [IB_ARCH]. It is optional for Target Channel Adapters (TCAs) to support the connected modes. The connected modes offer link MTUs of up to 2^31 octets in length. Thus the use of connected modes can offer significant benefits by supporting reasonably large MTUs. The datagram modes of InfiniBand Architecture (IBA) are limited to 4096 octets. Reliability is also enhanced if the underlying feature of "automatic path migration" supported by the connected modes is utilized. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. 2.0 IPoIB-connected mode Every IPoIB implementation MUST support IPoIB-UD. The IPoIB-CM support is OPTIONAL. This document extensively refers to [IPoIB_UD] and extends IPoIB description given in [IPoIB_UD] to IPoIB-CM. Therefore, only additional requirements or enhancements needed to enable IPoIB- CM are described. The IP encapsulation, default MTU, link layer address format and the IPv6 stateless autoconfiguration mechanism apply to IPoIB-CM exactly as described in [IPoIB_UD]. 2.1 Multicasting The connected modes of IBA define a non-broadcast, multiple access network. The connected modes of IBA do not support multicasting though every node can communicate with every other node if desired. This requires that multicasting be emulated in some form by the network. However, in the case of an InfiniBand network, instead of an emulation, an unreliable datagram (UD) queue pair (QP) can be used to support multicasting while the connected mode QP is used for unicast traffic. Since every IPoIB implementation is required to support the UD mode, every implementation supporting IPoIB-CM will be able to utilize the coexisting IPoIB-UD QP for all broadcast/multicast communications. Multicast mapping, transmission and reception of multicast packets and multicast routing MUST use the IPoIB-UD QP associated with the IPoIB-CM interface. 2.2 Outline of Address Resolution Every IPoIB-CM interface MUST have two QPs associated with it: 1) A connected mode QP 2) An unreliable datagram mode QP [IPoIB_UD] proposes that the address resolution query is multicast over an IB multicast address that is joined by every member of the IPoIB subnet. This IB multicast address is referred to as the "broadcast-GID" [IPoIB_UD]. The "broadcsat- GID" is "FullMember" joined by every IPoIB-UD implementation on the associated QP [IPoIB-UD]. A broadcast-GID is formed with the knowledge of the scope bits, IP version, the partition key (P_Key) associated with the subnet. Thus these three parameters must be known to the node before an IPoIB interface can be brought up. The exact format and rules to setup the broadcast-GID are defined in [IPoIB_UD]. In response to the query the response is received on the IPoIB- UD QP [IPoIB_UD]. 2.3 Outline of Connection setup Once the link address of the remote node is known an IB connection must be setup between the nodes before any IP communication may occur. To make a connection, the sender must know the service-ID to use in the request to make a connection [IB_ARCH]. It must also supply the "connection mode" queue pair to the remote node. The peer replies with its queue pair. Each IB connection is peer to peer and uses one connected mode QP at each end. Though the address resolution occurs at an individual IP address level the connection between the nodes is at the IB layer. Therefore every individual address resolution does not imply a new connection between the peers. 3.0 Address Resolution Address resolution queries are sent out on the "broadcast-GID" over the IPoIB-UD QP associated with the IPoIB-CM interface. A unicast reply is received on the UD QP. 3.1 Link-layer Address IPoIB encapsulation [IPoIB_UD] describes the link-layer address as follows: <1 octet reserved>:QP: GID This document extends the link-layer address as follows: :QPN:GID Flags: This is a single octet field. The bits indicate the connected modes supported by the interface. Bit 0 specifies the support for the "reliable connected" (RC) mode. Bit 1 indicates the support for the "unreliable connected" (UC) mode. All other bits in the octet are reserved and MUST be set to 0 on transmits and ignored on receives. The format of the flags is: +--+--+--+--+--+--+--+--+ |RC|UC| 0| 0| 0| 0| 0| 0| +--+--+--+--+--+--+--+--+ Both the RC and UC MAY be set at the same time if the interface supports both the modes. Since the IPoIB-UD mode is always supported there are no flags to indicate IPoIB-UD support. If IPoIB-CM is not supported i.e. if the implementation only supports IPoIB-UD, then the implementation MUST ignore the on reception. It MUST set the octet to all zeroes on transmission as specified in [IPoIB_UD]. QPN: The queue-pair number (QPN) on which the unicast address resolution reply will be received. This allows the IPoIB-UD address resolution code and method to be used for IPoIB-CM address resolution. The QPN also serves another purpose. It is used to form the Service-ID that is used to setup the IB connection. On receiving the multicast/broadcast address resolution request the receiver replies with its own link-address, including the associated UD QPN and the appropriate flags. The receiver's reply is unicast back to the sender after the receiver has, as in the case of IPoIB-UD, resolved the GID to the LID and determined other required parameters [IPoIB_UD]. Once the address resolution is completed the underlying IB connection on the supported connection modes can be set up. An implementation is NOT REQUIRED to setup a connection merely because the peer indicates the capability. The decision to make such a connection is left to the implementation. 3.2 IB Connection Setup The IB reliable/unreliable mode connection may be setup by any of the peers though it is more likely that the one that initiated the address resolution phase, probably as a result of the need to send IP data, will initiate the connection setup. IBA allows passive-active and active-active connection setup [IB_ARCH]. To setup a connection IB Management Datagrams (MADs) are directed to the peer's communication manager (CM). The connection request always contains a Service-ID for the peer to associate the request with the appropriate entity. If the request is accepted the peer returns the relevant connected mode QPN in the response MAD. The format of the CM connection messages and the IB connection setup process is described in [IB_ARCH]. The CM messages include, among other parameters, the Service-ID, Local QPN, and the payload size to use over the connection. Note: The IB connection is setup using the Service-ID as defined above. The node MUST keep a record of IB connections it is participating in. The node MAY attempt another connection to the remote peer using the same Service-ID as used for an existing IB connection. Similarly, the receiver of such a connection MAY drop the request with a suitable error indication in the CM response. The decision to accept or initiate multiple connections from or to an IPoIB interface is left to the implementation. 3.3 Service-ID The InfiniBand specification defines a block of service IDs for IETF use. The InfiniBand specification has left the definition and management of this block to the IETF [IB_ARCH]. The 64-bit block is: +--------+--------+--------+--------+-------+--------+--------+------+ |00000001|<-------------------IETF use------------------------------>| +--------+--------+--------+--------+-------+--------+--------+------+ The Service-IDs used by IPoIB will be in the format: +--------+--------+--------+--------+-------+-------+--------+-------+ |00000001| Type | Reserved | QPN | +--------+--------+--------+--------+-------+-------+--------+-------+ The Reserved fields MUST be transmitted as zeroes. They are ignored on reception. The QPN MUST be the UD QP exchanged during address resolution. The Type MUST be set to 0. 4.0 Frame Format All IP and ARP datagrams transported over InfiniBand are prefixed by a 4-octet encapsulation header as described in [IPoIB_UD]. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | Type | Reserved | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The type field SHALL indicate the encapsulated protocol as per the following table. +----------+-------------+ | Type | Protocol | |------------------------| | 0x800 | IPv4 | |------------------------| | 0x806 | ARP | |------------------------| | 0x8035 | RARP | |------------------------| | 0x86DD | IPv6 | +------------------------+ These values are taken from the "ETHER TYPE" numbers assigned by Internet Assigned Numbers Authority (IANA). Other network protocols, identified by different values of "ETHER TYPE", may use the encapsulation format defined herein but such use is outside of the scope of this document. 5.0 Maximum Transmission Unit The IB connection setup might be used for both IPv4 and IPv6 or it could be used for only one of them while a different connection is used for the other. The link MTU MUST be able to support the minimum MTU required by the protocols. The default MTU of the IPoIB-CM interface is 2044 octets i.e. 2048 octet IPoIB-link MTU minus the 4 octet encapsulation header. However, connected modes of InfiniBand allow message sizes up to 2^31 octets. Therefore, IPoIB-CM can use a much larger MTU for unicast communication between any two endpoints. The maximum and/or optimal payload that can be received or sent over an InfiniBand connection is dependent on the implementation, HCA and the resources configured. An implementation MAY utilise the following mechanism to exchange the optimal message size across the IB connection. 5.1 Per-Connection MTU Every IB connection setup message includes a "private data" field [IB_ARCH]. The private data field MUST carry the following information: 0 15 +----------------+ | Receive MTU | +----------------+ The connection setup message (CM REQ) MUST insert the requested MTU in the "Receive MTU" field. This indicates the maximum packet size the requester can accept. The requester MUST be able to accept smaller MTU sizes as well. It is up to the implementation to utilize this mechanism for setting the per IB connection MTU. The IPoIB interface must account for the 4-octet encapsulation header and so the IPoIB MTU over the connection will be smaller by that amount. This mechanism allows for different MTU values per peer, however to enable implmentations to work with a single "connection" MTU, a configuration parameter called "IPoIB-CM MTU multiplier" is introduced. The default value of "IPoIB-CM MTU multiplier" is 1. The "Receive MTU" MUST NOT be set less than "IPoIB-CM MTU multiplier" times 2048. 6.0 IPoIB-CM Considerations Every IPoIB interface supports IPoIB-UD. It may additionally support one or both of IPoIB-CM modes. Therefore, there can be multiple methods of communicating between any two peers. This implies that an interface MAY transmit/receive a packet over any of the RC, UC or UD modes depending on the modes supported between it and the peer. It further follows that every IPoIB implementation compliant with this document MUST accept all unicast transmissions over any fo the IPoIB modes it supports. Multicast and broadcast packets by their nature will always be transmitted and received over the IPoIB-UD QP. 6.1 A Cautionary Note on IPoIB-RC The RC mode of InfiniBand guarantees in-order delivery of packets. Every message transmitted over the RC connection is broken into physical MTU sized packets by the RC connection. If any packet is lost, it is retransmitted until the complete message is exchanged. Therefore there is a possibility of a reliable transport layer, such as TCP, retransmitting due to a shorter timeout while the RC layer is still in the process of transferring the complete message. A retransmission at the upper layer will add to the already existing congestion. Therefore, the RC timers as well as the maximum message size supported at the IPoIB-RC connection must be set judiciously. 7.0 Security Considerations A node may be returned a false set of flags by an impostor. This may cause unnecessary attempts and some delay/disruption in IPoIB communication. The same is the case if wrong/spurious QPN values are provided during address resolution broadcast/multicast. 8.0 IANA Considerations This document requires that the reserved bits and octets be set to zero on sends and ignored on receives. Proposed uses of the reserved bits MUST be published as RFCs. 9.0 References Normative [IB_ARCH] InfiniBand Architecture Specification, version 1.1 www.infinibandta.org [IPoIB_ARCH] draft-ietf-ipoib-architecture-04.txt, V. Kashyap [IPoIB_UD] draft-ietf-ipoib-ip-over-infiniband-9.txt, H.K. Jerry Chu, V. Kashyap Author's Address Vivek Kashyap 15350, SW Koll Parkway Beaverton, OR 97006 Phone: +1 503 578 3422 Email: vivk at us.ibm.com From ianjiang91 at hotmail.com Thu Aug 4 00:10:27 2005 From: ianjiang91 at hotmail.com (Ian Jiang) Date: Thu, 04 Aug 2005 15:10:27 +0800 Subject: [openib-general] [iSER]How to use the iSER with the UNH iSCSI Message-ID: Hi, everybody! Thanks for all the replis to my "How to get the dat_headers_1_1.tgz"! I downloaded the dapl_beta2.06.tgz as Itamar told me. And I made some modification to the iSER to use it on the x86_64 platform. I got through the compiling finally, but here is another question: How to use the iSER with the UNH iSCSI? I have the UNH iSCSI running on my system at present. Need I modify it and reinstall? And I'm not sure if the dapl_beta2.06 has to be installed to run the iSER. In fact, I did not compile or install the dapl before installing the iSER. Any suggestion is appriciated! Ian Jiang ianjiang91 at hotmail.com ---- Computer Architecture Laboratory Institute of Computing Technology Chinese Academy of Sciences Beijing,P.R.China Zip code: 100080 Tel: +86-10-62564394(office) _________________________________________________________________ 免费下载 MSN Explorer: http://explorer.msn.com/lccn From mst at mellanox.co.il Thu Aug 4 00:38:48 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 4 Aug 2005 10:38:48 +0300 Subject: [openib-general] Re: configuration management for OpenIB In-Reply-To: <52slxqdf8s.fsf@cisco.com> References: <1123092422.4422.1586.camel@hal.voltaire.com> <52slxqdf8s.fsf@cisco.com> Message-ID: <20050804073847.GU15300@mellanox.co.il> Quoting r. Roland Dreier : > However, since git and hg track merge history much better, managing > multiple lines of development becomes possible (anyone who has tried > to develop on a svn branch knows just how useless "svn merge" is). I didnt try git. mercurial merges seem to work better than svn's. But quilt basically works for me, as well. > As a bonus, git/hg are shockingly fast and allow completely disconnected > operation (ie you can work on your laptop on a plane and do commits > and everything). This is also the main disadvantage, for me: everyone has to arrange for his own hosting, backups ... or set up ssh accounts on openib.org servers. BTW, there's also svk (which uses svn as an underlying transport/storage). Did anyone try it? MST _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST From mst at mellanox.co.il Thu Aug 4 00:51:37 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 4 Aug 2005 10:51:37 +0300 Subject: [openib-general] Re: [BUG]OpenSM double free or corruption In-Reply-To: <1122993087.4422.88.camel@hal.voltaire.com> References: <1122993087.4422.88.camel@hal.voltaire.com> Message-ID: <20050804075137.GV15300@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: RE: [BUG]OpenSM double free or corruption > > Hi Woody, > > On Wed, 2005-07-27 at 11:48, Bob Woodruff wrote: > > I don't have the log file, but it is easy to reproduce. > > I'm just getting back to this now. > > > Load the stack on an HCA that has old firmware > > I don't have an HCA with old firmware (what version ?) and I thought it > was dangerous to downgrade firmware. Its not *that* dangerous, but you have to know what you are doing. You cant downgrade firmware in a failsafe manner. But by failsafe we mean "safe even if it fails". There's no reason for downgrade to fail more frequently than an upgrade. You should be fine as long as you burn with flint -nofs (nonfailsafe) and dont reboot or kill or reset the chip or the burner otherwise while in the process of burning. And there's actually a pretty good chance you'll be able to resume even if you do, so long as you dont try to load the driver. -- MST From ogerlitz at voltaire.com Thu Aug 4 01:18:27 2005 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 4 Aug 2005 11:18:27 +0300 Subject: [openib-general] [iSER]How to use the iSER with the UNH iSCSI Message-ID: Ian, Few things: - drops (eg dapl_beta2.06) of the SF dapl do not realy work "as is" this is as of the lack of code that interacts with real IB AT (Address Translation) and CM (Communication Manager) - before gen2, each dapl provider (MLX, TS, ICON, Voltaire etc) used the SF dapl as base line and implemented the code in dapl (dapl_xx_cm.c) to call their AT and CM plus other enhancments - eg iSER was tested with the Voltaire kDAPL which was a derivative of the SF dapl - with openib gen2, the kDAPL implementation at https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband/ulp/kdapl/ has the code to work with the gen2 AT and CM - we (Voltaire) are now in the process to send few patches to the gen2 kdapl that allow for iSER to work over the gen2 kdapl - you can see this over the list - to use the UNH iSCSI target with iSER, i guess you need to approach UNH Or. ----Original Message---- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Ian Jiang Sent: Thursday, August 04, 2005 10:10 AM To: openib-general at openib.org Subject: [openib-general] [iSER]How to use the iSER with the UNH iSCSI > Hi, everybody! > Thanks for all the replis to my "How to get the dat_headers_1_1.tgz"! > I downloaded the dapl_beta2.06.tgz as Itamar told me. > And I made some modification to the iSER to use it on the x86_64 > platform. > > I got through the compiling finally, but here is another question: > How to use the iSER with the UNH iSCSI? I have the UNH iSCSI running > on my system at present. Need I modify it and reinstall? > > And I'm not sure if the dapl_beta2.06 has to be installed to run the > iSER. > In fact, I did not compile or install the dapl before installing the > iSER. > > Any suggestion is appriciated! > > Ian Jiang > ianjiang91 at hotmail.com > ---- > Computer Architecture Laboratory > Institute of Computing Technology > Chinese Academy of Sciences > Beijing,P.R.China > Zip code: 100080 > Tel: +86-10-62564394(office) > > _________________________________________________________________ > 免费下载 MSN Explorer: http://explorer.msn.com/lccn > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From guyg at voltaire.com Thu Aug 4 01:25:31 2005 From: guyg at voltaire.com (Guy German) Date: Thu, 4 Aug 2005 11:25:31 +0300 Subject: [openib-general][kdapl]: vmalloc instead of kmalloc Message-ID: James, I see what you mean. The allocation of the event vector is derived from evd->qlen. In DTO ev'd, however, qlen is also the parameter passed to ib_create_cq. Since we don't want to limit DAPL consumers to an unnecessary small completion queue size, maybe we could differentiate between DTO supporting evd's and CONN evd's, when allocating the events vector. if evd supports CONN only, leave it : event = kmalloc(evd->qlen * sizeof *event) (Relying on the consumer he knows what he is doing) if evd is DTO only : don't allocate an event buffer, at all if evd supports both : event = kmalloc(DEFAULT_4_CONN * sizeof *event) if DEFAULT_4_CONN=256, that's a 3 pages allocation. How does that sound to you ? Thanks, Guy James Lentini wrote: > On Wed, 3 Aug 2005, Guy German wrote: > >> James Lentini wrote: >>> >>> kDAPL creates two large pools of memory. >>> >>> One is for events. When the kDAPL consumer creates an EVD, it >>> specifies a queue size (the number of events the EVD can hold). The >>> implementation pre-allocates a pool of events equal to the >>> size of the >>> queue. These events are used when an IB upcall is made (e.g. >>> connection request, connection established, aysnc. error, >>> etc.) or the >>> kDAPL consumer posts a "software event" via dat_evd_post_se(). >> >> And, of course - completions of data events - which is why the queue >> need to be more substantial. > > In uDAPL yes, but not in kDAPL. None of the callers of > dapl_evd_get_event(), where events are dequeued from the EVD's > free_event_queue, use the events for DTO completions. > > In uDAPL, dat_evd_wait can dequeue data events and store them in the > pending event queue. > > As Itamar pointed out, kDAPL could use a single circular list instead > of maintaining the EVD's free and pending event queues. > >>> The other memory pool is for cookies. A kDAPL event contains certain >>> fields that the IB work completion (ib_wc) does not provide (like >>> the EVD, EP, etc.). For that reason, the kDAPL provider sticks >>> the missing >>> information in a dapl_cookie structure and sets it as the work >>> request's context value. When the work completion comes back, the >>> kDAPL provider pulls the cookie out and uses it to populate the >>> missing event fields. These cookies are also pre-allocated in a >>> pool equal to the EVD size. >>> >>>>> To answer your question, vmalloc has a performance overhead and >>>>> can and will fail when vmalloc-space is exhausted (as can >>>>> kmalloc, for different reasons). Can this allocation be cut down >>>>> so that it becomes a non-issue? >>> >>> The size of the event pool seems much larger than necessary. I would >>> expect most consumers only use a few events from this pool (with no >>> errors or software events, a client will use 2 and a server will use >>> 3). >> >> If you consider that the consumers are polling from the queue >> themselves (upcall policy is disabled, for performance) and the >> queue of events holds completions of data, then you have to support >> larger queue. Bare in mind that one target can have many initiators. > > Even if the event queue never stores DTO events? In the worst case, I > agree that kDAPL would need to allocate an amount of memory > equal to n > * sizeof(DAT_EVENT), where n is the EVD queue size. > > My observation is that an EVD almost always uses the event pool < 5 > times (when there are no async errors and no software > events). Further > more, it usually only uses one event at a time (it posts a connection > request event, the consumer reaps it, it posts a connection > event, the > consumer reaps it, ...). Given that, allocating a event pool equal to > the queue length seems like overkill to me. The EVD could allocate > smaller blocks of events in some configurable size. Most of > the time a > single poll (say of 25 events) would be sufficient. In the rare case > when this pool was exhausted, a second one could be > allocated. If that > one was used up, a third could be allocated... > >> Any way, ISER seems to be needing a solution for this, and I think >> it is possible to come up with a different solution than vmalloc >> (maybe a few kmallocs) I will think about it and send a patch when I >> have one. > > Ok. That would be great. From guyg at voltaire.com Thu Aug 4 04:06:29 2005 From: guyg at voltaire.com (Guy German) Date: Thu, 4 Aug 2005 14:06:29 +0300 Subject: [openib-general][PATCH][kdapl]: DAT_MEM_TYPE_IA support and dat_ia_open fix Message-ID: <20050804110629.GA7084@voltaire.com> 1. Adding DAT_MEM_TYPE_IA support 2. When calling dat_ia_open - check port status and fail if port is !active I am resending this patch, also as an attachment, in case there are still problems in the mailer (Im using mutt now) Signed-off-by: Guy German Index: infiniband/ulp/kdapl/ib/dapl_openib_util.c =================================================================== --- infiniband/ulp/kdapl/ib/dapl_openib_util.c (revision 2980) +++ infiniband/ulp/kdapl/ib/dapl_openib_util.c (working copy) @@ -133,8 +133,43 @@ int dapl_ib_mr_register(struct dapl_ia * return -ENOSYS; } +int dapl_ib_mr_register_ia(struct dapl_ia *ia, struct dapl_lmr *lmr, + union dat_region_description phys_addr, u64 length, + enum dat_mem_priv_flags privileges) +{ + int status; + int acl; + struct ib_mr *mr; + struct ib_phys_buf buf_list; + u64 iova = 0; + buf_list.addr = phys_addr.for_pa; + buf_list.size = length; + + iova = buf_list.addr; + acl = dapl_ib_convert_mem_privileges(privileges); + acl |= IB_ACCESS_MW_BIND; + mr = ib_reg_phys_mr(((struct dapl_pz *)lmr->param.pz)->pd, + &buf_list, 1, acl, &iova); + if (IS_ERR(mr)) { + status = PTR_ERR(mr); + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + "%s: ib_reg_phys_mr error code return = %d\n", + __func__, status); + return status; + } + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, "%s: got handle mr=%p\n", + __func__, mr); + + lmr->param.lmr_context = mr->lkey; + lmr->param.rmr_context = mr->rkey; + lmr->param.registered_size = length; + lmr->param.registered_address = phys_addr.for_pa; + lmr->mr = mr; + return 0; +} + int dapl_ib_mr_register_physical(struct dapl_ia *ia, struct dapl_lmr *lmr, - void *phys_addr, u64 length, + union dat_region_description phys_addr, u64 length, enum dat_mem_priv_flags privileges) { int status; @@ -145,7 +180,7 @@ int dapl_ib_mr_register_physical(struct u64 iova = 0; u64 *array; - array = (u64 *) phys_addr; + array = (u64 *)phys_addr.for_array; /* need to add for_u64_array to union */ buf_list = kmalloc(length * sizeof *buf_list, GFP_ATOMIC); if (!buf_list) return -ENOMEM; @@ -164,8 +199,8 @@ int dapl_ib_mr_register_physical(struct if (IS_ERR(mr)) { status = PTR_ERR(mr); dapl_dbg_log(DAPL_DBG_TYPE_ERR, - " ib_reg_phys_mr error code return = %d\n", - status); + "%s: ib_reg_phys_mr error code return = %d\n", + __func__, status); return status; } #if 0 @@ -173,8 +208,8 @@ int dapl_ib_mr_register_physical(struct status = ib_query_mr(mr, &attr); if (status < 0) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, - " ib_query_mr error code return from ib_query_mr = %d\n", - status); + "%s: ib_query_mr error code return from ib_query_mr = %d\n", + __func__, status); ib_dereg_mr(mr); return status; } @@ -182,10 +217,12 @@ int dapl_ib_mr_register_physical(struct lmr->param.lmr_context = mr->lkey; lmr->param.rmr_context = mr->rkey; + lmr->param.registered_size = length * PAGE_SIZE; + lmr->param.registered_address = array[0]; lmr->mr = mr; dapl_dbg_log(DAPL_DBG_TYPE_UTIL, - "dapl_ib_mr_register_physical(%p %d) got lkey 0x%x \n", + "%s: (%p %d) got lkey 0x%x \n", __func__, buf_list, length, lmr->param.lmr_context); return 0; } Index: infiniband/ulp/kdapl/ib/dapl_openib_util.h =================================================================== --- infiniband/ulp/kdapl/ib/dapl_openib_util.h (revision 2980) +++ infiniband/ulp/kdapl/ib/dapl_openib_util.h (working copy) @@ -79,8 +79,13 @@ int dapl_ib_mr_register(struct dapl_ia * void *virt_addr, u64 length, enum dat_mem_priv_flags privileges); +int dapl_ib_mr_register_ia(struct dapl_ia *ia, struct dapl_lmr *lmr, + union dat_region_description phys_addr, u64 length, + enum dat_mem_priv_flags privileges); + int dapl_ib_mr_register_physical(struct dapl_ia *ia, struct dapl_lmr *lmr, - void *phys_addr, u64 length, + union dat_region_description phys_addr, + u64 length, enum dat_mem_priv_flags privileges); int dapl_ib_mr_deregister(struct dapl_lmr *lmr); Index: infiniband/ulp/kdapl/ib/dapl_ia.c =================================================================== --- infiniband/ulp/kdapl/ib/dapl_ia.c (revision 2980) +++ infiniband/ulp/kdapl/ib/dapl_ia.c (working copy) @@ -576,6 +576,7 @@ int dapl_ia_open(const char *name, int a struct dapl_hca *hca = NULL; struct dapl_ia *ia = NULL; struct dapl_evd *evd; + struct ib_port_attr port_attr; dapl_dbg_log(DAPL_DBG_TYPE_API, "dapl_ia_open(%s, %d, %p, %p)\n", @@ -583,7 +584,7 @@ int dapl_ia_open(const char *name, int a status = dapl_provider_list_search(name, &provider); if (0 != status) { - status = -EINVAL; + status = -ENODEV; goto bail; } @@ -591,6 +592,17 @@ int dapl_ia_open(const char *name, int a *ia_ptr = NULL; hca = (struct dapl_hca *)provider->extension; + status = ib_query_port(hca->device, hca->port_num, &port_attr); + if (status) + goto bail; + if (port_attr.state != IB_PORT_ACTIVE) { + dapl_dbg_log(DAPL_DBG_TYPE_WARN, + "%s: Port %d is not in ACTIVE state\n", + __FUNCTION__, hca->port_num); + + status = -EBUSY; + goto bail; + } /* Allocate and initialize ia structure */ ia = dapl_ia_alloc(provider, hca); @@ -630,13 +642,13 @@ int dapl_ia_open(const char *name, int a goto bail; atomic_inc(&evd->evd_ref_count); - /* Register the handlers associated with the async EVD. */ - status = dapl_ia_setup_callbacks(ia, evd); - /* Assign the EVD so it gets cleaned up */ + /* Register the handlers associated with the async EVD. */ + status = dapl_ia_setup_callbacks(ia, evd); + /* Assign the EVD so it gets cleaned up */ ia->cleanup_async_error_evd = TRUE; ia->async_error_evd = evd; - if (status != 0) - goto bail; + if (status != 0) + goto bail; } status = 0; @@ -741,7 +753,8 @@ int dapl_ia_query(struct dat_ia *ia_ptr, provider_attr->provider_version_major = DAPL_PROVIDER_MAJOR; provider_attr->provider_version_minor = DAPL_PROVIDER_MINOR; provider_attr->lmr_mem_types_supported = - DAT_MEM_TYPE_VIRTUAL | DAT_MEM_TYPE_LMR; + DAT_MEM_TYPE_PHYSICAL | DAT_MEM_TYPE_PLATFORM | + DAT_MEM_TYPE_IA; provider_attr->iov_ownership_on_return = DAT_IOV_CONSUMER; provider_attr->dat_qos_supported = DAT_QOS_BEST_EFFORT; provider_attr->completion_flags_supported = Index: infiniband/ulp/kdapl/ib/dapl_lmr.c =================================================================== --- infiniband/ulp/kdapl/ib/dapl_lmr.c (revision 2980) +++ infiniband/ulp/kdapl/ib/dapl_lmr.c (working copy) @@ -125,20 +125,21 @@ error1: } static inline int dapl_lmr_create_physical(struct dapl_ia *ia, - DAT_REGION_DESCRIPTION phys_addr, - u64 page_count, struct dapl_pz *pz, + union dat_region_description phys_addr, + u64 page_count, + enum dat_mem_type mem_type, + struct dapl_pz *pz, enum dat_mem_priv_flags privileges, struct dat_lmr **lmr, - DAT_LMR_CONTEXT *lmr_context, - DAT_RMR_CONTEXT *rmr_context, + u32 *lmr_context, + u32 *rmr_context, u64 *registered_length, u64 *registered_address) { struct dapl_lmr *new_lmr; - u64 *array = phys_addr.for_array; int status; - new_lmr = dapl_lmr_alloc(ia, DAT_MEM_TYPE_PHYSICAL, phys_addr, + new_lmr = dapl_lmr_alloc(ia, mem_type, phys_addr, page_count, (struct dat_pz *) pz, privileges); if (NULL == new_lmr) { @@ -146,8 +147,14 @@ static inline int dapl_lmr_create_physic goto error1; } - status = dapl_ib_mr_register_physical(ia, new_lmr, phys_addr.for_array, - page_count, privileges); + if (DAT_MEM_TYPE_IA == mem_type) { + status = dapl_ib_mr_register_ia(ia, new_lmr, phys_addr, + page_count, privileges); + } + else { + status = dapl_ib_mr_register_physical(ia, new_lmr, phys_addr, + page_count, privileges); + } if (0 != status) goto error2; @@ -157,13 +164,13 @@ static inline int dapl_lmr_create_physic if (lmr) *lmr = (struct dat_lmr *)new_lmr; if (lmr_context) - *lmr_context = (DAT_LMR_CONTEXT)new_lmr->param.lmr_context; + *lmr_context = (u32)new_lmr->param.lmr_context; if (rmr_context) - *rmr_context = (DAT_LMR_CONTEXT)new_lmr->param.rmr_context; + *rmr_context = (u32)new_lmr->param.rmr_context; if (registered_address) - *registered_address = array[0]; + *registered_address = new_lmr->param.registered_address; if (registered_length) - *registered_length = page_count * PAGE_SIZE; + *registered_length = new_lmr->param.registered_size; return 0; @@ -233,7 +240,7 @@ int dapl_lmr_kcreate(struct dat_ia *ia, struct dapl_ia *dapl_ia; struct dapl_pz *dapl_pz; int status; - + dapl_dbg_log(DAPL_DBG_TYPE_API, "dapl_lmr_kcreate(ia:%p, mem_type:%x, ...)\n", ia, mem_type); @@ -250,10 +257,12 @@ int dapl_lmr_kcreate(struct dat_ia *ia, rmr_context, registered_length, registered_address); break; + case DAT_MEM_TYPE_PLATFORM: // used as proprietary Tavor-FMR case DAT_MEM_TYPE_PHYSICAL: + case DAT_MEM_TYPE_IA: status = dapl_lmr_create_physical(dapl_ia, region_description, - length, dapl_pz, privileges, - lmr, lmr_context, + length, mem_type, dapl_pz, + privileges, lmr, lmr_context, rmr_context, registered_length, registered_address); @@ -275,8 +284,6 @@ int dapl_lmr_kcreate(struct dat_ia *ia, registered_address); break; } - case DAT_MEM_TYPE_PLATFORM: - case DAT_MEM_TYPE_IA: case DAT_MEM_TYPE_BYPASS: status = -ENOSYS; break; @@ -300,7 +307,8 @@ int dapl_lmr_free(struct dat_lmr *lmr) switch (dapl_lmr->param.mem_type) { case DAT_MEM_TYPE_PHYSICAL: - case DAT_MEM_TYPE_VIRTUAL: + case DAT_MEM_TYPE_PLATFORM: + case DAT_MEM_TYPE_IA: case DAT_MEM_TYPE_LMR: { struct dapl_pz *pz; @@ -316,8 +324,7 @@ int dapl_lmr_free(struct dat_lmr *lmr) } break; } - case DAT_MEM_TYPE_PLATFORM: - case DAT_MEM_TYPE_IA: + case DAT_MEM_TYPE_VIRTUAL: case DAT_MEM_TYPE_BYPASS: status = -ENOSYS; break; -------------- next part -------------- Index: infiniband/ulp/kdapl/ib/dapl_openib_util.c =================================================================== --- infiniband/ulp/kdapl/ib/dapl_openib_util.c (revision 2980) +++ infiniband/ulp/kdapl/ib/dapl_openib_util.c (working copy) @@ -133,8 +133,43 @@ int dapl_ib_mr_register(struct dapl_ia * return -ENOSYS; } +int dapl_ib_mr_register_ia(struct dapl_ia *ia, struct dapl_lmr *lmr, + union dat_region_description phys_addr, u64 length, + enum dat_mem_priv_flags privileges) +{ + int status; + int acl; + struct ib_mr *mr; + struct ib_phys_buf buf_list; + u64 iova = 0; + buf_list.addr = phys_addr.for_pa; + buf_list.size = length; + + iova = buf_list.addr; + acl = dapl_ib_convert_mem_privileges(privileges); + acl |= IB_ACCESS_MW_BIND; + mr = ib_reg_phys_mr(((struct dapl_pz *)lmr->param.pz)->pd, + &buf_list, 1, acl, &iova); + if (IS_ERR(mr)) { + status = PTR_ERR(mr); + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + "%s: ib_reg_phys_mr error code return = %d\n", + __func__, status); + return status; + } + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, "%s: got handle mr=%p\n", + __func__, mr); + + lmr->param.lmr_context = mr->lkey; + lmr->param.rmr_context = mr->rkey; + lmr->param.registered_size = length; + lmr->param.registered_address = phys_addr.for_pa; + lmr->mr = mr; + return 0; +} + int dapl_ib_mr_register_physical(struct dapl_ia *ia, struct dapl_lmr *lmr, - void *phys_addr, u64 length, + union dat_region_description phys_addr, u64 length, enum dat_mem_priv_flags privileges) { int status; @@ -145,7 +180,7 @@ int dapl_ib_mr_register_physical(struct u64 iova = 0; u64 *array; - array = (u64 *) phys_addr; + array = (u64 *)phys_addr.for_array; /* need to add for_u64_array to union */ buf_list = kmalloc(length * sizeof *buf_list, GFP_ATOMIC); if (!buf_list) return -ENOMEM; @@ -164,8 +199,8 @@ int dapl_ib_mr_register_physical(struct if (IS_ERR(mr)) { status = PTR_ERR(mr); dapl_dbg_log(DAPL_DBG_TYPE_ERR, - " ib_reg_phys_mr error code return = %d\n", - status); + "%s: ib_reg_phys_mr error code return = %d\n", + __func__, status); return status; } #if 0 @@ -173,8 +208,8 @@ int dapl_ib_mr_register_physical(struct status = ib_query_mr(mr, &attr); if (status < 0) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, - " ib_query_mr error code return from ib_query_mr = %d\n", - status); + "%s: ib_query_mr error code return from ib_query_mr = %d\n", + __func__, status); ib_dereg_mr(mr); return status; } @@ -182,10 +217,12 @@ int dapl_ib_mr_register_physical(struct lmr->param.lmr_context = mr->lkey; lmr->param.rmr_context = mr->rkey; + lmr->param.registered_size = length * PAGE_SIZE; + lmr->param.registered_address = array[0]; lmr->mr = mr; dapl_dbg_log(DAPL_DBG_TYPE_UTIL, - "dapl_ib_mr_register_physical(%p %d) got lkey 0x%x \n", + "%s: (%p %d) got lkey 0x%x \n", __func__, buf_list, length, lmr->param.lmr_context); return 0; } Index: infiniband/ulp/kdapl/ib/dapl_openib_util.h =================================================================== --- infiniband/ulp/kdapl/ib/dapl_openib_util.h (revision 2980) +++ infiniband/ulp/kdapl/ib/dapl_openib_util.h (working copy) @@ -79,8 +79,13 @@ int dapl_ib_mr_register(struct dapl_ia * void *virt_addr, u64 length, enum dat_mem_priv_flags privileges); +int dapl_ib_mr_register_ia(struct dapl_ia *ia, struct dapl_lmr *lmr, + union dat_region_description phys_addr, u64 length, + enum dat_mem_priv_flags privileges); + int dapl_ib_mr_register_physical(struct dapl_ia *ia, struct dapl_lmr *lmr, - void *phys_addr, u64 length, + union dat_region_description phys_addr, + u64 length, enum dat_mem_priv_flags privileges); int dapl_ib_mr_deregister(struct dapl_lmr *lmr); Index: infiniband/ulp/kdapl/ib/dapl_ia.c =================================================================== --- infiniband/ulp/kdapl/ib/dapl_ia.c (revision 2980) +++ infiniband/ulp/kdapl/ib/dapl_ia.c (working copy) @@ -576,6 +576,7 @@ int dapl_ia_open(const char *name, int a struct dapl_hca *hca = NULL; struct dapl_ia *ia = NULL; struct dapl_evd *evd; + struct ib_port_attr port_attr; dapl_dbg_log(DAPL_DBG_TYPE_API, "dapl_ia_open(%s, %d, %p, %p)\n", @@ -583,7 +584,7 @@ int dapl_ia_open(const char *name, int a status = dapl_provider_list_search(name, &provider); if (0 != status) { - status = -EINVAL; + status = -ENODEV; goto bail; } @@ -591,6 +592,17 @@ int dapl_ia_open(const char *name, int a *ia_ptr = NULL; hca = (struct dapl_hca *)provider->extension; + status = ib_query_port(hca->device, hca->port_num, &port_attr); + if (status) + goto bail; + if (port_attr.state != IB_PORT_ACTIVE) { + dapl_dbg_log(DAPL_DBG_TYPE_WARN, + "%s: Port %d is not in ACTIVE state\n", + __FUNCTION__, hca->port_num); + + status = -EBUSY; + goto bail; + } /* Allocate and initialize ia structure */ ia = dapl_ia_alloc(provider, hca); @@ -630,13 +642,13 @@ int dapl_ia_open(const char *name, int a goto bail; atomic_inc(&evd->evd_ref_count); - /* Register the handlers associated with the async EVD. */ - status = dapl_ia_setup_callbacks(ia, evd); - /* Assign the EVD so it gets cleaned up */ + /* Register the handlers associated with the async EVD. */ + status = dapl_ia_setup_callbacks(ia, evd); + /* Assign the EVD so it gets cleaned up */ ia->cleanup_async_error_evd = TRUE; ia->async_error_evd = evd; - if (status != 0) - goto bail; + if (status != 0) + goto bail; } status = 0; @@ -741,7 +753,8 @@ int dapl_ia_query(struct dat_ia *ia_ptr, provider_attr->provider_version_major = DAPL_PROVIDER_MAJOR; provider_attr->provider_version_minor = DAPL_PROVIDER_MINOR; provider_attr->lmr_mem_types_supported = - DAT_MEM_TYPE_VIRTUAL | DAT_MEM_TYPE_LMR; + DAT_MEM_TYPE_PHYSICAL | DAT_MEM_TYPE_PLATFORM | + DAT_MEM_TYPE_IA; provider_attr->iov_ownership_on_return = DAT_IOV_CONSUMER; provider_attr->dat_qos_supported = DAT_QOS_BEST_EFFORT; provider_attr->completion_flags_supported = Index: infiniband/ulp/kdapl/ib/dapl_lmr.c =================================================================== --- infiniband/ulp/kdapl/ib/dapl_lmr.c (revision 2980) +++ infiniband/ulp/kdapl/ib/dapl_lmr.c (working copy) @@ -125,20 +125,21 @@ error1: } static inline int dapl_lmr_create_physical(struct dapl_ia *ia, - DAT_REGION_DESCRIPTION phys_addr, - u64 page_count, struct dapl_pz *pz, + union dat_region_description phys_addr, + u64 page_count, + enum dat_mem_type mem_type, + struct dapl_pz *pz, enum dat_mem_priv_flags privileges, struct dat_lmr **lmr, - DAT_LMR_CONTEXT *lmr_context, - DAT_RMR_CONTEXT *rmr_context, + u32 *lmr_context, + u32 *rmr_context, u64 *registered_length, u64 *registered_address) { struct dapl_lmr *new_lmr; - u64 *array = phys_addr.for_array; int status; - new_lmr = dapl_lmr_alloc(ia, DAT_MEM_TYPE_PHYSICAL, phys_addr, + new_lmr = dapl_lmr_alloc(ia, mem_type, phys_addr, page_count, (struct dat_pz *) pz, privileges); if (NULL == new_lmr) { @@ -146,8 +147,14 @@ static inline int dapl_lmr_create_physic goto error1; } - status = dapl_ib_mr_register_physical(ia, new_lmr, phys_addr.for_array, - page_count, privileges); + if (DAT_MEM_TYPE_IA == mem_type) { + status = dapl_ib_mr_register_ia(ia, new_lmr, phys_addr, + page_count, privileges); + } + else { + status = dapl_ib_mr_register_physical(ia, new_lmr, phys_addr, + page_count, privileges); + } if (0 != status) goto error2; @@ -157,13 +164,13 @@ static inline int dapl_lmr_create_physic if (lmr) *lmr = (struct dat_lmr *)new_lmr; if (lmr_context) - *lmr_context = (DAT_LMR_CONTEXT)new_lmr->param.lmr_context; + *lmr_context = (u32)new_lmr->param.lmr_context; if (rmr_context) - *rmr_context = (DAT_LMR_CONTEXT)new_lmr->param.rmr_context; + *rmr_context = (u32)new_lmr->param.rmr_context; if (registered_address) - *registered_address = array[0]; + *registered_address = new_lmr->param.registered_address; if (registered_length) - *registered_length = page_count * PAGE_SIZE; + *registered_length = new_lmr->param.registered_size; return 0; @@ -233,7 +240,7 @@ int dapl_lmr_kcreate(struct dat_ia *ia, struct dapl_ia *dapl_ia; struct dapl_pz *dapl_pz; int status; - + dapl_dbg_log(DAPL_DBG_TYPE_API, "dapl_lmr_kcreate(ia:%p, mem_type:%x, ...)\n", ia, mem_type); @@ -250,10 +257,12 @@ int dapl_lmr_kcreate(struct dat_ia *ia, rmr_context, registered_length, registered_address); break; + case DAT_MEM_TYPE_PLATFORM: // used as proprietary Tavor-FMR case DAT_MEM_TYPE_PHYSICAL: + case DAT_MEM_TYPE_IA: status = dapl_lmr_create_physical(dapl_ia, region_description, - length, dapl_pz, privileges, - lmr, lmr_context, + length, mem_type, dapl_pz, + privileges, lmr, lmr_context, rmr_context, registered_length, registered_address); @@ -275,8 +284,6 @@ int dapl_lmr_kcreate(struct dat_ia *ia, registered_address); break; } - case DAT_MEM_TYPE_PLATFORM: - case DAT_MEM_TYPE_IA: case DAT_MEM_TYPE_BYPASS: status = -ENOSYS; break; @@ -300,7 +307,8 @@ int dapl_lmr_free(struct dat_lmr *lmr) switch (dapl_lmr->param.mem_type) { case DAT_MEM_TYPE_PHYSICAL: - case DAT_MEM_TYPE_VIRTUAL: + case DAT_MEM_TYPE_PLATFORM: + case DAT_MEM_TYPE_IA: case DAT_MEM_TYPE_LMR: { struct dapl_pz *pz; @@ -316,8 +324,7 @@ int dapl_lmr_free(struct dat_lmr *lmr) } break; } - case DAT_MEM_TYPE_PLATFORM: - case DAT_MEM_TYPE_IA: + case DAT_MEM_TYPE_VIRTUAL: case DAT_MEM_TYPE_BYPASS: status = -ENOSYS; break; From mst at mellanox.co.il Thu Aug 4 05:06:01 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 4 Aug 2005 15:06:01 +0300 Subject: [openib-general] [PATCH] flush_scheduled_work on SDP module unload Message-ID: <20050804120601.GG15300@mellanox.co.il> Need to flush scheduled work on SDP module unload: make sure that a deferred iocb isnt outstanding. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_inet.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/ulp/sdp/sdp_inet.c +++ linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_inet.c @@ -35,6 +35,7 @@ #include #include +#include #include "sdp_main.h" /* @@ -1426,6 +1427,12 @@ static void __exit sdp_exit(void) * unregister */ sock_unregister(sdp_proto.family); + + /* + Make sure there are no deferred iocbs + */ + flush_scheduled_work(); + /* * connection table */ -- MST From Arkady.Kanevsky at netapp.com Thu Aug 4 06:06:43 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 4 Aug 2005 09:06:43 -0400 Subject: [openib-general] [iSER]How to get the dat_headers_1_1.tgz Message-ID: How about http://www.datcollaborative.org/dat_headers_1_1.tgz . Tom, does this serves the purpose? Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Tom Duffy [mailto:Thomas.Duffy.99 at alumni.brown.edu] > Sent: Wednesday, August 03, 2005 4:47 PM > To: Kanevsky, Arkady > Cc: Ian Jiang; dat-discussions at yahoogroups.com; > openib-general at openib.org > Subject: RE: [openib-general] [iSER]How to get the dat_headers_1_1.tgz > > > > > application/x-compressed attachment (dat_headers_1_1.tgz), > > "dat_headers_1_1.tgz" > > I think you missed my point. > > -tduffy > From mlleinin at hpcn.ca.sandia.gov Thu Aug 4 06:25:40 2005 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Thu, 04 Aug 2005 06:25:40 -0700 Subject: [openib-general] RE: OpenSM Work In-Reply-To: <506C3D7B14CDD411A52C00025558DED607C305C4@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED607C305C4@mtlex01.yok.mtl.com> Message-ID: <1123161940.11152.69.camel@localhost> On Thu, 2005-08-04 at 09:09 +0300, Eitan Zahavi wrote: > > > > > > The mode of work we suggest is that she will work offline. > > > > Not sure by what you mean by offline here. > [EZ] Offline means she will do the entire merge and then commit. I > propose she will commit the changes into a branch and then you can > review it and do the merge to the main trunk yourself. Eitan, I'm glad to see the continued interest in OpenSM. Thanks for the help. Please submit OpenSM changes as patches to the mail list so that Hal and others in the community can review them. No sense in doing a bunch of work until we know what things you are trying to add. Start with header files and work out from there. My understanding of your "offline" is that it is closed development followed by open release. That's not how OpenIB works. Please submit the patches to the list so everyone can follow and commit on the OpenSM code changes. Thanks, - Matt From halr at voltaire.com Thu Aug 4 06:26:29 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Aug 2005 09:26:29 -0400 Subject: [openib-general] RE: OpenSM Work In-Reply-To: <506C3D7B14CDD411A52C00025558DED607C305C4@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED607C305C4@mtlex01.yok.mtl.com> Message-ID: <1123161989.4422.2739.camel@hal.voltaire.com> Hi Eitan, On Thu, 2005-08-04 at 02:09, Eitan Zahavi wrote: > Hi Hal, > > Please see my responses below. > > > > > > > As Mellanox moves to work on OpenIB Gen2 stack, we have assigned > Yael > > > to work on merging OpenSM 1.8.0 (which released based on gen1) > into > > > the gen2 stack. > > > > I too am working on small pieces of this. > [EZ] I think it does not make much sense to work in parallel as the > changes spans many files. What are the changes you work on? > > > She has started to work on the merge Is there an estimated time frame for this work ? I will respond to the other questions and issues later. -- Hal From eitan at mellanox.co.il Thu Aug 4 06:36:21 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 4 Aug 2005 16:36:21 +0300 Subject: [openib-general] RE: OpenSM Work Message-ID: <506C3D7B14CDD411A52C00025558DED607C305D5@mtlex01.yok.mtl.com> I think she covered 60% of the work in 3 working days. I estimate that testing will be finished by next Wed. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Thursday, August 04, 2005 4:26 PM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: RE: OpenSM Work > > Hi Eitan, > > On Thu, 2005-08-04 at 02:09, Eitan Zahavi wrote: > > Hi Hal, > > > > Please see my responses below. > > > > > > > > > > As Mellanox moves to work on OpenIB Gen2 stack, we have assigned > > Yael > > > > to work on merging OpenSM 1.8.0 (which released based on gen1) > > into > > > > the gen2 stack. > > > > > > I too am working on small pieces of this. > > [EZ] I think it does not make much sense to work in parallel as the > > changes spans many files. What are the changes you work on? > > > > > She has started to work on the merge > > Is there an estimated time frame for this work ? > > I will respond to the other questions and issues later. > > -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at ammasso.com Thu Aug 4 06:43:39 2005 From: swise at ammasso.com (Steve Wise) Date: Thu, 4 Aug 2005 08:43:39 -0500 Subject: [openib-general] 32b openib applications on 64b kernel Message-ID: Does the openib code support a 32b app using user-mode IB verbs on a 64b distro/kernel? IE: The app was compiled on a 32b distro/kernel, then run on the 64b distro/kernel. Steve. From swise at ammasso.com Thu Aug 4 06:52:53 2005 From: swise at ammasso.com (Steve Wise) Date: Thu, 4 Aug 2005 08:52:53 -0500 Subject: [openib-general] OpenMPI question Message-ID: Will the various MPI implementations be ported to the openIB user verbs API? Or uDAPL? From reading the FAQ it appears that MVAPICH2 will remain on udapl. I'm curious about Open-MPI and LA-MPI? Steve. From eitan at mellanox.co.il Thu Aug 4 07:02:37 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 4 Aug 2005 17:02:37 +0300 Subject: [openib-general] RE: OpenSM Work Message-ID: <506C3D7B14CDD411A52C00025558DED607C305D7@mtlex01.yok.mtl.com> Hi Matt, The offline work is only expected for the task of merging of the current 1.8.0 release into OpenIB. It should be completed by end of next week. It would involve the following parts: 1. Merging changes done by Hal/Shahar on the 1.6.1 and 1.7.0 into the 1.8.0 code. 2. Moving headers back into place and adding Makefile.am "install" directives to move the headers during install to the prefix/include dir 3. Adding osm_vendor_sa_api.h and accompany implementation for that API. 4. Adding Simulator vendor After that point we wish to continue OpenSM development on the OpenIB repository. To be able to do that we need to open a branch for each new feature we develop, qualify it and then we can provide diffs to the main thread so everybody can review it before the merge. Actually with the amount of work we put already into OpenSM, once we move into gen2 environment we should become the active maintainers of this code. The development work plan for OpenSM was posted more then once to the OpenIB mailing list even before we started merging with gen2 (attached to this mail). We will appreciate feedback for our plans and will adjust accordingly. So far I did not get any response for my previous mails describing our plans. I hope this will change. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Matt Leininger [mailto:mlleinin at hpcn.ca.sandia.gov] > Sent: Thursday, August 04, 2005 4:26 PM > To: Eitan Zahavi > Cc: openib-general at openib.org; 'Hal Rosenstock' > Subject: Re: [openib-general] RE: OpenSM Work > > On Thu, 2005-08-04 at 09:09 +0300, Eitan Zahavi wrote: > > > > > > > > > The mode of work we suggest is that she will work offline. > > > > > > Not sure by what you mean by offline here. > > [EZ] Offline means she will do the entire merge and then commit. I > > propose she will commit the changes into a branch and then you can > > review it and do the merge to the main trunk yourself. > > Eitan, > > I'm glad to see the continued interest in OpenSM. Thanks for the > help. > > Please submit OpenSM changes as patches to the mail list so that Hal > and others in the community can review them. No sense in doing a bunch > of work until we know what things you are trying to add. Start with > header files and work out from there. > > My understanding of your "offline" is that it is closed development > followed by open release. That's not how OpenIB works. Please submit > the patches to the list so everyone can follow and commit on the OpenSM > code changes. > > Thanks, > > - Matt -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded message was scrubbed... From: Eitan Zahavi Subject: [openib-general] OpenSM work Date: Fri, 8 Apr 2005 19:40:09 +0300 Size: 28607 URL: -------------- next part -------------- An embedded message was scrubbed... From: Eitan Zahavi Subject: [openib-general] OpenSM Work Q2 2005 Date: Wed, 4 May 2005 20:59:23 +0300 Size: 6707 URL: From halr at voltaire.com Thu Aug 4 06:59:49 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Aug 2005 09:59:49 -0400 Subject: [openib-general] OpenMPI question In-Reply-To: References: Message-ID: <1123163988.4422.2768.camel@hal.voltaire.com> On Thu, 2005-08-04 at 09:52, Steve Wise wrote: > Will the various MPI implementations be ported to the openIB user verbs API? > Or uDAPL? From reading the FAQ it appears that MVAPICH2 will remain on > udapl. OSU MVAPICH2 is being ported now to OpenIB uverbs. It's Intel's MPI which is over uDAPL. > I'm curious about Open-MPI and LA-MPI? OpenMPI is working with OpenIB uverbs now (in point to point mode) but has not yet been released. Don't know about LA-MPI. -- Hal > > Steve. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From panda at cse.ohio-state.edu Thu Aug 4 07:08:56 2005 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Thu, 4 Aug 2005 10:08:56 -0400 (EDT) Subject: [openib-general] OpenMPI question In-Reply-To: from "Steve Wise" at Aug 04, 2005 08:52:53 AM Message-ID: <200508041408.j74E8uUP010003@xi.cse.ohio-state.edu> Hi Steve, > Will the various MPI implementations be ported to the openIB user verbs API? > Or uDAPL? From reading the FAQ it appears that MVAPICH2 will remain on > udapl. I'm curious about Open-MPI and LA-MPI? The current MVAPICH2 design supports uDAPL as well as VAPI. MVAPICH currently supports VAPI. However, a uDAPL interface is being worked out for this as well and will be released soon. Regarding OpenIB/Gen2, we have the following four designs being worked out and will be released to the community in a staged manner. - MVAPICH - support for native OpenIB Gen2 verbs API - support for uDAPL - MVAPICH2 (a new design is being worked out which will provide the same performance and features as that of MVAPICH + support for one-sided communication) - support for native OpenIB Gen2 verbs API - support for uDAPL During the upcoming OpenIB workshop, I will talk about these designs in more detail. Hope this helps. Thanks, DK > > Steve. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From ogerlitz at voltaire.com Thu Aug 4 07:13:46 2005 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 4 Aug 2005 17:13:46 +0300 Subject: [openib-general] [PATCH 1/2] kDAPL: remove inline functions Message-ID: Tom, Wouldn't it be simple to come with a patch that applies only to the "dat" code ie (gen2/trunk/src/linux-kernel/infiniband/ulp/kdapl/kdapl.h and/or api.c) which does not require --any-- code change at the cosumer side? that is, with the patch you sent, a call to dat_pz_create at in consumer code was changed from - ret = dat_pz_create (ia, &pz); to + ret = ia->common.provider->pz_create (ia, &pz); so cant it be done with dat_pz_create being a function (or define) calling to ia->common.provider->pz_create(ia, &pz); similar to what is done by ib_core / libibverbs calling ib_mthca / libmthca This makes the consumer code simpler and no changes are need in current consumers Or. From halr at voltaire.com Thu Aug 4 07:10:14 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Aug 2005 10:10:14 -0400 Subject: [openib-general] RE: OpenSM Work In-Reply-To: <506C3D7B14CDD411A52C00025558DED607C305D7@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED607C305D7@mtlex01.yok.mtl.com> Message-ID: <1123164613.4422.2779.camel@hal.voltaire.com> On Thu, 2005-08-04 at 10:02, Eitan Zahavi wrote: > 3. Adding osm_vendor_sa_api.h and accompany implementation for that > API. This is already there as is the vendor support for this (osm_vendor_ibumad_sa.c) and this is in the osmvendor library. Other comments on this later. -- Hal From jlentini at netapp.com Thu Aug 4 07:36:44 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 4 Aug 2005 10:36:44 -0400 (EDT) Subject: [openib-general] [PATCH 1/2] kDAPL: remove inline functions In-Reply-To: <1123096072.13498.25.camel@duffman> References: <1123020047.5203.4.camel@duffman> <1123096072.13498.25.camel@duffman> Message-ID: On Wed, 3 Aug 2005, Tom Duffy wrote: > On Tue, 2005-08-02 at 15:00 -0700, Tom Duffy wrote: >> This patch removes all the inline functions, instead just call the >> functions from the function table. It also removes the _func from the >> names as this is obvious by its type. > > James, > > Any chance of merging these? I was removing dapl_hca_util.[ch]. I'll go through them now. james From halr at voltaire.com Thu Aug 4 07:45:53 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Aug 2005 10:45:53 -0400 Subject: [openib-general] sdp: cant unload ib_ipoib module In-Reply-To: <1123007814.2946.12.camel@duffman> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175B95@taurus.voltaire.com> <1123007814.2946.12.camel@duffman> Message-ID: <1123166751.4422.2823.camel@hal.voltaire.com> Hi Tom, On Tue, 2005-08-02 at 14:36, Tom Duffy wrote: > On Tue, 2005-08-02 at 16:08 +0300, Hal Rosenstock wrote: > > This was reported back a while ago. The simplest scenario I have found to reproduce this is as follows: > > > > After using SDP, and unload SDP and then unload IPoIB and > > got the following: > > > > unregister_netdevice: waiting for ib0 to become free. Usage count = 1 > > > > The simplest way I found to recreate this is: > > 1. Bring up IPoIB and then SDP > > 2. Run tcp.aio.x -t > > (no server/receiver) > > 3. Wait for connection refused > > 4. Unload SDP and then IPoIB > > [root at flopteron2 ~]# modprobe ib_ipoib > ip_tables: (C) 2000-2002 Netfilter core team > [root at flopteron2 ~]# ifconfig ib0 192.168.0.26 up > [root at flopteron2 ~]# ping 192.168.0.0 -b > WARNING: pinging broadcast address > PING 192.168.0.0 (192.168.0.0) 56(84) bytes of data. > 64 bytes from 192.168.0.26: icmp_seq=0 ttl=64 time=0.057 ms > 64 bytes from 192.168.0.233: icmp_seq=0 ttl=64 time=0.159 ms (DUP!) > <-- snip --> > [root at flopteron2 rc]# modprobe ib_sdp > [root at flopteron2 ~]# ./ttcp -t -l 65536 -n 100000 -a 20 localhost -p > 5002 > ttcp-t: buflen = 65536 nbuf = 100000 align = 16384/0 port = 5002 > localhost > ttcp-t: socket > ttcp-t: connect: Connection refused > errno=111 > [root at flopteron2 ~]# rmmod ib_sdp > [root at flopteron2 ~]# rmmod ib_ipoib > [root at flopteron2 ~]# > > I can even shoot stuff over the wire and not have unload issues. > > What is the problem? Not sure why it doesn't occur for you. > Perhaps you need my sdp_inet_port_put() patch? Yes, this was from before that patch but I thought that patch was to resolve an oops with port reuse. I don't see the relation between the two. Is this not the case ? -- Hal From Thomas.Duffy.99 at alumni.brown.edu Thu Aug 4 08:03:13 2005 From: Thomas.Duffy.99 at alumni.brown.edu (Tom Duffy) Date: Thu, 04 Aug 2005 08:03:13 -0700 Subject: [openib-general] [iSER]How to get the dat_headers_1_1.tgz In-Reply-To: References: Message-ID: <1123167793.22293.4.camel@duffman> On Thu, 2005-08-04 at 09:06 -0400, Kanevsky, Arkady wrote: > How about http://www.datcollaborative.org/dat_headers_1_1.tgz . > Tom, > does this serves the purpose? Great! Thank you very much. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From guyg at voltaire.com Thu Aug 4 08:03:00 2005 From: guyg at voltaire.com (Guy German) Date: Thu, 4 Aug 2005 18:03:00 +0300 Subject: [openib-general][PATCH][kdapltest]: DAT_MEM_TYPE_IA support and ref count Message-ID: <20050804150300.GA7359@voltaire.com> James, Here are the kdapltest patches, resent. changes conclude: 1. allow DAT_MEM_TYPE_IA support on server side 2. kdapltest module ref count fix from 2.4 API to 2.6 API Signed-off-by: Guy German Index: dapltest/test/dapl_server.c =================================================================== --- dapltest/test/dapl_server.c (revision 2981) +++ dapltest/test/dapl_server.c (working copy) @@ -232,7 +232,12 @@ DT_cs_Server (Params_t * params_ptr) * Create two buffers, large enough to hold ClientInfo and the largest * command we'll use. */ - ps_ptr->bpool = DT_BpoolAlloc (NULL, + if (!(pt_ptr = DT_Alloc_Per_Test_Data (phead))) + goto server_exit; + DT_MemListInit (pt_ptr); + memcpy ((void *)(uintptr_t) &pt_ptr->Params, + (const void *) params_ptr, sizeof (Params_t)); + ps_ptr->bpool = DT_BpoolAlloc (pt_ptr, phead, ps_ptr->ia, ps_ptr->pz, Index: dapltest/cmd/dapl_transaction_cmd.c =================================================================== --- dapltest/cmd/dapl_transaction_cmd.c (revision 2981) +++ dapltest/cmd/dapl_transaction_cmd.c (working copy) @@ -243,8 +243,7 @@ DT_Transaction_Cmd_Usage (void) DT_Mdep_printf ("USAGE: (EC == QOS_ECONOMY)\n"); DT_Mdep_printf ("USAGE: (PM == QOS_PREMIUM)\n"); DT_Mdep_printf ("USAGE: [-M ]\n"); - DT_Mdep_printf ("USAGE: (VIR == DAT_MEM_TYPE_VIRTUAL - Default)\n"); - DT_Mdep_printf ("USAGE: (PHY == DAT_MEM_TYPE_PHYSICAL)\n"); + DT_Mdep_printf ("USAGE: (PHY == DAT_MEM_TYPE_PHYSICAL - Default)\n"); DT_Mdep_printf ("USAGE: (IA == DAT_MEM_TYPE_IA)\n"); DT_Mdep_printf ("USAGE: (FMR == DAT_MEM_TYPE_PLATFORM)\n"); DT_Mdep_printf ("USAGE: (BP == DAT_MEM_TYPE_BYPASS)\n"); Index: dapltest/kdapl/kdapl_module.c =================================================================== --- dapltest/kdapl/kdapl_module.c (revision 2981) +++ dapltest/kdapl/kdapl_module.c (working copy) @@ -53,17 +53,13 @@ int g_status; static int kdapltest_open(struct inode *inode, struct file *file) { -#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,0) - MOD_INC_USE_COUNT; -#endif + try_module_get(THIS_MODULE); return 0; } static int kdapltest_release(struct inode *inode, struct file *file) { -#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,0) - MOD_DEC_USE_COUNT; -#endif + module_put(THIS_MODULE); return 0; } -------------- next part -------------- Index: dapltest/test/dapl_server.c =================================================================== --- dapltest/test/dapl_server.c (revision 2981) +++ dapltest/test/dapl_server.c (working copy) @@ -232,7 +232,12 @@ DT_cs_Server (Params_t * params_ptr) * Create two buffers, large enough to hold ClientInfo and the largest * command we'll use. */ - ps_ptr->bpool = DT_BpoolAlloc (NULL, + if (!(pt_ptr = DT_Alloc_Per_Test_Data (phead))) + goto server_exit; + DT_MemListInit (pt_ptr); + memcpy ((void *)(uintptr_t) &pt_ptr->Params, + (const void *) params_ptr, sizeof (Params_t)); + ps_ptr->bpool = DT_BpoolAlloc (pt_ptr, phead, ps_ptr->ia, ps_ptr->pz, Index: dapltest/cmd/dapl_transaction_cmd.c =================================================================== --- dapltest/cmd/dapl_transaction_cmd.c (revision 2981) +++ dapltest/cmd/dapl_transaction_cmd.c (working copy) @@ -243,8 +243,7 @@ DT_Transaction_Cmd_Usage (void) DT_Mdep_printf ("USAGE: (EC == QOS_ECONOMY)\n"); DT_Mdep_printf ("USAGE: (PM == QOS_PREMIUM)\n"); DT_Mdep_printf ("USAGE: [-M ]\n"); - DT_Mdep_printf ("USAGE: (VIR == DAT_MEM_TYPE_VIRTUAL - Default)\n"); - DT_Mdep_printf ("USAGE: (PHY == DAT_MEM_TYPE_PHYSICAL)\n"); + DT_Mdep_printf ("USAGE: (PHY == DAT_MEM_TYPE_PHYSICAL - Default)\n"); DT_Mdep_printf ("USAGE: (IA == DAT_MEM_TYPE_IA)\n"); DT_Mdep_printf ("USAGE: (FMR == DAT_MEM_TYPE_PLATFORM)\n"); DT_Mdep_printf ("USAGE: (BP == DAT_MEM_TYPE_BYPASS)\n"); Index: dapltest/kdapl/kdapl_module.c =================================================================== --- dapltest/kdapl/kdapl_module.c (revision 2981) +++ dapltest/kdapl/kdapl_module.c (working copy) @@ -53,17 +53,13 @@ int g_status; static int kdapltest_open(struct inode *inode, struct file *file) { -#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,0) - MOD_INC_USE_COUNT; -#endif + try_module_get(THIS_MODULE); return 0; } static int kdapltest_release(struct inode *inode, struct file *file) { -#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,0) - MOD_DEC_USE_COUNT; -#endif + module_put(THIS_MODULE); return 0; } From tziporet at mellanox.co.il Thu Aug 4 08:08:31 2005 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 4 Aug 2005 18:08:31 +0300 Subject: [openib-general] [PATCH] mthca: update FW versions Message-ID: <506C3D7B14CDD411A52C00025558DED6085BD020@mtlex01.yok.mtl.com> This patch (attached) update FW versions to check according to latest Mellanox FW release in July. See http://www.mellanox.com/products/firmware.html Thanks -- Tziporet Koren Mellanox -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: mthca.patch Type: application/octet-stream Size: 1006 bytes Desc: not available URL: From Thomas.Duffy.99 at alumni.brown.edu Thu Aug 4 08:14:00 2005 From: Thomas.Duffy.99 at alumni.brown.edu (Tom Duffy) Date: Thu, 04 Aug 2005 08:14:00 -0700 Subject: [openib-general] sdp: cant unload ib_ipoib module In-Reply-To: <1123166751.4422.2823.camel@hal.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175B95@taurus.voltaire.com> <1123007814.2946.12.camel@duffman> <1123166751.4422.2823.camel@hal.voltaire.com> Message-ID: <1123168440.22293.10.camel@duffman> On Thu, 2005-08-04 at 10:45 -0400, Hal Rosenstock wrote: > > Perhaps you need my sdp_inet_port_put() patch? > > Yes, this was from before that patch but I thought that patch was to > resolve an oops with port reuse. I don't see the relation between the > two. Is this not the case ? It was just a wild ass guess, since my SDP tree is a bit different from trunk. But the port_put() code is called on shutdown of the socket. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From Thomas.Duffy.99 at alumni.brown.edu Thu Aug 4 08:23:06 2005 From: Thomas.Duffy.99 at alumni.brown.edu (Tom Duffy) Date: Thu, 04 Aug 2005 08:23:06 -0700 Subject: [openib-general] [PATCH 1/2] kDAPL: remove inline functions In-Reply-To: References: Message-ID: <1123168986.22293.21.camel@duffman> On Thu, 2005-08-04 at 17:13 +0300, Or Gerlitz wrote: > Tom, > > Wouldn't it be simple to come with a patch that applies only to the > "dat" code ie > (gen2/trunk/src/linux-kernel/infiniband/ulp/kdapl/kdapl.h and/or api.c) > which does > not require --any-- code change at the cosumer side? > > that is, with the patch you sent, a call to dat_pz_create at in consumer > code was changed > > from > - ret = dat_pz_create (ia, &pz); > to > + ret = ia->common.provider->pz_create (ia, &pz); > > so cant it be done with dat_pz_create being a function (or define) > calling to > > ia->common.provider->pz_create(ia, &pz); > > similar to what is done by ib_core / libibverbs calling ib_mthca / > libmthca > > This makes the consumer code simpler and no changes are need in current > consumers This is almost exactly what is there now. Not really a change. I don't like the multiple indirections to get to the function table, but this can be simplified later. Using the function table to call the function is a well established practice within the Linux kernel -- look at struct file_operations for a good example usage. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From rolandd at cisco.com Thu Aug 4 08:52:01 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 04 Aug 2005 08:52:01 -0700 Subject: [openib-general] 32b openib applications on 64b kernel In-Reply-To: (Steve Wise's message of "Thu, 4 Aug 2005 08:43:39 -0500") References: Message-ID: <524qa5bsym.fsf@cisco.com> Steve> Does the openib code support a 32b app using user-mode IB Steve> verbs on a 64b distro/kernel? IE: The app was compiled on Steve> a 32b distro/kernel, then run on the 64b distro/kernel. Yes. - R. From rolandd at cisco.com Thu Aug 4 09:04:28 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 04 Aug 2005 09:04:28 -0700 Subject: [openib-general] [git pull] REALLY final InfiniBand updates for 2.6.13 Message-ID: <52vf2ladtf.fsf@cisco.com> There are two small last-minute fixes for InfiniBand I would like to get into 2.6.13: one to avoid pain in releasing 2.6.13 with an incorrect enum definition that we have to rename later, and one to fix RARP on IP-over-InfiniBand. If there's still time before 2.6.13, please pull from rsync://rsync.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus The patches update the following files: include/ib_cm.h | 3 ++- ulp/ipoib/ipoib_main.c | 5 +++-- 2 files changed, 5 insertions(+), 3 deletions(-) through the following changes: commit 0dca0f7bf82face7b700890318d5550fd542cabf Author: Hal Rosenstock Date: Thu Jul 28 13:17:26 2005 -0700 [PATCH] [IPoIB] Handle sending of unicast RARP responses RARP replies are another valid case where IPoIB may need to send a unicast packet with no neighbour structure. Signed-off-by: Hal Rosenstock Signed-off-by: Roland Dreier commit 4e38d36d88ead4e56f3155573976da84d5df18b3 Author: Roland Dreier Date: Thu Jul 28 13:16:30 2005 -0700 [PATCH] [IB/cm]: Correct CM port redirect reject codes Reject code 24 is port and CM redirection, not just port redirection. Port redirection alone is code 25. Therefore we should rename code 24 to IB_CM_REJ_PORT_CM_REDIRECT and use IB_CM_REJ_PORT_REDIRECT for code 25. Signed-off-by: Roland Dreier Thanks, Roland From rolandd at cisco.com Thu Aug 4 09:11:15 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 04 Aug 2005 09:11:15 -0700 Subject: [openib-general] Re: [PATCH] mthca: update FW versions In-Reply-To: <506C3D7B14CDD411A52C00025558DED6085BD020@mtlex01.yok.mtl.com> (Tziporet Koren's message of "Thu, 4 Aug 2005 18:08:31 +0300") References: <506C3D7B14CDD411A52C00025558DED6085BD020@mtlex01.yok.mtl.com> Message-ID: <52r7d9adi4.fsf@cisco.com> Tziporet> This patch (attached) update FW versions to check Tziporet> according to latest Mellanox FW release in July. See Tziporet> http://www.mellanox.com/products/firmware.html Thanks, I'll apply this after I get my SRQ work checked in, and I'll make sure this goes upstream as soon as 2.6.14 opens. - R. From yhlu.kernel at gmail.com Thu Aug 4 09:33:38 2005 From: yhlu.kernel at gmail.com (yhlu) Date: Thu, 4 Aug 2005 09:33:38 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS (was: [PATCH 1/2] [IB/cm]: Correct CM port redirect reject codes) In-Reply-To: <52u0i6b9an.fsf_-_@cisco.com> References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <20057281331.7vqhiAJ1Yc0um2je@cisco.com> <86802c44050803175873fb0569@mail.gmail.com> <52u0i6b9an.fsf_-_@cisco.com> Message-ID: <86802c44050804093374aca360@mail.gmail.com> YES. I will send you the output message later about "CONFIG_INFINIBAND_MTHCA_DEBUG=y" YH On 8/3/05, Roland Dreier wrote: > yhlu> In LinuxBIOS, If I enable the prefmem64 to use real 64 > yhlu> range. the IB driver in Kernel can not be loaded. > > What does it mean to "enable the prefmem64 to use real 64 range"? > > Does the driver work if you don't do this? > > yhlu> ib_mthca 0000:04:00.0: Failed to initialize queue pair table, aborting. > > Can you add printk()s to mthca_qp.c::mthca_init_qp_table() to find out > how far the function gets before it fails? > > It would also be useful for you to build with CONFIG_INFINIBAND_MTHCA_DEBUG=y > and send the kernel output you get with that. > > - Roland > From rolandd at cisco.com Thu Aug 4 09:36:42 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 04 Aug 2005 09:36:42 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <86802c44050804093374aca360@mail.gmail.com> (yhlu's message of "Thu, 4 Aug 2005 09:33:38 -0700") References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <20057281331.7vqhiAJ1Yc0um2je@cisco.com> <86802c44050803175873fb0569@mail.gmail.com> <52u0i6b9an.fsf_-_@cisco.com> <86802c44050804093374aca360@mail.gmail.com> Message-ID: <52mznxacbp.fsf@cisco.com> >>>>> "yhlu" == yhlu writes: yhlu> YES. I will send you the output message later about yhlu> "CONFIG_INFINIBAND_MTHCA_DEBUG=y" Thanks. In the meantime, can you explain what it means to "enable the prefmem64 to use real 64 range"? What is the difference between this and the configuration that works? - R. From yhlu.kernel at gmail.com Thu Aug 4 10:23:15 2005 From: yhlu.kernel at gmail.com (yhlu) Date: Thu, 4 Aug 2005 10:23:15 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <52mznxacbp.fsf@cisco.com> References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <20057281331.7vqhiAJ1Yc0um2je@cisco.com> <86802c44050803175873fb0569@mail.gmail.com> <52u0i6b9an.fsf_-_@cisco.com> <86802c44050804093374aca360@mail.gmail.com> <52mznxacbp.fsf@cisco.com> Message-ID: <86802c4405080410236ba59619@mail.gmail.com> The mellanox can use prefmem64, but the BIOS could only allocate the some range below 4G, So 32 bit OS still can use the IB cards. but for 64bit OS, We could allocate range above 4G (0xfc00000000), So the mmio below 4G can be smaller. ( for example from 512M to 128M, the user can get back some RAM back if Opteron don't have hardware memhole support). YH On 8/4/05, Roland Dreier wrote: > >>>>> "yhlu" == yhlu writes: > > yhlu> YES. I will send you the output message later about > yhlu> "CONFIG_INFINIBAND_MTHCA_DEBUG=y" > > Thanks. In the meantime, can you explain what it means to "enable the > prefmem64 to use real 64 range"? What is the difference between this > and the configuration that works? > > - R. > From rolandd at cisco.com Thu Aug 4 10:32:14 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 04 Aug 2005 10:32:14 -0700 Subject: [openib-general] [RFC] Move InfiniBand .h files Message-ID: <52iryla9r5.fsf@cisco.com> I would like to get people's reactions to moving the InfiniBand .h files from their current location in drivers/infiniband/include/ to include/linux/rdma/. If we agree that this is a good idea then I'll push this change as soon as 2.6.14 starts. The advantages of doing this are: - The headers become more easily accessible to other parts of the tree that might want to use IB support. For example, an NFS/RDMA client probably wants to live under fs/ - It makes it easier to build IB modules outside the tree, since include/linux gets put in /lib/modules//build. I realize that we don't really care about out-of-tree modules, but it is convenient to be able to develop and distribute new drivers that build against someone's existing kernels. - We can kill off the ugly EXTRA_CFLAGS += -Idrivers/infiniband/include lines in our Makefiles. The disadvantages are: - It's churn with little technical merit. - It makes it a little harder to pull the OpenIB svn tree into a kernel tree, since one would have to link both drivers/infiniband and include/linux/rdma instead of just drivers/infiniband. This problem goes away if/when OpenIB shifts over to a new source code control system. Thanks, Roland From ogerlitz at voltaire.com Thu Aug 4 10:30:26 2005 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 4 Aug 2005 20:30:26 +0300 Subject: [openib-general] [PATCH 1/2] kDAPL: remove inline functions Message-ID: >This is almost exactly what is there now. Not really a change. I don't like > the multiple indirections to get to the function table, but this can be simplified later. Indeed, it can be simplified a little, but why not having this simplification in the kdapl registry level ("dat") as it is in ib_core calling ib_verbs and libibverbs calling libmthca, while the verbs consumer does not have to go into those indirections at all but rather just call a well understood/defined api? are you suggesting to apply such a chance also to the verbs? Or. From limichal at cisco.com Thu Aug 4 10:42:50 2005 From: limichal at cisco.com (Libor Michalek) Date: Thu, 4 Aug 2005 10:42:50 -0700 Subject: [openib-general] Re: sdp: cant unload ib_ipoib module In-Reply-To: <20050803070923.GG15300@mellanox.co.il>; from mst@mellanox.co.il on Wed, Aug 03, 2005 at 10:09:24AM +0300 References: <1123007814.2946.12.camel@duffman> <20050803070923.GG15300@mellanox.co.il> Message-ID: <20050804104250.A30741@topspin.com> On Wed, Aug 03, 2005 at 10:09:24AM +0300, Michael S. Tsirkin wrote: > Quoting r. Tom Duffy : > > Perhaps you need my sdp_inet_port_put() patch? > > Could be. > I'll give it a spin next week. Thanks! Michael, I remember this problem from the last time Hal mentioned it, I had just forgotten about it. I'm almost certain this is being caused by a reference counter incremented in the sdp address resolution code that is never decremented. In sdp_link.c:do_link_path_lookup() we get the route table entry using ip_route_output_key(), which I believe takes out a reference, and that reference should be returned using ip_rt_put() which we never do. -Libor From Thomas.Duffy.99 at alumni.brown.edu Thu Aug 4 10:48:59 2005 From: Thomas.Duffy.99 at alumni.brown.edu (Tom Duffy) Date: Thu, 04 Aug 2005 10:48:59 -0700 Subject: [openib-general] [RFC] Move InfiniBand .h files In-Reply-To: <52iryla9r5.fsf@cisco.com> References: <52iryla9r5.fsf@cisco.com> Message-ID: <1123177739.29639.1.camel@duffman> On Thu, 2005-08-04 at 10:32 -0700, Roland Dreier wrote: > I would like to get people's reactions to moving the InfiniBand .h > files from their current location in drivers/infiniband/include/ to > include/linux/rdma/. If we agree that this is a good idea then I'll > push this change as soon as 2.6.14 starts. I think it is a great idea. > The advantages of doing this are: > > - The headers become more easily accessible to other parts of the > tree that might want to use IB support. For example, an NFS/RDMA > client probably wants to live under fs/ > - It makes it easier to build IB modules outside the tree, since > include/linux gets put in /lib/modules//build. I realize > that we don't really care about out-of-tree modules, but it is > convenient to be able to develop and distribute new drivers that > build against someone's existing kernels. > - We can kill off the ugly > > EXTRA_CFLAGS += -Idrivers/infiniband/include > > lines in our Makefiles. One more advantage: - It shows our willingness to push past just infiniband. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From arjan at infradead.org Thu Aug 4 10:53:58 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Thu, 04 Aug 2005 19:53:58 +0200 Subject: [openib-general] Re: [RFC] Move InfiniBand .h files In-Reply-To: <52iryla9r5.fsf@cisco.com> References: <52iryla9r5.fsf@cisco.com> Message-ID: <1123178038.3318.40.camel@laptopd505.fenrus.org> On Thu, 2005-08-04 at 10:32 -0700, Roland Dreier wrote: > I would like to get people's reactions to moving the InfiniBand .h > files from their current location in drivers/infiniband/include/ to > include/linux/rdma/. If we agree that this is a good idea then I'll > push this change as soon as 2.6.14 starts. please only put userspace clean headers here; the rest is more or less private headers for your subsystem. At minimum the headers should be split in separate files for shared-userspace and kernel (eg no overlap at all), but I'd vote for keeping the headers in your own dir. From yhlu.kernel at gmail.com Thu Aug 4 11:01:55 2005 From: yhlu.kernel at gmail.com (yhlu) Date: Thu, 4 Aug 2005 11:01:55 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <86802c4405080410236ba59619@mail.gmail.com> References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <20057281331.7vqhiAJ1Yc0um2je@cisco.com> <86802c44050803175873fb0569@mail.gmail.com> <52u0i6b9an.fsf_-_@cisco.com> <86802c44050804093374aca360@mail.gmail.com> <52mznxacbp.fsf@cisco.com> <86802c4405080410236ba59619@mail.gmail.com> Message-ID: <86802c4405080411013b60382c@mail.gmail.com> i enable CCONFIG_INFINIBAND_MTHCA_DEBUG=y I didn't get any more debug info, is that depend other setting? YH On 8/4/05, yhlu wrote: > The mellanox can use prefmem64, but the BIOS could only allocate the > some range below 4G, So 32 bit OS still can use the IB cards. > but for 64bit OS, We could allocate range above 4G (0xfc00000000), So > the mmio below 4G can be smaller. ( for example from 512M to 128M, the > user can get back some RAM back if Opteron don't have hardware memhole > support). > > YH > > > > On 8/4/05, Roland Dreier wrote: > > >>>>> "yhlu" == yhlu writes: > > > > yhlu> YES. I will send you the output message later about > > yhlu> "CONFIG_INFINIBAND_MTHCA_DEBUG=y" > > > > Thanks. In the meantime, can you explain what it means to "enable the > > prefmem64 to use real 64 range"? What is the difference between this > > and the configuration that works? > > > > - R. > > > From iod00d at hp.com Thu Aug 4 11:14:30 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 4 Aug 2005 11:14:30 -0700 Subject: [openib-general] [PATCH] mthca: update FW versions In-Reply-To: <506C3D7B14CDD411A52C00025558DED6085BD020@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED6085BD020@mtlex01.yok.mtl.com> Message-ID: <20050804181430.GC20422@esmail.cup.hp.com> On Thu, Aug 04, 2005 at 06:08:31PM +0300, Tziporet Koren wrote: > This patch (attached) update FW versions to check according to latest > Mellanox FW release in July. > See http://www.mellanox.com/products/firmware.html Does this mean that topspin/cisco, voltaire, and infinicon (sorry, I'm forgetting someone and I forgot infinicon's new name) are all shipping v3.3.3 for their respective customers? Or will respective vendors bless their customers using mellanox firmware with openib.org code? thanks, grant From swise at ammasso.com Thu Aug 4 11:11:10 2005 From: swise at ammasso.com (Steve Wise) Date: Thu, 4 Aug 2005 13:11:10 -0500 Subject: [openib-general] [RFC] Move InfiniBand .h files In-Reply-To: <52iryla9r5.fsf@cisco.com> Message-ID: Seems reasonable to me... Steve. > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier > Sent: Thursday, August 04, 2005 12:32 PM > To: openib-general at openib.org; linux-kernel at vger.kernel.org > Subject: [openib-general] [RFC] Move InfiniBand .h files > > I would like to get people's reactions to moving the InfiniBand .h > files from their current location in drivers/infiniband/include/ to > include/linux/rdma/. If we agree that this is a good idea then I'll > push this change as soon as 2.6.14 starts. > > The advantages of doing this are: > > - The headers become more easily accessible to other parts of the > tree that might want to use IB support. For example, an NFS/RDMA > client probably wants to live under fs/ > - It makes it easier to build IB modules outside the tree, since > include/linux gets put in /lib/modules//build. I realize > that we don't really care about out-of-tree modules, but it is > convenient to be able to develop and distribute new drivers that > build against someone's existing kernels. > - We can kill off the ugly > > EXTRA_CFLAGS += -Idrivers/infiniband/include > > lines in our Makefiles. > > The disadvantages are: > > - It's churn with little technical merit. > - It makes it a little harder to pull the OpenIB svn tree into a > kernel tree, since one would have to link both drivers/infiniband > and include/linux/rdma instead of just drivers/infiniband. This > problem goes away if/when OpenIB shifts over to a new source code > control system. > > Thanks, > Roland > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From iod00d at hp.com Thu Aug 4 11:20:16 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 4 Aug 2005 11:20:16 -0700 Subject: [openib-general] [RFC] Move InfiniBand .h files In-Reply-To: <52iryla9r5.fsf@cisco.com> References: <52iryla9r5.fsf@cisco.com> Message-ID: <20050804182016.GE20422@esmail.cup.hp.com> On Thu, Aug 04, 2005 at 10:32:14AM -0700, Roland Dreier wrote: ... I agree with the rename/relocation of the header files for the reasons you mentioned. > - It makes it a little harder to pull the OpenIB svn tree into a > kernel tree, since one would have to link both drivers/infiniband > and include/linux/rdma instead of just drivers/infiniband. This > problem goes away if/when OpenIB shifts over to a new source code > control system. Any thoughts on renaming drivers/infiniband to drivers/rdma at the same time? If you are going to churn...don't be shy about it :^) grant From yhlu.kernel at gmail.com Thu Aug 4 11:22:05 2005 From: yhlu.kernel at gmail.com (yhlu) Date: Thu, 4 Aug 2005 11:22:05 -0700 Subject: [openib-general] Re: [PATCH 1/2] [IB/cm]: Correct CM port redirect reject codes In-Reply-To: <20050804064223.GT15300@mellanox.co.il> References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <20057281331.7vqhiAJ1Yc0um2je@cisco.com> <86802c44050803175873fb0569@mail.gmail.com> <20050804064223.GT15300@mellanox.co.il> Message-ID: <86802c4405080411227bce41f7@mail.gmail.com> Yes. On 8/3/05, Michael S. Tsirkin wrote: > Quoting r. yhlu : > > Subject: Re: [PATCH 1/2] [IB/cm]: Correct CM port redirect reject codes > > > > Roland, > > > > In LinuxBIOS, If I enable the prefmem64 to use real 64 range. the IB > > driver in Kernel can not be loaded. > > Are you using the latest firmware on the HCA card? > > -- > MST > From Thomas.Duffy.99 at alumni.brown.edu Thu Aug 4 11:22:49 2005 From: Thomas.Duffy.99 at alumni.brown.edu (Tom Duffy) Date: Thu, 04 Aug 2005 11:22:49 -0700 Subject: [openib-general] [PATCH 1/2] kDAPL: remove inline functions In-Reply-To: References: Message-ID: <1123179769.786.0.camel@duffman> On Thu, 2005-08-04 at 20:30 +0300, Or Gerlitz wrote: > >This is almost exactly what is there now. Not really a change. I don't like > > the multiple indirections to get to the function table, but this can be simplified later. > > Indeed, it can be simplified a little, but why not having this simplification in the kdapl > registry level ("dat") as it is in ib_core calling ib_verbs and libibverbs calling libmthca, > while the verbs consumer does not have to go into those indirections at all but rather > just call a well understood/defined api? are you suggesting to apply such a chance also > to the verbs? I don't understand what you mean? What change would happen in verbs? -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From iod00d at hp.com Thu Aug 4 11:26:52 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 4 Aug 2005 11:26:52 -0700 Subject: [openib-general] Re: [RFC] Move InfiniBand .h files In-Reply-To: <1123178038.3318.40.camel@laptopd505.fenrus.org> References: <52iryla9r5.fsf@cisco.com> <1123178038.3318.40.camel@laptopd505.fenrus.org> Message-ID: <20050804182652.GF20422@esmail.cup.hp.com> On Thu, Aug 04, 2005 at 07:53:58PM +0200, Arjan van de Ven wrote: > On Thu, 2005-08-04 at 10:32 -0700, Roland Dreier wrote: > > I would like to get people's reactions to moving the InfiniBand .h > > files from their current location in drivers/infiniband/include/ to > > include/linux/rdma/. If we agree that this is a good idea then I'll > > push this change as soon as 2.6.14 starts. > > please only put userspace clean headers here; the rest is more or less > private headers for your subsystem. Sorry...this smells like a rathole...but does this mean linus agrees the kernel subsystems should export headers suitable for both user space and kernel driver modules? Historical, I thought glibc and other user space libs were expected to maintain their own set of header files. Maybe I'm just confused... thanks, grant From rolandd at cisco.com Thu Aug 4 11:31:24 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 04 Aug 2005 11:31:24 -0700 Subject: [openib-general] Re: [RFC] Move InfiniBand .h files In-Reply-To: <1123178038.3318.40.camel@laptopd505.fenrus.org> (Arjan van de Ven's message of "Thu, 04 Aug 2005 19:53:58 +0200") References: <52iryla9r5.fsf@cisco.com> <1123178038.3318.40.camel@laptopd505.fenrus.org> Message-ID: <52acjxa70j.fsf@cisco.com> Arjan> At minimum the headers should be split in separate files Arjan> for shared-userspace and kernel (eg no overlap at all), but Arjan> I'd vote for keeping the headers in your own dir. This is already done -- the userspace ABI is defined in ib_user_mad.h, ib_user_verbs.h, etc. The problem with keeping subsystem headers under drivers/infiniband is that it's ugly for, say, fs/nfs/Makefile to have to add -Idrivers/infiniband/include to its CFLAGS just because it's implementing NFS/RDMA. Also, drivers/infiniband/include doesn't get put into the /lib/modules//build directory, so it's a pain for people developing new drivers (this is a real complaint that came to me from a vendor developing a driver for a new piece of IB hardware). Thanks, Roland From rolandd at cisco.com Thu Aug 4 11:32:50 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 04 Aug 2005 11:32:50 -0700 Subject: [openib-general] [RFC] Move InfiniBand .h files In-Reply-To: <20050804182016.GE20422@esmail.cup.hp.com> (Grant Grundler's message of "Thu, 4 Aug 2005 11:20:16 -0700") References: <52iryla9r5.fsf@cisco.com> <20050804182016.GE20422@esmail.cup.hp.com> Message-ID: <5264ula6y5.fsf@cisco.com> Grant> Any thoughts on renaming drivers/infiniband to drivers/rdma Grant> at the same time? Grant> If you are going to churn...don't be shy about it :^) Well, I'd rather avoid churn for purely political reasons. The main point of my proposal is to move the includes from drivers/ to include/, but while we're at it me might as well pick a more neutral directory name. Moving drivers/infiniband to drivers/rdma has no technical merit right now, so I'd rather wait and see how it makes sense to organize the code we end up with. - R. From rolandd at cisco.com Thu Aug 4 11:35:44 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 04 Aug 2005 11:35:44 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <86802c4405080411013b60382c@mail.gmail.com> (yhlu's message of "Thu, 4 Aug 2005 11:01:55 -0700") References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <20057281331.7vqhiAJ1Yc0um2je@cisco.com> <86802c44050803175873fb0569@mail.gmail.com> <52u0i6b9an.fsf_-_@cisco.com> <86802c44050804093374aca360@mail.gmail.com> <52mznxacbp.fsf@cisco.com> <86802c4405080410236ba59619@mail.gmail.com> <86802c4405080411013b60382c@mail.gmail.com> Message-ID: <521x59a6tb.fsf@cisco.com> yhlu> i enable CCONFIG_INFINIBAND_MTHCA_DEBUG=y I didn't get any yhlu> more debug info, is that depend other setting? It shouldn't depend on anything. mthca_dbg() gets turned into dev_dbg(), which just does dev_printk(KERN_DEBUG,...). Perhaps you have to change your console level to see KERN_DEBUG messages? Since you're getting to the call to mthca_init_qp_table(), there are mthca_dbg() calls that you should definitely be getting output from. - R. From arjan at infradead.org Thu Aug 4 11:38:37 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Thu, 04 Aug 2005 20:38:37 +0200 Subject: [openib-general] Re: [RFC] Move InfiniBand .h files In-Reply-To: <52acjxa70j.fsf@cisco.com> References: <52iryla9r5.fsf@cisco.com> <1123178038.3318.40.camel@laptopd505.fenrus.org> <52acjxa70j.fsf@cisco.com> Message-ID: <1123180717.3318.43.camel@laptopd505.fenrus.org> > Also, drivers/infiniband/include doesn't get put into the > /lib/modules//build directory, that is a symlink not a directory, and a symlink to the full source... From rolandd at cisco.com Thu Aug 4 11:57:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 04 Aug 2005 11:57:55 -0700 Subject: [openib-general] Re: [RFC] Move InfiniBand .h files In-Reply-To: <1123180717.3318.43.camel@laptopd505.fenrus.org> (Arjan van de Ven's message of "Thu, 04 Aug 2005 20:38:37 +0200") References: <52iryla9r5.fsf@cisco.com> <1123178038.3318.40.camel@laptopd505.fenrus.org> <52acjxa70j.fsf@cisco.com> <1123180717.3318.43.camel@laptopd505.fenrus.org> Message-ID: <52wtn18r7w.fsf@cisco.com> Roland> Also, drivers/infiniband/include doesn't get put into the Roland> /lib/modules//build directory, Arjan> that is a symlink not a directory, and a symlink to the Arjan> full source... Sorry, I was too terse about the problem. You're right, but typical distros don't ship full kernel source in their "support kernel builds" package. And if I use an external build directory (ie "O=") then the symlink just points to my external build directory, which doesn't include the source to drivers/, just links to include/ - R. From sam at ravnborg.org Thu Aug 4 12:22:29 2005 From: sam at ravnborg.org (Sam Ravnborg) Date: Thu, 4 Aug 2005 21:22:29 +0200 Subject: [openib-general] Re: [RFC] Move InfiniBand .h files In-Reply-To: <52wtn18r7w.fsf@cisco.com> References: <52iryla9r5.fsf@cisco.com> <1123178038.3318.40.camel@laptopd505.fenrus.org> <52acjxa70j.fsf@cisco.com> <1123180717.3318.43.camel@laptopd505.fenrus.org> <52wtn18r7w.fsf@cisco.com> Message-ID: <20050804192229.GA26714@mars.ravnborg.org> On Thu, Aug 04, 2005 at 11:57:55AM -0700, Roland Dreier wrote: > Roland> Also, drivers/infiniband/include doesn't get put into the > Roland> /lib/modules//build directory, > > Arjan> that is a symlink not a directory, and a symlink to the > Arjan> full source... > > Sorry, I was too terse about the problem. You're right, but typical > distros don't ship full kernel source in their "support kernel builds" > package. And if I use an external build directory (ie "O=") then > the symlink just points to my external build directory, which doesn't > include the source to drivers/, just links to include/ If the external module uses a Kbuild file as explained in Documentation/kbuild/makefiles.txt and then uses both O= and M= when compiling the module there is no issue. With respect to moving the .h files - please do so. drivers/infiniband should only include header used in that same directory. Not header files potentially uased by fs/. Sam From yhlu.kernel at gmail.com Thu Aug 4 12:30:17 2005 From: yhlu.kernel at gmail.com (yhlu) Date: Thu, 4 Aug 2005 12:30:17 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <521x59a6tb.fsf@cisco.com> References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <20057281331.7vqhiAJ1Yc0um2je@cisco.com> <86802c44050803175873fb0569@mail.gmail.com> <52u0i6b9an.fsf_-_@cisco.com> <86802c44050804093374aca360@mail.gmail.com> <52mznxacbp.fsf@cisco.com> <86802c4405080410236ba59619@mail.gmail.com> <86802c4405080411013b60382c@mail.gmail.com> <521x59a6tb.fsf@cisco.com> Message-ID: <86802c440508041230143354c2@mail.gmail.com> ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) ib_mthca: Initializing Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) (0000:04:00.0) ib_mthca 0000:04:00.0: FW version 000400060002, max commands 64 ib_mthca 0000:04:00.0: FW size 6143 KB (start fcefa00000, end fcefffffff) ib_mthca 0000:04:00.0: HCA memory size 262143 KB (start fce0000000, end fcefffffff) ib_mthca 0000:04:00.0: Max QPs: 16777216, reserved QPs: 1024, entry size: 256 ib_mthca 0000:04:00.0: Max CQs: 16777216, reserved CQs: 128, entry size: 64 ib_mthca 0000:04:00.0: Max EQs: 64, reserved EQs: 1, entry size: 64 ib_mthca 0000:04:00.0: reserved MPTs: 16, reserved MTTs: 16 ib_mthca 0000:04:00.0: Max PDs: 16777216, reserved PDs: 0, reserved UARs: 1 ib_mthca 0000:04:00.0: Max QP/MCG: 16777216, reserved MGMs: 0 ib_mthca 0000:04:00.0: Flags: 00370347 ib_mthca 0000:04:00.0: profile[ 0]--10/20 @ 0x fce0000000 (size 0x 4000000) ib_mthca 0000:04:00.0: profile[ 1]-- 0/16 @ 0x fce4000000 (size 0x 1000000) ib_mthca 0000:04:00.0: profile[ 2]-- 7/18 @ 0x fce5000000 (size 0x 800000) ib_mthca 0000:04:00.0: profile[ 3]-- 9/17 @ 0x fce5800000 (size 0x 800000) ib_mthca 0000:04:00.0: profile[ 4]-- 3/16 @ 0x fce6000000 (size 0x 400000) ib_mthca 0000:04:00.0: profile[ 5]-- 4/16 @ 0x fce6400000 (size 0x 200000) ib_mthca 0000:04:00.0: profile[ 6]--12/15 @ 0x fce6600000 (size 0x 100000) ib_mthca 0000:04:00.0: profile[ 7]-- 8/13 @ 0x fce6700000 (size 0x 80000) ib_mthca 0000:04:00.0: profile[ 8]--11/11 @ 0x fce6780000 (size 0x 10000) ib_mthca 0000:04:00.0: profile[ 9]-- 6/ 5 @ 0x fce6790000 (size 0x 800) ib_mthca 0000:04:00.0: HCA memory: allocated 106050 KB/256000 KB (149950 KB free) ib_mthca 0000:04:00.0: Allocated EQ 1 with 65536 entries ib_mthca 0000:04:00.0: Allocated EQ 2 with 128 entries ib_mthca 0000:04:00.0: Allocated EQ 3 with 128 entries ib_mthca 0000:04:00.0: Setting mask 00000000000f43fe for eqn 2 ib_mthca 0000:04:00.0: Setting mask 0000000000000400 for eqn 3 ib_mthca 0000:04:00.0: NOP command IRQ test passed <------------------------------------------------------------------------------------------------------stuck 30s ib_mthca 0000:04:00.0: Failed to initialize queue pair table, aborting. ib_mthca 0000:04:00.0: Clearing mask 00000000000f43fe for eqn 2 ib_mthca 0000:04:00.0: Clearing mask 0000000000000400 for eqn 3 ib_mthca: probe of 0000:04:00.0 failed with error -16 On 8/4/05, Roland Dreier wrote: > yhlu> i enable CCONFIG_INFINIBAND_MTHCA_DEBUG=y I didn't get any > yhlu> more debug info, is that depend other setting? > > It shouldn't depend on anything. mthca_dbg() gets turned into > dev_dbg(), which just does dev_printk(KERN_DEBUG,...). Perhaps you > have to change your console level to see KERN_DEBUG messages? > > Since you're getting to the call to mthca_init_qp_table(), there are > mthca_dbg() calls that you should definitely be getting output from. > > - R. > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > From arjan at infradead.org Thu Aug 4 12:51:10 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Thu, 04 Aug 2005 21:51:10 +0200 Subject: [openib-general] Re: [RFC] Move InfiniBand .h files In-Reply-To: <20050804182652.GF20422@esmail.cup.hp.com> References: <52iryla9r5.fsf@cisco.com> <1123178038.3318.40.camel@laptopd505.fenrus.org> <20050804182652.GF20422@esmail.cup.hp.com> Message-ID: <1123185070.3318.49.camel@laptopd505.fenrus.org> On Thu, 2005-08-04 at 11:26 -0700, Grant Grundler wrote: > On Thu, Aug 04, 2005 at 07:53:58PM +0200, Arjan van de Ven wrote: > > On Thu, 2005-08-04 at 10:32 -0700, Roland Dreier wrote: > > > I would like to get people's reactions to moving the InfiniBand .h > > > files from their current location in drivers/infiniband/include/ to > > > include/linux/rdma/. If we agree that this is a good idea then I'll > > > push this change as soon as 2.6.14 starts. > > > > please only put userspace clean headers here; the rest is more or less > > private headers for your subsystem. > > Sorry...this smells like a rathole...but does this mean > linus agrees the kernel subsystems should export headers suitable for > both user space and kernel driver modules? > > Historical, I thought glibc and other user space libs were expected to > maintain their own set of header files. Maybe I'm just confused... there is a definite requirement for the kernel to expose SOME things to userspace. Well for SOMETHING to expose them. Right now most distros ship a hacked up version of the kernel headers (eg removed of all the kernel specific stuff and all the gpl inline code etc). A good part of making such an external project possible is to make a clean separation between userspace shared stuff and pure kernel internals. From arjan at infradead.org Thu Aug 4 12:51:34 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Thu, 04 Aug 2005 21:51:34 +0200 Subject: [openib-general] Re: [RFC] Move InfiniBand .h files In-Reply-To: <52wtn18r7w.fsf@cisco.com> References: <52iryla9r5.fsf@cisco.com> <1123178038.3318.40.camel@laptopd505.fenrus.org> <52acjxa70j.fsf@cisco.com> <1123180717.3318.43.camel@laptopd505.fenrus.org> <52wtn18r7w.fsf@cisco.com> Message-ID: <1123185094.3318.51.camel@laptopd505.fenrus.org> On Thu, 2005-08-04 at 11:57 -0700, Roland Dreier wrote: > Roland> Also, drivers/infiniband/include doesn't get put into the > Roland> /lib/modules//build directory, > > Arjan> that is a symlink not a directory, and a symlink to the > Arjan> full source... > > Sorry, I was too terse about the problem. You're right, but typical > distros don't ship full kernel source in their "support kernel builds" > package. so what makes you think they will ship include/infiniband ? From cfriesen at nortel.com Thu Aug 4 12:54:56 2005 From: cfriesen at nortel.com (Christopher Friesen) Date: Thu, 04 Aug 2005 13:54:56 -0600 Subject: [openib-general] Re: [RFC] Move InfiniBand .h files In-Reply-To: <20050804192229.GA26714@mars.ravnborg.org> References: <52iryla9r5.fsf@cisco.com> <1123178038.3318.40.camel@laptopd505.fenrus.org> <52acjxa70j.fsf@cisco.com> <1123180717.3318.43.camel@laptopd505.fenrus.org> <52wtn18r7w.fsf@cisco.com> <20050804192229.GA26714@mars.ravnborg.org> Message-ID: <42F27290.2070002@nortel.com> Sam Ravnborg wrote: > On Thu, Aug 04, 2005 at 11:57:55AM -0700, Roland Dreier wrote: >>Sorry, I was too terse about the problem. You're right, but typical >>distros don't ship full kernel source in their "support kernel builds" >>package. And if I use an external build directory (ie "O=") then >>the symlink just points to my external build directory, which doesn't >>include the source to drivers/, just links to include/ > > > If the external module uses a Kbuild file as explained in > Documentation/kbuild/makefiles.txt and then uses both O= and M= > when compiling the module there is no issue. > > With respect to moving the .h files - please do so. > drivers/infiniband should only include header used in that same > directory. Not header files potentially uased by fs/. I think Roland was talking about the case where the running kernel was built with "O=", in which case the /lib/modules.../build symlink points to the build directory rather than the original source tree. Does Kbuild handle this case properly? Chris From sam at ravnborg.org Thu Aug 4 13:02:45 2005 From: sam at ravnborg.org (Sam Ravnborg) Date: Thu, 4 Aug 2005 22:02:45 +0200 Subject: [openib-general] Re: [RFC] Move InfiniBand .h files In-Reply-To: <42F27290.2070002@nortel.com> References: <52iryla9r5.fsf@cisco.com> <1123178038.3318.40.camel@laptopd505.fenrus.org> <52acjxa70j.fsf@cisco.com> <1123180717.3318.43.camel@laptopd505.fenrus.org> <52wtn18r7w.fsf@cisco.com> <20050804192229.GA26714@mars.ravnborg.org> <42F27290.2070002@nortel.com> Message-ID: <20050804200245.GA4622@mars.ravnborg.org> On Thu, Aug 04, 2005 at 01:54:56PM -0600, Christopher Friesen wrote: > Sam Ravnborg wrote: > >On Thu, Aug 04, 2005 at 11:57:55AM -0700, Roland Dreier wrote: > > >>Sorry, I was too terse about the problem. You're right, but typical > >>distros don't ship full kernel source in their "support kernel builds" > >>package. And if I use an external build directory (ie "O=") then > >>the symlink just points to my external build directory, which doesn't > >>include the source to drivers/, just links to include/ > > > > > >If the external module uses a Kbuild file as explained in > >Documentation/kbuild/makefiles.txt and then uses both O= and M= > >when compiling the module there is no issue. > > > >With respect to moving the .h files - please do so. > >drivers/infiniband should only include header used in that same > >directory. Not header files potentially uased by fs/. > > I think Roland was talking about the case where the running kernel was > built with "O=", in which case the /lib/modules.../build symlink points > to the build directory rather than the original source tree. > > Does Kbuild handle this case properly? Yes it does. /lib/modules/.../ contains two symlinks these days: build -> always point to the directory containing the output of the build source -> always point to the kernel source In the 'make' case where the kernel is built without using O= they point to the same directory. In the 'make O=' case they point to different directories. SUSE does ship with a make O= build kernel these days. Fedora IIRC has done an ugly hack and just copied over a number of files so a compile works in most cases - but then also use both symlink. It has never been easier to build a module if the target is only the running kernel. Only when you adds backwards compatibility it gets messy :-( Sam From jlentini at netapp.com Thu Aug 4 13:14:21 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 4 Aug 2005 16:14:21 -0400 (EDT) Subject: [openib-general] [PATCH 1/2] kDAPL: remove inline functions In-Reply-To: <1123179769.786.0.camel@duffman> References: <1123179769.786.0.camel@duffman> Message-ID: On Thu, 4 Aug 2005, Tom Duffy wrote: > On Thu, 2005-08-04 at 20:30 +0300, Or Gerlitz wrote: >>> This is almost exactly what is there now. Not really a change. >>> I don't like the multiple indirections to get to the function >>> table, but this can be simplified later. >> >> Indeed, it can be simplified a little, but why not having this >> simplification in the kdapl registry level ("dat") as it is in >> ib_core calling ib_verbs and libibverbs calling libmthca, while the >> verbs consumer does not have to go into those indirections at all >> but rather just call a well understood/defined api? are you >> suggesting to apply such a chance also to the verbs? > > I don't understand what you mean? What change would happen in verbs? I think Or is asking if similar changes should be made to the verbs (i.e. should ib_alloc_pd() be remove and replaced in each verb's user by device->alloc_pd(..), etc.). Is that right Or? In kDAPL, the inline dat_* functions just perform a function call. In the case of the verbs, there are some operations performed by the verbs core in addition to calling the function. For example, ib_alloc_pd() intiailizes the ib_pd fields in addition to calling the device specific alloc_pd function. Due to the additional functionality, I don't think this convention would extend to the IB verbs. In terms of kDAPL, I share Or's concern that this will make the API harder for consumers to use. Picture someone who wants to use the kDAPL API for the first time. Today that person would scan the kDAPL header and see all of the functions available and get a pretty clear picture of what each functions parameters are for just by looking at the parameter names. We'd loose that "documentation" with this change. With respect to the struct file_operations analogy, is a struct file_operations embedded in any other structure besides struct file? In kDAPL, the function table is embedded in several different objects. From jlentini at netapp.com Thu Aug 4 13:30:30 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 4 Aug 2005 16:30:30 -0400 (EDT) Subject: [openib-general][kdapl]: vmalloc instead of kmalloc In-Reply-To: References: Message-ID: On Thu, 4 Aug 2005, Guy German wrote: > James, > > I see what you mean. > The allocation of the event vector is derived from evd->qlen. > In DTO ev'd, however, qlen is also the parameter passed to > ib_create_cq. > Since we don't want to limit DAPL consumers to an > unnecessary small completion queue size, maybe we > could differentiate between DTO supporting evd's and > CONN evd's, when allocating the events vector. > > if evd supports CONN only, leave it : > event = kmalloc(evd->qlen * sizeof *event) > (Relying on the consumer he knows what he is doing) > if evd is DTO only : > don't allocate an event buffer, at all > if evd supports both : > event = kmalloc(DEFAULT_4_CONN * sizeof *event) And dynamically add additional events up to qlen as needed? > > if DEFAULT_4_CONN=256, that's a 3 pages allocation. > > How does that sound to you ? I'd prefer that the EVDs were uniform. I would worry about bugs otherwise. The eventual solution has to support qlen generated events (connection request, connection, disconnect, software) if those event types are supported by the EVD (even if the EVD is being used for both generated events and DTOs). From arlin.r.davis at intel.com Thu Aug 4 14:05:27 2005 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Thu, 4 Aug 2005 14:05:27 -0700 Subject: [openib-general] RE: some questions/comments on gen2 udapl code Message-ID: <59278FC0C48A994BABABD069571E45680BA7BA39@orsmsx401.amr.corp.intel.com> Or, We need to have these discussions on openib-general. >-----Original Message----- >From: Or Gerlitz [mailto:ogerlitz at voltaire.com] >Sent: Thursday, August 04, 2005 6:55 AM >To: Davis, Arlin R >Cc: James Lentini >Subject: RE: some questions/comments on gen2 udapl code > >> Arlin writes: >> I take this back. With the current verbs event mechanisms, a >> dedicated thread for uAT, uCM, and uCQ is the only way to process >> multi-threaded application requests. Let's say an application opens >> the device and then kicks off separate threads to connect and send to >> many different endpoints. Each thread will be waiting on the same >> event mechanisms. I see no good way to multiplex properly outside of >> a dedicated processing thread. > >> If the verbs AT, CM, and CQ event mechanisms change then we can >> re-visit. > >I am not sure there's is no good way - but i guess "not sure" is not >enough, i need to send a >patch that allows for it (ie work under multi-threaded app (eg >udapltest) with each thread having >diff conn/dto/cr evd)... Patch would be great. I moved to a dedicated CQ thread approach after struggling with the many multi-threaded issues with direct waits. See comments in dapls_ib_wait_object_wait() in dapl_ib_cq.c. > >The thing is that the current implementation is not optimal in >delivering CQ interrupts latency, >since there's one "extra" context switch. Also the udapl provider I agree that this is not optimal and we want the CQ_WAIT_OBJECT mapped directly to a CQ fd to avoid the extra context switch. But with one CQ FD per device how do you optimally multiplex the user's CQ context across multiple waiters without a dedicated processing thread? It seems just as inefficient to have 4 waiters on different CQ's all waking up, processing an event that may or may not be theirs. How do you deliver the event that was not yours? What if the event was directed to a thread that already woke and went back to sleep? Do you wake them all up again? This is why I went back to the dedicated CQ thread. We need some help from verbs to make this uDAPL CQ mapping work: Either support multiple CQ FD's, one per CQ. or modify call ibv_get_cq_event() and allow the user to supply the specific ibv_cq event to get, not just the first one on the event queue. >library opens 3 dedicated >threads (uat/ucm/ucq) and for getting async event we will probably add >uasync thread - which I can work on rolling up the uat/ucm/uasync processing into one thread. -arlin >would be better if we can avoid doing (opening so many threads). > >Or. From pauln at psc.edu Thu Aug 4 14:14:01 2005 From: pauln at psc.edu (PAulN) Date: Thu, 04 Aug 2005 17:14:01 -0400 Subject: [openib-general] ib_memory_register() equivalent in gen2 Message-ID: <42F28519.6070004@psc.edu> Hi, I'm looking for the memory registration function which takes a vaddr and and length, not an array of physical buffers. In gen1 this call is ib_memory_register(). The closest thing I can find is the reg_user_mr stuff which is used by uverbs. Could someone please tell me if req_user_mr is the right place to start? Thanks, Paul From limichal at cisco.com Thu Aug 4 14:26:10 2005 From: limichal at cisco.com (Libor Michalek) Date: Thu, 4 Aug 2005 14:26:10 -0700 Subject: [openib-general] SDP and uCM. Message-ID: <20050804142610.C31096@topspin.com> At the end of this week I will be leaving Cisco and will no longer have the time, or access to the equipment, needed to maintain the SDP or uCM code. It would seem ideal for Sean Hefty to take over the uCM code if he has the time and desire. I also think that Tom Duffy would be the right person to maintain the SDP code, unless he doesn't think he has the time. In which case Michael Tsirkin knows the code very well and would hopefully consider taking over the code. Thank you all. -Libor From jlentini at netapp.com Thu Aug 4 14:38:35 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 4 Aug 2005 17:38:35 -0400 (EDT) Subject: [openib-general] [RFC] Move InfiniBand .h files In-Reply-To: <52iryla9r5.fsf@cisco.com> References: <52iryla9r5.fsf@cisco.com> Message-ID: On Thu, 4 Aug 2005, Roland Dreier wrote: > I would like to get people's reactions to moving the InfiniBand .h > files from their current location in drivers/infiniband/include/ to > include/linux/rdma/. If we agree that this is a good idea then I'll > push this change as soon as 2.6.14 starts. I think it is a good idea. > The advantages of doing this are: > > - The headers become more easily accessible to other parts of the > tree that might want to use IB support. For example, an NFS/RDMA > client probably wants to live under fs/ net/sunrpc/ has also been proposed. From Thomas.Duffy.99 at alumni.brown.edu Thu Aug 4 14:46:47 2005 From: Thomas.Duffy.99 at alumni.brown.edu (Tom Duffy) Date: Thu, 04 Aug 2005 14:46:47 -0700 Subject: [openib-general] Re: SDP and uCM. In-Reply-To: <20050804142610.C31096@topspin.com> References: <20050804142610.C31096@topspin.com> Message-ID: <1123192008.786.4.camel@duffman> On Thu, 2005-08-04 at 14:26 -0700, Libor Michalek wrote: > At the end of this week I will be leaving Cisco and will no longer > have the time, or access to the equipment, needed to maintain the SDP > or uCM code. It would seem ideal for Sean Hefty to take over the uCM > code if he has the time and desire. I also think that Tom Duffy would > be the right person to maintain the SDP code, unless he doesn't think > he has the time. In which case Michael Tsirkin knows the code very well > and would hopefully consider taking over the code. Libor, do you plan on checking anything else into SDP before you give over maintainership? -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From twoodall at lanl.gov Thu Aug 4 14:49:33 2005 From: twoodall at lanl.gov (Timothy S. Woodall) Date: Thu, 4 Aug 2005 15:49:33 -0600 (MDT) Subject: [openib-general] OpenMPI question In-Reply-To: References: Message-ID: <51903.128.165.0.81.1123192173.squirrel@webmail.lanl.gov> Hello Steve, > Will the various MPI implementations be ported to the openIB user verbs > API? > Or uDAPL? From reading the FAQ it appears that MVAPICH2 will remain on > udapl. I'm curious about Open-MPI and LA-MPI? > There is a port of OpenMPI in the works to the gen2 openIB verbs. We're interested in supporting uDAPL as well, but there is no active development on this that I'm aware of. We'd be glad to provide assistance if someone was interested in taking this on... We plan on phasing LA-MPI out as OpenMPI becomes available. So, we don't have any plans to do this port. Regards, Tim From sean.hefty at intel.com Thu Aug 4 14:52:43 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 4 Aug 2005 14:52:43 -0700 Subject: [openib-general] SDP and uCM. In-Reply-To: <20050804142610.C31096@topspin.com> Message-ID: >or uCM code. It would seem ideal for Sean Hefty to take over the uCM I can maintain the uCM code along with the kCM. Good luck with your new endeavor! - Sean From limichal at cisco.com Thu Aug 4 16:43:44 2005 From: limichal at cisco.com (Libor Michalek) Date: Thu, 4 Aug 2005 16:43:44 -0700 Subject: [openib-general] Re: SDP and uCM. In-Reply-To: <1123192008.786.4.camel@duffman>; from Thomas.Duffy.99@alumni.brown.edu on Thu, Aug 04, 2005 at 02:46:47PM -0700 References: <20050804142610.C31096@topspin.com> <1123192008.786.4.camel@duffman> Message-ID: <20050804164344.B30741@topspin.com> On Thu, Aug 04, 2005 at 02:46:47PM -0700, Tom Duffy wrote: > On Thu, 2005-08-04 at 14:26 -0700, Libor Michalek wrote: > > At the end of this week I will be leaving Cisco and will no longer > > have the time, or access to the equipment, needed to maintain the SDP > > or uCM code. It would seem ideal for Sean Hefty to take over the uCM > > code if he has the time and desire. I also think that Tom Duffy would > > be the right person to maintain the SDP code, unless he doesn't think > > he has the time. In which case Michael Tsirkin knows the code very well > > and would hopefully consider taking over the code. > > Libor, do you plan on checking anything else into SDP before you give > over maintainership? No. I don't have any private patches to commit, and it does not look like I'll have time to look at and test all the patches that have been posted. Sorry. -Libor From limichal at cisco.com Thu Aug 4 16:46:34 2005 From: limichal at cisco.com (Libor Michalek) Date: Thu, 4 Aug 2005 16:46:34 -0700 Subject: [openib-general] Re: [PATCH updated] sdp: cancel read with no iocb In-Reply-To: <20050801131445.GW14384@mellanox.co.il>; from mst@mellanox.co.il on Mon, Aug 01, 2005 at 04:14:45PM +0300 References: <20050801084000.GS14384@mellanox.co.il> <20050801131445.GW14384@mellanox.co.il> Message-ID: <20050804164634.C30741@topspin.com> On Mon, Aug 01, 2005 at 04:14:45PM +0300, Michael S. Tsirkin wrote: > Quoting r. Michael S. Tsirkin : > > Subject: sdp: cancel read with no iocb > > > > Libor, I'm seeing these messages: > > > > ib_sdp WARN: Cancel read with no IOCB. <2:0:00000005> > > > > It seems that this warning is printed in a legal state where > > a deferred iocb is canceled. Shouldnt this sdp_warn be replaced > > with sdp_dbg_ctrl? You're right, this should have been a sdp_dbg_ctrl all along, since it's even possible, although much less likely, when the completion is performed in irq context. The write cancel in sdp_send.c should be changed as well. -Libor From limichal at cisco.com Thu Aug 4 16:49:24 2005 From: limichal at cisco.com (Libor Michalek) Date: Thu, 4 Aug 2005 16:49:24 -0700 Subject: [openib-general] Re: [PATCH] remove in_atomic In-Reply-To: <20050801063529.GO14384@mellanox.co.il>; from mst@mellanox.co.il on Mon, Aug 01, 2005 at 09:35:29AM +0300 References: <20050801063529.GO14384@mellanox.co.il> Message-ID: <20050804164924.D30741@topspin.com> On Mon, Aug 01, 2005 at 09:35:29AM +0300, Michael S. Tsirkin wrote: > in_atomic isnt a reliable way to check that we are in an atomic context. > Just schedule work, always, since most cq polling is currently done under > a spinlock, anyway. I agree this patch is correct, but did you see any performance change? With it applied I'm seeing a lot more variance in performance from run to run. I believe the differences look to do with timing and how frequently we are using source rdma advertisement vs. the sink. I believe the slow start algorithm for src_avails needs improvment. -Libor From tomduffy at speakeasy.net Thu Aug 4 16:52:08 2005 From: tomduffy at speakeasy.net (Tom Duffy) Date: Thu, 04 Aug 2005 23:52:08 +0000 Subject: [openib-general] Re: SDP and uCM. Message-ID: > From: Libor Michalek [mailto:limichal at cisco.com] > No. I don't have any private patches to commit, and it does not look > like I'll have time to look at and test all the patches that have been > posted. Sorry. Do you have any test scripts or procedures that you use to verify correctness or look for regressions when somebody sends you an SDP patch? Thanks, -tduffy From yaronh at voltaire.com Thu Aug 4 19:18:31 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Fri, 5 Aug 2005 05:18:31 +0300 Subject: [openib-general] [iSER]How to use the iSER with the UNH iSCSI Message-ID: <35EA21F54A45CB47B879F21A91F4862F6C5F70@taurus.voltaire.com> Ian, Currently the UNH iSCSI doesn’t support the "Datamover API" which is a new API defined in IETF and enable iSCSI to run over offload technologies such as iSER In addition the iSER code that is in OpenIB covers the Initiator side The Target code is (and being) integrated into few commercial products, or can be provided under some licensing There are few that intend to enable the datamover API in the UNH iSCSI and integrate it with iSER, they would be happy to see more helping hands, if you are interested I can hook you up with them Yaron > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Ian Jiang > Sent: Thursday, August 04, 2005 3:10 AM > To: openib-general at openib.org > Subject: [openib-general] [iSER]How to use the iSER with the UNH iSCSI > > Hi, everybody! > Thanks for all the replis to my "How to get the dat_headers_1_1.tgz"! > I downloaded the dapl_beta2.06.tgz as Itamar told me. > And I made some modification to the iSER to use it on the x86_64 platform. > > I got through the compiling finally, but here is another question: > How to use the iSER with the UNH iSCSI? I have the UNH iSCSI running on my > system at present. Need I modify it and reinstall? > > And I'm not sure if the dapl_beta2.06 has to be installed to run the iSER. > In fact, I did not compile or install the dapl before installing the iSER. > > Any suggestion is appriciated! > > Ian Jiang > ianjiang91 at hotmail.com > ---- > Computer Architecture Laboratory > Institute of Computing Technology > Chinese Academy of Sciences > Beijing,P.R.China > Zip code: 100080 > Tel: +86-10-62564394(office) > > _________________________________________________________________ > 免费下载 MSN Explorer: http://explorer.msn.com/lccn > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general From rolandd at cisco.com Thu Aug 4 20:47:12 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 04 Aug 2005 20:47:12 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <86802c440508041230143354c2@mail.gmail.com> (yhlu's message of "Thu, 4 Aug 2005 12:30:17 -0700") References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <20057281331.7vqhiAJ1Yc0um2je@cisco.com> <86802c44050803175873fb0569@mail.gmail.com> <52u0i6b9an.fsf_-_@cisco.com> <86802c44050804093374aca360@mail.gmail.com> <52mznxacbp.fsf@cisco.com> <86802c4405080410236ba59619@mail.gmail.com> <86802c4405080411013b60382c@mail.gmail.com> <521x59a6tb.fsf@cisco.com> <86802c440508041230143354c2@mail.gmail.com> Message-ID: <52slxp6o5b.fsf@cisco.com> Hmm, that output all looks fine. Can you run with the patch below to see exactly where the QP table initialization fails? (I haven't actually compiled this patch so you may have to fix a typo or two) I'm guessing that the CONF_SPECIAL_QP command is failing, but let's make sure. Thanks, Roland diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -2214,13 +2214,16 @@ int __devinit mthca_init_qp_table(struct (1 << 24) - 1, dev->qp_table.sqp_start + MTHCA_MAX_PORTS * 2); - if (err) + if (err) { + mthca_err(dev, "mthca_init_qp_table: mthca_alloc_init failed (%d)\n", err); return err; + } err = mthca_array_init(&dev->qp_table.qp, dev->limits.num_qps); if (err) { mthca_alloc_cleanup(&dev->qp_table.alloc); + mthca_err(dev, "mthca_init_qp_table: mthca_array_init failed (%d)\n", err); return err; } @@ -2228,8 +2231,10 @@ int __devinit mthca_init_qp_table(struct err = mthca_CONF_SPECIAL_QP(dev, i ? IB_QPT_GSI : IB_QPT_SMI, dev->qp_table.sqp_start + i * 2, &status); - if (err) + if (err) { + mthca_err(dev, "mthca_init_qp_table: mthca_CONF_SPECIAL_QP failed for %d/%d (%d)\n", i, dev->qp_table.sqp_start + i * 2, err); goto err_out; + } if (status) { mthca_warn(dev, "CONF_SPECIAL_QP returned " "status %02x, aborting.\n", From rolandd at cisco.com Thu Aug 4 20:56:39 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 04 Aug 2005 20:56:39 -0700 Subject: [openib-general] ib_memory_register() equivalent in gen2 In-Reply-To: <42F28519.6070004@psc.edu> (pauln@psc.edu's message of "Thu, 04 Aug 2005 17:14:01 -0400") References: <42F28519.6070004@psc.edu> Message-ID: <52oe8d6npk.fsf@cisco.com> pauln> Hi, I'm looking for the memory registration function which pauln> takes a vaddr and and length, not an array of physical pauln> buffers. In gen1 this call is ib_memory_register(). The pauln> closest thing I can find is the reg_user_mr stuff which is pauln> used by uverbs. Could someone please tell me if pauln> req_user_mr is the right place to start? No, reg_user_mr is used to register userspace memory. There is no memory registration function for kernel consumers that takes a virtual address and a length. This is intentional, because there is not a general way to translate a kernel virtual address to a bus address that can be passed to a device. In fact, not every kernel virtual address may be used for DMA, and not every piece of DMA-able memory has a kernel virtual address. Every kernel consumer needs to handle the mapping of its memory to bus addresses (note -- _not_ physical addresses). - R. From ogerlitz at voltaire.com Thu Aug 4 23:30:16 2005 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Fri, 5 Aug 2005 09:30:16 +0300 Subject: [openib-general] RE: some questions/comments on gen2 udapl code Message-ID: Arlin > I can work on rolling up the uat/ucm/uasync processing into one thread. Sounds good, so each udapl process would consume less resources from the system. >We need some help from verbs to make this uDAPL CQ mapping work: > Either > support multiple CQ FD's, one per CQ. > or > modify call ibv_get_cq_event() and allow the user to supply the specific ibv_cq event to get, not > just the first one on the event queue. I think the first question we want to ask - do we need multi-CQ/threaded support at the gen2 udapl? - the udapl consumers i am aware to - mvapich and another IB commercial mpi open one DTO EVD , how about intel mpi? So before driving changes to libibverbs lets see if there's a need. When the app that needs it comes, we can then either suggest a change in libibverbs or implement two flavor of the udapl library one for libdapl.so and libdapl.mt.so where lidapl.so will assume no multi cq/threads and libdapl.mt.so not assume this. and there's udapltest - which does use multi thread/cq - but should we spend work on only synthetic test - i dont know - what do you say? Or. From hch at infradead.org Fri Aug 5 02:22:28 2005 From: hch at infradead.org (Christoph Hellwig) Date: Fri, 5 Aug 2005 10:22:28 +0100 Subject: [openib-general] Re: [RFC] Move InfiniBand .h files In-Reply-To: <52iryla9r5.fsf@cisco.com> References: <52iryla9r5.fsf@cisco.com> Message-ID: <20050805092228.GA7237@infradead.org> On Thu, Aug 04, 2005 at 10:32:14AM -0700, Roland Dreier wrote: > I would like to get people's reactions to moving the InfiniBand .h > files from their current location in drivers/infiniband/include/ to > include/linux/rdma/. If we agree that this is a good idea then I'll > push this change as soon as 2.6.14 starts. include/rmda, please. not need for the linux/ component. > > The advantages of doing this are: > From jlentini at netapp.com Fri Aug 5 08:47:59 2005 From: jlentini at netapp.com (James Lentini) Date: Fri, 5 Aug 2005 11:47:59 -0400 (EDT) Subject: [openib-general] Re: [PATCH] uDAPL openib uAT retry fixes In-Reply-To: References: Message-ID: On Wed, 3 Aug 2005, Arlin Davis wrote: > James, > > Please review the following uDAPL patch. Fixes my broken uAT retry code. > > Thanks, > > -arlin Looks good Arlin. Committed revision in 2985. From halr at voltaire.com Fri Aug 5 08:42:17 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Aug 2005 11:42:17 -0400 Subject: [openib-general] [RFC] IPoIB sockaddr_ll changes Message-ID: <1123256537.4451.7.camel@hal.voltaire.com> Hi, I would like to get comments on this prior to sending this over to kernel land. There is a similar change to both: /usr/include/linux/if_packet.h /usr/include/netpacket/packet.h as in: include/linux/if_packet.h below to increase sll_addr from 8 to 20 bytes. Thanks. -- Hal IPoIB sockaddr_ll changes due to the fact that the IPoIB link layer address is 20 bytes rather than 8 bytes Signed-off-by: Hal Rosenstock --- include/linux/if_packet.h.orig 2005-06-29 19:00:53.000000000 -0400 +++ include/linux/if_packet.h 2005-08-05 10:04:06.000000000 -0400 @@ -8,6 +8,7 @@ struct sockaddr_pkt unsigned short spkt_protocol; }; +#define SOCKADDR_LL_COMPAT 12 struct sockaddr_ll { unsigned short sll_family; @@ -16,7 +17,7 @@ struct sockaddr_ll unsigned short sll_hatype; unsigned char sll_pkttype; unsigned char sll_halen; - unsigned char sll_addr[8]; + unsigned char sll_addr[20]; }; /* Packet types */ --- net/packet/af_packet.c.orig 2005-06-29 19:00:53.000000000 -0400 +++ net/packet/af_packet.c 2005-08-05 11:23:24.000000000 -0400 @@ -140,7 +140,7 @@ dev->hard_header == NULL (ll header is a mac.raw -> data data -> data - We should set nh.raw on output to correct posistion, + We should set nh.raw on output to correct position, packet classifier depends on it. */ @@ -315,7 +315,7 @@ static int packet_sendmsg_spkt(struct ki struct net_device *dev; unsigned short proto=0; int err; - + /* * Get and verify the address. */ @@ -708,8 +708,11 @@ static int packet_sendmsg(struct kiocb * addr = NULL; } else { err = -EINVAL; - if (msg->msg_namelen < sizeof(struct sockaddr_ll)) - goto out; + if (msg->msg_namelen < sizeof(struct sockaddr_ll)) { + /* Support for older sockaddr_ll structs */ + if ((msg->msg_namelen != sizeof(struct sockaddr_ll) - SOCKADDR_LL_COMPAT) || (saddr->sll_hatype == ARPHRD_INFINIBAND)) + goto out; + } ifindex = saddr->sll_ifindex; proto = saddr->sll_protocol; addr = saddr->sll_addr; @@ -937,7 +940,9 @@ static int packet_bind(struct socket *so */ if (addr_len < sizeof(struct sockaddr_ll)) - return -EINVAL; + if ((addr_len != sizeof(struct sockaddr_ll) - SOCKADDR_LL_COMPAT) || + (sll->sll_hatype == ARPHRD_INFINIBAND)) + return -EINVAL; if (sll->sll_family != AF_PACKET) return -EINVAL; From halr at voltaire.com Fri Aug 5 08:46:12 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Aug 2005 11:46:12 -0400 Subject: [openib-general] [PATCH] arping for IPoIB Message-ID: <1123256772.4451.12.camel@hal.voltaire.com> Fix broadcast address for IPoIB interfaces Note this patch is (currently) dependent on the previous sockaddr_ll change. Should I make it otherwise ? Note also that this reincludes Tom patch as well. Signed-off-by: Hal Rosenstock --- arping.c.orig 2001-10-05 18:42:47.000000000 -0400 +++ arping.c 2005-08-05 08:45:02.000000000 -0400 @@ -56,9 +56,17 @@ struct timeval start, last; int sent, brd_sent; int received, brd_recv, req_recv; +static const uint8_t ipv4_bcast_addr[] = { + 0x00, 0xff, 0xff, 0xff, + 0xff, 0x12, 0x40, 0x1b, 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff +}; + #define MS_TDIFF(tv1,tv2) ( ((tv1).tv_sec-(tv2).tv_sec)*1000 + \ ((tv1).tv_usec-(tv2).tv_usec)/1000 ) +#define min(x,y) ((x)<(y) ? (x) : (y)) + void usage(void) { fprintf(stderr, @@ -476,7 +484,10 @@ main(int argc, char **argv) } he = me; - memset(he.sll_addr, -1, he.sll_halen); + if (me.sll_hatype == ARPHRD_INFINIBAND) + memcpy(&he.sll_addr, &ipv4_bcast_addr, sizeof(ipv4_bcast_addr)); + else + memset(he.sll_addr, -1, min(he.sll_halen, sizeof he.sll_addr)); if (!quiet) { printf("ARPING %s ", inet_ntoa(dst)); From limichal at cisco.com Fri Aug 5 08:54:25 2005 From: limichal at cisco.com (Libor Michalek) Date: Fri, 5 Aug 2005 08:54:25 -0700 Subject: [openib-general] Re: SDP and uCM. In-Reply-To: ; from tomduffy@speakeasy.net on Thu, Aug 04, 2005 at 11:52:08PM +0000 References: Message-ID: <20050805085425.F30741@topspin.com> On Thu, Aug 04, 2005 at 11:52:08PM +0000, Tom Duffy wrote: > > From: Libor Michalek [mailto:limichal at cisco.com] > > No. I don't have any private patches to commit, and it does not look > > like I'll have time to look at and test all the patches that have been > > posted. Sorry. > > Do you have any test scripts or procedures that you use to verify > correctness or look for regressions when somebody sends you an SDP > patch? It depends on which area of the code gets touched. I usually run some subset of ttcp.aio, below and above the zcopy threshold, (32K and 4K) plain ttcp with 32K transfers. netperf including the throughput test, (TCP_STREAM) the connection test, (TCP_CC) the latency test, (TCP_RR) and the connection test with data exchange. (TCP_CRR) Finally, what I have not run in a while, is Polygraph with 800 concurrent connections. -Libor From rolandd at cisco.com Fri Aug 5 09:06:25 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 05 Aug 2005 09:06:25 -0700 Subject: [openib-general] [RFC] IPoIB sockaddr_ll changes In-Reply-To: <1123256537.4451.7.camel@hal.voltaire.com> (Hal Rosenstock's message of "05 Aug 2005 11:42:17 -0400") References: <1123256537.4451.7.camel@hal.voltaire.com> Message-ID: <52fyto74hq.fsf@cisco.com> Hal> Hi, I would like to get comments on this prior to sending Hal> this over to kernel land. When you send it, make sure to include netdev at vger.kernel.org so that all the networking people see it. SOCKADDR_LL_COMPAT is pretty ugly but I'm not sure I see a better solution right now. Hal> IPoIB sockaddr_ll changes due to the fact that the IPoIB link Hal> layer address is 20 bytes rather than 8 bytes You'll want to expand the explanation here so that it's clear what is being done and why. > + if ((msg->msg_namelen != sizeof(struct sockaddr_ll) - SOCKADDR_LL_COMPAT) || (saddr->sll_hatype == ARPHRD_INFINIBAND)) I know I say that it's OK to fudge on line lengths a little over 80 characters, but this line is way too long. > + if ((addr_len != sizeof(struct sockaddr_ll) - SOCKADDR_LL_COMPAT) || > + (sll->sll_hatype == ARPHRD_INFINIBAND)) Fix the indendation here so that (sll-> lines up with (addr_len != - R. From rolandd at cisco.com Fri Aug 5 09:08:41 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 05 Aug 2005 09:08:41 -0700 Subject: [openib-general] Re: [RFC] Move InfiniBand .h files In-Reply-To: <20050805092228.GA7237@infradead.org> (Christoph Hellwig's message of "Fri, 5 Aug 2005 10:22:28 +0100") References: <52iryla9r5.fsf@cisco.com> <20050805092228.GA7237@infradead.org> Message-ID: <52br4c74dy.fsf@cisco.com> Christoph> include/rmda, please. not need for the linux/ component. OK, fair enough. Any objection to this? - R. From jlentini at netapp.com Fri Aug 5 09:17:15 2005 From: jlentini at netapp.com (James Lentini) Date: Fri, 5 Aug 2005 12:17:15 -0400 (EDT) Subject: [openib-general] Re: [PATCH] uDAPL common code to build with counters In-Reply-To: References: Message-ID: On Wed, 3 Aug 2005, Arlin Davis wrote: > James, > > I tried to build uDAPL with counters to debug my wait/wakeup problem but ran into some build > problems. Can you review the following patch to enable counters? > > Not sure what happened to dapl_counters.h? I'm not either. I'll look into it. A couple of questions: > > Thanks, > > -arlin > > > Signed-off by: Arlin Davis > > > Index: dapl/include/dapl_debug.h > =================================================================== > --- dapl/include/dapl_debug.h (revision 2967) > +++ dapl/include/dapl_debug.h (working copy) > @@ -64,7 +64,9 @@ typedef enum > DAPL_DBG_TYPE_API = 0x0100, > DAPL_DBG_TYPE_RTN = 0x0200, > DAPL_DBG_TYPE_EXCEPTION = 0x0400, > - DAPL_DBG_TYPE_SRQ = 0x0800 > + DAPL_DBG_TYPE_SRQ = 0x0800, > + DAPL_DBG_TYPE_CNTR = 0x1000 > + > } DAPL_DBG_TYPE; > > typedef enum > @@ -110,12 +112,21 @@ extern void dapl_internal_dbg_log ( DAPL > #define DCNT_EVD_DEQUEUE_NOT_FOUND 18 > #define DCNT_TIMER_SET 19 > #define DCNT_TIMER_CANCEL 20 > -#define DCNT_LAST_COUNTER 22 /* Always the last counter */ > +#define DCNT_LAST_COUNTER 21 /* Always the last counter */ What do you think of changing the name of DCNT_LAST_COUNTER to DCNT_NUM_COUNTERS? > +#define DCNT_ALL_COUNTERS DCNT_LAST_COUNTER > > #if defined(DAPL_COUNTERS) > -#include "dapl_counters.h" > > -#define DAPL_CNTR(cntr) dapl_os_atomic_inc (&dapl_dbg_counters[cntr]); > +extern void dapl_dump_cntr( int cntr ); > +extern int dapl_dbg_counters[]; > + > +#define DAPL_CNTR(cntr) dapl_os_atomic_inc (&dapl_dbg_counters[cntr]); > +#define DAPL_DUMP_CNTR(cntr) dapl_dump_cntr( cntr ); > +#define DAPL_COUNTERS_INIT() > +#define DAPL_COUNTERS_NEW(__tag, __id) > +#define DAPL_COUNTERS_RESET(__id, __incr) > +#define DAPL_COUNTERS_INCR(__id, __incr) > + > #else > > #define DAPL_CNTR(cntr) > Index: dapl/common/dapl_debug.c > =================================================================== > --- dapl/common/dapl_debug.c (revision 2967) > +++ dapl/common/dapl_debug.c (working copy) > @@ -58,7 +58,7 @@ void dapl_internal_dbg_log ( DAPL_DBG_TY > } > > #if defined(DAPL_COUNTERS) > -long dapl_dbg_counters[DAPL_CNTR_MAX]; > +int dapl_dbg_counters[DCNT_LAST_COUNTER+1] = { 0 }; How about making the array size equal to DCNT_LAST_COUNTER (aka DCNT_NUM_COUNTERS) and ... > > /* > * The order of this list must match exactly with the #defines > @@ -89,6 +89,22 @@ char *dapl_dbg_counter_names[] = { > 0 ... getting rid of this extra placeholder. > }; > > +void dapl_dump_cntr( int cntr ) > +{ > + int i; > + > + for ( i=0;i + { > + if (( cntr == i ) || ( cntr == DCNT_ALL_COUNTERS )) > + { > + dapl_dbg_log ( DAPL_DBG_TYPE_CNTR, > + "DAPL Counter: %s = %lu \n", > + dapl_dbg_counter_names[i], > + dapl_dbg_counters[i] ); > + } > + } > +} > + > #endif /* DAPL_COUNTERS */ > #endif From halr at voltaire.com Fri Aug 5 09:17:03 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Aug 2005 12:17:03 -0400 Subject: [openib-general] [RFC] IPoIB sockaddr_ll changes In-Reply-To: <52fyto74hq.fsf@cisco.com> References: <1123256537.4451.7.camel@hal.voltaire.com> <52fyto74hq.fsf@cisco.com> Message-ID: <1123258622.4451.83.camel@hal.voltaire.com> On Fri, 2005-08-05 at 12:06, Roland Dreier wrote: > Hal> Hi, I would like to get comments on this prior to sending > Hal> this over to kernel land. > > When you send it, make sure to include netdev at vger.kernel.org so that > all the networking people see it. Yes, that's where I was going to send it. Does it need to go to lkml as well or would that be handled by netdev ? > SOCKADDR_LL_COMPAT is pretty ugly but I'm not sure I see a better > solution right now. I agree but I couldn't see a better way (either). > Hal> IPoIB sockaddr_ll changes due to the fact that the IPoIB link > Hal> layer address is 20 bytes rather than 8 bytes > > You'll want to expand the explanation here so that it's clear what is > being done and why. OK. How about: The current link level address accomodates MAC addresses which are 8 bytes. IPoIB link level addresses are composed of a GID (Global Identifier) which is 16 bytes and a QPN (Queue Pair Number) which is 3 bytes with 1 byte Reserved. So in order to support IPoIB interfaces, the link layer address needs to be increased from 8 to 20 bytes. > > + if ((msg->msg_namelen != sizeof(struct sockaddr_ll) - SOCKADDR_LL_COMPAT) || (saddr->sll_hatype == ARPHRD_INFINIBAND)) > > I know I say that it's OK to fudge on line lengths a little over 80 > characters, but this line is way too long. > > > + if ((addr_len != sizeof(struct sockaddr_ll) - SOCKADDR_LL_COMPAT) || > > + (sll->sll_hatype == ARPHRD_INFINIBAND)) > > Fix the indendation here so that (sll-> lines up with (addr_len != I'll send out v2 of this patch shortly. --- Hal From rolandd at cisco.com Fri Aug 5 09:26:10 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 05 Aug 2005 09:26:10 -0700 Subject: [openib-general] [RFC] IPoIB sockaddr_ll changes In-Reply-To: <1123258622.4451.83.camel@hal.voltaire.com> (Hal Rosenstock's message of "05 Aug 2005 12:17:03 -0400") References: <1123256537.4451.7.camel@hal.voltaire.com> <52fyto74hq.fsf@cisco.com> <1123258622.4451.83.camel@hal.voltaire.com> Message-ID: <527jf073kt.fsf@cisco.com> Hal> The current link level address accomodates MAC addresses Hal> which are 8 bytes. IPoIB link level addresses are composed of Hal> a GID (Global Identifier) which is 16 bytes and a QPN (Queue Hal> Pair Number) which is 3 bytes with 1 byte Reserved. So in Hal> order to support IPoIB interfaces, the link layer address Hal> needs to be increased from 8 to 20 bytes. I don't think we need more detail on the format of the IPoIB address. It's OK just to say that it's 20 bytes long. What needs more explanation is why struct sockaddr_ll has to grow. What breaks with the current definition? - R. From halr at voltaire.com Fri Aug 5 09:23:55 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Aug 2005 12:23:55 -0400 Subject: [openib-general] RE: OpenSM Work In-Reply-To: <506C3D7B14CDD411A52C00025558DED607C305D7@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED607C305D7@mtlex01.yok.mtl.com> Message-ID: <1123259035.4451.92.camel@hal.voltaire.com> Hi Eitan, While I don't understand why the update for 1.8.0 can't be done by patches which is the usual way (I think it could be broken at least into complib, then vendor lib, then SM, and finally SA changes), I will work on merging Yael's branch for this. Note that there may be some back and forth on this similar to comments on patches. In the future, I hope that work can be done in smaller incremental pieces and with patches. As to the directory structure, there are projects which follow the structure which is being used in the OpenIB tree. The makefiles already do install the headers. That being said the directory structure is not cast in stone but there is a lot of churn here to change it. Are there any other clear benefits ? Does it somehow make your internal development easier ? If that is it, I don't see why a correspondence script wouldn't work. Typically things like this are community decided. I would think the simulator work is separable and would prefer to hold off on this until the OpenSM merge is done and working. That alone seems like a lot to swallow at once. Finally, as to feedback on the proposals for OpenSM work, as I recall, there were responses from both Tom and myself both being supportive of new functionality and some design review issues (particularly relating to routing algorithms proposed). I would expect this work to generate more feedback as there is code to go along with it or possibly even with an update on the design approach. -- Hal From halr at voltaire.com Fri Aug 5 09:42:25 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Aug 2005 12:42:25 -0400 Subject: [openib-general] [PATCHv2] sockaddr_ll changes for IPoIB interfaces Message-ID: <1123260142.4451.129.camel@hal.voltaire.com> Hi again, This is v2 of this. There is a similar change to both: /usr/include/linux/if_packet.h /usr/include/netpacket/packet.h as in: include/linux/if_packet.h below to increase sll_addr from 8 to 20 bytes. Thanks. -- Hal sockaddr_ll changes to accomodate IPoIB interfaces. This is due to the fact that the IPoIB link layer address is 20 bytes rather than 8 bytes. With the current 8 byte address, it is not possible to send ARPs and RARPs from userspace as the broadcast and unicast IPoIB addresses cannot be supplied properly. Signed-off-by: Hal Rosenstock --- include/linux/if_packet.h.orig 2005-06-29 19:00:53.000000000 -0400 +++ include/linux/if_packet.h 2005-08-05 10:04:06.000000000 -0400 @@ -8,6 +8,7 @@ struct sockaddr_pkt unsigned short spkt_protocol; }; +#define SOCKADDR_LL_COMPAT 12 struct sockaddr_ll { unsigned short sll_family; @@ -16,7 +17,7 @@ struct sockaddr_ll unsigned short sll_hatype; unsigned char sll_pkttype; unsigned char sll_halen; - unsigned char sll_addr[8]; + unsigned char sll_addr[20]; }; /* Packet types */ --- af_packet.c.orig 2005-06-29 19:00:53.000000000 -0400 +++ af_packet.c 2005-08-05 12:40:52.000000000 -0400 @@ -140,7 +140,7 @@ dev->hard_header == NULL (ll header is a mac.raw -> data data -> data - We should set nh.raw on output to correct posistion, + We should set nh.raw on output to correct position, packet classifier depends on it. */ @@ -315,7 +315,7 @@ static int packet_sendmsg_spkt(struct ki struct net_device *dev; unsigned short proto=0; int err; - + /* * Get and verify the address. */ @@ -708,8 +708,12 @@ static int packet_sendmsg(struct kiocb * addr = NULL; } else { err = -EINVAL; - if (msg->msg_namelen < sizeof(struct sockaddr_ll)) - goto out; + if (msg->msg_namelen < sizeof(struct sockaddr_ll)) { + /* Support for older sockaddr_ll structs */ + if ((msg->msg_namelen != sizeof(struct sockaddr_ll) - SOCKADDR_LL_COMPAT) || + (saddr->sll_hatype == ARPHRD_INFINIBAND)) + goto out; + } ifindex = saddr->sll_ifindex; proto = saddr->sll_protocol; addr = saddr->sll_addr; @@ -937,7 +941,11 @@ static int packet_bind(struct socket *so */ if (addr_len < sizeof(struct sockaddr_ll)) - return -EINVAL; + /* Support for older sockaddr_ll structs */ + if ((addr_len != sizeof(struct sockaddr_ll) - + SOCKADDR_LL_COMPAT) || + (sll->sll_hatype == ARPHRD_INFINIBAND)) + return -EINVAL; if (sll->sll_family != AF_PACKET) return -EINVAL; From tomduffy at gmail.com Fri Aug 5 10:06:56 2005 From: tomduffy at gmail.com (Tom Duffy) Date: Fri, 5 Aug 2005 10:06:56 -0700 Subject: [openib-general] [RFC] IPoIB sockaddr_ll changes In-Reply-To: <1123256537.4451.7.camel@hal.voltaire.com> References: <1123256537.4451.7.camel@hal.voltaire.com> Message-ID: <9d3b7de70508051006104ee679@mail.gmail.com> On 05 Aug 2005 11:42:17 -0400, Hal Rosenstock wrote: > I would like to get comments on this prior to sending this over to > kernel land. > --- net/packet/af_packet.c.orig 2005-06-29 19:00:53.000000000 -0400 > +++ net/packet/af_packet.c 2005-08-05 11:23:24.000000000 -0400 > @@ -140,7 +140,7 @@ dev->hard_header == NULL (ll header is a > mac.raw -> data > data -> data > > - We should set nh.raw on output to correct posistion, > + We should set nh.raw on output to correct position, > packet classifier depends on it. > */ > > @@ -315,7 +315,7 @@ static int packet_sendmsg_spkt(struct ki > struct net_device *dev; > unsigned short proto=0; > int err; > - > + > /* > * Get and verify the address. > */ Both of these are whitespace changes that shouldn't be sent with this patch. -tduffy From iod00d at hp.com Fri Aug 5 10:16:04 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 5 Aug 2005 10:16:04 -0700 Subject: [openib-general] ib_memory_register() equivalent in gen2 In-Reply-To: <52oe8d6npk.fsf@cisco.com> References: <42F28519.6070004@psc.edu> <52oe8d6npk.fsf@cisco.com> Message-ID: <20050805171604.GB25121@esmail.cup.hp.com> On Thu, Aug 04, 2005 at 08:56:39PM -0700, Roland Dreier wrote: > This is intentional, because > there is not a general way to translate a kernel virtual address to a > bus address that can be passed to a device. Paul, Well, IB can't offer a general method. But the kernel does have one and it's documented in Documentation/DMA-API.txt. See SDP and IPoIB for use of dma_map_single() and related calls. grant From halr at voltaire.com Fri Aug 5 10:12:26 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Aug 2005 13:12:26 -0400 Subject: [openib-general] [RFC] IPoIB sockaddr_ll changes In-Reply-To: <9d3b7de70508051006104ee679@mail.gmail.com> References: <1123256537.4451.7.camel@hal.voltaire.com> <9d3b7de70508051006104ee679@mail.gmail.com> Message-ID: <1123261867.4451.170.camel@hal.voltaire.com> On Fri, 2005-08-05 at 13:06, Tom Duffy wrote: > On 05 Aug 2005 11:42:17 -0400, Hal Rosenstock wrote: > > I would like to get comments on this prior to sending this over to > > kernel land. > > > --- net/packet/af_packet.c.orig 2005-06-29 19:00:53.000000000 -0400 > > +++ net/packet/af_packet.c 2005-08-05 11:23:24.000000000 -0400 > > @@ -140,7 +140,7 @@ dev->hard_header == NULL (ll header is a > > mac.raw -> data > > data -> data > > > > - We should set nh.raw on output to correct posistion, > > + We should set nh.raw on output to correct position, > > packet classifier depends on it. > > */ > > > > @@ -315,7 +315,7 @@ static int packet_sendmsg_spkt(struct ki > > struct net_device *dev; > > unsigned short proto=0; > > int err; > > - > > + > > /* > > * Get and verify the address. > > */ > > Both of these are whitespace changes that shouldn't be sent with this patch. The latter is but is removing the whitespace. Should that not be done ? The former is a commentary typo. -- Hal From tomduffy at gmail.com Fri Aug 5 10:21:24 2005 From: tomduffy at gmail.com (Tom Duffy) Date: Fri, 5 Aug 2005 10:21:24 -0700 Subject: [openib-general] [RFC] IPoIB sockaddr_ll changes In-Reply-To: <1123261867.4451.170.camel@hal.voltaire.com> References: <1123256537.4451.7.camel@hal.voltaire.com> <9d3b7de70508051006104ee679@mail.gmail.com> <1123261867.4451.170.camel@hal.voltaire.com> Message-ID: <9d3b7de7050805102112daa72c@mail.gmail.com> > The latter is but is removing the whitespace. Should that not be done ? Maybe in a different patch. > The former is a commentary typo. I just don't want people to confuse the issues. One idea per patch, that sort of thing. -tduffy From jlentini at netapp.com Fri Aug 5 10:29:35 2005 From: jlentini at netapp.com (James Lentini) Date: Fri, 5 Aug 2005 13:29:35 -0400 (EDT) Subject: [openib-general] [ANNOUNCE][uDAPL] uDAPL available in the trunk Message-ID: An implementation of the Userspace Direct Access Programming Library (uDAPL) for OpenIB is now available in the trunk at https://openib.org/svn/gen2/trunk/src/userspace/dapl/ Arlin Davis contributed the code needed to support OpenIB in uDAPL. I'd like to thank Arlin for his hard work. I'd also like to thank Hal Rosenstock for his uIBAT library, which uDAPL use for address resolution. Thanks, james P.S. Once the dust settles, I will remove the development copy we were using in the "jlentini" branch. From halr at voltaire.com Fri Aug 5 10:30:34 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Aug 2005 13:30:34 -0400 Subject: [openib-general] dtest makefile patch Message-ID: <1123263034.4451.186.camel@hal.voltaire.com> Fix dtest makefile on trunk Signed-off-by: Hal Rosenstock Index: makefile =================================================================== -- makefile (revision 2986) +++ makefile (working copy) @@ -1,16 +1,16 @@ CC = gcc CFLAGS = -O2 -g -DAT_INC = ../dat/include +DAT_INC = ../../dat/include DAT_LIB = /usr/lib64 all: dtest clean: - rm -f *.o;touch *.c;rm -f dtest + rm -f *.o;touch *.c;rm -f dtest dtest: ./dtest.c - $(CC) $(CFLAGS) ./dtest.c -o dtest \ - -DDAPL_PROVIDER='"IB1"' \ - -I $(DAT_INC) -L $(DAT_LIB) -ldat + $(CC) $(CFLAGS) ./dtest.c -o dtest \ + -DDAPL_PROVIDER='"IB1"' \ + -I $(DAT_INC) -L $(DAT_LIB) -ldat From jlentini at netapp.com Fri Aug 5 10:52:34 2005 From: jlentini at netapp.com (James Lentini) Date: Fri, 5 Aug 2005 13:52:34 -0400 (EDT) Subject: [openib-general] Re: dtest makefile patch In-Reply-To: <1123263034.4451.186.camel@hal.voltaire.com> References: <1123263034.4451.186.camel@hal.voltaire.com> Message-ID: On Fri, 5 Aug 2005, Hal Rosenstock wrote: > Fix dtest makefile on trunk Committed in revision 2989. From yhlu.kernel at gmail.com Fri Aug 5 11:03:36 2005 From: yhlu.kernel at gmail.com (yhlu) Date: Fri, 5 Aug 2005 11:03:36 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <52slxp6o5b.fsf@cisco.com> References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <86802c44050803175873fb0569@mail.gmail.com> <52u0i6b9an.fsf_-_@cisco.com> <86802c44050804093374aca360@mail.gmail.com> <52mznxacbp.fsf@cisco.com> <86802c4405080410236ba59619@mail.gmail.com> <86802c4405080411013b60382c@mail.gmail.com> <521x59a6tb.fsf@cisco.com> <86802c440508041230143354c2@mail.gmail.com> <52slxp6o5b.fsf@cisco.com> Message-ID: <86802c440508051103500f6942@mail.gmail.com> You are right. CONG_SPECIAL_QP ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) ib_mthca: Initializing Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) (0000:04:00.0) ib_mthca 0000:04:00.0: FW version 000400060002, max commands 64 ib_mthca 0000:04:00.0: FW size 6143 KB (start fcefa00000, end fcefffffff) ib_mthca 0000:04:00.0: HCA memory size 262143 KB (start fce0000000, end fcefffffff) ib_mthca 0000:04:00.0: Max QPs: 16777216, reserved QPs: 1024, entry size: 256 ib_mthca 0000:04:00.0: Max CQs: 16777216, reserved CQs: 128, entry size: 64 ib_mthca 0000:04:00.0: Max EQs: 64, reserved EQs: 1, entry size: 64 ib_mthca 0000:04:00.0: reserved MPTs: 16, reserved MTTs: 16 ib_mthca 0000:04:00.0: Max PDs: 16777216, reserved PDs: 0, reserved UARs: 1 ib_mthca 0000:04:00.0: Max QP/MCG: 16777216, reserved MGMs: 0 ib_mthca 0000:04:00.0: Flags: 00370347 ib_mthca 0000:04:00.0: profile[ 0]--10/20 @ 0x fce0000000 (size 0x 4000000) ib_mthca 0000:04:00.0: profile[ 1]-- 0/16 @ 0x fce4000000 (size 0x 1000000) ib_mthca 0000:04:00.0: profile[ 2]-- 7/18 @ 0x fce5000000 (size 0x 800000) ib_mthca 0000:04:00.0: profile[ 3]-- 9/17 @ 0x fce5800000 (size 0x 800000) ib_mthca 0000:04:00.0: profile[ 4]-- 3/16 @ 0x fce6000000 (size 0x 400000) ib_mthca 0000:04:00.0: profile[ 5]-- 4/16 @ 0x fce6400000 (size 0x 200000) ib_mthca 0000:04:00.0: profile[ 6]--12/15 @ 0x fce6600000 (size 0x 100000) ib_mthca 0000:04:00.0: profile[ 7]-- 8/13 @ 0x fce6700000 (size 0x 80000) ib_mthca 0000:04:00.0: profile[ 8]--11/11 @ 0x fce6780000 (size 0x 10000) ib_mthca 0000:04:00.0: profile[ 9]-- 6/ 5 @ 0x fce6790000 (size 0x 800) ib_mthca 0000:04:00.0: HCA memory: allocated 106050 KB/256000 KB (149950 KB free) ib_mthca 0000:04:00.0: Allocated EQ 1 with 65536 entries ib_mthca 0000:04:00.0: Allocated EQ 2 with 128 entries ib_mthca 0000:04:00.0: Allocated EQ 3 with 128 entries ib_mthca 0000:04:00.0: Setting mask 00000000000f43fe for eqn 2 ib_mthca 0000:04:00.0: Setting mask 0000000000000400 for eqn 3 ib_mthca 0000:04:00.0: NOP command IRQ test passed ib_mthca 0000:04:00.0: mthca_init_qp_table: mthca_CONF_SPECIAL_QP failed for 0/1024 (-16) ib_mthca 0000:04:00.0: Failed to initialize queue pair table, aborting. ib_mthca 0000:04:00.0: Clearing mask 00000000000f43fe for eqn 2 ib_mthca 0000:04:00.0: Clearing mask 0000000000000400 for eqn 3 ib_mthca: probe of 0000:04:00.0 failed with error -16 From yhlu.kernel at gmail.com Fri Aug 5 11:07:27 2005 From: yhlu.kernel at gmail.com (yhlu) Date: Fri, 5 Aug 2005 11:07:27 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <86802c440508051103500f6942@mail.gmail.com> References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <52u0i6b9an.fsf_-_@cisco.com> <86802c44050804093374aca360@mail.gmail.com> <52mznxacbp.fsf@cisco.com> <86802c4405080410236ba59619@mail.gmail.com> <86802c4405080411013b60382c@mail.gmail.com> <521x59a6tb.fsf@cisco.com> <86802c440508041230143354c2@mail.gmail.com> <52slxp6o5b.fsf@cisco.com> <86802c440508051103500f6942@mail.gmail.com> Message-ID: <86802c4405080511079d01532@mail.gmail.com> ps. some kernel pci code patch broke sth yesterday night. it mask out bit [32-39] ib_mthca 0000:04:00.0: profile[ 0]--10/20 @ 0x e0000000 (size 0x 4000000) ib_mthca 0000:04:00.0: profile[ 1]-- 0/16 @ 0x e4000000 (size 0x 1000000) ib_mthca 0000:04:00.0: profile[ 2]-- 7/18 @ 0x e5000000 (size 0x 800000) ib_mthca 0000:04:00.0: profile[ 3]-- 9/17 @ 0x e5800000 (size 0x 800000) ib_mthca 0000:04:00.0: profile[ 4]-- 3/16 @ 0x e6000000 (size 0x 400000) ib_mthca 0000:04:00.0: profile[ 5]-- 4/16 @ 0x e6400000 (size 0x 200000) ib_mthca 0000:04:00.0: profile[ 6]--12/15 @ 0x e6600000 (size 0x 100000) ib_mthca 0000:04:00.0: profile[ 7]-- 8/13 @ 0x e6700000 (size 0x 80000) ib_mthca 0000:04:00.0: profile[ 8]--11/11 @ 0x e6780000 (size 0x 10000) ib_mthca 0000:04:00.0: profile[ 9]-- 6/ 5 @ 0x e6790000 (size 0x 800) YH On 8/5/05, yhlu wrote: > You are right. CONG_SPECIAL_QP > > ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) > ib_mthca: Initializing Mellanox Technologies MT25208 InfiniHost III Ex > (Tavor compatibility mode) (0000:04:00.0) > ib_mthca 0000:04:00.0: FW version 000400060002, max commands 64 > ib_mthca 0000:04:00.0: FW size 6143 KB (start fcefa00000, end fcefffffff) > ib_mthca 0000:04:00.0: HCA memory size 262143 KB (start fce0000000, > end fcefffffff) > ib_mthca 0000:04:00.0: Max QPs: 16777216, reserved QPs: 1024, entry size: 256 > ib_mthca 0000:04:00.0: Max CQs: 16777216, reserved CQs: 128, entry size: 64 > ib_mthca 0000:04:00.0: Max EQs: 64, reserved EQs: 1, entry size: 64 > ib_mthca 0000:04:00.0: reserved MPTs: 16, reserved MTTs: 16 > ib_mthca 0000:04:00.0: Max PDs: 16777216, reserved PDs: 0, reserved UARs: 1 > ib_mthca 0000:04:00.0: Max QP/MCG: 16777216, reserved MGMs: 0 > ib_mthca 0000:04:00.0: Flags: 00370347 > ib_mthca 0000:04:00.0: profile[ 0]--10/20 @ 0x fce0000000 (size 0x 4000000) > ib_mthca 0000:04:00.0: profile[ 1]-- 0/16 @ 0x fce4000000 (size 0x 1000000) > ib_mthca 0000:04:00.0: profile[ 2]-- 7/18 @ 0x fce5000000 (size 0x 800000) > ib_mthca 0000:04:00.0: profile[ 3]-- 9/17 @ 0x fce5800000 (size 0x 800000) > ib_mthca 0000:04:00.0: profile[ 4]-- 3/16 @ 0x fce6000000 (size 0x 400000) > ib_mthca 0000:04:00.0: profile[ 5]-- 4/16 @ 0x fce6400000 (size 0x 200000) > ib_mthca 0000:04:00.0: profile[ 6]--12/15 @ 0x fce6600000 (size 0x 100000) > ib_mthca 0000:04:00.0: profile[ 7]-- 8/13 @ 0x fce6700000 (size 0x 80000) > ib_mthca 0000:04:00.0: profile[ 8]--11/11 @ 0x fce6780000 (size 0x 10000) > ib_mthca 0000:04:00.0: profile[ 9]-- 6/ 5 @ 0x fce6790000 (size 0x 800) > ib_mthca 0000:04:00.0: HCA memory: allocated 106050 KB/256000 KB > (149950 KB free) > ib_mthca 0000:04:00.0: Allocated EQ 1 with 65536 entries > ib_mthca 0000:04:00.0: Allocated EQ 2 with 128 entries > ib_mthca 0000:04:00.0: Allocated EQ 3 with 128 entries > ib_mthca 0000:04:00.0: Setting mask 00000000000f43fe for eqn 2 > ib_mthca 0000:04:00.0: Setting mask 0000000000000400 for eqn 3 > ib_mthca 0000:04:00.0: NOP command IRQ test passed > ib_mthca 0000:04:00.0: mthca_init_qp_table: mthca_CONF_SPECIAL_QP > failed for 0/1024 (-16) > ib_mthca 0000:04:00.0: Failed to initialize queue pair table, aborting. > ib_mthca 0000:04:00.0: Clearing mask 00000000000f43fe for eqn 2 > ib_mthca 0000:04:00.0: Clearing mask 0000000000000400 for eqn 3 > ib_mthca: probe of 0000:04:00.0 failed with error -16 > From rolandd at cisco.com Fri Aug 5 11:11:09 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 05 Aug 2005 11:11:09 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <86802c440508051103500f6942@mail.gmail.com> (yhlu's message of "Fri, 5 Aug 2005 11:03:36 -0700") References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <86802c44050803175873fb0569@mail.gmail.com> <52u0i6b9an.fsf_-_@cisco.com> <86802c44050804093374aca360@mail.gmail.com> <52mznxacbp.fsf@cisco.com> <86802c4405080410236ba59619@mail.gmail.com> <86802c4405080411013b60382c@mail.gmail.com> <521x59a6tb.fsf@cisco.com> <86802c440508041230143354c2@mail.gmail.com> <52slxp6o5b.fsf@cisco.com> <86802c440508051103500f6942@mail.gmail.com> Message-ID: <52u0i45k5e.fsf@cisco.com> > ib_mthca 0000:04:00.0: FW version 000400060002, max commands 64 This is FW 4.6.2 -- 4.7.0 has been released, so it might be worth trying that. > ib_mthca 0000:04:00.0: NOP command IRQ test passed > ib_mthca 0000:04:00.0: mthca_init_qp_table: mthca_CONF_SPECIAL_QP failed for 0/1024 (-16) Hmm, looks like CONF_SPECIAL_QP is timing out. MST (or any Mellanox people), any idea why this might happening? The NOP command is working fine with interrupts, but CONF_SPECIAL_QP is timing out. The difference from the working setup is that the HCA's local memory is mapped above 4 GB. - R. From rolandd at cisco.com Fri Aug 5 11:13:14 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 05 Aug 2005 11:13:14 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <86802c4405080511079d01532@mail.gmail.com> (yhlu's message of "Fri, 5 Aug 2005 11:07:27 -0700") References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <52u0i6b9an.fsf_-_@cisco.com> <86802c44050804093374aca360@mail.gmail.com> <52mznxacbp.fsf@cisco.com> <86802c4405080410236ba59619@mail.gmail.com> <86802c4405080411013b60382c@mail.gmail.com> <521x59a6tb.fsf@cisco.com> <86802c440508041230143354c2@mail.gmail.com> <52slxp6o5b.fsf@cisco.com> <86802c440508051103500f6942@mail.gmail.com> <86802c4405080511079d01532@mail.gmail.com> Message-ID: <52psss5k1x.fsf@cisco.com> yhlu> ps. some kernel pci code patch broke sth yesterday night. yhlu> it mask out bit [32-39] Is it possible that all your problems are coming from the PCI setup code incorrectly assigning BARs? - R. From yhlu.kernel at gmail.com Fri Aug 5 11:26:38 2005 From: yhlu.kernel at gmail.com (yhlu) Date: Fri, 5 Aug 2005 11:26:38 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <52psss5k1x.fsf@cisco.com> References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <52mznxacbp.fsf@cisco.com> <86802c4405080410236ba59619@mail.gmail.com> <86802c4405080411013b60382c@mail.gmail.com> <521x59a6tb.fsf@cisco.com> <86802c440508041230143354c2@mail.gmail.com> <52slxp6o5b.fsf@cisco.com> <86802c440508051103500f6942@mail.gmail.com> <86802c4405080511079d01532@mail.gmail.com> <52psss5k1x.fsf@cisco.com> Message-ID: <86802c44050805112661d889aa@mail.gmail.com> before I do the cg-update this morning, it didn't mask out the upper 8 bit. YH On 8/5/05, Roland Dreier wrote: > yhlu> ps. some kernel pci code patch broke sth yesterday night. > yhlu> it mask out bit [32-39] > > Is it possible that all your problems are coming from the PCI setup > code incorrectly assigning BARs? > > - R. > From halr at voltaire.com Fri Aug 5 11:25:57 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Aug 2005 14:25:57 -0400 Subject: [openib-general] [PATCH] sockaddr_ll change for IPoIB interfaces Message-ID: <1123266356.4451.243.camel@hal.voltaire.com> Hi, The patch below is to accomodate IPoIB link layer address in the sockaddr_ll struct so that user space can send and receive IPoIB link later packets. Unfortunately, IPoIB has 20 bytes LL addresses rather than the 8 byte MAC addresses (or under) used by other LLs. There is a similar change to both: /usr/include/linux/if_packet.h /usr/include/netpacket/packet.h as in: include/linux/if_packet.h below to increase sll_addr from 8 to 20 bytes. Thanks. -- Hal sockaddr_ll changes to accomodate IPoIB interfaces. This is due to the fact that the IPoIB link layer address is 20 bytes rather than 8 bytes. With the current 8 byte address, it is not possible to send ARPs and RARPs from userspace as the broadcast and unicast IPoIB addresses cannot be supplied properly. There is backward compatibility support for those applications built with the existing structure (prior to this patch). Signed-off-by: Hal Rosenstock --- include/linux/if_packet.h.orig 2005-06-29 19:00:53.000000000 -0400 +++ include/linux/if_packet.h 2005-08-05 10:04:06.000000000 -0400 @@ -8,6 +8,7 @@ struct sockaddr_pkt unsigned short spkt_protocol; }; +#define SOCKADDR_LL_COMPAT 12 struct sockaddr_ll { unsigned short sll_family; @@ -16,7 +17,7 @@ struct sockaddr_ll unsigned short sll_hatype; unsigned char sll_pkttype; unsigned char sll_halen; - unsigned char sll_addr[8]; + unsigned char sll_addr[20]; }; /* Packet types */ --- af_packet.c.orig 2005-06-29 19:00:53.000000000 -0400 +++ af_packet.c 2005-08-05 13:28:49.000000000 -0400 @@ -708,8 +708,12 @@ static int packet_sendmsg(struct kiocb * addr = NULL; } else { err = -EINVAL; - if (msg->msg_namelen < sizeof(struct sockaddr_ll)) - goto out; + if (msg->msg_namelen < sizeof(struct sockaddr_ll)) { + /* Support for older sockaddr_ll structs */ + if ((msg->msg_namelen != sizeof(struct sockaddr_ll) - SOCKADDR_LL_COMPAT) || + (saddr->sll_hatype == ARPHRD_INFINIBAND)) + goto out; + } ifindex = saddr->sll_ifindex; proto = saddr->sll_protocol; addr = saddr->sll_addr; @@ -937,7 +941,11 @@ static int packet_bind(struct socket *so */ if (addr_len < sizeof(struct sockaddr_ll)) - return -EINVAL; + /* Support for older sockaddr_ll structs */ + if ((addr_len != sizeof(struct sockaddr_ll) - + SOCKADDR_LL_COMPAT) || + (sll->sll_hatype == ARPHRD_INFINIBAND)) + return -EINVAL; if (sll->sll_family != AF_PACKET) return -EINVAL; From tom at ammasso.com Fri Aug 5 11:45:48 2005 From: tom at ammasso.com (Tom Tucker) Date: Fri, 5 Aug 2005 14:45:48 -0400 Subject: [openib-general] iWARP Branch Message-ID: <8E9D028761D8264D910612167E8457E8FA3686@mail2.ammasso.com> Matt: Could you please create a branch so that we can drop our driver? Thanks, Tom Tucker From tomduffy at gmail.com Fri Aug 5 11:47:09 2005 From: tomduffy at gmail.com (Tom Duffy) Date: Fri, 5 Aug 2005 11:47:09 -0700 Subject: [openib-general] iWARP Branch In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3686@mail2.ammasso.com> References: <8E9D028761D8264D910612167E8457E8FA3686@mail2.ammasso.com> Message-ID: <9d3b7de705080511476b4e7ab2@mail.gmail.com> On 8/5/05, Tom Tucker wrote: > Could you please create a branch so that we can drop our driver? How about a patch? -tduffy From jim.ryan at intel.com Fri Aug 5 11:49:28 2005 From: jim.ryan at intel.com (Ryan, Jim) Date: Fri, 5 Aug 2005 11:49:28 -0700 Subject: [openib-general] iWARP Branch Message-ID: Tom, this comes from a discussion just concluded. TomT is talking about a significant amount of code that seemed best suited to a branch. That's what Matt suggested and TomT was complying with the request you saw Jim -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Tom Duffy Sent: Friday, August 05, 2005 11:47 AM To: Tom Tucker Cc: openib-general at openib.org Subject: Re: [openib-general] iWARP Branch On 8/5/05, Tom Tucker wrote: > Could you please create a branch so that we can drop our driver? How about a patch? -tduffy _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mlleini at ca.sandia.gov Fri Aug 5 12:08:36 2005 From: mlleini at ca.sandia.gov (Matt L. Leininger) Date: Fri, 05 Aug 2005 12:08:36 -0700 Subject: [openib-general] iWARP Branch In-Reply-To: <9d3b7de705080511476b4e7ab2@mail.gmail.com> References: <8E9D028761D8264D910612167E8457E8FA3686@mail2.ammasso.com> <9d3b7de705080511476b4e7ab2@mail.gmail.com> Message-ID: <1123268916.16140.91.camel@localhost> On Fri, 2005-08-05 at 11:47 -0700, Tom Duffy wrote: > On 8/5/05, Tom Tucker wrote: > > Could you please create a branch so that we can drop our driver? > > How about a patch? > This is the GPL/BSD licensed driver TomT brought up in an earlier email. It seemed reasonable to have the initial driver code drop in the svn repository so that future work is done in the open. TomT, and others, can post patches to the list that start to fix up the driver for incorporation in the OpenIB code base and clean it up for submission to kernel.org. - Matt From mamidala at cse.ohio-state.edu Fri Aug 5 12:22:02 2005 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Fri, 5 Aug 2005 15:22:02 -0400 (EDT) Subject: [openib-general] (no subject) Message-ID: Hi, We have been testing gen2 installation using the mem-free IBA cards from Mellanox. We were able to successfully run InfiniBand tests over this installation. When we replaced the IBA mem-free with the mem cards (no software changes), the ports are in the disabled state and ibstatus shows the following: Infiniband device 'mthca0' port 1 status: default gid: fe80:0000:0000:0000:0002:c901:0a31:a431 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 10 Gb/sec (4X) Infiniband device 'mthca0' port 2 status: default gid: fe80:0000:0000:0000:0002:c901:0a31:a432 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 10 Gb/sec (4X) We were wondering what was going wrong? Thanks, Amith From yhlu.kernel at gmail.com Fri Aug 5 12:25:50 2005 From: yhlu.kernel at gmail.com (yhlu) Date: Fri, 5 Aug 2005 12:25:50 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <86802c44050805112661d889aa@mail.gmail.com> References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <86802c4405080410236ba59619@mail.gmail.com> <86802c4405080411013b60382c@mail.gmail.com> <521x59a6tb.fsf@cisco.com> <86802c440508041230143354c2@mail.gmail.com> <52slxp6o5b.fsf@cisco.com> <86802c440508051103500f6942@mail.gmail.com> <86802c4405080511079d01532@mail.gmail.com> <52psss5k1x.fsf@cisco.com> <86802c44050805112661d889aa@mail.gmail.com> Message-ID: <86802c4405080512254b9cd496@mail.gmail.com> pci_restore_bars cause that. it didn't restore that according to if resource is 64 bit or not. So it overwirte upper 32 bit with 0. YH file:1b34fc56067ed8ae0ba9b32f46679e13068bb86c -> file:65ea7d25f6911d7396e19afbf4bb2738906376f7 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -222,6 +222,37 @@ pci_find_parent_resource(const struct pc } /** + * pci_restore_bars - restore a devices BAR values (e.g. after wake-up) + * @dev: PCI device to have its BARs restored + * + * Restore the BAR values for a given device, so as to make it + * accessible by its driver. + */ +void +pci_restore_bars(struct pci_dev *dev) +{ + int i, numres; + + switch (dev->hdr_type) { + case PCI_HEADER_TYPE_NORMAL: + numres = 6; + break; + case PCI_HEADER_TYPE_BRIDGE: + numres = 2; + break; + case PCI_HEADER_TYPE_CARDBUS: + numres = 1; + break; + default: + /* Should never get here, but just in case... */ + return; + } + + for (i = 0; i < numres; i ++) + pci_update_resource(dev, &dev->resource[i], i); +} + +/** On 8/5/05, yhlu wrote: > before I do the cg-update this morning, it didn't mask out the upper 8 bit. > > YH > > On 8/5/05, Roland Dreier wrote: > > yhlu> ps. some kernel pci code patch broke sth yesterday night. > > yhlu> it mask out bit [32-39] > > > > Is it possible that all your problems are coming from the PCI setup > > code incorrectly assigning BARs? > > > > - R. > > > From yhlu.kernel at gmail.com Fri Aug 5 12:45:25 2005 From: yhlu.kernel at gmail.com (yhlu) Date: Fri, 5 Aug 2005 12:45:25 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <86802c4405080512254b9cd496@mail.gmail.com> References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <86802c4405080411013b60382c@mail.gmail.com> <521x59a6tb.fsf@cisco.com> <86802c440508041230143354c2@mail.gmail.com> <52slxp6o5b.fsf@cisco.com> <86802c440508051103500f6942@mail.gmail.com> <86802c4405080511079d01532@mail.gmail.com> <52psss5k1x.fsf@cisco.com> <86802c44050805112661d889aa@mail.gmail.com> <86802c4405080512254b9cd496@mail.gmail.com> Message-ID: <86802c4405080512451cdcae48@mail.gmail.com> in drivers/pci/setup-res.c: pci_update_resource() why???? new = 0; /* currently everyone zeros the high address */ if ((new & (PCI_BASE_ADDRESS_SPACE|PCI_BASE_ADDRESS_MEM_TYPE_MASK)) == (PCI_BASE_ADDRESS_SPACE_MEMORY|PCI_BASE_ADDRESS_MEM_TYPE_64)) { new = 0; /* currently everyone zeros the high address */ pci_write_config_dword(dev, reg + 4, new); pci_read_config_dword(dev, reg + 4, &check); if (check != new) { printk(KERN_ERR "PCI: Error updating region " "%s/%d (high %08x != %08x)\n", pci_name(dev), resno, new, check); } } On 8/5/05, yhlu wrote: > pci_restore_bars cause that. > it didn't restore that according to if resource is 64 bit or not. So > it overwirte upper 32 bit with 0. > > YH > > file:1b34fc56067ed8ae0ba9b32f46679e13068bb86c -> > file:65ea7d25f6911d7396e19afbf4bb2738906376f7 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -222,6 +222,37 @@ pci_find_parent_resource(const struct pc > } > /** > + * pci_restore_bars - restore a devices BAR values (e.g. after wake-up) > + * @dev: PCI device to have its BARs restored > + * > + * Restore the BAR values for a given device, so as to make it > + * accessible by its driver. > + */ > +void > +pci_restore_bars(struct pci_dev *dev) > +{ > + int i, numres; > + > + switch (dev->hdr_type) { > + case PCI_HEADER_TYPE_NORMAL: > + numres = 6; > + break; > + case PCI_HEADER_TYPE_BRIDGE: > + numres = 2; > + break; > + case PCI_HEADER_TYPE_CARDBUS: > + numres = 1; > + break; > + default: > + /* Should never get here, but just in case... */ > + return; > + } > + > + for (i = 0; i < numres; i ++) > + pci_update_resource(dev, &dev->resource[i], i); > +} > + > +/** > > On 8/5/05, yhlu wrote: > > before I do the cg-update this morning, it didn't mask out the upper 8 bit. > > > > YH > > > > On 8/5/05, Roland Dreier wrote: > > > yhlu> ps. some kernel pci code patch broke sth yesterday night. > > > yhlu> it mask out bit [32-39] > > > > > > Is it possible that all your problems are coming from the PCI setup > > > code incorrectly assigning BARs? > > > > > > - R. > > > > > > From yhlu.kernel at gmail.com Fri Aug 5 13:28:04 2005 From: yhlu.kernel at gmail.com (yhlu) Date: Fri, 5 Aug 2005 13:28:04 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <86802c4405080512451cdcae48@mail.gmail.com> References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <521x59a6tb.fsf@cisco.com> <86802c440508041230143354c2@mail.gmail.com> <52slxp6o5b.fsf@cisco.com> <86802c440508051103500f6942@mail.gmail.com> <86802c4405080511079d01532@mail.gmail.com> <52psss5k1x.fsf@cisco.com> <86802c44050805112661d889aa@mail.gmail.com> <86802c4405080512254b9cd496@mail.gmail.com> <86802c4405080512451cdcae48@mail.gmail.com> Message-ID: <86802c44050805132853070f1@mail.gmail.com> please check the patch for fix overwrite upper 32bit YH --- drivers/pci/setup-res.c.orig 2005-08-05 10:08:45.000000000 -0700 +++ drivers/pci/setup-res.c 2005-08-05 13:25:06.000000000 -0700 @@ -33,6 +33,18 @@ u32 new, check, mask; int reg; + if (resno < 6) { + reg = PCI_BASE_ADDRESS_0 + 4 * resno; + if((resno & 1)==1) { + /* check if previous reg is 64 mem */ + pci_read_config_dword(dev, reg-4, &check ); + if ((check & (PCI_BASE_ADDRESS_SPACE|PCI_BASE_ADDRESS_MEM_TYPE_MASK)) == + (PCI_BASE_ADDRESS_SPACE_MEMORY|PCI_BASE_ADDRESS_MEM_TYPE_64)) + return; + } + + } + pcibios_resource_to_bus(dev, ®ion, res); pr_debug(" got res [%lx:%lx] bus [%lx:%lx] flags %lx for " @@ -67,7 +79,7 @@ if ((new & (PCI_BASE_ADDRESS_SPACE|PCI_BASE_ADDRESS_MEM_TYPE_MASK)) == (PCI_BASE_ADDRESS_SPACE_MEMORY|PCI_BASE_ADDRESS_MEM_TYPE_64)) { - new = 0; /* currently everyone zeros the high address */ + new = region.start >> 32 ; pci_write_config_dword(dev, reg + 4, new); pci_read_config_dword(dev, reg + 4, &check); if (check != new) { From tomduffy at gmail.com Fri Aug 5 13:28:25 2005 From: tomduffy at gmail.com (Tom Duffy) Date: Fri, 5 Aug 2005 13:28:25 -0700 Subject: [openib-general] iWARP Branch In-Reply-To: References: Message-ID: <9d3b7de7050805132816296a55@mail.gmail.com> On 8/5/05, Ryan, Jim wrote: > Tom, this comes from a discussion just concluded. TomT is talking about > a significant amount of code that seemed best suited to a branch. That's > what Matt suggested and TomT was complying with the request you saw Fair enough. Sorry for being out of the loop. -tduffy From torvalds at osdl.org Fri Aug 5 13:38:37 2005 From: torvalds at osdl.org (Linus Torvalds) Date: Fri, 5 Aug 2005 13:38:37 -0700 (PDT) Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <86802c44050805132853070f1@mail.gmail.com> References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <521x59a6tb.fsf@cisco.com> <86802c440508041230143354c2@mail.gmail.com> <52slxp6o5b.fsf@cisco.com> <86802c440508051103500f6942@mail.gmail.com> <86802c4405080511079d01532@mail.gmail.com> <52psss5k1x.fsf@cisco.com> <86802c44050805112661d889aa@mail.gmail.com> <86802c4405080512254b9cd496@mail.gmail.com> <86802c4405080512451cdcae48@mail.gmail.com> <86802c44050805132853070f1@mail.gmail.com> Message-ID: Hmm.. This looks half-way sane, but too ugly for words. I'd much rather see that when we detect a 64-bit resource, we always mark the next resource as being reserved some way, and then we just make pci_update_resource() ignore such reserved resources. The if((resno & 1)==1) { /* check if previous reg is 64 mem */ .. stuff is really too ugly. Greg? Ivan? Linus On Fri, 5 Aug 2005, yhlu wrote: > > please check the patch for fix overwrite upper 32bit > > YH > > --- drivers/pci/setup-res.c.orig 2005-08-05 10:08:45.000000000 -0700 > +++ drivers/pci/setup-res.c 2005-08-05 13:25:06.000000000 -0700 > @@ -33,6 +33,18 @@ > u32 new, check, mask; > int reg; > > + if (resno < 6) { > + reg = PCI_BASE_ADDRESS_0 + 4 * resno; > + if((resno & 1)==1) { > + /* check if previous reg is 64 mem */ > + pci_read_config_dword(dev, reg-4, &check ); > + if ((check & > (PCI_BASE_ADDRESS_SPACE|PCI_BASE_ADDRESS_MEM_TYPE_MASK)) == > + > (PCI_BASE_ADDRESS_SPACE_MEMORY|PCI_BASE_ADDRESS_MEM_TYPE_64)) > + return; > + } > + > + } > + > pcibios_resource_to_bus(dev, ®ion, res); > > pr_debug(" got res [%lx:%lx] bus [%lx:%lx] flags %lx for " > @@ -67,7 +79,7 @@ > > if ((new & (PCI_BASE_ADDRESS_SPACE|PCI_BASE_ADDRESS_MEM_TYPE_MASK)) == > (PCI_BASE_ADDRESS_SPACE_MEMORY|PCI_BASE_ADDRESS_MEM_TYPE_64)) { > - new = 0; /* currently everyone zeros the high address */ > + new = region.start >> 32 ; > pci_write_config_dword(dev, reg + 4, new); > pci_read_config_dword(dev, reg + 4, &check); > if (check != new) { > From tom at ammasso.com Fri Aug 5 13:46:29 2005 From: tom at ammasso.com (Tom Tucker) Date: Fri, 5 Aug 2005 16:46:29 -0400 Subject: [openib-general] iWARP Branch Proposal Message-ID: <8E9D028761D8264D910612167E8457E8FA3695@mail2.ammasso.com> My plan at this point is to create a branch under gen2/branches/iwarp and then work from there. Does this fit with everyone's expectations? The driver will be added in this branch under infiniband/hw/amso1100. I think it will take weeks before we're ready to create a patch for the main trunk. Any insights on the criteria for migrating patches into the main trunk are greatly appreciated. For example, there are many changes that could be made early that will not affect the functionality or stability of the existing code. These could precede the merging in of the driver itself, updates to the Kconfig and Makefile, etc... Thanks, Tom Tucker From rolandd at cisco.com Fri Aug 5 13:52:54 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 05 Aug 2005 13:52:54 -0700 Subject: [openib-general] iWARP Branch In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3686@mail2.ammasso.com> (Tom Tucker's message of "Fri, 5 Aug 2005 14:45:48 -0400") References: <8E9D028761D8264D910612167E8457E8FA3686@mail2.ammasso.com> Message-ID: <52ll3g5cnt.fsf@cisco.com> Tom> Matt: Could you please create a branch so that we can drop Tom> our driver? Nothing special is required to create a branch. Assuming you have commit access, you can just do something like svn copy https://openib.org/svn/gen2/trunk \ https://openib.org/svn/gen2/branches/iwarp has a good introduction to using svn. - R. From halr at voltaire.com Fri Aug 5 13:48:25 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Aug 2005 16:48:25 -0400 Subject: [openib-general] (no subject) In-Reply-To: References: Message-ID: <1123274818.4451.253.camel@hal.voltaire.com> Hi Amith, On Fri, 2005-08-05 at 15:22, amith rajith mamidala wrote: > We have been testing gen2 installation using the mem-free IBA cards from > Mellanox. We were able to successfully run InfiniBand tests over this > installation. When we replaced the IBA mem-free with the mem cards > (no software changes), the ports are in the disabled state and ibstatus shows the following: > > Infiniband device 'mthca0' port 1 status: > default gid: fe80:0000:0000:0000:0002:c901:0a31:a431 > base lid: 0x0 > sm lid: 0x0 > state: 1: DOWN > phys state: 3: Disabled > rate: 10 Gb/sec (4X) > > Infiniband device 'mthca0' port 2 status: > default gid: fe80:0000:0000:0000:0002:c901:0a31:a432 > base lid: 0x0 > sm lid: 0x0 > state: 1: DOWN > phys state: 3: Disabled > rate: 10 Gb/sec (4X) > > > We were wondering what was going wrong? Someone else from OSU reported this issue with the Physical state indicating disabled rather than polling. I'm not sure if/how this was resolved. What is the firmware version of these cards ? Can you run /usr/local/ib/bin/ibstat ? You will need umad for this. -- Hal From halr at voltaire.com Fri Aug 5 13:57:26 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Aug 2005 16:57:26 -0400 Subject: [openib-general] (no subject) In-Reply-To: <1123274818.4451.253.camel@hal.voltaire.com> References: <1123274818.4451.253.camel@hal.voltaire.com> Message-ID: <1123275372.4451.258.camel@hal.voltaire.com> On Fri, 2005-08-05 at 16:48, Hal Rosenstock wrote: > Someone else from OSU reported this issue with the Physical state > indicating disabled rather than polling. I'm not sure if/how this was > resolved. > > What is the firmware version of these cards ? > > Can you run /usr/local/ib/bin/ibstat ? You will need umad for this. One more thing: You might want to enabled CONFIG_INFINIBAND_MTHCA_DEBUG and see what log messages occur during initialization. Something might be going wrong there on your machines/in your configuration. -- Hal From sam at ravnborg.org Fri Aug 5 14:06:54 2005 From: sam at ravnborg.org (Sam Ravnborg) Date: Fri, 5 Aug 2005 23:06:54 +0200 Subject: [openib-general] Re: [RFC] Move InfiniBand .h files In-Reply-To: <52br4c74dy.fsf@cisco.com> References: <52iryla9r5.fsf@cisco.com> <20050805092228.GA7237@infradead.org> <52br4c74dy.fsf@cisco.com> Message-ID: <20050805210654.GB17639@mars.ravnborg.org> On Fri, Aug 05, 2005 at 09:08:41AM -0700, Roland Dreier wrote: > Christoph> include/rmda, please. not need for the linux/ component. > > OK, fair enough. Any objection to this? No, makes sense. Sam From tomduffy at gmail.com Fri Aug 5 14:26:48 2005 From: tomduffy at gmail.com (Tom Duffy) Date: Fri, 5 Aug 2005 14:26:48 -0700 Subject: [openib-general] [PATCH] SDP: use linux/list.h for advt table In-Reply-To: <1122310714.27947.8.camel@duffman> References: <1122310714.27947.8.camel@duffman> Message-ID: <9d3b7de7050805142648ebe47f@mail.gmail.com> On 7/25/05, Tom Duffy wrote: > This patch changes sdp_advt.[ch] to use linux lists. I didn't change > the API, but it may be a good idea now that the lists are done a bit > differently. I committed this patch as version 2990. From tomduffy at gmail.com Fri Aug 5 14:28:56 2005 From: tomduffy at gmail.com (Tom Duffy) Date: Fri, 5 Aug 2005 14:28:56 -0700 Subject: [openib-general] [PATCH] SDP: priority is now unsigned int In-Reply-To: <1122326745.420.7.camel@duffman> References: <1122326745.420.7.camel@duffman> Message-ID: <9d3b7de7050805142865424152@mail.gmail.com> On 7/25/05, Tom Duffy wrote: > sk_alloc() takes an "unsigned int __nocast priority" instead of an int. > This was changed in 2.6.13-rc3. I committed this change as revision 2991. -tduffy From gregkh at suse.de Fri Aug 5 15:00:15 2005 From: gregkh at suse.de (Greg KH) Date: Fri, 5 Aug 2005 15:00:15 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: References: <86802c440508041230143354c2@mail.gmail.com> <52slxp6o5b.fsf@cisco.com> <86802c440508051103500f6942@mail.gmail.com> <86802c4405080511079d01532@mail.gmail.com> <52psss5k1x.fsf@cisco.com> <86802c44050805112661d889aa@mail.gmail.com> <86802c4405080512254b9cd496@mail.gmail.com> <86802c4405080512451cdcae48@mail.gmail.com> <86802c44050805132853070f1@mail.gmail.com> Message-ID: <20050805220015.GA3524@suse.de> On Fri, Aug 05, 2005 at 01:38:37PM -0700, Linus Torvalds wrote: > > Hmm.. This looks half-way sane, but too ugly for words. > > I'd much rather see that when we detect a 64-bit resource, we always mark > the next resource as being reserved some way, and then we just make > pci_update_resource() ignore such reserved resources. > > The > > if((resno & 1)==1) { > /* check if previous reg is 64 mem */ > .. > > stuff is really too ugly. Yeah, that's not nice. But what's the real problem we are trying to fix here? I seem to have missed that in the email thread somehow. > Greg? Ivan? Ivan's the pci resource guru, any thoughts as to how to do this in a nicer way? thanks, greg k-h From yhlu.kernel at gmail.com Fri Aug 5 15:25:02 2005 From: yhlu.kernel at gmail.com (yhlu) Date: Fri, 5 Aug 2005 15:25:02 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <20050805220015.GA3524@suse.de> References: <86802c440508041230143354c2@mail.gmail.com> <86802c440508051103500f6942@mail.gmail.com> <86802c4405080511079d01532@mail.gmail.com> <52psss5k1x.fsf@cisco.com> <86802c44050805112661d889aa@mail.gmail.com> <86802c4405080512254b9cd496@mail.gmail.com> <86802c4405080512451cdcae48@mail.gmail.com> <86802c44050805132853070f1@mail.gmail.com> <20050805220015.GA3524@suse.de> Message-ID: <86802c4405080515257ddaa8d2@mail.gmail.com> In LinuxBIOS, We can allocate 64 bit region ( 0xfc0000000....) to the mellanox Infiniband card. Some range above 4G. So the mmio below 4G is some smaller only 128M, Otherwise need 512M. If 4 IB cards are used, the mmio will be 2G. For new opteron E stepping, We could use hareware memhole support. But if the CPU is before opteron E, We only can use SW mem mapping ( will lose some performance) or lose (2G RAM). at such case We need 64bit pref mem. We only lose 128M even four IB card are installed. yesterday, someone add pci_restore_bars...., that will call pci_update_resource, and it will overwirte upper 32 bit of BAR2 and BAR4 of IB card. So the patch make pci_restore_resource 1. don't touch BAR3, and BAR5, if BAR2, and BAR4 are 64 bit MEM IO 2. not assume BAR2 and BAR4 upper 32bit is 0 if if BAR2, and BAR4 are 64 bit MEM IO YH On 8/5/05, Greg KH wrote: > On Fri, Aug 05, 2005 at 01:38:37PM -0700, Linus Torvalds wrote: > > > > Hmm.. This looks half-way sane, but too ugly for words. > > > > I'd much rather see that when we detect a 64-bit resource, we always mark > > the next resource as being reserved some way, and then we just make > > pci_update_resource() ignore such reserved resources. > > > > The > > > > if((resno & 1)==1) { > > /* check if previous reg is 64 mem */ > > .. > > > > stuff is really too ugly. > > Yeah, that's not nice. > > But what's the real problem we are trying to fix here? I seem to have > missed that in the email thread somehow. > > > Greg? Ivan? > > Ivan's the pci resource guru, any thoughts as to how to do this in a > nicer way? > > thanks, > > greg k-h > From mlleini at ca.sandia.gov Fri Aug 5 15:55:14 2005 From: mlleini at ca.sandia.gov (Matt L. Leininger) Date: Fri, 05 Aug 2005 15:55:14 -0700 Subject: [openib-general] iWARP Branch Proposal In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3695@mail2.ammasso.com> References: <8E9D028761D8264D910612167E8457E8FA3695@mail2.ammasso.com> Message-ID: <1123282514.16140.105.camel@localhost> On Fri, 2005-08-05 at 16:46 -0400, Tom Tucker wrote: > My plan at this point is to create a branch under gen2/branches/iwarp > and then work from there. Does this fit with everyone's expectations? > Sounds reasonable. - Matt From gregkh at suse.de Fri Aug 5 16:03:00 2005 From: gregkh at suse.de (Greg KH) Date: Fri, 5 Aug 2005 16:03:00 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <86802c4405080515257ddaa8d2@mail.gmail.com> References: <86802c440508051103500f6942@mail.gmail.com> <86802c4405080511079d01532@mail.gmail.com> <52psss5k1x.fsf@cisco.com> <86802c44050805112661d889aa@mail.gmail.com> <86802c4405080512254b9cd496@mail.gmail.com> <86802c4405080512451cdcae48@mail.gmail.com> <86802c44050805132853070f1@mail.gmail.com> <20050805220015.GA3524@suse.de> <86802c4405080515257ddaa8d2@mail.gmail.com> Message-ID: <20050805230300.GA4363@suse.de> On Fri, Aug 05, 2005 at 03:25:02PM -0700, yhlu wrote: > In LinuxBIOS, We can allocate 64 bit region ( 0xfc0000000....) to the > mellanox Infiniband card. Some range above 4G. So the mmio below 4G > is some smaller only 128M, Otherwise need 512M. If 4 IB cards are > used, the mmio will be 2G. For new opteron E stepping, We could use > hareware memhole support. But if the CPU is before opteron E, We only > can use SW mem mapping ( will lose some performance) or lose (2G RAM). > at such case We need 64bit pref mem. We only lose 128M even four IB > card are installed. > > yesterday, someone add pci_restore_bars...., that will call > pci_update_resource, and it will overwirte upper 32 bit of BAR2 and > BAR4 of IB card. Hm, perhaps that change should not do this? Dominik, care to weigh in here? That was your patch... > So the patch make pci_restore_resource > 1. don't touch BAR3, and BAR5, if BAR2, and BAR4 are 64 bit MEM IO > 2. not assume BAR2 and BAR4 upper 32bit is 0 if if BAR2, and BAR4 are > 64 bit MEM IO thanks, greg k-h From torvalds at osdl.org Fri Aug 5 16:06:06 2005 From: torvalds at osdl.org (Linus Torvalds) Date: Fri, 5 Aug 2005 16:06:06 -0700 (PDT) Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <20050805220015.GA3524@suse.de> References: <86802c440508041230143354c2@mail.gmail.com> <52slxp6o5b.fsf@cisco.com> <86802c440508051103500f6942@mail.gmail.com> <86802c4405080511079d01532@mail.gmail.com> <52psss5k1x.fsf@cisco.com> <86802c44050805112661d889aa@mail.gmail.com> <86802c4405080512254b9cd496@mail.gmail.com> <86802c4405080512451cdcae48@mail.gmail.com> <86802c44050805132853070f1@mail.gmail.com> <20050805220015.GA3524@suse.de> Message-ID: On Fri, 5 Aug 2005, Greg KH wrote: > On Fri, Aug 05, 2005 at 01:38:37PM -0700, Linus Torvalds wrote: > > But what's the real problem we are trying to fix here? We're screwing up the top 32 bits of the BAR when you resume it. Look at the patch, you'll see the fix (the other part of the patch looks fine, but then in order to not overwrite the upper bits with zero again when doing the _next_ - nonexistent - BAR update, we need to have something that avoids writing the next BAR). Remember: a 64-bit BAR puts the upper 32 bits in what would otherwise be the low 32 bits of the next BAR. Which is why we need to mark the next BAR resource as _not_ being valid some way - so that we don't try to (incorrectly) "restore" it and overwrite the high bits of the previous BAR. Of course, this only hits the (very few) people who not only have 64-bit PCI devices, but literally have them mapped in the 4GB+ region. Quite uncommon. Linus From iod00d at hp.com Fri Aug 5 16:59:37 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 5 Aug 2005 16:59:37 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: References: <86802c440508051103500f6942@mail.gmail.com> <86802c4405080511079d01532@mail.gmail.com> <52psss5k1x.fsf@cisco.com> <86802c44050805112661d889aa@mail.gmail.com> <86802c4405080512254b9cd496@mail.gmail.com> <86802c4405080512451cdcae48@mail.gmail.com> <86802c44050805132853070f1@mail.gmail.com> <20050805220015.GA3524@suse.de> Message-ID: <20050805235937.GK25121@esmail.cup.hp.com> On Fri, Aug 05, 2005 at 04:06:06PM -0700, Linus Torvalds wrote: > > > On Fri, 5 Aug 2005, Greg KH wrote: > > On Fri, Aug 05, 2005 at 01:38:37PM -0700, Linus Torvalds wrote: > > > > But what's the real problem we are trying to fix here? > > We're screwing up the top 32 bits of the BAR when you resume it. Look at > the patch, you'll see the fix (the other part of the patch looks fine, but > then in order to not overwrite the upper bits with zero again when doing > the _next_ - nonexistent - BAR update, we need to have something that > avoids writing the next BAR). ISTR making comments before about the offending patch on linux-pci mailing list. Is this the same patch that assumes pci_dev->resource[i] == BAR[i] ? That's not true for 64-bit bars. > Remember: a 64-bit BAR puts the upper 32 bits in what would otherwise be > the low 32 bits of the next BAR. Which is why we need to mark the next BAR > resource as _not_ being valid some way - so that we don't try to > (incorrectly) "restore" it and overwrite the high bits of the previous > BAR. > Of course, this only hits the (very few) people who not only have 64-bit > PCI devices, but literally have them mapped in the 4GB+ region. *lots* of PCI device now have 64-bit BAR. The first I'm aware of was LSI 53c896 card (Ultra 2 SCSI). > Quite uncommon. Assigning 4GB+ regions is uncommon because too often either the HW, the OS, or the driver would break. firmware keeps having to worry about legacy OSs. grant From yhlu.kernel at gmail.com Fri Aug 5 17:57:55 2005 From: yhlu.kernel at gmail.com (yhlu) Date: Fri, 5 Aug 2005 17:57:55 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <86802c440508051103500f6942@mail.gmail.com> References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <52u0i6b9an.fsf_-_@cisco.com> <86802c44050804093374aca360@mail.gmail.com> <52mznxacbp.fsf@cisco.com> <86802c4405080410236ba59619@mail.gmail.com> <86802c4405080411013b60382c@mail.gmail.com> <521x59a6tb.fsf@cisco.com> <86802c440508041230143354c2@mail.gmail.com> <52slxp6o5b.fsf@cisco.com> <86802c440508051103500f6942@mail.gmail.com> Message-ID: <86802c44050805175757f6ff6a@mail.gmail.com> Roland, what is the -16 mean? is it /* Attempt to modify a QP/EE which is not in the presumed state: */ MTHCA_CMD_STAT_BAD_QPEE_STATE = 0x10, YH On 8/5/05, yhlu wrote: > You are right. CONG_SPECIAL_QP > > ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) > ib_mthca: Initializing Mellanox Technologies MT25208 InfiniHost III Ex > (Tavor compatibility mode) (0000:04:00.0) > ib_mthca 0000:04:00.0: FW version 000400060002, max commands 64 > ib_mthca 0000:04:00.0: FW size 6143 KB (start fcefa00000, end fcefffffff) > ib_mthca 0000:04:00.0: HCA memory size 262143 KB (start fce0000000, > end fcefffffff) > ib_mthca 0000:04:00.0: Max QPs: 16777216, reserved QPs: 1024, entry size: 256 > ib_mthca 0000:04:00.0: Max CQs: 16777216, reserved CQs: 128, entry size: 64 > ib_mthca 0000:04:00.0: Max EQs: 64, reserved EQs: 1, entry size: 64 > ib_mthca 0000:04:00.0: reserved MPTs: 16, reserved MTTs: 16 > ib_mthca 0000:04:00.0: Max PDs: 16777216, reserved PDs: 0, reserved UARs: 1 > ib_mthca 0000:04:00.0: Max QP/MCG: 16777216, reserved MGMs: 0 > ib_mthca 0000:04:00.0: Flags: 00370347 > ib_mthca 0000:04:00.0: profile[ 0]--10/20 @ 0x fce0000000 (size 0x 4000000) > ib_mthca 0000:04:00.0: profile[ 1]-- 0/16 @ 0x fce4000000 (size 0x 1000000) > ib_mthca 0000:04:00.0: profile[ 2]-- 7/18 @ 0x fce5000000 (size 0x 800000) > ib_mthca 0000:04:00.0: profile[ 3]-- 9/17 @ 0x fce5800000 (size 0x 800000) > ib_mthca 0000:04:00.0: profile[ 4]-- 3/16 @ 0x fce6000000 (size 0x 400000) > ib_mthca 0000:04:00.0: profile[ 5]-- 4/16 @ 0x fce6400000 (size 0x 200000) > ib_mthca 0000:04:00.0: profile[ 6]--12/15 @ 0x fce6600000 (size 0x 100000) > ib_mthca 0000:04:00.0: profile[ 7]-- 8/13 @ 0x fce6700000 (size 0x 80000) > ib_mthca 0000:04:00.0: profile[ 8]--11/11 @ 0x fce6780000 (size 0x 10000) > ib_mthca 0000:04:00.0: profile[ 9]-- 6/ 5 @ 0x fce6790000 (size 0x 800) > ib_mthca 0000:04:00.0: HCA memory: allocated 106050 KB/256000 KB > (149950 KB free) > ib_mthca 0000:04:00.0: Allocated EQ 1 with 65536 entries > ib_mthca 0000:04:00.0: Allocated EQ 2 with 128 entries > ib_mthca 0000:04:00.0: Allocated EQ 3 with 128 entries > ib_mthca 0000:04:00.0: Setting mask 00000000000f43fe for eqn 2 > ib_mthca 0000:04:00.0: Setting mask 0000000000000400 for eqn 3 > ib_mthca 0000:04:00.0: NOP command IRQ test passed > ib_mthca 0000:04:00.0: mthca_init_qp_table: mthca_CONF_SPECIAL_QP > failed for 0/1024 (-16) > ib_mthca 0000:04:00.0: Failed to initialize queue pair table, aborting. > ib_mthca 0000:04:00.0: Clearing mask 00000000000f43fe for eqn 2 > ib_mthca 0000:04:00.0: Clearing mask 0000000000000400 for eqn 3 > ib_mthca: probe of 0000:04:00.0 failed with error -16 > From rolandd at cisco.com Fri Aug 5 18:30:07 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 05 Aug 2005 18:30:07 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <86802c44050805175757f6ff6a@mail.gmail.com> (yhlu's message of "Fri, 5 Aug 2005 17:57:55 -0700") References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <52u0i6b9an.fsf_-_@cisco.com> <86802c44050804093374aca360@mail.gmail.com> <52mznxacbp.fsf@cisco.com> <86802c4405080410236ba59619@mail.gmail.com> <86802c4405080411013b60382c@mail.gmail.com> <521x59a6tb.fsf@cisco.com> <86802c440508041230143354c2@mail.gmail.com> <52slxp6o5b.fsf@cisco.com> <86802c440508051103500f6942@mail.gmail.com> <86802c44050805175757f6ff6a@mail.gmail.com> Message-ID: <52hde36ee8.fsf@cisco.com> yhlu> Roland, what is the -16 mean? yhlu> is it /* Attempt to modify a QP/EE which is not in the yhlu> presumed state: */ MTHCA_CMD_STAT_BAD_QPEE_STATE = 0x10, No, -16 is just -EBUSY. You could put a printk in event_timeout() in mthca_cmd.c to make sure, but I'm pretty sure that's where it's coming from. In other words we issue the CONF_SPECIAL_QP firmware command and don't ever get a response back from the HCA. - R. From yhlu.kernel at gmail.com Fri Aug 5 19:47:41 2005 From: yhlu.kernel at gmail.com (yhlu) Date: Fri, 5 Aug 2005 19:47:41 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <52hde36ee8.fsf@cisco.com> References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <52mznxacbp.fsf@cisco.com> <86802c4405080410236ba59619@mail.gmail.com> <86802c4405080411013b60382c@mail.gmail.com> <521x59a6tb.fsf@cisco.com> <86802c440508041230143354c2@mail.gmail.com> <52slxp6o5b.fsf@cisco.com> <86802c440508051103500f6942@mail.gmail.com> <86802c44050805175757f6ff6a@mail.gmail.com> <52hde36ee8.fsf@cisco.com> Message-ID: <86802c44050805194779379932@mail.gmail.com> I remember last year when I used IBGOLD 0.5 with PCI-X IB card, it seems that it could support 64 bit pref mem. I will try IBGOLD 1.7 ..... YH On 8/5/05, Roland Dreier wrote: > yhlu> Roland, what is the -16 mean? > > yhlu> is it /* Attempt to modify a QP/EE which is not in the > yhlu> presumed state: */ MTHCA_CMD_STAT_BAD_QPEE_STATE = 0x10, > > No, -16 is just -EBUSY. You could put a printk in event_timeout() in > mthca_cmd.c to make sure, but I'm pretty sure that's where it's coming > from. In other words we issue the CONF_SPECIAL_QP firmware command > and don't ever get a response back from the HCA. > > - R. > From tomduffy at gmail.com Fri Aug 5 20:07:02 2005 From: tomduffy at gmail.com (Tom Duffy) Date: Fri, 5 Aug 2005 20:07:02 -0700 Subject: [openib-general] [PATCH updated] sdp: cancel read with no iocb In-Reply-To: <20050801131445.GW14384@mellanox.co.il> References: <20050801084000.GS14384@mellanox.co.il> <20050801131445.GW14384@mellanox.co.il> Message-ID: <9d3b7de7050805200749704f7c@mail.gmail.com> On 8/1/05, Michael S. Tsirkin wrote: > Quoting r. Michael S. Tsirkin : > > Subject: sdp: cancel read with no iocb > > > > Libor, I'm seeing these messages: > > > > ib_sdp WARN: Cancel read with no IOCB. <2:0:00000005> > > > > It seems that this warning is printed in a legal state where > > a deferred iocb is canceled. Shouldnt this sdp_warn be replaced > > with sdp_dbg_ctrl? > > Ugh, that patch broke compilation with debug on. > Here's a better one. Thanks, applied. -tduffy From tomduffy at gmail.com Fri Aug 5 20:16:08 2005 From: tomduffy at gmail.com (Tom Duffy) Date: Fri, 5 Aug 2005 20:16:08 -0700 Subject: [openib-general] [PATCH] flush_scheduled_work on SDP module unload In-Reply-To: <20050804120601.GG15300@mellanox.co.il> References: <20050804120601.GG15300@mellanox.co.il> Message-ID: <9d3b7de7050805201675011514@mail.gmail.com> On 8/4/05, Michael S. Tsirkin wrote: > Need to flush scheduled work on SDP module unload: make sure > that a deferred iocb isnt outstanding. Is this still needed? Or was it just from the incorrect printout? -tduffy From iod00d at hp.com Fri Aug 5 21:33:54 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 5 Aug 2005 21:33:54 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <20050805235937.GK25121@esmail.cup.hp.com> References: <86802c4405080511079d01532@mail.gmail.com> <52psss5k1x.fsf@cisco.com> <86802c44050805112661d889aa@mail.gmail.com> <86802c4405080512254b9cd496@mail.gmail.com> <86802c4405080512451cdcae48@mail.gmail.com> <86802c44050805132853070f1@mail.gmail.com> <20050805220015.GA3524@suse.de> <20050805235937.GK25121@esmail.cup.hp.com> Message-ID: <20050806043354.GA27352@esmail.cup.hp.com> On Fri, Aug 05, 2005 at 04:59:37PM -0700, Grant Grundler wrote: > ISTR making comments before about the offending patch on linux-pci mailing > list. Is this the same patch that assumes pci_dev->resource[i] == BAR[i] ? I meant the patch assume 1:1 for pci_dev->resource[i] and BAR[i]. not that the two are equivalent. grant From iod00d at hp.com Fri Aug 5 22:29:21 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 5 Aug 2005 22:29:21 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <20050805230300.GA4363@suse.de> References: <86802c4405080511079d01532@mail.gmail.com> <52psss5k1x.fsf@cisco.com> <86802c44050805112661d889aa@mail.gmail.com> <86802c4405080512254b9cd496@mail.gmail.com> <86802c4405080512451cdcae48@mail.gmail.com> <86802c44050805132853070f1@mail.gmail.com> <20050805220015.GA3524@suse.de> <86802c4405080515257ddaa8d2@mail.gmail.com> <20050805230300.GA4363@suse.de> Message-ID: <20050806052921.GC27352@esmail.cup.hp.com> On Fri, Aug 05, 2005 at 04:03:00PM -0700, Greg KH wrote: ... > > yesterday, someone add pci_restore_bars...., that will call > > pci_update_resource, and it will overwirte upper 32 bit of BAR2 and > > BAR4 of IB card. > > Hm, perhaps that change should not do this? > > Dominik, care to weigh in here? That was your patch... Was the origin of this patch the following thread started by John Linville: http://lkml.org/lkml/2005/6/23/257 I pointed out it would have issues with 64-bit BARs. And I suggested some solutions to JohnL's patch here: http://lkml.org/lkml/2005/7/2/14 In any case same issues apply to pci_update_resource(). grant From info at qsv07.com Sat Aug 6 15:12:14 2005 From: info at qsv07.com (info at qsv07.com) Date: 7 Aug 2005 07:12:14 +0900 Subject: [openib-general] $B0MMj%a!<%k(B Message-ID: <20050806221214.1039.qmail@mail.qsv07.com> $B!!Ev%5%$%H$O!V=w at -M%@h!W@)$r:NMQ$7$F$*$j!"=w at -2q0w$NMW5a$K=>(B $B$&$N$G$9!#(B $B!!$3$N%a!<%k$OHs2q0w$N5.J}$K=w at -$r>R2p$9$k;v$K$D$$$F!"=w at -(B ($BAa:d(B $BKcH~$5$s(B)$BK\?M$N0MMj$r$5$l$?CK at -$@$1$KAw$i$l$F$$$k%a!<(B $B%k$J$N$G!"4|BT$KEz$($F$"$2$F$/$@$5$$!#(B $B!!!V(BN$B!&(BY$B$K(B3$BG/4V=;$s$G$$$^$7$?!#;~4V$,$"$l$P=5$K#22s$O%8%`$KDL(B $B$C$F$$$^$9!#$$$$4@$r$+$$$F%9%H%l%9$rH/;6$7$F$$$k$N$G$9$,!"CK(B $B at -$H$7$P$i$/$*$D$-$"$$$r$7$F$$$J$$$;$$$+Le!9$H$7$?F|!9$rAw$C(B $B$F$$$^$9!#(B $B:#$N<+J,$K$ONx$OI,?MMW$J$$;W$H$C$F$$$k$N$G$9$,!"(B $B$3$N5$;}$A$r$I$&$K$+$7$?$$$G$9!#%2!<%`$N$h$&$J6n$10z$-$r$7$?(B $B$/$O$"$j$^$;$s!#$*8_$$%9%H%l!<%H$K$$$-$^$7$g$&!#(B $B!!7P:QE*$K;d$KG$$;$F2<$5$$!#1g=u$H$$$&8 at MU<+BN7y$$$J$N$G!"l9g$O(B http://www.jumpb8.net/?profile $B$+(B $B$i$*4j$$CW$7$^$9!#$b$A$m$s4iA0(B($B=w at -(B)$B$r at 5$7$/F~NO$9$k$h$&$*4j$$$7$^$9!#(B $B!!"($*6b$+$1$J$/$F$b= References: <86802c4405080511079d01532@mail.gmail.com> <86802c44050805112661d889aa@mail.gmail.com> <86802c4405080512254b9cd496@mail.gmail.com> <86802c4405080512451cdcae48@mail.gmail.com> <86802c44050805132853070f1@mail.gmail.com> <20050805220015.GA3524@suse.de> <20050805235937.GK25121@esmail.cup.hp.com> <20050806043354.GA27352@esmail.cup.hp.com> Message-ID: <86802c4405080616526642fe39@mail.gmail.com> In LinuxBIOS internal structure for resource, We have index member in resource. So the resource will be count from 0, ....7 or etc, but index member will point to real BAR position. I would like to see Kernel has simmliar definintion. in LinuxBIOS typedef uint64_t resource_t; struct resource { resource_t base; /* Base address of the resource */ resource_t size; /* Size of the resource */ resource_t limit; /* Largest valid value base + size -1 */ unsigned long flags; /* Descriptions of the kind of resource */ unsigned long index; /* Bus specific per device resource id */ unsigned char align; /* Required alignment (log 2) of the resource */ unsigned char gran; /* Granularity (log 2) of the resource */ /* Alignment must be >= the granularity of the resource */ }; YH On 8/5/05, Grant Grundler wrote: > On Fri, Aug 05, 2005 at 04:59:37PM -0700, Grant Grundler wrote: > > ISTR making comments before about the offending patch on linux-pci mailing > > list. Is this the same patch that assumes pci_dev->resource[i] == BAR[i] ? > > I meant the patch assume 1:1 for pci_dev->resource[i] and BAR[i]. > not that the two are equivalent. > > grant > From ianjiang91 at hotmail.com Sat Aug 6 17:35:44 2005 From: ianjiang91 at hotmail.com (Ian Jiang) Date: Sun, 07 Aug 2005 08:35:44 +0800 Subject: [openib-general] [iSER]use iSER on x86_64 Message-ID: [I sent this mail the day before yesterday, but it did not appear on the list. So I just try again.] I modified several files under iSER/datamover/ to use it on my x86_64 machine. I also edited the iSER/make.conf file where necessary. I could compile the iSER without any errors by now, but I am not sure if it can works with the iSCSI initiator or the kDAPL, because I have no idea to test it yet. The iSER is obtained from https://openib.org/svn/gen2/ulps/iser/ The dat headers related could be either http://sourceforge.net/projects/dapl/dapl_beta2.06/dat/ or http://www.datcollaborative.org/dat_headers_1.2.tgz/ Any suggestion is appriciated! Ian Jiang ianjiang91 at hotmail.com ---- Computer Architecture Laboratory Institute of Computing Technology Chinese Academy of Sciences Beijing,P.R.China Zip code: 100080 Tel: +86-10-62564394(office) ====================== iser-datamover.patches ====================== --- iser_conn.c.old 2005-08-04 01:40:23.000000000 +0800 +++ iser_conn.c 2005-08-05 06:36:56.000000000 +0800 @@ -602,8 +602,11 @@ /* Find the connection */ p_iser_conn = hash_find_iser_conn(iscsi_conn_h); if (p_iser_conn == NULL) { - IERROR("Connection not found, conn_h: %d\n", - (unsigned) iscsi_conn_h); + /* by IanJiang, ianjiang at ict.ac.cn */ + // IERROR("Connection not found, conn_h: %d\n", + IERROR("Connection not found, conn_h: %ld\n", + // (unsigned) iscsi_conn_h); + (unsigned long) iscsi_conn_h); return ISER_INVALID_CONN; } --- iser_dto.c.old 2005-08-04 05:59:17.000000000 +0800 +++ iser_dto.c 2005-08-05 06:37:28.000000000 +0800 @@ -356,7 +356,9 @@ /* Get the VA of the headers registered memory */ p_recv_buf = - (unsigned char *) (unsigned int) p_dto->regd[0]->virt_buf.p_buf; +/* by IanJiang, ianjiang at ict.ac.cn */ +// (unsigned char *) (unsigned int) p_dto->regd[0]->virt_buf.p_buf; + (unsigned char *) (unsigned long) p_dto->regd[0]->virt_buf.p_buf; /* Account for the offset within it */ p_recv_buf += p_dto->offset[0]; /* Skip the iSER header to get the iSCSI PDU BHS */ --- iser_global.c.old 2005-08-04 01:40:04.000000000 +0800 +++ iser_global.c 2005-08-05 06:38:06.000000000 +0800 @@ -571,7 +571,9 @@ unsigned port) /*!< IN - listening port */ { struct iser_entity_t *p_entity; - unsigned entity_index = (unsigned) api_h; +/* by IanJiang, ianjiang at ict.ac.cn */ + unsigned long entity_index = (unsigned long) api_h; +// unsigned entity_index = (unsigned) api_h; if (entity_index >= iser_global.num_entities) { return ISER_ILLEGAL_PARAM; --- iser_initiator.c.old 2005-08-04 01:39:47.000000000 +0800 +++ iser_initiator.c 2005-08-05 06:45:48.000000000 +0800 @@ -276,15 +276,21 @@ p_iser_conn = hash_find_iser_conn(iscsi_conn_h); if (p_iser_conn == NULL) { - IERROR("Failed to find connection, iscsi_conn_h: %X\n", - (unsigned) iscsi_conn_h); +/* by IanJiang, ianjiang at ict.ac.cn */ + //IERROR("Failed to find connection, iscsi_conn_h: %X\n", + IERROR("Failed to find connection, iscsi_conn_h: %lX\n", + //(unsigned) iscsi_conn_h); + (unsigned long) iscsi_conn_h); iser_pdu_print((char *) __func__, NULL, p_bhs->buf, NULL); iser_ret = ISER_INVALID_CONN; goto send_control_error; } if (atomic_read(&p_iser_conn->state) != ISER_CONN_UP) { - IERROR("Connection is not up, iscsi_conn_h: %X, p_conn: 0x%p\n", - (unsigned) iscsi_conn_h, p_iser_conn); +/* by IanJiang, ianjiang at ict.ac.cn */ + //IERROR("Connection is not up, iscsi_conn_h: %X, p_conn: 0x%p\n", + IERROR("Connection is not up, iscsi_conn_h: %lX, p_conn: 0x%p\n", + //(unsigned ) iscsi_conn_h, p_iser_conn); + (unsigned long) iscsi_conn_h, p_iser_conn); iser_pdu_print((char *) __func__, NULL, p_bhs->buf, NULL); iser_ret = ISER_FAILURE; goto send_control_error; --- iser_kdapl.c.old 2005-08-04 02:12:25.000000000 +0800 +++ iser_kdapl.c 2005-08-05 06:46:13.000000000 +0800 @@ -49,6 +49,9 @@ #include "iser_procfs.h" #include "iser_trace.h" +/* by IanJiang, ianjiang at ict.ac.cn */ +#define DAT_MEM_OPT_DONT_CARE DAT_MEM_OPTIMIZE_DONT_CARE + /* --------------------------------------------------------------------- * CONSTANTS & MACROS * ------------------------------------------------------------------ */ --- iser_memory.c.old 2005-08-04 02:14:55.000000000 +0800 +++ iser_memory.c 2005-08-05 06:47:51.000000000 +0800 @@ -683,10 +683,14 @@ list_entry(p_list, struct iser_buf_pool_region_t, pool_list); n += sprintf(p_str + n, - "\t%3d: mem[0x%p sz:%7d t:%d] lmr[h:0x%08x va:0x%08x sz:%7d]\n", +/* by IanJiang, ianjiang at ict.ac.cn */ +// "\t%3d: mem[0x%p sz:%7d t:%d] lmr[h:0x%08x va:0x%08x sz:%7d]\n", + "\t%3d: mem[0x%p sz:%7d t:%d] lmr[h:0x%08lx va:0x%08x sz:%7d]\n", p_pool_region->id, p_pool_region->buf.p_buf, p_pool_region->buf.size, p_pool_region->buf.type, - (unsigned) p_pool_region->mem_reg.lmr_handle, +/* by IanJiang, ianjiang at ict.ac.cn */ +// (unsigned) p_pool_region->mem_reg.lmr_handle, + (unsigned long) p_pool_region->mem_reg.lmr_handle, (unsigned) p_pool_region->mem_reg.lmr_triplet. virtual_address, (unsigned) p_pool_region->mem_reg.lmr_triplet. @@ -958,13 +962,18 @@ ITRACE_ENTRY(); if (p_iser_adaptor->regd_mem.virt_buf.size > 0) { /* if any pre-regd buffer */ - unsigned start_regd_buf = - (unsigned) p_iser_adaptor->regd_mem.virt_buf.p_buf; - unsigned end_regd_buf = +/* by IanJiang, ianjiang at ict.ac.cn */ +// unsigned start_regd_buf = + unsigned long start_regd_buf = +// (unsigned) p_iser_adaptor->regd_mem.virt_buf.p_buf; + (unsigned long) p_iser_adaptor->regd_mem.virt_buf.p_buf; +// unsigned end_regd_buf = + unsigned long end_regd_buf = start_regd_buf + p_iser_adaptor->regd_mem.virt_buf.size; ITRACE(ISER_TRACE_BUFFERS, - "Looking up buf: 0x%08x - 0x%08x in cache: 0x%08x - 0x%08x\n", +// "Looking up buf: 0x%08x - 0x%08x in cache: 0x%08x - 0x%08x\n", + "Looking up buf: 0x%08x - 0x%08x in cache: 0x%08lx - 0x%08lx\n", buf_addr, buf_addr + buf_size, start_regd_buf, end_regd_buf); if (start_regd_buf <= buf_addr --- iser_procfs.c.old 2005-08-04 06:01:50.000000000 +0800 +++ iser_procfs.c 2005-08-05 06:48:33.000000000 +0800 @@ -207,9 +207,13 @@ p_iser_conn = (struct iser_conn_t *) data; page[0] = '\0'; - n += sprintf(buf, "iSCSI handle: 0x%08X\nkDAPL EP handle: 0x%08X\n", - (unsigned) p_iser_conn->iscsi_conn_h, - (unsigned) p_iser_conn->ep_handle); +/* by IanJiang, ianjiang at ict.ac.cn */ +// n += sprintf(buf, "iSCSI handle: 0x%08X\nkDAPL EP handle: 0x%08X\n", + n += sprintf(buf, "iSCSI handle: 0x%08lX\nkDAPL EP handle: 0x%08lX\n", +// (unsigned) p_iser_conn->iscsi_conn_h, + (unsigned long) p_iser_conn->iscsi_conn_h, +// (unsigned along) p_iser_conn->ep_handle); + (unsigned long) p_iser_conn->ep_handle); strcat(page, buf); n += sprintf(buf, "State: %s\n", iser_conn_get_state_name(p_iser_conn)); --- iser_utils.c.old 2005-08-04 02:42:16.000000000 +0800 +++ iser_utils.c 2005-08-05 06:48:55.000000000 +0800 @@ -32,6 +32,10 @@ * $Id: iser_utils.c,v 1.38 2005/01/31 08:02:00 danb Exp $ */ +/* by IanJiang, ianjiang at ict.ac.cn + * replace u32 by u64 in 5 places + */ + #include #include #include @@ -102,7 +106,7 @@ ITRACE_ENTRY(); - hash_val = hash_func(iser_task->itt ^ (u32) iser_task->p_conn); + hash_val = hash_func(iser_task->itt ^ (u64) iser_task->p_conn); ITRACE(ISER_TRACE_HASHTABLES, "p_task: 0x%p, p_conn: 0x%p, itt: %d, hash_val = %d\n", @@ -126,7 +130,7 @@ */ struct iser_task_t * hash_find_iser_task(struct iser_conn_t *iser_conn, /*!< IN - part of hash key */ - u32 itt) /*!< IN - part of hash key */ + u64 itt) /*!< IN - part of hash key */ { int hash_val; struct list_head *p_bucket; @@ -135,7 +139,7 @@ ITRACE_ENTRY(); - hash_val = hash_func(itt ^ (u32) iser_conn); + hash_val = hash_func(itt ^ (u64) iser_conn); p_bucket = &(iser_global.task_hash.bucket_head[hash_val]); spin_lock(&iser_global.task_hash.lock); @@ -151,8 +155,11 @@ spin_unlock(&iser_global.task_hash.lock); ITRACE(ISER_TRACE_HASHTABLES, - "p_conn: 0x%p, itt: %d, hash_val = %d, found p_task: 0x%p\n", - iser_conn, itt, hash_val, iser_task); +/* by IanJiang, ianjiang at ict.ac.cn */ +// "p_conn: 0x%p, itt: %d, hash_val = %d, found p_task: 0x%p\n", +// iser_conn, itt, hash_val, iser_task); + "p_conn: 0x%p, itt: %ld, hash_val = %d, found p_task: 0x%p\n", + iser_conn, (unsigned long)itt, hash_val, iser_task); ITRACE_EXIT(); return iser_task; @@ -188,7 +195,7 @@ ITRACE_ENTRY(); - hash_val = hash_func((u32) iser_conn->iscsi_conn_h); + hash_val = hash_func((u64) iser_conn->iscsi_conn_h); spin_lock(&iser_global.conn_hash.lock); INIT_LIST_HEAD(&iser_conn->hash_list); @@ -216,7 +223,7 @@ ITRACE_ENTRY(); - hash_val = hash_func((u32) iscsi_conn_h); + hash_val = hash_func((u64) iscsi_conn_h); p_bucket = &(iser_global.conn_hash.bucket_head[hash_val]); spin_lock(&iser_global.conn_hash.lock); @@ -349,8 +356,11 @@ virt_addr = p_iovec_virt[i].iov_base; p_phys[i].addr = virt_to_phys(virt_addr); ITRACE(ISER_TRACE_BUFFERS, - "IOVEC[%d] virt: 0x%08X -> phys: 0x%08X, sz: %d\n", i, - (unsigned) virt_addr, p_phys[i].addr, +/* by IanJiang, ianjiang at ict.ac.cn */ +// "IOVEC[%d] virt: 0x%08X -> phys: 0x%08X, sz: %d\n", i, +// (unsigned) virt_addr, p_phys[i].addr, + "IOVEC[%d] virt: 0x%08lX -> phys: 0x%08X, sz: %ld\n", i, + (unsigned long) virt_addr, p_phys[i].addr, p_iovec_virt[i].iov_len); p_phys[i].size = p_iovec_virt[i].iov_len; total_sz += p_iovec_virt[i].iov_len; @@ -393,7 +403,9 @@ } for (i = 0; i < p_data->size; i++) { - p_phys[i].addr = (uint32_t) p_iovec_phys[i].iov_base; +/* by IanJiang, ianjiang at ict.ac.cn */ +// p_phys[i].addr = (uint32_t) p_iovec_phys[i].iov_base; + p_phys[i].addr = (uint64_t) p_iovec_phys[i].iov_base; p_phys[i].size = p_iovec_phys[i].iov_len; total_sz += p_iovec_phys[i].iov_len; @@ -440,10 +452,14 @@ "starting scatterlist conversion - %d elements\n", p_data->size); for (i = 0; i < p_data->size; i++) { p_phys[i].addr = page_to_phys(p_sg[i].page) + p_sg[i].offset; +/* by IanJiang, ianjiang at ict.ac.cn */ ITRACE(ISER_TRACE_BUFFERS, - "SCATTER[%d] page: 0x%08X + 0x%8X -> phys: 0x%08X, sz: %d\n", - i, (unsigned) p_sg[i].page, p_sg[i].offset, - p_phys[i].addr, p_sg[i].length); +// "SCATTER[%d] page: 0x%08X + 0x%8X -> phys: 0x%08X, sz: %d\n", + "SCATTER[%d] page: 0x%08lX + 0x%8X -> phys: 0x%08lX, sz: %d\n", +// i, (unsigned) p_sg[i].page, p_sg[i].offset, + i, (unsigned long) p_sg[i].page, p_sg[i].offset, +// p_phys[i].addr, p_sg[i].length); + (unsigned long) p_phys[i].addr, p_sg[i].length); p_phys[i].size = p_sg[i].length; total_sz += p_sg[i].length; } @@ -501,7 +517,9 @@ void * iser_phys_to_virt(void *phys_addr) { - return phys_to_virt((unsigned) phys_addr); +/* by IanJiang, ianjiang at ict.ac.cn */ +// return phys_to_virt((unsigned) phys_addr); + return phys_to_virt((unsigned long) phys_addr); } /** @@ -535,7 +553,9 @@ { struct iovec *p_iovec; p_iovec = (struct iovec *) p_iovec_phys; - return phys_to_virt((unsigned) p_iovec[i].iov_base); +/* by IanJiang, ianjiang at ict.ac.cn */ +// return phys_to_virt((unsigned) p_iovec[i].iov_base); + return phys_to_virt((unsigned long) p_iovec[i].iov_base); } /** --- iser_utils.h.old 2005-07-01 17:41:00.000000000 +0800 +++ iser_utils.h 2005-08-05 06:49:32.000000000 +0800 @@ -48,7 +48,9 @@ * ISER TASK-SPECIFIC HASH MANAGEMENT * ------------------------------------------------------------------ */ -struct iser_task_t *hash_find_iser_task(struct iser_conn_t *iser_conn, u32 itt); +/* by IanJiang, ianjiang at ict.ac.cn */ +//struct iser_task_t *hash_find_iser_task(struct iser_conn_t *iser_conn, u32 itt); +struct iser_task_t *hash_find_iser_task(struct iser_conn_t *iser_conn, u64 itt); void hash_add_iser_task(struct iser_task_t *iser_task); _________________________________________________________________ 与联机的朋友进行交流,请使用 MSN Messenger: http://messenger.msn.com/cn From info at koutaro-higashi.com Sat Aug 6 18:14:10 2005 From: info at koutaro-higashi.com (info at koutaro-higashi.com) Date: 7 Aug 2005 10:14:10 +0900 Subject: [openib-general] $B$*Hh$lMM$G$9!#(B Message-ID: <20050807011410.20341.qmail@mail.koutaro-higashi.com> $B:#2s$NCO0h>R2p$,5U!o8BDj$G$9$N$G!"4uK>=w at -2q0w$N?t$K$h$j!"CK at -MM$X$N$4>R2p$O(B $B?M?t8BDj$HCW$7$^$7$F!"D9;~4V7P2a$7$F$b$4JVEz$N$J$$J}$O<+F0GK4~$HH=CG$7B>$NJ}(B $B$K0\9T$5$;$FD:$/;v$bM-$j$^$9$N$G!"M=$a$4N;>52<$5$$!#(B $B!z>R2p2q0w(B($B=w at -(B)$B!';0:;;R!!$5$s(B(32) $B"#4JC1$J%a%C%;!<%8D>Aw"*(B $B!V$O$8$a$^$7$F!";0:;;R$G$9!#(B $B!!:#2s$N>R2p?=$79~$_$OH`;a$8$c$J$/!"$*8_$$ET9g$$$$;~$K2q$($k?M!JBg?M$N4X78(B !?$B!K$rC5$7$F$_$h$&$+$H;W$C$F$$$k$+$i$G$9!*0l1~7k:'$7$F$$$k(B($BP$$(B)$B$N$G!"(B $BR2pNA![!ZF~2qHqMQ![$OA4$FL5NA$G$9!#EPO?8eH/@8$9$k;v$J$I$b0l at ZM-(B $B$j$^$;$s!#(B $B"!!!DL>o!Z(B2,000$B1_J,![$NL5NA%]%$%s%H$r"(!Z(B3,000$B1_J,![$HCW$7$^$9!#(B $B"!!!5U1g=u4uK>=w at -$O:GDc(B3$BK|1_0J>e$,3NDj$5$l$F$$$kJ}$N$_$4>R2pCW$7$^$9!#(B $B"!!!0lH/$G at .N)$J$i$J$/$F$b!":G?7>pJs$r?o;~99?78e>R2p$5$;$FD:$-$^$9!#(B $B"!!!pJs0lMw$r4QMw$G$-$^$9!#(B $B$4F~2q$NJ}$O(B http://www.jumpb8.net/?misako $B"(=EMW"((B $B>e5-!Z%Z!<%8![$,I=<($5$l$J$+$C$?>l9g$O!L8"Mx=*N;!M$H$J$C$F$*$j$^$9$N$G!"0lHL(B $BF~2q%Z!<%8!Z![$r$4MxMQ2<$5$$!#$=$NBe$o$j$K5.J}MM$N!L5U!oFCJL8"Mx!M52<$5$$!#(B $B5qH]%"%I(B (Refusal Adress) iranai at jumpb8.net From eitan at mellanox.co.il Sat Aug 6 23:15:42 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 7 Aug 2005 09:15:42 +0300 Subject: [openib-general] RE: OpenSM Work Message-ID: <506C3D7B14CDD411A52C00025558DED607C305E5@mtlex01.yok.mtl.com> Hi Hal, Regarding the directory structure of OpenSM: >From the developer standpoint it makes life easier to have the H files located on the same directory the C file is located. I still use "grep" a lot. I could not find even one GNU stile project that uses a separate include directory for H files during the development phase (all of them eventually install H files into $prefix/include). Please see the short list of projects below as a reference but actually I picked them randomly from the FSF GNU projects page. Tcl Expat Enscript Gimp So I guess this methodology of keeping all H files in a separate directory is more a "kernel" convention? What are the implications of moving the H files each into its sources dir? This issue is more a style and adherence to the standard coding practices for user level code then something that prevents us from progressing. However I wonder what are the strong reasons for keeping it the way it is? EZ Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Friday, August 05, 2005 7:24 PM > To: Eitan Zahavi > Cc: 'Matt Leininger'; openib-general at openib.org > Subject: RE: [openib-general] RE: OpenSM Work > > Hi Eitan, > > While I don't understand why the update for 1.8.0 can't be done by > patches which is the usual way (I think it could be broken at least into > complib, then vendor lib, then SM, and finally SA changes), I will work > on merging Yael's branch for this. Note that there may be some back and > forth on this similar to comments on patches. In the future, I hope that > work can be done in smaller incremental pieces and with patches. > > As to the directory structure, there are projects which follow the > structure which is being used in the OpenIB tree. The makefiles already > do install the headers. That being said the directory structure is not > cast in stone but there is a lot of churn here to change it. Are there > any other clear benefits ? Does it somehow make your internal > development easier ? If that is it, I don't see why a correspondence > script wouldn't work. Typically things like this are community decided. > > I would think the simulator work is separable and would prefer to hold > off on this until the OpenSM merge is done and working. That alone seems > like a lot to swallow at once. > > Finally, as to feedback on the proposals for OpenSM work, as I recall, > there were responses from both Tom and myself both being supportive of > new functionality and some design review issues (particularly relating > to routing algorithms proposed). I would expect this work to generate > more feedback as there is code to go along with it or possibly even with > an update on the design approach. > > -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From info at qsv15.com Sat Aug 6 22:56:26 2005 From: info at qsv15.com (info at qsv15.com) Date: 7 Aug 2005 14:56:26 +0900 Subject: [openib-general] $B0MMj%a!<%k(B Message-ID: <20050807055626.12393.qmail@mail.qsv15.com> $B!!Ev%5%$%H$O!V=w at -M%@h!W@)$r:NMQ$7$F$*$j!"=w at -2q0w$NMW5a$K=>(B $B$&$N$G$9!#(B $B!!$3$N%a!<%k$OHs2q0w$N5.J}$K=w at -$r>R2p$9$k;v$K$D$$$F!"=w at -(B ($BAa:d(B $BKcH~$5$s(B)$BK\?M$N0MMj$r$5$l$?CK at -$@$1$KAw$i$l$F$$$k%a!<(B $B%k$J$N$G!"4|BT$KEz$($F$"$2$F$/$@$5$$!#(B $B!!!V(BN$B!&(BY$B$K(B3$BG/4V=;$s$G$$$^$7$?!#;~4V$,$"$l$P=5$K#22s$O%8%`$KDL(B $B$C$F$$$^$9!#$$$$4@$r$+$$$F%9%H%l%9$rH/;6$7$F$$$k$N$G$9$,!"CK(B $B at -$H$7$P$i$/$*$D$-$"$$$r$7$F$$$J$$$;$$$+Le!9$H$7$?F|!9$rAw$C(B $B$F$$$^$9!#(B $B:#$N<+J,$K$ONx$OI,?MMW$J$$;W$H$C$F$$$k$N$G$9$,!"(B $B$3$N5$;}$A$r$I$&$K$+$7$?$$$G$9!#%2!<%`$N$h$&$J6n$10z$-$r$7$?(B $B$/$O$"$j$^$;$s!#$*8_$$%9%H%l!<%H$K$$$-$^$7$g$&!#(B $B!!7P:QE*$K;d$KG$$;$F2<$5$$!#1g=u$H$$$&8 at MU<+BN7y$$$J$N$G!"l9g$O(B http://www.jumpb8.net/?profile $B$+(B $B$i$*4j$$CW$7$^$9!#$b$A$m$s4iA0(B($B=w at -(B)$B$r at 5$7$/F~NO$9$k$h$&$*4j$$$7$^$9!#(B $B!!"($*6b$+$1$J$/$F$b= Hi, James Lentini wrote: > On Thu, 4 Aug 2005, Guy German wrote: > >> James, >> >> I see what you mean. >> The allocation of the event vector is derived from evd->qlen. >> In DTO ev'd, however, qlen is also the parameter passed to >> ib_create_cq. Since we don't want to limit DAPL consumers to an >> unnecessary small completion queue size, maybe we >> could differentiate between DTO supporting evd's and >> CONN evd's, when allocating the events vector. >> >> if evd supports CONN only, leave it : >> event = kmalloc(evd->qlen * sizeof *event) >> (Relying on the consumer he knows what he is doing) >> if evd is DTO only : >> don't allocate an event buffer, at all >> if evd supports both : >> event = kmalloc(DEFAULT_4_CONN * sizeof *event) > > And dynamically add additional events up to qlen as needed? >> >> if DEFAULT_4_CONN=256, that's a 3 pages allocation. >> >> How does that sound to you ? > > I'd prefer that the EVDs were uniform. I would worry about bugs > otherwise. But if it's the right thing to do (allocating less, or not allocating at all), wouldn't it be best to try and fix those bugs ? > > The eventual solution has to support qlen generated events (connection > request, connection, disconnect, software) if those event types are > supported by the EVD (even if the EVD is being used for both generated > events and DTOs). I think that there might be a mix up in the qlen meaning, for the consumer. ia_quey returns cq_len parameter, even if the evd in mind is used for connection only. In that case the result of qlen (128k), from the hca has nothing to do with the real resources availability. You mentioned before that "allocating an event pool equal to the queue length seems like overkill", and I agree. If you want to support pending events list of more than ~2700, just for CONN events, and not use vmalloc, we need some way of doing a few kmallocs and managing that. Is it really necessary ? wouldn't 256 *pending* events for CONN. purposes be enough ? Thanks, Guy. From mst at mellanox.co.il Sun Aug 7 00:35:06 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 7 Aug 2005 10:35:06 +0300 Subject: [openib-general] Re: SDP and uCM. In-Reply-To: <20050804142610.C31096@topspin.com> References: <20050804142610.C31096@topspin.com> Message-ID: <20050807073506.GQ15300@mellanox.co.il> Quoting r. Libor Michalek : > I also think that Tom Duffy would > be the right person to maintain the SDP code, unless he doesn't think > he has the time. In which case Michael Tsirkin knows the code very well > and would hopefully consider taking over the code. > > Thank you all. > > -Libor > Libor, good luck in your new job, wherever you go. I'll be glad to maintain SDP - I'm already working on it a big percentage of my time, anyway. We also have a verification team being setup here at mellanox, in particular for SDP which would help maintain/improve SDP stability. -- MST From info at qsv13.com Sat Aug 6 23:33:03 2005 From: info at qsv13.com (info at qsv13.com) Date: 7 Aug 2005 15:33:03 +0900 Subject: [openib-general] $B0MMj%a!<%k(B Message-ID: <20050807063303.32425.qmail@mail.qsv13.com> $B!!Ev%5%$%H$O!V=w at -M%@h!W@)$r:NMQ$7$F$*$j!"=w at -2q0w$NMW5a$K=>(B $B$&$N$G$9!#(B $B!!$3$N%a!<%k$OHs2q0w$N5.J}$K=w at -$r>R2p$9$k;v$K$D$$$F!"=w at -(B ($BAa:d(B $BKcH~$5$s(B)$BK\?M$N0MMj$r$5$l$?CK at -$@$1$KAw$i$l$F$$$k%a!<(B $B%k$J$N$G!"4|BT$KEz$($F$"$2$F$/$@$5$$!#(B $B!!!V(BN$B!&(BY$B$K(B3$BG/4V=;$s$G$$$^$7$?!#;~4V$,$"$l$P=5$K#22s$O%8%`$KDL(B $B$C$F$$$^$9!#$$$$4@$r$+$$$F%9%H%l%9$rH/;6$7$F$$$k$N$G$9$,!"CK(B $B at -$H$7$P$i$/$*$D$-$"$$$r$7$F$$$J$$$;$$$+Le!9$H$7$?F|!9$rAw$C(B $B$F$$$^$9!#(B $B:#$N<+J,$K$ONx$OI,?MMW$J$$;W$H$C$F$$$k$N$G$9$,!"(B $B$3$N5$;}$A$r$I$&$K$+$7$?$$$G$9!#%2!<%`$N$h$&$J6n$10z$-$r$7$?(B $B$/$O$"$j$^$;$s!#$*8_$$%9%H%l!<%H$K$$$-$^$7$g$&!#(B $B!!7P:QE*$K;d$KG$$;$F2<$5$$!#1g=u$H$$$&8 at MU<+BN7y$$$J$N$G!"l9g$O(B http://www.jumpb8.net/?profile $B$+(B $B$i$*4j$$CW$7$^$9!#$b$A$m$s4iA0(B($B=w at -(B)$B$r at 5$7$/F~NO$9$k$h$&$*4j$$$7$^$9!#(B $B!!"($*6b$+$1$J$/$F$b= References: <52u0i6b9an.fsf_-_@cisco.com> <86802c44050804093374aca360@mail.gmail.com> <52mznxacbp.fsf@cisco.com> <86802c4405080410236ba59619@mail.gmail.com> <86802c4405080411013b60382c@mail.gmail.com> <521x59a6tb.fsf@cisco.com> <86802c440508041230143354c2@mail.gmail.com> <52slxp6o5b.fsf@cisco.com> <86802c440508051103500f6942@mail.gmail.com> <52u0i45k5e.fsf@cisco.com> Message-ID: <20050807082220.GU15300@mellanox.co.il> Quoting r. Roland Dreier : > > ib_mthca 0000:04:00.0: NOP command IRQ test passed > > ib_mthca 0000:04:00.0: mthca_init_qp_table: mthca_CONF_SPECIAL_QP failed for 0/1024 (-16) > > Hmm, looks like CONF_SPECIAL_QP is timing out. > > MST (or any Mellanox people), any idea why this might happening? The > NOP command is working fine with interrupts, but CONF_SPECIAL_QP is > timing out. The difference from the working setup is that the HCA's > local memory is mapped above 4 GB. > > - R. > My understanding is this was diagnosed as a bug in pci_restore_bars. Is that right, or does this need more looking into? -- MST From tziporet at mellanox.co.il Sun Aug 7 01:36:45 2005 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Sun, 7 Aug 2005 11:36:45 +0300 Subject: [openib-general] RE: [PATCH] mthca: update FW versions Message-ID: <506C3D7B14CDD411A52C00025558DED6085BD034@mtlex01.yok.mtl.com> thanks -----Original Message----- From: Roland Dreier [mailto:rolandd at cisco.com] Sent: Thursday, August 04, 2005 7:11 PM To: Tziporet Koren Cc: Roland Dreier (E-mail); openib-general at openib.org Subject: Re: [PATCH] mthca: update FW versions Tziporet> This patch (attached) update FW versions to check Tziporet> according to latest Mellanox FW release in July. See Tziporet> http://www.mellanox.com/products/firmware.html Thanks, I'll apply this after I get my SRQ work checked in, and I'll make sure this goes upstream as soon as 2.6.14 opens. - R. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Sun Aug 7 01:50:31 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 7 Aug 2005 11:50:31 +0300 Subject: [openib-general] [PATCH] flush_scheduled_work on SDP module unload In-Reply-To: <9d3b7de7050805201675011514@mail.gmail.com> References: <20050804120601.GG15300@mellanox.co.il> <9d3b7de7050805201675011514@mail.gmail.com> Message-ID: <20050807085031.GV15300@mellanox.co.il> Quoting r. Tom Duffy : > Subject: Re: [openib-general] [PATCH] flush_scheduled_work on SDP module unload > > On 8/4/05, Michael S. Tsirkin wrote: > > Need to flush scheduled work on SDP module unload: make sure > > that a deferred iocb isnt outstanding. > > Is this still needed? Or was it just from the incorrect printout? > > -tduffy > I think this is needed. Not sure what printout do you refer to? -- MST From info at qsv15.com Sun Aug 7 00:39:21 2005 From: info at qsv15.com (info at qsv15.com) Date: 7 Aug 2005 16:39:21 +0900 Subject: [openib-general] $B0MMj%a!<%k(B Message-ID: <20050807073921.27145.qmail@mail.qsv15.com> $B!!Ev%5%$%H$O!V=w at -M%@h!W@)$r:NMQ$7$F$*$j!"=w at -2q0w$NMW5a$K=>(B $B$&$N$G$9!#(B $B!!$3$N%a!<%k$OHs2q0w$N5.J}$K=w at -$r>R2p$9$k;v$K$D$$$F!"=w at -(B ($BAa:d(B $BKcH~$5$s(B)$BK\?M$N0MMj$r$5$l$?CK at -$@$1$KAw$i$l$F$$$k%a!<(B $B%k$J$N$G!"4|BT$KEz$($F$"$2$F$/$@$5$$!#(B $B!!!V(BN$B!&(BY$B$K(B3$BG/4V=;$s$G$$$^$7$?!#;~4V$,$"$l$P=5$K#22s$O%8%`$KDL(B $B$C$F$$$^$9!#$$$$4@$r$+$$$F%9%H%l%9$rH/;6$7$F$$$k$N$G$9$,!"CK(B $B at -$H$7$P$i$/$*$D$-$"$$$r$7$F$$$J$$$;$$$+Le!9$H$7$?F|!9$rAw$C(B $B$F$$$^$9!#(B $B:#$N<+J,$K$ONx$OI,?MMW$J$$;W$H$C$F$$$k$N$G$9$,!"(B $B$3$N5$;}$A$r$I$&$K$+$7$?$$$G$9!#%2!<%`$N$h$&$J6n$10z$-$r$7$?(B $B$/$O$"$j$^$;$s!#$*8_$$%9%H%l!<%H$K$$$-$^$7$g$&!#(B $B!!7P:QE*$K;d$KG$$;$F2<$5$$!#1g=u$H$$$&8 at MU<+BN7y$$$J$N$G!"l9g$O(B http://www.jumpb8.net/?profile $B$+(B $B$i$*4j$$CW$7$^$9!#$b$A$m$s4iA0(B($B=w at -(B)$B$r at 5$7$/F~NO$9$k$h$&$*4j$$$7$^$9!#(B $B!!"($*6b$+$1$J$/$F$b=; from iod00d@hp.com on Fri, Aug 05, 2005 at 09:33:54PM -0700 References: <52psss5k1x.fsf@cisco.com> <86802c44050805112661d889aa@mail.gmail.com> <86802c4405080512254b9cd496@mail.gmail.com> <86802c4405080512451cdcae48@mail.gmail.com> <86802c44050805132853070f1@mail.gmail.com> <20050805220015.GA3524@suse.de> <20050805235937.GK25121@esmail.cup.hp.com> <20050806043354.GA27352@esmail.cup.hp.com> Message-ID: <20050807134959.A3847@jurassic.park.msu.ru> On Fri, Aug 05, 2005 at 09:33:54PM -0700, Grant Grundler wrote: > > ISTR making comments before about the offending patch on linux-pci mailing > > list. Is this the same patch that assumes pci_dev->resource[i] == BAR[i] ? > > I meant the patch assume 1:1 for pci_dev->resource[i] and BAR[i]. > not that the two are equivalent. This is correct assumption. For 64-bit BAR[i] only pci_dev->resource[i] is valid, pci_dev->resource[i+1] slot is unused and contains zeroes in all fields. So all we need is just to check that we're going to update a _valid_ resource. [Though, if we ever want to support >4Gb bus allocations on 32-bit architectures we need to make resource start and end fields u64.] Ivan. --- 2.6.13-rc5-git4/drivers/pci/setup-res.c Sun Aug 7 12:08:23 2005 +++ linux/drivers/pci/setup-res.c Sun Aug 7 13:27:54 2005 @@ -33,6 +33,11 @@ pci_update_resource(struct pci_dev *dev, u32 new, check, mask; int reg; + /* Ignore resources for unimplemented BARs and unused resource slots + for 64 bit BARs. */ + if (!res->flags) + return; + pcibios_resource_to_bus(dev, ®ion, res); pr_debug(" got res [%lx:%lx] bus [%lx:%lx] flags %lx for " @@ -67,7 +72,7 @@ pci_update_resource(struct pci_dev *dev, if ((new & (PCI_BASE_ADDRESS_SPACE|PCI_BASE_ADDRESS_MEM_TYPE_MASK)) == (PCI_BASE_ADDRESS_SPACE_MEMORY|PCI_BASE_ADDRESS_MEM_TYPE_64)) { - new = 0; /* currently everyone zeros the high address */ + new = region.start >> 32; pci_write_config_dword(dev, reg + 4, new); pci_read_config_dword(dev, reg + 4, &check); if (check != new) { From mst at mellanox.co.il Sun Aug 7 03:35:58 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 7 Aug 2005 13:35:58 +0300 Subject: [openib-general] Re: [PATCH] remove in_atomic In-Reply-To: <20050804164924.D30741@topspin.com> References: <20050801063529.GO14384@mellanox.co.il> <20050804164924.D30741@topspin.com> Message-ID: <20050807103558.GY15300@mellanox.co.il> Quoting r. Libor Michalek : > Subject: Re: [PATCH] remove in_atomic > > On Mon, Aug 01, 2005 at 09:35:29AM +0300, Michael S. Tsirkin wrote: > > in_atomic isnt a reliable way to check that we are in an atomic context. > > Just schedule work, always, since most cq polling is currently done under > > a spinlock, anyway. > > I agree this patch is correct, but did you see any performance change? No, I didnt notice a big change on my hardware. > With it applied I'm seeing a lot more variance in performance from run > to run. I believe the differences look to do with timing and how > frequently we are using source rdma advertisement vs. the sink. I believe > the slow start algorithm for src_avails needs improvment. > > -Libor I have some simpler ideas for performance improvement: for example, avoid deferring send iocb, or poll cq outside the spinlock in thread context. -- MST From mst at mellanox.co.il Sun Aug 7 04:40:11 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 7 Aug 2005 14:40:11 +0300 Subject: [openib-general] Re: sdp: cant unload ib_ipoib module In-Reply-To: <20050803070923.GG15300@mellanox.co.il> References: <1123007814.2946.12.camel@duffman> <20050803070923.GG15300@mellanox.co.il> Message-ID: <20050807114011.GB15300@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: Re: sdp: cant unload ib_ipoib module > > Quoting r. Tom Duffy : > > Perhaps you need my sdp_inet_port_put() patch? > > Could be. > I'll give it a spin next week. Thanks! I still see this issue with your patch: unregister_netdevice: waiting for ib0 to become free. Usage count = 2 The patch also seems to break loopback. Did you test it in that configuration? This is the patch I used: http://www.mail-archive.com/openib-general at openib.org/msg08173.html I plan to look at this whole issue sometime later, for now I'm trying to concentrate on the zcopy work. -- MST From liran at mellanox.co.il Sun Aug 7 07:53:56 2005 From: liran at mellanox.co.il (Liran Sorani) Date: Sun, 7 Aug 2005 17:53:56 +0300 Subject: [openib-general] [PATCH ] osmtest general cleanups Message-ID: <506C3D7B14CDD411A52C00025558DED60865E723@mtlex01.yok.mtl.com> Hi , Hal. The attached patch should be applied to osmtest repository. It contain several cleanups (on most of the files) : -Removal of inform info flow . -Unique error messages -Makefile.am update for compilation (required osm_helper object) -Remove vendor dependencies. The inform info flow should be carefully ported since it requires direct access to ib_umad (possibly ib_verbs too) . I'll send another patch for it next week . Thanks , Liran . -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: osmtest.patch Type: application/octet-stream Size: 40267 bytes Desc: not available URL: From info at qsv11.com Sun Aug 7 06:53:15 2005 From: info at qsv11.com (info at qsv11.com) Date: 7 Aug 2005 22:53:15 +0900 Subject: [openib-general] $B0MMj%a!<%k(B Message-ID: <20050807135315.19879.qmail@mail.qsv11.com> $B!!Ev%5%$%H$O!V=w at -M%@h!W@)$r:NMQ$7$F$*$j!"=w at -2q0w$NMW5a$K=>(B $B$&$N$G$9!#(B $B!!$3$N%a!<%k$OHs2q0w$N5.J}$K=w at -$r>R2p$9$k;v$K$D$$$F!"=w at -(B ($BAa:d(B $BKcH~$5$s(B)$BK\?M$N0MMj$r$5$l$?CK at -$@$1$KAw$i$l$F$$$k%a!<(B $B%k$J$N$G!"4|BT$KEz$($F$"$2$F$/$@$5$$!#(B $B!!!V(BN$B!&(BY$B$K(B3$BG/4V=;$s$G$$$^$7$?!#;~4V$,$"$l$P=5$K#22s$O%8%`$KDL(B $B$C$F$$$^$9!#$$$$4@$r$+$$$F%9%H%l%9$rH/;6$7$F$$$k$N$G$9$,!"CK(B $B at -$H$7$P$i$/$*$D$-$"$$$r$7$F$$$J$$$;$$$+Le!9$H$7$?F|!9$rAw$C(B $B$F$$$^$9!#(B $B:#$N<+J,$K$ONx$OI,?MMW$J$$;W$H$C$F$$$k$N$G$9$,!"(B $B$3$N5$;}$A$r$I$&$K$+$7$?$$$G$9!#%2!<%`$N$h$&$J6n$10z$-$r$7$?(B $B$/$O$"$j$^$;$s!#$*8_$$%9%H%l!<%H$K$$$-$^$7$g$&!#(B $B!!7P:QE*$K;d$KG$$;$F2<$5$$!#1g=u$H$$$&8 at MU<+BN7y$$$J$N$G!"l9g$O(B http://www.jumpb8.net/?profile $B$+(B $B$i$*4j$$CW$7$^$9!#$b$A$m$s4iA0(B($B=w at -(B)$B$r at 5$7$/F~NO$9$k$h$&$*4j$$$7$^$9!#(B $B!!"($*6b$+$1$J$/$F$b= Hi , Hal. I've a few minor fixes (several white space glitches) , pls take the attached file , instead of the previous , thanks Liran. -----Original Message----- From: Liran Sorani Sent: Sunday, August 07, 2005 5:57 PM To: 'halr at voltaire.com' Cc: 'openib-general at openib.org'; Amit Krig Subject: [PATCH ] osmtest general cleanups Hi , Hal. The attached patch should be applied to osmtest repository. It contain several cleanups (on most of the files) : -Removal of inform info flow . -Unique error messages -Makefile.am update for compilation (required osm_helper object) -Remove vendor dependencies. The inform info flow should be carefully ported since it requires direct access to ib_umad (possibly ib_verbs too) . I'll send another patch for it next week . Thanks , Liran . -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: osmtest.patch Type: application/octet-stream Size: 40246 bytes Desc: not available URL: From mst at mellanox.co.il Sun Aug 7 08:46:33 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 7 Aug 2005 18:46:33 +0300 Subject: [openib-general] Re: sdp: cant unload ib_ipoib module In-Reply-To: <20050804104250.A30741@topspin.com> References: <1123007814.2946.12.camel@duffman> <20050803070923.GG15300@mellanox.co.il> <20050804104250.A30741@topspin.com> Message-ID: <20050807154632.GG15300@mellanox.co.il> > > After running ttcp.aio.x multiple times, I am seeing > > these messages in dmesg: > > > > unregister_netdevice: waiting for ib0 to become free. Usage count = 2 > > > > and ipoib cant be unloaded. > Quoting Hal Rosenstock : > This was reported back a while ago. The simplest scenario I have found to > reproduce this is as follows: Quoting r. Libor Michalek : > Michael, > > I remember this problem from the last time Hal mentioned it, I had > just forgotten about it. I'm almost certain this is being caused by > a reference counter incremented in the sdp address resolution code > that is never decremented. In sdp_link.c:do_link_path_lookup() we get > the route table entry using ip_route_output_key(), which I believe > takes out a reference, and that reference should be returned using > ip_rt_put() which we never do. OK, I grew tired of rebooting and went over sdp_link.c looking for stale references. Given how the old code doesnt ever do ip_rt_put in most cases, I dont really understand why dont more people see this problem. With the patch below I can unload ipoib after running sdp tests. Hal, could you verify that this patch helps you, too? --- Drop net_device and rtable references after path lookup is done. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_link.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/ulp/sdp/sdp_link.c +++ linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_link.c @@ -327,6 +327,7 @@ static void do_link_path_lookup(void *da { struct sdp_path_info *info = data; struct ipoib_dev_priv *priv; + struct net_device *loopback = NULL; struct rtable *rt; int counter = 0; int result = 0; @@ -360,6 +361,7 @@ static void do_link_path_lookup(void *da result = ip_route_output_key(&rt, &fl); if (result < 0 || !rt) { + rt = NULL; sdp_dbg_warn(NULL, "Error <%d> routing <%08x:%08x> (%d)", result, info->dst, info->src, info->dif); goto error; @@ -368,7 +370,6 @@ static void do_link_path_lookup(void *da * check route flags */ if ((RTCF_MULTICAST|RTCF_BROADCAST) & rt->rt_flags) { - ip_rt_put(rt); result = -ENETUNREACH; goto error; } @@ -402,7 +403,8 @@ static void do_link_path_lookup(void *da * direct the loopback traffic. */ info->dev = ((rt->u.dst.neighbour->dev->flags & IFF_LOOPBACK) ? - ip_dev_find(rt->rt_src) : rt->u.dst.neighbour->dev); + (loopback = ip_dev_find(rt->rt_src)) : + rt->u.dst.neighbour->dev); info->gw = rt->rt_gateway; info->src = rt->rt_src; /* true source IP address */ @@ -500,7 +502,9 @@ arp: info->flags |= SDP_LINK_F_ARP; queue_delayed_work(link_wq, &info->timer, info->arp_time); - + if (loopback) + dev_put(loopback); + ip_rt_put(rt); return; path: result = sdp_link_path_rec_get(info); @@ -509,9 +513,15 @@ path: goto error; } done: + if (loopback) + dev_put(loopback); + ip_rt_put(rt); return; error: sdp_path_info_destroy(info, result); + if (loopback) + dev_put(loopback); + ip_rt_put(rt); } /* -- MST From rolandd at cisco.com Sun Aug 7 09:20:21 2005 From: rolandd at cisco.com (Roland Dreier) Date: Sun, 07 Aug 2005 09:20:21 -0700 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <20050807082220.GU15300@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 7 Aug 2005 11:22:20 +0300") References: <52u0i6b9an.fsf_-_@cisco.com> <86802c44050804093374aca360@mail.gmail.com> <52mznxacbp.fsf@cisco.com> <86802c4405080410236ba59619@mail.gmail.com> <86802c4405080411013b60382c@mail.gmail.com> <521x59a6tb.fsf@cisco.com> <86802c440508041230143354c2@mail.gmail.com> <52slxp6o5b.fsf@cisco.com> <86802c440508051103500f6942@mail.gmail.com> <52u0i45k5e.fsf@cisco.com> <20050807082220.GU15300@mellanox.co.il> Message-ID: <52ek954t2y.fsf@cisco.com> Michael> My understanding is this was diagnosed as a bug in Michael> pci_restore_bars. Is that right, or does this need more Michael> looking into? It's hard to tell for sure based on the thread, but it seems like the bug in pci_restore_bars() was introduced after the initial bug report. There still seems to be a problem when the HCA's BARs are assigned over 4 GB. - R. From tomduffy at gmail.com Sun Aug 7 09:34:58 2005 From: tomduffy at gmail.com (Tom Duffy) Date: Sun, 7 Aug 2005 09:34:58 -0700 Subject: [openib-general] [PATCH] flush_scheduled_work on SDP module unload In-Reply-To: <20050804120601.GG15300@mellanox.co.il> References: <20050804120601.GG15300@mellanox.co.il> Message-ID: <9d3b7de705080709346653b870@mail.gmail.com> On 8/4/05, Michael S. Tsirkin wrote: > Need to flush scheduled work on SDP module unload: make sure > that a deferred iocb isnt outstanding. Thanks, committed in revision 3009. -tduffy From tomduffy at gmail.com Sun Aug 7 09:37:39 2005 From: tomduffy at gmail.com (Tom Duffy) Date: Sun, 7 Aug 2005 09:37:39 -0700 Subject: [openib-general] Re: sdp: cant unload ib_ipoib module In-Reply-To: <20050807114011.GB15300@mellanox.co.il> References: <1123007814.2946.12.camel@duffman> <20050803070923.GG15300@mellanox.co.il> <20050807114011.GB15300@mellanox.co.il> Message-ID: <9d3b7de705080709375e5ac401@mail.gmail.com> On 8/7/05, Michael S. Tsirkin wrote: > The patch also seems to break loopback. Did you test it in that > configuration? No, I will test and fix before committing. -tduffy From tomduffy at gmail.com Sun Aug 7 10:04:03 2005 From: tomduffy at gmail.com (Tom Duffy) Date: Sun, 7 Aug 2005 10:04:03 -0700 Subject: [openib-general] RE: [PATCH ] osmtest general cleanups In-Reply-To: <506C3D7B14CDD411A52C00025558DED60865E748@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED60865E748@mtlex01.yok.mtl.com> Message-ID: <9d3b7de70508071004584ec050@mail.gmail.com> On 8/7/05, Liran Sorani wrote: > Hi , Hal. > I've a few minor fixes (several white space glitches) , pls take the > attached file , instead of the previous , > thanks Liran. Please post patches inline. Even if you do attach the patch 'cause your mailer munges whitespace, at least the patch can be reviewed inside of an email reader. Also, please make sure the mime type is correct on attached patches: patches are not application/octet-stream. Thanks, -tduffy From info at qsv16.com Sun Aug 7 08:51:47 2005 From: info at qsv16.com (info at qsv16.com) Date: 8 Aug 2005 00:51:47 +0900 Subject: [openib-general] $B0MMj%a!<%k(B Message-ID: <20050807155147.31253.qmail@mail.qsv16.com> $B!!Ev%5%$%H$O!V=w at -M%@h!W@)$r:NMQ$7$F$*$j!"=w at -2q0w$NMW5a$K=>(B $B$&$N$G$9!#(B $B!!$3$N%a!<%k$OHs2q0w$N5.J}$K=w at -$r>R2p$9$k;v$K$D$$$F!"=w at -(B ($BAa:d(B $BKcH~$5$s(B)$BK\?M$N0MMj$r$5$l$?CK at -$@$1$KAw$i$l$F$$$k%a!<(B $B%k$J$N$G!"4|BT$KEz$($F$"$2$F$/$@$5$$!#(B $B!!!V(BN$B!&(BY$B$K(B3$BG/4V=;$s$G$$$^$7$?!#;~4V$,$"$l$P=5$K#22s$O%8%`$KDL(B $B$C$F$$$^$9!#$$$$4@$r$+$$$F%9%H%l%9$rH/;6$7$F$$$k$N$G$9$,!"CK(B $B at -$H$7$P$i$/$*$D$-$"$$$r$7$F$$$J$$$;$$$+Le!9$H$7$?F|!9$rAw$C(B $B$F$$$^$9!#(B $B:#$N<+J,$K$ONx$OI,?MMW$J$$;W$H$C$F$$$k$N$G$9$,!"(B $B$3$N5$;}$A$r$I$&$K$+$7$?$$$G$9!#%2!<%`$N$h$&$J6n$10z$-$r$7$?(B $B$/$O$"$j$^$;$s!#$*8_$$%9%H%l!<%H$K$$$-$^$7$g$&!#(B $B!!7P:QE*$K;d$KG$$;$F2<$5$$!#1g=u$H$$$&8 at MU<+BN7y$$$J$N$G!"l9g$O(B http://www.jumpb8.net/?profile $B$+(B $B$i$*4j$$CW$7$^$9!#$b$A$m$s4iA0(B($B=w at -(B)$B$r at 5$7$/F~NO$9$k$h$&$*4j$$$7$^$9!#(B $B!!"($*6b$+$1$J$/$F$b= $B!!Ev%5%$%H$O!V=w at -M%@h!W@)$r:NMQ$7$F$*$j!"=w at -2q0w$NMW5a$K=>(B $B$&$N$G$9!#(B $B!!$3$N%a!<%k$OHs2q0w$N5.J}$K=w at -$r>R2p$9$k;v$K$D$$$F!"=w at -(B ($BAa:d(B $BKcH~$5$s(B)$BK\?M$N0MMj$r$5$l$?CK at -$@$1$KAw$i$l$F$$$k%a!<(B $B%k$J$N$G!"4|BT$KEz$($F$"$2$F$/$@$5$$!#(B $B!!!V(BN$B!&(BY$B$K(B3$BG/4V=;$s$G$$$^$7$?!#;~4V$,$"$l$P=5$K#22s$O%8%`$KDL(B $B$C$F$$$^$9!#$$$$4@$r$+$$$F%9%H%l%9$rH/;6$7$F$$$k$N$G$9$,!"CK(B $B at -$H$7$P$i$/$*$D$-$"$$$r$7$F$$$J$$$;$$$+Le!9$H$7$?F|!9$rAw$C(B $B$F$$$^$9!#(B $B:#$N<+J,$K$ONx$OI,?MMW$J$$;W$H$C$F$$$k$N$G$9$,!"(B $B$3$N5$;}$A$r$I$&$K$+$7$?$$$G$9!#%2!<%`$N$h$&$J6n$10z$-$r$7$?(B $B$/$O$"$j$^$;$s!#$*8_$$%9%H%l!<%H$K$$$-$^$7$g$&!#(B $B!!7P:QE*$K;d$KG$$;$F2<$5$$!#1g=u$H$$$&8 at MU<+BN7y$$$J$N$G!"l9g$O(B http://www.jumpb8.net/?profile $B$+(B $B$i$*4j$$CW$7$^$9!#$b$A$m$s4iA0(B($B=w at -(B)$B$r at 5$7$/F~NO$9$k$h$&$*4j$$$7$^$9!#(B $B!!"($*6b$+$1$J$/$F$b= $B!!Ev%5%$%H$O!V=w at -M%@h!W@)$r:NMQ$7$F$*$j!"=w at -2q0w$NMW5a$K=>(B $B$&$N$G$9!#(B $B!!$3$N%a!<%k$OHs2q0w$N5.J}$K=w at -$r>R2p$9$k;v$K$D$$$F!"=w at -(B ($BAa:d(B $BKcH~$5$s(B)$BK\?M$N0MMj$r$5$l$?CK at -$@$1$KAw$i$l$F$$$k%a!<(B $B%k$J$N$G!"4|BT$KEz$($F$"$2$F$/$@$5$$!#(B $B!!!V(BN$B!&(BY$B$K(B3$BG/4V=;$s$G$$$^$7$?!#;~4V$,$"$l$P=5$K#22s$O%8%`$KDL(B $B$C$F$$$^$9!#$$$$4@$r$+$$$F%9%H%l%9$rH/;6$7$F$$$k$N$G$9$,!"CK(B $B at -$H$7$P$i$/$*$D$-$"$$$r$7$F$$$J$$$;$$$+Le!9$H$7$?F|!9$rAw$C(B $B$F$$$^$9!#(B $B:#$N<+J,$K$ONx$OI,?MMW$J$$;W$H$C$F$$$k$N$G$9$,!"(B $B$3$N5$;}$A$r$I$&$K$+$7$?$$$G$9!#%2!<%`$N$h$&$J6n$10z$-$r$7$?(B $B$/$O$"$j$^$;$s!#$*8_$$%9%H%l!<%H$K$$$-$^$7$g$&!#(B $B!!7P:QE*$K;d$KG$$;$F2<$5$$!#1g=u$H$$$&8 at MU<+BN7y$$$J$N$G!"l9g$O(B http://www.jumpb8.net/?profile $B$+(B $B$i$*4j$$CW$7$^$9!#$b$A$m$s4iA0(B($B=w at -(B)$B$r at 5$7$/F~NO$9$k$h$&$*4j$$$7$^$9!#(B $B!!"($*6b$+$1$J$/$F$b= Hello All, I have a question regarding the exact meaning of the mtu selector in the MCMemberRecord: Does the multicast group carry a range of mtus according to the creation request mtu_selector and mtu, or does it carry an exact mtu selected according to the request mtu_selector and mtu? When a multicast group is created, if the value of the mtu selector in the request MCMemberRecord is not 2 (meaning: greater than or lower than), does this value of the mtu_selector have a meaning outside the scope of the creation request? For example, if a request to create was received with mtu selector = 0, and the mtu = 256. and the multicast group was created with the specific mtu value 2048 (selected by the SA according to p913 l14-16): 1. Does a join request with mtu=1024, and selector=2 succeed? 2. Does a query request with mtu=256, and selector=0 (both compmask are set) return the above mcgroup? My understanding is: mcgroup does not carry the mtu_selector value, meaning it will always carry an exact mtu value. Thus the answers to the previous questions are: 1. No. Since the mcgroup was created with mtu=2048. 2. No. There is no meaning for this query (the mtu must be provided along with mtu selector). Thank you in advance, Yael -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Mon Aug 8 01:00:34 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 08 Aug 2005 11:00:34 +0300 Subject: [openib-general] [PATCH] osm: Remove mad_addr file which is a result of a grep Message-ID: <42F71122.8040908@mellanox.co.il> Hi Hal, Please remove the file osm/opensm/mad_addr which is a stale file with some grep results. Signed-off-by: Eitan Zahavi Index: mad_addr =================================================================== --- mad_addr (revision 3009) +++ mad_addr (working copy) @@ -1,127 +0,0 @@ -osm_mad_pool.c:110: *pp_pool_item = &p_madw->pool_item; -osm_mad_pool.c:208: p_mad = osm_vendor_get( h_bind, total_size, &p_madw->vend_wrap ); -osm_mad_pool.c:230: "size = %u.\n", p_madw, p_madw->p_mad, total_size ); -osm_mad_pool.c:276: "size = %u\n", p_madw, p_madw->p_mad, total_size ); -osm_mad_pool.c:321: p_madw, p_madw->p_mad ); -osm_mad_pool.c:326: if( p_madw->p_mad ) -osm_mad_pool.c:327: osm_vendor_put( p_madw->h_bind, &p_madw->vend_wrap ); -osm_req.c:217: p_madw->mad_addr.dest_lid = IB_LID_PERMISSIVE; -osm_req.c:218: p_madw->mad_addr.addr_type.smi.source_lid = IB_LID_PERMISSIVE; -osm_req.c:219: p_madw->resp_expected = TRUE; -osm_req.c:220: p_madw->fail_msg = err_msg; -osm_req.c:229: p_madw->context = *p_context; -osm_req.c:307: p_madw->mad_addr.dest_lid = IB_LID_PERMISSIVE; -osm_req.c:308: p_madw->mad_addr.addr_type.smi.source_lid = IB_LID_PERMISSIVE; -osm_req.c:309: p_madw->resp_expected = TRUE; -osm_req.c:310: p_madw->fail_msg = err_msg; -osm_req.c:319: p_madw->context = *p_context; -osm_resp.c:230: p_madw->mad_addr.dest_lid = -osm_resp.c:232: p_madw->mad_addr.addr_type.smi.source_lid = -osm_resp.c:235: p_madw->resp_expected = FALSE; -osm_resp.c:236: p_madw->fail_msg = CL_DISP_MSGID_NONE; -osm_sa_class_port_info.c:174: p_resp_madw = osm_mad_pool_get( p_rcv->p_mad_pool, p_madw->h_bind, -osm_sa_class_port_info.c:175: mad_size, &p_madw->mad_addr ); -osm_sa_class_port_info.c:243: status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); -osm_sa_informinfo.c:333: p_resp_madw = osm_mad_pool_get( p_rcv->p_mad_pool, p_madw->h_bind, -osm_sa_informinfo.c:334: mad_size, &p_madw->mad_addr ); -osm_sa_informinfo.c:359: status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); -osm_sa_informinfo.c:414: inform_info_rec.report_addr = p_madw->mad_addr; -osm_sa_informinfo.c:417: inform_info_rec.h_bind = p_madw->h_bind; -osm_sa_informinfo.c:425: osm_get_gid_by_mad_addr( p_rcv->p_log, p_rcv->p_subn, &p_madw->mad_addr ); -osm_sa_lft_record.c:467: p_madw->h_bind, -osm_sa_lft_record.c:469: &p_madw->mad_addr ); -osm_sa_lft_record.c:535: status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); -osm_sa_link_record.c:679: p_resp_madw = osm_mad_pool_get( p_rcv->p_mad_pool, p_madw->h_bind, -osm_sa_link_record.c:680: mad_size, &p_madw->mad_addr ); -osm_sa_link_record.c:752: status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); -osm_sa_mad_ctrl.c:409: CL_ASSERT( p_madw->resp_expected == TRUE); -osm_sa_mcmember_record.c:445: p_madw->h_bind, -osm_sa_mcmember_record.c:453: p_resp_sa_mad = (ib_sa_mad_t*)p_resp_madw->p_mad; -osm_sa_mcmember_record.c:454: p_sa_mad = (ib_sa_mad_t*)p_madw->p_mad; -osm_sa_mcmember_record.c:480: p_resp_madw->h_bind, -osm_sa_mcmember_record.c:1692: p_madw->h_bind, -osm_sa_mcmember_record.c:1778: status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE); -osm_sa_node_record.c:571: p_madw->h_bind, -osm_sa_node_record.c:573: &p_madw->mad_addr ); -osm_sa_node_record.c:639: status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); -osm_sa_path_record.c:1253: p_resp_madw = osm_mad_pool_get( p_rcv->p_mad_pool, p_madw->h_bind, -osm_sa_path_record.c:1254: mad_size, &p_madw->mad_addr ); -osm_sa_path_record.c:1318: status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); -osm_sa_pkey_record.c:471: p_madw->h_bind, -osm_sa_pkey_record.c:473: &p_madw->mad_addr ); -osm_sa_pkey_record.c:537: status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE); -osm_sa_portinfo_record.c:733: p_madw->h_bind, -osm_sa_portinfo_record.c:735: &p_madw->mad_addr ); -osm_sa_portinfo_record.c:814: status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE); -osm_sa_response.c:164: p_madw->h_bind, MAD_BLOCK_SIZE, &p_madw->mad_addr ); -osm_sa_service_record.c:421: p_resp_madw = osm_mad_pool_get( p_rcv->p_mad_pool, p_madw->h_bind, -osm_sa_service_record.c:422: mad_size, &p_madw->mad_addr ); -osm_sa_service_record.c:513: status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); -osm_sa_slvl_record.c:499: p_madw->h_bind, -osm_sa_slvl_record.c:501: &p_madw->mad_addr ); -osm_sa_slvl_record.c:569: status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE); -osm_sa_sminfo_record.c:213: p_madw->h_bind, -osm_sa_sminfo_record.c:215: &p_madw->mad_addr ); -osm_sa_sminfo_record.c:270: status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); -osm_sa_vlarb_record.c:523: p_madw->h_bind, -osm_sa_vlarb_record.c:525: &p_madw->mad_addr ); -osm_sa_vlarb_record.c:593: status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE); -osm_sm_mad_ctrl.c:230: CL_ASSERT( p_madw->resp_expected == FALSE ); -osm_sm_mad_ctrl.c:235: if( p_madw->resp_expected == TRUE ) -osm_sm_mad_ctrl.c:760: CL_ASSERT( p_madw->resp_expected == FALSE ); -osm_sm_mad_ctrl.c:765: if( p_madw->resp_expected == TRUE ) -osm_sm_mad_ctrl.c:858: ib_get_err_str( p_madw->status ) ); -osm_sm_mad_ctrl.c:861: CL_ASSERT( p_madw->resp_expected == TRUE ); -osm_sm_mad_ctrl.c:890: if ( p_madw->mad_addr.dest_lid != 0xFFFF ) -osm_sm_mad_ctrl.c:895: &(p_madw->mad_addr)); -osm_sm_mad_ctrl.c:908: p_ctrl->p_log, p_ctrl->p_subn, p_physp, p_madw->h_bind ); -osm_trap_rcv.c:366: if (p_madw->p_mad->mgmt_class == IB_MCLASS_SUBN_LID || -osm_trap_rcv.c:367: p_madw->p_mad->mgmt_class == IB_MCLASS_SUBN_DIR ) -osm_trap_rcv.c:404: if (p_madw->mad_addr.addr_type.smi.source_lid == 0) -osm_vendor_al.c:273: p_madw->status = __osm_al_convert_wcs( p_elem->status ); -osm_vendor_ibumad.c:604: p_madw->p_mad = NULL; -osm_vendor_mlx.c:482: is_rmpp = (p_madw->mad_size > MAD_BLOCK_SIZE || osmv_mad_is_rmpp(p_mad)); -osm_vendor_mlx.c:560: p_madw->status = ret; -osm_vendor_mlx_anafa.c:479: is_rmpp = (p_madw->mad_size > MAD_BLOCK_SIZE -osm_vendor_mlx_anafa.c:546: p_madw->status = ret; -osm_vendor_mlx_dispatcher.c:409: p_madw->status = status; -osm_vendor_mlx_sa.c:104: request structure) is attached as the p_madw->context.ni_context.node_guid -osm_vendor_mlx_sa.c:140: p_sa_mad = ( ib_sa_mad_t * ) p_madw->p_mad; -osm_vendor_mlx_sa.c:158: if (! p_madw->mad_size) -osm_vendor_mlx_sa.c:188: ( ( p_madw->mad_size - IB_SA_MAD_HDR_SIZE ) / -osm_vendor_mlx_sa.c:192: query_res.result_cnt, p_madw->mad_size - IB_SA_MAD_HDR_SIZE, -osm_vendor_mlx_sa.c:194: ( p_madw->mad_size - IB_SA_MAD_HDR_SIZE ) % -osm_vendor_mlx_sa.c:235: (osmv_query_req_t *)(long*)(long)(p_madw->context.ni_context.node_guid); -osm_vendor_mlx_sa.c:540: p_madw->mad_addr.dest_lid = cl_hton16(p_bind->sm_lid); -osm_vendor_mlx_sa.c:541: p_madw->mad_addr.addr_type.smi.source_lid = -osm_vendor_mlx_sa.c:543: p_madw->mad_addr.addr_type.gsi.remote_qp = CL_HTON32(1); -osm_vendor_mlx_sa.c:544: p_madw->resp_expected = TRUE; -osm_vendor_mlx_sa.c:545: p_madw->fail_msg = CL_DISP_MSGID_NONE; -osm_vendor_mlx_sa.c:553: p_madw->context.ni_context.node_guid -osm_vendor_mlx_sa.c:557: p_madw->context.ni_context.node_guid = -osm_vendor_mlx_sa.c:567: p_madw->resp_expected ); -osm_vendor_mlx_sender.c:110: CL_ASSERT( p_madw->mad_size <= MAD_BLOCK_SIZE ); -osm_vendor_mlx_sender.c:113: cl_memcpy(p_mad, osm_madw_get_mad_ptr(p_madw), p_madw->mad_size); -osm_vendor_mlx_txn.c:176: (void*)p_madw->p_mad, -osm_vendor_mlx_txn.c:177: p_madw->mad_size, -osm_vendor_mlx_txn.c:681: p_madw->status = IB_TIMEOUT; -osm_vendor_mtl.c:101: CL_ASSERT(p_madw->p_mad); -osm_vendor_mtl.c:105: ib_mad_is_response(p_madw->p_mad) | -osm_vendor_mtl.c:106: (p_madw->p_mad->method == IB_MAD_METHOD_TRAP_REPRESS); -osm_vendor_mtl.c:970: p_madw->p_mad = NULL; -osm_vendor_mtl_transaction_mgr.c:371: if (osm_madw_req_p->p_madw->p_mad) -osm_vendor_mtl_transaction_mgr.c:376: osm_madw_req_p->p_madw->p_mad->trans_id ); -osm_vendor_mtl_transaction_mgr.c:422: const ib_mad_t *mad_p = p_madw->p_mad; -osm_vendor_mtl_transaction_mgr.c:443: p_madw, waking_time, p_madw->p_mad->trans_id ); -osm_vendor_ts.c:33: CL_ASSERT(p_madw->p_mad); -osm_vendor_ts.c:37: ib_mad_is_response(p_madw->p_mad) | -osm_vendor_ts.c:38: (p_madw->p_mad->method == IB_MAD_METHOD_TRAP_REPRESS); -osm_vendor_ts.c:327: p_mad_buf = (void *)p_madw->p_mad; -osm_vendor_ts.c:336: CL_ASSERT(p_madw->h_bind); -osm_vendor_ts.c:337: p_mad_buf = osm_vendor_get(p_madw->h_bind, mad_size, &p_madw->vend_wrap ); -osm_vendor_ts.c:364: p_madw->p_mad = p_mad_buf; -osm_vendor_ts.c:367: p_madw->h_bind = p_new_vw->h_bind; -osm_vendor_ts.c:750: p_madw->p_mad = NULL; -osm_vl15intf.c:168: if( p_madw->resp_expected == TRUE ) -osm_vl15intf.c:191: p_madw, p_madw->resp_expected ); -osm_vl15intf.c:399: if( p_madw->resp_expected == TRUE ) From eitan at mellanox.co.il Mon Aug 8 02:05:13 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 8 Aug 2005 12:05:13 +0300 Subject: [openib-general] osm: management headers installed into /usr/local/include Message-ID: <506C3D7B14CDD411A52C00025558DED607C30600@mtlex01.yok.mtl.com> Hi Hal, According to the README from the management directory the executables are installed by default to: /usr/local/ib/bin and the libs into /usr/local/ib/lib. The header files are not described but are installed into /usr/local/include/infiniband Was this done in purpose? I would expect the headers to follow the same prefix of the executables and libs. I can work on a patch to move them to /usr/local/ib/include EZ -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Mon Aug 8 04:52:13 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 8 Aug 2005 14:52:13 +0300 Subject: [openib-general] osm: management headers installed into /usr/ local/include Message-ID: <506C3D7B14CDD411A52C00025558DED607C30604@mtlex01.yok.mtl.com> Hi Hal, I looked further into the userspace/management autoconf/automake system and found out that the tweak of the prefix directory (from /usr/local to /usr/local/ib) was done in a hack, rather then the "auto tools" way: Autoconf provides the macro: AC_PREFIX_DEFAULT to do change default prefix. Using this macro the bindir, libdir and includedir, datadir, etc, will be set correctly. To use it in our case every configure.in should have this directive: AC_PREFIX_DEFAULT([/usr/local/ib]) The current method of overriding the bindir/libdir in the Makefile.am works but breaks other standard autoconf features like the -prefix for directing the installation dir to the non default directory. Actually some other prefix dependant variables should be re-assigned and are missing. I think that we should fix this and do it the "Auto tools" way. Please approve and I will provide a patch for using the macro. Actually the entire management tree should have been built as a standard "auto tools" project utilizing the AC_CONFIG_SUBDIRS. This is somewhat a larger change so I do not think it worth it this time. It will require ALL the sub directories to be "autogen" then ALL of them configured then ALL make ... This normally requires a special compilation mode ("pre-install") where the libs and includes are taken from relative paths rather then from the final "install" path. The advantage is that the entire project can be "configured" and if something is broken in the required C lib or stdlib you do not wait for the specific sub dir to fail but get the notice in advance. There might be other benefits like adhering to the standard build procedure. Please approve the direction for this patch (using AC_DEFAULT_PREFIX) is acceptable. EZ Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL -----Original Message----- From: Eitan Zahavi [mailto:eitan at mellanox.co.il] Sent: Monday, August 08, 2005 12:05 PM To: openib-general at openib.org Subject: [openib-general] osm: management headers installed into /usr/local/include Hi Hal, According to the README from the management directory the executables are installed by default to: /usr/local/ib/bin and the libs into /usr/local/ib/lib. The header files are not described but are installed into /usr/local/include/infiniband Was this done in purpose? I would expect the headers to follow the same prefix of the executables and libs. I can work on a patch to move them to /usr/local/ib/include EZ -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon Aug 8 04:47:28 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 07:47:28 -0400 Subject: [openib-general] Re: osm: management headers installed into /usr/local/include In-Reply-To: <506C3D7B14CDD411A52C00025558DED607C30600@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED607C30600@mtlex01.yok.mtl.com> Message-ID: <1123501512.4422.3079.camel@hal.voltaire.com> On Mon, 2005-08-08 at 05:05, Eitan Zahavi wrote: > Hi Hal, > > According to the README from the management directory the executables > are installed by default to: > > /usr/local/ib/bin and the libs into /usr/local/ib/lib. > > The header files are not described but are installed into > /usr/local/include/infiniband > > Was this done in purpose? Yes, the documentation is out of date. There was a thread to move them to /usr/local/include/infiniband to make this consistent with the other includes. I will update the README. -- Hal > I would expect the headers to follow the same prefix of the > executables and libs. > > I can work on a patch to move them to /usr/local/ib/include > > EZ > From mst at mellanox.co.il Mon Aug 8 05:21:52 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 8 Aug 2005 15:21:52 +0300 Subject: [openib-general] Re: [PATCH 1/2] [IB/cm]: Correct CM port redirect reject codes In-Reply-To: <86802c4405080411227bce41f7@mail.gmail.com> References: <20057281331.dR47KhjBsU48JfGE@cisco.com> <20057281331.7vqhiAJ1Yc0um2je@cisco.com> <86802c44050803175873fb0569@mail.gmail.com> <20050804064223.GT15300@mellanox.co.il> <86802c4405080411227bce41f7@mail.gmail.com> Message-ID: <20050808122152.GU15300@mellanox.co.il> Quoting r. yhlu : > On 8/3/05, Michael S. Tsirkin wrote: > > Quoting yhlu : > > > Subject: Re: [PATCH 1/2] [IB/cm]: Correct CM port redirect reject codes > > > > > > Roland, > > > > > > In LinuxBIOS, If I enable the prefmem64 to use real 64 range. the IB > > > driver in Kernel can not be loaded. Could you please test with latest firmware 4.7.0? Thanks, -- MST From halr at voltaire.com Mon Aug 8 05:31:45 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 08:31:45 -0400 Subject: [openib-general] RE: [PATCH ] osmtest general cleanups In-Reply-To: <506C3D7B14CDD411A52C00025558DED60865E748@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED60865E748@mtlex01.yok.mtl.com> Message-ID: <1123504305.4422.3242.camel@hal.voltaire.com> Hi Liran, On Sun, 2005-08-07 at 11:28, Liran Sorani wrote: > Hi , Hal. > I've a few minor fixes (several white space glitches) , pls take the > attached file , instead of the previous , Thanks. Applied with some minor modifications and caveats below. In the future can your patches be submitted as text rather than attachments ? That is the norm for doing this. Some comments below. > thanks Liran. > > -----Original Message----- > From: Liran Sorani > Sent: Sunday, August 07, 2005 5:57 PM > To: 'halr at voltaire.com' > Cc: 'openib-general at openib.org'; Amit Krig > Subject: [PATCH ] osmtest general cleanups > > > Hi , Hal. > The attached patch should be applied to osmtest repository. > It contain several cleanups (on most of the files) : It is easier if there is a patch per idea rather than an amalgam. > -Removal of inform info flow . > -Unique error messages There are some non real error message numbers (neither hex nor decimal) in osmt_multicast.c. Also, osmtest.c and osmt_service.c still have duplicates. > -Makefile.am update for compilation (required osm_helper object) osm_helper is part of the libopensm since r2973 so this part of the patch was not applied. Please update your osm directory. > -Remove vendor dependencies. Remove some OSM_VENDOR_INTF_MTL vendor dependencies (There are still some in osmt_slvl_vl_arb.c). > The inform info flow should be carefully ported since it requires > direct access to ib_umad (possibly ib_verbs too) . umad would be a temporary measure. Why would ib_uverbs direct access be needed ? > I'll send another patch for it next week . Thanks. -- Hal From halr at voltaire.com Mon Aug 8 05:40:12 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 08:40:12 -0400 Subject: [openib-general] Re: [PATCH] osm: Remove mad_addr file which is a result of a grep In-Reply-To: <42F71122.8040908@mellanox.co.il> References: <42F71122.8040908@mellanox.co.il> Message-ID: <1123504812.4422.3280.camel@hal.voltaire.com> On Mon, 2005-08-08 at 04:00, Eitan Zahavi wrote: > Hi Hal, > > Please remove the file osm/opensm/mad_addr which is a stale file with > some grep results. > > Signed-off-by: Eitan Zahavi Thanks. Applied. In the future, please include your email address as part of the signed off lines. -- Hal From mst at mellanox.co.il Mon Aug 8 06:05:41 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 8 Aug 2005 16:05:41 +0300 Subject: [openib-general] Re: [BUG]OpenSM double free or corruption Message-ID: <20050808130541.GB15300@mellanox.co.il> Hal, I'm trying to build libraries in non-standard directory. My problem is that infiniband/mad.h assumes that its installed in the same directory as infiniband/common.h. Pls consider the following patch. --- Dont assume that all headers are in the same directory. Signed-off-by: Michael S. Tsirkin Index: management/libibmad/include/infiniband/mad.h =================================================================== --- management/libibmad/include/infiniband/mad.h (revision 2963) +++ management/libibmad/include/infiniband/mad.h (working copy) @@ -36,7 +36,7 @@ #include #include -#include "common.h" +#include #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { -- MST From eitan at mellanox.co.il Mon Aug 8 06:07:03 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 8 Aug 2005 16:07:03 +0300 Subject: [openib-general] RE: [PATCH] osm: Remove mad_addr file which is a result of a grep Message-ID: <506C3D7B14CDD411A52C00025558DED607C30605@mtlex01.yok.mtl.com> Hi Hal, I followed the Wiki directions. I updated the Wiki to reflect that requirement... > In the future, please include your email address as part of the signed > off lines. > > -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Mon Aug 8 06:13:08 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 8 Aug 2005 16:13:08 +0300 Subject: [openib-general] [PATCH] libibmad: use infiniband/ for includes Message-ID: <20050808131308.GC15300@mellanox.co.il> Repost: subject line was wrong --- Hal, I'm trying to build libraries in non-standard directory. My problem is that infiniband/mad.h assumes that its installed in the same directory as infiniband/common.h. Another reason is to make headers into examples of correct usage. Pls consider the following patches. --- libibmad: dont assume that all headers are in the same directory. Signed-off-by: Michael S. Tsirkin Index: management/libibmad/include/infiniband/mad.h =================================================================== --- management/libibmad/include/infiniband/mad.h (revision 2963) +++ management/libibmad/include/infiniband/mad.h (working copy) @@ -36,7 +36,7 @@ #include #include -#include "common.h" +#include #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { -- MST _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST From mst at mellanox.co.il Mon Aug 8 06:14:40 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 8 Aug 2005 16:14:40 +0300 Subject: [openib-general] [PATCH] libibumad: use infiniband/ for includes In-Reply-To: <20050808131308.GC15300@mellanox.co.il> References: <20050808131308.GC15300@mellanox.co.il> Message-ID: <20050808131439.GD15300@mellanox.co.il> > Hal, I'm trying to build libraries in non-standard directory. Same for libibumad. --- libibumad: dont assume that all headers are in the same directory. Signed-off-by: Michael S. Tsirkin Index: management/libibumad/include/infiniband/umad.h =================================================================== --- management/libibumad/include/infiniband/umad.h (revision 2963) +++ management/libibumad/include/infiniband/umad.h (working copy) @@ -35,7 +35,7 @@ #define _UMAD_H #include -#include "common.h" +#include #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { -- MST From eitan at mellanox.co.il Mon Aug 8 06:16:28 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 8 Aug 2005 16:16:28 +0300 Subject: [openib-general] OpenSM: No support for debug build Message-ID: <506C3D7B14CDD411A52C00025558DED607C30606@mtlex01.yok.mtl.com> Hi Hal, More issues with the build environment: OpenSM used to support two modes of compilation: debug and release (called "free") The main difference is in tradeoff performance vs debugability. debug uses: -D_DEBUG_ = control some extra operations and data stored in complib useful for debugging. one example is CL_ASSERT which is disabled if _DEBUG_ not set -g = include debug information in executables and libs -O0 = prevent optimization release mode uses: -O6 = max optimization for speed up I propose to add -enable-debug to the configure.in and use it to control the compilation flags. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon Aug 8 06:16:31 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 09:16:31 -0400 Subject: [openib-general] RE: [PATCH] osm: Remove mad_addr file which is a result of a grep In-Reply-To: <506C3D7B14CDD411A52C00025558DED607C30605@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED607C30605@mtlex01.yok.mtl.com> Message-ID: <1123506990.4422.3466.camel@hal.voltaire.com> On Mon, 2005-08-08 at 09:07, Eitan Zahavi wrote: > Hi Hal, > > I followed the Wiki directions. > I updated the Wiki to reflect that requirement... Where in the wiki ? Was this in the cheat sheet or somewhere else ? -- Hal > > In the future, please include your email address as part of the > signed > > off lines. > > > > -- Hal > From mst at mellanox.co.il Mon Aug 8 06:24:37 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 8 Aug 2005 16:24:37 +0300 Subject: [openib-general] Re: OpenSM: No support for debug build In-Reply-To: <506C3D7B14CDD411A52C00025558DED607C30606@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED607C30606@mtlex01.yok.mtl.com> Message-ID: <20050808132437.GF15300@mellanox.co.il> Quoting r. Eitan Zahavi : > -g = include debug information in executables and libs IMO, there's no actual reason not to include this flag in release build. You can always strip at the install stage if you want to. > -O0 = prevent optimization > > release mode uses: > > -O6 = max optimization for speed up That would explain why does it take forever to build. -O6 looks like an overkill. AFAIK, going above -O2 doesnt necessarily give you a speedup. libibverbs uses -O2, and thats used to run MPI. -- MST From halr at voltaire.com Mon Aug 8 06:25:47 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 09:25:47 -0400 Subject: [openib-general] osm: management headers installed into /usr/ local/include In-Reply-To: <506C3D7B14CDD411A52C00025558DED607C30604@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED607C30604@mtlex01.yok.mtl.com> Message-ID: <1123507546.4422.3513.camel@hal.voltaire.com> On Mon, 2005-08-08 at 07:52, Eitan Zahavi wrote: > Hi Hal, > > I looked further into the userspace/management autoconf/automake > system and found out that the tweak of the prefix directory (from > /usr/local to /usr/local/ib) was done in a hack, rather then the "auto > tools" way: > > Autoconf provides the macro: AC_PREFIX_DEFAULTto do change default > prefix. > > Using this macro the bindir, libdir and includedir, datadir, etc, will > be set correctly. > > To use it in our case every configure.in should have this directive: > > AC_PREFIX_DEFAULT([/usr/local/ib]) > > The current method of overriding the bindir/libdir in the Makefile.am > works but breaks other standard autoconf features like the -prefix for > directing the installation dir to the non default directory. > > Actually some other prefix dependant variables should be re-assigned > and are missing. > > I think that we should fix this and do it the "Auto tools" way. > > Please approve and I will provide a patch for using the macro. Yes, that is a better way and I would accept such a patch. > Actually the entire management tree should have been built as a > standard "auto tools" project utilizing the AC_CONFIG_SUBDIRS. This is > somewhat a larger change so I do not think it worth it this time. It > will require ALL the sub directories to be "autogen" then ALL of them > configured then ALL make ... > > This normally requires a special compilation mode ("pre-install") > where the libs and includes are taken from relative paths rather then > from the final "install" path. The advantage is that the entire > project can be "configured" and if something is broken in the required > C lib or stdlib you do not wait for the specific sub dir to fail but > get the notice in advance. There might be other benefits like adhering > to the standard build procedure. IMO, the sooner the better if this is going to change. -- Hal > Please approve the direction for this patch (using AC_DEFAULT_PREFIX) > is acceptable. > > > > EZ > > Eitan Zahavi > > Design Technology Director > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > -----Original Message----- > From: Eitan Zahavi [mailto:eitan at mellanox.co.il] > Sent: Monday, August 08, 2005 12:05 PM > To: openib-general at openib.org > Subject: [openib-general] osm: management headers installed into > /usr/local/include > > > > Hi Hal, > > According to the README from the management directory the executables > are installed by default to: > > /usr/local/ib/bin and the libs into /usr/local/ib/lib. > > The header files are not described but are installed into > /usr/local/include/infiniband > > Was this done in purpose? > > I would expect the headers to follow the same prefix of the > executables and libs. > > I can work on a patch to move them to /usr/local/ib/include > > EZ > > From eitan at mellanox.co.il Mon Aug 8 06:34:02 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 8 Aug 2005 16:34:02 +0300 Subject: [openib-general] RE: [PATCH] osm: Remove mad_addr file which is a result of a grep Message-ID: <506C3D7B14CDD411A52C00025558DED607C30607@mtlex01.yok.mtl.com> > -- Hal > Where in the wiki ? Was this in the cheat sheet or somewhere else ? https://openib.org/tiki/tiki-index.php?page=OpenIBFAQ I just added "your-email" field in: "All submitted patches must contain the appropriate "Signed-off-by" line (Signed-off-by: Frst and last name your-email)." > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Mon Aug 8 06:36:18 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 8 Aug 2005 16:36:18 +0300 Subject: [openib-general] RE: OpenSM: No support for debug build Message-ID: <506C3D7B14CDD411A52C00025558DED607C30608@mtlex01.yok.mtl.com> I think I did see some benefits from using -O6 but if you think it is not required we can try -O2 and see. Anyway, is using -O0 for debug mode make sense to you? > > libibverbs uses -O2, and thats used to run MPI. > > -- > MST -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Mon Aug 8 06:42:33 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 8 Aug 2005 16:42:33 +0300 Subject: [openib-general] Re: OpenSM: No support for debug build In-Reply-To: <506C3D7B14CDD411A52C00025558DED607C30608@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED607C30608@mtlex01.yok.mtl.com> Message-ID: <20050808134233.GG15300@mellanox.co.il> Quoting r. Eitan Zahavi : > > > > libibverbs uses -O2, and thats used to run MPI. > > I think I did see some benefits from using -O6 but if you think it is > not required we can try -O2 and see. > Anyway, is using -O0 for debug mode make sense to you? > You can always pass these things to configure in the CFLAGS variable. I dont know whether this works for opensm - but thats how I get debug builds of ibverbs. -- MST From eitan at mellanox.co.il Mon Aug 8 06:43:25 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 8 Aug 2005 16:43:25 +0300 Subject: [openib-general] osm: management headers installed into /usr/ local/include Message-ID: <506C3D7B14CDD411A52C00025558DED607C30609@mtlex01.yok.mtl.com> Hal wrote: > > > > I think that we should fix this and do it the "Auto tools" way. > > > > Please approve and I will provide a patch for using the macro. > > Yes, that is a better way and I would accept such a patch. [EZ] But now when we know the include is not residing under prefix but forced under /usr/local it does not make much sense to do it. Why do we keep the bin lib and include under different directory hierarchy? It contradicts with the way most other user level things looks like in Linux. Normally people place them all under prefix dir and let the user play with the prefix. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Mon Aug 8 06:45:10 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 8 Aug 2005 16:45:10 +0300 Subject: [openib-general] RE: OpenSM: No support for debug build Message-ID: <506C3D7B14CDD411A52C00025558DED607C3060A@mtlex01.yok.mtl.com> As pointed out earlier they change involves more then just -O or -g so it make sense to use a flag. Also the --enable-debug is a common solution for how to do that... > You can always pass these things to configure in the CFLAGS variable. > I dont know whether this works for opensm - but thats how I get > debug builds of ibverbs. > > -- > MST -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Mon Aug 8 06:51:39 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 08 Aug 2005 06:51:39 -0700 Subject: [openib-general] iWARP Branch Proposal References: <8E9D028761D8264D910612167E8457E8FA3695@mail2.ammasso.com> Message-ID: <52ek9435as.fsf@cisco.com> Tom, thanks for posting your device driver so quickly. I realize that this is work-in-progress, and with that in mind I skimmed through the code and came up with some suggestions for your to-do list: - Kconfig: replace 'mthca' with 'ams1100' in help text :) - Replace all // comments with /* */. - Why are members of cc_pci_regs_t and cc_adapter_pci_regs_t volatile? Volatile declarations are almost inevitably buggy. It's better to use ordered accessors (readl(), writel(), etc) or insert explicit memory barriers. - Get rid of most typedefs -- for example typedef enum { ... } foo_t; should become enum foo { ... }; - Can cc_byteorder.h be eliminated? Most of the wrappers are definitely superfluous. Can the WR byte order ever change? ie are the cpu_to_wrXX() functions actually a useful abstraction? - Most of cc_common.h can be removed -- eg CC_SLEEP, CC_PRINT are useless - In ccil_api.h, rather than "#ifdef CC_LITTLE_ENDIAN" just use __constant_cpu_to_b32 or whatever. Most of the "PACKED" attributes are unnecessary (since the structures are already aligned). They'll lead to horrible code with gcc on ia64 etc. - In general, all the #ifdef X86_64 in the code looks like portability bugs. The best solution is just to make the code 32/64 clean, but at least we need to replace the test with #ifdef CONFIG_COMPAT - ccilnet_dbg.c is an awful lot of code just for debugging. - Get rid of compatibility with Linux 2.4 -- everywhere there's a test of LINUX_VERSION_CODE, just take the 2.6 code. - cc_mq_common.c: BUMP is pretty inefficient, does a divide every time - cc_qp_common.c: cc_memcpy8 corrupts FPU state, is it really needed? it's never called. Why is it declared in cc_mq_common.c? memcpy4 similarly corrupts state. If it's fixed to save CR0 and do clts, is it really faster than a normal memcpy (considering it also disables IRQs)? This is all utterly non-portably anyway -- there needs to be a standard fallback for PPC64, IA64 etc. - Why is cc_queue.h needed? What is missing? - cc_types.h: get rid of NULL, TRUE, FALSE defines, cc_bool_t, etc. PTR_TO_CTX, CC_PTR_TO_64, etc seem busted on 64-bit archs. forget the macros, just cast to/from (unsigned long). - devccil_adapter.c: adapter_list seems to be a latent bug -- why does the driver need to keep a global list of adapters? - devccil.c: What is 'Static' -- why not 'static'?? Probably all this code gets replaced by existing uverbs anyway. - devccil_lock.h: Get rid of lock wrappers, they're just obfuscation. And it seems CCTHREADSAFE won't ever be defined?? The driver should always be threadsafe. - devccil_mem.c: ccil_big_malloc is horrible -- need a way to use non-contig memory. - devccil_srq.c: CCERR_NOT_IMPLEMENTED stubs not needed -- the existing midlayer will return -ENOSYS for unimplemented functions (ie NULL pointers in ib_device method table). - devccil_var.h: Defining your own version of PAGE_SHIFT etc. is just obfuscation; probably has all rounding functions you need. Custom ASSERT() can probably just be replaced with standard BUG_ON(). From liran at mellanox.co.il Mon Aug 8 06:58:09 2005 From: liran at mellanox.co.il (Liran Sorani) Date: Mon, 8 Aug 2005 16:58:09 +0300 Subject: [openib-general] RE: [PATCH ] osmtest general cleanups Message-ID: <506C3D7B14CDD411A52C00025558DED60865E882@mtlex01.yok.mtl.com> Hi , Regarding the inform_info test flow , the reason I need (possibly) ib_verbs is to subscribe a notice report through a QP other then QP1 (permitted by IB Spec see p754 , 13.5.1) , generate a trap , then validate the received report through that QP. To enable Set/Recieve notice mads through QP other then QP1 , I think , I need the ib_verbs . Regarding the rest , will be fixed in a seperate patch . -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Monday, August 08, 2005 3:32 PM To: Liran Sorani Cc: openib-general at openib.org; Amit Krig Subject: RE: [PATCH ] osmtest general cleanups Hi Liran, On Sun, 2005-08-07 at 11:28, Liran Sorani wrote: > Hi , Hal. > I've a few minor fixes (several white space glitches) , pls take the > attached file , instead of the previous , Thanks. Applied with some minor modifications and caveats below. In the future can your patches be submitted as text rather than attachments ? That is the norm for doing this. Some comments below. > thanks Liran. > > -----Original Message----- > From: Liran Sorani > Sent: Sunday, August 07, 2005 5:57 PM > To: 'halr at voltaire.com' > Cc: 'openib-general at openib.org'; Amit Krig > Subject: [PATCH ] osmtest general cleanups > > > Hi , Hal. > The attached patch should be applied to osmtest repository. > It contain several cleanups (on most of the files) : It is easier if there is a patch per idea rather than an amalgam. > -Removal of inform info flow . > -Unique error messages There are some non real error message numbers (neither hex nor decimal) in osmt_multicast.c. Also, osmtest.c and osmt_service.c still have duplicates. > -Makefile.am update for compilation (required osm_helper object) osm_helper is part of the libopensm since r2973 so this part of the patch was not applied. Please update your osm directory. > -Remove vendor dependencies. Remove some OSM_VENDOR_INTF_MTL vendor dependencies (There are still some in osmt_slvl_vl_arb.c). > The inform info flow should be carefully ported since it requires > direct access to ib_umad (possibly ib_verbs too) . umad would be a temporary measure. Why would ib_uverbs direct access be needed ? > I'll send another patch for it next week . Thanks. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon Aug 8 06:58:03 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 09:58:03 -0400 Subject: [openib-general] Re: [PATCH] libibmad: use infiniband/ for includes In-Reply-To: <20050808131308.GC15300@mellanox.co.il> References: <20050808131308.GC15300@mellanox.co.il> Message-ID: <1123507900.4422.3539.camel@hal.voltaire.com> On Mon, 2005-08-08 at 09:13, Michael S. Tsirkin wrote: > libibmad: dont assume that all headers are in the same directory. Thanks. Applied. -- Hal From halr at voltaire.com Mon Aug 8 07:01:17 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 10:01:17 -0400 Subject: [openib-general] Re: [PATCH] libibumad: use infiniband/ for includes In-Reply-To: <20050808131439.GD15300@mellanox.co.il> References: <20050808131308.GC15300@mellanox.co.il> <20050808131439.GD15300@mellanox.co.il> Message-ID: <1123508315.4422.3576.camel@hal.voltaire.com> On Mon, 2005-08-08 at 09:14, Michael S. Tsirkin wrote: > libibumad: dont assume that all headers are in the same directory. > > Signed-off-by: Michael S. Tsirkin Thanks. Applied. -- Hal From halr at voltaire.com Mon Aug 8 07:04:31 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 10:04:31 -0400 Subject: [openib-general] Re: OpenSM: No support for debug build In-Reply-To: <506C3D7B14CDD411A52C00025558DED607C30606@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED607C30606@mtlex01.yok.mtl.com> Message-ID: <1123508371.4422.3584.camel@hal.voltaire.com> On Mon, 2005-08-08 at 09:16, Eitan Zahavi wrote: > I propose to add -enable-debug to the configure.in and use it to > control the compilation flags. I would accept such a patch. -- Hal From halr at voltaire.com Mon Aug 8 07:07:44 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 10:07:44 -0400 Subject: [openib-general] Installation Cheat Sheet Message-ID: <1123508692.4422.3611.camel@hal.voltaire.com> Hi Michael, On the installation cheat sheet, https://openib.org/tiki/tiki-index.php?page=Installation+Cheat+Sheet shouldn't userspace CM be added ? If so, should it go under Building userspace libraries or a separate section ? Thanks. -- Hal From jlentini at netapp.com Mon Aug 8 07:22:35 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 8 Aug 2005 10:22:35 -0400 (EDT) Subject: [openib-general] Installation Cheat Sheet In-Reply-To: <1123508692.4422.3611.camel@hal.voltaire.com> References: <1123508692.4422.3611.camel@hal.voltaire.com> Message-ID: I added a section userspace CM section last week. Someone on the list (MST?) suggested that I place it in a separate section called Building the CM. james On Mon, 8 Aug 2005, Hal Rosenstock wrote: > Hi Michael, > > On the installation cheat sheet, > https://openib.org/tiki/tiki-index.php?page=Installation+Cheat+Sheet > > shouldn't userspace CM be added ? If so, should it go under Building > userspace libraries or a separate section ? > > Thanks. > > -- Hal > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Mon Aug 8 07:26:41 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 8 Aug 2005 17:26:41 +0300 Subject: [openib-general] [PATCH] libibmad: configure option to skip library test In-Reply-To: <20050808131439.GD15300@mellanox.co.il> References: <20050808131308.GC15300@mellanox.co.il> <20050808131439.GD15300@mellanox.co.il> Message-ID: <20050808142641.GH15300@mellanox.co.il> Hal, I'm trying to split the build process to configure/make/install steps. One of the problems is, that I am forced to do make one library before configuring another one that depends on it. As a solution, I'd like to add a configure option to skip the library test. I then can configure --disable-libcheck in all directories, and then make --- Add option to skip Signed-off-by: Michael S. Tsirkin Index: libibmad/configure.in =================================================================== --- libibmad/configure.in (revision 2963) +++ libibmad/configure.in (working copy) @@ -6,27 +6,42 @@ AC_CONFIG_SRCDIR([src/sa.c]) AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE(libibmad, 0.9.0) + + +AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presense of ib libraries], +[ if test x$enableval = xno ; then + disable_libcheck=yes + fi +]) + + AM_PROG_LIBTOOL dnl Checks for programs AC_PROG_CC dnl Checks for libraries +if test "$disable_libcheck" != "yes" +then LDFLAGS="$LDFLAGS -L/usr/local/ib/lib" AC_CHECK_LIB(ibcommon, sys_read_string, [], AC_MSG_ERROR([sys_read_string() not found. libibmad requires libibcommon.])) AC_CHECK_LIB(ibumad, umad_init, [], AC_MSG_ERROR([umad_init() not found. libibmad requires libibumad.])) +fi dnl Checks for header files. AC_HEADER_STDC AC_CHECK_HEADERS([netinet/in.h stdlib.h string.h sys/time.h unistd.h]) +if test "$disable_libcheck" != "yes" +then AC_CHECK_HEADER(infiniband/common.h, [], AC_MSG_ERROR([ not found. libibmad requires libibcommon.]) ) AC_CHECK_HEADER(infiniband/umad.h, [], AC_MSG_ERROR([ not found. libibmad requires libibumad.]) ) +fi dnl Checks for library functions AC_CHECK_FUNCS([memset strrchr strtol]) -- MST From halr at voltaire.com Mon Aug 8 07:30:09 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 10:30:09 -0400 Subject: [openib-general] Installation Cheat Sheet In-Reply-To: References: <1123508692.4422.3611.camel@hal.voltaire.com> Message-ID: <1123511409.4422.3797.camel@hal.voltaire.com> On Mon, 2005-08-08 at 10:22, James Lentini wrote: > I added a section userspace CM section last week. Someone on the list > (MST?) suggested that I place it in a separate section called Building > the CM. Thanks. I see what you are talking about. It needs some minor tweaking. -- Hal From halr at voltaire.com Mon Aug 8 07:38:15 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 10:38:15 -0400 Subject: [openib-general] osm: management headers installed into /usr/ local/include In-Reply-To: <506C3D7B14CDD411A52C00025558DED607C30609@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED607C30609@mtlex01.yok.mtl.com> Message-ID: <1123511845.4422.3839.camel@hal.voltaire.com> On Mon, 2005-08-08 at 09:43, Eitan Zahavi wrote: > Hal wrote: > > > > > > I think that we should fix this and do it the "Auto tools" way. > > > > > > Please approve and I will provide a patch for using the macro. > > > > Yes, that is a better way and I would accept such a patch. > [EZ] But now when we know the include is not residing under prefix but > forced under /usr/local it does not make much sense to do it. > > Why do we keep the bin lib and include under different directory > hierarchy? > It contradicts with the way most other user level things looks like in > Linux. > Normally people place them all under prefix dir and let the user play > with the prefix. This was the convention used back when this started (I'm not sure how it evolved) but maybe it doesn't make sense anymore and all should just be under /usr/local/[include lib bin]. The only reason in terms of the headers was that they were not commonly used and so numerous so they perhaps should be separated out. -- Hal From jlentini at netapp.com Mon Aug 8 08:24:19 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 8 Aug 2005 11:24:19 -0400 (EDT) Subject: [openib-general][kdapl]: vmalloc instead of kmalloc In-Reply-To: References: Message-ID: On Sun, 7 Aug 2005, Guy German wrote: > Hi, > > James Lentini wrote: >> On Thu, 4 Aug 2005, Guy German wrote: >> >>> James, >>> >>> I see what you mean. >>> The allocation of the event vector is derived from evd->qlen. >>> In DTO ev'd, however, qlen is also the parameter passed to >>> ib_create_cq. Since we don't want to limit DAPL consumers to an >>> unnecessary small completion queue size, maybe we >>> could differentiate between DTO supporting evd's and >>> CONN evd's, when allocating the events vector. >>> >>> if evd supports CONN only, leave it : >>> event = kmalloc(evd->qlen * sizeof *event) >>> (Relying on the consumer he knows what he is doing) >>> if evd is DTO only : >>> don't allocate an event buffer, at all >>> if evd supports both : >>> event = kmalloc(DEFAULT_4_CONN * sizeof *event) >> >> And dynamically add additional events up to qlen as needed? >>> >>> if DEFAULT_4_CONN=256, that's a 3 pages allocation. >>> >>> How does that sound to you ? >> >> I'd prefer that the EVDs were uniform. I would worry about bugs >> otherwise. > > But if it's the right thing to do (allocating less, or not > allocating at all), wouldn't it be best to try and fix those bugs ? I think that the dynamic allocation strategy could be used for all types of EVDs. EVDs that only handled DTO events (events generated by an IB CQ) would have an initial allocation of 0 DAT events. These EVDs would never grow the event pool. EVDs that handled "software" events might grow this event pool. For these EVDs we could allocate some number of initial events. The advantage of making the EVDs uniform is that the de-allocation code would be the same. >> The eventual solution has to support qlen generated events (connection >> request, connection, disconnect, software) if those event types are >> supported by the EVD (even if the EVD is being used for both generated >> events and DTOs). > > I think that there might be a mix up in the qlen meaning, for the consumer. > ia_quey returns cq_len parameter, even if the evd in mind is used > for connection only. In that case the result of qlen (128k), from the hca > has nothing to do with the real resources availability. I assume you are referring to the dat_ia_attr's max_evd_qlen value. You are right that it is not entirely accurate to report the CQ max qlen as this value. As you point out, there are really two limits: the maximum number of DTO and RMR events and the maximum number of "software" events. The former are limited by the capabilities of the underlying CQ while the later are constrained by the systems resources. > You mentioned before that "allocating an event pool equal to > the queue length seems like overkill", and I agree. > > If you want to support pending events list of more than ~2700, just > for CONN events, and not use vmalloc, we need some way of doing a > few kmallocs and managing that. Is it really necessary ? wouldn't > 256 *pending* events for CONN. purposes be enough ? I don't know what the consumer requirements are this value. Remember that there are more "software" events than just pending connection events. All of these event types correspond to what I have been terming "software" events: DAT_CONNECTION_REQUEST_EVENT = 0x02001, DAT_CONNECTION_EVENT_ESTABLISHED = 0x04001, DAT_CONNECTION_EVENT_PEER_REJECTED = 0x04002, DAT_CONNECTION_EVENT_NON_PEER_REJECTED = 0x04003, DAT_CONNECTION_EVENT_ACCEPT_COMPLETION_ERROR = 0x04004, DAT_CONNECTION_EVENT_DISCONNECTED = 0x04005, DAT_CONNECTION_EVENT_BROKEN = 0x04006, DAT_CONNECTION_EVENT_TIMED_OUT = 0x04007, DAT_CONNECTION_EVENT_UNREACHABLE = 0x04008, DAT_ASYNC_ERROR_EVD_OVERFLOW = 0x08001, DAT_ASYNC_ERROR_IA_CATASTROPHIC = 0x08002, DAT_ASYNC_ERROR_EP_BROKEN = 0x08003, DAT_ASYNC_ERROR_TIMED_OUT = 0x08004, DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR = 0x08005, DAT_SOFTWARE_EVENT = 0x10001 If place a limit (be it 256 or some other value) on the number of events of the types above, there will be a consumer that will want more. My goal is that the queue length of an EVD supporting all events types (DTO and software) only be limited by the capabilities of the underlying CQ. From halr at voltaire.com Mon Aug 8 08:29:07 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 11:29:07 -0400 Subject: [openib-general] Re: [PATCH] libibmad: configure option to skip library test In-Reply-To: <20050808142641.GH15300@mellanox.co.il> References: <20050808131308.GC15300@mellanox.co.il> <20050808131439.GD15300@mellanox.co.il> <20050808142641.GH15300@mellanox.co.il> Message-ID: <1123514946.4422.4097.camel@hal.voltaire.com> On Mon, 2005-08-08 at 10:26, Michael S. Tsirkin wrote: > Hal, I'm trying to split the build process to configure/make/install > steps. > One of the problems is, that I am forced to do make one > library before configuring another one that depends on it. > > As a solution, I'd like to add a configure option to skip > the library test. > > I then can > configure --disable-libcheck > in all directories, and then > make Seems reasonable. Will you be doing the same to umad and common too ? > Add option to skip > > Signed-off-by: Michael S. Tsirkin Thanks. Applied. -- Hal From rolandd at cisco.com Mon Aug 8 08:36:39 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 08 Aug 2005 08:36:39 -0700 Subject: [openib-general] (no subject) References: Message-ID: <523bpk30fs.fsf@cisco.com> amith> Hi, We have been testing gen2 installation using the amith> mem-free IBA cards from Mellanox. We were able to amith> successfully run InfiniBand tests over this amith> installation. When we replaced the IBA mem-free with the amith> mem cards (no software changes), the ports are in the amith> disabled state and ibstatus shows the following: As Hal mentioned, it would be useful if you can build drivers with CONFIG_INFINIBAND_MTHCA_DEBUG=y and send the kernel output you get when you load ib_mthca. Also, do these HCAs work with any other drivers (eg IBGD)? Thanks, Roland From pfister at us.ibm.com Mon Aug 8 08:40:41 2005 From: pfister at us.ibm.com (Greg Pfister) Date: Mon, 8 Aug 2005 11:40:41 -0400 Subject: [openib-general] Re: [mgtwg] [OpenSM] multicast group - mtu_selector question In-Reply-To: <506C3D7B14CDD411A52C00025558DED60865E7B6@mtlex01.yok.mtl.com> Message-ID: Hi, Yael. My understanding is that your understanding is almost correct, with two adjustments -- one technical, one formal, from the point of view of the IBA spec. MtuSelector does have meaning in queries, not just in the scope of creation. E.g.,if you query using 1 and 1024 (less than 1024), and the only group existing that otherwise matches has an MTU of 2048, you will not find that group. Formal: The statement "mcgroup does not carry the mtu_selector value" is meaningless from the spec's position. It is silent (purposely) on any implementation, hence is silent about what may be in any data structure created to represent a multicast group. Naturally, an OpenIB implementation can't take that position. Personal opinion, not spec: The query exception above doesn't change the fact that it looks less than useful to maintain the selector in the mcgroup data structure. Any comments by others? (I'm not on the OpenIB reflector, so this may not be reflected there. If someone who is wants to forward this there, please feel free.) Greg Pfister IBM Distinguished Engineer, Member IBM Academy of Technology IBM Systems & Technology Group, Server Technology & Architecture (512) 838-8338 | IBM tieline 678-8338 | FAX (512) 838-3418 Sic Crustulum Frangitur Yael Kalka wrote on 08/08/2005 02:39:36 AM: > Hello All, > > I have a question regarding the exact meaning of the mtu selector in > the MCMemberRecord: > Does the multicast group carry a range of mtus according to the > creation request mtu_selector and > mtu, or does it carry an exact mtu selected according to the request > mtu_selector and mtu? > > When a multicast group is created, if the value of the mtu selector > in the request MCMemberRecord > is not 2 (meaning: greater than or lower than), does this value of > the mtu_selector have a > meaning outside the scope of the creation request? > For example, if a request to create was received with mtu selector = > 0, and the mtu = 256. > and the multicast group was created with the specific mtu value 2048 > (selected by the SA according to p913 l14-16): > 1. Does a join request with mtu=1024, and selector=2 succeed? > 2. Does a query request with mtu=256, and selector=0 (both compmask > are set) return the above mcgroup? > > My understanding is: > mcgroup does not carry the mtu_selector value, meaning it will > always carry an exact mtu value. > Thus the answers to the previous questions are: > 1. No. Since the mcgroup was created with mtu=2048. > 2. No. There is no meaning for this query (the mtu must be provided > along with mtu selector). > > Thank you in advance, > Yael > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5207 bytes Desc: S/MIME Cryptographic Signature URL: From mst at mellanox.co.il Mon Aug 8 08:11:42 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 8 Aug 2005 18:11:42 +0300 Subject: [openib-general] [PATCH] ipoib: device removal races Message-ID: <20050808151141.GJ15300@mellanox.co.il> Currently we may have work scheduled in default kernel workqueue when the device is going down. Device could get freed before this workqueue gets serviced. I am actually seeing this causing system hangs. The following patch fixes this by using ipoib_workqueue which gets flushed when the device is going down. As a side note, schedule_work in ipoib_event also looks suspicios. Cant we have it oustanding when the device is going down? Roland, what do you say we switch that to ipoib_workqueue as well, and add a flush after ib_unregister_event_handler? --- Convert schedule_work to queue_work: solves system hang. Signed-off-by: Michael S. Tsirkin Index: ipoib/ipoib_main.c =================================================================== --- ipoib/ipoib_main.c (revision 2937) +++ ipoib/ipoib_main.c (working copy) @@ -672,7 +672,7 @@ static void ipoib_set_mcast_list(struct { struct ipoib_dev_priv *priv = netdev_priv(dev); - schedule_work(&priv->restart_task); + queue_work(ipoib_workqueue, &priv->restart_task); } static void ipoib_neigh_destructor(struct neighbour *n) -- MST From eitan at mellanox.co.il Mon Aug 8 09:00:55 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 8 Aug 2005 19:00:55 +0300 Subject: [openib-general] osm: management headers installed into /usr/ local/include Message-ID: <506C3D7B14CDD411A52C00025558DED607C3060B@mtlex01.yok.mtl.com> Hi Hal, > > This was the convention used back when this started (I'm not sure how it > evolved) but maybe it doesn't make sense anymore and all should just be > under /usr/local/[include lib bin]. The only reason in > terms of the headers was that they were not commonly used and so > numerous so they perhaps should be separated out. > If that could be done it will simplify user level coding by making it more "standard". If one reference opensm vendor in /lib one can simply user /include . What is the process of making it happen? EZ -------------- next part -------------- An HTML attachment was scrubbed... URL: From danb at voltaire.com Mon Aug 8 10:09:24 2005 From: danb at voltaire.com (Dan Bar Dov) Date: Mon, 8 Aug 2005 20:09:24 +0300 Subject: [openib-general] [iSER]use iSER on x86_64 Message-ID: Ian, hi Please send patches for the modifications you made for x86_64. In any case, we'll be upgrading the ISER implementation to be openIB/kDAPL based (DAT1.2, but with the openIB DAT headers). It will support x86_64. To test you'll have to have InfiniBand hardware (HCA+switch) plus an ISER target. Dan > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Ian Jiang > Sent: Sunday, August 07, 2005 3:36 AM > To: openib-general at openib.org > Subject: [openib-general] [iSER]use iSER on x86_64 > > [I sent this mail the day before yesterday, but it did not > appear on the list. So I just try again.] > > I modified several files under iSER/datamover/ to use it on > my x86_64 machine. > I also edited the iSER/make.conf file where necessary. > I could compile the iSER without any errors by now, but I am > not sure if it can works with the iSCSI initiator or the > kDAPL, because I have no idea to test it yet. > > The iSER is obtained from > https://openib.org/svn/gen2/ulps/iser/ > The dat headers related could be either > http://sourceforge.net/projects/dapl/dapl_beta2.06/dat/ > or > http://www.datcollaborative.org/dat_headers_1.2.tgz/ > > Any suggestion is appriciated! > > > Ian Jiang > ianjiang91 at hotmail.com > ---- > Computer Architecture Laboratory > Institute of Computing Technology > Chinese Academy of Sciences > Beijing,P.R.China > Zip code: 100080 > Tel: +86-10-62564394(office) > > > ====================== > iser-datamover.patches > ====================== > > --- iser_conn.c.old 2005-08-04 01:40:23.000000000 +0800 > +++ iser_conn.c 2005-08-05 06:36:56.000000000 +0800 > @@ -602,8 +602,11 @@ > /* Find the connection */ > p_iser_conn = hash_find_iser_conn(iscsi_conn_h); > if (p_iser_conn == NULL) { > - IERROR("Connection not found, conn_h: %d\n", > - (unsigned) iscsi_conn_h); > + /* by IanJiang, ianjiang at ict.ac.cn */ > + // IERROR("Connection not found, conn_h: %d\n", > + IERROR("Connection not found, conn_h: %ld\n", > + // (unsigned) iscsi_conn_h); > + (unsigned long) iscsi_conn_h); > return ISER_INVALID_CONN; > } > > --- iser_dto.c.old 2005-08-04 05:59:17.000000000 +0800 > +++ iser_dto.c 2005-08-05 06:37:28.000000000 +0800 > @@ -356,7 +356,9 @@ > > /* Get the VA of the headers registered memory */ > p_recv_buf = > - (unsigned char *) (unsigned int) > p_dto->regd[0]->virt_buf.p_buf; > +/* by IanJiang, ianjiang at ict.ac.cn */ > +// (unsigned char *) (unsigned int) > p_dto->regd[0]->virt_buf.p_buf; > + (unsigned char *) (unsigned long) > p_dto->regd[0]->virt_buf.p_buf; > /* Account for the offset within it */ > p_recv_buf += p_dto->offset[0]; > /* Skip the iSER header to get the iSCSI PDU BHS */ > --- iser_global.c.old 2005-08-04 01:40:04.000000000 +0800 > +++ iser_global.c 2005-08-05 06:38:06.000000000 +0800 > @@ -571,7 +571,9 @@ > unsigned port) /*!< IN - listening port */ > { > struct iser_entity_t *p_entity; > - unsigned entity_index = (unsigned) api_h; > +/* by IanJiang, ianjiang at ict.ac.cn */ > + unsigned long entity_index = (unsigned long) api_h; > +// unsigned entity_index = (unsigned) api_h; > > if (entity_index >= iser_global.num_entities) { > return ISER_ILLEGAL_PARAM; > --- iser_initiator.c.old 2005-08-04 01:39:47.000000000 +0800 > +++ iser_initiator.c 2005-08-05 06:45:48.000000000 +0800 > @@ -276,15 +276,21 @@ > p_iser_conn = hash_find_iser_conn(iscsi_conn_h); > > if (p_iser_conn == NULL) { > - IERROR("Failed to find connection, > iscsi_conn_h: %X\n", > - (unsigned) iscsi_conn_h); > +/* by IanJiang, ianjiang at ict.ac.cn */ > + //IERROR("Failed to find connection, > iscsi_conn_h: %X\n", > + IERROR("Failed to find connection, > iscsi_conn_h: %lX\n", > + //(unsigned) iscsi_conn_h); > + (unsigned long) iscsi_conn_h); > iser_pdu_print((char *) __func__, NULL, > p_bhs->buf, NULL); > iser_ret = ISER_INVALID_CONN; > goto send_control_error; > } > if (atomic_read(&p_iser_conn->state) != ISER_CONN_UP) { > - IERROR("Connection is not up, iscsi_conn_h: > %X, p_conn: > 0x%p\n", > - (unsigned) iscsi_conn_h, p_iser_conn); > +/* by IanJiang, ianjiang at ict.ac.cn */ > + //IERROR("Connection is not up, iscsi_conn_h: > %X, p_conn: > 0x%p\n", > + IERROR("Connection is not up, iscsi_conn_h: > %lX, p_conn: > 0x%p\n", > + //(unsigned ) iscsi_conn_h, p_iser_conn); > + (unsigned long) iscsi_conn_h, p_iser_conn); > iser_pdu_print((char *) __func__, NULL, > p_bhs->buf, NULL); > iser_ret = ISER_FAILURE; > goto send_control_error; > --- iser_kdapl.c.old 2005-08-04 02:12:25.000000000 +0800 > +++ iser_kdapl.c 2005-08-05 06:46:13.000000000 +0800 > @@ -49,6 +49,9 @@ > #include "iser_procfs.h" > #include "iser_trace.h" > > +/* by IanJiang, ianjiang at ict.ac.cn */ > +#define DAT_MEM_OPT_DONT_CARE DAT_MEM_OPTIMIZE_DONT_CARE > + > /* > --------------------------------------------------------------------- > * CONSTANTS & MACROS > * > ------------------------------------------------------------------ */ > --- iser_memory.c.old 2005-08-04 02:14:55.000000000 +0800 > +++ iser_memory.c 2005-08-05 06:47:51.000000000 +0800 > @@ -683,10 +683,14 @@ > list_entry(p_list, struct iser_buf_pool_region_t, > pool_list); > n += sprintf(p_str + n, > - "\t%3d: mem[0x%p sz:%7d t:%d] > lmr[h:0x%08x > va:0x%08x sz:%7d]\n", > +/* by IanJiang, ianjiang at ict.ac.cn */ > +// "\t%3d: mem[0x%p sz:%7d t:%d] > lmr[h:0x%08x > va:0x%08x sz:%7d]\n", > + "\t%3d: mem[0x%p sz:%7d t:%d] > lmr[h:0x%08lx > va:0x%08x sz:%7d]\n", > p_pool_region->id, > p_pool_region->buf.p_buf, > p_pool_region->buf.size, > p_pool_region->buf.type, > - (unsigned) > p_pool_region->mem_reg.lmr_handle, > +/* by IanJiang, ianjiang at ict.ac.cn */ > +// (unsigned) > p_pool_region->mem_reg.lmr_handle, > + (unsigned long) > p_pool_region->mem_reg.lmr_handle, > (unsigned) > p_pool_region->mem_reg.lmr_triplet. > virtual_address, > (unsigned) > p_pool_region->mem_reg.lmr_triplet. > @@ -958,13 +962,18 @@ > ITRACE_ENTRY(); > > if (p_iser_adaptor->regd_mem.virt_buf.size > 0) { > /* if any > pre-regd > buffer */ > - unsigned start_regd_buf = > - (unsigned) > p_iser_adaptor->regd_mem.virt_buf.p_buf; > - unsigned end_regd_buf = > +/* by IanJiang, ianjiang at ict.ac.cn */ > +// unsigned start_regd_buf = > + unsigned long start_regd_buf = > +// (unsigned) > p_iser_adaptor->regd_mem.virt_buf.p_buf; > + (unsigned long) > p_iser_adaptor->regd_mem.virt_buf.p_buf; > +// unsigned end_regd_buf = > + unsigned long end_regd_buf = > start_regd_buf + > p_iser_adaptor->regd_mem.virt_buf.size; > > ITRACE(ISER_TRACE_BUFFERS, > - "Looking up buf: 0x%08x - 0x%08x in > cache: 0x%08x - > 0x%08x\n", > +// "Looking up buf: 0x%08x - 0x%08x in > cache: 0x%08x - > 0x%08x\n", > + "Looking up buf: 0x%08x - 0x%08x in cache: > + 0x%08lx - > 0x%08lx\n", > buf_addr, buf_addr + buf_size, start_regd_buf, > end_regd_buf); > if (start_regd_buf <= buf_addr > --- iser_procfs.c.old 2005-08-04 06:01:50.000000000 +0800 > +++ iser_procfs.c 2005-08-05 06:48:33.000000000 +0800 > @@ -207,9 +207,13 @@ > p_iser_conn = (struct iser_conn_t *) data; > > page[0] = '\0'; > - n += sprintf(buf, "iSCSI handle: 0x%08X\nkDAPL EP handle: > 0x%08X\n", > - (unsigned) p_iser_conn->iscsi_conn_h, > - (unsigned) p_iser_conn->ep_handle); > +/* by IanJiang, ianjiang at ict.ac.cn */ > +// n += sprintf(buf, "iSCSI handle: 0x%08X\nkDAPL EP handle: > 0x%08X\n", > + n += sprintf(buf, "iSCSI handle: 0x%08lX\nkDAPL EP handle: > 0x%08lX\n", > +// (unsigned) p_iser_conn->iscsi_conn_h, > + (unsigned long) p_iser_conn->iscsi_conn_h, > +// (unsigned along) p_iser_conn->ep_handle); > + (unsigned long) p_iser_conn->ep_handle); > strcat(page, buf); > > n += sprintf(buf, "State: %s\n", > iser_conn_get_state_name(p_iser_conn)); > --- iser_utils.c.old 2005-08-04 02:42:16.000000000 +0800 > +++ iser_utils.c 2005-08-05 06:48:55.000000000 +0800 > @@ -32,6 +32,10 @@ > * $Id: iser_utils.c,v 1.38 2005/01/31 08:02:00 danb Exp $ */ > > +/* by IanJiang, ianjiang at ict.ac.cn > + * replace u32 by u64 in 5 places > + */ > + > #include > #include > #include > @@ -102,7 +106,7 @@ > > ITRACE_ENTRY(); > > - hash_val = hash_func(iser_task->itt ^ (u32) > iser_task->p_conn); > + hash_val = hash_func(iser_task->itt ^ (u64) > iser_task->p_conn); > > ITRACE(ISER_TRACE_HASHTABLES, > "p_task: 0x%p, p_conn: 0x%p, itt: %d, hash_val > = %d\n", @@ -126,7 +130,7 @@ */ struct iser_task_t * > hash_find_iser_task(struct iser_conn_t *iser_conn, /*!< > IN - part of > hash > key */ > - u32 itt) /*!< IN - part of hash key */ > + u64 itt) /*!< IN - part of hash key */ > { > int hash_val; > struct list_head *p_bucket; > @@ -135,7 +139,7 @@ > > ITRACE_ENTRY(); > > - hash_val = hash_func(itt ^ (u32) iser_conn); > + hash_val = hash_func(itt ^ (u64) iser_conn); > p_bucket = &(iser_global.task_hash.bucket_head[hash_val]); > > spin_lock(&iser_global.task_hash.lock); > @@ -151,8 +155,11 @@ > spin_unlock(&iser_global.task_hash.lock); > > ITRACE(ISER_TRACE_HASHTABLES, > - "p_conn: 0x%p, itt: %d, hash_val = %d, found p_task: > 0x%p\n", > - iser_conn, itt, hash_val, iser_task); > +/* by IanJiang, ianjiang at ict.ac.cn */ > +// "p_conn: 0x%p, itt: %d, hash_val = %d, found p_task: > 0x%p\n", > +// iser_conn, itt, hash_val, iser_task); > + "p_conn: 0x%p, itt: %ld, hash_val = %d, found p_task: > 0x%p\n", > + iser_conn, (unsigned long)itt, hash_val, iser_task); > > ITRACE_EXIT(); > return iser_task; > @@ -188,7 +195,7 @@ > > ITRACE_ENTRY(); > > - hash_val = hash_func((u32) iser_conn->iscsi_conn_h); > + hash_val = hash_func((u64) iser_conn->iscsi_conn_h); > > spin_lock(&iser_global.conn_hash.lock); > INIT_LIST_HEAD(&iser_conn->hash_list); > @@ -216,7 +223,7 @@ > > ITRACE_ENTRY(); > > - hash_val = hash_func((u32) iscsi_conn_h); > + hash_val = hash_func((u64) iscsi_conn_h); > p_bucket = &(iser_global.conn_hash.bucket_head[hash_val]); > > spin_lock(&iser_global.conn_hash.lock); > @@ -349,8 +356,11 @@ > virt_addr = p_iovec_virt[i].iov_base; > p_phys[i].addr = virt_to_phys(virt_addr); > ITRACE(ISER_TRACE_BUFFERS, > - "IOVEC[%d] virt: 0x%08X -> phys: > 0x%08X, sz: %d\n", > i, > - (unsigned) virt_addr, p_phys[i].addr, > +/* by IanJiang, ianjiang at ict.ac.cn */ > +// "IOVEC[%d] virt: 0x%08X -> phys: > 0x%08X, sz: %d\n", > i, > +// (unsigned) virt_addr, p_phys[i].addr, > + "IOVEC[%d] virt: 0x%08lX -> phys: 0x%08X, sz: > %ld\n", i, > + (unsigned long) virt_addr, p_phys[i].addr, > p_iovec_virt[i].iov_len); > p_phys[i].size = p_iovec_virt[i].iov_len; > total_sz += p_iovec_virt[i].iov_len; @@ -393,7 > +403,9 @@ > } > > for (i = 0; i < p_data->size; i++) { > - p_phys[i].addr = (uint32_t) p_iovec_phys[i].iov_base; > +/* by IanJiang, ianjiang at ict.ac.cn */ > +// p_phys[i].addr = (uint32_t) p_iovec_phys[i].iov_base; > + p_phys[i].addr = (uint64_t) p_iovec_phys[i].iov_base; > p_phys[i].size = p_iovec_phys[i].iov_len; > total_sz += p_iovec_phys[i].iov_len; > > @@ -440,10 +452,14 @@ > "starting scatterlist conversion - %d > elements\n", p_data->size); > for (i = 0; i < p_data->size; i++) { > p_phys[i].addr = page_to_phys(p_sg[i].page) + > p_sg[i].offset; > +/* by IanJiang, ianjiang at ict.ac.cn */ > ITRACE(ISER_TRACE_BUFFERS, > - "SCATTER[%d] page: 0x%08X + 0x%8X -> > phys: 0x%08X, > sz: %d\n", > - i, (unsigned) p_sg[i].page, p_sg[i].offset, > - p_phys[i].addr, p_sg[i].length); > +// "SCATTER[%d] page: 0x%08X + 0x%8X -> > phys: 0x%08X, > sz: %d\n", > + "SCATTER[%d] page: 0x%08lX + 0x%8X -> phys: > + 0x%08lX, > sz: %d\n", > +// i, (unsigned) p_sg[i].page, p_sg[i].offset, > + i, (unsigned long) p_sg[i].page, > p_sg[i].offset, > +// p_phys[i].addr, p_sg[i].length); > + (unsigned long) p_phys[i].addr, > p_sg[i].length); > p_phys[i].size = p_sg[i].length; > total_sz += p_sg[i].length; > } > @@ -501,7 +517,9 @@ > void * > iser_phys_to_virt(void *phys_addr) > { > - return phys_to_virt((unsigned) phys_addr); > +/* by IanJiang, ianjiang at ict.ac.cn */ > +// return phys_to_virt((unsigned) phys_addr); > + return phys_to_virt((unsigned long) phys_addr); > } > > /** > @@ -535,7 +553,9 @@ > { > struct iovec *p_iovec; > p_iovec = (struct iovec *) p_iovec_phys; > - return phys_to_virt((unsigned) p_iovec[i].iov_base); > +/* by IanJiang, ianjiang at ict.ac.cn */ > +// return phys_to_virt((unsigned) p_iovec[i].iov_base); > + return phys_to_virt((unsigned long) p_iovec[i].iov_base); > } > > /** > --- iser_utils.h.old 2005-07-01 17:41:00.000000000 +0800 > +++ iser_utils.h 2005-08-05 06:49:32.000000000 +0800 > @@ -48,7 +48,9 @@ > * ISER TASK-SPECIFIC HASH MANAGEMENT > * > ------------------------------------------------------------------ */ > > -struct iser_task_t *hash_find_iser_task(struct iser_conn_t > *iser_conn, u32 itt); > +/* by IanJiang, ianjiang at ict.ac.cn */ > +//struct iser_task_t *hash_find_iser_task(struct iser_conn_t > +*iser_conn, > u32 itt); > +struct iser_task_t *hash_find_iser_task(struct iser_conn_t > *iser_conn, > +u64 > itt); > > void > hash_add_iser_task(struct iser_task_t *iser_task); > > _________________________________________________________________ > 与联机的朋友进行交流,请使用 MSN Messenger: http://messenger.msn.com/cn > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rep.nop at aon.at Mon Aug 8 10:29:29 2005 From: rep.nop at aon.at (Bernhard Fischer) Date: Mon, 8 Aug 2005 19:29:29 +0200 Subject: [openib-general] Re: [PATCH] libibmad: configure option to skip library test In-Reply-To: <1123514946.4422.4097.camel@hal.voltaire.com> References: <20050808131308.GC15300@mellanox.co.il> <20050808131439.GD15300@mellanox.co.il> <20050808142641.GH15300@mellanox.co.il> <1123514946.4422.4097.camel@hal.voltaire.com> Message-ID: <20050808172929.GK15261@aon.at> On Mon, Aug 08, 2005 at 11:29:07AM -0400, Hal Rosenstock wrote: >On Mon, 2005-08-08 at 10:26, Michael S. Tsirkin wrote: >> Hal, I'm trying to split the build process to configure/make/install >> steps. >Seems reasonable. Will you be doing the same to umad and common too ? > >> Add option to skip >> >> Signed-off-by: Michael S. Tsirkin > >Thanks. Applied. Hal, please correct this: presense -> presence TIA, Bernhard From rolandd at cisco.com Mon Aug 8 10:40:05 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 08 Aug 2005 10:40:05 -0700 Subject: [openib-general] LinuxWorld San Francisco meetup Message-ID: <52wtmwz5sa.fsf@cisco.com> I'll be at the LinuxWorld Expo in San Francisco this Wednesday. Is anyone interested in meeting up for lunch (or some other time)? - R. From halr at voltaire.com Mon Aug 8 10:48:34 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 13:48:34 -0400 Subject: [openib-general] Re: [PATCH] libibmad: configure option to skip library test In-Reply-To: <20050808172929.GK15261@aon.at> References: <20050808131308.GC15300@mellanox.co.il> <20050808131439.GD15300@mellanox.co.il> <20050808142641.GH15300@mellanox.co.il> <1123514946.4422.4097.camel@hal.voltaire.com> <20050808172929.GK15261@aon.at> Message-ID: <1123523314.4422.4738.camel@hal.voltaire.com> On Mon, 2005-08-08 at 13:29, Bernhard Fischer wrote: > >Thanks. Applied. > Hal, please correct this: > presense -> presence Done. From tom at ammasso.com Mon Aug 8 07:14:08 2005 From: tom at ammasso.com (Tom Tucker) Date: Mon, 8 Aug 2005 10:14:08 -0400 Subject: [openib-general] iWARP Branch Proposal Message-ID: <8E9D028761D8264D910612167E8457E8FA36B6@mail2.ammasso.com> Roland: Awesome list! Thanks for the quick and thorough review. I'll add it to a TODO file in the branch and we'll keep track of our progress against these items. Answers to the questions follow... TomT > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Monday, August 08, 2005 8:52 AM > To: Tom Tucker > Cc: openib-general at openib.org > Subject: Re: [openib-general] iWARP Branch Proposal > > Tom, thanks for posting your device driver so quickly. I realize that > this is work-in-progress, and with that in mind I skimmed through the > code and came up with some suggestions for your to-do list: > > - Kconfig: replace 'mthca' with 'ams1100' in help text :) > > - Replace all // comments with /* */. > > - Why are members of cc_pci_regs_t and cc_adapter_pci_regs_t volatile? > Volatile declarations are almost inevitably buggy. It's better to > use ordered accessors (readl(), writel(), etc) or insert explicit > memory barriers. [Tom] will do. > > - Get rid of most typedefs -- for example > > typedef enum { > ... > } foo_t; > > should become > > enum foo { > ... > }; > > - Can cc_byteorder.h be eliminated? Most of the wrappers are > definitely superfluous. Can the WR byte order ever change? ie are > the cpu_to_wrXX() functions actually a useful abstraction? [Tom] no, but the code was written when this was still in flux... > > - Most of cc_common.h can be removed -- eg CC_SLEEP, CC_PRINT are useless > > - In ccil_api.h, rather than "#ifdef CC_LITTLE_ENDIAN" just use > __constant_cpu_to_b32 or whatever. > > Most of the "PACKED" attributes are unnecessary (since the > structures are already aligned). They'll lead to horrible code with > gcc on ia64 etc. > > - In general, all the #ifdef X86_64 in the code looks like portability > bugs. > The best solution is just to make the code 32/64 clean, but at least > we need to replace the test with #ifdef CONFIG_COMPAT > > - ccilnet_dbg.c is an awful lot of code just for debugging. > > - Get rid of compatibility with Linux 2.4 -- everywhere there's a test > of LINUX_VERSION_CODE, just take the 2.6 code. > > - cc_mq_common.c: BUMP is pretty inefficient, does a divide every time > > - cc_qp_common.c: cc_memcpy8 corrupts FPU state, is it really needed? > it's never called. Why is it declared in cc_mq_common.c? > memcpy4 similarly corrupts state. If it's fixed to save CR0 and do > clts, is it really faster than a normal memcpy (considering it also > disables IRQs)? > [Tom] This file was shared between the user and kernel portions of our standalone code. It will get incorporated into the devccil_qp.c file. > This is all utterly non-portably anyway -- there needs to be a > standard fallback for PPC64, IA64 etc. > > - Why is cc_queue.h needed? What is missing? [Tom] nada, but the code was originally intended to be OS agnostic. > > - cc_types.h: get rid of NULL, TRUE, FALSE defines, cc_bool_t, etc. > PTR_TO_CTX, CC_PTR_TO_64, etc seem busted on 64-bit archs. > forget the macros, just cast to/from (unsigned long). > > - devccil_adapter.c: adapter_list seems to be a latent bug -- why does > the driver need to keep a global list of adapters? > > - devccil.c: What is 'Static' -- why not 'static'?? > Probably all this code gets replaced by existing uverbs anyway. > > - devccil_lock.h: Get rid of lock wrappers, they're just > obfuscation. And it seems CCTHREADSAFE won't ever be defined?? > The driver should always be threadsafe. > > - devccil_mem.c: ccil_big_malloc is horrible -- need a way to use > non-contig memory. > > - devccil_srq.c: CCERR_NOT_IMPLEMENTED stubs not needed -- the > existing midlayer will return -ENOSYS for unimplemented functions > (ie NULL pointers in ib_device method table). > > - devccil_var.h: Defining your own version of PAGE_SHIFT etc. is just > obfuscation; probably has all rounding functions > you need. Custom ASSERT() can probably just be replaced with > standard BUG_ON(). From swise at ammasso.com Mon Aug 8 10:31:43 2005 From: swise at ammasso.com (Steve Wise) Date: Mon, 8 Aug 2005 12:31:43 -0500 Subject: [openib-general] [PATCH] iwarp branch - Kconfig + README fixes In-Reply-To: <52ek9435as.fsf@cisco.com> Message-ID: Tom, Here is a patch that fixes Kconfig and the README typos. I didn't test it. Signed-off-by: Steve Wise Index: Kconfig =================================================================== --- Kconfig (revision 3015) +++ Kconfig (working copy) @@ -3,13 +3,13 @@ depends on PCI && INFINIBAND ---help--- This is a low-level driver for the Ammasso 1100 host - channel adapters (HCAs). + channel adapter (HCA). config INFINIBAND_AMSO1100_DEBUG bool "Verbose debugging output" depends on INFINIBAND_AMSO1100 default n ---help--- - This option causes the mthca driver produce a bunch of debug - messages. Select this is you are developing the driver or - trying to diagnose a problem. + This option causes the amso1100 driver to produce a bunch of + debug messages. Select this if you are developing the driver + or trying to diagnose a problem. Index: README =================================================================== --- README (revision 3015) +++ README (working copy) @@ -1,13 +1,13 @@ -This is starting point for the OpenIB iWARP driver for the AMSO1100 HCA -from Ammasso. The adapter is a 1Gb RDMA capable PCI-X NIC. +This is a starting point for the OpenIB iWARP driver for the AMSO1100 +HCA from Ammasso. The adapter is a 1Gb RDMA capable PCI-X NIC. The driver in its original form contains a netdev driver for native stack support, a set of kDAT services, and a set of Ammasso verbs that are based on the RDMAC verbs. That said, it should be clear that this driver was originally developed for a different environment (standalone) with a different set of goals and objectives than those for which it is -not being tasked. It was also developed to reuse as much code as possible +now being tasked. It was also developed to reuse as much code as possible between the user mode and kernel mode pieces of the product. This is why the naming conventions for the files (and some of the functions) is so whacky. xyz_common.c for example, and fname_kern() instead of From jlentini at netapp.com Mon Aug 8 13:14:12 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 8 Aug 2005 16:14:12 -0400 (EDT) Subject: [openib-general] [PATCH][IBAT] minor update to clarify comment Message-ID: Minor update to clarify comment. Signed-off-by: James Lentini Index: include/ib_at.h =================================================================== --- include/ib_at.h (revision 3015) +++ include/ib_at.h (working copy) @@ -82,7 +82,7 @@ * @context: user defined context pointer * @req_id: asynchronous request ID - optional, out * - * The following asynchronous resolution function behavior is as follows: + * The asynchronous resolution function behavior is as follows: * If the resolve operation can be fulfilled immediately, then the output * structures are set and the number of filled structures is returned. -------------- next part -------------- Index: include/ib_at.h =================================================================== --- include/ib_at.h (revision 3015) +++ include/ib_at.h (working copy) @@ -82,7 +82,7 @@ * @context: user defined context pointer * @req_id: asynchronous request ID - optional, out * - * The following asynchronous resolution function behavior is as follows: + * The asynchronous resolution function behavior is as follows: * If the resolve operation can be fulfilled immediately, then the output * structures are set and the number of filled structures is returned. * From jlentini at netapp.com Mon Aug 8 13:20:47 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 8 Aug 2005 16:20:47 -0400 (EDT) Subject: [openib-general] [PATCH][IBAT] another small comment update Message-ID: Small update to another comment Signed-off-by: James Lentini Index: include/ib_at.h =================================================================== --- include/ib_at.h (revision 3015) +++ include/ib_at.h (working copy) @@ -116,7 +116,7 @@ }; /** - * ib_at_route_by_ip - asynchronously resolve ip route to ib route + * ib_at_route_by_ip - asynchronously resolve ip address to ib route * @dst_ip: destination ip * @src_ip: source ip - optional * @tos: ip type of service -------------- next part -------------- Index: include/ib_at.h =================================================================== --- include/ib_at.h (revision 3015) +++ include/ib_at.h (working copy) @@ -116,7 +116,7 @@ }; /** - * ib_at_route_by_ip - asynchronously resolve ip route to ib route + * ib_at_route_by_ip - asynchronously resolve ip address to ib route * @dst_ip: destination ip * @src_ip: source ip - optional * @tos: ip type of service From halr at voltaire.com Mon Aug 8 13:17:54 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 16:17:54 -0400 Subject: [openib-general] Re: [PATCH][IBAT] minor update to clarify comment In-Reply-To: References: Message-ID: <1123532273.4399.5.camel@hal.voltaire.com> On Mon, 2005-08-08 at 16:14, James Lentini wrote: > Minor update to clarify comment. Thanks. Applied. -- Hal From halr at voltaire.com Mon Aug 8 13:24:01 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 16:24:01 -0400 Subject: [openib-general] Re: [PATCH][IBAT] another small comment update In-Reply-To: References: Message-ID: <1123532639.4399.9.camel@hal.voltaire.com> On Mon, 2005-08-08 at 16:20, James Lentini wrote: > Small update to another comment Thanks. Applied. -- Hal From halr at voltaire.com Mon Aug 8 15:05:08 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 18:05:08 -0400 Subject: [openib-general] 2.6.11 backport update Message-ID: <1123538708.4402.30.camel@hal.voltaire.com> Hi Michael, One more change is now needed to mthca to support 2.6.11 backport for the below to the mthca patch. Are you still maintaing this ? -- Hal Index: mthca_provider.c =================================================================== -- mthca_provider.c (revision 2907) +++ mthca_provider.c (revision 2908) @@ -350,9 +350,9 @@ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); - if (remap_pfn_range(vma, vma->vm_start, - to_mucontext(context)->uar.pfn, - PAGE_SIZE, vma->vm_page_prot)) + if (io_remap_pfn_range(vma, vma->vm_start, + to_mucontext(context)->uar.pfn, + PAGE_SIZE, vma->vm_page_prot)) return -EAGAIN; return 0; From halr at voltaire.com Mon Aug 8 15:11:55 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 18:11:55 -0400 Subject: [openib-general] Re: sdp: cant unload ib_ipoib module In-Reply-To: <20050807154632.GG15300@mellanox.co.il> References: <1123007814.2946.12.camel@duffman> <20050803070923.GG15300@mellanox.co.il> <20050804104250.A30741@topspin.com> <20050807154632.GG15300@mellanox.co.il> Message-ID: <1123539115.4402.39.camel@hal.voltaire.com> On Sun, 2005-08-07 at 11:46, Michael S. Tsirkin wrote: > OK, I grew tired of rebooting and went over sdp_link.c looking for stale > references. Thanks for doing this. > Given how the old code doesnt ever do ip_rt_put in most cases, I dont really > understand why dont more people see this problem. I don't understand that either (but that's not a requirement :-) > With the patch below I can unload ipoib after running sdp tests. > > Hal, could you verify that this patch helps you, too? Yes it does. -- Hal From rolandd at cisco.com Mon Aug 8 15:26:43 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 08 Aug 2005 15:26:43 -0700 Subject: [openib-general] [PATCH] ipoib: device removal races In-Reply-To: <20050808151141.GJ15300@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 8 Aug 2005 18:11:42 +0300") References: <20050808151141.GJ15300@mellanox.co.il> Message-ID: <52wtmwxdy4.fsf@cisco.com> Michael> As a side note, schedule_work in ipoib_event also looks Michael> suspicios. Cant we have it oustanding when the device is Michael> going down? Roland, what do you say we switch that to Michael> ipoib_workqueue as well, and add a flush after Michael> ib_unregister_event_handler? Thanks, I'll take a look at all the workqueue stuff in IPoIB. - R. From sean.hefty at intel.com Mon Aug 8 15:32:27 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 8 Aug 2005 15:32:27 -0700 Subject: [openib-general] likely user error with uat Message-ID: Hal/anyone, I'm trying to run the uats and uatt examples. Uatt runs fine, but with uats I'm getting the following errors: uats: main: uat test start uats: main: ib_at_route_by_ip: ret 1 errno 0 for request 1 id 0 0 uats: att_rt_comp_fn: id 0 context 0x804ade0 completed with rec_num 1 ===> rt 0x804ade0 sgid 0xfe800000000000000002c90107fc5e11 dgid 0xfe800000000000000002c90107fc5e11 uats: att_rt_comp_fn: ib_at_ips_by_gid: ret -1 errno 19 id 0 0 uats: main: ib_at_route_by_ip: ret 1 errno 19 for request 2 id 0 0 uats: att_rt_comp_fn: id 0 context 0x804ae10 completed with rec_num 1 ===> rt 0x804ae10 sgid 0xfe800000000000000002c90107fc5e11 dgid 0xfe800000000000000002c90107fc5e11 uats: att_rt_comp_fn: ib_at_ips_by_gid: ret -1 errno 19 id 0 0 The errors just repeat beyond this point. The kernel shows: Kernel: ib_at: resolve_ip: No device for IB comm I'm guessing that I'm missing a configuration somewhere, but not sure where. (I'm running 2.6.12.1 with svn 2989.) Ipoib seems to work fine. Have you seen anything like this before? - Sean From halr at voltaire.com Mon Aug 8 15:52:45 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 18:52:45 -0400 Subject: [openib-general] Re: likely user error with uat In-Reply-To: References: Message-ID: <1123541564.4870.39.camel@hal.voltaire.com> Hi Sean, On Mon, 2005-08-08 at 18:32, Sean Hefty wrote: > Hal/anyone, > > I'm trying to run the uats and uatt examples. Uatt runs fine, but with uats I'm > getting the following errors: > > uats: main: uat test start > uats: main: ib_at_route_by_ip: ret 1 errno 0 for request 1 id 0 0 > uats: att_rt_comp_fn: id 0 context 0x804ade0 completed with rec_num 1 > ===> rt 0x804ade0 sgid 0xfe800000000000000002c90107fc5e11 dgid > 0xfe800000000000000002c90107fc5e11 > uats: att_rt_comp_fn: ib_at_ips_by_gid: ret -1 errno 19 id 0 0 > uats: main: ib_at_route_by_ip: ret 1 errno 19 for request 2 id 0 0 > uats: att_rt_comp_fn: id 0 context 0x804ae10 completed with rec_num 1 > ===> rt 0x804ae10 sgid 0xfe800000000000000002c90107fc5e11 dgid > 0xfe800000000000000002c90107fc5e11 > uats: att_rt_comp_fn: ib_at_ips_by_gid: ret -1 errno 19 id 0 0 > > The errors just repeat beyond this point. That's not surprising as there is a loop in this which repeats the request MAX_REQ (32) times. > The kernel shows: > > Kernel: ib_at: resolve_ip: No device for IB comm > > I'm guessing that I'm missing a configuration somewhere, but not sure where. > (I'm running 2.6.12.1 with svn 2989.) Ipoib seems to work fine. Have you seen > anything like this before? Not sure what's going on yet. I'm now seeing something like this too. It's the address resolution within AT which is indicating this -- Hal From halr at voltaire.com Mon Aug 8 16:30:15 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 19:30:15 -0400 Subject: [openib-general] [PATCH] Fix amso1100/Kconfig Message-ID: <1123543814.4870.70.camel@hal.voltaire.com> Fix amso1100/Kconfig Signed-off-by: Hal Rosenstock Index: Kconfig =================================================================== --- Kconfig (revision 3028) +++ Kconfig (working copy) @@ -2,7 +2,7 @@ tristate "Ammasso 1100 HCA support" depends on PCI && INFINIBAND ---help--- - This is a low-level driver for the Ammasso 1100 host + This is a low-level driver for the Ammasso 1100 host channel adapter (HCA). config INFINIBAND_AMSO1100_DEBUG From rolandd at cisco.com Mon Aug 8 16:46:42 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 08 Aug 2005 16:46:42 -0700 Subject: [openib-general] [PATCH] Fix amso1100/Kconfig In-Reply-To: <1123543814.4870.70.camel@hal.voltaire.com> (Hal Rosenstock's message of "08 Aug 2005 19:30:15 -0400") References: <1123543814.4870.70.camel@hal.voltaire.com> Message-ID: <52fytkxa8t.fsf@cisco.com> > - This is a low-level driver for the Ammasso 1100 host > + This is a low-level driver for the Ammasso 1100 host Why do we want to delete a space here? In any case I think it's better to let Tom et al work on the real issues with the driver for now, rather than distracting them with typo patches. - R. From halr at voltaire.com Mon Aug 8 16:45:55 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 19:45:55 -0400 Subject: [openib-general] [PATCH] Fix amso1100/Kconfig In-Reply-To: <52fytkxa8t.fsf@cisco.com> References: <1123543814.4870.70.camel@hal.voltaire.com> <52fytkxa8t.fsf@cisco.com> Message-ID: <1123544755.4870.88.camel@hal.voltaire.com> On Mon, 2005-08-08 at 19:46, Roland Dreier wrote: > > - This is a low-level driver for the Ammasso 1100 host > > + This is a low-level driver for the Ammasso 1100 host > > Why do we want to delete a space here? It broke the build. I got the following error: drivers/infiniband/hw/amso1100/Kconfig:6: syntax error, unexpected T_WORD drivers/infiniband/hw/amso1100/Kconfig:8: invalid menu option > In any case I think it's better to let Tom et al work on the real > issues with the driver for now, rather than distracting them with typo > patches. That was the only reason why. -- Hal From rolandd at cisco.com Mon Aug 8 16:59:43 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 08 Aug 2005 16:59:43 -0700 Subject: [openib-general] [PATCH] Fix amso1100/Kconfig In-Reply-To: <1123544755.4870.88.camel@hal.voltaire.com> (Hal Rosenstock's message of "08 Aug 2005 19:45:55 -0400") References: <1123543814.4870.70.camel@hal.voltaire.com> <52fytkxa8t.fsf@cisco.com> <1123544755.4870.88.camel@hal.voltaire.com> Message-ID: <52br48x9n4.fsf@cisco.com> Hal> It broke the build. I got the following error: Hal> drivers/infiniband/hw/amso1100/Kconfig:6: syntax error, Hal> unexpected T_WORD drivers/infiniband/hw/amso1100/Kconfig:8: Hal> invalid menu option Huh, you're right. Sorry for complaining to you. Tom: please make sure patches build before applying them ;) (not that I've always followed this rule) - R. From halr at voltaire.com Mon Aug 8 17:04:52 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 20:04:52 -0400 Subject: [openib-general] Re: likely user error with uat In-Reply-To: <1123541564.4870.39.camel@hal.voltaire.com> References: <1123541564.4870.39.camel@hal.voltaire.com> Message-ID: <1123545889.4522.11.camel@hal.voltaire.com> Hi again Sean, On Mon, 2005-08-08 at 18:52, Hal Rosenstock wrote: > Not sure what's going on yet. I'm now seeing something like this too. > It's the address resolution within AT which is indicating this It seems to work fine for me immediately after modprob'ing and bringing up IPoIB and then UAT (which brings up AT). It seems to fail after removing UAT and IPoIB and then bringing them back. AT resolve_ip fails in walking the netdev list by index (dev_get_by_index). It never sees the IPoIB interfaces in this case for some reason I don't yet understand. Is the failing scenario the same ? Does it work for you after a boot and initial bringup ? -- Hal From mshefty at ichips.intel.com Mon Aug 8 17:27:59 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 08 Aug 2005 17:27:59 -0700 Subject: [openib-general] Re: likely user error with uat In-Reply-To: <1123545889.4522.11.camel@hal.voltaire.com> References: <1123541564.4870.39.camel@hal.voltaire.com> <1123545889.4522.11.camel@hal.voltaire.com> Message-ID: <42F7F88F.402@ichips.intel.com> Hal Rosenstock wrote: > It seems to fail after removing UAT and IPoIB and then bringing them > back. AT resolve_ip fails in walking the netdev list by index > (dev_get_by_index). It never sees the IPoIB interfaces in this case for > some reason I don't yet understand. I did bring at and ipoib up/down/up before running the tests. > Is the failing scenario the same ? Does it work for you after a boot and > initial bringup ? Yes - the tests worked after a fresh reboot. - Sean From halr at voltaire.com Mon Aug 8 17:32:02 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 20:32:02 -0400 Subject: [openib-general] osm: management headers installed into /usr/ local/include In-Reply-To: <506C3D7B14CDD411A52C00025558DED607C3060B@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED607C3060B@mtlex01.yok.mtl.com> Message-ID: <1123547521.4421.58.camel@hal.voltaire.com> Hi Eitan, On Mon, 2005-08-08 at 12:00, Eitan Zahavi wrote: > Hi Hal, > > > > > This was the convention used back when this started (I'm not sure > how it > > evolved) but maybe it doesn't make sense anymore and all should just > be > > under /usr/local/[include lib bin]. The only reason in > > terms of the headers was that they were not commonly used and so > > numerous so they perhaps should be separated out. > > > > If that could be done it will simplify user level coding by making it > more "standard". > If one reference opensm vendor in /lib one can simply user > /include . > > What is the process of making it happen? Does anyone object to moving the diagnostic and OpenSM related libraries and binaries directly under /usr/local/[lib bin] rather than /usr/local/ib/[lib bin] ? I am presuming that the includes would be left in /usr/local/include/infiniband/[complib iba opensm vendor]. -- Hal From halr at voltaire.com Mon Aug 8 17:40:37 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 20:40:37 -0400 Subject: [openib-general] Re: likely user error with uat In-Reply-To: <42F7F88F.402@ichips.intel.com> References: <1123541564.4870.39.camel@hal.voltaire.com> <1123545889.4522.11.camel@hal.voltaire.com> <42F7F88F.402@ichips.intel.com> Message-ID: <1123547632.4421.65.camel@hal.voltaire.com> On Mon, 2005-08-08 at 20:27, Sean Hefty wrote: > Hal Rosenstock wrote: > > It seems to fail after removing UAT and IPoIB and then bringing them > > back. AT resolve_ip fails in walking the netdev list by index > > (dev_get_by_index). It never sees the IPoIB interfaces in this case for > > some reason I don't yet understand. > > I did bring at and ipoib up/down/up before running the tests. Just down and back up and didn't remove the IPoIB module ? I haven't tried that but it sounds like that is sufficient to create the problem with walking the netdev list. > > Is the failing scenario the same ? Does it work for you after a boot and > > initial bringup ? > > Yes - the tests worked after a fresh reboot. Thanks. -- Hal From halr at voltaire.com Mon Aug 8 17:52:49 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 20:52:49 -0400 Subject: [openib-general] RE: [PATCH ] osmtest general cleanups In-Reply-To: <506C3D7B14CDD411A52C00025558DED60865E882@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED60865E882@mtlex01.yok.mtl.com> Message-ID: <1123548705.4421.103.camel@hal.voltaire.com> Hi Liran, On Mon, 2005-08-08 at 09:58, Liran Sorani wrote: > Hi , > Regarding the inform_info test flow , the reason I need (possibly) > ib_verbs is to subscribe a notice report through a QP other then QP1 > (permitted by IB Spec see p754 , 13.5.1) , What version of the spec are you referring to ? Can you make the reference more specific ? Is it 13.5.1.2? > generate a trap , then validate the received report through that QP. > To enable Set/Recieve notice mads through QP other then QP1 , I think > , I need the ib_verbs While the subscription (InformInfo) could be through a user SA client, the Trap/Notice could be direct to some QP other than QP1 and hence would require uverbs for at least this part. Do these tests cause SM generated events (ports in and out of service, multicast groups, etc.) ? -- Hal > Regarding the rest , will be fixed in a seperate patch . > > > --Original Message-- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Monday, August 08, 2005 3:32 PM > To: Liran Sorani > Cc: openib-general at openib.org; Amit Krig > Subject: RE: [PATCH ] osmtest general cleanups > > > Hi Liran, > > On Sun, 2005-08-07 at 11:28, Liran Sorani wrote: > > Hi , Hal. > > I've a few minor fixes (several white space glitches) , pls take the > > attached file , instead of the previous , > > Thanks. Applied with some minor modifications and caveats below. > > In the future can your patches be submitted as text rather than > attachments ? That is the norm for doing this. > > Some comments below. > > > thanks Liran. > > > > --Original Message-- > > From: Liran Sorani > > Sent: Sunday, August 07, 2005 5:57 PM > > To: 'halr at voltaire.com' > > Cc: 'openib-general at openib.org'; Amit Krig > > Subject: [PATCH ] osmtest general cleanups > > > > > > Hi , Hal. > > The attached patch should be applied to osmtest repository. > > It contain several cleanups (on most of the files) : > > It is easier if there is a patch per idea rather than an amalgam. > > > -Removal of inform info flow . > > -Unique error messages > > There are some non real error message numbers (neither hex nor > decimal) > in osmt_multicast.c. Also, osmtest.c and osmt_service.c still have > duplicates. > > > -Makefile.am update for compilation (required osm_helper object) > > osm_helper is part of the libopensm since r2973 so this part of the > patch was not applied. Please update your osm directory. > > > -Remove vendor dependencies. > > Remove some OSM_VENDOR_INTF_MTL vendor dependencies > (There are still some in osmt_slvl_vl_arb.c). > > > The inform info flow should be carefully ported since it requires > > direct access to ib_umad (possibly ib_verbs too) . > > umad would be a temporary measure. > > Why would ib_uverbs direct access be needed ? > > > I'll send another patch for it next week . > > Thanks. > > -- Hal > From halr at voltaire.com Mon Aug 8 17:56:03 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Aug 2005 20:56:03 -0400 Subject: [openib-general] RE: OpenSM Work In-Reply-To: <506C3D7B14CDD411A52C00025558DED607C305E5@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED607C305E5@mtlex01.yok.mtl.com> Message-ID: <1123548757.4421.106.camel@hal.voltaire.com> Hi Eitan, On Sun, 2005-08-07 at 02:15, Eitan Zahavi wrote: > Hi Hal, > > Regarding the directory structure of OpenSM: > From the developer standpoint it makes life easier to have the H files > located on the same directory the C file is located. I still use > "grep" a lot. Couldn't the grep be done from one level up and get pretty much the same thing ? > I could not find even one GNU stile project that uses a separate > include directory for H files during the development phase (all of > them eventually install H files into $prefix/include). > Please see the short list of projects below as a reference but > actually I picked them randomly from the FSF GNU projects page. > > Tcl > Expat > Enscript > Gimp gcc and gdb have include directories in their source distribution. Others: GGI > So I guess this methodology of keeping all H files in a separate > directory is more a "kernel" convention? I don't think so. > What are the implications of moving the H files each into its sources > dir? What are the implications of not doing this ? > This issue is more a style and adherence to the standard coding > practices for user level code then something that prevents us from > progressing. I do think it is a style issue and don't think there are any real technical arguments either way. > However I wonder what are the strong reasons for keeping it the way it > is? Is there a strong reason for changing it ? -- Hal From tomduffy at gmail.com Mon Aug 8 18:31:09 2005 From: tomduffy at gmail.com (Tom Duffy) Date: Mon, 8 Aug 2005 18:31:09 -0700 Subject: [openib-general] Re: sdp: cant unload ib_ipoib module In-Reply-To: <20050807154632.GG15300@mellanox.co.il> References: <1123007814.2946.12.camel@duffman> <20050803070923.GG15300@mellanox.co.il> <20050804104250.A30741@topspin.com> <20050807154632.GG15300@mellanox.co.il> Message-ID: <9d3b7de705080818316fe781a1@mail.gmail.com> On 8/7/05, Michael S. Tsirkin wrote: > Drop net_device and rtable references after path lookup is done. Thanks, applied in revision 3029. -tduffy From tomduffy at speakesy.net Mon Aug 8 18:56:59 2005 From: tomduffy at speakesy.net (Tom Duffy) Date: Mon, 8 Aug 2005 18:56:59 -0700 Subject: [openib-general] Re: GONE from Sun as of Aug 5, 2005 In-Reply-To: <20050808071647.GL15300@mellanox.co.il> References: <20050808071647.GL15300@mellanox.co.il> Message-ID: <00158FC9-8274-4725-A1B9-14D52F8C54C8@speakesy.net> On Aug 8, 2005, at 12:16 AM, Michael S. Tsirkin wrote: > Tom, I'd like to start doing checkins directly to SDP trunk. > As I said, I'm already working on it a big percentage of my time, > as part of my day job, and there's a setup here in Mellanox Israel > to stress-test SDP. Sounds fine. I wouldn't object. > This could work similiar to the way core is co-maintained by > Sean, Hal and Roland: all patches send to openib-general: > trivial patches checked in directly, bigger things wait a bit > for review. > > Let me know whether this works for you. This would work fine. Go for it. -tduffy -------------- next part -------------- An HTML attachment was scrubbed... URL: From liran at mellanox.co.il Mon Aug 8 23:09:19 2005 From: liran at mellanox.co.il (Liran Sorani) Date: Tue, 9 Aug 2005 09:09:19 +0300 Subject: [openib-general] RE: [PATCH ] osmtest general cleanups Message-ID: <506C3D7B14CDD411A52C00025558DED60865E8E7@mtlex01.yok.mtl.com> Hi , Hal. The spec I'm referring to is IB Spec Release 1.2 (Volume 1) , p.754 in the 13.5.1 section , look at the third paragraph and find the following quote : " Note that it is not required by IBA that GS managers use QP1 as the source QP used to send management packets to GS agents. GS managers may send packets from any QP other than QP0." The InformInfo test flow sends a Trap (using QP0) , this part can be implemented by umad . In response the SM should send a Report ( to a previously subscribed node). -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Tuesday, August 09, 2005 3:53 AM To: Liran Sorani Cc: openib-general at openib.org; Amit Krig Subject: RE: [PATCH ] osmtest general cleanups Hi Liran, On Mon, 2005-08-08 at 09:58, Liran Sorani wrote: > Hi , > Regarding the inform_info test flow , the reason I need (possibly) > ib_verbs is to subscribe a notice report through a QP other then QP1 > (permitted by IB Spec see p754 , 13.5.1) , What version of the spec are you referring to ? Can you make the reference more specific ? Is it 13.5.1.2? > generate a trap , then validate the received report through that QP. > To enable Set/Recieve notice mads through QP other then QP1 , I think > , I need the ib_verbs While the subscription (InformInfo) could be through a user SA client, the Trap/Notice could be direct to some QP other than QP1 and hence would require uverbs for at least this part. Do these tests cause SM generated events (ports in and out of service, multicast groups, etc.) ? -- Hal > Regarding the rest , will be fixed in a seperate patch . > > > --Original Message-- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Monday, August 08, 2005 3:32 PM > To: Liran Sorani > Cc: openib-general at openib.org; Amit Krig > Subject: RE: [PATCH ] osmtest general cleanups > > > Hi Liran, > > On Sun, 2005-08-07 at 11:28, Liran Sorani wrote: > > Hi , Hal. > > I've a few minor fixes (several white space glitches) , pls take the > > attached file , instead of the previous , > > Thanks. Applied with some minor modifications and caveats below. > > In the future can your patches be submitted as text rather than > attachments ? That is the norm for doing this. > > Some comments below. > > > thanks Liran. > > > > --Original Message-- > > From: Liran Sorani > > Sent: Sunday, August 07, 2005 5:57 PM > > To: 'halr at voltaire.com' > > Cc: 'openib-general at openib.org'; Amit Krig > > Subject: [PATCH ] osmtest general cleanups > > > > > > Hi , Hal. > > The attached patch should be applied to osmtest repository. > > It contain several cleanups (on most of the files) : > > It is easier if there is a patch per idea rather than an amalgam. > > > -Removal of inform info flow . > > -Unique error messages > > There are some non real error message numbers (neither hex nor > decimal) > in osmt_multicast.c. Also, osmtest.c and osmt_service.c still have > duplicates. > > > -Makefile.am update for compilation (required osm_helper object) > > osm_helper is part of the libopensm since r2973 so this part of the > patch was not applied. Please update your osm directory. > > > -Remove vendor dependencies. > > Remove some OSM_VENDOR_INTF_MTL vendor dependencies > (There are still some in osmt_slvl_vl_arb.c). > > > The inform info flow should be carefully ported since it requires > > direct access to ib_umad (possibly ib_verbs too) . > > umad would be a temporary measure. > > Why would ib_uverbs direct access be needed ? > > > I'll send another patch for it next week . > > Thanks. > > -- Hal > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Tue Aug 9 00:14:24 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 9 Aug 2005 10:14:24 +0300 Subject: [openib-general] Re: [PATCH] libibmad: configure option to skip library test In-Reply-To: <1123514946.4422.4097.camel@hal.voltaire.com> References: <20050808131308.GC15300@mellanox.co.il> <20050808131439.GD15300@mellanox.co.il> <20050808142641.GH15300@mellanox.co.il> <1123514946.4422.4097.camel@hal.voltaire.com> Message-ID: <20050809071424.GM15300@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: [PATCH] libibmad: configure option to skip library test > > On Mon, 2005-08-08 at 10:26, Michael S. Tsirkin wrote: > > Hal, I'm trying to split the build process to configure/make/install > > steps. > > Seems reasonable. Will you be doing the same to umad and common too ? Yes. Shortly. -- MST From danb at voltaire.com Tue Aug 9 02:33:54 2005 From: danb at voltaire.com (Dan Bar Dov) Date: Tue, 9 Aug 2005 12:33:54 +0300 Subject: [openib-general] [iSER]use iSER on x86_64 Message-ID: > > hi Dan, > Here it is! > We have got two machines with one IB HCA each. They are > connected directly without a switch. > But we don't have a iSER target. Currently ISER target is available only as part of commercial products. There is no open-source ISER target, although I know that some people are working on it. > By the way, I have a question about your updating the iSER: > Need the kDAPL be modified after updating the iSER? Yes, the openIB kDAPL version has changed the API so that it is no longer compatible with the DAT collaborative API of either DAT1.2 or DAT1.1. > And could you give some explanation for the difference > between the DAT_1.2_headers and the openIB DAT headers ? All the typedefs were removed, all the function names were changed to lower case etc., I don't know all the details. > Dan From mst at mellanox.co.il Tue Aug 9 04:20:56 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 9 Aug 2005 14:20:56 +0300 Subject: [openib-general] Re: 2.6.11 backport update In-Reply-To: <1123538708.4402.30.camel@hal.voltaire.com> References: <1123538708.4402.30.camel@hal.voltaire.com> Message-ID: <20050809112056.GA32419@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: 2.6.11 backport update > > Hi Michael, > > One more change is now needed to mthca to support 2.6.11 backport for > the below to the mthca patch. Fixed. I didnt see this because I actually have a vendor kernel which apparently backported io_remap_pfn_range. > Are you still maintaing this ? Sure. Note that I didnt look at backporting SRP yet - if someone has the inclination, feel free to go ahead. -- MST From mst at mellanox.co.il Tue Aug 9 05:12:55 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 9 Aug 2005 15:12:55 +0300 Subject: [openib-general] [PATCH] libibumad: configure option to skip library test Message-ID: <20050809121255.GE32419@mellanox.co.il> Add option to skip infiniband library checks in libibumad. Signed-off-by: Michael S. Tsirkin Index: libibumad/configure.in =================================================================== --- libibumad/configure.in (revision 2963) +++ libibumad/configure.in (working copy) @@ -7,6 +7,12 @@ AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE(libibumad, 0.9.0) +AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presense of ib libraries], +[ if test x$enableval = xno ; then + disable_libcheck=yes + fi +]) + dnl Checks for programs AC_PROG_CXX AC_PROG_CC @@ -16,18 +22,24 @@ AC_PROG_LN_S AC_PROG_MAKE_SET AM_PROG_LIBTOOL +if test "$disable_libcheck" != "yes" +then dnl Checks for libraries LDFLAGS="$LDFLAGS -L/usr/local/ib/lib" AC_CHECK_LIB(ibcommon, sys_read_string, [], AC_MSG_ERROR([sys_read_string() not found. libibumad requires libibcommon.])) +fi dnl Checks for header files. AC_HEADER_DIRENT AC_HEADER_STDC AC_CHECK_HEADERS([fcntl.h netinet/in.h stdlib.h string.h sys/ioctl.h unistd.h]) +if test "$disable_libcheck" != "yes" +then AC_CHECK_HEADER(infiniband/common.h, [], AC_MSG_ERROR([ not found. libibumad requires libibcommon.]) ) +fi dnl Checks for library functions AC_PROG_GCC_TRADITIONAL -- MST From mst at mellanox.co.il Tue Aug 9 05:34:03 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 9 Aug 2005 15:34:03 +0300 Subject: [openib-general] [PATCH applied] sdp: kill all trailing whitespace Message-ID: <20050809123403.GF32419@mellanox.co.il> The following is applied in rev 3033. --- Kill all trailing whitespace in SDP. There's no reason to keep it around. Signed-off-by: Michael S. Tsirkin Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_write.c =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_write.c (revision 3032) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_write.c (working copy) @@ -116,12 +116,12 @@ int sdp_event_write(struct sdp_sock *con iocb = (struct sdpc_iocb *)sdp_desc_q_look_type_head(&conn->send_queue, SDP_DESC_TYPE_IOCB); if (!iocb) { - sdp_dbg_warn(conn, + sdp_dbg_warn(conn, "No IOCB on write complete <%llu:%d:%d>", (unsigned long long)comp->wr_id, sdp_desc_q_size(&conn->w_snk), sdp_desc_q_size(&conn->send_queue)); - + result = -EPROTO; goto error; } Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_link.c =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_link.c (revision 3032) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_link.c (working copy) @@ -82,12 +82,12 @@ static void sdp_link_path_complete(u64 i /* * call completion function */ - func(id, + func(id, status, info->dst, info->src, info->port, - info->ca, + info->ca, &info->path, arg); @@ -136,7 +136,7 @@ static void sdp_path_wait_destroy(struct static void sdp_path_wait_complete(struct sdp_path_wait *wait, struct sdp_path_info *info, int status) { - sdp_link_path_complete(wait->id, + sdp_link_path_complete(wait->id, status, info, wait->completion, @@ -267,7 +267,7 @@ static void sdp_link_path_rec_done(int s IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH), - info->sa_time, + info->sa_time, GFP_KERNEL, sdp_link_path_rec_done, info, @@ -560,8 +560,8 @@ int sdp_link_path_lookup(u32 dst_addr, sdp_dbg_warn(NULL, "Failed to create path object"); return -ENOMEM; } - - info->src = src_addr; /* source is used in lookup and + + info->src = src_addr; /* source is used in lookup and populated by routing lookup */ } /* @@ -609,13 +609,13 @@ static void sdp_link_sweep(void *data) struct sdp_path_info *info; struct sdp_path_info *sweep; - sweep = info_list; + sweep = info_list; while (sweep) { info = sweep; sweep = sweep->next; if (jiffies > (info->use + SDP_LINK_INFO_TIMEOUT)) { - sdp_dbg_ctrl(NULL, + sdp_dbg_ctrl(NULL, "info delete <%d.%d.%d.%d> <%lu:%lu>", (info->dst & 0x000000ff), (info->dst & 0x0000ff00) >> 8, @@ -762,7 +762,7 @@ int sdp_link_addr_init(void) result = -ENOMEM; goto error_wq; } - + INIT_WORK(&link_timer, sdp_link_sweep, NULL); queue_delayed_work(link_wq, &link_timer, SDP_LINK_SWEEP_INTERVAL); /* @@ -771,7 +771,7 @@ int sdp_link_addr_init(void) * completed. */ dev_add_pack(&sdp_arp_type); - + return 0; error_wq: kmem_cache_destroy(wait_cache); Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_rcvd.c =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_rcvd.c (revision 3032) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_rcvd.c (working copy) @@ -63,7 +63,7 @@ static int sdp_rcvd_disconnect(struct sd */ result = ib_send_cm_dreq(conn->cm_id, NULL, 0); /* - * if the remote DREQ was already received, but unprocessed, + * if the remote DREQ was already received, but unprocessed, * do not treat it as an error */ if (result) { @@ -135,7 +135,7 @@ static int sdp_rcvd_send_sm(struct sdp_s * using buffered mode * 2) Conn is in source cancel, and this message acks the cancel. * Release all active IOCBs in the source queue. - * 3) Conn is in source cancel, but this message doesn't ack the + * 3) Conn is in source cancel, but this message doesn't ack the * cancel. * * Do nothing, can't send since the IOCB is being cancelled, but @@ -355,7 +355,7 @@ static int sdp_rcvd_mode_change(struct s /* if */ /* * drop all srcAvail message, they will be reissued, with - * combined mode constraints. No snkAvails outstanding on + * combined mode constraints. No snkAvails outstanding on * this half of the connection. How do I know which srcAvail * RDMA's completed? */ @@ -422,7 +422,7 @@ static int sdp_rcvd_src_cancel(struct sd } else { result = sdp_send_ctrl_rdma_rd(conn, advt->post); if (result < 0) { - sdp_dbg_warn(conn, + sdp_dbg_warn(conn, "Error <%d> read completion", result); goto done; @@ -668,7 +668,7 @@ static int sdp_rcvd_snk_avail(struct sdp goto consume; } /* - * If there are outstanding SrcAvail messages, they are now + * If there are outstanding SrcAvail messages, they are now * invalid and the queue needs to be fixed up. */ if (conn->src_sent > 0) { @@ -690,7 +690,7 @@ static int sdp_rcvd_snk_avail(struct sdp sdp_iocb_complete(iocb, 0); } /* - * If Source Cancel was in process, it should now + * If Source Cancel was in process, it should now * be cleared. */ if (conn->flags & SDP_CONN_F_SRC_CANCEL_L) { @@ -713,7 +713,7 @@ static int sdp_rcvd_snk_avail(struct sdp advt->rkey = snkah->r_key; conn->snk_recv++; - + conn->s_cur_adv = 1; conn->s_par_adv = 0; @@ -740,7 +740,7 @@ consume: result); } else result = 0; - + /* * PostRecv will take care of consuming this advertisment, based * on result. @@ -768,7 +768,7 @@ static int sdp_rcvd_src_avail(struct sdp if (conn->snk_sent > 0) { /* - * crossed SrcAvail and SnkAvail, the source message is + * crossed SrcAvail and SnkAvail, the source message is * discarded. */ sdp_dbg_data(conn, "avail cross<%d> dropping src. mode <%d>", @@ -811,12 +811,12 @@ static int sdp_rcvd_src_avail(struct sdp goto done; } /* - * consume the advertisment, if it's allowed, first check the recv + * consume the advertisment, if it's allowed, first check the recv * path mode to determine if all is cool for the advertisment. */ switch (conn->recv_mode) { case SDP_MODE_BUFF: - sdp_dbg_warn(conn, "SrcAvail in bad mode. <%d>", + sdp_dbg_warn(conn, "SrcAvail in bad mode. <%d>", conn->recv_mode); result = -EPROTO; goto advt_error; @@ -826,7 +826,7 @@ static int sdp_rcvd_src_avail(struct sdp if (conn->src_recv > 0 || size <= 0 || !(srcah->size > size)) { - sdp_dbg_warn(conn, + sdp_dbg_warn(conn, "SrcAvail mode <%d> mismatch. <%d:%d:%d>", conn->recv_mode, conn->src_recv, size, srcah->size); @@ -955,7 +955,7 @@ static int sdp_rcvd_data(struct sdp_sock * me can dispose of the buffer. */ conn->byte_strm += ret_val; - + return ret_val; } @@ -1075,7 +1075,7 @@ int sdp_event_recv(struct sdp_sock *conn } dma_unmap_single(conn->ca->dma_device, - buff->sge.addr, + buff->sge.addr, buff->tail - buff->data, PCI_DMA_FROMDEVICE); @@ -1090,7 +1090,7 @@ int sdp_event_recv(struct sdp_sock *conn sdp_msg_net_to_cpu_bsdh(buff->bsdh_hdr); if (comp->byte_len != buff->bsdh_hdr->size) { - sdp_dbg_warn(conn, + sdp_dbg_warn(conn, "receive event, message size mismatch <%d:%d>", comp->byte_len, buff->bsdh_hdr->size); @@ -1101,7 +1101,7 @@ int sdp_event_recv(struct sdp_sock *conn buff->tail = buff->data + buff->bsdh_hdr->size; buff->data = buff->data + sizeof(struct msg_hdr_bsdh); /* - * Do not update the advertised sequence number, until the + * Do not update the advertised sequence number, until the * SrcAvailCancel message has been processed. */ conn->recv_seq = buff->bsdh_hdr->seq_num; @@ -1122,7 +1122,7 @@ int sdp_event_recv(struct sdp_sock *conn buff->bsdh_hdr->flags, buff->bsdh_hdr->mid, buff->bsdh_hdr->size, - buff->bsdh_hdr->seq_num, + buff->bsdh_hdr->seq_num, buff->bsdh_hdr->seq_ack); /* * fast path data messages @@ -1163,7 +1163,7 @@ int sdp_event_recv(struct sdp_sock *conn } } else if (result < 0) { - sdp_dbg_warn(conn, + sdp_dbg_warn(conn, "receive event, dispatch error. <%d>", result); @@ -1178,7 +1178,7 @@ int sdp_event_recv(struct sdp_sock *conn sk->sk_data_ready(sk, conn->byte_strm); } /* - * It's possible that a new recv buffer advertisment opened up the + * It's possible that a new recv buffer advertisment opened up the * recv window and we can flush buffered send data */ result = sdp_send_flush(conn); Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_inet.c =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_inet.c (revision 3032) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_inet.c (working copy) @@ -184,7 +184,7 @@ static int sdp_inet_disconnect(struct sd result = sdp_send_ctrl_disconnect(conn); if (result < 0) { - sdp_dbg_warn(conn, + sdp_dbg_warn(conn, "Error <%d> send disconnect request", result); goto error; @@ -303,7 +303,7 @@ static int sdp_inet_release(struct socke if (result < 0) goto done; /* - * Skip lingering/canceling if + * Skip lingering/canceling if * non-blocking and not exiting. */ if (!(flags & MSG_DONTWAIT) || @@ -316,7 +316,7 @@ static int sdp_inet_release(struct socke && !(PF_EXITING & current->flags)) { DECLARE_WAITQUEUE(wait, current); timeout = sk->sk_lingertime; - + add_wait_queue(sk->sk_sleep, &wait); set_current_state(TASK_INTERRUPTIBLE); @@ -789,7 +789,7 @@ static int sdp_inet_accept(struct socket listen_done: sdp_conn_unlock(listen_conn); - sdp_dbg_ctrl(listen_conn, + sdp_dbg_ctrl(listen_conn, "ACCEPT: complete <%d> <%08x:%04x><%08x:%04x>", (accept_conn ? accept_conn->hashent : SDP_DEV_SK_INVALID), (accept_sk ? accept_conn->src_addr : 0), @@ -814,7 +814,7 @@ static int sdp_inet_getname(struct socke conn = sdp_sk(sk); sdp_dbg_ctrl(conn, "GETNAME: src <%08x:%04x> dst <%08x:%04x>", - conn->src_addr, conn->src_port, + conn->src_addr, conn->src_port, conn->dst_addr, conn->dst_port); addr->sin_family = proto_family; @@ -889,8 +889,8 @@ static unsigned int sdp_inet_poll(struct mask |= POLLIN | POLLRDNORM; /* - * send EOF _or_ send data space. - * (Some poll() Linux documentation says that POLLHUP is + * send EOF _or_ send data space. + * (Some poll() Linux documentation says that POLLHUP is * incompatible with the POLLOUT/POLLWR flags) */ if (SEND_SHUTDOWN & conn->shutdown) @@ -898,7 +898,7 @@ static unsigned int sdp_inet_poll(struct else { /* * avoid race by setting flags, and only clearing - * them if the test is passed. Setting after the + * them if the test is passed. Setting after the * test, we can end up with them set and a passing * test. */ @@ -917,10 +917,10 @@ static unsigned int sdp_inet_poll(struct mask |= POLLPRI; } - sdp_dbg_data(conn, "POLL: mask <%08x> flags <%08lx> <%d:%d:%d>", + sdp_dbg_data(conn, "POLL: mask <%08x> flags <%08lx> <%d:%d:%d>", mask, sock->flags, conn->send_buf, conn->send_qud, sdp_inet_writable(conn)); - + return mask; } @@ -1051,7 +1051,7 @@ static int sdp_inet_ioctl(struct socket /* * sdp_inet_setopt - set a socket option */ -static int sdp_inet_setopt(struct socket *sock, int level, int optname, +static int sdp_inet_setopt(struct socket *sock, int level, int optname, char __user *optval, int optlen) { struct sock *sk; @@ -1062,7 +1062,7 @@ static int sdp_inet_setopt(struct socket sk = sock->sk; conn = sdp_sk(sk); - sdp_dbg_ctrl(conn, "SETSOCKOPT: level <%d> option <%d>", + sdp_dbg_ctrl(conn, "SETSOCKOPT: level <%d> option <%d>", level, optname); if (SOL_TCP != level && SOL_SDP != level) @@ -1272,7 +1272,7 @@ static int sdp_inet_create(struct socket sdp_dbg_ctrl(NULL, "SOCKET: type <%d> proto <%d> state <%u:%08lx>", sock->type, protocol, sock->state, sock->flags); - + if (SOCK_STREAM != sock->type || (IPPROTO_IP != protocol && IPPROTO_TCP != protocol)) { sdp_dbg_warn(NULL, "SOCKET: unsupported type/proto. <%d:%d>", @@ -1366,9 +1366,9 @@ static int __init sdp_init(void) /* * buffer memory */ - result = sdp_buff_pool_init(buff_min, - buff_max, - alloc_inc, + result = sdp_buff_pool_init(buff_min, + buff_max, + alloc_inc, free_mark); if (result < 0) { sdp_warn("Error <%d> initializing buffer pool.", result); @@ -1378,7 +1378,7 @@ static int __init sdp_init(void) * connection table */ result = sdp_conn_table_init(proto_family, - conn_size, + conn_size, recv_post_max, recv_buff_max, send_post_max, Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_proto.h =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_proto.h (revision 3032) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_proto.h (working copy) @@ -93,7 +93,7 @@ struct sdpc_buff *sdp_buff_q_fetch(struc void *arg), void *usr_arg); -int sdp_buff_pool_init(int buff_min, +int sdp_buff_pool_init(int buff_min, int buff_max, int alloc_inc, int free_mark); @@ -263,7 +263,7 @@ struct sdp_sock *sdp_conn_table_lookup(s struct sdp_sock *sdp_conn_alloc(unsigned int priority); int sdp_conn_alloc_ib(struct sdp_sock *conn, - struct ib_device *device, + struct ib_device *device, u8 hw_port, u16 pkey); @@ -325,7 +325,7 @@ int sdp_send_ctrl_abort(struct sdp_sock int sdp_send_ctrl_send_sm(struct sdp_sock *conn); int sdp_send_ctrl_snk_avail(struct sdp_sock *conn, - u32 size, + u32 size, u32 rkey, u64 addr); @@ -369,14 +369,14 @@ int sdp_event_write(struct sdp_sock *con * DATA transport */ int sdp_inet_send(struct kiocb *iocb, - struct socket *sock, + struct socket *sock, struct msghdr *msg, size_t size); int sdp_inet_recv(struct kiocb *iocb, struct socket *sock, - struct msghdr *msg, - size_t size, + struct msghdr *msg, + size_t size, int flags); void sdp_iocb_q_cancel_all_read(struct sdp_sock *conn, ssize_t error); @@ -413,7 +413,7 @@ void sdp_link_addr_cleanup(void); /* * Event handling function, demultiplexed base on Message ID */ -typedef int (*sdp_event_cb_func)(struct sdp_sock *conn, +typedef int (*sdp_event_cb_func)(struct sdp_sock *conn, struct sdpc_buff *buff); /* @@ -562,7 +562,7 @@ static inline void sdp_conn_stat_dump(st #ifdef _SDP_CONN_STATS_REC int counter; - sdp_dbg_init("STAT: src <%u> snk <%u>", + sdp_dbg_init("STAT: src <%u> snk <%u>", conn->src_serv, conn->snk_serv); for (counter = 0; counter < 0x20; counter++) Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_send.c =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_send.c (revision 3032) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_send.c (working copy) @@ -82,7 +82,7 @@ static int sdp_send_buff_post(struct sdp * the flag. This allows for at least one pending urgent message * to send early notification. */ - if ((conn->flags & SDP_CONN_F_OOB_SEND) && + if ((conn->flags & SDP_CONN_F_OOB_SEND) && conn->oob_offset <= 0xFFFF) { SDP_BSDH_SET_OOB_PEND(buff->bsdh_hdr); SDP_BUFF_F_SET_SE(buff); @@ -138,7 +138,7 @@ static int sdp_send_buff_post(struct sdp result = ib_post_send(conn->qp, &send_param, &bad_wr); if (result) { - sdp_dbg_warn(conn, + sdp_dbg_warn(conn, "Error <%d> posting send. <%d:%d> <%d:%d:%d>", result, conn->s_wq_cur, conn->s_wq_size, sdp_buff_q_size(&conn->send_post), @@ -625,7 +625,7 @@ static int sdp_send_data_iocb_src(struct if (len > iocb->len) { sdp_dbg_warn(conn, "Data <%d:%d:%d> from IOCB <%d:%d>", len, pos, off, - iocb->page_count, + iocb->page_count, iocb->page_offset); result = -EFAULT; @@ -633,7 +633,7 @@ static int sdp_send_data_iocb_src(struct } local_irq_save(flags); - + addr = kmap_atomic(iocb->page_array[pos], KM_IRQ0); if (!addr) { result = -ENOMEM; @@ -711,7 +711,7 @@ static int sdp_send_iocb_buff_write(stru local_irq_restore(flags); break; } - + copy = min((PAGE_SIZE - offset), (unsigned long)(buff->end - buff->tail)); copy = min((unsigned long)iocb->len, copy); @@ -838,7 +838,7 @@ static int sdp_send_data_iocb(struct sdp } /* * If there are active sink IOCBs we want to stall, in the - * hope that a new sink advertisment will arrive, because + * hope that a new sink advertisment will arrive, because * sinks are more efficient. */ if (sdp_desc_q_size(&conn->w_snk) || @@ -925,7 +925,7 @@ static int sdp_send_data_queue_flush(str * (positive: no space; negative: error) */ while ((element = sdp_desc_q_look_head(&conn->send_queue))) { - + result = sdp_send_data_queue_test(conn, element); if (result) break; @@ -964,7 +964,7 @@ static int sdp_send_data_queue(struct sd result = sdp_send_ctrl_mode_ch(conn, SDP_MSG_MCH_PIPE_RECV); if (result < 0) { - sdp_dbg_warn(conn, + sdp_dbg_warn(conn, "Error <%d> posting mode change", result); goto done; @@ -1096,14 +1096,14 @@ static int sdp_send_ctrl_buff_flush(stru { struct sdpc_desc *element; int result = 0; - + /* * As long as there are buffers, try to post until a non-zero * result is generated. (positive: no space; negative: error) */ while ((element = sdp_desc_q_look_head(&conn->send_ctrl))) { - result = sdp_send_ctrl_buff_test(conn, + result = sdp_send_ctrl_buff_test(conn, (struct sdpc_buff *)element); if (result) break; @@ -1259,7 +1259,7 @@ int sdp_send_ctrl_disconnect(struct sdp_ */ if ((conn->flags & SDP_CONN_F_DIS_HOLD) || sdp_desc_q_size(&conn->send_queue) || - conn->src_sent) + conn->src_sent) conn->flags |= SDP_CONN_F_DIS_PEND; else result = do_send_ctrl_disconnect(conn); @@ -1685,7 +1685,7 @@ static int sdp_inet_write_cancel(struct sdp_dbg_ctrl(NULL, "Cancel Write IOCB user <%d> key <%d> flag <%08lx>", req->ki_users, req->ki_key, req->ki_flags); - + if (!si || !si->sock || !si->sock->sk) { sdp_warn("Cancel empty write IOCB users <%d> flags <%d:%08lx>", req->ki_users, req->ki_key, req->ki_flags); @@ -1714,7 +1714,7 @@ static int sdp_inet_write_cancel(struct * If active, then place it into the correct active queue */ sdp_desc_q_remove((struct sdpc_desc *)iocb); - + if (iocb->flags & SDP_IOCB_F_ACTIVE) { if (iocb->flags & SDP_IOCB_F_RDMA_W) sdp_desc_q_put_tail(&conn->w_snk, @@ -1936,7 +1936,7 @@ int sdp_inet_send(struct kiocb *req, str sdp_dbg_data(conn, "write IOCB <%d> addr <%p> user <%d> flag <%08lx>", req->ki_key, msg->msg_iov->iov_base, req->ki_users, req->ki_flags); - + sdp_conn_lock(conn); /* * ESTABLISED and CLOSE can send, while CONNECT and ACCEPTED can @@ -1983,7 +1983,7 @@ int sdp_inet_send(struct kiocb *req, str copy = min(copy, sdp_inet_write_space(conn, oob)); #ifndef _SDP_DATA_PATH_NULL - result = memcpy_fromiovec(buff->tail, + result = memcpy_fromiovec(buff->tail, msg->msg_iov, copy); if (result < 0) { @@ -2095,20 +2095,20 @@ skip: /* entry point for IOCB based tran iocb->req = req; iocb->key = req->ki_key; iocb->addr = (unsigned long)msg->msg_iov->iov_base - copied; - + req->ki_cancel = sdp_inet_write_cancel; result = sdp_iocb_lock(iocb); if (result < 0) { - sdp_dbg_warn(conn, "Error <%d> locking IOCB <%Zu:%d>", + sdp_dbg_warn(conn, "Error <%d> locking IOCB <%Zu:%d>", result, size, copied); - + sdp_iocb_destroy(iocb); break; } SDP_CONN_STAT_WQ_INC(conn, iocb->size); - + conn->send_pipe += iocb->len; result = sdp_send_data_queue(conn, (struct sdpc_desc *)iocb); Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_conn.c =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_conn.c (revision 3032) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_conn.c (working copy) @@ -106,7 +106,7 @@ void sdp_conn_abort(struct sdp_sock *con int error = -ECONNRESET; sdp_dbg_ctrl(conn, "Abort send. src <%08x:%04x> dst <%08x:%04x>", - conn->src_addr, conn->src_port, + conn->src_addr, conn->src_port, conn->dst_addr, conn->dst_port); switch (conn->state) { @@ -121,7 +121,7 @@ void sdp_conn_abort(struct sdp_sock *con case SDP_CONN_ST_DIS_SEND_2: case SDP_CONN_ST_DIS_SEND_1: /* - * don't touch control queue, diconnect message may + * don't touch control queue, diconnect message may * still be queued. */ sdp_desc_q_clear(&conn->send_queue); @@ -423,13 +423,13 @@ int sdp_inet_port_get(struct sdp_sock *c INADDR_ANY == look->src_addr || conn->src_addr == look->src_addr) { - sdp_dbg_warn(conn, + sdp_dbg_warn(conn, "port rejected. <%04x><%d:%d><%d:%d><%04x><%u:%u>", port, sk->sk_bound_dev_if, srch->sk_bound_dev_if, sk->sk_reuse, - srch->sk_reuse, + srch->sk_reuse, look->state, conn->src_addr, look->src_addr); @@ -619,7 +619,7 @@ done: return conn; } -/* +/* * Functions to cancel IOCB requests in a conenctions queues. */ static int sdp_desc_q_cancel_lookup_func(struct sdpc_desc *element, void *arg) @@ -826,7 +826,7 @@ void sdp_conn_relock(struct sdp_sock *co if (1 == result_r) { result = sdp_cq_event_locked(&entry, conn); if (result < 0) - sdp_dbg_warn(conn, + sdp_dbg_warn(conn, "Error <%d> from event handler.", result); @@ -837,7 +837,7 @@ void sdp_conn_relock(struct sdp_sock *co if (1 == result_s) { result = sdp_cq_event_locked(&entry, conn); if (result < 0) - sdp_dbg_warn(conn, + sdp_dbg_warn(conn, "Error <%d> from event handler.", result); rearm = 1; @@ -850,17 +850,17 @@ void sdp_conn_relock(struct sdp_sock *co result = ib_req_notify_cq(conn->recv_cq, IB_CQ_NEXT_COMP); if (result) - sdp_dbg_warn(conn, + sdp_dbg_warn(conn, "Error <%d> rearming recv CQ", result); result = ib_req_notify_cq(conn->send_cq, IB_CQ_NEXT_COMP); if (result) - sdp_dbg_warn(conn, + sdp_dbg_warn(conn, "Error <%d> rearming send CQ", result); - + rearm = 0; } else break; /* exit CQ handler routine */ @@ -891,7 +891,7 @@ int sdp_conn_cq_drain(struct ib_cq *cq, result = ib_poll_cq(cq, 1, &entry); if (1 == result) { /* - * dispatch completion, and mark that the CQ needs + * dispatch completion, and mark that the CQ needs * to be armed. */ result = sdp_cq_event_locked(&entry, conn); @@ -909,7 +909,7 @@ int sdp_conn_cq_drain(struct ib_cq *cq, if (rearm > 0) { result = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); if (result) - sdp_dbg_warn(conn, + sdp_dbg_warn(conn, "Error <%d> rearming CQ", result); rearm = 0; @@ -988,13 +988,13 @@ int sdp_conn_alloc_ib(struct sdp_sock *c result = -ENOMEM; goto error_attr; } - + init_attr = kmalloc(sizeof(*init_attr), GFP_KERNEL); if (!init_attr) { result = -ENOMEM; goto error_param; } - + memset(qp_attr, 0, sizeof(*qp_attr)); memset(init_attr, 0, sizeof(*init_attr)); /* @@ -1308,12 +1308,12 @@ error: "dst address:port src address:port ID comm_id pid " \ " dst guid src guid dlid slid dqpn " \ "sqpn data sent buff'd data rcvd_buff'd " \ - " data written data read src_serv snk_serv\n" + " data written data read src_serv snk_serv\n" #define SDP_PROC_CONN_MAIN_SEP \ "---------------- ---------------- ---- -------- ---- " \ "---------------- ---------------- ---- ---- ------ " \ "------ ---------------- ---------------- " \ - "---------------- ---------------- -------- --------\n" + "---------------- ---------------- -------- --------\n" #define SDP_PROC_CONN_MAIN_FORM \ "%02x.%02x.%02x.%02x:%04x %02x.%02x.%02x.%02x:%04x " \ "%04x %08x %04x %08x%08x %08x%08x %04x %04x " \ @@ -1322,7 +1322,7 @@ error: /* * sdp_proc_dump_conn_main - dump the connection table to /proc */ -int sdp_proc_dump_conn_main(char *buffer, int max_size, off_t start_index, +int sdp_proc_dump_conn_main(char *buffer, int max_size, off_t start_index, long *end_index) { struct sdp_sock *conn; @@ -1352,7 +1352,7 @@ int sdp_proc_dump_conn_main(char *buffer /* * loop across connections. */ - for (counter = start_index; + for (counter = start_index; counter < dev_root_s.sk_size && !(SDP_CONN_PROC_MAIN_SIZE > (max_size - offset)); counter++) { @@ -1374,7 +1374,7 @@ int sdp_proc_dump_conn_main(char *buffer ((conn->src_addr >> 8) & 0xff), ((conn->src_addr >> 16) & 0xff), ((conn->src_addr >> 24) & 0xff), - conn->src_port, + conn->src_port, conn->hashent, conn->cm_id ? conn->cm_id->local_id : 0, conn->pid, @@ -1771,7 +1771,7 @@ static void sdp_device_init_one(struct i /* * port allocation */ - for (port_count = 0; + for (port_count = 0; port_count < device->phys_port_cnt; port_count++) { port = kmalloc(sizeof *port, GFP_KERNEL); @@ -1788,8 +1788,8 @@ static void sdp_device_init_one(struct i port->index = port_count + 1; list_add(&port->list, &hca->port_list); - result = ib_query_gid(hca->ca, - port->index, + result = ib_query_gid(hca->ca, + port->index, 0, /* index */ &port->gid); if (result) { @@ -1836,7 +1836,7 @@ static void sdp_device_remove_one(struct sdp_warn("Device <%s> has no HCA info.", device->name); return; } - + list_for_each_entry_safe(port, tmp, &hca->port_list, list) { list_del(&port->list); kfree(port); @@ -1890,7 +1890,7 @@ int sdp_conn_table_init(int proto_family dev_root_s.recv_buff_max = recv_buff_max; dev_root_s.send_post_max = send_post_max; dev_root_s.send_buff_max = send_buff_max; - + dev_root_s.send_usig_max = send_usig_max; /* * Get HCA/port list Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_actv.c =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_actv.c (revision 3032) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_actv.c (working copy) @@ -252,7 +252,7 @@ static int sdp_cm_hello_ack_check(struct hello_ack->bsdh.seq_ack); sdp_dbg_ctrl(NULL, "Hello Ack HAH <%02x:%02x:%08x>", hello_ack->hah.max_adv, - hello_ack->hah.version, + hello_ack->hah.version, hello_ack->hah.l_rcv_size); return 0; /* success */ @@ -354,7 +354,7 @@ static void sdp_cm_path_complete(u64 id, */ if (id != conn->plid) { sdp_dbg_warn(conn, "Path record ID mismatch <%016llx:%016llx>", - (unsigned long long)id, + (unsigned long long)id, (unsigned long long)conn->plid); goto done; } @@ -530,7 +530,7 @@ int sdp_cm_connect(struct sdp_sock *conn */ sdp_conn_hold(conn); /* address resolution reference */ sdp_conn_unlock(conn); - + result = sdp_link_path_lookup(htonl(conn->dst_addr), htonl(conn->src_addr), sk_sdp(conn)->sk_bound_dev_if, Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_advt.c =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_advt.c (revision 3032) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_advt.c (working copy) @@ -102,7 +102,7 @@ struct sdpc_advt *sdp_advt_q_look(struct { if (list_empty(&table->head)) return NULL; - + return list_entry(table->head.next, struct sdpc_advt, list); } Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_recv.c =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_recv.c (revision 3032) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_recv.c (working copy) @@ -278,7 +278,7 @@ static int sdp_post_rdma_iocb_src(struct * if there is no more iocb space queue the it for completion */ if (!iocb->len) - sdp_desc_q_put_tail(&conn->r_src, + sdp_desc_q_put_tail(&conn->r_src, (struct sdpc_desc *) sdp_iocb_q_get_head(&conn->r_pend)); @@ -487,7 +487,7 @@ int sdp_recv_flush(struct sdp_sock *conn sdp_buff_q_size(&conn->recv_pool))); counter -= conn->l_recv_bf; - counter = min(counter, + counter = min(counter, ((s32)conn->recv_cq_size - (s32)conn->l_recv_bf)); while (counter-- > 0) { @@ -648,7 +648,7 @@ static int sdp_recv_buff_iocb_active(str result = sdp_read_buff_iocb(iocb, buff); if (result < 0) { sdp_dbg_warn(conn, "Error <%d> data copy <%d:%u> to IOCB", - result, iocb->len, + result, iocb->len, (unsigned)(buff->tail - buff->data)); sdp_iocb_q_put_head(&conn->r_snk, iocb); @@ -692,7 +692,7 @@ static int sdp_recv_buff_iocb_pending(st result = sdp_read_buff_iocb(iocb, buff); if (result < 0) { sdp_dbg_warn(conn, "Error <%d> data copy <%d:%u> to IOCB", - result, iocb->len, + result, iocb->len, (unsigned)(buff->tail - buff->data)); return result; } @@ -790,7 +790,7 @@ int sdp_recv_buff(struct sdp_sock *conn, break; if (result < 0) { - sdp_dbg_warn(conn, + sdp_dbg_warn(conn, "Error <%d> processing IOCB. <%d:%d:%d>", result, conn->snk_sent, sdp_iocb_q_size(&conn->r_pend), @@ -841,7 +841,7 @@ static int sdp_inet_read_cancel(struct k sdp_dbg_ctrl(NULL, "Cancel Read IOCB. user <%d> key <%d> flag <%08lx>", req->ki_users, req->ki_key, req->ki_flags); - + if (!si || !si->sock || !si->sock->sk) { sdp_warn("Cancel empty read IOCB. users <%d> flags <%d:%08lx>", req->ki_users, req->ki_key, req->ki_flags); @@ -887,7 +887,7 @@ static int sdp_inet_read_cancel(struct k result = 0; } - + goto unlock; } @@ -926,10 +926,10 @@ static int sdp_inet_read_cancel(struct k * source probably will get cancel requests as well. */ if (!(conn->flags & SDP_CONN_F_SNK_CANCEL)) { - + result = sdp_send_ctrl_snk_cancel(conn); SDP_EXPECT(result >= 0); - + conn->flags |= SDP_CONN_F_SNK_CANCEL; } @@ -946,7 +946,7 @@ static int sdp_inet_read_cancel(struct k req->ki_users, req->ki_key, req->ki_flags); result = -EAGAIN; - + unlock: sdp_conn_unlock(conn); done: @@ -1030,7 +1030,7 @@ static int sdp_inet_recv_urg(struct sock if (!(flags & MSG_PEEK)) { conn->rcv_urg_cnt -= 1; conn->byte_strm -= 1; - + SDP_CONN_STAT_RECV_INC(conn, 1); /* * we've potentially emptied a buffer, if @@ -1057,7 +1057,7 @@ done: /* * sdp_inet_recv - recv data from the network to user space */ -int sdp_inet_recv(struct kiocb *req, struct socket *sock, struct msghdr *msg, +int sdp_inet_recv(struct kiocb *req, struct socket *sock, struct msghdr *msg, size_t size, int flags) { struct sock *sk; @@ -1084,7 +1084,7 @@ int sdp_inet_recv(struct kiocb *req, st sdp_dbg_data(conn, "state <%08x> size <%Zu> pending <%d> falgs <%08x>", conn->state, size, conn->byte_strm, flags); sdp_dbg_data(conn, "read IOCB <%d> addr <%p> users <%d> flags <%08lx>", - req->ki_key, msg->msg_iov->iov_base, + req->ki_key, msg->msg_iov->iov_base, req->ki_users, req->ki_flags); /* @@ -1239,7 +1239,7 @@ int sdp_inet_recv(struct kiocb *req, st } } /* - * urgent data needs to break up the data stream, regardless + * urgent data needs to break up the data stream, regardless * of low water mark, or whether there is room in the buffer. */ if (oob > 0) { @@ -1282,7 +1282,7 @@ int sdp_inet_recv(struct kiocb *req, st result = (copied > 0) ? 0 : sock_error(sk); break; } - + if (RCV_SHUTDOWN & conn->shutdown) { result = 0; break; @@ -1332,7 +1332,7 @@ int sdp_inet_recv(struct kiocb *req, st */ iocb = sdp_iocb_create(); if (!iocb) { - sdp_dbg_warn(conn, + sdp_dbg_warn(conn, "Error allocating IOCB <%Zu:%d>", size, copied); result = -ENOMEM; @@ -1351,10 +1351,10 @@ int sdp_inet_recv(struct kiocb *req, st result = sdp_iocb_lock(iocb); if (result < 0) { - sdp_dbg_warn(conn, - "Error <%d> IOCB lock <%Zu:%d>", + sdp_dbg_warn(conn, + "Error <%d> IOCB lock <%Zu:%d>", result, size, copied); - + sdp_iocb_destroy(iocb); break; } @@ -1362,11 +1362,11 @@ int sdp_inet_recv(struct kiocb *req, st SDP_CONN_STAT_RQ_INC(conn, iocb->size); sdp_iocb_q_put_tail(&conn->r_pend, iocb); - + ack = 1; copied = 0; /* copied amount was saved in IOCB. */ result = -EIOCBQUEUED; - + break; } } Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_conn.h =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_conn.h (revision 3032) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_conn.h (working copy) @@ -488,7 +488,7 @@ static inline void sdp_conn_put_light(st void sdp_conn_put(struct sdp_sock *conn); -static inline void *hashent_arg(s32 hashent) +static inline void *hashent_arg(s32 hashent) { return (void *)(unsigned long)hashent; } Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_proc.c =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_proc.c (revision 3032) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_proc.c (working copy) @@ -54,7 +54,7 @@ static int sdp_proc_read_parse(char *pag #if 0 if (!*start && offset) { - return 0; /* I'm not sure why this always gets + return 0; /* I'm not sure why this always gets called twice... */ } #endif Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_pass.c =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_pass.c (revision 3032) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_pass.c (working copy) @@ -46,7 +46,7 @@ int sdp_cm_pass_establish(struct sdp_soc int result; sdp_dbg_ctrl(conn, "Passive Establish src <%08x:%04x> dst <%08x:%04x>", - conn->src_addr, conn->src_port, + conn->src_addr, conn->src_port, conn->dst_addr, conn->dst_port); /* * free hello ack message @@ -162,7 +162,7 @@ static int sdp_cm_accept(struct sdp_sock */ sdp_buff_q_put_tail(&conn->send_post, buff); /* - * modify QP. INIT->RTR + * modify QP. INIT->RTR */ qp_attr = kmalloc(sizeof(*qp_attr), GFP_KERNEL); if (!qp_attr) { @@ -174,7 +174,7 @@ static int sdp_cm_accept(struct sdp_sock memset(qp_attr, 0, sizeof(*qp_attr)); qp_attr->qp_state = IB_QPS_RTR; - + result = ib_cm_init_qp_attr(conn->cm_id, qp_attr, &qp_mask); if (result) { sdp_dbg_warn(conn, "Error <%d> QP attributes for RTR", @@ -188,7 +188,7 @@ static int sdp_cm_accept(struct sdp_sock result = ib_modify_qp(conn->qp, qp_attr, qp_mask); kfree(qp_attr); - + if (result) { sdp_dbg_warn(conn, "Error <%d> modifying QP to RTR.", result); goto error; @@ -343,7 +343,7 @@ static int sdp_cm_hello_check(struct sdp msg_hello->hh.port, msg_hello->hh.src.ipv4.addr, msg_hello->hh.dst.ipv4.addr); - + return 0; /* success */ } @@ -357,10 +357,10 @@ int sdp_cm_req_handler(struct ib_cm_id * u16 port; u32 addr; - sdp_dbg_ctrl(NULL, + sdp_dbg_ctrl(NULL, "CM REQ. comm <%08x> SID <%016llx> ca <%s> port <%d>", cm_id->local_id, (unsigned long long)cm_id->service_id, - event->param.req_rcvd.device->name, + event->param.req_rcvd.device->name, event->param.req_rcvd.port); /* * check Hello Header, to determine if we want the connection. @@ -378,7 +378,7 @@ int sdp_cm_req_handler(struct ib_cm_id * * first find a listening connection, and check backlog */ result = -ECONNREFUSED; - + listen_conn = sdp_inet_listen_lookup(addr, port); if (!listen_conn) { /* @@ -395,7 +395,7 @@ int sdp_cm_req_handler(struct ib_cm_id * goto done; if (listen_conn->backlog_cnt > listen_conn->backlog_max) { - sdp_dbg_ctrl(listen_conn, + sdp_dbg_ctrl(listen_conn, "Listen backlog <%d> too big to accept new conn", listen_conn->backlog_cnt); goto done; @@ -437,7 +437,7 @@ int sdp_cm_req_handler(struct ib_cm_id * conn->send_size = min((u16)sdp_buff_pool_buff_size(), (u16)conn->send_size) - SDP_MSG_HDR_SIZE; - memcpy(&conn->d_gid, + memcpy(&conn->d_gid, &event->param.req_rcvd.remote_ca_guid, sizeof(conn->d_gid)); /* @@ -448,7 +448,7 @@ int sdp_cm_req_handler(struct ib_cm_id * /* * associate connection with a hca/port, and allocate IB. */ - result = sdp_conn_alloc_ib(conn, + result = sdp_conn_alloc_ib(conn, event->param.req_rcvd.device, event->param.req_rcvd.port, event->param.req_rcvd.primary_path->pkey); @@ -502,7 +502,7 @@ done: sdp_conn_put(listen_conn); /* ListenLookup reference. */ empty: (void)ib_send_cm_rej(cm_id, - IB_CM_REJ_CONSUMER_DEFINED, + IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, NULL, 0); return result; } Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_proc.h =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_proc.h (revision 3032) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_proc.h (working copy) @@ -64,7 +64,7 @@ struct sdpc_proc_ent { struct proc_dir_entry *entry; int (*read)(char *buffer, int max_size, - off_t start, + off_t start, long *end); }; Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_sent.c =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_sent.c (revision 3032) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_sent.c (working copy) @@ -162,7 +162,7 @@ int sdp_event_send(struct sdp_sock *conn /* * error */ - sdp_dbg_warn(conn, + sdp_dbg_warn(conn, "Send wrid mismatch. <%llu:%llu:%d>", (unsigned long long)comp->wr_id, (unsigned long long)buff->wrid, @@ -256,7 +256,7 @@ int sdp_event_send(struct sdp_sock *conn sdp_buff_pool_chain_put(head, free_count); if (free_count <= 0 || conn->send_usig < 0) { - sdp_dbg_warn(conn, + sdp_dbg_warn(conn, "Send processing mismatch. <%llu:%llu:%d:%d>", (unsigned long long)comp->wr_id, (unsigned long long)current_wrid, Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_iocb.c =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_iocb.c (revision 3032) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_iocb.c (working copy) @@ -58,17 +58,17 @@ static void do_iocb_unlock(struct sdpc_i iocb->addr, iocb->size); while (vma) { - sdp_dbg_data(NULL, + sdp_dbg_data(NULL, "unmark <%lx> <%p> <%08lx:%08lx> <%08lx> <%ld>", iocb->addr, vma, vma->vm_start, vma->vm_end, vma->vm_flags, (long)vma->vm_private_data); - + spin_lock(&iocb->mm->page_table_lock); /* * if there are no more references to the vma */ vma->vm_private_data--; - + if (!vma->vm_private_data) { /* * modify VM flags. @@ -78,7 +78,7 @@ static void do_iocb_unlock(struct sdpc_i * adjust locked page count */ vma->vm_mm->locked_vm -= ((vma->vm_end - - vma->vm_start) >> + vma->vm_start) >> PAGE_SHIFT); } @@ -107,7 +107,7 @@ void sdp_iocb_unlock(struct sdpc_iocb *i * spin lock since this could be from interrupt context. */ down_write(&iocb->mm->mmap_sem); - + do_iocb_unlock(iocb); up_write(&iocb->mm->mmap_sem); @@ -152,7 +152,7 @@ static int sdp_iocb_page_save(struct sdp if (!iocb->addr_array) goto err_addr; - iocb->page_array = kmalloc((sizeof(struct page *) * iocb->page_count), + iocb->page_array = kmalloc((sizeof(struct page *) * iocb->page_count), GFP_KERNEL); if (!iocb->page_array) goto err_page; @@ -182,13 +182,13 @@ static int sdp_iocb_page_save(struct sdp pud = pud_offset(pgd, addr); if (!pud || pud_none(*pud)) break; - + pmd = pmd_offset(pud, addr); if (!pmd || pmd_none(*pmd)) break; ptep = pte_offset_map(pmd, addr); - if (!ptep) + if (!ptep) break; pte = *ptep; @@ -200,7 +200,7 @@ static int sdp_iocb_page_save(struct sdp pfn = pte_pfn(pte); if (!pfn_valid(pfn)) break; - + page = pfn_to_page(pfn); iocb->page_array[counter] = page; @@ -208,7 +208,7 @@ static int sdp_iocb_page_save(struct sdp } spin_unlock(&iocb->mm->page_table_lock); - + if (size > 0) { result = -EFAULT; goto err_find; @@ -216,7 +216,7 @@ static int sdp_iocb_page_save(struct sdp return 0; err_find: - + kfree(iocb->page_array); iocb->page_array = NULL; err_page: @@ -239,7 +239,7 @@ int sdp_iocb_lock(struct sdpc_iocb *iocb int result = -ENOMEM; unsigned long addr; size_t size; - + /* * mark IOCB as locked. We do not take a reference on the mm, AIO * handles this for us. @@ -251,7 +251,7 @@ int sdp_iocb_lock(struct sdpc_iocb *iocb */ real_cap = cap_t(current->cap_effective); cap_raise(current->cap_effective, CAP_IPC_LOCK); - + size = PAGE_ALIGN(iocb->size + (iocb->addr & ~PAGE_MASK)); addr = iocb->addr & PAGE_MASK; @@ -271,13 +271,13 @@ int sdp_iocb_lock(struct sdpc_iocb *iocb */ if (result) { sdp_dbg_err("VMA lock <%lx:%Zu> error <%d> <%d:%lu:%lu>", - iocb->addr, iocb->size, result, + iocb->addr, iocb->size, result, iocb->page_count, iocb->mm->locked_vm, limit); goto err_lock; } /* * look up the head of the vma queue, loop through the vmas, marking - * them do not copy, reference counting, and saving them. + * them do not copy, reference counting, and saving them. */ vma = find_vma(iocb->mm, addr); if (!vma) @@ -296,13 +296,13 @@ int sdp_iocb_lock(struct sdpc_iocb *iocb if (PAGE_SIZE < (unsigned long)vma->vm_private_data) sdp_dbg_err("VMA: private daya in use! <%08lx>", (unsigned long)vma->vm_private_data); - + vma->vm_flags |= VM_DONTCOPY; vma->vm_private_data++; spin_unlock(&iocb->mm->page_table_lock); - sdp_dbg_data(NULL, + sdp_dbg_data(NULL, "mark <%lx> <0x%p> <%08lx:%08lx> <%08lx> <%ld>", iocb->addr, vma, vma->vm_start, vma->vm_end, vma->vm_flags, (long)vma->vm_private_data); @@ -315,7 +315,7 @@ int sdp_iocb_lock(struct sdpc_iocb *iocb result = sdp_iocb_page_save(iocb); if (result) { - sdp_dbg_err("Error <%d> saving pages for IOCB <%lx:%Zu>", + sdp_dbg_err("Error <%d> saving pages for IOCB <%lx:%Zu>", result, iocb->addr, iocb->size); goto err_save; } @@ -362,9 +362,9 @@ static int sdp_mem_lock_init(void) struct kallsym_iter *iter; loff_t pos = 0; int ret = -EINVAL; - + sdp_dbg_init("Memory Locking initialization."); - + kallsyms = filp_open("/proc/kallsyms", O_RDONLY, 0); if (!kallsyms) { sdp_warn("Failed to open /proc/kallsyms"); @@ -444,7 +444,7 @@ int sdp_iocb_register(struct sdpc_iocb * iocb->page_offset); goto error; } - + iocb->l_key = iocb->mem->fmr->lkey; iocb->r_key = iocb->mem->fmr->rkey; /* @@ -501,10 +501,10 @@ static void do_iocb_complete(void *arg) value = (iocb->post > 0) ? iocb->post : iocb->status; sdp_dbg_data(NULL, "IOCB complete. <%d:%d:%08lx> value <%ld>", - iocb->req->ki_users, iocb->req->ki_key, + iocb->req->ki_users, iocb->req->ki_key, iocb->req->ki_flags, value); /* - * valid result can be 0 or 1 for complete so + * valid result can be 0 or 1 for complete so * we ignore the value. */ (void)aio_complete(iocb->req, value, 0); @@ -520,7 +520,7 @@ static void do_iocb_complete(void *arg) void sdp_iocb_complete(struct sdpc_iocb *iocb, ssize_t status) { iocb->status = status; - + if (in_atomic() || irqs_disabled()) { INIT_WORK(&iocb->completion, do_iocb_complete, (void *)iocb); schedule_work(&iocb->completion); @@ -684,7 +684,7 @@ static struct sdpc_iocb *sdp_iocb_q_get( /* * sdp_iocb_q_put - put the IOCB object at the tables tail */ -static void sdp_iocb_q_put(struct sdpc_iocb_q *table, +static void sdp_iocb_q_put(struct sdpc_iocb_q *table, struct sdpc_iocb *iocb, int head) { Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_event.c =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_event.c (revision 3032) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_event.c (working copy) @@ -48,7 +48,7 @@ int sdp_cq_event_locked(struct ib_wc *co if (SDP_ST_MASK_CLOSED & conn->state) { /* - * Ignore events in closed state, connection is being + * Ignore events in closed state, connection is being * terminated, connection cleanup will take care of freeing * posted buffers. */ @@ -268,7 +268,7 @@ static int sdp_cm_established(struct ib_ */ result = ib_send_cm_dreq(conn->cm_id, NULL, 0); if (result) { - sdp_dbg_warn(conn, "Error <%d> sending CM DREQ", + sdp_dbg_warn(conn, "Error <%d> sending CM DREQ", result); goto error; } @@ -357,7 +357,7 @@ static int sdp_cm_timewait(struct ib_cm_ */ case SDP_CONN_ST_ESTABLISHED: /* - * Change state, so we only need to wait for the abort + * Change state, so we only need to wait for the abort * callback, and idle. Call the abort callback. */ SDP_CONN_ST_SET(conn, SDP_CONN_ST_TIME_WAIT_2); @@ -394,7 +394,7 @@ int sdp_cm_event_handler(struct ib_cm_id sdp_conn_lock(conn); else if (cm_id->state != IB_CM_REQ_RCVD) { - sdp_dbg_warn(NULL, + sdp_dbg_warn(NULL, "No conn <%d> CM state <%d> event <%d>", hashent, cm_id->state, event->event); return -EINVAL; @@ -430,7 +430,7 @@ int sdp_cm_event_handler(struct ib_cm_id */ if (conn) { if (result < 0 && event->event != IB_CM_TIMEWAIT_EXIT) { - sdp_dbg_warn(conn, + sdp_dbg_warn(conn, "CM state <%d> event <%d> error <%d>", cm_id->state, event->event, result); /* Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_buff.c =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_buff.c (revision 3032) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_buff.c (working copy) @@ -125,7 +125,7 @@ static inline struct sdpc_buff *sdp_buff /* * do_buff_q_remove - remove a specific buffer from a specific pool */ -static inline void do_buff_q_remove(struct sdpc_buff_q *pool, +static inline void do_buff_q_remove(struct sdpc_buff_q *pool, struct sdpc_buff *buff) { struct sdpc_buff *prev; @@ -382,7 +382,7 @@ static int sdp_buff_pool_alloc(struct sd kmem_cache_free(m_pool->buff_cache, buff); break; } - + buff->end = buff->head + PAGE_SIZE; buff->data = buff->head; buff->tail = buff->head; @@ -513,7 +513,7 @@ void sdp_buff_pool_destroy(void) * Sanity check that the current number of buffers was released. */ if (main_pool->buff_cur) - sdp_warn("Leaking buffers during cleanup. <%d>", + sdp_warn("Leaking buffers during cleanup. <%d>", main_pool->buff_cur); /* * free pool cache @@ -719,7 +719,7 @@ int sdp_proc_dump_buff_pool(char *buffer spin_lock_irqsave(&main_pool->lock, flags); if (!start_index) { - offset += sprintf((buffer + offset), + offset += sprintf((buffer + offset), " buffer size: %8d\n", main_pool->buff_size); offset += sprintf((buffer + offset), Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_queue.c =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_queue.c (revision 3032) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_queue.c (working copy) @@ -144,7 +144,7 @@ void sdp_desc_q_remove(struct sdpc_desc /* * sdp_desc_q_lookup - search and return an element from the table */ -struct sdpc_desc *sdp_desc_q_lookup(struct sdpc_desc_q *table, +struct sdpc_desc *sdp_desc_q_lookup(struct sdpc_desc_q *table, int (*lookup)(struct sdpc_desc *element, void *arg), void *arg) @@ -214,7 +214,7 @@ int sdp_desc_q_type_head(struct sdpc_des /* * sdp_desc_q_look_type_head - look at a specific object */ -struct sdpc_desc *sdp_desc_q_look_type_head(struct sdpc_desc_q *table, +struct sdpc_desc *sdp_desc_q_look_type_head(struct sdpc_desc_q *table, enum sdp_desc_type type) { if (!table->head) Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_buff.h =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_buff.h (revision 3032) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_buff.h (working copy) @@ -69,7 +69,7 @@ struct sdpc_buff { u32 data_size; /* size of just data in the buffer */ u64 wrid; /* IB work request ID */ /* - * IB specific data (The main buffer pool sets the lkey when + * IB specific data (The main buffer pool sets the lkey when * it is created) */ struct ib_sge sge; -- MST From mst at mellanox.co.il Tue Aug 9 05:46:31 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 9 Aug 2005 15:46:31 +0300 Subject: [openib-general] Re: sdp: cant unload ib_ipoib module In-Reply-To: <1123539115.4402.39.camel@hal.voltaire.com> References: <1123007814.2946.12.camel@duffman> <20050803070923.GG15300@mellanox.co.il> <20050804104250.A30741@topspin.com> <20050807154632.GG15300@mellanox.co.il> <1123539115.4402.39.camel@hal.voltaire.com> Message-ID: <20050809124631.GG32419@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: sdp: cant unload ib_ipoib module > > With the patch below I can unload ipoib after running sdp tests. > > > > Hal, could you verify that this patch helps you, too? > > Yes it does. I see Tom checked it in. ip_rt_put now looks right, but it looks like device_put is still done too early. It wont hurt you unless there's a hotplug event or ipoib is unloaded before sdp, and its not a new bug. I'll add this to my TODO list. -- MST From mst at mellanox.co.il Tue Aug 9 06:22:54 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 9 Aug 2005 16:22:54 +0300 Subject: [openib-general] [PATCH applied] sdp: replace mlock with get_user_pages Message-ID: <20050809132254.GH32419@mellanox.co.il> OK, this was posted a couple of times already with no objections. Checked in. --- The following patch replaces the mlock hack with call to get_user_pages. Since the application could have forked while an iocb is outstanding, when an iocb is done I do get_user_pages for a second time and copy data if the physical address has changed. Thus, changing ulimit is no longer required to get aio working, processes are also allowed to fork and to call mlock/munlock on the buffer. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_iocb.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/ulp/sdp/sdp_iocb.c 2005-08-09 15:42:04.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_iocb.c 2005-08-09 16:06:13.000000000 +0300 @@ -33,377 +33,182 @@ * $Id: sdp_iocb.c 3033 2005-08-09 12:45:08Z mst $ */ +#include #include "sdp_main.h" static kmem_cache_t *sdp_iocb_cache = NULL; -/* - * memory locking functions - */ -#include - -typedef int (*do_mlock_ptr_t)(unsigned long, size_t, int); -static do_mlock_ptr_t mlock_ptr = NULL; - -/* - * do_iocb_unlock - unlock the memory for an IOCB - */ -static void do_iocb_unlock(struct sdpc_iocb *iocb) -{ - struct vm_area_struct *vma; - - vma = find_vma(iocb->mm, (iocb->addr & PAGE_MASK)); - if (!vma) - sdp_warn("No VMA for IOCB <%lx:%Zu> unlock", - iocb->addr, iocb->size); - - while (vma) { - sdp_dbg_data(NULL, - "unmark <%lx> <%p> <%08lx:%08lx> <%08lx> <%ld>", - iocb->addr, vma, vma->vm_start, vma->vm_end, - vma->vm_flags, (long)vma->vm_private_data); - - spin_lock(&iocb->mm->page_table_lock); - /* - * if there are no more references to the vma - */ - vma->vm_private_data--; - - if (!vma->vm_private_data) { - /* - * modify VM flags. - */ - vma->vm_flags &= ~(VM_DONTCOPY|VM_LOCKED); - /* - * adjust locked page count - */ - vma->vm_mm->locked_vm -= ((vma->vm_end - - vma->vm_start) >> - PAGE_SHIFT); - } - - spin_unlock(&iocb->mm->page_table_lock); - /* - * continue if the buffer continues onto the next vma - */ - if ((iocb->addr + iocb->size) > vma->vm_end) - vma = vma->vm_next; - else - vma = NULL; - } +static void sdp_copy_one_page(struct page *from, struct page* to, + unsigned long iocb_addr, size_t iocb_size, + unsigned long uaddr) +{ + size_t size_left = iocb_addr + iocb_size - uaddr; + size_t size = min(size_left,PAGE_SIZE); + unsigned long offset = uaddr % PAGE_SIZE; + unsigned long flags; + + void* fptr; + void* tptr; + + local_irq_save(flags); + fptr = kmap_atomic(from, KM_IRQ0); + tptr = kmap_atomic(to, KM_IRQ1); + + memcpy(tptr + offset, fptr + offset, size); + + kunmap_atomic(tptr, KM_IRQ1); + kunmap_atomic(fptr, KM_IRQ0); + local_irq_restore(flags); + set_page_dirty_lock(to); } /* * sdp_iocb_unlock - unlock the memory for an IOCB + * Copy if pages moved since. + * TODO: is this needed? */ void sdp_iocb_unlock(struct sdpc_iocb *iocb) { - /* - * check if IOCB is locked. - */ + int result; + struct page ** pages = NULL; + unsigned long uaddr; + int i; + if (!(iocb->flags & SDP_IOCB_F_LOCKED)) return; - /* - * spin lock since this could be from interrupt context. - */ - down_write(&iocb->mm->mmap_sem); - - do_iocb_unlock(iocb); - - up_write(&iocb->mm->mmap_sem); + /* For read, unlock and we are done */ + if (!(iocb->flags & SDP_IOCB_F_RECV)) { + for (i = 0;i < iocb->page_count; ++i) + put_page(iocb->page_array[i]); + goto done; + } + + /* For write, we must check the virtual pages did not get remapped */ + + /* As an optimisation (to avoid scanning the vma tree each time), + * try to get all pages in one go. */ + /* TODO: use cache for allocations? Allocate by chunks? */ + + pages = kmalloc((sizeof(struct page *) * iocb->page_count), GFP_KERNEL); + down_read(&iocb->mm->mmap_sem); + if (pages) { + result = get_user_pages(iocb->tsk, iocb->mm, iocb->addr, + iocb->page_count, 1, 0, pages, NULL); + if (result != iocb->page_count) { + kfree(pages); + pages = NULL; + } + } + for (i = 0, uaddr = iocb->addr; i < iocb->page_count; + ++i, uaddr = (uaddr & PAGE_MASK) + PAGE_SIZE) + { + struct page* page; + set_page_dirty_lock(iocb->page_array[i]); + + if (pages) + page = pages[i]; + else { + result = get_user_pages(iocb->tsk, iocb->mm, + uaddr & PAGE_MASK, + 1 , 1, 0, &page, NULL); + if (result != 1) { + page = NULL; + } + } + if (page && iocb->page_array[i] != page) + sdp_copy_one_page(iocb->page_array[i], page, + iocb->addr, iocb->size, uaddr); + if (page) + put_page(page); + put_page(iocb->page_array[i]); + } + up_read(&iocb->mm->mmap_sem); + if (pages) + kfree(pages); + +done: kfree(iocb->page_array); - kfree(iocb->addr_array); + kfree(iocb->addr_array); iocb->page_array = NULL; - iocb->addr_array = NULL; + iocb->addr_array = NULL; iocb->mm = NULL; - /* - * mark IOCB unlocked. - */ + iocb->tsk = NULL; iocb->flags &= ~SDP_IOCB_F_LOCKED; } /* - * sdp_iocb_page_save - save page information for an IOCB + * sdp_iocb_lock - lock the memory for an IOCB + * We do not take a reference on the mm, AIO handles this for us. */ -static int sdp_iocb_page_save(struct sdpc_iocb *iocb) +int sdp_iocb_lock(struct sdpc_iocb *iocb) { - unsigned int counter; + int result = -ENOMEM; unsigned long addr; size_t size; - int result = -ENOMEM; - struct page *page; - unsigned long pfn; - pgd_t *pgd; - pud_t *pud; - pmd_t *pmd; - pte_t *ptep; - pte_t pte; + int i; - if (iocb->page_count <= 0 || iocb->size <= 0 || !iocb->addr) - return -EINVAL; - /* + /* + * iocb->addr - buffer start address + * iocb->size - buffer length + * addr - page aligned + * size - page multiple + */ + addr = iocb->addr & PAGE_MASK; + size = PAGE_ALIGN(iocb->size + (iocb->addr & ~PAGE_MASK)); + + iocb->page_offset = iocb->addr - addr; + + iocb->page_count = size >> PAGE_SHIFT; + /* * create array to hold page value which are later needed to register * the buffer with the HCA - */ + */ + + /* TODO: use cache for allocations? Allocate by chunks? */ iocb->addr_array = kmalloc((sizeof(u64) * iocb->page_count), GFP_KERNEL); if (!iocb->addr_array) goto err_addr; - + iocb->page_array = kmalloc((sizeof(struct page *) * iocb->page_count), GFP_KERNEL); if (!iocb->page_array) goto err_page; - /* - * iocb->addr - buffer start address - * iocb->size - buffer length - * addr - page aligned - * size - page multiple - */ - addr = iocb->addr & PAGE_MASK; - size = PAGE_ALIGN(iocb->size + (iocb->addr & ~PAGE_MASK)); - - iocb->page_offset = iocb->addr - addr; - /* - * Find pages used within the buffer which will then be registered - * for RDMA - */ - spin_lock(&iocb->mm->page_table_lock); - - for (counter = 0; - size > 0; - counter++, addr += PAGE_SIZE, size -= PAGE_SIZE) { - pgd = pgd_offset_gate(iocb->mm, addr); - if (!pgd || pgd_none(*pgd)) - break; - - pud = pud_offset(pgd, addr); - if (!pud || pud_none(*pud)) - break; - - pmd = pmd_offset(pud, addr); - if (!pmd || pmd_none(*pmd)) - break; - - ptep = pte_offset_map(pmd, addr); - if (!ptep) - break; - - pte = *ptep; - pte_unmap(ptep); - - if (!pte_present(pte)) - break; - - pfn = pte_pfn(pte); - if (!pfn_valid(pfn)) - break; - - page = pfn_to_page(pfn); - - iocb->page_array[counter] = page; - iocb->addr_array[counter] = page_to_phys(page); - } - - spin_unlock(&iocb->mm->page_table_lock); - - if (size > 0) { - result = -EFAULT; - goto err_find; + + down_read(¤t->mm->mmap_sem); + + result = get_user_pages(current, current->mm, + iocb->addr, iocb->page_count, + !!(iocb->flags & SDP_IOCB_F_RECV), 0, + iocb->page_array, NULL); + + up_read(¤t->mm->mmap_sem); + + if (result != iocb->page_count) { + sdp_dbg_err("unable to lock <%lx:%Zu> error <%d> <%d>", + iocb->addr, iocb->size, result, iocb->page_count); + goto err_get; } - - return 0; -err_find: - + + iocb->flags |= SDP_IOCB_F_LOCKED; + iocb->mm = current->mm; + iocb->tsk = current; + + + for (i = 0; i< iocb->page_count; ++i) { + iocb->addr_array[i] = page_to_phys(iocb->page_array[i]); + } + + return 0; + +err_get: kfree(iocb->page_array); - iocb->page_array = NULL; err_page: - kfree(iocb->addr_array); - iocb->addr_array = NULL; err_addr: - - return result; -} - -/* - * sdp_iocb_lock - lock the memory for an IOCB - */ -int sdp_iocb_lock(struct sdpc_iocb *iocb) -{ - struct vm_area_struct *vma; - kernel_cap_t real_cap; - unsigned long limit; - int result = -ENOMEM; - unsigned long addr; - size_t size; - - /* - * mark IOCB as locked. We do not take a reference on the mm, AIO - * handles this for us. - */ - iocb->flags |= SDP_IOCB_F_LOCKED; - iocb->mm = current->mm; - /* - * save and raise capabilities - */ - real_cap = cap_t(current->cap_effective); - cap_raise(current->cap_effective, CAP_IPC_LOCK); - - size = PAGE_ALIGN(iocb->size + (iocb->addr & ~PAGE_MASK)); - addr = iocb->addr & PAGE_MASK; - - iocb->page_count = size >> PAGE_SHIFT; - - limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur; - limit >>= PAGE_SHIFT; - /* - * lock the mm, if within the limit lock the address range. - */ - down_write(&iocb->mm->mmap_sem); - - if (!((iocb->page_count + current->mm->locked_vm) > limit)) - result = (*mlock_ptr)(addr, size, 1); - /* - * process result - */ - if (result) { - sdp_dbg_err("VMA lock <%lx:%Zu> error <%d> <%d:%lu:%lu>", - iocb->addr, iocb->size, result, - iocb->page_count, iocb->mm->locked_vm, limit); - goto err_lock; - } - /* - * look up the head of the vma queue, loop through the vmas, marking - * them do not copy, reference counting, and saving them. - */ - vma = find_vma(iocb->mm, addr); - if (!vma) - /* - * sanity check. - */ - sdp_warn("No VMA for IOCB! <%lx:%Zu> lock", - iocb->addr, iocb->size); - - while (vma) { - spin_lock(&iocb->mm->page_table_lock); - - if (!(VM_LOCKED & vma->vm_flags)) - sdp_warn("Unlocked vma! <%08lx>", vma->vm_flags); - - if (PAGE_SIZE < (unsigned long)vma->vm_private_data) - sdp_dbg_err("VMA: private daya in use! <%08lx>", - (unsigned long)vma->vm_private_data); - - vma->vm_flags |= VM_DONTCOPY; - vma->vm_private_data++; - - spin_unlock(&iocb->mm->page_table_lock); - - sdp_dbg_data(NULL, - "mark <%lx> <0x%p> <%08lx:%08lx> <%08lx> <%ld>", - iocb->addr, vma, vma->vm_start, vma->vm_end, - vma->vm_flags, (long)vma->vm_private_data); - - if ((addr + size) > vma->vm_end) - vma = vma->vm_next; - else - vma = NULL; - } - - result = sdp_iocb_page_save(iocb); - if (result) { - sdp_dbg_err("Error <%d> saving pages for IOCB <%lx:%Zu>", - result, iocb->addr, iocb->size); - goto err_save; - } - - up_write(&iocb->mm->mmap_sem); - cap_t(current->cap_effective) = real_cap; - - return 0; -err_save: - - do_iocb_unlock(iocb); -err_lock: - /* - * unlock the mm and restore capabilities. - */ - up_write(&iocb->mm->mmap_sem); - cap_t(current->cap_effective) = real_cap; - - iocb->flags &= ~SDP_IOCB_F_LOCKED; - iocb->mm = NULL; - - return result; -} - -/* - * IOCB memory locking init functions - */ -struct kallsym_iter { - loff_t pos; - struct module *owner; - unsigned long value; - unsigned int nameoff; /* If iterating in core kernel symbols */ - char type; - char name[128]; -}; - -/* - * sdp_mem_lock_init - initialize the userspace memory locking - */ -static int sdp_mem_lock_init(void) -{ - struct file *kallsyms; - struct seq_file *seq; - struct kallsym_iter *iter; - loff_t pos = 0; - int ret = -EINVAL; - - sdp_dbg_init("Memory Locking initialization."); - - kallsyms = filp_open("/proc/kallsyms", O_RDONLY, 0); - if (!kallsyms) { - sdp_warn("Failed to open /proc/kallsyms"); - goto done; - } - - seq = (struct seq_file *)kallsyms->private_data; - if (!seq) { - sdp_warn("Failed to fetch sequential file."); - goto err_close; - } - - for (iter = seq->op->start(seq, &pos); - iter != NULL; - iter = seq->op->next(seq, iter, &pos)) - if (!strcmp(iter->name, "do_mlock")) - mlock_ptr = (do_mlock_ptr_t)iter->value; - - if (!mlock_ptr) - sdp_warn("Failed to find lock pointer."); - else - ret = 0; - -err_close: - filp_close(kallsyms, NULL); -done: - return ret; -} - -/* - * sdp_mem_lock_cleanup - cleanup the memory locking tables - */ -static void sdp_mem_lock_cleanup(void) -{ - sdp_dbg_init("Memory Locking cleanup."); - /* - * null out entries. - */ - mlock_ptr = NULL; + return result; } /* @@ -802,28 +607,12 @@ void sdp_iocb_q_clear(struct sdpc_iocb_q } /* - * primary initialization/cleanup functions - */ - -/* * sdp_main_iocb_init - initialize the advertisment caches */ int sdp_main_iocb_init(void) { - int result; - sdp_dbg_init("IOCB cache initialization."); - /* - * initialize locking code. - */ - result = sdp_mem_lock_init(); - if (result < 0) { - sdp_warn("Error <%d> initializing memory locking.", result); - return result; - } - /* - * initialize the caches only once. - */ + if (sdp_iocb_cache) { sdp_warn("IOCB caches already initialized."); return -EINVAL; @@ -833,15 +622,10 @@ int sdp_main_iocb_init(void) sizeof(struct sdpc_iocb), 0, SLAB_HWCACHE_ALIGN, NULL, NULL); - if (!sdp_iocb_cache) { - result = -ENOMEM; - goto error_iocb_c; - } + if (!sdp_iocb_cache) + return -ENOMEM; return 0; -error_iocb_c: - sdp_mem_lock_cleanup(); - return result; } /* @@ -858,8 +642,4 @@ void sdp_main_iocb_cleanup(void) * null out entries. */ sdp_iocb_cache = NULL; - /* - * cleanup memory locking - */ - sdp_mem_lock_cleanup(); } Index: linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_iocb.h =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/ulp/sdp/sdp_iocb.h 2005-08-09 15:07:29.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_iocb.h 2005-08-09 16:06:13.000000000 +0300 @@ -53,7 +53,8 @@ #define SDP_IOCB_F_RDMA_R 0x00000010 /* IOCB is in RDMA read processing */ #define SDP_IOCB_F_RDMA_W 0x00000020 /* IOCB is in RDMA write processing */ #define SDP_IOCB_F_LOCKED 0x00000040 /* IOCB is locked in memory */ -#define SDP_IOCB_F_REG 0x00000080 /* IOCB is memory is registered */ +#define SDP_IOCB_F_REG 0x00000080 /* IOCB memory is registered */ +#define SDP_IOCB_F_RECV 0x00000100 /* IOCB is for a receive request */ #define SDP_IOCB_F_ALL 0xFFFFFFFF /* IOCB all mask */ /* * zcopy constants. @@ -100,9 +101,10 @@ struct sdpc_iocb { /* * page list. data for locking/registering userspace */ - struct mm_struct *mm; /* user mm struct */ - unsigned long addr; /* user space address */ - size_t size; /* total size of the user buffer */ + struct mm_struct *mm; /* user mm struct */ + struct task_struct *tsk; + unsigned long addr; /* user space address */ + size_t size; /* total size of the user buffer */ struct page **page_array; /* list of page structure pointers. */ u64 *addr_array; /* list of physical page addresses. */ Index: linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_recv.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/ulp/sdp/sdp_recv.c 2005-08-09 15:42:04.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_recv.c 2005-08-09 16:06:13.000000000 +0300 @@ -1347,6 +1347,8 @@ int sdp_inet_recv(struct kiocb *req, st iocb->addr = ((unsigned long)msg->msg_iov->iov_base - copied); + iocb->flags |= SDP_IOCB_F_RECV; + req->ki_cancel = sdp_inet_read_cancel; result = sdp_iocb_lock(iocb); -- MST From halr at voltaire.com Tue Aug 9 06:31:37 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Aug 2005 09:31:37 -0400 Subject: [openib-general] Re: sdp: cant unload ib_ipoib module In-Reply-To: <20050809124631.GG32419@mellanox.co.il> References: <1123007814.2946.12.camel@duffman> <20050803070923.GG15300@mellanox.co.il> <20050804104250.A30741@topspin.com> <20050807154632.GG15300@mellanox.co.il> <1123539115.4402.39.camel@hal.voltaire.com> <20050809124631.GG32419@mellanox.co.il> Message-ID: <1123593909.4403.16.camel@hal.voltaire.com> Hi Michael, On Tue, 2005-08-09 at 08:46, Michael S. Tsirkin wrote: > ip_rt_put now looks right, but it looks like device_put is still done too early. Any idea where it should be done ? > It wont hurt you unless there's a hotplug event or ipoib is unloaded > before sdp, and its not a new bug. Right, that was the case before this patch too. It's a large step in the right direction. > I'll add this to my TODO list. Thanks. Any idea when you might get back to it ? -- Hal From halr at voltaire.com Tue Aug 9 06:39:24 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Aug 2005 09:39:24 -0400 Subject: [openib-general] Re: [PATCH] libibumad: configure option to skip library test In-Reply-To: <20050809121255.GE32419@mellanox.co.il> References: <20050809121255.GE32419@mellanox.co.il> Message-ID: <1123594764.4403.21.camel@hal.voltaire.com> On Tue, 2005-08-09 at 08:12, Michael S. Tsirkin wrote: > Add option to skip infiniband library checks in libibumad. Thanks. Applied. -- Hal From mst at mellanox.co.il Tue Aug 9 06:48:25 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 9 Aug 2005 16:48:25 +0300 Subject: [openib-general] Re: sdp: cant unload ib_ipoib module In-Reply-To: <1123593909.4403.16.camel@hal.voltaire.com> References: <1123007814.2946.12.camel@duffman> <20050803070923.GG15300@mellanox.co.il> <20050804104250.A30741@topspin.com> <20050807154632.GG15300@mellanox.co.il> <1123539115.4402.39.camel@hal.voltaire.com> <20050809124631.GG32419@mellanox.co.il> <1123593909.4403.16.camel@hal.voltaire.com> Message-ID: <20050809134825.GI32419@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: sdp: cant unload ib_ipoib module > > Hi Michael, > > On Tue, 2005-08-09 at 08:46, Michael S. Tsirkin wrote: > > ip_rt_put now looks right, but it looks like device_put is still done too early. > > Any idea where it should be done ? After SDP stops using the device. > > I'll add this to my TODO list. > > Thanks. Any idea when you might get back to it ? After the synchronous zcopy stuff. -- MST From halr at voltaire.com Tue Aug 9 07:17:35 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Aug 2005 10:17:35 -0400 Subject: [openib-general] Re: likely user error with uat In-Reply-To: <42F7F88F.402@ichips.intel.com> References: <1123541564.4870.39.camel@hal.voltaire.com> <1123545889.4522.11.camel@hal.voltaire.com> <42F7F88F.402@ichips.intel.com> Message-ID: <1123597055.4403.46.camel@hal.voltaire.com> On Mon, 2005-08-08 at 20:27, Sean Hefty wrote: > I did bring at and ipoib up/down/up before running the tests. Was AT invoked or was that brought down (directly or indirectly around) the IPoIB up/down/up ? I'm looking for a way to reproduce your scenario. -- Hal From amitk at mellanox.co.il Tue Aug 9 08:08:48 2005 From: amitk at mellanox.co.il (Amit Krig) Date: Tue, 9 Aug 2005 18:08:48 +0300 Subject: [openib-general] InfiniBand Test Project === IBTP === Message-ID: <506C3D7B14CDD411A52C00025558DED607F0DDA1@mtlex01.yok.mtl.com> Hi All, I would like to propos a new InfiniBand Test Project === IBTP === The proposed new dir can be under Gen2/trunk => https://openib.org/svn/gen2/trunk/ibtp/ The project may contain the following tree * tools (In this folder we will have all the scripts and automation utilities) * infiniband * core * ulps * ipoib * sdp * srp * kdapl * iser * * user * verbs * cm * management * osm * utils * ulps * udapl * mpi * Each sub dir may have the following directories * doa (Basic and simple dead or alive tests, up to X sec) * functional (Full flows & long tests) * bad machine (Destructive tests can not run in parallel to other tests) * Scratch (New tests that did not integrated yet) * Doc We should decide on the relevant document/format attached to each test such as * README file * Test runner * Config file (will contain several cmd options to run the test and maybe will be used by the external auto run script) * > Amit Krig > Mellanox Technologies > SW/HW Design validation manager > mailto:amitk at mellanox.co.il > Work: +972-4-9097200 Ext. 315 > Fax: +972-4-9593245 > Cell: +972-544-799099 > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Tue Aug 9 08:12:17 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 9 Aug 2005 18:12:17 +0300 Subject: [openib-general] [PATCH] libosmvendor: configure option to skip library check Message-ID: <20050809151217.GL32419@mellanox.co.il> Add option to skip infiniband library checks in libibumad. Signed-off-by: Michael S. Tsirkin Index: management/osm/libvendor/configure.in =================================================================== --- management/osm/libvendor/configure.in (revision 2963) +++ management/osm/libvendor/configure.in (working copy) @@ -7,6 +7,13 @@ AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE(libosmvendor, 0.9.0) +AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presense of +ib libraries], +[ if test x$enableval = xno ; then + disable_libcheck=yes + fi +]) + dnl Checks for programs AC_PROG_CC AC_PROG_GCC_TRADITIONAL @@ -16,17 +23,23 @@ AC_PROG_LN_S AC_PROG_MAKE_SET AC_PROG_LIBTOOL +if test "$disable_libcheck" != "yes" +then dnl Checks for libraries LDFLAGS="$LDFLAGS -L/usr/local/ib/lib" AC_CHECK_LIB(ibumad, umad_init, [], AC_MSG_ERROR([umad_init() not found. libosmvendor requires libibumad.])) +fi dnl Checks for header files. AC_HEADER_DIRENT AC_HEADER_STDC +if test "$disable_libcheck" != "yes" +then AC_CHECK_HEADER(infiniband/umad.h, [], AC_MSG_ERROR([ not found. libosmvendor requires libibumad.]) ) +fi AC_CHECK_HEADERS([fcntl.h stddef.h stdint.h sys/ioctl.h]) dnl Checks for library functions -- MST From eitan at mellanox.co.il Tue Aug 9 08:26:09 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 09 Aug 2005 18:26:09 +0300 Subject: [openib-general] [PATCH] osm: add a main auto tools project for osm Message-ID: <42F8CB11.5070606@mellanox.co.il> Hi Hal This patch includes: 1. Added a top level autotools project for OpenSM. So now you need autogen.sh && configure && make && make install just once for osm (needed 4: complib, libvendor, opensm, osmtest). 2. Cleanup the direct override of libdir, bindir using AC_PREFIX_DEFAULT 3. Move osm includes into prefix (/usr/local/ib) 4. Support debug build for OpenSM using --enable-debug This is important to allow for asserts during runtime and various other additional debug features. Since the generated compilb can not be used with the release version we also use a special header file that stores the type of build for applications that wish to link with it. 5. Cleanup stale use of AC_CHECK_LIB with no parameters I tested the patch on : 2.6.12.3-smp SuSE Linux 9.3 (i586) Signed-off-by: Eitan Zahavi eitan at mellanox.co.il Index: include/configure.in =================================================================== --- include/configure.in (revision 3036) +++ include/configure.in (working copy) @@ -13,7 +13,7 @@ dnl Checks for programs AC_PROG_CC dnl Checks for libraries -AC_CHECK_LIB +dnl AC_CHECK_LIB - need to provide symbol and library... what do we depend on? dnl Checks for header files. AC_HEADER_STDC Index: libvendor/configure.in =================================================================== --- libvendor/configure.in (revision 3036) +++ libvendor/configure.in (working copy) @@ -7,6 +7,9 @@ AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE(libosmvendor, 0.9.0) +dnl We use a non standard default prefix +AC_PREFIX_DEFAULT([/usr/local/ib]) + dnl Checks for programs AC_PROG_CC AC_PROG_GCC_TRADITIONAL @@ -47,5 +50,15 @@ AC_CACHE_CHECK(whether ld accepts --vers AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") +dnl support debug mode +AC_ARG_ENABLE(debug, +[ --enable-debug Turn on debug mode], +[case "${enableval}" in + yes) debug=true ;; + no) debug=false ;; + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; +esac],[debug=false]) +AM_CONDITIONAL(DEBUG, test x$debug = xtrue) + AC_CONFIG_FILES([Makefile libosmvendor.spec]) AC_OUTPUT Index: libvendor/Makefile.am =================================================================== --- libvendor/Makefile.am (revision 3036) +++ libvendor/Makefile.am (working copy) @@ -1,15 +1,19 @@ -libdir = ${exec_prefix}/ib/lib - SUBDIRS = . +if DEBUG +DBGFLAGS = -g -O0 -D_DEBUG_ +else +DBGFLAGS = -g -O2 +endif + INCLUDES = -I$(srcdir)/../include \ -I$(srcdir)/../../libibcommon/include/infiniband \ -I$(srcdir)/../../libibumad/include/infiniband lib_LTLIBRARIES = libosmvendor.la -libosmvendor_la_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT +libosmvendor_la_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) if HAVE_LD_VERSION_SCRIPT libosmvendor_version_script = -Wl,--version-script=$(srcdir)/libosmvendor.map Index: libvendor/autogen.sh =================================================================== --- libvendor/autogen.sh (revision 3036) +++ libvendor/autogen.sh (working copy) @@ -1,5 +1,8 @@ #! /bin/sh +# We change dir since the later utilities assume to work in the project dir +cd ${0%*/*} + set -x aclocal -I config libtoolize --force --copy Index: complib/autogen.sh =================================================================== --- complib/autogen.sh (revision 3036) +++ complib/autogen.sh (working copy) @@ -1,5 +1,8 @@ #! /bin/sh +# We change dir since the later utilities assume to work in the project dir +cd ${0%*/*} + set -x aclocal -I config libtoolize --force --copy Index: complib/configure.in =================================================================== --- complib/configure.in (revision 3036) +++ complib/configure.in (working copy) @@ -7,6 +7,9 @@ AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE(libosmcomp, 0.9.0) +dnl We use a non standard default prefix +AC_PREFIX_DEFAULT([/usr/local/ib]) + dnl Checks for programs AC_PROG_CC AC_PROG_GCC_TRADITIONAL @@ -31,6 +34,7 @@ AC_C_INLINE AC_TYPE_SIZE_T AC_HEADER_TIME +dnl We use --version-script with ld if possible AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then ac_cv_version_script=yes @@ -40,5 +44,15 @@ AC_CACHE_CHECK(whether ld accepts --vers AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") +dnl Support debug mode build - if enable-debug provided the DEBUG variable is set +AC_ARG_ENABLE(debug, +[ --enable-debug Turn on debug mode], +[case "${enableval}" in + yes) debug=true ;; + no) debug=false ;; + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; +esac],[debug=false]) +AM_CONDITIONAL(DEBUG, test x$debug = xtrue) + AC_CONFIG_FILES([Makefile libosmcomp.spec]) AC_OUTPUT Index: complib/Makefile.am =================================================================== --- complib/Makefile.am (revision 3036) +++ complib/Makefile.am (working copy) @@ -1,5 +1,5 @@ -libdir = ${exec_prefix}/ib/lib +# libdir = ${exec_prefix}/ib/lib SUBDIRS = . @@ -7,7 +7,13 @@ INCLUDES = -I$(srcdir)/../include lib_LTLIBRARIES = libosmcomp.la -libosmcomp_la_CFLAGS = -Wall +if DEBUG +DBGFLAGS = -g -O0 -D_DEBUG_ +else +DBGFLAGS = -g -O2 +endif + +libosmcomp_la_CFLAGS = -Wall $(DBGFLAGS) if HAVE_LD_VERSION_SCRIPT libosmcomp_version_script = -Wl,--version-script=$(srcdir)/libosmcomp.map Index: AUTHORS =================================================================== Index: configure.in =================================================================== --- configure.in (revision 0) +++ configure.in (revision 0) @@ -0,0 +1,42 @@ +dnl Process this file with autoconf to produce a configure script. + +AC_INIT(autogen.sh) + +dnl use local config dir for extras +AC_CONFIG_AUX_DIR(config) + +dnl We use a non standard default prefix +AC_PREFIX_DEFAULT([/usr/local/ib]) + +dnl Defines the Language +AC_LANG_C + +dnl Auto make +AM_INIT_AUTOMAKE(osm,1.0) + +dnl Provides control over re-making of all auto files +dnl We also use it to define swig dependencies so end +dnl users do not see them. +AM_MAINTAINER_MODE + +dnl Required for cases make defines a MAKE=make ??? Why +AC_PROG_MAKE_SET + +dnl Define an input config option to control debug compile +AC_ARG_ENABLE(debug, +[ --enable-debug Turn on debugging], +[case "${enableval}" in + yes) debug=true ;; + no) debug=false ;; + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; +esac],[debug=false]) +AM_CONDITIONAL(DEBUG, test x$debug = xtrue) + +dnl Configure the following subdirs +AC_CONFIG_SUBDIRS(complib libvendor opensm osmtest) + +dnl Create the following Makefiles +AC_OUTPUT(Makefile) + + + Index: ChangeLog =================================================================== Index: README =================================================================== Index: osmtest/configure.in =================================================================== --- osmtest/configure.in (revision 3036) +++ osmtest/configure.in (working copy) @@ -9,6 +9,9 @@ AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE(osmtest, 0.9.0) +dnl We use a non standard default prefix +AC_PREFIX_DEFAULT([/usr/local/ib]) + dnl Checks for programs AC_PROG_CXX AC_PROG_CC @@ -52,5 +55,15 @@ AC_CACHE_CHECK(whether ld accepts --vers AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") +dnl support debug mode +AC_ARG_ENABLE(debug, +[ --enable-debug Turn on debug mode], +[case "${enableval}" in + yes) debug=true ;; + no) debug=false ;; + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; +esac],[debug=false]) +AM_CONDITIONAL(DEBUG, test x$debug = xtrue) + AC_CONFIG_FILES([Makefile]) AC_OUTPUT Index: osmtest/Makefile.am =================================================================== --- osmtest/Makefile.am (revision 3036) +++ osmtest/Makefile.am (working copy) @@ -1,6 +1,9 @@ -bindir = ${exec_prefix}/ib/bin -libdir = ${exec_prefix}/ib/lib +if DEBUG +DBGFLAGS = -g -O0 -D_DEBUG_ +else +DBGFLAGS = -g -O2 +endif INCLUDES = -I$(srcdir)/include \ -I$(srcdir)/../include \ @@ -11,12 +14,9 @@ bin_PROGRAMS = osmtest osmtest_SOURCES = main.c osmtest.c osmt_service.c osmt_slvl_vl_arb.c \ osmt_multicast.c osmt_inform.c -osmtest_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT -osmtest_LDADD = $(libdir)/libibumad.la \ - $(libdir)/libibcommon.la \ - $(libdir)/libopensm.la \ - $(libdir)/libosmcomp.la \ - $(libdir)/libosmvendor.la +osmtest_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) +osmtest_LDADD = -L../complib -L../libvendor -L../opensm -L$(libdir) \ + -libumad -libcommon -lopensm -losmcomp -losmvendor osmtest_LDFLAGS = -Wl,--rpath -Wl,$(libdir) -lpthread -L../opensm Index: osmtest/autogen.sh =================================================================== --- osmtest/autogen.sh (revision 3036) +++ osmtest/autogen.sh (working copy) @@ -1,5 +1,8 @@ #! /bin/sh +# We change dir since the later utilities assume to work in the project dir +cd ${0%*/*} + set -x aclocal -I config libtoolize --force --copy Index: opensm/configure.in =================================================================== --- opensm/configure.in (revision 3036) +++ opensm/configure.in (working copy) @@ -9,6 +9,9 @@ AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE(opensm, 0.9.0) +dnl We use a non standard default prefix +AC_PREFIX_DEFAULT([/usr/local/ib]) + dnl Checks for programs AC_PROG_CXX AC_PROG_CC @@ -52,5 +55,15 @@ AC_CACHE_CHECK(whether ld accepts --vers AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") +dnl support debug mode +AC_ARG_ENABLE(debug, +[ --enable-debug Turn on debug mode], +[case "${enableval}" in + yes) debug=true ;; + no) debug=false ;; + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; +esac],[debug=false]) +AM_CONDITIONAL(DEBUG, test x$debug = xtrue) + AC_CONFIG_FILES([Makefile]) AC_OUTPUT Index: opensm/autogen.sh =================================================================== --- opensm/autogen.sh (revision 3036) +++ opensm/autogen.sh (working copy) @@ -1,5 +1,8 @@ #! /bin/sh +# We change dir since the later utilities assume to work in the project dir +cd ${0%*/*} + set -x aclocal -I config libtoolize --force --copy Index: opensm/Makefile.am =================================================================== --- opensm/Makefile.am (revision 3036) +++ opensm/Makefile.am (working copy) @@ -1,14 +1,17 @@ -bindir = ${exec_prefix}/ib/bin -libdir = ${exec_prefix}/ib/lib - INCLUDES = -I$(srcdir)/../include \ -I$(srcdir)/../../libibcommon/include/infiniband \ -I$(srcdir)/../../libibumad/include/infiniband lib_LTLIBRARIES = libopensm.la -libopensm_la_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT +if DEBUG +DBGFLAGS = -g -O0 -D_DEBUG_ +else +DBGFLAGS = -g -O2 +endif + +libopensm_la_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) if HAVE_LD_VERSION_SCRIPT libopensm_version_script = -Wl,--version-script=$(srcdir)/libopensm.map @@ -60,12 +63,13 @@ opensm_SOURCES = main.c osm_drop_mgr.c o osm_ucast_mgr.c osm_ucast_updn.c \ osm_vl15intf.c osm_vl_arb_rcv.c\ osm_vl_arb_rcv_ctrl.c -opensm_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT -opensm_LDADD = $(libdir)/libibumad.la \ - $(libdir)/libibcommon.la \ - $(srcdir)/libopensm.la \ - $(libdir)/libosmcomp.la \ - $(libdir)/libosmvendor.la +opensm_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) + +# we need to be able to load libraries from local build subtree before make install +# we always give precedence to local tree libs and then use the pre-installed ones. +opensm_LDADD = -L../complib -L../libvendor -L$(libdir) \ + -libumad -lopensm -losmcomp -losmvendor + opensm_LDFLAGS = -Wl,--rpath -Wl,$(libdir) -lpthread opensmincludedir = $(includedir)/infiniband/opensm @@ -79,4 +83,3 @@ EXTRA_DIST = $(srcdir)/../include/opensm $(srcdir)/../include/opensm/osm_log.h \ $(srcdir)/../include/opensm/osm_madw.h \ $(srcdir)/../include/opensm/osm_mad_pool.h - Index: INSTALL =================================================================== --- INSTALL (revision 0) +++ INSTALL (revision 0) @@ -0,0 +1,231 @@ +Installation Instructions +************************* + +Copyright (C) 1994, 1995, 1996, 1999, 2000, 2001, 2002, 2004 Free +Software Foundation, Inc. + +This file is free documentation; the Free Software Foundation gives +unlimited permission to copy, distribute and modify it. + +Basic Installation +================== + +These are generic installation instructions. + + The `configure' shell script attempts to guess correct values for +various system-dependent variables used during compilation. It uses +those values to create a `Makefile' in each directory of the package. +It may also create one or more `.h' files containing system-dependent +definitions. Finally, it creates a shell script `config.status' that +you can run in the future to recreate the current configuration, and a +file `config.log' containing compiler output (useful mainly for +debugging `configure'). + + It can also use an optional file (typically called `config.cache' +and enabled with `--cache-file=config.cache' or simply `-C') that saves +the results of its tests to speed up reconfiguring. (Caching is +disabled by default to prevent problems with accidental use of stale +cache files.) + + If you need to do unusual things to compile the package, please try +to figure out how `configure' could check whether to do them, and mail +diffs or instructions to the address given in the `README' so they can +be considered for the next release. If you are using the cache, and at +some point `config.cache' contains results you don't want to keep, you +may remove or edit it. + + The file `configure.ac' (or `configure.in') is used to create +`configure' by a program called `autoconf'. You only need +`configure.ac' if you want to change it or regenerate `configure' using +a newer version of `autoconf'. + +The simplest way to compile this package is: + + 1. `cd' to the directory containing the package's source code and type + `./configure' to configure the package for your system. If you're + using `csh' on an old version of System V, you might need to type + `sh ./configure' instead to prevent `csh' from trying to execute + `configure' itself. + + Running `configure' takes awhile. While running, it prints some + messages telling which features it is checking for. + + 2. Type `make' to compile the package. + + 3. Optionally, type `make check' to run any self-tests that come with + the package. + + 4. Type `make install' to install the programs and any data files and + documentation. + + 5. You can remove the program binaries and object files from the + source code directory by typing `make clean'. To also remove the + files that `configure' created (so you can compile the package for + a different kind of computer), type `make distclean'. There is + also a `make maintainer-clean' target, but that is intended mainly + for the package's developers. If you use it, you may have to get + all sorts of other programs in order to regenerate files that came + with the distribution. + +Compilers and Options +===================== + +Some systems require unusual options for compilation or linking that the +`configure' script does not know about. Run `./configure --help' for +details on some of the pertinent environment variables. + + You can give `configure' initial values for configuration parameters +by setting variables in the command line or in the environment. Here +is an example: + + ./configure CC=c89 CFLAGS=-O2 LIBS=-lposix + + *Note Defining Variables::, for more details. + +Compiling For Multiple Architectures +==================================== + +You can compile the package for more than one kind of computer at the +same time, by placing the object files for each architecture in their +own directory. To do this, you must use a version of `make' that +supports the `VPATH' variable, such as GNU `make'. `cd' to the +directory where you want the object files and executables to go and run +the `configure' script. `configure' automatically checks for the +source code in the directory that `configure' is in and in `..'. + + If you have to use a `make' that does not support the `VPATH' +variable, you have to compile the package for one architecture at a +time in the source code directory. After you have installed the +package for one architecture, use `make distclean' before reconfiguring +for another architecture. + +Installation Names +================== + +By default, `make install' will install the package's files in +`/usr/local/bin', `/usr/local/man', etc. You can specify an +installation prefix other than `/usr/local' by giving `configure' the +option `--prefix=PREFIX'. + + You can specify separate installation prefixes for +architecture-specific files and architecture-independent files. If you +give `configure' the option `--exec-prefix=PREFIX', the package will +use PREFIX as the prefix for installing programs and libraries. +Documentation and other data files will still use the regular prefix. + + In addition, if you use an unusual directory layout you can give +options like `--bindir=DIR' to specify different values for particular +kinds of files. Run `configure --help' for a list of the directories +you can set and what kinds of files go in them. + + If the package supports it, you can cause programs to be installed +with an extra prefix or suffix on their names by giving `configure' the +option `--program-prefix=PREFIX' or `--program-suffix=SUFFIX'. + +Optional Features +================= + +Some packages pay attention to `--enable-FEATURE' options to +`configure', where FEATURE indicates an optional part of the package. +They may also pay attention to `--with-PACKAGE' options, where PACKAGE +is something like `gnu-as' or `x' (for the X Window System). The +`README' should mention any `--enable-' and `--with-' options that the +package recognizes. + + For packages that use the X Window System, `configure' can usually +find the X include and library files automatically, but if it doesn't, +you can use the `configure' options `--x-includes=DIR' and +`--x-libraries=DIR' to specify their locations. + +Specifying the System Type +========================== + +There may be some features `configure' cannot figure out automatically, +but needs to determine by the type of machine the package will run on. +Usually, assuming the package is built to be run on the _same_ +architectures, `configure' can figure that out, but if it prints a +message saying it cannot guess the machine type, give it the +`--build=TYPE' option. TYPE can either be a short name for the system +type, such as `sun4', or a canonical name which has the form: + + CPU-COMPANY-SYSTEM + +where SYSTEM can have one of these forms: + + OS KERNEL-OS + + See the file `config.sub' for the possible values of each field. If +`config.sub' isn't included in this package, then this package doesn't +need to know the machine type. + + If you are _building_ compiler tools for cross-compiling, you should +use the `--target=TYPE' option to select the type of system they will +produce code for. + + If you want to _use_ a cross compiler, that generates code for a +platform different from the build platform, you should specify the +"host" platform (i.e., that on which the generated programs will +eventually be run) with `--host=TYPE'. + +Sharing Defaults +================ + +If you want to set default values for `configure' scripts to share, you +can create a site shell script called `config.site' that gives default +values for variables like `CC', `cache_file', and `prefix'. +`configure' looks for `PREFIX/share/config.site' if it exists, then +`PREFIX/etc/config.site' if it exists. Or, you can set the +`CONFIG_SITE' environment variable to the location of the site script. +A warning: not all `configure' scripts look for a site script. + +Defining Variables +================== + +Variables not defined in a site shell script can be set in the +environment passed to `configure'. However, some packages may run +configure again during the build, and the customized values of these +variables may be lost. In order to avoid this problem, you should set +them in the `configure' command line, using `VAR=value'. For example: + + ./configure CC=/usr/local2/bin/gcc + +will cause the specified gcc to be used as the C compiler (unless it is +overridden in the site shell script). + +`configure' Invocation +====================== + +`configure' recognizes the following options to control how it operates. + +`--help' +`-h' + Print a summary of the options to `configure', and exit. + +`--version' +`-V' + Print the version of Autoconf used to generate the `configure' + script, and exit. + +`--cache-file=FILE' + Enable the cache: use and save the results of the tests in FILE, + traditionally `config.cache'. FILE defaults to `/dev/null' to + disable caching. + +`--config-cache' +`-C' + Alias for `--cache-file=config.cache'. + +`--quiet' +`--silent' +`-q' + Do not print messages saying which checks are being made. To + suppress all normal output, redirect it to `/dev/null' (any error + messages will still be shown). + +`--srcdir=DIR' + Look for the package's source code in directory DIR. Usually + `configure' can determine that directory automatically. + +`configure' also accepts some other, not widely useful, options. Run +`configure --help' for more details. + Index: COPYING =================================================================== --- COPYING (revision 0) +++ COPYING (revision 0) @@ -0,0 +1,32 @@ + Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available from the file + COPYING in the main directory of this source tree, or the + OpenIB.org BSD license below: + + Redistribution and use in source and binary forms, with or + without modification, are permitted provided that the following + conditions are met: + + - Redistributions of source code must retain the above + copyright notice, this list of conditions and the following + disclaimer. + + - Redistributions in binary form must reproduce the above + copyright notice, this list of conditions and the following + disclaimer in the documentation and/or other materials + provided with the distribution. + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + Index: Makefile.am =================================================================== --- Makefile.am (revision 0) +++ Makefile.am (revision 0) @@ -0,0 +1,16 @@ + +# note that order matters: make the lib first then use it +SUBDIRS = complib libvendor opensm osmtest + +# this will control the update of the files in order +MAINTAINERCLEANFILES = Makefile.in aclocal.m4 configure config-h.in + +ACLOCAL = aclocal -I $(ac_aux_dir) + +# we should provide a hint for other apps about the build mode of this project +install-exec-hook: +if DEBUG + echo "define osm_build_type \"debug\"" > $(includedir)/infiniband/opensm/osm_build_id.h +else + echo "define osm_build_type \"free\"" > $(includedir)/infiniband/opensm/osm_build_id.h +endif Index: autogen.sh =================================================================== --- autogen.sh (revision 0) +++ autogen.sh (revision 0) @@ -0,0 +1,74 @@ +#!/bin/bash + +# We change dir since the later utilities assume to work in the project dir +cd ${0%*/*} + +# make sure autoconf is up-to-date +ac_ver=`autoconf --version | head -1 | awk '{print $NF}'` +ac_maj=`echo $ac_ver|sed 's/\..*//'` +ac_min=`echo $ac_ver|sed 's/.*\.//'` +if [[ $ac_maj < 2 ]]; then + echo Min autoconf version is 2.59 + exit +fi +if [[ $ac_maj = 2 && $ac_min < 59 ]]; then + echo Min autoconf version is 2.59 + exit +fi + +# make sure automake is up-to-date +am_ver=`automake --version | head -1 | awk '{print $NF}'` +am_maj=`echo $am_ver|sed 's/\..*//'` +am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` +am_sub=`echo $am_ver|sed 's/.*\.//'` +if [[ $am_maj < 1 ]]; then + echo Min automake version is 1.9.3 + exit +fi +if [[ $am_maj = 1 && $am_min < 9 ]]; then + echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.3" + exit +fi +if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 3 ]]; then + echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.3" + exit +fi + +# make sure libtool is up-to-date +lt_ver=`libtool --version | head -1 | awk '{print $4}'` +lt_maj=`echo $lt_ver|sed 's/\..*//'` +lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` +lt_sub=`echo $lt_ver|sed 's/.*\.//'` +if [[ $lt_maj < 1 ]]; then + echo Min libtool version is 1.4.2 + exit +fi +if [[ $lt_maj = 1 && $lt_min < 4 ]]; then + echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" + exit +fi +if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then + echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" + exit +fi + +# cleanup +find . \( -name Makefile.in -o -name aclocal.m4 -o -name autom4te.cache -o -name configure -o -name aclocal.m4 \) -exec \rm -rf {} \; -prune + +# handle our own autoconf: +aclocal -I config 2>&1 | grep -v "arning: underquoted definition of" +automake --add-missing --gnu +autoconf + +# visit all sub directories with autogen.sh +anyErr=0 +for a in `ls */autogen.sh`; do + echo Visiting $a + $a 2>& 1 | sed 's/^/| /' + if test $? != 0; then + echo $a failed + anyErr=1 + fi +done + +exit $anyErr Property changes on: autogen.sh ___________________________________________________________________ Name: svn:executable + * Index: NEWS =================================================================== Index: Makefile =================================================================== --- Makefile (revision 3036) +++ Makefile (working copy) @@ -1,44 +0,0 @@ -LIBS:= complib libvendor -BIN:= opensm -UTIL:= include - -SUBDIRS=$(BIN) $(UTIL) - -all: BUILD_TARG=all -all: libs_install subdirs - @echo Make all done - -install: BUILD_TARG=install -install: subdirs - @echo Install done - -clean: SUBDIRS= $(LIBS) $(BIN) -clean: BUILD_TARG=clean -clean: subdirs - @echo Clean done - -rmdep: - find $(SUBDIRS) -name ".depend" | xargs rm -f - -depend: SUBDIRS= $(LIBS) $(BIN) $(UTIL) -depend: BUILD_TARG=depend -depend: rmdep subdirs - @echo Depend done - -.PHONY : subdirs -subdirs: - @for i in $(SUBDIRS); do\ - if [ -e $$i/Makefile ]; then\ - if !(cd $$i; make $(BUILD_TARG)); then exit 1; fi\ - fi\ - done\ - -.PHONY : libs_install -libs_install: - @for i in $(LIBS); do\ - if [ -e $$i/Makefile ]; then\ - if !(cd $$i; make install); then exit 1; fi\ - fi\ - done\ - -export BUILD_TARG From mst at mellanox.co.il Tue Aug 9 08:44:32 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 9 Aug 2005 18:44:32 +0300 Subject: [openib-general] [PATCHv2] sdp: remove in_atomic usage Message-ID: <20050809154432.GM32419@mellanox.co.il> Libor commented that performance suffers for him from this patch. Its somewhat surprising since most operations are now done from under conn->lock.slock. This is hopefully addressed now that send iocbs can be unlocked in interrupt context. Please try/comment. Tom, please dont commit this yet :) --- SDP used in_atomic to detect atomic context, which, according to Andrew's comment, doesnt actually catch all atomic conditions. And there doesnt seem to exist a reliable way to detect atomic environment. So lets use iocb flags instead, and always schedule work for a receive iocb. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_iocb.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/ulp/sdp/sdp_iocb.c 2005-08-09 17:20:38.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_iocb.c 2005-08-09 17:21:15.000000000 +0300 @@ -326,11 +326,13 @@ void sdp_iocb_complete(struct sdpc_iocb { iocb->status = status; - if (in_atomic() || irqs_disabled()) { - INIT_WORK(&iocb->completion, do_iocb_complete, (void *)iocb); - schedule_work(&iocb->completion); - } else + if (!(iocb->flags & SDP_IOCB_F_LOCKED) || + !(iocb->flags & SDP_IOCB_F_RECV)) do_iocb_complete(iocb); + else { + INIT_WORK(&iocb->completion, do_iocb_complete, iocb); + schedule_work(&iocb->completion); + } } /* -- MST From halr at voltaire.com Tue Aug 9 09:15:24 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Aug 2005 12:15:24 -0400 Subject: [openib-general] InfiniBand Test Project === IBTP === In-Reply-To: <506C3D7B14CDD411A52C00025558DED607F0DDA1@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED607F0DDA1@mtlex01.yok.mtl.com> Message-ID: <1123604123.4403.62.camel@hal.voltaire.com> Hi Amit, On Tue, 2005-08-09 at 11:08, Amit Krig wrote: > Hi All, > > I would like to propos a new InfiniBand Test Project === IBTP === Glad to see this. Something like this should make it easier for everyone to regress OpenIB. > The proposed new dir can be under Gen2/trunk => > > https://openib.org/svn/gen2/trunk/ibtp/ > > The project may contain the following tree > > * tools > > (In this folder we will have all the scripts and automation > utilities) > > * infiniband Is this kernel part ? If so why not name it accordingly ? > * core > * ulps > * ipoib > * sdp > * srp > * kdapl > * iser The indentation of this looks wrong to me. core and ulps at the same level in the tree; everything else below ulps. Should it be ulps or ulp as in the actual kernel tree ? > * user This will make the third naming of this: user, userspace, and linux-user. We should converge on one. > * verbs > * cm > * management > * osm > * utils > * ulps Same name as on kernel side. > * udapl > * mpi > -- Hal From sean.hefty at intel.com Tue Aug 9 09:29:28 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 9 Aug 2005 09:29:28 -0700 Subject: [openib-general] Re: likely user error with uat In-Reply-To: <1123597055.4403.46.camel@hal.voltaire.com> Message-ID: >> I did bring at and ipoib up/down/up before running the tests. > >Was AT invoked or was that brought down (directly or indirectly around) >the IPoIB up/down/up ? I meant load/unload/reload. >I'm looking for a way to reproduce your scenario. I didn't have a simple test that I was using. I was trying to run some DAPL code against some ucm changes that I have, but was having trouble with AT errors. I unloaded the drivers, pulled the latest svn version, rebuilt, reloaded the drivers, reinstalled userspace libraries, fiddled around with trying to run the uat tests, until I finally got to this error. After reading through the documentation, mail archives, and AT code, I finally gave up and figured that I'd see if anyone had run into a similar issue. I will see if I can reproduce the problem now that AT is working. - Sean From sean.hefty at intel.com Tue Aug 9 09:35:29 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 9 Aug 2005 09:35:29 -0700 Subject: [openib-general] InfiniBand Test Project === IBTP === In-Reply-To: <1123604123.4403.62.camel@hal.voltaire.com> Message-ID: >> The proposed new dir can be under Gen2/trunk => >> >> https://openib.org/svn/gen2/trunk/ibtp/ My preference would be to place it until a test or existing util directory, rather than trunk directly. >> * core >> * ulps >> * ipoib >> * sdp >> * srp >> * kdapl >> * iser > >The indentation of this looks wrong to me. core and ulps at the same >level in the tree; everything else below ulps. Should it be ulps or ulp >as in the actual kernel tree ? I agree that matching the kernel or userspace tree structure would be better. - Sean From halr at voltaire.com Tue Aug 9 09:30:44 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Aug 2005 12:30:44 -0400 Subject: [openib-general] Re: [PATCH] libosmvendor: configure option to skip library check In-Reply-To: <20050809151217.GL32419@mellanox.co.il> References: <20050809151217.GL32419@mellanox.co.il> Message-ID: <1123605044.4403.90.camel@hal.voltaire.com> On Tue, 2005-08-09 at 11:12, Michael S. Tsirkin wrote: > Add option to skip infiniband library checks in libibumad. Thanks. Applied. -- Hal From halr at voltaire.com Tue Aug 9 09:41:35 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Aug 2005 12:41:35 -0400 Subject: [openib-general] Re: likely user error with uat In-Reply-To: References: Message-ID: <1123605185.4403.95.camel@hal.voltaire.com> On Tue, 2005-08-09 at 12:29, Sean Hefty wrote: > >> I did bring at and ipoib up/down/up before running the tests. > > > >Was AT invoked or was that brought down (directly or indirectly around) > >the IPoIB up/down/up ? > > I meant load/unload/reload. OK. That's the same scenario. > >I'm looking for a way to reproduce your scenario. > > I didn't have a simple test that I was using. I was trying to run some DAPL > code against some ucm changes that I have, but was having trouble with AT > errors. I unloaded the drivers, pulled the latest svn version, rebuilt, > reloaded the drivers, reinstalled userspace libraries, fiddled around with > trying to run the uat tests, until I finally got to this error. After reading > through the documentation, mail archives, and AT code, I finally gave up and > figured that I'd see if anyone had run into a similar issue. > > I will see if I can reproduce the problem now that AT is working. OK. Thanks. -- Hal From halr at voltaire.com Tue Aug 9 09:44:49 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Aug 2005 12:44:49 -0400 Subject: [openib-general] [PATCH applied] sdp: replace mlock with get_user_pages In-Reply-To: <20050809132254.GH32419@mellanox.co.il> References: <20050809132254.GH32419@mellanox.co.il> Message-ID: <1123605694.4403.108.camel@hal.voltaire.com> On Tue, 2005-08-09 at 09:22, Michael S. Tsirkin wrote: > +static void sdp_copy_one_page(struct page *from, struct page* to, > + unsigned long iocb_addr, size_t iocb_size, > + unsigned long uaddr) > +{ > + size_t size_left = iocb_addr + iocb_size - uaddr; > + size_t size = min(size_left,PAGE_SIZE); The last line results in the following warning on x86: drivers/infiniband/ulp/sdp/sdp_iocb.c: In function `sdp_copy_one_page': drivers/infiniband/ulp/sdp/sdp_iocb.c:46: warning: comparison of distinct pointer types lacks a cast -- Hal From halr at voltaire.com Tue Aug 9 09:48:09 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Aug 2005 12:48:09 -0400 Subject: [openib-general] Re: [PATCH] osm: add a main auto tools project for osm In-Reply-To: <42F8CB11.5070606@mellanox.co.il> References: <42F8CB11.5070606@mellanox.co.il> Message-ID: <1123606012.4403.113.camel@hal.voltaire.com> On Tue, 2005-08-09 at 11:26, Eitan Zahavi wrote: > Hi Hal > > This patch includes: > 1. Added a top level autotools project for OpenSM. > So now you need autogen.sh && configure && make && make install just > once for osm > (needed 4: complib, libvendor, opensm, osmtest). > 2. Cleanup the direct override of libdir, bindir using AC_PREFIX_DEFAULT > 3. Move osm includes into prefix (/usr/local/ib) but all the includes go into /usr/local/include/infiniband. There are 4 subdirectories there for OpenSM related includes: complib, iba, opensm, and vendor. > 4. Support debug build for OpenSM using --enable-debug > This is important to allow for asserts during runtime and various > other additional debug features. > Since the generated compilb can not be used with the release version > we also use a special header file that stores the type of build > for applications that wish to link with it. > 5. Cleanup stale use of AC_CHECK_LIB with no parameters -- Hal From amitk at mellanox.co.il Tue Aug 9 10:19:37 2005 From: amitk at mellanox.co.il (Amit Krig) Date: Tue, 9 Aug 2005 20:19:37 +0300 Subject: [openib-general] InfiniBand Test Project === IBTP === Message-ID: <506C3D7B14CDD411A52C00025558DED607F0DDAD@mtlex01.yok.mtl.com> I think that a trunk directory will be more meaningful, as for the naming convention I also agree that we should match the kernel or userspace tree structure -----Original Message----- From: Sean Hefty [mailto:sean.hefty at intel.com] Sent: Tuesday, August 09, 2005 7:35 PM To: 'Hal Rosenstock'; Amit Krig Cc: openib-general at openib.org Subject: RE: [openib-general] InfiniBand Test Project === IBTP === >> The proposed new dir can be under Gen2/trunk => >> >> https://openib.org/svn/gen2/trunk/ibtp/ My preference would be to place it until a test or existing util directory, rather than trunk directly. >> * core >> * ulps >> * ipoib >> * sdp >> * srp >> * kdapl >> * iser > >The indentation of this looks wrong to me. core and ulps at the same >level in the tree; everything else below ulps. Should it be ulps or ulp >as in the actual kernel tree ? I agree that matching the kernel or userspace tree structure would be better. - Sean -------------- next part -------------- An HTML attachment was scrubbed... URL: From damato at psc.edu Tue Aug 9 10:45:19 2005 From: damato at psc.edu (Joe Damato) Date: Tue, 09 Aug 2005 13:45:19 -0400 Subject: [openib-general] ib_get_dma_mr and dma_map_single Message-ID: <42F8EBAF.6000804@psc.edu> Hello - I am attempting to port code that calls the gen1 function ib_memory_register -- I saw the other posts about this issue -- I have looked into pci_map_single to map a buffer -- but I am failing to see how ib_get_dma_mr fits in with this.... The sdp example (in gen2/trunk/src/linux-kernel/infiniband/ulp/sdp) makes a call to ib_get_dma_mr (sdp_conn.c:1812) and then calls to dma_map_single (sdp_recv.c:85 and sdp_send.c:129) although I am failing to see the connection. Also, this might not be a question I should ask here, but why not? What is the difference between pci_map_single and dma_map_single ? Are they interchangable? Thanks. From rolandd at cisco.com Tue Aug 9 10:52:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 09 Aug 2005 10:52:55 -0700 Subject: [openib-general] ib_get_dma_mr and dma_map_single In-Reply-To: <42F8EBAF.6000804@psc.edu> (Joe Damato's message of "Tue, 09 Aug 2005 13:45:19 -0400") References: <42F8EBAF.6000804@psc.edu> Message-ID: <521x53uhe0.fsf@cisco.com> Joe> Hello - I am attempting to port code that calls the gen1 Joe> function ib_memory_register -- I saw the other posts about Joe> this issue -- I have looked into pci_map_single to map a Joe> buffer -- but I am failing to see how ib_get_dma_mr fits in Joe> with this.... ib_get_dma_mr() returns a memory region that can use DMA addresses returned from the DMA mapping API. In other words, when posting work requests, you take the bus address you get from the DMA mapping API and the L_Key you get from ib_get_dma_mr() and put them in the gather/scatter list of your work request. Joe> Also, this might not be a question I should ask here, Joe> but why not? What is the difference between pci_map_single Joe> and dma_map_single ? Are they interchangable? pci_map_single() works for PCI devices and takes a struct pci_dev pointer. dma_map_single works for generic devices and takes a struct device pointer. Since the IB midlayer gives you a struct device pointer (device->dma_device), you should use the dma_map_XXX functions. In general, the dma_map_XXX functions are somewhat more general and are designed to support situations like special rules for on-chip peripherals in embedded devices and other strange platforms. For standard x86-like systems, it's a distinction without a difference. - R. From eitan at mellanox.co.il Tue Aug 9 11:12:05 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 9 Aug 2005 21:12:05 +0300 Subject: [openib-general] RE: [PATCH] osm: add a main auto tools project for osm Message-ID: <506C3D7B14CDD411A52C00025558DED607C3062C@mtlex01.yok.mtl.com> > > 3. Move osm includes into prefix (/usr/local/ib) > > but all the includes go into /usr/local/include/infiniband. There are 4 > subdirectories there for OpenSM related includes: complib, iba, opensm, > and vendor. [EZ] So the path /usr/local/include/infiniband is not relative to the user defined prefix? This is a little hard coded. Can we say maybe that if one wants to place the include under /usr/local/include then the bin should be placed under /usr/local/bin ? Why do we need to invent a new concept where the include and bin use a different prefix? [EZ] Will it be OK if the include will be placed under prefix/include and will be linked to the proper subdirectory of /usr/local/include/infiniband ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From hugh at veritas.com Tue Aug 9 11:13:33 2005 From: hugh at veritas.com (Hugh Dickins) Date: Tue, 9 Aug 2005 19:13:33 +0100 (BST) Subject: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support In-Reply-To: <20050726133553.GA22276@mellanox.co.il> References: <20050719165542.GB16028@mellanox.co.il> <20050725171928.GC12206@mellanox.co.il> <20050726133553.GA22276@mellanox.co.il> Message-ID: Sorry for my delay in replying... On Tue, 26 Jul 2005, Michael S. Tsirkin wrote: > Quoting Hugh Dickins : > > Subject: Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support > > > > On Mon, 25 Jul 2005, Michael S. Tsirkin wrote: > > > > > > This patch adds PROT_DONTCOPY to mmap and mprotect, to set VM_DONTCOPY on vma. > > > This is needed for infiniband userspace i/o, where we need to protect against > > > - the child process accessing the parent hardware page > > > - the parent registered address (on which the driver did get_user_pages) > > > getting remapped to another page by COW > > > One can imagine other uses, e.g. combined with mlock for real-time or security. > > > > I don't much like it, but it does solve a real problem in an efficient way. > > > > Partly I don't like it because of "PROT_DONTCOPY" itself: I'm queasy > > about protection flags which are not protection flags, though I find > > you're not the first to go down that road. > > Yes. Compare with PROT_GROWSDOWN and such. Though if you look deeper into that, you find that PROT_GROWSDOWN and PROT_GROWSUP are all about determining the start or end of the range when it's the stack: nothing to do with the protection flags set. Which inclines me the more against using mprotect to set VM_DONTCOPY. > > Is the patch tested? I've not tried, but suspect the newflags shift > > and mask won't work for it. > > I tested this patch. I didnt test all thinkable configurations of > flags though - what do you mean by "newflags shift and mask"? My error. See further down where the code is shown. > > And I don't look forward to your adding > > VM_MAYDONTCOPY - ugh! > > We already have VM_DONTCOPY. Why would we need VM_MAYDONTCOPY and what > would it do? > > > > @@ -246,7 +246,7 @@ sys_mprotect(unsigned long start, size_t > > > goto out; > > > } > > > > > > - newflags = vm_flags | (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC)); > > > + newflags = vm_flags | (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC | VM_DONTCOPY)); > > > > > > if ((newflags & ~(newflags >> 4)) & 0xf) { > > > error = -EACCES; That newflags shift and mask is checking VM_READ,VM_WRITE,VM_EXEC against VM_MAYREAD,VM_MAYWRITE,VM_MAYEXEC (the same bits shifted up/down 4). It's checking, for example, whether the caller actually has permission to mprotect the mapping to make writes to the file. But I was reading it wrongly, sorry: I thought you were going to need a VM_MAYDONTCOPY bit in order to give permission for you to mprotect the mapping to VM_DONTCOPY. No, it's only checking the bottom four bits (including VM_SHARED against VM_MAYSHARE, but that's never changed). > > I rather think it would all be more cleanly handled by dropping the mmap > > and mprotect changes, > > Well, mmap would be much better off if VM_DONTCOPY is set atomically, since > a process may fork after mmap is called but before madvise. But it doesn't matter if the process does fork after mmap before madvise. It only starts to matter when you do get_user_pages (for writing): that will break COW on the private pages made readonly by a preceding fork, your problem is when a fork occurs after that to make them readonly. > > adding an madvise instead. > > I'm not opposed to this, on principle. But see below. > > > Though you may object > > that madvise is for optional behaviours, and this should be mandatory. > > What about a new system call? > Or a flag for mprotect that effectively turns it into a new system call? > Something like PROT_EXTENDED? PROT_DONTCOPY seems quite enough to signal the extension, if we were to go the mprotect route. > > The other reason I dislike the patch is that the problem it fixes is > > an old one, and I'd much rather have get_user_pages fix it for itself, > > Please note that the problem this attempts to solve is not limited > to pages locked by get_user_pages: in an infiniband userspace initiator, > a hardware page is mapped into process memory and must not be inherited > by a child processes, otherwise hardware protection breaks. Interesting. But (correct me if I'm wrong, I know nothing about InfiniBand userspace initiators) that would be done by a driver, which can set VM_DONTCOPY on the vma, without us having to extend the mprotect or madvise API > > than ask the developer to do some additional magic to get around it. > > > > But I've failed to work out a simple efficient alternative, which won't > > burden the vast majority of get_user_pages usages which never hit the > > issue. > > They dont hit it if they keep the mm semaphore, or if they only lock > pages for read. I think the usual case is simply that userspace does not touch those pages while they are pinned by get_user_pages, and/or it does not fork. But we have occasionally got bitten by the issue. > > So your way is probably appropriate, but I'd prefer madvise. > > The difficulty with changing get_user_pages, as I see it, is that > you wont be able to get away with a single DONTCOPY bit - you'll need > a full reference count for each page, no less. Quite possibly: I only thought it through far enough to conclude that your proposal has the great merit of simplicity in comparison, despite its dubious interface. > > (Sorry, I won't be able to discuss further for a couple of days.) Please correct that to weeks ;) > Well, madvise currently cant break/merge VMAs, which is required > for VM_DONTCOPY. And it seems like making madvise do this opens > a whole cans of worms. madvise has been splitting vmas forever, and was enhanced to remerge them 2.6.13-rc. > Hugh, so the patch is likely to be bigger in the madvise approach. > Considering this, and the fact that a full solution has to add > a flag to mmap, anyway, do you still think madvise is really the best way > to do it? Has to add a flag to mmap? I didn't buy your "atomic" argument above, did I miss something? I still prefer madvise to mprotect for this, but admit neither is entirely clean, would rather let someone else decide between them. Even more I'd prefer one of these two solutions below, which sidestep that uncleanliness - but both of these would be in mmap only, no clean way to change afterwards (except by munmap or mmap MAP_FIXED): 1. Use the standard mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0) which gives you a memory object shared with children, so write-protection and COW won't come into it. or if there's good reason why that's no good, 2. Define a MAP_DONTCOPY to mmap: we have a fine tradition of MAP_flags to achieve this or that effect, adding one more would be cleaner than now corrupting mprotect or madvise. > Regarding solving the problem automagically by get_user_pages: > > What about a new VM_COPYONFORK flag, to trigger the old unix > behaviour of copying the vma on fork and a flag for get_user_pages > that sets it? Only users that dont keep the mm semaphore around > the get_user_pages/put_page operation would use this flag, others > would be unaffected. The flag will stay on until the VMA is destroyed. (I don't understand why you propose a new flag for the usual behaviour, but that's just a matter of which way round it's defined, not important.) Splitting a vma from within get_user_pages is not straightforward, we need down_write(&mm->mmap_sem) for a start; I think we'd all prefer to avoid that if we can - as I said, your proposal rather simpler. Coincidentally, Linus has drawn my attention in the last week to some uses of get_user_pages which are behaving in a way which I believe is currently mishandled, and may need splitting the vma. But I don't think you should wait around for however we decide to fix that issue. Hugh From halr at voltaire.com Tue Aug 9 12:00:22 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Aug 2005 15:00:22 -0400 Subject: [openib-general] RE: [PATCH] osm: add a main auto tools project for osm In-Reply-To: <506C3D7B14CDD411A52C00025558DED607C3062C@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED607C3062C@mtlex01.yok.mtl.com> Message-ID: <1123614021.4403.301.camel@hal.voltaire.com> On Tue, 2005-08-09 at 14:12, Eitan Zahavi wrote: > > > 3. Move osm includes into prefix (/usr/local/ib) > > > > but all the includes go into /usr/local/include/infiniband. There > are 4 > > subdirectories there for OpenSM related includes: complib, iba, > opensm, > > and vendor. > > [EZ] So the path /usr/local/include/infiniband is not relative to the > user defined prefix? The prefix was for the libraries and binaries but not the includes. > This is a little hard coded. Can we say maybe that if one wants to > place the include under /usr/local/include then the bin should be > placed under /usr/local/bin ? I thought the proposal was to ditch the prefix totally and just use /usr/local/bin for binaries, usr/local/lib for libraries, and /usr/local/include/infiniband for the includes (I am assuming that the subdirectories will be maintained under that for opensm, vendor, iba, and complib). > Why do we need to invent a new concept where the include and bin use a > different prefix? > > [EZ] Will it be OK if the include will be placed under prefix/include > and will be linked to the proper subdirectory of > /usr/local/include/infiniband ? I thought the whole prefix thing was being removed so this shouldn't be necessary. -- Hal From eitan at mellanox.co.il Tue Aug 9 12:33:51 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 9 Aug 2005 22:33:51 +0300 Subject: [openib-general] RE: [PATCH] osm: add a main auto tools project for osm Message-ID: <506C3D7B14CDD411A52C00025558DED607C30633@mtlex01.yok.mtl.com> Hi Hal The bottom line as I understand from your mail is /usr/local/bin /usr/local/lib /usr/local/include/infiniband I can do that. Patch by next Sun (I'm OOO till then). The support for configure --prefix= is rudimentary. I do not think it is wise to disable it. But the default is now well behaved and will be fixed to /usr/local rather then /usr/local/ib. [EZ] Please confirm -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Aug 9 13:18:36 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Aug 2005 16:18:36 -0400 Subject: [openib-general] RE: [PATCH] osm: add a main auto tools project for osm In-Reply-To: <506C3D7B14CDD411A52C00025558DED607C30633@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED607C30633@mtlex01.yok.mtl.com> Message-ID: <1123618715.4403.394.camel@hal.voltaire.com> Hi Eitan, On Tue, 2005-08-09 at 15:33, Eitan Zahavi wrote: > Hi Hal > > The bottom line as I understand from your mail is > /usr/local/bin > /usr/local/lib > /usr/local/include/infiniband > > I can do that. Yes, that was the idea; same as other IB userspace projects. > Patch by next Sun (I'm OOO till then). Should I wait for that or use this patch slightly modified ? > The support for configure --prefix= is rudimentary. I do not > think it is wise to disable it. But the default is now well behaved > and will be fixed to /usr/local rather then /usr/local/ib. > > [EZ] Please confirm Yes (no one was saying to disable prefix; just eliminate the /usr/local/ib settings in the management build now). -- Hal From sean.hefty at intel.com Tue Aug 9 16:23:38 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 9 Aug 2005 16:23:38 -0700 Subject: [openib-general] [PATCH] [ucm] fix for potential deadlock Message-ID: The following patch fixes a potential deadlock condition in the kernel ucm code resulting from trying to destroy a cm_id while in the context of a CM callback thread. The synchronization around the ucm context structure was simplified as a result, and some simple code cleanup is included. (I tried keeping the code cleanup separate, but it was turning out to be more work.) Arlin, can you please test with this and see if your problems (well the ones related to IB anyway ;) go away. Signed-off-by: Sean Hefty Index: core/ucm.c =================================================================== --- core/ucm.c (revision 3044) +++ core/ucm.c (working copy) @@ -1,5 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -73,14 +74,18 @@ static struct semaphore ctx_id_mutex; static struct idr ctx_id_table; static int ctx_id_rover = 0; -static struct ib_ucm_context *ib_ucm_ctx_get(int id) +static struct ib_ucm_context *ib_ucm_ctx_get(struct ib_ucm_file *file, int id) { struct ib_ucm_context *ctx; down(&ctx_id_mutex); ctx = idr_find(&ctx_id_table, id); - if (ctx) - ctx->ref++; + if (!ctx) + ctx = ERR_PTR(-ENOENT); + else if (ctx->file != file) + ctx = ERR_PTR(-EINVAL); + else + atomic_inc(&ctx->ref); up(&ctx_id_mutex); return ctx; @@ -88,22 +93,37 @@ static struct ib_ucm_context *ib_ucm_ctx static void ib_ucm_ctx_put(struct ib_ucm_context *ctx) { + if (atomic_dec_and_test(&ctx->ref)) + wake_up(&ctx->wait); +} + +static ssize_t ib_ucm_destroy_ctx(struct ib_ucm_file *file, int id) +{ + struct ib_ucm_context *ctx; struct ib_ucm_event *uevent; down(&ctx_id_mutex); - - ctx->ref--; - if (ctx->ref) { - up(&ctx_id_mutex); - return; - } + ctx = idr_find(&ctx_id_table, id); + if (!ctx) + ctx = ERR_PTR(-ENOENT); + else if (ctx->file != file) + ctx = ERR_PTR(-EINVAL); else idr_remove(&ctx_id_table, ctx->id); - up(&ctx_id_mutex); - down(&ctx->file->mutex); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + atomic_dec(&ctx->ref); + wait_event(ctx->wait, !atomic_read(&ctx->ref)); + /* No new events will be generated after destroying the cm_id. */ + if (!IS_ERR(ctx->cm_id)) + ib_destroy_cm_id(ctx->cm_id); + + /* Cleanup events not yet reported to the user. */ + down(&file->mutex); list_del(&ctx->file_list); while (!list_empty(&ctx->events)) { @@ -118,13 +138,10 @@ static void ib_ucm_ctx_put(struct ib_ucm kfree(uevent); } + up(&file->mutex); - up(&ctx->file->mutex); - - ucm_dbg("Destroyed CM ID <%d>\n", ctx->id); - - ib_destroy_cm_id(ctx->cm_id); kfree(ctx); + return 0; } static struct ib_ucm_context *ib_ucm_ctx_alloc(struct ib_ucm_file *file) @@ -136,11 +153,11 @@ static struct ib_ucm_context *ib_ucm_ctx if (!ctx) return NULL; - ctx->ref = 1; /* user reference */ + atomic_set(&ctx->ref, 1); + init_waitqueue_head(&ctx->wait); ctx->file = file; INIT_LIST_HEAD(&ctx->events); - init_MUTEX(&ctx->mutex); list_add_tail(&ctx->file_list, &file->ctxs); @@ -178,8 +195,8 @@ static void ib_ucm_event_path_get(struct if (!kpath || !upath) return; - memcpy(upath->dgid, kpath->dgid.raw, sizeof(union ib_gid)); - memcpy(upath->sgid, kpath->sgid.raw, sizeof(union ib_gid)); + memcpy(upath->dgid, kpath->dgid.raw, sizeof *upath->dgid); + memcpy(upath->sgid, kpath->sgid.raw, sizeof *upath->sgid); upath->dlid = kpath->dlid; upath->slid = kpath->slid; @@ -202,10 +219,11 @@ static void ib_ucm_event_path_get(struct kpath->packet_life_time_selector; } -static void ib_ucm_event_req_get(struct ib_ucm_req_event_resp *ureq, +static void ib_ucm_event_req_get(struct ib_ucm_context *ctx, + struct ib_ucm_req_event_resp *ureq, struct ib_cm_req_event_param *kreq) { - ureq->listen_id = (long)kreq->listen_id->context; + ureq->listen_id = ctx->id; ureq->remote_ca_guid = kreq->remote_ca_guid; ureq->remote_qkey = kreq->remote_qkey; @@ -241,34 +259,11 @@ static void ib_ucm_event_rep_get(struct urep->srq = krep->srq; } -static void ib_ucm_event_rej_get(struct ib_ucm_rej_event_resp *urej, - struct ib_cm_rej_event_param *krej) -{ - urej->reason = krej->reason; -} - -static void ib_ucm_event_mra_get(struct ib_ucm_mra_event_resp *umra, - struct ib_cm_mra_event_param *kmra) -{ - umra->timeout = kmra->service_timeout; -} - -static void ib_ucm_event_lap_get(struct ib_ucm_lap_event_resp *ulap, - struct ib_cm_lap_event_param *klap) -{ - ib_ucm_event_path_get(&ulap->path, klap->alternate_path); -} - -static void ib_ucm_event_apr_get(struct ib_ucm_apr_event_resp *uapr, - struct ib_cm_apr_event_param *kapr) -{ - uapr->status = kapr->ap_status; -} - -static void ib_ucm_event_sidr_req_get(struct ib_ucm_sidr_req_event_resp *ureq, +static void ib_ucm_event_sidr_req_get(struct ib_ucm_context *ctx, + struct ib_ucm_sidr_req_event_resp *ureq, struct ib_cm_sidr_req_event_param *kreq) { - ureq->listen_id = (long)kreq->listen_id->context; + ureq->listen_id = ctx->id; ureq->pkey = kreq->pkey; } @@ -280,19 +275,18 @@ static void ib_ucm_event_sidr_rep_get(st urep->qpn = krep->qpn; }; -static int ib_ucm_event_process(struct ib_cm_event *evt, +static int ib_ucm_event_process(struct ib_ucm_context *ctx, + struct ib_cm_event *evt, struct ib_ucm_event *uvt) { void *info = NULL; - int result; switch (evt->event) { case IB_CM_REQ_RECEIVED: - ib_ucm_event_req_get(&uvt->resp.u.req_resp, + ib_ucm_event_req_get(ctx, &uvt->resp.u.req_resp, &evt->param.req_rcvd); uvt->data_len = IB_CM_REQ_PRIVATE_DATA_SIZE; - uvt->resp.present |= (evt->param.req_rcvd.primary_path ? - IB_UCM_PRES_PRIMARY : 0); + uvt->resp.present = IB_UCM_PRES_PRIMARY; uvt->resp.present |= (evt->param.req_rcvd.alternate_path ? IB_UCM_PRES_ALTERNATE : 0); break; @@ -300,57 +294,46 @@ static int ib_ucm_event_process(struct i ib_ucm_event_rep_get(&uvt->resp.u.rep_resp, &evt->param.rep_rcvd); uvt->data_len = IB_CM_REP_PRIVATE_DATA_SIZE; - break; case IB_CM_RTU_RECEIVED: uvt->data_len = IB_CM_RTU_PRIVATE_DATA_SIZE; uvt->resp.u.send_status = evt->param.send_status; - break; case IB_CM_DREQ_RECEIVED: uvt->data_len = IB_CM_DREQ_PRIVATE_DATA_SIZE; uvt->resp.u.send_status = evt->param.send_status; - break; case IB_CM_DREP_RECEIVED: uvt->data_len = IB_CM_DREP_PRIVATE_DATA_SIZE; uvt->resp.u.send_status = evt->param.send_status; - break; case IB_CM_MRA_RECEIVED: - ib_ucm_event_mra_get(&uvt->resp.u.mra_resp, - &evt->param.mra_rcvd); + uvt->resp.u.mra_resp.timeout = + evt->param.mra_rcvd.service_timeout; uvt->data_len = IB_CM_MRA_PRIVATE_DATA_SIZE; - break; case IB_CM_REJ_RECEIVED: - ib_ucm_event_rej_get(&uvt->resp.u.rej_resp, - &evt->param.rej_rcvd); + uvt->resp.u.rej_resp.reason = evt->param.rej_rcvd.reason; uvt->data_len = IB_CM_REJ_PRIVATE_DATA_SIZE; uvt->info_len = evt->param.rej_rcvd.ari_length; info = evt->param.rej_rcvd.ari; - break; case IB_CM_LAP_RECEIVED: - ib_ucm_event_lap_get(&uvt->resp.u.lap_resp, - &evt->param.lap_rcvd); + ib_ucm_event_path_get(&uvt->resp.u.lap_resp.path, + evt->param.lap_rcvd.alternate_path); uvt->data_len = IB_CM_LAP_PRIVATE_DATA_SIZE; - uvt->resp.present |= (evt->param.lap_rcvd.alternate_path ? - IB_UCM_PRES_ALTERNATE : 0); + uvt->resp.present = IB_UCM_PRES_ALTERNATE; break; case IB_CM_APR_RECEIVED: - ib_ucm_event_apr_get(&uvt->resp.u.apr_resp, - &evt->param.apr_rcvd); + uvt->resp.u.apr_resp.status = evt->param.apr_rcvd.ap_status; uvt->data_len = IB_CM_APR_PRIVATE_DATA_SIZE; uvt->info_len = evt->param.apr_rcvd.info_len; info = evt->param.apr_rcvd.apr_info; - break; case IB_CM_SIDR_REQ_RECEIVED: - ib_ucm_event_sidr_req_get(&uvt->resp.u.sidr_req_resp, + ib_ucm_event_sidr_req_get(ctx, &uvt->resp.u.sidr_req_resp, &evt->param.sidr_req_rcvd); uvt->data_len = IB_CM_SIDR_REQ_PRIVATE_DATA_SIZE; - break; case IB_CM_SIDR_REP_RECEIVED: ib_ucm_event_sidr_rep_get(&uvt->resp.u.sidr_rep_resp, @@ -358,43 +341,35 @@ static int ib_ucm_event_process(struct i uvt->data_len = IB_CM_SIDR_REP_PRIVATE_DATA_SIZE; uvt->info_len = evt->param.sidr_rep_rcvd.info_len; info = evt->param.sidr_rep_rcvd.info; - break; default: uvt->resp.u.send_status = evt->param.send_status; - break; } - if (uvt->data_len && evt->private_data) { - + if (uvt->data_len) { uvt->data = kmalloc(uvt->data_len, GFP_KERNEL); - if (!uvt->data) { - result = -ENOMEM; - goto error; - } + if (!uvt->data) + goto err1; memcpy(uvt->data, evt->private_data, uvt->data_len); uvt->resp.present |= IB_UCM_PRES_DATA; } - if (uvt->info_len && info) { - + if (uvt->info_len) { uvt->info = kmalloc(uvt->info_len, GFP_KERNEL); - if (!uvt->info) { - result = -ENOMEM; - goto error; - } + if (!uvt->info) + goto err2; memcpy(uvt->info, info, uvt->info_len); uvt->resp.present |= IB_UCM_PRES_INFO; } - return 0; -error: - kfree(uvt->info); + +err2: kfree(uvt->data); - return result; +err1: + return -ENOMEM; } static int ib_ucm_event_handler(struct ib_cm_id *cm_id, @@ -404,63 +379,42 @@ static int ib_ucm_event_handler(struct i struct ib_ucm_context *ctx; int result = 0; int id; - /* - * lookup correct context based on event type. - */ - switch (event->event) { - case IB_CM_REQ_RECEIVED: - id = (long)event->param.req_rcvd.listen_id->context; - break; - case IB_CM_SIDR_REQ_RECEIVED: - id = (long)event->param.sidr_req_rcvd.listen_id->context; - break; - default: - id = (long)cm_id->context; - break; - } - - ucm_dbg("Event. CM ID <%d> event <%d>\n", id, event->event); - ctx = ib_ucm_ctx_get(id); - if (!ctx) - return -ENOENT; + ctx = cm_id->context; if (event->event == IB_CM_REQ_RECEIVED || event->event == IB_CM_SIDR_REQ_RECEIVED) id = IB_UCM_CM_ID_INVALID; + else + id = ctx->id; uevent = kmalloc(sizeof(*uevent), GFP_KERNEL); - if (!uevent) { - result = -ENOMEM; - goto done; - } + if (!uevent) + goto err1; memset(uevent, 0, sizeof(*uevent)); - uevent->resp.id = id; uevent->resp.event = event->event; - result = ib_ucm_event_process(event, uevent); + result = ib_ucm_event_process(ctx, event, uevent); if (result) - goto done; + goto err2; uevent->ctx = ctx; - uevent->cm_id = ((event->event == IB_CM_REQ_RECEIVED || - event->event == IB_CM_SIDR_REQ_RECEIVED ) ? - cm_id : NULL); + uevent->cm_id = (id == IB_UCM_CM_ID_INVALID) ? cm_id : NULL; down(&ctx->file->mutex); - list_add_tail(&uevent->file_list, &ctx->file->events); list_add_tail(&uevent->ctx_list, &ctx->events); - wake_up_interruptible(&ctx->file->poll_wait); - up(&ctx->file->mutex); -done: - ctx->error = result; - ib_ucm_ctx_put(ctx); /* func reference */ - return result; + return 0; + +err2: + kfree(uevent); +err1: + /* Destroy new cm_id's */ + return (id == IB_UCM_CM_ID_INVALID); } static ssize_t ib_ucm_event(struct ib_ucm_file *file, @@ -518,9 +472,8 @@ static ssize_t ib_ucm_event(struct ib_uc goto done; } - ctx->cm_id = uevent->cm_id; - ctx->cm_id->cm_handler = ib_ucm_event_handler; - ctx->cm_id->context = (void *)(unsigned long)ctx->id; + ctx->cm_id = uevent->cm_id; + ctx->cm_id->context = ctx; uevent->resp.id = ctx->id; @@ -586,30 +539,29 @@ static ssize_t ib_ucm_create_id(struct i if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; + down(&file->mutex); ctx = ib_ucm_ctx_alloc(file); + up(&file->mutex); if (!ctx) return -ENOMEM; - ctx->cm_id = ib_create_cm_id(ib_ucm_event_handler, - (void *)(unsigned long)ctx->id); - if (!ctx->cm_id) { - result = -ENOMEM; - goto err_cm; + ctx->cm_id = ib_create_cm_id(ib_ucm_event_handler, ctx); + if (IS_ERR(ctx->cm_id)) { + result = PTR_ERR(ctx->cm_id); + goto err; } resp.id = ctx->id; if (copy_to_user((void __user *)(unsigned long)cmd.response, &resp, sizeof(resp))) { result = -EFAULT; - goto err_ret; + goto err; } return 0; -err_ret: - ib_destroy_cm_id(ctx->cm_id); -err_cm: - ib_ucm_ctx_put(ctx); /* user reference */ +err: + ib_ucm_destroy_ctx(file, ctx->id); return result; } @@ -618,19 +570,11 @@ static ssize_t ib_ucm_destroy_id(struct int in_len, int out_len) { struct ib_ucm_destroy_id cmd; - struct ib_ucm_context *ctx; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; - ctx = ib_ucm_ctx_get(cmd.id); - if (!ctx) - return -ENOENT; - - ib_ucm_ctx_put(ctx); /* user reference */ - ib_ucm_ctx_put(ctx); /* func reference */ - - return 0; + return ib_ucm_destroy_ctx(file, cmd.id); } static ssize_t ib_ucm_attr_id(struct ib_ucm_file *file, @@ -648,15 +592,9 @@ static ssize_t ib_ucm_attr_id(struct ib_ if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; - ctx = ib_ucm_ctx_get(cmd.id); - if (!ctx) - return -ENOENT; - - down(&ctx->file->mutex); - if (ctx->file != file) { - result = -EINVAL; - goto done; - } + ctx = ib_ucm_ctx_get(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); resp.service_id = ctx->cm_id->service_id; resp.service_mask = ctx->cm_id->service_mask; @@ -667,9 +605,7 @@ static ssize_t ib_ucm_attr_id(struct ib_ &resp, sizeof(resp))) result = -EFAULT; -done: - up(&ctx->file->mutex); - ib_ucm_ctx_put(ctx); /* func reference */ + ib_ucm_ctx_put(ctx); return result; } @@ -684,19 +620,12 @@ static ssize_t ib_ucm_listen(struct ib_u if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; - ctx = ib_ucm_ctx_get(cmd.id); - if (!ctx) - return -ENOENT; - - down(&ctx->file->mutex); - if (ctx->file != file) - result = -EINVAL; - else - result = ib_cm_listen(ctx->cm_id, cmd.service_id, - cmd.service_mask); + ctx = ib_ucm_ctx_get(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); - up(&ctx->file->mutex); - ib_ucm_ctx_put(ctx); /* func reference */ + result = ib_cm_listen(ctx->cm_id, cmd.service_id, cmd.service_mask); + ib_ucm_ctx_put(ctx); return result; } @@ -711,18 +640,12 @@ static ssize_t ib_ucm_establish(struct i if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; - ctx = ib_ucm_ctx_get(cmd.id); - if (!ctx) - return -ENOENT; - - down(&ctx->file->mutex); - if (ctx->file != file) - result = -EINVAL; - else - result = ib_cm_establish(ctx->cm_id); + ctx = ib_ucm_ctx_get(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); - up(&ctx->file->mutex); - ib_ucm_ctx_put(ctx); /* func reference */ + result = ib_cm_establish(ctx->cm_id); + ib_ucm_ctx_put(ctx); return result; } @@ -769,8 +692,8 @@ static int ib_ucm_path_get(struct ib_sa_ return -EFAULT; } - memcpy(sa_path->dgid.raw, ucm_path.dgid, sizeof(union ib_gid)); - memcpy(sa_path->sgid.raw, ucm_path.sgid, sizeof(union ib_gid)); + memcpy(sa_path->dgid.raw, ucm_path.dgid, sizeof sa_path->dgid); + memcpy(sa_path->sgid.raw, ucm_path.sgid, sizeof sa_path->sgid); sa_path->dlid = ucm_path.dlid; sa_path->slid = ucm_path.slid; @@ -840,25 +763,17 @@ static ssize_t ib_ucm_send_req(struct ib param.max_cm_retries = cmd.max_cm_retries; param.srq = cmd.srq; - ctx = ib_ucm_ctx_get(cmd.id); - if (!ctx) { - result = -ENOENT; - goto done; - } - - down(&ctx->file->mutex); - if (ctx->file != file) - result = -EINVAL; - else + ctx = ib_ucm_ctx_get(file, cmd.id); + if (!IS_ERR(ctx)) { result = ib_send_cm_req(ctx->cm_id, ¶m); + ib_ucm_ctx_put(ctx); + } else + result = PTR_ERR(ctx); - up(&ctx->file->mutex); - ib_ucm_ctx_put(ctx); /* func reference */ done: kfree(param.private_data); kfree(param.primary_path); kfree(param.alternate_path); - return result; } @@ -891,23 +806,14 @@ static ssize_t ib_ucm_send_rep(struct ib param.rnr_retry_count = cmd.rnr_retry_count; param.srq = cmd.srq; - ctx = ib_ucm_ctx_get(cmd.id); - if (!ctx) { - result = -ENOENT; - goto done; - } - - down(&ctx->file->mutex); - if (ctx->file != file) - result = -EINVAL; - else + ctx = ib_ucm_ctx_get(file, cmd.id); + if (!IS_ERR(ctx)) { result = ib_send_cm_rep(ctx->cm_id, ¶m); + ib_ucm_ctx_put(ctx); + } else + result = PTR_ERR(ctx); - up(&ctx->file->mutex); - ib_ucm_ctx_put(ctx); /* func reference */ -done: kfree(param.private_data); - return result; } @@ -929,23 +835,14 @@ static ssize_t ib_ucm_send_private_data( if (result) return result; - ctx = ib_ucm_ctx_get(cmd.id); - if (!ctx) { - result = -ENOENT; - goto done; - } - - down(&ctx->file->mutex); - if (ctx->file != file) - result = -EINVAL; - else + ctx = ib_ucm_ctx_get(file, cmd.id); + if (!IS_ERR(ctx)) { result = func(ctx->cm_id, private_data, cmd.len); + ib_ucm_ctx_put(ctx); + } else + result = PTR_ERR(ctx); - up(&ctx->file->mutex); - ib_ucm_ctx_put(ctx); /* func reference */ -done: kfree(private_data); - return result; } @@ -996,26 +893,17 @@ static ssize_t ib_ucm_send_info(struct i if (result) goto done; - ctx = ib_ucm_ctx_get(cmd.id); - if (!ctx) { - result = -ENOENT; - goto done; - } - - down(&ctx->file->mutex); - if (ctx->file != file) - result = -EINVAL; - else - result = func(ctx->cm_id, cmd.status, - info, cmd.info_len, + ctx = ib_ucm_ctx_get(file, cmd.id); + if (!IS_ERR(ctx)) { + result = func(ctx->cm_id, cmd.status, info, cmd.info_len, data, cmd.data_len); + ib_ucm_ctx_put(ctx); + } else + result = PTR_ERR(ctx); - up(&ctx->file->mutex); - ib_ucm_ctx_put(ctx); /* func reference */ done: kfree(data); kfree(info); - return result; } @@ -1049,24 +937,14 @@ static ssize_t ib_ucm_send_mra(struct ib if (result) return result; - ctx = ib_ucm_ctx_get(cmd.id); - if (!ctx) { - result = -ENOENT; - goto done; - } + ctx = ib_ucm_ctx_get(file, cmd.id); + if (!IS_ERR(ctx)) { + result = ib_send_cm_mra(ctx->cm_id, cmd.timeout, data, cmd.len); + ib_ucm_ctx_put(ctx); + } else + result = PTR_ERR(ctx); - down(&ctx->file->mutex); - if (ctx->file != file) - result = -EINVAL; - else - result = ib_send_cm_mra(ctx->cm_id, cmd.timeout, - data, cmd.len); - - up(&ctx->file->mutex); - ib_ucm_ctx_put(ctx); /* func reference */ -done: kfree(data); - return result; } @@ -1091,24 +969,16 @@ static ssize_t ib_ucm_send_lap(struct ib if (result) goto done; - ctx = ib_ucm_ctx_get(cmd.id); - if (!ctx) { - result = -ENOENT; - goto done; - } - - down(&ctx->file->mutex); - if (ctx->file != file) - result = -EINVAL; - else + ctx = ib_ucm_ctx_get(file, cmd.id); + if (!IS_ERR(ctx)) { result = ib_send_cm_lap(ctx->cm_id, path, data, cmd.len); + ib_ucm_ctx_put(ctx); + } else + result = PTR_ERR(ctx); - up(&ctx->file->mutex); - ib_ucm_ctx_put(ctx); /* func reference */ done: kfree(data); kfree(path); - return result; } @@ -1141,24 +1011,16 @@ static ssize_t ib_ucm_send_sidr_req(stru param.max_cm_retries = cmd.max_cm_retries; param.pkey = cmd.pkey; - ctx = ib_ucm_ctx_get(cmd.id); - if (!ctx) { - result = -ENOENT; - goto done; - } - - down(&ctx->file->mutex); - if (ctx->file != file) - result = -EINVAL; - else + ctx = ib_ucm_ctx_get(file, cmd.id); + if (!IS_ERR(ctx)) { result = ib_send_cm_sidr_req(ctx->cm_id, ¶m); + ib_ucm_ctx_put(ctx); + } else + result = PTR_ERR(ctx); - up(&ctx->file->mutex); - ib_ucm_ctx_put(ctx); /* func reference */ done: kfree(param.private_data); kfree(param.path); - return result; } @@ -1185,30 +1047,22 @@ static ssize_t ib_ucm_send_sidr_rep(stru if (result) goto done; - param.qp_num = cmd.qpn; - param.qkey = cmd.qkey; - param.status = cmd.status; - param.info_length = cmd.info_len; - param.private_data_len = cmd.data_len; - - ctx = ib_ucm_ctx_get(cmd.id); - if (!ctx) { - result = -ENOENT; - goto done; - } + param.qp_num = cmd.qpn; + param.qkey = cmd.qkey; + param.status = cmd.status; + param.info_length = cmd.info_len; + param.private_data_len = cmd.data_len; - down(&ctx->file->mutex); - if (ctx->file != file) - result = -EINVAL; - else + ctx = ib_ucm_ctx_get(file, cmd.id); + if (!IS_ERR(ctx)) { result = ib_send_cm_sidr_rep(ctx->cm_id, ¶m); + ib_ucm_ctx_put(ctx); + } else + result = PTR_ERR(ctx); - up(&ctx->file->mutex); - ib_ucm_ctx_put(ctx); /* func reference */ done: kfree(param.private_data); kfree(param.info); - return result; } @@ -1306,22 +1160,17 @@ static int ib_ucm_close(struct inode *in struct ib_ucm_context *ctx; down(&file->mutex); - while (!list_empty(&file->ctxs)) { ctx = list_entry(file->ctxs.next, struct ib_ucm_context, file_list); - up(&ctx->file->mutex); - ib_ucm_ctx_put(ctx); /* user reference */ + up(&file->mutex); + ib_ucm_destroy_ctx(file, ctx->id); down(&file->mutex); } - up(&file->mutex); - kfree(file); - - ucm_dbg("Deleted struct\n"); return 0; } Index: core/ucm.h =================================================================== --- core/ucm.h (revision 3044) +++ core/ucm.h (working copy) @@ -48,9 +48,7 @@ struct ib_ucm_file { struct semaphore mutex; struct file *filp; - /* - * list of pending events - */ + struct list_head ctxs; /* list of active connections */ struct list_head events; /* list of pending events */ wait_queue_head_t poll_wait; @@ -58,12 +56,11 @@ struct ib_ucm_file { struct ib_ucm_context { int id; - int ref; - int error; + wait_queue_head_t wait; + atomic_t ref; struct ib_ucm_file *file; struct ib_cm_id *cm_id; - struct semaphore mutex; struct list_head events; /* list of pending events. */ struct list_head file_list; /* member in file ctx list */ From rolandd at cisco.com Tue Aug 9 20:55:39 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 9 Aug 2005 20:55:39 -0700 Subject: [openib-general] [PATCH 1/4] SRQs for uverbs In-Reply-To: <2005892055.9C6SZhq4oqh0awp2@cisco.com> Message-ID: <2005892055.Vp0wtgyJ0EKtyBn4@cisco.com> Add SRQ support to userspace verbs module. This adds several commands and associated structures, but it's OK to do this without bumping the ABI version because the commands are added at the end of the list so they don't change the existing numbering. There are two cases to worry about: 1. New kernel, old userspace. This is OK because old userspace simply won't try to use the new SRQ commands. None of the old commands are changed. 2. Old kernel, new userspace. This works perfectly as long as userspace doesn't try to use SRQ commands. If userspace tries to use SRQ commands, it will get EINVAL, which is perfectly reasonable: the kernel doesn't support SRQs, so we couldn't do any better. --- infiniband/core/uverbs_cmd.c (revision 3018) +++ infiniband/core/uverbs_cmd.c (working copy) @@ -724,6 +724,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv struct ib_uobject *uobj; struct ib_pd *pd; struct ib_cq *scq, *rcq; + struct ib_srq *srq; struct ib_qp *qp; struct ib_qp_init_attr attr; int ret; @@ -747,10 +748,12 @@ ssize_t ib_uverbs_create_qp(struct ib_uv pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle); scq = idr_find(&ib_uverbs_cq_idr, cmd.send_cq_handle); rcq = idr_find(&ib_uverbs_cq_idr, cmd.recv_cq_handle); + srq = cmd.is_srq ? idr_find(&ib_uverbs_srq_idr, cmd.srq_handle) : NULL; if (!pd || pd->uobject->context != file->ucontext || !scq || scq->uobject->context != file->ucontext || - !rcq || rcq->uobject->context != file->ucontext) { + !rcq || rcq->uobject->context != file->ucontext || + (cmd.is_srq && (!srq || srq->uobject->context != file->ucontext))) { ret = -EINVAL; goto err_up; } @@ -759,7 +762,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv attr.qp_context = file; attr.send_cq = scq; attr.recv_cq = rcq; - attr.srq = NULL; + attr.srq = srq; attr.sq_sig_type = cmd.sq_sig_all ? IB_SIGNAL_ALL_WR : IB_SIGNAL_REQ_WR; attr.qp_type = cmd.qp_type; @@ -1004,3 +1007,178 @@ ssize_t ib_uverbs_detach_mcast(struct ib return ret ? ret : in_len; } + +ssize_t ib_uverbs_create_srq(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_create_srq cmd; + struct ib_uverbs_create_srq_resp resp; + struct ib_udata udata; + struct ib_uobject *uobj; + struct ib_pd *pd; + struct ib_srq *srq; + struct ib_srq_init_attr attr; + int ret; + + if (out_len < sizeof resp) + return -ENOSPC; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + INIT_UDATA(&udata, buf + sizeof cmd, + (unsigned long) cmd.response + sizeof resp, + in_len - sizeof cmd, out_len - sizeof resp); + + uobj = kmalloc(sizeof *uobj, GFP_KERNEL); + if (!uobj) + return -ENOMEM; + + down(&ib_uverbs_idr_mutex); + + pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle); + + if (!pd || pd->uobject->context != file->ucontext) { + ret = -EINVAL; + goto err_up; + } + + attr.event_handler = ib_uverbs_srq_event_handler; + attr.srq_context = file; + attr.attr.max_wr = cmd.max_wr; + attr.attr.max_sge = cmd.max_sge; + attr.attr.srq_limit = cmd.srq_limit; + + uobj->user_handle = cmd.user_handle; + uobj->context = file->ucontext; + + srq = pd->device->create_srq(pd, &attr, &udata); + if (IS_ERR(srq)) { + ret = PTR_ERR(srq); + goto err_up; + } + + srq->device = pd->device; + srq->pd = pd; + srq->uobject = uobj; + srq->event_handler = attr.event_handler; + srq->srq_context = attr.srq_context; + atomic_inc(&pd->usecnt); + atomic_set(&srq->usecnt, 0); + + memset(&resp, 0, sizeof resp); + +retry: + if (!idr_pre_get(&ib_uverbs_srq_idr, GFP_KERNEL)) { + ret = -ENOMEM; + goto err_destroy; + } + + ret = idr_get_new(&ib_uverbs_srq_idr, srq, &uobj->id); + + if (ret == -EAGAIN) + goto retry; + if (ret) + goto err_destroy; + + resp.srq_handle = uobj->id; + + spin_lock_irq(&file->ucontext->lock); + list_add_tail(&uobj->list, &file->ucontext->srq_list); + spin_unlock_irq(&file->ucontext->lock); + + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) { + ret = -EFAULT; + goto err_list; + } + + up(&ib_uverbs_idr_mutex); + + return in_len; + +err_list: + spin_lock_irq(&file->ucontext->lock); + list_del(&uobj->list); + spin_unlock_irq(&file->ucontext->lock); + +err_destroy: + ib_destroy_srq(srq); + +err_up: + up(&ib_uverbs_idr_mutex); + + kfree(uobj); + return ret; +} + +ssize_t ib_uverbs_modify_srq(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_modify_srq cmd; + struct ib_srq *srq; + struct ib_srq_attr attr; + int ret; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + down(&ib_uverbs_idr_mutex); + + srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle); + if (!srq || srq->uobject->context != file->ucontext) { + ret = -EINVAL; + goto out; + } + + attr.max_wr = cmd.max_wr; + attr.max_sge = cmd.max_sge; + attr.srq_limit = cmd.srq_limit; + + ret = ib_modify_srq(srq, &attr, cmd.attr_mask); + +out: + up(&ib_uverbs_idr_mutex); + + return ret ? ret : in_len; +} + +ssize_t ib_uverbs_destroy_srq(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_destroy_srq cmd; + struct ib_srq *srq; + struct ib_uobject *uobj; + int ret = -EINVAL; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + down(&ib_uverbs_idr_mutex); + + srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle); + if (!srq || srq->uobject->context != file->ucontext) + goto out; + + uobj = srq->uobject; + + ret = ib_destroy_srq(srq); + if (ret) + goto out; + + idr_remove(&ib_uverbs_srq_idr, cmd.srq_handle); + + spin_lock_irq(&file->ucontext->lock); + list_del(&uobj->list); + spin_unlock_irq(&file->ucontext->lock); + + kfree(uobj); + +out: + up(&ib_uverbs_idr_mutex); + + return ret ? ret : in_len; +} --- infiniband/core/uverbs.h (revision 3018) +++ infiniband/core/uverbs.h (working copy) @@ -99,10 +99,12 @@ extern struct idr ib_uverbs_mw_idr; extern struct idr ib_uverbs_ah_idr; extern struct idr ib_uverbs_cq_idr; extern struct idr ib_uverbs_qp_idr; +extern struct idr ib_uverbs_srq_idr; void ib_uverbs_comp_handler(struct ib_cq *cq, void *cq_context); void ib_uverbs_cq_event_handler(struct ib_event *event, void *context_ptr); void ib_uverbs_qp_event_handler(struct ib_event *event, void *context_ptr); +void ib_uverbs_srq_event_handler(struct ib_event *event, void *context_ptr); int ib_umem_get(struct ib_device *dev, struct ib_umem *mem, void *addr, size_t size, int write); @@ -131,5 +133,8 @@ IB_UVERBS_DECLARE_CMD(modify_qp); IB_UVERBS_DECLARE_CMD(destroy_qp); IB_UVERBS_DECLARE_CMD(attach_mcast); IB_UVERBS_DECLARE_CMD(detach_mcast); +IB_UVERBS_DECLARE_CMD(create_srq); +IB_UVERBS_DECLARE_CMD(modify_srq); +IB_UVERBS_DECLARE_CMD(destroy_srq); #endif /* UVERBS_H */ --- infiniband/core/uverbs_main.c (revision 3018) +++ infiniband/core/uverbs_main.c (working copy) @@ -69,6 +69,7 @@ DEFINE_IDR(ib_uverbs_mw_idr); DEFINE_IDR(ib_uverbs_ah_idr); DEFINE_IDR(ib_uverbs_cq_idr); DEFINE_IDR(ib_uverbs_qp_idr); +DEFINE_IDR(ib_uverbs_srq_idr); static spinlock_t map_lock; static DECLARE_BITMAP(dev_map, IB_UVERBS_MAX_DEVICES); @@ -93,6 +94,9 @@ static ssize_t (*uverbs_cmd_table[])(str [IB_USER_VERBS_CMD_DESTROY_QP] = ib_uverbs_destroy_qp, [IB_USER_VERBS_CMD_ATTACH_MCAST] = ib_uverbs_attach_mcast, [IB_USER_VERBS_CMD_DETACH_MCAST] = ib_uverbs_detach_mcast, + [IB_USER_VERBS_CMD_CREATE_SRQ] = ib_uverbs_create_srq, + [IB_USER_VERBS_CMD_MODIFY_SRQ] = ib_uverbs_modify_srq, + [IB_USER_VERBS_CMD_DESTROY_SRQ] = ib_uverbs_destroy_srq, }; static struct vfsmount *uverbs_event_mnt; @@ -127,7 +131,14 @@ static int ib_dealloc_ucontext(struct ib kfree(uobj); } - /* XXX Free SRQs */ + list_for_each_entry_safe(uobj, tmp, &context->srq_list, list) { + struct ib_srq *srq = idr_find(&ib_uverbs_srq_idr, uobj->id); + idr_remove(&ib_uverbs_srq_idr, uobj->id); + ib_destroy_srq(srq); + list_del(&uobj->list); + kfree(uobj); + } + /* XXX Free MWs */ list_for_each_entry_safe(uobj, tmp, &context->mr_list, list) { @@ -345,6 +356,13 @@ void ib_uverbs_qp_event_handler(struct i event->event); } +void ib_uverbs_srq_event_handler(struct ib_event *event, void *context_ptr) +{ + ib_uverbs_async_handler(context_ptr, + event->element.srq->uobject->user_handle, + event->event); +} + static void ib_uverbs_event_handler(struct ib_event_handler *handler, struct ib_event *event) { --- infiniband/include/ib_user_verbs.h (revision 3018) +++ infiniband/include/ib_user_verbs.h (working copy) @@ -78,7 +78,12 @@ enum { IB_USER_VERBS_CMD_POST_SEND, IB_USER_VERBS_CMD_POST_RECV, IB_USER_VERBS_CMD_ATTACH_MCAST, - IB_USER_VERBS_CMD_DETACH_MCAST + IB_USER_VERBS_CMD_DETACH_MCAST, + IB_USER_VERBS_CMD_CREATE_SRQ, + IB_USER_VERBS_CMD_MODIFY_SRQ, + IB_USER_VERBS_CMD_QUERY_SRQ, + IB_USER_VERBS_CMD_DESTROY_SRQ, + IB_USER_VERBS_CMD_POST_SRQ_RECV }; /* @@ -386,4 +391,32 @@ struct ib_uverbs_detach_mcast { __u64 driver_data[0]; }; +struct ib_uverbs_create_srq { + __u64 response; + __u64 user_handle; + __u32 pd_handle; + __u32 max_wr; + __u32 max_sge; + __u32 srq_limit; + __u64 driver_data[0]; +}; + +struct ib_uverbs_create_srq_resp { + __u32 srq_handle; +}; + +struct ib_uverbs_modify_srq { + __u32 srq_handle; + __u32 attr_mask; + __u32 max_wr; + __u32 max_sge; + __u32 srq_limit; + __u32 reserved; + __u64 driver_data[0]; +}; + +struct ib_uverbs_destroy_srq { + __u32 srq_handle; +}; + #endif /* IB_USER_VERBS_H */ From rolandd at cisco.com Tue Aug 9 20:55:39 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 9 Aug 2005 20:55:39 -0700 Subject: [openib-general] [PATCH 0/4] SRQ implementation Message-ID: <2005892055.9C6SZhq4oqh0awp2@cisco.com> I just checked in the following series of patches, which implement SRQs (shared receive queues). Both kernel and userspace are working, although I've done much more testing of SRQs from userspace. Still pending is the mthca implementation of the modify SRQ verb and the "SRQ Limit Reached" asynchronous event. Also, the ibv_srq_pingpong example is a little hokey, although it's fine for testing. Please test and look for any regressions that may have snuck in. - R. From rolandd at cisco.com Tue Aug 9 20:55:39 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 9 Aug 2005 20:55:39 -0700 Subject: [openib-general] [PATCH 4/4] SRQs for libmthca In-Reply-To: <2005892055.pyiDGsM5njBrWc61@cisco.com> Message-ID: <2005892055.fMUyiKiHifMHKA7C@cisco.com> --- libmthca/Makefile.am (revision 3009) +++ libmthca/Makefile.am (working copy) @@ -4,7 +4,7 @@ mthcalibdir = $(libdir)/infiniband mthcalib_LTLIBRARIES = src/mthca.la -src_mthca_la_CFLAGS = -Wall -D_GNU_SOURCE +src_mthca_la_CFLAGS = -g -Wall -D_GNU_SOURCE if HAVE_LD_VERSION_SCRIPT mthca_version_script = -Wl,--version-script=$(srcdir)/src/mthca.map @@ -12,7 +12,8 @@ else mthca_version_script = endif -src_mthca_la_SOURCES = src/ah.c src/cq.c src/memfree.c src/mthca.c src/qp.c src/verbs.c +src_mthca_la_SOURCES = src/ah.c src/cq.c src/memfree.c src/mthca.c src/qp.c \ + src/srq.c src/verbs.c src_mthca_la_LDFLAGS = -avoid-version -module \ $(mthca_version_script) --- libmthca/src/qp.c (revision 3009) +++ libmthca/src/qp.c (working copy) @@ -43,81 +43,7 @@ #include "mthca.h" #include "doorbell.h" - -enum { - MTHCA_SEND_DOORBELL = 0x10, - MTHCA_RECV_DOORBELL = 0x18 -}; - -enum { - MTHCA_NEXT_DBD = 1 << 7, - MTHCA_NEXT_FENCE = 1 << 6, - MTHCA_NEXT_CQ_UPDATE = 1 << 3, - MTHCA_NEXT_EVENT_GEN = 1 << 2, - MTHCA_NEXT_SOLICIT = 1 << 1, -}; - -enum { - MTHCA_INVAL_LKEY = 0x100 -}; - -enum { - MTHCA_INLINE_SEG = 1 << 31 -}; - -struct mthca_next_seg { - uint32_t nda_op; /* [31:6] next WQE [4:0] next opcode */ - uint32_t ee_nds; /* [31:8] next EE [7] DBD [6] F [5:0] next WQE size */ - uint32_t flags; /* [3] CQ [2] Event [1] Solicit */ - uint32_t imm; /* immediate data */ -}; - -struct mthca_tavor_ud_seg { - uint32_t reserved1; - uint32_t lkey; - uint64_t av_addr; - uint32_t reserved2[4]; - uint32_t dqpn; - uint32_t qkey; - uint32_t reserved3[2]; -}; - -struct mthca_arbel_ud_seg { - uint32_t av[8]; - uint32_t dqpn; - uint32_t qkey; - uint32_t reserved[2]; -}; - -struct mthca_bind_seg { - uint32_t flags; /* [31] Atomic [30] rem write [29] rem read */ - uint32_t reserved; - uint32_t new_rkey; - uint32_t lkey; - uint64_t addr; - uint64_t length; -}; - -struct mthca_raddr_seg { - uint64_t raddr; - uint32_t rkey; - uint32_t reserved; -}; - -struct mthca_atomic_seg { - uint64_t swap_add; - uint64_t compare; -}; - -struct mthca_data_seg { - uint32_t byte_count; - uint32_t lkey; - uint64_t addr; -}; - -struct mthca_inline_seg { - uint32_t byte_count; -}; +#include "wqe.h" static const uint8_t mthca_opcode[] = { [IBV_WR_SEND] = MTHCA_OPCODE_SEND, @@ -925,15 +851,21 @@ int mthca_free_err_wqe(struct mthca_qp * { struct mthca_next_seg *next; + /* + * For SRQs, all WQEs generate a CQE, so we're always at the + * end of the doorbell chain. + */ + if (qp->ibv_qp.srq) { + *new_wqe = 0; + return 0; + } + if (is_send) next = get_send_wqe(qp, index); else next = get_recv_wqe(qp, index); - if (mthca_is_memfree(qp->ibv_qp.context)) - *dbd = 1; - else - *dbd = !!(next->ee_nds & htonl(MTHCA_NEXT_DBD)); + *dbd = !!(next->ee_nds & htonl(MTHCA_NEXT_DBD)); if (next->ee_nds & htonl(0x3f)) *new_wqe = (next->nda_op & htonl(~0x3f)) | (next->ee_nds & htonl(0x3f)); --- libmthca/src/verbs.c (revision 3009) +++ libmthca/src/verbs.c (working copy) @@ -265,17 +265,127 @@ int mthca_destroy_cq(struct ibv_cq *cq) return 0; } -static int align_qp_size(struct ibv_context *context, int size) +static int align_queue_size(struct ibv_context *context, int size, int spare) { int ret; + /* + * If someone asks for a 0-sized queue, presumably they're not + * going to use it. So don't mess with their size. + */ + if (!size) + return 0; + if (mthca_is_memfree(context)) { - for (ret = 1; ret < size; ret <<= 1) + for (ret = 1; ret < size + spare; ret <<= 1) ; /* nothing */ return ret; } else - return size; + return size + spare; +} + +struct ibv_srq *mthca_create_srq(struct ibv_pd *pd, + struct ibv_srq_init_attr *attr) +{ + struct mthca_create_srq cmd; + struct mthca_create_srq_resp resp; + struct mthca_srq *srq; + int ret; + + /* Sanity check SRQ size before proceeding */ + if (attr->attr.max_wr > 16 << 20 || attr->attr.max_sge > 64) + return NULL; + + srq = malloc(sizeof *srq); + if (!srq) + return NULL; + + if (pthread_spin_init(&srq->lock, PTHREAD_PROCESS_PRIVATE)) + goto err; + + srq->max = align_queue_size(pd->context, attr->attr.max_wr, 1); + srq->max_gs = attr->attr.max_sge; + srq->last = NULL; + srq->counter = 0; + + if (mthca_alloc_srq_buf(pd, &attr->attr, srq)) + goto err; + + srq->mr = __mthca_reg_mr(pd, srq->buf, srq->buf_size, 0, 0); + if (!srq->mr) + goto err_free; + + srq->mr->context = pd->context; + + if (mthca_is_memfree(pd->context)) { + srq->db_index = mthca_alloc_db(to_mctx(pd->context)->db_tab, + MTHCA_DB_TYPE_SRQ, &srq->db); + if (srq->db_index < 0) + goto err_unreg; + + cmd.db_page = db_align(srq->db); + cmd.db_index = srq->db_index; + } + + cmd.lkey = srq->mr->lkey; + + ret = ibv_cmd_create_srq(pd, &srq->ibv_srq, attr, + &cmd.ibv_cmd, sizeof cmd, + &resp.ibv_resp, sizeof resp); + if (ret) + goto err_db; + + srq->srqn = resp.srqn; + + if (mthca_is_memfree(pd->context)) + mthca_set_db_qn(srq->db, MTHCA_DB_TYPE_SRQ, srq->srqn); + + return &srq->ibv_srq; + +err_db: + if (mthca_is_memfree(pd->context)) + mthca_free_db(to_mctx(pd->context)->db_tab, MTHCA_DB_TYPE_SRQ, + srq->db_index); + +err_unreg: + mthca_dereg_mr(srq->mr); + +err_free: + free(srq->wrid); + free(srq->buf); + +err: + free(srq); + + return NULL; +} + +int mthca_modify_srq(struct ibv_srq *srq, + struct ibv_srq_attr *attr, + enum ibv_srq_attr_mask mask) +{ + return -1; +} + +int mthca_destroy_srq(struct ibv_srq *srq) +{ + int ret; + + ret = ibv_cmd_destroy_srq(srq); + if (ret) + return ret; + + if (mthca_is_memfree(srq->context)) + mthca_free_db(to_mctx(srq->context)->db_tab, MTHCA_DB_TYPE_SRQ, + to_msrq(srq)->db_index); + + mthca_dereg_mr(to_msrq(srq)->mr); + + free(to_msrq(srq)->buf); + free(to_msrq(srq)->wrid); + + return 0; } struct ibv_qp *mthca_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr) @@ -298,15 +408,15 @@ struct ibv_qp *mthca_create_qp(struct ib qp->qpt = attr->qp_type; - qp->sq.max = align_qp_size(pd->context, attr->cap.max_send_wr); + qp->sq.max = align_queue_size(pd->context, attr->cap.max_send_wr, 0); qp->sq.next_ind = 0; qp->sq.last_comp = qp->sq.max - 1; qp->sq.head = 0; qp->sq.tail = 0; qp->sq.last = NULL; - qp->rq.max = align_qp_size(pd->context, attr->cap.max_recv_wr); + qp->rq.max = align_queue_size(pd->context, attr->cap.max_recv_wr, 0); qp->rq.next_ind = 0; qp->rq.last_comp = qp->rq.max - 1; qp->rq.head = 0; --- libmthca/src/mthca.h (revision 3009) +++ libmthca/src/mthca.h (working copy) @@ -142,6 +142,27 @@ struct mthca_cq { int arm_sn; }; +struct mthca_srq { + struct ibv_srq ibv_srq; + void *buf; + void *last; + pthread_spinlock_t lock; + struct ibv_mr *mr; + uint64_t *wrid; + uint32_t srqn; + int max; + int max_gs; + int wqe_shift; + int first_free; + int last_free; + int buf_size; + + /* Next fields are mem-free only */ + int db_index; + uint32_t *db; + uint16_t counter; +}; + struct mthca_wq { pthread_spinlock_t lock; int max; @@ -233,6 +254,11 @@ static inline struct mthca_cq *to_mcq(st return to_mxxx(cq, cq); } +static inline struct mthca_srq *to_msrq(struct ibv_srq *ibsrq) +{ + return to_mxxx(srq, srq); +} + static inline struct mthca_qp *to_mqp(struct ibv_qp *ibqp) { return to_mxxx(qp, qp); @@ -279,6 +305,22 @@ extern int mthca_arbel_arm_cq(struct ibv extern void mthca_arbel_cq_event(struct ibv_cq *cq); extern void mthca_init_cq_buf(struct mthca_cq *cq, int nent); +extern struct ibv_srq *mthca_create_srq(struct ibv_pd *pd, + struct ibv_srq_init_attr *attr); +extern int mthca_modify_srq(struct ibv_srq *srq, + struct ibv_srq_attr *attr, + enum ibv_srq_attr_mask mask); +extern int mthca_destroy_srq(struct ibv_srq *srq); +extern int mthca_alloc_srq_buf(struct ibv_pd *pd, struct ibv_srq_attr *attr, + struct mthca_srq *srq); +extern void mthca_free_srq_wqe(struct mthca_srq *srq, uint32_t wqe_addr); +extern int mthca_tavor_post_srq_recv(struct ibv_srq *ibsrq, + struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr); +extern int mthca_arbel_post_srq_recv(struct ibv_srq *ibsrq, + struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr); + extern struct ibv_qp *mthca_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr); extern int mthca_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, enum ibv_qp_attr_mask attr_mask); --- libmthca/src/wqe.h (revision 0) +++ libmthca/src/wqe.h (revision 0) @@ -0,0 +1,114 @@ +/* + * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#ifndef WQE_H +#define WQE_H + +enum { + MTHCA_SEND_DOORBELL = 0x10, + MTHCA_RECV_DOORBELL = 0x18 +}; + +enum { + MTHCA_NEXT_DBD = 1 << 7, + MTHCA_NEXT_FENCE = 1 << 6, + MTHCA_NEXT_CQ_UPDATE = 1 << 3, + MTHCA_NEXT_EVENT_GEN = 1 << 2, + MTHCA_NEXT_SOLICIT = 1 << 1, +}; + +enum { + MTHCA_INLINE_SEG = 1 << 31 +}; + +enum { + MTHCA_INVAL_LKEY = 0x100 +}; + +struct mthca_next_seg { + uint32_t nda_op; /* [31:6] next WQE [4:0] next opcode */ + uint32_t ee_nds; /* [31:8] next EE [7] DBD [6] F [5:0] next WQE size */ + uint32_t flags; /* [3] CQ [2] Event [1] Solicit */ + uint32_t imm; /* immediate data */ +}; + +struct mthca_tavor_ud_seg { + uint32_t reserved1; + uint32_t lkey; + uint64_t av_addr; + uint32_t reserved2[4]; + uint32_t dqpn; + uint32_t qkey; + uint32_t reserved3[2]; +}; + +struct mthca_arbel_ud_seg { + uint32_t av[8]; + uint32_t dqpn; + uint32_t qkey; + uint32_t reserved[2]; +}; + +struct mthca_bind_seg { + uint32_t flags; /* [31] Atomic [30] rem write [29] rem read */ + uint32_t reserved; + uint32_t new_rkey; + uint32_t lkey; + uint64_t addr; + uint64_t length; +}; + +struct mthca_raddr_seg { + uint64_t raddr; + uint32_t rkey; + uint32_t reserved; +}; + +struct mthca_atomic_seg { + uint64_t swap_add; + uint64_t compare; +}; + +struct mthca_data_seg { + uint32_t byte_count; + uint32_t lkey; + uint64_t addr; +}; + +struct mthca_inline_seg { + uint32_t byte_count; +}; + +#endif /* WQE_H */ Property changes on: libmthca/src/wqe.h ___________________________________________________________________ Name: svn:keywords + Id --- libmthca/src/cq.c (revision 3009) +++ libmthca/src/cq.c (working copy) @@ -234,6 +234,13 @@ static int handle_error_cqe(struct mthca break; } + /* + * Mem-free HCAs always generate one CQE per WQE, even in the + * error case, so we don't have to check the doorbell count, etc. + */ + if (mthca_is_memfree(cq->ibv_cq.context)) + return 0; + err = mthca_free_err_wqe(qp, is_send, wqe_index, &dbd, &new_wqe); if (err) return err; @@ -242,12 +249,8 @@ static int handle_error_cqe(struct mthca * If we're at the end of the WQE chain, or we've used up our * doorbell count, free the CQE. Otherwise just update it for * the next poll operation. - * - * This does not apply to mem-free HCAs: they don't use the - * doorbell count field, and so we should always free the CQE. */ - if (mthca_is_memfree(cq->ibv_cq.context) || - !(new_wqe & htonl(0x3f)) || (!cqe->db_cnt && dbd)) + if (!(new_wqe & htonl(0x3f)) || (!cqe->db_cnt && dbd)) return 0; cqe->db_cnt = htons(ntohs(cqe->db_cnt) - dbd); @@ -274,7 +277,9 @@ static inline int mthca_poll_one(struct { struct mthca_wq *wq; struct mthca_cqe *cqe; + struct mthca_srq *srq; uint32_t qpn; + uint32_t wqe; int wqe_index; int is_error; int is_send; @@ -319,18 +324,27 @@ static inline int mthca_poll_one(struct wq = &(*cur_qp)->sq; wqe_index = ((ntohl(cqe->wqe) - (*cur_qp)->send_wqe_offset) >> wq->wqe_shift); wc->wr_id = (*cur_qp)->wrid[wqe_index + (*cur_qp)->rq.max]; + } else if ((*cur_qp)->ibv_qp.srq) { + srq = to_msrq((*cur_qp)->ibv_qp.srq); + wqe = htonl(cqe->wqe); + wq = NULL; + wqe_index = wqe >> srq->wqe_shift; + wc->wr_id = srq->wrid[wqe_index]; + mthca_free_srq_wqe(srq, wqe); } else { wq = &(*cur_qp)->rq; wqe_index = ntohl(cqe->wqe) >> wq->wqe_shift; wc->wr_id = (*cur_qp)->wrid[wqe_index]; } - if (wq->last_comp < wqe_index) - wq->tail += wqe_index - wq->last_comp; - else - wq->tail += wqe_index + wq->max - wq->last_comp; + if (wq) { + if (wq->last_comp < wqe_index) + wq->tail += wqe_index - wq->last_comp; + else + wq->tail += wqe_index + wq->max - wq->last_comp; - wq->last_comp = wqe_index; + wq->last_comp = wqe_index; + } if (is_error) { err = handle_error_cqe(cq, *cur_qp, wqe_index, is_send, --- libmthca/src/srq.c (revision 0) +++ libmthca/src/srq.c (revision 0) @@ -0,0 +1,286 @@ +/* + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include + +#include "mthca.h" +#include "doorbell.h" +#include "wqe.h" + +static void *get_wqe(struct mthca_srq *srq, int n) +{ + return srq->buf + (n << srq->wqe_shift); +} + +void mthca_free_srq_wqe(struct mthca_srq *srq, uint32_t wqe_addr) +{ + int ind; + + ind = wqe_addr >> srq->wqe_shift; + + pthread_spin_lock(&srq->lock); + + if (srq->first_free >= 0) + *(int *) get_wqe(srq, srq->last_free) = ind; + else + srq->first_free = ind; + + *(int *) get_wqe(srq, ind) = -1; + srq->last_free = ind; + + pthread_spin_unlock(&srq->lock); +} + +int mthca_tavor_post_srq_recv(struct ibv_srq *ibsrq, + struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr) +{ + struct mthca_srq *srq = to_msrq(ibsrq); + int err = 0; + int first_ind; + int ind; + int next_ind; + int nreq; + int i; + void *wqe; + void *prev_wqe; + + pthread_spin_lock(&srq->lock); + + first_ind = srq->first_free; + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + ind = srq->first_free; + + if (ind < 0) { + err = -1; + *bad_wr = wr; + return nreq; + } + + wqe = get_wqe(srq, ind); + next_ind = *(int *) wqe; + prev_wqe = srq->last; + srq->last = wqe; + + ((struct mthca_next_seg *) wqe)->nda_op = 0; + ((struct mthca_next_seg *) wqe)->ee_nds = 0; + /* flags field will always remain 0 */ + + wqe += sizeof (struct mthca_next_seg); + + if (wr->num_sge > srq->max_gs) { + err = -1; + *bad_wr = wr; + srq->last = prev_wqe; + return nreq; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + htonl(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + htonl(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + htonll(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + } + + if (i < srq->max_gs) { + ((struct mthca_data_seg *) wqe)->byte_count = 0; + ((struct mthca_data_seg *) wqe)->lkey = htonl(MTHCA_INVAL_LKEY); + ((struct mthca_data_seg *) wqe)->addr = 0; + } + + if (prev_wqe) { + ((struct mthca_next_seg *) prev_wqe)->nda_op = + htonl((ind << srq->wqe_shift) | 1); + mb(); + ((struct mthca_next_seg *) prev_wqe)->ee_nds = + htonl(MTHCA_NEXT_DBD); + } + + srq->wrid[ind] = wr->wr_id; + srq->first_free = next_ind; + } + + if (nreq) { + uint32_t doorbell[2]; + + doorbell[0] = htonl(first_ind << srq->wqe_shift); + doorbell[1] = htonl((srq->srqn << 8) | nreq); + + /* + * Make sure that descriptors are written before + * doorbell is rung. + */ + mb(); + + mthca_write64(doorbell, to_mctx(ibsrq->context), MTHCA_RECV_DOORBELL); + } + + pthread_spin_unlock(&srq->lock); + return err; +} + +int mthca_arbel_post_srq_recv(struct ibv_srq *ibsrq, + struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr) +{ + struct mthca_srq *srq = to_msrq(ibsrq); + int err = 0; + int ind; + int next_ind; + int nreq; + int i; + void *wqe; + + pthread_spin_lock(&srq->lock); + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + ind = srq->first_free; + + if (ind < 0) { + err = -1; + *bad_wr = wr; + return nreq; + } + + wqe = get_wqe(srq, ind); + next_ind = *(int *) wqe; + + ((struct mthca_next_seg *) wqe)->nda_op = + htonl((next_ind << srq->wqe_shift) | 1); + ((struct mthca_next_seg *) wqe)->ee_nds = 0; + /* flags field will always remain 0 */ + + wqe += sizeof (struct mthca_next_seg); + + if (wr->num_sge > srq->max_gs) { + err = -1; + *bad_wr = wr; + return nreq; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + htonl(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + htonl(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + htonll(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + } + + if (i < srq->max_gs) { + ((struct mthca_data_seg *) wqe)->byte_count = 0; + ((struct mthca_data_seg *) wqe)->lkey = htonl(MTHCA_INVAL_LKEY); + ((struct mthca_data_seg *) wqe)->addr = 0; + } + + srq->wrid[ind] = wr->wr_id; + srq->first_free = next_ind; + } + + if (nreq) { + srq->counter += nreq; + + /* + * Make sure that descriptors are written before + * we write doorbell record. + */ + mb(); + *srq->db = htonl(srq->counter); + } + + pthread_spin_unlock(&srq->lock); + return err; +} + +int mthca_alloc_srq_buf(struct ibv_pd *pd, struct ibv_srq_attr *attr, + struct mthca_srq *srq) +{ + struct mthca_data_seg *scatter; + void *wqe; + int size; + int i; + + srq->wrid = malloc(srq->max * sizeof (uint64_t)); + if (!srq->wrid) + return -1; + + size = sizeof (struct mthca_next_seg) + + srq->max_gs * sizeof (struct mthca_data_seg); + + for (srq->wqe_shift = 6; 1 << srq->wqe_shift < size; ++srq->wqe_shift) + ; /* nothing */ + + srq->buf_size = srq->max << srq->wqe_shift; + + if (posix_memalign(&srq->buf, to_mdev(pd->context->device)->page_size, + align(srq->buf_size, to_mdev(pd->context->device)->page_size))) { + free(srq->wrid); + return -1; + } + + memset(srq->buf, 0, srq->buf_size); + + /* + * Now initialize the SRQ buffer so that all of the WQEs are + * linked into the list of free WQEs. In addition, set the + * scatter list L_Keys to the sentry value of 0x100. + */ + + for (i = 0; i < srq->max; ++i) { + wqe = get_wqe(srq, i); + + *(int *) wqe = i < srq->max - 1 ? i + 1 : -1; + + for (scatter = wqe + sizeof (struct mthca_next_seg); + (void *) scatter < wqe + (1 << srq->wqe_shift); + ++scatter) + scatter->lkey = htonl(MTHCA_INVAL_LKEY); + } + + srq->first_free = 0; + srq->last_free = srq->max - 1; + + return 0; +} Property changes on: libmthca/src/srq.c ___________________________________________________________________ Name: svn:keywords + Id --- libmthca/src/mthca-abi.h (revision 3009) +++ libmthca/src/mthca-abi.h (working copy) @@ -65,6 +65,19 @@ struct mthca_create_cq_resp { __u32 reserved; }; +struct mthca_create_srq { + struct ibv_create_srq ibv_cmd; + __u32 lkey; + __u32 db_index; + __u64 db_page; +}; + +struct mthca_create_srq_resp { + struct ibv_create_srq_resp ibv_resp; + __u32 srqn; + __u32 reserved; +}; + struct mthca_create_qp { struct ibv_create_qp ibv_cmd; __u32 lkey; --- libmthca/src/mthca.c (revision 3009) +++ libmthca/src/mthca.c (working copy) @@ -108,6 +108,8 @@ static struct ibv_context_ops mthca_ctx_ .create_cq = mthca_create_cq, .poll_cq = mthca_poll_cq, .destroy_cq = mthca_destroy_cq, + .create_srq = mthca_create_srq, + .destroy_srq = mthca_destroy_srq, .create_qp = mthca_create_qp, .modify_qp = mthca_modify_qp, .destroy_qp = mthca_destroy_qp, @@ -176,11 +178,13 @@ static struct ibv_context *mthca_alloc_c context->ibv_ctx.ops.cq_event = mthca_arbel_cq_event; context->ibv_ctx.ops.post_send = mthca_arbel_post_send; context->ibv_ctx.ops.post_recv = mthca_arbel_post_recv; + context->ibv_ctx.ops.post_srq_recv = mthca_arbel_post_srq_recv; } else { context->ibv_ctx.ops.req_notify_cq = mthca_tavor_arm_cq; context->ibv_ctx.ops.cq_event = NULL; context->ibv_ctx.ops.post_send = mthca_tavor_post_send; context->ibv_ctx.ops.post_recv = mthca_tavor_post_recv; + context->ibv_ctx.ops.post_srq_recv = mthca_tavor_post_srq_recv; } return &context->ibv_ctx; From rolandd at cisco.com Tue Aug 9 20:55:39 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 9 Aug 2005 20:55:39 -0700 Subject: [openib-general] [PATCH 2/4] SRQs for libibverbs In-Reply-To: <2005892055.Vp0wtgyJ0EKtyBn4@cisco.com> Message-ID: <2005892055.7ma6sTzi0gSqfPDU@cisco.com> --- libibverbs/include/infiniband/driver.h (revision 3041) +++ libibverbs/include/infiniband/driver.h (working copy) @@ -92,6 +92,12 @@ extern int ibv_cmd_create_cq(struct ibv_ struct ibv_create_cq_resp *resp, size_t resp_size); extern int ibv_cmd_destroy_cq(struct ibv_cq *cq); +extern int ibv_cmd_create_srq(struct ibv_pd *pd, + struct ibv_srq *srq, struct ibv_srq_init_attr *attr, + struct ibv_create_srq *cmd, size_t cmd_size, + struct ibv_create_srq_resp *resp, size_t resp_size); +extern int ibv_cmd_destroy_srq(struct ibv_srq *srq); + extern int ibv_cmd_create_qp(struct ibv_pd *pd, struct ibv_qp *qp, struct ibv_qp_init_attr *attr, struct ibv_create_qp *cmd, size_t cmd_size); --- libibverbs/include/infiniband/verbs.h (revision 3041) +++ libibverbs/include/infiniband/verbs.h (working copy) @@ -185,16 +185,20 @@ enum ibv_event_type { IBV_EVENT_PORT_ERR, IBV_EVENT_LID_CHANGE, IBV_EVENT_PKEY_CHANGE, - IBV_EVENT_SM_CHANGE + IBV_EVENT_SM_CHANGE, + IBV_EVENT_SRQ_ERR, + IBV_EVENT_SRQ_LIMIT_REACHED, + IBV_EVENT_QP_LAST_WQE_REACHED }; struct ibv_async_event { union { - struct ibv_cq *cq; - struct ibv_qp *qp; - int port_num; - } element; - enum ibv_event_type event_type; + struct ibv_cq *cq; + struct ibv_qp *qp; + struct ibv_srq *srq; + int port_num; + } element; + enum ibv_event_type event_type; }; enum ibv_wc_status { @@ -297,6 +301,22 @@ struct ibv_ah_attr { uint8_t port_num; }; +enum ibv_srq_attr_mask { + IBV_SRQ_MAX_WR = 1 << 0, + IBV_SRQ_LIMIT = 1 << 1, +}; + +struct ibv_srq_attr { + uint32_t max_wr; + uint32_t max_sge; + uint32_t srq_limit; +}; + +struct ibv_srq_init_attr { + void *srq_context; + struct ibv_srq_attr attr; +}; + enum ibv_qp_type { IBV_QPT_RC = 2, IBV_QPT_UC, @@ -446,12 +466,20 @@ struct ibv_recv_wr { int num_sge; }; +struct ibv_srq { + struct ibv_context *context; + void *srq_context; + struct ibv_pd *pd; + uint32_t handle; +}; + struct ibv_qp { struct ibv_context *context; void *qp_context; struct ibv_pd *pd; struct ibv_cq *send_cq; struct ibv_cq *recv_cq; + struct ibv_srq *srq; uint32_t handle; uint32_t qp_num; enum ibv_qp_state state; @@ -504,6 +532,15 @@ struct ibv_context_ops { int (*req_notify_cq)(struct ibv_cq *cq, int solicited); void (*cq_event)(struct ibv_cq *cq); int (*destroy_cq)(struct ibv_cq *cq); + struct ibv_srq * (*create_srq)(struct ibv_pd *pd, + struct ibv_srq_init_attr *srq_init_attr); + int (*modify_srq)(struct ibv_srq *srq, + struct ibv_srq_attr *srq_attr, + enum ibv_srq_attr_mask srq_attr_mask); + int (*destroy_srq)(struct ibv_srq *srq); + int (*post_srq_recv)(struct ibv_srq *srq, + struct ibv_recv_wr *recv_wr, + struct ibv_recv_wr **bad_recv_wr); struct ibv_qp * (*create_qp)(struct ibv_pd *pd, struct ibv_qp_init_attr *attr); int (*modify_qp)(struct ibv_qp *qp, struct ibv_qp_attr *attr, enum ibv_qp_attr_mask attr_mask); @@ -650,6 +687,56 @@ static inline int ibv_req_notify_cq(stru } /** + * ibv_create_srq - Creates a SRQ associated with the specified protection + * domain. + * @pd: The protection domain associated with the SRQ. + * @srq_init_attr: A list of initial attributes required to create the SRQ. + * + * srq_attr->max_wr and srq_attr->max_sge are read the determine the + * requested size of the SRQ, and set to the actual values allocated + * on return. If ibv_create_srq() succeeds, then max_wr and max_sge + * will always be at least as large as the requested values. + */ +struct ibv_srq *ibv_create_srq(struct ibv_pd *pd, + struct ibv_srq_init_attr *srq_init_attr); + +/** + * ibv_modify_srq - Modifies the attributes for the specified SRQ. + * @srq: The SRQ to modify. + * @srq_attr: On input, specifies the SRQ attributes to modify. On output, + * the current values of selected SRQ attributes are returned. + * @srq_attr_mask: A bit-mask used to specify which attributes of the SRQ + * are being modified. + * + * The mask may contain IB_SRQ_MAX_WR to resize the SRQ and/or + * IB_SRQ_LIMIT to set the SRQ's limit and request notification when + * the number of receives queued drops below the limit. + */ +int ibv_modify_srq(struct ibv_srq *srq, + struct ibv_srq_attr *srq_attr, + enum ibv_srq_attr_mask srq_attr_mask); + +/** + * ibv_destroy_srq - Destroys the specified SRQ. + * @srq: The SRQ to destroy. + */ +int ibv_destroy_srq(struct ibv_srq *srq); + +/** + * ibv_post_srq_recv - Posts a list of work requests to the specified SRQ. + * @srq: The SRQ to post the work request on. + * @recv_wr: A list of work requests to post on the receive queue. + * @bad_recv_wr: On an immediate failure, this parameter will reference + * the work request that failed to be posted on the QP. + */ +static inline int ibv_post_srq_recv(struct ibv_srq *srq, + struct ibv_recv_wr *recv_wr, + struct ibv_recv_wr **bad_recv_wr) +{ + return srq->context->ops.post_srq_recv(srq, recv_wr, bad_recv_wr); +} + +/** * ibv_create_qp - Create a queue pair. */ extern struct ibv_qp *ibv_create_qp(struct ibv_pd *pd, --- libibverbs/include/infiniband/kern-abi.h (revision 3041) +++ libibverbs/include/infiniband/kern-abi.h (working copy) @@ -83,7 +83,12 @@ enum { IB_USER_VERBS_CMD_POST_SEND, IB_USER_VERBS_CMD_POST_RECV, IB_USER_VERBS_CMD_ATTACH_MCAST, - IB_USER_VERBS_CMD_DETACH_MCAST + IB_USER_VERBS_CMD_DETACH_MCAST, + IB_USER_VERBS_CMD_CREATE_SRQ, + IB_USER_VERBS_CMD_MODIFY_SRQ, + IB_USER_VERBS_CMD_QUERY_SRQ, + IB_USER_VERBS_CMD_DESTROY_SRQ, + IB_USER_VERBS_CMD_POST_SRQ_RECV }; /* @@ -425,4 +430,41 @@ struct ibv_detach_mcast { __u64 driver_data[0]; }; +struct ibv_create_srq { + __u32 command; + __u16 in_words; + __u16 out_words; + __u64 response; + __u64 user_handle; + __u32 pd_handle; + __u32 max_wr; + __u32 max_sge; + __u32 srq_limit; + __u64 driver_data[0]; +}; + +struct ibv_create_srq_resp { + __u32 srq_handle; +}; + +struct ibv_modify_srq { + __u32 command; + __u16 in_words; + __u16 out_words; + __u32 srq_handle; + __u32 attr_mask; + __u32 max_wr; + __u32 max_sge; + __u32 srq_limit; + __u32 reserved; + __u64 driver_data[0]; +}; + +struct ibv_destroy_srq { + __u32 command; + __u16 in_words; + __u16 out_words; + __u32 srq_handle; +}; + #endif /* KERN_ABI_H */ --- libibverbs/src/libibverbs.map (revision 3041) +++ libibverbs/src/libibverbs.map (working copy) @@ -17,6 +17,9 @@ IBVERBS_1.0 { ibv_create_cq; ibv_destroy_cq; ibv_get_cq_event; + ibv_create_srq; + ibv_modify_srq; + ibv_destroy_srq; ibv_create_qp; ibv_modify_qp; ibv_destroy_qp; @@ -35,6 +38,9 @@ IBVERBS_1.0 { ibv_cmd_dereg_mr; ibv_cmd_create_cq; ibv_cmd_destroy_cq; + ibv_cmd_create_srq; + ibv_cmd_modify_srq; + ibv_cmd_destroy_srq; ibv_cmd_create_qp; ibv_cmd_modify_qp; ibv_cmd_destroy_qp; --- libibverbs/src/verbs.c (revision 3041) +++ libibverbs/src/verbs.c (working copy) @@ -140,6 +140,32 @@ int ibv_get_cq_event(struct ibv_context return 0; } +struct ibv_srq *ibv_create_srq(struct ibv_pd *pd, + struct ibv_srq_init_attr *srq_init_attr) +{ + struct ibv_srq *srq = pd->context->ops.create_srq(pd, srq_init_attr); + + if (srq) { + srq->context = pd->context; + srq->srq_context = srq_init_attr->srq_context; + srq->pd = pd; + } + + return srq; +} + +int ibv_modify_srq(struct ibv_srq *srq, + struct ibv_srq_attr *srq_attr, + enum ibv_srq_attr_mask srq_attr_mask) +{ + return srq->context->ops.modify_srq(srq, srq_attr, srq_attr_mask); +} + +int ibv_destroy_srq(struct ibv_srq *srq) +{ + return srq->context->ops.destroy_srq(srq); +} + struct ibv_qp *ibv_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *qp_init_attr) { @@ -151,10 +177,12 @@ struct ibv_qp *ibv_create_qp(struct ibv_ qp->pd = pd; qp->send_cq = qp_init_attr->send_cq; qp->recv_cq = qp_init_attr->recv_cq; + qp->srq = qp_init_attr->srq; } return qp; } + int ibv_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, enum ibv_qp_attr_mask attr_mask) { --- libibverbs/src/cmd.c (revision 3041) +++ libibverbs/src/cmd.c (working copy) @@ -288,6 +288,39 @@ int ibv_cmd_destroy_cq(struct ibv_cq *cq return 0; } +int ibv_cmd_create_srq(struct ibv_pd *pd, + struct ibv_srq *srq, struct ibv_srq_init_attr *attr, + struct ibv_create_srq *cmd, size_t cmd_size, + struct ibv_create_srq_resp *resp, size_t resp_size) +{ + IBV_INIT_CMD_RESP(cmd, cmd_size, CREATE_SRQ, resp, resp_size); + cmd->user_handle = (uintptr_t) srq; + cmd->pd_handle = pd->handle; + cmd->max_wr = attr->attr.max_wr; + cmd->max_sge = attr->attr.max_sge; + cmd->srq_limit = attr->attr.srq_limit; + + if (write(pd->context->cmd_fd, cmd, cmd_size) != cmd_size) + return errno; + + srq->handle = resp->srq_handle; + + return 0; +} + +int ibv_cmd_destroy_srq(struct ibv_srq *srq) +{ + struct ibv_destroy_srq cmd; + + IBV_INIT_CMD(&cmd, sizeof cmd, DESTROY_SRQ); + cmd.srq_handle = srq->handle; + + if (write(srq->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) + return errno; + + return 0; +} + int ibv_cmd_create_qp(struct ibv_pd *pd, struct ibv_qp *qp, struct ibv_qp_init_attr *attr, struct ibv_create_qp *cmd, size_t cmd_size) @@ -299,6 +332,7 @@ int ibv_cmd_create_qp(struct ibv_pd *pd, cmd->pd_handle = pd->handle; cmd->send_cq_handle = attr->send_cq->handle; cmd->recv_cq_handle = attr->recv_cq->handle; + cmd->srq_handle = attr->srq ? attr->srq->handle : 0; cmd->max_send_wr = attr->cap.max_send_wr; cmd->max_recv_wr = attr->cap.max_recv_wr; cmd->max_send_sge = attr->cap.max_send_sge; @@ -306,7 +340,7 @@ int ibv_cmd_create_qp(struct ibv_pd *pd, cmd->max_inline_data = attr->cap.max_inline_data; cmd->sq_sig_all = attr->sq_sig_all; cmd->qp_type = attr->qp_type; - cmd->is_srq = 0; + cmd->is_srq = !!attr->srq; if (write(pd->context->cmd_fd, cmd, cmd_size) != cmd_size) return errno; --- libibverbs/Makefile.am (revision 3041) +++ libibverbs/Makefile.am (working copy) @@ -21,7 +21,7 @@ src_libibverbs_la_DEPENDENCIES = $(srcdi bin_PROGRAMS = examples/ibv_devices examples/ibv_devinfo \ examples/ibv_asyncwatch examples/ibv_rc_pingpong examples/ibv_uc_pingpong \ - examples/ibv_ud_pingpong + examples/ibv_ud_pingpong examples/ibv_srq_pingpong examples_ibv_devices_SOURCES = examples/device_list.c examples_ibv_devices_LDADD = $(top_builddir)/src/libibverbs.la examples_ibv_devinfo_SOURCES = examples/devinfo.c @@ -32,6 +32,8 @@ examples_ibv_uc_pingpong_SOURCES = examp examples_ibv_uc_pingpong_LDADD = $(top_builddir)/src/libibverbs.la examples_ibv_ud_pingpong_SOURCES = examples/ud_pingpong.c examples_ibv_ud_pingpong_LDADD = $(top_builddir)/src/libibverbs.la +examples_ibv_srq_pingpong_SOURCES = examples/srq_pingpong.c +examples_ibv_srq_pingpong_LDADD = $(top_builddir)/src/libibverbs.la examples_ibv_asyncwatch_SOURCES = examples/asyncwatch.c examples_ibv_asyncwatch_LDADD = $(top_builddir)/src/libibverbs.la --- libibverbs/examples/srq_pingpong.c (revision 0) +++ libibverbs/examples/srq_pingpong.c (revision 0) @@ -0,0 +1,772 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include + +enum { + PINGPONG_RECV_WRID = 1, + PINGPONG_SEND_WRID = 2, + + MAX_QP = 256, +}; + +static int page_size; + +struct pingpong_context { + struct ibv_context *context; + struct ibv_pd *pd; + struct ibv_mr *mr; + struct ibv_cq *cq; + struct ibv_srq *srq; + struct ibv_qp *qp[MAX_QP]; + void *buf; + int size; + int num_qp; + int rx_depth; +}; + +struct pingpong_dest { + int lid; + int qpn; + int psn; +}; + +static uint16_t pp_get_local_lid(struct pingpong_context *ctx, int port) +{ + struct ibv_port_attr attr; + + if (ibv_query_port(ctx->context, port, &attr)) + return 0; + + return attr.lid; +} + +static int pp_connect_ctx(struct pingpong_context *ctx, int port, + const struct pingpong_dest *my_dest, + const struct pingpong_dest *dest) +{ + int i; + + for (i = 0; i < ctx->num_qp; ++i) { + struct ibv_qp_attr attr = { + .qp_state = IBV_QPS_RTR, + .path_mtu = IBV_MTU_1024, + .dest_qp_num = dest[i].qpn, + .rq_psn = dest[i].psn, + .max_dest_rd_atomic = 1, + .min_rnr_timer = 12, + .ah_attr = { + .is_global = 0, + .dlid = dest[i].lid, + .sl = 0, + .src_path_bits = 0, + .port_num = port + } + }; + if (ibv_modify_qp(ctx->qp[i], &attr, + IBV_QP_STATE | + IBV_QP_AV | + IBV_QP_PATH_MTU | + IBV_QP_DEST_QPN | + IBV_QP_RQ_PSN | + IBV_QP_MAX_DEST_RD_ATOMIC | + IBV_QP_MIN_RNR_TIMER)) { + fprintf(stderr, "Failed to modify QP[%d] to RTR\n", i); + return 1; + } + + attr.qp_state = IBV_QPS_RTS; + attr.timeout = 14; + attr.retry_cnt = 7; + attr.rnr_retry = 7; + attr.sq_psn = my_dest[i].psn; + attr.max_rd_atomic = 1; + if (ibv_modify_qp(ctx->qp[i], &attr, + IBV_QP_STATE | + IBV_QP_TIMEOUT | + IBV_QP_RETRY_CNT | + IBV_QP_RNR_RETRY | + IBV_QP_SQ_PSN | + IBV_QP_MAX_QP_RD_ATOMIC)) { + fprintf(stderr, "Failed to modify QP[%d] to RTS\n", i); + return 1; + } + } + + return 0; +} + +static struct pingpong_dest *pp_client_exch_dest(const char *servername, int port, + const struct pingpong_dest *my_dest) +{ + struct addrinfo *res, *t; + struct addrinfo hints = { + .ai_family = AF_UNSPEC, + .ai_socktype = SOCK_STREAM + }; + char *service; + char msg[ sizeof "0000:000000:000000"]; + int n; + int r; + int i; + int sockfd = -1; + struct pingpong_dest *rem_dest = NULL; + + asprintf(&service, "%d", port); + n = getaddrinfo(servername, service, &hints, &res); + + if (n < 0) { + fprintf(stderr, "%s for %s:%d\n", gai_strerror(n), servername, port); + return NULL; + } + + for (t = res; t; t = t->ai_next) { + sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); + if (sockfd >= 0) { + if (!connect(sockfd, t->ai_addr, t->ai_addrlen)) + break; + close(sockfd); + sockfd = -1; + } + } + + freeaddrinfo(res); + + if (sockfd < 0) { + fprintf(stderr, "Couldn't connect to %s:%d\n", servername, port); + return NULL; + } + + for (i = 0; i < MAX_QP; ++i) { + sprintf(msg, "%04x:%06x:%06x", my_dest[i].lid, my_dest[i].qpn, my_dest[i].psn); + if (write(sockfd, msg, sizeof msg) != sizeof msg) { + fprintf(stderr, "Couldn't send local address\n"); + goto out; + } + } + + rem_dest = malloc(MAX_QP * sizeof *rem_dest); + if (!rem_dest) + goto out; + + for (i = 0; i < MAX_QP; ++i) { + n = 0; + while (n < sizeof msg) { + r = read(sockfd, msg + n, sizeof msg - n); + if (r < 0) { + perror("client read"); + fprintf(stderr, "%d/%d: Couldn't read remote address [%d]\n", + n, (int) sizeof msg, i); + goto out; + } + n += r; + } + + sscanf(msg, "%x:%x:%x", + &rem_dest[i].lid, &rem_dest[i].qpn, &rem_dest[i].psn); + } + + write(sockfd, "done", sizeof "done"); + +out: + close(sockfd); + return rem_dest; +} + +static struct pingpong_dest *pp_server_exch_dest(struct pingpong_context *ctx, + int ib_port, int port, + const struct pingpong_dest *my_dest) +{ + struct addrinfo *res, *t; + struct addrinfo hints = { + .ai_flags = AI_PASSIVE, + .ai_family = AF_UNSPEC, + .ai_socktype = SOCK_STREAM + }; + char *service; + char msg[ sizeof "0000:000000:000000"]; + int n; + int r; + int i; + int sockfd = -1, connfd; + struct pingpong_dest *rem_dest = NULL; + + asprintf(&service, "%d", port); + n = getaddrinfo(NULL, service, &hints, &res); + + if (n < 0) { + fprintf(stderr, "%s for port %d\n", gai_strerror(n), port); + return NULL; + } + + for (t = res; t; t = t->ai_next) { + sockfd = socket(t->ai_family, t->ai_socktype, t->ai_protocol); + if (sockfd >= 0) { + n = 1; + + setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &n, sizeof n); + + if (!bind(sockfd, t->ai_addr, t->ai_addrlen)) + break; + close(sockfd); + sockfd = -1; + } + } + + freeaddrinfo(res); + + if (sockfd < 0) { + fprintf(stderr, "Couldn't listen to port %d\n", port); + return NULL; + } + + listen(sockfd, 1); + connfd = accept(sockfd, NULL, 0); + close(sockfd); + if (connfd < 0) { + fprintf(stderr, "accept() failed\n"); + return NULL; + } + + rem_dest = malloc(MAX_QP *sizeof *rem_dest); + if (!rem_dest) + goto out; + + for (i = 0; i < MAX_QP; ++i) { + n = 0; + while (n < sizeof msg) { + r = read(connfd, msg + n, sizeof msg - n); + if (r < 0) { + perror("server read"); + fprintf(stderr, "%d/%d: Couldn't read remote address [%d]\n", + n, (int) sizeof msg, i); + goto out; + } + n += r; + } + + sscanf(msg, "%x:%x:%x", + &rem_dest[i].lid, &rem_dest[i].qpn, &rem_dest[i].psn); + } + + if (pp_connect_ctx(ctx, ib_port, my_dest, rem_dest)) { + fprintf(stderr, "Couldn't connect to remote QP\n"); + free(rem_dest); + rem_dest = NULL; + goto out; + } + + for (i = 0; i < MAX_QP; ++i) { + sprintf(msg, "%04x:%06x:%06x", my_dest[i].lid, my_dest[i].qpn, my_dest[i].psn); + if (write(connfd, msg, sizeof msg) != sizeof msg) { + fprintf(stderr, "Couldn't send local address\n"); + free(rem_dest); + rem_dest = NULL; + goto out; + } + } + + read(connfd, msg, sizeof msg); + +out: + close(connfd); + return rem_dest; +} + +static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size, + int num_qp, int rx_depth, int port) +{ + struct pingpong_context *ctx; + int i; + + ctx = malloc(sizeof *ctx); + if (!ctx) + return NULL; + + ctx->size = size; + ctx->num_qp = num_qp; + ctx->rx_depth = rx_depth; + + ctx->buf = memalign(page_size, size); + if (!ctx->buf) { + fprintf(stderr, "Couldn't allocate work buf.\n"); + return NULL; + } + + memset(ctx->buf, 0, size); + + ctx->context = ibv_open_device(ib_dev); + if (!ctx->context) { + fprintf(stderr, "Couldn't get context for %s\n", + ibv_get_device_name(ib_dev)); + return NULL; + } + + ctx->pd = ibv_alloc_pd(ctx->context); + if (!ctx->pd) { + fprintf(stderr, "Couldn't allocate PD\n"); + return NULL; + } + + ctx->mr = ibv_reg_mr(ctx->pd, ctx->buf, size, IBV_ACCESS_LOCAL_WRITE); + if (!ctx->mr) { + fprintf(stderr, "Couldn't allocate MR\n"); + return NULL; + } + + ctx->cq = ibv_create_cq(ctx->context, rx_depth + 1, NULL); + if (!ctx->cq) { + fprintf(stderr, "Couldn't create CQ\n"); + return NULL; + } + + { + struct ibv_srq_init_attr attr = { + .attr = { + .max_wr = rx_depth, + .max_sge = 1 + } + }; + + ctx->srq = ibv_create_srq(ctx->pd, &attr); + if (!ctx->srq) { + fprintf(stderr, "Couldn't create SRQ\n"); + return NULL; + } + } + + for (i = 0; i < num_qp; ++i) { + struct ibv_qp_init_attr attr = { + .send_cq = ctx->cq, + .recv_cq = ctx->cq, + .srq = ctx->srq, + .cap = { + .max_send_wr = 4, + .max_send_sge = 1, + }, + .qp_type = IBV_QPT_RC + }; + + ctx->qp[i] = ibv_create_qp(ctx->pd, &attr); + if (!ctx->qp[i]) { + fprintf(stderr, "Couldn't create QP[%d]\n", i); + return NULL; + } + } + + for (i = 0; i < num_qp; ++i) { + struct ibv_qp_attr attr; + + attr.qp_state = IBV_QPS_INIT; + attr.pkey_index = 0; + attr.port_num = port; + attr.qp_access_flags = 0; + + if (ibv_modify_qp(ctx->qp[i], &attr, + IBV_QP_STATE | + IBV_QP_PKEY_INDEX | + IBV_QP_PORT | + IBV_QP_ACCESS_FLAGS)) { + fprintf(stderr, "Failed to modify QP[%d] to INIT\n", i); + return NULL; + } + } + + return ctx; +} + +static int pp_post_recv(struct pingpong_context *ctx, int n) +{ + struct ibv_sge list = { + .addr = (uintptr_t) ctx->buf, + .length = ctx->size, + .lkey = ctx->mr->lkey + }; + struct ibv_recv_wr wr = { + .wr_id = PINGPONG_RECV_WRID, + .sg_list = &list, + .num_sge = 1, + }; + struct ibv_recv_wr *bad_wr; + int i; + + for (i = 0; i < n; ++i) + if (ibv_post_srq_recv(ctx->srq, &wr, &bad_wr)) + break; + + return i; +} + +static int pp_post_send(struct pingpong_context *ctx, int qp_index) +{ + struct ibv_sge list = { + .addr = (uintptr_t) ctx->buf, + .length = ctx->size, + .lkey = ctx->mr->lkey + }; + struct ibv_send_wr wr = { + .wr_id = PINGPONG_SEND_WRID, + .sg_list = &list, + .num_sge = 1, + .opcode = IBV_WR_SEND, + .send_flags = IBV_SEND_SIGNALED, + }; + struct ibv_send_wr *bad_wr; + + return ibv_post_send(ctx->qp[qp_index], &wr, &bad_wr); +} + +static int find_qp(int qpn, struct pingpong_context *ctx, int num_qp) +{ + int i; + + for (i = 0; i < num_qp; ++i) + if (ctx->qp[i]->qp_num == qpn) + return i; + + return -1; +} + +static void usage(const char *argv0) +{ + printf("Usage:\n"); + printf(" %s start a server and wait for connection\n", argv0); + printf(" %s connect to server at \n", argv0); + printf("\n"); + printf("Options:\n"); + printf(" -p, --port= listen on/connect to port (default 18515)\n"); + printf(" -d, --ib-dev= use IB device (default first device found)\n"); + printf(" -i, --ib-port= use port of IB device (default 1)\n"); + printf(" -s, --size= size of message to exchange (default 4096)\n"); + printf(" -q, --num-qp= number of QPs to use (default 16)\n"); + printf(" -r, --rx-depth= number of receives to post at a time (default 500)\n"); + printf(" -n, --iters= number of exchanges per QP(default 1000)\n"); + printf(" -e, --events sleep on CQ events (default poll)\n"); +} + +int main(int argc, char *argv[]) +{ + struct dlist *dev_list; + struct ibv_device *ib_dev; + struct pingpong_context *ctx; + struct pingpong_dest my_dest[MAX_QP]; + struct pingpong_dest *rem_dest; + struct timeval start, end; + char *ib_devname = NULL; + char *servername = NULL; + int port = 18515; + int ib_port = 1; + int size = 4096; + int num_qp = 16; + int rx_depth = 500; + int iters = 1000; + int use_event = 0; + int routs; + int rcnt, scnt; + int i; + + srand48(getpid() * time(NULL)); + + while (1) { + int c; + + static struct option long_options[] = { + { .name = "port", .has_arg = 1, .val = 'p' }, + { .name = "ib-dev", .has_arg = 1, .val = 'd' }, + { .name = "ib-port", .has_arg = 1, .val = 'i' }, + { .name = "size", .has_arg = 1, .val = 's' }, + { .name = "num-qp", .has_arg = 1, .val = 'q' }, + { .name = "rx-depth", .has_arg = 1, .val = 'r' }, + { .name = "iters", .has_arg = 1, .val = 'n' }, + { .name = "events", .has_arg = 0, .val = 'e' }, + { 0 } + }; + + c = getopt_long(argc, argv, "p:d:i:s:q:r:n:e", long_options, NULL); + if (c == -1) + break; + + switch (c) { + case 'p': + port = strtol(optarg, NULL, 0); + if (port < 0 || port > 65535) { + usage(argv[0]); + return 1; + } + break; + + case 'd': + ib_devname = strdupa(optarg); + break; + + case 'i': + ib_port = strtol(optarg, NULL, 0); + if (ib_port < 0) { + usage(argv[0]); + return 1; + } + break; + + case 's': + size = strtol(optarg, NULL, 0); + break; + + case 'q': + num_qp = strtol(optarg, NULL, 0); + break; + + case 'r': + rx_depth = strtol(optarg, NULL, 0); + break; + + case 'n': + iters = strtol(optarg, NULL, 0); + break; + + case 'e': + ++use_event; + break; + + default: + usage(argv[0]); + return 1; + } + } + + if (optind == argc - 1) + servername = strdupa(argv[optind]); + else if (optind < argc) { + usage(argv[0]); + return 1; + } + + page_size = sysconf(_SC_PAGESIZE); + + dev_list = ibv_get_devices(); + + dlist_start(dev_list); + if (!ib_devname) { + ib_dev = dlist_next(dev_list); + if (!ib_dev) { + fprintf(stderr, "No IB devices found\n"); + return 1; + } + } else { + dlist_for_each_data(dev_list, ib_dev, struct ibv_device) + if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) + break; + if (!ib_dev) { + fprintf(stderr, "IB device %s not found\n", ib_devname); + return 1; + } + } + + ctx = pp_init_ctx(ib_dev, size, num_qp, rx_depth, ib_port); + if (!ctx) + return 1; + + routs = pp_post_recv(ctx, ctx->rx_depth); + if (routs < ctx->rx_depth) { + fprintf(stderr, "Couldn't post receive (%d)\n", routs); + return 1; + } + + for (i = 0; i < num_qp; ++i) { + my_dest[i].qpn = ctx->qp[i]->qp_num; + my_dest[i].psn = lrand48() & 0xffffff; + my_dest[i].lid = pp_get_local_lid(ctx, ib_port); + if (!my_dest[i].lid) { + fprintf(stderr, "Couldn't get local LID\n"); + return 1; + } + + printf(" local address: LID 0x%04x, QPN 0x%06x, PSN 0x%06x\n", + my_dest[i].lid, my_dest[i].qpn, my_dest[i].psn); + } + + if (servername) + rem_dest = pp_client_exch_dest(servername, port, my_dest); + else + rem_dest = pp_server_exch_dest(ctx, ib_port, port, my_dest); + + if (!rem_dest) + return 1; + + for (i = 0; i < num_qp; ++i) + printf(" remote address: LID 0x%04x, QPN 0x%06x, PSN 0x%06x\n", + rem_dest[i].lid, rem_dest[i].qpn, rem_dest[i].psn); + + if (servername) + if (pp_connect_ctx(ctx, ib_port, my_dest, rem_dest)) + return 1; + + if (use_event) + if (ibv_req_notify_cq(ctx->cq, 0)) { + fprintf(stderr, "Couldn't request CQ notification\n"); + return 1; + } + + if (servername) + for (i = 0; i < num_qp; ++i) + if (pp_post_send(ctx, i)) { + fprintf(stderr, "Couldn't post send\n"); + return 1; + } + + if (gettimeofday(&start, NULL)) { + perror("gettimeofday"); + return 1; + } + + rcnt = scnt = 0; + while (rcnt < iters || scnt < iters) { + if (use_event) { + struct ibv_cq *ev_cq; + void *ev_ctx; + + if (ibv_get_cq_event(ctx->context, 0, &ev_cq, &ev_ctx)) { + fprintf(stderr, "Failed to get cq_event\n"); + return 1; + } + + if (ev_cq != ctx->cq) { + fprintf(stderr, "CQ event for unknown CQ %p\n", ev_cq); + return 1; + } + + if (ibv_req_notify_cq(ctx->cq, 0)) { + fprintf(stderr, "Couldn't request CQ notification\n"); + return 1; + } + } + + { + struct ibv_wc wc[2]; + int ne, j; + + do { + ne = ibv_poll_cq(ctx->cq, 2, wc); + } while (!use_event && ne < 1); + + if (ne < 0) { + fprintf(stderr, "poll CQ failed %d\n", ne); + return 1; + } + + for (i = 0; i < ne; ++i) { + if (wc[i].status != IBV_WC_SUCCESS) { + fprintf(stderr, "Failed status %d for wr_id %d\n", + wc[i].status, (int) wc[i].wr_id); + return 1; + } + + switch ((int) wc[i].wr_id) { + case PINGPONG_SEND_WRID: + ++scnt; + break; + + case PINGPONG_RECV_WRID: + if (--routs <= 1) { + routs += pp_post_recv(ctx, ctx->rx_depth - routs); + if (routs < ctx->rx_depth) { + fprintf(stderr, + "Couldn't post receive (%d)\n", + routs); + return 1; + } + } + + if (scnt < iters) { + j = find_qp(wc[i].qp_num, ctx, num_qp); + if (j < 0) { + fprintf(stderr, "Couldn't find QPN %06x\n", + wc[i].qp_num); + return 1; + } + + if (pp_post_send(ctx, j)) { + fprintf(stderr, "Couldn't post send\n"); + return 1; + } + } + + ++rcnt; + break; + + default: + fprintf(stderr, "Completion for unknown wr_id %d\n", + (int) wc[i].wr_id); + return 1; + } + } + } + } + + if (gettimeofday(&end, NULL)) { + perror("gettimeofday"); + return 1; + } + + { + float usec = (end.tv_sec - start.tv_sec) * 1000000 + + (end.tv_usec - start.tv_usec); + long long bytes = (long long) size * iters * 2; + + printf("%lld bytes in %.2f seconds = %.2f Mbit/sec\n", + bytes, usec / 1000000., bytes * 8. / usec); + printf("%d iters in %.2f seconds = %.2f usec/iter\n", + iters, usec / 1000000., usec / iters); + } + + return 0; +} Property changes on: libibverbs/examples/srq_pingpong.c ___________________________________________________________________ Name: svn:keywords + Id From rolandd at cisco.com Tue Aug 9 20:55:39 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 9 Aug 2005 20:55:39 -0700 Subject: [openib-general] [PATCH 3/4] SRQs for kernel mthca In-Reply-To: <2005892055.7ma6sTzi0gSqfPDU@cisco.com> Message-ID: <2005892055.pyiDGsM5njBrWc61@cisco.com> --- infiniband/hw/mthca/mthca_user.h (revision 3041) +++ infiniband/hw/mthca/mthca_user.h (working copy) @@ -69,6 +69,17 @@ struct mthca_create_cq_resp { __u32 reserved; }; +struct mthca_create_srq { + __u32 lkey; + __u32 db_index; + __u64 db_page; +}; + +struct mthca_create_srq_resp { + __u32 srqn; + __u32 reserved; +}; + struct mthca_create_qp { __u32 lkey; __u32 reserved; --- infiniband/hw/mthca/mthca_dev.h (revision 3041) +++ infiniband/hw/mthca/mthca_dev.h (working copy) @@ -217,6 +217,13 @@ struct mthca_cq_table { struct mthca_icm_table *table; }; +struct mthca_srq_table { + struct mthca_alloc alloc; + spinlock_t lock; + struct mthca_array srq; + struct mthca_icm_table *table; +}; + struct mthca_qp_table { struct mthca_alloc alloc; u32 rdb_base; @@ -298,6 +305,7 @@ struct mthca_dev { struct mthca_mr_table mr_table; struct mthca_eq_table eq_table; struct mthca_cq_table cq_table; + struct mthca_srq_table srq_table; struct mthca_qp_table qp_table; struct mthca_av_table av_table; struct mthca_mcg_table mcg_table; @@ -360,12 +368,18 @@ int mthca_array_set(struct mthca_array * void mthca_array_clear(struct mthca_array *array, int index); int mthca_array_init(struct mthca_array *array, int nent); void mthca_array_cleanup(struct mthca_array *array, int nent); +int mthca_buf_alloc(struct mthca_dev *dev, int size, int max_direct, + union mthca_buf *buf, int *is_direct, struct mthca_pd *pd, + int hca_write, struct mthca_mr *mr); +void mthca_buf_free(struct mthca_dev *dev, int size, union mthca_buf *buf, + int is_direct, struct mthca_mr *mr); int mthca_init_uar_table(struct mthca_dev *dev); int mthca_init_pd_table(struct mthca_dev *dev); int mthca_init_mr_table(struct mthca_dev *dev); int mthca_init_eq_table(struct mthca_dev *dev); int mthca_init_cq_table(struct mthca_dev *dev); +int mthca_init_srq_table(struct mthca_dev *dev); int mthca_init_qp_table(struct mthca_dev *dev); int mthca_init_av_table(struct mthca_dev *dev); int mthca_init_mcg_table(struct mthca_dev *dev); @@ -375,6 +389,7 @@ void mthca_cleanup_pd_table(struct mthca void mthca_cleanup_mr_table(struct mthca_dev *dev); void mthca_cleanup_eq_table(struct mthca_dev *dev); void mthca_cleanup_cq_table(struct mthca_dev *dev); +void mthca_cleanup_srq_table(struct mthca_dev *dev); void mthca_cleanup_qp_table(struct mthca_dev *dev); void mthca_cleanup_av_table(struct mthca_dev *dev); void mthca_cleanup_mcg_table(struct mthca_dev *dev); @@ -425,7 +440,19 @@ int mthca_init_cq(struct mthca_dev *dev, void mthca_free_cq(struct mthca_dev *dev, struct mthca_cq *cq); void mthca_cq_event(struct mthca_dev *dev, u32 cqn); -void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn); +void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn, + struct mthca_srq *srq); + +int mthca_alloc_srq(struct mthca_dev *dev, struct mthca_pd *pd, + struct ib_srq_attr *attr, struct mthca_srq *srq); +void mthca_free_srq(struct mthca_dev *dev, struct mthca_srq *srq); +void mthca_srq_event(struct mthca_dev *dev, u32 srqn, + enum ib_event_type event_type); +void mthca_free_srq_wqe(struct mthca_srq *srq, u32 wqe_addr); +int mthca_tavor_post_srq_recv(struct ib_srq *srq, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); +int mthca_arbel_post_srq_recv(struct ib_srq *srq, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); void mthca_qp_event(struct mthca_dev *dev, u32 qpn, enum ib_event_type event_type); --- infiniband/hw/mthca/mthca_main.c (revision 3041) +++ infiniband/hw/mthca/mthca_main.c (working copy) @@ -252,6 +252,8 @@ static int __devinit mthca_init_tavor(st profile = default_profile; profile.num_uar = dev_lim.uar_size / PAGE_SIZE; profile.uarc_size = 0; + if (mdev->mthca_flags & MTHCA_FLAG_SRQ) + profile.num_srq = dev_lim.max_srqs; err = mthca_make_profile(mdev, &profile, &dev_lim, &init_hca); if (err < 0) @@ -423,15 +425,29 @@ static int __devinit mthca_init_icm(stru } mdev->cq_table.table = mthca_alloc_icm_table(mdev, init_hca->cqc_base, - dev_lim->cqc_entry_sz, - mdev->limits.num_cqs, - mdev->limits.reserved_cqs, 0); + dev_lim->cqc_entry_sz, + mdev->limits.num_cqs, + mdev->limits.reserved_cqs, 0); if (!mdev->cq_table.table) { mthca_err(mdev, "Failed to map CQ context memory, aborting.\n"); err = -ENOMEM; goto err_unmap_rdb; } + if (mdev->mthca_flags & MTHCA_FLAG_SRQ) { + mdev->srq_table.table = + mthca_alloc_icm_table(mdev, init_hca->srqc_base, + dev_lim->srq_entry_sz, + mdev->limits.num_srqs, + mdev->limits.reserved_srqs, 0); + if (!mdev->srq_table.table) { + mthca_err(mdev, "Failed to map SRQ context memory, " + "aborting.\n"); + err = -ENOMEM; + goto err_unmap_cq; + } + } + /* * It's not strictly required, but for simplicity just map the * whole multicast group table now. The table isn't very big @@ -447,11 +463,15 @@ static int __devinit mthca_init_icm(stru if (!mdev->mcg_table.table) { mthca_err(mdev, "Failed to map MCG context memory, aborting.\n"); err = -ENOMEM; - goto err_unmap_cq; + goto err_unmap_srq; } return 0; +err_unmap_srq: + if (mdev->mthca_flags & MTHCA_FLAG_SRQ) + mthca_free_icm_table(mdev, mdev->srq_table.table); + err_unmap_cq: mthca_free_icm_table(mdev, mdev->cq_table.table); @@ -531,6 +551,8 @@ static int __devinit mthca_init_arbel(st profile = default_profile; profile.num_uar = dev_lim.uar_size / PAGE_SIZE; profile.num_udav = 0; + if (mdev->mthca_flags & MTHCA_FLAG_SRQ) + profile.num_srq = dev_lim.max_srqs; icm_size = mthca_make_profile(mdev, &profile, &dev_lim, &init_hca); if ((int) icm_size < 0) { @@ -557,6 +579,8 @@ static int __devinit mthca_init_arbel(st return 0; err_free_icm: + if (mdev->mthca_flags & MTHCA_FLAG_SRQ) + mthca_free_icm_table(mdev, mdev->srq_table.table); mthca_free_icm_table(mdev, mdev->cq_table.table); mthca_free_icm_table(mdev, mdev->qp_table.rdb_table); mthca_free_icm_table(mdev, mdev->qp_table.eqp_table); @@ -586,6 +610,8 @@ static void mthca_close_hca(struct mthca mthca_CLOSE_HCA(mdev, 0, &status); if (mthca_is_memfree(mdev)) { + if (mdev->mthca_flags & MTHCA_FLAG_SRQ) + mthca_free_icm_table(mdev, mdev->srq_table.table); mthca_free_icm_table(mdev, mdev->cq_table.table); mthca_free_icm_table(mdev, mdev->qp_table.rdb_table); mthca_free_icm_table(mdev, mdev->qp_table.eqp_table); @@ -730,11 +756,18 @@ static int __devinit mthca_setup_hca(str goto err_cmd_poll; } + err = mthca_init_srq_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "shared receive queue table, aborting.\n"); + goto err_cq_table_free; + } + err = mthca_init_qp_table(dev); if (err) { mthca_err(dev, "Failed to initialize " "queue pair table, aborting.\n"); - goto err_cq_table_free; + goto err_srq_table_free; } err = mthca_init_av_table(dev); @@ -759,6 +792,9 @@ err_av_table_free: err_qp_table_free: mthca_cleanup_qp_table(dev); +err_srq_table_free: + mthca_cleanup_srq_table(dev); + err_cq_table_free: mthca_cleanup_cq_table(dev); @@ -1045,6 +1081,7 @@ err_cleanup: mthca_cleanup_mcg_table(mdev); mthca_cleanup_av_table(mdev); mthca_cleanup_qp_table(mdev); + mthca_cleanup_srq_table(mdev); mthca_cleanup_cq_table(mdev); mthca_cmd_use_polling(mdev); mthca_cleanup_eq_table(mdev); @@ -1094,6 +1131,7 @@ static void __devexit mthca_remove_one(s mthca_cleanup_mcg_table(mdev); mthca_cleanup_av_table(mdev); mthca_cleanup_qp_table(mdev); + mthca_cleanup_srq_table(mdev); mthca_cleanup_cq_table(mdev); mthca_cmd_use_polling(mdev); mthca_cleanup_eq_table(mdev); --- infiniband/hw/mthca/mthca_provider.c (revision 3041) +++ infiniband/hw/mthca/mthca_provider.c (working copy) @@ -421,6 +421,77 @@ static int mthca_ah_destroy(struct ib_ah return 0; } +static struct ib_srq *mthca_create_srq(struct ib_pd *pd, + struct ib_srq_init_attr *init_attr, + struct ib_udata *udata) +{ + struct mthca_create_srq ucmd; + struct mthca_ucontext *context = NULL; + struct mthca_srq *srq; + int err; + + srq = kmalloc(sizeof *srq, GFP_KERNEL); + if (!srq) + return ERR_PTR(-ENOMEM); + + if (pd->uobject) { + context = to_mucontext(pd->uobject->context); + + if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) + return ERR_PTR(-EFAULT); + + err = mthca_map_user_db(to_mdev(pd->device), &context->uar, + context->db_tab, ucmd.db_index, + ucmd.db_page); + + if (err) + goto err_free; + + srq->mr.ibmr.lkey = ucmd.lkey; + srq->db_index = ucmd.db_index; + } + + err = mthca_alloc_srq(to_mdev(pd->device), to_mpd(pd), + &init_attr->attr, srq); + + if (err && pd->uobject) + mthca_unmap_user_db(to_mdev(pd->device), &context->uar, + context->db_tab, ucmd.db_index); + + if (err) + goto err_free; + + if (context && ib_copy_to_udata(udata, &srq->srqn, sizeof (__u32))) { + mthca_free_srq(to_mdev(pd->device), srq); + err = -EFAULT; + goto err_free; + } + + return &srq->ibsrq; + +err_free: + kfree(srq); + + return ERR_PTR(err); +} + +static int mthca_destroy_srq(struct ib_srq *srq) +{ + struct mthca_ucontext *context; + + if (srq->uobject) { + context = to_mucontext(srq->uobject->context); + + mthca_unmap_user_db(to_mdev(srq->device), &context->uar, + context->db_tab, to_msrq(srq)->db_index); + } + + mthca_free_srq(to_mdev(srq->device), to_msrq(srq)); + kfree(srq); + + return 0; +} + static struct ib_qp *mthca_create_qp(struct ib_pd *pd, struct ib_qp_init_attr *init_attr, struct ib_udata *udata) @@ -999,6 +1070,17 @@ int mthca_register_device(struct mthca_d dev->ib_dev.dealloc_pd = mthca_dealloc_pd; dev->ib_dev.create_ah = mthca_ah_create; dev->ib_dev.destroy_ah = mthca_ah_destroy; + + if (dev->mthca_flags & MTHCA_FLAG_SRQ) { + dev->ib_dev.create_srq = mthca_create_srq; + dev->ib_dev.destroy_srq = mthca_destroy_srq; + + if (mthca_is_memfree(dev)) + dev->ib_dev.post_srq_recv = mthca_arbel_post_srq_recv; + else + dev->ib_dev.post_srq_recv = mthca_tavor_post_srq_recv; + } + dev->ib_dev.create_qp = mthca_create_qp; dev->ib_dev.modify_qp = mthca_modify_qp; dev->ib_dev.destroy_qp = mthca_destroy_qp; --- infiniband/hw/mthca/mthca_provider.h (revision 3041) +++ infiniband/hw/mthca/mthca_provider.h (working copy) @@ -51,6 +51,11 @@ struct mthca_buf_list { DECLARE_PCI_UNMAP_ADDR(mapping) }; +union mthca_buf { + struct mthca_buf_list direct; + struct mthca_buf_list *page_list; +}; + struct mthca_uar { unsigned long pfn; int index; @@ -187,14 +192,34 @@ struct mthca_cq { __be32 *arm_db; int arm_sn; - union { - struct mthca_buf_list direct; - struct mthca_buf_list *page_list; - } queue; + union mthca_buf queue; struct mthca_mr mr; wait_queue_head_t wait; }; +struct mthca_srq { + struct ib_srq ibsrq; + spinlock_t lock; + atomic_t refcount; + int srqn; + int max; + int max_gs; + int wqe_shift; + int first_free; + int last_free; + u16 counter; /* Arbel only */ + int db_index; /* Arbel only */ + __be32 *db; /* Arbel only */ + void *last; + + int is_direct; + u64 *wrid; + union mthca_buf queue; + struct mthca_mr mr; + + wait_queue_head_t wait; +}; + struct mthca_wq { spinlock_t lock; int max; @@ -228,10 +253,7 @@ struct mthca_qp { int send_wqe_offset; u64 *wrid; - union { - struct mthca_buf_list direct; - struct mthca_buf_list *page_list; - } queue; + union mthca_buf queue; wait_queue_head_t wait; }; @@ -278,6 +300,11 @@ static inline struct mthca_cq *to_mcq(st return container_of(ibcq, struct mthca_cq, ibcq); } +static inline struct mthca_srq *to_msrq(struct ib_srq *ibsrq) +{ + return container_of(ibsrq, struct mthca_srq, ibsrq); +} + static inline struct mthca_qp *to_mqp(struct ib_qp *ibqp) { return container_of(ibqp, struct mthca_qp, ibqp); --- infiniband/hw/mthca/mthca_profile.c (revision 3041) +++ infiniband/hw/mthca/mthca_profile.c (working copy) @@ -102,6 +102,7 @@ u64 mthca_make_profile(struct mthca_dev profile[MTHCA_RES_UARC].size = request->uarc_size; profile[MTHCA_RES_QP].num = request->num_qp; + profile[MTHCA_RES_SRQ].num = request->num_srq; profile[MTHCA_RES_EQP].num = request->num_qp; profile[MTHCA_RES_RDB].num = request->num_qp * request->rdb_per_qp; profile[MTHCA_RES_CQ].num = request->num_cq; --- infiniband/hw/mthca/mthca_wqe.h (revision 0) +++ infiniband/hw/mthca/mthca_wqe.h (revision 0) @@ -0,0 +1,114 @@ +/* + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#ifndef MTHCA_WQE_H +#define MTHCA_WQE_H + +#include + +enum { + MTHCA_NEXT_DBD = 1 << 7, + MTHCA_NEXT_FENCE = 1 << 6, + MTHCA_NEXT_CQ_UPDATE = 1 << 3, + MTHCA_NEXT_EVENT_GEN = 1 << 2, + MTHCA_NEXT_SOLICIT = 1 << 1, + + MTHCA_MLX_VL15 = 1 << 17, + MTHCA_MLX_SLR = 1 << 16 +}; + +enum { + MTHCA_INVAL_LKEY = 0x100 +}; + +struct mthca_next_seg { + __be32 nda_op; /* [31:6] next WQE [4:0] next opcode */ + __be32 ee_nds; /* [31:8] next EE [7] DBD [6] F [5:0] next WQE size */ + __be32 flags; /* [3] CQ [2] Event [1] Solicit */ + __be32 imm; /* immediate data */ +}; + +struct mthca_tavor_ud_seg { + u32 reserved1; + __be32 lkey; + __be64 av_addr; + u32 reserved2[4]; + __be32 dqpn; + __be32 qkey; + u32 reserved3[2]; +}; + +struct mthca_arbel_ud_seg { + __be32 av[8]; + __be32 dqpn; + __be32 qkey; + u32 reserved[2]; +}; + +struct mthca_bind_seg { + __be32 flags; /* [31] Atomic [30] rem write [29] rem read */ + u32 reserved; + __be32 new_rkey; + __be32 lkey; + __be64 addr; + __be64 length; +}; + +struct mthca_raddr_seg { + __be64 raddr; + __be32 rkey; + u32 reserved; +}; + +struct mthca_atomic_seg { + __be64 swap_add; + __be64 compare; +}; + +struct mthca_data_seg { + __be32 byte_count; + __be32 lkey; + __be64 addr; +}; + +struct mthca_mlx_seg { + __be32 nda_op; + __be32 nds; + __be32 flags; /* [17] VL15 [16] SLR [14:12] static rate + [11:8] SL [3] C [2] E */ + __be16 rlid; + __be16 vcrc; +}; + +#endif /* MTHCA_WQE_H */ Property changes on: infiniband/hw/mthca/mthca_wqe.h ___________________________________________________________________ Name: svn:keywords + Id --- infiniband/hw/mthca/mthca_cmd.c (revision 3041) +++ infiniband/hw/mthca/mthca_cmd.c (working copy) @@ -109,6 +109,7 @@ enum { CMD_SW2HW_SRQ = 0x35, CMD_HW2SW_SRQ = 0x36, CMD_QUERY_SRQ = 0x37, + CMD_ARM_SRQ = 0x40, /* QP/EE commands */ CMD_RST2INIT_QPEE = 0x19, @@ -1032,6 +1033,8 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev mthca_dbg(dev, "Max QPs: %d, reserved QPs: %d, entry size: %d\n", dev_lim->max_qps, dev_lim->reserved_qps, dev_lim->qpc_entry_sz); + mthca_dbg(dev, "Max SRQs: %d, reserved SRQs: %d, entry size: %d\n", + dev_lim->max_srqs, dev_lim->reserved_srqs, dev_lim->srq_entry_sz); mthca_dbg(dev, "Max CQs: %d, reserved CQs: %d, entry size: %d\n", dev_lim->max_cqs, dev_lim->reserved_cqs, dev_lim->cqc_entry_sz); mthca_dbg(dev, "Max EQs: %d, reserved EQs: %d, entry size: %d\n", @@ -1503,6 +1506,27 @@ int mthca_HW2SW_CQ(struct mthca_dev *dev CMD_TIME_CLASS_A, status); } +int mthca_SW2HW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, + int srq_num, u8 *status) +{ + return mthca_cmd(dev, mailbox->dma, srq_num, 0, CMD_SW2HW_SRQ, + CMD_TIME_CLASS_A, status); +} + +int mthca_HW2SW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, + int srq_num, u8 *status) +{ + return mthca_cmd_box(dev, 0, mailbox->dma, srq_num, 0, + CMD_HW2SW_SRQ, + CMD_TIME_CLASS_A, status); +} + +int mthca_ARM_SRQ(struct mthca_dev *dev, int srq_num, int limit, u8 *status) +{ + return mthca_cmd(dev, limit, srq_num, 0, CMD_ARM_SRQ, + CMD_TIME_CLASS_B, status); +} + int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num, int is_ee, struct mthca_mailbox *mailbox, u32 optmask, u8 *status) --- infiniband/hw/mthca/mthca_cq.c (revision 3041) +++ infiniband/hw/mthca/mthca_cq.c (working copy) @@ -224,7 +224,8 @@ void mthca_cq_event(struct mthca_dev *de cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); } -void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn) +void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn, + struct mthca_srq *srq) { struct mthca_cq *cq; struct mthca_cqe *cqe; @@ -265,8 +266,11 @@ void mthca_cq_clean(struct mthca_dev *de */ while (prod_index > cq->cons_index) { cqe = get_cqe(cq, (prod_index - 1) & cq->ibcq.cqe); - if (cqe->my_qpn == cpu_to_be32(qpn)) + if (cqe->my_qpn == cpu_to_be32(qpn)) { + if (srq) + mthca_free_srq_wqe(srq, be32_to_cpu(cqe->wqe)); ++nfreed; + } else if (nfreed) memcpy(get_cqe(cq, (prod_index - 1 + nfreed) & cq->ibcq.cqe), @@ -367,6 +371,13 @@ static int handle_error_cqe(struct mthca break; } + /* + * Mem-free HCAs always generate one CQE per WQE, even in the + * error case, so we don't have to check the doorbell count, etc. + */ + if (mthca_is_memfree(dev)) + return 0; + err = mthca_free_err_wqe(dev, qp, is_send, wqe_index, &dbd, &new_wqe); if (err) return err; @@ -375,12 +386,8 @@ static int handle_error_cqe(struct mthca * If we're at the end of the WQE chain, or we've used up our * doorbell count, free the CQE. Otherwise just update it for * the next poll operation. - * - * This does not apply to mem-free HCAs: they don't use the - * doorbell count field, and so we should always free the CQE. */ - if (mthca_is_memfree(dev) || - !(new_wqe & cpu_to_be32(0x3f)) || (!cqe->db_cnt && dbd)) + if (!(new_wqe & cpu_to_be32(0x3f)) || (!cqe->db_cnt && dbd)) return 0; cqe->db_cnt = cpu_to_be16(be16_to_cpu(cqe->db_cnt) - dbd); @@ -452,23 +459,27 @@ static inline int mthca_poll_one(struct >> wq->wqe_shift); entry->wr_id = (*cur_qp)->wrid[wqe_index + (*cur_qp)->rq.max]; + } else if ((*cur_qp)->ibqp.srq) { + struct mthca_srq *srq = to_msrq((*cur_qp)->ibqp.srq); + u32 wqe = be32_to_cpu(cqe->wqe); + wq = NULL; + wqe_index = wqe >> srq->wqe_shift; + entry->wr_id = srq->wrid[wqe_index]; + mthca_free_srq_wqe(srq, wqe); } else { wq = &(*cur_qp)->rq; wqe_index = be32_to_cpu(cqe->wqe) >> wq->wqe_shift; entry->wr_id = (*cur_qp)->wrid[wqe_index]; } - if (wq->last_comp < wqe_index) - wq->tail += wqe_index - wq->last_comp; - else - wq->tail += wqe_index + wq->max - wq->last_comp; + if (wq) { + if (wq->last_comp < wqe_index) + wq->tail += wqe_index - wq->last_comp; + else + wq->tail += wqe_index + wq->max - wq->last_comp; - wq->last_comp = wqe_index; - - if (0) - mthca_dbg(dev, "%s completion for QP %06x, index %d (nr %d)\n", - is_send ? "Send" : "Receive", - (*cur_qp)->qpn, wqe_index, wq->max); + wq->last_comp = wqe_index; + } if (is_error) { err = handle_error_cqe(dev, cq, *cur_qp, wqe_index, is_send, @@ -639,113 +650,8 @@ int mthca_arbel_arm_cq(struct ib_cq *ibc static void mthca_free_cq_buf(struct mthca_dev *dev, struct mthca_cq *cq) { - int i; - int size; - - if (cq->is_direct) - dma_free_coherent(&dev->pdev->dev, - (cq->ibcq.cqe + 1) * MTHCA_CQ_ENTRY_SIZE, - cq->queue.direct.buf, - pci_unmap_addr(&cq->queue.direct, - mapping)); - else { - size = (cq->ibcq.cqe + 1) * MTHCA_CQ_ENTRY_SIZE; - for (i = 0; i < (size + PAGE_SIZE - 1) / PAGE_SIZE; ++i) - if (cq->queue.page_list[i].buf) - dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, - cq->queue.page_list[i].buf, - pci_unmap_addr(&cq->queue.page_list[i], - mapping)); - - kfree(cq->queue.page_list); - } -} - -static int mthca_alloc_cq_buf(struct mthca_dev *dev, int size, - struct mthca_cq *cq) -{ - int err = -ENOMEM; - int npages, shift; - u64 *dma_list = NULL; - dma_addr_t t; - int i; - - if (size <= MTHCA_MAX_DIRECT_CQ_SIZE) { - cq->is_direct = 1; - npages = 1; - shift = get_order(size) + PAGE_SHIFT; - - cq->queue.direct.buf = dma_alloc_coherent(&dev->pdev->dev, - size, &t, GFP_KERNEL); - if (!cq->queue.direct.buf) - return -ENOMEM; - - pci_unmap_addr_set(&cq->queue.direct, mapping, t); - - memset(cq->queue.direct.buf, 0, size); - - while (t & ((1 << shift) - 1)) { - --shift; - npages *= 2; - } - - dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); - if (!dma_list) - goto err_free; - - for (i = 0; i < npages; ++i) - dma_list[i] = t + i * (1 << shift); - } else { - cq->is_direct = 0; - npages = (size + PAGE_SIZE - 1) / PAGE_SIZE; - shift = PAGE_SHIFT; - - dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); - if (!dma_list) - return -ENOMEM; - - cq->queue.page_list = kmalloc(npages * sizeof *cq->queue.page_list, - GFP_KERNEL); - if (!cq->queue.page_list) - goto err_out; - - for (i = 0; i < npages; ++i) - cq->queue.page_list[i].buf = NULL; - - for (i = 0; i < npages; ++i) { - cq->queue.page_list[i].buf = - dma_alloc_coherent(&dev->pdev->dev, PAGE_SIZE, - &t, GFP_KERNEL); - if (!cq->queue.page_list[i].buf) - goto err_free; - - dma_list[i] = t; - pci_unmap_addr_set(&cq->queue.page_list[i], mapping, t); - - memset(cq->queue.page_list[i].buf, 0, PAGE_SIZE); - } - } - - err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num, - dma_list, shift, npages, - 0, size, - MTHCA_MPT_FLAG_LOCAL_WRITE | - MTHCA_MPT_FLAG_LOCAL_READ, - &cq->mr); - if (err) - goto err_free; - - kfree(dma_list); - - return 0; - -err_free: - mthca_free_cq_buf(dev, cq); - -err_out: - kfree(dma_list); - - return err; + mthca_buf_free(dev, (cq->ibcq.cqe + 1) * MTHCA_CQ_ENTRY_SIZE, + &cq->queue, cq->is_direct, &cq->mr); } int mthca_init_cq(struct mthca_dev *dev, int nent, @@ -797,7 +703,9 @@ int mthca_init_cq(struct mthca_dev *dev, cq_context = mailbox->buf; if (cq->is_kernel) { - err = mthca_alloc_cq_buf(dev, size, cq); + err = mthca_buf_alloc(dev, size, MTHCA_MAX_DIRECT_CQ_SIZE, + &cq->queue, &cq->is_direct, + &dev->driver_pd, 1, &cq->mr); if (err) goto err_out_mailbox; @@ -858,10 +766,8 @@ int mthca_init_cq(struct mthca_dev *dev, return 0; err_out_free_mr: - if (cq->is_kernel) { - mthca_free_mr(dev, &cq->mr); + if (cq->is_kernel) mthca_free_cq_buf(dev, cq); - } err_out_mailbox: mthca_free_mailbox(dev, mailbox); @@ -929,7 +835,6 @@ void mthca_free_cq(struct mthca_dev *dev wait_event(cq->wait, !atomic_read(&cq->refcount)); if (cq->is_kernel) { - mthca_free_mr(dev, &cq->mr); mthca_free_cq_buf(dev, cq); if (mthca_is_memfree(dev)) { mthca_free_db(dev, MTHCA_DB_TYPE_CQ_ARM, cq->arm_db_index); --- infiniband/hw/mthca/mthca_profile.h (revision 3041) +++ infiniband/hw/mthca/mthca_profile.h (working copy) @@ -42,6 +42,7 @@ struct mthca_profile { int num_qp; int rdb_per_qp; + int num_srq; int num_cq; int num_mcg; int num_mpt; --- infiniband/hw/mthca/mthca_srq.c (revision 0) +++ infiniband/hw/mthca/mthca_srq.c (revision 0) @@ -0,0 +1,578 @@ +/* + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include "mthca_dev.h" +#include "mthca_cmd.h" +#include "mthca_memfree.h" +#include "mthca_wqe.h" + +enum { + MTHCA_MAX_DIRECT_SRQ_SIZE = 4 * PAGE_SIZE +}; + +struct mthca_tavor_srq_context { + __be64 wqe_base_ds; /* low 6 bits is descriptor size */ + __be32 state_pd; + __be32 lkey; + __be32 uar; + __be32 wqe_cnt; + u32 reserved[2]; +}; + +struct mthca_arbel_srq_context { + __be32 state_logsize_srqn; + __be32 lkey; + __be32 db_index; + __be32 logstride_usrpage; + __be64 wqe_base; + __be32 eq_pd; + __be16 limit_watermark; + __be16 wqe_cnt; + u16 reserved1; + __be16 wqe_counter; + u32 reserved2[3]; +}; + +static void *get_wqe(struct mthca_srq *srq, int n) +{ + if (srq->is_direct) + return srq->queue.direct.buf + (n << srq->wqe_shift); + else + return srq->queue.page_list[(n << srq->wqe_shift) >> PAGE_SHIFT].buf + + ((n << srq->wqe_shift) & (PAGE_SIZE - 1)); +} + +static void mthca_tavor_init_srq_context(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_srq *srq, + struct mthca_tavor_srq_context *context) +{ + memset(context, 0, sizeof *context); + + context->wqe_base_ds = cpu_to_be64(1 << (srq->wqe_shift - 4)); + context->state_pd = cpu_to_be32(pd->pd_num); + context->lkey = cpu_to_be32(srq->mr.ibmr.lkey); + + if (pd->ibpd.uobject) + context->uar = + cpu_to_be32(to_mucontext(pd->ibpd.uobject->context)->uar.index); + else + context->uar = cpu_to_be32(dev->driver_uar.index); +} + +static void mthca_arbel_init_srq_context(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_srq *srq, + struct mthca_arbel_srq_context *context) +{ + int logsize; + + memset(context, 0, sizeof *context); + + logsize = long_log2(srq->max) + srq->wqe_shift; + context->state_logsize_srqn = cpu_to_be32(logsize << 24 | srq->srqn); + context->lkey = cpu_to_be32(srq->mr.ibmr.lkey); + context->db_index = cpu_to_be32(srq->db_index); + context->logstride_usrpage = cpu_to_be32((srq->wqe_shift - 4) << 29); + if (pd->ibpd.uobject) + context->logstride_usrpage |= + cpu_to_be32(to_mucontext(pd->ibpd.uobject->context)->uar.index); + else + context->logstride_usrpage |= cpu_to_be32(dev->driver_uar.index); + context->eq_pd = cpu_to_be32(MTHCA_EQ_ASYNC << 24 | pd->pd_num); +} + +static void mthca_free_srq_buf(struct mthca_dev *dev, struct mthca_srq *srq) +{ + mthca_buf_free(dev, srq->max << srq->wqe_shift, &srq->queue, + srq->is_direct, &srq->mr); + kfree(srq->wrid); +} + +static int mthca_alloc_srq_buf(struct mthca_dev *dev, struct mthca_pd *pd, + struct mthca_srq *srq) +{ + struct mthca_data_seg *scatter; + void *wqe; + int err; + int i; + + if (pd->ibpd.uobject) + return 0; + + srq->wrid = kmalloc(srq->max * sizeof (u64), GFP_KERNEL); + if (!srq->wrid) + return -ENOMEM; + + err = mthca_buf_alloc(dev, srq->max << srq->wqe_shift, + MTHCA_MAX_DIRECT_SRQ_SIZE, + &srq->queue, &srq->is_direct, pd, 1, &srq->mr); + if (err) { + kfree(srq->wrid); + return err; + } + + /* + * Now initialize the SRQ buffer so that all of the WQEs are + * linked into the list of free WQEs. In addition, set the + * scatter list L_Keys to the sentry value of 0x100. + */ + for (i = 0; i < srq->max; ++i) { + wqe = get_wqe(srq, i); + + *(int *) wqe = i < srq->max - 1 ? i + 1 : -1; + + for (scatter = wqe + sizeof (struct mthca_next_seg); + (void *) scatter < wqe + (1 << srq->wqe_shift); + ++scatter) + scatter->lkey = cpu_to_be32(MTHCA_INVAL_LKEY); + } + + return 0; +} + +int mthca_alloc_srq(struct mthca_dev *dev, struct mthca_pd *pd, + struct ib_srq_attr *attr, struct mthca_srq *srq) +{ + struct mthca_mailbox *mailbox; + u8 status; + int ds; + int err; + + /* Sanity check SRQ size before proceeding */ + if (attr->max_wr > 16 << 20 || attr->max_sge > 64) + return -EINVAL; + + srq->max = attr->max_wr; + srq->max_gs = attr->max_sge; + srq->last = NULL; + srq->counter = 0; + + if (mthca_is_memfree(dev)) + srq->max = roundup_pow_of_two(srq->max + 1); + + ds = min(64UL, + roundup_pow_of_two(sizeof (struct mthca_next_seg) + + srq->max_gs * sizeof (struct mthca_data_seg))); + srq->wqe_shift = long_log2(ds); + + srq->srqn = mthca_alloc(&dev->srq_table.alloc); + if (srq->srqn == -1) + return -ENOMEM; + + if (mthca_is_memfree(dev)) { + err = mthca_table_get(dev, dev->srq_table.table, srq->srqn); + if (err) + goto err_out; + + if (!pd->ibpd.uobject) { + srq->db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_SRQ, + srq->srqn, &srq->db); + if (srq->db_index < 0) { + err = -ENOMEM; + goto err_out_icm; + } + } + } + + mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL); + if (IS_ERR(mailbox)) { + err = PTR_ERR(mailbox); + goto err_out_db; + } + + err = mthca_alloc_srq_buf(dev, pd, srq); + if (err) + goto err_out_mailbox; + + spin_lock_init(&srq->lock); + atomic_set(&srq->refcount, 1); + init_waitqueue_head(&srq->wait); + + if (mthca_is_memfree(dev)) + mthca_arbel_init_srq_context(dev, pd, srq, mailbox->buf); + else + mthca_tavor_init_srq_context(dev, pd, srq, mailbox->buf); + + err = mthca_SW2HW_SRQ(dev, mailbox, srq->srqn, &status); + + if (err) { + mthca_warn(dev, "SW2HW_SRQ failed (%d)\n", err); + goto err_out_free_buf; + } + if (status) { + mthca_warn(dev, "SW2HW_SRQ returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_free_buf; + } + + spin_lock_irq(&dev->srq_table.lock); + if (mthca_array_set(&dev->srq_table.srq, + srq->srqn & (dev->limits.num_srqs - 1), + srq)) { + spin_unlock_irq(&dev->srq_table.lock); + goto err_out_free_srq; + } + spin_unlock_irq(&dev->srq_table.lock); + + mthca_free_mailbox(dev, mailbox); + + srq->first_free = 0; + srq->last_free = srq->max - 1; + + return 0; + +err_out_free_srq: + err = mthca_HW2SW_SRQ(dev, mailbox, srq->srqn, &status); + if (err) + mthca_warn(dev, "HW2SW_SRQ failed (%d)\n", err); + else if (status) + mthca_warn(dev, "HW2SW_SRQ returned status 0x%02x\n", status); + +err_out_free_buf: + if (!pd->ibpd.uobject) + mthca_free_srq_buf(dev, srq); + +err_out_mailbox: + mthca_free_mailbox(dev, mailbox); + +err_out_db: + if (!pd->ibpd.uobject && mthca_is_memfree(dev)) + mthca_free_db(dev, MTHCA_DB_TYPE_SRQ, srq->db_index); + +err_out_icm: + mthca_table_put(dev, dev->srq_table.table, srq->srqn); + +err_out: + mthca_free(&dev->srq_table.alloc, srq->srqn); + + return err; +} + +void mthca_free_srq(struct mthca_dev *dev, struct mthca_srq *srq) +{ + struct mthca_mailbox *mailbox; + int err; + u8 status; + + mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL); + if (IS_ERR(mailbox)) { + mthca_warn(dev, "No memory for mailbox to free SRQ.\n"); + return; + } + + err = mthca_HW2SW_SRQ(dev, mailbox, srq->srqn, &status); + if (err) + mthca_warn(dev, "HW2SW_SRQ failed (%d)\n", err); + else if (status) + mthca_warn(dev, "HW2SW_SRQ returned status 0x%02x\n", status); + + spin_lock_irq(&dev->srq_table.lock); + mthca_array_clear(&dev->srq_table.srq, + srq->srqn & (dev->limits.num_srqs - 1)); + spin_unlock_irq(&dev->srq_table.lock); + + atomic_dec(&srq->refcount); + wait_event(srq->wait, !atomic_read(&srq->refcount)); + + if (!srq->ibsrq.uobject) { + mthca_free_srq_buf(dev, srq); + if (mthca_is_memfree(dev)) + mthca_free_db(dev, MTHCA_DB_TYPE_SRQ, srq->db_index); + } + + mthca_table_put(dev, dev->srq_table.table, srq->srqn); + mthca_free(&dev->srq_table.alloc, srq->srqn); + mthca_free_mailbox(dev, mailbox); +} + +void mthca_srq_event(struct mthca_dev *dev, u32 srqn, + enum ib_event_type event_type) +{ + struct mthca_srq *srq; + struct ib_event event; + + spin_lock(&dev->srq_table.lock); + srq = mthca_array_get(&dev->srq_table.srq, srqn & (dev->limits.num_srqs - 1)); + if (srq) + atomic_inc(&srq->refcount); + spin_unlock(&dev->srq_table.lock); + + if (!srq) { + mthca_warn(dev, "Async event for bogus SRQ %08x\n", srqn); + return; + } + + if (!srq->ibsrq.event_handler) + goto out; + + event.device = &dev->ib_dev; + event.event = event_type; + event.element.srq = &srq->ibsrq; + srq->ibsrq.event_handler(&event, srq->ibsrq.srq_context); + +out: + if (atomic_dec_and_test(&srq->refcount)) + wake_up(&srq->wait); +} + +/* + * This function must be called with IRQs disabled. + */ +void mthca_free_srq_wqe(struct mthca_srq *srq, u32 wqe_addr) +{ + int ind; + + ind = wqe_addr >> srq->wqe_shift; + + spin_lock(&srq->lock); + + if (likely(srq->first_free >= 0)) + *(int *) get_wqe(srq, srq->last_free) = ind; + else + srq->first_free = ind; + + *(int *) get_wqe(srq, ind) = -1; + srq->last_free = ind; + + spin_unlock(&srq->lock); +} + +int mthca_tavor_post_srq_recv(struct ib_srq *ibsrq, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + struct mthca_dev *dev = to_mdev(ibsrq->device); + struct mthca_srq *srq = to_msrq(ibsrq); + unsigned long flags; + int err = 0; + int first_ind; + int ind; + int next_ind; + int nreq; + int i; + void *wqe; + void *prev_wqe; + + spin_lock_irqsave(&srq->lock, flags); + + first_ind = srq->first_free; + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + ind = srq->first_free; + + if (ind < 0) { + mthca_err(dev, "SRQ %06x full\n", srq->srqn); + err = -ENOMEM; + *bad_wr = wr; + return nreq; + } + + wqe = get_wqe(srq, ind); + next_ind = *(int *) wqe; + prev_wqe = srq->last; + srq->last = wqe; + + ((struct mthca_next_seg *) wqe)->nda_op = 0; + ((struct mthca_next_seg *) wqe)->ee_nds = 0; + /* flags field will always remain 0 */ + + wqe += sizeof (struct mthca_next_seg); + + if (unlikely(wr->num_sge > srq->max_gs)) { + err = -EINVAL; + *bad_wr = wr; + srq->last = prev_wqe; + return nreq; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + } + + if (i < srq->max_gs) { + ((struct mthca_data_seg *) wqe)->byte_count = 0; + ((struct mthca_data_seg *) wqe)->lkey = cpu_to_be32(MTHCA_INVAL_LKEY); + ((struct mthca_data_seg *) wqe)->addr = 0; + } + + if (likely(prev_wqe)) { + ((struct mthca_next_seg *) prev_wqe)->nda_op = + cpu_to_be32((ind << srq->wqe_shift) | 1); + wmb(); + ((struct mthca_next_seg *) prev_wqe)->ee_nds = + cpu_to_be32(MTHCA_NEXT_DBD); + } + + srq->wrid[ind] = wr->wr_id; + srq->first_free = next_ind; + } + + return nreq; + + if (likely(nreq)) { + __be32 doorbell[2]; + + doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift); + doorbell[1] = cpu_to_be32((srq->srqn << 8) | nreq); + + /* + * Make sure that descriptors are written before + * doorbell is rung. + */ + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_RECEIVE_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } + + spin_unlock_irqrestore(&srq->lock, flags); + return err; +} + +int mthca_arbel_post_srq_recv(struct ib_srq *ibsrq, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + struct mthca_dev *dev = to_mdev(ibsrq->device); + struct mthca_srq *srq = to_msrq(ibsrq); + unsigned long flags; + int err = 0; + int ind; + int next_ind; + int nreq; + int i; + void *wqe; + + spin_lock_irqsave(&srq->lock, flags); + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + ind = srq->first_free; + + if (ind < 0) { + mthca_err(dev, "SRQ %06x full\n", srq->srqn); + err = -ENOMEM; + *bad_wr = wr; + return nreq; + } + + wqe = get_wqe(srq, ind); + next_ind = *(int *) wqe; + + ((struct mthca_next_seg *) wqe)->nda_op = 0; + cpu_to_be32((next_ind << srq->wqe_shift) | 1); + ((struct mthca_next_seg *) wqe)->ee_nds = 0; + /* flags field will always remain 0 */ + + wqe += sizeof (struct mthca_next_seg); + + if (unlikely(wr->num_sge > srq->max_gs)) { + err = -EINVAL; + *bad_wr = wr; + return nreq; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + } + + if (i < srq->max_gs) { + ((struct mthca_data_seg *) wqe)->byte_count = 0; + ((struct mthca_data_seg *) wqe)->lkey = cpu_to_be32(MTHCA_INVAL_LKEY); + ((struct mthca_data_seg *) wqe)->addr = 0; + } + + srq->wrid[ind] = wr->wr_id; + srq->first_free = next_ind; + } + + if (likely(nreq)) { + srq->counter += nreq; + + /* + * Make sure that descriptors are written before + * we write doorbell record. + */ + wmb(); + *srq->db = cpu_to_be32(srq->counter); + } + + spin_unlock_irqrestore(&srq->lock, flags); + return err; +} + +int __devinit mthca_init_srq_table(struct mthca_dev *dev) +{ + int err; + + if (!(dev->mthca_flags & MTHCA_FLAG_SRQ)) + return 0; + + spin_lock_init(&dev->srq_table.lock); + + err = mthca_alloc_init(&dev->srq_table.alloc, + dev->limits.num_srqs, + dev->limits.num_srqs - 1, + dev->limits.reserved_srqs); + if (err) + return err; + + err = mthca_array_init(&dev->srq_table.srq, + dev->limits.num_srqs); + if (err) + mthca_alloc_cleanup(&dev->srq_table.alloc); + + return err; +} + +void __devexit mthca_cleanup_srq_table(struct mthca_dev *dev) +{ + if (!(dev->mthca_flags & MTHCA_FLAG_SRQ)) + return; + + mthca_array_cleanup(&dev->srq_table.srq, dev->limits.num_srqs); + mthca_alloc_cleanup(&dev->srq_table.alloc); +} Property changes on: infiniband/hw/mthca/mthca_srq.c ___________________________________________________________________ Name: svn:keywords + Id --- infiniband/hw/mthca/mthca_cmd.h (revision 3041) +++ infiniband/hw/mthca/mthca_cmd.h (working copy) @@ -299,6 +299,11 @@ int mthca_SW2HW_CQ(struct mthca_dev *dev int cq_num, u8 *status); int mthca_HW2SW_CQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, int cq_num, u8 *status); +int mthca_SW2HW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, + int srq_num, u8 *status); +int mthca_HW2SW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, + int srq_num, u8 *status); +int mthca_ARM_SRQ(struct mthca_dev *dev, int srq_num, int limit, u8 *status); int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num, int is_ee, struct mthca_mailbox *mailbox, u32 optmask, u8 *status); --- infiniband/hw/mthca/mthca_allocator.c (revision 3041) +++ infiniband/hw/mthca/mthca_allocator.c (working copy) @@ -177,3 +177,119 @@ void mthca_array_cleanup(struct mthca_ar kfree(array->page_list); } + +/* + * Handling for queue buffers -- we allocate a bunch of memory and + * register it in a memory region at HCA virtual address 0. If the + * requested size is > max_direct, we split the allocation into + * multiple pages, so we don't require too much contiguous memory. + */ + +int mthca_buf_alloc(struct mthca_dev *dev, int size, int max_direct, + union mthca_buf *buf, int *is_direct, struct mthca_pd *pd, + int hca_write, struct mthca_mr *mr) +{ + int err = -ENOMEM; + int npages, shift; + u64 *dma_list = NULL; + dma_addr_t t; + int i; + + if (size <= max_direct) { + *is_direct = 1; + npages = 1; + shift = get_order(size) + PAGE_SHIFT; + + buf->direct.buf = dma_alloc_coherent(&dev->pdev->dev, + size, &t, GFP_KERNEL); + if (!buf->direct.buf) + return -ENOMEM; + + pci_unmap_addr_set(&buf->direct, mapping, t); + + memset(buf->direct.buf, 0, size); + + while (t & ((1 << shift) - 1)) { + --shift; + npages *= 2; + } + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_free; + + for (i = 0; i < npages; ++i) + dma_list[i] = t + i * (1 << shift); + } else { + *is_direct = 0; + npages = (size + PAGE_SIZE - 1) / PAGE_SIZE; + shift = PAGE_SHIFT; + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + return -ENOMEM; + + buf->page_list = kmalloc(npages * sizeof *buf->page_list, + GFP_KERNEL); + if (!buf->page_list) + goto err_out; + + for (i = 0; i < npages; ++i) + buf->page_list[i].buf = NULL; + + for (i = 0; i < npages; ++i) { + buf->page_list[i].buf = + dma_alloc_coherent(&dev->pdev->dev, PAGE_SIZE, + &t, GFP_KERNEL); + if (!buf->page_list[i].buf) + goto err_free; + + dma_list[i] = t; + pci_unmap_addr_set(&buf->page_list[i], mapping, t); + + memset(buf->page_list[i].buf, 0, PAGE_SIZE); + } + } + + err = mthca_mr_alloc_phys(dev, pd->pd_num, + dma_list, shift, npages, + 0, size, + MTHCA_MPT_FLAG_LOCAL_READ | + (hca_write ? MTHCA_MPT_FLAG_LOCAL_WRITE : 0), + mr); + if (err) + goto err_free; + + kfree(dma_list); + + return 0; + +err_free: + mthca_buf_free(dev, size, buf, *is_direct, NULL); + +err_out: + kfree(dma_list); + + return err; +} + +void mthca_buf_free(struct mthca_dev *dev, int size, union mthca_buf *buf, + int is_direct, struct mthca_mr *mr) +{ + int i; + + if (mr) + mthca_free_mr(dev, mr); + + if (is_direct) + dma_free_coherent(&dev->pdev->dev, size, buf->direct.buf, + pci_unmap_addr(&buf->direct, mapping)); + else { + for (i = 0; i < (size + PAGE_SIZE - 1) / PAGE_SIZE; ++i) + dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, + buf->page_list[i].buf, + pci_unmap_addr(&buf->page_list[i], + mapping)); + kfree(buf->page_list); + } +} --- infiniband/hw/mthca/mthca_qp.c (revision 3041) +++ infiniband/hw/mthca/mthca_qp.c (working copy) @@ -44,6 +44,7 @@ #include "mthca_dev.h" #include "mthca_cmd.h" #include "mthca_memfree.h" +#include "mthca_wqe.h" enum { MTHCA_MAX_DIRECT_QP_SIZE = 4 * PAGE_SIZE, @@ -175,80 +176,6 @@ enum { MTHCA_QP_OPTPAR_SCHED_QUEUE = 1 << 16 }; -enum { - MTHCA_NEXT_DBD = 1 << 7, - MTHCA_NEXT_FENCE = 1 << 6, - MTHCA_NEXT_CQ_UPDATE = 1 << 3, - MTHCA_NEXT_EVENT_GEN = 1 << 2, - MTHCA_NEXT_SOLICIT = 1 << 1, - - MTHCA_MLX_VL15 = 1 << 17, - MTHCA_MLX_SLR = 1 << 16 -}; - -enum { - MTHCA_INVAL_LKEY = 0x100 -}; - -struct mthca_next_seg { - __be32 nda_op; /* [31:6] next WQE [4:0] next opcode */ - __be32 ee_nds; /* [31:8] next EE [7] DBD [6] F [5:0] next WQE size */ - __be32 flags; /* [3] CQ [2] Event [1] Solicit */ - __be32 imm; /* immediate data */ -}; - -struct mthca_tavor_ud_seg { - u32 reserved1; - __be32 lkey; - __be64 av_addr; - u32 reserved2[4]; - __be32 dqpn; - __be32 qkey; - u32 reserved3[2]; -}; - -struct mthca_arbel_ud_seg { - __be32 av[8]; - __be32 dqpn; - __be32 qkey; - u32 reserved[2]; -}; - -struct mthca_bind_seg { - __be32 flags; /* [31] Atomic [30] rem write [29] rem read */ - u32 reserved; - __be32 new_rkey; - __be32 lkey; - __be64 addr; - __be64 length; -}; - -struct mthca_raddr_seg { - __be64 raddr; - __be32 rkey; - u32 reserved; -}; - -struct mthca_atomic_seg { - __be64 swap_add; - __be64 compare; -}; - -struct mthca_data_seg { - __be32 byte_count; - __be32 lkey; - __be64 addr; -}; - -struct mthca_mlx_seg { - __be32 nda_op; - __be32 nds; - __be32 flags; /* [17] VL15 [16] SLR [14:12] static rate - [11:8] SL [3] C [2] E */ - __be16 rlid; - __be16 vcrc; -}; - static const u8 mthca_opcode[] = { [IB_WR_SEND] = MTHCA_OPCODE_SEND, [IB_WR_SEND_WITH_IMM] = MTHCA_OPCODE_SEND_IMM, @@ -686,10 +613,13 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp_context->mtu_msgmax = (attr->path_mtu << 5) | 31; if (mthca_is_memfree(dev)) { - qp_context->rq_size_stride = - ((ffs(qp->rq.max) - 1) << 3) | (qp->rq.wqe_shift - 4); - qp_context->sq_size_stride = - ((ffs(qp->sq.max) - 1) << 3) | (qp->sq.wqe_shift - 4); + if (qp->rq.max) + qp_context->rq_size_stride = long_log2(qp->rq.max) << 3; + qp_context->rq_size_stride |= qp->rq.wqe_shift - 4; + + if (qp->sq.max) + qp_context->sq_size_stride = long_log2(qp->sq.max) << 3; + qp_context->sq_size_stride |= qp->sq.wqe_shift - 4; } /* leave arbel_sched_queue as 0 */ @@ -858,6 +788,9 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RSC); + if (ibqp->srq) + qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RIC); + if (attr_mask & IB_QP_MIN_RNR_TIMER) { qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->min_rnr_timer << 24); qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_TIMEOUT); @@ -880,6 +813,10 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_Q_KEY); } + if (ibqp->srq) + qp_context->srqn = cpu_to_be32(1 << 24 | + to_msrq(ibqp->srq)->srqn); + err = mthca_MODIFY_QP(dev, state_table[cur_state][new_state].trans, qp->qpn, 0, mailbox, 0, &status); if (status) { @@ -927,10 +864,6 @@ static int mthca_alloc_wqe_buf(struct mt struct mthca_qp *qp) { int size; - int i; - int npages, shift; - dma_addr_t t; - u64 *dma_list = NULL; int err = -ENOMEM; size = sizeof (struct mthca_next_seg) + @@ -980,116 +913,24 @@ static int mthca_alloc_wqe_buf(struct mt if (!qp->wrid) goto err_out; - if (size <= MTHCA_MAX_DIRECT_QP_SIZE) { - qp->is_direct = 1; - npages = 1; - shift = get_order(size) + PAGE_SHIFT; - - if (0) - mthca_dbg(dev, "Creating direct QP of size %d (shift %d)\n", - size, shift); - - qp->queue.direct.buf = dma_alloc_coherent(&dev->pdev->dev, size, - &t, GFP_KERNEL); - if (!qp->queue.direct.buf) - goto err_out; - - pci_unmap_addr_set(&qp->queue.direct, mapping, t); - - memset(qp->queue.direct.buf, 0, size); - - while (t & ((1 << shift) - 1)) { - --shift; - npages *= 2; - } - - dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); - if (!dma_list) - goto err_out_free; - - for (i = 0; i < npages; ++i) - dma_list[i] = t + i * (1 << shift); - } else { - qp->is_direct = 0; - npages = size / PAGE_SIZE; - shift = PAGE_SHIFT; - - if (0) - mthca_dbg(dev, "Creating indirect QP with %d pages\n", npages); - - dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); - if (!dma_list) - goto err_out; - - qp->queue.page_list = kmalloc(npages * - sizeof *qp->queue.page_list, - GFP_KERNEL); - if (!qp->queue.page_list) - goto err_out; - - for (i = 0; i < npages; ++i) { - qp->queue.page_list[i].buf = - dma_alloc_coherent(&dev->pdev->dev, PAGE_SIZE, - &t, GFP_KERNEL); - if (!qp->queue.page_list[i].buf) - goto err_out_free; - - memset(qp->queue.page_list[i].buf, 0, PAGE_SIZE); - - pci_unmap_addr_set(&qp->queue.page_list[i], mapping, t); - dma_list[i] = t; - } - } - - err = mthca_mr_alloc_phys(dev, pd->pd_num, dma_list, shift, - npages, 0, size, - MTHCA_MPT_FLAG_LOCAL_READ, - &qp->mr); + err = mthca_buf_alloc(dev, size, MTHCA_MAX_DIRECT_QP_SIZE, + &qp->queue, &qp->is_direct, pd, 0, &qp->mr); if (err) - goto err_out_free; + goto err_out; - kfree(dma_list); return 0; - err_out_free: - if (qp->is_direct) { - dma_free_coherent(&dev->pdev->dev, size, qp->queue.direct.buf, - pci_unmap_addr(&qp->queue.direct, mapping)); - } else - for (i = 0; i < npages; ++i) { - if (qp->queue.page_list[i].buf) - dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, - qp->queue.page_list[i].buf, - pci_unmap_addr(&qp->queue.page_list[i], - mapping)); - - } - - err_out: +err_out: kfree(qp->wrid); - kfree(dma_list); return err; } static void mthca_free_wqe_buf(struct mthca_dev *dev, struct mthca_qp *qp) { - int i; - int size = PAGE_ALIGN(qp->send_wqe_offset + - (qp->sq.max << qp->sq.wqe_shift)); - - if (qp->is_direct) { - dma_free_coherent(&dev->pdev->dev, size, qp->queue.direct.buf, - pci_unmap_addr(&qp->queue.direct, mapping)); - } else { - for (i = 0; i < size / PAGE_SIZE; ++i) { - dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, - qp->queue.page_list[i].buf, - pci_unmap_addr(&qp->queue.page_list[i], - mapping)); - } - } - + mthca_buf_free(dev, PAGE_ALIGN(qp->send_wqe_offset + + (qp->sq.max << qp->sq.wqe_shift)), + &qp->queue, qp->is_direct, &qp->mr); kfree(qp->wrid); } @@ -1430,11 +1271,12 @@ void mthca_free_qp(struct mthca_dev *dev * unref the mem-free tables and free the QPN in our table. */ if (!qp->ibqp.uobject) { - mthca_cq_clean(dev, to_mcq(qp->ibqp.send_cq)->cqn, qp->qpn); + mthca_cq_clean(dev, to_mcq(qp->ibqp.send_cq)->cqn, qp->qpn, + qp->ibqp.srq ? to_msrq(qp->ibqp.srq) : NULL); if (qp->ibqp.send_cq != qp->ibqp.recv_cq) - mthca_cq_clean(dev, to_mcq(qp->ibqp.recv_cq)->cqn, qp->qpn); + mthca_cq_clean(dev, to_mcq(qp->ibqp.recv_cq)->cqn, qp->qpn, + qp->ibqp.srq ? to_msrq(qp->ibqp.srq) : NULL); - mthca_free_mr(dev, &qp->mr); mthca_free_memfree(dev, qp); mthca_free_wqe_buf(dev, qp); } @@ -2179,15 +2021,21 @@ int mthca_free_err_wqe(struct mthca_dev { struct mthca_next_seg *next; + /* + * For SRQs, all WQEs generate a CQE, so we're always at the + * end of the doorbell chain. + */ + if (qp->ibqp.srq) { + *new_wqe = 0; + return 0; + } + if (is_send) next = get_send_wqe(qp, index); else next = get_recv_wqe(qp, index); - if (mthca_is_memfree(dev)) - *dbd = 1; - else - *dbd = !!(next->ee_nds & cpu_to_be32(MTHCA_NEXT_DBD)); + *dbd = !!(next->ee_nds & cpu_to_be32(MTHCA_NEXT_DBD)); if (next->ee_nds & cpu_to_be32(0x3f)) *new_wqe = (next->nda_op & cpu_to_be32(~0x3f)) | (next->ee_nds & cpu_to_be32(0x3f)); --- infiniband/hw/mthca/Makefile (revision 3041) +++ infiniband/hw/mthca/Makefile (working copy) @@ -9,4 +9,4 @@ obj-$(CONFIG_INFINIBAND_MTHCA) += ib_mth ib_mthca-y := mthca_main.o mthca_cmd.o mthca_profile.o mthca_reset.o \ mthca_allocator.o mthca_eq.o mthca_pd.o mthca_cq.o \ mthca_mr.o mthca_qp.o mthca_av.o mthca_mcg.o mthca_mad.o \ - mthca_provider.o mthca_memfree.o mthca_uar.o + mthca_provider.o mthca_memfree.o mthca_uar.o mthca_srq.o From tom at ammasso.com Tue Aug 9 20:58:58 2005 From: tom at ammasso.com (Tom Tucker) Date: Tue, 09 Aug 2005 22:58:58 -0500 Subject: [openib-general] [PATCH] Fix amso1100/Kconfig References: <1123543814.4870.70.camel@hal.voltaire.com> <52fytkxa8t.fsf@cisco.com> <1123544755.4870.88.camel@hal.voltaire.com> <52br48x9n4.fsf@cisco.com> Message-ID: <42F97B82.7090201@ammasso.com> oof -- sorry guys. i thought we did that. Can you give me some quick guidance on the accepted test-build procedure. 64, 32, both. BTW, I've been "on the road" all day today, so sorry for the long delay... Roland Dreier wrote: > Hal> It broke the build. I got the following error: > Hal> drivers/infiniband/hw/amso1100/Kconfig:6: syntax error, > Hal> unexpected T_WORD drivers/infiniband/hw/amso1100/Kconfig:8: > Hal> invalid menu option > >Huh, you're right. Sorry for complaining to you. > >Tom: please make sure patches build before applying them ;) >(not that I've always followed this rule) > > - R. > > From rolandd at cisco.com Tue Aug 9 21:17:02 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 09 Aug 2005 21:17:02 -0700 Subject: [openib-general] [PATCH] Fix amso1100/Kconfig In-Reply-To: <42F97B82.7090201@ammasso.com> (Tom Tucker's message of "Tue, 09 Aug 2005 22:58:58 -0500") References: <1123543814.4870.70.camel@hal.voltaire.com> <52fytkxa8t.fsf@cisco.com> <1123544755.4870.88.camel@hal.voltaire.com> <52br48x9n4.fsf@cisco.com> <42F97B82.7090201@ammasso.com> Message-ID: <5264ueqvcx.fsf@cisco.com> Tom> oof -- sorry guys. i thought we did that. Can you give me Tom> some quick guidance on the accepted test-build procedure. 64, Tom> 32, both. The more targets and configs you check out, the better. But just make sure at least one config builds :) - R. From wxg at ict.ac.cn Tue Aug 9 23:58:27 2005 From: wxg at ict.ac.cn (wangxigui) Date: Wed, 10 Aug 2005 14:58:27 +0800 (CST) Subject: [openib-general] Re: openib-general Digest, Vol 14, Issue 57 In-Reply-To: <20050809152037.AA11A2283E0@openib.ca.sandia.gov> References: <20050809152037.AA11A2283E0@openib.ca.sandia.gov> Message-ID: <1326.159.226.40.170.1123657107.squirrel@webmail.ict.ac.cn> That's a good news! I installed IBGD-1.7.0, and set the SDP Environment Variables as the 'IBGD-Stack_User_Manual' said: for preloading libsdp and configuring the policy: LD_PRELOAD - This environment variable should point to the libsdp.so library. The variable should be set by the system administrator to ${prefix}/lib/libsdp.so1. LIBSDP_CONFIG_FILE - This environment variable must point to the libsdp.conf file. By default, it points to: ${prefix}/etc/libsdp.conf. In order to avoid setting LD_PRELOAD, it is possible to add lbsdp.so into /etc/ld.so.preload. This causes the library to be preloaded into any executable. I created file /etc/ld.so.preload and edit a line lbsdp.so in the file, but I it hung while I reboot it. > Date: Tue, 9 Aug 2005 18:08:48 +0300 > From: Amit Krig > Subject: [openib-general] InfiniBand Test Project === IBTP === > To: openib-general at openib.org > Message-ID: > <506C3D7B14CDD411A52C00025558DED607F0DDA1 at mtlex01.yok.mtl.com> > Content-Type: text/plain; charset="us-ascii" > > > Hi All, > > I would like to propos a new InfiniBand Test Project === IBTP === > The proposed new dir can be under Gen2/trunk => > > https://openib.org/svn/gen2/trunk/ibtp/ > > > The project may contain the following tree > > * tools > (In this folder we will have all the scripts and automation > utilities) > > * infiniband > * core > * ulps > * ipoib > * sdp > * srp > * kdapl > * iser > * > > * user > * verbs > * cm > * management > * osm > * utils > * ulps > * udapl > * mpi > * > > Each sub dir may have the following directories > > * doa (Basic and simple dead or alive tests, up to X sec) > * functional (Full flows & long tests) > * bad machine (Destructive tests can not run in parallel to other > tests) > * Scratch (New tests that did not integrated yet) > * Doc > > We should decide on the relevant document/format attached to each test > such > as > > * README file > * Test runner > * Config file (will contain several cmd options to run the test and > maybe will be used by the external auto run script) > * > >> Amit Krig >> Mellanox Technologies >> SW/HW Design validation manager >> mailto:amitk at mellanox.co.il >> Work: +972-4-9097200 Ext. 315 >> Fax: +972-4-9593245 >> Cell: +972-544-799099 >> >> >> > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > http://openib.org/pipermail/openib-general/attachments/20050809/6a257705/attachment.html > > ------------------------------ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > > End of openib-general Digest, Vol 14, Issue 57 > ********************************************** > -- Sincerely yours Wang Xigui 王锡贵 2005-07-06 -------------------- Computer Architecture Laboratory Institute of Computing Technology Chinese Academy of Sciences Beijing,P.R.China Zip code: 100080 Tel: +86-10-62565533-9314(office) Email: wxg at ict.ac.cn From mst at mellanox.co.il Wed Aug 10 00:31:00 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 10 Aug 2005 10:31:00 +0300 Subject: [openib-general] [PATCH applied] sdp: replace mlock with get_user_pages In-Reply-To: <1123605694.4403.108.camel@hal.voltaire.com> References: <20050809132254.GH32419@mellanox.co.il> <1123605694.4403.108.camel@hal.voltaire.com> Message-ID: <20050810073100.GA2405@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: [openib-general] [PATCH applied] sdp: replace mlock with get_user_pages > > On Tue, 2005-08-09 at 09:22, Michael S. Tsirkin wrote: > > +static void sdp_copy_one_page(struct page *from, struct page* to, > > + unsigned long iocb_addr, size_t iocb_size, > > + unsigned long uaddr) > > +{ > > + size_t size_left = iocb_addr + iocb_size - uaddr; > > + size_t size = min(size_left,PAGE_SIZE); > > The last line results in the following warning on x86: > drivers/infiniband/ulp/sdp/sdp_iocb.c: In function `sdp_copy_one_page': > drivers/infiniband/ulp/sdp/sdp_iocb.c:46: warning: comparison of distinct pointer types lacks a cast > > -- Hal > Here's a fix. Applied. Thanks! --- x86: warning: comparison of distinct pointer types lacks a cast Signed-off-by: Michael S. Tsirkin Index: ulp/sdp/sdp_iocb.c =================================================================== --- ulp/sdp/sdp_iocb.c (revision 3036) +++ ulp/sdp/sdp_iocb.c (working copy) @@ -43,7 +43,7 @@ static void sdp_copy_one_page(struct pag unsigned long uaddr) { size_t size_left = iocb_addr + iocb_size - uaddr; - size_t size = min(size_left,PAGE_SIZE); + size_t size = min(size_left, (size_t)PAGE_SIZE); unsigned long offset = uaddr % PAGE_SIZE; unsigned long flags; -- MST From hch at lst.de Wed Aug 10 00:32:25 2005 From: hch at lst.de (Christoph Hellwig) Date: Wed, 10 Aug 2005 09:32:25 +0200 Subject: [openib-general] [PATCH] Fix amso1100/Kconfig In-Reply-To: <42F97B82.7090201@ammasso.com> References: <1123543814.4870.70.camel@hal.voltaire.com> <52fytkxa8t.fsf@cisco.com> <1123544755.4870.88.camel@hal.voltaire.com> <52br48x9n4.fsf@cisco.com> <42F97B82.7090201@ammasso.com> Message-ID: <20050810073225.GA4927@lst.de> On Tue, Aug 09, 2005 at 10:58:58PM -0500, Tom Tucker wrote: > oof -- sorry guys. i thought we did that. > > Can you give me some quick guidance on the accepted test-build > procedure. 64, 32, both. minimum test targets are normally: i386 - currently most used platform, little endian and 32bit ppc64 - big endian and 64bit, used quite a lot aswell if you're really eager throw in sparc64 aswell for some odd alignment restrictions (more a runtime than compile-time thing) From mst at mellanox.co.il Wed Aug 10 01:30:07 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 10 Aug 2005 11:30:07 +0300 Subject: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support In-Reply-To: References: <20050719165542.GB16028@mellanox.co.il> <20050725171928.GC12206@mellanox.co.il> <20050726133553.GA22276@mellanox.co.il> Message-ID: <20050810083007.GD2405@mellanox.co.il> Quoting r. Hugh Dickins : > > > The other reason I dislike the patch is that the problem it fixes is > > > an old one, and I'd much rather have get_user_pages fix it for itself, > > > > Please note that the problem this attempts to solve is not limited > > to pages locked by get_user_pages: in an infiniband userspace initiator, > > a hardware page is mapped into process memory and must not be inherited > > by a child processes, otherwise hardware protection breaks. > > Interesting. > > But (correct me if I'm wrong, I know nothing about InfiniBand userspace > initiators) that would be done by a driver, which can set VM_DONTCOPY > on the vma, without us having to extend the mprotect or madvise API Roland, Hugh here proposes setting VM_DONTCOPY on user-mapped PIO memory from driver on mmap, to protect against child process corrupting parent's user access region. IIRC, we used to set this bit, but it was removed later - could you please clarify why? Do you think its a good idea to restore this behaviour? -- MST From glebn at voltaire.com Wed Aug 10 01:39:43 2005 From: glebn at voltaire.com (Gleb Natapov) Date: Wed, 10 Aug 2005 11:39:43 +0300 Subject: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support In-Reply-To: References: <20050719165542.GB16028@mellanox.co.il> <20050725171928.GC12206@mellanox.co.il> <20050726133553.GA22276@mellanox.co.il> Message-ID: <20050810083943.GM16361@minantech.com> On Tue, Aug 09, 2005 at 07:13:33PM +0100, Hugh Dickins wrote: > Even more I'd prefer one of these two solutions below, which sidestep > that uncleanliness - but both of these would be in mmap only, no clean > way to change afterwards (except by munmap or mmap MAP_FIXED): > > 1. Use the standard mmap(NULL, len, PROT_READ|PROT_WRITE, > MAP_SHARED|MAP_ANONYMOUS, -1, 0) which gives you a memory object > shared with children, so write-protection and COW won't come into it. > > or if there's good reason why that's no good, > > 2. Define a MAP_DONTCOPY to mmap: we have a fine tradition of MAP_flags > to achieve this or that effect, adding one more would be cleaner than > now corrupting mprotect or madvise. > They are both relying on the way user allocates memory for RDMA. The idea behind Michael's propose it to let library (MPI for instance) to tell to the kernel that the pages are used for RDMA and it is not safe to copy them now. The pages may be anywhere in the process address space bss, text, stack whatever. -- Gleb. From hugh at veritas.com Wed Aug 10 06:22:40 2005 From: hugh at veritas.com (Hugh Dickins) Date: Wed, 10 Aug 2005 14:22:40 +0100 (BST) Subject: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support In-Reply-To: <20050810083943.GM16361@minantech.com> References: <20050719165542.GB16028@mellanox.co.il> <20050725171928.GC12206@mellanox.co.il> <20050726133553.GA22276@mellanox.co.il> <20050810083943.GM16361@minantech.com> Message-ID: On Wed, 10 Aug 2005, Gleb Natapov wrote: > On Tue, Aug 09, 2005 at 07:13:33PM +0100, Hugh Dickins wrote: > > Even more I'd prefer one of these two solutions below, which sidestep > > that uncleanliness - but both of these would be in mmap only, no clean > > way to change afterwards (except by munmap or mmap MAP_FIXED): > > > > 1. Use the standard mmap(NULL, len, PROT_READ|PROT_WRITE, > > MAP_SHARED|MAP_ANONYMOUS, -1, 0) which gives you a memory object > > shared with children, so write-protection and COW won't come into it. > > > > or if there's good reason why that's no good, > > > > 2. Define a MAP_DONTCOPY to mmap: we have a fine tradition of MAP_flags > > to achieve this or that effect, adding one more would be cleaner than > > now corrupting mprotect or madvise. > > They are both relying on the way user allocates memory for RDMA. The idea > behind Michael's propose it to let library (MPI for instance) to tell to the > kernel that the pages are used for RDMA and it is not safe to copy them now. > The pages may be anywhere in the process address space bss, text, stack > whatever. That's a nice aim, but I don't think it can quite be done in the face of the fork issue - one way or another, we have to change the behaviour of a forked RDMA area slightly, which might interfere with common assumptions. Your stack example is a good one: if we end up setting VM_DONTCOPY on the user stack, then I don't think fork's child will get very far without hitting a SIGSEGV. Hugh From glebn at voltaire.com Wed Aug 10 06:26:12 2005 From: glebn at voltaire.com (Gleb Natapov) Date: Wed, 10 Aug 2005 16:26:12 +0300 Subject: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support In-Reply-To: References: <20050719165542.GB16028@mellanox.co.il> <20050725171928.GC12206@mellanox.co.il> <20050726133553.GA22276@mellanox.co.il> <20050810083943.GM16361@minantech.com> Message-ID: <20050810132611.GP16361@minantech.com> On Wed, Aug 10, 2005 at 02:22:40PM +0100, Hugh Dickins wrote: > On Wed, 10 Aug 2005, Gleb Natapov wrote: > > On Tue, Aug 09, 2005 at 07:13:33PM +0100, Hugh Dickins wrote: > > > Even more I'd prefer one of these two solutions below, which sidestep > > > that uncleanliness - but both of these would be in mmap only, no clean > > > way to change afterwards (except by munmap or mmap MAP_FIXED): > > > > > > 1. Use the standard mmap(NULL, len, PROT_READ|PROT_WRITE, > > > MAP_SHARED|MAP_ANONYMOUS, -1, 0) which gives you a memory object > > > shared with children, so write-protection and COW won't come into it. > > > > > > or if there's good reason why that's no good, > > > > > > 2. Define a MAP_DONTCOPY to mmap: we have a fine tradition of MAP_flags > > > to achieve this or that effect, adding one more would be cleaner than > > > now corrupting mprotect or madvise. > > > > They are both relying on the way user allocates memory for RDMA. The idea > > behind Michael's propose it to let library (MPI for instance) to tell to the > > kernel that the pages are used for RDMA and it is not safe to copy them now. > > The pages may be anywhere in the process address space bss, text, stack > > whatever. > > That's a nice aim, but I don't think it can quite be done in the face of > the fork issue - one way or another, we have to change the behaviour of a > forked RDMA area slightly, which might interfere with common assumptions. > > Your stack example is a good one: if we end up setting VM_DONTCOPY on > the user stack, then I don't think fork's child will get very far without > hitting a SIGSEGV. I know, but I prefer child SIGSEGV than silent data corruption. In most cases child will exec immediately after fork so no problem in this case. -- Gleb. From amitk at mellanox.co.il Wed Aug 10 06:51:58 2005 From: amitk at mellanox.co.il (Amit Krig) Date: Wed, 10 Aug 2005 16:51:58 +0300 Subject: [openib-general] InfiniBand Test Project === IBTP === Message-ID: <506C3D7B14CDD411A52C00025558DED607F0DDC7@MTLEX01> Hi Matt, I have updated the tree structure according to the replies I got (see below), One change from the src tree is the ulp dir under the userspace which I think is better. In order to start sharing the tests we would need to open this project in the svn tree. >From Mellanox side I hope it will be simple to give Dotan Barak and My self (Amit Krig) permission to commit tests to this project. ============ Hi All, I would like to propos a new InfiniBand Test Project === IBTP === The proposed new dir can be under Gen2/trunk => https://openib.org/svn/gen2/trunk/ibtp/ The project may contain the following tree * tools (In this folder we will have all the scripts and automation utilities) * linux-kernel * infiniband * core * ulps * ipoib * sdp * srp * kdapl * iser * userspace * useraccess * cm * management * osm * utils * ulps * udapl * mpi Each sub dir may have the following directories * doa (Basic and simple dead or alive tests, up to X sec) * functional (Full flows & long tests) * bad machine (Destructive tests can not run in parallel to other tests) * Scratch (New tests that did not integrated yet) * Doc We should decide on the relevant document/format attached to each test such as * README file * Test runner * Config file (will contain several cmd options to run the test and maybe will be used by the external auto run script) Amit Krig Mellanox Technologies SW/HW Design validation manager mailto:amitk at mellanox.co.il Work: +972-4-9097200 Ext. 315 Fax: +972-4-9593245 Cell: +972-544-799099 -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Wed Aug 10 06:56:23 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Aug 2005 09:56:23 -0400 Subject: [openib-general] InfiniBand Test Project === IBTP === In-Reply-To: <506C3D7B14CDD411A52C00025558DED607F0DDC7@MTLEX01> References: <506C3D7B14CDD411A52C00025558DED607F0DDC7@MTLEX01> Message-ID: <1123682183.4403.1634.camel@hal.voltaire.com> Hi Amit, On Wed, 2005-08-10 at 09:51, Amit Krig wrote: > Hi Matt, > > I have updated the tree structure according to the replies I got (see > below), One change from the src tree is the ulp dir under the > userspace which I think is better. Yes, this matches better. The one unresolved issue is whether this is in the trunk or under utils or test. > In order to start sharing the tests we would need to open this project > in the svn tree. > From Mellanox side I hope it will be simple to give Dotan Barak and My > self (Amit Krig) permission to commit tests to this project. One either has permission or not. Once you have permission, you can create the structure needed. -- Hal > ============ > Hi All, > > I would like to propos a new InfiniBand Test Project === IBTP === The > proposed new dir can be under Gen2/trunk => > > https://openib.org/svn/gen2/trunk/ibtp/ > > The project may contain the following tree > > > * tools (In this folder we will have all the scripts and automation > utilities) > * linux-kernel > * infiniband > * core > * ulps > * ipoib > * sdp > * srp > * kdapl > * iser > * userspace > * useraccess > * cm > * management > * osm > * utils > * ulps > * udapl > * mpi > > > > Each sub dir may have the following directories > > > * doa (Basic and simple dead or alive tests, > up to X sec) > * functional (Full flows & long tests) > * bad machine (Destructive tests can not run > in parallel to other tests) > * Scratch (New tests that did not integrated > yet) > * Doc > > > We should decide on the relevant document/format attached to each test > such as > > * README file > * Test runner > * Config file (will contain several cmd options to run the > test and maybe will be used by the external auto run script) > > > > > Amit Krig > Mellanox Technologies > SW/HW Design validation manager > mailto:amitk at mellanox.co.il > Work: +972-4-9097200 Ext. 315 > Fax: +972-4-9593245 > Cell: +972-544-799099 > > > > > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jlentini at netapp.com Wed Aug 10 07:09:11 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 10 Aug 2005 10:09:11 -0400 (EDT) Subject: [openib-general] InfiniBand Test Project === IBTP === In-Reply-To: <506C3D7B14CDD411A52C00025558DED607F0DDA1@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED607F0DDA1@mtlex01.yok.mtl.com> Message-ID: On Tue, 9 Aug 2005, Amit Krig wrote: > > Hi All, > > I would like to propos a new InfiniBand Test Project === IBTP === > The proposed new dir can be under Gen2/trunk => > > https://openib.org/svn/gen2/trunk/ibtp/ > Did you consider using the existing https://openib.org/svn/gen2/utils/src/ directory? This part of the repository was recommended as the best place for kdapltest, a test tool for kdapl. From amitk at mellanox.co.il Wed Aug 10 07:27:21 2005 From: amitk at mellanox.co.il (Amit Krig) Date: Wed, 10 Aug 2005 17:27:21 +0300 Subject: [openib-general] InfiniBand Test Project === IBTP === Message-ID: <506C3D7B14CDD411A52C00025558DED607F0DDC8@MTLEX01> -----Original Message----- From: James Lentini [mailto:jlentini at netapp.com] Sent: Wednesday, August 10, 2005 5:09 PM To: Amit Krig Cc: openib-general at openib.org Subject: Re: [openib-general] InfiniBand Test Project === IBTP === >On Tue, 9 Aug 2005, Amit Krig wrote: > > Hi All, > > I would like to propos a new InfiniBand Test Project === IBTP === The > proposed new dir can be under Gen2/trunk => > > https://openib.org/svn/gen2/trunk/ibtp/ > > Did you consider using the existing > https://openib.org/svn/gen2/utils/src/ directory? This is an option as well, The tree is not the same, for example kdapl should be under infiniband -> ulp. > This part of the repository was recommended as the best place for kdapltest, a test tool for kdapl. I think that the ibtp major goal is quality and not tools -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlentini at netapp.com Wed Aug 10 07:56:39 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 10 Aug 2005 10:56:39 -0400 (EDT) Subject: [openib-general] InfiniBand Test Project === IBTP === In-Reply-To: <506C3D7B14CDD411A52C00025558DED607F0DDC8@MTLEX01> References: <506C3D7B14CDD411A52C00025558DED607F0DDC8@MTLEX01> Message-ID: On Wed, 10 Aug 2005, Amit Krig wrote: >> On Tue, 9 Aug 2005, Amit Krig wrote: > >> >> Hi All, >> >> I would like to propos a new InfiniBand Test Project === IBTP === The >> proposed new dir can be under Gen2/trunk => >> >> https://openib.org/svn/gen2/trunk/ibtp/ >> > >> Did you consider using the existing >> https://openib.org/svn/gen2/utils/src/ directory? > > This is an option as well, > The tree is not the same, for example kdapl should be under infiniband -> > ulp. We can re-arrange the directory structure. >> This part of the repository was recommended as the best place for > kdapltest, a test tool for kdapl. > > I think that the ibtp major goal is quality and not tools Ok. I just have two concerns. First, I would prefer a directory name that was less obfuscated. It is not immediately obvious what a directory called "ibtp" contains. If the directory was called "test", then it would be clear. Second, I'm not clear on why this code should go under the trunk but the tools in https://openib.org/svn/gen2/utils should not. I had been thinking that all of the code under the trunk was intended for end users. I'm not sure if that is the distinction though. Perhaps the ulps, users, and utils, directories in https://openib.org/svn/gen2/ are out of place. From tziporet at mellanox.co.il Wed Aug 10 08:13:19 2005 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Wed, 10 Aug 2005 18:13:19 +0300 Subject: [openib-general] Is IB_EVENT_SM_CHANGE is equivalent in semantic to the Client Rer egister event in the IB SPEC Message-ID: <506C3D7B14CDD411A52C00025558DED6085BD09E@MTLEX01> Hi, I am working to enable the Client Reregister event as defined in IB SPEC on chapter 14.4.11. While doing it I saw that in the core the event IB_EVENT_SM_CHANGE is defined and also ULPs are using it but this event is never generated. My understanding is that Client Reregister event should generate this event. Is this correct? Thanks, Tziporet Koren Software Director Mellanox Technologies, Ltd mailto:tziporet at mellanox.co.il Tel +972-4-9097200, ext 380 -------------- next part -------------- An HTML attachment was scrubbed... URL: From hugh at veritas.com Wed Aug 10 08:27:31 2005 From: hugh at veritas.com (Hugh Dickins) Date: Wed, 10 Aug 2005 16:27:31 +0100 (BST) Subject: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support In-Reply-To: <20050810132611.GP16361@minantech.com> References: <20050719165542.GB16028@mellanox.co.il> <20050725171928.GC12206@mellanox.co.il> <20050726133553.GA22276@mellanox.co.il> <20050810083943.GM16361@minantech.com> <20050810132611.GP16361@minantech.com> Message-ID: On Wed, 10 Aug 2005, Gleb Natapov wrote: > On Wed, Aug 10, 2005 at 02:22:40PM +0100, Hugh Dickins wrote: > > > > Your stack example is a good one: if we end up setting VM_DONTCOPY on > > the user stack, then I don't think fork's child will get very far without > > hitting a SIGSEGV. > > I know, but I prefer child SIGSEGV than silent data corruption. Most people will share your preference, but neither is satisfactory. > In most cases child will exec immediately after fork so no problem > in this case. In most(?) cases it won't even be able to exec before the SIGSEGV. Hugh From mst at mellanox.co.il Wed Aug 10 08:30:49 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 10 Aug 2005 18:30:49 +0300 Subject: [openib-general] [PATCH] diags: configure option to skip library check Message-ID: <20050810153049.GO9141@mellanox.co.il> Add option to skip infiniband library checks in diags Signed-off-by: Michael S. Tsirkin Index: management/diags/ibtracert/configure.in =================================================================== --- management/diags/ibtracert/configure.in (revision 2963) +++ management/diags/ibtracert/configure.in (working copy) @@ -8,10 +8,18 @@ AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE(ibtracert, 0.9.0) +AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presense of ib libraries], +[ if test x$enableval = xno ; then + disable_libcheck=yes + fi +]) + dnl Checks for programs AC_PROG_CC AC_PROG_LIBTOOL +if test "$disable_libcheck" != "yes" +then dnl Checks for libraries LDFLAGS="$LDFLAGS -L/usr/local/ib/lib" AC_CHECK_LIB(ibcommon, sys_read_string, [], @@ -20,10 +28,13 @@ AC_CHECK_LIB(ibumad, umad_init, [], AC_MSG_ERROR([umad_init() not found. ibroute requires libibumad.])) AC_CHECK_LIB(ibmad, mad_dump_int, [], AC_MSG_ERROR([mad_dump_int() not found. ibroute requires libibmad.])) +fi dnl Checks for header files. AC_HEADER_STDC AC_CHECK_HEADERS([netinet/in.h stdlib.h unistd.h]) +if test "$disable_libcheck" != "yes" +then AC_CHECK_HEADER(infiniband/common.h, [], AC_MSG_ERROR([ not found. ibroute requires libibcommon.]) ) @@ -33,6 +44,7 @@ AC_CHECK_HEADER(infiniband/umad.h, [], AC_CHECK_HEADER(infiniband/mad.h, [], AC_MSG_ERROR([ not found. ibroute requires libibmad.]) ) +fi dnl Checks for library functions AC_FUNC_ERROR_AT_LINE Index: management/diags/ibnetdiscover/configure.in =================================================================== --- management/diags/ibnetdiscover/configure.in (revision 2963) +++ management/diags/ibnetdiscover/configure.in (working copy) @@ -8,10 +8,18 @@ AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE(ibnetdiscover, 0.9.0) +AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presense of ib libraries], +[ if test x$enableval = xno ; then + disable_libcheck=yes + fi +]) + dnl Checks for programs AC_PROG_CC AC_PROG_LIBTOOL +if test "$disable_libcheck" != "yes" +then dnl Checks for libraries LDFLAGS="$LDFLAGS -L/usr/local/ib/lib" AC_CHECK_LIB(ibcommon, sys_read_string, [], @@ -20,10 +28,13 @@ AC_CHECK_LIB(ibumad, umad_init, [], AC_MSG_ERROR([umad_init() not found. ibnetdiscover requires libibumad.])) AC_CHECK_LIB(ibmad, mad_dump_int, [], AC_MSG_ERROR([mad_dump_int() not found. ibnetdiscover requires libibmad.])) +fi dnl Checks for header files. AC_HEADER_STDC AC_CHECK_HEADERS([stdlib.h string.h unistd.h]) +if test "$disable_libcheck" != "yes" +then AC_CHECK_HEADER(infiniband/common.h, [], AC_MSG_ERROR([ not found. ibnetdiscover requires libibcommon.]) ) @@ -33,6 +44,7 @@ AC_CHECK_HEADER(infiniband/umad.h, [], AC_CHECK_HEADER(infiniband/mad.h, [], AC_MSG_ERROR([ not found. ibnetdiscover requires libibmad.]) ) +fi dnl Checks for library functions AC_FUNC_ERROR_AT_LINE Index: management/diags/perfquery/configure.in =================================================================== --- management/diags/perfquery/configure.in (revision 2963) +++ management/diags/perfquery/configure.in (working copy) @@ -8,10 +8,18 @@ AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE(perfquery, 0.9.0) +AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presense of ib libraries], +[ if test x$enableval = xno ; then + disable_libcheck=yes + fi +]) + dnl Checks for programs AC_PROG_CC AC_PROG_LIBTOOL +if test "$disable_libcheck" != "yes" +then dnl Checks for libraries LDFLAGS="$LDFLAGS -L/usr/local/ib/lib" AC_CHECK_LIB(ibcommon, sys_read_string, [], @@ -20,10 +28,13 @@ AC_CHECK_LIB(ibumad, umad_init, [], AC_MSG_ERROR([umad_init() not found. perfquery requires libibumad.])) AC_CHECK_LIB(ibmad, mad_dump_int, [], AC_MSG_ERROR([mad_dump_int() not found. perfquery requires libibmad.])) +fi dnl Checks for header files. AC_HEADER_STDC AC_CHECK_HEADERS([stdlib.h string.h unistd.h]) +if test "$disable_libcheck" != "yes" +then AC_CHECK_HEADER(infiniband/common.h, [], AC_MSG_ERROR([ not found. perfquery requires libibcommon.]) ) @@ -33,6 +44,7 @@ AC_CHECK_HEADER(infiniband/umad.h, [], AC_CHECK_HEADER(infiniband/mad.h, [], AC_MSG_ERROR([ not found. perfquery requires libibmad.]) ) +fi dnl Checks for library functions AC_FUNC_ERROR_AT_LINE Index: management/diags/smpquery/configure.in =================================================================== --- management/diags/smpquery/configure.in (revision 2963) +++ management/diags/smpquery/configure.in (working copy) @@ -8,10 +8,18 @@ AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE(smpquery, 0.9.0) +AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presense of ib libraries], +[ if test x$enableval = xno ; then + disable_libcheck=yes + fi +]) + dnl Checks for programs AC_PROG_CC AC_PROG_LIBTOOL +if test "$disable_libcheck" != "yes" +then dnl Checks for libraries LDFLAGS="$LDFLAGS -L/usr/local/ib/lib" AC_CHECK_LIB(ibcommon, sys_read_string, [], @@ -20,10 +28,13 @@ AC_CHECK_LIB(ibumad, umad_init, [], AC_MSG_ERROR([umad_init() not found. smpquery requires libibumad.])) AC_CHECK_LIB(ibmad, mad_dump_int, [], AC_MSG_ERROR([mad_dump_int() not found. smpquery requires libibmad.])) +fi dnl Checks for header files. AC_HEADER_STDC AC_CHECK_HEADERS([stdlib.h string.h unistd.h]) +if test "$disable_libcheck" != "yes" +then AC_CHECK_HEADER(infiniband/common.h, [], AC_MSG_ERROR([ not found. smpquery requires libibcommon.]) ) @@ -33,6 +44,7 @@ AC_CHECK_HEADER(infiniband/umad.h, [], AC_CHECK_HEADER(infiniband/mad.h, [], AC_MSG_ERROR([ not found. smpquery requires libibmad.]) ) +fi dnl Checks for library functions AC_FUNC_ERROR_AT_LINE Index: management/diags/ibaddr/configure.in =================================================================== --- management/diags/ibaddr/configure.in (revision 2963) +++ management/diags/ibaddr/configure.in (working copy) @@ -8,10 +8,18 @@ AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE(ibaddr, 0.9.0) +AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presense of ib libraries], +[ if test x$enableval = xno ; then + disable_libcheck=yes + fi +]) + dnl Checks for programs AC_PROG_CC AC_PROG_LIBTOOL +if test "$disable_libcheck" != "yes" +then dnl Checks for libraries LDFLAGS="$LDFLAGS -L/usr/local/ib/lib" AC_CHECK_LIB(ibcommon, sys_read_string, [], @@ -20,10 +28,13 @@ AC_CHECK_LIB(ibumad, umad_init, [], AC_MSG_ERROR([umad_init() not found. ibaddr requires libibumad.])) AC_CHECK_LIB(ibmad, mad_dump_int, [], AC_MSG_ERROR([mad_dump_int() not found. ibaddr requires libibmad.])) +fi dnl Checks for header files. AC_HEADER_STDC AC_CHECK_HEADERS([stdlib.h string.h unistd.h]) +if test "$disable_libcheck" != "yes" +then AC_CHECK_HEADER(infiniband/common.h, [], AC_MSG_ERROR([ not found. ibaddr requires libibcommon.]) ) @@ -33,6 +44,7 @@ AC_CHECK_HEADER(infiniband/umad.h, [], AC_CHECK_HEADER(infiniband/mad.h, [], AC_MSG_ERROR([ not found. ibaddr requires libibmad.]) ) +fi dnl Checks for library functions AC_FUNC_ERROR_AT_LINE Index: management/diags/smpdump/configure.in =================================================================== --- management/diags/smpdump/configure.in (revision 2963) +++ management/diags/smpdump/configure.in (working copy) @@ -8,26 +8,38 @@ AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE(smpdump, 0.9.0) +AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presense of ib libraries], +[ if test x$enableval = xno ; then + disable_libcheck=yes + fi +]) + dnl Checks for programs AC_PROG_CC AC_PROG_LIBTOOL +if test "$disable_libcheck" != "yes" +then dnl Checks for libraries LDFLAGS="$LDFLAGS -L/usr/local/ib/lib" AC_CHECK_LIB(ibcommon, sys_read_string, [], AC_MSG_ERROR([sys_read_string() not found. smpdump requires libibcommon.])) AC_CHECK_LIB(ibumad, umad_init, [], AC_MSG_ERROR([umad_init() not found. smpdump requires libibumad.])) +fi dnl Checks for header files. AC_HEADER_STDC AC_CHECK_HEADERS([fcntl.h inttypes.h netinet/in.h stdlib.h string.h sys/ioctl.h syslog.h unistd.h]) +if test "$disable_libcheck" != "yes" +then AC_CHECK_HEADER(infiniband/common.h, [], AC_MSG_ERROR([ not found. smpdump requires libibcommon.]) ) AC_CHECK_HEADER(infiniband/umad.h, [], AC_MSG_ERROR([ not found. smpdump requires libibumad.]) ) +fi dnl Checks for library functions AC_CHECK_FUNCS([memset strchr strtoul]) Index: management/diags/ibsysstat/configure.in =================================================================== --- management/diags/ibsysstat/configure.in (revision 2963) +++ management/diags/ibsysstat/configure.in (working copy) @@ -8,10 +8,18 @@ AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE(ibsysstat, 0.9.0) +AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presense of ib libraries], +[ if test x$enableval = xno ; then + disable_libcheck=yes + fi +]) + dnl Checks for programs AC_PROG_CC AC_PROG_LIBTOOL +if test "$disable_libcheck" != "yes" +then dnl Checks for libraries LDFLAGS="$LDFLAGS -L/usr/local/ib/lib" AC_CHECK_LIB(ibcommon, sys_read_string, [], @@ -20,10 +28,13 @@ AC_CHECK_LIB(ibumad, umad_init, [], AC_MSG_ERROR([umad_init() not found. ibsysstat requires libibumad.])) AC_CHECK_LIB(ibmad, mad_dump_int, [], AC_MSG_ERROR([mad_dump_int() not found. ibsysstat requires libibmad.])) +fi dnl Checks for header files. AC_HEADER_STDC AC_CHECK_HEADERS([stdlib.h string.h unistd.h]) +if test "$disable_libcheck" != "yes" +then AC_CHECK_HEADER(infiniband/common.h, [], AC_MSG_ERROR([ not found. ibsysstat requires libibcommon.]) ) @@ -33,6 +44,7 @@ AC_CHECK_HEADER(infiniband/umad.h, [], AC_CHECK_HEADER(infiniband/mad.h, [], AC_MSG_ERROR([ not found. ibsysstat requires libibmad.]) ) +fi dnl Checks for library functions AC_FUNC_ERROR_AT_LINE Index: management/diags/ibstat/configure.in =================================================================== --- management/diags/ibstat/configure.in (revision 2963) +++ management/diags/ibstat/configure.in (working copy) @@ -8,10 +8,18 @@ AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE(ibstat, 0.9.0) +AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presense of ib libraries], +[ if test x$enableval = xno ; then + disable_libcheck=yes + fi +]) + dnl Checks for programs AC_PROG_CC AC_PROG_LIBTOOL +if test "$disable_libcheck" != "yes" +then dnl Checks for libraries LDFLAGS="$LDFLAGS -L/usr/local/ib/lib" AC_CHECK_LIB(ibcommon, sys_read_string, [], @@ -19,10 +27,13 @@ AC_CHECK_LIB(ibcommon, sys_read_string, .])) AC_CHECK_LIB(ibumad, umad_init, [], AC_MSG_ERROR([umad_init() not found. ibstat requires libibumad.])) +fi dnl Checks for header files. AC_HEADER_STDC AC_CHECK_HEADERS([fcntl.h inttypes.h netinet/in.h stdlib.h string.h sys/ioctl.h syslog.h unistd.h]) +if test "$disable_libcheck" != "yes" +then AC_CHECK_HEADER(infiniband/common.h, [], AC_MSG_ERROR([ not found. ibstat requires libibco mmon.]) @@ -31,6 +42,7 @@ AC_CHECK_HEADER(infiniband/umad.h, [], AC_MSG_ERROR([ not found. ibstat requires libibumad .]) ) +fi dnl Checks for library functions AC_CHECK_FUNCS([strtol]) Index: management/diags/ibping/configure.in =================================================================== --- management/diags/ibping/configure.in (revision 2963) +++ management/diags/ibping/configure.in (working copy) @@ -8,10 +8,18 @@ AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE(ibping, 0.9.0) +AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presense of ib libraries], +[ if test x$enableval = xno ; then + disable_libcheck=yes + fi +]) + dnl Checks for programs AC_PROG_CC AC_PROG_LIBTOOL +if test "$disable_libcheck" != "yes" +then dnl Checks for libraries LDFLAGS="$LDFLAGS -L/usr/local/ib/lib" AC_CHECK_LIB(ibcommon, sys_read_string, [], @@ -20,10 +28,13 @@ AC_CHECK_LIB(ibumad, umad_init, [], AC_MSG_ERROR([umad_init() not found. ibping requires libibumad.])) AC_CHECK_LIB(ibmad, mad_dump_int, [], AC_MSG_ERROR([mad_dump_int() not found. ibping requires libibmad.])) +fi dnl Checks for header files. AC_HEADER_STDC AC_CHECK_HEADERS([stdlib.h string.h unistd.h]) +if test "$disable_libcheck" != "yes" +then AC_CHECK_HEADER(infiniband/common.h, [], AC_MSG_ERROR([ not found. ibping requires libibcommon.]) ) @@ -33,6 +44,7 @@ AC_CHECK_HEADER(infiniband/umad.h, [], AC_CHECK_HEADER(infiniband/mad.h, [], AC_MSG_ERROR([ not found. ibping requires libibmad.]) ) +fi dnl Checks for library functions AC_FUNC_ERROR_AT_LINE Index: management/diags/ibroute/configure.in =================================================================== --- management/diags/ibroute/configure.in (revision 2963) +++ management/diags/ibroute/configure.in (working copy) @@ -8,10 +8,18 @@ AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE(ibroute, 0.9.0) +AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presense of ib libraries], +[ if test x$enableval = xno ; then + disable_libcheck=yes + fi +]) + dnl Checks for programs AC_PROG_CC AC_PROG_LIBTOOL +if test "$disable_libcheck" != "yes" +then dnl Checks for libraries LDFLAGS="$LDFLAGS -L/usr/local/ib/lib" AC_CHECK_LIB(ibcommon, sys_read_string, [], @@ -20,10 +28,13 @@ AC_CHECK_LIB(ibumad, umad_init, [], AC_MSG_ERROR([umad_init() not found. ibroute requires libibumad.])) AC_CHECK_LIB(ibmad, mad_dump_int, [], AC_MSG_ERROR([mad_dump_int() not found. ibroute requires libibmad.])) +fi dnl Checks for header files. AC_HEADER_STDC AC_CHECK_HEADERS([netinet/in.h stdlib.h string.h unistd.h]) +if test "$disable_libcheck" != "yes" +then AC_CHECK_HEADER(infiniband/common.h, [], AC_MSG_ERROR([ not found. ibroute requires libibcommon.]) ) @@ -33,6 +44,7 @@ AC_CHECK_HEADER(infiniband/umad.h, [], AC_CHECK_HEADER(infiniband/mad.h, [], AC_MSG_ERROR([ not found. ibroute requires libibmad.]) ) +fi dnl Checks for library functions AC_FUNC_ERROR_AT_LINE Index: management/diags/sminfo/configure.in =================================================================== --- management/diags/sminfo/configure.in (revision 2963) +++ management/diags/sminfo/configure.in (working copy) @@ -8,10 +8,18 @@ AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE(sminfo, 0.9.0) +AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presense of ib libraries], +[ if test x$enableval = xno ; then + disable_libcheck=yes + fi +]) + dnl Checks for programs AC_PROG_CC AC_PROG_LIBTOOL +if test "$disable_libcheck" != "yes" +then dnl Checks for libraries LDFLAGS="$LDFLAGS -L/usr/local/ib/lib" AC_CHECK_LIB(ibcommon, sys_read_string, [], @@ -20,10 +28,13 @@ AC_CHECK_LIB(ibumad, umad_init, [], AC_MSG_ERROR([umad_init() not found. sminfo requires libibumad.])) AC_CHECK_LIB(ibmad, mad_dump_int, [], AC_MSG_ERROR([mad_dump_int() not found. sminfo requires libibmad.])) +fi dnl Checks for header files. AC_HEADER_STDC AC_CHECK_HEADERS([stdlib.h unistd.h]) +if test "$disable_libcheck" != "yes" +then AC_CHECK_HEADER(infiniband/common.h, [], AC_MSG_ERROR([ not found. sminfo requires libibcommon.]) ) @@ -33,6 +44,7 @@ AC_CHECK_HEADER(infiniband/umad.h, [], AC_CHECK_HEADER(infiniband/mad.h, [], AC_MSG_ERROR([ not found. sminfo requires libibmad.]) ) +fi dnl Checks for library functions AC_FUNC_ERROR_AT_LINE -- MST From mlleini at ca.sandia.gov Wed Aug 10 08:55:48 2005 From: mlleini at ca.sandia.gov (Matt L. Leininger) Date: Wed, 10 Aug 2005 08:55:48 -0700 Subject: [openib-general] InfiniBand Test Project === IBTP === In-Reply-To: <506C3D7B14CDD411A52C00025558DED607F0DDC8@MTLEX01> References: <506C3D7B14CDD411A52C00025558DED607F0DDC8@MTLEX01> Message-ID: <1123689348.23250.97.camel@localhost> On Wed, 2005-08-10 at 17:27 +0300, Amit Krig wrote: > > > -----Original Message----- > From: James Lentini [mailto:jlentini at netapp.com] > Sent: Wednesday, August 10, 2005 5:09 PM > To: Amit Krig > Cc: openib-general at openib.org > Subject: Re: [openib-general] InfiniBand Test Project === IBTP === > > > > >On Tue, 9 Aug 2005, Amit Krig wrote: > > > > > Hi All, > > > > I would like to propos a new InfiniBand Test Project === IBTP === > The > > proposed new dir can be under Gen2/trunk => > > > > https://openib.org/svn/gen2/trunk/ibtp/ > > > > > Did you consider using the existing > > https://openib.org/svn/gen2/utils/src/ directory? > > This is an option as well, > The tree is not the same, for example kdapl should be under infiniband > -> ulp. > > > > This part of the repository was recommended as the best place for > kdapltest, a test tool for kdapl. > > I think that the ibtp major goal is quality and not tools > Without tools you will never get good quality code. One of our highest priorities is good tools for testing and diagnostics. - Matt From mst at mellanox.co.il Wed Aug 10 09:12:27 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 10 Aug 2005 19:12:27 +0300 Subject: [openib-general] [PATCH] sdp: cleanup brackets Message-ID: <20050810161227.GQ9141@mellanox.co.il> I plan to commit the following tomorrow after a bit of testing. --- Kill brackets around function parameters: , binds weaker than other operator. + some whitespace fixes. Signed-off-by: Michael S. Tsirkin Index: ulp/sdp/sdp_link.c =================================================================== --- ulp/sdp/sdp_link.c (revision 3036) +++ ulp/sdp/sdp_link.c (working copy) @@ -436,16 +436,13 @@ static void do_link_path_lookup(void *da info->path.pkey = cpu_to_be16(priv->pkey); info->path.numb_path = 1; - memcpy(&info->path.sgid, - (info->dev->dev_addr + 4), - sizeof(union ib_gid)); + memcpy(&info->path.sgid, info->dev->dev_addr + 4, sizeof(union ib_gid)); /* * If the routing device is loopback save the device address of * the IB device which was found. */ if (rt->u.dst.neighbour->dev->flags & IFF_LOOPBACK) { - memcpy(&info->path.dgid, - (info->dev->dev_addr + 4), + memcpy(&info->path.dgid, info->dev->dev_addr + 4, sizeof(union ib_gid)); goto path; @@ -453,8 +450,7 @@ static void do_link_path_lookup(void *da if ((NUD_CONNECTED|NUD_DELAY|NUD_PROBE) & rt->u.dst.neighbour->nud_state) { - memcpy(&info->path.dgid, - (rt->u.dst.neighbour->ha + 4), + memcpy(&info->path.dgid, rt->u.dst.neighbour->ha + 4, sizeof(union ib_gid)); goto path; @@ -617,7 +613,7 @@ static void sdp_link_sweep(void *data) if (jiffies > (info->use + SDP_LINK_INFO_TIMEOUT)) { sdp_dbg_ctrl(NULL, "info delete <%d.%d.%d.%d> <%lu:%lu>", - (info->dst & 0x000000ff), + info->dst & 0x000000ff, (info->dst & 0x0000ff00) >> 8, (info->dst & 0x00ff0000) >> 16, (info->dst & 0xff000000) >> 24, @@ -648,7 +644,7 @@ static void sdp_link_arp_work(void *data arp = (struct sdp_link_arp *)skb->nh.raw; sdp_dbg_data(NULL, "Recv IB ARP ip <%d.%d.%d.%d> gid <" GID_FMT ">", - (arp->src_ip & 0x000000ff), + arp->src_ip & 0x000000ff, (arp->src_ip & 0x0000ff00) >> 8, (arp->src_ip & 0x00ff0000) >> 16, (arp->src_ip & 0xff000000) >> 24, Index: ulp/sdp/sdp_rcvd.c =================================================================== --- ulp/sdp/sdp_rcvd.c (revision 3036) +++ ulp/sdp/sdp_rcvd.c (working copy) @@ -851,7 +851,7 @@ static int sdp_rcvd_src_avail(struct sdp sdp_dbg_warn(conn, "SrcAvail mode <%d> mismatch. <%d:%d>", conn->recv_mode, - (conn->src_recv + conn->snk_recv), size); + conn->src_recv + conn->snk_recv, size); result = -EPROTO; goto advt_error; Index: ulp/sdp/sdp_inet.c =================================================================== --- ulp/sdp/sdp_inet.c (revision 3036) +++ ulp/sdp/sdp_inet.c (working copy) @@ -550,7 +550,7 @@ static int sdp_inet_connect(struct socke /* * wait for connection to complete. */ - timeout = sock_sndtimeo(sk, (O_NONBLOCK & flags)); + timeout = sock_sndtimeo(sk, O_NONBLOCK & flags); if (timeout > 0) { DECLARE_WAITQUEUE(wait, current); @@ -698,7 +698,7 @@ static int sdp_inet_accept(struct socket goto listen_done; } - timeout = sock_rcvtimeo(listen_sk, (O_NONBLOCK & flags)); + timeout = sock_rcvtimeo(listen_sk, O_NONBLOCK & flags); /* * if there is no socket on the queue, wait for one. It' done in a * loop in case there is a problem with the first socket we hit. @@ -791,11 +791,11 @@ listen_done: sdp_dbg_ctrl(listen_conn, "ACCEPT: complete <%d> <%08x:%04x><%08x:%04x>", - (accept_conn ? accept_conn->hashent : SDP_DEV_SK_INVALID), - (accept_sk ? accept_conn->src_addr : 0), - (accept_sk ? accept_conn->src_port : 0), - (accept_sk ? accept_conn->dst_addr : 0), - (accept_sk ? accept_conn->dst_port : 0)); + accept_conn ? accept_conn->hashent : SDP_DEV_SK_INVALID, + accept_sk ? accept_conn->src_addr : 0, + accept_sk ? accept_conn->src_port : 0, + accept_sk ? accept_conn->dst_addr : 0, + accept_sk ? accept_conn->dst_port : 0); return result; } Index: ulp/sdp/sdp_send.c =================================================================== --- ulp/sdp/sdp_send.c (revision 3036) +++ ulp/sdp/sdp_send.c (working copy) @@ -641,7 +641,7 @@ static int sdp_send_data_iocb_src(struct goto error; } - memcpy(buff->tail, (addr + off), len); + memcpy(buff->tail, addr + off, len); kunmap_atomic(iocb->page_array[pos], KM_IRQ0); @@ -712,11 +712,11 @@ static int sdp_send_iocb_buff_write(stru break; } - copy = min((PAGE_SIZE - offset), + copy = min(PAGE_SIZE - offset, (unsigned long)(buff->end - buff->tail)); copy = min((unsigned long)iocb->len, copy); #ifndef _SDP_DATA_PATH_NULL - memcpy(buff->tail, (addr + offset), copy); + memcpy(buff->tail, addr + offset, copy); #endif buff->data_size += copy; buff->tail += copy; @@ -2021,8 +2021,8 @@ skip: /* entry point for IOCB based tran * onetime setup of timeout, but only if it's needed. */ if (timeout < 0) - timeout = sock_sndtimeo(sk, (MSG_DONTWAIT & - msg->msg_flags)); + timeout = sock_sndtimeo(sk, + MSG_DONTWAIT & msg->msg_flags); if (sk->sk_err) { result = (copied > 0) ? 0 : sock_error(sk); Index: ulp/sdp/sdp_actv.c =================================================================== --- ulp/sdp/sdp_actv.c (revision 3036) +++ ulp/sdp/sdp_actv.c (working copy) @@ -238,8 +238,8 @@ static int sdp_cm_hello_ack_check(struct if ((0xF0 & hello_ack->hah.version) != (0xF0 & SDP_MSG_VERSION)) { sdp_dbg_warn(NULL, "hello ack, version mismatch. <%d:%d>", - ((0xF0 & hello_ack->hah.version) >> 4), - ((0xF0 & SDP_MSG_VERSION) >> 4)); + (0xF0 & hello_ack->hah.version) >> 4, + (0xF0 & SDP_MSG_VERSION) >> 4); return -EINVAL; } Index: ulp/sdp/sdp_conn.c =================================================================== --- ulp/sdp/sdp_conn.c (revision 3036) +++ ulp/sdp/sdp_conn.c (working copy) @@ -1337,8 +1337,8 @@ int sdp_proc_dump_conn_main(char *buffer * header should only be printed once */ if (!start_index) { - offset += sprintf((buffer + offset), SDP_PROC_CONN_MAIN_HEAD); - offset += sprintf((buffer + offset), SDP_PROC_CONN_MAIN_SEP); + offset += sprintf(buffer + offset, SDP_PROC_CONN_MAIN_HEAD); + offset += sprintf(buffer + offset, SDP_PROC_CONN_MAIN_SEP); } /* * lock table @@ -1364,16 +1364,16 @@ int sdp_proc_dump_conn_main(char *buffer d_guid = cpu_to_be64(conn->d_gid.global.interface_id); s_guid = cpu_to_be64(conn->s_gid.global.interface_id); - offset += sprintf((buffer + offset), SDP_PROC_CONN_MAIN_FORM, - (conn->dst_addr & 0xff), - ((conn->dst_addr >> 8) & 0xff), - ((conn->dst_addr >> 16) & 0xff), - ((conn->dst_addr >> 24) & 0xff), + offset += sprintf(buffer + offset, SDP_PROC_CONN_MAIN_FORM, + conn->dst_addr & 0xff, + (conn->dst_addr >> 8) & 0xff, + (conn->dst_addr >> 16) & 0xff, + (conn->dst_addr >> 24) & 0xff, conn->dst_port, - (conn->src_addr & 0xff), - ((conn->src_addr >> 8) & 0xff), - ((conn->src_addr >> 16) & 0xff), - ((conn->src_addr >> 24) & 0xff), + conn->src_addr & 0xff, + (conn->src_addr >> 8) & 0xff, + (conn->src_addr >> 16) & 0xff, + (conn->src_addr >> 24) & 0xff, conn->src_port, conn->hashent, conn->cm_id ? conn->cm_id->local_id : 0, @@ -1440,8 +1440,8 @@ int sdp_proc_dump_conn_data(char *buffer * header should only be printed once */ if (!start_index) { - offset += sprintf((buffer + offset), SDP_PROC_CONN_DATA_HEAD); - offset += sprintf((buffer + offset), SDP_PROC_CONN_DATA_SEP); + offset += sprintf(buffer + offset, SDP_PROC_CONN_DATA_HEAD); + offset += sprintf(buffer + offset, SDP_PROC_CONN_DATA_SEP); } /* * lock table @@ -1464,7 +1464,7 @@ int sdp_proc_dump_conn_data(char *buffer conn = dev_root_s.sk_array[counter]; sk = sk_sdp(conn); - offset += sprintf((buffer + offset), SDP_PROC_CONN_DATA_FORM, + offset += sprintf(buffer + offset, SDP_PROC_CONN_DATA_FORM, conn->hashent, conn->state, conn->recv_mode, @@ -1533,8 +1533,8 @@ int sdp_proc_dump_conn_rdma(char *buffer * header should only be printed once */ if (!start_index) { - offset += sprintf((buffer + offset), SDP_PROC_CONN_RDMA_HEAD); - offset += sprintf((buffer + offset), SDP_PROC_CONN_RDMA_SEP); + offset += sprintf(buffer + offset, SDP_PROC_CONN_RDMA_HEAD); + offset += sprintf(buffer + offset, SDP_PROC_CONN_RDMA_SEP); } /* * lock table @@ -1556,7 +1556,7 @@ int sdp_proc_dump_conn_rdma(char *buffer conn = dev_root_s.sk_array[counter]; - offset += sprintf((buffer + offset),SDP_PROC_CONN_RDMA_FORM, + offset += sprintf(buffer + offset, SDP_PROC_CONN_RDMA_FORM, conn->hashent, conn->src_recv, conn->snk_recv, @@ -1610,8 +1610,8 @@ int sdp_proc_dump_conn_sopt(char *buffer * header should only be printed once */ if (!start_index) { - offset += sprintf((buffer + offset), SDP_PROC_CONN_SOPT_HEAD); - offset += sprintf((buffer + offset), SDP_PROC_CONN_SOPT_SEP); + offset += sprintf(buffer + offset, SDP_PROC_CONN_SOPT_HEAD); + offset += sprintf(buffer + offset, SDP_PROC_CONN_SOPT_SEP); } /* * lock table @@ -1633,16 +1633,16 @@ int sdp_proc_dump_conn_sopt(char *buffer conn = dev_root_s.sk_array[counter]; - offset += sprintf((buffer + offset), SDP_PROC_CONN_SOPT_FORM, - (conn->dst_addr & 0xff), - ((conn->dst_addr >> 8) & 0xff), - ((conn->dst_addr >> 16) & 0xff), - ((conn->dst_addr >> 24) & 0xff), + offset += sprintf(buffer + offset, SDP_PROC_CONN_SOPT_FORM, + conn->dst_addr & 0xff, + (conn->dst_addr >> 8) & 0xff, + (conn->dst_addr >> 16) & 0xff, + (conn->dst_addr >> 24) & 0xff, conn->dst_port, - (conn->src_addr & 0xff), - ((conn->src_addr >> 8) & 0xff), - ((conn->src_addr >> 16) & 0xff), - ((conn->src_addr >> 24) & 0xff), + conn->src_addr & 0xff, + (conn->src_addr >> 8) & 0xff, + (conn->src_addr >> 16) & 0xff, + (conn->src_addr >> 24) & 0xff, conn->src_port, conn->src_zthresh, conn->snk_zthresh, @@ -1668,29 +1668,29 @@ int sdp_proc_dump_device(char *buffer, i * header should only be printed once */ if (!start_index) { - offset += sprintf((buffer + offset), + offset += sprintf(buffer + offset, "connection table maximum: <%d>\n", dev_root_s.sk_size); - offset += sprintf((buffer + offset), + offset += sprintf(buffer + offset, "connection table entries: <%d>\n", dev_root_s.sk_entry); - offset += sprintf((buffer + offset), + offset += sprintf(buffer + offset, "connection table rover: <%d>\n", dev_root_s.sk_rover); - offset += sprintf((buffer + offset), + offset += sprintf(buffer + offset, "max send posted: <%d>\n", dev_root_s.send_post_max); - offset += sprintf((buffer + offset), + offset += sprintf(buffer + offset, "max send buffered: <%d>\n", dev_root_s.send_buff_max); - offset += sprintf((buffer + offset), + offset += sprintf(buffer + offset, "max send unsignalled: <%d>\n", dev_root_s.send_usig_max); - offset += sprintf((buffer + offset), + offset += sprintf(buffer + offset, "max receive posted: <%d>\n", dev_root_s.recv_post_max); - offset += sprintf((buffer + offset), + offset += sprintf(buffer + offset, "max receive buffered: <%d>\n", dev_root_s.recv_buff_max); } Index: ulp/sdp/sdp_recv.c =================================================================== --- ulp/sdp/sdp_recv.c (revision 3036) +++ ulp/sdp/sdp_recv.c (working copy) @@ -481,7 +481,7 @@ int sdp_recv_flush(struct sdp_sock *conn * the buffered receive/receive posted queue, and the maximum * number which are allowed to be posted at a given time. */ - counter = min((s32)((conn->rwin_max - conn->byte_strm)/ + counter = min((s32)((conn->rwin_max - conn->byte_strm) / conn->recv_size), (s32) (conn->recv_max - sdp_buff_q_size(&conn->recv_pool))); @@ -594,11 +594,11 @@ static int sdp_read_buff_iocb(struct sdp if (!addr) break; - copy = min((PAGE_SIZE - offset), + copy = min(PAGE_SIZE - offset, (unsigned long)(buff->tail - buff->data)); copy = min((unsigned long)iocb->len, copy); #ifndef _SDP_DATA_PATH_NULL - memcpy((addr + offset), buff->data, copy); + memcpy(addr + offset, buff->data, copy); #endif buff->data += copy; @@ -1120,8 +1120,8 @@ int sdp_inet_recv(struct kiocb *req, st /* * get socket values we'll need. */ - timeout = sock_rcvtimeo(sk, (flags & MSG_DONTWAIT)); - low_water = sock_rcvlowat(sk, (flags & MSG_WAITALL), size); + timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT); + low_water = sock_rcvlowat(sk, flags & MSG_WAITALL, size); /* * process data first, and then check condition, then wait */ Index: ulp/sdp/sdp_pass.c =================================================================== --- ulp/sdp/sdp_pass.c (revision 3036) +++ ulp/sdp/sdp_pass.c (working copy) @@ -312,8 +312,8 @@ static int sdp_cm_hello_check(struct sdp if ((0xF0 & msg_hello->hh.version) != (0xF0 & SDP_MSG_VERSION)) { sdp_dbg_warn(NULL, "hello msg, version mismatch. <%d:%d>", - ((0xF0 & msg_hello->hh.version) >> 4), - ((0xF0 & SDP_MSG_VERSION) >> 4)); + (0xF0 & msg_hello->hh.version) >> 4, + (0xF0 & SDP_MSG_VERSION) >> 4); return -EINVAL; } #ifdef _SDP_MS_APRIL_ERROR_COMPAT Index: ulp/sdp/sdp_iocb.c =================================================================== --- ulp/sdp/sdp_iocb.c (revision 3049) +++ ulp/sdp/sdp_iocb.c (working copy) @@ -90,7 +90,7 @@ void sdp_iocb_unlock(struct sdpc_iocb *i * try to get all pages in one go. */ /* TODO: use cache for allocations? Allocate by chunks? */ - pages = kmalloc((sizeof(struct page *) * iocb->page_count), GFP_KERNEL); + pages = kmalloc(sizeof(struct page *) * iocb->page_count, GFP_KERNEL); down_read(&iocb->mm->mmap_sem); if (pages) { result = get_user_pages(iocb->tsk, iocb->mm, iocb->addr, @@ -167,12 +167,11 @@ int sdp_iocb_lock(struct sdpc_iocb *iocb */ /* TODO: use cache for allocations? Allocate by chunks? */ - iocb->addr_array = kmalloc((sizeof(u64) * iocb->page_count), - GFP_KERNEL); + iocb->addr_array = kmalloc(sizeof(u64) * iocb->page_count, GFP_KERNEL); if (!iocb->addr_array) goto err_addr; - iocb->page_array = kmalloc((sizeof(struct page *) * iocb->page_count), + iocb->page_array = kmalloc(sizeof(struct page *) * iocb->page_count, GFP_KERNEL); if (!iocb->page_array) goto err_page; Index: ulp/sdp/sdp_buff.c =================================================================== --- ulp/sdp/sdp_buff.c (revision 3036) +++ ulp/sdp/sdp_buff.c (working copy) @@ -342,8 +342,8 @@ static inline void sdp_buff_pool_release * Always leave at least minimum buffers, otherwise remove * either half of the pool, which is more then the mark */ - count = min((m_pool->buff_cur - m_pool->buff_min), - (m_pool->free_mark/2)); + count = min(m_pool->buff_cur - m_pool->buff_min, + m_pool->free_mark/2); sdp_buff_pool_release(m_pool, count); } @@ -359,7 +359,7 @@ static int sdp_buff_pool_alloc(struct sd /* * Calculate the total number of buffers. */ - total = max(m_pool->buff_min, (m_pool->buff_cur + m_pool->alloc_inc)); + total = max(m_pool->buff_min, m_pool->buff_cur + m_pool->alloc_inc); total = min(total, m_pool->buff_max); while (total > m_pool->buff_cur) { @@ -719,28 +719,28 @@ int sdp_proc_dump_buff_pool(char *buffer spin_lock_irqsave(&main_pool->lock, flags); if (!start_index) { - offset += sprintf((buffer + offset), + offset += sprintf(buffer + offset, " buffer size: %8d\n", main_pool->buff_size); - offset += sprintf((buffer + offset), + offset += sprintf(buffer + offset, " buffers maximum: %8d\n", main_pool->buff_max); - offset += sprintf((buffer + offset), + offset += sprintf(buffer + offset, " buffers minimum: %8d\n", main_pool->buff_min); - offset += sprintf((buffer + offset), + offset += sprintf(buffer + offset, " buffers increment: %8d\n", main_pool->alloc_inc); - offset += sprintf((buffer + offset), + offset += sprintf(buffer + offset, " buffers decrement: %8d\n", main_pool->free_mark); - offset += sprintf((buffer + offset), + offset += sprintf(buffer + offset, " buffers allocated: %8d\n", main_pool->buff_cur); - offset += sprintf((buffer + offset), + offset += sprintf(buffer + offset, " buffers available: %8d\n", main_pool->pool.size); - offset += sprintf((buffer + offset), + offset += sprintf(buffer + offset, " buffers outstanding: %8d\n", main_pool->buff_cur - main_pool->pool.size); } -- MST From halr at voltaire.com Wed Aug 10 09:26:31 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Aug 2005 12:26:31 -0400 Subject: [openib-general] Is IB_EVENT_SM_CHANGE is equivalent in semantic to the Client Rer egister event in the IB SPEC In-Reply-To: <506C3D7B14CDD411A52C00025558DED6085BD09E@MTLEX01> References: <506C3D7B14CDD411A52C00025558DED6085BD09E@MTLEX01> Message-ID: <1123691190.4403.2129.camel@hal.voltaire.com> Hi Tziporet, On Wed, 2005-08-10 at 11:13, Tziporet Koren wrote: > Hi, > I am working to enable the Client Reregister event as defined in IB > SPEC on chapter 14.4.11. > While doing it I saw that in the core the event IB_EVENT_SM_CHANGE is > defined and also ULPs are using it but this event is never generated. > > My understanding is that Client Reregister event should generate this > event. > Is this correct? I think these are 2 different things. There is the PortInfo::ClientReregister bit (assuming IsClientReregistrationSupported is indicated in PortInfo::CapabilityMask) which is used by the SM to tell the endport to reregister. SM Change event would be when the SM LID in PortInfo is changed (by SM failover/takeover). However, in OpenIB, I believe that any receipt of a Set(PortInfo) directed at this endport/node is not distinguished and there is one event indicated for this currently (IB_EVENT_LID_CHANGE). Perhaps that will change to be finer grained (saving and comparing some state) going forward. -- Hal > Thanks, > Tziporet Koren > Software Director > Mellanox Technologies, Ltd > mailto:tziporet at mellanox.co.il > Tel +972-4-9097200, ext 380 > > > > > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From amitk at mellanox.co.il Wed Aug 10 10:44:23 2005 From: amitk at mellanox.co.il (Amit Krig) Date: Wed, 10 Aug 2005 20:44:23 +0300 Subject: [openib-general] InfiniBand Test Project === IBTP === Message-ID: <506C3D7B14CDD411A52C00025558DED607F0DDCD@mtlex01.yok.mtl.com> > Regards, > > Amit Krig > Mellanox Technologies LTD > amitk at mellanox.co.il > Tel: +972-4-9593244 ext: 315, > Fax: +972-4-9593245 > P.O.B 586, Yokneam, 20692 -----Original Message----- From: James Lentini [mailto:jlentini at netapp.com] Sent: Wednesday, August 10, 2005 5:57 PM To: Amit Krig Cc: openib-general at openib.org Subject: RE: [openib-general] InfiniBand Test Project === IBTP === On Wed, 10 Aug 2005, Amit Krig wrote: >> On Tue, 9 Aug 2005, Amit Krig wrote: > >> >> Hi All, >> >> I would like to propos a new InfiniBand Test Project === IBTP === The >> proposed new dir can be under Gen2/trunk => >> >> https://openib.org/svn/gen2/trunk/ibtp/ >> > >> Did you consider using the existing >> https://openib.org/svn/gen2/utils/src/ directory? > > This is an option as well, > The tree is not the same, for example kdapl should be under infiniband -> > ulp. We can re-arrange the directory structure. >> This part of the repository was recommended as the best place for > kdapltest, a test tool for kdapl. > > I think that the ibtp major goal is quality and not tools > Ok. I just have two concerns. > First, I would prefer a directory name that was less obfuscated. It is > not immediately obvious what a directory called "ibtp" contains. If > the directory was called "test", then it would be clear. The idea of ibtp is from ltp (linux test project) http://ltp.sourceforge.net/ We can go with test dir as well. > Second, I'm not clear on why this code should go under the trunk but > the tools in https://openib.org/svn/gen2/utils should not. I had been > thinking that all of the code under the trunk was intended for end > users. I'm not sure if that is the distinction though. Perhaps the > ulps, users, and utils, directories in https://openib.org/svn/gen2/ > are out of place. Maybe it should be also under the trunk. -------------- next part -------------- An HTML attachment was scrubbed... URL: From amitk at mellanox.co.il Wed Aug 10 10:47:44 2005 From: amitk at mellanox.co.il (Amit Krig) Date: Wed, 10 Aug 2005 20:47:44 +0300 Subject: [openib-general] InfiniBand Test Project === IBTP === Message-ID: <506C3D7B14CDD411A52C00025558DED607F0DDCE@mtlex01.yok.mtl.com> > Regards, > > Amit Krig > Mellanox Technologies LTD > amitk at mellanox.co.il > Tel: +972-4-9593244 ext: 315, > Fax: +972-4-9593245 > P.O.B 586, Yokneam, 20692 -----Original Message----- From: Matt L. Leininger [mailto:mlleini at ca.sandia.gov] Sent: Wednesday, August 10, 2005 6:56 PM To: Amit Krig Cc: 'James Lentini'; openib-general at openib.org Subject: RE: [openib-general] InfiniBand Test Project === IBTP === On Wed, 2005-08-10 at 17:27 +0300, Amit Krig wrote: > > > -----Original Message----- > From: James Lentini [mailto:jlentini at netapp.com] > Sent: Wednesday, August 10, 2005 5:09 PM > To: Amit Krig > Cc: openib-general at openib.org > Subject: Re: [openib-general] InfiniBand Test Project === IBTP === > > > > >On Tue, 9 Aug 2005, Amit Krig wrote: > > > > > Hi All, > > > > I would like to propos a new InfiniBand Test Project === IBTP === > The > > proposed new dir can be under Gen2/trunk => > > > > https://openib.org/svn/gen2/trunk/ibtp/ > > > > > Did you consider using the existing > > https://openib.org/svn/gen2/utils/src/ directory? > > This is an option as well, > The tree is not the same, for example kdapl should be under infiniband > -> ulp. > > > > This part of the repository was recommended as the best place for > kdapltest, a test tool for kdapl. > > I think that the ibtp major goal is quality and not tools > > Without tools you will never get good quality code. One of our > highest priorities is good tools for testing and diagnostics. The tools for running the tests should be a part of the test project, in my suggestion you can find a tool dir. - Matt -------------- next part -------------- An HTML attachment was scrubbed... URL: From arlin.r.davis at intel.com Wed Aug 10 11:31:16 2005 From: arlin.r.davis at intel.com (Arlin Davis) Date: Wed, 10 Aug 2005 11:31:16 -0700 Subject: [openib-general] RE: [PATCH] [ucm] fix for potential deadlock In-Reply-To: Message-ID: >-----Original Message----- >From: Hefty, Sean >Sent: Tuesday, August 09, 2005 4:24 PM >To: openib-general at openib.org; Davis, Arlin R >Subject: [PATCH] [ucm] fix for potential deadlock > >The following patch fixes a potential deadlock condition in the kernel >ucm code resulting from trying to destroy a cm_id while in the context >of a CM callback thread. The synchronization around the ucm context >structure was simplified as a result, and some simple code cleanup is >included. (I tried keeping the code cleanup separate, but it was >turning out to be more work.) > >Arlin, can you please test with this and see if your problems (well the >ones related to IB anyway ;) go away. Yes, the issues with dapltest regress.sh (client 6) are resolved with this change. Thanks! -arlin > >Signed-off-by: Sean Hefty > > From sean.hefty at intel.com Wed Aug 10 11:33:32 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 10 Aug 2005 11:33:32 -0700 Subject: [openib-general] userspace event reporting take 2 Message-ID: Here's another stab at trying to improve userspace event reporting. I think that the first discussion on this ended with a solution that wasn't any better than what is there now. The goal is to provide userspace clients receiving an event with a context that is valid and does not require searches. Here's another attempt at a fix: Destroy userspace_object: Destroy the corresponding kernel object Clean up all outstanding associated events Return reported events that have been retrieved by userspace Wait until reported events == completed events Get event: Retrieve an event Increment reported events Put event: Increment completed events Signal destroy if reported events == completed events and destroying I think that this guarantees that the user context will be valid until put event is called, and should avoid searches in either the userspace client or userspace IB module. - Sean From halr at voltaire.com Wed Aug 10 14:48:31 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Aug 2005 17:48:31 -0400 Subject: [openib-general] Re: [PATCH] diags: configure option to skip library check In-Reply-To: <20050810153049.GO9141@mellanox.co.il> References: <20050810153049.GO9141@mellanox.co.il> Message-ID: <1123710510.4403.3060.camel@hal.voltaire.com> On Wed, 2005-08-10 at 11:30, Michael S. Tsirkin wrote: > Add option to skip infiniband library checks in diags Thanks. Applied. -- Hal From mshefty at ichips.intel.com Wed Aug 10 15:40:25 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 10 Aug 2005 15:40:25 -0700 Subject: [openib-general] userspace event reporting take 2 In-Reply-To: References: Message-ID: <42FA8259.4020904@ichips.intel.com> Sean Hefty wrote: > The goal is to provide userspace clients receiving an event with a context that > is valid and does not require searches. The user CM does not provide any user specified context in any of its calls. Adding it will result in API/ABI changes. Does anyone have any objection to this? - Sean From sean.hefty at intel.com Wed Aug 10 17:14:06 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 10 Aug 2005 17:14:06 -0700 Subject: [openib-general] [RFC] [uCM] proposed API changes Message-ID: I'd like to propose the following changes to the user CM API. These would allow returning user specified context when reporting events to the user. I also added a call to retrieve the necessary QP attributes from the kernel CM that I would like to include as a part of the API/ABI changes. Comments? - Sean Index: include/infiniband/cm.h =================================================================== --- include/infiniband/cm.h (revision 2989) +++ include/infiniband/cm.h (working copy) @@ -77,8 +77,15 @@ enum ib_cm_data_size { IB_CM_SIDR_REP_INFO_LENGTH = 72 }; +struct ib_cm_id { + void *context; + uint64_t service_id; + uint64_t service_mask; + uint32_t handle; +}; + struct ib_cm_req_event_param { - uint32_t listen_id; + struct ib_cm_id *listen_id; struct ib_sa_path_rec *primary_path; struct ib_sa_path_rec *alternate_path; @@ -187,7 +194,7 @@ struct ib_cm_apr_event_param { }; struct ib_cm_sidr_req_event_param { - uint32_t listen_id; + struct ib_cm_id *listen_id; struct ib_device *device; uint8_t port; uint16_t pkey; @@ -212,7 +219,7 @@ struct ib_cm_sidr_rep_event_param { }; struct ib_cm_event { - uint32_t cm_id; + struct ib_cm_id *cm_id; enum ib_cm_event_type event; union { struct ib_cm_req_event_param req_rcvd; @@ -287,30 +294,13 @@ int ib_cm_get_fd(void); * Communication identifiers are used to track connection states, service * ID resolution requests, and listen requests. */ -int ib_cm_create_id(uint32_t *cm_id); +struct ib_cm_id *ib_cm_create_id(void *context); /** * ib_cm_destroy_id - Destroy a connection identifier. * @cm_id: Connection identifier to destroy. */ -int ib_cm_destroy_id(uint32_t cm_id); - -struct ib_cm_attr_param { - uint64_t service_id; - uint64_t service_mask; - uint32_t local_id; - uint32_t remote_id; -}; - -/** - * ib_cm_attr_id - Get connection identifier attributes. - * @cm_id: Connection identifier to retrieve attributes. - * @param: Destination of retreived parameters. - * - * Not all parameters are valid during all connection states. - */ -int ib_cm_attr_id(uint32_t cm_id, - struct ib_cm_attr_param *param); +int ib_cm_destroy_id(struct ib_cm_id *cm_id); /** * ib_cm_listen - Initiates listening on the specified service ID for @@ -323,7 +313,7 @@ int ib_cm_attr_id(uint32_t cm_id, * range of service IDs. If set to 0, the service ID is matched * exactly. */ -int ib_cm_listen(uint32_t cm_id, +int ib_cm_listen(struct ib_cm_id *cm_id, uint64_t service_id, uint64_t service_mask); @@ -355,7 +345,7 @@ struct ib_cm_req_param { * @param: Connection request information needed to establish the * connection. */ -int ib_cm_send_req(uint32_t cm_id, +int ib_cm_send_req(struct ib_cm_id *cm_id, struct ib_cm_req_param *param); struct ib_cm_rep_param { @@ -380,7 +370,7 @@ struct ib_cm_rep_param { * @param: Connection reply information needed to establish the * connection. */ -int ib_cm_send_rep(uint32_t cm_id, +int ib_cm_send_rep(struct ib_cm_id *cm_id, struct ib_cm_rep_param *param); /** @@ -391,7 +381,7 @@ int ib_cm_send_rep(uint32_t cm_id, * ready to use message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_rtu(uint32_t cm_id, +int ib_cm_send_rtu(struct ib_cm_id *cm_id, void *private_data, uint8_t private_data_len); @@ -404,7 +394,7 @@ int ib_cm_send_rtu(uint32_t cm_id, * disconnection request message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_dreq(uint32_t cm_id, +int ib_cm_send_dreq(struct ib_cm_id *cm_id, void *private_data, uint8_t private_data_len); @@ -416,7 +406,7 @@ int ib_cm_send_dreq(uint32_t cm_id, * disconnection reply message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_drep(uint32_t cm_id, +int ib_cm_send_drep(struct ib_cm_id *cm_id, void *private_data, uint8_t private_data_len); @@ -427,7 +417,7 @@ int ib_cm_send_drep(uint32_t cm_id, * This routine should be invoked by users who receive messages on a * connected QP before an RTU has been received. */ -int ib_cm_establish(uint32_t cm_id); +int ib_cm_establish(struct ib_cm_id *cm_id); /** * ib_cm_send_rej - Sends a connection rejection message to the @@ -441,7 +431,7 @@ int ib_cm_establish(uint32_t cm_id); * rejection message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_rej(uint32_t cm_id, +int ib_cm_send_rej(struct ib_cm_id *cm_id, enum ib_cm_rej_reason reason, void *ari, uint8_t ari_length, @@ -458,7 +448,7 @@ int ib_cm_send_rej(uint32_t cm_id, * message receipt acknowledgement. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_mra(uint32_t cm_id, +int ib_cm_send_mra(struct ib_cm_id *cm_id, uint8_t service_timeout, void *private_data, uint8_t private_data_len); @@ -473,12 +463,32 @@ int ib_cm_send_mra(uint32_t cm_id, * load alternate path message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_lap(uint32_t cm_id, +int ib_cm_send_lap(struct ib_cm_id *cm_id, struct ib_sa_path_rec *alternate_path, void *private_data, uint8_t private_data_len); /** + * ib_cm_init_qp_attr - Initializes the QP attributes for use in transitioning + * to a specified QP state. + * @cm_id: Communication identifier associated with the QP attributes to + * initialize. + * @qp_attr: On input, specifies the desired QP state. On output, the + * mandatory and desired optional attributes will be set in order to + * modify the QP to the specified state. + * @qp_attr_mask: The QP attribute mask that may be used to transition the + * QP to the specified state. + * + * Users must set the @qp_attr->qp_state to the desired QP state. This call + * will set all required attributes for the given transition, along with + * known optional attributes. Users may override the attributes returned from + * this call before calling ib_modify_qp. + */ +int ib_cm_init_qp_attr(struct ib_cm_id *cm_id, + struct ibv_qp_attr *qp_attr, + int *qp_attr_mask); + +/** * ib_cm_send_apr - Sends an alternate path response message in response to * a load alternate path request. * @cm_id: Connection identifier associated with the alternate path response. @@ -490,7 +500,7 @@ int ib_cm_send_lap(uint32_t cm_id, * alternate path response message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_apr(uint32_t cm_id, +int ib_cm_send_apr(struct ib_cm_id *cm_id, enum ib_cm_apr_status status, void *info, uint8_t info_length, @@ -514,7 +524,7 @@ struct ib_cm_sidr_req_param { * service ID resolution request. * @param: Service ID resolution request information. */ -int ib_cm_send_sidr_req(uint32_t cm_id, +int ib_cm_send_sidr_req(struct ib_cm_id *cm_id, struct ib_cm_sidr_req_param *param); struct ib_cm_sidr_rep_param { @@ -534,7 +544,7 @@ struct ib_cm_sidr_rep_param { * resolution request. * @param: Service ID resolution reply information. */ -int ib_cm_send_sidr_rep(uint32_t cm_id, +int ib_cm_send_sidr_rep(struct ib_cm_id *cm_id, struct ib_cm_sidr_rep_param *param); #endif /* CM_H */ From ardavis at ichips.intel.com Wed Aug 10 20:08:42 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 10 Aug 2005 20:08:42 -0700 Subject: [openib-general] [RFC] [uCM] proposed API changes In-Reply-To: References: Message-ID: <42FAC13A.6020801@ichips.intel.com> Sean Hefty wrote: >I'd like to propose the following changes to the user CM API. These >would allow returning user specified context when reporting events to >the user. I also added a call to retrieve the necessary QP attributes >from the kernel CM that I would like to include as a part of the API/ABI >changes. Comments? > >- Sean > > Looks good so far. The context will save uDAPL a list search and the uCM QP attributes is a much needed feature that I have been anxiously watiting for. Thanks, -arlin > >Index: include/infiniband/cm.h >=================================================================== >--- include/infiniband/cm.h (revision 2989) >+++ include/infiniband/cm.h (working copy) >@@ -77,8 +77,15 @@ enum ib_cm_data_size { > IB_CM_SIDR_REP_INFO_LENGTH = 72 > }; > >+struct ib_cm_id { >+ void *context; >+ uint64_t service_id; >+ uint64_t service_mask; >+ uint32_t handle; >+}; >+ > struct ib_cm_req_event_param { >- uint32_t listen_id; >+ struct ib_cm_id *listen_id; > > struct ib_sa_path_rec *primary_path; > struct ib_sa_path_rec *alternate_path; >@@ -187,7 +194,7 @@ struct ib_cm_apr_event_param { > }; > > struct ib_cm_sidr_req_event_param { >- uint32_t listen_id; >+ struct ib_cm_id *listen_id; > struct ib_device *device; > uint8_t port; > uint16_t pkey; >@@ -212,7 +219,7 @@ struct ib_cm_sidr_rep_event_param { > }; > > struct ib_cm_event { >- uint32_t cm_id; >+ struct ib_cm_id *cm_id; > enum ib_cm_event_type event; > union { > struct ib_cm_req_event_param req_rcvd; >@@ -287,30 +294,13 @@ int ib_cm_get_fd(void); > * Communication identifiers are used to track connection states, service > * ID resolution requests, and listen requests. > */ >-int ib_cm_create_id(uint32_t *cm_id); >+struct ib_cm_id *ib_cm_create_id(void *context); > > /** > * ib_cm_destroy_id - Destroy a connection identifier. > * @cm_id: Connection identifier to destroy. > */ >-int ib_cm_destroy_id(uint32_t cm_id); >- >-struct ib_cm_attr_param { >- uint64_t service_id; >- uint64_t service_mask; >- uint32_t local_id; >- uint32_t remote_id; >-}; >- >-/** >- * ib_cm_attr_id - Get connection identifier attributes. >- * @cm_id: Connection identifier to retrieve attributes. >- * @param: Destination of retreived parameters. >- * >- * Not all parameters are valid during all connection states. >- */ >-int ib_cm_attr_id(uint32_t cm_id, >- struct ib_cm_attr_param *param); >+int ib_cm_destroy_id(struct ib_cm_id *cm_id); > > /** > * ib_cm_listen - Initiates listening on the specified service ID for >@@ -323,7 +313,7 @@ int ib_cm_attr_id(uint32_t cm_id, > * range of service IDs. If set to 0, the service ID is matched > * exactly. > */ >-int ib_cm_listen(uint32_t cm_id, >+int ib_cm_listen(struct ib_cm_id *cm_id, > uint64_t service_id, > uint64_t service_mask); > >@@ -355,7 +345,7 @@ struct ib_cm_req_param { > * @param: Connection request information needed to establish the > * connection. > */ >-int ib_cm_send_req(uint32_t cm_id, >+int ib_cm_send_req(struct ib_cm_id *cm_id, > struct ib_cm_req_param *param); > > struct ib_cm_rep_param { >@@ -380,7 +370,7 @@ struct ib_cm_rep_param { > * @param: Connection reply information needed to establish the > * connection. > */ >-int ib_cm_send_rep(uint32_t cm_id, >+int ib_cm_send_rep(struct ib_cm_id *cm_id, > struct ib_cm_rep_param *param); > > /** >@@ -391,7 +381,7 @@ int ib_cm_send_rep(uint32_t cm_id, > * ready to use message. > * @private_data_len: Size of the private data buffer, in bytes. > */ >-int ib_cm_send_rtu(uint32_t cm_id, >+int ib_cm_send_rtu(struct ib_cm_id *cm_id, > void *private_data, > uint8_t private_data_len); > >@@ -404,7 +394,7 @@ int ib_cm_send_rtu(uint32_t cm_id, > * disconnection request message. > * @private_data_len: Size of the private data buffer, in bytes. > */ >-int ib_cm_send_dreq(uint32_t cm_id, >+int ib_cm_send_dreq(struct ib_cm_id *cm_id, > void *private_data, > uint8_t private_data_len); > >@@ -416,7 +406,7 @@ int ib_cm_send_dreq(uint32_t cm_id, > * disconnection reply message. > * @private_data_len: Size of the private data buffer, in bytes. > */ >-int ib_cm_send_drep(uint32_t cm_id, >+int ib_cm_send_drep(struct ib_cm_id *cm_id, > void *private_data, > uint8_t private_data_len); > >@@ -427,7 +417,7 @@ int ib_cm_send_drep(uint32_t cm_id, > * This routine should be invoked by users who receive messages on a > * connected QP before an RTU has been received. > */ >-int ib_cm_establish(uint32_t cm_id); >+int ib_cm_establish(struct ib_cm_id *cm_id); > > /** > * ib_cm_send_rej - Sends a connection rejection message to the >@@ -441,7 +431,7 @@ int ib_cm_establish(uint32_t cm_id); > * rejection message. > * @private_data_len: Size of the private data buffer, in bytes. > */ >-int ib_cm_send_rej(uint32_t cm_id, >+int ib_cm_send_rej(struct ib_cm_id *cm_id, > enum ib_cm_rej_reason reason, > void *ari, > uint8_t ari_length, >@@ -458,7 +448,7 @@ int ib_cm_send_rej(uint32_t cm_id, > * message receipt acknowledgement. > * @private_data_len: Size of the private data buffer, in bytes. > */ >-int ib_cm_send_mra(uint32_t cm_id, >+int ib_cm_send_mra(struct ib_cm_id *cm_id, > uint8_t service_timeout, > void *private_data, > uint8_t private_data_len); >@@ -473,12 +463,32 @@ int ib_cm_send_mra(uint32_t cm_id, > * load alternate path message. > * @private_data_len: Size of the private data buffer, in bytes. > */ >-int ib_cm_send_lap(uint32_t cm_id, >+int ib_cm_send_lap(struct ib_cm_id *cm_id, > struct ib_sa_path_rec *alternate_path, > void *private_data, > uint8_t private_data_len); > > /** >+ * ib_cm_init_qp_attr - Initializes the QP attributes for use in transitioning >+ * to a specified QP state. >+ * @cm_id: Communication identifier associated with the QP attributes to >+ * initialize. >+ * @qp_attr: On input, specifies the desired QP state. On output, the >+ * mandatory and desired optional attributes will be set in order to >+ * modify the QP to the specified state. >+ * @qp_attr_mask: The QP attribute mask that may be used to transition the >+ * QP to the specified state. >+ * >+ * Users must set the @qp_attr->qp_state to the desired QP state. This call >+ * will set all required attributes for the given transition, along with >+ * known optional attributes. Users may override the attributes returned from >+ * this call before calling ib_modify_qp. >+ */ >+int ib_cm_init_qp_attr(struct ib_cm_id *cm_id, >+ struct ibv_qp_attr *qp_attr, >+ int *qp_attr_mask); >+ >+/** > * ib_cm_send_apr - Sends an alternate path response message in response to > * a load alternate path request. > * @cm_id: Connection identifier associated with the alternate path response. >@@ -490,7 +500,7 @@ int ib_cm_send_lap(uint32_t cm_id, > * alternate path response message. > * @private_data_len: Size of the private data buffer, in bytes. > */ >-int ib_cm_send_apr(uint32_t cm_id, >+int ib_cm_send_apr(struct ib_cm_id *cm_id, > enum ib_cm_apr_status status, > void *info, > uint8_t info_length, >@@ -514,7 +524,7 @@ struct ib_cm_sidr_req_param { > * service ID resolution request. > * @param: Service ID resolution request information. > */ >-int ib_cm_send_sidr_req(uint32_t cm_id, >+int ib_cm_send_sidr_req(struct ib_cm_id *cm_id, > struct ib_cm_sidr_req_param *param); > > struct ib_cm_sidr_rep_param { >@@ -534,7 +544,7 @@ struct ib_cm_sidr_rep_param { > * resolution request. > * @param: Service ID resolution reply information. > */ >-int ib_cm_send_sidr_rep(uint32_t cm_id, >+int ib_cm_send_sidr_rep(struct ib_cm_id *cm_id, > struct ib_cm_sidr_rep_param *param); > > #endif /* CM_H */ > > > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From glebn at voltaire.com Thu Aug 11 01:02:05 2005 From: glebn at voltaire.com (Gleb Natapov) Date: Thu, 11 Aug 2005 11:02:05 +0300 Subject: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support In-Reply-To: References: <20050719165542.GB16028@mellanox.co.il> <20050725171928.GC12206@mellanox.co.il> <20050726133553.GA22276@mellanox.co.il> <20050810083943.GM16361@minantech.com> <20050810132611.GP16361@minantech.com> Message-ID: <20050811080205.GR16361@minantech.com> On Wed, Aug 10, 2005 at 04:27:31PM +0100, Hugh Dickins wrote: > On Wed, 10 Aug 2005, Gleb Natapov wrote: > > On Wed, Aug 10, 2005 at 02:22:40PM +0100, Hugh Dickins wrote: > > > > > > Your stack example is a good one: if we end up setting VM_DONTCOPY on > > > the user stack, then I don't think fork's child will get very far without > > > hitting a SIGSEGV. > > > > I know, but I prefer child SIGSEGV than silent data corruption. > > Most people will share your preference, but neither is satisfactory. > What about the idea that was floating around about new VM flag that will instruct kernel to copy pages belonging to the vma on fork instead of mark them as cow? > > In most cases child will exec immediately after fork so no problem > > in this case. > > In most(?) cases it won't even be able to exec before the SIGSEGV. > If the top of the stack belongs to not copied page then yes. -- Gleb. From dotanb at mellanox.co.il Thu Aug 11 05:08:06 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Thu, 11 Aug 2005 15:08:06 +0300 Subject: [openib-general] the tests user/libibverbs/examples/*_pingpong.c doesn't support t he long format of the parameter rx-depth Message-ID: <506C3D7B14CDD411A52C00025558DED60882C45A@mtlex01.yok.mtl.com> here is the code that handles the long parameters format in the test: static struct option long_options[] = { { .name = "port", .has_arg = 1, .val = 'p' }, { .name = "ib-dev", .has_arg = 1, .val = 'd' }, { .name = "ib-port", .has_arg = 1, .val = 'i' }, { .name = "size", .has_arg = 1, .val = 's' }, { .name = "iters", .has_arg = 1, .val = 'n' }, { .name = "events", .has_arg = 0, .val = 'e' }, { 0 } }; the following line should be added to all of the tests (ud_pingpong.c, uc_pingpong.c, rc_pingpong.c) { .name = "rx-depth", .has_arg = 1, .val = 'r' }, Dotan Barak Software Verification Engineer Mellanox Technologies LTD mailto:dotanb at mellanox.co.il Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-4-8289408 Cell: 052-4222383 [ May the fork be with you ] -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Thu Aug 11 05:16:15 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Aug 2005 08:16:15 -0400 Subject: [openib-general] [RFC] [uCM] proposed API changes In-Reply-To: References: Message-ID: <1123762574.4403.4794.camel@hal.voltaire.com> On Wed, 2005-08-10 at 20:14, Sean Hefty wrote: > I'd like to propose the following changes to the user CM API. These > would allow returning user specified context when reporting events to > the user. I also added a call to retrieve the necessary QP attributes > from the kernel CM that I would like to include as a part of the API/ABI > changes. Comments? This looks good to me. Just one minor comment/question: How would one go about obtaining the local and remote IDs for a connection ? That appears to be removed. Is that not needed ? -- Hal From panda at cse.ohio-state.edu Thu Aug 11 06:06:16 2005 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Thu, 11 Aug 2005 09:06:16 -0400 (EDT) Subject: [openib-general] Continue to experience problems in installing Gen2 on IA-32 systems Message-ID: <200508111306.j7BD6GRi018085@xi.cse.ohio-state.edu> pHi Hal and Roland, We continue to experience problems in installing Gen2 on IA-32 systems. After the installation, the IB cards show the state as `disabled'. These systems have PCI-X IB cards. Sometimes back, my students had posted this issue to the group. You people had given some tips. However, none of them worked. Recently, we have spent a lot of time on this issue. We are able to successfully install Gen2 on EM64T and Opetron systems and carry out experiments. There is no problem. The problem is coming only for IA-32 systems. Even on EM64T systems, this problem comes when operating it in IA-32 mode. I am wondering whether any of you have personally installed Gen2 successfully on IA-32 systems with PCI-X IB cards and checked that it works. I am also cc'ing this note to the openib group. If anybody else has installed this successfully, we will appreciate the help. Thanks, DK From halr at voltaire.com Thu Aug 11 06:07:29 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Aug 2005 09:07:29 -0400 Subject: [openib-general] Re: Continue to experience problems in installing Gen2 on IA-32 systems In-Reply-To: <200508111306.j7BD6GRi018085@xi.cse.ohio-state.edu> References: <200508111306.j7BD6GRi018085@xi.cse.ohio-state.edu> Message-ID: <1123765649.4403.4937.camel@hal.voltaire.com> Hi DK, On Thu, 2005-08-11 at 09:06, Dhabaleswar Panda wrote: > pHi Hal and Roland, > > We continue to experience problems in installing Gen2 on IA-32 > systems. Sorry to hear this. > After the installation, the IB cards show the state as > `disabled'. These systems have PCI-X IB cards. > > Sometimes back, my students had posted this issue to the group. You > people had given some tips. However, none of them worked. The postings were more in the realm of helping with debug rather than a solution. The last postings requested that mthca be built with CONFIG_INFINIBAND_MTHCA_DEBUG=y and send the kernel output you get when you load ib_mthca. That might help to figure out the problem or what the next step is. > Recently, we > have spent a lot of time on this issue. > > We are able to successfully install Gen2 on EM64T and Opetron systems > and carry out experiments. There is no problem. The problem is coming > only for IA-32 systems. Even on EM64T systems, this problem comes when > operating it in IA-32 mode. > > I am wondering whether any of you have personally installed Gen2 > successfully on IA-32 systems with PCI-X IB cards and checked that it > works. One of my machines is IA-32 with PCI-X IB so I do this all the time without an issue. -- Hal > I am also cc'ing this note to the openib group. If anybody else has > installed this successfully, we will appreciate the help. From jlentini at netapp.com Thu Aug 11 06:20:46 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 11 Aug 2005 09:20:46 -0400 (EDT) Subject: [openib-general] Continue to experience problems in installing Gen2 on IA-32 systems In-Reply-To: <200508111306.j7BD6GRi018085@xi.cse.ohio-state.edu> References: <200508111306.j7BD6GRi018085@xi.cse.ohio-state.edu> Message-ID: On Thu, 11 Aug 2005, Dhabaleswar Panda wrote: > I am wondering whether any of you have personally installed Gen2 > successfully on IA-32 systems with PCI-X IB cards and checked that it > works. I've been successfully using an IA-32 system (Dell 1550), Linux 2.6.12 from kernel.org, the current OpenIB subversion tree, a Mellanox Tavor based card (MT23108), and 3.3.2 firmware (which I think is a revision behind the current firmware). james From guyg at voltaire.com Thu Aug 11 06:28:08 2005 From: guyg at voltaire.com (Guy German) Date: Thu, 11 Aug 2005 16:28:08 +0300 (IDT) Subject: [openib-general][PATCH][kdapl]: FMR implementation for kdapl Message-ID: James, This is a Tavor FMR implementation for kDPAL, that works over gen2's fmr_pool.c implementation. A few notes: 1. I've added mod params to kdapl_ib to control the fmr size and pool length. There are still reasonable defaults, but I think we should allow consumers to control this. 2. the fmr pool allocation is done in the pz_create (if active_fmr =1). The problem is that at the time of creating the pool we don't know what the consumer's privileges request is going to be (passed afterwords in dapl_lmr_kcreate). I think this can be solved by creating 3 pools and taking from the right pool at lmr_kcreate time. Do you have any "cheaper" solution to this ? 3. Can we get rid of DAT_MEM_TYPE_LMR code ? I don't understand whats it for. Signed-off-by: Guy German Index: dapl_openib_util.c =================================================================== --- dapl_openib_util.c (revision 3056) +++ dapl_openib_util.c (working copy) @@ -196,6 +196,53 @@ int dapl_ib_mr_register_physical(struct return 0; } +int dapl_ib_mr_register_fmr(struct dapl_ia *ia, struct dapl_lmr *lmr, + void *phys_addr, u64 length, + enum dat_mem_priv_flags privileges) +{ + /* FIXME: this phase-1 implementation of fmr doesn't take "privileges" + into account. This is a security breech. */ + u64 io_addr; + u64 *page_list; + int page_count; + struct ib_pool_fmr *mem; + int status; + + page_list = (u64 *)phys_addr; + page_count = (int)length; + io_addr = page_list[0]; + + mem = ib_fmr_pool_map_phys (((struct dapl_pz *)lmr->param.pz)->fmr_pool, + page_list, + page_count, + &io_addr); + if (IS_ERR(mem)) { + status = (int)PTR_ERR(mem); + if (status != -EAGAIN) + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + "fmr_pool_map_phys ret=%d <%d pages>\n", + status, page_count); + + lmr->param.registered_address = 0; + lmr->fmr = 0; + return status; + } + + lmr->param.lmr_context = mem->fmr->lkey; + lmr->param.rmr_context = mem->fmr->rkey; + lmr->param.registered_size = length * PAGE_SIZE; + lmr->param.registered_address = io_addr; + lmr->fmr = mem; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + "%s: (addr=%p size=0x%x) lkey=0x%x rkey=0x%x\n", __func__, + lmr->param.registered_address, + lmr->param.registered_size, + lmr->param.lmr_context, + lmr->param.rmr_context); + return 0; +} + int dapl_ib_mr_register_shared(struct dapl_ia *ia, struct dapl_lmr *lmr, enum dat_mem_priv_flags privileges) { @@ -222,7 +269,10 @@ int dapl_ib_mr_deregister(struct dapl_lm { int status; - status = ib_dereg_mr(lmr->mr); + if (DAT_MEM_TYPE_PLATFORM == lmr->param.mem_type && lmr->fmr) + status = ib_fmr_pool_unmap(lmr->fmr); + else + status = ib_dereg_mr(lmr->mr); if (status < 0) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, " ib_dereg_mr error code return = %d\n", Index: dapl_openib_util.h =================================================================== --- dapl_openib_util.h (revision 3056) +++ dapl_openib_util.h (working copy) @@ -87,6 +87,10 @@ int dapl_ib_mr_register_physical(struct void *phys_addr, u64 length, enum dat_mem_priv_flags privileges); +int dapl_ib_mr_register_fmr(struct dapl_ia *ia, struct dapl_lmr *lmr, + void *phys_addr, u64 length, + enum dat_mem_priv_flags privileges); + int dapl_ib_mr_deregister(struct dapl_lmr *lmr); int dapl_ib_mr_register_shared(struct dapl_ia *ia, struct dapl_lmr *lmr, Index: dapl_lmr.c =================================================================== --- dapl_lmr.c (revision 3056) +++ dapl_lmr.c (working copy) @@ -137,7 +137,7 @@ static inline int dapl_lmr_create_physic u64 *registered_address) { struct dapl_lmr *new_lmr; - int status; + int status = 0; new_lmr = dapl_lmr_alloc(ia, mem_type, phys_addr, page_count, (struct dat_pz *) pz, privileges); @@ -151,13 +151,22 @@ static inline int dapl_lmr_create_physic status = dapl_ib_mr_register_ia(ia, new_lmr, phys_addr, page_count, privileges); } - else { + else if (DAT_MEM_TYPE_PHYSICAL == mem_type) { status = dapl_ib_mr_register_physical(ia, new_lmr, phys_addr.for_array, page_count, privileges); } + else if (DAT_MEM_TYPE_PLATFORM == mem_type) { + status = dapl_ib_mr_register_fmr(ia, new_lmr, + phys_addr.for_array, + page_count, privileges); + } + else { + status = -EINVAL; + goto error1; + } - if (0 != status) + if (status) goto error2; atomic_inc(&pz->pz_ref_count); @@ -243,7 +252,7 @@ int dapl_lmr_kcreate(struct dat_ia *ia, int status; dapl_dbg_log(DAPL_DBG_TYPE_API, - "dapl_lmr_kcreate(ia:%p, mem_type:%x, ...)\n", + "dapl_lmr_kcreate(ia:%p, mem_type:%x)\n", ia, mem_type); dapl_ia = (struct dapl_ia *)ia; @@ -258,6 +267,11 @@ int dapl_lmr_kcreate(struct dat_ia *ia, rmr_context, registered_length, registered_address); break; + case DAT_MEM_TYPE_PLATFORM: /* used as a proprietary Tavor-FMR */ + if (!active_fmr) { + status = -EINVAL; + break; + } case DAT_MEM_TYPE_PHYSICAL: case DAT_MEM_TYPE_IA: status = dapl_lmr_create_physical(dapl_ia, region_description, @@ -307,6 +321,7 @@ int dapl_lmr_free(struct dat_lmr *lmr) switch (dapl_lmr->param.mem_type) { case DAT_MEM_TYPE_PHYSICAL: + case DAT_MEM_TYPE_PLATFORM: case DAT_MEM_TYPE_IA: case DAT_MEM_TYPE_LMR: { Index: dapl_pz.c =================================================================== --- dapl_pz.c (revision 3056) +++ dapl_pz.c (working copy) @@ -89,7 +89,17 @@ int dapl_pz_create(struct dat_ia *ia, st status); goto error2; } - + + if (active_fmr) { + struct ib_fmr_pool_param params; + set_fmr_params (¶ms); + dapl_pz->fmr_pool = ib_create_fmr_pool(dapl_pz->pd, ¶ms); + if (IS_ERR(dapl_pz->fmr_pool)) + dapl_dbg_log(DAPL_DBG_TYPE_WARN, + "could not create FMR pool <%ld>", + PTR_ERR(dapl_pz->fmr_pool)); + } + *pz = (struct dat_pz *)dapl_pz; return 0; @@ -104,7 +114,7 @@ error1: int dapl_pz_free(struct dat_pz *pz) { struct dapl_pz *dapl_pz; - int status; + int status=0; dapl_dbg_log(DAPL_DBG_TYPE_API, "dapl_pz_free(%p)\n", pz); @@ -114,8 +124,10 @@ int dapl_pz_free(struct dat_pz *pz) status = -EINVAL; goto error; } - - status = ib_dealloc_pd(dapl_pz->pd); + if (active_fmr) + (void)ib_destroy_fmr_pool(dapl_pz->fmr_pool); + else + status = ib_dealloc_pd(dapl_pz->pd); if (status) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, "ib_dealloc_pd failed: %X\n", status); Index: dapl_ia.c =================================================================== --- dapl_ia.c (revision 3056) +++ dapl_ia.c (working copy) @@ -745,7 +745,8 @@ int dapl_ia_query(struct dat_ia *ia_ptr, provider_attr->provider_version_major = DAPL_PROVIDER_MAJOR; provider_attr->provider_version_minor = DAPL_PROVIDER_MINOR; provider_attr->lmr_mem_types_supported = - DAT_MEM_TYPE_PHYSICAL | DAT_MEM_TYPE_IA; + DAT_MEM_TYPE_PHYSICAL | DAT_MEM_TYPE_IA | + DAT_MEM_TYPE_PLATFORM; provider_attr->iov_ownership_on_return = DAT_IOV_CONSUMER; provider_attr->dat_qos_supported = DAT_QOS_BEST_EFFORT; provider_attr->completion_flags_supported = Index: dapl_provider.c =================================================================== --- dapl_provider.c (revision 3056) +++ dapl_provider.c (working copy) @@ -48,8 +48,19 @@ MODULE_AUTHOR("James Lentini"); #ifdef CONFIG_KDAPL_INFINIBAND_DEBUG static DAPL_DBG_MASK g_dapl_dbg_mask = 0; +unsigned int active_fmr = 1; +static unsigned int pool_size = 2048; +static unsigned int max_pages_per_fmr = 64; + module_param_named(dbg_mask, g_dapl_dbg_mask, int, 0644); -MODULE_PARM_DESC(dbg_mask, "Bitmask to enable debug message types."); +module_param_named(active_fmr, active_fmr, int, 0644); +module_param_named(pool_size, pool_size, int, 0644); +module_param_named(max_pages_per_fmr, max_pages_per_fmr, int, 0644); +MODULE_PARM_DESC(dbg_mask, "Bitmask to enable debug message types. "); +MODULE_PARM_DESC(active_fmr, "if active_fmr==1, creates fmr pool in pz_create "); +MODULE_PARM_DESC(pool_size, "num of fmr handles in pool "); +MODULE_PARM_DESC(max_pages_per_fmr, "max size (in pages) of an fmr handle "); + #endif /* CONFIG_KDAPL_INFINIBAND_DEBUG */ static LIST_HEAD(g_dapl_provider_list); @@ -152,6 +163,18 @@ void dapl_dbg_log(enum dapl_dbg_type typ #endif /* KDAPL_INFINIBAND_DEBUG */ +void set_fmr_params (struct ib_fmr_pool_param *fmr_param_s) +{ + fmr_param_s->max_pages_per_fmr = max_pages_per_fmr; + fmr_param_s->pool_size = pool_size; + fmr_param_s->dirty_watermark = 32; + fmr_param_s->cache = 1; + fmr_param_s->flush_function = NULL; + fmr_param_s->access = (IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_WRITE | + IB_ACCESS_REMOTE_READ); +} + static struct dapl_provider *dapl_provider_alloc(const char *name, struct ib_device *device, u8 port) -------------- next part -------------- Index: dapl_openib_util.c =================================================================== --- dapl_openib_util.c (revision 3056) +++ dapl_openib_util.c (working copy) @@ -196,6 +196,53 @@ int dapl_ib_mr_register_physical(struct return 0; } +int dapl_ib_mr_register_fmr(struct dapl_ia *ia, struct dapl_lmr *lmr, + void *phys_addr, u64 length, + enum dat_mem_priv_flags privileges) +{ + /* FIXME: this phase-1 implementation of fmr doesn't take "privileges" + into account. This is a security breech. */ + u64 io_addr; + u64 *page_list; + int page_count; + struct ib_pool_fmr *mem; + int status; + + page_list = (u64 *)phys_addr; + page_count = (int)length; + io_addr = page_list[0]; + + mem = ib_fmr_pool_map_phys (((struct dapl_pz *)lmr->param.pz)->fmr_pool, + page_list, + page_count, + &io_addr); + if (IS_ERR(mem)) { + status = (int)PTR_ERR(mem); + if (status != -EAGAIN) + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + "fmr_pool_map_phys ret=%d <%d pages>\n", + status, page_count); + + lmr->param.registered_address = 0; + lmr->fmr = 0; + return status; + } + + lmr->param.lmr_context = mem->fmr->lkey; + lmr->param.rmr_context = mem->fmr->rkey; + lmr->param.registered_size = length * PAGE_SIZE; + lmr->param.registered_address = io_addr; + lmr->fmr = mem; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + "%s: (addr=%p size=0x%x) lkey=0x%x rkey=0x%x\n", __func__, + lmr->param.registered_address, + lmr->param.registered_size, + lmr->param.lmr_context, + lmr->param.rmr_context); + return 0; +} + int dapl_ib_mr_register_shared(struct dapl_ia *ia, struct dapl_lmr *lmr, enum dat_mem_priv_flags privileges) { @@ -222,7 +269,10 @@ int dapl_ib_mr_deregister(struct dapl_lm { int status; - status = ib_dereg_mr(lmr->mr); + if (DAT_MEM_TYPE_PLATFORM == lmr->param.mem_type && lmr->fmr) + status = ib_fmr_pool_unmap(lmr->fmr); + else + status = ib_dereg_mr(lmr->mr); if (status < 0) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, " ib_dereg_mr error code return = %d\n", Index: dapl_openib_util.h =================================================================== --- dapl_openib_util.h (revision 3056) +++ dapl_openib_util.h (working copy) @@ -87,6 +87,10 @@ int dapl_ib_mr_register_physical(struct void *phys_addr, u64 length, enum dat_mem_priv_flags privileges); +int dapl_ib_mr_register_fmr(struct dapl_ia *ia, struct dapl_lmr *lmr, + void *phys_addr, u64 length, + enum dat_mem_priv_flags privileges); + int dapl_ib_mr_deregister(struct dapl_lmr *lmr); int dapl_ib_mr_register_shared(struct dapl_ia *ia, struct dapl_lmr *lmr, Index: dapl_lmr.c =================================================================== --- dapl_lmr.c (revision 3056) +++ dapl_lmr.c (working copy) @@ -137,7 +137,7 @@ static inline int dapl_lmr_create_physic u64 *registered_address) { struct dapl_lmr *new_lmr; - int status; + int status = 0; new_lmr = dapl_lmr_alloc(ia, mem_type, phys_addr, page_count, (struct dat_pz *) pz, privileges); @@ -151,13 +151,22 @@ static inline int dapl_lmr_create_physic status = dapl_ib_mr_register_ia(ia, new_lmr, phys_addr, page_count, privileges); } - else { + else if (DAT_MEM_TYPE_PHYSICAL == mem_type) { status = dapl_ib_mr_register_physical(ia, new_lmr, phys_addr.for_array, page_count, privileges); } + else if (DAT_MEM_TYPE_PLATFORM == mem_type) { + status = dapl_ib_mr_register_fmr(ia, new_lmr, + phys_addr.for_array, + page_count, privileges); + } + else { + status = -EINVAL; + goto error1; + } - if (0 != status) + if (status) goto error2; atomic_inc(&pz->pz_ref_count); @@ -243,7 +252,7 @@ int dapl_lmr_kcreate(struct dat_ia *ia, int status; dapl_dbg_log(DAPL_DBG_TYPE_API, - "dapl_lmr_kcreate(ia:%p, mem_type:%x, ...)\n", + "dapl_lmr_kcreate(ia:%p, mem_type:%x)\n", ia, mem_type); dapl_ia = (struct dapl_ia *)ia; @@ -258,6 +267,11 @@ int dapl_lmr_kcreate(struct dat_ia *ia, rmr_context, registered_length, registered_address); break; + case DAT_MEM_TYPE_PLATFORM: /* used as a proprietary Tavor-FMR */ + if (!active_fmr) { + status = -EINVAL; + break; + } case DAT_MEM_TYPE_PHYSICAL: case DAT_MEM_TYPE_IA: status = dapl_lmr_create_physical(dapl_ia, region_description, @@ -307,6 +321,7 @@ int dapl_lmr_free(struct dat_lmr *lmr) switch (dapl_lmr->param.mem_type) { case DAT_MEM_TYPE_PHYSICAL: + case DAT_MEM_TYPE_PLATFORM: case DAT_MEM_TYPE_IA: case DAT_MEM_TYPE_LMR: { Index: dapl_pz.c =================================================================== --- dapl_pz.c (revision 3056) +++ dapl_pz.c (working copy) @@ -89,7 +89,17 @@ int dapl_pz_create(struct dat_ia *ia, st status); goto error2; } - + + if (active_fmr) { + struct ib_fmr_pool_param params; + set_fmr_params (¶ms); + dapl_pz->fmr_pool = ib_create_fmr_pool(dapl_pz->pd, ¶ms); + if (IS_ERR(dapl_pz->fmr_pool)) + dapl_dbg_log(DAPL_DBG_TYPE_WARN, + "could not create FMR pool <%ld>", + PTR_ERR(dapl_pz->fmr_pool)); + } + *pz = (struct dat_pz *)dapl_pz; return 0; @@ -104,7 +114,7 @@ error1: int dapl_pz_free(struct dat_pz *pz) { struct dapl_pz *dapl_pz; - int status; + int status=0; dapl_dbg_log(DAPL_DBG_TYPE_API, "dapl_pz_free(%p)\n", pz); @@ -114,8 +124,10 @@ int dapl_pz_free(struct dat_pz *pz) status = -EINVAL; goto error; } - - status = ib_dealloc_pd(dapl_pz->pd); + if (active_fmr) + (void)ib_destroy_fmr_pool(dapl_pz->fmr_pool); + else + status = ib_dealloc_pd(dapl_pz->pd); if (status) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, "ib_dealloc_pd failed: %X\n", status); Index: dapl_ia.c =================================================================== --- dapl_ia.c (revision 3056) +++ dapl_ia.c (working copy) @@ -745,7 +745,8 @@ int dapl_ia_query(struct dat_ia *ia_ptr, provider_attr->provider_version_major = DAPL_PROVIDER_MAJOR; provider_attr->provider_version_minor = DAPL_PROVIDER_MINOR; provider_attr->lmr_mem_types_supported = - DAT_MEM_TYPE_PHYSICAL | DAT_MEM_TYPE_IA; + DAT_MEM_TYPE_PHYSICAL | DAT_MEM_TYPE_IA | + DAT_MEM_TYPE_PLATFORM; provider_attr->iov_ownership_on_return = DAT_IOV_CONSUMER; provider_attr->dat_qos_supported = DAT_QOS_BEST_EFFORT; provider_attr->completion_flags_supported = Index: dapl_provider.c =================================================================== --- dapl_provider.c (revision 3056) +++ dapl_provider.c (working copy) @@ -48,8 +48,19 @@ MODULE_AUTHOR("James Lentini"); #ifdef CONFIG_KDAPL_INFINIBAND_DEBUG static DAPL_DBG_MASK g_dapl_dbg_mask = 0; +unsigned int active_fmr = 1; +static unsigned int pool_size = 2048; +static unsigned int max_pages_per_fmr = 64; + module_param_named(dbg_mask, g_dapl_dbg_mask, int, 0644); -MODULE_PARM_DESC(dbg_mask, "Bitmask to enable debug message types."); +module_param_named(active_fmr, active_fmr, int, 0644); +module_param_named(pool_size, pool_size, int, 0644); +module_param_named(max_pages_per_fmr, max_pages_per_fmr, int, 0644); +MODULE_PARM_DESC(dbg_mask, "Bitmask to enable debug message types. "); +MODULE_PARM_DESC(active_fmr, "if active_fmr==1, creates fmr pool in pz_create "); +MODULE_PARM_DESC(pool_size, "num of fmr handles in pool "); +MODULE_PARM_DESC(max_pages_per_fmr, "max size (in pages) of an fmr handle "); + #endif /* CONFIG_KDAPL_INFINIBAND_DEBUG */ static LIST_HEAD(g_dapl_provider_list); @@ -152,6 +163,18 @@ void dapl_dbg_log(enum dapl_dbg_type typ #endif /* KDAPL_INFINIBAND_DEBUG */ +void set_fmr_params (struct ib_fmr_pool_param *fmr_param_s) +{ + fmr_param_s->max_pages_per_fmr = max_pages_per_fmr; + fmr_param_s->pool_size = pool_size; + fmr_param_s->dirty_watermark = 32; + fmr_param_s->cache = 1; + fmr_param_s->flush_function = NULL; + fmr_param_s->access = (IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_WRITE | + IB_ACCESS_REMOTE_READ); +} + static struct dapl_provider *dapl_provider_alloc(const char *name, struct ib_device *device, u8 port) From guyg at voltaire.com Thu Aug 11 06:34:47 2005 From: guyg at voltaire.com (Guy German) Date: Thu, 11 Aug 2005 16:34:47 +0300 (IDT) Subject: [openib-general][PATCH][kdapl]: dapl.h for FMR and evd trigger changes Message-ID: This changes are neede for the FMR and evd patch Signed-off-by: Guy German Index: dapl.h =================================================================== --- dapl.h (revision 3056) +++ dapl.h (working copy) @@ -40,9 +40,9 @@ #include #include - -#include "ib_verbs.h" -#include "ib_cm.h" +#include +#include +#include /********************************************************************* * * @@ -179,6 +179,7 @@ struct dapl_evd { struct dapl_ring_buffer pending_event_queue; enum dat_upcall_policy upcall_policy; struct dat_upcall_object upcall; + int is_triggered; }; struct dapl_ep { @@ -235,6 +236,7 @@ struct dapl_pz { struct list_head list; struct ib_pd *pd; atomic_t pz_ref_count; + struct ib_fmr_pool *fmr_pool; }; struct dapl_lmr { @@ -243,6 +245,7 @@ struct dapl_lmr { struct list_head list; struct dat_lmr_param param; struct ib_mr *mr; + struct ib_pool_fmr *fmr; atomic_t lmr_ref_count; }; @@ -634,4 +637,6 @@ extern void dapl_dbg_log(enum dapl_dbg_t #define dapl_dbg_log(...) #endif /* KDAPL_INFINIBAND_DEBUG */ +extern void set_fmr_params (struct ib_fmr_pool_param *fmr_param_s); +extern unsigned int active_fmr; #endif /* DAPL_H */ -------------- next part -------------- Index: dapl.h =================================================================== --- dapl.h (revision 3056) +++ dapl.h (working copy) @@ -40,9 +40,9 @@ #include #include - -#include "ib_verbs.h" -#include "ib_cm.h" +#include +#include +#include /********************************************************************* * * @@ -179,6 +179,7 @@ struct dapl_evd { struct dapl_ring_buffer pending_event_queue; enum dat_upcall_policy upcall_policy; struct dat_upcall_object upcall; + int is_triggered; }; struct dapl_ep { @@ -235,6 +236,7 @@ struct dapl_pz { struct list_head list; struct ib_pd *pd; atomic_t pz_ref_count; + struct ib_fmr_pool *fmr_pool; }; struct dapl_lmr { @@ -243,6 +245,7 @@ struct dapl_lmr { struct list_head list; struct dat_lmr_param param; struct ib_mr *mr; + struct ib_pool_fmr *fmr; atomic_t lmr_ref_count; }; @@ -634,4 +637,6 @@ extern void dapl_dbg_log(enum dapl_dbg_t #define dapl_dbg_log(...) #endif /* KDAPL_INFINIBAND_DEBUG */ +extern void set_fmr_params (struct ib_fmr_pool_param *fmr_param_s); +extern unsigned int active_fmr; #endif /* DAPL_H */ From dotanb at mellanox.co.il Thu Aug 11 06:54:01 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Thu, 11 Aug 2005 16:54:01 +0300 Subject: [openib-general] what do you think about the following two user level small tools? Message-ID: <506C3D7B14CDD411A52C00025558DED60882C49D@mtlex01.yok.mtl.com> Hi. I ported the following two tools from gen1 and they can be found in https://openib.org/svn/trunk/contrib/mellanox/tools/. Here is a small description of those tools: vstat: print all the capabilities that query hca return (this tool can be used in scripts). check_catastrophic: check that the device is not in fatal state (this tool will be useful when the fatal flow will be implemented) If you think that those tools are useful i will move them to the trunk. Dotan Barak Software Verification Engineer Mellanox Technologies LTD mailto:dotanb at mellanox.co.il Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-4-8289408 Cell: 052-4222383 [ May the fork be with you ] -------------- next part -------------- An HTML attachment was scrubbed... URL: From guyg at voltaire.com Thu Aug 11 06:57:46 2005 From: guyg at voltaire.com (Guy German) Date: Thu, 11 Aug 2005 16:57:46 +0300 (IDT) Subject: [openib-general][PATCH][kdapl]: Message-ID: James, This patch allows the dapl consumer to control the evd upcall policy. Some consumers (e.g. ISER) receives one upcall, disable the upcall policy, and retrieve the rest of the events from a kernel_thread, via dat_evd_dequeue. This fashion of work improves performance by saving the context switching that is involved in many upcalls. If the consumer does not behave that way and leaves the upcall policy enabled at all times (e.g. kdapltest), the behavior will stay the same and the consumer will get an upcall for each event. Signed-off-by: Guy German Index: dapl_evd.c =================================================================== --- dapl_evd.c (revision 3056) +++ dapl_evd.c (working copy) @@ -38,28 +38,39 @@ #include "dapl_ring_buffer_util.h" /* - * DAPL Internal routine to trigger the specified CNO. - * Called by the callback of some EVD associated with the CNO. + * DAPL Internal routine to trigger the callback of the EVD */ static void dapl_evd_upcall_trigger(struct dapl_evd *evd) { int status = 0; struct dat_event event; + unsigned long flags; - /* Only process events if there is an enabled callback function. */ - if ((evd->upcall.upcall_func == (DAT_UPCALL_FUNC) NULL) || - (evd->upcall_policy == DAT_UPCALL_DISABLE)) { + + /* The function is not re-entrant (change when implementing DAT_UPCALL_MANY)*/ + if (evd->is_triggered) return; - } - for (;;) { + spin_lock_irqsave (&evd->common.lock, flags); + if (evd->is_triggered) { + spin_unlock_irqrestore (&evd->common.lock, flags); + return; + } + evd->is_triggered = 1; + spin_unlock_irqrestore (&evd->common.lock, flags); + /* Only process events if there is an enabled callback function */ + while ((evd->upcall.upcall_func != (DAT_UPCALL_FUNC)NULL) && + (evd->upcall_policy != DAT_UPCALL_TEARDOWN) && + (evd->upcall_policy != DAT_UPCALL_DISABLE)) { status = dapl_evd_dequeue((struct dat_evd *)evd, &event); - if (0 != status) - return; - + if (status) + break; evd->upcall.upcall_func(evd->upcall.instance_data, &event, FALSE); } + evd->is_triggered = 0; + + return; } static void dapl_evd_eh_print_wc(struct ib_wc *wc) @@ -820,24 +831,19 @@ static void dapl_evd_dto_callback(struct * This function does not dequeue from the CQ; only the consumer * can do that. Instead, it wakes up waiters if any exist. * It rearms the completion only if completions should always occur - * (specifically if a CNO is associated with the EVD and the - * EVD is enabled). */ - - if (state == DAPL_EVD_STATE_OPEN && - evd->upcall_policy != DAT_UPCALL_DISABLE) { - /* - * Re-enable callback, *then* trigger. - * This guarantees we won't miss any events. - */ - status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); - if (0 != status) - (void)dapl_evd_post_async_error_event( - evd->common.owner_ia->async_error_evd, - DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR, - evd->common.owner_ia); - + + if (state == DAPL_EVD_STATE_OPEN) { dapl_evd_upcall_trigger(evd); + if ((evd->upcall_policy != DAT_UPCALL_TEARDOWN) && + (evd->upcall_policy != DAT_UPCALL_DISABLE)) { + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); + if (0 != status) + (void)dapl_evd_post_async_error_event( + evd->common.owner_ia->async_error_evd, + DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR, + evd->common.owner_ia); + } } dapl_dbg_log(DAPL_DBG_TYPE_RTN, "%s() returns\n", __func__); } @@ -890,7 +896,7 @@ int dapl_evd_internal_create(struct dapl /* reset the qlen in the attributes, it may have changed */ evd->qlen = evd->cq->cqe; - + evd->is_triggered = 0; status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); if (status != 0) @@ -1035,15 +1041,41 @@ int dapl_evd_modify_upcall(struct dat_ev const struct dat_upcall_object *upcall) { struct dapl_evd *evd; - - dapl_dbg_log(DAPL_DBG_TYPE_API, "dapl_modify_upcall(%p)\n", evd_handle); + int status = 0; + int pending_events; + unsigned long flags; evd = (struct dapl_evd *)evd_handle; + dapl_dbg_log (DAPL_DBG_TYPE_API, "%s: (evd=%p) set to %d\n", + __func__, evd_handle, upcall_policy); + spin_lock_irqsave(&evd->common.lock, flags); + if ((upcall_policy != DAT_UPCALL_TEARDOWN) && + (upcall_policy != DAT_UPCALL_DISABLE)) { + pending_events = dapl_rbuf_count(&evd->pending_event_queue); + if (pending_events) { + dapl_dbg_log (DAPL_DBG_TYPE_WARN, + "%s: (evd %p) there are still %d pending " + "events in the queue - policy stays disabled\n", + __func__, evd_handle, pending_events); + status = -EBUSY; + goto bail; + } + if (evd->evd_flags & DAT_EVD_DTO_FLAG) { + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); + if (status) { + printk(KERN_ERR "%s: dapls_ib_completion_notify" + " failed (status=0x%x) \n",__func__, + status); + goto bail; + } + } + } evd->upcall_policy = upcall_policy; evd->upcall = *upcall; - - return 0; +bail: + spin_unlock_irqrestore(&evd->common.lock, flags); + return status; } int dapl_evd_post_se(struct dat_evd *evd_handle, const struct dat_event *event) @@ -1076,7 +1108,7 @@ int dapl_evd_post_se(struct dat_evd *evd event->event_data. software_event_data.pointer); - bail: +bail: return status; } @@ -1124,7 +1156,7 @@ int dapl_evd_dequeue(struct dat_evd *evd } spin_unlock_irqrestore(&evd->common.lock, evd->common.flags); - bail: +bail: dapl_dbg_log(DAPL_DBG_TYPE_RTN, "dapl_evd_dequeue () returns 0x%x\n", status); From hugh at veritas.com Thu Aug 11 07:04:29 2005 From: hugh at veritas.com (Hugh Dickins) Date: Thu, 11 Aug 2005 15:04:29 +0100 (BST) Subject: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support In-Reply-To: <20050811080205.GR16361@minantech.com> References: <20050719165542.GB16028@mellanox.co.il> <20050725171928.GC12206@mellanox.co.il> <20050726133553.GA22276@mellanox.co.il> <20050810083943.GM16361@minantech.com> <20050810132611.GP16361@minantech.com> <20050811080205.GR16361@minantech.com> Message-ID: On Thu, 11 Aug 2005, Gleb Natapov wrote: > What about the idea that was floating around about new VM flag that will > instruct kernel to copy pages belonging to the vma on fork instead of mark > them as cow? It's a pretty good idea, and thanks for reminding us of it. It suffers from the general difficulty with fixes within get_user_pages, that we need down_write(&mm->mmap_sem) to split_vma, and even just to update vm_flags, whereas get_user_pages is entered with down_read. Really, we'd prefer not to mess with the vma itself in get_user_pages. Could mark the ptes instead, perhaps, but that gets very architecture- dependent. A separate array? not so nice if the vma is very large and the get_user_pages area very small. I had toyed with leaving the ptes in the parent as writable, made readonly just in the child; but though that violation could be excused while get_user_pages is active, it would have to be corrected at the end, and that gets complicated again. Hugh From guyg at voltaire.com Thu Aug 11 07:04:26 2005 From: guyg at voltaire.com (Guy German) Date: Thu, 11 Aug 2005 17:04:26 +0300 (IDT) Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation Message-ID: Sorry for resending it - the former mail did not have a subject This patch allows the dapl consumer to control the evd upcall policy. Some consumers (e.g. ISER) receives one upcall, disable the upcall policy, and retrieve the rest of the events from a kernel_thread, via dat_evd_dequeue. This fashion of work improves performance by saving the context switching that is involved in many upcalls. If the consumer does not behave that way and leaves the upcall policy enabled at all times (e.g. kdapltest), the behavior will stay the same and the consumer will get an upcall for each event. Signed-off-by: Guy German Index: dapl_evd.c =================================================================== --- dapl_evd.c (revision 3056) +++ dapl_evd.c (working copy) @@ -38,28 +38,39 @@ #include "dapl_ring_buffer_util.h" /* - * DAPL Internal routine to trigger the specified CNO. - * Called by the callback of some EVD associated with the CNO. + * DAPL Internal routine to trigger the callback of the EVD */ static void dapl_evd_upcall_trigger(struct dapl_evd *evd) { int status = 0; struct dat_event event; + unsigned long flags; - /* Only process events if there is an enabled callback function. */ - if ((evd->upcall.upcall_func == (DAT_UPCALL_FUNC) NULL) || - (evd->upcall_policy == DAT_UPCALL_DISABLE)) { + + /* The function is not re-entrant (change when implementing DAT_UPCALL_MANY)*/ + if (evd->is_triggered) return; - } - for (;;) { + spin_lock_irqsave (&evd->common.lock, flags); + if (evd->is_triggered) { + spin_unlock_irqrestore (&evd->common.lock, flags); + return; + } + evd->is_triggered = 1; + spin_unlock_irqrestore (&evd->common.lock, flags); + /* Only process events if there is an enabled callback function */ + while ((evd->upcall.upcall_func != (DAT_UPCALL_FUNC)NULL) && + (evd->upcall_policy != DAT_UPCALL_TEARDOWN) && + (evd->upcall_policy != DAT_UPCALL_DISABLE)) { status = dapl_evd_dequeue((struct dat_evd *)evd, &event); - if (0 != status) - return; - + if (status) + break; evd->upcall.upcall_func(evd->upcall.instance_data, &event, FALSE); } + evd->is_triggered = 0; + + return; } static void dapl_evd_eh_print_wc(struct ib_wc *wc) @@ -820,24 +831,19 @@ static void dapl_evd_dto_callback(struct * This function does not dequeue from the CQ; only the consumer * can do that. Instead, it wakes up waiters if any exist. * It rearms the completion only if completions should always occur - * (specifically if a CNO is associated with the EVD and the - * EVD is enabled). */ - - if (state == DAPL_EVD_STATE_OPEN && - evd->upcall_policy != DAT_UPCALL_DISABLE) { - /* - * Re-enable callback, *then* trigger. - * This guarantees we won't miss any events. - */ - status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); - if (0 != status) - (void)dapl_evd_post_async_error_event( - evd->common.owner_ia->async_error_evd, - DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR, - evd->common.owner_ia); - + + if (state == DAPL_EVD_STATE_OPEN) { dapl_evd_upcall_trigger(evd); + if ((evd->upcall_policy != DAT_UPCALL_TEARDOWN) && + (evd->upcall_policy != DAT_UPCALL_DISABLE)) { + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); + if (0 != status) + (void)dapl_evd_post_async_error_event( + evd->common.owner_ia->async_error_evd, + DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR, + evd->common.owner_ia); + } } dapl_dbg_log(DAPL_DBG_TYPE_RTN, "%s() returns\n", __func__); } @@ -890,7 +896,7 @@ int dapl_evd_internal_create(struct dapl /* reset the qlen in the attributes, it may have changed */ evd->qlen = evd->cq->cqe; - + evd->is_triggered = 0; status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); if (status != 0) @@ -1035,15 +1041,41 @@ int dapl_evd_modify_upcall(struct dat_ev const struct dat_upcall_object *upcall) { struct dapl_evd *evd; - - dapl_dbg_log(DAPL_DBG_TYPE_API, "dapl_modify_upcall(%p)\n", evd_handle); + int status = 0; + int pending_events; + unsigned long flags; evd = (struct dapl_evd *)evd_handle; + dapl_dbg_log (DAPL_DBG_TYPE_API, "%s: (evd=%p) set to %d\n", + __func__, evd_handle, upcall_policy); + spin_lock_irqsave(&evd->common.lock, flags); + if ((upcall_policy != DAT_UPCALL_TEARDOWN) && + (upcall_policy != DAT_UPCALL_DISABLE)) { + pending_events = dapl_rbuf_count(&evd->pending_event_queue); + if (pending_events) { + dapl_dbg_log (DAPL_DBG_TYPE_WARN, + "%s: (evd %p) there are still %d pending " + "events in the queue - policy stays disabled\n", + __func__, evd_handle, pending_events); + status = -EBUSY; + goto bail; + } + if (evd->evd_flags & DAT_EVD_DTO_FLAG) { + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); + if (status) { + printk(KERN_ERR "%s: dapls_ib_completion_notify" + " failed (status=0x%x) \n",__func__, + status); + goto bail; + } + } + } evd->upcall_policy = upcall_policy; evd->upcall = *upcall; - - return 0; +bail: + spin_unlock_irqrestore(&evd->common.lock, flags); + return status; } int dapl_evd_post_se(struct dat_evd *evd_handle, const struct dat_event *event) @@ -1076,7 +1108,7 @@ int dapl_evd_post_se(struct dat_evd *evd event->event_data. software_event_data.pointer); - bail: +bail: return status; } @@ -1124,7 +1156,7 @@ int dapl_evd_dequeue(struct dat_evd *evd } spin_unlock_irqrestore(&evd->common.lock, evd->common.flags); - bail: +bail: dapl_dbg_log(DAPL_DBG_TYPE_RTN, "dapl_evd_dequeue () returns 0x%x\n", status); -------------- next part -------------- Index: dapl_evd.c =================================================================== --- dapl_evd.c (revision 3056) +++ dapl_evd.c (working copy) @@ -38,28 +38,39 @@ #include "dapl_ring_buffer_util.h" /* - * DAPL Internal routine to trigger the specified CNO. - * Called by the callback of some EVD associated with the CNO. + * DAPL Internal routine to trigger the callback of the EVD */ static void dapl_evd_upcall_trigger(struct dapl_evd *evd) { int status = 0; struct dat_event event; + unsigned long flags; - /* Only process events if there is an enabled callback function. */ - if ((evd->upcall.upcall_func == (DAT_UPCALL_FUNC) NULL) || - (evd->upcall_policy == DAT_UPCALL_DISABLE)) { + + /* The function is not re-entrant (change when implementing DAT_UPCALL_MANY)*/ + if (evd->is_triggered) return; - } - for (;;) { + spin_lock_irqsave (&evd->common.lock, flags); + if (evd->is_triggered) { + spin_unlock_irqrestore (&evd->common.lock, flags); + return; + } + evd->is_triggered = 1; + spin_unlock_irqrestore (&evd->common.lock, flags); + /* Only process events if there is an enabled callback function */ + while ((evd->upcall.upcall_func != (DAT_UPCALL_FUNC)NULL) && + (evd->upcall_policy != DAT_UPCALL_TEARDOWN) && + (evd->upcall_policy != DAT_UPCALL_DISABLE)) { status = dapl_evd_dequeue((struct dat_evd *)evd, &event); - if (0 != status) - return; - + if (status) + break; evd->upcall.upcall_func(evd->upcall.instance_data, &event, FALSE); } + evd->is_triggered = 0; + + return; } static void dapl_evd_eh_print_wc(struct ib_wc *wc) @@ -820,24 +831,19 @@ static void dapl_evd_dto_callback(struct * This function does not dequeue from the CQ; only the consumer * can do that. Instead, it wakes up waiters if any exist. * It rearms the completion only if completions should always occur - * (specifically if a CNO is associated with the EVD and the - * EVD is enabled). */ - - if (state == DAPL_EVD_STATE_OPEN && - evd->upcall_policy != DAT_UPCALL_DISABLE) { - /* - * Re-enable callback, *then* trigger. - * This guarantees we won't miss any events. - */ - status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); - if (0 != status) - (void)dapl_evd_post_async_error_event( - evd->common.owner_ia->async_error_evd, - DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR, - evd->common.owner_ia); - + + if (state == DAPL_EVD_STATE_OPEN) { dapl_evd_upcall_trigger(evd); + if ((evd->upcall_policy != DAT_UPCALL_TEARDOWN) && + (evd->upcall_policy != DAT_UPCALL_DISABLE)) { + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); + if (0 != status) + (void)dapl_evd_post_async_error_event( + evd->common.owner_ia->async_error_evd, + DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR, + evd->common.owner_ia); + } } dapl_dbg_log(DAPL_DBG_TYPE_RTN, "%s() returns\n", __func__); } @@ -890,7 +896,7 @@ int dapl_evd_internal_create(struct dapl /* reset the qlen in the attributes, it may have changed */ evd->qlen = evd->cq->cqe; - + evd->is_triggered = 0; status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); if (status != 0) @@ -1035,15 +1041,41 @@ int dapl_evd_modify_upcall(struct dat_ev const struct dat_upcall_object *upcall) { struct dapl_evd *evd; - - dapl_dbg_log(DAPL_DBG_TYPE_API, "dapl_modify_upcall(%p)\n", evd_handle); + int status = 0; + int pending_events; + unsigned long flags; evd = (struct dapl_evd *)evd_handle; + dapl_dbg_log (DAPL_DBG_TYPE_API, "%s: (evd=%p) set to %d\n", + __func__, evd_handle, upcall_policy); + spin_lock_irqsave(&evd->common.lock, flags); + if ((upcall_policy != DAT_UPCALL_TEARDOWN) && + (upcall_policy != DAT_UPCALL_DISABLE)) { + pending_events = dapl_rbuf_count(&evd->pending_event_queue); + if (pending_events) { + dapl_dbg_log (DAPL_DBG_TYPE_WARN, + "%s: (evd %p) there are still %d pending " + "events in the queue - policy stays disabled\n", + __func__, evd_handle, pending_events); + status = -EBUSY; + goto bail; + } + if (evd->evd_flags & DAT_EVD_DTO_FLAG) { + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); + if (status) { + printk(KERN_ERR "%s: dapls_ib_completion_notify" + " failed (status=0x%x) \n",__func__, + status); + goto bail; + } + } + } evd->upcall_policy = upcall_policy; evd->upcall = *upcall; - - return 0; +bail: + spin_unlock_irqrestore(&evd->common.lock, flags); + return status; } int dapl_evd_post_se(struct dat_evd *evd_handle, const struct dat_event *event) @@ -1076,7 +1108,7 @@ int dapl_evd_post_se(struct dat_evd *evd event->event_data. software_event_data.pointer); - bail: +bail: return status; } @@ -1124,7 +1156,7 @@ int dapl_evd_dequeue(struct dat_evd *evd } spin_unlock_irqrestore(&evd->common.lock, evd->common.flags); - bail: +bail: dapl_dbg_log(DAPL_DBG_TYPE_RTN, "dapl_evd_dequeue () returns 0x%x\n", status); From glebn at voltaire.com Thu Aug 11 07:07:30 2005 From: glebn at voltaire.com (Gleb Natapov) Date: Thu, 11 Aug 2005 17:07:30 +0300 Subject: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support In-Reply-To: References: <20050725171928.GC12206@mellanox.co.il> <20050726133553.GA22276@mellanox.co.il> <20050810083943.GM16361@minantech.com> <20050810132611.GP16361@minantech.com> <20050811080205.GR16361@minantech.com> Message-ID: <20050811140729.GU16361@minantech.com> On Thu, Aug 11, 2005 at 03:04:29PM +0100, Hugh Dickins wrote: > On Thu, 11 Aug 2005, Gleb Natapov wrote: > > What about the idea that was floating around about new VM flag that will > > instruct kernel to copy pages belonging to the vma on fork instead of mark > > them as cow? > > It's a pretty good idea, and thanks for reminding us of it. > > It suffers from the general difficulty with fixes within get_user_pages, > that we need down_write(&mm->mmap_sem) to split_vma, and even just to > update vm_flags, whereas get_user_pages is entered with down_read. > Why do it form get_user_pages? Lets use madvise/mprotect interface. Program can mrpotect(VM_COPYONFORK) address range before registering it. -- Gleb. From mst at mellanox.co.il Thu Aug 11 07:11:43 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 11 Aug 2005 17:11:43 +0300 Subject: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support In-Reply-To: References: <20050725171928.GC12206@mellanox.co.il> <20050726133553.GA22276@mellanox.co.il> <20050810083943.GM16361@minantech.com> <20050810132611.GP16361@minantech.com> <20050811080205.GR16361@minantech.com> Message-ID: <20050811141143.GB19686@mellanox.co.il> Quoting r. Hugh Dickins : > Subject: Re: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support > > On Thu, 11 Aug 2005, Gleb Natapov wrote: > > What about the idea that was floating around about new VM flag that will > > instruct kernel to copy pages belonging to the vma on fork instead of mark > > them as cow? > > It's a pretty good idea, and thanks for reminding us of it. > > It suffers from the general difficulty with fixes within get_user_pages, > that we need down_write(&mm->mmap_sem) to split_vma, and even just to > update vm_flags, whereas get_user_pages is entered with down_read. No, the idea is to let the application (or a library that it loades) change this flag by means of some system call. Something like MADV_COPYONFORK, in addition to MADV_DONTCOPY. -- MST From hugh at veritas.com Thu Aug 11 07:17:01 2005 From: hugh at veritas.com (Hugh Dickins) Date: Thu, 11 Aug 2005 15:17:01 +0100 (BST) Subject: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support In-Reply-To: <20050811140729.GU16361@minantech.com> References: <20050725171928.GC12206@mellanox.co.il> <20050726133553.GA22276@mellanox.co.il> <20050810083943.GM16361@minantech.com> <20050810132611.GP16361@minantech.com> <20050811080205.GR16361@minantech.com> <20050811140729.GU16361@minantech.com> Message-ID: On Thu, 11 Aug 2005, Gleb Natapov wrote: > On Thu, Aug 11, 2005 at 03:04:29PM +0100, Hugh Dickins wrote: > > On Thu, 11 Aug 2005, Gleb Natapov wrote: > > > What about the idea that was floating around about new VM flag that will > > > instruct kernel to copy pages belonging to the vma on fork instead of mark > > > them as cow? > > > > It's a pretty good idea, and thanks for reminding us of it. > > > > It suffers from the general difficulty with fixes within get_user_pages, > > that we need down_write(&mm->mmap_sem) to split_vma, and even just to > > update vm_flags, whereas get_user_pages is entered with down_read. > > > Why do it form get_user_pages? Lets use madvise/mprotect interface. > Program can mrpotect(VM_COPYONFORK) address range before registering it. Perhaps. But then it's more complicated than the VM_DONTCOPY we came from. It's a good solution to the semantic divergence introduced by VM_DONTCOPY, but most people seemed unworried by that aspect. My trouble is that I'm waiting for a magic right solution to appear, and none has struct me that way so far. Hugh From pw at osc.edu Thu Aug 11 07:19:58 2005 From: pw at osc.edu (Pete Wyckoff) Date: Thu, 11 Aug 2005 10:19:58 -0400 Subject: [openib-general] RE: [PATCH] osm: add a main auto tools project for osm In-Reply-To: <1123614021.4403.301.camel@hal.voltaire.com> References: <506C3D7B14CDD411A52C00025558DED607C3062C@mtlex01.yok.mtl.com> <1123614021.4403.301.camel@hal.voltaire.com> Message-ID: <20050811141958.GB4353@osc.edu> halr at voltaire.com wrote on Tue, 09 Aug 2005 15:00 -0400: > I thought the proposal was to ditch the prefix totally and just use > /usr/local/bin for binaries, usr/local/lib for libraries, and > /usr/local/include/infiniband for the includes (I am assuming that the > subdirectories will be maintained under that for opensm, vendor, iba, > and complib). Sorry I'm stepping into this a bit late. I just slogged my way through an openib build and install using the helpful Cheat Sheet in the wiki. I don't want everything in /usr/local/{bin,lib,include}. I'd prefer to have it segregated into /usr/local/openib/{bin,lib,include}. That makes it a bit easier to manage multiple versions if perhaps I'd like to leave a basic /usr/local/openib-working directory that users see plus also have an /usr/local/openib-testing for experimentation. We use a package called "modules" to help manage user paths for various software components. As long as you respect the $prefix setting in Makefiles, this is mostly doable. One improvement: instead of generating libdir yourself in Makefile.am, libdir = ${exec_prefix}/ib/lib you can directly use ${libdir} as set by configure, thus permitting the installer to decide the paths for himself. -- Pete P.S. Every little one-file source project does not need its own configure, automake and full directory tree in my opinion, but that's perhaps an argument for a different thread. From jlentini at netapp.com Thu Aug 11 07:33:44 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 11 Aug 2005 10:33:44 -0400 (EDT) Subject: [openib-general][PATCH][kdapltest]: DAT_MEM_TYPE_IA support and ref count Message-ID: > James, > Here are the kdapltest patches, resent. > changes conclude: > 1. allow DAT_MEM_TYPE_IA support on server side > 2. kdapltest module ref count fix from 2.4 API to 2.6 API > > Signed-off-by: Guy German Committed in revision 3058. From halr at voltaire.com Thu Aug 11 07:28:30 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Aug 2005 10:28:30 -0400 Subject: [openib-general] RE: [PATCH] osm: add a main auto tools project for osm In-Reply-To: <20050811141958.GB4353@osc.edu> References: <506C3D7B14CDD411A52C00025558DED607C3062C@mtlex01.yok.mtl.com> <1123614021.4403.301.camel@hal.voltaire.com> <20050811141958.GB4353@osc.edu> Message-ID: <1123770509.4403.5143.camel@hal.voltaire.com> Hi Pete, On Thu, 2005-08-11 at 10:19, Pete Wyckoff wrote: > halr at voltaire.com wrote on Tue, 09 Aug 2005 15:00 -0400: > > I thought the proposal was to ditch the prefix totally and just use > > /usr/local/bin for binaries, usr/local/lib for libraries, and > > /usr/local/include/infiniband for the includes (I am assuming that the > > subdirectories will be maintained under that for opensm, vendor, iba, > > and complib). > > Sorry I'm stepping into this a bit late. I just slogged my way > through an openib build and install using the helpful Cheat Sheet > in the wiki. > > I don't want everything in /usr/local/{bin,lib,include}. I'd prefer > to have it segregated into /usr/local/openib/{bin,lib,include}. > That makes it a bit easier to manage multiple versions if perhaps > I'd like to leave a basic /usr/local/openib-working directory that > users see plus also have an /usr/local/openib-testing for > experimentation. We use a package called "modules" to help manage > user paths for various software components. > > As long as you respect the $prefix setting in Makefiles, this is > mostly doable. One improvement: instead of generating libdir > yourself in Makefile.am, > > libdir = ${exec_prefix}/ib/lib > > you can directly use ${libdir} as set by configure, thus permitting > the installer to decide the paths for himself. Is /usr/local/[bin, lib, include] acceptable as a default ? I believe the new version (Eitan is working on a patch) will do that and allow it to be configured otherwise properly (unlike the current build). Is that acceptable ? > -- Pete > > > P.S. Every little one-file source project does not need its own > configure, automake and full directory tree in my opinion, but > that's perhaps an argument for a different thread. Yes, at least all the diags could be rolled into one. Is that sufficient ? Would you have any time to work on a patch for this aspect ? If not, I'll add it to my TODO list. Thanks. -- Hal From hch at lst.de Thu Aug 11 08:03:26 2005 From: hch at lst.de (Christoph Hellwig) Date: Thu, 11 Aug 2005 17:03:26 +0200 Subject: [openib-general] [PATCH] amso1100: use standard byteorder macros Message-ID: <20050811150326.GA7803@lst.de> Signed-off-by: Christoph Hellwig Index: cc_byteorder.h =================================================================== --- cc_byteorder.h (revision 3058) +++ cc_byteorder.h (working copy) @@ -1,113 +1,30 @@ #ifndef _CC_BYTEORDER_H_ #define _CC_BYTEORDER_H_ +#include #include "cc_types.h" -static inline const u64 cc_arch_swap64(u64 x) -{ - union { - struct { u32 a,b; } s; - u64 u; - } v; - - v.u = x; - - asm("bswap %0\n\t" - "bswap %1\n\t" - "xchgl %0,%1\n" - : "=r" (v.s.a), "=r" (v.s.b) - : "0" (v.s.a), "1" (v.s.b)); - - return v.u; -} - -static inline const u32 cc_arch_swap32(u32 x) -{ - asm("bswap %0" : "=r" (x) : "0" (x)); - return x; -} - -#define cc_swap16(x) \ -({ \ - u16 __x = (x); \ - ((u16)( \ - (((u16)(__x) & (u16)0x00ffU) << 8) | \ - (((u16)(__x) & (u16)0xff00U) >> 8) )); \ -}) - -#define cc_swap32(x) \ -({ \ - u32 __x = (x); \ - ((u32)( \ - (((u32)(__x) & (u32)0x000000ffUL) << 24) | \ - (((u32)(__x) & (u32)0x0000ff00UL) << 8) | \ - (((u32)(__x) & (u32)0x00ff0000UL) >> 8) | \ - (((u32)(__x) & (u32)0xff000000UL) >> 24) )); \ -}) - -#define cc_swap64(x) \ -({ \ - u64 __x = (x); \ - ((u64)( \ - (u64)(((u64)(__x) & (u64)0x00000000000000ffULL) << 56) | \ - (u64)(((u64)(__x) & (u64)0x000000000000ff00ULL) << 40) | \ - (u64)(((u64)(__x) & (u64)0x0000000000ff0000ULL) << 24) | \ - (u64)(((u64)(__x) & (u64)0x00000000ff000000ULL) << 8) | \ - (u64)(((u64)(__x) & (u64)0x000000ff00000000ULL) >> 8) | \ - (u64)(((u64)(__x) & (u64)0x0000ff0000000000ULL) >> 24) | \ - (u64)(((u64)(__x) & (u64)0x00ff000000000000ULL) >> 40) | \ - (u64)(((u64)(__x) & (u64)0xff00000000000000ULL) >> 56) )); \ -}) - -/* This section defines what it means to swap a word into the byte - order of the current CPU. For example, x86-32 and x86-64 are - little-endian platforms, so swapping a big-endian number to the - cpu means the bytes need to be rearranged. However, swapping a - little-endian number to the cpu means that nothing should be done. -*/ - -#define X86_32 -#if defined(X86_32) || defined (X86_64) - -#define cc_be64_to_cpu(x) (__builtin_constant_p((u64)(x)) ? cc_swap64(x) : cc_arch_swap64(x)) -#define cc_be32_to_cpu(x) (__builtin_constant_p((u32)(x)) ? cc_swap32(x) : cc_arch_swap32(x)) -#define cc_be16_to_cpu(x) cc_swap16(x) -#define cc_cpu_to_be64(x) cc_be64_to_cpu(x) -#define cc_cpu_to_be32(x) cc_be32_to_cpu(x) -#define cc_cpu_to_be16(x) cc_be16_to_cpu(x) - -#define cc_le64_to_cpu(x) ((u64)(x)) -#define cc_le32_to_cpu(x) ((u32)(x)) -#define cc_le16_to_cpu(x) ((u16)(x)) -#define cc_cpu_to_le64(x) ((u64)(x)) -#define cc_cpu_to_le32(x) ((u32)(x)) -#define cc_cpu_to_le16(x) ((u16)(x)) - -#else -#error Byte swapping functions not defined for this platform -#endif - /* Here we define the adapter-to-cpu and cpu-to-adapter byte order functions based on whether the adapter is big-endian or little-endian. */ #if defined(WR_BYTE_ORDER_BIG_ENDIAN) -#define wr64_to_cpu cc_be64_to_cpu -#define wr32_to_cpu cc_be32_to_cpu -#define wr16_to_cpu cc_be16_to_cpu -#define cpu_to_wr64 cc_cpu_to_be64 -#define cpu_to_wr32 cc_cpu_to_be32 -#define cpu_to_wr16 cc_cpu_to_be16 +#define wr64_to_cpu be64_to_cpu +#define wr32_to_cpu be32_to_cpu +#define wr16_to_cpu be16_to_cpu +#define cpu_to_wr64 cpu_to_be64 +#define cpu_to_wr32 cpu_to_be32 +#define cpu_to_wr16 cpu_to_be16 #elif defined (WR_BYTE_ORDER_LITTLE_ENDIAN) -#define wr64_to_cpu cc_le64_to_cpu -#define wr32_to_cpu cc_le32_to_cpu -#define wr16_to_cpu cc_le16_to_cpu -#define cpu_to_wr64 cc_cpu_to_le64 -#define cpu_to_wr32 cc_cpu_to_le32 -#define cpu_to_wr16 cc_cpu_to_le16 +#define wr64_to_cpu le64_to_cpu +#define wr32_to_cpu le32_to_cpu +#define wr16_to_cpu le16_to_cpu +#define cpu_to_wr64 cpu_to_le64 +#define cpu_to_wr32 cpu_to_le32 +#define cpu_to_wr16 cpu_to_le16 #else #error Work request (WR) byte order is not defined. From hch at lst.de Thu Aug 11 08:04:15 2005 From: hch at lst.de (Christoph Hellwig) Date: Thu, 11 Aug 2005 17:04:15 +0200 Subject: [openib-general] [PATCH] amso1100: kill memcpy4 and memcpy8 Message-ID: <20050811150415.GB7803@lst.de> memcpy8 isn't used at all, and for memcpy4 we should just rely on the compiler to do the right optimizations for us. Signed-off-by: Christoph Hellwig Index: cc_qp_common.c =================================================================== --- cc_qp_common.c (revision 3058) +++ cc_qp_common.c (working copy) @@ -135,139 +135,6 @@ /* - * Function: cc_memcpy8 - * - * Description: - * Just like memcpy, but does 16 and 8 bytes at a time. - * - * IN: - * dest - ptr destination - * src - ptr source - * len - The len, in bytes - * - * OUT: none - * - * Return: none - */ -void -cc_memcpy8( u64 *dest, u64 *src, s32 len) -{ -#ifdef CCDEBUG - assert((len & 0x03) == 0); - assert(((s32)dest & 0x03) == 0); - assert(((s32)src & 0x03) == 0); -#endif - -#if (defined(X86_32) || defined(X86_64)) - -#define MINSIZE 16 - /* unaligned data copy, 16 bytes at a time */ - while(len >= MINSIZE) { - /* printf("%p --> %p 16B unaligned copy,len=%d \n", src, dest,len); */ - asm volatile("movdqu 0(%1), %%xmm0\n" \ - "movdqu %%xmm0, 0(%0)\n" \ - :: "r"(dest), "r"(src) : "memory"); - src += 2; - dest += 2; - len -= 16; - } - - /* At this point, we'll have fewer than 16 bytes left. - * But, we only allow 8 byte copies. So, we do 8 byte copies now. - * If our len happens to be 4 or 12, we will copy 8 or 16 bytes, - * respectively. This is not a problem, since - * all msg_sizes in all WR queues are padded up to 8 bytes - * (see fw/clustercore/cc_qp.c, the function ccwr_qp_create()). - */ - while(len >= 0) { - /* printf("%p --> %p 8B copy,len=%d \n", src, dest,len); */ - asm volatile("movq 0(%1), %%xmm0\n" \ - "movq %%xmm0, 0(%0)\n" \ - :: "r"(dest), "r"(src) : "memory"); - src += 1; - dest += 1; - len -= 8; - } - -#elif - #error "You need to define your platform, or add optimized" - #error "cc_memcpy8 support for your platform." - -#endif /*(defined(X86_64) || defined(X86_32)) */ - -} - -/* - * Function: memcpy4 - * - * Description: - * Just like memcpy, but assumes all args are 4 byte aligned already. - * - * IN: - * dest - ptr destination - * src - ptr source - * len - The len, in bytes - * - * OUT: none - * - * Return: none - */ -static __inline__ void -memcpy4(u64 *dest, u64 *src, u32 len) -{ -#ifdef __KERNEL__ - unsigned long flags; -#endif /* #ifdef __KERNEL__ */ - - u64 xmm_regs[16]; /* Reserve space for 8, though only use 1 now. */ - -#ifdef CCDEBUG - ASSERT((len & 0x03) == 0); - ASSERT(((long)dest & 0x03) == 0); - ASSERT(((long)src & 0x03) == 0); -#endif - - /* We must save and restor xmm0. - * Failure to do so messes up the application code. - */ - asm volatile("movdqu %%xmm0, 0(%0)\n" :: "r"(xmm_regs) : "memory"); - -#ifdef __KERNEL__ - /* Further, in the kernel version, we must disable local interupts. - * This is because ISRs do not save & restore xmm0. So, if - * we are interrupted between the first movdqu and the second, - * then xmm0 may be modified, and we will write garbage to the adapter. - */ - local_irq_save(flags); -#endif /* #ifdef __KERNEL__ */ - -#define MINSIZE 16 - /* unaligned data copy */ - while(len >= MINSIZE) { - asm volatile("movdqu 0(%1), %%xmm0\n" \ - "movdqu %%xmm0, 0(%0)\n" \ - :: "r"(dest), "r"(src) : "memory"); - src += 2; - dest += 2; - len -= 16; - } - -#ifdef __KERNEL__ - /* Restore interrupts and registers */ - local_irq_restore(flags); - asm volatile("movdqu 0(%0), %%xmm0\n" :: "r"(xmm_regs) : "memory"); -#endif /* #ifdef __KERNEL__ */ - - while (len >= 4) { - *((u32 *)dest) = *((u32 *)src); - dest = (u64*)((unsigned long)dest + 4); - src = (u64*)((unsigned long)src + 4); - len -= 4; - } -} - - -/* * Function: qp_wr_post * * Description: @@ -308,7 +175,7 @@ /* * Copy the wr down to the adapter */ - memcpy4((void *)msg, (void *)wr, size); + memcpy(msg, wr, size); cc_mq_produce(q); return CC_OK; Index: cc_mq_common.c =================================================================== --- cc_mq_common.c (revision 3058) +++ cc_mq_common.c (working copy) @@ -17,8 +17,6 @@ #include "cc_mq_common.h" #include "cc_common.h" -extern void cc_memcpy8(u64 *, u64 *, s32); - #define BUMP(q,p) (p) = ((p)+1) % (q)->q_size #define BUMP_SHARED(q,p) (p) = cpu_to_wr16((wr16_to_cpu(p)+1) % (q)->q_size) Index: TODO =================================================================== --- TODO (revision 3058) +++ TODO (working copy) @@ -43,15 +43,6 @@ [-] cc_mq_common.c: BUMP is pretty inefficient, does a divide every time -[-] cc_qp_common.c: cc_memcpy8 corrupts FPU state, is it really needed? - it's never called. Why is it declared in cc_mq_common.c? - memcpy4 similarly corrupts state. If it's fixed to save CR0 and do - clts, is it really faster than a normal memcpy (considering it also - disables IRQs)? - - This is all utterly non-portably anyway -- there needs to be a - standard fallback for PPC64, IA64 etc. - [-] Why is cc_queue.h needed? What is <linux/list.h> missing? [-] cc_types.h: get rid of NULL, TRUE, FALSE defines, cc_bool_t, etc. From guyg at voltaire.com Thu Aug 11 08:29:35 2005 From: guyg at voltaire.com (Guy German) Date: Thu, 11 Aug 2005 18:29:35 +0300 (IDT) Subject: [openib-general][PATCH][kdapltest]: "free" missing Message-ID: Signed-off-by: Guy German Index: dapl_server.c =================================================================== --- dapl_server.c (revision 3058) +++ dapl_server.c (working copy) @@ -248,6 +248,9 @@ DT_cs_Server (Params_t * params_ptr) 256, /* FIXME query for this value */ FALSE, FALSE); + if (pt_ptr) + DT_Free_Per_Test_Data (pt_ptr); + if (ps_ptr->bpool == 0) { DT_Tdep_PT_Printf (phead, From pw at osc.edu Thu Aug 11 08:31:32 2005 From: pw at osc.edu (Pete Wyckoff) Date: Thu, 11 Aug 2005 11:31:32 -0400 Subject: [openib-general] RE: [PATCH] osm: add a main auto tools project for osm In-Reply-To: <1123770509.4403.5143.camel@hal.voltaire.com> References: <506C3D7B14CDD411A52C00025558DED607C3062C@mtlex01.yok.mtl.com> <1123614021.4403.301.camel@hal.voltaire.com> <20050811141958.GB4353@osc.edu> <1123770509.4403.5143.camel@hal.voltaire.com> Message-ID: <20050811153132.GA5349@osc.edu> halr at voltaire.com wrote on Thu, 11 Aug 2005 10:28 -0400: > On Thu, 2005-08-11 at 10:19, Pete Wyckoff wrote: > > halr at voltaire.com wrote on Tue, 09 Aug 2005 15:00 -0400: > > > I thought the proposal was to ditch the prefix totally and just use > > > /usr/local/bin for binaries, usr/local/lib for libraries, and > > > /usr/local/include/infiniband for the includes (I am assuming that the > > > subdirectories will be maintained under that for opensm, vendor, iba, > > > and complib). [..] > Is /usr/local/[bin, lib, include] acceptable as a default ? > > I believe the new version (Eitan is working on a patch) will do that and > allow it to be configured otherwise properly (unlike the current build). > Is that acceptable ? Absolutely. I was afraid by your comment that you wanted to ignore $prefix. My bad for missing the original proposal perhaps. > > P.S. Every little one-file source project does not need its own > > configure, automake and full directory tree in my opinion, but > > that's perhaps an argument for a different thread. > > Yes, at least all the diags could be rolled into one. Is that sufficient > ? > > Would you have any time to work on a patch for this aspect ? If not, > I'll add it to my TODO list. It's more of a call for the component authors to decide what things are unlikely ever to be distributed separately. The diags are a good candidate for single-configure, and would get rid of 11 of the 27 configure.in files in trunk/src/userspace. OSM is likely another. The rest require more thought. I'm not going to attempt any changes myself. -- Pete From jlentini at netapp.com Thu Aug 11 08:39:59 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 11 Aug 2005 11:39:59 -0400 (EDT) Subject: [openib-general][PATCH][kdapl]: FMR implementation for kdapl In-Reply-To: References: Message-ID: Hi Guy, I haven't used FMRs before, so if I make some incorrect assumptions about how they work please point them out. Some high level questions: Why did you use the FMR pool API (ib_fmr_pool.h) instead of the verbs (ib_alloc_fmr, ib_map_phys_fmr, ...)? Why did you make FMR support configurable via a module parameter? My concern is that this forces users to know if their kdapl software needs FMR support. More questions/comments: On Thu, 11 Aug 2005, Guy German wrote: > James, > > This is a Tavor FMR implementation for kDPAL, that works over > gen2's fmr_pool.c implementation. > > A few notes: > 1. I've added mod params to kdapl_ib to control the fmr size and pool > length. There are still reasonable defaults, but I think we should allow > consumers to control this. > 2. the fmr pool allocation is done in the pz_create (if active_fmr =1). > The problem is that at the time of creating the pool we don't > know what the consumer's privileges request is going to be (passed > afterwords in dapl_lmr_kcreate). I think this can be solved by creating > 3 pools and taking from the right pool at lmr_kcreate time. > Do you have any "cheaper" solution to this ? My naive suggestion is to use ib_alloc_fmr(), etc. > 3. Can we get rid of DAT_MEM_TYPE_LMR code ? I don't understand whats it > for. It is supposed to implement that memory type. If it is broken, we should fix it. > Signed-off-by: Guy German > > Index: dapl_openib_util.c > =================================================================== > --- dapl_openib_util.c (revision 3056) > +++ dapl_openib_util.c (working copy) > @@ -196,6 +196,53 @@ int dapl_ib_mr_register_physical(struct > return 0; > } > > +int dapl_ib_mr_register_fmr(struct dapl_ia *ia, struct dapl_lmr *lmr, > + void *phys_addr, u64 length, > + enum dat_mem_priv_flags privileges) Why not change this function signature to match what you want to receive (ie. change length to page_count)? > +{ > + /* FIXME: this phase-1 implementation of fmr doesn't take "privileges" > + into account. This is a security breech. */ > + u64 io_addr; > + u64 *page_list; > + int page_count; > + struct ib_pool_fmr *mem; > + int status; > + > + page_list = (u64 *)phys_addr; > + page_count = (int)length; > + io_addr = page_list[0]; > + > + mem = ib_fmr_pool_map_phys (((struct dapl_pz *)lmr->param.pz)->fmr_pool, > + page_list, > + page_count, > + &io_addr); > + if (IS_ERR(mem)) { > + status = (int)PTR_ERR(mem); > + if (status != -EAGAIN) > + dapl_dbg_log(DAPL_DBG_TYPE_ERR, > + "fmr_pool_map_phys ret=%d <%d pages>\n", > + status, page_count); > + > + lmr->param.registered_address = 0; > + lmr->fmr = 0; > + return status; > + } > + > + lmr->param.lmr_context = mem->fmr->lkey; > + lmr->param.rmr_context = mem->fmr->rkey; > + lmr->param.registered_size = length * PAGE_SIZE; > + lmr->param.registered_address = io_addr; > + lmr->fmr = mem; > + > + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, > + "%s: (addr=%p size=0x%x) lkey=0x%x rkey=0x%x\n", __func__, > + lmr->param.registered_address, > + lmr->param.registered_size, > + lmr->param.lmr_context, > + lmr->param.rmr_context); > + return 0; > +} > + > int dapl_ib_mr_register_shared(struct dapl_ia *ia, struct dapl_lmr *lmr, > enum dat_mem_priv_flags privileges) > { > @@ -222,7 +269,10 @@ int dapl_ib_mr_deregister(struct dapl_lm > { > int status; > > - status = ib_dereg_mr(lmr->mr); > + if (DAT_MEM_TYPE_PLATFORM == lmr->param.mem_type && lmr->fmr) > + status = ib_fmr_pool_unmap(lmr->fmr); > + else > + status = ib_dereg_mr(lmr->mr); > if (status < 0) { > dapl_dbg_log(DAPL_DBG_TYPE_ERR, > " ib_dereg_mr error code return = %d\n", > Index: dapl_openib_util.h > =================================================================== > --- dapl_openib_util.h (revision 3056) > +++ dapl_openib_util.h (working copy) > @@ -87,6 +87,10 @@ int dapl_ib_mr_register_physical(struct > void *phys_addr, u64 length, > enum dat_mem_priv_flags privileges); > > +int dapl_ib_mr_register_fmr(struct dapl_ia *ia, struct dapl_lmr *lmr, > + void *phys_addr, u64 length, > + enum dat_mem_priv_flags privileges); > + > int dapl_ib_mr_deregister(struct dapl_lmr *lmr); > > int dapl_ib_mr_register_shared(struct dapl_ia *ia, struct dapl_lmr *lmr, > Index: dapl_lmr.c > =================================================================== > --- dapl_lmr.c (revision 3056) > +++ dapl_lmr.c (working copy) > @@ -137,7 +137,7 @@ static inline int dapl_lmr_create_physic > u64 *registered_address) > { > struct dapl_lmr *new_lmr; > - int status; > + int status = 0; > > new_lmr = dapl_lmr_alloc(ia, mem_type, phys_addr, > page_count, (struct dat_pz *) pz, privileges); > @@ -151,13 +151,22 @@ static inline int dapl_lmr_create_physic > status = dapl_ib_mr_register_ia(ia, new_lmr, phys_addr, > page_count, privileges); > } > - else { > + else if (DAT_MEM_TYPE_PHYSICAL == mem_type) { > status = dapl_ib_mr_register_physical(ia, new_lmr, > phys_addr.for_array, > page_count, privileges); > } > + else if (DAT_MEM_TYPE_PLATFORM == mem_type) { > + status = dapl_ib_mr_register_fmr(ia, new_lmr, > + phys_addr.for_array, > + page_count, privileges); > + } > + else { > + status = -EINVAL; > + goto error1; > + } > > - if (0 != status) > + if (status) > goto error2; > > atomic_inc(&pz->pz_ref_count); > @@ -243,7 +252,7 @@ int dapl_lmr_kcreate(struct dat_ia *ia, > int status; > > dapl_dbg_log(DAPL_DBG_TYPE_API, > - "dapl_lmr_kcreate(ia:%p, mem_type:%x, ...)\n", > + "dapl_lmr_kcreate(ia:%p, mem_type:%x)\n", I like the ... in the printout :) > ia, mem_type); > > dapl_ia = (struct dapl_ia *)ia; > @@ -258,6 +267,11 @@ int dapl_lmr_kcreate(struct dat_ia *ia, > rmr_context, registered_length, > registered_address); > break; > + case DAT_MEM_TYPE_PLATFORM: /* used as a proprietary Tavor-FMR */ My understanding is that FMRs are not specific to Tavor. I thought Arbel and Sinai also had FMR support. Is that correct? If so, we should change this comment. > + if (!active_fmr) { > + status = -EINVAL; > + break; > + } > case DAT_MEM_TYPE_PHYSICAL: > case DAT_MEM_TYPE_IA: > status = dapl_lmr_create_physical(dapl_ia, region_description, > @@ -307,6 +321,7 @@ int dapl_lmr_free(struct dat_lmr *lmr) > > switch (dapl_lmr->param.mem_type) { > case DAT_MEM_TYPE_PHYSICAL: > + case DAT_MEM_TYPE_PLATFORM: > case DAT_MEM_TYPE_IA: > case DAT_MEM_TYPE_LMR: > { > Index: dapl_pz.c > =================================================================== > --- dapl_pz.c (revision 3056) > +++ dapl_pz.c (working copy) > @@ -89,7 +89,17 @@ int dapl_pz_create(struct dat_ia *ia, st > status); > goto error2; > } > - > + > + if (active_fmr) { > + struct ib_fmr_pool_param params; > + set_fmr_params (¶ms); > + dapl_pz->fmr_pool = ib_create_fmr_pool(dapl_pz->pd, ¶ms); > + if (IS_ERR(dapl_pz->fmr_pool)) > + dapl_dbg_log(DAPL_DBG_TYPE_WARN, > + "could not create FMR pool <%ld>", > + PTR_ERR(dapl_pz->fmr_pool)); > + } > + > *pz = (struct dat_pz *)dapl_pz; > return 0; > > @@ -104,7 +114,7 @@ error1: > int dapl_pz_free(struct dat_pz *pz) > { > struct dapl_pz *dapl_pz; > - int status; > + int status=0; > > dapl_dbg_log(DAPL_DBG_TYPE_API, "dapl_pz_free(%p)\n", pz); > > @@ -114,8 +124,10 @@ int dapl_pz_free(struct dat_pz *pz) > status = -EINVAL; > goto error; > } > - > - status = ib_dealloc_pd(dapl_pz->pd); > + if (active_fmr) > + (void)ib_destroy_fmr_pool(dapl_pz->fmr_pool); > + else > + status = ib_dealloc_pd(dapl_pz->pd); If active_fmr is true, we still allocate a PD in dapl_pz_create. Should the above be + if (active_fmr) + (void)ib_destroy_fmr_pool(dapl_pz->fmr_pool); + status = ib_dealloc_pd(dapl_pz->pd); > if (status) { > dapl_dbg_log(DAPL_DBG_TYPE_ERR, "ib_dealloc_pd failed: %X\n", > status); > Index: dapl_ia.c > =================================================================== > --- dapl_ia.c (revision 3056) > +++ dapl_ia.c (working copy) > @@ -745,7 +745,8 @@ int dapl_ia_query(struct dat_ia *ia_ptr, > provider_attr->provider_version_major = DAPL_PROVIDER_MAJOR; > provider_attr->provider_version_minor = DAPL_PROVIDER_MINOR; > provider_attr->lmr_mem_types_supported = > - DAT_MEM_TYPE_PHYSICAL | DAT_MEM_TYPE_IA; > + DAT_MEM_TYPE_PHYSICAL | DAT_MEM_TYPE_IA | > + DAT_MEM_TYPE_PLATFORM; Please align the DAT_MEM_TYPE_PLATFORM with the above line. > provider_attr->iov_ownership_on_return = DAT_IOV_CONSUMER; > provider_attr->dat_qos_supported = DAT_QOS_BEST_EFFORT; > provider_attr->completion_flags_supported = > Index: dapl_provider.c > =================================================================== > --- dapl_provider.c (revision 3056) > +++ dapl_provider.c (working copy) > @@ -48,8 +48,19 @@ MODULE_AUTHOR("James Lentini"); > > #ifdef CONFIG_KDAPL_INFINIBAND_DEBUG > static DAPL_DBG_MASK g_dapl_dbg_mask = 0; > +unsigned int active_fmr = 1; > +static unsigned int pool_size = 2048; > +static unsigned int max_pages_per_fmr = 64; These FMR module parameters should not be inside the CONFIG_KDAPL_INFINIBAND_DEBUG guard. They should be enabled regardless of the debug configuration. > + > module_param_named(dbg_mask, g_dapl_dbg_mask, int, 0644); > -MODULE_PARM_DESC(dbg_mask, "Bitmask to enable debug message types."); > +module_param_named(active_fmr, active_fmr, int, 0644); > +module_param_named(pool_size, pool_size, int, 0644); > +module_param_named(max_pages_per_fmr, max_pages_per_fmr, int, 0644); Can you use the g_dapl_ prefix for active_fmr, pool_size, and max_pages_per_fmr? > +MODULE_PARM_DESC(dbg_mask, "Bitmask to enable debug message types. "); > +MODULE_PARM_DESC(active_fmr, "if active_fmr==1, creates fmr pool in pz_create "); > +MODULE_PARM_DESC(pool_size, "num of fmr handles in pool "); > +MODULE_PARM_DESC(max_pages_per_fmr, "max size (in pages) of an fmr handle "); > + > #endif /* CONFIG_KDAPL_INFINIBAND_DEBUG */ > > static LIST_HEAD(g_dapl_provider_list); > @@ -152,6 +163,18 @@ void dapl_dbg_log(enum dapl_dbg_type typ > > #endif /* KDAPL_INFINIBAND_DEBUG */ > > +void set_fmr_params (struct ib_fmr_pool_param *fmr_param_s) > +{ > + fmr_param_s->max_pages_per_fmr = max_pages_per_fmr; > + fmr_param_s->pool_size = pool_size; > + fmr_param_s->dirty_watermark = 32; > + fmr_param_s->cache = 1; > + fmr_param_s->flush_function = NULL; > + fmr_param_s->access = (IB_ACCESS_LOCAL_WRITE | > + IB_ACCESS_REMOTE_WRITE | > + IB_ACCESS_REMOTE_READ); > +} > + Lets find a better name for this function and possibly a different location. How about dapl_fmr_pool_param_init? How about shortening the parameter name from fmr_param_s to either fmr_params or fmr_param. Either of those would be more standard. > static struct dapl_provider *dapl_provider_alloc(const char *name, > struct ib_device *device, > u8 port) From jlentini at netapp.com Thu Aug 11 08:48:20 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 11 Aug 2005 11:48:20 -0400 (EDT) Subject: [openib-general][PATCH][kdapltest]: "free" missing In-Reply-To: References: Message-ID: Committed in revision 3061. From sean.hefty at intel.com Thu Aug 11 09:32:16 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 11 Aug 2005 09:32:16 -0700 Subject: [openib-general] [RFC] [uCM] proposed API changes In-Reply-To: <1123762574.4403.4794.camel@hal.voltaire.com> Message-ID: >How would one go about obtaining the local and remote IDs for a >connection ? That appears to be removed. Is that not needed ? I don't think that those are needed by a client. If we want to add them, I would put them into the cm_id structure, rather than adding a call to retrieve them. This is similar to the operation of the userspace verbs. - Sean From halr at voltaire.com Thu Aug 11 09:32:59 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Aug 2005 12:32:59 -0400 Subject: [openib-general] [RFC] [uCM] proposed API changes In-Reply-To: References: Message-ID: <1123777979.4403.5403.camel@hal.voltaire.com> On Thu, 2005-08-11 at 12:32, Sean Hefty wrote: > >How would one go about obtaining the local and remote IDs for a > >connection ? That appears to be removed. Is that not needed ? > > I don't think that those are needed by a client. but perhaps useful for debug ? > If we want to add them, I > would put them into the cm_id structure, rather than adding a call to retrieve > them. This is similar to the operation of the userspace verbs. From sean.hefty at intel.com Thu Aug 11 09:47:41 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 11 Aug 2005 09:47:41 -0700 Subject: [openib-general] [RFC] [uCM] proposed API changes In-Reply-To: <1123777979.4403.5403.camel@hal.voltaire.com> Message-ID: >> >How would one go about obtaining the local and remote IDs for a >> >connection ? That appears to be removed. Is that not needed ? >> >> I don't think that those are needed by a client. > >but perhaps useful for debug ? They could be useful for debugging. I'll either add them to the cm_id structure or keep the existing call to retrieve them. I was leaning to the former, since is seems to better match how uverbs operates. - Sean From jlentini at netapp.com Thu Aug 11 09:59:58 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 11 Aug 2005 12:59:58 -0400 (EDT) Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation In-Reply-To: References: Message-ID: Hi Guy, Comments/questions below: On Thu, 11 Aug 2005, Guy German wrote: > Sorry for resending it - the former mail did not have a subject > > This patch allows the dapl consumer to control the evd upcall policy. > Some consumers (e.g. ISER) receives one upcall, disable > the upcall policy, and retrieve the rest of the events from a > kernel_thread, via dat_evd_dequeue. > This fashion of work improves performance by saving the context > switching that is involved in many upcalls. > If the consumer does not behave that way and leaves the upcall policy > enabled at all times (e.g. kdapltest), the behavior will stay the same and > the consumer will get an upcall for each event. > > Signed-off-by: Guy German > > Index: dapl_evd.c > =================================================================== > --- dapl_evd.c (revision 3056) > +++ dapl_evd.c (working copy) > @@ -38,28 +38,39 @@ > #include "dapl_ring_buffer_util.h" > > /* > - * DAPL Internal routine to trigger the specified CNO. > - * Called by the callback of some EVD associated with the CNO. Thanks for catch these CNO references. I thought I had removed them all. > + * DAPL Internal routine to trigger the callback of the EVD > */ > static void dapl_evd_upcall_trigger(struct dapl_evd *evd) > { > int status = 0; > struct dat_event event; > + unsigned long flags; For flags, we use flags member in the dapl_common structure. Take a look at the call to spin_lock_irqsave() in dapl_evd_get_event() for an example. We use the flags in the structure because the EVD code takes the spin lock in one function and releases it in another. > > - /* Only process events if there is an enabled callback function. */ > - if ((evd->upcall.upcall_func == (DAT_UPCALL_FUNC) NULL) || > - (evd->upcall_policy == DAT_UPCALL_DISABLE)) { > + > + /* The function is not re-entrant (change when implementing DAT_UPCALL_MANY)*/ Why is this function not re-entrant? For reference, here is how I would define re-entrant: http://en.wikipedia.org/wiki/Reentrant http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?re-entrant > + if (evd->is_triggered) > return; > - } Why check the value here? Is it only for the efficiency of not taking the spin lock when is_triggered is 1? > > - for (;;) { > + spin_lock_irqsave (&evd->common.lock, flags); > + if (evd->is_triggered) { > + spin_unlock_irqrestore (&evd->common.lock, flags); > + return; > + } > + evd->is_triggered = 1; > + spin_unlock_irqrestore (&evd->common.lock, flags); > + /* Only process events if there is an enabled callback function */ > + while ((evd->upcall.upcall_func != (DAT_UPCALL_FUNC)NULL) && > + (evd->upcall_policy != DAT_UPCALL_TEARDOWN) && > + (evd->upcall_policy != DAT_UPCALL_DISABLE)) { > status = dapl_evd_dequeue((struct dat_evd *)evd, &event); > - if (0 != status) > - return; > - > + if (status) > + break; > evd->upcall.upcall_func(evd->upcall.instance_data, &event, > FALSE); > } > + evd->is_triggered = 0; > + > + return; > } > > static void dapl_evd_eh_print_wc(struct ib_wc *wc) > @@ -820,24 +831,19 @@ static void dapl_evd_dto_callback(struct > * This function does not dequeue from the CQ; only the consumer > * can do that. Instead, it wakes up waiters if any exist. > * It rearms the completion only if completions should always occur > - * (specifically if a CNO is associated with the EVD and the > - * EVD is enabled). > */ > - > - if (state == DAPL_EVD_STATE_OPEN && > - evd->upcall_policy != DAT_UPCALL_DISABLE) { > - /* > - * Re-enable callback, *then* trigger. > - * This guarantees we won't miss any events. > - */ > - status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); > - if (0 != status) > - (void)dapl_evd_post_async_error_event( > - evd->common.owner_ia->async_error_evd, > - DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR, > - evd->common.owner_ia); > - > + > + if (state == DAPL_EVD_STATE_OPEN) { > dapl_evd_upcall_trigger(evd); > + if ((evd->upcall_policy != DAT_UPCALL_TEARDOWN) && > + (evd->upcall_policy != DAT_UPCALL_DISABLE)) { > + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); > + if (0 != status) > + (void)dapl_evd_post_async_error_event( > + evd->common.owner_ia->async_error_evd, > + DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR, > + evd->common.owner_ia); > + } You changed the order in which the CQ upcall is enabled and the kDAPL upcall is made. It used to be: enable CQ upcall call kDAPL upcall you are proposing call kDAPL upcall enable CQ upcall I think your proposed order contains a race condition. Specifically if a work completion occurs after dapl_evd_upcall_trigger() returns but before the CQ upcall is re-enabled with ib_req_notify_cq(), no upcall will occur for the completion. Do you agree? > } > dapl_dbg_log(DAPL_DBG_TYPE_RTN, "%s() returns\n", __func__); > } > @@ -890,7 +896,7 @@ int dapl_evd_internal_create(struct dapl > > /* reset the qlen in the attributes, it may have changed */ > evd->qlen = evd->cq->cqe; > - > + evd->is_triggered = 0; This should be done in dapl_evd_alloc. > status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); > > if (status != 0) > @@ -1035,15 +1041,41 @@ int dapl_evd_modify_upcall(struct dat_ev > const struct dat_upcall_object *upcall) > { > struct dapl_evd *evd; > - > - dapl_dbg_log(DAPL_DBG_TYPE_API, "dapl_modify_upcall(%p)\n", evd_handle); > + int status = 0; > + int pending_events; > + unsigned long flags; See my comment above about he flags. > > evd = (struct dapl_evd *)evd_handle; > + dapl_dbg_log (DAPL_DBG_TYPE_API, "%s: (evd=%p) set to %d\n", > + __func__, evd_handle, upcall_policy); The idea was to make the DAPL_DBG_TYPE_API prints look like a debugger stack trace. The following would be keeping with the other print statements: + dapl_dbg_log (DAPL_DBG_TYPE_API, "%s(%p, %d, %p)\n", + __func__, evd_handle, upcall_policy, upcall); > > + spin_lock_irqsave(&evd->common.lock, flags); > + if ((upcall_policy != DAT_UPCALL_TEARDOWN) && > + (upcall_policy != DAT_UPCALL_DISABLE)) { Why not let the consumer setup the upcall when it disabled? That seems like the only safe time to modify it. > + pending_events = dapl_rbuf_count(&evd->pending_event_queue); I don't understand this restriction either. Please explain. > + if (pending_events) { > + dapl_dbg_log (DAPL_DBG_TYPE_WARN, > + "%s: (evd %p) there are still %d pending " > + "events in the queue - policy stays disabled\n", > + __func__, evd_handle, pending_events); > + status = -EBUSY; > + goto bail; > + } > + if (evd->evd_flags & DAT_EVD_DTO_FLAG) { > + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); Why do we need to re-enable the CQ upcall? > + if (status) { > + printk(KERN_ERR "%s: dapls_ib_completion_notify" > + " failed (status=0x%x) \n",__func__, > + status); Let's use dapl_dbg_log instead of printk. We can also update the text of the message to "%s: ib_req_notify_cq failed: %X\n" > + goto bail; > + } > + } > + } > evd->upcall_policy = upcall_policy; > evd->upcall = *upcall; > - > - return 0; > +bail: > + spin_unlock_irqrestore(&evd->common.lock, flags); > + return status; > } > > int dapl_evd_post_se(struct dat_evd *evd_handle, const struct dat_event *event) > @@ -1076,7 +1108,7 @@ int dapl_evd_post_se(struct dat_evd *evd > event->event_data. > software_event_data.pointer); > > - bail: > +bail: > return status; > } > > @@ -1124,7 +1156,7 @@ int dapl_evd_dequeue(struct dat_evd *evd > } > > spin_unlock_irqrestore(&evd->common.lock, evd->common.flags); > - bail: > +bail: > dapl_dbg_log(DAPL_DBG_TYPE_RTN, > "dapl_evd_dequeue () returns 0x%x\n", status); > From rolandd at cisco.com Thu Aug 11 10:26:33 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 11 Aug 2005 10:26:33 -0700 Subject: [openib-general] Re: the tests user/libibverbs/examples/*_pingpong.c doesn't support t he long format of the parameter rx-depth In-Reply-To: <506C3D7B14CDD411A52C00025558DED60882C45A@mtlex01.yok.mtl.com> (Dotan Barak's message of "Thu, 11 Aug 2005 15:08:06 +0300") References: <506C3D7B14CDD411A52C00025558DED60882C45A@mtlex01.yok.mtl.com> Message-ID: <52acjopepi.fsf@cisco.com> Dotan> the following line should be added to all of the tests Dotan> (ud_pingpong.c, uc_pingpong.c, rc_pingpong.c) { .name = Dotan> "rx-depth", .has_arg = 1, .val = 'r' }, Are you looking at the latest svn? In my tree it seems that I checked in that fix in rev 3041. - R. From rolandd at cisco.com Thu Aug 11 10:27:39 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 11 Aug 2005 10:27:39 -0700 Subject: [openib-general] Re: Continue to experience problems in installing Gen2 on IA-32 systems In-Reply-To: <200508111306.j7BD6GRi018085@xi.cse.ohio-state.edu> (Dhabaleswar Panda's message of "Thu, 11 Aug 2005 09:06:16 -0400 (EDT)") References: <200508111306.j7BD6GRi018085@xi.cse.ohio-state.edu> Message-ID: <5264ucpeno.fsf@cisco.com> Dhabaleswar> I am wondering whether any of you have personally Dhabaleswar> installed Gen2 successfully on IA-32 systems with Dhabaleswar> PCI-X IB cards and checked that it works. Yes, I use this configuration for testing quite a bit. - R. From halr at voltaire.com Thu Aug 11 11:48:37 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Aug 2005 14:48:37 -0400 Subject: [openib-general] [PATCH] sockaddr_ll change for IPoIB interface Message-ID: <1123786117.4403.5835.camel@hal.voltaire.com> Hi, This is a repost of a patch which was posted last week which appears to have been lost in the shuffle. The patch below is to accomodate IPoIB link layer address in the sockaddr_ll struct so that user space can send and receive IPoIB link later packets. Unfortunately, IPoIB has 20 bytes LL addresses rather than the 8 byte MAC addresses (or under) used by other LLs. There is a similar change to both: /usr/include/linux/if_packet.h /usr/include/netpacket/packet.h as in: include/linux/if_packet.h below to increase sll_addr from 8 to 20 bytes. Thanks. -- Hal sockaddr_ll changes to accomodate IPoIB interfaces. This is due to the fact that the IPoIB link layer address is 20 bytes rather than 8 bytes. With the current 8 byte address, it is not possible to send ARPs and RARPs from userspace as the broadcast and unicast IPoIB addresses cannot be supplied properly. There is backward compatibility support for those applications built with the existing structure (prior to this patch). Signed-off-by: Hal Rosenstock --- include/linux/if_packet.h.orig 2005-06-29 19:00:53.000000000 -0400 +++ include/linux/if_packet.h 2005-08-05 10:04:06.000000000 -0400 @@ -8,6 +8,7 @@ struct sockaddr_pkt unsigned short spkt_protocol; }; +#define SOCKADDR_LL_COMPAT 12 struct sockaddr_ll { unsigned short sll_family; @@ -16,7 +17,7 @@ struct sockaddr_ll unsigned short sll_hatype; unsigned char sll_pkttype; unsigned char sll_halen; - unsigned char sll_addr[8]; + unsigned char sll_addr[20]; }; /* Packet types */ --- af_packet.c.orig 2005-06-29 19:00:53.000000000 -0400 +++ af_packet.c 2005-08-05 13:28:49.000000000 -0400 @@ -708,8 +708,12 @@ static int packet_sendmsg(struct kiocb * addr = NULL; } else { err = -EINVAL; - if (msg->msg_namelen < sizeof(struct sockaddr_ll)) - goto out; + if (msg->msg_namelen < sizeof(struct sockaddr_ll)) { + /* Support for older sockaddr_ll structs */ + if ((msg->msg_namelen != sizeof(struct sockaddr_ll) - SOCKADDR_LL_COMPAT) || + (saddr->sll_hatype == ARPHRD_INFINIBAND)) + goto out; + } ifindex = saddr->sll_ifindex; proto = saddr->sll_protocol; addr = saddr->sll_addr; @@ -937,7 +941,11 @@ static int packet_bind(struct socket *so */ if (addr_len < sizeof(struct sockaddr_ll)) - return -EINVAL; + /* Support for older sockaddr_ll structs */ + if ((addr_len != sizeof(struct sockaddr_ll) - + SOCKADDR_LL_COMPAT) || + (sll->sll_hatype == ARPHRD_INFINIBAND)) + return -EINVAL; if (sll->sll_family != AF_PACKET) return -EINVAL; From sean.hefty at intel.com Thu Aug 11 12:10:21 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 11 Aug 2005 12:10:21 -0700 Subject: [openib-general] [RFC] [uCM] proposed API changes In-Reply-To: <42FAC13A.6020801@ichips.intel.com> Message-ID: >>I'd like to propose the following changes to the user CM API. These >>would allow returning user specified context when reporting events to >>the user. I also added a call to retrieve the necessary QP attributes >>from the kernel CM that I would like to include as a part of the API/ABI >>changes. Comments? >> >>- Sean > >Looks good so far. The context will save uDAPL a list search and the uCM >QP attributes is a much >needed feature that I have been anxiously watiting for. There is an issue with adding context. When a connection REQ is received, a new kernel cm_id is created. This cm_id doesn't have any context associated with it. For kernel clients, this isn't a big deal, since all events associated with a single cm_id are serialized. A kernel app can set the context as part of their REQ handling. Userspace clients will run into the same situation, where no context is defined. But events for the same cm_id are not serialized for userspace clients. An app can receive a REJ event for a newly created cm_id that does not have a context. (They can even process the REJ event before the REQ event is seen.) Searching in this case is unavoidable. I'm not even sure of the right way to handle this situation. In a more generic sense, userspace clients need to be able to handle out of order events if they use multiple threads for event handling. For example, MRA to a REQ, REP received, and REJ received events could all occur at the same time. (In this case, a userspace context would be valid.) - Sean From rolandd at cisco.com Thu Aug 11 12:18:59 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 11 Aug 2005 12:18:59 -0700 Subject: [openib-general] Re: Continue to experience problems in installing Gen2 on IA-32 systems In-Reply-To: <200508111306.j7BD6GRi018085@xi.cse.ohio-state.edu> (Dhabaleswar Panda's message of "Thu, 11 Aug 2005 09:06:16 -0400 (EDT)") References: <200508111306.j7BD6GRi018085@xi.cse.ohio-state.edu> Message-ID: <52oe84nuxo.fsf@cisco.com> Dhabaleswar> Opetron systems and carry out experiments. There is Dhabaleswar> no problem. The problem is coming only for IA-32 Dhabaleswar> systems. Even on EM64T systems, this problem comes Dhabaleswar> when operating it in IA-32 mode. Out of curiousity, do PCIe cards work with 32-bit kernels? As Hal said, please post the kernel log you get when loading drivers built with CONFIG_INFINIBAND_MTHCA_DEBUG=y. Thanks, Roland From mshefty at ichips.intel.com Thu Aug 11 12:20:19 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 11 Aug 2005 12:20:19 -0700 Subject: [openib-general] Re: uCM question In-Reply-To: <59278FC0C48A994BABABD069571E45680BBA5015@orsmsx401.amr.corp.intel.com> References: <59278FC0C48A994BABABD069571E45680BBA5015@orsmsx401.amr.corp.intel.com> Message-ID: <42FBA4F3.7050604@ichips.intel.com> Davis, Arlin R wrote: > Question: > > When a new request arrives and CM provides a new cm_id on behalf of the > user I assume the user is expected to update the context in the ib_cm_id > structure before putting the request event back? You may want to add > some words to cm.h if that is the case just to be clear. Yes - that would be the case. However, see my other email... events can arrive out of order if using multiple threads to get them. I believe that the only case where this matters with respect to setting a context is receiving a REJ after a REQ. - Sean From ftillier at silverstorm.com Thu Aug 11 12:31:32 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Thu, 11 Aug 2005 12:31:32 -0700 Subject: [openib-general] [RFC] [uCM] proposed API changes In-Reply-To: Message-ID: <002b01c59eab$4852cf70$9c5aa8c0@infiniconsys.com> > From: Sean Hefty [mailto:sean.hefty at intel.com] > Sent: Thursday, August 11, 2005 12:10 PM > > There is an issue with adding context. When a connection REQ is received, a > new > kernel cm_id is created. This cm_id doesn't have any context associated with > it. For kernel clients, this isn't a big deal, since all events associated > with > a single cm_id are serialized. A kernel app can set the context as part of > their REQ handling. Serialize events for user-mode cm_ids, and allow the user client to set the context from their REQ handler. The latter is probably pretty easy to do, but in and of itself doesn't solve the problem with the out-of-order events and races between setting the context and receiving an event. > Userspace clients will run into the same situation, where no context is > defined. > But events for the same cm_id are not serialized for userspace clients. An > app > can receive a REJ event for a newly created cm_id that does not have a context. > (They can even process the REJ event before the REQ event is seen.) Searching > in this case is unavoidable. I'm not even sure of the right way to handle > this > situation. A search on a REJ isn't a big deal - it should be a rare case as it will only occur if the remote side times out or aborts. A client could ignore the REJ because sending the REP will fail if a REJ was received. > In a more generic sense, userspace clients need to be able to handle out of > order events if they use multiple threads for event handling. For example, > MRA > to a REQ, REP received, and REJ received events could all occur at the same > time. (In this case, a userspace context would be valid.) If you allow the user to target a get_event call to a specific cm_id this problem goes away. If the user issues multiple requests against the same cm_id, they need to be ready to deal with out-of-order event reporting. This also solves the context issue, since the REJ won't be reported until the user requests an event from that specific cm_id. - Fab From panda at cse.ohio-state.edu Thu Aug 11 12:32:27 2005 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Thu, 11 Aug 2005 15:32:27 -0400 (EDT) Subject: [openib-general] Re: Continue to experience problems in installing Gen2 on IA-32 In-Reply-To: <52oe84nuxo.fsf@cisco.com> from "Roland Dreier" at Aug 11, 2005 12:18:59 PM Message-ID: <200508111932.j7BJWROj026274@xi.cse.ohio-state.edu> Hal, Roland and James, Many thanks for your prompt replies!! We tried with the debug option. Thanks for this suggestion. It looks like one of the parameters (1X/4X) parameter for the card is not being set properly on the IA-32 system which is leading to the `disable' state for the card. By manually changing this parameter to 4X, one of the nodes is able to detect the card. We are trying this on other nodes. Not sure whether this is coming out because of the driver or the firmware in the card. We are looking into this further. One of my students will soon post all the details. Thanks again for all your help!! DK > Dhabaleswar> Opetron systems and carry out experiments. There is > Dhabaleswar> no problem. The problem is coming only for IA-32 > Dhabaleswar> systems. Even on EM64T systems, this problem comes > Dhabaleswar> when operating it in IA-32 mode. > > Out of curiousity, do PCIe cards work with 32-bit kernels? > > As Hal said, please post the kernel log you get when loading drivers > built with CONFIG_INFINIBAND_MTHCA_DEBUG=y. > > Thanks, > Roland > From damato at psc.edu Thu Aug 11 12:44:43 2005 From: damato at psc.edu (Joe Damato) Date: Thu, 11 Aug 2005 15:44:43 -0400 Subject: [openib-general] osm build errors gen1 Message-ID: <42FBAAAB.3010709@psc.edu> Hello - Sorry for the long email (lots of build errors). I am attempting to build osm as part of gen1 to allow me to test some openib gen1 code and compare to an openib gen2 port that I have been working on. When attempting to build osm in gen1 some build errors occur, it seems that the file osm_db.h and osm_db_pack.h are included but do not exist in the source tree... I am using openib-3022 (but from a quick glance at the newest revision, these files appear to be missing)... [root at frodo osm]# pwd /cluster/src/OPENIB_SRC/openib-3022/gen1/trunk/src/userspace/osm [root at frodo osm]# grep -R -H -n "osm_db.h" . ./opensm/osm_lid_mgr.h:97:#include "osm_db.h" ./opensm/osm_opensm.h:98:#include "osm_db.h" ./opensm/osm_sm.h:125:#include "osm_db.h" [root at frodo osm]# grep -R -H -n "osm_db_pack.h" . ./opensm/osm_lid_mgr.c:137:#include "osm_db_pack.h" snippet of build errors is below. Maybe I'm just supposed to use the OpenSM from gen2 with the gen1 stack? Thanks, Joe Damato -------------------------------------------------------------------------------- cc -Wall -Wimplicit -Wreturn-type -Wformat -pipe -fno-strength-reduce -DDEBUG -D_DEBUG -D_DEBUG_ -DDBG -g -O0 -I/cluster/src/OPENIB_SRC/gen1_wtf/gen1/trunk/src/userspace/osm -I/cluster/src/OPENIB_SRC/gen1_wtf/gen1/trunk/src/userspace/osm/include -I. -I/include -I -I/ib/ts_api_ng/useraccess/include -DOSM_VENDOR_INTF_TS -DVENDOR_RMPP_SUPPORT=1 -Werror -DLINUX -DLINUX -fpic -c -o main.o main.c In file included from osm_state_mgr.h:95, from osm_node_info_rcv.h:100, from osm_node_info_rcv_ctrl.h:97, from osm_sm.h:104, from osm_opensm.h:96, from main.c:93: osm_lid_mgr.h:97:20: osm_db.h: No such file or directory In file included from osm_state_mgr.h:95, from osm_node_info_rcv.h:100, from osm_node_info_rcv_ctrl.h:97, from osm_sm.h:104, from osm_opensm.h:96, from main.c:93: osm_lid_mgr.h:134: error: syntax error before "osm_db_t" osm_lid_mgr.h:134: warning: no semicolon at end of struct or union osm_lid_mgr.h:139: error: syntax error before '*' token osm_lid_mgr.h:139: warning: type defaults to `int' in declaration of `p_g2l' osm_lid_mgr.h:139: warning: data definition has no type or storage class osm_lid_mgr.h:142: error: syntax error before '}' token osm_lid_mgr.h:142: warning: type defaults to `int' in declaration of `osm_lid_mgr_t' osm_lid_mgr.h:142: warning: data definition has no type or storage class osm_lid_mgr.h:188: error: syntax error before '*' token osm_lid_mgr.h:220: error: syntax error before '*' token osm_lid_mgr.h:253: error: syntax error before '*' token osm_lid_mgr.h:299: error: syntax error before '*' token osm_lid_mgr.h:329: error: syntax error before '*' token In file included from osm_node_info_rcv.h:100, from osm_node_info_rcv_ctrl.h:97, from osm_sm.h:104, from osm_opensm.h:96, from main.c:93: osm_state_mgr.h:138: error: syntax error before "osm_lid_mgr_t" osm_state_mgr.h:138: warning: no semicolon at end of struct or union osm_state_mgr.h:156: error: syntax error before '}' token osm_state_mgr.h:156: warning: type defaults to `int' in declaration of `osm_state_mgr_t' osm_state_mgr.h:156: warning: data definition has no type or storage class osm_state_mgr.h:282: error: syntax error before '*' token osm_state_mgr.h:333: error: syntax error before '*' token osm_state_mgr.h:365: error: syntax error before '*' token osm_state_mgr.h:398: error: syntax error before '*' token osm_state_mgr.h:480: error: syntax error before '*' token In file included from osm_node_info_rcv_ctrl.h:97, from osm_sm.h:104, from osm_opensm.h:96, from main.c:93: osm_node_info_rcv.h:137: error: syntax error before "osm_state_mgr_t" osm_node_info_rcv.h:137: warning: no semicolon at end of struct or union osm_node_info_rcv.h:140: error: syntax error before '}' token osm_node_info_rcv.h:140: warning: type defaults to `int' in declaration of `osm_ni_rcv_t' osm_node_info_rcv.h:140: warning: data definition has no type or storage class osm_node_info_rcv.h:172: error: syntax error before '*' token osm_node_info_rcv.h:204: error: syntax error before '*' token osm_node_info_rcv.h:236: error: syntax error before '*' token osm_node_info_rcv.h:284: warning: type defaults to `int' in declaration of `osm_ni_rcv_t' osm_node_info_rcv.h:284: error: syntax error before '*' token osm_node_info_rcv.h:313: warning: type defaults to `int' in declaration of `osm_ni_rcv_t' osm_node_info_rcv.h:313: error: syntax error before '*' token In file included from osm_sm.h:104, from osm_opensm.h:96, from main.c:93: osm_node_info_rcv_ctrl.h:131: error: syntax error before "osm_ni_rcv_t" osm_node_info_rcv_ctrl.h:131: warning: no semicolon at end of struct or union osm_node_info_rcv_ctrl.h:136: error: syntax error before '}' token osm_node_info_rcv_ctrl.h:136: warning: type defaults to `int' in declaration of `osm_ni_rcv_ctrl_t' osm_node_info_rcv_ctrl.h:136: warning: data definition has no type or storage class osm_node_info_rcv_ctrl.h:166: error: syntax error before '*' token osm_node_info_rcv_ctrl.h:199: error: syntax error before '*' token osm_node_info_rcv_ctrl.h:231: error: syntax error before '*' token osm_node_info_rcv_ctrl.h:271: warning: type defaults to `int' in declaration of `osm_ni_rcv_ctrl_t' osm_node_info_rcv_ctrl.h:271: error: syntax error before '*' token In file included from osm_port_info_rcv_ctrl.h:97, from osm_sm.h:105, from osm_opensm.h:96, from main.c:93: osm_port_info_rcv.h:136: error: syntax error before "osm_state_mgr_t" osm_port_info_rcv.h:136: warning: no semicolon at end of struct or union osm_port_info_rcv.h:139: error: syntax error before '}' token osm_port_info_rcv.h:139: warning: type defaults to `int' in declaration of `osm_pi_rcv_t' osm_port_info_rcv.h:139: warning: data definition has no type or storage class osm_port_info_rcv.h:171: error: syntax error before '*' token osm_port_info_rcv.h:202: error: syntax error before '*' token osm_port_info_rcv.h:234: error: syntax error before '*' token osm_port_info_rcv.h:282: warning: type defaults to `int' in declaration of `osm_pi_rcv_t' osm_port_info_rcv.h:282: error: syntax error before '*' token In file included from osm_sm.h:105, from osm_opensm.h:96, from main.c:93: osm_port_info_rcv_ctrl.h:130: error: syntax error before "osm_pi_rcv_t" osm_port_info_rcv_ctrl.h:130: warning: no semicolon at end of struct or union osm_port_info_rcv_ctrl.h:135: error: syntax error before '}' token osm_port_info_rcv_ctrl.h:135: warning: type defaults to `int' in declaration of `osm_pi_rcv_ctrl_t' osm_port_info_rcv_ctrl.h:135: warning: data definition has no type or storage class osm_port_info_rcv_ctrl.h:165: error: syntax error before '*' token osm_port_info_rcv_ctrl.h:198: error: syntax error before '*' token osm_port_info_rcv_ctrl.h:230: error: syntax error before '*' token osm_port_info_rcv_ctrl.h:270: warning: type defaults to `int' in declaration of `osm_pi_rcv_ctrl_t' osm_port_info_rcv_ctrl.h:270: error: syntax error before '*' token In file included from osm_sw_info_rcv_ctrl.h:96, from osm_sm.h:106, from osm_opensm.h:96, from main.c:93: osm_sw_info_rcv.h:135: error: syntax error before "osm_state_mgr_t" osm_sw_info_rcv.h:135: warning: no semicolon at end of struct or union osm_sw_info_rcv.h:138: error: syntax error before '}' token osm_sw_info_rcv.h:138: warning: type defaults to `int' in declaration of `osm_si_rcv_t' osm_sw_info_rcv.h:138: warning: data definition has no type or storage class osm_sw_info_rcv.h:170: error: syntax error before '*' token osm_sw_info_rcv.h:202: error: syntax error before '*' token osm_sw_info_rcv.h:234: error: syntax error before '*' token osm_sw_info_rcv.h:282: warning: type defaults to `int' in declaration of `osm_si_rcv_t' osm_sw_info_rcv.h:282: error: syntax error before '*' token osm_sw_info_rcv.h:311: warning: type defaults to `int' in declaration of `osm_si_rcv_t' osm_sw_info_rcv.h:311: error: syntax error before '*' token In file included from osm_sm.h:106, from osm_opensm.h:96, from main.c:93: osm_sw_info_rcv_ctrl.h:131: error: syntax error before "osm_si_rcv_t" osm_sw_info_rcv_ctrl.h:131: warning: no semicolon at end of struct or union osm_sw_info_rcv_ctrl.h:136: error: syntax error before '}' token osm_sw_info_rcv_ctrl.h:136: warning: type defaults to `int' in declaration of `osm_si_rcv_ctrl_t' osm_sw_info_rcv_ctrl.h:136: warning: data definition has no type or storage class osm_sw_info_rcv_ctrl.h:166: error: syntax error before '*' token osm_sw_info_rcv_ctrl.h:199: error: syntax error before '*' token osm_sw_info_rcv_ctrl.h:231: error: syntax error before '*' token osm_sw_info_rcv_ctrl.h:271: warning: type defaults to `int' in declaration of `osm_si_rcv_ctrl_t' osm_sw_info_rcv_ctrl.h:271: error: syntax error before '*' token In file included from osm_sm.h:109, from osm_opensm.h:96, from main.c:93: osm_state_mgr_ctrl.h:129: error: syntax error before "osm_state_mgr_t" osm_state_mgr_ctrl.h:129: warning: no semicolon at end of struct or union osm_state_mgr_ctrl.h:134: error: syntax error before '}' token osm_state_mgr_ctrl.h:134: warning: type defaults to `int' in declaration of `osm_state_mgr_ctrl_t' osm_state_mgr_ctrl.h:134: warning: data definition has no type or storage class osm_state_mgr_ctrl.h:164: error: syntax error before '*' token osm_state_mgr_ctrl.h:198: error: syntax error before '*' token osm_state_mgr_ctrl.h:231: error: syntax error before '*' token In file included from osm_sm.h:116, from osm_opensm.h:96, from main.c:93: osm_sweep_fail_ctrl.h:133: error: syntax error before "osm_state_mgr_t" osm_sweep_fail_ctrl.h:133: warning: no semicolon at end of struct or union osm_sweep_fail_ctrl.h:138: error: syntax error before '}' token osm_sweep_fail_ctrl.h:138: warning: type defaults to `int' in declaration of `osm_sweep_fail_ctrl_t' osm_sweep_fail_ctrl.h:138: warning: data definition has no type or storage class osm_sweep_fail_ctrl.h:169: error: syntax error before '*' token osm_sweep_fail_ctrl.h:202: error: syntax error before '*' token osm_sweep_fail_ctrl.h:235: error: syntax error before '*' token In file included from osm_sminfo_rcv_ctrl.h:95, from osm_sm.h:117, from osm_opensm.h:96, from main.c:93: osm_sminfo_rcv.h:137: error: syntax error before "osm_state_mgr_t" osm_sminfo_rcv.h:137: warning: no semicolon at end of struct or union osm_sminfo_rcv.h:141: error: syntax error before '}' token osm_sminfo_rcv.h:141: warning: type defaults to `int' in declaration of `osm_sminfo_rcv_t' osm_sminfo_rcv.h:141: warning: data definition has no type or storage class osm_sminfo_rcv.h:179: error: syntax error before '*' token osm_sminfo_rcv.h:210: error: syntax error before '*' token osm_sminfo_rcv.h:242: error: syntax error before '*' token osm_sminfo_rcv.h:298: warning: type defaults to `int' in declaration of `osm_sminfo_rcv_t' osm_sminfo_rcv.h:298: error: syntax error before '*' token In file included from osm_sm.h:117, from osm_opensm.h:96, from main.c:93: osm_sminfo_rcv_ctrl.h:129: error: syntax error before "osm_sminfo_rcv_t" osm_sminfo_rcv_ctrl.h:129: warning: no semicolon at end of struct or union osm_sminfo_rcv_ctrl.h:134: error: syntax error before '}' token osm_sminfo_rcv_ctrl.h:134: warning: type defaults to `int' in declaration of `osm_sminfo_rcv_ctrl_t' osm_sminfo_rcv_ctrl.h:134: warning: data definition has no type or storage class osm_sminfo_rcv_ctrl.h:164: error: syntax error before '*' token osm_sminfo_rcv_ctrl.h:196: error: syntax error before '*' token osm_sminfo_rcv_ctrl.h:229: error: syntax error before '*' token In file included from osm_trap_rcv_ctrl.h:95, from osm_sm.h:118, from osm_opensm.h:96, from main.c:93: osm_trap_rcv.h:137: error: syntax error before "osm_state_mgr_t" osm_trap_rcv.h:137: warning: no semicolon at end of struct or union osm_trap_rcv.h:140: error: syntax error before '}' token osm_trap_rcv.h:140: warning: type defaults to `int' in declaration of `osm_trap_rcv_t' osm_trap_rcv.h:140: warning: data definition has no type or storage class osm_trap_rcv.h:180: error: syntax error before '*' token osm_trap_rcv.h:211: error: syntax error before '*' token osm_trap_rcv.h:243: error: syntax error before '*' token osm_trap_rcv.h:295: error: syntax error before '*' token In file included from osm_sm.h:118, from osm_opensm.h:96, from main.c:93: osm_trap_rcv_ctrl.h:129: error: syntax error before "osm_trap_rcv_t" osm_trap_rcv_ctrl.h:129: warning: no semicolon at end of struct or union osm_trap_rcv_ctrl.h:134: error: syntax error before '}' token osm_trap_rcv_ctrl.h:134: warning: type defaults to `int' in declaration of `osm_trap_rcv_ctrl_t' osm_trap_rcv_ctrl.h:134: warning: data definition has no type or storage class osm_trap_rcv_ctrl.h:164: error: syntax error before '*' token osm_trap_rcv_ctrl.h:196: error: syntax error before '*' token osm_trap_rcv_ctrl.h:229: error: syntax error before '*' token In file included from osm_sm.h:119, from osm_opensm.h:96, from main.c:93: osm_sm_state_mgr.h:141: error: syntax error before "osm_state_mgr_t" osm_sm_state_mgr.h:141: warning: no semicolon at end of struct or union osm_sm_state_mgr.h:146: error: syntax error before '}' token osm_sm_state_mgr.h:146: warning: type defaults to `int' in declaration of `osm_sm_state_mgr_t' osm_sm_state_mgr.h:146: warning: data definition has no type or storage class osm_sm_state_mgr.h:191: error: syntax error before '*' token osm_sm_state_mgr.h:223: error: syntax error before '*' token osm_sm_state_mgr.h:256: error: syntax error before '*' token osm_sm_state_mgr.h:302: error: syntax error before '*' token osm_sm_state_mgr.h:334: error: syntax error before '*' token osm_sm_state_mgr.h:361: error: syntax error before '*' token In file included from osm_opensm.h:96, from main.c:93: osm_sm.h:166: error: syntax error before "osm_db_t" osm_sm.h:166: warning: no semicolon at end of struct or union osm_sm.h:177: error: syntax error before "ni_rcv" osm_sm.h:177: warning: type defaults to `int' in declaration of `ni_rcv' osm_sm.h:177: warning: data definition has no type or storage class osm_sm.h:178: error: syntax error before "ni_rcv_ctrl" osm_sm.h:178: warning: type defaults to `int' in declaration of `ni_rcv_ctrl' osm_sm.h:178: warning: data definition has no type or storage class osm_sm.h:179: error: syntax error before "pi_rcv" osm_sm.h:179: warning: type defaults to `int' in declaration of `pi_rcv' osm_sm.h:179: warning: data definition has no type or storage class osm_sm.h:180: error: syntax error before "pi_rcv_ctrl" osm_sm.h:180: warning: type defaults to `int' in declaration of `pi_rcv_ctrl' osm_sm.h:180: warning: data definition has no type or storage class osm_sm.h:184: error: syntax error before "si_rcv" osm_sm.h:184: warning: type defaults to `int' in declaration of `si_rcv' osm_sm.h:184: warning: data definition has no type or storage class osm_sm.h:185: error: syntax error before "si_rcv_ctrl" osm_sm.h:185: warning: type defaults to `int' in declaration of `si_rcv_ctrl' osm_sm.h:185: warning: data definition has no type or storage class osm_sm.h:186: error: syntax error before "state_mgr_ctrl" osm_sm.h:186: warning: type defaults to `int' in declaration of `state_mgr_ctrl' osm_sm.h:186: warning: data definition has no type or storage class osm_sm.h:187: error: syntax error before "lid_mgr" osm_sm.h:187: warning: type defaults to `int' in declaration of `lid_mgr' osm_sm.h:187: warning: data definition has no type or storage class osm_sm.h:190: error: syntax error before "state_mgr" osm_sm.h:190: warning: type defaults to `int' in declaration of `state_mgr' osm_sm.h:190: warning: data definition has no type or storage class osm_sm.h:196: error: syntax error before "sweep_fail_ctrl" osm_sm.h:196: warning: type defaults to `int' in declaration of `sweep_fail_ctrl' osm_sm.h:196: warning: data definition has no type or storage class osm_sm.h:197: error: syntax error before "sm_info_rcv" osm_sm.h:197: warning: type defaults to `int' in declaration of `sm_info_rcv' osm_sm.h:197: warning: data definition has no type or storage class osm_sm.h:198: error: syntax error before "sm_info_rcv_ctrl" osm_sm.h:198: warning: type defaults to `int' in declaration of `sm_info_rcv_ctrl' osm_sm.h:198: warning: data definition has no type or storage class osm_sm.h:199: error: syntax error before "trap_rcv" osm_sm.h:199: warning: type defaults to `int' in declaration of `trap_rcv' osm_sm.h:199: warning: data definition has no type or storage class osm_sm.h:200: error: syntax error before "trap_rcv_ctrl" osm_sm.h:200: warning: type defaults to `int' in declaration of `trap_rcv_ctrl' osm_sm.h:200: warning: data definition has no type or storage class osm_sm.h:201: error: syntax error before "sm_state_mgr" osm_sm.h:201: warning: type defaults to `int' in declaration of `sm_state_mgr' osm_sm.h:201: warning: data definition has no type or storage class osm_sm.h:211: error: syntax error before '}' token osm_sm.h:211: warning: type defaults to `int' in declaration of `osm_sm_t' osm_sm.h:211: warning: data definition has no type or storage class osm_sm.h:283: error: syntax error before '*' token osm_sm.h:314: error: syntax error before '*' token osm_sm.h:344: error: syntax error before '*' token osm_sm.h:404: error: syntax error before '*' token osm_sm.h:430: error: syntax error before '*' token osm_sm.h:464: error: syntax error before '*' token osm_sm.h:504: error: syntax error before '*' token osm_sm.h:537: error: syntax error before '*' token osm_sm.h: In function `osm_sm_wait_for_subnet_up': osm_sm.h:541: error: `p_sm' undeclared (first use in this function) osm_sm.h:541: error: (Each undeclared identifier is reported only once osm_sm.h:541: error: for each function it appears in.) osm_sm.h:542: error: `wait_us' undeclared (first use in this function) osm_sm.h:542: error: `interruptible' undeclared (first use in this function) In file included from osm_sa_mcmember_record_ctrl.h:97, from osm_sa.h:108, from osm_opensm.h:97, from main.c:93: osm_sa_mcmember_record.h: At top level: osm_sa_mcmember_record.h:138: error: syntax error before "osm_sm_t" osm_sa_mcmember_record.h:138: warning: no semicolon at end of struct or union osm_sa_mcmember_record.h:145: error: syntax error before '}' token osm_sa_mcmember_record.h:145: warning: type defaults to `int' in declaration of `osm_mcmr_recv_t' osm_sa_mcmember_record.h:145: warning: data definition has no type or storage class osm_sa_mcmember_record.h:176: error: syntax error before '*' token osm_sa_mcmember_record.h:207: error: syntax error before '*' token osm_sa_mcmember_record.h:239: error: syntax error before '*' token osm_sa_mcmember_record.h:287: error: syntax error before '*' token osm_sa_mcmember_record.h:322: error: syntax error before '*' token In file included from osm_sa.h:108, from osm_opensm.h:97, from main.c:93: osm_sa_mcmember_record_ctrl.h:131: error: syntax error before "osm_mcmr_recv_t" osm_sa_mcmember_record_ctrl.h:131: warning: no semicolon at end of struct or union osm_sa_mcmember_record_ctrl.h:136: error: syntax error before '}' token osm_sa_mcmember_record_ctrl.h:136: warning: type defaults to `int' in declaration of `osm_mcmr_rcv_ctrl_t' osm_sa_mcmember_record_ctrl.h:136: warning: data definition has no type or storage class osm_sa_mcmember_record_ctrl.h:166: error: syntax error before '*' token osm_sa_mcmember_record_ctrl.h:199: error: syntax error before '*' token osm_sa_mcmember_record_ctrl.h:231: error: syntax error before '*' token osm_sa_mcmember_record_ctrl.h:271: warning: type defaults to `int' in declaration of `osm_mcmr_rcv_ctrl_t' osm_sa_mcmember_record_ctrl.h:271: error: syntax error before '*' token In file included from osm_opensm.h:97, from main.c:93: osm_sa.h:191: error: syntax error before "osm_mcmr_recv_t" osm_sa.h:191: warning: no semicolon at end of struct or union osm_sa.h:192: warning: type defaults to `int' in declaration of `mcmr_rcv_ctlr' osm_sa.h:192: warning: data definition has no type or storage class osm_sa.h:216: error: syntax error before '}' token osm_sa.h:216: warning: type defaults to `int' in declaration of `osm_sa_t' osm_sa.h:216: warning: data definition has no type or storage class osm_sa.h:282: error: syntax error before '*' token osm_sa.h:312: error: syntax error before '*' token osm_sa.h:341: error: syntax error before '*' token osm_sa.h:397: warning: type defaults to `int' in declaration of `osm_sa_t' osm_sa.h:397: error: syntax error before '*' token osm_sa.h:426: error: syntax error before '*' token osm_sa.h:460: error: syntax error before '*' token osm_sa.h:494: error: syntax error before '*' token In file included from main.c:93: osm_opensm.h:137: error: syntax error before "osm_sm_t" osm_opensm.h:137: warning: no semicolon at end of struct or union osm_opensm.h:138: warning: type defaults to `int' in declaration of `sa' osm_opensm.h:138: warning: data definition has no type or storage class osm_opensm.h:139: error: syntax error before "db" osm_opensm.h:139: warning: type defaults to `int' in declaration of `db' osm_opensm.h:139: warning: data definition has no type or storage class osm_opensm.h:143: warning: built-in function 'log' declared as non-function osm_opensm.h:148: error: syntax error before '}' token osm_opensm.h:148: warning: type defaults to `int' in declaration of `osm_opensm_t' osm_opensm.h:148: warning: data definition has no type or storage class osm_opensm.h:199: error: syntax error before '*' token osm_opensm.h:229: error: syntax error before '*' token osm_opensm.h:259: error: syntax error before '*' token osm_opensm.h:290: error: syntax error before '*' token osm_opensm.h: In function `osm_opensm_sweep': osm_opensm.h:292: error: `p_osm' undeclared (first use in this function) osm_opensm.h: At top level: osm_opensm.h:321: error: syntax error before '*' token osm_opensm.h: In function `osm_opensm_set_log_flags': osm_opensm.h:324: error: `p_osm' undeclared (first use in this function) osm_opensm.h:324: error: `log_flags' undeclared (first use in this function) osm_opensm.h: At top level: osm_opensm.h:353: error: syntax error before '*' token osm_opensm.h:383: error: syntax error before '*' token osm_opensm.h: In function `osm_opensm_wait_for_subnet_up': osm_opensm.h:387: error: `p_osm' undeclared (first use in this function) osm_opensm.h:387: error: `wait_us' undeclared (first use in this function) osm_opensm.h:387: error: `interruptible' undeclared (first use in this function) main.c: At top level: main.c:108: error: syntax error before "osm" main.c:108: warning: type defaults to `int' in declaration of `osm' main.c:108: warning: data definition has no type or storage class main.c:267: error: syntax error before '*' token main.c: In function `get_port_guid': main.c:282: error: `p_osm' undeclared (first use in this function) main.c: At top level: main.c:345: error: syntax error before "osm_opensm_t" main.c: In function `parse_ignore_guids_file': main.c:354: error: `p_osm' undeclared (first use in this function) main.c:356: error: `guids_file_name' undeclared (first use in this function) main.c: In function `main': main.c:725: error: request for member `p_updn_ucast_routing' in something not a structure or union main.c:725: error: request for member `subn' in something not a structure or union main.c:758: error: request for member `mad_pool' in something not a structure or union main.c:761: error: request for member `mad_pool' in something not a structure or union make[1]: *** [main.o] Error 1 make[1]: Leaving directory `/cluster/src/OPENIB_SRC/gen1_wtf/gen1/trunk/src/userspace/osm/opensm' make: *** [all_targets] Error 2 From davem at davemloft.net Thu Aug 11 12:49:16 2005 From: davem at davemloft.net (David S. Miller) Date: Thu, 11 Aug 2005 12:49:16 -0700 (PDT) Subject: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface In-Reply-To: <1123786117.4403.5835.camel@hal.voltaire.com> References: <1123786117.4403.5835.camel@hal.voltaire.com> Message-ID: <20050811.124916.77057824.davem@davemloft.net> From: Hal Rosenstock Date: 11 Aug 2005 14:48:37 -0400 > The patch below is to accomodate IPoIB link layer address in the > sockaddr_ll struct so that user space can send and receive IPoIB link > later packets. Unfortunately, IPoIB has 20 bytes LL addresses rather > than the 8 byte MAC addresses (or under) used by other LLs. Two problems. 1) it's a really ugly IPoIB specific hack to extend the sockaddr_ll structure, it won't work for anything else without adding more special tests to that af_packet.c code and 2) you inproperly rooted your patch, so we get stuff like this: > --- af_packet.c.orig 2005-06-29 19:00:53.000000000 -0400 > +++ af_packet.c 2005-08-05 13:28:49.000000000 -0400 Please find another way to extend the structure. That's why I didn't respond, this patch was too ugly for words so it went to the bottom of my prioritized list of things to do. From sean.hefty at intel.com Thu Aug 11 12:50:52 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 11 Aug 2005 12:50:52 -0700 Subject: [openib-general] [RFC] [uCM] proposed API changes In-Reply-To: <002b01c59eab$4852cf70$9c5aa8c0@infiniconsys.com> Message-ID: >Serialize events for user-mode cm_ids, and allow the user client to set the >context from their REQ handler. The latter is probably pretty easy to do, but >in and of itself doesn't solve the problem with the out-of-order events and >races between setting the context and receiving an event. Callbacks aren't used in usermode to report events. A user calls get_event, then put_event when they're done. Get and put can come in separate threads. I'm not sure about blocking a user's thread until put is called. Setting the context is actually a little complex than the client setting a field in a data structure. To avoid searching, the userspace cm_id needs to be created, then stored with the kernel cm_id, so that it can be returned with future events. >A search on a REJ isn't a big deal - it should be a rare case as it will only >occur if the remote side times out or aborts. A client could ignore the REJ >because sending the REP will fail if a REJ was received. I agree that a search here isn't a big deal. But if the REJ reaches userspace first, then it won't find a cm_id. If it creates one (in order to report the REJ to the user), then all REQs now need to search. And I'm hesitant to ignore the REJ, but that may be the best option if a cm_id isn't found. >If you allow the user to target a get_event call to a specific cm_id this >problem goes away. If the user issues multiple requests against the same >cm_id, >they need to be ready to deal with out-of-order event reporting. This also >solves the context issue, since the REJ won't be reported until the user >requests an event from that specific cm_id. This is an option, but a significant change from the existing implementation. - Sean From halr at voltaire.com Thu Aug 11 12:51:30 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Aug 2005 15:51:30 -0400 Subject: [openib-general] osm build errors gen1 In-Reply-To: <42FBAAAB.3010709@psc.edu> References: <42FBAAAB.3010709@psc.edu> Message-ID: <1123789890.4403.5975.camel@hal.voltaire.com> Hi Joe, On Thu, 2005-08-11 at 15:44, Joe Damato wrote: > Hello - > > Sorry for the long email (lots of build errors). > > I am attempting to build osm as part of gen1 to allow me to test > some openib gen1 code and compare to an openib gen2 port that I have > been working on. I'm trying to understand what you are doing: Are you trying to take OpenIB OpenSM and build it with gen1 rather than gen2 (OpenIB) ? Or are you trying to build a recent gen1 based OpenSM (like Mellanox Gold 1.8.0) and use this on gen2 (OpenIB) node ? It looks like the OpenSM you are building includes files (osm_db.h, osm_db_pack.h) which are not part of the OpenIB OpenSM so it is likely the latter. Note that a merge of OpenSM 1.8.0 functionality is underway and will be coming to OpenIB. -- Hal From damato at psc.edu Thu Aug 11 13:07:16 2005 From: damato at psc.edu (Joe Damato) Date: Thu, 11 Aug 2005 16:07:16 -0400 Subject: [openib-general] osm build errors gen1 In-Reply-To: <1123789890.4403.5975.camel@hal.voltaire.com> References: <42FBAAAB.3010709@psc.edu> <1123789890.4403.5975.camel@hal.voltaire.com> Message-ID: <42FBAFF4.6040005@psc.edu> Hal Rosenstock wrote: >Hi Joe, > >On Thu, 2005-08-11 at 15:44, Joe Damato wrote: > > >>Hello - >> >> Sorry for the long email (lots of build errors). >> >> I am attempting to build osm as part of gen1 to allow me to test >>some openib gen1 code and compare to an openib gen2 port that I have >>been working on. >> >> > >I'm trying to understand what you are doing: > >Are you trying to take OpenIB OpenSM and build it with gen1 rather than >gen2 (OpenIB) ? Or are you trying to build a recent gen1 based OpenSM >(like Mellanox Gold 1.8.0) and use this on gen2 (OpenIB) node ? > > > I'm trying to build a gen1 based OpenSM to use on an OpenIB gen1 node -- if the opensm from gen2 will work with gen1 nodes, then I can just use the gen2 opensm.... Thanks. From damato at psc.edu Thu Aug 11 13:13:30 2005 From: damato at psc.edu (Joe Damato) Date: Thu, 11 Aug 2005 16:13:30 -0400 Subject: [openib-general] osm build errors gen1 Message-ID: <42FBB16A.8030307@psc.edu> Joe Damato wrote: > Hal Rosenstock wrote: > >> Hi Joe, >> >> On Thu, 2005-08-11 at 15:44, Joe Damato wrote: >> >> >>> Hello - >>> Sorry for the long email (lots of build errors). >>> >>> I am attempting to build osm as part of gen1 to allow me to test >>> some openib gen1 code and compare to an openib gen2 port that I have >>> been working on. >>> >> >> >> >> I'm trying to understand what you are doing: >> >> Are you trying to take OpenIB OpenSM and build it with gen1 rather than >> gen2 (OpenIB) ? Or are you trying to build a recent gen1 based OpenSM >> (like Mellanox Gold 1.8.0) and use this on gen2 (OpenIB) node ? >> >> >> > I'm trying to build a gen1 based OpenSM to use on an OpenIB gen1 node > -- if the opensm from gen2 will work with gen1 nodes, then I can just > use the gen2 opensm.... After re-reading what I said maybe I was not completely clear. I have source code for a project that uses the openib gen1 libraries. I have ported this source code to use openib gen2. I would like to test the -original- source code written for gen1, to do this I need to build opensm from gen1 (I have already installed a kernel on a few nodes that uses IB gen1) -- the first email was my attempt at building opensm from gen1 on a gen1 node. Hopefully this makes it more clear as to what I'm trying to do -- sorry for the confusion. Thanks, Joe Damato From halr at voltaire.com Thu Aug 11 13:08:53 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Aug 2005 16:08:53 -0400 Subject: [openib-general] osm build errors gen1 In-Reply-To: <42FBAFF4.6040005@psc.edu> References: <42FBAAAB.3010709@psc.edu> <1123789890.4403.5975.camel@hal.voltaire.com> <42FBAFF4.6040005@psc.edu> Message-ID: <1123790932.4403.6014.camel@hal.voltaire.com> Hi Joe, On Thu, 2005-08-11 at 16:07, Joe Damato wrote: > I'm trying to build a gen1 based OpenSM to use on an OpenIB gen1 node -- > if the opensm from gen2 will work with gen1 nodes, then I can just use > the gen2 opensm.... I have those files as part of OpenSM 1.8.0: ./opensm/osm_db_files.c ./opensm/osm_db.h ./opensm/osm_db_pack.c ./opensm/osm_db_pack.h As to whether gen2's OpenSM would work for gen1, I don't know not having tried it. It depends on what your subnet relies on. What are you doing, etc. ? BTW, gen1 related questions should go to your vendor for support. It is not OpenIB. -- Hal From halr at voltaire.com Thu Aug 11 13:15:10 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Aug 2005 16:15:10 -0400 Subject: [openib-general] osm build errors gen1 In-Reply-To: <42FBB16A.8030307@psc.edu> References: <42FBB16A.8030307@psc.edu> Message-ID: <1123791310.4403.6036.camel@hal.voltaire.com> On Thu, 2005-08-11 at 16:13, Joe Damato wrote: > After re-reading what I said maybe I was not completely clear. > > I have source code for a project that uses the openib gen1 libraries. Which libraries ? > I have ported this source code to use openib gen2. > > I would like to test the -original- source code written for gen1, to > do this I need to build opensm from gen1 (I have already installed a > kernel on a few nodes that uses IB gen1) -- the first email was my > attempt at building opensm from gen1 on a gen1 node. So you want to run gen1 OpenSM over the gen2 (OpenIB) vendor layer ? I don't know if that has been tried. Anyhow, your build issue (missing OpenSM files) doesn't appear to me to be something not correct about your gen1 OpenSM. > Hopefully this makes it more clear as to what I'm trying to do -- > sorry for the confusion. I think I still may be confused. -- Hal From halr at voltaire.com Thu Aug 11 14:38:57 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Aug 2005 17:38:57 -0400 Subject: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface In-Reply-To: <20050811.124916.77057824.davem@davemloft.net> References: <1123786117.4403.5835.camel@hal.voltaire.com> <20050811.124916.77057824.davem@davemloft.net> Message-ID: <1123796337.4403.6371.camel@hal.voltaire.com> On Thu, 2005-08-11 at 15:49, David S. Miller wrote: > From: Hal Rosenstock > Date: 11 Aug 2005 14:48:37 -0400 > > > The patch below is to accomodate IPoIB link layer address in the > > sockaddr_ll struct so that user space can send and receive IPoIB link > > later packets. Unfortunately, IPoIB has 20 bytes LL addresses rather > > than the 8 byte MAC addresses (or under) used by other LLs. > > Two problems. 1) it's a really ugly IPoIB specific hack to extend the > sockaddr_ll structure, it won't work for anything else without adding > more special tests to that af_packet.c code > Please find another way to extend the structure. Can anyone think of another approach to do this and keep backward compatibility ? -- Hal From yuw at cse.ohio-state.edu Thu Aug 11 15:07:18 2005 From: yuw at cse.ohio-state.edu (Weikuan Yu) Date: Thu, 11 Aug 2005 18:07:18 -0400 Subject: [openib-general] Re: Continue to experience problems in installing Gen2 on IA-32 In-Reply-To: <200508111932.j7BJWROj026274@xi.cse.ohio-state.edu> References: <200508111932.j7BJWROj026274@xi.cse.ohio-state.edu> Message-ID: <482FEE13-0AB4-11DA-80CD-000D932C3754@cse.ohio-state.edu> Hi, Thanks for your suggestions and help. At the end of this email, I have included the output from our system when enabling CONFIG_INFINIBAND_MTHCA_DEBUG=y. Note that there are additional four lines of warning message during the initiation of the device. These are generated from init_port() function, due to the incorrect return status of a command to the firmware, INIT_IB. We were suspicious of some of the INIT_IB flags or other parameters could have gone wrong, or have mismatches between our firmware and the gen2 code. So I went ahead and hacked on some of the INIT_IB parameters. At the end, it turns out that this patch could solve the problem on our system. [yuw at p3 hw]$ svn diff mthca/ Index: mthca/mthca_qp.c =================================================================== --- mthca/mthca_qp.c (revision 2986) +++ mthca/mthca_qp.c (working copy) @@ -575,7 +575,7 @@ memset(¶m, 0, sizeof param); - param.enable_1x = 1; + param.enable_1x = 0; param.enable_4x = 1; param.vl_cap = dev->limits.vl_cap; param.mtu_cap = dev->limits.mtu_cap; So this suggests that the current code is trying to enable the device to do both 1x and 4x communication, which is not compatible with the firmware parameters we chose. Anyhow, this solves our problem. We are now running the gen2 code fine as tested with provided test programs, e.g., ibv_rc_pingpong. We will be happy to provide additional information if needed. BTW, we are using firmware 3.3.2 for tavor cards. As always, your suggestions and help are greatly appreciated. --Weikuan +++++++++ dmesg output ++++++++++++++ ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) ib_mthca: Initializing (0000:02:00.0) ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 26 (level, low) -> IRQ 185 ib_mthca 0000:02:00.0: Found bridge: (0000:01:02.0) ib_mthca 0000:02:00.0: FW version 000300030002, max commands 64 ib_mthca 0000:02:00.0: FW size 6143 KB (start bfa00000, end bfffffff) ib_mthca 0000:02:00.0: HCA memory size 131071 KB (start b8000000, end bfffffff) ib_mthca 0000:02:00.0: Max QPs: 16777216, reserved QPs: 1024, entry size: 256 ib_mthca 0000:02:00.0: Max SRQs: 1024, reserved SRQs: 16, entry size: 32 ib_mthca 0000:02:00.0: Max CQs: 16777216, reserved CQs: 128, entry size: 64 ib_mthca 0000:02:00.0: Max EQs: 64, reserved EQs: 1, entry size: 64 ib_mthca 0000:02:00.0: reserved MPTs: 16, reserved MTTs: 16 ib_mthca 0000:02:00.0: Max PDs: 16777216, reserved PDs: 0, reserved UARs: 1 ib_mthca 0000:02:00.0: Max QP/MCG: 16777216, reserved MGMs: 0 ib_mthca 0000:02:00.0: Flags: 00370347 ib_mthca 0000:02:00.0: profile[ 0]--10/20 @ 0x b8000000 (size 0x 4000000) ib_mthca 0000:02:00.0: profile[ 1]-- 0/16 @ 0x bc000000 (size 0x 1000000) ib_mthca 0000:02:00.0: profile[ 2]-- 7/18 @ 0x bd000000 (size 0x 800000) ib_mthca 0000:02:00.0: profile[ 3]-- 9/17 @ 0x bd800000 (size 0x 800000) ib_mthca 0000:02:00.0: profile[ 4]-- 3/16 @ 0x be000000 (size 0x 400000) ib_mthca 0000:02:00.0: profile[ 5]-- 4/16 @ 0x be400000 (size 0x 200000) ib_mthca 0000:02:00.0: profile[ 6]--12/15 @ 0x be600000 (size 0x 100000) ib_mthca 0000:02:00.0: profile[ 7]-- 8/13 @ 0x be700000 (size 0x 80000) ib_mthca 0000:02:00.0: profile[ 8]--11/11 @ 0x be780000 (size 0x 10000) ib_mthca 0000:02:00.0: profile[ 9]-- 2/10 @ 0x be790000 (size 0x 8000) ib_mthca 0000:02:00.0: profile[10]-- 6/ 5 @ 0x be798000 (size 0x 800) ib_mthca 0000:02:00.0: HCA memory: allocated 106082 KB/124928 KB (18846 KB free) ib_mthca 0000:02:00.0: Allocated EQ 1 with 65536 entries ib_mthca 0000:02:00.0: Allocated EQ 2 with 128 entries ib_mthca 0000:02:00.0: Allocated EQ 3 with 128 entries ib_mthca 0000:02:00.0: Setting mask 00000000000f43fe for eqn 2 ib_mthca 0000:02:00.0: Setting mask 0000000000000400 for eqn 3 ib_mthca 0000:02:00.0: NOP command IRQ test passed ib_mthca 0000:02:00.0: Command 09 completed with status 03 ib_mthca 0000:02:00.0: INIT_IB returned status 03. ib_mthca 0000:02:00.0: Command 09 completed with status 03 ib_mthca 0000:02:00.0: INIT_IB returned status 03. On Aug 11, 2005, at 3:32 PM, Dhabaleswar Panda wrote: > Hal, Roland and James, > > Many thanks for your prompt replies!! > > We tried with the debug option. Thanks for this suggestion. > > It looks like one of the parameters (1X/4X) parameter for the card is > not being set properly on the IA-32 system which is leading to the > `disable' state for the card. By manually changing this parameter to > 4X, one of the nodes is able to detect the card. We are trying this on > other nodes. Not sure whether this is coming out because of the driver > or the firmware in the card. We are looking into this further. One of > my students will soon post all the details. > > Thanks again for all your help!! > > DK > >> Dhabaleswar> Opetron systems and carry out experiments. There is >> Dhabaleswar> no problem. The problem is coming only for IA-32 >> Dhabaleswar> systems. Even on EM64T systems, this problem comes >> Dhabaleswar> when operating it in IA-32 mode. >> >> Out of curiousity, do PCIe cards work with 32-bit kernels? >> >> As Hal said, please post the kernel log you get when loading drivers >> built with CONFIG_INFINIBAND_MTHCA_DEBUG=y. >> >> Thanks, >> Roland >> > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Thu Aug 11 15:18:26 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Aug 2005 18:18:26 -0400 Subject: [openib-general] Re: Continue to experience problems in installing Gen2 on IA-32 In-Reply-To: <482FEE13-0AB4-11DA-80CD-000D932C3754@cse.ohio-state.edu> References: <200508111932.j7BJWROj026274@xi.cse.ohio-state.edu> <482FEE13-0AB4-11DA-80CD-000D932C3754@cse.ohio-state.edu> Message-ID: <1123798705.4403.6456.camel@hal.voltaire.com> On Thu, 2005-08-11 at 18:07, Weikuan Yu wrote: > - param.enable_1x = 1; > + param.enable_1x = 0; > param.enable_4x = 1; That likely locks the port at 4x rather than have it autonegotiate based on what is at the other end of the link. This seems like a workaround to me. The question is what causes this to fail (perhaps in your configuration) with 3.3.2. I presume this occurs whether the port is plugged into anything at the other end or not. -- Hal From halr at voltaire.com Thu Aug 11 15:21:52 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Aug 2005 18:21:52 -0400 Subject: [openib-general] Re: Continue to experience problems in installing Gen2 on IA-32 In-Reply-To: <482FEE13-0AB4-11DA-80CD-000D932C3754@cse.ohio-state.edu> References: <200508111932.j7BJWROj026274@xi.cse.ohio-state.edu> <482FEE13-0AB4-11DA-80CD-000D932C3754@cse.ohio-state.edu> Message-ID: <1123798912.4403.6462.camel@hal.voltaire.com> On Thu, 2005-08-11 at 18:07, Weikuan Yu wrote: > BTW, we are using firmware 3.3.2 for tavor cards. One more thing: I've been using 3.3.2 for a long time with Tavor on x86 and haven't seen this. CA 'mthca0' CA type: MT23108 Number of ports: 2 Firmware version: 3.3.2 Hardware version: a1 Wonder what's different. -- Hal From rolandd at cisco.com Thu Aug 11 15:30:46 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 11 Aug 2005 15:30:46 -0700 Subject: [openib-general] Re: Continue to experience problems in installing Gen2 on IA-32 In-Reply-To: <482FEE13-0AB4-11DA-80CD-000D932C3754@cse.ohio-state.edu> (Weikuan Yu's message of "Thu, 11 Aug 2005 18:07:18 -0400") References: <200508111932.j7BJWROj026274@xi.cse.ohio-state.edu> <482FEE13-0AB4-11DA-80CD-000D932C3754@cse.ohio-state.edu> Message-ID: <52k6isnm21.fsf@cisco.com> Weikuan> At the end of this email, I have included the output from Weikuan> our system when enabling Weikuan> CONFIG_INFINIBAND_MTHCA_DEBUG=y. Note that there are Weikuan> additional four lines of warning message during the Weikuan> initiation of the device. These are generated from Weikuan> init_port() function, due to the incorrect return status Weikuan> of a command to the firmware, INIT_IB. Did these warning messages about INIT_IB not show up in the kernel before you enabled CONFIG_INFINIBAND_MTHCA_DEBUG? They are printed using mthca_warn(), which should be printed no matter what. In any case I guess you built your firmware image without support for 1X. Is this right? Do you have any theory as to why the drivers worked in 64-bit mode and failed in 32-bit mode? I don't see any reason why the parameters passed to INIT_IB would be any different. Anyway, can you apply the debugging patch below and send the output you get during device initialization (with CONFIG_INFINIBAND_MTHCA_DEBUG enabled, of course)? I'm guessing you'll see something like: ib_mthca 0000:02:00.0: Max port width: 2 If my guess is correct, then we can use that value to get the correct width to pass back to INIT_IB. Thanks, Roland Index: infiniband/hw/mthca/mthca_cmd.c =================================================================== --- infiniband/hw/mthca/mthca_cmd.c (revision 3056) +++ infiniband/hw/mthca/mthca_cmd.c (working copy) @@ -1031,6 +1031,7 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev MTHCA_GET(size, outbox, QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET); dev_lim->uar_scratch_entry_sz = size; + mthca_dbg(dev, "Max port width: %x\n", dev_lim->max_port_width); mthca_dbg(dev, "Max QPs: %d, reserved QPs: %d, entry size: %d\n", dev_lim->max_qps, dev_lim->reserved_qps, dev_lim->qpc_entry_sz); mthca_dbg(dev, "Max SRQs: %d, reserved SRQs: %d, entry size: %d\n", From guyg at voltaire.com Thu Aug 11 15:56:44 2005 From: guyg at voltaire.com (Guy German) Date: Fri, 12 Aug 2005 01:56:44 +0300 Subject: [openib-general][PATCH][kdapl]: FMR implementation for kdapl Message-ID: -----Original Message----- From: Guy German Sent: Thu 8/11/2005 7:20 PM To: 'James Lentini' Subject: RE: [openib-general][PATCH][kdapl]: FMR implementation for kdapl Hi > Why did you use the FMR pool API (ib_fmr_pool.h) instead of the verbs > (ib_alloc_fmr, ib_map_phys_fmr, ...)? I've implemented FMR over VAPI, but my gen2 implementation understanding of FMR is still limited. I took the SDP as an example of use. There might be a better way of doing it. (This way, however, seem to be working, even though not fully tested yet) > > Why did you make FMR support configurable via a module parameter? My > concern is that this forces users to know if their kdapl software > needs FMR support. Well, as I mentioned - there are defaults. So the consumer can be unaware of FMR implementation, but I think we need to give this flexibility. However, If mod_params are major problem it's not a *must* > > More questions/comments: > > On Thu, 11 Aug 2005, Guy German wrote: > >> James, >> >> This is a Tavor FMR implementation for kDPAL, that works over >> gen2's fmr_pool.c implementation. >> >> A few notes: >> 1. I've added mod params to kdapl_ib to control the fmr size and pool >> length. There are still reasonable defaults, but I think we should >> allow consumers to control this. >> 2. the fmr pool allocation is done in the pz_create (if active_fmr >> =1). The problem is that at the time of creating the pool we don't >> know what the consumer's privileges request is going to be (passed >> afterwords in dapl_lmr_kcreate). I think this can be solved by >> creating 3 pools and taking from the right pool at lmr_kcreate time. >> Do you have any "cheaper" solution to this ? > > My naive suggestion is to use ib_alloc_fmr(), etc. I will have to learn this API in order to understand how they solve the problem... Isn't it necessary to create a pool beforehand, in this way ? > >> 3. Can we get rid of DAT_MEM_TYPE_LMR code ? I don't understand >> whats it for. > > It is supposed to implement that memory type. If it is broken, we > should fix it. I just don't understand what this memory type do ... > >> Signed-off-by: Guy German >> >> Index: dapl_openib_util.c >> =================================================================== >> --- dapl_openib_util.c (revision 3056) >> +++ dapl_openib_util.c (working copy) >> @@ -196,6 +196,53 @@ int dapl_ib_mr_register_physical(struct >> return 0; } >> >> +int dapl_ib_mr_register_fmr(struct dapl_ia *ia, struct dapl_lmr >> *lmr, + void *phys_addr, u64 length, >> + enum dat_mem_priv_flags privileges) > > Why not change this function signature to match what you want to > receive (ie. change length to page_count)? OK. This is also true to dapl_ib_mr_register_physical > >> +{ >> + /* FIXME: this phase-1 implementation of fmr doesn't take >> "privileges" + into account. This is a security breech. */ >> + u64 io_addr; >> + u64 *page_list; >> + int page_count; >> + struct ib_pool_fmr *mem; >> + int status; >> + >> + page_list = (u64 *)phys_addr; >> + page_count = (int)length; >> + io_addr = page_list[0]; >> + >> + mem = ib_fmr_pool_map_phys (((struct dapl_pz >> *)lmr->param.pz)->fmr_pool, + page_list, + page_count, >> + &io_addr); >> + if (IS_ERR(mem)) { >> + status = (int)PTR_ERR(mem); >> + if (status != -EAGAIN) >> + dapl_dbg_log(DAPL_DBG_TYPE_ERR, >> + "fmr_pool_map_phys ret=%d > <%d pages>\n", >> + status, page_count); >> + >> + lmr->param.registered_address = 0; >> + lmr->fmr = 0; >> + return status; >> + } >> + >> + lmr->param.lmr_context = mem->fmr->lkey; >> + lmr->param.rmr_context = mem->fmr->rkey; >> + lmr->param.registered_size = length * PAGE_SIZE; >> + lmr->param.registered_address = io_addr; >> + lmr->fmr = mem; >> + >> + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, >> + "%s: (addr=%p size=0x%x) lkey=0x%x rkey=0x%x\n", __func__, >> + lmr->param.registered_address, >> + lmr->param.registered_size, >> + lmr->param.lmr_context, >> + lmr->param.rmr_context); >> + return 0; >> +} >> + >> int dapl_ib_mr_register_shared(struct dapl_ia *ia, struct dapl_lmr >> *lmr, enum dat_mem_priv_flags privileges) >> { >> @@ -222,7 +269,10 @@ int dapl_ib_mr_deregister(struct dapl_lm { >> int status; >> >> - status = ib_dereg_mr(lmr->mr); >> + if (DAT_MEM_TYPE_PLATFORM == lmr->param.mem_type && lmr->fmr) >> + status = ib_fmr_pool_unmap(lmr->fmr); >> + else >> + status = ib_dereg_mr(lmr->mr); >> if (status < 0) { >> dapl_dbg_log(DAPL_DBG_TYPE_ERR, >> " ib_dereg_mr error code return = %d\n", >> Index: dapl_openib_util.h >> =================================================================== >> --- dapl_openib_util.h (revision 3056) >> +++ dapl_openib_util.h (working copy) >> @@ -87,6 +87,10 @@ int dapl_ib_mr_register_physical(struct >> void *phys_addr, u64 length, >> enum dat_mem_priv_flags privileges); >> >> +int dapl_ib_mr_register_fmr(struct dapl_ia *ia, struct dapl_lmr >> *lmr, + void *phys_addr, u64 length, >> + enum dat_mem_priv_flags privileges); >> + >> int dapl_ib_mr_deregister(struct dapl_lmr *lmr); >> >> int dapl_ib_mr_register_shared(struct dapl_ia *ia, struct dapl_lmr >> *lmr, Index: dapl_lmr.c >> =================================================================== >> --- dapl_lmr.c (revision 3056) +++ dapl_lmr.c (working copy) >> @@ -137,7 +137,7 @@ static inline int dapl_lmr_create_physic >> u64 *registered_address) >> { >> struct dapl_lmr *new_lmr; >> - int status; >> + int status = 0; >> >> new_lmr = dapl_lmr_alloc(ia, mem_type, phys_addr, >> page_count, (struct dat_pz *) > pz, privileges); >> @@ -151,13 +151,22 @@ static inline int dapl_lmr_create_physic >> status = dapl_ib_mr_register_ia(ia, new_lmr, phys_addr, >> page_count, privileges); >> } >> - else { >> + else if (DAT_MEM_TYPE_PHYSICAL == mem_type) { >> status = dapl_ib_mr_register_physical(ia, new_lmr, >> > phys_addr.for_array, >> > page_count, privileges); >> } >> + else if (DAT_MEM_TYPE_PLATFORM == mem_type) { >> + status = dapl_ib_mr_register_fmr(ia, new_lmr, >> + phys_addr.for_array, >> + page_count, > privileges); >> + } >> + else { >> + status = -EINVAL; >> + goto error1; >> + } >> >> - if (0 != status) >> + if (status) >> goto error2; >> >> atomic_inc(&pz->pz_ref_count); >> @@ -243,7 +252,7 @@ int dapl_lmr_kcreate(struct dat_ia *ia, int >> status; >> >> dapl_dbg_log(DAPL_DBG_TYPE_API, >> - "dapl_lmr_kcreate(ia:%p, mem_type:%x, ...)\n", >> + "dapl_lmr_kcreate(ia:%p, mem_type:%x)\n", > > I like the ... in the printout :) Sorry :) > >> ia, mem_type); >> >> dapl_ia = (struct dapl_ia *)ia; >> @@ -258,6 +267,11 @@ int dapl_lmr_kcreate(struct dat_ia *ia, >> rmr_context, > registered_length, >> registered_address); >> break; >> + case DAT_MEM_TYPE_PLATFORM: /* used as a proprietary Tavor-FMR */ > > My understanding is that FMRs are not specific to Tavor. I thought > Arbel and Sinai also had FMR support. Is that correct? If so, we > should change this comment. Arbel FMR is suppose to support IBTA 1.2 spec FMR, which is totaly different and it is done by post_send (The Arbel fw doesn't support it yet though, I think). The current FMR is a proprietary mellanox implementation, I think it is Tavor specific, but I might be wrong. I don't know about Sinai's FMR... > >> + if (!active_fmr) { >> + status = -EINVAL; >> + break; >> + } >> case DAT_MEM_TYPE_PHYSICAL: >> case DAT_MEM_TYPE_IA: >> status = dapl_lmr_create_physical(dapl_ia, region_description, >> @@ -307,6 +321,7 @@ int dapl_lmr_free(struct dat_lmr *lmr) >> >> switch (dapl_lmr->param.mem_type) { >> case DAT_MEM_TYPE_PHYSICAL: >> + case DAT_MEM_TYPE_PLATFORM: >> case DAT_MEM_TYPE_IA: >> case DAT_MEM_TYPE_LMR: >> { >> Index: dapl_pz.c >> =================================================================== >> --- dapl_pz.c (revision 3056) +++ dapl_pz.c (working copy) >> @@ -89,7 +89,17 @@ int dapl_pz_create(struct dat_ia *ia, st >> status); goto error2; >> } >> - >> + >> + if (active_fmr) { >> + struct ib_fmr_pool_param params; >> + set_fmr_params (¶ms); >> + dapl_pz->fmr_pool = > ib_create_fmr_pool(dapl_pz->pd, ¶ms); >> + if (IS_ERR(dapl_pz->fmr_pool)) >> + dapl_dbg_log(DAPL_DBG_TYPE_WARN, >> + "could not create FMR pool <%ld>", >> + PTR_ERR(dapl_pz->fmr_pool)); >> + } >> + >> *pz = (struct dat_pz *)dapl_pz; >> return 0; >> >> @@ -104,7 +114,7 @@ error1: >> int dapl_pz_free(struct dat_pz *pz) >> { >> struct dapl_pz *dapl_pz; >> - int status; >> + int status=0; >> >> dapl_dbg_log(DAPL_DBG_TYPE_API, "dapl_pz_free(%p)\n", pz); >> >> @@ -114,8 +124,10 @@ int dapl_pz_free(struct dat_pz *pz) status = >> -EINVAL; goto error; >> } >> - >> - status = ib_dealloc_pd(dapl_pz->pd); >> + if (active_fmr) >> + (void)ib_destroy_fmr_pool(dapl_pz->fmr_pool); >> + else >> + status = ib_dealloc_pd(dapl_pz->pd); > > > If active_fmr is true, we still allocate a PD in dapl_pz_create. > Should the above be I think you are right. That's the way I did it at the beginning, but I think I got an oops there and changed it from some weird reason - I will have to check it again and understand what the problem was, and what it had to do with the pd... > > + if (active_fmr) > + (void)ib_destroy_fmr_pool(dapl_pz->fmr_pool); > + status = ib_dealloc_pd(dapl_pz->pd); > > >> if (status) { >> dapl_dbg_log(DAPL_DBG_TYPE_ERR, "ib_dealloc_pd failed: %X\n", >> status); Index: dapl_ia.c >> =================================================================== >> --- dapl_ia.c (revision 3056) +++ dapl_ia.c (working copy) >> @@ -745,7 +745,8 @@ int dapl_ia_query(struct dat_ia *ia_ptr, >> provider_attr->provider_version_major = DAPL_PROVIDER_MAJOR; >> provider_attr->provider_version_minor = DAPL_PROVIDER_MINOR; >> provider_attr->lmr_mem_types_supported = >> - DAT_MEM_TYPE_PHYSICAL | DAT_MEM_TYPE_IA; >> + DAT_MEM_TYPE_PHYSICAL | DAT_MEM_TYPE_IA | >> + DAT_MEM_TYPE_PLATFORM; > > Please align the DAT_MEM_TYPE_PLATFORM with the above line. OK > >> provider_attr->iov_ownership_on_return = DAT_IOV_CONSUMER; >> provider_attr->dat_qos_supported = DAT_QOS_BEST_EFFORT; >> provider_attr->completion_flags_supported = >> Index: dapl_provider.c >> =================================================================== >> --- dapl_provider.c (revision 3056) >> +++ dapl_provider.c (working copy) >> @@ -48,8 +48,19 @@ MODULE_AUTHOR("James Lentini"); >> >> #ifdef CONFIG_KDAPL_INFINIBAND_DEBUG >> static DAPL_DBG_MASK g_dapl_dbg_mask = 0; >> +unsigned int active_fmr = 1; >> +static unsigned int pool_size = 2048; >> +static unsigned int max_pages_per_fmr = 64; > > These FMR module parameters should not be inside the > CONFIG_KDAPL_INFINIBAND_DEBUG guard. They should be enabled regardless > of the debug configuration. correct. > >> + >> module_param_named(dbg_mask, g_dapl_dbg_mask, int, 0644); >> -MODULE_PARM_DESC(dbg_mask, "Bitmask to enable debug message >> types."); +module_param_named(active_fmr, active_fmr, int, 0644); >> +module_param_named(pool_size, pool_size, int, 0644); >> +module_param_named(max_pages_per_fmr, max_pages_per_fmr, int, 0644); > > Can you use the g_dapl_ prefix for active_fmr, pool_size, and > max_pages_per_fmr? Yes. > >> +MODULE_PARM_DESC(dbg_mask, "Bitmask to enable debug message types. >> "); +MODULE_PARM_DESC(active_fmr, "if active_fmr==1, >> creates > fmr pool in pz_create "); >> +MODULE_PARM_DESC(pool_size, "num of fmr handles in pool >> "); +MODULE_PARM_DESC(max_pages_per_fmr, "max size (in >> pages) > of an fmr handle "); >> + >> #endif /* CONFIG_KDAPL_INFINIBAND_DEBUG */ >> >> static LIST_HEAD(g_dapl_provider_list); >> @@ -152,6 +163,18 @@ void dapl_dbg_log(enum dapl_dbg_type typ >> >> #endif /* KDAPL_INFINIBAND_DEBUG */ >> >> +void set_fmr_params (struct ib_fmr_pool_param *fmr_param_s) +{ >> + fmr_param_s->max_pages_per_fmr = max_pages_per_fmr; >> + fmr_param_s->pool_size = pool_size; >> + fmr_param_s->dirty_watermark = 32; >> + fmr_param_s->cache = 1; >> + fmr_param_s->flush_function = NULL; >> + fmr_param_s->access = (IB_ACCESS_LOCAL_WRITE | >> + IB_ACCESS_REMOTE_WRITE | >> + IB_ACCESS_REMOTE_READ); >> +} >> + > > Lets find a better name for this function and possibly a different > location. How about dapl_fmr_pool_param_init? The location is there to use the mod params vars as static and not extern them > > How about shortening the parameter name from fmr_param_s to either > fmr_params or fmr_param. Either of those would be more standard. OK > >> static struct dapl_provider *dapl_provider_alloc(const char *name, >> struct > ib_device *device, >> u8 port) Thanks, Guy. From tom at ammasso.com Thu Aug 11 16:09:54 2005 From: tom at ammasso.com (Tom Tucker) Date: Thu, 11 Aug 2005 19:09:54 -0400 Subject: [openib-general] RE: [PATCH] amso1100: use standard byteorder macros Message-ID: <8E9D028761D8264D910612167E8457E8FA37F0@mail2.ammasso.com> Christoph: This is great stuff, but would be even better if we just globally replaced thinks like "cpu_to_wr64" with "cpu_to_be64" and removed the cc_byteorder.h file altogether? What do you think? > -----Original Message----- > From: Christoph Hellwig [mailto:hch at lst.de] > Sent: Thursday, August 11, 2005 10:03 AM > To: Tom Tucker > Cc: openib-general at openib.org > Subject: [PATCH] amso1100: use standard byteorder macros > > Signed-off-by: Christoph Hellwig > > Index: cc_byteorder.h > =================================================================== > --- cc_byteorder.h (revision 3058) > +++ cc_byteorder.h (working copy) > @@ -1,113 +1,30 @@ > #ifndef _CC_BYTEORDER_H_ > #define _CC_BYTEORDER_H_ > > +#include > #include "cc_types.h" > > -static inline const u64 cc_arch_swap64(u64 x) -{ > - union { > - struct { u32 a,b; } s; > - u64 u; > - } v; > - > - v.u = x; > - > - asm("bswap %0\n\t" > - "bswap %1\n\t" > - "xchgl %0,%1\n" > - : "=r" (v.s.a), "=r" (v.s.b) > - : "0" (v.s.a), "1" (v.s.b)); > - > - return v.u; > -} > - > -static inline const u32 cc_arch_swap32(u32 x) -{ > - asm("bswap %0" : "=r" (x) : "0" (x)); > - return x; > -} > - > -#define cc_swap16(x) \ > -({ \ > - u16 __x = (x); \ > - ((u16)( \ > - (((u16)(__x) & (u16)0x00ffU) << 8) | \ > - (((u16)(__x) & (u16)0xff00U) >> 8) )); \ > -}) > - > -#define cc_swap32(x) \ > -({ \ > - u32 __x = (x); \ > - ((u32)( \ > - (((u32)(__x) & (u32)0x000000ffUL) << 24) | \ > - (((u32)(__x) & (u32)0x0000ff00UL) << 8) | \ > - (((u32)(__x) & (u32)0x00ff0000UL) >> 8) | \ > - (((u32)(__x) & (u32)0xff000000UL) >> 24) )); \ > -}) > - > -#define cc_swap64(x) \ > -({ \ > - u64 __x = (x); \ > - ((u64)( \ > - (u64)(((u64)(__x) & (u64)0x00000000000000ffULL) > << 56) | \ > - (u64)(((u64)(__x) & (u64)0x000000000000ff00ULL) > << 40) | \ > - (u64)(((u64)(__x) & (u64)0x0000000000ff0000ULL) > << 24) | \ > - (u64)(((u64)(__x) & (u64)0x00000000ff000000ULL) > << 8) | \ > - (u64)(((u64)(__x) & (u64)0x000000ff00000000ULL) > >> 8) | \ > - (u64)(((u64)(__x) & (u64)0x0000ff0000000000ULL) > >> 24) | \ > - (u64)(((u64)(__x) & (u64)0x00ff000000000000ULL) > >> 40) | \ > - (u64)(((u64)(__x) & (u64)0xff00000000000000ULL) > >> 56) )); \ > -}) > - > -/* This section defines what it means to swap a word into the byte > - order of the current CPU. For example, x86-32 and x86-64 are > - little-endian platforms, so swapping a big-endian number to the > - cpu means the bytes need to be rearranged. However, swapping a > - little-endian number to the cpu means that nothing should be done. > -*/ > - > -#define X86_32 > -#if defined(X86_32) || defined (X86_64) > - > -#define cc_be64_to_cpu(x) (__builtin_constant_p((u64)(x)) ? > cc_swap64(x) : cc_arch_swap64(x)) -#define cc_be32_to_cpu(x) > (__builtin_constant_p((u32)(x)) ? cc_swap32(x) : > cc_arch_swap32(x)) -#define cc_be16_to_cpu(x) cc_swap16(x) > -#define cc_cpu_to_be64(x) cc_be64_to_cpu(x) -#define > cc_cpu_to_be32(x) cc_be32_to_cpu(x) -#define > cc_cpu_to_be16(x) cc_be16_to_cpu(x) > - > -#define cc_le64_to_cpu(x) ((u64)(x)) > -#define cc_le32_to_cpu(x) ((u32)(x)) > -#define cc_le16_to_cpu(x) ((u16)(x)) > -#define cc_cpu_to_le64(x) ((u64)(x)) > -#define cc_cpu_to_le32(x) ((u32)(x)) > -#define cc_cpu_to_le16(x) ((u16)(x)) > - > -#else > -#error Byte swapping functions not defined for this platform -#endif > - > /* Here we define the adapter-to-cpu and cpu-to-adapter byte > order functions > based on whether the adapter is big-endian or little-endian. > */ > > #if defined(WR_BYTE_ORDER_BIG_ENDIAN) > > -#define wr64_to_cpu cc_be64_to_cpu > -#define wr32_to_cpu cc_be32_to_cpu > -#define wr16_to_cpu cc_be16_to_cpu > -#define cpu_to_wr64 cc_cpu_to_be64 > -#define cpu_to_wr32 cc_cpu_to_be32 > -#define cpu_to_wr16 cc_cpu_to_be16 > +#define wr64_to_cpu be64_to_cpu > +#define wr32_to_cpu be32_to_cpu > +#define wr16_to_cpu be16_to_cpu > +#define cpu_to_wr64 cpu_to_be64 > +#define cpu_to_wr32 cpu_to_be32 > +#define cpu_to_wr16 cpu_to_be16 > > #elif defined (WR_BYTE_ORDER_LITTLE_ENDIAN) > > -#define wr64_to_cpu cc_le64_to_cpu > -#define wr32_to_cpu cc_le32_to_cpu > -#define wr16_to_cpu cc_le16_to_cpu > -#define cpu_to_wr64 cc_cpu_to_le64 > -#define cpu_to_wr32 cc_cpu_to_le32 > -#define cpu_to_wr16 cc_cpu_to_le16 > +#define wr64_to_cpu le64_to_cpu > +#define wr32_to_cpu le32_to_cpu > +#define wr16_to_cpu le16_to_cpu > +#define cpu_to_wr64 cpu_to_le64 > +#define cpu_to_wr32 cpu_to_le32 > +#define cpu_to_wr16 cpu_to_le16 > > #else > #error Work request (WR) byte order is not defined. > From rolandd at cisco.com Thu Aug 11 16:13:24 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 11 Aug 2005 16:13:24 -0700 Subject: [openib-general] RE: [PATCH] amso1100: use standard byteorder macros In-Reply-To: <8E9D028761D8264D910612167E8457E8FA37F0@mail2.ammasso.com> (Tom Tucker's message of "Thu, 11 Aug 2005 19:09:54 -0400") References: <8E9D028761D8264D910612167E8457E8FA37F0@mail2.ammasso.com> Message-ID: <52fytgnk2z.fsf@cisco.com> Tom> Christoph: This is great stuff, but would be even better if Tom> we just globally replaced thinks like "cpu_to_wr64" with Tom> "cpu_to_be64" and removed the cc_byteorder.h file altogether? Yes, if your byte order is definitely frozen to big-endian, then you should definitely killall the cpu_to_wrXX wrappers. - R. From guyg at voltaire.com Thu Aug 11 16:29:57 2005 From: guyg at voltaire.com (Guy German) Date: Fri, 12 Aug 2005 02:29:57 +0300 Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation Message-ID: Hi, Sorry for the bad indentation. Please see my answers starting with gg: -----Original Message----- From: James Lentini [mailto:jlentini at netapp.com] Sent: Thu 8/11/2005 7:59 PM To: Guy German Cc: openib-general at openib.org Subject: Re: [openib-general][PATCH][kdapl]: evd upcall policy implementation Hi Guy, Comments/questions below: On Thu, 11 Aug 2005, Guy German wrote: > Sorry for resending it - the former mail did not have a subject > > This patch allows the dapl consumer to control the evd upcall policy. > Some consumers (e.g. ISER) receives one upcall, disable > the upcall policy, and retrieve the rest of the events from a > kernel_thread, via dat_evd_dequeue. > This fashion of work improves performance by saving the context > switching that is involved in many upcalls. > If the consumer does not behave that way and leaves the upcall policy > enabled at all times (e.g. kdapltest), the behavior will stay the same and > the consumer will get an upcall for each event. > > Signed-off-by: Guy German > > Index: dapl_evd.c > =================================================================== > --- dapl_evd.c (revision 3056) > +++ dapl_evd.c (working copy) > @@ -38,28 +38,39 @@ > #include "dapl_ring_buffer_util.h" > > /* > - * DAPL Internal routine to trigger the specified CNO. > - * Called by the callback of some EVD associated with the CNO. Thanks for catch these CNO references. I thought I had removed them all. > + * DAPL Internal routine to trigger the callback of the EVD > */ > static void dapl_evd_upcall_trigger(struct dapl_evd *evd) > { > int status = 0; > struct dat_event event; > + unsigned long flags; For flags, we use flags member in the dapl_common structure. Take a look at the call to spin_lock_irqsave() in dapl_evd_get_event() for an example. We use the flags in the structure because the EVD code takes the spin lock in one function and releases it in another. gg: OK > > - /* Only process events if there is an enabled callback function. */ > - if ((evd->upcall.upcall_func == (DAT_UPCALL_FUNC) NULL) || > - (evd->upcall_policy == DAT_UPCALL_DISABLE)) { > + > + /* The function is not re-entrant (change when implementing DAT_UPCALL_MANY)*/ Why is this function not re-entrant? For reference, here is how I would define re-entrant: http://en.wikipedia.org/wiki/Reentrant http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?re-entrant gg: The function can not be entered twice at the same time, because when the upcall is at the hands of the consumer he can disable the upcall policy and if it is entered twice, there is a chance the consumer will get another upcall after disabling the upcall policy. > + if (evd->is_triggered) > return; > - } Why check the value here? Is it only for the efficiency of not taking the spin lock when is_triggered is 1? gg: No. you can't take the spin_lock here because this can cause a dead lockin the case the function calls itself from dat_evd_dequeue, on a uni-proccessor machines. > > - for (;;) { > + spin_lock_irqsave (&evd->common.lock, flags); > + if (evd->is_triggered) { > + spin_unlock_irqrestore (&evd->common.lock, flags); > + return; > + } > + evd->is_triggered = 1; > + spin_unlock_irqrestore (&evd->common.lock, flags); > + /* Only process events if there is an enabled callback function */ > + while ((evd->upcall.upcall_func != (DAT_UPCALL_FUNC)NULL) && > + (evd->upcall_policy != DAT_UPCALL_TEARDOWN) && > + (evd->upcall_policy != DAT_UPCALL_DISABLE)) { > status = dapl_evd_dequeue((struct dat_evd *)evd, &event); > - if (0 != status) > - return; > - > + if (status) > + break; > evd->upcall.upcall_func(evd->upcall.instance_data, &event, > FALSE); > } > + evd->is_triggered = 0; > + > + return; > } > > static void dapl_evd_eh_print_wc(struct ib_wc *wc) > @@ -820,24 +831,19 @@ static void dapl_evd_dto_callback(struct > * This function does not dequeue from the CQ; only the consumer > * can do that. Instead, it wakes up waiters if any exist. > * It rearms the completion only if completions should always occur > - * (specifically if a CNO is associated with the EVD and the > - * EVD is enabled). > */ > - > - if (state == DAPL_EVD_STATE_OPEN && > - evd->upcall_policy != DAT_UPCALL_DISABLE) { > - /* > - * Re-enable callback, *then* trigger. > - * This guarantees we won't miss any events. > - */ > - status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); > - if (0 != status) > - (void)dapl_evd_post_async_error_event( > - evd->common.owner_ia->async_error_evd, > - DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR, > - evd->common.owner_ia); > - > + > + if (state == DAPL_EVD_STATE_OPEN) { > dapl_evd_upcall_trigger(evd); > + if ((evd->upcall_policy != DAT_UPCALL_TEARDOWN) && > + (evd->upcall_policy != DAT_UPCALL_DISABLE)) { > + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); > + if (0 != status) > + (void)dapl_evd_post_async_error_event( > + evd->common.owner_ia->async_error_evd, > + DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR, > + evd->common.owner_ia); > + } You changed the order in which the CQ upcall is enabled and the kDAPL upcall is made. It used to be: enable CQ upcall call kDAPL upcall you are proposing call kDAPL upcall enable CQ upcall I think your proposed order contains a race condition. Specifically if a work completion occurs after dapl_evd_upcall_trigger() returns but before the CQ upcall is re-enabled with ib_req_notify_cq(), no upcall will occur for the completion. Do you agree? gg: You need to enable the CQ upcall only if the consumer did not change his upcall policy, while in upcall context. In the first case you will create a situation where the cq is enabled, while the consumers doesn't want any upcalls. In most real world application dapl_evd_upcall_trigger() will return with upcall policy disabled and there will be no need to alarm the cq upcall - i.e the consumer would dequeue the rest of the events himself. I see the race you talk about. It is relevent to kdapltest. Maybe we can check if there are pending events after enabling CQ upcall, and if there are - call dapl_evd_upcall_trigger() again. What do you think ? > } > dapl_dbg_log(DAPL_DBG_TYPE_RTN, "%s() returns\n", __func__); > } > @@ -890,7 +896,7 @@ int dapl_evd_internal_create(struct dapl > > /* reset the qlen in the attributes, it may have changed */ > evd->qlen = evd->cq->cqe; > - > + evd->is_triggered = 0; This should be done in dapl_evd_alloc. > status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); > > if (status != 0) > @@ -1035,15 +1041,41 @@ int dapl_evd_modify_upcall(struct dat_ev > const struct dat_upcall_object *upcall) > { > struct dapl_evd *evd; > - > - dapl_dbg_log(DAPL_DBG_TYPE_API, "dapl_modify_upcall(%p)\n", evd_handle); > + int status = 0; > + int pending_events; > + unsigned long flags; See my comment above about he flags. > > evd = (struct dapl_evd *)evd_handle; > + dapl_dbg_log (DAPL_DBG_TYPE_API, "%s: (evd=%p) set to %d\n", > + __func__, evd_handle, upcall_policy); The idea was to make the DAPL_DBG_TYPE_API prints look like a debugger stack trace. The following would be keeping with the other print statements: gg: I thought it would make it a bit more user friendly :) sometimes the consumers use those debug prints and they don't want to dwell in the kdapl code too much in order to understand what they are reading ... + dapl_dbg_log (DAPL_DBG_TYPE_API, "%s(%p, %d, %p)\n", + __func__, evd_handle, upcall_policy, upcall); > > + spin_lock_irqsave(&evd->common.lock, flags); > + if ((upcall_policy != DAT_UPCALL_TEARDOWN) && > + (upcall_policy != DAT_UPCALL_DISABLE)) { Why not let the consumer setup the upcall when it disabled? That seems like the only safe time to modify it. gg: The consumer needs and can change the poilcy to disable and enable. The only time he is not allowed to change the policy to enable (in this implementation) is when there are still pending events in the queue. This is to solve a race where the consumer dequeued all the events and changed the policy to enable, but there were other event/s that came just before calling dat_evd_modufy_upcall. In this case dat_evd_modufy_upcall to enable would fail and the consumer would keep dequeue-ing the events, without loosing his context. > + pending_events = dapl_rbuf_count(&evd->pending_event_queue); I don't understand this restriction either. Please explain. gg: explained above > + if (pending_events) { > + dapl_dbg_log (DAPL_DBG_TYPE_WARN, > + "%s: (evd %p) there are still %d pending " > + "events in the queue - policy stays disabled\n", > + __func__, evd_handle, pending_events); > + status = -EBUSY; > + goto bail; > + } > + if (evd->evd_flags & DAT_EVD_DTO_FLAG) { > + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); Why do we need to re-enable the CQ upcall? gg: If the consumer returned from the evd_upcall with upcall policy "disabled" the CQ upcall is not enabled. So this is the only place it is done. > + if (status) { > + printk(KERN_ERR "%s: dapls_ib_completion_notify" > + " failed (status=0x%x) \n",__func__, > + status); Let's use dapl_dbg_log instead of printk. We can also update the text of the message to "%s: ib_req_notify_cq failed: %X\n" gg: OK > + goto bail; > + } > + } > + } > evd->upcall_policy = upcall_policy; > evd->upcall = *upcall; > - > - return 0; > +bail: > + spin_unlock_irqrestore(&evd->common.lock, flags); > + return status; > } > > int dapl_evd_post_se(struct dat_evd *evd_handle, const struct dat_event *event) > @@ -1076,7 +1108,7 @@ int dapl_evd_post_se(struct dat_evd *evd > event->event_data. > software_event_data.pointer); > > - bail: > +bail: > return status; > } > > @@ -1124,7 +1156,7 @@ int dapl_evd_dequeue(struct dat_evd *evd > } > > spin_unlock_irqrestore(&evd->common.lock, evd->common.flags); > - bail: > +bail: > dapl_dbg_log(DAPL_DBG_TYPE_RTN, > "dapl_evd_dequeue () returns 0x%x\n", status); > From tomduffy at speakeasy.net Thu Aug 11 20:52:19 2005 From: tomduffy at speakeasy.net (Tom Duffy) Date: Thu, 11 Aug 2005 20:52:19 -0700 Subject: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface In-Reply-To: <1123796337.4403.6371.camel@hal.voltaire.com> References: <1123786117.4403.5835.camel@hal.voltaire.com> <20050811.124916.77057824.davem@davemloft.net> <1123796337.4403.6371.camel@hal.voltaire.com> Message-ID: <25BC832E-3EC8-407E-A490-60236CFFF99B@speakeasy.net> On Aug 11, 2005, at 2:38 PM, Hal Rosenstock wrote: > Can anyone think of another approach to do this and keep backward > compatibility ? Do we need backward compatibility? How about the stuff that includes if_packet.h gets rebuilt? You are adding to the end of the struct, after all. -tduffy From iod00d at hp.com Thu Aug 11 22:50:44 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 11 Aug 2005 22:50:44 -0700 Subject: [openib-general] Re: Continue to experience problems in installing Gen2 on IA-32 In-Reply-To: <52k6isnm21.fsf@cisco.com> References: <200508111932.j7BJWROj026274@xi.cse.ohio-state.edu> <482FEE13-0AB4-11DA-80CD-000D932C3754@cse.ohio-state.edu> <52k6isnm21.fsf@cisco.com> Message-ID: <20050812055044.GA8582@esmail.cup.hp.com> On Thu, Aug 11, 2005 at 03:30:46PM -0700, Roland Dreier wrote: > Do you have any theory as to why the drivers worked in 64-bit mode and > failed in 32-bit mode? I don't see any reason why the parameters > passed to INIT_IB would be any different. grundler at gsyprf3:/usr/src/openib_gen2/src/linux-kernel/infiniband/hw/mthca$ fgrep writeq * ... mthca_doorbell.h: __raw_writeq((__force u64) val, dest); mthca_doorbell.h: __raw_writeq(*(u64 *) val, dest); The only theory I can think of is 64-bit MMIO writes on a 32-bit OS will come out as two seperate writes. But since others are using this without a problem, this isn't likely a generic issue. Maybe there is some timing issue here...ie slower/faster CPU or chipset is exposing a problem. grant From hch at lst.de Fri Aug 12 00:12:23 2005 From: hch at lst.de (Christoph Hellwig) Date: Fri, 12 Aug 2005 09:12:23 +0200 Subject: [openib-general] RE: [PATCH] amso1100: use standard byteorder macros In-Reply-To: <52fytgnk2z.fsf@cisco.com> References: <8E9D028761D8264D910612167E8457E8FA37F0@mail2.ammasso.com> <52fytgnk2z.fsf@cisco.com> Message-ID: <20050812071223.GA24159@lst.de> On Thu, Aug 11, 2005 at 04:13:24PM -0700, Roland Dreier wrote: > Tom> Christoph: This is great stuff, but would be even better if > Tom> we just globally replaced thinks like "cpu_to_wr64" with > Tom> "cpu_to_be64" and removed the cc_byteorder.h file altogether? > > Yes, if your byte order is definitely frozen to big-endian, then you > should definitely killall the cpu_to_wrXX wrappers. Agreed. This was just the most trivial obvious correct patch to get things to mostly compile on ppc. I'll do a new round of patches when I find time. From Thomas.Talpey at netapp.com Fri Aug 12 04:52:27 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Fri, 12 Aug 2005 07:52:27 -0400 Subject: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface In-Reply-To: <25BC832E-3EC8-407E-A490-60236CFFF99B@speakeasy.net> References: <1123786117.4403.5835.camel@hal.voltaire.com> <20050811.124916.77057824.davem@davemloft.net> <1123796337.4403.6371.camel@hal.voltaire.com> <25BC832E-3EC8-407E-A490-60236CFFF99B@speakeasy.net> Message-ID: <6.2.3.4.2.20050812074727.05bb9790@exnane01.nane.netapp.com> At 11:52 PM 8/11/2005, Tom Duffy wrote: > >On Aug 11, 2005, at 2:38 PM, Hal Rosenstock wrote: >> Can anyone think of another approach to do this and keep backward >> compatibility ? > >Do we need backward compatibility? How about the stuff that includes >if_packet.h gets rebuilt? You are adding to the end of the struct, >after all. The size of the struct is less of an issue than the test for ARPHRD_INFINIBAND. David said as much: -- it won't work for anything else without adding -- more special tests to that af_packet.c code I have to say, SOCKADDR_COMPAT_LL is pretty stinky too. Hal, why *are* you testing for ARPHRD_INFINIBAND anyway? What different action happens in the transport-independent code in this special case? Tom. From halr at voltaire.com Fri Aug 12 06:01:25 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Aug 2005 09:01:25 -0400 Subject: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface In-Reply-To: <6.2.3.4.2.20050812074727.05bb9790@exnane01.nane.netapp.com> References: <1123786117.4403.5835.camel@hal.voltaire.com> <20050811.124916.77057824.davem@davemloft.net> <1123796337.4403.6371.camel@hal.voltaire.com> <25BC832E-3EC8-407E-A490-60236CFFF99B@speakeasy.net> <6.2.3.4.2.20050812074727.05bb9790@exnane01.nane.netapp.com> Message-ID: <1123851684.4403.7496.camel@hal.voltaire.com> On Fri, 2005-08-12 at 07:52, Talpey, Thomas wrote: > At 11:52 PM 8/11/2005, Tom Duffy wrote: > > > >On Aug 11, 2005, at 2:38 PM, Hal Rosenstock wrote: > >> Can anyone think of another approach to do this and keep backward > >> compatibility ? > > > >Do we need backward compatibility? How about the stuff that includes > >if_packet.h gets rebuilt? You are adding to the end of the struct, > >after all. > > The size of the struct is less of an issue than the test for > ARPHRD_INFINIBAND. David said as much: > > -- it won't work for anything else without adding > -- more special tests to that af_packet.c code > > I have to say, SOCKADDR_COMPAT_LL is pretty stinky too. SOCKADD_COMPAT_LL is to support backwards compatibility (on binaries built with the old header with only 8 byte sll_addr in sockaddr_ll struct). > Hal, why *are* you testing for ARPHRD_INFINIBAND anyway? > What different action happens in the transport-independent > code in this special case? It is done to preserve length checks that were already there (on struct msghdr in packet_sendmsg and addr_len in packet_bind). I didn't want to weaken that. In packet_sendmsg, the check is that the message is not shorter than the link level header. For bind, the check is that the address length bound to the socket is accurate for the link layer being used. Those parameters come from sendto and bind calls in the application. -- Hal From halr at voltaire.com Fri Aug 12 06:14:40 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Aug 2005 09:14:40 -0400 Subject: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface In-Reply-To: <25BC832E-3EC8-407E-A490-60236CFFF99B@speakeasy.net> References: <1123786117.4403.5835.camel@hal.voltaire.com> <20050811.124916.77057824.davem@davemloft.net> <1123796337.4403.6371.camel@hal.voltaire.com> <25BC832E-3EC8-407E-A490-60236CFFF99B@speakeasy.net> Message-ID: <1123852410.4403.7522.camel@hal.voltaire.com> On Thu, 2005-08-11 at 23:52, Tom Duffy wrote: > On Aug 11, 2005, at 2:38 PM, Hal Rosenstock wrote: > > Can anyone think of another approach to do this and keep backward > > compatibility ? > > Do we need backward compatibility? I'm not sure but I think this was a recommendation from Roland. > How about the stuff that includes > if_packet.h gets rebuilt? You are adding to the end of the struct, > after all. If one can assume the applications are rebuilt, I think that makes life easier as the IPoIB specific checks can be removed. I think that was the primary technical objection (aside from the mechanical one). -- Hal From halr at voltaire.com Fri Aug 12 06:17:53 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Aug 2005 09:17:53 -0400 Subject: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface In-Reply-To: <25BC832E-3EC8-407E-A490-60236CFFF99B@speakeasy.net> References: <1123786117.4403.5835.camel@hal.voltaire.com> <20050811.124916.77057824.davem@davemloft.net> <1123796337.4403.6371.camel@hal.voltaire.com> <25BC832E-3EC8-407E-A490-60236CFFF99B@speakeasy.net> Message-ID: <1123852479.4403.7525.camel@hal.voltaire.com> On Thu, 2005-08-11 at 23:52, Tom Duffy wrote: > On Aug 11, 2005, at 2:38 PM, Hal Rosenstock wrote: > > Can anyone think of another approach to do this and keep backward > > compatibility ? > > Do we need backward compatibility? I'm not sure but I think this was a recommendation from Roland. It certainly would be better if we could. I don't know whether this is a requirement of the solution or not or just desirable. > How about the stuff that includes > if_packet.h gets rebuilt? You are adding to the end of the struct, > after all. If one can assume the applications are rebuilt, I think that makes life easier as the IPoIB specific checks can be removed. I think that was the primary technical objection (aside from the mechanical one). -- Hal From halr at voltaire.com Fri Aug 12 06:32:56 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Aug 2005 09:32:56 -0400 Subject: [openib-general] what do you think about the following two user level small tools? In-Reply-To: <506C3D7B14CDD411A52C00025558DED60882C49D@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED60882C49D@mtlex01.yok.mtl.com> Message-ID: <1123853575.4403.7546.camel@hal.voltaire.com> Hi Dotan, On Thu, 2005-08-11 at 09:54, Dotan Barak wrote: > Hi. > I ported the following two tools from gen1 and they can be found in > https://openib.org/svn/trunk/contrib/mellanox/tools/. > > Here is a small description of those tools: > > vstat: print all the capabilities that query hca > return (this tool can be used in scripts). > check_catastrophic: check that the device is not in fatal state > (this tool will be useful when the fatal flow will be implemented) > > If you think that those tools are useful i will move them to the > trunk. vstat is the fourth version of some sort of status. There are ibstat and ibstatus working at either the driver or umad layer. There is also ibv_devinfo which displays a subset of this info from the verbs layer. So this adds some things which may be useful. Perhaps a rename to ibv_stat to be more in sync. As far as check_catastrophic goes, this basically does a query_device and so is less than what is done in the ibv_devinfo example so this one may be less useful. Also, rather than or in addition to device name, should this take a GID or GUID as an alternative to device name ? Also, they should be changed to be built with autotools as is the OpenIB way. -- Hal From Thomas.Talpey at netapp.com Fri Aug 12 06:52:19 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Fri, 12 Aug 2005 09:52:19 -0400 Subject: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface In-Reply-To: <1123851684.4403.7496.camel@hal.voltaire.com> References: <1123786117.4403.5835.camel@hal.voltaire.com> <20050811.124916.77057824.davem@davemloft.net> <1123796337.4403.6371.camel@hal.voltaire.com> <25BC832E-3EC8-407E-A490-60236CFFF99B@speakeasy.net> <6.2.3.4.2.20050812074727.05bb9790@exnane01.nane.netapp.com> <1123851684.4403.7496.camel@hal.voltaire.com> Message-ID: <6.2.3.4.2.20050812092230.05caaa00@exnane01.nane.netapp.com> At 09:01 AM 8/12/2005, Hal Rosenstock wrote: >It is done to preserve length checks that were already there (on struct >msghdr in packet_sendmsg and addr_len in packet_bind). I didn't want to >weaken that. Are you sure things break if you simply build a message in user space that's got the larger address (without changing the sockaddr_ll at all)? It looks to me as if msg->msg_namelen/msg_name can be any appropriate size which is at least as large as the sockaddr_ll. Tom. From halr at voltaire.com Fri Aug 12 07:12:41 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Aug 2005 10:12:41 -0400 Subject: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface In-Reply-To: <6.2.3.4.2.20050812092230.05caaa00@exnane01.nane.netapp.com> References: <1123786117.4403.5835.camel@hal.voltaire.com> <20050811.124916.77057824.davem@davemloft.net> <1123796337.4403.6371.camel@hal.voltaire.com> <25BC832E-3EC8-407E-A490-60236CFFF99B@speakeasy.net> <6.2.3.4.2.20050812074727.05bb9790@exnane01.nane.netapp.com> <1123851684.4403.7496.camel@hal.voltaire.com> <6.2.3.4.2.20050812092230.05caaa00@exnane01.nane.netapp.com> Message-ID: <1123855960.4403.7565.camel@hal.voltaire.com> On Fri, 2005-08-12 at 09:52, Talpey, Thomas wrote: > At 09:01 AM 8/12/2005, Hal Rosenstock wrote: > >It is done to preserve length checks that were already there (on struct > >msghdr in packet_sendmsg and addr_len in packet_bind). I didn't want to > >weaken that. > > Are you sure things break if you simply build a message in user space > that's got the larger address (without changing the sockaddr_ll at all)? > It looks to me as if msg->msg_namelen/msg_name can be any appropriate > size which is at least as large as the sockaddr_ll. I think that's where I started on this. I didn't change sockaddr_ll and it didn't work for IPoIB link level messages but at this point, I'm no longer 100% sure so I will check again (it may have been due to some other problem). If sockaddr_ll struct is left alone, I think it may be a problem on the receive side where the size of that struct is used. -- Hal From Thomas.Talpey at netapp.com Fri Aug 12 07:32:39 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Fri, 12 Aug 2005 10:32:39 -0400 Subject: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface In-Reply-To: <1123855960.4403.7565.camel@hal.voltaire.com> References: <1123786117.4403.5835.camel@hal.voltaire.com> <20050811.124916.77057824.davem@davemloft.net> <1123796337.4403.6371.camel@hal.voltaire.com> <25BC832E-3EC8-407E-A490-60236CFFF99B@speakeasy.net> <6.2.3.4.2.20050812074727.05bb9790@exnane01.nane.netapp.com> <1123851684.4403.7496.camel@hal.voltaire.com> <6.2.3.4.2.20050812092230.05caaa00@exnane01.nane.netapp.com> <1123855960.4403.7565.camel@hal.voltaire.com> Message-ID: <6.2.3.4.2.20050812102712.065ce190@exnane01.nane.netapp.com> At 10:12 AM 8/12/2005, Hal Rosenstock wrote: >If sockaddr_ll struct is left alone, I think it may be a problem on the >receive side where the size of that struct is used. Maybe. The receive side builds the incoming sockaddr_ll in the skb->cb. But that's 48 bytes and it goes off to your device's hard_header_parse to do so... You sure you have hard_header_len and all the appropriate vectors set up properly? (netdevice.h) Tom. From halr at voltaire.com Fri Aug 12 07:25:58 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Aug 2005 10:25:58 -0400 Subject: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface In-Reply-To: <1123855960.4403.7565.camel@hal.voltaire.com> References: <1123786117.4403.5835.camel@hal.voltaire.com> <20050811.124916.77057824.davem@davemloft.net> <1123796337.4403.6371.camel@hal.voltaire.com> <25BC832E-3EC8-407E-A490-60236CFFF99B@speakeasy.net> <6.2.3.4.2.20050812074727.05bb9790@exnane01.nane.netapp.com> <1123851684.4403.7496.camel@hal.voltaire.com> <6.2.3.4.2.20050812092230.05caaa00@exnane01.nane.netapp.com> <1123855960.4403.7565.camel@hal.voltaire.com> Message-ID: <1123856758.4403.7569.camel@hal.voltaire.com> On Fri, 2005-08-12 at 10:12, Hal Rosenstock wrote: > If sockaddr_ll struct is left alone, I think it may be a problem on the > receive side where the size of that struct is used. One further thought on this possible approach: Even if this does work, it seems to me that this pushes all the hokiness back on the applications which use sockaddr_ll. For example, with the struct change and a trivial change to arping.c, this now works for IPoIB. If the structure were not to be changed, there would be a lot of IPoIB specific changes to each application. -- Hal From pw at osc.edu Fri Aug 12 07:42:22 2005 From: pw at osc.edu (Pete Wyckoff) Date: Fri, 12 Aug 2005 10:42:22 -0400 Subject: [openib-general] avoid segv in libibverbs/examples Message-ID: <20050812144222.GA8988@osc.edu> Against latest svn, in gen2/trunk/userspace/libibverbs/examples. These tools segv when the sysfs lookup doesn't find an IB card. -- Pete Index: asyncwatch.c =================================================================== --- asyncwatch.c (revision 3072) +++ asyncwatch.c (working copy) @@ -56,6 +56,8 @@ struct ibv_async_event event; dev_list = ibv_get_devices(); + if (!dev_list) + return 1; dlist_start(dev_list); ib_dev = dlist_next(dev_list); Index: rc_pingpong.c =================================================================== --- rc_pingpong.c (revision 3072) +++ rc_pingpong.c (working copy) @@ -524,6 +524,8 @@ page_size = sysconf(_SC_PAGESIZE); dev_list = ibv_get_devices(); + if (!dev_list) + return 1; dlist_start(dev_list); if (!ib_devname) { Index: srq_pingpong.c =================================================================== --- srq_pingpong.c (revision 3072) +++ srq_pingpong.c (working copy) @@ -593,6 +593,8 @@ page_size = sysconf(_SC_PAGESIZE); dev_list = ibv_get_devices(); + if (!dev_list) + return 1; dlist_start(dev_list); if (!ib_devname) { Index: uc_pingpong.c =================================================================== --- uc_pingpong.c (revision 3072) +++ uc_pingpong.c (working copy) @@ -516,6 +516,8 @@ page_size = sysconf(_SC_PAGESIZE); dev_list = ibv_get_devices(); + if (!dev_list) + return 1; dlist_start(dev_list); if (!ib_devname) { Index: ud_pingpong.c =================================================================== --- ud_pingpong.c (revision 3072) +++ ud_pingpong.c (working copy) @@ -520,6 +520,8 @@ page_size = sysconf(_SC_PAGESIZE); dev_list = ibv_get_devices(); + if (!dev_list) + return 1; dlist_start(dev_list); if (!ib_devname) { Index: device_list.c =================================================================== --- device_list.c (revision 3072) +++ device_list.c (working copy) @@ -55,6 +55,8 @@ struct ibv_device *ib_dev; dev_list = ibv_get_devices(); + if (!dev_list) + return 1; printf(" %-16s\t node GUID\n", "device"); printf(" %-16s\t----------------\n", "------"); Index: devinfo.c =================================================================== --- devinfo.c (revision 3072) +++ devinfo.c (working copy) @@ -58,6 +58,8 @@ int i; dev_list = ibv_get_devices(); + if (!dev_list) + return 1; dlist_start(dev_list); ib_dev = dlist_next(dev_list); From rolandd at cisco.com Fri Aug 12 07:51:35 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 12 Aug 2005 07:51:35 -0700 Subject: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface In-Reply-To: <1123852410.4403.7522.camel@hal.voltaire.com> (Hal Rosenstock's message of "12 Aug 2005 09:14:40 -0400") References: <1123786117.4403.5835.camel@hal.voltaire.com> <20050811.124916.77057824.davem@davemloft.net> <1123796337.4403.6371.camel@hal.voltaire.com> <25BC832E-3EC8-407E-A490-60236CFFF99B@speakeasy.net> <1123852410.4403.7522.camel@hal.voltaire.com> Message-ID: <52u0hvmcnc.fsf@cisco.com> Tom> Do we need backward compatibility? Hal> I'm not sure but I think this was a recommendation from Hal> Roland. Yes, of course we need backward compatibility. We can't put a change into the kernel that breaks userspace binaries. For example, the old Fedora binary of arping has to continue to work, even if you use a new kernel. - R. From halr at voltaire.com Fri Aug 12 07:45:24 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Aug 2005 10:45:24 -0400 Subject: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface In-Reply-To: <6.2.3.4.2.20050812102712.065ce190@exnane01.nane.netapp.com> References: <1123786117.4403.5835.camel@hal.voltaire.com> <20050811.124916.77057824.davem@davemloft.net> <1123796337.4403.6371.camel@hal.voltaire.com> <25BC832E-3EC8-407E-A490-60236CFFF99B@speakeasy.net> <6.2.3.4.2.20050812074727.05bb9790@exnane01.nane.netapp.com> <1123851684.4403.7496.camel@hal.voltaire.com> <6.2.3.4.2.20050812092230.05caaa00@exnane01.nane.netapp.com> <1123855960.4403.7565.camel@hal.voltaire.com> <6.2.3.4.2.20050812102712.065ce190@exnane01.nane.netapp.com> Message-ID: <1123857924.4403.7574.camel@hal.voltaire.com> On Fri, 2005-08-12 at 10:32, Talpey, Thomas wrote: > At 10:12 AM 8/12/2005, Hal Rosenstock wrote: > >If sockaddr_ll struct is left alone, I think it may be a problem on the > >receive side where the size of that struct is used. > > Maybe. The receive side builds the incoming sockaddr_ll in the skb->cb. > But that's 48 bytes and it goes off to your device's hard_header_parse > to do so... > > You sure you have hard_header_len and all the appropriate vectors > set up properly? (netdevice.h) Yes, this is done by ipoib_setup in ipoib_main.c which is called when an IPoIB port is added as follows: dev = alloc_netdev((int) sizeof (struct ipoib_dev_priv), name, ipoib_setup); -- Hal From tom at ammasso.com Fri Aug 12 08:04:28 2005 From: tom at ammasso.com (Tom Tucker) Date: Fri, 12 Aug 2005 11:04:28 -0400 Subject: [openib-general] RE: [PATCH] amso1100: use standard byteorder macros Message-ID: <8E9D028761D8264D910612167E8457E8FA380F@mail2.ammasso.com> I'll send one out here in a few minutes with the bexx related changes. I'll accept the memcpy4 patch. Thanks > -----Original Message----- > From: Christoph Hellwig [mailto:hch at lst.de] > Sent: Friday, August 12, 2005 2:12 AM > To: Roland Dreier > Cc: Tom Tucker; Christoph Hellwig; openib-general at openib.org > Subject: Re: [openib-general] RE: [PATCH] amso1100: use > standard byteorder macros > > On Thu, Aug 11, 2005 at 04:13:24PM -0700, Roland Dreier wrote: > > Tom> Christoph: This is great stuff, but would be even better if > > Tom> we just globally replaced thinks like "cpu_to_wr64" with > > Tom> "cpu_to_be64" and removed the cc_byteorder.h file > altogether? > > > > Yes, if your byte order is definitely frozen to big-endian, > then you > > should definitely killall the cpu_to_wrXX wrappers. > > Agreed. This was just the most trivial obvious correct patch > to get things to mostly compile on ppc. > > I'll do a new round of patches when I find time. > From tom at ammasso.com Fri Aug 12 08:14:16 2005 From: tom at ammasso.com (Tom Tucker) Date: Fri, 12 Aug 2005 11:14:16 -0400 Subject: [openib-general] Use Linux generics in iWARP driver patch Message-ID: <8E9D028761D8264D910612167E8457E8FA3811@mail2.ammasso.com> Here's a patch for the iWARP branch that changes the driver to use all Linux byte swapping generics. Thanks to Christoph and Roland for their help here. Signed-off-by: Tom Tucker Index: devccil_adapter.c =================================================================== --- devccil_adapter.c (revision 3072) +++ devccil_adapter.c (working copy) @@ -219,7 +219,7 @@ /* * update the resource indicator and id */ - er.resource_indicator = wr32_to_cpu(wr->ae.ae_generic.resource_type); + er.resource_indicator = be32_to_cpu(wr->ae.ae_generic.resource_type); er.resource_user_context = CC_CTX_TO_PTR(wr->ae.ae_generic.user_context, void *); /* @@ -228,7 +228,7 @@ switch (er.resource_indicator) { case CC_RES_IND_QP: { cc_kern_qp_t *qp = (cc_kern_qp_t *)er.resource_user_context; - qp->qp_state = wr32_to_cpu(wr->ae.ae_generic.qp_state); + qp->qp_state = be32_to_cpu(wr->ae.ae_generic.qp_state); er.resource_id.qp_id = qp->adapter_handle; er.resource_user_context = qp->user_context; break; @@ -273,7 +273,7 @@ er.event_data.active_connect_results.rport = wr->ae.ae_active_connect_results.rport; er.event_data.active_connect_results.private_data_length = - wr32_to_cpu(wr->ae.ae_active_connect_results.private_data_length); + be32_to_cpu(wr->ae.ae_active_connect_results.private_data_length); memcpy(er.event_data.active_connect_results.private_data, wr->ae.ae_active_connect_results.private_data, er.event_data.active_connect_results.private_data_length); @@ -290,7 +290,7 @@ er.event_data.connection_request.rport = wr->ae.ae_connection_request.rport; er.event_data.connection_request.private_data_length = - wr32_to_cpu(wr->ae.ae_connection_request.private_data_length); + be32_to_cpu(wr->ae.ae_connection_request.private_data_length); memcpy(er.event_data.connection_request.private_data, wr->ae.ae_connection_request.private_data, er.event_data.connection_request.private_data_length); @@ -580,8 +580,8 @@ if (pci == NULL) return CCERR_INSUFFICIENT_RESOURCES; - q1_pages_len = cpu_to_wr32(pci->q1_q_size) * cpu_to_wr32(pci->q1_msg_size); - q2_pages_len = cpu_to_wr32(pci->q2_q_size) * cpu_to_wr32(pci->q2_msg_size); + q1_pages_len = cpu_to_be32(pci->q1_q_size) * cpu_to_be32(pci->q1_msg_size); + q2_pages_len = cpu_to_be32(pci->q2_q_size) * cpu_to_be32(pci->q2_msg_size); q1_pages = ccil_malloc(q1_pages_len, CCIL_WAIT); if (!q1_pages) { @@ -597,22 +597,22 @@ memset(q2_pages, 0xb0, q2_pages_len); #endif - cc_mq_init(&cca->req_vq, 0, cpu_to_wr32(pci->q0_q_size), - cpu_to_wr32(pci->q0_msg_size), - aoff_to_virt(cca, cpu_to_wr32(pci->q0_pool_start)), - aoff_to_virt(cca, cpu_to_wr32(pci->q0_shared)), + cc_mq_init(&cca->req_vq, 0, cpu_to_be32(pci->q0_q_size), + cpu_to_be32(pci->q0_msg_size), + aoff_to_virt(cca, cpu_to_be32(pci->q0_pool_start)), + aoff_to_virt(cca, cpu_to_be32(pci->q0_shared)), CC_MQ_ADAPTER_TARGET); - cc_mq_init(&cca->rep_vq, 1, cpu_to_wr32(pci->q1_q_size), - cpu_to_wr32(pci->q1_msg_size), + cc_mq_init(&cca->rep_vq, 1, cpu_to_be32(pci->q1_q_size), + cpu_to_be32(pci->q1_msg_size), (void *)q1_pages, - aoff_to_virt(cca, cpu_to_wr32(pci->q1_shared)), + aoff_to_virt(cca, cpu_to_be32(pci->q1_shared)), CC_MQ_HOST_TARGET); - cc_mq_init(&cca->aeq, 2, cpu_to_wr32(pci->q2_q_size), - cpu_to_wr32(pci->q2_msg_size), + cc_mq_init(&cca->aeq, 2, cpu_to_be32(pci->q2_q_size), + cpu_to_be32(pci->q2_msg_size), (void *)q2_pages, - aoff_to_virt(cca, cpu_to_wr32(pci->q2_shared)), + aoff_to_virt(cca, cpu_to_be32(pci->q2_shared)), CC_MQ_HOST_TARGET); /* @@ -641,17 +641,17 @@ */ cc_wr_set_id(&wr, CCWR_INIT); wr.hdr.context = 0; - wr.hint_count = cpu_to_wr64(__pa(&cca->hint_count)); + wr.hint_count = cpu_to_be64(__pa(&cca->hint_count)); wr.q0_host_shared = - cpu_to_wr64(__pa(cca->req_vq.shared)); + cpu_to_be64(__pa(cca->req_vq.shared)); wr.q1_host_shared = - cpu_to_wr64(__pa(cca->rep_vq.shared)); + cpu_to_be64(__pa(cca->rep_vq.shared)); wr.q1_host_msg_pool = - cpu_to_wr64(__pa(cca->rep_vq.msg_pool)); + cpu_to_be64(__pa(cca->rep_vq.msg_pool)); wr.q2_host_shared = - cpu_to_wr64(__pa(cca->aeq.shared)); + cpu_to_be64(__pa(cca->aeq.shared)); wr.q2_host_msg_pool = - cpu_to_wr64(__pa(cca->aeq.msg_pool)); + cpu_to_be64(__pa(cca->aeq.msg_pool)); /* * Send WR to adapter */ @@ -783,7 +783,7 @@ cc_kern_eh_t *eh; - while (cca->hints_read != wr16_to_cpu(cca->hint_count)) { + while (cca->hints_read != be16_to_cpu(cca->hint_count)) { mq_index = *(unsigned int *)(cca->bars[0].virt+PCI_BAR0_HOST_HINT); if (mq_index & 0x80000000) { DEVCCIL_LOG(KERN_INFO "no hint present although one is expected\n"); @@ -879,10 +879,10 @@ * Ensure that the Interface Version Numbers match * between the kernel driver and the firmware. */ - if (wr32_to_cpu(adapter_regs->ivn) != CC_IVN) { + if (be32_to_cpu(adapter_regs->ivn) != CC_IVN) { devccil_err("devccil_probe: fw IVN mismatch (fw 0x%x devccil 0x%x).\n" "Adapter not claimed.\n", - wr32_to_cpu(adapter_regs->ivn), CC_IVN); + be32_to_cpu(adapter_regs->ivn), CC_IVN); iounmap((void *)adapter_regs); free_adapter(cca); return -1; @@ -892,7 +892,7 @@ * Obtain the actual size of the verbs request MQ to map that * adapter memory. */ - size = CC_ADAPTER_PCI_REGS_OFFSET + wr32_to_cpu(adapter_regs->pci_window_size); + size = CC_ADAPTER_PCI_REGS_OFFSET + be32_to_cpu(adapter_regs->pci_window_size); iounmap((void *)adapter_regs); } cca->bars[i].size = size; Index: cc_mq.h =================================================================== --- cc_mq.h (revision 3072) +++ cc_mq.h (working copy) @@ -1,5 +1,6 @@ #ifndef _CC_MQ_H_ #define _CC_MQ_H_ +#include #include "cc_types.h" #include "cc_adapter.h" #include "cc_wr.h" @@ -46,18 +47,18 @@ #endif /* X86_64 */ #define BUMP(q,p) (p) = ((p)+1) % (q)->q_size -#define BUMP_SHARED(q,p) (p) = cpu_to_wr16((wr16_to_cpu(p)+1) % (q)->q_size) +#define BUMP_SHARED(q,p) (p) = cpu_to_be16((be16_to_cpu(p)+1) % (q)->q_size) static __inline__ cc_bool_t cc_mq_empty(cc_mq_t *q) { - return q->priv == wr16_to_cpu(*q->shared); + return q->priv == be16_to_cpu(*q->shared); } static __inline__ cc_bool_t cc_mq_full(cc_mq_t *q) { - return q->priv == (wr16_to_cpu(*q->shared) + q->q_size-1) % q->q_size; + return q->priv == (be16_to_cpu(*q->shared) + q->q_size-1) % q->q_size; } extern void * cc_mq_alloc(cc_mq_t *q); Index: devccil_cq.c =================================================================== --- devccil_cq.c (revision 3072) +++ devccil_cq.c (working copy) @@ -396,10 +396,10 @@ cc_wr_set_id(&wr, CCWR_CQ_CREATE); wr.hdr.context = CC_PTR_TO_CTX(vq_req); wr.rnic_handle = devp->adapter_handle; - wr.msg_size = cpu_to_wr32(cq->mq_handle.msg_size); - wr.depth = cpu_to_wr32(cq->mq_handle.q_size); - wr.shared_ht = cpu_to_wr64(__pa(shared_kva)); - wr.msg_pool = cpu_to_wr64(__pa(cq->mq.u.h.msg_pool_kva)); + wr.msg_size = cpu_to_be32(cq->mq_handle.msg_size); + wr.depth = cpu_to_be32(cq->mq_handle.q_size); + wr.shared_ht = cpu_to_be64(__pa(shared_kva)); + wr.msg_pool = cpu_to_be64(__pa(cq->mq.u.h.msg_pool_kva)); wr.user_context = CC_PTR_TO_CTX(user_context); /* @@ -436,9 +436,9 @@ goto bail5; } - cq->mq.mq_idx = wr32_to_cpu(reply->mq_index); + cq->mq.mq_idx = be32_to_cpu(reply->mq_index); cq->adapter_handle = reply->cq_handle; - cq->mq.u.h.peer_aoff = wr32_to_cpu(reply->adapter_shared); + cq->mq.u.h.peer_aoff = be32_to_cpu(reply->adapter_shared); /* * Free Msg @@ -550,8 +550,8 @@ wr.hdr.context = CC_PTR_TO_CTX(vq_req); wr.rnic_handle = devp->adapter_handle; wr.cq_handle = cq->adapter_handle; - wr.new_depth = cpu_to_wr32(*p_cq_depth+1); - wr.new_msg_pool = cpu_to_wr64(__pa(new_msg_pool_kva)); + wr.new_depth = cpu_to_be32(*p_cq_depth+1); + wr.new_msg_pool = cpu_to_be64(__pa(new_msg_pool_kva)); /* * reference the request struct. dereferenced in the int handler. Index: devccil_ep.c =================================================================== --- devccil_ep.c (revision 3072) +++ devccil_ep.c (working copy) @@ -145,7 +145,7 @@ wr.rnic_handle = devp->adapter_handle; wr.local_addr = addr; /* already in Net Byte Order */ wr.local_port = *p_port; /* already in Net Byte Order */ - wr.backlog = cpu_to_wr32(backlog); + wr.backlog = cpu_to_be32(backlog); wr.user_context = CC_PTR_TO_CTX(ep->user_handle); /* @@ -433,7 +433,7 @@ wr->ep_handle = CC_PTR_TO_CTX(cr_handle); wr->qp_handle = qp->adapter_handle; if (p_private_data) { - wr->private_data_length = cpu_to_wr32(private_data_length); + wr->private_data_length = cpu_to_be32(private_data_length); memcpy(&wr->private_data[0], p_private_data, private_data_length); } else { wr->private_data_length = 0; Index: cc_qp_common.c =================================================================== --- cc_qp_common.c (revision 3072) +++ cc_qp_common.c (working copy) @@ -106,9 +106,9 @@ */ if (src->length) { tot += src->length; - dst->stag = cpu_to_wr32(src->stag); - dst->to = cpu_to_wr64(src->to); - dst->length = cpu_to_wr32(src->length); + dst->stag = cpu_to_be32(src->stag); + dst->to = cpu_to_be64(src->to); + dst->length = cpu_to_be32(src->length); dst++; acount++; } @@ -189,7 +189,7 @@ len -= 8; } -#elif +#else #error "You need to define your platform, or add optimized" #error "cc_memcpy8 support for your platform." @@ -296,7 +296,7 @@ } #ifdef CCMSGMAGIC - ((ccwr_hdr_t *)wr)->magic = cpu_to_wr32(CCWR_MAGIC); + ((ccwr_hdr_t *)wr)->magic = cpu_to_be32(CCWR_MAGIC); #endif /* @@ -384,13 +384,13 @@ case CC_WR_TYPE_SEND_INV: msg_size = sizeof(ccwr_send_inv_req_t); wr.sqwr.send.remote_stag = - cpu_to_wr32(wr_list->wr_u.send.remote_stag); + cpu_to_be32(wr_list->wr_u.send.remote_stag); goto send; case CC_WR_TYPE_SEND_SE_INV: msg_size = sizeof(ccwr_send_se_inv_req_t); wr.sqwr.send.remote_stag = - cpu_to_wr32(wr_list->wr_u.send.remote_stag); + cpu_to_be32(wr_list->wr_u.send.remote_stag); goto send; case CC_WR_TYPE_SEND: @@ -416,7 +416,7 @@ wr_list->wr_u.send.local_sgl.sge_count, &tot_len, &actual_sge_count); - wr.sqwr.send.sge_len = cpu_to_wr32(tot_len); + wr.sqwr.send.sge_len = cpu_to_be32(tot_len); cc_wr_set_sge_count(&wr, actual_sge_count); break; @@ -432,9 +432,9 @@ if (wr_list->wr_u.rdma_write.read_fence) { flags |= SQ_READ_FENCE; } - wr.sqwr.rdma_write.remote_stag = cpu_to_wr32( + wr.sqwr.rdma_write.remote_stag = cpu_to_be32( wr_list->wr_u.rdma_write.remote_stag); - wr.sqwr.rdma_write.remote_to = cpu_to_wr64( + wr.sqwr.rdma_write.remote_to = cpu_to_be64( wr_list->wr_u.rdma_write.remote_to); status = move_sgl((cc_data_addr_t*) &(wr.sqwr.rdma_write.data), @@ -442,7 +442,7 @@ wr_list->wr_u.rdma_write.local_sgl.sge_count, &tot_len, &actual_sge_count); - wr.sqwr.rdma_write.sge_len = cpu_to_wr32(tot_len); + wr.sqwr.rdma_write.sge_len = cpu_to_be32(tot_len); cc_wr_set_sge_count(&wr, actual_sge_count); break; @@ -453,15 +453,15 @@ /* * Move the local and remote stag/to/len into the WR. */ - wr.sqwr.rdma_read.local_stag = cpu_to_wr32( + wr.sqwr.rdma_read.local_stag = cpu_to_be32( wr_list->wr_u.rdma_read.local_stag); - wr.sqwr.rdma_read.local_to = cpu_to_wr64( + wr.sqwr.rdma_read.local_to = cpu_to_be64( wr_list->wr_u.rdma_read.local_to); - wr.sqwr.rdma_read.remote_stag = cpu_to_wr32( + wr.sqwr.rdma_read.remote_stag = cpu_to_be32( wr_list->wr_u.rdma_read.remote_stag); - wr.sqwr.rdma_read.remote_to = cpu_to_wr64( + wr.sqwr.rdma_read.remote_to = cpu_to_be64( wr_list->wr_u.rdma_read.remote_to); - wr.sqwr.rdma_read.length = cpu_to_wr32( + wr.sqwr.rdma_read.length = cpu_to_be32( wr_list->wr_u.rdma_read.length); break; @@ -474,17 +474,17 @@ mwflags |= MEM_VA_BASED; } mwflags |= wr_list->wr_u.mw_bind.acf; - wr.sqwr.mw_bind.flags = cpu_to_wr32(mwflags); + wr.sqwr.mw_bind.flags = cpu_to_be32(mwflags); wr.sqwr.mw_bind.stag_key = wr_list->wr_u.mw_bind.stag_key; wr.sqwr.mw_bind.mw_stag_index = - cpu_to_wr32(wr_list->wr_u.mw_bind.mw_stag_index); + cpu_to_be32(wr_list->wr_u.mw_bind.mw_stag_index); wr.sqwr.mw_bind.mr_stag_index = - cpu_to_wr32(wr_list->wr_u.mw_bind.mr_stag_index); + cpu_to_be32(wr_list->wr_u.mw_bind.mr_stag_index); wr.sqwr.mw_bind.length = - cpu_to_wr32(wr_list->wr_u.mw_bind.length); + cpu_to_be32(wr_list->wr_u.mw_bind.length); wr.sqwr.mw_bind.va = - cpu_to_wr64((u64)(unsigned long) + cpu_to_be64((u64)(unsigned long) wr_list->wr_u.mw_bind.va); break; } @@ -501,7 +501,7 @@ wr.sqwr.stag_inv.stag_key = wr_list->wr_u.inv_stag.stag_key; wr.sqwr.stag_inv.stag_index = - cpu_to_wr32(wr_list->wr_u.inv_stag.stag_index); + cpu_to_be32(wr_list->wr_u.inv_stag.stag_index); break; case CC_WR_TYPE_NOP: Index: cc_wr.h =================================================================== --- cc_wr.h (revision 3072) +++ cc_wr.h (working copy) @@ -23,17 +23,6 @@ * in common/include/clustercore/cc_ivn.h. */ -/* - * Work Request Byte Order - define only one of these. - */ -#define WR_BYTE_ORDER_BIG_ENDIAN -/*#define WR_BYTE_ORDER_LITTLE_ENDIAN */ - -/* - * Now include host or adapter specific macros - */ -#include "cc_byteorder.h" - #ifdef CCDEBUG #define CCWR_MAGIC 0xb07700b0 #endif Index: devccil_qp.c =================================================================== --- devccil_qp.c (revision 3072) +++ devccil_qp.c (working copy) @@ -348,8 +348,8 @@ wr.rnic_handle = devp->adapter_handle; wr.sq_cq_handle = qp->sq_cq->adapter_handle; wr.rq_cq_handle = qp->rq_cq->adapter_handle; - wr.sq_depth = cpu_to_wr32(p_attrs->sq_depth+1); - wr.rq_depth = cpu_to_wr32(p_attrs->rq_depth+1); + wr.sq_depth = cpu_to_be32(p_attrs->sq_depth+1); + wr.rq_depth = cpu_to_be32(p_attrs->rq_depth+1); if (qp->srq) { wr.srq_handle = qp->srq->adapter_handle; } else { @@ -361,14 +361,14 @@ flags |= p_attrs->zero_stag_enabled ? QP_ZERO_STAG : 0; flags |= p_attrs->rdma_read_response_enabled ? QP_RDMA_READ_RESPONSE : 0; - wr.flags = cpu_to_wr32(flags); - wr.send_sgl_depth = cpu_to_wr32(p_attrs->send_sgl_depth); - wr.recv_sgl_depth = cpu_to_wr32(p_attrs->recv_sgl_depth); - wr.rdma_write_sgl_depth = cpu_to_wr32(p_attrs->rdma_write_sgl_depth); - wr.shared_sq_ht = cpu_to_wr64(__pa(qp->sq_mq.shared_kva)); - wr.shared_rq_ht = cpu_to_wr64(__pa(qp->rq_mq.shared_kva)); - wr.ord = cpu_to_wr32(p_attrs->ord); - wr.ird = cpu_to_wr32(p_attrs->ird); + wr.flags = cpu_to_be32(flags); + wr.send_sgl_depth = cpu_to_be32(p_attrs->send_sgl_depth); + wr.recv_sgl_depth = cpu_to_be32(p_attrs->recv_sgl_depth); + wr.rdma_write_sgl_depth = cpu_to_be32(p_attrs->rdma_write_sgl_depth); + wr.shared_sq_ht = cpu_to_be64(__pa(qp->sq_mq.shared_kva)); + wr.shared_rq_ht = cpu_to_be64(__pa(qp->rq_mq.shared_kva)); + wr.ord = cpu_to_be32(p_attrs->ord); + wr.ird = cpu_to_be32(p_attrs->ird); wr.pd_id = p_attrs->pdid; /* * the kernel caller will pass in a null context, in which case @@ -414,10 +414,10 @@ goto bail5; } - qp->sq_mq.mq_idx = wr32_to_cpu(reply->sq_mq_index); - qp->sq_mq.u.a.msg_pool_aoff = wr32_to_cpu(reply->sq_mq_start); - qp->rq_mq.mq_idx = wr32_to_cpu(reply->rq_mq_index); - qp->rq_mq.u.a.msg_pool_aoff = wr32_to_cpu(reply->rq_mq_start); + qp->sq_mq.mq_idx = be32_to_cpu(reply->sq_mq_index); + qp->sq_mq.u.a.msg_pool_aoff = be32_to_cpu(reply->sq_mq_start); + qp->rq_mq.mq_idx = be32_to_cpu(reply->rq_mq_index); + qp->rq_mq.u.a.msg_pool_aoff = be32_to_cpu(reply->rq_mq_start); qp->adapter_handle = reply->qp_handle; /* @@ -426,23 +426,23 @@ * NOTE: The sq/rq depth fields in the wr are the mq q_size values, * which are the sq/rq depth + 1. */ - p_attrs->sq_depth = wr32_to_cpu(reply->sq_depth)-1; - p_attrs->rq_depth = wr32_to_cpu(reply->rq_depth)-1; - p_attrs->send_sgl_depth = wr32_to_cpu(reply->send_sgl_depth); - p_attrs->recv_sgl_depth = wr32_to_cpu(reply->recv_sgl_depth); + p_attrs->sq_depth = be32_to_cpu(reply->sq_depth)-1; + p_attrs->rq_depth = be32_to_cpu(reply->rq_depth)-1; + p_attrs->send_sgl_depth = be32_to_cpu(reply->send_sgl_depth); + p_attrs->recv_sgl_depth = be32_to_cpu(reply->recv_sgl_depth); p_attrs->rdma_write_sgl_depth = - wr32_to_cpu(reply->rdma_write_sgl_depth); - p_attrs->ird = wr32_to_cpu(reply->ird); - p_attrs->ord = wr32_to_cpu(reply->ord); + be32_to_cpu(reply->rdma_write_sgl_depth); + p_attrs->ird = be32_to_cpu(reply->ird); + p_attrs->ord = be32_to_cpu(reply->ord); /* * save off sq/rq size and depth for use after common qp * create code completes */ - qp->sq_mq_handle.msg_size = wr32_to_cpu(reply->sq_msg_size); - qp->sq_mq_handle.q_size = wr32_to_cpu(reply->sq_depth); - qp->rq_mq_handle.msg_size = wr32_to_cpu(reply->rq_msg_size); - qp->rq_mq_handle.q_size = wr32_to_cpu(reply->rq_depth); + qp->sq_mq_handle.msg_size = be32_to_cpu(reply->sq_msg_size); + qp->sq_mq_handle.q_size = be32_to_cpu(reply->sq_depth); + qp->rq_mq_handle.msg_size = be32_to_cpu(reply->rq_msg_size); + qp->rq_mq_handle.q_size = be32_to_cpu(reply->rq_depth); /* * Free Msg @@ -700,14 +700,14 @@ wr.hdr.context = CC_PTR_TO_CTX(vq_req); wr.rnic_handle = devp->adapter_handle; wr.qp_handle = (u32)qp->adapter_handle; - wr.next_qp_state = cpu_to_wr32(p_attrs->next_qp_state); - wr.ord = cpu_to_wr32(p_attrs->ord); - wr.ird = cpu_to_wr32(p_attrs->ird); - wr.sq_depth = cpu_to_wr32(sq_depth); - wr.rq_depth = cpu_to_wr32(rq_depth); - /*wr.llp_ep_handle = cpu_to_wr32((u32)p_attrs->llp_ep); */ - wr.stream_msg_length = cpu_to_wr32(p_attrs->stream_message_length); - wr.stream_msg = cpu_to_wr64((unsigned long)p_attrs->stream_message_buffer); + wr.next_qp_state = cpu_to_be32(p_attrs->next_qp_state); + wr.ord = cpu_to_be32(p_attrs->ord); + wr.ird = cpu_to_be32(p_attrs->ird); + wr.sq_depth = cpu_to_be32(sq_depth); + wr.rq_depth = cpu_to_be32(rq_depth); + /*wr.llp_ep_handle = cpu_to_be32((u32)p_attrs->llp_ep); */ + wr.stream_msg_length = cpu_to_be32(p_attrs->stream_message_length); + wr.stream_msg = cpu_to_be64((unsigned long)p_attrs->stream_message_buffer); /* * reference the request struct. dereferenced in the int handler. @@ -747,14 +747,14 @@ * Update actuals for user. */ if (!devp->rnic_attrs.ord_static) { - p_attrs->ord = wr32_to_cpu(reply->ord); + p_attrs->ord = be32_to_cpu(reply->ord); } if (!devp->rnic_attrs.ird_static) { - p_attrs->ird = wr32_to_cpu(reply->ird); + p_attrs->ird = be32_to_cpu(reply->ird); } if (!devp->rnic_attrs.qp_depth_static) { - p_attrs->sq_depth = wr32_to_cpu(reply->sq_depth)-1; - p_attrs->rq_depth = wr32_to_cpu(reply->rq_depth)-1; + p_attrs->sq_depth = be32_to_cpu(reply->sq_depth)-1; + p_attrs->rq_depth = be32_to_cpu(reply->rq_depth)-1; } bail2: @@ -833,17 +833,17 @@ p_attrs->sq_cq = qp->sq_cq->user_handle; p_attrs->rq_cq = qp->rq_cq->user_handle; p_attrs->srq = qp->srq ? CC_PTR_TO_64(qp->srq->user_handle) : 0; - p_attrs->sq_depth = wr32_to_cpu(reply->sq_depth)-1; - p_attrs->rq_depth = wr32_to_cpu(reply->rq_depth)-1; - p_attrs->send_sgl_depth = wr32_to_cpu(reply->send_sgl_depth); - p_attrs->recv_sgl_depth = wr32_to_cpu(reply->recv_sgl_depth); - p_attrs->rdma_write_sgl_depth = wr32_to_cpu(reply->rdma_write_sgl_depth); - p_attrs->ord = wr32_to_cpu(reply->ord); - p_attrs->ird = wr32_to_cpu(reply->ird); + p_attrs->sq_depth = be32_to_cpu(reply->sq_depth)-1; + p_attrs->rq_depth = be32_to_cpu(reply->rq_depth)-1; + p_attrs->send_sgl_depth = be32_to_cpu(reply->send_sgl_depth); + p_attrs->recv_sgl_depth = be32_to_cpu(reply->recv_sgl_depth); + p_attrs->rdma_write_sgl_depth = be32_to_cpu(reply->rdma_write_sgl_depth); + p_attrs->ord = be32_to_cpu(reply->ord); + p_attrs->ird = be32_to_cpu(reply->ird); p_attrs->pd_id = qp->pd->pd_id; - p_attrs->qp_id = wr32_to_cpu(reply->qp_id); + p_attrs->qp_id = be32_to_cpu(reply->qp_id); p_attrs->llp_ep = qp->ep ? (u64)(qp->ep->adapter_handle) : 0; - flags = wr16_to_cpu(reply->flags); + flags = be16_to_cpu(reply->flags); p_attrs->rdma_read_enabled = (flags&QP_RDMA_READ); p_attrs->rdma_write_enabled = (flags&QP_RDMA_WRITE); p_attrs->rdma_read_response_enabled = (flags&QP_RDMA_READ_RESPONSE); @@ -854,16 +854,16 @@ p_attrs->local_port = reply->local_port; p_attrs->remote_addr = reply->remote_addr; p_attrs->remote_port = reply->remote_port; - p_attrs->user_context = CC_CTX_TO_PTR(wr64_to_cpu(reply->user_context), void *); - p_attrs->qp_state = wr16_to_cpu(reply->qp_state); + p_attrs->user_context = CC_CTX_TO_PTR(be64_to_cpu(reply->user_context), void *); + p_attrs->qp_state = be16_to_cpu(reply->qp_state); /* * If the caller wants the terminate message, then copy it out... */ if (p_attrs->terminate_message && p_attrs->terminate_message_length) { - if (wr32_to_cpu(reply->terminate_msg_length) > 0) { + if (be32_to_cpu(reply->terminate_msg_length) > 0) { p_attrs->terminate_message_length = - ccmin(wr32_to_cpu(reply->terminate_msg_length), + ccmin(be32_to_cpu(reply->terminate_msg_length), p_attrs->terminate_message_length); memcpy(p_attrs->terminate_message, &reply->data, p_attrs->terminate_message_length); @@ -931,7 +931,7 @@ * the WR. */ if (p_private_data) { - wr->private_data_length = cpu_to_wr32(private_data_length); + wr->private_data_length = cpu_to_be32(private_data_length); memcpy(&wr->private_data[0], p_private_data, private_data_length); } else { wr->private_data_length = 0; Index: cc_cq_common.c =================================================================== --- cc_cq_common.c (revision 3072) +++ cc_cq_common.c (working copy) @@ -90,16 +90,16 @@ p_wc->wr_type = cc_wr_get_id(ce); p_wc->wr_id = ce->hdr.context; p_wc->status = cc_wr_get_result(ce); - p_wc->bytes_rcvd = wr32_to_cpu(ce->bytes_rcvd); + p_wc->bytes_rcvd = be32_to_cpu(ce->bytes_rcvd); p_wc->stag_invalidated = (ce->stag != 0); - *(u32*)&p_wc->stag = wr32_to_cpu(ce->stag); + *(u32*)&p_wc->stag = be32_to_cpu(ce->stag); p_wc->qp_id = ce->handle; /* * update the qp state */ ASSERT(VALID_MAGIC(qp->magic, QP_MAGIC)); - qp->qp_state = wr32_to_cpu(ce->qp_state); + qp->qp_state = be32_to_cpu(ce->qp_state); /* * Consume WQEs on the SQ or RQ now. The completion event @@ -109,7 +109,7 @@ cc_mq_lconsume(&qp->rq_mq_handle, 1); } else { cc_mq_lconsume(&qp->sq_mq_handle, - wr32_to_cpu(cc_wr_get_wqe_count(ce))+1); + be32_to_cpu(cc_wr_get_wqe_count(ce))+1); } /* @@ -209,7 +209,7 @@ u16 priv = q->priv; ccwr_ce_t *msg; - while (priv != cpu_to_wr16(*q->shared)) { + while (priv != cpu_to_be16(*q->shared)) { msg = (ccwr_ce_t *)(q->msg_pool + priv * q->msg_size); if (msg->qp_user_context == (u64)(unsigned long)qp) { msg->qp_user_context = (u64)0; @@ -266,7 +266,7 @@ bo_cq_dump() { cc_user_cq_t *cq = bo_last_cq; - u16 shared = wr16_to_cpu(*cq->mq.shared); + u16 shared = be16_to_cpu(*cq->mq.shared); cc_mq_shared_t peer = *cq->mq.peer; ccwr_ce_t *ce; int i, priv, count, atend; @@ -284,7 +284,7 @@ peer.notification_type == CC_CQ_NOTIFICATION_TYPE_NEXT_SE? "CC_CQ_NOTIFICATION_TYPE_NEXT_SE": peer.notification_type == CC_CQ_NOTIFICATION_TYPE_NONE? "CC_CQ_NOTIFICATION_TYPE_NONE": "", peer.notification_type, - wr16_to_cpu(peer.shared)); + be16_to_cpu(peer.shared)); priv = cq->mq.priv; ce = (ccwr_ce_t *)(cq->mq.msg_pool + priv * cq->mq.msg_size); @@ -295,14 +295,14 @@ printf("\t%c%3d: id %s (%d) result %d mag %08x rcvd %d state %s (%d) wqe %d\n", atend? ' ':'N', priv, - bo_wr_name(wr16_to_cpu(ce->hdr.id)), + bo_wr_name(be16_to_cpu(ce->hdr.id)), cc_wr_get_id(ce), cc_wr_get_result(ce), - wr32_to_cpu(ce->hdr.magic), - wr32_to_cpu(ce->bytes_rcvd), - bo_qp_state_name(wr32_to_cpu(ce->qp_state)), - wr32_to_cpu(ce->qp_state), - wr32_to_cpu(cc_wr_get_wqe_count(ce))); + be32_to_cpu(ce->hdr.magic), + be32_to_cpu(ce->bytes_rcvd), + bo_qp_state_name(be32_to_cpu(ce->qp_state)), + be32_to_cpu(ce->qp_state), + be32_to_cpu(cc_wr_get_wqe_count(ce))); priv = (priv + 1) % cq->mq.q_size; ce = (ccwr_ce_t *)(cq->mq.msg_pool + priv * cq->mq.msg_size); } Index: devccil_rnic.c =================================================================== --- devccil_rnic.c (revision 3072) +++ devccil_rnic.c (working copy) @@ -387,8 +387,8 @@ if (capable(CAP_SYS_ADMIN)) { wr.rnic_open.req.flags |= RNIC_PRIV_MODE; } - wr.rnic_open.req.flags = cpu_to_wr16(wr.rnic_open.req.flags); - wr.rnic_open.req.port_num = cpu_to_wr16(myport); + wr.rnic_open.req.flags = cpu_to_be16(wr.rnic_open.req.flags); + wr.rnic_open.req.port_num = cpu_to_be16(myport); wr.rnic_open.req.user_context = CC_PTR_TO_CTX(rnic); /* @@ -1057,60 +1057,60 @@ /* * marshall the query attrs into the request buffer */ - p_attrs->vendor_id.vendor_id = wr32_to_cpu(reply->vendor_id); - p_attrs->vendor_id.part_number =wr32_to_cpu(reply->part_number); + p_attrs->vendor_id.vendor_id = be32_to_cpu(reply->vendor_id); + p_attrs->vendor_id.part_number =be32_to_cpu(reply->part_number); p_attrs->vendor_id.hardware_version = - wr32_to_cpu(reply->hw_version); + be32_to_cpu(reply->hw_version); p_attrs->vendor_id.fw_ver_major = - wr32_to_cpu(reply->fw_ver_major); + be32_to_cpu(reply->fw_ver_major); p_attrs->vendor_id.fw_ver_minor = - wr32_to_cpu(reply->fw_ver_minor); + be32_to_cpu(reply->fw_ver_minor); p_attrs->vendor_id.fw_ver_patch = - wr32_to_cpu(reply->fw_ver_patch); - p_attrs->max_qps = wr32_to_cpu(reply->max_qps); - p_attrs->max_srq_depth = wr32_to_cpu(reply->max_srq_depth); + be32_to_cpu(reply->fw_ver_patch); + p_attrs->max_qps = be32_to_cpu(reply->max_qps); + p_attrs->max_srq_depth = be32_to_cpu(reply->max_srq_depth); memcpy(p_attrs->vendor_id.fw_ver_build_str, reply->fw_ver_build_str, CC_BUILD_STR_LEN); p_attrs->vendor_id.fw_ver_build_str[CC_BUILD_STR_LEN-1] = 0; p_attrs->max_send_sgl_depth = - wr32_to_cpu(reply->max_send_sgl_depth); + be32_to_cpu(reply->max_send_sgl_depth); p_attrs->max_rdma_sgl_depth = - wr32_to_cpu(reply->max_rdma_sgl_depth); - p_attrs->max_cqs = wr32_to_cpu(reply->max_cqs); - p_attrs->max_cq_depth = wr32_to_cpu(reply->max_cq_depth); - p_attrs->max_qp_depth = wr32_to_cpu(reply->max_qp_depth); + be32_to_cpu(reply->max_rdma_sgl_depth); + p_attrs->max_cqs = be32_to_cpu(reply->max_cqs); + p_attrs->max_cq_depth = be32_to_cpu(reply->max_cq_depth); + p_attrs->max_qp_depth = be32_to_cpu(reply->max_qp_depth); p_attrs->max_cq_ehs = CC_MAX_EHS; - p_attrs->max_mrs = wr32_to_cpu(reply->max_mrs); - p_attrs->max_pbl_depth = wr32_to_cpu(reply->max_pbl_depth); - p_attrs->max_pds = wr32_to_cpu(reply->max_pds); - p_attrs->max_ird = wr32_to_cpu(reply->max_global_ird); - p_attrs->max_ord = wr32_to_cpu(reply->max_global_ord); - p_attrs->max_qp_ird = wr32_to_cpu(reply->max_qp_ird); - p_attrs->max_qp_ord = wr32_to_cpu(reply->max_qp_ord); - p_attrs->ird_static = wr32_to_cpu(reply->flags)&RNIC_IRD_STATIC; - p_attrs->ord_static = wr32_to_cpu(reply->flags)&RNIC_ORD_STATIC; + p_attrs->max_mrs = be32_to_cpu(reply->max_mrs); + p_attrs->max_pbl_depth = be32_to_cpu(reply->max_pbl_depth); + p_attrs->max_pds = be32_to_cpu(reply->max_pds); + p_attrs->max_ird = be32_to_cpu(reply->max_global_ird); + p_attrs->max_ord = be32_to_cpu(reply->max_global_ord); + p_attrs->max_qp_ird = be32_to_cpu(reply->max_qp_ird); + p_attrs->max_qp_ord = be32_to_cpu(reply->max_qp_ord); + p_attrs->ird_static = be32_to_cpu(reply->flags)&RNIC_IRD_STATIC; + p_attrs->ord_static = be32_to_cpu(reply->flags)&RNIC_ORD_STATIC; p_attrs->qp_depth_static = - wr32_to_cpu(reply->flags)&RNIC_QP_STATIC; + be32_to_cpu(reply->flags)&RNIC_QP_STATIC; p_attrs->srq_supported = - wr32_to_cpu(reply->flags)&RNIC_SRQ_SUPPORTED; + be32_to_cpu(reply->flags)&RNIC_SRQ_SUPPORTED; p_attrs->cq_overflow_detected = - wr32_to_cpu(reply->flags)&RNIC_CQ_OVF_DETECTED; - p_attrs->max_mws = wr32_to_cpu(reply->max_mws); - p_attrs->max_srqs = wr32_to_cpu(reply->max_srqs); - if (wr32_to_cpu(reply->flags)&RNIC_PBL_BLOCK_MODE) { + be32_to_cpu(reply->flags)&RNIC_CQ_OVF_DETECTED; + p_attrs->max_mws = be32_to_cpu(reply->max_mws); + p_attrs->max_srqs = be32_to_cpu(reply->max_srqs); + if (be32_to_cpu(reply->flags)&RNIC_PBL_BLOCK_MODE) { p_attrs->pbl_mode = CC_PBL_BLOCK_MODE; } else { p_attrs->pbl_mode = CC_PBL_PAGE_MODE; } - if (wr32_to_cpu(reply->flags)&RNIC_SRQ_MODEL_ARRIVAL) { + if (be32_to_cpu(reply->flags)&RNIC_SRQ_MODEL_ARRIVAL) { p_attrs->srq_model = CC_SRQ_MODEL_ARRIVAL_ORDER; } else { p_attrs->srq_model = CC_SRQ_MODEL_SEQUENTIAL_ORDER; } - p_attrs->pbe_range.range_low =wr32_to_cpu(reply->pbe_range_low); + p_attrs->pbe_range.range_low =be32_to_cpu(reply->pbe_range_low); p_attrs->pbe_range.range_high = - wr32_to_cpu(reply->pbe_range_high); - p_attrs->page_size =wr32_to_cpu(reply->page_size); + be32_to_cpu(reply->pbe_range_high); + p_attrs->page_size =be32_to_cpu(reply->page_size); p_attrs->user_context = CC_64_TO_PTR(reply->user_context); bail2: @@ -1185,7 +1185,7 @@ cc_wr_set_id(wr, CCWR_RNIC_SETCONFIG); wr->hdr.context = CC_PTR_TO_CTX(vq_req); wr->rnic_handle = devp->adapter_handle; - wr->option = cpu_to_wr32(cmd); + wr->option = cpu_to_be32(cmd); /* * Move the cmd-specific data into the wr. @@ -1306,13 +1306,13 @@ cc_wr_set_id(&wr, CCWR_RNIC_GETCONFIG); wr.hdr.context = CC_PTR_TO_CTX(vq_req); wr.rnic_handle = devp->adapter_handle; - wr.option = cpu_to_wr32(cmd); + wr.option = cpu_to_be32(cmd); if (buf) { - wr.reply_buf = cpu_to_wr64((u64)__pa(wr_buf)); + wr.reply_buf = cpu_to_be64((u64)__pa(wr_buf)); } else { wr.reply_buf = (u64)NULL; } - wr.reply_buf_len = cpu_to_wr32(size); + wr.reply_buf_len = cpu_to_be32(size); /* * reference the request struct. dereferenced in the int handler. @@ -1351,7 +1351,7 @@ /* * update the caller's notion of the count length */ - *buf_len = wr32_to_cpu(reply->count_len); + *buf_len = be32_to_cpu(reply->count_len); /* * copy the data to the caller's buffer. @@ -1573,8 +1573,8 @@ cc_wr_set_id(&wr, CCWR_CONSOLE); wr.console.req.hdr.context = CC_PTR_TO_CTX(vq_req); wr.console.req.reply_buf = - cpu_to_wr64((u64)__pa(reply_buf)); - wr.console.req.reply_buf_len = cpu_to_wr32(req->reply_buf_len); + cpu_to_be64((u64)__pa(reply_buf)); + wr.console.req.reply_buf_len = cpu_to_be32(req->reply_buf_len); memcpy(wr.console.req.command, req->cmd, req->cmd_len); /* @@ -1624,7 +1624,7 @@ if (cc != 0) { req->status = CCERR_INVALID_MODIFIER; } - req->truncated = (wr32_to_cpu(reply->flags) & CONS_REPLY_TRUNCATED); + req->truncated = (be32_to_cpu(reply->flags) & CONS_REPLY_TRUNCATED); vq_repbuf_free(devp->cca, reply); vq_req_free(devp->cca, vq_req); @@ -1671,8 +1671,8 @@ * Map the log pages to the user process. */ pci = (cc_adapter_pci_regs_t*)devp->cca->kva; - log_size = wr32_to_cpu(pci->log_size); - log_pa = (unsigned long)aoff_to_phys(devp->cca, wr32_to_cpu(pci->log_start)) & CC_PAGEMASK; + log_size = be32_to_cpu(pci->log_size); + log_pa = (unsigned long)aoff_to_phys(devp->cca, be32_to_cpu(pci->log_start)) & CC_PAGEMASK; down_write(¤t->mm->mmap_sem); ret = do_mmap(filp, 0, log_size, PROT_READ|PROT_WRITE, MAP_SHARED, log_pa); up_write(¤t->mm->mmap_sem); @@ -1682,7 +1682,7 @@ return; } req->log_start = (void *)ret; - req->log_size = wr32_to_cpu(pci->log_size); + req->log_size = be32_to_cpu(pci->log_size); req->status = CC_OK; return; } @@ -1789,17 +1789,17 @@ /* * Map adapter flash buffer to user process. */ - adapter_flash_buf_len = wr32_to_cpu(reply->adapter_flash_len); + adapter_flash_buf_len = be32_to_cpu(reply->adapter_flash_len); adapter_flash_buf_pa = (unsigned long)aoff_to_phys(devp->cca, - wr32_to_cpu(reply->adapter_flash_buf_offset) & CC_PAGEMASK); + be32_to_cpu(reply->adapter_flash_buf_offset) & CC_PAGEMASK); down_write(¤t->mm->mmap_sem); ret = do_mmap(filp, 0, adapter_flash_buf_len, PROT_READ|PROT_WRITE, MAP_SHARED, adapter_flash_buf_pa); up_write(¤t->mm->mmap_sem); DEVCCIL_LOG(KERN_INFO "flash init: flash buf offset 0x%x pa 0x%x len %d uva 0x%x\n", - wr32_to_cpu(reply->adapter_flash_buf_offset), + be32_to_cpu(reply->adapter_flash_buf_offset), (int)adapter_flash_buf_pa, - wr32_to_cpu(reply->adapter_flash_len), + be32_to_cpu(reply->adapter_flash_len), (int)ret); if (IS_ERR((void *)ret)) { req->status = CCERR_INSUFFICIENT_RESOURCES; @@ -1876,8 +1876,8 @@ cc_wr_set_id(&wr, CCWR_FLASH); wr.flash.req.hdr.context = CC_PTR_TO_CTX(vq_req); wr.flash.req.rnic_handle = devp->adapter_handle; - wr.flash.req.len = cpu_to_wr32(req->len); - DEVCCIL_LOG(KERN_INFO "flash: len %d\n", cpu_to_wr32(req->len)); + wr.flash.req.len = cpu_to_be32(req->len); + DEVCCIL_LOG(KERN_INFO "flash: len %d\n", cpu_to_be32(req->len)); /* * reference the request struct. dereferenced in the int handler. @@ -1935,7 +1935,7 @@ * Return flash status to app. The flash_status field contains * flash-part-specific status information (see cc_flash_status_t). */ - req->flash_status = cpu_to_wr32(reply->status); + req->flash_status = cpu_to_be32(reply->status); req->status = CC_OK; @@ -2063,8 +2063,8 @@ cc_wr_set_id(&wr, CCWR_BUF_ALLOC); wr.buf_alloc.req.hdr.context = CC_PTR_TO_CTX(vq_req); wr.buf_alloc.req.rnic_handle = devp->adapter_handle; - wr.buf_alloc.req.size = cpu_to_wr32(req_len); - DEVCCIL_LOG(KERN_INFO "rnic_buf_alloc: len %d\n", cpu_to_wr32(req_len)); + wr.buf_alloc.req.size = cpu_to_be32(req_len); + DEVCCIL_LOG(KERN_INFO "rnic_buf_alloc: len %d\n", cpu_to_be32(req_len)); /* * reference the request struct. dereferenced in the int handler. @@ -2106,10 +2106,10 @@ return status; } - *act_len = wr32_to_cpu(reply->size); + *act_len = be32_to_cpu(reply->size); *addr_adapter = reply->offset; *addr_phys = aoff_to_phys(devp->cca, - wr32_to_cpu(reply->offset) & CC_PAGEMASK); + be32_to_cpu(reply->offset) & CC_PAGEMASK); /* Free the request and reply */ vq_repbuf_free(devp->cca, reply); @@ -2141,7 +2141,7 @@ cc_wr_set_id(&wr, CCWR_BUF_FREE); wr.buf_free.req.hdr.context = CC_PTR_TO_CTX(vq_req); wr.buf_free.req.rnic_handle = devp->adapter_handle; - wr.buf_free.req.size = cpu_to_wr32(len); + wr.buf_free.req.size = cpu_to_be32(len); wr.buf_free.req.offset = addr_adapter; /* @@ -2215,8 +2215,8 @@ cc_wr_set_id(&wr, CCWR_FLASH_WRITE); wr.flash_write.req.hdr.context = CC_PTR_TO_CTX(vq_req); wr.flash_write.req.rnic_handle = devp->adapter_handle; - wr.flash_write.req.type = cpu_to_wr32(type); - wr.flash_write.req.size = cpu_to_wr32(len); + wr.flash_write.req.type = cpu_to_be32(type); + wr.flash_write.req.size = cpu_to_be32(len); wr.flash_write.req.offset = addr_adapter; /* @@ -2258,7 +2258,7 @@ vq_req_free(devp->cca, vq_req); return status; } - *flash_status = cpu_to_wr32(reply->status); + *flash_status = cpu_to_be32(reply->status); /* Free the request and reply */ vq_repbuf_free(devp->cca, reply); Index: cc_mq_common.c =================================================================== --- cc_mq_common.c (revision 3072) +++ cc_mq_common.c (working copy) @@ -20,7 +20,7 @@ extern void cc_memcpy8(u64 *, u64 *, s32); #define BUMP(q,p) (p) = ((p)+1) % (q)->q_size -#define BUMP_SHARED(q,p) (p) = cpu_to_wr16((wr16_to_cpu(p)+1) % (q)->q_size) +#define BUMP_SHARED(q,p) (p) = cpu_to_be16((be16_to_cpu(p)+1) % (q)->q_size) #ifdef CC_STALL_DEBUG /* For debug only. */ @@ -42,15 +42,15 @@ ccwr_hdr_t *m = (ccwr_hdr_t*)(q->msg_pool + q->priv * q->msg_size); #if 0 unsigned int bar=0; - while (m->magic != wr32_to_cpu(~CCWR_MAGIC)) { + while (m->magic != be32_to_cpu(~CCWR_MAGIC)) { bar++; if (bar >= 10000) ASSERT(0); } if (bar) CC_WARN_LOG(CCIL_LOG_MQ|CCIL_LOG_WARNING,"cc_mq_alloc spun %d times\n", bar); #endif #ifdef CCMSGMAGIC - ASSERT(m->magic == wr32_to_cpu(~CCWR_MAGIC)); - m->magic = cpu_to_wr32(CCWR_MAGIC); + ASSERT(m->magic == be32_to_cpu(~CCWR_MAGIC)); + m->magic = cpu_to_be32(CCWR_MAGIC); #endif CC_LOG(CCIL_LOG_MQ|CCIL_LOG_DEBUG,"cc_mq_alloc %p\n", m); return m; @@ -78,7 +78,7 @@ BUMP(q, q->priv); q->hint_count++; /* Update peer's offset. */ - q->peer->shared = cpu_to_wr16(q->priv); + q->peer->shared = cpu_to_be16(q->priv); } } @@ -97,14 +97,14 @@ (q->msg_pool + q->priv * q->msg_size); #if 0 unsigned int bar=0; - while (m->magic != wr32_to_cpu(CCWR_MAGIC)) { + while (m->magic != be32_to_cpu(CCWR_MAGIC)) { bar++; if (bar >= 10000) ASSERT(0); } if (bar) CC_WARN_LOG(CCIL_LOG_MQ|CCIL_LOG_WARNING,"cc_mq_consume spun %d times\n",bar); #endif #ifdef CCMSGMAGIC - ASSERT(m->magic == wr32_to_cpu(CCWR_MAGIC)); + ASSERT(m->magic == be32_to_cpu(CCWR_MAGIC)); #endif CC_LOG(CCIL_LOG_MQ|CCIL_LOG_DEBUG,"cc_mq_consume %p\n", m); return m; @@ -133,12 +133,12 @@ { ccwr_hdr_t *m = (ccwr_hdr_t*) (q->msg_pool + q->priv * q->msg_size); - m->magic = cpu_to_wr32(~CCWR_MAGIC); + m->magic = cpu_to_be32(~CCWR_MAGIC); } #endif BUMP(q, q->priv); /* Update peer's offset. */ - q->peer->shared = cpu_to_wr16(q->priv); + q->peer->shared = cpu_to_be16(q->priv); } } @@ -164,9 +164,9 @@ ASSERT(q); if (q->type == CC_MQ_HOST_TARGET) { - count = wr16_to_cpu(*q->shared) - q->priv; + count = be16_to_cpu(*q->shared) - q->priv; } else { - count = q->priv - wr16_to_cpu(*q->shared); + count = q->priv - be16_to_cpu(*q->shared); } if (count < 0) { @@ -176,3 +176,4 @@ return (u32)count; } #endif /* #ifndef _CC_MQ_COMMON_C_ */ + Index: devccil_ae.c =================================================================== --- devccil_ae.c (revision 3072) +++ devccil_ae.c (working copy) @@ -127,7 +127,7 @@ /* * Save current state. */ - req->qp_state = wr32_to_cpu(wr->ae.ae_generic.qp_state); + req->qp_state = be32_to_cpu(wr->ae.ae_generic.qp_state); /* * We update the resource indicator (type) and the resource @@ -136,7 +136,7 @@ * QP ID... */ req->er.resource_indicator = - wr32_to_cpu(wr->ae.ae_generic.resource_type); + be32_to_cpu(wr->ae.ae_generic.resource_type); req->er.resource_user_context = CC_CTX_TO_PTR(wr->ae.ae_generic.user_context, void *); /* @@ -156,7 +156,7 @@ req->er.event_data.active_connect_results.rport = wr->ae.ae_active_connect_results.rport; req->er.event_data.active_connect_results.private_data_length = - wr32_to_cpu(wr->ae.ae_active_connect_results.private_data_length); + be32_to_cpu(wr->ae.ae_active_connect_results.private_data_length); memcpy(req->er.event_data.active_connect_results.private_data, wr->ae.ae_active_connect_results.private_data, req->er.event_data.active_connect_results.private_data_length); @@ -173,7 +173,7 @@ req->er.event_data.connection_request.rport = wr->ae.ae_connection_request.rport; req->er.event_data.connection_request.private_data_length = - wr32_to_cpu(wr->ae.ae_connection_request.private_data_length); + be32_to_cpu(wr->ae.ae_connection_request.private_data_length); memcpy(req->er.event_data.connection_request.private_data, wr->ae.ae_connection_request.private_data, req->er.event_data.connection_request.private_data_length); Index: devccil_vq.c =================================================================== --- devccil_vq.c (revision 3072) +++ devccil_vq.c (working copy) @@ -230,7 +230,7 @@ * copy wr into adapter msg */ #ifdef CCMSGMAGIC - ((ccwr_hdr_t*)wr)->magic = cpu_to_wr32(CCWR_MAGIC); + ((ccwr_hdr_t*)wr)->magic = cpu_to_be32(CCWR_MAGIC); #endif memcpy(msg, wr, cca->req_vq.msg_size); Index: devccil_mm.c =================================================================== --- devccil_mm.c (revision 3072) +++ devccil_mm.c (working copy) @@ -519,7 +519,7 @@ wr->flags = 0; while (pbl_depth) { count = ccmin(pbe_count, pbl_depth); - wr->addrs_length = cpu_to_wr32(count); + wr->addrs_length = cpu_to_be32(count); /* * If this is the last message, then reference the @@ -532,7 +532,7 @@ * int handler. */ vq_req_get(devp->cca, vq_req); - wr->flags = cpu_to_wr32(MEM_PBL_COMPLETE); + wr->flags = cpu_to_be32(MEM_PBL_COMPLETE); /* * This is the last PBL message. @@ -554,10 +554,10 @@ */ for (i=0; i < count; i++) { if (pbl_virt) { - wr->paddrs[i] = cpu_to_wr64(user_virt_to_phys(va)); + wr->paddrs[i] = cpu_to_be64(user_virt_to_phys(va)); va += PAGE_SIZE; } else { - wr->paddrs[i] = cpu_to_wr64((u64)(unsigned long)((void **)va)[i]); + wr->paddrs[i] = cpu_to_be64((u64)(unsigned long)((void **)va)[i]); } } @@ -779,16 +779,16 @@ if (pbl_depth <= pbe_count) { flags |= MEM_PBL_COMPLETE; } - wr->flags = cpu_to_wr16(flags); + wr->flags = cpu_to_be16(flags); wr->stag_key = stag_key; - wr->va = cpu_to_wr64((u64)(unsigned long)va); + wr->va = cpu_to_be64((u64)(unsigned long)va); wr->pd_id = pdid; - wr->pbe_size = cpu_to_wr32(CC_PAGESIZE); - wr->length = cpu_to_wr32(length); - wr->pbl_depth = cpu_to_wr32(pbl_depth); - wr->fbo = cpu_to_wr32(fbo); + wr->pbe_size = cpu_to_be32(CC_PAGESIZE); + wr->length = cpu_to_be32(length); + wr->pbl_depth = cpu_to_be32(pbl_depth); + wr->fbo = cpu_to_be32(fbo); count = ccmin(pbl_depth, pbe_count); - wr->addrs_length = cpu_to_wr32(count); + wr->addrs_length = cpu_to_be32(count); /* * Fill out the PBL for this message @@ -797,10 +797,10 @@ pbe_count, count); tmp_va = (unsigned long)va; for (i=0; i < count; i++) { - wr->paddrs[i] = cpu_to_wr64(user_virt_to_phys(tmp_va)); + wr->paddrs[i] = cpu_to_be64(user_virt_to_phys(tmp_va)); DEVCCIL_LOG(KERN_INFO " paddr[%d] = 0x%Lx %Lu\n", i, - wr64_to_cpu(wr->paddrs[i]), - wr64_to_cpu(wr->paddrs[i])); + be64_to_cpu(wr->paddrs[i]), + be64_to_cpu(wr->paddrs[i])); tmp_va += PAGE_SIZE; } @@ -848,7 +848,7 @@ if ( (status = cc_wr_get_result(reply)) != CC_OK) { goto bail6; } - mr->stag_index = wr32_to_cpu(reply->stag_index); + mr->stag_index = be32_to_cpu(reply->stag_index); *p_stag_index = mr->stag_index; vq_repbuf_free(devp->cca, reply); vq_req->reply_msg = (u64)NULL; @@ -866,7 +866,7 @@ pbl_depth -= count; if (pbl_depth) { status = send_pbl_messages(devp, - cpu_to_wr32(mr->stag_index), + cpu_to_be32(mr->stag_index), tmp_va, pbl_depth, vq_req, PBL_VIRT); if (status != CC_OK) { @@ -1033,22 +1033,22 @@ if (pbl_depth <= pbe_count) { flags |= MEM_PBL_COMPLETE; } - wr->flags = cpu_to_wr16(flags); + wr->flags = cpu_to_be16(flags); wr->stag_key = stag_key; - wr->va = cpu_to_wr64((u64)(unsigned long)va); + wr->va = cpu_to_be64((u64)(unsigned long)va); wr->pd_id = pdid; - wr->pbe_size = cpu_to_wr32(pb_sz); - wr->length = cpu_to_wr32(length); - wr->pbl_depth = cpu_to_wr32(pbl_depth); - wr->fbo = cpu_to_wr32(fbo); + wr->pbe_size = cpu_to_be32(pb_sz); + wr->length = cpu_to_be32(length); + wr->pbl_depth = cpu_to_be32(pbl_depth); + wr->fbo = cpu_to_be32(fbo); count = ccmin(pbl_depth, pbe_count); - wr->addrs_length = cpu_to_wr32(count); + wr->addrs_length = cpu_to_be32(count); /* * fill out the PBL for this message */ for (i = 0; i < count; i++) { - wr->paddrs[i] = cpu_to_wr64((u64)(unsigned long)addr_list[i]); + wr->paddrs[i] = cpu_to_be64((u64)(unsigned long)addr_list[i]); } /* @@ -1084,8 +1084,8 @@ if ( (status = cc_wr_get_result(reply)) != CC_OK) { goto bail4; } - *p_pb_entries = wr32_to_cpu(reply->pbl_depth); - mr->stag_index = wr32_to_cpu(reply->stag_index); + *p_pb_entries = be32_to_cpu(reply->pbl_depth); + mr->stag_index = be32_to_cpu(reply->stag_index); *p_stag_index = mr->stag_index; vq_repbuf_free(devp->cca, reply); @@ -1100,7 +1100,7 @@ vq_req->reply_msg = CC_PTR_TO_64(NULL); atomic_set(&vq_req->reply_ready, 0); status = send_pbl_messages(devp, - cpu_to_wr32(mr->stag_index), + cpu_to_be32(mr->stag_index), CC_PTR_TO_64(&addr_list[i]), pbl_depth, vq_req, PBL_PHYS); if (status != CC_OK) { @@ -1158,7 +1158,7 @@ cc_wr_set_id(&wr, CCWR_MR_QUERY); wr.hdr.context = CC_PTR_TO_CTX(vq_req); wr.rnic_handle = devp->adapter_handle; - wr.stag_index = cpu_to_wr32(stag_index); + wr.stag_index = cpu_to_be32(stag_index); /* * reference the request struct. dereferenced in the int handler. @@ -1199,8 +1199,8 @@ */ p_attrs->stag_key = reply->stag_key; p_attrs->pdid = reply->pd_id; - p_attrs->pbl_depth = wr32_to_cpu(reply->pbl_depth); - flags = wr32_to_cpu(reply->flags); + p_attrs->pbl_depth = be32_to_cpu(reply->pbl_depth); + flags = be32_to_cpu(reply->flags); p_attrs->remote = (flags & MEM_REMOTE) ? 1 : 0; p_attrs->acf = flags & (MEM_LOCAL_READ|MEM_LOCAL_WRITE| MEM_REMOTE_READ|MEM_REMOTE_WRITE|MEM_WINDOW_BIND); @@ -1238,7 +1238,7 @@ cc_wr_set_id(&wr, CCWR_STAG_DEALLOC); wr.hdr.context = CC_PTR_TO_CTX(vq_req); wr.rnic_handle = devp->adapter_handle; - wr.stag_index = cpu_to_wr32(stag_index); + wr.stag_index = cpu_to_be32(stag_index); /* * reference the request struct. dereferenced in the int handler. @@ -1460,7 +1460,7 @@ goto bail3; } - mw->stag_index = wr32_to_cpu(reply->stag_index); + mw->stag_index = be32_to_cpu(reply->stag_index); *p_stag_index = mw->stag_index; vq_req_free(devp->cca, vq_req); vq_repbuf_free(devp->cca, reply); @@ -1508,7 +1508,7 @@ cc_wr_set_id(&wr, CCWR_MW_QUERY); wr.hdr.context = CC_PTR_TO_CTX(vq_req); wr.rnic_handle = devp->adapter_handle; - wr.stag_index = cpu_to_wr32(stag_index); + wr.stag_index = cpu_to_be32(stag_index); /* * reference the request struct. dereferenced in the int handler. @@ -1549,7 +1549,7 @@ */ p_attrs->stag_key = reply->stag_key; p_attrs->pdid = reply->pd_id; - flags = wr32_to_cpu(reply->flags); + flags = be32_to_cpu(reply->flags); p_attrs->acf = flags & (MEM_REMOTE_READ|MEM_REMOTE_WRITE); p_attrs->stag_state = (flags & MEM_STAG_VALID) ? CC_STAG_VALID : CC_STAG_INVALID; bail2: Index: devccil_mq.c =================================================================== --- devccil_mq.c (revision 3072) +++ devccil_mq.c (working copy) @@ -217,7 +217,7 @@ for (i = 0; i < q_size; ++i) { ccwr_hdr_t *h = (ccwr_hdr_t *)(*p_msg_pool_kva + i * msg_size); - h->magic = cpu_to_wr32(~CCWR_MAGIC); + h->magic = cpu_to_be32(~CCWR_MAGIC); } } #endif @@ -313,7 +313,7 @@ for (i = 0; i < new_q_size; ++i) { ccwr_hdr_t *h = (ccwr_hdr_t *)(*p_new_msg_pool_kva + i * new_msg_size); - h->magic = cpu_to_wr32(~CCWR_MAGIC); + h->magic = cpu_to_be32(~CCWR_MAGIC); } } #endif Index: ccilnet.c =================================================================== --- ccilnet.c (revision 3072) +++ ccilnet.c (working copy) @@ -2327,7 +2327,7 @@ * index instead of a real pointer. * See BugId: 1319 */ - desc = wr32_to_cpu(adapter_regs->fw_hrxd_cur); + desc = be32_to_cpu(adapter_regs->fw_hrxd_cur); idx = (desc - 0xffffc000) / sizeof(rxp_hrxd_t); adapter->adapter_recv_last_idx = idx; dprintf(CCILNET_DBGMSK_PROBE, Index: ccilnet.h =================================================================== --- ccilnet.h (revision 3072) +++ ccilnet.h (working copy) @@ -35,7 +35,6 @@ #include "cc_ivn.h" #include "cc_adapter.h" #include "cc_wr.h" /* define byte ordering */ -#include "cc_byteorder.h" /* byte ordering macros */ #include "cc_rxpiface.h" /* defs for RXP_RXD_DONE, RXP_STATUS_OK, etc.*/ #include "cc_txpiface.h" #include "cc_hostintr.h" /* defs for transmit & receive interrupt bits */ @@ -85,12 +84,12 @@ * ATOHL - adapter to host long (32 bits) * ATOHS - adapter to host short (16 bits) */ -#define HTOAS(_x16) cpu_to_wr16(_x16) -#define HTOAL(_x32) cpu_to_wr32(_x32) -#define HTOALL(_x64) cpu_to_wr64(_x64) -#define ATOHS(_x16) wr16_to_cpu(_x16) -#define ATOHL(_x32) wr32_to_cpu(_x32) -#define ATOHLL(_x64) wr64_to_cpu(_x64) +#define HTOAS(_x16) cpu_to_be16(_x16) +#define HTOAL(_x32) cpu_to_be32(_x32) +#define HTOALL(_x64) cpu_to_be64(_x64) +#define ATOHS(_x16) be16_to_cpu(_x16) +#define ATOHL(_x32) be32_to_cpu(_x32) +#define ATOHLL(_x64) be64_to_cpu(_x64) #define CCILNET_ADAPTER_MAGIC 0x20044002 From tomduffy at speakeasy.net Fri Aug 12 08:15:53 2005 From: tomduffy at speakeasy.net (Tom Duffy) Date: Fri, 12 Aug 2005 08:15:53 -0700 Subject: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface In-Reply-To: <52u0hvmcnc.fsf@cisco.com> References: <1123786117.4403.5835.camel@hal.voltaire.com> <20050811.124916.77057824.davem@davemloft.net> <1123796337.4403.6371.camel@hal.voltaire.com> <25BC832E-3EC8-407E-A490-60236CFFF99B@speakeasy.net> <1123852410.4403.7522.camel@hal.voltaire.com> <52u0hvmcnc.fsf@cisco.com> Message-ID: On Aug 12, 2005, at 7:51 AM, Roland Dreier wrote: > Tom> Do we need backward compatibility? > > Hal> I'm not sure but I think this was a recommendation from > Hal> Roland. > > Yes, of course we need backward compatibility. We can't put a change > into the kernel that breaks userspace binaries. > > For example, the old Fedora binary of arping has to continue to work, > even if you use a new kernel. But, Fedora will rebuild their binary once this change is in. If the Linux developers cared about this sort of thing, it would version all its kernel structs and put padding at the end to ensure new fields could be added. It has opted for the cleaner (technical) solution of having all the apps recompile. Sure there will be a little bit of growing pain, but in the end, it won't have all kinds of backwards compatibility cruft lying around. -tduffy From tom at ammasso.com Fri Aug 12 08:23:55 2005 From: tom at ammasso.com (Tom Tucker) Date: Fri, 12 Aug 2005 11:23:55 -0400 Subject: [openib-general] iWARP patch to remove x86 special memcpy optimizations Message-ID: <8E9D028761D8264D910612167E8457E8FA3814@mail2.ammasso.com> Thanks to Christoph... This patch removes memcpy4 and memcpy8 that were optimized to the SSES instructions when writing data over the PCI bus. We may need to do something later to optimize performance depending on how good the Linux mempcy optimizations are. Index: cc_qp_common.c =================================================================== --- cc_qp_common.c (revision 3073) +++ cc_qp_common.c (working copy) @@ -135,139 +135,6 @@ /* - * Function: cc_memcpy8 - * - * Description: - * Just like memcpy, but does 16 and 8 bytes at a time. - * - * IN: - * dest - ptr destination - * src - ptr source - * len - The len, in bytes - * - * OUT: none - * - * Return: none - */ -void -cc_memcpy8( u64 *dest, u64 *src, s32 len) -{ -#ifdef CCDEBUG - assert((len & 0x03) == 0); - assert(((s32)dest & 0x03) == 0); - assert(((s32)src & 0x03) == 0); -#endif - -#if (defined(X86_32) || defined(X86_64)) - -#define MINSIZE 16 - /* unaligned data copy, 16 bytes at a time */ - while(len >= MINSIZE) { - /* printf("%p --> %p 16B unaligned copy,len=%d \n", src, dest,len); */ - asm volatile("movdqu 0(%1), %%xmm0\n" \ - "movdqu %%xmm0, 0(%0)\n" \ - :: "r"(dest), "r"(src) : "memory"); - src += 2; - dest += 2; - len -= 16; - } - - /* At this point, we'll have fewer than 16 bytes left. - * But, we only allow 8 byte copies. So, we do 8 byte copies now. - * If our len happens to be 4 or 12, we will copy 8 or 16 bytes, - * respectively. This is not a problem, since - * all msg_sizes in all WR queues are padded up to 8 bytes - * (see fw/clustercore/cc_qp.c, the function ccwr_qp_create()). - */ - while(len >= 0) { - /* printf("%p --> %p 8B copy,len=%d \n", src, dest,len); */ - asm volatile("movq 0(%1), %%xmm0\n" \ - "movq %%xmm0, 0(%0)\n" \ - :: "r"(dest), "r"(src) : "memory"); - src += 1; - dest += 1; - len -= 8; - } - -#else - #error "You need to define your platform, or add optimized" - #error "cc_memcpy8 support for your platform." - -#endif /*(defined(X86_64) || defined(X86_32)) */ - -} - -/* - * Function: memcpy4 - * - * Description: - * Just like memcpy, but assumes all args are 4 byte aligned already. - * - * IN: - * dest - ptr destination - * src - ptr source - * len - The len, in bytes - * - * OUT: none - * - * Return: none - */ -static __inline__ void -memcpy4(u64 *dest, u64 *src, u32 len) -{ -#ifdef __KERNEL__ - unsigned long flags; -#endif /* #ifdef __KERNEL__ */ - - u64 xmm_regs[16]; /* Reserve space for 8, though only use 1 now. */ - -#ifdef CCDEBUG - ASSERT((len & 0x03) == 0); - ASSERT(((long)dest & 0x03) == 0); - ASSERT(((long)src & 0x03) == 0); -#endif - - /* We must save and restor xmm0. - * Failure to do so messes up the application code. - */ - asm volatile("movdqu %%xmm0, 0(%0)\n" :: "r"(xmm_regs) : "memory"); - -#ifdef __KERNEL__ - /* Further, in the kernel version, we must disable local interupts. - * This is because ISRs do not save & restore xmm0. So, if - * we are interrupted between the first movdqu and the second, - * then xmm0 may be modified, and we will write garbage to the adapter. - */ - local_irq_save(flags); -#endif /* #ifdef __KERNEL__ */ - -#define MINSIZE 16 - /* unaligned data copy */ - while(len >= MINSIZE) { - asm volatile("movdqu 0(%1), %%xmm0\n" \ - "movdqu %%xmm0, 0(%0)\n" \ - :: "r"(dest), "r"(src) : "memory"); - src += 2; - dest += 2; - len -= 16; - } - -#ifdef __KERNEL__ - /* Restore interrupts and registers */ - local_irq_restore(flags); - asm volatile("movdqu 0(%0), %%xmm0\n" :: "r"(xmm_regs) : "memory"); -#endif /* #ifdef __KERNEL__ */ - - while (len >= 4) { - *((u32 *)dest) = *((u32 *)src); - dest = (u64*)((unsigned long)dest + 4); - src = (u64*)((unsigned long)src + 4); - len -= 4; - } -} - - -/* * Function: qp_wr_post * * Description: @@ -308,7 +175,7 @@ /* * Copy the wr down to the adapter */ - memcpy4((void *)msg, (void *)wr, size); + memcpy((void *)msg, (void *)wr, size); cc_mq_produce(q); return CC_OK; Index: cc_mq_common.c =================================================================== --- cc_mq_common.c (revision 3073) +++ cc_mq_common.c (working copy) @@ -17,8 +17,6 @@ #include "cc_mq_common.h" #include "cc_common.h" -extern void cc_memcpy8(u64 *, u64 *, s32); - #define BUMP(q,p) (p) = ((p)+1) % (q)->q_size #define BUMP_SHARED(q,p) (p) = cpu_to_be16((be16_to_cpu(p)+1) % (q)->q_size) From tom at ammasso.com Fri Aug 12 08:42:17 2005 From: tom at ammasso.com (Tom Tucker) Date: Fri, 12 Aug 2005 11:42:17 -0400 Subject: [openib-general] IB Driver Initialization of FMR methods Message-ID: <8E9D028761D8264D910612167E8457E8FA3819@mail2.ammasso.com> I've noticed that in the IB driver, the FMR methods are not initialized if the XX_FLAG_FRM bit is not set in the device structure. My assumption at this point is that these methods are not present if they are not supported by the device. What's confusing is that the verbs do not check if the function ptr is null when involking the underlying method. I would have expected, that a method would be initialized that returned ENOSYS in this case. Any explanation as to the intended design point for FMR initialization would be greatly appreciated. Thanks, Tom T. From rolandd at cisco.com Fri Aug 12 08:54:05 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 12 Aug 2005 08:54:05 -0700 Subject: [openib-general] Re: Continue to experience problems in installing Gen2 on IA-32 References: <200508111932.j7BJWROj026274@xi.cse.ohio-state.edu> <482FEE13-0AB4-11DA-80CD-000D932C3754@cse.ohio-state.edu> <52k6isnm21.fsf@cisco.com> <20050812055044.GA8582@esmail.cup.hp.com> Message-ID: <52oe83m9r6.fsf@cisco.com> Roland> Do you have any theory as to why the drivers worked in Roland> 64-bit mode and failed in 32-bit mode? I don't see any Roland> reason why the parameters passed to INIT_IB would be any Roland> different. Grant> grundler at gsyprf3:/usr/src/openib_gen2/src/linux-kernel/infiniband/hw/mthca$ fgrep writeq * Grant> ... Grant> mthca_doorbell.h: __raw_writeq((__force u64) val, dest); Grant> mthca_doorbell.h: __raw_writeq(*(u64 *) val, dest); It's a good theory, except for two problems: - that code is only used for data-path doorbell ringing, so it won't be used until long after the INIT_IB command. - It's inside an "#if BITS_PER_LONG" block, so it doesn't ever get used on 32-bit platforms anyway. Grant> The only theory I can think of is 64-bit MMIO writes on a Grant> 32-bit OS will come out as two seperate writes. But since Grant> others are using this without a problem, this isn't likely Grant> a generic issue. Maybe there is some timing issue here...ie Grant> slower/faster CPU or chipset is exposing a problem. Well, earlier in the thread it was said that the same system worked fine with a 64-bit kernel and failed with a 32-bit kernel. So it's probably not a chipset issue. And the INIT_IB command isn't using MMIO writes to transfer the command block to the HCA -- the HCA DMAs it out of memory itself. - R. From rolandd at cisco.com Fri Aug 12 09:07:32 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 12 Aug 2005 09:07:32 -0700 Subject: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface In-Reply-To: (Tom Duffy's message of "Fri, 12 Aug 2005 08:15:53 -0700") References: <1123786117.4403.5835.camel@hal.voltaire.com> <20050811.124916.77057824.davem@davemloft.net> <1123796337.4403.6371.camel@hal.voltaire.com> <25BC832E-3EC8-407E-A490-60236CFFF99B@speakeasy.net> <1123852410.4403.7522.camel@hal.voltaire.com> <52u0hvmcnc.fsf@cisco.com> Message-ID: <52fytfm94r.fsf@cisco.com> Tom> But, Fedora will rebuild their binary once this change is in. Tom> If the Linux developers cared about this sort of thing, it Tom> would version all its kernel structs and put padding at the Tom> end to ensure new fields could be added. It has opted for Tom> the cleaner (technical) solution of having all the apps Tom> recompile. Sure there will be a little bit of growing pain, Tom> but in the end, it won't have all kinds of backwards Tom> compatibility cruft lying around. No, this is absolutely not true. The kernel-user ABI is very stable, and with very few exceptions, you should be able to take binaries that worked on kernel 1.0 and run them on a modern kernel. For example, The in-kernel ABI and API can and do change all the time, but that's a different story. - R. From rolandd at cisco.com Fri Aug 12 09:11:09 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 12 Aug 2005 09:11:09 -0700 Subject: [openib-general] IB Driver Initialization of FMR methods In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3819@mail2.ammasso.com> (Tom Tucker's message of "Fri, 12 Aug 2005 11:42:17 -0400") References: <8E9D028761D8264D910612167E8457E8FA3819@mail2.ammasso.com> Message-ID: <52br43m8yq.fsf@cisco.com> >>>>> "Tom" == Tom Tucker writes: Tom> I've noticed that in the IB driver, the FMR methods are not Tom> initialized if the XX_FLAG_FRM bit is not set in the device Tom> structure. My assumption at this point is that these methods Tom> are not present if they are not supported by the device. Tom> What's confusing is that the verbs do not check if the Tom> function ptr is null when involking the underlying method. I Tom> would have expected, that a method would be initialized that Tom> returned ENOSYS in this case. Any explanation as to the Tom> intended design point for FMR initialization would be greatly Tom> appreciated. I think you must have looked in the wrong place. In drivers/infiniband/core/verbs.c, ib_alloc_fmr() starts with: struct ib_fmr *ib_alloc_fmr(struct ib_pd *pd, int mr_access_flags, struct ib_fmr_attr *fmr_attr) { struct ib_fmr *fmr; if (!pd->device->alloc_fmr) return ERR_PTR(-ENOSYS); so if alloc_fmr is not set, the caller will get -ENOSYS. We do assume that if the device implements alloc_fmr(), it will implement the other FMR methods. This isn't enforced by any code, since it doesn't seem worth checking for something that is an obvious bug, and that will cause an immediate and easy-to-diagnose oops. - R. From tomduffy at speakeasy.net Fri Aug 12 09:21:29 2005 From: tomduffy at speakeasy.net (Tom Duffy) Date: Fri, 12 Aug 2005 09:21:29 -0700 Subject: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface In-Reply-To: <52fytfm94r.fsf@cisco.com> References: <1123786117.4403.5835.camel@hal.voltaire.com> <20050811.124916.77057824.davem@davemloft.net> <1123796337.4403.6371.camel@hal.voltaire.com> <25BC832E-3EC8-407E-A490-60236CFFF99B@speakeasy.net> <1123852410.4403.7522.camel@hal.voltaire.com> <52u0hvmcnc.fsf@cisco.com> <52fytfm94r.fsf@cisco.com> Message-ID: On Aug 12, 2005, at 9:07 AM, Roland Dreier wrote: > Tom> But, Fedora will rebuild their binary once this change is in. > Tom> If the Linux developers cared about this sort of thing, it > Tom> would version all its kernel structs and put padding at the > Tom> end to ensure new fields could be added. It has opted for > Tom> the cleaner (technical) solution of having all the apps > Tom> recompile. Sure there will be a little bit of growing pain, > Tom> but in the end, it won't have all kinds of backwards > Tom> compatibility cruft lying around. > > No, this is absolutely not true. The kernel-user ABI is very stable, > and with very few exceptions, you should be able to take binaries that > worked on kernel 1.0 and run them on a modern kernel. For example, > > > The in-kernel ABI and API can and do change all the time, but that's a > different story. I don't want to get into a big debate about this. If a good solution can be had that will both maintain compatibility and allow for IB, I would welcome that. On the other hand, most of the interesting apps have broken on Linux in the past few years. Some examples: - Loki games - Word Perfect 8 - Crossover office/plugin - java I know that lots of that has to do with gcc, threading, or glibc instability, but clearly most interesting binaries that were around in the 1.0 days will not run on todays stuff. Can we do an audit of what stuff will break with this change? If it is a handful of applications that we all have the source to, maybe it won't be that big of a deal. Maybe the better approach is to simply submit the struct change. And let the maintainers object if they want ABI stability. If they do, ask them for an elegant solution ;) -tduffy From hch at lst.de Fri Aug 12 09:24:18 2005 From: hch at lst.de (Christoph Hellwig) Date: Fri, 12 Aug 2005 18:24:18 +0200 Subject: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface In-Reply-To: References: <1123786117.4403.5835.camel@hal.voltaire.com> <20050811.124916.77057824.davem@davemloft.net> <1123796337.4403.6371.camel@hal.voltaire.com> <25BC832E-3EC8-407E-A490-60236CFFF99B@speakeasy.net> <1123852410.4403.7522.camel@hal.voltaire.com> <52u0hvmcnc.fsf@cisco.com> <52fytfm94r.fsf@cisco.com> Message-ID: <20050812162418.GA647@lst.de> On Fri, Aug 12, 2005 at 09:21:29AM -0700, Tom Duffy wrote: > I don't want to get into a big debate about this. If a good solution > can be had that will both maintain compatibility and allow for IB, I > would welcome that. On the other hand, most of the interesting apps > have broken on Linux in the past few years. Some examples: > > - Loki games > - Word Perfect 8 > - Crossover office/plugin > - java s/interesting apps/crappy shit that relies on undefined behaviour/ I can still run most of some really old a.out slackware, including the original doom port on my only x86 box (binfmt_aout complains a lot about some missing alignment and stuff, though ;)) From tom at ammasso.com Fri Aug 12 09:35:16 2005 From: tom at ammasso.com (Tom Tucker) Date: Fri, 12 Aug 2005 12:35:16 -0400 Subject: [openib-general] IB Driver Initialization of FMR methods Message-ID: <8E9D028761D8264D910612167E8457E8FA3826@mail2.ammasso.com> Got it. Thanks. > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Friday, August 12, 2005 11:11 AM > To: Tom Tucker > Cc: openib-general at openib.org > Subject: Re: [openib-general] IB Driver Initialization of FMR methods > > >>>>> "Tom" == Tom Tucker writes: > > Tom> I've noticed that in the IB driver, the FMR methods are not > Tom> initialized if the XX_FLAG_FRM bit is not set in the device > Tom> structure. My assumption at this point is that these methods > Tom> are not present if they are not supported by the device. > > Tom> What's confusing is that the verbs do not check if the > Tom> function ptr is null when involking the underlying method. I > Tom> would have expected, that a method would be initialized that > Tom> returned ENOSYS in this case. Any explanation as to the > Tom> intended design point for FMR initialization would be greatly > Tom> appreciated. > > I think you must have looked in the wrong place. In > drivers/infiniband/core/verbs.c, ib_alloc_fmr() starts with: > > struct ib_fmr *ib_alloc_fmr(struct ib_pd *pd, > int mr_access_flags, > struct ib_fmr_attr *fmr_attr) > { > struct ib_fmr *fmr; > > if (!pd->device->alloc_fmr) > return ERR_PTR(-ENOSYS); > > so if alloc_fmr is not set, the caller will get -ENOSYS. > > We do assume that if the device implements alloc_fmr(), it will > implement the other FMR methods. This isn't enforced by any code, > since it doesn't seem worth checking for something that is an obvious > bug, and that will cause an immediate and easy-to-diagnose oops. > > - R. > From Thomas.Talpey at netapp.com Fri Aug 12 09:40:54 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Fri, 12 Aug 2005 12:40:54 -0400 Subject: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface In-Reply-To: References: <1123786117.4403.5835.camel@hal.voltaire.com> <20050811.124916.77057824.davem@davemloft.net> <1123796337.4403.6371.camel@hal.voltaire.com> <25BC832E-3EC8-407E-A490-60236CFFF99B@speakeasy.net> <1123852410.4403.7522.camel@hal.voltaire.com> <52u0hvmcnc.fsf@cisco.com> <52fytfm94r.fsf@cisco.com> Message-ID: <6.2.3.4.2.20050812123543.066d81d0@exnane01.nane.netapp.com> At 12:21 PM 8/12/2005, Tom Duffy wrote: >Can we do an audit of what stuff will break with this change? If it >is a handful of applications that we all have the source to, maybe it >won't be that big of a deal. Right now it looks to me like the app is arping, and it can be fixed by increasing the size of the storage it allocates in its data segment, without changing the sockaddr_ll. Maybe others, haven't bothered to look. struct sockaddr_ll me -> union { struct sockaddr_ll xx; unsigned char yy[32]; } me; Note: Hal's change requires arping to be recompiled too! Can't stick 20 bytes into 8 there, either. Tom. From halr at voltaire.com Fri Aug 12 09:57:06 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Aug 2005 12:57:06 -0400 Subject: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface In-Reply-To: <6.2.3.4.2.20050812123543.066d81d0@exnane01.nane.netapp.com> References: <1123786117.4403.5835.camel@hal.voltaire.com> <20050811.124916.77057824.davem@davemloft.net> <1123796337.4403.6371.camel@hal.voltaire.com> <25BC832E-3EC8-407E-A490-60236CFFF99B@speakeasy.net> <1123852410.4403.7522.camel@hal.voltaire.com> <52u0hvmcnc.fsf@cisco.com> <52fytfm94r.fsf@cisco.com> <6.2.3.4.2.20050812123543.066d81d0@exnane01.nane.netapp.com> Message-ID: <1123865826.4403.7686.camel@hal.voltaire.com> On Fri, 2005-08-12 at 12:40, Talpey, Thomas wrote: > Note: Hal's change requires arping to be recompiled too! > Can't stick 20 bytes into 8 there, either. Not quite. Old arping binaries will work for non IPoIB links just fine. Using old arping on IPoIB will get the error on the sendto as the hardware type is not available at bind time. -- Hal From Thomas.Talpey at netapp.com Fri Aug 12 10:43:44 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Fri, 12 Aug 2005 13:43:44 -0400 Subject: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface In-Reply-To: <1123865826.4403.7686.camel@hal.voltaire.com> References: <1123786117.4403.5835.camel@hal.voltaire.com> <20050811.124916.77057824.davem@davemloft.net> <1123796337.4403.6371.camel@hal.voltaire.com> <25BC832E-3EC8-407E-A490-60236CFFF99B@speakeasy.net> <1123852410.4403.7522.camel@hal.voltaire.com> <52u0hvmcnc.fsf@cisco.com> <52fytfm94r.fsf@cisco.com> <6.2.3.4.2.20050812123543.066d81d0@exnane01.nane.netapp.com> <1123865826.4403.7686.camel@hal.voltaire.com> Message-ID: <6.2.3.4.2.20050812134236.040cc9a0@exnane01.nane.netapp.com> At 12:57 PM 8/12/2005, Hal Rosenstock wrote: >Using old arping on IPoIB will get the error on the sendto as the >hardware type is not available at bind time. Okay, that's a feature then, instead of "Bus Error - core dumped" when 20 bytes land on top of 8, they'll get a send failure. :-) Tom. From jlentini at netapp.com Fri Aug 12 11:26:18 2005 From: jlentini at netapp.com (James Lentini) Date: Fri, 12 Aug 2005 14:26:18 -0400 (EDT) Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation In-Reply-To: References: Message-ID: On Fri, 12 Aug 2005, Guy German wrote: > > - /* Only process events if there is an enabled callback function. */ > > - if ((evd->upcall.upcall_func == (DAT_UPCALL_FUNC) NULL) || > > - (evd->upcall_policy == DAT_UPCALL_DISABLE)) { > > + > > + /* The function is not re-entrant (change when implementing DAT_UPCALL_MANY)*/ > > Why is this function not re-entrant? For reference, here is how I > would define re-entrant: > > http://en.wikipedia.org/wiki/Reentrant > http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?re-entrant > > gg: > The function can not be entered twice at the same time, because > when the upcall is at the hands of the consumer he can disable > the upcall policy and if it is entered twice, there is a chance > the consumer will get another upcall after disabling > the upcall policy. I don't see the flow of control that would result in the scenario you describe. This piece of code spin_lock_irqsave (&evd->common.lock, flags); if (evd->is_triggered) { spin_unlock_irqrestore (&evd->common.lock, flags); return; } evd->is_triggered = 1; spin_unlock_irqrestore (&evd->common.lock, flags); ensures that only one thread can be making upcalls at a time. > > > > + if (evd->is_triggered) > > return; > > - } > > Why check the value here? Is it only for the efficiency of not taking > the spin lock when is_triggered is 1? > > gg: > No. you can't take the spin_lock here because this can cause > a dead lockin the case the function calls itself from > dat_evd_dequeue, on a uni-proccessor machines. Can you elaborate on this? Do you mean that the thread that performs the upcall in dapl_evd_upcall_trigger() can be used by the consumer to call dapl_evd_dequeue()? If so, I don't see the flow of control that begins in dapl_evd_dequeue() and reaches dapl_evd_upcall_trigger(). > > @@ -820,24 +831,19 @@ static void dapl_evd_dto_callback(struct > > * This function does not dequeue from the CQ; only the consumer > > * can do that. Instead, it wakes up waiters if any exist. > > * It rearms the completion only if completions should always occur > > - * (specifically if a CNO is associated with the EVD and the > > - * EVD is enabled). > > */ > > - > > - if (state == DAPL_EVD_STATE_OPEN && > > - evd->upcall_policy != DAT_UPCALL_DISABLE) { > > - /* > > - * Re-enable callback, *then* trigger. > > - * This guarantees we won't miss any events. > > - */ > > - status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); > > - if (0 != status) > > - (void)dapl_evd_post_async_error_event( > > - evd->common.owner_ia->async_error_evd, > > - DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR, > > - evd->common.owner_ia); > > - > > + > > + if (state == DAPL_EVD_STATE_OPEN) { > > dapl_evd_upcall_trigger(evd); > > + if ((evd->upcall_policy != DAT_UPCALL_TEARDOWN) && > > + (evd->upcall_policy != DAT_UPCALL_DISABLE)) { > > + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); > > + if (0 != status) > > + (void)dapl_evd_post_async_error_event( > > + evd->common.owner_ia->async_error_evd, > > + DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR, > > + evd->common.owner_ia); > > + } > > > You changed the order in which the CQ upcall is enabled and the kDAPL > upcall is made. It used to be: > > enable CQ upcall > call kDAPL upcall > > you are proposing > > call kDAPL upcall > enable CQ upcall > > I think your proposed order contains a race condition. Specifically if > a work completion occurs after dapl_evd_upcall_trigger() > returns but before the CQ upcall is re-enabled with > ib_req_notify_cq(), no upcall will occur for the completion. > > Do you agree? > > gg: > You need to enable the CQ upcall only if the consumer did > not change his upcall policy, while in upcall context. In > the first case you will create a situation where the cq is > enabled, while the consumers doesn't want any upcalls. Correct, but this can be hidden so that the consumer does not receive the upcall. > In most real world application dapl_evd_upcall_trigger() > will return with upcall policy disabled and there will be > no need to alarm the cq upcall - i.e the consumer would > dequeue the rest of the events himself. > > I see the race you talk about. It is relevent to kdapltest. > Maybe we can check if there are pending events after > enabling CQ upcall, and if there are - call > dapl_evd_upcall_trigger() again. What do you think ? The problem we are dealing with is that DAPL upcalls behave differently from IB upcalls. DAPL upcalls are enabled until they are disabled while IB upcalls are "one shots". The approach taken in the current implementation is to always enable the IB upcalls and determine in the DAPL provider if the consumer's upcall should be invoked. You are proposing a shift away from that approach. If we do that, we need to preserve the original semantics. I'll ask the DAT Collaborative from some clarification on the meaning of the different upcall policy flags. One final item. To be consistent with your design, CQ upcalls should be selectively enabled in dapl_evd_internal_create(). > > evd = (struct dapl_evd *)evd_handle; > > + dapl_dbg_log (DAPL_DBG_TYPE_API, "%s: (evd=%p) set to %d\n", > > + __func__, evd_handle, upcall_policy); > > The idea was to make the DAPL_DBG_TYPE_API prints look like a > debugger stack trace. The following would be keeping with the other > print statements: > > gg: > I thought it would make it a bit more user friendly :) sometimes > the consumers use those debug prints and they don't want to dwell > in the kdapl code too much in order to understand what they > are reading ... You need a fair amount of familiarity with the code to know what a message that says "dapl_evd_modify_upcall: (evd=dbbe4e58) set to 2" means. If you'd like to add the parameter names (e.g. evd=%p, upcall_policy=...) that is fine. I think the function call format is better, because a user familiar with the function signatures will know what each of the fields means. > > > > + spin_lock_irqsave(&evd->common.lock, flags); > > + if ((upcall_policy != DAT_UPCALL_TEARDOWN) && > > + (upcall_policy != DAT_UPCALL_DISABLE)) { > > Why not let the consumer setup the upcall when it disabled? That seems > like the only safe time to modify it. > > gg: > The consumer needs and can change the poilcy to disable and enable. > The only time he is not allowed to change the policy to enable (in > this implementation) is when there are still pending events in the > queue. You mean when there are *no* pending events in the queue. > This is to solve a race where the consumer dequeued all the events > and changed the policy to enable, but there were other event/s that > came just before calling dat_evd_modufy_upcall. In this case > dat_evd_modufy_upcall to enable would fail and the consumer would > keep dequeue-ing the events, without loosing his context. You've only decrease the window in which that scenario could occur, not eliminated it. If a DTO completion occured after you count the number of pending events but before you enable the CQ callback, a completion will be missed. Also, the pending_event_queue is only used for kDAPL generated software events. This queue can be empty when there are events on the CQ, so your would need to be expanded your check to cover that. > > + if (evd->evd_flags & DAT_EVD_DTO_FLAG) { > > + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); > > Why do we need to re-enable the CQ upcall? > > gg: > If the consumer returned from the evd_upcall with upcall policy > "disabled" the CQ upcall is not enabled. So this is the only > place it is done. Ok, that fits with your new approach to the problem. From guyg at voltaire.com Fri Aug 12 15:05:52 2005 From: guyg at voltaire.com (Guy German) Date: Sat, 13 Aug 2005 01:05:52 +0300 Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation Message-ID: Hi James, Im writing you back, from my web mail, so sorry again for the format (I will use "gg:" prefix again for my answer :) On Fri, 12 Aug 2005, Guy German wrote: > > - /* Only process events if there is an enabled callback function. */ > > - if ((evd->upcall.upcall_func == (DAT_UPCALL_FUNC) NULL) || > > - (evd->upcall_policy == DAT_UPCALL_DISABLE)) { > > + > > + /* The function is not re-entrant (change when implementing DAT_UPCALL_MANY)*/ > > Why is this function not re-entrant? For reference, here is how I > would define re-entrant: > > http://en.wikipedia.org/wiki/Reentrant > http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?re-entrant > > gg: > The function can not be entered twice at the same time, because > when the upcall is at the hands of the consumer he can disable > the upcall policy and if it is entered twice, there is a chance > the consumer will get another upcall after disabling > the upcall policy. I don't see the flow of control that would result in the scenario you describe. This piece of code spin_lock_irqsave (&evd->common.lock, flags); if (evd->is_triggered) { spin_unlock_irqrestore (&evd->common.lock, flags); return; } evd->is_triggered = 1; spin_unlock_irqrestore (&evd->common.lock, flags); ensures that only one thread can be making upcalls at a time. gg: The change I did in the function assures that the function would not be entered twice at the same time, from 2 (or more) contexts (by adding the spin_lock). The remark mentions that this change, however, does not support DAT_UPCALL_MANY, which, by my understanding, require upcalls to be called simultaneously, hence the function to be re-entrant (and not protected by a spin lock that prevents entering the function) > > > > + if (evd->is_triggered) > > return; > > - } > > Why check the value here? Is it only for the efficiency of not taking > the spin lock when is_triggered is 1? > > gg: > No. you can't take the spin_lock here because this can cause > a dead lockin the case the function calls itself from > dat_evd_dequeue, on a uni-proccessor machines. Can you elaborate on this? Do you mean that the thread that performs the upcall in dapl_evd_upcall_trigger() can be used by the consumer to call dapl_evd_dequeue()? If so, I don't see the flow of control that begins in dapl_evd_dequeue() and reaches dapl_evd_upcall_trigger(). gg: The protection there is from the case, where dapl_evd_upcall_trigger calls dapl_evd_dequeue that calls dapl_evd_upcall_trigger again (recursively). This happened when there was a bad DTO completion and CONN_EVENT_BROKEN was synthesized. Now, I saw that this part was removed from dapl_evd_wc_to_event. I still think that we should leave this protection, for those kinds of cases, that might be implemented in the future. That leaves me with a question: what did happened to DAT_CONNECTION_EVENT_BROKEN? > > @@ -820,24 +831,19 @@ static void dapl_evd_dto_callback(struct > > * This function does not dequeue from the CQ; only the consumer > > * can do that. Instead, it wakes up waiters if any exist. > > * It rearms the completion only if completions should always occur > > - * (specifically if a CNO is associated with the EVD and the > > - * EVD is enabled). > > */ > > - > > - if (state == DAPL_EVD_STATE_OPEN && > > - evd->upcall_policy != DAT_UPCALL_DISABLE) { > > - /* > > - * Re-enable callback, *then* trigger. > > - * This guarantees we won't miss any events. > > - */ > > - status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); > > - if (0 != status) > > - (void)dapl_evd_post_async_error_event( > > - evd->common.owner_ia->async_error_evd, > > - DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR, > > - evd->common.owner_ia); > > - > > + > > + if (state == DAPL_EVD_STATE_OPEN) { > > dapl_evd_upcall_trigger(evd); > > + if ((evd->upcall_policy != DAT_UPCALL_TEARDOWN) && > > + (evd->upcall_policy != DAT_UPCALL_DISABLE)) { > > + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); > > + if (0 != status) > > + (void)dapl_evd_post_async_error_event( > > + evd->common.owner_ia->async_error_evd, > > + DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR, > > + evd->common.owner_ia); > > + } > > > You changed the order in which the CQ upcall is enabled and the kDAPL > upcall is made. It used to be: > > enable CQ upcall > call kDAPL upcall > > you are proposing > > call kDAPL upcall > enable CQ upcall > > I think your proposed order contains a race condition. Specifically if > a work completion occurs after dapl_evd_upcall_trigger() > returns but before the CQ upcall is re-enabled with > ib_req_notify_cq(), no upcall will occur for the completion. > > Do you agree? > > gg: > You need to enable the CQ upcall only if the consumer did > not change his upcall policy, while in upcall context. In > the first case you will create a situation where the cq is > enabled, while the consumers doesn't want any upcalls. Correct, but this can be hidden so that the consumer does not receive the upcall. gg: So what does dapl would do with the DTO event once it got it ? > In most real world application dapl_evd_upcall_trigger() > will return with upcall policy disabled and there will be > no need to alarm the cq upcall - i.e the consumer would > dequeue the rest of the events himself. > > I see the race you talk about. It is relevent to kdapltest. > Maybe we can check if there are pending events after > enabling CQ upcall, and if there are - call > dapl_evd_upcall_trigger() again. What do you think ? The problem we are dealing with is that DAPL upcalls behave differently from IB upcalls. DAPL upcalls are enabled until they are disabled while IB upcalls are "one shots". The approach taken in the current implementation is to always enable the IB upcalls and determine in the DAPL provider if the consumer's upcall should be invoked. gg: This is not good enough for some consumers (e.g. ISER), and it is not implementing the DAPL spec, which deals with upcall policies. You are proposing a shift away from that approach. If we do that, we need to preserve the original semantics. gg: I think that the implementation proposed preserves the current semantics. If the consumer doesn't change the upcall policy - it stays enabled and everything stays the same (kdapltest is still working). I think you had a good point about the race over there, and that can be fixed. > I'll ask the DAT Collaborative from some clarification on > the meaning of the different upcall policy flags. > One final item. To be consistent with your design, CQ upcalls should > be selectively enabled in dapl_evd_internal_create(). gg: Why ? the initial state is that the upcalls are enabled, like it is today. Only if the consumer chooses to disable the upcall, he calls dat_evd_modify_upcall. > > evd = (struct dapl_evd *)evd_handle; > > + dapl_dbg_log (DAPL_DBG_TYPE_API, "%s: (evd=%p) set to %d\n", > > + __func__, evd_handle, upcall_policy); > > The idea was to make the DAPL_DBG_TYPE_API prints look like a > debugger stack trace. The following would be keeping with the other > print statements: > > gg: > I thought it would make it a bit more user friendly :) sometimes > the consumers use those debug prints and they don't want to dwell > in the kdapl code too much in order to understand what they > are reading ... You need a fair amount of familiarity with the code to know what a message that says "dapl_evd_modify_upcall: (evd=dbbe4e58) set to 2" means. If you'd like to add the parameter names (e.g. evd=%p, upcall_policy=...) that is fine. gg: Sure. That’s good enough. > I think the function call format is better, > because a user familiar with the function > signatures will know what each of the fields means. gg: I think that if someone is writing a code over dapl, for the first time, he has enough on his head, beside reviewing dapl’s implementation code. It is easier to debug your own code with helpful debug prints, from the lower layer, like: "upcall_policy=1" / "upcall_policy=o" / "connecting to 192.168.10.10" DTO completion info ... etc.. > > > > + spin_lock_irqsave(&evd->common.lock, flags); > > + if ((upcall_policy != DAT_UPCALL_TEARDOWN) && > > + (upcall_policy != DAT_UPCALL_DISABLE)) { > > Why not let the consumer setup the upcall when it disabled? That seems > like the only safe time to modify it. > > gg: > The consumer needs and can change the poilcy to disable and enable. > The only time he is not allowed to change the policy to enable (in > this implementation) is when there are still pending events in the > queue. You mean when there are *no* pending events in the queue. gg: No. I mean when there *are* pending events. When the consumer wishes to enable the upcall policy, he believes that he dequeued all the events. If this is not the case - dat_evd_modify_upcall alarms him and he knows he should continue to dequeue. > This is to solve a race where the consumer dequeued all the events > and changed the policy to enable, but there were other event/s that > came just before calling dat_evd_modufy_upcall. In this case > dat_evd_modufy_upcall to enable would fail and the consumer would > keep dequeue-ing the events, without loosing his context. You've only decrease the window in which that scenario could occur, not eliminated it. If a DTO completion occured after you count the number of pending events but before you enable the CQ callback, a completion will be missed. gg: I don't think so. That is what the spin_lock_irqsave is for. Also, the pending_event_queue is only used for kDAPL generated software events. This queue can be empty when there are events on the CQ, so your would need to be expanded your check to cover that. gg: I agree. > > + if (evd->evd_flags & DAT_EVD_DTO_FLAG) { > > + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); > > Why do we need to re-enable the CQ upcall? > > gg: > If the consumer returned from the evd_upcall with upcall policy > "disabled" the CQ upcall is not enabled. So this is the only > place it is done. Ok, that fits with your new approach to the problem. From guyg at voltaire.com Fri Aug 12 15:19:34 2005 From: guyg at voltaire.com (Guy German) Date: Sat, 13 Aug 2005 01:19:34 +0300 Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation Message-ID: Hi, Just a small add to my reply about: > The approach taken in the current implementation is to always enable > the IB upcalls and determine in the DAPL provider if the consumer's > upcall should be invoked. My earlier reply was referring to part that the DAPL provider would determine if the consumer's upcall should be invoked, instead of the consumer deciding it. As to always enabling the IB upcalls - that can be done. But if we do it we should add the DTO events to the pending events list, which brings us to the size of this list, that is in the current implementation problematic. Guy From hozer at hozed.org Fri Aug 12 19:42:58 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Fri, 12 Aug 2005 21:42:58 -0500 Subject: [openib-general] 'Couldn't post send' error? Message-ID: <20050813024258.GE16924@kalmia.hozed.org> What's this mean? da4:~/NetPIPE_3.6.2# ibv_rc_pingpong 10.1.5.218 local address: LID 0x0002, QPN 0x0d0404, PSN 0x599dea remote address: LID 0x0001, QPN 0x090404, PSN 0x93b0c8 Couldn't post send I'm trying to get NetPIPE to work again, and it gets a similiar error. It looks like 'uc' and 'ud' versions work just fine. -- -------------------------------------------------------------------------- Troy Benjegerdes 'da hozer' hozer at hozed.org Somone asked me why I work on this free (http://www.fsf.org/philosophy/) software stuff and not get a real job. Charles Shultz had the best answer: "Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life." -- Charles Shultz From yuw at cse.ohio-state.edu Fri Aug 12 20:47:00 2005 From: yuw at cse.ohio-state.edu (Weikuan Yu) Date: Fri, 12 Aug 2005 23:47:00 -0400 Subject: [openib-general] Re: Continue to experience problems in installing Gen2 on IA-32 In-Reply-To: <52k6isnm21.fsf@cisco.com> References: <200508111932.j7BJWROj026274@xi.cse.ohio-state.edu> <482FEE13-0AB4-11DA-80CD-000D932C3754@cse.ohio-state.edu> <52k6isnm21.fsf@cisco.com> Message-ID: Hi, Thanks for your suggestions. Your guess is right with the debugging statement. Output is at the end. Weikuan -------------------------------------- ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) ib_mthca: Initializing (0000:03:00.0) ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 24 (level, low) -> IRQ 177 ib_mthca 0000:03:00.0: Found bridge: (0000:02:03.0) ib_mthca 0000:03:00.0: FW version 000300030002, max commands 64 ib_mthca 0000:03:00.0: FW size 6143 KB (start e7a00000, end e7ffffff) ib_mthca 0000:03:00.0: HCA memory size 131071 KB (start e0000000, end e7ffffff) ib_mthca 0000:03:00.0: Max port width: 2 ib_mthca 0000:03:00.0: Max QPs: 16777216, reserved QPs: 1024, entry size: 256 ib_mthca 0000:03:00.0: Max SRQs: 1024, reserved SRQs: 16, entry size: 32 ib_mthca 0000:03:00.0: Max CQs: 16777216, reserved CQs: 128, entry size: 64 ib_mthca 0000:03:00.0: Max EQs: 64, reserved EQs: 1, entry size: 64 ib_mthca 0000:03:00.0: reserved MPTs: 16, reserved MTTs: 16 ib_mthca 0000:03:00.0: Max PDs: 16777216, reserved PDs: 0, reserved UARs: 1 ib_mthca 0000:03:00.0: Max QP/MCG: 16777216, reserved MGMs: 0 ib_mthca 0000:03:00.0: Flags: 00370347 ib_mthca 0000:03:00.0: profile[ 0]--10/20 @ 0x e0000000 (size 0x 4000000) ib_mthca 0000:03:00.0: profile[ 1]-- 0/16 @ 0x e4000000 (size 0x 1000000) ib_mthca 0000:03:00.0: profile[ 2]-- 7/18 @ 0x e5000000 (size 0x 800000) ib_mthca 0000:03:00.0: profile[ 3]-- 9/17 @ 0x e5800000 (size 0x 800000) ib_mthca 0000:03:00.0: profile[ 4]-- 3/16 @ 0x e6000000 (size 0x 400000) ib_mthca 0000:03:00.0: profile[ 5]-- 4/16 @ 0x e6400000 (size 0x 200000) ib_mthca 0000:03:00.0: profile[ 6]--12/15 @ 0x e6600000 (size 0x 100000) ib_mthca 0000:03:00.0: profile[ 7]-- 8/13 @ 0x e6700000 (size 0x 80000) ib_mthca 0000:03:00.0: profile[ 8]--11/11 @ 0x e6780000 (size 0x 10000) ib_mthca 0000:03:00.0: profile[ 9]-- 2/10 @ 0x e6790000 (size 0x 8000) ib_mthca 0000:03:00.0: profile[10]-- 6/ 5 @ 0x e6798000 (size 0x 800) ib_mthca 0000:03:00.0: HCA memory: allocated 106082 KB/124928 KB (18846 KB free) b_mthca 0000:03:00.0: Allocated EQ 1 with 65536 entries ib_mthca 0000:03:00.0: Allocated EQ 2 with 128 entries ib_mthca 0000:03:00.0: Allocated EQ 3 with 128 entries ib_mthca 0000:03:00.0: Setting mask 00000000000f43fe for eqn 2 ib_mthca 0000:03:00.0: Setting mask 0000000000000400 for eqn 3 ib_mthca 0000:03:00.0: NOP command IRQ test passed ib_mthca 0000:03:00.0: Command 09 completed with status 03 ib_mthca 0000:03:00.0: INIT_IB returned status 03. ib_mthca 0000:03:00.0: Command 09 completed with status 03 ib_mthca 0000:03:00.0: INIT_IB returned status 03. On Aug 11, 2005, at 6:30 PM, Roland Dreier wrote: > Weikuan> At the end of this email, I have included the output from > Weikuan> our system when enabling > Weikuan> CONFIG_INFINIBAND_MTHCA_DEBUG=y. Note that there are > Weikuan> additional four lines of warning message during the > Weikuan> initiation of the device. These are generated from > Weikuan> init_port() function, due to the incorrect return status > Weikuan> of a command to the firmware, INIT_IB. > > Did these warning messages about INIT_IB not show up in the kernel > before you enabled CONFIG_INFINIBAND_MTHCA_DEBUG? They are printed > using mthca_warn(), which should be printed no matter what. > > In any case I guess you built your firmware image without support for > 1X. Is this right? > > Do you have any theory as to why the drivers worked in 64-bit mode and > failed in 32-bit mode? I don't see any reason why the parameters > passed to INIT_IB would be any different. > > Anyway, can you apply the debugging patch below and send the output > you get during device initialization (with > CONFIG_INFINIBAND_MTHCA_DEBUG > enabled, of course)? I'm guessing you'll see something like: > > ib_mthca 0000:02:00.0: Max port width: 2 > > If my guess is correct, then we can use that value to get the correct > width to pass back to INIT_IB. > > Thanks, > Roland > > Index: infiniband/hw/mthca/mthca_cmd.c > =================================================================== > --- infiniband/hw/mthca/mthca_cmd.c (revision 3056) > +++ infiniband/hw/mthca/mthca_cmd.c (working copy) > @@ -1031,6 +1031,7 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev > MTHCA_GET(size, outbox, QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET); > dev_lim->uar_scratch_entry_sz = size; > > + mthca_dbg(dev, "Max port width: %x\n", dev_lim->max_port_width); > mthca_dbg(dev, "Max QPs: %d, reserved QPs: %d, entry size: %d\n", > dev_lim->max_qps, dev_lim->reserved_qps, dev_lim->qpc_entry_sz); > mthca_dbg(dev, "Max SRQs: %d, reserved SRQs: %d, entry size: %d\n", > From halr at voltaire.com Sat Aug 13 05:30:22 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Aug 2005 08:30:22 -0400 Subject: [openib-general] 'Couldn't post send' error? In-Reply-To: <20050813024258.GE16924@kalmia.hozed.org> References: <20050813024258.GE16924@kalmia.hozed.org> Message-ID: <1123936214.4403.8849.camel@hal.voltaire.com> Hi Troy, On Fri, 2005-08-12 at 22:42, Troy Benjegerdes wrote: > What's this mean? > > da4:~/NetPIPE_3.6.2# ibv_rc_pingpong 10.1.5.218 > local address: LID 0x0002, QPN 0x0d0404, PSN 0x599dea > remote address: LID 0x0001, QPN 0x090404, PSN 0x93b0c8 > Couldn't post send It means ibv_post_send returned an error. It could mean WQ overflow (appears to be detected at 2 levels (user space and mthca)), too many gathers, or opcode invalid. Are there any mthca errors indicated in the log ? -- Hal > I'm trying to get NetPIPE to work again, and it gets a similiar error. > > It looks like 'uc' and 'ud' versions work just fine. From halr at voltaire.com Sat Aug 13 05:34:46 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Aug 2005 08:34:46 -0400 Subject: [openib-general] what do you think about the following two user level small tools? In-Reply-To: <1123853575.4403.7546.camel@hal.voltaire.com> References: <506C3D7B14CDD411A52C00025558DED60882C49D@mtlex01.yok.mtl.com> <1123853575.4403.7546.camel@hal.voltaire.com> Message-ID: <1123936485.4403.8856.camel@hal.voltaire.com> Hi again Dotan, On Fri, 2005-08-12 at 09:32, Hal Rosenstock wrote: > vstat is the fourth version of some sort of status. There are ibstat and > ibstatus working at either the driver or umad layer. There is also > ibv_devinfo which displays a subset of this info from the verbs layer. > So this adds some things which may be useful. Perhaps a rename to > ibv_stat to be more in sync. If this is to be separate the options should be consistent with the other ibv_xxx examples as follows: printf(" -d, --ib-dev= use IB device (default first device found)\n"); printf(" -i, --ib-port= use port of IB device (default 1)\n"); Also, perhaps since this is similar to ibv_devinfo, it should be merged into it perhaps as an command line option for more verbose. That might be the best approach and if so, patches to Roland would be in order. -- Hal From hozer at hozed.org Sat Aug 13 10:05:22 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Sat, 13 Aug 2005 12:05:22 -0500 Subject: [openib-general] 'Couldn't post send' error? In-Reply-To: <20050813024258.GE16924@kalmia.hozed.org> References: <20050813024258.GE16924@kalmia.hozed.org> Message-ID: <20050813170522.GH16924@kalmia.hozed.org> On Fri, Aug 12, 2005 at 09:42:58PM -0500, Troy Benjegerdes wrote: > What's this mean? > > da4:~/NetPIPE_3.6.2# ibv_rc_pingpong 10.1.5.218 > local address: LID 0x0002, QPN 0x0d0404, PSN 0x599dea > remote address: LID 0x0001, QPN 0x090404, PSN 0x93b0c8 > Couldn't post send > > I'm trying to get NetPIPE to work again, and it gets a similiar error. > > It looks like 'uc' and 'ud' versions work just fine. Interesting.. it looks like it's size dependent.. da5:~# ibv_rc_pingpong -s 65536 local address: LID 0x0001, QPN 0x250404, PSN 0xabcbf7 remote address: LID 0x0002, QPN 0x2a0404, PSN 0xa2f603 131072000 bytes in 0.15 seconds = 7225.03 Mbit/sec 1000 iters in 0.15 seconds = 145.13 usec/iter da5:~# ibv_rc_pingpong -s 512 local address: LID 0x0001, QPN 0x260404, PSN 0xa7be00 remote address: LID 0x0002, QPN 0x2b0404, PSN 0x534952 Couldn't post send I'm running the kernel modules from 2.6.13-rc6, and userspace libs checked out from subversion sometime last week. Do I need to have matching kernel and userspace versions? I can't find anything in the kernel log messages. This is also with a new single-port memfree card in a dual opteron iwill DK8NES motherboard, with the cards connected back-to-back with no switch. From rolandd at cisco.com Sat Aug 13 11:13:01 2005 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 13 Aug 2005 11:13:01 -0700 Subject: [openib-general] 'Couldn't post send' error? References: <20050813024258.GE16924@kalmia.hozed.org> Message-ID: <52r7cxk8nm.fsf@cisco.com> Troy> What's this mean? da4:~/NetPIPE_3.6.2# ibv_rc_pingpong Troy> 10.1.5.218 local address: LID 0x0002, QPN 0x0d0404, PSN Troy> 0x599dea remote address: LID 0x0001, QPN 0x090404, PSN Troy> 0x93b0c8 Couldn't post send ibv_post_send() is failing. Most likely the send queue is overflowing. Did ibv_rc_pingpong work in the past? If so what changed in your setup? If you change the line .max_send_wr = 1, in rc_pingpong.c to .max_send_wr = 4, does it work? How about .max_send_wr = 2, Thanks, Roland From rolandd at cisco.com Sat Aug 13 12:26:45 2005 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 13 Aug 2005 12:26:45 -0700 Subject: [openib-general] 'Couldn't post send' error? In-Reply-To: <1123936214.4403.8849.camel@hal.voltaire.com> (Hal Rosenstock's message of "13 Aug 2005 08:30:22 -0400") References: <20050813024258.GE16924@kalmia.hozed.org> <1123936214.4403.8849.camel@hal.voltaire.com> Message-ID: <52mznlk58q.fsf@cisco.com> Hal> It means ibv_post_send returned an error. It could mean WQ Hal> overflow (appears to be detected at 2 levels (user space and Hal> mthca)), too many gathers, or opcode invalid. Send posting doesn't go into the kernel. The WQ overflow is only detected in the post_send routine in qp.c from libmthca. - R. From dotanb at mellanox.co.il Sat Aug 13 22:18:41 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Sun, 14 Aug 2005 08:18:41 +0300 Subject: [openib-general] RE: the tests user/libibverbs/examples/*_pingpong.c doesn't suppo rt t he long format of the parameter rx-depth Message-ID: <506C3D7B14CDD411A52C00025558DED60882C519@mtlex01.yok.mtl.com> > Are you looking at the latest svn? In my tree it seems that I checked > in that fix in rev 3041. > > - R. > you are right, sorry. -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Sat Aug 13 22:40:24 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 14 Aug 2005 08:40:24 +0300 Subject: [openib-general] osm build errors gen1 Message-ID: <506C3D7B14CDD411A52C00025558DED607C30638@mtlex01.yok.mtl.com> Hi Joe, OpenSM from gen1 is now being ported to gen2. It will take some (probably another week). Then it will be placed on a side branch until Hal will review the changes (it will take a while as it took Yael more then a week of merging already). Meanwhile you can get OpenSM on gen1 from Mellanox IBGD. EZ -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Sat Aug 13 22:47:03 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 14 Aug 2005 08:47:03 +0300 Subject: [openib-general] RE: [PATCH] osm: add a main auto tools proje ct for osm Message-ID: <506C3D7B14CDD411A52C00025558DED607C30639@mtlex01.yok.mtl.com> There is no intention to have only one giant auto tools projects for all user level code... Only an optional way to install OpenSM by using single project. Also - --prefix is supported at all levels. But normally when you say --prefix=dir you expect executables to be placed under dir/bin ... The patch I will provide will support that. We are still looking for any objection to have the default prefix tuned to : /usr/local rather then /usr/local/ib Any objections?? EZ -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Sun Aug 14 03:50:05 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 14 Aug 2005 13:50:05 +0300 Subject: [openib-general] [PATCH] osm: add a main auto tools project for osm Message-ID: <42FF21DD.2060306@mellanox.co.il> Hi Hal Fixing the patch I have sent last week by moving the bin and lib into the /usr/local and keep the includes in place: /usr/local/include/infiniband This patch includes: 1. Added a top level autotools project for OpenSM. So now you need autogen.sh && configure && make && make install just once for osm (previously needed 4: complib, libvendor, opensm, osmtest). 2. Cleanup the direct override of libdir, bindir. Support --prefix 3. Move osm lib, bin into default prefix (/usr/local) 4. Support debug build for OpenSM using --enable-debug This is important to allow for asserts during runtime and various other additional debug features. Since the generated compilb can not be used with the release version we also use a special header file that stores the type of build for applications that wish to link with it. 5. Cleanup stale use of AC_CHECK_LIB with no parameters I tested the patch on : 2.6.12.3-smp SuSE Linux 9.3 (i586) Signed-off-by: Eitan Zahavi eitan at mellanox.co.il Index: include/configure.in =================================================================== --- include/configure.in (revision 3036) +++ include/configure.in (working copy) @@ -13,7 +13,7 @@ dnl Checks for programs AC_PROG_CC dnl Checks for libraries -AC_CHECK_LIB +dnl AC_CHECK_LIB - need to provide symbol and library... what do we depend on? dnl Checks for header files. AC_HEADER_STDC Index: libvendor/configure.in =================================================================== --- libvendor/configure.in (revision 3036) +++ libvendor/configure.in (working copy) @@ -47,5 +47,15 @@ AC_CACHE_CHECK(whether ld accepts --vers AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") +dnl support debug mode +AC_ARG_ENABLE(debug, +[ --enable-debug Turn on debug mode], +[case "${enableval}" in + yes) debug=true ;; + no) debug=false ;; + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; +esac],[debug=false]) +AM_CONDITIONAL(DEBUG, test x$debug = xtrue) + AC_CONFIG_FILES([Makefile libosmvendor.spec]) AC_OUTPUT Index: libvendor/Makefile.am =================================================================== --- libvendor/Makefile.am (revision 3036) +++ libvendor/Makefile.am (working copy) @@ -1,15 +1,19 @@ -libdir = ${exec_prefix}/ib/lib - SUBDIRS = . +if DEBUG +DBGFLAGS = -g -O0 -D_DEBUG_ +else +DBGFLAGS = -g -O2 +endif + INCLUDES = -I$(srcdir)/../include \ -I$(srcdir)/../../libibcommon/include/infiniband \ -I$(srcdir)/../../libibumad/include/infiniband lib_LTLIBRARIES = libosmvendor.la -libosmvendor_la_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT +libosmvendor_la_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) if HAVE_LD_VERSION_SCRIPT libosmvendor_version_script = -Wl,--version-script=$(srcdir)/libosmvendor.map Index: libvendor/autogen.sh =================================================================== --- libvendor/autogen.sh (revision 3036) +++ libvendor/autogen.sh (working copy) @@ -1,5 +1,8 @@ #! /bin/sh +# We change dir since the later utilities assume to work in the project dir +cd ${0%*/*} + set -x aclocal -I config libtoolize --force --copy Index: complib/autogen.sh =================================================================== --- complib/autogen.sh (revision 3036) +++ complib/autogen.sh (working copy) @@ -1,5 +1,8 @@ #! /bin/sh +# We change dir since the later utilities assume to work in the project dir +cd ${0%*/*} + set -x aclocal -I config libtoolize --force --copy Index: complib/configure.in =================================================================== --- complib/configure.in (revision 3036) +++ complib/configure.in (working copy) @@ -31,6 +31,7 @@ AC_C_INLINE AC_TYPE_SIZE_T AC_HEADER_TIME +dnl We use --version-script with ld if possible AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then ac_cv_version_script=yes @@ -40,5 +41,15 @@ AC_CACHE_CHECK(whether ld accepts --vers AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") +dnl Support debug mode build - if enable-debug provided the DEBUG variable is set +AC_ARG_ENABLE(debug, +[ --enable-debug Turn on debug mode], +[case "${enableval}" in + yes) debug=true ;; + no) debug=false ;; + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; +esac],[debug=false]) +AM_CONDITIONAL(DEBUG, test x$debug = xtrue) + AC_CONFIG_FILES([Makefile libosmcomp.spec]) AC_OUTPUT Index: complib/Makefile.am =================================================================== --- complib/Makefile.am (revision 3036) +++ complib/Makefile.am (working copy) @@ -1,5 +1,5 @@ -libdir = ${exec_prefix}/ib/lib +# libdir = ${exec_prefix}/ib/lib SUBDIRS = . @@ -7,7 +7,13 @@ INCLUDES = -I$(srcdir)/../include lib_LTLIBRARIES = libosmcomp.la -libosmcomp_la_CFLAGS = -Wall +if DEBUG +DBGFLAGS = -g -O0 -D_DEBUG_ +else +DBGFLAGS = -g -O2 +endif + +libosmcomp_la_CFLAGS = -Wall $(DBGFLAGS) if HAVE_LD_VERSION_SCRIPT libosmcomp_version_script = -Wl,--version-script=$(srcdir)/libosmcomp.map Index: AUTHORS =================================================================== Index: configure.in =================================================================== --- configure.in (revision 0) +++ configure.in (revision 0) @@ -0,0 +1,39 @@ +dnl Process this file with autoconf to produce a configure script. + +AC_INIT(autogen.sh) + +dnl use local config dir for extras +AC_CONFIG_AUX_DIR(config) + +dnl Defines the Language +AC_LANG_C + +dnl Auto make +AM_INIT_AUTOMAKE(osm,1.0) + +dnl Provides control over re-making of all auto files +dnl We also use it to define swig dependencies so end +dnl users do not see them. +AM_MAINTAINER_MODE + +dnl Required for cases make defines a MAKE=make ??? Why +AC_PROG_MAKE_SET + +dnl Define an input config option to control debug compile +AC_ARG_ENABLE(debug, +[ --enable-debug Turn on debugging], +[case "${enableval}" in + yes) debug=true ;; + no) debug=false ;; + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; +esac],[debug=false]) +AM_CONDITIONAL(DEBUG, test x$debug = xtrue) + +dnl Configure the following subdirs +AC_CONFIG_SUBDIRS(complib libvendor opensm osmtest) + +dnl Create the following Makefiles +AC_OUTPUT(Makefile) + + + Index: ChangeLog =================================================================== Index: README =================================================================== Index: osmtest/configure.in =================================================================== --- osmtest/configure.in (revision 3036) +++ osmtest/configure.in (working copy) @@ -52,5 +52,15 @@ AC_CACHE_CHECK(whether ld accepts --vers AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") +dnl support debug mode +AC_ARG_ENABLE(debug, +[ --enable-debug Turn on debug mode], +[case "${enableval}" in + yes) debug=true ;; + no) debug=false ;; + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; +esac],[debug=false]) +AM_CONDITIONAL(DEBUG, test x$debug = xtrue) + AC_CONFIG_FILES([Makefile]) AC_OUTPUT Index: osmtest/Makefile.am =================================================================== --- osmtest/Makefile.am (revision 3036) +++ osmtest/Makefile.am (working copy) @@ -1,6 +1,9 @@ -bindir = ${exec_prefix}/ib/bin -libdir = ${exec_prefix}/ib/lib +if DEBUG +DBGFLAGS = -g -O0 -D_DEBUG_ +else +DBGFLAGS = -g -O2 +endif INCLUDES = -I$(srcdir)/include \ -I$(srcdir)/../include \ @@ -11,12 +14,9 @@ bin_PROGRAMS = osmtest osmtest_SOURCES = main.c osmtest.c osmt_service.c osmt_slvl_vl_arb.c \ osmt_multicast.c osmt_inform.c -osmtest_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT -osmtest_LDADD = $(libdir)/libibumad.la \ - $(libdir)/libibcommon.la \ - $(libdir)/libopensm.la \ - $(libdir)/libosmcomp.la \ - $(libdir)/libosmvendor.la +osmtest_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) +osmtest_LDADD = -L../complib -L../libvendor -L../opensm -L$(libdir) \ + -libumad -libcommon -lopensm -losmcomp -losmvendor osmtest_LDFLAGS = -Wl,--rpath -Wl,$(libdir) -lpthread -L../opensm Index: osmtest/autogen.sh =================================================================== --- osmtest/autogen.sh (revision 3036) +++ osmtest/autogen.sh (working copy) @@ -1,5 +1,8 @@ #! /bin/sh +# We change dir since the later utilities assume to work in the project dir +cd ${0%*/*} + set -x aclocal -I config libtoolize --force --copy Index: opensm/configure.in =================================================================== --- opensm/configure.in (revision 3036) +++ opensm/configure.in (working copy) @@ -52,5 +52,15 @@ AC_CACHE_CHECK(whether ld accepts --vers AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") +dnl support debug mode +AC_ARG_ENABLE(debug, +[ --enable-debug Turn on debug mode], +[case "${enableval}" in + yes) debug=true ;; + no) debug=false ;; + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; +esac],[debug=false]) +AM_CONDITIONAL(DEBUG, test x$debug = xtrue) + AC_CONFIG_FILES([Makefile]) AC_OUTPUT Index: opensm/autogen.sh =================================================================== --- opensm/autogen.sh (revision 3036) +++ opensm/autogen.sh (working copy) @@ -1,5 +1,8 @@ #! /bin/sh +# We change dir since the later utilities assume to work in the project dir +cd ${0%*/*} + set -x aclocal -I config libtoolize --force --copy Index: opensm/Makefile.am =================================================================== --- opensm/Makefile.am (revision 3036) +++ opensm/Makefile.am (working copy) @@ -1,14 +1,17 @@ -bindir = ${exec_prefix}/ib/bin -libdir = ${exec_prefix}/ib/lib - INCLUDES = -I$(srcdir)/../include \ -I$(srcdir)/../../libibcommon/include/infiniband \ -I$(srcdir)/../../libibumad/include/infiniband lib_LTLIBRARIES = libopensm.la -libopensm_la_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT +if DEBUG +DBGFLAGS = -g -O0 -D_DEBUG_ +else +DBGFLAGS = -g -O2 +endif + +libopensm_la_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) if HAVE_LD_VERSION_SCRIPT libopensm_version_script = -Wl,--version-script=$(srcdir)/libopensm.map @@ -60,12 +63,13 @@ opensm_SOURCES = main.c osm_drop_mgr.c o osm_ucast_mgr.c osm_ucast_updn.c \ osm_vl15intf.c osm_vl_arb_rcv.c\ osm_vl_arb_rcv_ctrl.c -opensm_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT -opensm_LDADD = $(libdir)/libibumad.la \ - $(libdir)/libibcommon.la \ - $(srcdir)/libopensm.la \ - $(libdir)/libosmcomp.la \ - $(libdir)/libosmvendor.la +opensm_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) + +# we need to be able to load libraries from local build subtree before make install +# we always give precedence to local tree libs and then use the pre-installed ones. +opensm_LDADD = -L../complib -L../libvendor -L$(libdir) \ + -libumad -lopensm -losmcomp -losmvendor + opensm_LDFLAGS = -Wl,--rpath -Wl,$(libdir) -lpthread opensmincludedir = $(includedir)/infiniband/opensm @@ -79,4 +83,3 @@ EXTRA_DIST = $(srcdir)/../include/opensm $(srcdir)/../include/opensm/osm_log.h \ $(srcdir)/../include/opensm/osm_madw.h \ $(srcdir)/../include/opensm/osm_mad_pool.h - Index: INSTALL =================================================================== --- INSTALL (revision 0) +++ INSTALL (revision 0) @@ -0,0 +1,231 @@ +Installation Instructions +************************* + +Copyright (C) 1994, 1995, 1996, 1999, 2000, 2001, 2002, 2004 Free +Software Foundation, Inc. + +This file is free documentation; the Free Software Foundation gives +unlimited permission to copy, distribute and modify it. + +Basic Installation +================== + +These are generic installation instructions. + + The `configure' shell script attempts to guess correct values for +various system-dependent variables used during compilation. It uses +those values to create a `Makefile' in each directory of the package. +It may also create one or more `.h' files containing system-dependent +definitions. Finally, it creates a shell script `config.status' that +you can run in the future to recreate the current configuration, and a +file `config.log' containing compiler output (useful mainly for +debugging `configure'). + + It can also use an optional file (typically called `config.cache' +and enabled with `--cache-file=config.cache' or simply `-C') that saves +the results of its tests to speed up reconfiguring. (Caching is +disabled by default to prevent problems with accidental use of stale +cache files.) + + If you need to do unusual things to compile the package, please try +to figure out how `configure' could check whether to do them, and mail +diffs or instructions to the address given in the `README' so they can +be considered for the next release. If you are using the cache, and at +some point `config.cache' contains results you don't want to keep, you +may remove or edit it. + + The file `configure.ac' (or `configure.in') is used to create +`configure' by a program called `autoconf'. You only need +`configure.ac' if you want to change it or regenerate `configure' using +a newer version of `autoconf'. + +The simplest way to compile this package is: + + 1. `cd' to the directory containing the package's source code and type + `./configure' to configure the package for your system. If you're + using `csh' on an old version of System V, you might need to type + `sh ./configure' instead to prevent `csh' from trying to execute + `configure' itself. + + Running `configure' takes awhile. While running, it prints some + messages telling which features it is checking for. + + 2. Type `make' to compile the package. + + 3. Optionally, type `make check' to run any self-tests that come with + the package. + + 4. Type `make install' to install the programs and any data files and + documentation. + + 5. You can remove the program binaries and object files from the + source code directory by typing `make clean'. To also remove the + files that `configure' created (so you can compile the package for + a different kind of computer), type `make distclean'. There is + also a `make maintainer-clean' target, but that is intended mainly + for the package's developers. If you use it, you may have to get + all sorts of other programs in order to regenerate files that came + with the distribution. + +Compilers and Options +===================== + +Some systems require unusual options for compilation or linking that the +`configure' script does not know about. Run `./configure --help' for +details on some of the pertinent environment variables. + + You can give `configure' initial values for configuration parameters +by setting variables in the command line or in the environment. Here +is an example: + + ./configure CC=c89 CFLAGS=-O2 LIBS=-lposix + + *Note Defining Variables::, for more details. + +Compiling For Multiple Architectures +==================================== + +You can compile the package for more than one kind of computer at the +same time, by placing the object files for each architecture in their +own directory. To do this, you must use a version of `make' that +supports the `VPATH' variable, such as GNU `make'. `cd' to the +directory where you want the object files and executables to go and run +the `configure' script. `configure' automatically checks for the +source code in the directory that `configure' is in and in `..'. + + If you have to use a `make' that does not support the `VPATH' +variable, you have to compile the package for one architecture at a +time in the source code directory. After you have installed the +package for one architecture, use `make distclean' before reconfiguring +for another architecture. + +Installation Names +================== + +By default, `make install' will install the package's files in +`/usr/local/bin', `/usr/local/man', etc. You can specify an +installation prefix other than `/usr/local' by giving `configure' the +option `--prefix=PREFIX'. + + You can specify separate installation prefixes for +architecture-specific files and architecture-independent files. If you +give `configure' the option `--exec-prefix=PREFIX', the package will +use PREFIX as the prefix for installing programs and libraries. +Documentation and other data files will still use the regular prefix. + + In addition, if you use an unusual directory layout you can give +options like `--bindir=DIR' to specify different values for particular +kinds of files. Run `configure --help' for a list of the directories +you can set and what kinds of files go in them. + + If the package supports it, you can cause programs to be installed +with an extra prefix or suffix on their names by giving `configure' the +option `--program-prefix=PREFIX' or `--program-suffix=SUFFIX'. + +Optional Features +================= + +Some packages pay attention to `--enable-FEATURE' options to +`configure', where FEATURE indicates an optional part of the package. +They may also pay attention to `--with-PACKAGE' options, where PACKAGE +is something like `gnu-as' or `x' (for the X Window System). The +`README' should mention any `--enable-' and `--with-' options that the +package recognizes. + + For packages that use the X Window System, `configure' can usually +find the X include and library files automatically, but if it doesn't, +you can use the `configure' options `--x-includes=DIR' and +`--x-libraries=DIR' to specify their locations. + +Specifying the System Type +========================== + +There may be some features `configure' cannot figure out automatically, +but needs to determine by the type of machine the package will run on. +Usually, assuming the package is built to be run on the _same_ +architectures, `configure' can figure that out, but if it prints a +message saying it cannot guess the machine type, give it the +`--build=TYPE' option. TYPE can either be a short name for the system +type, such as `sun4', or a canonical name which has the form: + + CPU-COMPANY-SYSTEM + +where SYSTEM can have one of these forms: + + OS KERNEL-OS + + See the file `config.sub' for the possible values of each field. If +`config.sub' isn't included in this package, then this package doesn't +need to know the machine type. + + If you are _building_ compiler tools for cross-compiling, you should +use the `--target=TYPE' option to select the type of system they will +produce code for. + + If you want to _use_ a cross compiler, that generates code for a +platform different from the build platform, you should specify the +"host" platform (i.e., that on which the generated programs will +eventually be run) with `--host=TYPE'. + +Sharing Defaults +================ + +If you want to set default values for `configure' scripts to share, you +can create a site shell script called `config.site' that gives default +values for variables like `CC', `cache_file', and `prefix'. +`configure' looks for `PREFIX/share/config.site' if it exists, then +`PREFIX/etc/config.site' if it exists. Or, you can set the +`CONFIG_SITE' environment variable to the location of the site script. +A warning: not all `configure' scripts look for a site script. + +Defining Variables +================== + +Variables not defined in a site shell script can be set in the +environment passed to `configure'. However, some packages may run +configure again during the build, and the customized values of these +variables may be lost. In order to avoid this problem, you should set +them in the `configure' command line, using `VAR=value'. For example: + + ./configure CC=/usr/local2/bin/gcc + +will cause the specified gcc to be used as the C compiler (unless it is +overridden in the site shell script). + +`configure' Invocation +====================== + +`configure' recognizes the following options to control how it operates. + +`--help' +`-h' + Print a summary of the options to `configure', and exit. + +`--version' +`-V' + Print the version of Autoconf used to generate the `configure' + script, and exit. + +`--cache-file=FILE' + Enable the cache: use and save the results of the tests in FILE, + traditionally `config.cache'. FILE defaults to `/dev/null' to + disable caching. + +`--config-cache' +`-C' + Alias for `--cache-file=config.cache'. + +`--quiet' +`--silent' +`-q' + Do not print messages saying which checks are being made. To + suppress all normal output, redirect it to `/dev/null' (any error + messages will still be shown). + +`--srcdir=DIR' + Look for the package's source code in directory DIR. Usually + `configure' can determine that directory automatically. + +`configure' also accepts some other, not widely useful, options. Run +`configure --help' for more details. + Index: COPYING =================================================================== --- COPYING (revision 0) +++ COPYING (revision 0) @@ -0,0 +1,32 @@ + Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available from the file + COPYING in the main directory of this source tree, or the + OpenIB.org BSD license below: + + Redistribution and use in source and binary forms, with or + without modification, are permitted provided that the following + conditions are met: + + - Redistributions of source code must retain the above + copyright notice, this list of conditions and the following + disclaimer. + + - Redistributions in binary form must reproduce the above + copyright notice, this list of conditions and the following + disclaimer in the documentation and/or other materials + provided with the distribution. + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + Index: Makefile.am =================================================================== --- Makefile.am (revision 0) +++ Makefile.am (revision 0) @@ -0,0 +1,16 @@ + +# note that order matters: make the lib first then use it +SUBDIRS = complib libvendor opensm osmtest + +# this will control the update of the files in order +MAINTAINERCLEANFILES = Makefile.in aclocal.m4 configure config-h.in + +ACLOCAL = aclocal -I $(ac_aux_dir) + +# we should provide a hint for other apps about the build mode of this project +install-exec-hook: +if DEBUG + echo "define osm_build_type \"debug\"" > $(includedir)/infiniband/opensm/osm_build_id.h +else + echo "define osm_build_type \"free\"" > $(includedir)/infiniband/opensm/osm_build_id.h +endif Index: autogen.sh =================================================================== --- autogen.sh (revision 0) +++ autogen.sh (revision 0) @@ -0,0 +1,74 @@ +#!/bin/bash + +# We change dir since the later utilities assume to work in the project dir +cd ${0%*/*} + +# make sure autoconf is up-to-date +ac_ver=`autoconf --version | head -1 | awk '{print $NF}'` +ac_maj=`echo $ac_ver|sed 's/\..*//'` +ac_min=`echo $ac_ver|sed 's/.*\.//'` +if [[ $ac_maj < 2 ]]; then + echo Min autoconf version is 2.59 + exit +fi +if [[ $ac_maj = 2 && $ac_min < 59 ]]; then + echo Min autoconf version is 2.59 + exit +fi + +# make sure automake is up-to-date +am_ver=`automake --version | head -1 | awk '{print $NF}'` +am_maj=`echo $am_ver|sed 's/\..*//'` +am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` +am_sub=`echo $am_ver|sed 's/.*\.//'` +if [[ $am_maj < 1 ]]; then + echo Min automake version is 1.9.3 + exit +fi +if [[ $am_maj = 1 && $am_min < 9 ]]; then + echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.3" + exit +fi +if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 3 ]]; then + echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.3" + exit +fi + +# make sure libtool is up-to-date +lt_ver=`libtool --version | head -1 | awk '{print $4}'` +lt_maj=`echo $lt_ver|sed 's/\..*//'` +lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` +lt_sub=`echo $lt_ver|sed 's/.*\.//'` +if [[ $lt_maj < 1 ]]; then + echo Min libtool version is 1.4.2 + exit +fi +if [[ $lt_maj = 1 && $lt_min < 4 ]]; then + echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" + exit +fi +if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then + echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" + exit +fi + +# cleanup +find . \( -name Makefile.in -o -name aclocal.m4 -o -name autom4te.cache -o -name configure -o -name aclocal.m4 \) -exec \rm -rf {} \; -prune + +# handle our own autoconf: +aclocal -I config 2>&1 | grep -v "arning: underquoted definition of" +automake --add-missing --gnu +autoconf + +# visit all sub directories with autogen.sh +anyErr=0 +for a in `ls */autogen.sh`; do + echo Visiting $a + $a 2>& 1 | sed 's/^/| /' + if test $? != 0; then + echo $a failed + anyErr=1 + fi +done + +exit $anyErr Property changes on: autogen.sh ___________________________________________________________________ Name: svn:executable + * Index: NEWS =================================================================== Index: Makefile =================================================================== --- Makefile (revision 3036) +++ Makefile (working copy) @@ -1,44 +0,0 @@ -LIBS:= complib libvendor -BIN:= opensm -UTIL:= include - -SUBDIRS=$(BIN) $(UTIL) - -all: BUILD_TARG=all -all: libs_install subdirs - @echo Make all done - -install: BUILD_TARG=install -install: subdirs - @echo Install done - -clean: SUBDIRS= $(LIBS) $(BIN) -clean: BUILD_TARG=clean -clean: subdirs - @echo Clean done - -rmdep: - find $(SUBDIRS) -name ".depend" | xargs rm -f - -depend: SUBDIRS= $(LIBS) $(BIN) $(UTIL) -depend: BUILD_TARG=depend -depend: rmdep subdirs - @echo Depend done - -.PHONY : subdirs -subdirs: - @for i in $(SUBDIRS); do\ - if [ -e $$i/Makefile ]; then\ - if !(cd $$i; make $(BUILD_TARG)); then exit 1; fi\ - fi\ - done\ - -.PHONY : libs_install -libs_install: - @for i in $(LIBS); do\ - if [ -e $$i/Makefile ]; then\ - if !(cd $$i; make install); then exit 1; fi\ - fi\ - done\ - -export BUILD_TARG From mst at mellanox.co.il Sun Aug 14 04:13:57 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 14 Aug 2005 14:13:57 +0300 Subject: [openib-general] Re: [PATCH] osm: add a main auto tools project for osm In-Reply-To: <42FF21DD.2060306@mellanox.co.il> References: <42FF21DD.2060306@mellanox.co.il> Message-ID: <20050814111357.GC23848@mellanox.co.il> Quoting r. Eitan Zahavi : > Subject: [PATCH] osm: add a main auto tools project for osm > > Hi Hal > > Fixing the patch I have sent last week by moving the bin and lib into > the /usr/local and keep the > includes in place: /usr/local/include/infiniband > > This patch includes: > 1. Added a top level autotools project for OpenSM. > So now you need autogen.sh && configure && make && make install just > once for osm > (previously needed 4: complib, libvendor, opensm, osmtest). > 2. Cleanup the direct override of libdir, bindir. Support --prefix > 3. Move osm lib, bin into default prefix (/usr/local) > 4. Support debug build for OpenSM using --enable-debug > This is important to allow for asserts during runtime and various > other additional debug features. > Since the generated compilb can not be used with the release version > we also use a special header file that stores the type of build > for applications that wish to link with it. > 5. Cleanup stale use of AC_CHECK_LIB with no parameters Eitan, for debug, I suggest removing -O0. This is the default, lets the user set the optimization level with environment variables. -- MST From halr at voltaire.com Sun Aug 14 05:23:23 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Aug 2005 08:23:23 -0400 Subject: [openib-general] Re: [PATCH] osm: add a main auto tools project for osm In-Reply-To: <42FF21DD.2060306@mellanox.co.il> References: <42FF21DD.2060306@mellanox.co.il> Message-ID: <1124022202.4403.8956.camel@hal.voltaire.com> Hi Eitan, On Sun, 2005-08-14 at 06:50, Eitan Zahavi wrote: > Hi Hal > > Fixing the patch I have sent last week by moving the bin and lib into > the /usr/local and keep the > includes in place: /usr/local/include/infiniband > > This patch includes: > 1. Added a top level autotools project for OpenSM. > So now you need autogen.sh && configure && make && make install just > once for osm > (previously needed 4: complib, libvendor, opensm, osmtest). > 2. Cleanup the direct override of libdir, bindir. Support --prefix > 3. Move osm lib, bin into default prefix (/usr/local) > 4. Support debug build for OpenSM using --enable-debug > This is important to allow for asserts during runtime and various > other additional debug features. > Since the generated compilb can not be used with the release version > we also use a special header file that stores the type of build > for applications that wish to link with it. > 5. Cleanup stale use of AC_CHECK_LIB with no parameters > > I tested the patch on : > 2.6.12.3-smp SuSE Linux 9.3 (i586) A couple of things: The patch appears to be whitespace wrapped. patching file include/configure.in patch: **** malformed patch at line 11: depend on? Also for debug, should -O0 be removed as Michael suggests ? Also, why are files without changes included in the patch ? Thanks. -- Hal From eitan at mellanox.co.il Sun Aug 14 06:11:07 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 14 Aug 2005 16:11:07 +0300 Subject: [openib-general] RE: [PATCH] osm: add a main auto tools project for osm Message-ID: <506C3D7B14CDD411A52C00025558DED607C30651@mtlex01.yok.mtl.com> Hi Hal, > A couple of things: > > The patch appears to be whitespace wrapped. > patching file include/configure.in > patch: **** malformed patch at line 11: depend on? [EZ] I used Thunderbird which to my best knowledge does not corrupt the white space. > > Also for debug, should -O0 be removed as Michael suggests ? [EZ] I will provide a mode where if the user provide a CLAFS on the command line that includes other optimization mode it will be used. The problem with MST proposal is that if one does not provide CFLAGS the default would have been -O2 . So I'm looking for a way to use a default of -O0 on debug builds. Should I provide the patch in pieces or wait until all these features are in? > > Also, why are files without changes included in the patch ? [EZ] I will double check. > > Thanks > > -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Sun Aug 14 06:19:09 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Aug 2005 09:19:09 -0400 Subject: [openib-general] RE: [PATCH] osm: add a main auto tools project for osm In-Reply-To: <506C3D7B14CDD411A52C00025558DED607C30651@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED607C30651@mtlex01.yok.mtl.com> Message-ID: <1124025548.4403.9081.camel@hal.voltaire.com> On Sun, 2005-08-14 at 09:11, Eitan Zahavi wrote: > Hi Hal, > > > A couple of things: > > > > The patch appears to be whitespace wrapped. > > patching file include/configure.in > > patch: **** malformed patch at line 11: depend on? > [EZ] I used Thunderbird which to my best knowledge does not corrupt > the white space. Just look at the line in question in the email (and the patch). It was made into 2 lines likely by your mailer. > > Also for debug, should -O0 be removed as Michael suggests ? > [EZ] I will provide a mode where if the user provide a CLAFS on the > command line that includes other optimization mode it will be used. > The problem with MST proposal is that if one does not provide CFLAGS > the default would have been -O2 . > > So I'm looking for a way to use a default of -O0 on debug builds. > > Should I provide the patch in pieces or wait until all these features > are in? It's up to you. IMO it can wait. > > Also, why are files without changes included in the patch ? > [EZ] I will double check. -- Hal From eitan at mellanox.co.il Sun Aug 14 06:54:31 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 14 Aug 2005 16:54:31 +0300 Subject: [openib-general] [PATCH] osm: bad extern C in cl_pthreadpool.h Message-ID: <42FF4D17.9090300@mellanox.co.il> Hi Just found an extra extern C in the file. Needless to say it breaks any c++ include of this file. Signed-off-by: Eitan Zahavi eitan at mellanox.co.il Index: include/complib/cl_threadpool.h =================================================================== --- include/complib/cl_threadpool.h (revision 3036) +++ include/complib/cl_threadpool.h (working copy) @@ -146,13 +146,6 @@ typedef struct _cl_thread_pool * Thread Pool *********/ - -#ifdef __cplusplus -extern "C" -{ -#endif - - /****f* Component Library: Thread Pool/cl_thread_pool_construct * NAME * cl_thread_pool_construct From mst at mellanox.co.il Sun Aug 14 07:24:41 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 14 Aug 2005 17:24:41 +0300 Subject: [openib-general] [PATCH] dont drop device reference before use (was Re: sdp: cant unload ib_ipoib module) In-Reply-To: <1123593909.4403.16.camel@hal.voltaire.com> References: <1123007814.2946.12.camel@duffman> <20050803070923.GG15300@mellanox.co.il> <20050804104250.A30741@topspin.com> <20050807154632.GG15300@mellanox.co.il> <1123539115.4402.39.camel@hal.voltaire.com> <20050809124631.GG32419@mellanox.co.il> <1123593909.4403.16.camel@hal.voltaire.com> Message-ID: <20050814142441.GO23848@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: sdp: cant unload ib_ipoib module > > Hi Michael, > > On Tue, 2005-08-09 at 08:46, Michael S. Tsirkin wrote: > > ip_rt_put now looks right, but it looks like device_put is still done too early. > > Any idea where it should be done ? The following should finally fix the dev_put issue. I think an extra dev_hold on a path lookup is not a big deal, and helps make the code much simpler. Works fine, for me. Hal, can you give it a whirl? Signed-off-by: Michael S. Tsirkin Index: linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_link.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/ulp/sdp/sdp_link.c +++ linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_link.c @@ -327,7 +327,7 @@ static void do_link_path_lookup(void *da { struct sdp_path_info *info = data; struct ipoib_dev_priv *priv; - struct net_device *loopback = NULL; + struct net_device *dev = NULL; struct rtable *rt; int counter = 0; int result = 0; @@ -387,7 +387,7 @@ static void do_link_path_lookup(void *da * check for IB device or loopback, the later requires extra * handling. */ - if (ARPHRD_INFINIBAND != rt->u.dst.neighbour->dev->type && + if (rt->u.dst.neighbour->dev->type != ARPHRD_INFINIBAND && !(rt->u.dst.neighbour->dev->flags & IFF_LOOPBACK)) { result = -ENETUNREACH; goto error; @@ -402,23 +402,28 @@ static void do_link_path_lookup(void *da * In case of loopback find a valid IB device on which to * direct the loopback traffic. */ - info->dev = ((rt->u.dst.neighbour->dev->flags & IFF_LOOPBACK) ? - (loopback = ip_dev_find(rt->rt_src)) : - rt->u.dst.neighbour->dev); + if (rt->u.dst.neighbour->dev->flags & IFF_LOOPBACK) + dev = ip_dev_find(rt->rt_src); + else { + dev = rt->u.dst.neighbour->dev; + dev_hold(dev); + } info->gw = rt->rt_gateway; info->src = rt->rt_src; /* true source IP address */ - if (info->dev->flags & IFF_LOOPBACK) - while ((info->dev = dev_get_by_index(++counter))) { - - dev_put(info->dev); - if (ARPHRD_INFINIBAND == info->dev->type && - (info->dev->flags & IFF_UP)) + if (dev->flags & IFF_LOOPBACK) { + dev_put(dev); + while ((dev = dev_get_by_index(++counter))) { + if (dev->type == ARPHRD_INFINIBAND && + (dev->flags & IFF_UP)) break; + else + dev_put(dev); } + } - if (!info->dev) { + if (!dev) { sdp_dbg_warn(NULL, "No device for IB comm <%s:%08x:%08x>", rt->u.dst.neighbour->dev->name, rt->u.dst.neighbour->dev->flags, @@ -429,20 +434,20 @@ static void do_link_path_lookup(void *da /* * lookup local info. */ - priv = info->dev->priv; + priv = dev->priv; info->ca = priv->ca; info->port = priv->port; info->path.pkey = cpu_to_be16(priv->pkey); info->path.numb_path = 1; - memcpy(&info->path.sgid, info->dev->dev_addr + 4, sizeof(union ib_gid)); + memcpy(&info->path.sgid, dev->dev_addr + 4, sizeof(union ib_gid)); /* * If the routing device is loopback save the device address of * the IB device which was found. */ if (rt->u.dst.neighbour->dev->flags & IFF_LOOPBACK) { - memcpy(&info->path.dgid, info->dev->dev_addr + 4, + memcpy(&info->path.dgid, dev->dev_addr + 4, sizeof(union ib_gid)); goto path; @@ -466,10 +471,10 @@ arp: arp_send(ARPOP_REQUEST, ETH_P_ARP, info->gw, - info->dev, + dev, info->src, NULL, - info->dev->dev_addr, + dev->dev_addr, NULL); /* * start arp timer if it's not already going. @@ -498,8 +503,8 @@ arp: info->flags |= SDP_LINK_F_ARP; queue_delayed_work(link_wq, &info->timer, info->arp_time); - if (loopback) - dev_put(loopback); + if (dev) + dev_put(dev); ip_rt_put(rt); return; path: @@ -509,14 +514,14 @@ path: goto error; } done: - if (loopback) - dev_put(loopback); + if (dev) + dev_put(dev); ip_rt_put(rt); return; error: sdp_path_info_destroy(info, result); - if (loopback) - dev_put(loopback); + if (dev) + dev_put(dev); ip_rt_put(rt); } @@ -690,7 +695,7 @@ static int sdp_link_arp_recv(struct sk_b arp_hdr = (struct arphdr *)skb->nh.raw; - if (ARPHRD_INFINIBAND != dev->type || + if (dev->type != ARPHRD_INFINIBAND || (arp_hdr->ar_op != __constant_htons(ARPOP_REPLY) && arp_hdr->ar_op != __constant_htons(ARPOP_REQUEST))) goto done; Index: linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_link.h =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/ulp/sdp/sdp_link.h +++ linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_link.h @@ -59,8 +59,6 @@ struct sdp_path_info { struct work_struct timer; /* arp request timers. */ - struct net_device *dev; /* ipoib device */ - struct sdp_path_info *next; /* next element in path list. */ struct sdp_path_info **pext; /* previous next element in path list */ -- MST From halr at voltaire.com Sun Aug 14 07:42:00 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Aug 2005 10:42:00 -0400 Subject: [openib-general] Re: [PATCH] osm: bad extern C in cl_pthreadpool.h In-Reply-To: <42FF4D17.9090300@mellanox.co.il> References: <42FF4D17.9090300@mellanox.co.il> Message-ID: <1124030518.4403.9260.camel@hal.voltaire.com> On Sun, 2005-08-14 at 09:54, Eitan Zahavi wrote: > Just found an extra extern C in the file. Needless to say it breaks any > c++ include of this file. Thanks. Applied. > Signed-off-by: Eitan Zahavi eitan at mellanox.co.il The email address should have braces around it in the signed off line: Signed-off-by: Eitan Zahavi From eitan at mellanox.co.il Sun Aug 14 07:58:09 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 14 Aug 2005 17:58:09 +0300 Subject: [openib-general] [PATCH] osm: add a main auto tools project for osm Message-ID: <42FF5C01.8020800@mellanox.co.il> Hi Hal Fixing the patch I have sent earlier today. Now I am sure the text included is not wrapped. Removed the patch with empty file. Also I have fixed another issue with the iba/ib_types.h installation. This patch includes: 1. Added a top level autotools project for OpenSM. So now you need autogen.sh && configure && make && make install just once for osm (previously needed 4: complib, libvendor, opensm, osmtest). 2. Cleanup the direct override of libdir, bindir. Support --prefix 3. Move osm lib, bin into default prefix (/usr/local) 4. Support debug build for OpenSM using --enable-debug This is important to allow for asserts during runtime and various other additional debug features. Since the generated compilb can not be used with the release version we also use a special header file that stores the type of build for applications that wish to link with it. 5. Cleanup stale use of AC_CHECK_LIB with no parameters 6. Resolved another bug: iba/ib_types.h not installed correctly I tested the patch on : 2.6.12.3-smp SuSE Linux 9.3 (i586) Signed-off-by: Eitan Zahavi eitan at mellanox.co.il Index: include/configure.in =================================================================== --- include/configure.in (revision 3036) +++ include/configure.in (working copy) @@ -13,7 +13,7 @@ dnl Checks for programs AC_PROG_CC dnl Checks for libraries -AC_CHECK_LIB +dnl AC_CHECK_LIB - need to provide symbol and library... what do we depend on? dnl Checks for header files. AC_HEADER_STDC Index: include/Makefile.am =================================================================== --- include/Makefile.am (revision 3036) +++ include/Makefile.am (working copy) @@ -1,17 +1,9 @@ - -libincdir = ${exec_prefix}/ib/lib +# HACK: as we do not use the standard "prefix/include" subdir +includedir = ${prefix}/include/infiniband SUBDIRS = . -INCLUDES = - -lib_LTLIBRARIES = - -lib_version_script = - -libincincludedir = $(includedir)/infiniband/iba - -libincinclude_HEADERS = $(srcdir)/iba/ib_types.h +nobase_include_HEADERS = iba/ib_types.h EXTRA_DIST = $(srcdir)/iba/ib_types.h Index: include/autogen.sh =================================================================== --- include/autogen.sh (revision 3036) +++ include/autogen.sh (working copy) @@ -1,5 +1,8 @@ #! /bin/sh +# We change dir since the later utilities assume to work in the project dir +cd ${0%*/*} + set -x aclocal -I config libtoolize --force --copy Index: libvendor/configure.in =================================================================== --- libvendor/configure.in (revision 3036) +++ libvendor/configure.in (working copy) @@ -47,5 +47,15 @@ AC_CACHE_CHECK(whether ld accepts --vers AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") +dnl support debug mode +AC_ARG_ENABLE(debug, +[ --enable-debug Turn on debug mode], +[case "${enableval}" in + yes) debug=true ;; + no) debug=false ;; + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; +esac],[debug=false]) +AM_CONDITIONAL(DEBUG, test x$debug = xtrue) + AC_CONFIG_FILES([Makefile libosmvendor.spec]) AC_OUTPUT Index: libvendor/Makefile.am =================================================================== --- libvendor/Makefile.am (revision 3036) +++ libvendor/Makefile.am (working copy) @@ -1,15 +1,19 @@ -libdir = ${exec_prefix}/ib/lib - SUBDIRS = . +if DEBUG +DBGFLAGS = -g -O0 -D_DEBUG_ +else +DBGFLAGS = -g -O2 +endif + INCLUDES = -I$(srcdir)/../include \ -I$(srcdir)/../../libibcommon/include/infiniband \ -I$(srcdir)/../../libibumad/include/infiniband lib_LTLIBRARIES = libosmvendor.la -libosmvendor_la_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT +libosmvendor_la_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) if HAVE_LD_VERSION_SCRIPT libosmvendor_version_script = -Wl,--version-script=$(srcdir)/libosmvendor.map Index: libvendor/autogen.sh =================================================================== --- libvendor/autogen.sh (revision 3036) +++ libvendor/autogen.sh (working copy) @@ -1,5 +1,8 @@ #! /bin/sh +# We change dir since the later utilities assume to work in the project dir +cd ${0%*/*} + set -x aclocal -I config libtoolize --force --copy Index: complib/autogen.sh =================================================================== --- complib/autogen.sh (revision 3036) +++ complib/autogen.sh (working copy) @@ -1,5 +1,8 @@ #! /bin/sh +# We change dir since the later utilities assume to work in the project dir +cd ${0%*/*} + set -x aclocal -I config libtoolize --force --copy Index: complib/configure.in =================================================================== --- complib/configure.in (revision 3036) +++ complib/configure.in (working copy) @@ -31,6 +31,7 @@ AC_C_INLINE AC_TYPE_SIZE_T AC_HEADER_TIME +dnl We use --version-script with ld if possible AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then ac_cv_version_script=yes @@ -40,5 +41,15 @@ AC_CACHE_CHECK(whether ld accepts --vers AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") +dnl Support debug mode build - if enable-debug provided the DEBUG variable is set +AC_ARG_ENABLE(debug, +[ --enable-debug Turn on debug mode], +[case "${enableval}" in + yes) debug=true ;; + no) debug=false ;; + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; +esac],[debug=false]) +AM_CONDITIONAL(DEBUG, test x$debug = xtrue) + AC_CONFIG_FILES([Makefile libosmcomp.spec]) AC_OUTPUT Index: complib/Makefile.am =================================================================== --- complib/Makefile.am (revision 3036) +++ complib/Makefile.am (working copy) @@ -1,13 +1,15 @@ -libdir = ${exec_prefix}/ib/lib - -SUBDIRS = . - INCLUDES = -I$(srcdir)/../include lib_LTLIBRARIES = libosmcomp.la -libosmcomp_la_CFLAGS = -Wall +if DEBUG +DBGFLAGS = -g -O0 -D_DEBUG_ +else +DBGFLAGS = -g -O2 +endif + +libosmcomp_la_CFLAGS = -Wall $(DBGFLAGS) if HAVE_LD_VERSION_SCRIPT libosmcomp_version_script = -Wl,--version-script=$(srcdir)/libosmcomp.map Index: AUTHORS =================================================================== --- AUTHORS (revision 0) +++ AUTHORS (revision 0) @@ -0,0 +1,7 @@ + +By the chronological order of involvement: +Steve King, Intel +Anil Keshavamurthy, Intel +Eitan Zahavi, Mellanox Technologies, eitan at mellanox.co.il +Yael Kalka, Mellanox Technologies, yael at mellanox.co.il +Hal Rosenstock, Voltaire, halr at voltaire.com Index: configure.in =================================================================== --- configure.in (revision 0) +++ configure.in (revision 0) @@ -0,0 +1,39 @@ +dnl Process this file with autoconf to produce a configure script. + +AC_INIT(autogen.sh) + +dnl use local config dir for extras +AC_CONFIG_AUX_DIR(config) + +dnl Defines the Language +AC_LANG_C + +dnl Auto make +AM_INIT_AUTOMAKE(osm,1.0) + +dnl Provides control over re-making of all auto files +dnl We also use it to define swig dependencies so end +dnl users do not see them. +AM_MAINTAINER_MODE + +dnl Required for cases make defines a MAKE=make ??? Why +AC_PROG_MAKE_SET + +dnl Define an input config option to control debug compile +AC_ARG_ENABLE(debug, +[ --enable-debug Turn on debugging], +[case "${enableval}" in + yes) debug=true ;; + no) debug=false ;; + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; +esac],[debug=false]) +AM_CONDITIONAL(DEBUG, test x$debug = xtrue) + +dnl Configure the following subdirs +AC_CONFIG_SUBDIRS(complib libvendor opensm osmtest include) + +dnl Create the following Makefiles +AC_OUTPUT(Makefile) + + + Index: ChangeLog =================================================================== --- ChangeLog (revision 0) +++ ChangeLog (revision 0) @@ -0,0 +1,6 @@ +2005-08-14 Eitan Zahavi + + * Provided a top level auto tools project so there is no need to + cd into each of the sub directories and do: + ./autogen.sh && configure && make && make install + Index: README =================================================================== --- README (revision 0) +++ README (revision 0) @@ -0,0 +1,20 @@ +OpenSM README: +-------------- + +OpenSM provides an implementation for an InfiniBand Subnet Manager and +Administrator. Such a software entity is required to run for in order +to initialize the InfiniBand hardware (at least one per each +InfiniBand subnet). + +The full list of OpenSM features is described in the user manual +provided in the doc sub directory. + +The installation of OpenSM includes: + +bin/ + opensm - the SM/SA executable + osmtest - a test program for the SM/SA +lib/ + libosmcomp.{a,so} - component library with generic services and containers + libopensm.{a,so} - opensm services for logs and mad buffer pool + libosmvendor.{a,so} - interface to the user mad service of the driver Index: osmtest/configure.in =================================================================== --- osmtest/configure.in (revision 3036) +++ osmtest/configure.in (working copy) @@ -52,5 +52,15 @@ AC_CACHE_CHECK(whether ld accepts --vers AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") +dnl support debug mode +AC_ARG_ENABLE(debug, +[ --enable-debug Turn on debug mode], +[case "${enableval}" in + yes) debug=true ;; + no) debug=false ;; + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; +esac],[debug=false]) +AM_CONDITIONAL(DEBUG, test x$debug = xtrue) + AC_CONFIG_FILES([Makefile]) AC_OUTPUT Index: osmtest/Makefile.am =================================================================== --- osmtest/Makefile.am (revision 3036) +++ osmtest/Makefile.am (working copy) @@ -1,6 +1,9 @@ -bindir = ${exec_prefix}/ib/bin -libdir = ${exec_prefix}/ib/lib +if DEBUG +DBGFLAGS = -g -O0 -D_DEBUG_ +else +DBGFLAGS = -g -O2 +endif INCLUDES = -I$(srcdir)/include \ -I$(srcdir)/../include \ @@ -11,12 +14,9 @@ bin_PROGRAMS = osmtest osmtest_SOURCES = main.c osmtest.c osmt_service.c osmt_slvl_vl_arb.c \ osmt_multicast.c osmt_inform.c -osmtest_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT -osmtest_LDADD = $(libdir)/libibumad.la \ - $(libdir)/libibcommon.la \ - $(libdir)/libopensm.la \ - $(libdir)/libosmcomp.la \ - $(libdir)/libosmvendor.la +osmtest_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) +osmtest_LDADD = -L../complib -L../libvendor -L../opensm -L$(libdir) \ + -libumad -libcommon -lopensm -losmcomp -losmvendor osmtest_LDFLAGS = -Wl,--rpath -Wl,$(libdir) -lpthread -L../opensm Index: osmtest/autogen.sh =================================================================== --- osmtest/autogen.sh (revision 3036) +++ osmtest/autogen.sh (working copy) @@ -1,5 +1,8 @@ #! /bin/sh +# We change dir since the later utilities assume to work in the project dir +cd ${0%*/*} + set -x aclocal -I config libtoolize --force --copy Index: opensm/configure.in =================================================================== --- opensm/configure.in (revision 3036) +++ opensm/configure.in (working copy) @@ -52,5 +52,15 @@ AC_CACHE_CHECK(whether ld accepts --vers AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") +dnl support debug mode +AC_ARG_ENABLE(debug, +[ --enable-debug Turn on debug mode], +[case "${enableval}" in + yes) debug=true ;; + no) debug=false ;; + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; +esac],[debug=false]) +AM_CONDITIONAL(DEBUG, test x$debug = xtrue) + AC_CONFIG_FILES([Makefile]) AC_OUTPUT Index: opensm/autogen.sh =================================================================== --- opensm/autogen.sh (revision 3036) +++ opensm/autogen.sh (working copy) @@ -1,5 +1,8 @@ #! /bin/sh +# We change dir since the later utilities assume to work in the project dir +cd ${0%*/*} + set -x aclocal -I config libtoolize --force --copy Index: opensm/Makefile.am =================================================================== --- opensm/Makefile.am (revision 3036) +++ opensm/Makefile.am (working copy) @@ -1,14 +1,17 @@ -bindir = ${exec_prefix}/ib/bin -libdir = ${exec_prefix}/ib/lib - INCLUDES = -I$(srcdir)/../include \ -I$(srcdir)/../../libibcommon/include/infiniband \ -I$(srcdir)/../../libibumad/include/infiniband lib_LTLIBRARIES = libopensm.la -libopensm_la_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT +if DEBUG +DBGFLAGS = -g -O0 -D_DEBUG_ +else +DBGFLAGS = -g -O2 +endif + +libopensm_la_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) if HAVE_LD_VERSION_SCRIPT libopensm_version_script = -Wl,--version-script=$(srcdir)/libopensm.map @@ -60,12 +63,13 @@ opensm_SOURCES = main.c osm_drop_mgr.c o osm_ucast_mgr.c osm_ucast_updn.c \ osm_vl15intf.c osm_vl_arb_rcv.c\ osm_vl_arb_rcv_ctrl.c -opensm_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT -opensm_LDADD = $(libdir)/libibumad.la \ - $(libdir)/libibcommon.la \ - $(srcdir)/libopensm.la \ - $(libdir)/libosmcomp.la \ - $(libdir)/libosmvendor.la +opensm_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) + +# we need to be able to load libraries from local build subtree before make install +# we always give precedence to local tree libs and then use the pre-installed ones. +opensm_LDADD = -L../complib -L../libvendor -L$(libdir) \ + -libumad -lopensm -losmcomp -losmvendor + opensm_LDFLAGS = -Wl,--rpath -Wl,$(libdir) -lpthread opensmincludedir = $(includedir)/infiniband/opensm @@ -79,4 +83,3 @@ EXTRA_DIST = $(srcdir)/../include/opensm $(srcdir)/../include/opensm/osm_log.h \ $(srcdir)/../include/opensm/osm_madw.h \ $(srcdir)/../include/opensm/osm_mad_pool.h - Index: INSTALL =================================================================== --- INSTALL (revision 0) +++ INSTALL (revision 0) @@ -0,0 +1,231 @@ +Installation Instructions +************************* + +Copyright (C) 1994, 1995, 1996, 1999, 2000, 2001, 2002, 2004 Free +Software Foundation, Inc. + +This file is free documentation; the Free Software Foundation gives +unlimited permission to copy, distribute and modify it. + +Basic Installation +================== + +These are generic installation instructions. + + The `configure' shell script attempts to guess correct values for +various system-dependent variables used during compilation. It uses +those values to create a `Makefile' in each directory of the package. +It may also create one or more `.h' files containing system-dependent +definitions. Finally, it creates a shell script `config.status' that +you can run in the future to recreate the current configuration, and a +file `config.log' containing compiler output (useful mainly for +debugging `configure'). + + It can also use an optional file (typically called `config.cache' +and enabled with `--cache-file=config.cache' or simply `-C') that saves +the results of its tests to speed up reconfiguring. (Caching is +disabled by default to prevent problems with accidental use of stale +cache files.) + + If you need to do unusual things to compile the package, please try +to figure out how `configure' could check whether to do them, and mail +diffs or instructions to the address given in the `README' so they can +be considered for the next release. If you are using the cache, and at +some point `config.cache' contains results you don't want to keep, you +may remove or edit it. + + The file `configure.ac' (or `configure.in') is used to create +`configure' by a program called `autoconf'. You only need +`configure.ac' if you want to change it or regenerate `configure' using +a newer version of `autoconf'. + +The simplest way to compile this package is: + + 1. `cd' to the directory containing the package's source code and type + `./configure' to configure the package for your system. If you're + using `csh' on an old version of System V, you might need to type + `sh ./configure' instead to prevent `csh' from trying to execute + `configure' itself. + + Running `configure' takes awhile. While running, it prints some + messages telling which features it is checking for. + + 2. Type `make' to compile the package. + + 3. Optionally, type `make check' to run any self-tests that come with + the package. + + 4. Type `make install' to install the programs and any data files and + documentation. + + 5. You can remove the program binaries and object files from the + source code directory by typing `make clean'. To also remove the + files that `configure' created (so you can compile the package for + a different kind of computer), type `make distclean'. There is + also a `make maintainer-clean' target, but that is intended mainly + for the package's developers. If you use it, you may have to get + all sorts of other programs in order to regenerate files that came + with the distribution. + +Compilers and Options +===================== + +Some systems require unusual options for compilation or linking that the +`configure' script does not know about. Run `./configure --help' for +details on some of the pertinent environment variables. + + You can give `configure' initial values for configuration parameters +by setting variables in the command line or in the environment. Here +is an example: + + ./configure CC=c89 CFLAGS=-O2 LIBS=-lposix + + *Note Defining Variables::, for more details. + +Compiling For Multiple Architectures +==================================== + +You can compile the package for more than one kind of computer at the +same time, by placing the object files for each architecture in their +own directory. To do this, you must use a version of `make' that +supports the `VPATH' variable, such as GNU `make'. `cd' to the +directory where you want the object files and executables to go and run +the `configure' script. `configure' automatically checks for the +source code in the directory that `configure' is in and in `..'. + + If you have to use a `make' that does not support the `VPATH' +variable, you have to compile the package for one architecture at a +time in the source code directory. After you have installed the +package for one architecture, use `make distclean' before reconfiguring +for another architecture. + +Installation Names +================== + +By default, `make install' will install the package's files in +`/usr/local/bin', `/usr/local/man', etc. You can specify an +installation prefix other than `/usr/local' by giving `configure' the +option `--prefix=PREFIX'. + + You can specify separate installation prefixes for +architecture-specific files and architecture-independent files. If you +give `configure' the option `--exec-prefix=PREFIX', the package will +use PREFIX as the prefix for installing programs and libraries. +Documentation and other data files will still use the regular prefix. + + In addition, if you use an unusual directory layout you can give +options like `--bindir=DIR' to specify different values for particular +kinds of files. Run `configure --help' for a list of the directories +you can set and what kinds of files go in them. + + If the package supports it, you can cause programs to be installed +with an extra prefix or suffix on their names by giving `configure' the +option `--program-prefix=PREFIX' or `--program-suffix=SUFFIX'. + +Optional Features +================= + +Some packages pay attention to `--enable-FEATURE' options to +`configure', where FEATURE indicates an optional part of the package. +They may also pay attention to `--with-PACKAGE' options, where PACKAGE +is something like `gnu-as' or `x' (for the X Window System). The +`README' should mention any `--enable-' and `--with-' options that the +package recognizes. + + For packages that use the X Window System, `configure' can usually +find the X include and library files automatically, but if it doesn't, +you can use the `configure' options `--x-includes=DIR' and +`--x-libraries=DIR' to specify their locations. + +Specifying the System Type +========================== + +There may be some features `configure' cannot figure out automatically, +but needs to determine by the type of machine the package will run on. +Usually, assuming the package is built to be run on the _same_ +architectures, `configure' can figure that out, but if it prints a +message saying it cannot guess the machine type, give it the +`--build=TYPE' option. TYPE can either be a short name for the system +type, such as `sun4', or a canonical name which has the form: + + CPU-COMPANY-SYSTEM + +where SYSTEM can have one of these forms: + + OS KERNEL-OS + + See the file `config.sub' for the possible values of each field. If +`config.sub' isn't included in this package, then this package doesn't +need to know the machine type. + + If you are _building_ compiler tools for cross-compiling, you should +use the `--target=TYPE' option to select the type of system they will +produce code for. + + If you want to _use_ a cross compiler, that generates code for a +platform different from the build platform, you should specify the +"host" platform (i.e., that on which the generated programs will +eventually be run) with `--host=TYPE'. + +Sharing Defaults +================ + +If you want to set default values for `configure' scripts to share, you +can create a site shell script called `config.site' that gives default +values for variables like `CC', `cache_file', and `prefix'. +`configure' looks for `PREFIX/share/config.site' if it exists, then +`PREFIX/etc/config.site' if it exists. Or, you can set the +`CONFIG_SITE' environment variable to the location of the site script. +A warning: not all `configure' scripts look for a site script. + +Defining Variables +================== + +Variables not defined in a site shell script can be set in the +environment passed to `configure'. However, some packages may run +configure again during the build, and the customized values of these +variables may be lost. In order to avoid this problem, you should set +them in the `configure' command line, using `VAR=value'. For example: + + ./configure CC=/usr/local2/bin/gcc + +will cause the specified gcc to be used as the C compiler (unless it is +overridden in the site shell script). + +`configure' Invocation +====================== + +`configure' recognizes the following options to control how it operates. + +`--help' +`-h' + Print a summary of the options to `configure', and exit. + +`--version' +`-V' + Print the version of Autoconf used to generate the `configure' + script, and exit. + +`--cache-file=FILE' + Enable the cache: use and save the results of the tests in FILE, + traditionally `config.cache'. FILE defaults to `/dev/null' to + disable caching. + +`--config-cache' +`-C' + Alias for `--cache-file=config.cache'. + +`--quiet' +`--silent' +`-q' + Do not print messages saying which checks are being made. To + suppress all normal output, redirect it to `/dev/null' (any error + messages will still be shown). + +`--srcdir=DIR' + Look for the package's source code in directory DIR. Usually + `configure' can determine that directory automatically. + +`configure' also accepts some other, not widely useful, options. Run +`configure --help' for more details. + Index: COPYING =================================================================== --- COPYING (revision 0) +++ COPYING (revision 0) @@ -0,0 +1,32 @@ + Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available from the file + COPYING in the main directory of this source tree, or the + OpenIB.org BSD license below: + + Redistribution and use in source and binary forms, with or + without modification, are permitted provided that the following + conditions are met: + + - Redistributions of source code must retain the above + copyright notice, this list of conditions and the following + disclaimer. + + - Redistributions in binary form must reproduce the above + copyright notice, this list of conditions and the following + disclaimer in the documentation and/or other materials + provided with the distribution. + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + Index: Makefile.am =================================================================== --- Makefile.am (revision 0) +++ Makefile.am (revision 0) @@ -0,0 +1,16 @@ + +# note that order matters: make the lib first then use it +SUBDIRS = complib libvendor opensm osmtest include + +# this will control the update of the files in order +MAINTAINERCLEANFILES = Makefile.in aclocal.m4 configure config-h.in + +ACLOCAL = aclocal -I $(ac_aux_dir) + +# we should provide a hint for other apps about the build mode of this project +install-exec-hook: +if DEBUG + echo "define osm_build_type \"debug\"" > $(includedir)/infiniband/opensm/osm_build_id.h +else + echo "define osm_build_type \"free\"" > $(includedir)/infiniband/opensm/osm_build_id.h +endif Index: autogen.sh =================================================================== --- autogen.sh (revision 0) +++ autogen.sh (revision 0) @@ -0,0 +1,74 @@ +#!/bin/bash + +# We change dir since the later utilities assume to work in the project dir +cd ${0%*/*} + +# make sure autoconf is up-to-date +ac_ver=`autoconf --version | head -1 | awk '{print $NF}'` +ac_maj=`echo $ac_ver|sed 's/\..*//'` +ac_min=`echo $ac_ver|sed 's/.*\.//'` +if [[ $ac_maj < 2 ]]; then + echo Min autoconf version is 2.59 + exit +fi +if [[ $ac_maj = 2 && $ac_min < 59 ]]; then + echo Min autoconf version is 2.59 + exit +fi + +# make sure automake is up-to-date +am_ver=`automake --version | head -1 | awk '{print $NF}'` +am_maj=`echo $am_ver|sed 's/\..*//'` +am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` +am_sub=`echo $am_ver|sed 's/.*\.//'` +if [[ $am_maj < 1 ]]; then + echo Min automake version is 1.9.3 + exit +fi +if [[ $am_maj = 1 && $am_min < 9 ]]; then + echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.3" + exit +fi +if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 3 ]]; then + echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.3" + exit +fi + +# make sure libtool is up-to-date +lt_ver=`libtool --version | head -1 | awk '{print $4}'` +lt_maj=`echo $lt_ver|sed 's/\..*//'` +lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` +lt_sub=`echo $lt_ver|sed 's/.*\.//'` +if [[ $lt_maj < 1 ]]; then + echo Min libtool version is 1.4.2 + exit +fi +if [[ $lt_maj = 1 && $lt_min < 4 ]]; then + echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" + exit +fi +if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then + echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" + exit +fi + +# cleanup +find . \( -name Makefile.in -o -name aclocal.m4 -o -name autom4te.cache -o -name configure -o -name aclocal.m4 \) -exec \rm -rf {} \; -prune + +# handle our own autoconf: +aclocal -I config 2>&1 | grep -v "arning: underquoted definition of" +automake --add-missing --gnu +autoconf + +# visit all sub directories with autogen.sh +anyErr=0 +for a in `ls */autogen.sh`; do + echo Visiting $a + $a 2>& 1 | sed 's/^/| /' + if test $? != 0; then + echo $a failed + anyErr=1 + fi +done + +exit $anyErr Index: NEWS =================================================================== --- NEWS (revision 0) +++ NEWS (revision 0) @@ -0,0 +1,2 @@ + +This file will hold news about the OpenSM project. Index: Makefile =================================================================== --- Makefile (revision 3036) +++ Makefile (working copy) @@ -1,44 +0,0 @@ -LIBS:= complib libvendor -BIN:= opensm -UTIL:= include - -SUBDIRS=$(BIN) $(UTIL) - -all: BUILD_TARG=all -all: libs_install subdirs - @echo Make all done - -install: BUILD_TARG=install -install: subdirs - @echo Install done - -clean: SUBDIRS= $(LIBS) $(BIN) -clean: BUILD_TARG=clean -clean: subdirs - @echo Clean done - -rmdep: - find $(SUBDIRS) -name ".depend" | xargs rm -f - -depend: SUBDIRS= $(LIBS) $(BIN) $(UTIL) -depend: BUILD_TARG=depend -depend: rmdep subdirs - @echo Depend done - -.PHONY : subdirs -subdirs: - @for i in $(SUBDIRS); do\ - if [ -e $i/Makefile ]; then\ - if !(cd $i; make $(BUILD_TARG)); then exit 1; fi\ - fi\ - done\ - -.PHONY : libs_install -libs_install: - @for i in $(LIBS); do\ - if [ -e $i/Makefile ]; then\ - if !(cd $i; make install); then exit 1; fi\ - fi\ - done\ - -export BUILD_TARG From danb at voltaire.com Sun Aug 14 08:04:24 2005 From: danb at voltaire.com (Dan Bar Dov) Date: Sun, 14 Aug 2005 18:04:24 +0300 Subject: [openib-general] QP opened in user space in kernel? Message-ID: Is it possible to open a QP in user space, and pass it to the kernel for later use from kernel? Any examples? Dan From guyg at voltaire.com Sun Aug 14 08:06:39 2005 From: guyg at voltaire.com (Guy German) Date: Sun, 14 Aug 2005 18:06:39 +0300 Subject: [openib-general]: Question about FMR Message-ID: Hi, I'm using fmr_pool.c implementation, in kDAPL, and I've encountered a problem. When I run this simple flow (implemented also in a stand-alone-test-module) : fmr_pool = ib_create_fmr_pool(pd, ¶ms); mem = ib_fmr_pool_map_phys (fmr_pool, page_list, page_count, &io_addr); status = ib_fmr_pool_unmap(mem); (void)ib_destroy_fmr_pool(fmr_pool); I get: pd->usecnt=1, hence ib_dealloc_pd(pd) fails with -EBUSY rc. Why is the pd use count 1, and not 0 ? Am I doing something wrong ? I am attaching a small test module that demonstrates this problem. Thanks, Guy -------------- next part -------------- A non-text attachment was scrubbed... Name: Makefile Type: application/octet-stream Size: 694 bytes Desc: Makefile URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: fmr_test.c Type: application/octet-stream Size: 2309 bytes Desc: fmr_test.c URL: From halr at voltaire.com Sun Aug 14 08:51:05 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Aug 2005 11:51:05 -0400 Subject: [openib-general] Re: [PATCH] osm: add a main auto tools project for osm In-Reply-To: <42FF5C01.8020800@mellanox.co.il> References: <42FF5C01.8020800@mellanox.co.il> Message-ID: <1124034664.4403.9410.camel@hal.voltaire.com> Hi Eitan, On Sun, 2005-08-14 at 10:58, Eitan Zahavi wrote: > Fixing the patch I have sent earlier today. Now I am sure the text > included is not wrapped. Something is still wrong with this patch. I have the same issue with this patch. Did you apply this patch from your email received from the list to a clean directory to be sure ? -- Hal From eitan at mellanox.co.il Sun Aug 14 09:03:43 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 14 Aug 2005 19:03:43 +0300 Subject: [openib-general] RE: [PATCH] osm: add a main auto tools project for osm Message-ID: <506C3D7B14CDD411A52C00025558DED607C30655@mtlex01.yok.mtl.com> > Something is still wrong with this patch. I have the same issue with > this patch. Did you apply this patch from your email received from the > list to a clean directory to be sure ? [EZ] I will try that right away > > -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Sun Aug 14 09:15:40 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 14 Aug 2005 19:15:40 +0300 Subject: [openib-general] RE: [PATCH] osm: add a main auto tools project for osm Message-ID: <506C3D7B14CDD411A52C00025558DED607C30656@mtlex01.yok.mtl.com> Sorry. Apparently, even though I use Thunderbird, on windows, the inserted file lacks CRs. Also I needed to use patch -p 0 to make it work. So I will resend after I resolve the CR issues. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Sunday, August 14, 2005 6:51 PM > To: Eitan Zahavi > Cc: OPENIB GENERAL > Subject: Re: [PATCH] osm: add a main auto tools project for osm > > Hi Eitan, > > On Sun, 2005-08-14 at 10:58, Eitan Zahavi wrote: > > Fixing the patch I have sent earlier today. Now I am sure the text > > included is not wrapped. > > Something is still wrong with this patch. I have the same issue with > this patch. Did you apply this patch from your email received from the > list to a clean directory to be sure ? > > -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Sun Aug 14 09:27:19 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 14 Aug 2005 19:27:19 +0300 Subject: [openib-general] [ANNOUNCE] mstflint : link libstdc++ statically by default Message-ID: <20050814162719.GT23848@mellanox.co.il> Hi! I've changed mstflint to link libstdc++ statically by default. This is done to make the resulting binary easier to relocate across machines/distributions. I still link in shared version of libc. You can also build fully static version by make static and fully shared version by make shared Thanks, MST -- MST From mst at mellanox.co.il Sun Aug 14 09:40:56 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 14 Aug 2005 19:40:56 +0300 Subject: [openib-general] [ANNOUNCE] 2.6.9 backport patches Message-ID: <20050814164056.GU23848@mellanox.co.il> Hi! Backport patches to trunk that enable support for RHEL4.0 (2.6.9) and SuSE9.3 (2.6.11), can now be found under https://openib.org/svn/gen2/branches/backport/2.6.9 and https://openib.org/svn/gen2/branches/backport/2.6.11 These patches do not touch the kernel source outside the infiniband directory, and you dont need to reboot after you apply them. Please note that these backports are different from Woody's patches: These are patches for trunk to support kernels 2.6.9/2.6.11 as opposed to patches for kernel 2.6.9 to support code from 2.6.11 (or trunk). The advantage of this approach is that you can apply it to latest code from trunk. -- MST From guyg at voltaire.com Sun Aug 14 10:12:45 2005 From: guyg at voltaire.com (Guy German) Date: Sun, 14 Aug 2005 20:12:45 +0300 Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation Message-ID: James Lentini wrote: >> You changed the order in which the CQ upcall is enabled and the kDAPL >> upcall is made. It used to be: >> >> enable CQ upcall >> call kDAPL upcall >> >> you are proposing >> >> call kDAPL upcall >> enable CQ upcall >> >> I think your proposed order contains a race condition. Specifically >> if a work completion occurs after dapl_evd_upcall_trigger() >> returns but before the CQ upcall is re-enabled with >> ib_req_notify_cq(), no upcall will occur for the completion. >> >> Do you agree? Or, has turned my attention to the fact that also in the first case You have the alleged race: if a work completion occurs just before you enable the CQ upcall... Are you suggesting that enabling the CQ upcall will not trigger the CQ upcall, if completions happened before enabling? I don't think this is the case, but I'm not 100% sure... As I mentioned before, and regardless to this issue, I still believe that the right order should be: >> call kDAPL upcall >> (conditionally) enable CQ upcall We can't have interrupts if the consumer disabled the upcall policy... Guy. From mst at mellanox.co.il Sun Aug 14 10:44:19 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 14 Aug 2005 20:44:19 +0300 Subject: [openib-general] Re: QP opened in user space in kernel? In-Reply-To: References: Message-ID: <20050814174419.GA25798@mellanox.co.il> Quoting r. Dan Bar Dov : > Subject: QP opened in user space in kernel? > > Is it possible to open a QP in user space, and pass it to the kernel for later use from kernel? > > Any examples? > > Dan > I dont think you can do this currently. -- MST From mst at mellanox.co.il Mon Aug 15 00:22:11 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 15 Aug 2005 10:22:11 +0300 Subject: [openib-general] [PATCH] ipoib: device removal races In-Reply-To: <52wtmwxdy4.fsf@cisco.com> References: <20050808151141.GJ15300@mellanox.co.il> <52wtmwxdy4.fsf@cisco.com> Message-ID: <20050815072211.GW23848@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] [PATCH] ipoib: device removal races > > Michael> As a side note, schedule_work in ipoib_event also looks > Michael> suspicios. Cant we have it oustanding when the device is > Michael> going down? Roland, what do you say we switch that to > Michael> ipoib_workqueue as well, and add a flush after > Michael> ib_unregister_event_handler? > > Thanks, I'll take a look at all the workqueue stuff in IPoIB. > > - R. > Here's fix for this theoretical race (I didnt see it triggered in real life). This needs to be applied in addition to my previous patch, which fixes a crash I actually see in the lab. Roland, I think at least the previous one-line patch should go in to 2.6.13. Do you have it? --- It seems we can have a work oustanding when the device is going down. Solve this by creating a work queue for events. We cant reuse the ipoib_workqueue since that sometimes needs to be flushed when we get an event. Its also probably a good idea to flush in a single threaded workqueue, to prevent several flushes from running in parallel on multiple CPUs. Signed-off-by: Michael S. Tsirkin Index: linux-kernel/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- linux-kernel/infiniband/ulp/ipoib/ipoib_verbs.c (revision 3083) +++ linux-kernel/infiniband/ulp/ipoib/ipoib_verbs.c (working copy) @@ -256,6 +256,6 @@ void ipoib_event(struct ib_event_handler record->event == IB_EVENT_LID_CHANGE || record->event == IB_EVENT_SM_CHANGE) { ipoib_dbg(priv, "Port active event\n"); - schedule_work(&priv->flush_task); + queue_work(ipoib_event_workqueue, &priv->flush_task); } } Index: linux-kernel/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-kernel/infiniband/ulp/ipoib/ipoib_main.c (revision 3083) +++ linux-kernel/infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -65,6 +65,7 @@ static const u8 ipv4_bcast_addr[] = { }; struct workqueue_struct *ipoib_workqueue; +struct workqueue_struct *ipoib_event_workqueue; static void ipoib_add_one(struct ib_device *device); static void ipoib_remove_one(struct ib_device *device); @@ -993,6 +994,7 @@ debug_failed: register_failed: ib_unregister_event_handler(&priv->event_handler); + flush_workqueue(ipoib_event_workqueue); event_failed: ipoib_dev_cleanup(priv->dev); @@ -1045,6 +1047,7 @@ static void ipoib_remove_one(struct ib_d list_for_each_entry_safe(priv, tmp, dev_list, list) { ib_unregister_event_handler(&priv->event_handler); + flush_workqueue(ipoib_event_workqueue); unregister_netdev(priv->dev); ipoib_dev_cleanup(priv->dev); @@ -1061,8 +1064,8 @@ static int __init ipoib_init_module(void return ret; /* - * We create our own workqueue mainly because we want to be - * able to flush it when devices are being removed. We can't + * We create our own workqueues mainly because we want to be + * able to flush them when devices are being removed. We can't * use schedule_work()/flush_scheduled_work() because both * unregister_netdev() and linkwatch_event take the rtnl lock, * so flush_scheduled_work() can deadlock during device @@ -1074,12 +1077,21 @@ static int __init ipoib_init_module(void goto err_fs; } + ipoib_event_workqueue = create_singlethread_workqueue("ipoib_flush"); + if (!ipoib_workqueue) { + ret = -ENOMEM; + goto err_wq; + } + ret = ib_register_client(&ipoib_client); if (ret) - goto err_wq; + goto err_fwq; return 0; +err_fwq: + destroy_workqueue(ipoib_event_workqueue); + err_wq: destroy_workqueue(ipoib_workqueue); @@ -1092,8 +1104,9 @@ err_fs: static void __exit ipoib_cleanup_module(void) { ib_unregister_client(&ipoib_client); - ipoib_unregister_debugfs(); + destroy_workqueue(ipoib_event_workqueue); destroy_workqueue(ipoib_workqueue); + ipoib_unregister_debugfs(); } module_init(ipoib_init_module); Index: linux-kernel/infiniband/ulp/ipoib/ipoib.h =================================================================== --- linux-kernel/infiniband/ulp/ipoib/ipoib.h (revision 3083) +++ linux-kernel/infiniband/ulp/ipoib/ipoib.h (working copy) @@ -217,6 +217,7 @@ static inline struct ipoib_neigh **to_ip } extern struct workqueue_struct *ipoib_workqueue; +extern struct workqueue_struct *ipoib_event_workqueue; /* functions */ -- MST From mst at mellanox.co.il Mon Aug 15 02:05:43 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 15 Aug 2005 12:05:43 +0300 Subject: [openib-general] [PATCH fixed] ipoib: device removal races In-Reply-To: <20050815072211.GW23848@mellanox.co.il> References: <20050808151141.GJ15300@mellanox.co.il> <52wtmwxdy4.fsf@cisco.com> <20050815072211.GW23848@mellanox.co.il> Message-ID: <20050815090543.GA1752@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: Re: [PATCH] ipoib: device removal races > > Quoting r. Roland Dreier : > > Subject: Re: [openib-general] [PATCH] ipoib: device removal races > > > > Michael> As a side note, schedule_work in ipoib_event also looks > > Michael> suspicios. Cant we have it oustanding when the device is > > Michael> going down? Roland, what do you say we switch that to > > Michael> ipoib_workqueue as well, and add a flush after > > Michael> ib_unregister_event_handler? > > > > Thanks, I'll take a look at all the workqueue stuff in IPoIB. > > > > - R. > > > > Here's fix for this theoretical race (I didnt see it triggered in real life). > This needs to be applied in addition to my previous patch, which > fixes a crash I actually see in the lab. > > Roland, I think at least the previous one-line patch should go in to > 2.6.13. Do you have it? > And here's a patch that actually works. Sorry. Roland, pls note this patch is *in addition* to the ipoib_set_mcast_list patch. --- It seems we can have a work oustanding when the device is going down. Solve this by creating a work queue for events. We cant reuse the ipoib_workqueue since that sometimes needs to be flushed when we get an event. Its also probably a good idea to flush in a single threaded workqueue, to prevent several flushes from running in parallel on multiple CPUs. Signed-off-by: Michael S. Tsirkin Index: linux-kernel/drivers/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- linux-kernel/drivers/infiniband/ulp/ipoib/ipoib_verbs.c (revision 3084) +++ linux-kernel/drivers/infiniband/ulp/ipoib/ipoib_verbs.c (working copy) @@ -256,6 +256,6 @@ void ipoib_event(struct ib_event_handler record->event == IB_EVENT_LID_CHANGE || record->event == IB_EVENT_SM_CHANGE) { ipoib_dbg(priv, "Port active event\n"); - schedule_work(&priv->flush_task); + queue_work(ipoib_event_workqueue, &priv->flush_task); } } Index: linux-kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c (revision 3084) +++ linux-kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -65,6 +65,7 @@ static const u8 ipv4_bcast_addr[] = { }; struct workqueue_struct *ipoib_workqueue; +struct workqueue_struct *ipoib_event_workqueue; static void ipoib_add_one(struct ib_device *device); static void ipoib_remove_one(struct ib_device *device); @@ -993,6 +994,7 @@ debug_failed: register_failed: ib_unregister_event_handler(&priv->event_handler); + flush_workqueue(ipoib_event_workqueue); event_failed: ipoib_dev_cleanup(priv->dev); @@ -1045,6 +1047,7 @@ static void ipoib_remove_one(struct ib_d list_for_each_entry_safe(priv, tmp, dev_list, list) { ib_unregister_event_handler(&priv->event_handler); + flush_workqueue(ipoib_event_workqueue); unregister_netdev(priv->dev); ipoib_dev_cleanup(priv->dev); @@ -1061,8 +1064,8 @@ static int __init ipoib_init_module(void return ret; /* - * We create our own workqueue mainly because we want to be - * able to flush it when devices are being removed. We can't + * We create our own workqueues mainly because we want to be + * able to flush them when devices are being removed. We can't * use schedule_work()/flush_scheduled_work() because both * unregister_netdev() and linkwatch_event take the rtnl lock, * so flush_scheduled_work() can deadlock during device @@ -1074,12 +1077,21 @@ static int __init ipoib_init_module(void goto err_fs; } + ipoib_event_workqueue = create_singlethread_workqueue("ipoibevent"); + if (!ipoib_event_workqueue) { + ret = -ENOMEM; + goto err_wq; + } + ret = ib_register_client(&ipoib_client); if (ret) - goto err_wq; + goto err_fwq; return 0; +err_fwq: + destroy_workqueue(ipoib_event_workqueue); + err_wq: destroy_workqueue(ipoib_workqueue); @@ -1092,8 +1104,9 @@ err_fs: static void __exit ipoib_cleanup_module(void) { ib_unregister_client(&ipoib_client); - ipoib_unregister_debugfs(); + destroy_workqueue(ipoib_event_workqueue); destroy_workqueue(ipoib_workqueue); + ipoib_unregister_debugfs(); } module_init(ipoib_init_module); Index: linux-kernel/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- linux-kernel/drivers/infiniband/ulp/ipoib/ipoib.h (revision 3084) +++ linux-kernel/drivers/infiniband/ulp/ipoib/ipoib.h (working copy) @@ -217,6 +217,7 @@ static inline struct ipoib_neigh **to_ip } extern struct workqueue_struct *ipoib_workqueue; +extern struct workqueue_struct *ipoib_event_workqueue; /* functions */ -- MST From halr at voltaire.com Mon Aug 15 04:03:05 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2005 07:03:05 -0400 Subject: [openib-general] [Fwd: [openib-commits] r3075 - trunk/contrib/mellanox/tools] Message-ID: <1124103783.4403.12743.camel@hal.voltaire.com> Hi Dotan, Have you decided to keep this as a separate tool ? Also, will you be changing the name or keeping it the same ? -- Hal -----Forwarded Message----- From: dotanb at openib.org To: openib-commits at openib.org Subject: [openib-commits] r3075 - trunk/contrib/mellanox/tools Date: 14 Aug 2005 02:08:32 -0700 Author: dotanb Date: 2005-08-14 02:08:31 -0700 (Sun, 14 Aug 2005) New Revision: 3075 Modified: trunk/contrib/mellanox/tools/vstat.c Log: added some changed according to hal's suggestion From dotanb at mellanox.co.il Mon Aug 15 04:25:58 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 15 Aug 2005 14:25:58 +0300 Subject: [openib-general] RE: [Fwd: [openib-commits] r3075 - trunk/contrib/mellanox/tools] Message-ID: <506C3D7B14CDD411A52C00025558DED60882C812@mtlex01.yok.mtl.com> > Hi Dotan, > > Have you decided to keep this as a separate tool ? Also, will you be > changing the name or keeping it the same ? > > -- Hal > Hi. I think that a tool in user space that prints all the hca attributes can be very useful, but as you mentioned before there are already several tools that print a subset of the hca attributes. I did some changes in my code, so the output will look like roland's ibv_devinfo and when i'll finish all the changes, my vstat code will replace ibv_devinfo (only the code will be changed, the name of the utility will still be ibv_devinfo). Dotan -------------- next part -------------- An HTML attachment was scrubbed... URL: From guyg at voltaire.com Mon Aug 15 05:22:06 2005 From: guyg at voltaire.com (Guy German) Date: Mon, 15 Aug 2005 15:22:06 +0300 (IDT) Subject: [openib-general][PATCH][mthca]: arbel/tavor calls Message-ID: I think there is a call to Arbel's function instead the general func, in mthca_free_region. Signed-off-by: Guy German Index: mthca_mr.c =================================================================== --- mthca_mr.c (revision 3085) +++ mthca_mr.c (working copy) @@ -459,7 +459,7 @@ int mthca_mr_alloc_phys(struct mthca_dev static void mthca_free_region(struct mthca_dev *dev, u32 lkey) { mthca_table_put(dev, dev->mr_table.mpt_table, - arbel_key_to_hw_index(lkey)); + key_to_hw_index(dev, lkey)); mthca_free(&dev->mr_table.mpt_alloc, key_to_hw_index(dev, lkey)); } From halr at voltaire.com Mon Aug 15 06:15:18 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2005 09:15:18 -0400 Subject: [openib-general] Re: [PATCH] dont drop device reference before use (was Re: sdp: cant unload ib_ipoib module) In-Reply-To: <20050814142441.GO23848@mellanox.co.il> References: <1123007814.2946.12.camel@duffman> <20050803070923.GG15300@mellanox.co.il> <20050804104250.A30741@topspin.com> <20050807154632.GG15300@mellanox.co.il> <1123539115.4402.39.camel@hal.voltaire.com> <20050809124631.GG32419@mellanox.co.il> <1123593909.4403.16.camel@hal.voltaire.com> <20050814142441.GO23848@mellanox.co.il> Message-ID: <1124111717.4403.12755.camel@hal.voltaire.com> On Sun, 2005-08-14 at 10:24, Michael S. Tsirkin wrote: > The following should finally fix the dev_put issue. > I think an extra dev_hold on a path lookup is not a big deal, and > helps make the code much simpler. Works fine, for me. > Hal, can you give it a whirl? This works for me too. -- Hal From eitan at mellanox.co.il Mon Aug 15 07:23:57 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 15 Aug 2005 17:23:57 +0300 Subject: [openib-general] [PATCH] osm: add a main auto tools project for osm Message-ID: <86mznjcm82.fsf@mtl066.yok.mtl.com> Hi Hal I'm now using xemacs as my mail client. This is pure unix and tested to work. This patch includes: 1. Added a top level autotools project for OpenSM. So now you need autogen.sh && configure && make && make install just once for osm (previously needed 4: complib, libvendor, opensm, osmtest). 2. Cleanup the direct override of libdir, bindir. Support --prefix 3. Move osm lib, bin into default prefix (/usr/local) 4. Support debug build for OpenSM using --enable-debug This is important to allow for asserts during runtime and various other additional debug features. Since the generated compilb can not be used with the release version we also use a special header file that stores the type of build for applications that wish to link with it. 5. Cleanup stale use of AC_CHECK_LIB with no parameters 6. Resolved another bug: iba/ib_types.h not installed correctly I tested the patch on : 2.6.12.3-smp SuSE Linux 9.3 (i586) Signed-off-by: Eitan Zahavi eitan at mellanox.co.il Index: include/configure.in =================================================================== --- include/configure.in (revision 3083) +++ include/configure.in (working copy) @@ -13,7 +13,7 @@ dnl Checks for programs AC_PROG_CC dnl Checks for libraries -AC_CHECK_LIB +dnl AC_CHECK_LIB - need to provide symbol and library... what do we depend on? dnl Checks for header files. AC_HEADER_STDC Index: include/Makefile.am =================================================================== --- include/Makefile.am (revision 3083) +++ include/Makefile.am (working copy) @@ -1,17 +1,9 @@ - -libincdir = ${exec_prefix}/ib/lib +# HACK: as we do not use the standard "prefix/include" subdir +includedir = ${prefix}/include/infiniband SUBDIRS = . -INCLUDES = - -lib_LTLIBRARIES = - -lib_version_script = - -libincincludedir = $(includedir)/infiniband/iba - -libincinclude_HEADERS = $(srcdir)/iba/ib_types.h +nobase_include_HEADERS = iba/ib_types.h EXTRA_DIST = $(srcdir)/iba/ib_types.h Index: include/autogen.sh =================================================================== --- include/autogen.sh (revision 3083) +++ include/autogen.sh (working copy) @@ -1,5 +1,8 @@ #! /bin/sh +# We change dir since the later utilities assume to work in the project dir +cd ${0%*/*} + set -x aclocal -I config libtoolize --force --copy Index: libvendor/configure.in =================================================================== --- libvendor/configure.in (revision 3083) +++ libvendor/configure.in (working copy) @@ -60,5 +60,15 @@ AC_CACHE_CHECK(whether ld accepts --vers AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") +dnl support debug mode +AC_ARG_ENABLE(debug, +[ --enable-debug Turn on debug mode], +[case "${enableval}" in + yes) debug=true ;; + no) debug=false ;; + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; +esac],[debug=false]) +AM_CONDITIONAL(DEBUG, test x$debug = xtrue) + AC_CONFIG_FILES([Makefile libosmvendor.spec]) AC_OUTPUT Index: libvendor/Makefile.am =================================================================== --- libvendor/Makefile.am (revision 3083) +++ libvendor/Makefile.am (working copy) @@ -1,15 +1,19 @@ -libdir = ${exec_prefix}/ib/lib - SUBDIRS = . +if DEBUG +DBGFLAGS = -g -O0 -D_DEBUG_ +else +DBGFLAGS = -g -O2 +endif + INCLUDES = -I$(srcdir)/../include \ -I$(srcdir)/../../libibcommon/include/infiniband \ -I$(srcdir)/../../libibumad/include/infiniband lib_LTLIBRARIES = libosmvendor.la -libosmvendor_la_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT +libosmvendor_la_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) if HAVE_LD_VERSION_SCRIPT libosmvendor_version_script = -Wl,--version-script=$(srcdir)/libosmvendor.map Index: libvendor/autogen.sh =================================================================== --- libvendor/autogen.sh (revision 3083) +++ libvendor/autogen.sh (working copy) @@ -1,5 +1,8 @@ #! /bin/sh +# We change dir since the later utilities assume to work in the project dir +cd ${0%*/*} + set -x aclocal -I config libtoolize --force --copy Index: AUTHORS =================================================================== --- AUTHORS (revision 0) +++ AUTHORS (revision 0) @@ -0,0 +1,7 @@ + +By the chronological order of involvement: +Steve King, Intel +Anil Keshavamurthy, Intel +Eitan Zahavi, Mellanox Technologies, eitan at mellanox.co.il +Yael Kalka, Mellanox Technologies, yael at mellanox.co.il +Hal Rosenstock, Voltaire, halr at voltaire.com Index: complib/autogen.sh =================================================================== --- complib/autogen.sh (revision 3083) +++ complib/autogen.sh (working copy) @@ -1,5 +1,8 @@ #! /bin/sh +# We change dir since the later utilities assume to work in the project dir +cd ${0%*/*} + set -x aclocal -I config libtoolize --force --copy Index: complib/configure.in =================================================================== --- complib/configure.in (revision 3083) +++ complib/configure.in (working copy) @@ -31,6 +31,7 @@ AC_C_INLINE AC_TYPE_SIZE_T AC_HEADER_TIME +dnl We use --version-script with ld if possible AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then ac_cv_version_script=yes @@ -40,5 +41,15 @@ AC_CACHE_CHECK(whether ld accepts --vers AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") +dnl Support debug mode build - if enable-debug provided the DEBUG variable is set +AC_ARG_ENABLE(debug, +[ --enable-debug Turn on debug mode], +[case "${enableval}" in + yes) debug=true ;; + no) debug=false ;; + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; +esac],[debug=false]) +AM_CONDITIONAL(DEBUG, test x$debug = xtrue) + AC_CONFIG_FILES([Makefile libosmcomp.spec]) AC_OUTPUT Index: complib/Makefile.am =================================================================== --- complib/Makefile.am (revision 3083) +++ complib/Makefile.am (working copy) @@ -1,13 +1,15 @@ -libdir = ${exec_prefix}/ib/lib - -SUBDIRS = . - INCLUDES = -I$(srcdir)/../include lib_LTLIBRARIES = libosmcomp.la -libosmcomp_la_CFLAGS = -Wall +if DEBUG +DBGFLAGS = -g -O0 -D_DEBUG_ +else +DBGFLAGS = -g -O2 +endif + +libosmcomp_la_CFLAGS = -Wall $(DBGFLAGS) if HAVE_LD_VERSION_SCRIPT libosmcomp_version_script = -Wl,--version-script=$(srcdir)/libosmcomp.map Index: configure.in =================================================================== --- configure.in (revision 0) +++ configure.in (revision 0) @@ -0,0 +1,39 @@ +dnl Process this file with autoconf to produce a configure script. + +AC_INIT(autogen.sh) + +dnl use local config dir for extras +AC_CONFIG_AUX_DIR(config) + +dnl Defines the Language +AC_LANG_C + +dnl Auto make +AM_INIT_AUTOMAKE(osm,1.0) + +dnl Provides control over re-making of all auto files +dnl We also use it to define swig dependencies so end +dnl users do not see them. +AM_MAINTAINER_MODE + +dnl Required for cases make defines a MAKE=make ??? Why +AC_PROG_MAKE_SET + +dnl Define an input config option to control debug compile +AC_ARG_ENABLE(debug, +[ --enable-debug Turn on debugging], +[case "${enableval}" in + yes) debug=true ;; + no) debug=false ;; + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; +esac],[debug=false]) +AM_CONDITIONAL(DEBUG, test x$debug = xtrue) + +dnl Configure the following subdirs +AC_CONFIG_SUBDIRS(complib libvendor opensm osmtest include) + +dnl Create the following Makefiles +AC_OUTPUT(Makefile) + + + Index: ChangeLog =================================================================== --- ChangeLog (revision 0) +++ ChangeLog (revision 0) @@ -0,0 +1,6 @@ +2005-08-14 Eitan Zahavi + + * Provided a top level auto tools project so there is no need to + cd into each of the sub directories and do: + ./autogen.sh && configure && make && make install + Index: README =================================================================== --- README (revision 0) +++ README (revision 0) @@ -0,0 +1,20 @@ +OpenSM README: +-------------- + +OpenSM provides an implementation for an InfiniBand Subnet Manager and +Administrator. Such a software entity is required to run for in order +to initialize the InfiniBand hardware (at least one per each +InfiniBand subnet). + +The full list of OpenSM features is described in the user manual +provided in the doc sub directory. + +The installation of OpenSM includes: + +bin/ + opensm - the SM/SA executable + osmtest - a test program for the SM/SA +lib/ + libosmcomp.{a,so} - component library with generic services and containers + libopensm.{a,so} - opensm services for logs and mad buffer pool + libosmvendor.{a,so} - interface to the user mad service of the driver Index: osmtest/configure.in =================================================================== --- osmtest/configure.in (revision 3083) +++ osmtest/configure.in (working copy) @@ -52,5 +52,15 @@ AC_CACHE_CHECK(whether ld accepts --vers AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") +dnl support debug mode +AC_ARG_ENABLE(debug, +[ --enable-debug Turn on debug mode], +[case "${enableval}" in + yes) debug=true ;; + no) debug=false ;; + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; +esac],[debug=false]) +AM_CONDITIONAL(DEBUG, test x$debug = xtrue) + AC_CONFIG_FILES([Makefile]) AC_OUTPUT Index: osmtest/Makefile.am =================================================================== --- osmtest/Makefile.am (revision 3083) +++ osmtest/Makefile.am (working copy) @@ -1,6 +1,9 @@ -bindir = ${exec_prefix}/ib/bin -libdir = ${exec_prefix}/ib/lib +if DEBUG +DBGFLAGS = -g -O0 -D_DEBUG_ +else +DBGFLAGS = -g -O2 +endif INCLUDES = -I$(srcdir)/include \ -I$(srcdir)/../include \ @@ -11,12 +14,9 @@ bin_PROGRAMS = osmtest osmtest_SOURCES = main.c osmtest.c osmt_service.c osmt_slvl_vl_arb.c \ osmt_multicast.c osmt_inform.c -osmtest_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT -osmtest_LDADD = $(libdir)/libibumad.la \ - $(libdir)/libibcommon.la \ - $(libdir)/libopensm.la \ - $(libdir)/libosmcomp.la \ - $(libdir)/libosmvendor.la +osmtest_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) +osmtest_LDADD = -L../complib -L../libvendor -L../opensm -L$(libdir) \ + -libumad -libcommon -lopensm -losmcomp -losmvendor osmtest_LDFLAGS = -Wl,--rpath -Wl,$(libdir) -lpthread -L../opensm Index: osmtest/autogen.sh =================================================================== --- osmtest/autogen.sh (revision 3083) +++ osmtest/autogen.sh (working copy) @@ -1,5 +1,8 @@ #! /bin/sh +# We change dir since the later utilities assume to work in the project dir +cd ${0%*/*} + set -x aclocal -I config libtoolize --force --copy Index: opensm/configure.in =================================================================== --- opensm/configure.in (revision 3083) +++ opensm/configure.in (working copy) @@ -52,5 +52,15 @@ AC_CACHE_CHECK(whether ld accepts --vers AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") +dnl support debug mode +AC_ARG_ENABLE(debug, +[ --enable-debug Turn on debug mode], +[case "${enableval}" in + yes) debug=true ;; + no) debug=false ;; + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; +esac],[debug=false]) +AM_CONDITIONAL(DEBUG, test x$debug = xtrue) + AC_CONFIG_FILES([Makefile]) AC_OUTPUT Index: opensm/autogen.sh =================================================================== --- opensm/autogen.sh (revision 3083) +++ opensm/autogen.sh (working copy) @@ -1,5 +1,8 @@ #! /bin/sh +# We change dir since the later utilities assume to work in the project dir +cd ${0%*/*} + set -x aclocal -I config libtoolize --force --copy Index: opensm/Makefile.am =================================================================== --- opensm/Makefile.am (revision 3083) +++ opensm/Makefile.am (working copy) @@ -1,14 +1,17 @@ -bindir = ${exec_prefix}/ib/bin -libdir = ${exec_prefix}/ib/lib - INCLUDES = -I$(srcdir)/../include \ -I$(srcdir)/../../libibcommon/include/infiniband \ -I$(srcdir)/../../libibumad/include/infiniband lib_LTLIBRARIES = libopensm.la -libopensm_la_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT +if DEBUG +DBGFLAGS = -g -O0 -D_DEBUG_ +else +DBGFLAGS = -g -O2 +endif + +libopensm_la_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) if HAVE_LD_VERSION_SCRIPT libopensm_version_script = -Wl,--version-script=$(srcdir)/libopensm.map @@ -60,12 +63,13 @@ opensm_SOURCES = main.c osm_drop_mgr.c o osm_ucast_mgr.c osm_ucast_updn.c \ osm_vl15intf.c osm_vl_arb_rcv.c\ osm_vl_arb_rcv_ctrl.c -opensm_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT -opensm_LDADD = $(libdir)/libibumad.la \ - $(libdir)/libibcommon.la \ - $(srcdir)/libopensm.la \ - $(libdir)/libosmcomp.la \ - $(libdir)/libosmvendor.la +opensm_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) + +# we need to be able to load libraries from local build subtree before make install +# we always give precedence to local tree libs and then use the pre-installed ones. +opensm_LDADD = -L../complib -L../libvendor -L$(libdir) \ + -libumad -lopensm -losmcomp -losmvendor + opensm_LDFLAGS = -Wl,--rpath -Wl,$(libdir) -lpthread opensmincludedir = $(includedir)/infiniband/opensm @@ -79,4 +83,3 @@ EXTRA_DIST = $(srcdir)/../include/opensm $(srcdir)/../include/opensm/osm_log.h \ $(srcdir)/../include/opensm/osm_madw.h \ $(srcdir)/../include/opensm/osm_mad_pool.h - Index: INSTALL =================================================================== --- INSTALL (revision 0) +++ INSTALL (revision 0) @@ -0,0 +1,231 @@ +Installation Instructions +************************* + +Copyright (C) 1994, 1995, 1996, 1999, 2000, 2001, 2002, 2004 Free +Software Foundation, Inc. + +This file is free documentation; the Free Software Foundation gives +unlimited permission to copy, distribute and modify it. + +Basic Installation +================== + +These are generic installation instructions. + + The `configure' shell script attempts to guess correct values for +various system-dependent variables used during compilation. It uses +those values to create a `Makefile' in each directory of the package. +It may also create one or more `.h' files containing system-dependent +definitions. Finally, it creates a shell script `config.status' that +you can run in the future to recreate the current configuration, and a +file `config.log' containing compiler output (useful mainly for +debugging `configure'). + + It can also use an optional file (typically called `config.cache' +and enabled with `--cache-file=config.cache' or simply `-C') that saves +the results of its tests to speed up reconfiguring. (Caching is +disabled by default to prevent problems with accidental use of stale +cache files.) + + If you need to do unusual things to compile the package, please try +to figure out how `configure' could check whether to do them, and mail +diffs or instructions to the address given in the `README' so they can +be considered for the next release. If you are using the cache, and at +some point `config.cache' contains results you don't want to keep, you +may remove or edit it. + + The file `configure.ac' (or `configure.in') is used to create +`configure' by a program called `autoconf'. You only need +`configure.ac' if you want to change it or regenerate `configure' using +a newer version of `autoconf'. + +The simplest way to compile this package is: + + 1. `cd' to the directory containing the package's source code and type + `./configure' to configure the package for your system. If you're + using `csh' on an old version of System V, you might need to type + `sh ./configure' instead to prevent `csh' from trying to execute + `configure' itself. + + Running `configure' takes awhile. While running, it prints some + messages telling which features it is checking for. + + 2. Type `make' to compile the package. + + 3. Optionally, type `make check' to run any self-tests that come with + the package. + + 4. Type `make install' to install the programs and any data files and + documentation. + + 5. You can remove the program binaries and object files from the + source code directory by typing `make clean'. To also remove the + files that `configure' created (so you can compile the package for + a different kind of computer), type `make distclean'. There is + also a `make maintainer-clean' target, but that is intended mainly + for the package's developers. If you use it, you may have to get + all sorts of other programs in order to regenerate files that came + with the distribution. + +Compilers and Options +===================== + +Some systems require unusual options for compilation or linking that the +`configure' script does not know about. Run `./configure --help' for +details on some of the pertinent environment variables. + + You can give `configure' initial values for configuration parameters +by setting variables in the command line or in the environment. Here +is an example: + + ./configure CC=c89 CFLAGS=-O2 LIBS=-lposix + + *Note Defining Variables::, for more details. + +Compiling For Multiple Architectures +==================================== + +You can compile the package for more than one kind of computer at the +same time, by placing the object files for each architecture in their +own directory. To do this, you must use a version of `make' that +supports the `VPATH' variable, such as GNU `make'. `cd' to the +directory where you want the object files and executables to go and run +the `configure' script. `configure' automatically checks for the +source code in the directory that `configure' is in and in `..'. + + If you have to use a `make' that does not support the `VPATH' +variable, you have to compile the package for one architecture at a +time in the source code directory. After you have installed the +package for one architecture, use `make distclean' before reconfiguring +for another architecture. + +Installation Names +================== + +By default, `make install' will install the package's files in +`/usr/local/bin', `/usr/local/man', etc. You can specify an +installation prefix other than `/usr/local' by giving `configure' the +option `--prefix=PREFIX'. + + You can specify separate installation prefixes for +architecture-specific files and architecture-independent files. If you +give `configure' the option `--exec-prefix=PREFIX', the package will +use PREFIX as the prefix for installing programs and libraries. +Documentation and other data files will still use the regular prefix. + + In addition, if you use an unusual directory layout you can give +options like `--bindir=DIR' to specify different values for particular +kinds of files. Run `configure --help' for a list of the directories +you can set and what kinds of files go in them. + + If the package supports it, you can cause programs to be installed +with an extra prefix or suffix on their names by giving `configure' the +option `--program-prefix=PREFIX' or `--program-suffix=SUFFIX'. + +Optional Features +================= + +Some packages pay attention to `--enable-FEATURE' options to +`configure', where FEATURE indicates an optional part of the package. +They may also pay attention to `--with-PACKAGE' options, where PACKAGE +is something like `gnu-as' or `x' (for the X Window System). The +`README' should mention any `--enable-' and `--with-' options that the +package recognizes. + + For packages that use the X Window System, `configure' can usually +find the X include and library files automatically, but if it doesn't, +you can use the `configure' options `--x-includes=DIR' and +`--x-libraries=DIR' to specify their locations. + +Specifying the System Type +========================== + +There may be some features `configure' cannot figure out automatically, +but needs to determine by the type of machine the package will run on. +Usually, assuming the package is built to be run on the _same_ +architectures, `configure' can figure that out, but if it prints a +message saying it cannot guess the machine type, give it the +`--build=TYPE' option. TYPE can either be a short name for the system +type, such as `sun4', or a canonical name which has the form: + + CPU-COMPANY-SYSTEM + +where SYSTEM can have one of these forms: + + OS KERNEL-OS + + See the file `config.sub' for the possible values of each field. If +`config.sub' isn't included in this package, then this package doesn't +need to know the machine type. + + If you are _building_ compiler tools for cross-compiling, you should +use the `--target=TYPE' option to select the type of system they will +produce code for. + + If you want to _use_ a cross compiler, that generates code for a +platform different from the build platform, you should specify the +"host" platform (i.e., that on which the generated programs will +eventually be run) with `--host=TYPE'. + +Sharing Defaults +================ + +If you want to set default values for `configure' scripts to share, you +can create a site shell script called `config.site' that gives default +values for variables like `CC', `cache_file', and `prefix'. +`configure' looks for `PREFIX/share/config.site' if it exists, then +`PREFIX/etc/config.site' if it exists. Or, you can set the +`CONFIG_SITE' environment variable to the location of the site script. +A warning: not all `configure' scripts look for a site script. + +Defining Variables +================== + +Variables not defined in a site shell script can be set in the +environment passed to `configure'. However, some packages may run +configure again during the build, and the customized values of these +variables may be lost. In order to avoid this problem, you should set +them in the `configure' command line, using `VAR=value'. For example: + + ./configure CC=/usr/local2/bin/gcc + +will cause the specified gcc to be used as the C compiler (unless it is +overridden in the site shell script). + +`configure' Invocation +====================== + +`configure' recognizes the following options to control how it operates. + +`--help' +`-h' + Print a summary of the options to `configure', and exit. + +`--version' +`-V' + Print the version of Autoconf used to generate the `configure' + script, and exit. + +`--cache-file=FILE' + Enable the cache: use and save the results of the tests in FILE, + traditionally `config.cache'. FILE defaults to `/dev/null' to + disable caching. + +`--config-cache' +`-C' + Alias for `--cache-file=config.cache'. + +`--quiet' +`--silent' +`-q' + Do not print messages saying which checks are being made. To + suppress all normal output, redirect it to `/dev/null' (any error + messages will still be shown). + +`--srcdir=DIR' + Look for the package's source code in directory DIR. Usually + `configure' can determine that directory automatically. + +`configure' also accepts some other, not widely useful, options. Run +`configure --help' for more details. + Index: COPYING =================================================================== --- COPYING (revision 0) +++ COPYING (revision 0) @@ -0,0 +1,32 @@ + Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available from the file + COPYING in the main directory of this source tree, or the + OpenIB.org BSD license below: + + Redistribution and use in source and binary forms, with or + without modification, are permitted provided that the following + conditions are met: + + - Redistributions of source code must retain the above + copyright notice, this list of conditions and the following + disclaimer. + + - Redistributions in binary form must reproduce the above + copyright notice, this list of conditions and the following + disclaimer in the documentation and/or other materials + provided with the distribution. + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + Index: Makefile.am =================================================================== --- Makefile.am (revision 0) +++ Makefile.am (revision 0) @@ -0,0 +1,16 @@ + +# note that order matters: make the lib first then use it +SUBDIRS = complib libvendor opensm osmtest include + +# this will control the update of the files in order +MAINTAINERCLEANFILES = Makefile.in aclocal.m4 configure config-h.in + +ACLOCAL = aclocal -I $(ac_aux_dir) + +# we should provide a hint for other apps about the build mode of this project +install-exec-hook: +if DEBUG + echo "define osm_build_type \"debug\"" > $(includedir)/infiniband/opensm/osm_build_id.h +else + echo "define osm_build_type \"free\"" > $(includedir)/infiniband/opensm/osm_build_id.h +endif Index: autogen.sh =================================================================== --- autogen.sh (revision 0) +++ autogen.sh (revision 0) @@ -0,0 +1,74 @@ +#!/bin/bash + +# We change dir since the later utilities assume to work in the project dir +cd ${0%*/*} + +# make sure autoconf is up-to-date +ac_ver=`autoconf --version | head -1 | awk '{print $NF}'` +ac_maj=`echo $ac_ver|sed 's/\..*//'` +ac_min=`echo $ac_ver|sed 's/.*\.//'` +if [[ $ac_maj < 2 ]]; then + echo Min autoconf version is 2.59 + exit +fi +if [[ $ac_maj = 2 && $ac_min < 59 ]]; then + echo Min autoconf version is 2.59 + exit +fi + +# make sure automake is up-to-date +am_ver=`automake --version | head -1 | awk '{print $NF}'` +am_maj=`echo $am_ver|sed 's/\..*//'` +am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` +am_sub=`echo $am_ver|sed 's/.*\.//'` +if [[ $am_maj < 1 ]]; then + echo Min automake version is 1.9.3 + exit +fi +if [[ $am_maj = 1 && $am_min < 9 ]]; then + echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.3" + exit +fi +if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 3 ]]; then + echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.3" + exit +fi + +# make sure libtool is up-to-date +lt_ver=`libtool --version | head -1 | awk '{print $4}'` +lt_maj=`echo $lt_ver|sed 's/\..*//'` +lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` +lt_sub=`echo $lt_ver|sed 's/.*\.//'` +if [[ $lt_maj < 1 ]]; then + echo Min libtool version is 1.4.2 + exit +fi +if [[ $lt_maj = 1 && $lt_min < 4 ]]; then + echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" + exit +fi +if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then + echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" + exit +fi + +# cleanup +find . \( -name Makefile.in -o -name aclocal.m4 -o -name autom4te.cache -o -name configure -o -name aclocal.m4 \) -exec \rm -rf {} \; -prune + +# handle our own autoconf: +aclocal -I config 2>&1 | grep -v "arning: underquoted definition of" +automake --add-missing --gnu +autoconf + +# visit all sub directories with autogen.sh +anyErr=0 +for a in `ls */autogen.sh`; do + echo Visiting $a + $a 2>& 1 | sed 's/^/| /' + if test $? != 0; then + echo $a failed + anyErr=1 + fi +done + +exit $anyErr Property changes on: autogen.sh ___________________________________________________________________ Name: svn:executable + * Index: NEWS =================================================================== --- NEWS (revision 0) +++ NEWS (revision 0) @@ -0,0 +1,2 @@ + +This file will hold news about the OpenSM project. Index: Makefile =================================================================== --- Makefile (revision 3083) +++ Makefile (working copy) @@ -1,44 +0,0 @@ -LIBS:= complib libvendor -BIN:= opensm -UTIL:= include - -SUBDIRS=$(BIN) $(UTIL) - -all: BUILD_TARG=all -all: libs_install subdirs - @echo Make all done - -install: BUILD_TARG=install -install: subdirs - @echo Install done - -clean: SUBDIRS= $(LIBS) $(BIN) -clean: BUILD_TARG=clean -clean: subdirs - @echo Clean done - -rmdep: - find $(SUBDIRS) -name ".depend" | xargs rm -f - -depend: SUBDIRS= $(LIBS) $(BIN) $(UTIL) -depend: BUILD_TARG=depend -depend: rmdep subdirs - @echo Depend done - -.PHONY : subdirs -subdirs: - @for i in $(SUBDIRS); do\ - if [ -e $$i/Makefile ]; then\ - if !(cd $$i; make $(BUILD_TARG)); then exit 1; fi\ - fi\ - done\ - -.PHONY : libs_install -libs_install: - @for i in $(LIBS); do\ - if [ -e $$i/Makefile ]; then\ - if !(cd $$i; make install); then exit 1; fi\ - fi\ - done\ - -export BUILD_TARG From rolandd at cisco.com Mon Aug 15 07:35:29 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 15 Aug 2005 07:35:29 -0700 Subject: [openib-general]: Question about FMR In-Reply-To: (Guy German's message of "Sun, 14 Aug 2005 18:06:39 +0300") References: Message-ID: <52r7cvi7ym.fsf@cisco.com> Thanks, I checked in a fix for this. - R. From rolandd at cisco.com Mon Aug 15 07:37:24 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 15 Aug 2005 07:37:24 -0700 Subject: [openib-general][PATCH][mthca]: arbel/tavor calls In-Reply-To: (Guy German's message of "Mon, 15 Aug 2005 15:22:06 +0300 (IDT)") References: Message-ID: <52mznji7vf.fsf@cisco.com> Thanks, applied. - R. From rolandd at cisco.com Mon Aug 15 07:45:06 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 15 Aug 2005 07:45:06 -0700 Subject: [openib-general] Re: [PATCH fixed] ipoib: device removal races In-Reply-To: <20050815090543.GA1752@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 15 Aug 2005 12:05:43 +0300") References: <20050808151141.GJ15300@mellanox.co.il> <52wtmwxdy4.fsf@cisco.com> <20050815072211.GW23848@mellanox.co.il> <20050815090543.GA1752@mellanox.co.il> Message-ID: <52iry7i7il.fsf@cisco.com> Michael> Here's fix for this theoretical race (I didnt see it Michael> triggered in real life). This needs to be applied in Michael> addition to my previous patch, which fixes a crash I Michael> actually see in the lab. Thanks. For this patch, why do we need to use a different workqueue rather than sharing the existing IPoIB workqueue? Michael> Roland, I think at least the previous one-line patch Michael> should go in to 2.6.13. Do you have it? Yes, I have it in my queue. I think it's too late to add this to 2.6.13, since it's not really critical, but we can propose it for 2.6.13.1. - R. From jlentini at netapp.com Mon Aug 15 07:48:38 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 15 Aug 2005 10:48:38 -0400 (EDT) Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation In-Reply-To: References: Message-ID: Hi Guy, Answers below: On Sat, 13 Aug 2005, Guy German wrote: > Hi James, > > Im writing you back, from my web mail, so sorry again > for the format (I will use "gg:" prefix again for my > answer :) > > > On Fri, 12 Aug 2005, Guy German wrote: > > > - /* Only process events if there is an enabled callback function. */ > > > - if ((evd->upcall.upcall_func == (DAT_UPCALL_FUNC) NULL) || > > > - (evd->upcall_policy == DAT_UPCALL_DISABLE)) { > > > + > > > + /* The function is not re-entrant (change when implementing DAT_UPCALL_MANY)*/ > > > > Why is this function not re-entrant? For reference, here is how I > > would define re-entrant: > > > > http://en.wikipedia.org/wiki/Reentrant > > http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?re-entrant > > > > gg: > > The function can not be entered twice at the same time, because > > when the upcall is at the hands of the consumer he can disable > > the upcall policy and if it is entered twice, there is a chance > > the consumer will get another upcall after disabling > > the upcall policy. > > I don't see the flow of control that would result in the scenario you > describe. This piece of code > > spin_lock_irqsave (&evd->common.lock, flags); > if (evd->is_triggered) { > spin_unlock_irqrestore (&evd->common.lock, flags); > return; > } > evd->is_triggered = 1; > spin_unlock_irqrestore (&evd->common.lock, flags); > > ensures that only one thread can be making upcalls at a time. > > gg: > The change I did in the function assures that the function would not > be entered twice at the same time, from 2 (or more) contexts > (by adding the spin_lock). > The remark mentions that this change, however, does not support > DAT_UPCALL_MANY, which, by my understanding, require upcalls > to be called simultaneously, hence the function to be re-entrant > (and not protected by a spin lock that prevents entering the function) How about changing the comment to say that DAT_UPCALL_MANY is not supported and leave it at that? > > > + if (evd->is_triggered) > > > return; > > > - } > > > > Why check the value here? Is it only for the efficiency of not taking > > the spin lock when is_triggered is 1? > > > > gg: > > No. you can't take the spin_lock here because this can cause > > a dead lockin the case the function calls itself from > > dat_evd_dequeue, on a uni-proccessor machines. > > Can you elaborate on this? Do you mean that the thread that performs > the upcall in dapl_evd_upcall_trigger() can be used by the consumer to > call dapl_evd_dequeue()? If so, I don't see the flow of control that > begins in dapl_evd_dequeue() and reaches dapl_evd_upcall_trigger(). > > gg: > The protection there is from the case, where dapl_evd_upcall_trigger > calls dapl_evd_dequeue that calls dapl_evd_upcall_trigger again > (recursively). This happened when there was a bad DTO completion and > CONN_EVENT_BROKEN was synthesized. Now, I saw that this part was > removed from dapl_evd_wc_to_event. > I still think that we should leave this protection, for those kinds of cases, > that might be implemented in the future. > > That leaves me with a question: what did happened to > DAT_CONNECTION_EVENT_BROKEN? I'm not sure. > > > @@ -820,24 +831,19 @@ static void dapl_evd_dto_callback(struct > > > * This function does not dequeue from the CQ; only the consumer > > > * can do that. Instead, it wakes up waiters if any exist. > > > * It rearms the completion only if completions should always occur > > > - * (specifically if a CNO is associated with the EVD and the > > > - * EVD is enabled). > > > */ > > > - > > > - if (state == DAPL_EVD_STATE_OPEN && > > > - evd->upcall_policy != DAT_UPCALL_DISABLE) { > > > - /* > > > - * Re-enable callback, *then* trigger. > > > - * This guarantees we won't miss any events. > > > - */ > > > - status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); > > > - if (0 != status) > > > - (void)dapl_evd_post_async_error_event( > > > - evd->common.owner_ia->async_error_evd, > > > - DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR, > > > - evd->common.owner_ia); > > > - > > > + > > > + if (state == DAPL_EVD_STATE_OPEN) { > > > dapl_evd_upcall_trigger(evd); > > > + if ((evd->upcall_policy != DAT_UPCALL_TEARDOWN) && > > > + (evd->upcall_policy != DAT_UPCALL_DISABLE)) { > > > + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); > > > + if (0 != status) > > > + (void)dapl_evd_post_async_error_event( > > > + evd->common.owner_ia->async_error_evd, > > > + DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR, > > > + evd->common.owner_ia); > > > + } > > > > > > You changed the order in which the CQ upcall is enabled and the kDAPL > > upcall is made. It used to be: > > > > enable CQ upcall > > call kDAPL upcall > > > > you are proposing > > > > call kDAPL upcall > > enable CQ upcall > > > > I think your proposed order contains a race condition. Specifically if > > a work completion occurs after dapl_evd_upcall_trigger() > > returns but before the CQ upcall is re-enabled with > > ib_req_notify_cq(), no upcall will occur for the completion. > > > > Do you agree? > > > > gg: > > You need to enable the CQ upcall only if the consumer did > > not change his upcall policy, while in upcall context. In > > the first case you will create a situation where the cq is > > enabled, while the consumers doesn't want any upcalls. > > Correct, but this can be hidden so that the consumer does not receive > the upcall. > > gg: > So what does dapl would do with the DTO event once it got it ? kDAPL doesn't dequeue the CQ events, so this shouldn't be an issue. > > In most real world application dapl_evd_upcall_trigger() > > will return with upcall policy disabled and there will be > > no need to alarm the cq upcall - i.e the consumer would > > dequeue the rest of the events himself. > > > > I see the race you talk about. It is relevent to kdapltest. > > Maybe we can check if there are pending events after > > enabling CQ upcall, and if there are - call > > dapl_evd_upcall_trigger() again. What do you think ? > > The problem we are dealing with is that DAPL upcalls behave > differently from IB upcalls. DAPL upcalls are enabled until they are > disabled while IB upcalls are "one shots". > > The approach taken in the current implementation is to always enable > the IB upcalls and determine in the DAPL provider if the consumer's > upcall should be invoked. > > gg: > This is not good enough for some consumers (e.g. ISER), and it is > not implementing the DAPL spec, which deals with upcall policies. I disagree that this is in violation of the DAPL spec. The spec. describes DAPL upcall policies and leaves the interaction with lower layers (IB verbs) unspecified. In any event, we should fix any of the performance problems you've found. > You are proposing a shift away from that approach. If we do that, we > need to preserve the original semantics. > > gg: > I think that the implementation proposed preserves the current > semantics. If the consumer doesn't change the upcall policy - it > stays enabled and everything stays the same (kdapltest is still > working). I think you had a good point about the race over there, > and that can be fixed. > > > I'll ask the DAT Collaborative from some clarification on > > the meaning of the different upcall policy flags. > > > One final item. To be consistent with your design, CQ upcalls should > > be selectively enabled in dapl_evd_internal_create(). > > gg: > Why ? the initial state is that the upcalls are enabled, like it > is today. Only if the consumer chooses to disable the upcall, he > calls dat_evd_modify_upcall. Currently the IB upcall is initially enabled, but there are checks on the upcall path to determine if the EVD upcall is enabled. See dapl_evd.c line 827: if (state == DAPL_EVD_STATE_OPEN && evd->upcall_policy != DAT_UPCALL_DISABLE) { which you've replaced with if (state == DAPL_EVD_STATE_OPEN) { > > > evd = (struct dapl_evd *)evd_handle; > > > + dapl_dbg_log (DAPL_DBG_TYPE_API, "%s: (evd=%p) set to %d\n", > > > + __func__, evd_handle, upcall_policy); > > > > The idea was to make the DAPL_DBG_TYPE_API prints look like a > > debugger stack trace. The following would be keeping with the other > > print statements: > > > > gg: > > I thought it would make it a bit more user friendly :) sometimes > > the consumers use those debug prints and they don't want to dwell > > in the kdapl code too much in order to understand what they > > are reading ... > > You need a fair amount of familiarity with the code to know what a > message that says "dapl_evd_modify_upcall: (evd=dbbe4e58) set to 2" > means. > > If you'd like to add the parameter names (e.g. evd=%p, > upcall_policy=...) that is fine. > > gg: > Sure. That�s good enough. > > > I think the function call format is better, > > because a user familiar with the function > > signatures will know what each of the fields means. > > gg: > I think that if someone is writing a code over dapl, for the > first time, he has enough on his head, beside reviewing dapl�s > implementation code. > It is easier to debug your own code with helpful debug prints, from > the lower layer, like: "upcall_policy=1" / "upcall_policy=o" / > "connecting to 192.168.10.10" DTO completion info ... etc.. > > > > > > > + spin_lock_irqsave(&evd->common.lock, flags); > > > + if ((upcall_policy != DAT_UPCALL_TEARDOWN) && > > > + (upcall_policy != DAT_UPCALL_DISABLE)) { > > > > Why not let the consumer setup the upcall when it disabled? That seems > > like the only safe time to modify it. > > > > gg: > > The consumer needs and can change the poilcy to disable and enable. > > The only time he is not allowed to change the policy to enable (in > > this implementation) is when there are still pending events in the > > queue. > > You mean when there are *no* pending events in the queue. > > gg: > No. I mean when there *are* pending events. When the consumer wishes > to enable the upcall policy, he believes that he dequeued all the > events. If this is not the case - dat_evd_modify_upcall alarms him > and he knows he should continue to dequeue. Agreed. I didn't read your original sentence carefully enough. > > This is to solve a race where the consumer dequeued all the events > > and changed the policy to enable, but there were other event/s that > > came just before calling dat_evd_modufy_upcall. In this case > > dat_evd_modufy_upcall to enable would fail and the consumer would > > keep dequeue-ing the events, without loosing his context. > > You've only decrease the window in which that scenario could occur, > not eliminated it. If a DTO completion occured after you count the > number of pending events but before you enable the CQ callback, a > completion will be missed. > > gg: > I don't think so. That is what the spin_lock_irqsave is for. What if there are multiple processors in the system? > Also, the pending_event_queue is only used for kDAPL generated > software events. This queue can be empty when there are events on the > CQ, so your would need to be expanded your check to cover that. > > gg: > I agree. > > > > + if (evd->evd_flags & DAT_EVD_DTO_FLAG) { > > > + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); > > > > Why do we need to re-enable the CQ upcall? > > > > gg: > > If the consumer returned from the evd_upcall with upcall policy > > "disabled" the CQ upcall is not enabled. So this is the only > > place it is done. > > Ok, that fits with your new approach to the problem. > > From mst at mellanox.co.il Mon Aug 15 07:53:18 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 15 Aug 2005 17:53:18 +0300 Subject: [openib-general] Re: [PATCH fixed] ipoib: device removal races In-Reply-To: <52iry7i7il.fsf@cisco.com> References: <20050808151141.GJ15300@mellanox.co.il> <52wtmwxdy4.fsf@cisco.com> <20050815072211.GW23848@mellanox.co.il> <20050815090543.GA1752@mellanox.co.il> <52iry7i7il.fsf@cisco.com> Message-ID: <20050815145318.GD1856@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH fixed] ipoib: device removal races > > Michael> Here's fix for this theoretical race (I didnt see it > Michael> triggered in real life). This needs to be applied in > Michael> addition to my previous patch, which fixes a crash I > Michael> actually see in the lab. > > Thanks. For this patch, why do we need to use a different workqueue > rather than sharing the existing IPoIB workqueue? Because we down/up the device upon an event. We need to flush ipoib_workqueue when we down the device, and this cant be done from inside the same work queue. > Michael> Roland, I think at least the previous one-line patch > Michael> should go in to 2.6.13. Do you have it? > > Yes, I have it in my queue. I think it's too late to add this to > 2.6.13, since it's not really critical, but we can propose it for > 2.6.13.1. You mean people dont unload the module all that much? -- MST From jlentini at netapp.com Mon Aug 15 07:54:31 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 15 Aug 2005 10:54:31 -0400 (EDT) Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation In-Reply-To: References: Message-ID: On Sat, 13 Aug 2005, Guy German wrote: > Hi, > > Just a small add to my reply about: > > > The approach taken in the current implementation is to always enable > > the IB upcalls and determine in the DAPL provider if the consumer's > > upcall should be invoked. > > My earlier reply was referring to part that the DAPL provider would > determine if the consumer's upcall should be invoked, instead of the > consumer deciding it. > As to always enabling the IB upcalls - that can be done. But if we do > it we should add the DTO events to the pending events list, which > brings us to the size of this list, that is in the current implementation > problematic. I like your approach of only enabling the CQ upcall when the consumer enables DAPL upcalls. I think we should continue in that direction. From jlentini at netapp.com Mon Aug 15 08:02:55 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 15 Aug 2005 11:02:55 -0400 (EDT) Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation In-Reply-To: References: Message-ID: On Sun, 14 Aug 2005, Guy German wrote: > James Lentini wrote: > >> You changed the order in which the CQ upcall is enabled and the kDAPL > >> upcall is made. It used to be: > >> > >> enable CQ upcall > >> call kDAPL upcall > >> > >> you are proposing > >> > >> call kDAPL upcall > >> enable CQ upcall > >> > >> I think your proposed order contains a race condition. Specifically > >> if a work completion occurs after dapl_evd_upcall_trigger() > >> returns but before the CQ upcall is re-enabled with > >> ib_req_notify_cq(), no upcall will occur for the completion. > >> > >> Do you agree? > > Or, has turned my attention to the fact that also in the first case > You have the alleged race: if a work completion occurs just > before you enable the CQ upcall... Correct, kDAPL will not receive a CQ upcall for such a completion, but it does notify the consumer that one or more completions have occurred. It is the consumer's responsibility to empty the EVD (CQ). In the scenario you describe, the consumer will not receive an upcall until the next work completion event which could take an arbitrary amount of time. > Are you suggesting that enabling the CQ upcall will not trigger the > CQ upcall, if completions happened before enabling? > I don't think this is the case, but I'm not 100% sure... That is my assumption of how it works. This is how other verbs APIs have worked in the past. > As I mentioned before, and regardless to this issue, I still believe > that the right order should be: > >> call kDAPL upcall > >> (conditionally) enable CQ upcall > We can't have interrupts if the consumer disabled the upcall policy... I agree that we should not request interrupts if the consumer disabled the upcall policy. From rolandd at cisco.com Mon Aug 15 08:16:44 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 15 Aug 2005 08:16:44 -0700 Subject: [openib-general] Re: Continue to experience problems in installing Gen2 on IA-32 References: <200508111932.j7BJWROj026274@xi.cse.ohio-state.edu> <482FEE13-0AB4-11DA-80CD-000D932C3754@cse.ohio-state.edu> <52k6isnm21.fsf@cisco.com> Message-ID: <52d5ofi61v.fsf@cisco.com> Thanks for the debugging info. Can you apply the patch below and confirm that it works with your PCI-X adapters? If this works for you then I will check it into svn and merge it for kernel 2.6.14. Thanks, Roland Index: infiniband/hw/mthca/mthca_dev.h =================================================================== --- infiniband/hw/mthca/mthca_dev.h (revision 3056) +++ infiniband/hw/mthca/mthca_dev.h (working copy) @@ -148,6 +148,7 @@ struct mthca_limits { int reserved_mcgs; int num_pds; int reserved_pds; + u8 port_width_cap; }; struct mthca_alloc { Index: infiniband/hw/mthca/mthca_main.c =================================================================== --- infiniband/hw/mthca/mthca_main.c (revision 3056) +++ infiniband/hw/mthca/mthca_main.c (working copy) @@ -171,6 +171,7 @@ static int __devinit mthca_dev_lim(struc mdev->limits.reserved_mrws = dev_lim->reserved_mrws; mdev->limits.reserved_uars = dev_lim->reserved_uars; mdev->limits.reserved_pds = dev_lim->reserved_pds; + mdev->limits.port_width_cap = dev_lim->max_port_width; /* IB_DEVICE_RESIZE_MAX_WR not supported by driver. May be doable since hardware supports it for SRQ. Index: infiniband/hw/mthca/mthca_cmd.c =================================================================== --- infiniband/hw/mthca/mthca_cmd.c (revision 3056) +++ infiniband/hw/mthca/mthca_cmd.c (working copy) @@ -1285,10 +1285,8 @@ int mthca_INIT_IB(struct mthca_dev *dev, #define INIT_IB_FLAG_SIG (1 << 18) #define INIT_IB_FLAG_NG (1 << 17) #define INIT_IB_FLAG_G0 (1 << 16) -#define INIT_IB_FLAG_1X (1 << 8) -#define INIT_IB_FLAG_4X (1 << 9) -#define INIT_IB_FLAG_12X (1 << 11) #define INIT_IB_VL_SHIFT 4 +#define INIT_IB_PORT_WIDTH_SHIFT 8 #define INIT_IB_MTU_SHIFT 12 #define INIT_IB_MAX_GID_OFFSET 0x06 #define INIT_IB_MAX_PKEY_OFFSET 0x0a @@ -1304,12 +1302,11 @@ int mthca_INIT_IB(struct mthca_dev *dev, memset(inbox, 0, INIT_IB_IN_SIZE); flags = 0; - flags |= param->enable_1x ? INIT_IB_FLAG_1X : 0; - flags |= param->enable_4x ? INIT_IB_FLAG_4X : 0; flags |= param->set_guid0 ? INIT_IB_FLAG_G0 : 0; flags |= param->set_node_guid ? INIT_IB_FLAG_NG : 0; flags |= param->set_si_guid ? INIT_IB_FLAG_SIG : 0; flags |= param->vl_cap << INIT_IB_VL_SHIFT; + flags |= param->port_width << INIT_IB_PORT_WIDTH_SHIFT; flags |= param->mtu_cap << INIT_IB_MTU_SHIFT; MTHCA_PUT(inbox, flags, INIT_IB_FLAGS_OFFSET); Index: infiniband/hw/mthca/mthca_cmd.h =================================================================== --- infiniband/hw/mthca/mthca_cmd.h (revision 3056) +++ infiniband/hw/mthca/mthca_cmd.h (working copy) @@ -220,8 +220,7 @@ struct mthca_init_hca_param { }; struct mthca_init_ib_param { - int enable_1x; - int enable_4x; + int port_width; int vl_cap; int mtu_cap; u16 gid_cap; Index: infiniband/hw/mthca/mthca_qp.c =================================================================== --- infiniband/hw/mthca/mthca_qp.c (revision 3056) +++ infiniband/hw/mthca/mthca_qp.c (working copy) @@ -502,12 +502,11 @@ static void init_port(struct mthca_dev * memset(¶m, 0, sizeof param); - param.enable_1x = 1; - param.enable_4x = 1; - param.vl_cap = dev->limits.vl_cap; - param.mtu_cap = dev->limits.mtu_cap; - param.gid_cap = dev->limits.gid_table_len; - param.pkey_cap = dev->limits.pkey_table_len; + param.port_width = dev->limits.port_width_cap; + param.vl_cap = dev->limits.vl_cap; + param.mtu_cap = dev->limits.mtu_cap; + param.gid_cap = dev->limits.gid_table_len; + param.pkey_cap = dev->limits.pkey_table_len; err = mthca_INIT_IB(dev, ¶m, port, &status); if (err) From guyg at voltaire.com Mon Aug 15 08:42:09 2005 From: guyg at voltaire.com (Guy German) Date: Mon, 15 Aug 2005 18:42:09 +0300 Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation Message-ID: Hi, >> Are you suggesting that enabling the CQ upcall will not trigger the >> CQ upcall, if completions happened before enabling? >> I don't think this is the case, but I'm not 100% sure... > > That is my assumption of how it works. This is how other > verbs APIs have worked in the past. Please see InfiniHost MT23108 Programmer's Reference Manual p 102 section 9.4.1 If completions are posted to the CQ (after the reporting of a completion event) but still not consumed by the software, events will be generated immediately after request for notification is executed. Subscribe for event is implemented by writing the CQ doorbell with the request notification command to the appropriate UAR page, and passing as a parameter to the command the consumer index to be polled. (found by Or G.) >> As I mentioned before, and regardless to this issue, I still believe >> that the right order should be: >>>> call kDAPL upcall >>>> (conditionally) enable CQ upcall >> We can't have interrupts if the consumer disabled the upcall >> policy... > > I agree that we should not request interrupts if the consumer disabled > the upcall policy. Guy From guyg at voltaire.com Mon Aug 15 08:42:35 2005 From: guyg at voltaire.com (Guy German) Date: Mon, 15 Aug 2005 18:42:35 +0300 Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation Message-ID: Hi, James Lentini wrote: > How about changing the comment to say that DAT_UPCALL_MANY is not > supported and leave it at that? OK. It was a very long debate over a remark... :) >> That leaves me with a question: what did happened to >> DAT_CONNECTION_EVENT_BROKEN? > I'm not sure. I think we need to add it back, at some point, even though the consumers can handle it themselves, for now... >>> You changed the order in which the CQ upcall is enabled and the >>> kDAPL upcall is made. > kDAPL doesn't dequeue the CQ events, so this shouldn't be an issue. I think we agreed on that issue, on another thread, that I shall leave it: > call kDAPL upcall > enable CQ upcall right ? > I disagree that this is in violation of the DAPL spec. The spec. > describes DAPL upcall policies and leaves the interaction with lower > layers (IB verbs) unspecified. > In any event, we should fix any of the performance problems you've > found. Agreed. > Currently the IB upcall is initially enabled, but there are checks on > the upcall path to determine if the EVD upcall is enabled. See > dapl_evd.c line 827: > > if (state == DAPL_EVD_STATE_OPEN && > evd->upcall_policy != DAT_UPCALL_DISABLE) { > > which you've replaced with > > if (state == DAPL_EVD_STATE_OPEN) { Yes, because I believe the correct place to check it is inside dapl_evd_upcall_trigger, after the lock and before the actual upcall. >> You've only decrease the window in which that scenario could occur, >> not eliminated it. If a DTO completion occurred after you count the >> number of pending events but before you enable the CQ callback, a >> completion will be missed. >> >> gg: >> I don't think so. That is what the spin_lock_irqsave is for. > > What if there are multiple processors in the system? AFAIK, spin_lock_irqsave does the trick. am I wrong ? >> Also, the pending_event_queue is only used for kDAPL generated >> software events. This queue can be empty when there are events on the >> CQ, so your would need to be expanded your check to cover that. Actually, even though, I agreed before, I tend to disagree now. The consumer will still get the DTO events as soon as the CQ upcall is triggered (enabled), so only problem is with the pending events list. Thanks, Guy From guyg at voltaire.com Mon Aug 15 09:00:11 2005 From: guyg at voltaire.com (Guy German) Date: Mon, 15 Aug 2005 19:00:11 +0300 Subject: [openib-general]: Question about FMR Message-ID: Thanks. Problem solved. Roland Dreier wrote: > Thanks, I checked in a fix for this. > > - R. From halr at voltaire.com Mon Aug 15 08:54:42 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2005 11:54:42 -0400 Subject: [openib-general] Re: [PATCH] osm: add a main auto tools project for osm In-Reply-To: <86mznjcm82.fsf@mtl066.yok.mtl.com> References: <86mznjcm82.fsf@mtl066.yok.mtl.com> Message-ID: <1124121281.4403.12806.camel@hal.voltaire.com> Hi Eitan, On Mon, 2005-08-15 at 10:23, Eitan Zahavi wrote: > I'm now using xemacs as my mail client. This is pure unix and > tested to work. That's better. > This patch includes: > 1. Added a top level autotools project for OpenSM. > So now you need autogen.sh && configure && make && make install just > once for osm > (previously needed 4: complib, libvendor, opensm, osmtest). > 2. Cleanup the direct override of libdir, bindir. Support --prefix > 3. Move osm lib, bin into default prefix (/usr/local) > 4. Support debug build for OpenSM using --enable-debug > This is important to allow for asserts during runtime and various > other additional debug features. > Since the generated compilb can not be used with the release version > we also use a special header file that stores the type of build > for applications that wish to link with it. > 5. Cleanup stale use of AC_CHECK_LIB with no parameters > 6. Resolved another bug: iba/ib_types.h not installed correctly Thanks. Applied with the following minor modifications: Min autoconf version is 2.57 rather than 2.59 Min automake version is 1.6.3 rather than 1.9.3 Also, the builds now show: -g -O2 -g -O2 (when not configured with --enable-debug) and -g -O0 -D_DEBUG_ -g -O2 (when configured with --enable-debug) Not sure which -O option gcc takes in that case. Can you remove this duplication ? > Signed-off-by: Eitan Zahavi eitan at mellanox.co.il The email address should have braces around it in the signed off line: Signed-off-by: Eitan Zahavi -- Hal From woodennickel at gmail.com Mon Aug 15 09:37:50 2005 From: woodennickel at gmail.com (Bill Jordan) Date: Mon, 15 Aug 2005 12:37:50 -0400 Subject: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support In-Reply-To: <20050811080205.GR16361@minantech.com> References: <20050719165542.GB16028@mellanox.co.il> <20050725171928.GC12206@mellanox.co.il> <20050726133553.GA22276@mellanox.co.il> <20050810083943.GM16361@minantech.com> <20050810132611.GP16361@minantech.com> <20050811080205.GR16361@minantech.com> Message-ID: <5ebee0d10508150937da6c1ed@mail.gmail.com> On 8/11/05, Gleb Natapov wrote: > What about the idea that was floating around about new VM flag that will > instruct kernel to copy pages belonging to the vma on fork instead of mark > them as cow? > I think the big problem with this idea is the huge memory regions that InfiniBand applications are dealing with. If the application forks (or uses system()), you are going to copy a huge chunk of data (most likely swapping since the application memory footprint is probably already tuned to consume the available physical memory). And the copy is really for nothing since in most (or at least many) cases the child is just going to exec anyway. -- Bill Jordan From rolandd at cisco.com Mon Aug 15 11:46:27 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 15 Aug 2005 11:46:27 -0700 Subject: [openib-general] avoid segv in libibverbs/examples In-Reply-To: <20050812144222.GA8988@osc.edu> (Pete Wyckoff's message of "Fri, 12 Aug 2005 10:42:22 -0400") References: <20050812144222.GA8988@osc.edu> Message-ID: <52u0hrghrw.fsf@cisco.com> Thanks. I think we should probably print a diagnostic if no devices are found. Can you resend the patch with that fixed, and also include a "Signed-off-by:" line? Thanks, Roland From eitan at mellanox.co.il Mon Aug 15 11:54:10 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 15 Aug 2005 21:54:10 +0300 Subject: [openib-general] RE: [PATCH] osm: add a main auto tools project for osm Message-ID: <506C3D7B14CDD411A52C00025558DED607C30663@mtlex01.yok.mtl.com> Hi Hal, Thanks applying the patch. I will chase down the hard coded -g -O2 . I'll update my patch mail template... I was using autoconf 2.57 for a while. It has issues with tracking usage of the same source code for two executables each with its own set of -D flags. (like if you want to compile a test program with a "-D_ALOW_MAIN_ ") . Anyway this is not a must for OpenSM. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Monday, August 15, 2005 6:55 PM > To: Eitan Zahavi > Cc: OPENIB GENERAL > Subject: Re: [PATCH] osm: add a main auto tools project for osm > > Hi Eitan, > > On Mon, 2005-08-15 at 10:23, Eitan Zahavi wrote: > > I'm now using xemacs as my mail client. This is pure unix and > > tested to work. > > That's better. > > > This patch includes: > > 1. Added a top level autotools project for OpenSM. > > So now you need autogen.sh && configure && make && make install just > > once for osm > > (previously needed 4: complib, libvendor, opensm, osmtest). > > 2. Cleanup the direct override of libdir, bindir. Support --prefix > > 3. Move osm lib, bin into default prefix (/usr/local) > > 4. Support debug build for OpenSM using --enable-debug > > This is important to allow for asserts during runtime and various > > other additional debug features. > > Since the generated compilb can not be used with the release version > > we also use a special header file that stores the type of build > > for applications that wish to link with it. > > 5. Cleanup stale use of AC_CHECK_LIB with no parameters > > 6. Resolved another bug: iba/ib_types.h not installed correctly > > Thanks. Applied with the following minor modifications: > > Min autoconf version is 2.57 rather than 2.59 > Min automake version is 1.6.3 rather than 1.9.3 > > Also, the builds now show: > -g -O2 -g -O2 (when not configured with --enable-debug) > and > -g -O0 -D_DEBUG_ -g -O2 (when configured with --enable-debug) > > Not sure which -O option gcc takes in that case. Can you remove this > duplication ? > > > Signed-off-by: Eitan Zahavi eitan at mellanox.co.il > > The email address should have braces around it in the signed off line: > Signed-off-by: Eitan Zahavi > > -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlentini at netapp.com Mon Aug 15 12:05:49 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 15 Aug 2005 15:05:49 -0400 (EDT) Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation In-Reply-To: References: Message-ID: We should verify that all implementations of the verbs API will behave this way. I'll start a new thread on the list to make sure that this is the correct definition. james On Mon, 15 Aug 2005, Guy German wrote: guyg> Hi, guyg> guyg> >> Are you suggesting that enabling the CQ upcall will not trigger the guyg> >> CQ upcall, if completions happened before enabling? guyg> >> I don't think this is the case, but I'm not 100% sure... guyg> > guyg> > That is my assumption of how it works. This is how other guyg> > verbs APIs have worked in the past. guyg> guyg> Please see InfiniHost MT23108 Programmer's Reference Manual guyg> p 102 section 9.4.1 guyg> guyg> If completions are posted to the CQ (after the reporting of a completion event) but still not consumed by the software, guyg> events will be generated immediately after request for notification is executed. Subscribe for event is implemented by guyg> writing the CQ doorbell with the request notification command to the appropriate UAR page, and passing as a parameter guyg> to the command the consumer index to be polled. guyg> guyg> (found by Or G.) guyg> guyg> >> As I mentioned before, and regardless to this issue, I still believe guyg> >> that the right order should be: guyg> >>>> call kDAPL upcall guyg> >>>> (conditionally) enable CQ upcall guyg> >> We can't have interrupts if the consumer disabled the upcall guyg> >> policy... guyg> > guyg> > I agree that we should not request interrupts if the consumer disabled guyg> > the upcall policy. guyg> guyg> Guy guyg> From halr at voltaire.com Mon Aug 15 12:04:37 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2005 15:04:37 -0400 Subject: [openib-general] [ANNOUNCE] management: binaries and libraries now moved Message-ID: <1124132676.4403.12882.camel@hal.voltaire.com> Hi, Thanks to Eitan for updating the OpenSM build, the management libraries are binaries are now (r3081) under /usr/local/[lib bin] rather than /usr/local/ib/[lib bin] as before. You will need to go through the build process to make this happen (autogen.sh, configure, make, and make install). The directions have been updated for the new locations and reduced number of steps. When I get a chance I will combine the diags into one. -- Hal From jlentini at netapp.com Mon Aug 15 12:28:48 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 15 Aug 2005 15:28:48 -0400 (EDT) Subject: [openib-general] ib_req_notify_cq() kernel verb Message-ID: Does the ib_req_notify_cq() kernel verb conform the definition of the Request Completion Notification in section 11.4.2.2 of the IBTA spec? Specifically, is the handler only guaranteed to be called after the next completion entry is added? I understand that Mellanox hardware can generate notification if there are already completions on the CQ, but I want to avoid using any hardware specific features. james From kingman at storagegear.com Mon Aug 15 12:29:53 2005 From: kingman at storagegear.com (John Kingman) Date: Mon, 15 Aug 2005 14:29:53 -0500 Subject: [openib-general] [PATCH] uat: make uat.c compile on 2.6.13-rc3 Message-ID: <4300ED31.6010503@storagegear.com> What is the status of this patch? (Changes class_simple_* to class_* ) I tried to compile on a 2.6.13-rc5 system (OpenSuSE 10.0 Beta 1) and got the error that this patch should have fixed. (The ucm patch is also missing.) John From rolandd at cisco.com Mon Aug 15 12:35:59 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 15 Aug 2005 12:35:59 -0700 Subject: [openib-general] ib_req_notify_cq() kernel verb In-Reply-To: (James Lentini's message of "Mon, 15 Aug 2005 15:28:48 -0400 (EDT)") References: Message-ID: <52oe7zgfhc.fsf@cisco.com> James> Does the ib_req_notify_cq() kernel verb conform the James> definition of the Request Completion Notification in James> section 11.4.2.2 of the IBTA spec? James> Specifically, is the handler only guaranteed to be called James> after the next completion entry is added? Yes, I think these should be semantics we assume. James> I understand that Mellanox hardware can generate James> notification if there are already completions on the CQ, James> but I want to avoid using any hardware specific features. I agree - we should not assume the stronger Mellanox semantics. - R. From halr at voltaire.com Mon Aug 15 13:11:41 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Aug 2005 16:11:41 -0400 Subject: [openib-general] [PATCH] uat: make uat.c compile on 2.6.13-rc3 In-Reply-To: <4300ED31.6010503@storagegear.com> References: <4300ED31.6010503@storagegear.com> Message-ID: <1124136701.4481.1.camel@rich0-t42.us.voltaire.com> On Mon, 2005-08-15 at 15:29, John Kingman wrote: > What is the status of this patch? (Changes class_simple_* to class_* ) > > I tried to compile on a 2.6.13-rc5 system (OpenSuSE 10.0 Beta 1) and got the > error that this patch should have fixed. > > (The ucm patch is also missing.) This is not yet applied as OpenIB has not yet moved up to 2.6.13 as this is not out of kernel.org yet. -- Hal From jlentini at netapp.com Mon Aug 15 13:31:19 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 15 Aug 2005 16:31:19 -0400 (EDT) Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation In-Reply-To: References: Message-ID: > >>> You changed the order in which the CQ upcall is enabled and the > >>> kDAPL upcall is made. > > kDAPL doesn't dequeue the CQ events, so this shouldn't be an issue. > > I think we agreed on that issue, on another thread, that I shall leave it: > > call kDAPL upcall > > enable CQ upcall > right ? Given that the ib_req_notify_cq() verb conforms to the IBTA spec's Request Completion Notification verb, we will have to find another solution. > > Currently the IB upcall is initially enabled, but there are checks on > > the upcall path to determine if the EVD upcall is enabled. See > > dapl_evd.c line 827: > > > > if (state == DAPL_EVD_STATE_OPEN && > > evd->upcall_policy != DAT_UPCALL_DISABLE) { > > > > which you've replaced with > > > > if (state == DAPL_EVD_STATE_OPEN) { > > Yes, because I believe the correct place to check it is inside > dapl_evd_upcall_trigger, after the lock and before the actual upcall. I agree that checking it in dapl_evd_upcall_trigger() is better. I still don't understand why you wouldn't setup the completion notification based on the kDAPL consumer's upcall_policy in dapl_evd_internal_create(). > >> You've only decrease the window in which that scenario could > >> occur, not eliminated it. If a DTO completion occurred after you > >> count the number of pending events but before you enable the CQ > >> callback, a completion will be missed. > >> > >> gg: > >> I don't think so. That is what the spin_lock_irqsave is for. > > > > What if there are multiple processors in the system? > > AFAIK, spin_lock_irqsave does the trick. am I wrong ? spin_lock_irqsave disables local interrupts. In any event...[see question below] > >> Also, the pending_event_queue is only used for kDAPL generated > >> software events. This queue can be empty when there are events on the > >> CQ, so your would need to be expanded your check to cover that. > > Actually, even though, I agreed before, I tend to disagree now. > The consumer will still get the DTO events as soon as the CQ > upcall is triggered (enabled), so only problem is with the pending > events list. Why is it an error for the consumer to modify the upcall policy when there are pending events? dat_evd_modify_upcall should behave just like the IBTA spec's Request Completion Notification verb in this respect. If there were events on the EVD before the upcall is enabled, no upcall needs to be generated. A correct consumer can easily work around this by enabling the upcall and polling the EVD one final time to ensure it is empty. From kingman at storagegear.com Mon Aug 15 13:36:59 2005 From: kingman at storagegear.com (John Kingman) Date: Mon, 15 Aug 2005 20:36:59 +0000 (UTC) Subject: [openib-general] Re: [PATCH] uat: make uat.c compile on 2.6.13-rc3 References: <4300ED31.6010503@storagegear.com> <1124136701.4481.1.camel@rich0-t42.us.voltaire.com> Message-ID: Hal Rosenstock voltaire.com> writes: > > On Mon, 2005-08-15 at 15:29, John Kingman wrote: > > What is the status of this patch? (Changes class_simple_* to class_* ) > > > > I tried to compile on a 2.6.13-rc5 system (OpenSuSE 10.0 Beta 1) and got the > > error that this patch should have fixed. > > > > (The ucm patch is also missing.) > > This is not yet applied as OpenIB has not yet moved up to 2.6.13 as this > is not out of kernel.org yet. Roger. I deduced that after I tracked down the ucm patch and found the comment wrt 2.6.13. I'll apply them locally. Thanks, John From eitan at mellanox.co.il Mon Aug 15 13:50:16 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 15 Aug 2005 23:50:16 +0300 Subject: [openib-general] [PATCH] osm: Remove extra duplicated optimization flag Message-ID: <86psse2ad3.fsf@mtl066.yok.mtl.com> Hi Hal Traced down the double -g -O2 to a bug in the autoconf core. According to http://sources.redhat.com/automake/automake.html#Flag-Variables-Ordering CFLAGS should never be modified by the configure. The intent is that this variable should allow the user to override any decision made by the automatic configure flows. The bug is that a test made by the core is actually modifying the variable. I tested the patch on : 2.6.12.3-smp SuSE Linux 9.3 (i586) Signed-off-by: Eitan Zahavi Index: osm/include/configure.in =================================================================== --- osm/include/configure.in (revision 3098) +++ osm/include/configure.in (working copy) @@ -34,5 +34,11 @@ AC_CACHE_CHECK(whether ld accepts --vers AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") +# we have to revive the env CFALGS as some how they are being overwritten... +# see http://sources.redhat.com/automake/automake.html#Flag-Variables-Ordering +# for why they should NEVER be modified by the configure to allow for user +# overrides. +CFLAGS=$ac_env_CFLAGS_value + AC_CONFIG_FILES([Makefile]) AC_OUTPUT Index: osm/libvendor/configure.in =================================================================== --- osm/libvendor/configure.in (revision 3098) +++ osm/libvendor/configure.in (working copy) @@ -69,5 +69,11 @@ AC_ARG_ENABLE(debug, esac],[debug=false]) AM_CONDITIONAL(DEBUG, test x$debug = xtrue) +# we have to revive the env CFALGS as some how they are being overwritten... +# see http://sources.redhat.com/automake/automake.html#Flag-Variables-Ordering +# for why they should NEVER be modified by the configure to allow for user +# overrides. +CFLAGS=$ac_env_CFLAGS_value + AC_CONFIG_FILES([Makefile libosmvendor.spec]) AC_OUTPUT Index: osm/complib/configure.in =================================================================== --- osm/complib/configure.in (revision 3098) +++ osm/complib/configure.in (working copy) @@ -51,5 +51,12 @@ AC_ARG_ENABLE(debug, esac],[debug=false]) AM_CONDITIONAL(DEBUG, test x$debug = xtrue) +# we have to revive the env CFALGS as some how they are being overwritten... +# see http://sources.redhat.com/automake/automake.html#Flag-Variables-Ordering +# for why they should NEVER be modified by the configure to allow for user +# overrides. +CFLAGS=$ac_env_CFLAGS_value + + AC_CONFIG_FILES([Makefile libosmcomp.spec]) AC_OUTPUT Index: osm/osmtest/configure.in =================================================================== --- osm/osmtest/configure.in (revision 3098) +++ osm/osmtest/configure.in (working copy) @@ -61,5 +61,11 @@ AC_ARG_ENABLE(debug, esac],[debug=false]) AM_CONDITIONAL(DEBUG, test x$debug = xtrue) +# we have to revive the env CFALGS as some how they are being overwritten... +# see http://sources.redhat.com/automake/automake.html#Flag-Variables-Ordering +# for why they should NEVER be modified by the configure to allow for user +# overrides. +CFLAGS=$ac_env_CFLAGS_value + AC_CONFIG_FILES([Makefile]) AC_OUTPUT Index: osm/opensm/configure.in =================================================================== --- osm/opensm/configure.in (revision 3098) +++ osm/opensm/configure.in (working copy) @@ -61,5 +61,11 @@ AC_ARG_ENABLE(debug, esac],[debug=false]) AM_CONDITIONAL(DEBUG, test x$debug = xtrue) +# we have to revive the env CFALGS as some how they are being overwritten... +# see http://sources.redhat.com/automake/automake.html#Flag-Variables-Ordering +# for why they should NEVER be modified by the configure to allow for user +# overrides. +CFLAGS=$ac_env_CFLAGS_value + AC_CONFIG_FILES([Makefile]) AC_OUTPUT From halr at voltaire.com Mon Aug 15 15:31:43 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 16 Aug 2005 01:31:43 +0300 Subject: [openib-general] RE: [PATCH] osm: Remove extra duplicated optimization flag Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175BA6@taurus.voltaire.com> Thanks. Applied. -- Hal From mst at mellanox.co.il Mon Aug 15 23:45:51 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 16 Aug 2005 09:45:51 +0300 Subject: [openib-general] Re: [PATCH] dont drop device reference before use (was Re: sdp: cant unload ib_ipoib module) In-Reply-To: <1124111717.4403.12755.camel@hal.voltaire.com> References: <1123007814.2946.12.camel@duffman> <20050803070923.GG15300@mellanox.co.il> <20050804104250.A30741@topspin.com> <20050807154632.GG15300@mellanox.co.il> <1123539115.4402.39.camel@hal.voltaire.com> <20050809124631.GG32419@mellanox.co.il> <1123593909.4403.16.camel@hal.voltaire.com> <20050814142441.GO23848@mellanox.co.il> <1124111717.4403.12755.camel@hal.voltaire.com> Message-ID: <20050816064551.GI1856@mellanox.co.il> Quoting r. Hal Rosenstock : > > The following should finally fix the dev_put issue. > > This works for me too. Applied. -- MST From yael at mellanox.co.il Tue Aug 16 00:20:00 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Tue, 16 Aug 2005 10:20:00 +0300 Subject: [openib-general] bugzilla - adding opensm as a component Message-ID: <506C3D7B14CDD411A52C00025558DED60882C974@mtlex01.yok.mtl.com> Hello all, Who is responsible for opening new components in the bugzilla? I wanted to open a bug for the OpenSM, but it doesn't appear as a component under bugzilla. Thanks, Yael -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Tue Aug 16 00:40:24 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 16 Aug 2005 10:40:24 +0300 Subject: [openib-general] [PATCH] osm: Fix a rate code computation for DDR ports Message-ID: <86oe7y1g9j.fsf@mtl066.yok.mtl.com> Hi Hal DDR Ports are not supported by the current implementation of the ib_type.h: ib_port_info_compute_rate This patch fixes that issue (had to change order of functions too). I tested the patch on : 2.6.12.3-smp SuSE Linux 9.3 (i586) Signed-off-by: Eitan Zahavi Index: include/iba/ib_types.h =================================================================== --- include/iba/ib_types.h (revision 3103) +++ include/iba/ib_types.h (working copy) @@ -4164,6 +4164,37 @@ ib_port_info_get_link_down_def_state( * * SEE ALSO *********/ + +/****f* IBA Base: Types/ib_port_info_get_link_speed_active +* NAME +* ib_port_info_get_link_speed_active +* +* DESCRIPTION +* Returns the Link Speed Active value assigned to this port. +* +* SYNOPSIS +*/ +static inline uint8_t +ib_port_info_get_link_speed_active( + IN const ib_port_info_t* const p_pi ) +{ + return( (uint8_t)((p_pi->link_speed & + IB_PORT_LINK_SPEED_ACTIVE_MASK) >> + IB_PORT_LINK_SPEED_SHIFT) ); +} +/* +* PARAMETERS +* p_pi +* [in] Pointer to a PortInfo attribute. +* +* RETURN VALUES +* Returns the link speed active value assigned to this port. +* +* NOTES +* +* SEE ALSO +*********/ + /****f* IBA Base: Types/ib_port_info_set_link_down_def_state * NAME * ib_port_info_set_link_down_def_state @@ -4196,13 +4227,24 @@ ib_port_info_set_link_down_def_state( * SEE ALSO *********/ -#define IB_LINK_WIDTH_ACTIVE_1X 1 -#define IB_LINK_WIDTH_ACTIVE_4X 2 -#define IB_LINK_WIDTH_ACTIVE_12X 8 - -#define IB_PATH_RECORD_RATE_2_5_GBS 2 -#define IB_PATH_RECORD_RATE_10_GBS 3 -#define IB_PATH_RECORD_RATE_30_GBS 4 +#define IB_LINK_WIDTH_ACTIVE_1X 1 +#define IB_LINK_WIDTH_ACTIVE_4X 2 +#define IB_LINK_WIDTH_ACTIVE_12X 8 +#define IB_LINK_SPEED_ACTIVE_2_5 1 +#define IB_LINK_SPEED_ACTIVE_5 2 +#define IB_LINK_SPEED_ACTIVE_10 4 + +/* following v1 ver1.2 p901 */ +#define IB_MAX_RATE 10 +#define IB_PATH_RECORD_RATE_2_5_GBS 2 +#define IB_PATH_RECORD_RATE_10_GBS 3 +#define IB_PATH_RECORD_RATE_30_GBS 4 +#define IB_PATH_RECORD_RATE_5_GBS 5 +#define IB_PATH_RECORD_RATE_20_GBS 6 +#define IB_PATH_RECORD_RATE_40_GBS 7 +#define IB_PATH_RECORD_RATE_60_GBS 8 +#define IB_PATH_RECORD_RATE_80_GBS 9 +#define IB_PATH_RECORD_RATE_120_GBS 10 /****f* IBA Base: Types/ib_port_info_compute_rate * NAME @@ -4218,22 +4260,60 @@ static inline uint8_t ib_port_info_compute_rate( IN const ib_port_info_t* const p_pi ) { - switch(p_pi->link_width_active) - { - case IB_LINK_WIDTH_ACTIVE_1X: - return IB_PATH_RECORD_RATE_2_5_GBS; - - case IB_LINK_WIDTH_ACTIVE_4X: - return IB_PATH_RECORD_RATE_10_GBS; - - case IB_LINK_WIDTH_ACTIVE_12X: - return IB_PATH_RECORD_RATE_30_GBS; - - default: - return IB_PATH_RECORD_RATE_2_5_GBS; - } + switch (ib_port_info_get_link_speed_active(p_pi)) + { + case IB_LINK_SPEED_ACTIVE_2_5: + switch (p_pi->link_width_active) + { + case IB_LINK_WIDTH_ACTIVE_1X: + return IB_PATH_RECORD_RATE_2_5_GBS; + + case IB_LINK_WIDTH_ACTIVE_4X: + return IB_PATH_RECORD_RATE_10_GBS; + + case IB_LINK_WIDTH_ACTIVE_12X: + return IB_PATH_RECORD_RATE_30_GBS; + + default: + return IB_PATH_RECORD_RATE_2_5_GBS; + } + break; + case IB_LINK_SPEED_ACTIVE_5: + switch (p_pi->link_width_active) + { + case IB_LINK_WIDTH_ACTIVE_1X: + return IB_PATH_RECORD_RATE_5_GBS; + + case IB_LINK_WIDTH_ACTIVE_4X: + return IB_PATH_RECORD_RATE_20_GBS; + + case IB_LINK_WIDTH_ACTIVE_12X: + return IB_PATH_RECORD_RATE_60_GBS; + + default: + return IB_PATH_RECORD_RATE_5_GBS; + } + break; + case IB_LINK_SPEED_ACTIVE_10: + switch (p_pi->link_width_active) + { + case IB_LINK_WIDTH_ACTIVE_1X: + return IB_PATH_RECORD_RATE_10_GBS; + + case IB_LINK_WIDTH_ACTIVE_4X: + return IB_PATH_RECORD_RATE_40_GBS; + + case IB_LINK_WIDTH_ACTIVE_12X: + return IB_PATH_RECORD_RATE_120_GBS; + + default: + return IB_PATH_RECORD_RATE_10_GBS; + } + break; + default: + return IB_PATH_RECORD_RATE_2_5_GBS; + } } - /* * PARAMETERS * p_pi @@ -4658,36 +4738,6 @@ ib_port_info_set_lmc( * * NOTES * -* SEE ALSO -*********/ - -/****f* IBA Base: Types/ib_port_info_get_link_speed_active -* NAME -* ib_port_info_get_link_speed_active -* -* DESCRIPTION -* Returns the Link Speed Active value assigned to this port. -* -* SYNOPSIS -*/ -static inline uint8_t -ib_port_info_get_link_speed_active( - IN const ib_port_info_t* const p_pi ) -{ - return( (uint8_t)((p_pi->link_speed & - IB_PORT_LINK_SPEED_ACTIVE_MASK) >> - IB_PORT_LINK_SPEED_SHIFT) ); -} -/* -* PARAMETERS -* p_pi -* [in] Pointer to a PortInfo attribute. -* -* RETURN VALUES -* Returns the link speed active value assigned to this port. -* -* NOTES -* * SEE ALSO *********/ From glebn at voltaire.com Tue Aug 16 00:52:17 2005 From: glebn at voltaire.com (Gleb Natapov) Date: Tue, 16 Aug 2005 10:52:17 +0300 Subject: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support In-Reply-To: <5ebee0d10508150937da6c1ed@mail.gmail.com> References: <20050725171928.GC12206@mellanox.co.il> <20050726133553.GA22276@mellanox.co.il> <20050810083943.GM16361@minantech.com> <20050810132611.GP16361@minantech.com> <20050811080205.GR16361@minantech.com> <5ebee0d10508150937da6c1ed@mail.gmail.com> Message-ID: <20050816075217.GA6232@minantech.com> On Mon, Aug 15, 2005 at 12:37:50PM -0400, Bill Jordan wrote: > On 8/11/05, Gleb Natapov wrote: > > What about the idea that was floating around about new VM flag that will > > instruct kernel to copy pages belonging to the vma on fork instead of mark > > them as cow? > > > > I think the big problem with this idea is the huge memory regions that > InfiniBand applications are dealing with. If the application forks (or > uses system()), you are going to copy a huge chunk of data (most > likely swapping since the application memory footprint is probably > already tuned to consume the available physical memory). And the copy > is really for nothing since in most (or at least many) cases the child > is just going to exec anyway. If the child is going to exec it may call vfork or clone with CLONE_VM flag. glibc system(3) does clone (CLONE_PARENT_SETTID | SIGCHLD) why not CLONE_VM too? This single change will allow to use system() from MPI programs thus eliminating many users problem. If the child isn't going to exec it should face the music. -- Gleb. From mst at mellanox.co.il Tue Aug 16 02:19:51 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 16 Aug 2005 12:19:51 +0300 Subject: [openib-general] Re: [PATCH] SDP: fix oops with port reuse In-Reply-To: <1122315500.27947.25.camel@duffman> References: <1122315500.27947.25.camel@duffman> Message-ID: <20050816091951.GP1856@mellanox.co.il> This is what I committed. I'm not against making sdp_inet_port_put void, looking at libsdp this shouldnt introduce any problems, but lets make it a separate patch. Maybe its a good idea to init bind_next at socket creation, at port_put we could call list_del_init, and/or use list_empty to figure out whether the socket is on the bind list. Quoting r. Tom Duffy : > Subject: [PATCH] SDP: fix oops with port reuse > > This patch fixes an oops that I introduced in my conversion to use linux > lists for binds (committed in revision 2874). If two sockets tried to > use the same port, after failing to get the port (again), it would > attempt a put and the second attempt would oops the machine. Signed-off-by: Tom Duffy Signed-off-by: Michael S. Tsirkin Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_conn.c =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_conn.c (revision 3103) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_conn.c (working copy) @@ -493,16 +493,17 @@ done: int sdp_inet_port_put(struct sdp_sock *conn) { unsigned long flags; - - if (list_empty(&dev_root_s.bind_list)) - return -EADDRNOTAVAIL; + int result = -EADDRNOTAVAIL; spin_lock_irqsave(&dev_root_s.bind_lock, flags); - list_del(&conn->bind_next); - conn->src_port = 0; + if (conn->src_port) { + list_del(&conn->bind_next); + conn->src_port = 0; + result = 0; + } spin_unlock_irqrestore(&dev_root_s.bind_lock, flags); - return 0; + return result; } /* -- MST From guyg at voltaire.com Tue Aug 16 03:21:20 2005 From: guyg at voltaire.com (Guy German) Date: Tue, 16 Aug 2005 13:21:20 +0300 Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation Message-ID: James Lentini wrote: >>>>> You changed the order in which the CQ upcall is enabled and the >>>>> kDAPL upcall is made. >>> kDAPL doesn't dequeue the CQ events, so this shouldn't be an issue. >> >> I think we agreed on that issue, on another thread, that I > shall leave it: >>> call kDAPL upcall >>> enable CQ upcall >> right ? > > Given that the ib_req_notify_cq() verb conforms to the IBTA spec's > Request Completion Notification verb, we will have to find another > solution. I don't think its a good idea to enable the CQ upcall before the kdapl upcall. You will have a bigger problem than a race in *kdapltest* (over hardware that doesn't yet exist :) - like a consumer getting an unexpected upcall or just CQ interrupts, while in disable mode. Any way, one suggestion, I can think of, to solve it, is this: if (state == DAPL_EVD_STATE_OPEN) { dapl_evd_upcall_trigger(evd); if ((evd->upcall_policy != DAT_UPCALL_TEARDOWN) && (evd->upcall_policy != DAT_UPCALL_DISABLE)) { status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); if (status) (void)dapl_evd_post_async_error_event( evd->common.owner_ia->async_error_evd, DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR, evd->common.owner_ia); else dapl_evd_upcall_trigger(evd); } } Note that dapl_evd_upcall_trigger makes sure that only if there *are* completions in the cq (and policy is enabled) the consumer would get another upcall. Is this acceptable ? (It is possible to check the cq, before calling dapl_evd_upcall_trigger, for performance reasons, but applications that are interested in performance, wouldn't go down that flow, anyway...) >>> Currently the IB upcall is initially enabled, but there are checks >>> on the upcall path to determine if the EVD upcall is enabled. See >>> dapl_evd.c line 827: >>> >>> if (state == DAPL_EVD_STATE_OPEN && >>> evd->upcall_policy != DAT_UPCALL_DISABLE) { >>> >>> which you've replaced with >>> >>> if (state == DAPL_EVD_STATE_OPEN) { >> >> Yes, because I believe the correct place to check it is inside >> dapl_evd_upcall_trigger, after the lock and before the actual upcall. > > I agree that checking it in dapl_evd_upcall_trigger() is better. > > I still don't understand why you wouldn't setup the completion > notification based on the kDAPL consumer's upcall_policy in > dapl_evd_internal_create(). > Ok. I will add that check. >>>> You've only decrease the window in which that scenario could >>>> occur, not eliminated it. If a DTO completion occurred after you >>>> count the number of pending events but before you enable the CQ >>>> callback, a completion will be missed. >>>> >>>> gg: >>>> I don't think so. That is what the spin_lock_irqsave is for. >>> >>> What if there are multiple processors in the system? >> >> AFAIK, spin_lock_irqsave does the trick. am I wrong ? > > spin_lock_irqsave disables local interrupts. The irq part disables local interrupts and the spin_lock part protects resources from multiple processors, no ? > In any event...[see question below] > >>>> Also, the pending_event_queue is only used for kDAPL generated >>>> software events. This queue can be empty when there are events on >>>> the CQ, so your would need to be expanded your check to cover that. >> >> Actually, even though, I agreed before, I tend to disagree now. >> The consumer will still get the DTO events as soon as the CQ >> upcall is triggered (enabled), so only problem is with the pending >> events list. > > Why is it an error for the consumer to modify the upcall policy when > there are pending events? > > dat_evd_modify_upcall should behave just like the IBTA spec's Request > Completion Notification verb in this respect. If there were events on > the EVD before the upcall is enabled, no upcall needs to be generated. > A correct consumer can easily work around this by enabling the upcall > and polling the EVD one final time to ensure it is empty. There can be more than one event, and the consumer would need to dequeue many times. While the consumer would do his extra dequeue-ing he might also get an upcall, because his policy is now enabled. I can't think of a design that can handle such a case, and if there is one it is demanding and complicated, from the consumers side. After enabling the upcall policy, the consumer, in my opinion, should only expect upcalls and not do any dequeu's. My suggestion is an optimization (lowering the context switching) and race solving. I don't think it's a good idea to make it more complicated, then this simple solution. Indeed it is not spec compliant, even though I did send a request to dat-collabrotive to make it part of the spec, on March: http://groups.yahoo.com/group/dat-discussions/message/3312 buy did not get any reply from them. BTW, the race we are talking about actually happens, and I can say from a practical point of view, that this implementation (along with the one existing in ISER side) solves this problem. Guy From halr at voltaire.com Tue Aug 16 03:58:28 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 16 Aug 2005 13:58:28 +0300 Subject: [openib-general] bugzilla - adding opensm as a component Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175BAA@taurus.voltaire.com> Hi Yael, We have been using the list for bug reporting so please send this to the list. We will be using Bugzilla so this category should be added as well. -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Yael Kalka Sent: Tue 8/16/2005 3:20 AM To: openib-general at openib.org Subject: [openib-general] bugzilla - adding opensm as a component Hello all, Who is responsible for opening new components in the bugzilla? I wanted to open a bug for the OpenSM, but it doesn't appear as a component under bugzilla. Thanks, Yael From halr at voltaire.com Tue Aug 16 04:17:02 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Aug 2005 07:17:02 -0400 Subject: [openib-general] Re: [PATCH] osm: Fix a rate code computation for DDR ports In-Reply-To: <86oe7y1g9j.fsf@mtl066.yok.mtl.com> References: <86oe7y1g9j.fsf@mtl066.yok.mtl.com> Message-ID: <1124191021.22358.2.camel@localhost.localdomain> On Tue, 2005-08-16 at 03:40, Eitan Zahavi wrote: > DDR Ports are not supported by the current implementation of > the ib_type.h: ib_port_info_compute_rate Thanks. Applied. -- Hal From thomas.duffy.99 at alumni.brown.edu Tue Aug 16 07:47:21 2005 From: thomas.duffy.99 at alumni.brown.edu (Tom Duffy) Date: Tue, 16 Aug 2005 07:47:21 -0700 Subject: [openib-general] Re: [PATCH] SDP: fix oops with port reuse In-Reply-To: <20050816091951.GP1856@mellanox.co.il> References: <1122315500.27947.25.camel@duffman> <20050816091951.GP1856@mellanox.co.il> Message-ID: <228BCEE7-11BA-40D2-B563-5F708414708B@alumni.brown.edu> On Aug 16, 2005, at 2:19 AM, Michael S. Tsirkin wrote: > This is what I committed. I'm not against making sdp_inet_port_put > void, > looking at libsdp this shouldnt introduce any problems, but lets > make it > a separate patch. I agree that changing it to void should be done separately. But, I thought that we didn't want to remove the port unless it was really on the list and that just checking if the src_port was set wouldn't guarantee this. > Maybe its a good idea to init bind_next at socket creation, at > port_put > we could call list_del_init, and/or use list_empty to figure out > whether the socket is on the bind list. Do you mean we would setup the bind_next field to point to itself? This is normally reserved for list heads. -tduffy -------------- next part -------------- An HTML attachment was scrubbed... URL: From caitlin.bestler at gmail.com Tue Aug 16 07:49:19 2005 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Tue, 16 Aug 2005 07:49:19 -0700 Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation In-Reply-To: References: Message-ID: <469958e00508160749efae3c7@mail.gmail.com> The DAT spec is fairly clear on this. The Consumer is responsible for either draining the EVD during the upcall, or for arranging for it to be drained at a later time through some out-of-scope mechanism. The latter can include setting a flag that tells a background thread to continue reaping from the EVD when it is next scheduled. On 8/15/05, James Lentini wrote: > > > > >>> You changed the order in which the CQ upcall is enabled and the > > >>> kDAPL upcall is made. > > > kDAPL doesn't dequeue the CQ events, so this shouldn't be an issue. > > > > I think we agreed on that issue, on another thread, that I shall leave it: > > > call kDAPL upcall > > > enable CQ upcall > > right ? > > Given that the ib_req_notify_cq() verb conforms to the IBTA spec's > Request Completion Notification verb, we will have to find another > solution. > > > > Currently the IB upcall is initially enabled, but there are checks on > > > the upcall path to determine if the EVD upcall is enabled. See > > > dapl_evd.c line 827: > > > > > > if (state == DAPL_EVD_STATE_OPEN && > > > evd->upcall_policy != DAT_UPCALL_DISABLE) { > > > > > > which you've replaced with > > > > > > if (state == DAPL_EVD_STATE_OPEN) { > > > > Yes, because I believe the correct place to check it is inside > > dapl_evd_upcall_trigger, after the lock and before the actual upcall. > > I agree that checking it in dapl_evd_upcall_trigger() is better. > > I still don't understand why you wouldn't setup the completion > notification based on the kDAPL consumer's upcall_policy in > dapl_evd_internal_create(). > > > >> You've only decrease the window in which that scenario could > > >> occur, not eliminated it. If a DTO completion occurred after you > > >> count the number of pending events but before you enable the CQ > > >> callback, a completion will be missed. > > >> > > >> gg: > > >> I don't think so. That is what the spin_lock_irqsave is for. > > > > > > What if there are multiple processors in the system? > > > > AFAIK, spin_lock_irqsave does the trick. am I wrong ? > > spin_lock_irqsave disables local interrupts. > > In any event...[see question below] > > > >> Also, the pending_event_queue is only used for kDAPL generated > > >> software events. This queue can be empty when there are events on the > > >> CQ, so your would need to be expanded your check to cover that. > > > > Actually, even though, I agreed before, I tend to disagree now. > > The consumer will still get the DTO events as soon as the CQ > > upcall is triggered (enabled), so only problem is with the pending > > events list. > > Why is it an error for the consumer to modify the upcall policy when > there are pending events? > > dat_evd_modify_upcall should behave just like the IBTA spec's Request > Completion Notification verb in this respect. If there were events on > the EVD before the upcall is enabled, no upcall needs to be generated. > A correct consumer can easily work around this by enabling the upcall > and polling the EVD one final time to ensure it is empty. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From jlentini at netapp.com Tue Aug 16 07:53:50 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 16 Aug 2005 10:53:50 -0400 (EDT) Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation In-Reply-To: References: Message-ID: On Tue, 16 Aug 2005, Guy German wrote: > James Lentini wrote: > >>>>> You changed the order in which the CQ upcall is enabled and the > >>>>> kDAPL upcall is made. > >>> kDAPL doesn't dequeue the CQ events, so this shouldn't be an issue. > >> > >> I think we agreed on that issue, on another thread, that I > > shall leave it: > >>> call kDAPL upcall > >>> enable CQ upcall > >> right ? > > > > Given that the ib_req_notify_cq() verb conforms to the IBTA spec's > > Request Completion Notification verb, we will have to find another > > solution. > > I don't think its a good idea to enable the CQ upcall before the kdapl > upcall. You will have a bigger problem than a race in *kdapltest* > (over hardware that doesn't yet exist :) - like a consumer getting an > unexpected upcall or just CQ interrupts, while in disable mode. > > Any way, one suggestion, I can think of, to solve it, is this: > > if (state == DAPL_EVD_STATE_OPEN) { > dapl_evd_upcall_trigger(evd); > if ((evd->upcall_policy != DAT_UPCALL_TEARDOWN) && > (evd->upcall_policy != DAT_UPCALL_DISABLE)) { > status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); > if (status) > (void)dapl_evd_post_async_error_event( > evd->common.owner_ia->async_error_evd, > DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR, > evd->common.owner_ia); > else > dapl_evd_upcall_trigger(evd); > } > } > > Note that dapl_evd_upcall_trigger makes sure that only if there *are* > completions in the cq (and policy is enabled) the consumer would > get another upcall. > > Is this acceptable ? Yes. > (It is possible to check the cq, before calling dapl_evd_upcall_trigger, we could use ib_peek_cq > for performance reasons, but applications that are interested > in performance, wouldn't go down that flow, anyway...) You're assuming that applications interested in performance would not use upcalls? > >>>> You've only decrease the window in which that scenario could > >>>> occur, not eliminated it. If a DTO completion occurred after you > >>>> count the number of pending events but before you enable the CQ > >>>> callback, a completion will be missed. > >>>> > >>>> gg: > >>>> I don't think so. That is what the spin_lock_irqsave is for. > >>> > >>> What if there are multiple processors in the system? > >> > >> AFAIK, spin_lock_irqsave does the trick. am I wrong ? > > > > spin_lock_irqsave disables local interrupts. > > The irq part disables local interrupts and the spin_lock part > protects resources from multiple processors, no ? Yes. I just wasn't sure if the other processors were using the spin lock. > >>>> Also, the pending_event_queue is only used for kDAPL generated > >>>> software events. This queue can be empty when there are events on > >>>> the CQ, so your would need to be expanded your check to cover that. > >> > >> Actually, even though, I agreed before, I tend to disagree now. > >> The consumer will still get the DTO events as soon as the CQ > >> upcall is triggered (enabled), so only problem is with the pending > >> events list. > > > > Why is it an error for the consumer to modify the upcall policy when > > there are pending events? > > > > dat_evd_modify_upcall should behave just like the IBTA spec's Request > > Completion Notification verb in this respect. If there were events on > > the EVD before the upcall is enabled, no upcall needs to be generated. > > A correct consumer can easily work around this by enabling the upcall > > and polling the EVD one final time to ensure it is empty. > > There can be more than one event, and the consumer would need to > dequeue many times. While the consumer would do his extra dequeue-ing > he might also get an upcall, because his policy is now enabled. > I can't think of a design that can handle such a case, and if there is one it > is demanding and complicated, from the consumers side. Isn't it the same position all event code written to the OpenIB API is in? I agree with you that this programming model is difficult to use, but I don't think it is impossible. > After enabling the upcall policy, the consumer, in my opinion, > should only expect upcalls and not do any dequeu's. We can change the behavior as you suggest, but we shouldn't use features specific to current Mellanox hardware. Using proprietary features will make the code hard to maintain. How about using the same technique you've proposed for the DTO callback (ie enabling the CQ upcall and then calling dapl_evd_upcall_trigger)? > My suggestion is an optimization (lowering the context switching) > and race solving. I don't think it's a good idea to make it more > complicated, then this simple solution. Indeed it is not spec > compliant, even though I did send a request to dat-collabrotive to > make it part of the spec, on March: > http://groups.yahoo.com/group/dat-discussions/message/3312 buy did > not get any reply from them. > > BTW, the race we are talking about actually happens, and I can say > from a practical point of view, that this implementation (along with > the one existing in ISER side) solves this problem. From guyg at voltaire.com Tue Aug 16 08:18:48 2005 From: guyg at voltaire.com (Guy German) Date: Tue, 16 Aug 2005 18:18:48 +0300 Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation Message-ID: Hi, >> (It is possible to check the cq, before calling >> dapl_evd_upcall_trigger, > > we could use ib_peek_cq > >> for performance reasons, but applications that are interested >> in performance, wouldn't go down that flow, anyway...) > > You're assuming that applications interested in performance would not > use upcalls? No. I'm assuming that applications interested in performance would disable the upcall policy (as soon as they receive the first upcall) and dequeue the rest of the events from a thread. Therefore, when returning from dapl_evd_upcall_trigger - the upcall policy would be disabled and the "if" statement that follows it, would be false. >>>>>> Also, the pending_event_queue is only used for kDAPL generated >>>>>> software events. This queue can be empty when there are events on >>>>>> the CQ, so your would need to be expanded your check to cover >>>>>> that. >>>> >>>> Actually, even though, I agreed before, I tend to disagree now. >>>> The consumer will still get the DTO events as soon as the CQ >>>> upcall is triggered (enabled), so only problem is with the pending >>>> events list. >>> >>> Why is it an error for the consumer to modify the upcall policy >>> when there are pending events? >>> >>> dat_evd_modify_upcall should behave just like the IBTA spec's >>> Request Completion Notification verb in this respect. If there were >>> events on the EVD before the upcall is enabled, no upcall needs to >>> be generated. A correct consumer can easily work around this by >>> enabling the upcall and polling the EVD one final time to ensure it >>> is empty. >> >> There can be more than one event, and the consumer would need to >> dequeue many times. While the consumer would do his extra dequeue-ing >> he might also get an upcall, because his policy is now enabled. >> I can't think of a design that can handle such a case, and if there >> is one it is demanding and complicated, from the consumers side. > > Isn't it the same position all event code written to the > OpenIB API is > in? I don't quite know what you are reffering to, but if you are reffering to the case of cq in IB - It's totally different: you only enable the cq once, so you will only get one upcall, and the rest of the events you will need to dequeue. > > I agree with you that this programming model is difficult to use, > but I don't think it is impossible. I think it is a bad idea to dequeue events and at the same time receive upcalls from the same queue. It is racy, and has bad performance. I don't see *any* reason to do it. Guy From eitan at mellanox.co.il Tue Aug 16 08:19:23 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 16 Aug 2005 18:19:23 +0300 Subject: [openib-general] [PATCH] osm: avoid override of user given includedir for ib_types.h Message-ID: <86mznh29l0.fsf@mtl066.yok.mtl.com> Hi Hal The include/iba/ib_types.h used override of the includedir variable which was overriding any user given --include-dir to the configure. The fix uses pkginclude to add "infiniband" as the package subdirectory. I tested the patch on : 2.6.12.3-smp SuSE Linux 9.3 (i586) Signed-off-by: Eitan Zahavi Index: include/configure.in =================================================================== --- include/configure.in (revision 3107) +++ include/configure.in (working copy) @@ -6,7 +6,7 @@ AC_CONFIG_SRCDIR() AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE() -AM_INIT_AUTOMAKE(libinc, 0.9.0) +AM_INIT_AUTOMAKE(infiniband, 0.9.0) AM_PROG_LIBTOOL dnl Checks for programs Index: include/Makefile.am =================================================================== --- include/Makefile.am (revision 3107) +++ include/Makefile.am (working copy) @@ -1,9 +1,6 @@ -# HACK: as we do not use the standard "prefix/include" subdir -includedir = ${prefix}/include/infiniband - SUBDIRS = . -nobase_include_HEADERS = iba/ib_types.h +nobase_pkginclude_HEADERS = iba/ib_types.h EXTRA_DIST = $(srcdir)/iba/ib_types.h From guyg at voltaire.com Tue Aug 16 08:24:41 2005 From: guyg at voltaire.com (Guy German) Date: Tue, 16 Aug 2005 18:24:41 +0300 Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation Message-ID: Hi Caitlin, Do you think that it is a good idea for the consumer to continue reaping events from the EVD, while his upcall policy is enabled ? Thanks, Guy Caitlin Bestler wrote: > The DAT spec is fairly clear on this. The Consumer is responsible for > either draining the EVD during the upcall, or for arranging > for it to be > drained at a later time through some out-of-scope mechanism. > > The latter can include setting a flag that tells a background thread > to continue reaping from the EVD when it is next scheduled. > > On 8/15/05, James Lentini wrote: >> >> >>>>>> You changed the order in which the CQ upcall is enabled and the >>>>>> kDAPL upcall is made. >>>> kDAPL doesn't dequeue the CQ events, so this shouldn't be an issue. >>> >>> I think we agreed on that issue, on another thread, that > I shall leave it: >>>> call kDAPL upcall >>>> enable CQ upcall >>> right ? >> >> Given that the ib_req_notify_cq() verb conforms to the IBTA spec's >> Request Completion Notification verb, we will have to find another >> solution. >> >>>> Currently the IB upcall is initially enabled, but there are checks >>>> on the upcall path to determine if the EVD upcall is enabled. See >>>> dapl_evd.c line 827: >>>> >>>> if (state == DAPL_EVD_STATE_OPEN && >>>> evd->upcall_policy != DAT_UPCALL_DISABLE) { >>>> >>>> which you've replaced with >>>> >>>> if (state == DAPL_EVD_STATE_OPEN) { >>> >>> Yes, because I believe the correct place to check it is inside >>> dapl_evd_upcall_trigger, after the lock and before the actual >>> upcall. >> >> I agree that checking it in dapl_evd_upcall_trigger() is better. >> >> I still don't understand why you wouldn't setup the completion >> notification based on the kDAPL consumer's upcall_policy in >> dapl_evd_internal_create(). >> >>>>> You've only decrease the window in which that scenario could >>>>> occur, not eliminated it. If a DTO completion occurred after you >>>>> count the number of pending events but before you enable the CQ >>>>> callback, a completion will be missed. >>>>> >>>>> gg: >>>>> I don't think so. That is what the spin_lock_irqsave is for. >>>> >>>> What if there are multiple processors in the system? >>> >>> AFAIK, spin_lock_irqsave does the trick. am I wrong ? >> >> spin_lock_irqsave disables local interrupts. >> >> In any event...[see question below] >> >>>>> Also, the pending_event_queue is only used for kDAPL generated >>>>> software events. This queue can be empty when there are events on >>>>> the CQ, so your would need to be expanded your check to cover >>>>> that. >>> >>> Actually, even though, I agreed before, I tend to disagree now. >>> The consumer will still get the DTO events as soon as the CQ >>> upcall is triggered (enabled), so only problem is with the pending >>> events list. >> >> Why is it an error for the consumer to modify the upcall policy when >> there are pending events? >> >> dat_evd_modify_upcall should behave just like the IBTA spec's Request >> Completion Notification verb in this respect. If there were events on >> the EVD before the upcall is enabled, no upcall needs to be >> generated. A correct consumer can easily work around this by >> enabling the upcall and polling the EVD one final time to ensure it >> is empty. _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Tue Aug 16 08:30:23 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 16 Aug 2005 18:30:23 +0300 Subject: [openib-general] Re: [PATCH] SDP: fix oops with port reuse In-Reply-To: <228BCEE7-11BA-40D2-B563-5F708414708B@alumni.brown.edu> References: <1122315500.27947.25.camel@duffman> <20050816091951.GP1856@mellanox.co.il> <228BCEE7-11BA-40D2-B563-5F708414708B@alumni.brown.edu> Message-ID: <20050816153023.GW1856@mellanox.co.il> Quoting r. Tom Duffy : > Subject: Re: [PATCH] SDP: fix oops with port reuse > > > On Aug 16, 2005, at 2:19 AM, Michael S. Tsirkin wrote: > > > This is what I committed. I'm not against making sdp_inet_port_put void, > looking at libsdp this shouldnt introduce any problems, but lets make it > a separate patch. > > I agree that changing it to void should be done separately. But, I thought > that we didn't want to remove the port unless it was really on the list Right. > and that just checking if the src_port was set wouldn't guarantee this. Why not? You cant really bind to port 0, can you? > Maybe its a good idea to init bind_next at socket creation, at port_put > we could call list_del_init, and/or use list_empty to figure out > whether the socket is on the bind list. > > > Do you mean we would setup the bind_next field to point to itself? Yes. > This is normally reserved for list heads. What's list_del_init for then? -- MST From mst at mellanox.co.il Tue Aug 16 08:44:08 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 16 Aug 2005 18:44:08 +0300 Subject: [openib-general] sdp_msgs.h Message-ID: <20050816154408.GZ1856@mellanox.co.il> Tom, do you think we should change struct names in sdp_msgs.h to start with sdp_? -- MST From jlentini at netapp.com Tue Aug 16 09:05:01 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 16 Aug 2005 12:05:01 -0400 (EDT) Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation In-Reply-To: References: Message-ID: On Tue, 16 Aug 2005, Guy German wrote: > >>>>>> Also, the pending_event_queue is only used for kDAPL generated > >>>>>> software events. This queue can be empty when there are events on > >>>>>> the CQ, so your would need to be expanded your check to cover > >>>>>> that. > >>>> > >>>> Actually, even though, I agreed before, I tend to disagree now. > >>>> The consumer will still get the DTO events as soon as the CQ > >>>> upcall is triggered (enabled), so only problem is with the pending > >>>> events list. > >>> > >>> Why is it an error for the consumer to modify the upcall policy > >>> when there are pending events? > >>> > >>> dat_evd_modify_upcall should behave just like the IBTA spec's > >>> Request Completion Notification verb in this respect. If there were > >>> events on the EVD before the upcall is enabled, no upcall needs to > >>> be generated. A correct consumer can easily work around this by > >>> enabling the upcall and polling the EVD one final time to ensure it > >>> is empty. > >> > >> There can be more than one event, and the consumer would need to > >> dequeue many times. While the consumer would do his extra dequeue-ing > >> he might also get an upcall, because his policy is now enabled. > >> I can't think of a design that can handle such a case, and if there > >> is one it is demanding and complicated, from the consumers side. > > > > Isn't it the same position all event code written to the > > OpenIB API is > > in? > > I don't quite know what you are reffering to, but if you are reffering > to the case of cq in IB - It's totally different: you only enable the cq > once, so you will only get one upcall, and the rest of the events > you will need to dequeue. The consumer should only receive one upcall at a time if the upcall policy is DAT_UPCALL_SINGLE_INSTANCE. If the dequeues are performed in an upcall, the logic needed in an OpenIB consumer and kDAPL consumer is essentially the same. The difference is that the OpenIB consumer needs to re-enable the CQ upcall and poll to make sure no events were missed. > > I agree with you that this programming model is difficult to use, > > but I don't think it is impossible. > > I think it is a bad idea to dequeue events and at the same time > receive upcalls from the same queue. It is racy, and has bad performance. > I don't see *any* reason to do it. The current kDAPL implementation does create a situation in which an upcall and poll occur simultaneously if the upcall is disabled, the consumer enables the upcall, and then the consumer does a poll. In this scenario an upcall can occur while the consumer is polling. I was pointing out that this same race exists in the OpenIB verbs API (and the IBTA verbs). Again, I agree that we can eliminate the additional poll after enabling the upcall in kDAPL. We just need to do it in a way that is not hardware specific. I believe we can use the same technique we did in the DTO upcall. james From jlentini at netapp.com Tue Aug 16 09:07:03 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 16 Aug 2005 12:07:03 -0400 (EDT) Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation In-Reply-To: References: Message-ID: Guy, Could you send your updated version of this patch? With an updated copy and the small mods we've been discussing, I think this is ready. james From thomas.duffy.99 at alumni.brown.edu Tue Aug 16 09:38:10 2005 From: thomas.duffy.99 at alumni.brown.edu (Tom Duffy) Date: Tue, 16 Aug 2005 09:38:10 -0700 Subject: [openib-general] Re: sdp_msgs.h In-Reply-To: <20050816154408.GZ1856@mellanox.co.il> References: <20050816154408.GZ1856@mellanox.co.il> Message-ID: On Aug 16, 2005, at 8:44 AM, Michael S. Tsirkin wrote: > Tom, do you think we should change struct names in > sdp_msgs.h to start with sdp_? Are you worried about name collisions? I don't really care one way or the other. Go ahead if you want. -tduffy From mst at mellanox.co.il Tue Aug 16 09:45:39 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 16 Aug 2005 19:45:39 +0300 Subject: [openib-general] Re: sdp_msgs.h In-Reply-To: References: <20050816154408.GZ1856@mellanox.co.il> Message-ID: <20050816164539.GA25256@mellanox.co.il> Quoting r. Tom Duffy : > Subject: Re: sdp_msgs.h > > > On Aug 16, 2005, at 8:44 AM, Michael S. Tsirkin wrote: > > >Tom, do you think we should change struct names in > >sdp_msgs.h to start with sdp_? > > Are you worried about name collisions? Yes. > I don't really care one way > or the other. Go ahead if you want. > > -tduffy I will at some point. -- MST From thomas.duffy.99 at alumni.brown.edu Tue Aug 16 09:48:01 2005 From: thomas.duffy.99 at alumni.brown.edu (Tom Duffy) Date: Tue, 16 Aug 2005 09:48:01 -0700 Subject: [openib-general] Re: [PATCH] SDP: fix oops with port reuse In-Reply-To: <20050816153023.GW1856@mellanox.co.il> References: <1122315500.27947.25.camel@duffman> <20050816091951.GP1856@mellanox.co.il> <228BCEE7-11BA-40D2-B563-5F708414708B@alumni.brown.edu> <20050816153023.GW1856@mellanox.co.il> Message-ID: <61E1737E-B626-4C88-90DF-3364135ED496@alumni.brown.edu> On Aug 16, 2005, at 8:30 AM, Michael S. Tsirkin wrote: > Why not? You cant really bind to port 0, can you? I don't think so. It was Libor who objected to this, so he must have had a reason. > >> Maybe its a good idea to init bind_next at socket creation, at >> port_put >> we could call list_del_init, and/or use list_empty to figure out >> whether the socket is on the bind list. >> >> >> Do you mean we would setup the bind_next field to point to itself? >> > > Yes. Other option is to initialize and check for LIST_POISON[12]. > >> This is normally reserved for list heads. >> > > What's list_del_init for then? Resetting a list? -tduffy From mst at mellanox.co.il Tue Aug 16 10:12:01 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 16 Aug 2005 20:12:01 +0300 Subject: [openib-general] Re: [PATCH] SDP: fix oops with port reuse In-Reply-To: <61E1737E-B626-4C88-90DF-3364135ED496@alumni.brown.edu> References: <1122315500.27947.25.camel@duffman> <20050816091951.GP1856@mellanox.co.il> <228BCEE7-11BA-40D2-B563-5F708414708B@alumni.brown.edu> <20050816153023.GW1856@mellanox.co.il> <61E1737E-B626-4C88-90DF-3364135ED496@alumni.brown.edu> Message-ID: <20050816171201.GA25373@mellanox.co.il> Quoting r. Tom Duffy : > Subject: Re: [PATCH] SDP: fix oops with port reuse > > > On Aug 16, 2005, at 8:30 AM, Michael S. Tsirkin wrote: > >Why not? You cant really bind to port 0, can you? > > I don't think so. It was Libor who objected to this, so he must have > had a reason. Its a mystery then. IMO, even if we agree on using list_del_init (as discussed below) we should add a BUG_ON(!conn->src_port) as a sanity check. > > > >> Maybe its a good idea to init bind_next at socket creation, at port_put > >> we could call list_del_init, and/or use list_empty to figure out > >> whether the socket is on the bind list. > >> > >> > >> Do you mean we would setup the bind_next field to point to itself? > >> > > > >Yes. > > Other option is to initialize and check for LIST_POISON[12]. Ugh. > > > >>This is normally reserved for list heads. > >> > > > >What's list_del_init for then? > > Resetting a list? Doesnt INIT_LIST_HEAD do this? list_del_init seems to be an exact match for our case: its a function that is safe to call twice on the same entry, and it lets you check that entry is on some list by !list_empty. Example: kernel/posix-timers.c -- MST From thomas.duffy.99 at alumni.brown.edu Tue Aug 16 10:32:46 2005 From: thomas.duffy.99 at alumni.brown.edu (Tom Duffy) Date: Tue, 16 Aug 2005 10:32:46 -0700 Subject: [openib-general] Re: [PATCH] SDP: fix oops with port reuse In-Reply-To: <20050816171201.GA25373@mellanox.co.il> References: <1122315500.27947.25.camel@duffman> <20050816091951.GP1856@mellanox.co.il> <228BCEE7-11BA-40D2-B563-5F708414708B@alumni.brown.edu> <20050816153023.GW1856@mellanox.co.il> <61E1737E-B626-4C88-90DF-3364135ED496@alumni.brown.edu> <20050816171201.GA25373@mellanox.co.il> Message-ID: On Aug 16, 2005, at 10:12 AM, Michael S. Tsirkin wrote: > Its a mystery then. IMO, even if we agree on using list_del_init > (as discussed below) we should add a BUG_ON(!conn->src_port) as > a sanity check. Yeah, let's do that. > Doesnt INIT_LIST_HEAD do this? > > list_del_init seems to be an exact match for our case: its > a function that is safe to call twice on the same entry, > and it lets you check that entry is on some list by !list_empty. > > Example: kernel/posix-timers.c Sounds good. Are you going to code up a patch? -tduffy From mjleven at sandia.gov Tue Aug 16 11:17:14 2005 From: mjleven at sandia.gov (Michael Levenhagen) Date: Tue, 16 Aug 2005 12:17:14 -0600 Subject: [openib-general] acronyms Message-ID: <43022DAA.7030406@sandia.gov> I'm trying to get up to speed on OpenIB. Is there any documentation that explains all the acronyms used in OpenIB? thanks Mike From jlentini at netapp.com Tue Aug 16 11:25:08 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 16 Aug 2005 14:25:08 -0400 (EDT) Subject: [openib-general] acronyms In-Reply-To: <43022DAA.7030406@sandia.gov> References: <43022DAA.7030406@sandia.gov> Message-ID: On Tue, 16 Aug 2005, Michael Levenhagen wrote: > I'm trying to get up to speed on OpenIB. Is there any documentation that > explains all the acronyms used in OpenIB? > > thanks > Mike There is a lot of information in the OpenIB Wiki: https://openib.org/tiki/tiki-index.php From mshefty at ichips.intel.com Tue Aug 16 11:29:11 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 16 Aug 2005 11:29:11 -0700 Subject: [openib-general] acronyms In-Reply-To: <43022DAA.7030406@sandia.gov> References: <43022DAA.7030406@sandia.gov> Message-ID: <43023077.30909@ichips.intel.com> Michael Levenhagen wrote: > I'm trying to get up to speed on OpenIB. Is there any documentation that > explains all the acronyms used in OpenIB? You may want to see Chapter 2 of the Infiniband Spec. - Sean From mjleven at sandia.gov Tue Aug 16 11:32:38 2005 From: mjleven at sandia.gov (Michael Levenhagen) Date: Tue, 16 Aug 2005 12:32:38 -0600 Subject: [openib-general] acronyms In-Reply-To: <43023077.30909@ichips.intel.com> References: <43022DAA.7030406@sandia.gov> <43023077.30909@ichips.intel.com> Message-ID: <43023146.1060504@sandia.gov> That's what I needed. thanks Mike Sean Hefty wrote: > Michael Levenhagen wrote: > >> I'm trying to get up to speed on OpenIB. Is there any documentation >> that explains all the acronyms used in OpenIB? > > > You may want to see Chapter 2 of the Infiniband Spec. > > - Sean > From robert.j.woodruff at intel.com Tue Aug 16 14:46:15 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Tue, 16 Aug 2005 14:46:15 -0700 Subject: [openib-general] RE: [ANNOUNCE] 2.6.9 backport patches In-Reply-To: <20050814164056.GU23848@mellanox.co.il> Message-ID: Michael wrote, >Hi! >Backport patches to trunk that enable support for RHEL4.0 (2.6.9) and >SuSE9.3 (2.6.11), can now be found under >https://openib.org/svn/gen2/branches/backport/2.6.9 >and >https://openib.org/svn/gen2/branches/backport/2.6.11 >These patches do not touch the kernel source outside the infiniband directory, >and you dont need to reboot after you apply them. Cool!! I tried these out on x86_64 and Itanium (so far) and from the initial tests I have done so far, I have seen no problems. It is great that you provided some patches that do not require any kernel mods, which allows one to build .ko's that I can load on the stock redhat EL4.0 kernel. I did notice a couple of things. 1.) There is no backport patch for SRP, which I had worked around by exporting scsi_scan_target in the kernel. Not sure how best to fix it with a patch that does not require any kernel changes. Perhaps Roland could recommend something. 2.) You might want to include a patch for the kernel/drivers Kconfig and Makefile, see below. This is the only thing I had to touch outside the infiniband directory to build the code. woody diff -Naurp linux-2.6.9/drivers/Kconfig linux-2.6.9.ib/drivers/Kconfig --- linux-2.6.9/drivers/Kconfig 2004-10-18 14:55:24.000000000 -0700 +++ linux-2.6.9.ib/drivers/Kconfig 2005-08-16 13:46:58.000000000 -0700 @@ -52,6 +52,8 @@ source "drivers/video/Kconfig" source "sound/Kconfig" +source "drivers/infiniband/Kconfig" + source "drivers/usb/Kconfig" endmenu diff -Naurp linux-2.6.9/drivers/Makefile linux-2.6.9.ib/drivers/Makefile --- linux-2.6.9/drivers/Makefile 2004-10-18 14:55:43.000000000 -0700 +++ linux-2.6.9.ib/drivers/Makefile 2005-08-16 13:47:33.000000000 -0700 @@ -58,4 +58,5 @@ obj-$(CONFIG_MCA) += mca/ obj-$(CONFIG_EISA) += eisa/ obj-$(CONFIG_CPU_FREQ) += cpufreq/ obj-$(CONFIG_MMC) += mmc/ +obj-$(CONFIG_INFINIBAND) += infiniband/ obj-y += firmware/ From sean.hefty at intel.com Tue Aug 16 18:14:02 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 16 Aug 2005 18:14:02 -0700 Subject: [openib-general] AT interface Message-ID: Sorry if I'm repeating some old AT discussions, but I'm trying to decipher the AT API. As an example, I call: int ib_at_route_by_ip(uint32_t dst_ip, uint32_t src_ip, int tos, uint16_t flags, struct ib_at_ib_route *ib_route, struct ib_at_completion *async_comp, uint64_t *req_id); struct ib_at_completion contains a pointer to a callback function, as show: struct ib_at_completion { void (*fn)(uint64_t req_id, void *context, int rec_num); void *context; uint64_t req_id; }; I then call: int ib_at_callback_get(void); Does this result in the callback function specified by the ib_at_completion structure being invoked with any returned routes copied to the ib_route parameter that was passed into ib_at_route_by_ip? Is there always only one route? Will AT invoke the user's callback automatically, or only as a result of calling get (and done in the user's thread)? Assuming that my understanding is correct, can we eliminate the callback and change the API to something that's easier to use and understand? It's a chore just trying to decipher the struct ib_at_completion comments. I'm thinking of something more like this: int ib_at_route_by_ip(uint32_t dst_ip, uint32_t src_ip, int tos, uint16_t flags, uint64_t *req_id); int ib_at_response_get(struct ib_at_response *); int ib_at_response_put(struct ib_at_response *); (Or even replacing uint32_t IP addresses with struct addrinfo or struct sockaddr, but that's a separate issue.) Would you accept such changes? - Sean From yaronh at voltaire.com Tue Aug 16 22:32:37 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Wed, 17 Aug 2005 08:32:37 +0300 Subject: [openib-general] FW: [Ips] iSER over IB - Consensus call Message-ID: <35EA21F54A45CB47B879F21A91F4862F713A3E@taurus.voltaire.com> FYI, For the ones that don't track the IETF iSCSI WG In the last IETF meeting in Paris iSER (iSCSI RDMA) over InfiniBand was discussed again, and as you can see below IETF gave its green light to do the few semantic changes in the iSER RFC and generalize it to IB Can also note that iSER over IB/iWarp RFC is in the Last Call status It is interesting to see the convergence with OpenIB adding iWarp drivers, and IETF adding IB to the iSER RFC, resulting in a common set of Drivers, ULPs, and remote boot support. Yaron -----Original Message----- From: ips-bounces at ietf.org [mailto:ips-bounces at ietf.org] On Behalf Of Black_David at emc.com Sent: Tuesday, August 09, 2005 11:02 PM To: ips at ietf.org Subject: [Ips] iSER over IB - Consensus call The IPS WG Paris meeting discussed: iSER over InfiniBand (draft-hufferd-iser-ib-00.txt) Proposal for text edits to iSER to permit use on other transports, including InfiniBand. Also will help enable iSER to be defined over SCTP. This draft is (or at least is intended to be) entirely editorial - it does not (or at least is not intended to) make any technical changes to the iSER draft that has passed WG Last Call. The draft Paris minutes record the following: Sense of room: Want to proceed towards applying these changes (after careful review and WG rough consensus) to the approved iSER draft so that there is one draft that is broadly applicable rather than the current iSER draft plus a draft that modifies that draft to broaden it. Anyone who objects to this sense of the room in Paris should post to the list with reasons for the objection, otherwise the sense of the room to proceed in this direction will become the rough consensus of the IPS WG. If the WG does proceed in this direction, the next step will be a WG Last Call on draft-hufferd-iser-ib-00.txt, with all changes/comments/etc. to be posted to the list, even editorial ones. After conclusion of that WG Last Call, the resulting edits can be applied to produce a new version of the iSER draft. We'll try to get this done by the end of August, but it may take a bit longer. Thanks, --David ---------------------------------------------------- David L. Black, Senior Technologist EMC Corporation, 176 South St., Hopkinton, MA 01748 +1 (508) 293-7953 FAX: +1 (508) 293-7786 black_david at emc.com Mobile: +1 (978) 394-7754 ---------------------------------------------------- _______________________________________________ Ips mailing list Ips at ietf.org https://www1.ietf.org/mailman/listinfo/ips From mst at mellanox.co.il Tue Aug 16 23:49:56 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 17 Aug 2005 09:49:56 +0300 Subject: [openib-general] Re: [PATCH] SDP: fix oops with port reuse In-Reply-To: References: <1122315500.27947.25.camel@duffman> <20050816091951.GP1856@mellanox.co.il> <228BCEE7-11BA-40D2-B563-5F708414708B@alumni.brown.edu> <20050816153023.GW1856@mellanox.co.il> <61E1737E-B626-4C88-90DF-3364135ED496@alumni.brown.edu> <20050816171201.GA25373@mellanox.co.il> Message-ID: <20050817064956.GB1856@mellanox.co.il> Quoting r. Tom Duffy : > Sounds good. Are you going to code up a patch? I put this on my todo list. -- MST From yael at mellanox.co.il Wed Aug 17 04:15:49 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Wed, 17 Aug 2005 14:15:49 +0300 Subject: [openib-general] osm: Missing osm_vendor_unbind function in osm_vendor_ibumad Message-ID: <506C3D7B14CDD411A52C00025558DED60882CB9C@mtlex01.yok.mtl.com> Hello Hal, I am currently working on merging gen2 with our OpenSM 1.8.0 version. I noticed that in the osm_vendor_ibumad files there is no implementation for the osm_vendor_unbind function (which is currently used by the opensm). Please add interface (if unnecessary, then an empty one) to this function. Thanks, Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Tuesday, August 16, 2005 2:17 PM To: Eitan Zahavi Cc: OPENIB GENERAL Subject: [openib-general] Re: [PATCH] osm: Fix a rate code computation for DDR ports On Tue, 2005-08-16 at 03:40, Eitan Zahavi wrote: > DDR Ports are not supported by the current implementation of > the ib_type.h: ib_port_info_compute_rate Thanks. Applied. -- Hal _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From yuw at cse.ohio-state.edu Wed Aug 17 05:24:03 2005 From: yuw at cse.ohio-state.edu (Weikuan Yu) Date: Wed, 17 Aug 2005 08:24:03 -0400 Subject: [openib-general] Re: Continue to experience problems in installing Gen2 on IA-32 In-Reply-To: <52d5ofi61v.fsf@cisco.com> References: <200508111932.j7BJWROj026274@xi.cse.ohio-state.edu> <482FEE13-0AB4-11DA-80CD-000D932C3754@cse.ohio-state.edu> <52k6isnm21.fsf@cisco.com> <52d5ofi61v.fsf@cisco.com> Message-ID: Hi, Thanks for the patch. This works for our adaptors too. Out of my curiosity, it seems like this patch takes firmware parameters and feed it back as it is when INIT_IB, instead of assuming port width being (1x | 4x). So wonder now the logic, in this regard, is in place for future 8x (?) or 12x adaptors with the other two bits, right? Thanks, --Weikuan On Aug 15, 2005, at 11:16 AM, Roland Dreier wrote: > Thanks for the debugging info. Can you apply the patch below and > confirm that it works with your PCI-X adapters? If this works for you > then I will check it into svn and merge it for kernel 2.6.14. > > Thanks, > Roland > > Index: infiniband/hw/mthca/mthca_dev.h > =================================================================== > --- infiniband/hw/mthca/mthca_dev.h (revision 3056) > +++ infiniband/hw/mthca/mthca_dev.h (working copy) > @@ -148,6 +148,7 @@ struct mthca_limits { > int reserved_mcgs; > int num_pds; > int reserved_pds; > + u8 port_width_cap; > }; > > struct mthca_alloc { > Index: infiniband/hw/mthca/mthca_main.c > =================================================================== > --- infiniband/hw/mthca/mthca_main.c (revision 3056) > +++ infiniband/hw/mthca/mthca_main.c (working copy) > @@ -171,6 +171,7 @@ static int __devinit mthca_dev_lim(struc > mdev->limits.reserved_mrws = dev_lim->reserved_mrws; > mdev->limits.reserved_uars = dev_lim->reserved_uars; > mdev->limits.reserved_pds = dev_lim->reserved_pds; > + mdev->limits.port_width_cap = dev_lim->max_port_width; > > /* IB_DEVICE_RESIZE_MAX_WR not supported by driver. > May be doable since hardware supports it for SRQ. > Index: infiniband/hw/mthca/mthca_cmd.c > =================================================================== > --- infiniband/hw/mthca/mthca_cmd.c (revision 3056) > +++ infiniband/hw/mthca/mthca_cmd.c (working copy) > @@ -1285,10 +1285,8 @@ int mthca_INIT_IB(struct mthca_dev *dev, > #define INIT_IB_FLAG_SIG (1 << 18) > #define INIT_IB_FLAG_NG (1 << 17) > #define INIT_IB_FLAG_G0 (1 << 16) > -#define INIT_IB_FLAG_1X (1 << 8) > -#define INIT_IB_FLAG_4X (1 << 9) > -#define INIT_IB_FLAG_12X (1 << 11) > #define INIT_IB_VL_SHIFT 4 > +#define INIT_IB_PORT_WIDTH_SHIFT 8 > #define INIT_IB_MTU_SHIFT 12 > #define INIT_IB_MAX_GID_OFFSET 0x06 > #define INIT_IB_MAX_PKEY_OFFSET 0x0a > @@ -1304,12 +1302,11 @@ int mthca_INIT_IB(struct mthca_dev *dev, > memset(inbox, 0, INIT_IB_IN_SIZE); > > flags = 0; > - flags |= param->enable_1x ? INIT_IB_FLAG_1X : 0; > - flags |= param->enable_4x ? INIT_IB_FLAG_4X : 0; > flags |= param->set_guid0 ? INIT_IB_FLAG_G0 : 0; > flags |= param->set_node_guid ? INIT_IB_FLAG_NG : 0; > flags |= param->set_si_guid ? INIT_IB_FLAG_SIG : 0; > flags |= param->vl_cap << INIT_IB_VL_SHIFT; > + flags |= param->port_width << INIT_IB_PORT_WIDTH_SHIFT; > flags |= param->mtu_cap << INIT_IB_MTU_SHIFT; > MTHCA_PUT(inbox, flags, INIT_IB_FLAGS_OFFSET); > > Index: infiniband/hw/mthca/mthca_cmd.h > =================================================================== > --- infiniband/hw/mthca/mthca_cmd.h (revision 3056) > +++ infiniband/hw/mthca/mthca_cmd.h (working copy) > @@ -220,8 +220,7 @@ struct mthca_init_hca_param { > }; > > struct mthca_init_ib_param { > - int enable_1x; > - int enable_4x; > + int port_width; > int vl_cap; > int mtu_cap; > u16 gid_cap; > Index: infiniband/hw/mthca/mthca_qp.c > =================================================================== > --- infiniband/hw/mthca/mthca_qp.c (revision 3056) > +++ infiniband/hw/mthca/mthca_qp.c (working copy) > @@ -502,12 +502,11 @@ static void init_port(struct mthca_dev * > > memset(¶m, 0, sizeof param); > > - param.enable_1x = 1; > - param.enable_4x = 1; > - param.vl_cap = dev->limits.vl_cap; > - param.mtu_cap = dev->limits.mtu_cap; > - param.gid_cap = dev->limits.gid_table_len; > - param.pkey_cap = dev->limits.pkey_table_len; > + param.port_width = dev->limits.port_width_cap; > + param.vl_cap = dev->limits.vl_cap; > + param.mtu_cap = dev->limits.mtu_cap; > + param.gid_cap = dev->limits.gid_table_len; > + param.pkey_cap = dev->limits.pkey_table_len; > > err = mthca_INIT_IB(dev, ¶m, port, &status); > if (err) > From halr at voltaire.com Wed Aug 17 05:57:12 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 17 Aug 2005 15:57:12 +0300 Subject: [openib-general] RE: [PATCH] osm: avoid override of user given includedir for ib_types.h Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175BB7@taurus.voltaire.com> Thanks. Applied. -- Hal From halr at voltaire.com Wed Aug 17 06:31:27 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 17 Aug 2005 16:31:27 +0300 Subject: [openib-general] RE: Missing osm_vendor_unbind function in osm_vendor_ibumad Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175BBC@taurus.voltaire.com> Hi Yael, I presume the need for osm_vendor_unbind was added at 1.8.0 in OpenSM (I see it added into osm_sa/sm_mad_ctrl.c). I added a null function as a temporary hack for this. Will this suffice for the time being ? I will fill it in (it definitely will not be a null function) but I'm on vacation right now. How soon do need this ? I may supply a patch for you to try for tomorrow if this can't wait. -- Hal ________________________________ From: Yael Kalka [mailto:yael at mellanox.co.il] Sent: Wed 8/17/2005 7:15 AM To: Hal Rosenstock Cc: OPENIB GENERAL Subject: osm: Missing osm_vendor_unbind function in osm_vendor_ibumad Hello Hal, I am currently working on merging gen2 with our OpenSM 1.8.0 version. I noticed that in the osm_vendor_ibumad files there is no implementation for the osm_vendor_unbind function (which is currently used by the opensm). Please add interface (if unnecessary, then an empty one) to this function. Thanks, Yael From tom at ammasso.com Wed Aug 17 07:02:08 2005 From: tom at ammasso.com (Tom Tucker) Date: Wed, 17 Aug 2005 10:02:08 -0400 Subject: [openib-general] [PATCH][iWARP] iWARP Provider Functions Added Message-ID: <1124287328.3519.18.camel@trinity.ammasso.com> This patch replaces the previous netdev driver portion with a much cleaner implementation and adds provider wrapper functions. At this point the driver loads and registers itself with the core. >From this point forward it should start getting more interesting. You can either check this out anew from the tree or apply the enclosed patch to your existing iWARP branch. Signed-off-by: Tom Tucker Index: amso1100/c2.c =================================================================== --- amso1100/c2.c (revision 0) +++ amso1100/c2.c (revision 0) @@ -0,0 +1,1183 @@ +/* + * c2.c: A Linux PCI-X Gigabit Ethernet driver for AMSO1100 (Cepheus2) RNIC + * Copyright(c) 2005 Ammasso, Inc. + * + * History: + * + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include +#include "c2.h" + +MODULE_AUTHOR("Ranjith Balachandran "); +MODULE_DESCRIPTION("Ammasso AMSO1100 Gigabit Ethernet driver"); +MODULE_LICENSE("GPL"); +MODULE_VERSION(DRV_VERSION); + +static const u32 default_msg = NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK + | NETIF_MSG_IFUP | NETIF_MSG_IFDOWN; + +static int debug = -1; /* defaults above */ +module_param(debug, int, 0); +MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)"); + +static int c2_up(struct net_device *netdev); +static int c2_down(struct net_device *netdev); +static int c2_xmit_frame(struct sk_buff *skb, struct net_device *netdev); +static void c2_tx_interrupt(struct net_device *netdev); +static void c2_rx_interrupt(struct net_device *netdev); +static irqreturn_t c2_interrupt(int irq, void *dev_id, struct pt_regs *regs); +static void c2_tx_timeout(struct net_device *netdev); +static int c2_change_mtu(struct net_device *netdev, int new_mtu); +static void c2_reset(struct c2_port *c2_port); +static struct net_device_stats* c2_get_stats(struct net_device *netdev); + +static struct pci_device_id c2_pci_table[] = { + { 0x18b8, 0xb001, PCI_ANY_ID, PCI_ANY_ID }, + { 0 } +}; + +MODULE_DEVICE_TABLE(pci, c2_pci_table); + +#if 0 +static struct ethtool_ops c2_ethtool_ops = { + .get_drvinfo = c2_get_drvinfo, + .get_regs_len = c2_get_regs_len, + .get_link = ethtool_op_get_link, + .get_settings = c2_get_settings, + .set_settings = c2_set_settings, + .get_rx_csum = c2_get_rx_csum, + .set_rx_csum = c2_set_rx_csum, + .get_tx_csum = ethtool_op_get_tx_csum, + .set_tx_csum = ethtool_op_set_tx_csum, + .get_sg = ethtool_op_get_sg, + .set_sg = ethtool_op_set_sg, + .get_tso = ethtool_op_get_tso, + .set_tso = ethtool_op_set_tso, + .get_regs = c2_get_regs, +}; +#endif + +#if 0 +static void c2_link_up(struct c2_port *c2_port) +{ + netif_carrier_on(c2_port->netdev); + + if (c2_port->tx_avail > MAX_SKB_FRAGS + 1) + netif_wake_queue(c2_port->netdev); + + printk(KERN_INFO PFX "%s: Link is up\n", c2_port->netdev->name); +} + +static void c2_link_down(struct c2_port *c2_port) +{ + netif_carrier_off(c2_port->netdev); + netif_stop_queue(c2_port->netdev); + + printk(KERN_INFO PFX "%s: Link is down\n", c2_port->netdev->name); +} +#endif + +static void c2_set_rxbufsize(struct c2_port *c2_port) +{ + struct net_device *netdev = c2_port->netdev; + + assert(netdev != NULL); + + if (netdev->mtu > RX_BUF_SIZE) + c2_port->rx_buf_size = netdev->mtu + ETH_HLEN + sizeof(struct c2_rxp_hdr) + NET_IP_ALIGN; + else + c2_port->rx_buf_size = sizeof(struct c2_rxp_hdr) + RX_BUF_SIZE; +} + +/* + * Allocate TX ring elements and chain them together. + * One-to-one association of adapter descriptors with ring elements. + */ +static int c2_tx_ring_alloc(struct c2_ring *tx_ring, void *vaddr, dma_addr_t base, + void __iomem *mmio_txp_ring) +{ + struct c2_tx_desc *tx_desc; + struct c2_txp_desc *txp_desc; + struct c2_element *elem; + int i; + + tx_ring->start = kmalloc(sizeof(*elem)*tx_ring->count, GFP_KERNEL); + if (!tx_ring->start) + return -ENOMEM; + + for (i = 0, elem = tx_ring->start, tx_desc = vaddr, txp_desc = mmio_txp_ring; + i < tx_ring->count; i++, elem++, tx_desc++, txp_desc++) + { + tx_desc->len = 0; + tx_desc->status = 0; + + /* Set TXP_HTXD_UNINIT */ + c2_write64((void *)txp_desc + C2_TXP_ADDR, cpu_to_be64(0x1122334455667788ULL)); + c2_write16((void *)txp_desc + C2_TXP_LEN, cpu_to_be16(0)); + c2_write16((void *)txp_desc + C2_TXP_FLAGS, cpu_to_be16(TXP_HTXD_UNINIT)); + + elem->skb = NULL; + elem->ht_desc = tx_desc; + elem->hw_desc = txp_desc; + + if (i == tx_ring->count - 1) { + elem->next = tx_ring->start; + tx_desc->next_offset = base; + } else { + elem->next = elem + 1; + tx_desc->next_offset = base + (i + 1) * sizeof(*tx_desc); + } + } + + tx_ring->to_use = tx_ring->to_clean = tx_ring->start; + + return 0; +} + +/* + * Allocate RX ring elements and chain them together. + * One-to-one association of adapter descriptors with ring elements. + */ +static int c2_rx_ring_alloc(struct c2_ring *rx_ring, void *vaddr, dma_addr_t base, + void __iomem *mmio_rxp_ring) +{ + struct c2_rx_desc *rx_desc; + struct c2_rxp_desc *rxp_desc; + struct c2_element *elem; + int i; + + rx_ring->start = kmalloc(sizeof(*elem) * rx_ring->count, GFP_KERNEL); + if (!rx_ring->start) + return -ENOMEM; + + for (i = 0, elem = rx_ring->start, rx_desc = vaddr, rxp_desc = mmio_rxp_ring; + i < rx_ring->count; i++, elem++, rx_desc++, rxp_desc++) + { + rx_desc->len = 0; + rx_desc->status = 0; + + /* Set RXP_HRXD_UNINIT */ + c2_write16((void *)rxp_desc + C2_RXP_STATUS, cpu_to_be16(RXP_HRXD_OK)); + c2_write16((void *)rxp_desc + C2_RXP_COUNT, cpu_to_be16(0)); + c2_write16((void *)rxp_desc + C2_RXP_LEN, cpu_to_be16(0)); + c2_write64((void *)rxp_desc + C2_RXP_ADDR, cpu_to_be64(0x99aabbccddeeffULL)); + c2_write16((void *)rxp_desc + C2_RXP_FLAGS, cpu_to_be16(RXP_HRXD_UNINIT)); + + elem->skb = NULL; + elem->ht_desc = rx_desc; + elem->hw_desc = rxp_desc; + + if (i == rx_ring->count - 1) { + elem->next = rx_ring->start; + rx_desc->next_offset = base; + } else { + elem->next = elem + 1; + rx_desc->next_offset = base + (i + 1) * sizeof(*rx_desc); + } + } + + rx_ring->to_use = rx_ring->to_clean = rx_ring->start; + + return 0; +} + +/* Setup buffer for receiving */ +static inline int c2_rx_alloc(struct c2_port *c2_port, struct c2_element *elem) +{ + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_rx_desc *rx_desc = elem->ht_desc; + struct sk_buff *skb; + dma_addr_t mapaddr; + u32 maplen; + struct c2_rxp_hdr *rxp_hdr; + + skb = dev_alloc_skb(c2_port->rx_buf_size); + if (unlikely(!skb)) { + printk(KERN_ERR PFX "%s: out of memory for receive\n", + c2_port->netdev->name); + return -ENOMEM; + } + + /* Zero out the rxp hdr in the sk_buff */ + memset(skb->data, 0, sizeof(*rxp_hdr)); + + skb->dev = c2_port->netdev; + + maplen = c2_port->rx_buf_size; + mapaddr = pci_map_single(c2dev->pcidev, skb->data, maplen, PCI_DMA_FROMDEVICE); + + /* Set the sk_buff RXP_header to RXP_HRXD_READY */ + rxp_hdr = (struct c2_rxp_hdr *) skb->data; + rxp_hdr->flags = RXP_HRXD_READY; + + //c2_write16(elem->hw_desc + C2_RXP_COUNT, cpu_to_be16(0)); + c2_write16(elem->hw_desc + C2_RXP_STATUS, cpu_to_be16(0)); + c2_write16(elem->hw_desc + C2_RXP_LEN, cpu_to_be16((u16)maplen - sizeof(*rxp_hdr))); + c2_write64(elem->hw_desc + C2_RXP_ADDR, cpu_to_be64(mapaddr)); + c2_write16(elem->hw_desc + C2_RXP_FLAGS, cpu_to_be16(RXP_HRXD_READY)); + + elem->skb = skb; + elem->mapaddr = mapaddr; + elem->maplen = maplen; + rx_desc->len = maplen; + + return 0; +} + +/* + * Allocate buffers for the Rx ring + * For receive: rx_ring.to_clean is next received frame + */ +static int c2_rx_fill(struct c2_port *c2_port) +{ + struct c2_ring *rx_ring = &c2_port->rx_ring; + struct c2_element *elem; + int ret = 0; + + elem = rx_ring->start; + do { + if (c2_rx_alloc(c2_port, elem)) { + ret = 1; + break; + } + } while ((elem = elem->next) != rx_ring->start); + + rx_ring->to_clean = rx_ring->start; + return ret; +} + +/* Free all buffers in RX ring, assumes receiver stopped */ +static void c2_rx_clean(struct c2_port *c2_port) +{ + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_ring *rx_ring = &c2_port->rx_ring; + struct c2_element *elem; + struct c2_rx_desc *rx_desc; + + elem = rx_ring->start; + do { + rx_desc = elem->ht_desc; + rx_desc->len = 0; + + c2_write16(elem->hw_desc + C2_RXP_STATUS, cpu_to_be16(0)); + c2_write16(elem->hw_desc + C2_RXP_COUNT, cpu_to_be16(0)); + c2_write16(elem->hw_desc + C2_RXP_LEN, cpu_to_be16(0)); + c2_write64(elem->hw_desc + C2_RXP_ADDR, cpu_to_be64(0x99aabbccddeeffULL)); + c2_write16(elem->hw_desc + C2_RXP_FLAGS, cpu_to_be16(RXP_HRXD_UNINIT)); + + if (elem->skb) { + pci_unmap_single(c2dev->pcidev, elem->mapaddr, elem->maplen, + PCI_DMA_FROMDEVICE); + dev_kfree_skb(elem->skb); + elem->skb = NULL; + } + } while ((elem = elem->next) != rx_ring->start); +} + +static inline int c2_tx_free(struct c2_dev *c2dev, struct c2_element *elem) +{ + struct c2_tx_desc *tx_desc = elem->ht_desc; + + tx_desc->len = 0; + + pci_unmap_single(c2dev->pcidev, elem->mapaddr, elem->maplen, PCI_DMA_TODEVICE); + + if (elem->skb) { + dev_kfree_skb_any(elem->skb); + elem->skb = NULL; + } + + return 0; +} + +/* Free all buffers in TX ring, assumes transmitter stopped */ +static void c2_tx_clean(struct c2_port *c2_port) +{ + struct c2_ring *tx_ring = &c2_port->tx_ring; + struct c2_element *elem; + struct c2_txp_desc txp_htxd; + int retry; + unsigned long flags; + + spin_lock_irqsave(&c2_port->tx_lock, flags); + + elem = tx_ring->start; + + do { + retry = 0; + do { + txp_htxd.flags = c2_read16(elem->hw_desc + C2_TXP_FLAGS); + + if (txp_htxd.flags == TXP_HTXD_READY) { + retry = 1; + c2_write16(elem->hw_desc + C2_TXP_LEN, cpu_to_be16(0)); + c2_write64(elem->hw_desc + C2_TXP_ADDR, cpu_to_be64(0)); + c2_write16(elem->hw_desc + C2_TXP_FLAGS, cpu_to_be16(TXP_HTXD_DONE)); + c2_port->netstats.tx_dropped++; + break; + } else { + c2_write16(elem->hw_desc + C2_TXP_LEN, cpu_to_be16(0)); + c2_write64(elem->hw_desc + C2_TXP_ADDR, cpu_to_be64(0x1122334455667788ULL)); + c2_write16(elem->hw_desc + C2_TXP_FLAGS, cpu_to_be16(TXP_HTXD_UNINIT)); + } + + c2_tx_free(c2_port->c2dev, elem); + + } while ((elem = elem->next) != tx_ring->start); + } while (retry); + + c2_port->tx_avail = c2_port->tx_ring.count - 1; + c2_port->c2dev->cur_tx = tx_ring->to_use - tx_ring->start; + + if (c2_port->tx_avail > MAX_SKB_FRAGS + 1) + netif_wake_queue(c2_port->netdev); + + spin_unlock_irqrestore(&c2_port->tx_lock, flags); +} + +/* + * Process transmit descriptors marked 'DONE' by the firmware, + * freeing up their unneeded sk_buffs. + */ +static void c2_tx_interrupt(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_ring *tx_ring = &c2_port->tx_ring; + struct c2_element *elem; + struct c2_txp_desc txp_htxd; + + spin_lock(&c2_port->tx_lock); + + for(elem = tx_ring->to_clean; elem != tx_ring->to_use; elem = elem->next) + { + txp_htxd.flags = be16_to_cpu(c2_read16(elem->hw_desc + C2_TXP_FLAGS)); + + if (txp_htxd.flags != TXP_HTXD_DONE) + break; + + if (netif_msg_tx_done(c2_port)) { + /* PCI reads are expensive in fast path */ + //txp_htxd.addr = be64_to_cpu(c2_read64(elem->hw_desc + C2_TXP_ADDR)); + txp_htxd.len = be16_to_cpu(c2_read16(elem->hw_desc + C2_TXP_LEN)); + printk(KERN_INFO PFX + "%s: tx done slot %3Zu status 0x%x len %5u bytes\n", + netdev->name, elem - tx_ring->start, + txp_htxd.flags, txp_htxd.len); + } + + c2_tx_free(c2dev, elem); + ++(c2_port->tx_avail); + } + + tx_ring->to_clean = elem; + + if (netif_queue_stopped(netdev) && c2_port->tx_avail > MAX_SKB_FRAGS + 1) + netif_wake_queue(netdev); + + spin_unlock(&c2_port->tx_lock); +} + +static void c2_rx_error(struct c2_port *c2_port, struct c2_element *elem) +{ + struct c2_rx_desc *rx_desc = elem->ht_desc; + struct c2_rxp_hdr *rxp_hdr = (struct c2_rxp_hdr *)elem->skb->data; + + if (rxp_hdr->status != RXP_HRXD_OK || + rxp_hdr->len > (rx_desc->len - sizeof(*rxp_hdr))) { + printk(KERN_ERR PFX "BAD RXP_HRXD\n"); + printk(KERN_ERR PFX " rx_desc : %p\n", rx_desc); + printk(KERN_ERR PFX " index : %Zu\n", elem - c2_port->rx_ring.start); + printk(KERN_ERR PFX " len : %u\n", rx_desc->len); + printk(KERN_ERR PFX " rxp_hdr : %p [PA %p]\n", rxp_hdr, + (void *)__pa((unsigned long)rxp_hdr)); + printk(KERN_ERR PFX " flags : 0x%x\n", rxp_hdr->flags); + printk(KERN_ERR PFX " status: 0x%x\n", rxp_hdr->status); + printk(KERN_ERR PFX " len : %u\n", rxp_hdr->len); + printk(KERN_ERR PFX " rsvd : 0x%x\n", rxp_hdr->rsvd); + } + + /* Setup the skb for reuse since we're dropping this pkt */ + elem->skb->tail = elem->skb->data = elem->skb->head; + + /* Zero out the rxp hdr in the sk_buff */ + memset(elem->skb->data, 0, sizeof(*rxp_hdr)); + + /* Write the descriptor to the adapter's rx ring */ + c2_write16(elem->hw_desc + C2_RXP_STATUS, cpu_to_be16(0)); + c2_write16(elem->hw_desc + C2_RXP_COUNT, cpu_to_be16(0)); + c2_write16(elem->hw_desc + C2_RXP_LEN, cpu_to_be16((u16)elem->maplen - sizeof(*rxp_hdr))); + c2_write64(elem->hw_desc + C2_RXP_ADDR, cpu_to_be64(elem->mapaddr)); + c2_write16(elem->hw_desc + C2_RXP_FLAGS, cpu_to_be16(RXP_HRXD_READY)); + + printk(KERN_INFO PFX "packet dropped\n"); + c2_port->netstats.rx_dropped++; +} + +static void c2_rx_interrupt(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_ring *rx_ring = &c2_port->rx_ring; + struct c2_element *elem; + struct c2_rx_desc *rx_desc; + struct c2_rxp_hdr *rxp_hdr; + struct sk_buff *skb; + dma_addr_t mapaddr; + u32 maplen, buflen; + unsigned long flags; + + spin_lock_irqsave(&c2dev->lock, flags); + + /* Begin where we left off */ + rx_ring->to_clean = rx_ring->start + c2dev->cur_rx; + + for(elem = rx_ring->to_clean; elem->next != rx_ring->to_clean; elem = elem->next) + { + rx_desc = elem->ht_desc; + mapaddr = elem->mapaddr; + maplen = elem->maplen; + skb = elem->skb; + rxp_hdr = (struct c2_rxp_hdr *)skb->data; + + if (rxp_hdr->flags != RXP_HRXD_DONE) + break; + + if (netif_msg_rx_status(c2_port)) + printk(KERN_INFO PFX "%s: rx done slot %3Zu status 0x%x len %5u bytes\n", + netdev->name, elem - rx_ring->start, + rxp_hdr->flags, rxp_hdr->len); + + buflen = rxp_hdr->len; + + /* Sanity check the RXP header */ + if (rxp_hdr->status != RXP_HRXD_OK || + buflen > (rx_desc->len - sizeof(*rxp_hdr))) { + c2_rx_error(c2_port, elem); + continue; + } + + /* Allocate and map a new skb for replenishing the host RX desc */ + if (c2_rx_alloc(c2_port, elem)) { + c2_rx_error(c2_port, elem); + continue; + } + + /* Unmap the old skb */ + pci_unmap_single(c2dev->pcidev, mapaddr, maplen, PCI_DMA_FROMDEVICE); + + /* + * Skip past the leading 8 bytes comprising of the + * "struct c2_rxp_hdr", prepended by the adapter + * to the usual Ethernet header ("struct ethhdr"), + * to the start of the raw Ethernet packet. + * + * Fix up the various fields in the sk_buff before + * passing it up to netif_rx(). The transfer size + * (in bytes) specified by the adapter len field of + * the "struct rxp_hdr_t" does NOT include the + * "sizeof(struct c2_rxp_hdr)". + */ + skb->data += sizeof(*rxp_hdr); + skb->tail = skb->data + buflen; + skb->len = buflen; + skb->dev = netdev; + skb->protocol = eth_type_trans(skb, netdev); + + netif_rx(skb); + + netdev->last_rx = jiffies; + c2_port->netstats.rx_packets++; + c2_port->netstats.rx_bytes += buflen; + } + + /* Save where we left off */ + rx_ring->to_clean = elem; + c2dev->cur_rx = elem - rx_ring->start; + C2_SET_CUR_RX(c2dev, c2dev->cur_rx); + + spin_unlock_irqrestore(&c2dev->lock, flags); +} + +/* + * Handle netisr0 TX & RX interrupts. + */ +static irqreturn_t c2_interrupt(int irq, void *dev_id, struct pt_regs *regs) +{ + unsigned int netisr0; + struct c2_dev *c2dev = dev_id; + + assert(c2dev != NULL); + + netisr0 = c2_read32(c2dev->regs + C2_NISR0); + + if (netisr0 & ~(C2_PCI_HRX_INT | C2_PCI_HRX_INT)) { + printk(KERN_ERR PFX "Unknown IRQ!\n"); + return IRQ_NONE; + } + + /* Process RX 'DONE' descriptors */ + if (netisr0 & C2_PCI_HRX_INT) { + c2_rx_interrupt(c2dev->netdev); + + /* + * Also process TX 'DONE' descriptors here + * since the fw provides the status of RX for + * both TX & RX interrupts. + */ + c2_tx_interrupt(c2dev->netdev); + + c2_write32(c2dev->regs + C2_NISR0, C2_PCI_HRX_INT); + } + + /* Process TX 'DONE' descriptors */ + if (netisr0 & C2_PCI_HTX_INT) { + c2_tx_interrupt(c2dev->netdev); + + c2_write32(c2dev->regs + C2_NISR0, C2_PCI_HTX_INT); + } + + return IRQ_HANDLED; +} + +static int c2_up(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_element *elem; + struct c2_rxp_hdr *rxp_hdr; + size_t rx_size, tx_size; + int ret, i; + unsigned int netimr0; + + assert(c2dev != NULL); + + if (netif_msg_ifup(c2_port)) + printk(KERN_INFO PFX "%s: enabling interface\n", netdev->name); + + /* Set the Rx buffer size based on MTU */ + c2_set_rxbufsize(c2_port); + + /* Allocate DMA'able memory for Tx/Rx host descriptor rings */ + rx_size = c2_port->rx_ring.count * sizeof(struct c2_rx_desc); + tx_size = c2_port->tx_ring.count * sizeof(struct c2_tx_desc); + + c2_port->mem_size = tx_size + rx_size; + c2_port->mem = pci_alloc_consistent(c2dev->pcidev, c2_port->mem_size, + &c2_port->dma); + if (c2_port->mem == NULL) { + printk(KERN_ERR PFX "Unable to allocate memory for host descriptor rings\n"); + return -ENOMEM; + } + + memset(c2_port->mem, 0, c2_port->mem_size); + + /* Create the Rx host descriptor ring */ + if ((ret = c2_rx_ring_alloc(&c2_port->rx_ring, c2_port->mem, c2_port->dma, + c2dev->mmio_rxp_ring))) { + printk(KERN_ERR PFX "Unable to create RX ring\n"); + goto bail0; + } + + /* Allocate Rx buffers for the host descriptor ring */ + if (c2_rx_fill(c2_port)) { + printk(KERN_ERR PFX "Unable to fill RX ring\n"); + goto bail1; + } + + /* Create the Tx host descriptor ring */ + if ((ret = c2_tx_ring_alloc(&c2_port->tx_ring, c2_port->mem + rx_size, + c2_port->dma + rx_size, c2dev->mmio_txp_ring))) { + printk(KERN_ERR PFX "Unable to create TX ring\n"); + goto bail1; + } + + /* Set the TX pointer to where we left off */ + c2_port->tx_avail = c2_port->tx_ring.count - 1; + c2_port->tx_ring.to_use = c2_port->tx_ring.to_clean = c2_port->tx_ring.start + c2dev->cur_tx; + + /* missing: Initialize MAC */ + + BUG_ON(c2_port->tx_ring.to_use != c2_port->tx_ring.to_clean); + + /* Reset the adapter, ensures the driver is in sync with the RXP */ + c2_reset(c2_port); + + /* Reset the READY bit in the sk_buff RXP headers & adapter HRXDQ */ + for(i = 0, elem = c2_port->rx_ring.start; i < c2_port->rx_ring.count; + i++, elem++) + { + rxp_hdr = (struct c2_rxp_hdr *)elem->skb->data; + rxp_hdr->flags = 0; + c2_write16(elem->hw_desc + C2_RXP_FLAGS, cpu_to_be16(RXP_HRXD_READY)); + } + + /* Enable network packets */ + netif_start_queue(netdev); + + /* Enable IRQ */ + c2_write32(c2dev->regs + C2_IDIS, 0); + netimr0 = c2_read32(c2dev->regs + C2_NIMR0); + netimr0 &= ~(C2_PCI_HTX_INT | C2_PCI_HRX_INT); + c2_write32(c2dev->regs + C2_NIMR0, netimr0); + + return 0; + + bail1: + c2_rx_clean(c2_port); + kfree(c2_port->rx_ring.start); + + bail0: + pci_free_consistent(c2dev->pcidev, c2_port->mem_size, c2_port->mem, c2_port->dma); + + return ret; +} + +static int c2_down(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + + if (netif_msg_ifdown(c2_port)) + printk(KERN_INFO PFX "%s: disabling interface\n", netdev->name); + + /* Wait for all the queued packets to get sent */ + c2_tx_interrupt(netdev); + + /* Disable network packets */ + netif_stop_queue(netdev); + + /* Disable IRQs by clearing the interrupt mask */ + c2_write32(c2dev->regs + C2_IDIS, 1); + c2_write32(c2dev->regs + C2_NIMR0, 0); + + /* missing: Stop transmitter */ + + /* missing: Stop receiver */ + + /* Reset the adapter, ensures the driver is in sync with the RXP */ + c2_reset(c2_port); + + /* missing: Turn off LEDs here */ + + /* Free all buffers in the host descriptor rings */ + c2_tx_clean(c2_port); + c2_rx_clean(c2_port); + + /* Free the host descriptor rings */ + kfree(c2_port->rx_ring.start); + kfree(c2_port->tx_ring.start); + pci_free_consistent(c2dev->pcidev, c2_port->mem_size, c2_port->mem, c2_port->dma); + + return 0; +} + +static void c2_reset(struct c2_port *c2_port) +{ + struct c2_dev *c2dev = c2_port->c2dev; + unsigned int cur_rx = c2dev->cur_rx; + + /* Tell the hardware to quiesce */ + C2_SET_CUR_RX(c2dev, cur_rx|C2_PCI_HRX_QUI); + + /* + * The hardware will reset the C2_PCI_HRX_QUI bit once + * the RXP is quiesced. Wait 2 seconds for this. + */ + ssleep(2); + + cur_rx = C2_GET_CUR_RX(c2dev); + + if (cur_rx & C2_PCI_HRX_QUI) + printk(KERN_ERR PFX "c2_reset: failed to quiesce the hardware!\n"); + + cur_rx &= ~C2_PCI_HRX_QUI; + + c2dev->cur_rx = cur_rx; + + dprintk("Current RX: %u\n", c2dev->cur_rx); +} + +static int c2_xmit_frame(struct sk_buff *skb, struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + struct c2_dev *c2dev = c2_port->c2dev; + struct c2_ring *tx_ring = &c2_port->tx_ring; + struct c2_element *elem; + dma_addr_t mapaddr; + u32 maplen; + unsigned long flags; + unsigned int i; + + spin_lock_irqsave(&c2_port->tx_lock, flags); + + if (unlikely(c2_port->tx_avail < (skb_shinfo(skb)->nr_frags + 1))) { + netif_stop_queue(netdev); + spin_unlock_irqrestore(&c2_port->tx_lock, flags); + + printk(KERN_WARNING PFX "%s: Tx ring full when queue awake!\n", + netdev->name); + return NETDEV_TX_BUSY; + } + + maplen = skb_headlen(skb); + mapaddr = pci_map_single(c2dev->pcidev, skb->data, maplen, PCI_DMA_TODEVICE); + + elem = tx_ring->to_use; + elem->skb = skb; + elem->mapaddr = mapaddr; + elem->maplen = maplen; + + /* Tell HW to xmit */ + c2_write64(elem->hw_desc + C2_TXP_ADDR, cpu_to_be64(mapaddr)); + c2_write16(elem->hw_desc + C2_TXP_LEN, cpu_to_be16(maplen)); + c2_write16(elem->hw_desc + C2_TXP_FLAGS, cpu_to_be16(TXP_HTXD_READY)); + + c2_port->netstats.tx_packets++; + c2_port->netstats.tx_bytes += maplen; + + /* Loop thru additional data fragments and queue them */ + if (skb_shinfo(skb)->nr_frags) { + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) + { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + maplen = frag->size; + mapaddr = pci_map_page(c2dev->pcidev, frag->page, frag->page_offset, + maplen, PCI_DMA_TODEVICE); + + elem = elem->next; + elem->skb = NULL; + elem->mapaddr = mapaddr; + elem->maplen = maplen; + + /* Tell HW to xmit */ + c2_write64(elem->hw_desc + C2_TXP_ADDR, cpu_to_be64(mapaddr)); + c2_write16(elem->hw_desc + C2_TXP_LEN, cpu_to_be16(maplen)); + c2_write16(elem->hw_desc + C2_TXP_FLAGS, cpu_to_be16(TXP_HTXD_READY)); + + c2_port->netstats.tx_packets++; + c2_port->netstats.tx_bytes += maplen; + } + } + + tx_ring->to_use = elem->next; + c2_port->tx_avail -= (skb_shinfo(skb)->nr_frags + 1); + + if (netif_msg_tx_queued(c2_port)) + printk(KERN_DEBUG PFX "%s: tx queued, slot %3Zu, len %5u bytes, avail = %u\n", + netdev->name, elem - tx_ring->start, maplen, c2_port->tx_avail); + + if (c2_port->tx_avail <= MAX_SKB_FRAGS + 1) { + netif_stop_queue(netdev); + if (netif_msg_tx_queued(c2_port)) + printk(KERN_INFO PFX "%s: transmit queue full\n", netdev->name); + } + + spin_unlock_irqrestore(&c2_port->tx_lock, flags); + + netdev->trans_start = jiffies; + + return NETDEV_TX_OK; +} + +static struct net_device_stats *c2_get_stats(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + + return &c2_port->netstats; +} + +static int c2_set_mac_address(struct net_device *netdev, void *p) +{ + return -1; +} + +static void c2_tx_timeout(struct net_device *netdev) +{ + struct c2_port *c2_port = netdev_priv(netdev); + + if (netif_msg_timer(c2_port)) + printk(KERN_DEBUG PFX "%s: tx timeout\n", netdev->name); + + c2_tx_clean(c2_port); +} + +static int c2_change_mtu(struct net_device *netdev, int new_mtu) +{ + int ret = 0; + + if (new_mtu < ETH_ZLEN || new_mtu > ETH_JUMBO_MTU) + return -EINVAL; + + netdev->mtu = new_mtu; + + if (netif_running(netdev)) { + c2_down(netdev); + + c2_up(netdev); + } + + return ret; +} + +/* Initialize network device */ +static struct net_device *c2_devinit(struct c2_dev *c2dev, void __iomem *mmio_addr) +{ + struct c2_port *c2_port = NULL; + struct net_device *netdev = alloc_etherdev(sizeof(*c2_port)); + + if (!netdev) { + printk(KERN_ERR PFX "c2_port etherdev alloc failed"); + return NULL; + } + + SET_MODULE_OWNER(netdev); + SET_NETDEV_DEV(netdev, &c2dev->pcidev->dev); + + netdev->open = c2_up; + netdev->stop = c2_down; + netdev->hard_start_xmit = c2_xmit_frame; + netdev->get_stats = c2_get_stats; +#if 0 + SET_ETHTOOL_OPS(netdev, &c2_ethtool_ops); +#endif + netdev->tx_timeout = c2_tx_timeout; + netdev->set_mac_address = c2_set_mac_address; + netdev->change_mtu = c2_change_mtu; + netdev->watchdog_timeo = C2_TX_TIMEOUT; + netdev->irq = c2dev->pcidev->irq; + + c2_port = netdev_priv(netdev); + c2_port->netdev = netdev; + c2_port->c2dev = c2dev; + c2_port->msg_enable = netif_msg_init(debug, default_msg); + c2_port->tx_ring.count = C2_NUM_TX_DESC; + c2_port->rx_ring.count = C2_NUM_RX_DESC; + + spin_lock_init(&c2_port->tx_lock); + + /* Copy our 48-bit ethernet hardware address */ + memcpy_fromio(netdev->dev_addr, mmio_addr + C2_REGS_ENADDR, 6); + + /* Validate the MAC address */ + if(!is_valid_ether_addr(netdev->dev_addr)) { + printk(KERN_ERR PFX "Invalid MAC Address\n"); + c2_print_macaddr(netdev); + free_netdev(netdev); + return NULL; + } + + c2dev->netdev = netdev; + + return netdev; +} + +static int __devinit c2_probe(struct pci_dev *pcidev, const struct pci_device_id *ent) +{ + int ret = 0, i; + unsigned long reg0_start, reg0_flags, reg0_len; + unsigned long reg2_start, reg2_flags, reg2_len; + unsigned long reg4_start, reg4_flags, reg4_len; + struct net_device *netdev = NULL; + struct c2_dev *c2dev = NULL; + void __iomem *mmio_regs = NULL; + + assert(pcidev != NULL); + assert(ent != NULL); + + printk(KERN_INFO PFX "AMSO1100 Gigabit Ethernet driver v%s loaded\n", + DRV_VERSION); + + /* Enable PCI device */ + ret = pci_enable_device(pcidev); + if (ret) { + printk(KERN_ERR PFX "%s: Unable to enable PCI device\n", pci_name(pcidev)); + goto bail0; + } + + reg0_start = pci_resource_start(pcidev, BAR_0); + reg0_len = pci_resource_len(pcidev, BAR_0); + reg0_flags = pci_resource_flags(pcidev, BAR_0); + + reg2_start = pci_resource_start(pcidev, BAR_2); + reg2_len = pci_resource_len(pcidev, BAR_2); + reg2_flags = pci_resource_flags(pcidev, BAR_2); + + reg4_start = pci_resource_start(pcidev, BAR_4); + reg4_len = pci_resource_len(pcidev, BAR_4); + reg4_flags = pci_resource_flags(pcidev, BAR_4); + + printk(KERN_INFO PFX "BAR0 size = 0x%lX bytes\n", reg0_len); + printk(KERN_INFO PFX "BAR2 size = 0x%lX bytes\n", reg2_len); + printk(KERN_INFO PFX "BAR4 size = 0x%lX bytes\n", reg4_len); + + /* Make sure PCI base addr are MMIO */ + if (!(reg0_flags & IORESOURCE_MEM) || + !(reg2_flags & IORESOURCE_MEM) || + !(reg4_flags & IORESOURCE_MEM)) { + printk (KERN_ERR PFX "PCI regions not an MMIO resource\n"); + ret = -ENODEV; + goto bail1; + } + + /* Check for weird/broken PCI region reporting */ + if ((reg0_len < C2_REG0_SIZE) || + (reg2_len < C2_REG2_SIZE) || + (reg4_len < C2_REG4_SIZE)) { + printk (KERN_ERR PFX "Invalid PCI region sizes\n"); + ret = -ENODEV; + goto bail1; + } + + /* Reserve PCI I/O and memory resources */ + ret = pci_request_regions(pcidev, DRV_NAME); + if (ret) { + printk(KERN_ERR PFX "%s: Unable to request regions\n", pci_name(pcidev)); + goto bail1; + } + + if ((sizeof(dma_addr_t) > 4)) { + ret = pci_set_dma_mask(pcidev, DMA_64BIT_MASK); + if (ret < 0) { + printk(KERN_ERR PFX "64b DMA configuration failed\n"); + goto bail2; + } + } else { + ret = pci_set_dma_mask(pcidev, DMA_32BIT_MASK); + if (ret < 0) { + printk(KERN_ERR PFX "32b DMA configuration failed\n"); + goto bail2; + } + } + + /* Enables bus-mastering on the device */ + pci_set_master(pcidev); + + /* Remap the adapter PCI registers in BAR4 */ + mmio_regs = ioremap_nocache(reg4_start + C2_PCI_REGS_OFFSET, + sizeof(struct c2_adapter_pci_regs)); + if (mmio_regs == 0UL) { + printk(KERN_ERR PFX "Unable to remap adapter PCI registers in BAR4\n"); + ret = -EIO; + goto bail2; + } + + /* Validate PCI regs magic */ + for (i = 0; i < sizeof(c2_magic); i++) + { + if (c2_magic[i] != c2_read8(mmio_regs + C2_REGS_MAGIC + i)) { + printk(KERN_ERR PFX + "Invalid PCI regs magic [%d/%Zd: got 0x%x, exp 0x%x]\n", + i + 1, sizeof(c2_magic), + c2_read8(mmio_regs + C2_REGS_MAGIC + i), c2_magic[i]); + printk(KERN_ERR PFX "Adapter not claimed\n"); + iounmap(mmio_regs); + ret = -EIO; + goto bail2; + } + } + + /* Validate the adapter version */ + if (be32_to_cpu(c2_read32(mmio_regs + C2_REGS_VERS)) != C2_VERSION) { + printk(KERN_ERR PFX "Version mismatch [fw=%u, c2=%u], Adapter not claimed\n", + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_VERS)), C2_VERSION); + ret = -EINVAL; + iounmap(mmio_regs); + goto bail2; + } + + /* Validate the adapter IVN */ + if (be32_to_cpu(c2_read32(mmio_regs + C2_REGS_IVN)) != C2_IVN) { + printk(KERN_ERR PFX "IVN mismatch [fw=0x%x, c2=0x%x], Adapter not claimed\n", + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_IVN)), C2_IVN); + ret = -EINVAL; + iounmap(mmio_regs); + goto bail2; + } + + /* Allocate hardware structure */ + c2dev = kmalloc(sizeof(*c2dev), GFP_KERNEL); + if (!c2dev) { + printk(KERN_ERR PFX "%s: Unable to alloc hardware struct\n", + pci_name(pcidev)); + ret = -ENOMEM; + iounmap(mmio_regs); + goto bail2; + } + + memset(c2dev, 0, sizeof(*c2dev)); + spin_lock_init(&c2dev->lock); + c2dev->pcidev = pcidev; + c2dev->cur_tx = 0; + + /* Get the last RX index */ + c2dev->cur_rx = (be32_to_cpu(c2_read32(mmio_regs + C2_REGS_HRX_CUR)) - 0xffffc000) / sizeof(struct c2_rxp_desc); + + /* Request an interrupt line for the driver */ + ret = request_irq(pcidev->irq, c2_interrupt, SA_SHIRQ, DRV_NAME, c2dev); + if (ret) { + printk(KERN_ERR PFX "%s: requested IRQ %u is busy\n", + pci_name(pcidev), pcidev->irq); + iounmap(mmio_regs); + goto bail3; + } + + /* Set driver specific data */ + pci_set_drvdata(pcidev, c2dev); + + /* Initialize network device */ + if ((netdev = c2_devinit(c2dev, mmio_regs)) == NULL) { + iounmap(mmio_regs); + goto bail4; + } + + /* Unmap the adapter PCI registers in BAR4 */ + iounmap(mmio_regs); + + /* Register network device */ + ret = register_netdev(netdev); + if (ret) { + printk(KERN_ERR PFX "Unable to register netdev, ret = %d\n", ret); + goto bail5; + } + + /* Disable network packets */ + netif_stop_queue(netdev); + + /* Remap the adapter HRXDQ PA space to kernel VA space */ + c2dev->mmio_rxp_ring = ioremap_nocache(reg4_start + C2_RXP_HRXDQ_OFFSET, + C2_RXP_HRXDQ_SIZE); + if (c2dev->mmio_rxp_ring == 0UL) { + printk(KERN_ERR PFX "Unable to remap MMIO HRXDQ region\n"); + ret = -EIO; + goto bail6; + } + + /* Remap the adapter HTXDQ PA space to kernel VA space */ + c2dev->mmio_txp_ring = ioremap_nocache(reg4_start + C2_TXP_HTXDQ_OFFSET, + C2_TXP_HTXDQ_SIZE); + if (c2dev->mmio_txp_ring == 0UL) { + printk(KERN_ERR PFX "Unable to remap MMIO HTXDQ region\n"); + ret = -EIO; + goto bail7; + } + + /* Save off the current RX index in the last 4 bytes of the TXP Ring */ + C2_SET_CUR_RX(c2dev, c2dev->cur_rx); + + /* Remap the PCI registers in adapter BAR0 to kernel VA space */ + c2dev->regs = ioremap_nocache(reg0_start, sizeof(struct c2_pci_regs)); + if (c2dev->regs == 0UL) { + printk(KERN_ERR PFX "Unable to remap BAR0\n"); + ret = -EIO; + goto bail8; + } + + /* Print out the MAC address */ + c2_print_macaddr(netdev); + + return 0; + + bail8: + iounmap(c2dev->mmio_txp_ring); + + bail7: + iounmap(c2dev->mmio_rxp_ring); + + bail6: + unregister_netdev(netdev); + + bail5: + free_netdev(netdev); + + bail4: + free_irq(pcidev->irq, c2dev); + + bail3: + kfree(c2dev); + + bail2: + pci_release_regions(pcidev); + + bail1: + pci_disable_device(pcidev); + + bail0: + return ret; +} + +static void __devexit c2_remove(struct pci_dev *pcidev) +{ + struct c2_dev *c2dev = pci_get_drvdata(pcidev); + struct net_device *netdev = c2dev->netdev; + + assert(netdev != NULL); + + /* Remove network device from the kernel */ + unregister_netdev(netdev); + + /* Free network device */ + free_netdev(netdev); + + /* Free the interrupt line */ + free_irq(pcidev->irq, c2dev); + + /* missing: Turn LEDs off here */ + + /* Unmap adapter PA space */ + iounmap(c2dev->regs); + iounmap(c2dev->mmio_txp_ring); + iounmap(c2dev->mmio_rxp_ring); + + /* Free the hardware structure */ + kfree(c2dev); + + /* Release reserved PCI I/O and memory resources */ + pci_release_regions(pcidev); + + /* Disable PCI device */ + pci_disable_device(pcidev); + + /* Clear driver specific data */ + pci_set_drvdata(pcidev, NULL); +} + +static struct pci_driver c2_pci_driver = { + .name = DRV_NAME, + .id_table = c2_pci_table, + .probe = c2_probe, + .remove = __devexit_p(c2_remove), +}; + +static int __init c2_init_module(void) +{ + return pci_module_init(&c2_pci_driver); +} + +static void __exit c2_exit_module(void) +{ + pci_unregister_driver(&c2_pci_driver); +} + +module_init(c2_init_module); +module_exit(c2_exit_module); Index: amso1100/c2.h =================================================================== --- amso1100/c2.h (revision 0) +++ amso1100/c2.h (revision 0) @@ -0,0 +1,326 @@ +/* + * c2.h: A Linux PCI-X Gigabit Ethernet driver for AMSO1100 (Cepheus2) RNIC + * + * Copyright(c) 2005 Ammasso, Inc. + */ + +#define DRV_NAME "c2" +#define DRV_VERSION "1.1" +#define PFX DRV_NAME ": " + +#undef C2_DEBUG + +#ifdef C2_DEBUG +#define assert(expr) \ + if(!(expr)) { \ + printk(KERN_ERR PFX "Assertion failed! %s, %s, %s, line %d\n",\ + #expr, __FILE__, __FUNCTION__, __LINE__); \ + } +#define dprintk(fmt, args...) do {printk(KERN_INFO PFX fmt, ##args);} while (0) +#else +#define assert(expr) do {} while (0) +#define dprintk(fmt, args...) do {} while (0) +#endif /* C2_DEBUG */ + +#define BAR_0 0 +#define BAR_2 2 +#define BAR_4 4 + +#define RX_BUF_SIZE (1536 + 8) +#define ETH_JUMBO_MTU 9000 +#define C2_MAGIC "CEPHEUS" +#define C2_VERSION 4 +#define C2_IVN (18 & 0x7fffffff) + +#define C2_REG0_SIZE (16 * 1024) +#define C2_REG2_SIZE (2 * 1024 * 1024) +#define C2_REG4_SIZE (256 * 1024 * 1024) +#define C2_NUM_TX_DESC 341 +#define C2_NUM_RX_DESC 256 +#define C2_PCI_REGS_OFFSET (0x10000) +#define C2_RXP_HRXDQ_OFFSET (((C2_REG4_SIZE)/2)) +#define C2_RXP_HRXDQ_SIZE (4096) +#define C2_TXP_HTXDQ_OFFSET (((C2_REG4_SIZE)/2) + C2_RXP_HRXDQ_SIZE) +#define C2_TXP_HTXDQ_SIZE (4096) +#define C2_TX_TIMEOUT (6*HZ) + +/* CEPHEUS */ +static const u8 c2_magic[] = { + 0x43, 0x45, 0x50, 0x48, 0x45, 0x55, 0x53 + }; + +enum adapter_pci_regs { + C2_REGS_MAGIC = 0x0000, + C2_REGS_VERS = 0x0008, + C2_REGS_IVN = 0x000C, + C2_REGS_ENADDR = 0x004C, + C2_REGS_HRX_CUR = 0x006C, +}; + +struct c2_adapter_pci_regs { + char reg_magic[8]; + u32 version; + u32 ivn; + u32 pci_window_size; + u32 q0_q_size; + u32 q0_msg_size; + u32 q0_pool_start; + u32 q0_shared; + u32 q1_q_size; + u32 q1_msg_size; + u32 q1_pool_start; + u32 q1_shared; + u32 q2_q_size; + u32 q2_msg_size; + u32 q2_pool_start; + u32 q2_shared; + u32 log_start; + u32 log_size; + u8 host_enaddr[8]; + u8 rdma_enaddr[8]; + u32 crash_entry; + u32 crash_ready[2]; + u32 fw_txd_cur; + u32 fw_hrxd_cur; + u32 fw_rxd_cur; +}; + +enum pci_regs { + C2_HISR = 0x0000, + C2_DISR = 0x0004, + C2_HIMR = 0x0008, + C2_DIMR = 0x000C, + C2_NISR0 = 0x0010, + C2_NISR1 = 0x0014, + C2_NIMR0 = 0x0018, + C2_NIMR1 = 0x001C, + C2_IDIS = 0x0020, +}; + +enum { + C2_PCI_HRX_INT = 1<<8, + C2_PCI_HTX_INT = 1<<17, + C2_PCI_HRX_QUI = 1<<31, +}; + +/* + * Cepheus registers in BAR0. + */ +struct c2_pci_regs { + u32 hostisr; + u32 dmaisr; + u32 hostimr; + u32 dmaimr; + u32 netisr0; + u32 netisr1; + u32 netimr0; + u32 netimr1; + u32 int_disable; +}; + +/* TXP flags */ +enum c2_txp_flags { + TXP_HTXD_DONE = 0, + TXP_HTXD_READY = 1<<0, + TXP_HTXD_UNINIT = 1<<1, +}; + +/* RXP flags */ +enum c2_rxp_flags { + RXP_HRXD_UNINIT = 0, + RXP_HRXD_READY = 1<<0, + RXP_HRXD_DONE = 1<<1, +}; + +/* RXP status */ +enum c2_rxp_status { + RXP_HRXD_ZERO = 0, + RXP_HRXD_OK = 1<<0, + RXP_HRXD_BUF_OV = 1<<1, +}; + +/* TXP descriptor fields */ +enum txp_desc { + C2_TXP_FLAGS = 0x0000, + C2_TXP_LEN = 0x0002, + C2_TXP_ADDR = 0x0004, +}; + +/* RXP descriptor fields */ +enum rxp_desc { + C2_RXP_FLAGS = 0x0000, + C2_RXP_STATUS = 0x0002, + C2_RXP_COUNT = 0x0004, + C2_RXP_LEN = 0x0006, + C2_RXP_ADDR = 0x0008, +}; + +struct c2_txp_desc { + u16 flags; + u16 len; + u64 addr; +} __attribute__ ((packed)); + +struct c2_rxp_desc { + u16 flags; + u16 status; + u16 count; + u16 len; + u64 addr; +} __attribute__ ((packed)); + +struct c2_rxp_hdr { + u16 flags; + u16 status; + u16 len; + u16 rsvd; +} __attribute__ ((packed)); + +struct c2_tx_desc { + u32 len; + u32 status; + dma_addr_t next_offset; +}; + +struct c2_rx_desc { + u32 len; + u32 status; + dma_addr_t next_offset; +}; + +struct c2_element { + struct c2_element *next; + void *ht_desc; /* host descriptor */ + void *hw_desc; /* hardware descriptor */ + struct sk_buff *skb; + dma_addr_t mapaddr; + u32 maplen; +}; + +struct c2_ring { + struct c2_element *to_clean; + struct c2_element *to_use; + struct c2_element *start; + unsigned long count; +}; + +struct ib_device; +struct c2_dev { + struct ib_device ibdev; + void __iomem *regs; + void __iomem *mmio_txp_ring; /* remapped adapter memory for hw rings */ + void __iomem *mmio_rxp_ring; + spinlock_t lock; + struct pci_dev *pcidev; + struct net_device *netdev; + unsigned int cur_tx; + unsigned int cur_rx; +}; + +struct c2_port { + u32 msg_enable; + struct c2_dev *c2dev; + struct net_device *netdev; + + spinlock_t tx_lock; + u32 tx_avail; + struct c2_ring tx_ring; + struct c2_ring rx_ring; + + void *mem; /* PCI memory for host rings */ + dma_addr_t dma; + unsigned long mem_size; + + u32 rx_buf_size; + + struct net_device_stats netstats; +}; + +#ifndef readq +static inline u64 readq(const void __iomem *addr) +{ + u64 ret = readl(addr + 4); + ret <<= 32; + ret |= readl(addr); + + return ret; +} +#endif + +#ifndef writeq +static inline void writeq(u64 val, void __iomem *addr) +{ + writel((u32) (val), addr); + writel((u32) (val >> 32), (addr + 4)); +} +#endif + +/* Read from memory-mapped device */ +static inline u64 c2_read64(const void __iomem *addr) +{ + return readq(addr); +} + +static inline u32 c2_read32(const void __iomem *addr) +{ + return readl(addr); +} + +static inline u16 c2_read16(const void __iomem *addr) +{ + return readw(addr); +} + +static inline u8 c2_read8(const void __iomem *addr) +{ + return readb(addr); +} + +/* Write to memory-mapped device */ +static inline void c2_write64(void __iomem *addr, u64 val) +{ + writeq(val, addr); +} + +static inline void c2_write32(void __iomem *addr, u32 val) +{ + writel(val, addr); +} + +static inline void c2_write16(void __iomem *addr, u16 val) +{ + writew(val, addr); +} + +static inline void c2_write8(void __iomem *addr, u8 val) +{ + writeb(val, addr); +} + +#define C2_SET_CUR_RX(c2dev, cur_rx) \ + c2_write32(c2dev->mmio_txp_ring + 4092, cpu_to_be32(cur_rx)) + +#define C2_GET_CUR_RX(c2dev) \ + be32_to_cpu(c2_read32(c2dev->mmio_txp_ring + 4092)) + +#if 0 +static void c2_print_ethhdr(struct ethhdr *ehdr) +{ + printk(KERN_DEBUG PFX "ehdr[h_src(%02x:%02x:%02x:%02x:%02x:%02x) -> " + "h_dst(%02x:%02x:%02x:%02x:%02x:%02x) h_proto(%04x)]\n", + ehdr->h_source[0], ehdr->h_source[1], ehdr->h_source[2], + ehdr->h_source[3], ehdr->h_source[4], ehdr->h_source[5], + ehdr->h_dest[0], ehdr->h_dest[1], ehdr->h_dest[2], + ehdr->h_dest[3], ehdr->h_dest[4], ehdr->h_dest[5], + htons(ehdr->h_proto)); +} +#endif + +static void c2_print_macaddr(struct net_device *netdev) +{ + printk(KERN_INFO PFX "%s: MAC %02X:%02X:%02X:%02X:%02X:%02X, " + "IRQ %u\n", netdev->name, + netdev->dev_addr[0], netdev->dev_addr[1], netdev->dev_addr[2], + netdev->dev_addr[3], netdev->dev_addr[4], netdev->dev_addr[5], + netdev->irq); +} Index: amso1100/Makefile =================================================================== --- amso1100/Makefile (revision 3074) +++ amso1100/Makefile (working copy) @@ -4,27 +4,30 @@ EXTRA_CFLAGS += -DDEBUG endif -obj-$(CONFIG_INFINIBAND_AMSO1100) += ib_amso1100.o +obj-$(CONFIG_INFINIBAND_AMSO1100) += ib_c2.o -ib_amso1100-y := cc_cq_common.o \ - ccilnet.o \ - ccilnet_dbg.o \ - cc_mq_common.o \ - cc_qp_common.o \ - devccil_adapter.o \ - devccil_ae.o \ - devccil.o \ - devccil_cq.o \ - devccil_eh.o \ - devccil_ep.o \ - devccil_logging.o \ - devccil_mem.o \ - devccil_mm.o \ - devccil_mq.o \ - devccil_pd.o \ - devccil_qp.o \ - devccil_rnic.o \ - devccil_srq.o \ - devccil_vq.o \ - devccil_wrappers.o \ - devnet.o +ib_c2-y := c2_provider.o \ + c2.o + +# cc_cq_common.o \ +# ccilnet.o \ +# ccilnet_dbg.o \ +# cc_mq_common.o \ +# cc_qp_common.o \ +# devccil_adapter.o \ +# devccil_ae.o \ +# devccil.o \ +# devccil_cq.o \ +# devccil_eh.o \ +# devccil_ep.o \ +# devccil_logging.o \ +# devccil_mem.o \ +# devccil_mm.o \ +# devccil_mq.o \ +# devccil_pd.o \ +# devccil_qp.o \ +# devccil_rnic.o \ +# devccil_srq.o \ +# devccil_vq.o \ +# devccil_wrappers.o \ +# devnet.o Index: amso1100/c2_provider.c =================================================================== --- amso1100/c2_provider.c (revision 0) +++ amso1100/c2_provider.c (revision 0) @@ -0,0 +1,330 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include + +#include "c2.h" +#include "c2_provider.h" + +static int c2_query_device(struct ib_device *ibdev, + struct ib_device_attr *props) +{ + int err = -ENOMEM; + return err; +} + +static int c2_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + return ENOSYS; +} + +static int c2_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + return ENOSYS; +} + +static int c2_query_pkey(struct ib_device *ibdev, + u8 port, u16 index, u16 *pkey) +{ + return ENOSYS; +} + +static int c2_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + return ENOSYS; +} + +static struct ib_ucontext *c2_alloc_ucontext(struct ib_device *ibdev, + struct ib_udata *udata) +{ + return 0; +} + +static int c2_dealloc_ucontext(struct ib_ucontext *context) +{ + return 0; +} + +static int c2_mmap_uar(struct ib_ucontext *context, + struct vm_area_struct *vma) +{ + return 0; +} + +static struct ib_pd *c2_alloc_pd(struct ib_device *ibdev, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + return 0; +} + +static int c2_dealloc_pd(struct ib_pd *pd) +{ + return 0; +} + +static struct ib_ah *c2_ah_create(struct ib_pd *pd, + struct ib_ah_attr *ah_attr) +{ + return 0; +} + +static int c2_ah_destroy(struct ib_ah *ah) +{ + return 0; +} + +static struct ib_qp *c2_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *init_attr, + struct ib_udata *udata) +{ + return 0; +} + +static int c2_destroy_qp(struct ib_qp *qp) +{ + return 0; +} + +static struct ib_cq *c2_create_cq(struct ib_device *ibdev, int entries, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + return ERR_PTR(0); +} + +static int c2_destroy_cq(struct ib_cq *cq) +{ + return 0; +} + +static struct ib_mr *c2_get_dma_mr(struct ib_pd *pd, int acc) +{ + return 0; +} + +static struct ib_mr *c2_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, + u64 *iova_start) +{ + return 0; +} + +static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, + int acc, struct ib_udata *udata) +{ + return ERR_PTR(0); +} + +static int c2_dereg_mr(struct ib_mr *mr) +{ + return 0; +} + +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct c2_dev *dev = container_of(cdev, struct c2_dev, ibdev.class_dev); + return sprintf(buf, "%x\n", 1); +} + +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) +{ + struct c2_dev *dev = container_of(cdev, struct c2_dev, ibdev.class_dev); + return sprintf(buf, "%x.%x.%x\n", 1,2,3); +} + +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + return sprintf(buf, "AMSO1100\n"); +} + +static ssize_t show_board(struct class_device *cdev, char *buf) +{ + struct c2_dev *dev = container_of(cdev, struct c2_dev, ibdev.class_dev); + return sprintf(buf, "%.*s\n", 32, "AMSO1100 Board ID"); +} + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); +static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL); + +static struct class_device_attribute *c2_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_fw_ver, + &class_device_attr_hca_type, + &class_device_attr_board_id +}; + +int c2_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) +{ + return ENOSYS; +} + +int c2_poll_cq(struct ib_cq *ibcq, int num_entries, + struct ib_wc *entry) +{ + return 0; +} +int c2_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + return ENOSYS; +} + +int c2_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + return ENOSYS; +} + +int c2_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + struct ib_wc *in_wc, + struct ib_grh *in_grh, + struct ib_mad *in_mad, + struct ib_mad *out_mad) +{ + return ENOSYS; +} + +int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +{ + return 0; +} + +int c2_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr) +{ + return 0; +} + +int c2_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + return 0; +} + + +int c2_register_device(struct c2_dev *dev) +{ + int ret; + int i; + + strlcpy(dev->ibdev.name, "amso%d", IB_DEVICE_NAME_MAX); + dev->ibdev.owner = THIS_MODULE; + + dev->ibdev.node_type = IB_NODE_CA; + dev->ibdev.phys_port_cnt = 1; + dev->ibdev.dma_device = &dev->pcidev->dev; + dev->ibdev.class_dev.dev = &dev->pcidev->dev; + dev->ibdev.query_device = c2_query_device; + dev->ibdev.query_port = c2_query_port; + dev->ibdev.modify_port = c2_modify_port; + dev->ibdev.query_pkey = c2_query_pkey; + dev->ibdev.query_gid = c2_query_gid; + dev->ibdev.alloc_ucontext = c2_alloc_ucontext; + dev->ibdev.dealloc_ucontext = c2_dealloc_ucontext; + dev->ibdev.mmap = c2_mmap_uar; + dev->ibdev.alloc_pd = c2_alloc_pd; + dev->ibdev.dealloc_pd = c2_dealloc_pd; + dev->ibdev.create_ah = c2_ah_create; + dev->ibdev.destroy_ah = c2_ah_destroy; + dev->ibdev.create_qp = c2_create_qp; + dev->ibdev.modify_qp = c2_modify_qp; + dev->ibdev.destroy_qp = c2_destroy_qp; + dev->ibdev.create_cq = c2_create_cq; + dev->ibdev.destroy_cq = c2_destroy_cq; + dev->ibdev.poll_cq = c2_poll_cq; + dev->ibdev.get_dma_mr = c2_get_dma_mr; + dev->ibdev.reg_phys_mr = c2_reg_phys_mr; + dev->ibdev.reg_user_mr = c2_reg_user_mr; + dev->ibdev.dereg_mr = c2_dereg_mr; + + dev->ibdev.alloc_fmr = 0; + dev->ibdev.unmap_fmr = 0; + dev->ibdev.dealloc_fmr = 0; + dev->ibdev.map_phys_fmr = 0; + + dev->ibdev.attach_mcast = c2_multicast_attach; + dev->ibdev.detach_mcast = c2_multicast_detach; + dev->ibdev.process_mad = c2_process_mad; + + dev->ibdev.req_notify_cq = c2_arm_cq; + dev->ibdev.post_send = c2_post_send; + dev->ibdev.post_recv = c2_post_receive; + + ret = ib_register_device(&dev->ibdev); + if (ret) + return ret; + + for (i = 0; i < ARRAY_SIZE(c2_class_attributes); ++i) { + ret = class_device_create_file(&dev->ibdev.class_dev, + c2_class_attributes[i]); + if (ret) { + ib_unregister_device(&dev->ibdev); + return ret; + } + } + return 0; +} + +void c2_unregister_device(struct c2_dev *dev) +{ + ib_unregister_device(&dev->ibdev); +} Index: amso1100/c2_provider.h =================================================================== --- amso1100/c2_provider.h (revision 0) +++ amso1100/c2_provider.h (revision 0) @@ -0,0 +1,134 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef C2_PROVIDER_H +#define C2_PROVIDER_H + +#include +#include + +#define C2_MPT_FLAG_ATOMIC (1 << 14) +#define C2_MPT_FLAG_REMOTE_WRITE (1 << 13) +#define C2_MPT_FLAG_REMOTE_READ (1 << 12) +#define C2_MPT_FLAG_LOCAL_WRITE (1 << 11) +#define C2_MPT_FLAG_LOCAL_READ (1 << 10) + +struct c2_buf_list { + void *buf; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +struct c2_uar { + unsigned long pfn; + int index; +}; + +struct c2_user_db_table; + +struct c2_ucontext { + struct ib_ucontext ibucontext; + struct c2_uar uar; + struct c2_user_db_table *db_tab; +}; + +struct c2_mtt; + +struct c2_mr { + struct ib_mr ibmr; +}; + +struct c2_pd { + struct ib_pd ibpd; +}; + +struct c2_av; + +enum c2_ah_type { + C2_AH_ON_HCA, + C2_AH_PCI_POOL, + C2_AH_KMALLOC +}; + +struct c2_ah { + struct ib_ah ibah; +}; + +struct c2_cq { + struct ib_cq ibcq; +}; + +struct c2_wq { + spinlock_t lock; +}; + +struct c2_qp { + struct ib_qp ibqp; +}; + +static inline struct c2_dev *to_dev(struct ib_device* ibdev) +{ + return container_of(ibdev, struct c2_dev, ibdev); +} + +#if 0 +static inline struct c2_ucontext *to_mucontext(struct ib_ucontext *ibucontext) +{ + return container_of(ibucontext, struct c2_ucontext, ibucontext); +} + +static inline struct c2_mr *to_mmr(struct ib_mr *ibmr) +{ + return container_of(ibmr, struct c2_mr, ibmr); +} + +static inline struct c2_pd *to_mpd(struct ib_pd *ibpd) +{ + return container_of(ibpd, struct c2_pd, ibpd); +} + +static inline struct c2_ah *to_mah(struct ib_ah *ibah) +{ + return container_of(ibah, struct c2_ah, ibah); +} + +static inline struct c2_cq *to_mcq(struct ib_cq *ibcq) +{ + return container_of(ibcq, struct c2_cq, ibcq); +} + +static inline struct c2_qp *to_mqp(struct ib_qp *ibqp) +{ + return container_of(ibqp, struct c2_qp, ibqp); +} +#endif +#endif /* C2_PROVIDER_H */ Index: amso1100/TODO =================================================================== --- amso1100/TODO (revision 3074) +++ amso1100/TODO (working copy) @@ -1,3 +1,5 @@ +An X in the boxes below indicates the work is completed. A '-' indicates +it's remains to be completed. [X] Kconfig: replace 'mthca' with 'ams1100' in help text :) @@ -3,5 +5,5 @@ [X] Replace all // comments with /* */. -[-] Why are members of cc_pci_regs_t and cc_adapter_pci_regs_t volatile? +[X] Why are members of cc_pci_regs_t and cc_adapter_pci_regs_t volatile? Volatile declarations are almost inevitably buggy. It's better to use ordered accessors (readl(), writel(), etc) or insert explicit @@ -20,7 +22,7 @@ ... }; -[-] Can cc_byteorder.h be eliminated? Most of the wrappers are +[X] Can cc_byteorder.h be eliminated? Most of the wrappers are definitely superfluous. Can the WR byte order ever change? ie are the cpu_to_wrXX() functions actually a useful abstraction? @@ -43,7 +45,7 @@ [-] cc_mq_common.c: BUMP is pretty inefficient, does a divide every time -[-] cc_qp_common.c: cc_memcpy8 corrupts FPU state, is it really needed? +[X] cc_qp_common.c: cc_memcpy8 corrupts FPU state, is it really needed? it's never called. Why is it declared in cc_mq_common.c? memcpy4 similarly corrupts state. If it's fixed to save CR0 and do clts, is it really faster than a normal memcpy (considering it also @@ -85,3 +87,6 @@ [-] Remove kDAT entry points [-] Remove superflouos common files/code + +[-] Boot firmware from flash instead of loading over PCI + From tom at ammasso.com Wed Aug 17 07:44:53 2005 From: tom at ammasso.com (Tom Tucker) Date: Wed, 17 Aug 2005 10:44:53 -0400 Subject: [openib-general] iWARP Driver Licensing Message-ID: <1124289893.3519.27.camel@trinity.ammasso.com> The licensing for the iWARP driver is Dual BSD/GPL. James L. pointed out that the MODULE_LICENSE string in the c2.c file incorrectly stated GPL only. I have updated this in the c2.c file. TomT From rolandd at cisco.com Wed Aug 17 07:35:04 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 17 Aug 2005 07:35:04 -0700 Subject: [openib-general] Re: Continue to experience problems in installing Gen2 on IA-32 In-Reply-To: (Weikuan Yu's message of "Wed, 17 Aug 2005 08:24:03 -0400") References: <200508111932.j7BJWROj026274@xi.cse.ohio-state.edu> <482FEE13-0AB4-11DA-80CD-000D932C3754@cse.ohio-state.edu> <52k6isnm21.fsf@cisco.com> <52d5ofi61v.fsf@cisco.com> Message-ID: <52slx8fx7r.fsf@cisco.com> Weikuan> Thanks for the patch. This works for our adaptors too. Great, I will check it in to svn and send it upstream to the kernel. Weikuan> Out of my curiosity, it seems like this patch takes Weikuan> firmware parameters and feed it back as it is when Weikuan> INIT_IB, instead of assuming port width being (1x | Weikuan> 4x). So wonder now the logic, in this regard, is in place Weikuan> for future 8x (?) or 12x adaptors with the other two Weikuan> bits, right? Yes, assuming the software interface remains the same, 8X and 12X adapters should "just work." - R. From caitlin.bestler at gmail.com Wed Aug 17 07:42:30 2005 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Wed, 17 Aug 2005 07:42:30 -0700 Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation In-Reply-To: References: Message-ID: <469958e0050817074255ae0626@mail.gmail.com> Some clarifications are needed here. First the Consumer is responsible for draining the EVD after re-enabling it, or at least for remembering that there may be undrained notified events. That is "you-have-been-notified" is a sticky boolean attribute that the Consumer is supposed to set to TRUE when the upcall is made and only clear when the EVD has been drained *after* re-enabling. Second, is that the EVD is first and foremost an event *serializer*. It is presumed to have a finite number of resources for making upcalls (at most one for the typical case where SINGLE is enabled). The next upcall per resource CANNOT occur until after the current upcall has completed. Whether this should be solved in the DAT Provider is a question of what the verb-layer provider is allowed to do. If the verb layer provider can in fact generate multiple concurrent upcalls for the same CQ then the EVD itself must guard against re-entrancy. A more likely implementation is that upcalls triggered by post_se, CM events and CQs could theoretically occur at the same instance -- but that none of these paths can be re-entrant by themselves. Once the potential re-entrancy from the verb layer is known, then an optimal strategy can be selected. For exaple, if the only potential re-entrancy comes when the upcall interrupts a post_se call then some simple critical regions can avoid all problems without general purpose spinlocks or semaphores. On 8/16/05, James Lentini wrote: > > > On Tue, 16 Aug 2005, Guy German wrote: > > > >>>>>> Also, the pending_event_queue is only used for kDAPL generated > > >>>>>> software events. This queue can be empty when there are events on > > >>>>>> the CQ, so your would need to be expanded your check to cover > > >>>>>> that. > > >>>> > > >>>> Actually, even though, I agreed before, I tend to disagree now. > > >>>> The consumer will still get the DTO events as soon as the CQ > > >>>> upcall is triggered (enabled), so only problem is with the pending > > >>>> events list. > > >>> > > >>> Why is it an error for the consumer to modify the upcall policy > > >>> when there are pending events? > > >>> > > >>> dat_evd_modify_upcall should behave just like the IBTA spec's > > >>> Request Completion Notification verb in this respect. If there were > > >>> events on the EVD before the upcall is enabled, no upcall needs to > > >>> be generated. A correct consumer can easily work around this by > > >>> enabling the upcall and polling the EVD one final time to ensure it > > >>> is empty. > > >> > > >> There can be more than one event, and the consumer would need to > > >> dequeue many times. While the consumer would do his extra dequeue-ing > > >> he might also get an upcall, because his policy is now enabled. > > >> I can't think of a design that can handle such a case, and if there > > >> is one it is demanding and complicated, from the consumers side. > > > > > > Isn't it the same position all event code written to the > > > OpenIB API is > > > in? > > > > I don't quite know what you are reffering to, but if you are reffering > > to the case of cq in IB - It's totally different: you only enable the cq > > once, so you will only get one upcall, and the rest of the events > > you will need to dequeue. > > The consumer should only receive one upcall at a time if the upcall > policy is DAT_UPCALL_SINGLE_INSTANCE. If the dequeues are performed in > an upcall, the logic needed in an OpenIB consumer and kDAPL consumer > is essentially the same. > > The difference is that the OpenIB consumer needs to re-enable the CQ > upcall and poll to make sure no events were missed. > > > > I agree with you that this programming model is difficult to use, > > > but I don't think it is impossible. > > > > I think it is a bad idea to dequeue events and at the same time > > receive upcalls from the same queue. It is racy, and has bad performance. > > I don't see *any* reason to do it. > > The current kDAPL implementation does create a situation in which an > upcall and poll occur simultaneously if the upcall is disabled, the > consumer enables the upcall, and then the consumer does a poll. In > this scenario an upcall can occur while the consumer is polling. I was > pointing out that this same race exists in the OpenIB verbs API (and > the IBTA verbs). > > Again, I agree that we can eliminate the additional poll after > enabling the upcall in kDAPL. We just need to do it in a way that is > not hardware specific. I believe we can use the same technique we did > in the DTO upcall. > > james > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From guyg at voltaire.com Wed Aug 17 09:02:13 2005 From: guyg at voltaire.com (Guy German) Date: Wed, 17 Aug 2005 19:02:13 +0300 (IDT) Subject: [openib-general][PATCH][kdapl]: FMR and EVD patch Message-ID: [kdapl]: this is the same FMR and EVD patch, sent before with modifications, after discussions with James. Signed-off-by: Guy German Index: ib/dapl_openib_util.c =================================================================== --- ib/dapl_openib_util.c (revision 3113) +++ ib/dapl_openib_util.c (working copy) @@ -138,7 +138,7 @@ int dapl_ib_mr_register_ia(struct dapl_i } int dapl_ib_mr_register_physical(struct dapl_ia *ia, struct dapl_lmr *lmr, - void *phys_addr, u64 length, + void *phys_addr, u32 page_count, enum dat_mem_priv_flags privileges) { int status; @@ -150,11 +150,11 @@ int dapl_ib_mr_register_physical(struct u64 *array; array = (u64 *) phys_addr; - buf_list = kmalloc(length * sizeof *buf_list, GFP_ATOMIC); + buf_list = kmalloc(page_count * sizeof *buf_list, GFP_ATOMIC); if (!buf_list) return -ENOMEM; - for (i = 0; i < length; i++) { + for (i = 0; i < page_count; i++) { buf_list[i].addr = array[i]; buf_list[i].size = PAGE_SIZE; } @@ -163,7 +163,7 @@ int dapl_ib_mr_register_physical(struct acl = dapl_ib_convert_mem_privileges(privileges); acl |= IB_ACCESS_MW_BIND; mr = ib_reg_phys_mr(((struct dapl_pz *)lmr->param.pz)->pd, - buf_list, length, acl, &iova); + buf_list, page_count, acl, &iova); kfree(buf_list); if (IS_ERR(mr)) { status = PTR_ERR(mr); @@ -186,13 +186,58 @@ int dapl_ib_mr_register_physical(struct lmr->param.lmr_context = mr->lkey; lmr->param.rmr_context = mr->rkey; - lmr->param.registered_size = length * PAGE_SIZE; + lmr->param.registered_size = page_count * PAGE_SIZE; lmr->param.registered_address = array[0]; lmr->mr = mr; dapl_dbg_log(DAPL_DBG_TYPE_UTIL, "%s: (%p %d) got lkey 0x%x \n", __func__, - buf_list, length, lmr->param.lmr_context); + buf_list, page_count, lmr->param.lmr_context); + return 0; +} + +int dapl_ib_mr_register_fmr(struct dapl_ia *ia, struct dapl_lmr *lmr, + void *phys_addr, u32 page_count, + enum dat_mem_priv_flags privileges) +{ + /* FIXME: this phase-1 implementation of fmr doesn't take "privileges" + into account. This is a security breech. */ + u64 io_addr; + u64 *page_list; + struct ib_pool_fmr *mem; + int status; + + page_list = (u64 *)phys_addr; + io_addr = page_list[0]; + + mem = ib_fmr_pool_map_phys (((struct dapl_pz *)lmr->param.pz)->fmr_pool, + page_list, + page_count, + &io_addr); + if (IS_ERR(mem)) { + status = (int)PTR_ERR(mem); + if (status != -EAGAIN) + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + "fmr_pool_map_phys ret=%d <%d pages>\n", + status, page_count); + + lmr->param.registered_address = 0; + lmr->fmr = 0; + return status; + } + + lmr->param.lmr_context = mem->fmr->lkey; + lmr->param.rmr_context = mem->fmr->rkey; + lmr->param.registered_size = page_count * PAGE_SIZE; + lmr->param.registered_address = io_addr; + lmr->fmr = mem; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + "%s: (addr=%p size=0x%x) lkey=0x%x rkey=0x%x\n", __func__, + lmr->param.registered_address, + lmr->param.registered_size, + lmr->param.lmr_context, + lmr->param.rmr_context); return 0; } @@ -222,7 +267,10 @@ int dapl_ib_mr_deregister(struct dapl_lm { int status; - status = ib_dereg_mr(lmr->mr); + if (DAT_MEM_TYPE_PLATFORM == lmr->param.mem_type && lmr->fmr) + status = ib_fmr_pool_unmap(lmr->fmr); + else + status = ib_dereg_mr(lmr->mr); if (status < 0) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, " ib_dereg_mr error code return = %d\n", Index: ib/dapl_evd.c =================================================================== --- ib/dapl_evd.c (revision 3113) +++ ib/dapl_evd.c (working copy) @@ -42,19 +42,30 @@ static void dapl_evd_upcall_trigger(stru int status = 0; struct dat_event event; - /* Only process events if there is an enabled callback function. */ - if ((evd->upcall.upcall_func == (DAT_UPCALL_FUNC) NULL) || - (evd->upcall_policy == DAT_UPCALL_DISABLE)) + /* DAT_UPCALL_MANY is not supported */ + if (evd->is_triggered) return; - for (;;) { + spin_lock_irqsave (&evd->common.lock, evd->common.flags); + if (evd->is_triggered) { + spin_unlock_irqrestore (&evd->common.lock, evd->common.flags); + return; + } + evd->is_triggered = 1; + spin_unlock_irqrestore (&evd->common.lock, evd->common.flags); + /* Only process events if there is an enabled callback function */ + while ((evd->upcall.upcall_func != (DAT_UPCALL_FUNC)NULL) && + (evd->upcall_policy != DAT_UPCALL_TEARDOWN) && + (evd->upcall_policy != DAT_UPCALL_DISABLE)) { status = dapl_evd_dequeue((struct dat_evd *)evd, &event); - if (0 != status) - return; - + if (status) + break; evd->upcall.upcall_func(evd->upcall.instance_data, &event, FALSE); } + evd->is_triggered = 0; + + return; } static void dapl_evd_eh_print_wc(struct ib_wc *wc) @@ -163,6 +174,7 @@ static struct dapl_evd *dapl_evd_alloc(s evd->cq = NULL; atomic_set(&evd->evd_ref_count, 0); evd->catastrophic_overflow = FALSE; + evd->is_triggered = 0; evd->qlen = qlen; evd->upcall_policy = upcall_policy; if ( NULL != upcall ) @@ -798,25 +810,28 @@ static void dapl_evd_dto_callback(struct overflow); /* - * This function does not dequeue from the CQ; only the consumer - * can do that. It rearms the completion only if completions should - * always occur. + * This function does not dequeue from the CQ; + * It rearms the completion only if the consumer did not + * disable the upcall policy (in order to dequeu the rest + * of the events himself) */ - if (!overflow && evd->upcall_policy != DAT_UPCALL_DISABLE) { - /* - * Re-enable callback, *then* trigger. - * This guarantees we won't miss any events. - */ - status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); - if (0 != status) - (void)dapl_evd_post_async_error_event( - evd->common.owner_ia->async_error_evd, - DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR, - evd->common.owner_ia); - + if (!overflow) { dapl_evd_upcall_trigger(evd); + if ((evd->upcall_policy != DAT_UPCALL_TEARDOWN) && + (evd->upcall_policy != DAT_UPCALL_DISABLE)) { + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); + if (status) + (void)dapl_evd_post_async_error_event( + evd->common.owner_ia->async_error_evd, + DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR, + evd->common.owner_ia); + else + dapl_evd_upcall_trigger(evd); + } } + else + dapl_dbg_log(DAPL_DBG_TYPE_ERR, "%s: evd %p overflowed\n",evd); dapl_dbg_log(DAPL_DBG_TYPE_RTN, "%s() returns\n", __func__); } @@ -868,10 +883,11 @@ int dapl_evd_internal_create(struct dapl /* reset the qlen in the attributes, it may have changed */ evd->qlen = evd->cq->cqe; + if ((evd->upcall_policy != DAT_UPCALL_TEARDOWN) && + (evd->upcall_policy != DAT_UPCALL_DISABLE)) + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); - status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); - - if (status != 0) + if (status) goto bail; } @@ -879,14 +895,14 @@ int dapl_evd_internal_create(struct dapl * the EVD */ status = dapl_evd_event_alloc(evd); - if (status != 0) + if (status) goto bail; dapl_ia_link_evd(ia, evd); *evd_ptr = evd; bail: - if (status != 0) + if (status) if (evd) dapl_evd_dealloc(evd); @@ -1012,15 +1028,40 @@ int dapl_evd_modify_upcall(struct dat_ev const struct dat_upcall_object *upcall) { struct dapl_evd *evd; - - dapl_dbg_log(DAPL_DBG_TYPE_API, "dapl_modify_upcall(%p)\n", evd_handle); + int status = 0; + int pending_events; evd = (struct dapl_evd *)evd_handle; + dapl_dbg_log (DAPL_DBG_TYPE_API, "%s: (evd=%p, upcall_policy=%d)\n", + __func__, evd_handle, upcall_policy); + spin_lock_irqsave(&evd->common.lock, evd->common.flags); + if ((upcall_policy != DAT_UPCALL_TEARDOWN) && + (upcall_policy != DAT_UPCALL_DISABLE)) { + pending_events = dapl_rbuf_count(&evd->pending_event_queue); + if (pending_events) { + dapl_dbg_log (DAPL_DBG_TYPE_WARN, + "%s: (evd %p) there are still %d pending " + "events in the queue - policy stays disabled\n", + __func__, evd_handle, pending_events); + status = -EBUSY; + goto bail; + } + if (evd->evd_flags & DAT_EVD_DTO_FLAG) { + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); + if (status) { + printk(KERN_ERR "%s: dapls_ib_completion_notify" + " failed (status=0x%x) \n",__func__, + status); + goto bail; + } + } + } evd->upcall_policy = upcall_policy; evd->upcall = *upcall; - - return 0; +bail: + spin_unlock_irqrestore(&evd->common.lock, evd->common.flags); + return status; } int dapl_evd_post_se(struct dat_evd *evd_handle, const struct dat_event *event) @@ -1090,7 +1131,6 @@ int dapl_evd_dequeue(struct dat_evd *evd bail: spin_unlock_irqrestore(&evd->common.lock, evd->common.flags); - dapl_dbg_log(DAPL_DBG_TYPE_RTN, "dapl_evd_dequeue () returns 0x%x\n", status); Index: ib/dapl_openib_util.h =================================================================== --- ib/dapl_openib_util.h (revision 3113) +++ ib/dapl_openib_util.h (working copy) @@ -84,9 +84,13 @@ int dapl_ib_mr_register_ia(struct dapl_i enum dat_mem_priv_flags privileges); int dapl_ib_mr_register_physical(struct dapl_ia *ia, struct dapl_lmr *lmr, - void *phys_addr, u64 length, + void *phys_addr, u32 page_count, enum dat_mem_priv_flags privileges); +int dapl_ib_mr_register_fmr(struct dapl_ia *ia, struct dapl_lmr *lmr, + void *phys_addr, u32 page_count, + enum dat_mem_priv_flags privileges); + int dapl_ib_mr_deregister(struct dapl_lmr *lmr); int dapl_ib_mr_register_shared(struct dapl_ia *ia, struct dapl_lmr *lmr, Index: ib/dapl.h =================================================================== --- ib/dapl.h (revision 3113) +++ ib/dapl.h (working copy) @@ -40,9 +40,9 @@ #include #include - -#include "ib_verbs.h" -#include "ib_cm.h" +#include +#include +#include /********************************************************************* * * @@ -173,6 +173,7 @@ struct dapl_evd { struct dapl_ring_buffer pending_event_queue; enum dat_upcall_policy upcall_policy; struct dat_upcall_object upcall; + int is_triggered; }; struct dapl_ep { @@ -229,6 +230,7 @@ struct dapl_pz { struct list_head list; struct ib_pd *pd; atomic_t pz_ref_count; + struct ib_fmr_pool *fmr_pool; }; struct dapl_lmr { @@ -237,6 +239,7 @@ struct dapl_lmr { struct list_head list; struct dat_lmr_param param; struct ib_mr *mr; + struct ib_pool_fmr *fmr; atomic_t lmr_ref_count; }; @@ -628,4 +631,6 @@ extern void dapl_dbg_log(enum dapl_dbg_t #define dapl_dbg_log(...) #endif /* KDAPL_INFINIBAND_DEBUG */ +extern void set_fmr_params (struct ib_fmr_pool_param *fmr_param_s); +extern unsigned int g_dapl_active_fmr; #endif /* DAPL_H */ Index: ib/dapl_pz.c =================================================================== --- ib/dapl_pz.c (revision 3113) +++ ib/dapl_pz.c (working copy) @@ -29,7 +29,7 @@ /* * $Id$ */ - +#include #include "dapl.h" #include "dapl_ia.h" #include "dapl_openib_util.h" @@ -89,7 +89,17 @@ int dapl_pz_create(struct dat_ia *ia, st status); goto error2; } - + + if (g_dapl_active_fmr) { + struct ib_fmr_pool_param params; + set_fmr_params (¶ms); + dapl_pz->fmr_pool = ib_create_fmr_pool(dapl_pz->pd, ¶ms); + if (IS_ERR(dapl_pz->fmr_pool)) + dapl_dbg_log(DAPL_DBG_TYPE_WARN, + "could not create FMR pool <%ld>", + PTR_ERR(dapl_pz->fmr_pool)); + } + *pz = (struct dat_pz *)dapl_pz; return 0; @@ -104,7 +114,7 @@ error1: int dapl_pz_free(struct dat_pz *pz) { struct dapl_pz *dapl_pz; - int status; + int status=0; dapl_dbg_log(DAPL_DBG_TYPE_API, "dapl_pz_free(%p)\n", pz); @@ -114,8 +124,15 @@ int dapl_pz_free(struct dat_pz *pz) status = -EINVAL; goto error; } - - status = ib_dealloc_pd(dapl_pz->pd); + + if (g_dapl_active_fmr && dapl_pz->fmr_pool) { + (void)ib_destroy_fmr_pool(dapl_pz->fmr_pool); + dapl_pz->fmr_pool = NULL; + } + + if (dapl_pz->pd) + status = ib_dealloc_pd(dapl_pz->pd); + if (status) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, "ib_dealloc_pd failed: %X\n", status); Index: ib/dapl_ia.c =================================================================== --- ib/dapl_ia.c (revision 3113) +++ ib/dapl_ia.c (working copy) @@ -115,7 +115,6 @@ static int dapl_ia_abrupt_close(struct d * when we run out of entries, or when we get back to the head * if we end up skipping an entry. */ - list_for_each_entry(rmr, &ia->rmr_list, list) { status = dapl_rmr_free((struct dat_rmr *)rmr); if (status != 0) @@ -196,7 +195,6 @@ static int dapl_ia_abrupt_close(struct d "ia_close(ABRUPT): psp_free(%p) returns %x\n", sp, status); } - list_for_each_entry(pz, &ia->pz_list, list) { status = dapl_pz_free((struct dat_pz *)pz); if (status != 0) @@ -266,7 +264,6 @@ static int dapl_ia_graceful_close(struct int status = 0; int cur_status; struct dapl_evd *evd; - if (!list_empty(&ia->rmr_list) || !list_empty(&ia->rsp_list) || !list_empty(&ia->ep_list) || @@ -745,7 +742,8 @@ int dapl_ia_query(struct dat_ia *ia_ptr, provider_attr->provider_version_major = DAPL_PROVIDER_MAJOR; provider_attr->provider_version_minor = DAPL_PROVIDER_MINOR; provider_attr->lmr_mem_types_supported = - DAT_MEM_TYPE_PHYSICAL | DAT_MEM_TYPE_IA; + DAT_MEM_TYPE_PHYSICAL | DAT_MEM_TYPE_IA | + DAT_MEM_TYPE_PLATFORM; provider_attr->iov_ownership_on_return = DAT_IOV_CONSUMER; provider_attr->dat_qos_supported = DAT_QOS_BEST_EFFORT; provider_attr->completion_flags_supported = Index: ib/dapl_provider.c =================================================================== --- ib/dapl_provider.c (revision 3113) +++ ib/dapl_provider.c (working copy) @@ -49,8 +49,17 @@ MODULE_AUTHOR("James Lentini"); #ifdef CONFIG_KDAPL_INFINIBAND_DEBUG static DAPL_DBG_MASK g_dapl_dbg_mask = 0; module_param_named(dbg_mask, g_dapl_dbg_mask, int, 0644); -MODULE_PARM_DESC(dbg_mask, "Bitmask to enable debug message types."); +MODULE_PARM_DESC(dbg_mask, "Bitmask to enable debug message types. "); #endif /* CONFIG_KDAPL_INFINIBAND_DEBUG */ +unsigned int g_dapl_active_fmr = 1; +static unsigned int g_dapl_pool_size = 2048; +static unsigned int g_dapl_max_pages_per_fmr = 64; +module_param_named(active_fmr, g_dapl_active_fmr, int, 0644); +module_param_named(pool_size, g_dapl_pool_size, int, 0644); +module_param_named(max_pages_per_fmr, g_dapl_max_pages_per_fmr, int, 0644); +MODULE_PARM_DESC(active_fmr, "if active_fmr==1, creates fmr pool in pz_create "); +MODULE_PARM_DESC(pool_size, "num of fmr handles in pool "); +MODULE_PARM_DESC(max_pages_per_fmr, "max size (in pages) of an fmr handle "); static LIST_HEAD(g_dapl_provider_list); @@ -152,6 +161,18 @@ void dapl_dbg_log(enum dapl_dbg_type typ #endif /* KDAPL_INFINIBAND_DEBUG */ +void set_fmr_params (struct ib_fmr_pool_param *fmr_params) +{ + fmr_params->max_pages_per_fmr = g_dapl_max_pages_per_fmr; + fmr_params->pool_size = g_dapl_pool_size; + fmr_params->dirty_watermark = 32; + fmr_params->cache = 1; + fmr_params->flush_function = NULL; + fmr_params->access = (IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_WRITE | + IB_ACCESS_REMOTE_READ); +} + static struct dapl_provider *dapl_provider_alloc(const char *name, struct ib_device *device, u8 port) Index: ib/dapl_lmr.c =================================================================== --- ib/dapl_lmr.c (revision 3113) +++ ib/dapl_lmr.c (working copy) @@ -126,7 +126,7 @@ error1: static inline int dapl_lmr_create_physical(struct dapl_ia *ia, union dat_region_description phys_addr, - u64 page_count, + u64 length, enum dat_mem_type mem_type, struct dapl_pz *pz, enum dat_mem_priv_flags privileges, @@ -137,8 +137,10 @@ static inline int dapl_lmr_create_physic u64 *registered_address) { struct dapl_lmr *new_lmr; - int status; + int status = 0; + u32 page_count; + page_count = (u32)length; new_lmr = dapl_lmr_alloc(ia, mem_type, phys_addr, page_count, (struct dat_pz *) pz, privileges); @@ -149,15 +151,24 @@ static inline int dapl_lmr_create_physic if (DAT_MEM_TYPE_IA == mem_type) { status = dapl_ib_mr_register_ia(ia, new_lmr, phys_addr, - page_count, privileges); + length, privileges); } - else { + else if (DAT_MEM_TYPE_PHYSICAL == mem_type) { status = dapl_ib_mr_register_physical(ia, new_lmr, phys_addr.for_array, page_count, privileges); } + else if (DAT_MEM_TYPE_PLATFORM == mem_type) { + status = dapl_ib_mr_register_fmr(ia, new_lmr, + phys_addr.for_array, + page_count, privileges); + } + else { + status = -EINVAL; + goto error1; + } - if (0 != status) + if (status) goto error2; atomic_inc(&pz->pz_ref_count); @@ -258,6 +269,11 @@ int dapl_lmr_kcreate(struct dat_ia *ia, rmr_context, registered_length, registered_address); break; + case DAT_MEM_TYPE_PLATFORM: /* used as a proprietary Mellanox-FMR */ + if (!g_dapl_active_fmr) { + status = -EINVAL; + break; + } case DAT_MEM_TYPE_PHYSICAL: case DAT_MEM_TYPE_IA: status = dapl_lmr_create_physical(dapl_ia, region_description, @@ -307,6 +323,7 @@ int dapl_lmr_free(struct dat_lmr *lmr) switch (dapl_lmr->param.mem_type) { case DAT_MEM_TYPE_PHYSICAL: + case DAT_MEM_TYPE_PLATFORM: case DAT_MEM_TYPE_IA: case DAT_MEM_TYPE_LMR: { From eitan at mellanox.co.il Wed Aug 17 12:18:25 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 17 Aug 2005 22:18:25 +0300 Subject: [openib-general] [PATCH] osm: export more complib functions and opensm headers Message-ID: <86ll301if2.fsf@mtl066.yok.mtl.com> Hi Hal There were few missing complib functions in the libosmcomp.map. Also the following two header functions were missing from the opensm distribution list: osm_msgdef.h and osm_helper.h I tested the patch on : 2.6.12.3-smp SuSE Linux 9.3 (i586) Signed-off-by: Eitan Zahavi Index: complib/libosmcomp.map =================================================================== --- complib/libosmcomp.map (revision 3122) +++ complib/libosmcomp.map (working copy) @@ -207,5 +207,14 @@ OSMCOMP_1.0 { ib_error_str; ib_async_event_str; ib_wc_status_str; + cl_atomic_dec; + cl_free; + cl_malloc; + cl_perf_construct; + cl_perf_destroy; + cl_perf_display; + cl_perf_init; + cl_perf_reset; + cl_zalloc; local: *; }; Index: opensm/Makefile.am =================================================================== --- opensm/Makefile.am (revision 3122) +++ opensm/Makefile.am (working copy) @@ -75,11 +75,15 @@ opensm_LDFLAGS = -Wl,--rpath -Wl,$(libdi opensmincludedir = $(includedir)/infiniband/opensm opensminclude_HEADERS = $(srcdir)/../include/opensm/osm_base.h \ + $(srcdir)/../include/opensm/osm_msgdef.h \ + $(srcdir)/../include/opensm/osm_helper.h \ $(srcdir)/../include/opensm/osm_log.h \ $(srcdir)/../include/opensm/osm_madw.h \ $(srcdir)/../include/opensm/osm_mad_pool.h EXTRA_DIST = $(srcdir)/../include/opensm/osm_base.h \ - $(srcdir)/../include/opensm/osm_log.h \ - $(srcdir)/../include/opensm/osm_madw.h \ - $(srcdir)/../include/opensm/osm_mad_pool.h + $(srcdir)/../include/opensm/osm_helper.h \ + $(srcdir)/../include/opensm/osm_msgdef.h \ + $(srcdir)/../include/opensm/osm_log.h \ + $(srcdir)/../include/opensm/osm_madw.h \ + $(srcdir)/../include/opensm/osm_mad_pool.h From mst at mellanox.co.il Wed Aug 17 13:51:07 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 17 Aug 2005 23:51:07 +0300 Subject: [openib-general] Re: [ANNOUNCE] 2.6.9 backport patches In-Reply-To: References: <20050814164056.GU23848@mellanox.co.il> Message-ID: <20050817205107.GA3130@mellanox.co.il> Quoting r. Bob Woodruff : > Subject: RE: [ANNOUNCE] 2.6.9 backport patches > > Michael wrote, > >Hi! > >Backport patches to trunk that enable support for RHEL4.0 (2.6.9) and > >SuSE9.3 (2.6.11), can now be found under > >https://openib.org/svn/gen2/branches/backport/2.6.9 > >and > >https://openib.org/svn/gen2/branches/backport/2.6.11 > > >These patches do not touch the kernel source outside the infiniband > directory, > >and you dont need to reboot after you apply them. > > Cool!! I tried these out on x86_64 and Itanium (so far) and from the initial > > tests I have done so far, I have seen no problems. It is great that you > provided some patches that do not require any kernel mods, which allows > one to build .ko's that I can load on the stock redhat EL4.0 kernel. > > I did notice a couple of things. > 1.) There is no backport patch for SRP, which I had worked around by > exporting > scsi_scan_target in the kernel. Not sure how best to fix it with a patch > that does not require any kernel changes. Perhaps Roland could recommend > something. On intel we can use the hack with reading /proc/kallsyms and casting the address to function pointers. I gather this doesnt work for some other arches. > 2.) You might want to include a patch for the kernel/drivers Kconfig and > Makefile, see below. This is the only thing I had to touch outside the > infiniband directory to build the code. Actually, I dont do this, I use a simple script below. This means I have to tweak config options manually instead of using a menu, but oh well. #!/usr/bin/make -f openib_mak_all: openib modules_post include ./Makefile .PHONY: modules_post openib modules_post: @echo ' Building modules, stage 2.'; $(Q)$(MAKE) -rR -f $(srctree)/scripts/Makefile.modpost openib: @echo ' Building modules, stage 1.'; $(Q)$(MAKE) -f ./Makefile drivers/infiniband/\ CONFIG_INFINIBAND=m\ CONFIG_INFINIBAND_MTHCA=m\ CONFIG_INFINIBAND_IPOIB=m\ CONFIG_INFINIBAND_SDP=m\ CONFIG_INFINIBAND_SRP=m\ CONFIG_INFINIBAND_SRP_DEBUG=m -- MST From boas1 at llnl.gov Tue Aug 16 22:35:23 2005 From: boas1 at llnl.gov (Bill Boas) Date: Tue, 16 Aug 2005 22:35:23 -0700 Subject: [openib-general] Datacenter Fabric Workshop Registrations as of Tues Aug 16 9.00AM PDT Message-ID: <6.2.1.2.2.20050816221322.02be9108@mail-lc.llnl.gov> To OpenIB members, contributors and participants, Here's the registration list we have with 6 days to go its 108 participants. We have facilities for 250. We soon have to confirm numbers for breakfast, lunch and the dinner reception - yes we are serving dinner after the Exec Panel! Thats good, but it could be better. Some of our key contributors are not yet registered but we hope they are planning to attend. WHAT CAN YOU DO TO HELP? Each and everyone us, please look carefully at who is registered. When you find some one not on the list you think should be, email them and ask them to confirm they are attending and register through www.openib.org. For those attending, please make sure you thank Allyson Klein and Nicole Johnston from Intel who have supported our workshop so strongly and contributed IDF facilities and resources, and lots of their time, Emily Backus of Owenmedia for all the publicity, email blasts AND THE SHIRTS you will receive when you register, and MarciChae of GPJ the IDF eveny organizers for the logistics of the facilities, badge pick-up and the FOOD and refreshments. You will see them all athe Workshop, send emails to their addresses below. Again, we ask that you check out the attached registration list and contact those you know who should be registered but are not so far. Thank you. Bill. >Date: Tue, 16 Aug 2005 09:32:13 -0700 >From: Emily Backus >Subject: Datacenter Fabric Workshop Ticket Orders.xls >To: "'Klein, Allyson'" >Cc: "'Johnston, Nicole'" , Marci.Chase at gpjco.com >X-Mailer: Microsoft Outlook, Build 10.0.2627 >X-Scanned-By: MIMEDefang 2.39 > >All - > >Updated registration grid is attached. We picked up 17 registrations >yesterday for a grand total of 105. As usual, new people are in turquoise. > >Thanks, >Emily > > Bill Boas bboas at llnl.gov ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 7000 East Ave, L-555 Cell: 925-337-2224 Livermore, CA 94551 Pgr: 877-203-2248 -------------- next part -------------- A non-text attachment was scrubbed... Name: Datacenter Fabric Workshop Ticket Orders28.xls Type: application/vnd.ms-excel Size: 112640 bytes Desc: not available URL: From jlentini at netapp.com Wed Aug 17 14:21:18 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 17 Aug 2005 17:21:18 -0400 (EDT) Subject: [openib-general][PATCH][kdapl]: FMR and EVD patch In-Reply-To: References: Message-ID: Hi Guy, I've committed all but one of the EVD changes in revision 3126. I still need a little time to review all of your changes to dapl_evd_modify_upcall and the FMR implementation. james On Wed, 17 Aug 2005, Guy German wrote: > [kdapl]: this is the same FMR and EVD patch, sent before > with modifications, after discussions with James. > > Signed-off-by: Guy German > > Index: ib/dapl_openib_util.c > =================================================================== > --- ib/dapl_openib_util.c (revision 3113) > +++ ib/dapl_openib_util.c (working copy) > @@ -138,7 +138,7 @@ int dapl_ib_mr_register_ia(struct dapl_i > } > > int dapl_ib_mr_register_physical(struct dapl_ia *ia, struct dapl_lmr *lmr, > - void *phys_addr, u64 length, > + void *phys_addr, u32 page_count, > enum dat_mem_priv_flags privileges) > { > int status; > @@ -150,11 +150,11 @@ int dapl_ib_mr_register_physical(struct > u64 *array; > > array = (u64 *) phys_addr; > - buf_list = kmalloc(length * sizeof *buf_list, GFP_ATOMIC); > + buf_list = kmalloc(page_count * sizeof *buf_list, GFP_ATOMIC); > if (!buf_list) > return -ENOMEM; > > - for (i = 0; i < length; i++) { > + for (i = 0; i < page_count; i++) { > buf_list[i].addr = array[i]; > buf_list[i].size = PAGE_SIZE; > } > @@ -163,7 +163,7 @@ int dapl_ib_mr_register_physical(struct > acl = dapl_ib_convert_mem_privileges(privileges); > acl |= IB_ACCESS_MW_BIND; > mr = ib_reg_phys_mr(((struct dapl_pz *)lmr->param.pz)->pd, > - buf_list, length, acl, &iova); > + buf_list, page_count, acl, &iova); > kfree(buf_list); > if (IS_ERR(mr)) { > status = PTR_ERR(mr); > @@ -186,13 +186,58 @@ int dapl_ib_mr_register_physical(struct > > lmr->param.lmr_context = mr->lkey; > lmr->param.rmr_context = mr->rkey; > - lmr->param.registered_size = length * PAGE_SIZE; > + lmr->param.registered_size = page_count * PAGE_SIZE; > lmr->param.registered_address = array[0]; > lmr->mr = mr; > > dapl_dbg_log(DAPL_DBG_TYPE_UTIL, > "%s: (%p %d) got lkey 0x%x \n", __func__, > - buf_list, length, lmr->param.lmr_context); > + buf_list, page_count, lmr->param.lmr_context); > + return 0; > +} > + > +int dapl_ib_mr_register_fmr(struct dapl_ia *ia, struct dapl_lmr *lmr, > + void *phys_addr, u32 page_count, > + enum dat_mem_priv_flags privileges) > +{ > + /* FIXME: this phase-1 implementation of fmr doesn't take "privileges" > + into account. This is a security breech. */ > + u64 io_addr; > + u64 *page_list; > + struct ib_pool_fmr *mem; > + int status; > + > + page_list = (u64 *)phys_addr; > + io_addr = page_list[0]; > + > + mem = ib_fmr_pool_map_phys (((struct dapl_pz *)lmr->param.pz)->fmr_pool, > + page_list, > + page_count, > + &io_addr); > + if (IS_ERR(mem)) { > + status = (int)PTR_ERR(mem); > + if (status != -EAGAIN) > + dapl_dbg_log(DAPL_DBG_TYPE_ERR, > + "fmr_pool_map_phys ret=%d <%d pages>\n", > + status, page_count); > + > + lmr->param.registered_address = 0; > + lmr->fmr = 0; > + return status; > + } > + > + lmr->param.lmr_context = mem->fmr->lkey; > + lmr->param.rmr_context = mem->fmr->rkey; > + lmr->param.registered_size = page_count * PAGE_SIZE; > + lmr->param.registered_address = io_addr; > + lmr->fmr = mem; > + > + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, > + "%s: (addr=%p size=0x%x) lkey=0x%x rkey=0x%x\n", __func__, > + lmr->param.registered_address, > + lmr->param.registered_size, > + lmr->param.lmr_context, > + lmr->param.rmr_context); > return 0; > } > > @@ -222,7 +267,10 @@ int dapl_ib_mr_deregister(struct dapl_lm > { > int status; > > - status = ib_dereg_mr(lmr->mr); > + if (DAT_MEM_TYPE_PLATFORM == lmr->param.mem_type && lmr->fmr) > + status = ib_fmr_pool_unmap(lmr->fmr); > + else > + status = ib_dereg_mr(lmr->mr); > if (status < 0) { > dapl_dbg_log(DAPL_DBG_TYPE_ERR, > " ib_dereg_mr error code return = %d\n", > Index: ib/dapl_evd.c > =================================================================== > --- ib/dapl_evd.c (revision 3113) > +++ ib/dapl_evd.c (working copy) > @@ -42,19 +42,30 @@ static void dapl_evd_upcall_trigger(stru > int status = 0; > struct dat_event event; > > - /* Only process events if there is an enabled callback function. */ > - if ((evd->upcall.upcall_func == (DAT_UPCALL_FUNC) NULL) || > - (evd->upcall_policy == DAT_UPCALL_DISABLE)) > + /* DAT_UPCALL_MANY is not supported */ > + if (evd->is_triggered) > return; > > - for (;;) { > + spin_lock_irqsave (&evd->common.lock, evd->common.flags); > + if (evd->is_triggered) { > + spin_unlock_irqrestore (&evd->common.lock, evd->common.flags); > + return; > + } > + evd->is_triggered = 1; > + spin_unlock_irqrestore (&evd->common.lock, evd->common.flags); > + /* Only process events if there is an enabled callback function */ > + while ((evd->upcall.upcall_func != (DAT_UPCALL_FUNC)NULL) && > + (evd->upcall_policy != DAT_UPCALL_TEARDOWN) && > + (evd->upcall_policy != DAT_UPCALL_DISABLE)) { > status = dapl_evd_dequeue((struct dat_evd *)evd, &event); > - if (0 != status) > - return; > - > + if (status) > + break; > evd->upcall.upcall_func(evd->upcall.instance_data, &event, > FALSE); > } > + evd->is_triggered = 0; > + > + return; > } > > static void dapl_evd_eh_print_wc(struct ib_wc *wc) > @@ -163,6 +174,7 @@ static struct dapl_evd *dapl_evd_alloc(s > evd->cq = NULL; > atomic_set(&evd->evd_ref_count, 0); > evd->catastrophic_overflow = FALSE; > + evd->is_triggered = 0; > evd->qlen = qlen; > evd->upcall_policy = upcall_policy; > if ( NULL != upcall ) > @@ -798,25 +810,28 @@ static void dapl_evd_dto_callback(struct > overflow); > > /* > - * This function does not dequeue from the CQ; only the consumer > - * can do that. It rearms the completion only if completions should > - * always occur. > + * This function does not dequeue from the CQ; > + * It rearms the completion only if the consumer did not > + * disable the upcall policy (in order to dequeu the rest > + * of the events himself) > */ > > - if (!overflow && evd->upcall_policy != DAT_UPCALL_DISABLE) { > - /* > - * Re-enable callback, *then* trigger. > - * This guarantees we won't miss any events. > - */ > - status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); > - if (0 != status) > - (void)dapl_evd_post_async_error_event( > - evd->common.owner_ia->async_error_evd, > - DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR, > - evd->common.owner_ia); > - > + if (!overflow) { > dapl_evd_upcall_trigger(evd); > + if ((evd->upcall_policy != DAT_UPCALL_TEARDOWN) && > + (evd->upcall_policy != DAT_UPCALL_DISABLE)) { > + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); > + if (status) > + (void)dapl_evd_post_async_error_event( > + evd->common.owner_ia->async_error_evd, > + DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR, > + evd->common.owner_ia); > + else > + dapl_evd_upcall_trigger(evd); > + } > } > + else > + dapl_dbg_log(DAPL_DBG_TYPE_ERR, "%s: evd %p overflowed\n",evd); > dapl_dbg_log(DAPL_DBG_TYPE_RTN, "%s() returns\n", __func__); > } > > @@ -868,10 +883,11 @@ int dapl_evd_internal_create(struct dapl > > /* reset the qlen in the attributes, it may have changed */ > evd->qlen = evd->cq->cqe; > + if ((evd->upcall_policy != DAT_UPCALL_TEARDOWN) && > + (evd->upcall_policy != DAT_UPCALL_DISABLE)) > + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); > > - status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); > - > - if (status != 0) > + if (status) > goto bail; > } > > @@ -879,14 +895,14 @@ int dapl_evd_internal_create(struct dapl > * the EVD > */ > status = dapl_evd_event_alloc(evd); > - if (status != 0) > + if (status) > goto bail; > > dapl_ia_link_evd(ia, evd); > *evd_ptr = evd; > > bail: > - if (status != 0) > + if (status) > if (evd) > dapl_evd_dealloc(evd); > > @@ -1012,15 +1028,40 @@ int dapl_evd_modify_upcall(struct dat_ev > const struct dat_upcall_object *upcall) > { > struct dapl_evd *evd; > - > - dapl_dbg_log(DAPL_DBG_TYPE_API, "dapl_modify_upcall(%p)\n", evd_handle); > + int status = 0; > + int pending_events; > > evd = (struct dapl_evd *)evd_handle; > + dapl_dbg_log (DAPL_DBG_TYPE_API, "%s: (evd=%p, upcall_policy=%d)\n", > + __func__, evd_handle, upcall_policy); > > + spin_lock_irqsave(&evd->common.lock, evd->common.flags); > + if ((upcall_policy != DAT_UPCALL_TEARDOWN) && > + (upcall_policy != DAT_UPCALL_DISABLE)) { > + pending_events = dapl_rbuf_count(&evd->pending_event_queue); > + if (pending_events) { > + dapl_dbg_log (DAPL_DBG_TYPE_WARN, > + "%s: (evd %p) there are still %d pending " > + "events in the queue - policy stays disabled\n", > + __func__, evd_handle, pending_events); > + status = -EBUSY; > + goto bail; > + } > + if (evd->evd_flags & DAT_EVD_DTO_FLAG) { > + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); > + if (status) { > + printk(KERN_ERR "%s: dapls_ib_completion_notify" > + " failed (status=0x%x) \n",__func__, > + status); > + goto bail; > + } > + } > + } > evd->upcall_policy = upcall_policy; > evd->upcall = *upcall; > - > - return 0; > +bail: > + spin_unlock_irqrestore(&evd->common.lock, evd->common.flags); > + return status; > } > > int dapl_evd_post_se(struct dat_evd *evd_handle, const struct dat_event *event) > @@ -1090,7 +1131,6 @@ int dapl_evd_dequeue(struct dat_evd *evd > > bail: > spin_unlock_irqrestore(&evd->common.lock, evd->common.flags); > - > dapl_dbg_log(DAPL_DBG_TYPE_RTN, > "dapl_evd_dequeue () returns 0x%x\n", status); > > Index: ib/dapl_openib_util.h > =================================================================== > --- ib/dapl_openib_util.h (revision 3113) > +++ ib/dapl_openib_util.h (working copy) > @@ -84,9 +84,13 @@ int dapl_ib_mr_register_ia(struct dapl_i > enum dat_mem_priv_flags privileges); > > int dapl_ib_mr_register_physical(struct dapl_ia *ia, struct dapl_lmr *lmr, > - void *phys_addr, u64 length, > + void *phys_addr, u32 page_count, > enum dat_mem_priv_flags privileges); > > +int dapl_ib_mr_register_fmr(struct dapl_ia *ia, struct dapl_lmr *lmr, > + void *phys_addr, u32 page_count, > + enum dat_mem_priv_flags privileges); > + > int dapl_ib_mr_deregister(struct dapl_lmr *lmr); > > int dapl_ib_mr_register_shared(struct dapl_ia *ia, struct dapl_lmr *lmr, > Index: ib/dapl.h > =================================================================== > --- ib/dapl.h (revision 3113) > +++ ib/dapl.h (working copy) > @@ -40,9 +40,9 @@ > #include > > #include > - > -#include "ib_verbs.h" > -#include "ib_cm.h" > +#include > +#include > +#include > > /********************************************************************* > * * > @@ -173,6 +173,7 @@ struct dapl_evd { > struct dapl_ring_buffer pending_event_queue; > enum dat_upcall_policy upcall_policy; > struct dat_upcall_object upcall; > + int is_triggered; > }; > > struct dapl_ep { > @@ -229,6 +230,7 @@ struct dapl_pz { > struct list_head list; > struct ib_pd *pd; > atomic_t pz_ref_count; > + struct ib_fmr_pool *fmr_pool; > }; > > struct dapl_lmr { > @@ -237,6 +239,7 @@ struct dapl_lmr { > struct list_head list; > struct dat_lmr_param param; > struct ib_mr *mr; > + struct ib_pool_fmr *fmr; > atomic_t lmr_ref_count; > }; > > @@ -628,4 +631,6 @@ extern void dapl_dbg_log(enum dapl_dbg_t > #define dapl_dbg_log(...) > #endif /* KDAPL_INFINIBAND_DEBUG */ > > +extern void set_fmr_params (struct ib_fmr_pool_param *fmr_param_s); > +extern unsigned int g_dapl_active_fmr; > #endif /* DAPL_H */ > Index: ib/dapl_pz.c > =================================================================== > --- ib/dapl_pz.c (revision 3113) > +++ ib/dapl_pz.c (working copy) > @@ -29,7 +29,7 @@ > /* > * $Id$ > */ > - > +#include > #include "dapl.h" > #include "dapl_ia.h" > #include "dapl_openib_util.h" > @@ -89,7 +89,17 @@ int dapl_pz_create(struct dat_ia *ia, st > status); > goto error2; > } > - > + > + if (g_dapl_active_fmr) { > + struct ib_fmr_pool_param params; > + set_fmr_params (¶ms); > + dapl_pz->fmr_pool = ib_create_fmr_pool(dapl_pz->pd, ¶ms); > + if (IS_ERR(dapl_pz->fmr_pool)) > + dapl_dbg_log(DAPL_DBG_TYPE_WARN, > + "could not create FMR pool <%ld>", > + PTR_ERR(dapl_pz->fmr_pool)); > + } > + > *pz = (struct dat_pz *)dapl_pz; > return 0; > > @@ -104,7 +114,7 @@ error1: > int dapl_pz_free(struct dat_pz *pz) > { > struct dapl_pz *dapl_pz; > - int status; > + int status=0; > > dapl_dbg_log(DAPL_DBG_TYPE_API, "dapl_pz_free(%p)\n", pz); > > @@ -114,8 +124,15 @@ int dapl_pz_free(struct dat_pz *pz) > status = -EINVAL; > goto error; > } > - > - status = ib_dealloc_pd(dapl_pz->pd); > + > + if (g_dapl_active_fmr && dapl_pz->fmr_pool) { > + (void)ib_destroy_fmr_pool(dapl_pz->fmr_pool); > + dapl_pz->fmr_pool = NULL; > + } > + > + if (dapl_pz->pd) > + status = ib_dealloc_pd(dapl_pz->pd); > + > if (status) { > dapl_dbg_log(DAPL_DBG_TYPE_ERR, "ib_dealloc_pd failed: %X\n", > status); > Index: ib/dapl_ia.c > =================================================================== > --- ib/dapl_ia.c (revision 3113) > +++ ib/dapl_ia.c (working copy) > @@ -115,7 +115,6 @@ static int dapl_ia_abrupt_close(struct d > * when we run out of entries, or when we get back to the head > * if we end up skipping an entry. > */ > - > list_for_each_entry(rmr, &ia->rmr_list, list) { > status = dapl_rmr_free((struct dat_rmr *)rmr); > if (status != 0) > @@ -196,7 +195,6 @@ static int dapl_ia_abrupt_close(struct d > "ia_close(ABRUPT): psp_free(%p) returns %x\n", > sp, status); > } > - > list_for_each_entry(pz, &ia->pz_list, list) { > status = dapl_pz_free((struct dat_pz *)pz); > if (status != 0) > @@ -266,7 +264,6 @@ static int dapl_ia_graceful_close(struct > int status = 0; > int cur_status; > struct dapl_evd *evd; > - > if (!list_empty(&ia->rmr_list) || > !list_empty(&ia->rsp_list) || > !list_empty(&ia->ep_list) || > @@ -745,7 +742,8 @@ int dapl_ia_query(struct dat_ia *ia_ptr, > provider_attr->provider_version_major = DAPL_PROVIDER_MAJOR; > provider_attr->provider_version_minor = DAPL_PROVIDER_MINOR; > provider_attr->lmr_mem_types_supported = > - DAT_MEM_TYPE_PHYSICAL | DAT_MEM_TYPE_IA; > + DAT_MEM_TYPE_PHYSICAL | DAT_MEM_TYPE_IA | > + DAT_MEM_TYPE_PLATFORM; > provider_attr->iov_ownership_on_return = DAT_IOV_CONSUMER; > provider_attr->dat_qos_supported = DAT_QOS_BEST_EFFORT; > provider_attr->completion_flags_supported = > Index: ib/dapl_provider.c > =================================================================== > --- ib/dapl_provider.c (revision 3113) > +++ ib/dapl_provider.c (working copy) > @@ -49,8 +49,17 @@ MODULE_AUTHOR("James Lentini"); > #ifdef CONFIG_KDAPL_INFINIBAND_DEBUG > static DAPL_DBG_MASK g_dapl_dbg_mask = 0; > module_param_named(dbg_mask, g_dapl_dbg_mask, int, 0644); > -MODULE_PARM_DESC(dbg_mask, "Bitmask to enable debug message types."); > +MODULE_PARM_DESC(dbg_mask, "Bitmask to enable debug message types. "); > #endif /* CONFIG_KDAPL_INFINIBAND_DEBUG */ > +unsigned int g_dapl_active_fmr = 1; > +static unsigned int g_dapl_pool_size = 2048; > +static unsigned int g_dapl_max_pages_per_fmr = 64; > +module_param_named(active_fmr, g_dapl_active_fmr, int, 0644); > +module_param_named(pool_size, g_dapl_pool_size, int, 0644); > +module_param_named(max_pages_per_fmr, g_dapl_max_pages_per_fmr, int, 0644); > +MODULE_PARM_DESC(active_fmr, "if active_fmr==1, creates fmr pool in pz_create "); > +MODULE_PARM_DESC(pool_size, "num of fmr handles in pool "); > +MODULE_PARM_DESC(max_pages_per_fmr, "max size (in pages) of an fmr handle "); > > static LIST_HEAD(g_dapl_provider_list); > > @@ -152,6 +161,18 @@ void dapl_dbg_log(enum dapl_dbg_type typ > > #endif /* KDAPL_INFINIBAND_DEBUG */ > > +void set_fmr_params (struct ib_fmr_pool_param *fmr_params) > +{ > + fmr_params->max_pages_per_fmr = g_dapl_max_pages_per_fmr; > + fmr_params->pool_size = g_dapl_pool_size; > + fmr_params->dirty_watermark = 32; > + fmr_params->cache = 1; > + fmr_params->flush_function = NULL; > + fmr_params->access = (IB_ACCESS_LOCAL_WRITE | > + IB_ACCESS_REMOTE_WRITE | > + IB_ACCESS_REMOTE_READ); > +} > + > static struct dapl_provider *dapl_provider_alloc(const char *name, > struct ib_device *device, > u8 port) > Index: ib/dapl_lmr.c > =================================================================== > --- ib/dapl_lmr.c (revision 3113) > +++ ib/dapl_lmr.c (working copy) > @@ -126,7 +126,7 @@ error1: > > static inline int dapl_lmr_create_physical(struct dapl_ia *ia, > union dat_region_description phys_addr, > - u64 page_count, > + u64 length, > enum dat_mem_type mem_type, > struct dapl_pz *pz, > enum dat_mem_priv_flags privileges, > @@ -137,8 +137,10 @@ static inline int dapl_lmr_create_physic > u64 *registered_address) > { > struct dapl_lmr *new_lmr; > - int status; > + int status = 0; > + u32 page_count; > > + page_count = (u32)length; > new_lmr = dapl_lmr_alloc(ia, mem_type, phys_addr, > page_count, (struct dat_pz *) pz, privileges); > > @@ -149,15 +151,24 @@ static inline int dapl_lmr_create_physic > > if (DAT_MEM_TYPE_IA == mem_type) { > status = dapl_ib_mr_register_ia(ia, new_lmr, phys_addr, > - page_count, privileges); > + length, privileges); > } > - else { > + else if (DAT_MEM_TYPE_PHYSICAL == mem_type) { > status = dapl_ib_mr_register_physical(ia, new_lmr, > phys_addr.for_array, > page_count, privileges); > } > + else if (DAT_MEM_TYPE_PLATFORM == mem_type) { > + status = dapl_ib_mr_register_fmr(ia, new_lmr, > + phys_addr.for_array, > + page_count, privileges); > + } > + else { > + status = -EINVAL; > + goto error1; > + } > > - if (0 != status) > + if (status) > goto error2; > > atomic_inc(&pz->pz_ref_count); > @@ -258,6 +269,11 @@ int dapl_lmr_kcreate(struct dat_ia *ia, > rmr_context, registered_length, > registered_address); > break; > + case DAT_MEM_TYPE_PLATFORM: /* used as a proprietary Mellanox-FMR */ > + if (!g_dapl_active_fmr) { > + status = -EINVAL; > + break; > + } > case DAT_MEM_TYPE_PHYSICAL: > case DAT_MEM_TYPE_IA: > status = dapl_lmr_create_physical(dapl_ia, region_description, > @@ -307,6 +323,7 @@ int dapl_lmr_free(struct dat_lmr *lmr) > > switch (dapl_lmr->param.mem_type) { > case DAT_MEM_TYPE_PHYSICAL: > + case DAT_MEM_TYPE_PLATFORM: > case DAT_MEM_TYPE_IA: > case DAT_MEM_TYPE_LMR: > { > From ardavis at ichips.intel.com Wed Aug 17 16:44:34 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 17 Aug 2005 16:44:34 -0700 Subject: [openib-general] uverbs comp events Message-ID: <4303CBE2.2010009@ichips.intel.com> Hi Roland, I just noticed that when I call ibv_close_device() it does not wakeup the poll cq_fd[0] in a separate thread. Can you take a look? While we are on the subject. Any chance of getting something like ibv_generate_cq_event() added to wakeup events from user space? Signaling, as a means to wake up my CQ (and CM or AT for that matter) processing thread, presents some problems. Unless of course, someone can show me a way to signal the FD directly from usersspace by another means. Thanks, -arlin From rolandd at cisco.com Wed Aug 17 16:51:36 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 17 Aug 2005 16:51:36 -0700 Subject: [openib-general] Re: uverbs comp events In-Reply-To: <4303CBE2.2010009@ichips.intel.com> (Arlin Davis's message of "Wed, 17 Aug 2005 16:44:34 -0700") References: <4303CBE2.2010009@ichips.intel.com> Message-ID: <52mzngdsvr.fsf@cisco.com> Arlin> Hi Roland, I just noticed that when I call Arlin> ibv_close_device() it does not wakeup the poll cq_fd[0] in Arlin> a separate thread. Can you take a look? OK, I'll see if there's something wrong. Arlin> While we are on the subject. Any chance of getting Arlin> something like ibv_generate_cq_event() added to wakeup Arlin> events from user space? Signaling, as a means to wake up Arlin> my CQ (and CM or AT for that matter) processing thread, Arlin> presents some problems. Unless of course, someone can show Arlin> me a way to signal the FD directly from usersspace by Arlin> another means. This doesn't seem useful to me. Why not just use pipe() and write() to generate your own FD events? - R. From ardavis at ichips.intel.com Wed Aug 17 18:02:36 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 17 Aug 2005 18:02:36 -0700 Subject: [openib-general] Re: uverbs comp events In-Reply-To: <52mzngdsvr.fsf@cisco.com> References: <4303CBE2.2010009@ichips.intel.com> <52mzngdsvr.fsf@cisco.com> Message-ID: <4303DE2C.5070809@ichips.intel.com> Roland Dreier wrote: > Arlin> Hi Roland, I just noticed that when I call > Arlin> ibv_close_device() it does not wakeup the poll cq_fd[0] in > Arlin> a separate thread. Can you take a look? > >OK, I'll see if there's something wrong. > > Arlin> While we are on the subject. Any chance of getting > Arlin> something like ibv_generate_cq_event() added to wakeup > Arlin> events from user space? Signaling, as a means to wake up > Arlin> my CQ (and CM or AT for that matter) processing thread, > Arlin> presents some problems. Unless of course, someone can show > Arlin> me a way to signal the FD directly from usersspace by > Arlin> another means. > >This doesn't seem useful to me. Why not just use pipe() and write() >to generate your own FD events? > > > Sorry, I should have explained what I am trying to do. I have a separate CQ event processing thread that is created, during the uDAPL device open, to handle all event processing. The issue is that the CQ processing thread is blocking on the verbs cq_fd, waiting to process events for multiple connections/CQ's. When the application wishes to close the device it needs to wake up this thread, cleanup, and exit. the options to wake up this blocking cq_fd and thread are: 1. signal the thread with pthread_kill , pthread_cancel 2. poll cq_fd with timeout, wakeup periodically and check for termination. 3. ibv_close_device () to force interrupt on the polling cq_fd (problem I reported above) 4. add new generate event call from verbs. (IB gen1 direct CQ objects supported this model) In my opinion, option 4 is the best option unless someone can show me a better way to signal the cq_fd from userspace. Did I miss some options? Thanks for looking into #3. -arlin > - R. > > > From halr at voltaire.com Wed Aug 17 18:54:38 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 18 Aug 2005 04:54:38 +0300 Subject: [openib-general] RE: [PATCH] osm: export more complib functions and opensm headers Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175BC2@taurus.voltaire.com> Thanks. Applied. In general, the rule is one thought per patch. So this would normally be 2 patches. -- Hal From rolandd at cisco.com Wed Aug 17 20:35:11 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 17 Aug 2005 20:35:11 -0700 Subject: [openib-general] Re: uverbs comp events In-Reply-To: <4303DE2C.5070809@ichips.intel.com> (Arlin Davis's message of "Wed, 17 Aug 2005 18:02:36 -0700") References: <4303CBE2.2010009@ichips.intel.com> <52mzngdsvr.fsf@cisco.com> <4303DE2C.5070809@ichips.intel.com> Message-ID: <52iry3ex3k.fsf@cisco.com> Arlin> the options to wake up this blocking cq_fd and thread are: Arlin> 1. signal the thread with pthread_kill , pthread_cancel Arlin> 2. poll cq_fd with timeout, wakeup periodically and check Arlin> for termination. 3. ibv_close_device () to force interrupt Arlin> on the polling cq_fd (problem I reported above) 4. add new Arlin> generate event call from verbs. (IB gen1 direct CQ objects Arlin> supported this model) Arlin> In my opinion, option 4 is the best option unless someone Arlin> can show me a better way to signal the cq_fd from Arlin> userspace. Did I miss some options? I think you can just create a pair of FDs with pipe() and then sleep on both the cq_fd and your pipe fd using poll(). Then when you want to wake up the thread just write something into the other pipe fd. - R. From yael at mellanox.co.il Wed Aug 17 22:22:43 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Thu, 18 Aug 2005 08:22:43 +0300 Subject: [openib-general] RE: Missing osm_vendor_unbind function in osm_vendor_ibumad Message-ID: <506C3D7B14CDD411A52C00025558DED60882CC64@mtlex01.yok.mtl.com> Hello Hal, A null function is suffice for the time being. I will use the null function for now, I do not need the full implementation too soon. Enjoy you vacation! Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Wednesday, August 17, 2005 4:31 PM To: Yael Kalka Cc: OPENIB GENERAL Subject: RE: Missing osm_vendor_unbind function in osm_vendor_ibumad Hi Yael, I presume the need for osm_vendor_unbind was added at 1.8.0 in OpenSM (I see it added into osm_sa/sm_mad_ctrl.c). I added a null function as a temporary hack for this. Will this suffice for the time being ? I will fill it in (it definitely will not be a null function) but I'm on vacation right now. How soon do need this ? I may supply a patch for you to try for tomorrow if this can't wait. -- Hal ________________________________ From: Yael Kalka [mailto:yael at mellanox.co.il] Sent: Wed 8/17/2005 7:15 AM To: Hal Rosenstock Cc: OPENIB GENERAL Subject: osm: Missing osm_vendor_unbind function in osm_vendor_ibumad Hello Hal, I am currently working on merging gen2 with our OpenSM 1.8.0 version. I noticed that in the osm_vendor_ibumad files there is no implementation for the osm_vendor_unbind function (which is currently used by the opensm). Please add interface (if unnecessary, then an empty one) to this function. Thanks, Yael -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Wed Aug 17 23:37:05 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 18 Aug 2005 09:37:05 +0300 Subject: [openib-general] [PATCH] osm: export more opensm headers Message-ID: <86k6ij21ke.fsf@mtl066.yok.mtl.com> Hi Hal Somehow the last patch got accepted only partially. So here is the extra diff. Thanks Eitan I tested the patch on : 2.6.12.3-smp SuSE Linux 9.3 (i586) Signed-off-by: Eitan Zahavi Index: Makefile.am =================================================================== --- Makefile.am (revision 3128) +++ Makefile.am (working copy) @@ -75,6 +75,8 @@ opensm_LDFLAGS = -Wl,--rpath -Wl,$(libdi opensmincludedir = $(includedir)/infiniband/opensm opensminclude_HEADERS = $(srcdir)/../include/opensm/osm_base.h \ + $(srcdir)/../include/opensm/osm_msgdef.h \ + $(srcdir)/../include/opensm/osm_helper.h \ $(srcdir)/../include/opensm/osm_log.h \ $(srcdir)/../include/opensm/osm_madw.h \ $(srcdir)/../include/opensm/osm_mad_pool.h \ From info at 4-ankou.com Thu Aug 18 03:29:24 2005 From: info at 4-ankou.com (info at 4-ankou.com) Date: 18 Aug 2005 19:29:24 +0900 Subject: [openib-general] $B0lK|1_J,$N%W%l%<%s%H $B!!!z8=:_!"%5%$%H#O#P#E#N%-%c%s%Z!<%sCf!z(B $BEPO?L5NA$N%5%$%H$GHV9fG'>Z$7$F$$$?$@$1$1$k$H(B $B$J$s$H(B10.000$B1_J,$b$NL5NA%]%$%s%H$r%W%l%<%s%H!*(B $B0lK|1_$b$N%]%$%s%H$r%W%l%<%s%H$7$F$bBg>fIW$J$N$O(B $B%5!<%S%9$,$7$C$+$j$7$F$$$F!"K~B-$7$F$*;YJ'$$$$$?$@$$$F$$$k$+$i!*(B $BG5=Ey0l at ZL5$$$3$H$rJ]>Z$7$^$9!#(B $BBgK~B-$N%5%$%H$NF~$j8}$O$3$A$i$+$i(B $B!JCK=w6&DL%-%c%s%Z!<%sF~8}!K(B $B"-"-"-"-"-(B http://www.o9sama1.com/?num=10000 $B%U%j!<%a!<%k$G$bEPO?2DG=$G$9!#(B $B"(2q$&$3$HA0Ds$N%5%$%H$G$9!#B>$N%5%$%H$H0c$$2qOC$r3Z$7$`$@$1$G$O$"$j$^$;$s(B $B#O#P#E#N$7$F4V$b$J$$%5%$%H$G$9$,!"=w at -;o3F;o$K9-9p$r7G:\$5$;$F$b$i$C$F$$$k$N$G(B $B=w at -2q0w?t$,CK at -$h$juBV$J$N$G$*9%$-$J=w at -$K%"%?%C%/$7$d$9$$$G$9!#(B $B"(1|MM0J30$N=w at -$b>o;~BgJg=8$7$F$^$9!#$*5$7Z$K$I$&$>"v(B $BG[?.5qH]$O$3$A$i$N%"%I%l%9$^$G$*4j$$$7$^$9!#(B $B!!!!!!(B $B"-"-"-"-(B $B!!!!!!(Bnomore at o9sama.com ----------------------------------------------------------- Please inform the following address if this mail is unnecessary. nomore at o9sama.com From danb at voltaire.com Thu Aug 18 05:14:05 2005 From: danb at voltaire.com (Dan Bar Dov) Date: Thu, 18 Aug 2005 15:14:05 +0300 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISER initiator Message-ID: I just checked in a first version of iSCSI Extensions for RDMA Protocol (ISER) initiator under infiniband/ulp/iser. This implements the ISER datamover, a transport layer alternative to TCP/IP usable by iSCSI. This ISER transport has been tested with the open-iscsi opensource project, and against the Voltaire Fibre-Channel Router (FCR) and Voltaire's Native-IB storage kit. All the iSCSI features including device management are available seamlessly with the iSCSI/ISER initiator. ISER simply puts iSCSI on steroids. The ISER implementation makes use of the openIB/kDAPL. Please note that several kDAPL patches that were submitted to the list are necessary for this implementation to work. Dan Bar Dov Infiniband Storage Solutions www.voltaire.com The Grid Interconnect Company From hch at lst.de Thu Aug 18 05:36:05 2005 From: hch at lst.de (Christoph Hellwig) Date: Thu, 18 Aug 2005 14:36:05 +0200 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISER initiator In-Reply-To: References: Message-ID: <20050818123605.GB22381@lst.de> On Thu, Aug 18, 2005 at 03:14:05PM +0300, Dan Bar Dov wrote: > I just checked in a first version of iSCSI Extensions for RDMA > Protocol (ISER) initiator under infiniband/ulp/iser. This > implements the ISER datamover, a transport layer alternative to > TCP/IP usable by iSCSI. This ISER transport has been tested with > the open-iscsi opensource project, and against the Voltaire > Fibre-Channel Router (FCR) and Voltaire's Native-IB storage kit. > > All the iSCSI features including device management are available > seamlessly with the iSCSI/ISER initiator. ISER simply puts iSCSI > on steroids. > > The ISER implementation makes use of the openIB/kDAPL. Please note > that several kDAPL patches that were submitted to the list are > necessary for this implementation to work. The code is complete crap, please remove it again. From hch at lst.de Thu Aug 18 05:47:57 2005 From: hch at lst.de (Christoph Hellwig) Date: Thu, 18 Aug 2005 14:47:57 +0200 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISER initiator In-Reply-To: <20050818123605.GB22381@lst.de> References: <20050818123605.GB22381@lst.de> Message-ID: <20050818124757.GA22637@lst.de> On Thu, Aug 18, 2005 at 02:36:05PM +0200, Christoph Hellwig wrote: > > I just checked in a first version of iSCSI Extensions for RDMA > > Protocol (ISER) initiator under infiniband/ulp/iser. This > > implements the ISER datamover, a transport layer alternative to > > TCP/IP usable by iSCSI. This ISER transport has been tested with > > the open-iscsi opensource project, and against the Voltaire > > Fibre-Channel Router (FCR) and Voltaire's Native-IB storage kit. While we're at it can the admins please remove check in permissions for Dan? Someone who writes such code with zero understanding of kernel intenrals and checks in without asking can cause far too much damage. From halr at voltaire.com Thu Aug 18 05:48:07 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 18 Aug 2005 15:48:07 +0300 Subject: [openib-general] RE: [PATCH] osm: export more opensm headers Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175BC4@taurus.voltaire.com> Hi Eitan, >Somehow the last patch got accepted only partially. Are you sure ? Try svn revert Makefile.am. -- Hal From hch at lst.de Thu Aug 18 06:07:20 2005 From: hch at lst.de (Christoph Hellwig) Date: Thu, 18 Aug 2005 15:07:20 +0200 Subject: [openib-general] [PATCH] srp tidyups Message-ID: <20050818130720.GA22995@lst.de> - you must not call the scsi_done callback in the EH routines. Will there be real error handling for srp one day? - given that you have on scsi host per srp target there's no need for the idr lookup, you can just use the host private data - no need to fill out unique_id, that's a leftover for ISA HBAs - and some small cleanups I couldn't resist Index: ulp/srp/ib_srp.c =================================================================== --- ulp/srp/ib_srp.c (revision 3129) +++ ulp/srp/ib_srp.c (working copy) @@ -34,14 +34,11 @@ #include #include - #include #include #include -#include #include #include - #include #include @@ -69,11 +66,6 @@ static const u8 topspin_oui[3] = { 0x00, 0x05, 0xad }; -static atomic_t srp_uid; - -static rwlock_t idr_lock; -static DEFINE_IDR(target_idr); - static void srp_add_one(struct ib_device *device)et static void srp_remove_one(struct ib_device *device); @@ -101,27 +93,29 @@ iu = kmalloc(sizeof *iu, gfp_mask); if (!iu) - return NULL; + goto out; iu->buf = kmalloc(size, gfp_mask); - if (!iu->buf) { - kfree(iu); - return NULL; - } + if (!iu->buf) + goto out_free_iu; memset(iu->buf, 0, size); iu->dma = dma_map_single(host->dev->dma_device, iu->buf, size, direction); - if (dma_mapping_error(iu->dma)) { - kfree(iu->buf); - kfree(iu); - return NULL; - } + if (dma_mapping_error(iu->dma)) + goto out_free_buf; iu->size = size; iu->direction = direction; return iu; + + out_free_buf: + kfree(iu->buf); + out_free_iu: + kfree(iu); + out: + return NULL; } static void srp_free_iu(struct srp_host *host, struct srp_iu *iu) @@ -142,7 +136,7 @@ u8 fmt; if (!scmnd->request_buffer || scmnd->sc_data_direction == DMA_NONE) - return sizeof (struct srp_cmd); + return sizeof(struct srp_cmd); if (scmnd->sc_data_direction != DMA_FROM_DEVICE && scmnd->sc_data_direction != DMA_TO_DEVICE) { @@ -456,31 +450,11 @@ static int srp_queuecommand(struct scsi_cmnd *scmnd, void (*done)(struct scsi_cmnd *)) { - struct srp_target_port *target; - struct srp_iu *iu; + struct srp_target_port *target = host_to_target(scmnd->device->host); + struct srp_iu *iu = target->tx_ring[target->tx_head & SRP_SQ_SIZE]; struct srp_cmd *cmd; - unsigned long flags; int len; - read_lock_irqsave(&idr_lock, flags); - target = idr_find(&target_idr, scmnd->device->id); - read_unlock_irqrestore(&idr_lock, flags); - - if (!target) { - printk(KERN_ERR PFX "queuecommand for unknown device id %d\n", - scmnd->device->id); - scmnd->result = DID_ERROR << 16; - done(scmnd); - return 0; - } - - if (0) { - printk(KERN_ERR PFX "command for %u: ", scmnd->device->id); - scsi_print_command(scmnd); - } - - iu = target->tx_ring[target->tx_head & SRP_SQ_SIZE]; - dma_sync_single_for_cpu(target->srp_host->dev->dma_device, iu->dma, SRP_MAX_IU_LEN, DMA_TO_DEVICE); @@ -499,7 +473,7 @@ len = srp_map_data(scmnd, target, iu); if (len < 0) { printk(KERN_ERR PFX "Failed to map data\n"); - goto err; + goto err_free_iu; } if (srp_post_recv(target, GFP_ATOMIC)) { @@ -519,8 +493,7 @@ err_unmap: srp_unmap_data(scmnd, target, cmd); - -err: +err_free_iu: return SCSI_MLQUEUE_HOST_BUSY; } @@ -528,9 +501,6 @@ { printk(KERN_ERR PFX "srp_abort called\n"); - scmnd->result = DID_ABORT << 16; - scmnd->scsi_done(scmnd); - return SUCCESS; } @@ -538,9 +508,6 @@ { printk(KERN_ERR PFX "srp_reset called\n"); - scmnd->result = DID_ABORT << 16; - scmnd->scsi_done(scmnd); - return SUCCESS; } @@ -892,10 +859,7 @@ static void srp_release_target(struct srp_target_port *target) { int i; - unsigned long flags; - /* XXX should send SRP_I_LOGOUT request */ - init_completion(&target->done); ib_send_cm_dreq(target->cm_id, NULL, 0); wait_for_completion(&target->done); @@ -907,10 +871,6 @@ srp_free_iu(target->srp_host, target->rx_ring[i]); for (i = 0; i < SRP_SQ_SIZE + 1; ++i) srp_free_iu(target->srp_host, target->tx_ring[i]); - - write_lock_irqsave(&idr_lock, flags); - idr_remove(&target_idr, target->scsi_id); - write_unlock_irqrestore(&idr_lock, flags); } static struct scsi_host_template srp_template = { @@ -933,21 +893,8 @@ unsigned long flags; int ret; - do { - if (!idr_pre_get(&target_idr, GFP_KERNEL)) - return -ENOMEM; - - write_lock_irqsave(&idr_lock, flags); - ret = idr_get_new(&target_idr, target, &target->scsi_id); - write_unlock_irqrestore(&idr_lock, flags); - } while (ret == -EAGAIN); - - if (ret) - goto fail; - sprintf(target->target_name, "SRP.T10:%016llX", (unsigned long long) be64_to_cpu(target->id_ext)); - target->scsi_host->unique_id = atomic_inc_return(&srp_uid); if (scsi_add_host(target->scsi_host, host->dev->dma_device)) goto fail; @@ -963,11 +910,7 @@ return 0; fail: - write_lock_irqsave(&idr_lock, flags); - idr_remove(&target_idr, target->scsi_id); - write_unlock_irqrestore(&idr_lock, flags); - - return ret; + return -ENODEV; } static void srp_release_class_dev(struct class_device *class_dev) @@ -1366,9 +1309,6 @@ { int ret; - atomic_set(&srp_uid, 0); - rwlock_init(&idr_lock); - ret = class_register(&srp_class); if (ret) { printk(KERN_ERR PFX "couldn't register class infiniband_srp\n"); From dotanb at mellanox.co.il Thu Aug 18 06:31:12 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Thu, 18 Aug 2005 16:31:12 +0300 Subject: [openib-general] did anyone execute the ucm_simple (user level CM example)? i fail ed to execute this example Message-ID: <506C3D7B14CDD411A52C00025558DED60882CDC1@mtlex01.yok.mtl.com> I'm working on svn version 3107 on Mellanox HCA (23108). I executed this example with the following parameters: % ./ucm_simple 0 % ./ucm_simple 1 i got the following output for the later execution (the client) Error <-1:22> sending REQ <6> the kernel module that handles the ucm was loaded. Dotan Barak Software Verification Engineer Mellanox Technologies LTD Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-77-8841095 Cell: 052-4222383 [ May the fork be with you ] -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Thu Aug 18 07:00:49 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 18 Aug 2005 17:00:49 +0300 Subject: [openib-general] did anyone execute the ucm_simple (user level CMexample)? i fail ed to execute this example Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175BC7@taurus.voltaire.com> Hi Dotan, There are some hardcoded parameters (GIDs and LIDs) in the example that you need to tailor to your configuration. -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Dotan Barak Sent: Thu 8/18/2005 9:31 AM To: openib-general at openib.org Subject: [openib-general] did anyone execute the ucm_simple (user level CMexample)? i fail ed to execute this example I'm working on svn version 3107 on Mellanox HCA (23108). I executed this example with the following parameters: % ./ucm_simple 0 % ./ucm_simple 1 i got the following output for the later execution (the client) Error <-1:22> sending REQ <6> the kernel module that handles the ucm was loaded. Dotan Barak Software Verification Engineer Mellanox Technologies LTD Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-77-8841095 Cell: 052-4222383 [ May the fork be with you ] From mjleven at sandia.gov Thu Aug 18 07:12:09 2005 From: mjleven at sandia.gov (Michael Levenhagen) Date: Thu, 18 Aug 2005 08:12:09 -0600 Subject: [openib-general] SMP Directed Routes Message-ID: <43049739.5050803@sandia.gov> The Infiniband Architecture Spec Vol 1 Release 1.2 talks about Directed Routes in section 3.9.4.2. I'd like to learn more about how Directed Routes work but this section seems to be all there is in the spec. Is there a good reference on Directed Routes or do I just need to start reading source code. Mike From dotanb at mellanox.co.il Thu Aug 18 07:18:35 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Thu, 18 Aug 2005 17:18:35 +0300 Subject: [openib-general] SMP Directed Routes Message-ID: <506C3D7B14CDD411A52C00025558DED60882CDE5@mtlex01.yok.mtl.com> > The Infiniband Architecture Spec Vol 1 Release 1.2 talks > about Directed > Routes in section 3.9.4.2. > I'd like to learn more about how Directed Routes work but > this section > seems to be all there is in the spec. Is there a good reference on > Directed Routes or do I just need to start reading source code. > > Mike > > _______________________________________________ Did you look at section 14.2.1.2 at the IB spec you specified? Dotan -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Thu Aug 18 07:17:11 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 18 Aug 2005 17:17:11 +0300 Subject: [openib-general] SMP Directed Routes Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175BCA@taurus.voltaire.com> Hi Mike, There's much more in chapter 14 specifically 14.2.2 SMPs and Directed Route Algorithm. -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Michael Levenhagen Sent: Thu 8/18/2005 10:12 AM To: openib-general at openib.org Subject: [openib-general] SMP Directed Routes The Infiniband Architecture Spec Vol 1 Release 1.2 talks about Directed Routes in section 3.9.4.2. I'd like to learn more about how Directed Routes work but this section seems to be all there is in the spec. Is there a good reference on Directed Routes or do I just need to start reading source code. Mike _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mjleven at sandia.gov Thu Aug 18 07:34:20 2005 From: mjleven at sandia.gov (Michael Levenhagen) Date: Thu, 18 Aug 2005 08:34:20 -0600 Subject: [openib-general] SMP Directed Routes In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5175BCA@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175BCA@taurus.voltaire.com> Message-ID: <43049C6C.5080803@sandia.gov> I skimmed chapter 14. I'll take a closer look. thanks Mike Hal Rosenstock wrote: >Hi Mike, > >There's much more in chapter 14 specifically 14.2.2 SMPs and Directed Route Algorithm. > >-- Hal > >________________________________ > >From: openib-general-bounces at openib.org on behalf of Michael Levenhagen >Sent: Thu 8/18/2005 10:12 AM >To: openib-general at openib.org >Subject: [openib-general] SMP Directed Routes > > > >The Infiniband Architecture Spec Vol 1 Release 1.2 talks about Directed >Routes in section 3.9.4.2. >I'd like to learn more about how Directed Routes work but this section >seems to be all there is in the spec. Is there a good reference on >Directed Routes or do I just need to start reading source code. > >Mike > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > From eitan at mellanox.co.il Thu Aug 18 07:57:06 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 18 Aug 2005 17:57:06 +0300 Subject: [openib-general] SMP Directed Routes Message-ID: <506C3D7B14CDD411A52C00025558DED607C306B6@mtlex01.yok.mtl.com> Hi Michael, I second Hal: the spec is the best source. Read thoroughly the sections: 14.2.2 and then 13.5(.3!) Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Michael Levenhagen [mailto:mjleven at sandia.gov] > Sent: Thursday, August 18, 2005 5:34 PM > To: openib-general at openib.org > Subject: Re: [openib-general] SMP Directed Routes > > I skimmed chapter 14. I'll take a closer look. > > thanks > Mike > > Hal Rosenstock wrote: > > >Hi Mike, > > > >There's much more in chapter 14 specifically 14.2.2 SMPs and Directed Route > Algorithm. > > > >-- Hal > > > >________________________________ > > > >From: openib-general-bounces at openib.org on behalf of Michael Levenhagen > >Sent: Thu 8/18/2005 10:12 AM > >To: openib-general at openib.org > >Subject: [openib-general] SMP Directed Routes > > > > > > > >The Infiniband Architecture Spec Vol 1 Release 1.2 talks about Directed > >Routes in section 3.9.4.2. > >I'd like to learn more about how Directed Routes work but this section > >seems to be all there is in the spec. Is there a good reference on > >Directed Routes or do I just need to start reading source code. > > > >Mike > > > >_______________________________________________ > >openib-general mailing list > >openib-general at openib.org > >http://openib.org/mailman/listinfo/openib-general > > > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From dmitry_yus at yahoo.com Thu Aug 18 08:10:47 2005 From: dmitry_yus at yahoo.com (Dmitry Yusupov) Date: Thu, 18 Aug 2005 08:10:47 -0700 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISER initiator In-Reply-To: References: Message-ID: <1124377847.10403.7.camel@mylaptop> Dan, I'm in the process of reviewing your patches. So far, some code could be accepted, but some still needs additional work. The most serious concern is that your patches forcing Open iSCSI to open/close IB connection in-kernel, while the TCP connection terminated in user-space. This is not really what we wanted. The solution could be to provide set of kernel patches to do this job in user-space for IB connection also. Dima On Thu, 2005-08-18 at 15:14 +0300, Dan Bar Dov wrote: > I just checked in a first version of iSCSI Extensions for RDMA > Protocol (ISER) initiator under infiniband/ulp/iser. This > implements the ISER datamover, a transport layer alternative to > TCP/IP usable by iSCSI. This ISER transport has been tested with > the open-iscsi opensource project, and against the Voltaire > Fibre-Channel Router (FCR) and Voltaire's Native-IB storage kit. > > All the iSCSI features including device management are available > seamlessly with the iSCSI/ISER initiator. ISER simply puts iSCSI > on steroids. > > The ISER implementation makes use of the openIB/kDAPL. Please note > that several kDAPL patches that were submitted to the list are > necessary for this implementation to work. > > > Dan Bar Dov > Infiniband Storage Solutions > www.voltaire.com > The Grid Interconnect Company > From eitan at mellanox.co.il Thu Aug 18 08:30:04 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 18 Aug 2005 18:30:04 +0300 Subject: [openib-general] [PATCH] osm: export more complib functions a nd opensm headers Message-ID: <506C3D7B14CDD411A52C00025558DED607C306B9@mtlex01.yok.mtl.com> Hi Hal, I just did a fresh co of the userspace/management dir I verified that the patch below is not fully applied. Only the opensm/Makefile.am was changed partially with the two added header files to the DIST list - but not to the include list. Please double check. I can provide a new patch if required. Thanks. EZ ---------------- BACKUP: ------------------------ I still see the following diff (note this is not a patch): swlab30:/home/eitan/SW/SVN/gen2_tmp>diff management/osm/complib/libosmcomp.map ../gen2_mgmt/osm/complib/ 218a219,227 > cl_atomic_dec; > cl_free; > cl_malloc; > cl_perf_construct; > cl_perf_destroy; > cl_perf_display; > cl_perf_init; > cl_perf_reset; > cl_zalloc; swlab30:/home/eitan/SW/SVN/gen2_tmp>diff management/osm/opensm/Makefile.am ../gen2_mgmt/osm/opensm/ 77a78,79 > $(srcdir)/../include/opensm/osm_msgdef.h \ > $(srcdir)/../include/opensm/osm_helper.h \ 89a92 > Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Eitan Zahavi [mailto:eitan at mellanox.co.il] > Sent: Wednesday, August 17, 2005 10:18 PM > To: OPENIB GENERAL > Subject: [openib-general] [PATCH] osm: export more complib functions and opensm > headers > > Hi Hal > > There were few missing complib functions in the libosmcomp.map. > Also the following two header functions were missing from the opensm > distribution list: osm_msgdef.h and osm_helper.h > > I tested the patch on : > 2.6.12.3-smp SuSE Linux 9.3 (i586) > > Signed-off-by: Eitan Zahavi > > Index: complib/libosmcomp.map > =================================================================== > --- complib/libosmcomp.map (revision 3122) > +++ complib/libosmcomp.map (working copy) > @@ -207,5 +207,14 @@ OSMCOMP_1.0 { > ib_error_str; > ib_async_event_str; > ib_wc_status_str; > + cl_atomic_dec; > + cl_free; > + cl_malloc; > + cl_perf_construct; > + cl_perf_destroy; > + cl_perf_display; > + cl_perf_init; > + cl_perf_reset; > + cl_zalloc; > local: *; > }; > Index: opensm/Makefile.am > =================================================================== > --- opensm/Makefile.am (revision 3122) > +++ opensm/Makefile.am (working copy) > @@ -75,11 +75,15 @@ opensm_LDFLAGS = -Wl,--rpath -Wl,$(libdi > opensmincludedir = $(includedir)/infiniband/opensm > > opensminclude_HEADERS = $(srcdir)/../include/opensm/osm_base.h \ > + $(srcdir)/../include/opensm/osm_msgdef.h \ > + $(srcdir)/../include/opensm/osm_helper.h \ > $(srcdir)/../include/opensm/osm_log.h \ > $(srcdir)/../include/opensm/osm_madw.h \ > $(srcdir)/../include/opensm/osm_mad_pool.h > > EXTRA_DIST = $(srcdir)/../include/opensm/osm_base.h \ > - $(srcdir)/../include/opensm/osm_log.h \ > - $(srcdir)/../include/opensm/osm_madw.h \ > - $(srcdir)/../include/opensm/osm_mad_pool.h > + $(srcdir)/../include/opensm/osm_helper.h \ > + $(srcdir)/../include/opensm/osm_msgdef.h \ > + $(srcdir)/../include/opensm/osm_log.h \ > + $(srcdir)/../include/opensm/osm_madw.h \ > + $(srcdir)/../include/opensm/osm_mad_pool.h > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From guyg at voltaire.com Thu Aug 18 08:39:12 2005 From: guyg at voltaire.com (Guy German) Date: Thu, 18 Aug 2005 18:39:12 +0300 Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation Message-ID: Hi Caitlin, Caitlin Bestler wrote: > Some clarifications are needed here. > > First the Consumer is responsible for draining the > EVD after re-enabling it, or at least for remembering > that there may be undrained notified events. Can you please explain what you mean by "re-enabling" the EVD ? Do you mean calling dat_evd_modify_upcall and changing the upcall policy from disable, back to enable ? > > That is "you-have-been-notified" is a sticky boolean > attribute that the Consumer is supposed to set to TRUE > when the upcall is made and only clear when the EVD > has been drained *after* re-enabling. > > Second, is that the EVD is first and foremost an event > *serializer*. It is presumed to have a finite number of > resources for making upcalls (at most one for the typical > case where SINGLE is enabled). The next upcall per > resource CANNOT occur until after the current upcall > has completed. > > Whether this should be solved in the DAT Provider is > a question of what the verb-layer provider is allowed > to do. If the verb layer provider can in fact generate > multiple concurrent upcalls for the same CQ then the > EVD itself must guard against re-entrancy. > > A more likely implementation is that upcalls triggered > by post_se, CM events and CQs could theoretically > occur at the same instance -- but that none of these > paths can be re-entrant by themselves. > > Once the potential re-entrancy from the verb layer > is known, then an optimal strategy can be selected. > For exaple, if the only potential re-entrancy comes > when the upcall interrupts a post_se call then some > simple critical regions can avoid all problems without > general purpose spinlocks or semaphores. > > On 8/16/05, James Lentini wrote: >> >> >> On Tue, 16 Aug 2005, Guy German wrote: >> >>>>>>>>> Also, the pending_event_queue is only used for kDAPL generated >>>>>>>>> software events. This queue can be empty when there are >>>>>>>>> events on the CQ, so your would need to be expanded your >>>>>>>>> check to cover that. >>>>>>> >>>>>>> Actually, even though, I agreed before, I tend to disagree now. >>>>>>> The consumer will still get the DTO events as soon as the CQ >>>>>>> upcall is triggered (enabled), so only problem is with the >>>>>>> pending events list. >>>>>> >>>>>> Why is it an error for the consumer to modify the upcall policy >>>>>> when there are pending events? >>>>>> >>>>>> dat_evd_modify_upcall should behave just like the IBTA spec's >>>>>> Request Completion Notification verb in this respect. If there >>>>>> were events on the EVD before the upcall is enabled, no upcall >>>>>> needs to be generated. A correct consumer can easily work around >>>>>> this by enabling the upcall and polling the EVD one final time >>>>>> to ensure it is empty. >>>>> >>>>> There can be more than one event, and the consumer would need to >>>>> dequeue many times. While the consumer would do his extra >>>>> dequeue-ing he might also get an upcall, because his policy is >>>>> now enabled. I can't think of a design that can handle such a >>>>> case, and if there is one it is demanding and complicated, from >>>>> the consumers side. >>>> >>>> Isn't it the same position all event code written to the OpenIB >>>> API is in? >>> >>> I don't quite know what you are reffering to, but if you are >>> reffering to the case of cq in IB - It's totally different: you >>> only enable the cq once, so you will only get one upcall, and the >>> rest of the events you will need to dequeue. >> >> The consumer should only receive one upcall at a time if the upcall >> policy is DAT_UPCALL_SINGLE_INSTANCE. If the dequeues are performed >> in an upcall, the logic needed in an OpenIB consumer and kDAPL >> consumer is essentially the same. >> >> The difference is that the OpenIB consumer needs to re-enable the CQ >> upcall and poll to make sure no events were missed. >> >>>> I agree with you that this programming model is difficult to use, >>>> but I don't think it is impossible. >>> >>> I think it is a bad idea to dequeue events and at the same time >>> receive upcalls from the same queue. It is racy, and has bad >>> performance. I don't see *any* reason to do it. >> >> The current kDAPL implementation does create a situation in which an >> upcall and poll occur simultaneously if the upcall is disabled, the >> consumer enables the upcall, and then the consumer does a poll. In >> this scenario an upcall can occur while the consumer is polling. I >> was pointing out that this same race exists in the OpenIB verbs API >> (and the IBTA verbs). >> >> Again, I agree that we can eliminate the additional poll after >> enabling the upcall in kDAPL. We just need to do it in a way that is >> not hardware specific. I believe we can use the same technique we >> did in the DTO upcall. >> >> james >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From pw at osc.edu Thu Aug 18 08:49:28 2005 From: pw at osc.edu (Pete Wyckoff) Date: Thu, 18 Aug 2005 11:49:28 -0400 Subject: [openib-general] avoid segv in libibverbs/examples In-Reply-To: <52u0hrghrw.fsf@cisco.com> References: <20050812144222.GA8988@osc.edu> <52u0hrghrw.fsf@cisco.com> Message-ID: <20050818154928.GB31078@osc.edu> rolandd at cisco.com wrote on Mon, 15 Aug 2005 11:46 -0700: > Thanks. I think we should probably print a diagnostic if no devices > are found. Can you resend the patch with that fixed, and also include > a "Signed-off-by:" line? Sure. I just copied a complaint from a few lines lower in each file. There are probably nicer ways to do it. Now a failure says: titan$ ibv_devices libibverbs: Fatal: couldn't open sysfs class 'infiniband_verbs'. No IB devices found -- Pete Avoid segv when no IB devices are found. Signed-off-by: Pete Wyckoff Index: libibverbs/examples/asyncwatch.c =================================================================== --- libibverbs/examples/asyncwatch.c (revision 3132) +++ libibverbs/examples/asyncwatch.c (working copy) @@ -56,6 +56,10 @@ struct ibv_async_event event; dev_list = ibv_get_devices(); + if (!dev_list) { + fprintf(stderr, "No IB devices found\n"); + return 1; + } dlist_start(dev_list); ib_dev = dlist_next(dev_list); Index: libibverbs/examples/rc_pingpong.c =================================================================== --- libibverbs/examples/rc_pingpong.c (revision 3132) +++ libibverbs/examples/rc_pingpong.c (working copy) @@ -524,6 +524,10 @@ page_size = sysconf(_SC_PAGESIZE); dev_list = ibv_get_devices(); + if (!dev_list) { + fprintf(stderr, "No IB devices found\n"); + return 1; + } dlist_start(dev_list); if (!ib_devname) { Index: libibverbs/examples/srq_pingpong.c =================================================================== --- libibverbs/examples/srq_pingpong.c (revision 3132) +++ libibverbs/examples/srq_pingpong.c (working copy) @@ -593,6 +593,10 @@ page_size = sysconf(_SC_PAGESIZE); dev_list = ibv_get_devices(); + if (!dev_list) { + fprintf(stderr, "No IB devices found\n"); + return 1; + } dlist_start(dev_list); if (!ib_devname) { Index: libibverbs/examples/uc_pingpong.c =================================================================== --- libibverbs/examples/uc_pingpong.c (revision 3132) +++ libibverbs/examples/uc_pingpong.c (working copy) @@ -516,6 +516,10 @@ page_size = sysconf(_SC_PAGESIZE); dev_list = ibv_get_devices(); + if (!dev_list) { + fprintf(stderr, "No IB devices found\n"); + return 1; + } dlist_start(dev_list); if (!ib_devname) { Index: libibverbs/examples/ud_pingpong.c =================================================================== --- libibverbs/examples/ud_pingpong.c (revision 3132) +++ libibverbs/examples/ud_pingpong.c (working copy) @@ -520,6 +520,10 @@ page_size = sysconf(_SC_PAGESIZE); dev_list = ibv_get_devices(); + if (!dev_list) { + fprintf(stderr, "No IB devices found\n"); + return 1; + } dlist_start(dev_list); if (!ib_devname) { Index: libibverbs/examples/device_list.c =================================================================== --- libibverbs/examples/device_list.c (revision 3132) +++ libibverbs/examples/device_list.c (working copy) @@ -55,6 +55,10 @@ struct ibv_device *ib_dev; dev_list = ibv_get_devices(); + if (!dev_list) { + fprintf(stderr, "No IB devices found\n"); + return 1; + } printf(" %-16s\t node GUID\n", "device"); printf(" %-16s\t----------------\n", "------"); Index: libibverbs/examples/devinfo.c =================================================================== --- libibverbs/examples/devinfo.c (revision 3132) +++ libibverbs/examples/devinfo.c (working copy) @@ -58,6 +58,10 @@ int i; dev_list = ibv_get_devices(); + if (!dev_list) { + fprintf(stderr, "No IB devices found\n"); + return 1; + } dlist_start(dev_list); ib_dev = dlist_next(dev_list); From jlentini at netapp.com Thu Aug 18 09:04:23 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 18 Aug 2005 12:04:23 -0400 (EDT) Subject: [openib-general][PATCH][kdapl]: FMR and EVD patch In-Reply-To: References: Message-ID: On Wed, 17 Aug 2005, James Lentini wrote: > > Hi Guy, > > I've committed all but one of the EVD changes in revision 3126. > > I still need a little time to review all of your changes to > dapl_evd_modify_upcall and the FMR implementation. I've committed your Mellanox-style FMR support in revision 3133. I'm viewing this as a prototype implementation of DAT_MEM_TYPE_PLATFORM. Please expect changes in this area in the future. james From guyg at voltaire.com Thu Aug 18 09:35:32 2005 From: guyg at voltaire.com (Guy German) Date: Thu, 18 Aug 2005 19:35:32 +0300 Subject: [openib-general][PATCH][kdapl]: FMR and EVD patch Message-ID: James Lentini wrote: > On Wed, 17 Aug 2005, James Lentini wrote: > >> >> Hi Guy, >> >> I've committed all but one of the EVD changes in revision 3126. >> >> I still need a little time to review all of your changes to >> dapl_evd_modify_upcall and the FMR implementation. > > I've committed your Mellanox-style FMR support in revision 3133. I'm > viewing this as a prototype implementation of DAT_MEM_TYPE_PLATFORM. > Please expect changes in this area in the future. No problems. Thanks, Guy From mshefty at ichips.intel.com Thu Aug 18 09:44:04 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 18 Aug 2005 09:44:04 -0700 Subject: [openib-general] did anyone execute the ucm_simple (user level CM example)? i fail ed to execute this example In-Reply-To: <506C3D7B14CDD411A52C00025558DED60882CDC1@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED60882CDC1@mtlex01.yok.mtl.com> Message-ID: <4304BAD4.5060904@ichips.intel.com> Dotan Barak wrote: > I'm working on svn version 3107 on Mellanox HCA (23108). > > I executed this example with the following parameters: > % ./ucm_simple 0 > % ./ucm_simple 1 > > i got the following output for the later execution (the client) > Error <-1:22> sending REQ <6> The hard-coded GID parameters in simple.c are likely the problem. I am in the process of replacing ucm_simple with a test program that will make use of address translation services, establish a connection, and perform data transfers. (It's essentially a port of the kernel cmpost test program to userspace.) - Sean From halr at voltaire.com Thu Aug 18 09:58:25 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 18 Aug 2005 19:58:25 +0300 Subject: [openib-general] [PATCH] osm: export more complib functions and opensm headers Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175BCD@taurus.voltaire.com> Eitan Zahavi wrote: > I just did a fresh co of the userspace/management dir > I verified that the patch below is not fully applied. > Only the opensm/Makefile.am was changed partially with the two added header files to the DIST list - but not to the include list. > Please double check. It looks to me like those lines are there in revision 3128 (and beyond) of https://openib.org/svn/gen2/trunk/src/userspace/management/osm/opensm/Makefile.am opensminclude_HEADERS = $(srcdir)/../include/opensm/osm_base.h \ $(srcdir)/../include/opensm/osm_log.h \ $(srcdir)/../include/opensm/osm_madw.h \ $(srcdir)/../include/opensm/osm_mad_pool.h \ $(srcdir)/../include/opensm/osm_msgdef.h \ $(srcdir)/../include/opensm/osm_helper.h EXTRA_DIST = $(srcdir)/../include/opensm/osm_base.h \ $(srcdir)/../include/opensm/osm_log.h \ $(srcdir)/../include/opensm/osm_madw.h \ $(srcdir)/../include/opensm/osm_mad_pool.h \ $(srcdir)/../include/opensm/osm_msgdef.h \ $(srcdir)/../include/opensm/osm_helper.h -- Hal From rolandd at cisco.com Thu Aug 18 10:27:49 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 18 Aug 2005 10:27:49 -0700 Subject: [openib-general] Re: [PATCH] srp tidyups In-Reply-To: <20050818130720.GA22995@lst.de> (Christoph Hellwig's message of "Thu, 18 Aug 2005 15:07:20 +0200") References: <20050818130720.GA22995@lst.de> Message-ID: <5264u3b1ey.fsf@cisco.com> Thanks, I applied most of this except for some minor personal style things. For example I left out changes like the below :) > - return sizeof (struct srp_cmd); > + return sizeof(struct srp_cmd); By the way, what's your feeling about upstream inclusion for the driver in its current state? > Will there be real error handling for srp one day? Yes, it's on my TODO list. I need to figure out exactly what makes sense. We definitely need to gracefully handle network failures and handle it when the target logs us out. For the EH routines, I'm still pondering... I'm not sure there's anything sensible we can do in the abort routine, since we don't know if the command has made it out of our local send queue or not, and we can't stop it even if it hasn't been sent. The reset routine should just tear down our connection and reconnect. - R. From iod00d at hp.com Thu Aug 18 10:43:05 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 18 Aug 2005 10:43:05 -0700 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISER initiator In-Reply-To: <20050818123605.GB22381@lst.de> References: <20050818123605.GB22381@lst.de> Message-ID: <20050818174305.GC15077@esmail.cup.hp.com> On Thu, Aug 18, 2005 at 02:36:05PM +0200, Christoph Hellwig wrote: > > All the iSCSI features including device management are available > > seamlessly with the iSCSI/ISER initiator. ISER simply puts iSCSI > > on steroids. > > > > The ISER implementation makes use of the openIB/kDAPL. Please note > > that several kDAPL patches that were submitted to the list are > > necessary for this implementation to work. > > The code is complete crap, please remove it again. Christoph, While I agree with you, that's not a very constructive approach. Can you pick 5 things that are brain damaged and point them out? Here are a few easy ones in iser.h: * vi: set noautoindent tabstop=4 shiftwidth=4 : This doesn't follow kernel coding style. Tabstops are *8* spaces. #ifndef CONFIG_INFINIBAND #include #else #include ... WTF? this module won't get built w/o CONFIG_INFINIBAND defined. Delete all references to the !CONFIG_INFINIBAND cases. Delete the pile of typedef's that are commented out #ifndef MIN #define MIN(a,b) ((a) < (b) ? (a) : (b)) #endif delete MIN and MAX. Use min ser_conn_state] /*! -------------------------------------------------------------------- [enum iser_conn_state] Description: iSER connection state -------------------------------------------------------------------- */ enum iser_conn_state { The comment is utterly useless. Delete it or add some content. Ditto for several of the additional enum declaratations further down. /*! -------------------------------------------------------------------- [struct iser_phys_mem_t] Description: Physical address based memory descriptor -------------------------------------------------------------------- */ struct iser_phys_mem_t { uint64_t *addrs; int length; int offset; int data_size; Is this a clone of "struct scatterlist"? See include/asm-*/scatterlist.h. // uint64_t addr; // uint64_t size; }; /* iser_phys_mem_t */ Delete the C++ style comments. struct iser_op_params_t { uint32_t InitiatorRecvDataSegmentLength; /* 512..2*24-1], default: 8K, func: MIN */ uint32_t TargetRecvDataSegmentLength; /* 512..2*24-1], default: 8K, func: MIN */ uint32_t FirstBurstLength; /* [512..2**24-1], default: 64K, func: MIN */ uint32_t MaxBurstLength; /* [512..2**24-1], default: 256K, func: MIN */ ... 1) Codingstyle: use tabs, not 4 spaces. 2) Codingstyle: WankyCapsNames are strongly discouraged. (See Chap 4 of linux/Documentation/Codingstyle) enum iser_op_param_default { defaultInitiatorRecvDataSegmentLength = 8*1024, defaultTargetRecvDataSegmentLength = 8*1024, defaultFirstBurstLength = 64*1024, ... Ditto. And why aren't these constants? Any advantage to the enum type? I'm wondering if any of these is replacable with ISER_LOGIN_PHASE_PDU_DATA_LEN or related constants defined earlier in the file. Ok...made it half-way through (line 286 of 649) the first file and that's only 10% of the .h files in ulp/iser/datamover. There are 11K lines of .c and 2.5K lines of .h. I hope that's enough to give Dan Bar Dov an idea of what the problem is. hth, grant From hch at lst.de Thu Aug 18 10:42:22 2005 From: hch at lst.de (Christoph Hellwig) Date: Thu, 18 Aug 2005 19:42:22 +0200 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISER initiator In-Reply-To: <20050818174305.GC15077@esmail.cup.hp.com> References: <20050818123605.GB22381@lst.de> <20050818174305.GC15077@esmail.cup.hp.com> Message-ID: <20050818174222.GA27674@lst.de> On Thu, Aug 18, 2005 at 10:43:05AM -0700, Grant Grundler wrote: > On Thu, Aug 18, 2005 at 02:36:05PM +0200, Christoph Hellwig wrote: > > > All the iSCSI features including device management are available > > > seamlessly with the iSCSI/ISER initiator. ISER simply puts iSCSI > > > on steroids. > > > > > > The ISER implementation makes use of the openIB/kDAPL. Please note > > > that several kDAPL patches that were submitted to the list are > > > necessary for this implementation to work. > > > > The code is complete crap, please remove it again. > > Christoph, > While I agree with you, that's not a very constructive approach. > Can you pick 5 things that are brain damaged and point them out? The same as last time, the code didn't change at all. It's still totally ignorant about such essential things as dma mapping, has creative new abuse for struct iovec, it's still based on iovecs, actually missing the iscsi initiator integration, duplicating iscsi defines and scsi debugging code all over, adding tons of layers of useless abstraction. Not to mention that the code looks like a cat vomiting over the keyboard. In short the code should be thrown a way, and someone with a clue (aka not a Voltaire person) needs to start over again. From hch at lst.de Thu Aug 18 10:43:17 2005 From: hch at lst.de (Christoph Hellwig) Date: Thu, 18 Aug 2005 19:43:17 +0200 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISER initiator In-Reply-To: <20050818174222.GA27674@lst.de> References: <20050818123605.GB22381@lst.de> <20050818174305.GC15077@esmail.cup.hp.com> <20050818174222.GA27674@lst.de> Message-ID: <20050818174317.GA27739@lst.de> On Thu, Aug 18, 2005 at 07:42:22PM +0200, Christoph Hellwig wrote: > On Thu, Aug 18, 2005 at 10:43:05AM -0700, Grant Grundler wrote: > > On Thu, Aug 18, 2005 at 02:36:05PM +0200, Christoph Hellwig wrote: > > > > All the iSCSI features including device management are available > > > > seamlessly with the iSCSI/ISER initiator. ISER simply puts iSCSI > > > > on steroids. > > > > > > > > The ISER implementation makes use of the openIB/kDAPL. Please note > > > > that several kDAPL patches that were submitted to the list are > > > > necessary for this implementation to work. > > > > > > The code is complete crap, please remove it again. > > > > Christoph, > > While I agree with you, that's not a very constructive approach. > > Can you pick 5 things that are brain damaged and point them out? > > The same as last time, the code didn't change at all. It's still > totally ignorant about such essential things as dma mapping, has > creative new abuse for struct iovec, it's still based on iovecs, "... still based on kdapl" of course From hch at lst.de Thu Aug 18 10:46:17 2005 From: hch at lst.de (Christoph Hellwig) Date: Thu, 18 Aug 2005 19:46:17 +0200 Subject: [openib-general] Re: [PATCH] srp tidyups In-Reply-To: <5264u3b1ey.fsf@cisco.com> References: <20050818130720.GA22995@lst.de> <5264u3b1ey.fsf@cisco.com> Message-ID: <20050818174617.GA27817@lst.de> On Thu, Aug 18, 2005 at 10:27:49AM -0700, Roland Dreier wrote: > By the way, what's your feeling about upstream inclusion for the > driver in its current state? Except for the lack of error handling it looks nice. From iod00d at hp.com Thu Aug 18 11:00:48 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 18 Aug 2005 11:00:48 -0700 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISER initiator In-Reply-To: <20050818174305.GC15077@esmail.cup.hp.com> References: <20050818123605.GB22381@lst.de> <20050818174305.GC15077@esmail.cup.hp.com> Message-ID: <20050818180048.GD15077@esmail.cup.hp.com> On Thu, Aug 18, 2005 at 10:43:05AM -0700, Grant Grundler wrote: > #ifndef MIN > #define MIN(a,b) ((a) < (b) ? (a) : (b)) > #endif > delete MIN and MAX. Use min ser_conn_state] Sorry - This got mangled during a cut/paste. Should read: Use min and max - defined in include/linux/kernel.h grant From iod00d at hp.com Thu Aug 18 11:18:23 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 18 Aug 2005 11:18:23 -0700 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISER initiator In-Reply-To: <20050818174317.GA27739@lst.de> References: <20050818123605.GB22381@lst.de> <20050818174305.GC15077@esmail.cup.hp.com> <20050818174222.GA27674@lst.de> <20050818174317.GA27739@lst.de> Message-ID: <20050818181823.GE15077@esmail.cup.hp.com> On Thu, Aug 18, 2005 at 07:43:17PM +0200, Christoph Hellwig wrote: ... > > The same as last time, the code didn't change at all. It's still > > totally ignorant about such essential things as dma mapping, has > > creative new abuse for struct iovec, it's still based on iovecs, > > "... still based on kdapl" of course Yeah, I was wondering about that. When I was off on vacation in July (and OLS), kDAPL was committed to the svn repository. Has anyone reviewed that? I was under the impression kDAPL would never make it into the openib.org source tree. Or has something changed? grant From hch at lst.de Thu Aug 18 11:18:01 2005 From: hch at lst.de (Christoph Hellwig) Date: Thu, 18 Aug 2005 20:18:01 +0200 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISER initiator In-Reply-To: <20050818181823.GE15077@esmail.cup.hp.com> References: <20050818123605.GB22381@lst.de> <20050818174305.GC15077@esmail.cup.hp.com> <20050818174222.GA27674@lst.de> <20050818174317.GA27739@lst.de> <20050818181823.GE15077@esmail.cup.hp.com> Message-ID: <20050818181801.GA28672@lst.de> On Thu, Aug 18, 2005 at 11:18:23AM -0700, Grant Grundler wrote: > On Thu, Aug 18, 2005 at 07:43:17PM +0200, Christoph Hellwig wrote: > ... > > > The same as last time, the code didn't change at all. It's still > > > totally ignorant about such essential things as dma mapping, has > > > creative new abuse for struct iovec, it's still based on iovecs, > > > > "... still based on kdapl" of course > > Yeah, I was wondering about that. When I was off on vacation > in July (and OLS), kDAPL was committed to the svn repository. > Has anyone reviewed that? We agreed that that James can commit it, but that doesn't mean it'll go anywhere. kDAPL is a dead end except for maybe out of tree vendor code. From ardavis at ichips.intel.com Thu Aug 18 12:05:50 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Thu, 18 Aug 2005 12:05:50 -0700 Subject: [openib-general] uVerbs: ibv_query_port Message-ID: <4304DC0E.4050605@ichips.intel.com> Hi Roland, I just added ibv_query_port() code to uDAPL and noticed the max_msg_sz is not being set properly. Looks like mthca_query_port does not return any value and the ib_uverbs_query_port() rerturns a value from the uninitialized ib_port_attr structure. Can you take a look. Thanks, -arlin From gshipman at lanl.gov Thu Aug 18 12:05:49 2005 From: gshipman at lanl.gov (Galen Shipman) Date: Thu, 18 Aug 2005 13:05:49 -0600 Subject: [openib-general] RDMA Read performance Message-ID: Hello, I am using RDMA READ for an internal application and I am seeing problems with performance on OpenIB. I am not seeing the same performance problems on like hardware using Mellanox VAPI. Perhaps I am doing something silly. Attached is a patch to rdma_bw.c where I am seeing only ~192MB/sec which should be closer to 950MB/sec. Any help or ideas are appreciated. Thanks, Galen -------------- next part -------------- A non-text attachment was scrubbed... Name: rdma_bw_diff Type: application/octet-stream Size: 2407 bytes Desc: not available URL: From iod00d at hp.com Thu Aug 18 12:10:18 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 18 Aug 2005 12:10:18 -0700 Subject: [openib-general] [noreply@googlegroups.com: Posting error: open-iscsi] Message-ID: <20050818191018.GH15077@esmail.cup.hp.com> Dan, please don't cross post to closed lists. It means people on open-iscsi don't see my response and they can't participate in the conversation. More open source community email ettiquette at: http://www.parisc-linux.org/mailing-lists/index.html See "Rules" section. grant ----- Forwarded message from noreply at googlegroups.com ----- From: noreply at googlegroups.com To: iod00d at hp.com Subject: Posting error: open-iscsi X-PMX-Version: 5.0.3.165339, Antispam-Engine: 2.1.0.0, Antispam-Data: 2005.8.18.18 X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on debian.cup.hp.com X-Spam-Level: X-Spam-Status: No, score=-0.5 required=5.0 tests=BAYES_00,NO_REAL_NAME, RCVD_BY_IP autolearn=no version=3.0.4 You do not have permission to post to group open-iscsi. You may need to join the group before being allowed to post, or this group may not be open to posting. Visit http://groups.google.com/group/open-iscsi/about to join or learn more about who is allowed to post to the group. Help on using Google Groups is also available at: http://groups.google.com/support From: Grant Grundler To: Christoph Hellwig Cc: Dan Bar Dov , open-iscsi at googlegroups.com, openib-general at openib.org Subject: Re: [openib-general] [ANNOUNCE] Initial trunk checkin of ISER initiator On Thu, Aug 18, 2005 at 02:36:05PM +0200, Christoph Hellwig wrote: > > All the iSCSI features including device management are available > > seamlessly with the iSCSI/ISER initiator. ISER simply puts iSCSI > > on steroids. > > > > The ISER implementation makes use of the openIB/kDAPL. Please note > > that several kDAPL patches that were submitted to the list are > > necessary for this implementation to work. > > The code is complete crap, please remove it again. Christoph, While I agree with you, that's not a very constructive approach. Can you pick 5 things that are brain damaged and point them out? Here are a few easy ones in iser.h: * vi: set noautoindent tabstop=4 shiftwidth=4 : This doesn't follow kernel coding style. Tabstops are *8* spaces. #ifndef CONFIG_INFINIBAND #include #else #include ... WTF? this module won't get built w/o CONFIG_INFINIBAND defined. Delete all references to the !CONFIG_INFINIBAND cases. Delete the pile of typedef's that are commented out #ifndef MIN #define MIN(a,b) ((a) < (b) ? (a) : (b)) #endif delete MIN and MAX. Use min ser_conn_state] /*! -------------------------------------------------------------------- [enum iser_conn_state] Description: iSER connection state -------------------------------------------------------------------- */ enum iser_conn_state { The comment is utterly useless. Delete it or add some content. Ditto for several of the additional enum declaratations further down. /*! -------------------------------------------------------------------- [struct iser_phys_mem_t] Description: Physical address based memory descriptor -------------------------------------------------------------------- */ struct iser_phys_mem_t { uint64_t *addrs; int length; int offset; int data_size; Is this a clone of "struct scatterlist"? See include/asm-*/scatterlist.h. // uint64_t addr; // uint64_t size; }; /* iser_phys_mem_t */ Delete the C++ style comments. struct iser_op_params_t { uint32_t InitiatorRecvDataSegmentLength; /* 512..2*24-1], default: 8K, func: MIN */ uint32_t TargetRecvDataSegmentLength; /* 512..2*24-1], default: 8K, func: MIN */ uint32_t FirstBurstLength; /* [512..2**24-1], default: 64K, func: MIN */ uint32_t MaxBurstLength; /* [512..2**24-1], default: 256K, func: MIN */ ... 1) Codingstyle: use tabs, not 4 spaces. 2) Codingstyle: WankyCapsNames are strongly discouraged. (See Chap 4 of linux/Documentation/Codingstyle) enum iser_op_param_default { defaultInitiatorRecvDataSegmentLength = 8*1024, defaultTargetRecvDataSegmentLength = 8*1024, defaultFirstBurstLength = 64*1024, ... Ditto. And why aren't these constants? Any advantage to the enum type? I'm wondering if any of these is replacable with ISER_LOGIN_PHASE_PDU_DATA_LEN or related constants defined earlier in the file. Ok...made it half-way through (line 286 of 649) the first file and that's only 10% of the .h files in ulp/iser/datamover. There are 11K lines of .c and 2.5K lines of .h. I hope that's enough to give Dan Bar Dov an idea of what the problem is. hth, grant ----- End forwarded message ----- From rolandd at cisco.com Thu Aug 18 12:14:25 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 18 Aug 2005 12:14:25 -0700 Subject: [openib-general] Re: uVerbs: ibv_query_port In-Reply-To: <4304DC0E.4050605@ichips.intel.com> (Arlin Davis's message of "Thu, 18 Aug 2005 12:05:50 -0700") References: <4304DC0E.4050605@ichips.intel.com> Message-ID: <52ek8r9hwu.fsf@cisco.com> Thanks, I checked in a fix for this. - R. From jlentini at netapp.com Thu Aug 18 12:28:35 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 18 Aug 2005 15:28:35 -0400 (EDT) Subject: [openib-general][PATCH][kdapl]: FMR and EVD patch In-Reply-To: References: Message-ID: Hi Guy, The one piece of this patch that remains unaccepted is: Index: ib/dapl_evd.c =================================================================== --- ib/dapl_evd.c (revision 3136) +++ ib/dapl_evd.c (working copy) @@ -1028,6 +1028,7 @@ { struct dapl_evd *evd; int status = 0; + int pending_events; evd = (struct dapl_evd *)evd_handle; dapl_dbg_log (DAPL_DBG_TYPE_API, "%s: (evd=%p, upcall_policy=%d)\n", @@ -1035,14 +1036,25 @@ spin_lock_irqsave(&evd->common.lock, evd->common.flags); if ((upcall_policy != DAT_UPCALL_TEARDOWN) && - (upcall_policy != DAT_UPCALL_DISABLE) && - (evd->evd_flags & DAT_EVD_DTO_FLAG)) { - status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); - if (status) { - printk(KERN_ERR "%s: dapls_ib_completion_notify failed " - "(status=0x%x)\n",__func__, status); + (upcall_policy != DAT_UPCALL_DISABLE)) { + pending_events = dapl_rbuf_count(&evd->pending_event_queue); + if (pending_events) { + dapl_dbg_log(DAPL_DBG_TYPE_WARN, + "%s: (evd %p) there are still %d pending " + "events in the queue - policy stays disabled\n", + __func__, evd_handle, pending_events); + status = -EBUSY; goto bail; } + if (evd->evd_flags & DAT_EVD_DTO_FLAG) { + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); + if (status) { + printk(KERN_ERR "%s: dapls_ib_completion_notify" + " failed (status=0x%x) \n",__func__, + status); + goto bail; + } + } } evd->upcall_policy = upcall_policy; evd->upcall = *upcall; The IB analog to this function, ib_req_notify_cq(), does not require that the CQ be empty. The kDAPL specification does not define an empty EVD as a requirement for modifying the upcall and previous implementations of the API have not made this requirement. From yaronh at voltaire.com Thu Aug 18 12:28:38 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Thu, 18 Aug 2005 22:28:38 +0300 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator Message-ID: <35EA21F54A45CB47B879F21A91F4862F713C92@taurus.voltaire.com> > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Christoph Hellwig > Sent: Thursday, August 18, 2005 8:36 AM > To: Dan Bar Dov > Cc: open-iscsi at googlegroups.com; openib-general at openib.org > Subject: Re: [openib-general] [ANNOUNCE] Initial trunk checkin of > ISERinitiator > > On Thu, Aug 18, 2005 at 03:14:05PM +0300, Dan Bar Dov wrote: > > I just checked in a first version of iSCSI Extensions for RDMA > > Protocol (ISER) initiator under infiniband/ulp/iser. This > > implements the ISER datamover, a transport layer alternative to > > TCP/IP usable by iSCSI. This ISER transport has been tested with > > the open-iscsi opensource project, and against the Voltaire > > Fibre-Channel Router (FCR) and Voltaire's Native-IB storage kit. > > > > All the iSCSI features including device management are available > > seamlessly with the iSCSI/ISER initiator. ISER simply puts iSCSI > > on steroids. > > > > The ISER implementation makes use of the openIB/kDAPL. Please note > > that several kDAPL patches that were submitted to the list are > > necessary for this implementation to work. > > The code is complete crap, please remove it again. Cristoph, iSER is part of OpenIB just like any other ULP And there needs to be a Productive process of adding it to the stack Your feedback is valuable, but we need to get more details on what concerns you, the iSER team is committed to address any feedback that will be presented in this list, after all it's just an initial posting Yaron > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general From yaronh at voltaire.com Thu Aug 18 12:42:22 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Thu, 18 Aug 2005 22:42:22 +0300 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator Message-ID: <35EA21F54A45CB47B879F21A91F4862F713C99@taurus.voltaire.com> > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Grant Grundler > Sent: Thursday, August 18, 2005 2:18 PM > To: Christoph Hellwig > Cc: open-iscsi at googlegroups.com; openib-general at openib.org > Subject: Re: [openib-general] [ANNOUNCE] Initial trunk checkin of > ISERinitiator > > On Thu, Aug 18, 2005 at 07:43:17PM +0200, Christoph Hellwig wrote: > ... > > > The same as last time, the code didn't change at all. It's still > > > totally ignorant about such essential things as dma mapping, has > > > creative new abuse for struct iovec, it's still based on iovecs, > > > > "... still based on kdapl" of course > > Yeah, I was wondering about that. When I was off on vacation > in July (and OLS), kDAPL was committed to the svn repository. > Has anyone reviewed that? > > I was under the impression kDAPL would never make it into > the openib.org source tree. Or has something changed? > Grant, Currently kDAPL is the ONLY layer that can be abstracted over both IB & iWarp, due to the different CM model of the two interconnects iSER and NFS/RDMA are common to both IB & iWarp and are implemented to run on both Until OpenIB will define another layer that can be used for both, there is no other viable alternative for iSER to be implemented on top In future if a new common API/Layer will be provided iSER can change to support it Also appreciate your productive feedback on the code, the team will address it Yaron > grant > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general From guyg at voltaire.com Thu Aug 18 13:46:31 2005 From: guyg at voltaire.com (Guy German) Date: Thu, 18 Aug 2005 23:46:31 +0300 Subject: [openib-general][PATCH][kdapl]: FMR and EVD patch Message-ID: Hi James, I will try to explain the reason behind this patch: In IB, a “normal” working flow, for a consumer, is: - Receive a CQ notification callback - Wakeup polling thread - Poll for completion (empty the queue) - Request completion notification There is no problem here. In kdapl, however, the consumer will keep getting upcalls, until he sets the upcall policy to disable. So a working flow will be: - Receive an evd upcall - Disable evd upcall policy - Wakeup polling thread - Dequeue all evd’s - Enable evd upcall policy There is a race here: A completion can come after the last dequeu and before the Enabling. The provider won’t call for the consumer (policy is disabled) and the consumer would not dequeu any more because he “knows” the queue is empty. I think it is a very bad idea, to solve this race by adding another evd_dequeue after you enable the upcall policy. If you do that you would have a polling thread (because while you dequeue one completion you can have many more following) and at the same time you will receive upcall from the dapl provider. Beside the fact that this is an expensive and unnecessary context switch you have an upcall and a thread racing. You will have a situation that the upcall has an event at hand and the thread has an event, both not handled yet - you will have to queue them again internally or something to keep the order. And I think that is only a partial list of the problems in this case. SO My suggestion is simple, it solves the race, it saves the unnecessary context switch and it spares the complexity from the consumer side. The solution is to notify the consumer when he tries to enable upcall policy, that the queue is actually not empty, and force him to continue polling (in the same thread context he is now). dat_evd_modify_upcall is guarded by a spin_lock_irqsave, when it checks the queue and so the race would not occur. BTW, I’m not sure if it is still the case, but I think that one of the ulps in openib, did not use a kernel thread for dequeu-ing. This is a very bad design, as the upcall can be polling for *long* periods of time, in a tasklet/interrupt context. That’s it Sorry for the long mail – I hope It was not to blur Guy. -----Original Message----- From: James Lentini [mailto:jlentini at netapp.com] Sent: Thu 8/18/2005 10:28 PM To: Guy German Cc: Openib Subject: Re: [openib-general][PATCH][kdapl]: FMR and EVD patch Hi Guy, The one piece of this patch that remains unaccepted is: Index: ib/dapl_evd.c =================================================================== --- ib/dapl_evd.c (revision 3136) +++ ib/dapl_evd.c (working copy) @@ -1028,6 +1028,7 @@ { struct dapl_evd *evd; int status = 0; + int pending_events; evd = (struct dapl_evd *)evd_handle; dapl_dbg_log (DAPL_DBG_TYPE_API, "%s: (evd=%p, upcall_policy=%d)\n", @@ -1035,14 +1036,25 @@ spin_lock_irqsave(&evd->common.lock, evd->common.flags); if ((upcall_policy != DAT_UPCALL_TEARDOWN) && - (upcall_policy != DAT_UPCALL_DISABLE) && - (evd->evd_flags & DAT_EVD_DTO_FLAG)) { - status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); - if (status) { - printk(KERN_ERR "%s: dapls_ib_completion_notify failed " - "(status=0x%x)\n",__func__, status); + (upcall_policy != DAT_UPCALL_DISABLE)) { + pending_events = dapl_rbuf_count(&evd->pending_event_queue); + if (pending_events) { + dapl_dbg_log(DAPL_DBG_TYPE_WARN, + "%s: (evd %p) there are still %d pending " + "events in the queue - policy stays disabled\n", + __func__, evd_handle, pending_events); + status = -EBUSY; goto bail; } + if (evd->evd_flags & DAT_EVD_DTO_FLAG) { + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); + if (status) { + printk(KERN_ERR "%s: dapls_ib_completion_notify" + " failed (status=0x%x) \n",__func__, + status); + goto bail; + } + } } evd->upcall_policy = upcall_policy; evd->upcall = *upcall; The IB analog to this function, ib_req_notify_cq(), does not require that the CQ be empty. The kDAPL specification does not define an empty EVD as a requirement for modifying the upcall and previous implementations of the API have not made this requirement. From caitlin.bestler at gmail.com Thu Aug 18 14:46:16 2005 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Thu, 18 Aug 2005 14:46:16 -0700 Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation In-Reply-To: References: Message-ID: <469958e0050818144657ceb90f@mail.gmail.com> Yes. dat_evd_modify_upcall has been called, but the current upcall instance has not yet returned. During this period the consumer should check to see if the EVD is drained. If so, the consumer is no longer notified (re this EVD). On 8/18/05, Guy German wrote: > Hi Caitlin, > Caitlin Bestler wrote: > > Some clarifications are needed here. > > > > First the Consumer is responsible for draining the > > EVD after re-enabling it, or at least for remembering > > that there may be undrained notified events. > > Can you please explain what you mean by "re-enabling" > the EVD ? Do you mean calling dat_evd_modify_upcall > and changing the upcall policy from disable, back to > enable ? > > > > > That is "you-have-been-notified" is a sticky boolean > > attribute that the Consumer is supposed to set to TRUE > > when the upcall is made and only clear when the EVD > > has been drained *after* re-enabling. > > > > Second, is that the EVD is first and foremost an event > > *serializer*. It is presumed to have a finite number of > > resources for making upcalls (at most one for the typical > > case where SINGLE is enabled). The next upcall per > > resource CANNOT occur until after the current upcall > > has completed. > > > > Whether this should be solved in the DAT Provider is > > a question of what the verb-layer provider is allowed > > to do. If the verb layer provider can in fact generate > > multiple concurrent upcalls for the same CQ then the > > EVD itself must guard against re-entrancy. > > > > A more likely implementation is that upcalls triggered > > by post_se, CM events and CQs could theoretically > > occur at the same instance -- but that none of these > > paths can be re-entrant by themselves. > > > > Once the potential re-entrancy from the verb layer > > is known, then an optimal strategy can be selected. > > For exaple, if the only potential re-entrancy comes > > when the upcall interrupts a post_se call then some > > simple critical regions can avoid all problems without > > general purpose spinlocks or semaphores. > > > > On 8/16/05, James Lentini wrote: > >> > >> > >> On Tue, 16 Aug 2005, Guy German wrote: > >> > >>>>>>>>> Also, the pending_event_queue is only used for kDAPL generated > >>>>>>>>> software events. This queue can be empty when there are > >>>>>>>>> events on the CQ, so your would need to be expanded your > >>>>>>>>> check to cover that. > >>>>>>> > >>>>>>> Actually, even though, I agreed before, I tend to disagree now. > >>>>>>> The consumer will still get the DTO events as soon as the CQ > >>>>>>> upcall is triggered (enabled), so only problem is with the > >>>>>>> pending events list. > >>>>>> > >>>>>> Why is it an error for the consumer to modify the upcall policy > >>>>>> when there are pending events? > >>>>>> > >>>>>> dat_evd_modify_upcall should behave just like the IBTA spec's > >>>>>> Request Completion Notification verb in this respect. If there > >>>>>> were events on the EVD before the upcall is enabled, no upcall > >>>>>> needs to be generated. A correct consumer can easily work around > >>>>>> this by enabling the upcall and polling the EVD one final time > >>>>>> to ensure it is empty. > >>>>> > >>>>> There can be more than one event, and the consumer would need to > >>>>> dequeue many times. While the consumer would do his extra > >>>>> dequeue-ing he might also get an upcall, because his policy is > >>>>> now enabled. I can't think of a design that can handle such a > >>>>> case, and if there is one it is demanding and complicated, from > >>>>> the consumers side. > >>>> > >>>> Isn't it the same position all event code written to the OpenIB > >>>> API is in? > >>> > >>> I don't quite know what you are reffering to, but if you are > >>> reffering to the case of cq in IB - It's totally different: you > >>> only enable the cq once, so you will only get one upcall, and the > >>> rest of the events you will need to dequeue. > >> > >> The consumer should only receive one upcall at a time if the upcall > >> policy is DAT_UPCALL_SINGLE_INSTANCE. If the dequeues are performed > >> in an upcall, the logic needed in an OpenIB consumer and kDAPL > >> consumer is essentially the same. > >> > >> The difference is that the OpenIB consumer needs to re-enable the CQ > >> upcall and poll to make sure no events were missed. > >> > >>>> I agree with you that this programming model is difficult to use, > >>>> but I don't think it is impossible. > >>> > >>> I think it is a bad idea to dequeue events and at the same time > >>> receive upcalls from the same queue. It is racy, and has bad > >>> performance. I don't see *any* reason to do it. > >> > >> The current kDAPL implementation does create a situation in which an > >> upcall and poll occur simultaneously if the upcall is disabled, the > >> consumer enables the upcall, and then the consumer does a poll. In > >> this scenario an upcall can occur while the consumer is polling. I > >> was pointing out that this same race exists in the OpenIB verbs API > >> (and the IBTA verbs). > >> > >> Again, I agree that we can eliminate the additional poll after > >> enabling the upcall in kDAPL. We just need to do it in a way that is > >> not hardware specific. I believe we can use the same technique we > >> did in the DTO upcall. > >> > >> james > >> _______________________________________________ > >> openib-general mailing list > >> openib-general at openib.org > >> http://openib.org/mailman/listinfo/openib-general > >> > >> To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From iod00d at hp.com Thu Aug 18 15:26:06 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 18 Aug 2005 15:26:06 -0700 Subject: [openib-general] RDMA Read performance In-Reply-To: References: Message-ID: <20050818222606.GM15077@esmail.cup.hp.com> On Thu, Aug 18, 2005 at 01:05:49PM -0600, Galen Shipman wrote: > Hello, > > I am using RDMA READ for an internal application and I am seeing > problems with performance on OpenIB. I am not seeing the same > performance problems on like hardware using Mellanox VAPI. Perhaps I am > doing something silly. Attached is a patch to rdma_bw.c where I am > seeing only ~192MB/sec which should be closer to 950MB/sec. Galen, Normally patches should be attached as "plain text" and not as: [-- Attachment #2: rdma_bw_diff --] [-- Type: application/octet-stream, Encoding: 7bit, Size: 2.4K --] [-- application/octet-stream is unsupported (use 'v' to view this part) --] The patch also has "white space damage" - ie white space changes that don't belong in the patch. Just a distraction. > Any help or ideas are appreciated. I can try it out on my boxes. ISTR I was getting ~750 MB/s on the regular rdma_bw.c code. I agree 192 MB/s seems really low for any PCI-X box. I have to wonder if your version (using reads) is not getting the parallelism that writes could acheive. 192MB/s sounds more like ping-pong with large pages. Could you run the ibv_pingpong test and report what that gets? (It's under .../src/userspace/libibverbs/examples/) grant From mshefty at ichips.intel.com Thu Aug 18 15:28:58 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 18 Aug 2005 15:28:58 -0700 Subject: [openib-general] RDMA Read performance In-Reply-To: <20050818222606.GM15077@esmail.cup.hp.com> References: <20050818222606.GM15077@esmail.cup.hp.com> Message-ID: <43050BAA.2040706@ichips.intel.com> Grant Grundler wrote: > I have to wonder if your version (using reads) is not getting the > parallelism that writes could acheive. 192MB/s sounds more like > ping-pong with large pages. > Could you run the ibv_pingpong test and report what that gets? What are max_rd_atomic and max_dest_rd_atomic QP attributes set to? - Sean From guyg at voltaire.com Thu Aug 18 15:31:40 2005 From: guyg at voltaire.com (Guy German) Date: Fri, 19 Aug 2005 01:31:40 +0300 Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation Message-ID: > Yes. > dat_evd_modify_upcall has been called, but the current > upcall instance has not yet returned. During this period > the consumer should check to see if the EVD is drained. > If so, the consumer is no longer notified (re this EVD). I don't follow you - if the consumer is still in the upcall context, why should he be changing the upcall policy at all ? (assuming It's a single instance) Any way, I don't think it is recommended to "drain the evd", in the upcall's tasklet/interrupt context. There can be thousends of events to dequeue, and while you drain them, there can be more comming. You want to get out of that context as fast as possible. Guy On 8/18/05, Guy German wrote: > Hi Caitlin, > Caitlin Bestler wrote: > > Some clarifications are needed here. > > > > First the Consumer is responsible for draining the > > EVD after re-enabling it, or at least for remembering > > that there may be undrained notified events. > > Can you please explain what you mean by "re-enabling" > the EVD ? Do you mean calling dat_evd_modify_upcall > and changing the upcall policy from disable, back to > enable ? > > > > > That is "you-have-been-notified" is a sticky boolean > > attribute that the Consumer is supposed to set to TRUE > > when the upcall is made and only clear when the EVD > > has been drained *after* re-enabling. > > > > Second, is that the EVD is first and foremost an event > > *serializer*. It is presumed to have a finite number of > > resources for making upcalls (at most one for the typical > > case where SINGLE is enabled). The next upcall per > > resource CANNOT occur until after the current upcall > > has completed. > > > > Whether this should be solved in the DAT Provider is > > a question of what the verb-layer provider is allowed > > to do. If the verb layer provider can in fact generate > > multiple concurrent upcalls for the same CQ then the > > EVD itself must guard against re-entrancy. > > > > A more likely implementation is that upcalls triggered > > by post_se, CM events and CQs could theoretically > > occur at the same instance -- but that none of these > > paths can be re-entrant by themselves. > > > > Once the potential re-entrancy from the verb layer > > is known, then an optimal strategy can be selected. > > For exaple, if the only potential re-entrancy comes > > when the upcall interrupts a post_se call then some > > simple critical regions can avoid all problems without > > general purpose spinlocks or semaphores. > > > > On 8/16/05, James Lentini wrote: > >> > >> > >> On Tue, 16 Aug 2005, Guy German wrote: > >> > >>>>>>>>> Also, the pending_event_queue is only used for kDAPL generated > >>>>>>>>> software events. This queue can be empty when there are > >>>>>>>>> events on the CQ, so your would need to be expanded your > >>>>>>>>> check to cover that. > >>>>>>> > >>>>>>> Actually, even though, I agreed before, I tend to disagree now. > >>>>>>> The consumer will still get the DTO events as soon as the CQ > >>>>>>> upcall is triggered (enabled), so only problem is with the > >>>>>>> pending events list. > >>>>>> > >>>>>> Why is it an error for the consumer to modify the upcall policy > >>>>>> when there are pending events? > >>>>>> > >>>>>> dat_evd_modify_upcall should behave just like the IBTA spec's > >>>>>> Request Completion Notification verb in this respect. If there > >>>>>> were events on the EVD before the upcall is enabled, no upcall > >>>>>> needs to be generated. A correct consumer can easily work around > >>>>>> this by enabling the upcall and polling the EVD one final time > >>>>>> to ensure it is empty. > >>>>> > >>>>> There can be more than one event, and the consumer would need to > >>>>> dequeue many times. While the consumer would do his extra > >>>>> dequeue-ing he might also get an upcall, because his policy is > >>>>> now enabled. I can't think of a design that can handle such a > >>>>> case, and if there is one it is demanding and complicated, from > >>>>> the consumers side. > >>>> > >>>> Isn't it the same position all event code written to the OpenIB > >>>> API is in? > >>> > >>> I don't quite know what you are reffering to, but if you are > >>> reffering to the case of cq in IB - It's totally different: you > >>> only enable the cq once, so you will only get one upcall, and the > >>> rest of the events you will need to dequeue. > >> > >> The consumer should only receive one upcall at a time if the upcall > >> policy is DAT_UPCALL_SINGLE_INSTANCE. If the dequeues are performed > >> in an upcall, the logic needed in an OpenIB consumer and kDAPL > >> consumer is essentially the same. > >> > >> The difference is that the OpenIB consumer needs to re-enable the CQ > >> upcall and poll to make sure no events were missed. > >> > >>>> I agree with you that this programming model is difficult to use, > >>>> but I don't think it is impossible. > >>> > >>> I think it is a bad idea to dequeue events and at the same time > >>> receive upcalls from the same queue. It is racy, and has bad > >>> performance. I don't see *any* reason to do it. > >> > >> The current kDAPL implementation does create a situation in which an > >> upcall and poll occur simultaneously if the upcall is disabled, the > >> consumer enables the upcall, and then the consumer does a poll. In > >> this scenario an upcall can occur while the consumer is polling. I > >> was pointing out that this same race exists in the OpenIB verbs API > >> (and the IBTA verbs). > >> > >> Again, I agree that we can eliminate the additional poll after > >> enabling the upcall in kDAPL. We just need to do it in a way that is > >> not hardware specific. I believe we can use the same technique we > >> did in the DTO upcall. > >> > >> james > >> _______________________________________________ > >> openib-general mailing list > >> openib-general at openib.org > >> http://openib.org/mailman/listinfo/openib-general > >> > >> To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From iod00d at hp.com Thu Aug 18 16:41:20 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 18 Aug 2005 16:41:20 -0700 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F713C99@taurus.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F713C99@taurus.voltaire.com> Message-ID: <20050818234120.GN15077@esmail.cup.hp.com> On Thu, Aug 18, 2005 at 10:42:22PM +0300, Yaron Haviv wrote: > > Yeah, I was wondering about that. When I was off on vacation > > in July (and OLS), kDAPL was committed to the svn repository. > > Has anyone reviewed that? > > > > I was under the impression kDAPL would never make it into > > the openib.org source tree. Or has something changed? > > Grant, > > Currently kDAPL is the ONLY layer that can be abstracted over both IB & > iWarp, due to the different CM model of the two interconnects > iSER and NFS/RDMA are common to both IB & iWarp and are implemented to > run on both. I think I understand why kDAPL exists. But doesn't there need to be some iwarp code under src/linux-kernel/infiniband/ulp/kdapl ? I only see "ib". Let me rephrase my question since your answer doesn't really address what I was asking: Any reason why kDAPL might not get pushed upstream? Some of the original objections were it's an unnecessary layer that interfers with performance and bloats the linux kernel source tree. Arguments about portability between user/kernel and portability across OSs won't (or didn't) fly in the linux kernel community. Has that perception about kDAPL change since March? If kDAPL for any reason doesn't get pushed upstream to kernel.org, we effectively don't have iSER or NFS/RDMA in linux. Since I think without them, linux won't be competitive in the commercial market place. > Until OpenIB will define another layer that can be used for both, there > is no other viable alternative for iSER to be implemented on top > In future if a new common API/Layer will be provided iSER can change to > support it I've understood that the openib.org Verbs API can be changed to make it "transport neutral" - ie support RNICs. RNIC vendors don't seem to be interested in submitting patches for that. Did someone think they can drop kDAPL into openib.org SVN and roland would automatically push that into kernel.org? I'm not convinced of that and worry that iSER and NFS/RDMA won't make it into kernel.org as things stand now. > Also appreciate your productive feedback on the code, s/productive/constructive/ :^) "productive" would have been to provide a patch to implement what I was whining about. :^) (I'm just teasing in case that's not obvious) > the team will address it I'd suggest addressing Christoph's issues first since they are "deeper". This includes his comments today and his original comments here: http://openib.org/pipermail/openib-general/2005-March/thread.html Look for "putting in dead wood for DAPL" on that page. I've mostly only commented on superficial things (mostly Codingstyle). {/rant on} If people used RFC compliant email clients, the emails would be archived all under one thread in that web page....but it's not. :^( It's multiple threads. {/rant off} thanks, grant From hch at lst.de Thu Aug 18 16:44:54 2005 From: hch at lst.de (Christoph Hellwig) Date: Fri, 19 Aug 2005 01:44:54 +0200 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator In-Reply-To: <20050818234120.GN15077@esmail.cup.hp.com> References: <35EA21F54A45CB47B879F21A91F4862F713C99@taurus.voltaire.com> <20050818234120.GN15077@esmail.cup.hp.com> Message-ID: <20050818234454.GA632@lst.de> > If kDAPL for any reason doesn't get pushed upstream to kernel.org, > we effectively don't have iSER or NFS/RDMA in linux. > Since I think without them, linux won't be competitive in the > commercial market place. iser doesn't matter at all in the marketplace. nfs/rdma matters and even if netapp/citi keeps beeing ignorant I will port it over to the infiniband/rdma layer myself. I'll hopefully have some iwarp cards soon. From iod00d at hp.com Thu Aug 18 18:03:54 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 18 Aug 2005 18:03:54 -0700 Subject: [openib-general] RDMA Read performance In-Reply-To: <43050BAA.2040706@ichips.intel.com> References: <20050818222606.GM15077@esmail.cup.hp.com> <43050BAA.2040706@ichips.intel.com> Message-ID: <20050819010354.GQ15077@esmail.cup.hp.com> On Thu, Aug 18, 2005 at 03:28:58PM -0700, Sean Hefty wrote: > Grant Grundler wrote: > >I have to wonder if your version (using reads) is not getting the > >parallelism that writes could acheive. 192MB/s sounds more like > >ping-pong with large pages. > >Could you run the ibv_pingpong test and report what that gets? > > What are max_rd_atomic and max_dest_rd_atomic QP attributes set to? looks like both are hardcoded to 1 in different parts of pp_connect_ctx(). grant From yaronh at voltaire.com Thu Aug 18 18:13:12 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Fri, 19 Aug 2005 04:13:12 +0300 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator Message-ID: <35EA21F54A45CB47B879F21A91F4862F713CA5@taurus.voltaire.com> > -----Original Message----- > From: Grant Grundler [mailto:iod00d at hp.com] > Sent: Thursday, August 18, 2005 7:41 PM > To: Yaron Haviv > Cc: Grant Grundler; Christoph Hellwig; open-iscsi at googlegroups.com; > openib-general at openib.org > Subject: Re: [openib-general] [ANNOUNCE] Initial trunk checkin of > ISERinitiator > > > > Until OpenIB will define another layer that can be used for both, there > > is no other viable alternative for iSER to be implemented on top > > In future if a new common API/Layer will be provided iSER can change to > > support it > > I've understood that the openib.org Verbs API can be changed to make > it "transport neutral" - ie support RNICs. RNIC vendors don't seem > to be interested in submitting patches for that. Did someone think > they can drop kDAPL into openib.org SVN and roland would automatically > push that into kernel.org? > > I'm not convinced of that and worry that iSER and NFS/RDMA won't > make it into kernel.org as things stand now. > Grant, The Verb portion deals with the data path operations (after the connection was established), the connection establishment process is very different IB CM is implemented on top of the verbs, an iWarp specific CM would also need to be developed in parallel (interacts with the TCP stack ..), and common ULPs need a single mechanism to use both (in the DAPL case a BSD like API using IP addresses) Again I'm not saying kDAPL is the ultimate solution or that it will last in its current form, its just the only thing we can use today, if someone would come with a better implementation we can just change iSER In one of the previous threads I suggested building a hybrid layer that uses the current verb APIs for verb type operations, and the DAPL code for the connection establishment, resulting in a simpler/shorter code, this would present a middle ground addressing the concerns on both sides Yaron From yaronh at voltaire.com Thu Aug 18 19:47:33 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Fri, 19 Aug 2005 05:47:33 +0300 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator Message-ID: <35EA21F54A45CB47B879F21A91F4862F713CA6@taurus.voltaire.com> -----Original Message----- > From: Christoph Hellwig [mailto:hch at lst.de] > Sent: Thursday, August 18, 2005 7:45 PM > To: Grant Grundler > Cc: Yaron Haviv; open-iscsi at googlegroups.com; openib-general at openib.org > Subject: Re: [openib-general] [ANNOUNCE] Initial trunk checkin of > ISERinitiator > > > If kDAPL for any reason doesn't get pushed upstream to kernel.org, > > we effectively don't have iSER or NFS/RDMA in linux. > > Since I think without them, linux won't be competitive in the > > commercial market place. > > iser doesn't matter at all in the marketplace. nfs/rdma matters and > even if netapp/citi keeps beeing ignorant I will port it over to the > infiniband/rdma layer myself. I'll hopefully have some iwarp cards > soon. Christoph, Can you help me understand how would you address the CM issue, would you add IB/iWarp specific code into all the ULPs (NFS, SDP, MPI, Lustre, iSER, ..) ? Regarding iSER, You are entitled to your opinion Many others won't agree with you and think that in the long run iSER will be the only viable block storage alternative in OpenIB, mainly since it fits the IB/iWarp generalization and it is much more complete than alternatives, and with the recent IETF moves people can't claim its non-standard anymore. Not every one wants to keep on doing target discovery with Python scripts, and some prefer just using existing code and management from iSCSI rather than inventing new mechanisms just for IB Yaron From rolandd at cisco.com Thu Aug 18 21:18:34 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 18 Aug 2005 21:18:34 -0700 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator In-Reply-To: <20050818234120.GN15077@esmail.cup.hp.com> (Grant Grundler's message of "Thu, 18 Aug 2005 16:41:20 -0700") References: <35EA21F54A45CB47B879F21A91F4862F713C99@taurus.voltaire.com> <20050818234120.GN15077@esmail.cup.hp.com> Message-ID: <52ll2y8spx.fsf@cisco.com> Grant> Let me rephrase my question since your answer doesn't Grant> really address what I was asking: Any reason why kDAPL Grant> might not get pushed upstream? Yes. In fact, I am quite sure that kDAPL will not go upstream. The recent discussions trying to make sense of the kDAPL event handling mess just convince me even more the kDAPL is quite broken. I do agree that we need some abstraction of connection to cover both IB and iWARP, but I think it would be much more productive to try and build a sane connection API from scratch. Grant> If kDAPL for any reason doesn't get pushed upstream to Grant> kernel.org, we effectively don't have iSER or NFS/RDMA in Grant> linux. Since I think without them, linux won't be Grant> competitive in the commercial market place. This I disagree with. Linux is such a huge part of the RDMA world that I would tend to believe the converse: without robust Linux implementations, iSER and NFS/RDMA won't be competitive. - R. From rolandd at cisco.com Thu Aug 18 21:24:24 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 18 Aug 2005 21:24:24 -0700 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F713CA6@taurus.voltaire.com> (Yaron Haviv's message of "Fri, 19 Aug 2005 05:47:33 +0300") References: <35EA21F54A45CB47B879F21A91F4862F713CA6@taurus.voltaire.com> Message-ID: <52hddm8sg7.fsf@cisco.com> Yaron> Not every one wants to keep on doing target discovery with Yaron> Python scripts, Come on, this is just a stupid statement. The whole point of putting device management in userspace is so that everybody has the flexibility to use whatever discovery mechanism they want. I agree that the SRP and iSER protocols are basically equivalent at a technical level: they both transport SCSI over RDMA. If you want to compare existing implementations, I'd much rather use my SRP driver's 1600 lines of code over your 14000+ lines of x86-only iSER on top of 10000+ lines of kDAPL (not even counting the iSCSI core). - R. From osirase at e-factory.co.jp Thu Aug 18 16:28:37 2005 From: osirase at e-factory.co.jp (osirase at e-factory.co.jp) Date: Fri, 19 Aug 2005 08:28:37 +0900 Subject: [openib-general] =?iso-2022-jp?b?GyRCIUozdCFLGyhCZS1mYWN0b3J5?= =?iso-2022-jp?b?GyRCJCskaSFaOkYhWyQ0TyJNbSRHJDkbKEI=?= Message-ID: <20050818.2328370072@osirase-e-factory.co.jp> 当社交プロダクションから貴方様に個人的【再】指名が入りました。 【指名者】 中野 美香 【プロフィール】 31歳・158cm47kg・年収1,700万・未婚 【コメント】 お返事頂けると思ってずっと待ってました。期待しても無駄でしょうか? これを最後の指名と考えていたから凄く切ない気持ちです。 もし、私でもいいのならお返事もらえませんか?絶対後悔させませんから♪ ↓こちらからご登録(無料)をして、お待ち下さい。↓ http://www.dandymrs.net/index.php?pr=pc02ma 直接、貴方様へ中野様からご連絡がきます。 From yaronh at voltaire.com Thu Aug 18 22:22:23 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Fri, 19 Aug 2005 08:22:23 +0300 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator Message-ID: <35EA21F54A45CB47B879F21A91F4862F713CA8@taurus.voltaire.com> > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Friday, August 19, 2005 12:24 AM > To: Yaron Haviv > Cc: Christoph Hellwig; Grant Grundler; open-iscsi at googlegroups.com; > openib-general at openib.org > Subject: Re: [openib-general] [ANNOUNCE] Initial trunk checkin of > ISERinitiator > > Yaron> Not every one wants to keep on doing target discovery with > Yaron> Python scripts, > > Come on, this is just a stupid statement. The whole point of putting > device management in userspace is so that everybody has the > flexibility to use whatever discovery mechanism they want. You know there is a small problem in storage, people don't want to just use what "they want", but rather use standard management, discovery, Security, HA, etc' which are quite essential for commercial customers > I agree that the SRP and iSER protocols are basically equivalent at a > technical level: they both transport SCSI over RDMA. If you want to > compare existing implementations, I'd much rather use my SRP driver's > 1600 lines of code over your 14000+ lines of x86-only iSER on top of > 10000+ lines of kDAPL (not even counting the iSCSI core). Not sure how you do your LOC counting or what's included in it In any case a protocol that is generalized to multiple transports, has built in discovery, error-recovery, global routing/naming, authentication, built-in multi-pathing, multi-connection per session, optimizations for small messages, comprehensive management and configuration with industry standard APIs, etc' Probably need to have more LOC than one that just tunnels SCSI command from one predefined point to another (by the way is DM, CFM and/or Python included in the 1400 :)) The important things is how many LOC are on the command path and how optimized it the protocol, this code runs SCSI at 850-900MB/s and on the same time provides the most comprehensive set of features, and is managed out of the box with industry standard tools A variation of that code runs today on PPC, so I assume it's not an issue to make sure it runs over PPC In any case let aside the religious discussion iSER needs to get into OpenIB and customers will then decide what ever they want, to get it in we need: 1. iSER developers to comply to Linux requirements and address any constructive feedback 2. have an API that can be used by ULP developers that want to be transport independent (till then kDAPL would need to be used) Yaron > > - R. From guyg at voltaire.com Fri Aug 19 01:12:56 2005 From: guyg at voltaire.com (Guy German) Date: Fri, 19 Aug 2005 11:12:56 +0300 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator Message-ID: Roland, Roland> Yes. In fact, I am quite sure that kDAPL will not go upstream. Roland> The recent discussions trying to make sense of the kDAPL event Roland> handling mess just convince me even more the kDAPL is quite broken. There was indeed a discussion about the event handling, and it rose a question, which I would like to hear your opinion on : Isn’t there a problem with polling for CQ, from the CQ’s upcall policy context (i.e. potentially tasklet/interrupt context). ISER target can get thousands of completions, and to drain them all in this context (not to mention handle them) would be problematic, I believe. I looked a little bit at the SRP’s code and it is seems like you are polling the cq in the srp_completion. Do you think it is the right way to do it ? Please let me add that the reason I'm asking this has nothing to do with the politics around kdapl (or ISER), and is regardless to kdapl prespectives to go in to the kernel - this is a sincere technical question. Thanks, Guy p.s. ISER supports PPC, our targets run on PPC. From gshipman at lanl.gov Fri Aug 19 06:48:34 2005 From: gshipman at lanl.gov (Galen Shipman) Date: Fri, 19 Aug 2005 07:48:34 -0600 Subject: [openib-general] RDMA Read performance In-Reply-To: <20050818222606.GM15077@esmail.cup.hp.com> References: <20050818222606.GM15077@esmail.cup.hp.com> Message-ID: <34441d4cc93fdf624a1dd42263b89429@lanl.gov> Grant, > Could you run the ibv_pingpong test and report what that gets? Using: ibv_rc_pingpong --size=1048576 I am seeing 942 Mbytes per sec. As I said previously, we have an internal application that sees ~950 MBytes per sec using RDMA Write. It looks like ibv_rc_pingpong is using send receive and not RDMA. Perhaps someone (Roland) has an RDMA Read test they can point me to? Thanks, Galen From hbchen at lanl.gov Fri Aug 19 07:06:40 2005 From: hbchen at lanl.gov (Hsing-bung(HB) Chen) Date: Fri, 19 Aug 2005 08:06:40 -0600 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator In-Reply-To: <20050818234454.GA632@lst.de> References: <35EA21F54A45CB47B879F21A91F4862F713C99@taurus.voltaire.com> <20050818234120.GN15077@esmail.cup.hp.com> <20050818234454.GA632@lst.de> Message-ID: <4305E770.7090208@lanl.gov> Christoph Hellwig wrote: >>If kDAPL for any reason doesn't get pushed upstream to kernel.org, >>we effectively don't have iSER or NFS/RDMA in linux. >>Since I think without them, linux won't be competitive in the >>commercial market place. >> >> > >iser doesn't matter at all in the marketplace. > Really? HB Chen LANL > nfs/rdma matters and >even if netapp/citi keeps beeing ignorant I will port it over to the >infiniband/rdma layer myself. I'll hopefully have some iwarp cards >soon. >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hch at lst.de Fri Aug 19 07:20:18 2005 From: hch at lst.de (Christoph Hellwig) Date: Fri, 19 Aug 2005 16:20:18 +0200 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator In-Reply-To: <20050818234120.GN15077@esmail.cup.hp.com> References: <35EA21F54A45CB47B879F21A91F4862F713C99@taurus.voltaire.com> <20050818234120.GN15077@esmail.cup.hp.com> Message-ID: <20050819142018.GE12485@lst.de> On Thu, Aug 18, 2005 at 04:41:20PM -0700, Grant Grundler wrote: > I've understood that the openib.org Verbs API can be changed to make > it "transport neutral" - ie support RNICs. RNIC vendors don't seem > to be interested in submitting patches for that. Ammasso is working on that. All the other RNIC vendors seems to be involved in the openrdma.org spec masturbation fest. Looks like we have a winner for iWarp HCAs on Linux.. From hch at lst.de Fri Aug 19 07:21:30 2005 From: hch at lst.de (Christoph Hellwig) Date: Fri, 19 Aug 2005 16:21:30 +0200 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator In-Reply-To: <52hddm8sg7.fsf@cisco.com> References: <35EA21F54A45CB47B879F21A91F4862F713CA6@taurus.voltaire.com> <52hddm8sg7.fsf@cisco.com> Message-ID: <20050819142130.GF12485@lst.de> On Thu, Aug 18, 2005 at 09:24:24PM -0700, Roland Dreier wrote: > Yaron> Not every one wants to keep on doing target discovery with > Yaron> Python scripts, > > Come on, this is just a stupid statement. The whole point of putting > device management in userspace is so that everybody has the > flexibility to use whatever discovery mechanism they want. And just FYI. If you ever want an iSER implementation merged it will have to work the same way. Look at how the open-iscsi TCP initator does it. From hch at lst.de Fri Aug 19 07:26:02 2005 From: hch at lst.de (Christoph Hellwig) Date: Fri, 19 Aug 2005 16:26:02 +0200 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F713CA8@taurus.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F713CA8@taurus.voltaire.com> Message-ID: <20050819142602.GA13478@lst.de> On Fri, Aug 19, 2005 at 08:22:23AM +0300, Yaron Haviv wrote: > Not sure how you do your LOC counting or what's included in it > In any case a protocol that is generalized to multiple transports, has > built in discovery, error-recovery, global routing/naming, > authentication, built-in multi-pathing, multi-connection per session, > optimizations for small messages, comprehensive management and > configuration with industry standard APIs, etc' .... is a total mess and nothing you'd want to run in a production enviroment. Point taken ;-) > The important things is how many LOC are on the command path and how > optimized it the protocol, this code runs SCSI at 850-900MB/s and on the > same time provides the most comprehensive set of features, and is > managed out of the box with industry standard tools Which is slower than plain iscsi over 10Gige.. > A variation of that code runs today on PPC, so I assume it's not an > issue to make sure it runs over PPC Maybe on toy ppc processors like the 4xx. The code won't run on any non-toy platform with proper iommus without a major rework. That beeing said any driver that makes plattform assumptions at all has no business in the openib.org or mainline kernel tree. From yaronh at voltaire.com Fri Aug 19 08:04:12 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Fri, 19 Aug 2005 18:04:12 +0300 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator Message-ID: <35EA21F54A45CB47B879F21A91F4862F713CC9@taurus.voltaire.com> > -----Original Message----- > From: Christoph Hellwig [mailto:hch at lst.de] > Sent: Friday, August 19, 2005 10:22 AM > To: Roland Dreier > Cc: Yaron Haviv; Christoph Hellwig; Grant Grundler; open- > iscsi at googlegroups.com; openib-general at openib.org > Subject: Re: [openib-general] [ANNOUNCE] Initial trunk checkin of > ISERinitiator > > On Thu, Aug 18, 2005 at 09:24:24PM -0700, Roland Dreier wrote: > > Yaron> Not every one wants to keep on doing target discovery with > > Yaron> Python scripts, > > > > Come on, this is just a stupid statement. The whole point of putting > > device management in userspace is so that everybody has the > > flexibility to use whatever discovery mechanism they want. > > And just FYI. If you ever want an iSER implementation merged it will > have to work the same way. Look at how the open-iscsi TCP initator does > it. Good point, the high-level functionality in iSER is all done in Open-iSCSI and its userspace extensions iSER just deals with the data transfer and is layered under Open-iSCSI by the way can you point me to the iSCSI HBA that delivers better performance, latency, and memory consumption and what about the price of that HBA and the attached 10GbE switch Yaron From krause at cup.hp.com Fri Aug 19 08:16:44 2005 From: krause at cup.hp.com (Michael Krause) Date: Fri, 19 Aug 2005 08:16:44 -0700 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F713CC9@taurus.voltaire.com > References: <35EA21F54A45CB47B879F21A91F4862F713CC9@taurus.voltaire.com> Message-ID: <6.2.0.14.2.20050819081511.022749e0@esmail.cup.hp.com> At 08:04 AM 8/19/2005, Yaron Haviv wrote: > > -----Original Message----- > > From: Christoph Hellwig [mailto:hch at lst.de] > > Sent: Friday, August 19, 2005 10:22 AM > > To: Roland Dreier > > Cc: Yaron Haviv; Christoph Hellwig; Grant Grundler; open- > > iscsi at googlegroups.com; openib-general at openib.org > > Subject: Re: [openib-general] [ANNOUNCE] Initial trunk checkin of > > ISERinitiator > > > > On Thu, Aug 18, 2005 at 09:24:24PM -0700, Roland Dreier wrote: > > > Yaron> Not every one wants to keep on doing target discovery >with > > > Yaron> Python scripts, > > > > > > Come on, this is just a stupid statement. The whole point of >putting > > > device management in userspace is so that everybody has the > > > flexibility to use whatever discovery mechanism they want. > > > > And just FYI. If you ever want an iSER implementation merged it will > > have to work the same way. Look at how the open-iscsi TCP initator >does > > it. > >Good point, the high-level functionality in iSER >is all done in Open-iSCSI and its userspace extensions >iSER just deals with the data transfer and is layered under Open-iSCSI > >by the way can you point me to the iSCSI HBA that delivers better >performance, latency, and memory consumption >and what about the price of that HBA and the attached 10GbE switch Is any of this really relevant? The focus here is open source and creating a RDMA infrastructure for ULP to use. The market will decide whether a given technology survives or not. It isn't up to the open source community. Please take personal opinions on whether a technology will succeed elsewhere. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From gshipman at lanl.gov Fri Aug 19 07:22:09 2005 From: gshipman at lanl.gov (Galen Shipman) Date: Fri, 19 Aug 2005 08:22:09 -0600 Subject: [openib-general] RDMA Read performance - svn diff In-Reply-To: <20050818222606.GM15077@esmail.cup.hp.com> References: <20050818222606.GM15077@esmail.cup.hp.com> Message-ID: Hello, I am seeing BW of ~192 MB/sec using RDMA Write as opposed to ~950MB/sec using RDMA Read, Any ideas? > Normally patches should be attached as "plain text" Here is the svn diff to gen2/src/userspace/perftest/rdma_bw.c to make it use RDMA Read instead or RDMA Write: Index: rdma_bw.c =================================================================== --- rdma_bw.c (revision 3134) +++ rdma_bw.c (working copy) @@ -299,7 +299,7 @@ } ctx->mr = ibv_reg_mr(ctx->pd, ctx->buf, size * 2, - IBV_ACCESS_REMOTE_WRITE); + IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ); if (!ctx->mr) { fprintf(stderr, "Couldn't allocate MR\n"); return NULL; @@ -340,7 +340,7 @@ attr.qp_state = IBV_QPS_INIT; attr.pkey_index = 0; attr.port_num = port; - attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE; + attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ; if (ibv_modify_qp(ctx->qp, &attr, IBV_QP_STATE | @@ -664,7 +664,7 @@ ctx->wr.wr_id = PINGPONG_RDMA_WRID; ctx->wr.sg_list = &ctx->list; ctx->wr.num_sge = 1; - ctx->wr.opcode = IBV_WR_RDMA_WRITE; + ctx->wr.opcode = IBV_WR_RDMA_READ; ctx->wr.send_flags = IBV_SEND_SIGNALED; ctx->wr.next = NULL; From mshefty at ichips.intel.com Fri Aug 19 09:30:09 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 19 Aug 2005 09:30:09 -0700 Subject: [openib-general] RDMA Read performance In-Reply-To: <34441d4cc93fdf624a1dd42263b89429@lanl.gov> References: <20050818222606.GM15077@esmail.cup.hp.com> <34441d4cc93fdf624a1dd42263b89429@lanl.gov> Message-ID: <43060911.3030904@ichips.intel.com> Galen Shipman wrote: > Grant, > >> Could you run the ibv_pingpong test and report what that gets? > > > Using: ibv_rc_pingpong --size=1048576 I am seeing 942 Mbytes per sec. > > As I said previously, we have an internal application that sees ~950 > MBytes per sec using RDMA Write. It looks like ibv_rc_pingpong is using > send receive and not RDMA. Perhaps someone (Roland) has an RDMA Read > test they can point me to? I'm not aware of an RDMA read test. Try increasing max_rd_atomic and max_dest_rd_atomic to the maximum values supported by the HCA. - Sean From ardavis at ichips.intel.com Fri Aug 19 10:59:14 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Fri, 19 Aug 2005 10:59:14 -0700 Subject: [openib-general] Re: uverbs comp events In-Reply-To: <52iry3ex3k.fsf@cisco.com> References: <4303CBE2.2010009@ichips.intel.com> <52mzngdsvr.fsf@cisco.com> <4303DE2C.5070809@ichips.intel.com> <52iry3ex3k.fsf@cisco.com> Message-ID: <43061DF2.8020504@ichips.intel.com> Roland Dreier wrote: > Arlin> the options to wake up this blocking cq_fd and thread are: > Arlin> 1. signal the thread with pthread_kill , pthread_cancel > Arlin> 2. poll cq_fd with timeout, wakeup periodically and check > Arlin> for termination. 3. ibv_close_device () to force interrupt > Arlin> on the polling cq_fd (problem I reported above) 4. add new > Arlin> generate event call from verbs. (IB gen1 direct CQ objects > Arlin> supported this model) > > Arlin> In my opinion, option 4 is the best option unless someone > Arlin> can show me a better way to signal the cq_fd from > Arlin> userspace. Did I miss some options? > >I think you can just create a pair of FDs with pipe() and then sleep >on both the cq_fd and your pipe fd using poll(). Then when you want >to wake up the thread just write something into the other pipe fd. > > Yes, this is certainly another option; albeit one that requires more system resources. Why not take full advantage of the FD resource we already have? It's your call, but uDAPL and other multi-thread applications could make good use of a wakeup feature with these event interfaces. An event model that allows users to create events and get events but requires them to use side band mechanisms to trigger the event seems incomplete to me. -arlin > - R. > > > From mshefty at ichips.intel.com Fri Aug 19 11:07:47 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 19 Aug 2005 11:07:47 -0700 Subject: [openib-general] Re: uverbs comp events In-Reply-To: <43061DF2.8020504@ichips.intel.com> References: <4303CBE2.2010009@ichips.intel.com> <52mzngdsvr.fsf@cisco.com> <4303DE2C.5070809@ichips.intel.com> <52iry3ex3k.fsf@cisco.com> <43061DF2.8020504@ichips.intel.com> Message-ID: <43061FF3.2030304@ichips.intel.com> Arlin Davis wrote: > Yes, this is certainly another option; albeit one that requires more > system resources. Why not take full advantage of the FD resource we > already have? It's your call, but uDAPL and other multi-thread > applications could make good use of a wakeup feature with these event > interfaces. An event model that allows users to create events and get > events but requires them to use side band mechanisms to trigger the > event seems incomplete to me. I'm leaning more towards Arlin's thinking on this, but whatever is decided, I think that uCM and uAT should match. - Sean From rolandd at cisco.com Fri Aug 19 11:11:12 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 19 Aug 2005 11:11:12 -0700 Subject: [openib-general] Re: uverbs comp events In-Reply-To: <43061DF2.8020504@ichips.intel.com> (Arlin Davis's message of "Fri, 19 Aug 2005 10:59:14 -0700") References: <4303CBE2.2010009@ichips.intel.com> <52mzngdsvr.fsf@cisco.com> <4303DE2C.5070809@ichips.intel.com> <52iry3ex3k.fsf@cisco.com> <43061DF2.8020504@ichips.intel.com> Message-ID: <52mznd7q67.fsf@cisco.com> Arlin> Yes, this is certainly another option; albeit one that Arlin> requires more system resources. Why not take full advantage Arlin> of the FD resource we already have? It's your call, but Arlin> uDAPL and other multi-thread applications could make good Arlin> use of a wakeup feature with these event interfaces. An Arlin> event model that allows users to create events and get Arlin> events but requires them to use side band mechanisms to Arlin> trigger the event seems incomplete to me. I disagree. Right now the CQ FD is a pretty clean concept: you read CQ events out of it. If you want to trigger a CQ event, then you could post a work request to a QP that generates a completion event. Adding a new system call for queuing synthetic events seems like growing an ugly wart to me. If we look at the analogous design of a multi-threaded network server, where a thread might block waiting for input on a socket, we see that there's no system call to inject synthetic data into a network socket. I'd rather fix the uDAPL design instead of adding ugliness to the kernel to work around it. - R. From rolandd at cisco.com Fri Aug 19 12:09:31 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 19 Aug 2005 12:09:31 -0700 Subject: [openib-general] Re: uverbs comp events In-Reply-To: <43061FF3.2030304@ichips.intel.com> (Sean Hefty's message of "Fri, 19 Aug 2005 11:07:47 -0700") References: <4303CBE2.2010009@ichips.intel.com> <52mzngdsvr.fsf@cisco.com> <4303DE2C.5070809@ichips.intel.com> <52iry3ex3k.fsf@cisco.com> <43061DF2.8020504@ichips.intel.com> <43061FF3.2030304@ichips.intel.com> Message-ID: <52slx568wk.fsf@cisco.com> Sean> I'm leaning more towards Arlin's thinking on this, but Sean> whatever is decided, I think that uCM and uAT should match. Doesn't this seem like a good argument against adding support for synthetic events? We already have a perfectly good mechanism for generating fd events (namely pipe() + write()), so why do we want to add three more mechanisms for generating synthetic events? - R. From Thomas.Talpey at netapp.com Fri Aug 19 12:38:36 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Fri, 19 Aug 2005 15:38:36 -0400 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator In-Reply-To: <20050818234120.GN15077@esmail.cup.hp.com> References: <35EA21F54A45CB47B879F21A91F4862F713C99@taurus.voltaire.com> <20050818234120.GN15077@esmail.cup.hp.com> Message-ID: <6.2.3.4.2.20050819110714.08f0bcd0@exnane01.nane.netapp.com> At 07:41 PM 8/18/2005, Grant Grundler wrote: >If kDAPL for any reason doesn't get pushed upstream to kernel.org, >we effectively don't have iSER or NFS/RDMA in linux. >Since I think without them, linux won't be competitive in the >commercial market place. Put another way, OpenIB want storage to use it, and vice versa. I can speak for NFS/RDMA. If NFS/RDMA doesn't have kDAPL, then it gets thrown backwards due to having to reimplement. That's recoverable (sigh) but there are still missing pieces. By far the largest is the connection and addressing models. There is, as yet, no unified means for an upper layer to connect over any other transport in the OpenIB framework. In fact, there isn't even a way to use IP addressing on the OpenIB framework now, which is an even more fundamental issue. So, yes, without kDAPL at the moment we don't have iSER or NFS/RDMA. We can recode the message handling pieces to OpenIB verbs. For NFS/RDMA, that's not even a ton of work. Then we'll be forced to reimplement or reuse pretty much all of the connect and listen code, and the IP address translation, atop OpenIB. How quickly can OpenIB move to a transport model that supports these missing pieces? I can give a different answer with that information. Tom. From mshefty at ichips.intel.com Fri Aug 19 12:39:50 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 19 Aug 2005 12:39:50 -0700 Subject: [openib-general] Re: uverbs comp events In-Reply-To: <52slx568wk.fsf@cisco.com> References: <4303CBE2.2010009@ichips.intel.com> <52mzngdsvr.fsf@cisco.com> <4303DE2C.5070809@ichips.intel.com> <52iry3ex3k.fsf@cisco.com> <43061DF2.8020504@ichips.intel.com> <43061FF3.2030304@ichips.intel.com> <52slx568wk.fsf@cisco.com> Message-ID: <43063586.1030704@ichips.intel.com> Roland Dreier wrote: > Sean> I'm leaning more towards Arlin's thinking on this, but > Sean> whatever is decided, I think that uCM and uAT should match. > > Doesn't this seem like a good argument against adding support for > synthetic events? We already have a perfectly good mechanism for > generating fd events (namely pipe() + write()), so why do we want to > add three more mechanisms for generating synthetic events? I think that the issue that Arlin is hitting is that once he calls blah_get_event() he doesn't have an easy way to release the thread. This may turn out to be a uCM / uAT issue, and not a verbs issue. Verbs has calls to open and close the device that in turn open and close the fd. It sounds like when he closes the device, his thread doesn't return, but he'll need to confirm this. For uCM and uAT, the fd's are created during initialization, and I don't see where they're ever explicitly closed. - Sean From mshefty at ichips.intel.com Fri Aug 19 12:43:58 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 19 Aug 2005 12:43:58 -0700 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator In-Reply-To: <6.2.3.4.2.20050819110714.08f0bcd0@exnane01.nane.netapp.com> References: <35EA21F54A45CB47B879F21A91F4862F713C99@taurus.voltaire.com> <20050818234120.GN15077@esmail.cup.hp.com> <6.2.3.4.2.20050819110714.08f0bcd0@exnane01.nane.netapp.com> Message-ID: <4306367E.6000809@ichips.intel.com> Talpey, Thomas wrote: > By far the largest is the connection and addressing models. > There is, as yet, no unified means for an upper layer to connect > over any other transport in the OpenIB framework. In fact, there > isn't even a way to use IP addressing on the OpenIB framework > now, which is an even more fundamental issue. The address translation (AT) service in the trunk permits using IP addresses to obtain connection information. - Sean From jbanks at lnxi.com Fri Aug 19 13:35:00 2005 From: jbanks at lnxi.com (Justin Banks) Date: Fri, 19 Aug 2005 14:35:00 -0600 Subject: [openib-general] [PATCH] allow buffers > 1GB in rdma_bw Message-ID: <1124483700.3372.13.camel@nuthead> - Allow use of buffers > 1GB in rdma_bw - Minor cosmetic fixes I couldn't resist, mostly line widths Sorry if my email client wraps long lines - I've been shoehorned into using evolution, and still haven't figured out how to make it do what I want it to do. Index: rdma_bw.c =================================================================== --- rdma_bw.c (revision 3137) +++ rdma_bw.c (working copy) @@ -108,7 +108,8 @@ n = getaddrinfo(servername, service, &hints, &res); if (n < 0) { - fprintf(stderr, "%s for %s:%d\n", gai_strerror(n), servername, port); + fprintf(stderr, "%s for %s:%d\n", gai_strerror(n), + servername, port); return n; } @@ -125,7 +126,8 @@ freeaddrinfo(res); if (sockfd < 0) { - fprintf(stderr, "Couldn't connect to %s:%d\n", servername, port); + fprintf(stderr, "Couldn't connect to %s:%d\n", + servername, port); return sockfd; } return sockfd; @@ -195,7 +197,8 @@ if (sockfd >= 0) { n = 1; - setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &n, sizeof n); + setsockopt(sockfd, SOL_SOCKET, + SO_REUSEADDR, &n, sizeof n); if (!bind(sockfd, t->ai_addr, t->ai_addrlen)) break; @@ -224,7 +227,8 @@ return connfd; } -static struct pingpong_dest *pp_server_exch_dest(int connfd, const struct pingpong_dest *my_dest) +static struct pingpong_dest* +pp_server_exch_dest(int connfd, const struct pingpong_dest *my_dest) { char msg[sizeof "0000:000000:000000:00000000:0000000000000000"]; struct pingpong_dest *rem_dest = NULL; @@ -234,7 +238,8 @@ n = read(connfd, msg, sizeof msg); if (n != sizeof msg) { perror("server read"); - fprintf(stderr, "%d/%d: Couldn't read remote address\n", n, (int) sizeof msg); + fprintf(stderr, "%d/%d: Couldn't read remote address\n", + n, (int) sizeof msg); goto out; } @@ -265,7 +270,8 @@ return rem_dest; } -static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size, +static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, + unsigned long size, int tx_depth, int port) { struct pingpong_context *ctx; @@ -407,20 +413,27 @@ static void usage(const char *argv0) { printf("Usage:\n"); - printf(" %s start a server and wait for connection\n", argv0); + printf(" %s start a server and wait for connection\n", + argv0); printf(" %s connect to server at \n", argv0); printf("\n"); printf("Options:\n"); - printf(" -p, --port= listen on/connect to port (default 18515)\n"); - printf(" -d, --ib-dev= use IB device (default first device found)\n"); - printf(" -i, --ib-port= use port of IB device (default 1)\n"); - printf(" -s, --size= size of message to exchange (default 4096)\n"); + printf(" -p, --port= listen on/connect to port " + " (default 18515)\n"); + printf(" -d, --ib-dev= use IB device (default " + "first device found)\n"); + printf(" -i, --ib-port= use port of IB device " + "(default 1)\n"); + printf(" -s, --size= size of message to exchange " + "(default 4096)\n"); printf(" -t, --tx-depth= size of tx queue (default 100)\n"); - printf(" -n, --iters= number of exchanges (at least 2, default 1000)\n"); - printf(" -b, --bidirectional measure bidirectional bandwidth (default unidirectional)\n"); + printf(" -n, --iters= number of exchanges " + "(at least 2, default 1000)\n"); + printf(" -b, --bidirectional measure bidirectional bandwidth " + "(default unidirectional)\n"); } -static void print_report(unsigned int iters, int size, int duplex, +static void print_report(unsigned int iters, unsigned long size, int duplex, cycles_t *tposted, cycles_t *tcompleted) { double cycles_to_units; @@ -453,7 +466,8 @@ opt_posted, opt_completed, tsize * cycles_to_units / opt_delta / 1024); printf("Bandwidth average: %g MB/sec\n", - tsize * iters * cycles_to_units / (tcompleted[iters - 1] - tposted[0]) / 1024); + tsize * iters * cycles_to_units / + (tcompleted[iters-1] - tposted[0]) / 1024); printf("Service Demand peak (#%d to #%d): %ld cycles/KB\n", opt_posted, opt_completed, opt_delta/tsize); @@ -473,7 +487,7 @@ char *servername = NULL; int port = 18515; int ib_port = 1; - int size = 4096; + unsigned long size = 4096; int tx_depth = 100; int iters = 1000; int scnt, ccnt; @@ -525,7 +539,7 @@ break; case 's': - size = strtol(optarg, NULL, 0); + size = strtoul(optarg, NULL, 0); if (size < 1) { usage(argv[0]); return 1; } break; -justinb From ardavis at ichips.intel.com Fri Aug 19 13:48:56 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Fri, 19 Aug 2005 13:48:56 -0700 Subject: [openib-general] uAT issues after SM node bounced Message-ID: <430645B8.3090502@ichips.intel.com> Hal, Sean and I are seeing some issues with uAT when our dedicated SM node bounces. Sean saw a kernel oops (I will let him send output) and I see the following console message with my failing ib_at_ips_by_gid requests : ib_at: ib_dev_ats_op: dev (ffffffff880775c0) ib0 already has pending op 21 Here are my 5 retries from the ib_at_ips_by_gid() call: open_hca: GID subnet fe80000000000000 id 0002c9020000409d get_hca_addr: ips_by_gid ret 0 at_rec 0x7fffffff9630 -> id 31788 ip_comp_handler: at_rec 0x7fffffff9630 ->id 31788 id 31788 rec_num -22 30803000 ip_comp_handler: resolution err -22 retry 1 ip_comp_handler: NEW ips_by_gid ret 0 at_rec 0x7fffffff9630 -> id 31789 at_thread: callback woke ip_comp_handler: at_rec 0x7fffffff9630 ->id 31789 id 31789 rec_num -22 0 ip_comp_handler: resolution err -22 retry 2 ip_comp_handler: NEW ips_by_gid ret 0 at_rec 0x7fffffff9630 -> id 31790 at_thread: callback woke ip_comp_handler: at_rec 0x7fffffff9630 ->id 31790 id 31790 rec_num -22 0 ip_comp_handler: resolution err -22 retry 3 ip_comp_handler: NEW ips_by_gid ret 0 at_rec 0x7fffffff9630 -> id 31791 at_thread: callback woke ip_comp_handler: at_rec 0x7fffffff9630 ->id 31791 id 31791 rec_num -22 0 ip_comp_handler: resolution err -22 retry 4 ip_comp_handler: ERR: at_rec 0x7fffffff9630, req_id 31791 rec_num -22 at_thread: callback woke open_hca: IB get ADDR failed for mthca0 Sometimes if I bounce the IPoIB device (ifconfig down/up) it starts working, other times it does not. -arlin From swise at ammasso.com Fri Aug 19 13:49:08 2005 From: swise at ammasso.com (Steve Wise) Date: Fri, 19 Aug 2005 15:49:08 -0500 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator In-Reply-To: <4306367E.6000809@ichips.intel.com> Message-ID: Tom Tucker and I have talked about this, and agree that we could to create a set of agnostic services like ib_connect_qp(), ib_create_psp(), ib_accept_cr(), ib_reject_cr(), and the associated connection events similar to kdapl. These service would then use the AT service and core IB cm services for IB devices -and- plug into our up and coming iwarp-specific connection services for IWARP devices. Thus a common connection manager. However, we need to get the iwarp specific connection services code up, reviewed by all, and running before continuing onto the higher level services. Once we post our iwarp-specific services to the iwarp branch, ya'll can chew on that and think about these higher level services. Thoughts? > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Sean Hefty > Sent: Friday, August 19, 2005 2:44 PM > To: Talpey, Thomas > Cc: open-iscsi at googlegroups.com; Christoph Hellwig; > openib-general at openib.org > Subject: Re: [openib-general] [ANNOUNCE] Initial trunk > checkin of ISERinitiator > > Talpey, Thomas wrote: > > By far the largest is the connection and addressing models. > > There is, as yet, no unified means for an upper layer to connect > > over any other transport in the OpenIB framework. In fact, there > > isn't even a way to use IP addressing on the OpenIB framework > > now, which is an even more fundamental issue. > > The address translation (AT) service in the trunk permits using IP > addresses to obtain connection information. > > - Sean > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rolandd at cisco.com Fri Aug 19 13:56:42 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 19 Aug 2005 13:56:42 -0700 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator In-Reply-To: (Steve Wise's message of "Fri, 19 Aug 2005 15:49:08 -0500") References: Message-ID: <52fyt563xx.fsf@cisco.com> Steve> Tom Tucker and I have talked about this, and agree that we Steve> could to create a set of agnostic services like Steve> ib_connect_qp(), ib_create_psp(), ib_accept_cr(), Steve> ib_reject_cr(), and the associated connection events Steve> similar to kdapl. These service would then use the AT Steve> service and core IB cm services for IB devices -and- plug Steve> into our up and coming iwarp-specific connection services Steve> for IWARP devices. Thus a common connection manager. I think this is exactly the right plan. Maybe the names could be a little better -- no need to use odd terminology like "psp." But other than that sort of minor quibble, this is the same as my vague ideas. Also on the IB side the AT code probably needs to be reviewed and improved. The API should be simpler, and I don't like the way AT sticks its tentacles into the IPoIB driver and network stack. Christoph has suggested doing address translation in userspace and working exclusively with native addresses in the kernel. In general this seems like a good plan to me, although it's not clear to me how to handle cases like the NFS/RDMA server, which wants an IP address to match against its exports file. - R. From mshefty at ichips.intel.com Fri Aug 19 14:12:11 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 19 Aug 2005 14:12:11 -0700 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator In-Reply-To: <52fyt563xx.fsf@cisco.com> References: <52fyt563xx.fsf@cisco.com> Message-ID: <43064B2B.2000403@ichips.intel.com> Roland Dreier wrote: > I think this is exactly the right plan. Maybe the names could be a > little better -- no need to use odd terminology like "psp." But other > than that sort of minor quibble, this is the same as my vague ideas. It is my hope to help with the implementation of some sort of CM abstraction. > Also on the IB side the AT code probably needs to be reviewed and > improved. The API should be simpler, and I don't like the way AT > sticks its tentacles into the IPoIB driver and network stack. I agree with the comments on the API, but haven't looked at the implementation in enough detail to say anything about its operation. Somewhat on a side note, AT appears to be the only way to connect from userspace without hard-coding values or interfacing directly with the MAD layer itself. Do we want to expose SA query to userspace or rely on AT? I'm leaning towards the latter at the moment. - Sean From sean.hefty at intel.com Fri Aug 19 14:36:44 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 19 Aug 2005 14:36:44 -0700 Subject: [openib-general] uAT issues after SM node bounced In-Reply-To: <430645B8.3090502@ichips.intel.com> Message-ID: >Sean and I are seeing some issues with uAT when our dedicated SM node >bounces. Sean saw a kernel oops (I will let him send output) and I see >the following console message with my failing ib_at_ips_by_gid requests : Here's the bug check that I saw. I haven't spent anytime debugging this yet. - Sean Aug 19 11:49:49 mshefty-linux1 kernel: ib_at: ib_dev_ats_op: dev (f8e68f40) ib0 already has pending op 2 Aug 19 11:50:00 mshefty-linux1 kernel: ib0: no IPv6 routers present Aug 19 11:53:49 mshefty-linux1 kernel: Debug: sleeping function called from invalid context at mm/rmap.c:86 Aug 19 11:53:49 mshefty-linux1 kernel: in_atomic():0, irqs_disabled():1 Aug 19 11:53:49 mshefty-linux1 kernel: [dump_stack+21/32] dump_stack+0x15/0x20 Aug 19 11:53:49 mshefty-linux1 kernel: [] dump_stack+0x15/0x20 Aug 19 11:53:49 mshefty-linux1 kernel: [__might_sleep+150/176] __might_sleep+0x96/0xb0 Aug 19 11:53:49 mshefty-linux1 kernel: [] __might_sleep+0x96/0xb0 Aug 19 11:53:49 mshefty-linux1 kernel: [anon_vma_prepare+29/224] anon_vma_prepare+0x1d/0xe0 Aug 19 11:53:49 mshefty-linux1 kernel: [] anon_vma_prepare+0x1d/0xe0 Aug 19 11:53:49 mshefty-linux1 kernel: [expand_stack+16/128] expand_stack+0x10/0x80 Aug 19 11:53:49 mshefty-linux1 kernel: [] expand_stack+0x10/0x80 Aug 19 11:53:49 mshefty-linux1 kernel: [do_page_fault+373/1648] do_page_fault+0x175/0x670 Aug 19 11:53:49 mshefty-linux1 kernel: [] do_page_fault+0x175/0x670 Aug 19 11:53:49 mshefty-linux1 kernel: [error_code+79/96] error_code+0x4f/0x60 Aug 19 11:53:49 mshefty-linux1 kernel: [] error_code+0x4f/0x60 Aug 19 11:53:49 mshefty-linux1 kernel: [pg0+949458380/1069249536] ib_get_client_data+0x1c/0x60 [ib_core] Aug 19 11:53:49 mshefty-linux1 kernel: [] ib_get_client_data+0x1c/0x60 [ib_core] Aug 19 11:53:49 mshefty-linux1 kernel: [pg0+949630151/1069249536] ib_sa_path_rec_get+0x27/0x170 [ib_sa] Aug 19 11:53:49 mshefty-linux1 kernel: [] ib_sa_path_rec_get+0x27/0x170 [ib_sa] Aug 19 11:53:49 mshefty-linux1 kernel: [pg0+950134476/1069249536] resolve_path+0x4c/0x100 [ib_at] Aug 19 11:53:49 mshefty-linux1 kernel: [] resolve_path+0x4c/0x100 [ib_at] Aug 19 11:53:49 mshefty-linux1 kernel: [pg0+950135658/1069249536] ib_at_paths_by_route+0xba/0xf0 [ib_at] Aug 19 11:53:49 mshefty-linux1 kernel: [] ib_at_paths_by_route+0xba/0xf0 [ib_at] Aug 19 11:53:49 mshefty-linux1 kernel: [pg0+949733048/1069249536] ib_uat_paths_by_route+0xf8/0x1e0 [ib_uat] Aug 19 11:53:49 mshefty-linux1 kernel: [] ib_uat_paths_by_route+0xf8/0x1e0 [ib_uat] Aug 19 11:53:49 mshefty-linux1 kernel: [pg0+949734940/1069249536] ib_uat_write+0x9c/0xb0 [ib_uat] Aug 19 11:53:49 mshefty-linux1 kernel: [] ib_uat_write+0x9c/0xb0 [ib_uat] Aug 19 11:53:49 mshefty-linux1 kernel: [vfs_write+176/272] vfs_write+0xb0/0x110 Aug 19 11:53:49 mshefty-linux1 kernel: [] vfs_write+0xb0/0x110 Aug 19 11:53:49 mshefty-linux1 kernel: [sys_write+59/112] sys_write+0x3b/0x70 Aug 19 11:53:49 mshefty-linux1 kernel: [] sys_write+0x3b/0x70 Aug 19 11:53:49 mshefty-linux1 kernel: [sysenter_past_esp+84/121] sysenter_past_esp+0x54/0x79 Aug 19 11:53:49 mshefty-linux1 kernel: [] sysenter_past_esp+0x54/0x79 Aug 19 11:53:49 mshefty-linux1 kernel: eip: f8dc05cc Aug 19 11:53:49 mshefty-linux1 kernel: ------------[ cut here ]------------ Aug 19 11:53:49 mshefty-linux1 kernel: kernel BUG at include/asm/spinlock.h:149! Aug 19 11:53:49 mshefty-linux1 kernel: invalid operand: 0000 [#1] Aug 19 11:53:49 mshefty-linux1 kernel: SMP Aug 19 11:53:49 mshefty-linux1 kernel: Modules linked in: ib_madeye ib_ipoib ib_uat ib_at ib_ucm ib_cm ib_sa ib_uverbs ib_mthca ib_mad ib_core edd joydev st sr_mod ide_cd cdrom nvram usbserial parport_pc lp parport thermal processor fan button battery ac ipv6 af_packet e1000 i2c_i801 i2c_core hw_random uhci_hcd usbcore evdev reiserfs aic7xxx scsi_transport_spi sd_mod scsi_mod Aug 19 11:53:49 mshefty-linux1 kernel: CPU: 0 Aug 19 11:53:49 mshefty-linux1 kernel: EIP: 0060:[_spin_lock_irqsave+68/80] Not tainted VLI Aug 19 11:53:49 mshefty-linux1 kernel: EIP: 0060:[] Not tainted VLI Aug 19 11:53:49 mshefty-linux1 kernel: EFLAGS: 00010046 (2.6.12.1) Aug 19 11:53:49 mshefty-linux1 kernel: EIP is at _spin_lock_irqsave+0x44/0x50 Aug 19 11:53:49 mshefty-linux1 kernel: eax: c031f296 ebx: 00000286 ecx: c035a344 edx: f8dc05cc Aug 19 11:53:49 mshefty-linux1 kernel: esi: 5a5a5abe edi: 5a5a5abe ebp: f2f55e10 esp: f2f55e08 Aug 19 11:53:49 mshefty-linux1 kernel: ds: 007b es: 007b ss: 0068 Aug 19 11:53:49 mshefty-linux1 kernel: Process lt-ucmpost (pid: 8350, threadinfo=f2f54000 task=f7762a60) Aug 19 11:53:49 mshefty-linux1 kernel: Stack: 5a5a5a5a f8decf6c f2f55e28 f8dc05cc 00000000 0c30005a 000000d0 f2f55eb0 Aug 19 11:53:49 mshefty-linux1 kernel: f2f55e4c f8dea4c7 c0148b61 00000000 0c300000 f2f55e70 f20cdbfc f2f55e70 Aug 19 11:53:49 mshefty-linux1 kernel: f2f55eb0 f2f55ebc f8e656cc 00000000 0c300000 00000064 000000d0 f8e652c0 Aug 19 11:53:49 mshefty-linux1 kernel: Call Trace: Aug 19 11:53:49 mshefty-linux1 kernel: [show_stack+155/176] show_stack+0x9b/0xb0 Aug 19 11:53:49 mshefty-linux1 kernel: [] show_stack+0x9b/0xb0 Aug 19 11:53:49 mshefty-linux1 kernel: [show_registers+287/400] show_registers+0x11f/0x190 Aug 19 11:53:49 mshefty-linux1 kernel: [] show_registers+0x11f/0x190 Aug 19 11:53:49 mshefty-linux1 kernel: [die+227/352] die+0xe3/0x160 Aug 19 11:53:49 mshefty-linux1 kernel: [] die+0xe3/0x160 Aug 19 11:53:49 mshefty-linux1 kernel: [do_invalid_op+149/160] do_invalid_op+0x95/0xa0 Aug 19 11:53:49 mshefty-linux1 kernel: [] do_invalid_op+0x95/0xa0 Aug 19 11:53:49 mshefty-linux1 kernel: [error_code+79/96] error_code+0x4f/0x60 Aug 19 11:53:49 mshefty-linux1 kernel: [] error_code+0x4f/0x60 Aug 19 11:53:49 mshefty-linux1 kernel: [pg0+949458380/1069249536] ib_get_client_data+0x1c/0x60 [ib_core] Aug 19 11:53:49 mshefty-linux1 kernel: [] ib_get_client_data+0x1c/0x60 [ib_core] Aug 19 11:53:49 mshefty-linux1 kernel: [pg0+949630151/1069249536] ib_sa_path_rec_get+0x27/0x170 [ib_sa] Aug 19 11:53:49 mshefty-linux1 kernel: [] ib_sa_path_rec_get+0x27/0x170 [ib_sa] Aug 19 11:53:49 mshefty-linux1 kernel: [pg0+950134476/1069249536] resolve_path+0x4c/0x100 [ib_at] Aug 19 11:53:49 mshefty-linux1 kernel: [] resolve_path+0x4c/0x100 [ib_at] Aug 19 11:53:49 mshefty-linux1 kernel: [pg0+950135658/1069249536] ib_at_paths_by_route+0xba/0xf0 [ib_at] Aug 19 11:53:49 mshefty-linux1 kernel: [] ib_at_paths_by_route+0xba/0xf0 [ib_at] Aug 19 11:53:49 mshefty-linux1 kernel: [pg0+949733048/1069249536] ib_uat_paths_by_route+0xf8/0x1e0 [ib_uat] Aug 19 11:53:49 mshefty-linux1 kernel: [] ib_uat_paths_by_route+0xf8/0x1e0 [ib_uat] Aug 19 11:53:49 mshefty-linux1 kernel: [pg0+949734940/1069249536] ib_uat_write+0x9c/0xb0 [ib_uat] Aug 19 11:53:49 mshefty-linux1 kernel: [] ib_uat_write+0x9c/0xb0 [ib_uat] Aug 19 11:53:49 mshefty-linux1 kernel: [vfs_write+176/272] vfs_write+0xb0/0x110 Aug 19 11:53:49 mshefty-linux1 kernel: [] vfs_write+0xb0/0x110 Aug 19 11:53:49 mshefty-linux1 kernel: [sys_write+59/112] sys_write+0x3b/0x70 Aug 19 11:53:49 mshefty-linux1 kernel: [] sys_write+0x3b/0x70 Aug 19 11:53:49 mshefty-linux1 kernel: [sysenter_past_esp+84/121] sysenter_past_esp+0x54/0x79 Aug 19 11:53:49 mshefty-linux1 kernel: [] sysenter_past_esp+0x54/0x79 Aug 19 11:53:49 mshefty-linux1 kernel: Code: c3 00 02 00 00 74 01 fb f3 90 80 3e 00 7e f9 fa eb e8 8d 65 f8 89 d8 5b 5e 5d c3 8b 4d 04 51 68 96 f2 31 c0 e8 3e 87 e1 ff 58 5a <0f> 0b 95 00 8d ea 31 c0 eb c5 89 f6 55 89 e5 53 89 c3 fa 81 78 Aug 19 11:54:04 mshefty-linux1 kernel: <3>kfree_debugcheck: bad ptr f8a560f4h. Aug 19 11:54:04 mshefty-linux1 kernel: ------------[ cut here ]------------ Aug 19 11:54:04 mshefty-linux1 kernel: kernel BUG at mm/slab.c:1892! Aug 19 11:54:04 mshefty-linux1 kernel: invalid operand: 0000 [#2] Aug 19 11:54:04 mshefty-linux1 kernel: SMP Aug 19 11:54:04 mshefty-linux1 kernel: Modules linked in: ib_madeye ib_ipoib ib_uat ib_at ib_ucm ib_cm ib_sa ib_uverbs ib_mthca ib_mad ib_core edd joydev st sr_mod ide_cd cdrom nvram usbserial parport_pc lp parport thermal processor fan button battery ac ipv6 af_packet e1000 i2c_i801 i2c_core hw_random uhci_hcd usbcore evdev reiserfs aic7xxx scsi_transport_spi sd_mod scsi_mod Aug 19 11:54:04 mshefty-linux1 kernel: CPU: 1 Aug 19 11:54:04 mshefty-linux1 kernel: EIP: 0060:[kfree_debugcheck+107/128] Not tainted VLI Aug 19 11:54:04 mshefty-linux1 kernel: EIP: 0060:[] Not tainted VLI Aug 19 11:54:04 mshefty-linux1 kernel: EFLAGS: 00010006 (2.6.12.1) Aug 19 11:54:04 mshefty-linux1 kernel: EIP is at kfree_debugcheck+0x6b/0x80 Aug 19 11:54:04 mshefty-linux1 kernel: eax: 00000028 ebx: c1714ac0 ecx: c035a30c edx: 00000000 Aug 19 11:54:04 mshefty-linux1 kernel: esi: f8a560f4 edi: f20cdc50 ebp: f2557ef4 esp: f2557ee4 Aug 19 11:54:04 mshefty-linux1 kernel: ds: 007b es: 007b ss: 0068 Aug 19 11:54:04 mshefty-linux1 kernel: Process ib_at_wq/1 (pid: 8195, threadinfo=f2556000 task=f6a22a60) Aug 19 11:54:04 mshefty-linux1 kernel: Stack: c032162c f8a560f4 f8a560f4 43062abd f2557f0c c0149af7 00000282 f20c9f18 Aug 19 11:54:04 mshefty-linux1 kernel: 43062abd f20cdc50 f2557f2c f8e03372 00000002 00000000 00000000 f20cdc2c Aug 19 11:54:04 mshefty-linux1 kernel: f253f23c f20cdc50 f2557f3c f8e033f2 f20c9f18 ffffff92 f2557f4c f8e64af5 Aug 19 11:54:04 mshefty-linux1 kernel: Call Trace: Aug 19 11:54:04 mshefty-linux1 kernel: [show_stack+155/176] show_stack+0x9b/0xb0 Aug 19 11:54:04 mshefty-linux1 kernel: [] show_stack+0x9b/0xb0 Aug 19 11:54:04 mshefty-linux1 kernel: [show_registers+287/400] show_registers+0x11f/0x190 Aug 19 11:54:04 mshefty-linux1 kernel: [] show_registers+0x11f/0x190 Aug 19 11:54:04 mshefty-linux1 kernel: [die+227/352] die+0xe3/0x160 Aug 19 11:54:04 mshefty-linux1 kernel: [] die+0xe3/0x160 Aug 19 11:54:04 mshefty-linux1 kernel: [do_invalid_op+149/160] do_invalid_op+0x95/0xa0 Aug 19 11:54:04 mshefty-linux1 kernel: [] do_invalid_op+0x95/0xa0 Aug 19 11:54:04 mshefty-linux1 kernel: [error_code+79/96] error_code+0x4f/0x60 Aug 19 11:54:04 mshefty-linux1 kernel: [] error_code+0x4f/0x60 Aug 19 11:54:04 mshefty-linux1 kernel: [kfree+23/144] kfree+0x17/0x90 Aug 19 11:54:04 mshefty-linux1 kernel: [] kfree+0x17/0x90 Aug 19 11:54:04 mshefty-linux1 kernel: [pg0+949732210/1069249536] ib_uat_callback+0x42/0x70 [ib_uat] Aug 19 11:54:04 mshefty-linux1 kernel: [] ib_uat_callback+0x42/0x70 [ib_uat] Aug 19 11:54:04 mshefty-linux1 kernel: [pg0+949732338/1069249536] ib_uat_path_callback+0x12/0x20 [ib_uat] Aug 19 11:54:04 mshefty-linux1 kernel: [] ib_uat_path_callback+0x12/0x20 [ib_uat] Aug 19 11:54:04 mshefty-linux1 kernel: [pg0+950131445/1069249536] req_comp_work+0x15/0x30 [ib_at] Aug 19 11:54:04 mshefty-linux1 kernel: [] req_comp_work+0x15/0x30 [ib_at] Aug 19 11:54:04 mshefty-linux1 kernel: [worker_thread+387/528] worker_thread+0x183/0x210 Aug 19 11:54:04 mshefty-linux1 kernel: [] worker_thread+0x183/0x210 Aug 19 11:54:04 mshefty-linux1 kernel: [kthread+139/192] kthread+0x8b/0xc0 Aug 19 11:54:04 mshefty-linux1 kernel: [] kthread+0x8b/0xc0 Aug 19 11:54:04 mshefty-linux1 kernel: [kernel_thread_helper+5/12] kernel_thread_helper+0x5/0xc Aug 19 11:54:04 mshefty-linux1 kernel: [] kernel_thread_helper+0x5/0xc Aug 19 11:54:04 mshefty-linux1 kernel: Code: 07 7d 0b 32 c0 58 5a c1 e3 05 a1 10 67 42 c0 01 c3 8b 03 a9 80 00 00 00 75 d1 8d b6 00 00 00 00 56 68 2c 16 32 c0 e8 95 6e fd ff <0f> 0b 64 07 7d 0b 32 c0 59 5b 8d 65 f8 5b 5e 5d c3 8d 74 26 00 From sean.hefty at intel.com Fri Aug 19 15:30:40 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 19 Aug 2005 15:30:40 -0700 Subject: [openib-general] [PATCH] [uCM] user specified context in CM events + new test program Message-ID: The following patch: * Adds user specified context to all uCM events. Users will not retrieve any events associated with the context after destroying the corresponding cm_id. * Provides the ib_cm_init_qp_attr() call to userspace clients of the CM. This call may be used to set QP attributes properly before modifying the QP. * Fixes some error handling synchronization and cleanup issues. * Performs some minor code cleanup. * Replaces the ucm_simple test program with a userspace version of cmpost. The userspace version of cmpost uses the uAT interface to retrieve path records based on a remote host name, establishes a connection over a QP, and performs some simple message passing between the nodes. This patch bumps the ABI, and will require synchronization with uDAPL before committing. Signed-off-by: Sean Hefty Index: userspace/libibcm/include/infiniband/cm_abi.h =================================================================== --- userspace/libibcm/include/infiniband/cm_abi.h (revision 3124) +++ userspace/libibcm/include/infiniband/cm_abi.h (working copy) @@ -1,5 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -41,7 +42,7 @@ * drivers/infiniband/include/ib_user_cm.h */ -#define IB_USER_CM_ABI_VERSION 1 +#define IB_USER_CM_ABI_VERSION 2 enum { IB_USER_CM_CMD_CREATE_ID, @@ -64,6 +65,7 @@ enum { IB_USER_CM_CMD_SEND_SIDR_REP, IB_USER_CM_CMD_EVENT, + IB_USER_CM_CMD_INIT_QP_ATTR, }; /* * command ABI structures. @@ -75,6 +77,7 @@ struct cm_abi_cmd_hdr { }; struct cm_abi_create_id { + __u64 uid; __u64 response; }; @@ -83,9 +86,14 @@ struct cm_abi_create_id_resp { }; struct cm_abi_destroy_id { + __u64 response; __u32 id; }; +struct cm_abi_destroy_id_resp { + __u32 events_reported; +}; + struct cm_abi_attr_id { __u64 response; __u32 id; @@ -98,6 +106,64 @@ struct cm_abi_attr_id_resp { __u32 remote_id; }; +struct cm_abi_init_qp_attr { + __u64 response; + __u32 id; + __u32 qp_state; +}; + +struct cm_abi_ah_attr { + __u8 grh_dgid[16]; + __u32 grh_flow_label; + __u16 dlid; + __u16 reserved; + __u8 grh_sgid_index; + __u8 grh_hop_limit; + __u8 grh_traffic_class; + __u8 sl; + __u8 src_path_bits; + __u8 static_rate; + __u8 is_global; + __u8 port_num; +}; + +struct cm_abi_init_qp_attr_resp { + __u32 qp_attr_mask; + __u32 qp_state; + __u32 cur_qp_state; + __u32 path_mtu; + __u32 path_mig_state; + __u32 qkey; + __u32 rq_psn; + __u32 sq_psn; + __u32 dest_qp_num; + __u32 qp_access_flags; + + struct cm_abi_ah_attr ah_attr; + struct cm_abi_ah_attr alt_ah_attr; + + /* ibv_qp_cap */ + __u32 max_send_wr; + __u32 max_recv_wr; + __u32 max_send_sge; + __u32 max_recv_sge; + __u32 max_inline_data; + + __u16 pkey_index; + __u16 alt_pkey_index; + __u8 en_sqd_async_notify; + __u8 sq_draining; + __u8 max_rd_atomic; + __u8 max_dest_rd_atomic; + __u8 min_rnr_timer; + __u8 port_num; + __u8 timeout; + __u8 retry_cnt; + __u8 rnr_retry; + __u8 alt_port_num; + __u8 alt_timeout; +}; + struct cm_abi_listen { __u64 service_id; __u64 service_mask; @@ -161,6 +227,7 @@ struct cm_abi_req { }; struct cm_abi_rep { + __u64 uid; __u64 data; __u32 id; __u32 qpn; @@ -236,7 +303,6 @@ struct cm_abi_event_get { }; struct cm_abi_req_event_resp { - __u32 listen_id; /* device */ /* port */ struct cm_abi_path_rec primary_path; @@ -291,7 +357,6 @@ struct cm_abi_apr_event_resp { }; struct cm_abi_sidr_req_event_resp { - __u32 listen_id; /* device */ /* port */ __u16 pkey; @@ -311,6 +376,7 @@ struct cm_abi_sidr_rep_event_resp { #define CM_ABI_PRES_ALTERNATE 0x08 struct cm_abi_event_resp { + __u64 uid; __u32 id; __u32 event; __u32 present; Index: userspace/libibcm/include/infiniband/cm.h =================================================================== --- userspace/libibcm/include/infiniband/cm.h (revision 3124) +++ userspace/libibcm/include/infiniband/cm.h (working copy) @@ -77,8 +77,13 @@ enum ib_cm_data_size { IB_CM_SIDR_REP_INFO_LENGTH = 72 }; +struct ib_cm_id { + void *context; + uint32_t handle; +}; + struct ib_cm_req_event_param { - uint32_t listen_id; + struct ib_cm_id *listen_id; struct ib_sa_path_rec *primary_path; struct ib_sa_path_rec *alternate_path; @@ -187,7 +192,7 @@ struct ib_cm_apr_event_param { }; struct ib_cm_sidr_req_event_param { - uint32_t listen_id; + struct ib_cm_id *listen_id; struct ib_device *device; uint8_t port; uint16_t pkey; @@ -212,7 +217,7 @@ struct ib_cm_sidr_rep_event_param { }; struct ib_cm_event { - uint32_t cm_id; + struct ib_cm_id *cm_id; enum ib_cm_event_type event; union { struct ib_cm_req_event_param req_rcvd; @@ -287,13 +292,13 @@ int ib_cm_get_fd(void); * Communication identifiers are used to track connection states, service * ID resolution requests, and listen requests. */ -int ib_cm_create_id(uint32_t *cm_id); +int ib_cm_create_id(struct ib_cm_id **cm_id, void *context); /** * ib_cm_destroy_id - Destroy a connection identifier. * @cm_id: Connection identifier to destroy. */ -int ib_cm_destroy_id(uint32_t cm_id); +int ib_cm_destroy_id(struct ib_cm_id *cm_id); struct ib_cm_attr_param { uint64_t service_id; @@ -309,7 +314,7 @@ struct ib_cm_attr_param { * * Not all parameters are valid during all connection states. */ -int ib_cm_attr_id(uint32_t cm_id, +int ib_cm_attr_id(struct ib_cm_id *cm_id, struct ib_cm_attr_param *param); /** @@ -323,7 +328,7 @@ int ib_cm_attr_id(uint32_t cm_id, * range of service IDs. If set to 0, the service ID is matched * exactly. */ -int ib_cm_listen(uint32_t cm_id, +int ib_cm_listen(struct ib_cm_id *cm_id, uint64_t service_id, uint64_t service_mask); @@ -355,7 +360,7 @@ struct ib_cm_req_param { * @param: Connection request information needed to establish the * connection. */ -int ib_cm_send_req(uint32_t cm_id, +int ib_cm_send_req(struct ib_cm_id *cm_id, struct ib_cm_req_param *param); struct ib_cm_rep_param { @@ -380,7 +385,7 @@ struct ib_cm_rep_param { * @param: Connection reply information needed to establish the * connection. */ -int ib_cm_send_rep(uint32_t cm_id, +int ib_cm_send_rep(struct ib_cm_id *cm_id, struct ib_cm_rep_param *param); /** @@ -391,7 +396,7 @@ int ib_cm_send_rep(uint32_t cm_id, * ready to use message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_rtu(uint32_t cm_id, +int ib_cm_send_rtu(struct ib_cm_id *cm_id, void *private_data, uint8_t private_data_len); @@ -404,7 +409,7 @@ int ib_cm_send_rtu(uint32_t cm_id, * disconnection request message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_dreq(uint32_t cm_id, +int ib_cm_send_dreq(struct ib_cm_id *cm_id, void *private_data, uint8_t private_data_len); @@ -416,7 +421,7 @@ int ib_cm_send_dreq(uint32_t cm_id, * disconnection reply message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_drep(uint32_t cm_id, +int ib_cm_send_drep(struct ib_cm_id *cm_id, void *private_data, uint8_t private_data_len); @@ -427,7 +432,7 @@ int ib_cm_send_drep(uint32_t cm_id, * This routine should be invoked by users who receive messages on a * connected QP before an RTU has been received. */ -int ib_cm_establish(uint32_t cm_id); +int ib_cm_establish(struct ib_cm_id *cm_id); /** * ib_cm_send_rej - Sends a connection rejection message to the @@ -441,7 +446,7 @@ int ib_cm_establish(uint32_t cm_id); * rejection message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_rej(uint32_t cm_id, +int ib_cm_send_rej(struct ib_cm_id *cm_id, enum ib_cm_rej_reason reason, void *ari, uint8_t ari_length, @@ -458,7 +463,7 @@ int ib_cm_send_rej(uint32_t cm_id, * message receipt acknowledgement. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_mra(uint32_t cm_id, +int ib_cm_send_mra(struct ib_cm_id *cm_id, uint8_t service_timeout, void *private_data, uint8_t private_data_len); @@ -473,12 +478,32 @@ int ib_cm_send_mra(uint32_t cm_id, * load alternate path message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_lap(uint32_t cm_id, +int ib_cm_send_lap(struct ib_cm_id *cm_id, struct ib_sa_path_rec *alternate_path, void *private_data, uint8_t private_data_len); /** + * ib_cm_init_qp_attr - Initializes the QP attributes for use in transitioning + * to a specified QP state. + * @cm_id: Communication identifier associated with the QP attributes to + * initialize. + * @qp_attr: On input, specifies the desired QP state. On output, the + * mandatory and desired optional attributes will be set in order to + * modify the QP to the specified state. + * @qp_attr_mask: The QP attribute mask that may be used to transition the + * QP to the specified state. + * + * Users must set the @qp_attr->qp_state to the desired QP state. This call + * will set all required attributes for the given transition, along with + * known optional attributes. Users may override the attributes returned from + * this call before calling ib_modify_qp. + */ +int ib_cm_init_qp_attr(struct ib_cm_id *cm_id, + struct ibv_qp_attr *qp_attr, + int *qp_attr_mask); + +/** * ib_cm_send_apr - Sends an alternate path response message in response to * a load alternate path request. * @cm_id: Connection identifier associated with the alternate path response. @@ -490,7 +515,7 @@ int ib_cm_send_lap(uint32_t cm_id, * alternate path response message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_apr(uint32_t cm_id, +int ib_cm_send_apr(struct ib_cm_id *cm_id, enum ib_cm_apr_status status, void *info, uint8_t info_length, @@ -514,7 +539,7 @@ struct ib_cm_sidr_req_param { * service ID resolution request. * @param: Service ID resolution request information. */ -int ib_cm_send_sidr_req(uint32_t cm_id, +int ib_cm_send_sidr_req(struct ib_cm_id *cm_id, struct ib_cm_sidr_req_param *param); struct ib_cm_sidr_rep_param { @@ -534,7 +559,7 @@ struct ib_cm_sidr_rep_param { * resolution request. * @param: Service ID resolution reply information. */ -int ib_cm_send_sidr_rep(uint32_t cm_id, +int ib_cm_send_sidr_rep(struct ib_cm_id *cm_id, struct ib_cm_sidr_rep_param *param); #endif /* CM_H */ Index: userspace/libibcm/AUTHORS =================================================================== --- userspace/libibcm/AUTHORS (revision 3124) +++ userspace/libibcm/AUTHORS (working copy) @@ -1 +1,2 @@ +Sean Hefty Libor Michalek Index: userspace/libibcm/src/cm.c =================================================================== --- userspace/libibcm/src/cm.c (revision 3124) +++ userspace/libibcm/src/cm.c (working copy) @@ -1,5 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -45,6 +46,7 @@ #include #include #include +#include #include #include @@ -69,7 +71,7 @@ do { resp = alloca(sizeof(*resp)); \ if (!resp) \ return -ENOMEM; \ - cmd->response = (unsigned long)resp;\ + cmd->response = (uintptr_t)resp;\ } while (0) #define CM_CREATE_MSG_CMD(msg, cmd, type, size) \ @@ -88,8 +90,18 @@ do { memset(cmd, 0, sizeof(*cmd)); \ } while (0) +struct cm_id_private { + struct ib_cm_id id; + int events_completed; + pthread_cond_t cond; + pthread_mutex_t mut; +}; + static int fd; +#define container_of(ptr, type, field) \ + ((type *) ((void *)ptr - offsetof(type, field))) + static void __attribute__((constructor)) ib_cm_init(void) { fd = open(IB_UCM_DEV_PATH, O_RDWR); @@ -127,46 +139,89 @@ static void cm_param_path_get(struct cm_ abi->preference = sa->preference; } -int ib_cm_create_id(uint32_t *cm_id) +static void ib_cm_free_id(struct cm_id_private *cm_id_priv) +{ + pthread_cond_destroy(&cm_id_priv->cond); + pthread_mutex_destroy(&cm_id_priv->mut); + free(cm_id_priv); +} + +static struct cm_id_private *ib_cm_alloc_id(void *context) +{ + struct cm_id_private *cm_id_priv; + + cm_id_priv = malloc(sizeof *cm_id_priv); + if (!cm_id_priv) + return NULL; + + memset(cm_id_priv, 0, sizeof *cm_id_priv); + cm_id_priv->id.context = context; + pthread_mutex_init(&cm_id_priv->mut, NULL); + if (pthread_cond_init(&cm_id_priv->cond, NULL)) + goto err; + + return cm_id_priv; + +err: ib_cm_free_id(cm_id_priv); + return NULL; +} + +int ib_cm_create_id(struct ib_cm_id **cm_id, void *context) { struct cm_abi_create_id_resp *resp; struct cm_abi_create_id *cmd; + struct cm_id_private *cm_id_priv; void *msg; int result; int size; - if (!cm_id) - return -EINVAL; + cm_id_priv = ib_cm_alloc_id(context); + if (!cm_id_priv) + return -ENOMEM; - CM_CREATE_MSG_CMD_RESP(msg, cmd, resp, IB_USER_CM_CMD_CREATE_ID, size); + CM_CREATE_MSG_CMD_RESP(msg, cmd, resp, IB_USER_CM_CMD_CREATE_ID, size); + cmd->uid = (uintptr_t) cm_id_priv; result = write(fd, msg, size); if (result != size) - return (result > 0) ? -ENODATA : result; + goto err; - *cm_id = resp->id; + cm_id_priv->id.handle = resp->id; + *cm_id = &cm_id_priv->id; return 0; + +err: ib_cm_free_id(cm_id_priv); + return result; } -int ib_cm_destroy_id(uint32_t cm_id) +int ib_cm_destroy_id(struct ib_cm_id *cm_id) { + struct cm_abi_destroy_id_resp *resp; struct cm_abi_destroy_id *cmd; + struct cm_id_private *cm_id_priv; void *msg; int result; int size; - CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_DESTROY_ID, size); - - cmd->id = cm_id; + CM_CREATE_MSG_CMD_RESP(msg, cmd, resp, IB_USER_CM_CMD_DESTROY_ID, size); + cmd->id = cm_id->handle; result = write(fd, msg, size); if (result != size) return (result > 0) ? -ENODATA : result; + cm_id_priv = container_of(cm_id, struct cm_id_private, id); + + pthread_mutex_lock(&cm_id_priv->mut); + while (cm_id_priv->events_completed < resp->events_reported) + pthread_cond_wait(&cm_id_priv->cond, &cm_id_priv->mut); + pthread_mutex_unlock(&cm_id_priv->mut); + + ib_cm_free_id(cm_id_priv); return 0; } -int ib_cm_attr_id(uint32_t cm_id, struct ib_cm_attr_param *param) +int ib_cm_attr_id(struct ib_cm_id *cm_id, struct ib_cm_attr_param *param) { struct cm_abi_attr_id_resp *resp; struct cm_abi_attr_id *cmd; @@ -177,9 +232,8 @@ int ib_cm_attr_id(uint32_t cm_id, struct if (!param) return -EINVAL; - CM_CREATE_MSG_CMD_RESP(msg, cmd, resp, IB_USER_CM_CMD_ATTR_ID, size); - - cmd->id = cm_id; + CM_CREATE_MSG_CMD_RESP(msg, cmd, resp, IB_USER_CM_CMD_ATTR_ID, size); + cmd->id = cm_id->handle; result = write(fd, msg, size); if (result != size) @@ -189,11 +243,91 @@ int ib_cm_attr_id(uint32_t cm_id, struct param->service_mask = resp->service_mask; param->local_id = resp->local_id; param->remote_id = resp->remote_id; + return 0; +} + +static void ib_cm_copy_ah_attr(struct ibv_ah_attr *dest_attr, + struct cm_abi_ah_attr *src_attr) +{ + memcpy(dest_attr->grh.dgid.raw, src_attr->grh_dgid, + sizeof dest_attr->grh.dgid); + dest_attr->grh.flow_label = src_attr->grh_flow_label; + dest_attr->grh.sgid_index = src_attr->grh_sgid_index; + dest_attr->grh.hop_limit = src_attr->grh_hop_limit; + dest_attr->grh.traffic_class = src_attr->grh_traffic_class; + + dest_attr->dlid = src_attr->dlid; + dest_attr->sl = src_attr->sl; + dest_attr->src_path_bits = src_attr->src_path_bits; + dest_attr->static_rate = src_attr->static_rate; + dest_attr->is_global = src_attr->is_global; + dest_attr->port_num = src_attr->port_num; +} + +static void ib_cm_copy_qp_attr(struct ibv_qp_attr *dest_attr, + struct cm_abi_init_qp_attr_resp *src_attr) +{ + dest_attr->cur_qp_state = src_attr->cur_qp_state; + dest_attr->path_mtu = src_attr->path_mtu; + dest_attr->path_mig_state = src_attr->path_mig_state; + dest_attr->qkey = src_attr->qkey; + dest_attr->rq_psn = src_attr->rq_psn; + dest_attr->sq_psn = src_attr->sq_psn; + dest_attr->dest_qp_num = src_attr->dest_qp_num; + dest_attr->qp_access_flags = src_attr->qp_access_flags; + + dest_attr->cap.max_send_wr = src_attr->max_send_wr; + dest_attr->cap.max_recv_wr = src_attr->max_recv_wr; + dest_attr->cap.max_send_sge = src_attr->max_send_sge; + dest_attr->cap.max_recv_sge = src_attr->max_recv_sge; + dest_attr->cap.max_inline_data = src_attr->max_inline_data; + + ib_cm_copy_ah_attr(&dest_attr->ah_attr, &src_attr->ah_attr); + ib_cm_copy_ah_attr(&dest_attr->alt_ah_attr, &src_attr->alt_ah_attr); + + dest_attr->pkey_index = src_attr->pkey_index; + dest_attr->alt_pkey_index = src_attr->alt_pkey_index; + dest_attr->en_sqd_async_notify = src_attr->en_sqd_async_notify; + dest_attr->sq_draining = src_attr->sq_draining; + dest_attr->max_rd_atomic = src_attr->max_rd_atomic; + dest_attr->max_dest_rd_atomic = src_attr->max_dest_rd_atomic; + dest_attr->min_rnr_timer = src_attr->min_rnr_timer; + dest_attr->port_num = src_attr->port_num; + dest_attr->timeout = src_attr->timeout; + dest_attr->retry_cnt = src_attr->retry_cnt; + dest_attr->rnr_retry = src_attr->rnr_retry; + dest_attr->alt_port_num = src_attr->alt_port_num; + dest_attr->alt_timeout = src_attr->alt_timeout; +} + +int ib_cm_init_qp_attr(struct ib_cm_id *cm_id, + struct ibv_qp_attr *qp_attr, + int *qp_attr_mask) +{ + struct cm_abi_init_qp_attr_resp *resp; + struct cm_abi_init_qp_attr *cmd; + void *msg; + int result; + int size; + + if (!qp_attr || !qp_attr_mask) + return -EINVAL; + + CM_CREATE_MSG_CMD_RESP(msg, cmd, resp, IB_USER_CM_CMD_INIT_QP_ATTR, size); + cmd->id = cm_id->handle; + cmd->qp_state = qp_attr->qp_state; + + result = write(fd, msg, size); + if (result != size) + return (result > 0) ? -ENODATA : result; + + *qp_attr_mask = resp->qp_attr_mask; + ib_cm_copy_qp_attr(qp_attr, resp); return 0; } -int ib_cm_listen(uint32_t cm_id, +int ib_cm_listen(struct ib_cm_id *cm_id, uint64_t service_id, uint64_t service_mask) { @@ -203,8 +337,7 @@ int ib_cm_listen(uint32_t cm_id, int size; CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_LISTEN, size); - - cmd->id = cm_id; + cmd->id = cm_id->handle; cmd->service_id = service_id; cmd->service_mask = service_mask; @@ -215,7 +348,7 @@ int ib_cm_listen(uint32_t cm_id, return 0; } -int ib_cm_send_req(uint32_t cm_id, struct ib_cm_req_param *param) +int ib_cm_send_req(struct ib_cm_id *cm_id, struct ib_cm_req_param *param) { struct cm_abi_path_rec *p_path; struct cm_abi_path_rec *a_path; @@ -228,13 +361,11 @@ int ib_cm_send_req(uint32_t cm_id, struc return -EINVAL; CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_SEND_REQ, size); - - cmd->id = cm_id; - cmd->qpn = param->qp_num; - cmd->qp_type = param->qp_type; - cmd->psn = param->starting_psn; - cmd->sid = param->service_id; - + cmd->id = cm_id->handle; + cmd->qpn = param->qp_num; + cmd->qp_type = param->qp_type; + cmd->psn = param->starting_psn; + cmd->sid = param->service_id; cmd->peer_to_peer = param->peer_to_peer; cmd->responder_resources = param->responder_resources; cmd->initiator_depth = param->initiator_depth; @@ -247,28 +378,25 @@ int ib_cm_send_req(uint32_t cm_id, struc cmd->srq = param->srq; if (param->primary_path) { - p_path = alloca(sizeof(*p_path)); if (!p_path) return -ENOMEM; cm_param_path_get(p_path, param->primary_path); - cmd->primary_path = (unsigned long)p_path; + cmd->primary_path = (uintptr_t) p_path; } if (param->alternate_path) { - a_path = alloca(sizeof(*a_path)); if (!a_path) return -ENOMEM; cm_param_path_get(a_path, param->alternate_path); - cmd->alternate_path = (unsigned long)a_path; + cmd->alternate_path = (uintptr_t) a_path; } if (param->private_data && param->private_data_len) { - - cmd->data = (unsigned long)param->private_data; + cmd->data = (uintptr_t) param->private_data; cmd->len = param->private_data_len; } @@ -279,7 +407,7 @@ int ib_cm_send_req(uint32_t cm_id, struc return 0; } -int ib_cm_send_rep(uint32_t cm_id, struct ib_cm_rep_param *param) +int ib_cm_send_rep(struct ib_cm_id *cm_id, struct ib_cm_rep_param *param) { struct cm_abi_rep *cmd; void *msg; @@ -290,11 +418,10 @@ int ib_cm_send_rep(uint32_t cm_id, struc return -EINVAL; CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_SEND_REP, size); - - cmd->id = cm_id; - cmd->qpn = param->qp_num; - cmd->psn = param->starting_psn; - + cmd->uid = (uintptr_t) container_of(cm_id, struct cm_id_private, id); + cmd->id = cm_id->handle; + cmd->qpn = param->qp_num; + cmd->psn = param->starting_psn; cmd->responder_resources = param->responder_resources; cmd->initiator_depth = param->initiator_depth; cmd->target_ack_delay = param->target_ack_delay; @@ -304,8 +431,7 @@ int ib_cm_send_rep(uint32_t cm_id, struc cmd->srq = param->srq; if (param->private_data && param->private_data_len) { - - cmd->data = (unsigned long)param->private_data; + cmd->data = (uintptr_t) param->private_data; cmd->len = param->private_data_len; } @@ -316,7 +442,7 @@ int ib_cm_send_rep(uint32_t cm_id, struc return 0; } -static inline int cm_send_private_data(uint32_t cm_id, +static inline int cm_send_private_data(struct ib_cm_id *cm_id, uint32_t type, void *private_data, uint8_t private_data_len) @@ -327,12 +453,10 @@ static inline int cm_send_private_data(u int size; CM_CREATE_MSG_CMD(msg, cmd, type, size); - - cmd->id = cm_id; + cmd->id = cm_id->handle; if (private_data && private_data_len) { - - cmd->data = (unsigned long)private_data; + cmd->data = (uintptr_t) private_data; cmd->len = private_data_len; } @@ -343,7 +467,7 @@ static inline int cm_send_private_data(u return 0; } -int ib_cm_send_rtu(uint32_t cm_id, +int ib_cm_send_rtu(struct ib_cm_id *cm_id, void *private_data, uint8_t private_data_len) { @@ -351,7 +475,7 @@ int ib_cm_send_rtu(uint32_t cm_id, private_data, private_data_len); } -int ib_cm_send_dreq(uint32_t cm_id, +int ib_cm_send_dreq(struct ib_cm_id *cm_id, void *private_data, uint8_t private_data_len) { @@ -359,7 +483,7 @@ int ib_cm_send_dreq(uint32_t cm_id, private_data, private_data_len); } -int ib_cm_send_drep(uint32_t cm_id, +int ib_cm_send_drep(struct ib_cm_id *cm_id, void *private_data, uint8_t private_data_len) { @@ -367,16 +491,15 @@ int ib_cm_send_drep(uint32_t cm_id, private_data, private_data_len); } -int ib_cm_establish(uint32_t cm_id) +int ib_cm_establish(struct ib_cm_id *cm_id) { struct cm_abi_establish *cmd; void *msg; int result; int size; - CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_ESTABLISH, size); - - cmd->id = cm_id; + CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_ESTABLISH, size); + cmd->id = cm_id->handle; result = write(fd, msg, size); if (result != size) @@ -385,7 +508,7 @@ int ib_cm_establish(uint32_t cm_id) return 0; } -static inline int cm_send_status(uint32_t cm_id, +static inline int cm_send_status(struct ib_cm_id *cm_id, uint32_t type, int status, void *info, @@ -399,19 +522,16 @@ static inline int cm_send_status(uint32_ int size; CM_CREATE_MSG_CMD(msg, cmd, type, size); - - cmd->id = cm_id; + cmd->id = cm_id->handle; cmd->status = status; if (private_data && private_data_len) { - - cmd->data = (unsigned long)private_data; + cmd->data = (uintptr_t) private_data; cmd->data_len = private_data_len; } if (info && info_length) { - - cmd->info = (unsigned long)info; + cmd->info = (uintptr_t) info; cmd->info_len = info_length; } @@ -422,7 +542,7 @@ static inline int cm_send_status(uint32_ return 0; } -int ib_cm_send_rej(uint32_t cm_id, +int ib_cm_send_rej(struct ib_cm_id *cm_id, enum ib_cm_rej_reason reason, void *ari, uint8_t ari_length, @@ -434,7 +554,7 @@ int ib_cm_send_rej(uint32_t cm_id, private_data, private_data_len); } -int ib_cm_send_apr(uint32_t cm_id, +int ib_cm_send_apr(struct ib_cm_id *cm_id, enum ib_cm_apr_status status, void *info, uint8_t info_length, @@ -446,7 +566,7 @@ int ib_cm_send_apr(uint32_t cm_id, private_data, private_data_len); } -int ib_cm_send_mra(uint32_t cm_id, +int ib_cm_send_mra(struct ib_cm_id *cm_id, uint8_t service_timeout, void *private_data, uint8_t private_data_len) @@ -457,13 +577,11 @@ int ib_cm_send_mra(uint32_t cm_id, int size; CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_SEND_MRA, size); - - cmd->id = cm_id; + cmd->id = cm_id->handle; cmd->timeout = service_timeout; if (private_data && private_data_len) { - - cmd->data = (unsigned long)private_data; + cmd->data = (uintptr_t) private_data; cmd->len = private_data_len; } @@ -474,7 +592,7 @@ int ib_cm_send_mra(uint32_t cm_id, return 0; } -int ib_cm_send_lap(uint32_t cm_id, +int ib_cm_send_lap(struct ib_cm_id *cm_id, struct ib_sa_path_rec *alternate_path, void *private_data, uint8_t private_data_len) @@ -486,22 +604,19 @@ int ib_cm_send_lap(uint32_t cm_id, int size; CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_SEND_LAP, size); - - cmd->id = cm_id; + cmd->id = cm_id->handle; if (alternate_path) { - abi_path = alloca(sizeof(*abi_path)); if (!abi_path) return -ENOMEM; cm_param_path_get(abi_path, alternate_path); - cmd->path = (unsigned long)abi_path; + cmd->path = (uintptr_t) abi_path; } if (private_data && private_data_len) { - - cmd->data = (unsigned long)private_data; + cmd->data = (uintptr_t) private_data; cmd->len = private_data_len; } @@ -512,7 +627,8 @@ int ib_cm_send_lap(uint32_t cm_id, return 0; } -int ib_cm_send_sidr_req(uint32_t cm_id, struct ib_cm_sidr_req_param *param) +int ib_cm_send_sidr_req(struct ib_cm_id *cm_id, + struct ib_cm_sidr_req_param *param) { struct cm_abi_path_rec *abi_path; struct cm_abi_sidr_req *cmd; @@ -524,26 +640,23 @@ int ib_cm_send_sidr_req(uint32_t cm_id, return -EINVAL; CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_SEND_SIDR_REQ, size); - - cmd->id = cm_id; + cmd->id = cm_id->handle; cmd->sid = param->service_id; cmd->timeout = param->timeout_ms; cmd->pkey = param->pkey; cmd->max_cm_retries = param->max_cm_retries; if (param->path) { - abi_path = alloca(sizeof(*abi_path)); if (!abi_path) return -ENOMEM; cm_param_path_get(abi_path, param->path); - cmd->path = (unsigned long)abi_path; + cmd->path = (uintptr_t) abi_path; } if (param->private_data && param->private_data_len) { - - cmd->data = (unsigned long)param->private_data; + cmd->data = (uintptr_t) param->private_data; cmd->len = param->private_data_len; } @@ -554,7 +667,8 @@ int ib_cm_send_sidr_req(uint32_t cm_id, return 0; } -int ib_cm_send_sidr_rep(uint32_t cm_id, struct ib_cm_sidr_rep_param *param) +int ib_cm_send_sidr_rep(struct ib_cm_id *cm_id, + struct ib_cm_sidr_rep_param *param) { struct cm_abi_sidr_rep *cmd; void *msg; @@ -565,21 +679,18 @@ int ib_cm_send_sidr_rep(uint32_t cm_id, return -EINVAL; CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_SEND_SIDR_REP, size); - - cmd->id = cm_id; + cmd->id = cm_id->handle; cmd->qpn = param->qp_num; cmd->qkey = param->qkey; cmd->status = param->status; if (param->private_data && param->private_data_len) { - - cmd->data = (unsigned long)param->private_data; + cmd->data = (uintptr_t) param->private_data; cmd->data_len = param->private_data_len; } if (param->info && param->info_length) { - - cmd->info = (unsigned long)param->info; + cmd->info = (uintptr_t) param->info; cmd->info_len = param->info_length; } @@ -599,8 +710,8 @@ static void cm_event_path_get(struct ib_ if (!kpath || !upath) return; - memcpy(upath->dgid.raw, kpath->dgid, sizeof(union ibv_gid)); - memcpy(upath->sgid.raw, kpath->sgid, sizeof(union ibv_gid)); + memcpy(upath->dgid.raw, kpath->dgid, sizeof upath->dgid); + memcpy(upath->sgid.raw, kpath->sgid, sizeof upath->sgid); upath->dlid = kpath->dlid; upath->slid = kpath->slid; @@ -626,8 +737,6 @@ static void cm_event_path_get(struct ib_ static void cm_event_req_get(struct ib_cm_req_event_param *ureq, struct cm_abi_req_event_resp *kreq) { - ureq->listen_id = kreq->listen_id; - ureq->remote_ca_guid = kreq->remote_ca_guid; ureq->remote_qkey = kreq->remote_qkey; ureq->remote_qpn = kreq->remote_qpn; @@ -661,36 +770,6 @@ static void cm_event_rep_get(struct ib_c urep->rnr_retry_count = krep->rnr_retry_count; urep->srq = krep->srq; } -static void cm_event_rej_get(struct ib_cm_rej_event_param *urej, - struct cm_abi_rej_event_resp *krej) -{ - urej->reason = krej->reason; -} - -static void cm_event_mra_get(struct ib_cm_mra_event_param *umra, - struct cm_abi_mra_event_resp *kmra) -{ - umra->service_timeout = kmra->timeout; -} - -static void cm_event_lap_get(struct ib_cm_lap_event_param *ulap, - struct cm_abi_lap_event_resp *klap) -{ - cm_event_path_get(ulap->alternate_path, &klap->path); -} - -static void cm_event_apr_get(struct ib_cm_apr_event_param *uapr, - struct cm_abi_apr_event_resp *kapr) -{ - uapr->ap_status = kapr->status; -} - -static void cm_event_sidr_req_get(struct ib_cm_sidr_req_event_param *ureq, - struct cm_abi_sidr_req_event_resp *kreq) -{ - ureq->listen_id = kreq->listen_id; - ureq->pkey = kreq->pkey; -} static void cm_event_sidr_rep_get(struct ib_cm_sidr_rep_event_param *urep, struct cm_abi_sidr_rep_event_resp *krep) @@ -702,6 +781,7 @@ static void cm_event_sidr_rep_get(struct int ib_cm_event_get(struct ib_cm_event **event) { + struct cm_id_private *cm_id_priv; struct cm_abi_cmd_hdr *hdr; struct cm_abi_event_get *cmd; struct cm_abi_event_resp *resp; @@ -733,7 +813,7 @@ int ib_cm_event_get(struct ib_cm_event * if (!resp) return -ENOMEM; - cmd->response = (unsigned long)resp; + cmd->response = (uintptr_t) resp; cmd->data_len = (uint8_t)(~0U); cmd->info_len = (uint8_t)(~0U); @@ -749,8 +829,8 @@ int ib_cm_event_get(struct ib_cm_event * goto done; } - cmd->data = (unsigned long)data; - cmd->info = (unsigned long)info; + cmd->data = (uintptr_t) data; + cmd->info = (uintptr_t) info; result = write(fd, msg, size); if (result != size) { @@ -765,14 +845,11 @@ int ib_cm_event_get(struct ib_cm_event * result = -ENOMEM; goto done; } - memset(evt, 0, sizeof(*evt)); - - evt->cm_id = resp->id; + evt->cm_id = (void *) (uintptr_t) resp->uid; evt->event = resp->event; if (resp->present & CM_ABI_PRES_PRIMARY) { - path_a = malloc(sizeof(*path_a)); if (!path_a) { result = -ENOMEM; @@ -781,81 +858,78 @@ int ib_cm_event_get(struct ib_cm_event * } if (resp->present & CM_ABI_PRES_ALTERNATE) { - path_b = malloc(sizeof(*path_b)); if (!path_b) { result = -ENOMEM; goto done; } } - - if (resp->present & CM_ABI_PRES_DATA) { - - evt->private_data = data; - data = NULL; - } switch (evt->event) { case IB_CM_REQ_RECEIVED: - + evt->param.req_rcvd.listen_id = evt->cm_id; + cm_id_priv = ib_cm_alloc_id(evt->cm_id->context); + if (!cm_id_priv) { + result = -ENOMEM; + goto done; + } + cm_id_priv->id.handle = resp->id; + evt->cm_id = &cm_id_priv->id; evt->param.req_rcvd.primary_path = path_a; evt->param.req_rcvd.alternate_path = path_b; path_a = NULL; path_b = NULL; - cm_event_req_get(&evt->param.req_rcvd, &resp->u.req_resp); break; case IB_CM_REP_RECEIVED: - cm_event_rep_get(&evt->param.rep_rcvd, &resp->u.rep_resp); break; case IB_CM_MRA_RECEIVED: - - cm_event_mra_get(&evt->param.mra_rcvd, &resp->u.mra_resp); + evt->param.mra_rcvd.service_timeout = resp->u.mra_resp.timeout; break; case IB_CM_REJ_RECEIVED: - - cm_event_rej_get(&evt->param.rej_rcvd, &resp->u.rej_resp); - + evt->param.rej_rcvd.reason = resp->u.rej_resp.reason; evt->param.rej_rcvd.ari = info; info = NULL; - break; case IB_CM_LAP_RECEIVED: - evt->param.lap_rcvd.alternate_path = path_b; path_b = NULL; - - cm_event_lap_get(&evt->param.lap_rcvd, &resp->u.lap_resp); + cm_event_path_get(evt->param.lap_rcvd.alternate_path, + &resp->u.lap_resp.path); break; case IB_CM_APR_RECEIVED: - - cm_event_apr_get(&evt->param.apr_rcvd, &resp->u.apr_resp); - + evt->param.apr_rcvd.ap_status = resp->u.apr_resp.status; evt->param.apr_rcvd.apr_info = info; info = NULL; - break; case IB_CM_SIDR_REQ_RECEIVED: - - cm_event_sidr_req_get(&evt->param.sidr_req_rcvd, - &resp->u.sidr_req_resp); + evt->param.sidr_req_rcvd.listen_id = evt->cm_id; + cm_id_priv = ib_cm_alloc_id(evt->cm_id->context); + if (!cm_id_priv) { + result = -ENOMEM; + goto done; + } + cm_id_priv->id.handle = resp->id; + evt->cm_id = &cm_id_priv->id; + evt->param.sidr_req_rcvd.pkey = resp->u.sidr_req_resp.pkey; break; case IB_CM_SIDR_REP_RECEIVED: - cm_event_sidr_rep_get(&evt->param.sidr_rep_rcvd, &resp->u.sidr_rep_resp); - evt->param.sidr_rep_rcvd.info = info; info = NULL; - break; default: - evt->param.send_status = resp->u.send_status; break; } + if (resp->present & CM_ABI_PRES_DATA) { + evt->private_data = data; + data = NULL; + } + *event = evt; evt = NULL; result = 0; @@ -876,44 +950,51 @@ done: int ib_cm_event_put(struct ib_cm_event *event) { + struct cm_id_private *cm_id_priv; + if (!event) return -EINVAL; if (event->private_data) free(event->private_data); + cm_id_priv = container_of(event->cm_id, struct cm_id_private, id); + switch (event->event) { case IB_CM_REQ_RECEIVED: - - if (event->param.req_rcvd.primary_path) - free(event->param.req_rcvd.primary_path); - + cm_id_priv = container_of(event->param.req_rcvd.listen_id, + struct cm_id_private, id); + free(event->param.req_rcvd.primary_path); if (event->param.req_rcvd.alternate_path) free(event->param.req_rcvd.alternate_path); break; case IB_CM_REJ_RECEIVED: - if (event->param.rej_rcvd.ari) free(event->param.rej_rcvd.ari); break; case IB_CM_LAP_RECEIVED: - - if (event->param.lap_rcvd.alternate_path) - free(event->param.lap_rcvd.alternate_path); + free(event->param.lap_rcvd.alternate_path); break; case IB_CM_APR_RECEIVED: - if (event->param.apr_rcvd.apr_info) free(event->param.apr_rcvd.apr_info); break; + case IB_CM_SIDR_REQ_RECEIVED: + cm_id_priv = container_of(event->param.sidr_req_rcvd.listen_id, + struct cm_id_private, id); + break; case IB_CM_SIDR_REP_RECEIVED: - if (event->param.sidr_rep_rcvd.info) free(event->param.sidr_rep_rcvd.info); default: break; } + pthread_mutex_lock(&cm_id_priv->mut); + cm_id_priv->events_completed++; + pthread_cond_signal(&cm_id_priv->cond); + pthread_mutex_unlock(&cm_id_priv->mut); + free(event); return 0; } Index: userspace/libibcm/Makefile.am =================================================================== --- userspace/libibcm/Makefile.am (revision 3124) +++ userspace/libibcm/Makefile.am (working copy) @@ -18,9 +18,11 @@ endif src_libibcm_la_SOURCES = src/cm.c src_libibcm_la_LDFLAGS = -avoid-version -module $(ucm_version_script) -bin_PROGRAMS = examples/ucm_simple -examples_ucm_simple_SOURCES = examples/simple.c -examples_ucm_simple_LDADD = $(top_builddir)/src/libibcm.la +bin_PROGRAMS = examples/ucmpost +examples_ucmpost_SOURCES = examples/cmpost.c +examples_ucmpost_LDADD = $(top_builddir)/src/libibcm.la \ + $(libdir)/libibverbs.la \ + $(libdir)/libibat.la libibcmincludedir = $(includedir)/infiniband Index: userspace/libibcm/examples/cmpost.c =================================================================== --- userspace/libibcm/examples/cmpost.c (revision 0) +++ userspace/libibcm/examples/cmpost.c (revision 0) @@ -0,0 +1,718 @@ +/* + * Copyright (c) 2004, 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#if __BYTE_ORDER == __BIG_ENDIAN +static inline uint64_t cpu_to_be64(uint64_t x) { return x; } +static inline uint32_t cpu_to_be32(uint32_t x) { return x; } +#else +static inline uint64_t cpu_to_be64(uint64_t x) { return bswap_64(x); } +static inline uint32_t cpu_to_be32(uint32_t x) { return bswap_32(x); } +#endif + +/* + * To execute: + * Server: ucmpost + * Client: ucmpost server + */ + +struct cmtest { + struct ibv_device *device; + struct ibv_context *verbs; + struct ibv_pd *pd; + + /* cm info */ + struct ib_sa_path_rec path_rec; + + struct cmtest_node *nodes; + int conn_index; + int connects_left; + int disconnects_left; + + /* memory region info */ + struct ibv_mr *mr; + void *mem; +}; + +static struct cmtest test; +static int message_count = 10; +static int message_size = 100; +static int connections = 1; +static int is_server = 1; + +struct cmtest_node { + int id; + struct ibv_cq *cq; + struct ibv_qp *qp; + struct ib_cm_id *cm_id; + int connected; +}; + +static int post_recvs(struct cmtest_node *node) +{ + struct ibv_recv_wr recv_wr, *recv_failure; + struct ibv_sge sge; + int i, ret = 0; + + if (!message_count) + return 0; + + recv_wr.next = NULL; + recv_wr.sg_list = &sge; + recv_wr.num_sge = 1; + recv_wr.wr_id = (uintptr_t) node; + + sge.length = message_size; + sge.lkey = test.mr->lkey; + sge.addr = (uintptr_t) test.mem; + + for (i = 0; i < message_count && !ret; i++ ) { + ret = ibv_post_recv(node->qp, &recv_wr, &recv_failure); + if (ret) { + printf("failed to post receives: %d\n", ret); + break; + } + } + return ret; +} + +static int modify_to_rtr(struct cmtest_node *node) +{ + struct ibv_qp_attr qp_attr; + int qp_attr_mask, ret; + + qp_attr.qp_state = IBV_QPS_INIT; + ret = ib_cm_init_qp_attr(node->cm_id, &qp_attr, &qp_attr_mask); + if (ret) { + printf("failed to init QP attr for INIT: %d\n", ret); + return ret; + } + ret = ibv_modify_qp(node->qp, &qp_attr, qp_attr_mask); + if (ret) { + printf("failed to modify QP to INIT: %d\n", ret); + return ret; + } + qp_attr.qp_state = IBV_QPS_RTR; + ret = ib_cm_init_qp_attr(node->cm_id, &qp_attr, &qp_attr_mask); + if (ret) { + printf("failed to init QP attr for RTR: %d\n", ret); + return ret; + } + qp_attr.rq_psn = node->qp->qp_num; + ret = ibv_modify_qp(node->qp, &qp_attr, qp_attr_mask); + if (ret) { + printf("failed to modify QP to RTR: %d\n", ret); + return ret; + } + return 0; +} + +static int modify_to_rts(struct cmtest_node *node) +{ + struct ibv_qp_attr qp_attr; + int qp_attr_mask, ret; + + qp_attr.qp_state = IBV_QPS_RTS; + ret = ib_cm_init_qp_attr(node->cm_id, &qp_attr, &qp_attr_mask); + if (ret) { + printf("failed to init QP attr for RTS: %d\n", ret); + return ret; + } + ret = ibv_modify_qp(node->qp, &qp_attr, qp_attr_mask); + if (ret) { + printf("failed to modify QP to RTS: %d\n", ret); + return ret; + } + return 0; +} + +static void req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) +{ + struct cmtest_node *node; + struct ib_cm_req_event_param *req; + struct ib_cm_rep_param rep; + int ret; + + if (test.conn_index == connections) + goto error1; + node = &test.nodes[test.conn_index++]; + + node->cm_id = cm_id; + cm_id->context = node; + + ret = modify_to_rtr(node); + if (ret) + goto error2; + + ret = post_recvs(node); + if (ret) + goto error2; + + req = &event->param.req_rcvd; + memset(&rep, 0, sizeof rep); + rep.qp_num = node->qp->qp_num; + rep.srq = (node->qp->srq != NULL); + rep.starting_psn = node->qp->qp_num; + rep.responder_resources = req->responder_resources; + rep.initiator_depth = req->initiator_depth; + rep.target_ack_delay = 20; + rep.flow_control = req->flow_control; + rep.rnr_retry_count = req->rnr_retry_count; + + ret = ib_cm_send_rep(cm_id, &rep); + if (ret) { + printf("failed to send CM REP: %d\n", ret); + goto error2; + } + return; +error2: + test.disconnects_left--; + test.connects_left--; +error1: + printf("failing connection request\n"); + ib_cm_send_rej(cm_id, IB_CM_REJ_UNSUPPORTED, NULL, 0, NULL, 0); +} + +static void rep_handler(struct cmtest_node *node, struct ib_cm_event *event) +{ + int ret; + + ret = modify_to_rtr(node); + if (ret) + goto error; + + ret = modify_to_rts(node); + if (ret) + goto error; + + ret = post_recvs(node); + if (ret) + goto error; + + ret = ib_cm_send_rtu(node->cm_id, NULL, 0); + if (ret) { + printf("failed to send CM RTU: %d\n", ret); + goto error; + } + node->connected = 1; + test.connects_left--; + return; +error: + printf("failing connection reply\n"); + ib_cm_send_rej(node->cm_id, IB_CM_REJ_UNSUPPORTED, NULL, 0, NULL, 0); + test.disconnects_left--; + test.connects_left--; +} + +static void rtu_handler(struct cmtest_node *node) +{ + int ret; + + ret = modify_to_rts(node); + if (ret) + goto error; + + node->connected = 1; + test.connects_left--; + return; +error: + printf("aborting connection - disconnecting\n"); + ib_cm_send_dreq(node->cm_id, NULL, 0); + test.disconnects_left--; + test.connects_left--; +} + +static void cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) +{ + struct cmtest_node *node = cm_id->context; + + switch (event->event) { + case IB_CM_REQ_RECEIVED: + req_handler(cm_id, event); + break; + case IB_CM_REP_RECEIVED: + rep_handler(node, event); + break; + case IB_CM_RTU_RECEIVED: + rtu_handler(node); + break; + case IB_CM_DREQ_RECEIVED: + node->connected = 0; + ib_cm_send_drep(node->cm_id, NULL, 0); + test.disconnects_left--; + break; + case IB_CM_DREP_RECEIVED: + test.disconnects_left--; + break; + case IB_CM_REJ_RECEIVED: + printf("Received REJ\n"); + /* fall through */ + case IB_CM_REQ_ERROR: + case IB_CM_REP_ERROR: + printf("Error sending REQ or REP\n"); + test.disconnects_left--; + test.connects_left--; + break; + case IB_CM_DREQ_ERROR: + test.disconnects_left--; + printf("Error sending DREQ\n"); + break; + default: + break; + } +} + +static int init_node(struct cmtest_node *node, struct ibv_qp_init_attr *qp_attr) +{ + int cqe, ret; + + if (!is_server) { + ret = ib_cm_create_id(&node->cm_id, node); + if (ret) { + printf("failed to create cm_id: %d\n", ret); + return ret; + } + } + + cqe = message_count ? message_count * 2 : 2; + node->cq = ibv_create_cq(test.verbs, cqe, node); + if (!node->cq) { + printf("unable to create CQ\n"); + goto error1; + } + + qp_attr->send_cq = node->cq; + qp_attr->recv_cq = node->cq; + node->qp = ibv_create_qp(test.pd, qp_attr); + if (!node->qp) { + printf("unable to create QP\n"); + goto error2; + } + return 0; +error2: + ibv_destroy_cq(node->cq); +error1: + if (!is_server) + ib_cm_destroy_id(node->cm_id); + return -1; +} + +static void destroy_node(struct cmtest_node *node) +{ + ibv_destroy_qp(node->qp); + ibv_destroy_cq(node->cq); + if (node->cm_id) + ib_cm_destroy_id(node->cm_id); +} + +static int create_nodes(void) +{ + struct ibv_qp_init_attr qp_attr; + int ret, i; + + test.nodes = malloc(sizeof *test.nodes * connections); + if (!test.nodes) { + printf("unable to allocate memory for test nodes\n"); + return -1; + } + memset(test.nodes, 0, sizeof *test.nodes * connections); + + memset(&qp_attr, 0, sizeof qp_attr); + qp_attr.cap.max_send_wr = message_count ? message_count : 1; + qp_attr.cap.max_recv_wr = message_count ? message_count : 1; + qp_attr.cap.max_send_sge = 1; + qp_attr.cap.max_recv_sge = 1; + qp_attr.qp_type = IBV_QPT_RC; + + for (i = 0; i < connections; i++) { + test.nodes[i].id = i; + ret = init_node(&test.nodes[i], &qp_attr); + if (ret) + goto error; + } + return 0; +error: + while (--i >= 0) + destroy_node(&test.nodes[i]); + free(test.nodes); + return ret; +} + +static void destroy_nodes(void) +{ + int i; + + for (i = 0; i < connections; i++) + destroy_node(&test.nodes[i]); + free(test.nodes); +} + +static int create_messages(void) +{ + if (!message_size) + message_count = 0; + + if (!message_count) + return 0; + + test.mem = malloc(message_size); + if (!test.mem) { + printf("failed message allocation\n"); + return -1; + } + test.mr = ibv_reg_mr(test.pd, test.mem, message_size, + IBV_ACCESS_LOCAL_WRITE); + if (!test.mr) { + printf("failed to reg MR\n"); + goto err; + } + return 0; +err: + free(test.mem); + return -1; +} + +static void destroy_messages(void) +{ + if (!message_count) + return; + + ibv_dereg_mr(test.mr); + free(test.mem); +} + +static int init(void) +{ + struct dlist *dev_list; + int ret; + + test.connects_left = connections; + test.disconnects_left = connections; + + dev_list = ibv_get_devices(); + dlist_start(dev_list); + test.device = dlist_next(dev_list); + if (!test.device) + return -1; + + test.verbs = ibv_open_device(test.device); + if (!test.verbs) + return -1; + + test.pd = ibv_alloc_pd(test.verbs); + if (!test.pd) { + printf("failed to alloc PD\n"); + return -1; + } + ret = create_messages(); + if (ret) { + printf("unable to create test messages\n"); + goto error1; + } + ret = create_nodes(); + if (ret) { + printf("unable to create test nodes\n"); + goto error2; + } + return 0; +error2: + destroy_messages(); +error1: + ibv_dealloc_pd(test.pd); + return -1; +} + +static void cleanup(void) +{ + destroy_nodes(); + destroy_messages(); + ibv_dealloc_pd(test.pd); +} + +static int send_msgs(void) +{ + struct ibv_send_wr send_wr, *bad_send_wr; + struct ibv_sge sge; + int i, m, ret; + + send_wr.next = NULL; + send_wr.sg_list = &sge; + send_wr.num_sge = 1; + send_wr.opcode = IBV_WR_SEND; + send_wr.send_flags = IBV_SEND_SIGNALED; + send_wr.wr_id = 0; + + sge.addr = (uintptr_t) test.mem; + sge.length = message_size; + sge.lkey = test.mr->lkey; + + for (i = 0; i < connections; i++) { + if (!test.nodes[i].connected) + continue; + + for (m = 0; m < message_count; m++) { + ret = ibv_post_send(test.nodes[i].qp, &send_wr, + &bad_send_wr); + if (ret) + return ret; + } + } + return 0; +} + +static int poll_cqs(void) +{ + struct ibv_wc wc[8]; + int done, i, ret; + + for (i = 0; i < connections; i++) { + if (!test.nodes[i].connected) + continue; + + for (done = 0; done < message_count; done += ret) { + ret = ibv_poll_cq(test.nodes[i].cq, 8, wc); + if (ret < 0) { + printf("failed polling CQ: %d\n", ret); + return ret; + } + } + } + return 0; +} + +static void connect_events(void) +{ + struct ib_cm_event *event; + int err = 0; + + while (test.connects_left && !err) { + err = ib_cm_event_get(&event); + if (!err) { + cm_handler(event->cm_id, event); + ib_cm_event_put(event); + } + } +} + +static void disconnect_events(void) +{ + struct ib_cm_event *event; + int err = 0; + + while (test.disconnects_left && !err) { + err = ib_cm_event_get(&event); + if (!err) { + cm_handler(event->cm_id, event); + ib_cm_event_put(event); + } + } +} + +static void run_server(void) +{ + struct ib_cm_id *listen_id; + int i, ret; + + printf("starting server\n"); + if (ib_cm_create_id(&listen_id, &test)) { + printf("listen request failed\n"); + return; + } + ret = ib_cm_listen(listen_id, cpu_to_be64(0x1000), 0); + if (ret) { + printf("failure trying to listen: %d\n", ret); + goto out; + } + + connect_events(); + + if (message_count) { + printf("initiating data transfers\n"); + if (send_msgs()) + goto out; + printf("receiving data transfers\n"); + if (poll_cqs()) + goto out; + printf("data transfers complete\n"); + } + + printf("disconnecting\n"); + for (i = 0; i < connections; i++) { + if (!test.nodes[i].connected) + continue; + + test.nodes[i].connected = 0; + ib_cm_send_dreq(test.nodes[i].cm_id, NULL, 0); + } + disconnect_events(); + printf("disconnected\n"); +out: + ib_cm_destroy_id(listen_id); +} + +static void at_callback(uint64_t req_id, void *context, int rec_num) +{ +} + +static int query_for_path(char *dest) +{ + struct ib_at_ib_route route; + struct ib_at_completion comp; + struct addrinfo *res; + int ret; + + ret = getaddrinfo(dest, NULL, NULL, &res); + if (ret) { + printf("getaddrinfo failed - invalid hostname or IP address\n"); + return ret; + } + + if (res->ai_family != PF_INET) { + ret = -1; + goto out; + } + + comp.fn = at_callback; + ret = ib_at_route_by_ip(((struct sockaddr_in *)res->ai_addr)->sin_addr.s_addr, + 0, 0, 0, &route, &comp, NULL); + if (ret < 0) { + printf("ib_at_route_by_ip failed: %d\n", ret); + goto out; + } + + if (!ret) { + ret = ib_at_callback_get(); + if (ret) { + printf("ib_at_callback_get failed: %d\n", ret); + goto out; + } + } + + ret = ib_at_paths_by_route(&route, 0, &test.path_rec, 1, &comp, NULL); + if (ret < 0) { + printf("ib_at_paths_by_route failed: %d\n", ret); + goto out; + } + + if (!ret) { + ret = ib_at_callback_get(); + if (ret) + printf("ib_at_callback_get failed: %d\n", ret); + } else + ret = 0; + +out: + freeaddrinfo(res); + return ret; +} + +static void run_client(char *dest) +{ + struct ib_cm_req_param req; + int i, ret; + + printf("starting client\n"); + ret = query_for_path(dest); + if (ret) { + printf("failed path record query: %d\n", ret); + return; + } + + memset(&req, 0, sizeof req); + req.primary_path = &test.path_rec; + req.service_id = cpu_to_be64(0x1000); + req.responder_resources = 1; + req.initiator_depth = 1; + req.remote_cm_response_timeout = 20; + req.local_cm_response_timeout = 20; + req.retry_count = 5; + req.max_cm_retries = 5; + + printf("connecting\n"); + for (i = 0; i < connections; i++) { + req.qp_num = test.nodes[i].qp->qp_num; + req.qp_type = IBV_QPT_RC; + req.srq = (test.nodes[i].qp->srq != NULL); + req.starting_psn = test.nodes[i].qp->qp_num; + ret = ib_cm_send_req(test.nodes[i].cm_id, &req); + if (ret) { + printf("failure sending REQ: %d\n", ret); + return; + } + } + + connect_events(); + + if (message_count) { + printf("receiving data transfers\n"); + if (poll_cqs()) + goto out; + printf("initiating data transfers\n"); + if (send_msgs()) + goto out; + printf("data transfers complete\n"); + } +out: + disconnect_events(); +} + +int main(int argc, char **argv) +{ + if (argc != 1 && argc != 2) { + printf("usage: %s [server_addr]\n", argv[0]); + exit(1); + } + + is_server = (argc == 1); + if (init()) + exit(1); + + if (is_server) + run_server(); + else + run_client(argv[1]); + + printf("test complete\n"); + cleanup(); + return 0; +} Index: userspace/libibcm/examples/simple.c =================================================================== --- userspace/libibcm/examples/simple.c (revision 3124) +++ userspace/libibcm/examples/simple.c (working copy) @@ -58,7 +58,7 @@ static inline uint64_t cpu_to_be64(uint6 #define TEST_SID 0x0000000ff0000000ULL -static int cm_connect(uint32_t cm_id) +static int cm_connect(struct ib_cm_id *cm_id) { struct ib_cm_req_param param; struct ib_sa_path_rec sa; @@ -108,8 +108,8 @@ static int cm_connect(uint32_t cm_id) src->global.subnet_prefix = cpu_to_be64(0xfe80000000000000ULL); dst->global.subnet_prefix = cpu_to_be64(0xfe80000000000000ULL); - src->global.interface_id = cpu_to_be64(0x0002c90200002179ULL); - dst->global.interface_id = cpu_to_be64(0x0005ad000001296cULL); + src->global.interface_id = cpu_to_be64(0x0002c90107fc5e11ULL); + dst->global.interface_id = cpu_to_be64(0x0002c90107fc5eb1ULL); return ib_cm_send_req(cm_id, ¶m); } @@ -118,7 +118,7 @@ int main(int argc, char **argv) { struct ib_cm_event *event; struct ib_cm_rep_param rep; - int cm_id; + struct ib_cm_id *cm_id; int result; int param_c = 0; @@ -137,8 +137,8 @@ int main(int argc, char **argv) exit(1); } - result = ib_cm_create_id(&cm_id); - if (result < 0) { + result = ib_cm_create_id(&cm_id, NULL); + if (result) { printf("Error creating CM ID <%d:%d>\n", result, errno); goto done; } @@ -146,16 +146,16 @@ int main(int argc, char **argv) if (mode) { result = cm_connect(cm_id); if (result) { - printf("Error <%d:%d> sending REQ <%d>\n", - result, errno, cm_id); + printf("Error <%d:%d> sending REQ\n", + result, errno); goto done; } } else { result = ib_cm_listen(cm_id, TEST_SID, 0); if (result) { - printf("Error <%d:%d> listening <%d>\n", - result, errno, cm_id); + printf("Error <%d:%d> listening\n", + result, errno); goto done; } } @@ -169,7 +169,7 @@ int main(int argc, char **argv) goto done; } - printf("CM ID <%d> Event <%d>\n", event->cm_id, event->event); + printf("CM ID <%p> Event <%d>\n", event->cm_id, event->event); switch (event->event) { case IB_CM_REQ_RECEIVED: @@ -264,4 +264,3 @@ int main(int argc, char **argv) done: return 0; } - Index: linux-kernel/infiniband/include/ib_user_cm.h =================================================================== --- linux-kernel/infiniband/include/ib_user_cm.h (revision 3109) +++ linux-kernel/infiniband/include/ib_user_cm.h (working copy) @@ -1,5 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -37,7 +38,7 @@ #include -#define IB_USER_CM_ABI_VERSION 1 +#define IB_USER_CM_ABI_VERSION 2 enum { IB_USER_CM_CMD_CREATE_ID, @@ -60,6 +61,7 @@ enum { IB_USER_CM_CMD_SEND_SIDR_REP, IB_USER_CM_CMD_EVENT, + IB_USER_CM_CMD_INIT_QP_ATTR, }; /* * command ABI structures. @@ -71,6 +73,7 @@ struct ib_ucm_cmd_hdr { }; struct ib_ucm_create_id { + __u64 uid; __u64 response; }; @@ -79,9 +82,14 @@ struct ib_ucm_create_id_resp { }; struct ib_ucm_destroy_id { + __u64 response; __u32 id; }; +struct ib_ucm_destroy_id_resp { + __u32 events_reported; +}; + struct ib_ucm_attr_id { __u64 response; __u32 id; @@ -94,6 +102,64 @@ struct ib_ucm_attr_id_resp { __be32 remote_id; }; +struct ib_ucm_init_qp_attr { + __u64 response; + __u32 id; + __u32 qp_state; +}; + +struct ib_ucm_ah_attr { + __u8 grh_dgid[16]; + __u32 grh_flow_label; + __u16 dlid; + __u16 reserved; + __u8 grh_sgid_index; + __u8 grh_hop_limit; + __u8 grh_traffic_class; + __u8 sl; + __u8 src_path_bits; + __u8 static_rate; + __u8 is_global; + __u8 port_num; +}; + +struct ib_ucm_init_qp_attr_resp { + __u32 qp_attr_mask; + __u32 qp_state; + __u32 cur_qp_state; + __u32 path_mtu; + __u32 path_mig_state; + __u32 qkey; + __u32 rq_psn; + __u32 sq_psn; + __u32 dest_qp_num; + __u32 qp_access_flags; + + struct ib_ucm_ah_attr ah_attr; + struct ib_ucm_ah_attr alt_ah_attr; + + /* ib_qp_cap */ + __u32 max_send_wr; + __u32 max_recv_wr; + __u32 max_send_sge; + __u32 max_recv_sge; + __u32 max_inline_data; + + __u16 pkey_index; + __u16 alt_pkey_index; + __u8 en_sqd_async_notify; + __u8 sq_draining; + __u8 max_rd_atomic; + __u8 max_dest_rd_atomic; + __u8 min_rnr_timer; + __u8 port_num; + __u8 timeout; + __u8 retry_cnt; + __u8 rnr_retry; + __u8 alt_port_num; + __u8 alt_timeout; +}; + struct ib_ucm_listen { __be64 service_id; __be64 service_mask; @@ -157,6 +223,7 @@ struct ib_ucm_req { }; struct ib_ucm_rep { + __u64 uid; __u64 data; __u32 id; __u32 qpn; @@ -232,7 +299,6 @@ struct ib_ucm_event_get { }; struct ib_ucm_req_event_resp { - __u32 listen_id; /* device */ /* port */ struct ib_ucm_path_rec primary_path; @@ -287,7 +353,6 @@ struct ib_ucm_apr_event_resp { }; struct ib_ucm_sidr_req_event_resp { - __u32 listen_id; /* device */ /* port */ __u16 pkey; @@ -307,6 +372,7 @@ struct ib_ucm_sidr_rep_event_resp { #define IB_UCM_PRES_ALTERNATE 0x08 struct ib_ucm_event_resp { + __u64 uid; __u32 id; __u32 event; __u32 present; Index: linux-kernel/infiniband/core/ucm.c =================================================================== --- linux-kernel/infiniband/core/ucm.c (revision 3109) +++ linux-kernel/infiniband/core/ucm.c (working copy) @@ -72,7 +72,6 @@ enum { static struct semaphore ctx_id_mutex; static struct idr ctx_id_table; -static int ctx_id_rover = 0; static struct ib_ucm_context *ib_ucm_ctx_get(struct ib_ucm_file *file, int id) { @@ -97,33 +96,16 @@ static void ib_ucm_ctx_put(struct ib_ucm wake_up(&ctx->wait); } -static ssize_t ib_ucm_destroy_ctx(struct ib_ucm_file *file, int id) +static inline int ib_ucm_new_cm_id(int event) { - struct ib_ucm_context *ctx; - struct ib_ucm_event *uevent; - - down(&ctx_id_mutex); - ctx = idr_find(&ctx_id_table, id); - if (!ctx) - ctx = ERR_PTR(-ENOENT); - else if (ctx->file != file) - ctx = ERR_PTR(-EINVAL); - else - idr_remove(&ctx_id_table, ctx->id); - up(&ctx_id_mutex); - - if (IS_ERR(ctx)) - return PTR_ERR(ctx); - - atomic_dec(&ctx->ref); - wait_event(ctx->wait, !atomic_read(&ctx->ref)); + return event == IB_CM_REQ_RECEIVED || event == IB_CM_SIDR_REQ_RECEIVED; +} - /* No new events will be generated after destroying the cm_id. */ - if (!IS_ERR(ctx->cm_id)) - ib_destroy_cm_id(ctx->cm_id); +static void ib_ucm_cleanup_events(struct ib_ucm_context *ctx) +{ + struct ib_ucm_event *uevent; - /* Cleanup events not yet reported to the user. */ - down(&file->mutex); + down(&ctx->file->mutex); list_del(&ctx->file_list); while (!list_empty(&ctx->events)) { @@ -133,15 +115,12 @@ static ssize_t ib_ucm_destroy_ctx(struct list_del(&uevent->ctx_list); /* clear incoming connections. */ - if (uevent->cm_id) + if (ib_ucm_new_cm_id(uevent->resp.event)) ib_destroy_cm_id(uevent->cm_id); kfree(uevent); } - up(&file->mutex); - - kfree(ctx); - return 0; + up(&ctx->file->mutex); } static struct ib_ucm_context *ib_ucm_ctx_alloc(struct ib_ucm_file *file) @@ -153,36 +132,31 @@ static struct ib_ucm_context *ib_ucm_ctx if (!ctx) return NULL; + memset(ctx, 0, sizeof *ctx); atomic_set(&ctx->ref, 1); init_waitqueue_head(&ctx->wait); ctx->file = file; - INIT_LIST_HEAD(&ctx->events); - list_add_tail(&ctx->file_list, &file->ctxs); - - ctx_id_rover = (ctx_id_rover + 1) & INT_MAX; -retry: - result = idr_pre_get(&ctx_id_table, GFP_KERNEL); - if (!result) - goto error; + do { + result = idr_pre_get(&ctx_id_table, GFP_KERNEL); + if (!result) + goto error; + + down(&ctx_id_mutex); + result = idr_get_new(&ctx_id_table, ctx, &ctx->id); + up(&ctx_id_mutex); + } while (result == -EAGAIN); - down(&ctx_id_mutex); - result = idr_get_new_above(&ctx_id_table, ctx, ctx_id_rover, &ctx->id); - up(&ctx_id_mutex); - - if (result == -EAGAIN) - goto retry; if (result) goto error; + list_add_tail(&ctx->file_list, &file->ctxs); ucm_dbg("Allocated CM ID <%d>\n", ctx->id); - return ctx; + error: - list_del(&ctx->file_list); kfree(ctx); - return NULL; } /* @@ -219,12 +193,9 @@ static void ib_ucm_event_path_get(struct kpath->packet_life_time_selector; } -static void ib_ucm_event_req_get(struct ib_ucm_context *ctx, - struct ib_ucm_req_event_resp *ureq, +static void ib_ucm_event_req_get(struct ib_ucm_req_event_resp *ureq, struct ib_cm_req_event_param *kreq) { - ureq->listen_id = ctx->id; - ureq->remote_ca_guid = kreq->remote_ca_guid; ureq->remote_qkey = kreq->remote_qkey; ureq->remote_qpn = kreq->remote_qpn; @@ -259,14 +230,6 @@ static void ib_ucm_event_rep_get(struct urep->srq = krep->srq; } -static void ib_ucm_event_sidr_req_get(struct ib_ucm_context *ctx, - struct ib_ucm_sidr_req_event_resp *ureq, - struct ib_cm_sidr_req_event_param *kreq) -{ - ureq->listen_id = ctx->id; - ureq->pkey = kreq->pkey; -} - static void ib_ucm_event_sidr_rep_get(struct ib_ucm_sidr_rep_event_resp *urep, struct ib_cm_sidr_rep_event_param *krep) { @@ -275,15 +238,14 @@ static void ib_ucm_event_sidr_rep_get(st urep->qpn = krep->qpn; }; -static int ib_ucm_event_process(struct ib_ucm_context *ctx, - struct ib_cm_event *evt, +static int ib_ucm_event_process(struct ib_cm_event *evt, struct ib_ucm_event *uvt) { void *info = NULL; switch (evt->event) { case IB_CM_REQ_RECEIVED: - ib_ucm_event_req_get(ctx, &uvt->resp.u.req_resp, + ib_ucm_event_req_get(&uvt->resp.u.req_resp, &evt->param.req_rcvd); uvt->data_len = IB_CM_REQ_PRIVATE_DATA_SIZE; uvt->resp.present = IB_UCM_PRES_PRIMARY; @@ -331,8 +293,8 @@ static int ib_ucm_event_process(struct i info = evt->param.apr_rcvd.apr_info; break; case IB_CM_SIDR_REQ_RECEIVED: - ib_ucm_event_sidr_req_get(ctx, &uvt->resp.u.sidr_req_resp, - &evt->param.sidr_req_rcvd); + uvt->resp.u.sidr_req_resp.pkey = + evt->param.sidr_req_rcvd.pkey; uvt->data_len = IB_CM_SIDR_REQ_PRIVATE_DATA_SIZE; break; case IB_CM_SIDR_REP_RECEIVED: @@ -378,31 +340,24 @@ static int ib_ucm_event_handler(struct i struct ib_ucm_event *uevent; struct ib_ucm_context *ctx; int result = 0; - int id; ctx = cm_id->context; - if (event->event == IB_CM_REQ_RECEIVED || - event->event == IB_CM_SIDR_REQ_RECEIVED) - id = IB_UCM_CM_ID_INVALID; - else - id = ctx->id; - uevent = kmalloc(sizeof(*uevent), GFP_KERNEL); if (!uevent) goto err1; memset(uevent, 0, sizeof(*uevent)); - uevent->resp.id = id; + uevent->ctx = ctx; + uevent->cm_id = cm_id; + uevent->resp.uid = ctx->uid; + uevent->resp.id = ctx->id; uevent->resp.event = event->event; - result = ib_ucm_event_process(ctx, event, uevent); + result = ib_ucm_event_process(event, uevent); if (result) goto err2; - uevent->ctx = ctx; - uevent->cm_id = (id == IB_UCM_CM_ID_INVALID) ? cm_id : NULL; - down(&ctx->file->mutex); list_add_tail(&uevent->file_list, &ctx->file->events); list_add_tail(&uevent->ctx_list, &ctx->events); @@ -414,7 +369,7 @@ err2: kfree(uevent); err1: /* Destroy new cm_id's */ - return (id == IB_UCM_CM_ID_INVALID); + return ib_ucm_new_cm_id(event->event); } static ssize_t ib_ucm_event(struct ib_ucm_file *file, @@ -423,7 +378,7 @@ static ssize_t ib_ucm_event(struct ib_uc { struct ib_ucm_context *ctx; struct ib_ucm_event_get cmd; - struct ib_ucm_event *uevent = NULL; + struct ib_ucm_event *uevent; int result = 0; DEFINE_WAIT(wait); @@ -436,7 +391,6 @@ static ssize_t ib_ucm_event(struct ib_uc * wait */ down(&file->mutex); - while (list_empty(&file->events)) { if (file->filp->f_flags & O_NONBLOCK) { @@ -463,21 +417,18 @@ static ssize_t ib_ucm_event(struct ib_uc uevent = list_entry(file->events.next, struct ib_ucm_event, file_list); - if (!uevent->cm_id) - goto user; + if (ib_ucm_new_cm_id(uevent->resp.event)) { + ctx = ib_ucm_ctx_alloc(file); + if (!ctx) { + result = -ENOMEM; + goto done; + } - ctx = ib_ucm_ctx_alloc(file); - if (!ctx) { - result = -ENOMEM; - goto done; + ctx->cm_id = uevent->cm_id; + ctx->cm_id->context = ctx; + uevent->resp.id = ctx->id; } - ctx->cm_id = uevent->cm_id; - ctx->cm_id->context = ctx; - - uevent->resp.id = ctx->id; - -user: if (copy_to_user((void __user *)(unsigned long)cmd.response, &uevent->resp, sizeof(uevent->resp))) { result = -EFAULT; @@ -485,12 +436,10 @@ user: } if (uevent->data) { - if (cmd.data_len < uevent->data_len) { result = -ENOMEM; goto done; } - if (copy_to_user((void __user *)(unsigned long)cmd.data, uevent->data, uevent->data_len)) { result = -EFAULT; @@ -499,12 +448,10 @@ user: } if (uevent->info) { - if (cmd.info_len < uevent->info_len) { result = -ENOMEM; goto done; } - if (copy_to_user((void __user *)(unsigned long)cmd.info, uevent->info, uevent->info_len)) { result = -EFAULT; @@ -514,6 +461,7 @@ user: list_del(&uevent->file_list); list_del(&uevent->ctx_list); + uevent->ctx->events_reported++; kfree(uevent->data); kfree(uevent->info); @@ -545,6 +493,7 @@ static ssize_t ib_ucm_create_id(struct i if (!ctx) return -ENOMEM; + ctx->uid = cmd.uid; ctx->cm_id = ib_create_cm_id(ib_ucm_event_handler, ctx); if (IS_ERR(ctx->cm_id)) { result = PTR_ERR(ctx->cm_id); @@ -561,7 +510,14 @@ static ssize_t ib_ucm_create_id(struct i return 0; err: - ib_ucm_destroy_ctx(file, ctx->id); + down(&ctx_id_mutex); + idr_remove(&ctx_id_table, ctx->id); + up(&ctx_id_mutex); + + if (!IS_ERR(ctx->cm_id)) + ib_destroy_cm_id(ctx->cm_id); + + kfree(ctx); return result; } @@ -570,11 +526,44 @@ static ssize_t ib_ucm_destroy_id(struct int in_len, int out_len) { struct ib_ucm_destroy_id cmd; + struct ib_ucm_destroy_id_resp resp; + struct ib_ucm_context *ctx; + int result = 0; + + if (out_len < sizeof(resp)) + return -ENOSPC; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; - return ib_ucm_destroy_ctx(file, cmd.id); + down(&ctx_id_mutex); + ctx = idr_find(&ctx_id_table, cmd.id); + if (!ctx) + ctx = ERR_PTR(-ENOENT); + else if (ctx->file != file) + ctx = ERR_PTR(-EINVAL); + else + idr_remove(&ctx_id_table, ctx->id); + up(&ctx_id_mutex); + + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + atomic_dec(&ctx->ref); + wait_event(ctx->wait, !atomic_read(&ctx->ref)); + + /* No new events will be generated after destroying the cm_id. */ + ib_destroy_cm_id(ctx->cm_id); + /* Cleanup events not yet reported to the user. */ + ib_ucm_cleanup_events(ctx); + + resp.events_reported = ctx->events_reported; + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) + result = -EFAULT; + + kfree(ctx); + return result; } static ssize_t ib_ucm_attr_id(struct ib_ucm_file *file, @@ -609,6 +598,98 @@ static ssize_t ib_ucm_attr_id(struct ib_ return result; } +static void ib_ucm_copy_ah_attr(struct ib_ucm_ah_attr *dest_attr, + struct ib_ah_attr *src_attr) +{ + memcpy(dest_attr->grh_dgid, src_attr->grh.dgid.raw, + sizeof src_attr->grh.dgid); + dest_attr->grh_flow_label = src_attr->grh.flow_label; + dest_attr->grh_sgid_index = src_attr->grh.sgid_index; + dest_attr->grh_hop_limit = src_attr->grh.hop_limit; + dest_attr->grh_traffic_class = src_attr->grh.traffic_class; + + dest_attr->dlid = src_attr->dlid; + dest_attr->sl = src_attr->sl; + dest_attr->src_path_bits = src_attr->src_path_bits; + dest_attr->static_rate = src_attr->static_rate; + dest_attr->is_global = (src_attr->ah_flags & IB_AH_GRH); + dest_attr->port_num = src_attr->port_num; +} + +static void ib_ucm_copy_qp_attr(struct ib_ucm_init_qp_attr_resp *dest_attr, + struct ib_qp_attr *src_attr) +{ + dest_attr->cur_qp_state = src_attr->cur_qp_state; + dest_attr->path_mtu = src_attr->path_mtu; + dest_attr->path_mig_state = src_attr->path_mig_state; + dest_attr->qkey = src_attr->qkey; + dest_attr->rq_psn = src_attr->rq_psn; + dest_attr->sq_psn = src_attr->sq_psn; + dest_attr->dest_qp_num = src_attr->dest_qp_num; + dest_attr->qp_access_flags = src_attr->qp_access_flags; + + dest_attr->max_send_wr = src_attr->cap.max_send_wr; + dest_attr->max_recv_wr = src_attr->cap.max_recv_wr; + dest_attr->max_send_sge = src_attr->cap.max_send_sge; + dest_attr->max_recv_sge = src_attr->cap.max_recv_sge; + dest_attr->max_inline_data = src_attr->cap.max_inline_data; + + ib_ucm_copy_ah_attr(&dest_attr->ah_attr, &src_attr->ah_attr); + ib_ucm_copy_ah_attr(&dest_attr->alt_ah_attr, &src_attr->alt_ah_attr); + + dest_attr->pkey_index = src_attr->pkey_index; + dest_attr->alt_pkey_index = src_attr->alt_pkey_index; + dest_attr->en_sqd_async_notify = src_attr->en_sqd_async_notify; + dest_attr->sq_draining = src_attr->sq_draining; + dest_attr->max_rd_atomic = src_attr->max_rd_atomic; + dest_attr->max_dest_rd_atomic = src_attr->max_dest_rd_atomic; + dest_attr->min_rnr_timer = src_attr->min_rnr_timer; + dest_attr->port_num = src_attr->port_num; + dest_attr->timeout = src_attr->timeout; + dest_attr->retry_cnt = src_attr->retry_cnt; + dest_attr->rnr_retry = src_attr->rnr_retry; + dest_attr->alt_port_num = src_attr->alt_port_num; + dest_attr->alt_timeout = src_attr->alt_timeout; +} + +static ssize_t ib_ucm_init_qp_attr(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_init_qp_attr_resp resp; + struct ib_ucm_init_qp_attr cmd; + struct ib_ucm_context *ctx; + struct ib_qp_attr qp_attr; + int result = 0; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ib_ucm_ctx_get(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + resp.qp_attr_mask = 0; + memset(&qp_attr, 0, sizeof qp_attr); + qp_attr.qp_state = cmd.qp_state; + result = ib_cm_init_qp_attr(ctx->cm_id, &qp_attr, &resp.qp_attr_mask); + if (result) + goto out; + + ib_ucm_copy_qp_attr(&resp, &qp_attr); + + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) + result = -EFAULT; + +out: + ib_ucm_ctx_put(ctx); + return result; +} + static ssize_t ib_ucm_listen(struct ib_ucm_file *file, const char __user *inbuf, int in_len, int out_len) @@ -808,6 +889,7 @@ static ssize_t ib_ucm_send_rep(struct ib ctx = ib_ucm_ctx_get(file, cmd.id); if (!IS_ERR(ctx)) { + ctx->uid = cmd.uid; result = ib_send_cm_rep(ctx->cm_id, ¶m); ib_ucm_ctx_put(ctx); } else @@ -1086,6 +1168,7 @@ static ssize_t (*ucm_cmd_table[])(struct [IB_USER_CM_CMD_SEND_SIDR_REQ] = ib_ucm_send_sidr_req, [IB_USER_CM_CMD_SEND_SIDR_REP] = ib_ucm_send_sidr_rep, [IB_USER_CM_CMD_EVENT] = ib_ucm_event, + [IB_USER_CM_CMD_INIT_QP_ATTR] = ib_ucm_init_qp_attr, }; static ssize_t ib_ucm_write(struct file *filp, const char __user *buf, @@ -1161,12 +1244,18 @@ static int ib_ucm_close(struct inode *in down(&file->mutex); while (!list_empty(&file->ctxs)) { - ctx = list_entry(file->ctxs.next, struct ib_ucm_context, file_list); - up(&file->mutex); - ib_ucm_destroy_ctx(file, ctx->id); + + down(&ctx_id_mutex); + idr_remove(&ctx_id_table, ctx->id); + up(&ctx_id_mutex); + + ib_destroy_cm_id(ctx->cm_id); + ib_ucm_cleanup_events(ctx); + kfree(ctx); + down(&file->mutex); } up(&file->mutex); Index: linux-kernel/infiniband/core/ucm.h =================================================================== --- linux-kernel/infiniband/core/ucm.h (revision 3109) +++ linux-kernel/infiniband/core/ucm.h (working copy) @@ -1,5 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -43,8 +44,6 @@ #include #include -#define IB_UCM_CM_ID_INVALID 0xffffffff - struct ib_ucm_file { struct semaphore mutex; struct file *filp; @@ -58,9 +57,11 @@ struct ib_ucm_context { int id; wait_queue_head_t wait; atomic_t ref; + int events_reported; struct ib_ucm_file *file; struct ib_cm_id *cm_id; + __u64 uid; struct list_head events; /* list of pending events. */ struct list_head file_list; /* member in file ctx list */ @@ -71,16 +72,12 @@ struct ib_ucm_event { struct list_head file_list; /* member in file event list */ struct list_head ctx_list; /* member in ctx event list */ + struct ib_cm_id *cm_id; struct ib_ucm_event_resp resp; void *data; void *info; int data_len; int info_len; - /* - * new connection identifiers needs to be saved until - * userspace can get a handle on them. - */ - struct ib_cm_id *cm_id; }; #endif /* UCM_H */ From sean.hefty at intel.com Fri Aug 19 17:20:53 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 19 Aug 2005 17:20:53 -0700 Subject: [openib-general] [PATCH] [uDAPL] update to new uCM API Message-ID: This patch updates uDAPL to the new uCM API. It only fixes the build issues at this point and does not try to optimize for the use of the new API. That will come in a later patch. James, I can commit this when committing the uCM changes if that's okay. - Sean Index: dapl/dapl/openib/dapl_ib_cm.c =================================================================== --- dapl/dapl/openib/dapl_ib_cm.c (revision 3137) +++ dapl/dapl/openib/dapl_ib_cm.c (working copy) @@ -643,7 +643,8 @@ void cm_thread(void *arg) dapl_os_lock( &g_cm_lock ); while (!g_cm_destroy) { - int cm_id,ret; + struct ib_cm_id *cm_id; + int ret; /* select for CM event, all events process via cm_fd */ ufds.fd = ib_cm_get_fd(); @@ -819,7 +820,7 @@ dapls_ib_connect ( conn->ep = ep_ptr; conn->hca = ep_ptr->header.owner_ia->hca_ptr; - status = ib_cm_create_id(&conn->cm_id); + status = ib_cm_create_id(&conn->cm_id, conn); if (status < 0) { dat_status = dapl_convert_errno(errno,"create_cm_id"); dapl_os_free(conn, sizeof(*conn)); @@ -1001,7 +1002,7 @@ dapls_ib_setup_conn_listener ( return DAT_INTERNAL_ERROR; } - status = ib_cm_create_id(&conn->cm_id); + status = ib_cm_create_id(&conn->cm_id, conn); if (status < 0) { dat_status = dapl_convert_errno(errno,"create_cm_id"); dapl_os_free(conn, sizeof(*conn)); Index: dapl/dapl/openib/dapl_ib_util.h =================================================================== --- dapl/dapl/openib/dapl_ib_util.h (revision 3137) +++ dapl/dapl/openib/dapl_ib_util.h (working copy) @@ -120,7 +120,7 @@ struct dapl_cm_id { int retries; int destroy; int in_callback; - uint32_t cm_id; + struct ib_cm_id *cm_id; DAT_SOCK_ADDR6 r_addr; DAT_CONN_QUAL service_id; struct dapl_hca *hca; From sean.hefty at intel.com Fri Aug 19 17:27:27 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 19 Aug 2005 17:27:27 -0700 Subject: [openib-general] [PATCH] [uCM] user specified context in CM events + new test program Message-ID: Resending - sorry about any duplicated messages. The following patch: * Adds user specified context to all uCM events. Users will not retrieve any events associated with the context after destroying the corresponding cm_id. * Provides the ib_cm_init_qp_attr() call to userspace clients of the CM. This call may be used to set QP attributes properly before modifying the QP. * Fixes some error handling synchronization and cleanup issues. * Performs some minor code cleanup. * Replaces the ucm_simple test program with a userspace version of cmpost. The userspace version of cmpost uses the uAT interface to retrieve path records based on a remote host name, establishes a connection over a QP, and performs some simple message passing between the nodes. This patch bumps the ABI, and will require synchronization with uDAPL before committing. Signed-off-by: Sean Hefty Index: userspace/libibcm/include/infiniband/cm_abi.h =================================================================== --- userspace/libibcm/include/infiniband/cm_abi.h (revision 3124) +++ userspace/libibcm/include/infiniband/cm_abi.h (working copy) @@ -1,5 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -41,7 +42,7 @@ * drivers/infiniband/include/ib_user_cm.h */ -#define IB_USER_CM_ABI_VERSION 1 +#define IB_USER_CM_ABI_VERSION 2 enum { IB_USER_CM_CMD_CREATE_ID, @@ -64,6 +65,7 @@ enum { IB_USER_CM_CMD_SEND_SIDR_REP, IB_USER_CM_CMD_EVENT, + IB_USER_CM_CMD_INIT_QP_ATTR, }; /* * command ABI structures. @@ -75,6 +77,7 @@ struct cm_abi_cmd_hdr { }; struct cm_abi_create_id { + __u64 uid; __u64 response; }; @@ -83,9 +86,14 @@ struct cm_abi_create_id_resp { }; struct cm_abi_destroy_id { + __u64 response; __u32 id; }; +struct cm_abi_destroy_id_resp { + __u32 events_reported; +}; + struct cm_abi_attr_id { __u64 response; __u32 id; @@ -98,6 +106,64 @@ struct cm_abi_attr_id_resp { __u32 remote_id; }; +struct cm_abi_init_qp_attr { + __u64 response; + __u32 id; + __u32 qp_state; +}; + +struct cm_abi_ah_attr { + __u8 grh_dgid[16]; + __u32 grh_flow_label; + __u16 dlid; + __u16 reserved; + __u8 grh_sgid_index; + __u8 grh_hop_limit; + __u8 grh_traffic_class; + __u8 sl; + __u8 src_path_bits; + __u8 static_rate; + __u8 is_global; + __u8 port_num; +}; + +struct cm_abi_init_qp_attr_resp { + __u32 qp_attr_mask; + __u32 qp_state; + __u32 cur_qp_state; + __u32 path_mtu; + __u32 path_mig_state; + __u32 qkey; + __u32 rq_psn; + __u32 sq_psn; + __u32 dest_qp_num; + __u32 qp_access_flags; + + struct cm_abi_ah_attr ah_attr; + struct cm_abi_ah_attr alt_ah_attr; + + /* ibv_qp_cap */ + __u32 max_send_wr; + __u32 max_recv_wr; + __u32 max_send_sge; + __u32 max_recv_sge; + __u32 max_inline_data; + + __u16 pkey_index; + __u16 alt_pkey_index; + __u8 en_sqd_async_notify; + __u8 sq_draining; + __u8 max_rd_atomic; + __u8 max_dest_rd_atomic; + __u8 min_rnr_timer; + __u8 port_num; + __u8 timeout; + __u8 retry_cnt; + __u8 rnr_retry; + __u8 alt_port_num; + __u8 alt_timeout; +}; + struct cm_abi_listen { __u64 service_id; __u64 service_mask; @@ -161,6 +227,7 @@ struct cm_abi_req { }; struct cm_abi_rep { + __u64 uid; __u64 data; __u32 id; __u32 qpn; @@ -236,7 +303,6 @@ struct cm_abi_event_get { }; struct cm_abi_req_event_resp { - __u32 listen_id; /* device */ /* port */ struct cm_abi_path_rec primary_path; @@ -291,7 +357,6 @@ struct cm_abi_apr_event_resp { }; struct cm_abi_sidr_req_event_resp { - __u32 listen_id; /* device */ /* port */ __u16 pkey; @@ -311,6 +376,7 @@ struct cm_abi_sidr_rep_event_resp { #define CM_ABI_PRES_ALTERNATE 0x08 struct cm_abi_event_resp { + __u64 uid; __u32 id; __u32 event; __u32 present; Index: userspace/libibcm/include/infiniband/cm.h =================================================================== --- userspace/libibcm/include/infiniband/cm.h (revision 3124) +++ userspace/libibcm/include/infiniband/cm.h (working copy) @@ -77,8 +77,13 @@ enum ib_cm_data_size { IB_CM_SIDR_REP_INFO_LENGTH = 72 }; +struct ib_cm_id { + void *context; + uint32_t handle; +}; + struct ib_cm_req_event_param { - uint32_t listen_id; + struct ib_cm_id *listen_id; struct ib_sa_path_rec *primary_path; struct ib_sa_path_rec *alternate_path; @@ -187,7 +192,7 @@ struct ib_cm_apr_event_param { }; struct ib_cm_sidr_req_event_param { - uint32_t listen_id; + struct ib_cm_id *listen_id; struct ib_device *device; uint8_t port; uint16_t pkey; @@ -212,7 +217,7 @@ struct ib_cm_sidr_rep_event_param { }; struct ib_cm_event { - uint32_t cm_id; + struct ib_cm_id *cm_id; enum ib_cm_event_type event; union { struct ib_cm_req_event_param req_rcvd; @@ -287,13 +292,13 @@ int ib_cm_get_fd(void); * Communication identifiers are used to track connection states, service * ID resolution requests, and listen requests. */ -int ib_cm_create_id(uint32_t *cm_id); +int ib_cm_create_id(struct ib_cm_id **cm_id, void *context); /** * ib_cm_destroy_id - Destroy a connection identifier. * @cm_id: Connection identifier to destroy. */ -int ib_cm_destroy_id(uint32_t cm_id); +int ib_cm_destroy_id(struct ib_cm_id *cm_id); struct ib_cm_attr_param { uint64_t service_id; @@ -309,7 +314,7 @@ struct ib_cm_attr_param { * * Not all parameters are valid during all connection states. */ -int ib_cm_attr_id(uint32_t cm_id, +int ib_cm_attr_id(struct ib_cm_id *cm_id, struct ib_cm_attr_param *param); /** @@ -323,7 +328,7 @@ int ib_cm_attr_id(uint32_t cm_id, * range of service IDs. If set to 0, the service ID is matched * exactly. */ -int ib_cm_listen(uint32_t cm_id, +int ib_cm_listen(struct ib_cm_id *cm_id, uint64_t service_id, uint64_t service_mask); @@ -355,7 +360,7 @@ struct ib_cm_req_param { * @param: Connection request information needed to establish the * connection. */ -int ib_cm_send_req(uint32_t cm_id, +int ib_cm_send_req(struct ib_cm_id *cm_id, struct ib_cm_req_param *param); struct ib_cm_rep_param { @@ -380,7 +385,7 @@ struct ib_cm_rep_param { * @param: Connection reply information needed to establish the * connection. */ -int ib_cm_send_rep(uint32_t cm_id, +int ib_cm_send_rep(struct ib_cm_id *cm_id, struct ib_cm_rep_param *param); /** @@ -391,7 +396,7 @@ int ib_cm_send_rep(uint32_t cm_id, * ready to use message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_rtu(uint32_t cm_id, +int ib_cm_send_rtu(struct ib_cm_id *cm_id, void *private_data, uint8_t private_data_len); @@ -404,7 +409,7 @@ int ib_cm_send_rtu(uint32_t cm_id, * disconnection request message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_dreq(uint32_t cm_id, +int ib_cm_send_dreq(struct ib_cm_id *cm_id, void *private_data, uint8_t private_data_len); @@ -416,7 +421,7 @@ int ib_cm_send_dreq(uint32_t cm_id, * disconnection reply message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_drep(uint32_t cm_id, +int ib_cm_send_drep(struct ib_cm_id *cm_id, void *private_data, uint8_t private_data_len); @@ -427,7 +432,7 @@ int ib_cm_send_drep(uint32_t cm_id, * This routine should be invoked by users who receive messages on a * connected QP before an RTU has been received. */ -int ib_cm_establish(uint32_t cm_id); +int ib_cm_establish(struct ib_cm_id *cm_id); /** * ib_cm_send_rej - Sends a connection rejection message to the @@ -441,7 +446,7 @@ int ib_cm_establish(uint32_t cm_id); * rejection message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_rej(uint32_t cm_id, +int ib_cm_send_rej(struct ib_cm_id *cm_id, enum ib_cm_rej_reason reason, void *ari, uint8_t ari_length, @@ -458,7 +463,7 @@ int ib_cm_send_rej(uint32_t cm_id, * message receipt acknowledgement. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_mra(uint32_t cm_id, +int ib_cm_send_mra(struct ib_cm_id *cm_id, uint8_t service_timeout, void *private_data, uint8_t private_data_len); @@ -473,12 +478,32 @@ int ib_cm_send_mra(uint32_t cm_id, * load alternate path message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_lap(uint32_t cm_id, +int ib_cm_send_lap(struct ib_cm_id *cm_id, struct ib_sa_path_rec *alternate_path, void *private_data, uint8_t private_data_len); /** + * ib_cm_init_qp_attr - Initializes the QP attributes for use in transitioning + * to a specified QP state. + * @cm_id: Communication identifier associated with the QP attributes to + * initialize. + * @qp_attr: On input, specifies the desired QP state. On output, the + * mandatory and desired optional attributes will be set in order to + * modify the QP to the specified state. + * @qp_attr_mask: The QP attribute mask that may be used to transition the + * QP to the specified state. + * + * Users must set the @qp_attr->qp_state to the desired QP state. This call + * will set all required attributes for the given transition, along with + * known optional attributes. Users may override the attributes returned from + * this call before calling ib_modify_qp. + */ +int ib_cm_init_qp_attr(struct ib_cm_id *cm_id, + struct ibv_qp_attr *qp_attr, + int *qp_attr_mask); + +/** * ib_cm_send_apr - Sends an alternate path response message in response to * a load alternate path request. * @cm_id: Connection identifier associated with the alternate path response. @@ -490,7 +515,7 @@ int ib_cm_send_lap(uint32_t cm_id, * alternate path response message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_apr(uint32_t cm_id, +int ib_cm_send_apr(struct ib_cm_id *cm_id, enum ib_cm_apr_status status, void *info, uint8_t info_length, @@ -514,7 +539,7 @@ struct ib_cm_sidr_req_param { * service ID resolution request. * @param: Service ID resolution request information. */ -int ib_cm_send_sidr_req(uint32_t cm_id, +int ib_cm_send_sidr_req(struct ib_cm_id *cm_id, struct ib_cm_sidr_req_param *param); struct ib_cm_sidr_rep_param { @@ -534,7 +559,7 @@ struct ib_cm_sidr_rep_param { * resolution request. * @param: Service ID resolution reply information. */ -int ib_cm_send_sidr_rep(uint32_t cm_id, +int ib_cm_send_sidr_rep(struct ib_cm_id *cm_id, struct ib_cm_sidr_rep_param *param); #endif /* CM_H */ Index: userspace/libibcm/AUTHORS =================================================================== --- userspace/libibcm/AUTHORS (revision 3124) +++ userspace/libibcm/AUTHORS (working copy) @@ -1 +1,2 @@ +Sean Hefty Libor Michalek Index: userspace/libibcm/src/cm.c =================================================================== --- userspace/libibcm/src/cm.c (revision 3124) +++ userspace/libibcm/src/cm.c (working copy) @@ -1,5 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -45,6 +46,7 @@ #include #include #include +#include #include #include @@ -69,7 +71,7 @@ do { resp = alloca(sizeof(*resp)); \ if (!resp) \ return -ENOMEM; \ - cmd->response = (unsigned long)resp;\ + cmd->response = (uintptr_t)resp;\ } while (0) #define CM_CREATE_MSG_CMD(msg, cmd, type, size) \ @@ -88,8 +90,18 @@ do { memset(cmd, 0, sizeof(*cmd)); \ } while (0) +struct cm_id_private { + struct ib_cm_id id; + int events_completed; + pthread_cond_t cond; + pthread_mutex_t mut; +}; + static int fd; +#define container_of(ptr, type, field) \ + ((type *) ((void *)ptr - offsetof(type, field))) + static void __attribute__((constructor)) ib_cm_init(void) { fd = open(IB_UCM_DEV_PATH, O_RDWR); @@ -127,46 +139,89 @@ static void cm_param_path_get(struct cm_ abi->preference = sa->preference; } -int ib_cm_create_id(uint32_t *cm_id) +static void ib_cm_free_id(struct cm_id_private *cm_id_priv) +{ + pthread_cond_destroy(&cm_id_priv->cond); + pthread_mutex_destroy(&cm_id_priv->mut); + free(cm_id_priv); +} + +static struct cm_id_private *ib_cm_alloc_id(void *context) +{ + struct cm_id_private *cm_id_priv; + + cm_id_priv = malloc(sizeof *cm_id_priv); + if (!cm_id_priv) + return NULL; + + memset(cm_id_priv, 0, sizeof *cm_id_priv); + cm_id_priv->id.context = context; + pthread_mutex_init(&cm_id_priv->mut, NULL); + if (pthread_cond_init(&cm_id_priv->cond, NULL)) + goto err; + + return cm_id_priv; + +err: ib_cm_free_id(cm_id_priv); + return NULL; +} + +int ib_cm_create_id(struct ib_cm_id **cm_id, void *context) { struct cm_abi_create_id_resp *resp; struct cm_abi_create_id *cmd; + struct cm_id_private *cm_id_priv; void *msg; int result; int size; - if (!cm_id) - return -EINVAL; + cm_id_priv = ib_cm_alloc_id(context); + if (!cm_id_priv) + return -ENOMEM; - CM_CREATE_MSG_CMD_RESP(msg, cmd, resp, IB_USER_CM_CMD_CREATE_ID, size); + CM_CREATE_MSG_CMD_RESP(msg, cmd, resp, IB_USER_CM_CMD_CREATE_ID, size); + cmd->uid = (uintptr_t) cm_id_priv; result = write(fd, msg, size); if (result != size) - return (result > 0) ? -ENODATA : result; + goto err; - *cm_id = resp->id; + cm_id_priv->id.handle = resp->id; + *cm_id = &cm_id_priv->id; return 0; + +err: ib_cm_free_id(cm_id_priv); + return result; } -int ib_cm_destroy_id(uint32_t cm_id) +int ib_cm_destroy_id(struct ib_cm_id *cm_id) { + struct cm_abi_destroy_id_resp *resp; struct cm_abi_destroy_id *cmd; + struct cm_id_private *cm_id_priv; void *msg; int result; int size; - CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_DESTROY_ID, size); - - cmd->id = cm_id; + CM_CREATE_MSG_CMD_RESP(msg, cmd, resp, IB_USER_CM_CMD_DESTROY_ID, size); + cmd->id = cm_id->handle; result = write(fd, msg, size); if (result != size) return (result > 0) ? -ENODATA : result; + cm_id_priv = container_of(cm_id, struct cm_id_private, id); + + pthread_mutex_lock(&cm_id_priv->mut); + while (cm_id_priv->events_completed < resp->events_reported) + pthread_cond_wait(&cm_id_priv->cond, &cm_id_priv->mut); + pthread_mutex_unlock(&cm_id_priv->mut); + + ib_cm_free_id(cm_id_priv); return 0; } -int ib_cm_attr_id(uint32_t cm_id, struct ib_cm_attr_param *param) +int ib_cm_attr_id(struct ib_cm_id *cm_id, struct ib_cm_attr_param *param) { struct cm_abi_attr_id_resp *resp; struct cm_abi_attr_id *cmd; @@ -177,9 +232,8 @@ int ib_cm_attr_id(uint32_t cm_id, struct if (!param) return -EINVAL; - CM_CREATE_MSG_CMD_RESP(msg, cmd, resp, IB_USER_CM_CMD_ATTR_ID, size); - - cmd->id = cm_id; + CM_CREATE_MSG_CMD_RESP(msg, cmd, resp, IB_USER_CM_CMD_ATTR_ID, size); + cmd->id = cm_id->handle; result = write(fd, msg, size); if (result != size) @@ -189,11 +243,91 @@ int ib_cm_attr_id(uint32_t cm_id, struct param->service_mask = resp->service_mask; param->local_id = resp->local_id; param->remote_id = resp->remote_id; + return 0; +} + +static void ib_cm_copy_ah_attr(struct ibv_ah_attr *dest_attr, + struct cm_abi_ah_attr *src_attr) +{ + memcpy(dest_attr->grh.dgid.raw, src_attr->grh_dgid, + sizeof dest_attr->grh.dgid); + dest_attr->grh.flow_label = src_attr->grh_flow_label; + dest_attr->grh.sgid_index = src_attr->grh_sgid_index; + dest_attr->grh.hop_limit = src_attr->grh_hop_limit; + dest_attr->grh.traffic_class = src_attr->grh_traffic_class; + + dest_attr->dlid = src_attr->dlid; + dest_attr->sl = src_attr->sl; + dest_attr->src_path_bits = src_attr->src_path_bits; + dest_attr->static_rate = src_attr->static_rate; + dest_attr->is_global = src_attr->is_global; + dest_attr->port_num = src_attr->port_num; +} + +static void ib_cm_copy_qp_attr(struct ibv_qp_attr *dest_attr, + struct cm_abi_init_qp_attr_resp *src_attr) +{ + dest_attr->cur_qp_state = src_attr->cur_qp_state; + dest_attr->path_mtu = src_attr->path_mtu; + dest_attr->path_mig_state = src_attr->path_mig_state; + dest_attr->qkey = src_attr->qkey; + dest_attr->rq_psn = src_attr->rq_psn; + dest_attr->sq_psn = src_attr->sq_psn; + dest_attr->dest_qp_num = src_attr->dest_qp_num; + dest_attr->qp_access_flags = src_attr->qp_access_flags; + + dest_attr->cap.max_send_wr = src_attr->max_send_wr; + dest_attr->cap.max_recv_wr = src_attr->max_recv_wr; + dest_attr->cap.max_send_sge = src_attr->max_send_sge; + dest_attr->cap.max_recv_sge = src_attr->max_recv_sge; + dest_attr->cap.max_inline_data = src_attr->max_inline_data; + + ib_cm_copy_ah_attr(&dest_attr->ah_attr, &src_attr->ah_attr); + ib_cm_copy_ah_attr(&dest_attr->alt_ah_attr, &src_attr->alt_ah_attr); + + dest_attr->pkey_index = src_attr->pkey_index; + dest_attr->alt_pkey_index = src_attr->alt_pkey_index; + dest_attr->en_sqd_async_notify = src_attr->en_sqd_async_notify; + dest_attr->sq_draining = src_attr->sq_draining; + dest_attr->max_rd_atomic = src_attr->max_rd_atomic; + dest_attr->max_dest_rd_atomic = src_attr->max_dest_rd_atomic; + dest_attr->min_rnr_timer = src_attr->min_rnr_timer; + dest_attr->port_num = src_attr->port_num; + dest_attr->timeout = src_attr->timeout; + dest_attr->retry_cnt = src_attr->retry_cnt; + dest_attr->rnr_retry = src_attr->rnr_retry; + dest_attr->alt_port_num = src_attr->alt_port_num; + dest_attr->alt_timeout = src_attr->alt_timeout; +} + +int ib_cm_init_qp_attr(struct ib_cm_id *cm_id, + struct ibv_qp_attr *qp_attr, + int *qp_attr_mask) +{ + struct cm_abi_init_qp_attr_resp *resp; + struct cm_abi_init_qp_attr *cmd; + void *msg; + int result; + int size; + + if (!qp_attr || !qp_attr_mask) + return -EINVAL; + + CM_CREATE_MSG_CMD_RESP(msg, cmd, resp, IB_USER_CM_CMD_INIT_QP_ATTR, size); + cmd->id = cm_id->handle; + cmd->qp_state = qp_attr->qp_state; + + result = write(fd, msg, size); + if (result != size) + return (result > 0) ? -ENODATA : result; + + *qp_attr_mask = resp->qp_attr_mask; + ib_cm_copy_qp_attr(qp_attr, resp); return 0; } -int ib_cm_listen(uint32_t cm_id, +int ib_cm_listen(struct ib_cm_id *cm_id, uint64_t service_id, uint64_t service_mask) { @@ -203,8 +337,7 @@ int ib_cm_listen(uint32_t cm_id, int size; CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_LISTEN, size); - - cmd->id = cm_id; + cmd->id = cm_id->handle; cmd->service_id = service_id; cmd->service_mask = service_mask; @@ -215,7 +348,7 @@ int ib_cm_listen(uint32_t cm_id, return 0; } -int ib_cm_send_req(uint32_t cm_id, struct ib_cm_req_param *param) +int ib_cm_send_req(struct ib_cm_id *cm_id, struct ib_cm_req_param *param) { struct cm_abi_path_rec *p_path; struct cm_abi_path_rec *a_path; @@ -228,13 +361,11 @@ int ib_cm_send_req(uint32_t cm_id, struc return -EINVAL; CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_SEND_REQ, size); - - cmd->id = cm_id; - cmd->qpn = param->qp_num; - cmd->qp_type = param->qp_type; - cmd->psn = param->starting_psn; - cmd->sid = param->service_id; - + cmd->id = cm_id->handle; + cmd->qpn = param->qp_num; + cmd->qp_type = param->qp_type; + cmd->psn = param->starting_psn; + cmd->sid = param->service_id; cmd->peer_to_peer = param->peer_to_peer; cmd->responder_resources = param->responder_resources; cmd->initiator_depth = param->initiator_depth; @@ -247,28 +378,25 @@ int ib_cm_send_req(uint32_t cm_id, struc cmd->srq = param->srq; if (param->primary_path) { - p_path = alloca(sizeof(*p_path)); if (!p_path) return -ENOMEM; cm_param_path_get(p_path, param->primary_path); - cmd->primary_path = (unsigned long)p_path; + cmd->primary_path = (uintptr_t) p_path; } if (param->alternate_path) { - a_path = alloca(sizeof(*a_path)); if (!a_path) return -ENOMEM; cm_param_path_get(a_path, param->alternate_path); - cmd->alternate_path = (unsigned long)a_path; + cmd->alternate_path = (uintptr_t) a_path; } if (param->private_data && param->private_data_len) { - - cmd->data = (unsigned long)param->private_data; + cmd->data = (uintptr_t) param->private_data; cmd->len = param->private_data_len; } @@ -279,7 +407,7 @@ int ib_cm_send_req(uint32_t cm_id, struc return 0; } -int ib_cm_send_rep(uint32_t cm_id, struct ib_cm_rep_param *param) +int ib_cm_send_rep(struct ib_cm_id *cm_id, struct ib_cm_rep_param *param) { struct cm_abi_rep *cmd; void *msg; @@ -290,11 +418,10 @@ int ib_cm_send_rep(uint32_t cm_id, struc return -EINVAL; CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_SEND_REP, size); - - cmd->id = cm_id; - cmd->qpn = param->qp_num; - cmd->psn = param->starting_psn; - + cmd->uid = (uintptr_t) container_of(cm_id, struct cm_id_private, id); + cmd->id = cm_id->handle; + cmd->qpn = param->qp_num; + cmd->psn = param->starting_psn; cmd->responder_resources = param->responder_resources; cmd->initiator_depth = param->initiator_depth; cmd->target_ack_delay = param->target_ack_delay; @@ -304,8 +431,7 @@ int ib_cm_send_rep(uint32_t cm_id, struc cmd->srq = param->srq; if (param->private_data && param->private_data_len) { - - cmd->data = (unsigned long)param->private_data; + cmd->data = (uintptr_t) param->private_data; cmd->len = param->private_data_len; } @@ -316,7 +442,7 @@ int ib_cm_send_rep(uint32_t cm_id, struc return 0; } -static inline int cm_send_private_data(uint32_t cm_id, +static inline int cm_send_private_data(struct ib_cm_id *cm_id, uint32_t type, void *private_data, uint8_t private_data_len) @@ -327,12 +453,10 @@ static inline int cm_send_private_data(u int size; CM_CREATE_MSG_CMD(msg, cmd, type, size); - - cmd->id = cm_id; + cmd->id = cm_id->handle; if (private_data && private_data_len) { - - cmd->data = (unsigned long)private_data; + cmd->data = (uintptr_t) private_data; cmd->len = private_data_len; } @@ -343,7 +467,7 @@ static inline int cm_send_private_data(u return 0; } -int ib_cm_send_rtu(uint32_t cm_id, +int ib_cm_send_rtu(struct ib_cm_id *cm_id, void *private_data, uint8_t private_data_len) { @@ -351,7 +475,7 @@ int ib_cm_send_rtu(uint32_t cm_id, private_data, private_data_len); } -int ib_cm_send_dreq(uint32_t cm_id, +int ib_cm_send_dreq(struct ib_cm_id *cm_id, void *private_data, uint8_t private_data_len) { @@ -359,7 +483,7 @@ int ib_cm_send_dreq(uint32_t cm_id, private_data, private_data_len); } -int ib_cm_send_drep(uint32_t cm_id, +int ib_cm_send_drep(struct ib_cm_id *cm_id, void *private_data, uint8_t private_data_len) { @@ -367,16 +491,15 @@ int ib_cm_send_drep(uint32_t cm_id, private_data, private_data_len); } -int ib_cm_establish(uint32_t cm_id) +int ib_cm_establish(struct ib_cm_id *cm_id) { struct cm_abi_establish *cmd; void *msg; int result; int size; - CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_ESTABLISH, size); - - cmd->id = cm_id; + CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_ESTABLISH, size); + cmd->id = cm_id->handle; result = write(fd, msg, size); if (result != size) @@ -385,7 +508,7 @@ int ib_cm_establish(uint32_t cm_id) return 0; } -static inline int cm_send_status(uint32_t cm_id, +static inline int cm_send_status(struct ib_cm_id *cm_id, uint32_t type, int status, void *info, @@ -399,19 +522,16 @@ static inline int cm_send_status(uint32_ int size; CM_CREATE_MSG_CMD(msg, cmd, type, size); - - cmd->id = cm_id; + cmd->id = cm_id->handle; cmd->status = status; if (private_data && private_data_len) { - - cmd->data = (unsigned long)private_data; + cmd->data = (uintptr_t) private_data; cmd->data_len = private_data_len; } if (info && info_length) { - - cmd->info = (unsigned long)info; + cmd->info = (uintptr_t) info; cmd->info_len = info_length; } @@ -422,7 +542,7 @@ static inline int cm_send_status(uint32_ return 0; } -int ib_cm_send_rej(uint32_t cm_id, +int ib_cm_send_rej(struct ib_cm_id *cm_id, enum ib_cm_rej_reason reason, void *ari, uint8_t ari_length, @@ -434,7 +554,7 @@ int ib_cm_send_rej(uint32_t cm_id, private_data, private_data_len); } -int ib_cm_send_apr(uint32_t cm_id, +int ib_cm_send_apr(struct ib_cm_id *cm_id, enum ib_cm_apr_status status, void *info, uint8_t info_length, @@ -446,7 +566,7 @@ int ib_cm_send_apr(uint32_t cm_id, private_data, private_data_len); } -int ib_cm_send_mra(uint32_t cm_id, +int ib_cm_send_mra(struct ib_cm_id *cm_id, uint8_t service_timeout, void *private_data, uint8_t private_data_len) @@ -457,13 +577,11 @@ int ib_cm_send_mra(uint32_t cm_id, int size; CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_SEND_MRA, size); - - cmd->id = cm_id; + cmd->id = cm_id->handle; cmd->timeout = service_timeout; if (private_data && private_data_len) { - - cmd->data = (unsigned long)private_data; + cmd->data = (uintptr_t) private_data; cmd->len = private_data_len; } @@ -474,7 +592,7 @@ int ib_cm_send_mra(uint32_t cm_id, return 0; } -int ib_cm_send_lap(uint32_t cm_id, +int ib_cm_send_lap(struct ib_cm_id *cm_id, struct ib_sa_path_rec *alternate_path, void *private_data, uint8_t private_data_len) @@ -486,22 +604,19 @@ int ib_cm_send_lap(uint32_t cm_id, int size; CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_SEND_LAP, size); - - cmd->id = cm_id; + cmd->id = cm_id->handle; if (alternate_path) { - abi_path = alloca(sizeof(*abi_path)); if (!abi_path) return -ENOMEM; cm_param_path_get(abi_path, alternate_path); - cmd->path = (unsigned long)abi_path; + cmd->path = (uintptr_t) abi_path; } if (private_data && private_data_len) { - - cmd->data = (unsigned long)private_data; + cmd->data = (uintptr_t) private_data; cmd->len = private_data_len; } @@ -512,7 +627,8 @@ int ib_cm_send_lap(uint32_t cm_id, return 0; } -int ib_cm_send_sidr_req(uint32_t cm_id, struct ib_cm_sidr_req_param *param) +int ib_cm_send_sidr_req(struct ib_cm_id *cm_id, + struct ib_cm_sidr_req_param *param) { struct cm_abi_path_rec *abi_path; struct cm_abi_sidr_req *cmd; @@ -524,26 +640,23 @@ int ib_cm_send_sidr_req(uint32_t cm_id, return -EINVAL; CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_SEND_SIDR_REQ, size); - - cmd->id = cm_id; + cmd->id = cm_id->handle; cmd->sid = param->service_id; cmd->timeout = param->timeout_ms; cmd->pkey = param->pkey; cmd->max_cm_retries = param->max_cm_retries; if (param->path) { - abi_path = alloca(sizeof(*abi_path)); if (!abi_path) return -ENOMEM; cm_param_path_get(abi_path, param->path); - cmd->path = (unsigned long)abi_path; + cmd->path = (uintptr_t) abi_path; } if (param->private_data && param->private_data_len) { - - cmd->data = (unsigned long)param->private_data; + cmd->data = (uintptr_t) param->private_data; cmd->len = param->private_data_len; } @@ -554,7 +667,8 @@ int ib_cm_send_sidr_req(uint32_t cm_id, return 0; } -int ib_cm_send_sidr_rep(uint32_t cm_id, struct ib_cm_sidr_rep_param *param) +int ib_cm_send_sidr_rep(struct ib_cm_id *cm_id, + struct ib_cm_sidr_rep_param *param) { struct cm_abi_sidr_rep *cmd; void *msg; @@ -565,21 +679,18 @@ int ib_cm_send_sidr_rep(uint32_t cm_id, return -EINVAL; CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_SEND_SIDR_REP, size); - - cmd->id = cm_id; + cmd->id = cm_id->handle; cmd->qpn = param->qp_num; cmd->qkey = param->qkey; cmd->status = param->status; if (param->private_data && param->private_data_len) { - - cmd->data = (unsigned long)param->private_data; + cmd->data = (uintptr_t) param->private_data; cmd->data_len = param->private_data_len; } if (param->info && param->info_length) { - - cmd->info = (unsigned long)param->info; + cmd->info = (uintptr_t) param->info; cmd->info_len = param->info_length; } @@ -599,8 +710,8 @@ static void cm_event_path_get(struct ib_ if (!kpath || !upath) return; - memcpy(upath->dgid.raw, kpath->dgid, sizeof(union ibv_gid)); - memcpy(upath->sgid.raw, kpath->sgid, sizeof(union ibv_gid)); + memcpy(upath->dgid.raw, kpath->dgid, sizeof upath->dgid); + memcpy(upath->sgid.raw, kpath->sgid, sizeof upath->sgid); upath->dlid = kpath->dlid; upath->slid = kpath->slid; @@ -626,8 +737,6 @@ static void cm_event_path_get(struct ib_ static void cm_event_req_get(struct ib_cm_req_event_param *ureq, struct cm_abi_req_event_resp *kreq) { - ureq->listen_id = kreq->listen_id; - ureq->remote_ca_guid = kreq->remote_ca_guid; ureq->remote_qkey = kreq->remote_qkey; ureq->remote_qpn = kreq->remote_qpn; @@ -661,36 +770,6 @@ static void cm_event_rep_get(struct ib_c urep->rnr_retry_count = krep->rnr_retry_count; urep->srq = krep->srq; } -static void cm_event_rej_get(struct ib_cm_rej_event_param *urej, - struct cm_abi_rej_event_resp *krej) -{ - urej->reason = krej->reason; -} - -static void cm_event_mra_get(struct ib_cm_mra_event_param *umra, - struct cm_abi_mra_event_resp *kmra) -{ - umra->service_timeout = kmra->timeout; -} - -static void cm_event_lap_get(struct ib_cm_lap_event_param *ulap, - struct cm_abi_lap_event_resp *klap) -{ - cm_event_path_get(ulap->alternate_path, &klap->path); -} - -static void cm_event_apr_get(struct ib_cm_apr_event_param *uapr, - struct cm_abi_apr_event_resp *kapr) -{ - uapr->ap_status = kapr->status; -} - -static void cm_event_sidr_req_get(struct ib_cm_sidr_req_event_param *ureq, - struct cm_abi_sidr_req_event_resp *kreq) -{ - ureq->listen_id = kreq->listen_id; - ureq->pkey = kreq->pkey; -} static void cm_event_sidr_rep_get(struct ib_cm_sidr_rep_event_param *urep, struct cm_abi_sidr_rep_event_resp *krep) @@ -702,6 +781,7 @@ static void cm_event_sidr_rep_get(struct int ib_cm_event_get(struct ib_cm_event **event) { + struct cm_id_private *cm_id_priv; struct cm_abi_cmd_hdr *hdr; struct cm_abi_event_get *cmd; struct cm_abi_event_resp *resp; @@ -733,7 +813,7 @@ int ib_cm_event_get(struct ib_cm_event * if (!resp) return -ENOMEM; - cmd->response = (unsigned long)resp; + cmd->response = (uintptr_t) resp; cmd->data_len = (uint8_t)(~0U); cmd->info_len = (uint8_t)(~0U); @@ -749,8 +829,8 @@ int ib_cm_event_get(struct ib_cm_event * goto done; } - cmd->data = (unsigned long)data; - cmd->info = (unsigned long)info; + cmd->data = (uintptr_t) data; + cmd->info = (uintptr_t) info; result = write(fd, msg, size); if (result != size) { @@ -765,14 +845,11 @@ int ib_cm_event_get(struct ib_cm_event * result = -ENOMEM; goto done; } - memset(evt, 0, sizeof(*evt)); - - evt->cm_id = resp->id; + evt->cm_id = (void *) (uintptr_t) resp->uid; evt->event = resp->event; if (resp->present & CM_ABI_PRES_PRIMARY) { - path_a = malloc(sizeof(*path_a)); if (!path_a) { result = -ENOMEM; @@ -781,81 +858,78 @@ int ib_cm_event_get(struct ib_cm_event * } if (resp->present & CM_ABI_PRES_ALTERNATE) { - path_b = malloc(sizeof(*path_b)); if (!path_b) { result = -ENOMEM; goto done; } } - - if (resp->present & CM_ABI_PRES_DATA) { - - evt->private_data = data; - data = NULL; - } switch (evt->event) { case IB_CM_REQ_RECEIVED: - + evt->param.req_rcvd.listen_id = evt->cm_id; + cm_id_priv = ib_cm_alloc_id(evt->cm_id->context); + if (!cm_id_priv) { + result = -ENOMEM; + goto done; + } + cm_id_priv->id.handle = resp->id; + evt->cm_id = &cm_id_priv->id; evt->param.req_rcvd.primary_path = path_a; evt->param.req_rcvd.alternate_path = path_b; path_a = NULL; path_b = NULL; - cm_event_req_get(&evt->param.req_rcvd, &resp->u.req_resp); break; case IB_CM_REP_RECEIVED: - cm_event_rep_get(&evt->param.rep_rcvd, &resp->u.rep_resp); break; case IB_CM_MRA_RECEIVED: - - cm_event_mra_get(&evt->param.mra_rcvd, &resp->u.mra_resp); + evt->param.mra_rcvd.service_timeout = resp->u.mra_resp.timeout; break; case IB_CM_REJ_RECEIVED: - - cm_event_rej_get(&evt->param.rej_rcvd, &resp->u.rej_resp); - + evt->param.rej_rcvd.reason = resp->u.rej_resp.reason; evt->param.rej_rcvd.ari = info; info = NULL; - break; case IB_CM_LAP_RECEIVED: - evt->param.lap_rcvd.alternate_path = path_b; path_b = NULL; - - cm_event_lap_get(&evt->param.lap_rcvd, &resp->u.lap_resp); + cm_event_path_get(evt->param.lap_rcvd.alternate_path, + &resp->u.lap_resp.path); break; case IB_CM_APR_RECEIVED: - - cm_event_apr_get(&evt->param.apr_rcvd, &resp->u.apr_resp); - + evt->param.apr_rcvd.ap_status = resp->u.apr_resp.status; evt->param.apr_rcvd.apr_info = info; info = NULL; - break; case IB_CM_SIDR_REQ_RECEIVED: - - cm_event_sidr_req_get(&evt->param.sidr_req_rcvd, - &resp->u.sidr_req_resp); + evt->param.sidr_req_rcvd.listen_id = evt->cm_id; + cm_id_priv = ib_cm_alloc_id(evt->cm_id->context); + if (!cm_id_priv) { + result = -ENOMEM; + goto done; + } + cm_id_priv->id.handle = resp->id; + evt->cm_id = &cm_id_priv->id; + evt->param.sidr_req_rcvd.pkey = resp->u.sidr_req_resp.pkey; break; case IB_CM_SIDR_REP_RECEIVED: - cm_event_sidr_rep_get(&evt->param.sidr_rep_rcvd, &resp->u.sidr_rep_resp); - evt->param.sidr_rep_rcvd.info = info; info = NULL; - break; default: - evt->param.send_status = resp->u.send_status; break; } + if (resp->present & CM_ABI_PRES_DATA) { + evt->private_data = data; + data = NULL; + } + *event = evt; evt = NULL; result = 0; @@ -876,44 +950,51 @@ done: int ib_cm_event_put(struct ib_cm_event *event) { + struct cm_id_private *cm_id_priv; + if (!event) return -EINVAL; if (event->private_data) free(event->private_data); + cm_id_priv = container_of(event->cm_id, struct cm_id_private, id); + switch (event->event) { case IB_CM_REQ_RECEIVED: - - if (event->param.req_rcvd.primary_path) - free(event->param.req_rcvd.primary_path); - + cm_id_priv = container_of(event->param.req_rcvd.listen_id, + struct cm_id_private, id); + free(event->param.req_rcvd.primary_path); if (event->param.req_rcvd.alternate_path) free(event->param.req_rcvd.alternate_path); break; case IB_CM_REJ_RECEIVED: - if (event->param.rej_rcvd.ari) free(event->param.rej_rcvd.ari); break; case IB_CM_LAP_RECEIVED: - - if (event->param.lap_rcvd.alternate_path) - free(event->param.lap_rcvd.alternate_path); + free(event->param.lap_rcvd.alternate_path); break; case IB_CM_APR_RECEIVED: - if (event->param.apr_rcvd.apr_info) free(event->param.apr_rcvd.apr_info); break; + case IB_CM_SIDR_REQ_RECEIVED: + cm_id_priv = container_of(event->param.sidr_req_rcvd.listen_id, + struct cm_id_private, id); + break; case IB_CM_SIDR_REP_RECEIVED: - if (event->param.sidr_rep_rcvd.info) free(event->param.sidr_rep_rcvd.info); default: break; } + pthread_mutex_lock(&cm_id_priv->mut); + cm_id_priv->events_completed++; + pthread_cond_signal(&cm_id_priv->cond); + pthread_mutex_unlock(&cm_id_priv->mut); + free(event); return 0; } Index: userspace/libibcm/Makefile.am =================================================================== --- userspace/libibcm/Makefile.am (revision 3124) +++ userspace/libibcm/Makefile.am (working copy) @@ -18,9 +18,11 @@ endif src_libibcm_la_SOURCES = src/cm.c src_libibcm_la_LDFLAGS = -avoid-version -module $(ucm_version_script) -bin_PROGRAMS = examples/ucm_simple -examples_ucm_simple_SOURCES = examples/simple.c -examples_ucm_simple_LDADD = $(top_builddir)/src/libibcm.la +bin_PROGRAMS = examples/ucmpost +examples_ucmpost_SOURCES = examples/cmpost.c +examples_ucmpost_LDADD = $(top_builddir)/src/libibcm.la \ + $(libdir)/libibverbs.la \ + $(libdir)/libibat.la libibcmincludedir = $(includedir)/infiniband Index: userspace/libibcm/examples/cmpost.c =================================================================== --- userspace/libibcm/examples/cmpost.c (revision 0) +++ userspace/libibcm/examples/cmpost.c (revision 0) @@ -0,0 +1,718 @@ +/* + * Copyright (c) 2004, 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#if __BYTE_ORDER == __BIG_ENDIAN +static inline uint64_t cpu_to_be64(uint64_t x) { return x; } +static inline uint32_t cpu_to_be32(uint32_t x) { return x; } +#else +static inline uint64_t cpu_to_be64(uint64_t x) { return bswap_64(x); } +static inline uint32_t cpu_to_be32(uint32_t x) { return bswap_32(x); } +#endif + +/* + * To execute: + * Server: ucmpost + * Client: ucmpost server + */ + +struct cmtest { + struct ibv_device *device; + struct ibv_context *verbs; + struct ibv_pd *pd; + + /* cm info */ + struct ib_sa_path_rec path_rec; + + struct cmtest_node *nodes; + int conn_index; + int connects_left; + int disconnects_left; + + /* memory region info */ + struct ibv_mr *mr; + void *mem; +}; + +static struct cmtest test; +static int message_count = 10; +static int message_size = 100; +static int connections = 1; +static int is_server = 1; + +struct cmtest_node { + int id; + struct ibv_cq *cq; + struct ibv_qp *qp; + struct ib_cm_id *cm_id; + int connected; +}; + +static int post_recvs(struct cmtest_node *node) +{ + struct ibv_recv_wr recv_wr, *recv_failure; + struct ibv_sge sge; + int i, ret = 0; + + if (!message_count) + return 0; + + recv_wr.next = NULL; + recv_wr.sg_list = &sge; + recv_wr.num_sge = 1; + recv_wr.wr_id = (uintptr_t) node; + + sge.length = message_size; + sge.lkey = test.mr->lkey; + sge.addr = (uintptr_t) test.mem; + + for (i = 0; i < message_count && !ret; i++ ) { + ret = ibv_post_recv(node->qp, &recv_wr, &recv_failure); + if (ret) { + printf("failed to post receives: %d\n", ret); + break; + } + } + return ret; +} + +static int modify_to_rtr(struct cmtest_node *node) +{ + struct ibv_qp_attr qp_attr; + int qp_attr_mask, ret; + + qp_attr.qp_state = IBV_QPS_INIT; + ret = ib_cm_init_qp_attr(node->cm_id, &qp_attr, &qp_attr_mask); + if (ret) { + printf("failed to init QP attr for INIT: %d\n", ret); + return ret; + } + ret = ibv_modify_qp(node->qp, &qp_attr, qp_attr_mask); + if (ret) { + printf("failed to modify QP to INIT: %d\n", ret); + return ret; + } + qp_attr.qp_state = IBV_QPS_RTR; + ret = ib_cm_init_qp_attr(node->cm_id, &qp_attr, &qp_attr_mask); + if (ret) { + printf("failed to init QP attr for RTR: %d\n", ret); + return ret; + } + qp_attr.rq_psn = node->qp->qp_num; + ret = ibv_modify_qp(node->qp, &qp_attr, qp_attr_mask); + if (ret) { + printf("failed to modify QP to RTR: %d\n", ret); + return ret; + } + return 0; +} + +static int modify_to_rts(struct cmtest_node *node) +{ + struct ibv_qp_attr qp_attr; + int qp_attr_mask, ret; + + qp_attr.qp_state = IBV_QPS_RTS; + ret = ib_cm_init_qp_attr(node->cm_id, &qp_attr, &qp_attr_mask); + if (ret) { + printf("failed to init QP attr for RTS: %d\n", ret); + return ret; + } + ret = ibv_modify_qp(node->qp, &qp_attr, qp_attr_mask); + if (ret) { + printf("failed to modify QP to RTS: %d\n", ret); + return ret; + } + return 0; +} + +static void req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) +{ + struct cmtest_node *node; + struct ib_cm_req_event_param *req; + struct ib_cm_rep_param rep; + int ret; + + if (test.conn_index == connections) + goto error1; + node = &test.nodes[test.conn_index++]; + + node->cm_id = cm_id; + cm_id->context = node; + + ret = modify_to_rtr(node); + if (ret) + goto error2; + + ret = post_recvs(node); + if (ret) + goto error2; + + req = &event->param.req_rcvd; + memset(&rep, 0, sizeof rep); + rep.qp_num = node->qp->qp_num; + rep.srq = (node->qp->srq != NULL); + rep.starting_psn = node->qp->qp_num; + rep.responder_resources = req->responder_resources; + rep.initiator_depth = req->initiator_depth; + rep.target_ack_delay = 20; + rep.flow_control = req->flow_control; + rep.rnr_retry_count = req->rnr_retry_count; + + ret = ib_cm_send_rep(cm_id, &rep); + if (ret) { + printf("failed to send CM REP: %d\n", ret); + goto error2; + } + return; +error2: + test.disconnects_left--; + test.connects_left--; +error1: + printf("failing connection request\n"); + ib_cm_send_rej(cm_id, IB_CM_REJ_UNSUPPORTED, NULL, 0, NULL, 0); +} + +static void rep_handler(struct cmtest_node *node, struct ib_cm_event *event) +{ + int ret; + + ret = modify_to_rtr(node); + if (ret) + goto error; + + ret = modify_to_rts(node); + if (ret) + goto error; + + ret = post_recvs(node); + if (ret) + goto error; + + ret = ib_cm_send_rtu(node->cm_id, NULL, 0); + if (ret) { + printf("failed to send CM RTU: %d\n", ret); + goto error; + } + node->connected = 1; + test.connects_left--; + return; +error: + printf("failing connection reply\n"); + ib_cm_send_rej(node->cm_id, IB_CM_REJ_UNSUPPORTED, NULL, 0, NULL, 0); + test.disconnects_left--; + test.connects_left--; +} + +static void rtu_handler(struct cmtest_node *node) +{ + int ret; + + ret = modify_to_rts(node); + if (ret) + goto error; + + node->connected = 1; + test.connects_left--; + return; +error: + printf("aborting connection - disconnecting\n"); + ib_cm_send_dreq(node->cm_id, NULL, 0); + test.disconnects_left--; + test.connects_left--; +} + +static void cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) +{ + struct cmtest_node *node = cm_id->context; + + switch (event->event) { + case IB_CM_REQ_RECEIVED: + req_handler(cm_id, event); + break; + case IB_CM_REP_RECEIVED: + rep_handler(node, event); + break; + case IB_CM_RTU_RECEIVED: + rtu_handler(node); + break; + case IB_CM_DREQ_RECEIVED: + node->connected = 0; + ib_cm_send_drep(node->cm_id, NULL, 0); + test.disconnects_left--; + break; + case IB_CM_DREP_RECEIVED: + test.disconnects_left--; + break; + case IB_CM_REJ_RECEIVED: + printf("Received REJ\n"); + /* fall through */ + case IB_CM_REQ_ERROR: + case IB_CM_REP_ERROR: + printf("Error sending REQ or REP\n"); + test.disconnects_left--; + test.connects_left--; + break; + case IB_CM_DREQ_ERROR: + test.disconnects_left--; + printf("Error sending DREQ\n"); + break; + default: + break; + } +} + +static int init_node(struct cmtest_node *node, struct ibv_qp_init_attr *qp_attr) +{ + int cqe, ret; + + if (!is_server) { + ret = ib_cm_create_id(&node->cm_id, node); + if (ret) { + printf("failed to create cm_id: %d\n", ret); + return ret; + } + } + + cqe = message_count ? message_count * 2 : 2; + node->cq = ibv_create_cq(test.verbs, cqe, node); + if (!node->cq) { + printf("unable to create CQ\n"); + goto error1; + } + + qp_attr->send_cq = node->cq; + qp_attr->recv_cq = node->cq; + node->qp = ibv_create_qp(test.pd, qp_attr); + if (!node->qp) { + printf("unable to create QP\n"); + goto error2; + } + return 0; +error2: + ibv_destroy_cq(node->cq); +error1: + if (!is_server) + ib_cm_destroy_id(node->cm_id); + return -1; +} + +static void destroy_node(struct cmtest_node *node) +{ + ibv_destroy_qp(node->qp); + ibv_destroy_cq(node->cq); + if (node->cm_id) + ib_cm_destroy_id(node->cm_id); +} + +static int create_nodes(void) +{ + struct ibv_qp_init_attr qp_attr; + int ret, i; + + test.nodes = malloc(sizeof *test.nodes * connections); + if (!test.nodes) { + printf("unable to allocate memory for test nodes\n"); + return -1; + } + memset(test.nodes, 0, sizeof *test.nodes * connections); + + memset(&qp_attr, 0, sizeof qp_attr); + qp_attr.cap.max_send_wr = message_count ? message_count : 1; + qp_attr.cap.max_recv_wr = message_count ? message_count : 1; + qp_attr.cap.max_send_sge = 1; + qp_attr.cap.max_recv_sge = 1; + qp_attr.qp_type = IBV_QPT_RC; + + for (i = 0; i < connections; i++) { + test.nodes[i].id = i; + ret = init_node(&test.nodes[i], &qp_attr); + if (ret) + goto error; + } + return 0; +error: + while (--i >= 0) + destroy_node(&test.nodes[i]); + free(test.nodes); + return ret; +} + +static void destroy_nodes(void) +{ + int i; + + for (i = 0; i < connections; i++) + destroy_node(&test.nodes[i]); + free(test.nodes); +} + +static int create_messages(void) +{ + if (!message_size) + message_count = 0; + + if (!message_count) + return 0; + + test.mem = malloc(message_size); + if (!test.mem) { + printf("failed message allocation\n"); + return -1; + } + test.mr = ibv_reg_mr(test.pd, test.mem, message_size, + IBV_ACCESS_LOCAL_WRITE); + if (!test.mr) { + printf("failed to reg MR\n"); + goto err; + } + return 0; +err: + free(test.mem); + return -1; +} + +static void destroy_messages(void) +{ + if (!message_count) + return; + + ibv_dereg_mr(test.mr); + free(test.mem); +} + +static int init(void) +{ + struct dlist *dev_list; + int ret; + + test.connects_left = connections; + test.disconnects_left = connections; + + dev_list = ibv_get_devices(); + dlist_start(dev_list); + test.device = dlist_next(dev_list); + if (!test.device) + return -1; + + test.verbs = ibv_open_device(test.device); + if (!test.verbs) + return -1; + + test.pd = ibv_alloc_pd(test.verbs); + if (!test.pd) { + printf("failed to alloc PD\n"); + return -1; + } + ret = create_messages(); + if (ret) { + printf("unable to create test messages\n"); + goto error1; + } + ret = create_nodes(); + if (ret) { + printf("unable to create test nodes\n"); + goto error2; + } + return 0; +error2: + destroy_messages(); +error1: + ibv_dealloc_pd(test.pd); + return -1; +} + +static void cleanup(void) +{ + destroy_nodes(); + destroy_messages(); + ibv_dealloc_pd(test.pd); +} + +static int send_msgs(void) +{ + struct ibv_send_wr send_wr, *bad_send_wr; + struct ibv_sge sge; + int i, m, ret; + + send_wr.next = NULL; + send_wr.sg_list = &sge; + send_wr.num_sge = 1; + send_wr.opcode = IBV_WR_SEND; + send_wr.send_flags = IBV_SEND_SIGNALED; + send_wr.wr_id = 0; + + sge.addr = (uintptr_t) test.mem; + sge.length = message_size; + sge.lkey = test.mr->lkey; + + for (i = 0; i < connections; i++) { + if (!test.nodes[i].connected) + continue; + + for (m = 0; m < message_count; m++) { + ret = ibv_post_send(test.nodes[i].qp, &send_wr, + &bad_send_wr); + if (ret) + return ret; + } + } + return 0; +} + +static int poll_cqs(void) +{ + struct ibv_wc wc[8]; + int done, i, ret; + + for (i = 0; i < connections; i++) { + if (!test.nodes[i].connected) + continue; + + for (done = 0; done < message_count; done += ret) { + ret = ibv_poll_cq(test.nodes[i].cq, 8, wc); + if (ret < 0) { + printf("failed polling CQ: %d\n", ret); + return ret; + } + } + } + return 0; +} + +static void connect_events(void) +{ + struct ib_cm_event *event; + int err = 0; + + while (test.connects_left && !err) { + err = ib_cm_event_get(&event); + if (!err) { + cm_handler(event->cm_id, event); + ib_cm_event_put(event); + } + } +} + +static void disconnect_events(void) +{ + struct ib_cm_event *event; + int err = 0; + + while (test.disconnects_left && !err) { + err = ib_cm_event_get(&event); + if (!err) { + cm_handler(event->cm_id, event); + ib_cm_event_put(event); + } + } +} + +static void run_server(void) +{ + struct ib_cm_id *listen_id; + int i, ret; + + printf("starting server\n"); + if (ib_cm_create_id(&listen_id, &test)) { + printf("listen request failed\n"); + return; + } + ret = ib_cm_listen(listen_id, cpu_to_be64(0x1000), 0); + if (ret) { + printf("failure trying to listen: %d\n", ret); + goto out; + } + + connect_events(); + + if (message_count) { + printf("initiating data transfers\n"); + if (send_msgs()) + goto out; + printf("receiving data transfers\n"); + if (poll_cqs()) + goto out; + printf("data transfers complete\n"); + } + + printf("disconnecting\n"); + for (i = 0; i < connections; i++) { + if (!test.nodes[i].connected) + continue; + + test.nodes[i].connected = 0; + ib_cm_send_dreq(test.nodes[i].cm_id, NULL, 0); + } + disconnect_events(); + printf("disconnected\n"); +out: + ib_cm_destroy_id(listen_id); +} + +static void at_callback(uint64_t req_id, void *context, int rec_num) +{ +} + +static int query_for_path(char *dest) +{ + struct ib_at_ib_route route; + struct ib_at_completion comp; + struct addrinfo *res; + int ret; + + ret = getaddrinfo(dest, NULL, NULL, &res); + if (ret) { + printf("getaddrinfo failed - invalid hostname or IP address\n"); + return ret; + } + + if (res->ai_family != PF_INET) { + ret = -1; + goto out; + } + + comp.fn = at_callback; + ret = ib_at_route_by_ip(((struct sockaddr_in *)res->ai_addr)->sin_addr.s_addr, + 0, 0, 0, &route, &comp, NULL); + if (ret < 0) { + printf("ib_at_route_by_ip failed: %d\n", ret); + goto out; + } + + if (!ret) { + ret = ib_at_callback_get(); + if (ret) { + printf("ib_at_callback_get failed: %d\n", ret); + goto out; + } + } + + ret = ib_at_paths_by_route(&route, 0, &test.path_rec, 1, &comp, NULL); + if (ret < 0) { + printf("ib_at_paths_by_route failed: %d\n", ret); + goto out; + } + + if (!ret) { + ret = ib_at_callback_get(); + if (ret) + printf("ib_at_callback_get failed: %d\n", ret); + } else + ret = 0; + +out: + freeaddrinfo(res); + return ret; +} + +static void run_client(char *dest) +{ + struct ib_cm_req_param req; + int i, ret; + + printf("starting client\n"); + ret = query_for_path(dest); + if (ret) { + printf("failed path record query: %d\n", ret); + return; + } + + memset(&req, 0, sizeof req); + req.primary_path = &test.path_rec; + req.service_id = cpu_to_be64(0x1000); + req.responder_resources = 1; + req.initiator_depth = 1; + req.remote_cm_response_timeout = 20; + req.local_cm_response_timeout = 20; + req.retry_count = 5; + req.max_cm_retries = 5; + + printf("connecting\n"); + for (i = 0; i < connections; i++) { + req.qp_num = test.nodes[i].qp->qp_num; + req.qp_type = IBV_QPT_RC; + req.srq = (test.nodes[i].qp->srq != NULL); + req.starting_psn = test.nodes[i].qp->qp_num; + ret = ib_cm_send_req(test.nodes[i].cm_id, &req); + if (ret) { + printf("failure sending REQ: %d\n", ret); + return; + } + } + + connect_events(); + + if (message_count) { + printf("receiving data transfers\n"); + if (poll_cqs()) + goto out; + printf("initiating data transfers\n"); + if (send_msgs()) + goto out; + printf("data transfers complete\n"); + } +out: + disconnect_events(); +} + +int main(int argc, char **argv) +{ + if (argc != 1 && argc != 2) { + printf("usage: %s [server_addr]\n", argv[0]); + exit(1); + } + + is_server = (argc == 1); + if (init()) + exit(1); + + if (is_server) + run_server(); + else + run_client(argv[1]); + + printf("test complete\n"); + cleanup(); + return 0; +} Index: userspace/libibcm/examples/simple.c =================================================================== --- userspace/libibcm/examples/simple.c (revision 3124) +++ userspace/libibcm/examples/simple.c (working copy) @@ -58,7 +58,7 @@ static inline uint64_t cpu_to_be64(uint6 #define TEST_SID 0x0000000ff0000000ULL -static int cm_connect(uint32_t cm_id) +static int cm_connect(struct ib_cm_id *cm_id) { struct ib_cm_req_param param; struct ib_sa_path_rec sa; @@ -108,8 +108,8 @@ static int cm_connect(uint32_t cm_id) src->global.subnet_prefix = cpu_to_be64(0xfe80000000000000ULL); dst->global.subnet_prefix = cpu_to_be64(0xfe80000000000000ULL); - src->global.interface_id = cpu_to_be64(0x0002c90200002179ULL); - dst->global.interface_id = cpu_to_be64(0x0005ad000001296cULL); + src->global.interface_id = cpu_to_be64(0x0002c90107fc5e11ULL); + dst->global.interface_id = cpu_to_be64(0x0002c90107fc5eb1ULL); return ib_cm_send_req(cm_id, ¶m); } @@ -118,7 +118,7 @@ int main(int argc, char **argv) { struct ib_cm_event *event; struct ib_cm_rep_param rep; - int cm_id; + struct ib_cm_id *cm_id; int result; int param_c = 0; @@ -137,8 +137,8 @@ int main(int argc, char **argv) exit(1); } - result = ib_cm_create_id(&cm_id); - if (result < 0) { + result = ib_cm_create_id(&cm_id, NULL); + if (result) { printf("Error creating CM ID <%d:%d>\n", result, errno); goto done; } @@ -146,16 +146,16 @@ int main(int argc, char **argv) if (mode) { result = cm_connect(cm_id); if (result) { - printf("Error <%d:%d> sending REQ <%d>\n", - result, errno, cm_id); + printf("Error <%d:%d> sending REQ\n", + result, errno); goto done; } } else { result = ib_cm_listen(cm_id, TEST_SID, 0); if (result) { - printf("Error <%d:%d> listening <%d>\n", - result, errno, cm_id); + printf("Error <%d:%d> listening\n", + result, errno); goto done; } } @@ -169,7 +169,7 @@ int main(int argc, char **argv) goto done; } - printf("CM ID <%d> Event <%d>\n", event->cm_id, event->event); + printf("CM ID <%p> Event <%d>\n", event->cm_id, event->event); switch (event->event) { case IB_CM_REQ_RECEIVED: @@ -264,4 +264,3 @@ int main(int argc, char **argv) done: return 0; } - Index: linux-kernel/infiniband/include/ib_user_cm.h =================================================================== --- linux-kernel/infiniband/include/ib_user_cm.h (revision 3109) +++ linux-kernel/infiniband/include/ib_user_cm.h (working copy) @@ -1,5 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -37,7 +38,7 @@ #include -#define IB_USER_CM_ABI_VERSION 1 +#define IB_USER_CM_ABI_VERSION 2 enum { IB_USER_CM_CMD_CREATE_ID, @@ -60,6 +61,7 @@ enum { IB_USER_CM_CMD_SEND_SIDR_REP, IB_USER_CM_CMD_EVENT, + IB_USER_CM_CMD_INIT_QP_ATTR, }; /* * command ABI structures. @@ -71,6 +73,7 @@ struct ib_ucm_cmd_hdr { }; struct ib_ucm_create_id { + __u64 uid; __u64 response; }; @@ -79,9 +82,14 @@ struct ib_ucm_create_id_resp { }; struct ib_ucm_destroy_id { + __u64 response; __u32 id; }; +struct ib_ucm_destroy_id_resp { + __u32 events_reported; +}; + struct ib_ucm_attr_id { __u64 response; __u32 id; @@ -94,6 +102,64 @@ struct ib_ucm_attr_id_resp { __be32 remote_id; }; +struct ib_ucm_init_qp_attr { + __u64 response; + __u32 id; + __u32 qp_state; +}; + +struct ib_ucm_ah_attr { + __u8 grh_dgid[16]; + __u32 grh_flow_label; + __u16 dlid; + __u16 reserved; + __u8 grh_sgid_index; + __u8 grh_hop_limit; + __u8 grh_traffic_class; + __u8 sl; + __u8 src_path_bits; + __u8 static_rate; + __u8 is_global; + __u8 port_num; +}; + +struct ib_ucm_init_qp_attr_resp { + __u32 qp_attr_mask; + __u32 qp_state; + __u32 cur_qp_state; + __u32 path_mtu; + __u32 path_mig_state; + __u32 qkey; + __u32 rq_psn; + __u32 sq_psn; + __u32 dest_qp_num; + __u32 qp_access_flags; + + struct ib_ucm_ah_attr ah_attr; + struct ib_ucm_ah_attr alt_ah_attr; + + /* ib_qp_cap */ + __u32 max_send_wr; + __u32 max_recv_wr; + __u32 max_send_sge; + __u32 max_recv_sge; + __u32 max_inline_data; + + __u16 pkey_index; + __u16 alt_pkey_index; + __u8 en_sqd_async_notify; + __u8 sq_draining; + __u8 max_rd_atomic; + __u8 max_dest_rd_atomic; + __u8 min_rnr_timer; + __u8 port_num; + __u8 timeout; + __u8 retry_cnt; + __u8 rnr_retry; + __u8 alt_port_num; + __u8 alt_timeout; +}; + struct ib_ucm_listen { __be64 service_id; __be64 service_mask; @@ -157,6 +223,7 @@ struct ib_ucm_req { }; struct ib_ucm_rep { + __u64 uid; __u64 data; __u32 id; __u32 qpn; @@ -232,7 +299,6 @@ struct ib_ucm_event_get { }; struct ib_ucm_req_event_resp { - __u32 listen_id; /* device */ /* port */ struct ib_ucm_path_rec primary_path; @@ -287,7 +353,6 @@ struct ib_ucm_apr_event_resp { }; struct ib_ucm_sidr_req_event_resp { - __u32 listen_id; /* device */ /* port */ __u16 pkey; @@ -307,6 +372,7 @@ struct ib_ucm_sidr_rep_event_resp { #define IB_UCM_PRES_ALTERNATE 0x08 struct ib_ucm_event_resp { + __u64 uid; __u32 id; __u32 event; __u32 present; Index: linux-kernel/infiniband/core/ucm.c =================================================================== --- linux-kernel/infiniband/core/ucm.c (revision 3109) +++ linux-kernel/infiniband/core/ucm.c (working copy) @@ -72,7 +72,6 @@ enum { static struct semaphore ctx_id_mutex; static struct idr ctx_id_table; -static int ctx_id_rover = 0; static struct ib_ucm_context *ib_ucm_ctx_get(struct ib_ucm_file *file, int id) { @@ -97,33 +96,16 @@ static void ib_ucm_ctx_put(struct ib_ucm wake_up(&ctx->wait); } -static ssize_t ib_ucm_destroy_ctx(struct ib_ucm_file *file, int id) +static inline int ib_ucm_new_cm_id(int event) { - struct ib_ucm_context *ctx; - struct ib_ucm_event *uevent; - - down(&ctx_id_mutex); - ctx = idr_find(&ctx_id_table, id); - if (!ctx) - ctx = ERR_PTR(-ENOENT); - else if (ctx->file != file) - ctx = ERR_PTR(-EINVAL); - else - idr_remove(&ctx_id_table, ctx->id); - up(&ctx_id_mutex); - - if (IS_ERR(ctx)) - return PTR_ERR(ctx); - - atomic_dec(&ctx->ref); - wait_event(ctx->wait, !atomic_read(&ctx->ref)); + return event == IB_CM_REQ_RECEIVED || event == IB_CM_SIDR_REQ_RECEIVED; +} - /* No new events will be generated after destroying the cm_id. */ - if (!IS_ERR(ctx->cm_id)) - ib_destroy_cm_id(ctx->cm_id); +static void ib_ucm_cleanup_events(struct ib_ucm_context *ctx) +{ + struct ib_ucm_event *uevent; - /* Cleanup events not yet reported to the user. */ - down(&file->mutex); + down(&ctx->file->mutex); list_del(&ctx->file_list); while (!list_empty(&ctx->events)) { @@ -133,15 +115,12 @@ static ssize_t ib_ucm_destroy_ctx(struct list_del(&uevent->ctx_list); /* clear incoming connections. */ - if (uevent->cm_id) + if (ib_ucm_new_cm_id(uevent->resp.event)) ib_destroy_cm_id(uevent->cm_id); kfree(uevent); } - up(&file->mutex); - - kfree(ctx); - return 0; + up(&ctx->file->mutex); } static struct ib_ucm_context *ib_ucm_ctx_alloc(struct ib_ucm_file *file) @@ -153,36 +132,31 @@ static struct ib_ucm_context *ib_ucm_ctx if (!ctx) return NULL; + memset(ctx, 0, sizeof *ctx); atomic_set(&ctx->ref, 1); init_waitqueue_head(&ctx->wait); ctx->file = file; - INIT_LIST_HEAD(&ctx->events); - list_add_tail(&ctx->file_list, &file->ctxs); - - ctx_id_rover = (ctx_id_rover + 1) & INT_MAX; -retry: - result = idr_pre_get(&ctx_id_table, GFP_KERNEL); - if (!result) - goto error; + do { + result = idr_pre_get(&ctx_id_table, GFP_KERNEL); + if (!result) + goto error; + + down(&ctx_id_mutex); + result = idr_get_new(&ctx_id_table, ctx, &ctx->id); + up(&ctx_id_mutex); + } while (result == -EAGAIN); - down(&ctx_id_mutex); - result = idr_get_new_above(&ctx_id_table, ctx, ctx_id_rover, &ctx->id); - up(&ctx_id_mutex); - - if (result == -EAGAIN) - goto retry; if (result) goto error; + list_add_tail(&ctx->file_list, &file->ctxs); ucm_dbg("Allocated CM ID <%d>\n", ctx->id); - return ctx; + error: - list_del(&ctx->file_list); kfree(ctx); - return NULL; } /* @@ -219,12 +193,9 @@ static void ib_ucm_event_path_get(struct kpath->packet_life_time_selector; } -static void ib_ucm_event_req_get(struct ib_ucm_context *ctx, - struct ib_ucm_req_event_resp *ureq, +static void ib_ucm_event_req_get(struct ib_ucm_req_event_resp *ureq, struct ib_cm_req_event_param *kreq) { - ureq->listen_id = ctx->id; - ureq->remote_ca_guid = kreq->remote_ca_guid; ureq->remote_qkey = kreq->remote_qkey; ureq->remote_qpn = kreq->remote_qpn; @@ -259,14 +230,6 @@ static void ib_ucm_event_rep_get(struct urep->srq = krep->srq; } -static void ib_ucm_event_sidr_req_get(struct ib_ucm_context *ctx, - struct ib_ucm_sidr_req_event_resp *ureq, - struct ib_cm_sidr_req_event_param *kreq) -{ - ureq->listen_id = ctx->id; - ureq->pkey = kreq->pkey; -} - static void ib_ucm_event_sidr_rep_get(struct ib_ucm_sidr_rep_event_resp *urep, struct ib_cm_sidr_rep_event_param *krep) { @@ -275,15 +238,14 @@ static void ib_ucm_event_sidr_rep_get(st urep->qpn = krep->qpn; }; -static int ib_ucm_event_process(struct ib_ucm_context *ctx, - struct ib_cm_event *evt, +static int ib_ucm_event_process(struct ib_cm_event *evt, struct ib_ucm_event *uvt) { void *info = NULL; switch (evt->event) { case IB_CM_REQ_RECEIVED: - ib_ucm_event_req_get(ctx, &uvt->resp.u.req_resp, + ib_ucm_event_req_get(&uvt->resp.u.req_resp, &evt->param.req_rcvd); uvt->data_len = IB_CM_REQ_PRIVATE_DATA_SIZE; uvt->resp.present = IB_UCM_PRES_PRIMARY; @@ -331,8 +293,8 @@ static int ib_ucm_event_process(struct i info = evt->param.apr_rcvd.apr_info; break; case IB_CM_SIDR_REQ_RECEIVED: - ib_ucm_event_sidr_req_get(ctx, &uvt->resp.u.sidr_req_resp, - &evt->param.sidr_req_rcvd); + uvt->resp.u.sidr_req_resp.pkey = + evt->param.sidr_req_rcvd.pkey; uvt->data_len = IB_CM_SIDR_REQ_PRIVATE_DATA_SIZE; break; case IB_CM_SIDR_REP_RECEIVED: @@ -378,31 +340,24 @@ static int ib_ucm_event_handler(struct i struct ib_ucm_event *uevent; struct ib_ucm_context *ctx; int result = 0; - int id; ctx = cm_id->context; - if (event->event == IB_CM_REQ_RECEIVED || - event->event == IB_CM_SIDR_REQ_RECEIVED) - id = IB_UCM_CM_ID_INVALID; - else - id = ctx->id; - uevent = kmalloc(sizeof(*uevent), GFP_KERNEL); if (!uevent) goto err1; memset(uevent, 0, sizeof(*uevent)); - uevent->resp.id = id; + uevent->ctx = ctx; + uevent->cm_id = cm_id; + uevent->resp.uid = ctx->uid; + uevent->resp.id = ctx->id; uevent->resp.event = event->event; - result = ib_ucm_event_process(ctx, event, uevent); + result = ib_ucm_event_process(event, uevent); if (result) goto err2; - uevent->ctx = ctx; - uevent->cm_id = (id == IB_UCM_CM_ID_INVALID) ? cm_id : NULL; - down(&ctx->file->mutex); list_add_tail(&uevent->file_list, &ctx->file->events); list_add_tail(&uevent->ctx_list, &ctx->events); @@ -414,7 +369,7 @@ err2: kfree(uevent); err1: /* Destroy new cm_id's */ - return (id == IB_UCM_CM_ID_INVALID); + return ib_ucm_new_cm_id(event->event); } static ssize_t ib_ucm_event(struct ib_ucm_file *file, @@ -423,7 +378,7 @@ static ssize_t ib_ucm_event(struct ib_uc { struct ib_ucm_context *ctx; struct ib_ucm_event_get cmd; - struct ib_ucm_event *uevent = NULL; + struct ib_ucm_event *uevent; int result = 0; DEFINE_WAIT(wait); @@ -436,7 +391,6 @@ static ssize_t ib_ucm_event(struct ib_uc * wait */ down(&file->mutex); - while (list_empty(&file->events)) { if (file->filp->f_flags & O_NONBLOCK) { @@ -463,21 +417,18 @@ static ssize_t ib_ucm_event(struct ib_uc uevent = list_entry(file->events.next, struct ib_ucm_event, file_list); - if (!uevent->cm_id) - goto user; + if (ib_ucm_new_cm_id(uevent->resp.event)) { + ctx = ib_ucm_ctx_alloc(file); + if (!ctx) { + result = -ENOMEM; + goto done; + } - ctx = ib_ucm_ctx_alloc(file); - if (!ctx) { - result = -ENOMEM; - goto done; + ctx->cm_id = uevent->cm_id; + ctx->cm_id->context = ctx; + uevent->resp.id = ctx->id; } - ctx->cm_id = uevent->cm_id; - ctx->cm_id->context = ctx; - - uevent->resp.id = ctx->id; - -user: if (copy_to_user((void __user *)(unsigned long)cmd.response, &uevent->resp, sizeof(uevent->resp))) { result = -EFAULT; @@ -485,12 +436,10 @@ user: } if (uevent->data) { - if (cmd.data_len < uevent->data_len) { result = -ENOMEM; goto done; } - if (copy_to_user((void __user *)(unsigned long)cmd.data, uevent->data, uevent->data_len)) { result = -EFAULT; @@ -499,12 +448,10 @@ user: } if (uevent->info) { - if (cmd.info_len < uevent->info_len) { result = -ENOMEM; goto done; } - if (copy_to_user((void __user *)(unsigned long)cmd.info, uevent->info, uevent->info_len)) { result = -EFAULT; @@ -514,6 +461,7 @@ user: list_del(&uevent->file_list); list_del(&uevent->ctx_list); + uevent->ctx->events_reported++; kfree(uevent->data); kfree(uevent->info); @@ -545,6 +493,7 @@ static ssize_t ib_ucm_create_id(struct i if (!ctx) return -ENOMEM; + ctx->uid = cmd.uid; ctx->cm_id = ib_create_cm_id(ib_ucm_event_handler, ctx); if (IS_ERR(ctx->cm_id)) { result = PTR_ERR(ctx->cm_id); @@ -561,7 +510,14 @@ static ssize_t ib_ucm_create_id(struct i return 0; err: - ib_ucm_destroy_ctx(file, ctx->id); + down(&ctx_id_mutex); + idr_remove(&ctx_id_table, ctx->id); + up(&ctx_id_mutex); + + if (!IS_ERR(ctx->cm_id)) + ib_destroy_cm_id(ctx->cm_id); + + kfree(ctx); return result; } @@ -570,11 +526,44 @@ static ssize_t ib_ucm_destroy_id(struct int in_len, int out_len) { struct ib_ucm_destroy_id cmd; + struct ib_ucm_destroy_id_resp resp; + struct ib_ucm_context *ctx; + int result = 0; + + if (out_len < sizeof(resp)) + return -ENOSPC; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; - return ib_ucm_destroy_ctx(file, cmd.id); + down(&ctx_id_mutex); + ctx = idr_find(&ctx_id_table, cmd.id); + if (!ctx) + ctx = ERR_PTR(-ENOENT); + else if (ctx->file != file) + ctx = ERR_PTR(-EINVAL); + else + idr_remove(&ctx_id_table, ctx->id); + up(&ctx_id_mutex); + + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + atomic_dec(&ctx->ref); + wait_event(ctx->wait, !atomic_read(&ctx->ref)); + + /* No new events will be generated after destroying the cm_id. */ + ib_destroy_cm_id(ctx->cm_id); + /* Cleanup events not yet reported to the user. */ + ib_ucm_cleanup_events(ctx); + + resp.events_reported = ctx->events_reported; + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) + result = -EFAULT; + + kfree(ctx); + return result; } static ssize_t ib_ucm_attr_id(struct ib_ucm_file *file, @@ -609,6 +598,98 @@ static ssize_t ib_ucm_attr_id(struct ib_ return result; } +static void ib_ucm_copy_ah_attr(struct ib_ucm_ah_attr *dest_attr, + struct ib_ah_attr *src_attr) +{ + memcpy(dest_attr->grh_dgid, src_attr->grh.dgid.raw, + sizeof src_attr->grh.dgid); + dest_attr->grh_flow_label = src_attr->grh.flow_label; + dest_attr->grh_sgid_index = src_attr->grh.sgid_index; + dest_attr->grh_hop_limit = src_attr->grh.hop_limit; + dest_attr->grh_traffic_class = src_attr->grh.traffic_class; + + dest_attr->dlid = src_attr->dlid; + dest_attr->sl = src_attr->sl; + dest_attr->src_path_bits = src_attr->src_path_bits; + dest_attr->static_rate = src_attr->static_rate; + dest_attr->is_global = (src_attr->ah_flags & IB_AH_GRH); + dest_attr->port_num = src_attr->port_num; +} + +static void ib_ucm_copy_qp_attr(struct ib_ucm_init_qp_attr_resp *dest_attr, + struct ib_qp_attr *src_attr) +{ + dest_attr->cur_qp_state = src_attr->cur_qp_state; + dest_attr->path_mtu = src_attr->path_mtu; + dest_attr->path_mig_state = src_attr->path_mig_state; + dest_attr->qkey = src_attr->qkey; + dest_attr->rq_psn = src_attr->rq_psn; + dest_attr->sq_psn = src_attr->sq_psn; + dest_attr->dest_qp_num = src_attr->dest_qp_num; + dest_attr->qp_access_flags = src_attr->qp_access_flags; + + dest_attr->max_send_wr = src_attr->cap.max_send_wr; + dest_attr->max_recv_wr = src_attr->cap.max_recv_wr; + dest_attr->max_send_sge = src_attr->cap.max_send_sge; + dest_attr->max_recv_sge = src_attr->cap.max_recv_sge; + dest_attr->max_inline_data = src_attr->cap.max_inline_data; + + ib_ucm_copy_ah_attr(&dest_attr->ah_attr, &src_attr->ah_attr); + ib_ucm_copy_ah_attr(&dest_attr->alt_ah_attr, &src_attr->alt_ah_attr); + + dest_attr->pkey_index = src_attr->pkey_index; + dest_attr->alt_pkey_index = src_attr->alt_pkey_index; + dest_attr->en_sqd_async_notify = src_attr->en_sqd_async_notify; + dest_attr->sq_draining = src_attr->sq_draining; + dest_attr->max_rd_atomic = src_attr->max_rd_atomic; + dest_attr->max_dest_rd_atomic = src_attr->max_dest_rd_atomic; + dest_attr->min_rnr_timer = src_attr->min_rnr_timer; + dest_attr->port_num = src_attr->port_num; + dest_attr->timeout = src_attr->timeout; + dest_attr->retry_cnt = src_attr->retry_cnt; + dest_attr->rnr_retry = src_attr->rnr_retry; + dest_attr->alt_port_num = src_attr->alt_port_num; + dest_attr->alt_timeout = src_attr->alt_timeout; +} + +static ssize_t ib_ucm_init_qp_attr(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_init_qp_attr_resp resp; + struct ib_ucm_init_qp_attr cmd; + struct ib_ucm_context *ctx; + struct ib_qp_attr qp_attr; + int result = 0; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ib_ucm_ctx_get(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + resp.qp_attr_mask = 0; + memset(&qp_attr, 0, sizeof qp_attr); + qp_attr.qp_state = cmd.qp_state; + result = ib_cm_init_qp_attr(ctx->cm_id, &qp_attr, &resp.qp_attr_mask); + if (result) + goto out; + + ib_ucm_copy_qp_attr(&resp, &qp_attr); + + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) + result = -EFAULT; + +out: + ib_ucm_ctx_put(ctx); + return result; +} + static ssize_t ib_ucm_listen(struct ib_ucm_file *file, const char __user *inbuf, int in_len, int out_len) @@ -808,6 +889,7 @@ static ssize_t ib_ucm_send_rep(struct ib ctx = ib_ucm_ctx_get(file, cmd.id); if (!IS_ERR(ctx)) { + ctx->uid = cmd.uid; result = ib_send_cm_rep(ctx->cm_id, ¶m); ib_ucm_ctx_put(ctx); } else @@ -1086,6 +1168,7 @@ static ssize_t (*ucm_cmd_table[])(struct [IB_USER_CM_CMD_SEND_SIDR_REQ] = ib_ucm_send_sidr_req, [IB_USER_CM_CMD_SEND_SIDR_REP] = ib_ucm_send_sidr_rep, [IB_USER_CM_CMD_EVENT] = ib_ucm_event, + [IB_USER_CM_CMD_INIT_QP_ATTR] = ib_ucm_init_qp_attr, }; static ssize_t ib_ucm_write(struct file *filp, const char __user *buf, @@ -1161,12 +1244,18 @@ static int ib_ucm_close(struct inode *in down(&file->mutex); while (!list_empty(&file->ctxs)) { - ctx = list_entry(file->ctxs.next, struct ib_ucm_context, file_list); - up(&file->mutex); - ib_ucm_destroy_ctx(file, ctx->id); + + down(&ctx_id_mutex); + idr_remove(&ctx_id_table, ctx->id); + up(&ctx_id_mutex); + + ib_destroy_cm_id(ctx->cm_id); + ib_ucm_cleanup_events(ctx); + kfree(ctx); + down(&file->mutex); } up(&file->mutex); Index: linux-kernel/infiniband/core/ucm.h =================================================================== --- linux-kernel/infiniband/core/ucm.h (revision 3109) +++ linux-kernel/infiniband/core/ucm.h (working copy) @@ -1,5 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -43,8 +44,6 @@ #include #include -#define IB_UCM_CM_ID_INVALID 0xffffffff - struct ib_ucm_file { struct semaphore mutex; struct file *filp; @@ -58,9 +57,11 @@ struct ib_ucm_context { int id; wait_queue_head_t wait; atomic_t ref; + int events_reported; struct ib_ucm_file *file; struct ib_cm_id *cm_id; + __u64 uid; struct list_head events; /* list of pending events. */ struct list_head file_list; /* member in file ctx list */ @@ -71,16 +72,12 @@ struct ib_ucm_event { struct list_head file_list; /* member in file event list */ struct list_head ctx_list; /* member in ctx event list */ + struct ib_cm_id *cm_id; struct ib_ucm_event_resp resp; void *data; void *info; int data_len; int info_len; - /* - * new connection identifiers needs to be saved until - * userspace can get a handle on them. - */ - struct ib_cm_id *cm_id; }; #endif /* UCM_H */ From yaronh at voltaire.com Fri Aug 19 19:42:24 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Sat, 20 Aug 2005 05:42:24 +0300 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator Message-ID: <35EA21F54A45CB47B879F21A91F4862F713CED@taurus.voltaire.com> > > Also on the IB side the AT code probably needs to be reviewed and > improved. The API should be simpler, and I don't like the way AT > sticks its tentacles into the IPoIB driver and network stack. > The AT implementation was based on the code from SDP I assume that similar changes as the ones you propose would need to apply to SDP, or SDP would need to use the same lib as the other ULPs Yaron From halr at voltaire.com Sat Aug 20 07:19:12 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 20 Aug 2005 17:19:12 +0300 Subject: [openib-general] RE: [openib-commits] r3137 -gen2/trunk/src/linux-kernel/infiniband/ulp/ipoib Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175BDA@taurus.voltaire.com> Hi Roland, What happens when the port is not a full member of the partition (and only a partial member) ? Is it just that the SA should reject those requests or does some other failure occur ? -- Hal ________________________________ From: openib-commits-bounces at openib.org on behalf of roland at openib.org Sent: Fri 8/19/2005 3:15 PM To: openib-commits at openib.org Subject: [openib-commits] r3137 -gen2/trunk/src/linux-kernel/infiniband/ulp/ipoib Author: roland Date: 2005-08-19 12:15:50 -0700 (Fri, 19 Aug 2005) New Revision: 3137 Modified: gen2/trunk/src/linux-kernel/infiniband/ulp/ipoib/ipoib_main.c Log: [PATCH] IPoIB: Set full membership bit in P_Keys Always make sure that the full membership bit is set in the P_Keys that IPoIB uses. This makes sure that all hosts join the correct multicast groups so that hosts that are partial partition members can talk to the rest of the network. Signed-off-by: Roland Dreier Modified: gen2/trunk/src/linux-kernel/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- gen2/trunk/src/linux-kernel/infiniband/ulp/ipoib/ipoib_main.c 2005-08-18 19:36:50 UTC (rev 3136) +++ gen2/trunk/src/linux-kernel/infiniband/ulp/ipoib/ipoib_main.c 2005-08-19 19:15:50 UTC (rev 3137) @@ -699,7 +699,7 @@ } spin_unlock_irqrestore(&priv->lock, flags); - + if (ah) ipoib_put_ah(ah); } @@ -883,6 +883,12 @@ if (pkey < 0 || pkey > 0xffff) return -EINVAL; + /* + * Set the full membership bit, so that we join the right + * broadcast group, etc. + */ + pkey |= 0x8000; + ret = ipoib_vlan_add(container_of(cdev, struct net_device, class_dev), pkey); @@ -935,6 +941,12 @@ goto alloc_mem_failed; } + /* + * Set the full membership bit, so that we join the right + * broadcast group, etc. + */ + priv->pkey |= 0x8000; + priv->dev->broadcast[8] = priv->pkey >> 8; priv->dev->broadcast[9] = priv->pkey & 0xff; _______________________________________________ openib-commits mailing list openib-commits at openib.org http://openib.org/mailman/listinfo/openib-commits From eitan at mellanox.co.il Sat Aug 20 10:34:02 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 20 Aug 2005 20:34:02 +0300 Subject: [openib-general] OpenSM Coding style Message-ID: <506C3D7B14CDD411A52C00025558DED607C306C6@mtlex01.yok.mtl.com> Hi, Unlike the kernel code that adhere to the LINUX kernel coding style, OpenSM uses coding style that was developed by Intel during the IBAL code development. Although being different then the Kernel style OpenSM style is extensive and strictly followed during the several years of its development. It is needless to say keeping a standard coding style is important to any project (especially the side of OpenSM). The full list of rules is presented below. I think we should keep it as OpenSM is a user level project that is mature. The point is that we MUST make sure any new code added is following the current coding style. Even if we later decide to use the kernel style (which I doubt the need for) we have to keep the current rules. Some automaton for checking for these rules and fixing violations is provided by the script osm_check_n_fix. The following emacs automation can also be used for indentation rule and some simple fixes Apparently the new osm_vendor_umad.* code breaks some of the rules. I hope we can agree to fix these violations. Thanks Eitan Zahavi OpenSM Coding Style Rules ---------------------------------------- OpenSM incorporates a coding style that includes the following rules: 1. Use { } on a separate line then the if and else 2. Use function declaration/definition style that includes: a. A standard header that is readable by robodoc b. Each parameter is defined on its own line and indented by 2 spaces. c. Use IN or OUT or IN OUT prefix for each parameter 3. Each struct also have a typedef with the format : typedef struct _x {...} x_t; 4. Logs: a. Most functions that have access to the log object use : OSM_LOG_ENTER( p_log, ); and OSM_LOG_EXIT( p_log ); b. The following log verbosity levels are defined: OSM_LOG_ERROR - report fabric response errors or any abnormal algorithmic condition (see c). OSM_LOG_INFO - Information regarding major results of fabric initialization and SA operations OSM_LOG_VERBOSE - Detailed information of initialization activities and SA operations OSM_LOG_DEBUG - Even more information - intended for debugging of OpenSM OSM_LOG_FUNCS - ENTER and EXIT function only OSM_LOG_FRAMES - MAD contents OSM_LOG_ROUTING - Dump out routing results. OSM_LOG_SYS - Information that should go into /var/log/messages c. ERROR messages should be numbered by the following rule: An example error statement should look like: osm_log( p_log, OSM_LOG_ERROR, ": ERR 1B02: " "Error Init of qlock pool (%d).\n", status ); Module codes are unique to each C file. d. Any message should include the function name it is part of. For example: osm_log( p_log, OSM_LOG_DEBUG, "__add_new_mgrp_port: " "create new port with proxy_join FALSE\n"); 5. Each module should have a module header (see osm_base.h for example) describing the module intent, thread safety opacity, etc. 6. Usage of TABs is not allowed. Indentation is 3 space instead. This provides "editor invariable" behavior. 7. No spaces at end of line. EMACS Rules Enforcement Automation -------------------------------------------------------- ;; prevent emacs from using tabs during identation (setq indent-tabs-mode nil) (setq-default indent-tabs-mode nil) (custom-set-variables '(tab-width 3) '(tcl-continued-indent-level 3)) ;; making C style like OpenSM : (defun my-c-mode-common-hook () ;; my customizations for all of c-mode, c++-mode, objc-mode, java-mode (c-set-offset 'substatement-open 0) (c-set-offset 'arglist-intro '+) ;(c-set-offset 'arglist-cont 0) ;; other customizations can go here ) (add-hook 'c-mode-common-hook 'my-c-mode-common-hook) ;; utilities for C file indentation etc (defun kill-oel-spaces () (interactive) (goto-char 1) (replace-regexp "[ \t]$" "") ) (defun untabify-buffer () (interactive) (end-of-buffer) (untabify 1 (point) ) ) (defun indent-buffer () (interactive) (end-of-buffer) (indent-region 1 (point) nil) ) (defun fix-brace-on-if-statements () (interactive) (goto-char 1) (replace-regexp "^\\([ \t]*if.*\\)[{]" "\\1 {") ) (defun fix-brace-on-else-statements () (interactive) (goto-char 1) (replace-regexp "^[ \t]*[}][ \t]+else+[ \t][{]" "} else {") ) (defun beauty-c-buffer () (interactive) (fix-brace-on-if-statements) (fix-brace-on-else-statements) (indent-buffer) (kill-oel-spaces) (untabify-buffer)) Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Sat Aug 20 10:48:48 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 20 Aug 2005 20:48:48 +0300 Subject: [openib-general] [PATCH] osm: osm_vendor_umad to provide port state Message-ID: <86iry01ou7.fsf@mtl066.yok.mtl.com> Hi Hal The following patch provides port state in the result of osm_vendor_get_all_port_attr. The port state is obtained (like lid) from the query HCA ports and delevered in the resulting port attributes. This enables clients of osm_vendor_api.h to knwo the state of the port as well as a the already provided LID, GUID. BTW: inspecting the umad vendor implementation I have found many usages of Array Bound Variables. I wonder if we need to clean them up. Thanks Eitan I tested the patch on : 2.6.12.3-smp SuSE Linux 9.3 (i586) Signed-off-by: Eitan Zahavi Index: osm/libvendor/osm_vendor_ibumad.c =================================================================== --- osm/libvendor/osm_vendor_ibumad.c (revision 3128) +++ osm/libvendor/osm_vendor_ibumad.c (working copy) @@ -493,7 +493,9 @@ osm_vendor_get_all_port_attr( ib_net64_t *p_guid = portguids, *e = portguids + *p_num_ports; umad_ca_t ca; int lids[*p_num_ports]; + int linkstates[*p_num_ports]; int *p_lid = lids; + int *p_linkstates = linkstates; umad_port_t def_port = {""}; int r, i, j; @@ -527,7 +529,9 @@ osm_vendor_get_all_port_attr( for (j = 0; j <= ca.numports; j++) { if (ca.ports[j]) { *p_lid = ca.ports[j]->base_lid; - p_lid++; + *p_linkstates = ca.ports[j]->state; + p_lid++; + p_linkstates++; } } } @@ -544,6 +548,7 @@ osm_vendor_get_all_port_attr( portguids[0] = def_port.port_guid; lids[0] = def_port.base_lid; + linkstates[0] = def_port.state; osm_log( p_vend->p_log, OSM_LOG_ERROR, "osm_vendor_get_all_port_attr: " @@ -560,6 +565,7 @@ osm_vendor_get_all_port_attr( p_attr_array[i].port_guid = portguids[i]; p_attr_array[i].lid = lids[i]; p_attr_array[i].sm_lid = p_vend->umad_port.sm_lid; + p_attr_array[i].link_state = linkstates[i]; } r = 0; } else From sean.hefty at intel.com Sat Aug 20 11:22:14 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Sat, 20 Aug 2005 11:22:14 -0700 Subject: [openib-general] [PATCH] [uCM] user specified context to uCM Message-ID: Try #3 - I'm not seeing this message make it through... The following patch: * Adds user specified context to all uCM events. Users will not retrieve any events associated with the context after destroying the corresponding cm_id. * Provides the ib_cm_init_qp_attr() call to userspace clients of the CM. This call may be used to set QP attributes properly before modifying the QP. * Fixes some error handling synchronization and cleanup issues. * Performs some minor code cleanup. * Replaces the ucm_simple test program with a userspace version of cmpost. The userspace version of cmpost uses the uAT interface to retrieve path records based on a remote host name, establishes a connection over a QP, and performs some simple message passing between the nodes. This patch bumps the ABI, and will require synchronization with uDAPL before committing. Signed-off-by: Sean Hefty Index: userspace/libibcm/include/infiniband/cm_abi.h =================================================================== --- userspace/libibcm/include/infiniband/cm_abi.h (revision 3124) +++ userspace/libibcm/include/infiniband/cm_abi.h (working copy) @@ -1,5 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -41,7 +42,7 @@ * drivers/infiniband/include/ib_user_cm.h */ -#define IB_USER_CM_ABI_VERSION 1 +#define IB_USER_CM_ABI_VERSION 2 enum { IB_USER_CM_CMD_CREATE_ID, @@ -64,6 +65,7 @@ enum { IB_USER_CM_CMD_SEND_SIDR_REP, IB_USER_CM_CMD_EVENT, + IB_USER_CM_CMD_INIT_QP_ATTR, }; /* * command ABI structures. @@ -75,6 +77,7 @@ struct cm_abi_cmd_hdr { }; struct cm_abi_create_id { + __u64 uid; __u64 response; }; @@ -83,9 +86,14 @@ struct cm_abi_create_id_resp { }; struct cm_abi_destroy_id { + __u64 response; __u32 id; }; +struct cm_abi_destroy_id_resp { + __u32 events_reported; +}; + struct cm_abi_attr_id { __u64 response; __u32 id; @@ -98,6 +106,64 @@ struct cm_abi_attr_id_resp { __u32 remote_id; }; +struct cm_abi_init_qp_attr { + __u64 response; + __u32 id; + __u32 qp_state; +}; + +struct cm_abi_ah_attr { + __u8 grh_dgid[16]; + __u32 grh_flow_label; + __u16 dlid; + __u16 reserved; + __u8 grh_sgid_index; + __u8 grh_hop_limit; + __u8 grh_traffic_class; + __u8 sl; + __u8 src_path_bits; + __u8 static_rate; + __u8 is_global; + __u8 port_num; +}; + +struct cm_abi_init_qp_attr_resp { + __u32 qp_attr_mask; + __u32 qp_state; + __u32 cur_qp_state; + __u32 path_mtu; + __u32 path_mig_state; + __u32 qkey; + __u32 rq_psn; + __u32 sq_psn; + __u32 dest_qp_num; + __u32 qp_access_flags; + + struct cm_abi_ah_attr ah_attr; + struct cm_abi_ah_attr alt_ah_attr; + + /* ibv_qp_cap */ + __u32 max_send_wr; + __u32 max_recv_wr; + __u32 max_send_sge; + __u32 max_recv_sge; + __u32 max_inline_data; + + __u16 pkey_index; + __u16 alt_pkey_index; + __u8 en_sqd_async_notify; + __u8 sq_draining; + __u8 max_rd_atomic; + __u8 max_dest_rd_atomic; + __u8 min_rnr_timer; + __u8 port_num; + __u8 timeout; + __u8 retry_cnt; + __u8 rnr_retry; + __u8 alt_port_num; + __u8 alt_timeout; +}; + struct cm_abi_listen { __u64 service_id; __u64 service_mask; @@ -161,6 +227,7 @@ struct cm_abi_req { }; struct cm_abi_rep { + __u64 uid; __u64 data; __u32 id; __u32 qpn; @@ -236,7 +303,6 @@ struct cm_abi_event_get { }; struct cm_abi_req_event_resp { - __u32 listen_id; /* device */ /* port */ struct cm_abi_path_rec primary_path; @@ -291,7 +357,6 @@ struct cm_abi_apr_event_resp { }; struct cm_abi_sidr_req_event_resp { - __u32 listen_id; /* device */ /* port */ __u16 pkey; @@ -311,6 +376,7 @@ struct cm_abi_sidr_rep_event_resp { #define CM_ABI_PRES_ALTERNATE 0x08 struct cm_abi_event_resp { + __u64 uid; __u32 id; __u32 event; __u32 present; Index: userspace/libibcm/include/infiniband/cm.h =================================================================== --- userspace/libibcm/include/infiniband/cm.h (revision 3124) +++ userspace/libibcm/include/infiniband/cm.h (working copy) @@ -77,8 +77,13 @@ enum ib_cm_data_size { IB_CM_SIDR_REP_INFO_LENGTH = 72 }; +struct ib_cm_id { + void *context; + uint32_t handle; +}; + struct ib_cm_req_event_param { - uint32_t listen_id; + struct ib_cm_id *listen_id; struct ib_sa_path_rec *primary_path; struct ib_sa_path_rec *alternate_path; @@ -187,7 +192,7 @@ struct ib_cm_apr_event_param { }; struct ib_cm_sidr_req_event_param { - uint32_t listen_id; + struct ib_cm_id *listen_id; struct ib_device *device; uint8_t port; uint16_t pkey; @@ -212,7 +217,7 @@ struct ib_cm_sidr_rep_event_param { }; struct ib_cm_event { - uint32_t cm_id; + struct ib_cm_id *cm_id; enum ib_cm_event_type event; union { struct ib_cm_req_event_param req_rcvd; @@ -287,13 +292,13 @@ int ib_cm_get_fd(void); * Communication identifiers are used to track connection states, service * ID resolution requests, and listen requests. */ -int ib_cm_create_id(uint32_t *cm_id); +int ib_cm_create_id(struct ib_cm_id **cm_id, void *context); /** * ib_cm_destroy_id - Destroy a connection identifier. * @cm_id: Connection identifier to destroy. */ -int ib_cm_destroy_id(uint32_t cm_id); +int ib_cm_destroy_id(struct ib_cm_id *cm_id); struct ib_cm_attr_param { uint64_t service_id; @@ -309,7 +314,7 @@ struct ib_cm_attr_param { * * Not all parameters are valid during all connection states. */ -int ib_cm_attr_id(uint32_t cm_id, +int ib_cm_attr_id(struct ib_cm_id *cm_id, struct ib_cm_attr_param *param); /** @@ -323,7 +328,7 @@ int ib_cm_attr_id(uint32_t cm_id, * range of service IDs. If set to 0, the service ID is matched * exactly. */ -int ib_cm_listen(uint32_t cm_id, +int ib_cm_listen(struct ib_cm_id *cm_id, uint64_t service_id, uint64_t service_mask); @@ -355,7 +360,7 @@ struct ib_cm_req_param { * @param: Connection request information needed to establish the * connection. */ -int ib_cm_send_req(uint32_t cm_id, +int ib_cm_send_req(struct ib_cm_id *cm_id, struct ib_cm_req_param *param); struct ib_cm_rep_param { @@ -380,7 +385,7 @@ struct ib_cm_rep_param { * @param: Connection reply information needed to establish the * connection. */ -int ib_cm_send_rep(uint32_t cm_id, +int ib_cm_send_rep(struct ib_cm_id *cm_id, struct ib_cm_rep_param *param); /** @@ -391,7 +396,7 @@ int ib_cm_send_rep(uint32_t cm_id, * ready to use message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_rtu(uint32_t cm_id, +int ib_cm_send_rtu(struct ib_cm_id *cm_id, void *private_data, uint8_t private_data_len); @@ -404,7 +409,7 @@ int ib_cm_send_rtu(uint32_t cm_id, * disconnection request message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_dreq(uint32_t cm_id, +int ib_cm_send_dreq(struct ib_cm_id *cm_id, void *private_data, uint8_t private_data_len); @@ -416,7 +421,7 @@ int ib_cm_send_dreq(uint32_t cm_id, * disconnection reply message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_drep(uint32_t cm_id, +int ib_cm_send_drep(struct ib_cm_id *cm_id, void *private_data, uint8_t private_data_len); @@ -427,7 +432,7 @@ int ib_cm_send_drep(uint32_t cm_id, * This routine should be invoked by users who receive messages on a * connected QP before an RTU has been received. */ -int ib_cm_establish(uint32_t cm_id); +int ib_cm_establish(struct ib_cm_id *cm_id); /** * ib_cm_send_rej - Sends a connection rejection message to the @@ -441,7 +446,7 @@ int ib_cm_establish(uint32_t cm_id); * rejection message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_rej(uint32_t cm_id, +int ib_cm_send_rej(struct ib_cm_id *cm_id, enum ib_cm_rej_reason reason, void *ari, uint8_t ari_length, @@ -458,7 +463,7 @@ int ib_cm_send_rej(uint32_t cm_id, * message receipt acknowledgement. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_mra(uint32_t cm_id, +int ib_cm_send_mra(struct ib_cm_id *cm_id, uint8_t service_timeout, void *private_data, uint8_t private_data_len); @@ -473,12 +478,32 @@ int ib_cm_send_mra(uint32_t cm_id, * load alternate path message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_lap(uint32_t cm_id, +int ib_cm_send_lap(struct ib_cm_id *cm_id, struct ib_sa_path_rec *alternate_path, void *private_data, uint8_t private_data_len); /** + * ib_cm_init_qp_attr - Initializes the QP attributes for use in transitioning + * to a specified QP state. + * @cm_id: Communication identifier associated with the QP attributes to + * initialize. + * @qp_attr: On input, specifies the desired QP state. On output, the + * mandatory and desired optional attributes will be set in order to + * modify the QP to the specified state. + * @qp_attr_mask: The QP attribute mask that may be used to transition the + * QP to the specified state. + * + * Users must set the @qp_attr->qp_state to the desired QP state. This call + * will set all required attributes for the given transition, along with + * known optional attributes. Users may override the attributes returned from + * this call before calling ib_modify_qp. + */ +int ib_cm_init_qp_attr(struct ib_cm_id *cm_id, + struct ibv_qp_attr *qp_attr, + int *qp_attr_mask); + +/** * ib_cm_send_apr - Sends an alternate path response message in response to * a load alternate path request. * @cm_id: Connection identifier associated with the alternate path response. @@ -490,7 +515,7 @@ int ib_cm_send_lap(uint32_t cm_id, * alternate path response message. * @private_data_len: Size of the private data buffer, in bytes. */ -int ib_cm_send_apr(uint32_t cm_id, +int ib_cm_send_apr(struct ib_cm_id *cm_id, enum ib_cm_apr_status status, void *info, uint8_t info_length, @@ -514,7 +539,7 @@ struct ib_cm_sidr_req_param { * service ID resolution request. * @param: Service ID resolution request information. */ -int ib_cm_send_sidr_req(uint32_t cm_id, +int ib_cm_send_sidr_req(struct ib_cm_id *cm_id, struct ib_cm_sidr_req_param *param); struct ib_cm_sidr_rep_param { @@ -534,7 +559,7 @@ struct ib_cm_sidr_rep_param { * resolution request. * @param: Service ID resolution reply information. */ -int ib_cm_send_sidr_rep(uint32_t cm_id, +int ib_cm_send_sidr_rep(struct ib_cm_id *cm_id, struct ib_cm_sidr_rep_param *param); #endif /* CM_H */ Index: userspace/libibcm/AUTHORS =================================================================== --- userspace/libibcm/AUTHORS (revision 3124) +++ userspace/libibcm/AUTHORS (working copy) @@ -1 +1,2 @@ +Sean Hefty Libor Michalek Index: userspace/libibcm/src/cm.c =================================================================== --- userspace/libibcm/src/cm.c (revision 3124) +++ userspace/libibcm/src/cm.c (working copy) @@ -1,5 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -45,6 +46,7 @@ #include #include #include +#include #include #include @@ -69,7 +71,7 @@ do { resp = alloca(sizeof(*resp)); \ if (!resp) \ return -ENOMEM; \ - cmd->response = (unsigned long)resp;\ + cmd->response = (uintptr_t)resp;\ } while (0) #define CM_CREATE_MSG_CMD(msg, cmd, type, size) \ @@ -88,8 +90,18 @@ do { memset(cmd, 0, sizeof(*cmd)); \ } while (0) +struct cm_id_private { + struct ib_cm_id id; + int events_completed; + pthread_cond_t cond; + pthread_mutex_t mut; +}; + static int fd; +#define container_of(ptr, type, field) \ + ((type *) ((void *)ptr - offsetof(type, field))) + static void __attribute__((constructor)) ib_cm_init(void) { fd = open(IB_UCM_DEV_PATH, O_RDWR); @@ -127,46 +139,89 @@ static void cm_param_path_get(struct cm_ abi->preference = sa->preference; } -int ib_cm_create_id(uint32_t *cm_id) +static void ib_cm_free_id(struct cm_id_private *cm_id_priv) +{ + pthread_cond_destroy(&cm_id_priv->cond); + pthread_mutex_destroy(&cm_id_priv->mut); + free(cm_id_priv); +} + +static struct cm_id_private *ib_cm_alloc_id(void *context) +{ + struct cm_id_private *cm_id_priv; + + cm_id_priv = malloc(sizeof *cm_id_priv); + if (!cm_id_priv) + return NULL; + + memset(cm_id_priv, 0, sizeof *cm_id_priv); + cm_id_priv->id.context = context; + pthread_mutex_init(&cm_id_priv->mut, NULL); + if (pthread_cond_init(&cm_id_priv->cond, NULL)) + goto err; + + return cm_id_priv; + +err: ib_cm_free_id(cm_id_priv); + return NULL; +} + +int ib_cm_create_id(struct ib_cm_id **cm_id, void *context) { struct cm_abi_create_id_resp *resp; struct cm_abi_create_id *cmd; + struct cm_id_private *cm_id_priv; void *msg; int result; int size; - if (!cm_id) - return -EINVAL; + cm_id_priv = ib_cm_alloc_id(context); + if (!cm_id_priv) + return -ENOMEM; - CM_CREATE_MSG_CMD_RESP(msg, cmd, resp, IB_USER_CM_CMD_CREATE_ID, size); + CM_CREATE_MSG_CMD_RESP(msg, cmd, resp, IB_USER_CM_CMD_CREATE_ID, size); + cmd->uid = (uintptr_t) cm_id_priv; result = write(fd, msg, size); if (result != size) - return (result > 0) ? -ENODATA : result; + goto err; - *cm_id = resp->id; + cm_id_priv->id.handle = resp->id; + *cm_id = &cm_id_priv->id; return 0; + +err: ib_cm_free_id(cm_id_priv); + return result; } -int ib_cm_destroy_id(uint32_t cm_id) +int ib_cm_destroy_id(struct ib_cm_id *cm_id) { + struct cm_abi_destroy_id_resp *resp; struct cm_abi_destroy_id *cmd; + struct cm_id_private *cm_id_priv; void *msg; int result; int size; - CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_DESTROY_ID, size); - - cmd->id = cm_id; + CM_CREATE_MSG_CMD_RESP(msg, cmd, resp, IB_USER_CM_CMD_DESTROY_ID, size); + cmd->id = cm_id->handle; result = write(fd, msg, size); if (result != size) return (result > 0) ? -ENODATA : result; + cm_id_priv = container_of(cm_id, struct cm_id_private, id); + + pthread_mutex_lock(&cm_id_priv->mut); + while (cm_id_priv->events_completed < resp->events_reported) + pthread_cond_wait(&cm_id_priv->cond, &cm_id_priv->mut); + pthread_mutex_unlock(&cm_id_priv->mut); + + ib_cm_free_id(cm_id_priv); return 0; } -int ib_cm_attr_id(uint32_t cm_id, struct ib_cm_attr_param *param) +int ib_cm_attr_id(struct ib_cm_id *cm_id, struct ib_cm_attr_param *param) { struct cm_abi_attr_id_resp *resp; struct cm_abi_attr_id *cmd; @@ -177,9 +232,8 @@ int ib_cm_attr_id(uint32_t cm_id, struct if (!param) return -EINVAL; - CM_CREATE_MSG_CMD_RESP(msg, cmd, resp, IB_USER_CM_CMD_ATTR_ID, size); - - cmd->id = cm_id; + CM_CREATE_MSG_CMD_RESP(msg, cmd, resp, IB_USER_CM_CMD_ATTR_ID, size); + cmd->id = cm_id->handle; result = write(fd, msg, size); if (result != size) @@ -189,11 +243,91 @@ int ib_cm_attr_id(uint32_t cm_id, struct param->service_mask = resp->service_mask; param->local_id = resp->local_id; param->remote_id = resp->remote_id; + return 0; +} + +static void ib_cm_copy_ah_attr(struct ibv_ah_attr *dest_attr, + struct cm_abi_ah_attr *src_attr) +{ + memcpy(dest_attr->grh.dgid.raw, src_attr->grh_dgid, + sizeof dest_attr->grh.dgid); + dest_attr->grh.flow_label = src_attr->grh_flow_label; + dest_attr->grh.sgid_index = src_attr->grh_sgid_index; + dest_attr->grh.hop_limit = src_attr->grh_hop_limit; + dest_attr->grh.traffic_class = src_attr->grh_traffic_class; + + dest_attr->dlid = src_attr->dlid; + dest_attr->sl = src_attr->sl; + dest_attr->src_path_bits = src_attr->src_path_bits; + dest_attr->static_rate = src_attr->static_rate; + dest_attr->is_global = src_attr->is_global; + dest_attr->port_num = src_attr->port_num; +} + +static void ib_cm_copy_qp_attr(struct ibv_qp_attr *dest_attr, + struct cm_abi_init_qp_attr_resp *src_attr) +{ + dest_attr->cur_qp_state = src_attr->cur_qp_state; + dest_attr->path_mtu = src_attr->path_mtu; + dest_attr->path_mig_state = src_attr->path_mig_state; + dest_attr->qkey = src_attr->qkey; + dest_attr->rq_psn = src_attr->rq_psn; + dest_attr->sq_psn = src_attr->sq_psn; + dest_attr->dest_qp_num = src_attr->dest_qp_num; + dest_attr->qp_access_flags = src_attr->qp_access_flags; + + dest_attr->cap.max_send_wr = src_attr->max_send_wr; + dest_attr->cap.max_recv_wr = src_attr->max_recv_wr; + dest_attr->cap.max_send_sge = src_attr->max_send_sge; + dest_attr->cap.max_recv_sge = src_attr->max_recv_sge; + dest_attr->cap.max_inline_data = src_attr->max_inline_data; + + ib_cm_copy_ah_attr(&dest_attr->ah_attr, &src_attr->ah_attr); + ib_cm_copy_ah_attr(&dest_attr->alt_ah_attr, &src_attr->alt_ah_attr); + + dest_attr->pkey_index = src_attr->pkey_index; + dest_attr->alt_pkey_index = src_attr->alt_pkey_index; + dest_attr->en_sqd_async_notify = src_attr->en_sqd_async_notify; + dest_attr->sq_draining = src_attr->sq_draining; + dest_attr->max_rd_atomic = src_attr->max_rd_atomic; + dest_attr->max_dest_rd_atomic = src_attr->max_dest_rd_atomic; + dest_attr->min_rnr_timer = src_attr->min_rnr_timer; + dest_attr->port_num = src_attr->port_num; + dest_attr->timeout = src_attr->timeout; + dest_attr->retry_cnt = src_attr->retry_cnt; + dest_attr->rnr_retry = src_attr->rnr_retry; + dest_attr->alt_port_num = src_attr->alt_port_num; + dest_attr->alt_timeout = src_attr->alt_timeout; +} + +int ib_cm_init_qp_attr(struct ib_cm_id *cm_id, + struct ibv_qp_attr *qp_attr, + int *qp_attr_mask) +{ + struct cm_abi_init_qp_attr_resp *resp; + struct cm_abi_init_qp_attr *cmd; + void *msg; + int result; + int size; + + if (!qp_attr || !qp_attr_mask) + return -EINVAL; + + CM_CREATE_MSG_CMD_RESP(msg, cmd, resp, IB_USER_CM_CMD_INIT_QP_ATTR, size); + cmd->id = cm_id->handle; + cmd->qp_state = qp_attr->qp_state; + + result = write(fd, msg, size); + if (result != size) + return (result > 0) ? -ENODATA : result; + + *qp_attr_mask = resp->qp_attr_mask; + ib_cm_copy_qp_attr(qp_attr, resp); return 0; } -int ib_cm_listen(uint32_t cm_id, +int ib_cm_listen(struct ib_cm_id *cm_id, uint64_t service_id, uint64_t service_mask) { @@ -203,8 +337,7 @@ int ib_cm_listen(uint32_t cm_id, int size; CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_LISTEN, size); - - cmd->id = cm_id; + cmd->id = cm_id->handle; cmd->service_id = service_id; cmd->service_mask = service_mask; @@ -215,7 +348,7 @@ int ib_cm_listen(uint32_t cm_id, return 0; } -int ib_cm_send_req(uint32_t cm_id, struct ib_cm_req_param *param) +int ib_cm_send_req(struct ib_cm_id *cm_id, struct ib_cm_req_param *param) { struct cm_abi_path_rec *p_path; struct cm_abi_path_rec *a_path; @@ -228,13 +361,11 @@ int ib_cm_send_req(uint32_t cm_id, struc return -EINVAL; CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_SEND_REQ, size); - - cmd->id = cm_id; - cmd->qpn = param->qp_num; - cmd->qp_type = param->qp_type; - cmd->psn = param->starting_psn; - cmd->sid = param->service_id; - + cmd->id = cm_id->handle; + cmd->qpn = param->qp_num; + cmd->qp_type = param->qp_type; + cmd->psn = param->starting_psn; + cmd->sid = param->service_id; cmd->peer_to_peer = param->peer_to_peer; cmd->responder_resources = param->responder_resources; cmd->initiator_depth = param->initiator_depth; @@ -247,28 +378,25 @@ int ib_cm_send_req(uint32_t cm_id, struc cmd->srq = param->srq; if (param->primary_path) { - p_path = alloca(sizeof(*p_path)); if (!p_path) return -ENOMEM; cm_param_path_get(p_path, param->primary_path); - cmd->primary_path = (unsigned long)p_path; + cmd->primary_path = (uintptr_t) p_path; } if (param->alternate_path) { - a_path = alloca(sizeof(*a_path)); if (!a_path) return -ENOMEM; cm_param_path_get(a_path, param->alternate_path); - cmd->alternate_path = (unsigned long)a_path; + cmd->alternate_path = (uintptr_t) a_path; } if (param->private_data && param->private_data_len) { - - cmd->data = (unsigned long)param->private_data; + cmd->data = (uintptr_t) param->private_data; cmd->len = param->private_data_len; } @@ -279,7 +407,7 @@ int ib_cm_send_req(uint32_t cm_id, struc return 0; } -int ib_cm_send_rep(uint32_t cm_id, struct ib_cm_rep_param *param) +int ib_cm_send_rep(struct ib_cm_id *cm_id, struct ib_cm_rep_param *param) { struct cm_abi_rep *cmd; void *msg; @@ -290,11 +418,10 @@ int ib_cm_send_rep(uint32_t cm_id, struc return -EINVAL; CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_SEND_REP, size); - - cmd->id = cm_id; - cmd->qpn = param->qp_num; - cmd->psn = param->starting_psn; - + cmd->uid = (uintptr_t) container_of(cm_id, struct cm_id_private, id); + cmd->id = cm_id->handle; + cmd->qpn = param->qp_num; + cmd->psn = param->starting_psn; cmd->responder_resources = param->responder_resources; cmd->initiator_depth = param->initiator_depth; cmd->target_ack_delay = param->target_ack_delay; @@ -304,8 +431,7 @@ int ib_cm_send_rep(uint32_t cm_id, struc cmd->srq = param->srq; if (param->private_data && param->private_data_len) { - - cmd->data = (unsigned long)param->private_data; + cmd->data = (uintptr_t) param->private_data; cmd->len = param->private_data_len; } @@ -316,7 +442,7 @@ int ib_cm_send_rep(uint32_t cm_id, struc return 0; } -static inline int cm_send_private_data(uint32_t cm_id, +static inline int cm_send_private_data(struct ib_cm_id *cm_id, uint32_t type, void *private_data, uint8_t private_data_len) @@ -327,12 +453,10 @@ static inline int cm_send_private_data(u int size; CM_CREATE_MSG_CMD(msg, cmd, type, size); - - cmd->id = cm_id; + cmd->id = cm_id->handle; if (private_data && private_data_len) { - - cmd->data = (unsigned long)private_data; + cmd->data = (uintptr_t) private_data; cmd->len = private_data_len; } @@ -343,7 +467,7 @@ static inline int cm_send_private_data(u return 0; } -int ib_cm_send_rtu(uint32_t cm_id, +int ib_cm_send_rtu(struct ib_cm_id *cm_id, void *private_data, uint8_t private_data_len) { @@ -351,7 +475,7 @@ int ib_cm_send_rtu(uint32_t cm_id, private_data, private_data_len); } -int ib_cm_send_dreq(uint32_t cm_id, +int ib_cm_send_dreq(struct ib_cm_id *cm_id, void *private_data, uint8_t private_data_len) { @@ -359,7 +483,7 @@ int ib_cm_send_dreq(uint32_t cm_id, private_data, private_data_len); } -int ib_cm_send_drep(uint32_t cm_id, +int ib_cm_send_drep(struct ib_cm_id *cm_id, void *private_data, uint8_t private_data_len) { @@ -367,16 +491,15 @@ int ib_cm_send_drep(uint32_t cm_id, private_data, private_data_len); } -int ib_cm_establish(uint32_t cm_id) +int ib_cm_establish(struct ib_cm_id *cm_id) { struct cm_abi_establish *cmd; void *msg; int result; int size; - CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_ESTABLISH, size); - - cmd->id = cm_id; + CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_ESTABLISH, size); + cmd->id = cm_id->handle; result = write(fd, msg, size); if (result != size) @@ -385,7 +508,7 @@ int ib_cm_establish(uint32_t cm_id) return 0; } -static inline int cm_send_status(uint32_t cm_id, +static inline int cm_send_status(struct ib_cm_id *cm_id, uint32_t type, int status, void *info, @@ -399,19 +522,16 @@ static inline int cm_send_status(uint32_ int size; CM_CREATE_MSG_CMD(msg, cmd, type, size); - - cmd->id = cm_id; + cmd->id = cm_id->handle; cmd->status = status; if (private_data && private_data_len) { - - cmd->data = (unsigned long)private_data; + cmd->data = (uintptr_t) private_data; cmd->data_len = private_data_len; } if (info && info_length) { - - cmd->info = (unsigned long)info; + cmd->info = (uintptr_t) info; cmd->info_len = info_length; } @@ -422,7 +542,7 @@ static inline int cm_send_status(uint32_ return 0; } -int ib_cm_send_rej(uint32_t cm_id, +int ib_cm_send_rej(struct ib_cm_id *cm_id, enum ib_cm_rej_reason reason, void *ari, uint8_t ari_length, @@ -434,7 +554,7 @@ int ib_cm_send_rej(uint32_t cm_id, private_data, private_data_len); } -int ib_cm_send_apr(uint32_t cm_id, +int ib_cm_send_apr(struct ib_cm_id *cm_id, enum ib_cm_apr_status status, void *info, uint8_t info_length, @@ -446,7 +566,7 @@ int ib_cm_send_apr(uint32_t cm_id, private_data, private_data_len); } -int ib_cm_send_mra(uint32_t cm_id, +int ib_cm_send_mra(struct ib_cm_id *cm_id, uint8_t service_timeout, void *private_data, uint8_t private_data_len) @@ -457,13 +577,11 @@ int ib_cm_send_mra(uint32_t cm_id, int size; CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_SEND_MRA, size); - - cmd->id = cm_id; + cmd->id = cm_id->handle; cmd->timeout = service_timeout; if (private_data && private_data_len) { - - cmd->data = (unsigned long)private_data; + cmd->data = (uintptr_t) private_data; cmd->len = private_data_len; } @@ -474,7 +592,7 @@ int ib_cm_send_mra(uint32_t cm_id, return 0; } -int ib_cm_send_lap(uint32_t cm_id, +int ib_cm_send_lap(struct ib_cm_id *cm_id, struct ib_sa_path_rec *alternate_path, void *private_data, uint8_t private_data_len) @@ -486,22 +604,19 @@ int ib_cm_send_lap(uint32_t cm_id, int size; CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_SEND_LAP, size); - - cmd->id = cm_id; + cmd->id = cm_id->handle; if (alternate_path) { - abi_path = alloca(sizeof(*abi_path)); if (!abi_path) return -ENOMEM; cm_param_path_get(abi_path, alternate_path); - cmd->path = (unsigned long)abi_path; + cmd->path = (uintptr_t) abi_path; } if (private_data && private_data_len) { - - cmd->data = (unsigned long)private_data; + cmd->data = (uintptr_t) private_data; cmd->len = private_data_len; } @@ -512,7 +627,8 @@ int ib_cm_send_lap(uint32_t cm_id, return 0; } -int ib_cm_send_sidr_req(uint32_t cm_id, struct ib_cm_sidr_req_param *param) +int ib_cm_send_sidr_req(struct ib_cm_id *cm_id, + struct ib_cm_sidr_req_param *param) { struct cm_abi_path_rec *abi_path; struct cm_abi_sidr_req *cmd; @@ -524,26 +640,23 @@ int ib_cm_send_sidr_req(uint32_t cm_id, return -EINVAL; CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_SEND_SIDR_REQ, size); - - cmd->id = cm_id; + cmd->id = cm_id->handle; cmd->sid = param->service_id; cmd->timeout = param->timeout_ms; cmd->pkey = param->pkey; cmd->max_cm_retries = param->max_cm_retries; if (param->path) { - abi_path = alloca(sizeof(*abi_path)); if (!abi_path) return -ENOMEM; cm_param_path_get(abi_path, param->path); - cmd->path = (unsigned long)abi_path; + cmd->path = (uintptr_t) abi_path; } if (param->private_data && param->private_data_len) { - - cmd->data = (unsigned long)param->private_data; + cmd->data = (uintptr_t) param->private_data; cmd->len = param->private_data_len; } @@ -554,7 +667,8 @@ int ib_cm_send_sidr_req(uint32_t cm_id, return 0; } -int ib_cm_send_sidr_rep(uint32_t cm_id, struct ib_cm_sidr_rep_param *param) +int ib_cm_send_sidr_rep(struct ib_cm_id *cm_id, + struct ib_cm_sidr_rep_param *param) { struct cm_abi_sidr_rep *cmd; void *msg; @@ -565,21 +679,18 @@ int ib_cm_send_sidr_rep(uint32_t cm_id, return -EINVAL; CM_CREATE_MSG_CMD(msg, cmd, IB_USER_CM_CMD_SEND_SIDR_REP, size); - - cmd->id = cm_id; + cmd->id = cm_id->handle; cmd->qpn = param->qp_num; cmd->qkey = param->qkey; cmd->status = param->status; if (param->private_data && param->private_data_len) { - - cmd->data = (unsigned long)param->private_data; + cmd->data = (uintptr_t) param->private_data; cmd->data_len = param->private_data_len; } if (param->info && param->info_length) { - - cmd->info = (unsigned long)param->info; + cmd->info = (uintptr_t) param->info; cmd->info_len = param->info_length; } @@ -599,8 +710,8 @@ static void cm_event_path_get(struct ib_ if (!kpath || !upath) return; - memcpy(upath->dgid.raw, kpath->dgid, sizeof(union ibv_gid)); - memcpy(upath->sgid.raw, kpath->sgid, sizeof(union ibv_gid)); + memcpy(upath->dgid.raw, kpath->dgid, sizeof upath->dgid); + memcpy(upath->sgid.raw, kpath->sgid, sizeof upath->sgid); upath->dlid = kpath->dlid; upath->slid = kpath->slid; @@ -626,8 +737,6 @@ static void cm_event_path_get(struct ib_ static void cm_event_req_get(struct ib_cm_req_event_param *ureq, struct cm_abi_req_event_resp *kreq) { - ureq->listen_id = kreq->listen_id; - ureq->remote_ca_guid = kreq->remote_ca_guid; ureq->remote_qkey = kreq->remote_qkey; ureq->remote_qpn = kreq->remote_qpn; @@ -661,36 +770,6 @@ static void cm_event_rep_get(struct ib_c urep->rnr_retry_count = krep->rnr_retry_count; urep->srq = krep->srq; } -static void cm_event_rej_get(struct ib_cm_rej_event_param *urej, - struct cm_abi_rej_event_resp *krej) -{ - urej->reason = krej->reason; -} - -static void cm_event_mra_get(struct ib_cm_mra_event_param *umra, - struct cm_abi_mra_event_resp *kmra) -{ - umra->service_timeout = kmra->timeout; -} - -static void cm_event_lap_get(struct ib_cm_lap_event_param *ulap, - struct cm_abi_lap_event_resp *klap) -{ - cm_event_path_get(ulap->alternate_path, &klap->path); -} - -static void cm_event_apr_get(struct ib_cm_apr_event_param *uapr, - struct cm_abi_apr_event_resp *kapr) -{ - uapr->ap_status = kapr->status; -} - -static void cm_event_sidr_req_get(struct ib_cm_sidr_req_event_param *ureq, - struct cm_abi_sidr_req_event_resp *kreq) -{ - ureq->listen_id = kreq->listen_id; - ureq->pkey = kreq->pkey; -} static void cm_event_sidr_rep_get(struct ib_cm_sidr_rep_event_param *urep, struct cm_abi_sidr_rep_event_resp *krep) @@ -702,6 +781,7 @@ static void cm_event_sidr_rep_get(struct int ib_cm_event_get(struct ib_cm_event **event) { + struct cm_id_private *cm_id_priv; struct cm_abi_cmd_hdr *hdr; struct cm_abi_event_get *cmd; struct cm_abi_event_resp *resp; @@ -733,7 +813,7 @@ int ib_cm_event_get(struct ib_cm_event * if (!resp) return -ENOMEM; - cmd->response = (unsigned long)resp; + cmd->response = (uintptr_t) resp; cmd->data_len = (uint8_t)(~0U); cmd->info_len = (uint8_t)(~0U); @@ -749,8 +829,8 @@ int ib_cm_event_get(struct ib_cm_event * goto done; } - cmd->data = (unsigned long)data; - cmd->info = (unsigned long)info; + cmd->data = (uintptr_t) data; + cmd->info = (uintptr_t) info; result = write(fd, msg, size); if (result != size) { @@ -765,14 +845,11 @@ int ib_cm_event_get(struct ib_cm_event * result = -ENOMEM; goto done; } - memset(evt, 0, sizeof(*evt)); - - evt->cm_id = resp->id; + evt->cm_id = (void *) (uintptr_t) resp->uid; evt->event = resp->event; if (resp->present & CM_ABI_PRES_PRIMARY) { - path_a = malloc(sizeof(*path_a)); if (!path_a) { result = -ENOMEM; @@ -781,81 +858,78 @@ int ib_cm_event_get(struct ib_cm_event * } if (resp->present & CM_ABI_PRES_ALTERNATE) { - path_b = malloc(sizeof(*path_b)); if (!path_b) { result = -ENOMEM; goto done; } } - - if (resp->present & CM_ABI_PRES_DATA) { - - evt->private_data = data; - data = NULL; - } switch (evt->event) { case IB_CM_REQ_RECEIVED: - + evt->param.req_rcvd.listen_id = evt->cm_id; + cm_id_priv = ib_cm_alloc_id(evt->cm_id->context); + if (!cm_id_priv) { + result = -ENOMEM; + goto done; + } + cm_id_priv->id.handle = resp->id; + evt->cm_id = &cm_id_priv->id; evt->param.req_rcvd.primary_path = path_a; evt->param.req_rcvd.alternate_path = path_b; path_a = NULL; path_b = NULL; - cm_event_req_get(&evt->param.req_rcvd, &resp->u.req_resp); break; case IB_CM_REP_RECEIVED: - cm_event_rep_get(&evt->param.rep_rcvd, &resp->u.rep_resp); break; case IB_CM_MRA_RECEIVED: - - cm_event_mra_get(&evt->param.mra_rcvd, &resp->u.mra_resp); + evt->param.mra_rcvd.service_timeout = resp->u.mra_resp.timeout; break; case IB_CM_REJ_RECEIVED: - - cm_event_rej_get(&evt->param.rej_rcvd, &resp->u.rej_resp); - + evt->param.rej_rcvd.reason = resp->u.rej_resp.reason; evt->param.rej_rcvd.ari = info; info = NULL; - break; case IB_CM_LAP_RECEIVED: - evt->param.lap_rcvd.alternate_path = path_b; path_b = NULL; - - cm_event_lap_get(&evt->param.lap_rcvd, &resp->u.lap_resp); + cm_event_path_get(evt->param.lap_rcvd.alternate_path, + &resp->u.lap_resp.path); break; case IB_CM_APR_RECEIVED: - - cm_event_apr_get(&evt->param.apr_rcvd, &resp->u.apr_resp); - + evt->param.apr_rcvd.ap_status = resp->u.apr_resp.status; evt->param.apr_rcvd.apr_info = info; info = NULL; - break; case IB_CM_SIDR_REQ_RECEIVED: - - cm_event_sidr_req_get(&evt->param.sidr_req_rcvd, - &resp->u.sidr_req_resp); + evt->param.sidr_req_rcvd.listen_id = evt->cm_id; + cm_id_priv = ib_cm_alloc_id(evt->cm_id->context); + if (!cm_id_priv) { + result = -ENOMEM; + goto done; + } + cm_id_priv->id.handle = resp->id; + evt->cm_id = &cm_id_priv->id; + evt->param.sidr_req_rcvd.pkey = resp->u.sidr_req_resp.pkey; break; case IB_CM_SIDR_REP_RECEIVED: - cm_event_sidr_rep_get(&evt->param.sidr_rep_rcvd, &resp->u.sidr_rep_resp); - evt->param.sidr_rep_rcvd.info = info; info = NULL; - break; default: - evt->param.send_status = resp->u.send_status; break; } + if (resp->present & CM_ABI_PRES_DATA) { + evt->private_data = data; + data = NULL; + } + *event = evt; evt = NULL; result = 0; @@ -876,44 +950,51 @@ done: int ib_cm_event_put(struct ib_cm_event *event) { + struct cm_id_private *cm_id_priv; + if (!event) return -EINVAL; if (event->private_data) free(event->private_data); + cm_id_priv = container_of(event->cm_id, struct cm_id_private, id); + switch (event->event) { case IB_CM_REQ_RECEIVED: - - if (event->param.req_rcvd.primary_path) - free(event->param.req_rcvd.primary_path); - + cm_id_priv = container_of(event->param.req_rcvd.listen_id, + struct cm_id_private, id); + free(event->param.req_rcvd.primary_path); if (event->param.req_rcvd.alternate_path) free(event->param.req_rcvd.alternate_path); break; case IB_CM_REJ_RECEIVED: - if (event->param.rej_rcvd.ari) free(event->param.rej_rcvd.ari); break; case IB_CM_LAP_RECEIVED: - - if (event->param.lap_rcvd.alternate_path) - free(event->param.lap_rcvd.alternate_path); + free(event->param.lap_rcvd.alternate_path); break; case IB_CM_APR_RECEIVED: - if (event->param.apr_rcvd.apr_info) free(event->param.apr_rcvd.apr_info); break; + case IB_CM_SIDR_REQ_RECEIVED: + cm_id_priv = container_of(event->param.sidr_req_rcvd.listen_id, + struct cm_id_private, id); + break; case IB_CM_SIDR_REP_RECEIVED: - if (event->param.sidr_rep_rcvd.info) free(event->param.sidr_rep_rcvd.info); default: break; } + pthread_mutex_lock(&cm_id_priv->mut); + cm_id_priv->events_completed++; + pthread_cond_signal(&cm_id_priv->cond); + pthread_mutex_unlock(&cm_id_priv->mut); + free(event); return 0; } Index: userspace/libibcm/Makefile.am =================================================================== --- userspace/libibcm/Makefile.am (revision 3124) +++ userspace/libibcm/Makefile.am (working copy) @@ -18,9 +18,11 @@ endif src_libibcm_la_SOURCES = src/cm.c src_libibcm_la_LDFLAGS = -avoid-version -module $(ucm_version_script) -bin_PROGRAMS = examples/ucm_simple -examples_ucm_simple_SOURCES = examples/simple.c -examples_ucm_simple_LDADD = $(top_builddir)/src/libibcm.la +bin_PROGRAMS = examples/ucmpost +examples_ucmpost_SOURCES = examples/cmpost.c +examples_ucmpost_LDADD = $(top_builddir)/src/libibcm.la \ + $(libdir)/libibverbs.la \ + $(libdir)/libibat.la libibcmincludedir = $(includedir)/infiniband Index: userspace/libibcm/examples/cmpost.c =================================================================== --- userspace/libibcm/examples/cmpost.c (revision 0) +++ userspace/libibcm/examples/cmpost.c (revision 0) @@ -0,0 +1,718 @@ +/* + * Copyright (c) 2004, 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#if __BYTE_ORDER == __BIG_ENDIAN +static inline uint64_t cpu_to_be64(uint64_t x) { return x; } +static inline uint32_t cpu_to_be32(uint32_t x) { return x; } +#else +static inline uint64_t cpu_to_be64(uint64_t x) { return bswap_64(x); } +static inline uint32_t cpu_to_be32(uint32_t x) { return bswap_32(x); } +#endif + +/* + * To execute: + * Server: ucmpost + * Client: ucmpost server + */ + +struct cmtest { + struct ibv_device *device; + struct ibv_context *verbs; + struct ibv_pd *pd; + + /* cm info */ + struct ib_sa_path_rec path_rec; + + struct cmtest_node *nodes; + int conn_index; + int connects_left; + int disconnects_left; + + /* memory region info */ + struct ibv_mr *mr; + void *mem; +}; + +static struct cmtest test; +static int message_count = 10; +static int message_size = 100; +static int connections = 1; +static int is_server = 1; + +struct cmtest_node { + int id; + struct ibv_cq *cq; + struct ibv_qp *qp; + struct ib_cm_id *cm_id; + int connected; +}; + +static int post_recvs(struct cmtest_node *node) +{ + struct ibv_recv_wr recv_wr, *recv_failure; + struct ibv_sge sge; + int i, ret = 0; + + if (!message_count) + return 0; + + recv_wr.next = NULL; + recv_wr.sg_list = &sge; + recv_wr.num_sge = 1; + recv_wr.wr_id = (uintptr_t) node; + + sge.length = message_size; + sge.lkey = test.mr->lkey; + sge.addr = (uintptr_t) test.mem; + + for (i = 0; i < message_count && !ret; i++ ) { + ret = ibv_post_recv(node->qp, &recv_wr, &recv_failure); + if (ret) { + printf("failed to post receives: %d\n", ret); + break; + } + } + return ret; +} + +static int modify_to_rtr(struct cmtest_node *node) +{ + struct ibv_qp_attr qp_attr; + int qp_attr_mask, ret; + + qp_attr.qp_state = IBV_QPS_INIT; + ret = ib_cm_init_qp_attr(node->cm_id, &qp_attr, &qp_attr_mask); + if (ret) { + printf("failed to init QP attr for INIT: %d\n", ret); + return ret; + } + ret = ibv_modify_qp(node->qp, &qp_attr, qp_attr_mask); + if (ret) { + printf("failed to modify QP to INIT: %d\n", ret); + return ret; + } + qp_attr.qp_state = IBV_QPS_RTR; + ret = ib_cm_init_qp_attr(node->cm_id, &qp_attr, &qp_attr_mask); + if (ret) { + printf("failed to init QP attr for RTR: %d\n", ret); + return ret; + } + qp_attr.rq_psn = node->qp->qp_num; + ret = ibv_modify_qp(node->qp, &qp_attr, qp_attr_mask); + if (ret) { + printf("failed to modify QP to RTR: %d\n", ret); + return ret; + } + return 0; +} + +static int modify_to_rts(struct cmtest_node *node) +{ + struct ibv_qp_attr qp_attr; + int qp_attr_mask, ret; + + qp_attr.qp_state = IBV_QPS_RTS; + ret = ib_cm_init_qp_attr(node->cm_id, &qp_attr, &qp_attr_mask); + if (ret) { + printf("failed to init QP attr for RTS: %d\n", ret); + return ret; + } + ret = ibv_modify_qp(node->qp, &qp_attr, qp_attr_mask); + if (ret) { + printf("failed to modify QP to RTS: %d\n", ret); + return ret; + } + return 0; +} + +static void req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) +{ + struct cmtest_node *node; + struct ib_cm_req_event_param *req; + struct ib_cm_rep_param rep; + int ret; + + if (test.conn_index == connections) + goto error1; + node = &test.nodes[test.conn_index++]; + + node->cm_id = cm_id; + cm_id->context = node; + + ret = modify_to_rtr(node); + if (ret) + goto error2; + + ret = post_recvs(node); + if (ret) + goto error2; + + req = &event->param.req_rcvd; + memset(&rep, 0, sizeof rep); + rep.qp_num = node->qp->qp_num; + rep.srq = (node->qp->srq != NULL); + rep.starting_psn = node->qp->qp_num; + rep.responder_resources = req->responder_resources; + rep.initiator_depth = req->initiator_depth; + rep.target_ack_delay = 20; + rep.flow_control = req->flow_control; + rep.rnr_retry_count = req->rnr_retry_count; + + ret = ib_cm_send_rep(cm_id, &rep); + if (ret) { + printf("failed to send CM REP: %d\n", ret); + goto error2; + } + return; +error2: + test.disconnects_left--; + test.connects_left--; +error1: + printf("failing connection request\n"); + ib_cm_send_rej(cm_id, IB_CM_REJ_UNSUPPORTED, NULL, 0, NULL, 0); +} + +static void rep_handler(struct cmtest_node *node, struct ib_cm_event *event) +{ + int ret; + + ret = modify_to_rtr(node); + if (ret) + goto error; + + ret = modify_to_rts(node); + if (ret) + goto error; + + ret = post_recvs(node); + if (ret) + goto error; + + ret = ib_cm_send_rtu(node->cm_id, NULL, 0); + if (ret) { + printf("failed to send CM RTU: %d\n", ret); + goto error; + } + node->connected = 1; + test.connects_left--; + return; +error: + printf("failing connection reply\n"); + ib_cm_send_rej(node->cm_id, IB_CM_REJ_UNSUPPORTED, NULL, 0, NULL, 0); + test.disconnects_left--; + test.connects_left--; +} + +static void rtu_handler(struct cmtest_node *node) +{ + int ret; + + ret = modify_to_rts(node); + if (ret) + goto error; + + node->connected = 1; + test.connects_left--; + return; +error: + printf("aborting connection - disconnecting\n"); + ib_cm_send_dreq(node->cm_id, NULL, 0); + test.disconnects_left--; + test.connects_left--; +} + +static void cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) +{ + struct cmtest_node *node = cm_id->context; + + switch (event->event) { + case IB_CM_REQ_RECEIVED: + req_handler(cm_id, event); + break; + case IB_CM_REP_RECEIVED: + rep_handler(node, event); + break; + case IB_CM_RTU_RECEIVED: + rtu_handler(node); + break; + case IB_CM_DREQ_RECEIVED: + node->connected = 0; + ib_cm_send_drep(node->cm_id, NULL, 0); + test.disconnects_left--; + break; + case IB_CM_DREP_RECEIVED: + test.disconnects_left--; + break; + case IB_CM_REJ_RECEIVED: + printf("Received REJ\n"); + /* fall through */ + case IB_CM_REQ_ERROR: + case IB_CM_REP_ERROR: + printf("Error sending REQ or REP\n"); + test.disconnects_left--; + test.connects_left--; + break; + case IB_CM_DREQ_ERROR: + test.disconnects_left--; + printf("Error sending DREQ\n"); + break; + default: + break; + } +} + +static int init_node(struct cmtest_node *node, struct ibv_qp_init_attr *qp_attr) +{ + int cqe, ret; + + if (!is_server) { + ret = ib_cm_create_id(&node->cm_id, node); + if (ret) { + printf("failed to create cm_id: %d\n", ret); + return ret; + } + } + + cqe = message_count ? message_count * 2 : 2; + node->cq = ibv_create_cq(test.verbs, cqe, node); + if (!node->cq) { + printf("unable to create CQ\n"); + goto error1; + } + + qp_attr->send_cq = node->cq; + qp_attr->recv_cq = node->cq; + node->qp = ibv_create_qp(test.pd, qp_attr); + if (!node->qp) { + printf("unable to create QP\n"); + goto error2; + } + return 0; +error2: + ibv_destroy_cq(node->cq); +error1: + if (!is_server) + ib_cm_destroy_id(node->cm_id); + return -1; +} + +static void destroy_node(struct cmtest_node *node) +{ + ibv_destroy_qp(node->qp); + ibv_destroy_cq(node->cq); + if (node->cm_id) + ib_cm_destroy_id(node->cm_id); +} + +static int create_nodes(void) +{ + struct ibv_qp_init_attr qp_attr; + int ret, i; + + test.nodes = malloc(sizeof *test.nodes * connections); + if (!test.nodes) { + printf("unable to allocate memory for test nodes\n"); + return -1; + } + memset(test.nodes, 0, sizeof *test.nodes * connections); + + memset(&qp_attr, 0, sizeof qp_attr); + qp_attr.cap.max_send_wr = message_count ? message_count : 1; + qp_attr.cap.max_recv_wr = message_count ? message_count : 1; + qp_attr.cap.max_send_sge = 1; + qp_attr.cap.max_recv_sge = 1; + qp_attr.qp_type = IBV_QPT_RC; + + for (i = 0; i < connections; i++) { + test.nodes[i].id = i; + ret = init_node(&test.nodes[i], &qp_attr); + if (ret) + goto error; + } + return 0; +error: + while (--i >= 0) + destroy_node(&test.nodes[i]); + free(test.nodes); + return ret; +} + +static void destroy_nodes(void) +{ + int i; + + for (i = 0; i < connections; i++) + destroy_node(&test.nodes[i]); + free(test.nodes); +} + +static int create_messages(void) +{ + if (!message_size) + message_count = 0; + + if (!message_count) + return 0; + + test.mem = malloc(message_size); + if (!test.mem) { + printf("failed message allocation\n"); + return -1; + } + test.mr = ibv_reg_mr(test.pd, test.mem, message_size, + IBV_ACCESS_LOCAL_WRITE); + if (!test.mr) { + printf("failed to reg MR\n"); + goto err; + } + return 0; +err: + free(test.mem); + return -1; +} + +static void destroy_messages(void) +{ + if (!message_count) + return; + + ibv_dereg_mr(test.mr); + free(test.mem); +} + +static int init(void) +{ + struct dlist *dev_list; + int ret; + + test.connects_left = connections; + test.disconnects_left = connections; + + dev_list = ibv_get_devices(); + dlist_start(dev_list); + test.device = dlist_next(dev_list); + if (!test.device) + return -1; + + test.verbs = ibv_open_device(test.device); + if (!test.verbs) + return -1; + + test.pd = ibv_alloc_pd(test.verbs); + if (!test.pd) { + printf("failed to alloc PD\n"); + return -1; + } + ret = create_messages(); + if (ret) { + printf("unable to create test messages\n"); + goto error1; + } + ret = create_nodes(); + if (ret) { + printf("unable to create test nodes\n"); + goto error2; + } + return 0; +error2: + destroy_messages(); +error1: + ibv_dealloc_pd(test.pd); + return -1; +} + +static void cleanup(void) +{ + destroy_nodes(); + destroy_messages(); + ibv_dealloc_pd(test.pd); +} + +static int send_msgs(void) +{ + struct ibv_send_wr send_wr, *bad_send_wr; + struct ibv_sge sge; + int i, m, ret; + + send_wr.next = NULL; + send_wr.sg_list = &sge; + send_wr.num_sge = 1; + send_wr.opcode = IBV_WR_SEND; + send_wr.send_flags = IBV_SEND_SIGNALED; + send_wr.wr_id = 0; + + sge.addr = (uintptr_t) test.mem; + sge.length = message_size; + sge.lkey = test.mr->lkey; + + for (i = 0; i < connections; i++) { + if (!test.nodes[i].connected) + continue; + + for (m = 0; m < message_count; m++) { + ret = ibv_post_send(test.nodes[i].qp, &send_wr, + &bad_send_wr); + if (ret) + return ret; + } + } + return 0; +} + +static int poll_cqs(void) +{ + struct ibv_wc wc[8]; + int done, i, ret; + + for (i = 0; i < connections; i++) { + if (!test.nodes[i].connected) + continue; + + for (done = 0; done < message_count; done += ret) { + ret = ibv_poll_cq(test.nodes[i].cq, 8, wc); + if (ret < 0) { + printf("failed polling CQ: %d\n", ret); + return ret; + } + } + } + return 0; +} + +static void connect_events(void) +{ + struct ib_cm_event *event; + int err = 0; + + while (test.connects_left && !err) { + err = ib_cm_event_get(&event); + if (!err) { + cm_handler(event->cm_id, event); + ib_cm_event_put(event); + } + } +} + +static void disconnect_events(void) +{ + struct ib_cm_event *event; + int err = 0; + + while (test.disconnects_left && !err) { + err = ib_cm_event_get(&event); + if (!err) { + cm_handler(event->cm_id, event); + ib_cm_event_put(event); + } + } +} + +static void run_server(void) +{ + struct ib_cm_id *listen_id; + int i, ret; + + printf("starting server\n"); + if (ib_cm_create_id(&listen_id, &test)) { + printf("listen request failed\n"); + return; + } + ret = ib_cm_listen(listen_id, cpu_to_be64(0x1000), 0); + if (ret) { + printf("failure trying to listen: %d\n", ret); + goto out; + } + + connect_events(); + + if (message_count) { + printf("initiating data transfers\n"); + if (send_msgs()) + goto out; + printf("receiving data transfers\n"); + if (poll_cqs()) + goto out; + printf("data transfers complete\n"); + } + + printf("disconnecting\n"); + for (i = 0; i < connections; i++) { + if (!test.nodes[i].connected) + continue; + + test.nodes[i].connected = 0; + ib_cm_send_dreq(test.nodes[i].cm_id, NULL, 0); + } + disconnect_events(); + printf("disconnected\n"); +out: + ib_cm_destroy_id(listen_id); +} + +static void at_callback(uint64_t req_id, void *context, int rec_num) +{ +} + +static int query_for_path(char *dest) +{ + struct ib_at_ib_route route; + struct ib_at_completion comp; + struct addrinfo *res; + int ret; + + ret = getaddrinfo(dest, NULL, NULL, &res); + if (ret) { + printf("getaddrinfo failed - invalid hostname or IP address\n"); + return ret; + } + + if (res->ai_family != PF_INET) { + ret = -1; + goto out; + } + + comp.fn = at_callback; + ret = ib_at_route_by_ip(((struct sockaddr_in *)res->ai_addr)->sin_addr.s_addr, + 0, 0, 0, &route, &comp, NULL); + if (ret < 0) { + printf("ib_at_route_by_ip failed: %d\n", ret); + goto out; + } + + if (!ret) { + ret = ib_at_callback_get(); + if (ret) { + printf("ib_at_callback_get failed: %d\n", ret); + goto out; + } + } + + ret = ib_at_paths_by_route(&route, 0, &test.path_rec, 1, &comp, NULL); + if (ret < 0) { + printf("ib_at_paths_by_route failed: %d\n", ret); + goto out; + } + + if (!ret) { + ret = ib_at_callback_get(); + if (ret) + printf("ib_at_callback_get failed: %d\n", ret); + } else + ret = 0; + +out: + freeaddrinfo(res); + return ret; +} + +static void run_client(char *dest) +{ + struct ib_cm_req_param req; + int i, ret; + + printf("starting client\n"); + ret = query_for_path(dest); + if (ret) { + printf("failed path record query: %d\n", ret); + return; + } + + memset(&req, 0, sizeof req); + req.primary_path = &test.path_rec; + req.service_id = cpu_to_be64(0x1000); + req.responder_resources = 1; + req.initiator_depth = 1; + req.remote_cm_response_timeout = 20; + req.local_cm_response_timeout = 20; + req.retry_count = 5; + req.max_cm_retries = 5; + + printf("connecting\n"); + for (i = 0; i < connections; i++) { + req.qp_num = test.nodes[i].qp->qp_num; + req.qp_type = IBV_QPT_RC; + req.srq = (test.nodes[i].qp->srq != NULL); + req.starting_psn = test.nodes[i].qp->qp_num; + ret = ib_cm_send_req(test.nodes[i].cm_id, &req); + if (ret) { + printf("failure sending REQ: %d\n", ret); + return; + } + } + + connect_events(); + + if (message_count) { + printf("receiving data transfers\n"); + if (poll_cqs()) + goto out; + printf("initiating data transfers\n"); + if (send_msgs()) + goto out; + printf("data transfers complete\n"); + } +out: + disconnect_events(); +} + +int main(int argc, char **argv) +{ + if (argc != 1 && argc != 2) { + printf("usage: %s [server_addr]\n", argv[0]); + exit(1); + } + + is_server = (argc == 1); + if (init()) + exit(1); + + if (is_server) + run_server(); + else + run_client(argv[1]); + + printf("test complete\n"); + cleanup(); + return 0; +} Index: userspace/libibcm/examples/simple.c =================================================================== --- userspace/libibcm/examples/simple.c (revision 3124) +++ userspace/libibcm/examples/simple.c (working copy) @@ -58,7 +58,7 @@ static inline uint64_t cpu_to_be64(uint6 #define TEST_SID 0x0000000ff0000000ULL -static int cm_connect(uint32_t cm_id) +static int cm_connect(struct ib_cm_id *cm_id) { struct ib_cm_req_param param; struct ib_sa_path_rec sa; @@ -108,8 +108,8 @@ static int cm_connect(uint32_t cm_id) src->global.subnet_prefix = cpu_to_be64(0xfe80000000000000ULL); dst->global.subnet_prefix = cpu_to_be64(0xfe80000000000000ULL); - src->global.interface_id = cpu_to_be64(0x0002c90200002179ULL); - dst->global.interface_id = cpu_to_be64(0x0005ad000001296cULL); + src->global.interface_id = cpu_to_be64(0x0002c90107fc5e11ULL); + dst->global.interface_id = cpu_to_be64(0x0002c90107fc5eb1ULL); return ib_cm_send_req(cm_id, ¶m); } @@ -118,7 +118,7 @@ int main(int argc, char **argv) { struct ib_cm_event *event; struct ib_cm_rep_param rep; - int cm_id; + struct ib_cm_id *cm_id; int result; int param_c = 0; @@ -137,8 +137,8 @@ int main(int argc, char **argv) exit(1); } - result = ib_cm_create_id(&cm_id); - if (result < 0) { + result = ib_cm_create_id(&cm_id, NULL); + if (result) { printf("Error creating CM ID <%d:%d>\n", result, errno); goto done; } @@ -146,16 +146,16 @@ int main(int argc, char **argv) if (mode) { result = cm_connect(cm_id); if (result) { - printf("Error <%d:%d> sending REQ <%d>\n", - result, errno, cm_id); + printf("Error <%d:%d> sending REQ\n", + result, errno); goto done; } } else { result = ib_cm_listen(cm_id, TEST_SID, 0); if (result) { - printf("Error <%d:%d> listening <%d>\n", - result, errno, cm_id); + printf("Error <%d:%d> listening\n", + result, errno); goto done; } } @@ -169,7 +169,7 @@ int main(int argc, char **argv) goto done; } - printf("CM ID <%d> Event <%d>\n", event->cm_id, event->event); + printf("CM ID <%p> Event <%d>\n", event->cm_id, event->event); switch (event->event) { case IB_CM_REQ_RECEIVED: @@ -264,4 +264,3 @@ int main(int argc, char **argv) done: return 0; } - Index: linux-kernel/infiniband/include/ib_user_cm.h =================================================================== --- linux-kernel/infiniband/include/ib_user_cm.h (revision 3109) +++ linux-kernel/infiniband/include/ib_user_cm.h (working copy) @@ -1,5 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -37,7 +38,7 @@ #include -#define IB_USER_CM_ABI_VERSION 1 +#define IB_USER_CM_ABI_VERSION 2 enum { IB_USER_CM_CMD_CREATE_ID, @@ -60,6 +61,7 @@ enum { IB_USER_CM_CMD_SEND_SIDR_REP, IB_USER_CM_CMD_EVENT, + IB_USER_CM_CMD_INIT_QP_ATTR, }; /* * command ABI structures. @@ -71,6 +73,7 @@ struct ib_ucm_cmd_hdr { }; struct ib_ucm_create_id { + __u64 uid; __u64 response; }; @@ -79,9 +82,14 @@ struct ib_ucm_create_id_resp { }; struct ib_ucm_destroy_id { + __u64 response; __u32 id; }; +struct ib_ucm_destroy_id_resp { + __u32 events_reported; +}; + struct ib_ucm_attr_id { __u64 response; __u32 id; @@ -94,6 +102,64 @@ struct ib_ucm_attr_id_resp { __be32 remote_id; }; +struct ib_ucm_init_qp_attr { + __u64 response; + __u32 id; + __u32 qp_state; +}; + +struct ib_ucm_ah_attr { + __u8 grh_dgid[16]; + __u32 grh_flow_label; + __u16 dlid; + __u16 reserved; + __u8 grh_sgid_index; + __u8 grh_hop_limit; + __u8 grh_traffic_class; + __u8 sl; + __u8 src_path_bits; + __u8 static_rate; + __u8 is_global; + __u8 port_num; +}; + +struct ib_ucm_init_qp_attr_resp { + __u32 qp_attr_mask; + __u32 qp_state; + __u32 cur_qp_state; + __u32 path_mtu; + __u32 path_mig_state; + __u32 qkey; + __u32 rq_psn; + __u32 sq_psn; + __u32 dest_qp_num; + __u32 qp_access_flags; + + struct ib_ucm_ah_attr ah_attr; + struct ib_ucm_ah_attr alt_ah_attr; + + /* ib_qp_cap */ + __u32 max_send_wr; + __u32 max_recv_wr; + __u32 max_send_sge; + __u32 max_recv_sge; + __u32 max_inline_data; + + __u16 pkey_index; + __u16 alt_pkey_index; + __u8 en_sqd_async_notify; + __u8 sq_draining; + __u8 max_rd_atomic; + __u8 max_dest_rd_atomic; + __u8 min_rnr_timer; + __u8 port_num; + __u8 timeout; + __u8 retry_cnt; + __u8 rnr_retry; + __u8 alt_port_num; + __u8 alt_timeout; +}; + struct ib_ucm_listen { __be64 service_id; __be64 service_mask; @@ -157,6 +223,7 @@ struct ib_ucm_req { }; struct ib_ucm_rep { + __u64 uid; __u64 data; __u32 id; __u32 qpn; @@ -232,7 +299,6 @@ struct ib_ucm_event_get { }; struct ib_ucm_req_event_resp { - __u32 listen_id; /* device */ /* port */ struct ib_ucm_path_rec primary_path; @@ -287,7 +353,6 @@ struct ib_ucm_apr_event_resp { }; struct ib_ucm_sidr_req_event_resp { - __u32 listen_id; /* device */ /* port */ __u16 pkey; @@ -307,6 +372,7 @@ struct ib_ucm_sidr_rep_event_resp { #define IB_UCM_PRES_ALTERNATE 0x08 struct ib_ucm_event_resp { + __u64 uid; __u32 id; __u32 event; __u32 present; Index: linux-kernel/infiniband/core/ucm.c =================================================================== --- linux-kernel/infiniband/core/ucm.c (revision 3109) +++ linux-kernel/infiniband/core/ucm.c (working copy) @@ -72,7 +72,6 @@ enum { static struct semaphore ctx_id_mutex; static struct idr ctx_id_table; -static int ctx_id_rover = 0; static struct ib_ucm_context *ib_ucm_ctx_get(struct ib_ucm_file *file, int id) { @@ -97,33 +96,16 @@ static void ib_ucm_ctx_put(struct ib_ucm wake_up(&ctx->wait); } -static ssize_t ib_ucm_destroy_ctx(struct ib_ucm_file *file, int id) +static inline int ib_ucm_new_cm_id(int event) { - struct ib_ucm_context *ctx; - struct ib_ucm_event *uevent; - - down(&ctx_id_mutex); - ctx = idr_find(&ctx_id_table, id); - if (!ctx) - ctx = ERR_PTR(-ENOENT); - else if (ctx->file != file) - ctx = ERR_PTR(-EINVAL); - else - idr_remove(&ctx_id_table, ctx->id); - up(&ctx_id_mutex); - - if (IS_ERR(ctx)) - return PTR_ERR(ctx); - - atomic_dec(&ctx->ref); - wait_event(ctx->wait, !atomic_read(&ctx->ref)); + return event == IB_CM_REQ_RECEIVED || event == IB_CM_SIDR_REQ_RECEIVED; +} - /* No new events will be generated after destroying the cm_id. */ - if (!IS_ERR(ctx->cm_id)) - ib_destroy_cm_id(ctx->cm_id); +static void ib_ucm_cleanup_events(struct ib_ucm_context *ctx) +{ + struct ib_ucm_event *uevent; - /* Cleanup events not yet reported to the user. */ - down(&file->mutex); + down(&ctx->file->mutex); list_del(&ctx->file_list); while (!list_empty(&ctx->events)) { @@ -133,15 +115,12 @@ static ssize_t ib_ucm_destroy_ctx(struct list_del(&uevent->ctx_list); /* clear incoming connections. */ - if (uevent->cm_id) + if (ib_ucm_new_cm_id(uevent->resp.event)) ib_destroy_cm_id(uevent->cm_id); kfree(uevent); } - up(&file->mutex); - - kfree(ctx); - return 0; + up(&ctx->file->mutex); } static struct ib_ucm_context *ib_ucm_ctx_alloc(struct ib_ucm_file *file) @@ -153,36 +132,31 @@ static struct ib_ucm_context *ib_ucm_ctx if (!ctx) return NULL; + memset(ctx, 0, sizeof *ctx); atomic_set(&ctx->ref, 1); init_waitqueue_head(&ctx->wait); ctx->file = file; - INIT_LIST_HEAD(&ctx->events); - list_add_tail(&ctx->file_list, &file->ctxs); - - ctx_id_rover = (ctx_id_rover + 1) & INT_MAX; -retry: - result = idr_pre_get(&ctx_id_table, GFP_KERNEL); - if (!result) - goto error; + do { + result = idr_pre_get(&ctx_id_table, GFP_KERNEL); + if (!result) + goto error; + + down(&ctx_id_mutex); + result = idr_get_new(&ctx_id_table, ctx, &ctx->id); + up(&ctx_id_mutex); + } while (result == -EAGAIN); - down(&ctx_id_mutex); - result = idr_get_new_above(&ctx_id_table, ctx, ctx_id_rover, &ctx->id); - up(&ctx_id_mutex); - - if (result == -EAGAIN) - goto retry; if (result) goto error; + list_add_tail(&ctx->file_list, &file->ctxs); ucm_dbg("Allocated CM ID <%d>\n", ctx->id); - return ctx; + error: - list_del(&ctx->file_list); kfree(ctx); - return NULL; } /* @@ -219,12 +193,9 @@ static void ib_ucm_event_path_get(struct kpath->packet_life_time_selector; } -static void ib_ucm_event_req_get(struct ib_ucm_context *ctx, - struct ib_ucm_req_event_resp *ureq, +static void ib_ucm_event_req_get(struct ib_ucm_req_event_resp *ureq, struct ib_cm_req_event_param *kreq) { - ureq->listen_id = ctx->id; - ureq->remote_ca_guid = kreq->remote_ca_guid; ureq->remote_qkey = kreq->remote_qkey; ureq->remote_qpn = kreq->remote_qpn; @@ -259,14 +230,6 @@ static void ib_ucm_event_rep_get(struct urep->srq = krep->srq; } -static void ib_ucm_event_sidr_req_get(struct ib_ucm_context *ctx, - struct ib_ucm_sidr_req_event_resp *ureq, - struct ib_cm_sidr_req_event_param *kreq) -{ - ureq->listen_id = ctx->id; - ureq->pkey = kreq->pkey; -} - static void ib_ucm_event_sidr_rep_get(struct ib_ucm_sidr_rep_event_resp *urep, struct ib_cm_sidr_rep_event_param *krep) { @@ -275,15 +238,14 @@ static void ib_ucm_event_sidr_rep_get(st urep->qpn = krep->qpn; }; -static int ib_ucm_event_process(struct ib_ucm_context *ctx, - struct ib_cm_event *evt, +static int ib_ucm_event_process(struct ib_cm_event *evt, struct ib_ucm_event *uvt) { void *info = NULL; switch (evt->event) { case IB_CM_REQ_RECEIVED: - ib_ucm_event_req_get(ctx, &uvt->resp.u.req_resp, + ib_ucm_event_req_get(&uvt->resp.u.req_resp, &evt->param.req_rcvd); uvt->data_len = IB_CM_REQ_PRIVATE_DATA_SIZE; uvt->resp.present = IB_UCM_PRES_PRIMARY; @@ -331,8 +293,8 @@ static int ib_ucm_event_process(struct i info = evt->param.apr_rcvd.apr_info; break; case IB_CM_SIDR_REQ_RECEIVED: - ib_ucm_event_sidr_req_get(ctx, &uvt->resp.u.sidr_req_resp, - &evt->param.sidr_req_rcvd); + uvt->resp.u.sidr_req_resp.pkey = + evt->param.sidr_req_rcvd.pkey; uvt->data_len = IB_CM_SIDR_REQ_PRIVATE_DATA_SIZE; break; case IB_CM_SIDR_REP_RECEIVED: @@ -378,31 +340,24 @@ static int ib_ucm_event_handler(struct i struct ib_ucm_event *uevent; struct ib_ucm_context *ctx; int result = 0; - int id; ctx = cm_id->context; - if (event->event == IB_CM_REQ_RECEIVED || - event->event == IB_CM_SIDR_REQ_RECEIVED) - id = IB_UCM_CM_ID_INVALID; - else - id = ctx->id; - uevent = kmalloc(sizeof(*uevent), GFP_KERNEL); if (!uevent) goto err1; memset(uevent, 0, sizeof(*uevent)); - uevent->resp.id = id; + uevent->ctx = ctx; + uevent->cm_id = cm_id; + uevent->resp.uid = ctx->uid; + uevent->resp.id = ctx->id; uevent->resp.event = event->event; - result = ib_ucm_event_process(ctx, event, uevent); + result = ib_ucm_event_process(event, uevent); if (result) goto err2; - uevent->ctx = ctx; - uevent->cm_id = (id == IB_UCM_CM_ID_INVALID) ? cm_id : NULL; - down(&ctx->file->mutex); list_add_tail(&uevent->file_list, &ctx->file->events); list_add_tail(&uevent->ctx_list, &ctx->events); @@ -414,7 +369,7 @@ err2: kfree(uevent); err1: /* Destroy new cm_id's */ - return (id == IB_UCM_CM_ID_INVALID); + return ib_ucm_new_cm_id(event->event); } static ssize_t ib_ucm_event(struct ib_ucm_file *file, @@ -423,7 +378,7 @@ static ssize_t ib_ucm_event(struct ib_uc { struct ib_ucm_context *ctx; struct ib_ucm_event_get cmd; - struct ib_ucm_event *uevent = NULL; + struct ib_ucm_event *uevent; int result = 0; DEFINE_WAIT(wait); @@ -436,7 +391,6 @@ static ssize_t ib_ucm_event(struct ib_uc * wait */ down(&file->mutex); - while (list_empty(&file->events)) { if (file->filp->f_flags & O_NONBLOCK) { @@ -463,21 +417,18 @@ static ssize_t ib_ucm_event(struct ib_uc uevent = list_entry(file->events.next, struct ib_ucm_event, file_list); - if (!uevent->cm_id) - goto user; + if (ib_ucm_new_cm_id(uevent->resp.event)) { + ctx = ib_ucm_ctx_alloc(file); + if (!ctx) { + result = -ENOMEM; + goto done; + } - ctx = ib_ucm_ctx_alloc(file); - if (!ctx) { - result = -ENOMEM; - goto done; + ctx->cm_id = uevent->cm_id; + ctx->cm_id->context = ctx; + uevent->resp.id = ctx->id; } - ctx->cm_id = uevent->cm_id; - ctx->cm_id->context = ctx; - - uevent->resp.id = ctx->id; - -user: if (copy_to_user((void __user *)(unsigned long)cmd.response, &uevent->resp, sizeof(uevent->resp))) { result = -EFAULT; @@ -485,12 +436,10 @@ user: } if (uevent->data) { - if (cmd.data_len < uevent->data_len) { result = -ENOMEM; goto done; } - if (copy_to_user((void __user *)(unsigned long)cmd.data, uevent->data, uevent->data_len)) { result = -EFAULT; @@ -499,12 +448,10 @@ user: } if (uevent->info) { - if (cmd.info_len < uevent->info_len) { result = -ENOMEM; goto done; } - if (copy_to_user((void __user *)(unsigned long)cmd.info, uevent->info, uevent->info_len)) { result = -EFAULT; @@ -514,6 +461,7 @@ user: list_del(&uevent->file_list); list_del(&uevent->ctx_list); + uevent->ctx->events_reported++; kfree(uevent->data); kfree(uevent->info); @@ -545,6 +493,7 @@ static ssize_t ib_ucm_create_id(struct i if (!ctx) return -ENOMEM; + ctx->uid = cmd.uid; ctx->cm_id = ib_create_cm_id(ib_ucm_event_handler, ctx); if (IS_ERR(ctx->cm_id)) { result = PTR_ERR(ctx->cm_id); @@ -561,7 +510,14 @@ static ssize_t ib_ucm_create_id(struct i return 0; err: - ib_ucm_destroy_ctx(file, ctx->id); + down(&ctx_id_mutex); + idr_remove(&ctx_id_table, ctx->id); + up(&ctx_id_mutex); + + if (!IS_ERR(ctx->cm_id)) + ib_destroy_cm_id(ctx->cm_id); + + kfree(ctx); return result; } @@ -570,11 +526,44 @@ static ssize_t ib_ucm_destroy_id(struct int in_len, int out_len) { struct ib_ucm_destroy_id cmd; + struct ib_ucm_destroy_id_resp resp; + struct ib_ucm_context *ctx; + int result = 0; + + if (out_len < sizeof(resp)) + return -ENOSPC; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; - return ib_ucm_destroy_ctx(file, cmd.id); + down(&ctx_id_mutex); + ctx = idr_find(&ctx_id_table, cmd.id); + if (!ctx) + ctx = ERR_PTR(-ENOENT); + else if (ctx->file != file) + ctx = ERR_PTR(-EINVAL); + else + idr_remove(&ctx_id_table, ctx->id); + up(&ctx_id_mutex); + + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + atomic_dec(&ctx->ref); + wait_event(ctx->wait, !atomic_read(&ctx->ref)); + + /* No new events will be generated after destroying the cm_id. */ + ib_destroy_cm_id(ctx->cm_id); + /* Cleanup events not yet reported to the user. */ + ib_ucm_cleanup_events(ctx); + + resp.events_reported = ctx->events_reported; + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) + result = -EFAULT; + + kfree(ctx); + return result; } static ssize_t ib_ucm_attr_id(struct ib_ucm_file *file, @@ -609,6 +598,98 @@ static ssize_t ib_ucm_attr_id(struct ib_ return result; } +static void ib_ucm_copy_ah_attr(struct ib_ucm_ah_attr *dest_attr, + struct ib_ah_attr *src_attr) +{ + memcpy(dest_attr->grh_dgid, src_attr->grh.dgid.raw, + sizeof src_attr->grh.dgid); + dest_attr->grh_flow_label = src_attr->grh.flow_label; + dest_attr->grh_sgid_index = src_attr->grh.sgid_index; + dest_attr->grh_hop_limit = src_attr->grh.hop_limit; + dest_attr->grh_traffic_class = src_attr->grh.traffic_class; + + dest_attr->dlid = src_attr->dlid; + dest_attr->sl = src_attr->sl; + dest_attr->src_path_bits = src_attr->src_path_bits; + dest_attr->static_rate = src_attr->static_rate; + dest_attr->is_global = (src_attr->ah_flags & IB_AH_GRH); + dest_attr->port_num = src_attr->port_num; +} + +static void ib_ucm_copy_qp_attr(struct ib_ucm_init_qp_attr_resp *dest_attr, + struct ib_qp_attr *src_attr) +{ + dest_attr->cur_qp_state = src_attr->cur_qp_state; + dest_attr->path_mtu = src_attr->path_mtu; + dest_attr->path_mig_state = src_attr->path_mig_state; + dest_attr->qkey = src_attr->qkey; + dest_attr->rq_psn = src_attr->rq_psn; + dest_attr->sq_psn = src_attr->sq_psn; + dest_attr->dest_qp_num = src_attr->dest_qp_num; + dest_attr->qp_access_flags = src_attr->qp_access_flags; + + dest_attr->max_send_wr = src_attr->cap.max_send_wr; + dest_attr->max_recv_wr = src_attr->cap.max_recv_wr; + dest_attr->max_send_sge = src_attr->cap.max_send_sge; + dest_attr->max_recv_sge = src_attr->cap.max_recv_sge; + dest_attr->max_inline_data = src_attr->cap.max_inline_data; + + ib_ucm_copy_ah_attr(&dest_attr->ah_attr, &src_attr->ah_attr); + ib_ucm_copy_ah_attr(&dest_attr->alt_ah_attr, &src_attr->alt_ah_attr); + + dest_attr->pkey_index = src_attr->pkey_index; + dest_attr->alt_pkey_index = src_attr->alt_pkey_index; + dest_attr->en_sqd_async_notify = src_attr->en_sqd_async_notify; + dest_attr->sq_draining = src_attr->sq_draining; + dest_attr->max_rd_atomic = src_attr->max_rd_atomic; + dest_attr->max_dest_rd_atomic = src_attr->max_dest_rd_atomic; + dest_attr->min_rnr_timer = src_attr->min_rnr_timer; + dest_attr->port_num = src_attr->port_num; + dest_attr->timeout = src_attr->timeout; + dest_attr->retry_cnt = src_attr->retry_cnt; + dest_attr->rnr_retry = src_attr->rnr_retry; + dest_attr->alt_port_num = src_attr->alt_port_num; + dest_attr->alt_timeout = src_attr->alt_timeout; +} + +static ssize_t ib_ucm_init_qp_attr(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct ib_ucm_init_qp_attr_resp resp; + struct ib_ucm_init_qp_attr cmd; + struct ib_ucm_context *ctx; + struct ib_qp_attr qp_attr; + int result = 0; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ib_ucm_ctx_get(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + resp.qp_attr_mask = 0; + memset(&qp_attr, 0, sizeof qp_attr); + qp_attr.qp_state = cmd.qp_state; + result = ib_cm_init_qp_attr(ctx->cm_id, &qp_attr, &resp.qp_attr_mask); + if (result) + goto out; + + ib_ucm_copy_qp_attr(&resp, &qp_attr); + + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) + result = -EFAULT; + +out: + ib_ucm_ctx_put(ctx); + return result; +} + static ssize_t ib_ucm_listen(struct ib_ucm_file *file, const char __user *inbuf, int in_len, int out_len) @@ -808,6 +889,7 @@ static ssize_t ib_ucm_send_rep(struct ib ctx = ib_ucm_ctx_get(file, cmd.id); if (!IS_ERR(ctx)) { + ctx->uid = cmd.uid; result = ib_send_cm_rep(ctx->cm_id, ¶m); ib_ucm_ctx_put(ctx); } else @@ -1086,6 +1168,7 @@ static ssize_t (*ucm_cmd_table[])(struct [IB_USER_CM_CMD_SEND_SIDR_REQ] = ib_ucm_send_sidr_req, [IB_USER_CM_CMD_SEND_SIDR_REP] = ib_ucm_send_sidr_rep, [IB_USER_CM_CMD_EVENT] = ib_ucm_event, + [IB_USER_CM_CMD_INIT_QP_ATTR] = ib_ucm_init_qp_attr, }; static ssize_t ib_ucm_write(struct file *filp, const char __user *buf, @@ -1161,12 +1244,18 @@ static int ib_ucm_close(struct inode *in down(&file->mutex); while (!list_empty(&file->ctxs)) { - ctx = list_entry(file->ctxs.next, struct ib_ucm_context, file_list); - up(&file->mutex); - ib_ucm_destroy_ctx(file, ctx->id); + + down(&ctx_id_mutex); + idr_remove(&ctx_id_table, ctx->id); + up(&ctx_id_mutex); + + ib_destroy_cm_id(ctx->cm_id); + ib_ucm_cleanup_events(ctx); + kfree(ctx); + down(&file->mutex); } up(&file->mutex); Index: linux-kernel/infiniband/core/ucm.h =================================================================== --- linux-kernel/infiniband/core/ucm.h (revision 3109) +++ linux-kernel/infiniband/core/ucm.h (working copy) @@ -1,5 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -43,8 +44,6 @@ #include #include -#define IB_UCM_CM_ID_INVALID 0xffffffff - struct ib_ucm_file { struct semaphore mutex; struct file *filp; @@ -58,9 +57,11 @@ struct ib_ucm_context { int id; wait_queue_head_t wait; atomic_t ref; + int events_reported; struct ib_ucm_file *file; struct ib_cm_id *cm_id; + __u64 uid; struct list_head events; /* list of pending events. */ struct list_head file_list; /* member in file ctx list */ @@ -71,16 +72,12 @@ struct ib_ucm_event { struct list_head file_list; /* member in file event list */ struct list_head ctx_list; /* member in ctx event list */ + struct ib_cm_id *cm_id; struct ib_ucm_event_resp resp; void *data; void *info; int data_len; int info_len; - /* - * new connection identifiers needs to be saved until - * userspace can get a handle on them. - */ - struct ib_cm_id *cm_id; }; #endif /* UCM_H */ From sean.hefty at intel.com Sat Aug 20 11:27:36 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Sat, 20 Aug 2005 11:27:36 -0700 Subject: [openib-general] [PATCH] [uDAPL] update to new uCM API In-Reply-To: Message-ID: >This patch updates uDAPL to the new uCM API. It only fixes the build >issues at this point and does not try to optimize for the use of the >new API. That will come in a later patch. FYI: I've sent a patch for the updated uCM three times, but I don't see where it's shown up on the openib mailing lists. Obviously, the uDAPL patch depends on that one. - Sean From hch at lst.de Sat Aug 20 11:42:54 2005 From: hch at lst.de (Christoph Hellwig) Date: Sat, 20 Aug 2005 20:42:54 +0200 Subject: [openib-general] OpenSM Coding style In-Reply-To: <506C3D7B14CDD411A52C00025558DED607C306C6@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED607C306C6@mtlex01.yok.mtl.com> Message-ID: <20050820184254.GA2265@lst.de> Just allow new modules to be written in a readable style so people can understand at least parts of the mess that opensm is. From roland.list at gmail.com Sat Aug 20 13:09:47 2005 From: roland.list at gmail.com (Roland Dreier) Date: Sat, 20 Aug 2005 13:09:47 -0700 Subject: [openib-general] Re: [ANNOUNCE] Initial trunk checkin of ISERinitiator In-Reply-To: References: <35EA21F54A45CB47B879F21A91F4862F713CED@taurus.voltaire.com> Message-ID: [Resending because gmail replies to sender only by default] > The AT implementation was based on the code from SDP > I assume that similar changes as the ones you propose would need to > apply to SDP, or SDP would need to use the same lib as the other ULPs Yes, the current SDP code is very ugly. Obviously SDP will have to use the common connection code once it exists. - R. From roland.list at gmail.com Sat Aug 20 13:10:28 2005 From: roland.list at gmail.com (Roland Dreier) Date: Sat, 20 Aug 2005 13:10:28 -0700 Subject: [openib-general] RE: [openib-commits] r3137 -gen2/trunk/src/linux-kernel/infiniband/ulp/ipoib In-Reply-To: References: <5CE025EE7D88BA4599A2C8FEFCF226F5175BDA@taurus.voltaire.com> Message-ID: [Resending because I forgot to reply to all] > What happens when the port is not a full member of the partition (and only > a partial member) ? Is it just that the SA should reject those requests or > does some other failure occur ? I'm not sure I understand the question. This patch fixes the situation where one host has, say, 0xffff in its P_Key table and another host has 0x7fff. They should be able to talk to each other, but with the old code the second host would join the wrong broadcast group and not be able to exchange ARPs with the first host. - R. From abhinav.vishnu at gmail.com Sat Aug 20 15:14:38 2005 From: abhinav.vishnu at gmail.com (abhinav vishnu) Date: Sat, 20 Aug 2005 17:14:38 -0500 Subject: [openib-general] Query regarding posting unsignaled send descriptors In-Reply-To: <1124483700.3372.13.camel@nuthead> References: <1124483700.3372.13.camel@nuthead> Message-ID: <87aa148d0508201514135132ab@mail.gmail.com> Hi All, For one of my application, i was trying to use unsignaled send descriptors. Currently, the send flags are defined as follows in verbs.h include file: enum ibv_send_flags { IBV_SEND_FENCE = 1 << 0, IBV_SEND_SIGNALED = 1 << 1, IBV_SEND_SOLICITED = 1 << 2, IBV_SEND_INLINE = 1 << 3 }; There is another mechanism, which potentially allows to define whether send WQEs be signaled/unsignled. Following is the snippet of the data structure from the same include file. struct ibv_qp_init_attr { void *qp_context; struct ibv_cq *send_cq; struct ibv_cq *recv_cq; struct ibv_srq *srq; struct ibv_qp_cap cap; enum ibv_qp_type qp_type; int sq_sig_all; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ }; Using any of above alone or combinations with different values of send_flags and sq_sig_all did not help in posting unsignaled descriptors. Any help in this direction will be greatly appreciated. Thanks, -- Abhinav Graduate Research Associate Department of Computer Science and Engineering The Ohio State University. From roland.list at gmail.com Sat Aug 20 16:37:38 2005 From: roland.list at gmail.com (Roland Dreier) Date: Sat, 20 Aug 2005 16:37:38 -0700 Subject: [openib-general] Query regarding posting unsignaled send descriptors In-Reply-To: <87aa148d0508201514135132ab@mail.gmail.com> References: <1124483700.3372.13.camel@nuthead> <87aa148d0508201514135132ab@mail.gmail.com> Message-ID: > For one of my application, i was trying to use unsignaled send descriptors. > Currently, the send flags are defined as follows in verbs.h include file: To post an unsignaled send, you must create the QP with sq_sig_all = 0. Then make sure that IBV_SEND_SIGNALED is _not_ set in the work request you post. I'm pretty sure this works because people have reported problems with buggy applications that never posted any signaled sends. - R. From info at a-hamagu.com Sat Aug 20 16:41:24 2005 From: info at a-hamagu.com (info at a-hamagu.com) Date: 21 Aug 2005 08:41:24 +0900 Subject: [openib-general] $B1|MM%i%s%A(B! Message-ID: <20050820234124.25710.qmail@mail.a-hamagu.com> $B!z!z!z(B10.000$B1_%-%c%s%Z!<%s$X(B10.000$B1_J,L5NA%"%]u67$G$9!#(B $B2F$N$3$N;~4|K;$7$=$&$J1|MM(B!$B$G$b:G6a$O$G$O3F<+$GM=Dj$rF~$l$k;~Be(B!$B1|MM$K$b2F!y$OFCJL$J5(@a(By(^$B!<(B^)y$B3Z$7$_"v$r5a$a$F1|MMJ}$,(B $BB3!9EPO?Cf(B! $B!!!!!!!!!!(Bhttp://www.o9sama1.com/?num=0308 10.000$B1_J,L5NA$r>eZ$,I,MW$H$J$j$^$9!#(B $BG[?.5qH]$O$3$A$i$N%"%I%l%9$^$G$*4j$$$7$^$9!#(B $B!!!!!!(B $B"-"-"-"-(B $B!!!!!!(Bnomore at o9sama.com ----------------------------------------------------------- Please inform the following address if this mail is unnecessary. nomore at o9sama.com From boas1 at llnl.gov Sat Aug 20 19:00:29 2005 From: boas1 at llnl.gov (Bill Boas) Date: Sat, 20 Aug 2005 19:00:29 -0700 Subject: [openib-general] Draft Aug 22 Workshop Program at a Glance and Session Speakers and Abstracts for your review. Message-ID: <6.2.1.2.2.20050820173924.02e5cca0@mail-lc.llnl.gov> All, Here's a final draft of the program for the Workshop for your review and for your reference. The OpenIB Workshop does not contain the recent updates and should be ignored from here onwards. These attachments reflect, I think, what the workshop schedule and content will be. Please review them and email if you see something untoward, spelled wrong, inaccurate or confusing. When you pick up your badge you will have the opportunity to pick up one of each of these to keep with you all day. Please keep to the times specified. One or more track leaders will introduce each session and cut-off the speakers at the ending time so we stay on schedule. 8.00AM to 7.45PM is a long enough day without us getting behind the schedule. The other documents you will be asked to pick up when you get your badge are 1) a survey so you can give OpenIB, IBTA and Intel feedback about any aspect of the workshop or other aspects of your interest in our communities's working together, and 2) an invitation to those companies attending the workshop who also exhibit at SC|05 in Seattle Nov 12-15 to join the Infiniband infrarstructure in SCinet there. If you have not registered yet, its open all and you can register on Monday, on-site. Its still just $250 and when you are there you can get a one day pass for IDF for another $250. Thank you. Bill. Bill Boas bboas at llnl.gov ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 7000 East Ave, L-555 Cell: 925-337-2224 Livermore, CA 94551 Pgr: 877-203-2248 -------------- next part -------------- A non-text attachment was scrubbed... Name: Workshop at a Glance v1.2_08_20_05.doc Type: application/msword Size: 210944 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Program Content 08_20_05v1.1.doc Type: application/msword Size: 56832 bytes Desc: not available URL: From dotanb at mellanox.co.il Sat Aug 20 22:41:47 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Sun, 21 Aug 2005 08:41:47 +0300 Subject: [openib-general] Re: uVerbs: ibv_query_port Message-ID: <506C3D7B14CDD411A52C00025558DED60882CE60@mtlex01.yok.mtl.com> > > > Thanks, I checked in a fix for this. > > - R. > _______________________________________________ Hi. There are some more attributes that the verb ibv_query_port doesn't fill: max_mtu active_mtu max_vl_num bad_pkey_cntr subnet_timeout init_type_reply Can you please check this issue too? Thanks Dotan -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Sat Aug 20 23:56:46 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 21 Aug 2005 09:56:46 +0300 Subject: [openib-general] [PATCH] osm: osm_vendor_umad registers to all SubnMgt methods Message-ID: <86hddj22xd.fsf@mtl066.yok.mtl.com> Hi Hal In case of registering to a non SubnAdm class the umad vendor layer osm_vendor_bind() is registering to ALL the methods. This prevents from multiple clients of SubnMgt (for example) to use the code. OpenSM osm_sm_mad_ctrl.c actually sets the correct methods bits (except for registering as report processor - which it is not). So the patch below prevents the "blind" registration to all methods in case of a !SubnAdm osm_vendor_bind(). Thanks Eitan I tested the patch on : 2.6.12.3-smp SuSE Linux 9.3 (i586) Signed-off-by: Eitan Zahavi Index: osm/libvendor/osm_vendor_ibumad.c =================================================================== --- osm/libvendor/osm_vendor_ibumad.c (revision 3128) +++ osm/libvendor/osm_vendor_ibumad.c (working copy) @@ -708,25 +714,21 @@ osm_vendor_bind( p_bind->p_mad_pool = p_mad_pool; p_bind->port_guid = port_guid; - if (p_user_bind->mad_class != IB_MCLASS_SUBN_ADM) - memset(method_mask, 0xff, sizeof method_mask); /* accept all methods */ - else { - memset(method_mask, 0, sizeof method_mask); - if (p_user_bind->is_responder) { - set_bit(IB_MAD_METHOD_GET, &method_mask); - set_bit(IB_MAD_METHOD_SET, &method_mask); - set_bit(IB_MAD_METHOD_GETTABLE, &method_mask); - set_bit(IB_MAD_METHOD_DELETE, &method_mask); - /* Add in IB_MAD_METHOD_GETTRACETABLE */ - /* and IB_MAD_METHOD_GETMULTI when */ - /* supported by OpenSM */ - } - if (p_user_bind->is_report_processor) - set_bit(IB_MAD_METHOD_REPORT, &method_mask); - if (p_user_bind->is_trap_processor) { - set_bit(IB_MAD_METHOD_TRAP, &method_mask); - set_bit(IB_MAD_METHOD_TRAP_REPRESS, &method_mask); - } + memset(method_mask, 0, sizeof method_mask); + if (p_user_bind->is_responder) { + set_bit(IB_MAD_METHOD_GET, &method_mask); + set_bit(IB_MAD_METHOD_SET, &method_mask); + set_bit(IB_MAD_METHOD_GETTABLE, &method_mask); + set_bit(IB_MAD_METHOD_DELETE, &method_mask); + /* Add in IB_MAD_METHOD_GETTRACETABLE */ + /* and IB_MAD_METHOD_GETMULTI when */ + /* supported by OpenSM */ + } + if (p_user_bind->is_report_processor) + set_bit(IB_MAD_METHOD_REPORT, &method_mask); + if (p_user_bind->is_trap_processor) { + set_bit(IB_MAD_METHOD_TRAP, &method_mask); + set_bit(IB_MAD_METHOD_TRAP_REPRESS, &method_mask); } #ifndef VENDOR_RMPP_SUPPORT Index: osm/opensm/osm_sm_mad_ctrl.c =================================================================== --- osm/opensm/osm_sm_mad_ctrl.c (revision 3128) +++ osm/opensm/osm_sm_mad_ctrl.c (working copy) @@ -1015,7 +1015,7 @@ osm_sm_mad_ctrl_bind( } bind_info.class_version = 1; - bind_info.is_report_processor = TRUE; + bind_info.is_report_processor = FALSE; bind_info.is_responder = TRUE; bind_info.is_trap_processor = TRUE; bind_info.mad_class = IB_MCLASS_SUBN_DIR; From dotanb at mellanox.co.il Sun Aug 21 00:15:36 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Sun, 21 Aug 2005 10:15:36 +0300 Subject: [openib-general] executing the ibv_srq_pingpong with many QPs sometimes causes to a test failure Message-ID: <506C3D7B14CDD411A52C00025558DED60882CEA5@mtlex01.yok.mtl.com> Hi. I'm using gen2 driver svn revision 3137 on 2 Mellanox HCAs(23108) connected b2b (without any switch). When I'm executing the ibv_srq_pingpong with many QPs(255) each executable is a different HCA (on different node) there is a completion with error in the server side and a seg fault in the client side. I think that there is a race in the test because when I'm using 248 QPs sometimes there is a failure and sometimes the test passes. Here is the command line + output of each side: *************** Daemon side ******************* % ./ibv_srq_pingpong --port=19872 --ib-dev=mthca0 --ib-port=1 --num-qp=255 remote address: LID 0x0002, QPN 0xf604fe, PSN 0x992558 remote address: LID 0x0002, QPN 0xf604ff, PSN 0x428b53 remote address: LID 0x0002, QPN 0xf60500, PSN 0x70675a remote address: LID 0x0002, QPN 0xf60501, PSN 0x999770 remote address: LID 0x0002, QPN 0xf60502, PSN 0x3d496f remote address: LID 0x0002, QPN 0xf60503, PSN 0xf7ff70 remote address: LID 0x0002, QPN 0xf60504, PSN 0x76ab8f [ 0] 00b50408 [ 4] 00000000 [ 8] 00000000 [ c] 00000000 [10] 15810000 [14] 00000000 [18] 000000c2 [1c] ff000000 [ 0] 00b50409 [ 4] 00000000 [ 8] 00000000 [ c] 00000000 [10] 15810000 [14] 00000000 [18] 000000c2 [1c] ff000000 Failed status 12 for wr_id 2 *************** Client side ******************* % ./ibv_srq_pingpong --port=19872 --ib-dev=mthca0 --ib-port=1 10.4.8.31 --num-qp=255 remote address: LID 0x0001, QPN 0xb50500, PSN 0x8913db remote address: LID 0x0001, QPN 0xb50501, PSN 0x02f05d remote address: LID 0x0001, QPN 0xb50502, PSN 0x791458 remote address: LID 0x0001, QPN 0xb50503, PSN 0x86baa5 remote address: LID 0x0001, QPN 0xb50504, PSN 0x49ff20 Segmentation fault The seg fault is somewhere is the polling for completion section (i didn't find the specific location). Thanx Dotan Barak Software Verification Engineer Mellanox Technologies LTD Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-77-8841095 Cell: 052-4222383 [ May the fork be with you ] -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Sun Aug 21 00:32:37 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 21 Aug 2005 10:32:37 +0300 Subject: [openib-general] [PATCH] osm: osm_vendor_umad osm_vendor_get_all_port_attr bug Message-ID: <86fyt3219m.fsf@mtl066.yok.mtl.com> Hi Hal osm_vendor_get_all_port_attr returns incorrect LID and state for device ports. This bug was caused by the fact that if a device port was skipped due to that fact it does not exist (HCA port 0). The lid and state pointers used as indexes into their corresponding return value arrays were not advancing to the next port index. So the return for a single HCA was mixing LID and state for the first port and displayed non initialized memory for the second port. The following simple patch fixes this bug. It assumes the previous patch named:"osm_vendor_umad to provide port state" was previously applied. Thanks Eitan I tested the patch on : 2.6.12.3-smp SuSE Linux 9.3 (i586) Signed-off-by: Eitan Zahavi Index: osm/libvendor/osm_vendor_ibumad.c =================================================================== --- osm/libvendor/osm_vendor_ibumad.c (revision 3128) +++ osm/libvendor/osm_vendor_ibumad.c (working copy) @@ -527,8 +529,10 @@ osm_vendor_get_all_port_attr( for (j = 0; j <= ca.numports; j++) { if (ca.ports[j]) { *p_lid = ca.ports[j]->base_lid; - p_lid++; - p_linkstates++; } + p_lid++; + p_linkstates++; } } } From mst at mellanox.co.il Sun Aug 21 01:36:31 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 21 Aug 2005 11:36:31 +0300 Subject: [openib-general] Re: [PATCH] allow buffers > 1GB in rdma_bw In-Reply-To: <1124483700.3372.13.camel@nuthead> References: <1124483700.3372.13.camel@nuthead> Message-ID: <20050821083631.GR1856@mellanox.co.il> Quoting Justin Banks : > Subject: [PATCH] allow buffers > 1GB in rdma_bw > > - Allow use of buffers > 1GB in rdma_bw > - Minor cosmetic fixes I couldn't resist, mostly line widths > > Sorry if my email client wraps long lines - I've been shoehorned into > using evolution, and still haven't figured out how to make it do what I > want it to do. Justin, the patch couldnt be applied, and the S.O.B. line is missing. I think I've fixed the issue, though. The following is already applied. The proper thing is, I think, to create 2 separate memory regions for local/remote access, this way we can get all the way to 4G-1 size. MST Index: rdma_bw.c =================================================================== --- rdma_bw.c (revision 3139) +++ rdma_bw.c (working copy) @@ -41,6 +41,7 @@ #include #include #include +#include #include #include #include @@ -68,7 +69,7 @@ struct pingpong_context { struct ibv_cq *cq; struct ibv_qp *qp; void *buf; - int size; + unsigned size; int tx_depth; struct ibv_sge list; struct ibv_send_wr wr; @@ -265,7 +266,8 @@ out: return rem_dest; } -static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size, +static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, + unsigned size, int tx_depth, int port) { struct pingpong_context *ctx; @@ -420,7 +422,7 @@ static void usage(const char *argv0) printf(" -b, --bidirectional measure bidirectional bandwidth (default unidirectional)\n"); } -static void print_report(unsigned int iters, int size, int duplex, +static void print_report(unsigned int iters, unsigned size, int duplex, cycles_t *tposted, cycles_t *tcompleted) { double cycles_to_units; @@ -473,7 +475,7 @@ int main(int argc, char *argv[]) char *servername = NULL; int port = 18515; int ib_port = 1; - int size = 4096; + long long size = 4096; int tx_depth = 100; int iters = 1000; int scnt, ccnt; @@ -525,8 +527,11 @@ int main(int argc, char *argv[]) break; case 's': - size = strtol(optarg, NULL, 0); - if (size < 1) { usage(argv[0]); return 1; } + size = strtoll(optarg, NULL, 0); + if (size < 1 || size > UINT_MAX / 2) { + usage(argv[0]); + return 1; + } break; case 't': From mst at mellanox.co.il Sun Aug 21 01:48:55 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 21 Aug 2005 11:48:55 +0300 Subject: [openib-general] SDP address translation Message-ID: <20050821084855.GS1856@mellanox.co.il> Quoting r. Roland Dreier : > Yes, the current SDP code is very ugly. Obviously SDP will have to use the > common connection code once it exists. > > - R. Do you think its a good idea to bounce that up to userspace? How would we handle sockets created from kernel? I was thinking about adding something that gets us the device/src/destination hardware address under net/. Does this look like a good plan? -- MST From eitan at mellanox.co.il Sun Aug 21 05:48:16 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 21 Aug 2005 15:48:16 +0300 Subject: [openib-general] RMPP Message Format Errors Message-ID: <506C3D7B14CDD411A52C00025558DED607C306CE@mtlex01.yok.mtl.com> Hi Sean, Hal, We have started testing RMPP packets with osmtest and opensm (gen2 version). We did not go very far. The first NodeRecord GetTable of all the nodes in a "loopback" case, has some issues. The explanation is below: 1. NodeRecord MAD size is 112bytes (note the required padding of 4 bytes at the end of the NodeRec data). 2. OpenSM log file shows the query should return 2 records one for each end-port. This really happens: Aug 21 14:59:49 998104 [40D9DBB0] -> __osm_nr_rcv_create_nr: Looking for NodeRecord with LID: 0x0 GUID:0x0000000000000000 Aug 21 14:59:49 998224 [40D9DBB0] -> __osm_nr_rcv_new_nr: New NodeRecord: node 0x0002c902000017a0 port 0x0002c902000017a1, lid 0x1. Aug 21 14:59:49 998327 [40D9DBB0] -> __osm_nr_rcv_new_nr: New NodeRecord: node 0x0002c902000017a0 port 0x0002c902000017a2, lid 0x2. Aug 21 14:59:49 998395 [40D9DBB0] -> osm_nr_rcv_process: Returning 2 records. 3. On the wire we see the following (see attached gif for more details): a. Two data segments were sent and two ACKs were returned. This is OK. b. The first segment reports PayLen = 440bytes. According to the spec the first segment might provide paylen != 0 and when it is done it should be equal to the (class header * Num-Segments) + data length. In our case we have data length = 2*112, and SA extra header = 20byte * 2seg. This leads to peylen=264 and not 440!!! The spec defines that in p775-l37. So this is a violation of the spec. c. The last segment (segment 2) provides the paylen field of 100. The expected value for the last segment length should have been: SA extra header + leftover data size from prev segments. Since the first segment has 200bytes for data the left over should have been 112*2 - 200 = 24. With the SA extra header 44bytes. So this is another violation of the spec. d. The analyzer is confused by the above and reports the result as having 3 NodeRecords. e. <> 4. Following that when we trace the log file of osmtest we find more issues. Probably caused by changes to the vendor layer or the rmpp assembly: It is expected that after assembly the size of the RMPP mad reported to the osm vendor layer will be the rmpp header + SA extra header + data-size. In our case that is 32 + 20 + 2*112 = 276. The log file shows: Aug 21 14:59:49 [40D87BB0] -> __osmv_sa_mad_rcv_cb: Count = 1 = 200 / 112 (88) Aug 21 14:59:49 [4017F6C0] -> osmtest_write_all_node_recs: Received 1 records So this is another problem - probably with the way RMPP results are assembled or pass back to the vendor. Please let me know if you will have time to dig into these problems or if I should try and resolve them myself and provide patches. Thanks Eitan Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Gen2 NodeRec GetTable RMPP Format Error.GIF Type: image/gif Size: 49481 bytes Desc: not available URL: From service at ebay.com Sun Aug 21 05:04:43 2005 From: service at ebay.com (service at ebay.com) Date: Sun, 21 Aug 2005 20:04:43 +0800 (CST) Subject: [openib-general] Update And Verify Your eBay Account Message-ID: <20050821120443.705017C163A@mybeer.com.tw> An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Sun Aug 21 12:18:19 2005 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Sun, 21 Aug 2005 15:18:19 -0400 (EDT) Subject: [openib-general] Announcing the release of MVAPICH-Gen2 1.0 (for OpenIB/Gen2 stack) Message-ID: <200508211918.j7LJIJor021773@xi.cse.ohio-state.edu> The MVAPICH team is pleased to announce the release of MVAPICH-Gen2 1.0 (for OpenIB/Gen2 stack) for multiple platforms (EM64T, IA-32 and Opteron) and network interfaces (PCI-X and PCI-Express-including the new mem-free cards and DDR technology). With IBA 4X DDR technology, this release delivers the following MPI-level performance: - 2.84 microsec one-way latency - up to 1474 MB/sec (unidirectional bandwidth) - up to 2645 MB/sec (bidirectional bandwidth) Detailed performance numbers for other platforms are available on the project's web page. MVAPICH-Gen2 1.0 is available under BSD license. A copy of this package will also be available from the OpenIB SVN soon. As of this announcement, we are experiencing some technical difficulties in uploading the code to the OpenIB SVN. We hope it to be resolved soon. This new release has a subset features of the popular MVAPICH 0.9.5 (over VAPI) package. Successive releases will incorporate additional features. Current features include: - optimized and tuned RDMA-based schemes for short and long messages - optimized intra-node shared memory support (both for bus-based and NUMA-based systems) - shared library support - optimized and tuned for the above platforms and different network interfaces (PCI-X and PCI-Express) and DDR technology - incorporates a set of tunable parameters - support for multiple compilers - single code base for all of the above platforms This release also includes an enhanced and detailed `User Guide' to assist users to install this package on different platforms with different options You are welcome to download the MVAPICH-Gen2 1.0 package and access relevant information from the following URL: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ Our upcoming releases include: - A highly optimized version of MVAPICH2 to work on OpenIB Gen2 and VAPI - MVAPICH with uDAPL support to run on different networks with uDAPL interface - Solaris support for both MVAPICH and MVAPICH2 All feedbacks, including bug reports and hints for performance tuning, are welcome. Please send an e-mail to mvapich-help at cse.ohio-state.edu. Thanks, MVAPICH Team at OSU/NBCL ---------- PS: If you would like to be removed from this mailing list, please end an e-mail to mvapich_request at cse.ohio-state.edu. From sean.hefty at intel.com Sun Aug 21 15:00:58 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Sun, 21 Aug 2005 15:00:58 -0700 Subject: [openib-general] RE: RMPP Message Format Errors In-Reply-To: <506C3D7B14CDD411A52C00025558DED607C306CE@mtlex01.yok.mtl.com> Message-ID: Please let me know if you will have time to dig into these problems or if I should try and resolve them myself and provide patches. I will not be able to look at this until early next week (with IDF running this week), but I will try to do so. Note that the current implementation of the RMPP code ignores the payload length on the receive side, and instead relies on the last bit to determine the end of a transfer. So, it wouldn't surprise me if the receive side accepted an invalid RMPP MAD. - Sean -------------- next part -------------- An HTML attachment was scrubbed... URL: From info at hige-gozi.com Sun Aug 21 18:56:52 2005 From: info at hige-gozi.com (info at hige-gozi.com) Date: 22 Aug 2005 10:56:52 +0900 Subject: [openib-general] $BM_5aITK~$J1|MM$r%O%a%F8+$^$;$s$+(B Message-ID: <20050822015652.8687.qmail@mail.hige-gozi.com> --------------------- $B5U!o8r:](B $B"v5U!o8r:](B --------------------- $BCK at -$,=w at -$rA*$V;~Be$O=*$j$^$7$?!#Ev%5%$%H$O5U!o4uK>$N=w at -$,CK at -$r<+M3$KA*$Y$k%7%9%F%`$r:NMQ$7$F$$$^$9!#(B $B"MCK at -(B $B!D(B $B=w at -$H$N8r>DD$7$F(BOK!! $B$h$C$FCK at -$OL5NA$G$4EPO?D:$1$^$9$,!"=w at -$OF~2q6b$rD:$-$^$9!#$^$?!"$4?75,EPO?$K4X$7$^$7$F$O?3::@)$H$J$C$F$*$j$^$9!#(B http://awg.webchu.com/deai/?meet $B"!!~"!!~"!!~"!!~"!!~"!!~"!!~"!!~"!!~"!!~"!!~(B $B=w at -$+$iCK at -$X$N%5%]!<%H!o$O$"$/$^$G$bK\?MD$G$*4j$$$7$^$9!#(B $B"!!~"!!~"!!~"!!~"!!~"!!~"!!~"!!~"!!~"!!~"!!~(B $B=w at -$X$N%-%c%C%A!&6bA,L\E*$@$1$N9T0Y!#(B $B$^$?!"CK at -$K$D$-$^$7$F$OA*$P$l$kBP>]$H$J$j$^$9!#L^O@$*CG$j$9$k$3$H$b$G$-$^$9$,!"$*AjcCW$7$^$;$s!#(B $B$G$bD![(B $B6b3[!&9g$&2s?t!&%W%l%$FbMF$J$I(B $B"-"-"-(B $B!Z5U!o8r:]%9%?!<%H![(B $BCK at -!'=w at -$N>r7o$K9g$o$;$F8r:](B+$B%5%]!<%H(B $B=w at -!':GDc(BSEX$B$r%5%]!<%H(B $B"-6=L#$r;}$?$l$?J}$OAaB.EPO?"-(B $B"-"-"-"-"-"-(B http://awg.webchu.com/deai/?meet From eitan at mellanox.co.il Sun Aug 21 22:54:33 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 22 Aug 2005 08:54:33 +0300 Subject: [openib-general] RE: RMPP Message Format Errors Message-ID: <506C3D7B14CDD411A52C00025558DED607C306DA@mtlex01.yok.mtl.com> Hi Sean, You wrote: "Note that the current implementation of the RMPP code ignores the payload length on the receive side, and instead relies on the last bit to determine the end of a transfer." But the receive side needs to calculate back the correct size of the assembled MAD. If it is done in kernel or user it does not matter. To my best knowledge the only way to calculate how many records are enclosed in an RMPP message is to use the paylen and offset. How can it be done without looking at paylen ? EZ Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL -----Original Message----- From: Sean Hefty [mailto:sean.hefty at intel.com] Sent: Monday, August 22, 2005 1:01 AM To: 'Eitan Zahavi'; Hal Rosenstock Cc: OPENIB GENERAL; Liran Sorani; Amit Krig; Aviram Gutman Subject: RE: RMPP Message Format Errors Please let me know if you will have time to dig into these problems or if I should try and resolve them myself and provide patches. I will not be able to look at this until early next week (with IDF running this week), but I will try to do so. So, it wouldn't surprise me if the receive side accepted an invalid RMPP MAD. - Sean -------------- next part -------------- An HTML attachment was scrubbed... URL: From danb at voltaire.com Mon Aug 22 03:22:55 2005 From: danb at voltaire.com (Dan Bar Dov) Date: Mon, 22 Aug 2005 13:22:55 +0300 Subject: [openib-general] ISER cleanup Message-ID: We have begun a cleanup of ISER based on the inputs we received. Mostly cosmetic cleanups were already commited. These include: C++ style comments changed to C Removed DAT 1.2 comments Reomoved DAT 1.2 API support file Removed all platform dependencies Removed vi comments Removed CONFIG_INFINIBAND refrences Reorganized module Rewritten Makefile to new style Added Kconfig file Using kernel min/max There are many other things to be done, including both coding style and substance, we'll proceed addressing all the technical issues that were commented on. Dan From yael at mellanox.co.il Mon Aug 22 05:20:44 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Mon, 22 Aug 2005 15:20:44 +0300 Subject: [openib-general] OpenSM: new branch Message-ID: <506C3D7B14CDD411A52C00025558DED60882D197@mtlex01.yok.mtl.com> Hello all, I have created a new branch for the osm under: https://openib.org/svn/gen2/branches/osm-1.8.0-merge/ This branch includes the merging between the gen2 OpenSM code and the gen1 1.8.0 Mellanox version of OpenSM. You are all welcome to try the new OpenSM version, and report to me with problems you encounter. Thanks, Yael Kalka -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon Aug 22 05:33:45 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 22 Aug 2005 15:33:45 +0300 Subject: [openib-general] OpenSM: new branch Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175BDF@taurus.voltaire.com> Hi Yael, I will start the merge of this back to the trunk later in the week when I return from the OpenIB workshop and IDF. I also have some other patches and issues to investigate queued up. -- Hal From halr at voltaire.com Mon Aug 22 05:46:30 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 22 Aug 2005 15:46:30 +0300 Subject: [openib-general] RE: RMPP Message Format Errors Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175BE1@taurus.voltaire.com> Hi Eitan, All Sean is saying is that the RMPP code itself only uses the last bit. The number of records is an SA thing and not RMPP thing. This is transparent to RMPP itself. The need to determine the number of records is a consumer issue (SA or SA client). To do this, AttributeOffset and (at least the last) PayloadLength field is needed (as one can't rely on the first PayloadLength being non zero). -- Hal ________________________________ From: Eitan Zahavi [mailto:eitan at mellanox.co.il] Sent: Mon 8/22/2005 1:54 AM To: 'Sean Hefty'; Eitan Zahavi; Hal Rosenstock Cc: OPENIB GENERAL; Liran Sorani; Amit Krig; Aviram Gutman Subject: RE: RMPP Message Format Errors Hi Sean, You wrote: "Note that the current implementation of the RMPP code ignores the payload length on the receive side, and instead relies on the last bit to determine the end of a transfer." But the receive side needs to calculate back the correct size of the assembled MAD. If it is done in kernel or user it does not matter. To my best knowledge the only way to calculate how many records are enclosed in an RMPP message is to use the paylen and offset. How can it be done without looking at paylen ? EZ Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL -----Original Message----- From: Sean Hefty [mailto:sean.hefty at intel.com] Sent: Monday, August 22, 2005 1:01 AM To: 'Eitan Zahavi'; Hal Rosenstock Cc: OPENIB GENERAL; Liran Sorani; Amit Krig; Aviram Gutman Subject: RE: RMPP Message Format Errors Please let me know if you will have time to dig into these problems or if I should try and resolve them myself and provide patches. I will not be able to look at this until early next week (with IDF running this week), but I will try to do so. So, it wouldn't surprise me if the receive side accepted an invalid RMPP MAD. - Sean From halr at voltaire.com Mon Aug 22 05:53:13 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 22 Aug 2005 15:53:13 +0300 Subject: [openib-general] RE: [PATCH] osm: osm_vendor_umad to provide port state Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175BE3@taurus.voltaire.com> Hi Eitan, I will deal with the osm_vendor_umad patches later in the week upon return from the OpenIB workshop. I believe there are 3 patches currently outstanding. -- Hal From eitan at mellanox.co.il Mon Aug 22 06:18:55 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 22 Aug 2005 16:18:55 +0300 Subject: [openib-general] Re: [PATCH] osm: osm_vendor_umad to provide port state In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5175BE3@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175BE3@taurus.voltaire.com> Message-ID: <4309D0BF.6060007@mellanox.co.il> Hal Rosenstock wrote: > I will deal with the osm_vendor_umad patches later in the week upon > return from the OpenIB workshop. Thanks I will be waiting patiently. I wish I could come to the workshop too. I hope you will be able to collect valuable inputs to the OpenSM TODO. > I believe there are 3 patches currently > outstanding. The patches not yet added are: osm: osm_vendor_umad to provide port state osm: osm_vendor_umad registers to all SubnMgt methods osm: osm_vendor_umad osm_vendor_get_all_port_attr bug Also Yael has posted the branch with the merge of the 1.8.0 to the current trunk. It might be worthwhile to try merging it in before any changes in the main trunk. I know the diff file to review will be huge, and cross files this is why we could not do it in patches. The changes are considerably big as the main is 1.6.1 + some 1.7.0 and the branch is 1.8.0. Yael spent two weeks on this merge. From krause at cup.hp.com Mon Aug 22 06:39:13 2005 From: krause at cup.hp.com (Michael Krause) Date: Mon, 22 Aug 2005 06:39:13 -0700 Subject: [openib-general] Re: uverbs comp events In-Reply-To: <52mznd7q67.fsf@cisco.com> References: <4303CBE2.2010009@ichips.intel.com> <52mzngdsvr.fsf@cisco.com> <4303DE2C.5070809@ichips.intel.com> <52iry3ex3k.fsf@cisco.com> <43061DF2.8020504@ichips.intel.com> <52mznd7q67.fsf@cisco.com> Message-ID: <6.2.0.14.2.20050822063707.02770928@esmail.cup.hp.com> At 11:11 AM 8/19/2005, Roland Dreier wrote: > Arlin> Yes, this is certainly another option; albeit one that > Arlin> requires more system resources. Why not take full advantage > Arlin> of the FD resource we already have? It's your call, but > Arlin> uDAPL and other multi-thread applications could make good > Arlin> use of a wakeup feature with these event interfaces. An > Arlin> event model that allows users to create events and get > Arlin> events but requires them to use side band mechanisms to > Arlin> trigger the event seems incomplete to me. > >I disagree. Right now the CQ FD is a pretty clean concept: you read >CQ events out of it. If you want to trigger a CQ event, then you >could post a work request to a QP that generates a completion event. >Adding a new system call for queuing synthetic events seems like >growing an ugly wart to me. > >If we look at the analogous design of a multi-threaded network server, >where a thread might block waiting for input on a socket, we see that >there's no system call to inject synthetic data into a network socket. > >I'd rather fix the uDAPL design instead of adding ugliness to the >kernel to work around it. Please take a look at the Sockets API Extensions standard that was published quite awhile back to insure that the infrastructure can support this API as well. The API was developed by a set of Sockets developers and addresses a number of concerns for async communications, event management, explicit memory management, etc. It is also well suited to have SDP transparently implemented underneath it. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon Aug 22 06:50:55 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 22 Aug 2005 16:50:55 +0300 Subject: [openib-general] RE: [PATCH] osm: osm_vendor_umad to provide port state Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175BE6@taurus.voltaire.com> Hi Eitan, >I hope you will be able to collect valuable inputs to the OpenSM TODO. I hope so too (but there already is more than enough to do here). > Also Yael has posted the branch with the merge of the 1.8.0 to the > current trunk. Yes, I saw this. > It might be worthwhile to try merging it in before any > changes in the main trunk. The only changes would be your 3 outstanding patches. > I know the diff file to review will be huge, > and cross files this is why we could not do it in patches. The changes > are considerably big as the main is 1.6.1 + some 1.7.0 and the branch is > 1.8.0. I think this can be done in steps as I had described earlier. That is the process I will use to merge this to the trunk. > Yael spent two weeks on this merge. I think it was longer. Thanks to Yael for getting this done. I'm sure OpenSM is now much better for it. It is a high priority item for me to get this back to the trunk but just as it took time to do the merge, I expect it to take some time to put it back on the trunk. You will see postings on questions and progress on this on the list. -- Hal ________________________________ From: Eitan Zahavi [mailto:eitan at mellanox.co.il] Sent: Mon 8/22/2005 9:18 AM To: Hal Rosenstock Cc: OPENIB GENERAL Subject: Re: [PATCH] osm: osm_vendor_umad to provide port state Hal Rosenstock wrote: > I will deal with the osm_vendor_umad patches later in the week upon > return from the OpenIB workshop. Thanks I will be waiting patiently. I wish I could come to the workshop too. I hope you will be able to collect valuable inputs to the OpenSM TODO. > I believe there are 3 patches currently > outstanding. The patches not yet added are: osm: osm_vendor_umad to provide port state osm: osm_vendor_umad registers to all SubnMgt methods osm: osm_vendor_umad osm_vendor_get_all_port_attr bug Also Yael has posted the branch with the merge of the 1.8.0 to the current trunk. It might be worthwhile to try merging it in before any changes in the main trunk. I know the diff file to review will be huge, and cross files this is why we could not do it in patches. The changes are considerably big as the main is 1.6.1 + some 1.7.0 and the branch is 1.8.0. Yael spent two weeks on this merge. From eitan at mellanox.co.il Mon Aug 22 06:52:02 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 22 Aug 2005 16:52:02 +0300 Subject: [openib-general] Re: RMPP Message Format Errors In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5175BE1@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175BE1@taurus.voltaire.com> Message-ID: <4309D882.1000806@mellanox.co.il> Hal Rosenstock wrote: > > The number of records is an SA thing and not RMPP thing. This is > transparent to RMPP itself. The transparency to the RMPP is an RMPP implementation choice. Having incorrect paylen in the first segment is a compliancy violation. It should be either 0 or correct value. > > The need to determine the number of records is a consumer issue (SA or > SA client). To do this, AttributeOffset and (at least the last) > PayloadLength field is needed (as one can't rely on the first > PayloadLength being non zero). True. But how would the SA or SA Client that gets an assembled MAD be able to tell the number of records? Also, does the current implementation let the client do the assembly? If so how would it handle abort transactions? If the re-assembly is done by the MAD service then the client only gets offset in the MAD header and probably mad size which is MAD Header + RMPP header + SA extra header + data. Anyway, the last segment paylen was incorrect too. > > -- Hal > > ________________________________ > > From: Eitan Zahavi [mailto:eitan at mellanox.co.il] > Sent: Mon 8/22/2005 1:54 AM > To: 'Sean Hefty'; Eitan Zahavi; Hal Rosenstock > Cc: OPENIB GENERAL; Liran Sorani; Amit Krig; Aviram Gutman > Subject: RE: RMPP Message Format Errors > > > Hi Sean, > > You wrote: > "Note that the current implementation of the RMPP code ignores the > payload length on the receive side, and instead relies on the last bit > to determine the end of a transfer." > > But the receive side needs to calculate back the correct size of the > assembled MAD. > If it is done in kernel or user it does not matter. To my best knowledge > the only way to calculate how many records are enclosed in an RMPP > message is to use the paylen and offset. > How can it be done without looking at paylen ? > > EZ > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > -----Original Message----- > From: Sean Hefty [mailto:sean.hefty at intel.com] > Sent: Monday, August 22, 2005 1:01 AM > To: 'Eitan Zahavi'; Hal Rosenstock > Cc: OPENIB GENERAL; Liran Sorani; Amit Krig; Aviram Gutman > Subject: RE: RMPP Message Format Errors > > Please let me know if you will have time to dig into these problems or > if I should try and resolve them myself and provide patches. > I will not be able to look at this until early next week (with IDF > running this week), but I will try to do so. So, it wouldn't surprise > me if the receive side accepted an invalid RMPP MAD. > - Sean From halr at voltaire.com Mon Aug 22 07:04:37 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 22 Aug 2005 17:04:37 +0300 Subject: [openib-general] RE: RMPP Message Format Errors Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175BE7@taurus.voltaire.com> Hi again Eitan, > The transparency to the RMPP is an RMPP implementation choice. > Having incorrect paylen in the first segment is a compliancy violation. > It should be either 0 or correct value. Yes, is that what is going on ? I haven't had a chance to look at the GIF you sent and analyze it. > But how would the SA or SA Client that gets an assembled MAD be > able to tell the number of records? It gets a "real" received length provided it supplies a buffer large enough. > Also, does the current implementation let the client do the assembly? No. > If so how would it handle abort transactions? See previous answer. > If the re-assembly is done by the MAD service then the client only gets > offset in the MAD header and probably mad size which is MAD Header + > RMPP header + SA extra header + data. > Anyway, the last segment paylen was incorrect too. OK. That's another thing I'll look at. -- Hal From mst at mellanox.co.il Mon Aug 22 07:17:01 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 22 Aug 2005 17:17:01 +0300 Subject: [openib-general] [PATCH] sdp: split sdp_inet_send to subroutines Message-ID: <20050822141701.GZ1856@mellanox.co.il> The following is not yet applied. Opinions on othis? --- Split the sdp_inet_send to smaller subroutines. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_send.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/ulp/sdp/sdp_send.c +++ linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_send.c @@ -1907,6 +1907,136 @@ done: return result; } +static inline int sdp_send_while_space(struct sock *sk, struct sdp_sock *conn, + struct msghdr *msg, int oob, + size_t size, size_t *copied) +{ + struct sdpc_buff *buff; + int result = 0; + int copy; + /* + * send while there is room... (thresholds should be + * observed...) use a different threshold for urgent + * data to allow some space for sending. + */ + while (sdp_inet_write_space(conn, oob) > 0) { + buff = sdp_send_data_buff_get(conn); + if (!buff) { + result = -ENOMEM; + goto done; + } + + copy = min((size_t)(buff->end - buff->tail), size - *copied); + copy = min(copy, sdp_inet_write_space(conn, oob)); + +#ifndef _SDP_DATA_PATH_NULL + result = memcpy_fromiovec(buff->tail, msg->msg_iov, copy); + if (result < 0) { + sdp_buff_pool_put(buff); + goto done; + } +#endif + buff->tail += copy; + *copied += copy; + + SDP_CONN_STAT_SEND_INC(conn, copy); + + result = sdp_send_data_buff_put(conn, buff, copy, + (*copied == size ? oob : 0)); + if (result < 0) + goto done; + + if (*copied == size) + goto done; + } + /* + * set no space bits since this code path is taken + * when there is no write space. + */ + set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); + set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); + +done: + return result; +} + +/* Returns new timeout value */ +static inline long sdp_wait_till_space(struct sock *sk, struct sdp_sock *conn, + int oob, long timeout) +{ + DECLARE_WAITQUEUE(wait, current); + + add_wait_queue(sk->sk_sleep, &wait); + set_current_state(TASK_INTERRUPTIBLE); + /* + * ASYNC_NOSPACE is only set if we're not sleeping, + * while NOSPACE is set whenever there is no space, + * and is only cleared once space opens up, in + * DevConnAck() + */ + clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); + + sdp_conn_unlock(conn); + if (sdp_inet_write_space(conn, oob) <= 0) + timeout = schedule_timeout(timeout); + sdp_conn_lock(conn); + + remove_wait_queue(sk->sk_sleep, &wait); + set_current_state(TASK_RUNNING); + return timeout; +} + +static inline int sdp_queue_iocb(struct kiocb *req, struct sdp_sock *conn, + struct msghdr *msg, size_t size, + size_t *copied) +{ + struct sdpc_iocb *iocb; + int result; + /* + * create IOCB with remaining space + */ + iocb = sdp_iocb_create(); + if (!iocb) { + sdp_dbg_warn(conn, "Failed to allocate IOCB <%Zu:%ld>", + size, (long)*copied); + return -ENOMEM; + } + + iocb->len = size - *copied; + iocb->post = *copied; + iocb->size = size; + iocb->req = req; + iocb->key = req->ki_key; + iocb->addr = (unsigned long)msg->msg_iov->iov_base - *copied; + + req->ki_cancel = sdp_inet_write_cancel; + + result = sdp_iocb_lock(iocb); + if (result < 0) { + sdp_dbg_warn(conn, "Error <%d> locking IOCB <%Zu:%ld>", + result, size, (long)copied); + + sdp_iocb_destroy(iocb); + return result; + } + + SDP_CONN_STAT_WQ_INC(conn, iocb->size); + + conn->send_pipe += iocb->len; + + result = sdp_send_data_queue(conn, (struct sdpc_desc *)iocb); + if (result < 0) { + sdp_dbg_warn(conn, "Error <%d> queueing write IOCB", + result); + + sdp_iocb_destroy(iocb); + return result; + } + + *copied = 0; /* copied amount was saved in IOCB. */ + return -EIOCBQUEUED; +} + /* * sdp_inet_send - send data from user space to the network */ @@ -1915,12 +2045,9 @@ int sdp_inet_send(struct kiocb *req, str { struct sock *sk; struct sdp_sock *conn; - struct sdpc_buff *buff; - struct sdpc_iocb *iocb; int result = 0; - int copied = 0; - int copy; - int oob; + size_t copied = 0; + int oob, zcopy; long timeout = -1; /* @@ -1954,75 +2081,35 @@ int sdp_inet_send(struct kiocb *req, str * they are smaller then the zopy threshold, but only if there is * no buffer write space. */ - if (!(conn->src_zthresh > size) && !is_sync_kiocb(req)) - goto skip; + zcopy = (size >= conn->src_zthresh && !is_sync_kiocb(req)); + /* * clear ASYN space bit, it'll be reset if there is no space. */ - clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); + if (!zcopy) + clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); /* * process data first if window is open, next check conditions, then * wait if there is more work to be done. The absolute window size is * used to 'block' the caller if the connection is still connecting. */ while (!result && copied < size) { - /* - * send while there is room... (thresholds should be - * observed...) use a different threshold for urgent - * data to allow some space for sending. - */ - while (sdp_inet_write_space(conn, oob) > 0) { - buff = sdp_send_data_buff_get(conn); - if (!buff) { - result = -ENOMEM; - goto done; - } - - copy = min((size_t)(buff->end - buff->tail), - (size_t)(size - copied)); - copy = min(copy, sdp_inet_write_space(conn, oob)); - -#ifndef _SDP_DATA_PATH_NULL - result = memcpy_fromiovec(buff->tail, - msg->msg_iov, - copy); - if (result < 0) { - sdp_buff_pool_put(buff); - goto done; - } -#endif - buff->tail += copy; - copied += copy; - - SDP_CONN_STAT_SEND_INC(conn, copy); - - result = sdp_send_data_buff_put(conn, buff, copy, - ((copied == - size) ? oob : 0)); - if (result < 0) - goto done; - - if (copied == size) - goto done; + if (!zcopy) { + result = sdp_send_while_space(sk, conn, msg, oob, size, + &copied); + if (result < 0 || copied == size) + break; } - /* - * set no space bits since this code path is taken - * when there is no write space. - */ - set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); - set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); - /* - * check status. - */ -skip: /* entry point for IOCB based transfers. Before processing IOCB, - check that the connection is OK, otherwise return error - synchronously. */ + + /* entry point for IOCB based transfers. Before processing IOCB, + check that the connection is OK, otherwise return error + synchronously. */ /* * onetime setup of timeout, but only if it's needed. */ if (timeout < 0) timeout = sock_sndtimeo(sk, - MSG_DONTWAIT & msg->msg_flags); + msg->msg_flags & MSG_DONTWAIT); if (sk->sk_err) { result = (copied > 0) ? 0 : sock_error(sk); @@ -2051,77 +2138,14 @@ skip: /* entry point for IOCB based tran break; } /* - * Either wait or create and queue an IOCB for defered + * Either wait or create and queue an IOCB for deferred * completion. Wait on sync IO call create IOCB for async * call. */ - if (is_sync_kiocb(req)) { - DECLARE_WAITQUEUE(wait, current); - - add_wait_queue(sk->sk_sleep, &wait); - set_current_state(TASK_INTERRUPTIBLE); - /* - * ASYNC_NOSPACE is only set if we're not sleeping, - * while NOSPACE is set whenever there is no space, - * and is only cleared once space opens up, in - * DevConnAck() - */ - clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); - - sdp_conn_unlock(conn); - if (sdp_inet_write_space(conn, oob) <= 0) - timeout = schedule_timeout(timeout); - sdp_conn_lock(conn); - - remove_wait_queue(sk->sk_sleep, &wait); - set_current_state(TASK_RUNNING); - - continue; - } - /* - * create IOCB with remaining space - */ - iocb = sdp_iocb_create(); - if (!iocb) { - sdp_dbg_warn(conn, "Failed to allocate IOCB <%Zu:%d>", - size, copied); - result = -ENOMEM; - break; - } - - iocb->len = size - copied; - iocb->post = copied; - iocb->size = size; - iocb->req = req; - iocb->key = req->ki_key; - iocb->addr = (unsigned long)msg->msg_iov->iov_base - copied; - - req->ki_cancel = sdp_inet_write_cancel; - - result = sdp_iocb_lock(iocb); - if (result < 0) { - sdp_dbg_warn(conn, "Error <%d> locking IOCB <%Zu:%d>", - result, size, copied); - - sdp_iocb_destroy(iocb); - break; - } - - SDP_CONN_STAT_WQ_INC(conn, iocb->size); - - conn->send_pipe += iocb->len; - - result = sdp_send_data_queue(conn, (struct sdpc_desc *)iocb); - if (result < 0) { - sdp_dbg_warn(conn, "Error <%d> queueing write IOCB", - result); - - sdp_iocb_destroy(iocb); - break; - } - - copied = 0; /* copied amount was saved in IOCB. */ - result = -EIOCBQUEUED; + if (is_sync_kiocb(req)) + timeout = sdp_wait_till_space(sk, conn, oob, timeout); + else + result = sdp_queue_iocb(req, conn, msg, size, &copied); } done: -- MST From eitan at mellanox.co.il Mon Aug 22 07:34:38 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 22 Aug 2005 17:34:38 +0300 Subject: [openib-general] Re: RMPP Message Format Errors In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5175BE7@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175BE7@taurus.voltaire.com> Message-ID: <4309E27E.7050002@mellanox.co.il> Hal Rosenstock wrote: > Hi again Eitan, > > >>The transparency to the RMPP is an RMPP implementation choice. >>Having incorrect paylen in the first segment is a compliancy violation. >>It should be either 0 or correct value. > > > Yes, is that what is going on ? I haven't had a chance to look at the GIF you sent > and analyze it. Yes that is exactly what I have provided in the first mail: > > >>But how would the SA or SA Client that gets an assembled MAD be >>able to tell the number of records? > > > It gets a "real" received length provided it supplies a buffer large enough. So I guess the "real receive length" is truncated to the last data record even if the packet sent was 256 bytes? > >>Also, does the current implementation let the client do the assembly? > > No. Good. I hoped this is the case. > >>Anyway, the last segment paylen was incorrect too. > > OK. That's another thing I'll look at. The first mail I sent had all the analysis in it with exact peylen values for first and second segments. > > -- Hal From steve_wooding at keysounds.co.uk Mon Aug 22 07:50:02 2005 From: steve_wooding at keysounds.co.uk (=?iso-8859-1?Q?Steve_Wooding?=) Date: Mon, 22 Aug 2005 16:50:02 +0200 Subject: [openib-general] uverbs: When message is larger than receive buffer Message-ID: <30280207$11247217344309e4462ed125.48694516@config11.schlund.de> Hi, I was wondering what happens if the message being received (for a Send operation) is larger than the length of the recieve buffer that was specified in the receive WR? Where is this error caught? Is is the same for an RDMA Write? Thanks, Steve. From mst at mellanox.co.il Mon Aug 22 07:58:34 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 22 Aug 2005 17:58:34 +0300 Subject: [openib-general] Re: uverbs: When message is larger than receive buffer In-Reply-To: <30280207$11247217344309e4462ed125.48694516@config11.schlund.de> References: <30280207$11247217344309e4462ed125.48694516@config11.schlund.de> Message-ID: <20050822145834.GA1856@mellanox.co.il> Hi Steve, Quoting r. Steve Wooding : > I was wondering what happens if the message being received (for a Send > operation) is larger than the length of the recieve buffer that was > specified in the receive WR? You'll get a completion with error when you poll the cq. > Where is this error caught? This is detected by HCA hardware. > Is is the same > for an RDMA Write? RDMA Write does not use a receive WR. -- MST From mst at mellanox.co.il Mon Aug 22 08:08:41 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 22 Aug 2005 18:08:41 +0300 Subject: [openib-general] [PATCH] osm: trivial: fix vendor includes Message-ID: <20050822150841.GA27630@mellanox.co.il> One cant currently build opensm if one does not have management libraries checked out in the same directory. The reason is that management libs are installed in /usr/local/include/infiniband now. The following patch applies to Yael's branch as well. --- Fix opensm to include files from include/infiniband directory by their proper names with infiniband/ prefix. Signed-off-by: Michael S. Tsirkin Index: include/vendor/osm_vendor_ibumad.h =================================================================== --- include/vendor/osm_vendor_ibumad.h (revision 3102) +++ include/vendor/osm_vendor_ibumad.h (working copy) @@ -44,8 +44,8 @@ #include #include -#include -#include +#include +#include #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { -- MST From mst at mellanox.co.il Mon Aug 22 08:47:33 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 22 Aug 2005 18:47:33 +0300 Subject: [openib-general] [PATCH] sdp: fix oops in sdp_link.c Message-ID: <20050822154733.GC27630@mellanox.co.il> Hi! I was getting the following oopsen in sdp_link.c I plan to commit the fix (below) tomorrow, after some more testing. Comments? Unable to handle kernel NULL pointer dereference at 0000000000000090 RIP: {arp_send+4} PGD 17c8f4067 PUD 17cfa3067 PMD 0 Oops: 0000 [1] SMP CPU 0 Modules linked in: ib_sdp ib_cm ib_ipoib ib_sa ib_umad ib_mthca ib_mad ib_core Pid: 2715, comm: sdp_wq/0 Not tainted 2.6.12.2 RIP: 0010:[] {arp_send+4} RSP: 0018:ffff8101775e9d48 EFLAGS: 00010296 RAX: 00000000000000d0 RBX: ffff8101787dee80 RCX: 0000000000000000 RDX: 000000009b08040b RSI: 0000000000000806 RDI: 0000000000000001 RBP: ffff8101787dee00 R08: 000000009c08040b R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000292 R15: ffffffff8805bd77 FS: 0000000000000000(0000) GS:ffffffff80579f00(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000090 CR3: 000000017d45b000 CR4: 00000000000006e0 Process sdp_wq/0 (pid: 2715, threadinfo ffff8101775e8000, task ffff8101760482f0) Stack: 0000000200000018 0000000000000000 ffff8101760482f0 ffffffff8805c102 00000000000000d0 0000000000000000 00000234b5c208f9 ffffffff803bfeb2 ffff8101775e9e58 ffffffff803bff0b Call Trace:{:ib_sdp:do_link_path_lookup+907} {thread_return+0} {thread_return+89} {worker_thread+476} {default_wake_function+0} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+204} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} Code: 80 b9 90 00 00 00 00 4c 8b 54 24 20 48 8b 44 24 28 3e 78 1f RIP {arp_send+4} RSP CR2: 0000000000000090 <6>ib_sdp CRTL: <13> <2100> RELEASE: linger <0:0> data <0:0> nfs warning: mount version older than kernel --- Signed-off-by: Michael S. Tsirkin If info->ca is present, go to path query, dont arp: we dont have the net device, anyway. Index: linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_link.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/ulp/sdp/sdp_link.c 2005-08-22 18:23:27.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_link.c 2005-08-22 18:23:24.000000000 +0300 @@ -354,10 +354,10 @@ static void do_link_path_lookup(void *da if (info->query) goto done; /* - * route information present, but no path query, goto re-arp. + * route information present, but no path query. */ if (info->ca) - goto arp; + goto path; result = ip_route_output_key(&rt, &fl); if (result < 0 || !rt) { -- MST From rolandd at cisco.com Mon Aug 22 08:50:24 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 22 Aug 2005 08:50:24 -0700 Subject: [openib-general] Re: executing the ibv_srq_pingpong with many QPs sometimes causes to a test failure In-Reply-To: <506C3D7B14CDD411A52C00025558DED60882CEA5@mtlex01.yok.mtl.com> (Dotan Barak's message of "Sun, 21 Aug 2005 10:15:36 +0300") References: <506C3D7B14CDD411A52C00025558DED60882CEA5@mtlex01.yok.mtl.com> Message-ID: <523bp2nf7j.fsf@cisco.com> Dotan> When I'm executing the ibv_srq_pingpong with many QPs(255) Dotan> each executable is a different HCA (on different node) Dotan> there is a completion with error in the server side and a Dotan> seg fault in the client side. Thanks, I'll try to track this down. Can you get a backtrace of the seg fault to see where it is? - R. From sean.hefty at intel.com Mon Aug 22 09:49:36 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 22 Aug 2005 09:49:36 -0700 Subject: [openib-general] RE: RMPP Message Format Errors In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5175BE1@taurus.voltaire.com> Message-ID: >The number of records is an SA thing and not RMPP thing. This is transparent to >RMPP itself. > >The need to determine the number of records is a consumer issue (SA or SA >client). To do this, AttributeOffset and (at least the last) PayloadLength >field is needed (as one can't rely on the first PayloadLength being non zero). The RMPP code returns the size of the receive as sizeof MAD header + sizeof RMPP header + optional sizeof other header (e.g. SA header) + actual payload. This size can be used to allocate a data buffer large enough to hold the reassembled MAD. You should be able to use this to determine the number of records in the payload. - Sean From thomas.duffy.99 at alumni.brown.edu Mon Aug 22 10:44:19 2005 From: thomas.duffy.99 at alumni.brown.edu (Tom Duffy) Date: Mon, 22 Aug 2005 10:44:19 -0700 Subject: [openib-general] Re: [PATCH] sdp: fix oops in sdp_link.c In-Reply-To: <20050822154733.GC27630@mellanox.co.il> References: <20050822154733.GC27630@mellanox.co.il> Message-ID: On Aug 22, 2005, at 8:47 AM, Michael S. Tsirkin wrote: > Hi! > I was getting the following oopsen in sdp_link.c > I plan to commit the fix (below) tomorrow, after some more testing. > Comments? > Yeah, looks good. This is why people hate goto's. -tduffy From rolandd at cisco.com Mon Aug 22 11:20:48 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 22 Aug 2005 11:20:48 -0700 Subject: [openib-general] Re: uverbs comp events In-Reply-To: <43063586.1030704@ichips.intel.com> (Sean Hefty's message of "Fri, 19 Aug 2005 12:39:50 -0700") References: <4303CBE2.2010009@ichips.intel.com> <52mzngdsvr.fsf@cisco.com> <4303DE2C.5070809@ichips.intel.com> <52iry3ex3k.fsf@cisco.com> <43061DF2.8020504@ichips.intel.com> <43061FF3.2030304@ichips.intel.com> <52slx568wk.fsf@cisco.com> <43063586.1030704@ichips.intel.com> Message-ID: <52ll2t6dfj.fsf@cisco.com> Sean> I think that the issue that Arlin is hitting is that once he Sean> calls blah_get_event() he doesn't have an easy way to Sean> release the thread. This may turn out to be a uCM / uAT Sean> issue, and not a verbs issue. I thought about this some more. I think the real issue is probably that the app is sleeping blah_get_event() on a blocking fd. If the app doesn't want to block in that function forever, then it should use poll() with a wake-up pipe() fd as well as the event fd. (I'm also not sure what's so bad about pthread_kill() as a wake-up) - R. From eitan at mellanox.co.il Mon Aug 22 14:09:01 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 23 Aug 2005 00:09:01 +0300 Subject: [openib-general] Re: RMPP Message Format Errors In-Reply-To: References: Message-ID: <430A3EED.3080206@mellanox.co.il> Sean Hefty wrote: > > The RMPP code returns the size of the receive as sizeof MAD header + sizeof RMPP > header + optional sizeof other header (e.g. SA header) + actual payload. This > size can be used to allocate a data buffer large enough to hold the reassembled > MAD. You should be able to use this to determine the number of records in the > payload. Good. But how is that size delivered? I mean through umad to the client. From my first email on this thread you can see there is at least one bug in the chain of events: a. First segment paylen should be either 0 or correct value - it is neither. Should be 264 but is 440 b. Last segment paylen MUST be updated to reflect the size of the data in the MAD (including class header) - should be 24 but is 100. c. In the receiver the re-assembled data size is not correct. OpenSM reports it got a 200 bytes MAD back. Probably a bug in the vendor layer or umad. Here is the full data again. 1. NodeRecord MAD size is 112bytes (note the required padding of 4 bytes at the end of the NodeRec data). 2. OpenSM log file shows the query should return 2 records one for each end-port. This really happens: Aug 21 14:59:49 998104 [40D9DBB0] -> __osm_nr_rcv_create_nr: Looking for NodeRecord with LID: 0x0 GUID:0x0000000000000000 Aug 21 14:59:49 998224 [40D9DBB0] -> __osm_nr_rcv_new_nr: New NodeRecord: node 0x0002c902000017a0 port 0x0002c902000017a1, lid 0x1. Aug 21 14:59:49 998327 [40D9DBB0] -> __osm_nr_rcv_new_nr: New NodeRecord: node 0x0002c902000017a0 port 0x0002c902000017a2, lid 0x2. Aug 21 14:59:49 998395 [40D9DBB0] -> osm_nr_rcv_process: Returning 2 records. 3. On the wire we see the following (see attached gif for more details): a. Two data segments were sent and two ACKs were returned. This is OK. b. The first segment reports PayLen = 440bytes. According to the spec the first segment might provide paylen != 0 and when it is done it should be equal to the (class header * Num-Segments) + data length. In our case we have data length = 2*112, and SA extra header = 20byte * 2seg. This leads to peylen=264 and not 440!!! The spec defines that in p775-l37. So this is a violation of the spec. c. The last segment (segment 2) provides the paylen field of 100. The expected value for the last segment length should have been: SA extra header + leftover data size from prev segments. Since the first segment has 200bytes for data the left over should have been 112*2 - 200 = 24. With the SA extra header 44bytes. So this is another violation of the spec. d. The analyzer is confused by the above and reports the result as having 3 NodeRecords. e. <> 4. Following that when we trace the log file of osmtest we find more issues. Probably caused by changes to the vendor layer or the rmpp assembly: It is expected that after assembly the size of the RMPP mad reported to the osm vendor layer will be the rmpp header + SA extra header + data-size. In our case that is 32 + 20 + 2*112 = 276. The log file shows: Aug 21 14:59:49 [40D87BB0] -> __osmv_sa_mad_rcv_cb: Count = 1 = 200 / 112 (88) Aug 21 14:59:49 [4017F6C0] -> osmtest_write_all_node_recs: Received 1 records From steve_wooding at keysounds.co.uk Mon Aug 22 14:27:25 2005 From: steve_wooding at keysounds.co.uk (Steve Wooding) Date: Mon, 22 Aug 2005 22:27:25 +0100 Subject: [openib-general] Re: uverbs: When message is larger than receive buffer In-Reply-To: <20050822145834.GA1856@mellanox.co.il> References: <30280207$11247217344309e4462ed125.48694516@config11.schlund.de> <20050822145834.GA1856@mellanox.co.il> Message-ID: <430A433D.5060708@keysounds.co.uk> Thanks Micheal. Can I confirm that no memory beyond that specified in the WR for the Send op is not actually written to. The reason I'm interested in this is that I'm thinking about registering a large Memory Region to receive lots of smaller data messages. This is to avoid the overhead of registering and unregistering MRs. For RDMA Write, would windowing be the only option to stop a sender writing beyond what the receiver was expecting? Regards, Steve. Michael S. Tsirkin wrote: >Hi Steve, > >Quoting r. Steve Wooding : > > >>I was wondering what happens if the message being received (for a Send >>operation) is larger than the length of the recieve buffer that was >>specified in the receive WR? >> >> >You'll get a completion with error when you poll the cq. > > > >>Where is this error caught? >> >> >This is detected by HCA hardware. > > > >>Is is the same >>for an RDMA Write? >> >> >RDMA Write does not use a receive WR. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From arlin.r.davis at intel.com Mon Aug 22 16:07:57 2005 From: arlin.r.davis at intel.com (Arlin Davis) Date: Mon, 22 Aug 2005 16:07:57 -0700 Subject: [openib-general] Re: uverbs comp events Message-ID: Roland Dreier wrote: Sean> I think that the issue that Arlin is hitting is that once he Sean> calls blah_get_event() he doesn't have an easy way to Sean> release the thread. This may turn out to be a uCM / uAT Sean> issue, and not a verbs issue. I thought about this some more. I think the real issue is probably that the app is sleeping blah_get_event() on a blocking fd. If the app doesn't want to block in that function forever, then it should use poll() with a wake-up pipe() fd as well as the event fd. (I'm also not sure what's so bad about pthread_kill() as a wake-up) - R. The application (udapl completion thread) is polling on the FD so it is very easy to add a pipe() fd to wakeup. I was just expecting the close_fd in ibv_close_device() to wakeup the poll (or blocking get_event) so I could exit the thread without a pipe or a kill. pthread_kill requires me to add a sig_handler to catch the signal and since uDAPL is a library, there is a chance of uDAPL replacing a signal handler already setup by the application for the process group. -arlin -------------- next part -------------- An HTML attachment was scrubbed... URL: From caitlin.bestler at gmail.com Mon Aug 22 17:03:28 2005 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Mon, 22 Aug 2005 17:03:28 -0700 Subject: [openib-general][PATCH][kdapl]: evd upcall policy implementation In-Reply-To: References: Message-ID: <469958e0050822170350cdc4f5@mail.gmail.com> If your EVD typically has hundreds of entries, let alone thousands, then you should be notification driven. You should be time-driven and simply use evd_dequeue(). The whole point of having control of notifications is to ensure that upcalls are used to wake up the consumer. If there is no need to be woken, because the EVD is never drained, then why use notifications at all? But in any event, if the application enables for a SINGLE upcall then it cannot be "overwhelmed" by upcalls. It can process as many events from the EVD that it wants. When it is done it has two choices: a) If it does not want to continue working on this EVD it can disable the EVD and simply cause dequeing to resume at a later time based on a clock (indpenedent of the number of events in the EVD). b) Allow the Provider to decide whether it should continue by enabling the EVD and exiting. If there is no reason not to reschedule the Provider will upcall again, or it might wait. On 8/18/05, Guy German wrote: > > Yes. > > dat_evd_modify_upcall has been called, but the current > > upcall instance has not yet returned. During this period > > the consumer should check to see if the EVD is drained. > > If so, the consumer is no longer notified (re this EVD). > > I don't follow you - if the consumer is still in the upcall > context, why should he be changing the upcall policy at all ? > (assuming It's a single instance) > > Any way, I don't think it is recommended to "drain the > evd", in the upcall's tasklet/interrupt context. > There can be thousends of events to dequeue, and > while you drain them, there can be more comming. > You want to get out of that context as fast as possible. > > Guy > > On 8/18/05, Guy German wrote: > > Hi Caitlin, > > Caitlin Bestler wrote: > > > Some clarifications are needed here. > > > > > > First the Consumer is responsible for draining the > > > EVD after re-enabling it, or at least for remembering > > > that there may be undrained notified events. > > > > Can you please explain what you mean by "re-enabling" > > the EVD ? Do you mean calling dat_evd_modify_upcall > > and changing the upcall policy from disable, back to > > enable ? > > > > > > > > That is "you-have-been-notified" is a sticky boolean > > > attribute that the Consumer is supposed to set to TRUE > > > when the upcall is made and only clear when the EVD > > > has been drained *after* re-enabling. > > > > > > Second, is that the EVD is first and foremost an event > > > *serializer*. It is presumed to have a finite number of > > > resources for making upcalls (at most one for the typical > > > case where SINGLE is enabled). The next upcall per > > > resource CANNOT occur until after the current upcall > > > has completed. > > > > > > Whether this should be solved in the DAT Provider is > > > a question of what the verb-layer provider is allowed > > > to do. If the verb layer provider can in fact generate > > > multiple concurrent upcalls for the same CQ then the > > > EVD itself must guard against re-entrancy. > > > > > > A more likely implementation is that upcalls triggered > > > by post_se, CM events and CQs could theoretically > > > occur at the same instance -- but that none of these > > > paths can be re-entrant by themselves. > > > > > > Once the potential re-entrancy from the verb layer > > > is known, then an optimal strategy can be selected. > > > For exaple, if the only potential re-entrancy comes > > > when the upcall interrupts a post_se call then some > > > simple critical regions can avoid all problems without > > > general purpose spinlocks or semaphores. > > > > > > On 8/16/05, James Lentini wrote: > > >> > > >> > > >> On Tue, 16 Aug 2005, Guy German wrote: > > >> > > >>>>>>>>> Also, the pending_event_queue is only used for kDAPL generated > > >>>>>>>>> software events. This queue can be empty when there are > > >>>>>>>>> events on the CQ, so your would need to be expanded your > > >>>>>>>>> check to cover that. > > >>>>>>> > > >>>>>>> Actually, even though, I agreed before, I tend to disagree now. > > >>>>>>> The consumer will still get the DTO events as soon as the CQ > > >>>>>>> upcall is triggered (enabled), so only problem is with the > > >>>>>>> pending events list. > > >>>>>> > > >>>>>> Why is it an error for the consumer to modify the upcall policy > > >>>>>> when there are pending events? > > >>>>>> > > >>>>>> dat_evd_modify_upcall should behave just like the IBTA spec's > > >>>>>> Request Completion Notification verb in this respect. If there > > >>>>>> were events on the EVD before the upcall is enabled, no upcall > > >>>>>> needs to be generated. A correct consumer can easily work around > > >>>>>> this by enabling the upcall and polling the EVD one final time > > >>>>>> to ensure it is empty. > > >>>>> > > >>>>> There can be more than one event, and the consumer would need to > > >>>>> dequeue many times. While the consumer would do his extra > > >>>>> dequeue-ing he might also get an upcall, because his policy is > > >>>>> now enabled. I can't think of a design that can handle such a > > >>>>> case, and if there is one it is demanding and complicated, from > > >>>>> the consumers side. > > >>>> > > >>>> Isn't it the same position all event code written to the OpenIB > > >>>> API is in? > > >>> > > >>> I don't quite know what you are reffering to, but if you are > > >>> reffering to the case of cq in IB - It's totally different: you > > >>> only enable the cq once, so you will only get one upcall, and the > > >>> rest of the events you will need to dequeue. > > >> > > >> The consumer should only receive one upcall at a time if the upcall > > >> policy is DAT_UPCALL_SINGLE_INSTANCE. If the dequeues are performed > > >> in an upcall, the logic needed in an OpenIB consumer and kDAPL > > >> consumer is essentially the same. > > >> > > >> The difference is that the OpenIB consumer needs to re-enable the CQ > > >> upcall and poll to make sure no events were missed. > > >> > > >>>> I agree with you that this programming model is difficult to use, > > >>>> but I don't think it is impossible. > > >>> > > >>> I think it is a bad idea to dequeue events and at the same time > > >>> receive upcalls from the same queue. It is racy, and has bad > > >>> performance. I don't see *any* reason to do it. > > >> > > >> The current kDAPL implementation does create a situation in which an > > >> upcall and poll occur simultaneously if the upcall is disabled, the > > >> consumer enables the upcall, and then the consumer does a poll. In > > >> this scenario an upcall can occur while the consumer is polling. I > > >> was pointing out that this same race exists in the OpenIB verbs API > > >> (and the IBTA verbs). > > >> > > >> Again, I agree that we can eliminate the additional poll after > > >> enabling the upcall in kDAPL. We just need to do it in a way that is > > >> not hardware specific. I believe we can use the same technique we > > >> did in the DTO upcall. > > >> > > >> james > > >> _______________________________________________ > > >> openib-general mailing list > > >> openib-general at openib.org > > >> http://openib.org/mailman/listinfo/openib-general > > >> > > >> To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > From thomas.duffy.99 at alumni.brown.edu Mon Aug 22 17:40:59 2005 From: thomas.duffy.99 at alumni.brown.edu (Tom Duffy) Date: Mon, 22 Aug 2005 17:40:59 -0700 Subject: [openib-general] Re: [PATCH] sdp: split sdp_inet_send to subroutines In-Reply-To: <20050822141701.GZ1856@mellanox.co.il> References: <20050822141701.GZ1856@mellanox.co.il> Message-ID: On Aug 22, 2005, at 7:17 AM, Michael S. Tsirkin wrote: > The following is not yet applied. Opinions on othis? Is there a reason to do this? Other than just to have smaller functions. Not that it is a bad thing, just wanted to know if you plan on reusing the smaller subroutines elsewhere. -tduffy From iod00d at hp.com Mon Aug 22 22:18:38 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 22 Aug 2005 22:18:38 -0700 Subject: [openib-general] ISER cleanup In-Reply-To: References: Message-ID: <20050823051838.GA31663@esmail.cup.hp.com> On Mon, Aug 22, 2005 at 01:22:55PM +0300, Dan Bar Dov wrote: > We have begun a cleanup of ISER based on the inputs we received. > Mostly cosmetic cleanups were already commited. yup - good progress and some more cosmetic stuff noted below. Then need to start looking at addressing Christoph's (hch) comments. > These include: > C++ style comments changed to C > Removed DAT 1.2 comments > Reomoved DAT 1.2 API support file > Removed all platform dependencies Still need to remove kernel_dep.h and probably most of the files in iser/include/. Those also all have a trailing "/* DAT 1.2 */" that might mislead in the future. Maybe a comment in the header about "Based on DAT 1.2" release. iser_api.h Should iSCSI be providiing the jump table definitions? struct iser_api_t struct iser_api_cb_t iser_ext_api.h typedef void * iser_conn_request_t; Delete stuff like this - it just obscures what is going on. I'm not sure what this file is doing. I was expecting iSCSI framework to define the data structures it needs to talk to a service provider. iser_pdu.h sorry - Didn't have time to understand what this is about. iser_types.h delete typdef void * iser_api_handle_t. replace usage of iser_api_handle_t with "void *". Ditto for all "void *" typedefs in that file. Kernel already defines scatter-gather lists type. kernel_dep.h Delete this file. This content belongs in a seperate patch that people can grab and apply when they want to build iSER on an older kernel. See src/linux/kernel/patches > Removed vi comments yup - mostly. Some are still present in iser/include/*.h. > Removed CONFIG_INFINIBAND refrences > Reorganized module > Rewritten Makefile to new style > Added Kconfig file > Using kernel min/max all very good. > There are many other things to be done, including both coding > style and substance, we'll proceed addressing all the technical > issues that were commented on. great! thanks, grant From dotanb at mellanox.co.il Mon Aug 22 22:19:38 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 23 Aug 2005 08:19:38 +0300 Subject: [openib-general] RE: executing the ibv_srq_pingpong with many QPs sometimes causes to a test failure Message-ID: <506C3D7B14CDD411A52C00025558DED6089DBA41@mtlex01.yok.mtl.com> > > Thanks, I'll try to track this down. Can you get a backtrace of the > seg fault to see where it is? > > - R. > Hi. Here is a backtrace of the seg fault: Loaded symbols for //usr/local/lib/infiniband/mthca.so #0 mthca_tavor_post_srq_recv (ibsrq=0x8051eb8, wr=0xbfffe460, bad_wr=0xbfffe45c) at srq.c:99 99 next_ind = *(int *) wqe; (gdb) bt #0 mthca_tavor_post_srq_recv (ibsrq=0x8051eb8, wr=0xbfffe460, bad_wr=0xbfffe45c) at srq.c:99 #1 0x08049100 in pp_post_recv (ctx=0x804f6e8, n=499) at verbs.h:736 #2 0x0804a34e in main (argc=6, argv=0xbffff404) at srq_pingpong.c:720 Dotan -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at mellanox.co.il Mon Aug 22 23:33:37 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 23 Aug 2005 09:33:37 +0300 Subject: [openib-general] when executing sminfo with a port in down state, there is a retur n value 0 Message-ID: <506C3D7B14CDD411A52C00025558DED6089DBA60@mtlex01.yok.mtl.com> Hi. I'm working with gen2 svn rev. 3155 with 2 Mellanox HCAs (23108) (1 on each host; they are connected b2b: port 1 to port 1). I executed opensm on host 1, port 1. When i executed sminfo on host 2 port 1 everything was as expected (return value = 0). I killed the opensm When i executed sminfo on host 2 port 1 everything was as expected (return value = 255). When i executed sminfo on host 2 port 2 everything i got 0 (i expected to get return value = 255). Port 2 in host 2 was down, so i don't know why i got the return value 0. here is the output: host2:~ # /usr/local/bin/sminfo -C mthca0 -P 2 sminfo: sm lid 0x0 sm guid 0x8200000000, activity count 0 priority 0 state SMINFO_NOTACT 0 host2:~ # echo $? 0 host2:~ # /usr/local/bin/sminfo -C mthca0 -P 2 sminfo: sm lid 0x0 sm guid 0x0, activity count 0 priority 0 state SMINFO_STANDBY 2 host2:~ # echo $? 0 host2:~ # vstat hca_id: mthca0 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 0 (0) active_mtu: 0 (0) sm_lid: 1 port_lid: 2 port_lmc: 0x00 port: 2 state: PORT_DOWN (1) max_mtu: 0 (0) active_mtu: 0 (0) sm_lid: 0 port_lid: 0 port_lmc: 0x00 can you please help me with this issue? Dotan Barak Software Verification Engineer Mellanox Technologies LTD Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-77-8841095 Cell: 052-4222383 [ May the fork be with you ] -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Tue Aug 23 00:14:37 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 23 Aug 2005 10:14:37 +0300 Subject: [openib-general] Re: [PATCH] sdp: split sdp_inet_send to subroutines In-Reply-To: References: <20050822141701.GZ1856@mellanox.co.il> Message-ID: <20050823071437.GE27630@mellanox.co.il> Quoting r. Tom Duffy : > Subject: Re: [PATCH] sdp: split sdp_inet_send to subroutines > > > On Aug 22, 2005, at 7:17 AM, Michael S. Tsirkin wrote: > > >The following is not yet applied. Opinions on othis? > > Is there a reason to do this? Other than just to have smaller > functions. Not that it is a bad thing, just wanted to know if you > plan on reusing the smaller subroutines elsewhere. > > -tduffy > Currently I'm adding stuff to sdp_inet_send and thats the reason I'm trying to make it manageable. I dont at the moment see how to reuse the smaller functions, but I am keeping that in mind as I go along. -- MST From mst at mellanox.co.il Tue Aug 23 00:23:17 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 23 Aug 2005 10:23:17 +0300 Subject: [openib-general] Re: uverbs: When message is larger than receive buffer In-Reply-To: <430A433D.5060708@keysounds.co.uk> References: <30280207$11247217344309e4462ed125.48694516@config11.schlund.de> <20050822145834.GA1856@mellanox.co.il> <430A433D.5060708@keysounds.co.uk> Message-ID: <20050823072317.GF27630@mellanox.co.il> Quoting r. Steve Wooding : > Subject: Re: uverbs: When message is larger than receive buffer > > Thanks Micheal. > > Can I confirm that no memory beyond that specified in the WR for the > Send op is not actually written to. Yes, IB spec guarantees that. > The reason I'm interested in this is that I'm thinking about registering a > large Memory Region to receive lots of smaller data messages. This is to avoid > the overhead of registering and unregistering MRs. For RDMA Write, would > windowing be the only option to stop a sender writing beyond what the receiver > was expecting? > > Regards, > > Steve. Not sure what you mean. No memory beyond that specified in the WR for the Send op is written to. So you can either: use memory windows (not yet supported in mthca AFAIK), set the region to match the buffer that you want to write to exactly, or trust the sender not to post send wr outside this buffer. Our SDP implementation takes the last approach. -- MST From yael at mellanox.co.il Tue Aug 23 00:25:11 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Tue, 23 Aug 2005 10:25:11 +0300 Subject: [openib-general] OpenSM: new branch Message-ID: <506C3D7B14CDD411A52C00025558DED6089DBA81@mtlex01.yok.mtl.com> Hello Hal, Since the merge or the OpenSM to the main trunk is going to be long, we tried to come up with a way to make it clearer and easier to review. Currently we have the osm-1.8.0-merge branch, that includes all the merges. We gave special attention to make sure fixes done on the main trunk will not be lost in the process. During the main trunk update, this branch can be used to validate the merge. Regarding the merge: As we tried to do the merge patch by patch with no success, we would like to propose an alternative stategy for making sure all changes are reviewed. We think the merge should be split to reasonable sized chunks, and applied on each section separately even at the cost of breaking the code at several stages. We believe that the effort of merging all the changes without breaking the compilation is too big, since many changes involve multiple files. We think the following partition could be useful: 1. New and deleted files. 2. Build environment changes 3. Complib Include and Code. 4. Vendor Modifications Include and Code. 5. OpenSM headers and ib_types.h. 6. OpenSM core - SM part. 7. OpenSM core SA part (actually can be done one file at a time) The attached file provides the list of the files for each step, such that obtaining the patch for each step is made simpler. We, of course, will be happy to help in creating the patches, and assist in any way possible to make this merge a smooth as possible. Thanks, Yael -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: osm.1.8.0.strategy Type: application/octet-stream Size: 4685 bytes Desc: not available URL: From guyg at voltaire.com Tue Aug 23 01:58:54 2005 From: guyg at voltaire.com (Guy) Date: Tue, 23 Aug 2005 11:58:54 +0300 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator In-Reply-To: References: Message-ID: <1124787534.27216.44.camel@r2d2> guy> Isn’t there a problem with polling for CQ, from the CQ’s upcall guy> context (i.e. potentially tasklet/interrupt context). I would like to address this question again to the list, with hope to get a couple of things clarified. How much is linux tolerant to code making *long* use of the interrupt context. Are there any limitations (set by the kernel hackers) regarding this issue ? (I know that in rtos this is totally not acceptable, for understandable reasons) How long does count as *long* ? Is it measured in jiffies/usecs/locs other ? Our iser implementation is currently context switching the completion (interrupt context) to a kernel thread for the "handling" and the rest of the polling. But, if linux tolerants code in isr contexts we might consider changing that (at least for the initiator code). Any help in clearing that up, would be much appreciated. Thanks, Guy. p.s. in the request_irq call (mthca_eq.c) - why not set the SA_SAMPLE_RANDOM to contribute to the linux entropy pool ? From mst at mellanox.co.il Tue Aug 23 04:20:37 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 23 Aug 2005 14:20:37 +0300 Subject: [openib-general] mpi drop in openib tree Message-ID: <20050823112037.GI27630@mellanox.co.il> Guys, someone seems to have dropped the complete copy of mvapich mpi under the openib trunk. In the future, could whoever plans to contribute large amounts of code please discuss the code and its placement on the list first? Does this make sense to have it under openib? It appears that Dr. Panda distributes the MPI for gen2 - why would openib want to fork it? We are also talking about almost 100megabytes of code, including what seems java binaries under ./mvapich-gen2/mpe/slog2sdk - I am almost sure we do not want binaries under subversion trunk here. I suggest removing mpi ASAP, or moving it to contrib or some other place where it wont cause each svn checkout to take twice the time. -- MST From halr at voltaire.com Tue Aug 23 05:36:14 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 23 Aug 2005 15:36:14 +0300 Subject: [openib-general] OpenSM: new branch Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175BED@taurus.voltaire.com> Hi Yael, >Since the merge or the OpenSM to the main trunk is going to be long, That's not exactly what I meant but ... > we tried to come up with a way to make it clearer and easier to review. > Currently we have the osm-1.8.0-merge branch, that includes all the merges. We gave special attention > to make sure fixes done on the main trunk will not be lost in the process. > During the main trunk update, this branch can be used to validate the merge. I will saving the trunk as a new branch before I start this. > Regarding the merge: > As we tried to do the merge patch by patch with no success, > we would like to propose an alternative stategy for making sure all changes are reviewed. > We think the merge should be split to reasonable sized chunks, and applied on each section separately > even at the cost of breaking the code at several stages. > We believe that the effort of merging all the changes without breaking the compilation is too big, since > many changes involve multiple files. > We think the following partition could be useful: > 1. New and deleted files. > 2. Build environment changes > 3. Complib Include and Code. > 4. Vendor Modifications Include and Code. > 5. OpenSM headers and ib_types.h. > 6. OpenSM core - SM part. > 7. OpenSM core SA part (actually can be done one file at a time) Yes, that's along the lines I was thinking and tried to describe in an earlier email. I still think it could be done with patches (with it being more cumbersome to do it that way). > The attached file provides the list of the files for each step, such that obtaining the patch for each step is > made simpler. Thanks. This will definitely help. > We, of course, will be happy to help in creating the patches, and assist in any way possible to make this > merge a smooth as possible. I will likely have questions as I go through this. -- Hal From mst at mellanox.co.il Tue Aug 23 08:10:35 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 23 Aug 2005 18:10:35 +0300 Subject: [openib-general] Re: mthca and LinuxBIOS In-Reply-To: <52ek954t2y.fsf@cisco.com> References: <52mznxacbp.fsf@cisco.com> <86802c4405080410236ba59619@mail.gmail.com> <86802c4405080411013b60382c@mail.gmail.com> <521x59a6tb.fsf@cisco.com> <86802c440508041230143354c2@mail.gmail.com> <52slxp6o5b.fsf@cisco.com> <86802c440508051103500f6942@mail.gmail.com> <52u0i45k5e.fsf@cisco.com> <20050807082220.GU15300@mellanox.co.il> <52ek954t2y.fsf@cisco.com> Message-ID: <20050823151035.GR27630@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: mthca and LinuxBIOS > > Michael> My understanding is this was diagnosed as a bug in > Michael> pci_restore_bars. Is that right, or does this need more > Michael> looking into? > > It's hard to tell for sure based on the thread, but it seems like the > bug in pci_restore_bars() was introduced after the initial bug > report. There still seems to be a problem when the HCA's BARs are > assigned over 4 GB. Yhlu, did you have a chance to test with latest firmware? Does the problem persist? -- MST From danb at voltaire.com Tue Aug 23 09:11:18 2005 From: danb at voltaire.com (Dan Bar Dov) Date: Tue, 23 Aug 2005 19:11:18 +0300 Subject: [openib-general] ISER cleanup Message-ID: > -----Original Message----- > On Mon, Aug 22, 2005 at 01:22:55PM +0300, Dan Bar Dov wrote: > > We have begun a cleanup of ISER based on the inputs we received. > > Mostly cosmetic cleanups were already commited. > > yup - good progress and some more cosmetic stuff noted below. > Then need to start looking at addressing Christoph's (hch) comments. Some of the comments were taken care of in today's commits iovecs removed procfs removed Function entry/exit traces removed Unnecessary files removed: kernel_dep.h iser_bhs.h iser_trace.c > > Still need to remove kernel_dep.h and probably most of the > files in iser/include/. > > Those also all have a trailing "/* DAT 1.2 */" > that might mislead in the future. > Maybe a comment in the header about "Based on DAT 1.2" release. All DAT 1.2 comments removed. Actually the current code is not DAT 1.2 compatible, but the openIB flavor compatible. Since work started on a CM abstraction, I expect ISER to get off of kdapl and onto ib-verbs + CM abstraction. > > > iser_api.h > Should iSCSI be providiing the jump table definitions? > struct iser_api_t > struct iser_api_cb_t > > iser_ext_api.h > typedef void * iser_conn_request_t; > Delete stuff like this - it just obscures what is going on. OK > > I'm not sure what this file is doing. > I was expecting iSCSI framework to define the data structures > it needs to talk to a service provider. This is an "extended API". The ISER spec defines an ISER API, but it does not consider implementation. We chose to implement the extra api out of the iser_api structute and in the iser_ext_api struct. iSCSI is still not part of the kernel so we had first modified and added the datamover framework to linux-iscsi and now to open-iscsi. Once open-iscsi is in the kernel we'll use it as the framework. > > iser_pdu.h > sorry - Didn't have time to understand what this is about. Most definitions are duplicates from the iscsi and will disappear. The struct iser_send_pdu defines the ISER extensions to the iscsi pdu. > > iser_types.h > delete typdef void * iser_api_handle_t. > replace usage of iser_api_handle_t with "void *". > Ditto for all "void *" typedefs in that file. OK > > Kernel already defines scatter-gather lists type. The iser_data_buf struct can point to a scatterlist array but can also be used to point at a single buffer. It does not replicate scatterlist but allows us to deal with two types of registrations - single buffer and scatter lists. > > kernel_dep.h > Delete this file. > This content belongs in a seperate patch that people can grab > and apply when they want to build iSER on an older kernel. > See src/linux/kernel/patches Gone. > > > > Removed vi comments > > yup - mostly. Some are still present in iser/include/*.h. Gone as well. > > > Removed CONFIG_INFINIBAND refrences > > Reorganized module > > Rewritten Makefile to new style > > Added Kconfig file > > Using kernel min/max > > all very good. > > > > There are many other things to be done, including both coding style > > and substance, we'll proceed addressing all the technical > issues that > > were commented on. > > great! We are going to simplify the local memory registrations by registering all memory like in the SRP driver. We do not understand some of the substance issues - for example, dma related comments - are taken care of by iscsi, not the transport. The io_mmu comment, we completely do not understand - there was some platform specific code, but its all gone now. BTW the code is now down to 7K LOC + 2K LOC of heavily commented header files. > > thanks, > grant From mst at mellanox.co.il Tue Aug 23 09:18:15 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 23 Aug 2005 19:18:15 +0300 Subject: [openib-general] [PATCH applied] rdma_bw: fix divide by zero Message-ID: <20050823161815.GT27630@mellanox.co.il> The following is already applied. --- Fix compiler warning on i386 systems Fix divide by zero in service demand calculation for small sizes Signed-off-by: Michael S. Tsirkin Index: rdma_bw.c =================================================================== --- rdma_bw.c (revision 3168) +++ rdma_bw.c (working copy) @@ -449,18 +449,21 @@ static void print_report(unsigned int it cycles_to_units = get_cpu_mhz() * 1000000; tsize = duplex ? 2 : 1; - tsize = tsize * size / 1024; + tsize = tsize * size; printf("Bandwidth peak (#%d to #%d): %g MB/sec\n", opt_posted, opt_completed, - tsize * cycles_to_units / opt_delta / 1024); + tsize * cycles_to_units / opt_delta / 0x100000); printf("Bandwidth average: %g MB/sec\n", - tsize * iters * cycles_to_units / (tcompleted[iters - 1] - tposted[0]) / 1024); + tsize * iters * cycles_to_units / + (tcompleted[iters - 1] - tposted[0]) / 0x100000); printf("Service Demand peak (#%d to #%d): %ld cycles/KB\n", - opt_posted, opt_completed, opt_delta/tsize); + opt_posted, opt_completed, + (unsigned long)opt_delta * 1024 / tsize); printf("Service Demand Avg : %ld cycles/KB\n", - (tcompleted[iters - 1] - tposted[0])/(tsize * iters)); + (unsigned long)(tcompleted[iters - 1] - tposted[0]) * + 1024 / (tsize * iters)); } -- MST From hch at lst.de Tue Aug 23 09:24:06 2005 From: hch at lst.de (Christoph Hellwig) Date: Tue, 23 Aug 2005 18:24:06 +0200 Subject: [openib-general] ISER cleanup In-Reply-To: References: Message-ID: <20050823162406.GA28946@lst.de> On Tue, Aug 23, 2005 at 07:11:18PM +0300, Dan Bar Dov wrote: > > iser_api.h > > Should iSCSI be providiing the jump table definitions? > > struct iser_api_t > > struct iser_api_cb_t > > > > iser_ext_api.h > > typedef void * iser_conn_request_t; > > Delete stuff like this - it just obscures what is going on. > OK > > > > > I'm not sure what this file is doing. > > I was expecting iSCSI framework to define the data structures > > it needs to talk to a service provider. > This is an "extended API". The ISER spec defines an ISER API, but it does not consider implementation. > We chose to implement the extra api out of the iser_api structute and in the iser_ext_api struct. > iSCSI is still not part of the kernel so we had first modified and added the datamover framework to > linux-iscsi and now to open-iscsi. Once open-iscsi is in the kernel we'll use it as the framework. Note that we care very little about specifications for in-kernel APIs. so the distinction between a standard API and extension make little sense, in the end we'll probably only have more or less non-standard APIs. > > iser_types.h > > delete typdef void * iser_api_handle_t. > > replace usage of iser_api_handle_t with "void *". > > Ditto for all "void *" typedefs in that file. > OK > > > > > Kernel already defines scatter-gather lists type. > The iser_data_buf struct can point to a scatterlist array but can also be used to point at a single buffer. > It does not replicate scatterlist but allows us to deal with two types of registrations - single buffer and scatter lists. We're planning to get rid of non-S/G list data transfer in the scsi subsystem for 2.6.14+. See the tree at http://www.kernel.org/git/gitweb.cgi?p=linux/kernel/git/jejb/scsi-block-2.6.git;a=summary in there everything but the sg,st and osst drivers doesn't submit non-S/G requests anymore, and these last drivers is beeing worked on already. > > > There are many other things to be done, including both coding style > > > and substance, we'll proceed addressing all the technical > > issues that > > > were commented on. > > > > great! > We are going to simplify the local memory registrations by registering all memory like in the SRP driver. > We do not understand some of the substance issues - for example, dma related comments - are taken care of by iscsi, > not the transport. The io_mmu comment, we completely do not understand - there was some platform specific code, but its all gone now. Basically all use of virt_to_{phys,bus} / {phys,bus}_to_virt is wrong. You must use proper dma mapping routines. See Documentation/DMA-API.txt in the kernel tree for details. From panda at cse.ohio-state.edu Tue Aug 23 10:07:35 2005 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Tue, 23 Aug 2005 13:07:35 -0400 (EDT) Subject: [openib-general] mpi drop in openib tree In-Reply-To: <20050823112037.GI27630@mellanox.co.il> from "Michael S. Tsirkin" at Aug 23, 2005 02:20:37 PM Message-ID: <200508231707.j7NH7ZvY009485@xi.cse.ohio-state.edu> Hi Michael, > Guys, someone seems to have dropped the complete copy > of mvapich mpi under the openib trunk. > In the future, could whoever plans to contribute large amounts > of code please discuss the code and its placement on the list first? The code has been uploaded after extensive discussions with Matt, Hal, Roland, and Woody. In fact, we followed the steps given by Hal and Roland after personal discussions with them at the OpenIB meeting yesterday. Not sure why the data size is so big. We will take a look at it. Thanks, DK > Does this make sense to have it under openib? > It appears that Dr. Panda distributes the MPI for gen2 - why > would openib want to fork it? > > We are also talking about almost 100megabytes of code, > including what seems java binaries under ./mvapich-gen2/mpe/slog2sdk - > I am almost sure we do not want binaries under subversion trunk here. > > I suggest removing mpi ASAP, or moving it to contrib or some other > place where it wont cause each svn checkout to take twice the time. > > -- > MST > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From danb at voltaire.com Tue Aug 23 10:26:19 2005 From: danb at voltaire.com (Dan Bar Dov) Date: Tue, 23 Aug 2005 20:26:19 +0300 Subject: [openib-general] ISER cleanup Message-ID: > -----Original Message----- > From: Christoph Hellwig [mailto:hch at lst.de] > > Note that we care very little about specifications for in-kernel APIs. > so the distinction between a standard API and extension make > little sense, in the end we'll probably only have more or > less non-standard APIs. OK, the extended API deals with the connection establishement. If we'll be using the "socket" semantics the extended API will be the socket. If its something else, then whatever it is. We still don't have any sensible solution to that since it is not possible to start the QP in user space and then use it in kernel. In any case, we can combine the two APIs. > > > > iser_types.h > > > delete typdef void * iser_api_handle_t. > > > replace usage of iser_api_handle_t with "void *". > > > Ditto for all "void *" typedefs in that file. > > OK > > > > > > > > Kernel already defines scatter-gather lists type. > > The iser_data_buf struct can point to a scatterlist array > but can also be used to point at a single buffer. > > It does not replicate scatterlist but allows us to deal > with two types of registrations - single buffer and scatter lists. > > We're planning to get rid of non-S/G list data transfer in > the scsi subsystem for 2.6.14+. See the tree at This is fine for the data, however, we also need to register PDU memory which is not in sg lists. This is most visible with unsolicited data, where part or all of the data is sent along with the PDU in a single send-control. We considered packing the PDU in a single element sg list, but it seems ridiculous. > > > http://www.kernel.org/git/gitweb.cgi?p=linux/kernel/git/jejb/s > csi-block-2.6.git;a=summary > in there everything but the sg,st and osst drivers doesn't submit non-S/G requests anymore, and these last drivers is beeing worked on already. >> We are going to simplify the local memory registrations by registering all memory like in the SRP driver. >> We do not understand some of the substance issues - for example, dma >> related comments - are taken care of by iscsi, not the transport. The io_mmu comment, we completely do not understand - there was some platform specific code, but its all gone now. > Basically all use of virt_to_{phys,bus} / {phys,bus}_to_virt is wrong. > You must use proper dma mapping routines. See Documentation/DMA-API.txt in the kernel tree for details. NP. From mst at mellanox.co.il Tue Aug 23 10:58:38 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 23 Aug 2005 20:58:38 +0300 Subject: [openib-general] Re: mpi drop in openib tree In-Reply-To: <200508231707.j7NH7ZvY009485@xi.cse.ohio-state.edu> References: <20050823112037.GI27630@mellanox.co.il> <200508231707.j7NH7ZvY009485@xi.cse.ohio-state.edu> Message-ID: <20050823175838.GA6848@mellanox.co.il> Quoting Dhabaleswar Panda : > The code has been uploaded after extensive discussions with Matt, Hal, > Roland, and Woody. In fact, we followed the steps given by Hal and > Roland after personal discussions with them at the OpenIB meeting > yesterday. Well, an announcement wouldnt hurt next time, doing an update and getting 100MB was a shock :) Do you plan to use openib svn to develop mvapich from now on? Some questions on the code: - Could you please address the question of what looks like binary java libraries under ./mpe/slog2sdk? I think we need to have the source, not binaries, under subversion. - ./mpid/nt_server/winmpd does not look like it belongs in linux tree - Lots of directories have configure files in them. Again, these are generated so I think we shouldnt keep them under version control. Thanks, -- MST From iod00d at hp.com Tue Aug 23 11:17:54 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 23 Aug 2005 11:17:54 -0700 Subject: [openib-general] ISER cleanup In-Reply-To: References: Message-ID: <20050823181754.GC1218@esmail.cup.hp.com> On Tue, Aug 23, 2005 at 07:11:18PM +0300, Dan Bar Dov wrote: > Some of the comments were taken care of in today's commits > iovecs removed > procfs removed > Function entry/exit traces removed > Unnecessary files removed: > kernel_dep.h > iser_bhs.h > iser_trace.c ok - thanks > All DAT 1.2 comments removed. > Actually the current code is not DAT 1.2 compatible, > but the openIB flavor compatible. Certianly. It would be good if the code contained a reference to it's origin. It's obvious to those who read this email thread or folks looking at svn.openib.org. But it won't be once the code gets into > Since work started on a CM abstraction, I expect ISER to get > off of kdapl and onto ib-verbs + CM abstraction. Understood. I lobbied heavily yesterday at the conference to remove elements from kapld with similar functionality to openib verbs and create a "rdma_cm" module with the leftovers. > > I was expecting iSCSI framework to define the data structures > > it needs to talk to a service provider. > This is an "extended API". The ISER spec defines an ISER API, but it does > not consider implementation. We chose to implement the extra api out of > the iser_api structure and in the iser_ext_api struct. ok > iSCSI is still not part of the kernel so we had first modified and > added the datamover framework to linux-iscsi and now to open-iscsi. > Once open-iscsi is in the kernel we'll use it as the framework. ok - I'm not following that developement and trust some of the other folks who are. ... > We do not understand some of the substance issues - for example, > dma related comments - are taken care of by iscsi, not the transport. > The io_mmu comment, we completely do not understand - there was some > platform specific code, but its all gone now. What Christoph said. ULPs are responsible for mapping the buffer virtual address to a "DMA address". The right kernel interface depends on the kernel version: 2.2 virt_to_phys, phys_to_virt no IOMMU support. 2.4 pci_map_single, pci_map_sg, pci_unmap_*, et al. See Documentation/DMA-mapping.txt. Only defined to support "PCI-like" busses/devices. 2.6 dma_map_single, dma_map_sg, et al. See Documentation/DMA-API.txt Supports IOMMU, multiple bus types, and non-coherent IO (ie CPU caches not coherent w/DMA). amd64, IA64, PPC, parisc, alpha, and sparc all have HW with IOMMU. You need to use 2.6 DMA support in iSER driver. See the following for examples of correct usage: .../openib_gen2/src/linux-kernel/infiniband/ulp$ fgrep -l -e dma_map -e dma_unmap */*c ipoib/ipoib_ib.c sdp/sdp_buff.c sdp/sdp_rcvd.c sdp/sdp_recv.c sdp/sdp_send.c sdp/sdp_sent.c srp/ib_srp.c > BTW the code is now down to 7K LOC + 2K LOC of heavily commented header files. getting there...good work! hth, grant From iod00d at hp.com Tue Aug 23 11:30:08 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 23 Aug 2005 11:30:08 -0700 Subject: [openib-general] ISER cleanup In-Reply-To: <20050823181754.GC1218@esmail.cup.hp.com> References: <20050823181754.GC1218@esmail.cup.hp.com> Message-ID: <20050823183008.GD1218@esmail.cup.hp.com> On Tue, Aug 23, 2005 at 11:17:54AM -0700, Grant Grundler wrote: > Certianly. It would be good if the code contained a reference > to it's origin. It's obvious to those who read this email thread > or folks looking at svn.openib.org. But it won't be once the > code gets into ... into kernel.org. sorry...got distracted. grant From iod00d at hp.com Tue Aug 23 11:53:09 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 23 Aug 2005 11:53:09 -0700 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator In-Reply-To: <1124787534.27216.44.camel@r2d2> References: <1124787534.27216.44.camel@r2d2> Message-ID: <20050823185309.GE1218@esmail.cup.hp.com> On Tue, Aug 23, 2005 at 11:58:54AM +0300, Guy wrote: > How much is linux tolerant to code making *long* use of the interrupt > context. Are there any limitations (set by the kernel hackers) > regarding this issue ? (I know that in rtos this is totally not > acceptable, for understandable reasons) > > How long does count as *long* ? Long enough to service the HW and defer the rest of the work to another context where interrupts are enabled. I'm not aware of any other "hard" rule on this. > Is it measured in jiffies/usecs/locs other ? It should be measured in usecs - preferably sub-jiffies. I think key is the duration must be deterministic. e.g. IDE PIO modes block interrupts for long, deterministic periods of time. That's another reason PIO modes suck - but some HW only works that way. > Our iser implementation is currently context switching the completion > (interrupt context) to a kernel thread for the "handling" and the rest > of the polling. But, if linux tolerants code in isr contexts we might > consider changing that (at least for the initiator code). Can you quantify the problem? Can you quantify how much longer you expect the driver to sit on the ISR? The danger in doing too much work in the interrupt context is "live lock". > in the request_irq call (mthca_eq.c) - why not set the SA_SAMPLE_RANDOM > to contribute to the linux entropy pool ? Who is actually using that flag in the interrupt support routines? i386, x86-64, ia64, ppc, parisc do not seem to. The comments in arch/sparc64/kernel/irq.c: didn't sound too encouraging. thanks, grant From panda at cse.ohio-state.edu Tue Aug 23 12:43:50 2005 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Tue, 23 Aug 2005 15:43:50 -0400 (EDT) Subject: [openib-general] Re: mpi drop in openib tree In-Reply-To: <20050823175838.GA6848@mellanox.co.il> from "Michael S. Tsirkin" at Aug 23, 2005 08:58:38 PM Message-ID: <200508231943.j7NJhok1011231@xi.cse.ohio-state.edu> Michael, > Quoting Dhabaleswar Panda : > > The code has been uploaded after extensive discussions with Matt, Hal, > > Roland, and Woody. In fact, we followed the steps given by Hal and > > Roland after personal discussions with them at the OpenIB meeting > > yesterday. > > Well, an announcement wouldnt hurt next time, doing an update > and getting 100MB was a shock :) In our earlier e-mail on Sunday to the openib mailing list, we announced that we are in the process of uploading the new mvapich-gen2 version to the OpenIB SVN and encountering some technical difficulties. Yesterday, we discussed the SVN upload problem with Hal and Roland and uploaded the files. We were going to make an announcement today. > Do you plan to use openib svn to develop mvapich from now on? > > Some questions on the code: > - Could you please address the question of what looks like binary java libraries > under ./mpe/slog2sdk? I think we need to have the source, not binaries, > under subversion. > > - ./mpid/nt_server/winmpd does not look like it belongs in linux tree > > - Lots of directories have configure files in them. Again, these > are generated so I think we shouldnt keep them under version control. These files belong to Argonne's standard MPICH distribution on which MVAPICH is based. They have NOT been introduced by us. Since the last two years, MVAPICH is being distributed as an integrated package with MPICH so that the end users can download and install it easily and run it. Thanks, DK > Thanks, > > -- > MST > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rolandd at cisco.com Tue Aug 23 14:45:40 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 23 Aug 2005 14:45:40 -0700 Subject: [openib-general] Re: executing the ibv_srq_pingpong with many QPs sometimes causes to a test failure In-Reply-To: <506C3D7B14CDD411A52C00025558DED6089DBA41@mtlex01.yok.mtl.com> (Dotan Barak's message of "Tue, 23 Aug 2005 08:19:38 +0300") References: <506C3D7B14CDD411A52C00025558DED6089DBA41@mtlex01.yok.mtl.com> Message-ID: <52br3o49a3.fsf@cisco.com> Thanks, I think I tracked down the seg fault and pushed a fix out to the subversion repository. I still see some problems with the SRQ pingpong test with many QPs. However I think the problems are in the pingpong test and not the verbs libraries -- the pingpong tests all make bogus assumptions about the ordering of send vs receive completions, so I'm sure there are race conditions. - R. From rolandd at cisco.com Tue Aug 23 14:53:51 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 23 Aug 2005 14:53:51 -0700 Subject: [openib-general] mpi drop in openib tree In-Reply-To: <200508231707.j7NH7ZvY009485@xi.cse.ohio-state.edu> (Dhabaleswar Panda's message of "Tue, 23 Aug 2005 13:07:35 -0400 (EDT)") References: <200508231707.j7NH7ZvY009485@xi.cse.ohio-state.edu> Message-ID: <527jec48wg.fsf@cisco.com> Dhabaleswar> The code has been uploaded after extensive Dhabaleswar> discussions with Matt, Hal, Roland, and Woody. In Dhabaleswar> fact, we followed the steps given by Hal and Roland Dhabaleswar> after personal discussions with them at the OpenIB Dhabaleswar> meeting yesterday. Without taking a position one way or another on whether MVAPICH should be in the subversion repository, I would just like to note that the discussions I participated on were very narrowly focused on helping with the simple mechanics of how to add a new directory to the subversion repository. As for whether MPI should be in the OpenIB subversion tree or not, my personal opinion is that having MPI there is only appropriate if the svn tree is being used as the primary development source tree. I don't think it's appropriate to use the OpenIB svn server as a release distribution mechanism. - R. From halr at voltaire.com Tue Aug 23 15:54:53 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 24 Aug 2005 01:54:53 +0300 Subject: [openib-general] RE: RMPP Message Format Errors Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175BF2@taurus.voltaire.com> Hi Eitan, >We have started testing RMPP packets with osmtest and opensm (gen2 version). >We did not go very far. The first NodeRecord GetTable of all the nodes in a "loopback" case, has some issues. Is this loopback between the 2 HCA ports ?(Just so I can recreate this when I get back). > The explanation is below: > 1. NodeRecord MAD size is 112bytes (note the required padding of 4 bytes at the end of the NodeRec data). > 2. OpenSM log file shows the query should return 2 records one for each end-port. This really happens: Aug 21 14:59:49 998104 [40D9DBB0] -> __osm_nr_rcv_create_nr: Looking for NodeRecord with LID: 0x0 GUID:0x0000000000000000 Aug 21 14:59:49 998224 [40D9DBB0] -> __osm_nr_rcv_new_nr: New NodeRecord: node 0x0002c902000017a0 port 0x0002c902000017a1, lid 0x1. Aug 21 14:59:49 998327 [40D9DBB0] -> __osm_nr_rcv_new_nr: New NodeRecord: node 0x0002c902000017a0 port 0x0002c902000017a2, lid 0x2. Aug 21 14:59:49 998395 [40D9DBB0] -> osm_nr_rcv_process: Returning 2 records. > 3. On the wire we see the following (see attached gif for more details): Could you send the raw hex as well ? a. Two data segments were sent and two ACKs were returned. This is OK. b. The first segment reports PayLen = 440bytes. According to the spec the first segment might provide paylen != 0 and when it is done it should be equal to the (class header * Num-Segments) + data length. In our case we have data length = 2*112, and SA extra header = 20byte * 2seg. This leads to peylen=264 and not 440!!! The spec defines that in p775-l37. So this is a violation of the spec. Agreed. It should either be 0 or the real length. c. The last segment (segment 2) provides the paylen field of 100. The expected value for the last segment length should have been: SA extra header + leftover data size from prev segments. Since the first segment has 200bytes for data the left over should have been 112*2 - 200 = 24. With the SA extra header 44bytes. So this is another violation of the spec. Yes, but perhaps related to the first issue. d. The analyzer is confused by the above and reports the result as having 3 NodeRecords. e. <> 4. Following that when we trace the log file of osmtest we find more issues. Probably caused by changes to the vendor layer or the rmpp assembly: It is expected that after assembly the size of the RMPP mad reported to the osm vendor layer will be the rmpp header + SA extra header + data-size. In our case that is 32 + 20 + 2*112 = 276. The log file shows: Aug 21 14:59:49 [40D87BB0] -> __osmv_sa_mad_rcv_cb: Count = 1 = 200 / 112 (88) Aug 21 14:59:49 [4017F6C0] -> osmtest_write_all_node_recs: Received 1 records So this is another problem - probably with the way RMPP results are assembled or pass back to the vendor. This may be a result of the violations on the sending side. > Please let me know if you will have time to dig into these problems or if I should try and resolve them myself and provide patches. I will look at these shortly after I get back. -- Hal From halr at voltaire.com Tue Aug 23 15:56:01 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 24 Aug 2005 01:56:01 +0300 Subject: [openib-general] RE: RMPP Message Format Errors Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175BF3@taurus.voltaire.com> Hi Eitan, You wrote: "Note that the current implementation of the RMPP code ignores the payload length on the receive side, and instead relies on the last bit to determine the end of a transfer." But the receive side needs to calculate back the correct size of the assembled MAD. If it is done in kernel or user it does not matter. To my best knowledge the only way to calculate how many records are enclosed in an RMPP message is to use the paylen and offset. How can it be done without looking at paylen ? All Sean is saying is that the receive RMPP ignores a non zero PayLen in a first segment and uses the last bit (and obviously the PayLen in the last segment) to determine the received length (of the reassembled MAD). -- Hal From rolandd at cisco.com Tue Aug 23 22:07:07 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 23 Aug 2005 22:07:07 -0700 Subject: [openib-general] RDMA connection and address translation API Message-ID: <52y86r2a9w.fsf@cisco.com> At the OpenIB workshop on Monday, we had some discussion about a high-level transport-neutral API for connection handling. After giving the topic some more thought, I've come to the conclusion that neither the kDAPL API nor the new API that was presented are usable. In this email, I'll try to detail my reasoning and sketch what I believe is the correct API. The new API that we looked at was essentially the following (I'm recreating this from memory, so I apologize if I misrepresent it): listen(local_ip_address, service_id, listen_callback) connect(local_qp, remote_ip_address, qos, service_id, private_data, connect_callback) We already discussed the problem with having the listen callback pass the consumer a remote source address -- doing this requires the connection handling module to do an ATS reverse lookup in the IB case, which the consumer might not want. I think there's agreement that the correct thing here is for the listen callback to pass a transport address to the consumer and provide a function that the consumer can call to perform an ATS reverse lookup if desired. This isn't a major problem and can be dealt with. However, there's another problem with trying to lump address translation and connection into a single "connect" call, and this problem looks fundamental and fatal to me. The connect call takes a QP pointer, but to create a QP the consumer needs to know which local device to use. However, the consumer doesn't know which device to use until the destination address has been resolved to a route, including a local interface. As far as I can tell, kDAPL punts on this and simply requires the consumer to handle the route lookup itself before calling dat_ep_connect(). It seems that current kDAPL consumers similarly punt on this issue: the iSER initiator and the NFS-RDMA client both just use a single device which is statically discovered at init time. It seems that the kDAPL connection model has a serious flaw, in that it pushes the complexity of route lookup into the consumer. Further, we have strong evidence that this routing code is hard to write and that consumers will just ignore this complexity and hard-code solutions that don't work under all configurations. With this in mind, I believe that the connection API needs to be something more like the following: rdma_resolve_address(): inputs: dest IP address, qos, npaths, done callback, opaque context done callback params: status, local RDMA device, RDMA transport address, context This function starts the process of resolving an IP address to an RDMA device and address. When the resolution is complete, the callback is called with a status. If the status is "success" then the callback also gets the device pointer and transport address (as well as the original context that the consumer passed in). The "RDMA transport address" type is a union containing transport-dependent data. In the IB case, it's all of the SGID, DGID, SLID, DLID, SL etc. that we know and love. In the iWARP case, it's the source IP, destination IP and QOS. npaths can be either 1 or 2 in the IB case; if it's 2, then the resolver will try to find a primary and alternate path for APM. In the iWARP case, I guess npaths will always be 1, and I guess anyone who wants to use iWARP over multihomed SCTP will probably have to use some lower-level API. By the way, we may also have to have the option of passing in a local netdev so that we can handle link-local IPv6 addresses. There may be other cases I haven't thought of yet. I just hope we can avoid going all the way to the horror of the getaddrinfo() API. I also hope we can agree to use IPoIB ARP to resolve the address in the IB case; having a flag or some other hack in the API to expose the option of ATS seems unacceptably ugly. rdma_connect(): inputs: local QP, RDMA transport address, destination service, private data, timeout, event callback, opaque context This function takes the resolved address and actually connects. I'm not sure how we want to abstract the IB service vs. iWARP TCP port number difference. I guess it's OK to have iWARP consumers stick their (16-bit) port number in a 64-bit parameter, even if it's not the prettiest API. To head off the knee-jerk objection: this API does NOT require any transport-specific code in consumers (unless a particular consumer WANTS to look inside the RDMA transport address). Code to connect would be as simple as: rdma_resolve_address(...); /* wait for resolution */ ib_create_qp(...) /* use device pointer we got from rdma_resolve_address() */ rdma_connect(...); /* pass transport address we got from rdma_resolve_address() */ /* wait for connection to finish... */ The listen side is even simpler: rdma_listen(): inputs: local service, event callback, consumer context Wait for connection requests and pass events to the consumer's callback. I'm not sure if/home we want to support binding to a particular IP address. The current IB CM in Linux doesn't support binding a listen to a single device or port, and even if it did it's not clear how to handle binding to one IP address when a port has more than one IP. I guess the event callback would receive a device pointer and the same RDMA transport address union I talked about above when discussing address resolution. It would be possible to have another function like rdma_getpeername() that takes the transport address and returns a source IP address. In the IB case this would do an ATS reverse lookup. However, I hate this idea. iSER already uses the CM private data to pass the source IP in the IB case, and I would much rather fix NFS/RDMA to do the same thing (so we can just kill ATS as an address resolution method). - R. From eitan at mellanox.co.il Tue Aug 23 23:21:42 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 24 Aug 2005 09:21:42 +0300 Subject: [openib-general] Re: RMPP Message Format Errors In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5175BF3@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175BF3@taurus.voltaire.com> Message-ID: <430C11F6.8000805@mellanox.co.il> Hal Rosenstock wrote: > Hi Eitan, > > You wrote: > "Note that the current implementation of the RMPP code ignores the payload length on the receive side, and instead relies on the last bit to determine the end of a transfer." > > But the receive side needs to calculate back the correct size of the assembled MAD. > If it is done in kernel or user it does not matter. To my best knowledge the only way to calculate how many records are enclosed in an RMPP message is to use the paylen and offset. > How can it be done without looking at paylen ? > > All Sean is saying is that the receive RMPP ignores a non zero PayLen in a first segment and uses the last bit (and obviously the PayLen in the last segment) to determine the received length (of the reassembled MAD). > OK, thanks for the clarification. We could use a paylen = 0 at first (but that is not last) segment From mst at mellanox.co.il Tue Aug 23 23:48:48 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 24 Aug 2005 09:48:48 +0300 Subject: [openib-general] Re: RDMA connection and address translation API In-Reply-To: <52y86r2a9w.fsf@cisco.com> References: <52y86r2a9w.fsf@cisco.com> Message-ID: <20050824064848.GU27630@mellanox.co.il> Hello, Roland! Quoting r. Roland Dreier : > With this in mind, I believe that the connection API needs to be > something more like the following: I'll have to think a bit about it some more, generally what you say makes a lot of sense to me. One comment - since completion is asynchronous, it seems clear that we'll need an additional API for cancelling operations, and timing things out (I guess time outs may be handled by the event callback?). Another thing to think about is closing connections versus closing an underlying qp. -- MST From sean.hefty at intel.com Wed Aug 24 00:41:24 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 24 Aug 2005 00:41:24 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <52y86r2a9w.fsf@cisco.com> Message-ID: >However, there's another problem with trying to lump address >translation and connection into a single "connect" call, and this >problem looks fundamental and fatal to me. The connect call takes a >QP pointer, but to create a QP the consumer needs to know which local >device to use. However, the consumer doesn't know which device to use >until the destination address has been resolved to a route, including >a local interface. I agree that this is a fairly serious issue with the proposed API. I guess that I'd like to clarify what the operation of a connect call would do. Would it be responsible for modifying the QP? If so, could such a call also allocate the QP? Note that I'm not advocating either of these, just trying to determine what the behavior of the API would be. > Wait for connection requests and pass events to the consumer's > callback. I'm not sure if/home we want to support binding to > a particular IP address. The current IB CM in Linux doesn't > support binding a listen to a single device or port, and even > if it did it's not clear how to handle binding to one IP > address when a port has more than one IP. I don't think that it would be overly difficult to bind IB CM listen requests to a specific port or LID, or based on matching specific private data. - Sean From danb at voltaire.com Wed Aug 24 01:06:29 2005 From: danb at voltaire.com (Dan Bar Dov) Date: Wed, 24 Aug 2005 11:06:29 +0300 Subject: [openib-general] RE: [iSER]question about iSER code Message-ID: > -----Original Message----- > From: xg wang [mailto:xgw200172 at hotmail.com] > Sent: Wednesday, August 24, 2005 6:17 AM > To: openib-general at openib.org > Cc: Dan Bar Dov; ianjiang91 at hotmail.com > Subject: [iSER]question about iSER code > > Hi, all! > > I read the draft-ieft-ips-iser-04 and the iSER source > code. In the draft, it wrote "During connection setup, the > iSCSI Layer at the initiator is responsible for establishing > a connection with the target" and "If the outcome of the > iSCSI negotiation is to enable iSER-assisted mode, then on > the initiator side, prior to sending the Login Request with > the T (Transit) bit set to 1 and the NSG (Next Stage) field > set to FullFeaturePhase, the iSCSI Layer MUST request the > iSER Layer to allocate the connection resources necessary to > support iWARP by invoking the Allocate_Connection_Resources > Operational Primitive" > > So I think that the iSCSI connection is different from the > iSER conection since it 'allocate the connection resources'. > Does it mean iSER layer will estabilsh another > iSER-connection. I am not sure. If it is, then the process is > like this: > > 1 iSCSI layer establish connection > 2 negotiated RDMAExtensions Yes > 3 iSRR layer establish connecton---band iSER connection to > iSCSI connection > 4 Notice_Key_Values > 5 Allocate_Connection_Resources > 6 send login request user function iser_initiator_send_control > > Since in the iser code, login request is in function > iser_initiator_send_control, and the function used 'struct > iser_conn_t p_iser_conn'. I conclude iSER establish an new > connecton because the iSCSI connection could not use > iser_initiator_send_control if the RDMAExtentsions negotiated > as 'No'. > Am I right? The ISER spec is iWARP oriented. Please read also the Huffered spec regarding ISER over IB. There is work in progress regarding ISER spec modification so it becomes transport neutral. The ISER implementation at openIB starts the connection Already in ISER mode, so the flow you list above is not followed. > > By the way, there is 'iser_post_send_control', whereas no > 'iser_post_receive_control'. How to send the unsolited data > from the initiator to target? These are two questions. Unsolicited data is sent using send-control. There is not such thing as receive control - there's control-notify. Dan > > Thanks!! > > Computer Architecture Laboratory > Institute of Computing Technology > Chinese Academy of Sciences > Beijing,P.R.China > Zip code: 100080 > Tel: +86-10-62565533-9314(office) > > xigui wang > xgw200172 at hotmail.com > 2005-08-24 > > _________________________________________________________________ > 免费下载 MSN Explorer: http://explorer.msn.com/lccn > > From sean.hefty at intel.com Wed Aug 24 01:07:57 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 24 Aug 2005 01:07:57 -0700 Subject: [openib-general] RE: RMPP Message Format Errors In-Reply-To: <430C11F6.8000805@mellanox.co.il> Message-ID: >> But the receive side needs to calculate back the correct size of the >assembled MAD. >> If it is done in kernel or user it does not matter. To my best knowledge the >only way to calculate how many records are enclosed in an RMPP message is to >use the paylen and offset. >> How can it be done without looking at paylen ? >> >> All Sean is saying is that the receive RMPP ignores a non zero PayLen in a >first segment and uses the last bit (and obviously the PayLen in the last >segment) to determine the received length (of the reassembled MAD). >> >OK, thanks for the clarification. We could use a paylen = 0 at first >(but that is not last) segment Looking through the code, it appears that the proper size of the MAD is being reported in the kernel and exported up to userspace. If I guessed the structure of the opensm code correctly, the length is returned by umad_recv() in umad_receiver() in osm_vendor_ibumad.c The length is discarded after umad_receiver() returns. I guess that one possible solution is for opensm to save the length value into the payload_length field in the RMPP header before returning from umad_receiver(). - Sean From guyg at voltaire.com Wed Aug 24 01:58:18 2005 From: guyg at voltaire.com (Guy German) Date: Wed, 24 Aug 2005 11:58:18 +0300 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator In-Reply-To: <20050823185309.GE1218@esmail.cup.hp.com> References: <1124787534.27216.44.camel@r2d2> <20050823185309.GE1218@esmail.cup.hp.com> Message-ID: <1124873898.3933.23.camel@r2d2> Hi Grant, On Tue, 2005-08-23 at 11:53 -0700, Grant Grundler wrote: > On Tue, Aug 23, 2005 at 11:58:54AM +0300, Guy wrote: > > How much is linux tolerant to code making *long* use of the interrupt > > context. Are there any limitations (set by the kernel hackers) > > regarding this issue ? (I know that in rtos this is totally not > > acceptable, for understandable reasons) > > > > How long does count as *long* ? > > Long enough to service the HW and defer the rest of the work > to another context where interrupts are enabled. > I'm not aware of any other "hard" rule on this. > > > Is it measured in jiffies/usecs/locs other ? > > It should be measured in usecs - preferably sub-jiffies. > I think key is the duration must be deterministic. > > e.g. IDE PIO modes block interrupts for long, deterministic periods > of time. That's another reason PIO modes suck - but some HW > only works that way. What about the SRP implementation, which polls the cq in the cq's upcall context (interrupt in mthca). I guess that it improves performance, and if it is good enough for the SRP (and linux) I think it will be good for the iSER *initiator* as well. The thing is - even if I find it reasonable to do it this way (after calculating the estimated amount of time spent in the ISR), I am not sure it is a good scalable solution, because there can be several initiators and hcas per machine. The fact that the primitiveness of linux is not a major issue, makes it harder to decide on the right way to go here... > > Our iser implementation is currently context switching the completion > > (interrupt context) to a kernel thread for the "handling" and the rest > > of the polling. But, if linux tolerants code in isr contexts we might > > consider changing that (at least for the initiator code). > > Can you quantify the problem? > Can you quantify how much longer you expect the driver to > sit on the ISR? > > The danger in doing too much work in the interrupt context is > "live lock". > > > in the request_irq call (mthca_eq.c) - why not set the SA_SAMPLE_RANDOM > > to contribute to the linux entropy pool ? > > Who is actually using that flag in the interrupt support routines? > i386, x86-64, ia64, ppc, parisc do not seem to. > The comments in arch/sparc64/kernel/irq.c: didn't sound too encouraging. You are probably right. I did not look at the current linux implementation of it, just the use of the flag in other drivers... Thanks, Guy From guyg at voltaire.com Wed Aug 24 02:03:56 2005 From: guyg at voltaire.com (Guy German) Date: Wed, 24 Aug 2005 12:03:56 +0300 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator In-Reply-To: <20050823185309.GE1218@esmail.cup.hp.com> References: <1124787534.27216.44.camel@r2d2> <20050823185309.GE1218@esmail.cup.hp.com> Message-ID: <1124874236.3933.27.camel@r2d2> Sorry.. primitiveness = preemptive-ness damn speller :) From mst at mellanox.co.il Wed Aug 24 02:24:59 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 24 Aug 2005 12:24:59 +0300 Subject: [openib-general] ipoib oops Message-ID: <20050824092459.GA20750@mellanox.co.il> Hi, Roland! I have seen the following oops recently, typically after restarting opensm on the same machine. This is on ipoib rev 3113 Pls note I'm running with my two event patches. The oops seems to be around offset db7 below: drivers/infiniband/ulp/ipoib/ipoib_multicast.c:223 da4: 49 8b 7d 08 mov 0x8(%r13),%rdi da8: 48 81 c7 b4 00 00 00 add $0xb4,%rdi daf: f3 a6 repz cmpsb %es:(%rdi),%ds:(%rsi) db1: 75 17 jne dca drivers/infiniband/ulp/ipoib/ipoib_multicast.c:225 db3: 49 8b 45 70 mov 0x70(%r13),%rax include/linux/byteorder/swab.h:147 db7: 8b 40 20 mov 0x20(%rax),%eax include/asm/byteorder.h:17 dba: 0f c8 bswap %eax include/linux/byteorder/swab.h:147 dbc: 41 89 85 f0 02 00 00 mov %eax,0x2f0(%r13) drivers/infiniband/ulp/ipoib/ipoib_multicast.c:226 dc3: 41 89 85 84 03 00 00 mov %eax,0x384(%r13) include/asm/bitops.h:236 dca: 8b 85 b8 00 00 00 mov 0xb8(%rbp),%eax Line 255 is here: struct ib_ah_attr av = { .dlid = be16_to_cpu(mcast->mcmember.mlid), .port_num = priv->port, .sl = mcast->mcmember.sl, 255: .ah_flags = IB_AH_GRH, .grh = { .flow_label = be32_to_cpu(mcast->mcmember.flow_label), .hop_limit = mcast->mcmember.hop_limit, .sgid_index = 0, .traffic_class = mcast->mcmember.traffic_class } }; so apparently the problem is in accessing mcast->mcmember.mlid. And I wander: what prevents the mcast object from being destroyed while a completion is outstanding? MST Unable to handle kernel NULL pointer dereference at 0000000000000020 RIP: {:ib_ipoib:ipoib_mcast_join_finish+119} PGD 17e386067 PUD 17f228067 PMD 0 Oops: 0000 [1] SMP CPU 0 Modules linked in: ib_sdp ib_cm ib_ipoib ib_sa ib_umad ib_mthca ib_mad ib_core Pid: 17774, comm: ib_mad1 Not tainted 2.6.12.2 RIP: 0010:[] {:ib_ipoib:ipoib_mcast_join_finish+119} RSP: 0018:ffff810169763c58 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8101688a3000 RCX: 0000000000000000 RDX: ffff810178f18a80 RSI: ffff810178f18a90 RDI: ffff8101688a30c4 RBP: ffff810178f18a80 R08: 0000000000000000 R09: ffff810169763d38 R10: ffff810169763df8 R11: 0000000000000001 R12: 0000000000000000 R13: ffff8101688a3380 R14: ffff8101688a3000 R15: ffff810178be1898 FS: 0000000000000000(0000) GS:ffffffff80579f00(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000020 CR3: 000000017e3d0000 CR4: 00000000000006e0 Process ib_mad1 (pid: 17774, threadinfo ffff810169762000, task ffff81016d34d110) Stack: 0000000000000001 0000000000000006 ffffc20000022000 0000000000000296 0000000000000296 ffffffff802800a0 ffffffff803c0c9d ffff81017874d540 ffff810178f18d80 ffff810167de7210 Call Trace:{dma_pool_free+272} {_spin_unlock_irqrestore+5} {:ib_ipoib:ipoib_mcast_join_complete+43} {:ib_core:ib_unpack+198} {:ib_sa:ib_sa_mcmember_rec_callback+64} {:ib_sa:recv_handler+117} {:ib_mad:ib_mad_completion_handler+941} {:ib_mad:ib_mad_completion_handler+0} {worker_thread+476} {default_wake_function+0} {__wake_up_common+64} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+204} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} Code: 8b 40 20 0f c8 41 89 85 f0 02 00 00 41 89 85 84 03 00 00 8b RIP {:ib_ipoib:ipoib_mcast_join_finish+119} RSP CR2: 0000000000000020 -- MST From steve_wooding at keysounds.co.uk Wed Aug 24 03:28:01 2005 From: steve_wooding at keysounds.co.uk (=?iso-8859-1?Q?Steve_Wooding?=) Date: Wed, 24 Aug 2005 12:28:01 +0200 Subject: [openib-general] How to change NodeDescription Message-ID: <30280207$1124878682430c495a0b75a3.09921159@config5.schlund.de> Hi, Is it possible with the gen2 driver to set the NodeDescription for each host. I have a machine which is still running gen1 drivers and the NodeDescription is set to the hostname. I guess this is done when the driver is loaded at boot time (but couldn't find out exactly how this is done). The NodeDescription on gen2 is "MT25208 InfiniHostEx Mellanox Technologies", which looks like it was formed by looking at lspci or something. I would like to change it to hostname plus some other custom attribute (e.g. type of node). Thanks, Steve. From alexn at voltaire.com Wed Aug 24 04:09:55 2005 From: alexn at voltaire.com (Alex Nezhinsky) Date: Wed, 24 Aug 2005 14:09:55 +0300 Subject: [openib-general] RE: [iSER]question about iSER code Message-ID: xg wang wrote: >> I read the draft-ieft-ips-iser-04 and the iSER source code. In >> the draft, it wrote ... >> So I think that the iSCSI connection is different from the iSER >> conection since it 'allocate the connection resources'. >> Does it mean iSER layer will estabilsh another iSER-connection. I am >> not sure. ... Dan Bar Dov wrote: > The ISER spec is iWARP oriented. Please read also the > Huffered spec regarding ISER over IB. There is work in > progress regarding ISER spec modification so it becomes > transport neutral. The ISER implementation at openIB starts > the connection Already in ISER mode, so the flow you list > above is not followed. There is also a new draft candidate by Mike Ko, incorporating the changes proposed in the Hufferd's draft. It can be found as draft-ietf-ips-iser-05-candidate.pdf or draft-ietf-ips-iser-05-candidate.txt at http://www.haifa.il.ibm.com/satran/ips/ This text refers to the transport layer as an RDMA Capable Protocol, mentioning the differences between iWARP and IB when necessary. iSER layer does not need to establish a new connection. ISER serves as a transport layer for iSCSI, so its connection is *the* connection, at least for the etire full-featured phase. The only issue is related to the login phase but this is being discussed here and in IETF-IPS. Alexander Nezhinsky Software Engineer Infiniband Storage Solutions 9 Ha-Menofim, Herzelya 46725 Israel tel: +972- 9- 9717637 fax: +972- 9- 9717660 mobile: +972-50-7504376 From guyg at voltaire.com Wed Aug 24 04:41:58 2005 From: guyg at voltaire.com (Guy German) Date: Wed, 24 Aug 2005 14:41:58 +0300 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: References: Message-ID: <1124883718.3933.95.camel@r2d2> Hi, - Here is a header file for cm abstraction API proposition. - This is just a preliminary suggestion, for review. - All comments are welcome. - Please read the notes in the header remarks - I am attaching the file and will send it later in a different message, to the list. - I think that the ib_ prefix should be changed to rdma_, but that should be done for the rest of the verbs as well, if we are claiming that the ib verbs abstract iwarp. - I think that the main difference between the 2 propositions is the question of whether or not to expose the consumer to the address resolution. I believe this suggestion (of covering it in the cma) is simpler, because it saves unnecessary upcall handling for the consumer. In any case - I don't believe this is clear cut, and would like to hear other opinions from people on the list. - Also please see my embedded answer to this mail Thanks, Guy. > We already discussed the problem with having the listen callback pass > the consumer a remote source address -- doing this requires the > connection handling module to do an ATS reverse lookup in the IB case, > which the consumer might not want. I think there's agreement that the > correct thing here is for the listen callback to pass a transport > address to the consumer and provide a function that the consumer can > call to perform an ATS reverse lookup if desired. This isn't a major > problem and can be dealt with. I agree. This is corrected in the current suggestion > However, there's another problem with trying to lump address > translation and connection into a single "connect" call, and this > problem looks fundamental and fatal to me. The connect call takes a > QP pointer, but to create a QP the consumer needs to know which local > device to use. However, the consumer doesn't know which device to use > until the destination address has been resolved to a route, including > a local interface. The proposition, also presented (I beleive) in the OpenIB workshop, include a function called ib_cma_get_device, that retrieves the device (for qp creation purposes) according to the destination address and the local routing table. This is done synchronously, and it is implemented today in the at module. If using link-local IPv6 addresses, I think that this function isn't even necessary (If I understand it correctly - you need to know which device to get out from). > As far as I can tell, kDAPL punts on this and simply requires the > consumer to handle the route lookup itself before calling > dat_ep_connect(). It seems that current kDAPL consumers similarly > punt on this issue: the iSER initiator and the NFS-RDMA client both > just use a single device which is statically discovered at init time. > > It seems that the kDAPL connection model has a serious flaw, in that > it pushes the complexity of route lookup into the consumer. Further, > we have strong evidence that this routing code is hard to write and > that consumers will just ignore this complexity and hard-code > solutions that don't work under all configurations. > With this in mind, I believe that the connection API needs to be > something more like the following: > > rdma_resolve_address(): > inputs: dest IP address, qos, npaths, > done callback, opaque context > done callback params: status, local RDMA device, > RDMA transport address, context > > This function starts the process of resolving an IP address to > an RDMA device and address. When the resolution is complete, > the callback is called with a status. If the status is > "success" then the callback also gets the device pointer and > transport address (as well as the original context that the > consumer passed in). In the address resolution you have 2 upcalls (from ip to gid and from gid to path). So, if you are already covering one upcall in the cma, why not cover both ? > The "RDMA transport address" type is a union containing > transport-dependent data. In the IB case, it's all of the > SGID, DGID, SLID, DLID, SL etc. that we know and love. In the > iWARP case, it's the source IP, destination IP and QOS. > > npaths can be either 1 or 2 in the IB case; if it's 2, then > the resolver will try to find a primary and alternate path for > APM. In the iWARP case, I guess npaths will always be 1, and > I guess anyone who wants to use iWARP over multihomed SCTP > will probably have to use some lower-level API. > > By the way, we may also have to have the option of passing in > a local netdev so that we can handle link-local IPv6 > addresses. There may be other cases I haven't thought of yet. > I just hope we can avoid going all the way to the horror of > the getaddrinfo() API. > > I also hope we can agree to use IPoIB ARP to resolve the > address in the IB case; having a flag or some other hack in > the API to expose the option of ATS seems unacceptably ugly. > > rdma_connect(): > inputs: local QP, RDMA transport address, destination service, > private data, timeout, event callback, opaque context > > This function takes the resolved address and actually connects. > > I'm not sure how we want to abstract the IB service vs. iWARP > TCP port number difference. I guess it's OK to have iWARP > consumers stick their (16-bit) port number in a 64-bit > parameter, even if it's not the prettiest API. > > To head off the knee-jerk objection: this API does NOT require any > transport-specific code in consumers (unless a particular consumer > WANTS to look inside the RDMA transport address). Code to connect > would be as simple as: > > rdma_resolve_address(...); > /* wait for resolution */ > ib_create_qp(...) /* use device pointer we got from rdma_resolve_address() */ > rdma_connect(...); /* pass transport address we got from rdma_resolve_address() */ > /* wait for connection to finish... */ Wouldn't it be simpler (for the consumer) to do: resolve_device_by_destip(); /* don't wait */ ib_create_qp(...) /* use device pointer we got */ rdma_connect(dest_ip); /* cma resolution implementation for ib*/ /* wait for at + connection to finish... */ I think this flow is also more "iwarp friendly" - saves them the asynchronic rdma_resolve_address wait. > > The listen side is even simpler: > > rdma_listen(): > inputs: local service, event callback, consumer context > > Wait for connection requests and pass events to the consumer's > callback. I'm not sure if/home we want to support binding to > a particular IP address. The current IB CM in Linux doesn't > support binding a listen to a single device or port, and even > if it did it's not clear how to handle binding to one IP > address when a port has more than one IP. > I guess the event callback would receive a device pointer and > the same RDMA transport address union I talked about above > when discussing address resolution. > > It would be possible to have another function like > rdma_getpeername() that takes the transport address and > returns a source IP address. In the IB case this would do an > ATS reverse lookup. However, I hate this idea. iSER already > uses the CM private data to pass the source IP in the IB case, > and I would much rather fix NFS/RDMA to do the same thing (so > we can just kill ATS as an address resolution method). > - R. -------------- next part -------------- A non-text attachment was scrubbed... Name: ib_cma.h Type: text/x-chdr Size: 7360 bytes Desc: not available URL: From halr at voltaire.com Wed Aug 24 05:10:00 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 24 Aug 2005 15:10:00 +0300 Subject: [openib-general] How to change NodeDescription Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175C0D@taurus.voltaire.com> Hi Steve, Steve wrote: > Is it possible with the gen2 driver to set the NodeDescription for each host ? I believe the NodeDescription is an NVRAM field in the HCA firmware so this would need to be supported by the flash tool. I don't know if mstflint supports this. Michael would certainly know. -- Hal From mst at mellanox.co.il Wed Aug 24 05:23:08 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 24 Aug 2005 15:23:08 +0300 Subject: [openib-general] Re: How to change NodeDescription In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5175C0D@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175C0D@taurus.voltaire.com> Message-ID: <20050824122308.GG20750@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: How to change NodeDescription > > Hi Steve, > > Steve wrote: > > Is it possible with the gen2 driver to set the NodeDescription for each host ? > > I believe the NodeDescription is an NVRAM field in the HCA firmware so this would need to be supported by the flash tool. I don't know if mstflint supports this. Michael would certainly know. > > -- Hal No, I dont believe it does. -- MST From guyg at voltaire.com Wed Aug 24 05:21:07 2005 From: guyg at voltaire.com (Guy German) Date: Wed, 24 Aug 2005 15:21:07 +0300 Subject: [openib-general] Connection Manager Abstraction proposition - header file Message-ID: <20050824122107.GA2323@voltaire.com> /* * Copyright (c) 2005 Voltaire Inc. All rights reserved. * * This Software is licensed under one of the following licenses: * * 1) under the terms of the "Common Public License 1.0" a copy of which is * available from the Open Source Initiative, see * http://www.opensource.org/licenses/cpl.php. * * 2) under the terms of the "The BSD License" a copy of which is * available from the Open Source Initiative, see * http://www.opensource.org/licenses/bsd-license.php. * * 3) under the terms of the "GNU General Public License (GPL) Version 2" a * copy of which is available from the Open Source Initiative, see * http://www.opensource.org/licenses/gpl-license.php. * * Licensee has the right to choose one of the above licenses. * * Redistributions of source code must retain the above copyright * notice and one of the license notices. * * Redistributions in binary form must reproduce both the above copyright * notice, one of the license notices in the documentation * and/or other materials provided with the distribution. * */ /* * This header file as a preliminary proposition for a connection manager * abstraction layer (cma) for IB and iwarp * - there is an assumption that iwarp uses the same openib qp terminology in * the rest of the verbs, and the only place needs abstraction is the cm. * - This proposition assumes that the address translation is done in the cma * layer. * - The cma also modifies the qp states to init/rtr/rts and error as needed. * - for calling accept/reject or disconnect on the passive side you need to * use the cma handle accepted in ib_cma_listen cb. * - cma_id is created when calling connect or listen and destroyed when * accepting disconnected/rejected/unreachable events on either active * side (connect cb) or passive side (accept cb) */ #ifndef IB_CMA_H #define IB_CMA_H #include enum ib_cma_event { IB_CMA_EVENT_ESTABLISHED, IB_CMA_EVENT_REJECTED, IB_CMA_EVENT_DISCONNECTED, IB_CMA_EVENT_UNREACHABLE }; enum ib_qos { IB_QOS_BEST_EFFORT = 0, IB_QOS_HIGH_THROUGHPUT = (1 << 0), IB_QOS_LOW_LATENCY = (1 << 1), IB_QOS_ECONOMY = (1 << 2), IB_QOS_PREMIUM = (1 << 3) }; enum ib_connect_flags { IB_CONNECT_DEFAULT_FLAG = 0x00, IB_CONNECT_MULTIPATH_FLAG = 0x01 }; /* * for ib_cma_get_src_ip - ib_cma_id will have to include * the path data received in the request handler */ union ib_cma_id{ struct ib_cm_id *cm_id; u32 iwarp_id; }; typedef void (*ib_cma_rarp_handler)(struct sockaddr *src_ip, void *context); typedef void (*ib_cma_ac_handler)(enum ib_cma_event event, void *context); typedef void (*ib_cma_event_handler)(enum ib_cma_event event, void *context, void *private_data); typedef void (*ib_cma_listen_handler)(union ib_cma_id *cma_id, void *private_data, void *context); struct ib_cma_conn { struct ib_qp *qp; struct ib_qp_attr *qp_attr; struct sockaddr *dst_ip; __be64 service_id; void *context; ib_cma_event_handler cma_event_handler; const void *private_data; u8 private_data_len; u32 timeout; enum ib_qos qos; enum ib_connect_flags connect_flags; }; /** * ib_cma_get_device - Returns the device to be used according to * the destination ip address (this can be detemined according * to the local routing table). Call this function before * creating the qp. If using link-local IPv6 addresses * @remote_address: The destination address for connection * @device: The device to use (returned by the function) */ int ib_cma_get_device(struct sockaddr *remote_address, struct ib_device **device); /** * ib_cma_connect - this is the connect request function, called by * the active side. The consumer registers an upcall that will be * initiated by the cma with an appropriate connection event * notification (established/rejected/disconnected etc) * @cma_conn: This structure contains the following connection parameters: * @qp: qp for establishing the connection * @qp_attr: only relevant attributes are used * @dst_ip: destination ip address * @service_id: destination service id (port) * @context: context to be returned in the callback * @cma_event_handler: the upcall function for the active side * @private_data: private data to be received at the listener upcall * @private_data_len: private data length (max 255) * @timeout: * @qos: Quality os service for the rc * @connect_flags: default or multipath connection * @cma_id: This returned handle is a union (different in ib and iwarp) * in ib - it is the cm_id. */ int ib_cma_connect(struct ib_cma_conn *cma_conn, union ib_cma_id *cma_id); /** * ib_cma_disconnect - this function disconnects the rc. It can be * called, by either the passive or active side * @qp: the connected qp to disconnect * @cma_id: On the active side- this handle is the one returned * when ib_cma_connect was called. * On the passive side- this handle was accepted in cma_listen callback */ int ib_cma_disconnect(struct ib_qp *qp, union ib_cma_id *cma_id); /** * ib_cma_sid_listen - this function is called by the passive side. It is * listening on a the specified port (ib service id) for incomming * connection requests * @device: ? need to resolve this issue * @service_id: service id (port) to listen on * @context: user context to be returned in the callback * @cm_listen_handler: the listen callback * @cma_id: cma handle for the passive side */ int ib_cma_sid_listen(struct ib_device *device, __be64 service_id, void *context, ib_cma_listen_handler cm_listen_handler, union ib_cma_id *cma_id); /** * ib_cma_sid_destroy - this functionis is called on the passive side, to * stop listenning on a certain sevice id * @cma_id: the same cma handle received when ib_cma_sid_listen was called */ int ib_cma_sid_destroy(union ib_cma_id *cma_id); /** * ib_cma_accept - call on the passive side to accept a connection request * @cma_id: this handle was accepted in cma_listen callback * @qp: the connection's qp * @private_data: private data to send back to the initiator * @private_data_len: private data length * @context: user context to be returned in the callback * @cm_accept_handler: the cma accept callback - triggered when RTU ack * received */ int ib_cma_accept(union ib_cma_id *cma_id, struct ib_qp *qp, const void *private_data, u8 private_data_len, void *context, ib_cma_ac_handler cm_accept_handler); /** * ib_cma_reject - call on the passive side to reject a connection request. * This call destroys the cma_id, hence when the active side accepts * the reject the cma_id is already destroyed. * @cma_id: this handle was accepted in cma_listen callback * @private_data: private data to send back to the initiator * @private_data_len: private data length */ int ib_cma_reject(union ib_cma_id *cma_id, const void *private_data, u8 private_data_len); /** * ib_cma_get_src_ip - this function performs "rarp", asynchronicly * from cma_id to src ip * @cma_id: the cma_id will have to include the path data received * in the request handler * @src_ip: source ip of the initiator */ int ib_cma_get_src_ip(union ib_cma_id *cma_id, ib_cma_rarp_handler rarp_handler, void *context); #endif /* IB_CMA_H */ From christian.guggenberger at rzg.mpg.de Wed Aug 24 05:32:06 2005 From: christian.guggenberger at rzg.mpg.de (Christian Guggenberger) Date: Wed, 24 Aug 2005 14:32:06 +0200 Subject: [openib-general] problems with loading ib_mthca Message-ID: <1124886726.24411.20.camel@bonnie.rzg.mpg.de> Hi, I am in between of setting up a small, 2 node environment for testing openib. I've got two machines (Sun V20z, Dual Opteron 248), with a 23108 (128MB) Tavor card each. Both systems were running IBGD previous (without major problems) While I got everything working on one machine (based on 2.6.12.5 and a svn checkout Aug, 23rd), the very same kernel fails to load ib_mthca on the other machine. Logs: ib_mad: Unknown parameter `ib_outs_mad_recv_entries' ib_mthca: Unknown symbol ib_unregister_mad_agent ib_mthca: Unknown symbol ib_post_send_mad ib_mthca: Unknown symbol ib_register_mad_agent The following odd difference of both machine's 'lspci -vv' output looks suspicious: Subsystem: Mellanox Technologies MT23108 InfiniHost HCA Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV+ VGASnoop- ParErr+ Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- SERR- References: <1124886726.24411.20.camel@bonnie.rzg.mpg.de> Message-ID: <20050824124615.GI20750@mellanox.co.il> Quoting r. Christian Guggenberger : > Logs: > > ib_mad: Unknown parameter `ib_outs_mad_recv_entries' > ib_mthca: Unknown symbol ib_unregister_mad_agent > ib_mthca: Unknown symbol ib_post_send_mad > ib_mthca: Unknown symbol ib_register_mad_agent module dependencies should pick up all necessary modules Try rerunning make modules_install -- MST From christian.guggenberger at rzg.mpg.de Wed Aug 24 05:58:31 2005 From: christian.guggenberger at rzg.mpg.de (Christian Guggenberger) Date: Wed, 24 Aug 2005 14:58:31 +0200 Subject: [openib-general] Re: problems with loading ib_mthca In-Reply-To: <20050824124615.GI20750@mellanox.co.il> References: <1124886726.24411.20.camel@bonnie.rzg.mpg.de> <20050824124615.GI20750@mellanox.co.il> Message-ID: <1124888311.24411.31.camel@bonnie.rzg.mpg.de> On Wed, 2005-08-24 at 15:46 +0300, Michael S. Tsirkin wrote: > Quoting r. Christian Guggenberger : > > Logs: > > > > ib_mad: Unknown parameter `ib_outs_mad_recv_entries' > > ib_mthca: Unknown symbol ib_unregister_mad_agent > > ib_mthca: Unknown symbol ib_post_send_mad > > ib_mthca: Unknown symbol ib_register_mad_agent > > module dependencies should pick up all necessary modules > Try rerunning make modules_install > The faulty machine had a leftover config file from IBGD (modprobe-openib.conf). I forgot to delete the reference in /etc/modprobe.conf . Now the hca driver loads. thanks for your hint, though. cheers. - Christian From shubbell at dbresearch.net Wed Aug 24 06:24:28 2005 From: shubbell at dbresearch.net (Sean Hubbell) Date: Wed, 24 Aug 2005 09:24:28 -0400 Subject: [openib-general] Question on the best approach to debug an infiniband connection problem Message-ID: <430C750C.9050301@dbresearch.net> Hello, I was wondering if there is a "best practices" method to debug a possible infiniband connection. I am currently trying to send a message over infiniband ib0 interface and I continue to get transmit errors. Minus going through and seeing if the port state is active, I am at a loss to find out what the problem is. I did notice a lot of errors in the /var/log/osm.log which I have listed below for today: Aug 24 08:19:10 [42FFF960] -> osm_report_notice: Reporting Generic Notice type:3 num:67 from LID:0x0001 GID:0xfe80000000000000,0x0005ad000003d269 Aug 24 08:19:10 [42FFF960] -> osm_vendor_send: RMPP 0 length 112 Aug 24 08:19:10 [42FFF960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet,scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7. Aug 24 08:19:10 [42FFF960] -> osm_vendor_send: RMPP 0 length 256 Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112 Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112 Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112 Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112 Aug 24 08:19:14 [42FFF960] -> osm_report_notice: Reporting Generic Notice type:3 num:67 from LID:0x0001 GID:0xfe80000000000000,0x0005ad000003d269 Aug 24 08:19:14 [42FFF960] -> osm_report_notice: Reporting Generic Notice type:3 num:67 from LID:0x0001 GID:0xfe80000000000000,0x0005ad000003d269 Aug 24 08:19:16 [447FF960] -> umad_receiver: recv error Interrupted system call Aug 24 08:22:05 [AB441140] -> OpenSM Rev:openib-1.0.0 Aug 24 08:22:05 [AB441140] -> osm_opensm_init: Forcing single threaded dispatcher. Aug 24 08:22:05 [AB441140] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 Aug 24 08:22:05 [AB441140] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 Aug 24 08:22:05 [AB441140] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x5ad000003d269) as the default port. Aug 24 08:22:05 [AB441140] -> osm_vendor_bind: Binding to port 0x5ad000003d269. Aug 24 08:22:05 [AB441140] -> osm_vendor_bind: Binding to port 0x5ad000003d269. Aug 24 08:22:05 [42FFF960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0002 TID:0x0000000000000000 Aug 24 08:22:05 [42FFF960] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0002 GID:0xfe80000000000000,0x0002c9010bec5320 Aug 24 08:22:06 [42FFF960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0001 TID:0x0000000000000000 Aug 24 08:22:06 [42FFF960] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0001 GID:0xfe80000000000000,0x0005ad000003d269 Aug 24 08:22:12 [42FFF960] -> osm_vendor_send: RMPP 0 length 112 Aug 24 08:22:12 [42FFF960] -> osm_vendor_send: RMPP 0 length 112 Aug 24 08:22:12 [42FFF960] -> osm_vendor_send: RMPP 0 length 112 Thanks for any and all guidance in advance, Sean From eitan at mellanox.co.il Wed Aug 24 07:15:56 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 24 Aug 2005 17:15:56 +0300 Subject: [openib-general] User level packages make, install and release Message-ID: <430C811C.3040405@mellanox.co.il> Hi I had the pleasure to study the autotools environment for the last 2 weeks or so in the process of making OpenSM make and install behave. What I learned is summarized below. I propose we take these as guidelines and work towards a standard way the entire user level code make and install operates. Any comments are welcome. Eitan A proposal for user level tree autotools install and distribution: ------------------------------------------------------------------ 1. It is possible to provide a hierarchical autotools environment such that running the autogen.sh, configure, make and make install at the top most directory will actually run the process on the entire tree below. (autogen.sh should be made hierarchical) 2. It is possible to support selection of which sub-packages are installed by using the --enable- or --disable- flags. 3. When building a full hierarchical tree of autotools packages it is very useful to use the configure --cache-file option to speed up the process and avoid double checking of the dependencies at each sub project. 4. Distribution of autotools projects can be made simple by following the procedure: a. At distribution time check out the source tree b. Run autogen.sh c. Run configure d. Run make dist - create a tar file named -.tgz e. During installation untar the file f. ./configure, make, make install This procedure remove the dependency on autotools during installation. One can script the above process of steps a to d and get a simple "distribution" system for the user level tree sources. It might be useful to provide a "release number" to the entire release. Autotools track that revision string in the VERSION variable. It is optional to define that VERSION by provisioning it to the AC_INIT macro. A smart distribution script can obtain the as an input and apply it to the AC_INIT automatically. 5. Shared library versioning: 5.1 Library Versioning as used by LD: Library versioning is different from release version since the library primary and secondary versions are interpreted by the dynamic linker and thus provide an API version definition rather then a release version info. API versions should be manually tracked and provided to libtool through the -version-info ::age option. Where: is the major version of the API. Changing every time the API changes. is the serial number (also called minor) which is advanced any time a change is made to the implementation of the API. is the number of major versions that the current version is backward compatible with. This methodology is described by the libtool manuals on the web: http://www.gnu.org/software/libtool/manual.html#Libtool-versioning My proposal is to use a special file named lib.ver to track this info. The configure.in should parse this file and define the variable to be used during the libtool invocation. When running an automatic distribution flow as described in the previous section one can automate updating the lib.ver based on the change set number. 5.2 Controlling the exported API for various major versions: LD actualy provides even finer control of the exported symbols for each major version of the lib. Such that during link time a library can expose a different set of APIs based on the require API version. To use this feature we need to provide a file defining the API to the linker using the -version-script. I propose to name the file lib.map. Please see the following link for more info: http://www.gnu.org/software/binutils/manual/ld-2.9.1/html_node/ld_25.html 5.3 Tracking Release number of shared libs once the numbering is used by the LD API version marking. libttol -release option should have been the best choice. This option renames the shared library file according to the follwing format: lib-.so... while it also provides a link with no revision: lib.so -> lib-.so. This looks very good but libtool is currently broken since it modifies the soname to - (the name of the shared object by while LD searches for matching libs during runtime). The implications of this change are that any application linked to the - lib will not be able to use any next release library since its soname will be different. Several mails and even a patch to libtool were posted regarding this limitation, but the common libtool still does not fix this issue. My proposal is to avoid the usage of -release flag and instead create the following symbolic link which should act as a "property mark": lib-.so.->lib.so... Traversing the link one can answer the question "which library version was released by in release ?", and the reverse one. From eitan at mellanox.co.il Wed Aug 24 07:20:07 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 24 Aug 2005 17:20:07 +0300 Subject: [openib-general] How to change NodeDescription In-Reply-To: <30280207$1124878682430c495a0b75a3.09921159@config5.schlund.de> References: <30280207$1124878682430c495a0b75a3.09921159@config5.schlund.de> Message-ID: <430C8217.1000206@mellanox.co.il> Steve Wooding wrote: > Hi, > > Is it possible with the gen2 driver to set the NodeDescription for each > host. I have a machine which is still running gen1 drivers and the > NodeDescription is set to the hostname. I guess this is done when the > driver is loaded at boot time (but couldn't find out exactly how this > is done). The gen1 implementation sniffed the ingoing SubnetManagement.Get(NodeDesc) in the sma.c and replaced appended to the results a string the looked like: HCA-. I think it was very useful feature. I wonder if we would provide a patch would it will accepted? > > The NodeDescription on gen2 is "MT25208 InfiniHostEx Mellanox > Technologies", which looks like it was formed by looking at lspci or > something. I would like to change it to hostname plus some other custom > attribute (e.g. type of node). > > Thanks, > > Steve. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tom at ammasso.com Wed Aug 24 07:32:21 2005 From: tom at ammasso.com (Tom Tucker) Date: Wed, 24 Aug 2005 10:32:21 -0400 Subject: [openib-general] RDMA connection and address translation API Message-ID: <8E9D028761D8264D910612167E8457E8FA3AF1@mail2.ammasso.com> Roland: Steve and I came to the same conclusion on the airplane ride back to Austin. Whereas plain old TCP/IP selects a device at the bottom of the stack, RDMA transports must select the device at the top because pre-connect resources must be allocated and these resouces are associated with a particular device. I think you've absolutely nailed the active side (by the way, I think the ib_at_route_by_ip service already performs the necessary routing function). The listen side, however, I think needs a little tweaking. It would be beneficial if the client can specify either an IP address and port to listen on (effectively selecting a particular device), or a wild card (all RDMA devices). An NFS server is an example of the later. This is trivial to do by providing an address to the listen call where a '0' represents a wild card. > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier > Sent: Wednesday, August 24, 2005 12:07 AM > To: openib-general at openib.org > Subject: [openib-general] RDMA connection and address translation API > > At the OpenIB workshop on Monday, we had some discussion > about a high-level transport-neutral API for connection > handling. After giving the topic some more thought, I've > come to the conclusion that neither the kDAPL API nor the new > API that was presented are usable. > In this email, I'll try to detail my reasoning and sketch > what I believe is the correct API. > > The new API that we looked at was essentially the following > (I'm recreating this from memory, so I apologize if I > misrepresent it): > > listen(local_ip_address, service_id, listen_callback) > connect(local_qp, remote_ip_address, qos, service_id, > private_data, connect_callback) > > We already discussed the problem with having the listen > callback pass the consumer a remote source address -- doing > this requires the connection handling module to do an ATS > reverse lookup in the IB case, which the consumer might not > want. I think there's agreement that the correct thing here > is for the listen callback to pass a transport address to the > consumer and provide a function that the consumer can call to > perform an ATS reverse lookup if desired. This isn't a major > problem and can be dealt with. > > However, there's another problem with trying to lump address > translation and connection into a single "connect" call, and > this problem looks fundamental and fatal to me. The connect > call takes a QP pointer, but to create a QP the consumer > needs to know which local device to use. However, the > consumer doesn't know which device to use until the > destination address has been resolved to a route, including a > local interface. > > As far as I can tell, kDAPL punts on this and simply requires > the consumer to handle the route lookup itself before calling > dat_ep_connect(). It seems that current kDAPL consumers > similarly punt on this issue: the iSER initiator and the > NFS-RDMA client both just use a single device which is > statically discovered at init time. > > It seems that the kDAPL connection model has a serious flaw, > in that it pushes the complexity of route lookup into the > consumer. Further, we have strong evidence that this routing > code is hard to write and that consumers will just ignore > this complexity and hard-code solutions that don't work under > all configurations. > > With this in mind, I believe that the connection API needs to > be something more like the following: > > rdma_resolve_address(): > inputs: dest IP address, qos, npaths, > done callback, opaque context > done callback params: status, local RDMA device, > RDMA transport address, context > > This function starts the process of resolving an IP address to > an RDMA device and address. When the resolution is complete, > the callback is called with a status. If the status is > "success" then the callback also gets the device pointer and > transport address (as well as the original context that the > consumer passed in). > > The "RDMA transport address" type is a union containing > transport-dependent data. In the IB case, it's all of the > SGID, DGID, SLID, DLID, SL etc. that we know and love. In the > iWARP case, it's the source IP, destination IP and QOS. > > npaths can be either 1 or 2 in the IB case; if it's 2, then > the resolver will try to find a primary and alternate path for > APM. In the iWARP case, I guess npaths will always be 1, and > I guess anyone who wants to use iWARP over multihomed SCTP > will probably have to use some lower-level API. > > By the way, we may also have to have the option of passing in > a local netdev so that we can handle link-local IPv6 > addresses. There may be other cases I haven't thought of yet. > I just hope we can avoid going all the way to the horror of > the getaddrinfo() API. > > I also hope we can agree to use IPoIB ARP to resolve the > address in the IB case; having a flag or some other hack in > the API to expose the option of ATS seems unacceptably ugly. > > rdma_connect(): > inputs: local QP, RDMA transport address, destination service, > private data, timeout, event callback, opaque context > > This function takes the resolved address and actually > connects. > > I'm not sure how we want to abstract the IB service vs. iWARP > TCP port number difference. I guess it's OK to have iWARP > consumers stick their (16-bit) port number in a 64-bit > parameter, even if it's not the prettiest API. > > To head off the knee-jerk objection: this API does NOT > require any transport-specific code in consumers (unless a > particular consumer WANTS to look inside the RDMA transport > address). Code to connect would be as simple as: > > rdma_resolve_address(...); > /* wait for resolution */ > ib_create_qp(...) /* use device pointer we got from > rdma_resolve_address() */ > rdma_connect(...); /* pass transport address we got from > rdma_resolve_address() */ > /* wait for connection to finish... */ > > The listen side is even simpler: > > rdma_listen(): > inputs: local service, event callback, consumer context > > Wait for connection requests and pass events to the consumer's > callback. I'm not sure if/home we want to support binding to > a particular IP address. The current IB CM in Linux doesn't > support binding a listen to a single device or port, and even > if it did it's not clear how to handle binding to one IP > address when a port has more than one IP. > > I guess the event callback would receive a device pointer and > the same RDMA transport address union I talked about above > when discussing address resolution. > > It would be possible to have another function like > rdma_getpeername() that takes the transport address and > returns a source IP address. In the IB case this would do an > ATS reverse lookup. However, I hate this idea. iSER already > uses the CM private data to pass the source IP in the IB case, > and I would much rather fix NFS/RDMA to do the same thing (so > we can just kill ATS as an address resolution method). > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Wed Aug 24 07:42:15 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 24 Aug 2005 17:42:15 +0300 Subject: [openib-general] Re: How to change NodeDescription In-Reply-To: <430C8217.1000206@mellanox.co.il> References: <30280207$1124878682430c495a0b75a3.09921159@config5.schlund.de> <430C8217.1000206@mellanox.co.il> Message-ID: <20050824144215.GL20750@mellanox.co.il> Quoting r. Eitan Zahavi : > Subject: Re: How to change NodeDescription > > Steve Wooding wrote: > >Hi, > > > >Is it possible with the gen2 driver to set the NodeDescription for each > >host. I have a machine which is still running gen1 drivers and the > >NodeDescription is set to the hostname. I guess this is done when the > >driver is loaded at boot time (but couldn't find out exactly how this > >is done). > > The gen1 implementation sniffed the ingoing > SubnetManagement.Get(NodeDesc) in the sma.c and replaced appended to the > results a string the looked like: HCA-. > I think it was very useful feature. Sounds good. Making node description visible, and editable, through /sys/class/infiniband/mthcaX/node_desc is probably the way to do it. The user would then be able to write whatever he wants there. -- MST From mst at mellanox.co.il Wed Aug 24 07:52:48 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 24 Aug 2005 17:52:48 +0300 Subject: [openib-general] Re: RDMA connection and address translation API In-Reply-To: <52y86r2a9w.fsf@cisco.com> References: <52y86r2a9w.fsf@cisco.com> Message-ID: <20050824145248.GM20750@mellanox.co.il> Hi, Roland! The more I think about your proposal, the more it makes sense to me. One comment: Quoting r. Roland Dreier : > The listen side is even simpler: > > rdma_listen(): > inputs: local service, event callback, consumer context > > Wait for connection requests and pass events to the consumer's > callback. I'm not sure if/home we want to support binding to > a particular IP address. The current IB CM in Linux doesn't > support binding a listen to a single device or port, and even > if it did it's not clear how to handle binding to one IP > address when a port has more than one IP. Interesting. The current CM in linux is global, but it seems there's no problem at least for the simple case with one IP per port. Generally, cant we solve it by filtering? Get the remote IP address from CM private data, and find through which gateway it's accessed? Note that clients like NFS/RDMA seem to want to re-resolve the IP by ARP anyway, for verification purposes. -- MST From swise at ammasso.com Wed Aug 24 07:59:12 2005 From: swise at ammasso.com (Steve Wise) Date: Wed, 24 Aug 2005 09:59:12 -0500 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <52y86r2a9w.fsf@cisco.com> Message-ID: Roland, this looks good! A few comments below... > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier > Sent: Wednesday, August 24, 2005 12:07 AM > To: openib-general at openib.org > Subject: [openib-general] RDMA connection and address translation API > > At the OpenIB workshop on Monday, we had some discussion about a > high-level transport-neutral API for connection handling. After > giving the topic some more thought, I've come to the conclusion that > neither the kDAPL API nor the new API that was presented are usable. > In this email, I'll try to detail my reasoning and sketch what I > believe is the correct API. > > The new API that we looked at was essentially the following (I'm > recreating this from memory, so I apologize if I misrepresent it): > > listen(local_ip_address, service_id, listen_callback) > connect(local_qp, remote_ip_address, qos, service_id, > private_data, connect_callback) > > We already discussed the problem with having the listen callback pass > the consumer a remote source address -- doing this requires the > connection handling module to do an ATS reverse lookup in the IB case, > which the consumer might not want. I think there's agreement that the > correct thing here is for the listen callback to pass a transport > address to the consumer and provide a function that the consumer can > call to perform an ATS reverse lookup if desired. This isn't a major > problem and can be dealt with. > > However, there's another problem with trying to lump address > translation and connection into a single "connect" call, and this > problem looks fundamental and fatal to me. The connect call takes a > QP pointer, but to create a QP the consumer needs to know which local > device to use. However, the consumer doesn't know which device to use > until the destination address has been resolved to a route, including > a local interface. > > As far as I can tell, kDAPL punts on this and simply requires the > consumer to handle the route lookup itself before calling > dat_ep_connect(). It seems that current kDAPL consumers similarly > punt on this issue: the iSER initiator and the NFS-RDMA client both > just use a single device which is statically discovered at init time. > Yes, DAPL punts on this. > It seems that the kDAPL connection model has a serious flaw, in that > it pushes the complexity of route lookup into the consumer. Further, > we have strong evidence that this routing code is hard to write and > that consumers will just ignore this complexity and hard-code > solutions that don't work under all configurations. > I agree! > With this in mind, I believe that the connection API needs to be > something more like the following: > > rdma_resolve_address(): > inputs: dest IP address, qos, npaths, > done callback, opaque context > done callback params: status, local RDMA device, > RDMA transport address, context > > This function starts the process of resolving an IP address to > an RDMA device and address. When the resolution is complete, > the callback is called with a status. If the status is > "success" then the callback also gets the device pointer and > transport address (as well as the original context that the > consumer passed in). > > The "RDMA transport address" type is a union containing > transport-dependent data. In the IB case, it's all of the > SGID, DGID, SLID, DLID, SL etc. that we know and love. In the > iWARP case, it's the source IP, destination IP and QOS. > > npaths can be either 1 or 2 in the IB case; if it's 2, then > the resolver will try to find a primary and alternate path for > APM. In the iWARP case, I guess npaths will always be 1, and > I guess anyone who wants to use iWARP over multihomed SCTP > will probably have to use some lower-level API. > > By the way, we may also have to have the option of passing in > a local netdev so that we can handle link-local IPv6 > addresses. There may be other cases I haven't thought of yet. > I just hope we can avoid going all the way to the horror of > the getaddrinfo() API. > > I also hope we can agree to use IPoIB ARP to resolve the > address in the IB case; having a flag or some other hack in > the API to expose the option of ATS seems unacceptably ugly. > > rdma_connect(): > inputs: local QP, RDMA transport address, destination service, > private data, timeout, event callback, opaque context > > This function takes the resolved address and actually > connects. > > I'm not sure how we want to abstract the IB service vs. iWARP > TCP port number difference. I guess it's OK to have iWARP > consumers stick their (16-bit) port number in a 64-bit > parameter, even if it's not the prettiest API. > > To head off the knee-jerk objection: this API does NOT require any > transport-specific code in consumers (unless a particular consumer > WANTS to look inside the RDMA transport address). Code to connect > would be as simple as: > > rdma_resolve_address(...); > /* wait for resolution */ > ib_create_qp(...) /* use device pointer we got from > rdma_resolve_address() */ > rdma_connect(...); /* pass transport address we got from > rdma_resolve_address() */ > /* wait for connection to finish... */ > > The listen side is even simpler: > > rdma_listen(): > inputs: local service, event callback, consumer context > > Wait for connection requests and pass events to the consumer's > callback. I'm not sure if/home we want to support binding to > a particular IP address. The current IB CM in Linux doesn't > support binding a listen to a single device or port, and even > if it did it's not clear how to handle binding to one IP > address when a port has more than one IP. > > I guess the event callback would receive a device pointer and > the same RDMA transport address union I talked about above > when discussing address resolution. > > It would be possible to have another function like > rdma_getpeername() that takes the transport address and > returns a source IP address. In the IB case this would do an > ATS reverse lookup. However, I hate this idea. iSER already > uses the CM private data to pass the source IP in the IB case, > and I would much rather fix NFS/RDMA to do the same thing (so > we can just kill ATS as an address resolution method). > I think we should allow an ULP application can listen on a specific IP address and device, this is done often in servers to limit the scope of the service. I was thinking such a ULP would simply walk the list of ib devices and issue a rdma_listen() on each device. But the ULP should also be able to pass down the local ipaddr upon which the device should listen. Consider an NFS/RDMA server ULP that has exports that are limited to a single subnet or interface, etc... From guyg at voltaire.com Wed Aug 24 07:49:51 2005 From: guyg at voltaire.com (Guy German) Date: Wed, 24 Aug 2005 14:49:51 +0000 Subject: [openib-general] RDMA connection and address translation API Message-ID: <1124894991.5991.13.camel@r2d2> Hi Tom, > I think you've absolutely nailed the active side (by the way, I think > the ib_at_route_by_ip service already performs the necessary routing > function) You are right - it does (in resolve_ip). That's why I believe there is no need for the consumer to use an async callback for the at part (especially when the transport is iwarp). Did you have a look at our suggestion (use ib_cma_get_device) ? http://openib.org/pipermail/openib-general/2005-August/010151.html http://openib.org/pipermail/openib-general/2005-August/010154.html Thanks, Guy. From tom at ammasso.com Wed Aug 24 08:07:20 2005 From: tom at ammasso.com (Tom Tucker) Date: Wed, 24 Aug 2005 11:07:20 -0400 Subject: [openib-general] Connection Manager Abstraction proposition -header file Message-ID: <8E9D028761D8264D910612167E8457E8FA3AFE@mail2.ammasso.com> Guy: I think we're on the right track. Regarding the 'get-device' function, the ib_at_route_by_ip service already returns a data structure that includes the device pointer. This data structure also includes the local and next hop remote IP addresses which is a good thing. First, it avoids requiring the connect method to look this data up again. For IB, these IP addresses are first converted to GID/LID with AT and then to a path record. For iWARP, these addresses are fine as is, and avoid having the connect method do a second lookup to find the next hop given the remote ip address. In other words, pass the ib_at_ib_route structure to the connect method along with the remote IP address to allow this info to be reused. With regard to ib_cma_id, what is the iwarp_id? Is this the remote port? With regard to listen, it should not require a device pointer because the app may in fact be listening on multiple devices. Also, the service_id does not provide enough information to determine which devices to listen on. It should be local ip (could be 0 for wild-card), and local port (service id). The listen callback function needs to know the device on which the connection request was received. This could be included in the ib_cma_id structure. > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Guy German > Sent: Wednesday, August 24, 2005 7:21 AM > To: openib-general at openib.org > Subject: [openib-general] Connection Manager Abstraction > proposition -header file > > /* > * Copyright (c) 2005 Voltaire Inc. All rights reserved. > * > * This Software is licensed under one of the following licenses: > * > * 1) under the terms of the "Common Public License 1.0" a > copy of which is > * available from the Open Source Initiative, see > * http://www.opensource.org/licenses/cpl.php. > * > * 2) under the terms of the "The BSD License" a copy of which is > * available from the Open Source Initiative, see > * http://www.opensource.org/licenses/bsd-license.php. > * > * 3) under the terms of the "GNU General Public License > (GPL) Version 2" a > * copy of which is available from the Open Source Initiative, see > * http://www.opensource.org/licenses/gpl-license.php. > * > * Licensee has the right to choose one of the above licenses. > * > * Redistributions of source code must retain the above copyright > * notice and one of the license notices. > * > * Redistributions in binary form must reproduce both the > above copyright > * notice, one of the license notices in the documentation > * and/or other materials provided with the distribution. > * > */ > > /* > * This header file as a preliminary proposition for a > connection manager > * abstraction layer (cma) for IB and iwarp > * - there is an assumption that iwarp uses the same openib > qp terminology in > * the rest of the verbs, and the only place needs > abstraction is the cm. > * - This proposition assumes that the address translation > is done in the cma > * layer. > * - The cma also modifies the qp states to init/rtr/rts and > error as needed. > * - for calling accept/reject or disconnect on the passive > side you need to > * use the cma handle accepted in ib_cma_listen cb. > * - cma_id is created when calling connect or listen and > destroyed when > * accepting disconnected/rejected/unreachable events on > either active > * side (connect cb) or passive side (accept cb) > */ > > #ifndef IB_CMA_H > #define IB_CMA_H > > #include > > enum ib_cma_event { > IB_CMA_EVENT_ESTABLISHED, > IB_CMA_EVENT_REJECTED, > IB_CMA_EVENT_DISCONNECTED, > IB_CMA_EVENT_UNREACHABLE > }; > > enum ib_qos { > IB_QOS_BEST_EFFORT = 0, > IB_QOS_HIGH_THROUGHPUT = (1 << 0), > IB_QOS_LOW_LATENCY = (1 << 1), > IB_QOS_ECONOMY = (1 << 2), > IB_QOS_PREMIUM = (1 << 3) > }; > > enum ib_connect_flags { > IB_CONNECT_DEFAULT_FLAG = 0x00, > IB_CONNECT_MULTIPATH_FLAG = 0x01 > }; > > /* > * for ib_cma_get_src_ip - ib_cma_id will have to include > * the path data received in the request handler > */ > union ib_cma_id{ > struct ib_cm_id *cm_id; > u32 iwarp_id; > }; > > typedef void (*ib_cma_rarp_handler)(struct sockaddr *src_ip, > void *context); > typedef void (*ib_cma_ac_handler)(enum ib_cma_event event, > void *context); > typedef void (*ib_cma_event_handler)(enum ib_cma_event event, > void *context, > void *private_data); > typedef void (*ib_cma_listen_handler)(union ib_cma_id *cma_id, > void *private_data, void > *context); > > struct ib_cma_conn { > struct ib_qp *qp; > struct ib_qp_attr *qp_attr; > struct sockaddr *dst_ip; > __be64 service_id; > void *context; > ib_cma_event_handler cma_event_handler; > const void *private_data; > u8 private_data_len; > u32 timeout; > enum ib_qos qos; > enum ib_connect_flags connect_flags; > }; > > > /** > * ib_cma_get_device - Returns the device to be used according to > * the destination ip address (this can be detemined according > * to the local routing table). Call this function before > * creating the qp. If using link-local IPv6 addresses > * @remote_address: The destination address for connection > * @device: The device to use (returned by the function) > */ > int ib_cma_get_device(struct sockaddr *remote_address, > struct ib_device **device); > > > /** > * ib_cma_connect - this is the connect request function, called by > * the active side. The consumer registers an upcall that will be > * initiated by the cma with an appropriate connection event > * notification (established/rejected/disconnected etc) > * @cma_conn: This structure contains the following > connection parameters: > * @qp: qp for establishing the connection > * @qp_attr: only relevant attributes are used > * @dst_ip: destination ip address > * @service_id: destination service id (port) > * @context: context to be returned in the callback > * @cma_event_handler: the upcall function for the active side > * @private_data: private data to be received at the listener upcall > * @private_data_len: private data length (max 255) > * @timeout: > * @qos: Quality os service for the rc > * @connect_flags: default or multipath connection > * @cma_id: This returned handle is a union (different in ib > and iwarp) > * in ib - it is the cm_id. > */ > int ib_cma_connect(struct ib_cma_conn *cma_conn, > union ib_cma_id *cma_id); > > > /** > * ib_cma_disconnect - this function disconnects the rc. It can be > * called, by either the passive or active side > * @qp: the connected qp to disconnect > * @cma_id: On the active side- this handle is the one returned > * when ib_cma_connect was called. > * On the passive side- this handle was accepted in > cma_listen callback > */ > int ib_cma_disconnect(struct ib_qp *qp, union ib_cma_id *cma_id); > > > /** > * ib_cma_sid_listen - this function is called by the passive > side. It is > * listening on a the specified port (ib service id) for incomming > * connection requests > * @device: ? need to resolve this issue > * @service_id: service id (port) to listen on > * @context: user context to be returned in the callback > * @cm_listen_handler: the listen callback > * @cma_id: cma handle for the passive side > */ > int ib_cma_sid_listen(struct ib_device *device, __be64 service_id, > void *context, ib_cma_listen_handler > cm_listen_handler, > union ib_cma_id *cma_id); > > > /** > * ib_cma_sid_destroy - this functionis is called on the > passive side, to > * stop listenning on a certain sevice id > * @cma_id: the same cma handle received when > ib_cma_sid_listen was called > */ > int ib_cma_sid_destroy(union ib_cma_id *cma_id); > > > /** > * ib_cma_accept - call on the passive side to accept a > connection request > * @cma_id: this handle was accepted in cma_listen callback > * @qp: the connection's qp > * @private_data: private data to send back to the initiator > * @private_data_len: private data length > * @context: user context to be returned in the callback > * @cm_accept_handler: the cma accept callback - triggered > when RTU ack > * received > */ > int ib_cma_accept(union ib_cma_id *cma_id, struct ib_qp *qp, > const void *private_data, u8 private_data_len, > void *context, ib_cma_ac_handler cm_accept_handler); > > /** > * ib_cma_reject - call on the passive side to reject a > connection request. > * This call destroys the cma_id, hence when the active side accepts > * the reject the cma_id is already destroyed. > * @cma_id: this handle was accepted in cma_listen callback > * @private_data: private data to send back to the initiator > * @private_data_len: private data length > */ > int ib_cma_reject(union ib_cma_id *cma_id, const void *private_data, > u8 private_data_len); > > > /** > * ib_cma_get_src_ip - this function performs "rarp", asynchronicly > * from cma_id to src ip > * @cma_id: the cma_id will have to include the path data received > * in the request handler > * @src_ip: source ip of the initiator > */ > int ib_cma_get_src_ip(union ib_cma_id *cma_id, > ib_cma_rarp_handler rarp_handler, > void *context); > > #endif /* IB_CMA_H */ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From guyg at voltaire.com Wed Aug 24 08:04:38 2005 From: guyg at voltaire.com (Guy German) Date: Wed, 24 Aug 2005 18:04:38 +0300 Subject: [openib-general] Connection Manager Abstraction proposition -header file In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3AFE@mail2.ammasso.com> References: <8E9D028761D8264D910612167E8457E8FA3AFE@mail2.ammasso.com> Message-ID: <1124895878.5991.25.camel@r2d2> Hi Tom, On Wed, 2005-08-24 at 11:07 -0400, Tom Tucker wrote: > Guy: > > > I think we're on the right track. Regarding the 'get-device' function, > the ib_at_route_by_ip service already returns a data structure that > includes the device pointer. This data structure also includes the local > > and next hop remote IP addresses which is a good thing. First, it avoids > > requiring the connect method to look this data up again. For IB, these > IP addresses are first converted to GID/LID with AT and then to a path > record. For iWARP, these addresses are fine as is, and avoid having the > connect method do a second lookup to find the next hop given the remote > ip address. In other words, pass the ib_at_ib_route structure to the > connect method along with the remote IP address to allow this info to > be reused. You are right about the functionality of ib_at_route_by_ip, but I don't think you want to implement at.c for iwarp. My thoughts were that one abstraction layer is enough. In ib it can be mapped to at module calls, and in iwarp to a different implementation. > With regard to ib_cma_id, what is the iwarp_id? Is this the remote port? You know iwarp needs better then me, it's the information you need to keep in order to open and close the connection. > > With regard to listen, it should not require a device pointer because > the > app may in fact be listening on multiple devices. Also, the service_id > does not provide enough information to determine which devices to listen > on. It should be local ip (could be 0 for wild-card), and local port > (service id). I agree that this part needs to be more thought of. If you look at the documentation - I left a "?" near ib_device. > > The listen callback function needs to know the device on which the > connection request was received. This could be included in the > ib_cma_id structure. Yes. This is part of the same issue as the previous one. Thanks, Guy. > > > -----Original Message----- > > From: openib-general-bounces at openib.org > > [mailto:openib-general-bounces at openib.org] On Behalf Of Guy German > > Sent: Wednesday, August 24, 2005 7:21 AM > > To: openib-general at openib.org > > Subject: [openib-general] Connection Manager Abstraction > > proposition -header file > > > > /* > > * Copyright (c) 2005 Voltaire Inc. All rights reserved. > > * > > * This Software is licensed under one of the following licenses: > > * > > * 1) under the terms of the "Common Public License 1.0" a > > copy of which is > > * available from the Open Source Initiative, see > > * http://www.opensource.org/licenses/cpl.php. > > * > > * 2) under the terms of the "The BSD License" a copy of which is > > * available from the Open Source Initiative, see > > * http://www.opensource.org/licenses/bsd-license.php. > > * > > * 3) under the terms of the "GNU General Public License > > (GPL) Version 2" a > > * copy of which is available from the Open Source Initiative, see > > * http://www.opensource.org/licenses/gpl-license.php. > > * > > * Licensee has the right to choose one of the above licenses. > > * > > * Redistributions of source code must retain the above copyright > > * notice and one of the license notices. > > * > > * Redistributions in binary form must reproduce both the > > above copyright > > * notice, one of the license notices in the documentation > > * and/or other materials provided with the distribution. > > * > > */ > > > > /* > > * This header file as a preliminary proposition for a > > connection manager > > * abstraction layer (cma) for IB and iwarp > > * - there is an assumption that iwarp uses the same openib > > qp terminology in > > * the rest of the verbs, and the only place needs > > abstraction is the cm. > > * - This proposition assumes that the address translation > > is done in the cma > > * layer. > > * - The cma also modifies the qp states to init/rtr/rts and > > error as needed. > > * - for calling accept/reject or disconnect on the passive > > side you need to > > * use the cma handle accepted in ib_cma_listen cb. > > * - cma_id is created when calling connect or listen and > > destroyed when > > * accepting disconnected/rejected/unreachable events on > > either active > > * side (connect cb) or passive side (accept cb) > > */ > > > > #ifndef IB_CMA_H > > #define IB_CMA_H > > > > #include > > > > enum ib_cma_event { > > IB_CMA_EVENT_ESTABLISHED, > > IB_CMA_EVENT_REJECTED, > > IB_CMA_EVENT_DISCONNECTED, > > IB_CMA_EVENT_UNREACHABLE > > }; > > > > enum ib_qos { > > IB_QOS_BEST_EFFORT = 0, > > IB_QOS_HIGH_THROUGHPUT = (1 << 0), > > IB_QOS_LOW_LATENCY = (1 << 1), > > IB_QOS_ECONOMY = (1 << 2), > > IB_QOS_PREMIUM = (1 << 3) > > }; > > > > enum ib_connect_flags { > > IB_CONNECT_DEFAULT_FLAG = 0x00, > > IB_CONNECT_MULTIPATH_FLAG = 0x01 > > }; > > > > /* > > * for ib_cma_get_src_ip - ib_cma_id will have to include > > * the path data received in the request handler > > */ > > union ib_cma_id{ > > struct ib_cm_id *cm_id; > > u32 iwarp_id; > > }; > > > > typedef void (*ib_cma_rarp_handler)(struct sockaddr *src_ip, > > void *context); > > typedef void (*ib_cma_ac_handler)(enum ib_cma_event event, > > void *context); > > typedef void (*ib_cma_event_handler)(enum ib_cma_event event, > > void *context, > > void *private_data); > > typedef void (*ib_cma_listen_handler)(union ib_cma_id *cma_id, > > void *private_data, void > > *context); > > > > struct ib_cma_conn { > > struct ib_qp *qp; > > struct ib_qp_attr *qp_attr; > > struct sockaddr *dst_ip; > > __be64 service_id; > > void *context; > > ib_cma_event_handler cma_event_handler; > > const void *private_data; > > u8 private_data_len; > > u32 timeout; > > enum ib_qos qos; > > enum ib_connect_flags connect_flags; > > }; > > > > > > /** > > * ib_cma_get_device - Returns the device to be used according to > > * the destination ip address (this can be detemined according > > * to the local routing table). Call this function before > > * creating the qp. If using link-local IPv6 addresses > > * @remote_address: The destination address for connection > > * @device: The device to use (returned by the function) > > */ > > int ib_cma_get_device(struct sockaddr *remote_address, > > struct ib_device **device); > > > > > > /** > > * ib_cma_connect - this is the connect request function, called by > > * the active side. The consumer registers an upcall that will be > > * initiated by the cma with an appropriate connection event > > * notification (established/rejected/disconnected etc) > > * @cma_conn: This structure contains the following > > connection parameters: > > * @qp: qp for establishing the connection > > * @qp_attr: only relevant attributes are used > > * @dst_ip: destination ip address > > * @service_id: destination service id (port) > > * @context: context to be returned in the callback > > * @cma_event_handler: the upcall function for the active side > > * @private_data: private data to be received at the listener upcall > > * @private_data_len: private data length (max 255) > > * @timeout: > > * @qos: Quality os service for the rc > > * @connect_flags: default or multipath connection > > * @cma_id: This returned handle is a union (different in ib > > and iwarp) > > * in ib - it is the cm_id. > > */ > > int ib_cma_connect(struct ib_cma_conn *cma_conn, > > union ib_cma_id *cma_id); > > > > > > /** > > * ib_cma_disconnect - this function disconnects the rc. It can be > > * called, by either the passive or active side > > * @qp: the connected qp to disconnect > > * @cma_id: On the active side- this handle is the one returned > > * when ib_cma_connect was called. > > * On the passive side- this handle was accepted in > > cma_listen callback > > */ > > int ib_cma_disconnect(struct ib_qp *qp, union ib_cma_id *cma_id); > > > > > > /** > > * ib_cma_sid_listen - this function is called by the passive > > side. It is > > * listening on a the specified port (ib service id) for incomming > > * connection requests > > * @device: ? need to resolve this issue > > * @service_id: service id (port) to listen on > > * @context: user context to be returned in the callback > > * @cm_listen_handler: the listen callback > > * @cma_id: cma handle for the passive side > > */ > > int ib_cma_sid_listen(struct ib_device *device, __be64 service_id, > > void *context, ib_cma_listen_handler > > cm_listen_handler, > > union ib_cma_id *cma_id); > > > > > > /** > > * ib_cma_sid_destroy - this functionis is called on the > > passive side, to > > * stop listenning on a certain sevice id > > * @cma_id: the same cma handle received when > > ib_cma_sid_listen was called > > */ > > int ib_cma_sid_destroy(union ib_cma_id *cma_id); > > > > > > /** > > * ib_cma_accept - call on the passive side to accept a > > connection request > > * @cma_id: this handle was accepted in cma_listen callback > > * @qp: the connection's qp > > * @private_data: private data to send back to the initiator > > * @private_data_len: private data length > > * @context: user context to be returned in the callback > > * @cm_accept_handler: the cma accept callback - triggered > > when RTU ack > > * received > > */ > > int ib_cma_accept(union ib_cma_id *cma_id, struct ib_qp *qp, > > const void *private_data, u8 private_data_len, > > void *context, ib_cma_ac_handler cm_accept_handler); > > > > /** > > * ib_cma_reject - call on the passive side to reject a > > connection request. > > * This call destroys the cma_id, hence when the active side accepts > > * the reject the cma_id is already destroyed. > > * @cma_id: this handle was accepted in cma_listen callback > > * @private_data: private data to send back to the initiator > > * @private_data_len: private data length > > */ > > int ib_cma_reject(union ib_cma_id *cma_id, const void *private_data, > > u8 private_data_len); > > > > > > /** > > * ib_cma_get_src_ip - this function performs "rarp", asynchronicly > > * from cma_id to src ip > > * @cma_id: the cma_id will have to include the path data received > > * in the request handler > > * @src_ip: source ip of the initiator > > */ > > int ib_cma_get_src_ip(union ib_cma_id *cma_id, > > ib_cma_rarp_handler rarp_handler, > > void *context); > > > > #endif /* IB_CMA_H */ > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > From rolandd at cisco.com Wed Aug 24 08:22:14 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 24 Aug 2005 08:22:14 -0700 Subject: [openib-general] Re: How to change NodeDescription In-Reply-To: <20050824144215.GL20750@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 24 Aug 2005 17:42:15 +0300") References: <30280207$1124878682430c495a0b75a3.09921159@config5.schlund.de> <430C8217.1000206@mellanox.co.il> <20050824144215.GL20750@mellanox.co.il> Message-ID: <52u0hf1hsp.fsf@cisco.com> Michael> Sounds good. Making node description visible, and Michael> editable, through /sys/class/infiniband/mthcaX/node_desc Michael> is probably the way to do it. The user would then be Michael> able to write whatever he wants there. Yes, I've had exactly this on my to-do list for a while, but haven't had a chance to implement it yet. - R. From jlentini at netapp.com Wed Aug 24 08:48:30 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 24 Aug 2005 11:48:30 -0400 (EDT) Subject: [openib-general] Re: [PATCH] [uDAPL] update to new uCM API In-Reply-To: References: Message-ID: On Fri, 19 Aug 2005, Sean Hefty wrote: > This patch updates uDAPL to the new uCM API. It only fixes the build > issues at this point and does not try to optimize for the use of the > new API. That will come in a later patch. > > James, I can commit this when committing the uCM changes if that's okay. Looks good. From rolandd at cisco.com Wed Aug 24 09:03:48 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 24 Aug 2005 09:03:48 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3AF1@mail2.ammasso.com> (Tom Tucker's message of "Wed, 24 Aug 2005 10:32:21 -0400") References: <8E9D028761D8264D910612167E8457E8FA3AF1@mail2.ammasso.com> Message-ID: <52pss31fvf.fsf@cisco.com> Tom> The listen side, however, I think needs a little tweaking. It Tom> would be beneficial if the client can specify either an IP Tom> address and port to listen on (effectively selecting a Tom> particular device), or a wild card (all RDMA devices). An NFS Tom> server is an example of the later. This is trivial to do by Tom> providing an address to the listen call where a '0' Tom> represents a wild card. I agree that it's useful to be able to pass a sockaddr to bind a listen to (just like the bind() call in userspace). However, the problem is that in the IB world, an incoming connection request does not come with a destination IP address in any standard way. So I don't know the right way to implement bind() in the IB case. By the way, an IP address/port does not necessarily select a single RDMA device. It's a perfectly valid configuration to have 10 network interfaces all with the same local IP address. - R. From rolandd at cisco.com Wed Aug 24 09:10:04 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 24 Aug 2005 09:10:04 -0700 Subject: [openib-general] Re: ipoib oops In-Reply-To: <20050824092459.GA20750@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 24 Aug 2005 12:24:59 +0300") References: <20050824092459.GA20750@mellanox.co.il> Message-ID: <52ll2r1fkz.fsf@cisco.com> Michael> so apparently the problem is in accessing Michael> mcast->mcmember.mlid. Michael> And I wander: what prevents the mcast object from being Michael> destroyed while a completion is outstanding? Hmm, I'm not sure that the problem is use-after-free. The oops is a null pointer dereference, which would seem to indicate that ipoib_mcast_join_finish() is being called with mcast == NULL. I don't see immediately what is causing that, though. - R. From tom at ammasso.com Wed Aug 24 09:13:09 2005 From: tom at ammasso.com (Tom Tucker) Date: Wed, 24 Aug 2005 12:13:09 -0400 Subject: [openib-general] RDMA connection and address translation API Message-ID: <8E9D028761D8264D910612167E8457E8FA3B0A@mail2.ammasso.com> > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Wednesday, August 24, 2005 11:04 AM > To: Tom Tucker > Cc: Roland Dreier; openib-general at openib.org > Subject: Re: [openib-general] RDMA connection and address > translation API > > Tom> The listen side, however, I think needs a little tweaking. It > Tom> would be beneficial if the client can specify either an IP > Tom> address and port to listen on (effectively selecting a > Tom> particular device), or a wild card (all RDMA devices). An NFS > Tom> server is an example of the later. This is trivial to do by > Tom> providing an address to the listen call where a '0' > Tom> represents a wild card. > > I agree that it's useful to be able to pass a sockaddr to > bind a listen to (just like the bind() call in userspace). > However, the problem is that in the IB world, an incoming > connection request does not come with a destination IP > address in any standard way. So I don't know the right way > to implement bind() in the IB case. I think I understand, but the purpose of specifying the IP address in the listen is not to filter incoming connect requests, but rather to determine which devices I listen on. I think this works for the IB case as well. So the utility of the IP address specified in the listen is only to determine which devices the sid is created on. Does this make sense or am I missing something? > > By the way, an IP address/port does not necessarily select a > single RDMA device. It's a perfectly valid configuration to > have 10 network interfaces all with the same local IP address. > > - R. > Yes, and in this case, all devices with the same IP address would end up listening in the same way that specifying a wildcard (0) would result in multiple devices listening. From rolandd at cisco.com Wed Aug 24 09:26:42 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 24 Aug 2005 09:26:42 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3B0A@mail2.ammasso.com> (Tom Tucker's message of "Wed, 24 Aug 2005 12:13:09 -0400") References: <8E9D028761D8264D910612167E8457E8FA3B0A@mail2.ammasso.com> Message-ID: <524q9f1et9.fsf@cisco.com> Tom> I think I understand, but the purpose of specifying the IP Tom> address in the listen is not to filter incoming connect Tom> requests, but rather to determine which devices I listen Tom> on. I think this works for the IB case as well. So the Tom> utility of the IP address specified in the listen is only to Tom> determine which devices the sid is created on. Does this make Tom> sense or am I missing something? Well, that's not what I would expect. Suppose I have a device configured with local addresses 192.168.11.12 and 192.168.98.99 and I start listening for some service at the address 192.168.11.12. I don't think I should see a connection request if a remote system tries to connect to 192.168.98.99 (even though it's the same network interface as 192.168.11.12). - R. From swise at ammasso.com Wed Aug 24 09:36:48 2005 From: swise at ammasso.com (Steve Wise) Date: Wed, 24 Aug 2005 11:36:48 -0500 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <524q9f1et9.fsf@cisco.com> Message-ID: > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier > Sent: Wednesday, August 24, 2005 11:27 AM > To: Tom Tucker > Cc: openib-general at openib.org > Subject: Re: [openib-general] RDMA connection and address > translation API > > Tom> I think I understand, but the purpose of specifying the IP > Tom> address in the listen is not to filter incoming connect > Tom> requests, but rather to determine which devices I listen > Tom> on. I think this works for the IB case as well. So the > Tom> utility of the IP address specified in the listen is only to > Tom> determine which devices the sid is created on. Does this make > Tom> sense or am I missing something? > > Well, that's not what I would expect. Suppose I have a device > configured with local addresses 192.168.11.12 and 192.168.98.99 and I > start listening for some service at the address 192.168.11.12. I > don't think I should see a connection request if a remote system tries > to connect to 192.168.98.99 (even though it's the same network > interface as 192.168.11.12). > I agree Roland. ULPs that listen to a specific addr, expect only connections requests that were sent to that ip addr. I think we want to provide this functionality. From ftillier at silverstorm.com Wed Aug 24 10:03:53 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Wed, 24 Aug 2005 10:03:53 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <524q9f1et9.fsf@cisco.com> Message-ID: <000101c5a8cd$d53cc9a0$6312000a@infiniconsys.com> > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Wednesday, August 24, 2005 9:27 AM > > Tom> I think I understand, but the purpose of specifying the IP > Tom> address in the listen is not to filter incoming connect > Tom> requests, but rather to determine which devices I listen > Tom> on. I think this works for the IB case as well. So the > Tom> utility of the IP address specified in the listen is only to > Tom> determine which devices the sid is created on. Does this make > Tom> sense or am I missing something? > > Well, that's not what I would expect. Suppose I have a device > configured with local addresses 192.168.11.12 and 192.168.98.99 and I > start listening for some service at the address 192.168.11.12. I > don't think I should see a connection request if a remote system tries > to connect to 192.168.98.99 (even though it's the same network > interface as 192.168.11.12). I think the IB CM needs to be able to do two things. It needs to allow a listen to be bound to a specific port - using the port GUID or the LID or something along those lines. The Windows CM currently take a port GUID as input to allow binding requests to a local IB port. Incoming MADs are matched based on which port they came in on. This does introduce the limitation that sending CM MADs to a port other than the one you wish to connect to won't have the desired result if the ULP performs port filtering. I don't think this is a big deal. Knowledge of actual IP addresses would be up to the consumer. However, the IB CM can facilitate checks by allowing the user to specify an offset and length in the private data to match to for incoming requests. ULPs that would want to distinguish between IP addresses on a given port would put the IP in their private data, and instruct the CM to compare a specific value at a specific offset and length for every incoming REQ. The Windows CM does this - a listen takes as input a private data compare buffer, buffer length, and offset within the REQ private data to perform the comparison. Without the CM performing the private data comparison for the client, there is no way for the CM to route to the proper person based on something like IP. Using a generic private data compare mechanism enables the users to do whatever they feel like, without putting knowledge of IP addresses and whatnot into the IB CM or dictating how clients must use their private data. A lookup of a listen for an incoming request changes from just being based on SID to taking as additional parameters the port GUID on which the REQ was received and the REQ's private data in case a private data compare needs to be performed. - Fab From jlentini at netapp.com Wed Aug 24 10:03:58 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 24 Aug 2005 13:03:58 -0400 (EDT) Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <1124883718.3933.95.camel@r2d2> References: <1124883718.3933.95.camel@r2d2> Message-ID: > > However, there's another problem with trying to lump address > > translation and connection into a single "connect" call, and this > > problem looks fundamental and fatal to me. The connect call takes a > > QP pointer, but to create a QP the consumer needs to know which local > > device to use. However, the consumer doesn't know which device to use > > until the destination address has been resolved to a route, including > > a local interface. > > The proposition, also presented (I beleive) in the OpenIB workshop, > include a function called ib_cma_get_device, that retrieves the device > (for qp creation purposes) according to the destination address and the > local routing table. That function was included in the presentation. Given that the discussion focused on the proper location of address translation, it is understandable that its presence was overlooked. From danb at voltaire.com Wed Aug 24 10:08:05 2005 From: danb at voltaire.com (Dan Bar Dov) Date: Wed, 24 Aug 2005 20:08:05 +0300 Subject: [openib-general] ISER cleanup Message-ID: In today's ISER commits: Removed files: iser_ext_api.h iser_global.[ch] Removed all typedefs except function pointers typedefs All files are now using tabs for indentation and lines are 80 long max. Down to 6K LOC + 2K .h LOC Dan > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Dan Bar Dov > Sent: Tuesday, August 23, 2005 7:11 PM > To: Grant Grundler > Cc: openib-general at openib.org > Subject: RE: [openib-general] ISER cleanup > > > > -----Original Message----- > > On Mon, Aug 22, 2005 at 01:22:55PM +0300, Dan Bar Dov wrote: > > > We have begun a cleanup of ISER based on the inputs we received. > > > Mostly cosmetic cleanups were already commited. > > > > yup - good progress and some more cosmetic stuff noted below. > > Then need to start looking at addressing Christoph's (hch) comments. > > Some of the comments were taken care of in today's commits > iovecs removed procfs removed Function entry/exit traces > removed Unnecessary files removed: > kernel_dep.h > iser_bhs.h > iser_trace.c > > > > > Still need to remove kernel_dep.h and probably most of the files in > > iser/include/. > > > > Those also all have a trailing "/* DAT 1.2 */" > > that might mislead in the future. > > Maybe a comment in the header about "Based on DAT 1.2" release. > > All DAT 1.2 comments removed. Actually the current code is > not DAT 1.2 compatible, but the openIB flavor compatible. > Since work started on a CM abstraction, I expect ISER to get > off of kdapl and onto ib-verbs + CM abstraction. > > > > > > iser_api.h > > Should iSCSI be providiing the jump table definitions? > > struct iser_api_t > > struct iser_api_cb_t > > > > iser_ext_api.h > > typedef void * iser_conn_request_t; > > Delete stuff like this - it just obscures what is going on. > OK > > > > > I'm not sure what this file is doing. > > I was expecting iSCSI framework to define the data structures > > it needs to talk to a service provider. > This is an "extended API". The ISER spec defines an ISER API, > but it does not consider implementation. > We chose to implement the extra api out of the iser_api > structute and in the iser_ext_api struct. > iSCSI is still not part of the kernel so we had first > modified and added the datamover framework to linux-iscsi and > now to open-iscsi. Once open-iscsi is in the kernel we'll use > it as the framework. > > > > > iser_pdu.h > > sorry - Didn't have time to understand what this is about. > Most definitions are duplicates from the iscsi and will disappear. > The struct iser_send_pdu defines the ISER extensions to the iscsi pdu. > > > > > iser_types.h > > delete typdef void * iser_api_handle_t. > > replace usage of iser_api_handle_t with "void *". > > Ditto for all "void *" typedefs in that file. > OK > > > > > Kernel already defines scatter-gather lists type. > The iser_data_buf struct can point to a scatterlist array but > can also be used to point at a single buffer. > It does not replicate scatterlist but allows us to deal with > two types of registrations - single buffer and scatter lists. > > > > > kernel_dep.h > > Delete this file. > > This content belongs in a seperate patch that people can grab > > and apply when they want to build iSER on an older kernel. > > See src/linux/kernel/patches > Gone. > > > > > > > > Removed vi comments > > > > yup - mostly. Some are still present in iser/include/*.h. > Gone as well. > > > > > > Removed CONFIG_INFINIBAND refrences > > > Reorganized module > > > Rewritten Makefile to new style > > > Added Kconfig file > > > Using kernel min/max > > > > all very good. > > > > > > > There are many other things to be done, including both > coding style > > > and substance, we'll proceed addressing all the technical > > issues that > > > were commented on. > > > > great! > We are going to simplify the local memory registrations by > registering all memory like in the SRP driver. > We do not understand some of the substance issues - for > example, dma related comments - are taken care of by iscsi, > not the transport. The io_mmu comment, we completely do not > understand - there was some platform specific code, but its > all gone now. > > BTW the code is now down to 7K LOC + 2K LOC of heavily > commented header files. > > > > > thanks, > > grant > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rolandd at cisco.com Wed Aug 24 10:15:41 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 24 Aug 2005 10:15:41 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <000101c5a8cd$d53cc9a0$6312000a@infiniconsys.com> (Fab Tillier's message of "Wed, 24 Aug 2005 10:03:53 -0700") References: <000101c5a8cd$d53cc9a0$6312000a@infiniconsys.com> Message-ID: <52pss3z26a.fsf@cisco.com> Fab> I think the IB CM needs to be able to do two things. It Fab> needs to allow a listen to be bound to a specific port - Fab> using the port GUID or the LID or something along those Fab> lines. Yes, this is probably a good idea. Fab> Knowledge of actual IP addresses would be up to the consumer. Fab> However, the IB CM can facilitate checks by allowing the user Fab> to specify an offset and length in the private data to match Fab> to for incoming requests. This seems too complex and at the same time too limited to me. For one thing -- although I think ATS should die -- this doesn't support ATS reverse lookups. For another, it doesn't handle something like the SDP Hello header, where the IP version is at a certain offset, and then the IP address is interpreted according to the IP address. What makes it really ugly is that it's perfectly reasonable for one consumer to listen to a service at 192.168.11.12 and another consumer to listen to the same service at 192.168.98.99. How do we handle this in the IB case?? - R. From shubbell at dbresearch.net Wed Aug 24 09:27:38 2005 From: shubbell at dbresearch.net (Sean Hubbell) Date: Wed, 24 Aug 2005 12:27:38 -0400 Subject: [openib-general] ISER cleanup In-Reply-To: References: Message-ID: <430C9FFA.30007@dbresearch.net> Dan Bar Dov wrote: >In today's ISER commits: >Removed files: iser_ext_api.h iser_global.[ch] >Removed all typedefs except function pointers typedefs >All files are now using tabs for indentation and lines are 80 long max. >Down to 6K LOC + 2K .h LOC > >Dan > > Just a thought, but you can use the gnu indent application to do this very easily (not sure if you did this, I just thought it might help if you have not). Here is a sample command: indent -kr --use-tabs -i2 -l80 -nhnl sourceFilename The -kr option is use The K&R style used in "The C Programming Language" The --use-tabs is self explanitory The -i2 is the indentation level in spaces The -nhnl is used to ignore newlines. Using indent will help adhear to a standard that you define... Sean From caitlinb at broadcom.com Wed Aug 24 10:39:17 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 24 Aug 2005 10:39:17 -0700 Subject: [openib-general] RE: Connection and Address Translation Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F503@NT-SJCA-0751.brcm.ad.broadcom.com> I have several comments on this topic. First, I strongly endorse the policy decision made long ago in the DAT Collaborative that a network address is a flat numeric identifier with IPv6 semantics. I still believe that all interfaces designed for application developers should follow that form. Now admittedly the interface used between the DAT Provider (and other middleware and maybe a handful of highly sophisticated kernel applications)is a different question. Such an interface could distinquish between an Address with IPv6 semantics and a lower layer "RDMA Address", but such a distinction would be of minimal benefit to the DAT Provider. Therefore I think it needs to be justified somewhere. The only benefit I see in the context of kDAPL is that it moves some logic from the device-dependent verbs to core code. One thing that we have to be careful about in defining this additional API is that the nature of iWARP paths is not artificially frozen. In particular Roland's proposal comes close to assuming that iWARP is a single-path-only subset of InfiniBand. That is incorrect. It is more correct to say that path selection, including path migration, takes place at L3 or L4 and is invisible to L5. This includes both multi-homed IP transports (such as SCTP) and the IP layer itself (where an IP address can be migrated to a new Ethernet port). I will be presenting a paper at the RAIT conference dealing with a multi-homed option for MPA/TCP. So there are several path failover options for iWARP. The difference from InfiniBand is not that it has only a single path, but that path selection and failover is transparent to the RDMA layer. It is also transparent to the kDAPL consumer, so exposing path failover to the DAT Provider in a way that interferes with below-RDMA path failure in iWARP would be a mistake. In particular, if we purse the "RDMA Address" proposal it should be clear that the resolved "source address" can still be a "Don't Care". When iWARP Connection Management is implemented over TCP the effect of specifying a local address is to do a bind on the socket before calling connect. For the vast majority of topologies the destination address alone is sufficient to ensure routing through the correct RDMA device (and it can avoid any pre-mature selection of a specific Ethernet port). The problem with listen is more complex. The fully correct interface from the application perspective would allow the application to listen on a *set* of local addresses. But an efficient transport neutral definition of a "set of local addresses" is not an easy thing to come up with. The decision within the DAT Collaborative was to punt on this issue. The Consumer was required to issue a single listen for the Service ID/Port for an RDMA device, and then to figure out what to do with the Connection Request based on both the local address requested and potentiall the remote address. In the IP world it is very common for content servers (HTTP, FTP, ...) to present different content based upon which of the server's IP addresses was requested. It is also very easy to listen on a *single* IA Address in a transport neutral fashion. So barring any great inspirations about how to represent a set of addresses, I would suggest that we stick with the single Address or all addresses supported by the device approach. It certainly does not make sense for IB to emulate IP subnet masking. That means there are three services needed. These are in fact identified by DAT, but they are not specified. The Consumer was directed to use the existing OS specific solutions to perform these functions: 1) Select IP Interface given desired destination address and Class of Service. 2) Select IP interface given desired local address. 3) Select DAT Device matching IP Interface. DAT, being OS neutral, did not specify these functions. Working within Linux they can be specified. The intent is for the first two to match the same APIs/ procedures used for sockets, firewalls and the first hop routing. Ultimately this means that on the listen side the verbs consumer must be able to: a) listen on either a specific address/port or on "all addresses/specific port" for the device. b) Receive the actual local address used in the Connection Request. c) Be able to query the remote address given a Connection Request, but this does not need to be delivered by default. I will also point out that there is nothing in the DAT interface that requires an extra wrie step for InfiniBand. The definition of IA Address is specifically designed to support division by subnetting, and this is even assumed when multiple DAT Providers are supported that use different transports. The System/Network administrators are *already* required to ensure that the 128-bit IPv6-like address space is divided unambiguously between the different RDMA devices. Further they are required to guarantee that the IA Address mapping does not contradict the mapping of IP addresses. That means that the System/Network administrators can easily define one ore more IPv6 network IDs that translate directly to assigned GIDs. If there is only a single InfiniBand subnet it can even be the link-local prefix. With such a solution there is no run-time penalty for supplying both local and remote "IA Addresses" in a connection request. As long as the "unspecified" local address remains a valid option the proposed API split is defnitely implementable under iWARP. But it is obviously simpler to just keep the same semantics as the exported kDAPL interface. Is there a definite benefit to making the change, if so what and for which middleware/application? From jlentini at netapp.com Wed Aug 24 10:42:36 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 24 Aug 2005 13:42:36 -0400 (EDT) Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <52y86r2a9w.fsf@cisco.com> References: <52y86r2a9w.fsf@cisco.com> Message-ID: On Tue, 23 Aug 2005, Roland Dreier wrote: > It would be possible to have another function like > rdma_getpeername() that takes the transport address and > returns a source IP address. In the IB case this would do an > ATS reverse lookup. However, I hate this idea. iSER already > uses the CM private data to pass the source IP in the IB case, I know this is how IB SDP works, but I don't think iSER works this way. The code in the tree calls dat_ep_connect() with a NULL private data pointer. There is an iSER HELLO message described in iser_header.h contains IP addresses, but I'm not certain that this is part of the current protocol (ISER_HELLO_LEN and ISER_HELLO_REPLY_LEN are unused). From ftillier at silverstorm.com Wed Aug 24 10:47:32 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Wed, 24 Aug 2005 10:47:32 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <52pss3z26a.fsf@cisco.com> Message-ID: <000201c5a8d3$f30b3150$6312000a@infiniconsys.com> > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Wednesday, August 24, 2005 10:16 AM > > Fab> Knowledge of actual IP addresses would be up to the consumer. > Fab> However, the IB CM can facilitate checks by allowing the user > Fab> to specify an offset and length in the private data to match > Fab> to for incoming requests. > > This seems too complex and at the same time too limited to me. For > one thing -- although I think ATS should die -- this doesn't support > ATS reverse lookups. I think if all ULPs provide their source and destination IP in the private data, you can eliminate the reverse lookup altogether. A simple forward lookup is all that's needed to validate that the source GID in the REQ matches the reported source IP in the private data. The forward lookup could be done via ATS or via ARP, but the CM doesn't need to care which method is used. > For another, it doesn't handle something like > the SDP Hello header, where the IP version is at a certain offset, and > then the IP address is interpreted according to the IP address. Why can't the IPV field be ignored? If a listen wants only IPV4 addresses, it would specify a 16-byte compare buffer with the first 12 bytes zero, the next 4 filled with the IPV4 address, and would set the offset to that of the hello message's destination address (32). > What makes it really ugly is that it's perfectly reasonable for one > consumer to listen to a service at 192.168.11.12 and another consumer > to listen to the same service at 192.168.98.99. How do we handle this > in the IB case?? As long as the service IP address (the local address on the listening side) is always advertised in the same place in the private data, this isn't a problem. The compare lengths and offsets would be identical for both services, but the compare buffer contents would differ. Did I miss what you were getting at? - Fab From yaronh at voltaire.com Wed Aug 24 11:01:06 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Wed, 24 Aug 2005 21:01:06 +0300 Subject: [openib-general] RDMA connection and address translation API Message-ID: <35EA21F54A45CB47B879F21A91F4862F7141EB@taurus.voltaire.com> > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of James Lentini > Sent: Wednesday, August 24, 2005 1:43 PM > To: Roland Dreier > Cc: openib-general at openib.org > Subject: Re: [openib-general] RDMA connection and address translation API > > > > On Tue, 23 Aug 2005, Roland Dreier wrote: > > > It would be possible to have another function like > > rdma_getpeername() that takes the transport address and > > returns a source IP address. In the IB case this would do an > > ATS reverse lookup. However, I hate this idea. iSER already > > uses the CM private data to pass the source IP in the IB case, > > I know this is how IB SDP works, but I don't think iSER works this > way. > > The code in the tree calls dat_ep_connect() with a NULL private data > pointer. > > There is an iSER HELLO message described in iser_header.h contains IP > addresses, but I'm not certain that this is part of the current > protocol (ISER_HELLO_LEN and ISER_HELLO_REPLY_LEN are unused). James, iSER doesn't mandate the source IP in general since its doing a much stronger authentication during Login However we believe using a similar header to SDP can help the Passive side a. know which destination IP was targeted (in a multi homed environment) b. for some implementations that want to validate the source for some reason that's why the draft suggested adding the source/dst IP in the private data just like SDP does, I believe it can be a good idea to use the same approach for NFS/RDMA and eliminate the need for reverse ATS lookup (the may have some conflicts when multiple IPs exists per node). We may just use the SDP hello header as is with unused fields zeroed This will allow all ULPs to use the same mechanism Yaron > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general From rolandd at cisco.com Wed Aug 24 11:02:52 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 24 Aug 2005 11:02:52 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <000201c5a8d3$f30b3150$6312000a@infiniconsys.com> (Fab Tillier's message of "Wed, 24 Aug 2005 10:47:32 -0700") References: <000201c5a8d3$f30b3150$6312000a@infiniconsys.com> Message-ID: <52hddfyzzn.fsf@cisco.com> Fab> Why can't the IPV field be ignored? If a listen wants only Fab> IPV4 addresses, it would specify a 16-byte compare buffer Fab> with the first 12 bytes zero, the next 4 filled with the IPV4 Fab> address, and would set the offset to that of the hello Fab> message's destination address (32). Yes, you're right for SDP. I guess if we're comfortable mandating that all protocols put their source and destination IPs in the private data for the IB case, then this works. Of course it's somewhat awkward to pass this information into the transport-neutral CM API but I think this can be worked around. Roland> What makes it really ugly is that it's perfectly Roland> reasonable for one consumer to listen to a service at Roland> 192.168.11.12 and another consumer to listen to the same Roland> service at 192.168.98.99. How do we handle this in the IB Roland> case?? Fab> As long as the service IP address (the local address on the Fab> listening side) is always advertised in the same place in the Fab> private data, this isn't a problem. The compare lengths and Fab> offsets would be identical for both services, but the compare Fab> buffer contents would differ. Did I miss what you were Fab> getting at? No, I think I confused myself. As long as the CM can get at the IP information, it can figure out which consumer is which. - R. From tom at ammasso.com Wed Aug 24 11:11:34 2005 From: tom at ammasso.com (Tom Tucker) Date: Wed, 24 Aug 2005 14:11:34 -0400 Subject: [openib-general] RDMA connection and address translation API Message-ID: <8E9D028761D8264D910612167E8457E8FA3B2A@mail2.ammasso.com> > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Wednesday, August 24, 2005 11:27 AM > To: Tom Tucker > Cc: Roland Dreier; openib-general at openib.org > Subject: Re: [openib-general] RDMA connection and address > translation API > > Tom> I think I understand, but the purpose of specifying the IP > Tom> address in the listen is not to filter incoming connect > Tom> requests, but rather to determine which devices I listen > Tom> on. I think this works for the IB case as well. So the > Tom> utility of the IP address specified in the listen is only to > Tom> determine which devices the sid is created on. Does this make > Tom> sense or am I missing something? > > Well, that's not what I would expect. Suppose I have a > device configured with local addresses 192.168.11.12 and > 192.168.98.99 and I start listening for some service at the > address 192.168.11.12. I don't think I should see a > connection request if a remote system tries to connect to > 192.168.98.99 (even though it's the same network interface as > 192.168.11.12). > > - R. > Good point, although for iWARP it will work that way that you expect. For IB, admitedly it's more complex and would require ATS. There seems to be significant reluctance around ATS and I don't understand the issues. Can you provide a quick synopsis? From steve_wooding at keysounds.co.uk Wed Aug 24 11:11:55 2005 From: steve_wooding at keysounds.co.uk (Steve Wooding) Date: Wed, 24 Aug 2005 19:11:55 +0100 Subject: [openib-general] Re: How to change NodeDescription In-Reply-To: <52u0hf1hsp.fsf@cisco.com> References: <30280207$1124878682430c495a0b75a3.09921159@config5.schlund.de> <430C8217.1000206@mellanox.co.il> <20050824144215.GL20750@mellanox.co.il> <52u0hf1hsp.fsf@cisco.com> Message-ID: <430CB86B.6010002@keysounds.co.uk> Thanks everyone. That was kind of the answer I was hoping for. I look forward to its implemention. It will be very useful. Cheers, Steve. Roland Dreier wrote: > Michael> Sounds good. Making node description visible, and > Michael> editable, through /sys/class/infiniband/mthcaX/node_desc > Michael> is probably the way to do it. The user would then be > Michael> able to write whatever he wants there. > >Yes, I've had exactly this on my to-do list for a while, but haven't >had a chance to implement it yet. > > - R. > > > From caitlin.bestler at gmail.com Wed Aug 24 11:14:08 2005 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Wed, 24 Aug 2005 11:14:08 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <000201c5a8d3$f30b3150$6312000a@infiniconsys.com> References: <52pss3z26a.fsf@cisco.com> <000201c5a8d3$f30b3150$6312000a@infiniconsys.com> Message-ID: <469958e005082411147f1dfd03@mail.gmail.com> On 8/24/05, Fab Tillier wrote: > > From: Roland Dreier [mailto:rolandd at cisco.com] > > Sent: Wednesday, August 24, 2005 10:16 AM > > > > Fab> Knowledge of actual IP addresses would be up to the consumer. > > Fab> However, the IB CM can facilitate checks by allowing the user > > Fab> to specify an offset and length in the private data to match > > Fab> to for incoming requests. > > > > This seems too complex and at the same time too limited to me. For > > one thing -- although I think ATS should die -- this doesn't support > > ATS reverse lookups. > > I think if all ULPs provide their source and destination IP in the private data, > you can eliminate the reverse lookup altogether. A simple forward lookup is all > that's needed to validate that the source GID in the REQ matches the reported > source IP in the private data. The forward lookup could be done via ATS or via > ARP, but the CM doesn't need to care which method is used. > That is not an option. The applications are expecting source/destination network addresses that come from a network layer, not from the peer application. IP has no problem meeting this requirement. This is an IB problem that needs to be solved within the scope of IB without changing any ULPs. > > For another, it doesn't handle something like > > the SDP Hello header, where the IP version is at a certain offset, and > > then the IP address is interpreted according to the IP address. > > Why can't the IPV field be ignored? If a listen wants only IPV4 addresses, it > would specify a 16-byte compare buffer with the first 12 bytes zero, the next 4 > filled with the IPV4 address, and would set the offset to that of the hello > message's destination address (32). > > > What makes it really ugly is that it's perfectly reasonable for one > > consumer to listen to a service at 192.168.11.12 and another consumer > > to listen to the same service at 192.168.98.99. How do we handle this > > in the IB case?? > > As long as the service IP address (the local address on the listening side) is > always advertised in the same place in the private data, this isn't a problem. > The compare lengths and offsets would be identical for both services, but the > compare buffer contents would differ. Did I miss what you were getting at? > The concensus when this issue was debated in the DAT Collaborative was that there was no transport neutral way to specify a set of addresses to listen on other than "all addresses supported by this device". As noted in another posting, it is easy to support "all for device" and "this address only" with transport neutral interfaces. Anything else is problematic. From rolandd at cisco.com Wed Aug 24 11:17:16 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 24 Aug 2005 11:17:16 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3B2A@mail2.ammasso.com> (Tom Tucker's message of "Wed, 24 Aug 2005 14:11:34 -0400") References: <8E9D028761D8264D910612167E8457E8FA3B2A@mail2.ammasso.com> Message-ID: <524q9fyzbn.fsf@cisco.com> Tom> Good point, although for iWARP it will work that way that you Tom> expect. For IB, admitedly it's more complex and would Tom> require ATS. There seems to be significant reluctance around Tom> ATS and I don't understand the issues. Can you provide a Tom> quick synopsis? My resistance is that ATS is just complexity without any benefit. It doesn't provide additional security. It doesn't solve the multi-homing problem we're talking about now. Once you've thrown away information by turning your IP address into an IB GID, there's no magic way ATS can recreate that information and be psychic about which of the multi-homed IPs you actually meant. So why not just put the IP addressing information into the CM private data, the way that the SDP protocol already does? - R. From sean.hefty at intel.com Wed Aug 24 11:17:36 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 24 Aug 2005 11:17:36 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <52hddfyzzn.fsf@cisco.com> Message-ID: > Fab> Why can't the IPV field be ignored? If a listen wants only > Fab> IPV4 addresses, it would specify a 16-byte compare buffer > Fab> with the first 12 bytes zero, the next 4 filled with the IPV4 > Fab> address, and would set the offset to that of the hello > Fab> message's destination address (32). > >Yes, you're right for SDP. I guess if we're comfortable mandating >that all protocols put their source and destination IPs in the private >data for the IB case, then this works. Of course it's somewhat >awkward to pass this information into the transport-neutral CM API but >I think this can be worked around. For IB, using private data to listen on a specific IP address seems the easiest thing to do. (Maybe we could do it by mapping different IP addresses to different service IDs, requiring registration and lookup?) If the CM abstraction layer expected those values to be returned in the REP message, it could validate that the remote side it using the same protocol to ensure some degree of backwards compatibility. I don't know if it makes more sense to push private data checks into the actual CM or keep them in a CM abstraction layer. My guess is that the former may be the easier implementation. - Sean From yaronh at voltaire.com Wed Aug 24 11:23:15 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Wed, 24 Aug 2005 21:23:15 +0300 Subject: [openib-general] RDMA connection and address translation API Message-ID: <35EA21F54A45CB47B879F21A91F4862F7141EF@taurus.voltaire.com> > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Caitlin Bestler > Sent: Wednesday, August 24, 2005 2:14 PM > To: Fab Tillier > Cc: openib-general at openib.org > Subject: Re: [openib-general] RDMA connection and address translation API > > > The applications are expecting source/destination network addresses > that come from a network layer, not from the peer application. IP has > no problem meeting this requirement. This is an IB problem that needs > to be solved within the scope of IB without changing any ULPs. > To my understanding IB private data fields are IB CM specific So embedding src/dst IP in it doesn't change the ULP and could be considered as part of the IB CM You can look at the private data in that case as a replacement to the TCP CM (Syn/SynAck exchange), and Syn packet includes IPs & Ports Yaron From sean.hefty at intel.com Wed Aug 24 11:23:54 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 24 Aug 2005 11:23:54 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <469958e005082411147f1dfd03@mail.gmail.com> Message-ID: >> I think if all ULPs provide their source and destination IP in the private >data, >> you can eliminate the reverse lookup altogether. A simple forward lookup is >all >> that's needed to validate that the source GID in the REQ matches the reported >> source IP in the private data. The forward lookup could be done via ATS or >via >> ARP, but the CM doesn't need to care which method is used. >> > >That is not an option. > >The applications are expecting source/destination network addresses >that come from a network layer, not from the peer application. IP has >no problem meeting this requirement. This is an IB problem that needs >to be solved within the scope of IB without changing any ULPs. IB can solve the option by exposing fewer bytes of private data. ULPs do not need to know that part of the IB private data is actually used by the CM abstraction layer. ULPs that make use of this new interface change anyway. - Sean From caitlin.bestler at gmail.com Wed Aug 24 11:28:35 2005 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Wed, 24 Aug 2005 11:28:35 -0700 Subject: [openib-general] Connection Manager Abstraction proposition - header file In-Reply-To: <20050824122107.GA2323@voltaire.com> References: <20050824122107.GA2323@voltaire.com> Message-ID: <469958e0050824112851e8109d@mail.gmail.com> On 8/24/05, Guy German wrote: > > enum ib_cma_event { > IB_CMA_EVENT_ESTABLISHED, > IB_CMA_EVENT_REJECTED, > IB_CMA_EVENT_DISCONNECTED, > IB_CMA_EVENT_UNREACHABLE > }; > The events need to distinquish between Rejected and Peer Rejected. For example, a TCP connection could be rejected by the peer stack for lack of capacity to relay the Connection Request to the peer for approval. That is neither a peer rejection nor an "unreachable" event. > enum ib_qos { > IB_QOS_BEST_EFFORT = 0, > IB_QOS_HIGH_THROUGHPUT = (1 << 0), > IB_QOS_LOW_LATENCY = (1 << 1), > IB_QOS_ECONOMY = (1 << 2), > IB_QOS_PREMIUM = (1 << 3) > }; > > enum ib_connect_flags { > IB_CONNECT_DEFAULT_FLAG = 0x00, > IB_CONNECT_MULTIPATH_FLAG = 0x01 > }; > > /* > * for ib_cma_get_src_ip - ib_cma_id will have to include > * the path data received in the request handler > */ > union ib_cma_id{ > struct ib_cm_id *cm_id; > u32 iwarp_id; > }; > > typedef void (*ib_cma_rarp_handler)(struct sockaddr *src_ip, void *context); > typedef void (*ib_cma_ac_handler)(enum ib_cma_event event, void *context); > typedef void (*ib_cma_event_handler)(enum ib_cma_event event, void *context, > void *private_data); > typedef void (*ib_cma_listen_handler)(union ib_cma_id *cma_id, > void *private_data, void *context); > > struct ib_cma_conn { > struct ib_qp *qp; > struct ib_qp_attr *qp_attr; > struct sockaddr *dst_ip; > __be64 service_id; > void *context; > ib_cma_event_handler cma_event_handler; > const void *private_data; > u8 private_data_len; > u32 timeout; > enum ib_qos qos; > enum ib_connect_flags connect_flags; > }; > > > /** > * ib_cma_get_device - Returns the device to be used according to > * the destination ip address (this can be detemined according > * to the local routing table). Call this function before > * creating the qp. If using link-local IPv6 addresses > * @remote_address: The destination address for connection > * @device: The device to use (returned by the function) > */ > int ib_cma_get_device(struct sockaddr *remote_address, > struct ib_device **device); > > This would need\ to be based on the remote_address *and* CoS. For example there could be two devices that reach the same destination network, but with different speeds. > /** > * ib_cma_connect - this is the connect request function, called by > * the active side. The consumer registers an upcall that will be > * initiated by the cma with an appropriate connection event > * notification (established/rejected/disconnected etc) > * @cma_conn: This structure contains the following connection parameters: > * @qp: qp for establishing the connection > * @qp_attr: only relevant attributes are used > * @dst_ip: destination ip address > * @service_id: destination service id (port) > * @context: context to be returned in the callback > * @cma_event_handler: the upcall function for the active side > * @private_data: private data to be received at the listener upcall > * @private_data_len: private data length (max 255) > * @timeout: > * @qos: Quality os service for the rc > * @connect_flags: default or multipath connection > * @cma_id: This returned handle is a union (different in ib and iwarp) > * in ib - it is the cm_id. > */ As noted above, the QoS is also needed to even select the device. > int ib_cma_connect(struct ib_cma_conn *cma_conn, > union ib_cma_id *cma_id); > > > /** > * ib_cma_disconnect - this function disconnects the rc. It can be > * called, by either the passive or active side > * @qp: the connected qp to disconnect > * @cma_id: On the active side- this handle is the one returned > * when ib_cma_connect was called. > * On the passive side- this handle was accepted in cma_listen callback > */ > int ib_cma_disconnect(struct ib_qp *qp, union ib_cma_id *cma_id); > > > /** > * ib_cma_sid_listen - this function is called by the passive side. It is > * listening on a the specified port (ib service id) for incomming > * connection requests > * @device: ? need to resolve this issue > * @service_id: service id (port) to listen on > * @context: user context to be returned in the callback > * @cm_listen_handler: the listen callback > * @cma_id: cma handle for the passive side > */ > int ib_cma_sid_listen(struct ib_device *device, __be64 service_id, > void *context, ib_cma_listen_handler cm_listen_handler, > union ib_cma_id *cma_id); > > There could be an option to listen on a specific address as well. When the current kDAPL interface is split up several state issues need ot be explicitly dealt with since they are no longer inherited. For iWARP these include how TCP layer errors are handled during connection setup (non-peer reject being the most accurate description that attempts to be transport neutral). It also needs to be clear that while Connection Requestsw do not have to be dealt with synchronously or even in-order, they are perishable goods that must be dealt with promptly. The exact deadlines are transport dependent, but should not be of concern to a typical application that responds to a connection request in anything that could in good faith be considered a prompt response. On the IB side that means some retries may be implemented transparently, and that a consumer that wanted explicit control over retries would need to use the current IB specific API. From caitlin.bestler at gmail.com Wed Aug 24 11:32:11 2005 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Wed, 24 Aug 2005 11:32:11 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <524q9fyzbn.fsf@cisco.com> References: <8E9D028761D8264D910612167E8457E8FA3B2A@mail2.ammasso.com> <524q9fyzbn.fsf@cisco.com> Message-ID: <469958e005082411327f61bd26@mail.gmail.com> NFS over RDMA does not do that. Shouldn't that be the end of discussion on abusing CM private data unless you are talking *solely* about IB private data. And if that is the discussion, should not such a strategy be proposed to IETF and/or IBTA for an NFSoRDMA for IB official mapping? The other end of the NFSoRDMA connection is not necessarily running OpenIB or even Linux and is not party to any of these discussions. > > My resistance is that ATS is just complexity without any benefit. It > doesn't provide additional security. It doesn't solve the > multi-homing problem we're talking about now. Once you've thrown away > information by turning your IP address into an IB GID, there's no > magic way ATS can recreate that information and be psychic about which > of the multi-homed IPs you actually meant. So why not just put the IP > addressing information into the CM private data, the way that the SDP > protocol already does? > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From ftillier at silverstorm.com Wed Aug 24 11:59:57 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Wed, 24 Aug 2005 11:59:57 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <469958e005082411147f1dfd03@mail.gmail.com> Message-ID: <000301c5a8de$14fc1cc0$6312000a@infiniconsys.com> > From: Caitlin Bestler [mailto:caitlin.bestler at gmail.com] > Sent: Wednesday, August 24, 2005 11:14 AM > > On 8/24/05, Fab Tillier wrote: > > > From: Roland Dreier [mailto:rolandd at cisco.com] > > > Sent: Wednesday, August 24, 2005 10:16 AM > > > > > > Fab> Knowledge of actual IP addresses would be up to the consumer. > > > Fab> However, the IB CM can facilitate checks by allowing the user > > > Fab> to specify an offset and length in the private data to match > > > Fab> to for incoming requests. > > > > > > This seems too complex and at the same time too limited to me. For > > > one thing -- although I think ATS should die -- this doesn't support > > > ATS reverse lookups. > > > > I think if all ULPs provide their source and destination IP in the private > > data, you can eliminate the reverse lookup altogether. A simple forward > > lookup is all that's needed to validate that the source GID in the REQ > > matches the reported source IP in the private data. The forward lookup > > could be done via ATS or via ARP, but the CM doesn't need to care which > > method is used. > > That is not an option. > > The applications are expecting source/destination network addresses > that come from a network layer, not from the peer application. IP has > no problem meeting this requirement. This is an IB problem that needs > to be solved within the scope of IB without changing any ULPs. If the app wants to use source/destination network addresses, there isn't a problem. The problem is the app wants to use IP addresses, which are *not* network addresses in IB. So the app needs to decide between one of two things - be aware of IB network addresses, or provide meaning to IP addresses over IB. The latter can't be done reliably under the covers - ATS reverse lookups won't tell you the IP the source actually used, and there's no way to do so without either using private data in the CM REQ or requiring a 1:1 mapping of IB:IP addresses. The 1:1 IB:IP mapping is not feasible, so the only way to know what IP address the application used is to embed that into the private data. I would expect protocols that try to use IP as their addressing would accommodate this in their IB usage, just like SDP accommodates it in the hello message. - Fab From ftillier at silverstorm.com Wed Aug 24 11:59:57 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Wed, 24 Aug 2005 11:59:57 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <52hddfyzzn.fsf@cisco.com> Message-ID: <000401c5a8de$2c32cce0$6312000a@infiniconsys.com> > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Wednesday, August 24, 2005 11:03 AM > > Fab> Why can't the IPV field be ignored? If a listen wants only > Fab> IPV4 addresses, it would specify a 16-byte compare buffer > Fab> with the first 12 bytes zero, the next 4 filled with the IPV4 > Fab> address, and would set the offset to that of the hello > Fab> message's destination address (32). > > Yes, you're right for SDP. I guess if we're comfortable mandating > that all protocols put their source and destination IPs in the private > data for the IB case, then this works. Of course it's somewhat > awkward to pass this information into the transport-neutral CM API but > I think this can be worked around. I don't know if we need to mandate IP usage - it's up to the application. Any application that wants to have similar semantics to the way socket listens work (especially when bound to one of multiple IP addresses on a port) the application would have to define its private data to accommodate this. At the IB level, the contents of the private data are still opaque, even to the CM. The CM would only expose the ability to have it perform an initial triage of requests by doing binary comparisons over regions of private data. It doesn't know (or need to know) what the data represents - it only cares about finding a match (or not). The CM doesn't define any sort of policy here, and I don't think it should. It's just bytes to the CM, and it's doing a blind comparison without interpreting the contents. - Fab From ftillier at silverstorm.com Wed Aug 24 12:04:17 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Wed, 24 Aug 2005 12:04:17 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: Message-ID: <000601c5a8de$a4d0a370$6312000a@infiniconsys.com> > From: Sean Hefty [mailto:sean.hefty at intel.com] > Sent: Wednesday, August 24, 2005 11:18 AM > > For IB, using private data to listen on a specific IP address seems the > easiest thing to do. (Maybe we could do it by mapping different IP > addresses to different service IDs, requiring registration and lookup?) The problem with the SID method is that the SID namespace is smaller than the IPV6 address name space. There's no way to get every possible IPV6 address represented by a 64-bit SID. This further ignores the rules for SIDs in the IB specification. I think private data is the only way to do this properly. > If the CM abstraction layer expected those values to be returned in the > REP message, it could validate that the remote side it using the same > protocol to ensure some degree of backwards compatibility. > > I don't know if it makes more sense to push private data checks into the > actual CM or keep them in a CM abstraction layer. My guess is that the > former may be the easier implementation. I think putting the checks in the CM makes the most sense, though it should be done in a generic fashion. A CM abstraction layer could then simply apply a policy for private data usage - where in the private data it stores the IP address information. Layering it this way allows the private data compare to be used for things other than IP addresses. Add functionality without imposing policy. - Fab From tom at ammasso.com Wed Aug 24 12:09:18 2005 From: tom at ammasso.com (Tom Tucker) Date: Wed, 24 Aug 2005 15:09:18 -0400 Subject: [openib-general] RDMA connection and address translation API Message-ID: <8E9D028761D8264D910612167E8457E8FA3B36@mail2.ammasso.com> > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Wednesday, August 24, 2005 1:17 PM > To: Tom Tucker > Cc: openib-general at openib.org > Subject: Re: [openib-general] RDMA connection and address > translation API > > Tom> Good point, although for iWARP it will work that way that you > Tom> expect. For IB, admitedly it's more complex and would > Tom> require ATS. There seems to be significant reluctance around > Tom> ATS and I don't understand the issues. Can you provide a > Tom> quick synopsis? > > My resistance is that ATS is just complexity without any benefit. IMHO the benefit is that you have a transport independent addressing mechanism -- albeit with some limitations as you've mentioned. In this case, the vast majority of clients enjoy the benefit without suffering the limitations. > ... It > doesn't provide additional security. It doesn't solve the > multi-homing problem we're talking about now. Whenever a single GID maps to multiple IP addresses, I agree, it is a limitation. However, I don't believe that this is strictly necessary. > ... Once you've thrown away > information by turning your IP address into an IB GID, there's no > magic way ATS can recreate that information and be psychic about which > of the multi-homed IPs you actually meant. I agree, so don't do that. If you want it to work properly, then you need to map GIDS to IP addresses. > ... So why not just put the IP > addressing information into the CM private data, the way that the SDP > protocol already does? > > - R. > Because it would be better to configure your network "properly". Putting IP addresses in private data is fundamentally insecure since any user mode client can spoof the IP address. From sean.hefty at intel.com Wed Aug 24 12:11:38 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 24 Aug 2005 12:11:38 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3B36@mail2.ammasso.com> Message-ID: >Because it would be better to configure your network "properly". Putting >IP addresses in private data is fundamentally insecure since any user >mode client can spoof the IP address. A simple forward lookup could detect this. - Sean From yaronh at voltaire.com Wed Aug 24 12:12:57 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Wed, 24 Aug 2005 22:12:57 +0300 Subject: [openib-general] RDMA connection and address translation API Message-ID: <35EA21F54A45CB47B879F21A91F4862F7141F3@taurus.voltaire.com> > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Fab Tillier > Sent: Wednesday, August 24, 2005 3:00 PM > To: 'Roland Dreier' > Cc: openib-general at openib.org > Subject: RE: [openib-general] RDMA connection and address translation API > > > From: Roland Dreier [mailto:rolandd at cisco.com] > > Sent: Wednesday, August 24, 2005 11:03 AM > > > > Fab> Why can't the IPV field be ignored? If a listen wants only > > Fab> IPV4 addresses, it would specify a 16-byte compare buffer > > Fab> with the first 12 bytes zero, the next 4 filled with the IPV4 > > Fab> address, and would set the offset to that of the hello > > Fab> message's destination address (32). > > > > Yes, you're right for SDP. I guess if we're comfortable mandating > > that all protocols put their source and destination IPs in the private > > data for the IB case, then this works. Of course it's somewhat > > awkward to pass this information into the transport-neutral CM API but > > I think this can be worked around. > > I don't know if we need to mandate IP usage - it's up to the application. > Any > application that wants to have similar semantics to the way socket listens > work > (especially when bound to one of multiple IP addresses on a port) the > application would have to define its private data to accommodate this. > The context of this discussion is around a common API for iWarp/IB ULPs In that case they all use IP addresses (since it's the common addressing) If someone would use the IB specific API under this abstraction level he can provide what ever data he wants to the CM Any way providing src/dst IPs in the CM Private data is simple, and we can come with IBTA extension blessing that data structure as a general way to map IP oriented protocols over IB (a 1-2 page draft at the most) This way it can also address Caitlin concerns regarding NFS & IETF (since now it's a transport specific issue) Yaron From jlentini at netapp.com Wed Aug 24 12:19:25 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 24 Aug 2005 15:19:25 -0400 (EDT) Subject: [openib-general] RDMA connection and address translation API In-Reply-To: References: Message-ID: On Wed, 24 Aug 2005, Sean Hefty wrote: > I guess that I'd like to clarify what the operation of a connect > call would do. Would it be responsible for modifying the QP? If > so, could such a call also allocate the QP? Note that I'm not > advocating either of these, just trying to determine what the > behavior of the API would be. If the connect call succeeds in establishing a connection, the ULP's QP should be ready for posting work requests. This simplifies the ULP considerably. The API should not create the QP. That would create race conditions for certain protocols. For example, consider a protocol in which the first message was a send from the server to the client. To properly implement such a protocol, the client must post a receive work request before initiating a connection. From mst at mellanox.co.il Wed Aug 24 12:22:56 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 24 Aug 2005 22:22:56 +0300 Subject: [openib-general] Re: RDMA connection and address translation API In-Reply-To: <52hddfyzzn.fsf@cisco.com> References: <000201c5a8d3$f30b3150$6312000a@infiniconsys.com> <52hddfyzzn.fsf@cisco.com> Message-ID: <20050824192256.GI23518@mellanox.co.il> Quoting r. Roland Dreier : > Yes, you're right for SDP. I guess if we're comfortable mandating > that all protocols put their source and destination IPs in the private > data for the IB case, then this works. Of course it's somewhat > awkward to pass this information into the transport-neutral CM API but > I think this can be worked around. Makes total sense to me. Where's the difficulty? CM doesnt need to expose the private data to consumers, does it? > No, I think I confused myself. As long as the CM can get at the IP > information, it can figure out which consumer is which. Right. -- MST From sean.hefty at intel.com Wed Aug 24 12:57:19 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 24 Aug 2005 12:57:19 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: Message-ID: >If the connect call succeeds in establishing a connection, the ULP's >QP should be ready for posting work requests. This simplifies the ULP >considerably. > >The API should not create the QP. That would create race conditions >for certain protocols. For example, consider a protocol in which the >first message was a send from the server to the client. To properly >implement such a protocol, the client must post a receive work request >before initiating a connection. Thanks for the clarification. This is similar to what I was thinking as well. I guess we should note that in order to post receives to the QP, it at least needs to be in the INIT state. Would this be done by the CM abstraction or the user? For IB, the following fields need to be set when transitioning to INIT: enable RDMA, PKey index, and physical port. Is the idea that the user calls connect() and then receives a single callback indicating that the connection has been established? If so, then the user may need to modify the QP to the INIT state, which would require some knowledge already of the path. We would also need to be clear on whether the QP is expected to be in the INIT state before connect is called, or if it could be in any arbitrary state. The other alternative is to provide multiple callbacks during connection establishment. - Sean From rolandd at cisco.com Wed Aug 24 13:17:12 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 24 Aug 2005 13:17:12 -0700 Subject: [openib-general] Re: RDMA connection and address translation API In-Reply-To: <20050824192256.GI23518@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 24 Aug 2005 22:22:56 +0300") References: <000201c5a8d3$f30b3150$6312000a@infiniconsys.com> <52hddfyzzn.fsf@cisco.com> <20050824192256.GI23518@mellanox.co.il> Message-ID: <52acj7xf7b.fsf@cisco.com> Michael> Makes total sense to me. Where's the difficulty? CM Michael> doesnt need to expose the private data to consumers, does Michael> it? I think it does, because protocols want to use the private data to exchange other data during connection establishment. - R. From rolandd at cisco.com Wed Aug 24 13:22:05 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 24 Aug 2005 13:22:05 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: (Sean Hefty's message of "Wed, 24 Aug 2005 12:57:19 -0700") References: Message-ID: <5264tvxez6.fsf@cisco.com> Sean> Is the idea that the user calls connect() and then receives Sean> a single callback indicating that the connection has been Sean> established? If so, then the user may need to modify the QP Sean> to the INIT state, which would require some knowledge Sean> already of the path. We would also need to be clear on Sean> whether the QP is expected to be in the INIT state before Sean> connect is called, or if it could be in any arbitrary state. Sean> The other alternative is to provide multiple callbacks Sean> during connection establishment. To me it makes sense for the generic CM API to be defined so that an IB QP must be in the INIT state before being passed to connect(). - R. From jlentini at netapp.com Wed Aug 24 13:23:03 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 24 Aug 2005 16:23:03 -0400 (EDT) Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <469958e005082411147f1dfd03@mail.gmail.com> References: <52pss3z26a.fsf@cisco.com> <000201c5a8d3$f30b3150$6312000a@infiniconsys.com> <469958e005082411147f1dfd03@mail.gmail.com> Message-ID: On Wed, 24 Aug 2005, Caitlin Bestler wrote: > On 8/24/05, Fab Tillier wrote: > > > > I think if all ULPs provide their source and destination IP in the > > private data, you can eliminate the reverse lookup altogether. A > > simple forward lookup is all that's needed to validate that the > > source GID in the REQ matches the reported source IP in the > > private data. The forward lookup could be done via ATS or via > > ARP, but the CM doesn't need to care which method is used. > > That is not an option. > > The applications are expecting source/destination network addresses > that come from a network layer, not from the peer application. IP > has no problem meeting this requirement. This is an IB problem that > needs to be solved within the scope of IB without changing any ULPs. I agree with Caitlin. The eventual solution cannot force protocol modifications in ULPs. From rolandd at cisco.com Wed Aug 24 13:27:16 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 24 Aug 2005 13:27:16 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: (James Lentini's message of "Wed, 24 Aug 2005 16:23:03 -0400 (EDT)") References: <52pss3z26a.fsf@cisco.com> <000201c5a8d3$f30b3150$6312000a@infiniconsys.com> <469958e005082411147f1dfd03@mail.gmail.com> Message-ID: <521x4jxeqj.fsf@cisco.com> James> I agree with Caitlin. The eventual solution cannot force James> protocol modifications in ULPs. Does this mean we're stuck with the current use of ATS in NFS-RDMA? Surely there's still time to fix the protocol. - R. From tom at ammasso.com Wed Aug 24 13:32:42 2005 From: tom at ammasso.com (Tom Tucker) Date: Wed, 24 Aug 2005 16:32:42 -0400 Subject: [openib-general] RDMA connection and address translation API Message-ID: <8E9D028761D8264D910612167E8457E8FA3B42@mail2.ammasso.com> Isn't this inevitable regardless of whether or not we have a tranport independent connection API. I thought ATS was required by NFS for authentication/authorization. Sorry in advance if I'm confused --- again. > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier > Sent: Wednesday, August 24, 2005 3:27 PM > To: James Lentini > Cc: Caitlin Bestler; openib-general at openib.org > Subject: Re: [openib-general] RDMA connection and address > translation API > > James> I agree with Caitlin. The eventual solution cannot force > James> protocol modifications in ULPs. > > Does this mean we're stuck with the current use of ATS in NFS-RDMA? > Surely there's still time to fix the protocol. > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rolandd at cisco.com Wed Aug 24 13:44:21 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 24 Aug 2005 13:44:21 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3B42@mail2.ammasso.com> (Tom Tucker's message of "Wed, 24 Aug 2005 16:32:42 -0400") References: <8E9D028761D8264D910612167E8457E8FA3B42@mail2.ammasso.com> Message-ID: <52r7cjvzdm.fsf@cisco.com> Tom> Isn't this inevitable regardless of whether or not we have a Tom> tranport independent connection API. I thought ATS was Tom> required by NFS for authentication/authorization. Sorry in Tom> advance if I'm confused --- again. Current NFS-RDMA code uses and relies on ATS. However I hope that we can fix the NFS-RDMA draft to get rid of this. - R. From tom at ammasso.com Wed Aug 24 13:47:44 2005 From: tom at ammasso.com (Tom Tucker) Date: Wed, 24 Aug 2005 16:47:44 -0400 Subject: [openib-general] RDMA connection and address translation API Message-ID: <8E9D028761D8264D910612167E8457E8FA3B44@mail2.ammasso.com> So the listening server takes the IP address from the private data, uses AT to get the GID and then compares it to the GID in the connect request? It feels to me like this private data thing is a case of the cure is worse than the disease. As I understand it, we're trying to avoid the following: server: dev = ib_get_device(10.10.1.1 /*src ip*/,0 /*dest ip*/); /* GID has IP addresses 10.10.1.1, 10.10.1.2 */ ib_listen(dev, 10.10.1.1 /* listen bind address */, 143 /* port */, 10 /* backlog */); client: dev = ib_get_device(0 /* src wildcard */, 10.10.1.2 /* dest ip*/) ib_connect(dev, 0 /*src*/, 10.10.1.2 /*dest*/, 143/*port*/, ...); The issue is that this connection will be established when the server may only want to accept requests that are targetted to the 10.10.1.1 address. I don't get why this is such a big deal. You can preclude this behavior by simply keeping a one to one mapping between the IPv4 addresses and the GIDs using the existing protocols and without mandating a private data format across *all* ulps and transports. If I'm being painfully stupid...please feel free to tell me. > -----Original Message----- > From: Sean Hefty [mailto:sean.hefty at intel.com] > Sent: Wednesday, August 24, 2005 2:12 PM > To: Tom Tucker; Roland Dreier > Cc: openib-general at openib.org > Subject: RE: [openib-general] RDMA connection and address > translation API > > >Because it would be better to configure your network "properly". > >Putting IP addresses in private data is fundamentally insecure since > >any user mode client can spoof the IP address. > > A simple forward lookup could detect this. > > - Sean > > From jlentini at netapp.com Wed Aug 24 13:57:41 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 24 Aug 2005 16:57:41 -0400 (EDT) Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <000401c5a8de$2c32cce0$6312000a@infiniconsys.com> References: <000401c5a8de$2c32cce0$6312000a@infiniconsys.com> Message-ID: On Wed, 24 Aug 2005, Fab Tillier wrote: > > From: Roland Dreier [mailto:rolandd at cisco.com] > > Sent: Wednesday, August 24, 2005 11:03 AM > > > > Fab> Why can't the IPV field be ignored? If a listen wants only > > Fab> IPV4 addresses, it would specify a 16-byte compare buffer > > Fab> with the first 12 bytes zero, the next 4 filled with the IPV4 > > Fab> address, and would set the offset to that of the hello > > Fab> message's destination address (32). > > > > Yes, you're right for SDP. I guess if we're comfortable mandating > > that all protocols put their source and destination IPs in the private > > data for the IB case, then this works. Of course it's somewhat > > awkward to pass this information into the transport-neutral CM API but > > I think this can be worked around. > > I don't know if we need to mandate IP usage - it's up to the > application. Any application that wants to have similar semantics > to the way socket listens work (especially when bound to one of > multiple IP addresses on a port) the application would have to > define its private data to accommodate this. > > At the IB level, the contents of the private data are still opaque, > even to the CM. The CM would only expose the ability to have it > perform an initial triage of requests by doing binary comparisons > over regions of private data. It doesn't know (or need to know) > what the data represents - it only cares about finding a match (or > not). The CM doesn't define any sort of policy here, and I don't > think it should. It's just bytes to the CM, and it's doing a blind > comparison without interpreting the contents. You need to consider what makes sense for *both* ib and iwarp. Keep in mind that the correct API will allow a consumer to use ib and iwarp devices transparently. In other words their will be one code path that support both. If we were to adopt your proposal, the consumer would need to perform unnecessary operations on iWARP. A transport neutral client would be forced to put IP information into its CM private data on iWARP. Likewise, a transport neutral server would be forced to pass an private data offset and binary blob to the listen API call on iWARP. Neither of these make sense. These API problems are secondary to the burden you would be placing on the protocols. As has been mentioned in a previous email, extending the current protocols to use this convention will require further standardization and in some cases may not be compatible with their current architecture. From hch at lst.de Wed Aug 24 13:57:46 2005 From: hch at lst.de (Christoph Hellwig) Date: Wed, 24 Aug 2005 22:57:46 +0200 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <524q9f1et9.fsf@cisco.com> References: <8E9D028761D8264D910612167E8457E8FA3B0A@mail2.ammasso.com> <524q9f1et9.fsf@cisco.com> Message-ID: <20050824205746.GA24447@lst.de> On Wed, Aug 24, 2005 at 09:26:42AM -0700, Roland Dreier wrote: > Tom> I think I understand, but the purpose of specifying the IP > Tom> address in the listen is not to filter incoming connect > Tom> requests, but rather to determine which devices I listen > Tom> on. I think this works for the IB case as well. So the > Tom> utility of the IP address specified in the listen is only to > Tom> determine which devices the sid is created on. Does this make > Tom> sense or am I missing something? > > Well, that's not what I would expect. Suppose I have a device > configured with local addresses 192.168.11.12 and 192.168.98.99 and I You never configure a device with local addresses. IP addresses are always a per-host attribute in Linux. From hch at lst.de Wed Aug 24 14:00:12 2005 From: hch at lst.de (Christoph Hellwig) Date: Wed, 24 Aug 2005 23:00:12 +0200 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <469958e005082411147f1dfd03@mail.gmail.com> References: <52pss3z26a.fsf@cisco.com> <000201c5a8d3$f30b3150$6312000a@infiniconsys.com> <469958e005082411147f1dfd03@mail.gmail.com> Message-ID: <20050824210012.GB24447@lst.de> On Wed, Aug 24, 2005 at 11:14:08AM -0700, Caitlin Bestler wrote: > The concensus when this issue was debated in the DAT Collaborative was > that there was no transport neutral way to specify a set of addresses to listen > on other than "all addresses supported by this device". That doesn't make any sense at all for iWarp as that uses IP addressing which in Linux is host-, not device-based. From rolandd at cisco.com Wed Aug 24 14:03:26 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 24 Aug 2005 14:03:26 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3B44@mail2.ammasso.com> (Tom Tucker's message of "Wed, 24 Aug 2005 16:47:44 -0400") References: <8E9D028761D8264D910612167E8457E8FA3B44@mail2.ammasso.com> Message-ID: <52irxvvyht.fsf@cisco.com> Tom> The issue is that this connection will be established when Tom> the server may only want to accept requests that are Tom> targetted to the 10.10.1.1 address. I don't get why this is Tom> such a big deal. You can preclude this behavior by simply Tom> keeping a one to one mapping between the IPv4 addresses and Tom> the GIDs using the existing protocols and without mandating a Tom> private data format across *all* ulps and transports. Well, a few problems with what you say: - ATS does not help at all with the case of a multi-homed interface. Unless the remote system puts the IP it's trying to connect to somewhere in the connection request, there is no way to be psychic and recover this information. - Mandating ATS use is dictating protocol design just as much as requiring the CM private data to carry source and destination IP addresses. - It's not just preventing connections to the wrong local address. NFS-RDMA wants the remote source address (ie getpeername()) so that it can look it up in the exports list. - Saying that a given GID may only have a single IP address is definitely a case of the cure being worse than the disease. I don't think we can forbid perfectly valid multi-homed configurations just because it's inconvenient for us to support them. By the way, as far as I can tell, there is NO formal documentation of the NFS-RDMA wire protocol. The current draft (draft-ietf-nfsv4-rpcrdma-01.txt) simply says: This protocol is designed to function with equivalent semantics over all appropriate RDMA transports. In its abstract form, this protocol does not implement RDMA directly. [...] It therefore becomes a useful, implementable standard when mapped onto a specific RDMA transport, such as iWARP [RDDP] or Infiniband [IB]. [...] In setting up a new RDMA connection, the first action by an RPC client will be to obtain a transport address for the server. The mechanism used to obtain this address, and to open an RDMA connection is dependent on the type of RDMA transport, and outside the scope of this protocol. So it seems perfectly reasonable and acceptable for the mapping of NFS-RDMA onto IB to specify that the source and destination IP addresses for an IB connection are placed in the CM private data. This seems much easier than trying to turn ATS into an IETF standard. - R. From rolandd at cisco.com Wed Aug 24 14:15:09 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 24 Aug 2005 14:15:09 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <20050824205746.GA24447@lst.de> (Christoph Hellwig's message of "Wed, 24 Aug 2005 22:57:46 +0200") References: <8E9D028761D8264D910612167E8457E8FA3B0A@mail2.ammasso.com> <524q9f1et9.fsf@cisco.com> <20050824205746.GA24447@lst.de> Message-ID: <52ek8jvxya.fsf@cisco.com> Roland> Well, that's not what I would expect. Suppose I have a Roland> device configured with local addresses 192.168.11.12 and Roland> 192.168.98.99 and I Christoph> You never configure a device with local addresses. IP Christoph> addresses are always a per-host attribute in Linux. I don't think this is really true. In some ways Linux behaves as if IP addresses are per-host (eg ARP responses can go out any interface) but really IP addresses are attached to an interface. Every struct net_device has a struct in_device, and every struct in_device has a list of struct in_ifaddrs for the device's IP addresses. - R. From caitlinb at broadcom.com Wed Aug 24 14:16:59 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 24 Aug 2005 14:16:59 -0700 Subject: [openib-general] Re: RDMA connection and address translation API Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F508@NT-SJCA-0751.brcm.ad.broadcom.com> Replying to the digest format, so lots all at once ---------------------------------------------------------------------- Message: 1 Date: Wed, 24 Aug 2005 11:59:57 -0700 From: "Fab Tillier" Subject: RE: [openib-general] RDMA connection and address translation API To: "'Caitlin Bestler'" Cc: openib-general at openib.org Message-ID: <000301c5a8de$14fc1cc0$6312000a at infiniconsys.com> Content-Type: text/plain; charset="us-ascii" > From: Caitlin Bestler [mailto:caitlin.bestler at gmail.com] > Sent: Wednesday, August 24, 2005 11:14 AM > > On 8/24/05, Fab Tillier wrote: > > > From: Roland Dreier [mailto:rolandd at cisco.com] > > > Sent: Wednesday, August 24, 2005 10:16 AM > > > > > > Fab> Knowledge of actual IP addresses would be up to the consumer. > > > Fab> However, the IB CM can facilitate checks by allowing the user > > > Fab> to specify an offset and length in the private data to match > > > Fab> to for incoming requests. > > > > > > This seems too complex and at the same time too limited to me. > > > For one thing -- although I think ATS should die -- this doesn't > > > support ATS reverse lookups. > > > > I think if all ULPs provide their source and destination IP in the > > private data, you can eliminate the reverse lookup altogether. A > > simple forward lookup is all that's needed to validate that the > > source GID in the REQ matches the reported source IP in the private > > data. The forward lookup could be done via ATS or via ARP, but the > > CM doesn't need to care which method is used. > > That is not an option. > > The applications are expecting source/destination network addresses > that come from a network layer, not from the peer application. IP has > no problem meeting this requirement. This is an IB problem that needs > to be solved within the scope of IB without changing any ULPs. If the app wants to use source/destination network addresses, there isn't a problem. The problem is the app wants to use IP addresses, which are *not* network addresses in IB. So the app needs to decide between one of two things - be aware of IB network addresses, or provide meaning to IP addresses over IB. The latter can't be done reliably under the covers - ATS reverse lookups won't tell you the IP the source actually used, and there's no way to do so without either using private data in the CM REQ or requiring a 1:1 mapping of IB:IP addresses. The 1:1 IB:IP mapping is not feasible, so the only way to know what IP address the application used is to embed that into the private data. I would expect protocols that try to use IP as their addressing would accommodate this in their IB usage, just like SDP accommodates it in the hello message. - Fab The question under debate is precisely how to define an API for transport neutral middleware and kernel applications. No one has proposed elimination or deprecation of the IB CM optimized IB-specific API. Any definition of a transport neutral "network address" is going to conform to the semantics of an IPv6 address. IP networks will never evolve pass the legacy of the current definition of an IP address. In fact it could be argued that IPoIB shows that IB networks won't either. There is no need for a 1:1 IB:IP mapping. All that is required is: a) The network address be mappable to a host name. b) That the host name be mappable back to the network address) even if the latter is a list). c) A server can determine its local address and advertise it by application specific means. d) A server can determine the network address of a remote peer requesting service 1) use it to identify the remote host (see a). 2) will be able to use it to send packets back to the peer in a reliable connection. Can you show me anything in the IB spec that prevents direct translation of specific IPV6 subnets to IB networks? Subdivision of the 128-bit network address space is a requirement for supporting multiple transports on the same host anyway. There is no reason the same technique cannot be used within a single IB device to distinquish between addresses that don't need translation (network address is GID) and those that do (when IPV4 addresses and/or multiple IPV6 addresses are desired). ------------------------------ Message: 2 Date: Wed, 24 Aug 2005 11:59:57 -0700 From: "Fab Tillier" Subject: RE: [openib-general] RDMA connection and address translation API To: "'Roland Dreier'" Cc: openib-general at openib.org Message-ID: <000401c5a8de$2c32cce0$6312000a at infiniconsys.com> Content-Type: text/plain; charset="us-ascii" > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Wednesday, August 24, 2005 11:03 AM > > Fab> Why can't the IPV field be ignored? If a listen wants only > Fab> IPV4 addresses, it would specify a 16-byte compare buffer > Fab> with the first 12 bytes zero, the next 4 filled with the IPV4 > Fab> address, and would set the offset to that of the hello > Fab> message's destination address (32). > > Yes, you're right for SDP. I guess if we're comfortable mandating > that all protocols put their source and destination IPs in the private > data for the IB case, then this works. Of course it's somewhat > awkward to pass this information into the transport-neutral CM API but > I think this can be worked around. I don't know if we need to mandate IP usage - it's up to the application. Any application that wants to have similar semantics to the way socket listens work (especially when bound to one of multiple IP addresses on a port) the application would have to define its private data to accommodate this. At the IB level, the contents of the private data are still opaque, even to the CM. The CM would only expose the ability to have it perform an initial triage of requests by doing binary comparisons over regions of private data. It doesn't know (or need to know) what the data represents - it only cares about finding a match (or not). The CM doesn't define any sort of policy here, and I don't think it should. It's just bytes to the CM, and it's doing a blind comparison without interpreting the contents. - Fab If the application wants IB specific semantics it can use an IB specific API. That is probably needed for options like CM redirection anyway. ------------------------------ Message: 4 Date: Wed, 24 Aug 2005 15:09:18 -0400 From: "Tom Tucker" Subject: RE: [openib-general] RDMA connection and address translation API To: "Roland Dreier" Cc: openib-general at openib.org Message-ID: <8E9D028761D8264D910612167E8457E8FA3B36 at mail2.ammasso.com> Content-Type: text/plain; charset="US-ASCII" > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Wednesday, August 24, 2005 1:17 PM > To: Tom Tucker > Cc: openib-general at openib.org > Subject: Re: [openib-general] RDMA connection and address translation > API > > Tom> Good point, although for iWARP it will work that way that you > Tom> expect. For IB, admitedly it's more complex and would > Tom> require ATS. There seems to be significant reluctance around > Tom> ATS and I don't understand the issues. Can you provide a > Tom> quick synopsis? > > My resistance is that ATS is just complexity without any benefit. IMHO the benefit is that you have a transport independent addressing mechanism -- albeit with some limitations as you've mentioned. In this case, the vast majority of clients enjoy the benefit without suffering the limitations. Exactly. We are trying to define an *additional* API that has transport independent addressing semantics. There is already an API for transports that have IB semantics. We don't need a second one. > ... It > doesn't provide additional security. It doesn't solve the > multi-homing problem we're talking about now. Whenever a single GID maps to multiple IP addresses, I agree, it is a limitation. However, I don't believe that this is strictly necessary. > ... Once you've thrown away > information by turning your IP address into an IB GID, there's no > magic way ATS can recreate that information and be psychic about which > of the multi-homed IPs you actually meant. I agree, so don't do that. If you want it to work properly, then you need to map GIDS to IP addresses. But you don't need to know which of the multi-homed IPs you actually meant. You just need one that translates back to the same remote entity. The exact same problem already exists in IP networks because of PNAT. > ... So why not just put the IP > addressing information into the CM private data, the way that the SDP > protocol already does? > > - R. > Because it would be better to configure your network "properly". Putting IP addresses in private data is fundamentally insecure since any user mode client can spoof the IP address. The existing contract with the ULP (especially NFS) is that the network layer identifies the remote peer and that said identification is coming from the network layer not from the remote application layer. It's up to IB to decide how to meet those semantics. iWARP has no problem with them. ------------------------------ Message: 5 Date: Wed, 24 Aug 2005 12:11:38 -0700 From: "Sean Hefty" Subject: RE: [openib-general] RDMA connection and address translation API To: "'Tom Tucker'" , "Roland Dreier" Cc: openib-general at openib.org Message-ID: Content-Type: text/plain; charset="us-ascii" >Because it would be better to configure your network "properly". >Putting IP addresses in private data is fundamentally insecure since >any user mode client can spoof the IP address. A simple forward lookup could detect this. - Sean A simple forward lookup by whom? Again, the point is that identification of the remote peer provided to the Consumer is supposed to be already validated. ------------------------------ Message: 6 Date: Wed, 24 Aug 2005 22:12:57 +0300 From: "Yaron Haviv" Subject: RE: [openib-general] RDMA connection and address translation API To: "Fab Tillier" , "Roland Dreier" Cc: openib-general at openib.org Message-ID: <35EA21F54A45CB47B879F21A91F4862F7141F3 at taurus.voltaire.com> Content-Type: text/plain; charset="us-ascii" > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Fab Tillier > Sent: Wednesday, August 24, 2005 3:00 PM > To: 'Roland Dreier' > Cc: openib-general at openib.org > Subject: RE: [openib-general] RDMA connection and address translation API > > > From: Roland Dreier [mailto:rolandd at cisco.com] > > Sent: Wednesday, August 24, 2005 11:03 AM > > > > Fab> Why can't the IPV field be ignored? If a listen wants only > > Fab> IPV4 addresses, it would specify a 16-byte compare buffer > > Fab> with the first 12 bytes zero, the next 4 filled with the IPV4 > > Fab> address, and would set the offset to that of the hello > > Fab> message's destination address (32). > > > > Yes, you're right for SDP. I guess if we're comfortable mandating > > that all protocols put their source and destination IPs in the private > > data for the IB case, then this works. Of course it's somewhat > > awkward to pass this information into the transport-neutral CM API but > > I think this can be worked around. > > I don't know if we need to mandate IP usage - it's up to the application. > Any > application that wants to have similar semantics to the way socket listens > work > (especially when bound to one of multiple IP addresses on a port) the > application would have to define its private data to accommodate this. > The context of this discussion is around a common API for iWarp/IB ULPs In that case they all use IP addresses (since it's the common addressing) If someone would use the IB specific API under this abstraction level he can provide what ever data he wants to the CM Any way providing src/dst IPs in the CM Private data is simple, and we can come with IBTA extension blessing that data structure as a general way to map IP oriented protocols over IB (a 1-2 page draft at the most) This way it can also address Caitlin concerns regarding NFS & IETF (since now it's a transport specific issue) Yaron Correct, an IBTA and/or IETF sanctioned standard use of CM Private Data that supplied IP Addresses in the Private Data (especially when the data came from the stack rather than the Consumer) would be perfectly fine. It would be interoperable (not OpenIB dependent), and it would provide the required API semantics (the network address supplied would be coming from the network layer). No changes to the ULP would be required. From jlentini at netapp.com Wed Aug 24 14:22:06 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 24 Aug 2005 17:22:06 -0400 (EDT) Subject: [openib-general] RE: Connection and Address Translation In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1F503@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1F503@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: On Wed, 24 Aug 2005, Caitlin Bestler wrote: > As long as the "unspecified" local address remains a valid option > the proposed API split is defnitely implementable under iWARP. But > it is obviously simpler to just keep the same semantics as the > exported kDAPL interface. Is there a definite benefit to making the > change, if so what and for which middleware/application? What is "the change" you are refering to? From caitlinb at broadcom.com Wed Aug 24 14:22:31 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 24 Aug 2005 14:22:31 -0700 Subject: [openib-general] RDMA connection and address translation API Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F509@NT-SJCA-0751.brcm.ad.broadcom.com> Not if the host connects two disjoint networks and does not route between them. Such a host should/may be configured to reject any packet that arrives with a destination address that does not match the expected destination address for the port it arrives upon. One of the things that iWARP vendors strive for is to ensure that all such existing filtring/safety rules on accepting connections are left 100% intact. -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Christoph Hellwig Sent: Wednesday, August 24, 2005 2:00 PM To: Caitlin Bestler Cc: openib-general at openib.org Subject: Re: [openib-general] RDMA connection and address translation API On Wed, Aug 24, 2005 at 11:14:08AM -0700, Caitlin Bestler wrote: > The concensus when this issue was debated in the DAT Collaborative was > that there was no transport neutral way to specify a set of addresses > to listen on other than "all addresses supported by this device". That doesn't make any sense at all for iWarp as that uses IP addressing which in Linux is host-, not device-based. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rolandd at cisco.com Wed Aug 24 14:23:43 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 24 Aug 2005 14:23:43 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: (James Lentini's message of "Wed, 24 Aug 2005 16:57:41 -0400 (EDT)") References: <000401c5a8de$2c32cce0$6312000a@infiniconsys.com> Message-ID: <52acj7vxk0.fsf@cisco.com> James> You need to consider what makes sense for *both* ib and James> iwarp. Keep in mind that the correct API will allow a James> consumer to use ib and iwarp devices transparently. In James> other words their will be one code path that support both. James> If we were to adopt your proposal, the consumer would need James> to perform unnecessary operations on iWARP. No, I think we just need to realize that a perfectly transport neutral protocol implementation is not achievable. It's unfortunate that kDAPL fooled people by hiding the details of the wire protocol under a supposedly "neutral API," but the fact is that mapping an abstract RDMA transport to a real implementation will always involve arbitrary transport-dependent choices. To use an analogy, the IP layer is mostly insulated from the details of the L2 transport it's using by the net_device abstraction. However, there are a few things that require code like: int arp_mc_map(u32 addr, u8 *haddr, struct net_device *dev, int dir) { switch (dev->type) { case ARPHRD_ETHER: case ARPHRD_FDDI: case ARPHRD_IEEE802: ip_eth_mc_map(addr, haddr); return 0; case ARPHRD_IEEE802_TR: ip_tr_mc_map(addr, haddr); return 0; case ARPHRD_INFINIBAND: ip_ib_mc_map(addr, haddr); return 0; default: if (dir) { memcpy(haddr, dev->broadcast, dev->addr_len); return 0; } } return -EINVAL; } - R. From tom at ammasso.com Wed Aug 24 14:25:18 2005 From: tom at ammasso.com (Tom Tucker) Date: Wed, 24 Aug 2005 17:25:18 -0400 Subject: [openib-general] RDMA connection and address translation API Message-ID: <8E9D028761D8264D910612167E8457E8FA3B48@mail2.ammasso.com> > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Wednesday, August 24, 2005 4:03 PM > To: Tom Tucker > Cc: Sean Hefty; Roland Dreier; openib-general at openib.org > Subject: Re: [openib-general] RDMA connection and address > translation API > > Tom> The issue is that this connection will be established when > Tom> the server may only want to accept requests that are > Tom> targetted to the 10.10.1.1 address. I don't get why this is > Tom> such a big deal. You can preclude this behavior by simply > Tom> keeping a one to one mapping between the IPv4 addresses and > Tom> the GIDs using the existing protocols and without mandating a > Tom> private data format across *all* ulps and transports. > > Well, a few problems with what you say: > > - ATS does not help at all with the case of a multi-homed interface. > Unless the remote system puts the IP it's trying to connect to > somewhere in the connection request, there is no way to be psychic > and recover this information. I thought a single HCA could have multiple GIDs. All I'm advocating is that a "correct" multi-homed configuration has a one-to-one mapping between it's IP addresses and it's GIDS. > > - Mandating ATS use is dictating protocol design just as much as > requiring the CM private data to carry source and destination IP > addresses. I think ATS dictates the kinds of authentication that can be done by the server over an IB transport, but not the protocol design. Certainly the private data can have additional authentication data (which I think is what you're advocating). > > - It's not just preventing connections to the wrong local address. > NFS-RDMA wants the remote source address (ie getpeername()) so that > it can look it up in the exports list. Agreed. But you could also get rid of ATS by allowing GIDs to be specified in the exports file and then treating them like IPv6 addresses for the purpose of subnet comparisons. > > - Saying that a given GID may only have a single IP address is > definitely a case of the cure being worse than the disease. I > don't think we can forbid perfectly valid multi-homed > configurations just because it's inconvenient for us to > support them. I think our different perspectives come from what we consider to be "perfectly valid multi-homed configurations". One approach advocates overloading private data, the other advocates overloading address assignments. My approach suffers from the fact that multiple IP addresses for the same GID are just aliases that are interchangeable and at the remote end indistinguishable. The private data approach suffers from the need to mandate private data formats across all ulps and transports. I prefer the former limitation/cost. > > By the way, as far as I can tell, there is NO formal > documentation of the NFS-RDMA wire protocol. The current > draft (draft-ietf-nfsv4-rpcrdma-01.txt) simply says: > > This protocol is designed to function with equivalent semantics > over all appropriate RDMA transports. In its abstract form, this > protocol does not implement RDMA directly. [...] It therefore > becomes a useful, implementable standard when mapped onto a > specific RDMA transport, such as iWARP [RDDP] or Infiniband [IB]. > > [...] > > In setting up a new RDMA connection, the first action by an RPC > client will be to obtain a transport address for the server. The > mechanism used to obtain this address, and to open an RDMA > connection is dependent on the type of RDMA transport, > and outside > the scope of this protocol. > > So it seems perfectly reasonable and acceptable for the > mapping of NFS-RDMA onto IB to specify that the source and > destination IP addresses for an IB connection are placed in > the CM private data. > This seems much easier than trying to turn ATS into an IETF standard. > > - R. > I think there is a way to get rid of ATS as I described above without overloading the private data. Phew -- I'm exhausted. I'm going to go write code ;-) From caitlinb at broadcom.com Wed Aug 24 14:28:47 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 24 Aug 2005 14:28:47 -0700 Subject: [openib-general] RDMA connection and address translation API Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F50A@NT-SJCA-0751.brcm.ad.broadcom.com> -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier Sent: Wednesday, August 24, 2005 2:03 PM To: Tom Tucker Cc: openib-general at openib.org Subject: Re: [openib-general] RDMA connection and address translation API By the way, as far as I can tell, there is NO formal documentation of the NFS-RDMA wire protocol. The current draft (draft-ietf-nfsv4-rpcrdma-01.txt) simply says: This protocol is designed to function with equivalent semantics over all appropriate RDMA transports. In its abstract form, this protocol does not implement RDMA directly. [...] It therefore becomes a useful, implementable standard when mapped onto a specific RDMA transport, such as iWARP [RDDP] or Infiniband [IB]. [...] In setting up a new RDMA connection, the first action by an RPC client will be to obtain a transport address for the server. The mechanism used to obtain this address, and to open an RDMA connection is dependent on the type of RDMA transport, and outside the scope of this protocol. So it seems perfectly reasonable and acceptable for the mapping of NFS-RDMA onto IB to specify that the source and destination IP addresses for an IB connection are placed in the CM private data. This seems much easier than trying to turn ATS into an IETF standard. - R. NFS over RDMA was intended to be implemented using DAPL in a transport neutrall way. Now having the transport layer *add* data before the private data is legitimate for any specific transport. It would just have to be defined independently of openib and linux. Basically, any solution that allows NFS over RDMA to be coded with the *same* set of kDAPL calls to listen/connect/accept/reject would be compliant with the intent -- as long as the mapping to wire protocols was straight-forward and allowed non-kDAPL implementations. For example, mapping the DAPL private data to the IETF MPA Request/Reply frame Private Data certainly qualifies as "straight forward". From rolandd at cisco.com Wed Aug 24 14:30:52 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 24 Aug 2005 14:30:52 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <52acj7vxk0.fsf@cisco.com> (Roland Dreier's message of "Wed, 24 Aug 2005 14:23:43 -0700") References: <000401c5a8de$2c32cce0$6312000a@infiniconsys.com> <52acj7vxk0.fsf@cisco.com> Message-ID: <5264tvvx83.fsf@cisco.com> Roland> No, I think we just need to realize that a perfectly Roland> transport neutral protocol implementation is not Roland> achievable. It's unfortunate that kDAPL fooled people by Roland> hiding the details of the wire protocol under a supposedly Roland> "neutral API," but the fact is that mapping an abstract Roland> RDMA transport to a real implementation will always Roland> involve arbitrary transport-dependent choices. Further: if we would be willing to say that transport-neutral protocols must use a "kDAPL wire protocol," then there's no problem in defining that wire protocol to put the source and destination IP address somewhere in the CM private data. The current "kDAPL wire protocol" happens to use ATS to try and achieve this (although it doesn't handle the multi-homed case), but that is no more and no less of an arbitrary protocol design choice. So in a nutshell, my objection to using ATS is that it is an arbitrary design choice that doesn't work as well as other equally valid choices. - R. From caitlinb at broadcom.com Wed Aug 24 14:40:06 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 24 Aug 2005 14:40:06 -0700 Subject: [openib-general] RDMA connection and address translation API Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F50C@NT-SJCA-0751.brcm.ad.broadcom.com> I think it would be more accurate to state that DAPL requires the 128-bit "IA Address space" to be administratively subdivided so that each "subnet" unambiguously translates to a specific IA reached network and that translation of the "IA Address" into and from that network's wire protocol is not visible to the DAT Consumer. ATS is indeed *one* solution for doing so. Adding RARP to IPoIB would make for another solution. Direct translation is also a valid solution for IPv6 compatible network IDs. So with this wealth of options available, do you agree that there is no reason to elevate any of these issues to being visisble to a transport neutral application? -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier Sent: Wednesday, August 24, 2005 2:31 PM To: James Lentini Cc: openib-general at openib.org Subject: Re: [openib-general] RDMA connection and address translation API Roland> No, I think we just need to realize that a perfectly Roland> transport neutral protocol implementation is not Roland> achievable. It's unfortunate that kDAPL fooled people by Roland> hiding the details of the wire protocol under a supposedly Roland> "neutral API," but the fact is that mapping an abstract Roland> RDMA transport to a real implementation will always Roland> involve arbitrary transport-dependent choices. Further: if we would be willing to say that transport-neutral protocols must use a "kDAPL wire protocol," then there's no problem in defining that wire protocol to put the source and destination IP address somewhere in the CM private data. The current "kDAPL wire protocol" happens to use ATS to try and achieve this (although it doesn't handle the multi-homed case), but that is no more and no less of an arbitrary protocol design choice. So in a nutshell, my objection to using ATS is that it is an arbitrary design choice that doesn't work as well as other equally valid choices. - R. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rolandd at cisco.com Wed Aug 24 14:43:07 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 24 Aug 2005 14:43:07 -0700 Subject: [openib-general] [PATCH] ipoib: device removal races In-Reply-To: <20050808151141.GJ15300@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 8 Aug 2005 18:11:42 +0300") References: <20050808151141.GJ15300@mellanox.co.il> Message-ID: <52y86rui38.fsf@cisco.com> Thanks, I finally applied this and put it in my git queue for 2.6.14. I'm still thinking about the bigger patch that adds a second work queue. Having one extra work queue because of the rtnl_lock issues is ugly enough, and I'd really like to find a way to avoid two queues. - R. From rolandd at cisco.com Wed Aug 24 14:44:56 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 24 Aug 2005 14:44:56 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1F50C@NT-SJCA-0751.brcm.ad.broadcom.com> (Caitlin Bestler's message of "Wed, 24 Aug 2005 14:40:06 -0700") References: <54AD0F12E08D1541B826BE97C98F99F1F50C@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <52u0hfui07.fsf@cisco.com> Caitlin> So with this wealth of options available, do you agree Caitlin> that there is no reason to elevate any of these issues to Caitlin> being visisble to a transport neutral application? No -- the fact that there are a wealth of options actually means that picking one is an arbitrary choice we impose on transport neutral implementations and is de facto mandating a wire protocol. - R. From jlentini at netapp.com Wed Aug 24 14:50:31 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 24 Aug 2005 17:50:31 -0400 (EDT) Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F7141EB@taurus.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F7141EB@taurus.voltaire.com> Message-ID: On Wed, 24 Aug 2005, Yaron Haviv wrote: > > On Tue, 23 Aug 2005, Roland Dreier wrote: > > > > > It would be possible to have another function like > > > rdma_getpeername() that takes the transport address and returns > > > a source IP address. In the IB case this would do an ATS > > > reverse lookup. However, I hate this idea. iSER already uses > > > the CM private data to pass the source IP in the IB case, > > > > I know this is how IB SDP works, but I don't think iSER works this > > way. > > > > The code in the tree calls dat_ep_connect() with a NULL private > > data pointer. > > > > There is an iSER HELLO message described in iser_header.h contains > > IP addresses, but I'm not certain that this is part of the current > > protocol (ISER_HELLO_LEN and ISER_HELLO_REPLY_LEN are unused). > > James, > > iSER doesn't mandate the source IP in general since its doing a much > stronger authentication during Login > However we believe using a similar header to SDP can help the Passive > side > a. know which destination IP was targeted (in a multi homed environment) > b. for some implementations that want to validate the source for some > reason > > that's why the draft suggested adding the source/dst IP in the private > data just like SDP does, Which draft contains this? I found http://www.ietf.org/internet-drafts/draft-ietf-ips-iser-04.txt but the HELLO header in section 9.3 does not contain any IP address information. > I believe it can be a good idea to use the same approach for > NFS/RDMA and eliminate the need for reverse ATS lookup (the may have > some conflicts when multiple IPs exists per node). We may just use > the SDP hello header as is with unused fields zeroed This will allow > all ULPs to use the same mechanism NFS/RDMA is not specific to iWARP or InfiniBand. My understanding is that this could not be easily accommodated in the current standards for that reason. From caitlinb at broadcom.com Wed Aug 24 14:53:41 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 24 Aug 2005 14:53:41 -0700 Subject: [openib-general] RDMA connection and address translation API Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F50D@NT-SJCA-0751.brcm.ad.broadcom.com> The requirement is solely that that System Administrators for each host directly attached to Network X agree on the basic addressing characteristics for Network X. This onerous challenge is sucessfully overcome on every IP subnet in the world every day for such details as what the subnet is, what the mask is, etc. Further, two adjoining subnets won't be able to talk unless their administrators have arranged for them to agree on what their network identifiers are/etc. For the specific question it is even less of a problem than theory suggests. A rule such as "non IPv4 subnets are direct translated while IPv4 subnets use IPv4" is actually quite simple to implement. That could even be extended to allow *some* IPv6 subnets to be translated so that mutiple IPV6 aliases for a single GID could be identified (that is, if anyone has a need for such a thing). -----Original Message----- From: Roland Dreier [mailto:rolandd at cisco.com] Sent: Wednesday, August 24, 2005 2:45 PM To: Caitlin Bestler Cc: Roland Dreier; James Lentini; openib-general at openib.org Subject: Re: [openib-general] RDMA connection and address translation API Caitlin> So with this wealth of options available, do you agree Caitlin> that there is no reason to elevate any of these issues to Caitlin> being visisble to a transport neutral application? No -- the fact that there are a wealth of options actually means that picking one is an arbitrary choice we impose on transport neutral implementations and is de facto mandating a wire protocol. - R. From rolandd at cisco.com Wed Aug 24 14:54:24 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 24 Aug 2005 14:54:24 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: (James Lentini's message of "Wed, 24 Aug 2005 17:50:31 -0400 (EDT)") References: <35EA21F54A45CB47B879F21A91F4862F7141EB@taurus.voltaire.com> Message-ID: <52pss3uhkf.fsf@cisco.com> James> NFS/RDMA is not specific to iWARP or InfiniBand. My James> understanding is that this could not be easily accommodated James> in the current standards for that reason. Yes, it seems that there will need to be some additional NFS/RDMA drafts describing the iWARP and IB wire protocols before the standard is complete. - R. From yaronh at voltaire.com Wed Aug 24 15:56:41 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Thu, 25 Aug 2005 01:56:41 +0300 Subject: [openib-general] RDMA connection and address translation API Message-ID: <35EA21F54A45CB47B879F21A91F4862F714202@taurus.voltaire.com> > -----Original Message----- > From: James Lentini [mailto:jlentini at netapp.com] > Sent: Wednesday, August 24, 2005 5:51 PM > To: Yaron Haviv > Cc: Roland Dreier; openib-general at openib.org > Subject: RE: [openib-general] RDMA connection and address translation API > > > > Which draft contains this? I found > > http://www.ietf.org/internet-drafts/draft-ietf-ips-iser-04.txt > James, You should look at : http://www.haifa.il.ibm.com/satran/ips/draft-ietf-ips-iser-05-candidate. txt The 05 rev really adds all the InfiniBand related stuff You can see how the association between IB & IP is done using IPoIB The current implementation may not use the private data field (since its not critical/mandatory) but the intention is to add it to address multi homed hosts, we would like to push such a definition into IBTA so every IP oriented ULP can use it, several people expressed interest in such a definition, this can also support NFS/RDMA or any other IP based ULP. > but the HELLO header in section 9.3 does not contain any IP address > information. > > > I believe it can be a good idea to use the same approach for > > NFS/RDMA and eliminate the need for reverse ATS lookup (the may have > > some conflicts when multiple IPs exists per node). We may just use > > the SDP hello header as is with unused fields zeroed This will allow > > all ULPs to use the same mechanism > > NFS/RDMA is not specific to iWARP or InfiniBand. My understanding is > that this could not be easily accommodated in the current standards > for that reason. Not sure why is that the case, if we add an IBTA definition of CM exchange for IP based ULP's (i.e. send src/dst IP and optionally ports) you can now have an NFS/RDMA spec that doesn't need to have any IB/iWarp specific definitions, since the differences are pushed down to the IBTA In case of NFS/RDMA over other (non IB or iWarp) transport you can specify that providing the IP addressing is a responsibility of the underline transport. Yaron From tom at ammasso.com Wed Aug 24 16:28:43 2005 From: tom at ammasso.com (Tom Tucker) Date: Wed, 24 Aug 2005 19:28:43 -0400 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs and query provider methods Message-ID: <8E9D028761D8264D910612167E8457E8FA3B4B@mail2.ammasso.com> This patch is against the iWARP branch. It adds CM related methods to the ib_device structure as well as simple versions of the low level port, gid, pkey, etc... query methods. I also added printks to the provider methods so I could track the loading process. These will be removed when the driver is stable. Please take a look and let me know what you think. I'll check this in to the iWRAP branch tomorrow if no one expresses dismay. Please feel free to send me patches to this patch if you see something. Signed-of-by: Tom Tucker Index: include/ib_verbs.h =================================================================== --- include/ib_verbs.h (revision 3120) +++ include/ib_verbs.h (working copy) @@ -43,6 +43,7 @@ #include #include +#include #include #include @@ -59,7 +60,8 @@ enum ib_node_type { IB_NODE_CA = 1, IB_NODE_SWITCH, - IB_NODE_ROUTER + IB_NODE_ROUTER, + IB_NODE_IWARP }; enum ib_device_cap_flags { @@ -273,6 +275,42 @@ enum ib_event_type event; }; +/* Connection events. */ +enum ib_xcm_event_type { + IB_EVENT_ACTIVE_CONNECT_RESULTS, + IB_EVENT_CONNECT_REQUEST +}; + +/* iwarp connection attributes. */ +struct ib_connect_attr { + struct in_addr local_addr; + struct in_addr remote_addr; + u16 local_port; + u16 remote_port; +}; +struct ib_conn_results { + int errno; + struct ib_connect_attr conn_attr; + int priv_len; + u8 private_data[0]; +}; + +struct ib_conn_request { + int cr_id; + struct ib_connect_attr conn_attr; + int priv_len; + u8 private_data[0]; +}; + +struct ib_xcm_event { + struct ib_device *device; + union { + struct ib_conn_results connect_qp_results; + struct ib_conn_request connect_request; + } element; + enum ib_xcm_event_type event; +}; + struct ib_event_handler { struct ib_device *device; void (*handler)(struct ib_event_handler *, struct ib_event *); @@ -795,6 +833,25 @@ IB_MAD_RESULT_CONSUMED = 1 << 2 /* Packet consumed: stop processing */ }; +/* Listening endpoint. */ +struct ib_listen_ep_attr { + void (*event_handler)(struct ib_xcm_event *, void *); + void *listen_context; + struct in_addr addr; + u16 port; + int backlog; +}; + +struct ib_listen_ep { + struct ib_device *device; + void (*event_handler)(struct ib_xcm_event *, void *); + void *listen_context; + struct in_addr addr; + u16 port; + int backlog; +}; + + #define IB_DEVICE_NAME_MAX 64 struct ib_cache { @@ -941,6 +998,21 @@ struct ib_mad *in_mad, struct ib_mad *out_mad); + int (*connect_qp)(struct ib_qp *qp, + struct ib_connect_attr *conn_attr); + int (*accept_cr)(int cr_id, + struct ib_qp *qp, + u8 *pdata, + int pdata_len); + int (*reject_cr)(int cr_id, + struct ib_qp *qp, + u8 *pdata, + int pdata_len); + int (*query_cr)(int cr_id, + struct ib_connect_attr *conn_attr); + struct ib_listen_ep * (*create_listen_ep)(struct ib_listen_ep_attr *); + int (*destroy_listen_ep)(struct ib_listen_ep *ep); + struct module *owner; struct class_device class_dev; struct kobject ports_parent; Index: hw/amso1100/c2.c =================================================================== --- hw/amso1100/c2.c (revision 3121) +++ hw/amso1100/c2.c (working copy) @@ -1,9 +1,33 @@ /* - * c2.c: A Linux PCI-X Gigabit Ethernet driver for AMSO1100 (Cepheus2) RNIC - * Copyright(c) 2005 Ammasso, Inc. - * - * History: - * + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. */ #include #include @@ -28,8 +52,8 @@ #include #include "c2.h" -MODULE_AUTHOR("Ranjith Balachandran "); -MODULE_DESCRIPTION("Ammasso AMSO1100 Gigabit Ethernet driver"); +MODULE_AUTHOR("Tom Tucker "); +MODULE_DESCRIPTION("Ammasso AMSO1100 Low-level iWARP Driver"); MODULE_LICENSE("Dual BSD/GPL"); MODULE_VERSION(DRV_VERSION); @@ -230,7 +254,7 @@ rxp_hdr = (struct c2_rxp_hdr *) skb->data; rxp_hdr->flags = RXP_HRXD_READY; - //c2_write16(elem->hw_desc + C2_RXP_COUNT, cpu_to_be16(0)); + /* c2_write16(elem->hw_desc + C2_RXP_COUNT, cpu_to_be16(0)); */ c2_write16(elem->hw_desc + C2_RXP_STATUS, cpu_to_be16(0)); c2_write16(elem->hw_desc + C2_RXP_LEN, cpu_to_be16((u16)maplen - sizeof(*rxp_hdr))); c2_write64(elem->hw_desc + C2_RXP_ADDR, cpu_to_be64(mapaddr)); @@ -975,7 +999,7 @@ /* Remap the adapter PCI registers in BAR4 */ mmio_regs = ioremap_nocache(reg4_start + C2_PCI_REGS_OFFSET, - sizeof(struct c2_adapter_pci_regs)); + sizeof(struct c2_adapter_pci_regs)); if (mmio_regs == 0UL) { printk(KERN_ERR PFX "Unable to remap adapter PCI registers in BAR4\n"); ret = -EIO; @@ -1000,7 +1024,7 @@ /* Validate the adapter version */ if (be32_to_cpu(c2_read32(mmio_regs + C2_REGS_VERS)) != C2_VERSION) { printk(KERN_ERR PFX "Version mismatch [fw=%u, c2=%u], Adapter not claimed\n", - be32_to_cpu(c2_read32(mmio_regs + C2_REGS_VERS)), C2_VERSION); + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_VERS)), C2_VERSION); ret = -EINVAL; iounmap(mmio_regs); goto bail2; @@ -1009,7 +1033,7 @@ /* Validate the adapter IVN */ if (be32_to_cpu(c2_read32(mmio_regs + C2_REGS_IVN)) != C2_IVN) { printk(KERN_ERR PFX "IVN mismatch [fw=0x%x, c2=0x%x], Adapter not claimed\n", - be32_to_cpu(c2_read32(mmio_regs + C2_REGS_IVN)), C2_IVN); + be32_to_cpu(c2_read32(mmio_regs + C2_REGS_IVN)), C2_IVN); ret = -EINVAL; iounmap(mmio_regs); goto bail2; @@ -1019,7 +1043,7 @@ c2dev = kmalloc(sizeof(*c2dev), GFP_KERNEL); if (!c2dev) { printk(KERN_ERR PFX "%s: Unable to alloc hardware struct\n", - pci_name(pcidev)); + pci_name(pcidev)); ret = -ENOMEM; iounmap(mmio_regs); goto bail2; @@ -1066,7 +1090,7 @@ /* Remap the adapter HRXDQ PA space to kernel VA space */ c2dev->mmio_rxp_ring = ioremap_nocache(reg4_start + C2_RXP_HRXDQ_OFFSET, - C2_RXP_HRXDQ_SIZE); + C2_RXP_HRXDQ_SIZE); if (c2dev->mmio_rxp_ring == 0UL) { printk(KERN_ERR PFX "Unable to remap MMIO HRXDQ region\n"); ret = -EIO; @@ -1075,7 +1099,7 @@ /* Remap the adapter HTXDQ PA space to kernel VA space */ c2dev->mmio_txp_ring = ioremap_nocache(reg4_start + C2_TXP_HTXDQ_OFFSET, - C2_TXP_HTXDQ_SIZE); + C2_TXP_HTXDQ_SIZE); if (c2dev->mmio_txp_ring == 0UL) { printk(KERN_ERR PFX "Unable to remap MMIO HTXDQ region\n"); ret = -EIO; @@ -1096,6 +1120,8 @@ /* Print out the MAC address */ c2_print_macaddr(netdev); + c2_register_device(c2dev); + return 0; bail8: Index: hw/amso1100/devccil_adapter.c =================================================================== --- hw/amso1100/devccil_adapter.c (revision 3120) +++ hw/amso1100/devccil_adapter.c (working copy) @@ -46,13 +46,6 @@ {0,} }; -static struct pci_driver devccil_driver = { - .name = "cepheus", - .id_table = devccil_id_table, - .probe = devccil_probe, - .remove = __devexit_p(devccil_remove), -}; - /* * global linked lists of CC adapters and open rnic instances. */ @@ -482,10 +475,8 @@ /* - * Called once on the first open of a host process to simulate reading - * the PCI regs and setting access to the system iface for this - * adapter. For a real adapter, we can do this as the result of a probe - * call. For now, we do it on the first user open. + * Called by device_probe to allocate and initialize adapter hardware + * resources. */ int alloc_adapter(cc_adapter_t **p_cca) @@ -541,21 +532,6 @@ return 0; } -int -adapter_init(struct pci_driver *driver) -{ - memcpy(driver, &devccil_driver, sizeof(struct pci_driver)); - - return 0; -} - -void -adapter_term(void) -{ - ; -} - - cc_status_t init_adapter_queues(cc_adapter_t* cca) { Index: hw/amso1100/c2.h =================================================================== --- hw/amso1100/c2.h (revision 3120) +++ hw/amso1100/c2.h (working copy) @@ -1,9 +1,42 @@ /* - * c2.h: A Linux PCI-X Gigabit Ethernet driver for AMSO1100 (Cepheus2) RNIC - * - * Copyright(c) 2005 Ammasso, Inc. + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. */ +#include "cc_queue.h" +#include "cc_adapter.h" + +#include "devccil_mq.h" +#include "devccil_eh.h" +#include "devccil_lock.h" + #define DRV_NAME "c2" #define DRV_VERSION "1.1" #define PFX DRV_NAME ": " @@ -215,6 +248,49 @@ struct net_device *netdev; unsigned int cur_tx; unsigned int cur_rx; + u64 fw_ver; + u32 hw_rev; + u32 device_cap_flags; + u32 vendor_id; + u32 vendor_part_id; + void __iomem *kva; /* KVA device memory */ + void __iomem *pa; /* PA device memory */ + void** qptr_array; + kmem_cache_t* host_msg_cache; + kmem_cache_t* ae_msg_cache; + struct list_head cca_link; /* adapter list */ + struct list_head eh_wakeup_list; /* event wakeup list */ + wait_queue_head_t req_vq_wo; + cc_mq_t req_vq; /* Verbs Request MQ */ + + /* RNIC Limits */ + u32 max_mr_size; + u32 max_qp; + u32 max_qp_wr; + u32 max_sge; + u32 max_cq; + u32 max_cqe; + u32 max_mr; + u32 max_pd; + + int ports; /* num of GigE ports */ + int devnum; + cc_lock_t vqlock; /* sync vbs req MQ */ + cc_mq_t rep_vq; /* Verbs Reply MQ */ + cc_mq_t aeq; /* Async Events MQ */ + cc_lock_t aeq_lock; + cc_lock_t rnic_lock; + u16 q1_shared; + u16 q2_shared; + u16 irq_claimed; + + u16 hint_count; + u16 hints_read; + + u16 q0_shared; + cc_bool_t init; /* TRUE if it's ready */ + char ae_cache_name[16]; + char vq_cache_name[16]; }; struct c2_port { @@ -224,8 +300,8 @@ spinlock_t tx_lock; u32 tx_avail; - struct c2_ring tx_ring; - struct c2_ring rx_ring; + struct c2_ring tx_ring; + struct c2_ring rx_ring; void *mem; /* PCI memory for host rings */ dma_addr_t dma; Index: hw/amso1100/devccil_adapter.h =================================================================== --- hw/amso1100/devccil_adapter.h (revision 3120) +++ hw/amso1100/devccil_adapter.h (working copy) @@ -34,8 +34,6 @@ #define ADAPTER_MAGIC 0x54504441 /* 'ADPT' */ typedef struct cc_adapter_s { - struct ib_device ib_dev; - struct pci_dev *dev; /* Linux PCI Dev */ struct port_s { void *phys; void *virt; @@ -76,12 +74,6 @@ char vq_cache_name[16]; } cc_adapter_t; -static inline cc_adapter_t* to_adapter(struct ib_device *ibdev) -{ - return container_of(ibdev, cc_adapter_t, ib_dev); -} - - extern CC_TAILQ_HEAD(cca_list, cc_adapter_s) adapter_list; extern int alloc_adapter(cc_adapter_t **p_cca); extern void free_adapter(cc_adapter_t *cca); Index: hw/amso1100/devnet.h =================================================================== --- hw/amso1100/devnet.h (revision 3120) +++ hw/amso1100/devnet.h (working copy) @@ -74,7 +74,7 @@ #else extern int ccilnet_init_module(struct pci_driver *); #endif /* STANDALONE */ -extern int devccil_init_module(struct pci_driver *); +extern int devccil_init_module(void); extern void devccil_exit_module(void); extern void ccilnet_exit_module(void); Index: hw/amso1100/devccil.c =================================================================== --- hw/amso1100/devccil.c (revision 3120) +++ hw/amso1100/devccil.c (working copy) @@ -87,56 +87,6 @@ MODULE_PARM_DESC(pinned_max_percent, "The maximum percent of real memory that devccil can pin."); /* - * define the kdapl verbs list - */ -#undef CCIL_KDAPL -#ifdef CCIL_KDAPL -#include - -CCVerbs ccil_kdapl_verbs_list = { - 1, /* version number */ - cc_rnic_enum_kern, - cc_rnic_open_kern, - cc_rnic_query_kern, - cc_rnic_close_kern, - cc_rnic_getconfig_kern, - cc_pd_alloc_kern, - cc_pd_dealloc_kern, - cc_qp_create_kern, - cc_qp_modify_kern, - cc_qp_modify_user_context_kern, - cc_qp_query_kern, - cc_qp_destroy_kern, - cc_cq_create_kern, - cc_cq_modify_kern, - cc_cq_destroy_kern, - cc_nsmr_register_phys_kern, - cc_mw_alloc_kern, - cc_stag_dealloc_kern, - cc_smr_register_kern, - cc_qp_post_sq_kern, - cc_qp_post_rq_kern, - cc_cq_poll_kern, - cc_eh_set_ce_handler_kern, - cc_eh_set_async_handler_kern, - cc_cq_request_notification_kern, - cc_ep_listen_create_kern, - cc_ep_listen_destroy_kern, - cc_ep_query_kern, - cc_qp_connect_kern, - cc_cr_accept_kern, - cc_cr_reject_kern -}; - -#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 0) -EXPORT_SYMBOL_NOVERS(ccil_kdapl_verbs_list); -#else -EXPORT_SYMBOL(ccil_kdapl_verbs_list); -#endif - -#endif - -/* * Fileops declarations. */ Static int ccilopen(struct inode *inode, struct file *f); @@ -171,7 +121,7 @@ * device memory and pin it. */ int -devccil_init_module(struct pci_driver *driver) +devccil_init_module(void) { int rc; @@ -220,12 +170,6 @@ } #endif - if ((rc = adapter_init(driver))) { - (void)unregister_chrdev (ccil_major_num, CCILMOD); - devccil_err("DEVCCIL: adapter initialization failed\n"); - return rc; - } - if ((rc = rnic_init())) { adapter_term(); (void)unregister_chrdev (ccil_major_num, CCILMOD); @@ -240,13 +184,6 @@ return rc; } -#if defined CCIL_KDAPL && !defined symbol_get - /* - * register the verbs list for the kdapl module - */ - inter_module_register("ccil_kdapl_verbs_list", THIS_MODULE, &ccil_kdapl_verbs_list); -#endif - devccil_info("DEVCCIL: module loaded\n"); return(0); } @@ -265,13 +202,6 @@ adapter_term(); -#if defined CCIL_KDAPL && !defined symbol_get - /* - * unregister the verbs list - */ - inter_module_unregister("ccil_kdapl_verbs_list"); -#endif - error = unregister_32bit_ioctls(); if (error) { devccil_err("DEVCCIL: unregistering 32 bit ioctls failed\n"); Index: hw/amso1100/c2_provider.c =================================================================== --- hw/amso1100/c2_provider.c (revision 3120) +++ hw/amso1100/c2_provider.c (working copy) @@ -55,53 +55,132 @@ #include "c2.h" #include "c2_provider.h" +#include "c2_user.h" static int c2_query_device(struct ib_device *ibdev, struct ib_device_attr *props) { - int err = -ENOMEM; - return err; + struct c2_dev* c2dev = to_dev(ibdev); + + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + + memset(props, 0, sizeof *props); + + memcpy(&props->sys_image_guid, c2dev->netdev->dev_addr, 6); + memcpy(&props->node_guid, c2dev->netdev->dev_addr, 6); + + props->fw_ver = c2dev->fw_ver; + props->device_cap_flags = c2dev->device_cap_flags; + props->vendor_id = c2dev->vendor_id; + props->vendor_part_id = c2dev->vendor_part_id; + props->hw_ver = c2dev->hw_rev; + props->max_mr_size = ~0ull; + props->max_qp = c2dev->max_qp; + props->max_qp_wr = c2dev->max_qp_wr; + props->max_sge = c2dev->max_sge; + props->max_cq = c2dev->max_cq; + props->max_cqe = c2dev->max_cqe; + props->max_mr = c2dev->max_mr; + props->max_pd = c2dev->max_pd; + props->max_qp_rd_atom = 0; + props->max_qp_init_rd_atom = 0; + props->local_ca_ack_delay = 0; + + return 0; } static int c2_query_port(struct ib_device *ibdev, - u8 port, struct ib_port_attr *props) + u8 port, struct ib_port_attr *props) { - return ENOSYS; + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + + props->max_mtu = IB_MTU_4096; + props->lid = 0; + props->lmc = 0; + props->sm_lid = 0; + props->sm_sl = 0; + props->state = IB_PORT_ACTIVE; + props->phys_state = 0; + props->port_cap_flags = 0; + props->gid_tbl_len = 128; + props->pkey_tbl_len = 1; + props->qkey_viol_cntr = 0; + props->active_width = 1; + props->active_speed = 1; + + return 0; } static int c2_modify_port(struct ib_device *ibdev, u8 port, int port_modify_mask, struct ib_port_modify *props) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return ENOSYS; } static int c2_query_pkey(struct ib_device *ibdev, - u8 port, u16 index, u16 *pkey) + u8 port, u16 index, u16 *pkey) { - return ENOSYS; + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + *pkey = 0; + return 0; } static int c2_query_gid(struct ib_device *ibdev, u8 port, int index, union ib_gid *gid) { - return ENOSYS; + struct c2_dev* c2dev = to_dev(ibdev); + + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + memcpy(&(gid->raw[2]),c2dev->netdev->dev_addr, 6); + gid->raw[0] = 0; + gid->raw[1] = 0; + + return 0; } +/* Allocate the user context data structure. This keeps track + * of all objects associated with a particular user-mode client. + */ static struct ib_ucontext *c2_alloc_ucontext(struct ib_device *ibdev, struct ib_udata *udata) { - return 0; + struct c2_alloc_ucontext_resp uresp; + struct c2_ucontext *context; + + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + memset(&uresp, 0, sizeof uresp); + + uresp.qp_tab_size = to_dev(ibdev)->max_qp; + + context = kmalloc(sizeof *context, GFP_KERNEL); + if (!context) + return ERR_PTR(-ENOMEM); + + /* The OpenIB user context is logically similar to the RNIC + * Instance of our existing driver + */ + /* context->rnic_p = rnic_open */ + + if (ib_copy_to_udata(udata, &uresp, sizeof uresp)) { + kfree(context); + return ERR_PTR(-EFAULT); + } + + return &context->ibucontext; } static int c2_dealloc_ucontext(struct ib_ucontext *context) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return 0; } static int c2_mmap_uar(struct ib_ucontext *context, struct vm_area_struct *vma) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return 0; } @@ -109,22 +188,26 @@ struct ib_ucontext *context, struct ib_udata *udata) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return 0; } static int c2_dealloc_pd(struct ib_pd *pd) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return 0; } static struct ib_ah *c2_ah_create(struct ib_pd *pd, struct ib_ah_attr *ah_attr) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return 0; } static int c2_ah_destroy(struct ib_ah *ah) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return 0; } @@ -132,11 +215,13 @@ struct ib_qp_init_attr *init_attr, struct ib_udata *udata) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return 0; } static int c2_destroy_qp(struct ib_qp *qp) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return 0; } @@ -144,16 +229,19 @@ struct ib_ucontext *context, struct ib_udata *udata) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return ERR_PTR(0); } static int c2_destroy_cq(struct ib_cq *cq) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return 0; } static struct ib_mr *c2_get_dma_mr(struct ib_pd *pd, int acc) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return 0; } @@ -163,40 +251,49 @@ int acc, u64 *iova_start) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return 0; } static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, int acc, struct ib_udata *udata) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return ERR_PTR(0); } static int c2_dereg_mr(struct ib_mr *mr) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return 0; } static ssize_t show_rev(struct class_device *cdev, char *buf) { struct c2_dev *dev = container_of(cdev, struct c2_dev, ibdev.class_dev); - return sprintf(buf, "%x\n", 1); + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return sprintf(buf, "%x\n", dev->hw_rev); } static ssize_t show_fw_ver(struct class_device *cdev, char *buf) { struct c2_dev *dev = container_of(cdev, struct c2_dev, ibdev.class_dev); - return sprintf(buf, "%x.%x.%x\n", 1,2,3); + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return sprintf(buf, "%x.%x.%x\n", + (int)(dev->fw_ver >> 32), + (int)(dev->fw_ver >> 16) & 0xffff, + (int)(dev->fw_ver & 0xffff)); } static ssize_t show_hca(struct class_device *cdev, char *buf) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return sprintf(buf, "AMSO1100\n"); } static ssize_t show_board(struct class_device *cdev, char *buf) { - struct c2_dev *dev = container_of(cdev, struct c2_dev, ibdev.class_dev); + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return sprintf(buf, "%.*s\n", 32, "AMSO1100 Board ID"); } @@ -214,21 +311,25 @@ int c2_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return ENOSYS; } int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return 0; } int c2_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return ENOSYS; } int c2_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return ENOSYS; } @@ -240,36 +341,85 @@ struct ib_mad *in_mad, struct ib_mad *out_mad) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return ENOSYS; } int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return 0; } int c2_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, struct ib_send_wr **bad_wr) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return 0; } int c2_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, struct ib_recv_wr **bad_wr) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return 0; } +int c2_connect_qp(struct ib_qp *qp, + struct ib_connect_attr *conn_attr) +{ + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return ENOSYS; +} + +int c2_accept_cr(int cr_id, + struct ib_qp *qp, + u8 *pdata, + int pdata_len) +{ + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return ENOSYS; +} + +int c2_reject_cr(int cr_id, + struct ib_qp *qp, + u8 *pdata, + int pdata_len) +{ + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return ENOSYS; +} + +int c2_query_cr(int cr_id, + struct ib_connect_attr *conn_attr) +{ + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return ENOSYS; +} + +struct ib_listen_ep * c2_create_listen_ep(struct ib_listen_ep_attr *la) +{ + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return 0; +} + +int c2_destroy_listen_ep(struct ib_listen_ep *ep) +{ + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + return ENOSYS; +} + int c2_register_device(struct c2_dev *dev) { int ret; int i; + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); strlcpy(dev->ibdev.name, "amso%d", IB_DEVICE_NAME_MAX); dev->ibdev.owner = THIS_MODULE; - dev->ibdev.node_type = IB_NODE_CA; + dev->ibdev.node_type = IB_NODE_IWARP; dev->ibdev.phys_port_cnt = 1; dev->ibdev.dma_device = &dev->pcidev->dev; dev->ibdev.class_dev.dev = &dev->pcidev->dev; @@ -305,10 +455,17 @@ dev->ibdev.detach_mcast = c2_multicast_detach; dev->ibdev.process_mad = c2_process_mad; - dev->ibdev.req_notify_cq = c2_arm_cq; - dev->ibdev.post_send = c2_post_send; - dev->ibdev.post_recv = c2_post_receive; + dev->ibdev.req_notify_cq = c2_arm_cq; + dev->ibdev.post_send = c2_post_send; + dev->ibdev.post_recv = c2_post_receive; + dev->ibdev.connect_qp = c2_connect_qp; + dev->ibdev.accept_cr = c2_accept_cr; + dev->ibdev.reject_cr = c2_reject_cr; + dev->ibdev.query_cr = c2_query_cr; + dev->ibdev.create_listen_ep = c2_create_listen_ep; + dev->ibdev.destroy_listen_ep = c2_destroy_listen_ep; + ret = ib_register_device(&dev->ibdev); if (ret) return ret; @@ -321,10 +478,12 @@ return ret; } } + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); return 0; } void c2_unregister_device(struct c2_dev *dev) { + printk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); ib_unregister_device(&dev->ibdev); } Index: hw/amso1100/c2_provider.h =================================================================== --- hw/amso1100/c2_provider.h (revision 3120) +++ hw/amso1100/c2_provider.h (working copy) @@ -48,17 +48,27 @@ DECLARE_PCI_UNMAP_ADDR(mapping) }; -struct c2_uar { - unsigned long pfn; - int index; -}; -struct c2_user_db_table; +/* The user context keeps track of objects allocated for a + * particular user-mode client. */ +struct c2_ucontext { + struct ib_ucontext ibucontext; -struct c2_ucontext { - struct ib_ucontext ibucontext; - struct c2_uar uar; - struct c2_user_db_table *db_tab; + int index; /* rnic index (minor) */ + int port; /* Which GigE port */ + + /* + * Shared HT pages for user-accessible MQs. + */ + int hthead; /* index of first free entry */ + void* htpages; /* kernel vaddr */ + int htlen; /* length of htpages memory */ + void* htuva; /* user mapped vaddr */ + cc_lock_t htlock; /* serialize allocation */ + u64 adapter_hint_uva; /* Activity FIFO */ + + cc_rnic_query_attrs_t rnic_attrs; /* cache of rnic attrs */ + }; struct c2_mtt; @@ -67,8 +77,12 @@ struct ib_mr ibmr; }; +/* All objects associated with a PD are kept in the + * associated user context if present. + */ struct c2_pd { - struct ib_pd ibpd; + struct ib_pd ibpd; + u32 pd_id; }; struct c2_av; Index: hw/amso1100/TODO =================================================================== --- hw/amso1100/TODO (revision 3120) +++ hw/amso1100/TODO (working copy) @@ -82,11 +82,13 @@ you need. Custom ASSERT() can probably just be replaced with standard BUG_ON(). -[-] Split out ccilnet +[X] Split out ccilnet -[-] Remove kDAT entry points +[X] Remove kDAT entry points [-] Remove superflouos common files/code [-] Boot firmware from flash instead of loading over PCI +[-] Remove MQ memory duplication in the host + From rolandd at cisco.com Wed Aug 24 16:29:06 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 24 Aug 2005 16:29:06 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F714202@taurus.voltaire.com> (Yaron Haviv's message of "Thu, 25 Aug 2005 01:56:41 +0300") References: <35EA21F54A45CB47B879F21A91F4862F714202@taurus.voltaire.com> Message-ID: <52ll2qvrr1.fsf@cisco.com> Yaron> The current implementation may not use the private data Yaron> field (since its not critical/mandatory) but the intention Yaron> is to add it to address multi homed hosts, we would like to Yaron> push such a definition into IBTA so every IP oriented ULP Yaron> can use it, several people expressed interest in such a Yaron> definition, this can also support NFS/RDMA or any other IP Yaron> based ULP. Strange as it may seem, I agree completely with Yaron ;) It would make perfect sense to take a couple of the reserved bits in the CM REQ format and turn them into an "IP address present" field (a couple of bits so we can distinguish between v4 and v6). When this field is set, then the first (or last, or whatever) 32 bytes of the private data would hold the source and destination IP address. Having this standardized also gives us the ability to deal with the concerns around connections initiated in userspace. The kernel proxy for the user CM can make sure that any REQs sent with the "IP address present" field set actually has an IP assigned to the local system. Remote systems would still need to treat CM messages from QPs other than QP 1 as untrusted. Of course for real security some stronger authentication is needed in any case (even in the iWARP case the source IP can't be trusted; an attacker could DOS the real owner of the IP, flood the switches MAC tables so it becomes a hub, and then take over any IP it wants). The only unfortunate thing about all this is that the SDP Hello message format is already frozen, and it seems a little too specialized for generic use (eg we don't want a "Max Zcopy Advertisements" field). Yaron, has anyone raised all this in the IBTA WG? - R. From sean.hefty at intel.com Wed Aug 24 16:36:54 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 24 Aug 2005 16:36:54 -0700 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs and queryprovider methods In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3B4B@mail2.ammasso.com> Message-ID: >@@ -59,7 +60,8 @@ > enum ib_node_type { > IB_NODE_CA = 1, > IB_NODE_SWITCH, >- IB_NODE_ROUTER >+ IB_NODE_ROUTER, >+ IB_NODE_IWARP > }; I guess I'm not sure what an iWarp node is or how it would be used. >+/* Connection events. */ >+enum ib_xcm_event_type { >+ IB_EVENT_ACTIVE_CONNECT_RESULTS, >+ IB_EVENT_CONNECT_REQUEST >+}; >+ >+/* iwarp connection attributes. */ >+struct ib_connect_attr { >+ struct in_addr local_addr; >+ struct in_addr remote_addr; >+ u16 local_port; >+ u16 remote_port; >+}; >+struct ib_conn_results { >+ int errno; >+ struct ib_connect_attr conn_attr; >+ int priv_len; >+ u8 private_data[0]; >+}; >+ >+struct ib_conn_request { >+ int cr_id; >+ struct ib_connect_attr conn_attr; >+ int priv_len; >+ u8 private_data[0]; >+}; >+ >+struct ib_xcm_event { >+ struct ib_device *device; >+ union { >+ struct ib_conn_results connect_qp_results; >+ struct ib_conn_request connect_request; >+ } element; >+ enum ib_xcm_event_type event; >+}; >+ Why include the connection protocol as part of the verbs layer? Granted I haven't looked at the iWarp specs in a long time, but I don't remember connection establishment being part of the verbs. - Sean From tom at ammasso.com Wed Aug 24 16:42:14 2005 From: tom at ammasso.com (Tom Tucker) Date: Wed, 24 Aug 2005 19:42:14 -0400 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs and queryprovider methods Message-ID: <8E9D028761D8264D910612167E8457E8FA3B4C@mail2.ammasso.com> Good question. It was placed in here to allow ULPs to filter the device notifications by the transport type. The IPoIB ULP, for example, doesn't care about iWARP devices. This method is butt simple, however, doesn't scale very well if you add additional transports. The alternative is to add a 'filter' mask that allows ULPs to register for device type events it is interested in. I am happy to do it either way. What do you think? > -----Original Message----- > From: Sean Hefty [mailto:sean.hefty at intel.com] > Sent: Wednesday, August 24, 2005 6:37 PM > To: Tom Tucker; openib-general at openib.org > Subject: RE: [openib-general] [PATCH][iWARP] Added provider > CM verbs and queryprovider methods > > >@@ -59,7 +60,8 @@ > > enum ib_node_type { > > IB_NODE_CA = 1, > > IB_NODE_SWITCH, > >- IB_NODE_ROUTER > >+ IB_NODE_ROUTER, > >+ IB_NODE_IWARP > > }; > > I guess I'm not sure what an iWarp node is or how it would be used. > > >+/* Connection events. */ > >+enum ib_xcm_event_type { > >+ IB_EVENT_ACTIVE_CONNECT_RESULTS, > >+ IB_EVENT_CONNECT_REQUEST > >+}; > >+ > >+/* iwarp connection attributes. */ > >+struct ib_connect_attr { > >+ struct in_addr local_addr; > >+ struct in_addr remote_addr; > >+ u16 local_port; > >+ u16 remote_port; > >+}; > >+struct ib_conn_results { > >+ int errno; > >+ struct ib_connect_attr conn_attr; > >+ int priv_len; > >+ u8 private_data[0]; > >+}; > >+ > >+struct ib_conn_request { > >+ int cr_id; > >+ struct ib_connect_attr conn_attr; > >+ int priv_len; > >+ u8 private_data[0]; > >+}; > >+ > >+struct ib_xcm_event { > >+ struct ib_device *device; > >+ union { > >+ struct ib_conn_results connect_qp_results; > >+ struct ib_conn_request connect_request; > >+ } element; > >+ enum ib_xcm_event_type event; > >+}; > >+ > > Why include the connection protocol as part of the verbs > layer? Granted I > haven't looked at the iWarp specs in a long time, but I don't remember > connection establishment being part of the verbs. > > - Sean > > From tom at ammasso.com Wed Aug 24 16:44:51 2005 From: tom at ammasso.com (Tom Tucker) Date: Wed, 24 Aug 2005 19:44:51 -0400 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs and queryprovider methods Message-ID: <8E9D028761D8264D910612167E8457E8FA3B4D@mail2.ammasso.com> Sorry, missed the second question... > >+ struct ib_conn_results connect_qp_results; > >+ struct ib_conn_request connect_request; > >+ } element; > >+ enum ib_xcm_event_type event; > >+}; > >+ > > Why include the connection protocol as part of the verbs > layer? Granted I > haven't looked at the iWarp specs in a long time, but I don't remember > connection establishment being part of the verbs. > > - Sean > Connection management is not part of the RDMAC verbs, however, we need some way for transports to "hook in" to the CM. The other approach is to have a separate registration mechanism for connection management verbs, but this seemed a little bizarre, so we just extended the provider verbs. Ideas? > From caitlinb at broadcom.com Wed Aug 24 16:47:14 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 24 Aug 2005 16:47:14 -0700 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs and queryprovider methods Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F510@NT-SJCA-0751.brcm.ad.broadcom.com> Sean Hefty asked: > Why include the connection protocol as part of the verbs layer? > Granted I haven't looked at the iWarp specs in a long time, but > I don't remember connection establishment being part of the verbs. The verbs assume that the kernel Access Layer (Privileged Resource Manager/whatever) can pass a socket handle into a "modify qp to RTS" call and extract the entire TCP connection state from the host stack in a single simple atomic action. If we could quickly agree that the Linux stack should support such an operation then there would be no need to define CM methods from each device module. And agreeing on such new stack functionality should be relatively easy, compared to unifying Unix or retiring the national debt. Creating device dependent methods that are broader than the RDMAC verbs is a method of allowing each device to solve the complete connection setup problem without requiring integration with the host stack. The latter is still the recommended solution, but obviously not one that should be done quickly without careful consideration to ensure that any additional functionality in the core TCP/IP stack has been carefully examined and reviewed. From ftillier at silverstorm.com Wed Aug 24 17:12:01 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Wed, 24 Aug 2005 17:12:01 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: Message-ID: <000701c5a909$a355b4b0$6312000a@infiniconsys.com> > From: James Lentini [mailto:jlentini at netapp.com] > Sent: Wednesday, August 24, 2005 1:58 PM > > On Wed, 24 Aug 2005, Fab Tillier wrote: > > > > From: Roland Dreier [mailto:rolandd at cisco.com] > > > Sent: Wednesday, August 24, 2005 11:03 AM > > > > > > Fab> Why can't the IPV field be ignored? If a listen wants only > > > Fab> IPV4 addresses, it would specify a 16-byte compare buffer > > > Fab> with the first 12 bytes zero, the next 4 filled with the IPV4 > > > Fab> address, and would set the offset to that of the hello > > > Fab> message's destination address (32). > > > > > > Yes, you're right for SDP. I guess if we're comfortable mandating > > > that all protocols put their source and destination IPs in the private > > > data for the IB case, then this works. Of course it's somewhat > > > awkward to pass this information into the transport-neutral CM API but > > > I think this can be worked around. > > > > I don't know if we need to mandate IP usage - it's up to the > > application. Any application that wants to have similar semantics > > to the way socket listens work (especially when bound to one of > > multiple IP addresses on a port) the application would have to > > define its private data to accommodate this. > > > > At the IB level, the contents of the private data are still opaque, > > even to the CM. The CM would only expose the ability to have it > > perform an initial triage of requests by doing binary comparisons > > over regions of private data. It doesn't know (or need to know) > > what the data represents - it only cares about finding a match (or > > not). The CM doesn't define any sort of policy here, and I don't > > think it should. It's just bytes to the CM, and it's doing a blind > > comparison without interpreting the contents. > > You need to consider what makes sense for *both* ib and iwarp. Keep in > mind that the correct API will allow a consumer to use ib and iwarp > devices transparently. In other words their will be one code path that > support both. I believe using the private data makes the most sense from the IB perspective. One could even argue that it is the only way to provide positive "getpeername" functionality. Use of the IB private data does not require identical use of private data in other technologies. > If we were to adopt your proposal, the consumer would need to perform > unnecessary operations on iWARP. It doesn't have to impact the client if there's some intermediate abstraction to isolate the client from the IB CM details (including private data use). > A transport neutral client would be forced to put IP information into > its CM private data on iWARP. > > Likewise, a transport neutral server would be forced to pass an > private data offset and binary blob to the listen API call on iWARP. > > Neither of these make sense. A higher-level CM abstraction could implement the policy of private data use when running on IB without the client's involvement. The end result still is that you end up with a wire protocol that needs to be documented so that someone without that exact CM abstraction knows where and how to format the private data as well as how to interpret it. If the IBTA defines something like this, all these issues go away. I don't know if the IBTA can define this without affecting existing protocols like SDP and iSER that already define how to encapsulate the source and destination information in the private data. Using the private data, either by the client or some IB-specific CM abstraction, will remove the need for any reverse lookups. A forward lookup to validate the incoming source GID to the source IP in the private data can validate the IP address. Performing a forward lookup via ARP is going to be a lot faster than ATS if the ARP entry already exists. On large fabrics, ARP is also going to scale better since there's not one single entity responsible for responding to every node's requests. > These API problems are secondary to the burden you would be placing on > the protocols. As has been mentioned in a previous email, extending > the current protocols to use this convention will require further > standardization and in some cases may not be compatible with their > current architecture. I think biting the bullet now on establishing these standards for applications using IP addressing over IB, whether in the IBTA or in each application, is going to give us the best long term result. - Fab From yaronh at voltaire.com Wed Aug 24 18:00:56 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Thu, 25 Aug 2005 04:00:56 +0300 Subject: [openib-general] RDMA connection and address translation API Message-ID: <35EA21F54A45CB47B879F21A91F4862F714207@taurus.voltaire.com> > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Wednesday, August 24, 2005 7:29 PM > To: Yaron Haviv > Cc: James Lentini; Roland Dreier; openib-general at openib.org > Subject: Re: [openib-general] RDMA connection and address translation API > > > Yaron, has anyone raised all this in the IBTA WG? > I raised it about a year ago, but didn't really followed up on it At the time IBTA was also busy with other more urgent stuff (verb ext..) We work with few key IBTA members to re-surface it with the need for an abstract CM See the following text that was proposed (a Year ago as is) It is slightly different than your proposal but can be altered if needed It basically uses SDP header and marks one of the fields with 01 (FlowC) to indicate it's not SDP, this way even SDP can use it Also it covers some nice idea raised by MS & SUN to extend SDP to accept PUT & GET operations for RDMA, so you can get a BSD like API with few additional APIs rather than have a totally new API like DAPL Establishing a TCP/iWarp like connections over InfiniBand ========================================================= In order to emulate an iWarp connection, it is required to open an InfiniBand RC connection, associate it with IP addresses and TCP ports In addition protocols may transfer control/login packets before the migration to the RDMA mode; this requires exchanging receiver buffer size and depth for initial usage (the ULP's will manage the flow control for the duration of the connection). The mapping uses the same data structures already defined for connection establishment in SDP (IBTA Socket Direct Protocol) which accomplish the same goal of mapping TCP Sockets addressing to InfiniBand, the non relevant SDP fields were Reserved. iWarp emulation CM Request (Hello) Private Data header 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 04 | MID | Rsvd | bufs | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 08 | len | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 12 | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 16 | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 20 | MajVer| MinVer| IPVer | FlowC | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 24 | DesRemRcvSz | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 28 | LocalRcvSz | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 32 | Local Port | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 36 | Src IP (127-96) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 40 | Src IP ( 95-64) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 44 | Src IP ( 63-32) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 48 | Src IP ( 31-00) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 52 | Dst IP (127-96) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 56 | Dst IP ( 95-64) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 60 | Dst IP ( 63-32) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 64 | Dst IP ( 31-00) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 1 CM Hello private data structure iWarp emulation CM Response (HelloReply) Private Data header 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 04 | MID | Rsvd | bufs | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 08 | len | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 12 | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 16 | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 20 | MajVer| MinVer| Reserved | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 24 | ActRcvSz | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 1 CM Reply private data structure All the fields mentioned have the same functionality as defined in SDP The MID field can accept only the values 0 and 1, a new field called FlowC is added that indicate that the ULP is responsible for the Flow Control and the Verb API usage (unlike SDP). MID - Message Identifier: 8 bits 0 - Hello message 1 - Reply message Other values are reserved FlowC - Flow Control and verb owner: 4 bits 0 - Transport owns flow control and may embed a 16 byte headers in RC Send (just like SDP does, and uses zero in that field) 1 - ULP owns flow control and control the entire Send buffer (e.g. iSER, NFS use their own headers in Send operations) Other values are reserved From sean.hefty at intel.com Wed Aug 24 18:28:44 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 24 Aug 2005 18:28:44 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <52y86r2a9w.fsf@cisco.com> Message-ID: >With this in mind, I believe that the connection API needs to be >something more like the following: > > rdma_resolve_address(): > inputs: dest IP address, qos, npaths, > done callback, opaque context > done callback params: status, local RDMA device, > RDMA transport address, context ... > rdma_connect(): > inputs: local QP, RDMA transport address, destination service, > private data, timeout, event callback, opaque context Have we agreed that this is the functionality that we should be aiming towards? > rdma_resolve_address(...); > /* wait for resolution */ > ib_create_qp(...) /* use device pointer we got from rdma_resolve_address() >*/ We need to insert in here: ib_modify_qp(...); /* somehow uses address resolution... */ ib_post_recvs(...); > rdma_connect(...); /* pass transport address we got from >rdma_resolve_address() */ > /* wait for connection to finish... */ Another possibility could be to add a list of receives to rdma_connect(). - Sean From iod00d at hp.com Wed Aug 24 20:31:49 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 24 Aug 2005 20:31:49 -0700 Subject: [openib-general] ISER cleanup In-Reply-To: <430C9FFA.30007@dbresearch.net> References: <430C9FFA.30007@dbresearch.net> Message-ID: <20050825033149.GF4793@esmail.cup.hp.com> On Wed, Aug 24, 2005 at 12:27:38PM -0400, Sean Hubbell wrote: > Just a thought, but you can use the gnu indent application to do this > very easily (not sure if you did this, I just thought it might help if > you have not). Here is a sample command: > > indent -kr --use-tabs -i2 -l80 -nhnl sourceFilename Please use what is reccomended in Documentation/Codingstyle: Now, again, GNU indent has the same brain-dead settings that GNU emacs has, which is why you need to give it a few command line options. However, that's not too bad, because even the makers of GNU indent recognize the authority of K&R (the GNU people aren't evil, they are just severely misguided in this matter), so you just give indent the options "-kr -i8" (stands for "K&R, 8 character indents"), or use "scripts/Lindent", which indents in the latest style. grant From halr at voltaire.com Wed Aug 24 21:01:57 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 25 Aug 2005 07:01:57 +0300 Subject: [openib-general] Question on the best approach to debug aninfiniband connection problem Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175C16@taurus.voltaire.com> Hi Sean, Sorry for the slow response. I was in transit today until now. > I was wondering if there is a "best practices" method to debug a > possible infiniband connection. There are FAQs on the OpenIB wiki https://openib.org/tiki/tiki-index.php There is one for IPoIB http://www.openib.org/docs/ipoib_faq.txt > I am currently trying to send a message > over infiniband ib0 interface and I continue to get transmit errors. > Minus going through and seeing if the port state is active, Is the port state active ? What are you running ? Is this OpenSM and IPoIB off the trunk or something else ? > I am at a loss to find out what the problem is. I did notice a lot of errors in > the /var/log/osm.log which I have listed below for today: Aug 24 08:19:10 [42FFF960] -> osm_report_notice: Reporting Generic Notice type:3 num:67 from LID:0x0001 GID:0xfe80000000000000,0x0005ad000003d269 Aug 24 08:19:10 [42FFF960] -> osm_vendor_send: RMPP 0 length 112 Aug 24 08:19:10 [42FFF960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet,scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7. It appears that a join is failing for some reason. It doesn't say which group (MGID) this is. (I will add that into the log). The SM is receiving a join rather than a create request for a new multicast group. That might be OK depending on which group it is. Aug 24 08:19:10 [42FFF960] -> osm_vendor_send: RMPP 0 length 256 Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112 Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112 Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112 Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112 Aug 24 08:19:14 [42FFF960] -> osm_report_notice: Reporting Generic Notice type:3 num:67 from LID:0x0001 GID:0xfe80000000000000,0x0005ad000003d269 Aug 24 08:19:14 [42FFF960] -> osm_report_notice: Reporting Generic Notice type:3 num:67 from LID:0x0001 GID:0xfe80000000000000,0x0005ad000003d269 Aug 24 08:19:16 [447FF960] -> umad_receiver: recv error Interrupted system call Aug 24 08:22:05 [AB441140] -> OpenSM Rev:openib-1.0.0 Aug 24 08:22:05 [AB441140] -> osm_opensm_init: Forcing single threaded dispatcher. It looks like OpenSM restarted here. If OpenSM is restarted currently, the IPoIB interface needs to be downed and then upped as client reregistration is not currently supported. Aug 24 08:22:05 [AB441140] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 Aug 24 08:22:05 [AB441140] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 Aug 24 08:22:05 [AB441140] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x5ad000003d269) as the default port. Aug 24 08:22:05 [AB441140] -> osm_vendor_bind: Binding to port 0x5ad000003d269. Aug 24 08:22:05 [AB441140] -> osm_vendor_bind: Binding to port 0x5ad000003d269. Aug 24 08:22:05 [42FFF960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0002 TID:0x0000000000000000 Aug 24 08:22:05 [42FFF960] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0002 GID:0xfe80000000000000,0x0002c9010bec5320 Aug 24 08:22:06 [42FFF960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0001 TID:0x0000000000000000 Aug 24 08:22:06 [42FFF960] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0001 GID:0xfe80000000000000,0x0005ad000003d269 Aug 24 08:22:12 [42FFF960] -> osm_vendor_send: RMPP 0 length 112 Aug 24 08:22:12 [42FFF960] -> osm_vendor_send: RMPP 0 length 112 Aug 24 08:22:12 [42FFF960] -> osm_vendor_send: RMPP 0 length 112 -- Hal From sean.hefty at intel.com Wed Aug 24 23:33:48 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 24 Aug 2005 23:33:48 -0700 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs and queryprovider methods In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3B4C@mail2.ammasso.com> Message-ID: >It was placed in here to allow ULPs to filter >the device notifications by the transport type. The IPoIB >ULP, for example, doesn't care about iWARP devices. I was confusing myself. >This method is butt simple, however, doesn't scale very well >if you add additional transports. The alternative is to >add a 'filter' mask that allows ULPs to register for >device type events it is interested in. I am happy to do it >either way. I would keep your change as it is. ULPs that care about specific devices can filter devices in a more flexible ways than anything that we could provide in an API. - Sean From sean.hefty at intel.com Wed Aug 24 23:50:07 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 24 Aug 2005 23:50:07 -0700 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs and queryprovider methods In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3B4D@mail2.ammasso.com> Message-ID: >> Why include the connection protocol as part of the verbs >> layer? Granted I >> haven't looked at the iWarp specs in a long time, but I don't remember >> connection establishment being part of the verbs. > >Connection management is not part of the RDMAC verbs, however, we need >some >way for transports to "hook in" to the CM. The other approach is to have >a separate registration mechanism for connection management verbs, but >this seemed a little bizarre, so we just extended the provider verbs. > >Ideas? I need to reacquaint myself with iWarp more, but I don't like the idea of adding CM calls as part of the verbs API, and in particular as part of a generic RDMA device structure. Does this suggest that each iWarp device driver will need to implement a connection establishment protocol? Isn't there a way to generalize that into a single iWarp CM module that can sit above multiple devices? How will connections between different devices be supported? I'm assuming that the Linux kernel will never permit an established connection to be offloaded onto a NIC. However it seems possible that a new iWarp connection could be done in a common way, with the result passed into the device through the modify QP call as the LLP stream. - Sean From iod00d at hp.com Wed Aug 24 23:57:51 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 24 Aug 2005 23:57:51 -0700 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator In-Reply-To: <1124873898.3933.23.camel@r2d2> References: <1124787534.27216.44.camel@r2d2> <20050823185309.GE1218@esmail.cup.hp.com> <1124873898.3933.23.camel@r2d2> Message-ID: <20050825065751.GN4793@esmail.cup.hp.com> On Wed, Aug 24, 2005 at 11:58:18AM +0300, Guy German wrote: > What about the SRP implementation, which polls the cq in the cq's upcall > context (interrupt in mthca). I guess that it improves performance, and > if it is good enough for the SRP (and linux) I think it will be good for > the iSER *initiator* as well. yes - that's probably a fair assumption for now. > The thing is - even if I find it reasonable to do it this way (after > calculating the estimated amount of time spent in the ISR), I am > not sure it is a good scalable solution, because there can be several > initiators and hcas per machine. Yup - but you don't need to estimate. It can be precisely measured. See "get_cycles()" and maybe the patch that Christoph Lameter posted 5-6 monthes ago: http://www.gelato.unsw.edu.au/archives/linux-ia64/0504/13877.html (My advice is to read the whole thread to be aware of the issues with this patch...but it should be good enough for the above measurements.) FWIW, netperf TCP_STREAM over SDP was not close to saturating a single CPU (1.5Ghz IA64). Two HCAs would probably saturate a single CPU. While I would hope it never happens that two HCAs both have their MSI-X interrupts pointed at the same CPU, I'm certain it could happen. One could argue the sys admin will need to rebalance the interrupt load manually by reassigning CPUs via /proc/irq/*/smp_affinity. > The fact that the primitiveness of linux is not a major issue, makes it > harder to decide on the right way to go here... preemtive is a major issue for some uses. But I'm skeptical it is for the initial clusters I expect people will use RDMA for. So I'm not going to worry about it for now. There are more important issues. hth, grant From mst at mellanox.co.il Thu Aug 25 01:48:09 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 25 Aug 2005 11:48:09 +0300 Subject: [openib-general] Re: RDMA connection and address translation API In-Reply-To: <52ll2qvrr1.fsf@cisco.com> References: <35EA21F54A45CB47B879F21A91F4862F714202@taurus.voltaire.com> <52ll2qvrr1.fsf@cisco.com> Message-ID: <20050825084809.GD22342@mellanox.co.il> Quoting r. Roland Dreier : > It would make perfect sense to take a couple of the reserved bits in > the CM REQ format and turn them into an "IP address present" field (a > couple of bits so we can distinguish between v4 and v6). When this > field is set, then the first (or last, or whatever) 32 bytes of the > private data would hold the source and destination IP address. Wouldnt it be better to use some bits in the service ID field for this? > Having this standardized also gives us the ability to deal with the > concerns around connections initiated in userspace. The kernel proxy > for the user CM can make sure that any REQs sent with the "IP address > present" field set actually has an IP assigned to the local system. > Remote systems would still need to treat CM messages from QPs other > than QP 1 as untrusted. Actually, it might already make sense to implement something like this for ucm: anything with service ID 0x0000 0000 0001 XXXX is SDP and should be kernel only. Does this make sense? > Of course for real security some stronger authentication is needed in > any case (even in the iWARP case the source IP can't be trusted; an > attacker could DOS the real owner of the IP, flood the switches MAC > tables so it becomes a hub, and then take over any IP it wants). > > The only unfortunate thing about all this is that the SDP Hello > message format is already frozen, and it seems a little too > specialized for generic use (eg we don't want a "Max Zcopy > Advertisements" field). It's somewhat ugly, but still possible to leave the IP address where it is in the SDP Hello message, in the middle of the private data field. Alternatively, special-casing SDP for the sake of backward compatibility would not be too bad. -- MST From mst at mellanox.co.il Thu Aug 25 01:51:36 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 25 Aug 2005 11:51:36 +0300 Subject: [openib-general] Re: RDMA connection and address translation API In-Reply-To: <52ll2qvrr1.fsf@cisco.com> References: <35EA21F54A45CB47B879F21A91F4862F714202@taurus.voltaire.com> <52ll2qvrr1.fsf@cisco.com> Message-ID: <20050825085136.GE22342@mellanox.co.il> Quoting r. Roland Dreier : > Of course for real security some stronger authentication is needed in > any case (even in the iWARP case the source IP can't be trusted; an > attacker could DOS the real owner of the IP, flood the switches MAC > tables so it becomes a hub, and then take over any IP it wants). I think you could get basically to the same level with IB by performing an additional ARP lookup on the IP address, and comparing the port GIDs. -- MST From hch at lst.de Thu Aug 25 01:49:26 2005 From: hch at lst.de (Christoph Hellwig) Date: Thu, 25 Aug 2005 10:49:26 +0200 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <52ek8jvxya.fsf@cisco.com> References: <8E9D028761D8264D910612167E8457E8FA3B0A@mail2.ammasso.com> <524q9f1et9.fsf@cisco.com> <20050824205746.GA24447@lst.de> <52ek8jvxya.fsf@cisco.com> Message-ID: <20050825084926.GA2621@lst.de> On Wed, Aug 24, 2005 at 02:15:09PM -0700, Roland Dreier wrote: > Roland> Well, that's not what I would expect. Suppose I have a > Roland> device configured with local addresses 192.168.11.12 and > Roland> 192.168.98.99 and I > > Christoph> You never configure a device with local addresses. IP > Christoph> addresses are always a per-host attribute in Linux. > > I don't think this is really true. In some ways Linux behaves as if > IP addresses are per-host (eg ARP responses can go out any interface) > but really IP addresses are attached to an interface. Every struct > net_device has a struct in_device, and every struct in_device has a > list of struct in_ifaddrs for the device's IP addresses. This is correct, but the user-visible effect is what I said above. When you do an ARP query for any of the IP addresses of a linux box you'll get a responce even if that interface isn't on the network. Even if you don't think that's enough you can assign any number of IP and other networking addresses to a given device even formally, rendering the notation of an IP address <-> network device relation rather mood. From hch at lst.de Thu Aug 25 01:52:00 2005 From: hch at lst.de (Christoph Hellwig) Date: Thu, 25 Aug 2005 10:52:00 +0200 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1F509@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1F509@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <20050825085200.GB2621@lst.de> On Wed, Aug 24, 2005 at 02:22:31PM -0700, Caitlin Bestler wrote: > Not if the host connects two disjoint networks and does not route > between them. Such a host should/may be configured to reject any > packet that arrives with a destination address that does not match > the expected destination address for the port it arrives upon. While you can configure a Linux system to reject such request through a bunch of crude hacks, the default and fully RFC compliant behaviour is to always reply to ARP requests for any IP address assigned to the system. RDMA CM implementations must work the same. From eitan at mtl001.openib.org Thu Aug 25 01:55:37 2005 From: eitan at mtl001.openib.org (Yael Kalka) Date: 25 Aug 2005 11:55:37 +0300 Subject: [openib-general] [PATCH] osm: osm_sa_path_record bug Message-ID: <5zoe7m2y5y.fsf@mtl066.yok.mtl.com> Hi Hal __osm_pr_rcv_check_mcast_dest called from osm_sa_path_record to check if the path record request in a request for multicast path record has a bug in it. In the destination lid check, the check compared host-order lid with network order. The following simple patch fixes this bug. Thanks Yael I tested the patch on : 2.6.11-1.1369_FC4smp Fedora Core release 4 (Stentz) Signed-off-by: Yael Kalka Index: opensm/osm_sa_path_record.c =================================================================== --- opensm/osm_sa_path_record.c (revision 3159) +++ opensm/osm_sa_path_record.c (working copy) @@ -1423,8 +1423,8 @@ __osm_pr_rcv_check_mcast_dest( if( comp_mask & IB_PR_COMPMASK_DLID ) { - if( cl_ntoh16( p_pr->dlid ) >= IB_LID_MCAST_START && - cl_ntoh16( p_pr->dlid ) <= IB_LID_MCAST_END ) + if( cl_ntoh16( p_pr->dlid ) >= IB_LID_MCAST_START_HO && + cl_ntoh16( p_pr->dlid ) <= IB_LID_MCAST_END_HO ) is_multicast = TRUE; else if( is_multicast ) osm_log( p_rcv->p_log, OSM_LOG_ERROR, From christian.guggenberger at rzg.mpg.de Thu Aug 25 02:01:51 2005 From: christian.guggenberger at rzg.mpg.de (Christian Guggenberger) Date: Thu, 25 Aug 2005 11:01:51 +0200 Subject: [openib-general] kernel crashes while using mvapich-gen2 over ib_uverbs Message-ID: <1124960511.12751.16.camel@bonnie.rzg.mpg.de> Hi, On a small, 2 node setup, I'd like to try some simple MPI programs with help of mvapich-gen2 (1.0). Both nodes are Dual-Opteron based, with a 23108 tavor each, directly connected. (no switch). Opensm is running on one node. Things like IPOIB seem to work reliable. Using 2.6.12.5 (and svn co of Aug, 24th), all I get after starting a simple 2 CPU mpi programm is a hard crash of that node. (no logs, no oops, node not pingable, nothing at the console, no SYSRQ available). I tried to go ahead with plain 2.6.13-rc7 (which already contains ib_uverbs). This is what I get then: test[12173] general protection rip:2aaaab219265 rsp:7fffffcc7c50 error:0 test[12174] general protection rip:2aaaab219265 rsp:7fffff980b90 error:0 general protection fault: 0000 [1] SMP CPU 1 Modules linked in: ib_ipoib ib_sa ib_ucm ib_cm ib_uverbs ib_umad joydev sg st sr_mod floppy ipv6 ib_mthca ib_mad ib_core hw_random af_packet evdev tg3 xfs exportfs dm_snapshot dm_mod ext3 jbd Pid: 12173, comm: test Not tainted 2.6.13-rc7 RIP: 0010:[] {:ib_uverbs:__ib_umem_release+67} RSP: 0018:ffff8100d9a1dc48 EFLAGS: 00010246 RAX: 6b6b6b6b6b6b6b6b RBX: ffff8100e2fffcf0 RCX: 0000000000000000 RDX: 000000000000007f RSI: ffff81007dccc018 RDI: 6b6b6b6b6b6b6b6b RBP: 0000000000000000 R08: 0000000000000000 R09: ffff8100e2fffcf0 R10: ffff8100d9a1dc7f R11: 0000000000003a98 R12: ffff81007dccc000 R13: ffff8100e36c92f0 R14: 0000000000000001 R15: ffff81007fa8e000 FS: 00002aaaab2160a0(0000) GS:ffffffff80571880(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000004204000 CR3: 000000007e324000 CR4: 00000000000006e0 Process test (pid: 12173, threadinfo ffff8100d9a1c000, task ffff8100e3c3c850) Stack: ffff8100e2fffcf0 ffff8100e36c9318 6b6b6b6b6b6b6b6b ffff8100e2fffcf0 ffff8100e36c92f0 ffff8100e36c92d8 ffff81007fc4a528 ffff81007b2144a8 ffff810037cfea28 ffffffff881e0eff Call Trace:{:ib_uverbs:ib_umem_release_on_close+31} {:ib_uverbs:ib_uverbs_close+453} {__fput+178} {filp_close+110} {put_files_struct+115} {do_exit+511} {__dequeue_signal+501} {sys_exit_group+0} {get_signal_to_deliver+1415} {do_signal+159} {specific_send_sig_info+222} {force_sig_info+187} {do_general_protection+159} {retint_signal+61} Code: 48 8b 38 e8 25 b0 f3 f7 41 3b 6c 24 10 7d 38 41 8b 45 20 48 RIP {:ib_uverbs:__ib_umem_release+67} RSP <1>Fixing recursive fault but reboot is needed! Any hints regarding a working combination of kernel + openib revision with respect to mvapich-gen2 are very appreciated. thanks in advance, - Christian From yael at mellanox.co.il Thu Aug 25 02:10:43 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Thu, 25 Aug 2005 12:10:43 +0300 Subject: [openib-general] developement on osm new branch Message-ID: <506C3D7B14CDD411A52C00025558DED6089DBEFD@mtlex01.yok.mtl.com> Hello Hal, We are currently working and developing the osm-1.8.0-merge branch. Since you are doing the merge - how do you want to keep track of our changes? If you do the merge according to a specific revision of the branch, we can easily supply patches on that branch that will include our modifications, and can be reviewed as well. Yael -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Thu Aug 25 02:56:49 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 25 Aug 2005 12:56:49 +0300 Subject: [openib-general] Heads up on SDP RDMA support for synchronous socket operations In-Reply-To: <42F108FE.7020309@keysounds.co.uk> References: <20050803135546.GQ15300@mellanox.co.il> <42F108FE.7020309@keysounds.co.uk> Message-ID: <20050825095649.GI22342@mellanox.co.il> Quoting r. Steve Wooding : > I look forward to trying it out when its ready. You can now download preliminary patches that add zero-copy RDMA (zcopy) support for synchronous send operations. The receive side still performs a copy at this stage. These patches are not yet intended for trunk inclusion. Todo: - implement zero-copy RDMA on the receive side - bugfixes - cleanup Since the receive side does not do zcopy at this stage, for a single connection this results in a lower bandwidth than without zcopy, but also much lower CPU utilization on the send size. Here's an example run: # ( export SIMPLE_LIBSDP=1; export LD_PRELOAD=libsdp.so; /usr/local/bin/netperf -H 11.4.8.155 -f M -C -c -- -m 6553600 -M 6553600 ) TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.4.8.155 (11.4.8.155) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. MBytes /s % S % S us/KB us/KB 124928 124928 6553600 10.00 414.91 2.62 51.36 0.123 2.418 How to use: svn co https://openib.org/svn/gen2/branches/zcopy/patches apply patches/send_zcopy.patch and then patches/zcopy1.patch Thanks, -- MST From danb at voltaire.com Thu Aug 25 03:10:49 2005 From: danb at voltaire.com (Dan Bar Dov) Date: Thu, 25 Aug 2005 13:10:49 +0300 Subject: [openib-general] RE: [iSER]making error of the new drop of iSER - syntax error Message-ID: Please update to svn 3128 (latest). The makfile was heavily modified. The code uses the kdat.h by default. Please note that you must upgrade your Linux to a 2.6 based distribution and install a 2.6.12 or better kernel. The ISER initiator does not work, does not even compile on a 2.4 kernel. Dan > -----Original Message----- > From: Ian Jiang [mailto:ianjiang91 at hotmail.com] > Sent: Thursday, August 25, 2005 12:08 PM > To: openib-general at openib.org > Cc: Dan Bar Dov > Subject: [iSER]making error of the new drop of iSER - syntax error > > I am using the redhat AS3, kernel 2.4.21-20.EL. I failed to > make the iSER/initiator (r3174). I got the source codes at > https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband/ > ulp/iser/ > and write the Makefile and make.conf according the old ones > of the previous iSER/initiator. > I made a little modification to use the header file of kdat.h > of InfiniBand Any suggestion is appriciated! > > =================================================== > > Makefile > > ========== > > # Makefile for iSER r3174 > > MAKE_CONF_PATH=. > > MODNAME := iser_dm.o > > #INCLUDES = -I./include > > SOURCES = iser_conn.c \ > iser_dto.c \ > iser_initiator.c \ > iser_lkdapl.c \ > iser_memory.c \ > iser_mod.c \ > iser_pdu.c \ > iser_task.c \ > iser_utils.c > > include $(MAKE_CONF_PATH)/make.conf > > ================================================================ > > make.conf > > ========== > > # target platform > # --------------- > TARGET = x86_64 > > # linux kernel version > # -------------------- > KERNEL := $(shell uname -r | cut -d'.' -f1,2) KMAJOR=$(shell > echo $(KERNEL) | cut -d. -f1) KMINOR=$(shell echo $(KERNEL) | > cut -d. -f2) KMINIMINOR=$(shell echo $(KERNEL) | cut -d. -f3) > KMAJOR_MINOR=$(KMAJOR).$(KMINOR) > > DAT_PATH=$(KERNEL_SRC)/drivers/infiniband/ib_dapl/dat/include/dat > > INCLUDES += -I$(KERNEL_SRC)/include -I$(KERNEL_SRC)/asm > INCLUDES += -I$(DAT_PATH) OBJECTS = $(SOURCES:.c=.o) KOBJECTS > = $(SOURCES:.c=.ko) OBJECTSCMD = $(SOURCES:.c=.o*) > > INSTDIR = modules > > # platform-specific > # ----------------- > ifeq ($(TARGET),x86_64) > CC=gcc > LD=ld > LDFLAGS = -m elf_$(ARCH) > ARCH=x86_64 > FLAVOR=default > endif > > ifeq ($(TARGET),i386) > CC=gcc > LD=ld > LDFLAGS = -m elf_$(ARCH) > ARCH=i386 > FLAVOR=default > endif > > # kernel-specific > # --------------- > ifeq ($(KERNEL),2.4) > KERNEL_EXT = .21-20.EL > endif > > ifeq ($(KERNEL),2.6) > KERNEL_EXT = .4-52 > endif > > KERNEL_VER = $(KERNEL)$(KERNEL_EXT) > KERNEL_SRC = /usr/src/linux-$(KERNEL_VER) > > > .PHONY: modules clean help > > > # all rules > # --------- > all: modules > > # install rules > # ------------- > #install: $(MODULES_DIR) $(MODULES_DIR)/$(MODNAME) # > #$(MODULES_DIR): > # -mkdir -p $(MODULES_DIR) > # > #$(MODULES_DIR)/$(MODNAME): $(MODNAME) > # @echo "cp $(MODNAME) $(MODULES_DIR)/$(MODNAME)" > # @cp --remove-destination $(MODNAME) $(MODULES_DIR) > > > # clean rules > # ----------- > clean: > rm -f $(OBJECTS) $(KOBJECTS) $(OBJECTSCMD) $(MODNAME) -rm > -f ./$(INSTDIR)/$(MODNAME:.o=.ko) -rm -f > ./$(INSTDIR)/$(MODNAME) rm -f *.mod.c *.mod.o .*.o.cmd > .*.ko.cmd rm -rf .tmp_versions > > # help rules > # ---------- > help: > @echo "This makefile is controlled by two variable:" > @echo "KERNEL=(2.4 | 2.6) - automatically computed based > kernel sources" > @echo "TARGET=(i386 | x86_64) - default is x86_64" > @echo "common targets:" > @echo " make modules - to generate a module" > # @echo " make kernel-links - fixes kernel links" > @echo " make clean - cleanses" > # @echo " make doc - creates a doxygen documentation" > > # make rule for 2.4 kernels > # ------------------------- > ifeq ($(KMAJOR_MINOR),2.4) > CFLAGS += $(INCLUDES) $(MODCFLAGS) > #CFLAGS += $(INCLUDES) $(MODCFLAGS) -O2 -Wall -Werror > > LDFAGS += -r > > # modules rules > # ------------- > modules: $(MODNAME) > $(MODNAME): $(OBJECTS) > @echo -e "\n>>> Linking module [$(MODNAME)]" > $(LD) $(LDFLAGS) -o $(MODNAME) $(OBJECTS) > > > endif > > ===================================================================== > > errors > > ======== > > > > > In file included from > /usr/src/linux-2.4.21-20.EL/include/asm/smp.h:14, > from > /usr/src/linux-2.4.21-20.EL/include/linux/smp.h:16, > from > /usr/src/linux-2.4.21-20.EL/include/linux/sched.h:24, > from iser.h:38, > from iser_conn.c:34: > /usr/src/linux-2.4.21-20.EL/include/asm/fixmap.h:53: syntax > error before "pgprot_t" > In file included from > /usr/src/linux-2.4.21-20.EL/include/asm/smp.h:16, > from > /usr/src/linux-2.4.21-20.EL/include/linux/smp.h:16, > from > /usr/src/linux-2.4.21-20.EL/include/linux/sched.h:24, > from iser.h:38, > from iser_conn.c:34: > /usr/src/linux-2.4.21-20.EL/include/asm/mpspec.h:190: syntax > error before "id" > /usr/src/linux-2.4.21-20.EL/include/asm/mpspec.h:191: syntax > error before "address" > /usr/src/linux-2.4.21-20.EL/include/asm/mpspec.h:194: syntax > error before "id" > /usr/src/linux-2.4.21-20.EL/include/asm/mpspec.h:195: syntax > error before "bus_irq" > In file included from > /usr/src/linux-2.4.21-20.EL/include/asm/smp.h:20, > from > /usr/src/linux-2.4.21-20.EL/include/linux/smp.h:16, > from > /usr/src/linux-2.4.21-20.EL/include/linux/sched.h:24, > from iser.h:38, > from iser_conn.c:34: > /usr/src/linux-2.4.21-20.EL/include/asm/apic.h:86: syntax > error before "unsigned" > In file included from > /usr/src/linux-2.4.21-20.EL/include/linux/sched.h:24, > from iser.h:38, > from iser_conn.c:34: > /usr/src/linux-2.4.21-20.EL/include/linux/smp.h:31: syntax > error before '(' > token > In file included from > /usr/src/linux-2.4.21-20.EL/include/linux/sched.h:30, > from iser.h:38, > from iser_conn.c:34: > /usr/src/linux-2.4.21-20.EL/include/linux/pid.h:18: field > `task_list' has incomplete type > /usr/src/linux-2.4.21-20.EL/include/linux/pid.h:19: field > `hash_chain' has incomplete type > /usr/src/linux-2.4.21-20.EL/include/linux/pid.h:24: field > `pid_chain' has incomplete type > /usr/src/linux-2.4.21-20.EL/include/linux/pid.h:36: syntax > error before '(' > token > /usr/src/linux-2.4.21-20.EL/include/linux/pid.h:38: syntax > error before '(' > token > > ... > > .... > > Ian Jiang > > ianjiang91 at hotmail.com > ---- > Computer Architecture Laboratory > Institute of Computing Technology > Chinese Academy of Sciences > Beijing,P.R.China > Zip code: 100080 > Tel: +86-10-62564394(office) > > _________________________________________________________________ > 免费下载 MSN Explorer: http://explorer.msn.com/lccn/ > > From guyg at voltaire.com Thu Aug 25 03:05:37 2005 From: guyg at voltaire.com (Guy German) Date: Thu, 25 Aug 2005 13:05:37 +0300 Subject: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator In-Reply-To: <20050825065751.GN4793@esmail.cup.hp.com> References: <1124787534.27216.44.camel@r2d2> <20050823185309.GE1218@esmail.cup.hp.com> <1124873898.3933.23.camel@r2d2> <20050825065751.GN4793@esmail.cup.hp.com> Message-ID: <1124964337.6584.6.camel@r2d2> > > The thing is - even if I find it reasonable to do it this way (after > > calculating the estimated amount of time spent in the ISR), I am > > not sure it is a good scalable solution, because there can be several > > initiators and hcas per machine. > > Yup - but you don't need to estimate. It can be precisely measured. > See "get_cycles()" and maybe the patch that Christoph Lameter > posted 5-6 monthes ago: > http://www.gelato.unsw.edu.au/archives/linux-ia64/0504/13877.html > > (My advice is to read the whole thread to be aware of the issues > with this patch...but it should be good enough for the above > measurements.) Thanks for the pointer. It can give us the time frames we are talking about (even though we need to consider the worst case scenario - which is a theoretical thing). > preemtive is a major issue for some uses. But I'm skeptical it is for > the initial clusters I expect people will use RDMA for. So I'm not > going to worry about it for now. There are more important issues. Fair enough. Thanks, Guy From halr at voltaire.com Thu Aug 25 04:26:29 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 25 Aug 2005 14:26:29 +0300 Subject: [openib-general] RE: developement on osm new branch Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175C18@taurus.voltaire.com> Hi Yael, > We are currently working and developing the osm-1.8.0-merge branch. > Since you are doing the merge - how do you want to keep track of our changes? > If you do the merge according to a specific revision of the branch, we can easily > supply patches on that branch that will include our modifications, and can be reviewed > as well. I see the changes so I think I can deal with this. I don't think anything special needs to be done. Thanks for the heads up. -- Hal From alexn at voltaire.com Thu Aug 25 05:22:01 2005 From: alexn at voltaire.com (Alex Nezhinsky) Date: Thu, 25 Aug 2005 15:22:01 +0300 Subject: [openib-general] RE: [iSER]question about iSER code Message-ID: Wang Xigui wrote: > This is differentf from prevous draft. That's what I meant:) > Is it possible to establish the connection in IPoIB mode, > and if the RDMAExtensions key is not negotiated to Yes, then > establish an new iSER conection? If opening an iSER connection is what you want, why wouldn't you try to open an iSER connection from the start. If the question is how do you locate targets (or ports) which provide iSER storage services, then it's a separate question to be answered by using and/or augmenting SendTargets, SLP, iSNS and related discovery mechanisms. There have been some discussions of it in IETF-IPS. > And some questions: I was confused with the > 'Allocate_Connection_Resources'. It allocates resources for iSER > connection. But the connection has been established at the beginning. The purpose of 'Allocate_Connection_Resources' is to allocate resources, not to establish connection. Some allocation/operation parameters may depend on the outcome of the login phase (negotiated values of some iscsi/iser operational parameters), so it is ok to call it even if the connection is already established. Note that the login phase has some very strict limitations on what can actually be sent or received, so 'Allocate_Connection_Resources' actually gives you a chance to do all necessary preparations for the "real" work. Alexander Nezhinsky From halr at voltaire.com Thu Aug 25 05:24:48 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 25 Aug 2005 15:24:48 +0300 Subject: [openib-general] RE: [PATCH] osm: trivial: fix vendor includes Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175C1F@taurus.voltaire.com> Thanks. Applied. From jlentini at netapp.com Thu Aug 25 05:40:41 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 25 Aug 2005 08:40:41 -0400 (EDT) Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs and query provider methods In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3B4B@mail2.ammasso.com> References: <8E9D028761D8264D910612167E8457E8FA3B4B@mail2.ammasso.com> Message-ID: On Wed, 24 Aug 2005, Tom Tucker wrote: > > > This patch is against the iWARP branch. It adds CM related > methods to the ib_device structure as well as simple versions > of the low level port, gid, pkey, etc... query methods. > > I also added printks to the provider methods so I could track the > loading process. These will be removed when the driver is stable. > > Please take a look and let me know what you think. I'll check > this in to the iWRAP branch tomorrow if no one expresses dismay. > > Please feel free to send me patches to this patch if you see > something. > > Signed-of-by: Tom Tucker > > Index: include/ib_verbs.h > =================================================================== > --- include/ib_verbs.h (revision 3120) > +++ include/ib_verbs.h (working copy) > @@ -43,6 +43,7 @@ > > #include > #include > +#include > > #include > #include > @@ -59,7 +60,8 @@ > enum ib_node_type { > IB_NODE_CA = 1, > IB_NODE_SWITCH, > - IB_NODE_ROUTER > + IB_NODE_ROUTER, > + IB_NODE_IWARP Should it be IB_NODE_RNIC instead since the other values are for hardware types? > }; From guyg at voltaire.com Thu Aug 25 06:13:15 2005 From: guyg at voltaire.com (Guy German) Date: Thu, 25 Aug 2005 16:13:15 +0300 Subject: [openib-general] Connection Manager Abstraction proposition - header file In-Reply-To: <469958e0050824112851e8109d@mail.gmail.com> References: <20050824122107.GA2323@voltaire.com> <469958e0050824112851e8109d@mail.gmail.com> Message-ID: <1124975595.6584.17.camel@r2d2> Hi Caitlin, > The events need to distinquish between Rejected and Peer Rejected. > For example, a TCP connection could be rejected by the peer stack > for lack of capacity to relay the Connection Request to the peer for > approval. That is neither a peer rejection nor an "unreachable" event. OK. I wanted to make it as simple as possible, but if you think that consumers could find is useful (for their connection flow) - it should be included. > This would need to be based on the remote_address *and* QoS. > For example there could be two devices that reach the same > destination network, but with different speeds. OK. I added it. There can still be more then one device (even for the same qos), but thats a different matter. > There could be an option to listen on a specific address as well. I added sockaddr to the listen call Thanks, Guy From shubbell at dbresearch.net Thu Aug 25 05:41:44 2005 From: shubbell at dbresearch.net (Sean Hubbell) Date: Thu, 25 Aug 2005 08:41:44 -0400 Subject: [openib-general] Question on the best approach to debug aninfiniband connection problem In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5175C16@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175C16@taurus.voltaire.com> Message-ID: <430DBC88.6000709@dbresearch.net> >Is the port state active ? > > The port is active for port 1 and down for port 2. Port 2 is not connected. > >What are you running ? Is this OpenSM and IPoIB off the trunk or something else ? > > > >>I am at a loss to find out what the problem is. I did notice a lot of errors in >> the /var/log/osm.log which I have listed below for today: >> >> > > > Yes, I guess I should have mentioned that. I am running cAos 2.0 with the openib package along with the opensm that comes with openib. I am also trying to run over IPoIB. >Aug 24 08:19:10 [42FFF960] -> osm_report_notice: Reporting Generic >Notice type:3 num:67 from LID:0x0001 >GID:0xfe80000000000000,0x0005ad000003d269 >Aug 24 08:19:10 [42FFF960] -> osm_vendor_send: RMPP 0 length 112 >Aug 24 08:19:10 [42FFF960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: method = >SubnAdmSet,scope_state = 0x1, component mask = 0x0000000000010083, >expected comp mask = 0x00000000000130c7. >It appears that a join is failing for some reason. It doesn't say which group >(MGID) this is. (I will add that into the log). > >The SM is receiving a join rather than a create request for >a new multicast group. That might be OK depending on which group it is. > >Aug 24 08:19:10 [42FFF960] -> osm_vendor_send: RMPP 0 length 256 >Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112 >Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112 >Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112 >Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112 >Aug 24 08:19:14 [42FFF960] -> osm_report_notice: Reporting Generic >Notice type:3 num:67 from LID:0x0001 >GID:0xfe80000000000000,0x0005ad000003d269 >Aug 24 08:19:14 [42FFF960] -> osm_report_notice: Reporting Generic >Notice type:3 num:67 from LID:0x0001 >GID:0xfe80000000000000,0x0005ad000003d269 >Aug 24 08:19:16 [447FF960] -> umad_receiver: recv error Interrupted >system call >Aug 24 08:22:05 [AB441140] -> OpenSM Rev:openib-1.0.0 >Aug 24 08:22:05 [AB441140] -> osm_opensm_init: Forcing single threaded >dispatcher. > >It looks like OpenSM restarted here. If OpenSM is restarted currently, the IPoIB >interface needs to be downed and then upped as client reregistration is not currently >supported. > > Yes, from the 4.5 hours I spent looking yesterday and with looking at the arp table, this makes since. What I ended up doing to fix it is to bring down ib0 and then bring it back up. After a little while when I started to try and ping, things were back to working. I will have to say that I was very concerned with our applications running using IPoIB, but after you mentioned this and after what I saw, I think we will be ok. Thank you for your response. Sean From guyg at voltaire.com Thu Aug 25 06:25:02 2005 From: guyg at voltaire.com (Guy German) Date: Thu, 25 Aug 2005 16:25:02 +0300 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: References: Message-ID: <1124976302.6584.19.camel@r2d2> On Wed, 2005-08-24 at 18:28 -0700, Sean Hefty wrote: > Another possibility could be to add a list of receives to rdma_connect(). I added this to both connect and accept calls Guy From swise at ammasso.com Thu Aug 25 06:42:56 2005 From: swise at ammasso.com (Steve Wise) Date: Thu, 25 Aug 2005 08:42:56 -0500 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs andqueryprovider methods In-Reply-To: Message-ID: > I need to reacquaint myself with iWarp more, but I don't like > the idea of adding > CM calls as part of the verbs API, and in particular as part > of a generic RDMA > device structure. > > Does this suggest that each iWarp device driver will need to > implement a > connection establishment protocol? Isn't there a way to > generalize that into a > single iWarp CM module that can sit above multiple devices? How will > connections between different devices be supported? > The Ammasso 1100 does do 100% connection setup. That's why we're pushing connection establishment verbs into the device struct. IMO, these functions are analagous to the process_mad function in the ib_device structs, which has no meaning to an iwarp device. So I think we have to admit up front, that the ib_device struct really has Infiniband-specific verb functions as well as iWARP-specific verb functions, and that's ok. (or maybe not :-) > I'm assuming that the Linux kernel will never permit an > established connection > to be offloaded onto a NIC. However it seems possible that a > new iWarp > connection could be done in a common way, with the result > passed into the device > through the modify QP call as the LLP stream. > > - Sean Assuming each RNIC supported some raw way to send and receive ethernet frames, then you could implement TCP, IP, ICMP, ARP etc al as a common stack to setup connections. I don't think we want to do this? BTW: We couldn't support this model with our 1100 card... Stevo. From halr at voltaire.com Thu Aug 25 06:55:30 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 25 Aug 2005 16:55:30 +0300 Subject: [openib-general] Question on the best approach to debug aninfiniband connection problem Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175C26@taurus.voltaire.com> Hi again Sean, > Yes, from the 4.5 hours I spent looking yesterday and with looking at > the arp table, this makes since. What I ended up doing to fix it is to > bring down ib0 and then bring it back up. After a little while when I > started to try and ping, things were back to working. I will have to say > that I was very concerned with our applications running using IPoIB, but > after you mentioned this and after what I saw, I think we will be ok. Once OpenSM is upgraded to 1.8.0 which is in progress, client reregistration will be requested. It will then be a matter of implementing this in or on behalf of the ULPs which do or cause SA registrations like IPoIB. -- Hal From mst at mellanox.co.il Thu Aug 25 07:08:03 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 25 Aug 2005 17:08:03 +0300 Subject: [openib-general] Re: Question on the best approach to debug aninfiniband connection problem In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5175C26@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175C26@taurus.voltaire.com> Message-ID: <20050825140803.GR22342@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: RE: Question on the best approach to debug aninfiniband connection problem > > Hi again Sean, > > Once OpenSM is upgraded to 1.8.0 which is in progress, client > reregistration will be requested. It will then be a matter of > implementing this in or on behalf of the ULPs which do or cause SA > registrations like IPoIB. I am running OpenSM from branch already. As far as I can tell, any Set method already forces reregistration in IPoIB, so ULPs support this. There seems to be some bug related to local MAD handling: if opensm is running on node A, and opensm is restarted, all nodes will re-register in the multicast group with opensm, except for the node A itself which has to be downed and upped manually. -- MST From guyg at voltaire.com Thu Aug 25 07:06:57 2005 From: guyg at voltaire.com (Guy German) Date: Thu, 25 Aug 2005 17:06:57 +0300 Subject: [openib-general] cma header - change some things according to the list feedback Message-ID: <20050825140657.GA2991@voltaire.com> This is the header file, embedding some of the feedbacks received from the list According to this suggestion - the qp is modified to init/rtr/rts in the cm abstraction. The connection is done with a synchronous call to get a device for qp creation. I think it can also be done the way Roland suggested, but I still believe this is simpler for consumers (especially for the iwarp oriented ones). There are still unresolved issues discussed now on the list, but those are mainly on the implementation side. I would like to hear the list opinion on the API, because I'm sure there are untied ends on the API side as well. If someone has a totally different suggestion - can he post it to the list for review ? Thanks, Guy /* * Copyright (c) 2005 Voltaire Inc. All rights reserved. * * This Software is licensed under one of the following licenses: * * 1) under the terms of the "Common Public License 1.0" a copy of which is * available from the Open Source Initiative, see * http://www.opensource.org/licenses/cpl.php. * * 2) under the terms of the "The BSD License" a copy of which is * available from the Open Source Initiative, see * http://www.opensource.org/licenses/bsd-license.php. * * 3) under the terms of the "GNU General Public License (GPL) Version 2" a * copy of which is available from the Open Source Initiative, see * http://www.opensource.org/licenses/gpl-license.php. * * Licensee has the right to choose one of the above licenses. * * Redistributions of source code must retain the above copyright * notice and one of the license notices. * * Redistributions in binary form must reproduce both the above copyright * notice, one of the license notices in the documentation * and/or other materials provided with the distribution. * */ /* * This header file as a preliminary proposition for a connection manager * abstraction layer (cma) for IB and iwarp * - there is an assumption that iwarp uses the same openib qp terminology in * the rest of the verbs, and the only place needs abstraction is the cm. * - This proposition assumes that the address translation is done in the cma * layer. * - The cma also modifies the qp states to init/rtr/rts and error as needed. * - for calling accept/reject or disconnect on the passive side you need to * use the cma handle accepted in ib_cma_listen cb. * - cma_id is created when calling connect or listen and destroyed when * accepting disconnected/rejected/unreachable events on either active * side (connect cb) or passive side (accept cb) */ #ifndef IB_CMA_H #define IB_CMA_H #include enum ib_cma_event { IB_CMA_EVENT_ESTABLISHED, IB_CMA_EVENT_REJECTED, IB_CMA_EVENT_NON_PEER_REJECTED, IB_CMA_EVENT_DISCONNECTED, IB_CMA_EVENT_UNREACHABLE }; enum ib_qos { IB_QOS_BEST_EFFORT = 0, IB_QOS_HIGH_THROUGHPUT = (1 << 0), IB_QOS_LOW_LATENCY = (1 << 1), IB_QOS_ECONOMY = (1 << 2), IB_QOS_PREMIUM = (1 << 3) }; enum ib_connect_flags { IB_CONNECT_DEFAULT_FLAG = 0x00, IB_CONNECT_MULTIPATH_FLAG = 0x01 }; /* * for ib_cma_get_src_ip - ib_cma_id will have to include * the path data received in the request handler */ union ib_cma_id{ struct ib_cm_id *cm_id; u32 iwarp_id; }; typedef void (*ib_cma_rarp_handler)(struct sockaddr *src_ip, void *context); typedef void (*ib_cma_ac_handler)(enum ib_cma_event event, void *context); typedef void (*ib_cma_event_handler)(enum ib_cma_event event, void *context, void *private_data); typedef void (*ib_cma_listen_handler)(union ib_cma_id *cma_id, struct ib_device *device, void *private_data, void *context); struct ib_cma_conn { struct ib_qp *qp; struct ib_qp_attr *qp_attr; struct sockaddr *dst_ip; __be64 service_id; struct ib_recv_wr *recv_wr, u32 num_recv_wr, void *context; ib_cma_event_handler cma_event_handler; const void *private_data; u8 private_data_len; u32 timeout; enum ib_qos qos; enum ib_connect_flags connect_flags; }; /** * ib_cma_get_device - Returns the device to be used according to * the destination ip address (this can be detemined according * to the local routing table). Call this function before * creating the qp. If using link-local IPv6 addresses * there is no need to call this function. * @remote_address: The destination address for connection * @device: The device to use (returned by the function) */ int ib_cma_get_device(struct sockaddr *remote_address, enum ib_qos qos, struct ib_device **device); /** * ib_cma_connect - this is the connect request function, called by * the active side. The consumer registers an upcall that will be * initiated by the cma with an appropriate connection event * notification (established/rejected/disconnected etc) * @cma_conn: This structure contains the following connection parameters: * @qp: qp for establishing the connection * @qp_attr: only relevant attributes are used * @dst_ip: destination ip address * @service_id: destination service id (port) * @recv_wr: An array of work requests to post on the receive queue.before * sending the RTU * @num_recv_wr: recv_wr array length - number of buffers to post recv * @context: context to be returned in the callback * @cma_event_handler: the upcall function for the active side * @private_data: private data to be received at the listener upcall * @private_data_len: private data length (max 255) * @timeout: * @qos: Quality os service for the rc * @connect_flags: default or multipath connection * @cma_id: This returned handle is a union (different in ib and iwarp) * in ib - it is the cm_id. */ int ib_cma_connect(struct ib_cma_conn *cma_conn, union ib_cma_id *cma_id); /** * ib_cma_disconnect - this function disconnects the rc. It can be * called, by either the passive or active side * @qp: the connected qp to disconnect * @cma_id: On the active side- this handle is the one returned * when ib_cma_connect was called. * On the passive side- this handle was accepted in cma_listen callback */ int ib_cma_disconnect(struct ib_qp *qp, union ib_cma_id *cma_id); /** * ib_cma_sid_listen - this function is called by the passive side. It is * listening on a the specified port (ib service id) for incomming * connection requests * @device: * @address: * @service_id: service id (port) to listen on * @context: user context to be returned in the callback * @cm_listen_handler: the listen callback * @cma_id: cma handle for the passive side */ int ib_cma_sid_listen(struct ib_device *device, struct sockaddr *address, __be64 service_id, void *context, ib_cma_listen_handler cm_listen_handler, union ib_cma_id *cma_id); /** * ib_cma_sid_destroy - this functionis is called on the passive side, to * stop listenning on a certain sevice id * @cma_id: the same cma handle received when ib_cma_sid_listen was called */ int ib_cma_sid_destroy(union ib_cma_id *cma_id); /** * ib_cma_accept - call on the passive side to accept a connection request * @cma_id: this handle was accepted in cma_listen callback * @qp: the connection's qp * @private_data: private data to send back to the initiator * @private_data_len: private data length * @recv_wr: An array of work requests to post on the receive queue.before * sending the REP * @num_recv_wr: recv_wr array length - number of buffers to post recv * @context: user context to be returned in the callback * @cm_accept_handler: the cma accept callback - triggered when RTU ack * received */ int ib_cma_accept(union ib_cma_id *cma_id, struct ib_qp *qp, const void *private_data, u8 private_data_len, struct ib_recv_wr *recv_wr, u32 num_recv_wr, void *context, ib_cma_ac_handler cm_accept_handler); /** * ib_cma_reject - call on the passive side to reject a connection request. * This call destroys the cma_id, hence when the active side accepts * the reject the cma_id is already destroyed. * @cma_id: this handle was accepted in cma_listen callback * @private_data: private data to send back to the initiator * @private_data_len: private data length */ int ib_cma_reject(union ib_cma_id *cma_id, const void *private_data, u8 private_data_len); /** * ib_cma_get_src_ip - this function performs "rarp", asynchronicly * from cma_id to src ip * @cma_id: the cma_id will have to include the path data received * in the request handler * @src_ip: source ip of the initiator */ int ib_cma_get_src_ip(union ib_cma_id *cma_id, ib_cma_rarp_handler rarp_handler, void *context); #endif /* IB_CMA_H */ From yuw at cse.ohio-state.edu Thu Aug 25 07:13:43 2005 From: yuw at cse.ohio-state.edu (Weikuan Yu) Date: Thu, 25 Aug 2005 10:13:43 -0400 Subject: [openib-general] kernel crashes while using mvapich-gen2 over ib_uverbs In-Reply-To: <1124960511.12751.16.camel@bonnie.rzg.mpg.de> References: <1124960511.12751.16.camel@bonnie.rzg.mpg.de> Message-ID: <718C553E-1572-11DA-B866-000D932C3754@cse.ohio-state.edu> Hi, Christian, It seems like the ib_uverbs module is not able to clean up the leftover pinned memory when closing a context of a user process. In my opinion, it should only kill the user-process with/wo core-dump, but not kernel oops as reported. Somebody developed the code-path for umem registration + deregistration can help more here. > Any hints regarding a working combination of kernel + openib revision > with respect to mvapich-gen2 are very appreciated. As to the combinations over Dual-opteron, mvapich-gen2 has been tested with 2.6.12.4.tar.gz + gen2-r2984 (userland+kernel) when we made the release. We are keen to keep updated with the latest gen2 stack. Chances are we will lag behind for a little bit. Thanks, Weikuan On Aug 25, 2005, at 5:01 AM, Christian Guggenberger wrote: > Hi, > > On a small, 2 node setup, I'd like to try some simple MPI programs with > help of mvapich-gen2 (1.0). > Both nodes are Dual-Opteron based, with a 23108 tavor each, directly > connected. (no switch). Opensm is running on one node. Things like > IPOIB > seem to work reliable. > > Using 2.6.12.5 (and svn co of Aug, 24th), all I get after starting a > simple 2 CPU mpi programm is a hard crash of that node. (no logs, no > oops, node not pingable, nothing at the console, no SYSRQ available). > > I tried to go ahead with plain 2.6.13-rc7 (which already contains > ib_uverbs). This is what I get then: > > test[12173] general protection rip:2aaaab219265 rsp:7fffffcc7c50 > error:0 > test[12174] general protection rip:2aaaab219265 rsp:7fffff980b90 > error:0 > general protection fault: 0000 [1] SMP > CPU 1 > Modules linked in: ib_ipoib ib_sa ib_ucm ib_cm ib_uverbs ib_umad joydev > sg st sr_mod floppy ipv6 ib_mthca ib_mad ib_core hw_random af_packet > evdev tg3 xfs exportfs dm_snapshot dm_mod ext3 jbd > Pid: 12173, comm: test Not tainted 2.6.13-rc7 > RIP: 0010:[] > {:ib_uverbs:__ib_umem_release+67} > RSP: 0018:ffff8100d9a1dc48 EFLAGS: 00010246 > RAX: 6b6b6b6b6b6b6b6b RBX: ffff8100e2fffcf0 RCX: 0000000000000000 > RDX: 000000000000007f RSI: ffff81007dccc018 RDI: 6b6b6b6b6b6b6b6b > RBP: 0000000000000000 R08: 0000000000000000 R09: ffff8100e2fffcf0 > R10: ffff8100d9a1dc7f R11: 0000000000003a98 R12: ffff81007dccc000 > R13: ffff8100e36c92f0 R14: 0000000000000001 R15: ffff81007fa8e000 > FS: 00002aaaab2160a0(0000) GS:ffffffff80571880(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 0000000004204000 CR3: 000000007e324000 CR4: 00000000000006e0 > Process test (pid: 12173, threadinfo ffff8100d9a1c000, task > ffff8100e3c3c850) > Stack: ffff8100e2fffcf0 ffff8100e36c9318 6b6b6b6b6b6b6b6b > ffff8100e2fffcf0 > ffff8100e36c92f0 ffff8100e36c92d8 ffff81007fc4a528 > ffff81007b2144a8 > ffff810037cfea28 ffffffff881e0eff > Call Trace:{:ib_uverbs:ib_umem_release_on_close+31} > {:ib_uverbs:ib_uverbs_close+453} > {__fput+178} > {filp_close+110} > {put_files_struct+115} > {do_exit+511} > {__dequeue_signal+501} > {sys_exit_group+0} > {get_signal_to_deliver+1415} > {do_signal+159} > {specific_send_sig_info+222} > {force_sig_info+187} > {do_general_protection+159} > {retint_signal+61} > > Code: 48 8b 38 e8 25 b0 f3 f7 41 3b 6c 24 10 7d 38 41 8b 45 20 48 > RIP {:ib_uverbs:__ib_umem_release+67} RSP > > <1>Fixing recursive fault but reboot is needed! > > Any hints regarding a working combination of kernel + openib revision > with respect to mvapich-gen2 are very appreciated. > > > thanks in advance, > - Christian > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Thu Aug 25 07:14:45 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 25 Aug 2005 17:14:45 +0300 Subject: [openib-general] RE: Question on the best approach to debug aninfiniband connection problem Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175C27@taurus.voltaire.com> > I am running OpenSM from branch already. > As far as I can tell, any Set method already forces reregistration in IPoIB, > so ULPs support this. I think you are referring to a Set of PortInfo which causes an event to IPoIB. > There seems to be some bug related to local MAD handling: > if opensm is running on node A, and opensm is restarted, > all nodes will re-register in the multicast group with opensm, > except for the node A itself which has to be downed and upped manually. I will look into this but have a few things ahead of this right now. -- Hal From tom at ammasso.com Thu Aug 25 07:22:16 2005 From: tom at ammasso.com (Tom Tucker) Date: Thu, 25 Aug 2005 09:22:16 -0500 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs and query provider methods References: <8E9D028761D8264D910612167E8457E8FA3B4B@mail2.ammasso.com> Message-ID: <430DD418.4020702@ammasso.com> Sure...good suggestion. IB_NODE_RNIC it is. James Lentini wrote: >On Wed, 24 Aug 2005, Tom Tucker wrote: > > > >>This patch is against the iWARP branch. It adds CM related >>methods to the ib_device structure as well as simple versions >>of the low level port, gid, pkey, etc... query methods. >> >>I also added printks to the provider methods so I could track the >>loading process. These will be removed when the driver is stable. >> >>Please take a look and let me know what you think. I'll check >>this in to the iWRAP branch tomorrow if no one expresses dismay. >> >>Please feel free to send me patches to this patch if you see >>something. >> >>Signed-of-by: Tom Tucker >> >>Index: include/ib_verbs.h >>=================================================================== >>--- include/ib_verbs.h (revision 3120) >>+++ include/ib_verbs.h (working copy) >>@@ -43,6 +43,7 @@ >> >> #include >> #include >>+#include >> >> #include >> #include >>@@ -59,7 +60,8 @@ >> enum ib_node_type { >> IB_NODE_CA = 1, >> IB_NODE_SWITCH, >>- IB_NODE_ROUTER >>+ IB_NODE_ROUTER, >>+ IB_NODE_IWARP >> >> > >Should it be IB_NODE_RNIC instead since the other values are for >hardware types? > > > >> }; >> >> From tom at ammasso.com Thu Aug 25 07:22:44 2005 From: tom at ammasso.com (Tom Tucker) Date: Thu, 25 Aug 2005 09:22:44 -0500 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs and queryprovider methods References: Message-ID: <430DD434.1090009@ammasso.com> I apologize in advance for the long-winded diatribe below... Sean Hefty wrote: >>>Why include the connection protocol as part of the verbs >>>layer? Granted I >>>haven't looked at the iWarp specs in a long time, but I don't remember >>>connection establishment being part of the verbs. >>> >>> >>Connection management is not part of the RDMAC verbs, however, we need >>some >>way for transports to "hook in" to the CM. The other approach is to have >>a separate registration mechanism for connection management verbs, but >>this seemed a little bizarre, so we just extended the provider verbs. >> >>Ideas? >> >> > >I need to reacquaint myself with iWarp more, but I don't like the idea of adding >CM calls as part of the verbs API, and in particular as part of a generic RDMA >device structure. > > It's definitely a wart from the perspective of IB and/or RNIC Verbs. The reason is that the verbs paradigm was founded (er confounded) on the premise that connection management is implemented separate from DTO on the transport. For IB this has been done and the "verbs" export a mechansim to send the special MAD messages necessary to build a connection (don't mean to preach what you know better than me -- just setting the stage) By contrast, in every TCP implementation I've seen, connection management is tightly coupled with the transport implementation itself. In fact the connection management state machine is part of a big function called tcp_output that is used to send *all* data on the transport. The intricacies of route lookups and address resolution are completely hidden in the stack. This is a very long winded way of saying that it is "technically possible", but practically complicated to achieve this separation in the implementation. The DAT people "achieved" this by ignoring the problem altogher, i.e. leaving it to the transport provider library. The Windows Chimney people "achieved" this by having two monolithic, integrated stacks that could exchange the connection state once established. The first stack (native host or "connection management" stack) establishes the connection, then the offload stack (hw in the adapter) takes over for DTO. Unfortunately, the "miracle occurs" step where the two stacks exchange state, inflight data, timer status etc.... is the subject of great debate and takes us back to where we are right now -- busted. Soo. The "solution" we proposed was simply to add the high level connection management "verbs" to the driver so they could be called by the upper level transport independent CM or by the ULP directly. The alternative is to create analogs of the IB connection process for TCP/IP in the core. For example: - The ib_at_route_by_ip function is fine for returning the route for an IP connection (need this anyway). - There is no path record for IP - Create a pseudo-ARP request message that is submitted to a psuedo-mad interface to TCP verbs. - Create some goofy pseudo-connect-request message format that gets submitted to a pseudo-mad interface to the TCP verbs. - Create another goofy pseudo-connect-accept request message format that gets submitted to the pseudo-mad interface Don't know if this is exactly right, but you get the idea...The pseudo-mad interface for iwarp would interpret these formats and then -- guess what -- do a connect or accept as the interface is now defined. This seems to me to be a very complicated, slow and bug prone way to make TCP connection establishment look similar to the IB process so that we can hide it once again under a transport independent CM. My 2 cents. >Does this suggest that each iWarp device driver will need to implement a >connection establishment protocol? Isn't there a way to generalize that into a >single iWarp CM module that can sit above multiple devices? How will >connections between different devices be supported? > >I'm assuming that the Linux kernel will never permit an established connection >to be offloaded onto a NIC. However it seems possible that a new iWarp >connection could be done in a common way, with the result passed into the device >through the modify QP call as the LLP stream. > >- Sean > > > From caitlinb at broadcom.com Thu Aug 25 07:38:46 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 25 Aug 2005 07:38:46 -0700 Subject: [openib-general] RDMA connection and address translation API Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F515@NT-SJCA-0751.brcm.ad.broadcom.com> Good point. But that's about wire behavior, not what an application sees. And yes, the RDMA device must behave as though its IP layer were part of the host stack. That is a strong argument for standardizing many of those interactions rather than relying on fully compliant parallel processing. -----Original Message----- From: Christoph Hellwig [mailto:hch at lst.de] Sent: Thursday, August 25, 2005 1:52 AM To: Caitlin Bestler Cc: Christoph Hellwig; openib-general at openib.org Subject: Re: [openib-general] RDMA connection and address translation API On Wed, Aug 24, 2005 at 02:22:31PM -0700, Caitlin Bestler wrote: > Not if the host connects two disjoint networks and does not route > between them. Such a host should/may be configured to reject any > packet that arrives with a destination address that does not match the > expected destination address for the port it arrives upon. While you can configure a Linux system to reject such request through a bunch of crude hacks, the default and fully RFC compliant behaviour is to always reply to ARP requests for any IP address assigned to the system. RDMA CM implementations must work the same. From jlentini at netapp.com Thu Aug 25 07:53:50 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 25 Aug 2005 10:53:50 -0400 (EDT) Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <5264tvxez6.fsf@cisco.com> References: <5264tvxez6.fsf@cisco.com> Message-ID: On Wed, 24 Aug 2005, Roland Dreier wrote: > Sean> Is the idea that the user calls connect() and then receives > Sean> a single callback indicating that the connection has been > Sean> established? If so, then the user may need to modify the QP > Sean> to the INIT state, which would require some knowledge > Sean> already of the path. We would also need to be clear on > Sean> whether the QP is expected to be in the INIT state before > Sean> connect is called, or if it could be in any arbitrary state. > Sean> The other alternative is to provide multiple callbacks > Sean> during connection establishment. > > To me it makes sense for the generic CM API to be defined so that an > IB QP must be in the INIT state before being passed to connect(). Will the ib_modify_qp() function be made transport neutral? I see some fields in the ib_qp_attr structure that are IB specific. I think the RDMA connection API should perform all the QP state transitions for the ULP. How about a new call to create the QP and perform all QP state transitions necessary for the posting receive work requests? From caitlinb at broadcom.com Thu Aug 25 08:03:23 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 25 Aug 2005 08:03:23 -0700 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs and queryprovider methods Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F516@NT-SJCA-0751.brcm.ad.broadcom.com> Sean Hefty wrote: > > I need to reacquaint myself with iWarp more, but I don't like > the idea of adding CM calls as part of the verbs API, and in > particular as part of a generic RDMA device structure. > > Does this suggest that each iWarp device driver will need to > implement a connection establishment protocol? Isn't there a > way to generalize that into a single iWarp CM module that can > sit above multiple devices? How will connections between > different devices be supported? > > I'm assuming that the Linux kernel will never permit an > established connection to be offloaded onto a NIC. However > it seems possible that a new iWarp connection could be done > in a common way, with the result passed into the device > through the modify QP call as the LLP stream. > It is indeed possible to build a device-independent iWARP Connection Manager on top of sockets and the 'modify qp to rts' verb request. However, this requires standardizing: a) how you open a socket that will later be used for RDMA. b) how to use that socket in pre-RDMA mode. c) how to transition that socket into RDMA mode. If the socket is not a normal socket used by the host stack then it must be a SOCK_STREAM socket associated with the offload device itself. That means you are now publishing a complete sockets API for TCP connections to the offload device and then *not* integrating it with the host stack -- which seems a strange thing to do. Structuring pre-RDMA mode to be fully compliant with all system administrative controls without integreting it with the host stack itself is possible, but it will slow down the entire process of making complete open source stacks available. Providing a "one stop" API that avoids the need to standardize the interactions between pre-RDMA and RDMA modes for the same connection allows drivers to be made available sooner, and provides more time to discuss what the optimal standardization of pre-RDMA mode actually is. Finding that optimal interface will take time, because different implementations vary on how work is divided between the device, driver and verbs. There is also a very distinct need to preserve OS control over connection setup. The simplest method of doing this, having the host stack set up all connections, does not seem to be available for a variety of non-technical reasons. Building a conscensus on the next best option will take time. Meanwhile listen/connect/accept/reject methods defined on a per-device basis will allow virtually all applications to be deployed even before the long-term stack co-ordination issues are fully resolved. From caitlinb at broadcom.com Thu Aug 25 08:06:59 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 25 Aug 2005 08:06:59 -0700 Subject: [openib-general] RDMA connection and address translation API Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F517@NT-SJCA-0751.brcm.ad.broadcom.com> The data required when doing a qp-modify-to-rts is inherently transport specific. IB requires a set of data obtained from the IB CM protocol (or the equivalent data through application specific black magic), while iWARP requires a handle for a TCP connection (assumed to be a socket, but not explicitly required to be so). The problem is that when the RDMAC specified the iWARP modify qp to RTS behaviour they did not forsee the non-technical barriers to simply using a socket handle to specify transfer of ownership of a TCP connection from one stack to another. > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of James Lentini > Sent: Thursday, August 25, 2005 7:54 AM > To: Roland Dreier > Cc: openib-general at openib.org > Subject: Re: [openib-general] RDMA connection and address > translation API > > > > On Wed, 24 Aug 2005, Roland Dreier wrote: > > > Sean> Is the idea that the user calls connect() and > then receives > > Sean> a single callback indicating that the connection has been > > Sean> established? If so, then the user may need to > modify the QP > > Sean> to the INIT state, which would require some knowledge > > Sean> already of the path. We would also need to be clear on > > Sean> whether the QP is expected to be in the INIT state before > > Sean> connect is called, or if it could be in any > arbitrary state. > > Sean> The other alternative is to provide multiple callbacks > > Sean> during connection establishment. > > > > To me it makes sense for the generic CM API to be defined > so that an > > IB QP must be in the INIT state before being passed to connect(). > > Will the ib_modify_qp() function be made transport neutral? I > see some fields in the ib_qp_attr structure that are IB specific. > > I think the RDMA connection API should perform all the QP > state transitions for the ULP. How about a new call to create > the QP and perform all QP state transitions necessary for the > posting receive work requests? > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From swise at ammasso.com Thu Aug 25 08:31:21 2005 From: swise at ammasso.com (Steve Wise) Date: Thu, 25 Aug 2005 10:31:21 -0500 Subject: [openib-general] data center fabric conference presentations Message-ID: Would it be possible to make available all the slides from the openib presentations? Maybe on-line via www.openib.org? Thanks, Steve. From caitlinb at broadcom.com Thu Aug 25 08:55:49 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 25 Aug 2005 08:55:49 -0700 Subject: [openib-general] RE: cma header - change some things according to the list feedback Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F518@NT-SJCA-0751.brcm.ad.broadcom.com> Comments imbedded. > -----Original Message----- > From: Guy German [mailto:guyg at voltaire.com] > Sent: Thursday, August 25, 2005 7:07 AM > To: openib-general at openib.org > Cc: sean.hefty at intel.com > Subject: cma header - change some things according to the > list feedback > > > /** > * ib_cma_get_device - Returns the device to be used according to > * the destination ip address (this can be detemined according > * to the local routing table). Call this function before > * creating the qp. If using link-local IPv6 addresses > * there is no need to call this function. > * @remote_address: The destination address for connection > * @device: The device to use (returned by the function) */ > int ib_cma_get_device(struct sockaddr *remote_address, > enum ib_qos qos, struct ib_device **device); > > Who implements this function? It has to be a core (device-independent) function, doesn't it? If so what data is it based upon? I believe the ultimate answer here is that each IP Interface has to be mapped to 0..1 RDMA devices (or put the other way, each RDMA device claims exclusive "RDMA enhancement" rights for one or more IP interfaces). Once that data is linked the above function can be implemented by core software or the user. Since it is of general utility it definitely makes sense to provide the common function. > /** > * ib_cma_connect - this is the connect request function, called by > * the active side. The consumer registers an upcall that will be > * initiated by the cma with an appropriate connection event > * notification (established/rejected/disconnected etc) > * @cma_conn: This structure contains the following > connection parameters: > * @qp: qp for establishing the connection > * @qp_attr: only relevant attributes are used > * @dst_ip: destination ip address > * @service_id: destination service id (port) > * @recv_wr: An array of work requests to post on the > receive queue.before > * sending the RTU > * @num_recv_wr: recv_wr array length - number of buffers > to post recv > * @context: context to be returned in the callback > * @cma_event_handler: the upcall function for the active side > * @private_data: private data to be received at the listener upcall > * @private_data_len: private data length (max 255) > * @timeout: > * @qos: Quality os service for the rc > * @connect_flags: default or multipath connection > * @cma_id: This returned handle is a union (different in ib > and iwarp) > * in ib - it is the cm_id. > */ > int ib_cma_connect(struct ib_cma_conn *cma_conn, > union ib_cma_id *cma_id); > > What event is generated at the timeout? Should it be expicitly guaranteed that each connect call will ultimate produce exactly one callback? > /** > * ib_cma_disconnect - this function disconnects the rc. It can be > * called, by either the passive or active side > * @qp: the connected qp to disconnect > * @cma_id: On the active side- this handle is the one returned > * when ib_cma_connect was called. > * On the passive side- this handle was accepted in > cma_listen callback > */ > int ib_cma_disconnect(struct ib_qp *qp, union ib_cma_id *cma_id); > > > /** > * ib_cma_sid_listen - this function is called by the passive > side. It is > * listening on a the specified port (ib service id) for incomming > * connection requests > * @device: > * @address: > * @service_id: service id (port) to listen on > * @context: user context to be returned in the callback > * @cm_listen_handler: the listen callback > * @cma_id: cma handle for the passive side */ int > ib_cma_sid_listen(struct ib_device *device, struct sockaddr *address, > __be64 service_id, void *context, > ib_cma_listen_handler cm_listen_handler, > union ib_cma_id *cma_id); > > Because RDMA connections are established over a reliable connection there is a distinct need to have some form of throttle control. Excess connection requests received will be non-peer-rejected. With kDAPL this is controlled by the size of the connection EVD. When using a callback what is needed is for the callback function to be able to report that it could not accept the connection request object -- resulting in a non-peer reject rather than a peer reject. To prevent DoS attacks it is essential that this be done as close to the driver as possible. > > /** > * ib_cma_get_src_ip - this function performs "rarp", asynchronicly > * from cma_id to src ip > * @cma_id: the cma_id will have to include the path data received > * in the request handler > * @src_ip: source ip of the initiator > */ > int ib_cma_get_src_ip(union ib_cma_id *cma_id, > ib_cma_rarp_handler rarp_handler, > void *context); > Or whatever equivalent logic needed to translate the 'cma_id' to the peer's 'src ip'. This is extracting a field for iWARP. From rolandd at cisco.com Thu Aug 25 08:58:44 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 25 Aug 2005 08:58:44 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <1124976302.6584.19.camel@r2d2> (Guy German's message of "Thu, 25 Aug 2005 16:25:02 +0300") References: <1124976302.6584.19.camel@r2d2> Message-ID: <52acj6uhxn.fsf@cisco.com> Sean> Another possibility could be to add a list of receives to Sean> rdma_connect(). Guy> I added this to both connect and accept calls I don't think this is a good idea. Let's try to streamline the connect call, not add every single possible feature to it. - R. From jlentini at netapp.com Thu Aug 25 09:01:10 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 25 Aug 2005 12:01:10 -0400 (EDT) Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <52acj7vxk0.fsf@cisco.com> References: <000401c5a8de$2c32cce0$6312000a@infiniconsys.com> <52acj7vxk0.fsf@cisco.com> Message-ID: On Wed, 24 Aug 2005, Roland Dreier wrote: > James> You need to consider what makes sense for *both* ib and > James> iwarp. Keep in mind that the correct API will allow a > James> consumer to use ib and iwarp devices transparently. In > James> other words their will be one code path that support both. > > James> If we were to adopt your proposal, the consumer would need > James> to perform unnecessary operations on iWARP. > > No, I think we just need to realize that a perfectly transport neutral > protocol implementation is not achievable. It is achievable. Although the IB and iWARP protocols are different, they can provide the same services to NFS-RDMA. IB is missing one service that iWARP has, namely that nodes can be identified with IP addresses. The ATS mechanism provides this capability for IB networks. If there are better mechanisms that do the same thing, then NFS-RDMA can use them. The important things is not to push this up into the ULPs. The NFS-RDMA protocol is being standardized in the IETF. There is no reason to upset that process. If an additional IB specific protocol is necessary, it should be standardized in the IBTA. > It's unfortunate that kDAPL fooled people by hiding the details of > the wire protocol under a supposedly "neutral API," but the fact is > that mapping an abstract RDMA transport to a real implementation > will always involve arbitrary transport-dependent choices. The kDAPL API *is* transport neutral. This has been demonstrated at several interoperability tests at which the same applications were run on both IB and iWARP. kDAPL isn't the only transport neutral networking API. The Sockets API supports UDP and TCP transports via the same interface. I believe we are very close to reaching agreement on a transport neutral RDMA connection API. Comparing your API proposal to the API that we proposed at the BOF, they are very similar. The most important similarity is that both use IP addressing. The only real point of debate is over how to perform the address translation (IP <-> GID) on IB. I believe we should separate that from the API discussion. From jim.ryan at intel.com Thu Aug 25 09:04:29 2005 From: jim.ryan at intel.com (Ryan, Jim) Date: Thu, 25 Aug 2005 09:04:29 -0700 Subject: [openib-general] data center fabric conference presentations Message-ID: That's our plan. It's up to the speakers to supply the materials, but the request has been made and I've received a couple. Matt Leininger has agreed to post them. We'll send out an announcement when this has been accomplished Thanks, Jim Ryan, Chairman, OpenIB -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Steve Wise Sent: Thursday, August 25, 2005 8:31 AM To: openib-general at openib.org Subject: [openib-general] data center fabric conference presentations Would it be possible to make available all the slides from the openib presentations? Maybe on-line via www.openib.org? Thanks, Steve. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rolandd at cisco.com Thu Aug 25 09:05:38 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 25 Aug 2005 09:05:38 -0700 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs and query provider methods In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3B4B@mail2.ammasso.com> (Tom Tucker's message of "Wed, 24 Aug 2005 19:28:43 -0400") References: <8E9D028761D8264D910612167E8457E8FA3B4B@mail2.ammasso.com> Message-ID: <523boyuhm5.fsf@cisco.com> Tom> This patch is against the iWARP branch. It adds CM related Tom> methods to the ib_device structure as well as simple versions Tom> of the low level port, gid, pkey, etc... query methods. This patch doesn't seem like the right approach to me. I don't think we want to put CM methods, which are not very verb-like, in to the device structure. For one thing, connect_qp() seems like it can just be replaced by the existing modify_qp() method. I'm not sure I understand the rest of the methods. It seems that the Ammasso device doesn't really implement the RNIC verbs. I'm guessing you handle all the connection stuff inside your device, which means that you can't implement the standard iWARP modify-to-RTS operation. Is there a way to make your interface look more like the iWARP verbs interface? Or do all iWARP devices have an interface like yours. - R. From jlentini at netapp.com Thu Aug 25 09:08:18 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 25 Aug 2005 12:08:18 -0400 (EDT) Subject: [openib-general] Draft ATS Specification Message-ID: There is a draft of the ATS specification in the ATS_v1.pdf file at http://groups.yahoo.com/group/dat-discussions/files/ This is obviously of interest to the current discussion on an RDMA connection and address translation API. From jlentini at netapp.com Thu Aug 25 09:10:16 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 25 Aug 2005 12:10:16 -0400 (EDT) Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <469958e005082411327f61bd26@mail.gmail.com> References: <8E9D028761D8264D910612167E8457E8FA3B2A@mail2.ammasso.com> <524q9fyzbn.fsf@cisco.com> <469958e005082411327f61bd26@mail.gmail.com> Message-ID: On Wed, 24 Aug 2005, Caitlin Bestler wrote: > NFS over RDMA does not do that. > > Shouldn't that be the end of discussion on abusing CM private data > unless you are talking *solely* about IB private data. And if that is > the discussion, should not such a strategy be proposed to IETF > and/or IBTA for an NFSoRDMA for IB official mapping? Since this is IB specific, I think it should be addressed in the IBTA. > The other end of the NFSoRDMA connection is not necessarily > running OpenIB or even Linux and is not party to any of these > discussions. > > > > > My resistance is that ATS is just complexity without any benefit. It > > doesn't provide additional security. It doesn't solve the > > multi-homing problem we're talking about now. Once you've thrown away > > information by turning your IP address into an IB GID, there's no > > magic way ATS can recreate that information and be psychic about which > > of the multi-homed IPs you actually meant. So why not just put the IP > > addressing information into the CM private data, the way that the SDP > > protocol already does? From guyg at voltaire.com Thu Aug 25 08:55:50 2005 From: guyg at voltaire.com (Guy German) Date: Thu, 25 Aug 2005 18:55:50 +0300 Subject: [openib-general] RE: cma header - change some things according to the list feedback In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1F518@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1F518@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <1124985350.6584.31.camel@r2d2> Hi > > int ib_cma_get_device(struct sockaddr *remote_address, > > enum ib_qos qos, struct ib_device **device); > > > > > > Who implements this function? It has to be a core (device-independent) > function, doesn't it? If so what data is it based upon? > > I believe the ultimate answer here is that each IP Interface > has to be mapped to 0..1 RDMA devices (or put the other way, > each RDMA device claims exclusive "RDMA enhancement" rights > for one or more IP interfaces). > > Once that data is linked the above function can be implemented > by core software or the user. Since it is of general utility > it definitely makes sense to provide the common function. This is implemented today in at.c - see resolve_ip(). > > int ib_cma_connect(struct ib_cma_conn *cma_conn, > > union ib_cma_id *cma_id); > > > > > > What event is generated at the timeout? IB_CMA_EVENT_UNREACHABLE. (We can add all the DAPL events but I think We should stick with the minimum required) > Should it be expicitly > guaranteed that each connect call will ultimate produce exactly > one callback? I think thats an implementation issue > > ib_cma_sid_listen(struct ib_device *device, struct sockaddr *address, > > __be64 service_id, void *context, > > ib_cma_listen_handler cm_listen_handler, > > union ib_cma_id *cma_id); > > > > > > Because RDMA connections are established over a reliable connection > there is a distinct need to have some form of throttle control. > Excess connection requests received will be non-peer-rejected. > With kDAPL this is controlled by the size of the connection EVD. > When using a callback what is needed is for the callback function > to be able to report that it could not accept the connection request > object -- resulting in a non-peer reject rather than a peer reject. > To prevent DoS attacks it is essential that this be done as close > to the driver as possible. How do you suggest changing the API then ? > > > > /** > > * ib_cma_get_src_ip - this function performs "rarp", asynchronicly > > * from cma_id to src ip > > * @cma_id: the cma_id will have to include the path data received > > * in the request handler > > * @src_ip: source ip of the initiator > > */ > > int ib_cma_get_src_ip(union ib_cma_id *cma_id, > > ib_cma_rarp_handler rarp_handler, > > void *context); > > > > Or whatever equivalent logic needed to translate the 'cma_id' to the > peer's 'src ip'. This is extracting a field for iWARP. Thanks, Guy From rolandd at cisco.com Thu Aug 25 09:12:53 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 25 Aug 2005 09:12:53 -0700 Subject: [openib-general] Re: RDMA connection and address translation API In-Reply-To: <20050825084809.GD22342@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 25 Aug 2005 11:48:09 +0300") References: <35EA21F54A45CB47B879F21A91F4862F714202@taurus.voltaire.com> <52ll2qvrr1.fsf@cisco.com> <20050825084809.GD22342@mellanox.co.il> Message-ID: <52y86qt2pm.fsf@cisco.com> Michael> Wouldnt it be better to use some bits in the service ID Michael> field for this? This would also be OK. But Annex 3 of the IBA spec has already defined the service ID field without any reserved bits we can use. For example, if the first byte is 0x01, then the IETF is allowed to use any value they want for the rest of the service ID. So if we want to keep backwards compatibility with the spec, this approach might be difficult. Anyway, what's the disadvantage of using a reserved bit or two from the CM REQ? - R. From guyg at voltaire.com Thu Aug 25 09:04:36 2005 From: guyg at voltaire.com (Guy German) Date: Thu, 25 Aug 2005 19:04:36 +0300 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <52acj6uhxn.fsf@cisco.com> References: <1124976302.6584.19.camel@r2d2> <52acj6uhxn.fsf@cisco.com> Message-ID: <1124985876.6584.39.camel@r2d2> On Thu, 2005-08-25 at 08:58 -0700, Roland Dreier wrote: > Sean> Another possibility could be to add a list of receives to > Sean> rdma_connect(). > > Guy> I added this to both connect and accept calls > > I don't think this is a good idea. Let's try to streamline the > connect call, not add every single possible feature to it. > > - R. I think it is a good solution for the sync problem that sean raised - in the case where we modify the qp inside the abstraction layer. We can take it out (i.e getting the path and modify qp to init *before* connect) but I think this will be more complicated for the consumers (especially the iwarp ones). I am not saying we *have* to do it - this is just a suggestion. Guy From rolandd at cisco.com Thu Aug 25 09:20:19 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 25 Aug 2005 09:20:19 -0700 Subject: [openib-general] kernel crashes while using mvapich-gen2 over ib_uverbs In-Reply-To: <1124960511.12751.16.camel@bonnie.rzg.mpg.de> (Christian Guggenberger's message of "Thu, 25 Aug 2005 11:01:51 +0200") References: <1124960511.12751.16.camel@bonnie.rzg.mpg.de> Message-ID: <52u0het2d8.fsf@cisco.com> Christian> Hi, On a small, 2 node setup, I'd like to try some Christian> simple MPI programs with help of mvapich-gen2 (1.0). Christian> Both nodes are Dual-Opteron based, with a 23108 tavor Christian> each, directly connected. (no switch). Opensm is Christian> running on one node. Things like IPOIB seem to work Christian> reliable. Thanks for the oops. Have you tried the examples like ibv_rc_pingpong that are included with libibverbs? Do they work OK or crash? Anwyay, I'll try to investigate this crash. - R. From jlentini at netapp.com Thu Aug 25 09:21:02 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 25 Aug 2005 12:21:02 -0400 (EDT) Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F7141F3@taurus.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F7141F3@taurus.voltaire.com> Message-ID: On Wed, 24 Aug 2005, Yaron Haviv wrote: > Any way providing src/dst IPs in the CM Private data is simple, and we > can come with IBTA extension blessing that data structure as a general > way to map IP oriented protocols over IB (a 1-2 page draft at the most) > This way it can also address Caitlin concerns regarding NFS & IETF > (since now it's a transport specific issue) How long do you estimate it would take to standardize an IP<->GID mechanism (ATS, CM embedded, ...) in the IBTA? 3 months? 6 months? A year?.... Let's assume that everyone on this list is in agreement. From jlentini at netapp.com Thu Aug 25 09:28:17 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 25 Aug 2005 12:28:17 -0400 (EDT) Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <521x4jxeqj.fsf@cisco.com> References: <52pss3z26a.fsf@cisco.com> <000201c5a8d3$f30b3150$6312000a@infiniconsys.com> <469958e005082411147f1dfd03@mail.gmail.com> <521x4jxeqj.fsf@cisco.com> Message-ID: On Wed, 24 Aug 2005, Roland Dreier wrote: > James> I agree with Caitlin. The eventual solution cannot force > James> protocol modifications in ULPs. > > Does this mean we're stuck with the current use of ATS in NFS-RDMA? NFS-RDMA requires that the lower layer provide IP addressing. ATS is one proposal and the only one being documented and standardized in a standards organization. Any other solution that was documented and standardized should be considered. Since this will involve the wire protocol, it can't be OpenIB specific. > Surely there's still time to fix the protocol. I believe that a solution can be found without impacting the NFS-RDMA specifications. From rolandd at cisco.com Thu Aug 25 09:34:02 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 25 Aug 2005 09:34:02 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: (James Lentini's message of "Thu, 25 Aug 2005 12:01:10 -0400 (EDT)") References: <000401c5a8de$2c32cce0$6312000a@infiniconsys.com> <52acj7vxk0.fsf@cisco.com> Message-ID: <52k6iat1qd.fsf@cisco.com> Roland> No, I think we just need to realize that a perfectly Roland> transport neutral protocol implementation is not Roland> achievable. James> It is achievable. Although the IB and iWARP protocols are James> different, they can provide the same services to NFS-RDMA. Not really. This is just hiding the transport dependence in some other layer and then pretending it doesn't exist. IB and iWARP can provide the same services to NFS/RDMA, but only through some intermediate layer that implements the actual transport-dependent wire protocol. James> IB is missing one service that iWARP has, namely that nodes James> can be identified with IP addresses. The ATS mechanism James> provides this capability for IB networks. If there are James> better mechanisms that do the same thing, then NFS-RDMA can James> use them. All implementation of NFS/RDMA on top of IB had better interoperate, right? Which means that someone has to specify which address translation mechanism is the choice for NFS/RDMA. James> The important things is not to push this up into the James> ULPs. The NFS-RDMA protocol is being standardized in the James> IETF. There is no reason to upset that process. If an James> additional IB specific protocol is necessary, it should be James> standardized in the IBTA. NFS/RDMA is being defined on top of an abstract RDMA interface. Someone has to write a spec for how that RDMA abstraction is translated into packets on the wire for each transport that NFS/RDMA will run on top of. - R. From caitlinb at broadcom.com Thu Aug 25 09:34:19 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 25 Aug 2005 09:34:19 -0700 Subject: [openib-general] RE: cma header - change some things according to the list feedback Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F51A@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: Guy German [mailto:guyg at voltaire.com] > Sent: Thursday, August 25, 2005 8:56 AM > To: Caitlin Bestler > Cc: openib-general at openib.org; sean.hefty at intel.com > Subject: RE: cma header - change some things according to the > list feedback > > > > > Because RDMA connections are established over a reliable connection > > there is a distinct need to have some form of throttle control. > > Excess connection requests received will be non-peer-rejected. > > With kDAPL this is controlled by the size of the connection EVD. > > When using a callback what is needed is for the callback > function to > > be able to report that it could not accept the connection request > > object -- resulting in a non-peer reject rather than a peer reject. > > To prevent DoS attacks it is essential that this be done as > close to > > the driver as possible. > > How do you suggest changing the API then ? > > Have the callback function return indicate whether the connection request was accepted. That is, either make it a boolean (I have take responsibility for the connection request: true/false) or a status (zero means ok, non-zero means not accepted for a reason that matches the exact integer value). From jlentini at netapp.com Thu Aug 25 09:48:16 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 25 Aug 2005 12:48:16 -0400 (EDT) Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3B48@mail2.ammasso.com> References: <8E9D028761D8264D910612167E8457E8FA3B48@mail2.ammasso.com> Message-ID: On Wed, 24 Aug 2005, Tom Tucker wrote: > > > > - It's not just preventing connections to the wrong local address. > > NFS-RDMA wants the remote source address (ie getpeername()) so that > > it can look it up in the exports list. > > Agreed. But you could also get rid of ATS by allowing GIDs to > be specified in the exports file and then treating them like > IPv6 addresses for the purpose of subnet comparisons. Could generic code use both GIDs and IPv4 addresses? From panda at cse.ohio-state.edu Thu Aug 25 09:50:04 2005 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Thu, 25 Aug 2005 12:50:04 -0400 (EDT) Subject: [openib-general] mpi drop in openib tree In-Reply-To: <527jec48wg.fsf@cisco.com> from "Roland Dreier" at Aug 23, 2005 02:53:51 PM Message-ID: <200508251650.j7PGo4w8029415@xi.cse.ohio-state.edu> Hi Roland, > As for whether MPI should be in the OpenIB subversion tree or not, my > personal opinion is that having MPI there is only appropriate if the > svn tree is being used as the primary development source tree. I > don't think it's appropriate to use the OpenIB svn server as a release > distribution mechanism. Yes, we plan to use the svn tree as the primary development source tree for mvapich-gen2 not as a release distribution mechanism. In fact, during the last two days we have received a couple of patches, bug fixes, and suggestions and have checked-in these patches, fixes, and enhancements to the SVN tree with credits to the people who supplied us the patches. Thanks to those who sent us the patches!! We encourage people in the OpenIB community to test the latest version of mvapich-gen2 from the svn tree and provide us feedbacks and comments so that we can keep on enhancing it. Thanks, DK > - R. > From Thomas.Talpey at netapp.com Thu Aug 25 09:55:51 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Thu, 25 Aug 2005 12:55:51 -0400 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <52k6iat1qd.fsf@cisco.com> References: <000401c5a8de$2c32cce0$6312000a@infiniconsys.com> <52acj7vxk0.fsf@cisco.com> <52k6iat1qd.fsf@cisco.com> Message-ID: <6.2.3.4.2.20050825125123.03373400@exnane01.nane.netapp.com> At 12:34 PM 8/25/2005, Roland Dreier wrote: >All implementation of NFS/RDMA on top of IB had better interoperate, >right? Which means that someone has to specify which address >translation mechanism is the choice for NFS/RDMA. Correct. At the moment the existing NFS/RDMA implementations use ATS (Sun's and NetApp's). >NFS/RDMA is being defined on top of an abstract RDMA interface. >Someone has to write a spec for how that RDMA abstraction is >translated into packets on the wire for each transport that NFS/RDMA >will run on top of. Well, we did. We specify the ULP payload of all the messages in those two IETF documents. What we didn't do is define how each transport handles IP addressing, that is a transport issue. We don't need address translation over iWARP, since that uses IP. Over IB, so far, we have used ATS. I am perfectly fine with a better solution, but ATS has been fine too. I am catching up to this discussion, so this is just one reply. Tom. From caitlinb at broadcom.com Thu Aug 25 09:56:43 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 25 Aug 2005 09:56:43 -0700 Subject: [openib-general] RDMA connection and address translation API Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F51D@NT-SJCA-0751.brcm.ad.broadcom.com> Generic code MUST support both IPv4 and IPv6 addresses. I've even seen code that actually does this. So supporting GIDs is not that much of an issue as long as no IB network IDs are assigned with a meaning that conflicts with any reachable IPv6 network ID. (In other words, assign GIDs so that they are in fact valid IPv6 addresses. Something that was always planned to be one option for GIDs). > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of James Lentini > Sent: Thursday, August 25, 2005 9:48 AM > To: Tom Tucker > Cc: openib-general at openib.org > Subject: RE: [openib-general] RDMA connection and address > translation API > > > > On Wed, 24 Aug 2005, Tom Tucker wrote: > > > > > > > - It's not just preventing connections to the wrong > local address. > > > NFS-RDMA wants the remote source address (ie > getpeername()) so that > > > it can look it up in the exports list. > > > > Agreed. But you could also get rid of ATS by allowing GIDs to be > > specified in the exports file and then treating them like > > IPv6 addresses for the purpose of subnet comparisons. > > Could generic code use both GIDs and IPv4 addresses? > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From yaronh at voltaire.com Thu Aug 25 09:57:44 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Thu, 25 Aug 2005 19:57:44 +0300 Subject: [openib-general] RE: RDMA connection and address translation API Message-ID: <35EA21F54A45CB47B879F21A91F4862F7530C0@taurus.voltaire.com> > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Thursday, August 25, 2005 12:13 PM > To: Michael S. Tsirkin > Cc: Yaron Haviv; openib-general at openib.org > Subject: Re: RDMA connection and address translation API > > Michael> Wouldnt it be better to use some bits in the service ID > Michael> field for this? > > This would also be OK. But Annex 3 of the IBA spec has already > defined the service ID field without any reserved bits we can use. > For example, if the first byte is 0x01, then the IETF is allowed to > use any value they want for the rest of the service ID. So if we want > to keep backwards compatibility with the spec, this approach might be > difficult. > The IB ServiceID is 64 bits and TCP is 16 bits, so we can still take some bits in the middle to define what Michael was proposing, this may be a simpler change in IBTA than changing the CM header, but both options are valid Yaron > Anyway, what's the disadvantage of using a reserved bit or two from > the CM REQ? > > - R. From rolandd at cisco.com Thu Aug 25 10:04:51 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 25 Aug 2005 10:04:51 -0700 Subject: [openib-general] kernel crashes while using mvapich-gen2 over ib_uverbs In-Reply-To: <1124960511.12751.16.camel@bonnie.rzg.mpg.de> (Christian Guggenberger's message of "Thu, 25 Aug 2005 11:01:51 +0200") References: <1124960511.12751.16.camel@bonnie.rzg.mpg.de> Message-ID: <52fysyt0b0.fsf@cisco.com> Hmm, looks like it might be a use-after-free bug (6b is the slab's poison value): Christian> RAX: 6b6b6b6b6b6b6b6b Are you running with CONFIG_DEBUG_SLAB enabled? Also can you send the output of "objdump -d" on uverbs_mem.o from your kernel compile? - R. From yaronh at voltaire.com Thu Aug 25 10:07:06 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Thu, 25 Aug 2005 20:07:06 +0300 Subject: [openib-general] RDMA connection and address translation API Message-ID: <35EA21F54A45CB47B879F21A91F4862F7530C1@taurus.voltaire.com> > -----Original Message----- > From: James Lentini [mailto:jlentini at netapp.com] > Sent: Thursday, August 25, 2005 12:21 PM > To: Yaron Haviv > Cc: Fab Tillier; Roland Dreier; openib-general at openib.org > Subject: RE: [openib-general] RDMA connection and address translation API > > > > On Wed, 24 Aug 2005, Yaron Haviv wrote: > > > Any way providing src/dst IPs in the CM Private data is simple, and we > > can come with IBTA extension blessing that data structure as a general > > way to map IP oriented protocols over IB (a 1-2 page draft at the most) > > This way it can also address Caitlin concerns regarding NFS & IETF > > (since now it's a transport specific issue) > > How long do you estimate it would take to standardize an IP<->GID > mechanism (ATS, CM embedded, ...) in the IBTA? 3 months? 6 months? A > year?.... > > Let's assume that everyone on this list is in agreement. James, I can identify enough IBTA members in this list In case the group is in agreement I believe it's a rather short process Since it's just some minor definition, and IBTA doesn't have much on its agenda these days. For example Hal added a feature to the SM (client re-register ..) in weeks Based on the OpenIB input We also don't have to wait for finalized spec to implement, just like we implement IPoIB without an IETF RFC (only a draft) By the way a quick path could be to define it in DAT and hand it over to IBTA, after all ATS is also not an IBTA standard Yaron From jlentini at netapp.com Thu Aug 25 10:09:50 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 25 Aug 2005 13:09:50 -0400 (EDT) Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <000701c5a909$a355b4b0$6312000a@infiniconsys.com> References: <000701c5a909$a355b4b0$6312000a@infiniconsys.com> Message-ID: On Wed, 24 Aug 2005, Fab Tillier wrote: > Performing a forward lookup via ARP is going to be a lot faster than > ATS if the ARP entry already exists. ATS responses could also be cached. From jlentini at netapp.com Thu Aug 25 10:17:34 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 25 Aug 2005 13:17:34 -0400 (EDT) Subject: [openib-general] RDMA connection and address translation API In-Reply-To: References: Message-ID: On Wed, 24 Aug 2005, Sean Hefty wrote: > >With this in mind, I believe that the connection API needs to be > >something more like the following: > > > > rdma_resolve_address(): > > inputs: dest IP address, qos, npaths, > > done callback, opaque context > > done callback params: status, local RDMA device, > > RDMA transport address, context > ... > > rdma_connect(): > > inputs: local QP, RDMA transport address, destination service, > > private data, timeout, event callback, opaque context > > Have we agreed that this is the functionality that we should be > aiming towards? I think so, but as you pointed out the local QP must be in the init state. > > > rdma_resolve_address(...); > > /* wait for resolution */ > > ib_create_qp(...) /* use device pointer we got from rdma_resolve_address() > >*/ > > We need to insert in here: > > ib_modify_qp(...); /* somehow uses address resolution... */ > ib_post_recvs(...); > or add a new call to create the qp and modify it to init (an analog to the socket(2) function). > > rdma_connect(...); /* pass transport address we got from > >rdma_resolve_address() */ > > /* wait for connection to finish... */ > > Another possibility could be to add a list of receives to > rdma_connect(). The caller might also want to setup memory windows. Requiring the qp to be in the init state before calling connect seems cleaner to me. > > - Sean > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From Thomas.Talpey at netapp.com Thu Aug 25 10:18:06 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Thu, 25 Aug 2005 13:18:06 -0400 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1F51D@NT-SJCA-0751.brcm.ad. broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1F51D@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <6.2.3.4.2.20050825131522.0327a830@exnane01.nane.netapp.com> At 12:56 PM 8/25/2005, Caitlin Bestler wrote: >Generic code MUST support both IPv4 and IPv6 addresses. >I've even seen code that actually does this. Let me jump ahead to the root question. How will the NFS layer know what address to resolve? On IB mounts, it will need to resolve a hostname or numeric string to a GID, in order to provide the address to connect. On TCP/UDP, or iWARP mounts, it must resolve to IP address. The mount command has little or no context to perform these lookups, since it does not know what interface will be used to form the connection. In exports, the server must inspect the source network of each incoming request, in order to match against /etc/exports. If there are wildcards in the file, a GID-specific algorithm must be applied. Historically, /etc/exports contains hostnames and IPv4 netmasks/ addresses. In either case, I think it is a red herring to assume that the GID is actually an IPv6 address. They are not assigned by the sysadmin, they are not subnetted, and they are quite foreign to many users. IPv6 support for Linux NFS isn't even submitted yet, btw. With an IP address service, we don't have to change a line of NFS code. Tom. > >So supporting GIDs is not that much of an issue as long >as no IB network IDs are assigned with a meaning that >conflicts with any reachable IPv6 network ID. (In other >words, assign GIDs so that they are in fact valid IPv6 >addresses. Something that was always planned to be one >option for GIDs). > > > >> -----Original Message----- >> From: openib-general-bounces at openib.org >> [mailto:openib-general-bounces at openib.org] On Behalf Of James Lentini >> Sent: Thursday, August 25, 2005 9:48 AM >> To: Tom Tucker >> Cc: openib-general at openib.org >> Subject: RE: [openib-general] RDMA connection and address >> translation API >> >> >> >> On Wed, 24 Aug 2005, Tom Tucker wrote: >> >> > > >> > > - It's not just preventing connections to the wrong >> local address. >> > > NFS-RDMA wants the remote source address (ie >> getpeername()) so that >> > > it can look it up in the exports list. >> > >> > Agreed. But you could also get rid of ATS by allowing GIDs to be >> > specified in the exports file and then treating them like >> > IPv6 addresses for the purpose of subnet comparisons. >> >> Could generic code use both GIDs and IPv4 addresses? >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> >> > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jlentini at netapp.com Thu Aug 25 10:23:55 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 25 Aug 2005 13:23:55 -0400 (EDT) Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <52y86r2a9w.fsf@cisco.com> References: <52y86r2a9w.fsf@cisco.com> Message-ID: On Tue, 23 Aug 2005, Roland Dreier wrote: > The listen side is even simpler: > > rdma_listen(): > inputs: local service, event callback, consumer context > > Wait for connection requests and pass events to the consumer's > callback. I'm not sure if/home we want to support binding to > a particular IP address. The current IB CM in Linux doesn't > support binding a listen to a single device or port, and even > if it did it's not clear how to handle binding to one IP > address when a port has more than one IP. > > I guess the event callback would receive a device pointer and > the same RDMA transport address union I talked about above > when discussing address resolution. > > It would be possible to have another function like > rdma_getpeername() that takes the transport address and > returns a source IP address. To be complete, the API needs an rdma_getpeername() function: rdma_getpeername(): inputs: connected QP outputs: peer IP address > In the IB case this would do an > ATS reverse lookup. However, I hate this idea. iSER already > uses the CM private data to pass the source IP in the IB case, > and I would much rather fix NFS/RDMA to do the same thing (so > we can just kill ATS as an address resolution method). From rolandd at cisco.com Thu Aug 25 10:41:39 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 25 Aug 2005 10:41:39 -0700 Subject: [openib-general] Header reorganization heads up Message-ID: <52br3lud64.fsf@cisco.com> To prepare for moving the headers in the Linux kernel from their current location of drivers/infiniband/include to a new include/rdma directory, I'm going to move the includes in the subversion repository from gen2/trunk/src/linux-kernel/infiniband/include to gen2/trunk/src/linux-kernel/infiniband/include/rdma. To go along with this, I will update all the sources from #include to #include I'll leave the "EXTRA_CFLAGS += -Idrivers/infiniband/include" lines in the subversion Makefiles, so the build will continue to work as usual. This change will be completely transparent to anyone who has gen2/trunk/src/linux-kernel/infiniband under the drivers/ directory of a kernel source tree. When the header file move is sent to the upstream kernel, I'll remove the EXTRA_CFLAGS from the kernel Makefiles. Once the headers are in include/rdma in the upstream kernel, then one extra step will be required to put a subversion checkout into the kernel: you'll have to "rm -rf include/rdma" to avoid picking up the old header files. Does anyone see a problem with this plan? - R. From rolandd at cisco.com Thu Aug 25 10:47:33 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 25 Aug 2005 10:47:33 -0700 Subject: [openib-general] Header reorganization heads up In-Reply-To: <52br3lud64.fsf@cisco.com> (Roland Dreier's message of "Thu, 25 Aug 2005 10:41:39 -0700") References: <52br3lud64.fsf@cisco.com> Message-ID: <527je9ucwa.fsf@cisco.com> BTW, the diffstat of this reorg looks like the following. I'm not going to post the whole patch because it's pretty boring, just stuff like: --- infiniband/core/agent.c (revision 3190) +++ infiniband/core/agent.c (working copy) @@ -40,7 +40,7 @@ #include #include -#include +#include #include "smi.h" #include "agent_priv.h" Anyway, here's the diffstat: core/agent.c | 2 core/at.c | 6 core/cache.c | 2 core/cm.c | 4 core/cm_msgs.h | 2 core/core_priv.h | 2 core/fmr_pool.c | 2 core/mad_priv.h | 4 core/packer.c | 2 core/ping.c | 2 core/sa_query.c | 4 core/smi.c | 2 core/sysfs.c | 2 core/uat.h | 6 core/ucm.h | 4 core/ud_header.c | 2 core/user_mad.c | 4 core/uverbs.h | 4 core/verbs.c | 4 hw/mthca/mthca_av.c | 4 hw/mthca/mthca_cmd.c | 2 hw/mthca/mthca_cmd.h | 2 hw/mthca/mthca_cq.c | 2 hw/mthca/mthca_mad.c | 6 hw/mthca/mthca_provider.c | 2 hw/mthca/mthca_provider.h | 4 hw/mthca/mthca_qp.c | 6 include/ib_at.h | 218 ------ include/ib_cache.h | 105 --- include/ib_cm.h | 568 ----------------- include/ib_fmr_pool.h | 93 -- include/ib_mad.h | 579 ----------------- include/ib_pack.h | 245 ------- include/ib_sa.h | 373 ----------- include/ib_smi.h | 94 -- include/ib_user_at.h | 177 ----- include/ib_user_cm.h | 394 ------------ include/ib_user_mad.h | 137 ---- include/ib_user_verbs.h | 422 ------------ include/ib_verbs.h | 1461 --------------------------------------------- include/rdma/ib_at.h | 4 include/rdma/ib_cache.h | 2 include/rdma/ib_cm.h | 4 include/rdma/ib_fmr_pool.h | 2 include/rdma/ib_mad.h | 2 include/rdma/ib_pack.h | 2 include/rdma/ib_sa.h | 4 include/rdma/ib_smi.h | 2 include/rdma/ib_user_at.h | 2 ulp/ipoib/ipoib.h | 6 ulp/ipoib/ipoib_ib.c | 2 ulp/ipoib/ipoib_verbs.c | 2 ulp/kdapl/ib/dapl.h | 6 ulp/sdp/sdp_dev.h | 2 ulp/sdp/sdp_iocb.h | 2 ulp/sdp/sdp_main.h | 6 ulp/sdp/sdp_proto.h | 2 ulp/srp/ib_srp.c | 15 ulp/srp/ib_srp.h | 6 59 files changed, 84 insertions(+), 4943 deletions(-) From halr at voltaire.com Thu Aug 25 10:42:42 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 25 Aug 2005 13:42:42 -0400 Subject: [openib-general] Re: [PATCH] osm: osm_vendor_umad to provide port state In-Reply-To: <86iry01ou7.fsf@mtl066.yok.mtl.com> References: <86iry01ou7.fsf@mtl066.yok.mtl.com> Message-ID: <1124991568.4421.326.camel@hal.voltaire.com> On Sat, 2005-08-20 at 13:48, Eitan Zahavi wrote: > The following patch provides port state in the result of > osm_vendor_get_all_port_attr. The port state is obtained (like lid) > from the query HCA ports and delevered in the resulting port attributes. > > This enables clients of osm_vendor_api.h to knwo the state of the port > as well as a the already provided LID, GUID. Thanks. Applied. > BTW: inspecting the umad vendor implementation I have found many usages of > Array Bound Variables. I wonder if we need to clean them up. This seems to be OK in C. I think the issue was doing this in C++. -- Hal From jlentini at netapp.com Thu Aug 25 10:56:20 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 25 Aug 2005 13:56:20 -0400 (EDT) Subject: [openib-general] Header reorganization heads up In-Reply-To: <52br3lud64.fsf@cisco.com> References: <52br3lud64.fsf@cisco.com> Message-ID: On Thu, 25 Aug 2005, Roland Dreier wrote: > To prepare for moving the headers in the Linux kernel from their > current location of drivers/infiniband/include to a new include/rdma > directory, I'm going to move the includes in the subversion repository > from gen2/trunk/src/linux-kernel/infiniband/include to > gen2/trunk/src/linux-kernel/infiniband/include/rdma. To go along with > this, I will update all the sources from > > #include > > to > > #include > > I'll leave the "EXTRA_CFLAGS += -Idrivers/infiniband/include" lines in > the subversion Makefiles, so the build will continue to work as usual. > > This change will be completely transparent to anyone who has > gen2/trunk/src/linux-kernel/infiniband under the drivers/ directory of > a kernel source tree. > > When the header file move is sent to the upstream kernel, I'll remove > the EXTRA_CFLAGS from the kernel Makefiles. Once the headers are in > include/rdma in the upstream kernel, then one extra step will be > required to put a subversion checkout into the kernel: you'll have to > "rm -rf include/rdma" to avoid picking up the old header files. If you remove the include/rdma directory, won't that break code outside the OpenIB subversion tree that is using this location? How about cp -rf drivers/infiniband/include/rdma include instead? or place the include files in a different part of the svn tree: https://openib.org/svn/gen2/trunk/src/linux-kernel/include/rdma and recommend everyone do two subversion checkouts, one for the includes and one for the driver. From rolandd at cisco.com Thu Aug 25 11:00:37 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 25 Aug 2005 11:00:37 -0700 Subject: [openib-general] Header reorganization heads up In-Reply-To: (James Lentini's message of "Thu, 25 Aug 2005 13:56:20 -0400 (EDT)") References: <52br3lud64.fsf@cisco.com> Message-ID: <523boxucai.fsf@cisco.com> James> If you remove the include/rdma directory, won't that break James> code outside the OpenIB subversion tree that is using this James> location? James> How about James> cp -rf drivers/infiniband/include/rdma include Yeah, that will work once there's code outside of drivers/infiniband that uses include/rdma. A symlink would work too. James> instead? or place the include files in a different part of James> the svn tree: James> https://openib.org/svn/gen2/trunk/src/linux-kernel/include/rdma James> and recommend everyone do two subversion checkouts, one for James> the includes and one for the driver. That's a hassle because it forces everyone with working setups now to do a new checkout. - R. From halr at voltaire.com Thu Aug 25 10:54:53 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 25 Aug 2005 13:54:53 -0400 Subject: [openib-general] Re: [PATCH] osm: osm_vendor_umad registers to all SubnMgt methods In-Reply-To: <86hddj22xd.fsf@mtl066.yok.mtl.com> References: <86hddj22xd.fsf@mtl066.yok.mtl.com> Message-ID: <1124992492.4421.447.camel@hal.voltaire.com> On Sun, 2005-08-21 at 02:56, Eitan Zahavi wrote: > In case of registering to a non SubnAdm class the umad vendor layer > osm_vendor_bind() is registering to ALL the methods. > This prevents from multiple clients of SubnMgt (for example) to use > the code. OpenSM osm_sm_mad_ctrl.c actually sets the correct methods > bits (except for registering as report processor - which it is not). > > So the patch below prevents the "blind" registration to all methods > in case of a !SubnAdm osm_vendor_bind(). Thanks. Applied. -- Hal From jlentini at netapp.com Thu Aug 25 11:15:01 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 25 Aug 2005 14:15:01 -0400 (EDT) Subject: [openib-general] cma header - change some things according to the list feedback In-Reply-To: <20050825140657.GA2991@voltaire.com> References: <20050825140657.GA2991@voltaire.com> Message-ID: Hi Guy, Looks good. A few small comments below: On Thu, 25 Aug 2005, Guy German wrote: > This is the header file, embedding some of the feedbacks received from the list > > According to this suggestion - the qp is modified to init/rtr/rts in the > cm abstraction. > > The connection is done with a synchronous call to get a device for qp > creation. I think it can also be done the way Roland suggested, but I still > believe this is simpler for consumers (especially for the iwarp oriented ones). > > There are still unresolved issues discussed now on the list, but those are mainly > on the implementation side. > > I would like to hear the list opinion on the API, because I'm sure there are > untied ends on the API side as well. > > If someone has a totally different suggestion - can he post it to the list > for review ? > > Thanks, > Guy > > > /* > * Copyright (c) 2005 Voltaire Inc. All rights reserved. > * > * This Software is licensed under one of the following licenses: > * > * 1) under the terms of the "Common Public License 1.0" a copy of which is > * available from the Open Source Initiative, see > * http://www.opensource.org/licenses/cpl.php. > * > * 2) under the terms of the "The BSD License" a copy of which is > * available from the Open Source Initiative, see > * http://www.opensource.org/licenses/bsd-license.php. > * > * 3) under the terms of the "GNU General Public License (GPL) Version 2" a > * copy of which is available from the Open Source Initiative, see > * http://www.opensource.org/licenses/gpl-license.php. > * > * Licensee has the right to choose one of the above licenses. > * > * Redistributions of source code must retain the above copyright > * notice and one of the license notices. > * > * Redistributions in binary form must reproduce both the above copyright > * notice, one of the license notices in the documentation > * and/or other materials provided with the distribution. > * > */ > > /* > * This header file as a preliminary proposition for a connection manager > * abstraction layer (cma) for IB and iwarp > * - there is an assumption that iwarp uses the same openib qp terminology in > * the rest of the verbs, and the only place needs abstraction is the cm. > * - This proposition assumes that the address translation is done in the cma > * layer. > * - The cma also modifies the qp states to init/rtr/rts and error as needed. > * - for calling accept/reject or disconnect on the passive side you need to > * use the cma handle accepted in ib_cma_listen cb. > * - cma_id is created when calling connect or listen and destroyed when > * accepting disconnected/rejected/unreachable events on either active > * side (connect cb) or passive side (accept cb) > */ > > #ifndef IB_CMA_H > #define IB_CMA_H > > #include > > enum ib_cma_event { > IB_CMA_EVENT_ESTABLISHED, > IB_CMA_EVENT_REJECTED, > IB_CMA_EVENT_NON_PEER_REJECTED, > IB_CMA_EVENT_DISCONNECTED, > IB_CMA_EVENT_UNREACHABLE > }; > > enum ib_qos { > IB_QOS_BEST_EFFORT = 0, > IB_QOS_HIGH_THROUGHPUT = (1 << 0), > IB_QOS_LOW_LATENCY = (1 << 1), > IB_QOS_ECONOMY = (1 << 2), > IB_QOS_PREMIUM = (1 << 3) > }; > > enum ib_connect_flags { > IB_CONNECT_DEFAULT_FLAG = 0x00, > IB_CONNECT_MULTIPATH_FLAG = 0x01 > }; > > /* > * for ib_cma_get_src_ip - ib_cma_id will have to include > * the path data received in the request handler > */ > union ib_cma_id{ > struct ib_cm_id *cm_id; > u32 iwarp_id; > }; > > typedef void (*ib_cma_rarp_handler)(struct sockaddr *src_ip, void *context); How about ib_cma_addr_handler? By using the string rarp, you imply that RARP is being used. > typedef void (*ib_cma_ac_handler)(enum ib_cma_event event, void *context); > typedef void (*ib_cma_event_handler)(enum ib_cma_event event, void *context, > void *private_data); Would ib_cma_conn_handler be more appropriate? > typedef void (*ib_cma_listen_handler)(union ib_cma_id *cma_id, > struct ib_device *device, > void *private_data, void *context); > > struct ib_cma_conn { > struct ib_qp *qp; > struct ib_qp_attr *qp_attr; > struct sockaddr *dst_ip; > __be64 service_id; > struct ib_recv_wr *recv_wr, > u32 num_recv_wr, > void *context; > ib_cma_event_handler cma_event_handler; > const void *private_data; > u8 private_data_len; > u32 timeout; > enum ib_qos qos; > enum ib_connect_flags connect_flags; > }; > > > /** > * ib_cma_get_device - Returns the device to be used according to > * the destination ip address (this can be detemined according > * to the local routing table). Call this function before > * creating the qp. If using link-local IPv6 addresses > * there is no need to call this function. > * @remote_address: The destination address for connection > * @device: The device to use (returned by the function) > */ > int ib_cma_get_device(struct sockaddr *remote_address, > enum ib_qos qos, struct ib_device **device); > > > /** > * ib_cma_connect - this is the connect request function, called by > * the active side. The consumer registers an upcall that will be > * initiated by the cma with an appropriate connection event > * notification (established/rejected/disconnected etc) > * @cma_conn: This structure contains the following connection parameters: > * @qp: qp for establishing the connection > * @qp_attr: only relevant attributes are used > * @dst_ip: destination ip address > * @service_id: destination service id (port) > * @recv_wr: An array of work requests to post on the receive queue.before > * sending the RTU > * @num_recv_wr: recv_wr array length - number of buffers to post recv > * @context: context to be returned in the callback > * @cma_event_handler: the upcall function for the active side > * @private_data: private data to be received at the listener upcall > * @private_data_len: private data length (max 255) > * @timeout: > * @qos: Quality os service for the rc > * @connect_flags: default or multipath connection > * @cma_id: This returned handle is a union (different in ib and iwarp) > * in ib - it is the cm_id. > */ > int ib_cma_connect(struct ib_cma_conn *cma_conn, > union ib_cma_id *cma_id); > Should there be a way to cancel an ib_cma_connect() call? > > /** > * ib_cma_disconnect - this function disconnects the rc. It can be > * called, by either the passive or active side > * @qp: the connected qp to disconnect > * @cma_id: On the active side- this handle is the one returned > * when ib_cma_connect was called. > * On the passive side- this handle was accepted in cma_listen callback > */ > int ib_cma_disconnect(struct ib_qp *qp, union ib_cma_id *cma_id); > > > /** > * ib_cma_sid_listen - this function is called by the passive side. It is > * listening on a the specified port (ib service id) for incomming > * connection requests > * @device: > * @address: > * @service_id: service id (port) to listen on > * @context: user context to be returned in the callback > * @cm_listen_handler: the listen callback > * @cma_id: cma handle for the passive side > */ > int ib_cma_sid_listen(struct ib_device *device, struct sockaddr *address, > __be64 service_id, void *context, > ib_cma_listen_handler cm_listen_handler, > union ib_cma_id *cma_id); Given the new sockaddr parameter, I think you should rename this function ib_cma_listen(). > > > /** > * ib_cma_sid_destroy - this functionis is called on the passive side, to > * stop listenning on a certain sevice id > * @cma_id: the same cma handle received when ib_cma_sid_listen was called > */ > int ib_cma_sid_destroy(union ib_cma_id *cma_id); I'd also change this to ib_cma_destroy(). > /** > * ib_cma_accept - call on the passive side to accept a connection request > * @cma_id: this handle was accepted in cma_listen callback > * @qp: the connection's qp > * @private_data: private data to send back to the initiator > * @private_data_len: private data length > * @recv_wr: An array of work requests to post on the receive queue.before > * sending the REP > * @num_recv_wr: recv_wr array length - number of buffers to post recv > * @context: user context to be returned in the callback > * @cm_accept_handler: the cma accept callback - triggered when RTU ack > * received > */ > int ib_cma_accept(union ib_cma_id *cma_id, struct ib_qp *qp, > const void *private_data, u8 private_data_len, > struct ib_recv_wr *recv_wr, u32 num_recv_wr, > void *context, ib_cma_ac_handler cm_accept_handler); > > /** > * ib_cma_reject - call on the passive side to reject a connection request. > * This call destroys the cma_id, hence when the active side accepts > * the reject the cma_id is already destroyed. > * @cma_id: this handle was accepted in cma_listen callback > * @private_data: private data to send back to the initiator > * @private_data_len: private data length > */ > int ib_cma_reject(union ib_cma_id *cma_id, const void *private_data, > u8 private_data_len); > > > /** > * ib_cma_get_src_ip - this function performs "rarp", asynchronicly > * from cma_id to src ip > * @cma_id: the cma_id will have to include the path data received > * in the request handler > * @src_ip: source ip of the initiator > */ > int ib_cma_get_src_ip(union ib_cma_id *cma_id, > ib_cma_rarp_handler rarp_handler, > void *context); > > #endif /* IB_CMA_H */ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From guyg at voltaire.com Thu Aug 25 11:29:48 2005 From: guyg at voltaire.com (Guy German) Date: Thu, 25 Aug 2005 21:29:48 +0300 Subject: [openib-general] cma header - change some things according to the list feedback Message-ID: Hi James, Sorry for breaking the thread (Replying from home - outlook web interface) > typedef void (*ib_cma_rarp_handler)(struct sockaddr *src_ip, void *context); >>How about ib_cma_addr_handler? By using the string rarp, you imply >>that RARP is being used. OK. > typedef void (*ib_cma_ac_handler)(enum ib_cma_event event, void *context); > typedef void (*ib_cma_event_handler)(enum ib_cma_event event, void *context, > void *private_data); >>Would ib_cma_conn_handler be more appropriate? Maybe, but it is actually the active side event cb (also for discon etc.) I don't mind changing it, though... > int ib_cma_connect(struct ib_cma_conn *cma_conn, > union ib_cma_id *cma_id); > >> Should there be a way to cancel an ib_cma_connect() call? It is possible to add it. Not sure how much it will be used by consumers, though. > int ib_cma_sid_listen(struct ib_device *device, struct sockaddr *address, > __be64 service_id, void *context, > ib_cma_listen_handler cm_listen_handler, > union ib_cma_id *cma_id); >>Given the new sockaddr parameter, I think you should rename this >>function ib_cma_listen(). True ;) > int ib_cma_sid_destroy(union ib_cma_id *cma_id); >>I'd also change this to ib_cma_destroy(). Right. Thanks, Guy. From sean.hefty at intel.com Thu Aug 25 11:33:08 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 25 Aug 2005 11:33:08 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: Message-ID: >> We need to insert in here: >> >> ib_modify_qp(...); /* somehow uses address resolution... */ >> ib_post_recvs(...); >> > >or add a new call to create the qp and modify it to init (an analog to >the socket(2) function). This approach seems reasonable to me. Maybe something like: rdma_create_qp(rdma_addr_info); Uses the output from the address resolution to create the QP on the correct device and transitions it to the INIT state. The user can now post any work requests that they want. For example, with iWarp, I believe that even send work requests can be posted in the INIT state. - Sean From sean.hefty at intel.com Thu Aug 25 11:35:07 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 25 Aug 2005 11:35:07 -0700 Subject: [openib-general] cma header - change some things according to thelist feedback In-Reply-To: Message-ID: >> typedef void (*ib_cma_ac_handler)(enum ib_cma_event event, void *context); >> typedef void (*ib_cma_event_handler)(enum ib_cma_event event, void *context, >> void *private_data); >>>Would ib_cma_conn_handler be more appropriate? > >Maybe, but it is actually the active side event cb (also for discon etc.) >I don't mind changing it, though... Event handler sounds right to me. >> int ib_cma_connect(struct ib_cma_conn *cma_conn, >> union ib_cma_id *cma_id); >> >>> Should there be a way to cancel an ib_cma_connect() call? > >It is possible to add it. Not sure how much it will be used by >consumers, though. We can use the destroy cma_id call to cancel the connection. - Sean From sean.hefty at intel.com Thu Aug 25 11:35:34 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 25 Aug 2005 11:35:34 -0700 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs andqueryprovider methods In-Reply-To: Message-ID: >The Ammasso 1100 does do 100% connection setup. That's why we're >pushing connection establishment verbs into the device struct. IMO, >these functions are analagous to the process_mad function in the >ib_device structs, which has no meaning to an iwarp device. So I think >we have to admit up front, that the ib_device struct really has >Infiniband-specific verb functions as well as iWARP-specific verb >functions, and that's ok. (or maybe not :-) Does it do the connection setup in hardware or software? I agree that there are IB specific functions in the IB device structure today, but I'm not sure of the best way to define an RDMA device structure moving forward. Longer term, it may make sense to separate out the transport specific functions, in which case we can start moving in that direction. >Assuming each RNIC supported some raw way to send and receive ethernet >frames, then you could implement TCP, IP, ICMP, ARP etc al as a common >stack to setup connections. I don't think we want to do this? Are you saying that your RNIC cannot send and receive raw ethernet frames and act as a plain ethernet NIC? I was assuming that this was a given for any RNIC device. - Sean From sean.hefty at intel.com Thu Aug 25 11:36:08 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 25 Aug 2005 11:36:08 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <52acj6uhxn.fsf@cisco.com> Message-ID: > Sean> Another possibility could be to add a list of receives to > Sean> rdma_connect(). > > Guy> I added this to both connect and accept calls > >I don't think this is a good idea. Let's try to streamline the >connect call, not add every single possible feature to it. I don't think that we want to add a list of receives to the connect call either. I only mentioned that it was a possibility. - Sean From sean.hefty at intel.com Thu Aug 25 11:37:25 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 25 Aug 2005 11:37:25 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: Message-ID: >> Any way providing src/dst IPs in the CM Private data is simple, and we >> can come with IBTA extension blessing that data structure as a general >> way to map IP oriented protocols over IB (a 1-2 page draft at the most) >> This way it can also address Caitlin concerns regarding NFS & IETF >> (since now it's a transport specific issue) > >How long do you estimate it would take to standardize an IP<->GID >mechanism (ATS, CM embedded, ...) in the IBTA? 3 months? 6 months? A >year?.... > >Let's assume that everyone on this list is in agreement. Does anyone in the IB world disagree with adding IP addresses in the CM private data area? Would we want to extend this concept to SIDR as well? - Sean From paul.baxter at dsl.pipex.com Thu Aug 25 11:54:51 2005 From: paul.baxter at dsl.pipex.com (Paul Baxter) Date: Thu, 25 Aug 2005 19:54:51 +0100 Subject: Do we care about pre-emption? Was Re: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator References: <1124787534.27216.44.camel@r2d2><20050823185309.GE1218@esmail.cup.hp.com><1124873898.3933.23.camel@r2d2> <20050825065751.GN4793@esmail.cup.hp.com> Message-ID: <010d01c5a9a6$857b9020$8000000a@blorp> From: "Grant Grundler" To: "Guy German" >> The fact that the of linux is not a major issue, makes >> it >> harder to decide on the right way to go here... > > preemptive is a major issue for some uses. But I'm skeptical it is for > the initial clusters I expect people will use RDMA for. So I'm not > going to worry about it for now. There are more important issues. I'm extremely pleased with what's going on in OpenIB at the moment but I just wanted to register an alternate view from Grant's, though I may be in a minority. Infiniband is good at moving large quantities of data quickly. I need it to lower communications overhead so that my processors can be working and responding in real-time. Apps like streaming video processing and/or audio processing value this as well. I don't have batch jobs taking seconds or more, I have parallel data streams being time multiplexed for processing across many nodes. I have distributed computation that needs to happen in milliseconds and not be locked out for milliseconds while a ponderous ISR prevents more important work from happening. Your (current?) design minimising the work in interrupt context seems a good starting point. Why do you want to do more of that work in an ISR? Is the extra complexity putting it outside an ISR significant? Is the performance much worse? I thought in general on Linux, even without PREEMPT_RT and alike, Linux kernel and driver developers were spending more effort in reducing possible preemption bottlenecks. That said, I'm very excited by the work and thank you for the efforts. I'd like to log my data streams at 800MB/s please :) Paul From jlentini at netapp.com Thu Aug 25 11:56:06 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 25 Aug 2005 14:56:06 -0400 (EDT) Subject: [openib-general] Header reorganization heads up In-Reply-To: <523boxucai.fsf@cisco.com> References: <52br3lud64.fsf@cisco.com> <523boxucai.fsf@cisco.com> Message-ID: On Thu, 25 Aug 2005, Roland Dreier wrote: > James> If you remove the include/rdma directory, won't that break > James> code outside the OpenIB subversion tree that is using this > James> location? > > James> How about > > James> cp -rf drivers/infiniband/include/rdma include > > Yeah, that will work once there's code outside of drivers/infiniband > that uses include/rdma. A symlink would work too. A symlink is perfect. > James> instead? or place the include files in a different part of > James> the svn tree: > > James> https://openib.org/svn/gen2/trunk/src/linux-kernel/include/rdma > > James> and recommend everyone do two subversion checkouts, one for > James> the includes and one for the driver. > > That's a hassle because it forces everyone with working setups now to > do a new checkout. > > - R. > From yaronh at voltaire.com Thu Aug 25 11:56:18 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Thu, 25 Aug 2005 21:56:18 +0300 Subject: [openib-general] RDMA connection and address translation API Message-ID: <35EA21F54A45CB47B879F21A91F4862F7530C9@taurus.voltaire.com> > -----Original Message----- > From: Sean Hefty [mailto:sean.hefty at intel.com] > Sent: Thursday, August 25, 2005 2:37 PM > To: 'James Lentini'; Yaron Haviv > Cc: openib-general at openib.org > Subject: RE: [openib-general] RDMA connection and address translation API > > >> Any way providing src/dst IPs in the CM Private data is simple, and we > >> can come with IBTA extension blessing that data structure as a general > >> way to map IP oriented protocols over IB (a 1-2 page draft at the most) > >> This way it can also address Caitlin concerns regarding NFS & IETF > >> (since now it's a transport specific issue) > > > >How long do you estimate it would take to standardize an IP<->GID > >mechanism (ATS, CM embedded, ...) in the IBTA? 3 months? 6 months? A > >year?.... > > > >Let's assume that everyone on this list is in agreement. > > Does anyone in the IB world disagree with adding IP addresses in the CM > private > data area? Would we want to extend this concept to SIDR as well? > > - Sean I send my proposal from 2004 re-send again as text (attached) Also addresses the ServiceID issue, this can be a baseline for discussions Feel free to change Yaron -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: iWarpoverIB.txt URL: From hch at lst.de Thu Aug 25 12:01:22 2005 From: hch at lst.de (Christoph Hellwig) Date: Thu, 25 Aug 2005 21:01:22 +0200 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: <6.2.3.4.2.20050825131522.0327a830@exnane01.nane.netapp.com> References: <54AD0F12E08D1541B826BE97C98F99F1F51D@NT-SJCA-0751.brcm.ad.broadcom.com> <6.2.3.4.2.20050825131522.0327a830@exnane01.nane.netapp.com> Message-ID: <20050825190122.GA14053@lst.de> On Thu, Aug 25, 2005 at 01:18:06PM -0400, Talpey, Thomas wrote: > At 12:56 PM 8/25/2005, Caitlin Bestler wrote: > >Generic code MUST support both IPv4 and IPv6 addresses. > >I've even seen code that actually does this. > > Let me jump ahead to the root question. How will the NFS layer know > what address to resolve? > > On IB mounts, it will need to resolve a hostname or numeric string to > a GID, in order to provide the address to connect. On TCP/UDP, or > iWARP mounts, it must resolve to IP address. The mount command > has little or no context to perform these lookups, since it does not > know what interface will be used to form the connection. > > In exports, the server must inspect the source network of each > incoming request, in order to match against /etc/exports. If there > are wildcards in the file, a GID-specific algorithm must be applied. > Historically, /etc/exports contains hostnames and IPv4 netmasks/ > addresses. > > In either case, I think it is a red herring to assume that the GID > is actually an IPv6 address. They are not assigned by the sysadmin, > they are not subnetted, and they are quite foreign to many users. > IPv6 support for Linux NFS isn't even submitted yet, btw. > > With an IP address service, we don't have to change a line of > NFS code. I think this shows that using IP addresses in any service over infiniband that isn't actually IP networking is extremly stupid. Just stop living in the illusion that it makes sense and use IB-specific addressing, namely IB and stop all this layering violations into IP, which is much higher up the stack. From halr at voltaire.com Thu Aug 25 11:52:51 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 25 Aug 2005 14:52:51 -0400 Subject: [openib-general] Re: [PATCH] osm: osm_vendor_umad osm_vendor_get_all_port_attr bug In-Reply-To: <86fyt3219m.fsf@mtl066.yok.mtl.com> References: <86fyt3219m.fsf@mtl066.yok.mtl.com> Message-ID: <1124995971.4421.852.camel@hal.voltaire.com> Hi Eitan, On Sun, 2005-08-21 at 03:32, Eitan Zahavi wrote: > osm_vendor_get_all_port_attr returns incorrect LID and state for > device ports. This bug was caused by the fact that if a device port > was skipped due to that fact it does not exist (HCA port 0). The > lid and state pointers used as indexes into their corresponding > return value arrays were not advancing to the next port index. > > So the return for a single HCA was mixing LID and state for the first > port and displayed non initialized memory for the second port. The array is not filled in as you claim. Port 0 does not take a slot on an HCA. This looks fine to me as is (I added some print statements in that loop as follows): osm_vendor_get_all_port_attr: port 0 osm_vendor_get_all_port_attr: port 1 osm_vendor_get_all_port_attr: port 1 lid 1 state 4 osm_vendor_get_all_port_attr: port 2 osm_vendor_get_all_port_attr: port 2 lid 0 state 1 Port 0 is skipped; port 1 is LID 1 and active; port 2 is not plugged in and is down: Port 1: State: Active Physical state: LinkUp Rate: 2 Base lid: 1 LMC: 0 SM lid: 1 Capability mask: 0x00500a68 Port GUID: 0x0008f10403960559 Port 2: State: Down Physical state: Polling Rate: 2 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00500a68 Port GUID: 0x0008f1040396055a -- Hal From halr at voltaire.com Thu Aug 25 12:32:36 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 25 Aug 2005 15:32:36 -0400 Subject: [openib-general] Re: [PATCH] osm: osm_sa_path_record bug In-Reply-To: <5zoe7m2y5y.fsf@mtl066.yok.mtl.com> References: <5zoe7m2y5y.fsf@mtl066.yok.mtl.com> Message-ID: <1124998356.4421.1127.camel@hal.voltaire.com> On Thu, 2005-08-25 at 04:55, Yael Kalka wrote: > __osm_pr_rcv_check_mcast_dest called from osm_sa_path_record to > check if the path record request in a request for multicast path > record has a bug in it. > In the destination lid check, the check compared host-order lid with > network order. > > The following simple patch fixes this bug. Thanks. Applied. -- Hal From rolandd at cisco.com Thu Aug 25 12:47:34 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 25 Aug 2005 12:47:34 -0700 Subject: iWARP emulation protocol (was: [openib-general] RDMA connection and address translation API) In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F7530C9@taurus.voltaire.com> (Yaron Haviv's message of "Thu, 25 Aug 2005 21:56:18 +0300") References: <35EA21F54A45CB47B879F21A91F4862F7530C9@taurus.voltaire.com> Message-ID: <52y86pssrt.fsf@cisco.com> Yaron> I send my proposal from 2004 re-send again as text Yaron> (attached) Also addresses the ServiceID issue, this can be Yaron> a baseline for discussions Feel free to change I think this protocol is going in exactly the right direction. Before you sent this email, I had independently reached the conclusion that what is desired is not a transport neutral API, but rather a general protocol for emulating iWARP on IB. Then it's easy to build an API that covers both native iWARP and emulated iWARP on IB, and use that for iSER and NFS/RDMA. This has some nice properties. For example, the high-level connection API doesn't have to have a 64-bit service ID parameter any more -- we can just pass in 16-bit TCP ports, and map them to IB service IDs. Also, it's easy to put some filtering in the userspace CM to forbid connections with source port < 1024 from unprivileged processes. Then listeners can have some level of trust in the source IP if the source port is privileged. I think that in light of the emerging consensus on using the IB CM private data to carry IP address information, we can stop worrying about ATS. We can implement this private data mechanism immediately, using a service ID base coming from the OpenIB OUI. Once we have the design nailed down, then we can go to the IBTA or IETF and standardize a final service ID base. I have a few minor quibbles with this proposal. I think it would be better to have only the IP version, source and destination IPs and local in the CM private data. The other fields don't seem generic to all protocols. If we do put the extra fields in the generic private data, then we need an API to set them on active connect and get them on passive connect, and I don't think it's worth it. So I would suggest that there be no REP private data, and that the REQ private data just be something like: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 00 | Src IP (127-96) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 04 | Src IP ( 95-64) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 08 | Src IP ( 63-32) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 12 | Src IP ( 31-00) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 16 | Dst IP (127-96) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 20 | Dst IP ( 95-64) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 24 | Dst IP ( 63-32) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 28 | Dst IP ( 31-00) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 32 | IPVer | Reserved | TCP Port | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - R. From rolandd at cisco.com Thu Aug 25 12:50:31 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 25 Aug 2005 12:50:31 -0700 Subject: [openib-general] Re: iWARP emulation protocol In-Reply-To: <52y86pssrt.fsf@cisco.com> (Roland Dreier's message of "Thu, 25 Aug 2005 12:47:34 -0700") References: <35EA21F54A45CB47B879F21A91F4862F7530C9@taurus.voltaire.com> <52y86pssrt.fsf@cisco.com> Message-ID: <52u0hdssmw.fsf@cisco.com> Roland> I have a few minor quibbles with this proposal. I think Roland> it would be better to have only the IP version, source and Roland> destination IPs and local in the CM private data. err-- "...and local PORT NUMBER in the CM private data..." - R. From David.Brean at Sun.COM Thu Aug 25 13:21:07 2005 From: David.Brean at Sun.COM (David M. Brean) Date: Thu, 25 Aug 2005 16:21:07 -0400 Subject: [openib-general] Re: iWARP emulation protocol In-Reply-To: <52u0hdssmw.fsf@cisco.com> References: <35EA21F54A45CB47B879F21A91F4862F7530C9@taurus.voltaire.com> <52y86pssrt.fsf@cisco.com> <52u0hdssmw.fsf@cisco.com> Message-ID: <430E2833.8030209@sun.com> Hello, I had an email exchange with the iSER folks earlier this summer where I submitted the attached proposal for consideration. It does not contain the local port number because nobody seems to set it to a meaningful value. More food for thought. -David Roland Dreier wrote: > Roland> I have a few minor quibbles with this proposal. I think > Roland> it would be better to have only the IP version, source and > Roland> destination IPs and local in the CM private data. > >err-- "...and local PORT NUMBER in the CM private data..." > > - R. >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > -------------- next part -------------- An embedded message was scrubbed... From: "David M. Brean" Subject: iWARP ULP on IB Hello/HelloAck Date: Wed, 24 Aug 2005 19:51:28 -0400 Size: 13832 URL: From iod00d at hp.com Thu Aug 25 13:33:15 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 25 Aug 2005 13:33:15 -0700 Subject: [openib-general] Header reorganization heads up In-Reply-To: <52br3lud64.fsf@cisco.com> References: <52br3lud64.fsf@cisco.com> Message-ID: <20050825203315.GD9080@esmail.cup.hp.com> On Thu, Aug 25, 2005 at 10:41:39AM -0700, Roland Dreier wrote: ... > then one extra step will be > required to put a subversion checkout into the kernel: you'll have to > "rm -rf include/rdma" to avoid picking up the old header files. > > Does anyone see a problem with this plan? This works for me. I keep several versions of openib checked out in my source tree and drivers/infiniband is a symlink to the one I'm working with. Removing the include/rdma is a trivial step. thanks, grant From iod00d at hp.com Thu Aug 25 13:35:49 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 25 Aug 2005 13:35:49 -0700 Subject: [openib-general] Header reorganization heads up In-Reply-To: References: <52br3lud64.fsf@cisco.com> Message-ID: <20050825203549.GE9080@esmail.cup.hp.com> On Thu, Aug 25, 2005 at 01:56:20PM -0400, James Lentini wrote: ... > If you remove the include/rdma directory, won't that break code > outside the OpenIB subversion tree that is using this location? What code outside of drivers/infiniband is using include/rdma? > cp -rf drivers/infiniband/include/rdma include Once there are other consumers, this is a good idea. grant From swise at ammasso.com Thu Aug 25 13:33:16 2005 From: swise at ammasso.com (Steve Wise) Date: Thu, 25 Aug 2005 15:33:16 -0500 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs andqueryprovider methods In-Reply-To: Message-ID: > -----Original Message----- > From: Sean Hefty [mailto:sean.hefty at intel.com] > Sent: Thursday, August 25, 2005 1:36 PM > To: 'Steve Wise'; 'Tom Tucker'; openib-general at openib.org > Subject: RE: [openib-general] [PATCH][iWARP] Added provider > CM verbs andqueryprovider methods > > >The Ammasso 1100 does do 100% connection setup. That's why we're > >pushing connection establishment verbs into the device struct. IMO, > >these functions are analagous to the process_mad function in the > >ib_device structs, which has no meaning to an iwarp device. > So I think > >we have to admit up front, that the ib_device struct really has > >Infiniband-specific verb functions as well as iWARP-specific verb > >functions, and that's ok. (or maybe not :-) > > Does it do the connection setup in hardware or software? > Hardware (actually firmware running on the amso1100 adapter). > I agree that there are IB specific functions in the IB device > structure today, > but I'm not sure of the best way to define an RDMA device > structure moving > forward. Longer term, it may make sense to separate out the > transport specific > functions, in which case we can start moving in that direction. > > > >Assuming each RNIC supported some raw way to send and > receive ethernet > >frames, then you could implement TCP, IP, ICMP, ARP etc al > as a common > >stack to setup connections. I don't think we want to do this? > > Are you saying that your RNIC cannot send and receive raw > ethernet frames and > act as a plain ethernet NIC? I was assuming that this was a > given for any RNIC > device. It can, but that uses a different MAC address. From halr at voltaire.com Thu Aug 25 13:45:02 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 25 Aug 2005 16:45:02 -0400 Subject: [openib-general] Re: RMPP Message Format Errors In-Reply-To: <4309E27E.7050002@mellanox.co.il> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175BE7@taurus.voltaire.com> <4309E27E.7050002@mellanox.co.il> Message-ID: <1125002702.4421.1378.camel@hal.voltaire.com> On Mon, 2005-08-22 at 10:34, Eitan Zahavi wrote: > > It gets a "real" received length provided it supplies a buffer large enough. > So I guess the "real receive length" is truncated to the last data > record even if the packet sent was 256 bytes? The receive buffer is not truncated. An error is returned if the buffer supplied is too small for a receive is too small and it includes the size of the buffer needed. I don't understand what you mean by "even if the packet sent was 256 bytes". -- Hal From arlin.r.davis at intel.com Thu Aug 25 14:02:42 2005 From: arlin.r.davis at intel.com (Arlin Davis) Date: Thu, 25 Aug 2005 14:02:42 -0700 Subject: [openib-general] [PATCH] uDAPL added ibv_query support Message-ID: James, Support for ibv_query_port, device, and gid. Thanks, -arlin Signed-off by: Arlin Davis Index: dapl/openib/TODO =================================================================== --- dapl/openib/TODO (revision 3190) +++ dapl/openib/TODO (working copy) @@ -1,16 +1,18 @@ IB Verbs: -- CQ resize? -- query call to get current qp state, remote port number -- ibv_get_cq_event() needs timed event call and wakeup -- query call to get device attributes +- CQ resize +- mulitple CQ event support - memory window support DAPL: - reinit EP needs a QP timewait completion notification -- add cq_object wakeup, time based cq_object wait when verbs support arrives +- direct cq_wait_object when multi-CQ verbs event support arrives +- async event support +- add support for ib_cm_init_qp_attr +- shared receive queue support -Other: -- Shared memory in udapl and kernel module to support? +Under discussion: +- Shared memory in udapl and kernel module to support +- merged DTO/connection Index: dapl/openib/dapl_ib_util.c =================================================================== --- dapl/openib/dapl_ib_util.c (revision 3190) +++ dapl/openib/dapl_ib_util.c (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - init, open, close, utilities + * The uDAPL openib provider - init, open, close, utilities, work thread * **************************************************************************** * Source Control System Information @@ -62,54 +62,6 @@ int g_dapl_loopback_connection = 0; -/* get lid */ -int dapli_get_lid(struct dapl_hca *hca_ptr, int port, uint16_t *lid ) -{ - struct ibv_port_attr attr; - - if (ibv_query_port(hca_ptr->ib_hca_handle, port, &attr)) - return 1; - - *lid = attr.lid; - - return 0; -} - -/* get gid */ -int dapli_get_gid(struct dapl_hca *hca_ptr, int port, - int index, union ibv_gid *gid ) -{ - /* ibv_query_gid() coming soon, until then HACK */ - char path[128]; - char val[40]; - char name[256]; - char *token; - uint16_t *p_gid; - - if (sysfs_get_mnt_path(path, sizeof path)) { - fprintf(stderr, "Couldn't find sysfs mount.\n"); - return 1; - } - sprintf(name, "%s/class/infiniband/%s/ports/%d/gids/%d", path, - ibv_get_device_name(hca_ptr->ib_trans.ib_dev), port, index); - - if (sysfs_read_attribute_value(name, val, sizeof val)) { - fprintf(stderr, "Couldn't read GID at %s\n", name); - return 1; - } - - /* get token strings with delimiter */ - token = strtok(val,":"); - p_gid = (uint16_t*)gid->raw; - while (token) { - *p_gid = strtoul(token,NULL,16); - *p_gid = htons(*p_gid); /* convert each token to network order */ - token = strtok(NULL,":"); - p_gid++; - } - return 0; -} - /* just get IP address, IPv4 only for now */ int dapli_get_hca_addr( struct dapl_hca *hca_ptr ) @@ -150,7 +102,6 @@ dapli_ip_comp_handler(at_rec.req_id, (void*)&at_rec, status); } else { dat_status = dapl_os_wait_object_wait(&hca_ptr->ib_trans.wait_object,500000); - return 0; if (dat_status != DAT_SUCCESS) ib_at_cancel(at_rec.req_id); } @@ -248,27 +199,19 @@ /* set inline max with enviromment or default, get local lid and gid 0 */ hca_ptr->ib_trans.max_inline_send = - dapl_os_get_env_val ( "DAPL_MAX_INLINE", INLINE_SEND_DEFAULT ); + dapl_os_get_env_val("DAPL_MAX_INLINE", INLINE_SEND_DEFAULT); - if (dapli_get_lid(hca_ptr, hca_ptr->port_num, - &hca_ptr->ib_trans.lid)) { - dapl_dbg_log (DAPL_DBG_TYPE_ERR, - " open_hca: IB get LID failed for %s\n", - ibv_get_device_name(hca_ptr->ib_trans.ib_dev) ); - goto bail; - } - - if (dapli_get_gid(hca_ptr, hca_ptr->port_num, 0, - &hca_ptr->ib_trans.gid)) { + /* GID with port_num provided, index 0 for now */ + if (ibv_query_gid(hca_ptr->ib_hca_handle, + hca_ptr->port_num, 0, &hca_ptr->ib_trans.gid)) { dapl_dbg_log (DAPL_DBG_TYPE_ERR, " open_hca: IB get GID failed for %s\n", - ibv_get_device_name(hca_ptr->ib_trans.ib_dev) ); + ibv_get_device_name(hca_ptr->ib_trans.ib_dev)); goto bail; } - - dapl_dbg_log(DAPL_DBG_TYPE_CM, - " open_hca: LID 0x%x GID subnet %016llx id %016llx\n", - hca_ptr->ib_trans.lid, + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " open_hca: GID subnet %016llx id %016llx\n", (unsigned long long)bswap_64(hca_ptr->ib_trans.gid.global.subnet_prefix), (unsigned long long)bswap_64(hca_ptr->ib_trans.gid.global.interface_id) ); @@ -309,7 +252,6 @@ ibv_close_device(hca_ptr->ib_hca_handle); hca_ptr->ib_hca_handle = IB_INVALID_HANDLE; return DAT_INTERNAL_ERROR; - } @@ -331,10 +273,12 @@ */ DAT_RETURN dapls_ib_close_hca ( IN DAPL_HCA *hca_ptr ) { - dapl_dbg_log (DAPL_DBG_TYPE_UTIL," close_hca: %p\n",hca_ptr); + dapl_dbg_log (DAPL_DBG_TYPE_UTIL," close_hca: %p->%p\n", + hca_ptr,hca_ptr->ib_hca_handle); + + dapli_cq_thread_destroy(hca_ptr); if (hca_ptr->ib_hca_handle != IB_INVALID_HANDLE) { - dapli_cq_thread_destroy(hca_ptr); if (ibv_close_device(hca_ptr->ib_hca_handle)) return(dapl_convert_errno(errno,"ib_close_device")); hca_ptr->ib_hca_handle = IB_INVALID_HANDLE; @@ -370,7 +314,8 @@ OUT DAT_EP_ATTR *ep_attr, OUT DAT_SOCK_ADDR6 *ip_addr) { - DAT_RETURN dat_status = DAT_SUCCESS; + struct ibv_device_attr dev_attr; + struct ibv_port_attr port_attr; if (hca_ptr->ib_hca_handle == NULL) { dapl_dbg_log (DAPL_DBG_TYPE_ERR," query_hca: BAD handle\n"); @@ -381,6 +326,15 @@ if (ip_addr != NULL) memcpy(ip_addr, &hca_ptr->hca_address, sizeof(DAT_SOCK_ADDR6)); + if (ia_attr == NULL && ep_attr == NULL) + return DAT_SUCCESS; + + /* query verbs for this device and port attributes */ + if (ibv_query_device(hca_ptr->ib_hca_handle, &dev_attr) || + ibv_query_port(hca_ptr->ib_hca_handle, + hca_ptr->port_num, &port_attr)) + return(dapl_convert_errno(errno,"ib_query_hca")); + if (ia_attr != NULL) { ia_attr->adapter_name[DAT_NAME_MAX_LENGTH - 1] = '\0'; ia_attr->vendor_name[DAT_NAME_MAX_LENGTH - 1] = '\0'; @@ -394,50 +348,51 @@ ((struct sockaddr_in *)ia_attr->ia_address_ptr)->sin_addr.s_addr >> 8 & 0xff, ((struct sockaddr_in *)ia_attr->ia_address_ptr)->sin_addr.s_addr >> 16 & 0xff, ((struct sockaddr_in *)ia_attr->ia_address_ptr)->sin_addr.s_addr >> 24 & 0xff ); + + ia_attr->hardware_version_major = dev_attr.hw_ver; + ia_attr->hardware_version_minor = dev_attr.fw_ver; + ia_attr->max_eps = dev_attr.max_qp; + ia_attr->max_dto_per_ep = dev_attr.max_qp_wr; + ia_attr->max_rdma_read_per_ep = dev_attr.max_qp_rd_atom; + ia_attr->max_evds = dev_attr.max_cq; + ia_attr->max_evd_qlen = dev_attr.max_cqe; + ia_attr->max_iov_segments_per_dto = dev_attr.max_sge; + ia_attr->max_lmrs = dev_attr.max_mr; + ia_attr->max_lmr_block_size = dev_attr.max_mr_size; + ia_attr->max_rmrs = dev_attr.max_mw; + ia_attr->max_lmr_virtual_address = dev_attr.max_mr_size; + ia_attr->max_rmr_target_address = dev_attr.max_mr_size; + ia_attr->max_pzs = dev_attr.max_pd; + ia_attr->max_mtu_size = port_attr.max_msg_sz; + ia_attr->max_rdma_size = port_attr.max_msg_sz; + ia_attr->num_transport_attr = 0; + ia_attr->transport_attr = NULL; + ia_attr->num_vendor_attr = 0; + ia_attr->vendor_attr = NULL; - /* TODO: need verbs query call */ - ia_attr->max_eps = 64000; - ia_attr->max_dto_per_ep = 64000; - ia_attr->max_rdma_read_per_ep = 8; - ia_attr->max_evds = 64000; - ia_attr->max_evd_qlen = 64000; - ia_attr->max_iov_segments_per_dto = 32; - ia_attr->max_lmrs = 64000; - ia_attr->max_lmr_block_size = 0x80000000; - ia_attr->max_rmrs = 64000; - ia_attr->max_lmr_virtual_address = 0x80000000; - ia_attr->max_rmr_target_address = 0x80000000; - ia_attr->max_pzs = 64000; - ia_attr->max_mtu_size = 0x80000000; - ia_attr->max_rdma_size = 0x80000000; - ia_attr->num_transport_attr = 0; - ia_attr->transport_attr = NULL; - ia_attr->num_vendor_attr = 0; - ia_attr->vendor_attr = NULL; - - dapl_dbg_log (DAPL_DBG_TYPE_UTIL, - " query_hca: (%d.%d) ep %d ep_q %d evd %d evd_q %d\n", - ia_attr->hardware_version_major, - ia_attr->hardware_version_minor, - ia_attr->max_eps, ia_attr->max_dto_per_ep, - ia_attr->max_evds, ia_attr->max_evd_qlen ); - dapl_dbg_log (DAPL_DBG_TYPE_UTIL, - " query_hca: msg %llu rdma %llu iov %d lmr %d rmr %d\n", - ia_attr->max_mtu_size, ia_attr->max_rdma_size, - ia_attr->max_iov_segments_per_dto, ia_attr->max_lmrs, - ia_attr->max_rmrs ); + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, + " query_hca: (%x.%x) ep %d ep_q %d evd %d evd_q %d\n", + ia_attr->hardware_version_major, + ia_attr->hardware_version_minor, + ia_attr->max_eps, ia_attr->max_dto_per_ep, + ia_attr->max_evds, ia_attr->max_evd_qlen ); + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, + " query_hca: msg %llu rdma %llu iov %d lmr %d rmr %d\n", + ia_attr->max_mtu_size, ia_attr->max_rdma_size, + ia_attr->max_iov_segments_per_dto, ia_attr->max_lmrs, + ia_attr->max_rmrs ); } - /* TODO: need verbs query call */ + if (ep_attr != NULL) { - ep_attr->max_mtu_size = 0x80000000; - ep_attr->max_rdma_size = 0x80000000; - ep_attr->max_recv_dtos = 64000; - ep_attr->max_request_dtos = 64000; - ep_attr->max_recv_iov = 32; - ep_attr->max_request_iov = 32; - ep_attr->max_rdma_read_in = 8; - ep_attr->max_rdma_read_out= 8; + ep_attr->max_mtu_size = port_attr.max_msg_sz; + ep_attr->max_rdma_size = port_attr.max_msg_sz; + ep_attr->max_recv_dtos = dev_attr.max_qp_wr; + ep_attr->max_request_dtos = dev_attr.max_qp_wr; + ep_attr->max_recv_iov = dev_attr.max_sge; + ep_attr->max_request_iov = dev_attr.max_sge; + ep_attr->max_rdma_read_in = dev_attr.max_qp_rd_atom; + ep_attr->max_rdma_read_out= dev_attr.max_qp_rd_atom; dapl_dbg_log (DAPL_DBG_TYPE_UTIL, " query_hca: MAX msg %llu dto %d iov %d rdma i%d,o%d\n", ep_attr->max_mtu_size, @@ -445,7 +400,7 @@ ep_attr->max_rdma_read_in, ep_attr->max_rdma_read_out); } - return dat_status; + return DAT_SUCCESS; } /* Index: dapl/openib/dapl_ib_cm.c =================================================================== --- dapl/openib/dapl_ib_cm.c (revision 3190) +++ dapl/openib/dapl_ib_cm.c (working copy) @@ -173,13 +173,14 @@ " ip_comp_handler: resolution err %d retry %d\n", rec_num, at_rec->retries + 1); - if (++at_rec->retries > IB_MAX_AT_RETRY) + ipv4_addr->sin_addr.s_addr = 0; + + if (++at_rec->retries > IB_MAX_AT_RETRY) goto bail; at_comp.fn = dapli_ip_comp_handler; at_comp.context = at_rec; - ipv4_addr->sin_addr.s_addr = 0; - + status = ib_at_ips_by_gid(&at_rec->hca_ptr->ib_trans.gid, &ipv4_addr->sin_addr.s_addr, 1, &at_comp, &at_rec->req_id); Index: dapl/openib/README =================================================================== --- dapl/openib/README (revision 3190) +++ dapl/openib/README (working copy) @@ -6,15 +6,14 @@ - CQ_WAIT_OBJECT support added - added dapl/openib directory - modify doc/dat.conf to add a example openib configuration -- fixes to wait and resize +- dapl_ep_alloc checks default attributes against device maximums dapl/common/dapl_adapter_util.h dapl/common/dapl_evd_dto_callb.c dapl/common/dapl_evd_util.c - dapl/common/dapl_evd_resize.c + dapl/common/dapl_ep_util.c dapl/include/dapl.h dapl/udapl/dapl_evd_set_unwaitable.c - dapl/udapl/dapl_evd_wait.c dapl/udapl/linux/dapl_osd.c dapl/udapl/Makefile dat/udat/linux/dat_osd.c @@ -44,11 +43,9 @@ Setup: - Third drop of code, includes uCM and uAT support. - NOTE: requires both uCM and uAT libraries and device modules from trunk. - + dapl/udapl/Makefile + Known issues: no memory windows support in ibverbs, dat_create_rmr fails. - some uCM scale up issues with an 8 thread dapltest in regress.sh hard coded modify QP RTR to port 1, waiting for ib_cm_init_qp_attr call. Index: dapl/openib/dapl_ib_util.h =================================================================== --- dapl/openib/dapl_ib_util.h (revision 3190) +++ dapl/openib/dapl_ib_util.h (working copy) @@ -238,7 +238,6 @@ int cq_destroy; DAPL_OS_THREAD cq_thread; int max_inline_send; - uint16_t lid; union ibv_gid gid; ib_async_handler_t async_unafiliated; ib_async_handler_t async_cq_error; From arlin.r.davis at intel.com Thu Aug 25 14:06:13 2005 From: arlin.r.davis at intel.com (Arlin Davis) Date: Thu, 25 Aug 2005 14:06:13 -0700 Subject: [openib-general] [PATCH] uDAPL common code fix for default attribute settings Message-ID: James, Please review this common code patch that fixes default settings so they don't exceed device maximums. Thanks, -arlin Signed-off by: Arlin Davis Index: dapl/common/dapl_ep_util.c =================================================================== --- dapl/common/dapl_ep_util.c (revision 3190) +++ dapl/common/dapl_ep_util.c (working copy) @@ -115,7 +115,41 @@ */ if ( ep_attr == NULL ) { + DAT_RETURN dat_status; + DAT_EP_ATTR ep_attr_limit; + dapli_ep_default_attrs (ep_ptr); + + dat_status = dapls_ib_query_hca (ia_ptr->hca_ptr, + NULL, &ep_attr_limit, NULL); + /* check against HCA maximums */ + if (dat_status == DAT_SUCCESS) + { + ep_ptr->param.ep_attr.max_mtu_size = + DAPL_MIN(ep_ptr->param.ep_attr.max_mtu_size, + ep_attr_limit.max_mtu_size); + ep_ptr->param.ep_attr.max_rdma_size = + DAPL_MIN(ep_ptr->param.ep_attr.max_rdma_size, + ep_attr_limit.max_rdma_size); + ep_ptr->param.ep_attr.max_recv_dtos = + DAPL_MIN(ep_ptr->param.ep_attr.max_recv_dtos, + ep_attr_limit.max_recv_dtos); + ep_ptr->param.ep_attr.max_request_dtos = + DAPL_MIN(ep_ptr->param.ep_attr.max_request_dtos, + ep_attr_limit.max_request_dtos); + ep_ptr->param.ep_attr.max_recv_iov = + DAPL_MIN(ep_ptr->param.ep_attr.max_recv_iov, + ep_attr_limit.max_recv_iov); + ep_ptr->param.ep_attr.max_request_iov = + DAPL_MIN(ep_ptr->param.ep_attr.max_request_iov, + ep_attr_limit.max_request_iov); + ep_ptr->param.ep_attr.max_rdma_read_in = + DAPL_MIN(ep_ptr->param.ep_attr.max_rdma_read_in, + ep_attr_limit.max_rdma_read_in); + ep_ptr->param.ep_attr.max_rdma_read_out = + DAPL_MIN(ep_ptr->param.ep_attr.max_rdma_read_out, + ep_attr_limit.max_rdma_read_out); + } } else { From swise at ammasso.com Thu Aug 25 14:16:44 2005 From: swise at ammasso.com (Steve Wise) Date: Thu, 25 Aug 2005 16:16:44 -0500 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs andquery provider methods In-Reply-To: <523boyuhm5.fsf@cisco.com> Message-ID: > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier > Sent: Thursday, August 25, 2005 11:06 AM > To: Tom Tucker > Cc: openib-general at openib.org > Subject: Re: [openib-general] [PATCH][iWARP] Added provider > CM verbs andquery provider methods > > Tom> This patch is against the iWARP branch. It adds CM related > Tom> methods to the ib_device structure as well as simple versions > Tom> of the low level port, gid, pkey, etc... query methods. > > This patch doesn't seem like the right approach to me. I don't think > we want to put CM methods, which are not very verb-like, in to the > device structure. For one thing, connect_qp() seems like it can just > be replaced by the existing modify_qp() method. > > I'm not sure I understand the rest of the methods. It seems that the > Ammasso device doesn't really implement the RNIC verbs. I'm guessing > you handle all the connection stuff inside your device, which means > that you can't implement the standard iWARP modify-to-RTS operation. > Since there was no way to do migration of a linux native stack connection to the rnic, we chose to model the product after the _only_ API available...DAPL. The RDMAC verbs punts entirely on connection setup -and- is modeled to support a chimney approach which will never be in linux as far as I can tell. Would you rather we put this in a CM device struct of some sort? And have rnic devices export a CM device that has these sorts of methods? To me that's basically the same as adding it to the struct ib_device. > Is there a way to make your interface look more like the iWARP verbs > interface? Or do all iWARP devices have an interface like yours. > Well, i haven't seen any other iwarp device at this detail (i only know of three iwarp devices in existance), so i cannot tell you. Perhaps the other vendors can respond. Stevo. From tom at ammasso.com Thu Aug 25 14:23:15 2005 From: tom at ammasso.com (Tom Tucker) Date: Thu, 25 Aug 2005 17:23:15 -0400 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs and query provider methods Message-ID: <8E9D028761D8264D910612167E8457E8FA3B8B@mail2.ammasso.com> > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Thursday, August 25, 2005 11:06 AM > To: Tom Tucker > Cc: openib-general at openib.org > Subject: Re: [openib-general] [PATCH][iWARP] Added provider > CM verbs and query provider methods > > Tom> This patch is against the iWARP branch. It adds CM related > Tom> methods to the ib_device structure as well as simple versions > Tom> of the low level port, gid, pkey, etc... query methods. > > This patch doesn't seem like the right approach to me. I > don't think we want to put CM methods, which are not very > verb-like, in to the device structure. For one thing, > connect_qp() seems like it can just be replaced by the > existing modify_qp() method. > I'll first try to convince you otherwise, and then propose another possibility below. > I'm not sure I understand the rest of the methods. It seems > that the Ammasso device doesn't really implement the RNIC > verbs. I'm guessing you handle all the connection stuff > inside your device, which means that you can't implement the > standard iWARP modify-to-RTS operation. RNIC Verbs imply that the modify qp verb takes a handle to a connection -- presumably a socket. This CAN'T be done on Linux in any fashion that is acceptable to the netdev crowd. SOOO we modeled this after DAPL. Trust me, I would LOVE to be able to establish the connection using bind, listen, etc..., query the Linux connection state and then pass this down to the qp modify verb...but I can't. > > Is there a way to make your interface look more like the > iWARP verbs interface? Or do all iWARP devices have an > interface like yours. You would have to ask them their interface, I'm not privy to this information. > > - R. > How about this. Plan B: We have a separate structure -- let's say ib_cma_verbs for the sake of argument that contains pointers to the aforementioned iWARP CM functions. This structure is separate from the ib_device structure, but can be queried by the client as one of the attributes. IB clients will return a zero for this attribute, iWARP clients will return a pointer to the ib_cma_verbs structure. Plan C: We hack in pseudo-messages that our MAD handler turns into device connection verbs. This is really nasty as I mentioned in an earlier e-mail. From caitlinb at broadcom.com Thu Aug 25 16:08:35 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 25 Aug 2005 16:08:35 -0700 Subject: iWARP emulation protocol (was: [openib-general] RDMA connection and address translation API) Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F520@NT-SJCA-0751.brcm.ad.broadcom.com> To be precise, what you are emulating is iWARP connection establishment for MPA with immediate entry to MPA mode and for SCTP. I would not recommend attempting to emultate deferred entry to MPA mode where the applications can engage in SOCK_STREAM style arbitrary exchanges before deciding to enable MPA and RDMA. MPA and SCTP both support up to 512 bytes of private data for both the request and response. You should be explicit abound any reductions from that size, since that new size would become the standard for truly transport neutral apps. > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier > Sent: Thursday, August 25, 2005 12:48 PM > To: Yaron Haviv > Cc: openib-general at openib.org > Subject: iWARP emulation protocol (was: [openib-general] RDMA > connection and address translation API) > > Yaron> I send my proposal from 2004 re-send again as text > Yaron> (attached) Also addresses the ServiceID issue, this can be > Yaron> a baseline for discussions Feel free to change > > I think this protocol is going in exactly the right > direction. Before you sent this email, I had independently > reached the conclusion that what is desired is not a > transport neutral API, but rather a general protocol for > emulating iWARP on IB. Then it's easy to build an API that > covers both native iWARP and emulated iWARP on IB, and use > that for iSER and NFS/RDMA. > > This has some nice properties. For example, the high-level > connection API doesn't have to have a 64-bit service ID > parameter any more -- we can just pass in 16-bit TCP ports, > and map them to IB service IDs. > Also, it's easy to put some filtering in the userspace CM to > forbid connections with source port < 1024 from unprivileged > processes. Then listeners can have some level of trust in > the source IP if the source port is privileged. > > I think that in light of the emerging consensus on using the > IB CM private data to carry IP address information, we can > stop worrying about ATS. We can implement this private data > mechanism immediately, using a service ID base coming from > the OpenIB OUI. Once we have the design nailed down, then we > can go to the IBTA or IETF and standardize a final service ID base. > > I have a few minor quibbles with this proposal. I think it > would be better to have only the IP version, source and > destination IPs and local in the CM private data. The other > fields don't seem generic to all protocols. If we do put the > extra fields in the generic private data, then we need an API > to set them on active connect and get them on passive > connect, and I don't think it's worth it. > > So I would suggest that there be no REP private data, and > that the REQ private data just be something like: > > 0 1 2 > 3 > 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 > 7 8 9 0 1 > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > 00 | Src IP (127-96) > | > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > 04 | Src IP ( 95-64) > | > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > 08 | Src IP ( 63-32) > | > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > 12 | Src IP ( 31-00) > | > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > 16 | Dst IP (127-96) > | > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > 20 | Dst IP ( 95-64) > | > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > 24 | Dst IP ( 63-32) > | > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > 28 | Dst IP ( 31-00) > | > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > 32 | IPVer | Reserved | TCP Port > | > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From rolandd at cisco.com Thu Aug 25 16:19:04 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 25 Aug 2005 16:19:04 -0700 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs andquery provider methods In-Reply-To: (Steve Wise's message of "Thu, 25 Aug 2005 16:16:44 -0500") References: Message-ID: <52ll2psizb.fsf@cisco.com> Steve> Would you rather we put this in a CM device struct of some Steve> sort? And have rnic devices export a CM device that has Steve> these sorts of methods? To me that's basically the same as Steve> adding it to the struct ib_device. I think if we really need these new methods, then they might as well be in the rdma_device structure. Steve> Well, i haven't seen any other iwarp device at this detail Steve> (i only know of three iwarp devices in existance), so i Steve> cannot tell you. Perhaps the other vendors can respond. I think it's important for us to understand whether this is an Amasso-specific interface or something that will work in the general case. Perhaps someone from Broadcom and/or NetEffect (or some other vendor) can comment? - R. From rolandd at cisco.com Thu Aug 25 16:21:06 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 25 Aug 2005 16:21:06 -0700 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs and query provider methods In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3B8B@mail2.ammasso.com> (Tom Tucker's message of "Thu, 25 Aug 2005 17:23:15 -0400") References: <8E9D028761D8264D910612167E8457E8FA3B8B@mail2.ammasso.com> Message-ID: <52hdddsivx.fsf@cisco.com> Tom> RNIC Verbs imply that the modify qp verb takes a handle to a Tom> connection -- presumably a socket. This CAN'T be done on Tom> Linux in any fashion that is acceptable to the netdev Tom> crowd. SOOO we modeled this after DAPL. Trust me, I would Tom> LOVE to be able to establish the connection using bind, Tom> listen, etc..., query the Linux connection state and then Tom> pass this down to the qp modify verb...but I can't. Let's not be too quick to say that this is impossible. I think we should work with the Linux networking community and come up with the right answer, and not accept a bad solution just because it lets us go around the networking people. Has there been any real discussion of this on netdev? - R. From caitlinb at broadcom.com Thu Aug 25 16:35:23 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 25 Aug 2005 16:35:23 -0700 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs andquery provider methods Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F521@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier > Sent: Thursday, August 25, 2005 4:19 PM > To: Steve Wise > Cc: openib-general at openib.org > Subject: Re: [openib-general] [PATCH][iWARP] Added provider > CM verbs andquery provider methods > > Steve> Would you rather we put this in a CM device struct of some > Steve> sort? And have rnic devices export a CM device that has > Steve> these sorts of methods? To me that's basically the same as > Steve> adding it to the struct ib_device. > > I think if we really need these new methods, then they might > as well be in the rdma_device structure. > > Steve> Well, i haven't seen any other iwarp device at this detail > Steve> (i only know of three iwarp devices in existance), so i > Steve> cannot tell you. Perhaps the other vendors can respond. > > I think it's important for us to understand whether this is > an Amasso-specific interface or something that will work in > the general case. Perhaps someone from Broadcom and/or > NetEffect (or some other > vendor) can comment? > The per-device connection methods proposed by Ammasso are definitely implementable for every iWARP RNIC that I am aware of. It does move some code around from where we have it currently, so we're not ready to release our versions yet. But that's because things take time, not because we think there's anything wrong with the interface. This interface needs to be complemented by the existing IB-specific interface and by an eventual TCP-specific interface that is compatible with the DAT Socket Service Point and/or IT-API's socket convert options. That interface would also deal with interoperability with pre-IETF MPA. But only a handful of applications need those optinons, the DAT-style interface accomodates the vast majority of applications that we are aware of. Caitlin Bestler Principal Software Scientist Broadcom caitlinb at broadcom.com From rolandd at cisco.com Thu Aug 25 16:39:52 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 25 Aug 2005 16:39:52 -0700 Subject: [openib-general] kernel crashes while using mvapich-gen2 over ib_uverbs In-Reply-To: <1124960511.12751.16.camel@bonnie.rzg.mpg.de> (Christian Guggenberger's message of "Thu, 25 Aug 2005 11:01:51 +0200") References: <1124960511.12751.16.camel@bonnie.rzg.mpg.de> Message-ID: <52br3lsi0n.fsf@cisco.com> One other question: do you have CONFIG_HUGETLB_PAGE turned on? Thanks, Roland From swise at ammasso.com Thu Aug 25 16:44:23 2005 From: swise at ammasso.com (Steve Wise) Date: Thu, 25 Aug 2005 18:44:23 -0500 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs andquery provider methods In-Reply-To: <52hdddsivx.fsf@cisco.com> Message-ID: See the current chelsio TOE thread on the netdev list. I brought up the chimney approach on this forum and was basically told chimney was crap. Steve. > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier > Sent: Thursday, August 25, 2005 6:21 PM > To: Tom Tucker > Cc: openib-general at openib.org > Subject: Re: [openib-general] [PATCH][iWARP] Added provider > CM verbs andquery provider methods > > Tom> RNIC Verbs imply that the modify qp verb takes a handle to a > Tom> connection -- presumably a socket. This CAN'T be done on > Tom> Linux in any fashion that is acceptable to the netdev > Tom> crowd. SOOO we modeled this after DAPL. Trust me, I would > Tom> LOVE to be able to establish the connection using bind, > Tom> listen, etc..., query the Linux connection state and then > Tom> pass this down to the qp modify verb...but I can't. > > Let's not be too quick to say that this is impossible. I think we > should work with the Linux networking community and come up with the > right answer, and not accept a bad solution just because it lets us go > around the networking people. > > Has there been any real discussion of this on netdev? > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From caitlinb at broadcom.com Thu Aug 25 16:46:38 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 25 Aug 2005 16:46:38 -0700 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs and query provider methods Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F523@NT-SJCA-0751.brcm.ad.broadcom.com> Device vendors would jump at the opportunity to have a stable interface with the host stack. Things like routing, protection from denial of service attacks, rules for logging and filtering connection requests and more all *should* be handled by the host stack. That's where the end user wants to control them, it's where the security code can be kept most current and most robust. It is also largely on packets that do not require offload optimization. But we also need time to ensure that the community understands this as giving the host stack control of an offload connection during connection establishment -- rather than as the offload device "stealing" the connection from the host stack. Moving the entire TCP connection logic to the offload device not only increases the work that the offload device must do, it reduces the auditability of the system and the user's control over their network activity. So the intent is not to evade the stack, rather it is to allow time for proper integration with host stack control. The tradeoffs are complex, and neither side fully understands the other's issues yet. We need to work together to determine how to provide the acceleration that our users want without sacrificing the OS provided security that they assume will not be sacrificed. > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier > Sent: Thursday, August 25, 2005 4:21 PM > To: Tom Tucker > Cc: openib-general at openib.org > Subject: Re: [openib-general] [PATCH][iWARP] Added provider > CM verbs and query provider methods > > Tom> RNIC Verbs imply that the modify qp verb takes a handle to a > Tom> connection -- presumably a socket. This CAN'T be done on > Tom> Linux in any fashion that is acceptable to the netdev > Tom> crowd. SOOO we modeled this after DAPL. Trust me, I would > Tom> LOVE to be able to establish the connection using bind, > Tom> listen, etc..., query the Linux connection state and then > Tom> pass this down to the qp modify verb...but I can't. > > Let's not be too quick to say that this is impossible. I > think we should work with the Linux networking community and > come up with the right answer, and not accept a bad solution > just because it lets us go around the networking people. > > Has there been any real discussion of this on netdev? > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From halr at voltaire.com Thu Aug 25 16:49:13 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 25 Aug 2005 19:49:13 -0400 Subject: [openib-general] [PATCH] osm_vendor_ibumad.c: osm_vendor_bind registration fix for other than SA class Message-ID: <1125013753.4398.209.camel@hal.voltaire.com> osm_vendor_ibumad.c: In osm_vendor_bind, only register GetTable and Delete methods for SA class Signed-off-by: Hal Rosenstock Index: osm_vendor_ibumad.c =================================================================== --- osm_vendor_ibumad.c (revision 3195) +++ osm_vendor_ibumad.c (working copy) @@ -718,11 +718,12 @@ if (p_user_bind->is_responder) { set_bit(IB_MAD_METHOD_GET, &method_mask); set_bit(IB_MAD_METHOD_SET, &method_mask); - set_bit(IB_MAD_METHOD_GETTABLE, &method_mask); - set_bit(IB_MAD_METHOD_DELETE, &method_mask); - /* Add in IB_MAD_METHOD_GETTRACETABLE */ - /* and IB_MAD_METHOD_GETMULTI when */ - /* supported by OpenSM */ + if (p_user_bind->mad_class == IB_MCLASS_SUBN_ADM) { + set_bit(IB_MAD_METHOD_GETTABLE, &method_mask); + set_bit(IB_MAD_METHOD_DELETE, &method_mask); + /* Add in IB_MAD_METHOD_GETTRACETABLE and */ + /* IB_MAD_METHOD_GETMULTI when supported by OpenSM */ + } } if (p_user_bind->is_report_processor) set_bit(IB_MAD_METHOD_REPORT, &method_mask); From rolandd at cisco.com Thu Aug 25 17:04:35 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 25 Aug 2005 17:04:35 -0700 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs andquery provider methods In-Reply-To: (Steve Wise's message of "Thu, 25 Aug 2005 18:44:23 -0500") References: Message-ID: <527je9sgvg.fsf@cisco.com> Steve> See the current chelsio TOE thread on the netdev list. Hmm, I just skimmed through this. It seems like a big flamefest about TOE without any reference to RDMA. I think someone from the iWARP world really needs to spend some time explaining to the netdev community why we want to do connection setup through the host stack so we can find a solution that everyone can agree on. I don't want to bypass part of the community (and bypass packet filtering too!) just because they'll "never" accept something. - R. From halr at voltaire.com Thu Aug 25 16:55:37 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 25 Aug 2005 19:55:37 -0400 Subject: [openib-general] RE: RMPP Message Format Errors In-Reply-To: References: Message-ID: <1125013956.4398.217.camel@hal.voltaire.com> On Wed, 2005-08-24 at 04:07, Sean Hefty wrote: In the below, c /opensm/osm vendor layer/ (It is also used by some SA client code in addition to OpenSM. > Looking through the code, it appears that the proper size of the MAD is being > reported in the kernel and exported up to userspace. If I guessed the structure > of the opensm code correctly, the length is returned by umad_recv() in > umad_receiver() in osm_vendor_ibumad.c The length is discarded after > umad_receiver() returns. You "guessed" correctly :-) > I guess that one possible solution is for opensm to save the length value into > the payload_length field in the RMPP header before returning from > umad_receiver(). Yes, that is a possible solution if it is needed on the receive side. It looks to me like it is currently unused (based on method, received size, and attribute offset). but it is probably a good idea to do this for the future as another algorithm would work and might be better. I will put this on my TODO list. -- Hal From tom at ammasso.com Thu Aug 25 17:40:47 2005 From: tom at ammasso.com (Tom Tucker) Date: Thu, 25 Aug 2005 20:40:47 -0400 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs andquery provider methods Message-ID: <8E9D028761D8264D910612167E8457E8FA3B8E@mail2.ammasso.com> Yes. I guess it's not obvious that the topics are related. Since our RDMA transport runs on top of TCP and since the RDMAC verbs spec requires that we have an LLP Handle (socket), we are dependent on the resolution of this issue. > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Thursday, August 25, 2005 7:05 PM > To: Steve Wise > Cc: 'Roland Dreier'; Tom Tucker; openib-general at openib.org > Subject: Re: [openib-general] [PATCH][iWARP] Added provider > CM verbs andquery provider methods > > Steve> See the current chelsio TOE thread on the netdev list. > > Hmm, I just skimmed through this. It seems like a big flamefest about > TOE without any reference to RDMA. > > I think someone from the iWARP world really needs to spend some time > explaining to the netdev community why we want to do connection setup > through the host stack so we can find a solution that everyone can > agree on. I don't want to bypass part of the community (and bypass > packet filtering too!) just because they'll "never" accept something. > > - R. > From rolandd at cisco.com Thu Aug 25 17:50:05 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 25 Aug 2005 17:50:05 -0700 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs andquery provider methods In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3B8E@mail2.ammasso.com> (Tom Tucker's message of "Thu, 25 Aug 2005 20:40:47 -0400") References: <8E9D028761D8264D910612167E8457E8FA3B8E@mail2.ammasso.com> Message-ID: <523boxserm.fsf@cisco.com> Tom> Yes. I guess it's not obvious that the topics are Tom> related. Since our RDMA transport runs on top of TCP and Tom> since the RDMAC verbs spec requires that we have an LLP Tom> Handle (socket), we are dependent on the resolution of this Tom> issue. Further: the desire to support RDMA could help reach a resolution. It seems pretty clear that the recent Chelsio patches are not acceptable (and I agree with that -- they clearly poke into deep parts of the TCP stack that should not be exported). Can someone come up with a better interface for handing over a TCP connection from the kernel to an RDMA device? Has anyone enumerated exactly what TCP state is required for the handoff? Is there a way to build this interface so that it isn't hard to maintain and doesn't constrain the evolution of the underlying host TCP stack? - R. From rolandd at cisco.com Thu Aug 25 17:52:56 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 25 Aug 2005 17:52:56 -0700 Subject: crash solved! (was: [openib-general] kernel crashes while using mvapich-gen2 over ib_uverbs) In-Reply-To: <1124960511.12751.16.camel@bonnie.rzg.mpg.de> (Christian Guggenberger's message of "Thu, 25 Aug 2005 11:01:51 +0200") References: <1124960511.12751.16.camel@bonnie.rzg.mpg.de> Message-ID: <52y86pr02f.fsf@cisco.com> I finally got to the bottom of this. It's a pretty simple use-after-free bug. I didn't see it because the only machines I routinely test CONFIG_DEBUG_SLAB=y kernels on are i386 machines, and the implementation of dma_unmap_sg() for i386 doesn't expose this bug. As soon as I tested CONFIG_DEBUG_SLAB=y on x86_64, I saw the same failure. The patch below should fix this for you. Please let me know if you still have problems after applying this patch. Thanks, Roland diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -130,13 +130,14 @@ static int ib_dealloc_ucontext(struct ib list_for_each_entry_safe(uobj, tmp, &context->mr_list, list) { struct ib_mr *mr = idr_find(&ib_uverbs_mr_idr, uobj->id); + struct ib_device *mrdev = mr->device; struct ib_umem_object *memobj; idr_remove(&ib_uverbs_mr_idr, uobj->id); ib_dereg_mr(mr); memobj = container_of(uobj, struct ib_umem_object, uobject); - ib_umem_release_on_close(mr->device, &memobj->umem); + ib_umem_release_on_close(mrdev, &memobj->umem); list_del(&uobj->list); kfree(memobj); From rolandd at cisco.com Thu Aug 25 18:00:09 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 25 Aug 2005 18:00:09 -0700 Subject: [openib-general] [PATCH] IB: fix use-after-free in user verbs cleanup Message-ID: <52u0hdqzqe.fsf@cisco.com> Hi Andrew, I'd like to get this into 2.6.13 if possible. If it's too late, it's not the end of the world -- we can wait for 2.6.13.1. But it's a tiny, obvious patch that fixes a crash that at least one person actually hit running a normal application: http://openib.org/pipermail/openib-general/2005-August/010248.html Thanks, Roland Fix a use-after-free bug in userspace verbs cleanup: we can't touch mr->device after we free mr by calling ib_dereg_mr(). diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -130,13 +130,14 @@ static int ib_dealloc_ucontext(struct ib list_for_each_entry_safe(uobj, tmp, &context->mr_list, list) { struct ib_mr *mr = idr_find(&ib_uverbs_mr_idr, uobj->id); + struct ib_device *mrdev = mr->device; struct ib_umem_object *memobj; idr_remove(&ib_uverbs_mr_idr, uobj->id); ib_dereg_mr(mr); memobj = container_of(uobj, struct ib_umem_object, uobject); - ib_umem_release_on_close(mr->device, &memobj->umem); + ib_umem_release_on_close(mrdev, &memobj->umem); list_del(&uobj->list); kfree(memobj); From bsharp at NetEffect.com Thu Aug 25 18:43:46 2005 From: bsharp at NetEffect.com (Bob Sharp) Date: Thu, 25 Aug 2005 20:43:46 -0500 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbsandquery provider methods Message-ID: <5E701717F2B2ED4EA60F87C8AA57B7CC01E5D9E2@venom2> >From NetEffect's perspective, the per device approach is simple to implement and I do not see it as an Ammasso specific approach. As Caitlin described, existing code needs to be reorganized but this aspect of our port is not a major effort. Bob -----Original Message----- From: openib-general-bounces at openib.org on behalf of Caitlin Bestler Sent: Thu 8/25/2005 6:35 PM To: openib-general at openib.org Cc: Subject: RE: [openib-general] [PATCH][iWARP] Added provider CM verbsandquery provider methods > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier > Sent: Thursday, August 25, 2005 4:19 PM > To: Steve Wise > Cc: openib-general at openib.org > Subject: Re: [openib-general] [PATCH][iWARP] Added provider > CM verbs andquery provider methods > > Steve> Would you rather we put this in a CM device struct of some > Steve> sort? And have rnic devices export a CM device that has > Steve> these sorts of methods? To me that's basically the same as > Steve> adding it to the struct ib_device. > > I think if we really need these new methods, then they might > as well be in the rdma_device structure. > > Steve> Well, i haven't seen any other iwarp device at this detail > Steve> (i only know of three iwarp devices in existance), so i > Steve> cannot tell you. Perhaps the other vendors can respond. > > I think it's important for us to understand whether this is > an Amasso-specific interface or something that will work in > the general case. Perhaps someone from Broadcom and/or > NetEffect (or some other > vendor) can comment? > The per-device connection methods proposed by Ammasso are definitely implementable for every iWARP RNIC that I am aware of. It does move some code around from where we have it currently, so we're not ready to release our versions yet. But that's because things take time, not because we think there's anything wrong with the interface. This interface needs to be complemented by the existing IB-specific interface and by an eventual TCP-specific interface that is compatible with the DAT Socket Service Point and/or IT-API's socket convert options. That interface would also deal with interoperability with pre-IETF MPA. But only a handful of applications need those optinons, the DAT-style interface accomodates the vast majority of applications that we are aware of. Caitlin Bestler Principal Software Scientist Broadcom caitlinb at broadcom.com _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From swise at ammasso.com Thu Aug 25 19:22:03 2005 From: swise at ammasso.com (Steve Wise) Date: Thu, 25 Aug 2005 21:22:03 -0500 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs andquery provider methods References: <8E9D028761D8264D910612167E8457E8FA3B8E@mail2.ammasso.com> Message-ID: <001901c5a9e4$f33cc190$020010ac@haggard> There is mention in the chelsio TOE thread that one reason you want TOE nics is so that you can then do RDMAP on top of TCP on the adapter, and get direct placement and kernel bypass... ----- Original Message ----- From: "Tom Tucker" To: "Roland Dreier" ; "Steve Wise" Cc: Sent: Thursday, August 25, 2005 7:40 PM Subject: RE: [openib-general] [PATCH][iWARP] Added provider CM verbs andquery provider methods Yes. I guess it's not obvious that the topics are related. Since our RDMA transport runs on top of TCP and since the RDMAC verbs spec requires that we have an LLP Handle (socket), we are dependent on the resolution of this issue. > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Thursday, August 25, 2005 7:05 PM > To: Steve Wise > Cc: 'Roland Dreier'; Tom Tucker; openib-general at openib.org > Subject: Re: [openib-general] [PATCH][iWARP] Added provider > CM verbs andquery provider methods > > Steve> See the current chelsio TOE thread on the netdev list. > > Hmm, I just skimmed through this. It seems like a big flamefest about > TOE without any reference to RDMA. > > I think someone from the iWARP world really needs to spend some time > explaining to the netdev community why we want to do connection setup > through the host stack so we can find a solution that everyone can > agree on. I don't want to bypass part of the community (and bypass > packet filtering too!) just because they'll "never" accept something. > > - R. > From swise at ammasso.com Thu Aug 25 19:28:57 2005 From: swise at ammasso.com (Steve Wise) Date: Thu, 25 Aug 2005 21:28:57 -0500 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbsandqueryprovider methods References: <5E701717F2B2ED4EA60F87C8AA57B7CC01E5D9E2@venom2> Message-ID: <002d01c5a9e5$e9d32120$020010ac@haggard> So all three current iWARP implementations can work with our proposed connection setup model. That's three different HW drivers. Can we agree to begin with this approach and get all the other iwarp vs ib issues flushed out by getting at least one iwarp device working with the openib design? IE: I'm asking that we push in our connection setup patch in the iwarp branch, then work from that and continue this evolution. Roland? Whatchathink? Stevo. ----- Original Message ----- From: "Bob Sharp" To: "Caitlin Bestler" ; Sent: Thursday, August 25, 2005 8:43 PM Subject: RE: [openib-general] [PATCH][iWARP] Added provider CM verbsandqueryprovider methods >From NetEffect's perspective, the per device approach is simple to implement and I do not see it as an Ammasso specific approach. As Caitlin described, existing code needs to be reorganized but this aspect of our port is not a major effort. Bob -----Original Message----- From: openib-general-bounces at openib.org on behalf of Caitlin Bestler Sent: Thu 8/25/2005 6:35 PM To: openib-general at openib.org Cc: Subject: RE: [openib-general] [PATCH][iWARP] Added provider CM verbsandquery provider methods > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier > Sent: Thursday, August 25, 2005 4:19 PM > To: Steve Wise > Cc: openib-general at openib.org > Subject: Re: [openib-general] [PATCH][iWARP] Added provider > CM verbs andquery provider methods > > Steve> Would you rather we put this in a CM device struct of some > Steve> sort? And have rnic devices export a CM device that has > Steve> these sorts of methods? To me that's basically the same as > Steve> adding it to the struct ib_device. > > I think if we really need these new methods, then they might > as well be in the rdma_device structure. > > Steve> Well, i haven't seen any other iwarp device at this detail > Steve> (i only know of three iwarp devices in existance), so i > Steve> cannot tell you. Perhaps the other vendors can respond. > > I think it's important for us to understand whether this is > an Amasso-specific interface or something that will work in > the general case. Perhaps someone from Broadcom and/or > NetEffect (or some other > vendor) can comment? > The per-device connection methods proposed by Ammasso are definitely implementable for every iWARP RNIC that I am aware of. It does move some code around from where we have it currently, so we're not ready to release our versions yet. But that's because things take time, not because we think there's anything wrong with the interface. This interface needs to be complemented by the existing IB-specific interface and by an eventual TCP-specific interface that is compatible with the DAT Socket Service Point and/or IT-API's socket convert options. That interface would also deal with interoperability with pre-IETF MPA. But only a handful of applications need those optinons, the DAT-style interface accomodates the vast majority of applications that we are aware of. Caitlin Bestler Principal Software Scientist Broadcom caitlinb at broadcom.com _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Thu Aug 25 19:54:54 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 25 Aug 2005 22:54:54 -0400 Subject: [openib-general] Re: RMPP Message Format Errors Message-ID: <1125024894.4398.436.camel@hal.voltaire.com> Hi Sean, In mad.c::ib_create_send_mad, if rmpp is active, the payload length is calculated as follows: if (rmpp_active) { ... rmpp_mad->rmpp_hdr.paylen_newwin = cpu_to_be32(hdr_len - offsetof(struct ib_rmpp_mad, data) + data_len); Then in mad_rmpp.c::send_next_seg, I see: if (mad_send_wr->seg_num == 1) { rmpp_mad->rmpp_hdr.rmpp_rtime_flags |= IB_MGMT_RMPP_FLAG_FIRST; rmpp_mad->rmpp_hdr.paylen_newwin = cpu_to_be32(mad_send_wr->total_seg * (sizeof(struct ib_rmpp_mad) - offsetof(struct ib_rmpp_mad, data))); That appears to me to overwrite the initial paylen but I might have missed something here. In any case, doesn't the initial payload length need to be the number of segments times (hdr_len - offsetof(struct ib_rmpp_mad, data)) + data_len ? If so, that's part of the problem. Another alternative would be not to set paylen in the first segment. -- Hal From rolandd at cisco.com Thu Aug 25 21:04:59 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 25 Aug 2005 21:04:59 -0700 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbsandqueryprovider methods In-Reply-To: <002d01c5a9e5$e9d32120$020010ac@haggard> (Steve Wise's message of "Thu, 25 Aug 2005 21:28:57 -0500") References: <5E701717F2B2ED4EA60F87C8AA57B7CC01E5D9E2@venom2> <002d01c5a9e5$e9d32120$020010ac@haggard> Message-ID: <52irxtqr6c.fsf@cisco.com> Steve> Can we agree to begin with this approach and get all the Steve> other iwarp vs ib issues flushed out by getting at least Steve> one iwarp device working with the openib design? IE: I'm Steve> asking that we push in our connection setup patch in the Steve> iwarp branch, then work from that and continue this Steve> evolution. Sure, feel free to go wild on your branch. However, I think we have to get some buy-in from the netdev crowd before we merge anything into the upstream kernel. - R. From sean.hefty at intel.com Thu Aug 25 22:16:40 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 25 Aug 2005 22:16:40 -0700 Subject: [openib-general] Re: RMPP Message Format Errors In-Reply-To: <1125024894.4398.436.camel@hal.voltaire.com> Message-ID: > if (rmpp_active) { > ... > rmpp_mad->rmpp_hdr.paylen_newwin = cpu_to_be32(hdr_len - > offsetof(struct ib_rmpp_mad, data) + data_len); > >Then in mad_rmpp.c::send_next_seg, I see: > > if (mad_send_wr->seg_num == 1) { > rmpp_mad->rmpp_hdr.rmpp_rtime_flags |= IB_MGMT_RMPP_FLAG_FIRST; > rmpp_mad->rmpp_hdr.paylen_newwin = > cpu_to_be32(mad_send_wr->total_seg * > (sizeof(struct ib_rmpp_mad) - > offsetof(struct ib_rmpp_mad, data))); >That appears to me to overwrite the initial paylen but I might have >missed something here. The payload is being overridden, but that's necessary. The payload that's set when creating the MAD is used to indicate the size of the buffer. The payload set with the 1st segment indicates the size of the transfer. They differ because the headers are duplicated in each segment, but only a single copy is provided in the send buffer. >In any case, doesn't the initial payload length need to be the number of >segments times (hdr_len - offsetof(struct ib_rmpp_mad, data)) + data_len >? If so, that's part of the problem. I believe that the payload is being calculated correctly. It should be the number of segments * 220 bytes per packet, or at least that was my interpretation of the spec. >Another alternative would be not to set paylen in the first segment. That would work. I tried to set the value to allow future optimization on the receive side. - Sean From sean.hefty at intel.com Thu Aug 25 22:34:17 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 25 Aug 2005 22:34:17 -0700 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbsandqueryprovider methods In-Reply-To: <5E701717F2B2ED4EA60F87C8AA57B7CC01E5D9E2@venom2> Message-ID: >>From NetEffect's perspective, the per device approach is simple to >implement and I do not see it as an Ammasso specific approach. As Caitlin >described, existing code needs to be reorganized but this aspect of our port >is not a major effort. I agree that connection setup code could be duplicated in all iWarp drivers to make such an interface work, but that doesn't necessarily make it the best approach. Personally, I would rather see a single iWarp connection manager establish the connections, and then hand those as the LLP handle to the modify QP verb. Based on comments, the iWarp vendors would prefer the host TCP stack establish connections. If this is the case, then why isn't having a single iWarp CM that resides above verbs preferred to duplicating that functionality in each driver? To me, this does sound like an Ammasso specific approach, in that they implemented TCP connections in hardware. Responses from NetEffect and Broadcom mentioned software solutions. - Sean From christian.guggenberger at rzg.mpg.de Fri Aug 26 00:50:32 2005 From: christian.guggenberger at rzg.mpg.de (Christian Guggenberger) Date: Fri, 26 Aug 2005 09:50:32 +0200 Subject: crash solved! (was: [openib-general] kernel crashes while using mvapich-gen2 over ib_uverbs) In-Reply-To: <52y86pr02f.fsf@cisco.com> References: <1124960511.12751.16.camel@bonnie.rzg.mpg.de> <52y86pr02f.fsf@cisco.com> Message-ID: <1125042633.1568.6.camel@bonnie.rzg.mpg.de> Roland, On Thu, 2005-08-25 at 17:52 -0700, Roland Dreier wrote: > I finally got to the bottom of this. It's a pretty simple > use-after-free bug. I didn't see it because the only machines I > routinely test CONFIG_DEBUG_SLAB=y kernels on are i386 machines, and > the implementation of dma_unmap_sg() for i386 doesn't expose this > bug. As soon as I tested CONFIG_DEBUG_SLAB=y on x86_64, I saw the > same failure. The patch below should fix this for you. > > Please let me know if you still have problems after applying this patch. thanks for your several replies. I am sorry for not being available for feedback immedeately. To summarize, your patch works well (although I had to trim it manually). To answer your previous questions, CONFIG_DEBUG_SLAB=y and CONFIG_HUGETLB_PAGE=y were set. (I guess you won't need objdump of uverbs_mem.o anymore, do you?) Now, together with a fresh checkout of mvapich-gen2 I have been able to run simple mpi programs. thanks for your help && keep up the good work, cheers. - Christian From guyg at voltaire.com Fri Aug 26 01:26:53 2005 From: guyg at voltaire.com (Guy German) Date: Fri, 26 Aug 2005 11:26:53 +0300 Subject: [openib-general] RDMA connection and address translation API Message-ID: >> We need to insert in here: >> >> ib_modify_qp(...); /* somehow uses address resolution... */ >> ib_post_recvs(...); >> > >or add a new call to create the qp and modify it to init (an analog to >the socket(2) function). Sean> This approach seems reasonable to me. Maybe something like: Sean> rdma_create_qp(rdma_addr_info); Sean> Uses the output from the address resolution to create the QP on the Sean> correct device and transitions it to the INIT state. The user can Sean> now post any work requests that they want. For example, with iWarp, Sean> I believe that even send work requests can be posted in the INIT state. What do you think about this flow ? 1. resolve device and port from ip address - synchronous operation (like at.c resolve_ip) 2. rdma_create_qp (device+port) - modifies qp to init with default pkey index 3. ib_post_recvs(...); 4. cma_connect - asynchronous at, modify qp with correct pkey index, cm_connect Guy _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From guyg at voltaire.com Fri Aug 26 01:35:10 2005 From: guyg at voltaire.com (Guy German) Date: Fri, 26 Aug 2005 11:35:10 +0300 Subject: [openib-general] RE: Do we care about pre-emption? Message-ID: Hi Paul, Thanks for pointing out your OS preemeption needs. > Your (current?) design minimizing the work in interrupt context > seems a good starting point. Why do you want to do more of that > work in an ISR? Is the extra complexity putting it outside an > ISR significant? Is the performance much worse? iSER, today, is going into the trouble of staying in the kernel thread context as long as possible, after receiving it's completion upcall. I was just a pondering - is it really necessary ? in light of the SRP’s different approach. Any way, there isn’t any immediate intention of changing this - there are other issues the iSER code is dealing with, right now. Thanks, Guy From eitan at mellanox.co.il Fri Aug 26 03:15:07 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 26 Aug 2005 13:15:07 +0300 Subject: [openib-general] Re: RMPP Message Format Errors In-Reply-To: <1125002702.4421.1378.camel@hal.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175BE7@taurus.voltaire.com> <4309E27E.7050002@mellanox.co.il> <1125002702.4421.1378.camel@hal.voltaire.com> Message-ID: <430EEBAB.9000303@mellanox.co.il> Hi Hal, I am trying to figure out how a client will figure out the number of records provided in the mad it gets back from umad. Can you describe this? Thanks Eitan Hal Rosenstock wrote: > On Mon, 2005-08-22 at 10:34, Eitan Zahavi wrote: > >>>It gets a "real" received length provided it supplies a buffer large enough. >> >>So I guess the "real receive length" is truncated to the last data >>record even if the packet sent was 256 bytes? > > > The receive buffer is not truncated. An error is returned if the buffer > supplied is too small for a receive is too small and it includes the > size of the buffer needed. > > I don't understand what you mean by "even if the packet sent was 256 > bytes". > > -- Hal From halr at voltaire.com Fri Aug 26 06:05:30 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Aug 2005 09:05:30 -0400 Subject: [openib-general] Re: RMPP Message Format Errors In-Reply-To: References: Message-ID: <1125061530.4627.13.camel@localhost.localdomain> On Fri, 2005-08-26 at 01:16, Sean Hefty wrote: > >In any case, doesn't the initial payload length need to be the number of > >segments times (hdr_len - offsetof(struct ib_rmpp_mad, data)) + data_len > >? If so, that's part of the problem. > > I believe that the payload is being calculated correctly. It should be the > number of segments * 220 bytes per packet, or at least that was my > interpretation of the spec. The 220 byte payload length is for SA. That's mostly right but assumes the last segment will be full (and accounted for by the paylen in the last segment). Doesn't it need to account for a "partial" rather than full last segment transferred data in the first segment length ? -- Hal From halr at voltaire.com Fri Aug 26 06:09:48 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Aug 2005 09:09:48 -0400 Subject: [openib-general] Re: RMPP Message Format Errors In-Reply-To: <430EEBAB.9000303@mellanox.co.il> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175BE7@taurus.voltaire.com> <4309E27E.7050002@mellanox.co.il> <1125002702.4421.1378.camel@hal.voltaire.com> <430EEBAB.9000303@mellanox.co.il> Message-ID: <1125061788.4627.20.camel@localhost.localdomain> Hi Eitan, On Fri, 2005-08-26 at 06:15, Eitan Zahavi wrote: > I am trying to figure out how a client will figure out the number of > records provided in the mad it gets back from umad. > > Can you describe this? A client would use the received length returned from umad_recv and either the attribute offset in the RMPP header (or expected attribute offset for record type) to calculate this (in the case of an SA client). For other classes, it is class specific. I think there is a problem in osm_vendor_ibumad_sa.c::__osmv_sa_mad_rcv_cb which I will be working on as soon as I sort through the send side issues. -- Hal From tom at ammasso.com Fri Aug 26 07:30:29 2005 From: tom at ammasso.com (Tom Tucker) Date: Fri, 26 Aug 2005 10:30:29 -0400 Subject: [openib-general] [PATCH][iWARP] Added provider CMverbsandqueryprovider methods Message-ID: <8E9D028761D8264D910612167E8457E8FA3B95@mail2.ammasso.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Sean Hefty > Sent: Friday, August 26, 2005 12:34 AM > To: 'Bob Sharp'; Caitlin Bestler; openib-general at openib.org > Subject: RE: [openib-general] [PATCH][iWARP] Added provider > CMverbsandqueryprovider methods > > >>From NetEffect's perspective, the per device approach is simple to > >implement and I do not see it as an Ammasso specific > approach. As Caitlin > >described, existing code needs to be reorganized but this > aspect of our port > >is not a major effort. > > I agree that connection setup code could be duplicated in all > iWarp drivers to > make such an interface work, but that doesn't necessarily > make it the best > approach. Personally, I would rather see a single iWarp > connection manager > establish the connections, and then hand those as the LLP > handle to the modify > QP verb. > > Based on comments, the iWarp vendors would prefer the host > TCP stack establish > connections. If this is the case, then why isn't having a > single iWarp CM that > resides above verbs preferred to duplicating that > functionality in each driver? > > To me, this does sound like an Ammasso specific approach, in that they > implemented TCP connections in hardware. Responses from > NetEffect and Broadcom > mentioned software solutions. I believe that what you are advocating is having a mini-TCP-stack in Linux. This mini-TCP-stack knows how to establish connections which are then passed down to the adpater. This mini-stack would comprise the iWARP side of a unified connection manager. This is all fine and good, but for the fact that not all adapters work this way. In fact, to date, *MOST* TOE engines including Chelsio, Adaptec, Emulex, and others establish connections themselves. Admittedly, I don't know how the Broadcom and NetEffect RDMA adapters plan on doing it. That said, there needs to be a device attribute that indicates whether the host establishes the connection on the adapter or takes existing connection state from the host. If the adapter establishes the connection, it must export functions to do so. I proposed yesterday that we have a separate structure for this purpose that is hung off the ib_device. This structure will be present and populated for adapters that establish connections themselves. As far as the host-mini-TCP stack goes...I find it almost impossible to believe that it has a better chance of getting pushed upstream than does host-stack integration. We should clearly articulate the host-adapter split stack model, why it is necessary for RDMA devices, and attempt to convince the netdev/stack community that this is the right thing to do. A mini-TCP stack in the host is a separate piece of logic that duplicates existing functionality that serves no other purpose than to delay an inevitable argument about upstream integration. > > - Sean > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Fri Aug 26 08:20:44 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Aug 2005 11:20:44 -0400 Subject: [openib-general] [PATCH] OSM vendor layer: Fix length when sending RMPP for SA class in osm_vendor_send Message-ID: <1125069644.4530.36.camel@hal.voltaire.com> OSM vendor layer: Fix length when sending RMPP for SA class in osm_vendor_send Signed-off-by: Hal Rosenstock Index: osm_vendor_ibumad.c =================================================================== --- osm_vendor_ibumad.c (revision 3197) +++ osm_vendor_ibumad.c (working copy) @@ -942,7 +942,8 @@ put_madw(p_vend, p_madw, &p_mad->trans_id); if ((ret = umad_send(p_bind->port_id, p_bind->agent_id, p_vw->umad, - p_madw->mad_size, + is_rmpp ? p_madw->mad_size - IB_SA_MAD_HDR_SIZE : + p_madw->mad_size, resp_expected ? p_vend->timeout : 0, p_vend->max_retries)) < 0) { if (resp_expected) From jlentini at netapp.com Fri Aug 26 08:40:38 2005 From: jlentini at netapp.com (James Lentini) Date: Fri, 26 Aug 2005 11:40:38 -0400 (EDT) Subject: [openib-general] cma header - change some things according to the list feedback In-Reply-To: References: Message-ID: On Thu, 25 Aug 2005, Guy German wrote: > > typedef void (*ib_cma_event_handler)(enum ib_cma_event event, void *context, > > void *private_data); > >>Would ib_cma_conn_handler be more appropriate? > > Maybe, but it is actually the active side event cb (also for discon etc.) > I don't mind changing it, though... In that case, I would not change it. > > int ib_cma_connect(struct ib_cma_conn *cma_conn, > > union ib_cma_id *cma_id); > > > >> Should there be a way to cancel an ib_cma_connect() call? > > It is possible to add it. Not sure how much it will be used by > consumers, though. The additional synchonization needed in consumers isn't worth it. From rishi.shah at soulcitypubs.com Fri Aug 26 08:40:01 2005 From: rishi.shah at soulcitypubs.com (RAVE*SQ Magazine) Date: Fri, 26 Aug 2005 08:40:01 -0700 Subject: [openib-general] Shubha Mudgal In Concert One Night Only! Message-ID: <1294869209-1463792126-1125070823@soulcitypublications.b.tep1.com> If you cannot read this message from RAVE*SQ your browser does not support HTML. Visit http://soulcitypublications.c.topica.com/maadUFaabjNaAci5DeZe/ to see this message from RAVE*SQ Magazine. ==================================================================== Update Your Profile: http://soulcitypublications.f.topica.com/f/?a84NZf.ci5DeZ.b3Blbmli Unsubscribe: http://soulcitypublications.f.topica.com/f/unsub.html/aafs57olsf4g91gfecd3h1q8_k8tp0mh_sfts2e6bsund Confirm Your Subscription: http://soulcitypublications.f.topica.com/f/?a84NZf.ci5DeZ.b3Blbmli.c Report Unsolicited Email: http://topica.com/f/abuse.html?aafs57olsf4g91gfecd3h1q8_k8tp0mh_sfts2e6bsund Delivered by Topica: http://www.topica.com/?p=T3FOOTER -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Fri Aug 26 09:34:58 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Aug 2005 12:34:58 -0400 Subject: [openib-general] [PATCH] OSM vendor layer: In umad_receiver, handle allocating RMPP large MADs from OSM MAD pool Message-ID: <1125074098.4530.179.camel@hal.voltaire.com> OSM vendor layer: In umad_receiver, handle allocating RMPP large MADs from OSM MAD pool Signed-off-by: Hal Rosenstock Index: osm_vendor_ibumad.c =================================================================== --- osm_vendor_ibumad.c (revision 3198) +++ osm_vendor_ibumad.c (working copy) @@ -271,7 +271,9 @@ if (!(madw_p = osm_mad_pool_get(p_bind->p_mad_pool, (osm_bind_handle_t)p_bind, - MAD_BLOCK_SIZE, &osm_addr))) { + length > MAD_BLOCK_SIZE ? + length : MAD_BLOCK_SIZE, + &osm_addr))) { osm_log( p_vend->p_log, OSM_LOG_ERROR, "umad_receiver: " "request for a new madw failed -- dropping packet\n" ); continue; From jlentini at netapp.com Fri Aug 26 09:50:56 2005 From: jlentini at netapp.com (James Lentini) Date: Fri, 26 Aug 2005 12:50:56 -0400 (EDT) Subject: [openib-general] RDMA connection and address translation API In-Reply-To: References: Message-ID: On Thu, 25 Aug 2005, Sean Hefty wrote: > >> Any way providing src/dst IPs in the CM Private data is simple, > >> and we can come with IBTA extension blessing that data structure > >> as a general way to map IP oriented protocols over IB (a 1-2 page > >> draft at the most) This way it can also address Caitlin concerns > >> regarding NFS & IETF (since now it's a transport specific issue) > > > >How long do you estimate it would take to standardize an IP<->GID > >mechanism (ATS, CM embedded, ...) in the IBTA? 3 months? 6 months? > >A year?.... > > > >Let's assume that everyone on this list is in agreement. > > Does anyone in the IB world disagree with adding IP addresses in the > CM private data area? Would we want to extend this concept to SIDR > as well? I think we should focus on providing a mechanism to allow ULPs to use IP addresses on InfiniBand networks. Service discovery (SIDR) seems like a separate issue. The ability to ask "What UD QPN is this service using?" seems useful on its own. From halr at voltaire.com Fri Aug 26 09:42:39 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Aug 2005 12:42:39 -0400 Subject: [openib-general] RMPP Message Format Errors (Short Term Plan) Message-ID: <1125074400.4530.187.camel@hal.voltaire.com> Hi, I will finish with RMPP and then embark on the 1.8.0 merge. I hope and expect to start the latter early next week. -- Hal From jlentini at netapp.com Fri Aug 26 09:54:37 2005 From: jlentini at netapp.com (James Lentini) Date: Fri, 26 Aug 2005 12:54:37 -0400 (EDT) Subject: [openib-general] Header reorganization heads up In-Reply-To: <20050825203549.GE9080@esmail.cup.hp.com> References: <52br3lud64.fsf@cisco.com> <20050825203549.GE9080@esmail.cup.hp.com> Message-ID: On Thu, 25 Aug 2005, Grant Grundler wrote: > On Thu, Aug 25, 2005 at 01:56:20PM -0400, James Lentini wrote: > ... > > If you remove the include/rdma directory, won't that break code > > outside the OpenIB subversion tree that is using this location? > > What code outside of drivers/infiniband is using include/rdma? None that I know of at the moment. From rolandd at cisco.com Fri Aug 26 09:56:37 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 26 Aug 2005 09:56:37 -0700 Subject: [openib-general] avoid segv in libibverbs/examples In-Reply-To: <20050818154928.GB31078@osc.edu> (Pete Wyckoff's message of "Thu, 18 Aug 2005 11:49:28 -0400") References: <20050812144222.GA8988@osc.edu> <52u0hrghrw.fsf@cisco.com> <20050818154928.GB31078@osc.edu> Message-ID: <528xyor60q.fsf@cisco.com> thanks, I finally applied this patch. - R. From caitlinb at broadcom.com Fri Aug 26 09:59:56 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 26 Aug 2005 09:59:56 -0700 Subject: [openib-general] RDMA connection and address translation API Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F525@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Guy German > Sent: Friday, August 26, 2005 1:27 AM > To: Sean Hefty; James Lentini > Cc: openib-general at openib.org > Subject: RE: [openib-general] RDMA connection and address > translation API > > >> We need to insert in here: > >> > >> ib_modify_qp(...); /* somehow uses address resolution... */ > >> ib_post_recvs(...); > >> > > > >or add a new call to create the qp and modify it to init (an > analog to > >the socket(2) function). > > Sean> This approach seems reasonable to me. Maybe something like: > Sean> rdma_create_qp(rdma_addr_info); > > Sean> Uses the output from the address resolution to create the QP on > Sean> the correct device and transitions it to the INIT > state. The user > Sean> can now post any work requests that they want. For > example, with > Sean> iWarp, I believe that even send work requests can be > posted in the INIT state. > > What do you think about this flow ? > 1. resolve device and port from ip address - synchronous operation > (like at.c resolve_ip) > 2. rdma_create_qp (device+port) - modifies qp to init with > default pkey index 3. ib_post_recvs(...); 4. cma_connect - > asynchronous at, modify qp with correct pkey index, cm_connect > At least with iWARP a QP is not bound to a specific port, or even to an IP Address. It is only bound to the RDMA Device (RNIC) and Protection Domain. The same QP can be re-used for a new connection with a new IP address. Indeed, that is exactly what would happen with application-layer controlled failover (such as iSER). From vkrishnamurthy at xsigo.com Fri Aug 26 10:13:33 2005 From: vkrishnamurthy at xsigo.com (Viswanath Krishnamurthy) Date: Fri, 26 Aug 2005 10:13:33 -0700 Subject: [openib-general] kernel oops Message-ID: <430F4DBD.4070703@xsigo.com> I downloaded the latest openib gen2 stack and ran into kernel panic when I run the cmpost/ucmpost example. I modified the program to continously send and receive data in an infinite loop and killed the application with ctrl-c. The kernel panics pretty consistently. I am currently running 2.6.12 version of the kernel . Log attached. I will try upgrading to newer kernel and see if I can reproduce it. -Viswa -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: oops URL: From iod00d at hp.com Fri Aug 26 10:23:02 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 26 Aug 2005 10:23:02 -0700 Subject: [openib-general] [PATCH] OSM vendor layer: In umad_receiver, handle allocating RMPP large MADs from OSM MAD pool In-Reply-To: <1125074098.4530.179.camel@hal.voltaire.com> References: <1125074098.4530.179.camel@hal.voltaire.com> Message-ID: <20050826172302.GA14157@esmail.cup.hp.com> On Fri, Aug 26, 2005 at 12:34:58PM -0400, Hal Rosenstock wrote: > - MAD_BLOCK_SIZE, &osm_addr))) { > + length > MAD_BLOCK_SIZE ? > + length : MAD_BLOCK_SIZE, > + &osm_addr))) { Can "max(length, MAD_BLOCK_SIZE)" be used instead? thanks, grant From sean.hefty at intel.com Fri Aug 26 10:43:41 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 26 Aug 2005 10:43:41 -0700 Subject: [openib-general] kernel oops In-Reply-To: <430F4DBD.4070703@xsigo.com> Message-ID: >I downloaded the latest openib gen2 stack and ran into kernel panic when >I run the cmpost/ucmpost example. I modified the program to continously >send and receive data in an infinite loop and killed the application >with ctrl-c. >The kernel panics pretty consistently. > >I am currently running 2.6.12 version of the kernel . Log attached. I >will try >upgrading to newer kernel and see if I can reproduce it. I have gotten something similar to this in my own testing, but haven't had the time to track it down. It seems to be related to how the IB AT code interacts with the SM, and if the SM has been restarted. Can you try resetting the SM node, then rebooting your other systems? - Sean From tom at ammasso.com Fri Aug 26 10:46:26 2005 From: tom at ammasso.com (Tom Tucker) Date: Fri, 26 Aug 2005 13:46:26 -0400 Subject: [openib-general] [PATCH][iWARP] Another CM Verbs Approach Message-ID: <8E9D028761D8264D910612167E8457E8FA3BA0@mail2.ammasso.com> Sean/Roland: I think there is general concensus that having CM verbs in the ib_device structure is disgusting. This patch moves these verbs to the ib_cm.h file and adds a device capabilities flag that indicates whether or not the device has CM verbs. I know this is not exactly what everyone is looking for, but I hope it is workable given the practical realities we face relative to host stack integration and the need to support devices that establish connections internally. I know Roland has said we can "go wild" in our own branch, but I really don't want to do something that is not going to be acceptable to the community at large. So here's another shot, please tell me what you think. This patch is to the include directory of the iWARP branch. Signed-off-by: Tom Tucker Index: ib_cm.h =================================================================== --- ib_cm.h (revision 3120) +++ ib_cm.h (working copy) @@ -39,6 +39,7 @@ #include #include +#include enum ib_cm_state { IB_CM_IDLE, @@ -555,6 +556,96 @@ u8 private_data_len; }; +/* iWARP connection attributes. */ +struct ib_conn_attr { + struct in_addr local_addr; + struct in_addr remote_addr; + u16 local_port; + u16 remote_port; +}; + +/* This is provided in the event generated when + * a remote peer accepts our connect request + */ +struct ib_conn_results { + int errno; + struct ib_conn_attr conn_attr; + u8 *private_data; + int private_data_len; +}; + +/* This is provided in the event generated by a remote + * connect request to a listening endpoint + */ +struct ib_conn_request { + int cr_id; + struct ib_conn_attr conn_attr; + u8 *private_data; + int private_data_len; +}; + +/* Connection events. */ +enum ib_cmv_event_type { + IB_EVENT_ACTIVE_CONNECT_RESULTS, + IB_EVENT_CONNECT_REQUEST +}; + +struct ib_cmv_event { + struct ib_device *device; + union { + struct ib_conn_results active_results; + struct ib_conn_request conn_request; + } element; + enum ib_cmv_event_type event; +}; + +/* Listening endpoint. */ +struct ib_listen_ep_attr { + void (*event_handler)(struct ib_cmv_event *, void *); + void *listen_context; + struct in_addr addr; + u16 port; + int backlog; +}; + +struct ib_listen_ep { + struct ib_device *device; + void (*event_handler)(struct ib_cmv_event *, void *); + void *listen_context; + struct in_addr addr; + u16 port; + int backlog; +}; + +struct ib_cmv { + + int (*connect_qp)(struct ib_qp *qp, + struct ib_conn_attr* attr, + void (*event_handler)(struct ib_cmv_event*, void*), + void* context, + u8 *pdata, + int pdata_len + ); + + int (*accept_cr)(int cr_id, + struct ib_qp *qp, + u8 *pdata, + int pdata_len); + + int (*reject_cr)(int cr_id, + struct ib_qp *qp, + u8 *pdata, + int pdata_len); + + int (*query_cr)(int cr_id, + struct ib_conn_request* req); + + struct ib_listen_ep * (*create_listen_ep)(struct ib_listen_ep_attr *); + + int (*destroy_listen_ep)(struct ib_listen_ep *ep); + +}; + /** * ib_send_cm_sidr_rep - Sends a service ID resolution request to the * remote node. @@ -565,4 +656,5 @@ int ib_send_cm_sidr_rep(struct ib_cm_id *cm_id, struct ib_cm_sidr_rep_param *param); + #endif /* IB_CM_H */ Index: ib_verbs.h =================================================================== --- ib_verbs.h (revision 3120) +++ ib_verbs.h (working copy) @@ -59,7 +59,8 @@ enum ib_node_type { IB_NODE_CA = 1, IB_NODE_SWITCH, - IB_NODE_ROUTER + IB_NODE_ROUTER, + IB_NODE_RNIC }; enum ib_device_cap_flags { @@ -78,6 +79,7 @@ IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), IB_DEVICE_SRQ_RESIZE = (1<<13), IB_DEVICE_N_NOTIFY_CQ = (1<<14), + IB_DEVICE_CMV = <1<<15) }; enum ib_atomic_cap { @@ -804,6 +806,7 @@ struct ib_gid_cache **gid_cache; }; +struct ib_cmv; struct ib_device { struct device *dma_device; @@ -820,6 +823,8 @@ u32 flags; + struct ib_cmv *ibcmv; + int (*query_device)(struct ib_device *device, struct ib_device_attr *device_attr); int (*query_port)(struct ib_device *device, From rolandd at cisco.com Fri Aug 26 10:52:54 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 26 Aug 2005 10:52:54 -0700 Subject: [openib-general] Re: [PATCH][iWARP] Another CM Verbs Approach In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3BA0@mail2.ammasso.com> (Tom Tucker's message of "Fri, 26 Aug 2005 13:46:26 -0400") References: <8E9D028761D8264D910612167E8457E8FA3BA0@mail2.ammasso.com> Message-ID: <52zmr4pouh.fsf@cisco.com> Tom> I think there is general concensus that having CM verbs in Tom> the ib_device structure is disgusting. This patch moves these Tom> verbs to the ib_cm.h file and adds a device capabilities flag Tom> that indicates whether or not the device has CM verbs. Tom> I know this is not exactly what everyone is looking for, but Tom> I hope it is workable given the practical realities we face Tom> relative to host stack integration and the need to support Tom> devices that establish connections internally. How about creating an iwarp_cm.h file instead of using ib_cm.h? That way it's clear that these new methods are for iWARP devices. I'd like to understand that last bit above about devices that establish connections internally. Does that mean that there are RNICs that cannot support the iWARP verbs' notion of migrating a connection? - R. From sean.hefty at intel.com Fri Aug 26 10:54:10 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 26 Aug 2005 10:54:10 -0700 Subject: [openib-general] Re: RMPP Message Format Errors In-Reply-To: <1125061530.4627.13.camel@localhost.localdomain> Message-ID: >The 220 byte payload length is for SA. That's mostly right but assumes >the last segment will be full (and accounted for by the paylen in the >last segment). I believe that the 220 byte payload length is for all RMPP MADs. Only the common and RMPP header lengths are ignored. >Doesn't it need to account for a "partial" rather than full last segment >transferred data in the first segment length ? What I couldn't easily tell from the spec is whether a partial last segment is included in the initial payload length or not. I read it as: "PayloadLength counts all the bytes in the TransferredData field of the DATA packet format." In my interpretation, partial data is indicated by the PayloadLength field in the last segment only. It's quite possible that my interpretation is incorrect, in which case the calculation in the RMPP code is off. - Sean From tom at ammasso.com Fri Aug 26 10:58:54 2005 From: tom at ammasso.com (Tom Tucker) Date: Fri, 26 Aug 2005 13:58:54 -0400 Subject: [openib-general] RE: [PATCH][iWARP] Another CM Verbs Approach Message-ID: <8E9D028761D8264D910612167E8457E8FA3BA2@mail2.ammasso.com> > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Friday, August 26, 2005 12:53 PM > To: Tom Tucker > Cc: Sean Hefty; Roland Dreier; openib-general at openib.org > Subject: Re: [PATCH][iWARP] Another CM Verbs Approach > > Tom> I think there is general concensus that having CM verbs in > Tom> the ib_device structure is disgusting. This patch moves these > Tom> verbs to the ib_cm.h file and adds a device capabilities flag > Tom> that indicates whether or not the device has CM verbs. > > Tom> I know this is not exactly what everyone is looking for, but > Tom> I hope it is workable given the practical realities we face > Tom> relative to host stack integration and the need to support > Tom> devices that establish connections internally. > > How about creating an iwarp_cm.h file instead of using ib_cm.h? That > way it's clear that these new methods are for iWARP devices. good suggestion. > > I'd like to understand that last bit above about devices that > establish connections internally. Does that mean that there are RNICs > that cannot support the iWARP verbs' notion of migrating a connection? > Yes, that's what it means and the reason is that Linux does not (yet) provide a means to do so. > - R. > From sean.hefty at intel.com Fri Aug 26 11:00:59 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 26 Aug 2005 11:00:59 -0700 Subject: [openib-general] [PATCH][iWARP] Added provider CMverbsandqueryprovider methods In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3B95@mail2.ammasso.com> Message-ID: >I believe that what you are advocating is having a mini-TCP-stack >in Linux. This mini-TCP-stack knows how to establish connections >which are then passed down to the adpater. This mini-stack would >comprise the iWARP side of a unified connection manager. I not advocating this, but I believe that it is better than having mini-TCP stacks in each driver. This is why I'm trying to determine if the connection establishment done in iWarp is done in software or hardware. What you've done for your hardware seems correct. - Sean From rolandd at cisco.com Fri Aug 26 11:02:45 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 26 Aug 2005 11:02:45 -0700 Subject: [openib-general] Re: [PATCH][iWARP] Another CM Verbs Approach In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3BA2@mail2.ammasso.com> (Tom Tucker's message of "Fri, 26 Aug 2005 13:58:54 -0400") References: <8E9D028761D8264D910612167E8457E8FA3BA2@mail2.ammasso.com> Message-ID: <52vf1spoe2.fsf@cisco.com> Roland> I'd like to understand that last bit above about devices Roland> that establish connections internally. Does that mean Roland> that there are RNICs that cannot support the iWARP verbs' Roland> notion of migrating a connection? Tom> Yes, that's what it means and the reason is that Linux does Tom> not (yet) provide a means to do so. Now I'm confused. Is the problem that the Linux network stack does not have an interface for transferring TCP state, or is it that some RNIC hardware does not support taking control of an existing TCP connection? The first problem we can fix, but the second one we can't. - R. From tom at ammasso.com Fri Aug 26 11:06:00 2005 From: tom at ammasso.com (Tom Tucker) Date: Fri, 26 Aug 2005 14:06:00 -0400 Subject: [openib-general] RE: [PATCH][iWARP] Another CM Verbs Approach Message-ID: <8E9D028761D8264D910612167E8457E8FA3BA4@mail2.ammasso.com> > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Friday, August 26, 2005 1:03 PM > To: Tom Tucker > Cc: Roland Dreier; Sean Hefty; openib-general at openib.org > Subject: Re: [PATCH][iWARP] Another CM Verbs Approach > > Roland> I'd like to understand that last bit above about devices > Roland> that establish connections internally. Does that mean > Roland> that there are RNICs that cannot support the iWARP verbs' > Roland> notion of migrating a connection? > > Tom> Yes, that's what it means and the reason is that Linux does > Tom> not (yet) provide a means to do so. > > Now I'm confused. Is the problem that the Linux network stack does > not have an interface for transferring TCP state, or is it that some > RNIC hardware does not support taking control of an existing TCP > connection? The first problem we can fix, but the second one > we can't. Sorry for the confusion. Our current product does not currently support migrating connections. Our next device does. > > - R. > From tom at ammasso.com Fri Aug 26 11:44:25 2005 From: tom at ammasso.com (Tom Tucker) Date: Fri, 26 Aug 2005 14:44:25 -0400 Subject: [openib-general] [PATCH][iWARP] IW CM Verbs Message-ID: <8E9D028761D8264D910612167E8457E8FA3BAF@mail2.ammasso.com> This is a cleaned up version of the iWARP CM verbs based on feedback from Roland and Sean. I've added a new file to contain the iw specific types and changed the iw specific type prefix to iw_. Please comment, and if it looks good, I'll commit this to the iWARP branch tonight. Signed-off-by: Tom Tucker < tom at ammasso.com> Index: ib_verbs.h =================================================================== --- ib_verbs.h (revision 3120) +++ ib_verbs.h (working copy) @@ -59,7 +59,8 @@ enum ib_node_type { IB_NODE_CA = 1, IB_NODE_SWITCH, - IB_NODE_ROUTER + IB_NODE_ROUTER, + IB_NODE_RNIC }; enum ib_device_cap_flags { @@ -78,6 +79,7 @@ IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), IB_DEVICE_SRQ_RESIZE = (1<<13), IB_DEVICE_N_NOTIFY_CQ = (1<<14), + IB_DEVICE_IWCM = (1<<15) }; enum ib_atomic_cap { @@ -804,6 +806,7 @@ struct ib_gid_cache **gid_cache; }; +struct iw_cm; struct ib_device { struct device *dma_device; @@ -820,6 +823,8 @@ u32 flags; + struct iw_cm *iwcm; + int (*query_device)(struct ib_device *device, struct ib_device_attr *device_attr); int (*query_port)(struct ib_device *device, Index: iw_cm.h =================================================================== --- iw_cm.h (revision 0) +++ iw_cm.h (revision 0) @@ -0,0 +1,127 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#if !defined(IW_CM_H) +#define IW_CM_H + +#include + +/* iWARP connection attributes. */ +struct iw_conn_attr { + struct in_addr local_addr; + struct in_addr remote_addr; + u16 local_port; + u16 remote_port; +}; + +/* This is provided in the event generated when + * a remote peer accepts our connect request + */ +struct iw_conn_results { + int errno; + struct iw_conn_attr conn_attr; + u8 *private_data; + int private_data_len; +}; + +/* This is provided in the event generated by a remote + * connect request to a listening endpoint + */ +struct iw_conn_request { + int cr_id; + struct iw_conn_attr conn_attr; + u8 *private_data; + int private_data_len; +}; + +/* Connection events. */ +enum iw_cm_event_type { + IW_EVENT_ACTIVE_CONNECT_RESULTS, + IW_EVENT_CONNECT_REQUEST +}; + +struct iw_cm_event { + struct iw_device *device; + union { + struct iw_conn_results active_results; + struct iw_conn_request conn_request; + } element; + enum iw_cm_event_type event; +}; + +/* Listening endpoint. */ +struct iw_listen_ep_attr { + void (*event_handler)(struct iw_cm_event *, void *); + void *listen_context; + struct in_addr addr; + u16 port; + int backlog; +}; + +struct iw_listen_ep { + struct iw_device *device; + void (*event_handler)(struct iw_cm_event *, void *); + void *listen_context; + struct in_addr addr; + u16 port; + int backlog; +}; + +struct iw_cm { + + int (*connect_qp)(struct ib_qp *qp, + struct iw_conn_attr* attr, + void (*event_handler)(struct iw_cm_event*, void*), + void* context, + u8 *pdata, + int pdata_len + ); + + int (*accept_cr)(int cr_id, + struct ib_qp *qp, + u8 *pdata, + int pdata_len); + + int (*reject_cr)(int cr_id, + struct ib_qp *qp, + u8 *pdata, + int pdata_len); + + int (*query_cr)(int cr_id, + struct iw_conn_request* req); + + struct iw_listen_ep * (*create_listen_ep)(struct iw_listen_ep_attr *); + + int (*destroy_listen_ep)(struct iw_listen_ep *ep); + +}; + +#endif /* IW_CM_H */ From sean.hefty at intel.com Fri Aug 26 12:01:09 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 26 Aug 2005 12:01:09 -0700 Subject: [openib-general] RE: [PATCH][iWARP] IW CM Verbs In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3BAF@mail2.ammasso.com> Message-ID: >Please comment, and if it looks good, I'll commit this to the >iWARP branch tonight. Looks fine. See one minor comment below. >+/* This is provided in the event generated by a remote >+ * connect request to a listening endpoint >+ */ >+struct iw_conn_request { >+ int cr_id; >+ struct iw_conn_attr conn_attr; >+ u8 *private_data; >+ int private_data_len; >+}; Should cr_id be an int or something more along the lines of struct iw_cm_id *? - Sean From vkrishnamurthy at xsigo.com Fri Aug 26 12:02:59 2005 From: vkrishnamurthy at xsigo.com (Viswanath Krishnamurthy) Date: Fri, 26 Aug 2005 12:02:59 -0700 Subject: [openib-general] kernel oops In-Reply-To: References: Message-ID: <430F6763.2010601@xsigo.com> Still see the issue 1. I rebooted both the machines, started opensm, after LID assignment killed opensm. Next started the ucmpost client/server, killing it panics the system -Viswa Unable to handle kernel NULL pointer dereference at virtual address 00000068 printing eip: c02f2635 *pde = 3661e001 Oops: 0000 [#1] SMP Modules linked in: nfsd exportfs lockd autofs4 sunrpc uhci_hcd ehci_hcd hw_random e1000 ext3 jbd sd_mod CPU: 0 EIP: 0060:[] Not tainted VLI EFLAGS: 00010086 (2.6.12.5) EIP is at _spin_lock_irqsave+0xa/0x51 eax: 00000064 ebx: 00000286 ecx: f689be6c edx: c036cbcc esi: 00000064 edi: 00000064 ebp: 00000000 esp: f689be00 ds: 007b es: 007b ss: 0068 Process lt-ucmpost (pid: 3993, threadinfo=f689a000 task=f6ef9540) Stack: 00000000 c013e3f0 00000000 c036cbcc c0267667 00000000 000000d0 f689beac f66a9e80 c027393f c0350d00 00000000 f689be6c 0c300000 00000064 f689beac f66a9e80 c027955f 00000000 0c300000 00000064 000000d0 c0279022 f66a9e80 Call Trace: [] __alloc_pages+0x166/0x3b6 [] ib_get_client_data+0x14/0x54 [] ib_sa_path_rec_get+0x1b/0x13e [] resolve_path+0x8c/0x15b [] path_req_complete+0x0/0xf7 [] rtnetlink_dump_all+0x0/0x9e [] rtnetlink_done+0x0/0x3 [] ib_at_paths_by_route+0xc4/0xd9 [] same_path_req+0x0/0x95 Sean Hefty wrote: >>I downloaded the latest openib gen2 stack and ran into kernel panic when >>I run the cmpost/ucmpost example. I modified the program to continously >>send and receive data in an infinite loop and killed the application >>with ctrl-c. >>The kernel panics pretty consistently. >> >>I am currently running 2.6.12 version of the kernel . Log attached. I >>will try >>upgrading to newer kernel and see if I can reproduce it. >> >> > >I have gotten something similar to this in my own testing, but haven't had the >time to track it down. It seems to be related to how the IB AT code interacts >with the SM, and if the SM has been restarted. Can you try resetting the SM >node, then rebooting your other systems? > >- Sean > > > From sean.hefty at intel.com Fri Aug 26 12:09:38 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 26 Aug 2005 12:09:38 -0700 Subject: [openib-general] kernel oops In-Reply-To: <430F6763.2010601@xsigo.com> Message-ID: >1. I rebooted both the machines, started opensm, after LID assignment >killed opensm. >Next started the ucmpost client/server, killing it panics the system This definitely shouldn't crash the systems, so there's a bug that needs to be fixed. But the tests will not work unless an SM is running somewhere on the fabric. - Sean From caitlinb at broadcom.com Fri Aug 26 12:24:15 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 26 Aug 2005 12:24:15 -0700 Subject: [openib-general] Re: [PATCH][iWARP] Another CM Verbs Approach Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F52A@NT-SJCA-0751.brcm.ad.broadcom.com> I believe it would be safe to state that all hardware has one or more methods for transitioning an established TCP connection to RDMA mode. What will vary is what restrictions there are, and how easily that can be expanded. When the RNIC also provides TCP offload services the current firmware may only support transition from offloaded connections. However, accepting an upload of the TCP state from the driver is probably within the capability of most hardware. For one thing it is needed for failover and is quite useful for testing. So right now it may be common for existing firmware/drivers to only know how to extract the TCP state from a partner stack. There is no reason why the Linux stack could not be a partner stack, but kind of by definition no device vendor can just decide to make the linux stack a partner. The linux stack has some say in the matter. Nobody is going to eagerly write firmware that relies on extracting state information based on unpublished data structures that are subject to change without notice. Inter-module dependencies need to be planned, not just assumed by add-on software. > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier > Sent: Friday, August 26, 2005 11:03 AM > To: Tom Tucker > Cc: openib-general at openib.org > Subject: [openib-general] Re: [PATCH][iWARP] Another CM Verbs Approach > > Roland> I'd like to understand that last bit above about devices > Roland> that establish connections internally. Does that mean > Roland> that there are RNICs that cannot support the iWARP verbs' > Roland> notion of migrating a connection? > > Tom> Yes, that's what it means and the reason is that Linux does > Tom> not (yet) provide a means to do so. > > Now I'm confused. Is the problem that the Linux network > stack does not have an interface for transferring TCP state, > or is it that some RNIC hardware does not support taking > control of an existing TCP connection? The first problem we > can fix, but the second one we can't. > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From guyg at voltaire.com Fri Aug 26 12:28:26 2005 From: guyg at voltaire.com (Guy German) Date: Fri, 26 Aug 2005 22:28:26 +0300 Subject: [openib-general] RDMA connection and address translation API Message-ID: > What do you think about this flow ? > 1. resolve device and port from ip address - synchronous operation > (like at.c resolve_ip) > 2. rdma_create_qp (device+port) - modifies qp to init with > default pkey index > 3. ib_post_recvs(...); > 4. cma_connect - asynchronous at, modify qp with correct > pkey index, cm_connect Caitlin wrote: >At least with iWARP a QP is not bound to a specific port, or even >to an IP Address. It is only bound to the RDMA Device (RNIC) and >Protection Domain. The same QP can be re-used for a new connection >with a new IP address. Indeed, that is exactly what would happen >with application-layer controlled failover (such as iSER). In ib, in order to post receive the QP need to be in init. In order to modify qp to init, you need port and pkey_index. If iWARP can post receive without it, the iwarp implementation of "rdma_create_qp" can ignore the port attribute. The other option, that was suggested to solve the sync problem (need of post receive before connect) is to retrieve the path synchronically, which will require an unnecessary upcall handling for iwarp consumers. Guy From nacc at us.ibm.com Fri Aug 26 11:54:27 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Fri, 26 Aug 2005 11:54:27 -0700 Subject: [openib-general] [PATCH][iWARP] IW CM Verbs In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3BAF@mail2.ammasso.com> References: <8E9D028761D8264D910612167E8457E8FA3BAF@mail2.ammasso.com> Message-ID: <20050826185427.GC4225@us.ibm.com> On 26.08.2005 [14:44:25 -0400], Tom Tucker wrote: > > This is a cleaned up version of the iWARP CM verbs based > on feedback from Roland and Sean. I've added a new file to > contain the iw specific types and changed the iw specific > type prefix to iw_. > > Please comment, and if it looks good, I'll commit this to the > iWARP branch tonight. Just a small nit about your mailer perhaps :) > int (*query_device)(struct ib_device > *device, Seems like lines are getting wrapped? Any chance to get around this? Makes it a bit hard to read the patches... Thanks, Nish From sean.hefty at intel.com Fri Aug 26 12:41:45 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 26 Aug 2005 12:41:45 -0700 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: Message-ID: >What do you think about this flow ? >1. resolve device and port from ip address - synchronous operation > (like at.c resolve_ip) >2. rdma_create_qp (device+port) - modifies qp to init with default pkey index >3. ib_post_recvs(...); >4. cma_connect - asynchronous at, modify qp with correct pkey index, cm_connect It looks like this would work. If a client wanted to create multiple connections to the same remote service (for example, to separate control and data), then it seems more efficient to move the asynchronous at outside of the connect call. - Sean From guyg at voltaire.com Fri Aug 26 13:20:48 2005 From: guyg at voltaire.com (Guy German) Date: Fri, 26 Aug 2005 23:20:48 +0300 Subject: [openib-general] RDMA connection and address translation API Message-ID: >What do you think about this flow ? >1. resolve device and port from ip address - synchronous operation > (like at.c resolve_ip) >2. rdma_create_qp (device+port) - modifies qp to init with default pkey index >3. ib_post_recvs(...); >4. cma_connect - asynchronous at, modify qp with correct pkey index, cm_connect >>It looks like this would work. If a client wanted to create multiple >>connections to the same remote service (for example, to separate control and >>data), then it seems more efficient to move the asynchronous at outside of the >>connect call. >>- Sean Thats a good point. What I had in mind was mainly simplicity for the consumer - save him dealing with another upcall. Maybe caching in at module would make things better, but I agree that for multiple connections to the same remote service, the asynchronous at aproach, seems more appropriate. So ... Does everyone else thinks that we should change the API of a cm abstraction to asynchronous at before connection ? (This should concern mostly the iWAPR guys - Caitlin,Tom etc..) Thanks, Guy From jlentini at netapp.com Fri Aug 26 13:42:44 2005 From: jlentini at netapp.com (James Lentini) Date: Fri, 26 Aug 2005 16:42:44 -0400 (EDT) Subject: [openib-general] [PATCH] IBAT resolve_ats_route Message-ID: Hi Hal, I was reading through the IBAT sources when I noticed that in resolve_ats_route() you set req->pend.sa_query to null on line 1127 and then check to see if it is null a few lines later. I don't think you need to do that. Signed-off-by: James Lentini Index: at.c =================================================================== --- at.c (revision 3204) +++ at.c (working copy) @@ -1134,10 +1134,6 @@ static int resolve_ats_route(struct rout } if (!rec) { /* new query */ - if (req->pend.sa_query) { - DEBUG("req %p (%s) already pending", req, netdev->name); - return -1; - } build_ats_req(&sa_rec, NULL, req->src.pkey, req->dst_ip); rec = &sa_rec; } From caitlinb at broadcom.com Fri Aug 26 15:31:11 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 26 Aug 2005 15:31:11 -0700 Subject: [openib-general] RDMA connection and address translation API Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F52C@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: Guy German [mailto:guyg at voltaire.com] > Sent: Friday, August 26, 2005 12:28 PM > To: Caitlin Bestler; Sean Hefty; James Lentini > Cc: openib-general at openib.org > Subject: RE: [openib-general] RDMA connection and address > translation API > > > What do you think about this flow ? > > 1. resolve device and port from ip address - synchronous operation > > (like at.c resolve_ip) > > 2. rdma_create_qp (device+port) - modifies qp to init with default > > pkey index 3. ib_post_recvs(...); 4. cma_connect - > asynchronous at, > > modify qp with correct pkey index, cm_connect > > Caitlin wrote: > >At least with iWARP a QP is not bound to a specific port, or > even to an > >IP Address. It is only bound to the RDMA Device (RNIC) and > Protection > >Domain. The same QP can be re-used for a new connection with > a new IP > >address. Indeed, that is exactly what would happen with > >application-layer controlled failover (such as iSER). > > In ib, in order to post receive the QP need to be in init. > In order to modify qp to init, you need port and pkey_index. > If iWARP can post receive without it, the iwarp > implementation of "rdma_create_qp" can ignore the port attribute. > The closest equivalent of a pkey_index would be the VLAN ID, which is at L2 and totally transparent to an iWARP QP. You can definitely post receive buffers before knowing anything about the TCP connection (or SCTP association/stream) that will provide the LLP service. > The other option, that was suggested to solve the sync > problem (need of post receive before connect) is to retrieve > the path synchronically, which will require an unnecessary > upcall handling for iwarp consumers. > The generic requirement is that the QP passed to the connect method is ready to be moved to a connected state as soon as the connection establishment exchanges have finished. If I follow what you are proposing, you are trying to find a way to do this for IB automatically as a by-product of determining what device to use. I don't see any problem with this, as long as the "port" being returned from the first call is defined in such a way that it can have a void value when the transport does not need this refinement. Avoiding transport-dependent steps is good for encouraging development of RDMA-aware applications. From iod00d at hp.com Fri Aug 26 16:03:21 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 26 Aug 2005 16:03:21 -0700 Subject: [openib-general] RE: [PATCH][iWARP] IW CM Verbs In-Reply-To: References: <8E9D028761D8264D910612167E8457E8FA3BAF@mail2.ammasso.com> Message-ID: <20050826230321.GB15209@esmail.cup.hp.com> On Fri, Aug 26, 2005 at 12:01:09PM -0700, Sean Hefty wrote: > >Please comment, and if it looks good, I'll commit this to the > >iWARP branch tonight. > > Looks fine. See one minor comment below. > > > >+/* This is provided in the event generated by a remote > >+ * connect request to a listening endpoint > >+ */ > >+struct iw_conn_request { > >+ int cr_id; > >+ struct iw_conn_attr conn_attr; > >+ u8 *private_data; > >+ int private_data_len; > >+}; > > Should cr_id be an int or something more along the lines of struct iw_cm_id *? "unsigned int" if possible please. thanks, grant From eitan at mellanox.co.il Sat Aug 27 07:59:47 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 27 Aug 2005 17:59:47 +0300 Subject: [openib-general] Re: RMPP Message Format Errors (Short Term Plan) In-Reply-To: <1125074400.4530.187.camel@hal.voltaire.com> References: <1125074400.4530.187.camel@hal.voltaire.com> Message-ID: <43107FE3.6020800@mellanox.co.il> Hi Hal, Thanks for taking care of the RMPP issues. Once you think both sender and receiver side issues are resolved please let us know so I can re-run the test with the IB Analyzer. Eitan Hal Rosenstock wrote: > Hi, > > I will finish with RMPP and then embark on the 1.8.0 merge. I hope and > expect to start the latter early next week. > From eitan at mellanox.co.il Sat Aug 27 08:14:49 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 27 Aug 2005 18:14:49 +0300 Subject: [openib-general] Re: RMPP Message Format Errors In-Reply-To: References: Message-ID: <43108369.1050201@mellanox.co.il> Sean Hefty wrote: > > I believe that the 220 byte payload length is for all RMPP MADs. Only the > common and RMPP header lengths are ignored. Yes. > > > >>Doesn't it need to account for a "partial" rather than full last segment >>transferred data in the first segment length ? Yes I think it needs to use the partial length. > > > What I couldn't easily tell from the spec is whether a partial last segment is > included in the initial payload length or not. I read it as: "PayloadLength > counts all the bytes in the TransferredData field of the DATA packet format." > In my interpretation, partial data is indicated by the PayloadLength field in > the last segment only. It's quite possible that my interpretation is incorrect, > in which case the calculation in the RMPP code is off. I agree the text might be missing an example or two for clarification. Anyway, we probably can use the IB Analyzer as the ultimate interpretation test. Note that there are IB implementations that uses the first segment payload length as the source of packet length and count on it to represent the correct DATA length. We can take your interpretation to discussion in the IBTA MGTWG for further discussion. Is the effort for fixing it big? Thanks Eitan From hozer at hozed.org Sat Aug 27 21:20:00 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Sat, 27 Aug 2005 23:20:00 -0500 Subject: [openib-general] mpi drop in openib tree In-Reply-To: <200508251650.j7PGo4w8029415@xi.cse.ohio-state.edu> References: <527jec48wg.fsf@cisco.com> <200508251650.j7PGo4w8029415@xi.cse.ohio-state.edu> Message-ID: <20050828042000.GU1685@kalmia.hozed.org> On Thu, Aug 25, 2005 at 12:50:04PM -0400, Dhabaleswar Panda wrote: > Hi Roland, > > > As for whether MPI should be in the OpenIB subversion tree or not, my > > personal opinion is that having MPI there is only appropriate if the > > svn tree is being used as the primary development source tree. I > > don't think it's appropriate to use the OpenIB svn server as a release > > distribution mechanism. > > Yes, we plan to use the svn tree as the primary development source > tree for mvapich-gen2 not as a release distribution mechanism. > > In fact, during the last two days we have received a couple of > patches, bug fixes, and suggestions and have checked-in these patches, > fixes, and enhancements to the SVN tree with credits to the people who > supplied us the patches. Thanks to those who sent us the patches!! > > We encourage people in the OpenIB community to test the latest version > of mvapich-gen2 from the svn tree and provide us feedbacks and > comments so that we can keep on enhancing it. What is the plan for merging with future MPICH updates? I've dealt with keeping full copies of the linux kernel in various version control systems, and merging with upstream always turns into a huge headache. Of course, if there aren't going to be any more MPICH-1 updates, then this is really no big deal. My first suggestion is to remove the java .jar and configure files. There is also probably a lot of other stuff that really doesn't need to be there as well. Do we really need to keep windows specific code in the (currently linux-only) OpenIB project svn tree? From mst at mellanox.co.il Sat Aug 27 23:34:39 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 28 Aug 2005 09:34:39 +0300 Subject: [openib-general] Re: Question on the best approach to debug aninfiniband connectio n problem In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5175C27@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175C27@taurus.voltaire.com> Message-ID: <20050828063439.GT22342@mellanox.co.il> Quoting r. Hal Rosenstock : > I think you are referring to a Set of PortInfo which causes an event to > IPoIB. Yes. > > There seems to be some bug related to local MAD handling: > > > if opensm is running on node A, and opensm is restarted, > > all nodes will re-register in the multicast group with opensm, > > except for the node A itself which has to be downed and upped > manually. > > I will look into this but have a few things ahead of this right now. -- MST From bchang at atipa.com Sun Aug 28 00:30:29 2005 From: bchang at atipa.com (Brady Chang) Date: Sun, 28 Aug 2005 02:30:29 -0500 Subject: [openib-general] out of MTT entries Message-ID: <0D6FBA307D01EA42BAC8715725643AA0157A9A@EXCHG2003.microtech-ks.com> Hello, I was wondering if anybody can advice on possible root cause for the MTT erros I see in dmesg thanks for your help in advance. THH(1): XHH_mrwm_register_mr: rc=HH_EAGAIN VIPKL(1): [MM_create_mr]:MM_mr_get_keys failed THH(4): borg/releng/linux.310/third_party/thca4_linux/kernel/mlxhh/thh/obj_host_amd64_smp_2_6_9_11_EL_rhel/mod_thh/tptm.c[263]: reserve_mtt_segs: Cannot reserve 255 MTT segments (16 dynamic MTT segments left) THH(4): borg/releng/linux.310/third_party/thca4_linux/kernel/mlxhh/thh/obj_host_amd64_smp_2_6_9_11_EL_rhel/mod_thh/tptm.c[1093]: alloc_reg_pages: Out of MTT entries THH(1): XHH_mrwm_register_mr: rc=HH_EAGAIN VIPKL(1): [MM_create_mr]:MM_mr_get_keys failed THH(4): borg/releng/linux.310/third_party/thca4_linux/kernel/mlxhh/thh/obj_host_amd64_smp_2_6_9_11_EL_rhel/mod_thh/tptm.c[263]: reserve_mtt_segs: Cannot reserve 255 MTT segments (16 dynamic MTT segments left) THH(4): borg/releng/linux.310/third_party/thca4_linux/kernel/mlxhh/thh/obj_host_amd64_smp_2_6_9_11_EL_rhel/mod_thh/tptm.c[1093]: alloc_reg_pages: Out of MTT entries THH(1): XHH_mrwm_register_mr: rc=HH_EAGAIN VIPKL(1): [MM_create_mr]:MM_mr_get_keys failed THH(4): borg/releng/linux.310/third_party/thca4_linux/kernel/mlxhh/thh/obj_host_amd64_smp_2_6_9_11_EL_rhel/mod_thh/tptm.c[263]: reserve_mtt_segs: Cannot reserve 255 MTT segments (16 dynamic MTT segments left) THH(4): borg/releng/linux.310/third_party/thca4_linux/kernel/mlxhh/thh/obj_host_amd64_smp_2_6_9_11_EL_rhel/mod_thh/tptm.c[1093]: alloc_reg_pages: Out of MTT entries THH(1): XHH_mrwm_register_mr: rc=HH_EAGAIN VIPKL(1): [MM_create_mr]:MM_mr_get_keys failed THH(4): borg/releng/linux.310/third_party/thca4_linux/kernel/mlxhh/thh/obj_host_amd64_smp_2_6_9_11_EL_rhel/mod_thh/tptm.c[263]: reserve_mtt_segs: Cannot reserve 255 MTT segments (16 dynamic MTT segments left) THH(4): borg/releng/linux.310/third_party/thca4_linux/kernel/mlxhh/thh/obj_host_amd64_smp_2_6_9_11_EL_rhel/mod_thh/tptm.c[1093]: alloc_reg_pages: Out of MTT entries THH(1): XHH_mrwm_register_mr: rc=HH_EAGAIN VIPKL(1): [MM_create_mr]:MM_mr_get_keys failed THH(4): borg/releng/linux.310/third_party/thca4_linux/kernel/mlxhh/thh/obj_host_amd64_smp_2_6_9_11_EL_rhel/mod_thh/tptm.c[263]: reserve_mtt_segs: Cannot reserve 255 MTT segments (16 dynamic MTT segments left) THH(4): borg/releng/linux.310/third_party/thca4_linux/kernel/mlxhh/thh/obj_host_amd64_smp_2_6_9_11_EL_rhel/mod_thh/tptm.c[1093]: alloc_reg_pages: Out of MTT entries THH(1): XHH_mrwm_register_mr: rc=HH_EAGAIN VIPKL(1): [MM_create_mr]:MM_mr_get_keys failed THH(4): borg/releng/linux.310/third_party/thca4_linux/kernel/mlxhh/thh/obj_host_amd64_smp_2_6_9_11_EL_rhel/mod_thh/tptm.c[263]: reserve_mtt_segs: Cannot reserve 255 MTT segments (16 dynamic MTT segments left) THH(4): borg/releng/linux.310/third_party/thca4_linux/kernel/mlxhh/thh/obj_host_amd64_smp_2_6_9_11_EL_rhel/mod_thh/tptm.c[1093]: alloc_reg_pages: Out of MTT entries THH(1): XHH_mrwm_register_mr: rc=HH_EAGAIN VIPKL(1): [MM_create_mr]:MM_mr_get_keys failed THH(4): borg/releng/linux.310/third_party/thca4_linux/kernel/mlxhh/thh/obj_host_amd64_smp_2_6_9_11_EL_rhel/mod_thh/tptm.c[263]: reserve_mtt_segs: Cannot reserve 1023 MTT segments (274 dynamic MTT segments left) THH(4): borg/releng/linux.310/third_party/thca4_linux/kernel/mlxhh/thh/obj_host_amd64_smp_2_6_9_11_EL_rhel/mod_thh/tptm.c[1093]: alloc_reg_pages: Out of MTT entries THH(1): XHH_mrwm_register_mr: rc=HH_EAGAIN VIPKL(1): [MM_create_mr]:MM_mr_get_keys failed -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Sun Aug 28 00:58:52 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 28 Aug 2005 10:58:52 +0300 Subject: [openib-general] Re: RDMA connection and address translation API In-Reply-To: <52y86qt2pm.fsf@cisco.com> References: <35EA21F54A45CB47B879F21A91F4862F714202@taurus.voltaire.com> <52ll2qvrr1.fsf@cisco.com> <20050825084809.GD22342@mellanox.co.il> <52y86qt2pm.fsf@cisco.com> Message-ID: <20050828075851.GV22342@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: RDMA connection and address translation API > > Michael> Wouldnt it be better to use some bits in the service ID > Michael> field for this? > > This would also be OK. But Annex 3 of the IBA spec has already > defined the service ID field without any reserved bits we can use. What about using an Externally Administrated Service ID? Openib gets Service ID = 0x1H00 1405 XXXX XXXX where H is any digit. > For example, if the first byte is 0x01, then the IETF is allowed to > use any value they want for the rest of the service ID. Would not an IETF specified service already use some specific format for CM private field? > So if we want to keep backwards compatibility with the spec, this > approach might be difficult. I'm not sure I understand. Do envision using the new CM REQ message to connect to an existing application? > Anyway, what's the disadvantage of using a reserved bit or two from > the CM REQ? I agree that using reserved bits in the CM REQ will work. I simply noted that Service ID is de facto used to specify the private data format. For example if you put anything other that the SDP HH message in the private data, and attempt to connect to Service ID = 0x0000 0000 0001 XXXX then an IB spec 1.2 compatible SDP implementation will respond and will interpret the private data wrongly. -- MST From mst at mellanox.co.il Sun Aug 28 03:06:50 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 28 Aug 2005 13:06:50 +0300 Subject: [openib-general] Re: [PATCH] sdp: split sdp_inet_send to subroutines In-Reply-To: <20050822141701.GZ1856@mellanox.co.il> References: <20050822141701.GZ1856@mellanox.co.il> Message-ID: <20050828100650.GF22342@mellanox.co.il> Applied. Committed revision 3211. > Quoting Michael S. Tsirkin : > Subject: [PATCH] sdp: split sdp_inet_send to subroutines > > The following is not yet applied. Opinions on othis? > > --- > > Split the sdp_inet_send to smaller subroutines. > > Signed-off-by: Michael S. Tsirkin Index: linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_send.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/ulp/sdp/sdp_send.c +++ linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_send.c @@ -1907,6 +1907,136 @@ done: return result; } +static inline int sdp_send_while_space(struct sock *sk, struct sdp_sock *conn, + struct msghdr *msg, int oob, + size_t size, size_t *copied) +{ + struct sdpc_buff *buff; + int result = 0; + int copy; + /* + * send while there is room... (thresholds should be + * observed...) use a different threshold for urgent + * data to allow some space for sending. + */ + while (sdp_inet_write_space(conn, oob) > 0) { + buff = sdp_send_data_buff_get(conn); + if (!buff) { + result = -ENOMEM; + goto done; + } + + copy = min((size_t)(buff->end - buff->tail), size - *copied); + copy = min(copy, sdp_inet_write_space(conn, oob)); + +#ifndef _SDP_DATA_PATH_NULL + result = memcpy_fromiovec(buff->tail, msg->msg_iov, copy); + if (result < 0) { + sdp_buff_pool_put(buff); + goto done; + } +#endif + buff->tail += copy; + *copied += copy; + + SDP_CONN_STAT_SEND_INC(conn, copy); + + result = sdp_send_data_buff_put(conn, buff, copy, + (*copied == size ? oob : 0)); + if (result < 0) + goto done; + + if (*copied == size) + goto done; + } + /* + * set no space bits since this code path is taken + * when there is no write space. + */ + set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); + set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); + +done: + return result; +} + +/* Returns new timeout value */ +static inline long sdp_wait_till_space(struct sock *sk, struct sdp_sock *conn, + int oob, long timeout) +{ + DECLARE_WAITQUEUE(wait, current); + + add_wait_queue(sk->sk_sleep, &wait); + set_current_state(TASK_INTERRUPTIBLE); + /* + * ASYNC_NOSPACE is only set if we're not sleeping, + * while NOSPACE is set whenever there is no space, + * and is only cleared once space opens up, in + * DevConnAck() + */ + clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); + + sdp_conn_unlock(conn); + if (sdp_inet_write_space(conn, oob) <= 0) + timeout = schedule_timeout(timeout); + sdp_conn_lock(conn); + + remove_wait_queue(sk->sk_sleep, &wait); + set_current_state(TASK_RUNNING); + return timeout; +} + +static inline int sdp_queue_iocb(struct kiocb *req, struct sdp_sock *conn, + struct msghdr *msg, size_t size, + size_t *copied) +{ + struct sdpc_iocb *iocb; + int result; + /* + * create IOCB with remaining space + */ + iocb = sdp_iocb_create(); + if (!iocb) { + sdp_dbg_warn(conn, "Failed to allocate IOCB <%Zu:%ld>", + size, (long)*copied); + return -ENOMEM; + } + + iocb->len = size - *copied; + iocb->post = *copied; + iocb->size = size; + iocb->req = req; + iocb->key = req->ki_key; + iocb->addr = (unsigned long)msg->msg_iov->iov_base - *copied; + + req->ki_cancel = sdp_inet_write_cancel; + + result = sdp_iocb_lock(iocb); + if (result < 0) { + sdp_dbg_warn(conn, "Error <%d> locking IOCB <%Zu:%ld>", + result, size, (long)copied); + + sdp_iocb_destroy(iocb); + return result; + } + + SDP_CONN_STAT_WQ_INC(conn, iocb->size); + + conn->send_pipe += iocb->len; + + result = sdp_send_data_queue(conn, (struct sdpc_desc *)iocb); + if (result < 0) { + sdp_dbg_warn(conn, "Error <%d> queueing write IOCB", + result); + + sdp_iocb_destroy(iocb); + return result; + } + + *copied = 0; /* copied amount was saved in IOCB. */ + return -EIOCBQUEUED; +} + /* * sdp_inet_send - send data from user space to the network */ @@ -1915,12 +2045,9 @@ int sdp_inet_send(struct kiocb *req, str { struct sock *sk; struct sdp_sock *conn; - struct sdpc_buff *buff; - struct sdpc_iocb *iocb; int result = 0; - int copied = 0; - int copy; - int oob; + size_t copied = 0; + int oob, zcopy; long timeout = -1; /* @@ -1954,75 +2081,35 @@ int sdp_inet_send(struct kiocb *req, str * they are smaller then the zopy threshold, but only if there is * no buffer write space. */ - if (!(conn->src_zthresh > size) && !is_sync_kiocb(req)) - goto skip; + zcopy = (size >= conn->src_zthresh && !is_sync_kiocb(req)); + /* * clear ASYN space bit, it'll be reset if there is no space. */ - clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); + if (!zcopy) + clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); /* * process data first if window is open, next check conditions, then * wait if there is more work to be done. The absolute window size is * used to 'block' the caller if the connection is still connecting. */ while (!result && copied < size) { - /* - * send while there is room... (thresholds should be - * observed...) use a different threshold for urgent - * data to allow some space for sending. - */ - while (sdp_inet_write_space(conn, oob) > 0) { - buff = sdp_send_data_buff_get(conn); - if (!buff) { - result = -ENOMEM; - goto done; - } - - copy = min((size_t)(buff->end - buff->tail), - (size_t)(size - copied)); - copy = min(copy, sdp_inet_write_space(conn, oob)); - -#ifndef _SDP_DATA_PATH_NULL - result = memcpy_fromiovec(buff->tail, - msg->msg_iov, - copy); - if (result < 0) { - sdp_buff_pool_put(buff); - goto done; - } -#endif - buff->tail += copy; - copied += copy; - - SDP_CONN_STAT_SEND_INC(conn, copy); - - result = sdp_send_data_buff_put(conn, buff, copy, - ((copied == - size) ? oob : 0)); - if (result < 0) - goto done; - - if (copied == size) - goto done; + if (!zcopy) { + result = sdp_send_while_space(sk, conn, msg, oob, size, + &copied); + if (result < 0 || copied == size) + break; } - /* - * set no space bits since this code path is taken - * when there is no write space. - */ - set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); - set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); - /* - * check status. - */ -skip: /* entry point for IOCB based transfers. Before processing IOCB, - check that the connection is OK, otherwise return error - synchronously. */ + + /* entry point for IOCB based transfers. Before processing IOCB, + check that the connection is OK, otherwise return error + synchronously. */ /* * onetime setup of timeout, but only if it's needed. */ if (timeout < 0) timeout = sock_sndtimeo(sk, - MSG_DONTWAIT & msg->msg_flags); + msg->msg_flags & MSG_DONTWAIT); if (sk->sk_err) { result = (copied > 0) ? 0 : sock_error(sk); @@ -2051,77 +2138,14 @@ skip: /* entry point for IOCB based tran break; } /* - * Either wait or create and queue an IOCB for defered + * Either wait or create and queue an IOCB for deferred * completion. Wait on sync IO call create IOCB for async * call. */ - if (is_sync_kiocb(req)) { - DECLARE_WAITQUEUE(wait, current); - - add_wait_queue(sk->sk_sleep, &wait); - set_current_state(TASK_INTERRUPTIBLE); - /* - * ASYNC_NOSPACE is only set if we're not sleeping, - * while NOSPACE is set whenever there is no space, - * and is only cleared once space opens up, in - * DevConnAck() - */ - clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); - - sdp_conn_unlock(conn); - if (sdp_inet_write_space(conn, oob) <= 0) - timeout = schedule_timeout(timeout); - sdp_conn_lock(conn); - - remove_wait_queue(sk->sk_sleep, &wait); - set_current_state(TASK_RUNNING); - - continue; - } - /* - * create IOCB with remaining space - */ - iocb = sdp_iocb_create(); - if (!iocb) { - sdp_dbg_warn(conn, "Failed to allocate IOCB <%Zu:%d>", - size, copied); - result = -ENOMEM; - break; - } - - iocb->len = size - copied; - iocb->post = copied; - iocb->size = size; - iocb->req = req; - iocb->key = req->ki_key; - iocb->addr = (unsigned long)msg->msg_iov->iov_base - copied; - - req->ki_cancel = sdp_inet_write_cancel; - - result = sdp_iocb_lock(iocb); - if (result < 0) { - sdp_dbg_warn(conn, "Error <%d> locking IOCB <%Zu:%d>", - result, size, copied); - - sdp_iocb_destroy(iocb); - break; - } - - SDP_CONN_STAT_WQ_INC(conn, iocb->size); - - conn->send_pipe += iocb->len; - - result = sdp_send_data_queue(conn, (struct sdpc_desc *)iocb); - if (result < 0) { - sdp_dbg_warn(conn, "Error <%d> queueing write IOCB", - result); - - sdp_iocb_destroy(iocb); - break; - } - - copied = 0; /* copied amount was saved in IOCB. */ - result = -EIOCBQUEUED; + if (is_sync_kiocb(req)) + timeout = sdp_wait_till_space(sk, conn, oob, timeout); + else + result = sdp_queue_iocb(req, conn, msg, size, &copied); } done: -- MST _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST From eitan at mellanox.co.il Sun Aug 28 07:43:00 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 28 Aug 2005 17:43:00 +0300 Subject: [openib-general] Re: [PATCH] osm: osm_vendor_umad osm_vendor_get_all_port_attr bug In-Reply-To: <1124995971.4421.852.camel@hal.voltaire.com> References: <86fyt3219m.fsf@mtl066.yok.mtl.com> <1124995971.4421.852.camel@hal.voltaire.com> Message-ID: <4311CD74.8060006@mellanox.co.il> Hi Hal, I agree that the index 0 of the guid,lids and the new linkstates arrays should be reserved for the default port. In the loop the index j is used to loop over all ports 0 .. N of the HCA's. It is clear that for HCA's port 0 will be skipped. However, since the current code does not advance the lid and linkstate accordingly the place for the port 0 will not be kept empty for the port 0. Current code: for (j = 0; j <= ca.numports; j++) { if (ca.ports[j]) { *p_lid = ca.ports[j]->base_lid; *p_linkstates = ca.ports[j]->state; p_lid++; p_linkstates++; } } Should be: for (j = 0; j <= ca.numports; j++) { if (ca.ports[j]) { *p_lid = ca.ports[j]->base_lid; *p_linkstates = ca.ports[j]->state; } /* as j advance even if the port is not valid, so should the lid and state pointer */ p_lid++; p_linkstates++; } As I could not convince you with the above explanations in my previous mail I have written the following simple program to test the pre-and post patch effect: /* test program for dumping osm_vendor_get_all_port_attr results */ #include "stdio.h" #include #include #include #include #include #define GUID_ARRAY_SIZE 64 int main() { osm_vendor_t vendor; osm_log_t osm_log; ib_api_status_t status; uint32_t num_ports = GUID_ARRAY_SIZE; ib_port_attr_t attr_array[GUID_ARRAY_SIZE]; int i; osm_log_construct(&osm_log); osm_log_init(&osm_log, TRUE, 0xff, "/tmp/test_vendor.log"); osm_vendor_init(&vendor, &osm_log, 1000); status = osm_vendor_get_all_port_attr(&vendor, attr_array, &num_ports ); if ( status != IB_SUCCESS ) { printf( "\nError from osm_vendor_get_all_port_attr (%x)\n", status); return; } printf("\nListing GUIDs:\n"); for (i = 0; i < num_ports; i++) { printf("Port %i:0x%"PRIx64" lid:0x%04x state:%x\n", i, cl_hton64(attr_array[i].port_guid), cl_ntoh16(attr_array[i].lid), attr_array[i].link_state ); } exit(0); } Without the above change I get: Listing GUIDs: Port 0:0xd9dffffff3d55 lid:0x0300 state:4 Port 1:0xd9dffffff3d55 lid:0x0400 state:4 Port 2:0xd9dffffff3d56 lid:0x0000 state:0 After the simple change I get: Listing GUIDs: Port 0:0xd9dffffff3d55 lid:0x0300 state:4 Port 1:0xd9dffffff3d55 lid:0x0300 state:4 Port 2:0xd9dffffff3d56 lid:0x0400 state:4 So as you can see - without the fix the lid of port 2 is presented as the lid of port 1... I guess you use ibstatus in your mail. Well ibstatus uses its own code so it shows the correct info anyway. In my case that is: swlab223:/tmp/bld/libvendor>ibstatus Infiniband device 'mthca0' port 1 status: default gid: fe80:0000:0000:0000:000d:9dff:ffff:3d55 base lid: 0x3 sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 10 Gb/sec (4X) Infiniband device 'mthca0' port 2 status: default gid: fe80:0000:0000:0000:000d:9dff:ffff:3d56 base lid: 0x4 sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 10 Gb/sec (4X) Hal Rosenstock wrote: > Hi Eitan, > > On Sun, 2005-08-21 at 03:32, Eitan Zahavi wrote: > >>osm_vendor_get_all_port_attr returns incorrect LID and state for >>device ports. This bug was caused by the fact that if a device port >>was skipped due to that fact it does not exist (HCA port 0). The >>lid and state pointers used as indexes into their corresponding >>return value arrays were not advancing to the next port index. >> >>So the return for a single HCA was mixing LID and state for the first >>port and displayed non initialized memory for the second port. > > > The array is not filled in as you claim. Port 0 does not take a slot on > an HCA. This looks fine to me as is (I added some print statements in > that loop as follows): > > osm_vendor_get_all_port_attr: port 0 > osm_vendor_get_all_port_attr: port 1 > osm_vendor_get_all_port_attr: port 1 lid 1 state 4 > osm_vendor_get_all_port_attr: port 2 > osm_vendor_get_all_port_attr: port 2 lid 0 state 1 > > Port 0 is skipped; port 1 is LID 1 and active; port 2 is not plugged in > and is down: > > Port 1: > State: Active > Physical state: LinkUp > Rate: 2 > Base lid: 1 > LMC: 0 > SM lid: 1 > Capability mask: 0x00500a68 > Port GUID: 0x0008f10403960559 > Port 2: > State: Down > Physical state: Polling > Rate: 2 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x00500a68 > Port GUID: 0x0008f1040396055a > > -- Hal From panda at cse.ohio-state.edu Sun Aug 28 08:15:51 2005 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Sun, 28 Aug 2005 11:15:51 -0400 (EDT) Subject: [openib-general] mpi drop in openib tree In-Reply-To: <20050828042000.GU1685@kalmia.hozed.org> from "Troy Benjegerdes" at Aug 27, 2005 11:20:00 PM Message-ID: <200508281515.j7SFFpXN024248@xi.cse.ohio-state.edu> Troy, Thanks for your comments. We are taking a look at the directories and see which files can be removed. I will also be checking on this with the Argonne folks. We should be finalzing on these issues in the next few days. Thanks, DK > What is the plan for merging with future MPICH updates? I've dealt with > keeping full copies of the linux kernel in various version control > systems, and merging with upstream always turns into a huge headache. > Of course, if there aren't going to be any more MPICH-1 updates, then > this is really no big deal. > > My first suggestion is to remove the java .jar and configure files. > There is also probably a lot of other stuff that really doesn't need to > be there as well. Do we really need to keep windows specific code in the > (currently linux-only) OpenIB project svn tree? > From halr at voltaire.com Sun Aug 28 18:02:09 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Aug 2005 21:02:09 -0400 Subject: [openib-general] Re: RMPP Message Format Errors In-Reply-To: References: Message-ID: <1125277185.4530.762.camel@hal.voltaire.com> Hi Sean, On Fri, 2005-08-26 at 13:54, Sean Hefty wrote: > >The 220 byte payload length is for SA. That's mostly right but assumes > >the last segment will be full (and accounted for by the paylen in the > >last segment). > > I believe that the 220 byte payload length is for all RMPP MADs. Yes, you're right. > Only the > common and RMPP header lengths are ignored. > > > >Doesn't it need to account for a "partial" rather than full last segment > >transferred data in the first segment length ? > > What I couldn't easily tell from the spec is whether a partial last segment is > included in the initial payload length or not. I read it as: "PayloadLength > counts all the bytes in the TransferredData field of the DATA packet format." > In my interpretation, partial data is indicated by the PayloadLength field in > the last segment only. It's quite possible that my interpretation is incorrect, > in which case the calculation in the RMPP code is off. I'm pretty sure that the intent is that the length in the first segment reflects the valid data (plus the header to be counted) so the last segment doesn't count as a full length (220) unless it is full. Patch for this shortly. -- Hal From halr at voltaire.com Sun Aug 28 18:05:24 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Aug 2005 21:05:24 -0400 Subject: [openib-general] [PATCH] RMPP: Fix length in first segment of multipacket sends Message-ID: <1125277188.4530.764.camel@hal.voltaire.com> RMPP: Fix length in first segment of multipacket sends (This is a compliance issue but does not affect at least OpenIB to OpenIB RMPP transfers). Signed-off-by: Hal Rosenstock Index: mad_rmpp.c =================================================================== --- mad_rmpp.c (revision 3197) +++ mad_rmpp.c (working copy) @@ -593,7 +593,8 @@ rmpp_mad->rmpp_hdr.paylen_newwin = cpu_to_be32(mad_send_wr->total_seg * (sizeof(struct ib_rmpp_mad) - - offsetof(struct ib_rmpp_mad, data))); + offsetof(struct ib_rmpp_mad, data)) - + mad_send_wr->pad); mad_send_wr->sg_list[0].length = sizeof(struct ib_rmpp_mad); } else { mad_send_wr->send_wr.num_sge = 2; From halr at voltaire.com Sun Aug 28 18:08:39 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Aug 2005 21:08:39 -0400 Subject: [openib-general] Re: RMPP Message Format Errors In-Reply-To: <43108369.1050201@mellanox.co.il> References: <43108369.1050201@mellanox.co.il> Message-ID: <1125277196.4530.770.camel@hal.voltaire.com> Hi Eitan, On Sat, 2005-08-27 at 11:14, Eitan Zahavi wrote: > Sean Hefty wrote: > > > > > I believe that the 220 byte payload length is for all RMPP MADs. Only the > > common and RMPP header lengths are ignored. > Yes. > > > > > > > >>Doesn't it need to account for a "partial" rather than full last segment > >>transferred data in the first segment length ? > Yes I think it needs to use the partial length. Agreed. > > What I couldn't easily tell from the spec is whether a partial last segment is > > included in the initial payload length or not. I read it as: "PayloadLength > > counts all the bytes in the TransferredData field of the DATA packet format." > > In my interpretation, partial data is indicated by the PayloadLength field in > > the last segment only. It's quite possible that my interpretation is incorrect, > > in which case the calculation in the RMPP code is off. > I agree the text might be missing an example or two for clarification. > Anyway, we probably can use the IB Analyzer as the ultimate > interpretation test. Note that there are IB implementations that uses > the first segment payload length as the source of packet length and > count on it to represent the correct DATA length. > > We can take your interpretation to discussion in the IBTA MGTWG for > further discussion. I think the spec wording is ambiguous and we should take it to the MgtWG. I believe your interpretation is the intent but could not find any specific language other than the valid bytes in terms of the last segment. The first segment length references transferred data which is the whole segment. I'll send something to MgtWG on this and copy openib-general. > Is the effort for fixing it big? It's a one line patch. I sent it previously. -- Hal From halr at voltaire.com Sun Aug 28 18:11:52 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Aug 2005 21:11:52 -0400 Subject: [openib-general] Re: RMPP Message Format Errors (Short Term Plan) In-Reply-To: <43107FE3.6020800@mellanox.co.il> References: <1125074400.4530.187.camel@hal.voltaire.com> <43107FE3.6020800@mellanox.co.il> Message-ID: <1125277267.4530.775.camel@hal.voltaire.com> Hi Eitan, On Sat, 2005-08-27 at 10:59, Eitan Zahavi wrote: > Thanks for taking care of the RMPP issues. > Once you think both sender and receiver side issues are resolved please > let us know so I can re-run the test with the IB Analyzer. I think they are now resolved with the one line patch (which should help the analyzer which I don't think impacts the end nodes). -- Hal From mst at mellanox.co.il Mon Aug 29 00:06:46 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 29 Aug 2005 10:06:46 +0300 Subject: [openib-general] [PATCH applied] sdp: endian-ness fix for src avail cancellation Message-ID: <20050829070646.GL22342@mellanox.co.il> The following was applied in rev 3218. Longer term we clearly want to kill all occurrences of sdp_msg_cpu_to_net_bsdh and deal with packets properly as they arrive - in network order. --- When SrcAvailCancel is sent, the sequence number must be stored in the connection context for matching with SendSm responses. This was currently taken from bsdh_hdr which has wrong (net) endian-ness. Fix by taking the number from the connection context. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_send.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/ulp/sdp/sdp_send.c +++ linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_send.c @@ -161,7 +161,7 @@ static int sdp_send_buff_post(struct sdp */ if (conn->flags & SDP_CONN_F_SRC_CANCEL_L) conn->src_cseq = ((buff->bsdh_hdr->mid == SDP_MID_SRC_CANCEL) ? - buff->bsdh_hdr->seq_num : conn->src_cseq); + conn->send_seq : conn->src_cseq); return 0; done: -- MST From mst at mellanox.co.il Mon Aug 29 00:23:16 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 29 Aug 2005 10:23:16 +0300 Subject: [openib-general] Re: [PATCH] RMPP: Fix length in first segment of multipacket sends In-Reply-To: <1125277188.4530.764.camel@hal.voltaire.com> References: <1125277188.4530.764.camel@hal.voltaire.com> Message-ID: <20050829072316.GN22342@mellanox.co.il> Quoting Hal Rosenstock : > Index: mad_rmpp.c > =================================================================== > --- mad_rmpp.c (revision 3197) > +++ mad_rmpp.c (working copy) > @@ -593,7 +593,8 @@ Hal, could you diff with -p in the future please? This makes the function name visible in the patch, making it possible to understand what is being changed without applying it. > rmpp_mad->rmpp_hdr.paylen_newwin = > cpu_to_be32(mad_send_wr->total_seg * > (sizeof(struct ib_rmpp_mad) - > - offsetof(struct ib_rmpp_mad, data))); > + offsetof(struct ib_rmpp_mad, data)) - > + mad_send_wr->pad); BTW, I just noticed that whitespace was (and remains) broken in these lines: indentation is done by spaces. Thanks, -- MST From danb at voltaire.com Mon Aug 29 01:51:11 2005 From: danb at voltaire.com (Dan Bar Dov) Date: Mon, 29 Aug 2005 11:51:11 +0300 Subject: [openib-general] RE: [iSER]How to send the login request PDU? Message-ID: As it is known to all, the iSCSI layer at the Initiator should send a Login Request PDU to the Target after allocating the connection resources. My question is How to send this PDU? Could the Send_Control Primitive Operation be used? But it is not in the iSER-assisted mode at present. Or else, the dapl API should used directly. Yes, you should use send_control to send the login PDU. With this ISER implementation you can do it before allocate connection resources. There is the similar problem at the Target when sending the Login Response PDU. Same answer. Hope my description is clear. Any suggestion is appreciated. Ian Jiang ianjiang91 at hotmail.com ---- Computer Architecture Laboratory Institute of Computing Technology Chinese Academy of Sciences Beijing,P.R.China Zip code: 100080 Tel: +86-10-62564394(office) ________________________________ 免费下载 MSN Explorer -------------- next part -------------- An HTML attachment was scrubbed... URL: From guyg at voltaire.com Mon Aug 29 05:09:29 2005 From: guyg at voltaire.com (Guy German) Date: Mon, 29 Aug 2005 15:09:29 +0300 Subject: [openib-general] RDMA connection and address translation API In-Reply-To: References: Message-ID: <1125317369.6584.52.camel@r2d2> Sean wrote: > >>It looks like this would work. If a client wanted to create multiple > >>connections to the same remote service (for example, to separate control and > >>data), then it seems more efficient to move the asynchronous at outside of the > >>connect call. > >>- Sean > > Thats a good point. What I had in mind was mainly simplicity for the > consumer - save him dealing with another upcall. > > Maybe caching in at module would make things better, but I agree > that for multiple connections to the same remote service, the > asynchronous at aproach, seems more appropriate. OTOH, After thinking about it some more, there might be problems in letting each and every consumer do his own caching. The at.c has a (non implemented yet) mechanism with invalidate for caching tables. Do we really want to let the consumer handle all the cases of routing tables changing on the fly etc. or centralize it in one place (i.e at.c) ? What do you think, Sean ? Guy From halr at voltaire.com Mon Aug 29 06:14:46 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2005 09:14:46 -0400 Subject: [openib-general] Payload Length in first RMPP sent segment Message-ID: <1125321286.4530.2976.camel@hal.voltaire.com> Hi, On the RMPP send side, while the Payload Length field in the last segment is clear that it indicates the number of valid bytes in Transferred Data, there seems to be some ambiguity in the optional Payload Length field in the first segment. I think it can work either way but I also think the intent was to reflect the valid bytes. Maybe it is this way to allow flexibility (choice in the implementation). What is the correct interpretation ? Should I enter a comment on this ? Thanks. -- Hal IBA 1.2 p.775 line 37 In the first packet of an RMPP transfer (RMPPFlags.First=1), PayloadLength may indicate the sum of the lengths, in bytes, of the TransferredData fields in all packets of the entire multipacket response; this is done by using a nonzero value for PayloadLength in the first packet. IBA 1.2 p. 776 line 8 In the last packet of an RMPP transfer (RMPPFlags.Last=1), PayloadLength indicates the number of valid bytes in the TransferredData field, allowing data transfers that are not an integral multiple of the length of the TransferredData field. A transfer terminates when either: (a) a packet containing RMPPFlags.Last=1 is received; or (b) a nonzero PayloadLength was given in the first packet of a transfer, and a packet is received containing sufficient TransferredData bytes to equal or exceed the PayloadLength originally provided. If case (b) occurs and RMPPFlags.Last is not 1 for that packet, the Receiver sends an ABORT packet with RMPPStatus of "Inconsistent Last and PayloadLength" and terminates the transfer. From jlentini at netapp.com Mon Aug 29 06:41:47 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 29 Aug 2005 09:41:47 -0400 (EDT) Subject: [openib-general] [PATCH][iWARP] IW CM Verbs In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3BAF@mail2.ammasso.com> References: <8E9D028761D8264D910612167E8457E8FA3BAF@mail2.ammasso.com> Message-ID: On Fri, 26 Aug 2005, Tom Tucker wrote: > Index: ib_verbs.h > =================================================================== > --- ib_verbs.h (revision 3120) > +++ ib_verbs.h (working copy) > @@ -804,6 +806,7 @@ > struct ib_gid_cache **gid_cache; > }; > > +struct iw_cm; > struct ib_device { > struct device *dma_device; > > @@ -820,6 +823,8 @@ > > u32 flags; > > + struct iw_cm *iwcm; > + > int (*query_device)(struct ib_device *device, > struct ib_device_attr *device_attr); > int (*query_port)(struct ib_device *device, Why does the ib_device need a cm structure for iWARP but not IB? If you used either Guy or Roland's generic RDMA connection API and did the iWARP implementation, would you still need to add the iw_cm structure? From halr at voltaire.com Mon Aug 29 06:58:41 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2005 09:58:41 -0400 Subject: [openib-general] Re: [PATCH] RMPP: Fix length in first segment of multipacket sends In-Reply-To: <20050829072316.GN22342@mellanox.co.il> References: <1125277188.4530.764.camel@hal.voltaire.com> <20050829072316.GN22342@mellanox.co.il> Message-ID: <1125323920.4530.3114.camel@hal.voltaire.com> Hi Michael, On Mon, 2005-08-29 at 03:23, Michael S. Tsirkin wrote: > Quoting Hal Rosenstock : > > Index: mad_rmpp.c > > =================================================================== > > --- mad_rmpp.c (revision 3197) > > +++ mad_rmpp.c (working copy) > > @@ -593,7 +593,8 @@ > > Hal, could you diff with -p in the future please? > This makes the function name visible in the patch, making it > possible to understand what is being changed without applying it. I'll try harder to remember to do this. > > rmpp_mad->rmpp_hdr.paylen_newwin = > > cpu_to_be32(mad_send_wr->total_seg * > > (sizeof(struct ib_rmpp_mad) - > > - offsetof(struct ib_rmpp_mad, data))); > > + offsetof(struct ib_rmpp_mad, data)) - > > + mad_send_wr->pad); > > BTW, I just noticed that whitespace was (and remains) broken in these lines: > indentation is done by spaces. The whitespace is preceeded by tabs and is to make the parameters line up. I thought that was allowable coding style. It has been used in many places in OpenIB code. -- Hal From jlentini at netapp.com Mon Aug 29 07:02:51 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 29 Aug 2005 10:02:51 -0400 (EDT) Subject: [openib-general] Re: [PATCH] uDAPL added ibv_query support In-Reply-To: References: Message-ID: On Thu, 25 Aug 2005, Arlin Davis wrote: arlin> James, arlin> arlin> Support for ibv_query_port, device, and gid. arlin> arlin> Thanks, arlin> arlin> -arlin Committed in revision 3227. From halr at voltaire.com Mon Aug 29 07:03:02 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2005 10:03:02 -0400 Subject: [openib-general] Re: Question on the best approach to debug aninfiniband connectio n problem In-Reply-To: <20050828063439.GT22342@mellanox.co.il> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175C27@taurus.voltaire.com> <20050828063439.GT22342@mellanox.co.il> Message-ID: <1125324182.4530.3141.camel@hal.voltaire.com> Hi Michael, On Sun, 2005-08-28 at 02:34, Michael S. Tsirkin wrote: > Quoting r. Hal Rosenstock : > > I think you are referring to a Set of PortInfo which causes an event to > > IPoIB. > > Yes. > > > > There seems to be some bug related to local MAD handling: Can you elaborate more on this ? -- Hal From halr at voltaire.com Mon Aug 29 07:10:38 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2005 10:10:38 -0400 Subject: [openib-general] [PATCH] RMPP: Fix payload length of middle RMPP sent segments Message-ID: <1125324637.4530.3176.camel@hal.voltaire.com> RMPP: Fix payload length of middle RMPP sent segments. Middle payload lengths should be 0 on the send side. (This is a compliance and should not be an interop issue as middle payload lengths are supposed to be ignored on receive). Signed-off-by: Hal Rosenstock Note also that diff -p did not show the routine name for this 1 line change. Index: mad_rmpp.c =================================================================== --- mad_rmpp.c (revision 3197) +++ mad_rmpp.c (working copy) @@ -602,6 +603,7 @@ mad_send_wr->sg_list[1].length = sizeof(struct ib_rmpp_mad) - mad_send_wr->data_offset; mad_send_wr->sg_list[1].lkey = mad_send_wr->sg_list[0].lkey; + rmpp_mad->rmpp_hdr.paylen_newwin = 0; } if (mad_send_wr->seg_num == mad_send_wr->total_seg) { From guyg at voltaire.com Mon Aug 29 06:59:07 2005 From: guyg at voltaire.com (Guy German) Date: Mon, 29 Aug 2005 16:59:07 +0300 Subject: [openib-general] RDMA Generic Connection Management Message-ID: <1125323947.6584.106.camel@r2d2> Hi, After receiving feedbacks from people here - I want to see if we can agree on a generic CM API, so we can start implementing it. I will try and summarize the 2 options, the way I understand it. If I am missing something or misrepresenting - please don't hesitate to correct me. both suggestion include the following verbs (or semantically equivalent): ib_cma_get_device, ib_cma_create_qp, ib_cma_connect, ib_cma_disconnect, ib_cma_listen, ib_cma_destroy, ib_cma_accept, ib_cma_reject, ib_cma_get_src_ip. a connect flow will be something like: - ib_cma_get_device (...) /* get device(1) or device+path(2) */ - pd = ib_alloc_pd(...) /* pd allocated in the given device */ - qp = ib_cma_create_qp(...) /* qp returned in init state */ - ib_post_recv(qp, ...); - ib_cma_connect (qp, dst_addr(1)/path(2), ...); Now, there are 2 suggestions for the device discovery: 1. get_device returns device and port, according the local routing tables, synchronously. ib_cma_connect calls the at module for address resolving (cache handled) before calling the cm_connect. 2. get_device registers an upcall and return in the upcall the data path to the consumer. In this case caching is done by the consumer. I prefer option 1, because it makes the consumer code simpler, without having to handle upcalls for address translations (which are not asynchronous in iWARP) or hold the transport's data information. Also it saves the consumer the trouble of caching routes to destinations. I would like to hear what other people in the list think of it ... Thanks, Guy From mst at mellanox.co.il Mon Aug 29 07:47:05 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 29 Aug 2005 17:47:05 +0300 Subject: [openib-general] Re: [PATCH] RMPP: Fix length in first segment of multipacket sends In-Reply-To: <1125323920.4530.3114.camel@hal.voltaire.com> References: <1125277188.4530.764.camel@hal.voltaire.com> <20050829072316.GN22342@mellanox.co.il> <1125323920.4530.3114.camel@hal.voltaire.com> Message-ID: <20050829144705.GR22342@mellanox.co.il> Hal, first, sorry about nitpicking. Quoting Hal Rosenstock : > > > rmpp_mad->rmpp_hdr.paylen_newwin = > > > cpu_to_be32(mad_send_wr->total_seg * > > > (sizeof(struct ib_rmpp_mad) - > > > - offsetof(struct ib_rmpp_mad, data))); > > > + offsetof(struct ib_rmpp_mad, data)) - > > > + mad_send_wr->pad); > > > > BTW, I just noticed that whitespace was (and remains) broken in these lines: > > indentation is done by spaces. > > The whitespace is preceeded by tabs and is to make the parameters line > up. I thought that was allowable coding style. It has been used in many > places in OpenIB code. Sure, whitespace preceeded by tabs is OK. But, pls take a look at the original file, or the patch, I think you'll see that its not preceeded by tabs in this instance. Some more nitpicks: In this specific instance, its probably best to just add a temp variable and write w = mad_send_wr->total_seg * (sizeof(struct ib_rmpp_mad) - offsetof(struct ib_rmpp_mad, data)) - mad_send_wr->pad; rmpp_mad->rmpp_hdr.paylen_newwin = cpu_to_be32(w) to avoid placing the descendant cpu_to_be32 left of the "=" sign. Documentation/CodingStyle says about this: > Descendants are always substantially shorter than the parent and are > placed substantially to the right. And by the way, wouldnt it be a good idea to replace hard-coded 220 and such in ib_mad.h by a symbolic constant, and then we'd just have w = mad_send_wr->total_seg * IB_RMMP_DATA_SIZE - mad_send_wr->pad; rmpp_mad->rmpp_hdr.paylen_newwin = cpu_to_be32(w) What do you think? Thanks, MST -- MST From mst at mellanox.co.il Mon Aug 29 07:51:15 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 29 Aug 2005 17:51:15 +0300 Subject: [openib-general] Re: Question on the best approach to debug aninfiniband connectio n problem In-Reply-To: <1125324182.4530.3141.camel@hal.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175C27@taurus.voltaire.com> <20050828063439.GT22342@mellanox.co.il> <1125324182.4530.3141.camel@hal.voltaire.com> Message-ID: <20050829145115.GS22342@mellanox.co.il> Quoting r. Hal Rosenstock : > > > > There seems to be some bug related to local MAD handling: > > Can you elaborate more on this ? I have a back to back setup. I sometimes start with all modules unloaded, load them back and bring up ipoib with a fixed ip address. I then start opensm on one node, and try to ping one from another. This does not work until I down and up ib0 on the node with opensm running. down and up on the other node does not help, which made me think local mad handling is the culprit. -- MST From halr at voltaire.com Mon Aug 29 07:48:29 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2005 10:48:29 -0400 Subject: [openib-general] [PATCH] OSM vendor layer: In umad_receiver, handle allocating RMPP large MADs from OSM MAD pool In-Reply-To: <20050826172302.GA14157@esmail.cup.hp.com> References: <1125074098.4530.179.camel@hal.voltaire.com> <20050826172302.GA14157@esmail.cup.hp.com> Message-ID: <1125326908.4530.3305.camel@hal.voltaire.com> Hi Grant, On Fri, 2005-08-26 at 13:23, Grant Grundler wrote: > On Fri, Aug 26, 2005 at 12:34:58PM -0400, Hal Rosenstock wrote: > > - MAD_BLOCK_SIZE, &osm_addr))) { > > + length > MAD_BLOCK_SIZE ? > > + length : MAD_BLOCK_SIZE, > > + &osm_addr))) { > > Can "max(length, MAD_BLOCK_SIZE)" be used instead? Yes; I made that change. There is a MAX macro in complib/cl_math.h. Thanks. -- Hal From jlentini at netapp.com Mon Aug 29 07:51:54 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 29 Aug 2005 10:51:54 -0400 (EDT) Subject: [openib-general] Re: [PATCH] uDAPL common code fix for default attribute settings In-Reply-To: References: Message-ID: On Thu, 25 Aug 2005, Arlin Davis wrote: arlin> James, arlin> arlin> Please review this common code patch that fixes default arlin> settings so they don't exceed device maximums. arlin> arlin> Thanks, arlin> arlin> -arlin I've moved the check into dapli_ep_default_attrs() so all future callers will also benefit from this. Index: dapl/common/dapl_ep_util.c =================================================================== --- dapl/common/dapl_ep_util.c (revision 3231) +++ dapl/common/dapl_ep_util.c (working copy) @@ -260,7 +260,9 @@ void dapli_ep_default_attrs ( IN DAPL_EP *ep_ptr ) { + DAT_EP_ATTR ep_attr_limit; DAT_EP_ATTR *ep_attr; + DAT_RETURN dat_status; ep_attr = &ep_ptr->param.ep_attr; /* Set up defaults */ @@ -295,7 +297,36 @@ dapli_ep_default_attrs ( * - provider_specific_params: 0 */ - return; + dat_status = dapls_ib_query_hca (ep_ptr->header.owner_ia->hca_ptr, + NULL, &ep_attr_limit, NULL); + /* check against HCA maximums */ + if (dat_status == DAT_SUCCESS) + { + ep_ptr->param.ep_attr.max_mtu_size = + DAPL_MIN(ep_ptr->param.ep_attr.max_mtu_size, + ep_attr_limit.max_mtu_size); + ep_ptr->param.ep_attr.max_rdma_size = + DAPL_MIN(ep_ptr->param.ep_attr.max_rdma_size, + ep_attr_limit.max_rdma_size); + ep_ptr->param.ep_attr.max_recv_dtos = + DAPL_MIN(ep_ptr->param.ep_attr.max_recv_dtos, + ep_attr_limit.max_recv_dtos); + ep_ptr->param.ep_attr.max_request_dtos = + DAPL_MIN(ep_ptr->param.ep_attr.max_request_dtos, + ep_attr_limit.max_request_dtos); + ep_ptr->param.ep_attr.max_recv_iov = + DAPL_MIN(ep_ptr->param.ep_attr.max_recv_iov, + ep_attr_limit.max_recv_iov); + ep_ptr->param.ep_attr.max_request_iov = + DAPL_MIN(ep_ptr->param.ep_attr.max_request_iov, + ep_attr_limit.max_request_iov); + ep_ptr->param.ep_attr.max_rdma_read_in = + DAPL_MIN(ep_ptr->param.ep_attr.max_rdma_read_in, + ep_attr_limit.max_rdma_read_in); + ep_ptr->param.ep_attr.max_rdma_read_out = + DAPL_MIN(ep_ptr->param.ep_attr.max_rdma_read_out, + ep_attr_limit.max_rdma_read_out); + } } From mst at mellanox.co.il Mon Aug 29 08:00:42 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 29 Aug 2005 18:00:42 +0300 Subject: [openib-general] [PATCH] sdp: use linux/list.h in sdp_link.c Message-ID: <20050829150042.GT22342@mellanox.co.il> The following kills sdp_link.h and converts sdp_link.c to use linux/list.h Locking is still missing here. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_link.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/ulp/sdp/sdp_link.c +++ linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_link.c @@ -35,17 +35,105 @@ #include "ipoib.h" #include "sdp_main.h" -#include "sdp_link.h" -static kmem_cache_t *wait_cache = NULL; -static kmem_cache_t *info_cache = NULL; +#define SDP_LINK_F_VALID 0x01 /* valid path info record. */ +#define SDP_LINK_F_ARP 0x02 /* arp request in progress. */ +#define SDP_LINK_F_PATH 0x04 /* arp request in progress. */ +/* + * wait for an ARP event to complete. + */ +struct sdp_path_info { + u32 src; /* source IP address. */ + u32 dst; /* destination IP address */ + int dif; /* bound device interface option */ + u32 gw; /* gateway IP address */ + int qid; /* path record query ID */ + u8 port; /* HCA port */ + u32 flags; /* record flags */ + int sa_time; /* path_rec request timeout */ + unsigned long arp_time; /* ARP request timeout */ + unsigned long use; /* last time accessed. */ + struct ib_device *ca; /* HCA device. */ + struct ib_sa_path_rec path; /* path record info */ + struct ib_sa_query *query; -static struct sdp_path_info *info_list = NULL; + struct work_struct timer; /* arp request timers. */ + + struct list_head info_list; + + struct list_head wait_list; +}; + +struct sdp_path_wait { + u64 id; /* request identifier */ + void (*completion)(u64 id, + int status, + u32 dst_addr, + u32 src_addr, + u8 hw_port, + struct ib_device *ca, + struct ib_sa_path_rec *path, + void *arg); + void *arg; + int retry; + struct list_head list; +}; + +struct sdp_work { + struct work_struct work; + void *arg; +}; + +struct sdp_link_arp { + /* + * generic arp header + */ + u16 addr_type; /* format of hardware address */ + u16 proto_type; /* format of protocol address */ + u8 addr_len; /* length of hardware address */ + u8 proto_len; /* length of protocol address */ + u16 op; /* ARP opcode (command) */ + /* + * begin IB specific section + */ + u32 src_qpn; /* MSB = reserved, low 3 bytes=QPN */ + union ib_gid src_gid; + u32 src_ip; + + u32 dst_qpn; /* MSB = reserved, low 3 bytes=QPN */ + union ib_gid dst_gid; + u32 dst_ip; + +} __attribute__ ((packed)); /* sdp_link_arp */ + +#define SDP_LINK_SWEEP_INTERVAL (10 * (HZ)) /* frequency of sweep function */ +#define SDP_LINK_INFO_TIMEOUT (300UL * (HZ)) /* unused time */ +#define SDP_LINK_SA_RETRY (3) /* number of SA retry requests */ +#define SDP_LINK_ARP_RETRY (3) /* number of ARP retry requests */ + +#define SDP_LINK_SA_TIME_MIN (500) /* milliseconds. */ +#define SDP_LINK_SA_TIME_MAX (10000) /* milliseconds. */ +#define SDP_LINK_ARP_TIME_MIN (HZ) +#define SDP_LINK_ARP_TIME_MAX (32UL * (HZ)) + +#if 0 +#define SDP_IPOIB_RETRY_VALUE 3 /* number of retries. */ +#define SDP_IPOIB_RETRY_INTERVAL (HZ * 1) /* retry frequency */ + +#define SDP_DEV_PATH_WAIT (5 * (HZ)) +#define SDP_PATH_TIMER_INTERVAL (15 * (HZ)) /* cache sweep frequency */ +#define SDP_PATH_REAPING_AGE (300 * (HZ)) /* idle time before reaping */ +#endif + +static kmem_cache_t *wait_cache; +static kmem_cache_t *info_cache; + +static LIST_HEAD(info_list); static struct workqueue_struct *link_wq; static struct work_struct link_timer; -static u64 path_lookup_id = 0; +static u64 path_lookup_id; #define _SDP_PATH_LOOKUP_ID() \ ((++path_lookup_id) ? path_lookup_id : ++path_lookup_id) @@ -95,42 +183,6 @@ static void sdp_link_path_complete(u64 i } /* - * sdp_path_wait_add - add a wait entry into the wait list for a path - */ -static void sdp_path_wait_add(struct sdp_path_info *info, - struct sdp_path_wait *wait) -{ - - wait->next = info->wait_list; - info->wait_list = wait; - wait->pext = &info->wait_list; - - if (wait->next) - wait->next->pext = &wait->next; -} - -/* - * sdp_path_wait_destroy - destroy an entry for a wait element - */ -static void sdp_path_wait_destroy(struct sdp_path_wait *wait) -{ - /* - * if it's in the list, pext will not be null - */ - if (wait->pext) { - if (wait->next) - wait->next->pext = wait->pext; - - *(wait->pext) = wait->next; - - wait->pext = NULL; - wait->next = NULL; - } - - kmem_cache_free(wait_cache, wait); -} - -/* * sdp_path_wait_complete - complete an entry for a wait element */ static void sdp_path_wait_complete(struct sdp_path_wait *wait, @@ -142,21 +194,8 @@ static void sdp_path_wait_complete(struc wait->completion, wait->arg); - sdp_path_wait_destroy(wait); -} - -/* - * sdp_path_info_lookup - lookup a path record entry - */ -static struct sdp_path_info *sdp_path_info_lookup(u32 dst_ip, int dev_if) -{ - struct sdp_path_info *info; - - for (info = info_list; info; info = info->next) - if (dst_ip == info->dst && dev_if == info->dif) - break; - - return info; + list_del(&wait->list); + kmem_cache_free(wait_cache, wait); } /* @@ -172,19 +211,15 @@ static struct sdp_path_info *sdp_path_in memset(info, 0, sizeof(struct sdp_path_info)); - info->next = info_list; - info_list = info; - info->pext = &info_list; - - if (info->next) - info->next->pext = &info->next; - info->dst = dst_ip; info->dif = dev_if; info->use = jiffies; info->sa_time = SDP_LINK_SA_TIME_MIN; info->arp_time = SDP_LINK_ARP_TIME_MIN; + + INIT_LIST_HEAD(&info->wait_list); + list_add(&info->info_list, &info_list); INIT_WORK(&info->timer, do_link_path_lookup, info); return info; @@ -195,21 +230,11 @@ static struct sdp_path_info *sdp_path_in */ static void sdp_path_info_destroy(struct sdp_path_info *info, int status) { - struct sdp_path_wait *wait; - /* - * if it's in the list, pext will not be null - */ - if (info->pext) { - if (info->next) - info->next->pext = info->pext; - - *(info->pext) = info->next; + struct sdp_path_wait *wait, *tmp; + /* TODO: replace by list_del once we have proper locking */ + list_del_init(&info->info_list); - info->pext = NULL; - info->next = NULL; - } - - while ((wait = info->wait_list)) + list_for_each_entry_safe(wait, tmp, &info->wait_list, list) sdp_path_wait_complete(wait, info, status); cancel_delayed_work(&info->timer); @@ -222,7 +247,7 @@ static void sdp_path_info_destroy(struct static void sdp_link_path_rec_done(int status, struct ib_sa_path_rec *resp, void *context) { - struct sdp_path_info *info = (struct sdp_path_info *)context; + struct sdp_path_info *info = context; struct sdp_path_wait *wait; struct sdp_path_wait *sweep; int result; @@ -241,23 +266,20 @@ static void sdp_link_path_rec_done(int s info->path = *resp; } - sweep = info->wait_list; - while (sweep) { - wait = sweep; - sweep = sweep->next; + list_for_each_entry_safe(wait, sweep, &info->wait_list, list) { /* * on timeout increment retries. */ if (status == -ETIMEDOUT) wait->retry++; - if (!status || SDP_LINK_SA_RETRY < wait->retry) + if (!status || wait->retry > SDP_LINK_SA_RETRY) sdp_path_wait_complete(wait, info, status); } /* * retry if anyone is waiting. */ - if (info->wait_list) { + if (!list_empty(&info->wait_list)) { info->sa_time = min(info->sa_time * 2, SDP_LINK_SA_TIME_MAX); result = ib_sa_path_rec_get(info->ca, @@ -348,10 +370,11 @@ static void do_link_path_lookup(void *da } } }; + /* * path request in progress? */ - if (info->query) + if (info->flags & SDP_LINK_F_PATH) goto done; /* * route information present, but no path query. @@ -483,16 +506,11 @@ static void do_link_path_lookup(void *da struct sdp_path_wait *sweep; struct sdp_path_wait *wait; - sweep = info->wait_list; - while (sweep) { - wait = sweep; - sweep = sweep->next; - - if (SDP_LINK_SA_RETRY < wait->retry++) + list_for_each_entry_safe(wait, sweep, &info->wait_list, list) + if (wait->retry++ > SDP_LINK_ARP_RETRY) sdp_path_wait_complete(wait, info, -ETIMEDOUT); - } - if (!info->wait_list) { + if (list_empty(&info->wait_list)) { result = -ETIMEDOUT; goto error; } @@ -551,11 +569,12 @@ int sdp_link_path_lookup(u32 dst_addr, int result; *id = _SDP_PATH_LOOKUP_ID(); - /* - * lookup entry, create if not found and add to wait list. - */ - info = sdp_path_info_lookup(dst_addr, bound_dev_if); - if (!info) { + + list_for_each_entry(info, &info_list, info_list) + if (info->dst == dst_addr && info->dif == bound_dev_if) + break; + + if (&info->info_list == &info_list) { info = sdp_path_info_create(dst_addr, bound_dev_if); if (!info) { sdp_dbg_warn(NULL, "Failed to create path object"); @@ -586,7 +605,8 @@ int sdp_link_path_lookup(u32 dst_addr, wait->completion = completion; wait->arg = arg; - sdp_path_wait_add(info, wait); + list_add(&wait->list, &info->wait_list); + /* * initiate address lookup, if not in progress. */ @@ -610,11 +630,7 @@ static void sdp_link_sweep(void *data) struct sdp_path_info *info; struct sdp_path_info *sweep; - sweep = info_list; - while (sweep) { - info = sweep; - sweep = sweep->next; - + list_for_each_entry_safe(info, sweep, &info_list, info_list) { if (jiffies > (info->use + SDP_LINK_INFO_TIMEOUT)) { sdp_dbg_ctrl(NULL, "info delete <%d.%d.%d.%d> <%lu:%lu>", @@ -657,11 +673,11 @@ static void sdp_link_arp_work(void *data /* * find a path info structure for the source IP address. */ - for (info = info_list; info; info = info->next) + list_for_each_entry(info, &info_list, info_list) if (info->dst == arp->src_ip) break; - if (!info) + if (&info->info_list == &info_list) goto done; /* * update record info, and request new path record data. @@ -788,14 +804,10 @@ error_path: void sdp_link_addr_cleanup(void) { struct sdp_path_info *info; + struct sdp_path_info *sweep; sdp_dbg_init("Link level services cleanup."); /* - * clear objects - */ - while ((info = info_list)) - sdp_path_info_destroy(info, -EINTR); - /* * remove ARP packet processing. */ dev_remove_pack(&sdp_arp_type); @@ -806,6 +818,11 @@ void sdp_link_addr_cleanup(void) flush_workqueue(link_wq); destroy_workqueue(link_wq); /* + * clear objects + */ + list_for_each_entry_safe(info, sweep, &info_list, info_list) + sdp_path_info_destroy(info, -EINTR); + /* * destroy caches */ kmem_cache_destroy(info_cache); -- MST From halr at voltaire.com Mon Aug 29 08:08:02 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2005 11:08:02 -0400 Subject: [openib-general] [PATCH} OpenSM vendor layer: Use MAX macro in umad_receiver Message-ID: <1125327974.4530.3356.camel@hal.voltaire.com> OpenSM vendor layer: Use MAX macro in umad_receiver rather than open coding it. Pointed out by Grant Grundler Signed-off-by: Hal Rosenstock Index: osm_vendor_ibumad.c =================================================================== --- osm_vendor_ibumad.c (revision 3200) +++ osm_vendor_ibumad.c (working copy) @@ -271,8 +271,7 @@ umad_receiver(void *p_ptr) if (!(madw_p = osm_mad_pool_get(p_bind->p_mad_pool, (osm_bind_handle_t)p_bind, - length > MAD_BLOCK_SIZE ? - length : MAD_BLOCK_SIZE, + MAX(length, MAD_BLOCK_SIZE), &osm_addr))) { osm_log( p_vend->p_log, OSM_LOG_ERROR, "umad_receiver: " "request for a new madw failed -- dropping packet\n" ); From halr at voltaire.com Mon Aug 29 08:15:24 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2005 11:15:24 -0400 Subject: [openib-general] OpenIB user space includes location Message-ID: <1125328523.4530.3382.camel@hal.voltaire.com> Hi, Should OpenIB user space library includes be under /usr/include/infiniband or /usr/include/rdma now (similar to the kernel move of includes) ? The two that make me think this are user verbs and perhaps CM. IB management (OpenSM and diags) as well as AT (address translation) are IB specific. -- Hal From rolandd at cisco.com Mon Aug 29 08:34:54 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 29 Aug 2005 08:34:54 -0700 Subject: [openib-general] Re: RDMA connection and address translation API References: <35EA21F54A45CB47B879F21A91F4862F714202@taurus.voltaire.com> <52ll2qvrr1.fsf@cisco.com> <20050825084809.GD22342@mellanox.co.il> <52y86qt2pm.fsf@cisco.com> <20050828075851.GV22342@mellanox.co.il> Message-ID: <528xykoixt.fsf@cisco.com> Michael> What about using an Externally Administrated Service ID? Michael> Openib gets Service ID = 0x1H00 1405 XXXX XXXX where H is Michael> any digit. That would work. I think we've already converged on picking a service ID range for our "iWARP emulation" spec. The only question is whether it should be in the IBTA or IETF service ID range, and I don't think that really matters much. - R. From halr at voltaire.com Mon Aug 29 08:56:38 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2005 11:56:38 -0400 Subject: [openib-general] Re: [PATCH] RMPP: Fix length in first segment of multipacket sends In-Reply-To: <20050829144705.GR22342@mellanox.co.il> References: <1125277188.4530.764.camel@hal.voltaire.com> <20050829072316.GN22342@mellanox.co.il> <1125323920.4530.3114.camel@hal.voltaire.com> <20050829144705.GR22342@mellanox.co.il> Message-ID: <1125330997.4530.3499.camel@hal.voltaire.com> Hi Michael, On Mon, 2005-08-29 at 10:47, Michael S. Tsirkin wrote: > Hal, first, sorry about nitpicking. > > Quoting Hal Rosenstock : > > > > rmpp_mad->rmpp_hdr.paylen_newwin = > > > > cpu_to_be32(mad_send_wr->total_seg * > > > > (sizeof(struct ib_rmpp_mad) - > > > > - offsetof(struct ib_rmpp_mad, data))); > > > > + offsetof(struct ib_rmpp_mad, data)) - > > > > + mad_send_wr->pad); > > > > > > BTW, I just noticed that whitespace was (and remains) broken in these lines: > > > indentation is done by spaces. > > > > The whitespace is preceeded by tabs and is to make the parameters line > > up. I thought that was allowable coding style. It has been used in many > > places in OpenIB code. > > Sure, whitespace preceeded by tabs is OK. > But, pls take a look at the original file, or the patch, I think > you'll see that its not preceeded by tabs in this instance. I think this is in the original file. > Some more nitpicks: > > In this specific instance, its probably best to just add a temp variable and > write > > w = mad_send_wr->total_seg * > (sizeof(struct ib_rmpp_mad) - > offsetof(struct ib_rmpp_mad, data)) - > mad_send_wr->pad; > > rmpp_mad->rmpp_hdr.paylen_newwin = cpu_to_be32(w) > > to avoid placing the descendant cpu_to_be32 left of the "=" sign. > > Documentation/CodingStyle says about this: > > Descendants are always substantially shorter than the parent and are > > placed substantially to the right. > > > And by the way, wouldnt it be a good idea to replace hard-coded > 220 and such in ib_mad.h by a symbolic constant, and then we'd just > have > > w = mad_send_wr->total_seg * IB_RMMP_DATA_SIZE - > mad_send_wr->pad; > > rmpp_mad->rmpp_hdr.paylen_newwin = cpu_to_be32(w) > > What do you think? All seem reasonable to me. Sean should comment and has the final say on this. -- Hal From swise at ammasso.com Mon Aug 29 09:00:09 2005 From: swise at ammasso.com (Steve Wise) Date: Mon, 29 Aug 2005 11:00:09 -0500 Subject: [openib-general] [PATCH][iWARP] IW CM Verbs In-Reply-To: Message-ID: > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of James Lentini > Sent: Monday, August 29, 2005 8:42 AM > To: Tom Tucker > Cc: openib-general at openib.org > Subject: Re: [openib-general] [PATCH][iWARP] IW CM Verbs > > > > On Fri, 26 Aug 2005, Tom Tucker wrote: > > > Index: ib_verbs.h > > =================================================================== > > --- ib_verbs.h (revision 3120) > > +++ ib_verbs.h (working copy) > > @@ -804,6 +806,7 @@ > > struct ib_gid_cache **gid_cache; > > }; > > > > +struct iw_cm; > > struct ib_device { > > struct device *dma_device; > > > > @@ -820,6 +823,8 @@ > > > > u32 flags; > > > > + struct iw_cm *iwcm; > > + > > int (*query_device)(struct > ib_device *device, > > struct > ib_device_attr *device_attr); > > int (*query_port)(struct > ib_device *device, > > Why does the ib_device need a cm structure for iWARP but not IB? Some RNIC devices fully establish the connection in hardware. However, _all_ openib IB devices export the MAD interface, which is used to send low level connection setup primitives, thus allowing connection setup in the host. > If > you used either Guy or Roland's generic RDMA connection API and did > the iWARP implementation, would you still need to add the iw_cm > structure? Yes. The struct iw_cm allows the RNIC device to export a set of high level connection verbs, if that device supports it. Hope this helps... Steve. From asgeir at chelsio.com Mon Aug 29 09:09:04 2005 From: asgeir at chelsio.com (Asgeir Eiriksson) Date: Mon, 29 Aug 2005 09:09:04 -0700 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs andquery provider methods Message-ID: <67D69596DDF0C2448DB0F0547D0F947E01A8BB8F@yogi.asicdesigners.com> Caitlin For the openib folk: keep in mind in the following that iWARP runs on top of TCP, and the TCP is offloaded so TOE enters the iWARP picture. I second the requirement that an iWARP RNIC needs to integrate with host configuration, reporting, and security mechanisms, and this is the approach taken in the Chelsio TOE patches that we have submitted. Standard tools such as netstat, ifconfig, work with the Chelsio TOE today, and there's nothing to prevent netfilter to work, etc. For those that are interested there is more information available at https://service.chelsio.com/open_toe and in particular in Chelsio_toe_arch.pdf I have to disagree that this means that the connection is necessarily setup by the host stack. The Chelsio 10GE NIC/TOE/iSCSI/iWARP products setup the connection on the card, but the setup includes an "ASK host" phase initiated by the card where the host can filter the connect request, and modify any of the initial TCP values chosen by the card, etc., and then respond with accept or reject. In general the iWARP connection manager needs to accommodate three possible TCP connection setup models in use today in iWARP devices: a) what you seem to be advocating, i.e. TCP connection setup on the host stack, b) what I brought up, i.e. offloaded TCP connection setup with an ASK phase, and c) what was brought up previously and that's full TCP connection setup offload. 'Asgeir >From: "Caitlin Bestler" >To: "Roland Dreier" ,"Tom Tucker" >CC: openib-general at openib.org >Subject: RE: [openib-general] [PATCH][iWARP] Added provider CM verbs andquery provider methods >Date: Thu, 25 Aug 2005 16:46:38 -0700 > >Device vendors would jump at the opportunity to have a stable >interface with the host stack. Things like routing, protection >from denial of service attacks, rules for logging and filtering >connection requests and more all *should* be handled by the host >stack. > >That's where the end user wants to control them, it's where >the security code can be kept most current and most robust. >It is also largely on packets that do not require offload >optimization. > >But we also need time to ensure that the community understands >this as giving the host stack control of an offload connection >during connection establishment -- rather than as the offload >device "stealing" the connection from the host stack. > >Moving the entire TCP connection logic to the offload device >not only increases the work that the offload device must do, >it reduces the auditability of the system and the user's control >over their network activity. > >So the intent is not to evade the stack, rather it is to allow >time for proper integration with host stack control. The tradeoffs >are complex, and neither side fully understands the other's >issues yet. We need to work together to determine how to provide >the acceleration that our users want without sacrificing the OS >provided security that they assume will not be sacrificed. > > > > -----Original Message----- > > From: openib-general-bounces at openib.org > > [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier > > Sent: Thursday, August 25, 2005 4:21 PM > > To: Tom Tucker > > Cc: openib-general at openib.org > > Subject: Re: [openib-general] [PATCH][iWARP] Added provider > > CM verbs and query provider methods > > > > Tom> RNIC Verbs imply that the modify qp verb takes a handle to a > > Tom> connection -- presumably a socket. This CAN'T be done on > > Tom> Linux in any fashion that is acceptable to the netdev > > Tom> crowd. SOOO we modeled this after DAPL. Trust me, I would > > Tom> LOVE to be able to establish the connection using bind, > > Tom> listen, etc..., query the Linux connection state and then > > Tom> pass this down to the qp modify verb...but I can't. > > > > Let's not be too quick to say that this is impossible. I > > think we should work with the Linux networking community and > > come up with the right answer, and not accept a bad solution > > just because it lets us go around the networking people. > > > > Has there been any real discussion of this on netdev? > > > > - R. From mst at mellanox.co.il Mon Aug 29 09:12:42 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 29 Aug 2005 19:12:42 +0300 Subject: [openib-general] Re: RDMA connection and address translation API In-Reply-To: <528xykoixt.fsf@cisco.com> References: <35EA21F54A45CB47B879F21A91F4862F714202@taurus.voltaire.com> <52ll2qvrr1.fsf@cisco.com> <20050825084809.GD22342@mellanox.co.il> <52y86qt2pm.fsf@cisco.com> <20050828075851.GV22342@mellanox.co.il> <528xykoixt.fsf@cisco.com> Message-ID: <20050829161242.GA4081@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: RDMA connection and address translation API > > Michael> What about using an Externally Administrated Service ID? > Michael> Openib gets Service ID = 0x1H00 1405 XXXX XXXX where H is > Michael> any digit. > > That would work. I think we've already converged on picking a service > ID range for our "iWARP emulation" spec. The only question is whether > it should be in the IBTA or IETF service ID range, and I don't think > that really matters much. Or neither :) Are there disadvantages to Externally Administrated Service ID? This avoids any need for approvals from either IETF or IBTA. -- MST From rolandd at cisco.com Mon Aug 29 09:24:00 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 29 Aug 2005 09:24:00 -0700 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs andquery provider methods In-Reply-To: <67D69596DDF0C2448DB0F0547D0F947E01A8BB8F@yogi.asicdesigners.com> (Asgeir Eiriksson's message of "Mon, 29 Aug 2005 09:09:04 -0700") References: <67D69596DDF0C2448DB0F0547D0F947E01A8BB8F@yogi.asicdesigners.com> Message-ID: <524q98ognz.fsf@cisco.com> Asgeir> ...this is the approach taken in the Chelsio TOE patches Asgeir> that we have submitted. What are your plans for these patches? I am not subscribed to netdev, but from reading the archives, it seems that your most recent submission was rejected quite strongly. - R. From mst at mellanox.co.il Mon Aug 29 09:35:53 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 29 Aug 2005 19:35:53 +0300 Subject: [openib-general] [PATCH] ipoib: device removal races In-Reply-To: <52y86rui38.fsf@cisco.com> References: <20050808151141.GJ15300@mellanox.co.il> <52y86rui38.fsf@cisco.com> Message-ID: <20050829163553.GB4081@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] [PATCH] ipoib: device removal races > > Thanks, I finally applied this and put it in my git queue for 2.6.14. > > I'm still thinking about the bigger patch that adds a second work > queue. Having one extra work queue because of the rtnl_lock issues is > ugly enough, and I'd really like to find a way to avoid two queues. I thought about this some more. I think I see even more problems, like potential deadlocks were ipoib needs to flush the link wq, link wq waits for sa query to time out, and that needs the default wq to run. It seems that most of the complexity comes from the way core uses work queues to pass events to upper layers. And I wander: couldnt core be simplified by passing up events directly in the interrupt context? I think IPoIB could then use plain spinlocks for most synchronisations. -- MST From mst at mellanox.co.il Mon Aug 29 09:36:50 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 29 Aug 2005 19:36:50 +0300 Subject: [openib-general] Re: RDMA connection and address translation API In-Reply-To: <528xykoixt.fsf@cisco.com> References: <35EA21F54A45CB47B879F21A91F4862F714202@taurus.voltaire.com> <52ll2qvrr1.fsf@cisco.com> <20050825084809.GD22342@mellanox.co.il> <52y86qt2pm.fsf@cisco.com> <20050828075851.GV22342@mellanox.co.il> <528xykoixt.fsf@cisco.com> Message-ID: <20050829163650.GC4081@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: RDMA connection and address translation API > > Michael> What about using an Externally Administrated Service ID? > Michael> Openib gets Service ID = 0x1H00 1405 XXXX XXXX where H is > Michael> any digit. > > That would work. So I'm saying, we dont need reserved bits in cm req then, do we? -- MST From rolandd at cisco.com Mon Aug 29 09:34:54 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 29 Aug 2005 09:34:54 -0700 Subject: [openib-general] [git pull] InfiniBand updates for 2.6.14 Message-ID: <52y86kn1ld.fsf@cisco.com> Linus, please pull from rsync://rsync.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git The only thing slightly noteworthy in this tree is the commit that moves the InfiniBand midlayer includes from drivers/infiniband/include to include/rdma. However, this was discussed in the thread starting with and no objections were raised. Pulling will update the following files: drivers/infiniband/core/Makefile | 2 drivers/infiniband/core/agent.c | 13 drivers/infiniband/core/agent_priv.h | 10 drivers/infiniband/core/cache.c | 6 drivers/infiniband/core/cm.c | 125 ++--- drivers/infiniband/core/cm_msgs.h | 194 ++++---- drivers/infiniband/core/core_priv.h | 2 drivers/infiniband/core/device.c | 1 drivers/infiniband/core/fmr_pool.c | 8 drivers/infiniband/core/mad.c | 15 drivers/infiniband/core/mad_priv.h | 10 drivers/infiniband/core/mad_rmpp.c | 311 ++++++++++--- drivers/infiniband/core/packer.c | 3 drivers/infiniband/core/sa_query.c | 6 drivers/infiniband/core/smi.c | 13 drivers/infiniband/core/sysfs.c | 40 - drivers/infiniband/core/ucm.c | 464 ++++++------------- drivers/infiniband/core/ucm.h | 13 drivers/infiniband/core/ud_header.c | 11 drivers/infiniband/core/user_mad.c | 10 drivers/infiniband/core/uverbs.h | 11 drivers/infiniband/core/uverbs_cmd.c | 182 +++++++ drivers/infiniband/core/uverbs_main.c | 22 drivers/infiniband/core/uverbs_mem.c | 1 drivers/infiniband/core/verbs.c | 65 ++ drivers/infiniband/hw/mthca/Makefile | 4 drivers/infiniband/hw/mthca/mthca_allocator.c | 116 ++++ drivers/infiniband/hw/mthca/mthca_av.c | 28 - drivers/infiniband/hw/mthca/mthca_cmd.c | 106 +++- drivers/infiniband/hw/mthca/mthca_cmd.h | 20 drivers/infiniband/hw/mthca/mthca_config_reg.h | 1 drivers/infiniband/hw/mthca/mthca_cq.c | 256 +++------- drivers/infiniband/hw/mthca/mthca_dev.h | 52 +- drivers/infiniband/hw/mthca/mthca_doorbell.h | 13 drivers/infiniband/hw/mthca/mthca_eq.c | 63 +- drivers/infiniband/hw/mthca/mthca_mad.c | 10 drivers/infiniband/hw/mthca/mthca_main.c | 179 ++++--- drivers/infiniband/hw/mthca/mthca_mcg.c | 36 - drivers/infiniband/hw/mthca/mthca_memfree.c | 12 drivers/infiniband/hw/mthca/mthca_memfree.h | 5 drivers/infiniband/hw/mthca/mthca_mr.c | 35 - drivers/infiniband/hw/mthca/mthca_pd.c | 1 drivers/infiniband/hw/mthca/mthca_profile.c | 2 drivers/infiniband/hw/mthca/mthca_profile.h | 2 drivers/infiniband/hw/mthca/mthca_provider.c | 115 ++++ drivers/infiniband/hw/mthca/mthca_provider.h | 54 +- drivers/infiniband/hw/mthca/mthca_qp.c | 362 ++++----------- drivers/infiniband/hw/mthca/mthca_srq.c | 591 +++++++++++++++++++++++++ drivers/infiniband/hw/mthca/mthca_user.h | 11 drivers/infiniband/hw/mthca/mthca_wqe.h | 114 ++++ drivers/infiniband/ulp/ipoib/Makefile | 2 drivers/infiniband/ulp/ipoib/ipoib.h | 12 drivers/infiniband/ulp/ipoib/ipoib_fs.c | 2 drivers/infiniband/ulp/ipoib/ipoib_ib.c | 5 drivers/infiniband/ulp/ipoib/ipoib_main.c | 33 - drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 8 drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 3 drivers/infiniband/ulp/ipoib/ipoib_vlan.c | 1 include/rdma/ib_cache.h | 4 include/rdma/ib_cm.h | 93 +-- include/rdma/ib_fmr_pool.h | 2 include/rdma/ib_mad.h | 26 - include/rdma/ib_pack.h | 2 include/rdma/ib_sa.h | 22 include/rdma/ib_smi.h | 20 include/rdma/ib_user_cm.h | 28 - include/rdma/ib_user_mad.h | 10 include/rdma/ib_user_verbs.h | 39 + include/rdma/ib_verbs.h | 118 ++++ 69 files changed, 2732 insertions(+), 1424 deletions(-) through the following changes: commit a4d61e84804f3b14cc35c5e2af768a07c0f64ef6 Author: Roland Dreier Date: Thu Aug 25 13:40:04 2005 -0700 [PATCH] IB: move include files to include/rdma Move the InfiniBand headers from drivers/infiniband/include to include/rdma. This allows InfiniBand-using code to live elsewhere, and lets us remove the ugly EXTRA_CFLAGS include path from the InfiniBand Makefiles. Signed-off-by: Roland Dreier commit 1ad62a19f177e61d4dde111ba35fb4badd0c2106 Author: Michael S. Tsirkin Date: Wed Aug 24 14:41:51 2005 -0700 [PATCH] IPoIB: Fix device removal race Currently we may have work scheduled in default kernel workqueue when the device is going down. The device could get freed before this workqueue gets serviced. I am actually seeing this causing system hangs. The following patch fixes this by using ipoib_workqueue which gets flushed when the device is going down. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier commit fe9e08e17af414a5fd8f3141b0fd88677f81a883 Author: Sean Hefty Date: Fri Aug 19 13:50:33 2005 -0700 [PATCH] IB: Add handling for ABORT and STOP RMPP MADs. Add handling for ABORT / STOP RMPP MADs. Signed-off-by: Sean Hefty Signed-off-by: Roland Dreier commit b9ef520f9caf20aba8ac7cb2bbba45b52ff19d53 Author: Sean Hefty Date: Fri Aug 19 13:46:34 2005 -0700 [PATCH] IB: fix userspace CM deadlock Fix deadlock condition resulting from trying to destroy a cm_id from the context of a CM thread. The synchronization around the ucm context structure is simplified as a result, and some simple code cleanup is included. Signed-off-by: Sean Hefty Signed-off-by: Roland Dreier commit 4ce059378c04b40c2e9f658b1c6a2e9078b85c7c Author: Roland Dreier Date: Fri Aug 19 12:03:17 2005 -0700 [PATCH] IPoIB: Set full membership bit in P_Keys Always make sure that the full membership bit is set in the P_Keys that IPoIB uses. This makes sure that all hosts join the correct multicast groups so that hosts that are partial partition members can talk to the rest of the network. Signed-off-by: Roland Dreier commit ec34a922d243c3401a694450734e9effb2bafbfe Author: Roland Dreier Date: Fri Aug 19 10:59:31 2005 -0700 [PATCH] IB/mthca: Add SRQ implementation Add mthca support for shared receive queues (SRQs), including userspace SRQs. Signed-off-by: Roland Dreier commit d20a40192868082eff6fec729b311cb8463b4a21 Author: Roland Dreier Date: Fri Aug 19 10:36:11 2005 -0700 [PATCH] IB/mthca: Handle context tables smaller than our chunk size When creating a table in context memory where the table is smaller than our chunk size, we don't want to allocate and map a full chunk. Instead, allocate just enough memory to cover the table. This can be pretty simple because all tables are a power-of-2 size, so either the table is a multiple of the chunk size, or it's smaller than one chunk. Signed-off-by: Roland Dreier commit c04bc3d1f417a8a90eef9ab46523dfd44858b28d Author: Roland Dreier Date: Fri Aug 19 10:33:35 2005 -0700 [PATCH] IB/mthca: Move WQE structures into their own header Move the definitions of the WQE structures from mthca_qp.c into mthca_wqe.h, so that we'll be able to share them when we add the SRQ code in mthca_srq.c. Signed-off-by: Roland Dreier commit 288bdeb4bc5b89befd7ee2f0f0183604034ff6c5 Author: Roland Dreier Date: Fri Aug 19 09:19:05 2005 -0700 [PATCH] IB/mthca: Simplify handling of completions with error Mem-free HCAs never generate error CQEs that complete multiple WQEs, so just skip the call to mthca_free_err_wqe() for them rather than having logic to handle the mem-free case in mthca_free_err_wqe(). Signed-off-by: Roland Dreier commit 87b816706bb2b79fbaff8e0b8e279e783273383e Author: Roland Dreier Date: Thu Aug 18 13:39:31 2005 -0700 [PATCH] IB/mthca: Factor out common queue alloc code Clean up the allocation of memory for queues by factoring out the common code into mthca_buf_alloc() and mthca_buf_free(). Now CQs and QPs share the same queue allocation code, which we'll also use for SRQs. Signed-off-by: Roland Dreier commit f520ba5aa48e2891c3fb3e364eeaaab4212c7c45 Author: Roland Dreier Date: Thu Aug 18 12:24:13 2005 -0700 [PATCH] IB: userspace SRQ support Add SRQ support to userspace verbs module. This adds several commands and associated structures, but it's OK to do this without bumping the ABI version because the commands are added at the end of the list so they don't change the existing numbering. There are two cases to worry about: 1. New kernel, old userspace. This is OK because old userspace simply won't try to use the new SRQ commands. None of the old commands are changed. 2. Old kernel, new userspace. This works perfectly as long as userspace doesn't try to use SRQ commands. If userspace tries to use SRQ commands, it will get EINVAL, which is perfectly reasonable: the kernel doesn't support SRQs, so we couldn't do any better. Signed-off-by: Roland Dreier commit d41fcc6705eddd04f7218c985b6da35435ed73cc Author: Roland Dreier Date: Thu Aug 18 12:23:08 2005 -0700 [PATCH] IB: Add SRQ support to midlayer Make the required core API additions and changes for shared receive queues (SRQs). Signed-off-by: Roland Dreier commit d1887ec2125988adccbd8bf0de638c41440bf80e Author: Roland Dreier Date: Thu Aug 18 12:14:11 2005 -0700 [PATCH] IB/mthca: Report correct max_msg_sz Set the max_msg_sz port property correctly in mthca's port_query function. Also zero out the attr struct so that we don't leave any other members uninitialized. Signed-off-by: Roland Dreier commit da6561c285a6e28a075b97fd5a1560a2b0ce843e Author: Roland Dreier Date: Wed Aug 17 07:39:10 2005 -0700 [PATCH] IB/mthca: Use correct port width capability value When we call the INIT_IB firmware command to bring up a port, use the actual port width capability returned by the QUERY_DEV_LIM command instead of always trying to enable both 1X and 4X. This fixes breakage seen when the firmware is build to allow 4X only. Signed-off-by: Roland Dreier commit 2aeba9a03b0d249fc710b9939fc089ce53d8cd30 Author: Olaf Hering Date: Mon Aug 15 14:29:03 2005 -0700 [PATCH] IB: Remove unnecessary includes of changing CONFIG_LOCALVERSION rebuilds too much, for no appearent reason. Remove unneeded includes of . Signed-off-by: Olaf Hering Signed-off-by: Roland Dreier commit 5dd2ce1200f4b12687d74de89a527f99e16c344e Author: Hal Rosenstock Date: Mon Aug 15 14:16:36 2005 -0700 [PATCH] IB: Fix ib_mad_thread_completion_handler declaration Change ib_mad_thread_completion_handler to conform to ib_comp_handler declaration. Signed-off-by: Hal Rosenstock Signed-off-by: Roland Dreier commit 7f9f2dba729cee6ea10596ccb07447d467705b08 Author: Guy German Date: Mon Aug 15 07:38:50 2005 -0700 [PATCH] IB/mthca: use generic function instead of arbel_ version in mthca_free_region() Use the generic key_to_hw_index() function instead of the Arbel-specific version in mthca_free_region(). Signed-off-by: Guy German Signed-off-by: Roland Dreier commit ffbf4c34f1916fa1e0554269c94c57da4a21a348 Author: Roland Dreier Date: Mon Aug 15 07:35:16 2005 -0700 [PATCH] IB: unmap FMRs when destroying FMR pool Make sure that all FMRs are unmapped before we deallocate them so that we don't leak references to our protection domain when destroying an FMR pool. (Bug reported by Guy German ) Signed-off-by: Roland Dreier commit 2e8b981c5d5c6fe5479ad47c44e3e76ebb5408ef Author: Michael S. Tsirkin Date: Sat Aug 13 21:19:38 2005 -0700 [PATCH] IB/mthca: add HCA board ID to sysfs info Add support for reporting HCA board ID returned from QUERY_ADAPTER firmware command through sysfs. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier commit 97f52eb438be7caebe026421545619d8a0c1398a Author: Sean Hefty Date: Sat Aug 13 21:05:57 2005 -0700 [PATCH] IB: sparse endianness cleanup Fix sparse warnings. Use __be* where appropriate. Signed-off-by: Sean Hefty Signed-off-by: Roland Dreier commit 92a6b34bf4d0d11c54b2a6bdd6240f98cb326200 Author: Hal Rosenstock Date: Sat Aug 13 20:50:27 2005 -0700 [PATCH] IB: Eliminate redundant NULL checks IPoIB: Eliminate NULL checks prior to calling kfree Signed-off-by: Hal Rosenstock Signed-off-by: Roland Dreier commit 2a1d9b7f09aaaacf235656cb32a40ba2c79590b3 Author: Roland Dreier Date: Wed Aug 10 23:03:10 2005 -0700 [PATCH] IB: Add copyright notices Make some lawyers happy and add copyright notices for people who forgot to include them when they actually touched the code. Signed-off-by: Roland Dreier commit 49f6a7fbe123dde25ca4193a7d60705784e18317 Author: Tziporet Koren Date: Wed Aug 10 23:00:50 2005 -0700 [PATCH] IB: Update current firmware versions in mthca driver Update FW versions in mthca according to July 05 Mellanox release Signed-off-by: Tziporet Koren Signed-off-by: Roland Dreier From rolandd at cisco.com Mon Aug 29 09:36:12 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 29 Aug 2005 09:36:12 -0700 Subject: [openib-general] Re: RDMA connection and address translation API In-Reply-To: <20050829163650.GC4081@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 29 Aug 2005 19:36:50 +0300") References: <35EA21F54A45CB47B879F21A91F4862F714202@taurus.voltaire.com> <52ll2qvrr1.fsf@cisco.com> <20050825084809.GD22342@mellanox.co.il> <52y86qt2pm.fsf@cisco.com> <20050828075851.GV22342@mellanox.co.il> <528xykoixt.fsf@cisco.com> <20050829163650.GC4081@mellanox.co.il> Message-ID: <52u0h8n1j7.fsf@cisco.com> Michael> So I'm saying, we dont need reserved bits in cm req then, Michael> do we? Right. - R. From rolandd at cisco.com Mon Aug 29 09:38:25 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 29 Aug 2005 09:38:25 -0700 Subject: [openib-general] [PATCH] ipoib: device removal races In-Reply-To: <20050829163553.GB4081@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 29 Aug 2005 19:35:53 +0300") References: <20050808151141.GJ15300@mellanox.co.il> <52y86rui38.fsf@cisco.com> <20050829163553.GB4081@mellanox.co.il> Message-ID: <52psrwn1fi.fsf@cisco.com> Michael> It seems that most of the complexity comes from the way Michael> core uses work queues to pass events to upper layers. Michael> And I wander: couldnt core be simplified by passing up Michael> events directly in the interrupt context? I'm confused -- which core and which events are you talking about? The ib_core module just passes on asynchronous events directly from the low-level driver, without using work queues. - R. From halr at voltaire.com Mon Aug 29 09:35:30 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2005 12:35:30 -0400 Subject: [openib-general] [PATCH] kdapl: Change for new include location Message-ID: <1125333329.4530.3628.camel@hal.voltaire.com> kdapl: Change for new include location (rdma rather than infiniband) Signed-off-by: Hal Rosenstock Index: dapl_openib_cm.h =================================================================== --- dapl_openib_cm.h (revision 3232) +++ dapl_openib_cm.h (working copy) @@ -34,9 +34,9 @@ #ifndef DAPL_OPENIB_CM_H #define DAPL_OPENIB_CM_H -#include "ib_cm.h" -#include "ib_sa.h" -#include "ib_at.h" +#include "rdma/ib_cm.h" +#include "rdma/ib_sa.h" +#include "rdma/ib_at.h" struct dapl_cm_ctx { struct ib_at_ib_route dapl_rt; Index: dapl_openib_util.h =================================================================== --- dapl_openib_util.h (revision 3232) +++ dapl_openib_util.h (working copy) @@ -33,8 +33,8 @@ #define DAPL_OPENIB_UTIL_H #include "dapl.h" -#include "ib_verbs.h" -#include "ib_cm.h" +#include "rdma/ib_verbs.h" +#include "rdma/ib_cm.h" enum dapl_async_handler_type { DAPL_ASYNC_UNAFILIATED, From tom at ammasso.com Mon Aug 29 09:46:47 2005 From: tom at ammasso.com (Tom Tucker) Date: Mon, 29 Aug 2005 12:46:47 -0400 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbsandquery provider methods Message-ID: <8E9D028761D8264D910612167E8457E8FA3BED@mail2.ammasso.com> >From my reading of the thread, there is resistence to TOE in general. The patch is just the messenger. The principle opponent is Dave Miller who strongly believes that stateless acceleration such as TSO (TCP Segmentation Offload) suffices for all needs. Ironically, this requires a much higher level of stack integration than TOE does. TOE for the purposes of RDMA may have more legs within the community, however, this has yet to be tested. > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier > Sent: Monday, August 29, 2005 11:24 AM > To: Asgeir Eiriksson > Cc: openib-general at openib.org > Subject: Re: [openib-general] [PATCH][iWARP] Added provider > CM verbsandquery provider methods > > Asgeir> ...this is the approach taken in the Chelsio TOE patches > Asgeir> that we have submitted. > > What are your plans for these patches? I am not subscribed to netdev, > but from reading the archives, it seems that your most recent > submission was rejected quite strongly. > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From jlentini at netapp.com Mon Aug 29 09:51:51 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 29 Aug 2005 12:51:51 -0400 (EDT) Subject: [openib-general][PATCH][kdapl]: FMR and EVD patch In-Reply-To: References: Message-ID: Hi Guy, I agree with you on the problems poised by the current interface. I hope we can find a solution that fixes the problem. Note that the same problem must be handled by a ULP using the native verbs. I still think that there may be a race condition with this patch. Here's the scenario I'm concerned about: - Receive an evd upcall - Disable evd upcall policy - Wakeup polling thread - Dequeue all events - Enable evd upcall policy by: 1. Call dapl_evd_modify_upcall() to enable the evd upcall 2. Obtain the EVD spin lock via spin_lock_irqsave, thus disabling local interrupts 3. Check that the EVD's ring buffer is empty (there are no DAPL "software" events) 4. A DTO completion occurs on the EVD's CQ 5. Enable the CQ's upcall via ib_req_notify_cq() If I understand you correctly, you are asserting that event #4, the CQ's DTO completion, cannot occur because the local interrupts are disabled by spin_lock_irqsave(). Have I understood you correctly? My belief is that the completion will occur on the card regardless of the interrupt state. Can you provide me with a reference that guarantees this will not happen? james On Thu, 18 Aug 2005, Guy German wrote: > Hi James, > > I will try to explain the reason behind this patch: > > In IB, a “normal” working flow, for a consumer, is: > - Receive a CQ notification callback > - Wakeup polling thread > - Poll for completion (empty the queue) > - Request completion notification > > There is no problem here. > > In kdapl, however, the consumer will keep getting upcalls, until he > sets the upcall policy to disable. So a working flow will be: > - Receive an evd upcall > - Disable evd upcall policy > - Wakeup polling thread > - Dequeue all evd’s > - Enable evd upcall policy > > There is a race here: A completion can come after the last dequeu > and before the Enabling. The provider won’t call for the consumer > (policy is disabled) and the consumer would not dequeu any more > because he “knows” the queue is empty. > > I think it is a very bad idea, to solve this race by adding another > evd_dequeue after you enable the upcall policy. If you do that you > would have a polling thread (because while you dequeue one > completion you can have many more following) and at the same time > you will receive upcall from the dapl provider. Beside the fact that > this is an expensive and unnecessary context switch you have an > upcall and a thread racing. You will have a situation that the > upcall has an event at hand and the thread has an event, both not > handled yet - you will have to queue them again internally or > something to keep the order. And I think that is only a partial list > of the problems in this case. > > SO > > My suggestion is simple, it solves the race, it saves the > unnecessary context switch and it spares the complexity from the > consumer side. The solution is to notify the consumer when he tries > to enable upcall policy, that the queue is actually not empty, and > force him to continue polling (in the same thread context he is > now). dat_evd_modify_upcall is guarded by a spin_lock_irqsave, when > it checks the queue and so the race would not occur. > > BTW, > I’m not sure if it is still the case, but I think that one of the > ulps in openib, did not use a kernel thread for dequeu-ing. This is > a very bad design, as the upcall can be polling for *long* periods > of time, in a tasklet/interrupt context. > > That’s it > Sorry for the long mail – I hope It was not to blur > > Guy. > > > > > > -----Original Message----- > From: James Lentini [mailto:jlentini at netapp.com] > Sent: Thu 8/18/2005 10:28 PM > To: Guy German > Cc: Openib > Subject: Re: [openib-general][PATCH][kdapl]: FMR and EVD patch > > > > Hi Guy, > > The one piece of this patch that remains unaccepted is: > > Index: ib/dapl_evd.c > =================================================================== > --- ib/dapl_evd.c (revision 3136) > +++ ib/dapl_evd.c (working copy) > @@ -1028,6 +1028,7 @@ > { > struct dapl_evd *evd; > int status = 0; > + int pending_events; > > evd = (struct dapl_evd *)evd_handle; > dapl_dbg_log (DAPL_DBG_TYPE_API, "%s: (evd=%p, upcall_policy=%d)\n", > @@ -1035,14 +1036,25 @@ > > spin_lock_irqsave(&evd->common.lock, evd->common.flags); > if ((upcall_policy != DAT_UPCALL_TEARDOWN) && > - (upcall_policy != DAT_UPCALL_DISABLE) && > - (evd->evd_flags & DAT_EVD_DTO_FLAG)) { > - status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); > - if (status) { > - printk(KERN_ERR "%s: dapls_ib_completion_notify failed " > - "(status=0x%x)\n",__func__, status); > + (upcall_policy != DAT_UPCALL_DISABLE)) { > + pending_events = dapl_rbuf_count(&evd->pending_event_queue); > + if (pending_events) { > + dapl_dbg_log(DAPL_DBG_TYPE_WARN, > + "%s: (evd %p) there are still %d pending " > + "events in the queue - policy stays disabled\n", > + __func__, evd_handle, pending_events); > + status = -EBUSY; > goto bail; > } > + if (evd->evd_flags & DAT_EVD_DTO_FLAG) { > + status = ib_req_notify_cq(evd->cq, IB_CQ_NEXT_COMP); > + if (status) { > + printk(KERN_ERR "%s: dapls_ib_completion_notify" > + " failed (status=0x%x) \n",__func__, > + status); > + goto bail; > + } > + } > } > evd->upcall_policy = upcall_policy; > evd->upcall = *upcall; > > The IB analog to this function, ib_req_notify_cq(), does not require > that the CQ be empty. The kDAPL specification does not define an empty > EVD as a requirement for modifying the upcall and previous > implementations of the API have not made this requirement. > From halr at voltaire.com Mon Aug 29 09:51:19 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2005 12:51:19 -0400 Subject: [openib-general] Re: [PATCH] IBAT resolve_ats_route In-Reply-To: References: Message-ID: <1125334278.4530.3701.camel@hal.voltaire.com> On Fri, 2005-08-26 at 16:42, James Lentini wrote: > I was reading through the IBAT sources when I noticed that in > resolve_ats_route() you set req->pend.sa_query to null on line 1127 > and then check to see if it is null a few lines later. I don't think > you need to do that. Yes, it looks like that code path could never be taken. Thanks. Applied. -- Hal From caitlinb at broadcom.com Mon Aug 29 10:06:29 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Mon, 29 Aug 2005 10:06:29 -0700 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbsandquery provider methods Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F537@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Tom Tucker > Sent: Monday, August 29, 2005 9:47 AM > To: Roland Dreier; Asgeir Eiriksson > Cc: openib-general at openib.org > Subject: RE: [openib-general] [PATCH][iWARP] Added provider > CM verbsandquery provider methods > > > >From my reading of the thread, there is resistence to > TOE in general. The patch is just the messenger. The > principle opponent is Dave Miller who strongly believes that > stateless acceleration such as TSO (TCP Segmentation Offload) > suffices for all needs. Ironically, this requires a much > higher level of stack integration than TOE does. > > TOE for the purposes of RDMA may have more legs within the > community, however, this has yet to be tested. > And even once we have concensus to do it, we then need to reach concensus on issues such as connect-on-chip-with-host-approval and/or connect-on-host-then-transfer to work through. For example, I think the host stack should support either, leaving the tradeoffs between NIC and host processing to be resolved in the marketplace. From soulcitypublications at app.topica.com Mon Aug 29 10:11:32 2005 From: soulcitypublications at app.topica.com (RAVE*SQ Magazine) Date: Mon, 29 Aug 2005 10:11:32 -0700 Subject: [openib-general] Shubha Mudgal In Concert TOMORROW NIGHT! Message-ID: <1144715965-1463747838-1125335608@soulcitypublications.b.tep1.com> If you cannot read this message from RAVE*SQ your browser does not support HTML. Visit http://soulcitypublications.c.topica.com/maadU2dabjQ4Fci5DeZe/ to see this message from RAVE*SQ Magazine. ==================================================================== Update Your Profile: http://soulcitypublications.f.topica.com/f/?a84NZf.ci5DeZ.b3Blbmli Unsubscribe: http://soulcitypublications.f.topica.com/f/unsub.html/aafs57olsf4g91gfecd3h1q8_k8tp0mh_t5wb6bwxqn9l Confirm Your Subscription: http://soulcitypublications.f.topica.com/f/?a84NZf.ci5DeZ.b3Blbmli.c Report Unsolicited Email: http://topica.com/f/abuse.html?aafs57olsf4g91gfecd3h1q8_k8tp0mh_t5wb6bwxqn9l Delivered by Topica: http://www.topica.com/?p=T3FOOTER -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon Aug 29 10:25:08 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2005 13:25:08 -0400 Subject: [openib-general] [PATCH] iser: Make iser Makefile like other OpenIB ULP makefiles Message-ID: <1125336308.4530.3840.camel@hal.voltaire.com> Make iser Makefile like other OpenIB ULP makefiles Signed-off-by: Hal Rosenstock Index: Makefile =================================================================== --- Makefile (revision 3232) +++ Makefile (working copy) @@ -1,16 +1,14 @@ -ISER_OBJ = iser_mod.o -ISER_OBJ += iser_conn.o -ISER_OBJ += iser_initiator.o -ISER_OBJ += iser_memory.o -ISER_OBJ += iser_task.o -ISER_OBJ += iser_utils.o -ISER_OBJ += iser_dto.o -ISER_OBJ += iser_lkdapl.o +EXTRA_CFLAGS += -Idrivers/infiniband/include -Idrivers/infiniband/ulp/kdapl \ + -I$(src)/include -DLINUX_KDAT -EXTRA_CFLAGS += -Idrivers/infiniband/include -EXTRA_CFLAGS += -Idrivers/infiniband/ulp/kdapl -EXTRA_CFLAGS += -I$(src)/include -EXTRA_CFLAGS += -DLINUX_KDAT +obj-$(CONFIG_INFINIBAND_ISER) += ib_iser.o -obj-$(CONFIG_INFINIBAND_ISER) += $(ISER_OBJ) +ib_iser-y := iser_mod.o \ + iser_conn.o \ + iser_initiator.o \ + iser_memory.o \ + iser_task.o \ + iser_utils.o \ + iser_dto.o \ + iser_lkdapl.o From jlentini at netapp.com Mon Aug 29 10:57:22 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 29 Aug 2005 13:57:22 -0400 (EDT) Subject: [openib-general] Re: [PATCH] kdapl: Change for new include location In-Reply-To: <1125333329.4530.3628.camel@hal.voltaire.com> References: <1125333329.4530.3628.camel@hal.voltaire.com> Message-ID: On Mon, 29 Aug 2005, Hal Rosenstock wrote: halr> kdapl: Change for new include location (rdma rather than infiniband) Committed in revision 3235. From mst at mellanox.co.il Mon Aug 29 11:55:40 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 29 Aug 2005 21:55:40 +0300 Subject: [openib-general] [PATCH] ipoib: device removal races In-Reply-To: <52psrwn1fi.fsf@cisco.com> References: <20050808151141.GJ15300@mellanox.co.il> <52y86rui38.fsf@cisco.com> <20050829163553.GB4081@mellanox.co.il> <52psrwn1fi.fsf@cisco.com> Message-ID: <20050829185540.GA5169@mellanox.co.il> Quoting r. Roland Dreier : > I'm confused -- which core and which events are you talking about? OK, I wasnt exactly clear. I was talking about activating the sa queries: if starting and cancelling sa queries could be done from under spinlock/interrupt context, IPoIB could use straight spinlocks for synchronisation, and avoid using workqueues altogether. Is this feasible? -- MST From rolandd at cisco.com Mon Aug 29 12:02:53 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 29 Aug 2005 12:02:53 -0700 Subject: [openib-general] [PATCH] ipoib: device removal races In-Reply-To: <20050829185540.GA5169@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 29 Aug 2005 21:55:40 +0300") References: <20050808151141.GJ15300@mellanox.co.il> <52y86rui38.fsf@cisco.com> <20050829163553.GB4081@mellanox.co.il> <52psrwn1fi.fsf@cisco.com> <20050829185540.GA5169@mellanox.co.il> Message-ID: <52ek8cmuqq.fsf@cisco.com> Michael> OK, I wasnt exactly clear. I was talking about Michael> activating the sa queries: if starting and cancelling sa Michael> queries could be done from under spinlock/interrupt Michael> context, IPoIB could use straight spinlocks for Michael> synchronisation, and avoid using workqueues altogether. I'd have to audit the code to make sure, but as far as I know it should be fine to call the SA query API with spinlocks held. - R. From jlentini at netapp.com Mon Aug 29 12:28:10 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 29 Aug 2005 15:28:10 -0400 (EDT) Subject: [openib-general] Re: RDMA connection and address translation API In-Reply-To: <20050829161242.GA4081@mellanox.co.il> References: <35EA21F54A45CB47B879F21A91F4862F714202@taurus.voltaire.com> <52ll2qvrr1.fsf@cisco.com> <20050825084809.GD22342@mellanox.co.il> <52y86qt2pm.fsf@cisco.com> <20050828075851.GV22342@mellanox.co.il> <528xykoixt.fsf@cisco.com> <20050829161242.GA4081@mellanox.co.il> Message-ID: On Mon, 29 Aug 2005, Michael S. Tsirkin wrote: > Quoting r. Roland Dreier : > > Subject: Re: RDMA connection and address translation API > > > > Michael> What about using an Externally Administrated Service ID? > > Michael> Openib gets Service ID = 0x1H00 1405 XXXX XXXX where H is > > Michael> any digit. > > > > That would work. I think we've already converged on picking a service > > ID range for our "iWARP emulation" spec. The only question is whether > > it should be in the IBTA or IETF service ID range, and I don't think > > that really matters much. > > Or neither :) > Are there disadvantages to Externally Administrated Service ID? > This avoids any need for approvals from either IETF or IBTA. We should encourage interoperability with other implementations. Standardizing the protocol in the appropriate standards body is the way to ensure that. My assumption is that IBTA is the appropriate place for this. As Yaron pointed out earlier, we can do the initial implementation and standardization in parallel. From jlentini at netapp.com Mon Aug 29 12:34:55 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 29 Aug 2005 15:34:55 -0400 (EDT) Subject: [openib-general] RDMA Generic Connection Management In-Reply-To: <1125323947.6584.106.camel@r2d2> References: <1125323947.6584.106.camel@r2d2> Message-ID: On Mon, 29 Aug 2005, Guy German wrote: > Hi, > > After receiving feedbacks from people here - I want to see if we can > agree on a generic CM API, so we can start implementing it. > I will try and summarize the 2 options, the way I understand it. > > If I am missing something or misrepresenting - please don't hesitate to > correct me. > > both suggestion include the following verbs (or semantically > equivalent): ib_cma_get_device, ib_cma_create_qp, ib_cma_connect, > ib_cma_disconnect, ib_cma_listen, ib_cma_destroy, ib_cma_accept, > ib_cma_reject, ib_cma_get_src_ip. > > a connect flow will be something like: > > - ib_cma_get_device (...) /* get device(1) or device+path(2) */ > - pd = ib_alloc_pd(...) /* pd allocated in the given device */ > - qp = ib_cma_create_qp(...) /* qp returned in init state */ > - ib_post_recv(qp, ...); > - ib_cma_connect (qp, dst_addr(1)/path(2), ...); > > Now, there are 2 suggestions for the device discovery: > 1. get_device returns device and port, according the local routing > tables, synchronously. ib_cma_connect calls the at module for address > resolving (cache handled) before calling the cm_connect. > 2. get_device registers an upcall and return in the upcall the data path > to the consumer. In this case caching is done by the consumer. What happens if multiple devices can reach the destination address? How will they be enumerated to the consumer? > I prefer option 1, because it makes the consumer code simpler, without > having to handle upcalls for address translations (which are not > asynchronous in iWARP) or hold the transport's data information. Also it > saves the consumer the trouble of caching routes to destinations. I also find option 1 simpler. Of course it is easy to turn an async call into a sync call but hard to do the opposite. > I would like to hear what other people in the list think of it ... > > Thanks, > Guy > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rolandd at cisco.com Mon Aug 29 12:40:51 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 29 Aug 2005 12:40:51 -0700 Subject: [openib-general] RDMA Generic Connection Management In-Reply-To: (James Lentini's message of "Mon, 29 Aug 2005 15:34:55 -0400 (EDT)") References: <1125323947.6584.106.camel@r2d2> Message-ID: <52acj0mszg.fsf@cisco.com> James> What happens if multiple devices can reach the destination James> address? How will they be enumerated to the consumer? I guess we need to move towards the full horror of getaddrinfo(). Probably we need some unusable native API, and then library functions layered on top for consumers that don't care. Although maybe it's not necessary -- are there any consumers of this API that really want to choose among different equal-metric routes? - R. From caitlinb at broadcom.com Mon Aug 29 12:44:02 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Mon, 29 Aug 2005 12:44:02 -0700 Subject: [openib-general] RDMA Generic Connection Management Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F53C@NT-SJCA-0751.brcm.ad.broadcom.com> > > a connect flow will be something like: > > > > - ib_cma_get_device (...) /* get device(1) or device+path(2) */ > > - pd = ib_alloc_pd(...) /* pd allocated in the given device */ > > - qp = ib_cma_create_qp(...) /* qp returned in init state */ > > - ib_post_recv(qp, ...); > > - ib_cma_connect (qp, dst_addr(1)/path(2), ...); > > > > Now, there are 2 suggestions for the device discovery: > > 1. get_device returns device and port, according the local routing > > tables, synchronously. ib_cma_connect calls the at module > for address > > resolving (cache handled) before calling the cm_connect. > > 2. get_device registers an upcall and return in the upcall the data > > path to the consumer. In this case caching is done by the consumer. > > What happens if multiple devices can reach the destination address? > How will they be enumerated to the consumer? > At the DAT layer the assumption was that multiple paths would be chosen based upon the Class of Service. So either the CoS must be passed down, or "get_device" must return an array of devices with the required info to allow the DAT Provider to make the determination. Passing it down sounds simpler to me. From caitlinb at broadcom.com Mon Aug 29 12:45:24 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Mon, 29 Aug 2005 12:45:24 -0700 Subject: [openib-general] RDMA Generic Connection Management Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F53D@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier > Sent: Monday, August 29, 2005 12:41 PM > To: James Lentini > Cc: openib-general at openib.org > Subject: Re: [openib-general] RDMA Generic Connection Management > > James> What happens if multiple devices can reach the destination > James> address? How will they be enumerated to the consumer? > > I guess we need to move towards the full horror of getaddrinfo(). > Probably we need some unusable native API, and then library > functions layered on top for consumers that don't care. > > Although maybe it's not necessary -- are there any consumers > of this API that really want to choose among different > equal-metric routes? > The assumption implicit in the DAT connection APIs is that there are none (i.e., if you can't distinguish based on Class of Service then you don't care which actual path you get). From mst at mellanox.co.il Mon Aug 29 12:49:54 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 29 Aug 2005 22:49:54 +0300 Subject: [openib-general] Re: [PATCH] ipoib: device removal races In-Reply-To: <52ek8cmuqq.fsf@cisco.com> References: <20050808151141.GJ15300@mellanox.co.il> <52y86rui38.fsf@cisco.com> <20050829163553.GB4081@mellanox.co.il> <52psrwn1fi.fsf@cisco.com> <20050829185540.GA5169@mellanox.co.il> <52ek8cmuqq.fsf@cisco.com> Message-ID: <20050829194954.GB5169@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] ipoib: device removal races > > Michael> OK, I wasnt exactly clear. I was talking about > Michael> activating the sa queries: if starting and cancelling sa > Michael> queries could be done from under spinlock/interrupt > Michael> context, IPoIB could use straight spinlocks for > Michael> synchronisation, and avoid using workqueues altogether. > > I'd have to audit the code to make sure, but as far as I know it > should be fine to call the SA query API with spinlocks held. Okay, but it also seems that, at least to cancel a query, its unsufficient to call ib_sa_cancel_query - you then have to wait until you get a callback, which seems to be performed from a work queue. Can sa query be changed to perform the callback directly, and so guarantee that query isnt used after ib_sa_cancel_query returns? -- MST From rolandd at cisco.com Mon Aug 29 12:51:15 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 29 Aug 2005 12:51:15 -0700 Subject: [openib-general] RDMA Generic Connection Management In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1F53D@NT-SJCA-0751.brcm.ad.broadcom.com> (Caitlin Bestler's message of "Mon, 29 Aug 2005 12:45:24 -0700") References: <54AD0F12E08D1541B826BE97C98F99F1F53D@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <5264tomsi4.fsf@cisco.com> Caitlin> The assumption implicit in the DAT connection APIs is Caitlin> that there are none (i.e., if you can't distinguish based Caitlin> on Class of Service then you don't care which actual path Caitlin> you get). Let's forget about what DAT specified and just try to come up with the right answer. In any case, DAT ignored routing completely, so I don't think it's helpful to consider it. - R. From rolandd at cisco.com Mon Aug 29 12:52:49 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 29 Aug 2005 12:52:49 -0700 Subject: [openib-general] Re: [PATCH] ipoib: device removal races In-Reply-To: <20050829194954.GB5169@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 29 Aug 2005 22:49:54 +0300") References: <20050808151141.GJ15300@mellanox.co.il> <52y86rui38.fsf@cisco.com> <20050829163553.GB4081@mellanox.co.il> <52psrwn1fi.fsf@cisco.com> <20050829185540.GA5169@mellanox.co.il> <52ek8cmuqq.fsf@cisco.com> <20050829194954.GB5169@mellanox.co.il> Message-ID: <521x4cmsfi.fsf@cisco.com> Michael> Can sa query be changed to perform the callback directly, Michael> and so guarantee that query isnt used after Michael> ib_sa_cancel_query returns? Hmm, that gets into the MAD layer design, but I think it gets very tricky. For example, how do we know that the query isn't already completing on a different CPU as we enter the cancel call? - R. From caitlinb at broadcom.com Mon Aug 29 12:54:36 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Mon, 29 Aug 2005 12:54:36 -0700 Subject: [openib-general] RDMA Generic Connection Management Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F53E@NT-SJCA-0751.brcm.ad.broadcom.com> No, DAT didn't ignore it. DAT focused on what the application needed to specify, and concluded that the applcation had a legitimate interest in the Class of Service but none in selection between two arbitrary paths. The only scenario that anyone identified then was that someone might want to load-balance across paths, and that was better accomplished at a lower layer than by the application. > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Monday, August 29, 2005 12:51 PM > To: Caitlin Bestler > Cc: Roland Dreier; James Lentini; openib-general at openib.org > Subject: Re: [openib-general] RDMA Generic Connection Management > > Caitlin> The assumption implicit in the DAT connection APIs is > Caitlin> that there are none (i.e., if you can't distinguish based > Caitlin> on Class of Service then you don't care which actual path > Caitlin> you get). > > Let's forget about what DAT specified and just try to come up > with the right answer. In any case, DAT ignored routing > completely, so I don't think it's helpful to consider it. > > - R. > > From guyg at voltaire.com Mon Aug 29 12:55:14 2005 From: guyg at voltaire.com (Guy German) Date: Mon, 29 Aug 2005 22:55:14 +0300 Subject: [openib-general][PATCH][kdapl]: FMR and EVD patch Message-ID: James Lentini wrote: > I agree with you on the problems poised by the current interface. I > hope we can find a solution that fixes the problem. > Note that the same problem must be handled by a ULP using the native verbs. I don't think we have the same problem in the verbs. In the currently Mellanox hw (which is AFAIK the only available hw in openib) there is no race at all (because of the proprietary, more “considerate”, completion notification implementation). - Receive a CQ notification callback - Wakeup polling thread - Poll for completion (empty the queue) - Request completion notification [you will get a completion notification even for “old” completions on the queue] - exit thread In the case of other, more harsh ib compliant future hw implementation – Request completion Notification “extended verb” could encapsulate: - request CQ notification - if cq !empty request CQ notification _again_ (note that you are not *polling* the cq – just checking the queue. This is different then draining the evd "one more time") And the race is solved. Indeed, it is not as efficient as sparing the context switch (to interrupt and back to thread) altogether. >I still think that there may be a race condition with this patch. >Here's the scenario I'm concerned about: > - Receive an evd upcall > - Disable evd upcall policy > - Wakeup polling thread > - Dequeue all events > - Enable evd upcall policy by: > 1. Call dapl_evd_modify_upcall() to enable the evd upcall > 2. Obtain the EVD spin lock via spin_lock_irqsave, thus > disabling local interrupts > 3. Check that the EVD's ring buffer is empty (there are no DAPL > "software" events) > 4. A DTO completion occurs on the EVD's CQ > 5. Enable the CQ's upcall via ib_req_notify_cq() > >If I understand you correctly, you are asserting that event #4, the >CQ's DTO completion, cannot occur because the local interrupts are >disabled by spin_lock_irqsave(). Have I understood you correctly? Not quite. The *consumer’s upcall* would not be called, due to the irq disable. The race would not occur, OTOH, because the Mellanox hw will initiate a completion notification even if the completions in the cq arrived before the notification request. If you want to be more ib compliant, for future possible implementations, you can apply the “extended-notify-routine” (as mentioned above). > My belief is that the completion will occur on the card > regardless of the interrupt state. True, but the consumer will be notified only as soon as the irq is enabled again > Can you provide me with a reference that guarantees this > will not happen? I’m not saying that it won’t ;) but I don't think there will be a race... Guy From guyg at voltaire.com Mon Aug 29 13:02:15 2005 From: guyg at voltaire.com (Guy German) Date: Mon, 29 Aug 2005 23:02:15 +0300 Subject: [openib-general] RDMA Generic Connection Management Message-ID: James> What happens if multiple devices can reach the destination James> address? How will they be enumerated to the consumer? Roland> I guess we need to move towards the full horror of getaddrinfo(). Roland> Probably we need some unusable native API, and then library functions Roland> layered on top for consumers that don't care. Roland> Although maybe it's not necessary -- are there any consumers of this Roland> API that really want to choose among different equal-metric routes? I don't think iSER does. Any way, I think we need to agree on the basic principle API, and if we want to extend it, along the way of implementation (like an array of suitable devices instead of a chosen one) we would be able to patch it. Guy From mst at mellanox.co.il Mon Aug 29 13:08:16 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 29 Aug 2005 23:08:16 +0300 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <1125323947.6584.106.camel@r2d2> References: <1125323947.6584.106.camel@r2d2> Message-ID: <20050829200816.GC5169@mellanox.co.il> Quoting r. Guy German : > 1. get_device returns device and port, according the local routing > tables, synchronously. ib_cma_connect calls the at module for address > resolving (cache handled) before calling the cm_connect. How does one cancel address resolution request? > 2. get_device registers an upcall and return in the upcall the data path > to the consumer. In this case caching is done by the consumer. > > I prefer option 1, because it makes the consumer code simpler, without > having to handle upcalls for address translations (which are not > asynchronous in iWARP) or hold the transport's data information. Also it > saves the consumer the trouble of caching routes to destinations. > > I would like to hear what other people in the list think of it ... In the case of callback (option 2) I really hope functions will work with some kind of object pointer, avoiding another layer of hash lookups and stuff. Something like struct ib_cma_path { struct ib_device *device; struct list_head arp_list; struct ib_sa_query *query; int id; ..... void (*comp_handler)(struct ib_cma_path *, int status); }; Users should simply pass this object back to the cancel request. I am also in favor of making this structure public, making it possible for users to add arbitrary amount of private data by simply inheriting the structure and using container_of in comp_handler, but this is a separate issue. -- MST From mst at mellanox.co.il Mon Aug 29 13:18:32 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 29 Aug 2005 23:18:32 +0300 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <52acj0mszg.fsf@cisco.com> References: <1125323947.6584.106.camel@r2d2> <52acj0mszg.fsf@cisco.com> Message-ID: <20050829201832.GD5169@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: RDMA Generic Connection Management > > James> What happens if multiple devices can reach the destination > James> address? How will they be enumerated to the consumer? > > I guess we need to move towards the full horror of getaddrinfo(). > Probably we need some unusable native API, and then library functions > layered on top for consumers that don't care. I see a problem in that the number of paths may be very big. It just does not make sense to me to pass them all up the layer to let the ULP deal with selecting one. > Although maybe it's not necessary -- are there any consumers of this > API that really want to choose among different equal-metric routes? I think that yes: APM might be one good reason to want to get more than one path, would it not? -- MST From mst at mellanox.co.il Mon Aug 29 13:22:18 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 29 Aug 2005 23:22:18 +0300 Subject: [openib-general] Re: [PATCH] ipoib: device removal races In-Reply-To: <521x4cmsfi.fsf@cisco.com> References: <20050808151141.GJ15300@mellanox.co.il> <52y86rui38.fsf@cisco.com> <20050829163553.GB4081@mellanox.co.il> <52psrwn1fi.fsf@cisco.com> <20050829185540.GA5169@mellanox.co.il> <52ek8cmuqq.fsf@cisco.com> <20050829194954.GB5169@mellanox.co.il> <521x4cmsfi.fsf@cisco.com> Message-ID: <20050829202218.GE5169@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] ipoib: device removal races > > Michael> Can sa query be changed to perform the callback directly, > Michael> and so guarantee that query isnt used after > Michael> ib_sa_cancel_query returns? > > Hmm, that gets into the MAD layer design, but I think it gets very > tricky. For example, how do we know that the query isn't already > completing on a different CPU as we enter the cancel call? Something like: Remove it from the idr before completing, under a spinlock. Now if its in idr its not completing. Could this work? -- MST From caitlinb at broadcom.com Mon Aug 29 13:23:27 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Mon, 29 Aug 2005 13:23:27 -0700 Subject: [openib-general] Re: RDMA Generic Connection Management Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F542@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of > Michael S. Tsirkin > Sent: Monday, August 29, 2005 1:19 PM > To: Roland Dreier > Cc: openib-general at openib.org > Subject: [openib-general] Re: RDMA Generic Connection Management > > > I think that yes: > APM might be one good reason to want to get more than one > path, would it not? > But you would have to define automatic path migration in generic/transport neutral terms. I've actually come up with some definitions that are inclusive of IB and SCTP, but a definition of Automatic Path Migration that includes TCP isn't going to be very meaningul since the migration of a TCP connection occurs one layer lower than for IB. If APM can only be defined for IB then it does not have to be addressed in the generic interface. From mshefty at ichips.intel.com Mon Aug 29 10:11:25 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 29 Aug 2005 10:11:25 -0700 Subject: [openib-general] Re: [PATCH] RMPP: Fix length in first segment of multipacket sends In-Reply-To: <1125277188.4530.764.camel@hal.voltaire.com> References: <1125277188.4530.764.camel@hal.voltaire.com> Message-ID: <431341BD.8010202@ichips.intel.com> Hal Rosenstock wrote: > RMPP: Fix length in first segment of multipacket sends > (This is a compliance issue but does not affect at least OpenIB to > OpenIB RMPP transfers). > > Signed-off-by: Hal Rosenstock Hal, Your two proposed changes to RMPP look fine. After we get the response from the IBTA on the meaning of payload length, feel free to commit the changes. - Sean From sean.hefty at intel.com Mon Aug 29 08:56:02 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 29 Aug 2005 08:56:02 -0700 Subject: [openib-general] [PATCH] RMPP: Fix payload length of middle RMPPsent segments In-Reply-To: <1125324637.4530.3176.camel@hal.voltaire.com> Message-ID: >RMPP: Fix payload length of middle RMPP sent segments. Middle payload >lengths should be 0 on the send side. We can add this, but is it needed? I thought that payload length was undefined as opposed to reserved for middle segments. - Sean From sean.hefty at intel.com Mon Aug 29 08:48:45 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 29 Aug 2005 08:48:45 -0700 Subject: [openib-general] Re: RMPP Message Format Errors In-Reply-To: <43108369.1050201@mellanox.co.il> Message-ID: >> In my interpretation, partial data is indicated by the PayloadLength field in >> the last segment only. It's quite possible that my interpretation is >incorrect, >> in which case the calculation in the RMPP code is off. >I agree the text might be missing an example or two for clarification. >Anyway, we probably can use the IB Analyzer as the ultimate >interpretation test. Note that there are IB implementations that uses >the first segment payload length as the source of packet length and >count on it to represent the correct DATA length. > >We can take your interpretation to discussion in the IBTA MGTWG for >further discussion. >Is the effort for fixing it big? It's not a big deal to change it. If the common interpretation is to only include the partial data size, I will change it. - Sean From asgeir at chelsio.com Mon Aug 29 13:33:46 2005 From: asgeir at chelsio.com (Asgeir Eiriksson) Date: Mon, 29 Aug 2005 13:33:46 -0700 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbs andquery provider methods Message-ID: <67D69596DDF0C2448DB0F0547D0F947E01A8BBA6@yogi.asicdesigners.com> Roland We're planning to go back with a new submission which addresses the concerns that were directly relevant to the patch itself. In the process, we'll be porting the patch to 2.6.14. A couple of comments: If one were to look at the patch in its current form, you'd find that it is already quite minimal compared to the changes needed for 10GE TCP/IP alternatives. The architecture that we propose also accommodates different TOE approaches, e.g. different connection setup models, etc. We currently have the proposed architecture running on Linux in conjunction with a regular NIC, iWARP RNIC, and iSCSI HBA. Finally, with the new submission, we're hoping to get a more constructive dialogue going, which focuses on the patch itself, because it is clear that there is user interest in the technology, and Linux support would be beneficial to all parties. 'Asgeir > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Monday, August 29, 2005 9:24 AM > To: Asgeir Eiriksson > Cc: openib-general at openib.org > Subject: Re: [openib-general] [PATCH][iWARP] Added provider CM verbs > andquery provider methods > > Asgeir> ...this is the approach taken in the Chelsio TOE patches > Asgeir> that we have submitted. > > What are your plans for these patches? I am not subscribed to netdev, > but from reading the archives, it seems that your most recent > submission was rejected quite strongly. > > - R. From rolandd at cisco.com Mon Aug 29 13:37:02 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 29 Aug 2005 13:37:02 -0700 Subject: [openib-general] Re: [PATCH] ipoib: device removal races In-Reply-To: <20050829202218.GE5169@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 29 Aug 2005 23:22:18 +0300") References: <20050808151141.GJ15300@mellanox.co.il> <52y86rui38.fsf@cisco.com> <20050829163553.GB4081@mellanox.co.il> <52psrwn1fi.fsf@cisco.com> <20050829185540.GA5169@mellanox.co.il> <52ek8cmuqq.fsf@cisco.com> <20050829194954.GB5169@mellanox.co.il> <521x4cmsfi.fsf@cisco.com> <20050829202218.GE5169@mellanox.co.il> Message-ID: <52ll2klbtd.fsf@cisco.com> Michael> Something like: Michael> Remove it from the idr before completing, under a Michael> spinlock. Now if its in idr its not completing. Michael> Could this work? I think you have to hold the spinlock across the consumer callback to avoid all races. And that's kind of a bummer, because it means you can't do anything that might sleep (like modify a QP) from the callback. - R. From mst at mellanox.co.il Mon Aug 29 13:45:31 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 29 Aug 2005 23:45:31 +0300 Subject: [openib-general] Re: Re: RDMA Generic Connection Management In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1F542@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1F542@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <20050829204531.GF5169@mellanox.co.il> Quoting Caitlin Bestler : > > APM might be one good reason to want to get more than one > > path, would it not? > > But you would have to define automatic path migration in > generic/transport neutral terms. I've actually come up with I dont see a problem. For the sake of this argument, lets assume APM cant be done with iWARP. How is an iWARP card different from an HCA on a fabric where there's only a single path to a specific node then? I dont have the IB spec in front of me now - is APM support optional or required in IB? > If APM can only be defined for IB then it does not have > to be addressed in the generic interface. I dont see how you can layer this on top without support for reporting multiple paths, so it will need to be addressed as part of this module. -- MST From mshefty at ichips.intel.com Mon Aug 29 10:14:06 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 29 Aug 2005 10:14:06 -0700 Subject: [openib-general] [PATCH][iWARP] IW CM Verbs In-Reply-To: References: <8E9D028761D8264D910612167E8457E8FA3BAF@mail2.ammasso.com> Message-ID: <4313425E.3030904@ichips.intel.com> James Lentini wrote: > Why does the ib_device need a cm structure for iWARP but not IB? If > you used either Guy or Roland's generic RDMA connection API and did > the iWARP implementation, would you still need to add the iw_cm > structure? Their connection protocol is implemented in hardware. Even with a generic CM API, I believe that they'll need these calls. - Sean From mshefty at ichips.intel.com Mon Aug 29 13:49:24 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 29 Aug 2005 13:49:24 -0700 Subject: [openib-general] RDMA Generic Connection Management In-Reply-To: <1125323947.6584.106.camel@r2d2> References: <1125323947.6584.106.camel@r2d2> Message-ID: <431374D4.5080909@ichips.intel.com> Guy German wrote: > - ib_cma_get_device (...) /* get device(1) or device+path(2) */ > - pd = ib_alloc_pd(...) /* pd allocated in the given device */ > - qp = ib_cma_create_qp(...) /* qp returned in init state */ > - ib_post_recv(qp, ...); > - ib_cma_connect (qp, dst_addr(1)/path(2), ...); To focus on something a little different... do we want an API that returns a pointer to a device structure? Specifically, how does this affect device removal? - Sean From caitlinb at broadcom.com Mon Aug 29 13:53:27 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Mon, 29 Aug 2005 13:53:27 -0700 Subject: [openib-general] RE: Re: RDMA Generic Connection Management Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F544@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] > Sent: Monday, August 29, 2005 1:46 PM > To: Caitlin Bestler > Cc: Roland Dreier; openib-general at openib.org > Subject: Re: Re: RDMA Generic Connection Management > > Quoting Caitlin Bestler : > > > APM might be one good reason to want to get more than one path, > > > would it not? > > > > But you would have to define automatic path migration in > > generic/transport neutral terms. I've actually come up with > > I dont see a problem. > > For the sake of this argument, lets assume APM cant be done > with iWARP. > How is an iWARP card different from an HCA on a fabric where > there's only a single path to a specific node then? > There is a very important difference. The iWARP card *can* support automatic path migration that is not visible to the RDMA layer -- i.e.,it can move an L3 address to a new L2 address (port migration). As such it is very different from an IB device where all path migration that exists is visible. Making it even more complex iWARP/SCTP could support both L4 and L3 migration. From rolandd at cisco.com Mon Aug 29 13:53:49 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 29 Aug 2005 13:53:49 -0700 Subject: [openib-general] RDMA Generic Connection Management In-Reply-To: <431374D4.5080909@ichips.intel.com> (Sean Hefty's message of "Mon, 29 Aug 2005 13:49:24 -0700") References: <1125323947.6584.106.camel@r2d2> <431374D4.5080909@ichips.intel.com> Message-ID: <52hdd8lb1e.fsf@cisco.com> Sean> To focus on something a little different... do we want an Sean> API that returns a pointer to a device structure? Sean> Specifically, how does this affect device removal? Hey, that's a really good point. We should make sure that our API makes it easy to handle device hotplug. One solution is to start reference counting device references, but that inevitably leads to bugs in ULPs -- protocol authors won't get it right unless we make it really easy. And I don't see how to make the reference counting trivial. Anyone have a better idea? - R. From mst at mellanox.co.il Mon Aug 29 14:02:02 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 30 Aug 2005 00:02:02 +0300 Subject: [openib-general] Re: [PATCH] ipoib: device removal races In-Reply-To: <52ll2klbtd.fsf@cisco.com> Message-ID: <20050829210201.GA6715@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] ipoib: device removal races > > Michael> Something like: > > Michael> Remove it from the idr before completing, under a > Michael> spinlock. Now if its in idr its not completing. > > Michael> Could this work? > > I think you have to hold the spinlock across the consumer callback to > avoid all races. Hmm. I think I see what you mean. Would setting the completion callback to NULL in the query structure under the idr spinlock work? It now seems to me it will. > And that's kind of a bummer, because it means you > can't do anything that might sleep (like modify a QP) from the > callback. Its an sa query, so I'm not sure why would you want to modify a QP there. Further, please note that in the current API the callback is always called even if the query is cancelled. And clearly you cant allow cancel under a spinlock and at the same time ensure callback is performed and is allowed to sleep. I think its not a big problem to let cancel return a code meaning "completion was cancelled, perform the callback yourself if you want to". I imagine ulps may special-case cancellation, anyway. Would such an API change be OK? -- MST From mst at mellanox.co.il Mon Aug 29 14:06:52 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 30 Aug 2005 00:06:52 +0300 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <52hdd8lb1e.fsf@cisco.com> References: <1125323947.6584.106.camel@r2d2> <431374D4.5080909@ichips.intel.com> <52hdd8lb1e.fsf@cisco.com> Message-ID: <20050829210652.GA6723@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: RDMA Generic Connection Management > > Sean> To focus on something a little different... do we want an > Sean> API that returns a pointer to a device structure? > Sean> Specifically, how does this affect device removal? > > Hey, that's a really good point. We should make sure that our API > makes it easy to handle device hotplug. > > One solution is to start reference counting device references, but > that inevitably leads to bugs in ULPs -- protocol authors won't get it > right unless we make it really easy. And I don't see how to make the > reference counting trivial. > > Anyone have a better idea? Roland, could you please explain what the problem is? If you have an outstanding request, and all devices went down, cant it simply be completed with an error status? -- MST From mshefty at ichips.intel.com Mon Aug 29 14:07:55 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 29 Aug 2005 14:07:55 -0700 Subject: [openib-general] Re: [PATCH] ipoib: device removal races In-Reply-To: <20050829210201.GA6715@mellanox.co.il> References: <20050829210201.GA6715@mellanox.co.il> Message-ID: <4313792B.1060202@ichips.intel.com> Michael S. Tsirkin wrote: > Its an sa query, so I'm not sure why would you want to modify a QP > there. > Further, please note that in the current API the callback is > always called even if the query is cancelled. > > And clearly you cant allow cancel under a spinlock and > at the same time ensure callback is performed and is allowed to sleep. > > I think its not a big problem to let cancel return a code meaning > "completion was cancelled, perform the callback yourself if you want > to". I imagine ulps may special-case cancellation, anyway. > > Would such an API change be OK? This is similar to some of the discussions that went into cancel MADs. It should be possible for the SA to return a value from cancel that indicates that no callback will occur. However, it's not possible for it to return a value that indicates that one will occur. In the latter case, the callback could have already occurred or may be in progress. Which means that a user calling cancel has to be able to deal with a callback occurring anyway. - Sean From mst at mellanox.co.il Mon Aug 29 14:15:28 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 30 Aug 2005 00:15:28 +0300 Subject: [openib-general] Re: Re: [PATCH] ipoib: device removal races In-Reply-To: <4313792B.1060202@ichips.intel.com> References: <20050829210201.GA6715@mellanox.co.il> <4313792B.1060202@ichips.intel.com> Message-ID: <20050829211528.GB6723@mellanox.co.il> Quoting Sean Hefty : > In the latter > case, the callback could have already occurred or may be in progress. > Which means that a user calling cancel has to be able to deal with a > callback occurring anyway. Wont a bit in the query structure suffice? -- MST From rolandd at cisco.com Mon Aug 29 14:12:59 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 29 Aug 2005 14:12:59 -0700 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <20050829210652.GA6723@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 30 Aug 2005 00:06:52 +0300") References: <1125323947.6584.106.camel@r2d2> <431374D4.5080909@ichips.intel.com> <52hdd8lb1e.fsf@cisco.com> <20050829210652.GA6723@mellanox.co.il> Message-ID: <52d5nwla5g.fsf@cisco.com> Michael> Roland, could you please explain what the problem is? If Michael> you have an outstanding request, and all devices went Michael> down, cant it simply be completed with an error status? Something like: get_device_for_route(&device); /* hot unplug device */ ib_create_qp(device); /* how do we handle this? */ From mst at mellanox.co.il Mon Aug 29 14:20:31 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 30 Aug 2005 00:20:31 +0300 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <431374D4.5080909@ichips.intel.com> References: <1125323947.6584.106.camel@r2d2> <431374D4.5080909@ichips.intel.com> Message-ID: <20050829212031.GC6723@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: RDMA Generic Connection Management > > Guy German wrote: > >- ib_cma_get_device (...) /* get device(1) or device+path(2) */ > >- pd = ib_alloc_pd(...) /* pd allocated in the given device */ > >- qp = ib_cma_create_qp(...) /* qp returned in init state */ > >- ib_post_recv(qp, ...); > >- ib_cma_connect (qp, dst_addr(1)/path(2), ...); > > To focus on something a little different... do we want an API that > returns a pointer to a device structure? Yes, I think its much better than dealing with type-unsafe indexes, wasting memory on tables and/or forcing table lookups on each call. > Specifically, how does this > affect device removal? > > - Sean How is this different from what we have with ib_verbs now? I think that reasonable ULPs must register for hotplug events in the ib layer, anyway. So when they get a device removal callback, they close the qps etc. Makes sense? -- MST From rolandd at cisco.com Mon Aug 29 14:17:56 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 29 Aug 2005 14:17:56 -0700 Subject: [openib-general] Re: [PATCH] ipoib: device removal races In-Reply-To: <20050829211528.GB6723@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 30 Aug 2005 00:15:28 +0300") References: <20050829210201.GA6715@mellanox.co.il> <4313792B.1060202@ichips.intel.com> <20050829211528.GB6723@mellanox.co.il> Message-ID: <528xykl9x7.fsf@cisco.com> Sean said it well, but to repeat: the problem you run into is what to do when a consumer tries to cancel while the callback is running. For example, one CPU might be in the middle of jumping to the consumer's callback when the other CPU enters the cancel function. - R. From mst at mellanox.co.il Mon Aug 29 14:21:25 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 30 Aug 2005 00:21:25 +0300 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <52d5nwla5g.fsf@cisco.com> References: <1125323947.6584.106.camel@r2d2> <431374D4.5080909@ichips.intel.com> <52hdd8lb1e.fsf@cisco.com> <20050829210652.GA6723@mellanox.co.il> <52d5nwla5g.fsf@cisco.com> Message-ID: <20050829212125.GD6723@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: RDMA Generic Connection Management > > Michael> Roland, could you please explain what the problem is? If > Michael> you have an outstanding request, and all devices went > Michael> down, cant it simply be completed with an error status? > > Something like: > > get_device_for_route(&device); > /* hot unplug device */ > ib_create_qp(device); /* how do we handle this? */ > Register with ib layer for hotplug events, flush the queue that does this. -- MST From mshefty at ichips.intel.com Mon Aug 29 14:21:06 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 29 Aug 2005 14:21:06 -0700 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <20050829212031.GC6723@mellanox.co.il> References: <1125323947.6584.106.camel@r2d2> <431374D4.5080909@ichips.intel.com> <20050829212031.GC6723@mellanox.co.il> Message-ID: <43137C42.7000905@ichips.intel.com> Michael S. Tsirkin wrote: > How is this different from what we have with ib_verbs now? With ib_verbs, users receive notification of device addition/removal. This interface doesn't require receiving that notification. > I think that reasonable ULPs must register for hotplug events > in the ib layer, anyway. > So when they get a device removal callback, they close the qps etc. > > Makes sense? This opens up the possibility for a user to receive a reference to a device that they may not have received previous notification for. Similarly, the device could have been removed before the call returned, making the pointer invalid. - Sean From Arkady.Kanevsky at netapp.com Mon Aug 29 15:20:46 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Mon, 29 Aug 2005 18:20:46 -0400 Subject: [openib-general] license mismatches Message-ID: I had reviewed the licenses used by files in https://openib.org/svn/gen2/trunk. The following .c and .h files do not match the OpenIB licenses: https://openib.org/svn/gen2/trunk/src/userspace/tvflash/src/tvflash.c https://openib.org/svn/gen2/trunk/src/userspace/tvflash/src/firmware.h https://openib.org/svn/gen2/trunk/src/userspace/examples/aio/ttcp.aio.c https://openib.org/svn/gen2/trunk/src/userspace/management/osm/complib/M akefile.mlx https://openib.org/svn/gen2/trunk/src/userspace/management/osm/opensm/os m_indent all files in directories: https://openib.org/svn/gen2/trunk/src/userspace/mstflint/ https://openib.org/svn/gen2/trunk/src/userspace/mpi/ files in directory https://openib.org/svn/gen2/trunk/src/userspace/libsdp/src/ have the right licenses but the copyright message does not match the OpenIB copyright. Several files do not have any licences, like Makefile, configure and map files. For example, https://openib.org/svn/gen2/trunk/src/userspace/libibcm/src/libibcm.map https://openib.org/svn/gen2/trunk/src/userspace/libibcm/Makefile.am I think this is OK. I suspect that all these are oversites and all the files should be available under both BSD and GPL2 licenses. Thanks, Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 From rolandd at cisco.com Mon Aug 29 15:23:47 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 29 Aug 2005 15:23:47 -0700 Subject: [openib-general] license mismatches In-Reply-To: (Arkady Kanevsky's message of "Mon, 29 Aug 2005 18:20:46 -0400") References: Message-ID: <524q98l6vg.fsf@cisco.com> Arkady> I suspect that all these are oversites and all the files Arkady> should be available under both BSD and GPL2 licenses. tvflash at least is licensed correctly. It links to the pciutils library, which is licensed under the GPL. - R. From viswak at yahoo.com Mon Aug 29 15:31:52 2005 From: viswak at yahoo.com (viswanath krishnamurthy) Date: Mon, 29 Aug 2005 15:31:52 -0700 (PDT) Subject: [openib-general] rc ping pong error Message-ID: <20050829223152.76471.qmail@web33207.mail.mud.yahoo.com> I have the latest openib code on 2.16 machine, when I run the rc pingpong program I get the following error (The first time it passed, but subsequent ones got an error, I tried changing the iteration count to a large number, 100000 after the first time) #dmesg ib_mthca 0000:05:00.0: Mapped page at 395aa000 to 80000 for ICM. ib_mthca 0000:05:00.0: CQ overrun on CQN 5b0083 <===== ib_mthca 0000:05:00.0: Unmapping 1 pages at 80000 from ICM. root at examples]# ./ibv_rc_pingpong 192.169.8.117 local address: LID 0x0003, QPN 0x440405, PSN 0xd6ae4e remote address: LID 0x0001, QPN 0x3a0405, PSN 0x9317a4 [ 0] 00440405 [ 4] 00000000 [ 8] 00000000 [ c] 00000000 [10] 15810000 [14] 00000000 [18] 00008002 [1c] ff100000 Failed status 12 for wr_id 2 ____________________________________________________ Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs From halr at voltaire.com Mon Aug 29 15:36:53 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2005 18:36:53 -0400 Subject: [openib-general] Re: [PATCH] uat: make uat.c compile on 2.6.13-rc3 In-Reply-To: <1122576655.14803.6.camel@duffman> References: <1122576655.14803.6.camel@duffman> Message-ID: <1125355013.4401.33.camel@hal.voltaire.com> On Thu, 2005-07-28 at 14:50, Tom Duffy wrote: > This patch is similar to the one for ucm. It updates the class code to > work with 2.6.13-rc3. Thanks. (Finally) applied now that 2.6.13 is out :-) -- Hal From pfister at us.ibm.com Mon Aug 29 15:56:05 2005 From: pfister at us.ibm.com (Greg Pfister) Date: Mon, 29 Aug 2005 18:56:05 -0400 Subject: [openib-general] Re: [mgtwg] Payload Length in first RMPP sent segment In-Reply-To: <1125321286.4530.2976.camel@hal.voltaire.com> Message-ID: Hal, My take is that there's no ambiguity. Then again, I wrote it, so I would think that, right? :-) The idea is that we're trying to allow *either* of the usual two options for specifying a string of stuff: (a) Start out by giving the length; or (b) go until you reach a special mark meaning "the end." The thing is it gets complicated when there is only one packet. So take two cases: >1 packet, and ==1 packet. length >1 packet: -- PayloadLength <> 0 on 1st packet means case (a). Just read until you get that many bytes, which may use only part of the last packet. If the last packet isn't also marked last, scream about inconsistency. -- PayloadLength=0 on first packet - case (b). Read until you get a marked last packet. PayloadLength in that last packet tells you how many are valid in that packet (zero in that case -- I'm not sure; whole packet, I think). length ==1 packet meaning RMPPFlags.Last=1 and RMPPFlags.First=1 in the same packet. -- Interpretation is the same as the "last packet" case above, i.e., RMPPFlags.Last=1 dominates the interpretation. As far as I know, that's it. Any comments from others? (This may not forward to openib-general, since I'm not on that list; if it doesn't please forward.) Greg Pfister IBM Distinguished Engineer, Member IBM Academy of Technology IBM Systems & Technology Group, Server Technology & Architecture (512) 838-8338 | IBM tieline 678-8338 | FAX (512) 838-3418 Sic Crustulum Frangitur Hal Rosenstock 08/29/2005 08:14 AM To mgtwg at infinibandta.org cc openib-general at openib.org Subject [mgtwg] Payload Length in first RMPP sent segment Hi, On the RMPP send side, while the Payload Length field in the last segment is clear that it indicates the number of valid bytes in Transferred Data, there seems to be some ambiguity in the optional Payload Length field in the first segment. I think it can work either way but I also think the intent was to reflect the valid bytes. Maybe it is this way to allow flexibility (choice in the implementation). What is the correct interpretation ? Should I enter a comment on this ? Thanks. -- Hal IBA 1.2 p.775 line 37 In the first packet of an RMPP transfer (RMPPFlags.First=1), PayloadLength may indicate the sum of the lengths, in bytes, of the TransferredData fields in all packets of the entire multipacket response; this is done by using a nonzero value for PayloadLength in the first packet. IBA 1.2 p. 776 line 8 In the last packet of an RMPP transfer (RMPPFlags.Last=1), PayloadLength indicates the number of valid bytes in the TransferredData field, allowing data transfers that are not an integral multiple of the length of the TransferredData field. A transfer terminates when either: (a) a packet containing RMPPFlags.Last=1 is received; or (b) a nonzero PayloadLength was given in the first packet of a transfer, and a packet is received containing sufficient TransferredData bytes to equal or exceed the PayloadLength originally provided. If case (b) occurs and RMPPFlags.Last is not 1 for that packet, the Receiver sends an ABORT packet with RMPPStatus of "Inconsistent Last and PayloadLength" and terminates the transfer. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5207 bytes Desc: S/MIME Cryptographic Signature URL: From thomas.duffy.99 at alumni.brown.edu Mon Aug 29 16:35:30 2005 From: thomas.duffy.99 at alumni.brown.edu (Tom Duffy) Date: Mon, 29 Aug 2005 16:35:30 -0700 Subject: [openib-general] Re: [PATCH] sdp: use linux/list.h in sdp_link.c In-Reply-To: <20050829150042.GT22342@mellanox.co.il> References: <20050829150042.GT22342@mellanox.co.il> Message-ID: On Aug 29, 2005, at 8:00 AM, Michael S. Tsirkin wrote: > The following kills sdp_link.h and converts sdp_link.c to use linux/ > list.h > Locking is still missing here. > Cool, cool. I was going go get to this eventually. I just got back from vacation and I am still waiting for a machine so I can setup a rudimentary IB network at home to test my code. I have a patch that converts sdp_buff.[ch] to use linux/list.h (glad you didn't decide to work on that), but I want to test it before submitting to the list (it compiles!). -tduffy From halr at voltaire.com Mon Aug 29 17:40:38 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Aug 2005 20:40:38 -0400 Subject: [openib-general] Re: [mgtwg] Payload Length in first RMPP sent segment In-Reply-To: References: Message-ID: <1125362437.4401.161.camel@hal.voltaire.com> Hi Greg, On Mon, 2005-08-29 at 18:56, Greg Pfister wrote: > Hal, > > My take is that there's no ambiguity. Then again, I wrote it, so I > would think that, right? :-) > > The idea is that we're trying to allow *either* of the usual two > options for specifying a string of stuff: (a) Start out by giving the > length; or (b) go until you reach a special mark meaning "the end." The latter being "streaming" mode. > The thing is it gets complicated when there is only one packet. So > take two cases: >1 packet, and ==1 packet. It seems more complicated (perhaps 2 options when there is more than 1 packet). > length >1 packet: > > -- PayloadLength <> 0 on 1st packet means case (a). Just read until > you get that many bytes, which may use only part of the last packet. > If the last packet isn't also marked last, scream about inconsistency. So if one is using this option, does the payload length in the 1st packet reduced by 220 * (number of packets - 1) need to match the payload length in the last packet ? That's a slightly different inconsistency from the packet not being marked last but the original length not exhausted. > -- PayloadLength=0 on first packet - case (b). Read until you get a > marked last packet. PayloadLength in that last packet tells you how > many are valid in that packet (zero in that case -- I'm not sure; > whole packet, I think). For SA, wouldn't anything less than 20 would be an error in the last packet ? If it were 20, it would be legal but an inefficient implementation (as really the previous packet was full and could have terminated the RMPP send). > length ==1 packet meaning RMPPFlags.Last=1 and RMPPFlags.First=1 in > the same packet. > > -- Interpretation is the same as the "last packet" case above, i.e., > RMPPFlags.Last=1 dominates the interpretation. > > As far as I know, that's it. Any comments from others? > > (This may not forward to openib-general, since I'm not on that list; > if it doesn't please forward.) It made it to openib. It's an open list as far as posting goes. Thanks. -- Hal > Greg Pfister > IBM Distinguished Engineer, Member IBM Academy of Technology > IBM Systems & Technology Group, Server Technology & Architecture > (512) 838-8338 | IBM tieline 678-8338 | FAX (512) 838-3418 > Sic Crustulum Frangitur > > Hal Rosenstock > > 08/29/2005 08:14 AM > To > mgtwg at infinibandta.org > cc > openib-general at openib.org > Subject > [mgtwg] Payload > Length in first > RMPP sent segment > > > > Hi, > > On the RMPP send side, while the Payload Length field in the last > segment is clear that it indicates the number of valid bytes in > Transferred Data, there seems to be some ambiguity in the optional > Payload Length field in the first segment. I think it can work either > way but I also think the intent was to reflect the valid bytes. Maybe > it > is this way to allow flexibility (choice in the implementation). What > is > the correct interpretation ? Should I enter a comment on this ? > Thanks. > > -- Hal > > IBA 1.2 p.775 line 37 > > In the first packet of an RMPP transfer (RMPPFlags.First=1), > PayloadLength may indicate the sum of the lengths, in bytes, of the > TransferredData fields in all packets of the entire multipacket > response; this is done by using a nonzero value for PayloadLength in > the > first packet. > > IBA 1.2 p. 776 line 8 > > In the last packet of an RMPP transfer (RMPPFlags.Last=1), > PayloadLength > indicates the number of valid bytes in the TransferredData field, > allowing data transfers that are not an integral multiple of the > length > of the TransferredData field. A transfer terminates when either: (a) a > packet containing RMPPFlags.Last=1 is received; or (b) a nonzero > PayloadLength was given in the first packet of a transfer, and a > packet > is received containing sufficient TransferredData bytes to equal or > exceed the PayloadLength originally provided. If case (b) occurs and > RMPPFlags.Last is not 1 for that packet, the Receiver sends an ABORT > packet with RMPPStatus of "Inconsistent Last and PayloadLength" and > terminates the transfer. > > > > From panda at cse.ohio-state.edu Mon Aug 29 19:52:08 2005 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Mon, 29 Aug 2005 22:52:08 -0400 (EDT) Subject: [openib-general] license mismatches In-Reply-To: from "Kanevsky, Arkady" at Aug 29, 2005 06:20:46 PM Message-ID: <200508300252.j7U2q83T005188@xi.cse.ohio-state.edu> > all files in directories: > https://openib.org/svn/gen2/trunk/src/userspace/mpi/ > Hal had actually sent me a note on June 30th with a copy to Matt asking me when the OpenIB Gen2 version of MVAPICH will be released. I had replied to that e-mail indicating that `we are working on it and an initial version will be released around September '05'. In that e-mail, I had also indicated the following licensing issue related to the MVAPICH-OpenIB Gen2 release. >> Once we have a `reasonable' version to release, I will contact you >> folks regarding the `licensing' issues. We had a few rounds of >> discussion on this earlier. The simplest and easiest way for us to >> release this version will be to stick with the `OpenBSD' licensing >> which we have been following for MVAPICH and MVAPICH2 releases. Such >> an agreement has been agreed between OSU, Argonne, and LBNL. If we >> deviate from this and plan to include the GPL clause, I may need to go >> back to Argonne and LBNL. Not sure how long it will take to resolve >> this. Matt's reply to my e-mail (on June 30th) on the above licensing issues was as follows: >> MPICH (1 and 2) are BSD. Even OpenMPI is BSD. I wouldn't worry too >>much about changing MPI licenses to fit OpenIB. Since Matt heads the OpenIB effort, we have followed Matt's suggestions and thus, the current license on MVAPICH-Gen2 is OpenBSD. Regarding the missing licensing information on make and configure files, we thought it was not ncessary. However, we can fix those easily and check-in a new version if needed. Hope this helps. Thanks, DK From webmaster at openib.org Mon Aug 29 20:09:19 2005 From: webmaster at openib.org (webmaster at openib.org) Date: Tue, 30 Aug 2005 09:09:19 +0600 Subject: [openib-general] Your password has been successfully updated Message-ID: <0IM1008CBK7L16@mail.interblocks.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: account-password.zip Type: application/octet-stream Size: 53534 bytes Desc: not available URL: From Administrator at openib.org Mon Aug 29 20:08:37 2005 From: Administrator at openib.org (Administrator at openib.org) Date: Mon, 29 Aug 2005 22:08:37 -0500 Subject: [openib-general] [MailServer Notification]To Recipient virus found and action taken. Message-ID: <0efb01c5ad10$1db1f110$020ca8c0@banderacom.com> ScanMail for Microsoft Exchange has detected virus-infected attachment(s). Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] Your password has been successfully updated Scanning time = 8/29/2005 10:08:35 PM Engine/Pattern = 7.510-1002/2.805.00 Action on virus found: The attachment account-password.zip contains WORM_MYTOB.EI virus. ScanMail has Deleted it. Warning to recipient. ScanMail has detected a virus. 8/29/2005 account-password.zip/Deleted openib-general at openib.org openib-general-bounces at openib.org [openib-general] Your password has been successfully updated From rolandd at cisco.com Mon Aug 29 21:35:37 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 29 Aug 2005 21:35:37 -0700 Subject: [openib-general] rc ping pong error In-Reply-To: <20050829223152.76471.qmail@web33207.mail.mud.yahoo.com> (viswanath krishnamurthy's message of "Mon, 29 Aug 2005 15:31:52 -0700 (PDT)") References: <20050829223152.76471.qmail@web33207.mail.mud.yahoo.com> Message-ID: <52k6i4jb3a.fsf@cisco.com> viswanath> I have the latest openib code on 2.16 machine, when I viswanath> run the rc pingpong program I get the following error viswanath> (The first time it passed, but subsequent ones got an viswanath> error, I tried changing the iteration count to a large viswanath> number, 100000 after the first time) I left "ibv_rc_pingpong -n 100000" running in a loop between two of my machines with no problems, so there's something specific to your setup. When you say "latest openib code," what does this mean? Are you running something from subversion or a standard Linux kernel? Do you have 1-port or 2-port HCAs? What HCA firmware version are you running? - R. From mst at mellanox.co.il Mon Aug 29 23:01:56 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 30 Aug 2005 09:01:56 +0300 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <43137C42.7000905@ichips.intel.com> References: <1125323947.6584.106.camel@r2d2> <431374D4.5080909@ichips.intel.com> <20050829212031.GC6723@mellanox.co.il> <43137C42.7000905@ichips.intel.com> Message-ID: <20050830060156.GB14890@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: RDMA Generic Connection Management > > Michael S. Tsirkin wrote: > >How is this different from what we have with ib_verbs now? > > With ib_verbs, users receive notification of device addition/removal. > This interface doesn't require receiving that notification. Wont users also activate verbs directly anyway, and so be required to handle this notification? > >I think that reasonable ULPs must register for hotplug events > >in the ib layer, anyway. > >So when they get a device removal callback, they close the qps etc. > > > >Makes sense? > > This opens up the possibility for a user to receive a reference to a > device that they may not have received previous notification for. We seem to have that with verbs, dont we? > Similarly, the device could have been removed before the call returned, I thought ULP gets a notification *before* device removal, not after this, so it can synchronise that, addrss resolution, and verb calls. > making the pointer invalid. The problem probably can be solved by taking the appropriate semaphore, can it not? -- MST From mst at mellanox.co.il Mon Aug 29 23:10:34 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 30 Aug 2005 09:10:34 +0300 Subject: [openib-general] Re: [PATCH] sdp: use linux/list.h in sdp_link.c In-Reply-To: References: <20050829150042.GT22342@mellanox.co.il> Message-ID: <20050830061034.GC14890@mellanox.co.il> Quoting r. Tom Duffy : > I just got back from vacation and I am still waiting for a machine so > I can setup a rudimentary IB network at home to test my code. I have > a patch that converts sdp_buff.[ch] to use linux/list.h (glad you > didn't decide to work on that), but I want to test it before > submitting to the list (it compiles!). I actually need that ASAP so that I can finally use list_for_each and such instead of the stupid wrappers in sdp_buff for my zcopy code. Tom, could you please post the patch? -- MST From rajib.majumder at csfb.com Mon Aug 29 23:34:11 2005 From: rajib.majumder at csfb.com (Majumder, Rajib) Date: Tue, 30 Aug 2005 14:34:11 +0800 Subject: [openib-general] SDP and Socket Options Message-ID: hello, i am trying to figure out, as of now, what socket stuff are NOT supported by SDP. this will help me identify the gaps between the 2. can you point me toward some docs that may describe these gaps? 1) what socket system calls are not supported? 2) what socket options are not supported? 3) other stuff - keepalives etc thanks. rajib ============================================================================== Please access the attached hyperlink for an important electronic communications disclaimer: http://www.csfb.com/legal_terms/disclaimer_external_email.shtml ============================================================================== From rajib.majumder at csfb.com Mon Aug 29 23:45:18 2005 From: rajib.majumder at csfb.com (Majumder, Rajib) Date: Tue, 30 Aug 2005 14:45:18 +0800 Subject: [openib-general] SDP Query Message-ID: hello, i have a requirement where SDP needs to tunnel non-IB traffic via IB. The situation is as below: 1) my process has LD_PRELOADE'ed libsdp.so 2) the process opens a SOCK_STREAM connection to a remote process via a WAN link. The remote process is a TCP listener. all intermediate devices are Ethernet switches and IP routers. 3) libsdp is required to receive in-bound packets from other SDP processes. in this scnario: 1) do i need a multi-protocol switch or an IB-Ethenet gateway? 2) even if i use the MP switch, will SDP work? any opinion is highly appreciated. thanks. rajib ============================================================================== Please access the attached hyperlink for an important electronic communications disclaimer: http://www.csfb.com/legal_terms/disclaimer_external_email.shtml ============================================================================== From eitan at mellanox.co.il Tue Aug 30 00:04:55 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 30 Aug 2005 10:04:55 +0300 Subject: [openib-general] Re: RMPP Message Format Errors In-Reply-To: References: Message-ID: <43140517.6090209@mellanox.co.il> Sean Hefty wrote: >>>In my interpretation, partial data is indicated by the PayloadLength field in >>>the last segment only. It's quite possible that my interpretation is >> >>incorrect, >> >>>in which case the calculation in the RMPP code is off. >> >>I agree the text might be missing an example or two for clarification. >>Anyway, we probably can use the IB Analyzer as the ultimate >>interpretation test. Note that there are IB implementations that uses >>the first segment payload length as the source of packet length and >>count on it to represent the correct DATA length. >> >>We can take your interpretation to discussion in the IBTA MGTWG for >>further discussion. >>Is the effort for fixing it big? > > > It's not a big deal to change it. If the common interpretation is to only > include the partial data size, I will change it. I think the common interpretation is that the paylen n the first segment should present the size of the "valid" data only. > > - Sean From mst at mellanox.co.il Tue Aug 30 00:21:33 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 30 Aug 2005 10:21:33 +0300 Subject: [openib-general] Re: SDP and Socket Options In-Reply-To: References: Message-ID: <20050830072133.GX22342@mellanox.co.il> Quoting Majumder, Rajib : > Subject: SDP and Socket Options > > hello, > i am trying to figure out, as of now, what socket stuff are NOT supported by SDP. this will help me identify the gaps between the 2. When you say SDP, I assume you are talking about creating a socket with the AF_INET_SDP family. > can you point me toward some docs that may describe these gaps? You'll have to look at the source. Get it here: https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband/ulp/sdp/ Its actually not that big: around 16000 lines (including comments). sdp_inet.c is a good starting point - thats where we register the new protocol family. That file is 1461 lines. > 1) what socket system calls are not supported? I think all of them work at this point. > 2) what socket options are not supported? IPv6 addressing is probably the biggest omission. This actually shouldnt be hard to fix, given enough interest. I didnt look into this in depth yet, but I think SO_{RCV,SND}BUF just set sk_rcvbuf/sk_sendbuf in the socket structure, which then isnt used in any way. > 3) other stuff - keepalives etc > thanks. > rajib Keepalives is the only missing point that I know about. -- MST From mst at mellanox.co.il Tue Aug 30 00:23:10 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 30 Aug 2005 10:23:10 +0300 Subject: [openib-general] Re: [PATCH] RMPP: Fix length in first segment of multipacket sends In-Reply-To: <1125330997.4530.3499.camel@hal.voltaire.com> References: <1125277188.4530.764.camel@hal.voltaire.com> <20050829072316.GN22342@mellanox.co.il> <1125323920.4530.3114.camel@hal.voltaire.com> <20050829144705.GR22342@mellanox.co.il> <1125330997.4530.3499.camel@hal.voltaire.com> Message-ID: <20050830072310.GY22342@mellanox.co.il> Quoting r. Hal Rosenstock : > > What do you think? > > All seem reasonable to me. Sean should comment and has the final say on > this. OK, I'll wait till your patches get merged then. -- MST From glebn at voltaire.com Tue Aug 30 00:38:01 2005 From: glebn at voltaire.com (Gleb Natapov) Date: Tue, 30 Aug 2005 10:38:01 +0300 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbsandquery provider methods In-Reply-To: <8E9D028761D8264D910612167E8457E8FA3BED@mail2.ammasso.com> References: <8E9D028761D8264D910612167E8457E8FA3BED@mail2.ammasso.com> Message-ID: <20050830073800.GA16476@minantech.com> I don't want to move flamewar from netdev to this list... On Mon, Aug 29, 2005 at 12:46:47PM -0400, Tom Tucker wrote: > > >From my reading of the thread, there is resistence to > TOE in general. The patch is just the messenger. The principle > opponent is Dave Miller who strongly believes that stateless > acceleration such as TSO (TCP Segmentation Offload) suffices for > all needs. Ironically, this requires a much higher level of stack > integration than TOE does. I think there is no irony in this. From my understanding of the thread the higher level of integration is what Dave is striving to. This will allow linux users to have latest and greatest most RFC compliant and secure TCP stack and at the same time enjoy 10Gb performance. He doesn't want to have two different TCP implementation on the same machine (or more if you install several different TOE cards). > > TOE for the purposes of RDMA may have more legs within the > community, however, this has yet to be tested. Is it possible to implement RDMA semantics using linux native TCP stack (with hardware assistance of cause)? Just asking. > > > -----Original Message----- > > From: openib-general-bounces at openib.org > > [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier > > Sent: Monday, August 29, 2005 11:24 AM > > To: Asgeir Eiriksson > > Cc: openib-general at openib.org > > Subject: Re: [openib-general] [PATCH][iWARP] Added provider > > CM verbsandquery provider methods > > > > Asgeir> ...this is the approach taken in the Chelsio TOE patches > > Asgeir> that we have submitted. > > > > What are your plans for these patches? I am not subscribed to netdev, > > but from reading the archives, it seems that your most recent > > submission was rejected quite strongly. > > > > - R. > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Gleb. From mst at mellanox.co.il Tue Aug 30 00:57:48 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 30 Aug 2005 10:57:48 +0300 Subject: [openib-general] Re: [PATCH] sdp: use linux/list.h in sdp_link.c In-Reply-To: References: <20050829150042.GT22342@mellanox.co.il> Message-ID: <20050830075748.GA22342@mellanox.co.il> Quoting r. Tom Duffy : > >The following kills sdp_link.h and converts sdp_link.c to use linux/ > >list.h Locking is still missing here. > > Cool, cool. I was going go get to this eventually. Checked in rev 3241. -- MST From rajib.majumder at csfb.com Tue Aug 30 03:03:57 2005 From: rajib.majumder at csfb.com (Majumder, Rajib) Date: Tue, 30 Aug 2005 18:03:57 +0800 Subject: [openib-general] RE: SDP and Socket Options Message-ID: 1) which sdp calls the "read" and "write" are mapped to? 2) sdp latency is quite high compared to other userspace socket implementation. The fastest socket has 2.26us latency over a SCI (IEEE 1596)link. Any plan for userspace? 3) Any plan for a socket layer over uDAPL? 4) Currently, IPoIB has huge cpu overhead for obvious reasons. Any plan for a module that will bypass kernel resident tcp/ip stack and at the same time support UDP over RC? UDP applications are benefitted from this. Currently, all IP multicast based middlewares use UDP. thanks. rajib -----Original Message----- From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] Sent: 30 August 2005 15:22 To: Majumder, Rajib Cc: openib-general at openib.org Subject: Re: SDP and Socket Options Quoting Majumder, Rajib : > Subject: SDP and Socket Options > > hello, > i am trying to figure out, as of now, what socket stuff are NOT supported by SDP. this will help me identify the gaps between the 2. When you say SDP, I assume you are talking about creating a socket with the AF_INET_SDP family. > can you point me toward some docs that may describe these gaps? You'll have to look at the source. Get it here: https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband/ulp/sdp/ Its actually not that big: around 16000 lines (including comments). sdp_inet.c is a good starting point - thats where we register the new protocol family. That file is 1461 lines. > 1) what socket system calls are not supported? I think all of them work at this point. > 2) what socket options are not supported? IPv6 addressing is probably the biggest omission. This actually shouldnt be hard to fix, given enough interest. I didnt look into this in depth yet, but I think SO_{RCV,SND}BUF just set sk_rcvbuf/sk_sendbuf in the socket structure, which then isnt used in any way. > 3) other stuff - keepalives etc > thanks. > rajib Keepalives is the only missing point that I know about. -- MST ============================================================================== Please access the attached hyperlink for an important electronic communications disclaimer: http://www.csfb.com/legal_terms/disclaimer_external_email.shtml ============================================================================== From rajib.majumder at csfb.com Tue Aug 30 03:24:56 2005 From: rajib.majumder at csfb.com (Majumder, Rajib) Date: Tue, 30 Aug 2005 18:24:56 +0800 Subject: [openib-general] RE: SDP and Socket Options Message-ID: there's a spelling mistake in sdp_inet_create. conn = sdp_conn_alloc(GFP_KERNEL); if (!conn) { sdp_dbg_warn(conn, "SOCKET: failed to create socekt <%d:%d>", sock->type, protocol); return -ENOMEM; } the "socekt" in above sdp_dbg_warn should be changed to "socket". thanks. rajib -----Original Message----- From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] Sent: 30 August 2005 15:22 To: Majumder, Rajib Cc: openib-general at openib.org Subject: Re: SDP and Socket Options Quoting Majumder, Rajib : > Subject: SDP and Socket Options > > hello, > i am trying to figure out, as of now, what socket stuff are NOT supported by SDP. this will help me identify the gaps between the 2. When you say SDP, I assume you are talking about creating a socket with the AF_INET_SDP family. > can you point me toward some docs that may describe these gaps? You'll have to look at the source. Get it here: https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband/ulp/sdp/ Its actually not that big: around 16000 lines (including comments). sdp_inet.c is a good starting point - thats where we register the new protocol family. That file is 1461 lines. > 1) what socket system calls are not supported? I think all of them work at this point. > 2) what socket options are not supported? IPv6 addressing is probably the biggest omission. This actually shouldnt be hard to fix, given enough interest. I didnt look into this in depth yet, but I think SO_{RCV,SND}BUF just set sk_rcvbuf/sk_sendbuf in the socket structure, which then isnt used in any way. > 3) other stuff - keepalives etc > thanks. > rajib Keepalives is the only missing point that I know about. -- MST ============================================================================== Please access the attached hyperlink for an important electronic communications disclaimer: http://www.csfb.com/legal_terms/disclaimer_external_email.shtml ============================================================================== From mshefty at ichips.intel.com Mon Aug 29 15:24:14 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 29 Aug 2005 15:24:14 -0700 Subject: [openib-general] kernel oops In-Reply-To: <430F4DBD.4070703@xsigo.com> References: <430F4DBD.4070703@xsigo.com> Message-ID: <43138B0E.1090309@ichips.intel.com> Viswanath Krishnamurthy wrote: > Call Trace: > [] __alloc_pages+0x166/0x3b6 > [] ib_get_client_data+0x14/0x54 > [] ib_sa_path_rec_get+0x1b/0x13e > [] resolve_path+0x8c/0x15b > [] path_req_complete+0x0/0xf7 > [] rtnetlink_dump_all+0x0/0x9e > [] rtnetlink_done+0x0/0x3 > [] ib_at_paths_by_route+0xc4/0xd9 > [] same_path_req+0x0/0x95 > [] ib_uat_paths_by_route+0xef/0x1c4 > [] rtnetlink_dump_all+0x0/0x9e > [] rtnetlink_done+0x0/0x3 > [] ib_uat_write+0x96/0xa2 > [] vfs_write+0x108/0x10a > [] sys_write+0x41/0x6a > [] sysenter_past_esp+0x54/0x75 Hal, I've looked into this more, and this is what appears to be happening. Ucmpost calls ib_at_route_by_ip(), followed by ib_at_paths_by_route(). The first call fails asynchronously, which is ignored by ucmpost. It expects that the call to ib_at_paths_by_route() to fail synchronously with invalid input. The AT code in the kernel assumes that the ib_route passed into ib_at_paths_by_route is valid and dereferences a device pointer, which I think is causing this crash. Can you confirm that this is what the code is doing? The AT code appears to passing a kernel pointer up to the userspace app, and then requires that pointer to be passed back to the kernel. This Needs to be changed to pass up some identifier that can be validated on the return to the kernel. - Sean From mst at mellanox.co.il Tue Aug 30 03:55:59 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 30 Aug 2005 13:55:59 +0300 Subject: [openib-general] Re: SDP and Socket Options In-Reply-To: References: Message-ID: <20050830105559.GC22342@mellanox.co.il> Quoting r. Majumder, Rajib : > Subject: RE: SDP and Socket Options > > there's a spelling mistake in sdp_inet_create. > > conn = sdp_conn_alloc(GFP_KERNEL); > if (!conn) { > sdp_dbg_warn(conn, "SOCKET: failed to create socekt <%d:%d>", > sock->type, protocol); > return -ENOMEM; > } > > the "socekt" in above sdp_dbg_warn should be changed to "socket". > > thanks. > > rajib Corrected, thanks. -- MST From mst at mellanox.co.il Tue Aug 30 04:19:35 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 30 Aug 2005 14:19:35 +0300 Subject: [openib-general] Re: [PATCH] ipoib: device removal races In-Reply-To: <528xykl9x7.fsf@cisco.com> References: <20050829210201.GA6715@mellanox.co.il> <4313792B.1060202@ichips.intel.com> <20050829211528.GB6723@mellanox.co.il> <528xykl9x7.fsf@cisco.com> Message-ID: <20050830111935.GG22342@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] ipoib: device removal races > > Sean said it well, but to repeat: the problem you run into is what to > do when a consumer tries to cancel while the callback is running. For > example, one CPU might be in the middle of jumping to the consumer's > callback when the other CPU enters the cancel function. Now I understand, thanks. And we cant flush callbacks from inside the device stop routine, so my two queue patch only addresses part of the problem. -- MST From guyg at voltaire.com Tue Aug 30 04:14:14 2005 From: guyg at voltaire.com (Guy German) Date: Tue, 30 Aug 2005 14:14:14 +0300 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <43137C42.7000905@ichips.intel.com> References: <1125323947.6584.106.camel@r2d2> <431374D4.5080909@ichips.intel.com> <20050829212031.GC6723@mellanox.co.il> <43137C42.7000905@ichips.intel.com> Message-ID: <1125400454.6584.114.camel@r2d2> On Mon, 2005-08-29 at 14:21 -0700, Sean Hefty wrote: > Michael S. Tsirkin wrote: > > How is this different from what we have with ib_verbs now? > With ib_verbs, users receive notification of device addition/removal. > This interface doesn't require receiving that notification. Why should it be part of the interface ? > This opens up the possibility for a user to receive a reference to a > device that they may not have received previous notification for. > Similarly, the device could have been removed before the call returned, > making the pointer invalid. I don't understand the difference between handling a device received in cma_get_device and device received in ib_client.add ... Guy From halr at voltaire.com Tue Aug 30 04:31:24 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Aug 2005 07:31:24 -0400 Subject: [openib-general] Re: RMPP Message Format Errors In-Reply-To: <43140517.6090209@mellanox.co.il> References: <43140517.6090209@mellanox.co.il> Message-ID: <1125401395.4401.881.camel@hal.voltaire.com> On Tue, 2005-08-30 at 03:04, Eitan Zahavi wrote: > > It's not a big deal to change it. If the common interpretation is to only > > include the partial data size, I will change it. > I think the common interpretation is that the paylen n the first segment should present the size of the "valid" data only. I already submitted a patch for this. It wasn't clear to me what the answer for the first segment is from Greg's response (so I sent a followup to clarify that). -- Hal From halr at voltaire.com Tue Aug 30 05:31:15 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Aug 2005 08:31:15 -0400 Subject: [openib-general] [PATCH] RMPP: Fix payload length of middle RMPPsent segments In-Reply-To: References: Message-ID: <1125405074.4401.1107.camel@hal.voltaire.com> On Mon, 2005-08-29 at 11:56, Sean Hefty wrote: > We can add this, but is it needed? I thought that payload length was undefined > as opposed to reserved for middle segments. Perhaps undefined but the language I see is (not) valid. In general, these fields have been treated as Reserved and set to 0 on transmit, ignored on receive for future extensibility. Maybe that doesn't matter for this so it may be unneccessary. -- Hal From halr at voltaire.com Tue Aug 30 05:50:31 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Aug 2005 08:50:31 -0400 Subject: [openib-general] RMPP Middle Segments Payload Length Message-ID: <1125406230.4401.1178.camel@hal.voltaire.com> Hi Greg, In addition to the question about whether the first packet Payload Length only includes valid bytes in Transferred Data or all bytes in all Transferred Data in all sent segments in the case of a multipacket/segment send, there is also a question about the Payload Length in middle segments/packets. It looks to me like there is just a comment about the Payload Length being valid in first (optional) and last (mandatory) segments/packets. So that means it is ignored on receive but does it need to be set to 0 on transmit ? It seems possibly different from a reserved field in those cases by language in the spec but I'm not sure whether this is the case or not. Thanks. -- Hal From halr at voltaire.com Tue Aug 30 06:11:06 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Aug 2005 09:11:06 -0400 Subject: [openib-general] kernel oops In-Reply-To: <43138B0E.1090309@ichips.intel.com> References: <430F4DBD.4070703@xsigo.com> <43138B0E.1090309@ichips.intel.com> Message-ID: <1125407465.4401.1246.camel@hal.voltaire.com> Hi Sean, On Mon, 2005-08-29 at 18:24, Sean Hefty wrote: > Viswanath Krishnamurthy wrote: > > Call Trace: > > [] __alloc_pages+0x166/0x3b6 > > [] ib_get_client_data+0x14/0x54 > > [] ib_sa_path_rec_get+0x1b/0x13e > > [] resolve_path+0x8c/0x15b > > [] path_req_complete+0x0/0xf7 > > [] rtnetlink_dump_all+0x0/0x9e > > [] rtnetlink_done+0x0/0x3 > > [] ib_at_paths_by_route+0xc4/0xd9 > > [] same_path_req+0x0/0x95 > > [] ib_uat_paths_by_route+0xef/0x1c4 > > [] rtnetlink_dump_all+0x0/0x9e > > [] rtnetlink_done+0x0/0x3 > > [] ib_uat_write+0x96/0xa2 > > [] vfs_write+0x108/0x10a > > [] sys_write+0x41/0x6a > > [] sysenter_past_esp+0x54/0x75 > > Hal, I've looked into this more, and this is what appears to be > happening. Thanks for looking into this. It's been on my list but I hadn't quite got to it yet. > Ucmpost calls ib_at_route_by_ip(), followed by > ib_at_paths_by_route(). The first call fails asynchronously, which is > ignored by ucmpost. It expects that the call to ib_at_paths_by_route() > to fail synchronously with invalid input. Why would ib_at_paths_by_route be called if no route were obtained (from ib_at_route_by_ip) ? Isn't that a ucmpost issue ? (I also agree it's not good for UAT to crash). > The AT code in the kernel assumes that the ib_route passed into > ib_at_paths_by_route is valid and dereferences a device pointer, which I > think is causing this crash. Can you confirm that this is what the code > is doing? It needs to be a valid route struct. I'm not sure how the kernel can validate that is the case. It does check for NULL pointer but this is bad pointer. > The AT code appears to passing a kernel pointer up to the userspace app, > and then requires that pointer to be passed back to the kernel. This > Needs to be changed to pass up some identifier that can be validated on > the return to the kernel. Isn't it copying the ib_route structure to userspace ? -- Hal From halr at voltaire.com Tue Aug 30 06:26:16 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Aug 2005 09:26:16 -0400 Subject: [openib-general] Re: when executing sminfo with a port in down state, there is a retur n value 0 In-Reply-To: <506C3D7B14CDD411A52C00025558DED6089DBA60@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED6089DBA60@mtlex01.yok.mtl.com> Message-ID: <1125408376.4401.1290.camel@hal.voltaire.com> Hi Dotan, On Tue, 2005-08-23 at 02:33, Dotan Barak wrote: > I'm working with gen2 svn rev. 3155 with 2 Mellanox HCAs (23108) (1 on > each host; they are connected b2b: port 1 to port 1). Is port 2 also connected to port 2 ? > I executed opensm on host 1, port 1. > When i executed sminfo on host 2 port 1 everything was as expected > (return value = 0). > > I killed the opensm > When i executed sminfo on host 2 port 1 everything was as expected > (return value = 255). > > When i executed sminfo on host 2 port 2 everything i got 0 (i expected > to get return value = 255). > > Port 2 in host 2 was down, so i don't know why i got the return value > 0. > > here is the output: > > host2:~ # /usr/local/bin/sminfo -C mthca0 -P 2 > sminfo: sm lid 0x0 sm guid 0x8200000000, activity count 0 priority 0 > state SMINFO_NOTACT 0 > host2:~ # echo $? > 0 > host2:~ # /usr/local/bin/sminfo -C mthca0 -P 2 > sminfo: sm lid 0x0 sm guid 0x0, activity count 0 priority 0 state > SMINFO_STANDBY 2 > host2:~ # echo $? > 0 > > host2:~ # vstat > hca_id: mthca0 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 0 (0) > active_mtu: 0 (0) > sm_lid: 1 > port_lid: 2 > port_lmc: 0x00 > > port: 2 > state: PORT_DOWN (1) > max_mtu: 0 (0) > active_mtu: 0 (0) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > > > can you please help me with this issue? I have this on my list TODO but it is lower down in priority. I haven't forgotten about it. I will get back to it after the 1.8.0 merge. -- Hal From jlentini at netapp.com Tue Aug 30 07:14:19 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 30 Aug 2005 10:14:19 -0400 (EDT) Subject: [openib-general] RDMA Generic Connection Management In-Reply-To: <52acj0mszg.fsf@cisco.com> References: <1125323947.6584.106.camel@r2d2> <52acj0mszg.fsf@cisco.com> Message-ID: On Mon, 29 Aug 2005, Roland Dreier wrote: > James> What happens if multiple devices can reach the destination > James> address? How will they be enumerated to the consumer? > > I guess we need to move towards the full horror of getaddrinfo(). > Probably we need some unusable native API, and then library functions > layered on top for consumers that don't care. > > Although maybe it's not necessary -- are there any consumers of this > API that really want to choose among different equal-metric routes? The rule of thumb should be to provide mechanism not policy. If there are multiple devices capable of reaching a destination, that should be exposed to the consumer. Let the consumer decide which to use (perhaps by looking at the ib_device_attr information). james From jlentini at netapp.com Tue Aug 30 07:40:24 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 30 Aug 2005 10:40:24 -0400 (EDT) Subject: [openib-general][PATCH][kdapl]: FMR and EVD patch In-Reply-To: References: Message-ID: I'm not comfortable with a solution that relies on vendor specific behavior for such a critical mechanism. Given that the OpenIB ib_req_notify_cq() verb conforms to the IBTA spec's semantics http://www.mail-archive.com/openib-general%40openib.org/msg08935.html and that the development of three other low-level OpenIB drivers (Pathscale, IBM, Ammasso) have been announced, I believe relying on this behavior would be a mistake. What if dapl_evd_modify_upcall() worked as follows dapl_evd_modify_upcall lock evd with spin_lock_irqsave if CQ upcalls need to be enabled ib_req_notify_cq setup the evd upcall unlock evd with spin_unlock_irqrestore if ib_peek_cq reports unreaped work completions call dapl_evd_dto_callback I realize that the call to dapl_evd_dto_callback() will potentially be racing with a CQ upcall, but I believe that the logic in dapl_evd_dto_callback() handles that correctly. james On Mon, 29 Aug 2005, Guy German wrote: > James Lentini wrote: > > I agree with you on the problems poised by the current interface. I > > hope we can find a solution that fixes the problem. > > Note that the same problem must be handled by a ULP using the native verbs. > > I don't think we have the same problem in the verbs. > In the currently Mellanox hw (which is AFAIK the only available hw in openib) > there is no race at all (because of the proprietary, more �considerate�, > completion notification implementation). > > - Receive a CQ notification callback > - Wakeup polling thread > - Poll for completion (empty the queue) > - Request completion notification > [you will get a completion notification even for �old� completions on the queue] > - exit thread > > In the case of other, more harsh ib compliant future hw implementation � > Request completion Notification �extended verb� could encapsulate: > - request CQ notification > - if cq !empty request CQ notification _again_ > (note that you are not *polling* the cq � just checking the queue. > This is different then draining the evd "one more time") > > And the race is solved. > Indeed, it is not as efficient as sparing the context switch > (to interrupt and back to thread) altogether. > > >I still think that there may be a race condition with this patch. > >Here's the scenario I'm concerned about: > > - Receive an evd upcall > > - Disable evd upcall policy > > - Wakeup polling thread > > - Dequeue all events > > - Enable evd upcall policy by: > > 1. Call dapl_evd_modify_upcall() to enable the evd upcall > > 2. Obtain the EVD spin lock via spin_lock_irqsave, thus > > disabling local interrupts > > 3. Check that the EVD's ring buffer is empty (there are no DAPL > > "software" events) > > 4. A DTO completion occurs on the EVD's CQ > > 5. Enable the CQ's upcall via ib_req_notify_cq() > > > >If I understand you correctly, you are asserting that event #4, the > >CQ's DTO completion, cannot occur because the local interrupts are > >disabled by spin_lock_irqsave(). Have I understood you correctly? > > Not quite. The *consumer�s upcall* would not be called, due to the irq disable. > The race would not occur, OTOH, because the Mellanox hw will initiate a > completion notification even if the completions in the cq arrived before > the notification request. > If you want to be more ib compliant, for future possible implementations, > you can apply the �extended-notify-routine� (as mentioned above). > > > My belief is that the completion will occur on the card > > regardless of the interrupt state. > > True, but the consumer will be notified only as soon as the irq > is enabled again > > > Can you provide me with a reference that guarantees this > > will not happen? > > I�m not saying that it won�t ;) but I don't think there will be a race... > > Guy > From halr at voltaire.com Tue Aug 30 07:46:29 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Aug 2005 10:46:29 -0400 Subject: [openib-general] Re: [PATCH] osm: osm_vendor_umad osm_vendor_get_all_port_attr bug In-Reply-To: <4311CD74.8060006@mellanox.co.il> References: <86fyt3219m.fsf@mtl066.yok.mtl.com> <1124995971.4421.852.camel@hal.voltaire.com> <4311CD74.8060006@mellanox.co.il> Message-ID: <1125413188.4401.1413.camel@hal.voltaire.com> Hi Eitan, On Sun, 2005-08-28 at 10:43, Eitan Zahavi wrote: > I agree that the index 0 of the guid,lids and the new linkstates arrays > should be reserved for the default port. In the loop the index j is used > to loop over all ports 0 .. N of the HCA's. It is clear that for HCA's > port 0 will be skipped. However, since the current code does not advance > the lid and linkstate accordingly the place for the port 0 will not be > kept empty for the port 0. > > Current code: > for (j = 0; j <= ca.numports; j++) { > if (ca.ports[j]) { > *p_lid = ca.ports[j]->base_lid; > *p_linkstates = ca.ports[j]->state; > p_lid++; > p_linkstates++; > } > } > Should be: > for (j = 0; j <= ca.numports; j++) { > if (ca.ports[j]) { > *p_lid = ca.ports[j]->base_lid; > *p_linkstates = ca.ports[j]->state; > } > /* as j advance even if the port is not valid, so should the > lid and state pointer */ > p_lid++; > p_linkstates++; > } > > As I could not convince you with the above explanations in my previous > mail I have written the following simple program to test the pre-and > post patch effect: > > /* > test program for dumping osm_vendor_get_all_port_attr results > */ > > #include "stdio.h" > #include > #include > #include > #include > > #include > #define GUID_ARRAY_SIZE 64 > int > main() { > osm_vendor_t vendor; > osm_log_t osm_log; > ib_api_status_t status; > uint32_t num_ports = GUID_ARRAY_SIZE; > ib_port_attr_t attr_array[GUID_ARRAY_SIZE]; > int i; > > osm_log_construct(&osm_log); > osm_log_init(&osm_log, TRUE, 0xff, "/tmp/test_vendor.log"); > > osm_vendor_init(&vendor, &osm_log, 1000); > > status = osm_vendor_get_all_port_attr(&vendor, attr_array, &num_ports ); > if ( status != IB_SUCCESS ) > { > printf( "\nError from osm_vendor_get_all_port_attr (%x)\n", status); > return; > } > > printf("\nListing GUIDs:\n"); > for (i = 0; i < num_ports; i++) { > printf("Port %i:0x%"PRIx64" lid:0x%04x state:%x\n", > i, > cl_hton64(attr_array[i].port_guid), > cl_ntoh16(attr_array[i].lid), > attr_array[i].link_state > ); > } > > exit(0); > } > > Without the above change I get: > Listing GUIDs: > Port 0:0xd9dffffff3d55 lid:0x0300 state:4 > Port 1:0xd9dffffff3d55 lid:0x0400 state:4 > Port 2:0xd9dffffff3d56 lid:0x0000 state:0 > > After the simple change I get: > Listing GUIDs: > Port 0:0xd9dffffff3d55 lid:0x0300 state:4 > Port 1:0xd9dffffff3d55 lid:0x0300 state:4 > Port 2:0xd9dffffff3d56 lid:0x0400 state:4 > > So as you can see - without the fix the lid of port 2 is presented as > the lid of port 1... I understand the difference in the code and think the difference perhaps relates to either a lack of clarity or confusion with the API as follows: I don't see where it is defined what the index into the port array means. I think we have 2 different interpretations and this relates to how opensm/main.c handles the results of calling this routine. So the patch is incomplete although perhaps correct depending on the interpretation. I'm not adverse to changing this as you indicate. I would like to resolve this before embarking on the 1.8.0 merge. Also, since we are in this area, I don't think switch port 0 would be handled correctly by this code either. > I guess you use ibstatus in your mail. Well ibstatus uses its own code > so it shows the correct info anyway. That was just to show that the port states corresponded to the ones shown by osm_vendor_get_all_port_attr with the print statements. Nothing else. -- Hal > In my case that is: > swlab223:/tmp/bld/libvendor>ibstatus > Infiniband device 'mthca0' port 1 status: > default gid: fe80:0000:0000:0000:000d:9dff:ffff:3d55 > base lid: 0x3 > sm lid: 0x1 > state: 4: ACTIVE > phys state: 5: LinkUp > rate: 10 Gb/sec (4X) > > Infiniband device 'mthca0' port 2 status: > default gid: fe80:0000:0000:0000:000d:9dff:ffff:3d56 > base lid: 0x4 > sm lid: 0x1 > state: 4: ACTIVE > phys state: 5: LinkUp > rate: 10 Gb/sec (4X) > > > > Hal Rosenstock wrote: > > Hi Eitan, > > > > On Sun, 2005-08-21 at 03:32, Eitan Zahavi wrote: > > > >>osm_vendor_get_all_port_attr returns incorrect LID and state for > >>device ports. This bug was caused by the fact that if a device port > >>was skipped due to that fact it does not exist (HCA port 0). The > >>lid and state pointers used as indexes into their corresponding > >>return value arrays were not advancing to the next port index. > >> > >>So the return for a single HCA was mixing LID and state for the first > >>port and displayed non initialized memory for the second port. > > > > > > The array is not filled in as you claim. Port 0 does not take a slot on > > an HCA. This looks fine to me as is (I added some print statements in > > that loop as follows): > > > > osm_vendor_get_all_port_attr: port 0 > > osm_vendor_get_all_port_attr: port 1 > > osm_vendor_get_all_port_attr: port 1 lid 1 state 4 > > osm_vendor_get_all_port_attr: port 2 > > osm_vendor_get_all_port_attr: port 2 lid 0 state 1 > > > > Port 0 is skipped; port 1 is LID 1 and active; port 2 is not plugged in > > and is down: > > > > Port 1: > > State: Active > > Physical state: LinkUp > > Rate: 2 > > Base lid: 1 > > LMC: 0 > > SM lid: 1 > > Capability mask: 0x00500a68 > > Port GUID: 0x0008f10403960559 > > Port 2: > > State: Down > > Physical state: Polling > > Rate: 2 > > Base lid: 0 > > LMC: 0 > > SM lid: 0 > > Capability mask: 0x00500a68 > > Port GUID: 0x0008f1040396055a > > > > -- Hal > From yaronh at voltaire.com Tue Aug 30 07:55:43 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Tue, 30 Aug 2005 17:55:43 +0300 Subject: [openib-general] RDMA Generic Connection Management Message-ID: <35EA21F54A45CB47B879F21A91F4862F753462@taurus.voltaire.com> > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of James Lentini > Sent: Monday, August 29, 2005 3:35 PM > To: Guy German > Cc: openib-general at openib.org > Subject: Re: [openib-general] RDMA Generic Connection Management > > > What happens if multiple devices can reach the destination address? > How will they be enumerated to the consumer? > Since its an IP based approach, it will work like traditional IP A preference is given to a device with the same subnet as destination In GbE if two NICs are on the same subnet then only one will be selected You can also use a LAG solution that will balance connections over multiple links, but it is done at the L2-3 layers (not exposed to the ULP) We should probably use the same approach and provide a single device handle to the ULP, we may have a virtual device handle representing few similar parallel devices (just like a LAG group has a virtual MAC), also maybe a good idea to pass an enum with some preference (e.g. single path or redundant or ...) Specifically in iSER the redundancy is handled in the upper layers The iSCSI discovery may return multiple src & dst IP addresses and the iSCSI multipath implementation will open multiple connections. There are many TCP/IP protocols that do that at the upper layers (e.g. GridFTP, ..), not sure how NFS does it. Also note that there was a new addendum to IB Multipath record query me & Hal proposed in IBTA that enable a client to ask "what are all the options to get from point A to point B ?", where A & B are identified by one of the GIDs we know about, and we can specify a flag for same port/hca/system preferences, this can be implemented under AT if we want. Yaron From jlentini at netapp.com Tue Aug 30 07:59:13 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 30 Aug 2005 10:59:13 -0400 (EDT) Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <1125400454.6584.114.camel@r2d2> References: <1125323947.6584.106.camel@r2d2> <431374D4.5080909@ichips.intel.com> <20050829212031.GC6723@mellanox.co.il> <43137C42.7000905@ichips.intel.com> <1125400454.6584.114.camel@r2d2> Message-ID: On Tue, 30 Aug 2005, Guy German wrote: > I don't understand the difference between handling a device > received in cma_get_device and device received in ib_client.add ... Consumers that receive a device via the ib_client.add callback will receive a notification that it is no longer available via the ib_client.remove callback. Given a cma_get_device() call, consumers are unlikely to use the ib_register_client() call and therefore will not receive the remove callback. From guyg at voltaire.com Tue Aug 30 07:53:33 2005 From: guyg at voltaire.com (Guy German) Date: Tue, 30 Aug 2005 17:53:33 +0300 Subject: [openib-general][PATCH][kdapl]: FMR and EVD patch In-Reply-To: References: Message-ID: <1125413613.4127.13.camel@r2d2> On Tue, 2005-08-30 at 10:40 -0400, James Lentini wrote: > I'm not comfortable with a solution that relies on vendor specific > behavior for such a critical mechanism. I believe I suggested a solution for the general case as well: 1. request CQ notification 2. if cq !empty request CQ notification _again_ > What if dapl_evd_modify_upcall() worked as follows > > dapl_evd_modify_upcall > lock evd with spin_lock_irqsave > if CQ upcalls need to be enabled > ib_req_notify_cq > setup the evd upcall > unlock evd with spin_unlock_irqrestore > if ib_peek_cq reports unreaped work completions > call dapl_evd_dto_callback > > I realize that the call to dapl_evd_dto_callback() will potentially be > racing with a CQ upcall, but I believe that the logic in > dapl_evd_dto_callback() handles that correctly. I don't think it's a good idea to call the consumer _again_ in the consumer's own context ... Usually the upcall is interrupt/tasklet context that wakes up the thread - now, you suggest that the thread will "wakeup" itself (when it is, actually, already running and about to go to sleep) Guy. From danb at voltaire.com Tue Aug 30 08:09:41 2005 From: danb at voltaire.com (Dan Bar Dov) Date: Tue, 30 Aug 2005 18:09:41 +0300 Subject: [openib-general] RE: [PATCH] iser: Make iser Makefile like other OpenIB ULP makefiles Message-ID: Thanks, applied. Dan > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Monday, August 29, 2005 8:25 PM > To: Dan Bar Dov > Cc: openib-general at openib.org > Subject: [PATCH] iser: Make iser Makefile like other OpenIB > ULP makefiles > > Make iser Makefile like other OpenIB ULP makefiles > > Signed-off-by: Hal Rosenstock > > Index: Makefile > =================================================================== > --- Makefile (revision 3232) > +++ Makefile (working copy) > @@ -1,16 +1,14 @@ > -ISER_OBJ = iser_mod.o > -ISER_OBJ += iser_conn.o > -ISER_OBJ += iser_initiator.o > -ISER_OBJ += iser_memory.o > -ISER_OBJ += iser_task.o > -ISER_OBJ += iser_utils.o > -ISER_OBJ += iser_dto.o > -ISER_OBJ += iser_lkdapl.o > +EXTRA_CFLAGS += -Idrivers/infiniband/include > -Idrivers/infiniband/ulp/kdapl \ > + -I$(src)/include -DLINUX_KDAT > > -EXTRA_CFLAGS += -Idrivers/infiniband/include > -EXTRA_CFLAGS += -Idrivers/infiniband/ulp/kdapl > -EXTRA_CFLAGS += -I$(src)/include > -EXTRA_CFLAGS += -DLINUX_KDAT > +obj-$(CONFIG_INFINIBAND_ISER) += ib_iser.o > > -obj-$(CONFIG_INFINIBAND_ISER) += $(ISER_OBJ) > +ib_iser-y := iser_mod.o \ > + iser_conn.o \ > + iser_initiator.o \ > + iser_memory.o \ > + iser_task.o \ > + iser_utils.o \ > + iser_dto.o \ > + iser_lkdapl.o > > > > > From guyg at voltaire.com Tue Aug 30 08:02:03 2005 From: guyg at voltaire.com (Guy German) Date: Tue, 30 Aug 2005 18:02:03 +0300 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: References: <1125323947.6584.106.camel@r2d2> <431374D4.5080909@ichips.intel.com> <20050829212031.GC6723@mellanox.co.il> <43137C42.7000905@ichips.intel.com> <1125400454.6584.114.camel@r2d2> Message-ID: <1125414123.4127.18.camel@r2d2> On Tue, 2005-08-30 at 10:59 -0400, James Lentini wrote: > On Tue, 30 Aug 2005, Guy German wrote: > > I don't understand the difference between handling a device > > received in cma_get_device and device received in ib_client.add ... > Consumers that receive a device via the ib_client.add > callback will receive a notification that it is no longer > available via the ib_client.remove callback. I see. Couldn't consumers be registered as clients _before_ they call cma_get_device ? Guy From Thomas.Duffy.99 at alumni.brown.edu Tue Aug 30 08:19:17 2005 From: Thomas.Duffy.99 at alumni.brown.edu (Tom Duffy) Date: Tue, 30 Aug 2005 08:19:17 -0700 Subject: [openib-general] SDP and Socket Options In-Reply-To: References: Message-ID: <67CBD6BE-E171-4FEE-92EF-B1B640CEE83D@alumni.brown.edu> On Aug 29, 2005, at 11:34 PM, Majumder, Rajib wrote: > 2) what socket options are not supported? Most of them, actually. The relevant piece of code should be illuminating: From sdp_inet.c:1461 switch (optname) { case TCP_NODELAY: conn->nodelay = value ? 1 : 0; if (conn->nodelay > 0) (void)sdp_send_flush(conn); break; < snip SDP specific options > default: sdp_warn("SETSOCKOPT unimplemented option <%d:%d> conn <%d>.", level, optname, conn->hashent); break; } I think we should fail these all other options and not just print a warning and return. Until we can characterize all of them. So, set result to -ENOPROTOOPT instead of just breaking. -tduffy From halr at voltaire.com Tue Aug 30 08:18:50 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Aug 2005 11:18:50 -0400 Subject: [openib-general] [PATCH] iser: Hook iser into OpenIB build Message-ID: <1125415128.4401.1468.camel@hal.voltaire.com> iser: Hook iser into OpenIB build Signed-off-by: Hal Rosenstock Index: Kconfig =================================================================== -- Kconfig (revision 3232) +++ Kconfig (working copy) @@ -27,4 +27,6 @@ source "drivers/infiniband/ulp/kdapl/Kconfig" +source "drivers/infiniband/ulp/iser/Kconfig" + endmenu Index: Makefile =================================================================== --- Makefile (revision 3248) +++ Makefile (working copy) @@ -4,3 +4,4 @@ obj-$(CONFIG_INFINIBAND_SDP) += ulp/sdp/ obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ obj-$(CONFIG_KDAPL) += ulp/kdapl/ +obj-$(CONFIG_INFINIBAND_ISER) += ulp/iser/ From mshefty at ichips.intel.com Tue Aug 30 09:04:32 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 30 Aug 2005 09:04:32 -0700 Subject: [openib-general] kernel oops In-Reply-To: <1125407465.4401.1246.camel@hal.voltaire.com> References: <430F4DBD.4070703@xsigo.com> <43138B0E.1090309@ichips.intel.com> <1125407465.4401.1246.camel@hal.voltaire.com> Message-ID: <43148390.7010605@ichips.intel.com> Hal Rosenstock wrote: > Why would ib_at_paths_by_route be called if no route were obtained (from > ib_at_route_by_ip) ? Isn't that a ucmpost issue ? (I also agree it's not > good for UAT to crash). The assumption that I made was that the call to ib_at_route_by_ip() would fail if given an invalid route. Also, since ucmpost is a simple test app designed more to test the CM than AT, I kept error testing to a minimum. > It needs to be a valid route struct. I'm not sure how the kernel can > validate that is the case. It does check for NULL pointer but this is > bad pointer. Struct ib_at_ib_route should probably change the struct ibv_device *out_dev field. It looks like this field is actually set to a struct ib_device * that is a kernel pointer. Can we just remove this field and use the sgid to locate the correct device structure in the kernel, or fail if it cannot be located? >>The AT code appears to passing a kernel pointer up to the userspace app, >>and then requires that pointer to be passed back to the kernel. This >>Needs to be changed to pass up some identifier that can be validated on >>the return to the kernel. > > Isn't it copying the ib_route structure to userspace ? Yes - but that contains the kernel device pointer. And looking at it more, the ABI contains pointers in the data structures. This should cause problems with 32-bit apps running on 64-bit kernels. I'm not sure how desirable it is to fix these issues versus moving to whatever the new CM abstraction API is. - Sean From rolandd at cisco.com Tue Aug 30 09:11:47 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 30 Aug 2005 09:11:47 -0700 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <20050830060156.GB14890@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 30 Aug 2005 09:01:56 +0300") References: <1125323947.6584.106.camel@r2d2> <431374D4.5080909@ichips.intel.com> <20050829212031.GC6723@mellanox.co.il> <43137C42.7000905@ichips.intel.com> <20050830060156.GB14890@mellanox.co.il> Message-ID: <52ek8bjtfg.fsf@cisco.com> Michael> Wont users also activate verbs directly anyway, and so be Michael> required to handle this notification? In the current verbs API, consumers only receive device pointers from their client->add callback. They may use that pointer until they return from the corresponding client->remove callback. If we add a new API for routing that also returns device pointers to a consumer, then the lifetime rules are messed up. - R. From rolandd at cisco.com Tue Aug 30 09:16:22 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 30 Aug 2005 09:16:22 -0700 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <1125414123.4127.18.camel@r2d2> (Guy German's message of "Tue, 30 Aug 2005 18:02:03 +0300") References: <1125323947.6584.106.camel@r2d2> <431374D4.5080909@ichips.intel.com> <20050829212031.GC6723@mellanox.co.il> <43137C42.7000905@ichips.intel.com> <1125400454.6584.114.camel@r2d2> <1125414123.4127.18.camel@r2d2> Message-ID: <52acizjt7t.fsf@cisco.com> Guy> Couldn't consumers be registered as clients _before_ they Guy> call cma_get_device ? Yes, consumers had better continue to use ib_register_client. The question is how to define the cma_get_device API so that consumers can handle hotplug correctly. And we need to make it simple enough so that ULP authors can get it right. - R. From halr at voltaire.com Tue Aug 30 09:14:18 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Aug 2005 12:14:18 -0400 Subject: [openib-general] osm-1.8.0-merge nit Message-ID: <1125418458.4401.1512.camel@hal.voltaire.com> Hi Yael, I'm starting to look at the complib changes with the 1.8.0 merge. One trivial thing is the following in include/complib/cl_event_wheel.h: #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { # define END_C_DECLS } #else /* !__cplusplus */ # define BEGIN_C_DECLS # define END_C_DECLS #endif /* __cplusplus */ BEGIN_C_DECLS #include #include #include #include #include #include #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { # define END_C_DECLS } #else /* !__cplusplus */ # define BEGIN_C_DECLS BEGIN_C_DECLS The second occurence of this should be removed. -- Hal From jlentini at netapp.com Tue Aug 30 09:28:03 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 30 Aug 2005 12:28:03 -0400 (EDT) Subject: [openib-general][PATCH][kdapl]: FMR and EVD patch In-Reply-To: <1125413613.4127.13.camel@r2d2> References: <1125413613.4127.13.camel@r2d2> Message-ID: On Tue, 30 Aug 2005, Guy German wrote: > On Tue, 2005-08-30 at 10:40 -0400, James Lentini wrote: > > I'm not comfortable with a solution that relies on vendor specific > > behavior for such a critical mechanism. > > I believe I suggested a solution for the general case as well: Sorry I missed this in you mail. > 1. request CQ notification > 2. if cq !empty request CQ notification _again_ Can you explain step #2 in more detail? What does "CQ notification _again_" entail? > > What if dapl_evd_modify_upcall() worked as follows > > > > dapl_evd_modify_upcall > > lock evd with spin_lock_irqsave > > if CQ upcalls need to be enabled > > ib_req_notify_cq > > setup the evd upcall > > unlock evd with spin_unlock_irqrestore > > if ib_peek_cq reports unreaped work completions > > call dapl_evd_dto_callback > > > > I realize that the call to dapl_evd_dto_callback() will potentially be > > racing with a CQ upcall, but I believe that the logic in > > dapl_evd_dto_callback() handles that correctly. > > I don't think it's a good idea to call the consumer _again_ in the > consumer's own context ... > Usually the upcall is interrupt/tasklet context that wakes up the thread > - now, you suggest that the thread will "wakeup" itself (when it is, > actually, already running and about to go to sleep) I agree that it is complicated. Using iSER as an example, the thread in iser_event_handler_thread() would: - consume the events in iser_consume_events() - set the has_first_event flag to false - call dat_evd_modify_upcall()... - ...which would potentially call iser_evd_upcall() and set the has_first_event flag to true - iser_evd_upcall() returns - dat_evd_modify_upcall() returns - iser_consume_events() returns - the event handler thread's call to wait_event_interruptible() would not put the thread to sleep because the has_first_event flag would be true From mshefty at ichips.intel.com Tue Aug 30 09:31:35 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 30 Aug 2005 09:31:35 -0700 Subject: [openib-general] Re: RMPP Message Format Errors In-Reply-To: <1125401395.4401.881.camel@hal.voltaire.com> References: <43140517.6090209@mellanox.co.il> <1125401395.4401.881.camel@hal.voltaire.com> Message-ID: <431489E7.90507@ichips.intel.com> Hal Rosenstock wrote: > I already submitted a patch for this. It wasn't clear to me what the > answer for the first segment is from Greg's response (so I sent a > followup to clarify that). Hal, can you go ahead and commit your two patches for payload length changes for RMPP? - Sean From halr at voltaire.com Tue Aug 30 09:33:31 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Aug 2005 12:33:31 -0400 Subject: [openib-general] OpenSM: new branch In-Reply-To: <506C3D7B14CDD411A52C00025558DED6089DBA81@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED6089DBA81@mtlex01.yok.mtl.com> Message-ID: <1125419610.4401.1552.camel@hal.voltaire.com> Hi again Yael, On Tue, 2005-08-23 at 03:25, Yael Kalka wrote: > Currently we have the osm-1.8.0-merge branch, that includes all the > merges. Is this the same as what is in Mellanox Gold 1.8.0 or are there some additional changes not from the original OpenIB port (like bug fixes beyond this) ? Just wondering... As the Windows port appears to be evolving, how will this affect the Linux version ? Is the plan to try to have one code base ? If so, where will this be ? Also, a minor comment on the file list so far (I know there have been changes since you wrote this): 1.1 New files: include/vendor/osm_pkt_randomizer.h libvendor/osm_pkt_randomizer.c 1.2 Deleted files: include/opensm/osm_pkt_randomizer.h opensm/osm_pkt_randomizer.c These appear to me to be moved (equivalent of deleted and added). Thanks. -- Hal From halr at voltaire.com Tue Aug 30 09:37:01 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Aug 2005 12:37:01 -0400 Subject: [openib-general] Re: RMPP Message Format Errors In-Reply-To: <431489E7.90507@ichips.intel.com> References: <43140517.6090209@mellanox.co.il> <1125401395.4401.881.camel@hal.voltaire.com> <431489E7.90507@ichips.intel.com> Message-ID: <1125419673.4401.1555.camel@hal.voltaire.com> Hi Sean, On Tue, 2005-08-30 at 12:31, Sean Hefty wrote: > Hal Rosenstock wrote: > > I already submitted a patch for this. It wasn't clear to me what the > > answer for the first segment is from Greg's response (so I sent a > > followup to clarify that). > > Hal, can you go ahead and commit your two patches for payload length > changes for RMPP? Do you think this is the correct interpretation ? If so, I will go ahead. I was waiting for confirmation. -- Hal From mshefty at ichips.intel.com Tue Aug 30 09:49:38 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 30 Aug 2005 09:49:38 -0700 Subject: [openib-general] Re: RMPP Message Format Errors In-Reply-To: <1125419673.4401.1555.camel@hal.voltaire.com> References: <43140517.6090209@mellanox.co.il> <1125401395.4401.881.camel@hal.voltaire.com> <431489E7.90507@ichips.intel.com> <1125419673.4401.1555.camel@hal.voltaire.com> Message-ID: <43148E22.3050305@ichips.intel.com> Hal Rosenstock wrote: >>Hal, can you go ahead and commit your two patches for payload length >>changes for RMPP? > > > Do you think this is the correct interpretation ? If so, I will go > ahead. I was waiting for confirmation. The interpretation of payload length for the first segment value looks correct. For the middle segments, 0 should work in all cases and may be a slightly cleaner solution. - Sean From Thomas.Talpey at netapp.com Tue Aug 30 09:53:37 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 30 Aug 2005 12:53:37 -0400 Subject: [openib-general] RDMA Generic Connection Management In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F753462@taurus.voltaire.com > References: <35EA21F54A45CB47B879F21A91F4862F753462@taurus.voltaire.com> Message-ID: <6.2.3.4.2.20050830123148.04e664f0@exnane01.nane.netapp.com> At 10:55 AM 8/30/2005, Yaron Haviv wrote: >The iSCSI discovery may return multiple src & dst IP addresses and the >iSCSI multipath implementation will open multiple connections. >There are many TCP/IP protocols that do that at the upper layers (e.g. >GridFTP, ..), not sure how NFS does it. The answer to that question depends on the version of NFS, and also the implementation. For NFSv2/v3, the situation is ad hoc. Some clients support multiple connections which they are able to round-robin. Solaris does this for example. The problem is, to the server each NFSv2/v3 connection appears to be a different client. Therefore the correctness guarantees (such as they are) go out the window. For example, a retry on a different connection is not a retry at all, it's a new op. So, the shotgun (trunked) NFSv3 situation is useful only for a certain class of use. For NFSv4, it's a little better in that there is a clientid which identifies the source. However, NFSv4 does not sufficiently deal with the case of requests on different connections either. With our new NFSv4 sessions proposal, planned to be part of NFSv4.1 (http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-sess-02.txt), trunking is fully supported, by allowing requests to belong to a higher-layer session regardless of what connection they arrive on. This exists in prototype form, the NFSv4.1 spec is still being pulled together. UMich/CITI is developing this btw. With a session, the client gets full consistency guarantees and trunked connections are therefore completely transparent. One thing to stress is that the type of connection (TCP, UDP, RDMA, etc) makes little or no difference in the trunking/multipathing picture. In fact, with an NFSv4.1 session, a mix of such connections is possible, and even a good idea. So it's more than a question of what RDMA capabilities are there, it's really *all* connections. To answer the question of how NFS "finds out" about multiple connections and trunking, the answer is generally that the mount command tells it. Mount can get this information from the command line, or DNS. I believe Solaris uses the command line approach. There may be a way to use the RPC portmapper for it, but the portmapper isn't used by NFSv4. Bottom line? NFS would love to have a way to learn multipathing topology. But it needs to follow existing practice, such as having an IP address / DNS expression. If the only way to find it is to query fabric services, that's not very compelling. Tom. From halr at voltaire.com Tue Aug 30 09:51:07 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Aug 2005 12:51:07 -0400 Subject: [openib-general] Re: RMPP Message Format Errors In-Reply-To: <43148E22.3050305@ichips.intel.com> References: <43140517.6090209@mellanox.co.il> <1125401395.4401.881.camel@hal.voltaire.com> <431489E7.90507@ichips.intel.com> <1125419673.4401.1555.camel@hal.voltaire.com> <43148E22.3050305@ichips.intel.com> Message-ID: <1125420666.4401.1580.camel@hal.voltaire.com> On Tue, 2005-08-30 at 12:49, Sean Hefty wrote: > The interpretation of payload length for the first segment value looks > correct. For the middle segments, 0 should work in all cases and may be > a slightly cleaner solution. OK. I'm going ahead with these changes. -- Hal From halr at voltaire.com Tue Aug 30 10:04:11 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Aug 2005 13:04:11 -0400 Subject: [openib-general] Re: RMPP Message Format Errors (Short Term Plan) In-Reply-To: <43107FE3.6020800@mellanox.co.il> References: <1125074400.4530.187.camel@hal.voltaire.com> <43107FE3.6020800@mellanox.co.il> Message-ID: <1125421450.4401.1599.camel@hal.voltaire.com> Hi Eitan, On Sat, 2005-08-27 at 10:59, Eitan Zahavi wrote: > Once you think both sender and receiver side issues are resolved please > let us know so I can re-run the test with the IB Analyzer. With r3251, the RMPP issues are resolved as far as I know. [IMO, The only thing waiting for full closure if verification from Greg (MgtWG).] Let me know if it works or if you find any further issues. -- Hal From swise at ammasso.com Tue Aug 30 10:06:06 2005 From: swise at ammasso.com (Steve Wise) Date: Tue, 30 Aug 2005 12:06:06 -0500 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: Message-ID: I thought all ULPs needed to register as an IB client regardless? > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of James Lentini > Sent: Tuesday, August 30, 2005 9:59 AM > To: Guy German > Cc: openib-general at openib.org > Subject: Re: [openib-general] Re: RDMA Generic Connection Management > > > > On Tue, 30 Aug 2005, Guy German wrote: > > > I don't understand the difference between handling a device > > received in cma_get_device and device received in ib_client.add ... > > Consumers that receive a device via the ib_client.add > callback will receive a notification that it is no longer > available via the ib_client.remove callback. > > Given a cma_get_device() call, consumers are unlikely to use the > ib_register_client() call and therefore will not receive the remove > callback. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rolandd at cisco.com Tue Aug 30 10:12:47 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 30 Aug 2005 10:12:47 -0700 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: (Steve Wise's message of "Tue, 30 Aug 2005 12:06:06 -0500") References: Message-ID: <521x4bjqls.fsf@cisco.com> Steve> I thought all ULPs needed to register as an IB client Steve> regardless? Right now they do, because there's no other way to get a struct ib_device pointer. If we add a new API that returns a struct ib_device pointer, then inevitably consumers will use it instead of the current client API, and then hotplug will be hopelessly broken. - R. From swise at ammasso.com Tue Aug 30 10:46:32 2005 From: swise at ammasso.com (Steve Wise) Date: Tue, 30 Aug 2005 12:46:32 -0500 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <521x4bjqls.fsf@cisco.com> Message-ID: You could enforce that ulps must register as clients. Then get_ib_device() or whatever would only work if the client ULP is registered... This approach might be the simple to do initially... It adds logic to the CM, however, to validate that the caller is a registered ULP. But that might not be a bad thing... Stevo. > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Tuesday, August 30, 2005 12:13 PM > To: Steve Wise > Cc: 'James Lentini'; 'Guy German'; openib-general at openib.org > Subject: Re: [openib-general] Re: RDMA Generic Connection Management > > Steve> I thought all ULPs needed to register as an IB client > Steve> regardless? > > Right now they do, because there's no other way to get a struct > ib_device pointer. If we add a new API that returns a struct > ib_device pointer, then inevitably consumers will use it instead of > the current client API, and then hotplug will be hopelessly broken. > > - R. > From Thomas.Talpey at netapp.com Tue Aug 30 10:55:39 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 30 Aug 2005 13:55:39 -0400 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <521x4bjqls.fsf@cisco.com> References: <521x4bjqls.fsf@cisco.com> Message-ID: <6.2.3.4.2.20050830135332.064b9030@exnane01.nane.netapp.com> At 01:12 PM 8/30/2005, Roland Dreier wrote: > Steve> I thought all ULPs needed to register as an IB client > Steve> regardless? > >Right now they do, because there's no other way to get a struct >ib_device pointer. If we add a new API that returns a struct >ib_device pointer, then inevitably consumers will use it instead of >the current client API, and then hotplug will be hopelessly broken. Are you telling us that RPC/RDMA (for example) has to handle hotplug events just to use IB? Isn't that the job of a lower layer? NFS/Sockets don't have to deal with these, f'rinstance. Tom. From rolandd at cisco.com Tue Aug 30 11:01:05 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 30 Aug 2005 11:01:05 -0700 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <6.2.3.4.2.20050830135332.064b9030@exnane01.nane.netapp.com> (Thomas Talpey's message of "Tue, 30 Aug 2005 13:55:39 -0400") References: <521x4bjqls.fsf@cisco.com> <6.2.3.4.2.20050830135332.064b9030@exnane01.nane.netapp.com> Message-ID: <521x4bi9su.fsf@cisco.com> Thomas> Are you telling us that RPC/RDMA (for example) has to Thomas> handle hotplug events just to use IB? Isn't that the job Thomas> of a lower layer? NFS/Sockets don't have to deal with Thomas> these, f'rinstance. Yes, if you want to talk directly to the device then you have to make sure that the device is still there to talk to. - R. From rolandd at cisco.com Tue Aug 30 11:03:46 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 30 Aug 2005 11:03:46 -0700 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: (Steve Wise's message of "Tue, 30 Aug 2005 12:46:32 -0500") References: Message-ID: <52wtm3gv3x.fsf@cisco.com> Steve> You could enforce that ulps must register as clients. Then Steve> get_ib_device() or whatever would only work if the client Steve> ULP is registered... I don't think this really solves anything. You still have all sorts of races to handle. For example, a device could be removed while get_ib_device() is returning to the consumer. - R. From Thomas.Talpey at netapp.com Tue Aug 30 11:10:23 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 30 Aug 2005 14:10:23 -0400 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <521x4bi9su.fsf@cisco.com> References: <521x4bjqls.fsf@cisco.com> <6.2.3.4.2.20050830135332.064b9030@exnane01.nane.netapp.com> <521x4bi9su.fsf@cisco.com> Message-ID: <6.2.3.4.2.20050830140954.060c9030@exnane01.nane.netapp.com> At 02:01 PM 8/30/2005, Roland Dreier wrote: > Thomas> Are you telling us that RPC/RDMA (for example) has to > Thomas> handle hotplug events just to use IB? Isn't that the job > Thomas> of a lower layer? NFS/Sockets don't have to deal with > Thomas> these, f'rinstance. > >Yes, if you want to talk directly to the device then you have to make >sure that the device is still there to talk to. Verbs don't do that? Tom. From rolandd at cisco.com Tue Aug 30 11:15:21 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 30 Aug 2005 11:15:21 -0700 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <6.2.3.4.2.20050830140954.060c9030@exnane01.nane.netapp.com> (Thomas Talpey's message of "Tue, 30 Aug 2005 14:10:23 -0400") References: <521x4bjqls.fsf@cisco.com> <6.2.3.4.2.20050830135332.064b9030@exnane01.nane.netapp.com> <521x4bi9su.fsf@cisco.com> <6.2.3.4.2.20050830140954.060c9030@exnane01.nane.netapp.com> Message-ID: <52k6i3gukm.fsf@cisco.com> Thomas> Verbs don't do that? Not as they are currently defined. And I don't think we want to add reference counting (aka cache-line pingpong) into every verbs call including the fast path to make sure that a device doesn't go away in the middle of the call. - R. From halr at voltaire.com Tue Aug 30 11:14:44 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Aug 2005 14:14:44 -0400 Subject: [openib-general] license mismatches In-Reply-To: References: Message-ID: <1125425684.4401.1682.camel@hal.voltaire.com> On Mon, 2005-08-29 at 18:20, Kanevsky, Arkady wrote: > I had reviewed the licenses used by files in > https://openib.org/svn/gen2/trunk. > The following .c and .h files do not match the OpenIB licenses: > https://openib.org/svn/gen2/trunk/src/userspace/management/osm/complib/Makefile.mlx > https://openib.org/svn/gen2/trunk/src/userspace/management/osm/opensm/osm_indent complib/Makefile.mlx is going away shortly. opensm/osm_indent is now fixed. Thanks. -- Hal From Thomas.Talpey at netapp.com Tue Aug 30 11:25:30 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 30 Aug 2005 14:25:30 -0400 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <52k6i3gukm.fsf@cisco.com> References: <521x4bjqls.fsf@cisco.com> <6.2.3.4.2.20050830135332.064b9030@exnane01.nane.netapp.com> <521x4bi9su.fsf@cisco.com> <6.2.3.4.2.20050830140954.060c9030@exnane01.nane.netapp.com> <52k6i3gukm.fsf@cisco.com> Message-ID: <6.2.3.4.2.20050830142125.063e1cc0@exnane01.nane.netapp.com> At 02:15 PM 8/30/2005, Roland Dreier wrote: > Thomas> Verbs don't do that? > >Not as they are currently defined. And I don't think we want to add >reference counting (aka cache-line pingpong) into every verbs call >including the fast path to make sure that a device doesn't go away in >the middle of the call. Well, you're saying somebody has to do it, right? Is it easier to fob this off to upper layers that (frankly) don't care what hardware they're talking to!? This means we have N copies of this, and N ways to do it. Talk about cacheline pingpong. Sorry but it suddenly sounds like we're all writing device drivers, not developing upper layers. This is a mistake. Tom. From rolandd at cisco.com Tue Aug 30 11:35:41 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 30 Aug 2005 11:35:41 -0700 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <6.2.3.4.2.20050830142125.063e1cc0@exnane01.nane.netapp.com> (Thomas Talpey's message of "Tue, 30 Aug 2005 14:25:30 -0400") References: <521x4bjqls.fsf@cisco.com> <6.2.3.4.2.20050830135332.064b9030@exnane01.nane.netapp.com> <521x4bi9su.fsf@cisco.com> <6.2.3.4.2.20050830140954.060c9030@exnane01.nane.netapp.com> <52k6i3gukm.fsf@cisco.com> <6.2.3.4.2.20050830142125.063e1cc0@exnane01.nane.netapp.com> Message-ID: <527je3gtmq.fsf@cisco.com> Thomas> Well, you're saying somebody has to do it, right? Is it Thomas> easier to fob this off to upper layers that (frankly) Thomas> don't care what hardware they're talking to!? This means Thomas> we have N copies of this, and N ways to do it. Talk about Thomas> cacheline pingpong. Upper layers have the luxury of being able to do this at a per-connection level, can sleep, etc. If we push it down into the verbs, then we have to do it in every verbs call, including the fast path verbs call. And that means we get into all sorts of crazy code to deal with a device disappearing between a consumer calling ib_post_send() and the core code being entered, etc. Right now we have a very simple set of rules: An upper level protocol consumer may begin using an IB device as soon as the add method of its struct ib_client is called for that device. A consumer must finish all cleanup and free all resources relating to a device before returning from the remove method. A consumer is permitted to sleep in its add and remove methods. - R. From eitan at mellanox.co.il Tue Aug 30 11:52:06 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 30 Aug 2005 21:52:06 +0300 Subject: [openib-general] Re: [PATCH] osm: osm_vendor_umad osm_vendor_get_all_port_attr bug In-Reply-To: <1125413188.4401.1413.camel@hal.voltaire.com> References: <86fyt3219m.fsf@mtl066.yok.mtl.com> <1124995971.4421.852.camel@hal.voltaire.com> <4311CD74.8060006@mellanox.co.il> <1125413188.4401.1413.camel@hal.voltaire.com> Message-ID: <4314AAD6.70700@mellanox.co.il> Hal Rosenstock wrote: > Hi Eitan, > > On Sun, 2005-08-28 at 10:43, Eitan Zahavi wrote: > >>I agree that the index 0 of the guid,lids and the new linkstates arrays >>should be reserved for the default port. In the loop the index j is used >>to loop over all ports 0 .. N of the HCA's. It is clear that for HCA's >>port 0 will be skipped. However, since the current code does not advance >>the lid and linkstate accordingly the place for the port 0 will not be >>kept empty for the port 0. >> >>Current code: >>for (j = 0; j <= ca.numports; j++) { >> if (ca.ports[j]) { >> *p_lid = ca.ports[j]->base_lid; >> *p_linkstates = ca.ports[j]->state; >> p_lid++; >> p_linkstates++; >> } >>} >>Should be: >>for (j = 0; j <= ca.numports; j++) { >> if (ca.ports[j]) { >> *p_lid = ca.ports[j]->base_lid; >> *p_linkstates = ca.ports[j]->state; >> } >> /* as j advance even if the port is not valid, so should the >> lid and state pointer */ >> p_lid++; >> p_linkstates++; >>} >> >>As I could not convince you with the above explanations in my previous >>mail I have written the following simple program to test the pre-and >>post patch effect: >> >>/* >> test program for dumping osm_vendor_get_all_port_attr results >>*/ >> >>#include "stdio.h" >>#include >>#include >>#include >>#include >> >>#include >>#define GUID_ARRAY_SIZE 64 >>int >>main() { >> osm_vendor_t vendor; >> osm_log_t osm_log; >> ib_api_status_t status; >> uint32_t num_ports = GUID_ARRAY_SIZE; >> ib_port_attr_t attr_array[GUID_ARRAY_SIZE]; >> int i; >> >> osm_log_construct(&osm_log); >> osm_log_init(&osm_log, TRUE, 0xff, "/tmp/test_vendor.log"); >> >> osm_vendor_init(&vendor, &osm_log, 1000); >> >> status = osm_vendor_get_all_port_attr(&vendor, attr_array, &num_ports ); >> if ( status != IB_SUCCESS ) >> { >> printf( "\nError from osm_vendor_get_all_port_attr (%x)\n", status); >> return; >> } >> >> printf("\nListing GUIDs:\n"); >> for (i = 0; i < num_ports; i++) { >> printf("Port %i:0x%"PRIx64" lid:0x%04x state:%x\n", >> i, >> cl_hton64(attr_array[i].port_guid), >> cl_ntoh16(attr_array[i].lid), >> attr_array[i].link_state >> ); >> } >> >> exit(0); >>} >> >>Without the above change I get: >>Listing GUIDs: >>Port 0:0xd9dffffff3d55 lid:0x0300 state:4 >>Port 1:0xd9dffffff3d55 lid:0x0400 state:4 >>Port 2:0xd9dffffff3d56 lid:0x0000 state:0 >> >>After the simple change I get: >>Listing GUIDs: >>Port 0:0xd9dffffff3d55 lid:0x0300 state:4 >>Port 1:0xd9dffffff3d55 lid:0x0300 state:4 >>Port 2:0xd9dffffff3d56 lid:0x0400 state:4 >> >>So as you can see - without the fix the lid of port 2 is presented as >>the lid of port 1... > > > I understand the difference in the code and think the difference > perhaps relates to either a lack of clarity or confusion with the API as > follows: I don't see where it is defined what the index into the port > array means. I think we have 2 different interpretations and this > relates to how opensm/main.c handles the results of calling this > routine. I do not follow you. Do you suggest it is OK the port at index 1 will have the guid of port 1 but the lid and state of port 2? I did not complain about what port is reported at what index: Just about the mismatch of guids and lids. Please see above. > > So the patch is incomplete although perhaps correct depending on the > interpretation. I'm not adverse to changing this as you indicate. I > would like to resolve this before embarking on the 1.8.0 merge. > > Also, since we are in this area, I don't think switch port 0 would be > handled correctly by this code either. > > >>I guess you use ibstatus in your mail. Well ibstatus uses its own code >>so it shows the correct info anyway. > > > That was just to show that the port states corresponded to the ones > shown by osm_vendor_get_all_port_attr with the print statements. Nothing > else. > > -- Hal > > >>In my case that is: >>swlab223:/tmp/bld/libvendor>ibstatus >>Infiniband device 'mthca0' port 1 status: >> default gid: fe80:0000:0000:0000:000d:9dff:ffff:3d55 >> base lid: 0x3 >> sm lid: 0x1 >> state: 4: ACTIVE >> phys state: 5: LinkUp >> rate: 10 Gb/sec (4X) >> >>Infiniband device 'mthca0' port 2 status: >> default gid: fe80:0000:0000:0000:000d:9dff:ffff:3d56 >> base lid: 0x4 >> sm lid: 0x1 >> state: 4: ACTIVE >> phys state: 5: LinkUp >> rate: 10 Gb/sec (4X) >> >> >> >>Hal Rosenstock wrote: >> >>>Hi Eitan, >>> >>>On Sun, 2005-08-21 at 03:32, Eitan Zahavi wrote: >>> >>> >>>>osm_vendor_get_all_port_attr returns incorrect LID and state for >>>>device ports. This bug was caused by the fact that if a device port >>>>was skipped due to that fact it does not exist (HCA port 0). The >>>>lid and state pointers used as indexes into their corresponding >>>>return value arrays were not advancing to the next port index. >>>> >>>>So the return for a single HCA was mixing LID and state for the first >>>>port and displayed non initialized memory for the second port. >>> >>> >>>The array is not filled in as you claim. Port 0 does not take a slot on >>>an HCA. This looks fine to me as is (I added some print statements in >>>that loop as follows): >>> >>>osm_vendor_get_all_port_attr: port 0 >>>osm_vendor_get_all_port_attr: port 1 >>>osm_vendor_get_all_port_attr: port 1 lid 1 state 4 >>>osm_vendor_get_all_port_attr: port 2 >>>osm_vendor_get_all_port_attr: port 2 lid 0 state 1 >>> >>>Port 0 is skipped; port 1 is LID 1 and active; port 2 is not plugged in >>>and is down: >>> >>> Port 1: >>> State: Active >>> Physical state: LinkUp >>> Rate: 2 >>> Base lid: 1 >>> LMC: 0 >>> SM lid: 1 >>> Capability mask: 0x00500a68 >>> Port GUID: 0x0008f10403960559 >>> Port 2: >>> State: Down >>> Physical state: Polling >>> Rate: 2 >>> Base lid: 0 >>> LMC: 0 >>> SM lid: 0 >>> Capability mask: 0x00500a68 >>> Port GUID: 0x0008f1040396055a >>> >>>-- Hal >> From Thomas.Talpey at netapp.com Tue Aug 30 11:59:24 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 30 Aug 2005 14:59:24 -0400 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <527je3gtmq.fsf@cisco.com> References: <521x4bjqls.fsf@cisco.com> <6.2.3.4.2.20050830135332.064b9030@exnane01.nane.netapp.com> <521x4bi9su.fsf@cisco.com> <6.2.3.4.2.20050830140954.060c9030@exnane01.nane.netapp.com> <52k6i3gukm.fsf@cisco.com> <6.2.3.4.2.20050830142125.063e1cc0@exnane01.nane.netapp.com> <527je3gtmq.fsf@cisco.com> Message-ID: <6.2.3.4.2.20050830145906.05104890@exnane01.nane.netapp.com> kDAPL does this! :-) At 02:35 PM 8/30/2005, Roland Dreier wrote: > Thomas> Well, you're saying somebody has to do it, right? Is it > Thomas> easier to fob this off to upper layers that (frankly) > Thomas> don't care what hardware they're talking to!? This means > Thomas> we have N copies of this, and N ways to do it. Talk about > Thomas> cacheline pingpong. > >Upper layers have the luxury of being able to do this at a >per-connection level, can sleep, etc. If we push it down into the >verbs, then we have to do it in every verbs call, including the fast >path verbs call. And that means we get into all sorts of crazy code >to deal with a device disappearing between a consumer calling >ib_post_send() and the core code being entered, etc. > >Right now we have a very simple set of rules: > > An upper level protocol consumer may begin using an IB device as > soon as the add method of its struct ib_client is called for that > device. A consumer must finish all cleanup and free all resources > relating to a device before returning from the remove method. > > A consumer is permitted to sleep in its add and remove methods. > > - R. From caitlinb at broadcom.com Tue Aug 30 12:00:42 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 30 Aug 2005 12:00:42 -0700 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbsandquery provider methods Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F54E@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of > glebn at voltaire.com > Sent: Tuesday, August 30, 2005 12:38 AM > To: Tom Tucker > Cc: openib-general at openib.org > Subject: Re: [openib-general] [PATCH][iWARP] Added provider > CM verbsandquery provider methods > > I don't want to move flamewar from netdev to this list... > > On Mon, Aug 29, 2005 at 12:46:47PM -0400, Tom Tucker wrote: > > > > >From my reading of the thread, there is resistence to > > TOE in general. The patch is just the messenger. The principle > > opponent is Dave Miller who strongly believes that stateless > > acceleration such as TSO (TCP Segmentation Offload) > suffices for all > > needs. Ironically, this requires a much higher level of stack > > integration than TOE does. > I think there is no irony in this. From my understanding of > the thread the higher level of integration is what Dave is > striving to. This will allow linux users to have latest and > greatest most RFC compliant and secure TCP stack and at the > same time enjoy 10Gb performance. He doesn't want to have two > different TCP implementation on the same machine (or more if > you install several different TOE cards). > > > > > > > TOE for the purposes of RDMA may have more legs within the > community, > > however, this has yet to be tested. > Is it possible to implement RDMA semantics using linux native > TCP stack (with hardware assistance of cause)? Just asking. > > It is possible to implement RDMA on the host processor. But it will not match the performance of hardware. The difference will be substantial at 10G. If someobody could build a software only solution that performed at 10G they would have done so. Having zero manufacturing cost would give them quite a competitive edge over solutions that required hardware. The need for offload has more to do with memory bandwidth than raw processing power. The data bandwidth required to support look-up of large data structures and for placement of the raw payload nearly consumes the bus bandwidth when operating at peak wire speeds. If you make that worse by moving the raw packets over the wire, and *then* copying them to a final location (a second memory move) *and* additional memory touches for accessing control structures... Well, it simply does not work. And no amount of wishful thinking will make it work. Customers do not seek hardware offload out of any great desire to turn their money over to hardware vendors. They pay a premimum for offloading NICs because it solves their problems. From swise at ammasso.com Tue Aug 30 12:01:29 2005 From: swise at ammasso.com (Steve Wise) Date: Tue, 30 Aug 2005 14:01:29 -0500 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <6.2.3.4.2.20050830145906.05104890@exnane01.nane.netapp.com> Message-ID: It looks like the kDAPL provider doesn't free up any resources in its remove functions. See dapl_remove_port() and dapl_provider_free()... I cant find where it cleans up any allocated QPs, CQ, etc... > -----Original Message----- > From: Talpey, Thomas [mailto:Thomas.Talpey at netapp.com] > Sent: Tuesday, August 30, 2005 1:59 PM > To: Roland Dreier > Cc: Steve Wise; openib-general at openib.org > Subject: Re: [openib-general] Re: RDMA Generic Connection Management > > kDAPL does this! > > :-) > > > At 02:35 PM 8/30/2005, Roland Dreier wrote: > > Thomas> Well, you're saying somebody has to do it, right? Is it > > Thomas> easier to fob this off to upper layers that (frankly) > > Thomas> don't care what hardware they're talking to!? This means > > Thomas> we have N copies of this, and N ways to do it. Talk about > > Thomas> cacheline pingpong. > > > >Upper layers have the luxury of being able to do this at a > >per-connection level, can sleep, etc. If we push it down into the > >verbs, then we have to do it in every verbs call, including the fast > >path verbs call. And that means we get into all sorts of crazy code > >to deal with a device disappearing between a consumer calling > >ib_post_send() and the core code being entered, etc. > > > >Right now we have a very simple set of rules: > > > > An upper level protocol consumer may begin using an IB device as > > soon as the add method of its struct ib_client is called for that > > device. A consumer must finish all cleanup and free all resources > > relating to a device before returning from the remove method. > > > > A consumer is permitted to sleep in its add and remove methods. > > > > - R. > From rolandd at cisco.com Tue Aug 30 12:08:04 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 30 Aug 2005 12:08:04 -0700 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <6.2.3.4.2.20050830145906.05104890@exnane01.nane.netapp.com> (Thomas Talpey's message of "Tue, 30 Aug 2005 14:59:24 -0400") References: <521x4bjqls.fsf@cisco.com> <6.2.3.4.2.20050830135332.064b9030@exnane01.nane.netapp.com> <521x4bi9su.fsf@cisco.com> <6.2.3.4.2.20050830140954.060c9030@exnane01.nane.netapp.com> <52k6i3gukm.fsf@cisco.com> <6.2.3.4.2.20050830142125.063e1cc0@exnane01.nane.netapp.com> <527je3gtmq.fsf@cisco.com> <6.2.3.4.2.20050830145906.05104890@exnane01.nane.netapp.com> Message-ID: <523borgs4r.fsf@cisco.com> Thomas> kDAPL does this! :-) Does what? As far as I can tell kDAPL just ignores hotplug and routing and hopes the problems go away ;) I do see some racy uses of atomic variables in kDAPL, but they don't seem to protect against anything really. - R. From caitlinb at broadcom.com Tue Aug 30 12:07:07 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 30 Aug 2005 12:07:07 -0700 Subject: [openib-general] Re: RDMA Generic Connection Management Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F54F@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Steve Wise > Sent: Tuesday, August 30, 2005 12:01 PM > To: 'Talpey, Thomas'; 'Roland Dreier' > Cc: openib-general at openib.org > Subject: RE: [openib-general] Re: RDMA Generic Connection Management > > It looks like the kDAPL provider doesn't free up any > resources in its remove functions. > > See dapl_remove_port() and dapl_provider_free()... I cant > find where it cleans up any allocated QPs, CQ, etc... > > The kDAPL reference implementation doesn't free up any resources in its remove functions. The API allows the Provider to do so. On most features of this kind the policy on the Reference Implemenation has always been to implement the things that everyone needs and leave optional features to be done by the first party that thinks they really need it. >From my experience hot-plug card insertion/removal has not been the highest priority in deploying RDMA. From eitan at mellanox.co.il Tue Aug 30 12:07:44 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 30 Aug 2005 22:07:44 +0300 Subject: [openib-general] OpenSM: new branch In-Reply-To: <1125419610.4401.1552.camel@hal.voltaire.com> References: <506C3D7B14CDD411A52C00025558DED6089DBA81@mtlex01.yok.mtl.com> <1125419610.4401.1552.camel@hal.voltaire.com> Message-ID: <4314AE80.7050807@mellanox.co.il> Hal Rosenstock wrote: > Hi again Yael, > > On Tue, 2005-08-23 at 03:25, Yael Kalka wrote: > >>Currently we have the osm-1.8.0-merge branch, that includes all the >>merges. > > > Is this the same as what is in Mellanox Gold 1.8.0 or are there some > additional changes not from the original OpenIB port (like bug fixes > beyond this) ? Just wondering... > Yes this is true. We fixed some bugs post 1.8.0 too. > As the Windows port appears to be evolving, how will this affect the > Linux version ? Is the plan to try to have one code base ? If so, where > will this be ? The intent is that the Windows port will provide back some changes into the Linux tree such that we minimize the changes. However, as much as we try we do have some cases where it is hard to converge. So some files will still reside modified in the windows tree. Others are going to be automatically updated from the Linux tree. > > Also, a minor comment on the file list so far (I know there have been > changes since you wrote this): > > 1.1 New files: > include/vendor/osm_pkt_randomizer.h > libvendor/osm_pkt_randomizer.c > > 1.2 Deleted files: > include/opensm/osm_pkt_randomizer.h > opensm/osm_pkt_randomizer.c Sorry I was sure it was in the list but I mad ea mistake. Also expect a removal of the complib/ul-generic/.. dir and file as we smashed it into its parent file. > > These appear to me to be moved (equivalent of deleted and added). > > Thanks. > > -- Hal > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From thomas.duffy.99 at alumni.brown.edu Tue Aug 30 13:11:38 2005 From: thomas.duffy.99 at alumni.brown.edu (Tom Duffy) Date: Tue, 30 Aug 2005 13:11:38 -0700 Subject: [openib-general] [PATCH] SDP: Use linux/list.h in sdp_buff.[ch] In-Reply-To: <20050830061034.GC14890@mellanox.co.il> References: <20050829150042.GT22342@mellanox.co.il> <20050830061034.GC14890@mellanox.co.il> Message-ID: <1125432698.18174.2.camel@localhost> On Tue, 2005-08-30 at 09:10 +0300, Michael S. Tsirkin wrote: > Quoting r. Tom Duffy : > > I just got back from vacation and I am still waiting for a machine so > > I can setup a rudimentary IB network at home to test my code. I have > > a patch that converts sdp_buff.[ch] to use linux/list.h (glad you > > didn't decide to work on that), but I want to test it before > > submitting to the list (it compiles!). > > I actually need that ASAP so that I can finally use > list_for_each and such instead of the stupid wrappers in > sdp_buff for my zcopy code. > Tom, could you please post the patch? Note: This patch is *UNTESTED*. Please verify before committing. Signed-off-by: Tom Duffy Index: drivers/infiniband/ulp/sdp/sdp_buff.c =================================================================== --- drivers/infiniband/ulp/sdp/sdp_buff.c (revision 3240) +++ drivers/infiniband/ulp/sdp/sdp_buff.c (working copy) @@ -1,6 +1,7 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. + * Copyright (c) 2005 Tom Duffy (thomas.duffy.99 at alumni.brown.edu) * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -52,31 +53,18 @@ static inline struct sdpc_buff *do_buff_ { struct sdpc_buff *buff; - if (!pool->head) + if (list_empty(&pool->head)) return NULL; if (fifo) - buff = pool->head; + buff = list_entry(pool->head.next, struct sdpc_buff, list); else - buff = pool->head->prev; + buff = list_entry(pool->head.prev, struct sdpc_buff, list); if (!test_func || !test_func(buff, usr_arg)) { - if (buff->next == buff && buff->prev == buff) - pool->head = NULL; - else { - buff->next->prev = buff->prev; - buff->prev->next = buff->next; - - pool->head = buff->next; - } - + list_del(&buff->list); pool->size--; - - buff->next = NULL; - buff->prev = NULL; - buff->pool = NULL; - } - else + } else buff = NULL; return buff; @@ -91,20 +79,10 @@ static inline void do_buff_q_put(struct /* fifo: false == tail, true == head */ BUG_ON(buff->pool); - if (!pool->head) { - buff->next = buff; - buff->prev = buff; - pool->head = buff; - } else { - buff->next = pool->head; - buff->prev = pool->head->prev; - - buff->next->prev = buff; - buff->prev->next = buff; - - if (fifo) - pool->head = buff; - } + if (fifo) + list_add(&buff->list, &pool->head); + else + list_add_tail(&buff->list, &pool->head); pool->size++; buff->pool = pool; @@ -116,10 +94,12 @@ static inline void do_buff_q_put(struct static inline struct sdpc_buff *sdp_buff_q_look(struct sdpc_buff_q *pool, int fifo) { - if (!pool->head || fifo) - return pool->head; + if (list_empty(&pool->head)) + return NULL; + if (fifo) + return list_entry(pool->head.next, struct sdpc_buff, list); else - return pool->head->prev; + return list_entry(pool->head.prev, struct sdpc_buff, list); } /* @@ -128,28 +108,11 @@ static inline struct sdpc_buff *sdp_buff static inline void do_buff_q_remove(struct sdpc_buff_q *pool, struct sdpc_buff *buff) { - struct sdpc_buff *prev; - struct sdpc_buff *next; - BUG_ON(pool != buff->pool); - if (buff->next == buff && buff->prev == buff) - pool->head = NULL; - else { - next = buff->next; - prev = buff->prev; - next->prev = prev; - prev->next = next; - - if (pool->head == buff) - pool->head = next; - } - + list_del(&buff->list); pool->size--; - buff->pool = NULL; - buff->next = NULL; - buff->prev = NULL; } /* @@ -157,7 +120,7 @@ static inline void do_buff_q_remove(stru */ void sdp_buff_q_init(struct sdpc_buff_q *pool) { - pool->head = NULL; + INIT_LIST_HEAD(&pool->head); pool->size = 0; } @@ -201,28 +164,24 @@ struct sdpc_buff *sdp_buff_q_fetch(struc void *arg), void *usr_arg) { - struct sdpc_buff *buff; + struct sdpc_buff *buff, *tmp; int result = 0; - int counter; /* * check to see if there is anything to traverse. */ - if (pool->head) + list_for_each_entry_safe(buff, tmp, &pool->head, list) { /* - * lock to prevent corruption of table + * XXX lock to prevent corruption of table */ - for (counter = 0, buff = pool->head; - counter < pool->size; counter++, buff = buff->next) { - result = test(buff, usr_arg); - if (result > 0) { - do_buff_q_remove(pool, buff); - return buff; - } - - if (result < 0) - break; + result = test(buff, usr_arg); + if (result > 0) { + do_buff_q_remove(pool, buff); + return buff; } + if (result < 0) + break; + } return NULL; } @@ -237,22 +196,18 @@ int sdp_buff_q_trav_head(struct sdpc_buf { struct sdpc_buff *buff; int result = 0; - int counter; /* * check to see if there is anything to traverse. */ - if (pool->head) + list_for_each_entry(buff, &pool->head, list) { /* - * lock to prevent corruption of table + * XXX lock to prevent corruption of table */ - for (counter = 0, buff = pool->head; - counter < pool->size; counter++, buff = buff->next) { - - result = trav_func(buff, usr_arg); - if (result < 0) - break; - } + result = trav_func(buff, usr_arg); + if (result < 0) + break; + } return result; } @@ -398,7 +353,7 @@ static int sdp_buff_pool_alloc(struct sd m_pool->buff_cur++; } - if (!main_pool->pool.head) { + if (list_empty(&main_pool->pool.head)) { sdp_warn("Failed to allocate any buffers. <%d:%d:%d>", total, m_pool->buff_cur, m_pool->alloc_inc); @@ -545,7 +500,7 @@ struct sdpc_buff *sdp_buff_pool_get(void */ spin_lock_irqsave(&main_pool->lock, flags); - if (!main_pool->pool.head) { + if (list_empty(&main_pool->pool.head)) { result = sdp_buff_pool_alloc(main_pool); if (result < 0) { sdp_warn("Error <%d> allocating buffers.", result); @@ -554,23 +509,12 @@ struct sdpc_buff *sdp_buff_pool_get(void } } - buff = main_pool->pool.head; - - if (buff->next == buff) - main_pool->pool.head = NULL; - else { - buff->next->prev = buff->prev; - buff->prev->next = buff->next; - - main_pool->pool.head = buff->next; - } - + buff = list_entry(main_pool->pool.head.next, struct sdpc_buff, list); + list_del(&buff->list); main_pool->pool.size--; spin_unlock_irqrestore(&main_pool->lock, flags); - buff->next = NULL; - buff->prev = NULL; buff->pool = NULL; /* * main pool specific reset @@ -596,7 +540,6 @@ void sdp_buff_pool_put(struct sdpc_buff return; BUG_ON(buff->pool); - BUG_ON(buff->next || buff->prev); /* * reset pointers */ @@ -606,17 +549,7 @@ void sdp_buff_pool_put(struct sdpc_buff spin_lock_irqsave(&main_pool->lock, flags); - if (!main_pool->pool.head) { - buff->next = buff; - buff->prev = buff; - main_pool->pool.head = buff; - } else { - buff->next = main_pool->pool.head; - buff->prev = main_pool->pool.head->prev; - - buff->next->prev = buff; - buff->prev->next = buff; - } + list_add(&buff->list, &main_pool->pool.head); main_pool->pool.size++; @@ -634,16 +567,8 @@ void sdp_buff_pool_chain_link(struct sdp buff->tail = buff->head; buff->pool = &main_pool->pool; - if (!head) { - buff->next = buff; - buff->prev = buff; - } else { - buff->next = head; - buff->prev = head->prev; - - buff->next->prev = buff; - buff->prev->next = buff; - } + if (head) + __list_splice(&buff->list, &head->list); } /* @@ -652,8 +577,6 @@ void sdp_buff_pool_chain_link(struct sdp void sdp_buff_pool_chain_put(struct sdpc_buff *buff, u32 count) { unsigned long flags; - struct sdpc_buff *next; - struct sdpc_buff *prev; /* * return an entire Link of buffers to the queue, this save on * lock contention for the buffer pool, for code paths where @@ -665,18 +588,7 @@ void sdp_buff_pool_chain_put(struct sdpc spin_lock_irqsave(&main_pool->lock, flags); - if (!main_pool->pool.head) - main_pool->pool.head = buff; - else { - prev = buff->prev; - next = main_pool->pool.head->next; - - buff->prev = main_pool->pool.head; - main_pool->pool.head->next = buff; - - prev->next = next; - next->prev = prev; - } + list_add(&main_pool->pool.head, &buff->list); main_pool->pool.size += count; Index: drivers/infiniband/ulp/sdp/sdp_buff.h =================================================================== --- drivers/infiniband/ulp/sdp/sdp_buff.h (revision 3240) +++ drivers/infiniband/ulp/sdp/sdp_buff.h (working copy) @@ -1,6 +1,7 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. + * Copyright (c) 2005 Tom Duffy (thomas.duffy.99 at alumni.brown.edu) * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -36,18 +37,19 @@ #ifndef _SDP_BUFF_H #define _SDP_BUFF_H +#include + #include "sdp_queue.h" /* * structures */ struct sdpc_buff_q { - struct sdpc_buff *head; /* double linked list of buffers */ + struct list_head head; /* double linked list of buffers */ u32 size; /* number of buffers in the pool */ }; struct sdpc_buff { - struct sdpc_buff *next; - struct sdpc_buff *prev; + struct list_head list; u32 type; /* element type. (for generic queue) */ struct sdpc_buff_q *pool; /* pool currently holding this buffer. */ void (*release)(struct sdpc_buff *buff); /* release the object */ From Thomas.Talpey at netapp.com Tue Aug 30 13:10:33 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 30 Aug 2005 16:10:33 -0400 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <523borgs4r.fsf@cisco.com> References: <521x4bjqls.fsf@cisco.com> <6.2.3.4.2.20050830135332.064b9030@exnane01.nane.netapp.com> <521x4bi9su.fsf@cisco.com> <6.2.3.4.2.20050830140954.060c9030@exnane01.nane.netapp.com> <52k6i3gukm.fsf@cisco.com> <6.2.3.4.2.20050830142125.063e1cc0@exnane01.nane.netapp.com> <527je3gtmq.fsf@cisco.com> <6.2.3.4.2.20050830145906.05104890@exnane01.nane.netapp.com> <523borgs4r.fsf@cisco.com> Message-ID: <6.2.3.4.2.20050830152550.0455dd90@exnane01.nane.netapp.com> At 03:08 PM 8/30/2005, Roland Dreier wrote: > Thomas> kDAPL does this! :-) > >Does what? As far as I can tell kDAPL just ignores hotplug and >routing and hopes the problems go away ;) I was referring to kDAPL's architecture, which does in fact address hotplug with async evd upcalls. In the early days of the reference port we implemented it on Solaris this way, for example. Tom. From Thomas.Talpey at netapp.com Tue Aug 30 13:13:41 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 30 Aug 2005 16:13:41 -0400 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <6.2.3.4.2.20050830152550.0455dd90@exnane01.nane.netapp.com > References: <521x4bjqls.fsf@cisco.com> <6.2.3.4.2.20050830135332.064b9030@exnane01.nane.netapp.com> <521x4bi9su.fsf@cisco.com> <6.2.3.4.2.20050830140954.060c9030@exnane01.nane.netapp.com> <52k6i3gukm.fsf@cisco.com> <6.2.3.4.2.20050830142125.063e1cc0@exnane01.nane.netapp.com> <527je3gtmq.fsf@cisco.com> <6.2.3.4.2.20050830145906.05104890@exnane01.nane.netapp.com> <523borgs4r.fsf@cisco.com> <6.2.3.4.2.20050830152550.0455dd90@exnane01.nane.netapp.com> Message-ID: <6.2.3.4.2.20050830161148.05f71040@exnane01.nane.netapp.com> At 04:10 PM 8/30/2005, Talpey, Thomas wrote: >At 03:08 PM 8/30/2005, Roland Dreier wrote: >> Thomas> kDAPL does this! :-) >> >>Does what? As far as I can tell kDAPL just ignores hotplug and >>routing and hopes the problems go away ;) > >I was referring to kDAPL's architecture, which does in fact address >hotplug with async evd upcalls. In the early days of the reference >port we implemented it on Solaris this way, for example. And I remember naming the upcall "E_NIC_ON_FIRE". There was another one after putting it out, of course. :-) Tom. From mst at mellanox.co.il Tue Aug 30 13:42:27 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 30 Aug 2005 23:42:27 +0300 Subject: [openib-general] Re: [PATCH] SDP: Use linux/list.h in sdp_buff.[ch] In-Reply-To: <1125432698.18174.2.camel@localhost> References: <20050829150042.GT22342@mellanox.co.il> <20050830061034.GC14890@mellanox.co.il> <1125432698.18174.2.camel@localhost> Message-ID: <20050830204227.GA19275@mellanox.co.il> Quoting r. Tom Duffy : > > Tom, could you please post the patch? > > Note: This patch is *UNTESTED*. Please verify before committing. Thanks I'll take a look. I actually wanted to do something more drastic, killing most of sdp_buff altogether. -- MST From mshefty at ichips.intel.com Tue Aug 30 14:09:45 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 30 Aug 2005 14:09:45 -0700 Subject: [openib-general] RDMA Generic Connection Management In-Reply-To: <52hdd8lb1e.fsf@cisco.com> References: <1125323947.6584.106.camel@r2d2> <431374D4.5080909@ichips.intel.com> <52hdd8lb1e.fsf@cisco.com> Message-ID: <4314CB19.1020109@ichips.intel.com> Roland Dreier wrote: > One solution is to start reference counting device references, but > that inevitably leads to bugs in ULPs -- protocol authors won't get it > right unless we make it really easy. And I don't see how to make the > reference counting trivial. > > Anyone have a better idea? I haven't figured out a way to make reference counting easy either. I should also point out that the kernel CM returns a device pointer when reporting REQ and SIDR_REQ events, so it has similar issues supporting hotplug. One other solution that I can think of is to report devices using some sort of ID, which users would then need to match with a specific device structure. We could probably even provide a call similar to ib_get_client_data_by_id(id, client) to assist with lookups. - Sean From gdror at mellanox.co.il Tue Aug 30 14:20:11 2005 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Wed, 31 Aug 2005 00:20:11 +0300 Subject: [openib-general] SDP Query Message-ID: <506C3D7B14CDD411A52C00025558DED60893AE08@mtlex01.yok.mtl.com> > From: Majumder, Rajib [mailto:rajib.majumder at csfb.com] > Sent: Tuesday, August 30, 2005 9:45 AM > > hello, > i have a requirement where SDP needs to tunnel non-IB traffic > via IB. The situation is as below: > 1) my process has LD_PRELOADE'ed libsdp.so > 2) the process opens a SOCK_STREAM connection to a remote > process via a WAN link. The remote process is a TCP listener. > all intermediate devices are Ethernet switches and IP routers. > 3) libsdp is required to receive in-bound packets from other > SDP processes. > in this scnario: > 1) do i need a multi-protocol switch or an IB-Ethenet gateway? > 2) even if i use the MP switch, will SDP work? Yes, you need a MP device. Obviously, in order to exit the IB and go to Ethernet, you'd have to have a device of which one port is IB and the other port is Ethernet. This device can be a simple IP router, in the form of a Linux box or a special product. It might as well be an IB to Ethernet layer 2 bridge, or it might be a gateway. So, depending on which box you're using to connect the IB to Ether, is what you can do with it. If it's an IP router or a switch, then you won't be able to move SDP traffic through it, and you'll have to connect your Ethernet host to IB host through TCP/IP using IPoIB and IPoEthernet. If you have a gateway that can do SDP termination on one port and TCP termination on the other port, then you can use SDP in the host. In any way, libsdp policy is configurable. So for example, you can configure to do the local stuff using SDP and the WAN stuff over IPoIB. You should probably consult your MP box vendor to learn more about the capabilities of the specific product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Tue Aug 30 14:28:15 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 30 Aug 2005 14:28:15 -0700 Subject: [openib-general] RDMA Generic Connection Management In-Reply-To: <4314CB19.1020109@ichips.intel.com> (Sean Hefty's message of "Tue, 30 Aug 2005 14:09:45 -0700") References: <1125323947.6584.106.camel@r2d2> <431374D4.5080909@ichips.intel.com> <52hdd8lb1e.fsf@cisco.com> <4314CB19.1020109@ichips.intel.com> Message-ID: <52u0h7f72o.fsf@cisco.com> Sean> I should also point out that the kernel CM returns a device Sean> pointer when reporting REQ and SIDR_REQ events, so it has Sean> similar issues supporting hotplug. Hmm, good point. Perhaps we should make IB CM listens be per-device? Then the consumer is in control over which devices it might get called back on, and can clean up listens on device removal. Sean> One other solution that I can think of is to report devices Sean> using some sort of ID, which users would then need to match Sean> with a specific device structure. We could probably even Sean> provide a call similar to ib_get_client_data_by_id(id, Sean> client) to assist with lookups. That's one possibility. Another thing we could do is have the consumer pass a list of devices ("here are all the devices I know about") into the API. Then it would be clear in the client that the list of devices needs to be protected by a semaphore or whatever. We could also export an API to lock and unlock the list of registered devices, so clients could grab the semaphore across calls to the routing API. But exporting our locking to consumers is probably a bad idea -- how many people really grok the rtnl lock in the networking core? I have to admit none of these ideas thrill me. I hope someone has a bright idea for how to fix this. - R. From jlentini at netapp.com Tue Aug 30 14:33:03 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 30 Aug 2005 17:33:03 -0400 (EDT) Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <523borgs4r.fsf@cisco.com> References: <521x4bjqls.fsf@cisco.com> <6.2.3.4.2.20050830135332.064b9030@exnane01.nane.netapp.com> <521x4bi9su.fsf@cisco.com> <6.2.3.4.2.20050830140954.060c9030@exnane01.nane.netapp.com> <52k6i3gukm.fsf@cisco.com> <6.2.3.4.2.20050830142125.063e1cc0@exnane01.nane.netapp.com> <527je3gtmq.fsf@cisco.com> <6.2.3.4.2.20050830145906.05104890@exnane01.nane.netapp.com> <523borgs4r.fsf@cisco.com> Message-ID: On Tue, 30 Aug 2005, Roland Dreier wrote: > Thomas> kDAPL does this! :-) > > Does what? As far as I can tell kDAPL just ignores hotplug and > routing and hopes the problems go away ;) > > I do see some racy uses of atomic variables in kDAPL, but they don't > seem to protect against anything really. > AFAICT none of there verbs clients in the tree (cache, CM, MAD, ping, SA client, uMAD, uVERBS, IPoIB, kDAPL, SDP, or SRP) synchronize their verbs calls with the hotplug removal callback. For example when SRP's srp_add_one() function is called, I don't see a synchronization primitive setup to do this. Suppose for example that a thread was in SRP's CM callback handler, srp_cm_handler(), just before the call to ib_modify_qp() on line 713 when the device removal callback, srp_remove_one(), is called. Currently, the SRP code will still call ib_modify_qp() even though the device has been removed. I see similar problems with the other verbs consumers in the tree. This is a good indication that handling the hotplug events in ULPs is difficult. From swise at ammasso.com Tue Aug 30 14:49:19 2005 From: swise at ammasso.com (Steve Wise) Date: Tue, 30 Aug 2005 16:49:19 -0500 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <6.2.3.4.2.20050830152550.0455dd90@exnane01.nane.netapp.com> Message-ID: Tom, Can explain this in more detail? I don't see how this is any different from the current openib client registration design? If I understand the openib implementation correctly, a ULP registers as an ib client, and gets notified of all new devices aswell as all device removals via upcalls. The ULP _must_ clean up all allocated device resources in its remove function (and even sleep need be?). How is that different from a dapl consumer having to process an async EVD about device removal, and shut down all EPs and EVDs that use that device? I'm not advocating necessarily that the client add/remove is the way to go vs shielding ULPs from this totally, but I want to understand how EVDs help things. Thanks, Steve. > -----Original Message----- > From: Talpey, Thomas [mailto:Thomas.Talpey at netapp.com] > Sent: Tuesday, August 30, 2005 3:11 PM > To: Roland Dreier > Cc: Steve Wise; openib-general at openib.org > Subject: Re: [openib-general] Re: RDMA Generic Connection Management > > At 03:08 PM 8/30/2005, Roland Dreier wrote: > > Thomas> kDAPL does this! :-) > > > >Does what? As far as I can tell kDAPL just ignores hotplug and > >routing and hopes the problems go away ;) > > I was referring to kDAPL's architecture, which does in fact address > hotplug with async evd upcalls. In the early days of the reference > port we implemented it on Solaris this way, for example. > > Tom. > From mshefty at ichips.intel.com Tue Aug 30 14:59:11 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 30 Aug 2005 14:59:11 -0700 Subject: [openib-general] RDMA Generic Connection Management In-Reply-To: <52u0h7f72o.fsf@cisco.com> References: <1125323947.6584.106.camel@r2d2> <431374D4.5080909@ichips.intel.com> <52hdd8lb1e.fsf@cisco.com> <4314CB19.1020109@ichips.intel.com> <52u0h7f72o.fsf@cisco.com> Message-ID: <4314D6AF.9050001@ichips.intel.com> Roland Dreier wrote: > Sean> I should also point out that the kernel CM returns a device > Sean> pointer when reporting REQ and SIDR_REQ events, so it has > Sean> similar issues supporting hotplug. > > Hmm, good point. Perhaps we should make IB CM listens be per-device? > Then the consumer is in control over which devices it might get called > back on, and can clean up listens on device removal. This seems like an easy way to fix this for the CM. I'll need to think about it some more, but unless there are objections, I will start work on this change. As for the other possibilities, I'm not wild about them either. - Sean From rolandd at cisco.com Tue Aug 30 15:03:26 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 30 Aug 2005 15:03:26 -0700 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: (James Lentini's message of "Tue, 30 Aug 2005 17:33:03 -0400 (EDT)") References: <521x4bjqls.fsf@cisco.com> <6.2.3.4.2.20050830135332.064b9030@exnane01.nane.netapp.com> <521x4bi9su.fsf@cisco.com> <6.2.3.4.2.20050830140954.060c9030@exnane01.nane.netapp.com> <52k6i3gukm.fsf@cisco.com> <6.2.3.4.2.20050830142125.063e1cc0@exnane01.nane.netapp.com> <527je3gtmq.fsf@cisco.com> <6.2.3.4.2.20050830145906.05104890@exnane01.nane.netapp.com> <523borgs4r.fsf@cisco.com> Message-ID: <52mzmzf5g1.fsf@cisco.com> James> For example when SRP's srp_add_one() function is called, I James> don't see a synchronization primitive setup to do James> this. Suppose for example that a thread was in SRP's CM James> callback handler, srp_cm_handler(), just before the call to James> ib_modify_qp() on line 713 when the device removal James> callback, srp_remove_one(), is called. Currently, the SRP James> code will still call ib_modify_qp() even though the device James> has been removed. That's a good catch. I'll add a little code to SRP to fix it. I also plan on fixing the user MAD problems later this week. user verbs is definitely broken and and needs to be fixed, although I worked around the worst problems by taking a reference on the low-level driver module so at least it can't be unloaded. I think the SA client is OK, since the ib_unregister_event_handler and ib_unregister_mad_agent calls in the remove method should wait until everything is cleaned up. Similarly I think the MAD module is OK. - R. From halr at voltaire.com Tue Aug 30 15:11:58 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Aug 2005 18:11:58 -0400 Subject: [openib-general] Re: [PATCH] osm: osm_vendor_umad osm_vendor_get_all_port_attr bug In-Reply-To: <4314AAD6.70700@mellanox.co.il> References: <86fyt3219m.fsf@mtl066.yok.mtl.com> <1124995971.4421.852.camel@hal.voltaire.com> <4311CD74.8060006@mellanox.co.il> <1125413188.4401.1413.camel@hal.voltaire.com> <4314AAD6.70700@mellanox.co.il> Message-ID: <1125439895.4401.2187.camel@hal.voltaire.com> On Tue, 2005-08-30 at 14:52, Eitan Zahavi wrote: > >>Without the above change I get: > >>Listing GUIDs: > >>Port 0:0xd9dffffff3d55 lid:0x0300 state:4 > >>Port 1:0xd9dffffff3d55 lid:0x0400 state:4 > >>Port 2:0xd9dffffff3d56 lid:0x0000 state:0 > >> > >>After the simple change I get: > >>Listing GUIDs: > >>Port 0:0xd9dffffff3d55 lid:0x0300 state:4 > >>Port 1:0xd9dffffff3d55 lid:0x0300 state:4 > >>Port 2:0xd9dffffff3d56 lid:0x0400 state:4 > >> > >>So as you can see - without the fix the lid of port 2 is presented as > >>the lid of port 1... > > > > > > I understand the difference in the code and think the difference > > perhaps relates to either a lack of clarity or confusion with the API as > > follows: I don't see where it is defined what the index into the port > > array means. I think we have 2 different interpretations and this > > relates to how opensm/main.c handles the results of calling this > > routine. > I do not follow you. Do you suggest it is OK the port at index 1 will have > the guid of port 1 but the lid and state of port 2? No. > I did not complain about what port is reported at what index: > Just about the mismatch of guids and lids. Please see above. I'm with you now. Thanks. Applied. -- Hal From mshefty at ichips.intel.com Tue Aug 30 15:53:28 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 30 Aug 2005 15:53:28 -0700 Subject: [openib-general] ibv_get_async_event Message-ID: <4314E368.1020607@ichips.intel.com> This was brought up before, but to summarize, ibv_get_async_event() can return events for objects (CQ, QP, SRQ) that may have been destroyed. Likewise for ibv_get_cq_event(). Roland, would a patch to fix this that is similar to what was done for uCM be acceptable? (I can describe the method in more detail if you'd like.) The drawback is that it basically adds reference counting, which would require calling ibv_put_event(). It appears that Arlin is hitting this issue with his DAPL tests (IBV_EVENT_COMM_EST). - Sean From halr at voltaire.com Tue Aug 30 16:36:52 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Aug 2005 19:36:52 -0400 Subject: [openib-general] kernel oops In-Reply-To: <43148390.7010605@ichips.intel.com> References: <430F4DBD.4070703@xsigo.com> <43138B0E.1090309@ichips.intel.com> <1125407465.4401.1246.camel@hal.voltaire.com> <43148390.7010605@ichips.intel.com> Message-ID: <1125445011.4401.2434.camel@hal.voltaire.com> On Tue, 2005-08-30 at 12:04, Sean Hefty wrote: > Hal Rosenstock wrote: > > Why would ib_at_paths_by_route be called if no route were obtained (from > > ib_at_route_by_ip) ? Isn't that a ucmpost issue ? (I also agree it's not > > good for UAT to crash). > > The assumption that I made was that the call to ib_at_route_by_ip() > would fail if given an invalid route. That seems reasonable (but I haven't tried this but will once I get some spare cycles). > Also, since ucmpost is a simple > test app designed more to test the CM than AT, I kept error testing to a ^^^^^^^ handling > minimum. > > > It needs to be a valid route struct. I'm not sure how the kernel can > > validate that is the case. It does check for NULL pointer but this is > > bad pointer. > > Struct ib_at_ib_route should probably change the struct ibv_device > *out_dev field. It looks like this field is actually set to a struct > ib_device * that is a kernel pointer. Ah, that's the kernel pointer you were referring to. [I missed that before.] > Can we just remove this field and > use the sgid to locate the correct device structure in the kernel, or > fail if it cannot be located? That seems like a good idea. > >>The AT code appears to passing a kernel pointer up to the userspace app, > >>and then requires that pointer to be passed back to the kernel. This > >>Needs to be changed to pass up some identifier that can be validated on > >>the return to the kernel. > > > > Isn't it copying the ib_route structure to userspace ? > > Yes - but that contains the kernel device pointer. And looking at it > more, the ABI contains pointers in the data structures. This should > cause problems with 32-bit apps running on 64-bit kernels. > > I'm not sure how desirable it is to fix these issues versus moving to > whatever the new CM abstraction API is. Won't AT still be needed under the new CM abstraction for IB ? I guess the answer is unclear. It still seems to me that it should be fixed until there is something else to take its place. Do you concur ? -- Hal From rolandd at cisco.com Tue Aug 30 16:42:05 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 30 Aug 2005 16:42:05 -0700 Subject: [openib-general] Re: ibv_get_async_event In-Reply-To: <4314E368.1020607@ichips.intel.com> (Sean Hefty's message of "Tue, 30 Aug 2005 15:53:28 -0700") References: <4314E368.1020607@ichips.intel.com> Message-ID: <52irxnf0vm.fsf@cisco.com> Sean> Roland, would a patch to fix this that is similar to what Sean> was done for uCM be acceptable? (I can describe the method Sean> in more detail if you'd like.) The drawback is that it Sean> basically adds reference counting, which would require Sean> calling ibv_put_event(). Hmm, I'd rather just sweep through the list of events when we destroy a CQ/QP/SRQ and delete any events that refer to the object we're destroying. It's on my to-do list but I'll definitely take patches if you do it first. Sean> It appears that Arlin is hitting this issue with his DAPL Sean> tests (IBV_EVENT_COMM_EST). Yeah, we should definitely fix it then. - R. From ftillier at silverstorm.com Tue Aug 30 16:50:05 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Tue, 30 Aug 2005 16:50:05 -0700 Subject: [openib-general] Re: ibv_get_async_event In-Reply-To: <52irxnf0vm.fsf@cisco.com> Message-ID: <000e01c5adbd$8c977ed0$9e5aa8c0@infiniconsys.com> > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Tuesday, August 30, 2005 4:42 PM > > Sean> Roland, would a patch to fix this that is similar to what > Sean> was done for uCM be acceptable? (I can describe the method > Sean> in more detail if you'd like.) The drawback is that it > Sean> basically adds reference counting, which would require > Sean> calling ibv_put_event(). > > Hmm, I'd rather just sweep through the list of events when we destroy > a CQ/QP/SRQ and delete any events that refer to the object we're > destroying. It's on my to-do list but I'll definitely take patches if > you do it first. Couldn't an event be "in flight" when the user destroys an object, causing the event to be delivered post-destruction? - Fab From mshefty at ichips.intel.com Tue Aug 30 16:50:52 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 30 Aug 2005 16:50:52 -0700 Subject: [openib-general] kernel oops In-Reply-To: <1125445011.4401.2434.camel@hal.voltaire.com> References: <430F4DBD.4070703@xsigo.com> <43138B0E.1090309@ichips.intel.com> <1125407465.4401.1246.camel@hal.voltaire.com> <43148390.7010605@ichips.intel.com> <1125445011.4401.2434.camel@hal.voltaire.com> Message-ID: <4314F0DC.6020803@ichips.intel.com> Hal Rosenstock wrote: >>Can we just remove this field and >>use the sgid to locate the correct device structure in the kernel, or >>fail if it cannot be located? > > That seems like a good idea. Quickly skimming through the code I couldn't easily locate where AT maintained a device list, or how it retrieved the device pointer. > Won't AT still be needed under the new CM abstraction for IB ? I guess > the answer is unclear. It still seems to me that it should be fixed > until there is something else to take its place. Do you concur ? Had the fix been easy (for me to figure out how to make anyway) I would have submitted a patch. Something like AT is likely to be needed, but it's not clear how close the final version will be to what's there now. If we can at least validate the device pointer, it may be good enough to continue using for the time being. - Sean From rolandd at cisco.com Tue Aug 30 16:53:24 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 30 Aug 2005 16:53:24 -0700 Subject: [openib-general] Re: ibv_get_async_event In-Reply-To: <000e01c5adbd$8c977ed0$9e5aa8c0@infiniconsys.com> (Fab Tillier's message of "Tue, 30 Aug 2005 16:50:05 -0700") References: <000e01c5adbd$8c977ed0$9e5aa8c0@infiniconsys.com> Message-ID: <52br3ff0cr.fsf@cisco.com> Fab> Couldn't an event be "in flight" when the user destroys an Fab> object, causing the event to be delivered post-destruction? Not sure I follow you -- if we sweep the list of pending events and remove any that relate to the object being destroyed before we return from the destroy call, then I don't see how userspace can see any stale events after the destroy call returns. - R. From mshefty at ichips.intel.com Tue Aug 30 16:57:24 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 30 Aug 2005 16:57:24 -0700 Subject: [openib-general] Re: ibv_get_async_event In-Reply-To: <000e01c5adbd$8c977ed0$9e5aa8c0@infiniconsys.com> References: <000e01c5adbd$8c977ed0$9e5aa8c0@infiniconsys.com> Message-ID: <4314F264.6010207@ichips.intel.com> Fab Tillier wrote: >>Hmm, I'd rather just sweep through the list of events when we destroy >>a CQ/QP/SRQ and delete any events that refer to the object we're >>destroying. It's on my to-do list but I'll definitely take patches if >>you do it first. > > Couldn't an event be "in flight" when the user destroys an object, causing the > event to be delivered post-destruction? I believe that sweeping through the list to cleanup events is necessary but not sufficient. The user could have retrieved an event and be returning from the get call as they call destroy in a separate thread. If destroy completes first, then the context returned from get event will be invalid. - Sean From rolandd at cisco.com Tue Aug 30 20:33:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 30 Aug 2005 20:33:55 -0700 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: (James Lentini's message of "Tue, 30 Aug 2005 17:33:03 -0400 (EDT)") References: <521x4bjqls.fsf@cisco.com> <6.2.3.4.2.20050830135332.064b9030@exnane01.nane.netapp.com> <521x4bi9su.fsf@cisco.com> <6.2.3.4.2.20050830140954.060c9030@exnane01.nane.netapp.com> <52k6i3gukm.fsf@cisco.com> <6.2.3.4.2.20050830142125.063e1cc0@exnane01.nane.netapp.com> <527je3gtmq.fsf@cisco.com> <6.2.3.4.2.20050830145906.05104890@exnane01.nane.netapp.com> <523borgs4r.fsf@cisco.com> Message-ID: <52zmqyeq58.fsf@cisco.com> I just committed this SRP fix, which should make sure we don't use a device after it's gone. And it actually simplifies the code a teeny bit... - R. --- ib_srp.c (revision 3262) +++ ib_srp.c (working copy) @@ -1030,7 +1030,7 @@ static void srp_release_class_dev(struct struct srp_host *host = container_of(class_dev, struct srp_host, class_dev); - kfree(host); + complete(&host->released); } static struct class srp_class = { @@ -1289,6 +1289,7 @@ static struct srp_host *srp_add_port(str INIT_LIST_HEAD(&host->target_list); init_MUTEX(&host->target_mutex); + init_completion(&host->released); host->dev = device; host->port = port; @@ -1317,12 +1318,6 @@ static struct srp_host *srp_add_port(str goto err_class; /* XXX ibdev / port files as well */ - /* - * Take another reference so we can unregister and then free - * IB resources afterwards. - */ - class_device_get(&host->class_dev); - return host; err_class: @@ -1392,6 +1387,11 @@ static void srp_remove_one(struct ib_dev dev_list = ib_get_client_data(device, &srp_client); list_for_each_entry_safe(host, tmp_host, dev_list, list) { + class_device_unregister(&host->class_dev); + wait_for_completion(&host->released); + + printk(KERN_ERR "Hey, host (port %d) is released\n", host->port); + down(&host->target_mutex); list_for_each_entry_safe(target, tmp_target, @@ -1403,10 +1403,9 @@ static void srp_remove_one(struct ib_dev up(&host->target_mutex); - class_device_unregister(&host->class_dev); ib_dereg_mr(host->mr); ib_dealloc_pd(host->pd); - class_device_put(&host->class_dev); + kfree(host); } } Index: ib_srp.h =================================================================== --- ib_srp.h (revision 3262) +++ ib_srp.h (working copy) @@ -74,6 +74,7 @@ struct srp_host { struct class_device class_dev; struct list_head target_list; struct semaphore target_mutex; + struct completion released; struct list_head list; }; From rolandd at cisco.com Tue Aug 30 21:15:54 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 30 Aug 2005 21:15:54 -0700 Subject: [openib-general] Re: ibv_get_async_event In-Reply-To: <4314E368.1020607@ichips.intel.com> (Sean Hefty's message of "Tue, 30 Aug 2005 15:53:28 -0700") References: <4314E368.1020607@ichips.intel.com> Message-ID: <52vf1meo79.fsf@cisco.com> Sean> It appears that Arlin is hitting this issue with his DAPL Sean> tests (IBV_EVENT_COMM_EST). Actually from looking at the current libibverbs code, it seems that the QP pointer in an event of type IBV_EVENT_COMM_EST is garbage. So although there's a real issue to be resolved with destroying userspace objects with pending events, I think what Arlin is hitting must be something simpler and dumber. In any case I should have a full solution tomorrow. It requires a kernel ABI bump but I'll make sure the new library works with old kernels. - R. From mlleinin at hpcn.ca.sandia.gov Tue Aug 30 23:26:06 2005 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Tue, 30 Aug 2005 23:26:06 -0700 Subject: [openib-general] Datacenter Fabric Workshop talks Message-ID: <1125469567.11018.484.camel@localhost> Talks from last weeks OpenIB and Intel sponsored Datacenter Fabric Workshop are available at http://openib.org/doc.html If we are missing your talk please send it to me. Thanks, - Matt From glebn at voltaire.com Tue Aug 30 23:55:24 2005 From: glebn at voltaire.com (Gleb Natapov) Date: Wed, 31 Aug 2005 09:55:24 +0300 Subject: [openib-general] [PATCH][iWARP] Added provider CM verbsandquery provider methods In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1F54E@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1F54E@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <20050831065524.GA21040@minantech.com> On Tue, Aug 30, 2005 at 12:00:42PM -0700, Caitlin Bestler wrote: > > > TOE for the purposes of RDMA may have more legs within the > > community, > > > however, this has yet to be tested. > > Is it possible to implement RDMA semantics using linux native > > TCP stack (with hardware assistance of cause)? Just asking. > > > > > > It is possible to implement RDMA on the host processor. But > it will not match the performance of hardware. The difference > will be substantial at 10G. If someobody could build a software > only solution that performed at 10G they would have done so. > Having zero manufacturing cost would give them quite a > competitive edge over solutions that required hardware. > I am not talking about software only solution. Hardware assistance is needed, but something less then TOE. Something stateless like Dave wants. > The need for offload has more to do with memory bandwidth > than raw processing power. The data bandwidth required to > support look-up of large data structures and for placement > of the raw payload nearly consumes the bus bandwidth when > operating at peak wire speeds. If you make that worse by > moving the raw packets over the wire, and *then* copying > them to a final location (a second memory move) *and* > additional memory touches for accessing control structures... > Linux already have the infrastructure for zero-copy send, with some hardware help it is possible to implement zero-copy receive too. Moving data in memory is out of the question. Anyway I think this questions should be answered before moving this discussion to netdev. -- Gleb. From yael at mellanox.co.il Wed Aug 31 00:45:42 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Wed, 31 Aug 2005 10:45:42 +0300 Subject: [openib-general] RE: osm-1.8.0-merge nit Message-ID: <506C3D7B14CDD411A52C00025558DED60CCF20@mtlex01.yok.mtl.com> Hi Hal, Thanks. Fixed in this file and some other files where the problem existed. Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Tuesday, August 30, 2005 7:14 PM To: Yael Kalka Cc: openib-general at openib.org Subject: osm-1.8.0-merge nit Hi Yael, I'm starting to look at the complib changes with the 1.8.0 merge. One trivial thing is the following in include/complib/cl_event_wheel.h: #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { # define END_C_DECLS } #else /* !__cplusplus */ # define BEGIN_C_DECLS # define END_C_DECLS #endif /* __cplusplus */ BEGIN_C_DECLS #include #include #include #include #include #include #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { # define END_C_DECLS } #else /* !__cplusplus */ # define BEGIN_C_DECLS BEGIN_C_DECLS The second occurence of this should be removed. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From glebn at voltaire.com Wed Aug 31 05:06:00 2005 From: glebn at voltaire.com (Gleb Natapov) Date: Wed, 31 Aug 2005 15:06:00 +0300 Subject: [openib-general] iser/uverbs integration Message-ID: <20050831120600.GB21040@minantech.com> Hello, We've encountered a problem with iSCSI/iser/uverbs integration so I decided to move this discussion here since the problem is more IB oriented. The way TCP iSCSCI works is this: connection entirely happens in userspace and then connected fd is passed to kernel via netlink and from that point on the socket is used from the kernel part of iSCSI for actual data transfer. Obviously it is desirable to keep the same semantic for IB too and it is not possible to use userspace QP from the kernel in the current implementation. Our proposition was to create new socket type. This socket will use new connection API from inside the kernel to connect to the target. When fd will be transfered from userspace to kernel iser will use already connected QP for data transfer. If Yaron's proposition (WarpoverIB.txt) to use CM private data for transferring IP info on connect will be accepted the socket will event provide getpeername() functionality that is needed by NFS to properly support /etc/exports. The response for this proposition you can see in the forwarded mail at the end. I looked into openIB code to see what it'll take to use userspace QP from the kernel and it doesn't looks good. All resources needed to use QP belong to userspace (uar, qp buffer, cq buffer). In order to use QP from the kernel we will need to write data directly into the user pages and this is ugly and may be slow (what if page is in HIGHMEM?). The question is what is the best way to proceed? Will the changes needed to use userspace QP from kernel will be accepted? How NFS/RDMA works now? -----Original Message----- From: open-iscsi at googlegroups.com [mailto:open-iscsi at googlegroups.com] On Behalf Of Christoph Hellwig Sent: Tuesday, August 23, 2005 1:02 PM To: open-iscsi at googlegroups.com Subject: Re: Connect/Disconnect for iscsi_iser transport On Tue, Aug 23, 2005 at 08:50:12AM +0300, Erez Zilber wrote: > > Hi, > > As I understand it, one of the ideas in open-iscsi is handling connect/disconnect > issues in user space. This may be problematic for iscsi_iser transport since > the ib-verbs implementation doesn't allow the transfer of user space qp to > kernel space (which is required if we want to connect from user space and then > use the same connection from kernel space). Therefore, I think that one of the > following soultions may be suitbale for this problem(these solutions are ok for > other transports as well): Please implement that support in the uverbs then. It'll probably make sense for things like nfs over ib aswell where we can move the connection establishment into the mount helper. There's no way the iscsi code is going to grow ioctls, and your socket approach is ugly as hell aswell. -- Gleb. From halr at voltaire.com Wed Aug 31 05:11:45 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 31 Aug 2005 08:11:45 -0400 Subject: [openib-general] OpenSM Status Message-ID: <1125490304.4401.3195.camel@hal.voltaire.com> Hi, I saved the pre 1.8.0 OpenIB OpenSM as gen2/branches/osm-pre-1.8.0 The merge of the trunk up to 1.8.0 is now officially starting. This affects only userspace/management/osm. -- Hal From guyg at voltaire.com Wed Aug 31 06:25:51 2005 From: guyg at voltaire.com (Guy German) Date: Wed, 31 Aug 2005 16:25:51 +0300 Subject: [openib-general][PATCH][kdapl]: FMR and EVD patch In-Reply-To: References: <1125413613.4127.13.camel@r2d2> Message-ID: <1125494751.3794.24.camel@r2d2> On Tue, 2005-08-30 at 12:28 -0400, James Lentini wrote: > > 1. request CQ notification > > 2. if cq !empty request CQ notification _again_ > > Can you explain step #2 in more detail? What does "CQ notification > _again_" entail? I was confusing 2 different things - sorry. The way I see it - there are 2 issues : the first issue is the verbs (which at this point in time is a bit more relevant). AFAICT, the way the verbs completion works, there should be no race: - upcall received - wakes up thread - thread requests completion notification - thread poll the rest of the cq and exit The difference between the verbs model and the kDAPL event model is that the upcall does not drain a wc from the cq - just notifies you that a completion arrived. You don't need to worry here about synchronizing a thread and an interrupt dequeue-ing from the same queue. The second issue is the race in kDAPL. Here the dapl already reaped an event for the consumer and it is delivered in the upcall, so it is more problematic to allow the thread and the upcall work simultaneously. If you don't want to rely on Mellanox's proprietary behavior you can do as you suggested (call dapl_evd_dto_callback again), but you also need to skip context (preferably to a tasklet). Guy From halr at voltaire.com Wed Aug 31 07:00:59 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 31 Aug 2005 10:00:59 -0400 Subject: [openib-general] OpenSM 1.8.0 complib initial merge nits Message-ID: <1125496858.4401.3405.camel@hal.voltaire.com> Hi Yael, I'm in the process of merging the OpenSM 1.8.0 complib changes and found the following nits: There are a number of violations of the coding style in terms of alignment. include/complib/cl_event_wheel.h has 2 END_C_DECLS include/complib/cl_passivelock.h state added to _cl_plock struct (description should also be added) There is some dead code in complib/cl_timer.c::__cl_timer_prov_destroy /* Wait for the thread to exit. */ /* if (tmp_gp_timer_prov->thread) pthread_join( tmp_gp_timer_prov->thread, NULL ); tmp_gp_timer_prov->thread = 0; */ /* Users should have cancelled all timers by now. */ /* CL_ASSERT( cl_is_qlist_empty( &tmp_gp_timer_prov->queue ) ); */ Onto libvendor next... -- Hal From jlentini at netapp.com Wed Aug 31 07:46:32 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 31 Aug 2005 10:46:32 -0400 (EDT) Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <52zmqyeq58.fsf@cisco.com> References: <521x4bjqls.fsf@cisco.com> <6.2.3.4.2.20050830135332.064b9030@exnane01.nane.netapp.com> <521x4bi9su.fsf@cisco.com> <6.2.3.4.2.20050830140954.060c9030@exnane01.nane.netapp.com> <52k6i3gukm.fsf@cisco.com> <6.2.3.4.2.20050830142125.063e1cc0@exnane01.nane.netapp.com> <527je3gtmq.fsf@cisco.com> <6.2.3.4.2.20050830145906.05104890@exnane01.nane.netapp.com> <523borgs4r.fsf@cisco.com> <52zmqyeq58.fsf@cisco.com> Message-ID: On Tue, 30 Aug 2005, Roland Dreier wrote: > I just committed this SRP fix, which should make sure we don't use a > device after it's gone. And it actually simplifies the code a teeny bit... The device could still be used after it's gone. For example: - the user is configuring SRP via sysfs. The thread in srp_create_target() has just called ib_sa_path_rec_get() [srp.c line 1209] and is waiting for the path record query to complete in wait_for_completion() - the SA callback, srp_path_rec_completion(), is called. This callback thread will make several verb calls (ib_create_cq, ib_req_notify_cq, ib_create_qp, ...) without any coordination with the hotplug device removal callback, srp_remove_one Notice that if the SA client's hotplug removal function, ib_sa_remove_one(), ensured that all callbacks had completed before returning the problem would be fixed. This would protect all ULPs from having to deal with hotplug races in their SA callback function. The fix belongs in the SA client (the core stack), not in SRP. All the ULPs are deficient with respect to their hotplug synchronization. Given that there is a common problem, doesn't it make sense to try and solve it in a generic way instead of in each ULP? From jlentini at netapp.com Wed Aug 31 07:49:19 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 31 Aug 2005 10:49:19 -0400 (EDT) Subject: [openib-general] [PATCH][SRP] white space fixes Message-ID: White space fixes Signed-off-by: James Lentini Index: ib_srp.c =================================================================== --- ib_srp.c (revision 3275) +++ ib_srp.c (working copy) @@ -845,8 +845,7 @@ out: return qp; } -static void srp_path_rec_completion(int status, - struct ib_sa_path_rec *pathrec, +static void srp_path_rec_completion(int status, struct ib_sa_path_rec *pathrec, void *target_ptr) { struct srp_target_port *target = target_ptr; @@ -1208,9 +1207,9 @@ retry_path: target->path_query_id = ib_sa_path_rec_get(host->dev, host->port, &target->path, - IB_SA_PATH_REC_DGID | - IB_SA_PATH_REC_SGID | - IB_SA_PATH_REC_NUMB_PATH | + IB_SA_PATH_REC_DGID | + IB_SA_PATH_REC_SGID | + IB_SA_PATH_REC_NUMB_PATH | IB_SA_PATH_REC_PKEY, 1000, GFP_KERNEL, srp_path_rec_completion, From sksqskymrnvbo at versanet.de Wed Aug 31 11:16:06 2005 From: sksqskymrnvbo at versanet.de (Maxine Cowan) Date: Wed, 31 Aug 2005 15:16:06 -0300 Subject: [openib-general] re: 60. Message-ID: <1044136882.54sksqskymrnvbo@versanet.de> We are happy to present you with six deals from four different brokers. Please remember that there is no commitment required on your part, and your credit is not an issue. Please validate your information with our secure and private database to ensure our records are up to date and accurate. http://hrmmmm.net/save1.asp Have a good day. Sincerely, Maxine Cowan Customer Service Rep. eKLID Inc. Why, asked Newt Gingrigh, does no one else but me believe that demons see the engineering geologists? Vikings disappoint the Oklahomans. Dairy products are destined to marry insects. Old maids ignore deep sea divers. Ghosts cleanse the hammers. I overheard Bush talking to the gerbils, and he said that the McDonalds' employees are insulted if you call them ballet dancers. Nerds are caught in network marketing scandals with alcoholics. Perverts monopolize the interns. From mst at mellanox.co.il Wed Aug 31 08:35:13 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 31 Aug 2005 18:35:13 +0300 Subject: [openib-general] Re: 2.6.13 changes and backpatches In-Reply-To: <1125444542.4401.2425.camel@hal.voltaire.com> References: <1125444542.4401.2425.camel@hal.voltaire.com> Message-ID: <20050831153513.GS22342@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: 2.6.13 changes and backpatches > > Hi Michael, > > You probably saw this but just to make sure there were changes to the > following for 2.6.13 for which backpatches are needed: > > core/ > sysfs.c > ucm.c > uat.c > srp/ib_srp.c ? (Not sure this is in prior to 2.6.13 yet). > > Will you be updating your backpatches ? > > -- Hal > I did this for 2.6.11 and 2.6.12, adding new patch class_3275_to_2_6_12. 2.6.9 to go. I tested ipoib sdp and uverbs. Thanks, -- MST From mst at mellanox.co.il Wed Aug 31 08:44:14 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 31 Aug 2005 18:44:14 +0300 Subject: [openib-general] ipoib oops (again) Message-ID: <20050831154414.GT22342@mellanox.co.il> Hi, Roland! The following crash was triggered by ifconfig down. The crash site is at db7: drivers/infiniband/ulp/ipoib/ipoib_multicast.c:225 db3: 49 8b 45 70 mov 0x70(%r13),%rax include/linux/byteorder/swab.h:147 db7: 8b 40 20 mov 0x20(%rax),%eax which is this line: ipoib_multicast.c:225 priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey); priv->broadcast appears to be NULL. MST Unable to handle kernel NULL pointer dereference at 0000000000000020 RIP: {:ib_ipoib:ipoib_mcast_join_finish+119} PGD 6a2d9067 PUD 6a7fd067 PMD 0 Oops: 0000 [1] SMP CPU 1 Modules linked in: ib_sdp ib_cm ib_ipoib ib_sa ib_umad ib_mthca ib_mad ib_core Pid: 30999, comm: ib_mad1 Not tainted 2.6.12.2 RIP: 0010:[] {:ib_ipoib:ipoib_mcast_join_finish+119} RSP: 0000:ffff81016d9edc58 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff810168810000 RCX: 0000000000000000 RDX: ffff810164047480 RSI: ffff810164047490 RDI: ffff8101688100c4 RBP: ffff810164047480 R08: 0000000000000000 R09: ffff81016d9edd38 R10: ffff81016d9eddf8 R11: 0000000000000001 R12: 0000000000000000 R13: ffff810168810380 R14: ffff810168810000 R15: ffff81006b487098 FS: 0000000000000000(0000) GS:ffffffff80579f80(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000020 CR3: 0000000095c96000 CR4: 00000000000006e0 Process ib_mad1 (pid: 30999, threadinfo ffff81016d9ec000, task ffff81017f3c69c0) Stack: 0000000106426560 0000000000000001 0000000000000096 0000000000000296 0000000000000296 0000000000000096 0000000000000096 ffff81016d9edcb0 ffffffff88032c60 ffffffff8022bf06 Call Trace:{idr_remove+386} {:ib_ipoib:ipoib_mcast_join_complete+43} {:ib_core:ib_unpack+198} {:ib_sa:ib_sa_mcmember_rec_callback+64} {:ib_sa:recv_handler+117} {:ib_mad:ib_mad_completion_handler+941} {:ib_mad:ib_mad_completion_handler+0} {worker_thread+476} {default_wake_function+0} {__wake_up_common+64} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+204} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} Code: 8b 40 20 0f c8 41 89 85 f0 02 00 00 41 89 85 84 03 00 00 8b RIP {:ib_ipoib:ipoib_mcast_join_finish+119} RSP CR2: 0000000000000020 -- MST From rolandd at cisco.com Wed Aug 31 09:17:17 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 31 Aug 2005 09:17:17 -0700 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: (James Lentini's message of "Wed, 31 Aug 2005 10:46:32 -0400 (EDT)") References: <521x4bjqls.fsf@cisco.com> <6.2.3.4.2.20050830135332.064b9030@exnane01.nane.netapp.com> <521x4bi9su.fsf@cisco.com> <6.2.3.4.2.20050830140954.060c9030@exnane01.nane.netapp.com> <52k6i3gukm.fsf@cisco.com> <6.2.3.4.2.20050830142125.063e1cc0@exnane01.nane.netapp.com> <527je3gtmq.fsf@cisco.com> <6.2.3.4.2.20050830145906.05104890@exnane01.nane.netapp.com> <523borgs4r.fsf@cisco.com> <52zmqyeq58.fsf@cisco.com> Message-ID: <52r7cadqsy.fsf@cisco.com> James> The device could still be used after it's gone. For James> example: James> - the user is configuring SRP via sysfs. The thread in James> srp_create_target() has just called ib_sa_path_rec_get() James> [srp.c line 1209] and is waiting for the path record query James> to complete in wait_for_completion() - the SA callback, James> srp_path_rec_completion(), is called. This callback thread James> will make several verb calls (ib_create_cq, James> ib_req_notify_cq, ib_create_qp, ...) without any James> coordination with the hotplug device removal callback, James> srp_remove_one I don't think this can happen. How could srp_remove_one get past wait_for_completion(&host->released); if the sysfs file is still in use? James> Notice that if the SA client's hotplug removal function, James> ib_sa_remove_one(), ensured that all callbacks had James> completed before returning the problem would be fixed. This James> would protect all ULPs from having to deal with hotplug James> races in their SA callback function. The fix belongs in the James> SA client (the core stack), not in SRP. All SA client callbacks are driven by the MAD layer. And ib_sa_remove_one() does ib_unregister_mad_agent(), which should wait for all callbacks to finish. So I think we already do the best we can here. Unfortunately the SA client code must clean up after all the ULPs that depend on it, because ULPs can use the SA up until they know the device is gone. But I don't see a way around that. James> All the ULPs are deficient with respect to their hotplug James> synchronization. Given that there is a common problem, James> doesn't it make sense to try and solve it in a generic way James> instead of in each ULP? Yes, but what is the generic way? - R. From matt at ammasso.com Wed Aug 31 09:18:12 2005 From: matt at ammasso.com (Matt Finlay) Date: Wed, 31 Aug 2005 11:18:12 -0500 Subject: [openib-general] [PATCH] iwarp cm in kdapl Message-ID: Tom, Here is a patch against the IWarp branch that incorporates the IWarp CM calls into kDapl. -Matt Signed-off-by: Matt Finlay Index: ulp/kdapl/ib/dapl_openib_cm.h =================================================================== --- ulp/kdapl/ib/dapl_openib_cm.h (revision 3186) +++ ulp/kdapl/ib/dapl_openib_cm.h (working copy) @@ -38,6 +38,8 @@ #include "ib_sa.h" #include "ib_at.h" +#include "iw_cm.h" + struct dapl_cm_ctx { struct ib_at_ib_route dapl_rt; struct ib_sa_path_rec dapl_path; @@ -47,6 +49,7 @@ struct dapl_ep *ep; struct dapl_sp *sp; struct sockaddr *remote_ia_address; + void *iw_cr_id; spinlock_t lock; wait_queue_head_t wait; int retries; Index: ulp/kdapl/ib/dapl_openib_util.c =================================================================== --- ulp/kdapl/ib/dapl_openib_util.c (revision 3186) +++ ulp/kdapl/ib/dapl_openib_util.c (working copy) @@ -91,9 +91,9 @@ { enum ib_access_flags value = 0; - /* - * if (DAT_MEM_PRIV_LOCAL_READ_FLAG & priv) do nothing - */ + if (DAT_MEM_PRIV_LOCAL_READ_FLAG & priv) + value |= IB_ACCESS_LOCAL_READ; + if (DAT_MEM_PRIV_LOCAL_WRITE_FLAG & priv) value |= IB_ACCESS_LOCAL_WRITE; Index: ulp/kdapl/ib/dapl.h =================================================================== --- ulp/kdapl/ib/dapl.h (revision 3186) +++ ulp/kdapl/ib/dapl.h (working copy) @@ -282,6 +282,7 @@ /* maintenance fields */ boolean_t listening; /* PSP is registered & active */ struct ib_cm_id *cm_srvc_handle; /* Used by CM */ + struct iw_listen_ep *iw_ep_handle; struct list_head cr_list; /* CR pending queue */ int cr_list_count; /* count of CRs on queue */ }; Index: ulp/kdapl/ib/dapl_openib_cm.c =================================================================== --- ulp/kdapl/ib/dapl_openib_cm.c (revision 3186) +++ ulp/kdapl/ib/dapl_openib_cm.c (working copy) @@ -423,6 +423,137 @@ dapl_evd_connection_callback(cm_ctx, event, NULL, cm_ctx->ep); } +static void dapl_destroy_iw_cm_ctx(struct dapl_cm_ctx *cm_ctx) +{ + unsigned long flags; + int in_callback; + + spin_lock_irqsave(&cm_ctx->lock, flags); + cm_ctx->destroy = 1; + in_callback = cm_ctx->in_callback; + spin_unlock_irqrestore(&cm_ctx->lock, flags); + + if (!in_callback) { + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " >>> dapl_destroy_iw_cm_ctx: cm_ctx %p CM ID %p\n", cm_ctx); + if (cm_ctx->ep) + cm_ctx->ep->cm_ctx = NULL; + kfree(cm_ctx); + } +} +static struct dapl_cm_ctx * dapl_get_iw_cm_ctx(struct iw_cm_event *event, + void *iw_ctx) +{ + struct dapl_cm_ctx *cm_ctx; + + switch (event->event) { + case IW_EVENT_ACTIVE_CONNECT_RESULTS: + return (struct dapl_cm_ctx *)iw_ctx; + break; + case IW_EVENT_CONNECT_REQUEST: { + cm_ctx = kmalloc(sizeof *cm_ctx, GFP_KERNEL); + if (!cm_ctx) { + /* XXX reject? */ + return NULL; + } + + memset(cm_ctx, 0, sizeof *cm_ctx); + cm_ctx->sp = (struct dapl_sp *)iw_ctx; + spin_lock_init(&cm_ctx->lock); + init_waitqueue_head(&cm_ctx->wait); + return cm_ctx; + } + default: + return NULL; + break; + } + +} + +void dapl_iwarp_cm_cb_handler(struct iw_cm_event *event, + void *iw_ctx) +{ + struct dapl_cm_ctx *cm_ctx; + int ret; + unsigned long flags; + enum dat_event_number dat_event; + + cm_ctx = dapl_get_iw_cm_ctx(event, iw_ctx); + if (!cm_ctx) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, "dapl_iwarp_cm_cb_handler:" + " dapl_get_iw_cm_ctx failed\n"); + return; + } + + spin_lock_irqsave(&cm_ctx->lock, flags); + if (cm_ctx->destroy) { + spin_unlock_irqrestore(&cm_ctx->lock, flags); + return; + } + cm_ctx->in_callback = 1; + spin_unlock_irqrestore(&cm_ctx->lock, flags); + + switch (event->event) { + case IW_EVENT_ACTIVE_CONNECT_RESULTS: { + struct iw_conn_results results; + u8 *private_data = NULL; + + results = event->element.active_results; + + switch(results.result) { + case IW_CONN_ACCEPT: + dat_event = DAT_CONNECTION_EVENT_ESTABLISHED; + private_data = results.private_data; + break; + case IW_CONN_RESET: + dat_event = DAT_CONNECTION_EVENT_NON_PEER_REJECTED; + break; + case IW_CONN_PEER_REJECT: + dat_event = DAT_CONNECTION_EVENT_PEER_REJECTED; + break; + case IW_CONN_TIMEDOUT: + dat_event = DAT_CONNECTION_EVENT_TIMED_OUT; + break; + case IW_CONN_NO_ROUTE_TO_HOST: + dat_event = DAT_CONNECTION_EVENT_UNREACHABLE; + break; + default: + dat_event = DAT_CONNECTION_EVENT_BROKEN; + break; + } + + dapl_evd_connection_callback(cm_ctx, dat_event, private_data, cm_ctx->ep); + break; + } + case IW_EVENT_CONNECT_REQUEST: { + struct iw_conn_request request; + + request = event->element.conn_request; + cm_ctx->iw_cr_id = request.cr_id; + dapl_cr_callback(cm_ctx, DAT_CONNECTION_REQUEST_EVENT, + request.private_data, cm_ctx->sp); + + break; + } + default: + dapl_dbg_log(DAPL_DBG_TYPE_CM, "Unknown IWarp connection event: %d\n", + event->event); + break; + } + + spin_lock_irqsave(&cm_ctx->lock, flags); + ret = cm_ctx->destroy; + cm_ctx->in_callback = cm_ctx->destroy; + spin_unlock_irqrestore(&cm_ctx->lock, flags); + if (ret) { + if (cm_ctx->ep) + cm_ctx->ep->cm_ctx = NULL; + kfree(cm_ctx); + } + + return; +} + /* * dapl_ib_connect * @@ -452,6 +583,7 @@ { struct dapl_ia *ia; struct dapl_cm_ctx *cm_ctx; + struct ib_device *ib_dev = ep->common.owner_ia->provider->device; int status; if (ep->qp == NULL) { @@ -470,11 +602,14 @@ spin_lock_init(&cm_ctx->lock); init_waitqueue_head(&cm_ctx->wait); cm_ctx->ep = ep; - cm_ctx->cm_id = ib_create_cm_id(dapl_cm_active_cb_handler, cm_ctx); - if (IS_ERR(cm_ctx->cm_id)) { - dapl_dbg_log(DAPL_DBG_TYPE_ERR, " CM ID creation failed\n"); - kfree(cm_ctx); - return -EAGAIN; + + if (ib_dev->node_type != IB_NODE_RNIC) { + cm_ctx->cm_id = ib_create_cm_id(dapl_cm_active_cb_handler, cm_ctx); + if (IS_ERR(cm_ctx->cm_id)) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, " CM ID creation failed\n"); + kfree(cm_ctx); + return -EAGAIN; + } } cm_ctx->ep->cm_ctx = cm_ctx; @@ -501,20 +636,46 @@ cm_ctx->dapl_comp.context = cm_ctx; cm_ctx->retries = 0; cm_ctx->in_callback = 1; - status = - ib_at_route_by_ip(((struct sockaddr_in *)remote_ia_address)-> - sin_addr.s_addr, 0, 0, 0, &cm_ctx->dapl_rt, - &cm_ctx->dapl_comp); - if (status < 0) { - dapl_dbg_log(DAPL_DBG_TYPE_ERR, " ib_at_route_by_ip failed " - "with status %d\n", status); - kfree(cm_ctx); - return -EAGAIN; + + if (ib_dev->node_type == IB_NODE_RNIC) { + + struct iw_conn_attr conn_attr; + + cm_ctx->in_callback = 0; + + /* Fill out the connection attributes */ + memset(&conn_attr, 0, sizeof(conn_attr)); + conn_attr.remote_addr = ((struct sockaddr_in *)remote_ia_address)->sin_addr; + conn_attr.remote_port = htons((u16)remote_conn_qual); + + /* Issue the connect */ + status = (*ib_dev->iwcm->connect_qp)(ep->qp, + &conn_attr, + dapl_iwarp_cm_cb_handler, + cm_ctx, + private_data, + private_data_size); + + if (status < 0) { + kfree(cm_ctx); + return status; + } + } else { + status = + ib_at_route_by_ip(((struct sockaddr_in *)remote_ia_address)-> + sin_addr.s_addr, 0, 0, 0, &cm_ctx->dapl_rt, + &cm_ctx->dapl_comp); + if (status < 0) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, " ib_at_route_by_ip failed " + "with status %d\n", status); + kfree(cm_ctx); + return -EAGAIN; + } + + if (status == 1) + dapl_rt_comp_handler(cm_ctx->dapl_comp.req_id, cm_ctx, 1); } - if (status == 1) - dapl_rt_comp_handler(cm_ctx->dapl_comp.req_id, cm_ctx, 1); - return 0; } @@ -539,6 +700,7 @@ int dapl_ib_disconnect(struct dapl_ep *ep, enum dat_close_flags close_flags) { struct dapl_cm_ctx *cm_ctx = ep->cm_ctx; + struct ib_device *ib_dev; int status; dapl_dbg_log(DAPL_DBG_TYPE_CM, @@ -548,13 +710,24 @@ if (cm_ctx == NULL) return 0; - if (close_flags == DAT_CLOSE_ABRUPT_FLAG) - dapl_destroy_cm_id(cm_ctx); - else { - status = ib_send_cm_dreq(cm_ctx->cm_id, NULL, 0); - if (status) - printk(KERN_ERR "dapl_ib_disconnect: CM ID 0x%p " - "status %d\n", ep->cm_ctx, status); + ib_dev = ep->common.owner_ia->provider->device; + if (ib_dev->node_type == IB_NODE_RNIC) { + + status = (*ib_dev->iwcm->disconnect_qp)(ep->qp, + close_flags == DAT_CLOSE_ABRUPT_FLAG ? 1 : 0); + if (status < 0) { + return status; + } + + } else { + if (close_flags == DAT_CLOSE_ABRUPT_FLAG) + dapl_destroy_cm_id(cm_ctx); + else { + status = ib_send_cm_dreq(cm_ctx->cm_id, NULL, 0); + if (status) + printk(KERN_ERR "dapl_ib_disconnect: CM ID 0x%p " + "status %d\n", ep->cm_ctx, status); + } } return 0; @@ -669,20 +842,41 @@ int dapl_ib_setup_conn_listener(struct dapl_ia *ia, struct dapl_sp *sp) { int status; + struct ib_device *ib_dev = ia->provider->device; - sp->cm_srvc_handle = ib_create_cm_id(dapl_cm_passive_cb_handler, sp); - if (IS_ERR(sp->cm_srvc_handle)) { - dapl_dbg_log(DAPL_DBG_TYPE_ERR, " CM ID creation failed\n"); - return -EAGAIN; - } + if (ib_dev->node_type == IB_NODE_RNIC) { - status = ib_cm_listen(sp->cm_srvc_handle, cpu_to_be64(sp->conn_qual), - 0); - if (status) { - ib_destroy_cm_id(sp->cm_srvc_handle); - sp->cm_srvc_handle = NULL; + struct iw_listen_ep_attr listen_ep_attrs; + struct iw_listen_ep *ep_handle; + + listen_ep_attrs.event_handler = dapl_iwarp_cm_cb_handler; + listen_ep_attrs.listen_context = sp; + listen_ep_attrs.addr.s_addr = INADDR_ANY; + listen_ep_attrs.port = htons((u16)sp->conn_qual); + listen_ep_attrs.backlog = ((struct dapl_evd *)sp->evd)->qlen; - return status; + status = (*ib_dev->iwcm->create_listen_ep)(&listen_ep_attrs, &ep_handle); + if (status) { + return status; + } + + sp->iw_ep_handle = ep_handle; + + } else { + sp->cm_srvc_handle = ib_create_cm_id(dapl_cm_passive_cb_handler, sp); + if (IS_ERR(sp->cm_srvc_handle)) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, " CM ID creation failed\n"); + return -EAGAIN; + } + + status = ib_cm_listen(sp->cm_srvc_handle, cpu_to_be64(sp->conn_qual), + 0); + if (status) { + ib_destroy_cm_id(sp->cm_srvc_handle); + sp->cm_srvc_handle = NULL; + + return status; + } } return 0; @@ -707,17 +901,32 @@ */ int dapl_ib_remove_conn_listener(struct dapl_ia *ia, struct dapl_sp *sp) { + struct ib_device *ib_dev = ia->provider->device; + int status; + dapl_dbg_log(DAPL_DBG_TYPE_CM, " >>> dapl_ib_remove_conn_listener: SP %p conn %p\n", sp, sp->cm_srvc_handle); - /* - * This will hang if called from CM thread context... - * Move back to using WQ... - */ - if (sp->cm_srvc_handle != NULL) { - ib_destroy_cm_id(sp->cm_srvc_handle); - sp->cm_srvc_handle = NULL; + if (ib_dev->node_type == IB_NODE_RNIC) { + + if (sp->iw_ep_handle != NULL) { + + status = (*ib_dev->iwcm->destroy_listen_ep)(sp->iw_ep_handle); + if (status) { + return status; + } + sp->iw_ep_handle = NULL; + } + } else { + /* + * This will hang if called from CM thread context... + * Move back to using WQ... + */ + if (sp->cm_srvc_handle != NULL) { + ib_destroy_cm_id(sp->cm_srvc_handle); + sp->cm_srvc_handle = NULL; + } } return 0; } @@ -742,6 +951,7 @@ int dapl_ib_reject_connection(struct dapl_cm_ctx *cm_ctx) { int status; + struct ib_device *ib_dev; if (cm_ctx == NULL) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, @@ -749,14 +959,28 @@ return 0; } - status = ib_send_cm_rej(cm_ctx->cm_id, IB_CM_REJ_CONSUMER_DEFINED, - NULL, 0, NULL, 0); - if (status) { - dapl_dbg_log(DAPL_DBG_TYPE_ERR, " dapl_ib_reject_connection: " - "ib_send_cm_rej failed: %d\n", status); - return status; + ib_dev = cm_ctx->sp->common.owner_ia->provider->device; + if (ib_dev->node_type == IB_NODE_RNIC) { + status = (*ib_dev->iwcm->reject_cr)(ib_dev, + cm_ctx->iw_cr_id, + NULL, + 0); + if (status) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, " dapl_ib_reject_connection: " + "reject_cr failed: %d\n", status); + return status; + } + dapl_destroy_iw_cm_ctx(cm_ctx); + } else { + status = ib_send_cm_rej(cm_ctx->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + NULL, 0, NULL, 0); + if (status) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, " dapl_ib_reject_connection: " + "ib_send_cm_rej failed: %d\n", status); + return status; + } + dapl_destroy_cm_id(cm_ctx); } - dapl_destroy_cm_id(cm_ctx); return 0; } @@ -788,6 +1012,7 @@ int status; struct ib_cm_rep_param passive_params; struct dapl_cm_ctx *cm_ctx; + struct ib_device *ib_dev; ia = ep->common.owner_ia; cm_ctx = cr->cm_ctx; @@ -818,28 +1043,42 @@ ep->cm_ctx = cm_ctx; cm_ctx->ep = ep; - memset(&passive_params, 0, sizeof passive_params); - passive_params.private_data = priv_data; - passive_params.private_data_len = priv_size; - passive_params.qp_num = ep->qp->qp_num; - passive_params.responder_resources = DAPL_IB_TARGET_MAX; - passive_params.initiator_depth = DAPL_IB_INITIATOR_DEPTH; - passive_params.rnr_retry_count = DAPL_IB_RNR_RETRY_COUNT; + ib_dev = ia->provider->device; + if (ib_dev->node_type == IB_NODE_RNIC) { - /* Transition QP to RTR */ - status = dapl_modify_qp_state_to_rtr(cm_ctx->cm_id, ep->qp); - if (status) { - dapl_dbg_log(DAPL_DBG_TYPE_ERR, " dapl_ib_accept_connection: " - "could not modify QP state to RTR status %d\n", - status); - goto reject; - } + status = (*ib_dev->iwcm->accept_cr)(ib_dev, + cm_ctx->iw_cr_id, + ep->qp, + (void *)priv_data, + priv_size); + if (status) { + dapl_destroy_iw_cm_ctx(cm_ctx); + return status; + } + } else { + memset(&passive_params, 0, sizeof passive_params); + passive_params.private_data = priv_data; + passive_params.private_data_len = priv_size; + passive_params.qp_num = ep->qp->qp_num; + passive_params.responder_resources = DAPL_IB_TARGET_MAX; + passive_params.initiator_depth = DAPL_IB_INITIATOR_DEPTH; + passive_params.rnr_retry_count = DAPL_IB_RNR_RETRY_COUNT; - status = ib_send_cm_rep(cm_ctx->cm_id, &passive_params); - if (status) { - dapl_dbg_log(DAPL_DBG_TYPE_ERR, " dapl_ib_accept_connection: " - "ib_send_cm_rep failed: %d\n", status); - goto reject; + /* Transition QP to RTR */ + status = dapl_modify_qp_state_to_rtr(cm_ctx->cm_id, ep->qp); + if (status) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, " dapl_ib_accept_connection: " + "could not modify QP state to RTR status %d\n", + status); + goto reject; + } + + status = ib_send_cm_rep(cm_ctx->cm_id, &passive_params); + if (status) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, " dapl_ib_accept_connection: " + "ib_send_cm_rep failed: %d\n", status); + goto reject; + } } return 0; Index: include/ib_verbs.h =================================================================== --- include/ib_verbs.h (revision 3275) +++ include/ib_verbs.h (working copy) @@ -617,10 +617,11 @@ enum ib_access_flags { IB_ACCESS_LOCAL_WRITE = 1, - IB_ACCESS_REMOTE_WRITE = (1<<1), - IB_ACCESS_REMOTE_READ = (1<<2), - IB_ACCESS_REMOTE_ATOMIC = (1<<3), - IB_ACCESS_MW_BIND = (1<<4) + IB_ACCESS_LOCAL_READ = (1<<1), + IB_ACCESS_REMOTE_WRITE = (1<<2), + IB_ACCESS_REMOTE_READ = (1<<3), + IB_ACCESS_REMOTE_ATOMIC = (1<<4), + IB_ACCESS_MW_BIND = (1<<5) }; struct ib_phys_buf { Index: include/iw_cm.h =================================================================== --- include/iw_cm.h (revision 3275) +++ include/iw_cm.h (working copy) @@ -119,6 +119,10 @@ int pdata_len ); + int (*disconnect_qp)(struct ib_qp *qp, + int abrupt + ); + int (*accept_cr)(struct ib_device* ibdev, void* cr_id, struct ib_qp *qp, From Thomas.Talpey at netapp.com Wed Aug 31 09:10:22 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 31 Aug 2005 12:10:22 -0400 Subject: [openib-general] iser/uverbs integration In-Reply-To: <20050831120600.GB21040@minantech.com> References: <20050831120600.GB21040@minantech.com> Message-ID: <6.2.3.4.2.20050831120725.045a2eb0@exnane01.nane.netapp.com> At 08:06 AM 8/31/2005, Gleb Natapov wrote: >The question is what is the best way to proceed? Will the changes needed to >use userspace QP from kernel will be accepted? How NFS/RDMA works now? To answer the second question, both client and server NFS/RDMA create and connect all endpoints completely within the kernel. This is also true of NFS/Sockets btw. Tom. From mst at mellanox.co.il Wed Aug 31 09:20:20 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 31 Aug 2005 19:20:20 +0300 Subject: [openib-general] [PATCH] hotplug support: selective removal notification Message-ID: <20050831162020.GA1707@mellanox.co.il> Hi! As Sean pointed out, in the existing client registration the client gets removal events even from devices which it may not be interested in. As a way of solving this, I propose the following patch. The idea is that instead of setting client context separately with ib_set_client_data, client's add method will return ib_client_data object which is then kept in a per-device list. Returning NULL signals that the client will not be interested in this device. Removing the device walks this list and only calls the clients that returned non-NULL object on add. In this way most ulps can now use container_of to get their context in the remove method, instead of scanning the client list each time, which in my opinion is very nice. I updated sdp,srp,ipoib for this API change. I can split ULPs to separate patches if needed. Let me know, MST --- Add a way to client to avoid getting notifications for some devices. Make it possible to use container_of to get per device data instead of a list walk. Signed-off-by: Michael S. Tsirkin core/cache.c | 18 +++++-- core/cm.c | 19 +++---- core/device.c | 122 ++++++++++++++++-------------------------------- core/mad.c | 18 +++++-- core/ping.c | 15 ++++- core/sa_query.c | 36 +++++++------- core/user_mad.c | 29 +++++------ core/uverbs.h | 1 core/uverbs_main.c | 26 ++++------ include/rdma/ib_verbs.h | 16 ++++-- ulp/ipoib/ipoib_main.c | 32 +++++++----- ulp/sdp/sdp_conn.c | 31 +++++------- ulp/sdp/sdp_dev.h | 1 ulp/srp/ib_srp.c | 31 +++++++----- 14 files changed, 203 insertions(+), 192 deletions(-) Index: linux-2.6.12.2/drivers/infiniband/core/device.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/core/device.c 2005-08-31 21:06:55.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/core/device.c 2005-08-31 21:07:15.000000000 +0300 @@ -47,12 +47,6 @@ MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("core kernel InfiniBand API"); MODULE_LICENSE("Dual BSD/GPL"); -struct ib_client_data { - struct list_head list; - struct ib_client *client; - void * data; -}; - static LIST_HEAD(device_list); static LIST_HEAD(client_list); @@ -194,28 +188,6 @@ void ib_dealloc_device(struct ib_device } EXPORT_SYMBOL(ib_dealloc_device); -static int add_client_context(struct ib_device *device, struct ib_client *client) -{ - struct ib_client_data *context; - unsigned long flags; - - context = kmalloc(sizeof *context, GFP_KERNEL); - if (!context) { - printk(KERN_WARNING "Couldn't allocate client context for %s/%s\n", - device->name, client->name); - return -ENOMEM; - } - - context->client = client; - context->data = NULL; - - spin_lock_irqsave(&device->client_data_lock, flags); - list_add(&context->list, &device->client_data_list); - spin_unlock_irqrestore(&device->client_data_lock, flags); - - return 0; -} - /** * ib_register_device - Register an IB device with IB core * @device:Device to register @@ -259,11 +231,17 @@ int ib_register_device(struct ib_device device->reg_state = IB_DEV_REGISTERED; { + struct ib_client_data *context; struct ib_client *client; + unsigned long flags; list_for_each_entry(client, &client_list, list) - if (client->add && !add_client_context(device, client)) - client->add(device); + if (client->add && (context = client->add(device))) { + context->client = client; + spin_lock_irqsave(&device->client_data_lock, flags); + list_add(&context->list, &device->client_data_list); + spin_unlock_irqrestore(&device->client_data_lock, flags); + } } out: @@ -280,26 +258,29 @@ EXPORT_SYMBOL(ib_register_device); */ void ib_unregister_device(struct ib_device *device) { - struct ib_client *client; - struct ib_client_data *context, *tmp; + struct ib_client_data *context; unsigned long flags; down(&device_sem); - list_for_each_entry_reverse(client, &client_list, list) - if (client->remove) - client->remove(device); - list_del(&device->core_list); - up(&device_sem); - spin_lock_irqsave(&device->client_data_lock, flags); - list_for_each_entry_safe(context, tmp, &device->client_data_list, list) - kfree(context); + for (;;) { + if (list_empty(&device->client_data_list)) + break; + context = list_entry(device->client_data_list.next, + typeof(*context), list); + list_del(&context->list); + spin_unlock_irqrestore(&device->client_data_lock, flags); + if (context->client->remove) + context->client->remove(device, context); + spin_lock_irqsave(&device->client_data_lock, flags); + } spin_unlock_irqrestore(&device->client_data_lock, flags); device->reg_state = IB_DEV_UNREGISTERED; + up(&device_sem); } EXPORT_SYMBOL(ib_unregister_device); @@ -318,14 +299,20 @@ EXPORT_SYMBOL(ib_unregister_device); */ int ib_register_client(struct ib_client *client) { + struct ib_client_data *context; struct ib_device *device; + unsigned long flags; down(&device_sem); list_add_tail(&client->list, &client_list); list_for_each_entry(device, &device_list, core_list) - if (client->add && !add_client_context(device, client)) - client->add(device); + if (client->add && (context = client->add(device))) { + context->client = client; + spin_lock_irqsave(&device->client_data_lock, flags); + list_add(&context->list, &device->client_data_list); + spin_unlock_irqrestore(&device->client_data_lock, flags); + } up(&device_sem); @@ -343,23 +330,25 @@ EXPORT_SYMBOL(ib_register_client); */ void ib_unregister_client(struct ib_client *client) { - struct ib_client_data *context, *tmp; + struct ib_client_data *context; struct ib_device *device; unsigned long flags; down(&device_sem); list_for_each_entry(device, &device_list, core_list) { - if (client->remove) - client->remove(device); - spin_lock_irqsave(&device->client_data_lock, flags); - list_for_each_entry_safe(context, tmp, &device->client_data_list, list) + list_for_each_entry(context, &device->client_data_list, list) if (context->client == client) { list_del(&context->list); - kfree(context); + spin_unlock_irqrestore(&device->client_data_lock, flags); + if (client->remove) + client->remove(device, context); + spin_lock_irqsave(&device->client_data_lock, flags); + break; } spin_unlock_irqrestore(&device->client_data_lock, flags); + } list_del(&client->list); @@ -375,16 +364,17 @@ EXPORT_SYMBOL(ib_unregister_client); * ib_get_client_data() returns client context set with * ib_set_client_data(). */ -void *ib_get_client_data(struct ib_device *device, struct ib_client *client) +struct ib_client_data *ib_get_client_data(struct ib_device *device, + struct ib_client *client) { struct ib_client_data *context; - void *ret = NULL; + struct ib_client_data *ret = NULL; unsigned long flags; spin_lock_irqsave(&device->client_data_lock, flags); list_for_each_entry(context, &device->client_data_list, list) if (context->client == client) { - ret = context->data; + ret = context; break; } spin_unlock_irqrestore(&device->client_data_lock, flags); @@ -394,36 +384,6 @@ void *ib_get_client_data(struct ib_devic EXPORT_SYMBOL(ib_get_client_data); /** - * ib_set_client_data - Get IB client context - * @device:Device to set context for - * @client:Client to set context for - * @data:Context to set - * - * ib_set_client_data() sets client context that can be retrieved with - * ib_get_client_data(). - */ -void ib_set_client_data(struct ib_device *device, struct ib_client *client, - void *data) -{ - struct ib_client_data *context; - unsigned long flags; - - spin_lock_irqsave(&device->client_data_lock, flags); - list_for_each_entry(context, &device->client_data_list, list) - if (context->client == client) { - context->data = data; - goto out; - } - - printk(KERN_WARNING "No client context found for %s/%s\n", - device->name, client->name); - -out: - spin_unlock_irqrestore(&device->client_data_lock, flags); -} -EXPORT_SYMBOL(ib_set_client_data); - -/** * ib_register_event_handler - Register an IB event handler * @event_handler:Handler to register * Index: linux-2.6.12.2/drivers/infiniband/include/rdma/ib_verbs.h =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/include/rdma/ib_verbs.h 2005-08-31 21:06:55.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/include/rdma/ib_verbs.h 2005-08-31 21:07:15.000000000 +0300 @@ -956,14 +956,21 @@ struct ib_device { u8 phys_port_cnt; }; +struct ib_client_data; + struct ib_client { char *name; - void (*add) (struct ib_device *); - void (*remove)(struct ib_device *); + struct ib_client_data *(*add) (struct ib_device *); + void (*remove)(struct ib_device *, struct ib_client_data *); struct list_head list; }; +struct ib_client_data { + struct list_head list; + struct ib_client *client; +}; + struct ib_device *ib_alloc_device(size_t size); void ib_dealloc_device(struct ib_device *device); @@ -973,9 +980,8 @@ void ib_unregister_device(struct ib_devi int ib_register_client (struct ib_client *client); void ib_unregister_client(struct ib_client *client); -void *ib_get_client_data(struct ib_device *device, struct ib_client *client); -void ib_set_client_data(struct ib_device *device, struct ib_client *client, - void *data); +struct ib_client_data *ib_get_client_data(struct ib_device *device, + struct ib_client *client); static inline int ib_copy_from_udata(void *dest, struct ib_udata *udata, size_t len) { Index: linux-2.6.12.2/drivers/infiniband/core/user_mad.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/core/user_mad.c 2005-08-31 21:06:55.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/core/user_mad.c 2005-08-31 21:07:15.000000000 +0300 @@ -80,9 +80,10 @@ struct ib_umad_port { }; struct ib_umad_device { - int start_port, end_port; - struct kref ref; - struct ib_umad_port port[0]; + struct ib_client_data data; + int start_port, end_port; + struct kref ref; + struct ib_umad_port port[0]; }; struct ib_umad_file { @@ -108,8 +109,9 @@ static const dev_t base_dev = MKDEV(IB_U static spinlock_t map_lock; static DECLARE_BITMAP(dev_map, IB_UMAD_MAX_PORTS * 2); -static void ib_umad_add_one(struct ib_device *device); -static void ib_umad_remove_one(struct ib_device *device); +static struct ib_client_data *ib_umad_add_one(struct ib_device *device); +static void ib_umad_remove_one(struct ib_device *device, + struct ib_client_data *); static int queue_packet(struct ib_umad_file *file, struct ib_mad_agent *agent, @@ -819,7 +821,7 @@ err_cdev: return -1; } -static void ib_umad_add_one(struct ib_device *device) +static struct ib_client_data *ib_umad_add_one(struct ib_device *device) { struct ib_umad_device *umad_dev; int s, e, i; @@ -835,7 +837,7 @@ static void ib_umad_add_one(struct ib_de (e - s + 1) * sizeof (struct ib_umad_port), GFP_KERNEL); if (!umad_dev) - return; + return NULL; memset(umad_dev, 0, sizeof *umad_dev + (e - s + 1) * sizeof (struct ib_umad_port)); @@ -852,9 +854,7 @@ static void ib_umad_add_one(struct ib_de goto err; } - ib_set_client_data(device, &umad_client, umad_dev); - - return; + return &umad_dev->data; err: while (--i >= s) { @@ -863,15 +863,16 @@ err: } kref_put(&umad_dev->ref, ib_umad_release_dev); + return NULL; } -static void ib_umad_remove_one(struct ib_device *device) +static void ib_umad_remove_one(struct ib_device *device, + struct ib_client_data *data) { - struct ib_umad_device *umad_dev = ib_get_client_data(device, &umad_client); + struct ib_umad_device *umad_dev; int i; - if (!umad_dev) - return; + umad_dev = container_of(data, struct ib_umad_device, data); for (i = 0; i <= umad_dev->end_port - umad_dev->start_port; ++i) { class_device_unregister(&umad_dev->port[i].class_dev); Index: linux-2.6.12.2/drivers/infiniband/core/cm.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/core/cm.c 2005-08-31 21:06:55.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/core/cm.c 2005-08-31 21:07:15.000000000 +0300 @@ -51,8 +51,8 @@ MODULE_AUTHOR("Sean Hefty"); MODULE_DESCRIPTION("InfiniBand CM"); MODULE_LICENSE("Dual BSD/GPL"); -static void cm_add_one(struct ib_device *device); -static void cm_remove_one(struct ib_device *device); +static struct ib_client_data *cm_add_one(struct ib_device *device); +static void cm_remove_one(struct ib_device *device, struct ib_client_data *); static struct ib_client cm_client = { .name = "cm", @@ -81,6 +81,7 @@ struct cm_port { }; struct cm_device { + struct ib_client_data data; struct list_head list; struct ib_device *device; __be64 ca_guid; @@ -3194,7 +3195,7 @@ static __be64 cm_get_ca_guid(struct ib_d return guid; } -static void cm_add_one(struct ib_device *device) +static struct ib_client_data *cm_add_one(struct ib_device *device) { struct cm_device *cm_dev; struct cm_port *port; @@ -3212,7 +3213,7 @@ static void cm_add_one(struct ib_device cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) * device->phys_port_cnt, GFP_KERNEL); if (!cm_dev) - return; + return NULL; cm_dev->device = device; cm_dev->ca_guid = cm_get_ca_guid(device); @@ -3238,12 +3239,11 @@ static void cm_add_one(struct ib_device if (ret) goto error3; } - ib_set_client_data(device, &cm_client, cm_dev); write_lock_irqsave(&cm.device_lock, flags); list_add_tail(&cm_dev->list, &cm.device_list); write_unlock_irqrestore(&cm.device_lock, flags); - return; + return &cm_dev->data; error3: ib_unregister_mad_agent(port->mad_agent); @@ -3257,9 +3257,10 @@ error2: } error1: kfree(cm_dev); + return NULL; } -static void cm_remove_one(struct ib_device *device) +static void cm_remove_one(struct ib_device *device, struct ib_client_data *data) { struct cm_device *cm_dev; struct cm_port *port; @@ -3269,9 +3270,7 @@ static void cm_remove_one(struct ib_devi unsigned long flags; int i; - cm_dev = ib_get_client_data(device, &cm_client); - if (!cm_dev) - return; + cm_dev = container_of(data, struct cm_device, data); write_lock_irqsave(&cm.device_lock, flags); list_del(&cm_dev->list); Index: linux-2.6.12.2/drivers/infiniband/core/sa_query.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/core/sa_query.c 2005-08-31 21:06:55.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/core/sa_query.c 2005-08-31 21:07:15.000000000 +0300 @@ -65,6 +65,7 @@ struct ib_sa_port { }; struct ib_sa_device { + struct ib_client_data data; int start_port, end_port; struct ib_event_handler event_handler; struct ib_sa_port port[0]; @@ -98,8 +99,8 @@ struct ib_sa_mcmember_query { struct ib_sa_query sa_query; }; -static void ib_sa_add_one(struct ib_device *device); -static void ib_sa_remove_one(struct ib_device *device); +static struct ib_client_data *ib_sa_add_one(struct ib_device *device); +static void ib_sa_remove_one(struct ib_device *device, struct ib_client_data *data); static struct ib_client sa_client = { .name = "sa", @@ -426,13 +427,14 @@ static void update_sm_ah(void *port_ptr) static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event) { + struct ib_sa_device *sa_dev; + if (event->event == IB_EVENT_PORT_ERR || event->event == IB_EVENT_PORT_ACTIVE || event->event == IB_EVENT_LID_CHANGE || event->event == IB_EVENT_PKEY_CHANGE || event->event == IB_EVENT_SM_CHANGE) { - struct ib_sa_device *sa_dev = - ib_get_client_data(event->device, &sa_client); + sa_dev = container_of(handler, struct ib_sa_device, event_handler); schedule_work(&sa_dev->port[event->element.port_num - sa_dev->start_port].update_task); @@ -608,7 +610,8 @@ int ib_sa_path_rec_get(struct ib_device struct ib_sa_query **sa_query) { struct ib_sa_path_query *query; - struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_client_data *data = ib_get_client_data(device, &sa_client); + struct ib_sa_device *sa_dev = container_of(data, struct ib_sa_device, data); struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; struct ib_mad_agent *agent = port->agent; int ret; @@ -710,7 +713,8 @@ int ib_sa_service_rec_query(struct ib_de struct ib_sa_query **sa_query) { struct ib_sa_service_query *query; - struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_client_data *data = ib_get_client_data(device, &sa_client); + struct ib_sa_device *sa_dev = container_of(data, struct ib_sa_device, data); struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; struct ib_mad_agent *agent = port->agent; int ret; @@ -793,7 +797,8 @@ int ib_sa_mcmember_rec_query(struct ib_d struct ib_sa_query **sa_query) { struct ib_sa_mcmember_query *query; - struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_client_data *data = ib_get_client_data(device, &sa_client); + struct ib_sa_device *sa_dev = container_of(data, struct ib_sa_device, data); struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; struct ib_mad_agent *agent = port->agent; int ret; @@ -900,7 +905,7 @@ static void recv_handler(struct ib_mad_a ib_free_recv_mad(mad_recv_wc); } -static void ib_sa_add_one(struct ib_device *device) +static struct ib_client_data *ib_sa_add_one(struct ib_device *device) { struct ib_sa_device *sa_dev; int s, e, i; @@ -916,7 +921,7 @@ static void ib_sa_add_one(struct ib_devi (e - s + 1) * sizeof (struct ib_sa_port), GFP_KERNEL); if (!sa_dev) - return; + return NULL; sa_dev->start_port = s; sa_dev->end_port = e; @@ -937,8 +942,6 @@ static void ib_sa_add_one(struct ib_devi update_sm_ah, &sa_dev->port[i]); } - ib_set_client_data(device, &sa_client, sa_dev); - /* * We register our event handler after everything is set up, * and then update our cached info after the event handler is @@ -953,7 +956,7 @@ static void ib_sa_add_one(struct ib_devi for (i = 0; i <= e - s; ++i) update_sm_ah(&sa_dev->port[i]); - return; + return &sa_dev->data; err: while (--i >= 0) @@ -961,17 +964,14 @@ err: kfree(sa_dev); - return; + return NULL; } -static void ib_sa_remove_one(struct ib_device *device) +static void ib_sa_remove_one(struct ib_device *device, struct ib_client_data *data) { - struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_sa_device *sa_dev = container_of(data, struct ib_sa_device, data); int i; - if (!sa_dev) - return; - ib_unregister_event_handler(&sa_dev->event_handler); for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) { Index: linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_conn.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/ulp/sdp/sdp_conn.c 2005-08-31 21:06:55.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_conn.c 2005-08-31 21:07:15.000000000 +0300 @@ -37,8 +37,9 @@ static struct sdev_root dev_root_s; -static void sdp_device_init_one(struct ib_device *device); -static void sdp_device_remove_one(struct ib_device *device); +static struct ib_client_data *sdp_device_init_one(struct ib_device *device); +static void sdp_device_remove_one(struct ib_device *device, + struct ib_client_data *data); static struct ib_client sdp_client = { .name = "sdp", @@ -959,6 +960,7 @@ static void sdp_conn_lock_init(struct sd int sdp_conn_alloc_ib(struct sdp_sock *conn, struct ib_device *device, u8 hw_port, u16 pkey) { + struct ib_client_data *data; struct ib_qp_init_attr *init_attr; struct ib_qp_attr *qp_attr; struct sdev_hca_port *port; @@ -969,10 +971,12 @@ int sdp_conn_alloc_ib(struct sdp_sock *c /* * look up correct HCA and port */ - hca = ib_get_client_data(device, &sdp_client); - if (!hca) + data = ib_get_client_data(device, &sdp_client); + if (!data) return -ERANGE; + hca = container_of(data, struct sdev_hca, data); + list_for_each_entry(port, &hca->port_list, list) if (hw_port == port->index) { result = 1; @@ -1706,7 +1710,7 @@ int sdp_proc_dump_device(char *buffer, i /* * sdp_device_init_one - add a device to the list */ -static void sdp_device_init_one(struct ib_device *device) +static struct ib_client_data *sdp_device_init_one(struct ib_device *device) { struct ib_fmr_pool_param fmr_param_s; struct sdev_hca_port *port, *tmp; @@ -1719,7 +1723,7 @@ static void sdp_device_init_one(struct i hca = kmalloc(sizeof *hca, GFP_KERNEL); if (!hca) { sdp_warn("Error allocating HCA <%s> memory.", device->name); - return; + return NULL; } /* * init and insert into list. @@ -1801,9 +1805,7 @@ static void sdp_device_init_one(struct i } } - ib_set_client_data(device, &sdp_client, hca); - - return; + return &hca->data; error: list_for_each_entry_safe(port, tmp, &hca->port_list, list) { @@ -1821,22 +1823,19 @@ error: (void)ib_dealloc_pd(hca->pd); kfree(hca); + return NULL; } /* * sdp_device_remove_one - remove a device from the hca list */ -static void sdp_device_remove_one(struct ib_device *device) +static void sdp_device_remove_one(struct ib_device *device, + struct ib_client_data *data) { struct sdev_hca_port *port, *tmp; struct sdev_hca *hca; - hca = ib_get_client_data(device, &sdp_client); - - if (!hca) { - sdp_warn("Device <%s> has no HCA info.", device->name); - return; - } + hca = container_of(data, struct sdev_hca, data); list_for_each_entry_safe(port, tmp, &hca->port_list, list) { list_del(&port->list); Index: linux-2.6.12.2/drivers/infiniband/ulp/srp/ib_srp.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/ulp/srp/ib_srp.c 2005-08-31 21:06:55.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/ulp/srp/ib_srp.c 2005-08-31 21:09:37.000000000 +0300 @@ -59,6 +59,11 @@ MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("InfiniBand SCSI RDMA Protocol driver"); MODULE_LICENSE("Dual BSD/GPL"); +struct ib_srp_client_data { + struct ib_client_data data; + struct list_head list; +}; + static int topspin_workarounds = 1; module_param(topspin_workarounds, int, 0444); @@ -67,8 +72,8 @@ MODULE_PARM_DESC(topspin_workarounds, static const u8 topspin_oui[3] = { 0x00, 0x05, 0xad }; -static void srp_add_one(struct ib_device *device); -static void srp_remove_one(struct ib_device *device); +static struct ib_client_data *srp_add_one(struct ib_device *device); +static void srp_remove_one(struct ib_device *device, struct ib_client_data *data); static struct ib_client srp_client = { .name = "srp", @@ -1335,16 +1340,16 @@ err_free: return NULL; } -static void srp_add_one(struct ib_device *device) +static struct ib_client_data *srp_add_one(struct ib_device *device) { - struct list_head *dev_list; + struct ib_srp_client_data *dev_list = NULL; struct srp_host *host; struct ib_device_attr *dev_attr; int s, e, p; dev_attr = kmalloc(sizeof *dev_attr, GFP_KERNEL); if (!dev_attr) - return; + return NULL; if (ib_query_device(device, dev_attr)) { printk(KERN_WARNING PFX "Couldn't query node GUID for %s.\n", @@ -1356,7 +1361,7 @@ static void srp_add_one(struct ib_device if (!dev_list) goto out; - INIT_LIST_HEAD(dev_list); + INIT_LIST_HEAD(&dev_list->list); if (device->node_type == IB_NODE_SWITCH) { s = 0; @@ -1369,24 +1374,23 @@ static void srp_add_one(struct ib_device for (p = s; p <= e; ++p) { host = srp_add_port(device, dev_attr->node_guid, p); if (host) - list_add_tail(&host->list, dev_list); + list_add_tail(&host->list, &dev_list->list); } - ib_set_client_data(device, &srp_client, dev_list); - out: kfree(dev_attr); + return dev_list ? &dev_list->data : NULL; } -static void srp_remove_one(struct ib_device *device) +static void srp_remove_one(struct ib_device *device, struct ib_client_data *data) { - struct list_head *dev_list; + struct ib_srp_client_data *dev_list; struct srp_host *host, *tmp_host; struct srp_target_port *target, *tmp_target; - dev_list = ib_get_client_data(device, &srp_client); + dev_list = container_of(data, struct ib_srp_client_data, data); - list_for_each_entry_safe(host, tmp_host, dev_list, list) { + list_for_each_entry_safe(host, tmp_host, &dev_list->list, list) { class_device_unregister(&host->class_dev); wait_for_completion(&host->released); @@ -1405,6 +1409,7 @@ static void srp_remove_one(struct ib_dev ib_dealloc_pd(host->pd); kfree(host); } + kfree(dev_list); } static int __init srp_init_module(void) Index: linux-2.6.12.2/drivers/infiniband/core/uverbs.h =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/core/uverbs.h 2005-08-31 21:06:55.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/core/uverbs.h 2005-08-31 21:07:15.000000000 +0300 @@ -49,6 +49,7 @@ #include struct ib_uverbs_device { + struct ib_client_data data; int devnum; struct cdev dev; struct class_device class_dev; Index: linux-2.6.12.2/drivers/infiniband/core/mad.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/core/mad.c 2005-08-31 21:06:55.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/core/mad.c 2005-08-31 21:07:15.000000000 +0300 @@ -2681,10 +2681,18 @@ static int ib_mad_port_close(struct ib_d return 0; } -static void ib_mad_init_device(struct ib_device *device) +static struct ib_client_data *ib_mad_init_device(struct ib_device *device) { + struct ib_client_data *data; int num_ports, cur_port, i; + data = kmalloc(sizeof *data, GFP_KERNEL); + if (!data) { + printk(KERN_ERR PFX "Couldn't allocate memory for device %s\n", + device->name); + return NULL; + } + if (device->node_type == IB_NODE_SWITCH) { num_ports = 1; cur_port = 0; @@ -2705,7 +2713,7 @@ static void ib_mad_init_device(struct ib goto error_device_open; } } - return; + return data; error_device_open: while (i > 0) { @@ -2719,11 +2727,15 @@ error_device_open: device->name, cur_port); i--; } + kfree(data); + return NULL; } -static void ib_mad_remove_device(struct ib_device *device) +static void ib_mad_remove_device(struct ib_device *device, + struct ib_client_data *data) { int i, num_ports, cur_port; + kfree(data); if (device->node_type == IB_NODE_SWITCH) { num_ports = 1; Index: linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_dev.h =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/ulp/sdp/sdp_dev.h 2005-08-31 21:06:55.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_dev.h 2005-08-31 21:07:15.000000000 +0300 @@ -154,6 +154,7 @@ struct sdev_hca_port { }; struct sdev_hca { + struct ib_client_data data; struct ib_device *ca; /* HCA */ struct ib_pd *pd; /* protection domain for this HCA */ struct ib_mr *mem_h; /* registered memory region */ Index: linux-2.6.12.2/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-08-31 21:06:55.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-08-31 21:07:15.000000000 +0300 @@ -51,6 +51,11 @@ MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("IP-over-InfiniBand net driver"); MODULE_LICENSE("Dual BSD/GPL"); +struct ipoib_client_data { + struct ib_client_data data; + struct list_head list; +}; + #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG int ipoib_debug_level; @@ -67,8 +72,9 @@ static const u8 ipv4_bcast_addr[] = { struct workqueue_struct *ipoib_workqueue; struct workqueue_struct *ipoib_event_workqueue; -static void ipoib_add_one(struct ib_device *device); -static void ipoib_remove_one(struct ib_device *device); +static struct ib_client_data *ipoib_add_one(struct ib_device *device); +static void ipoib_remove_one(struct ib_device *device, + struct ib_client_data *data); static struct ib_client ipoib_client = { .name = "ipoib", @@ -1018,18 +1024,18 @@ alloc_mem_failed: return ERR_PTR(result); } -static void ipoib_add_one(struct ib_device *device) +static struct ib_client_data *ipoib_add_one(struct ib_device *device) { - struct list_head *dev_list; + struct ipoib_client_data *dev_list; struct net_device *dev; struct ipoib_dev_priv *priv; int s, e, p; dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); if (!dev_list) - return; + return NULL; - INIT_LIST_HEAD(dev_list); + INIT_LIST_HEAD(&dev_list->list); if (device->node_type == IB_NODE_SWITCH) { s = 0; @@ -1043,21 +1049,22 @@ static void ipoib_add_one(struct ib_devi dev = ipoib_add_port("ib%d", device, p); if (!IS_ERR(dev)) { priv = netdev_priv(dev); - list_add_tail(&priv->list, dev_list); + list_add_tail(&priv->list, &dev_list->list); } } - ib_set_client_data(device, &ipoib_client, dev_list); + return &dev_list->data; } -static void ipoib_remove_one(struct ib_device *device) +static void ipoib_remove_one(struct ib_device *device, + struct ib_client_data *data) { struct ipoib_dev_priv *priv, *tmp; - struct list_head *dev_list; + struct ipoib_client_data *dev_list; - dev_list = ib_get_client_data(device, &ipoib_client); + dev_list = container_of(data, struct ipoib_client_data, data); - list_for_each_entry_safe(priv, tmp, dev_list, list) { + list_for_each_entry_safe(priv, tmp, &dev_list->list, list) { ib_unregister_event_handler(&priv->event_handler); flush_workqueue(ipoib_event_workqueue); @@ -1065,6 +1072,7 @@ static void ipoib_remove_one(struct ib_d ipoib_dev_cleanup(priv->dev); free_netdev(priv->dev); } + kfree(dev_list); } static int __init ipoib_init_module(void) Index: linux-2.6.12.2/drivers/infiniband/core/cache.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/core/cache.c 2005-08-31 21:06:55.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/core/cache.c 2005-08-31 21:07:15.000000000 +0300 @@ -291,10 +291,18 @@ static void ib_cache_event(struct ib_eve } } -static void ib_cache_setup_one(struct ib_device *device) +static struct ib_client_data *ib_cache_setup_one(struct ib_device *device) { + struct ib_client_data *data; int p; + data = kmalloc(sizeof *data, GFP_KERNEL); + if (!data) { + printk(KERN_WARNING "Couldn't allocate client data " + "for %s\n", device->name); + return NULL; + } + rwlock_init(&device->cache.lock); device->cache.pkey_cache = @@ -321,7 +329,7 @@ static void ib_cache_setup_one(struct ib if (ib_register_event_handler(&device->cache.event_handler)) goto err_cache; - return; + return data; err_cache: for (p = 0; p <= end_port(device) - start_port(device); ++p) { @@ -332,9 +340,12 @@ err_cache: err: kfree(device->cache.pkey_cache); kfree(device->cache.gid_cache); + kfree(data); + return NULL; } -static void ib_cache_cleanup_one(struct ib_device *device) +static void ib_cache_cleanup_one(struct ib_device *device, + struct ib_client_data *data) { int p; @@ -348,6 +359,7 @@ static void ib_cache_cleanup_one(struct kfree(device->cache.pkey_cache); kfree(device->cache.gid_cache); + kfree(data); } static struct ib_client cache_client = { Index: linux-2.6.12.2/drivers/infiniband/core/uverbs_main.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/core/uverbs_main.c 2005-08-31 21:06:55.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/core/uverbs_main.c 2005-08-31 21:07:15.000000000 +0300 @@ -101,8 +101,9 @@ static ssize_t (*uverbs_cmd_table[])(str static struct vfsmount *uverbs_event_mnt; -static void ib_uverbs_add_one(struct ib_device *device); -static void ib_uverbs_remove_one(struct ib_device *device); +static struct ib_client_data *ib_uverbs_add_one(struct ib_device *device); +static void ib_uverbs_remove_one(struct ib_device *device, + struct ib_client_data *data); static int ib_dealloc_ucontext(struct ib_ucontext *context) { @@ -581,16 +582,16 @@ static ssize_t show_abi_version(struct c } static CLASS_ATTR(abi_version, S_IRUGO, show_abi_version, NULL); -static void ib_uverbs_add_one(struct ib_device *device) +static struct ib_client_data *ib_uverbs_add_one(struct ib_device *device) { struct ib_uverbs_device *uverbs_dev; if (!device->alloc_ucontext) - return; + return NULL; uverbs_dev = kmalloc(sizeof *uverbs_dev, GFP_KERNEL); if (!uverbs_dev) - return; + return NULL; memset(uverbs_dev, 0, sizeof *uverbs_dev); @@ -626,9 +627,7 @@ static void ib_uverbs_add_one(struct ib_ if (class_device_create_file(&uverbs_dev->class_dev, &class_device_attr_ibdev)) goto err_class; - ib_set_client_data(device, &uverbs_client, uverbs_dev); - - return; + return &uverbs_dev->data; err_class: class_device_unregister(&uverbs_dev->class_dev); @@ -639,15 +638,14 @@ err_cdev: err: kfree(uverbs_dev); - return; + return NULL; } -static void ib_uverbs_remove_one(struct ib_device *device) +static void ib_uverbs_remove_one(struct ib_device *device, + struct ib_client_data *data) { - struct ib_uverbs_device *uverbs_dev = ib_get_client_data(device, &uverbs_client); - - if (!uverbs_dev) - return; + struct ib_uverbs_device *uverbs_dev; + uverbs_dev = container_of(data, struct ib_uverbs_device, data); class_device_unregister(&uverbs_dev->class_dev); } Index: linux-2.6.12.2/drivers/infiniband/core/ping.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/core/ping.c 2005-08-31 21:06:55.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/core/ping.c 2005-08-31 21:07:15.000000000 +0300 @@ -245,9 +245,10 @@ static int ib_ping_port_close(struct ib_ return 0; } -static void ib_ping_init_device(struct ib_device *device) +static struct ib_client_data *ib_ping_init_device(struct ib_device *device) { int num_ports, cur_port, i; + struct ib_client_data *data; if (device->node_type == IB_NODE_SWITCH) { num_ports = 1; @@ -263,7 +264,11 @@ static void ib_ping_init_device(struct i device->name, cur_port); goto error_device_open; } - return; + + data = kmalloc(sizeof *data, GFP_KERNEL); + if (!data) + goto error_device_open; + return data; error_device_open: while (i > 0) { @@ -274,12 +279,16 @@ error_device_open: device->name, cur_port); i--; } + return NULL; } -static void ib_ping_remove_device(struct ib_device *device) +static void ib_ping_remove_device(struct ib_device *device, + struct ib_client_data *data) { int i, num_ports, cur_port; + kfree(data); + if (device->node_type == IB_NODE_SWITCH) { num_ports = 1; cur_port = 0; -- MST From mshefty at ichips.intel.com Wed Aug 31 09:54:46 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 31 Aug 2005 09:54:46 -0700 Subject: [openib-general] RDMA Generic Connection Management In-Reply-To: <52hdd8lb1e.fsf@cisco.com> References: <1125323947.6584.106.camel@r2d2> <431374D4.5080909@ichips.intel.com> <52hdd8lb1e.fsf@cisco.com> Message-ID: <4315E0D6.6060508@ichips.intel.com> Roland Dreier wrote: > Hey, that's a really good point. We should make sure that our API > makes it easy to handle device hotplug. > > One solution is to start reference counting device references, but > that inevitably leads to bugs in ULPs -- protocol authors won't get it > right unless we make it really easy. And I don't see how to make the > reference counting trivial. > > Anyone have a better idea? Thinking about this more, I think that what we have for verbs works well. We can probably think of device removal as similar to having fatal errors on all QPs associated with the device. Assuming that ULPs should handle QP errors, then handling device removal doesn't seem like that big of a deal. For this to work, we should ensure that a client is never given a reference to a device that they may have received removal notification for. And this should probably be handled on a per module basis. - Sean From halr at voltaire.com Wed Aug 31 10:32:55 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 31 Aug 2005 13:32:55 -0400 Subject: [openib-general] OpenSM 1.8.0 libvendor initial merge nits Message-ID: <1125509574.4401.3801.camel@hal.voltaire.com> Hi again Yael & Eitan, I've now merged the OpenSM 1.8.0 libvendor changes and found the following: General nits: There are a number of violations of the coding style here as well. Also, There is some unneeded whitespace added to a number of files. Specific nits: include/vendor/osm_vendor_mlx_rmpp_ctx.h missing END_C_DECLS When I run autogen.sh in libvendor in the osm-1.8.0-merge branch, I get: ./autogen.sh + aclocal -I config -I ../config + libtoolize --force --copy Putting files in AC_CONFIG_AUX_DIR, `config'. + autoheader + automake --foreign --add-missing --copy Makefile.am:27: OSMV_OPENIB does not appear in AM_CONDITIONAL Makefile.am:25: HDRS was already defined in condition TRUE, which implies condition OSMV_OPENIB_TRUE -- Hal From eitan at mellanox.co.il Wed Aug 31 10:44:39 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 31 Aug 2005 20:44:39 +0300 Subject: [openib-general] Re: OpenSM 1.8.0 libvendor initial merge nits In-Reply-To: <1125509574.4401.3801.camel@hal.voltaire.com> References: <1125509574.4401.3801.camel@hal.voltaire.com> Message-ID: <4315EC87.4060808@mellanox.co.il> Hal Rosenstock wrote: > Hi again Yael & Eitan, > > I've now merged the OpenSM 1.8.0 libvendor changes and found > the following: > > General nits: > > There are a number of violations of the coding style here as well. Also, > There is some unneeded whitespace added to a number of files. We should run osm_check_n_fix this will get this fixed. I also think we need to decide if we want to change the OpenSM coding style to use tabs or we keep the no-tabs rule. Anyway we can automate both indentation and untabify (or tabify) using emacs. Just let me know if such a script is required. > > Specific nits: > > include/vendor/osm_vendor_mlx_rmpp_ctx.h missing END_C_DECLS Yael cleans up these - they were caused by multiple merges. You can go ahead and clean them if you want. > > When I run autogen.sh in libvendor in the osm-1.8.0-merge branch, I get: > ./autogen.sh > + aclocal -I config -I ../config > + libtoolize --force --copy > Putting files in AC_CONFIG_AUX_DIR, `config'. > + autoheader > + automake --foreign --add-missing --copy Actually I do not know how come these show up. I think it is a matter of automake version. I am running the build on the branch without seeing this. But I did see strange behavior on old RH 7.3 machine. > Makefile.am:27: OSMV_OPENIB does not appear in AM_CONDITIONAL > Makefile.am:25: HDRS was already defined in condition TRUE, which > implies condition OSMV_OPENIB_TRUE > > -- Hal From rolandd at cisco.com Wed Aug 31 10:53:03 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 31 Aug 2005 10:53:03 -0700 Subject: [openib-general] RDMA Generic Connection Management In-Reply-To: <4315E0D6.6060508@ichips.intel.com> (Sean Hefty's message of "Wed, 31 Aug 2005 09:54:46 -0700") References: <1125323947.6584.106.camel@r2d2> <431374D4.5080909@ichips.intel.com> <52hdd8lb1e.fsf@cisco.com> <4315E0D6.6060508@ichips.intel.com> Message-ID: <52d5nudmdc.fsf@cisco.com> Sean> For this to work, we should ensure that a client is never Sean> given a reference to a device that they may have received Sean> removal notification for. This is the hard part: one CPU could start calling into a consumer with a valid device, but get delayed by an interrupt or something. In the meantime, another CPU could remove that device from the consumer, and then when the first notification finally arrives, it's no longer valid. - R. From yaronh at voltaire.com Wed Aug 31 11:06:15 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Wed, 31 Aug 2005 21:06:15 +0300 Subject: [openib-general] Re: RDMA Generic Connection Management Message-ID: <35EA21F54A45CB47B879F21A91F4862F753693@taurus.voltaire.com> > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Roland Dreier > Sent: Tuesday, August 30, 2005 2:36 PM > To: Talpey, Thomas > Cc: openib-general at openib.org > Subject: Re: [openib-general] Re: RDMA Generic Connection Management > > Thomas> Well, you're saying somebody has to do it, right? Is it > Thomas> easier to fob this off to upper layers that (frankly) > Thomas> don't care what hardware they're talking to!? This means > Thomas> we have N copies of this, and N ways to do it. Talk about > Thomas> cacheline pingpong. > > Upper layers have the luxury of being able to do this at a > per-connection level, can sleep, etc. If we push it down into the > verbs, then we have to do it in every verbs call, including the fast > path verbs call. And that means we get into all sorts of crazy code > to deal with a device disappearing between a consumer calling > ib_post_send() and the core code being entered, etc. > > Right now we have a very simple set of rules: > If all the ULPs need to do exactly the same, or the implementation is different for IB/iWarp, than we should probably do it under the API like its defined in kDAPL. Also note that with Virtual machines this type of event may be more frequent and we may want to decouple the ULPs from the actual hardware device as much as we can Yaron From ftillier at silverstorm.com Wed Aug 31 11:09:31 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Wed, 31 Aug 2005 11:09:31 -0700 Subject: [openib-general] Re: OpenSM 1.8.0 libvendor initial merge nits In-Reply-To: <4315EC87.4060808@mellanox.co.il> Message-ID: <001101c5ae57$2349bcf0$9e5aa8c0@infiniconsys.com> > From: Eitan Zahavi [mailto:eitan at mellanox.co.il] > Sent: Wednesday, August 31, 2005 10:45 AM > > Hal Rosenstock wrote: > > Hi again Yael & Eitan, > > > > I've now merged the OpenSM 1.8.0 libvendor changes and found > > the following: > > > > General nits: > > > > There are a number of violations of the coding style here as well. Also, > > There is some unneeded whitespace added to a number of files. > > We should run osm_check_n_fix this will get this fixed. > I also think we need to decide if we want to change the OpenSM coding > style to use tabs or we keep the no-tabs rule. I personally prefer tabs to spaces, as it has less potential for people's individual tab width to mess with the code. - Fab From mshefty at ichips.intel.com Wed Aug 31 11:11:40 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 31 Aug 2005 11:11:40 -0700 Subject: [openib-general] RDMA Generic Connection Management In-Reply-To: <52d5nudmdc.fsf@cisco.com> References: <1125323947.6584.106.camel@r2d2> <431374D4.5080909@ichips.intel.com> <52hdd8lb1e.fsf@cisco.com> <4315E0D6.6060508@ichips.intel.com> <52d5nudmdc.fsf@cisco.com> Message-ID: <4315F2DC.5060202@ichips.intel.com> Roland Dreier wrote: > This is the hard part: one CPU could start calling into a consumer > with a valid device, but get delayed by an interrupt or something. In > the meantime, another CPU could remove that device from the consumer, > and then when the first notification finally arrives, it's no longer > valid. I understand. I don't know if there is (or should be) a common way to prevent this. For the CM, I think that the problem goes away if the user explicitly binds a cm_id to a specific device. The user then needs to destroy the cm_id before returning from their remove device call. I think sa_query can already handle these issues. I don't have a good solution yet for calls like ib_cma_get_device(). Yet another possibility is to have it return a device pointer in a callback. Then it can synchronize with device removal internally. - Sean From yaronh at voltaire.com Wed Aug 31 11:15:21 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Wed, 31 Aug 2005 21:15:21 +0300 Subject: [openib-general] RDMA Generic Connection Management Message-ID: <35EA21F54A45CB47B879F21A91F4862F753694@taurus.voltaire.com> > -----Original Message----- > From: Talpey, Thomas [mailto:Thomas.Talpey at netapp.com] > Sent: Tuesday, August 30, 2005 12:54 PM > To: Yaron Haviv > Cc: openib-general at openib.org > Subject: RE: [openib-general] RDMA Generic Connection Management > > At 10:55 AM 8/30/2005, Yaron Haviv wrote: > >The iSCSI discovery may return multiple src & dst IP addresses and the > >iSCSI multipath implementation will open multiple connections. > >There are many TCP/IP protocols that do that at the upper layers (e.g. > >GridFTP, ..), not sure how NFS does it. > > > To answer the question of how NFS "finds out" about multiple > connections and trunking, the answer is generally that the mount > command tells it. Mount can get this information from the command > line, or DNS. I believe Solaris uses the command line approach. There > may be a way to use the RPC portmapper for it, but the portmapper > isn't used by NFSv4. > > Bottom line? NFS would love to have a way to learn multipathing > topology. But it needs to follow existing practice, such as having > an IP address / DNS expression. If the only way to find it is to query > fabric services, that's not very compelling. > > Tom. Tom, from your description it looks like the multipathing is done based on IP addressing (like iSCSI/iSER, GridFTP, ..) and resolved by the ULP or its name service, in that case the ULP probably opens few connections from one or more IPs to one or more other IPs. This mean that we don't need a transport dependent mechanism as long as each port is associate with a unique IP (like we do today in OpenIB). (Another good reason to use IP addressing) Yaron From mshefty at ichips.intel.com Wed Aug 31 11:17:42 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 31 Aug 2005 11:17:42 -0700 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F753693@taurus.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F753693@taurus.voltaire.com> Message-ID: <4315F446.5090200@ichips.intel.com> Yaron Haviv wrote: > If all the ULPs need to do exactly the same, or the implementation is > different for IB/iWarp, than we should probably do it under the API like > its defined in kDAPL. To do this means destroying QPs, CQs, PDs, MRs, etc. under the API. I don't see that you want to do this. Unless I'm missing something in my thought process, handling device removal shouldn't be any more difficult than handling QP errors. - Sean From rolandd at cisco.com Wed Aug 31 11:34:57 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 31 Aug 2005 11:34:57 -0700 Subject: [openib-general] Re: ibv_get_async_event In-Reply-To: <4314F264.6010207@ichips.intel.com> (Sean Hefty's message of "Tue, 30 Aug 2005 16:57:24 -0700") References: <000e01c5adbd$8c977ed0$9e5aa8c0@infiniconsys.com> <4314F264.6010207@ichips.intel.com> Message-ID: <528xyidkfi.fsf@cisco.com> OK, I checked in changes to libibverbs and the kernel uverbs to handle cleaning up stale events when destroying a CQ/QP/SRQ. All the changes are in svn r3279. The changes require a kernel ABI bump. The new libibverbs works with both the old kernel and new kernel, but the old libibverbs will only work with the old kernel. So in other words, if you upgrade your kernel, then make sure you upgrade libibverbs as well. If you upgrade libibverbs, then you don't have to upgrade your kernel but you can if you want. (Confused yet? Or should I write still more?) I did some light testing but I don't have any tests that generate lots of async events. Sean and Arlin, if you could retest uDAPL or whatever was choking on QP connected events, that would be great. Thanks, Roland From rolandd at cisco.com Wed Aug 31 11:38:23 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 31 Aug 2005 11:38:23 -0700 Subject: [openib-general] Re: [PATCH] hotplug support: selective removal notification In-Reply-To: <20050831162020.GA1707@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 31 Aug 2005 19:20:20 +0300") References: <20050831162020.GA1707@mellanox.co.il> Message-ID: <524q96dk9s.fsf@cisco.com> Michael> As a way of solving this, I propose the following patch. Michael> The idea is that instead of setting client context Michael> separately with ib_set_client_data, client's add method Michael> will return ib_client_data object which is then kept in a Michael> per-device list. Returning NULL signals that the client Michael> will not be interested in this device. My first reaction was that this is a good idea. But looking at the patch, I'm not sure if it actually improves the existing code. It seems there aren't any consumers that really benefit from this, and the MAD module actually has to invent a pointer to return. So I'm not sure whether this is worth it right now. (BTW, perhaps the MAD module could just return (void *) 1L instead of kmalloc'ing something it then has to kfree) - R. From caitlinb at broadcom.com Wed Aug 31 11:59:00 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 31 Aug 2005 11:59:00 -0700 Subject: [openib-general] RDMA Generic Connection Management Message-ID: <54AD0F12E08D1541B826BE97C98F99F1F567@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier > Sent: Wednesday, August 31, 2005 10:53 AM > To: Sean Hefty > Cc: openib-general at openib.org > Subject: Re: [openib-general] RDMA Generic Connection Management > > Sean> For this to work, we should ensure that a client is never > Sean> given a reference to a device that they may have received > Sean> removal notification for. > > This is the hard part: one CPU could start calling into a > consumer with a valid device, but get delayed by an interrupt > or something. In the meantime, another CPU could remove that > device from the consumer, and then when the first > notification finally arrives, it's no longer valid. > A further complication is ensuring that the problem is not solved twice. The device layer may already have logic to ensure that concurrent callbacks from the driver layer do not conflict (including the simplest of not allowing concurrent callbacks). From mst at mellanox.co.il Wed Aug 31 12:05:53 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 31 Aug 2005 22:05:53 +0300 Subject: [openib-general] Re: [PATCH] hotplug support: selective removal notification In-Reply-To: <524q96dk9s.fsf@cisco.com> References: <20050831162020.GA1707@mellanox.co.il> <524q96dk9s.fsf@cisco.com> Message-ID: <20050831190553.GC2415@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] hotplug support: selective removal notification > > Michael> As a way of solving this, I propose the following patch. > Michael> The idea is that instead of setting client context > Michael> separately with ib_set_client_data, client's add method > Michael> will return ib_client_data object which is then kept in a > Michael> per-device list. Returning NULL signals that the client > Michael> will not be interested in this device. > > My first reaction was that this is a good idea. But looking at the > patch, I'm not sure if it actually improves the existing code. You'll notice there are a couple of places (e.g. IPoIB) which actually were forgetting to free the pointer. So I'd go ahead and claim that the additional typesafety is making it worth it. > It > seems there aren't any consumers that really benefit from this, and > the MAD module actually has to invent a pointer to return. > So I'm not sure whether this is worth it right now. > > (BTW, perhaps the MAD module could just return (void *) 1L instead of > kmalloc'ing something it then has to kfree) > > - R. > I agree its a problem. Well, what the module returns currently needs to stay on the list, to trigger calling the remove callback, so we cant just return 1 without additional work. To solve the problem for the MAD module, or anyone who does not want per-client data, what if we keep the remove method in the client as is, and instead add a remove method with two parameters to the client data? This way, modules which dont need client data can return NULL and still get called on module removal. Makes sense? I'll code up the patch. -- MST From halr at voltaire.com Wed Aug 31 12:17:19 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 31 Aug 2005 15:17:19 -0400 Subject: [openib-general] Re: OpenSM 1.8.0 libvendor initial merge nits In-Reply-To: <4315EC87.4060808@mellanox.co.il> References: <1125509574.4401.3801.camel@hal.voltaire.com> <4315EC87.4060808@mellanox.co.il> Message-ID: <1125515837.4401.3944.camel@hal.voltaire.com> On Wed, 2005-08-31 at 13:44, Eitan Zahavi wrote: > Hal Rosenstock wrote: > > Hi again Yael & Eitan, > > > > I've now merged the OpenSM 1.8.0 libvendor changes and found > > the following: > > > > General nits: > > > > There are a number of violations of the coding style here as well. Also, > > There is some unneeded whitespace added to a number of files. > We should run osm_check_n_fix this will get this fixed. > I also think we need to decide if we want to change the OpenSM coding > style to use tabs or we keep the no-tabs rule. Separate discussion. I'm not ready to take this one on yet. I prefer tabs too and also don't like the way the braces are in OpenSM. There are other things as well. But this is a big (but trivial) job that should wait IMO... > Anyway we can automate both indentation and untabify (or tabify) > using emacs. Just let me know if such a script is required. > > > > > Specific nits: > > > > include/vendor/osm_vendor_mlx_rmpp_ctx.h missing END_C_DECLS > Yael cleans up these - they were caused by multiple merges. > You can go ahead and clean them if you want. I fixed them in the version I am working on. This was meant as a heads up for the merge branch. > > When I run autogen.sh in libvendor in the osm-1.8.0-merge branch, I get: > > ./autogen.sh > > + aclocal -I config -I ../config > > + libtoolize --force --copy > > Putting files in AC_CONFIG_AUX_DIR, `config'. > > + autoheader > > + automake --foreign --add-missing --copy > Actually I do not know how come these show up. > I think it is a matter of automake version. > I am running the build on the branch without seeing this. > But I did see strange behavior on old RH 7.3 machine. > > Makefile.am:27: OSMV_OPENIB does not appear in AM_CONDITIONAL > > Makefile.am:25: HDRS was already defined in condition TRUE, which > > implies condition OSMV_OPENIB_TRUE Hmm, I update autoconf, automake, and libtool and now get: + automake --foreign --add-missing --copy configure.in: installing `config/install-sh' configure.in: installing `config/missing' Makefile.am:27: OSMV_OPENIB does not appear in AM_CONDITIONAL Makefile.am:31: OSMV_SIM does not appear in AM_CONDITIONAL Makefile.am:44: OSMV_GEN1 does not appear in AM_CONDITIONAL Makefile.am: installing `config/compile' Makefile.am: installing `config/depcomp' Any ideas ? -- Hal From Thomas.Talpey at netapp.com Wed Aug 31 12:15:20 2005 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 31 Aug 2005 15:15:20 -0400 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F753693@taurus.voltaire.com > References: <35EA21F54A45CB47B879F21A91F4862F753693@taurus.voltaire.com> Message-ID: <6.2.3.4.2.20050831151425.0411a9c0@exnane01.nane.netapp.com> At 02:06 PM 8/31/2005, Yaron Haviv wrote: >Also note that with Virtual machines this type of event may be more >frequent and we may want to decouple the ULPs from the actual hardware s/may want/definitely want/ Tom. From glebn at voltaire.com Wed Aug 31 12:54:33 2005 From: glebn at voltaire.com (Gleb Natapov) Date: Wed, 31 Aug 2005 22:54:33 +0300 Subject: [openib-general] iser/uverbs integration In-Reply-To: <6.2.3.4.2.20050831120725.045a2eb0@exnane01.nane.netapp.com> References: <20050831120600.GB21040@minantech.com> <6.2.3.4.2.20050831120725.045a2eb0@exnane01.nane.netapp.com> Message-ID: <20050831195433.GA32458@minantech.com> On Wed, Aug 31, 2005 at 12:10:22PM -0400, Talpey, Thomas wrote: > At 08:06 AM 8/31/2005, Gleb Natapov wrote: > >The question is what is the best way to proceed? Will the changes needed to > >use userspace QP from kernel will be accepted? How NFS/RDMA works now? > > To answer the second question, both client and server NFS/RDMA > create and connect all endpoints completely within the kernel. > This is also true of NFS/Sockets btw. > Thank you for clarification. So NFS doesn't have to deal with the issue we are having in iser. -- Gleb. From mst at mellanox.co.il Wed Aug 31 12:59:12 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 31 Aug 2005 22:59:12 +0300 Subject: [openib-general] Re: [PATCH] hotplug support: selective removal notification In-Reply-To: <524q96dk9s.fsf@cisco.com> References: <20050831162020.GA1707@mellanox.co.il> <524q96dk9s.fsf@cisco.com> Message-ID: <20050831195911.GA2779@mellanox.co.il> Quoting r. Roland Dreier : > the MAD module actually has to invent a pointer to return. OK, the solution I propose is to actually have two remove callbacks: one per client, called on all devices if defined. Another one per client data, called only if add returns non-NULL. Code now is actually getting smaller, so we are getting some benefit out of it. core/cache.c | 5 +- core/cm.c | 21 ++++----- core/device.c | 112 ++++++++++++++++-------------------------------- core/mad.c | 5 +- core/ping.c | 5 +- core/sa_query.c | 38 ++++++++-------- core/user_mad.c | 31 ++++++------- core/uverbs.h | 1 core/uverbs_main.c | 29 ++++++------ include/rdma/ib_verbs.h | 15 ++++-- ulp/ipoib/ipoib_main.c | 34 +++++++++----- ulp/sdp/sdp_conn.c | 33 ++++++-------- ulp/sdp/sdp_dev.h | 1 ulp/srp/ib_srp.c | 33 ++++++++------ 14 files changed, 175 insertions(+), 188 deletions(-) --- Add a way to client to avoid getting notifications for some devices. Make it possible to use container_of to get per device data instead of a list walk. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.12.2/drivers/infiniband/core/device.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/core/device.c 2005-09-01 00:40:39.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/core/device.c 2005-09-01 00:41:12.000000000 +0300 @@ -47,12 +47,6 @@ MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("core kernel InfiniBand API"); MODULE_LICENSE("Dual BSD/GPL"); -struct ib_client_data { - struct list_head list; - struct ib_client *client; - void * data; -}; - static LIST_HEAD(device_list); static LIST_HEAD(client_list); @@ -194,28 +188,6 @@ void ib_dealloc_device(struct ib_device } EXPORT_SYMBOL(ib_dealloc_device); -static int add_client_context(struct ib_device *device, struct ib_client *client) -{ - struct ib_client_data *context; - unsigned long flags; - - context = kmalloc(sizeof *context, GFP_KERNEL); - if (!context) { - printk(KERN_WARNING "Couldn't allocate client context for %s/%s\n", - device->name, client->name); - return -ENOMEM; - } - - context->client = client; - context->data = NULL; - - spin_lock_irqsave(&device->client_data_lock, flags); - list_add(&context->list, &device->client_data_list); - spin_unlock_irqrestore(&device->client_data_lock, flags); - - return 0; -} - /** * ib_register_device - Register an IB device with IB core * @device:Device to register @@ -259,11 +231,17 @@ int ib_register_device(struct ib_device device->reg_state = IB_DEV_REGISTERED; { + struct ib_client_data *context; struct ib_client *client; + unsigned long flags; list_for_each_entry(client, &client_list, list) - if (client->add && !add_client_context(device, client)) - client->add(device); + if (client->add && (context = client->add(device))) { + context->client = client; + spin_lock_irqsave(&device->client_data_lock, flags); + list_add(&context->list, &device->client_data_list); + spin_unlock_irqrestore(&device->client_data_lock, flags); + } } out: @@ -281,25 +259,31 @@ EXPORT_SYMBOL(ib_register_device); void ib_unregister_device(struct ib_device *device) { struct ib_client *client; - struct ib_client_data *context, *tmp; + struct ib_client_data *context; unsigned long flags; down(&device_sem); - list_for_each_entry_reverse(client, &client_list, list) if (client->remove) client->remove(device); list_del(&device->core_list); - up(&device_sem); - spin_lock_irqsave(&device->client_data_lock, flags); - list_for_each_entry_safe(context, tmp, &device->client_data_list, list) - kfree(context); + for (;;) { + if (list_empty(&device->client_data_list)) + break; + context = list_entry(device->client_data_list.prev, + typeof(*context), list); + list_del(&context->list); + spin_unlock_irqrestore(&device->client_data_lock, flags); + context->remove(device, context); + spin_lock_irqsave(&device->client_data_lock, flags); + } spin_unlock_irqrestore(&device->client_data_lock, flags); device->reg_state = IB_DEV_UNREGISTERED; + up(&device_sem); } EXPORT_SYMBOL(ib_unregister_device); @@ -318,14 +302,20 @@ EXPORT_SYMBOL(ib_unregister_device); */ int ib_register_client(struct ib_client *client) { + struct ib_client_data *context; struct ib_device *device; + unsigned long flags; down(&device_sem); list_add_tail(&client->list, &client_list); list_for_each_entry(device, &device_list, core_list) - if (client->add && !add_client_context(device, client)) - client->add(device); + if (client->add && (context = client->add(device))) { + context->client = client; + spin_lock_irqsave(&device->client_data_lock, flags); + list_add(&context->list, &device->client_data_list); + spin_unlock_irqrestore(&device->client_data_lock, flags); + } up(&device_sem); @@ -343,7 +333,7 @@ EXPORT_SYMBOL(ib_register_client); */ void ib_unregister_client(struct ib_client *client) { - struct ib_client_data *context, *tmp; + struct ib_client_data *context; struct ib_device *device; unsigned long flags; @@ -354,10 +344,13 @@ void ib_unregister_client(struct ib_clie client->remove(device); spin_lock_irqsave(&device->client_data_lock, flags); - list_for_each_entry_safe(context, tmp, &device->client_data_list, list) + list_for_each_entry(context, &device->client_data_list, list) if (context->client == client) { list_del(&context->list); - kfree(context); + spin_unlock_irqrestore(&device->client_data_lock, flags); + context->remove(device, context); + spin_lock_irqsave(&device->client_data_lock, flags); + break; } spin_unlock_irqrestore(&device->client_data_lock, flags); } @@ -375,16 +368,17 @@ EXPORT_SYMBOL(ib_unregister_client); * ib_get_client_data() returns client context set with * ib_set_client_data(). */ -void *ib_get_client_data(struct ib_device *device, struct ib_client *client) +struct ib_client_data *ib_get_client_data(struct ib_device *device, + struct ib_client *client) { struct ib_client_data *context; - void *ret = NULL; + struct ib_client_data *ret = NULL; unsigned long flags; spin_lock_irqsave(&device->client_data_lock, flags); list_for_each_entry(context, &device->client_data_list, list) if (context->client == client) { - ret = context->data; + ret = context; break; } spin_unlock_irqrestore(&device->client_data_lock, flags); @@ -394,36 +388,6 @@ void *ib_get_client_data(struct ib_devic EXPORT_SYMBOL(ib_get_client_data); /** - * ib_set_client_data - Get IB client context - * @device:Device to set context for - * @client:Client to set context for - * @data:Context to set - * - * ib_set_client_data() sets client context that can be retrieved with - * ib_get_client_data(). - */ -void ib_set_client_data(struct ib_device *device, struct ib_client *client, - void *data) -{ - struct ib_client_data *context; - unsigned long flags; - - spin_lock_irqsave(&device->client_data_lock, flags); - list_for_each_entry(context, &device->client_data_list, list) - if (context->client == client) { - context->data = data; - goto out; - } - - printk(KERN_WARNING "No client context found for %s/%s\n", - device->name, client->name); - -out: - spin_unlock_irqrestore(&device->client_data_lock, flags); -} -EXPORT_SYMBOL(ib_set_client_data); - -/** * ib_register_event_handler - Register an IB event handler * @event_handler:Handler to register * Index: linux-2.6.12.2/drivers/infiniband/include/rdma/ib_verbs.h =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/include/rdma/ib_verbs.h 2005-09-01 00:40:39.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/include/rdma/ib_verbs.h 2005-09-01 00:41:12.000000000 +0300 @@ -956,14 +956,22 @@ struct ib_device { u8 phys_port_cnt; }; +struct ib_client_data; + struct ib_client { char *name; - void (*add) (struct ib_device *); + struct ib_client_data *(*add) (struct ib_device *); void (*remove)(struct ib_device *); struct list_head list; }; +struct ib_client_data { + struct list_head list; + struct ib_client *client; + void (*remove)(struct ib_device *, struct ib_client_data *); +}; + struct ib_device *ib_alloc_device(size_t size); void ib_dealloc_device(struct ib_device *device); @@ -973,9 +981,8 @@ void ib_unregister_device(struct ib_devi int ib_register_client (struct ib_client *client); void ib_unregister_client(struct ib_client *client); -void *ib_get_client_data(struct ib_device *device, struct ib_client *client); -void ib_set_client_data(struct ib_device *device, struct ib_client *client, - void *data); +struct ib_client_data *ib_get_client_data(struct ib_device *device, + struct ib_client *client); static inline int ib_copy_from_udata(void *dest, struct ib_udata *udata, size_t len) { Index: linux-2.6.12.2/drivers/infiniband/core/user_mad.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/core/user_mad.c 2005-09-01 00:40:39.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/core/user_mad.c 2005-09-01 00:41:12.000000000 +0300 @@ -80,9 +80,10 @@ struct ib_umad_port { }; struct ib_umad_device { - int start_port, end_port; - struct kref ref; - struct ib_umad_port port[0]; + struct ib_client_data data; + int start_port, end_port; + struct kref ref; + struct ib_umad_port port[0]; }; struct ib_umad_file { @@ -108,8 +109,9 @@ static const dev_t base_dev = MKDEV(IB_U static spinlock_t map_lock; static DECLARE_BITMAP(dev_map, IB_UMAD_MAX_PORTS * 2); -static void ib_umad_add_one(struct ib_device *device); -static void ib_umad_remove_one(struct ib_device *device); +static struct ib_client_data *ib_umad_add_one(struct ib_device *device); +static void ib_umad_remove_one(struct ib_device *device, + struct ib_client_data *); static int queue_packet(struct ib_umad_file *file, struct ib_mad_agent *agent, @@ -667,7 +669,6 @@ static struct file_operations umad_sm_fo static struct ib_client umad_client = { .name = "umad", .add = ib_umad_add_one, - .remove = ib_umad_remove_one }; static ssize_t show_dev(struct class_device *class_dev, char *buf) @@ -819,7 +820,7 @@ err_cdev: return -1; } -static void ib_umad_add_one(struct ib_device *device) +static struct ib_client_data *ib_umad_add_one(struct ib_device *device) { struct ib_umad_device *umad_dev; int s, e, i; @@ -835,13 +836,14 @@ static void ib_umad_add_one(struct ib_de (e - s + 1) * sizeof (struct ib_umad_port), GFP_KERNEL); if (!umad_dev) - return; + return NULL; memset(umad_dev, 0, sizeof *umad_dev + (e - s + 1) * sizeof (struct ib_umad_port)); kref_init(&umad_dev->ref); + umad_dev->data.remove = ib_umad_remove_one; umad_dev->start_port = s; umad_dev->end_port = e; @@ -852,9 +854,7 @@ static void ib_umad_add_one(struct ib_de goto err; } - ib_set_client_data(device, &umad_client, umad_dev); - - return; + return &umad_dev->data; err: while (--i >= s) { @@ -863,15 +863,16 @@ err: } kref_put(&umad_dev->ref, ib_umad_release_dev); + return NULL; } -static void ib_umad_remove_one(struct ib_device *device) +static void ib_umad_remove_one(struct ib_device *device, + struct ib_client_data *data) { - struct ib_umad_device *umad_dev = ib_get_client_data(device, &umad_client); + struct ib_umad_device *umad_dev; int i; - if (!umad_dev) - return; + umad_dev = container_of(data, struct ib_umad_device, data); for (i = 0; i <= umad_dev->end_port - umad_dev->start_port; ++i) { class_device_unregister(&umad_dev->port[i].class_dev); Index: linux-2.6.12.2/drivers/infiniband/core/cm.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/core/cm.c 2005-09-01 00:40:39.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/core/cm.c 2005-09-01 00:41:12.000000000 +0300 @@ -51,13 +51,12 @@ MODULE_AUTHOR("Sean Hefty"); MODULE_DESCRIPTION("InfiniBand CM"); MODULE_LICENSE("Dual BSD/GPL"); -static void cm_add_one(struct ib_device *device); -static void cm_remove_one(struct ib_device *device); +static struct ib_client_data *cm_add_one(struct ib_device *device); +static void cm_remove_one(struct ib_device *device, struct ib_client_data *); static struct ib_client cm_client = { .name = "cm", .add = cm_add_one, - .remove = cm_remove_one }; static struct ib_cm { @@ -81,6 +80,7 @@ struct cm_port { }; struct cm_device { + struct ib_client_data data; struct list_head list; struct ib_device *device; __be64 ca_guid; @@ -3194,7 +3194,7 @@ static __be64 cm_get_ca_guid(struct ib_d return guid; } -static void cm_add_one(struct ib_device *device) +static struct ib_client_data *cm_add_one(struct ib_device *device) { struct cm_device *cm_dev; struct cm_port *port; @@ -3212,8 +3212,9 @@ static void cm_add_one(struct ib_device cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) * device->phys_port_cnt, GFP_KERNEL); if (!cm_dev) - return; + return NULL; + cm_dev->data.remove = cm_remove_one; cm_dev->device = device; cm_dev->ca_guid = cm_get_ca_guid(device); if (!cm_dev->ca_guid) @@ -3238,12 +3239,11 @@ static void cm_add_one(struct ib_device if (ret) goto error3; } - ib_set_client_data(device, &cm_client, cm_dev); write_lock_irqsave(&cm.device_lock, flags); list_add_tail(&cm_dev->list, &cm.device_list); write_unlock_irqrestore(&cm.device_lock, flags); - return; + return &cm_dev->data; error3: ib_unregister_mad_agent(port->mad_agent); @@ -3257,9 +3257,10 @@ error2: } error1: kfree(cm_dev); + return NULL; } -static void cm_remove_one(struct ib_device *device) +static void cm_remove_one(struct ib_device *device, struct ib_client_data *data) { struct cm_device *cm_dev; struct cm_port *port; @@ -3269,9 +3270,7 @@ static void cm_remove_one(struct ib_devi unsigned long flags; int i; - cm_dev = ib_get_client_data(device, &cm_client); - if (!cm_dev) - return; + cm_dev = container_of(data, struct cm_device, data); write_lock_irqsave(&cm.device_lock, flags); list_del(&cm_dev->list); Index: linux-2.6.12.2/drivers/infiniband/core/sa_query.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/core/sa_query.c 2005-09-01 00:40:39.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/core/sa_query.c 2005-09-01 00:41:12.000000000 +0300 @@ -65,6 +65,7 @@ struct ib_sa_port { }; struct ib_sa_device { + struct ib_client_data data; int start_port, end_port; struct ib_event_handler event_handler; struct ib_sa_port port[0]; @@ -98,13 +99,12 @@ struct ib_sa_mcmember_query { struct ib_sa_query sa_query; }; -static void ib_sa_add_one(struct ib_device *device); -static void ib_sa_remove_one(struct ib_device *device); +static struct ib_client_data *ib_sa_add_one(struct ib_device *device); +static void ib_sa_remove_one(struct ib_device *device, struct ib_client_data *data); static struct ib_client sa_client = { .name = "sa", .add = ib_sa_add_one, - .remove = ib_sa_remove_one }; static spinlock_t idr_lock; @@ -426,13 +426,14 @@ static void update_sm_ah(void *port_ptr) static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event) { + struct ib_sa_device *sa_dev; + if (event->event == IB_EVENT_PORT_ERR || event->event == IB_EVENT_PORT_ACTIVE || event->event == IB_EVENT_LID_CHANGE || event->event == IB_EVENT_PKEY_CHANGE || event->event == IB_EVENT_SM_CHANGE) { - struct ib_sa_device *sa_dev = - ib_get_client_data(event->device, &sa_client); + sa_dev = container_of(handler, struct ib_sa_device, event_handler); schedule_work(&sa_dev->port[event->element.port_num - sa_dev->start_port].update_task); @@ -608,7 +609,8 @@ int ib_sa_path_rec_get(struct ib_device struct ib_sa_query **sa_query) { struct ib_sa_path_query *query; - struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_client_data *data = ib_get_client_data(device, &sa_client); + struct ib_sa_device *sa_dev = container_of(data, struct ib_sa_device, data); struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; struct ib_mad_agent *agent = port->agent; int ret; @@ -710,7 +712,8 @@ int ib_sa_service_rec_query(struct ib_de struct ib_sa_query **sa_query) { struct ib_sa_service_query *query; - struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_client_data *data = ib_get_client_data(device, &sa_client); + struct ib_sa_device *sa_dev = container_of(data, struct ib_sa_device, data); struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; struct ib_mad_agent *agent = port->agent; int ret; @@ -793,7 +796,8 @@ int ib_sa_mcmember_rec_query(struct ib_d struct ib_sa_query **sa_query) { struct ib_sa_mcmember_query *query; - struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_client_data *data = ib_get_client_data(device, &sa_client); + struct ib_sa_device *sa_dev = container_of(data, struct ib_sa_device, data); struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; struct ib_mad_agent *agent = port->agent; int ret; @@ -900,7 +904,7 @@ static void recv_handler(struct ib_mad_a ib_free_recv_mad(mad_recv_wc); } -static void ib_sa_add_one(struct ib_device *device) +static struct ib_client_data *ib_sa_add_one(struct ib_device *device) { struct ib_sa_device *sa_dev; int s, e, i; @@ -916,8 +920,9 @@ static void ib_sa_add_one(struct ib_devi (e - s + 1) * sizeof (struct ib_sa_port), GFP_KERNEL); if (!sa_dev) - return; + return NULL; + sa_dev->data.remove = ib_sa_remove_one; sa_dev->start_port = s; sa_dev->end_port = e; @@ -937,8 +942,6 @@ static void ib_sa_add_one(struct ib_devi update_sm_ah, &sa_dev->port[i]); } - ib_set_client_data(device, &sa_client, sa_dev); - /* * We register our event handler after everything is set up, * and then update our cached info after the event handler is @@ -953,7 +956,7 @@ static void ib_sa_add_one(struct ib_devi for (i = 0; i <= e - s; ++i) update_sm_ah(&sa_dev->port[i]); - return; + return &sa_dev->data; err: while (--i >= 0) @@ -961,17 +964,14 @@ err: kfree(sa_dev); - return; + return NULL; } -static void ib_sa_remove_one(struct ib_device *device) +static void ib_sa_remove_one(struct ib_device *device, struct ib_client_data *data) { - struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_sa_device *sa_dev = container_of(data, struct ib_sa_device, data); int i; - if (!sa_dev) - return; - ib_unregister_event_handler(&sa_dev->event_handler); for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) { Index: linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_conn.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/ulp/sdp/sdp_conn.c 2005-09-01 00:40:39.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_conn.c 2005-09-01 00:41:12.000000000 +0300 @@ -37,13 +37,13 @@ static struct sdev_root dev_root_s; -static void sdp_device_init_one(struct ib_device *device); -static void sdp_device_remove_one(struct ib_device *device); +static struct ib_client_data *sdp_device_init_one(struct ib_device *device); +static void sdp_device_remove_one(struct ib_device *device, + struct ib_client_data *data); static struct ib_client sdp_client = { .name = "sdp", .add = sdp_device_init_one, - .remove = sdp_device_remove_one }; static DEFINE_SPINLOCK(psn_lock); @@ -959,6 +959,7 @@ static void sdp_conn_lock_init(struct sd int sdp_conn_alloc_ib(struct sdp_sock *conn, struct ib_device *device, u8 hw_port, u16 pkey) { + struct ib_client_data *data; struct ib_qp_init_attr *init_attr; struct ib_qp_attr *qp_attr; struct sdev_hca_port *port; @@ -969,10 +970,12 @@ int sdp_conn_alloc_ib(struct sdp_sock *c /* * look up correct HCA and port */ - hca = ib_get_client_data(device, &sdp_client); - if (!hca) + data = ib_get_client_data(device, &sdp_client); + if (!data) return -ERANGE; + hca = container_of(data, struct sdev_hca, data); + list_for_each_entry(port, &hca->port_list, list) if (hw_port == port->index) { result = 1; @@ -1706,7 +1709,7 @@ int sdp_proc_dump_device(char *buffer, i /* * sdp_device_init_one - add a device to the list */ -static void sdp_device_init_one(struct ib_device *device) +static struct ib_client_data *sdp_device_init_one(struct ib_device *device) { struct ib_fmr_pool_param fmr_param_s; struct sdev_hca_port *port, *tmp; @@ -1719,13 +1722,14 @@ static void sdp_device_init_one(struct i hca = kmalloc(sizeof *hca, GFP_KERNEL); if (!hca) { sdp_warn("Error allocating HCA <%s> memory.", device->name); - return; + return NULL; } /* * init and insert into list. */ memset(hca, 0, sizeof *hca); + hca->data.remove = sdp_device_remove_one; hca->ca = device; INIT_LIST_HEAD(&hca->port_list); /* @@ -1801,9 +1805,7 @@ static void sdp_device_init_one(struct i } } - ib_set_client_data(device, &sdp_client, hca); - - return; + return &hca->data; error: list_for_each_entry_safe(port, tmp, &hca->port_list, list) { @@ -1821,22 +1823,19 @@ error: (void)ib_dealloc_pd(hca->pd); kfree(hca); + return NULL; } /* * sdp_device_remove_one - remove a device from the hca list */ -static void sdp_device_remove_one(struct ib_device *device) +static void sdp_device_remove_one(struct ib_device *device, + struct ib_client_data *data) { struct sdev_hca_port *port, *tmp; struct sdev_hca *hca; - hca = ib_get_client_data(device, &sdp_client); - - if (!hca) { - sdp_warn("Device <%s> has no HCA info.", device->name); - return; - } + hca = container_of(data, struct sdev_hca, data); list_for_each_entry_safe(port, tmp, &hca->port_list, list) { list_del(&port->list); Index: linux-2.6.12.2/drivers/infiniband/ulp/srp/ib_srp.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/ulp/srp/ib_srp.c 2005-09-01 00:41:10.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/ulp/srp/ib_srp.c 2005-09-01 00:42:51.000000000 +0300 @@ -59,6 +59,11 @@ MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("InfiniBand SCSI RDMA Protocol driver"); MODULE_LICENSE("Dual BSD/GPL"); +struct ib_srp_client_data { + struct ib_client_data data; + struct list_head list; +}; + static int topspin_workarounds = 1; module_param(topspin_workarounds, int, 0444); @@ -67,13 +72,12 @@ MODULE_PARM_DESC(topspin_workarounds, static const u8 topspin_oui[3] = { 0x00, 0x05, 0xad }; -static void srp_add_one(struct ib_device *device); -static void srp_remove_one(struct ib_device *device); +static struct ib_client_data *srp_add_one(struct ib_device *device); +static void srp_remove_one(struct ib_device *device, struct ib_client_data *data); static struct ib_client srp_client = { .name = "srp", .add = srp_add_one, - .remove = srp_remove_one }; static inline struct srp_target_port *host_to_target(struct Scsi_Host *host) @@ -1346,16 +1350,16 @@ err_free: return NULL; } -static void srp_add_one(struct ib_device *device) +static struct ib_client_data *srp_add_one(struct ib_device *device) { - struct list_head *dev_list; + struct ib_srp_client_data *dev_list = NULL; struct srp_host *host; struct ib_device_attr *dev_attr; int s, e, p; dev_attr = kmalloc(sizeof *dev_attr, GFP_KERNEL); if (!dev_attr) - return; + return NULL; if (ib_query_device(device, dev_attr)) { printk(KERN_WARNING PFX "Couldn't query node GUID for %s.\n", @@ -1367,7 +1371,8 @@ static void srp_add_one(struct ib_device if (!dev_list) goto out; - INIT_LIST_HEAD(dev_list); + dev_list->data.remove = srp_remove_one; + INIT_LIST_HEAD(&dev_list->list); if (device->node_type == IB_NODE_SWITCH) { s = 0; @@ -1380,24 +1385,23 @@ static void srp_add_one(struct ib_device for (p = s; p <= e; ++p) { host = srp_add_port(device, dev_attr->node_guid, p); if (host) - list_add_tail(&host->list, dev_list); + list_add_tail(&host->list, &dev_list->list); } - ib_set_client_data(device, &srp_client, dev_list); - out: kfree(dev_attr); + return dev_list ? &dev_list->data : NULL; } -static void srp_remove_one(struct ib_device *device) +static void srp_remove_one(struct ib_device *device, struct ib_client_data *data) { - struct list_head *dev_list; + struct ib_srp_client_data *dev_list; struct srp_host *host, *tmp_host; struct srp_target_port *target, *tmp_target; - dev_list = ib_get_client_data(device, &srp_client); + dev_list = container_of(data, struct ib_srp_client_data, data); - list_for_each_entry_safe(host, tmp_host, dev_list, list) { + list_for_each_entry_safe(host, tmp_host, &dev_list->list, list) { class_device_unregister(&host->class_dev); wait_for_completion(&host->released); @@ -1416,6 +1420,7 @@ static void srp_remove_one(struct ib_dev ib_dealloc_pd(host->pd); kfree(host); } + kfree(dev_list); } static int __init srp_init_module(void) Index: linux-2.6.12.2/drivers/infiniband/core/uverbs.h =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/core/uverbs.h 2005-09-01 00:40:39.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/core/uverbs.h 2005-09-01 00:41:12.000000000 +0300 @@ -49,6 +49,7 @@ #include struct ib_uverbs_device { + struct ib_client_data data; int devnum; struct cdev dev; struct class_device class_dev; Index: linux-2.6.12.2/drivers/infiniband/core/mad.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/core/mad.c 2005-09-01 00:40:39.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/core/mad.c 2005-09-01 00:41:12.000000000 +0300 @@ -2681,7 +2681,7 @@ static int ib_mad_port_close(struct ib_d return 0; } -static void ib_mad_init_device(struct ib_device *device) +static struct ib_client_data *ib_mad_init_device(struct ib_device *device) { int num_ports, cur_port, i; @@ -2705,7 +2705,7 @@ static void ib_mad_init_device(struct ib goto error_device_open; } } - return; + return NULL; error_device_open: while (i > 0) { @@ -2719,6 +2719,7 @@ error_device_open: device->name, cur_port); i--; } + return NULL; } static void ib_mad_remove_device(struct ib_device *device) Index: linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_dev.h =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/ulp/sdp/sdp_dev.h 2005-09-01 00:40:39.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/ulp/sdp/sdp_dev.h 2005-09-01 00:41:12.000000000 +0300 @@ -154,6 +154,7 @@ struct sdev_hca_port { }; struct sdev_hca { + struct ib_client_data data; struct ib_device *ca; /* HCA */ struct ib_pd *pd; /* protection domain for this HCA */ struct ib_mr *mem_h; /* registered memory region */ Index: linux-2.6.12.2/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-09-01 00:40:39.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-09-01 00:41:12.000000000 +0300 @@ -51,6 +51,11 @@ MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("IP-over-InfiniBand net driver"); MODULE_LICENSE("Dual BSD/GPL"); +struct ipoib_client_data { + struct ib_client_data data; + struct list_head list; +}; + #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG int ipoib_debug_level; @@ -67,13 +72,13 @@ static const u8 ipv4_bcast_addr[] = { struct workqueue_struct *ipoib_workqueue; struct workqueue_struct *ipoib_event_workqueue; -static void ipoib_add_one(struct ib_device *device); -static void ipoib_remove_one(struct ib_device *device); +static struct ib_client_data *ipoib_add_one(struct ib_device *device); +static void ipoib_remove_one(struct ib_device *device, + struct ib_client_data *data); static struct ib_client ipoib_client = { .name = "ipoib", .add = ipoib_add_one, - .remove = ipoib_remove_one }; int ipoib_open(struct net_device *dev) @@ -1018,18 +1023,19 @@ alloc_mem_failed: return ERR_PTR(result); } -static void ipoib_add_one(struct ib_device *device) +static struct ib_client_data *ipoib_add_one(struct ib_device *device) { - struct list_head *dev_list; + struct ipoib_client_data *dev_list; struct net_device *dev; struct ipoib_dev_priv *priv; int s, e, p; dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); if (!dev_list) - return; + return NULL; - INIT_LIST_HEAD(dev_list); + dev_list->data.remove = ipoib_remove_one; + INIT_LIST_HEAD(&dev_list->list); if (device->node_type == IB_NODE_SWITCH) { s = 0; @@ -1043,21 +1049,22 @@ static void ipoib_add_one(struct ib_devi dev = ipoib_add_port("ib%d", device, p); if (!IS_ERR(dev)) { priv = netdev_priv(dev); - list_add_tail(&priv->list, dev_list); + list_add_tail(&priv->list, &dev_list->list); } } - ib_set_client_data(device, &ipoib_client, dev_list); + return &dev_list->data; } -static void ipoib_remove_one(struct ib_device *device) +static void ipoib_remove_one(struct ib_device *device, + struct ib_client_data *data) { struct ipoib_dev_priv *priv, *tmp; - struct list_head *dev_list; + struct ipoib_client_data *dev_list; - dev_list = ib_get_client_data(device, &ipoib_client); + dev_list = container_of(data, struct ipoib_client_data, data); - list_for_each_entry_safe(priv, tmp, dev_list, list) { + list_for_each_entry_safe(priv, tmp, &dev_list->list, list) { ib_unregister_event_handler(&priv->event_handler); flush_workqueue(ipoib_event_workqueue); @@ -1065,6 +1072,7 @@ static void ipoib_remove_one(struct ib_d ipoib_dev_cleanup(priv->dev); free_netdev(priv->dev); } + kfree(dev_list); } static int __init ipoib_init_module(void) Index: linux-2.6.12.2/drivers/infiniband/core/cache.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/core/cache.c 2005-09-01 00:40:39.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/core/cache.c 2005-09-01 00:41:26.000000000 +0300 @@ -291,7 +291,7 @@ static void ib_cache_event(struct ib_eve } } -static void ib_cache_setup_one(struct ib_device *device) +static struct ib_client_data *ib_cache_setup_one(struct ib_device *device) { int p; @@ -321,7 +321,7 @@ static void ib_cache_setup_one(struct ib if (ib_register_event_handler(&device->cache.event_handler)) goto err_cache; - return; + return NULL; err_cache: for (p = 0; p <= end_port(device) - start_port(device); ++p) { @@ -332,6 +332,7 @@ err_cache: err: kfree(device->cache.pkey_cache); kfree(device->cache.gid_cache); + return NULL; } static void ib_cache_cleanup_one(struct ib_device *device) Index: linux-2.6.12.2/drivers/infiniband/core/uverbs_main.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/core/uverbs_main.c 2005-09-01 00:40:39.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/core/uverbs_main.c 2005-09-01 00:42:20.000000000 +0300 @@ -101,8 +101,9 @@ static ssize_t (*uverbs_cmd_table[])(str static struct vfsmount *uverbs_event_mnt; -static void ib_uverbs_add_one(struct ib_device *device); -static void ib_uverbs_remove_one(struct ib_device *device); +static struct ib_client_data *ib_uverbs_add_one(struct ib_device *device); +static void ib_uverbs_remove_one(struct ib_device *device, + struct ib_client_data *data); static int ib_dealloc_ucontext(struct ib_ucontext *context) { @@ -539,7 +540,6 @@ static struct file_operations uverbs_mma static struct ib_client uverbs_client = { .name = "uverbs", .add = ib_uverbs_add_one, - .remove = ib_uverbs_remove_one }; static ssize_t show_dev(struct class_device *class_dev, char *buf) @@ -581,19 +581,21 @@ static ssize_t show_abi_version(struct c } static CLASS_ATTR(abi_version, S_IRUGO, show_abi_version, NULL); -static void ib_uverbs_add_one(struct ib_device *device) +static struct ib_client_data *ib_uverbs_add_one(struct ib_device *device) { struct ib_uverbs_device *uverbs_dev; if (!device->alloc_ucontext) - return; + return NULL; uverbs_dev = kmalloc(sizeof *uverbs_dev, GFP_KERNEL); if (!uverbs_dev) - return; + return NULL; memset(uverbs_dev, 0, sizeof *uverbs_dev); + uverbs_dev->data.remove = ib_uverbs_remove_one; + spin_lock(&map_lock); uverbs_dev->devnum = find_first_zero_bit(dev_map, IB_UVERBS_MAX_DEVICES); if (uverbs_dev->devnum >= IB_UVERBS_MAX_DEVICES) { @@ -626,9 +628,7 @@ static void ib_uverbs_add_one(struct ib_ if (class_device_create_file(&uverbs_dev->class_dev, &class_device_attr_ibdev)) goto err_class; - ib_set_client_data(device, &uverbs_client, uverbs_dev); - - return; + return &uverbs_dev->data; err_class: class_device_unregister(&uverbs_dev->class_dev); @@ -639,15 +639,14 @@ err_cdev: err: kfree(uverbs_dev); - return; + return NULL; } -static void ib_uverbs_remove_one(struct ib_device *device) +static void ib_uverbs_remove_one(struct ib_device *device, + struct ib_client_data *data) { - struct ib_uverbs_device *uverbs_dev = ib_get_client_data(device, &uverbs_client); - - if (!uverbs_dev) - return; + struct ib_uverbs_device *uverbs_dev; + uverbs_dev = container_of(data, struct ib_uverbs_device, data); class_device_unregister(&uverbs_dev->class_dev); } Index: linux-2.6.12.2/drivers/infiniband/core/ping.c =================================================================== --- linux-2.6.12.2.orig/drivers/infiniband/core/ping.c 2005-09-01 00:40:39.000000000 +0300 +++ linux-2.6.12.2/drivers/infiniband/core/ping.c 2005-09-01 00:41:12.000000000 +0300 @@ -245,7 +245,7 @@ static int ib_ping_port_close(struct ib_ return 0; } -static void ib_ping_init_device(struct ib_device *device) +static struct ib_client_data *ib_ping_init_device(struct ib_device *device) { int num_ports, cur_port, i; @@ -263,7 +263,7 @@ static void ib_ping_init_device(struct i device->name, cur_port); goto error_device_open; } - return; + return NULL; error_device_open: while (i > 0) { @@ -274,6 +274,7 @@ error_device_open: device->name, cur_port); i--; } + return NULL; } static void ib_ping_remove_device(struct ib_device *device) -- MST From krause at cup.hp.com Wed Aug 31 11:57:28 2005 From: krause at cup.hp.com (Michael Krause) Date: Wed, 31 Aug 2005 11:57:28 -0700 Subject: [openib-general] Re: RDMA Generic Connection Management In-Reply-To: References: <521x4bjqls.fsf@cisco.com> <6.2.3.4.2.20050830135332.064b9030@exnane01.nane.netapp.com> <521x4bi9su.fsf@cisco.com> <6.2.3.4.2.20050830140954.060c9030@exnane01.nane.netapp.com> <52k6i3gukm.fsf@cisco.com> <6.2.3.4.2.20050830142125.063e1cc0@exnane01.nane.netapp.com> <527je3gtmq.fsf@cisco.com> <6.2.3.4.2.20050830145906.05104890@exnane01.nane.netapp.com> <523borgs4r.fsf@cisco.com> <52zmqyeq58.fsf@cisco.com> Message-ID: <6.2.0.14.2.20050831114959.028ac810@esmail.cup.hp.com> At 07:46 AM 8/31/2005, James Lentini wrote: >On Tue, 30 Aug 2005, Roland Dreier wrote: > > > I just committed this SRP fix, which should make sure we don't use a > > device after it's gone. And it actually simplifies the code a teeny bit... > >The device could still be used after it's gone. For example: > > - the user is configuring SRP via sysfs. The thread in > srp_create_target() has just called ib_sa_path_rec_get() > [srp.c line 1209] and is waiting for the path > record query to complete in wait_for_completion() > - the SA callback, srp_path_rec_completion(), is called. This > callback thread will make several verb calls (ib_create_cq, > ib_req_notify_cq, ib_create_qp, ...) without any coordination with > the hotplug device removal callback, srp_remove_one > >Notice that if the SA client's hotplug removal function, >ib_sa_remove_one(), ensured that all callbacks had completed before >returning the problem would be fixed. This would protect all ULPs from >having to deal with hotplug races in their SA callback function. The >fix belongs in the SA client (the core stack), not in SRP. > >All the ULPs are deficient with respect to their hotplug >synchronization. Given that there is a common problem, doesn't it make >sense to try and solve it in a generic way instead of in each ULP? There are two approaches to device removal to consider - both are required to have a credible solution: (1) Inform all entities that a planned device removal is to occur and allow them to close gracefully or migrate to alternatives. Ideally, the OS comprehends whether the removal will result in the loss of any critical resources and not inform or take action unless it knows the removal is something that the system can survive. Doing this requires the ULP to register interest with the OS in a particular hardware resource. This also allows the OS to construct a resource analysis tool to determine whether the removal of a device will be a good idea or not. This is really outside the scope of an RDMA infrastructure and should be done by the OS through an OS defined API which is applicable to all types of hardware resources and sub-systems. (2) Design all ULP to handle surprise removal, e.g. device failure, from the start and allow them to close gracefully or migrate to alternatives. The OS would inform the device driver of the failure if the device driver has not already discovered the problem. The OS would also inform interested parties of the device failure. The device driver would simply error out all users of the device instance - there are already error codes defined for IB and iWARP for this purpose. The associated verbs resources should be released as the ULP closes out its resources through the verbs API (we did define the verbs to clean up resources that the infrastructure may allocate on behalf of the ULP). Activities such as listen entries would be released just like what is done for Sockets, etc. today. Device addition is simply a matter of informing policy or whatever service management within the OS that determines what services should be available on a given device. The device driver really does not need to do anything special. One area to consider is whether a planned migration of a service needs to be supported. This is generally best handled by the ULP with only a small set of services required of the infrastructure, e.g. get / set of QP / LLP context and then coordinating any other aspects with the appropriate SM or network services such updating address vectors or fabric management / configuration. In general, the ULP should already be designed to handle the error condition and whether they support a managed / planned removal or migration is perhaps the only potential area of deficiency. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Wed Aug 31 13:30:00 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 31 Aug 2005 13:30:00 -0700 Subject: [openib-general] [PATCH] hotplug support: selective removal notification In-Reply-To: <20050831162020.GA1707@mellanox.co.il> References: <20050831162020.GA1707@mellanox.co.il> Message-ID: <43161348.5090605@ichips.intel.com> Michael S. Tsirkin wrote: > Hi! > As Sean pointed out, in the existing client registration > the client gets removal events even from devices which it > may not be interested in. I was actually trying to point out that a remove event may occur before a client receives a device pointer from another call. > As a way of solving this, I propose the following patch. > The idea is that instead of setting client context separately > with ib_set_client_data, client's add method will return > ib_client_data object which is then kept in a per-device list. > Returning NULL signals that the client will not be interested > in this device. I don't think that this solves the race condition that can occur between receiving a remove device event and a call, such as ib_cma_get_device(), returning a pointer to that same device. There should be no restriction on what a client can use for their context. > In this way most ulps can now use container_of to get their > context in the remove method, instead of scanning the client list > each time, which in my opinion is very nice. This seems to be a somewhat separate issue. We should be able to return a client's context with a remove event without changing the add event handling. It seems that clients should be able to read their own context and determine how to handle the remove event. - Sean From iod00d at hp.com Wed Aug 31 13:41:03 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 31 Aug 2005 13:41:03 -0700 Subject: [openib-general] Re: [PATCH] ipoib: device removal races In-Reply-To: <52ll2klbtd.fsf@cisco.com> References: <20050808151141.GJ15300@mellanox.co.il> <52y86rui38.fsf@cisco.com> <20050829163553.GB4081@mellanox.co.il> <52psrwn1fi.fsf@cisco.com> <20050829185540.GA5169@mellanox.co.il> <52ek8cmuqq.fsf@cisco.com> <20050829194954.GB5169@mellanox.co.il> <521x4cmsfi.fsf@cisco.com> <20050829202218.GE5169@mellanox.co.il> <52ll2klbtd.fsf@cisco.com> Message-ID: <20050831204103.GH32377@esmail.cup.hp.com> On Mon, Aug 29, 2005 at 01:37:02PM -0700, Roland Dreier wrote: > I think you have to hold the spinlock across the consumer callback to > avoid all races. And that's kind of a bummer, because it means you > can't do anything that might sleep (like modify a QP) from the > callback. If the callback is being performed from the interrupt context, the callback can't sleep anyway. grant From viswak at yahoo.com Wed Aug 31 14:01:57 2005 From: viswak at yahoo.com (viswanath krishnamurthy) Date: Wed, 31 Aug 2005 14:01:57 -0700 (PDT) Subject: [openib-general] List of issues in uverbs In-Reply-To: <52k6i4jb3a.fsf@cisco.com> Message-ID: <20050831210157.38332.qmail@web33208.mail.mud.yahoo.com> I have attached the firmware version/svn info in the attachment. Here is new list of issues with uverbs 1. ib_cm_destroy_id(cm_id) hangs (does return to the caller) Is there a particular shutdown sequence that needs to be followed ? Is there a trace/debug I can enable ? 2. libmthca library crashes when a server accepts lots of new incoming sessions. See log (gdb) in the attachment. (It accepts about 170 connections) Looks like a memory allocation issue. 3. Kernel oops when lots of traffic between multiple clients and server. Very consistently reproducible. See attachment for details 4. Is there a way to get the Port GUID from incoming connection. I can only get the remote node guid, but not the port GUID from the CM REQ data. This was possible in gen1 stack. I will look in the rc_ping pong issue and try to reproduce. --- Roland Dreier wrote: > viswanath> I have the latest openib code on 2.16 > machine, when I > viswanath> run the rc pingpong program I get the > following error > viswanath> (The first time it passed, but > subsequent ones got an > viswanath> error, I tried changing the iteration > count to a large > viswanath> number, 100000 after the first time) > > I left "ibv_rc_pingpong -n 100000" running in a loop > between two of my > machines with no problems, so there's something > specific to your setup. > > When you say "latest openib code," what does this > mean? Are you > running something from subversion or a standard > Linux kernel? Do you > have 1-port or 2-port HCAs? What HCA firmware > version are you > running? > > - R. > ____________________________________________________ Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs -------------- next part -------------- A non-text attachment was scrubbed... Name: ib.log Type: application/octet-stream Size: 4446 bytes Desc: 2164448128-ib.log URL: From mshefty at ichips.intel.com Wed Aug 31 14:21:42 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 31 Aug 2005 14:21:42 -0700 Subject: [openib-general] List of issues in uverbs In-Reply-To: <20050831210157.38332.qmail@web33208.mail.mud.yahoo.com> References: <20050831210157.38332.qmail@web33208.mail.mud.yahoo.com> Message-ID: <43161F66.7050806@ichips.intel.com> viswanath krishnamurthy wrote: > 1. ib_cm_destroy_id(cm_id) > hangs (does return to the caller) > Is there a particular shutdown sequence > that needs to be followed ? Is there a trace/debug > I can enable ? There's no significant debug to enable. What app are you running that's calling ib_cm_destroy_id()? I didn't think that the ping pong tests used it. Are you trying to call this function from within a CM callback? The call will hang while there is a CM callback outstanding or if a CM event has not been completed by calling put_event. > 2. libmthca library crashes when a server accepts > lots of new incoming sessions. See log (gdb) > in the attachment. (It accepts about 170 > connections) Looks like a memory allocation issue. The log file borders on unreadable. > 3. Kernel oops when lots of traffic between multiple > clients and server. Very consistently > reproducible. See attachment for details Can you clarify what application you're running? I can't understand your configuration from the log file. > 4. Is there a way to get the Port GUID from > incoming connection. I can only get the remote > node guid, but not the port GUID from the CM REQ > data. This was possible in gen1 stack. You can use the returned path record to obtain port information. What do you need the port GUID for? - Sean From viswak at yahoo.com Wed Aug 31 14:49:41 2005 From: viswak at yahoo.com (viswanath krishnamurthy) Date: Wed, 31 Aug 2005 14:49:41 -0700 (PDT) Subject: [openib-general] List of issues in uverbs In-Reply-To: <43161F66.7050806@ichips.intel.com> Message-ID: <20050831214941.32321.qmail@web33215.mail.mud.yahoo.com> --- Sean Hefty wrote: > viswanath krishnamurthy wrote: > > 1. ib_cm_destroy_id(cm_id) > > hangs (does return to the caller) > > Is there a particular shutdown sequence > > that needs to be followed ? Is there a > trace/debug > > I can enable ? > > There's no significant debug to enable. What app > are you running that's calling > ib_cm_destroy_id()? I didn't think that the ping > pong tests used it. Are you > trying to call this function from within a CM > callback? Probably called from a callback.. The application is small application which accepts incoming connections (Like a socket server). When is the good time to call the destroy ? > > The call will hang while there is a CM callback > outstanding or if a CM event has > not been completed by calling put_event. > > > 2. libmthca library crashes when a server accepts > > lots of new incoming sessions. See log (gdb) > > in the attachment. (It accepts about 170 > > connections) Looks like a memory allocation issue. > > The log file borders on unreadable. Hope this time attachment is better.. See information here ================== A server program that accepts multiple incoming connections. After about 170 connections the library dies as seen in the gdb output ========================================== Program received signal SIGSEGV, Segmentation fault. [Switching to Thread -1208648784 (LWP 21309)] 0xb7f79de8 in mthca_free_db (db_tab=0x805c688, type=MTHCA_DB_TYPE_CQ_SET_CI, db_index=494) at src/memfree.c:150 150 db_tab->page[db_index / MTHCA_DB_REC_PER_PAGE]. (gdb) bt #0 0xb7f79de8 in mthca_free_db (db_tab=0x805c688, type=MTHCA_DB_TYPE_CQ_SET_CI, db_index=494) at src/memfree.c:150 #1 0xb7f7c699 in mthca_create_cq (context=0x805a0b4, cqe=10) at mthca.h:243 #2 0xb7f81eb5 in ibv_create_cq (context=0x805a0b4, cqe=10, cq_context=0x0) at src/verbs.c:107 #3 0xb7f5d6c0 in xib_qp_alloc_init (hp=0x865c958, port=1) at xsocket_trans2.c:157 #4 0xb7f5e19f in xib_conn_init (xcbp=0x865c958) at xsocket_trans2.c:496 #5 0xb7f5bd06 in handle_cm_req (hp=0x805da08, comm_id=0x865cab0, rguid=0x805db64 "", rn_guid=0x805db64 "", data=0x805d7b0, len=90) at xsocket.c:230 #6 0xb7f5ec73 in cm_handler () at xsocket_trans2.c:799 #7 0x007993ae in start_thread () from /lib/tls/libpthread.so.0 #8 0x00619aee in clone () from /lib/tls/libc.so.6 > > > 3. Kernel oops when lots of traffic between > multiple > > clients and server. Very consistently > > reproducible. See attachment for details > > Can you clarify what application you're running? I > can't understand your > configuration from the log file. The application is a simple one, which accepts incoming requests and spawns a thread to handle it. The application does simple "ping-pong" of data. printing eip: c0285f7d *pde = 3649a001 Oops: 0000 [#1] SMP Modules linked in: nfs nfsd exportfs lockd autofs4 sunrpc uhci_hcd ehci_hcd hw_random e1000 ext3 jbd sd_mod CPU: 0 EIP: 0060:[] Not tainted VLI EFLAGS: 00010002 (2.6.12.5) EIP is at mthca_poll_cq+0x158/0x534 eax: 00000000 ebx: c2027080 ecx: 00000007 edx: 00000a60 esi: 0000013c edi: c2027104 ebp: c1a33f0c esp: c1a33ea4 ds: 007b es: 007b ss: 0068 Process ib_mad1 (pid: 312, threadinfo=c1a32000 task=f7f16540) Stack: c1800560 c17f8560 c17f8ec0 c1a33edc c0116819 f7d9489c f78a31e0 00000000 00000080 00000000 00000000 00000286 f7d83000 c1a33f0c 00000001 f7d94880 f8806000 00000292 00000001 00000000 c2027080 f7d83000 f789bc00 c1a33f0c Call Trace: [] load_balance_newidle+0x76/0x81 [] ib_mad_completion_handler+0x2c/0x8d [] remove_wait_queue+0xf/0x34 [] worker_thread+0x1b0/0x23a [] ib_mad_completion_handler+0x0/0x8d [] default_wake_function+0x0/0xc [] default_wake_function+0x0/0xc [] worker_thread+0x0/0x23a [] kthread+0x8a/0xb2 [] kthread+0x0/0xb2 [] kernel_thread_helper+0x5/0xb Code: 01 00 00 8b 44 24 18 8d bb 84 00 00 00 8b 53 5c 8b 70 18 8b 4f 24 0f ce 2b b3 b8 00 00 00 8b 83 bc 00 00 00 d3 ee 01 f2 8d 14 d0 <8b> 02 8b 52 04 85 ff 89 45 00 89 55 04 74 16 8b 57 10 89 f0 39 After about 170 incoming connections the library (hence the application) dies.. > > > 4. Is there a way to get the Port GUID from > > incoming connection. I can only get the > remote > > node guid, but not the port GUID from the CM > REQ > > data. This was possible in gen1 stack. > > You can use the returned path record to obtain port > information. What do you > need the port GUID for? If an HCA has multiple ports, the node guid will be the same. It will be good to get the port guid to uniqely identify the port. > > - Sean > Here is the code version used.. [root at IB]# svn info Path: . URL: https://openib.org/svn/gen2/trunk Repository UUID: 21a7a0b7-18d7-0310-8e21-e8b31bdbf5cd Revision: 3169 Node Kind: directory Schedule: normal Last Changed Author: mst Last Changed Rev: 3169 Last Changed Date: 2005-08-23 09:25:31 -0700 (Tue, 23 Aug 2005) # cat /sys/class/infiniband/mthca0/hw_rev a0 # cat /sys/class/infiniband/mthca0/fw_ver 1.0.1 [root at subnetmgr4 ~]# ibv_devices device node GUID ------ ---------------- mthca0 0002c90200400d00 ____________________________________________________ Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs From mshefty at ichips.intel.com Wed Aug 31 15:25:01 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 31 Aug 2005 15:25:01 -0700 Subject: [openib-general] List of issues in uverbs In-Reply-To: <20050831214941.32321.qmail@web33215.mail.mud.yahoo.com> References: <20050831214941.32321.qmail@web33215.mail.mud.yahoo.com> Message-ID: <43162E3D.5000003@ichips.intel.com> viswanath krishnamurthy wrote: > Probably called from a callback.. The application > is small application which accepts incoming > connections (Like a socket server). > When is the good time to call the destroy ? You need to call ib_cm_event_put() after processing a CM event. You can call ib_cm_destroy_id() anytime, but the call will block if there is an outstanding event. So, if you're calling it in the same thread that calls ib_cm_event_put(), it must be done after making that call. > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread -1208648784 (LWP 21309)] > 0xb7f79de8 in mthca_free_db (db_tab=0x805c688, > type=MTHCA_DB_TYPE_CQ_SET_CI, db_index=494) at > src/memfree.c:150 I'll let Roland respond to this. (Since if he's read the mail, he may already be working on a fix.) >>>3. Kernel oops when lots of traffic between >>multiple >>> clients and server. Very consistently >>> reproducible. See attachment for details I've run with thousands of connections without any issues. We'll need to spend some time looking at this to see what the issue might be. > If an HCA has multiple ports, the node guid will be > the > same. It will be good to get the port guid to uniqely > identify the port. The CM REQ returns the device along with the port number that the REQ was received on, which can be used to identify the port. - Sean From ardavis at ichips.intel.com Wed Aug 31 15:27:26 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 31 Aug 2005 15:27:26 -0700 Subject: [openib-general] Re: ibv_get_async_event In-Reply-To: <528xyidkfi.fsf@cisco.com> References: <000e01c5adbd$8c977ed0$9e5aa8c0@infiniconsys.com> <4314F264.6010207@ichips.intel.com> <528xyidkfi.fsf@cisco.com> Message-ID: <43162ECE.3070405@ichips.intel.com> Roland Dreier wrote: >OK, I checked in changes to libibverbs and the kernel uverbs to handle >cleaning up stale events when destroying a CQ/QP/SRQ. All the changes >are in svn r3279. > >The changes require a kernel ABI bump. The new libibverbs works with >both the old kernel and new kernel, but the old libibverbs will only >work with the old kernel. So in other words, if you upgrade your >kernel, then make sure you upgrade libibverbs as well. If you upgrade >libibverbs, then you don't have to upgrade your kernel but you can if >you want. (Confused yet? Or should I write still more?) > >I did some light testing but I don't have any tests that generate lots >of async events. Sean and Arlin, if you could retest uDAPL or >whatever was choking on QP connected events, that would be great. > > The regress.sh dapltest set to 40x40 crashes the system. All I see on the console is a call trace of ib_verbs: ib_uverbs_event_poll+58. I will try and get more of the oops info. -arlin >Thanks, > Roland >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From mshefty at ichips.intel.com Wed Aug 31 15:33:17 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 31 Aug 2005 15:33:17 -0700 Subject: [openib-general] [RFC] change to ib_create_cm_id() Message-ID: <4316302D.2030401@ichips.intel.com> I'm considering changing the function: ib_create_cm_id(cm_handler, context); to ib_create_cm_id(device, cm_handler, context); This will bind all cm_id's to a specific device, including cm_id's associated with listens. This will help prevent the CM from returning a cm_id associated with a device that a consumer may have already seen as removed. This appears to be a straightforward change for most clients, but would require some work in SDP. Comments? - Sean From rolandd at cisco.com Wed Aug 31 15:40:47 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 31 Aug 2005 15:40:47 -0700 Subject: [openib-general] Re: ibv_get_async_event In-Reply-To: <43162ECE.3070405@ichips.intel.com> (Arlin Davis's message of "Wed, 31 Aug 2005 15:27:26 -0700") References: <000e01c5adbd$8c977ed0$9e5aa8c0@infiniconsys.com> <4314F264.6010207@ichips.intel.com> <528xyidkfi.fsf@cisco.com> <43162ECE.3070405@ichips.intel.com> Message-ID: <52r7c9d91s.fsf@cisco.com> Arlin> The regress.sh dapltest set to 40x40 crashes the Arlin> system. All I see on the console is a call trace of Arlin> ib_verbs: ib_uverbs_event_poll+58. I will try and get more Arlin> of the oops info. I'm not familiar with the current state of uDAPL. What do I need to set up to run regress.sh? In any case a full traceback of the panic that you get would be very helpful. Whatever you do end up getting, can you send the verbatim console output? I'm curious about exactly what it looks like, since it's pretty unusual to only get part of a traceback. - R. From rolandd at cisco.com Wed Aug 31 15:57:00 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 31 Aug 2005 15:57:00 -0700 Subject: [openib-general] Re: List of issues in uverbs In-Reply-To: <20050831210157.38332.qmail@web33208.mail.mud.yahoo.com> (viswanath krishnamurthy's message of "Wed, 31 Aug 2005 14:01:57 -0700 (PDT)") References: <20050831210157.38332.qmail@web33208.mail.mud.yahoo.com> Message-ID: <52mzmxd8ar.fsf@cisco.com> viswanath> Here is new list of issues with uverbs Thanks for the reports. viswanath> I have attached the firmware version/svn info in the viswanath> attachment. In the future can you attach things as text/plain (or just include them in your email)? If you attach it as application/octet-stream then I have to save the attachment and open it manually, rather than just reading it as part of your email. viswanath> 2. libmthca library crashes when a server accepts lots viswanath> of new incoming sessions. See log (gdb) in the viswanath> attachment. (It accepts about 170 connections) Looks viswanath> like a memory allocation issue. I found a few bugs in libmthca relating to allocating doorbell records for memfree HCAs. I've checked in fixes. Please try the latest subversion libmthca and let me know if it helps. viswanath> 3. Kernel oops when lots of traffic between multiple viswanath> clients and server. Very consistently reproducible. viswanath> See attachment for details Can you post the application you use to reproduce this? Thanks, Roland From halr at voltaire.com Wed Aug 31 16:56:17 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 31 Aug 2005 19:56:17 -0400 Subject: [openib-general] kernel oops In-Reply-To: <4314F0DC.6020803@ichips.intel.com> References: <430F4DBD.4070703@xsigo.com> <43138B0E.1090309@ichips.intel.com> <1125407465.4401.1246.camel@hal.voltaire.com> <43148390.7010605@ichips.intel.com> <1125445011.4401.2434.camel@hal.voltaire.com> <4314F0DC.6020803@ichips.intel.com> Message-ID: <1125532576.4401.4283.camel@hal.voltaire.com> On Tue, 2005-08-30 at 19:50, Sean Hefty wrote: > Hal Rosenstock wrote: > >>Can we just remove this field and > >>use the sgid to locate the correct device structure in the kernel, or > >>fail if it cannot be located? > > > > That seems like a good idea. > > Quickly skimming through the code I couldn't easily locate where AT maintained a > device list, or how it retrieved the device pointer. AT tracks IPoIB netdevices rather than IB devices but one can get at the IB device through the ipoib_dev_priv structure which is available through the netdevice. > > Won't AT still be needed under the new CM abstraction for IB ? I guess > > the answer is unclear. It still seems to me that it should be fixed > > until there is something else to take its place. Do you concur ? > > Had the fix been easy (for me to figure out how to make anyway) I would have > submitted a patch. Something like AT is likely to be needed, but it's not clear > how close the final version will be to what's there now. If we can at least > validate the device pointer, it may be good enough to continue using for the > time being. I think it is possible to validate the device pointer in the route rather than change the API. I'll work on a patch for this. -- Hal From eitan at mellanox.co.il Wed Aug 31 23:46:12 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 01 Sep 2005 09:46:12 +0300 Subject: [openib-general] Re: OpenSM 1.8.0 libvendor initial merge nits In-Reply-To: <1125515837.4401.3944.camel@hal.voltaire.com> References: <1125509574.4401.3801.camel@hal.voltaire.com> <4315EC87.4060808@mellanox.co.il> <1125515837.4401.3944.camel@hal.voltaire.com> Message-ID: <4316A3B4.8040703@mellanox.co.il> Hal Rosenstock wrote: > On Wed, 2005-08-31 at 13:44, Eitan Zahavi wrote: > >>Hal Rosenstock wrote: >> >>>Hi again Yael & Eitan, >>> >>>I've now merged the OpenSM 1.8.0 libvendor changes and found >>>the following: >>> >>>General nits: >>> >>>There are a number of violations of the coding style here as well. Also, >>>There is some unneeded whitespace added to a number of files. >> >>We should run osm_check_n_fix this will get this fixed. >>I also think we need to decide if we want to change the OpenSM coding >>style to use tabs or we keep the no-tabs rule. > > > Separate discussion. I'm not ready to take this one on yet. I understand, but once we decide to change it will involve all files... I think what we should at least make sure the current files adhere to the current methodology: no tabs and braces at next line.... I sent a long mail of rules. > > Hmm, I update autoconf, automake, and libtool and now get: > + automake --foreign --add-missing --copy > configure.in: installing `config/install-sh' > configure.in: installing `config/missing' > Makefile.am:27: OSMV_OPENIB does not appear in AM_CONDITIONAL > Makefile.am:31: OSMV_SIM does not appear in AM_CONDITIONAL > Makefile.am:44: OSMV_GEN1 does not appear in AM_CONDITIONAL ... > Any ideas ? I see this too on some machines (RH 7.3) but you can see these variables are declared AM_CONDITIONAL in the config/osmv.m4 I will double check. Might be missing a "fi" somewhere... > > -- Hal From eitan at mellanox.co.il Wed Aug 31 23:48:09 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 01 Sep 2005 09:48:09 +0300 Subject: [openib-general] Re: OpenSM 1.8.0 libvendor initial merge nits In-Reply-To: <001101c5ae57$2349bcf0$9e5aa8c0@infiniconsys.com> References: <001101c5ae57$2349bcf0$9e5aa8c0@infiniconsys.com> Message-ID: <4316A429.1040005@mellanox.co.il> Fab Tillier wrote: >>From: Eitan Zahavi [mailto:eitan at mellanox.co.il] >>Sent: Wednesday, August 31, 2005 10:45 AM >> >>Hal Rosenstock wrote: >> >>>Hi again Yael & Eitan, >>> >>>I've now merged the OpenSM 1.8.0 libvendor changes and found >>>the following: >>> >>>General nits: >>> >>>There are a number of violations of the coding style here as well. Also, >>>There is some unneeded whitespace added to a number of files. >> >>We should run osm_check_n_fix this will get this fixed. >>I also think we need to decide if we want to change the OpenSM coding >>style to use tabs or we keep the no-tabs rule. > > > I personally prefer tabs to spaces, as it has less potential for people's > individual tab width to mess with the code. > Actually what happens is that if we can not avoid ANY spaces the mix of tabs and spaces makes the code totally unreadable once you change tab width ... So unless we avoid any spaces we will have to use all spaces. That makes the tab width each one uses a don't care. For coders not using emacs, which supports automatic tab to space replacement (WYSIWYG), you can still replace all tabs with spaces (as you know your tab width) each time you save. That is my take. EZ > - Fab From mst at mellanox.co.il Wed Aug 31 23:59:04 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 1 Sep 2005 09:59:04 +0300 Subject: [openib-general] Re: [RFC] change to ib_create_cm_id() In-Reply-To: <4316302D.2030401@ichips.intel.com> References: <4316302D.2030401@ichips.intel.com> Message-ID: <20050901065904.GC1707@mellanox.co.il> Quoting r. Sean Hefty : > Subject: [RFC] change to ib_create_cm_id() > > I'm considering changing the function: > > ib_create_cm_id(cm_handler, context); > > to > > ib_create_cm_id(device, cm_handler, context); > > This will bind all cm_id's to a specific device, including cm_id's > associated with listens. This will help prevent the CM from returning a > cm_id associated with a device that a consumer may have already seen as > removed. Looking at the API, cm_ids are not currently associated with a specific device. What am I missing? > This appears to be a straightforward change for most clients, > but would require some work in SDP. So, I gather a ULP would need a list of cm_ids per connection, scanning all of them on each cm operation, scanning and updating these lists in all listening connections on each hotplug event. I wander whether cm can do the same thing internally, making the list part of the cm id object? -- MST