From eitan at mellanox.co.il Fri Apr 1 01:29:12 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 1 Apr 2005 12:29:12 +0300 Subject: [openib-general] [RMPP] RMPP formatting assumptions Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF062@mtlex01.yok.mtl.com> Seems ok to me. > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Friday, April 01, 2005 2:16 AM > To: openib-general > Subject: Re: [openib-general] [RMPP] RMPP formatting assumptions > > So far, here are my assumptions regarding the formatting of the RMPP MADs. > > The following fields in the RMPP header are set by the user: > Version, Type = DATA, RTime, Flags = ACTIVE, and Status = 0 > > The RMPP code will set the SegNum and update the Flags, but uses the > ACTIVE bit to determine if the user requires RMPP for a given transfer. > I could easily have the RMPP code set some of these fields, but > thought that the caller might be able to initialize them more efficiently. > > The WR length of a transfer should equal the size of the MAD header, > the RMPP header, class specific header for SA or vendor, plus a data > buffer that is evenly divisible by the size of the class' Data field. > This requirement is needed to prevent the RMPP code from allocating and > copying data segments. > > The payload field in the RMPP header should be set to the size of the > class specific header plus the number of valid bytes of user data in > the data buffer. The RMPP code will adjust the payload value to > account for multiple headers. > > Comments? > > - Sean > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Fri Apr 1 04:08:29 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Apr 2005 07:08:29 -0500 Subject: [openib-general] [RMPP] RMPP formatting assumptions In-Reply-To: <424C92CE.7040709@ichips.intel.com> References: <42488FDF.2050608@ichips.intel.com> <424C92CE.7040709@ichips.intel.com> Message-ID: <1112357309.4490.63.camel@localhost.localdomain> On Thu, 2005-03-31 at 19:16, Sean Hefty wrote: > So far, here are my assumptions regarding the formatting of the RMPP MADs. > > The following fields in the RMPP header are set by the user: > Version, Type = DATA, RTime, Flags = ACTIVE, and Status = 0 Should RMPP set the status rather than the user or is this an efficiency thing ? > The RMPP code will set the SegNum and update the Flags, but uses the > ACTIVE bit to determine if the user requires RMPP for a given transfer. > I could easily have the RMPP code set some of these fields, but > thought that the caller might be able to initialize them more efficiently. > > The WR length of a transfer should equal the size of the MAD header, > the RMPP header, class specific header for SA or vendor, plus a data > buffer that is evenly divisible by the size of the class' Data field. > This requirement is needed to prevent the RMPP code from allocating and > copying data segments. > > The payload field in the RMPP header should be set to the size of the > class specific header plus the number of valid bytes of user data in > the data buffer. The RMPP code will adjust the payload value to > account for multiple headers. So it sounds like the streaming mode of RMPPis not handled on transmit. It is optional on that side. What about receive ? Can it/will it handle streaming mode ? -- Hal From roland at topspin.com Fri Apr 1 08:27:19 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 01 Apr 2005 08:27:19 -0800 Subject: [openib-general] Re: problem with SDP/AIO on mem-free HCA In-Reply-To: <001301c5367f$e86a8a50$1802a8c0@infiniconsys.com> (Fab Tillier's message of "Thu, 31 Mar 2005 21:59:02 -0800") References: <001301c5367f$e86a8a50$1802a8c0@infiniconsys.com> Message-ID: <528y42laxk.fsf@topspin.com> Fab> If you are blessed with a Tavor PRM, see section 8.2.1.6 (in Fab> PRM 1.0.0). It states that a length of zero in a data Fab> segment indicates a 2GB transfer (MSb is used as a flag to Fab> indicate normal vs. inline data segments). A zero-byte Fab> request must not reference any data segments. Yup, that must be the problem. I guess mthca can skip over 0-length data segments. Another option would be to say that such work requests aren't allowed. Not sure which way I think we should go. I need to talk to Libor and find out why SDP is generating such requests. - R. From libor at topspin.com Fri Apr 1 09:03:31 2005 From: libor at topspin.com (Libor Michalek) Date: Fri, 1 Apr 2005 09:03:31 -0800 Subject: [openib-general] Re: problem with SDP/AIO on mem-free HCA In-Reply-To: <52is37kuz4.fsf@topspin.com>; from roland@topspin.com on Thu, Mar 31, 2005 at 07:59:43PM -0800 References: <52acojmt34.fsf@topspin.com> <20050331231023.GC6807@mellanox.co.il> <52is37kuz4.fsf@topspin.com> Message-ID: <20050401090331.A2870@topspin.com> On Thu, Mar 31, 2005 at 07:59:43PM -0800, Roland Dreier wrote: > > SDP is generating the 0-length RDMA by posting an RDMA READ with a > single scatter entry whole length is zero, which may behave > differently from posting an RDMA READ with no scatter entries. I need > to check this out, and also test on Tavor. I'll look into why SDP is generating a 0 length RDMA read, this should not be happening. -Libor From ftillier at infiniconsys.com Fri Apr 1 09:34:43 2005 From: ftillier at infiniconsys.com (Fab Tillier) Date: Fri, 1 Apr 2005 09:34:43 -0800 Subject: [openib-general] Re: problem with SDP/AIO on mem-free HCA In-Reply-To: <528y42laxk.fsf@topspin.com> Message-ID: <001401c536e1$18303080$1802a8c0@infiniconsys.com> > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Friday, April 01, 2005 8:27 AM > > Fab> If you are blessed with a Tavor PRM, see section 8.2.1.6 (in > Fab> PRM 1.0.0). It states that a length of zero in a data > Fab> segment indicates a 2GB transfer (MSb is used as a flag to > Fab> indicate normal vs. inline data segments). A zero-byte > Fab> request must not reference any data segments. > > Yup, that must be the problem. I guess mthca can skip over 0-length > data segments. Another option would be to say that such work requests > aren't allowed. Not sure which way I think we should go. I need to > talk to Libor and find out why SDP is generating such requests. > If the overhead of checking for zero-length is negligible, I would recommend trapping this in mthca. My reasoning is that unless the IB spec states that 0-length operations can't have data segments, this is a HW specific limitation and should be handled within the driver. - Fab From mshefty at ichips.intel.com Fri Apr 1 09:35:39 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 01 Apr 2005 09:35:39 -0800 Subject: [openib-general] [RMPP] RMPP formatting assumptions In-Reply-To: <1112357309.4490.63.camel@localhost.localdomain> References: <42488FDF.2050608@ichips.intel.com> <424C92CE.7040709@ichips.intel.com> <1112357309.4490.63.camel@localhost.localdomain> Message-ID: <424D866B.50600@ichips.intel.com> Hal Rosenstock wrote: > On Thu, 2005-03-31 at 19:16, Sean Hefty wrote: > >>So far, here are my assumptions regarding the formatting of the RMPP MADs. >> >>The following fields in the RMPP header are set by the user: >>Version, Type = DATA, RTime, Flags = ACTIVE, and Status = 0 > > Should RMPP set the status rather than the user or is this an efficiency > thing ? I was trying to limit the fields that the RMPP header would need to touch. The RMPP layer would change the status if an error occurred. > So it sounds like the streaming mode of RMPPis not handled on transmit. > It is optional on that side. I'm assuming that you're referring to the case where the payload length is set to 0. This is not handled. I'm not even sure how you could handle such a transfer without changes to the MAD API and having the client be aware of the RMPP implementation. > What about receive ? Can it/will it handle streaming mode ? The receive side uses the LAST bit to check for the end of the data transfer, so should work if payload length is 0. - Sean From roland at topspin.com Fri Apr 1 09:45:33 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 01 Apr 2005 09:45:33 -0800 Subject: [openib-general] [PATCH][4/3] IPoIB: document conversion to debugfs In-Reply-To: <20053311936.XaQmN4N9new7dTCP@topspin.com> (Roland Dreier's message of "Thu, 31 Mar 2005 19:36:12 -0800") References: <20053311936.XaQmN4N9new7dTCP@topspin.com> Message-ID: <52r7hujsqq.fsf@topspin.com> Update IPoIB documentation now that multicast debugging files have moved from ipoibdebugfs to debugfs. Signed-off-by: Roland Dreier --- linux-export.orig/Documentation/infiniband/ipoib.txt 2005-03-31 19:07:01.000000000 -0800 +++ linux-export/Documentation/infiniband/ipoib.txt 2005-04-01 09:43:27.122520190 -0800 @@ -32,14 +32,13 @@ mcast_debug_level to 1. These parameters can be controlled at runtime through files in /sys/module/ib_ipoib/. - CONFIG_INFINIBAND_IPOIB_DEBUG also enables the "ipoib_debugfs" + CONFIG_INFINIBAND_IPOIB_DEBUG also enables files in the debugfs virtual filesystem. By mounting this filesystem, for example with - mkdir -p /ipoib_debugfs - mount -t ipoib_debugfs none /ipoib_debufs + mount -t debugfs none /sys/kernel/debug - it is possible to get statistics about multicast groups from the - files /ipoib_debugfs/ib0_mcg and so on. + it is possible to get statistics about munlticast groups from the + files /sys/kernel/debug/ipoib/ib0_mcg and so on. The performance impact of this option is negligible, so it is safe to enable this option with debug_level set to 0 for normal From roland at topspin.com Fri Apr 1 10:23:50 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 10:23:50 -0800 Subject: [openib-general] [PATCH][2/6] IB: remove unneeded includes In-Reply-To: <2005411023.BIKgS4OLfFzZN9qI@topspin.com> Message-ID: <2005411023.AERMWYHGiX8V5KDM@topspin.com> From: Hal Rosenstock Eliminate no longer needed include files Signed-off-by: Hal Rosenstock Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/core/mad.c 2005-04-01 10:08:54.939957801 -0800 +++ linux-export/drivers/infiniband/core/mad.c 2005-04-01 10:08:56.473624910 -0800 @@ -33,9 +33,6 @@ */ #include -#include - -#include #include "mad_priv.h" #include "smi.h" From roland at topspin.com Fri Apr 1 10:23:51 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 10:23:51 -0800 Subject: [openib-general] [PATCH][3/6] IB: Fix FMR pool crash In-Reply-To: <2005411023.AERMWYHGiX8V5KDM@topspin.com> Message-ID: <2005411023.09JoUTQ2SAMPiKPQ@topspin.com> Mask bits correctly from jhash result in ib_fmr_hash() so that the computed bucket index is within our hash table. This fixes an SDP crash. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/core/fmr_pool.c 2005-03-31 19:07:05.000000000 -0800 +++ linux-export/drivers/infiniband/core/fmr_pool.c 2005-04-01 10:08:58.240241456 -0800 @@ -103,9 +103,8 @@ static inline u32 ib_fmr_hash(u64 first_page) { - return jhash_2words((u32) first_page, - (u32) (first_page >> 32), - 0); + return jhash_2words((u32) first_page, (u32) (first_page >> 32), 0) & + (IB_FMR_HASH_SIZE - 1); } /* Caller must hold pool_lock */ From roland at topspin.com Fri Apr 1 10:23:50 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 10:23:50 -0800 Subject: [openib-general] [PATCH][1/6] IB: Keep MAD work completion valid Message-ID: <2005411023.BIKgS4OLfFzZN9qI@topspin.com> From: Sean Hefty Replace the *wc field in ib_mad_recv_wc from pointing to a structure on the stack to one allocated with the received MAD buffer. This allows a client to access the *wc field after their receive completion handler has returned. Signed-off-by: Sean Hefty Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/core/mad.c 2005-03-31 19:07:01.000000000 -0800 +++ linux-export/drivers/infiniband/core/mad.c 2005-04-01 10:08:54.939957801 -0800 @@ -1600,7 +1600,8 @@ DMA_FROM_DEVICE); /* Setup MAD receive work completion from "normal" work completion */ - recv->header.recv_wc.wc = wc; + recv->header.wc = *wc; + recv->header.recv_wc.wc = &recv->header.wc; recv->header.recv_wc.mad_len = sizeof(struct ib_mad); recv->header.recv_wc.recv_buf.mad = &recv->mad.mad; recv->header.recv_wc.recv_buf.grh = &recv->grh; --- linux-export.orig/drivers/infiniband/core/mad_priv.h 2005-03-31 19:07:14.000000000 -0800 +++ linux-export/drivers/infiniband/core/mad_priv.h 2005-04-01 10:08:54.961953027 -0800 @@ -69,6 +69,7 @@ struct ib_mad_private_header { struct ib_mad_list_head mad_list; struct ib_mad_recv_wc recv_wc; + struct ib_wc wc; DECLARE_PCI_UNMAP_ADDR(mapping) } __attribute__ ((packed)); From roland at topspin.com Fri Apr 1 10:23:51 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 10:23:51 -0800 Subject: [openib-general] [PATCH][4/6] IB: Trivial FMR printk cleanup In-Reply-To: <2005411023.09JoUTQ2SAMPiKPQ@topspin.com> Message-ID: <2005411023.5oEZz0iawuKxVyay@topspin.com> From: Libor Michalek Add missing newline in printk. Signed-off-by: Libor Michalek Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/core/fmr_pool.c 2005-04-01 10:08:58.240241456 -0800 +++ linux-export/drivers/infiniband/core/fmr_pool.c 2005-04-01 10:08:59.539959345 -0800 @@ -442,7 +442,7 @@ list_add(&fmr->list, &pool->free_list); spin_unlock_irqrestore(&pool->pool_lock, flags); - printk(KERN_WARNING "fmr_map returns %d", + printk(KERN_WARNING "fmr_map returns %d\n", result); return ERR_PTR(result); From roland at topspin.com Fri Apr 1 10:23:51 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 10:23:51 -0800 Subject: [openib-general] [PATCH][5/6] IB: Fix user MAD registrations with class 0 In-Reply-To: <2005411023.5oEZz0iawuKxVyay@topspin.com> Message-ID: <2005411023.Wt2K1CXaZGIHp9sH@topspin.com> Fix handling of MAD agent registrations with mgmt_class == 0. In this case ib_umad should pass a NULL registration request to the MAD core rather than a request with mgmt_class set to 0. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/core/user_mad.c 2005-03-31 19:06:42.000000000 -0800 +++ linux-export/drivers/infiniband/core/user_mad.c 2005-04-01 10:09:01.250588043 -0800 @@ -389,15 +389,17 @@ goto out; found: - req.mgmt_class = ureq.mgmt_class; - req.mgmt_class_version = ureq.mgmt_class_version; - memcpy(req.method_mask, ureq.method_mask, sizeof req.method_mask); - memcpy(req.oui, ureq.oui, sizeof req.oui); + if (ureq.mgmt_class) { + req.mgmt_class = ureq.mgmt_class; + req.mgmt_class_version = ureq.mgmt_class_version; + memcpy(req.method_mask, ureq.method_mask, sizeof req.method_mask); + memcpy(req.oui, ureq.oui, sizeof req.oui); + } agent = ib_register_mad_agent(file->port->ib_dev, file->port->port_num, ureq.qpn ? IB_QPT_GSI : IB_QPT_SMI, - &req, 0, send_handler, recv_handler, - file); + ureq.mgmt_class ? &req : NULL, + 0, send_handler, recv_handler, file); if (IS_ERR(agent)) { ret = PTR_ERR(agent); goto out; From roland at topspin.com Fri Apr 1 10:23:51 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 10:23:51 -0800 Subject: [openib-general] [PATCH][6/6] IB: Remove incorrect comments In-Reply-To: <2005411023.Wt2K1CXaZGIHp9sH@topspin.com> Message-ID: <2005411023.sEUedyez566a4lDQ@topspin.com> From: Hal Rosenstock Eliminate unneeded and misleading comments Signed-off-by: Hal Rosenstock Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/core/agent.c 2005-03-31 19:06:48.000000000 -0800 +++ linux-export/drivers/infiniband/core/agent.c 2005-04-01 10:09:02.621290525 -0800 @@ -129,7 +129,6 @@ goto out; agent_send_wr->mad = mad_priv; - /* PCI mapping */ gather_list.addr = dma_map_single(mad_agent->device->dma_device, &mad_priv->mad, sizeof(mad_priv->mad), @@ -261,7 +260,6 @@ list_del(&agent_send_wr->send_list); spin_unlock_irqrestore(&port_priv->send_list_lock, flags); - /* Unmap PCI */ dma_unmap_single(mad_agent->device->dma_device, pci_unmap_addr(agent_send_wr, mapping), sizeof(agent_send_wr->mad->mad), --- linux-export.orig/drivers/infiniband/core/mad.c 2005-04-01 10:08:56.473624910 -0800 +++ linux-export/drivers/infiniband/core/mad.c 2005-04-01 10:09:02.768258624 -0800 @@ -2283,7 +2283,6 @@ /* Remove from posted receive MAD list */ list_del(&mad_list->list); - /* Undo PCI mapping */ dma_unmap_single(qp_info->port_priv->device->dma_device, pci_unmap_addr(&recv->header, mapping), sizeof(struct ib_mad_private) - From tduffy at sun.com Fri Apr 1 10:24:13 2005 From: tduffy at sun.com (Tom Duffy) Date: Fri, 01 Apr 2005 10:24:13 -0800 Subject: [openib-general] [PATCH][MTHCA] add in SINAI defines into mtcha code WAS: [openib-commits] r2101 - gen2/trunk/src/linux-kernel/patches In-Reply-To: <20050331204331.4320C2283D9@openib.ca.sandia.gov> References: <20050331204331.4320C2283D9@openib.ca.sandia.gov> Message-ID: <1112379853.18939.11.camel@duffman> On Thu, 2005-03-31 at 12:43 -0800, roland at openib.org wrote: > Author: roland > Date: 2005-03-31 12:43:29 -0800 (Thu, 31 Mar 2005) > New Revision: 2101 > > Added: > gen2/trunk/src/linux-kernel/patches/linux-2.6.11-sinai.diff > Log: > Add patch adding Sinai device IDs for 2.6.11 kernel. Roland, please consider applying this for svn ease of use: Signed-off-by: Tom Duffy Index: drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- drivers/infiniband/hw/mthca/mthca_dev.h (revision 2102) +++ drivers/infiniband/hw/mthca/mthca_dev.h (working copy) @@ -49,6 +49,14 @@ #define DRV_VERSION "0.06-pre" #define DRV_RELDATE "November 8, 2004" +/* XXX remove once SINAI defines make it into kernel.org */ +#ifndef PCI_DEVICE_ID_MELLANOX_SINAI_OLD +#define PCI_DEVICE_ID_MELLANOX_SINAI_OLD 0x5e8c +#endif +#ifndef PCI_DEVICE_ID_MELLANOX_SINAI +#define PCI_DEVICE_ID_MELLANOX_SINAI 0x6274 +#endif + enum { MTHCA_FLAG_DDR_HIDDEN = 1 << 1, MTHCA_FLAG_SRQ = 1 << 2, From peter at pantasys.com Fri Apr 1 11:37:18 2005 From: peter at pantasys.com (Peter Buckingham) Date: Fri, 01 Apr 2005 11:37:18 -0800 Subject: [openib-general] uverbs and OSU MPI/MPI in general? Message-ID: <424DA2EE.7050802@pantasys.com> Hi All, How does gen2's uverbs compare to VAPI? Is it meant to be the same API? Should OSU's MPI run on top of this or is there some other MPI implementation that will be able to run 'natively' over IB? thanks, peter From roland at topspin.com Fri Apr 1 11:39:59 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 01 Apr 2005 11:39:59 -0800 Subject: [openib-general] uverbs and OSU MPI/MPI in general? In-Reply-To: <424DA2EE.7050802@pantasys.com> (Peter Buckingham's message of "Fri, 01 Apr 2005 11:37:18 -0800") References: <424DA2EE.7050802@pantasys.com> Message-ID: <523buajng0.fsf@topspin.com> Peter> Hi All, How does gen2's uverbs compare to VAPI? Is it meant Peter> to be the same API? Should OSU's MPI run on top of this or Peter> is there some other MPI implementation that will be able to Peter> run 'natively' over IB? The basic functionality is the same but the API is different. For example completion events are handled in a different way that allows better performance. None of the current MPI implementations that use IB will run unmodified, but everyone (including OSU) is porting to the new API. - R. From panda at cse.ohio-state.edu Fri Apr 1 12:13:52 2005 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Fri, 1 Apr 2005 15:13:52 -0500 (EST) Subject: [openib-general] uverbs and OSU MPI/MPI in general? In-Reply-To: <523buajng0.fsf@topspin.com> from "Roland Dreier" at Apr 01, 2005 11:39:59 AM Message-ID: <200504012013.j31KDqus007452@xi.cse.ohio-state.edu> Peter, > Peter> Hi All, How does gen2's uverbs compare to VAPI? Is it meant > Peter> to be the same API? Should OSU's MPI run on top of this or > Peter> is there some other MPI implementation that will be able to > Peter> run 'natively' over IB? > > The basic functionality is the same but the API is different. For > example completion events are handled in a different way that allows > better performance. > > None of the current MPI implementations that use IB will run > unmodified, but everyone (including OSU) is porting to the new API. We have already started working on porting OSU MPI to the Gen2 stack. We plan to release MVAPICH 0.9.5 (on VAPI stack) during the next 1-2 weeks. After that we will make a subsequent release of 0.9.5 on the OpenIB Gen2 stack. Hope this helps. Thanks, DK > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From iod00d at hp.com Fri Apr 1 10:43:46 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 1 Apr 2005 10:43:46 -0800 Subject: [openib-general] [PATCH][MTHCA] add in SINAI defines into mtcha code WAS: [openib-commits] r2101 - gen2/trunk/src/linux-kernel/patches In-Reply-To: <1112379853.18939.11.camel@duffman> References: <20050331204331.4320C2283D9@openib.ca.sandia.gov> <1112379853.18939.11.camel@duffman> Message-ID: <20050401184346.GD11094@esmail.cup.hp.com> On Fri, Apr 01, 2005 at 10:24:13AM -0800, Tom Duffy wrote: > On Thu, 2005-03-31 at 12:43 -0800, roland at openib.org wrote: > > Author: roland > > Date: 2005-03-31 12:43:29 -0800 (Thu, 31 Mar 2005) > > New Revision: 2101 > > > > Added: > > gen2/trunk/src/linux-kernel/patches/linux-2.6.11-sinai.diff > > Log: > > Add patch adding Sinai device IDs for 2.6.11 kernel. > > Roland, please consider applying this for svn ease of use: No - I think Rolan is doing the right thing with a seperate patch. I ran into the same issue since I'm still poking at 2.6.11. By keeping "backport" patches seperate distro's will have an easier time figuring out which backport cruft they will need for their release. And Roland won't have to remember to clean it out later and won't cause pain for distro's when they grab a new version of openib but still shipping the same base release. grant From roland at topspin.com Fri Apr 1 12:49:52 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:52 -0800 Subject: [openib-general] [PATCH][2/27] IB/mthca: fill in more device query fields In-Reply-To: <2005411249.NCfupdZrkMmfcKnV@topspin.com> Message-ID: <2005411249.WCbW5NdE7NBIkIcr@topspin.com> Implement more of the device_query method in mthca. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 2005-03-31 19:07:00.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cmd.c 2005-04-01 12:38:20.843436141 -0800 @@ -987,6 +987,8 @@ if (dev->hca_type == ARBEL_NATIVE) { MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSZ_SRQ_OFFSET); dev_lim->hca.arbel.resize_srq = field & 1; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_RQ_OFFSET); + dev_lim->max_sg = min_t(int, field, dev_lim->max_sg); MTHCA_GET(size, outbox, QUERY_DEV_LIM_MTT_ENTRY_SZ_OFFSET); dev_lim->mtt_seg_sz = size; MTHCA_GET(size, outbox, QUERY_DEV_LIM_MPT_ENTRY_SZ_OFFSET); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2005-03-31 19:07:00.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c 2005-04-01 12:38:20.839437009 -0800 @@ -52,6 +52,8 @@ if (!in_mad || !out_mad) goto out; + memset(props, 0, sizeof props); + props->fw_ver = mdev->fw_ver; memset(in_mad, 0, sizeof *in_mad); @@ -71,14 +73,26 @@ goto out; } - props->device_cap_flags = mdev->device_cap_flags; - props->vendor_id = be32_to_cpup((u32 *) (out_mad->data + 36)) & + props->device_cap_flags = mdev->device_cap_flags; + props->vendor_id = be32_to_cpup((u32 *) (out_mad->data + 36)) & 0xffffff; - props->vendor_part_id = be16_to_cpup((u16 *) (out_mad->data + 30)); - props->hw_ver = be16_to_cpup((u16 *) (out_mad->data + 32)); + props->vendor_part_id = be16_to_cpup((u16 *) (out_mad->data + 30)); + props->hw_ver = be16_to_cpup((u16 *) (out_mad->data + 32)); memcpy(&props->sys_image_guid, out_mad->data + 4, 8); memcpy(&props->node_guid, out_mad->data + 12, 8); + props->max_mr_size = ~0ull; + props->max_qp = mdev->limits.num_qps - mdev->limits.reserved_qps; + props->max_qp_wr = 0xffff; + props->max_sge = mdev->limits.max_sg; + props->max_cq = mdev->limits.num_cqs - mdev->limits.reserved_cqs; + props->max_cqe = 0xffff; + props->max_mr = mdev->limits.num_mpts - mdev->limits.reserved_mrws; + props->max_pd = mdev->limits.num_pds - mdev->limits.reserved_pds; + props->max_qp_rd_atom = 1 << mdev->qp_table.rdb_shift; + props->max_qp_init_rd_atom = 1 << mdev->qp_table.rdb_shift; + props->local_ca_ack_delay = mdev->limits.local_ca_ack_delay; + err = 0; out: kfree(in_mad); From roland at topspin.com Fri Apr 1 12:49:52 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:52 -0800 Subject: [openib-general] [PATCH][1/27] IB/mthca: map MPT/MTT context in mem-free mode Message-ID: <2005411249.NCfupdZrkMmfcKnV@topspin.com> In mem-free mode, when allocating memory regions, make sure that the HCA has context memory mapped to cover the virtual space used for the MPT and MTTs being used. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c 2005-03-31 19:06:51.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c 2005-04-01 12:38:19.884644268 -0800 @@ -390,7 +390,7 @@ } mdev->mr_table.mtt_table = mthca_alloc_icm_table(mdev, init_hca->mtt_base, - init_hca->mtt_seg_sz, + dev_lim->mtt_seg_sz, mdev->limits.num_mtt_segs, mdev->limits.reserved_mtts, 1); if (!mdev->mr_table.mtt_table) { --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.c 2005-03-31 19:06:42.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.c 2005-04-01 12:38:19.911638409 -0800 @@ -192,6 +192,38 @@ up(&table->mutex); } +int mthca_table_get_range(struct mthca_dev *dev, struct mthca_icm_table *table, + int start, int end) +{ + int inc = MTHCA_TABLE_CHUNK_SIZE / table->obj_size; + int i, err; + + for (i = start; i <= end; i += inc) { + err = mthca_table_get(dev, table, i); + if (err) + goto fail; + } + + return 0; + +fail: + while (i > start) { + i -= inc; + mthca_table_put(dev, table, i); + } + + return err; +} + +void mthca_table_put_range(struct mthca_dev *dev, struct mthca_icm_table *table, + int start, int end) +{ + int i; + + for (i = start; i <= end; i += MTHCA_TABLE_CHUNK_SIZE / table->obj_size) + mthca_table_put(dev, table, i); +} + struct mthca_icm_table *mthca_alloc_icm_table(struct mthca_dev *dev, u64 virt, int obj_size, int nobj, int reserved, --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.h 2005-03-31 19:06:56.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.h 2005-04-01 12:38:19.895641881 -0800 @@ -85,6 +85,10 @@ void mthca_free_icm_table(struct mthca_dev *dev, struct mthca_icm_table *table); int mthca_table_get(struct mthca_dev *dev, struct mthca_icm_table *table, int obj); void mthca_table_put(struct mthca_dev *dev, struct mthca_icm_table *table, int obj); +int mthca_table_get_range(struct mthca_dev *dev, struct mthca_icm_table *table, + int start, int end); +void mthca_table_put_range(struct mthca_dev *dev, struct mthca_icm_table *table, + int start, int end); static inline void mthca_icm_first(struct mthca_icm *icm, struct mthca_icm_iter *iter) --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_mr.c 2005-03-31 19:07:06.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_mr.c 2005-04-01 12:38:19.903640145 -0800 @@ -38,6 +38,7 @@ #include "mthca_dev.h" #include "mthca_cmd.h" +#include "mthca_memfree.h" /* * Must be packed because mtt_seg is 64 bits but only aligned to 32 bits. @@ -71,7 +72,7 @@ * through the bitmaps) */ -static u32 mthca_alloc_mtt(struct mthca_dev *dev, int order) +static u32 __mthca_alloc_mtt(struct mthca_dev *dev, int order) { int o; int m; @@ -105,7 +106,7 @@ return seg; } -static void mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order) +static void __mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order) { seg >>= order; @@ -122,6 +123,32 @@ spin_unlock(&dev->mr_table.mpt_alloc.lock); } +static u32 mthca_alloc_mtt(struct mthca_dev *dev, int order) +{ + u32 seg = __mthca_alloc_mtt(dev, order); + + if (seg == -1) + return -1; + + if (dev->hca_type == ARBEL_NATIVE) + if (mthca_table_get_range(dev, dev->mr_table.mtt_table, seg, + seg + (1 << order) - 1)) { + __mthca_free_mtt(dev, seg, order); + seg = -1; + } + + return seg; +} + +static void mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order) +{ + __mthca_free_mtt(dev, seg, order); + + if (dev->hca_type == ARBEL_NATIVE) + mthca_table_put_range(dev, dev->mr_table.mtt_table, seg, + seg + (1 << order) - 1); +} + static inline u32 hw_index_to_key(struct mthca_dev *dev, u32 ind) { if (dev->hca_type == ARBEL_NATIVE) @@ -141,7 +168,7 @@ int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd, u32 access, struct mthca_mr *mr) { - void *mailbox; + void *mailbox = NULL; struct mthca_mpt_entry *mpt_entry; u32 key; int err; @@ -155,11 +182,17 @@ return -ENOMEM; mr->ibmr.rkey = mr->ibmr.lkey = hw_index_to_key(dev, key); + if (dev->hca_type == ARBEL_NATIVE) { + err = mthca_table_get(dev, dev->mr_table.mpt_table, key); + if (err) + goto err_out_mpt_free; + } + mailbox = kmalloc(sizeof *mpt_entry + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); if (!mailbox) { - mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); - return -ENOMEM; + err = -ENOMEM; + goto err_out_table; } mpt_entry = MAILBOX_ALIGN(mailbox); @@ -180,16 +213,27 @@ err = mthca_SW2HW_MPT(dev, mpt_entry, key & (dev->limits.num_mpts - 1), &status); - if (err) + if (err) { mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); - else if (status) { + goto err_out_table; + } else if (status) { mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n", status); err = -EINVAL; + goto err_out_table; } kfree(mailbox); return err; + +err_out_table: + if (dev->hca_type == ARBEL_NATIVE) + mthca_table_put(dev, dev->mr_table.mpt_table, key); + +err_out_mpt_free: + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + kfree(mailbox); + return err; } int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd, @@ -213,6 +257,12 @@ return -ENOMEM; mr->ibmr.rkey = mr->ibmr.lkey = hw_index_to_key(dev, key); + if (dev->hca_type == ARBEL_NATIVE) { + err = mthca_table_get(dev, dev->mr_table.mpt_table, key); + if (err) + goto err_out_mpt_free; + } + for (i = dev->limits.mtt_seg_size / 8, mr->order = 0; i < list_len; i <<= 1, ++mr->order) @@ -220,7 +270,7 @@ mr->first_seg = mthca_alloc_mtt(dev, mr->order); if (mr->first_seg == -1) - goto err_out_mpt_free; + goto err_out_table; /* * If list_len is odd, we add one more dummy entry for @@ -307,13 +357,17 @@ kfree(mailbox); return err; - err_out_mailbox_free: +err_out_mailbox_free: kfree(mailbox); - err_out_free_mtt: +err_out_free_mtt: mthca_free_mtt(dev, mr->first_seg, mr->order); - err_out_mpt_free: +err_out_table: + if (dev->hca_type == ARBEL_NATIVE) + mthca_table_put(dev, dev->mr_table.mpt_table, key); + +err_out_mpt_free: mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); return err; } @@ -338,6 +392,9 @@ if (mr->order >= 0) mthca_free_mtt(dev, mr->first_seg, mr->order); + if (dev->hca_type == ARBEL_NATIVE) + mthca_table_put(dev, dev->mr_table.mpt_table, + key_to_hw_index(dev, mr->ibmr.lkey)); mthca_free(&dev->mr_table.mpt_alloc, key_to_hw_index(dev, mr->ibmr.lkey)); } From roland at topspin.com Fri Apr 1 12:49:52 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:52 -0800 Subject: [openib-general] [PATCH][3/27] IB/mthca: fix calculation of RDB shift In-Reply-To: <2005411249.WCbW5NdE7NBIkIcr@topspin.com> Message-ID: <2005411249.ETBNcLeftemLukfd@topspin.com> Fix calculation of rdb_shift by using original number of QPs, not their slot in profile[] (which will be rearranged when we sort it). Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_profile.c 2005-03-31 19:07:14.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_profile.c 2005-04-01 12:38:21.237350633 -0800 @@ -208,8 +208,7 @@ break; case MTHCA_RES_RDB: for (dev->qp_table.rdb_shift = 0; - profile[MTHCA_RES_QP].num << dev->qp_table.rdb_shift < - profile[i].num; + request->num_qp << dev->qp_table.rdb_shift < profile[i].num; ++dev->qp_table.rdb_shift) ; /* nothing */ dev->qp_table.rdb_base = (u32) profile[i].start; From roland at topspin.com Fri Apr 1 12:49:52 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:52 -0800 Subject: [openib-general] [PATCH][4/27] IB/mthca: fix posting sends with immediate data In-Reply-To: <2005411249.ETBNcLeftemLukfd@topspin.com> Message-ID: <2005411249.dKg4ijljsqXo1Rt6@topspin.com> When posting a work request with immediate data, put the immediate data in the immediate data field of the hardware's work request (rather than overwriting the flags field). Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2005-03-31 19:06:41.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c 2005-04-01 12:38:21.580276194 -0800 @@ -1465,7 +1465,7 @@ cpu_to_be32(1); if (wr->opcode == IB_WR_SEND_WITH_IMM || wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) - ((struct mthca_next_seg *) wqe)->flags = wr->imm_data; + ((struct mthca_next_seg *) wqe)->imm = wr->imm_data; wqe += sizeof (struct mthca_next_seg); size = sizeof (struct mthca_next_seg) / 16; @@ -1769,7 +1769,7 @@ cpu_to_be32(1); if (wr->opcode == IB_WR_SEND_WITH_IMM || wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) - ((struct mthca_next_seg *) wqe)->flags = wr->imm_data; + ((struct mthca_next_seg *) wqe)->imm = wr->imm_data; wqe += sizeof (struct mthca_next_seg); size = sizeof (struct mthca_next_seg) / 16; From roland at topspin.com Fri Apr 1 12:49:52 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:52 -0800 Subject: [openib-general] [PATCH][6/27] IB/mthca: allocate correct number of doorbell pages In-Reply-To: <2005411249.cEJmE9mY2eziJTR6@topspin.com> Message-ID: <2005411249.VaroeECWUvqcGQCD@topspin.com> Doorbell record pages are allocated in HCA page size chunks (always 4096 bytes), so we need to divide by 4096 and not PAGE_SIZE when figuring out how many pages we'll need space for. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.c 2005-04-01 12:38:19.911638409 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.c 2005-04-01 12:38:22.274125578 -0800 @@ -446,7 +446,7 @@ init_MUTEX(&dev->db_tab->mutex); - dev->db_tab->npages = dev->uar_table.uarc_size / PAGE_SIZE; + dev->db_tab->npages = dev->uar_table.uarc_size / 4096; dev->db_tab->max_group1 = 0; dev->db_tab->min_group2 = dev->db_tab->npages - 1; From roland at topspin.com Fri Apr 1 12:49:52 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:52 -0800 Subject: [openib-general] [PATCH][8/27] IB/mthca: fix MR allocation error path In-Reply-To: <2005411249.i5VdQJiPqpmwTj3T@topspin.com> Message-ID: <2005411249.mKyALgAB0GbtFnjH@topspin.com> From: Michael S. Tsirkin Fix error handling in MR allocation for mem-free mode: mthca_free must get an MR index, not a key. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_mr.c 2005-04-01 12:38:19.903640145 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_mr.c 2005-04-01 12:38:22.968974746 -0800 @@ -231,7 +231,7 @@ mthca_table_put(dev, dev->mr_table.mpt_table, key); err_out_mpt_free: - mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + mthca_free(&dev->mr_table.mpt_alloc, key); kfree(mailbox); return err; } @@ -368,7 +368,7 @@ mthca_table_put(dev, dev->mr_table.mpt_table, key); err_out_mpt_free: - mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + mthca_free(&dev->mr_table.mpt_alloc, key); return err; } From roland at topspin.com Fri Apr 1 12:49:52 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:52 -0800 Subject: [openib-general] [PATCH][5/27] IB/mthca: allow unaligned memory regions In-Reply-To: <2005411249.dKg4ijljsqXo1Rt6@topspin.com> Message-ID: <2005411249.cEJmE9mY2eziJTR6@topspin.com> From: Michael S. Tsirkin The first buffer of a memory region is not required to be page-aligned, so don't return an error if it's not. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2005-04-01 12:38:20.839437009 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c 2005-04-01 12:38:21.926201103 -0800 @@ -494,7 +494,7 @@ mask = 0; total_size = 0; for (i = 0; i < num_phys_buf; ++i) { - if (buffer_list[i].addr & ~PAGE_MASK) + if (i != 0 && buffer_list[i].addr & ~PAGE_MASK) return ERR_PTR(-EINVAL); if (i != 0 && i != num_phys_buf - 1 && (buffer_list[i].size & ~PAGE_MASK)) From roland at topspin.com Fri Apr 1 12:49:52 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:52 -0800 Subject: [openib-general] [PATCH][9/27] IB/mthca: release mutex on doorbell alloc error path In-Reply-To: <2005411249.mKyALgAB0GbtFnjH@topspin.com> Message-ID: <2005411249.XnosdnfHawyDkITW@topspin.com> Release mutex on error return path from mthca_alloc_db(). Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.c 2005-04-01 12:38:22.274125578 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.c 2005-04-01 12:38:23.500859288 -0800 @@ -337,7 +337,8 @@ break; default: - return -1; + ret = -EINVAL; + goto out; } for (i = start; i != end; i += dir) From roland at topspin.com Fri Apr 1 12:49:52 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:52 -0800 Subject: [openib-general] [PATCH][7/27] IB/mthca: clean up mthca_dereg_mr() In-Reply-To: <2005411249.VaroeECWUvqcGQCD@topspin.com> Message-ID: <2005411249.i5VdQJiPqpmwTj3T@topspin.com> Signed-off-by: Michael S. Tsirkin It's cleaner to kfree mthca_mr, and not rely on the fact that ib_mr is the first field in mthca_mr. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2005-04-01 12:38:21.926201103 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c 2005-04-01 12:38:22.630048317 -0800 @@ -568,8 +568,9 @@ static int mthca_dereg_mr(struct ib_mr *mr) { - mthca_free_mr(to_mdev(mr->device), to_mmr(mr)); - kfree(mr); + struct mthca_mr *mmr = to_mmr(mr); + mthca_free_mr(to_mdev(mr->device), mmr); + kfree(mmr); return 0; } From roland at topspin.com Fri Apr 1 12:49:52 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:52 -0800 Subject: [openib-general] [PATCH][11/27] IB/mthca: only free doorbell records in mem-free mode In-Reply-To: <2005411249.tAq0qtfjGbz3oHeg@topspin.com> Message-ID: <2005411249.0RpxZQTVnbUL56cR@topspin.com> On error path, only free doorbell records if we're in mem-free mode. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2005-03-31 19:06:42.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c 2005-04-01 12:38:24.207705852 -0800 @@ -817,10 +817,12 @@ err_out_mailbox: kfree(mailbox); - mthca_free_db(dev, MTHCA_DB_TYPE_CQ_ARM, cq->arm_db_index); + if (dev->hca_type == ARBEL_NATIVE) + mthca_free_db(dev, MTHCA_DB_TYPE_CQ_ARM, cq->arm_db_index); err_out_ci: - mthca_free_db(dev, MTHCA_DB_TYPE_CQ_SET_CI, cq->set_ci_db_index); + if (dev->hca_type == ARBEL_NATIVE) + mthca_free_db(dev, MTHCA_DB_TYPE_CQ_SET_CI, cq->set_ci_db_index); err_out_icm: mthca_table_put(dev, dev->cq_table.table, cq->cqn); From roland at topspin.com Fri Apr 1 12:49:52 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:52 -0800 Subject: [openib-general] [PATCH][10/27] IB/mthca: print assigned IRQ when interrupt test fails In-Reply-To: <2005411249.XnosdnfHawyDkITW@topspin.com> Message-ID: <2005411249.tAq0qtfjGbz3oHeg@topspin.com> Print IRQ number when NOP command interrupt test fails to help debugging. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c 2005-04-01 12:38:19.884644268 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c 2005-04-01 12:38:23.852782896 -0800 @@ -672,7 +672,10 @@ err = mthca_NOP(dev, &status); if (err || status) { - mthca_err(dev, "NOP command failed to generate interrupt, aborting.\n"); + mthca_err(dev, "NOP command failed to generate interrupt (IRQ %d), aborting.\n", + dev->mthca_flags & MTHCA_FLAG_MSI_X ? + dev->eq_table.eq[MTHCA_EQ_CMD].msi_x_vector : + dev->pdev->irq); if (dev->mthca_flags & (MTHCA_FLAG_MSI | MTHCA_FLAG_MSI_X)) mthca_err(dev, "Try again with MSI/MSI-X disabled.\n"); else From roland at topspin.com Fri Apr 1 12:49:53 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:53 -0800 Subject: [openib-general] [PATCH][13/27] IB/mthca: implement RDMA/atomic operations for mem-free mode In-Reply-To: <2005411249.mBxBGEwdeob5Gy84@topspin.com> Message-ID: <2005411249.0FJpqa4lTtcUTWSU@topspin.com> Add code to support RDMA and atomic send work requests in mem-free mode. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2005-04-01 12:38:21.580276194 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c 2005-04-01 12:38:25.023528759 -0800 @@ -1775,6 +1775,53 @@ size = sizeof (struct mthca_next_seg) / 16; switch (qp->transport) { + case RC: + switch (wr->opcode) { + case IB_WR_ATOMIC_CMP_AND_SWP: + case IB_WR_ATOMIC_FETCH_AND_ADD: + ((struct mthca_raddr_seg *) wqe)->raddr = + cpu_to_be64(wr->wr.atomic.remote_addr); + ((struct mthca_raddr_seg *) wqe)->rkey = + cpu_to_be32(wr->wr.atomic.rkey); + ((struct mthca_raddr_seg *) wqe)->reserved = 0; + + wqe += sizeof (struct mthca_raddr_seg); + + if (wr->opcode == IB_WR_ATOMIC_CMP_AND_SWP) { + ((struct mthca_atomic_seg *) wqe)->swap_add = + cpu_to_be64(wr->wr.atomic.swap); + ((struct mthca_atomic_seg *) wqe)->compare = + cpu_to_be64(wr->wr.atomic.compare_add); + } else { + ((struct mthca_atomic_seg *) wqe)->swap_add = + cpu_to_be64(wr->wr.atomic.compare_add); + ((struct mthca_atomic_seg *) wqe)->compare = 0; + } + + wqe += sizeof (struct mthca_atomic_seg); + size += sizeof (struct mthca_raddr_seg) / 16 + + sizeof (struct mthca_atomic_seg); + break; + + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_WRITE_WITH_IMM: + case IB_WR_RDMA_READ: + ((struct mthca_raddr_seg *) wqe)->raddr = + cpu_to_be64(wr->wr.rdma.remote_addr); + ((struct mthca_raddr_seg *) wqe)->rkey = + cpu_to_be32(wr->wr.rdma.rkey); + ((struct mthca_raddr_seg *) wqe)->reserved = 0; + wqe += sizeof (struct mthca_raddr_seg); + size += sizeof (struct mthca_raddr_seg) / 16; + break; + + default: + /* No extra segments required for sends */ + break; + } + + break; + case UD: memcpy(((struct mthca_arbel_ud_seg *) wqe)->av, to_mah(wr->wr.ud.ah)->av, MTHCA_AV_SIZE); From roland at topspin.com Fri Apr 1 12:49:52 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:52 -0800 Subject: [openib-general] [PATCH][12/27] IB/mthca: fix format of CQ number for CQ events In-Reply-To: <2005411249.0RpxZQTVnbUL56cR@topspin.com> Message-ID: <2005411249.mBxBGEwdeob5Gy84@topspin.com> CQ numbers are only 24 bits, so only print 6 hex digits and mask off reserved part when reporting a CQ event. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_eq.c 2005-03-31 19:06:55.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_eq.c 2005-04-01 12:38:24.575625986 -0800 @@ -344,10 +344,10 @@ break; case MTHCA_EVENT_TYPE_CQ_ERROR: - mthca_warn(dev, "CQ %s on CQN %08x\n", + mthca_warn(dev, "CQ %s on CQN %06x\n", eqe->event.cq_err.syndrome == 1 ? "overrun" : "access violation", - be32_to_cpu(eqe->event.cq_err.cqn)); + be32_to_cpu(eqe->event.cq_err.cqn) & 0xffffff); break; case MTHCA_EVENT_TYPE_EQ_OVERFLOW: From roland at topspin.com Fri Apr 1 12:49:53 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:53 -0800 Subject: [openib-general] [PATCH][14/27] IB/mthca: fix MTT allocation in mem-free mode In-Reply-To: <2005411249.0FJpqa4lTtcUTWSU@topspin.com> Message-ID: <2005411249.E7CWkenJFFkWDs2q@topspin.com> Fix bug in MTT allocation in mem-free mode. I misunderstood the MTT size value returned by the firmware -- it is really the size of a single MTT entry, since mem-free mode does not segment the MTT as the original firmware did. This meant that our MTT addresses ended up being off by a factor of 8. This meant that our MTT allocations might overlap, and so we could overwrite and corrupt earlier memory regions when writing new MTT entries. We fix this by always using our 64-byte MTT segment size. This allows some simplification of the code as well, since there's no reason to put the MTT segment size in a variable -- we can always use our enum value directly. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 2005-04-01 12:38:20.843436141 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cmd.c 2005-04-01 12:38:25.574409178 -0800 @@ -990,7 +990,6 @@ MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_RQ_OFFSET); dev_lim->max_sg = min_t(int, field, dev_lim->max_sg); MTHCA_GET(size, outbox, QUERY_DEV_LIM_MTT_ENTRY_SZ_OFFSET); - dev_lim->mtt_seg_sz = size; MTHCA_GET(size, outbox, QUERY_DEV_LIM_MPT_ENTRY_SZ_OFFSET); dev_lim->mpt_entry_sz = size; MTHCA_GET(field, outbox, QUERY_DEV_LIM_PBL_SZ_OFFSET); @@ -1018,7 +1017,6 @@ } else { MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_AV_OFFSET); dev_lim->hca.tavor.max_avs = 1 << (field & 0x3f); - dev_lim->mtt_seg_sz = MTHCA_MTT_SEG_SIZE; dev_lim->mpt_entry_sz = MTHCA_MPT_ENTRY_SIZE; } --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cmd.h 2005-03-31 19:06:42.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cmd.h 2005-04-01 12:38:25.578408310 -0800 @@ -162,7 +162,6 @@ int cqc_entry_sz; int srq_entry_sz; int uar_scratch_entry_sz; - int mtt_seg_sz; int mpt_entry_sz; union { struct { --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-03-31 19:06:41.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-04-01 12:38:25.561412000 -0800 @@ -121,7 +121,6 @@ int reserved_eqs; int num_mpts; int num_mtt_segs; - int mtt_seg_size; int reserved_mtts; int reserved_mrws; int reserved_uars; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c 2005-04-01 12:38:23.852782896 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c 2005-04-01 12:38:25.566410914 -0800 @@ -390,7 +390,7 @@ } mdev->mr_table.mtt_table = mthca_alloc_icm_table(mdev, init_hca->mtt_base, - dev_lim->mtt_seg_sz, + MTHCA_MTT_SEG_SIZE, mdev->limits.num_mtt_segs, mdev->limits.reserved_mtts, 1); if (!mdev->mr_table.mtt_table) { --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_mr.c 2005-04-01 12:38:22.968974746 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_mr.c 2005-04-01 12:38:25.582407442 -0800 @@ -263,7 +263,7 @@ goto err_out_mpt_free; } - for (i = dev->limits.mtt_seg_size / 8, mr->order = 0; + for (i = MTHCA_MTT_SEG_SIZE / 8, mr->order = 0; i < list_len; i <<= 1, ++mr->order) ; /* nothing */ @@ -286,7 +286,7 @@ mtt_entry = MAILBOX_ALIGN(mailbox); mtt_entry[0] = cpu_to_be64(dev->mr_table.mtt_base + - mr->first_seg * dev->limits.mtt_seg_size); + mr->first_seg * MTHCA_MTT_SEG_SIZE); mtt_entry[1] = 0; for (i = 0; i < list_len; ++i) mtt_entry[i + 2] = cpu_to_be64(buffer_list[i] | @@ -330,7 +330,7 @@ memset(&mpt_entry->lkey, 0, sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey)); mpt_entry->mtt_seg = cpu_to_be64(dev->mr_table.mtt_base + - mr->first_seg * dev->limits.mtt_seg_size); + mr->first_seg * MTHCA_MTT_SEG_SIZE); if (0) { mthca_dbg(dev, "Dumping MPT entry %08x:\n", mr->ibmr.lkey); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_profile.c 2005-04-01 12:38:21.237350633 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_profile.c 2005-04-01 12:38:25.570410046 -0800 @@ -95,7 +95,7 @@ profile[MTHCA_RES_RDB].size = MTHCA_RDB_ENTRY_SIZE; profile[MTHCA_RES_MCG].size = MTHCA_MGM_ENTRY_SIZE; profile[MTHCA_RES_MPT].size = dev_lim->mpt_entry_sz; - profile[MTHCA_RES_MTT].size = dev_lim->mtt_seg_sz; + profile[MTHCA_RES_MTT].size = MTHCA_MTT_SEG_SIZE; profile[MTHCA_RES_UAR].size = dev_lim->uar_scratch_entry_sz; profile[MTHCA_RES_UDAV].size = MTHCA_AV_SIZE; profile[MTHCA_RES_UARC].size = request->uarc_size; @@ -229,10 +229,9 @@ break; case MTHCA_RES_MTT: dev->limits.num_mtt_segs = profile[i].num; - dev->limits.mtt_seg_size = dev_lim->mtt_seg_sz; dev->mr_table.mtt_base = profile[i].start; init_hca->mtt_base = profile[i].start; - init_hca->mtt_seg_sz = ffs(dev_lim->mtt_seg_sz) - 7; + init_hca->mtt_seg_sz = ffs(MTHCA_MTT_SEG_SIZE) - 7; break; case MTHCA_RES_UAR: dev->limits.num_uars = profile[i].num; From roland at topspin.com Fri Apr 1 12:49:53 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:53 -0800 Subject: [openib-general] [PATCH][15/27] IB/mthca: fill in opcode field for send completions In-Reply-To: <2005411249.E7CWkenJFFkWDs2q@topspin.com> Message-ID: <2005411249.qipkNNwvZYuE2KBu@topspin.com> From: Michael S. Tsirkin Fill in missing fields in send completions. Signed-off-by: Itamar Rabenstein Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2005-04-01 12:38:24.207705852 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c 2005-04-01 12:38:26.177278312 -0800 @@ -473,7 +473,41 @@ } if (is_send) { - entry->opcode = IB_WC_SEND; /* XXX */ + entry->wc_flags = 0; + switch (cqe->opcode) { + case MTHCA_OPCODE_RDMA_WRITE: + entry->opcode = IB_WC_RDMA_WRITE; + break; + case MTHCA_OPCODE_RDMA_WRITE_IMM: + entry->opcode = IB_WC_RDMA_WRITE; + entry->wc_flags |= IB_WC_WITH_IMM; + break; + case MTHCA_OPCODE_SEND: + entry->opcode = IB_WC_SEND; + break; + case MTHCA_OPCODE_SEND_IMM: + entry->opcode = IB_WC_SEND; + entry->wc_flags |= IB_WC_WITH_IMM; + break; + case MTHCA_OPCODE_RDMA_READ: + entry->opcode = IB_WC_RDMA_READ; + entry->byte_len = be32_to_cpu(cqe->byte_cnt); + break; + case MTHCA_OPCODE_ATOMIC_CS: + entry->opcode = IB_WC_COMP_SWAP; + entry->byte_len = be32_to_cpu(cqe->byte_cnt); + break; + case MTHCA_OPCODE_ATOMIC_FA: + entry->opcode = IB_WC_FETCH_ADD; + entry->byte_len = be32_to_cpu(cqe->byte_cnt); + break; + case MTHCA_OPCODE_BIND_MW: + entry->opcode = IB_WC_BIND_MW; + break; + default: + entry->opcode = MTHCA_OPCODE_INVALID; + break; + } } else { entry->byte_len = be32_to_cpu(cqe->byte_cnt); switch (cqe->opcode & 0x1f) { --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-04-01 12:38:25.561412000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-04-01 12:38:26.173279180 -0800 @@ -88,6 +88,19 @@ MTHCA_NUM_EQ }; +enum { + MTHCA_OPCODE_NOP = 0x00, + MTHCA_OPCODE_RDMA_WRITE = 0x08, + MTHCA_OPCODE_RDMA_WRITE_IMM = 0x09, + MTHCA_OPCODE_SEND = 0x0a, + MTHCA_OPCODE_SEND_IMM = 0x0b, + MTHCA_OPCODE_RDMA_READ = 0x10, + MTHCA_OPCODE_ATOMIC_CS = 0x11, + MTHCA_OPCODE_ATOMIC_FA = 0x12, + MTHCA_OPCODE_BIND_MW = 0x18, + MTHCA_OPCODE_INVALID = 0xff +}; + struct mthca_cmd { int use_events; struct semaphore hcr_sem; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2005-04-01 12:38:25.023528759 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c 2005-04-01 12:38:26.181277444 -0800 @@ -171,19 +171,6 @@ }; enum { - MTHCA_OPCODE_NOP = 0x00, - MTHCA_OPCODE_RDMA_WRITE = 0x08, - MTHCA_OPCODE_RDMA_WRITE_IMM = 0x09, - MTHCA_OPCODE_SEND = 0x0a, - MTHCA_OPCODE_SEND_IMM = 0x0b, - MTHCA_OPCODE_RDMA_READ = 0x10, - MTHCA_OPCODE_ATOMIC_CS = 0x11, - MTHCA_OPCODE_ATOMIC_FA = 0x12, - MTHCA_OPCODE_BIND_MW = 0x18, - MTHCA_OPCODE_INVALID = 0xff -}; - -enum { MTHCA_NEXT_DBD = 1 << 7, MTHCA_NEXT_FENCE = 1 << 6, MTHCA_NEXT_CQ_UPDATE = 1 << 3, From roland at topspin.com Fri Apr 1 12:49:53 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:53 -0800 Subject: [openib-general] [PATCH][16/27] IB/mthca: allow address handle creation in interrupt context In-Reply-To: <2005411249.qipkNNwvZYuE2KBu@topspin.com> Message-ID: <2005411249.gEJosMqrkm8KOH4C@topspin.com> Make address handle verbs usable from interrupt context. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_av.c 2005-03-31 19:07:01.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_av.c 2005-04-01 12:38:26.648176093 -0800 @@ -63,7 +63,7 @@ ah->type = MTHCA_AH_PCI_POOL; if (dev->hca_type == ARBEL_NATIVE) { - ah->av = kmalloc(sizeof *ah->av, GFP_KERNEL); + ah->av = kmalloc(sizeof *ah->av, GFP_ATOMIC); if (!ah->av) return -ENOMEM; @@ -77,7 +77,7 @@ if (index == -1) goto on_hca_fail; - av = kmalloc(sizeof *av, GFP_KERNEL); + av = kmalloc(sizeof *av, GFP_ATOMIC); if (!av) goto on_hca_fail; @@ -89,7 +89,7 @@ on_hca_fail: if (ah->type == MTHCA_AH_PCI_POOL) { ah->av = pci_pool_alloc(dev->av_table.pool, - SLAB_KERNEL, &ah->avdma); + SLAB_ATOMIC, &ah->avdma); if (!ah->av) return -ENOMEM; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2005-04-01 12:38:22.630048317 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c 2005-04-01 12:38:26.644176961 -0800 @@ -315,7 +315,7 @@ int err; struct mthca_ah *ah; - ah = kmalloc(sizeof *ah, GFP_KERNEL); + ah = kmalloc(sizeof *ah, GFP_ATOMIC); if (!ah) return ERR_PTR(-ENOMEM); From roland at topspin.com Fri Apr 1 12:49:53 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:53 -0800 Subject: [openib-general] [PATCH][17/27] IB/mthca: encapsulate MTT buddy allocator In-Reply-To: <2005411249.gEJosMqrkm8KOH4C@topspin.com> Message-ID: <2005411249.S2hhmQaEpM8vK71i@topspin.com> From: Michael S. Tsirkin Encapsulate the buddy allocator used for MTT segments. This cleans up the code and also gets us ready to add FMR support. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-04-01 12:38:26.173279180 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-04-01 12:38:27.068084943 -0800 @@ -170,10 +170,15 @@ struct mthca_alloc alloc; }; +struct mthca_buddy { + unsigned long **bits; + int max_order; + spinlock_t lock; +}; + struct mthca_mr_table { struct mthca_alloc mpt_alloc; - int max_mtt_order; - unsigned long **mtt_buddy; + struct mthca_buddy mtt_buddy; u64 mtt_base; struct mthca_icm_table *mtt_table; struct mthca_icm_table *mpt_table; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_mr.c 2005-04-01 12:38:25.582407442 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_mr.c 2005-04-01 12:38:27.075083423 -0800 @@ -72,60 +72,108 @@ * through the bitmaps) */ -static u32 __mthca_alloc_mtt(struct mthca_dev *dev, int order) +static u32 mthca_buddy_alloc(struct mthca_buddy *buddy, int order) { int o; int m; u32 seg; - spin_lock(&dev->mr_table.mpt_alloc.lock); + spin_lock(&buddy->lock); - for (o = order; o <= dev->mr_table.max_mtt_order; ++o) { - m = 1 << (dev->mr_table.max_mtt_order - o); - seg = find_first_bit(dev->mr_table.mtt_buddy[o], m); + for (o = order; o <= buddy->max_order; ++o) { + m = 1 << (buddy->max_order - o); + seg = find_first_bit(buddy->bits[o], m); if (seg < m) goto found; } - spin_unlock(&dev->mr_table.mpt_alloc.lock); + spin_unlock(&buddy->lock); return -1; found: - clear_bit(seg, dev->mr_table.mtt_buddy[o]); + clear_bit(seg, buddy->bits[o]); while (o > order) { --o; seg <<= 1; - set_bit(seg ^ 1, dev->mr_table.mtt_buddy[o]); + set_bit(seg ^ 1, buddy->bits[o]); } - spin_unlock(&dev->mr_table.mpt_alloc.lock); + spin_unlock(&buddy->lock); seg <<= order; return seg; } -static void __mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order) +static void mthca_buddy_free(struct mthca_buddy *buddy, u32 seg, int order) { seg >>= order; - spin_lock(&dev->mr_table.mpt_alloc.lock); + spin_lock(&buddy->lock); - while (test_bit(seg ^ 1, dev->mr_table.mtt_buddy[order])) { - clear_bit(seg ^ 1, dev->mr_table.mtt_buddy[order]); + while (test_bit(seg ^ 1, buddy->bits[order])) { + clear_bit(seg ^ 1, buddy->bits[order]); seg >>= 1; ++order; } - set_bit(seg, dev->mr_table.mtt_buddy[order]); + set_bit(seg, buddy->bits[order]); - spin_unlock(&dev->mr_table.mpt_alloc.lock); + spin_unlock(&buddy->lock); } -static u32 mthca_alloc_mtt(struct mthca_dev *dev, int order) +static int __devinit mthca_buddy_init(struct mthca_buddy *buddy, int max_order) { - u32 seg = __mthca_alloc_mtt(dev, order); + int i, s; + + buddy->max_order = max_order; + spin_lock_init(&buddy->lock); + + buddy->bits = kmalloc((buddy->max_order + 1) * sizeof (long *), + GFP_KERNEL); + if (!buddy->bits) + goto err_out; + + memset(buddy->bits, 0, (buddy->max_order + 1) * sizeof (long *)); + + for (i = 0; i <= buddy->max_order; ++i) { + s = BITS_TO_LONGS(1 << (buddy->max_order - i)); + buddy->bits[i] = kmalloc(s * sizeof (long), GFP_KERNEL); + if (!buddy->bits[i]) + goto err_out_free; + bitmap_zero(buddy->bits[i], + 1 << (buddy->max_order - i)); + } + + set_bit(0, buddy->bits[buddy->max_order]); + + return 0; + +err_out_free: + for (i = 0; i <= buddy->max_order; ++i) + kfree(buddy->bits[i]); + + kfree(buddy->bits); + +err_out: + return -ENOMEM; +} + +static void __devexit mthca_buddy_cleanup(struct mthca_buddy *buddy) +{ + int i; + + for (i = 0; i <= buddy->max_order; ++i) + kfree(buddy->bits[i]); + + kfree(buddy->bits); +} + +static u32 mthca_alloc_mtt(struct mthca_dev *dev, int order, + struct mthca_buddy *buddy) +{ + u32 seg = mthca_buddy_alloc(buddy, order); if (seg == -1) return -1; @@ -133,16 +181,17 @@ if (dev->hca_type == ARBEL_NATIVE) if (mthca_table_get_range(dev, dev->mr_table.mtt_table, seg, seg + (1 << order) - 1)) { - __mthca_free_mtt(dev, seg, order); + mthca_buddy_free(buddy, seg, order); seg = -1; } return seg; } -static void mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order) +static void mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order, + struct mthca_buddy* buddy) { - __mthca_free_mtt(dev, seg, order); + mthca_buddy_free(buddy, seg, order); if (dev->hca_type == ARBEL_NATIVE) mthca_table_put_range(dev, dev->mr_table.mtt_table, seg, @@ -268,7 +317,8 @@ i <<= 1, ++mr->order) ; /* nothing */ - mr->first_seg = mthca_alloc_mtt(dev, mr->order); + mr->first_seg = mthca_alloc_mtt(dev, mr->order, + &dev->mr_table.mtt_buddy); if (mr->first_seg == -1) goto err_out_table; @@ -361,7 +411,7 @@ kfree(mailbox); err_out_free_mtt: - mthca_free_mtt(dev, mr->first_seg, mr->order); + mthca_free_mtt(dev, mr->first_seg, mr->order, &dev->mr_table.mtt_buddy); err_out_table: if (dev->hca_type == ARBEL_NATIVE) @@ -390,7 +440,7 @@ status); if (mr->order >= 0) - mthca_free_mtt(dev, mr->first_seg, mr->order); + mthca_free_mtt(dev, mr->first_seg, mr->order, &dev->mr_table.mtt_buddy); if (dev->hca_type == ARBEL_NATIVE) mthca_table_put(dev, dev->mr_table.mpt_table, @@ -401,7 +451,6 @@ int __devinit mthca_init_mr_table(struct mthca_dev *dev) { int err; - int i, s; err = mthca_alloc_init(&dev->mr_table.mpt_alloc, dev->limits.num_mpts, @@ -409,53 +458,24 @@ if (err) return err; - err = -ENOMEM; - - for (i = 1, dev->mr_table.max_mtt_order = 0; - i < dev->limits.num_mtt_segs; - i <<= 1, ++dev->mr_table.max_mtt_order) - ; /* nothing */ - - dev->mr_table.mtt_buddy = kmalloc((dev->mr_table.max_mtt_order + 1) * - sizeof (long *), - GFP_KERNEL); - if (!dev->mr_table.mtt_buddy) - goto err_out; - - for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) - dev->mr_table.mtt_buddy[i] = NULL; - - for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) { - s = BITS_TO_LONGS(1 << (dev->mr_table.max_mtt_order - i)); - dev->mr_table.mtt_buddy[i] = kmalloc(s * sizeof (long), - GFP_KERNEL); - if (!dev->mr_table.mtt_buddy[i]) - goto err_out_free; - bitmap_zero(dev->mr_table.mtt_buddy[i], - 1 << (dev->mr_table.max_mtt_order - i)); - } - - set_bit(0, dev->mr_table.mtt_buddy[dev->mr_table.max_mtt_order]); - - for (i = 0; i < dev->mr_table.max_mtt_order; ++i) - if (1 << i >= dev->limits.reserved_mtts) - break; + err = mthca_buddy_init(&dev->mr_table.mtt_buddy, + fls(dev->limits.num_mtt_segs - 1)); + if (err) + goto err_mtt_buddy; - if (i == dev->mr_table.max_mtt_order) { - mthca_err(dev, "MTT table of order %d is " - "too small.\n", i); - goto err_out_free; + if (dev->limits.reserved_mtts) { + if (mthca_alloc_mtt(dev, fls(dev->limits.reserved_mtts - 1), + &dev->mr_table.mtt_buddy) == -1) { + mthca_warn(dev, "MTT table of order %d is too small.\n", + dev->mr_table.mtt_buddy.max_order); + err = -ENOMEM; + goto err_mtt_buddy; + } } - (void) mthca_alloc_mtt(dev, i); - return 0; - err_out_free: - for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) - kfree(dev->mr_table.mtt_buddy[i]); - - err_out: +err_mtt_buddy: mthca_alloc_cleanup(&dev->mr_table.mpt_alloc); return err; @@ -463,11 +483,7 @@ void __devexit mthca_cleanup_mr_table(struct mthca_dev *dev) { - int i; - /* XXX check if any MRs are still allocated? */ - for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) - kfree(dev->mr_table.mtt_buddy[i]); - kfree(dev->mr_table.mtt_buddy); + mthca_buddy_cleanup(&dev->mr_table.mtt_buddy); mthca_alloc_cleanup(&dev->mr_table.mpt_alloc); } From roland at topspin.com Fri Apr 1 12:49:53 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:53 -0800 Subject: [openib-general] [PATCH][18/27] IB/mthca: add SYNC_TPT firmware command In-Reply-To: <2005411249.S2hhmQaEpM8vK71i@topspin.com> Message-ID: <2005411249.Wiedh3QohPRJi9Sp@topspin.com> From: Michael S. Tsirkin Add code for SYNC_TPT firmware command, which will be used by FMR implementation. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 2005-04-01 12:38:25.574409178 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cmd.c 2005-04-01 12:38:27.495992056 -0800 @@ -1404,6 +1404,11 @@ return err; } +int mthca_SYNC_TPT(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_SYNC_TPT, CMD_TIME_CLASS_B, status); +} + int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap, int eq_num, u8 *status) { --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cmd.h 2005-04-01 12:38:25.578408310 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cmd.h 2005-04-01 12:38:27.500990971 -0800 @@ -276,6 +276,7 @@ int mpt_index, u8 *status); int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry, int num_mtt, u8 *status); +int mthca_SYNC_TPT(struct mthca_dev *dev, u8 *status); int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap, int eq_num, u8 *status); int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context, From roland at topspin.com Fri Apr 1 12:49:53 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:53 -0800 Subject: [openib-general] [PATCH][19/27] IB/mthca: add mthca_write64_raw() for writing to MTT table directly In-Reply-To: <2005411249.Wiedh3QohPRJi9Sp@topspin.com> Message-ID: <2005411249.t0DdCtarOabubO3D@topspin.com> From: Michael S. Tsirkin Add mthca_write64_raw() function, which will be used to write FMR entries that are in ioremapped PCI memory. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_doorbell.h 2005-03-31 19:06:52.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_doorbell.h 2005-04-01 12:38:27.898904595 -0800 @@ -51,6 +51,11 @@ #define MTHCA_INIT_DOORBELL_LOCK(ptr) do { } while (0) #define MTHCA_GET_DOORBELL_LOCK(ptr) (NULL) +static inline void mthca_write64_raw(__be64 val, void __iomem *dest) +{ + __raw_writeq((__force u64) val, dest); +} + static inline void mthca_write64(u32 val[2], void __iomem *dest, spinlock_t *doorbell_lock) { @@ -74,6 +79,12 @@ #define MTHCA_INIT_DOORBELL_LOCK(ptr) spin_lock_init(ptr) #define MTHCA_GET_DOORBELL_LOCK(ptr) (ptr) +static inline void mthca_write64_raw(__be64 val, void __iomem *dest) +{ + __raw_writel(((__force u32 *) &val)[0], dest); + __raw_writel(((__force u32 *) &val)[1], dest + 4); +} + static inline void mthca_write64(u32 val[2], void __iomem *dest, spinlock_t *doorbell_lock) { From roland at topspin.com Fri Apr 1 12:49:53 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:53 -0800 Subject: [openib-general] [PATCH][20/27] IB/mthca: add mthca_table_find() function In-Reply-To: <2005411249.t0DdCtarOabubO3D@topspin.com> Message-ID: <2005411249.Tkvt1lzz8zEHUMmz@topspin.com> From: Michael S. Tsirkin Add mthca_table_find() function, which returns the lowmem address of an entry in a mem-free HCA's context tables. This will be used by the FMR implementation. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.c 2005-04-01 12:38:23.500859288 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.c 2005-04-01 12:38:28.285820606 -0800 @@ -192,6 +192,40 @@ up(&table->mutex); } +void *mthca_table_find(struct mthca_icm_table *table, int obj) +{ + int idx, offset, i; + struct mthca_icm_chunk *chunk; + struct mthca_icm *icm; + struct page *page = NULL; + + if (!table->lowmem) + return NULL; + + down(&table->mutex); + + idx = (obj & (table->num_obj - 1)) * table->obj_size; + icm = table->icm[idx / MTHCA_TABLE_CHUNK_SIZE]; + offset = idx % MTHCA_TABLE_CHUNK_SIZE; + + if (!icm) + goto out; + + list_for_each_entry(chunk, &icm->chunk_list, list) { + for (i = 0; i < chunk->npages; ++i) { + if (chunk->mem[i].length >= offset) { + page = chunk->mem[i].page; + break; + } + offset -= chunk->mem[i].length; + } + } + +out: + up(&table->mutex); + return page ? lowmem_page_address(page) + offset : NULL; +} + int mthca_table_get_range(struct mthca_dev *dev, struct mthca_icm_table *table, int start, int end) { --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.h 2005-04-01 12:38:19.895641881 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.h 2005-04-01 12:38:28.280821691 -0800 @@ -85,6 +85,7 @@ void mthca_free_icm_table(struct mthca_dev *dev, struct mthca_icm_table *table); int mthca_table_get(struct mthca_dev *dev, struct mthca_icm_table *table, int obj); void mthca_table_put(struct mthca_dev *dev, struct mthca_icm_table *table, int obj); +void *mthca_table_find(struct mthca_icm_table *table, int obj); int mthca_table_get_range(struct mthca_dev *dev, struct mthca_icm_table *table, int start, int end); void mthca_table_put_range(struct mthca_dev *dev, struct mthca_icm_table *table, From roland at topspin.com Fri Apr 1 12:49:53 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:53 -0800 Subject: [openib-general] [PATCH][21/27] IB/mthca: split MR key munging routines In-Reply-To: <2005411249.Tkvt1lzz8zEHUMmz@topspin.com> Message-ID: <2005411249.VplL6XJIvCp9HHyP@topspin.com> From: Michael S. Tsirkin Split Tavor and Arbel/mem-free index<->hw key munging routines, so that FMR implementation can call correct implementation without testing HCA type (which it already knows). Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_mr.c 2005-04-01 12:38:27.075083423 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_mr.c 2005-04-01 12:38:28.676735749 -0800 @@ -198,20 +198,40 @@ seg + (1 << order) - 1); } +static inline u32 tavor_hw_index_to_key(u32 ind) +{ + return ind; +} + +static inline u32 tavor_key_to_hw_index(u32 key) +{ + return key; +} + +static inline u32 arbel_hw_index_to_key(u32 ind) +{ + return (ind >> 24) | (ind << 8); +} + +static inline u32 arbel_key_to_hw_index(u32 key) +{ + return (key << 24) | (key >> 8); +} + static inline u32 hw_index_to_key(struct mthca_dev *dev, u32 ind) { if (dev->hca_type == ARBEL_NATIVE) - return (ind >> 24) | (ind << 8); + return arbel_hw_index_to_key(ind); else - return ind; + return tavor_hw_index_to_key(ind); } static inline u32 key_to_hw_index(struct mthca_dev *dev, u32 key) { if (dev->hca_type == ARBEL_NATIVE) - return (key << 24) | (key >> 8); + return arbel_key_to_hw_index(key); else - return key; + return tavor_key_to_hw_index(key); } int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd, From roland at topspin.com Fri Apr 1 12:49:54 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:54 -0800 Subject: [openib-general] [PATCH][23/27] IB/mthca: tweaks to mthca_cmd.c In-Reply-To: <2005411249.CxF3RBWpNJELwaqL@topspin.com> Message-ID: <2005411249.5GDmFAellTSOT0Ai@topspin.com> Minor tweaks to firmware command handling: kill off an unused get of a value, and add a little more info to debug output. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 2005-04-01 12:38:27.495992056 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cmd.c 2005-04-01 12:38:30.084430178 -0800 @@ -989,7 +989,6 @@ dev_lim->hca.arbel.resize_srq = field & 1; MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_RQ_OFFSET); dev_lim->max_sg = min_t(int, field, dev_lim->max_sg); - MTHCA_GET(size, outbox, QUERY_DEV_LIM_MTT_ENTRY_SZ_OFFSET); MTHCA_GET(size, outbox, QUERY_DEV_LIM_MPT_ENTRY_SZ_OFFSET); dev_lim->mpt_entry_sz = size; MTHCA_GET(field, outbox, QUERY_DEV_LIM_PBL_SZ_OFFSET); @@ -1297,8 +1296,8 @@ pci_free_consistent(dev->pdev, 16, inbox, indma); if (!err) - mthca_dbg(dev, "Mapped page at %llx for ICM.\n", - (unsigned long long) virt); + mthca_dbg(dev, "Mapped page at %llx to %llx for ICM.\n", + (unsigned long long) dma_addr, (unsigned long long) virt); return err; } From roland at topspin.com Fri Apr 1 12:49:53 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:53 -0800 Subject: [openib-general] [PATCH][22/27] IB/mthca: add fast memory region implementation In-Reply-To: <2005411249.VplL6XJIvCp9HHyP@topspin.com> Message-ID: <2005411249.CxF3RBWpNJELwaqL@topspin.com> From: Michael S. Tsirkin Implement fast memory regions (FMRs), where the driver writes directly into the HCA's translation tables rather than requiring a firmware command. For Tavor, MTTs for FMR are separate from regular MTTs, and are reserved at driver initialization. This is done to limit the amount of virtual memory needed to map the MTTs. For Arbel, there's no such limitation, and all MTTs and MPTs may be used for FMR or for regular MR. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-04-01 12:38:27.068084943 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-04-01 12:38:29.460565601 -0800 @@ -61,7 +61,8 @@ MTHCA_FLAG_SRQ = 1 << 2, MTHCA_FLAG_MSI = 1 << 3, MTHCA_FLAG_MSI_X = 1 << 4, - MTHCA_FLAG_NO_LAM = 1 << 5 + MTHCA_FLAG_NO_LAM = 1 << 5, + MTHCA_FLAG_FMR = 1 << 6 }; enum { @@ -134,6 +135,7 @@ int reserved_eqs; int num_mpts; int num_mtt_segs; + int fmr_reserved_mtts; int reserved_mtts; int reserved_mrws; int reserved_uars; @@ -178,10 +180,17 @@ struct mthca_mr_table { struct mthca_alloc mpt_alloc; - struct mthca_buddy mtt_buddy; + struct mthca_buddy mtt_buddy; + struct mthca_buddy *fmr_mtt_buddy; u64 mtt_base; + u64 mpt_base; struct mthca_icm_table *mtt_table; struct mthca_icm_table *mpt_table; + struct { + void __iomem *mpt_base; + void __iomem *mtt_base; + struct mthca_buddy mtt_buddy; + } tavor_fmr; }; struct mthca_eq_table { @@ -380,7 +389,17 @@ u64 *buffer_list, int buffer_size_shift, int list_len, u64 iova, u64 total_size, u32 access, struct mthca_mr *mr); -void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr); +void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr); + +int mthca_fmr_alloc(struct mthca_dev *dev, u32 pd, + u32 access, struct mthca_fmr *fmr); +int mthca_tavor_map_phys_fmr(struct ib_fmr *ibfmr, u64 *page_list, + int list_len, u64 iova); +void mthca_tavor_fmr_unmap(struct mthca_dev *dev, struct mthca_fmr *fmr); +int mthca_arbel_map_phys_fmr(struct ib_fmr *ibfmr, u64 *page_list, + int list_len, u64 iova); +void mthca_arbel_fmr_unmap(struct mthca_dev *dev, struct mthca_fmr *fmr); +int mthca_free_fmr(struct mthca_dev *dev, struct mthca_fmr *fmr); int mthca_map_eq_icm(struct mthca_dev *dev, u64 icm_virt); void mthca_unmap_eq_icm(struct mthca_dev *dev); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c 2005-04-01 12:38:25.566410914 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c 2005-04-01 12:38:29.466564299 -0800 @@ -73,14 +73,15 @@ DRV_VERSION " (" DRV_RELDATE ")\n"; static struct mthca_profile default_profile = { - .num_qp = 1 << 16, - .rdb_per_qp = 4, - .num_cq = 1 << 16, - .num_mcg = 1 << 13, - .num_mpt = 1 << 17, - .num_mtt = 1 << 20, - .num_udav = 1 << 15, /* Tavor only */ - .uarc_size = 1 << 18, /* Arbel only */ + .num_qp = 1 << 16, + .rdb_per_qp = 4, + .num_cq = 1 << 16, + .num_mcg = 1 << 13, + .num_mpt = 1 << 17, + .num_mtt = 1 << 20, + .num_udav = 1 << 15, /* Tavor only */ + .fmr_reserved_mtts = 1 << 18, /* Tavor only */ + .uarc_size = 1 << 18, /* Arbel only */ }; static int __devinit mthca_tune_pci(struct mthca_dev *mdev) --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_mr.c 2005-04-01 12:38:28.676735749 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_mr.c 2005-04-01 12:38:29.493558440 -0800 @@ -66,6 +66,9 @@ #define MTHCA_MTT_FLAG_PRESENT 1 +#define MTHCA_MPT_STATUS_SW 0xF0 +#define MTHCA_MPT_STATUS_HW 0x00 + /* * Buddy allocator for MTT segments (currently not very efficient * since it doesn't keep a free list and just searches linearly @@ -442,6 +445,20 @@ return err; } +/* Free mr or fmr */ +static void mthca_free_region(struct mthca_dev *dev, u32 lkey, int order, + u32 first_seg, struct mthca_buddy *buddy) +{ + if (order >= 0) + mthca_free_mtt(dev, first_seg, order, buddy); + + if (dev->hca_type == ARBEL_NATIVE) + mthca_table_put(dev, dev->mr_table.mpt_table, + arbel_key_to_hw_index(lkey)); + + mthca_free(&dev->mr_table.mpt_alloc, key_to_hw_index(dev, lkey)); +} + void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr) { int err; @@ -459,18 +476,288 @@ mthca_warn(dev, "HW2SW_MPT returned status 0x%02x\n", status); - if (mr->order >= 0) - mthca_free_mtt(dev, mr->first_seg, mr->order, &dev->mr_table.mtt_buddy); + mthca_free_region(dev, mr->ibmr.lkey, mr->order, mr->first_seg, + &dev->mr_table.mtt_buddy); +} + +int mthca_fmr_alloc(struct mthca_dev *dev, u32 pd, + u32 access, struct mthca_fmr *mr) +{ + struct mthca_mpt_entry *mpt_entry; + void *mailbox; + u64 mtt_seg; + u32 key, idx; + u8 status; + int list_len = mr->attr.max_pages; + int err = -ENOMEM; + int i; + + might_sleep(); + + if (mr->attr.page_size < 12 || mr->attr.page_size >= 32) + return -EINVAL; + + /* For Arbel, all MTTs must fit in the same page. */ + if (dev->hca_type == ARBEL_NATIVE && + mr->attr.max_pages * sizeof *mr->mem.arbel.mtts > PAGE_SIZE) + return -EINVAL; + + mr->maps = 0; + + key = mthca_alloc(&dev->mr_table.mpt_alloc); + if (key == -1) + return -ENOMEM; + + idx = key & (dev->limits.num_mpts - 1); + mr->ibmr.rkey = mr->ibmr.lkey = hw_index_to_key(dev, key); + + if (dev->hca_type == ARBEL_NATIVE) { + err = mthca_table_get(dev, dev->mr_table.mpt_table, key); + if (err) + goto err_out_mpt_free; + + mr->mem.arbel.mpt = mthca_table_find(dev->mr_table.mpt_table, key); + BUG_ON(!mr->mem.arbel.mpt); + } else + mr->mem.tavor.mpt = dev->mr_table.tavor_fmr.mpt_base + + sizeof *(mr->mem.tavor.mpt) * idx; + + for (i = MTHCA_MTT_SEG_SIZE / 8, mr->order = 0; + i < list_len; + i <<= 1, ++mr->order) + ; /* nothing */ + + mr->first_seg = mthca_alloc_mtt(dev, mr->order, + dev->mr_table.fmr_mtt_buddy); + if (mr->first_seg == -1) + goto err_out_table; + + mtt_seg = mr->first_seg * MTHCA_MTT_SEG_SIZE; + + if (dev->hca_type == ARBEL_NATIVE) { + mr->mem.arbel.mtts = mthca_table_find(dev->mr_table.mtt_table, + mr->first_seg); + BUG_ON(!mr->mem.arbel.mtts); + } else + mr->mem.tavor.mtts = dev->mr_table.tavor_fmr.mtt_base + mtt_seg; + + mailbox = kmalloc(sizeof *mpt_entry + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out_free_mtt; + + mpt_entry = MAILBOX_ALIGN(mailbox); + + mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS | + MTHCA_MPT_FLAG_MIO | + MTHCA_MPT_FLAG_REGION | + access); + + mpt_entry->page_size = cpu_to_be32(mr->attr.page_size - 12); + mpt_entry->key = cpu_to_be32(key); + mpt_entry->pd = cpu_to_be32(pd); + memset(&mpt_entry->start, 0, + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, start)); + mpt_entry->mtt_seg = cpu_to_be64(dev->mr_table.mtt_base + mtt_seg); + + if (0) { + mthca_dbg(dev, "Dumping MPT entry %08x:\n", mr->ibmr.lkey); + for (i = 0; i < sizeof (struct mthca_mpt_entry) / 4; ++i) { + if (i % 4 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) mpt_entry)[i])); + if ((i + 1) % 4 == 0) + printk("\n"); + } + } + + err = mthca_SW2HW_MPT(dev, mpt_entry, + key & (dev->limits.num_mpts - 1), + &status); + if (err) { + mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); + goto err_out_mailbox_free; + } + if (status) { + mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_mailbox_free; + } + + kfree(mailbox); + return 0; + +err_out_mailbox_free: + kfree(mailbox); + +err_out_free_mtt: + mthca_free_mtt(dev, mr->first_seg, mr->order, + dev->mr_table.fmr_mtt_buddy); +err_out_table: if (dev->hca_type == ARBEL_NATIVE) - mthca_table_put(dev, dev->mr_table.mpt_table, - key_to_hw_index(dev, mr->ibmr.lkey)); - mthca_free(&dev->mr_table.mpt_alloc, key_to_hw_index(dev, mr->ibmr.lkey)); + mthca_table_put(dev, dev->mr_table.mpt_table, key); + +err_out_mpt_free: + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + return err; +} + +int mthca_free_fmr(struct mthca_dev *dev, struct mthca_fmr *fmr) +{ + if (fmr->maps) + return -EBUSY; + + mthca_free_region(dev, fmr->ibmr.lkey, fmr->order, fmr->first_seg, + dev->mr_table.fmr_mtt_buddy); + return 0; +} + +static inline int mthca_check_fmr(struct mthca_fmr *fmr, u64 *page_list, + int list_len, u64 iova) +{ + int i, page_mask; + + if (list_len > fmr->attr.max_pages) + return -EINVAL; + + page_mask = (1 << fmr->attr.page_size) - 1; + + /* We are getting page lists, so va must be page aligned. */ + if (iova & page_mask) + return -EINVAL; + + /* Trust the user not to pass misaligned data in page_list */ + if (0) + for (i = 0; i < list_len; ++i) { + if (page_list[i] & ~page_mask) + return -EINVAL; + } + + if (fmr->maps >= fmr->attr.max_maps) + return -EINVAL; + + return 0; +} + + +int mthca_tavor_map_phys_fmr(struct ib_fmr *ibfmr, u64 *page_list, + int list_len, u64 iova) +{ + struct mthca_fmr *fmr = to_mfmr(ibfmr); + struct mthca_dev *dev = to_mdev(ibfmr->device); + struct mthca_mpt_entry mpt_entry; + u32 key; + int i, err; + + err = mthca_check_fmr(fmr, page_list, list_len, iova); + if (err) + return err; + + ++fmr->maps; + + key = tavor_key_to_hw_index(fmr->ibmr.lkey); + key += dev->limits.num_mpts; + fmr->ibmr.lkey = fmr->ibmr.rkey = tavor_hw_index_to_key(key); + + writeb(MTHCA_MPT_STATUS_SW, fmr->mem.tavor.mpt); + + for (i = 0; i < list_len; ++i) { + __be64 mtt_entry = cpu_to_be64(page_list[i] | + MTHCA_MTT_FLAG_PRESENT); + mthca_write64_raw(mtt_entry, fmr->mem.tavor.mtts + i); + } + + mpt_entry.lkey = cpu_to_be32(key); + mpt_entry.length = cpu_to_be64(list_len * (1ull << fmr->attr.page_size)); + mpt_entry.start = cpu_to_be64(iova); + + writel(mpt_entry.lkey, &fmr->mem.tavor.mpt->key); + memcpy_toio(&fmr->mem.tavor.mpt->start, &mpt_entry.start, + offsetof(struct mthca_mpt_entry, window_count) - + offsetof(struct mthca_mpt_entry, start)); + + writeb(MTHCA_MPT_STATUS_HW, fmr->mem.tavor.mpt); + + return 0; +} + +int mthca_arbel_map_phys_fmr(struct ib_fmr *ibfmr, u64 *page_list, + int list_len, u64 iova) +{ + struct mthca_fmr *fmr = to_mfmr(ibfmr); + struct mthca_dev *dev = to_mdev(ibfmr->device); + u32 key; + int i, err; + + err = mthca_check_fmr(fmr, page_list, list_len, iova); + if (err) + return err; + + ++fmr->maps; + + key = arbel_key_to_hw_index(fmr->ibmr.lkey); + key += dev->limits.num_mpts; + fmr->ibmr.lkey = fmr->ibmr.rkey = arbel_hw_index_to_key(key); + + *(u8 *) fmr->mem.arbel.mpt = MTHCA_MPT_STATUS_SW; + + wmb(); + + for (i = 0; i < list_len; ++i) + fmr->mem.arbel.mtts[i] = cpu_to_be64(page_list[i] | + MTHCA_MTT_FLAG_PRESENT); + + fmr->mem.arbel.mpt->key = cpu_to_be32(key); + fmr->mem.arbel.mpt->lkey = cpu_to_be32(key); + fmr->mem.arbel.mpt->length = cpu_to_be64(list_len * (1ull << fmr->attr.page_size)); + fmr->mem.arbel.mpt->start = cpu_to_be64(iova); + + wmb(); + + *(u8 *) fmr->mem.arbel.mpt = MTHCA_MPT_STATUS_HW; + + wmb(); + + return 0; +} + +void mthca_tavor_fmr_unmap(struct mthca_dev *dev, struct mthca_fmr *fmr) +{ + u32 key; + + if (!fmr->maps) + return; + + key = tavor_key_to_hw_index(fmr->ibmr.lkey); + key &= dev->limits.num_mpts - 1; + fmr->ibmr.lkey = fmr->ibmr.rkey = tavor_hw_index_to_key(key); + + fmr->maps = 0; + + writeb(MTHCA_MPT_STATUS_SW, fmr->mem.tavor.mpt); +} + +void mthca_arbel_fmr_unmap(struct mthca_dev *dev, struct mthca_fmr *fmr) +{ + u32 key; + + if (!fmr->maps) + return; + + key = arbel_key_to_hw_index(fmr->ibmr.lkey); + key &= dev->limits.num_mpts - 1; + fmr->ibmr.lkey = fmr->ibmr.rkey = arbel_hw_index_to_key(key); + + fmr->maps = 0; + + *(u8 *) fmr->mem.arbel.mpt = MTHCA_MPT_STATUS_SW; } int __devinit mthca_init_mr_table(struct mthca_dev *dev) { - int err; + int err, i; err = mthca_alloc_init(&dev->mr_table.mpt_alloc, dev->limits.num_mpts, @@ -478,23 +765,93 @@ if (err) return err; + if (dev->hca_type != ARBEL_NATIVE && + (dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) + dev->limits.fmr_reserved_mtts = 0; + else + dev->mthca_flags |= MTHCA_FLAG_FMR; + err = mthca_buddy_init(&dev->mr_table.mtt_buddy, fls(dev->limits.num_mtt_segs - 1)); + if (err) goto err_mtt_buddy; + dev->mr_table.tavor_fmr.mpt_base = NULL; + dev->mr_table.tavor_fmr.mtt_base = NULL; + + if (dev->limits.fmr_reserved_mtts) { + i = fls(dev->limits.fmr_reserved_mtts - 1); + + if (i >= 31) { + mthca_warn(dev, "Unable to reserve 2^31 FMR MTTs.\n"); + err = -EINVAL; + goto err_fmr_mpt; + } + + dev->mr_table.tavor_fmr.mpt_base = + ioremap(dev->mr_table.mpt_base, + (1 << i) * sizeof (struct mthca_mpt_entry)); + + if (!dev->mr_table.tavor_fmr.mpt_base) { + mthca_warn(dev, "MPT ioremap for FMR failed.\n"); + err = -ENOMEM; + goto err_fmr_mpt; + } + + dev->mr_table.tavor_fmr.mtt_base = + ioremap(dev->mr_table.mtt_base, + (1 << i) * MTHCA_MTT_SEG_SIZE); + if (!dev->mr_table.tavor_fmr.mtt_base) { + mthca_warn(dev, "MTT ioremap for FMR failed.\n"); + err = -ENOMEM; + goto err_fmr_mtt; + } + + err = mthca_buddy_init(&dev->mr_table.tavor_fmr.mtt_buddy, i); + if (err) + goto err_fmr_mtt_buddy; + + /* Prevent regular MRs from using FMR keys */ + err = mthca_buddy_alloc(&dev->mr_table.mtt_buddy, i); + if (err) + goto err_reserve_fmr; + + dev->mr_table.fmr_mtt_buddy = + &dev->mr_table.tavor_fmr.mtt_buddy; + } else + dev->mr_table.fmr_mtt_buddy = &dev->mr_table.mtt_buddy; + + /* FMR table is always the first, take reserved MTTs out of there */ if (dev->limits.reserved_mtts) { - if (mthca_alloc_mtt(dev, fls(dev->limits.reserved_mtts - 1), - &dev->mr_table.mtt_buddy) == -1) { + i = fls(dev->limits.reserved_mtts - 1); + + if (mthca_alloc_mtt(dev, i, dev->mr_table.fmr_mtt_buddy) == -1) { mthca_warn(dev, "MTT table of order %d is too small.\n", - dev->mr_table.mtt_buddy.max_order); + dev->mr_table.fmr_mtt_buddy->max_order); err = -ENOMEM; - goto err_mtt_buddy; + goto err_reserve_mtts; } } return 0; +err_reserve_mtts: +err_reserve_fmr: + if (dev->limits.fmr_reserved_mtts) + mthca_buddy_cleanup(&dev->mr_table.tavor_fmr.mtt_buddy); + +err_fmr_mtt_buddy: + if (dev->mr_table.tavor_fmr.mtt_base) + iounmap(dev->mr_table.tavor_fmr.mtt_base); + +err_fmr_mtt: + if (dev->mr_table.tavor_fmr.mpt_base) + iounmap(dev->mr_table.tavor_fmr.mpt_base); + +err_fmr_mpt: + mthca_buddy_cleanup(&dev->mr_table.mtt_buddy); + err_mtt_buddy: mthca_alloc_cleanup(&dev->mr_table.mpt_alloc); @@ -504,6 +861,15 @@ void __devexit mthca_cleanup_mr_table(struct mthca_dev *dev) { /* XXX check if any MRs are still allocated? */ + if (dev->limits.fmr_reserved_mtts) + mthca_buddy_cleanup(&dev->mr_table.tavor_fmr.mtt_buddy); + mthca_buddy_cleanup(&dev->mr_table.mtt_buddy); + + if (dev->mr_table.tavor_fmr.mtt_base) + iounmap(dev->mr_table.tavor_fmr.mtt_base); + if (dev->mr_table.tavor_fmr.mpt_base) + iounmap(dev->mr_table.tavor_fmr.mpt_base); + mthca_alloc_cleanup(&dev->mr_table.mpt_alloc); } --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_profile.c 2005-04-01 12:38:25.570410046 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_profile.c 2005-04-01 12:38:29.480561261 -0800 @@ -223,9 +223,10 @@ init_hca->mc_hash_sz = 1 << (profile[i].log_num - 1); break; case MTHCA_RES_MPT: - dev->limits.num_mpts = profile[i].num; - init_hca->mpt_base = profile[i].start; - init_hca->log_mpt_sz = profile[i].log_num; + dev->limits.num_mpts = profile[i].num; + dev->mr_table.mpt_base = profile[i].start; + init_hca->mpt_base = profile[i].start; + init_hca->log_mpt_sz = profile[i].log_num; break; case MTHCA_RES_MTT: dev->limits.num_mtt_segs = profile[i].num; @@ -259,6 +260,18 @@ */ dev->limits.num_pds = MTHCA_NUM_PDS; + /* + * For Tavor, FMRs use ioremapped PCI memory. For 32 bit + * systems it may use too much vmalloc space to map all MTT + * memory, so we reserve some MTTs for FMR access, taking them + * out of the MR pool. They don't use additional memory, but + * we assign them as part of the HCA profile anyway. + */ + if (dev->hca_type == ARBEL_NATIVE) + dev->limits.fmr_reserved_mtts = 0; + else + dev->limits.fmr_reserved_mtts = request->fmr_reserved_mtts; + kfree(profile); return total_size; } --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_profile.h 2005-03-31 19:07:01.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_profile.h 2005-04-01 12:38:29.484560393 -0800 @@ -48,6 +48,7 @@ int num_udav; int num_uar; int uarc_size; + int fmr_reserved_mtts; }; u64 mthca_make_profile(struct mthca_dev *mdev, --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2005-04-01 12:38:26.644176961 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c 2005-04-01 12:38:29.471563214 -0800 @@ -574,6 +574,74 @@ return 0; } +static struct ib_fmr *mthca_alloc_fmr(struct ib_pd *pd, int mr_access_flags, + struct ib_fmr_attr *fmr_attr) +{ + struct mthca_fmr *fmr; + int err; + + fmr = kmalloc(sizeof *fmr, GFP_KERNEL); + if (!fmr) + return ERR_PTR(-ENOMEM); + + memcpy(&fmr->attr, fmr_attr, sizeof *fmr_attr); + err = mthca_fmr_alloc(to_mdev(pd->device), to_mpd(pd)->pd_num, + convert_access(mr_access_flags), fmr); + + if (err) { + kfree(fmr); + return ERR_PTR(err); + } + + return &fmr->ibmr; +} + +static int mthca_dealloc_fmr(struct ib_fmr *fmr) +{ + struct mthca_fmr *mfmr = to_mfmr(fmr); + int err; + + err = mthca_free_fmr(to_mdev(fmr->device), mfmr); + if (err) + return err; + + kfree(mfmr); + return 0; +} + +static int mthca_unmap_fmr(struct list_head *fmr_list) +{ + struct ib_fmr *fmr; + int err; + u8 status; + struct mthca_dev *mdev = NULL; + + list_for_each_entry(fmr, fmr_list, list) { + if (mdev && to_mdev(fmr->device) != mdev) + return -EINVAL; + mdev = to_mdev(fmr->device); + } + + if (!mdev) + return 0; + + if (mdev->hca_type == ARBEL_NATIVE) { + list_for_each_entry(fmr, fmr_list, list) + mthca_arbel_fmr_unmap(mdev, to_mfmr(fmr)); + + wmb(); + } else + list_for_each_entry(fmr, fmr_list, list) + mthca_tavor_fmr_unmap(mdev, to_mfmr(fmr)); + + err = mthca_SYNC_TPT(mdev, &status); + if (err) + return err; + if (status) + return -EINVAL; + return 0; +} + static ssize_t show_rev(struct class_device *cdev, char *buf) { struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); @@ -637,6 +705,17 @@ dev->ib_dev.get_dma_mr = mthca_get_dma_mr; dev->ib_dev.reg_phys_mr = mthca_reg_phys_mr; dev->ib_dev.dereg_mr = mthca_dereg_mr; + + if (dev->mthca_flags & MTHCA_FLAG_FMR) { + dev->ib_dev.alloc_fmr = mthca_alloc_fmr; + dev->ib_dev.unmap_fmr = mthca_unmap_fmr; + dev->ib_dev.dealloc_fmr = mthca_dealloc_fmr; + if (dev->hca_type == ARBEL_NATIVE) + dev->ib_dev.map_phys_fmr = mthca_arbel_map_phys_fmr; + else + dev->ib_dev.map_phys_fmr = mthca_tavor_map_phys_fmr; + } + dev->ib_dev.attach_mcast = mthca_multicast_attach; dev->ib_dev.detach_mcast = mthca_multicast_detach; dev->ib_dev.process_mad = mthca_process_mad; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.h 2005-03-31 19:06:47.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.h 2005-04-01 12:38:29.475562346 -0800 @@ -60,6 +60,24 @@ u32 first_seg; }; +struct mthca_fmr { + struct ib_fmr ibmr; + struct ib_fmr_attr attr; + int order; + u32 first_seg; + int maps; + union { + struct { + struct mthca_mpt_entry __iomem *mpt; + u64 __iomem *mtts; + } tavor; + struct { + struct mthca_mpt_entry *mpt; + __be64 *mtts; + } arbel; + } mem; +}; + struct mthca_pd { struct ib_pd ibpd; u32 pd_num; @@ -218,6 +236,11 @@ dma_addr_t header_dma; }; +static inline struct mthca_fmr *to_mfmr(struct ib_fmr *ibmr) +{ + return container_of(ibmr, struct mthca_fmr, ibmr); +} + static inline struct mthca_mr *to_mmr(struct ib_mr *ibmr) { return container_of(ibmr, struct mthca_mr, ibmr); From roland at topspin.com Fri Apr 1 12:49:54 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:54 -0800 Subject: [openib-general] [PATCH][24/27] IB/mthca: encapsulate mem-free check into mthca_is_memfree() In-Reply-To: <2005411249.5GDmFAellTSOT0Ai@topspin.com> Message-ID: <2005411249.qaesrlpuSaCRRPRE@topspin.com> Clean up mem-free mode support by introducing mthca_is_memfree() function, which encapsulates the logic of deciding if a device is mem-free. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_av.c 2005-04-01 12:38:26.648176093 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_av.c 2005-04-01 12:38:30.803274137 -0800 @@ -62,7 +62,7 @@ ah->type = MTHCA_AH_PCI_POOL; - if (dev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(dev)) { ah->av = kmalloc(sizeof *ah->av, GFP_ATOMIC); if (!ah->av) return -ENOMEM; @@ -192,7 +192,7 @@ { int err; - if (dev->hca_type == ARBEL_NATIVE) + if (mthca_is_memfree(dev)) return 0; err = mthca_alloc_init(&dev->av_table.alloc, @@ -231,7 +231,7 @@ void __devexit mthca_cleanup_av_table(struct mthca_dev *dev) { - if (dev->hca_type == ARBEL_NATIVE) + if (mthca_is_memfree(dev)) return; if (dev->av_table.av_map) --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 2005-04-01 12:38:30.084430178 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cmd.c 2005-04-01 12:38:30.790276958 -0800 @@ -651,7 +651,7 @@ mthca_dbg(dev, "FW version %012llx, max commands %d\n", (unsigned long long) dev->fw_ver, dev->cmd.max_cmds); - if (dev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(dev)) { MTHCA_GET(dev->fw.arbel.fw_pages, outbox, QUERY_FW_SIZE_OFFSET); MTHCA_GET(dev->fw.arbel.clr_int_base, outbox, QUERY_FW_CLR_INT_BASE_OFFSET); MTHCA_GET(dev->fw.arbel.eq_arm_base, outbox, QUERY_FW_EQ_ARM_BASE_OFFSET); @@ -984,7 +984,7 @@ mthca_dbg(dev, "Flags: %08x\n", dev_lim->flags); - if (dev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(dev)) { MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSZ_SRQ_OFFSET); dev_lim->hca.arbel.resize_srq = field & 1; MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_RQ_OFFSET); @@ -1148,7 +1148,7 @@ /* TPT attributes */ MTHCA_PUT(inbox, param->mpt_base, INIT_HCA_MPT_BASE_OFFSET); - if (dev->hca_type != ARBEL_NATIVE) + if (!mthca_is_memfree(dev)) MTHCA_PUT(inbox, param->mtt_seg_sz, INIT_HCA_MTT_SEG_SZ_OFFSET); MTHCA_PUT(inbox, param->log_mpt_sz, INIT_HCA_LOG_MPT_SZ_OFFSET); MTHCA_PUT(inbox, param->mtt_base, INIT_HCA_MTT_BASE_OFFSET); @@ -1161,7 +1161,7 @@ MTHCA_PUT(inbox, param->uar_scratch_base, INIT_HCA_UAR_SCATCH_BASE_OFFSET); - if (dev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(dev)) { MTHCA_PUT(inbox, param->log_uarc_sz, INIT_HCA_UARC_SZ_OFFSET); MTHCA_PUT(inbox, param->log_uar_sz, INIT_HCA_LOG_UAR_SZ_OFFSET); MTHCA_PUT(inbox, param->uarc_base, INIT_HCA_UAR_CTX_BASE_OFFSET); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2005-04-01 12:38:26.177278312 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c 2005-04-01 12:38:30.794276090 -0800 @@ -180,7 +180,7 @@ { u32 doorbell[2]; - if (dev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(dev)) { *cq->set_ci_db = cpu_to_be32(cq->cons_index); wmb(); } else { @@ -760,7 +760,7 @@ if (cq->cqn == -1) return -ENOMEM; - if (dev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(dev)) { cq->arm_sn = 1; err = mthca_table_get(dev, dev->cq_table.table, cq->cqn); @@ -811,7 +811,7 @@ cq_context->lkey = cpu_to_be32(cq->mr.ibmr.lkey); cq_context->cqn = cpu_to_be32(cq->cqn); - if (dev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(dev)) { cq_context->ci_db = cpu_to_be32(cq->set_ci_db_index); cq_context->state_db = cpu_to_be32(cq->arm_db_index); } @@ -851,11 +851,11 @@ err_out_mailbox: kfree(mailbox); - if (dev->hca_type == ARBEL_NATIVE) + if (mthca_is_memfree(dev)) mthca_free_db(dev, MTHCA_DB_TYPE_CQ_ARM, cq->arm_db_index); err_out_ci: - if (dev->hca_type == ARBEL_NATIVE) + if (mthca_is_memfree(dev)) mthca_free_db(dev, MTHCA_DB_TYPE_CQ_SET_CI, cq->set_ci_db_index); err_out_icm: @@ -916,7 +916,7 @@ mthca_free_mr(dev, &cq->mr); mthca_free_cq_buf(dev, cq); - if (dev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(dev)) { mthca_free_db(dev, MTHCA_DB_TYPE_CQ_ARM, cq->arm_db_index); mthca_free_db(dev, MTHCA_DB_TYPE_CQ_SET_CI, cq->set_ci_db_index); mthca_table_put(dev, dev->cq_table.table, cq->cqn); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-04-01 12:38:29.460565601 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-04-01 12:38:30.772280864 -0800 @@ -470,4 +470,9 @@ return container_of(ibdev, struct mthca_dev, ib_dev); } +static inline int mthca_is_memfree(struct mthca_dev *dev) +{ + return dev->hca_type == ARBEL_NATIVE; +} + #endif /* MTHCA_DEV_H */ --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_eq.c 2005-04-01 12:38:24.575625986 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_eq.c 2005-04-01 12:38:30.799275005 -0800 @@ -198,7 +198,7 @@ static inline void set_eq_ci(struct mthca_dev *dev, struct mthca_eq *eq, u32 ci) { - if (dev->hca_type == ARBEL_NATIVE) + if (mthca_is_memfree(dev)) arbel_set_eq_ci(dev, eq, ci); else tavor_set_eq_ci(dev, eq, ci); @@ -223,7 +223,7 @@ static inline void disarm_cq(struct mthca_dev *dev, int eqn, int cqn) { - if (dev->hca_type != ARBEL_NATIVE) { + if (!mthca_is_memfree(dev)) { u32 doorbell[2]; doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn); @@ -535,11 +535,11 @@ MTHCA_EQ_OWNER_HW | MTHCA_EQ_STATE_ARMED | MTHCA_EQ_FLAG_TR); - if (dev->hca_type == ARBEL_NATIVE) + if (mthca_is_memfree(dev)) eq_context->flags |= cpu_to_be32(MTHCA_EQ_STATE_ARBEL); eq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24); - if (dev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(dev)) { eq_context->arbel_pd = cpu_to_be32(dev->driver_pd.pd_num); } else { eq_context->logsize_usrpage |= cpu_to_be32(dev->driver_uar.index); @@ -686,7 +686,7 @@ mthca_base = pci_resource_start(dev->pdev, 0); - if (dev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(dev)) { /* * We assume that the EQ arm and EQ set CI registers * fall within the first BAR. We can't trust the @@ -756,7 +756,7 @@ static void __devexit mthca_unmap_eq_regs(struct mthca_dev *dev) { - if (dev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(dev)) { mthca_unmap_reg(dev, (pci_resource_len(dev->pdev, 0) - 1) & dev->fw.arbel.eq_set_ci_base, MTHCA_EQ_SET_CI_SIZE, @@ -880,7 +880,7 @@ for (i = 0; i < MTHCA_NUM_EQ; ++i) { err = request_irq(dev->eq_table.eq[i].msi_x_vector, - dev->hca_type == ARBEL_NATIVE ? + mthca_is_memfree(dev) ? mthca_arbel_msi_x_interrupt : mthca_tavor_msi_x_interrupt, 0, eq_name[i], dev->eq_table.eq + i); @@ -890,7 +890,7 @@ } } else { err = request_irq(dev->pdev->irq, - dev->hca_type == ARBEL_NATIVE ? + mthca_is_memfree(dev) ? mthca_arbel_interrupt : mthca_tavor_interrupt, SA_SHIRQ, DRV_NAME, dev); @@ -918,7 +918,7 @@ dev->eq_table.eq[MTHCA_EQ_CMD].eqn, status); for (i = 0; i < MTHCA_EQ_CMD; ++i) - if (dev->hca_type == ARBEL_NATIVE) + if (mthca_is_memfree(dev)) arbel_eq_req_not(dev, dev->eq_table.eq[i].eqn_mask); else tavor_eq_req_not(dev, dev->eq_table.eq[i].eqn); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c 2005-04-01 12:38:29.466564299 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c 2005-04-01 12:38:30.776279996 -0800 @@ -601,7 +601,7 @@ static int __devinit mthca_init_hca(struct mthca_dev *mdev) { - if (mdev->hca_type == ARBEL_NATIVE) + if (mthca_is_memfree(mdev)) return mthca_init_arbel(mdev); else return mthca_init_tavor(mdev); @@ -835,7 +835,7 @@ mthca_CLOSE_HCA(mdev, 0, &status); - if (mdev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(mdev)) { mthca_free_icm_table(mdev, mdev->cq_table.table); mthca_free_icm_table(mdev, mdev->qp_table.eqp_table); mthca_free_icm_table(mdev, mdev->qp_table.qp_table); @@ -939,7 +939,7 @@ mdev->pdev = pdev; mdev->hca_type = id->driver_data; - if (mdev->hca_type == ARBEL_NATIVE && !mthca_memfree_warned++) + if (mthca_is_memfree(mdev) && !mthca_memfree_warned++) mthca_warn(mdev, "Warning: native MT25208 mode support is incomplete. " "Your HCA may not work properly.\n"); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.c 2005-04-01 12:38:28.285820606 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.c 2005-04-01 12:38:30.831268060 -0800 @@ -472,7 +472,7 @@ { int i; - if (dev->hca_type != ARBEL_NATIVE) + if (!mthca_is_memfree(dev)) return 0; dev->db_tab = kmalloc(sizeof *dev->db_tab, GFP_KERNEL); @@ -504,7 +504,7 @@ int i; u8 status; - if (dev->hca_type != ARBEL_NATIVE) + if (!mthca_is_memfree(dev)) return; /* --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_mr.c 2005-04-01 12:38:29.493558440 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_mr.c 2005-04-01 12:38:30.822270013 -0800 @@ -181,7 +181,7 @@ if (seg == -1) return -1; - if (dev->hca_type == ARBEL_NATIVE) + if (mthca_is_memfree(dev)) if (mthca_table_get_range(dev, dev->mr_table.mtt_table, seg, seg + (1 << order) - 1)) { mthca_buddy_free(buddy, seg, order); @@ -196,7 +196,7 @@ { mthca_buddy_free(buddy, seg, order); - if (dev->hca_type == ARBEL_NATIVE) + if (mthca_is_memfree(dev)) mthca_table_put_range(dev, dev->mr_table.mtt_table, seg, seg + (1 << order) - 1); } @@ -223,7 +223,7 @@ static inline u32 hw_index_to_key(struct mthca_dev *dev, u32 ind) { - if (dev->hca_type == ARBEL_NATIVE) + if (mthca_is_memfree(dev)) return arbel_hw_index_to_key(ind); else return tavor_hw_index_to_key(ind); @@ -231,7 +231,7 @@ static inline u32 key_to_hw_index(struct mthca_dev *dev, u32 key) { - if (dev->hca_type == ARBEL_NATIVE) + if (mthca_is_memfree(dev)) return arbel_key_to_hw_index(key); else return tavor_key_to_hw_index(key); @@ -254,7 +254,7 @@ return -ENOMEM; mr->ibmr.rkey = mr->ibmr.lkey = hw_index_to_key(dev, key); - if (dev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(dev)) { err = mthca_table_get(dev, dev->mr_table.mpt_table, key); if (err) goto err_out_mpt_free; @@ -299,7 +299,7 @@ return err; err_out_table: - if (dev->hca_type == ARBEL_NATIVE) + if (mthca_is_memfree(dev)) mthca_table_put(dev, dev->mr_table.mpt_table, key); err_out_mpt_free: @@ -329,7 +329,7 @@ return -ENOMEM; mr->ibmr.rkey = mr->ibmr.lkey = hw_index_to_key(dev, key); - if (dev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(dev)) { err = mthca_table_get(dev, dev->mr_table.mpt_table, key); if (err) goto err_out_mpt_free; @@ -437,7 +437,7 @@ mthca_free_mtt(dev, mr->first_seg, mr->order, &dev->mr_table.mtt_buddy); err_out_table: - if (dev->hca_type == ARBEL_NATIVE) + if (mthca_is_memfree(dev)) mthca_table_put(dev, dev->mr_table.mpt_table, key); err_out_mpt_free: @@ -452,7 +452,7 @@ if (order >= 0) mthca_free_mtt(dev, first_seg, order, buddy); - if (dev->hca_type == ARBEL_NATIVE) + if (mthca_is_memfree(dev)) mthca_table_put(dev, dev->mr_table.mpt_table, arbel_key_to_hw_index(lkey)); @@ -498,7 +498,7 @@ return -EINVAL; /* For Arbel, all MTTs must fit in the same page. */ - if (dev->hca_type == ARBEL_NATIVE && + if (mthca_is_memfree(dev) && mr->attr.max_pages * sizeof *mr->mem.arbel.mtts > PAGE_SIZE) return -EINVAL; @@ -511,7 +511,7 @@ idx = key & (dev->limits.num_mpts - 1); mr->ibmr.rkey = mr->ibmr.lkey = hw_index_to_key(dev, key); - if (dev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(dev)) { err = mthca_table_get(dev, dev->mr_table.mpt_table, key); if (err) goto err_out_mpt_free; @@ -534,7 +534,7 @@ mtt_seg = mr->first_seg * MTHCA_MTT_SEG_SIZE; - if (dev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(dev)) { mr->mem.arbel.mtts = mthca_table_find(dev->mr_table.mtt_table, mr->first_seg); BUG_ON(!mr->mem.arbel.mtts); @@ -596,7 +596,7 @@ dev->mr_table.fmr_mtt_buddy); err_out_table: - if (dev->hca_type == ARBEL_NATIVE) + if (mthca_is_memfree(dev)) mthca_table_put(dev, dev->mr_table.mpt_table, key); err_out_mpt_free: @@ -765,7 +765,7 @@ if (err) return err; - if (dev->hca_type != ARBEL_NATIVE && + if (!mthca_is_memfree(dev) && (dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) dev->limits.fmr_reserved_mtts = 0; else --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_profile.c 2005-04-01 12:38:29.480561261 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_profile.c 2005-04-01 12:38:30.785278043 -0800 @@ -116,11 +116,11 @@ profile[i].type = i; profile[i].log_num = max(ffs(profile[i].num) - 1, 0); profile[i].size *= profile[i].num; - if (dev->hca_type == ARBEL_NATIVE) + if (mthca_is_memfree(dev)) profile[i].size = max(profile[i].size, (u64) PAGE_SIZE); } - if (dev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(dev)) { mem_base = 0; mem_avail = dev_lim->hca.arbel.max_icm_sz; } else { @@ -165,7 +165,7 @@ (unsigned long long) profile[i].size); } - if (dev->hca_type == ARBEL_NATIVE) + if (mthca_is_memfree(dev)) mthca_dbg(dev, "HCA context memory: reserving %d KB\n", (int) (total_size >> 10)); else @@ -267,7 +267,7 @@ * out of the MR pool. They don't use additional memory, but * we assign them as part of the HCA profile anyway. */ - if (dev->hca_type == ARBEL_NATIVE) + if (mthca_is_memfree(dev)) dev->limits.fmr_reserved_mtts = 0; else dev->limits.fmr_reserved_mtts = request->fmr_reserved_mtts; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2005-04-01 12:38:29.471563214 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c 2005-04-01 12:38:30.780279128 -0800 @@ -625,7 +625,7 @@ if (!mdev) return 0; - if (mdev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(mdev)) { list_for_each_entry(fmr, fmr_list, list) mthca_arbel_fmr_unmap(mdev, to_mfmr(fmr)); @@ -710,7 +710,7 @@ dev->ib_dev.alloc_fmr = mthca_alloc_fmr; dev->ib_dev.unmap_fmr = mthca_unmap_fmr; dev->ib_dev.dealloc_fmr = mthca_dealloc_fmr; - if (dev->hca_type == ARBEL_NATIVE) + if (mthca_is_memfree(dev)) dev->ib_dev.map_phys_fmr = mthca_arbel_map_phys_fmr; else dev->ib_dev.map_phys_fmr = mthca_tavor_map_phys_fmr; @@ -720,7 +720,7 @@ dev->ib_dev.detach_mcast = mthca_multicast_detach; dev->ib_dev.process_mad = mthca_process_mad; - if (dev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(dev)) { dev->ib_dev.req_notify_cq = mthca_arbel_arm_cq; dev->ib_dev.post_send = mthca_arbel_post_send; dev->ib_dev.post_recv = mthca_arbel_post_receive; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2005-04-01 12:38:26.181277444 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c 2005-04-01 12:38:30.827268928 -0800 @@ -639,7 +639,7 @@ else if (attr_mask & IB_QP_PATH_MTU) qp_context->mtu_msgmax = (attr->path_mtu << 5) | 31; - if (dev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(dev)) { qp_context->rq_size_stride = ((ffs(qp->rq.max) - 1) << 3) | (qp->rq.wqe_shift - 4); qp_context->sq_size_stride = @@ -731,7 +731,7 @@ qp_context->next_send_psn = cpu_to_be32(attr->sq_psn); qp_context->cqn_snd = cpu_to_be32(to_mcq(ibqp->send_cq)->cqn); - if (dev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(dev)) { qp_context->snd_wqe_base_l = cpu_to_be32(qp->send_wqe_offset); qp_context->snd_db_index = cpu_to_be32(qp->sq.db_index); } @@ -822,7 +822,7 @@ qp_context->cqn_rcv = cpu_to_be32(to_mcq(ibqp->recv_cq)->cqn); - if (dev->hca_type == ARBEL_NATIVE) + if (mthca_is_memfree(dev)) qp_context->rcv_db_index = cpu_to_be32(qp->rq.db_index); if (attr_mask & IB_QP_QKEY) { @@ -897,7 +897,7 @@ size += 2 * sizeof (struct mthca_data_seg); break; case UD: - if (dev->hca_type == ARBEL_NATIVE) + if (mthca_is_memfree(dev)) size += sizeof (struct mthca_arbel_ud_seg); else size += sizeof (struct mthca_tavor_ud_seg); @@ -1016,7 +1016,7 @@ { int ret = 0; - if (dev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(dev)) { ret = mthca_table_get(dev, dev->qp_table.qp_table, qp->qpn); if (ret) return ret; @@ -1057,7 +1057,7 @@ static void mthca_free_memfree(struct mthca_dev *dev, struct mthca_qp *qp) { - if (dev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(dev)) { mthca_free_db(dev, MTHCA_DB_TYPE_SQ, qp->sq.db_index); mthca_free_db(dev, MTHCA_DB_TYPE_RQ, qp->rq.db_index); mthca_table_put(dev, dev->qp_table.eqp_table, qp->qpn); @@ -1104,7 +1104,7 @@ return ret; } - if (dev->hca_type == ARBEL_NATIVE) { + if (mthca_is_memfree(dev)) { for (i = 0; i < qp->rq.max; ++i) { wqe = get_recv_wqe(qp, i); wqe->nda_op = cpu_to_be32(((i + 1) & (qp->rq.max - 1)) << @@ -1127,7 +1127,7 @@ { int i; - if (dev->hca_type != ARBEL_NATIVE) + if (!mthca_is_memfree(dev)) return; for (i = 0; 1 << i < qp->rq.max; ++i) @@ -2011,7 +2011,7 @@ else next = get_recv_wqe(qp, index); - if (dev->hca_type == ARBEL_NATIVE) + if (mthca_is_memfree(dev)) *dbd = 1; else *dbd = !!(next->ee_nds & cpu_to_be32(MTHCA_NEXT_DBD)); From roland at topspin.com Fri Apr 1 12:49:54 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:54 -0800 Subject: [openib-general] [PATCH][25/27] IB/mthca: map context for RDMA responder in mem-free mode In-Reply-To: <2005411249.qaesrlpuSaCRRPRE@topspin.com> Message-ID: <2005411249.Yyk7PJUeNHG0154S@topspin.com> Fix RDMA in mem-free mode: we need to make sure that the RDMA context memory is mapped for the HCA. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-04-01 12:38:30.772280864 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-04-01 12:38:31.661087929 -0800 @@ -222,6 +222,7 @@ struct mthca_array qp; struct mthca_icm_table *qp_table; struct mthca_icm_table *eqp_table; + struct mthca_icm_table *rdb_table; }; struct mthca_av_table { --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c 2005-04-01 12:38:30.776279996 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c 2005-04-01 12:38:31.666086844 -0800 @@ -430,14 +430,25 @@ goto err_unmap_qp; } - mdev->cq_table.table = mthca_alloc_icm_table(mdev, init_hca->cqc_base, + mdev->qp_table.rdb_table = mthca_alloc_icm_table(mdev, init_hca->rdb_base, + MTHCA_RDB_ENTRY_SIZE, + mdev->limits.num_qps << + mdev->qp_table.rdb_shift, + 0, 0); + if (!mdev->qp_table.rdb_table) { + mthca_err(mdev, "Failed to map RDB context memory, aborting\n"); + err = -ENOMEM; + goto err_unmap_eqp; + } + + mdev->cq_table.table = mthca_alloc_icm_table(mdev, init_hca->cqc_base, dev_lim->cqc_entry_sz, mdev->limits.num_cqs, mdev->limits.reserved_cqs, 0); if (!mdev->cq_table.table) { mthca_err(mdev, "Failed to map CQ context memory, aborting.\n"); err = -ENOMEM; - goto err_unmap_eqp; + goto err_unmap_rdb; } /* @@ -463,6 +474,9 @@ err_unmap_cq: mthca_free_icm_table(mdev, mdev->cq_table.table); +err_unmap_rdb: + mthca_free_icm_table(mdev, mdev->qp_table.rdb_table); + err_unmap_eqp: mthca_free_icm_table(mdev, mdev->qp_table.eqp_table); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2005-04-01 12:38:30.827268928 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c 2005-04-01 12:38:31.673085325 -0800 @@ -1025,11 +1025,16 @@ if (ret) goto err_qpc; + ret = mthca_table_get(dev, dev->qp_table.rdb_table, + qp->qpn << dev->qp_table.rdb_shift); + if (ret) + goto err_eqpc; + qp->rq.db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_RQ, qp->qpn, &qp->rq.db); if (qp->rq.db_index < 0) { ret = -ENOMEM; - goto err_eqpc; + goto err_rdb; } qp->sq.db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_SQ, @@ -1045,6 +1050,10 @@ err_rq_db: mthca_free_db(dev, MTHCA_DB_TYPE_RQ, qp->rq.db_index); +err_rdb: + mthca_table_put(dev, dev->qp_table.rdb_table, + qp->qpn << dev->qp_table.rdb_shift); + err_eqpc: mthca_table_put(dev, dev->qp_table.eqp_table, qp->qpn); @@ -1060,6 +1069,8 @@ if (mthca_is_memfree(dev)) { mthca_free_db(dev, MTHCA_DB_TYPE_SQ, qp->sq.db_index); mthca_free_db(dev, MTHCA_DB_TYPE_RQ, qp->rq.db_index); + mthca_table_put(dev, dev->qp_table.rdb_table, + qp->qpn << dev->qp_table.rdb_shift); mthca_table_put(dev, dev->qp_table.eqp_table, qp->qpn); mthca_table_put(dev, dev->qp_table.qp_table, qp->qpn); } From roland at topspin.com Fri Apr 1 12:49:54 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:54 -0800 Subject: [openib-general] [PATCH][26/27] IB/mthca: update receive queue initialization for new HCAs In-Reply-To: <2005411249.Yyk7PJUeNHG0154S@topspin.com> Message-ID: <2005411249.gE8d9QQAmCCNZRp6@topspin.com> Update initialization of receive queue to match new documentation. This change is required to support new MT25204 HCA. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2005-04-01 12:38:31.673085325 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c 2005-04-01 12:38:32.124987229 -0800 @@ -181,6 +181,10 @@ MTHCA_MLX_SLR = 1 << 16 }; +enum { + MTHCA_INVAL_LKEY = 0x100 +}; + struct mthca_next_seg { u32 nda_op; /* [31:6] next WQE [4:0] next opcode */ u32 ee_nds; /* [31:8] next EE [7] DBD [6] F [5:0] next WQE size */ @@ -1093,7 +1097,6 @@ enum ib_sig_type send_policy, struct mthca_qp *qp) { - struct mthca_next_seg *wqe; int ret; int i; @@ -1116,18 +1119,28 @@ } if (mthca_is_memfree(dev)) { + struct mthca_next_seg *next; + struct mthca_data_seg *scatter; + int size = (sizeof (struct mthca_next_seg) + + qp->rq.max_gs * sizeof (struct mthca_data_seg)) / 16; + for (i = 0; i < qp->rq.max; ++i) { - wqe = get_recv_wqe(qp, i); - wqe->nda_op = cpu_to_be32(((i + 1) & (qp->rq.max - 1)) << - qp->rq.wqe_shift); - wqe->ee_nds = cpu_to_be32(1 << (qp->rq.wqe_shift - 4)); + next = get_recv_wqe(qp, i); + next->nda_op = cpu_to_be32(((i + 1) & (qp->rq.max - 1)) << + qp->rq.wqe_shift); + next->ee_nds = cpu_to_be32(size); + + for (scatter = (void *) (next + 1); + (void *) scatter < (void *) next + (1 << qp->rq.wqe_shift); + ++scatter) + scatter->lkey = cpu_to_be32(MTHCA_INVAL_LKEY); } for (i = 0; i < qp->sq.max; ++i) { - wqe = get_send_wqe(qp, i); - wqe->nda_op = cpu_to_be32((((i + 1) & (qp->sq.max - 1)) << - qp->sq.wqe_shift) + - qp->send_wqe_offset); + next = get_send_wqe(qp, i); + next->nda_op = cpu_to_be32((((i + 1) & (qp->sq.max - 1)) << + qp->sq.wqe_shift) + + qp->send_wqe_offset); } } @@ -1986,7 +1999,7 @@ if (i < qp->rq.max_gs) { ((struct mthca_data_seg *) wqe)->byte_count = 0; - ((struct mthca_data_seg *) wqe)->lkey = cpu_to_be32(0x100); + ((struct mthca_data_seg *) wqe)->lkey = cpu_to_be32(MTHCA_INVAL_LKEY); ((struct mthca_data_seg *) wqe)->addr = 0; } From roland at topspin.com Fri Apr 1 12:49:54 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 1 Apr 2005 12:49:54 -0800 Subject: [openib-general] [PATCH][27/27] IB/mthca: add support for new MT25204 HCA In-Reply-To: <2005411249.gE8d9QQAmCCNZRp6@topspin.com> Message-ID: <2005411249.RHQWyM8AFcqb1PM4@topspin.com> Decouple table of HCA features from exact HCA device type. Add a current FW version field so we can warn when someone is using old FW. Add support for new MT25204 HCA. Remove the warning about mem-free support, since it should be pretty solid at this point. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-04-01 12:38:31.661087929 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-04-01 12:38:32.606882623 -0800 @@ -49,20 +49,15 @@ #define DRV_VERSION "0.06-pre" #define DRV_RELDATE "November 8, 2004" -/* Types of supported HCA */ -enum { - TAVOR, /* MT23108 */ - ARBEL_COMPAT, /* MT25208 in Tavor compat mode */ - ARBEL_NATIVE /* MT25208 with extended features */ -}; - enum { MTHCA_FLAG_DDR_HIDDEN = 1 << 1, MTHCA_FLAG_SRQ = 1 << 2, MTHCA_FLAG_MSI = 1 << 3, MTHCA_FLAG_MSI_X = 1 << 4, MTHCA_FLAG_NO_LAM = 1 << 5, - MTHCA_FLAG_FMR = 1 << 6 + MTHCA_FLAG_FMR = 1 << 6, + MTHCA_FLAG_MEMFREE = 1 << 7, + MTHCA_FLAG_PCIE = 1 << 8 }; enum { @@ -473,7 +468,7 @@ static inline int mthca_is_memfree(struct mthca_dev *dev) { - return dev->hca_type == ARBEL_NATIVE; + return dev->mthca_flags & MTHCA_FLAG_MEMFREE; } #endif /* MTHCA_DEV_H */ --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c 2005-04-01 12:38:31.666086844 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c 2005-04-01 12:38:32.611881538 -0800 @@ -103,7 +103,7 @@ "aborting.\n"); return -ENODEV; } - } else if (mdev->hca_type == TAVOR) + } else if (!(mdev->mthca_flags & MTHCA_FLAG_PCIE)) mthca_info(mdev, "No PCI-X capability, not setting RBC.\n"); cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_EXP); @@ -119,8 +119,7 @@ "register, aborting.\n"); return -ENODEV; } - } else if (mdev->hca_type == ARBEL_NATIVE || - mdev->hca_type == ARBEL_COMPAT) + } else if (mdev->mthca_flags & MTHCA_FLAG_PCIE) mthca_info(mdev, "No PCI Express capability, " "not setting Max Read Request Size.\n"); @@ -438,7 +437,7 @@ if (!mdev->qp_table.rdb_table) { mthca_err(mdev, "Failed to map RDB context memory, aborting\n"); err = -ENOMEM; - goto err_unmap_eqp; + goto err_unmap_rdb; } mdev->cq_table.table = mthca_alloc_icm_table(mdev, init_hca->cqc_base, @@ -593,6 +592,7 @@ err_free_icm: mthca_free_icm_table(mdev, mdev->cq_table.table); + mthca_free_icm_table(mdev, mdev->qp_table.rdb_table); mthca_free_icm_table(mdev, mdev->qp_table.eqp_table); mthca_free_icm_table(mdev, mdev->qp_table.qp_table); mthca_free_icm_table(mdev, mdev->mr_table.mpt_table); @@ -851,6 +851,7 @@ if (mthca_is_memfree(mdev)) { mthca_free_icm_table(mdev, mdev->cq_table.table); + mthca_free_icm_table(mdev, mdev->qp_table.rdb_table); mthca_free_icm_table(mdev, mdev->qp_table.eqp_table); mthca_free_icm_table(mdev, mdev->qp_table.qp_table); mthca_free_icm_table(mdev, mdev->mr_table.mpt_table); @@ -869,11 +870,32 @@ mthca_SYS_DIS(mdev, &status); } +/* Types of supported HCA */ +enum { + TAVOR, /* MT23108 */ + ARBEL_COMPAT, /* MT25208 in Tavor compat mode */ + ARBEL_NATIVE, /* MT25208 with extended features */ + SINAI /* MT25204 */ +}; + +#define MTHCA_FW_VER(major, minor, subminor) \ + (((u64) (major) << 32) | ((u64) (minor) << 16) | (u64) (subminor)) + +static struct { + u64 latest_fw; + int is_memfree; + int is_pcie; +} mthca_hca_table[] = { + [TAVOR] = { .latest_fw = MTHCA_FW_VER(3, 3, 2), .is_memfree = 0, .is_pcie = 0 }, + [ARBEL_COMPAT] = { .latest_fw = MTHCA_FW_VER(4, 6, 2), .is_memfree = 0, .is_pcie = 1 }, + [ARBEL_NATIVE] = { .latest_fw = MTHCA_FW_VER(5, 0, 1), .is_memfree = 1, .is_pcie = 1 }, + [SINAI] = { .latest_fw = MTHCA_FW_VER(1, 0, 1), .is_memfree = 1, .is_pcie = 1 } +}; + static int __devinit mthca_init_one(struct pci_dev *pdev, const struct pci_device_id *id) { static int mthca_version_printed = 0; - static int mthca_memfree_warned = 0; int ddr_hidden = 0; int err; struct mthca_dev *mdev; @@ -886,6 +908,12 @@ printk(KERN_INFO PFX "Initializing %s (%s)\n", pci_pretty_name(pdev), pci_name(pdev)); + if (id->driver_data >= ARRAY_SIZE(mthca_hca_table)) { + printk(KERN_ERR PFX "%s (%s) has invalid driver data %lx\n", + pci_pretty_name(pdev), pci_name(pdev), id->driver_data); + return -ENODEV; + } + err = pci_enable_device(pdev); if (err) { dev_err(&pdev->dev, "Cannot enable PCI device, " @@ -950,15 +978,14 @@ goto err_free_res; } - mdev->pdev = pdev; - mdev->hca_type = id->driver_data; - - if (mthca_is_memfree(mdev) && !mthca_memfree_warned++) - mthca_warn(mdev, "Warning: native MT25208 mode support is incomplete. " - "Your HCA may not work properly.\n"); + mdev->pdev = pdev; if (ddr_hidden) mdev->mthca_flags |= MTHCA_FLAG_DDR_HIDDEN; + if (mthca_hca_table[id->driver_data].is_memfree) + mdev->mthca_flags |= MTHCA_FLAG_MEMFREE; + if (mthca_hca_table[id->driver_data].is_pcie) + mdev->mthca_flags |= MTHCA_FLAG_PCIE; /* * Now reset the HCA before we touch the PCI capabilities or @@ -997,6 +1024,16 @@ if (err) goto err_iounmap; + if (mdev->fw_ver < mthca_hca_table[id->driver_data].latest_fw) { + mthca_warn(mdev, "HCA FW version %x.%x.%x is old (%x.%x.%x is current).\n", + (int) (mdev->fw_ver >> 32), (int) (mdev->fw_ver >> 16) & 0xffff, + (int) (mdev->fw_ver & 0xffff), + (int) (mthca_hca_table[id->driver_data].latest_fw >> 32), + (int) (mthca_hca_table[id->driver_data].latest_fw >> 16) & 0xffff, + (int) (mthca_hca_table[id->driver_data].latest_fw & 0xffff)); + mthca_warn(mdev, "If you have problems, try updating your HCA FW.\n"); + } + err = mthca_setup_hca(mdev); if (err) goto err_close; @@ -1112,6 +1149,14 @@ .driver_data = ARBEL_NATIVE }, { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL), .driver_data = ARBEL_NATIVE }, + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_SINAI), + .driver_data = SINAI }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_SINAI), + .driver_data = SINAI }, + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_SINAI_OLD), + .driver_data = SINAI }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_SINAI_OLD), + .driver_data = SINAI }, { 0, } }; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2005-04-01 12:38:30.780279128 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c 2005-04-01 12:38:32.615880670 -0800 @@ -659,11 +659,18 @@ static ssize_t show_hca(struct class_device *cdev, char *buf) { struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); - switch (dev->hca_type) { - case TAVOR: return sprintf(buf, "MT23108\n"); - case ARBEL_COMPAT: return sprintf(buf, "MT25208 (MT23108 compat mode)\n"); - case ARBEL_NATIVE: return sprintf(buf, "MT25208\n"); - default: return sprintf(buf, "unknown\n"); + switch (dev->pdev->device) { + case PCI_DEVICE_ID_MELLANOX_TAVOR: + return sprintf(buf, "MT23108\n"); + case PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT: + return sprintf(buf, "MT25208 (MT23108 compat mode)\n"); + case PCI_DEVICE_ID_MELLANOX_ARBEL: + return sprintf(buf, "MT25208\n"); + case PCI_DEVICE_ID_MELLANOX_SINAI: + case PCI_DEVICE_ID_MELLANOX_SINAI_OLD: + return sprintf(buf, "MT25204\n"); + default: + return sprintf(buf, "unknown\n"); } } --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_reset.c 2005-03-31 19:06:41.000000000 -0800 +++ linux-export/drivers/infiniband/hw/mthca/mthca_reset.c 2005-04-01 12:38:32.594885228 -0800 @@ -63,7 +63,7 @@ * header as well. */ - if (mdev->hca_type == TAVOR) { + if (!(mdev->mthca_flags & MTHCA_FLAG_PCIE)) { /* Look for the bridge -- its device ID will be 2 more than HCA's device ID. */ while ((bridge = pci_get_device(mdev->pdev->vendor, From tduffy at sun.com Fri Apr 1 13:02:18 2005 From: tduffy at sun.com (Tom Duffy) Date: Fri, 01 Apr 2005 13:02:18 -0800 Subject: [openib-general] [PATCH][MTHCA] add in SINAI defines into mtcha code WAS: [openib-commits] r2101 - gen2/trunk/src/linux-kernel/patches In-Reply-To: <20050401184346.GD11094@esmail.cup.hp.com> References: <20050331204331.4320C2283D9@openib.ca.sandia.gov> <1112379853.18939.11.camel@duffman> <20050401184346.GD11094@esmail.cup.hp.com> Message-ID: <1112389338.14094.7.camel@duffman> On Fri, 2005-04-01 at 10:43 -0800, Grant Grundler wrote: > No - I think Rolan is doing the right thing with a seperate patch. > I ran into the same issue since I'm still poking at 2.6.11. I thought the consensus what that openib gen2 trunk would always build against the latest stable 2.6.x kernel, in this case 2.6.11 which doesn't have the SINAI defines. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From iod00d at hp.com Fri Apr 1 13:32:49 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 1 Apr 2005 13:32:49 -0800 Subject: [openib-general] [PATCH][MTHCA] add in SINAI defines into mtcha code WAS: [openib-commits] r2101 - gen2/trunk/src/linux-kernel/patches In-Reply-To: <1112389338.14094.7.camel@duffman> References: <20050331204331.4320C2283D9@openib.ca.sandia.gov> <1112379853.18939.11.camel@duffman> <20050401184346.GD11094@esmail.cup.hp.com> <1112389338.14094.7.camel@duffman> Message-ID: <20050401213249.GF11094@esmail.cup.hp.com> On Fri, Apr 01, 2005 at 01:02:18PM -0800, Tom Duffy wrote: > I thought the consensus what that openib gen2 trunk would always build > against the latest stable 2.6.x kernel, in this case 2.6.11 which > doesn't have the SINAI defines. hrm...you are right. Forgot about that. Can we add the SINIA #defines to a local "compat.h" file? grant From peter at pantasys.com Fri Apr 1 13:45:14 2005 From: peter at pantasys.com (Peter Buckingham) Date: Fri, 01 Apr 2005 13:45:14 -0800 Subject: [openib-general] uverbs and OSU MPI/MPI in general? In-Reply-To: <200504012013.j31KDqus007452@xi.cse.ohio-state.edu> References: <200504012013.j31KDqus007452@xi.cse.ohio-state.edu> Message-ID: <424DC0EA.3070701@pantasys.com> Dhabaleswar Panda wrote: > Peter, > > >> Peter> Hi All, How does gen2's uverbs compare to VAPI? Is it meant >> Peter> to be the same API? Should OSU's MPI run on top of this or >> Peter> is there some other MPI implementation that will be able to >> Peter> run 'natively' over IB? >> >>The basic functionality is the same but the API is different. For >>example completion events are handled in a different way that allows >>better performance. >> >>None of the current MPI implementations that use IB will run >>unmodified, but everyone (including OSU) is porting to the new API. > > > We have already started working on porting OSU MPI to the Gen2 stack. > > We plan to release MVAPICH 0.9.5 (on VAPI stack) during the next 1-2 > weeks. After that we will make a subsequent release of 0.9.5 on the > OpenIB Gen2 stack. excellent! thanks for the info. peter From roland at topspin.com Fri Apr 1 14:01:19 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 01 Apr 2005 14:01:19 -0800 Subject: [openib-general] [PATCH][MTHCA] add in SINAI defines into mtcha code WAS: [openib-commits] r2101 - gen2/trunk/src/linux-kernel/patches In-Reply-To: <20050401213249.GF11094@esmail.cup.hp.com> (Grant Grundler's message of "Fri, 1 Apr 2005 13:32:49 -0800") References: <20050331204331.4320C2283D9@openib.ca.sandia.gov> <1112379853.18939.11.camel@duffman> <20050401184346.GD11094@esmail.cup.hp.com> <1112389338.14094.7.camel@duffman> <20050401213249.GF11094@esmail.cup.hp.com> Message-ID: <52ll82i2c0.fsf@topspin.com> I could go either way on this. The point of the patches directory is to make the trunk build against the current 2.6.11 tree. On the other hand, Tom's patch doesn't break anything (since the new symbols are inside an #ifdef) and I know I can trust Tom to remember to take it out once 2.6.12 comes out. So I think I'll go ahead and commit this change. - R. From roland at topspin.com Fri Apr 1 14:06:50 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 01 Apr 2005 14:06:50 -0800 Subject: [openib-general] [PATCH][26.5/27] Add MT25204 PCI IDs In-Reply-To: <2005411249.RHQWyM8AFcqb1PM4@topspin.com> (Roland Dreier's message of "Fri, 1 Apr 2005 12:49:54 -0800") References: <2005411249.RHQWyM8AFcqb1PM4@topspin.com> Message-ID: <52hdiqi22t.fsf@topspin.com> Ugh, this patch is required to build support for the new Mellanox HCAs. Greg K-H applied it to his tree a while ago but it hasn't made it to Linus yet. Sorry, Roland Add PCI device IDs for new Mellanox MT25204 "Sinai" InfiniHost III Lx HCA. Signed-off-by: Roland Dreier --- linux-export.orig/include/linux/pci_ids.h 2005-03-31 19:07:14.000000000 -0800 +++ linux-export/include/linux/pci_ids.h 2005-04-01 14:03:16.468519075 -0800 @@ -2122,6 +2122,8 @@ #define PCI_DEVICE_ID_MELLANOX_TAVOR 0x5a44 #define PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT 0x6278 #define PCI_DEVICE_ID_MELLANOX_ARBEL 0x6282 +#define PCI_DEVICE_ID_MELLANOX_SINAI_OLD 0x5e8c +#define PCI_DEVICE_ID_MELLANOX_SINAI 0x6274 #define PCI_VENDOR_ID_PDC 0x15e9 #define PCI_DEVICE_ID_PDC_1841 0x1841 From mshefty at ichips.intel.com Fri Apr 1 14:12:40 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 01 Apr 2005 14:12:40 -0800 Subject: [openib-general] [RMPP] RMPP formatting assumptions In-Reply-To: <424C92CE.7040709@ichips.intel.com> References: <42488FDF.2050608@ichips.intel.com> <424C92CE.7040709@ichips.intel.com> Message-ID: <424DC758.4040504@ichips.intel.com> Sean Hefty wrote: > The payload field in the RMPP header should be set to the size of the > class specific header plus the number of valid bytes of user data in the > data buffer. The RMPP code will adjust the payload value to account for > multiple headers. Doing this brings up the issue with the byte-ordering of the payload value set by the user. One on hand, the value is used to communicate with the RMPP code, so could be in host-order. But on the other hand, the value is in a data structure where all of the other fields are in network-byte order... I'm leaning towards network-byte order, which would set the payload to the correct value for a single-segment RMPP MAD. - Sean From libor at topspin.com Fri Apr 1 14:54:14 2005 From: libor at topspin.com (Libor Michalek) Date: Fri, 1 Apr 2005 14:54:14 -0800 Subject: [openib-general] Re: problem with SDP/AIO on mem-free HCA In-Reply-To: <528y42laxk.fsf@topspin.com>; from roland@topspin.com on Fri, Apr 01, 2005 at 08:27:19AM -0800 References: <001301c5367f$e86a8a50$1802a8c0@infiniconsys.com> <528y42laxk.fsf@topspin.com> Message-ID: <20050401145414.B2870@topspin.com> On Fri, Apr 01, 2005 at 08:27:19AM -0800, Roland Dreier wrote: > Fab> If you are blessed with a Tavor PRM, see section 8.2.1.6 (in > Fab> PRM 1.0.0). It states that a length of zero in a data > Fab> segment indicates a 2GB transfer (MSb is used as a flag to > Fab> indicate normal vs. inline data segments). A zero-byte > Fab> request must not reference any data segments. > > Yup, that must be the problem. I guess mthca can skip over 0-length > data segments. Another option would be to say that such work requests > aren't allowed. Not sure which way I think we should go. I need to > talk to Libor and find out why SDP is generating such requests. Roland, Can you try this patch, it should close a gap to prevent a zero length IOCB from getting into the receive data path. Thanks. -Libor Index: sdp_recv.c =================================================================== --- sdp_recv.c (revision 2094) +++ sdp_recv.c (working copy) @@ -297,13 +297,13 @@ * if there is no more advertised space, queue the * advertisment for completion */ - if (advt->size <= 0) + if (!advt->size) sdp_advt_q_put(&conn->src_actv, sdp_advt_q_get(&conn->src_pend)); /* * if there is no more iocb space queue the it for completion */ - if (iocb->len <= 0) { + if (!iocb->len) { iocb = sdp_iocb_q_get_head(&conn->r_pend); if (!iocb) { sdp_dbg_warn(conn, "read IOCB disappeared. <%d>", @@ -1368,26 +1371,11 @@ * RDMA advertisements are checked to determine if remote * data is pending and accessible. */ - if (!(copied < low_water) && - !conn->src_recv) { -#if 0 /* performance cheat. LM */ - if (!(conn->snk_zthresh > size)) { + if (copied == size) + break; - conn->nond_recv--; - - result = sdp_send_ctrl_snk_avail(conn, - 0, 0, 0); - if (result < 0) { - /* - * since the message did not go out, - * back out the non_discard counter - */ - conn->nond_recv++; - } - } -#endif + if (!(copied < low_water) && !conn->src_recv) break; - } /* * check connection errors, and then wait for more data. * check status. POSIX 1003.1g order. From halr at voltaire.com Fri Apr 1 14:56:00 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Apr 2005 17:56:00 -0500 Subject: [openib-general] [PATCH] FMR support in mthca In-Reply-To: <20050331134116.B1541@topspin.com> References: <20050327153112.GA26108@mellanox.co.il> <20050328170351.B30499@topspin.com> <20050330010814.GB24794@esmail.cup.hp.com> <20050329181228.H31683@topspin.com> <20050330032904.GA24936@esmail.cup.hp.com> <20050330164349.B32764@topspin.com> <1112304328.7331.42.camel@localhost.localdomain> <20050331134116.B1541@topspin.com> Message-ID: <1112395932.4476.217.camel@localhost.localdomain> On Thu, 2005-03-31 at 16:41, Libor Michalek wrote: > On Thu, Mar 31, 2005 at 04:25:28PM -0500, Hal Rosenstock wrote: > > On Wed, 2005-03-30 at 19:43, Libor Michalek wrote: > > > The program has a decent help for available parameters, but here are > > > some reasonable defaults: > > > > > > server: > > > > > > ./ttcp.aio.x -r -l 65536 -a 20 > > > > > > client: > > > > > > ./ttcp.aio.x -t -l 65536 -n 100000 -a 20 192.168.0.100 > > > > Are these the parameters used to achieve the throughput numbers you > > published ? > > > > Sounds like you tweaked the numbers in sdp_dev.h. Anywhere else ? > > > > Can you provide the tuning numbers used and where they were found so these > > results can be reproduced ? > > No tweaking or changes to the SDP code itself. The parameters above > should give similar results, but here are the exact parameters I used > for the two aync tests I mentioned in the original results I posted. > > > > For async socket I kept 20 96K buffers in flight. For the FMR pool cache > > > hit async results I used only 20 different buffers. > > ./ttcp.aio.x -r -l 98304 -a 20 -f M > ./ttcp.aio.x -t -l 98304 -n 200000 -a 20 -f M 192.168.0.100 > > > > For the FMR pool cache miss async results I used 1000 different > > > buffers, of which only 20 were in flight at a time. > > ./ttcp.aio.x -r -l 98304 -a 20 -x 1000 -f M > ./ttcp.aio.x -t -l 98304 -n 200000 -a 20 -x 1000 -f M 192.168.0.100 We are seeing issues with both buffer size and iterations. We get back -ENOMEM and also see VMA lock errors. Are the 2 related ? Should we turn on SDP debug to see what specifically can't be allocated ? In that case, what could be done ? When using the default parameters, we see the following: On the server: [root at openib1 ~]# ./ttcp.aio.x -r -l 65536 -a 20 ttcp-r: buflen = 65536 nbuf = 0 align = 16384/0 port = 5001 ttcp-r: socket ttcp-r: accept from 192.168.1.4 ttcp-r: Event error <-12> <5275648> ttcp-r: 0 bytes in 0.00 real seconds = 0.00 Mbit/sec +++ ttcp-r: 2 I/O calls, usec/call = 112.00, calls/sec = 8928.57 ttcp-r: user: 0 sys: 0 total: 0 real: 224 (microseconds) [root at openib1 ~]# On the client: [root at openib2 ~]# ./ttcp.aio.x -t -l 65536 -n 100000 -a 20 192.168.1.3 ttcp-t: buflen = 65536 nbuf = 100000 align = 16384/0 port = 5001 192.168.1.3 ttcp-t: socket ttcp-t: connect ttcp-t: Event error <-12> <5275648> ttcp-t: 0 bytes in 0.00 real seconds = 0.00 Mbit/sec +++ ttcp-t: 2 I/O calls, usec/call = 83.00, calls/sec = 12048.19 ttcp-t: user: 0 sys: 0 total: 0 real: 166 (microseconds) [root at openib2 ~]# Here's the output from the dmesg on the server: ERR: : VMA lock <620000:65536> error <-12> <16:0:8> ERR: : VMA lock <634000:65536> error <-12> <16:0:8> ERR: : VMA lock <648000:65536> error <-12> <16:0:8> ...... Here's the output from the dmesg (client): ERR: : VMA lock <580000:65536> error <-12> <16:0:8> ERR: : VMA lock <594000:65536> error <-12> <16:0:8> ERR: : VMA lock <5a8000:65536> error <-12> <16:0:8> ...... If the value of -l (length of network read/write buffers) it runs (up to buffer size of 4K). However, there still is dmesg output on the server side: Here's the output from the dmesg on the server: ERR: : VMA lock <550000:1024> error <-12> <1:8:8> ERR: : VMA lock <554000:1024> error <-12> <1:8:8> ERR: : VMA lock <558000:1024> error <-12> <1:8:8> WARN: : Cancel read with no IOCB. <2:0:00000005> WARN: : Cancel read with no IOCB. <2:0:00000005> ERR: : VMA lock <528000:1024> error <-12> <1:8:8> ERR: : VMA lock <52c000:1024> error <-12> <1:8:8> ...... Is this related to system configuration somehow ? How much system memory in your machines ? Is this a factor ? Thanks. -- Hal From libor at topspin.com Fri Apr 1 15:07:50 2005 From: libor at topspin.com (Libor Michalek) Date: Fri, 1 Apr 2005 15:07:50 -0800 Subject: [openib-general] [PATCH] FMR support in mthca In-Reply-To: <1112395932.4476.217.camel@localhost.localdomain>; from halr@voltaire.com on Fri, Apr 01, 2005 at 05:56:00PM -0500 References: <20050327153112.GA26108@mellanox.co.il> <20050328170351.B30499@topspin.com> <20050330010814.GB24794@esmail.cup.hp.com> <20050329181228.H31683@topspin.com> <20050330032904.GA24936@esmail.cup.hp.com> <20050330164349.B32764@topspin.com> <1112304328.7331.42.camel@localhost.localdomain> <20050331134116.B1541@topspin.com> <1112395932.4476.217.camel@localhost.localdomain> Message-ID: <20050401150750.C2870@topspin.com> On Fri, Apr 01, 2005 at 05:56:00PM -0500, Hal Rosenstock wrote: > On Thu, 2005-03-31 at 16:41, Libor Michalek wrote: > > > > ./ttcp.aio.x -r -l 98304 -a 20 -f M > > ./ttcp.aio.x -t -l 98304 -n 200000 -a 20 -f M 192.168.0.100 > > > > > > For the FMR pool cache miss async results I used 1000 different > > > > buffers, of which only 20 were in flight at a time. > > > > ./ttcp.aio.x -r -l 98304 -a 20 -x 1000 -f M > > ./ttcp.aio.x -t -l 98304 -n 200000 -a 20 -x 1000 -f M 192.168.0.100 > > We are seeing issues with both buffer size and iterations. We get back > -ENOMEM and also see VMA lock errors. Are the 2 related ? Should we turn > on SDP debug to see what specifically can't be allocated ? In that case, > what could be done ? Hal, You need to increase the amount of memory that the user is allowed to lock. The following command in each shell from which you are running ttcp: limit memorylocked unlimited In 2.6 mlock() looks at the rlimits for the process executing the lock, I decided not to artificially increase the limit while locking, instead relying on the user/admin to set the appropriate value. The default is pretty small. -Libor From halr at voltaire.com Fri Apr 1 15:21:24 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Apr 2005 18:21:24 -0500 Subject: [openib-general] [PATCH] FMR support in mthca In-Reply-To: <20050401150750.C2870@topspin.com> References: <20050327153112.GA26108@mellanox.co.il> <20050328170351.B30499@topspin.com> <20050330010814.GB24794@esmail.cup.hp.com> <20050329181228.H31683@topspin.com> <20050330032904.GA24936@esmail.cup.hp.com> <20050330164349.B32764@topspin.com> <1112304328.7331.42.camel@localhost.localdomain> <20050331134116.B1541@topspin.com> <1112395932.4476.217.camel@localhost.localdomain> <20050401150750.C2870@topspin.com> Message-ID: <1112397606.4476.232.camel@localhost.localdomain> On Fri, 2005-04-01 at 18:07, Libor Michalek wrote: > You need to increase the amount of memory that the user is allowed > to lock. The following command in each shell from which you are running > ttcp: > > limit memorylocked unlimited > > In 2.6 mlock() looks at the rlimits for the process executing the lock, > I decided not to artificially increase the limit while locking, instead > relying on the user/admin to set the appropriate value. The default is > pretty small. Thanks. Should this go into a SDP FAQ ? Also, perhaps also the alternatives for the SDP protocol family as well. Anything else ? -- Hal From iod00d at hp.com Fri Apr 1 17:43:03 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 1 Apr 2005 17:43:03 -0800 Subject: [openib-general] [PATCH] libsdp debug output Message-ID: <20050402014303.GK11094@esmail.cup.hp.com> Hi, I found this output from libsdp.so less than helpful: default libsdp configuration is used First, I need a clue what the default location is if I just want to hack the file. I was expecting it to live in /etc/libsdp.conf based on email describing gen1 in 2004: http://openib.org/pipermail/openib-general/2004-June/003222.html Secondly, the "make install" puts the libsdp.so in /usr/local/etc by default. That's ok if the lib tells me that. A future enhancement would be *always* print the path if a "verbose=1" (or something) exists the .conf file. At some point, customers don't want to know. I don't mind it since it's a good remind when I'm testing. thanks, grant Signed-off-by; Grant Grundler Index: src/userspace/libsdp/src/port.c =================================================================== --- src/userspace/libsdp/src/port.c (revision 2103) +++ src/userspace/libsdp/src/port.c (working copy) @@ -1202,8 +1202,10 @@ void __sdp_init( if (config_file) { __sdp_read_config(config_file); } else { - printf("default libsdp configuration is used\n"); -#define LIBSDP_DEFAULT_CONFIG_FILE "/usr/local/ibgd/etc/libsdp.conf" +/* #define LIBSDP_DEFAULT_CONFIG_FILE "/etc/libsdp.conf" */ +#define LIBSDP_DEFAULT_CONFIG_FILE "/usr/local/etc/libsdp.conf" + printf("libsdp.so: $LIBSDP_CONFIG_FILE not set. Using " + LIBSDP_DEFAULT_CONFIG_FILE "\n"); __sdp_read_config(LIBSDP_DEFAULT_CONFIG_FILE); } } /* __sdp_init */ From iod00d at hp.com Fri Apr 1 17:51:29 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 1 Apr 2005 17:51:29 -0800 Subject: [openib-general] [PATCH] improve tvflash output (verbose and error cases) Message-ID: <20050402015129.GM11094@esmail.cup.hp.com> Roland, This patch adds more output when "-v" is specified and adds details to output from some of the error cases. BTW, this is the version of code I used to flash all my cards to v3.3.2 on rx2600 and rx4640 (HP IA64/ZX1 boxen). thanks, grant Signed-off-by: Grant Grundler Index: src/userspace/tvflash/src/tvflash.c =================================================================== --- src/userspace/tvflash/src/tvflash.c (revision 2103) +++ src/userspace/tvflash/src/tvflash.c (working copy) @@ -460,6 +460,9 @@ static int open_hca(struct tvdevice *tvd { cur_hca = tvdev->pdev; + if (verbose) + fprintf(stderr, "open_hca(%d)\n", tvdev->num); + if (!config && tvdev->can_map) { int fd = open("/dev/mem", O_RDWR, O_SYNC); if (fd < 0) { @@ -485,6 +488,9 @@ static int open_hca(struct tvdevice *tvd static void close_hca(void) { + if (verbose) + fprintf(stderr, "close_hca()\n"); + if (bar0) munmap(bar0, 1 << 20); } @@ -563,6 +569,9 @@ static void flash_write_cmd(unsigned int static void flash_chip_reset(void) { + if (verbose) + fprintf(stderr, "flash_chip_reset()\n"); + /* Issue Flash Reset Command*/ flash_write_cmd(0x0, 0xf0); } @@ -784,6 +793,9 @@ static int flash_image_read_from_file(ch char *buf; unsigned int sector_sz; + if (verbose) + fprintf(stderr, "flash_image_read_from_file(%s)\n", fname); + /* Open and read image files */ fimg = fopen(fname, "r"); if (fimg == NULL) { @@ -872,6 +884,9 @@ static int flash_check_failsafe(void) char *psbuf; int i; + if (verbose) + fprintf(stderr, "flash_check_failsafe()\n"); + /* Grab the sector size first */ sector_sz_ptr = (flash_byte_read(0x16) << 8) | flash_byte_read(0x17); sector_sz = (flash_byte_read(0x32 + sector_sz_ptr) << 8) | @@ -882,6 +897,8 @@ static int flash_check_failsafe(void) * than 1MB is suspicious and thrown out */ if (sector_sz < 12 || sector_sz > 20) { + fprintf(stderr, "flash_check_failsafe(): sector_sz (%d) not" + " valid. Set to zero.\n", sector_sz); failsafe.sector_sz = TV_FLASH_DEFAULT_SECTOR_SIZE; failsafe.valid = 0; return 0; @@ -1192,6 +1209,8 @@ static int identify_hca(int num, struct case PCI_DEVICE_MELLANOX_MT25208_COMPAT: printf("HCA #%d: Found MT25208 (MT23108 mode)", num); break; + default: + printf("HCA #%d: WTF? 0x%x", num, tvdev->pdev->device_id); } switch (identify_board(tvdev)) { @@ -1236,7 +1255,10 @@ static int identify_hca(int num, struct ver_str, failsafe.images[0].vsd.data.vendor.topspin.hw_label); } else - printf(" Primary image is valid, unknown source\n"); + printf(" Primary image is valid, " + "unknown source (sig 0x%x/0x%x)\n", + failsafe.images[0].vsd.data.signature, + failsafe.images[0].vsd.data.vendor.topspin.signature2); } else printf(" Primary image is NOT valid\n"); @@ -1257,7 +1279,10 @@ static int identify_hca(int num, struct ver_str, failsafe.images[1].vsd.data.vendor.topspin.hw_label); } else - printf(" Secondary image is valid, unknown source\n"); + printf(" Secondary image is valid," + " unknown source (sig 0x%x/0x%x)\n", + failsafe.images[1].vsd.data.signature, + failsafe.images[1].vsd.data.vendor.topspin.signature2); } else printf(" Secondary image is NOT valid\n"); } else @@ -1429,6 +1454,9 @@ static int flash_image_write_to_file(cha int i, fd; unsigned int offset; + if (verbose) + fprintf(stderr, "flash_image_write_to_file(%s)\n", fname); + buffer = malloc(failsafe.sector_sz); if (!buffer) { fprintf(stderr, "couldn't allocated %d bytes of memory for buffer\n", @@ -1460,12 +1488,10 @@ static int flash_image_write_to_file(cha } write(fd, buffer, failsafe.sector_sz); - offset += failsafe.sector_sz; } close(fd); - return 0; } @@ -1474,6 +1500,9 @@ static int download_firmware(int hca, ch struct tvdevice *tvdev; int ret; + if (verbose) + fprintf(stderr, "download_firmware(%d,%s)\n", hca, ofname); + tvdev = find_device(hca); if (!tvdev) { fprintf(stderr, "couldn't find HCA #%d on the PCI bus\n", hca); From iod00d at hp.com Fri Apr 1 18:40:48 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 1 Apr 2005 18:40:48 -0800 Subject: [openib-general] ia64 perf and FMR Message-ID: <20050402024048.GN11094@esmail.cup.hp.com> Hi, Just wanted to share initial perf results (and surprise) that I'm getting on the HP ZX1/IA64 boxes. Before FMR support was committed, netperf was reporting around 1720 Mb/s (215 MB/s) for IPoIB with msi_x=1 and netserver pinned to the CPU that wasn't taking interrupts. After FMR was committed, netperf is reporting about 3500 Mb/s (437 MB/s) for IPoIB. CPU was saturated on the send side in all cases. I've a vague idea what "Fast Memory Registration" is but not a good understanding. Can someone point me at a decent explanation of FMR? I'd like to understand the 2X in performance. Maybe we are doing 1/2 as much DMA mapping in one of the bug fixes? And I'm suspicious of the IPoIB numbers since SDP is also seeing a bit over 3500 Mb/s and sending CPU is also saturated. I was hoping SDP would be 40-60% faster than TCP (ipoib). Maybe I'm just not configuring libsdp.conf correctly for netperf and maybe the IPoIB numbers are correct. I've "rmmod ib_sdp" on both boxes, unloaded and reloaded all the other IB drivers, and "unset LD_PRELOAD". Is unloading ib_sdp sufficient to be sure SDP isn't used? (I do get "module in use" when netserver is running with LD_PRELOAD pointing at libsdp.so) I also reviewed all the "__attribute__ ((packed))" uses in include/ib_mad.h and include/ib_smi.h. It looks safe to me to remove them since every field is "naturally" aligned from the start of it's respective structure. I also checked nested cases. However, while it worked fine, removing all use from the two files didn't matter for netperf TCP_STREAM. I didn't realize other files also use "packed" and will have to revisit the issue. I'm mostly worried some new use will not be well aligned and cause the compiler to insert padding. That will be a PITA to debug. What we need is a compiler warning to tell us when/where padding is inserted in a structure with a similar __attribute__. Reminder: not pinning the netserver thread to the other CPU costs around 25% performance. I think that's true for any single threaded networking perf test that saturates the CPU. thanks, grant From mst at mellanox.co.il Sat Apr 2 12:29:44 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 2 Apr 2005 23:29:44 +0300 Subject: [openib-general] Re: [PATCH][MTHCA] add in SINAI defines into mtcha code WAS: [openib-commits] r2101 - gen2/trunk/src/linux-kernel/patches In-Reply-To: <1112379853.18939.11.camel@duffman> References: <20050331204331.4320C2283D9@openib.ca.sandia.gov> <1112379853.18939.11.camel@duffman> Message-ID: <20050402202944.GB29843@mellanox.co.il> Quoting r. Tom Duffy : > Subject: [PATCH][MTHCA] add in SINAI defines into mtcha code WAS: [openib-commits] r2101 - gen2/trunk/src/linux-kernel/patches > > On Thu, 2005-03-31 at 12:43 -0800, roland at openib.org wrote: > > Author: roland > > Date: 2005-03-31 12:43:29 -0800 (Thu, 31 Mar 2005) > > New Revision: 2101 > > > > Added: > > gen2/trunk/src/linux-kernel/patches/linux-2.6.11-sinai.diff > > Log: > > Add patch adding Sinai device IDs for 2.6.11 kernel. > > Roland, please consider applying this for svn ease of use: > Just adding defines wont make sinai work for you. RQ formatting needs to be fixed. I posted patches that make Sinai work earlier: http://www.mail-archive.com/openib-general at openib.org/msg03891.html and http://www.mail-archive.com/openib-general at openib.org/msg03892.html I can repost if needed. -- MST - Michael S. Tsirkin From mst at mellanox.co.il Sun Apr 3 10:35:48 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 3 Apr 2005 20:35:48 +0300 Subject: [openib-general] Re: ia64 perf and FMR In-Reply-To: <20050402024048.GN11094@esmail.cup.hp.com> References: <20050402024048.GN11094@esmail.cup.hp.com> Message-ID: <20050403173548.GA14915@mellanox.co.il> Quoting r. Grant Grundler : > Subject: ia64 perf and FMR > > Hi, > Just wanted to share initial perf results (and surprise) > that I'm getting on the HP ZX1/IA64 boxes. > > Before FMR support was committed, netperf was reporting around > 1720 Mb/s (215 MB/s) for IPoIB with msi_x=1 and netserver pinned > to the CPU that wasn't taking interrupts. After FMR was committed, > netperf is reporting about 3500 Mb/s (437 MB/s) for IPoIB. CPU was > saturated on the send side in all cases. > > I've a vague idea what "Fast Memory Registration" is but not a good > understanding. Can someone point me at a decent explanation of FMR? > > I'd like to understand the 2X in performance. > Maybe we are doing 1/2 as much DMA mapping in one of > the bug fixes? > > And I'm suspicious of the IPoIB numbers since SDP is also seeing > a bit over 3500 Mb/s and sending CPU is also saturated. I was hoping > SDP would be 40-60% faster than TCP (ipoib). Maybe I'm just not > configuring libsdp.conf correctly for netperf and maybe the IPoIB > numbers are correct. I've "rmmod ib_sdp" on both boxes, unloaded > and reloaded all the other IB drivers, and "unset LD_PRELOAD". > Is unloading ib_sdp sufficient to be sure SDP isn't used? > > (I do get "module in use" when netserver is running with LD_PRELOAD > pointing at libsdp.so) > > > I also reviewed all the "__attribute__ ((packed))" uses in > include/ib_mad.h and include/ib_smi.h. It looks safe to me > to remove them since every field is "naturally" aligned from > the start of it's respective structure. I also checked > nested cases. However, while it worked fine, removing all use > from the two files didn't matter for netperf TCP_STREAM. > > I didn't realize other files also use "packed" and will > have to revisit the issue. I'm mostly worried some > new use will not be well aligned and cause the compiler > to insert padding. That will be a PITA to debug. > What we need is a compiler warning to tell us when/where > padding is inserted in a structure with a similar __attribute__. > > Reminder: not pinning the netserver thread to the other CPU > costs around 25% performance. I think that's true for any single > threaded networking perf test that saturates the CPU. > > thanks, > grant Can you try with hide DDR? this will disable FMRs for tavor. -- MST - Michael S. Tsirkin From iod00d at hp.com Sun Apr 3 14:13:48 2005 From: iod00d at hp.com (Grant Grundler) Date: Sun, 3 Apr 2005 14:13:48 -0700 Subject: [openib-general] Re: ia64 perf and FMR In-Reply-To: <20050403173548.GA14915@mellanox.co.il> References: <20050402024048.GN11094@esmail.cup.hp.com> <20050403173548.GA14915@mellanox.co.il> Message-ID: <20050403211348.GA18395@esmail.cup.hp.com> On Sun, Apr 03, 2005 at 08:35:48PM +0300, Michael S. Tsirkin wrote: > Can you try with hide DDR? this will disable FMRs for tavor. I could if someone provided the v3.3.2 "failsafe" FW image with DDR hidden. I'm not equipped with a windows machine nor infiniburn to create my own. I'll need a "failsafe" image for both cougar and cougarcub. Once I'm done with this round of testing, I'd be happy to try a newer version of firmware w/ and w/o DDR hidden. thanks, grant From iod00d at hp.com Sun Apr 3 22:51:31 2005 From: iod00d at hp.com (Grant Grundler) Date: Sun, 3 Apr 2005 22:51:31 -0700 Subject: [openib-general] ia64 perf and FMR In-Reply-To: <20050402024048.GN11094@esmail.cup.hp.com> References: <20050402024048.GN11094@esmail.cup.hp.com> Message-ID: <20050404055131.GA19409@esmail.cup.hp.com> On Fri, Apr 01, 2005 at 06:40:48PM -0800, Grant Grundler wrote: > Hi, > Just wanted to share initial perf results (and surprise) > that I'm getting on the HP ZX1/IA64 boxes. > > Before FMR support was committed, netperf was reporting around > 1720 Mb/s (215 MB/s) for IPoIB with msi_x=1 and netserver pinned > to the CPU that wasn't taking interrupts. After FMR was committed, > netperf is reporting about 3500 Mb/s (437 MB/s) for IPoIB. CPU was > saturated on the send side in all cases. FMR is a red herring. I tried SVN r2080 and it has roughly the same performance as r2082 (when FMR was committed) and later r210x. "packed" attribute is a red herring too. Performance stunk with r2050 and I will do a binary search this week until I sort out which changes doubled the perf. ISTR there was one change related to a "double mapping" issue and I will be tracking that down in a few days. > I've a vague idea what "Fast Memory Registration" is but not a good > understanding. Can someone point me at a decent explanation of FMR? I'm still fishing for this. Even tips on which docs I might scrounge through are welcome. ... > Maybe I'm just not configuring libsdp.conf correctly for netperf > and maybe the IPoIB numbers are correct. This was in fact the case. The explanations aren't very good in the default .conf file. Is there other documentation to describe libsdp.conf file? "match program *" worked. Variations of "match destination" and "match listen *:12866" did not. Well, it might have worked for one side or the other, but not both. I'm now getting ~5300-5500 Mb/s (~660 MB/s) using SDP with netperf. (256KB socket size....probably too small). So HP ZX1 chipset is doing quite well for a 3yr old PCI-X chipset. thanks, grant From ftillier at infiniconsys.com Sun Apr 3 23:16:35 2005 From: ftillier at infiniconsys.com (Fab Tillier) Date: Sun, 3 Apr 2005 23:16:35 -0700 Subject: [openib-general] ia64 perf and FMR In-Reply-To: <20050404055131.GA19409@esmail.cup.hp.com> Message-ID: <001901c538dd$db3e0590$1802a8c0@infiniconsys.com> > From: Grant Grundler [mailto:iod00d at hp.com] > Sent: Sunday, April 03, 2005 10:52 PM > > On Fri, Apr 01, 2005 at 06:40:48PM -0800, Grant Grundler wrote: > > I've a vague idea what "Fast Memory Registration" is but not a good > > understanding. Can someone point me at a decent explanation of FMR? > > I'm still fishing for this. > Even tips on which docs I might scrounge through are welcome. > If you have access to a Tavor PRM, you can see what they are and how they work. The Mellanox implementation of FMR is not the same as FMR defined in the 1.2 IB spec. Basically, FMR lets you register memory without using the command interface, using memory mapped HCA resource to access the translation tables directly. There are pitfalls with them related to the HCA caching translation entries and cache coherency between the HCA and what the app wants it to do. That's my current understanding, and will gladly stand corrected. I hope that helps some. - Fab From iod00d at hp.com Sun Apr 3 23:28:29 2005 From: iod00d at hp.com (Grant Grundler) Date: Sun, 3 Apr 2005 23:28:29 -0700 Subject: [openib-general] ia64 perf and FMR In-Reply-To: <001901c538dd$db3e0590$1802a8c0@infiniconsys.com> References: <20050404055131.GA19409@esmail.cup.hp.com> <001901c538dd$db3e0590$1802a8c0@infiniconsys.com> Message-ID: <20050404062829.GC19481@esmail.cup.hp.com> On Sun, Apr 03, 2005 at 11:16:35PM -0700, Fab Tillier wrote: > If you have access to a Tavor PRM, you can see what they are and how they > work. The Mellanox implementation of FMR is not the same as FMR defined in > the 1.2 IB spec. I know who in HP does. But I hate contaminating myself with docs that are only available under NDA. That's why I haven't looked at them yet. Is "competitive advantage" still a reason for Mellanox to NOT publish the PRM for older PCI-X chips? (e.g. Tavor) > Basically, FMR lets you register memory without using the command interface, > using memory mapped HCA resource to access the translation tables directly. > There are pitfalls with them related to the HCA caching translation entries > and cache coherency between the HCA and what the app wants it to do. ok - I can see how FMR helps with latency and PCI bus utilization. But not a 2x increase in throughput. > That's my current understanding, and will gladly stand corrected. I hope > that helps some. It does. thanks, grant From roland at topspin.com Mon Apr 4 07:06:53 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 04 Apr 2005 07:06:53 -0700 Subject: [openib-general] ia64 perf and FMR References: <20050402024048.GN11094@esmail.cup.hp.com> Message-ID: <52sm26eiv6.fsf@topspin.com> Grant> Before FMR support was committed, netperf was reporting Grant> around 1720 Mb/s (215 MB/s) for IPoIB with msi_x=1 and Grant> netserver pinned to the CPU that wasn't taking Grant> interrupts. After FMR was committed, netperf is reporting Grant> about 3500 Mb/s (437 MB/s) for IPoIB. CPU was saturated on Grant> the send side in all cases. Grant> I've a vague idea what "Fast Memory Registration" is but Grant> not a good understanding. Can someone point me at a decent Grant> explanation of FMR? A memory region (MR) is a memory translation mapping in the HCA's context. Usually, we create MRs via a firmware command, which is prohibitively expensive to do in the data path. However, it is possible for the driver to write directly into the HCA's context, bypassing the firmware. This is very cheap, just some posted writes, and so we can do it in the data path. For example, for AIO, SDP uses this to map a bunch of random userspace pages into something virtually contiguous in the HCA's memory map, so that it can be used as for RDMA. However this shouldn't affect IPoIB in the least since a) it doesn't do any dynamic memory registration and b) it doesn't call any FMR functions anyway. Grant> I'd like to understand the 2X in performance. Maybe we are Grant> doing 1/2 as much DMA mapping in one of the bug fixes? Grant> And I'm suspicious of the IPoIB numbers since SDP is also Grant> seeing a bit over 3500 Mb/s and sending CPU is also Grant> saturated. I was hoping SDP would be 40-60% faster than TCP Grant> (ipoib). Maybe I'm just not configuring libsdp.conf Grant> correctly for netperf and maybe the IPoIB numbers are Grant> correct. I've "rmmod ib_sdp" on both boxes, unloaded and Grant> reloaded all the other IB drivers, and "unset LD_PRELOAD". Grant> Is unloading ib_sdp sufficient to be sure SDP isn't used? This is really odd. I don't see how FMRs could directly change IPoIB performance, since IPoIB isn't using FMRs, even indirectly. If SDP is not loaded, then I don't see how it could be used, but the fact that you get the same number for SDP and IPoIB really makes me think that the IPoIB number is really an SDP number. Grant> I also reviewed all the "__attribute__ ((packed))" uses in Grant> include/ib_mad.h and include/ib_smi.h. It looks safe to me Grant> to remove them since every field is "naturally" aligned Grant> from the start of it's respective structure. I also checked Grant> nested cases. However, while it worked fine, removing all Grant> use from the two files didn't matter for netperf Grant> TCP_STREAM. Yeah, none of that code is in the data path, so I wouldn't expect it to make a difference one way or another. The one that might make a difference is struct mthca_eqe in mthca_eq.c. Unfortunately simply removing the packed attribute will break things on 64 bit archs unless the structure is written slightly differently. It shouldn't be that difficult, so I should have something for you to test in a day or two. - R. From roland at topspin.com Mon Apr 4 06:51:53 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 04 Apr 2005 06:51:53 -0700 Subject: [openib-general] Re: [PATCH][MTHCA] add in SINAI defines into mtcha code WAS: [openib-commits] r2101 - gen2/trunk/src/linux-kernel/patches References: <20050331204331.4320C2283D9@openib.ca.sandia.gov> <1112379853.18939.11.camel@duffman> <20050402202944.GB29843@mellanox.co.il> Message-ID: <52y8byejk6.fsf@topspin.com> Michael> Just adding defines wont make sinai work for you. RQ Michael> formatting needs to be fixed. I posted patches that make Michael> Sinai work earlier: Yes, I already committed a similar change based on the latest PRM. I finally got a Sinai board and it seems to be working with the current code. - R. From mst at mellanox.co.il Mon Apr 4 08:02:35 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 4 Apr 2005 18:02:35 +0300 Subject: [openib-general] [PATCH] SEND_INLINE support in libmthca Message-ID: <20050404150235.GZ15034@mellanox.co.il> Adds support for posting SEND_INLINE work requests in libmthca. With this patch, I get latency as low as 3.35 usec unidirectional with Arbel Tavor mode. Passed basic testing on Tavor and Arbel mode. Signed-off-by: Michael S. Tsirkin Index: src/qp.c =================================================================== --- src/qp.c (revision 2104) +++ src/qp.c (working copy) @@ -57,6 +57,10 @@ enum { MTHCA_NEXT_SOLICIT = 1 << 1, }; +enum { + MTHCA_INLINE_SEG = 1<<31 +}; + struct mthca_next_seg { uint32_t nda_op; /* [31:6] next WQE [4:0] next opcode */ uint32_t ee_nds; /* [31:8] next EE [7] DBD [6] F [5:0] next WQE size */ @@ -107,6 +111,10 @@ struct mthca_data_seg { uint64_t addr; }; +struct mthca_inline_seg { + uint32_t byte_count; +}; + static const uint8_t mthca_opcode[] = { [IBV_WR_SEND] = MTHCA_OPCODE_SEND, [IBV_WR_SEND_WITH_IMM] = MTHCA_OPCODE_SEND_IMM, @@ -255,15 +263,38 @@ int mthca_tavor_post_send(struct ibv_qp goto out; } - for (i = 0; i < wr->num_sge; ++i) { - ((struct mthca_data_seg *) wqe)->byte_count = - htonl(wr->sg_list[i].length); - ((struct mthca_data_seg *) wqe)->lkey = - htonl(wr->sg_list[i].lkey); - ((struct mthca_data_seg *) wqe)->addr = - htonll(wr->sg_list[i].addr); - wqe += sizeof (struct mthca_data_seg); - size += sizeof (struct mthca_data_seg) / 16; + if (wr->send_flags & IBV_SEND_INLINE) { + struct mthca_inline_seg *seg = wqe; + int s = 0; + wqe += sizeof *seg; + for (i = 0; i < wr->num_sge; ++i) { + struct ibv_sge *sge = &wr->sg_list[i]; + int l; + l = sge->length; + s += l; + + if (s + sizeof *seg > (1 << qp->sq.wqe_shift)) { + ret = -1; + *bad_wr = wr; + goto out; + } + + memcpy(wqe, (void*)(intptr_t)sge->addr, l); + wqe += l; + } + seg->byte_count = htonl(MTHCA_INLINE_SEG | s); + + size += align(s + sizeof *seg, 16) / 16; + } else { + struct mthca_data_seg *seg; + for (i = 0; i < wr->num_sge; ++i) { + seg = wqe; + seg->byte_count = htonl(wr->sg_list[i].length); + seg->lkey = htonl(wr->sg_list[i].lkey); + seg->addr = htonll(wr->sg_list[i].addr); + wqe += sizeof *seg; + } + size += wr->num_sge * sizeof *seg / 16; } qp->wrid[ind + qp->rq.max] = wr->wr_id; @@ -512,15 +543,37 @@ int mthca_arbel_post_send(struct ibv_qp goto out; } - for (i = 0; i < wr->num_sge; ++i) { - ((struct mthca_data_seg *) wqe)->byte_count = - htonl(wr->sg_list[i].length); - ((struct mthca_data_seg *) wqe)->lkey = - htonl(wr->sg_list[i].lkey); - ((struct mthca_data_seg *) wqe)->addr = - htonll(wr->sg_list[i].addr); - wqe += sizeof (struct mthca_data_seg); - size += sizeof (struct mthca_data_seg) / 16; + if (wr->send_flags & IBV_SEND_INLINE) { + struct mthca_inline_seg *seg = wqe; + int s = 0; + wqe += sizeof *seg; + for (i = 0; i < wr->num_sge; ++i) { + int l = wr->sg_list[i].length; + s += l; + + if (s + sizeof *seg > (1 << qp->sq.wqe_shift)) { + ret = -1; + *bad_wr = wr; + goto out; + } + + memcpy(wqe, + (void*)(intptr_t)wr->sg_list[i].addr, l); + wqe += l; + } + seg->byte_count = htonl(MTHCA_INLINE_SEG | s); + + size += align(s + sizeof *seg, 16) / 16; + } else { + struct mthca_data_seg *seg; + for (i = 0; i < wr->num_sge; ++i) { + seg = wqe; + seg->byte_count = htonl(wr->sg_list[i].length); + seg->lkey = htonl(wr->sg_list[i].lkey); + seg->addr = htonll(wr->sg_list[i].addr); + wqe += sizeof *seg; + } + size += wr->num_sge * sizeof *seg / 16; } qp->wrid[ind + qp->rq.max] = wr->wr_id; -- MST - Michael S. Tsirkin From iod00d at hp.com Mon Apr 4 08:29:05 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 4 Apr 2005 08:29:05 -0700 Subject: [openib-general] ia64 perf and FMR In-Reply-To: <52sm26eiv6.fsf@topspin.com> References: <20050402024048.GN11094@esmail.cup.hp.com> <52sm26eiv6.fsf@topspin.com> Message-ID: <20050404152905.GA20973@esmail.cup.hp.com> On Mon, Apr 04, 2005 at 07:06:53AM -0700, Roland Dreier wrote: > A memory region (MR) is a memory translation mapping in the HCA's > context. Usually, we create MRs via a firmware command, which is > prohibitively expensive to do in the data path. However, it is > possible for the driver to write directly into the HCA's context, > bypassing the firmware. This is very cheap, just some posted writes, > and so we can do it in the data path. For example, for AIO, SDP uses > this to map a bunch of random userspace pages into something virtually > contiguous in the HCA's memory map, so that it can be used as for RDMA. Thanks! I understood about 1/2 of that before. I'd like to read a bit more detail though... > However this shouldn't affect IPoIB in the least since a) it doesn't > do any dynamic memory registration and b) it doesn't call any FMR > functions anyway. *nod* - FMR is clearly a red herring in this case. > This is really odd. I don't see how FMRs could directly change IPoIB > performance, since IPoIB isn't using FMRs, even indirectly. Sorry - I said "FMR" when I should have said r210x release. FMR was just recently committed and I assumed (there's that word again) that was related somehow. My bad. > If SDP is > not loaded, then I don't see how it could be used, but the fact that > you get the same number for SDP and IPoIB really makes me think that > the IPoIB number is really an SDP number. Nope - those really where IPoIB numbers. ... > The one that might make a difference is struct mthca_eqe in > mthca_eq.c. Unfortunately simply removing the packed attribute will > break things on 64 bit archs unless the structure is written slightly > differently. It shouldn't be that difficult, so I should have > something for you to test in a day or two. Ok. I won't be able to test that until next week...I'll note which rev picks that up and make sure to test it seperately. BTW, SDP uses "packed" for a dozen or so structures. I haven't looked at any q-syscollect or pfmon data yet to see where SDP is spending time or if "packed" is part of the code path. But I don't have the impression SDP is CPU bound like IPoIB is. thanks, grant From libor at topspin.com Mon Apr 4 08:11:54 2005 From: libor at topspin.com (Libor Michalek) Date: Mon, 4 Apr 2005 08:11:54 -0700 Subject: [openib-general] ia64 perf and FMR In-Reply-To: <20050402024048.GN11094@esmail.cup.hp.com>; from iod00d@hp.com on Fri, Apr 01, 2005 at 06:40:48PM -0800 References: <20050402024048.GN11094@esmail.cup.hp.com> Message-ID: <20050404081154.A10315@topspin.com> On Fri, Apr 01, 2005 at 06:40:48PM -0800, Grant Grundler wrote: > Hi, > Just wanted to share initial perf results (and surprise) > that I'm getting on the HP ZX1/IA64 boxes. > > Before FMR support was committed, netperf was reporting around > 1720 Mb/s (215 MB/s) for IPoIB with msi_x=1 and netserver pinned > to the CPU that wasn't taking interrupts. After FMR was committed, > netperf is reporting about 3500 Mb/s (437 MB/s) for IPoIB. CPU was > saturated on the send side in all cases. > > I've a vague idea what "Fast Memory Registration" is but not a good > understanding. Can someone point me at a decent explanation of FMR? A few people responded with good FMR descriptions. However, I'd like to add that when using an unmodified version of netperf, neither IPoIB or SDP are using FMRs. IPoIB never uses FMRs, and currently SDP only uses FMRs when the application is using Linux AIO to read or write data on the socket. In that instance if the buffers are larger then a threshold value they will be registered using FMRs and the contiguous address is then shared with the remote half of the connetion which can then RDMA to/from the buffer. The example code (ttcp.aio.c) I checked in will use AIO and FMRs if the transfer size (-l) is over the 5K default threshold. -Libor From libor at topspin.com Mon Apr 4 08:20:02 2005 From: libor at topspin.com (Libor Michalek) Date: Mon, 4 Apr 2005 08:20:02 -0700 Subject: [openib-general] ia64 perf and FMR In-Reply-To: <20050404055131.GA19409@esmail.cup.hp.com>; from iod00d@hp.com on Sun, Apr 03, 2005 at 10:51:31PM -0700 References: <20050402024048.GN11094@esmail.cup.hp.com> <20050404055131.GA19409@esmail.cup.hp.com> Message-ID: <20050404082002.B10315@topspin.com> On Sun, Apr 03, 2005 at 10:51:31PM -0700, Grant Grundler wrote: > ... > > Maybe I'm just not configuring libsdp.conf correctly for netperf > > and maybe the IPoIB numbers are correct. > > This was in fact the case. > The explanations aren't very good in the default .conf file. > Is there other documentation to describe libsdp.conf file? > > "match program *" worked. Variations of "match destination" > and "match listen *:12866" did not. Well, it might have worked > for one side or the other, but not both. Using libsdp with netperf shows some of the limitations of libsdp. netperf connects to the server on a well known socket, but then the server creates a second socket which it autobinds, checks to see which port was assigned, and passes the port number to the client which connects to the port. This second connection is then used for the data transfer. Since the connection is not on a well known port the way to match it is with the 'program' keyword: match program netperf > I'm now getting ~5300-5500 Mb/s (~660 MB/s) using SDP with netperf. > (256KB socket size....probably too small). The socket size socket option is still in the TODO file. -Libor From iod00d at hp.com Mon Apr 4 12:20:03 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 4 Apr 2005 12:20:03 -0700 Subject: [openib-general] ia64 perf and FMR In-Reply-To: <20050404082002.B10315@topspin.com> References: <20050402024048.GN11094@esmail.cup.hp.com> <20050404055131.GA19409@esmail.cup.hp.com> <20050404082002.B10315@topspin.com> Message-ID: <20050404192003.GE20973@esmail.cup.hp.com> On Mon, Apr 04, 2005 at 08:20:02AM -0700, Libor Michalek wrote: > Using libsdp with netperf shows some of the limitations of libsdp. > netperf connects to the server on a well known socket, but then the > server creates a second socket which it autobinds, checks to see > which port was assigned, and passes the port number to the client > which connects to the port. This second connection is then used for > the data transfer. Ah ok. That explains why just keying off the port # didn't work. I wasn't aware of that. > Since the connection is not on a well known port the way to match > it is with the 'program' keyword: > > match program netperf Wouldn't we also need a line like this? match program netserver For a sanity check, I sometimes run netperf first in one direction and then the other. It helps confirm the two boxes are symetrical (same HW config, same CPU, same firmware, same drivers, etc). Having one libsdp.conf would keep things easy. And in fact, my current configuration is NOT symetrical. I have the same HW config but system firmware is not the same. This results in ~8-10% loss in performance in one direction vs the other. > > I'm now getting ~5300-5500 Mb/s (~660 MB/s) using SDP with netperf. > > (256KB socket size....probably too small). > > The socket size socket option is still in the TODO file. OH! I was working out how long it takes IB card to transmit or fill a 256KB buffer and it's really not very long. Kind of limits how many transactions can be coalesced into one interrupt and how long the interrupt handler can be deferred. But that's probably not a burning issue (yet). thanks, grant From roland at topspin.com Mon Apr 4 13:37:41 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 04 Apr 2005 13:37:41 -0700 Subject: [openib-general] Re: problem with SDP/AIO on mem-free HCA In-Reply-To: <20050401145414.B2870@topspin.com> (Libor Michalek's message of "Fri, 1 Apr 2005 14:54:14 -0800") References: <001301c5367f$e86a8a50$1802a8c0@infiniconsys.com> <528y42laxk.fsf@topspin.com> <20050401145414.B2870@topspin.com> Message-ID: <52psxaffca.fsf@topspin.com> This patch seems to fix it for me. With the patch applied, ttcp.aio runs through to the end and switches from 4 KB RDMAs to 8 KB RDMAs after 256 KB has been transferred. Without the patch, ttcp.aio does a 0-length after 256 KB and fails. - R. From roland at topspin.com Mon Apr 4 13:51:01 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 04 Apr 2005 13:51:01 -0700 Subject: [openib-general] [PATCH] SEND_INLINE support in libmthca In-Reply-To: <20050404150235.GZ15034@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 4 Apr 2005 18:02:35 +0300") References: <20050404150235.GZ15034@mellanox.co.il> Message-ID: <52is32feq2.fsf@topspin.com> Is the test here correct? + if (s + sizeof *seg > (1 << qp->sq.wqe_shift)) { It seems we need to take into account the size of next segment and any RDMA segment that we may be posting as well. Also does it make sense to put the code for gathering inline data segments and writing gather lists into an inline function that can be called from both the tavor and arbel post send function? Will gcc actually inline this function? - R. From halr at voltaire.com Mon Apr 4 15:08:03 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Apr 2005 18:08:03 -0400 Subject: [openib-general] IPoIB Message-ID: <1112652482.4490.281.camel@localhost.localdomain> A while ago, Tom brought up the issue of IPoIB link level broadcasting from user space (with the arping tool). Is it possible to do this from kernel space ? For example, how would/could sendto() work when sending to a IPoIB link layer address ? If all we wanted to support was broadcast, perhaps there could be a remapping of the ethernet MAC broadcast address to the all hosts MGID and QPN for that IPoIB interface. Or perhaps the entire ipoib pseudoheader should be supported in this mode. This is needed to support RARPing. Some hosts want to RARP for their IP address and this should be supported over IPoIB. -- Hal From roland at topspin.com Mon Apr 4 15:09:00 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 4 Apr 2005 15:09:00 -0700 Subject: [openib-general] [PATCH][RFC][1/4] IB: core changes for userspace verbs In-Reply-To: <200544159.Ahk9l0puXy39U6u6@topspin.com> Message-ID: <200544159.Qg0tUfQc4xGRabsc@topspin.com> Add new structs and struct members required by userspace verbs to IB core. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/core/verbs.c 2005-01-11 09:35:27.046388000 -0800 +++ linux-export/drivers/infiniband/core/verbs.c 2005-04-04 14:50:59.579791210 -0700 @@ -47,10 +47,11 @@ { struct ib_pd *pd; - pd = device->alloc_pd(device); + pd = device->alloc_pd(device, NULL, NULL, 0); if (!IS_ERR(pd)) { - pd->device = device; + pd->device = device; + pd->uobject = NULL; atomic_set(&pd->usecnt, 0); } @@ -76,8 +77,9 @@ ah = pd->device->create_ah(pd, ah_attr); if (!IS_ERR(ah)) { - ah->device = pd->device; - ah->pd = pd; + ah->device = pd->device; + ah->pd = pd; + ah->uobject = NULL; atomic_inc(&pd->usecnt); } @@ -122,7 +124,7 @@ { struct ib_qp *qp; - qp = pd->device->create_qp(pd, qp_init_attr); + qp = pd->device->create_qp(pd, qp_init_attr, NULL, 0); if (!IS_ERR(qp)) { qp->device = pd->device; @@ -130,6 +132,7 @@ qp->send_cq = qp_init_attr->send_cq; qp->recv_cq = qp_init_attr->recv_cq; qp->srq = qp_init_attr->srq; + qp->uobject = NULL; qp->event_handler = qp_init_attr->event_handler; qp->qp_context = qp_init_attr->qp_context; qp->qp_type = qp_init_attr->qp_type; @@ -197,10 +200,11 @@ { struct ib_cq *cq; - cq = device->create_cq(device, cqe); + cq = device->create_cq(device, cqe, NULL, NULL, 0); if (!IS_ERR(cq)) { cq->device = device; + cq->uobject = NULL; cq->comp_handler = comp_handler; cq->event_handler = event_handler; cq->cq_context = cq_context; @@ -245,8 +249,9 @@ mr = pd->device->get_dma_mr(pd, mr_access_flags); if (!IS_ERR(mr)) { - mr->device = pd->device; - mr->pd = pd; + mr->device = pd->device; + mr->pd = pd; + mr->uobject = NULL; atomic_inc(&pd->usecnt); atomic_set(&mr->usecnt, 0); } @@ -267,8 +272,9 @@ mr_access_flags, iova_start); if (!IS_ERR(mr)) { - mr->device = pd->device; - mr->pd = pd; + mr->device = pd->device; + mr->pd = pd; + mr->uobject = NULL; atomic_inc(&pd->usecnt); atomic_set(&mr->usecnt, 0); } @@ -344,8 +350,9 @@ mw = pd->device->alloc_mw(pd); if (!IS_ERR(mw)) { - mw->device = pd->device; - mw->pd = pd; + mw->device = pd->device; + mw->pd = pd; + mw->uobject = NULL; atomic_inc(&pd->usecnt); } --- linux-export.orig/drivers/infiniband/include/ib_verbs.h 2005-02-22 10:14:06.623746000 -0800 +++ linux-export/drivers/infiniband/include/ib_verbs.h 2005-04-04 14:50:42.054602327 -0700 @@ -41,7 +41,9 @@ #include #include + #include +#include union ib_gid { u8 raw[16]; @@ -618,29 +620,78 @@ u8 page_size; }; +struct ib_ucontext { + struct ib_device *device; + struct list_head pd_list; + struct list_head mr_list; + struct list_head mw_list; + struct list_head cq_list; + struct list_head qp_list; + struct list_head srq_list; + struct list_head ah_list; + spinlock_t lock; +}; + +struct ib_uobject { + u64 user_handle; /* handle given to us by userspace */ + struct ib_ucontext *context; /* associated user context */ + struct list_head list; /* link to context's list */ + u32 id; /* index into kernel idr */ +}; + +struct ib_umem { + unsigned long user_base; + unsigned long virt_base; + size_t length; + int offset; + int page_size; + struct list_head chunk_list; +}; + +struct ib_umem_chunk { + struct list_head list; + int nents; + int nmap; + struct scatterlist page_list[0]; +}; + +#define IB_UMEM_MAX_PAGE_CHUNK \ + ((PAGE_SIZE - offsetof(struct ib_umem_chunk, page_list)) / \ + ((void *) &((struct ib_umem_chunk *) 0)->page_list[1] - \ + (void *) &((struct ib_umem_chunk *) 0)->page_list[0])) + +struct ib_umem_object { + struct ib_uobject uobject; + struct ib_umem umem; +}; + struct ib_pd { - struct ib_device *device; - atomic_t usecnt; /* count all resources */ + struct ib_device *device; + struct ib_uobject *uobject; + atomic_t usecnt; /* count all resources */ }; struct ib_ah { struct ib_device *device; struct ib_pd *pd; + struct ib_uobject *uobject; }; typedef void (*ib_comp_handler)(struct ib_cq *cq, void *cq_context); struct ib_cq { - struct ib_device *device; - ib_comp_handler comp_handler; - void (*event_handler)(struct ib_event *, void *); - void * cq_context; - int cqe; - atomic_t usecnt; /* count number of work queues */ + struct ib_device *device; + struct ib_uobject *uobject; + ib_comp_handler comp_handler; + void (*event_handler)(struct ib_event *, void *); + void * cq_context; + int cqe; + atomic_t usecnt; /* count number of work queues */ }; struct ib_srq { struct ib_device *device; + struct ib_uobject *uobject; struct ib_pd *pd; void *srq_context; atomic_t usecnt; @@ -652,6 +703,7 @@ struct ib_cq *send_cq; struct ib_cq *recv_cq; struct ib_srq *srq; + struct ib_uobject *uobject; void (*event_handler)(struct ib_event *, void *); void *qp_context; u32 qp_num; @@ -659,16 +711,18 @@ }; struct ib_mr { - struct ib_device *device; - struct ib_pd *pd; - u32 lkey; - u32 rkey; - atomic_t usecnt; /* count number of MWs */ + struct ib_device *device; + struct ib_pd *pd; + struct ib_uobject *uobject; + u32 lkey; + u32 rkey; + atomic_t usecnt; /* count number of MWs */ }; struct ib_mw { struct ib_device *device; struct ib_pd *pd; + struct ib_uobject *uobject; u32 rkey; }; @@ -737,7 +791,12 @@ int (*modify_port)(struct ib_device *device, u8 port_num, int port_modify_mask, struct ib_port_modify *port_modify); - struct ib_pd * (*alloc_pd)(struct ib_device *device); + struct ib_ucontext * (*alloc_ucontext)(struct ib_device *device, + const void __user *udata, int udatalen); + int (*dealloc_ucontext)(struct ib_ucontext *context); + struct ib_pd * (*alloc_pd)(struct ib_device *device, + struct ib_ucontext *context, + const void __user *udata, int udatalen); int (*dealloc_pd)(struct ib_pd *pd); struct ib_ah * (*create_ah)(struct ib_pd *pd, struct ib_ah_attr *ah_attr); @@ -747,7 +806,8 @@ struct ib_ah_attr *ah_attr); int (*destroy_ah)(struct ib_ah *ah); struct ib_qp * (*create_qp)(struct ib_pd *pd, - struct ib_qp_init_attr *qp_init_attr); + struct ib_qp_init_attr *qp_init_attr, + const void __user *udata, int udatalen); int (*modify_qp)(struct ib_qp *qp, struct ib_qp_attr *qp_attr, int qp_attr_mask); @@ -762,8 +822,9 @@ int (*post_recv)(struct ib_qp *qp, struct ib_recv_wr *recv_wr, struct ib_recv_wr **bad_recv_wr); - struct ib_cq * (*create_cq)(struct ib_device *device, - int cqe); + struct ib_cq * (*create_cq)(struct ib_device *device, int cqe, + struct ib_ucontext *context, + const void __user *udata, int udatalen); int (*destroy_cq)(struct ib_cq *cq); int (*resize_cq)(struct ib_cq *cq, int *cqe); int (*poll_cq)(struct ib_cq *cq, int num_entries, @@ -780,6 +841,11 @@ int num_phys_buf, int mr_access_flags, u64 *iova_start); + struct ib_mr * (*reg_user_mr)(struct ib_pd *pd, + struct ib_umem *region, + int mr_access_flags, + const void __user *udata, + int udatalen); int (*query_mr)(struct ib_mr *mr, struct ib_mr_attr *mr_attr); int (*dereg_mr)(struct ib_mr *mr); @@ -816,7 +882,10 @@ struct ib_grh *in_grh, struct ib_mad *in_mad, struct ib_mad *out_mad); + int (*mmap)(struct ib_ucontext *context, + struct vm_area_struct *vma); + struct module *owner; struct class_device class_dev; struct kobject ports_parent; struct list_head port_list; From roland at topspin.com Mon Apr 4 15:09:00 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 4 Apr 2005 15:09:00 -0700 Subject: [openib-general] [PATCH][RFC][0/4] InfiniBand userspace verbs implementation Message-ID: <200544159.Ahk9l0puXy39U6u6@topspin.com> Here is an initial implementation of InfiniBand userspace verbs. I plan to commit this code to the OpenIB repository shortly, and submit it for inclusion during the 2.6.13 cycle, so I am posting it early for comments. This code, in conjunction with the libibverbs and libmthca userspace libraries available from the subversion trees at https://openib.org/svn/gen2/branches/roland-uverbs/src/userspace/libibverbs https://openib.org/svn/gen2/branches/roland-uverbs/src/userspace/libmthca enables userspace processes to access InfiniBand HCAs directly. For those not familiar with the InfiniBand architecture, this so-called "userspace verbs" support allows userspace to post data path commands directly to the HCA. Resource allocation and other control path operations still go through the kernel driver. Please take a look at this code if you have a chance. I would appreciate high-level criticism of the design and implementation as well as nitpicky complaints about coding style and typos. In particular, the memory pinning code in in uverbs_mem.c could stand a looking over. In addition, a sanity check of the write()-based scheme for passing commands into the kernel in uverbs_main.c and uverbs_cmd.c is probably worthwhile. Thanks, Roland From roland at topspin.com Mon Apr 4 15:09:00 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 4 Apr 2005 15:09:00 -0700 Subject: [openib-general] [PATCH][RFC][2/4] IB: userspace verbs main module In-Reply-To: <200544159.Qg0tUfQc4xGRabsc@topspin.com> Message-ID: <200544159.3X7p8nZ87qWqA7cv@topspin.com> Add device-independent userspace verbs support (ib_uverbs module). Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-export/drivers/infiniband/core/uverbs.h 2005-04-04 14:55:10.496227053 -0700 @@ -0,0 +1,124 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: uverbs.h 2001 2005-03-16 04:15:41Z roland $ + */ + +#ifndef UVERBS_H +#define UVERBS_H + +/* Include device.h and fs.h until cdev.h is self-sufficient */ +#include +#include +#include +#include +#include + +#include +#include + +struct ib_uverbs_device { + int devnum; + struct cdev dev; + struct class_device class_dev; + struct ib_device *ib_dev; + int num_comp; +}; + +struct ib_uverbs_event_file { + struct ib_uverbs_file *uverbs_file; + spinlock_t lock; + int fd; + int is_async; + wait_queue_head_t poll_wait; + struct list_head event_list; +}; + +struct ib_uverbs_file { + struct kref ref; + struct ib_uverbs_device *device; + struct ib_ucontext *ucontext; + struct ib_event_handler event_handler; + struct ib_uverbs_event_file async_file; + struct ib_uverbs_event_file comp_file[1]; +}; + +struct ib_uverbs_async_event { + struct ib_uverbs_async_event_desc desc; + struct list_head list; +}; + +struct ib_uverbs_comp_event { + struct ib_uverbs_comp_event_desc desc; + struct list_head list; +}; + +struct ib_uobject_mr { + struct ib_uobject uobj; + struct page *page_list; + struct scatterlist *sg_list; +}; + +extern struct semaphore ib_uverbs_idr_mutex; +extern struct idr ib_uverbs_pd_idr; +extern struct idr ib_uverbs_mr_idr; +extern struct idr ib_uverbs_mw_idr; +extern struct idr ib_uverbs_ah_idr; +extern struct idr ib_uverbs_cq_idr; +extern struct idr ib_uverbs_qp_idr; + +void ib_uverbs_comp_handler(struct ib_cq *cq, void *cq_context); +void ib_uverbs_cq_event_handler(struct ib_event *event, void *context_ptr); +void ib_uverbs_qp_event_handler(struct ib_event *event, void *context_ptr); + +int ib_umem_get(struct ib_device *dev, struct ib_umem *mem, + void *addr, size_t size); +void ib_umem_release(struct ib_device *dev, struct ib_umem *umem); + +#define IB_UVERBS_DECLARE_CMD(name) \ + ssize_t ib_uverbs_##name(struct ib_uverbs_file *file, \ + const char __user *buf, int in_len, \ + int out_len) + +IB_UVERBS_DECLARE_CMD(query_params); +IB_UVERBS_DECLARE_CMD(get_context); +IB_UVERBS_DECLARE_CMD(query_port); +IB_UVERBS_DECLARE_CMD(alloc_pd); +IB_UVERBS_DECLARE_CMD(dealloc_pd); +IB_UVERBS_DECLARE_CMD(reg_mr); +IB_UVERBS_DECLARE_CMD(dereg_mr); +IB_UVERBS_DECLARE_CMD(create_cq); +IB_UVERBS_DECLARE_CMD(destroy_cq); +IB_UVERBS_DECLARE_CMD(create_qp); +IB_UVERBS_DECLARE_CMD(modify_qp); +IB_UVERBS_DECLARE_CMD(destroy_qp); + +#endif /* UVERBS_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-export/drivers/infiniband/core/uverbs_cmd.c 2005-04-04 14:53:12.136965074 -0700 @@ -0,0 +1,790 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: uverbs_cmd.c 1995 2005-03-15 19:25:10Z roland $ + */ + +#include + +#include "uverbs.h" + +ssize_t ib_uverbs_query_params(struct ib_uverbs_file *file, + const char __user *buf, + int in_len, int out_len) +{ + struct ib_uverbs_query_params cmd; + struct ib_uverbs_query_params_resp resp; + + if (out_len < sizeof resp) + return -ENOSPC; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + resp.num_cq_events = file->device->num_comp; + + if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) + return -EFAULT; + + return in_len; +} + +ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, + const char __user *buf, + int in_len, int out_len) +{ + struct ib_uverbs_get_context cmd; + struct ib_uverbs_get_context_resp *resp; + struct ib_device *ibdev = file->device->ib_dev; + int outsz; + int i; + int ret = in_len; + + outsz = sizeof *resp + (file->device->num_comp - 1) * sizeof (__u32); + + if (out_len < outsz) + return -ENOSPC; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + resp = kmalloc(outsz, GFP_KERNEL); + if (!resp) + return -ENOMEM; + + file->ucontext = ibdev->alloc_ucontext(ibdev, buf + sizeof cmd, + in_len - sizeof cmd - + sizeof (struct ib_uverbs_cmd_hdr)); + if (IS_ERR(file->ucontext)) { + ret = PTR_ERR(file->ucontext); + file->ucontext = NULL; + kfree(resp); + return ret; + } + + file->ucontext->device = ibdev; + INIT_LIST_HEAD(&file->ucontext->pd_list); + INIT_LIST_HEAD(&file->ucontext->mr_list); + INIT_LIST_HEAD(&file->ucontext->mw_list); + INIT_LIST_HEAD(&file->ucontext->cq_list); + INIT_LIST_HEAD(&file->ucontext->qp_list); + INIT_LIST_HEAD(&file->ucontext->srq_list); + INIT_LIST_HEAD(&file->ucontext->ah_list); + spin_lock_init(&file->ucontext->lock); + + resp->async_fd = file->async_file.fd; + for (i = 0; i < file->device->num_comp; ++i) + resp->cq_fd[i] = file->comp_file[i].fd; + + if (copy_to_user((void __user *) (unsigned long) cmd.response, resp, outsz)) { + ibdev->dealloc_ucontext(file->ucontext); + file->ucontext = NULL; + ret = -EFAULT; + } + + kfree(resp); + return ret; +} + +ssize_t ib_uverbs_query_port(struct ib_uverbs_file *file, + const char __user *buf, + int in_len, int out_len) +{ + struct ib_uverbs_query_port cmd; + struct ib_uverbs_query_port_resp resp; + struct ib_port_attr attr; + int ret; + + if (out_len < sizeof resp) + return -ENOSPC; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + ret = ib_query_port(file->device->ib_dev, cmd.port_num, &attr); + if (ret) + return ret; + + resp.state = attr.state; + resp.max_mtu = attr.max_mtu; + resp.active_mtu = attr.active_mtu; + resp.gid_tbl_len = attr.gid_tbl_len; + resp.port_cap_flags = attr.port_cap_flags; + resp.max_msg_sz = attr.max_msg_sz; + resp.bad_pkey_cntr = attr.bad_pkey_cntr; + resp.qkey_viol_cntr = attr.qkey_viol_cntr; + resp.pkey_tbl_len = attr.pkey_tbl_len; + resp.lid = attr.lid; + resp.sm_lid = attr.sm_lid; + resp.lmc = attr.lmc; + resp.max_vl_num = attr.max_vl_num; + resp.sm_sl = attr.sm_sl; + resp.subnet_timeout = attr.subnet_timeout; + resp.init_type_reply = attr.init_type_reply; + resp.active_width = attr.active_width; + resp.active_speed = attr.active_speed; + resp.phys_state = attr.phys_state; + + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) + return -EFAULT; + + return in_len; +} + +ssize_t ib_uverbs_alloc_pd(struct ib_uverbs_file *file, + const char __user *buf, + int in_len, int out_len) +{ + struct ib_uverbs_alloc_pd cmd; + struct ib_uverbs_alloc_pd_resp resp; + struct ib_uobject *uobj; + struct ib_pd *pd; + int ret; + + if (out_len < sizeof resp) + return -ENOSPC; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + uobj = kmalloc(sizeof *uobj, GFP_KERNEL); + if (!uobj) + return -ENOMEM; + + uobj->context = file->ucontext; + + pd = file->device->ib_dev->alloc_pd(file->device->ib_dev, + file->ucontext, buf + sizeof cmd, + in_len - sizeof cmd - + sizeof (struct ib_uverbs_cmd_hdr)); + if (IS_ERR(pd)) { + ret = PTR_ERR(pd); + goto err; + } + + pd->device = file->device->ib_dev; + pd->uobject = uobj; + atomic_set(&pd->usecnt, 0); + +retry: + if (!idr_pre_get(&ib_uverbs_pd_idr, GFP_KERNEL)) { + ret = -ENOMEM; + goto err_pd; + } + + down(&ib_uverbs_idr_mutex); + ret = idr_get_new(&ib_uverbs_pd_idr, pd, &uobj->id); + up(&ib_uverbs_idr_mutex); + + if (ret == -EAGAIN) + goto retry; + if (ret) + goto err_pd; + + spin_lock_irq(&file->ucontext->lock); + list_add_tail(&uobj->list, &file->ucontext->pd_list); + spin_unlock_irq(&file->ucontext->lock); + + resp.pd_handle = uobj->id; + + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) { + ret = -EFAULT; + goto err_list; + } + + return in_len; + +err_list: + spin_lock_irq(&file->ucontext->lock); + list_del(&uobj->list); + spin_unlock_irq(&file->ucontext->lock); + + down(&ib_uverbs_idr_mutex); + idr_remove(&ib_uverbs_pd_idr, uobj->id); + up(&ib_uverbs_idr_mutex); + +err_pd: + ib_dealloc_pd(pd); + +err: + kfree(uobj); + return ret; +} + +ssize_t ib_uverbs_dealloc_pd(struct ib_uverbs_file *file, + const char __user *buf, + int in_len, int out_len) +{ + struct ib_uverbs_dealloc_pd cmd; + struct ib_pd *pd; + int ret = -EINVAL; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + down(&ib_uverbs_idr_mutex); + + pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle); + if (!pd || pd->uobject->context != file->ucontext) + goto out; + + ret = ib_dealloc_pd(pd); + if (ret) + goto out; + + idr_remove(&ib_uverbs_pd_idr, cmd.pd_handle); + + spin_lock_irq(&file->ucontext->lock); + list_del(&pd->uobject->list); + spin_unlock_irq(&file->ucontext->lock); + + kfree(pd->uobject); + +out: + up(&ib_uverbs_idr_mutex); + + return ret ? ret : in_len; +} + +ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_reg_mr cmd; + struct ib_uverbs_reg_mr_resp resp; + struct ib_umem_object *obj; + struct ib_pd *pd; + struct ib_mr *mr; + int ret; + + if (out_len < sizeof resp) + return -ENOSPC; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + if ((cmd.start & ~PAGE_MASK) != (cmd.hca_va & ~PAGE_MASK)) + return -EINVAL; + + obj = kmalloc(sizeof *obj, GFP_KERNEL); + if (!obj) + return -ENOMEM; + + obj->uobject.context = file->ucontext; + + ret = ib_umem_get(file->device->ib_dev, &obj->umem, + (void *) (unsigned long) cmd.start, + cmd.length); + if (ret) + goto err_free; + + obj->umem.virt_base = cmd.hca_va; + + down(&ib_uverbs_idr_mutex); + + pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle); + if (!pd || pd->uobject->context != file->ucontext) { + ret = -EINVAL; + goto err_up; + } + + if (!pd->device->reg_user_mr) { + ret = -ENOSYS; + goto err_up; + } + + mr = pd->device->reg_user_mr(pd, &obj->umem, + cmd.access_flags, + buf + sizeof cmd, + in_len - sizeof cmd - + sizeof (struct ib_uverbs_cmd_hdr)); + if (IS_ERR(mr)) { + ret = PTR_ERR(mr); + goto err_up; + } + + mr->device = pd->device; + mr->pd = pd; + mr->uobject = &obj->uobject; + atomic_inc(&pd->usecnt); + atomic_set(&mr->usecnt, 0); + + resp.lkey = mr->lkey; + resp.rkey = mr->rkey; + +retry: + if (!idr_pre_get(&ib_uverbs_mr_idr, GFP_KERNEL)) { + ret = -ENOMEM; + goto err_unreg; + } + + ret = idr_get_new(&ib_uverbs_mr_idr, mr, &obj->uobject.id); + + if (ret == -EAGAIN) + goto retry; + if (ret) + goto err_unreg; + + resp.mr_handle = obj->uobject.id; + + spin_lock_irq(&file->ucontext->lock); + list_add_tail(&obj->uobject.list, &file->ucontext->mr_list); + spin_unlock_irq(&file->ucontext->lock); + + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) { + ret = -EFAULT; + goto err_list; + } + + up(&ib_uverbs_idr_mutex); + + return in_len; + +err_list: + spin_lock_irq(&file->ucontext->lock); + list_del(&obj->uobject.list); + spin_unlock_irq(&file->ucontext->lock); + +err_unreg: + ib_dereg_mr(mr); + +err_up: + up(&ib_uverbs_idr_mutex); + + ib_umem_release(file->device->ib_dev, &obj->umem); + +err_free: + kfree(obj); + return ret; +} + +ssize_t ib_uverbs_dereg_mr(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_dereg_mr cmd; + struct ib_mr *mr; + struct ib_umem_object *memobj; + int ret = -EINVAL; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + down(&ib_uverbs_idr_mutex); + + mr = idr_find(&ib_uverbs_mr_idr, cmd.mr_handle); + if (!mr || mr->uobject->context != file->ucontext) + goto out; + + ret = ib_dereg_mr(mr); + if (ret) + goto out; + + idr_remove(&ib_uverbs_mr_idr, cmd.mr_handle); + + spin_lock_irq(&file->ucontext->lock); + list_del(&mr->uobject->list); + spin_unlock_irq(&file->ucontext->lock); + + memobj = container_of(mr->uobject, struct ib_umem_object, uobject); + ib_umem_release(file->device->ib_dev, &memobj->umem); + kfree(memobj); + +out: + up(&ib_uverbs_idr_mutex); + + return ret ? ret : in_len; +} + +ssize_t ib_uverbs_create_cq(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_create_cq cmd; + struct ib_uverbs_create_cq_resp resp; + struct ib_uobject *uobj; + struct ib_cq *cq; + int ret; + + if (out_len < sizeof resp) + return -ENOSPC; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + uobj = kmalloc(sizeof *uobj, GFP_KERNEL); + if (!uobj) + return -ENOMEM; + + uobj->user_handle = cmd.user_handle; + uobj->context = file->ucontext; + + cq = file->device->ib_dev->create_cq(file->device->ib_dev, cmd.cqe, + file->ucontext, buf + sizeof cmd, + in_len - sizeof cmd - + sizeof (struct ib_uverbs_cmd_hdr)); + if (IS_ERR(cq)) { + ret = PTR_ERR(cq); + goto err; + } + + cq->device = file->device->ib_dev; + cq->uobject = uobj; + cq->comp_handler = ib_uverbs_comp_handler; + cq->event_handler = ib_uverbs_cq_event_handler; + cq->cq_context = file; + atomic_set(&cq->usecnt, 0); + +retry: + if (!idr_pre_get(&ib_uverbs_cq_idr, GFP_KERNEL)) { + ret = -ENOMEM; + goto err_cq; + } + + down(&ib_uverbs_idr_mutex); + ret = idr_get_new(&ib_uverbs_cq_idr, cq, &uobj->id); + up(&ib_uverbs_idr_mutex); + + if (ret == -EAGAIN) + goto retry; + if (ret) + goto err_cq; + + spin_lock_irq(&file->ucontext->lock); + list_add_tail(&uobj->list, &file->ucontext->cq_list); + spin_unlock_irq(&file->ucontext->lock); + + resp.cq_handle = uobj->id; + resp.cqe = cq->cqe; + + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) { + ret = -EFAULT; + goto err_list; + } + + return in_len; + +err_list: + spin_lock_irq(&file->ucontext->lock); + list_del(&uobj->list); + spin_unlock_irq(&file->ucontext->lock); + + down(&ib_uverbs_idr_mutex); + idr_remove(&ib_uverbs_cq_idr, uobj->id); + up(&ib_uverbs_idr_mutex); + +err_cq: + ib_destroy_cq(cq); + +err: + kfree(uobj); + return ret; +} + +ssize_t ib_uverbs_destroy_cq(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_destroy_cq cmd; + struct ib_cq *cq; + int ret = -EINVAL; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + down(&ib_uverbs_idr_mutex); + + cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle); + if (!cq || cq->uobject->context != file->ucontext) + goto out; + + ret = ib_destroy_cq(cq); + if (ret) + goto out; + + idr_remove(&ib_uverbs_cq_idr, cmd.cq_handle); + + spin_lock_irq(&file->ucontext->lock); + list_del(&cq->uobject->list); + spin_unlock_irq(&file->ucontext->lock); + + kfree(cq->uobject); + +out: + up(&ib_uverbs_idr_mutex); + + return ret ? ret : in_len; +} + +ssize_t ib_uverbs_create_qp(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_create_qp cmd; + struct ib_uverbs_create_qp_resp resp; + struct ib_uobject *uobj; + struct ib_pd *pd; + struct ib_cq *scq, *rcq; + struct ib_qp *qp; + struct ib_qp_init_attr attr; + int ret; + + if (out_len < sizeof resp) + return -ENOSPC; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + uobj = kmalloc(sizeof *uobj, GFP_KERNEL); + if (!uobj) + return -ENOMEM; + + down(&ib_uverbs_idr_mutex); + + pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle); + scq = idr_find(&ib_uverbs_cq_idr, cmd.send_cq_handle); + rcq = idr_find(&ib_uverbs_cq_idr, cmd.recv_cq_handle); + + if (!pd || pd->uobject->context != file->ucontext || + !scq || scq->uobject->context != file->ucontext || + !rcq || rcq->uobject->context != file->ucontext) { + ret = -EINVAL; + goto err_up; + } + + attr.event_handler = ib_uverbs_qp_event_handler; + attr.qp_context = file; + attr.send_cq = scq; + attr.recv_cq = rcq; + attr.srq = NULL; + attr.sq_sig_type = cmd.sq_sig_all ? IB_SIGNAL_ALL_WR : IB_SIGNAL_REQ_WR; + attr.qp_type = cmd.qp_type; + + attr.cap.max_send_wr = cmd.max_send_wr; + attr.cap.max_recv_wr = cmd.max_recv_wr; + attr.cap.max_send_sge = cmd.max_send_sge; + attr.cap.max_recv_sge = cmd.max_recv_sge; + attr.cap.max_inline_data = cmd.max_inline_data; + + uobj->user_handle = cmd.user_handle; + uobj->context = file->ucontext; + + qp = pd->device->create_qp(pd, &attr, buf + sizeof cmd, + in_len - sizeof cmd - + sizeof (struct ib_uverbs_cmd_hdr)); + if (IS_ERR(qp)) { + ret = PTR_ERR(qp); + goto err_up; + } + + qp->device = pd->device; + qp->pd = pd; + qp->send_cq = attr.send_cq; + qp->recv_cq = attr.recv_cq; + qp->srq = attr.srq; + qp->uobject = uobj; + qp->event_handler = attr.event_handler; + qp->qp_context = attr.qp_context; + qp->qp_type = attr.qp_type; + atomic_inc(&pd->usecnt); + atomic_inc(&attr.send_cq->usecnt); + atomic_inc(&attr.recv_cq->usecnt); + if (attr.srq) + atomic_inc(&attr.srq->usecnt); + + resp.qpn = qp->qp_num; + +retry: + if (!idr_pre_get(&ib_uverbs_qp_idr, GFP_KERNEL)) { + ret = -ENOMEM; + goto err_destroy; + } + + ret = idr_get_new(&ib_uverbs_qp_idr, qp, &uobj->id); + + if (ret == -EAGAIN) + goto retry; + if (ret) + goto err_destroy; + + resp.qp_handle = uobj->id; + + spin_lock_irq(&file->ucontext->lock); + list_add_tail(&uobj->list, &file->ucontext->qp_list); + spin_unlock_irq(&file->ucontext->lock); + + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) { + ret = -EFAULT; + goto err_list; + } + + up(&ib_uverbs_idr_mutex); + + return in_len; + +err_list: + spin_lock_irq(&file->ucontext->lock); + list_del(&uobj->list); + spin_unlock_irq(&file->ucontext->lock); + +err_destroy: + ib_destroy_qp(qp); + +err_up: + up(&ib_uverbs_idr_mutex); + + kfree(uobj); + return ret; +} + +ssize_t ib_uverbs_modify_qp(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_modify_qp cmd; + struct ib_qp *qp; + struct ib_qp_attr *attr; + int ret; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (!attr) + return -ENOMEM; + + down(&ib_uverbs_idr_mutex); + + qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); + if (!qp || qp->uobject->context != file->ucontext) { + ret = -EINVAL; + goto out; + } + + attr->qp_state = cmd.qp_state; + attr->cur_qp_state = cmd.cur_qp_state; + attr->path_mtu = cmd.path_mtu; + attr->path_mig_state = cmd.path_mig_state; + attr->qkey = cmd.qkey; + attr->rq_psn = cmd.rq_psn; + attr->sq_psn = cmd.sq_psn; + attr->dest_qp_num = cmd.dest_qp_num; + attr->qp_access_flags = cmd.qp_access_flags; + attr->pkey_index = cmd.pkey_index; + attr->alt_pkey_index = cmd.pkey_index; + attr->en_sqd_async_notify = cmd.en_sqd_async_notify; + attr->max_rd_atomic = cmd.max_rd_atomic; + attr->max_dest_rd_atomic = cmd.max_dest_rd_atomic; + attr->min_rnr_timer = cmd.min_rnr_timer; + attr->port_num = cmd.port_num; + attr->timeout = cmd.timeout; + attr->retry_cnt = cmd.retry_cnt; + attr->rnr_retry = cmd.rnr_retry; + attr->alt_port_num = cmd.alt_port_num; + attr->alt_timeout = cmd.alt_timeout; + + memcpy(attr->ah_attr.grh.dgid.raw, cmd.dest.dgid, 16); + attr->ah_attr.grh.flow_label = cmd.dest.flow_label; + attr->ah_attr.grh.sgid_index = cmd.dest.sgid_index; + attr->ah_attr.grh.hop_limit = cmd.dest.hop_limit; + attr->ah_attr.grh.traffic_class = cmd.dest.traffic_class; + attr->ah_attr.dlid = cmd.dest.dlid; + attr->ah_attr.sl = cmd.dest.sl; + attr->ah_attr.src_path_bits = cmd.dest.src_path_bits; + attr->ah_attr.static_rate = cmd.dest.static_rate; + attr->ah_attr.ah_flags = cmd.dest.is_global ? IB_AH_GRH : 0; + attr->ah_attr.port_num = cmd.dest.port_num; + + memcpy(attr->alt_ah_attr.grh.dgid.raw, cmd.alt_dest.dgid, 16); + attr->alt_ah_attr.grh.flow_label = cmd.alt_dest.flow_label; + attr->alt_ah_attr.grh.sgid_index = cmd.alt_dest.sgid_index; + attr->alt_ah_attr.grh.hop_limit = cmd.alt_dest.hop_limit; + attr->alt_ah_attr.grh.traffic_class = cmd.alt_dest.traffic_class; + attr->alt_ah_attr.dlid = cmd.alt_dest.dlid; + attr->alt_ah_attr.sl = cmd.alt_dest.sl; + attr->alt_ah_attr.src_path_bits = cmd.alt_dest.src_path_bits; + attr->alt_ah_attr.static_rate = cmd.alt_dest.static_rate; + attr->alt_ah_attr.ah_flags = cmd.alt_dest.is_global ? IB_AH_GRH : 0; + attr->alt_ah_attr.port_num = cmd.alt_dest.port_num; + + ret = ib_modify_qp(qp, attr, cmd.attr_mask); + if (ret) + goto out; + + ret = in_len; + +out: + up(&ib_uverbs_idr_mutex); + kfree(attr); + + return ret; +} + +ssize_t ib_uverbs_destroy_qp(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_destroy_qp cmd; + struct ib_qp *qp; + int ret = -EINVAL; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + down(&ib_uverbs_idr_mutex); + + qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); + if (!qp || qp->uobject->context != file->ucontext) + goto out; + + ret = ib_destroy_qp(qp); + if (ret) + goto out; + + idr_remove(&ib_uverbs_qp_idr, cmd.qp_handle); + + spin_lock_irq(&file->ucontext->lock); + list_del(&qp->uobject->list); + spin_unlock_irq(&file->ucontext->lock); + + kfree(qp->uobject); + +out: + up(&ib_uverbs_idr_mutex); + + return ret ? ret : in_len; +} + --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-export/drivers/infiniband/core/uverbs_main.c 2005-04-04 14:53:17.824728218 -0700 @@ -0,0 +1,688 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: uverbs_main.c 2109 2005-04-04 21:10:34Z roland $ + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "uverbs.h" + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("InfiniBand userspace verbs access"); +MODULE_LICENSE("Dual BSD/GPL"); + +#define INFINIBANDEVENTFS_MAGIC 0x49426576 /* "IBev" */ + +enum { + IB_UVERBS_MAJOR = 231, + IB_UVERBS_BASE_MINOR = 128, + IB_UVERBS_MAX_DEVICES = 32 +}; + +#define IB_UVERBS_BASE_DEV MKDEV(IB_UVERBS_MAJOR, IB_UVERBS_BASE_MINOR) + +DECLARE_MUTEX(ib_uverbs_idr_mutex); +DEFINE_IDR(ib_uverbs_pd_idr); +DEFINE_IDR(ib_uverbs_mr_idr); +DEFINE_IDR(ib_uverbs_mw_idr); +DEFINE_IDR(ib_uverbs_ah_idr); +DEFINE_IDR(ib_uverbs_cq_idr); +DEFINE_IDR(ib_uverbs_qp_idr); + +static spinlock_t map_lock; +static DECLARE_BITMAP(dev_map, IB_UVERBS_MAX_DEVICES); + +static ssize_t (*uverbs_cmd_table[])(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) = { + [IB_USER_VERBS_CMD_QUERY_PARAMS] = ib_uverbs_query_params, + [IB_USER_VERBS_CMD_GET_CONTEXT] = ib_uverbs_get_context, + [IB_USER_VERBS_CMD_QUERY_PORT] = ib_uverbs_query_port, + [IB_USER_VERBS_CMD_ALLOC_PD] = ib_uverbs_alloc_pd, + [IB_USER_VERBS_CMD_DEALLOC_PD] = ib_uverbs_dealloc_pd, + [IB_USER_VERBS_CMD_REG_MR] = ib_uverbs_reg_mr, + [IB_USER_VERBS_CMD_DEREG_MR] = ib_uverbs_dereg_mr, + [IB_USER_VERBS_CMD_CREATE_CQ] = ib_uverbs_create_cq, + [IB_USER_VERBS_CMD_DESTROY_CQ] = ib_uverbs_destroy_cq, + [IB_USER_VERBS_CMD_CREATE_QP] = ib_uverbs_create_qp, + [IB_USER_VERBS_CMD_MODIFY_QP] = ib_uverbs_modify_qp, + [IB_USER_VERBS_CMD_DESTROY_QP] = ib_uverbs_destroy_qp, +}; + +static struct vfsmount *uverbs_event_mnt; + +static void ib_uverbs_add_one(struct ib_device *device); +static void ib_uverbs_remove_one(struct ib_device *device); + +static int ib_dealloc_ucontext(struct ib_ucontext *context) +{ + struct ib_uobject *uobj, *tmp; + + if (!context) + return 0; + + /* Free AHs */ + + list_for_each_entry_safe(uobj, tmp, &context->qp_list, list) { + struct ib_qp *qp = idr_find(&ib_uverbs_qp_idr, uobj->id); + idr_remove(&ib_uverbs_qp_idr, uobj->id); + ib_destroy_qp(qp); + list_del(&uobj->list); + kfree(uobj); + } + + list_for_each_entry_safe(uobj, tmp, &context->cq_list, list) { + struct ib_cq *cq = idr_find(&ib_uverbs_cq_idr, uobj->id); + idr_remove(&ib_uverbs_cq_idr, uobj->id); + ib_destroy_cq(cq); + list_del(&uobj->list); + kfree(uobj); + } + + /* XXX Free SRQs */ + /* XXX Free MWs */ + + list_for_each_entry_safe(uobj, tmp, &context->mr_list, list) { + struct ib_mr *mr = idr_find(&ib_uverbs_mr_idr, uobj->id); + struct ib_umem_object *memobj; + + memobj = container_of(uobj, struct ib_umem_object, uobject); + ib_umem_release(mr->device, &memobj->umem); + + idr_remove(&ib_uverbs_mr_idr, uobj->id); + ib_dereg_mr(mr); + list_del(&uobj->list); + kfree(memobj); + } + + list_for_each_entry_safe(uobj, tmp, &context->pd_list, list) { + struct ib_pd *pd = idr_find(&ib_uverbs_pd_idr, uobj->id); + idr_remove(&ib_uverbs_pd_idr, uobj->id); + ib_dealloc_pd(pd); + list_del(&uobj->list); + kfree(uobj); + } + + return context->device->dealloc_ucontext(context); +} + +static void ib_uverbs_release_file(struct kref *ref) +{ + struct ib_uverbs_file *file = + container_of(ref, struct ib_uverbs_file, ref); + + module_put(file->device->ib_dev->owner); + kfree(file); +} + +static ssize_t ib_uverbs_event_read(struct file *filp, char __user *buf, + size_t count, loff_t *pos) +{ + struct ib_uverbs_event_file *file = filp->private_data; + void *event; + int eventsz; + int ret = 0; + + spin_lock_irq(&file->lock); + + while (list_empty(&file->event_list) && file->fd >= 0) { + spin_unlock_irq(&file->lock); + + if (filp->f_flags & O_NONBLOCK) + return -EAGAIN; + + if (wait_event_interruptible(file->poll_wait, + !list_empty(&file->event_list) || + file->fd < 0)) + return -ERESTARTSYS; + + spin_lock_irq(&file->lock); + } + + if (file->fd < 0) { + spin_unlock_irq(&file->lock); + return -ENODEV; + } + + if (file->is_async) { + event = list_entry(file->event_list.next, + struct ib_uverbs_async_event, list); + eventsz = sizeof (struct ib_uverbs_async_event_desc); + } else { + event = list_entry(file->event_list.next, + struct ib_uverbs_comp_event, list); + eventsz = sizeof (struct ib_uverbs_comp_event_desc); + } + + if (eventsz > count) { + ret = -EINVAL; + event = NULL; + } else + list_del(file->event_list.next); + + spin_unlock_irq(&file->lock); + + if (event) { + if (copy_to_user(buf, event, eventsz)) + ret = -EFAULT; + else + ret = eventsz; + } + + kfree(event); + + return ret; +} + +static unsigned int ib_uverbs_event_poll(struct file *filp, + struct poll_table_struct *wait) +{ + unsigned int pollflags = 0; + struct ib_uverbs_event_file *file = filp->private_data; + + poll_wait(filp, &file->poll_wait, wait); + + spin_lock_irq(&file->lock); + if (file->fd < 0) + pollflags = POLLERR; + else if (!list_empty(&file->event_list)) + pollflags = POLLIN | POLLRDNORM; + spin_unlock_irq(&file->lock); + + return pollflags; +} + +static void ib_uverbs_event_release(struct ib_uverbs_event_file *file) +{ + struct list_head *entry, *tmp; + int put = 0; + + spin_lock_irq(&file->lock); + if (file->fd != -1) { + put = 1; + file->fd = -1; + list_for_each_safe(entry, tmp, &file->event_list) + if (file->is_async) + kfree(list_entry(entry, struct ib_uverbs_async_event, list)); + else + kfree(list_entry(entry, struct ib_uverbs_comp_event, list)); + } + spin_unlock_irq(&file->lock); + + if (put) + kref_put(&file->uverbs_file->ref, ib_uverbs_release_file); + +} + +static int ib_uverbs_event_close(struct inode *inode, struct file *filp) +{ + struct ib_uverbs_event_file *file = filp->private_data; + + ib_uverbs_event_release(file); + + return 0; +} + +static struct file_operations uverbs_event_fops = { + /* + * No .owner field since we artificially create event files, + * so there is no increment to the module reference count in + * the open path. All event files come from a uverbs command + * file, which already takes a module reference, so this is OK. + */ + .read = ib_uverbs_event_read, + .poll = ib_uverbs_event_poll, + .release = ib_uverbs_event_close +}; + +void ib_uverbs_comp_handler(struct ib_cq *cq, void *cq_context) +{ + struct ib_uverbs_file *file = cq_context; + struct ib_uverbs_comp_event *entry; + unsigned long flags; + + entry = kmalloc(sizeof *entry, GFP_ATOMIC); + if (!entry) + return; + + entry->desc.cq_handle = cq->uobject->user_handle; + + spin_lock_irqsave(&file->comp_file[0].lock, flags); + list_add_tail(&entry->list, &file->comp_file[0].event_list); + spin_unlock_irqrestore(&file->comp_file[0].lock, flags); + + wake_up_interruptible(&file->comp_file[0].poll_wait); +} + +void ib_uverbs_cq_event_handler(struct ib_event *event, void *context_ptr) +{ + +} + +void ib_uverbs_qp_event_handler(struct ib_event *event, void *context_ptr) +{ + +} + +static void ib_uverbs_event_handler(struct ib_event_handler *handler, + struct ib_event *event) +{ + struct ib_uverbs_file *file = + container_of(handler, struct ib_uverbs_file, event_handler); + struct ib_uverbs_async_event *entry; + unsigned long flags; + + entry = kmalloc(sizeof *entry, GFP_ATOMIC); + if (!entry) + return; + + entry->desc.event_type = event->event; + entry->desc.element = event->element.port_num; + + spin_lock_irqsave(&file->async_file.lock, flags); + list_add_tail(&entry->list, &file->async_file.event_list); + spin_unlock_irqrestore(&file->async_file.lock, flags); + + wake_up_interruptible(&file->async_file.poll_wait); +} + +static int ib_uverbs_event_init(struct ib_uverbs_event_file *file, + struct ib_uverbs_file *uverbs_file) +{ + struct file *filp; + + spin_lock_init(&file->lock); + INIT_LIST_HEAD(&file->event_list); + init_waitqueue_head(&file->poll_wait); + file->uverbs_file = uverbs_file; + + file->fd = get_unused_fd(); + if (file->fd < 0) + return file->fd; + + filp = get_empty_filp(); + if (!filp) { + put_unused_fd(file->fd); + return -ENFILE; + } + + filp->f_op = &uverbs_event_fops; + filp->f_vfsmnt = mntget(uverbs_event_mnt); + filp->f_dentry = dget(uverbs_event_mnt->mnt_root); + filp->f_mapping = filp->f_dentry->d_inode->i_mapping; + filp->f_flags = O_RDONLY; + filp->f_mode = FMODE_READ; + filp->private_data = file; + + fd_install(file->fd, filp); + + return 0; +} + +static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf, + size_t count, loff_t *pos) +{ + struct ib_uverbs_file *file = filp->private_data; + struct ib_uverbs_cmd_hdr hdr; + + if (count < sizeof hdr) + return -EINVAL; + + if (copy_from_user(&hdr, buf, sizeof hdr)) + return -EFAULT; + + if (hdr.in_words * 4 != count) + return -EINVAL; + + if (hdr.command < 0 || hdr.command >= ARRAY_SIZE(uverbs_cmd_table)) + return -EINVAL; + + if (!file->ucontext && + hdr.command != IB_USER_VERBS_CMD_QUERY_PARAMS && + hdr.command != IB_USER_VERBS_CMD_GET_CONTEXT) + return -EINVAL; + + return uverbs_cmd_table[hdr.command](file, buf + sizeof hdr, + hdr.in_words * 4, hdr.out_words * 4); +} + +static int ib_uverbs_mmap(struct file *filp, struct vm_area_struct *vma) +{ + struct ib_uverbs_file *file = filp->private_data; + + return file->device->ib_dev->mmap(file->ucontext, vma); +} + +static int ib_uverbs_open(struct inode *inode, struct file *filp) +{ + struct ib_uverbs_device *dev = + container_of(inode->i_cdev, struct ib_uverbs_device, dev); + struct ib_uverbs_file *file; + int i = 0; + int ret; + + if (!try_module_get(dev->ib_dev->owner)) + return -ENODEV; + + file = kmalloc(sizeof *file + + (dev->num_comp - 1) * sizeof (struct ib_uverbs_event_file), + GFP_KERNEL); + if (!file) + return -ENOMEM; + + file->device = dev; + kref_init(&file->ref); + + file->ucontext = NULL; + + ret = ib_uverbs_event_init(&file->async_file, file); + if (ret) + goto err; + + file->async_file.is_async = 1; + + kref_get(&file->ref); + + for (i = 0; i < dev->num_comp; ++i) { + ret = ib_uverbs_event_init(&file->comp_file[i], file); + if (ret) + goto err_async; + kref_get(&file->ref); + file->comp_file[i].is_async = 0; + } + + + filp->private_data = file; + + INIT_IB_EVENT_HANDLER(&file->event_handler, dev->ib_dev, + ib_uverbs_event_handler); + if (ib_register_event_handler(&file->event_handler)) + goto err_async; + + return 0; + +err_async: + while (i--) + ib_uverbs_event_release(&file->comp_file[i]); + + ib_uverbs_event_release(&file->async_file); + +err: + kref_put(&file->ref, ib_uverbs_release_file); + + return ret; +} + +static int ib_uverbs_close(struct inode *inode, struct file *filp) +{ + struct ib_uverbs_file *file = filp->private_data; + int i; + + ib_unregister_event_handler(&file->event_handler); + ib_uverbs_event_release(&file->async_file); + ib_dealloc_ucontext(file->ucontext); + + for (i = 0; i < file->device->num_comp; ++i) + ib_uverbs_event_release(&file->comp_file[i]); + + kref_put(&file->ref, ib_uverbs_release_file); + + return 0; +} + +static struct file_operations uverbs_fops = { + .owner = THIS_MODULE, + .write = ib_uverbs_write, + .open = ib_uverbs_open, + .release = ib_uverbs_close +}; + +static struct file_operations uverbs_mmap_fops = { + .owner = THIS_MODULE, + .write = ib_uverbs_write, + .mmap = ib_uverbs_mmap, + .open = ib_uverbs_open, + .release = ib_uverbs_close +}; + +static struct ib_client uverbs_client = { + .name = "uverbs", + .add = ib_uverbs_add_one, + .remove = ib_uverbs_remove_one +}; + +static ssize_t show_dev(struct class_device *class_dev, char *buf) +{ + struct ib_uverbs_device *dev = + container_of(class_dev, struct ib_uverbs_device, class_dev); + + return print_dev_t(buf, dev->dev.dev); +} +static CLASS_DEVICE_ATTR(dev, S_IRUGO, show_dev, NULL); + +static ssize_t show_ibdev(struct class_device *class_dev, char *buf) +{ + struct ib_uverbs_device *dev = + container_of(class_dev, struct ib_uverbs_device, class_dev); + + return sprintf(buf, "%s\n", dev->ib_dev->name); +} +static CLASS_DEVICE_ATTR(ibdev, S_IRUGO, show_ibdev, NULL); + +static void ib_uverbs_release_class_dev(struct class_device *class_dev) +{ + struct ib_uverbs_device *dev = + container_of(class_dev, struct ib_uverbs_device, class_dev); + + cdev_del(&dev->dev); + clear_bit(dev->devnum, dev_map); + kfree(dev); +} + +static struct class uverbs_class = { + .name = "infiniband_verbs", + .release = ib_uverbs_release_class_dev +}; + +static ssize_t show_abi_version(struct class *class, char *buf) +{ + return sprintf(buf, "%d\n", IB_USER_VERBS_ABI_VERSION); +} +static CLASS_ATTR(abi_version, S_IRUGO, show_abi_version, NULL); + +static void ib_uverbs_add_one(struct ib_device *device) +{ + struct ib_uverbs_device *uverbs_dev; + + if (!device->alloc_ucontext) + return; + + uverbs_dev = kmalloc(sizeof *uverbs_dev, GFP_KERNEL); + if (!uverbs_dev) + return; + + memset(uverbs_dev, 0, sizeof *uverbs_dev); + + spin_lock(&map_lock); + uverbs_dev->devnum = find_first_zero_bit(dev_map, IB_UVERBS_MAX_DEVICES); + if (uverbs_dev->devnum >= IB_UVERBS_MAX_DEVICES) { + spin_unlock(&map_lock); + goto err; + } + set_bit(uverbs_dev->devnum, dev_map); + spin_unlock(&map_lock); + + uverbs_dev->ib_dev = device; + uverbs_dev->num_comp = 1; + + if (device->mmap) + cdev_init(&uverbs_dev->dev, &uverbs_mmap_fops); + else + cdev_init(&uverbs_dev->dev, &uverbs_fops); + uverbs_dev->dev.owner = THIS_MODULE; + kobject_set_name(&uverbs_dev->dev.kobj, "uverbs%d", uverbs_dev->devnum); + if (cdev_add(&uverbs_dev->dev, IB_UVERBS_BASE_DEV + uverbs_dev->devnum, 1)) + goto err; + + uverbs_dev->class_dev.class = &uverbs_class; + uverbs_dev->class_dev.dev = device->dma_device; + snprintf(uverbs_dev->class_dev.class_id, BUS_ID_SIZE, "uverbs%d", uverbs_dev->devnum); + if (class_device_register(&uverbs_dev->class_dev)) + goto err_cdev; + + if (class_device_create_file(&uverbs_dev->class_dev, &class_device_attr_dev)) + goto err_class; + if (class_device_create_file(&uverbs_dev->class_dev, &class_device_attr_ibdev)) + goto err_class; + + ib_set_client_data(device, &uverbs_client, uverbs_dev); + + return; + +err_class: + class_device_unregister(&uverbs_dev->class_dev); + +err_cdev: + cdev_del(&uverbs_dev->dev); + clear_bit(uverbs_dev->devnum, dev_map); + +err: + kfree(uverbs_dev); + return; +} + +static void ib_uverbs_remove_one(struct ib_device *device) +{ + struct ib_uverbs_device *uverbs_dev = ib_get_client_data(device, &uverbs_client); + + if (!uverbs_dev) + return; + + class_device_unregister(&uverbs_dev->class_dev); +} + +static struct super_block *uverbs_event_get_sb(struct file_system_type *fs_type, int flags, + const char *dev_name, void *data) +{ + return get_sb_pseudo(fs_type, "infinibandevent:", NULL, + INFINIBANDEVENTFS_MAGIC); +} + +static struct file_system_type uverbs_event_fs = { + /* No owner field so module can be unloaded */ + .name = "infinibandeventfs", + .get_sb = uverbs_event_get_sb, + .kill_sb = kill_litter_super +}; + +static int __init ib_uverbs_init(void) +{ + int ret; + + spin_lock_init(&map_lock); + + ret = register_chrdev_region(IB_UVERBS_BASE_DEV, IB_UVERBS_MAX_DEVICES, + "infiniband_verbs"); + if (ret) { + printk(KERN_ERR "user_verbs: couldn't register device number\n"); + goto out; + } + + ret = class_register(&uverbs_class); + if (ret) { + printk(KERN_ERR "user_verbs: couldn't create class infiniband_verbs\n"); + goto out_chrdev; + } + + ret = class_create_file(&uverbs_class, &class_attr_abi_version); + if (ret) { + printk(KERN_ERR "user_verbs: couldn't create abi_version attribute\n"); + goto out_class; + } + + ret = register_filesystem(&uverbs_event_fs); + if (ret) { + printk(KERN_ERR "user_verbs: couldn't register infinibandeventfs\n"); + goto out_class; + } + + uverbs_event_mnt = kern_mount(&uverbs_event_fs); + if (IS_ERR(uverbs_event_mnt)) { + ret = PTR_ERR(uverbs_event_mnt); + printk(KERN_ERR "user_verbs: couldn't mount infinibandeventfs\n"); + goto out_fs; + } + + ret = ib_register_client(&uverbs_client); + if (ret) { + printk(KERN_ERR "user_verbs: couldn't register client\n"); + goto out_mnt; + } + + return 0; + +out_mnt: + mntput(uverbs_event_mnt); + +out_fs: + unregister_filesystem(&uverbs_event_fs); + +out_class: + class_unregister(&uverbs_class); + +out_chrdev: + unregister_chrdev_region(IB_UVERBS_BASE_DEV, IB_UVERBS_MAX_DEVICES); + +out: + return ret; +} + +static void __exit ib_uverbs_cleanup(void) +{ + ib_unregister_client(&uverbs_client); + unregister_filesystem(&uverbs_event_fs); + mntput(uverbs_event_mnt); + class_unregister(&uverbs_class); + unregister_chrdev_region(IB_UVERBS_BASE_DEV, IB_UVERBS_MAX_DEVICES); +} + +module_init(ib_uverbs_init); +module_exit(ib_uverbs_cleanup); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-export/drivers/infiniband/core/uverbs_mem.c 2005-04-04 14:53:17.825728001 -0700 @@ -0,0 +1,202 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: uverbs_mem.c 1979 2005-03-11 21:17:00Z roland $ + */ + +#include +#include + +#include "uverbs.h" + +static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem) +{ + struct ib_umem_chunk *chunk, *tmp; + int i; + + list_for_each_entry_safe(chunk, tmp, &umem->chunk_list, list) { + dma_unmap_sg(dev->dma_device, chunk->page_list, + chunk->nents, DMA_BIDIRECTIONAL); + for (i = 0; i < chunk->nents; ++i) { + set_page_dirty_lock(chunk->page_list[i].page); + put_page(chunk->page_list[i].page); + } + + kfree(chunk); + } +} + +static void __ib_umem_unmark(struct ib_umem *umem, struct mm_struct *mm) +{ + struct vm_area_struct *vma; + unsigned long cur_base; + + vma = find_vma(mm, umem->user_base); + + for (cur_base = umem->user_base; + cur_base < umem->user_base + umem->length; + cur_base = vma->vm_end) { + if (!vma || vma->vm_start > umem->user_base + umem->length) + break; + + if (!(vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) + vma->vm_flags &= ~VM_DONTCOPY; + + vma = vma->vm_next; + } +} + +int ib_umem_get(struct ib_device *dev, struct ib_umem *mem, + void *addr, size_t size) +{ + struct page **page_list; + struct vm_area_struct *vma; + struct ib_umem_chunk *chunk; + unsigned long cur_base; + int npages; + int ret = 0; + int off; + int i; + + page_list = (struct page **) __get_free_page(GFP_KERNEL); + if (!page_list) + return -ENOMEM; + + mem->user_base = (unsigned long) addr; + mem->length = size; + mem->offset = (unsigned long) addr & ~PAGE_MASK; + mem->page_size = PAGE_SIZE; + + INIT_LIST_HEAD(&mem->chunk_list); + + npages = PAGE_ALIGN(size + mem->offset) >> PAGE_SHIFT; + + down_write(¤t->mm->mmap_sem); + + vma = find_vma(current->mm, mem->user_base); + + for (cur_base = mem->user_base; + cur_base < mem->user_base + size; + cur_base = vma->vm_end) { + if (!vma || vma->vm_start > cur_base) { + ret = -ENOMEM; + goto out; + } + + if (!(vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) + vma->vm_flags |= VM_DONTCOPY; + + vma = vma->vm_next; + } + + cur_base = (unsigned long) addr & PAGE_MASK; + + while (npages) { + ret = get_user_pages(current, current->mm, cur_base, + min_t(int, npages, + PAGE_SIZE / sizeof (struct page *)), + 1, 0, page_list, NULL); + + if (ret < 0) + goto out; + + cur_base += ret * PAGE_SIZE; + npages -= ret; + + off = 0; + + while (ret) { + chunk = kmalloc(sizeof *chunk + sizeof (struct scatterlist) * + min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK), + GFP_KERNEL); + if (!chunk) { + ret = -ENOMEM; + goto out; + } + + chunk->nents = min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK); + for (i = 0; i < chunk->nents; ++i) { + chunk->page_list[i].page = page_list[i + off]; + chunk->page_list[i].offset = 0; + chunk->page_list[i].length = PAGE_SIZE; + } + + chunk->nmap = dma_map_sg(dev->dma_device, + &chunk->page_list[0], + chunk->nents, + DMA_BIDIRECTIONAL); + if (chunk->nmap <= 0) { + for (i = 0; i < chunk->nents; ++i) + put_page(chunk->page_list[i].page); + kfree(chunk); + + ret = -ENOMEM; + goto out; + } + + ret -= chunk->nents; + off += chunk->nents; + list_add_tail(&chunk->list, &mem->chunk_list); + } + + ret = 0; + } + +out: + if (ret < 0) { + __ib_umem_unmark(mem, current->mm); + __ib_umem_release(dev, mem); + } + + up_write(¤t->mm->mmap_sem); + free_page((unsigned long) page_list); + + return ret; +} + +void ib_umem_release(struct ib_device *dev, struct ib_umem *umem) +{ + struct mm_struct *mm; + + mm = get_task_mm(current); + + if (mm) { + down_write(&mm->mmap_sem); + __ib_umem_unmark(umem, mm); + } + + __ib_umem_release(dev, umem); + + if (mm) { + up_write(¤t->mm->mmap_sem); + mmput(mm); + } +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-export/drivers/infiniband/include/ib_user_verbs.h 2005-04-04 14:55:47.946083444 -0700 @@ -0,0 +1,275 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ib_user_verbs.h 2001 2005-03-16 04:15:41Z roland $ + */ + +#ifndef IB_USER_VERBS_H +#define IB_USER_VERBS_H + +#include + +/* + * Increment this value if any changes that break userspace ABI + * compatibility are made. + */ +#define IB_USER_VERBS_ABI_VERSION 1 + +enum { + IB_USER_VERBS_CMD_QUERY_PARAMS, + IB_USER_VERBS_CMD_GET_CONTEXT, + IB_USER_VERBS_CMD_QUERY_PORT, + IB_USER_VERBS_CMD_ALLOC_PD, + IB_USER_VERBS_CMD_DEALLOC_PD, + IB_USER_VERBS_CMD_REG_MR, + IB_USER_VERBS_CMD_DEREG_MR, + IB_USER_VERBS_CMD_CREATE_CQ, + IB_USER_VERBS_CMD_DESTROY_CQ, + IB_USER_VERBS_CMD_CREATE_QP, + IB_USER_VERBS_CMD_MODIFY_QP, + IB_USER_VERBS_CMD_DESTROY_QP, +}; + +/* + * Make sure that all structs defined in this file remain laid out so + * that they pack the same way on 32-bit and 64-bit architectures (to + * avoid incompatibility between 32-bit userspace and 64-bit kernels). + * In particular do not use pointer types -- pass pointers in __u64 + * instead. + */ + +struct ib_uverbs_async_event_desc { + __u64 element; + __u32 event_type; /* enum ib_event_type */ + __u32 reserved; +}; + +struct ib_uverbs_comp_event_desc { + __u64 cq_handle; +}; + +/* + * All commands from userspace should start with a __u32 command field + * followed by __u16 in_words and out_words fields (which give the + * length of the command block and response buffer if any in 32-bit + * words). The kernel driver will read these fields first and read + * the rest of the command struct based on these value. + */ + +struct ib_uverbs_cmd_hdr { + __u32 command; + __u16 in_words; + __u16 out_words; +}; + +/* + * No driver_data for "query params" command, since this is intended + * to be a core function with no possible device dependence. + */ +struct ib_uverbs_query_params { + __u64 response; +}; + +struct ib_uverbs_query_params_resp { + __u32 num_cq_events; +}; + +struct ib_uverbs_get_context { + __u64 response; + __u64 driver_data[0]; +}; + +struct ib_uverbs_get_context_resp { + __u32 async_fd; + __u32 cq_fd[1]; +}; + +struct ib_uverbs_query_port { + __u64 response; + __u8 port_num; + __u8 reserved[7]; + __u64 driver_data[0]; +}; + +struct ib_uverbs_query_port_resp { + __u32 port_cap_flags; + __u32 max_msg_sz; + __u32 bad_pkey_cntr; + __u32 qkey_viol_cntr; + __u32 gid_tbl_len; + __u16 pkey_tbl_len; + __u16 lid; + __u16 sm_lid; + __u8 state; + __u8 max_mtu; + __u8 active_mtu; + __u8 lmc; + __u8 max_vl_num; + __u8 sm_sl; + __u8 subnet_timeout; + __u8 init_type_reply; + __u8 active_width; + __u8 active_speed; + __u8 phys_state; + __u8 reserved[3]; +}; + +struct ib_uverbs_alloc_pd { + __u64 response; + __u64 driver_data[0]; +}; + +struct ib_uverbs_alloc_pd_resp { + __u32 pd_handle; +}; + +struct ib_uverbs_dealloc_pd { + __u32 pd_handle; +}; + +struct ib_uverbs_reg_mr { + __u64 response; + __u64 start; + __u64 length; + __u64 hca_va; + __u32 pd_handle; + __u32 access_flags; + __u64 driver_data[0]; +}; + +struct ib_uverbs_reg_mr_resp { + __u32 mr_handle; + __u32 lkey; + __u32 rkey; +}; + +struct ib_uverbs_dereg_mr { + __u32 mr_handle; +}; + +struct ib_uverbs_create_cq { + __u64 response; + __u64 user_handle; + __u32 cqe; + __u32 reserved; + __u64 driver_data[0]; +}; + +struct ib_uverbs_create_cq_resp { + __u32 cq_handle; + __u32 cqe; +}; + +struct ib_uverbs_destroy_cq { + __u32 cq_handle; +}; + +struct ib_uverbs_create_qp { + __u64 response; + __u64 user_handle; + __u32 pd_handle; + __u32 send_cq_handle; + __u32 recv_cq_handle; + __u32 srq_handle; + __u32 max_send_wr; + __u32 max_recv_wr; + __u32 max_send_sge; + __u32 max_recv_sge; + __u32 max_inline_data; + __u8 sq_sig_all; + __u8 qp_type; + __u8 is_srq; + __u8 reserved; + __u64 driver_data[0]; +}; + +struct ib_uverbs_create_qp_resp { + __u32 qp_handle; + __u32 qpn; +}; + +/* + * This struct needs to remain a multiple of 8 bytes to keep the + * alignment of the modify QP parameters. + */ +struct ib_uverbs_qp_dest { + __u8 dgid[16]; + __u32 flow_label; + __u16 dlid; + __u16 reserved; + __u8 sgid_index; + __u8 hop_limit; + __u8 traffic_class; + __u8 sl; + __u8 src_path_bits; + __u8 static_rate; + __u8 is_global; + __u8 port_num; +}; + +struct ib_uverbs_modify_qp { + struct ib_uverbs_qp_dest dest; + struct ib_uverbs_qp_dest alt_dest; + __u32 qp_handle; + __u32 attr_mask; + __u32 qkey; + __u32 rq_psn; + __u32 sq_psn; + __u32 dest_qp_num; + __u32 qp_access_flags; + __u16 pkey_index; + __u16 alt_pkey_index; + __u8 qp_state; + __u8 cur_qp_state; + __u8 path_mtu; + __u8 path_mig_state; + __u8 en_sqd_async_notify; + __u8 max_rd_atomic; + __u8 max_dest_rd_atomic; + __u8 min_rnr_timer; + __u8 port_num; + __u8 timeout; + __u8 retry_cnt; + __u8 rnr_retry; + __u8 alt_port_num; + __u8 alt_timeout; + __u8 reserved[2]; + __u64 driver_data[0]; +}; + +struct ib_uverbs_modify_qp_resp { +}; + +struct ib_uverbs_destroy_qp { + __u32 qp_handle; +}; + +#endif /* IB_USER_VERBS_H */ From roland at topspin.com Mon Apr 4 15:09:00 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 4 Apr 2005 15:09:00 -0700 Subject: [openib-general] [PATCH][RFC][3/4] IB: userspace verbs mthca changes In-Reply-To: <200544159.3X7p8nZ87qWqA7cv@topspin.com> Message-ID: <200544159.AzH1nqpM3uTQZaKG@topspin.com> Add Mellanox HCA-specific userspace verbs support to mthca. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2005-04-04 14:57:12.228756073 -0700 +++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c 2005-04-04 14:58:12.364679525 -0700 @@ -743,6 +743,7 @@ } int mthca_init_cq(struct mthca_dev *dev, int nent, + struct mthca_ucontext *ctx, u32 pdn, struct mthca_cq *cq) { int size = nent * MTHCA_CQ_ENTRY_SIZE; @@ -754,30 +755,33 @@ might_sleep(); - cq->ibcq.cqe = nent - 1; + cq->ibcq.cqe = nent - 1; + cq->is_kernel = !ctx; cq->cqn = mthca_alloc(&dev->cq_table.alloc); if (cq->cqn == -1) return -ENOMEM; if (mthca_is_memfree(dev)) { - cq->arm_sn = 1; - err = mthca_table_get(dev, dev->cq_table.table, cq->cqn); if (err) goto err_out; - err = -ENOMEM; + if (cq->is_kernel) { + cq->arm_sn = 1; + + err = -ENOMEM; - cq->set_ci_db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_CQ_SET_CI, - cq->cqn, &cq->set_ci_db); - if (cq->set_ci_db_index < 0) - goto err_out_icm; - - cq->arm_db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_CQ_ARM, - cq->cqn, &cq->arm_db); - if (cq->arm_db_index < 0) - goto err_out_ci; + cq->set_ci_db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_CQ_SET_CI, + cq->cqn, &cq->set_ci_db); + if (cq->set_ci_db_index < 0) + goto err_out_icm; + + cq->arm_db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_CQ_ARM, + cq->cqn, &cq->arm_db); + if (cq->arm_db_index < 0) + goto err_out_ci; + } } mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA, @@ -787,12 +791,14 @@ cq_context = MAILBOX_ALIGN(mailbox); - err = mthca_alloc_cq_buf(dev, size, cq); - if (err) - goto err_out_mailbox; + if (cq->is_kernel) { + err = mthca_alloc_cq_buf(dev, size, cq); + if (err) + goto err_out_mailbox; - for (i = 0; i < nent; ++i) - set_cqe_hw(get_cqe(cq, i)); + for (i = 0; i < nent; ++i) + set_cqe_hw(get_cqe(cq, i)); + } spin_lock_init(&cq->lock); atomic_set(&cq->refcount, 1); @@ -803,11 +809,14 @@ MTHCA_CQ_STATE_DISARMED | MTHCA_CQ_FLAG_TR); cq_context->start = cpu_to_be64(0); - cq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 | - dev->driver_uar.index); + cq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24); + if (ctx) + cq_context->logsize_usrpage |= cpu_to_be32(ctx->uar.index); + else + cq_context->logsize_usrpage |= cpu_to_be32(dev->driver_uar.index); cq_context->error_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn); cq_context->comp_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_COMP].eqn); - cq_context->pd = cpu_to_be32(dev->driver_pd.pd_num); + cq_context->pd = cpu_to_be32(pdn); cq_context->lkey = cpu_to_be32(cq->mr.ibmr.lkey); cq_context->cqn = cpu_to_be32(cq->cqn); @@ -845,17 +854,19 @@ return 0; err_out_free_mr: - mthca_free_mr(dev, &cq->mr); - mthca_free_cq_buf(dev, cq); + if (cq->is_kernel) { + mthca_free_mr(dev, &cq->mr); + mthca_free_cq_buf(dev, cq); + } err_out_mailbox: kfree(mailbox); - if (mthca_is_memfree(dev)) + if (cq->is_kernel && mthca_is_memfree(dev)) mthca_free_db(dev, MTHCA_DB_TYPE_CQ_ARM, cq->arm_db_index); err_out_ci: - if (mthca_is_memfree(dev)) + if (cq->is_kernel) mthca_free_db(dev, MTHCA_DB_TYPE_CQ_SET_CI, cq->set_ci_db_index); err_out_icm: @@ -895,7 +906,8 @@ int j; printk(KERN_ERR "context for CQN %x (cons index %x, next sw %d)\n", - cq->cqn, cq->cons_index, !!next_cqe_sw(cq)); + cq->cqn, cq->cons_index, + cq->is_kernel ? !!next_cqe_sw(cq) : 0); for (j = 0; j < 16; ++j) printk(KERN_ERR "[%2x] %08x\n", j * 4, be32_to_cpu(ctx[j])); } @@ -913,15 +925,17 @@ atomic_dec(&cq->refcount); wait_event(cq->wait, !atomic_read(&cq->refcount)); - mthca_free_mr(dev, &cq->mr); - mthca_free_cq_buf(dev, cq); - - if (mthca_is_memfree(dev)) { - mthca_free_db(dev, MTHCA_DB_TYPE_CQ_ARM, cq->arm_db_index); - mthca_free_db(dev, MTHCA_DB_TYPE_CQ_SET_CI, cq->set_ci_db_index); - mthca_table_put(dev, dev->cq_table.table, cq->cqn); + if (cq->is_kernel) { + mthca_free_mr(dev, &cq->mr); + mthca_free_cq_buf(dev, cq); + if (mthca_is_memfree(dev)) { + mthca_free_db(dev, MTHCA_DB_TYPE_CQ_ARM, cq->arm_db_index); + mthca_free_db(dev, MTHCA_DB_TYPE_CQ_SET_CI, cq->set_ci_db_index); + } } + if (mthca_is_memfree(dev)) + mthca_table_put(dev, dev->cq_table.table, cq->cqn); mthca_free(&dev->cq_table.alloc, cq->cqn); kfree(mailbox); } --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-04-04 14:57:12.254750421 -0700 +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-04-04 14:58:12.411669307 -0700 @@ -49,14 +49,6 @@ #define DRV_VERSION "0.06-pre" #define DRV_RELDATE "November 8, 2004" -/* XXX remove once SINAI defines make it into kernel.org */ -#ifndef PCI_DEVICE_ID_MELLANOX_SINAI_OLD -#define PCI_DEVICE_ID_MELLANOX_SINAI_OLD 0x5e8c -#endif -#ifndef PCI_DEVICE_ID_MELLANOX_SINAI -#define PCI_DEVICE_ID_MELLANOX_SINAI 0x6274 -#endif - enum { MTHCA_FLAG_DDR_HIDDEN = 1 << 1, MTHCA_FLAG_SRQ = 1 << 2, @@ -413,6 +405,7 @@ int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); int mthca_init_cq(struct mthca_dev *dev, int nent, + struct mthca_ucontext *ctx, u32 pdn, struct mthca_cq *cq); void mthca_free_cq(struct mthca_dev *dev, struct mthca_cq *cq); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.c 2005-04-04 14:57:12.256749986 -0700 +++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.c 2005-04-04 14:58:12.412669090 -0700 @@ -45,6 +45,15 @@ MTHCA_TABLE_CHUNK_SIZE = 1 << 18 }; +struct mthca_user_db_table { + struct semaphore mutex; + struct { + u64 uvirt; + struct scatterlist mem; + int refcount; + } page[0]; +}; + void mthca_free_icm(struct mthca_dev *dev, struct mthca_icm *icm) { struct mthca_icm_chunk *chunk, *tmp; @@ -334,13 +343,132 @@ kfree(table); } -static u64 mthca_uarc_virt(struct mthca_dev *dev, int page) +static u64 mthca_uarc_virt(struct mthca_dev *dev, struct mthca_uar *uar, int page) { return dev->uar_table.uarc_base + - dev->driver_uar.index * dev->uar_table.uarc_size + + uar->index * dev->uar_table.uarc_size + page * 4096; } +int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar, + struct mthca_user_db_table *db_tab, int index, u64 uaddr) +{ + int ret = 0; + u8 status; + int i; + + if (!mthca_is_memfree(dev)) + return 0; + + if (index < 0 || index > dev->uar_table.uarc_size / 8) + return -EINVAL; + + down(&db_tab->mutex); + + i = index / MTHCA_DB_REC_PER_PAGE; + + if ((db_tab->page[i].refcount >= MTHCA_DB_REC_PER_PAGE) || + (db_tab->page[i].uvirt && db_tab->page[i].uvirt != uaddr) || + (uaddr & 4095)) { + ret = -EINVAL; + goto out; + } + + if (db_tab->page[i].refcount) { + ++db_tab->page[i].refcount; + goto out; + } + + ret = get_user_pages(current, current->mm, uaddr & PAGE_MASK, 1, 1, 0, + &db_tab->page[i].mem.page, NULL); + if (ret < 0) + goto out; + + db_tab->page[i].mem.offset = uaddr & ~PAGE_MASK; + + ret = pci_map_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE); + if (ret < 0) { + put_page(db_tab->page[i].mem.page); + goto out; + } + + ret = mthca_MAP_ICM_page(dev, sg_dma_address(&db_tab->page[i].mem), + mthca_uarc_virt(dev, uar, i), &status); + if (!ret && status) + ret = -EINVAL; + if (ret) { + pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE); + put_page(db_tab->page[i].mem.page); + goto out; + } + + db_tab->page[i].uvirt = uaddr; + db_tab->page[i].refcount = 1; + +out: + up(&db_tab->mutex); + return ret; +} + +void mthca_unmap_user_db(struct mthca_dev *dev, struct mthca_uar *uar, + struct mthca_user_db_table *db_tab, int index) +{ + if (!mthca_is_memfree(dev)) + return; + + /* + * To make our bookkeeping simpler, we don't unmap DB + * pages until we clean up the whole db table. + */ + + down(&db_tab->mutex); + + --db_tab->page[index / MTHCA_DB_REC_PER_PAGE].refcount; + + up(&db_tab->mutex); +} + +struct mthca_user_db_table *mthca_init_user_db_tab(struct mthca_dev *dev) +{ + struct mthca_user_db_table *db_tab; + int npages; + int i; + + if (!mthca_is_memfree(dev)) + return NULL; + + npages = dev->uar_table.uarc_size / 4096; + db_tab = kmalloc(sizeof *db_tab + npages * sizeof *db_tab->page, GFP_KERNEL); + if (!db_tab) + return ERR_PTR(-ENOMEM); + + init_MUTEX(&db_tab->mutex); + for (i = 0; i < npages; ++i) { + db_tab->page[i].refcount = 0; + db_tab->page[i].uvirt = 0; + } + + return db_tab; +} + +void mthca_cleanup_user_db_tab(struct mthca_dev *dev, struct mthca_uar *uar, + struct mthca_user_db_table *db_tab) +{ + int i; + u8 status; + + if (!mthca_is_memfree(dev)) + return; + + for (i = 0; i < dev->uar_table.uarc_size / 4096; ++i) { + if (db_tab->page[i].uvirt) { + mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, uar, i), 1, &status); + pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE); + put_page(db_tab->page[i].mem.page); + } + } +} + int mthca_alloc_db(struct mthca_dev *dev, int type, u32 qn, u32 **db) { int group; @@ -397,7 +525,8 @@ } memset(page->db_rec, 0, 4096); - ret = mthca_MAP_ICM_page(dev, page->mapping, mthca_uarc_virt(dev, i), &status); + ret = mthca_MAP_ICM_page(dev, page->mapping, + mthca_uarc_virt(dev, &dev->driver_uar, i), &status); if (!ret && status) ret = -EINVAL; if (ret) { @@ -451,7 +580,7 @@ if (bitmap_empty(page->used, MTHCA_DB_REC_PER_PAGE) && i >= dev->db_tab->max_group1 - 1) { - mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, i), 1, &status); + mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, &dev->driver_uar, i), 1, &status); dma_free_coherent(&dev->pdev->dev, 4096, page->db_rec, page->mapping); @@ -520,7 +649,7 @@ if (!bitmap_empty(dev->db_tab->page[i].used, MTHCA_DB_REC_PER_PAGE)) mthca_warn(dev, "Kernel UARC page %d not empty\n", i); - mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, i), 1, &status); + mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, &dev->driver_uar, i), 1, &status); dma_free_coherent(&dev->pdev->dev, 4096, dev->db_tab->page[i].db_rec, --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.h 2005-04-04 14:57:12.256749986 -0700 +++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.h 2005-04-04 14:58:12.413668872 -0700 @@ -148,7 +148,7 @@ struct semaphore mutex; }; -enum { +enum mthca_db_type { MTHCA_DB_TYPE_INVALID = 0x0, MTHCA_DB_TYPE_CQ_SET_CI = 0x1, MTHCA_DB_TYPE_CQ_ARM = 0x2, @@ -158,6 +158,17 @@ MTHCA_DB_TYPE_GROUP_SEP = 0x7 }; +struct mthca_user_db_table; +struct mthca_uar; + +int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar, + struct mthca_user_db_table *db_tab, int index, u64 uaddr); +void mthca_unmap_user_db(struct mthca_dev *dev, struct mthca_uar *uar, + struct mthca_user_db_table *db_tab, int index); +struct mthca_user_db_table *mthca_init_user_db_tab(struct mthca_dev *dev); +void mthca_cleanup_user_db_tab(struct mthca_dev *dev, struct mthca_uar *uar, + struct mthca_user_db_table *db_tab); + int mthca_init_db_tab(struct mthca_dev *dev); void mthca_cleanup_db_tab(struct mthca_dev *dev); int mthca_alloc_db(struct mthca_dev *dev, int type, u32 qn, u32 **db); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2005-04-04 14:57:12.286743464 -0700 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c 2005-04-04 14:58:12.444662133 -0700 @@ -29,13 +29,17 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: mthca_provider.c 2100 2005-03-31 20:43:01Z roland $ + * $Id: mthca_provider.c 2109 2005-04-04 21:10:34Z roland $ */ +#include + #include #include "mthca_dev.h" #include "mthca_cmd.h" +#include "mthca_user.h" +#include "mthca_memfree.h" static int mthca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) @@ -283,11 +287,78 @@ return err; } -static struct ib_pd *mthca_alloc_pd(struct ib_device *ibdev) +static struct ib_ucontext *mthca_alloc_ucontext(struct ib_device *ibdev, + const void __user *udata, int udatalen) +{ + struct mthca_alloc_ucontext ucmd; + struct mthca_alloc_ucontext_resp uresp; + struct mthca_ucontext *context; + int err; + + if (copy_from_user(&ucmd, udata, sizeof ucmd)) + return ERR_PTR(-EFAULT); + + uresp.qp_tab_size = to_mdev(ibdev)->limits.num_qps; + if (mthca_is_memfree(to_mdev(ibdev))) + uresp.uarc_size = to_mdev(ibdev)->uar_table.uarc_size; + else + uresp.uarc_size = 0; + + context = kmalloc(sizeof *context, GFP_KERNEL); + if (!context) + return ERR_PTR(-ENOMEM); + + err = mthca_uar_alloc(to_mdev(ibdev), &context->uar); + if (err) { + kfree(context); + return ERR_PTR(err); + } + + context->db_tab = mthca_init_user_db_tab(to_mdev(ibdev)); + if (IS_ERR(context->db_tab)) { + err = PTR_ERR(context->db_tab); + mthca_uar_free(to_mdev(ibdev), &context->uar); + kfree(context); + return ERR_PTR(err); + } + + if (copy_to_user((void __user *) (unsigned long) ucmd.respbuf, + &uresp, sizeof uresp)) { + mthca_cleanup_user_db_tab(to_mdev(ibdev), &context->uar, context->db_tab); + mthca_uar_free(to_mdev(ibdev), &context->uar); + kfree(context); + return ERR_PTR(-EFAULT); + } + + return &context->ibucontext; +} + +static int mthca_dealloc_ucontext(struct ib_ucontext *context) { + mthca_cleanup_user_db_tab(to_mdev(context->device), &to_mucontext(context)->uar, + to_mucontext(context)->db_tab); + mthca_uar_free(to_mdev(context->device), &to_mucontext(context)->uar); + kfree(to_mucontext(context)); + + return 0; +} + +static struct ib_pd *mthca_alloc_pd(struct ib_device *ibdev, + struct ib_ucontext *context, + const void __user *udata, int udatalen) +{ + struct mthca_alloc_pd ucmd; struct mthca_pd *pd; int err; + if (context) { + if (udatalen != sizeof ucmd) + return ERR_PTR(-EINVAL); + + if (copy_from_user(&ucmd, udata, sizeof ucmd)) + return ERR_PTR(-EFAULT); + } + pd = kmalloc(sizeof *pd, GFP_KERNEL); if (!pd) return ERR_PTR(-ENOMEM); @@ -298,6 +369,14 @@ return ERR_PTR(err); } + if (context) { + if (put_user(pd->pd_num, (u32 __user *) (unsigned long) ucmd.pdnbuf)) { + mthca_pd_free(to_mdev(ibdev), pd); + kfree(pd); + return ERR_PTR(-EFAULT); + } + } + return &pd->ibpd; } @@ -337,8 +416,10 @@ } static struct ib_qp *mthca_create_qp(struct ib_pd *pd, - struct ib_qp_init_attr *init_attr) + struct ib_qp_init_attr *init_attr, + const void __user *udata, int udatalen) { + struct mthca_create_qp ucmd; struct mthca_qp *qp; int err; @@ -347,10 +428,48 @@ case IB_QPT_UC: case IB_QPT_UD: { + struct mthca_ucontext *context; + qp = kmalloc(sizeof *qp, GFP_KERNEL); if (!qp) return ERR_PTR(-ENOMEM); + if (pd->uobject) { + context = to_mucontext(pd->uobject->context); + + if (udatalen != sizeof ucmd) + return ERR_PTR(-EINVAL); + + if (copy_from_user(&ucmd, udata, sizeof ucmd)) + return ERR_PTR(-EFAULT); + + err = mthca_map_user_db(to_mdev(pd->device), &context->uar, + context->db_tab, + ucmd.sq_db_index, ucmd.sq_db_page); + if (err) { + kfree(qp); + return ERR_PTR(err); + } + + err = mthca_map_user_db(to_mdev(pd->device), &context->uar, + context->db_tab, + ucmd.rq_db_index, ucmd.rq_db_page); + if (err) { + mthca_unmap_user_db(to_mdev(pd->device), + &context->uar, + context->db_tab, + ucmd.sq_db_index); + kfree(qp); + return ERR_PTR(err); + } + } + + if (pd->uobject) { + qp->mr.ibmr.lkey = ucmd.lkey; + qp->sq.db_index = ucmd.sq_db_index; + qp->rq.db_index = ucmd.rq_db_index; + } + qp->sq.max = init_attr->cap.max_send_wr; qp->rq.max = init_attr->cap.max_recv_wr; qp->sq.max_gs = init_attr->cap.max_send_sge; @@ -361,12 +480,30 @@ to_mcq(init_attr->recv_cq), init_attr->qp_type, init_attr->sq_sig_type, qp); + + if (err && pd->uobject) { + context = to_mucontext(pd->uobject->context); + + mthca_unmap_user_db(to_mdev(pd->device), + &context->uar, + context->db_tab, + ucmd.sq_db_index); + mthca_unmap_user_db(to_mdev(pd->device), + &context->uar, + context->db_tab, + ucmd.rq_db_index); + } + qp->ibqp.qp_num = qp->qpn; break; } case IB_QPT_SMI: case IB_QPT_GSI: { + /* Don't allow userspace to create special QPs */ + if (pd->uobject) + return ERR_PTR(-EINVAL); + qp = kmalloc(sizeof (struct mthca_sqp), GFP_KERNEL); if (!qp) return ERR_PTR(-ENOMEM); @@ -396,42 +533,116 @@ return ERR_PTR(err); } - init_attr->cap.max_inline_data = 0; + init_attr->cap.max_inline_data = 0; + init_attr->cap.max_send_wr = qp->sq.max; + init_attr->cap.max_recv_wr = qp->rq.max; return &qp->ibqp; } static int mthca_destroy_qp(struct ib_qp *qp) { + if (qp->uobject) { + mthca_unmap_user_db(to_mdev(qp->device), + &to_mucontext(qp->uobject->context)->uar, + to_mucontext(qp->uobject->context)->db_tab, + to_mqp(qp)->sq.db_index); + mthca_unmap_user_db(to_mdev(qp->device), + &to_mucontext(qp->uobject->context)->uar, + to_mucontext(qp->uobject->context)->db_tab, + to_mqp(qp)->rq.db_index); + } mthca_free_qp(to_mdev(qp->device), to_mqp(qp)); kfree(qp); return 0; } -static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries) +static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries, + struct ib_ucontext *context, + const void __user *udata, int udatalen) { + struct mthca_create_cq ucmd; struct mthca_cq *cq; int nent; int err; + if (context) { + if (udatalen != sizeof ucmd) + return ERR_PTR(-EINVAL); + + if (copy_from_user(&ucmd, udata, sizeof ucmd)) + return ERR_PTR(-EFAULT); + + err = mthca_map_user_db(to_mdev(ibdev), &to_mucontext(context)->uar, + to_mucontext(context)->db_tab, + ucmd.set_db_index, ucmd.set_db_page); + if (err) + return ERR_PTR(err); + + err = mthca_map_user_db(to_mdev(ibdev), &to_mucontext(context)->uar, + to_mucontext(context)->db_tab, + ucmd.arm_db_index, ucmd.arm_db_page); + if (err) + goto err_unmap_set; + } + cq = kmalloc(sizeof *cq, GFP_KERNEL); - if (!cq) - return ERR_PTR(-ENOMEM); + if (!cq) { + err = -ENOMEM; + goto err_unmap_arm; + } + + if (context) { + cq->mr.ibmr.lkey = ucmd.lkey; + cq->set_ci_db_index = ucmd.set_db_index; + cq->arm_db_index = ucmd.arm_db_index; + } for (nent = 1; nent <= entries; nent <<= 1) ; /* nothing */ - err = mthca_init_cq(to_mdev(ibdev), nent, cq); - if (err) { - kfree(cq); - cq = ERR_PTR(err); + err = mthca_init_cq(to_mdev(ibdev), nent, + context ? to_mucontext(context) : NULL, + context ? ucmd.pdn : to_mdev(ibdev)->driver_pd.pd_num, + cq); + if (err) + goto err_free; + + if (context && put_user(cq->cqn, (u32 __user *) (unsigned long) ucmd.cqnbuf)) { + mthca_free_cq(to_mdev(ibdev), cq); + goto err_free; } return &cq->ibcq; + +err_free: + kfree(cq); + +err_unmap_arm: + if (context) + mthca_unmap_user_db(to_mdev(ibdev), &to_mucontext(context)->uar, + to_mucontext(context)->db_tab, ucmd.arm_db_index); + +err_unmap_set: + if (context) + mthca_unmap_user_db(to_mdev(ibdev), &to_mucontext(context)->uar, + to_mucontext(context)->db_tab, ucmd.set_db_index); + + return ERR_PTR(err); } static int mthca_destroy_cq(struct ib_cq *cq) { + if (cq->uobject) { + mthca_unmap_user_db(to_mdev(cq->device), + &to_mucontext(cq->uobject->context)->uar, + to_mucontext(cq->uobject->context)->db_tab, + to_mcq(cq)->arm_db_index); + mthca_unmap_user_db(to_mdev(cq->device), + &to_mucontext(cq->uobject->context)->uar, + to_mucontext(cq->uobject->context)->db_tab, + to_mcq(cq)->set_ci_db_index); + } mthca_free_cq(to_mdev(cq->device), to_mcq(cq)); kfree(cq); @@ -558,6 +769,57 @@ convert_access(acc), mr); if (err) { + kfree(page_list); + kfree(mr); + return ERR_PTR(err); + } + + kfree(page_list); + return &mr->ibmr; +} + +static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, + int acc, const void __user *udata, int udatalen) +{ + struct ib_umem_chunk *chunk; + int npages = 0; + u64 *page_list; + struct mthca_mr *mr; + int shift; + int i, j, k; + int err; + + shift = ffs(region->page_size) - 1; + + mr = kmalloc(sizeof *mr, GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + list_for_each_entry(chunk, ®ion->chunk_list, list) + npages += chunk->nents; + + page_list = kmalloc(npages * sizeof *page_list, GFP_KERNEL); + if (!page_list) { + kfree(mr); + return ERR_PTR(-ENOMEM); + } + + i = 0; + + list_for_each_entry(chunk, ®ion->chunk_list, list) + for (j = 0; j < chunk->nmap; ++j) + for (k = 0; k < sg_dma_len(&chunk->page_list[j]) >> shift; ++k) + page_list[i++] = sg_dma_address(&chunk->page_list[j]) + + region->page_size * k; + + err = mthca_mr_alloc_phys(to_mdev(pd->device), + to_mpd(pd)->pd_num, + page_list, shift, npages, + region->virt_base, region->length, + convert_access(acc), mr); + + if (err) { + kfree(page_list); kfree(mr); return ERR_PTR(err); } @@ -574,6 +836,22 @@ return 0; } +static int mthca_mmap_uar(struct ib_ucontext *context, + struct vm_area_struct *vma) +{ + if (vma->vm_end - vma->vm_start != PAGE_SIZE) + return -EINVAL; + + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + + if (remap_pfn_range(vma, vma->vm_start, + to_mucontext(context)->uar.pfn, + PAGE_SIZE, vma->vm_page_prot)) + return -EAGAIN; + + return 0; +} + static struct ib_fmr *mthca_alloc_fmr(struct ib_pd *pd, int mr_access_flags, struct ib_fmr_attr *fmr_attr) { @@ -690,6 +968,8 @@ int i; strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX); + dev->ib_dev.owner = THIS_MODULE; + dev->ib_dev.node_type = IB_NODE_CA; dev->ib_dev.phys_port_cnt = dev->limits.num_ports; dev->ib_dev.dma_device = &dev->pdev->dev; @@ -699,6 +979,8 @@ dev->ib_dev.modify_port = mthca_modify_port; dev->ib_dev.query_pkey = mthca_query_pkey; dev->ib_dev.query_gid = mthca_query_gid; + dev->ib_dev.alloc_ucontext = mthca_alloc_ucontext; + dev->ib_dev.dealloc_ucontext = mthca_dealloc_ucontext; dev->ib_dev.alloc_pd = mthca_alloc_pd; dev->ib_dev.dealloc_pd = mthca_dealloc_pd; dev->ib_dev.create_ah = mthca_ah_create; @@ -711,6 +993,7 @@ dev->ib_dev.poll_cq = mthca_poll_cq; dev->ib_dev.get_dma_mr = mthca_get_dma_mr; dev->ib_dev.reg_phys_mr = mthca_reg_phys_mr; + dev->ib_dev.reg_user_mr = mthca_reg_user_mr; dev->ib_dev.dereg_mr = mthca_dereg_mr; if (dev->mthca_flags & MTHCA_FLAG_FMR) { @@ -726,6 +1009,7 @@ dev->ib_dev.attach_mcast = mthca_multicast_attach; dev->ib_dev.detach_mcast = mthca_multicast_detach; dev->ib_dev.process_mad = mthca_process_mad; + dev->ib_dev.mmap = mthca_mmap_uar; if (mthca_is_memfree(dev)) { dev->ib_dev.req_notify_cq = mthca_arbel_arm_cq; --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.h 2005-04-04 14:57:12.287743246 -0700 +++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.h 2005-04-04 14:58:12.445661916 -0700 @@ -54,6 +54,14 @@ int index; }; +struct mthca_user_db_table; + +struct mthca_ucontext { + struct ib_ucontext ibucontext; + struct mthca_uar uar; + struct mthca_user_db_table *db_tab; +}; + struct mthca_mr { struct ib_mr ibmr; int order; @@ -167,6 +175,7 @@ int cqn; u32 cons_index; int is_direct; + int is_kernel; /* Next fields are Arbel only */ int set_ci_db_index; @@ -236,6 +245,11 @@ dma_addr_t header_dma; }; +static inline struct mthca_ucontext *to_mucontext(struct ib_ucontext *ibucontext) +{ + return container_of(ibucontext, struct mthca_ucontext, ibucontext); +} + static inline struct mthca_fmr *to_mfmr(struct ib_fmr *ibmr) { return container_of(ibmr, struct mthca_fmr, ibmr); --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2005-04-04 14:57:12.320736072 -0700 +++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c 2005-04-04 14:58:12.491651915 -0700 @@ -652,7 +652,11 @@ /* leave arbel_sched_queue as 0 */ - qp_context->usr_page = cpu_to_be32(dev->driver_uar.index); + if (qp->ibqp.uobject) + qp_context->usr_page = + cpu_to_be32(to_mucontext(qp->ibqp.uobject->context)->uar.index); + else + qp_context->usr_page = cpu_to_be32(dev->driver_uar.index); qp_context->local_qpn = cpu_to_be32(qp->qpn); if (attr_mask & IB_QP_DEST_QPN) { qp_context->remote_qpn = cpu_to_be32(attr->dest_qp_num); @@ -917,6 +921,15 @@ qp->send_wqe_offset = ALIGN(qp->rq.max << qp->rq.wqe_shift, 1 << qp->sq.wqe_shift); + + /* + * If this is a userspace QP, we don't actually have to + * allocate anything. All we need is to calculate the WQE + * sizes and the send_wqe_offset, so we're done now. + */ + if (pd->ibpd.uobject) + return 0; + size = PAGE_ALIGN(qp->send_wqe_offset + (qp->sq.max << qp->sq.wqe_shift)); @@ -1015,10 +1028,33 @@ return err; } -static int mthca_alloc_memfree(struct mthca_dev *dev, +static void mthca_free_wqe_buf(struct mthca_dev *dev, struct mthca_qp *qp) { - int ret = 0; + int i; + int size = PAGE_ALIGN(qp->send_wqe_offset + + (qp->sq.max << qp->sq.wqe_shift)); + + if (qp->is_direct) { + pci_free_consistent(dev->pdev, size, + qp->queue.direct.buf, + pci_unmap_addr(&qp->queue.direct, mapping)); + } else { + for (i = 0; i < size / PAGE_SIZE; ++i) { + pci_free_consistent(dev->pdev, PAGE_SIZE, + qp->queue.page_list[i].buf, + pci_unmap_addr(&qp->queue.page_list[i], + mapping)); + } + } + + kfree(qp->wrid); +} + +static int mthca_map_memfree(struct mthca_dev *dev, + struct mthca_qp *qp) +{ + int ret; if (mthca_is_memfree(dev)) { ret = mthca_table_get(dev, dev->qp_table.qp_table, qp->qpn); @@ -1029,35 +1065,15 @@ if (ret) goto err_qpc; - ret = mthca_table_get(dev, dev->qp_table.rdb_table, - qp->qpn << dev->qp_table.rdb_shift); - if (ret) - goto err_eqpc; - - qp->rq.db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_RQ, - qp->qpn, &qp->rq.db); - if (qp->rq.db_index < 0) { - ret = -ENOMEM; - goto err_rdb; - } + ret = mthca_table_get(dev, dev->qp_table.rdb_table, + qp->qpn << dev->qp_table.rdb_shift); + if (ret) + goto err_eqpc; - qp->sq.db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_SQ, - qp->qpn, &qp->sq.db); - if (qp->sq.db_index < 0) { - ret = -ENOMEM; - goto err_rq_db; - } } return 0; -err_rq_db: - mthca_free_db(dev, MTHCA_DB_TYPE_RQ, qp->rq.db_index); - -err_rdb: - mthca_table_put(dev, dev->qp_table.rdb_table, - qp->qpn << dev->qp_table.rdb_shift); - err_eqpc: mthca_table_put(dev, dev->qp_table.eqp_table, qp->qpn); @@ -1067,16 +1083,43 @@ return ret; } +static void mthca_unmap_memfree(struct mthca_dev *dev, + struct mthca_qp *qp) +{ + if (mthca_is_memfree(dev)) { + mthca_table_put(dev, dev->qp_table.rdb_table, + qp->qpn << dev->qp_table.rdb_shift); + mthca_table_put(dev, dev->qp_table.eqp_table, qp->qpn); + mthca_table_put(dev, dev->qp_table.qp_table, qp->qpn); + } +} + +static int mthca_alloc_memfree(struct mthca_dev *dev, + struct mthca_qp *qp) +{ + int ret = 0; + + if (mthca_is_memfree(dev)) { + qp->rq.db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_RQ, + qp->qpn, &qp->rq.db); + if (qp->rq.db_index < 0) + return ret; + + qp->sq.db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_SQ, + qp->qpn, &qp->sq.db); + if (qp->sq.db_index < 0) + mthca_free_db(dev, MTHCA_DB_TYPE_RQ, qp->rq.db_index); + } + + return ret; +} + static void mthca_free_memfree(struct mthca_dev *dev, struct mthca_qp *qp) { if (mthca_is_memfree(dev)) { mthca_free_db(dev, MTHCA_DB_TYPE_SQ, qp->sq.db_index); mthca_free_db(dev, MTHCA_DB_TYPE_RQ, qp->rq.db_index); - mthca_table_put(dev, dev->qp_table.rdb_table, - qp->qpn << dev->qp_table.rdb_shift); - mthca_table_put(dev, dev->qp_table.eqp_table, qp->qpn); - mthca_table_put(dev, dev->qp_table.qp_table, qp->qpn); } } @@ -1108,13 +1151,28 @@ mthca_wq_init(&qp->sq); mthca_wq_init(&qp->rq); - ret = mthca_alloc_memfree(dev, qp); + ret = mthca_map_memfree(dev, qp); if (ret) return ret; ret = mthca_alloc_wqe_buf(dev, pd, qp); if (ret) { - mthca_free_memfree(dev, qp); + mthca_unmap_memfree(dev, qp); + return ret; + } + + /* + * If this is a userspace QP, we're done now. The doorbells + * will be allocated and buffers will be initialized in + * userspace. + */ + if (pd->ibpd.uobject) + return 0; + + ret = mthca_alloc_memfree(dev, qp); + if (ret) { + mthca_free_wqe_buf(dev, qp); + mthca_unmap_memfree(dev, qp); return ret; } @@ -1274,8 +1332,6 @@ struct mthca_qp *qp) { u8 status; - int size; - int i; struct mthca_cq *send_cq; struct mthca_cq *recv_cq; @@ -1305,31 +1361,22 @@ if (qp->state != IB_QPS_RESET) mthca_MODIFY_QP(dev, MTHCA_TRANS_ANY2RST, qp->qpn, 0, NULL, 0, &status); - mthca_cq_clean(dev, to_mcq(qp->ibqp.send_cq)->cqn, qp->qpn); - if (qp->ibqp.send_cq != qp->ibqp.recv_cq) - mthca_cq_clean(dev, to_mcq(qp->ibqp.recv_cq)->cqn, qp->qpn); - - mthca_free_mr(dev, &qp->mr); - - size = PAGE_ALIGN(qp->send_wqe_offset + - (qp->sq.max << qp->sq.wqe_shift)); + /* + * If this is a userspace QP, the buffers, MR, CQs and so on + * will be cleaned up in userspace, so all we have to do is + * unref the mem-free tables and free the QPN in our table. + */ + if (!qp->ibqp.uobject) { + mthca_cq_clean(dev, to_mcq(qp->ibqp.send_cq)->cqn, qp->qpn); + if (qp->ibqp.send_cq != qp->ibqp.recv_cq) + mthca_cq_clean(dev, to_mcq(qp->ibqp.recv_cq)->cqn, qp->qpn); - if (qp->is_direct) { - pci_free_consistent(dev->pdev, size, - qp->queue.direct.buf, - pci_unmap_addr(&qp->queue.direct, mapping)); - } else { - for (i = 0; i < size / PAGE_SIZE; ++i) { - pci_free_consistent(dev->pdev, PAGE_SIZE, - qp->queue.page_list[i].buf, - pci_unmap_addr(&qp->queue.page_list[i], - mapping)); - } + mthca_free_mr(dev, &qp->mr); + mthca_free_memfree(dev, qp); + mthca_free_wqe_buf(dev, qp); } - kfree(qp->wrid); - - mthca_free_memfree(dev, qp); + mthca_unmap_memfree(dev, qp); if (is_sqp(dev, qp)) { atomic_dec(&(to_mpd(qp->ibqp.pd)->sqp_count)); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-export/drivers/infiniband/hw/mthca/mthca_user.h 2005-04-04 14:58:12.491651915 -0700 @@ -0,0 +1,89 @@ +/* + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef MTHCA_USER_H +#define MTHCA_USER_H + +#include + +/* + * Make sure that all structs defined in this file remain laid out so + * that they pack the same way on 32-bit and 64-bit architectures (to + * avoid incompatibility between 32-bit userspace and 64-bit kernels). + * In particular do not use pointer types -- pass pointers in __u64 + * instead. + */ + +struct mthca_alloc_ucontext { + __u64 respbuf; +}; + +struct mthca_alloc_ucontext_resp { + __u32 qp_tab_size; + __u32 uarc_size; +}; + +struct mthca_alloc_pd { + __u64 pdnbuf; +}; + +struct mthca_alloc_pd_resp { + __u32 pdn; + __u32 reserved; +}; + +struct mthca_create_cq { + __u64 cqnbuf; + __u32 lkey; + __u32 pdn; + __u64 arm_db_page; + __u64 set_db_page; + __u32 arm_db_index; + __u32 set_db_index; +}; + +struct mthca_create_cq_resp { + __u32 cqn; + __u32 reserved; +}; + +struct mthca_create_qp { + __u32 lkey; + __u32 reserved; + __u64 sq_db_page; + __u64 rq_db_page; + __u32 sq_db_index; + __u32 rq_db_index; +}; + +#endif /* MTHCA_USER_H */ From roland at topspin.com Mon Apr 4 15:09:00 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 4 Apr 2005 15:09:00 -0700 Subject: [openib-general] [PATCH][RFC][4/4] IB: userspace verbs Kconfig/Makefile changes In-Reply-To: <200544159.AzH1nqpM3uTQZaKG@topspin.com> Message-ID: <200544159.LHYjypUjDczyHP7A@topspin.com> Hook userspace verbs up to Kconfig and Makefile. Signed-off-by: Roland Dreier --- linux-export.orig/drivers/infiniband/Kconfig 2005-04-04 14:58:53.397756926 -0700 +++ linux-export/drivers/infiniband/Kconfig 2005-04-04 15:01:08.716332258 -0700 @@ -7,6 +7,14 @@ any protocols you wish to use as well as drivers for your InfiniBand hardware. +config INFINIBAND_USER_VERBS + tristate "InfiniBand userspace verbs support" + depends on INFINIBAND + ---help--- + Userspace InfiniBand verbs support. This is the kernel side + of userspace verbs. You will also need libibverbs and a + hardware driver library from . + source "drivers/infiniband/hw/mthca/Kconfig" source "drivers/infiniband/ulp/ipoib/Kconfig" --- linux-export.orig/drivers/infiniband/core/Makefile 2005-04-04 14:58:53.398756709 -0700 +++ linux-export/drivers/infiniband/core/Makefile 2005-04-04 15:00:44.933503748 -0700 @@ -1,7 +1,8 @@ EXTRA_CFLAGS += -Idrivers/infiniband/include -obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_ping.o \ - ib_cm.o ib_sa.o ib_umad.o ib_ucm.o +obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_ping.o \ + ib_cm.o ib_sa.o ib_umad.o ib_ucm.o +obj-$(CONFIG_INFINIBAND_USER_VERBS) += ib_uverbs.o ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ device.o fmr_pool.o cache.o @@ -16,4 +17,6 @@ ib_umad-y := user_mad.o +ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o + ib_ucm-y := ucm.o From roland at topspin.com Mon Apr 4 15:15:12 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 04 Apr 2005 15:15:12 -0700 Subject: [openib-general] Re: IPoIB In-Reply-To: <1112652482.4490.281.camel@localhost.localdomain> (Hal Rosenstock's message of "04 Apr 2005 18:08:03 -0400") References: <1112652482.4490.281.camel@localhost.localdomain> Message-ID: <524qemfatr.fsf@topspin.com> Hal> A while ago, Tom brought up the issue of IPoIB link level Hal> broadcasting from user space (with the arping tool). Is it Hal> possible to do this from kernel space ? For example, how Hal> would/could sendto() work when sending to a IPoIB link layer Hal> address ? If all we wanted to support was broadcast, perhaps Hal> there could be a remapping of the ethernet MAC broadcast Hal> address to the all hosts MGID and QPN for that IPoIB Hal> interface. Or perhaps the entire ipoib pseudoheader should be Hal> supported in this mode. This is needed to support Hal> RARPing. Some hosts want to RARP for their IP address and Hal> this should be supported over IPoIB. I think it should "just work" with the current setup. ipoib_hard_header() will look at the skb it's passed, and if there's no neighbour struct, it will just save off the destination link address. Then ipoib_start_xmit() will look at the destination address and handle multicast link addresses correctly. - R. From iod00d at hp.com Mon Apr 4 15:35:16 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 4 Apr 2005 15:35:16 -0700 Subject: [openib-general] IPoIB In-Reply-To: <1112652482.4490.281.camel@localhost.localdomain> References: <1112652482.4490.281.camel@localhost.localdomain> Message-ID: <20050404223516.GC22119@esmail.cup.hp.com> On Mon, Apr 04, 2005 at 06:08:03PM -0400, Hal Rosenstock wrote: > A while ago, Tom brought up the issue of IPoIB link level broadcasting > from user space (with the arping tool). Is it possible to do this from > kernel space? I would think any driver can call hard_xmit() for any "NIC". pktgen.c does. > For example, how would/could sendto() work when sending > to a IPoIB link layer address? Would net/core/pktgen.c help? * A tool for loading the network with preconfigurated packets. * The tool is implemented as a linux module. Parameters are output * device, delay (to hard_xmit), number of packets, and whether * to use multiple SKBs or just the same one. * pktgen uses the installed interface's output routine. That's one of the tools I use occasionally for performance analysis. This certainly would be useful to test TCP/IP <-> IB bridge/router support in the kernel. > If all we wanted to support was > broadcast, perhaps there could be a remapping of the ethernet MAC > broadcast address to the all hosts MGID and QPN for that IPoIB > interface. Or perhaps the entire ipoib pseudoheader should be supported > in this mode. This is needed to support RARPing. Some hosts want to RARP > for their IP address and this should be supported over IPoIB. sorry - the above is mostly greek to me... grant > > -- Hal > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tduffy at sun.com Mon Apr 4 15:49:35 2005 From: tduffy at sun.com (Tom Duffy) Date: Mon, 04 Apr 2005 15:49:35 -0700 Subject: [openib-general] [PATCH][RFC][3/4] IB: userspace verbs mthca changes In-Reply-To: <200544159.AzH1nqpM3uTQZaKG@topspin.com> References: <200544159.AzH1nqpM3uTQZaKG@topspin.com> Message-ID: <1112654975.22537.12.camel@duffman> On Mon, 2005-04-04 at 15:09 -0700, Roland Dreier wrote: > --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-04-04 14:57:12.254750421 -0700 > +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h 2005-04-04 14:58:12.411669307 -0700 > @@ -49,14 +49,6 @@ > #define DRV_VERSION "0.06-pre" > #define DRV_RELDATE "November 8, 2004" > > -/* XXX remove once SINAI defines make it into kernel.org */ > -#ifndef PCI_DEVICE_ID_MELLANOX_SINAI_OLD > -#define PCI_DEVICE_ID_MELLANOX_SINAI_OLD 0x5e8c > -#endif > -#ifndef PCI_DEVICE_ID_MELLANOX_SINAI > -#define PCI_DEVICE_ID_MELLANOX_SINAI 0x6274 > -#endif > - > enum { > MTHCA_FLAG_DDR_HIDDEN = 1 << 1, > MTHCA_FLAG_SRQ = 1 << 2, Now, you are really gonna hate me for asking you to put this in as you probably did not want to include this in the patch to lkml. So, maybe Grant was right ;-) -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Mon Apr 4 15:48:19 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Apr 2005 18:48:19 -0400 Subject: [openib-general] IPoIB In-Reply-To: <20050404223516.GC22119@esmail.cup.hp.com> References: <1112652482.4490.281.camel@localhost.localdomain> <20050404223516.GC22119@esmail.cup.hp.com> Message-ID: <1112654898.4490.293.camel@localhost.localdomain> On Mon, 2005-04-04 at 18:35, Grant Grundler wrote: > On Mon, Apr 04, 2005 at 06:08:03PM -0400, Hal Rosenstock wrote: > > A while ago, Tom brought up the issue of IPoIB link level broadcasting > > from user space (with the arping tool). Is it possible to do this from > > kernel space? > > I would think any driver can call hard_xmit() for any "NIC". > pktgen.c does. Yes, but I was looking at a different "use" case. How do net/packet/af_packet.c work when with link layer sends rather than IP based sends ? Can this be made to work for IPoIB and how ? > > For example, how would/could sendto() work when sending > > to a IPoIB link layer address? > > Would net/core/pktgen.c help? Glancing at pktgen.c, there would need to be some mods made for IPoIB as IPoIB does not deal with MAC addresses (random src/dest MACs). pktgen.c uses the driver's transmit routine directly so this is a different case from what I was describing. > * A tool for loading the network with preconfigurated packets. > * The tool is implemented as a linux module. Parameters are output > * device, delay (to hard_xmit), number of packets, and whether > * to use multiple SKBs or just the same one. > * pktgen uses the installed interface's output routine. > > That's one of the tools I use occasionally for performance analysis. > This certainly would be useful to test TCP/IP <-> IB bridge/router > support in the kernel. Do you mean IB or IP bridge/router ? IB bridges are switches. IB routers forward at the IB network layer and are not completely specified. I suspect you mean an IP router with one or more IPoIB interfaces. -- Hal From halr at voltaire.com Mon Apr 4 16:16:28 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Apr 2005 19:16:28 -0400 Subject: [openib-general] Re: IPoIB In-Reply-To: <524qemfatr.fsf@topspin.com> References: <1112652482.4490.281.camel@localhost.localdomain> <524qemfatr.fsf@topspin.com> Message-ID: <1112656588.4490.300.camel@localhost.localdomain> On Mon, 2005-04-04 at 18:15, Roland Dreier wrote: > I think it should "just work" with the current setup. > ipoib_hard_header() will look at the skb it's passed, and if there's > no neighbour struct, it will just save off the destination link > address. Then ipoib_start_xmit() will look at the destination address > and handle multicast link addresses correctly. That's good to hear. There are going to be some other changes for this. At a quick glance, ipoib_main.c::ipoib_start_xmit drops a unicast link level response if it is not ARP. RARP is also possible there, right ? I'm not sure the Linux code above this is set up to support the larger link level address needed by IPoIB either. -- Hal From libor at topspin.com Mon Apr 4 16:22:59 2005 From: libor at topspin.com (Libor Michalek) Date: Mon, 4 Apr 2005 16:22:59 -0700 Subject: [openib-general] What context can CM be called from? In-Reply-To: <20050331105141.A1541@topspin.com>; from libor@topspin.com on Thu, Mar 31, 2005 at 10:51:41AM -0800 References: <506C3D7B14CDD411A52C00025558DED6064BF21D@mtlex01.yok.mtl.com> <52u0muqvt7.fsf@topspin.com> <4249907F.8010101@ichips.intel.com> <20050329153826.D31683@topspin.com> <4249ED30.3060208@ichips.intel.com> <1112290139.4490.20.camel@localhost.localdomain> <424C3667.7040909@ichips.intel.com> <20050331105141.A1541@topspin.com> Message-ID: <20050404162259.C10315@topspin.com> On Thu, Mar 31, 2005 at 10:51:41AM -0800, Libor Michalek wrote: > On Thu, Mar 31, 2005 at 09:41:59AM -0800, Sean Hefty wrote: > > Hal Rosenstock wrote: > > > Is this just the kmalloc in cm_alloc_msg or is there more to this ? > > > > I _think_ that the kmalloc in cm_alloc_msg is all that needs to change. > > Yes, this CM change should be sufficient. I'm testing it now and > it looks good. I'll run some more tests and then check in the change. Sean, this change works correctly in all of my tests, so I checked it in. -Libor From roland at topspin.com Mon Apr 4 16:34:18 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 04 Apr 2005 16:34:18 -0700 Subject: [openib-general] [PATCH][RFC][3/4] IB: userspace verbs mthca changes In-Reply-To: <1112654975.22537.12.camel@duffman> (Tom Duffy's message of "Mon, 04 Apr 2005 15:49:35 -0700") References: <200544159.AzH1nqpM3uTQZaKG@topspin.com> <1112654975.22537.12.camel@duffman> Message-ID: <52vf72dslh.fsf@topspin.com> Tom> Now, you are really gonna hate me for asking you to put this Tom> in as you probably did not want to include this in the patch Tom> to lkml. Tom> So, maybe Grant was right ;-) Oh well, I didn't read the patches over carefully enough. Fortunately it was just my "for review" version. - R. From roland at topspin.com Mon Apr 4 16:43:21 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 04 Apr 2005 16:43:21 -0700 Subject: [openib-general] ia64 perf and FMR In-Reply-To: <20050404055131.GA19409@esmail.cup.hp.com> (Grant Grundler's message of "Sun, 3 Apr 2005 22:51:31 -0700") References: <20050402024048.GN11094@esmail.cup.hp.com> <20050404055131.GA19409@esmail.cup.hp.com> Message-ID: <52hdimds6e.fsf@topspin.com> Grant> FMR is a red herring. I tried SVN r2080 and it has roughly Grant> the same performance as r2082 (when FMR was committed) and Grant> later r210x. "packed" attribute is a red herring too. Grant> Performance stunk with r2050 and I will do a binary search Grant> this week until I sort out which changes doubled the Grant> perf. ISTR there was one change related to a "double Grant> mapping" issue and I will be tracking that down in a few Grant> days. A binary search to find the changeset that makes the difference would be really useful. I read through the svn log from r2046 through r2082 and I don't see anything that should make a difference to IPoIB. The only changes that seem remotely plausible are r2059 "Set skb->mac.raw on receive" r2068 "Make address handle verbs usable from interrupt context" but I don't see how either one could really have an effect. So I wonder what obvious thing I'm missing... - R. From roland at topspin.com Mon Apr 4 16:37:10 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 04 Apr 2005 16:37:10 -0700 Subject: [openib-general] Re: IPoIB In-Reply-To: <1112656588.4490.300.camel@localhost.localdomain> (Hal Rosenstock's message of "04 Apr 2005 19:16:28 -0400") References: <1112652482.4490.281.camel@localhost.localdomain> <524qemfatr.fsf@topspin.com> <1112656588.4490.300.camel@localhost.localdomain> Message-ID: <52ll7ydsgp.fsf@topspin.com> Hal> That's good to hear. There are going to be some other changes Hal> for this. At a quick glance, ipoib_main.c::ipoib_start_xmit Hal> drops a unicast link level response if it is not ARP. RARP is Hal> also possible there, right ? Yeah, you're right. That check can probably just be deleted. The driver should trust the kernel to pass it packets it means to send. Hal> I'm not sure the Linux code above this is set up to support Hal> the larger link level address needed by IPoIB either. Not sure what you mean by this. - R. From iod00d at hp.com Mon Apr 4 17:26:23 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 4 Apr 2005 17:26:23 -0700 Subject: [openib-general] IPoIB In-Reply-To: <1112654898.4490.293.camel@localhost.localdomain> References: <1112652482.4490.281.camel@localhost.localdomain> <20050404223516.GC22119@esmail.cup.hp.com> <1112654898.4490.293.camel@localhost.localdomain> Message-ID: <20050405002623.GH22119@esmail.cup.hp.com> On Mon, Apr 04, 2005 at 06:48:19PM -0400, Hal Rosenstock wrote: > Do you mean IB or IP bridge/router ? IB bridges are switches. IB routers > forward at the IB network layer and are not completely specified. I > suspect you mean an IP router with one or more IPoIB interfaces. Yes, I was thinking IP router. thanks, grant From halr at voltaire.com Mon Apr 4 17:44:01 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Apr 2005 20:44:01 -0400 Subject: [openib-general] Re: IPoIB In-Reply-To: <52ll7ydsgp.fsf@topspin.com> References: <1112652482.4490.281.camel@localhost.localdomain> <524qemfatr.fsf@topspin.com> <1112656588.4490.300.camel@localhost.localdomain> <52ll7ydsgp.fsf@topspin.com> Message-ID: <1112661841.4490.320.camel@localhost.localdomain> On Mon, 2005-04-04 at 19:37, Roland Dreier wrote: > Hal> That's good to hear. There are going to be some other changes > Hal> for this. At a quick glance, ipoib_main.c::ipoib_start_xmit > Hal> drops a unicast link level response if it is not ARP. RARP is > Hal> also possible there, right ? > > Yeah, you're right. That check can probably just be deleted. The > driver should trust the kernel to pass it packets it means to send. OK. It does look like unicast_arp_send would work for this case if the ARP check wasn't made. I'll play with this and propose a patch. > Hal> I'm not sure the Linux code above this is set up to support > Hal> the larger link level address needed by IPoIB either. > > Not sure what you mean by this. There are a couple of things that might be problematic but I'm not sure. The first has to do with some data structures: include/linux/socket.h: struct sockaddr { sa_family_t sa_family; /* address family, AF_xxx */ char sa_data[14]; /* 14 bytes of protocol address */ }; sa_data is not large enough for the IPoIB hardware address. Also, similarly for sll_addr in include/linux/if_packet.h: struct sockaddr_ll { ... unsigned char sll_halen; unsigned char sll_addr[8]; }; I'm not sure what all the implications of changing these are. Then, in net/packet/af_packet.c::packet_sendmsg: if (saddr == NULL) { ... } else { err = -EINVAL; if (msg->msg_namelen < sizeof(struct sockaddr_ll)) goto out; ifindex = saddr->sll_ifindex; proto = saddr->sll_protocol; addr = saddr->sll_addr; } -- Hal From hozer at hozed.org Mon Apr 4 19:50:37 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Mon, 4 Apr 2005 21:50:37 -0500 Subject: [openib-general] Port of ISC DHCP-3.0.2 to OpenIB IPoIB In-Reply-To: <1112310139.4490.24.camel@localhost.localdomain> References: <1112190300.4495.67.camel@localhost.localdomain> <424B4CA7.1050606@sandia.gov> <1112310139.4490.24.camel@localhost.localdomain> Message-ID: <20050405025037.GR26127@kalmia.hozed.org> On Thu, Mar 31, 2005 at 06:06:30PM -0500, Hal Rosenstock wrote: > On Wed, 2005-03-30 at 20:04, Josh England wrote: > > Are there any plans to modify the linux DHCP client so it would be > > possible to do kernel-level DHCP and NFSroot over IB? > > I took a quick look at this and it looks pretty straightforward. Stay > tuned... I'd say don't. Using initrd/initramfs is a much better solution. At some point the in-kernel dhcp is going to get so buggy and old it's going to get removed. I boot all my cluster systems with NFS root servers, and I'm trying to get everything moved to using Debian packaged kernels with initrd's. With an initrd, you at least have a chance to get a shell and figure out why you couldn't find your nfs server, instead of "kernel panic, I'm going to die now" you get with in-kernel dhcp/nfs. From jjengla at sandia.gov Mon Apr 4 20:17:43 2005 From: jjengla at sandia.gov (Josh England) Date: Mon, 04 Apr 2005 20:17:43 -0700 Subject: [openib-general] Port of ISC DHCP-3.0.2 to OpenIB IPoIB In-Reply-To: <20050405025037.GR26127@kalmia.hozed.org> References: <1112190300.4495.67.camel@localhost.localdomain> <424B4CA7.1050606@sandia.gov> <1112310139.4490.24.camel@localhost.localdomain> <20050405025037.GR26127@kalmia.hozed.org> Message-ID: <42520357.6090607@sandia.gov> Troy Benjegerdes wrote: > On Thu, Mar 31, 2005 at 06:06:30PM -0500, Hal Rosenstock wrote: > >>On Wed, 2005-03-30 at 20:04, Josh England wrote: >> >>>Are there any plans to modify the linux DHCP client so it would be >>>possible to do kernel-level DHCP and NFSroot over IB? >> >>I took a quick look at this and it looks pretty straightforward. Stay >>tuned... > > > I'd say don't. > > Using initrd/initramfs is a much better solution. At some point the > in-kernel dhcp is going to get so buggy and old it's going to get > removed. I know...it's just crummy to have ship another 1.3 Megs out to every node. > I boot all my cluster systems with NFS root servers, and I'm trying to > get everything moved to using Debian packaged kernels with initrd's. > With an initrd, you at least have a chance to get a shell and figure out > why you couldn't find your nfs server, instead of "kernel panic, I'm > going to die now" you get with in-kernel dhcp/nfs. Check out oneSIS (http://onesis.org). It can build initrds for you that do NFSroot (and drop to a shell when things go sour). I'd love to hear some feedback from people familiar with running NFSroot. -JE From mst at mellanox.co.il Tue Apr 5 00:42:13 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 5 Apr 2005 10:42:13 +0300 Subject: [openib-general] [PATCH] SEND_INLINE support in libmthca In-Reply-To: <52is32feq2.fsf@topspin.com> References: <20050404150235.GZ15034@mellanox.co.il> <52is32feq2.fsf@topspin.com> Message-ID: <20050405074213.GC15034@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] [PATCH] SEND_INLINE support in libmthca > > Is the test here correct? > > + if (s + sizeof *seg > (1 << qp->sq.wqe_shift)) { > It seems we need to take into account the size of next segment and any > RDMA segment that we may be posting as well. Right. Fixed (below). > Also does it make sense to put the code for gathering inline data > segments and writing gather lists into an inline function that can be > called from both the tavor and arbel post send function? Will gcc > actually inline this function? > > - R. > It does get inlined, but the function would have to return both size and status, so however I rearrange it I get either extra loads/stores or extra branches. Inline data support for libmthca (important for latency). Signed-off-by: Michael S. Tsirkin Index: src/qp.c =================================================================== --- src/qp.c (revision 2096) +++ src/qp.c (working copy) @@ -57,6 +57,10 @@ enum { MTHCA_NEXT_SOLICIT = 1 << 1, }; +enum { + MTHCA_INLINE_SEG = 1<<31 +}; + struct mthca_next_seg { uint32_t nda_op; /* [31:6] next WQE [4:0] next opcode */ uint32_t ee_nds; /* [31:8] next EE [7] DBD [6] F [5:0] next WQE size */ @@ -107,6 +111,10 @@ struct mthca_data_seg { uint64_t addr; }; +struct mthca_inline_seg { + uint32_t byte_count; +}; + static const uint8_t mthca_opcode[] = { [IBV_WR_SEND] = MTHCA_OPCODE_SEND, [IBV_WR_SEND_WITH_IMM] = MTHCA_OPCODE_SEND_IMM, @@ -255,15 +263,39 @@ int mthca_tavor_post_send(struct ibv_qp goto out; } - for (i = 0; i < wr->num_sge; ++i) { - ((struct mthca_data_seg *) wqe)->byte_count = - htonl(wr->sg_list[i].length); - ((struct mthca_data_seg *) wqe)->lkey = - htonl(wr->sg_list[i].lkey); - ((struct mthca_data_seg *) wqe)->addr = - htonll(wr->sg_list[i].addr); - wqe += sizeof (struct mthca_data_seg); - size += sizeof (struct mthca_data_seg) / 16; + if (wr->send_flags & IBV_SEND_INLINE) { + struct mthca_inline_seg *seg = wqe; + int wqe_size = 1 << qp->sq.wqe_shift; + int s = 0; + wqe += sizeof *seg; + for (i = 0; i < wr->num_sge; ++i) { + struct ibv_sge *sge = &wr->sg_list[i]; + int l; + l = sge->length; + s += l; + + if (s + sizeof *seg + size * 16 > wqe_size) { + ret = -1; + *bad_wr = wr; + goto out; + } + + memcpy(wqe, (void*)(intptr_t)sge->addr, l); + wqe += l; + } + seg->byte_count = htonl(MTHCA_INLINE_SEG | s); + + size += align(s + sizeof *seg, 16) / 16; + } else { + struct mthca_data_seg *seg; + for (i = 0; i < wr->num_sge; ++i) { + seg = wqe; + seg->byte_count = htonl(wr->sg_list[i].length); + seg->lkey = htonl(wr->sg_list[i].lkey); + seg->addr = htonll(wr->sg_list[i].addr); + wqe += sizeof *seg; + } + size += wr->num_sge * sizeof *seg / 16; } qp->wrid[ind + qp->rq.max] = wr->wr_id; @@ -512,15 +544,39 @@ int mthca_arbel_post_send(struct ibv_qp goto out; } - for (i = 0; i < wr->num_sge; ++i) { - ((struct mthca_data_seg *) wqe)->byte_count = - htonl(wr->sg_list[i].length); - ((struct mthca_data_seg *) wqe)->lkey = - htonl(wr->sg_list[i].lkey); - ((struct mthca_data_seg *) wqe)->addr = - htonll(wr->sg_list[i].addr); - wqe += sizeof (struct mthca_data_seg); - size += sizeof (struct mthca_data_seg) / 16; + if (wr->send_flags & IBV_SEND_INLINE) { + struct mthca_inline_seg *seg = wqe; + int wqe_size = 1 << qp->sq.wqe_shift; + int s = 0; + wqe += sizeof *seg; + for (i = 0; i < wr->num_sge; ++i) { + struct ibv_sge *sge = &wr->sg_list[i]; + int l; + l = sge->length; + s += l; + + if (s + sizeof *seg + size * 16 > wqe_size) { + ret = -1; + *bad_wr = wr; + goto out; + } + + memcpy(wqe, (void*)(intptr_t)sge->addr, l); + wqe += l; + } + seg->byte_count = htonl(MTHCA_INLINE_SEG | s); + + size += align(s + sizeof *seg, 16) / 16; + } else { + struct mthca_data_seg *seg; + for (i = 0; i < wr->num_sge; ++i) { + seg = wqe; + seg->byte_count = htonl(wr->sg_list[i].length); + seg->lkey = htonl(wr->sg_list[i].lkey); + seg->addr = htonll(wr->sg_list[i].addr); + wqe += sizeof *seg; + } + size += wr->num_sge * sizeof *seg / 16; } qp->wrid[ind + qp->rq.max] = wr->wr_id; -- MST - Michael S. Tsirkin From bunk at stusta.de Tue Apr 5 07:24:49 2005 From: bunk at stusta.de (Adrian Bunk) Date: Tue, 5 Apr 2005 16:24:49 +0200 Subject: [openib-general] [-mm patch] drivers/infiniband/hw/mthca/mthca_main.c: remove an unused label In-Reply-To: <20050405000524.592fc125.akpm@osdl.org> References: <20050405000524.592fc125.akpm@osdl.org> Message-ID: <20050405142449.GF6885@stusta.de> On Tue, Apr 05, 2005 at 12:05:24AM -0700, Andrew Morton wrote: >... > Changes since 2.6.12-rc1-mm4: >... > +ib-mthca-add-support-for-new-mt25204-hca.patch > > Infiniband update >... This patch causes the following compile warning: <-- snip --> ... CC drivers/infiniband/hw/mthca/mthca_main.o drivers/infiniband/hw/mthca/mthca_main.c: In function `mthca_init_icm': drivers/infiniband/hw/mthca/mthca_main.c:479: warning: label `err_unmap_eqp' defined but not used ... <-- snip --> I'm not sure whether this patch to remove this label is correct, but if it isn't correct there must be a bug somewhere. Signed-off-by: Adrian Bunk --- linux-2.6.12-rc2-mm1-full/drivers/infiniband/hw/mthca/mthca_main.c.old 2005-04-05 16:18:09.000000000 +0200 +++ linux-2.6.12-rc2-mm1-full/drivers/infiniband/hw/mthca/mthca_main.c 2005-04-05 16:19:15.000000000 +0200 @@ -475,8 +475,6 @@ err_unmap_rdb: mthca_free_icm_table(mdev, mdev->qp_table.rdb_table); - -err_unmap_eqp: mthca_free_icm_table(mdev, mdev->qp_table.eqp_table); err_unmap_qp: From halr at voltaire.com Tue Apr 5 07:37:25 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Apr 2005 10:37:25 -0400 Subject: [openib-general] Re: [-mm patch] drivers/infiniband/hw/mthca/mthca_main.c: remove an unused label In-Reply-To: <20050405142449.GF6885@stusta.de> References: <20050405000524.592fc125.akpm@osdl.org> <20050405142449.GF6885@stusta.de> Message-ID: <1112711845.4490.4.camel@localhost.localdomain> On Tue, 2005-04-05 at 10:24, Adrian Bunk wrote: > On Tue, Apr 05, 2005 at 12:05:24AM -0700, Andrew Morton wrote: > >... > > Changes since 2.6.12-rc1-mm4: > >... > > +ib-mthca-add-support-for-new-mt25204-hca.patch > > > > Infiniband update > >... > > > This patch causes the following compile warning: > > <-- snip --> > > ... > CC drivers/infiniband/hw/mthca/mthca_main.o > drivers/infiniband/hw/mthca/mthca_main.c: In function `mthca_init_icm': > drivers/infiniband/hw/mthca/mthca_main.c:479: warning: label > `err_unmap_eqp' defined but not used > ... > > <-- snip --> > > > I'm not sure whether this patch to remove this label is correct, but if > it isn't correct there must be a bug somewhere. > > > Signed-off-by: Adrian Bunk > > --- linux-2.6.12-rc2-mm1-full/drivers/infiniband/hw/mthca/mthca_main.c.old 2005-04-05 16:18:09.000000000 +0200 > +++ linux-2.6.12-rc2-mm1-full/drivers/infiniband/hw/mthca/mthca_main.c 2005-04-05 16:19:15.000000000 +0200 > @@ -475,8 +475,6 @@ > > err_unmap_rdb: > mthca_free_icm_table(mdev, mdev->qp_table.rdb_table); > - > -err_unmap_eqp: > mthca_free_icm_table(mdev, mdev->qp_table.eqp_table); > > err_unmap_qp: Roland caught this recently and there is a patch for this which will sent upstream. The proper fix is different from this. -- Hal From hozer at hozed.org Tue Apr 5 08:53:24 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Tue, 5 Apr 2005 10:53:24 -0500 Subject: [openib-general] Port of ISC DHCP-3.0.2 to OpenIB IPoIB In-Reply-To: <42520357.6090607@sandia.gov> References: <1112190300.4495.67.camel@localhost.localdomain> <424B4CA7.1050606@sandia.gov> <1112310139.4490.24.camel@localhost.localdomain> <20050405025037.GR26127@kalmia.hozed.org> <42520357.6090607@sandia.gov> Message-ID: <20050405155323.GT26127@kalmia.hozed.org> On Mon, Apr 04, 2005 at 08:17:43PM -0700, Josh England wrote: > Troy Benjegerdes wrote: > > On Thu, Mar 31, 2005 at 06:06:30PM -0500, Hal Rosenstock wrote: > > > >>On Wed, 2005-03-30 at 20:04, Josh England wrote: > >> > >>>Are there any plans to modify the linux DHCP client so it would be > >>>possible to do kernel-level DHCP and NFSroot over IB? > >> > >>I took a quick look at this and it looks pretty straightforward. Stay > >>tuned... > > > > > > I'd say don't. > > > > Using initrd/initramfs is a much better solution. At some point the > > in-kernel dhcp is going to get so buggy and old it's going to get > > removed. > > I know...it's just crummy to have ship another 1.3 Megs out to every node. > > > I boot all my cluster systems with NFS root servers, and I'm trying to > > get everything moved to using Debian packaged kernels with initrd's. > > With an initrd, you at least have a chance to get a shell and figure out > > why you couldn't find your nfs server, instead of "kernel panic, I'm > > going to die now" you get with in-kernel dhcp/nfs. > > Check out oneSIS (http://onesis.org). It can build initrds for you that > do NFSroot (and drop to a shell when things go sour). I'd love to hear > some feedback from people familiar with running NFSroot. Debian has a package called "lessdisks" that does some similiar stuff.. If I do "apt-get install initrd-netboot-tools", and then install a debian kernel image, it builds an initrd that can netboot. I suppose my next trick after I get the debian kernel maintainers to make sure infiniband is enabled is to try booting over IPoIB. What dhcp client does onesis use? The debian lessdisks stuff uses 'udhcpc', advertised as a "very small DHCP client". From halr at voltaire.com Tue Apr 5 08:51:31 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Apr 2005 11:51:31 -0400 Subject: [openib-general] CM misuse of in_atomic/irqs_disabled Message-ID: <1112716291.4490.8.camel@localhost.localdomain> Hi Sean, Should the following in the CM be changed: int ib_cm_establish(struct ib_cm_id *cm_id) { ... work = kmalloc(sizeof *work, (in_atomic() || irqs_disabled()) ? GFP_ATOMIC : GFP_KERNEL); to just work = kmalloc(sizeof *work, GFP_ATOMIC); similar to the other core changes for this issue ? -- Hal From mshefty at ichips.intel.com Tue Apr 5 09:06:35 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 05 Apr 2005 09:06:35 -0700 Subject: [openib-general] Re: CM misuse of in_atomic/irqs_disabled In-Reply-To: <1112716291.4490.8.camel@localhost.localdomain> References: <1112716291.4490.8.camel@localhost.localdomain> Message-ID: <4252B78B.10801@ichips.intel.com> Hal Rosenstock wrote: > Hi Sean, > > Should the following in the CM be changed: > > int ib_cm_establish(struct ib_cm_id *cm_id) > { > ... > work = kmalloc(sizeof *work, (in_atomic() || irqs_disabled()) ? > GFP_ATOMIC : GFP_KERNEL); > to just > work = kmalloc(sizeof *work, GFP_ATOMIC); > > similar to the other core changes for this issue ? I think so. This was structured similar to the MAD code. I'll commit a patch to change this. - Sean From roland at topspin.com Tue Apr 5 09:05:31 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 05 Apr 2005 09:05:31 -0700 Subject: [openib-general] CM misuse of in_atomic/irqs_disabled In-Reply-To: <1112716291.4490.8.camel@localhost.localdomain> (Hal Rosenstock's message of "05 Apr 2005 11:51:31 -0400") References: <1112716291.4490.8.camel@localhost.localdomain> Message-ID: <52y8bxcipg.fsf@topspin.com> Or we could make ib_cm_establish() take a gfp_mask... - R. From peter at pantasys.com Tue Apr 5 09:32:26 2005 From: peter at pantasys.com (Peter Buckingham) Date: Tue, 05 Apr 2005 09:32:26 -0700 Subject: [openib-general] Port of ISC DHCP-3.0.2 to OpenIB IPoIB In-Reply-To: <42520357.6090607@sandia.gov> References: <1112190300.4495.67.camel@localhost.localdomain> <424B4CA7.1050606@sandia.gov> <1112310139.4490.24.camel@localhost.localdomain> <20050405025037.GR26127@kalmia.hozed.org> <42520357.6090607@sandia.gov> Message-ID: <4252BD9A.6000800@pantasys.com> Josh England wrote: > Check out oneSIS (http://onesis.org). It can build initrds for you that > do NFSroot (and drop to a shell when things go sour). I'd love to hear > some feedback from people familiar with running NFSroot. i'll check it out. we actually build our own initrd's here.part of that is because gen1 IB drivers can't be built into the kernel, but it also gives us better fail over options too. peter From jjengla at sandia.gov Tue Apr 5 09:37:42 2005 From: jjengla at sandia.gov (Josh England) Date: Tue, 05 Apr 2005 09:37:42 -0700 Subject: [openib-general] Port of ISC DHCP-3.0.2 to OpenIB IPoIB In-Reply-To: <20050405155323.GT26127@kalmia.hozed.org> References: <1112190300.4495.67.camel@localhost.localdomain> <424B4CA7.1050606@sandia.gov> <1112310139.4490.24.camel@localhost.localdomain> <20050405025037.GR26127@kalmia.hozed.org> <42520357.6090607@sandia.gov> <20050405155323.GT26127@kalmia.hozed.org> Message-ID: <4252BED6.1000105@sandia.gov> Troy Benjegerdes wrote: > On Mon, Apr 04, 2005 at 08:17:43PM -0700, Josh England wrote: > >>Troy Benjegerdes wrote: >> >>>On Thu, Mar 31, 2005 at 06:06:30PM -0500, Hal Rosenstock wrote: >>> >>> >>>>On Wed, 2005-03-30 at 20:04, Josh England wrote: >>>> >>>> >>>>>Are there any plans to modify the linux DHCP client so it would be >>>>>possible to do kernel-level DHCP and NFSroot over IB? >>>> >>>>I took a quick look at this and it looks pretty straightforward. Stay >>>>tuned... >>> >>> >>>I'd say don't. >>> >>>Using initrd/initramfs is a much better solution. At some point the >>>in-kernel dhcp is going to get so buggy and old it's going to get >>>removed. >> >>I know...it's just crummy to have ship another 1.3 Megs out to every node. >> >> >>>I boot all my cluster systems with NFS root servers, and I'm trying to >>>get everything moved to using Debian packaged kernels with initrd's. >>>With an initrd, you at least have a chance to get a shell and figure out >>>why you couldn't find your nfs server, instead of "kernel panic, I'm >>>going to die now" you get with in-kernel dhcp/nfs. >> >>Check out oneSIS (http://onesis.org). It can build initrds for you that >>do NFSroot (and drop to a shell when things go sour). I'd love to hear >>some feedback from people familiar with running NFSroot. > > > Debian has a package called "lessdisks" that does some similiar stuff.. > If I do "apt-get install initrd-netboot-tools", and then install a > debian kernel image, it builds an initrd that can netboot. I suppose my > next trick after I get the debian kernel maintainers to make sure > infiniband is enabled is to try booting over IPoIB. > > What dhcp client does onesis use? The debian lessdisks stuff uses > 'udhcpc', advertised as a "very small DHCP client". It has used udhcpc in the past, but I noticed it going significantly slower on some machines and never found out why. Right now it uses dhclient, but that is easy enough to change. The initrd itself is a full mini-linux (busybox) with some helpful utilities (though I still need to add lspci). Maybe this discussion could be taken off the openib list, though. I'd love to talk more about the other cool stuff that oneSIS can do. -JE From peter at pantasys.com Tue Apr 5 09:40:48 2005 From: peter at pantasys.com (Peter Buckingham) Date: Tue, 05 Apr 2005 09:40:48 -0700 Subject: [openib-general] Port of ISC DHCP-3.0.2 to OpenIB IPoIB In-Reply-To: <20050405155323.GT26127@kalmia.hozed.org> References: <1112190300.4495.67.camel@localhost.localdomain> <424B4CA7.1050606@sandia.gov> <1112310139.4490.24.camel@localhost.localdomain> <20050405025037.GR26127@kalmia.hozed.org> <42520357.6090607@sandia.gov> <20050405155323.GT26127@kalmia.hozed.org> Message-ID: <4252BF90.1070803@pantasys.com> Troy Benjegerdes wrote: > What dhcp client does onesis use? The debian lessdisks stuff uses > 'udhcpc', advertised as a "very small DHCP client". I'd really suggest having a look at the klibc project. it provides pretty much all you need (apart from good documentation ;-). This was done as part of the move to initramfs by the kernel guys (hpa, al viro, etc). sources: http://www.kernel.org/pub/linux/libs/klibc/ peter From roland at topspin.com Tue Apr 5 09:53:21 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 05 Apr 2005 09:53:21 -0700 Subject: [openib-general] Re: [-mm patch] drivers/infiniband/hw/mthca/mthca_main.c: remove an unused label In-Reply-To: <20050405142449.GF6885@stusta.de> (Adrian Bunk's message of "Tue, 5 Apr 2005 16:24:49 +0200") References: <20050405000524.592fc125.akpm@osdl.org> <20050405142449.GF6885@stusta.de> Message-ID: <52psx9cghq.fsf@topspin.com> > CC drivers/infiniband/hw/mthca/mthca_main.o > drivers/infiniband/hw/mthca/mthca_main.c: In function `mthca_init_icm': > drivers/infiniband/hw/mthca/mthca_main.c:479: warning: label > `err_unmap_eqp' defined but not used Thanks, good catch. I screwed up the error path in that function a little while merging patches. Here's the correct fix. Correct unwinding in error path of mthca_init_icm(). Signed-off-by: Roland Dreier --- linux-2.6.12-rc2-mm1.orig/drivers/infiniband/hw/mthca/mthca_main.c 2005-04-05 09:49:02.944473724 -0700 +++ linux-2.6.12-rc2-mm1/drivers/infiniband/hw/mthca/mthca_main.c 2005-04-05 09:49:15.679708865 -0700 @@ -437,7 +437,7 @@ if (!mdev->qp_table.rdb_table) { mthca_err(mdev, "Failed to map RDB context memory, aborting\n"); err = -ENOMEM; - goto err_unmap_rdb; + goto err_unmap_eqp; } mdev->cq_table.table = mthca_alloc_icm_table(mdev, init_hca->cqc_base, From jjengla at sandia.gov Tue Apr 5 10:26:45 2005 From: jjengla at sandia.gov (Josh England) Date: Tue, 05 Apr 2005 10:26:45 -0700 Subject: [openib-general] Port of ISC DHCP-3.0.2 to OpenIB IPoIB In-Reply-To: <4252BF90.1070803@pantasys.com> References: <1112190300.4495.67.camel@localhost.localdomain> <424B4CA7.1050606@sandia.gov> <1112310139.4490.24.camel@localhost.localdomain> <20050405025037.GR26127@kalmia.hozed.org> <42520357.6090607@sandia.gov> <20050405155323.GT26127@kalmia.hozed.org> <4252BF90.1070803@pantasys.com> Message-ID: <4252CA55.3000001@sandia.gov> Peter Buckingham wrote: > Troy Benjegerdes wrote: > >> What dhcp client does onesis use? The debian lessdisks stuff uses >> 'udhcpc', advertised as a "very small DHCP client". > > > I'd really suggest having a look at the klibc project. it provides > pretty much all you need (apart from good documentation ;-). This was > done as part of the move to initramfs by the kernel guys (hpa, al viro, > etc). > > sources: > > http://www.kernel.org/pub/linux/libs/klibc/ Yeah...it definitely looks to be the way to go for 2.6 systems. -JE From rf at q-leap.de Tue Apr 5 10:26:52 2005 From: rf at q-leap.de (Roland Fehrenbacher) Date: Tue, 5 Apr 2005 19:26:52 +0200 Subject: [openib-general] gen2 opensm Message-ID: <16978.51804.357810.302532@gargle.gargle.HOWL> Hi, I have tried the kernel 2.6.11 drivers on an x86-64 machine with a MT23108 card. The driver loads ok after $ modprobe ib_mthca; modprobe ib_umad Since I use devfs, I have to manually create $ mknod /dev/infiniband/umad0 c 231 0 $ mknod /dev/infiniband/umad1 c 231 1 $ mknod /dev/infiniband/issm0 c 231 64 $ mknod /dev/infiniband/issm1 c 231 65 I get $ /usr/local/ib/bin/ibstat CA 'mthca0' CA type: MT23108 Number of ports: 2 Firmware version: 3.2.0 Hardware version: a1 Node GUID: 0x000000008815bcaa System image GUID: 0x000000008815bcaa Port 1: State: Initializing Physical state: LinkUp Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00500a68 Port GUID: 0x0000000000000000 Port 2: State: Down Physical state: Polling Rate: 2 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00500a68 Port GUID: 0x0000000000000000 which already looks strange (GUID 0 ???). Running opensm then doesn't activate the ports: Apr 05 19:18:25 [4000] -> OpenSM Rev:openib-1.0.0 Apr 05 19:18:25 [4000] -> osm_opensm_init: Forcing single threaded dispatcher. Apr 05 19:18:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0x0000000030f2ffff,0x0000000000000000 Apr 05 19:18:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0x0000000030f2ffff,0x0000000000000000 Apr 05 19:18:25 [4000] -> osm_vendor_get_all_port_attr: assign CA 0x7fffffffd010ort 1 guid (0x65babaa) as the default port. Apr 05 19:18:25 [4000] -> osm_vendor_bind: Binding to port 0x225dabaa. Apr 05 19:18:25 [4000] -> osm_vendor_bind: Binding to port 0x8000000. Apr 05 19:18:25 [2400A] -> umad_receiver: Failed to obtain request madw for received MAD(method=81 attr=11) -- dropping. What could have gone wrong? Roland From halr at voltaire.com Tue Apr 5 10:42:16 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Apr 2005 13:42:16 -0400 Subject: [openib-general] gen2 opensm In-Reply-To: <16978.51804.357810.302532@gargle.gargle.HOWL> References: <16978.51804.357810.302532@gargle.gargle.HOWL> Message-ID: <1112722935.4634.12.camel@localhost.localdomain> On Tue, 2005-04-05 at 13:26, Roland Fehrenbacher wrote: > Hi, > > I have tried the kernel 2.6.11 drivers on an x86-64 machine with a > MT23108 card. The driver loads ok after > $ modprobe ib_mthca; modprobe ib_umad > > Since I use devfs, I have to manually create > > $ mknod /dev/infiniband/umad0 c 231 0 > $ mknod /dev/infiniband/umad1 c 231 1 > $ mknod /dev/infiniband/issm0 c 231 64 > $ mknod /dev/infiniband/issm1 c 231 65 What are the permissions on those ? Are they crw ? > I get > > $ /usr/local/ib/bin/ibstat > CA 'mthca0' > CA type: MT23108 > Number of ports: 2 > Firmware version: 3.2.0 > Hardware version: a1 > Node GUID: 0x000000008815bcaa > System image GUID: 0x000000008815bcaa > Port 1: > State: Initializing > Physical state: LinkUp > Rate: 10 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x00500a68 > Port GUID: 0x0000000000000000 > Port 2: > State: Down > Physical state: Polling > Rate: 2 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x00500a68 > Port GUID: 0x0000000000000000 > > which already looks strange (GUID 0 ???). It looks like the port GUIDs are not set in NVRAM. > Running opensm then doesn't activate the ports: > > Apr 05 19:18:25 [4000] -> OpenSM Rev:openib-1.0.0 > Apr 05 19:18:25 [4000] -> osm_opensm_init: Forcing single threaded dispatcher. > Apr 05 19:18:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0x0000000030f2ffff,0x0000000000000000 > Apr 05 19:18:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0x0000000030f2ffff,0x0000000000000000 > Apr 05 19:18:25 [4000] -> osm_vendor_get_all_port_attr: assign CA 0x7fffffffd010ort 1 guid (0x65babaa) as the default port. I see a bug in this message. I will fix it. Please sync OpenSM to at least version 2111 and rerun. > Apr 05 19:18:25 [4000] -> osm_vendor_bind: Binding to port 0x225dabaa. > Apr 05 19:18:25 [4000] -> osm_vendor_bind: Binding to port 0x8000000. Two binds. This looks wrong to me. > Apr 05 19:18:25 [2400A] -> umad_receiver: Failed to obtain request madw for received MAD(method=81 attr=11) -- dropping. The vendor layer couldn't find the matching request to a response which came in. This is pretty fishy but probably related to the port issue. > What could have gone wrong? I would start with setting the port GUIDs for this HCA and see if the problem persists. -- Hal > > Roland > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Tue Apr 5 12:30:08 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Apr 2005 15:30:08 -0400 Subject: [openib-general] gen2 opensm In-Reply-To: <16978.51804.357810.302532@gargle.gargle.HOWL> References: <16978.51804.357810.302532@gargle.gargle.HOWL> Message-ID: <1112729408.4490.12.camel@localhost.localdomain> On Tue, 2005-04-05 at 13:26, Roland Fehrenbacher wrote: > $ /usr/local/ib/bin/ibstat > CA 'mthca0' > CA type: MT23108 > Number of ports: 2 > Firmware version: 3.2.0 You might also want to upgrade to 3.3.2. I forget what problems 3.2.0 had and whether they will ultimately get in your way. -- Hal From rf at q-leap.de Tue Apr 5 13:33:34 2005 From: rf at q-leap.de (Roland Fehrenbacher) Date: Tue, 5 Apr 2005 22:33:34 +0200 Subject: [openib-general] gen2 opensm In-Reply-To: <1112722935.4634.12.camel@localhost.localdomain> References: <16978.51804.357810.302532@gargle.gargle.HOWL> <1112722935.4634.12.camel@localhost.localdomain> Message-ID: <16978.63006.330557.956153@gargle.gargle.HOWL> >>>>> "Hal" == Hal Rosenstock writes: Hal> On Tue, 2005-04-05 at 13:26, Roland Fehrenbacher wrote: >> I have tried the kernel 2.6.11 drivers on an x86-64 machine >> with a MT23108 card. The driver loads ok after >> $ modprobe ib_mthca; modprobe ib_umad >> Since I use devfs, I have to manually create >> $ mknod /dev/infiniband/umad0 c 231 0 >> $ mknod /dev/infiniband/umad1 c 231 1 >> $ mknod /dev/infiniband/issm0 c 231 64 >> $ mknod /dev/infiniband/issm1 c 231 65 Hal> What are the permissions on those ? Are they crw ? $ ls -l /dev/infiniband total 0 crw-r--r-- 1 root root 231, 64 Apr 5 18:53 issm0 crw-r--r-- 1 root root 231, 65 Apr 5 18:54 issm1 crw-r--r-- 1 root root 231, 0 Apr 5 18:52 umad0 crw-r--r-- 1 root root 231, 1 Apr 5 18:54 umad1 >> I get >> >> $ /usr/local/ib/bin/ibstat >> CA 'mthca0' >> CA type: MT23108 >> Number of ports: 2 >> Firmware version: 3.2.0 >> Hardware version: a1 >> Node GUID: 0x000000008815bcaa >> System image GUID: 0x000000008815bcaa >> Port 1: >> State: Initializing >> Physical state: LinkUp >> Rate: 10 >> Base lid: 0 >> LMC: 0 >> SM lid: 0 >> Capability mask: 0x00500a68 >> Port GUID: 0x0000000000000000 >> Port 2: >> State: Down >> Physical state: Polling >> Rate: 2 >> Base lid: 0 >> LMC: 0 >> SM lid: 0 >> Capability mask: 0x00500a68 >> Port GUID: 0x0000000000000000 >> >> which already looks strange (GUID 0 ???). Hal> It looks like the port GUIDs are not set in NVRAM. They seem to be shown alright with ibstatus (or isn't gid = GUID?): $ /usr/local/ib/bin/ibstatus Infiniband device 'mthca0' port 1 status: default gid: fe80:0000:0000:0000:0002:c902:0000:771d base lid: 0x0 sm lid: 0x0 state: 2: INIT phys state: 5: LinkUp rate: 10 Gb/sec (4X) Infiniband device 'mthca0' port 2 status: default gid: fe80:0000:0000:0000:0002:c902:0000:771e base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 2: Polling rate: 2.5 Gb/sec (1X) > Running opensm then doesn't activate the ports: > > Apr 05 19:18:25 [4000] -> OpenSM Rev:openib-1.0.0 > Apr 05 19:18:25 [4000] -> osm_opensm_init: Forcing single threaded dispatcher. > Apr 05 19:18:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0x0000000030f2ffff,0x0000000000000000 > Apr 05 19:18:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0x0000000030f2ffff,0x0000000000000000 > Apr 05 19:18:25 [4000] -> osm_vendor_get_all_port_attr: assign CA 0x7fffffffd010ort 1 guid (0x65babaa) as the default port. Hal> I see a bug in this message. I will fix it. Please sync Hal> OpenSM to at least version 2111 and rerun. I will recompile tomorrow, and try a firmware upgrade. Roland From halr at voltaire.com Tue Apr 5 14:02:14 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Apr 2005 17:02:14 -0400 Subject: [openib-general] IB Address Translation service In-Reply-To: <4229CEA0.7060904@sun.com> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> <1109966313.20238.11.camel@duffman> <20050305020402.GA3297@greglaptop.internal.keyresearch.com> <4229CEA0.7060904@sun.com> Message-ID: <1112734934.4490.99.camel@localhost.localdomain> Reviving an old thread... On Sat, 2005-03-05 at 10:22, David M. Brean wrote: > There is an I-D for DHCP on IB. IPoIB defines a "broadcast" address and > DHCP (and ARP) on IB use it. Could make RARP work using this mechanism, > but as someone else pointed out, the IB hardware address contains a > QPN. The I-D for IPoIB says something like: > > The link-layer address for IPoIB includes the QPN which might not be > constant across reboots or even across network interface resets. > Cached QPN entries, such as in static ARP entries or in RARP servers > will only work if the implementation(s) using these options ensure > that the QPN associated with an interface is invariant across > reboots/network resets. > > So, there are requirements on the IPoIB implementation to make RARP > work. Folks in the IPoIB work group decided not to go much further than > these statements for RARP support since most folks felt that DHCP is (de > facto) replacement. There are 3 cases I can envision: 1. A single IPoIB interface per HCA port. In this case, the RARP server can just match on the hardware address (port GID) without the QPN. 2. In the case of VLANs, I think we are likely OK as well. In that case, there is a separate IP subnet (per PKey) so the port GID is unique per IP subnet (the port GID is unique on that partition (IP subnet)). I think there is a different QPN per VLAN. So I don't think that the above 2 cases require an invariant QPN. 3. The third case is multihomed interfaces on the same IPoIB subnet. I don't think this is currently supported by IPoIB (but may someday). That would either not be supported by RARP or some way to have invariant QPNs would be needed. I'm not sure how important this case is. Is the above correct ? Are there other cases ? -- Hal From mst at mellanox.co.il Wed Apr 6 00:19:33 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 6 Apr 2005 10:19:33 +0300 Subject: [openib-general] Re: IB Address Translation service In-Reply-To: <1112734934.4490.99.camel@localhost.localdomain> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> <1109966313.20238.11.camel@duffman> <20050305020402.GA3297@greglaptop.internal.keyresearch.com> <4229CEA0.7060904@sun.com> <1112734934.4490.99.camel@localhost.localdomain> Message-ID: <20050406071933.GL15034@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: IB Address Translation service > > Reviving an old thread... > > On Sat, 2005-03-05 at 10:22, David M. Brean wrote: > > There is an I-D for DHCP on IB. IPoIB defines a "broadcast" address and > > DHCP (and ARP) on IB use it. Could make RARP work using this mechanism, > > but as someone else pointed out, the IB hardware address contains a > > QPN. The I-D for IPoIB says something like: > > > > The link-layer address for IPoIB includes the QPN which might not be > > constant across reboots or even across network interface resets. > > Cached QPN entries, such as in static ARP entries or in RARP servers > > will only work if the implementation(s) using these options ensure > > that the QPN associated with an interface is invariant across > > reboots/network resets. > > > > So, there are requirements on the IPoIB implementation to make RARP > > work. Folks in the IPoIB work group decided not to go much further than > > these statements for RARP support since most folks felt that DHCP is (de > > facto) replacement. > > There are 3 cases I can envision: > > 1. A single IPoIB interface per HCA port. In this case, the RARP server > can just match on the hardware address (port GID) without the QPN. > > 2. In the case of VLANs, I think we are likely OK as well. In that case, > there is a separate IP subnet (per PKey) so the port GID is unique per > IP subnet (the port GID is unique on that partition (IP subnet)). I > think there is a different QPN per VLAN. > > So I don't think that the above 2 cases require an invariant QPN. > > 3. The third case is multihomed interfaces on the same IPoIB subnet. I > don't think this is currently supported by IPoIB (but may someday). That > would either not be supported by RARP or some way to have invariant QPNs > would be needed. I'm not sure how important this case is. > > Is the above correct ? Are there other cases ? > > -- Hal > Some DHCP servers (dhcpd) let you configure a fixed IP per hardware address. It seems to me that making this work requires an invariant QP, right? -- MST - Michael S. Tsirkin From mst at mellanox.co.il Wed Apr 6 04:30:57 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 6 Apr 2005 14:30:57 +0300 Subject: [openib-general] Re: IB Address Translation service In-Reply-To: <1112785737.4809.28.camel@localhost.localdomain> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> <1109966313.20238.11.camel@duffman> <20050305020402.GA3297@greglaptop.internal.keyresearch.com> <4229CEA0.7060904@sun.com> <1112734934.4490.99.camel@localhost.localdomain> <20050406071933.GL15034@mellanox.co.il> <1112785737.4809.28.camel@localhost.localdomain> Message-ID: <20050406113057.GU15034@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: IB Address Translation service > > On Wed, 2005-04-06 at 03:19, Michael S. Tsirkin wrote: > > > So I don't think that the above 2 cases require an invariant QPN. > > > > > > 3. The third case is multihomed interfaces on the same IPoIB subnet. I > > > don't think this is currently supported by IPoIB (but may someday). That > > > would either not be supported by RARP or some way to have invariant QPNs > > > would be needed. I'm not sure how important this case is. > > > > > > Is the above correct ? Are there other cases ? > > > > > > -- Hal > > > > > > > Some DHCP servers (dhcpd) let you configure a fixed IP per hardware address. > > It seems to me that making this work requires an invariant QP, right? > > I believe that DHCP servers require work in order to do this for IPoIB > as they do not understand an IPoIB hardware address. Are we talking about IPoIB hardware address size issues? > So if they do > support client identifier mapping to IP address, then I think the answer > is maybe (rather than a definitive yes). The reason I say this is that > the DHCP draft appears to allow any unique client identifier per IP > subnet to be used. In that case, as long there is a scheme to make each > identifier unique per IP subnet, DHCP is fine. > > The "client-identifier" option includes a type and identifier pair. > The identifier included in the "client-identifier" option may > consist of a hardware address or any other unique value such as the > DNS name of the client. When a hardware address is used, the type > field should be one of the ARP hardware types listed in [ARPPARAM]. > > The most common (simple to implement) client identifier from the the > DHCP client perspective is the IPoIB hardware address. > > http://www.ietf.org/internet-drafts/draft-ietf-dhc-3315id-for-v4-04.txt > states: > > Client identities are ephemeral > > RFC2132 recommends that client identifiers be generated by using > the permanent link-layer address of the network interface that the > client is trying to configure. > > Requirements > > In order to address the problems stated in section 2, DHCPv4 client > identifiers must have the following characteristics: > > - They must be persistent, in the sense that a particular host's > client identifier must not change simply because a piece of > network hardware is added or removed. > > ... > > - DHCPv4 client identifiers used by dual-stack hosts that also use > DHCPv6 must use the same host identification string for both > DHCPv4 and DHCPv6 - for example, a DHCPv4 server that uses the > client's identity to update the DNS on behalf of a DHCPv4 client > must register the same client identity in the DNS that would be > registered by the DHCPv6 server on behalf of the DHCPv6 client > running on that host, and vice versa. > > In order to satisfy all but the last of these requirements, we need > to construct a DHCPv4 client identifier out of two parts. One part > must be unique to the host on which the client is running. The > other must be unique to the network identity being presented. The > DHCP Unique Identifier (DUID) and Identity Association Identifier > (IAID) specified in RFC3315 satisfy these requirements. > > DHCPv4 Client behavior > > DHCPv4 clients conforming to this specification MUST use stable > DHCPv4 node identifiers in the dhcp-client-identifier option. > DHCPv4 clients MUST NOT use client identifiers based solely on > layer two addresses that are hard-wired to the layer two device > (e.g., the ethernet MAC address) as suggested in RFC2131, except as > allowed in section 9.2 of RFC3315. DHCPv4 clients MUST send a > 'client identifier' option containing an Identity Association > Unique Identifier, as defined in section 10 of RFC3315, and a DHCP > Unique Identifier, as defined in section 9 of RFC3315. These > together constitute an RFC3315-style binding identifier. > > The general format of the DHCPv4 'client identifier' option is > defined in section 9.14 of RFC2132. > > To send an RFC3315-style binding identifiier in a DHCPv4 'client > identifier' option, the type of the 'client identifier' option is > set to 255. The type field is immediately followed by the IAID, > which is an opaque 32-bit quantity. The IAID is immediately > followed by the DUID, which consumes the remaining contents of the > 'client identifier' option. The format of the 'client identifier' > option is as follows: > > Code Len Type IAID DUID > +----+----+-----+----+----+----+----+----+----+--- > | 61 | n | 255 | i1 | i2 | i3 | i4 | d1 | d2 |... > +----+----+-----+----+----+----+----+----+----+--- > > Any DHCPv4 or DHCPv6 client that conforms to this specification > SHOULD provide a means by which an operator can learn what DUID the > client has chosen. Such clients SHOULD also provide a means by > which the operator can configure the DUID. A device that is > normally configured with both a DHCPv4 and DHCPv6 client SHOULD > automatically use the same DUID for DHCPv4 and DHCPv6 without any > operator intervention. > > DHCPv4 clients that support more than one network interface SHOULD > use the same DUID on every interface. DHCPv4 clients that support > more than one network interface SHOULD use a different IAID on > each interface. > > >From RFC 3315, there are multiple DUID types: DUID-LLT (link link address > plus time), DUIT_EN (assigned by vendor based on enterprise number), DUID-LL > (based on link layer address), > > Identity Association > > An "identity-association" (IA) is a construct through which a server > and a client can identify, group, and manage a set of related IPv6 > addresses. Each IA consists of an IAID and associated configuration > information. > > A client must associate at least one distinct IA with each of its > network interfaces for which it is to request the assignment of IPv6 > addresses from a DHCP server. The client uses the IAs assigned to an > interface to obtain configuration information from a server for that > interface. Each IA must be associated with exactly one interface. > > The IAID uniquely identifies the IA and must be chosen to be unique > among the IAIDs on the client. The IAID is chosen by the client. > For any given use of an IA by the client, the IAID for that IA MUST > be consistent across restarts of the DHCP client. The client may > maintain consistency either by storing the IAID in non-volatile > storage or by using an algorithm that will consistently produce the > same IAID as long as the configuration of the client has not changed. > There may be no way for a client to maintain consistency of the IAIDs > if it does not have non-volatile storage and the client's hardware > configuration changes. So a client could, for example, mask the QP number and use the remaining non-volatile portion as the identifier? > Using a scheme along these lines, precludes the requirement for a nonvolatile > QPN for DHCP. > > -- Hal > Do you know whether dhcp clients / servers support this? dhcpd man page seems to talk only about hardware address. -- MST - Michael S. Tsirkin From mst at mellanox.co.il Wed Apr 6 04:46:24 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 6 Apr 2005 14:46:24 +0300 Subject: [openib-general] [PATCH] two fixes for INIT_IB Message-ID: <20050406114624.GV15034@mellanox.co.il> fixes the INIT_IB command: 1. Allocate 64 bytes for the inbox, so that the address is 16 bytes aligned as required by the manual. 2. Free the exact number of bytes we allocate. Signed-off-by: Michael S. Tsirkin Index: mthca_cmd.c =================================================================== --- mthca_cmd.c (revision 2115) +++ mthca_cmd.c (working copy) @@ -1183,7 +1183,7 @@ int mthca_INIT_IB(struct mthca_dev *dev, int err; u32 flags; -#define INIT_IB_IN_SIZE 56 +#define INIT_IB_IN_SIZE 0x40 #define INIT_IB_FLAGS_OFFSET 0x00 #define INIT_IB_FLAG_SIG (1 << 18) #define INIT_IB_FLAG_NG (1 << 17) @@ -1224,7 +1224,7 @@ int mthca_INIT_IB(struct mthca_dev *dev, err = mthca_cmd(dev, indma, port, 0, CMD_INIT_IB, CMD_TIME_CLASS_A, status); - pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma); + pci_free_consistent(dev->pdev, INIT_IB_IN_SIZE, inbox, indma); return err; } -- MST - Michael S. Tsirkin From halr at voltaire.com Wed Apr 6 04:08:57 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Apr 2005 07:08:57 -0400 Subject: [openib-general] Re: IB Address Translation service In-Reply-To: <20050406071933.GL15034@mellanox.co.il> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> <1109966313.20238.11.camel@duffman> <20050305020402.GA3297@greglaptop.internal.keyresearch.com> <4229CEA0.7060904@sun.com> <1112734934.4490.99.camel@localhost.localdomain> <20050406071933.GL15034@mellanox.co.il> Message-ID: <1112785737.4809.28.camel@localhost.localdomain> On Wed, 2005-04-06 at 03:19, Michael S. Tsirkin wrote: > > So I don't think that the above 2 cases require an invariant QPN. > > > > 3. The third case is multihomed interfaces on the same IPoIB subnet. I > > don't think this is currently supported by IPoIB (but may someday). That > > would either not be supported by RARP or some way to have invariant QPNs > > would be needed. I'm not sure how important this case is. > > > > Is the above correct ? Are there other cases ? > > > > -- Hal > > > > Some DHCP servers (dhcpd) let you configure a fixed IP per hardware address. > It seems to me that making this work requires an invariant QP, right? I believe that DHCP servers require work in order to do this for IPoIB as they do not understand an IPoIB hardware address. So if they do support client identifier mapping to IP address, then I think the answer is maybe (rather than a definitive yes). The reason I say this is that the DHCP draft appears to allow any unique client identifier per IP subnet to be used. In that case, as long there is a scheme to make each identifier unique per IP subnet, DHCP is fine. The "client-identifier" option includes a type and identifier pair. The identifier included in the "client-identifier" option may consist of a hardware address or any other unique value such as the DNS name of the client. When a hardware address is used, the type field should be one of the ARP hardware types listed in [ARPPARAM]. The most common (simple to implement) client identifier from the the DHCP client perspective is the IPoIB hardware address. http://www.ietf.org/internet-drafts/draft-ietf-dhc-3315id-for-v4-04.txt states: Client identities are ephemeral RFC2132 recommends that client identifiers be generated by using the permanent link-layer address of the network interface that the client is trying to configure. Requirements In order to address the problems stated in section 2, DHCPv4 client identifiers must have the following characteristics: - They must be persistent, in the sense that a particular host's client identifier must not change simply because a piece of network hardware is added or removed. ... - DHCPv4 client identifiers used by dual-stack hosts that also use DHCPv6 must use the same host identification string for both DHCPv4 and DHCPv6 - for example, a DHCPv4 server that uses the client's identity to update the DNS on behalf of a DHCPv4 client must register the same client identity in the DNS that would be registered by the DHCPv6 server on behalf of the DHCPv6 client running on that host, and vice versa. In order to satisfy all but the last of these requirements, we need to construct a DHCPv4 client identifier out of two parts. One part must be unique to the host on which the client is running. The other must be unique to the network identity being presented. The DHCP Unique Identifier (DUID) and Identity Association Identifier (IAID) specified in RFC3315 satisfy these requirements. DHCPv4 Client behavior DHCPv4 clients conforming to this specification MUST use stable DHCPv4 node identifiers in the dhcp-client-identifier option. DHCPv4 clients MUST NOT use client identifiers based solely on layer two addresses that are hard-wired to the layer two device (e.g., the ethernet MAC address) as suggested in RFC2131, except as allowed in section 9.2 of RFC3315. DHCPv4 clients MUST send a 'client identifier' option containing an Identity Association Unique Identifier, as defined in section 10 of RFC3315, and a DHCP Unique Identifier, as defined in section 9 of RFC3315. These together constitute an RFC3315-style binding identifier. The general format of the DHCPv4 'client identifier' option is defined in section 9.14 of RFC2132. To send an RFC3315-style binding identifiier in a DHCPv4 'client identifier' option, the type of the 'client identifier' option is set to 255. The type field is immediately followed by the IAID, which is an opaque 32-bit quantity. The IAID is immediately followed by the DUID, which consumes the remaining contents of the 'client identifier' option. The format of the 'client identifier' option is as follows: Code Len Type IAID DUID +----+----+-----+----+----+----+----+----+----+--- | 61 | n | 255 | i1 | i2 | i3 | i4 | d1 | d2 |... +----+----+-----+----+----+----+----+----+----+--- Any DHCPv4 or DHCPv6 client that conforms to this specification SHOULD provide a means by which an operator can learn what DUID the client has chosen. Such clients SHOULD also provide a means by which the operator can configure the DUID. A device that is normally configured with both a DHCPv4 and DHCPv6 client SHOULD automatically use the same DUID for DHCPv4 and DHCPv6 without any operator intervention. DHCPv4 clients that support more than one network interface SHOULD use the same DUID on every interface. DHCPv4 clients that support more than one network interface SHOULD use a different IAID on each interface. >From RFC 3315, there are multiple DUID types: DUID-LLT (link link address plus time), DUIT_EN (assigned by vendor based on enterprise number), DUID-LL (based on link layer address), Identity Association An "identity-association" (IA) is a construct through which a server and a client can identify, group, and manage a set of related IPv6 addresses. Each IA consists of an IAID and associated configuration information. A client must associate at least one distinct IA with each of its network interfaces for which it is to request the assignment of IPv6 addresses from a DHCP server. The client uses the IAs assigned to an interface to obtain configuration information from a server for that interface. Each IA must be associated with exactly one interface. The IAID uniquely identifies the IA and must be chosen to be unique among the IAIDs on the client. The IAID is chosen by the client. For any given use of an IA by the client, the IAID for that IA MUST be consistent across restarts of the DHCP client. The client may maintain consistency either by storing the IAID in non-volatile storage or by using an algorithm that will consistently produce the same IAID as long as the configuration of the client has not changed. There may be no way for a client to maintain consistency of the IAIDs if it does not have non-volatile storage and the client's hardware configuration changes. Using a scheme along these lines, precludes the requirement for a nonvolatile QPN for DHCP. -- Hal From halr at voltaire.com Wed Apr 6 05:16:25 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Apr 2005 08:16:25 -0400 Subject: [openib-general] Re: IB Address Translation service In-Reply-To: <20050406113057.GU15034@mellanox.co.il> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> <1109966313.20238.11.camel@duffman> <20050305020402.GA3297@greglaptop.internal.keyresearch.com> <4229CEA0.7060904@sun.com> <1112734934.4490.99.camel@localhost.localdomain> <20050406071933.GL15034@mellanox.co.il> <1112785737.4809.28.camel@localhost.localdomain> <20050406113057.GU15034@mellanox.co.il> Message-ID: <1112789785.4809.86.camel@localhost.localdomain> On Wed, 2005-04-06 at 07:30, Michael S. Tsirkin wrote: > > > Some DHCP servers (dhcpd) let you configure a fixed IP per hardware address. > > > It seems to me that making this work requires an invariant QP, right? > > > > I believe that DHCP servers require work in order to do this for IPoIB > > as they do not understand an IPoIB hardware address. > > Are we talking about IPoIB hardware address size issues? I was referring to dealing with IPoIB DHCP zero'ing chaddr field wuth hlen 0 and indicating htype of IPoIB. Not sure whether this is supported without any changes to DHCP servers. The other issue would be the support for the client identifier field and whether this is supported or needs work. > So a client could, for example, mask the QP number and use the > remaining non-volatile portion as the identifier? (This is what I said in a previous email in terms of making this work for RARP). That works if this is unique in the IP subnet. That's not true in all cases. Also, per the emerging DHCP requirement, it does not follow the format for the client identifier as an IAID is also required. I suppose if there is only 1 interface then IAID isn't a problem either. > Do you know whether dhcp clients / servers support this? Not sure exactly which this you are referring to. The I-D requirement (RFC3315id) is likely not supported. > dhcpd man page seems to talk only about hardware address. I think it may be dependent on which DHCP client/server. For the ISC one, the changes were minimal (I put them out; there is one update since) but this doesn't support 3315id. -- Hal From Roland.Fehrenbacher at transtec.de Wed Apr 6 07:44:05 2005 From: Roland.Fehrenbacher at transtec.de (Roland Fehrenbacher) Date: Wed, 6 Apr 2005 16:44:05 +0200 Subject: [openib-general] gen2 opensm In-Reply-To: <1112788349.4809.39.camel@localhost.localdomain> References: <16978.51804.357810.302532@gargle.gargle.HOWL> <1112722935.4634.12.camel@localhost.localdomain> <16978.63006.330557.956153@gargle.gargle.HOWL> <1112788349.4809.39.camel@localhost.localdomain> Message-ID: <16979.62901.482352.325628@gargle.gargle.HOWL> > $ /usr/local/ib/bin/ibstatus > Infiniband device 'mthca0' port 1 status: > default gid: fe80:0000:0000:0000:0002:c902:0000:771d > base lid: 0x0 > sm lid: 0x0 > state: 2: INIT > phys state: 5: LinkUp > rate: 10 Gb/sec (4X) > > Infiniband device 'mthca0' port 2 status: > default gid: fe80:0000:0000:0000:0002:c902:0000:771e > base lid: 0x0 > sm lid: 0x0 > state: 1: DOWN > phys state: 2: Polling > rate: 2.5 Gb/sec (1X) Hal> That's strange that you can get the port GIDs via ibstatus Hal> but not via ibstat. Hal> The one thing different I see is that the NodeGUID is very Hal> different from the PortGUIDs. Not sure if this messes things Hal> up. Somehow the tools don't seem to get the correct information, but it's there: $ cat /sys/class/infiniband/mthca0/node_guid 0002:c902:0000:771c $ cat /sys/class/infiniband/mthca0/sys_image_guid 0002:c902:0000:771f How can this happen? > > Running opensm then doesn't activate the ports: > > > > Apr 05 19:18:25 [4000] -> OpenSM Rev:openib-1.0.0 ....... Hal> I see a bug in this message. I will fix it. Please sync Hal> OpenSM to at least version 2111 and rerun. > I will recompile tomorrow, and try a firmware upgrade. The error log with the recompiled opensm is now: Apr 06 14:39:14 [4000] -> OpenSM Rev:openib-1.0.0 Apr 06 14:39:14 [4000] -> osm_opensm_init: Forcing single threaded dispatcher. Apr 06 14:39:14 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0x0000000030f2ffff,0x0000000000000000 Apr 06 14:39:14 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0x0000000030f2ffff,0x0000000000000000 Apr 06 14:39:14 [4000] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x65babaa) as the default port. Apr 06 14:39:14 [4000] -> osm_vendor_bind: Binding to port 0x225dabaa. Apr 06 14:39:14 [4000] -> osm_vendor_bind: Binding to port 0x8000000. Apr 06 14:39:14 [2400A] -> umad_receiver: Failed to obtain request madw for received MAD(method=81 attr=11) -- dropping. I couldn't do a firmware update yet, since I haven't gotten Mellanox mst to compile with kernel 2.6.11. Do you have another suggestion how I could do the upgrade? Thanks, Roland From David.Brean at Sun.COM Wed Apr 6 07:45:27 2005 From: David.Brean at Sun.COM (David M. Brean) Date: Wed, 06 Apr 2005 10:45:27 -0400 Subject: [openib-general] IB Address Translation service In-Reply-To: <1112734934.4490.99.camel@localhost.localdomain> References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com> <1109715208.11800.41.camel@duffman> <1109966313.20238.11.camel@duffman> <20050305020402.GA3297@greglaptop.internal.keyresearch.com> <4229CEA0.7060904@sun.com> <1112734934.4490.99.camel@localhost.localdomain> Message-ID: <4253F607.7060300@sun.com> Your case #3 is an application where the limitations of RARP on IB appear. I can't think of any other interesting configurations beyond 1-3. -David Hal Rosenstock wrote: >Reviving an old thread... > >On Sat, 2005-03-05 at 10:22, David M. Brean wrote: > > >>There is an I-D for DHCP on IB. IPoIB defines a "broadcast" address and >>DHCP (and ARP) on IB use it. Could make RARP work using this mechanism, >>but as someone else pointed out, the IB hardware address contains a >>QPN. The I-D for IPoIB says something like: >> >> The link-layer address for IPoIB includes the QPN which might not be >> constant across reboots or even across network interface resets. >> Cached QPN entries, such as in static ARP entries or in RARP servers >> will only work if the implementation(s) using these options ensure >> that the QPN associated with an interface is invariant across >> reboots/network resets. >> >>So, there are requirements on the IPoIB implementation to make RARP >>work. Folks in the IPoIB work group decided not to go much further than >>these statements for RARP support since most folks felt that DHCP is (de >>facto) replacement. >> >> > >There are 3 cases I can envision: > >1. A single IPoIB interface per HCA port. In this case, the RARP server >can just match on the hardware address (port GID) without the QPN. > >2. In the case of VLANs, I think we are likely OK as well. In that case, >there is a separate IP subnet (per PKey) so the port GID is unique per >IP subnet (the port GID is unique on that partition (IP subnet)). I >think there is a different QPN per VLAN. > >So I don't think that the above 2 cases require an invariant QPN. > >3. The third case is multihomed interfaces on the same IPoIB subnet. I >don't think this is currently supported by IPoIB (but may someday). That >would either not be supported by RARP or some way to have invariant QPNs >would be needed. I'm not sure how important this case is. > >Is the above correct ? Are there other cases ? > >-- Hal > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From halr at voltaire.com Wed Apr 6 07:58:03 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Apr 2005 10:58:03 -0400 Subject: [openib-general] gen2 opensm In-Reply-To: <16979.62901.482352.325628@gargle.gargle.HOWL> References: <16978.51804.357810.302532@gargle.gargle.HOWL> <1112722935.4634.12.camel@localhost.localdomain> <16978.63006.330557.956153@gargle.gargle.HOWL> <1112788349.4809.39.camel@localhost.localdomain> <16979.62901.482352.325628@gargle.gargle.HOWL> Message-ID: <1112799483.4906.32.camel@localhost.localdomain> On Wed, 2005-04-06 at 10:44, Roland Fehrenbacher wrote: > > $ /usr/local/ib/bin/ibstatus > > Infiniband device 'mthca0' port 1 status: > > default gid: fe80:0000:0000:0000:0002:c902:0000:771d > > base lid: 0x0 > > sm lid: 0x0 > > state: 2: INIT > > phys state: 5: LinkUp > > rate: 10 Gb/sec (4X) > > > > Infiniband device 'mthca0' port 2 status: > > default gid: fe80:0000:0000:0000:0002:c902:0000:771e > > base lid: 0x0 > > sm lid: 0x0 > > state: 1: DOWN > > phys state: 2: Polling > > rate: 2.5 Gb/sec (1X) > > Hal> That's strange that you can get the port GIDs via ibstatus > Hal> but not via ibstat. > > Hal> The one thing different I see is that the NodeGUID is very > Hal> different from the PortGUIDs. Not sure if this messes things > Hal> up. > > Somehow the tools don't seem to get the correct information, but it's > there: > > $ cat /sys/class/infiniband/mthca0/node_guid > 0002:c902:0000:771c > > $ cat /sys/class/infiniband/mthca0/sys_image_guid > 0002:c902:0000:771f > > How can this happen? > > > > Running opensm then doesn't activate the ports: > > > > > > Apr 05 19:18:25 [4000] -> OpenSM Rev:openib-1.0.0 ....... > > Hal> I see a bug in this message. I will fix it. Please sync > Hal> OpenSM to at least version 2111 and rerun. > > > I will recompile tomorrow, and try a firmware upgrade. > > The error log with the recompiled opensm is now: > > Apr 06 14:39:14 [4000] -> OpenSM Rev:openib-1.0.0 > Apr 06 14:39:14 [4000] -> osm_opensm_init: Forcing single threaded dispatcher. > Apr 06 14:39:14 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0x0000000030f2ffff,0x0000000000000000 > Apr 06 14:39:14 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0x0000000030f2ffff,0x0000000000000000 > Apr 06 14:39:14 [4000] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x65babaa) as the default port. > Apr 06 14:39:14 [4000] -> osm_vendor_bind: Binding to port 0x225dabaa. > Apr 06 14:39:14 [4000] -> osm_vendor_bind: Binding to port 0x8000000. > Apr 06 14:39:14 [2400A] -> umad_receiver: Failed to obtain request madw for received MAD(method=81 attr=11) -- dropping. > > I couldn't do a firmware update yet, since I haven't gotten Mellanox > mst to compile with kernel 2.6.11. Do you have another suggestion how > I could do the upgrade? Mellanox mst ? Are you using the Mellanox drivers and not OpenIB gen2 ? -- Hal From mst at mellanox.co.il Wed Apr 6 08:13:11 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 6 Apr 2005 18:13:11 +0300 Subject: [openib-general] Re: gen2 opensm In-Reply-To: <16979.62901.482352.325628@gargle.gargle.HOWL> References: <16978.51804.357810.302532@gargle.gargle.HOWL> <1112722935.4634.12.camel@localhost.localdomain> <16978.63006.330557.956153@gargle.gargle.HOWL> <1112788349.4809.39.camel@localhost.localdomain> <16979.62901.482352.325628@gargle.gargle.HOWL> Message-ID: <20050406151311.GE20567@mellanox.co.il> Quoting r. Roland Fehrenbacher : > Subject: Re: gen2 opensm > I couldn't do a firmware update yet, since I haven't gotten Mellanox > mst to compile with kernel 2.6.11. Do you have another suggestion how > I could do the upgrade? > > Thanks, > > Roland Yes, use mstflint from src/userspace/mstflint Latest gold disk in general and mst in particular does not support 2.6.11 -- MST - Michael S. Tsirkin From rf at q-leap.de Wed Apr 6 08:19:52 2005 From: rf at q-leap.de (Roland Fehrenbacher) Date: Wed, 6 Apr 2005 17:19:52 +0200 Subject: [openib-general] gen2 opensm In-Reply-To: <1112799483.4906.32.camel@localhost.localdomain> References: <16978.51804.357810.302532@gargle.gargle.HOWL> <1112722935.4634.12.camel@localhost.localdomain> <16978.63006.330557.956153@gargle.gargle.HOWL> <1112788349.4809.39.camel@localhost.localdomain> <16979.62901.482352.325628@gargle.gargle.HOWL> <1112799483.4906.32.camel@localhost.localdomain> Message-ID: <16979.65048.982358.532089@gargle.gargle.HOWL> >>>>> "Hal" == Hal Rosenstock writes: >> I couldn't do a firmware update yet, since I haven't gotten >> Mellanox mst to compile with kernel 2.6.11. Do you have another >> suggestion how I could do the upgrade? Hal> Mellanox mst ? Are you using the Mellanox drivers and not Hal> OpenIB gen2 ? No, I just wanted to use them for firmware flashing. But now Michael told me to flash with mstflint. I'll try. Thanks Michael. Roland From halr at voltaire.com Wed Apr 6 08:21:05 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Apr 2005 11:21:05 -0400 Subject: [openib-general] gen2 opensm In-Reply-To: <1112799483.4906.32.camel@localhost.localdomain> References: <16978.51804.357810.302532@gargle.gargle.HOWL> <1112722935.4634.12.camel@localhost.localdomain> <16978.63006.330557.956153@gargle.gargle.HOWL> <1112788349.4809.39.camel@localhost.localdomain> <16979.62901.482352.325628@gargle.gargle.HOWL> <1112799483.4906.32.camel@localhost.localdomain> Message-ID: <1112800865.4906.52.camel@localhost.localdomain> On Wed, 2005-04-06 at 10:58, Hal Rosenstock wrote: > Mellanox mst ? Are you using the Mellanox drivers and not OpenIB gen2 ? If you are, that combination doesn't work. You either need to use OpenIB gen2 (mthca) or use the OpenSM from Mellanox Gold of whatever variant you are using. -- Hal From halr at voltaire.com Wed Apr 6 08:27:36 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Apr 2005 11:27:36 -0400 Subject: [openib-general] gen2 opensm In-Reply-To: <16979.65048.982358.532089@gargle.gargle.HOWL> References: <16978.51804.357810.302532@gargle.gargle.HOWL> <1112722935.4634.12.camel@localhost.localdomain> <16978.63006.330557.956153@gargle.gargle.HOWL> <1112788349.4809.39.camel@localhost.localdomain> <16979.62901.482352.325628@gargle.gargle.HOWL> <1112799483.4906.32.camel@localhost.localdomain> <16979.65048.982358.532089@gargle.gargle.HOWL> Message-ID: <1112800951.4906.54.camel@localhost.localdomain> On Wed, 2005-04-06 at 11:19, Roland Fehrenbacher wrote: > >>>>> "Hal" == Hal Rosenstock writes: > > >> I couldn't do a firmware update yet, since I haven't gotten > >> Mellanox mst to compile with kernel 2.6.11. Do you have another > >> suggestion how I could do the upgrade? > > Hal> Mellanox mst ? Are you using the Mellanox drivers and not > Hal> OpenIB gen2 ? > > No, Good. > I just wanted to use them for firmware flashing. But now Michael > told me to flash with mstflint. I'll try. Just out of curiousity, what is the architecture of the machine you are using ? -- Hal From rf at q-leap.de Wed Apr 6 09:19:54 2005 From: rf at q-leap.de (Roland Fehrenbacher) Date: Wed, 6 Apr 2005 18:19:54 +0200 Subject: [openib-general] Re: gen2 opensm In-Reply-To: <20050406151311.GE20567@mellanox.co.il> References: <16978.51804.357810.302532@gargle.gargle.HOWL> <1112722935.4634.12.camel@localhost.localdomain> <16978.63006.330557.956153@gargle.gargle.HOWL> <1112788349.4809.39.camel@localhost.localdomain> <16979.62901.482352.325628@gargle.gargle.HOWL> <20050406151311.GE20567@mellanox.co.il> Message-ID: <16980.3114.482668.956134@gargle.gargle.HOWL> >>>>> "Michael" == Michael S Tsirkin writes: Michael> Quoting r. Roland Fehrenbacher Michael> : >> Subject: Re: gen2 opensm I couldn't do a firmware update yet, >> since I haven't gotten Mellanox mst to compile with kernel >> 2.6.11. Do you have another suggestion how I could do the >> upgrade? >> >> Thanks, >> >> Roland Michael> Yes, use mstflint from src/userspace/mstflint Latest gold Michael> disk in general and mst in particular does not support Michael> 2.6.11 I get $ mstflint -d /proc/bus/pci/03/00.0 q Image type: FailSafe Chip rev.: A1 GUIDs: 0002c9020000771c 0002c9020000771d 0002c9020000771e 0002c9020000771f Board ID: (MT_0030000001) What would be the right way to flash this card using the files from fw-23108-rel-3_3_2/ fw-23108-rel-3_3_2/fw-23108-a1-debug.mlx fw-23108-rel-3_3_2/fw-23108-a1-rel.mlx fw-23108-rel-3_3_2/MTLP23108_128MB.brd fw-23108-rel-3_3_2/MTLP23108_256MB.brd fw-23108-rel-3_3_2/MTLP23108_512MB.brd fw-23108-rel-3_3_2/MTPB23108_128MB.brd fw-23108-rel-3_3_2/MTPB23108_256MB.brd fw-23108-rel-3_3_2/BUILD_ID Roland From rf at q-leap.de Wed Apr 6 09:29:41 2005 From: rf at q-leap.de (Roland Fehrenbacher) Date: Wed, 6 Apr 2005 18:29:41 +0200 Subject: [openib-general] gen2 opensm In-Reply-To: <1112800951.4906.54.camel@localhost.localdomain> References: <16978.51804.357810.302532@gargle.gargle.HOWL> <1112722935.4634.12.camel@localhost.localdomain> <16978.63006.330557.956153@gargle.gargle.HOWL> <1112788349.4809.39.camel@localhost.localdomain> <16979.62901.482352.325628@gargle.gargle.HOWL> <1112799483.4906.32.camel@localhost.localdomain> <16979.65048.982358.532089@gargle.gargle.HOWL> <1112800951.4906.54.camel@localhost.localdomain> Message-ID: <16980.3701.448062.413906@gargle.gargle.HOWL> >>>>> "Hal" == Hal Rosenstock writes: Hal> Just out of curiousity, what is the architecture of the Hal> machine you are using ? It is a Tyan S2882 with 2 x Opteron 250, 2Gb RAM. Roland From mst at mellanox.co.il Wed Apr 6 09:44:56 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 6 Apr 2005 19:44:56 +0300 Subject: [openib-general] Re: gen2 opensm In-Reply-To: <16980.3114.482668.956134@gargle.gargle.HOWL> References: <16978.51804.357810.302532@gargle.gargle.HOWL> <1112722935.4634.12.camel@localhost.localdomain> <16978.63006.330557.956153@gargle.gargle.HOWL> <1112788349.4809.39.camel@localhost.localdomain> <16979.62901.482352.325628@gargle.gargle.HOWL> <20050406151311.GE20567@mellanox.co.il> <16980.3114.482668.956134@gargle.gargle.HOWL> Message-ID: <20050406164456.GA23565@mellanox.co.il> Quoting r. Roland Fehrenbacher : > Subject: Re: gen2 opensm > > >>>>> "Michael" == Michael S Tsirkin writes: > > Michael> Quoting r. Roland Fehrenbacher > Michael> : > >> Subject: Re: gen2 opensm I couldn't do a firmware update yet, > >> since I haven't gotten Mellanox mst to compile with kernel > >> 2.6.11. Do you have another suggestion how I could do the > >> upgrade? > >> > >> Thanks, > >> > >> Roland > > Michael> Yes, use mstflint from src/userspace/mstflint Latest gold > Michael> disk in general and mst in particular does not support > Michael> 2.6.11 > > I get > > $ mstflint -d /proc/bus/pci/03/00.0 q > Image type: FailSafe > Chip rev.: A1 > GUIDs: 0002c9020000771c 0002c9020000771d 0002c9020000771e 0002c9020000771f > Board ID: (MT_0030000001) > > What would be the right way to flash this card using the files from > > fw-23108-rel-3_3_2/ > fw-23108-rel-3_3_2/fw-23108-a1-debug.mlx > fw-23108-rel-3_3_2/fw-23108-a1-rel.mlx > fw-23108-rel-3_3_2/MTLP23108_128MB.brd > fw-23108-rel-3_3_2/MTLP23108_256MB.brd > fw-23108-rel-3_3_2/MTLP23108_512MB.brd > fw-23108-rel-3_3_2/MTPB23108_128MB.brd > fw-23108-rel-3_3_2/MTPB23108_256MB.brd > fw-23108-rel-3_3_2/BUILD_ID > > Roland > You want MTLP23108_128MB.brd and fw-23108-a1-rel.mlx then. Create a binary image with infiniburn (select raw binary format), And burn the result with mstflint. If you select a wrong brd file, mstflint will warn you that PSID (Board ID) is being changed. MST -- MST - Michael S. Tsirkin From ardavis at ichips.intel.com Wed Apr 6 14:20:47 2005 From: ardavis at ichips.intel.com (ardavis) Date: Wed, 06 Apr 2005 14:20:47 -0700 Subject: [openib-general] Re: uverbs events In-Reply-To: <52oeczoghb.fsf@topspin.com> References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com> Message-ID: <425452AF.6010207@ichips.intel.com> Roland Dreier wrote: > ardavis> Has anyone successfully run uverbs examples with events > ardavis> using ibv_get_cq_event? It seems to block forever on my > ardavis> system with the pingpong test. > >Yes, I have. I'll try again with the latest code to make sure I >haven't broken anything recently. > > - R. > > > Roland, Did you get a chance to retry events? I pulled the latest from your branch and my ibv_pingpong -e testing still blocks forever. -arlin From roland at topspin.com Wed Apr 6 15:17:38 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 06 Apr 2005 15:17:38 -0700 Subject: [openib-general] Re: uverbs events In-Reply-To: <425452AF.6010207@ichips.intel.com> (ardavis@ichips.intel.com's message of "Wed, 06 Apr 2005 14:20:47 -0700") References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com> <425452AF.6010207@ichips.intel.com> Message-ID: <527jjf8s8t.fsf@topspin.com> ardavis> Did you get a chance to retry events? I pulled the latest ardavis> from your branch and my ibv_pingpong -e testing still ardavis> blocks forever. Yes, it works for me. I just tried it again and it worked fine. What kind of system/HCA are you using? Does ibv_pingpong without -e work for you? - R. From rf at q-leap.de Thu Apr 7 05:38:17 2005 From: rf at q-leap.de (Roland Fehrenbacher) Date: Thu, 7 Apr 2005 14:38:17 +0200 Subject: [openib-general] Flashing Mellanox MT23108 Message-ID: <16981.10681.472289.311124@gargle.gargle.HOWL> Hi, can anyone tell me how to flash a Mellanox MT23108 card with mstflint. When I try the firmware file fw-23108-a1-rel.mlx from Mellanox I get $ mstflint -d /proc/bus/pci/03/00.0 -i fw-23108-a1-rel.mlx b Not a valid image Roland From Jerome.Pioux at bull.com Thu Apr 7 09:09:56 2005 From: Jerome.Pioux at bull.com (Jerome Pioux) Date: Thu, 7 Apr 2005 09:09:56 -0700 Subject: [openib-general] Flashing Mellanox MT23108 References: <16981.10681.472289.311124@gargle.gargle.HOWL> Message-ID: <004a01c53b8c$45a4dde0$0211708d@gpv.az05.bull.com> Hi Roland, I don't know about mstflint but I used flint in the past from IBGD so it may be the same? With flint, you need to use the raw image (bin) and not the Mellanox one (mlx). This is what I used: flint -d /dev/mst/mt23108_pci_cr2 -i /etc/ibfw/fw/fw-cougar-a1-3.3.2-jerome.bin b Image type: FailSafe Chip rev.: A1 GUIDs: 0005ad0000016770 0005ad0000016771 0005ad0000016772 0005ad000100d050 Board ID: ­ Burn image with the following GUIDs: Node: 0005ad0000016770 Port1: 0005ad0000016771 Port2: 0005ad0000016772 Sys.Image: 0005ad000100d050 etc... I think that I created the bin image using infiniburn. I read fw-23108-a1-rel.mlx using infiniburn and write the bin image (raw format). But if you have infiniburn working, I think that you can burn the mlx image directly. Jerome ----- Original Message ----- From: Roland Fehrenbacher To: openib-general at openib.org Sent: Thursday, April 07, 2005 5:38 AM Subject: [openib-general] Flashing Mellanox MT23108 Hi, can anyone tell me how to flash a Mellanox MT23108 card with mstflint. When I try the firmware file fw-23108-a1-rel.mlx from Mellanox I get $ mstflint -d /proc/bus/pci/03/00.0 -i fw-23108-a1-rel.mlx b Not a valid image Roland _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From ardavis at ichips.intel.com Thu Apr 7 09:47:57 2005 From: ardavis at ichips.intel.com (ardavis) Date: Thu, 07 Apr 2005 09:47:57 -0700 Subject: [openib-general] Re: uverbs events In-Reply-To: <527jjf8s8t.fsf@topspin.com> References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com> <425452AF.6010207@ichips.intel.com> <527jjf8s8t.fsf@topspin.com> Message-ID: <4255643D.30002@ichips.intel.com> Roland Dreier wrote: > ardavis> Did you get a chance to retry events? I pulled the latest > ardavis> from your branch and my ibv_pingpong -e testing still > ardavis> blocks forever. > >Yes, it works for me. I just tried it again and it worked fine. > >What kind of system/HCA are you using? Does ibv_pingpong without -e >work for you? > > - R. > > > EM64T server with MT25208 (MT23108 compat mode), fw_ver 4.6.0, hw_rev A0 From halr at voltaire.com Thu Apr 7 09:50:15 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Apr 2005 12:50:15 -0400 Subject: [openib-general] Re: uverbs events In-Reply-To: <4255643D.30002@ichips.intel.com> References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com> <425452AF.6010207@ichips.intel.com> <527jjf8s8t.fsf@topspin.com> <4255643D.30002@ichips.intel.com> Message-ID: <1112892615.4877.18.camel@localhost.localdomain> On Thu, 2005-04-07 at 12:47, ardavis wrote: > EM64T server with MT25208 (MT23108 compat mode), fw_ver 4.6.0, hw_rev A0 Didn't 4.6.0 have a issue with CQ handling ? Can you try 4.6.2 ? -- Hal From roland at topspin.com Thu Apr 7 10:09:02 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 07 Apr 2005 10:09:02 -0700 Subject: [openib-general] Re: uverbs events In-Reply-To: <1112892615.4877.18.camel@localhost.localdomain> (Hal Rosenstock's message of "07 Apr 2005 12:50:15 -0400") References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com> <425452AF.6010207@ichips.intel.com> <527jjf8s8t.fsf@topspin.com> <4255643D.30002@ichips.intel.com> <1112892615.4877.18.camel@localhost.localdomain> Message-ID: <52zmwa5xap.fsf@topspin.com> Hal> Didn't 4.6.0 have a issue with CQ handling ? Can you try 4.6.2 ? It's possible that firmware bug is the problem, but if it is I would expect the non-event mode to fail as well. - R. From mst at mellanox.co.il Thu Apr 7 11:55:32 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 7 Apr 2005 21:55:32 +0300 Subject: [openib-general] Re: Flashing Mellanox MT23108 In-Reply-To: <004a01c53b8c$45a4dde0$0211708d@gpv.az05.bull.com> References: <16981.10681.472289.311124@gargle.gargle.HOWL> <004a01c53b8c$45a4dde0$0211708d@gpv.az05.bull.com> Message-ID: <20050407185532.GC13172@mellanox.co.il> Quoting r. Jerome Pioux : > Subject: Re: Flashing Mellanox MT23108 > > Hi Roland, > > I don't know about mstflint but I used flint in the past from IBGD so it may be > the same? > With flint, you need to use the raw image (bin) and not the Mellanox one (mlx). > This is what I used: > > flint -d /dev/mst/mt23108_pci_cr2 -i /etc/ibfw/fw/fw-cougar-a1-3.3.2-jerome.bin > b > Image type: FailSafe > Chip rev.: A1 > GUIDs: 0005ad0000016770 0005ad0000016771 0005ad0000016772 0005ad000100d050 > Board ID: ­ > > Burn image with the following GUIDs: > Node: 0005ad0000016770 > Port1: 0005ad0000016771 > Port2: 0005ad0000016772 > Sys.Image: 0005ad000100d050 > etc... > > I think that I created the bin image using infiniburn. > I read fw-23108-a1-rel.mlx using infiniburn and write the bin image (raw > format). > But if you have infiniburn working, I think that you can burn the mlx image > directly. > > Jerome > > > ----- Original Message ----- > > From: Roland Fehrenbacher > To: openib-general at openib.org > Sent: Thursday, April 07, 2005 5:38 AM > Subject: [openib-general] Flashing Mellanox MT23108 > > Hi, > > can anyone tell me how to flash a Mellanox MT23108 card with > mstflint. When I try the firmware file fw-23108-a1-rel.mlx from > Mellanox I get > > $ mstflint -d /proc/bus/pci/03/00.0 -i fw-23108-a1-rel.mlx b > Not a valid image > > Roland > With mstflint you pass in the device location: -d 03:00.0 Otherwise its the same. -- MST - Michael S. Tsirkin From halr at voltaire.com Thu Apr 7 12:32:24 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Apr 2005 15:32:24 -0400 Subject: [openib-general] SM Bad Port Handling Message-ID: <1112902344.4490.91.camel@localhost.localdomain> Hi, Below is a writeup on bad port handling by the SM. I would appreciate any comments on this before I move on to the implementation. Thanks. -- Hal Problem Statement: Currently, OpenSM issues (directed route) SubnGet for NodeInfo and NodeDescription to any node it finds. It then requests PortInfo for each port which is physically up. There are scenarios where the port is physically up, but there is no response to the SM get requests. In this case, the OpenSM keeps retrying, never gives up, and doesn't service anything else in the subnet (I'm not 100% positive on this last point). Assumption: The proposed solution assumes that the ignore GUIDs file option of OpenSM only impacts the routing algorithm (path counting) and should not be extended for bad port handling. Proposed Solution: The OpenSM will implement a configurable policy (some number of consecutive lack of responses to SM requests). At the point of exhaustion of the timeout/retry strategy, that port will be marked as "bad" by OpenSM. At this point, should it attempt to revive the port by bringing the physical link down and back up ? Should it try this several times before declaring the port as "bad" ? In any case, this is a refinement on the basic strategy for dealing with this scenario. Also, there could also be a periodic "ping" at a slower rate to check if the "bad" ports revive. A "bad" port per this scenario still maintains its LID and other state. OpenSM will indicate a "bad" port detected via an internal port physical state which it will set to down. The "real" port physical state will be reflected accurately inside OpenSM. Once a "bad" port is detected, it will no longer be polled and the routing algorithm should be invoked to route around this. Is there a need to store these "bad" ports persistently (and ignore them on startup) ? From eitan at mellanox.co.il Thu Apr 7 13:02:39 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 7 Apr 2005 23:02:39 +0300 Subject: [openib-general] SM Bad Port Handling Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF0B8@mtlex01.yok.mtl.com> Hi Hal, Please see my comments below. Eitan Zahavi > Problem Statement: > > Currently, OpenSM issues (directed route) SubnGet for NodeInfo and > NodeDescription to any node it finds. It then requests PortInfo for > each port which is physically up. > > There are scenarios where the port is physically up, but there is no > response to the SM get requests. In this case, the OpenSM keeps > retrying, never gives up, and doesn't service anything else in the > subnet (I'm not 100% positive on this last point). [EZ] I have never seen this! Are you sure about it? Are you sure we are talking about gen1 ported to gen2? What will happen in a case of non responding port is that OpenSM will retry the send (actually the lower level does it) for the number of retries OpenSM is configured to use (actually 4 times) and then ignore the port and everything behind it. The reported topology (on stdout) will have the word UNKNOWN on the remote side of the link this port connects to. I will be happy to see a log file that shows what you claim happens. Or even if you can explain to me how and where in the code causes that. I have been checking the way OpenSM handles irresponsive ports during the the last two weeks, and did not see such case. > > Assumption: > > The proposed solution assumes that the ignore GUIDs file option of > OpenSM only impacts the routing algorithm (path counting) and should not > be extended for bad port handling. [EZ] This is correct. > > Proposed Solution: > > The OpenSM will implement a configurable policy (some number of > consecutive lack of responses to SM requests). At the point of > exhaustion of the timeout/retry strategy, that port will be marked as > "bad" by OpenSM. [EZ] This is already the current behavior. Nothing should be done. > > At this point, should it attempt to revive the port by bringing the > physical link down and back up ? Should it try this several times before > declaring the port as "bad" ? In any case, this is a refinement on the > basic strategy for dealing with this scenario. > > Also, there could also be a periodic "ping" at a slower rate to check if > the "bad" ports revive. [EZ] This will be released in gen1 within 2 weeks or so. The enhancement to light sweep will include the irresponsive ports in the light sweep. Once they respond a new heavy sweep will be generated. > > A "bad" port per this scenario still maintains its LID and other state. > OpenSM will indicate a "bad" port detected via an internal port physical > state which it will set to down. The "real" port physical state will be > reflected accurately inside OpenSM. [EZ] It is better to use the "un-healthy" bit of the physical port - which OpenSM is already maintaining. > > Once a "bad" port is detected, it will no longer be polled and the > routing algorithm should be invoked to route around this. > > Is there a need to store these "bad" ports persistently (and ignore them > on startup) ? [EZ] No I do not think so. > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From iod00d at hp.com Thu Apr 7 13:12:41 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 7 Apr 2005 13:12:41 -0700 Subject: [openib-general] SM Bad Port Handling In-Reply-To: <1112902344.4490.91.camel@localhost.localdomain> References: <1112902344.4490.91.camel@localhost.localdomain> Message-ID: <20050407201241.GJ32545@esmail.cup.hp.com> On Thu, Apr 07, 2005 at 03:32:24PM -0400, Hal Rosenstock wrote: ... > Assumption: > > The proposed solution assumes that the ignore GUIDs file option of > OpenSM only impacts the routing algorithm (path counting) and should not > be extended for bad port handling. > > Proposed Solution: > > The OpenSM will implement a configurable policy (some number of > consecutive lack of responses to SM requests). At the point of > exhaustion of the timeout/retry strategy, that port will be marked as > "bad" by OpenSM. Generally speaking, seperating recovery "policy" from "detection" is a good thing. ... > Is there a need to store these "bad" ports persistently (and ignore them > on startup) ? If opensm can see the physical link is ok, I would think it save any state. It's possible a system just hasn't loaded whatever SW is necessary to talk to the SM and might require operator intervention to kick that off (e.g. none of my systems auto-reboot unless I'm testing a specific customer environment). I expect it's a seperate policy on how long to save information after the physical link has been dropped - similar to DHCP. grant From halr at voltaire.com Thu Apr 7 13:11:02 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Apr 2005 16:11:02 -0400 Subject: [openib-general] SM Bad Port Handling In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EF0B8@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED6047EF0B8@mtlex01.yok.mtl.com> Message-ID: <1112904662.4490.99.camel@localhost.localdomain> On Thu, 2005-04-07 at 16:02, Eitan Zahavi wrote: > Hi Hal, > > Please see my comments below. > > Eitan Zahavi > > > Problem Statement: > > > > Currently, OpenSM issues (directed route) SubnGet for NodeInfo and > > NodeDescription to any node it finds. It then requests PortInfo for > > each port which is physically up. > > > > There are scenarios where the port is physically up, but there is no > > response to the SM get requests. In this case, the OpenSM keeps > > retrying, never gives up, and doesn't service anything else in the > > subnet (I'm not 100% positive on this last point). > [EZ] I have never seen this! Are you sure about it? Are you sure we > are talking about gen1 ported to gen2? > > What will happen in a case of non responding port is that OpenSM will > retry the send (actually the lower level does it) for the number of > retries OpenSM is configured to use (actually 4 times) and then ignore > the port and everything behind it. The reported topology (on stdout) > will have the word UNKNOWN on the remote side of the link this port > connects to. > > I will be happy to see a log file that shows what you claim happens. > Or even if you can explain to me how and where in the code causes > that. This was reported by Ron a while ago on this list. He sent log extracts of what was going on. It was around when I asked about the Anafa firmware issue with LFTTop. > I have been checking the way OpenSM handles irresponsive ports during > the the last two weeks, and did not see such case. Is this in both Gold 1.6.1 (OpenSM 1.7/1.7.1 ?) and Gold 1.7 (OpenSM 1.8) ? > > Assumption: > > > > The proposed solution assumes that the ignore GUIDs file option of > > OpenSM only impacts the routing algorithm (path counting) and should > not > > be extended for bad port handling. > [EZ] This is correct. > > > > Proposed Solution: > > > > The OpenSM will implement a configurable policy (some number of > > consecutive lack of responses to SM requests). At the point of > > exhaustion of the timeout/retry strategy, that port will be marked > as > > "bad" by OpenSM. > [EZ] This is already the current behavior. Nothing should be done. > > > > At this point, should it attempt to revive the port by bringing the > > physical link down and back up ? Should it try this several times > before > > declaring the port as "bad" ? In any case, this is a refinement on > the > > basic strategy for dealing with this scenario. > > > > Also, there could also be a periodic "ping" at a slower rate to > check if > > the "bad" ports revive. > [EZ] This will be released in gen1 within 2 weeks or so. What OpenSM release will this be ? > The enhancement to light sweep will include the irresponsive ports in > the light sweep. Once they respond a new heavy sweep will be > generated. > > > > > A "bad" port per this scenario still maintains its LID and other > state. > > OpenSM will indicate a "bad" port detected via an internal port > physical > > state which it will set to down. The "real" port physical state will > be > > reflected accurately inside OpenSM. > [EZ] It is better to use the "un-healthy" bit of the physical port - > which OpenSM is already maintaining. > > > > Once a "bad" port is detected, it will no longer be polled and the > > routing algorithm should be invoked to route around this. > > > > Is there a need to store these "bad" ports persistently (and ignore > them > > on startup) ? > [EZ] No I do not think so. Thanks. -- Hal From mlleinin at hpcn.ca.sandia.gov Thu Apr 7 13:14:46 2005 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Thu, 07 Apr 2005 13:14:46 -0700 Subject: [openib-general] SM Bad Port Handling In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EF0B8@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED6047EF0B8@mtlex01.yok.mtl.com> Message-ID: <1112904886.15180.202.camel@localhost> On Thu, 2005-04-07 at 23:02 +0300, Eitan Zahavi wrote: > > > > At this point, should it attempt to revive the port by bringing the > > physical link down and back up ? Should it try this several times > before > > declaring the port as "bad" ? In any case, this is a refinement on > the > > basic strategy for dealing with this scenario. > > > > Also, there could also be a periodic "ping" at a slower rate to > check if > > the "bad" ports revive. > [EZ] This will be released in gen1 within 2 weeks or so. The > enhancement to light sweep will include the irresponsive ports in the > light sweep. Once they respond a new heavy sweep will be generated. > Are you submitting these changes to gen2? If not, why not? - Matt From halr at voltaire.com Thu Apr 7 13:27:41 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Apr 2005 16:27:41 -0400 Subject: [openib-general] SM Bad Port Handling In-Reply-To: <20050407201241.GJ32545@esmail.cup.hp.com> References: <1112902344.4490.91.camel@localhost.localdomain> <20050407201241.GJ32545@esmail.cup.hp.com> Message-ID: <1112905460.4490.109.camel@localhost.localdomain> On Thu, 2005-04-07 at 16:12, Grant Grundler wrote: > > Is there a need to store these "bad" ports persistently (and ignore them > > on startup) ? > > If opensm can see the physical link is ok, I would think it save > any state. It's possible a system just hasn't loaded whatever > SW is necessary to talk to the SM and might require operator > intervention to kick that off (e.g. none of my systems auto-reboot > unless I'm testing a specific customer environment). Yes, I think there is also a partial boot up case where physical link can be up but the node won't respond to SM MADs. Still, I'm not sure why this would need to be saved persistently by the SM. It seems like a transient state that would be detected and if it goes away that should be detected too. The only issue being that the detection of the recovery might be longer. -- Hal From roland at topspin.com Thu Apr 7 14:42:49 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 07 Apr 2005 14:42:49 -0700 Subject: [openib-general] [ANNOUNCE] Userspace verbs/roland-uverbs branch merged to trunk Message-ID: <52k6ne5kme.fsf@topspin.com> I've just finished merging the userspace verbs support from the roland-uverbs branch to the main trunk (https://openib.org/svn/gen2/trunk/). For now, all userspace verbs development will be on the main trunk, so I would suggest that everyone switch from using the roland-uverbs branch to the trunk as soon as convenient. Thanks, Roland From greg at kroah.com Thu Apr 7 17:10:12 2005 From: greg at kroah.com (Greg KH) Date: Thu, 7 Apr 2005 17:10:12 -0700 Subject: [openib-general] Re: [PATCH][26.5/27] Add MT25204 PCI IDs In-Reply-To: <52hdiqi22t.fsf@topspin.com> References: <2005411249.RHQWyM8AFcqb1PM4@topspin.com> <52hdiqi22t.fsf@topspin.com> Message-ID: <20050408001011.GB7010@kroah.com> On Fri, Apr 01, 2005 at 02:06:50PM -0800, Roland Dreier wrote: > Ugh, this patch is required to build support for the new Mellanox > HCAs. Greg K-H applied it to his tree a while ago but it hasn't made > it to Linus yet. > > Sorry, > Roland > > Add PCI device IDs for new Mellanox MT25204 "Sinai" InfiniHost III Lx HCA. > > Signed-off-by: Roland Dreier Already in 2.6.12-rc2. thanks, greg k-h From rf at q-leap.de Thu Apr 7 23:25:46 2005 From: rf at q-leap.de (Roland Fehrenbacher) Date: Fri, 8 Apr 2005 08:25:46 +0200 Subject: [openib-general] Re: Flashing Mellanox MT23108 In-Reply-To: <20050407185532.GC13172@mellanox.co.il> References: <16981.10681.472289.311124@gargle.gargle.HOWL> <004a01c53b8c$45a4dde0$0211708d@gpv.az05.bull.com> <20050407185532.GC13172@mellanox.co.il> Message-ID: <16982.9194.487907.278262@gargle.gargle.HOWL> >>>>> "Michael" == Michael S Tsirkin writes: Michael> Quoting r. Jerome Pioux : >> I don't know about mstflint but I used flint in the past from >> IBGD so it may be the same? With flint, you need to use the >> raw image (bin) and not the Mellanox one (mlx). This is what I >> used: Hi Jerome, thanks for your help. I knew 'Mellanox flint'. The disadvantage of it is that it needs all the drivers loaded, while flint goes directly on the PCI device. >> flint -d /dev/mst/mt23108_pci_cr2 -i >> /etc/ibfw/fw/fw-cougar-a1-3.3.2-jerome.bin b Image type: >> FailSafe Chip rev.: A1 GUIDs: 0005ad0000016770 0005ad0000016771 >> 0005ad0000016772 0005ad000100d050 Board ID: ­ >> >> Burn image with the following GUIDs: Node: 0005ad0000016770 >> Port1: 0005ad0000016771 Port2: 0005ad0000016772 Sys.Image: >> 0005ad000100d050 etc... >> >> I think that I created the bin image using infiniburn. I read >> fw-23108-a1-rel.mlx using infiniburn and write the bin image >> (raw format). But if you have infiniburn working, I think that >> you can burn the mlx image directly. >> ----- Original Message ----- >> >> From: Roland Fehrenbacher To: openib-general at openib.org Sent: >> Thursday, April 07, 2005 5:38 AM Subject: [openib-general] >> Flashing Mellanox MT23108 >> >> Hi, >> >> can anyone tell me how to flash a Mellanox MT23108 card with >> mstflint. When I try the firmware file fw-23108-a1-rel.mlx from >> Mellanox I get >> >> $ mstflint -d /proc/bus/pci/03/00.0 -i fw-23108-a1-rel.mlx b >> Not a valid image Michael> With mstflint you pass in the device location: -d 03:00.0 Michael> Otherwise its the same. Unfortunately, now even with the raw image prepared by using infiniburn from fw-23108-a1-rel.mlx and the correct .brd, I get mstflint -d /proc/bus/pci/03/00.0 -i fw-23108.bin b /0x00030028/ (BOOT2) - read error (Address (0x3002c) is out of image limits ) Not a valid image Roland From mst at mellanox.co.il Fri Apr 8 02:32:07 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 8 Apr 2005 12:32:07 +0300 Subject: [openib-general] Re: Flashing Mellanox MT23108 In-Reply-To: <16982.9194.487907.278262@gargle.gargle.HOWL> References: <16981.10681.472289.311124@gargle.gargle.HOWL> <004a01c53b8c$45a4dde0$0211708d@gpv.az05.bull.com> <20050407185532.GC13172@mellanox.co.il> <16982.9194.487907.278262@gargle.gargle.HOWL> Message-ID: <20050408093207.GA21709@mellanox.co.il> Quoting r. Roland Fehrenbacher : > Subject: Re: Flashing Mellanox MT23108 > > >>>>> "Michael" == Michael S Tsirkin writes: > > Michael> Quoting r. Jerome Pioux : > > >> I don't know about mstflint but I used flint in the past from > >> IBGD so it may be the same? With flint, you need to use the > >> raw image (bin) and not the Mellanox one (mlx). This is what I > >> used: > > Hi Jerome, > > thanks for your help. I knew 'Mellanox flint'. The disadvantage of it > is that it needs all the drivers loaded, while flint goes directly on > the PCI device. > > >> flint -d /dev/mst/mt23108_pci_cr2 -i > >> /etc/ibfw/fw/fw-cougar-a1-3.3.2-jerome.bin b Image type: > >> FailSafe Chip rev.: A1 GUIDs: 0005ad0000016770 0005ad0000016771 > >> 0005ad0000016772 0005ad000100d050 Board ID: ­ > >> > >> Burn image with the following GUIDs: Node: 0005ad0000016770 > >> Port1: 0005ad0000016771 Port2: 0005ad0000016772 Sys.Image: > >> 0005ad000100d050 etc... > >> > >> I think that I created the bin image using infiniburn. I read > >> fw-23108-a1-rel.mlx using infiniburn and write the bin image > >> (raw format). But if you have infiniburn working, I think that > >> you can burn the mlx image directly. > > >> ----- Original Message ----- > >> > >> From: Roland Fehrenbacher To: openib-general at openib.org Sent: > >> Thursday, April 07, 2005 5:38 AM Subject: [openib-general] > >> Flashing Mellanox MT23108 > >> > >> Hi, > >> > >> can anyone tell me how to flash a Mellanox MT23108 card with > >> mstflint. When I try the firmware file fw-23108-a1-rel.mlx from > >> Mellanox I get > >> > >> $ mstflint -d /proc/bus/pci/03/00.0 -i fw-23108-a1-rel.mlx b > >> Not a valid image > > Michael> With mstflint you pass in the device location: -d 03:00.0 > Michael> Otherwise its the same. > > Unfortunately, now even with the raw image prepared by using infiniburn > from fw-23108-a1-rel.mlx and the correct .brd, I get > > mstflint -d /proc/bus/pci/03/00.0 -i fw-23108.bin b > /0x00030028/ (BOOT2) - read error (Address (0x3002c) is out of image limits > ) > Not a valid image > > Roland > I'll check this on Sunday. What does mstflint -d 03:00.0 v show? -- MST - Michael S. Tsirkin From rf at q-leap.de Fri Apr 8 02:36:39 2005 From: rf at q-leap.de (Roland Fehrenbacher) Date: Fri, 8 Apr 2005 11:36:39 +0200 Subject: [openib-general] Re: Flashing Mellanox MT23108 In-Reply-To: <20050408093207.GA21709@mellanox.co.il> References: <16981.10681.472289.311124@gargle.gargle.HOWL> <004a01c53b8c$45a4dde0$0211708d@gpv.az05.bull.com> <20050407185532.GC13172@mellanox.co.il> <16982.9194.487907.278262@gargle.gargle.HOWL> <20050408093207.GA21709@mellanox.co.il> Message-ID: <16982.20647.399413.653949@gargle.gargle.HOWL> >>>>> "Michael" == Michael S Tsirkin writes: Michael> With mstflint you pass in the device location: -d 03:00.0 Michael> Otherwise its the same. >> Unfortunately, now even with the raw image prepared by using >> infiniburn from fw-23108-a1-rel.mlx and the correct .brd, I get >> >> mstflint -d /proc/bus/pci/03/00.0 -i fw-23108.bin b >> /0x00030028/ (BOOT2) - read error (Address (0x3002c) is out of >> image limits ) Not a valid image Michael> I'll check this on Sunday. What does mstflint -d 03:00.0 Michael> v show? I have mstflint -d 03:00.0 v Failsafe image: Invariant /0x00000028-0x000006f7 (0x0006d0)/ (BOOT2) - OK Primary Image /0x00010000-0x00010107 (0x000108)/ (Pointer Sector)- OK /0x00030028-0x00030b3b (0x000b14)/ (BOOT2) - OK /0x00030b3c-0x00034aa7 (0x003f6c)/ (BOOT2) - OK /0x00034aa8-0x000375f3 (0x002b4c)/ (Configuration) - OK /0x000375f4-0x00037627 (0x000034)/ (GUID) - OK /0x00037628-0x00044c83 (0x00d65c)/ (DDR) - OK /0x00044c84-0x0004d30f (0x00868c)/ (DDR) - OK /0x0004d310-0x00061c03 (0x0148f4)/ (DDR) - OK /0x00061c04-0x0006d80b (0x00bc08)/ (DDR) - OK /0x0006d80c-0x0007099f (0x003194)/ (DDR) - OK /0x000709a0-0x0007a5af (0x009c10)/ (DDR) - OK /0x0007a5b0-0x0007a707 (0x000158)/ (Configuration) - OK /0x0007a708-0x0007a74b (0x000044)/ (Jump addresses) - OK /0x0007a74c-0x000841cb (0x009a80)/ (EMT Service) - OK Secondary Image /0x00020000-0x00020107 (0x000108)/ (Pointer Sector)- OK /0x00090028-0x00090b3b (0x000b14)/ (BOOT2) - OK /0x00090b3c-0x00094aa7 (0x003f6c)/ (BOOT2) - OK /0x00094aa8-0x000975f3 (0x002b4c)/ (Configuration) - OK /0x000975f4-0x00097627 (0x000034)/ (GUID) - OK /0x00097628-0x000a4c83 (0x00d65c)/ (DDR) - OK /0x000a4c84-0x000ad30f (0x00868c)/ (DDR) - OK /0x000ad310-0x000c1c03 (0x0148f4)/ (DDR) - OK /0x000c1c04-0x000cd80b (0x00bc08)/ (DDR) - OK /0x000cd80c-0x000d099f (0x003194)/ (DDR) - OK /0x000d09a0-0x000da5af (0x009c10)/ (DDR) - OK /0x000da5b0-0x000da707 (0x000158)/ (Configuration) - OK /0x000da708-0x000da74b (0x000044)/ (Jump addresses) - OK /0x000da74c-0x000e41cb (0x009a80)/ (EMT Service) - OK Roland From halr at voltaire.com Fri Apr 8 07:47:49 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Apr 2005 10:47:49 -0400 Subject: [openib-general] A Couple More CM Queries Message-ID: <1112971669.4522.147.camel@localhost.localdomain> Hi Sean, I have a couple more questions about the CM: 1. cm_alloc_id does an idr_get_new_above starting at 1. Might this be better saving the highest value and starting there so connection IDs are less likely to repeat as soon ? 2. Should ib_create_cm_id check return an error if cm_handler == NULL just to make sure ? Thanks. -- Hal From eitan at mellanox.co.il Fri Apr 8 08:11:08 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 8 Apr 2005 18:11:08 +0300 Subject: [openib-general] SM Bad Port Handling Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF0BB@mtlex01.yok.mtl.com> Hi Mat, > Are you submitting these changes to gen2? If not, why not? > [EZ] Mellanox is focused on improving the gen1 stack while contributing everything it builds to the community. OpenSM from gen1 is ported (merged) into gen2 tree by Hal and Shahar from Voltaire. I will be publishing the latest changes to OpenSM once they pass minimal QA. They will be posted in: https://openib.org/svn/gen1/trunk/src/userspace/osm I will publish the list of changes from previous version in a separate mail -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Fri Apr 8 08:23:44 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Apr 2005 11:23:44 -0400 Subject: [openib-general] SM Bad Port Handling In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EF0B8@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED6047EF0B8@mtlex01.yok.mtl.com> Message-ID: <1112973823.4522.150.camel@localhost.localdomain> Hi Eitan, On Thu, 2005-04-07 at 16:02, Eitan Zahavi wrote: > [EZ] It is better to use the "un-healthy" bit of the physical port - > which OpenSM is already maintaining. What is the name of this bit and in what structure does it appear ? Thanks. -- Hal From eitan at mellanox.co.il Fri Apr 8 08:55:39 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 8 Apr 2005 18:55:39 +0300 Subject: [openib-general] SM Bad Port Handling Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF0BC@mtlex01.yok.mtl.com> Hi Hal, This is a physical port attribute so the file is osm_port.h and the structure is osm_physp_t. >From the doc on the structure: * * healthy * Tracks the health of the port. Normally should be TRUE but * might change as a result of incoming traps indicating the port * healthy is questionable. * I have been trying my best to find how it can happen that a port that does not respond will cause OpenSM to continuously poll it. This can not happen so unless you can explain how it happens please do not contaminate the code with un-needed code. The only thing that comes to mind it the case of failure to "Set" some attributes of devices in the fabric. This happens only after discovery is completed and only after the validity of the data base is verified (i.e. each node has ports, each port have a node ...) In that case of failure to set some attributes (aka LFT, PortInfo etc) OpenSM will output a clear error message: "Errors in intialization" and will restart a full sweep of the fabric. This is the only way one get an infinite polling on the entire subnet. In general, it might make sense to try and improve how OpenSM qualifies each fabric port for the statistics of the number of packet drops versus good packets it passed through. Note this is complex due to the fact a port might affect packets that goes through it. And there is no way to know on which hop on the path the packet was dropped. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Friday, April 08, 2005 6:24 PM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: RE: [openib-general] SM Bad Port Handling > > Hi Eitan, > > On Thu, 2005-04-07 at 16:02, Eitan Zahavi wrote: > > [EZ] It is better to use the "un-healthy" bit of the physical port - > > which OpenSM is already maintaining. > > What is the name of this bit and in what structure does it appear ? > > Thanks. > > -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Fri Apr 8 09:34:24 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 08 Apr 2005 09:34:24 -0700 Subject: [openib-general] Re: A Couple More CM Queries In-Reply-To: <1112971669.4522.147.camel@localhost.localdomain> References: <1112971669.4522.147.camel@localhost.localdomain> Message-ID: <4256B290.7020704@ichips.intel.com> Hal Rosenstock wrote: > 1. cm_alloc_id does an idr_get_new_above starting at 1. Might this be > better saving the highest value and starting there so connection IDs are > less likely to repeat as soon ? I _think_ this would result in the IDR tables growing to their maximum size, which seems worse than repeating the IDs immediately after their timewait expires. > 2. Should ib_create_cm_id check return an error if cm_handler == NULL > just to make sure ? Personally, I don't think it's worth this check for kernel clients, unless we want to start checking for NULL parameters everywhere. While on the CM, I did look at the issue of calling the API out of order that you had pointed out before (which could result in accessing a NULL port pointer). I'm not convinced that a simple check for a NULL port pointer covers all potential problems. For example, I'm not sure how well the codebase will handle the dynamic removal of a device while users are attempting to access the device. - Sean From eitan at mellanox.co.il Fri Apr 8 09:40:09 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 8 Apr 2005 19:40:09 +0300 Subject: [openib-general] OpenSM work Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF0BE@mtlex01.yok.mtl.com> Hi All, FYI: Mellanox is focusing on the following items on OpenSM development for the last few weeks: 1. Stability testing over the IB management simulator: a. Randomly pick bad links with high packet drop statistics – success is SUBNET UP b. Route using up/down algorithm – success is no credit loops 2. Semi-static LID assignment: a. Developed an interface for persistent storage of arbitrary data. The goal is to enable further development of LDAP (ala Troy’s request) or SQL module. Please see osm_db.h attached <> b. Developed file based implementation for osm_db.h c. Modify osm_lid_mgr (lid assignment algorithm) to use the LIDs stored in the persistent storage. Handle all cases of bad file and new LIDs on the fabric. The –r flag now lets OpenSM overwrite the known data. Persistent Guid to LIDs data is kept even if the GUID disappears for a while. The code also handles LID assignment for LMC > 0 in a way better then the previous algorithm: It used to assign 2^LMC LIDs for every port – even for switches port 0. Now it will only preserve 1 LID for switch port 0. 3. Irresponsive port: a. The phenomenon is: A port does not respond to the SM during the discovery stage. OpenSM can not obtain enough data about the port and thus it does not appear in the final database. Since OpenSM uses light sweeps when there is no “change detected” it will not query the port until either a switch sets its “change bit” or send a trap. So that irresponsive port will never be polled again if there is no heavy sweep. b. The solution: i. During discovery track ports (physical ports) that have their logical link state != DOWN but the port on the other side of the link is not known to the SM. ii. During light sweep: not only scan the switches “change bit” but also test to see if the port on the other side on these ports (from i) is responding. If it does – issue a heavy sweep. 4. Head of Queue Life: a. Problem: In cases of PCI hardware failure HCAs can not complete RDMA requests and loose all credits from their input ports (in other words: their input buffers are filled). So they create back pressure on the fabric. b. Solution: use a fast head of queue time limit on every switch port that drives an HCA. 5. SA queries stress testing: a. We are exploring max performance of the SA and ways to improve it. Eitan Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: osm_db.h Type: application/octet-stream Size: 11514 bytes Desc: not available URL: From halr at voltaire.com Fri Apr 8 12:06:44 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Apr 2005 15:06:44 -0400 Subject: [openib-general] SM Bad Port Handling In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EF0BC@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED6047EF0BC@mtlex01.yok.mtl.com> Message-ID: <1112987204.4903.19.camel@localhost.localdomain> On Fri, 2005-04-08 at 11:55, Eitan Zahavi wrote: > Hi Hal, > > This is a physical port attribute so the file is osm_port.h and the > structure is osm_physp_t. > From the doc on the structure: > * > * healthy > * Tracks the health of the port. Normally should be TRUE but > * might change as a result of incoming traps indicating the port > * healthy is questionable. > * Yup. It's definitely there in the gen2 code base. > I have been trying my best to find how it can happen that a port that > does not respond will cause OpenSM to continuously poll it. This can > not happen so unless you can explain how it happens please do not > contaminate the code with un-needed code. This part of the code has not been touched. I've put all meaningful patches and ideas on how things might change out on this list. > The only thing that comes to mind it the case of failure to "Set" some > attributes of devices in the fabric. I dug out Ron's emails on this. I don't think that was what was going on. It was a SubnGet(NodeInfo) which failed. See http://openib.org/pipermail/openib-general/2005-February/009125.html for more details. -- Hal > This happens only after discovery is completed and only after the > validity of the data base is verified (i.e. each node has ports, each > port have a node ...) > > In that case of failure to set some attributes (aka LFT, PortInfo etc) > OpenSM will output a clear error message: "Errors in intialization" > and will restart a full sweep of the fabric. > > This is the only way one get an infinite polling on the entire subnet. > In general, it might make sense to try and improve how OpenSM > qualifies each fabric port for the statistics of the number of packet > drops versus good packets it passed through. Note this is complex due > to the fact a port might affect packets that goes through it. And > there is no way to know on which hop on the path the packet was > dropped. From halr at voltaire.com Fri Apr 8 12:37:32 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Apr 2005 15:37:32 -0400 Subject: [openib-general] SDP Performance Message-ID: <1112988342.4546.12.camel@localhost.localdomain> Hi Libor, A couple of questions about SDP performance: 1. When running the AIO version of TTCP, there appears to be a bandwidth degradation when using buffer sizes from about 5K to 13K. Do you see this too ? If so, is there an explanation for this ? 2. Also, is there a program you use to measure SDP latency ? Thanks. -- Hal From eitan at mellanox.co.il Fri Apr 8 13:05:57 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 8 Apr 2005 23:05:57 +0300 Subject: [openib-general] SM Bad Port Handling Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF0BF@mtlex01.yok.mtl.com> Hi Hal, I have looked up the mail thread. I could not find a log file indicating there was a repetitive query of the bad port. I know the code and I can not find a reason for it to do a repetitive port. Can you explain how this can happen? Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Friday, April 08, 2005 10:07 PM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: RE: [openib-general] SM Bad Port Handling > > On Fri, 2005-04-08 at 11:55, Eitan Zahavi wrote: > > Hi Hal, > > > > This is a physical port attribute so the file is osm_port.h and the > > structure is osm_physp_t. > > From the doc on the structure: > > * > > * healthy > > * Tracks the health of the port. Normally should be TRUE but > > * might change as a result of incoming traps indicating the port > > * healthy is questionable. > > * > > Yup. It's definitely there in the gen2 code base. > > > I have been trying my best to find how it can happen that a port that > > does not respond will cause OpenSM to continuously poll it. This can > > not happen so unless you can explain how it happens please do not > > contaminate the code with un-needed code. > > This part of the code has not been touched. I've put all meaningful > patches and ideas on how things might change out on this list. > > > The only thing that comes to mind it the case of failure to "Set" some > > attributes of devices in the fabric. > > I dug out Ron's emails on this. I don't think that was what was going > on. It was a SubnGet(NodeInfo) which failed. See > http://openib.org/pipermail/openib-general/2005-February/009125.html > for more details. > > -- Hal > > > This happens only after discovery is completed and only after the > > validity of the data base is verified (i.e. each node has ports, each > > port have a node ...) > > > > In that case of failure to set some attributes (aka LFT, PortInfo etc) > > OpenSM will output a clear error message: "Errors in intialization" > > and will restart a full sweep of the fabric. > > > > This is the only way one get an infinite polling on the entire subnet. > > > In general, it might make sense to try and improve how OpenSM > > qualifies each fabric port for the statistics of the number of packet > > drops versus good packets it passed through. Note this is complex due > > to the fact a port might affect packets that goes through it. And > > there is no way to know on which hop on the path the packet was > > dropped. -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Fri Apr 8 13:16:34 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Apr 2005 16:16:34 -0400 Subject: [openib-general] SM Bad Port Handling In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EF0BF@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED6047EF0BF@mtlex01.yok.mtl.com> Message-ID: <1112991393.4490.3.camel@localhost.localdomain> On Fri, 2005-04-08 at 16:05, Eitan Zahavi wrote: > Hi Hal, > > I have looked up the mail thread. > I could not find a log file indicating there was a repetitive query of > the bad port. That was what started this thread: I believe that was what Ron reported at the time which was back at the end of February. Perhaps there is insufficient log to back that up. > I know the code and I can not find a reason for it to do a repetitive > port. > > Can you explain how this can happen? I haven't looked at the code. I can't explain it. I don't even know for sure whether that was what was going on. I will go back through the thread again. -- Hal From iod00d at hp.com Fri Apr 8 17:34:48 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 8 Apr 2005 17:34:48 -0700 Subject: [openib-general] ia64 perf and FMR In-Reply-To: <52hdimds6e.fsf@topspin.com> References: <20050402024048.GN11094@esmail.cup.hp.com> <20050404055131.GA19409@esmail.cup.hp.com> <52hdimds6e.fsf@topspin.com> Message-ID: <20050409003448.GI3844@esmail.cup.hp.com> On Mon, Apr 04, 2005 at 04:43:21PM -0700, Roland Dreier wrote: > A binary search to find the changeset that makes the difference would > be really useful. I read through the svn log from r2046 through r2082 > and I don't see anything that should make a difference to IPoIB. I've worked backwards and didn't see any changes with netperf TCP_STREAM. I can't establish why the perfomance is substantially different with r2050 compared to before: SVN Rev Best Worst r2104-MSIX 3609 2623 r2081-MSIX 3580 2639 r2062-MSIX 3609 2618 r2054-MSIX 3602 2635 r2050-MSIX 3598 2636 r2050-IRQ 3594 2433 r2050-orig 1738(*) Numbers are in Mbits/s. 3600 is ~450 MB/s. 2600 is ~325 MB/s. Differences between Best/Worst are caused by binding netperf and netserver to the same (or different) CPU as the one handling interrupts. See "-T" in netperf 2.4.0-rc1 "experimental" release. (*) I didn't use -T with netperf to explore IRQ assignment. I think differences in "Best" column are not significant. (Except for the r2050-orig number of course...) ... > So I wonder what obvious thing I'm missing... I'm using gcc-3.3.5 (Debian 1:3.3.5-8) now and may have used gcc 3.4 or a slightly older gcc-3.3. I might be confusing gcc version with other work I've done too. I'm of course kicking myself for being sloppy and not tracking that precisely. I'm also using s different version of netperf (2.4.0-rc1). But I don't expect substantial changes in TCP_STREAM implementation that might cause 2x difference in performance. TCP_STREAM test is "mature" code. My best/worst theory right now is the TS90 switch was in a semi-comatose state when I collected the perf numbers in January. When I tried to collect perf data in March, the TS90 switch was non-responsive at the serial console and a power cycle took care of that. I hadn't cycled power on the switch since installing it in June 2004 or so. thanks, grant From roland at topspin.com Sat Apr 9 13:24:20 2005 From: roland at topspin.com (Roland Dreier) Date: Sat, 09 Apr 2005 13:24:20 -0700 Subject: [openib-general] [PATCH] SEND_INLINE support in libmthca In-Reply-To: <20050405074213.GC15034@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 5 Apr 2005 10:42:13 +0300") References: <20050404150235.GZ15034@mellanox.co.il> <52is32feq2.fsf@topspin.com> <20050405074213.GC15034@mellanox.co.il> Message-ID: <528y3r3dhn.fsf@topspin.com> Thanks, applied. From mst at mellanox.co.il Sun Apr 10 01:47:25 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 10 Apr 2005 11:47:25 +0300 Subject: [openib-general] [PATCH] uverbs with static libraries Message-ID: <20050410084724.GZ20567@mellanox.co.il> Hello, Roland! I'd like to get userspace verbs working with static libraries. My motivation is currently enabling our code coverage tools which only work well with static libraries, but I expect there to be other uses. The following patch makes it possible to link libmthca directly into the main executable. Signed-off-by: Michael S. Tsirkin Index: libibverbs/src/init.c =================================================================== --- libibverbs/src/init.c (revision 2104) +++ libibverbs/src/init.c (working copy) @@ -105,6 +105,8 @@ return; } + load_driver(NULL); + for (i = 0; i < so_glob.gl_pathc; ++i) load_driver(so_glob.gl_pathv[i]); } -- MST - Michael S. Tsirkin From hozer at hozed.org Mon Apr 11 07:22:13 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Mon, 11 Apr 2005 09:22:13 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <200544159.Ahk9l0puXy39U6u6@topspin.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> Message-ID: <20050411142213.GC26127@kalmia.hozed.org> > In particular, the memory pinning code in in uverbs_mem.c could stand > a looking over. In addition, a sanity check of the write()-based > scheme for passing commands into the kernel in uverbs_main.c and > uverbs_cmd.c is probably worthwhile. How is memory pinning handled? (I haven't had time to read all the code, so please excuse my ignorance of something obvious). The old mellanox drivers used to have a hack to call 'sys_mlock', and promiscuously lock memory any old userspace application asked for. What is the API for the new uverbs memory registration, and how will things like memory hotplug and NUMA page migration be able to unpin pages locked by a user program? I have applications that would benefit from being able to register 15GB of memory on a machine with 16GB. Right now, MPI and other possible users of infiniband in userspace have to play cacheing games and limit what they can register. But locking all that memory without providing the kernel a way to unlock it under memory pressure or for page migration seems like a bad idea. From roland at topspin.com Mon Apr 11 08:34:19 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 11 Apr 2005 08:34:19 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050411142213.GC26127@kalmia.hozed.org> (Troy Benjegerdes's message of "Mon, 11 Apr 2005 09:22:13 -0500") References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> Message-ID: <52mzs51g5g.fsf@topspin.com> Troy> How is memory pinning handled? (I haven't had time to read Troy> all the code, so please excuse my ignorance of something Troy> obvious). The userspace library calls mlock() and then the kernel does get_user_pages(). Troy> The old mellanox drivers used to have a hack to call Troy> 'sys_mlock', and promiscuously lock memory any old userspace Troy> application asked for. What is the API for the new uverbs Troy> memory registration, and how will things like memory hotplug Troy> and NUMA page migration be able to unpin pages locked by a Troy> user program? The API for uverbs memory registration is ibv_reg_mr(), and right now the memory is pinned and that's it. - R. From roland at topspin.com Mon Apr 11 08:36:37 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 11 Apr 2005 08:36:37 -0700 Subject: [openib-general] Re: [PATCH] uverbs with static libraries In-Reply-To: <20050410084724.GZ20567@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 10 Apr 2005 11:47:25 +0300") References: <20050410084724.GZ20567@mellanox.co.il> Message-ID: <52is2t1g1m.fsf@topspin.com> Michael> I'd like to get userspace verbs working with static Michael> libraries. My motivation is currently enabling our code Michael> coverage tools which only work well with static Michael> libraries, but I expect there to be other uses. Looks reasonable. With this, do you then do --enable-static when configuring libmthca or is there anything else required? - R. From mst at mellanox.co.il Mon Apr 11 09:20:42 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 11 Apr 2005 19:20:42 +0300 Subject: [openib-general] Re: [PATCH] uverbs with static libraries In-Reply-To: <52is2t1g1m.fsf@topspin.com> References: <20050410084724.GZ20567@mellanox.co.il> <52is2t1g1m.fsf@topspin.com> Message-ID: <20050411162042.GQ2477@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] uverbs with static libraries > > Michael> I'd like to get userspace verbs working with static > Michael> libraries. My motivation is currently enabling our code > Michael> coverage tools which only work well with static > Michael> libraries, but I expect there to be other uses. > > Looks reasonable. With this, do you then do --enable-static when > configuring libmthca or is there anything else required? > > - R. > A small patch in makefile seems to be required, I'll send that separately after I clean it up. -- MST - Michael S. Tsirkin From rf at q-leap.de Mon Apr 11 09:28:20 2005 From: rf at q-leap.de (Roland Fehrenbacher) Date: Mon, 11 Apr 2005 18:28:20 +0200 Subject: [openib-general] OpenSM (again) Message-ID: <16986.42404.540439.952094@gargle.gargle.HOWL> Hi, I got gen2 opensm running fine now (there was a problem with a wrong include file), and managed to get IP running on a network of currently 40 machines (final size will be 144). Performance is pretty impressive (initial tests with a simple netpipe): I got a latency of 18microsec, and a maximum throughput of approx. 400MB/sec at packet size approx. 1MB which then levels of at about 340MB/s for larger packets. One problem and two questions: Problem: When I reboot all the 40 nodes (apart from the one the opensm is running), the network is non-functional (no pings go through, even though ports show status "Active") for quite a while (more than 10 minutes) after all the nodes have come up. It then recovers without intervention. Is this normal? Single node reboots don't affect the network operation. osm Log file is appended. Question 1: Can I run opensm in a master slave configuration? I noticed that there is a priority commandline option, but am not sure how to apply this. Question 2: I plan to run the gen1/Mellanox IBGD drivers on the compute nodes (need fast MPI), and gen2 on the control/storage nodes (need only IP) with gen2 opensm running on the control nodes. Is there any reason why this should not work reliably? Roland -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: osm-port1.log URL: From hozer at hozed.org Mon Apr 11 09:33:42 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Mon, 11 Apr 2005 11:33:42 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <52mzs51g5g.fsf@topspin.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> Message-ID: <20050411163342.GE26127@kalmia.hozed.org> On Mon, Apr 11, 2005 at 08:34:19AM -0700, Roland Dreier wrote: > Troy> How is memory pinning handled? (I haven't had time to read > Troy> all the code, so please excuse my ignorance of something > Troy> obvious). > > The userspace library calls mlock() and then the kernel does > get_user_pages(). Is there a check in the kernel that the memory is actually mlock()ed? What if a malicious (or broken) application does ibv_reg_mr() but doesn't lock the memory? Does the IB card get a physical address for a page that might get swapped out? > Troy> The old mellanox drivers used to have a hack to call > Troy> 'sys_mlock', and promiscuously lock memory any old userspace > Troy> application asked for. What is the API for the new uverbs > Troy> memory registration, and how will things like memory hotplug > Troy> and NUMA page migration be able to unpin pages locked by a > Troy> user program? > > The API for uverbs memory registration is ibv_reg_mr(), and right now > the memory is pinned and that's it. > > - R. From roland at topspin.com Mon Apr 11 09:56:53 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 11 Apr 2005 09:56:53 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050411163342.GE26127@kalmia.hozed.org> (Troy Benjegerdes's message of "Mon, 11 Apr 2005 11:33:42 -0500") References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> Message-ID: <5264yt1cbu.fsf@topspin.com> Troy> Is there a check in the kernel that the memory is actually Troy> mlock()ed? No. Troy> What if a malicious (or broken) application does Troy> ibv_reg_mr() but doesn't lock the memory? Does the IB card Troy> get a physical address for a page that might get swapped Troy> out? No, the kernel does get_user_pages(). So the pages that the HCA gets will not be swapped or used for anything else. The only thing a malicious userspace app can do is screw itself up. - R. From halr at voltaire.com Mon Apr 11 10:50:24 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Apr 2005 13:50:24 -0400 Subject: [openib-general] [PATCH] mthca: Don't call CQ completion handler if it doesn't exist Message-ID: <1113241284.4490.19.camel@localhost.localdomain> mthca: Don't call CQ completion handler if it doesn't exist Signed-off-by: Hal Rosenstock Index: mthca_cq.c =================================================================== --- mthca_cq.c (revision 2154) +++ mthca_cq.c (working copy) @@ -206,7 +206,8 @@ ++cq->arm_sn; - cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); + if (cq->ibcq.comp_handler) + cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); } void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn) From roland at topspin.com Mon Apr 11 10:58:53 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 11 Apr 2005 10:58:53 -0700 Subject: [openib-general] Re: [PATCH] mthca: Don't call CQ completion handler if it doesn't exist In-Reply-To: <1113241284.4490.19.camel@localhost.localdomain> (Hal Rosenstock's message of "11 Apr 2005 13:50:24 -0400") References: <1113241284.4490.19.camel@localhost.localdomain> Message-ID: <52sm1xyz36.fsf@topspin.com> Hal> mthca: Don't call CQ completion handler if it doesn't exist Why do we want to add this test? This is adding a conditional branch in what I think is a fast path, and I would consider it a bug in the consumer if it creates a CQ with an invalid completion handler and then requests a completion event for that CQ. Am I missing something? - R. From hozer at hozed.org Mon Apr 11 11:01:08 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Mon, 11 Apr 2005 13:01:08 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <5264yt1cbu.fsf@topspin.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> Message-ID: <20050411180107.GF26127@kalmia.hozed.org> On Mon, Apr 11, 2005 at 09:56:53AM -0700, Roland Dreier wrote: > Troy> Is there a check in the kernel that the memory is actually > Troy> mlock()ed? > > No. > > Troy> What if a malicious (or broken) application does > Troy> ibv_reg_mr() but doesn't lock the memory? Does the IB card > Troy> get a physical address for a page that might get swapped > Troy> out? > > No, the kernel does get_user_pages(). So the pages that the HCA gets > will not be swapped or used for anything else. The only thing a > malicious userspace app can do is screw itself up. > > - R. Do we even need the mlock in userspace then? From halr at voltaire.com Mon Apr 11 11:01:32 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Apr 2005 14:01:32 -0400 Subject: [openib-general] Re: A Couple More CM Queries In-Reply-To: <4256B290.7020704@ichips.intel.com> References: <1112971669.4522.147.camel@localhost.localdomain> <4256B290.7020704@ichips.intel.com> Message-ID: <1113242491.4490.4.camel@localhost.localdomain> On Fri, 2005-04-08 at 12:34, Sean Hefty wrote: > Hal Rosenstock wrote: > > 1. cm_alloc_id does an idr_get_new_above starting at 1. Might this be > > better saving the highest value and starting there so connection IDs are > > less likely to repeat as soon ? > > I _think_ this would result in the IDR tables growing to their maximum > size, which seems worse than repeating the IDs immediately after their > timewait expires. > > > 2. Should ib_create_cm_id check return an error if cm_handler == NULL > > just to make sure ? > > Personally, I don't think it's worth this check for kernel clients, > unless we want to start checking for NULL parameters everywhere. Incoming REQs currently use this capability anyhow. > While on the CM, I did look at the issue of calling the API out of > order that you had pointed out before (which could result in accessing > a NULL port pointer). I'm not convinced that a simple check for a NULL > port pointer covers all potential problems. For example, I'm not sure > how well the codebase will handle the dynamic removal of a device while > users are attempting to access the device. We may need to handle this at some point. Guess the changes may be larger if/when we get there. A couple more questions: It looks like sending private data in REQ/REP/RTU, but incoming private data isn't handled on the receiving side. Also, in cm_process_send_error(), where the handler is called cm_id_priv->id.cm_handler(&cm_id_priv->id, &cm_event); might that callback request the CM ID destruction ? If so, some code is missing to handle this. Thanks. -- Hal From roland at topspin.com Mon Apr 11 11:03:08 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 11 Apr 2005 11:03:08 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050411180107.GF26127@kalmia.hozed.org> (Troy Benjegerdes's message of "Mon, 11 Apr 2005 13:01:08 -0500") References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> Message-ID: <52oeclyyw3.fsf@topspin.com> Troy> Do we even need the mlock in userspace then? Yes, because the kernel may go through and unmap pages from userspace while trying to swap. Since we have the page locked in the kernel, the physical page won't go anywhere, but userspace might end up with a different page mapped at the same virtual address. - R. From ardavis at ichips.intel.com Mon Apr 11 11:31:51 2005 From: ardavis at ichips.intel.com (ardavis) Date: Mon, 11 Apr 2005 11:31:51 -0700 Subject: [openib-general] Re: uverbs events In-Reply-To: <1112892615.4877.18.camel@localhost.localdomain> References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com> <425452AF.6010207@ichips.intel.com> <527jjf8s8t.fsf@topspin.com> <4255643D.30002@ichips.intel.com> <1112892615.4877.18.camel@localhost.localdomain> Message-ID: <425AC297.9090706@ichips.intel.com> Hal Rosenstock wrote: >On Thu, 2005-04-07 at 12:47, ardavis wrote: > > >>EM64T server with MT25208 (MT23108 compat mode), fw_ver 4.6.0, hw_rev A0 >> >> > >Didn't 4.6.0 have a issue with CQ handling ? Can you try 4.6.2 ? > >-- Hal > > > 4.6.2 did not help. I don't see any indication of mthca_cq_event firing. Could it be an issue with the user mode mthca arm_cq mappings? From mshefty at ichips.intel.com Mon Apr 11 11:47:32 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 11 Apr 2005 11:47:32 -0700 Subject: [openib-general] Re: A Couple More CM Queries In-Reply-To: <1113242491.4490.4.camel@localhost.localdomain> References: <1112971669.4522.147.camel@localhost.localdomain> <4256B290.7020704@ichips.intel.com> <1113242491.4490.4.camel@localhost.localdomain> Message-ID: <425AC644.1060405@ichips.intel.com> Hal Rosenstock wrote: >>>2. Should ib_create_cm_id check return an error if cm_handler == NULL >>>just to make sure ? >> >>Personally, I don't think it's worth this check for kernel clients, >>unless we want to start checking for NULL parameters everywhere. > > Incoming REQs currently use this capability anyhow. Incoming REQs use the cm_handler associated with the listen request. >>While on the CM, I did look at the issue of calling the API out of >>order that you had pointed out before (which could result in accessing >>a NULL port pointer). I'm not convinced that a simple check for a NULL >>port pointer covers all potential problems. For example, I'm not sure >>how well the codebase will handle the dynamic removal of a device while >>users are attempting to access the device. > > We may need to handle this at some point. Guess the changes may be larger > if/when we get there. One of the side effects of changing the CM from using a pointer to a QP to just the QPN is that the CM can no longer rely on the device being around. And I agree, this will need to be handled at some point, but may not be a huge issue as long as the client is reasonable and disconnects before destroying their QP. > It looks like sending private data in REQ/REP/RTU, but incoming private data > isn't handled on the receiving side. The private_data is given to the user in the cm_event structure. Look for work->cm_event.private_data = in cm_format_req_event, cm_format_rep_event, and cm_rtu_handler. Note that the private_data is only available while in the CM event callback. > Also, in cm_process_send_error(), where the handler is called > > cm_id_priv->id.cm_handler(&cm_id_priv->id, &cm_event); > > might that callback request the CM ID destruction ? If so, some > code is missing to handle this. Yep - this is a bug. Send errors should probably be handled using the same cm_process_work routine that the receive handling goes through. I'll generate a patch for this, but it'll take me a few days, unless this is urgent. - Sean From halr at voltaire.com Mon Apr 11 11:57:47 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Apr 2005 14:57:47 -0400 Subject: [openib-general] Re: [PATCH] mthca: Don't call CQ completion handler if it doesn't exist In-Reply-To: <52sm1xyz36.fsf@topspin.com> References: <1113241284.4490.19.camel@localhost.localdomain> <52sm1xyz36.fsf@topspin.com> Message-ID: <1113245695.4616.8.camel@localhost.localdomain> On Mon, 2005-04-11 at 13:58, Roland Dreier wrote: > Hal> mthca: Don't call CQ completion handler if it doesn't exist > > Why do we want to add this test? This is adding a conditional branch > in what I think is a fast path, and I would consider it a bug in the > consumer if it creates a CQ with an invalid completion handler and > then requests a completion event for that CQ. Am I missing something? Then shouldn't this be indicated as an error at create_cq time ? -- Hal From halr at voltaire.com Mon Apr 11 12:01:00 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Apr 2005 15:01:00 -0400 Subject: [openib-general] Re: A Couple More CM Queries In-Reply-To: <425AC644.1060405@ichips.intel.com> References: <1112971669.4522.147.camel@localhost.localdomain> <4256B290.7020704@ichips.intel.com> <1113242491.4490.4.camel@localhost.localdomain> <425AC644.1060405@ichips.intel.com> Message-ID: <1113245847.4616.12.camel@localhost.localdomain> On Mon, 2005-04-11 at 14:47, Sean Hefty wrote: > Hal Rosenstock wrote: > >>>2. Should ib_create_cm_id check return an error if cm_handler == NULL > >>>just to make sure ? > >> > >>Personally, I don't think it's worth this check for kernel clients, > >>unless we want to start checking for NULL parameters everywhere. > > > > Incoming REQs currently use this capability anyhow. > > Incoming REQs use the cm_handler associated with the listen request. Right, but the CM ID is initially created with the NULL handler. That's all I was saying... > > It looks like sending private data in REQ/REP/RTU, but incoming private data > > isn't handled on the receiving side. > > The private_data is given to the user in the cm_event structure. Look > for work->cm_event.private_data = in cm_format_req_event, > cm_format_rep_event, and cm_rtu_handler. Note that the private_data is > only available while in the CM event callback. Got it. Thanks. > > Also, in cm_process_send_error(), where the handler is called > > > > cm_id_priv->id.cm_handler(&cm_id_priv->id, &cm_event); > > > > might that callback request the CM ID destruction ? If so, some > > code is missing to handle this. > > Yep - this is a bug. Send errors should probably be handled using the > same cm_process_work routine that the receive handling goes through. > I'll generate a patch for this, but it'll take me a few days, unless > this is urgent. Nope; not urgent. Just stumbled across it while looking through things. -- Hal From mshefty at ichips.intel.com Mon Apr 11 12:14:02 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 11 Apr 2005 12:14:02 -0700 Subject: [openib-general] Re: A Couple More CM Queries In-Reply-To: <1113245847.4616.12.camel@localhost.localdomain> References: <1112971669.4522.147.camel@localhost.localdomain> <4256B290.7020704@ichips.intel.com> <1113242491.4490.4.camel@localhost.localdomain> <425AC644.1060405@ichips.intel.com> <1113245847.4616.12.camel@localhost.localdomain> Message-ID: <425ACC7A.8090904@ichips.intel.com> Hal Rosenstock wrote: >>>Also, in cm_process_send_error(), where the handler is called >>> >>>cm_id_priv->id.cm_handler(&cm_id_priv->id, &cm_event); >>> >>>might that callback request the CM ID destruction ? If so, some >>>code is missing to handle this. >> >>Yep - this is a bug. Send errors should probably be handled using the >>same cm_process_work routine that the receive handling goes through. >>I'll generate a patch for this, but it'll take me a few days, unless >>this is urgent. > > Nope; not urgent. Just stumbled across it while looking through things. Okay - I will try to get to this after finishing RMPP debug. Thinking about this more, send errors were not handled in the same way as receive handling, because I wanted to ensure that send errors were always reported to the user. I.e. I didn't want to deal with a failed memory allocation. I'll try to get a fix in next week. - Sean From libor at topspin.com Mon Apr 11 12:00:14 2005 From: libor at topspin.com (Libor Michalek) Date: Mon, 11 Apr 2005 12:00:14 -0700 Subject: [openib-general] Re: SDP Performance In-Reply-To: <1112988342.4546.12.camel@localhost.localdomain>; from halr@voltaire.com on Fri, Apr 08, 2005 at 03:37:32PM -0400 References: <1112988342.4546.12.camel@localhost.localdomain> Message-ID: <20050411120014.A6958@topspin.com> On Fri, Apr 08, 2005 at 03:37:32PM -0400, Hal Rosenstock wrote: > Hi Libor, > > A couple of questions about SDP performance: > > 1. When running the AIO version of TTCP, there appears to be a bandwidth > degradation when using buffer sizes from about 5K to 13K. Do you see > this too ? If so, is there an explanation for this ? This would be the result of transitioning from buffered to zcopy mode at 5K, which is the zcopy threshold. You can change the threshold with a socket option, which is exposed in ttcp.aio.c using the -z option. I was not planning on spending time to determine the correct value of the default threshold until the code stabalized a bit. At this point I'm investigating what appears to be an RDMA going into an incorrect location. > 2. Also, is there a program you use to measure SDP latency ? I've used netperf in the past which has a roundtrip test to measure latency using the regular sockets API. I don't have an AIO latency test handy at the moment, but I could fix one up and place it in the examples directory... -Libor From roland at topspin.com Mon Apr 11 11:55:07 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 11 Apr 2005 11:55:07 -0700 Subject: [openib-general] Re: uverbs events In-Reply-To: <425AC297.9090706@ichips.intel.com> (ardavis@ichips.intel.com's message of "Mon, 11 Apr 2005 11:31:51 -0700") References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com> <425452AF.6010207@ichips.intel.com> <527jjf8s8t.fsf@topspin.com> <4255643D.30002@ichips.intel.com> <1112892615.4877.18.camel@localhost.localdomain> <425AC297.9090706@ichips.intel.com> Message-ID: <52fyxxywhg.fsf@topspin.com> ardavis> 4.6.2 did not help. Not surprising. ardavis> I don't see any indication of mthca_cq_event ardavis> firing. Could it be an issue with the user mode mthca ardavis> arm_cq mappings? It's possible, I guess. You never said before -- does ibv_pingpong without the "-e" work? If so then the "update consumer index" doorbell is working. So it's kind of a mystery to me why the "arm CQ" doorbell would not work. Are other CQs in the kernel generating events? For example does IPoIB work for you? - R. From roland at topspin.com Mon Apr 11 12:07:35 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 11 Apr 2005 12:07:35 -0700 Subject: [openib-general] Re: [PATCH] mthca: Don't call CQ completion handler if it doesn't exist In-Reply-To: <1113245695.4616.8.camel@localhost.localdomain> (Hal Rosenstock's message of "11 Apr 2005 14:57:47 -0400") References: <1113241284.4490.19.camel@localhost.localdomain> <52sm1xyz36.fsf@topspin.com> <1113245695.4616.8.camel@localhost.localdomain> Message-ID: <527jj9yvwo.fsf@topspin.com> Roland> Why do we want to add this test? This is adding a Roland> conditional branch in what I think is a fast path, and I Roland> would consider it a bug in the consumer if it creates a CQ Roland> with an invalid completion handler and then requests a Roland> completion event for that CQ. Am I missing something? Hal> Then shouldn't this be indicated as an error at create_cq time ? How can it be? We don't know if the consumer is going to call ib_req_notify_cq() or not. - R. From halr at voltaire.com Mon Apr 11 12:26:32 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Apr 2005 15:26:32 -0400 Subject: [openib-general] Re: SDP Performance In-Reply-To: <20050411120014.A6958@topspin.com> References: <1112988342.4546.12.camel@localhost.localdomain> <20050411120014.A6958@topspin.com> Message-ID: <1113247451.4616.33.camel@localhost.localdomain> On Mon, 2005-04-11 at 15:00, Libor Michalek wrote: > On Fri, Apr 08, 2005 at 03:37:32PM -0400, Hal Rosenstock wrote: > > 2. Also, is there a program you use to measure SDP latency ? > > I've used netperf in the past which has a roundtrip test to > measure latency using the regular sockets API. I don't have an > AIO latency test handy at the moment, but I could fix one up > and place it in the examples directory... That would be handy when you get a chance. Thanks. -- Hal From ardavis at ichips.intel.com Mon Apr 11 12:32:53 2005 From: ardavis at ichips.intel.com (ardavis) Date: Mon, 11 Apr 2005 12:32:53 -0700 Subject: [openib-general] Re: uverbs events In-Reply-To: <52fyxxywhg.fsf@topspin.com> References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com> <425452AF.6010207@ichips.intel.com> <527jjf8s8t.fsf@topspin.com> <4255643D.30002@ichips.intel.com> <1112892615.4877.18.camel@localhost.localdomain> <425AC297.9090706@ichips.intel.com> <52fyxxywhg.fsf@topspin.com> Message-ID: <425AD0E5.3040805@ichips.intel.com> Roland Dreier wrote: > ardavis> 4.6.2 did not help. > >Not surprising. > > ardavis> I don't see any indication of mthca_cq_event > ardavis> firing. Could it be an issue with the user mode mthca > ardavis> arm_cq mappings? > >It's possible, I guess. You never said before -- does ibv_pingpong >without the "-e" work? If so then the "update consumer index" >doorbell is working. So it's kind of a mystery to me why the "arm CQ" >doorbell would not work. > >Are other CQs in the kernel generating events? For example does IPoIB >work for you? > > - R. > > > Yes, ibv_pingpong works without -e and IPoIB is generating events and working fine. From halr at voltaire.com Mon Apr 11 13:23:17 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Apr 2005 16:23:17 -0400 Subject: [openib-general] OpenSM (again) In-Reply-To: <16986.42404.540439.952094@gargle.gargle.HOWL> References: <16986.42404.540439.952094@gargle.gargle.HOWL> Message-ID: <1113250997.4616.53.camel@localhost.localdomain> On Mon, 2005-04-11 at 12:28, Roland Fehrenbacher wrote: > Hi, > > I got gen2 opensm running fine now (there was a problem with a wrong > include file), and managed to get IP running on a network of > currently 40 machines (final size will be 144). Performance is pretty > impressive (initial tests with a simple netpipe): I got a latency of > 18microsec, and a maximum throughput of approx. 400MB/sec at packet > size approx. 1MB which then levels of at about 340MB/s for larger > packets. That's all good to hear :-) > One problem and two questions: > > Problem: When I reboot all the 40 nodes (apart from the one the opensm > is running), the network is non-functional (no pings go through, even > though ports show status "Active") for quite a while (more than 10 > minutes) after all the nodes have come up. It then recovers without > intervention. Is this normal? Single node reboots don't affect the > network operation. osm Log file is appended. Can you describe your topology ? Is it the following: the SM is connected to a switch/or switches with the 40 nodes connected off these switches ? I'll respond to the log (and these questions) in a separate email response. > Question 1: Can I run opensm in a master slave configuration? Yes. Others are doing this. > I noticed > that there is a priority commandline option, but am not sure how to > apply this. SM election occurs per high priority low GUID. So if you don't care which SM is the master than you don't need to do anything. If you want a specific order (and it is not in GUID order) then you need to specify priority. > Question 2: I plan to run the gen1/Mellanox IBGD drivers on the > compute nodes (need fast MPI), and gen2 on the control/storage nodes > (need only IP) with gen2 opensm running on the control nodes. Is there > any reason why this should not work reliably? So basically this appears to be an interop question: 1. Will gen2 OpenSM support IBGD nodes ? 2. Will gen2 IPoIB interoperate with IBGD IPoIB ? I haven't done this but know of no reasons this should not work. Perhaps others can add to this. -- Hal ________________________________________________________________________ From iod00d at hp.com Mon Apr 11 13:46:56 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 11 Apr 2005 13:46:56 -0700 Subject: [openib-general] Re: SDP Performance In-Reply-To: <20050411120014.A6958@topspin.com> References: <1112988342.4546.12.camel@localhost.localdomain> <20050411120014.A6958@topspin.com> Message-ID: <20050411204656.GC13577@esmail.cup.hp.com> On Mon, Apr 11, 2005 at 12:00:14PM -0700, Libor Michalek wrote: > > 2. Also, is there a program you use to measure SDP latency ? > > I've used netperf in the past which has a roundtrip test to > measure latency using the regular sockets API. I don't have an > AIO latency test handy at the moment, but I could fix one up > and place it in the examples directory... netperf -t TCP_RR is really easy to run. I suggest the 2.4.0-rc1 version *experimental* version from www.netperf.org. 2.4.0-rc1 is much more linux friendly than previous versions. Just do "make config && make install" You *must* use "-T" option (bind apps to the CPU handling interrupts) for performance characterization. I've run this alot for gige but not ran a full set for ia64/IB. Output sample for IPoIB on HP/ZX1 ia64 (rx2600) below. I expect SDP to be a bit better and would like to generate full sets for both IPoIB and SDP this week. ISTR netpipe also has latency tests but I've not played with netpipe yet. grant # /usr/local/bin/netperf -c -C -l 60 -H 10.0.0.30 -t TCP_RR -T 0,0 TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.30 (10.0.0.30) port 0 AF_INET Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.00 16313.15 5.85 5.98 7.174 7.326 16384 87380 From halr at voltaire.com Mon Apr 11 13:52:26 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Apr 2005 16:52:26 -0400 Subject: [openib-general] OpenSM (again) In-Reply-To: <1113250997.4616.53.camel@localhost.localdomain> References: <16986.42404.540439.952094@gargle.gargle.HOWL> <1113250997.4616.53.camel@localhost.localdomain> Message-ID: <1113252518.4476.3.camel@localhost.localdomain> On Mon, 2005-04-11 at 16:23, Hal Rosenstock wrote: > > Problem: When I reboot all the 40 nodes (apart from the one the opensm > > is running), the network is non-functional (no pings go through, even > > though ports show status "Active") for quite a while (more than 10 > > minutes) after all the nodes have come up. It then recovers without > > intervention. Is this normal? Single node reboots don't affect the > > network operation. osm Log file is appended. > > Can you describe your topology ? Is it the following: the SM is > connected to a switch/or switches with the 40 nodes connected off these > switches ? What is the mix of those 40 nodes in terms of OpenIB (gen2) and gen1 ? Is there no difference in the behavior of gen2 and gen1 in terms of the above symptoms ? -- Hal From roland at topspin.com Mon Apr 11 13:58:03 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 11 Apr 2005 13:58:03 -0700 Subject: [openib-general] Re: uverbs events In-Reply-To: <425AD0E5.3040805@ichips.intel.com> (ardavis@ichips.intel.com's message of "Mon, 11 Apr 2005 12:32:53 -0700") References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com> <425452AF.6010207@ichips.intel.com> <527jjf8s8t.fsf@topspin.com> <4255643D.30002@ichips.intel.com> <1112892615.4877.18.camel@localhost.localdomain> <425AC297.9090706@ichips.intel.com> <52fyxxywhg.fsf@topspin.com> <425AD0E5.3040805@ichips.intel.com> Message-ID: <52wtr9xc84.fsf@topspin.com> Hmm... Has anyone else tried userspace verbs on a PCI Express HCA running 4.6.x FW? If so does "ibv_pingpong -e" work for you? All of the PCI Express HCAs I have handy are mem-free, but CQ events work for me with both mem-free HCAs and PCI-X HCAs. - R. From robert.j.woodruff at intel.com Mon Apr 11 15:09:50 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Mon, 11 Apr 2005 15:09:50 -0700 Subject: [openib-general] Re: SDP Performance In-Reply-To: <20050411204656.GC13577@esmail.cup.hp.com> Message-ID: Libor ># /usr/local/bin/netperf -c -C -l 60 -H 10.0.0.30 -t TCP_RR -T 0,0 >TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.30 (10.0.0.30) port 0 AF_INET >Local /Remote >Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem >Send Recv Size Size Time Rate local remote local remote >bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr >16384 87380 1 1 60.00 16313.15 5.85 5.98 7.174 7.326 >16384 87380 Hi Libor, What type of platform did you run this on ? CPU speed, type of HCA, etc. Also, have you run netpipe on SDP, it shows BW and latency for various sizes. woody From robert.j.woodruff at intel.com Mon Apr 11 15:13:06 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Mon, 11 Apr 2005 15:13:06 -0700 Subject: [openib-general] Re: SDP Performance In-Reply-To: Message-ID: woody wrote >What type of platform did you run this on ? CPU speed, type of HCA, etc. Sorry I just read the email and saw that it was an IPF box. woody From libor at topspin.com Mon Apr 11 15:12:59 2005 From: libor at topspin.com (Libor Michalek) Date: Mon, 11 Apr 2005 15:12:59 -0700 Subject: [openib-general] Re: SDP Performance In-Reply-To: ; from robert.j.woodruff@intel.com on Mon, Apr 11, 2005 at 03:09:50PM -0700 References: <20050411204656.GC13577@esmail.cup.hp.com> Message-ID: <20050411151259.B6958@topspin.com> On Mon, Apr 11, 2005 at 03:09:50PM -0700, Bob Woodruff wrote: > > Libor ># /usr/local/bin/netperf -c -C -l 60 -H 10.0.0.30 -t TCP_RR -T 0,0 > >TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 10.0.0.30 (10.0.0.30) port 0 AF_INET > >Local /Remote > >Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > >Send Recv Size Size Time Rate local remote local remote > >bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > >16384 87380 1 1 60.00 16313.15 5.85 5.98 7.174 7.326 > >16384 87380 > > Hi Libor, > > What type of platform did you run this on ? CPU speed, type of HCA, etc. I didn't run this, it was Grant, and those results were for IPoIB if I remember his email correctly. > Also, have you run netpipe on SDP, it shows BW and latency for various > sizes. No, I usually use ttcp or netperf which do pretty much the same thing, I would be surprised if netpipe showed radically different numbers. -Libor From robert.j.woodruff at intel.com Mon Apr 11 15:34:21 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Mon, 11 Apr 2005 15:34:21 -0700 Subject: [openib-general] Re: SDP Performance In-Reply-To: <20050411151259.B6958@topspin.com> Message-ID: Libor > I didn't run this, it was Grant, and those results were for IPoIB if I >remember his email correctly. > Also, have you run netpipe on SDP, it shows BW and latency for various > sizes. > No, I usually use ttcp or netperf which do pretty much the same thing, >I would be surprised if netpipe showed radically different numbers. >-Libor I know now, after I hit return I actually read the email. Sorry for the confusion. If I get a chance, I will try to get SDP running and run netpipe. I agree the numbers won't be much different, but it reports data for various sizes from 1 byte up to a couple of megabytes, so one can see the curve. woody From ardavis at ichips.intel.com Mon Apr 11 16:17:09 2005 From: ardavis at ichips.intel.com (ardavis) Date: Mon, 11 Apr 2005 16:17:09 -0700 Subject: [openib-general] Re: uverbs events In-Reply-To: <52wtr9xc84.fsf@topspin.com> References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com> <425452AF.6010207@ichips.intel.com> <527jjf8s8t.fsf@topspin.com> <4255643D.30002@ichips.intel.com> <1112892615.4877.18.camel@localhost.localdomain> <425AC297.9090706@ichips.intel.com> <52fyxxywhg.fsf@topspin.com> <425AD0E5.3040805@ichips.intel.com> <52wtr9xc84.fsf@topspin.com> Message-ID: <425B0575.5090702@ichips.intel.com> Roland Dreier wrote: >Hmm... > >Has anyone else tried userspace verbs on a PCI Express HCA running >4.6.x FW? If so does "ibv_pingpong -e" work for you? > >All of the PCI Express HCAs I have handy are mem-free, but CQ events >work for me with both mem-free HCAs and PCI-X HCAs. > > - R. > > > Roland, I was debugging this problem and when I added some debug prints in mthca_tavor_arm_cq (cq.c), just before the mthca_write64() call it started working. I will take a closer look.... -arlin From roland at topspin.com Mon Apr 11 16:20:54 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 11 Apr 2005 16:20:54 -0700 Subject: [openib-general] Re: uverbs events In-Reply-To: <425B0575.5090702@ichips.intel.com> (ardavis@ichips.intel.com's message of "Mon, 11 Apr 2005 16:17:09 -0700") References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com> <425452AF.6010207@ichips.intel.com> <527jjf8s8t.fsf@topspin.com> <4255643D.30002@ichips.intel.com> <1112892615.4877.18.camel@localhost.localdomain> <425AC297.9090706@ichips.intel.com> <52fyxxywhg.fsf@topspin.com> <425AD0E5.3040805@ichips.intel.com> <52wtr9xc84.fsf@topspin.com> <425B0575.5090702@ichips.intel.com> Message-ID: <52k6n8yk6h.fsf@topspin.com> ardavis> I was debugging this problem and when I added some debug ardavis> prints in mthca_tavor_arm_cq (cq.c), just before the ardavis> mthca_write64() call it started working. I will take a ardavis> closer look.... Ugh, smells like a compiler optimization problem or a timing problem. I'm still not seeing what could be going wrong, though. - R. From roland at topspin.com Mon Apr 11 16:37:49 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 11 Apr 2005 16:37:49 -0700 Subject: [openib-general] Re: uverbs events In-Reply-To: <52k6n8yk6h.fsf@topspin.com> (Roland Dreier's message of "Mon, 11 Apr 2005 16:20:54 -0700") References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com> <425452AF.6010207@ichips.intel.com> <527jjf8s8t.fsf@topspin.com> <4255643D.30002@ichips.intel.com> <1112892615.4877.18.camel@localhost.localdomain> <425AC297.9090706@ichips.intel.com> <52fyxxywhg.fsf@topspin.com> <425AD0E5.3040805@ichips.intel.com> <52wtr9xc84.fsf@topspin.com> <425B0575.5090702@ichips.intel.com> <52k6n8yk6h.fsf@topspin.com> Message-ID: <52fyxwyjea.fsf@topspin.com> What distribution and compiler version are you running? I assume you're running 64-bit userspace on a 64-bit kernel, right? What optimization level is libmthca being built with? - R. From ardavis at ichips.intel.com Mon Apr 11 16:52:22 2005 From: ardavis at ichips.intel.com (ardavis) Date: Mon, 11 Apr 2005 16:52:22 -0700 Subject: [openib-general] Re: uverbs events In-Reply-To: <52fyxwyjea.fsf@topspin.com> References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com> <425452AF.6010207@ichips.intel.com> <527jjf8s8t.fsf@topspin.com> <4255643D.30002@ichips.intel.com> <1112892615.4877.18.camel@localhost.localdomain> <425AC297.9090706@ichips.intel.com> <52fyxxywhg.fsf@topspin.com> <425AD0E5.3040805@ichips.intel.com> <52wtr9xc84.fsf@topspin.com> <425B0575.5090702@ichips.intel.com> <52k6n8yk6h.fsf@topspin.com> <52fyxwyjea.fsf@topspin.com> Message-ID: <425B0DB6.9090002@ichips.intel.com> Roland Dreier wrote: >What distribution and compiler version are you running? I assume >you're running 64-bit userspace on a 64-bit kernel, right? What >optimization level is libmthca being built with? > > - R. > > > Redhat EL 4.0, 64-bit gcc version 3.4.2 20041017 (Red Hat 3.4.2-6.fc3) libmthca built with default settings (-O2) From akpm at osdl.org Mon Apr 11 17:13:47 2005 From: akpm at osdl.org (Andrew Morton) Date: Mon, 11 Apr 2005 17:13:47 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <52oeclyyw3.fsf@topspin.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> Message-ID: <20050411171347.7e05859f.akpm@osdl.org> Roland Dreier wrote: > > Troy> Do we even need the mlock in userspace then? > > Yes, because the kernel may go through and unmap pages from userspace > while trying to swap. Since we have the page locked in the kernel, > the physical page won't go anywhere, but userspace might end up with a > different page mapped at the same virtual address. That shouldn't happen. If get_user_pages() has elevated the refcount on a page then the following can happen: - The VM may decide to add the page to swapcache (if it's not mmapped from a file). - Once the page is backed by either swapcache of a (mmapped) file, the VM may decide the unmap the application's pte's. A later minor fault by the app will cause the same physical page to be remapped. - The VM may decide to try to write the page to its backing file or swap. If it does, the page is still in core, but is now clean. - Once all pte's are unmapped and the page is clean, the VM may decide to try to reclaim the page. The VM will then see the elevated refcount and will bale out, leaving the page in core. - If your code was doing a read-from-disk (modifying memory), then your code should run set_page_dirty() or set_page_dirty_lock() against the page before dropping the refcount which get_user_pages() added. Once the page is dirty, the VM can't reclaim it until it has been been written to swap or mmapped backing file. IOW: while the page has an elevated refcount from get_user_pages(), that physical page is 100% pinned. Once you've done the set_page_dirty+put_page(), the page is again under control of the VM. There should be no need to run mlock() from userspace. From roland at topspin.com Mon Apr 11 17:21:04 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 11 Apr 2005 17:21:04 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050411171347.7e05859f.akpm@osdl.org> (Andrew Morton's message of "Mon, 11 Apr 2005 17:13:47 -0700") References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> Message-ID: <521x9gyhe7.fsf@topspin.com> Roland> Yes, because the kernel may go through and unmap pages Roland> from userspace while trying to swap. Since we have the Roland> page locked in the kernel, the physical page won't go Roland> anywhere, but userspace might end up with a different page Roland> mapped at the same virtual address. Andrew> That shouldn't happen. If get_user_pages() has elevated Andrew> the refcount on a page then the following can happen: ... Andrew> IOW: while the page has an elevated refcount from Andrew> get_user_pages(), that physical page is 100% pinned. Andrew> Once you've done the set_page_dirty+put_page(), the page Andrew> is again under control of the VM. Hmm... I've never tested it first hand but Libor assures me there is a something like what I said. Libor, did I get the explanation right? - R. From roland at topspin.com Mon Apr 11 17:10:51 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 11 Apr 2005 17:10:51 -0700 Subject: [openib-general] Re: uverbs events In-Reply-To: <425B0DB6.9090002@ichips.intel.com> (ardavis@ichips.intel.com's message of "Mon, 11 Apr 2005 16:52:22 -0700") References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com> <425452AF.6010207@ichips.intel.com> <527jjf8s8t.fsf@topspin.com> <4255643D.30002@ichips.intel.com> <1112892615.4877.18.camel@localhost.localdomain> <425AC297.9090706@ichips.intel.com> <52fyxxywhg.fsf@topspin.com> <425AD0E5.3040805@ichips.intel.com> <52wtr9xc84.fsf@topspin.com> <425B0575.5090702@ichips.intel.com> <52k6n8yk6h.fsf@topspin.com> <52fyxwyjea.fsf@topspin.com> <425B0DB6.9090002@ichips.intel.com> Message-ID: <527jj8yhv8.fsf@topspin.com> ardavis> Redhat EL 4.0, 64-bit OK, I found a system with that distro installed, although I can't test the results of the build. However, I built libmthca with the same CFLAGS that rpm seems to use, namely "-g -O2 -m64 -pipe". I found that mthca_tavor_arm_cq() compiles to the following tiny fragment: 0000000000001d10 : 1d10: 48 8b 07 mov (%rdi),%rax 1d13: 48 8b 90 a8 ef ff ff mov 0xffffffffffffefa8(%rax),%rdx 1d1a: 48 8b 44 24 f8 mov 0xfffffffffffffff8(%rsp),%rax 1d1f: 48 89 42 20 mov %rax,0x20(%rdx) 1d23: 31 c0 xor %eax,%eax 1d25: c3 retq in other words, the compiler seems to be discarding all the assignments to doorbell[0] and doorbell[1]. I'm not sure if this is a compiler bug or what -- I need to investigate further. In any case can you try the following patch to libmthca and see if it fixes things: Index: src/cq.c =================================================================== --- src/cq.c (revision 2156) +++ src/cq.c (working copy) @@ -441,6 +441,8 @@ int mthca_tavor_arm_cq(struct ibv_cq *cq to_mcq(cq)->cqn); doorbell[1] = 0xffffffff; + mb(); + mthca_write64(doorbell, to_mctx(cq->context), MTHCA_CQ_DOORBELL); return 0; From ardavis at ichips.intel.com Mon Apr 11 18:08:10 2005 From: ardavis at ichips.intel.com (ardavis) Date: Mon, 11 Apr 2005 18:08:10 -0700 Subject: [openib-general] Re: uverbs events In-Reply-To: <527jj8yhv8.fsf@topspin.com> References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com> <425452AF.6010207@ichips.intel.com> <527jjf8s8t.fsf@topspin.com> <4255643D.30002@ichips.intel.com> <1112892615.4877.18.camel@localhost.localdomain> <425AC297.9090706@ichips.intel.com> <52fyxxywhg.fsf@topspin.com> <425AD0E5.3040805@ichips.intel.com> <52wtr9xc84.fsf@topspin.com> <425B0575.5090702@ichips.intel.com> <52k6n8yk6h.fsf@topspin.com> <52fyxwyjea.fsf@topspin.com> <425B0DB6.9090002@ichips.intel.com> <527jj8yhv8.fsf@topspin.com> Message-ID: <425B1F7A.9070100@ichips.intel.com> Roland Dreier wrote: > ardavis> Redhat EL 4.0, 64-bit > >OK, I found a system with that distro installed, although I can't test >the results of the build. However, I built libmthca with the same >CFLAGS that rpm seems to use, namely "-g -O2 -m64 -pipe". I found >that mthca_tavor_arm_cq() compiles to the following tiny fragment: > >0000000000001d10 : > 1d10: 48 8b 07 mov (%rdi),%rax > 1d13: 48 8b 90 a8 ef ff ff mov 0xffffffffffffefa8(%rax),%rdx > 1d1a: 48 8b 44 24 f8 mov 0xfffffffffffffff8(%rsp),%rax > 1d1f: 48 89 42 20 mov %rax,0x20(%rdx) > 1d23: 31 c0 xor %eax,%eax > 1d25: c3 retq > >in other words, the compiler seems to be discarding all the >assignments to doorbell[0] and doorbell[1]. I'm not sure if this is a >compiler bug or what -- I need to investigate further. In any case >can you try the following patch to libmthca and see if it fixes >things: > >Index: src/cq.c >=================================================================== >--- src/cq.c (revision 2156) >+++ src/cq.c (working copy) >@@ -441,6 +441,8 @@ int mthca_tavor_arm_cq(struct ibv_cq *cq > to_mcq(cq)->cqn); > doorbell[1] = 0xffffffff; > >+ mb(); >+ > mthca_write64(doorbell, to_mctx(cq->context), MTHCA_CQ_DOORBELL); > > return 0; > > > Yes, this fixes my problem. Thanks! From roland at topspin.com Mon Apr 11 19:51:24 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 11 Apr 2005 19:51:24 -0700 Subject: [openib-general] Re: uverbs events In-Reply-To: <425B0DB6.9090002@ichips.intel.com> (ardavis@ichips.intel.com's message of "Mon, 11 Apr 2005 16:52:22 -0700") References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com> <425452AF.6010207@ichips.intel.com> <527jjf8s8t.fsf@topspin.com> <4255643D.30002@ichips.intel.com> <1112892615.4877.18.camel@localhost.localdomain> <425AC297.9090706@ichips.intel.com> <52fyxxywhg.fsf@topspin.com> <425AD0E5.3040805@ichips.intel.com> <52wtr9xc84.fsf@topspin.com> <425B0575.5090702@ichips.intel.com> <52k6n8yk6h.fsf@topspin.com> <52fyxwyjea.fsf@topspin.com> <425B0DB6.9090002@ichips.intel.com> Message-ID: <52fyxwwvv7.fsf@topspin.com> OK, I think I understand the problem. The old code violates the assumptions that gcc makes with -fstrict-aliasing (which is one of the optimizations turned on by -O2). Can you back out the patch to cq.c I sent and try this patch instead? Thanks, Roland Index: src/doorbell.h =================================================================== --- src/doorbell.h (revision 2156) +++ src/doorbell.h (working copy) @@ -69,14 +69,22 @@ static inline void mthca_write_db_rec(ui #elif SIZEOF_LONG == 8 +#if __BYTE_ORDER == __LITTLE_ENDIAN +# define MTHCA_PAIR_TO_64(val) ((uint64_t) val[1] << 32 | val[0]) +#elif __BYTE_ORDER == __BIG_ENDIAN +# define MTHCA_PAIR_TO_64(val) ((uint64_t) val[0] << 32 | val[1]) +#else +# error __BYTE_ORDER not defined +#endif + static inline void mthca_write64(uint32_t val[2], struct mthca_context *ctx, int offset) { - *(volatile uint64_t *) (ctx->uar + offset) = *(uint64_t *) val; + *(volatile uint64_t *) (ctx->uar + offset) = MTHCA_PAIR_TO_64(val); } static inline void mthca_write_db_rec(uint32_t val[2], uint32_t *db) { - *(volatile uint64_t *) db = *(uint64_t *) val; + *(volatile uint64_t *) db = MTHCA_PAIR_TO_64(val); } #else From iod00d at hp.com Mon Apr 11 19:58:35 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 11 Apr 2005 19:58:35 -0700 Subject: [openib-general] Re: uverbs events In-Reply-To: <527jj8yhv8.fsf@topspin.com> References: <1112892615.4877.18.camel@localhost.localdomain> <425AC297.9090706@ichips.intel.com> <52fyxxywhg.fsf@topspin.com> <425AD0E5.3040805@ichips.intel.com> <52wtr9xc84.fsf@topspin.com> <425B0575.5090702@ichips.intel.com> <52k6n8yk6h.fsf@topspin.com> <52fyxwyjea.fsf@topspin.com> <425B0DB6.9090002@ichips.intel.com> <527jj8yhv8.fsf@topspin.com> Message-ID: <20050412025835.GE13577@esmail.cup.hp.com> On Mon, Apr 11, 2005 at 05:10:51PM -0700, Roland Dreier wrote: > ardavis> Redhat EL 4.0, 64-bit > > OK, I found a system with that distro installed, although I can't test > the results of the build. However, I built libmthca with the same > CFLAGS that rpm seems to use, namely "-g -O2 -m64 -pipe". I found > that mthca_tavor_arm_cq() compiles to the following tiny fragment: > > 0000000000001d10 : > 1d10: 48 8b 07 mov (%rdi),%rax > 1d13: 48 8b 90 a8 ef ff ff mov 0xffffffffffffefa8(%rax),%rdx > 1d1a: 48 8b 44 24 f8 mov 0xfffffffffffffff8(%rsp),%rax > 1d1f: 48 89 42 20 mov %rax,0x20(%rdx) > 1d23: 31 c0 xor %eax,%eax > 1d25: c3 retq > > in other words, the compiler seems to be discarding all the > assignments to doorbell[0] and doorbell[1]. doorbell[] is a local variable and mthca_write64() is static inline. I don't see a problem with the assignments to doorbell getting optimized out since the scope of that variable is completely visible to gcc. A smart compiler would just use registers and reduce the 32-bit stores. I see a problem with "(notify == IB_CQ_SOLICITED ? ....)" code getting optimized away. "notifier" is passed in parameter (not a constant) and the function is only invoked as an indirect function call. I don't see how gcc could know what value notifier will have and optimize the test away. Hrm...maybe the bug is "notifier" is somehow overloaded to a constant. You'd have to look at the intermediate "-E" (preprocessed) output. > I'm not sure if this is a compiler bug or what -- I need to > investigate further.> In any case > can you try the following patch to libmthca and see if it fixes > things: > > Index: src/cq.c > =================================================================== > --- src/cq.c (revision 2156) > +++ src/cq.c (working copy) > @@ -441,6 +441,8 @@ int mthca_tavor_arm_cq(struct ibv_cq *cq > to_mcq(cq)->cqn); > doorbell[1] = 0xffffffff; > > + mb(); > + > mthca_write64(doorbell, to_mctx(cq->context), MTHCA_CQ_DOORBELL); I don't get how this fixes the problem. mthca_write64() uses a spinlock and I thought that has to enforce some sort of memory/instruction ordering already. I'm sketchy on details and can't look it up right now. hth, grant From iod00d at hp.com Mon Apr 11 20:08:01 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 11 Apr 2005 20:08:01 -0700 Subject: [openib-general] Re: SDP Performance In-Reply-To: References: <20050411204656.GC13577@esmail.cup.hp.com> Message-ID: <20050412030801.GG13577@esmail.cup.hp.com> On Mon, Apr 11, 2005 at 03:09:50PM -0700, Bob Woodruff wrote: > > Libor ># /usr/local/bin/netperf -c -C -l 60 -H 10.0.0.30 -t TCP_RR -T 0,0 > >TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 10.0.0.30 (10.0.0.30) port 0 AF_INET > >Local /Remote > >Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > >Send Recv Size Size Time Rate local remote local remote > >bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > >16384 87380 1 1 60.00 16313.15 5.85 5.98 7.174 7.326 > >16384 87380 > > Hi Libor, s/Libor/Grant/ > What type of platform did you run this on ? CPU speed, type of HCA, etc. > > Also, have you run netpipe on SDP, it shows BW and latency for various > sizes. All of that (except CPU speed: 1.5Ghz Madison) was answered in the original email. In case you didn't save it, I bounced you another copy. grant From roland at topspin.com Mon Apr 11 20:14:12 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 11 Apr 2005 20:14:12 -0700 Subject: [openib-general] Re: uverbs events In-Reply-To: <20050412025835.GE13577@esmail.cup.hp.com> (Grant Grundler's message of "Mon, 11 Apr 2005 19:58:35 -0700") References: <1112892615.4877.18.camel@localhost.localdomain> <425AC297.9090706@ichips.intel.com> <52fyxxywhg.fsf@topspin.com> <425AD0E5.3040805@ichips.intel.com> <52wtr9xc84.fsf@topspin.com> <425B0575.5090702@ichips.intel.com> <52k6n8yk6h.fsf@topspin.com> <52fyxwyjea.fsf@topspin.com> <425B0DB6.9090002@ichips.intel.com> <527jj8yhv8.fsf@topspin.com> <20050412025835.GE13577@esmail.cup.hp.com> Message-ID: <527jj8wut7.fsf@topspin.com> Grant> doorbell[] is a local variable and mthca_write64() is Grant> static inline. I don't see a problem with the assignments Grant> to doorbell getting optimized out since the scope of that Grant> variable is completely visible to gcc. A smart compiler Grant> would just use registers and reduce the 32-bit stores. Actually, what is that the compiler sees that we write to doorbell[] as a uint32_t but then read from it by dereferencing a uint64_t*. -O2 turns on -fstrict-aliasing, which allows the compiler to assume that pointers of different types never alias each other. So gcc says, hey, all you do is write to that local doorbell[] variable and never do anything with the values you write, so I'll just throw away that dead code. So gcc ends up only generating code for the store in mthca_write64() without any code to initialize doorbell[]. - R. From eitan at mellanox.co.il Mon Apr 11 22:07:11 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 12 Apr 2005 08:07:11 +0300 Subject: [openib-general] OpenSM (again) Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF0DE@mtlex01.yok.mtl.com> Hi Roland, If the case is reproducible, please run "opensm -V" and send us the osm.log Thanks Eitan Zahavi > -----Original Message----- > From: Roland Fehrenbacher [mailto:rf at q-leap.de] > Sent: Monday, April 11, 2005 7:28 PM > To: openib-general at openib.org > Subject: [openib-general] OpenSM (again) > > Hi, > > I got gen2 opensm running fine now (there was a problem with a wrong > include file), and managed to get IP running on a network of > currently 40 machines (final size will be 144). Performance is pretty > impressive (initial tests with a simple netpipe): I got a latency of > 18microsec, and a maximum throughput of approx. 400MB/sec at packet > size approx. 1MB which then levels of at about 340MB/s for larger > packets. > > One problem and two questions: > > Problem: When I reboot all the 40 nodes (apart from the one the opensm > is running), the network is non-functional (no pings go through, even > though ports show status "Active") for quite a while (more than 10 > minutes) after all the nodes have come up. It then recovers without > intervention. Is this normal? Single node reboots don't affect the > network operation. osm Log file is appended. > > Question 1: Can I run opensm in a master slave configuration? I noticed > that there is a priority commandline option, but am not sure how to > apply this. > > Question 2: I plan to run the gen1/Mellanox IBGD drivers on the > compute nodes (need fast MPI), and gen2 on the control/storage nodes > (need only IP) with gen2 opensm running on the control nodes. Is there > any reason why this should not work reliably? > > Roland -------------- next part -------------- An HTML attachment was scrubbed... URL: From iod00d at hp.com Mon Apr 11 22:34:12 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 11 Apr 2005 22:34:12 -0700 Subject: [openib-general] Re: SDP Performance In-Reply-To: References: <20050411151259.B6958@topspin.com> Message-ID: <20050412053412.GH13577@esmail.cup.hp.com> On Mon, Apr 11, 2005 at 03:34:21PM -0700, Bob Woodruff wrote: > If I get a chance, I will try to get SDP running and run netpipe. > I agree the numbers won't be much different, but it reports data for various > sizes from 1 byte up to a couple of megabytes, so one can see the curve. netperf inludes a shell script to do the same thing: tcp_rr_script grant From tziporet at mellanox.co.il Mon Apr 11 22:36:52 2005 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 12 Apr 2005 08:36:52 +0300 Subject: [openib-general] OpenSM (again) Message-ID: <506C3D7B14CDD411A52C00025558DED6064BF347@mtlex01.yok.mtl.com> regarding the question 2: > Question 2: I plan to run the gen1/Mellanox IBGD drivers on the > compute nodes (need fast MPI), and gen2 on the control/storage nodes > (need only IP) with gen2 opensm running on the control nodes. Is there > any reason why this should not work reliably? We tried it in Mellanox once and it did work properly (we used OpenSM from gen1 and IPoIB from gen1 & gen2 on 2 different machines). So although its not QAed I see no reason that it will not work for you. Tziporet -------------- next part -------------- An HTML attachment was scrubbed... URL: From tziporet at mellanox.co.il Mon Apr 11 23:28:14 2005 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 12 Apr 2005 09:28:14 +0300 Subject: [openib-general] Re: uverbs events Message-ID: <506C3D7B14CDD411A52C00025558DED6064BF34C@mtlex01.yok.mtl.com> Very important - there is a bug in gcc version 3.4.2 that had been fixed in gcc 3.4.3. This bug ((# 17581) heart us in VAPI when full optimizations is working in bits or on 64 bits systems. So I suggest that you replace gcc with gcc 3.4.3. Tziporet -----Original Message----- From: Roland Dreier [mailto:roland at topspin.com] Sent: Tuesday, April 12, 2005 6:14 AM To: Grant Grundler Cc: openib-general at openib.org Subject: Re: [openib-general] Re: uverbs events Grant> doorbell[] is a local variable and mthca_write64() is Grant> static inline. I don't see a problem with the assignments Grant> to doorbell getting optimized out since the scope of that Grant> variable is completely visible to gcc. A smart compiler Grant> would just use registers and reduce the 32-bit stores. Actually, what is that the compiler sees that we write to doorbell[] as a uint32_t but then read from it by dereferencing a uint64_t*. -O2 turns on -fstrict-aliasing, which allows the compiler to assume that pointers of different types never alias each other. So gcc says, hey, all you do is write to that local doorbell[] variable and never do anything with the values you write, so I'll just throw away that dead code. So gcc ends up only generating code for the store in mthca_write64() without any code to initialize doorbell[]. - R. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Apr 12 02:56:29 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Apr 2005 05:56:29 -0400 Subject: [openib-general] OpenSM (again) In-Reply-To: <16986.42404.540439.952094@gargle.gargle.HOWL> References: <16986.42404.540439.952094@gargle.gargle.HOWL> Message-ID: <1113299742.4476.40.camel@localhost.localdomain> On Mon, 2005-04-11 at 12:28, Roland Fehrenbacher wrote: > Problem: When I reboot all the 40 nodes (apart from the one the opensm > is running), the network is non-functional (no pings go through, even > though ports show status "Active") for quite a while (more than 10 > minutes) after all the nodes have come up. It then recovers without > intervention. Is this normal? Single node reboots don't affect the > network operation. osm Log file is appended. > > ______________________________________________________________________ > Apr 10 15:05:55 [4000] -> OpenSM Rev:openib-1.0.0 > Apr 10 15:05:55 [4000] -> osm_opensm_init: Forcing single threaded dispatcher. > Apr 10 15:05:55 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 > Apr 10 15:05:55 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 > Apr 10 15:05:55 [4000] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x2c902004013c1) as the default port. > Apr 10 15:05:55 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c1. > Apr 10 15:05:55 [4000] -> osm_vendor_bind: Unable to register class 129 version 1. > Apr 10 15:05:55 [4000] -> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind() failed. > Apr 10 15:05:55 [4000] -> osm_sm_bind: ERR 2E10: SM MAD Controller bind() failed (IB_ERROR). > Apr 10 15:06:58 [4000] -> OpenSM Rev:openib-1.0.0 > Apr 10 15:06:58 [4000] -> osm_opensm_init: Forcing single threaded dispatcher. > Apr 10 15:06:58 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 > Apr 10 15:06:58 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 > Apr 10 15:06:58 [4000] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x2c902004013c1) as the default port. > Apr 10 15:06:58 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c1. > Apr 10 15:06:58 [4000] -> osm_vendor_bind: Unable to register class 129 version 1. > Apr 10 15:06:58 [4000] -> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind() failed. > Apr 10 15:06:58 [4000] -> osm_sm_bind: ERR 2E10: SM MAD Controller bind() failed (IB_ERROR). > Apr 10 15:07:44 [4000] -> OpenSM Rev:openib-1.0.0 > Apr 10 15:07:44 [4000] -> osm_opensm_init: Forcing single threaded dispatcher. > Apr 10 15:07:44 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 > Apr 10 15:07:44 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 > Apr 10 15:07:44 [4000] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x2c902004013c1) as the default port. > Apr 10 15:07:44 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c1. > Apr 10 15:07:44 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c1. > Apr 10 15:07:44 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0011 TID:0x000000000000000a > Apr 10 15:07:44 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping. This is a SubnGet of NodeInfo which is timing out. > Apr 10 15:07:45 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping. > Apr 10 15:07:45 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping. > Apr 10 15:07:45 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping. > Apr 10 15:07:45 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping. > Apr 10 15:07:45 [2400A] -> umad_receiver: send completed with error(method=1 attr=16) -- dropping. > This is a SubnGet of PkeyTable which is timing out. > Apr 10 15:07:45 [2400A] -> umad_receiver: send completed with error(method=1 attr=16) -- dropping. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. > Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored. These are SA MADs being received when SM is not yet ready to handle them. They could be SA sets of MCMemberRecord (from IPoIB). SA clients in end nodes should retry them (assuming not exhaust their timeout/retry strategy). For debug purposes, it might be nice to display the method and attribute of the SA MAD. > Apr 10 15:07:46 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SELF. > Apr 10 15:07:46 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET. > Apr 10 15:07:46 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET. > Apr 10 15:07:46 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET. > Apr 10 15:07:46 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET. > Apr 10 15:07:46 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET. > Apr 10 15:07:46 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping. > Apr 10 15:07:46 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping. > Apr 10 15:07:46 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping. > Apr 10 15:07:46 [2400A] -> umad_receiver: send completed with error(method=1 attr=16) -- dropping. > Apr 10 15:07:46 [2400A] -> umad_receiver: send completed with error(method=1 attr=16) -- dropping. > Apr 10 15:07:47 [2400A] -> umad_receiver: send completed with error(method=1 attr=16) -- dropping. > Apr 10 15:07:47 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET. > Apr 10 15:07:47 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET. > Apr 10 15:08:16 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:08:16 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:08:16 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. In the most recent OpenSM (gen1), this has been changed from error to warning. (That doesn't explain the delay in connectivity). > Apr 11 08:32:17 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0028 TID:0x000000000000004c > Apr 11 08:32:17 [18007] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0028 GID:0xfe80000000000000,0x0002c9010befe900 > Apr 11 08:32:17 [18007] -> osm_report_notice: Reporting Generic Notice type:3 num:64 from LID:0x0011 GID:0xfe80000000000000,0x0002c902004013c1 > Apr 11 08:32:17 [18007] -> Discovered new port with GUID:0x0002c902004012e9 LID range [0x3D,0x3D] of node:MT23108 InfiniHost Mellanox Technologies > Apr 11 08:32:17 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT. > Apr 11 08:35:27 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0028 TID:0x000000000000004d > Apr 11 08:35:27 [18007] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0028 GID:0xfe80000000000000,0x0002c9010befe900 > Apr 11 08:35:27 [18007] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0011 GID:0xfe80000000000000,0x0002c902004013c1 > Apr 11 08:35:27 [18007] -> Removed port with GUID:0x0002c902004012e9 LID range [0x3D,0x3D] of node:MT23108 InfiniHost Mellanox Technologies At what point, did it start working again ? Was it at 15:24 ? (That appears to be a 16-17 minute delay in connectivity). -- Hal From SWOODING at qinetiq.com Tue Apr 12 04:18:44 2005 From: SWOODING at qinetiq.com (Wooding Steve) Date: Tue, 12 Apr 2005 12:18:44 +0100 Subject: [openib-general] AIO SDP and ttcp.aio: Event errors Message-ID: Hi, I have been putting ttcp.aio through its paces and have a few questions. 1. When -l is larger than 131072 I get an Event error <-22> on the transmit side and no data to transferred. Changing values of -n and -a do not make any difference. 2. When using a value of 1 for -a (so I suppose this is non-aio), I get an Event error of <-32> on the transmit side and an <-104> on the receiver end. Only some of the data is transferred. 3. For future reference, where can I find out what these Event error codes mean to give me a glue of what's going wrong. 4. I sometimes see significant differences in the transfer speed reported on the transmit and receiver ends. Is one more right than the other? My system details are: Two nodes with Dual Xeon 64-bit processors HCA: MT25208 (in MT23108 compat mode) with 128MB of ram OS: RHEL 4 (64-bit) Gen2 stack version: trunk of 2113 (subversion revision number) Thanks, Steve. The Information contained in this E-Mail and any subsequent correspondence is private and is intended solely for the intended recipient(s). For those other than the recipient any disclosure, copying, distribution, or any action taken or omitted to be taken in reliance on such information is prohibited and may be unlawful. Emails and other electronic communication with QinetiQ may be monitored. Calls to QinetiQ may be recorded for quality control, regulatory and monitoring purposes. From roland at topspin.com Tue Apr 12 08:38:36 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 12 Apr 2005 08:38:36 -0700 Subject: [openib-general] Re: uverbs events In-Reply-To: <506C3D7B14CDD411A52C00025558DED6064BF34C@mtlex01.yok.mtl.com> (Tziporet Koren's message of "Tue, 12 Apr 2005 09:28:14 +0300") References: <506C3D7B14CDD411A52C00025558DED6064BF34C@mtlex01.yok.mtl.com> Message-ID: <52r7hguhs3.fsf@topspin.com> Tziporet> Very important - there is a bug in gcc version 3.4.2 Tziporet> that had been fixed in gcc 3.4.3. This bug ((# 17581) Tziporet> heart us in VAPI when full optimizations is working in Tziporet> bits or on 64 bits systems. Thanks, but if the bug you're talking about is http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17581 then I don't think that's going to affect us -- we don't seem to do any 64-bit arithmetic inside a switch statement. - R. From ardavis at ichips.intel.com Tue Apr 12 09:07:30 2005 From: ardavis at ichips.intel.com (ardavis) Date: Tue, 12 Apr 2005 09:07:30 -0700 Subject: [openib-general] Re: uverbs events In-Reply-To: <52fyxwwvv7.fsf@topspin.com> References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com> <425452AF.6010207@ichips.intel.com> <527jjf8s8t.fsf@topspin.com> <4255643D.30002@ichips.intel.com> <1112892615.4877.18.camel@localhost.localdomain> <425AC297.9090706@ichips.intel.com> <52fyxxywhg.fsf@topspin.com> <425AD0E5.3040805@ichips.intel.com> <52wtr9xc84.fsf@topspin.com> <425B0575.5090702@ichips.intel.com> <52k6n8yk6h.fsf@topspin.com> <52fyxwyjea.fsf@topspin.com> <425B0DB6.9090002@ichips.intel.com> <52fyxwwvv7.fsf@topspin.com> Message-ID: <425BF242.5070003@ichips.intel.com> Roland Dreier wrote: >OK, I think I understand the problem. The old code violates the >assumptions that gcc makes with -fstrict-aliasing (which is one of the >optimizations turned on by -O2). Can you back out the patch to cq.c I >sent and try this patch instead? > >Thanks, > Roland > >Index: src/doorbell.h >=================================================================== >--- src/doorbell.h (revision 2156) >+++ src/doorbell.h (working copy) >@@ -69,14 +69,22 @@ static inline void mthca_write_db_rec(ui > > #elif SIZEOF_LONG == 8 > >+#if __BYTE_ORDER == __LITTLE_ENDIAN >+# define MTHCA_PAIR_TO_64(val) ((uint64_t) val[1] << 32 | val[0]) >+#elif __BYTE_ORDER == __BIG_ENDIAN >+# define MTHCA_PAIR_TO_64(val) ((uint64_t) val[0] << 32 | val[1]) >+#else >+# error __BYTE_ORDER not defined >+#endif >+ > static inline void mthca_write64(uint32_t val[2], struct mthca_context *ctx, int offset) > { >- *(volatile uint64_t *) (ctx->uar + offset) = *(uint64_t *) val; >+ *(volatile uint64_t *) (ctx->uar + offset) = MTHCA_PAIR_TO_64(val); > } > > static inline void mthca_write_db_rec(uint32_t val[2], uint32_t *db) > { >- *(volatile uint64_t *) db = *(uint64_t *) val; >+ *(volatile uint64_t *) db = MTHCA_PAIR_TO_64(val); > } > > #else > > > Done. Works fine. From rf at q-leap.de Tue Apr 12 09:46:59 2005 From: rf at q-leap.de (Roland Fehrenbacher) Date: Tue, 12 Apr 2005 18:46:59 +0200 Subject: [openib-general] OpenSM (again) In-Reply-To: <1113250997.4616.53.camel@localhost.localdomain> References: <16986.42404.540439.952094@gargle.gargle.HOWL> <1113250997.4616.53.camel@localhost.localdomain> Message-ID: <16987.64387.656148.774577@gargle.gargle.HOWL> >>>>> "Hal" == Hal Rosenstock writes: >> Problem: When I reboot all the 40 nodes (apart from the one the >> opensm is running), the network is non-functional (no pings go >> through, even though ports show status "Active") for quite a >> while (more than 10 minutes) after all the nodes have come >> up. It then recovers without intervention. Is this normal? >> Single node reboots don't affect the network operation. osm Log >> file is appended. Hal> Can you describe your topology ? Is it the following: the SM Hal> is connected to a switch/or switches with the 40 nodes Hal> connected off these switches ? Yes, the 40 nodes are connected to a single 144 port switch. Hal> I'll respond to the log (and these questions) in a separate Hal> email response. >> Question 1: Can I run opensm in a master slave configuration? Hal> Yes. Others are doing this. >> I noticed that there is a priority commandline option, but am >> not sure how to apply this. Hal> SM election occurs per high priority low GUID. So if you Hal> don't care which SM is the master than you don't need to do Hal> anything. If you want a specific order (and it is not in GUID Hal> order) then you need to specify priority. Ok. I tried this, specifying priority 0 on one server, and priority 15 on another one. I assume priority 15, will be the master. If I first start the priority 0 opensm, and then the priority 15 one, things look normal: Log excerpts priority 0 server Apr 12 18:41:06 [4000] -> OpenSM Rev:openib-1.0.0 Apr 12 18:41:06 [4000] -> osm_opensm_init: Forcing single threaded dispatcher. Apr 12 18:41:06 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 Apr 12 18:41:06 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 Apr 12 18:41:06 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c2. Apr 12 18:41:06 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c2. Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request. Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request. Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request. Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request. Apr 12 18:41:06 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0001 TID:0x0000000000000011 Apr 12 18:41:06 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0001 GID:0xfe80000000000000,0x0002c902004013c2 Apr 12 18:41:06 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0002 TID:0x000000000000000d Apr 12 18:41:06 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0002 GID:0xfe80000000000000,0x0002c9020040133a Apr 12 18:42:25 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0002 TID:0x000000000000000e Apr 12 18:42:25 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0002 GID:0xfe80000000000000,0x0002c9020040133a priority 15 server Apr 12 18:42:25 [4000] -> OpenSM Rev:openib-1.0.0 Apr 12 18:42:25 [4000] -> osm_opensm_init: Forcing single threaded dispatcher. Apr 12 18:42:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 Apr 12 18:42:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 Apr 12 18:42:25 [4000] -> osm_vendor_bind: Binding to port 0x2c9020040133a. Apr 12 18:42:25 [4000] -> osm_vendor_bind: Binding to port 0x2c9020040133a. Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request. Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request. Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request. Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request. When I kill the priority 15 server however, the priority 0 server runs amok with continous log messages like: Apr 12 18:44:28 [2400A] -> umad_receiver: send completed with error(method=1 attr=20) -- dropping. Apr 12 18:44:28 [2400A] -> umad_receiver: send completed with error(method=1 attr=20) -- dropping. I assume that the handover to the priority 0 opensm hasn't worked then. For additional information: This test was done on a point-to-point connection between 2 adapters. Roland From rf at q-leap.de Tue Apr 12 09:47:51 2005 From: rf at q-leap.de (Roland Fehrenbacher) Date: Tue, 12 Apr 2005 18:47:51 +0200 Subject: [openib-general] OpenSM (again) In-Reply-To: <1113252518.4476.3.camel@localhost.localdomain> References: <16986.42404.540439.952094@gargle.gargle.HOWL> <1113250997.4616.53.camel@localhost.localdomain> <1113252518.4476.3.camel@localhost.localdomain> Message-ID: <16987.64439.352837.517744@gargle.gargle.HOWL> >>>>> "Hal" == Hal Rosenstock writes: Hal> On Mon, 2005-04-11 at 16:23, Hal Rosenstock wrote: >> > Problem: When I reboot all the 40 nodes (apart from the one >> the opensm > is running), the network is non-functional (no >> pings go through, even > though ports show status "Active") for >> quite a while (more than 10 > minutes) after all the nodes have >> come up. It then recovers without > intervention. Is this >> normal? Single node reboots don't affect the > network >> operation. osm Log file is appended. >> >> Can you describe your topology ? Is it the following: the SM is >> connected to a switch/or switches with the 40 nodes connected >> off these switches ? Hal> What is the mix of those 40 nodes in terms of OpenIB (gen2) Hal> and gen1 ? Is there no difference in the behavior of gen2 Hal> and gen1 in terms of the above symptoms ? So far all nodes are gen2. Roland From halr at voltaire.com Tue Apr 12 10:00:17 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Apr 2005 13:00:17 -0400 Subject: [openib-general] OpenSM (again) In-Reply-To: <16987.64387.656148.774577@gargle.gargle.HOWL> References: <16986.42404.540439.952094@gargle.gargle.HOWL> <1113250997.4616.53.camel@localhost.localdomain> <16987.64387.656148.774577@gargle.gargle.HOWL> Message-ID: <1113325216.4523.8.camel@localhost.localdomain> On Tue, 2005-04-12 at 12:46, Roland Fehrenbacher wrote: > Hal> SM election occurs per high priority low GUID. So if you > Hal> don't care which SM is the master than you don't need to do > Hal> anything. If you want a specific order (and it is not in GUID > Hal> order) then you need to specify priority. > > Ok. I tried this, specifying priority 0 on one server, and priority 15 > on another one. I assume priority 15, will be the master. > If I first start the priority 0 opensm, and then the priority 15 one, > things look normal: Log excerpts > > priority 0 server > > Apr 12 18:41:06 [4000] -> OpenSM Rev:openib-1.0.0 > Apr 12 18:41:06 [4000] -> osm_opensm_init: Forcing single threaded dispatcher. > Apr 12 18:41:06 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 > Apr 12 18:41:06 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 > Apr 12 18:41:06 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c2. > Apr 12 18:41:06 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c2. > Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request. > Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request. > Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request. > Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request. > Apr 12 18:41:06 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0001 TID:0x0000000000000011 > Apr 12 18:41:06 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0001 GID:0xfe80000000000000,0x0002c902004013c2 > Apr 12 18:41:06 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0002 TID:0x000000000000000d > Apr 12 18:41:06 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0002 GID:0xfe80000000000000,0x0002c9020040133a > Apr 12 18:42:25 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0002 TID:0x000000000000000e > Apr 12 18:42:25 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0002 GID:0xfe80000000000000,0x0002c9020040133a > > priority 15 server > > Apr 12 18:42:25 [4000] -> OpenSM Rev:openib-1.0.0 > Apr 12 18:42:25 [4000] -> osm_opensm_init: Forcing single threaded dispatcher. > Apr 12 18:42:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 > Apr 12 18:42:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 > Apr 12 18:42:25 [4000] -> osm_vendor_bind: Binding to port 0x2c9020040133a. > Apr 12 18:42:25 [4000] -> osm_vendor_bind: Binding to port 0x2c9020040133a. > Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request. > Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request. > Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request. > Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request. > > When I kill the priority 15 server however, the priority 0 server runs > amok with continous log messages like: > > Apr 12 18:44:28 [2400A] -> umad_receiver: send completed with error(method=1 attr=20) -- dropping. > Apr 12 18:44:28 [2400A] -> umad_receiver: send completed with error(method=1 attr=20) -- dropping. Attribute 0x20 is SMInfo. This is just the SubnGet(SMInfo) from the priority 0 server failing (no matching SubnGetResp received) which is "normal" if you killed the priority 15 server. Do the messages ever subside ? > I assume that the handover to the priority 0 opensm hasn't worked > then. This isn't really handover but that is another matter. You should be able to use the sminfo diag to see whether this SM has assumed the MASTER role. -- Hal From iod00d at hp.com Tue Apr 12 10:34:37 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 12 Apr 2005 10:34:37 -0700 Subject: [openib-general] Re: uverbs events In-Reply-To: <527jj8wut7.fsf@topspin.com> References: <52fyxxywhg.fsf@topspin.com> <425AD0E5.3040805@ichips.intel.com> <52wtr9xc84.fsf@topspin.com> <425B0575.5090702@ichips.intel.com> <52k6n8yk6h.fsf@topspin.com> <52fyxwyjea.fsf@topspin.com> <425B0DB6.9090002@ichips.intel.com> <527jj8yhv8.fsf@topspin.com> <20050412025835.GE13577@esmail.cup.hp.com> <527jj8wut7.fsf@topspin.com> Message-ID: <20050412173437.GB17646@esmail.cup.hp.com> On Mon, Apr 11, 2005 at 08:14:12PM -0700, Roland Dreier wrote: > Grant> doorbell[] is a local variable and mthca_write64() is > Grant> static inline. I don't see a problem with the assignments > Grant> to doorbell getting optimized out since the scope of that > Grant> variable is completely visible to gcc. A smart compiler > Grant> would just use registers and reduce the 32-bit stores. > > Actually, what is that the compiler sees that we write to doorbell[] > as a uint32_t but then read from it by dereferencing a uint64_t*. -O2 > turns on -fstrict-aliasing, which allows the compiler to assume that > pointers of different types never alias each other. So gcc says, hey, > all you do is write to that local doorbell[] variable and never do > anything with the values you write, so I'll just throw away that dead > code. So gcc ends up only generating code for the store in > mthca_write64() without any code to initialize doorbell[]. Yup - I saw your followup right after I posted. But this is a better explanation...thanks! grant From mst at mellanox.co.il Tue Apr 12 11:23:57 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 12 Apr 2005 21:23:57 +0300 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <521x9gyhe7.fsf@topspin.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <521x9gyhe7.fsf@topspin.com> Message-ID: <20050412182357.GA24047@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation > > Roland> Yes, because the kernel may go through and unmap pages > Roland> from userspace while trying to swap. Since we have the > Roland> page locked in the kernel, the physical page won't go > Roland> anywhere, but userspace might end up with a different page > Roland> mapped at the same virtual address. > > Andrew> That shouldn't happen. If get_user_pages() has elevated > Andrew> the refcount on a page then the following can happen: > > ... > > Andrew> IOW: while the page has an elevated refcount from > Andrew> get_user_pages(), that physical page is 100% pinned. > Andrew> Once you've done the set_page_dirty+put_page(), the page > Andrew> is again under control of the VM. > > Hmm... I've never tested it first hand but Libor assures me there is a > something like what I said. Libor, did I get the explanation right? > > - R. Roland, is it possible that what you describe is the behaviour of older kernels? Digging around in rmap.c, I see the following code in try_to_unmap_one: /* * Don't pull an anonymous page out from under get_user_pages. * GUP carefully breaks COW and raises page count (while holding * page_table_lock, as we have here) to make sure that the page * cannot be freed. If we unmap that page here, a user write * access to the virtual address will bring back the page, but * its raised count will (ironically) be taken to mean it's not * an exclusive swap page, do_wp_page will replace it by a copy * page, and the user never get to see the data GUP was holding * the original page for. * * This test is also useful for when swapoff (unuse_process) has * to drop page lock: its reference to the page stops existing * ptes from being unmapped, so swapoff can make progress. */ if (PageSwapCache(page) && page_count(page) != page_mapcount(page) + 2) { ret = SWAP_FAIL; goto out_unmap; } This was added in http://linus.bkbits.net:8080/linux-2.5/patch at 1.1722.120.6 on 2004-06-05 , i.e. as far as I can see around 2.6.7, and the comment says: >>>>>>>>>>>>>>>>>>>>>> > [PATCH] mm: get_user_pages vs. try_to_unmap > > Andrea Arcangeli's fix to an ironic weakness with get_user_pages. > > try_to_unmap_one must check page_count against page->mapcount before unmapping > a swapcache page: because the raised pagecount by which get_user_pages ensures > the page cannot be freed, will cause any write fault to see that page as not > exclusively owned, and therefore a copy page will be substituted for it - the > reverse of what's intended. > > rmap.c was entirely free of such page_count heuristics before, I tried hard to > avoid putting this in. But Andrea's fix rarely gives a false positive; and > although it might be nicer to change exclusive_swap_page etc. to rely on > page->mapcount instead, it seems likely that we'll want to get rid of > page->mapcount later, so better not to entrench its use. > > Signed-off-by: Hugh Dickins > Signed-off-by: Andrew Morton > Signed-off-by: Linus Torvalds >>>>>>>>>>>>>>>>>>>>>> Seems quite like the situation that you described. Does my analysis make sence? Since this case seems to be explicitly handled, it is probably safe to rely on this behaviour or try_to_unmap, avoiding the need for mlock, is it not? -- MST - Michael S. Tsirkin From iod00d at hp.com Tue Apr 12 11:47:30 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 12 Apr 2005 11:47:30 -0700 Subject: [openib-general] failed to allocate buffer page Message-ID: <20050412184730.GE17646@esmail.cup.hp.com> Hi, Haven't checked yet waht the "side effect" of this error was, but here is the output so people are aware of it. This is the first time I've seen this. I've been doing (unload, reload, test) loops alot last week. Just scripted the set of commands and got the error on the first try. Not reproducible on this or other boxes. grant gsyprf3:/usr/src/linux-2.6# reload_ib + IPoIB=51 + ifconfig ib0 down + rmmod ib_ipoib ib_sdp ib_cm ib_sa ib_mthca ib_mad ib_core ERROR: Module ib_sdp does not exist in /proc/modules ERROR: Module ib_cm does not exist in /proc/modules ACPI: PCI interrupt for device 0000:81:00.0 disabled GSI 60 (level, low) -> vector 69 unregisterd. + modprobe ib_mthca msi_x=1 ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004) ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:81:00.0) GSI 60 (level, low) -> CPU 1 (0x0100) vector 69 ACPI: PCI interrupt 0000:81:00.0[A] -> GSI 60 (level, low) -> IRQ 69 + modprobe ib_ipoib + modprobe ib_sdp modprobe: page allocation failure. order:0, mode:0x20 Call Trace: [] show_stack+0x80/0xa0 sp=e00000002920fc50 bsp=e0000000292090c0 [] dump_stack+0x30/0x60 sp=e00000002920fe20 bsp=e0000000292090a8 [] __alloc_pages+0x5d0/0x8a0 sp=e00000002920fe20 bsp=e000000029209028 [] __get_free_pages+0x60/0x120 sp=e00000002920fe30 bsp=e000000029209000 [] sdp_buff_pool_alloc+0xf0/0x3e0 [ib_sdp] sp=e00000002920fe30 bsp=e000000029208f70 [] sdp_buff_pool_init+0x480/0x620 [ib_sdp] sp=e00000002920fe30 bsp=e000000029208f28 [] sdp_init+0xe0/0x4e0 [ib_sdp] sp=e00000002920fe30 bsp=e000000029208ef8 [] sys_init_module+0x470/0x640 sp=e00000002920fe30 bsp=e000000029208e80 [] ia64_ret_from_syscall+0x0/0x20 sp=e00000002920fe30 bsp=e000000029208e80 WARN: : Failed to allocate buffer page. <1024:747> NET: Registered protocol family 27 + ifconfig ib0 10.0.0.51 netmask 255.255.255.0 broadcast 10.0.0.255 + ifconfig ib1 10.0.1.51 netmask 255.255.255.0 broadcast 10.0.1.255 gsyprf3:/usr/src/linux-2.6# From iod00d at hp.com Tue Apr 12 12:25:01 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 12 Apr 2005 12:25:01 -0700 Subject: [openib-general] NULL ptr derefence Message-ID: <20050412192501.GA18034@esmail.cup.hp.com> System panic'd when I ran the "reload_ib" script with NULL ptr. Odd that I didn't see any problems with switching around module versions by hand before. Scripting it seems to have exposed more race conditions or something. Sorry, I'm not sure which rev of openib code was running on this machine. Is there some way I can tell what SVN version from the binaries in /lib/modules/'uname -r' directory? It's possible this was already fixed... thanks, grant ionize:/usr/src/linux-2.6# reload_ib + IPoIB=113 + ifconfig ib0 down Unable to handle kernel NULL pointer dereference (address 0000000000000000) ib_mad1[1882]: Oops 8813272891392 [1] Modules linked in: ib_ipoib ib_sdp ib_cm ib_sa ib_mthca ib_mad ib_core tg3 dm_mod e1000 e100 Pid: 1882, CPU 1, comm: ib_mad1 psr : 0000101008026018 ifs : 800000000000038b ip : [] Not tainted ip is at ib_sa_mcmember_rec_callback+0x90/0xe0 [ib_sa] unat: 0000000000000000 pfs : 000000000000048d rsc : 0000000000000003 rnat: 0000000000000000 bsps: 0000000000000000 pr : 000000000000a941 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a74433f csd : 0000000000000000 ssd : 0000000000000000 b0 : a000000200121a30 b6 : a000000100002d70 b7 : a000000200121440 f6 : 1003e8080808080808081 f7 : 1003e0000000000001400 f8 : 1003e0000000000001400 f9 : 1003e00000000000027d8 f10 : 1003e000000000ff00000 f11 : 1003e000000003b5f2d38 r1 : a000000200320000 r2 : a000000200123270 r3 : e0000001014a7d98 r8 : a000000200121440 r9 : 0000000000000006 r10 : 0000000000000003 r11 : 0000000000000001 r12 : e0000001014a7d20 r13 : e0000001014a0000 r14 : 0000000000000000 r15 : e0000002ead26588 r16 : a0000002001252d8 r17 : 0000000000000000 r18 : 0000000000000001 r19 : 0000000000000000 r20 : e00000000f05cf60 r21 : 0000000000000000 r22 : e00000000f05cf60 r23 : 0000000000000000 r24 : 0000000000000000 r25 : 0000000000200200 r26 : e000000100d22e70 r27 : 0000001008026018 r28 : e0000002e907c418 r29 : 0000000000100100 r30 : 0000000000000000 r31 : a000000200125da0 Call Trace: [] show_stack+0x80/0xa0 sp=e0000001014a78e0 bsp=e0000001014a1190 [] show_regs+0x7e0/0x800 sp=e0000001014a7ab0 bsp=e0000001014a1130 [] die+0x150/0x1c0 sp=e0000001014a7ac0 bsp=e0000001014a10f0 [] ia64_do_page_fault+0x370/0x980 sp=e0000001014a7ac0 bsp=e0000001014a1088 [] ia64_leave_kernel+0x0/0x260 sp=e0000001014a7b50 bsp=e0000001014a1088 [] ib_sa_mcmember_rec_callback+0x90/0xe0 [ib_sa] sp=e0000001014a7d20 bsp=e0000001014a1030 [] send_handler+0x110/0x280 [ib_sa] sp=e0000001014a7d70 bsp=e0000001014a0fe0 [] ib_mad_complete_send_wr+0x330/0x380 [ib_mad] sp=e0000001014a7d70 bsp=e0000001014a0f90 [] ib_mad_send_done_handler+0x1e0/0x2e0 [ib_mad] sp=e0000001014a7d70 bsp=e0000001014a0f20 [] ib_mad_completion_handler+0x180/0x200 [ib_mad] sp=e0000001014a7d80 bsp=e0000001014a0ed0 [] worker_thread+0x3d0/0x520 sp=e0000001014a7db0 bsp=e0000001014a0e48 [] kthread+0x160/0x180 sp=e0000001014a7e20 bsp=e0000001014a0e10 [] kernel_thread_helper+0xd0/0x100 sp=e0000001014a7e30 bsp=e0000001014a0de0 [] start_kernel_thread+0x20/0x40 sp=e0000001014a7e30 bsp=e0000001014a0de0 + rmmod ib_ipoib ib_sdp ib_cm ib_sa ib_mthca ib_mad ib_core <6>NET: Unregistered protocol family 27 From roland at topspin.com Tue Apr 12 12:29:59 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 12 Apr 2005 12:29:59 -0700 Subject: [openib-general] NULL ptr derefence In-Reply-To: <20050412192501.GA18034@esmail.cup.hp.com> (Grant Grundler's message of "Tue, 12 Apr 2005 12:25:01 -0700") References: <20050412192501.GA18034@esmail.cup.hp.com> Message-ID: <52br8ju72g.fsf@topspin.com> Grant> System panic'd when I ran the "reload_ib" script with NULL Grant> ptr. Odd that I didn't see any problems with switching Grant> around module versions by hand before. Scripting it seems Grant> to have exposed more race conditions or something. I think we've seen this before but I never tracked it down. I'll take another look. Grant> Sorry, I'm not sure which rev of openib code was running on Grant> this machine. Is there some way I can tell what SVN Grant> version from the binaries in /lib/modules/'uname -r' Grant> directory? Not really that I know of, unfortunately. - R. From roland at topspin.com Tue Apr 12 12:28:36 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 12 Apr 2005 12:28:36 -0700 Subject: [openib-general] failed to allocate buffer page In-Reply-To: <20050412184730.GE17646@esmail.cup.hp.com> (Grant Grundler's message of "Tue, 12 Apr 2005 11:47:30 -0700") References: <20050412184730.GE17646@esmail.cup.hp.com> Message-ID: <52fyxvu74r.fsf@topspin.com> This looks like a GFP_ATOMIC allocation failing. Not sure where in the SDP code it's being triggered. - R. From steve at wooding.uklinux.net Tue Apr 12 13:08:15 2005 From: steve at wooding.uklinux.net (Steven Wooding) Date: Tue, 12 Apr 2005 21:08:15 +0100 Subject: [openib-general] Repost: AIO SDP and ttcp.aio: Event errors Message-ID: <425C2AAF.2050700@wooding.uklinux.net> Hi, I have been putting ttcp.aio through its paces and have a few questions. 1. When -l is larger than 131072 I get an Event error <-22> on the transmit side and no data to transferred. Changing values of -n and -a do not make any difference. 2. When using a value of 1 for -a (so I suppose this is non-aio), I get an Event error of <-32> on the transmit side and an <-104> on the receiver end. Only some of the data is transferred. 3. For future reference, where can I find out what these Event error codes mean to give me a glue of what's going wrong. 4. I sometimes see significant differences in the transfer speed reported on the transmit and receiver ends. Is one more right than the other? My system details are: Two nodes with Dual Xeon 64-bit processors HCA: MT25208 (in MT23108 compat mode) with 128MB of ram OS: RHEL 4 (64-bit) Gen2 stack version: trunk of 2113 (subversion revision number) Thanks, Steve. From libor at topspin.com Tue Apr 12 13:03:05 2005 From: libor at topspin.com (Libor Michalek) Date: Tue, 12 Apr 2005 13:03:05 -0700 Subject: [openib-general] failed to allocate buffer page In-Reply-To: <20050412184730.GE17646@esmail.cup.hp.com>; from iod00d@hp.com on Tue, Apr 12, 2005 at 11:47:30AM -0700 References: <20050412184730.GE17646@esmail.cup.hp.com> Message-ID: <20050412130305.C6958@topspin.com> On Tue, Apr 12, 2005 at 11:47:30AM -0700, Grant Grundler wrote: > Hi, > Haven't checked yet waht the "side effect" of this error was, > but here is the output so people are aware of it. > > This is the first time I've seen this. I've been doing > (unload, reload, test) loops alot last week. Just scripted > the set of commands and got the error on the first try. > > Not reproducible on this or other boxes. > > [] sdp_buff_pool_alloc+0xf0/0x3e0 [ib_sdp] > sp=e00000002920fe30 bsp=e000000029208f70 This is the SDP buffer allocator, at init time it pre-allocates some buffers that are used for transfers. The alloctor uses ATOMIC since it can be called during run time. I'll add a function parameter to determine how the allocator should be called. -Libor From libor at topspin.com Tue Apr 12 13:46:13 2005 From: libor at topspin.com (Libor Michalek) Date: Tue, 12 Apr 2005 13:46:13 -0700 Subject: [openib-general] Repost: AIO SDP and ttcp.aio: Event errors In-Reply-To: <425C2AAF.2050700@wooding.uklinux.net>; from steve@wooding.uklinux.net on Tue, Apr 12, 2005 at 09:08:15PM +0100 References: <425C2AAF.2050700@wooding.uklinux.net> Message-ID: <20050412134613.D6958@topspin.com> On Tue, Apr 12, 2005 at 09:08:15PM +0100, Steven Wooding wrote: > Hi, > > I have been putting ttcp.aio through its paces and have a few questions. > > 1. When -l is larger than 131072 I get an Event error <-22> on the transmit > side and no data to transferred. Changing values of -n and -a do not make > any difference. The FMRs need to be sized at initialization time. The code currently picks 128K as the size for the FMRs, and does not support an AIO operation that would span multiple FMRs. If you want to try larger AIO operations with the current code you will need to recompile SDP with a larger FMR size, which is determined by the constant SDP_IOCB_SIZE_MAX in sdp_iocb.h It's been a while since I've last tried this, if you try it and have problems let me know. > 2. When using a value of 1 for -a (so I suppose this is non-aio), I get an > Event error of <-32> on the transmit side and an <-104> on the receiver end. > Only some of the data is transferred. I'll look into this, I'm seeing a problem on longer runs myself. With a value of 1 for -a it still uses aio, the value only means how many aio operations can be outstanding at a given time. This just means that a single buffer will be submitted for read/write and a new one will not be submitted until that buffer's IO completes. > 3. For future reference, where can I find out what these Event error codes > mean to give me a glue of what's going wrong. The errors are errno values. I'll make a note to write up which errors are possible and what they are likely to mean. > 4. I sometimes see significant differences in the transfer speed reported on > the transmit and receiver ends. Is one more right than the other? Are the wall clock times for the data transfers small, on the order of a few seconds? How big of a wall clock time difference are you seeing? -Libor From steve at wooding.uklinux.net Tue Apr 12 14:28:20 2005 From: steve at wooding.uklinux.net (Steven Wooding) Date: Tue, 12 Apr 2005 22:28:20 +0100 Subject: [openib-general] Repost: AIO SDP and ttcp.aio: Event errors In-Reply-To: <20050412134613.D6958@topspin.com> References: <425C2AAF.2050700@wooding.uklinux.net> <20050412134613.D6958@topspin.com> Message-ID: <425C3D74.8090705@wooding.uklinux.net> 1. OK. That's fair enough. I'll give that ago. 2. Yeah, it also occurs for large values of -n, say 10000. 3. Great. 4. Yeah, the times are small as I'm only doing short runs (-n 1000) to avoid the -32/-104 errors. I'll try pushing -n up a bit. Thanks Libor, Steve. Libor Michalek wrote: >On Tue, Apr 12, 2005 at 09:08:15PM +0100, Steven Wooding wrote: > > >>Hi, >> >>I have been putting ttcp.aio through its paces and have a few questions. >> >>1. When -l is larger than 131072 I get an Event error <-22> on the transmit >>side and no data to transferred. Changing values of -n and -a do not make >>any difference. >> >> > > The FMRs need to be sized at initialization time. The code currently >picks 128K as the size for the FMRs, and does not support an AIO operation >that would span multiple FMRs. If you want to try larger AIO operations >with the current code you will need to recompile SDP with a larger FMR >size, which is determined by the constant SDP_IOCB_SIZE_MAX in sdp_iocb.h >It's been a while since I've last tried this, if you try it and have >problems let me know. > > > >>2. When using a value of 1 for -a (so I suppose this is non-aio), I get an >>Event error of <-32> on the transmit side and an <-104> on the receiver end. >>Only some of the data is transferred. >> >> > > I'll look into this, I'm seeing a problem on longer runs myself. With a >value of 1 for -a it still uses aio, the value only means how many aio >operations can be outstanding at a given time. This just means that a >single buffer will be submitted for read/write and a new one will not >be submitted until that buffer's IO completes. > > > >>3. For future reference, where can I find out what these Event error codes >>mean to give me a glue of what's going wrong. >> >> > > The errors are errno values. I'll make a note to write up which errors >are possible and what they are likely to mean. > > > >>4. I sometimes see significant differences in the transfer speed reported on >>the transmit and receiver ends. Is one more right than the other? >> >> > > Are the wall clock times for the data transfers small, on the order >of a few seconds? How big of a wall clock time difference are you seeing? > >-Libor > > > > From halr at voltaire.com Tue Apr 12 14:58:58 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Apr 2005 17:58:58 -0400 Subject: [openib-general] SM Bad Port Handling In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EF0BC@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED6047EF0BC@mtlex01.yok.mtl.com> Message-ID: <1113343036.4476.0.camel@localhost.localdomain> Hi Eitan, On Fri, 2005-04-08 at 11:55, Eitan Zahavi wrote: > Hi Hal, > > This is a physical port attribute so the file is osm_port.h and the > structure is osm_physp_t. > From the doc on the structure: > * > * healthy > * Tracks the health of the port. Normally should be TRUE but > * might change as a result of incoming traps indicating the port > * healthy is questionable. > * > > I have been trying my best to find how it can happen that a port that > does not respond will cause OpenSM to continuously poll it. This can > not happen so unless you can explain how it happens please do not > contaminate the code with un-needed code. In looking at the unhealthy code, it appears to me that the unhealthy bit is only set if the SM receives traps 129-131 and not if the SMA does not respond to SM MADs so these ports will not be detected and hence not bypassed. -- Hal From surs at cse.ohio-state.edu Tue Apr 12 16:28:05 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Tue, 12 Apr 2005 19:28:05 -0400 Subject: [openib-general] segmentation fault with ibv_pingpong Message-ID: <20050412232804.GA30632@cse.ohio-state.edu> Hi, I am facing a segmentation fault problem with the OpenIB Gen2 drivers while executing `ibv_pingpong' test. The description of the problem is given below. Can someone point out what may be going wrong here? I have included as much information as I thought would be required, but if more specific information is needed, I can provide it. Thanks, Sayantan. Hardware: --------- Two Dual Intel Xeon EM64T 3.4 GHz nodes PCI-Express I/O bus MT25208 Mellanox HCAs (rev a0) Software: --------- RedHat AS 4 2.6.11.6/2.6.11.7 kernel with Gen2 InfiniBand drivers Firmware version 5.0.1 OpenIB Gen2 drivers (user verbs from main branch) OpenSM (OpenIB version/IBGD 1.7.0 both of them result in the same) Both the machines display their ports as ACTIVE. [surs at x1:~] cat /sys/class/infiniband/mthca0/ports/1/state 4: ACTIVE [surs at x5:bin] lsmod | grep ib ib_uverbs 28056 0 ib_umad 17696 0 ib_mthca 113952 0 ib_mad 38576 2 ib_umad,ib_mthca ib_core 52352 4 ib_uverbs,ib_umad,ib_mthca,ib_mad libata 53000 1 ata_piix scsi_mod 151888 3 libata,aic79xx,sd_mod Now, if I try to run ibv_pingpong, I get this error: ---> [surs at x1:~] ibv_pingpong Segmentation fault [surs at x1:~] Message from syslogd at x1 at Mon Apr 11 18:37:18 2005 ... x1 kernel: invalid operand: 0000 [1] SMP <--- The relevant part from the kernel log: ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at pci_gart:537 invalid operand: 0000 [1] SMP CPU 0 Modules linked in: ib_uverbs ib_umad ib_mthca ib_mad ib_core parport_pc lp parport autofs4 nfs lockd sunrpc dm_mod video button battery ac md5 ipv6 uhci_hcd ehci_hcd hw_random i2c_i801 i2c_core e1000 floppy ext3 jbd ata_piix libata aic79xx sd_mod scsi_mod Pid: 4034, comm: ibv_pingpong Not tainted 2.6.11.6 RIP: 0010:[] {dma_map_sg+223} RSP: 0018:ffff81001d4dfd58 EFLAGS: 00010246 RAX: 000000001b92b000 RBX: ffff81001764fbf8 RCX: 000000001b92b000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000ff0 R11: 0000000000000246 R12: ffff81001764f000 R13: ffff81001764fbf8 R14: 0000000000000001 R15: ffff81001f92c070 FS: 00002aaaaacca000(0000) GS:ffffffff804c6380(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000515000 CR3: 0000000013145000 CR4: 00000000000006e0 Process ibv_pingpong (pid: 4034, threadinfo ffff81001d4de000, task ffff81001dc641f0) Stack: ffff81001dc641f0 0000000000000000 0000000100000000 ffff81001764fbd0 ffff8100176f3000 ffff81001764f000 0000000000513000 0000000000000001 ffff81001facfa40 ffffffff8824b621 Call Trace:{:ib_mthca:mthca_map_user_db+366} {:ib_mthca:mthca_create_cq+115} {:ib_uverbs:ib_uverbs_create_cq+165} {:ib_uverbs:ib_uverbs_write+139} {vfs_write+207} {sys_write+69} {system_call+126} Code: 0f 0b 0f 91 32 80 ff ff ff ff 19 02 89 f8 49 8b 97 08 01 00 RIP {dma_map_sg+223} RSP ---------------------------------------------------------------- Now, if I try to run ibv_pingpong under gdb (sender side), I get it it to progress a little bit more (but not to completion). The receiver prints this now: <--- [surs at x5:examples] ibv_pingpong 192.168.107.2 local address: LID 0x0002, QPN 0x000404, PSN 0x104788 remote address: LID 0x0001, QPN 0x000404, PSN 0x08b81e [ 0] 00000404 [ 4] b3000000 [ 8] fd000003 [ c] 110000c0 [10] 15810000 [14] 00000010 [18] 00008002 [1c] ff100000 Failed status 12 for wr_id 2 ---> -- --------------------------------------------------------- Sayantan Sur Graduate Research Assistant 395 Dreese Labs, Computer Science and Engineering Ohio State University, Office : 774, Dreese Labs Columbus, email : surs at cse.ohio-state.edu Ohio - 43210. phone(res) : 614.688.9792 USA. phone(off) : 614.292.8501 --------------------------------------------------------- From roland at topspin.com Tue Apr 12 16:38:56 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 12 Apr 2005 16:38:56 -0700 Subject: [openib-general] segmentation fault with ibv_pingpong In-Reply-To: <20050412232804.GA30632@cse.ohio-state.edu> (Sayantan Sur's message of "Tue, 12 Apr 2005 19:28:05 -0400") References: <20050412232804.GA30632@cse.ohio-state.edu> Message-ID: <52u0mbsgz3.fsf@topspin.com> Thanks for the report. I think I have all the information I need and I'll try to figure out what's happening. - R. From roland at topspin.com Tue Apr 12 16:50:11 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 12 Apr 2005 16:50:11 -0700 Subject: [openib-general] segmentation fault with ibv_pingpong In-Reply-To: <20050412232804.GA30632@cse.ohio-state.edu> (Sayantan Sur's message of "Tue, 12 Apr 2005 19:28:05 -0400") References: <20050412232804.GA30632@cse.ohio-state.edu> Message-ID: <52ll7nsggc.fsf@topspin.com> OK, I think I see the problem. Can you please try this patch and let me know if it helps? Thanks, Roland --- infiniband/hw/mthca/mthca_memfree.c (revision 2156) +++ infiniband/hw/mthca/mthca_memfree.c (working copy) @@ -384,6 +384,7 @@ int mthca_map_user_db(struct mthca_dev * if (ret < 0) goto out; + db_tab->page[i].mem.length = 4096; db_tab->page[i].mem.offset = uaddr & ~PAGE_MASK; ret = pci_map_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE); From surs at cse.ohio-state.edu Tue Apr 12 17:15:19 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Tue, 12 Apr 2005 20:15:19 -0400 Subject: [openib-general] segmentation fault with ibv_pingpong In-Reply-To: <52ll7nsggc.fsf@topspin.com> References: <20050412232804.GA30632@cse.ohio-state.edu> <52ll7nsggc.fsf@topspin.com> Message-ID: <425C6497.7060703@cse.ohio-state.edu> Roland Dreier wrote: >OK, I think I see the problem. Can you please try this patch and let >me know if it helps? > >Thanks, > Roland > >--- infiniband/hw/mthca/mthca_memfree.c (revision 2156) >+++ infiniband/hw/mthca/mthca_memfree.c (working copy) >@@ -384,6 +384,7 @@ int mthca_map_user_db(struct mthca_dev * > if (ret < 0) > goto out; > >+ db_tab->page[i].mem.length = 4096; > db_tab->page[i].mem.offset = uaddr & ~PAGE_MASK; > > ret = pci_map_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE); > > Great! This helps. I can execute ibv_pingpong without any errors. [surs at x5:~] ibv_pingpong local address: LID 0x0001, QPN 0x000404, PSN 0x297822 remote address: LID 0x0002, QPN 0x000404, PSN 0x1c5b52 8192000 bytes in 0.02 seconds = 3403.05 Mbit/sec 1000 iters in 0.02 seconds = 19.26 usec/iter Thanks, Sayantan. >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From roland at topspin.com Tue Apr 12 17:27:26 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 12 Apr 2005 17:27:26 -0700 Subject: [openib-general] segmentation fault with ibv_pingpong In-Reply-To: <425C6497.7060703@cse.ohio-state.edu> (Sayantan Sur's message of "Tue, 12 Apr 2005 20:15:19 -0400") References: <20050412232804.GA30632@cse.ohio-state.edu> <52ll7nsggc.fsf@topspin.com> <425C6497.7060703@cse.ohio-state.edu> Message-ID: <52hdibseq9.fsf@topspin.com> Sayantan> Great! This helps. I can execute ibv_pingpong without Sayantan> any errors. Thanks for testing. I committed this change to the subversion tree. - R. From libor at topspin.com Tue Apr 12 18:04:47 2005 From: libor at topspin.com (Libor Michalek) Date: Tue, 12 Apr 2005 18:04:47 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050411171347.7e05859f.akpm@osdl.org>; from akpm@osdl.org on Mon, Apr 11, 2005 at 05:13:47PM -0700 References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> Message-ID: <20050412180447.E6958@topspin.com> On Mon, Apr 11, 2005 at 05:13:47PM -0700, Andrew Morton wrote: > Roland Dreier wrote: > > > > Troy> Do we even need the mlock in userspace then? > > > > Yes, because the kernel may go through and unmap pages from userspace > > while trying to swap. Since we have the page locked in the kernel, > > the physical page won't go anywhere, but userspace might end up with a > > different page mapped at the same virtual address. With the last few kernels I haven't had a chance to retest the problem that pushed us in the direction of using mlock. I will go back and do so with the latest kernel. Below I've given a quick description of the issue. > That shouldn't happen. If get_user_pages() has elevated the refcount on a > page then the following can happen: > > - The VM may decide to add the page to swapcache (if it's not mmapped > from a file). > > - Once the page is backed by either swapcache of a (mmapped) file, the VM > may decide the unmap the application's pte's. A later minor fault by the > app will cause the same physical page to be remapped. The driver did use get_user_pages() to elevated the refcount on all the pages it was going to use for IO, as well as call set_page_dirty() since the pages were going to have data written to them from the device. The problem we were seeing is that the minor fault by the app resulted in a new physical page getting mapped for the application. The page that had the elevated refcount was still waiting for the data to be written to by the driver at the time that the app accessed the page causing the minor fault. Obviously since the app had a new mapping the data written by the driver was lost. It looks like code was added to try_to_unmap_one() to address this, so hopefully it's no longer an issue... -Libor From eitan at mellanox.co.il Tue Apr 12 22:28:28 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 13 Apr 2005 08:28:28 +0300 Subject: [openib-general] SM Bad Port Handling Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF0F0@mtlex01.yok.mtl.com> > -- Hal > In looking at the unhealthy code, it appears to me that the unhealthy > bit is only set if the SM receives traps 129-131 and not if the SMA does > not respond to SM MADs so these ports will not be detected and hence not > bypassed. > [EZ] This is true. Currently there is only one cause for the un-healthy bits to be set - which are exactly as you point - these traps. The point I was trying to make was that this bit is the mechanism for flagging a port status is bad. What I did recommend was to write a "statistical" analysis of Directed Route packet drop - such that we can find the ports with a high drop rate and mark them as un-healthy. If you mark every port that does not respond to a MAD as un-healthy you can suffer from flaky links somewhere on the route to that port. Only analysis of the number of good packets vs. dropped packets can lead you to the right bad port. -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Tue Apr 12 22:50:25 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 13 Apr 2005 08:50:25 +0300 Subject: [openib-general] OpenSM (again) Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF0F2@mtlex01.yok.mtl.com> FYI: OpenSM implements master handover in a "lazy" or "less intrusive" manner: OpenSM will only handoff a subnet to the new master on a heavy sweep sequence. So if you start an SM and then start one with higher priority - the handoff will not happen unless there was some change in the subnet (trap or switch "change bit"). The main reason for this behavior is the concept of "light sweep" that minimizes the discovery to checking of "change bits" and now also "irresponsive ports". So the new SM is not even discovered by the SM. The benefit is that as long as there is no change in the subnet the active SM does not transfer the ownership to the new one - which has an overhead on the entire subnet (client re-registration or even LID changes). This behavior is compliant as the spec says: C14-60.2.1: If a Master SM finds another Master SM with lower priority (or same priority and higher GUID) it shall ensure that it is the highest priority (or same priority and lower GUID) on the subnet, and if so it shall wait for the other Master (or Masters) to relinquish control if its portion of the subnet. C14-61.2.2: If a Master SM determines that a lower priority Master SM has not performed a handover within a vendor-specific time period, then it shall not change the state of the subnet. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, April 12, 2005 8:00 PM > To: rf at q-leap.de > Cc: openib-general at openib.org > Subject: Re: [openib-general] OpenSM (again) > > On Tue, 2005-04-12 at 12:46, Roland Fehrenbacher wrote: > > Hal> SM election occurs per high priority low GUID. So if you > > Hal> don't care which SM is the master than you don't need to do > > Hal> anything. If you want a specific order (and it is not in GUID > > Hal> order) then you need to specify priority. > > > > Ok. I tried this, specifying priority 0 on one server, and priority 15 > > on another one. I assume priority 15, will be the master. > > If I first start the priority 0 opensm, and then the priority 15 one, > > things look normal: Log excerpts > > > > priority 0 server > > > > Apr 12 18:41:06 [4000] -> OpenSM Rev:openib-1.0.0 > > Apr 12 18:41:06 [4000] -> osm_opensm_init: Forcing single threaded dispatcher. > > Apr 12 18:41:06 [4000] -> osm_report_notice: Reporting Generic Notice type:3 > num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 > > Apr 12 18:41:06 [4000] -> osm_report_notice: Reporting Generic Notice type:3 > num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 > > Apr 12 18:41:06 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c2. > > Apr 12 18:41:06 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c2. > > Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an > Invalid Delete Request. > > Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an > Invalid Delete Request. > > Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an > Invalid Delete Request. > > Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an > Invalid Delete Request. > > Apr 12 18:41:06 [18007] -> __osm_trap_rcv_process_request: Received Generic > Notice type:0x04 num:144 Producer:1 from LID:0x0001 TID:0x0000000000000011 > > Apr 12 18:41:06 [18007] -> osm_report_notice: Reporting Generic Notice type:4 > num:144 from LID:0x0001 GID:0xfe80000000000000,0x0002c902004013c2 > > Apr 12 18:41:06 [18007] -> __osm_trap_rcv_process_request: Received Generic > Notice type:0x04 num:144 Producer:1 from LID:0x0002 TID:0x000000000000000d > > Apr 12 18:41:06 [18007] -> osm_report_notice: Reporting Generic Notice type:4 > num:144 from LID:0x0002 GID:0xfe80000000000000,0x0002c9020040133a > > Apr 12 18:42:25 [18007] -> __osm_trap_rcv_process_request: Received Generic > Notice type:0x04 num:144 Producer:1 from LID:0x0002 TID:0x000000000000000e > > Apr 12 18:42:25 [18007] -> osm_report_notice: Reporting Generic Notice type:4 > num:144 from LID:0x0002 GID:0xfe80000000000000,0x0002c9020040133a > > > > priority 15 server > > > > Apr 12 18:42:25 [4000] -> OpenSM Rev:openib-1.0.0 > > Apr 12 18:42:25 [4000] -> osm_opensm_init: Forcing single threaded dispatcher. > > Apr 12 18:42:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 > num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 > > Apr 12 18:42:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 > num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 > > Apr 12 18:42:25 [4000] -> osm_vendor_bind: Binding to port 0x2c9020040133a. > > Apr 12 18:42:25 [4000] -> osm_vendor_bind: Binding to port 0x2c9020040133a. > > Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an > Invalid Delete Request. > > Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an > Invalid Delete Request. > > Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an > Invalid Delete Request. > > Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an > Invalid Delete Request. > > > > When I kill the priority 15 server however, the priority 0 server runs > > amok with continous log messages like: > > > > Apr 12 18:44:28 [2400A] -> umad_receiver: send completed with error(method=1 > attr=20) -- dropping. > > Apr 12 18:44:28 [2400A] -> umad_receiver: send completed with error(method=1 > attr=20) -- dropping. > > Attribute 0x20 is SMInfo. This is just the SubnGet(SMInfo) from the > priority 0 server failing (no matching SubnGetResp received) which is > "normal" if you killed the priority 15 server. > > Do the messages ever subside ? > > > I assume that the handover to the priority 0 opensm hasn't worked > > then. > > This isn't really handover but that is another matter. > You should be able to use the sminfo diag to see whether this SM has > assumed the MASTER role. > > -- Hal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From tziporet at mellanox.co.il Wed Apr 13 01:19:44 2005 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Wed, 13 Apr 2005 11:19:44 +0300 Subject: [openib-general] Re: uverbs events Message-ID: <506C3D7B14CDD411A52C00025558DED6064BF35D@mtlex01.yok.mtl.com> Well this is the description of this bug, but we got this problem of 64-bits arithmetic on a simple if and not switch. >From this reason I suggested to give it a try. Tziporet -----Original Message----- From: Roland Dreier [mailto:roland at topspin.com] Sent: Tuesday, April 12, 2005 6:39 PM To: Tziporet Koren Cc: Grant Grundler; openib-general at openib.org Subject: Re: [openib-general] Re: uverbs events Tziporet> Very important - there is a bug in gcc version 3.4.2 Tziporet> that had been fixed in gcc 3.4.3. This bug ((# 17581) Tziporet> heart us in VAPI when full optimizations is working in Tziporet> bits or on 64 bits systems. Thanks, but if the bug you're talking about is http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17581 then I don't think that's going to affect us -- we don't seem to do any 64-bit arithmetic inside a switch statement. - R. -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Wed Apr 13 02:00:05 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Apr 2005 05:00:05 -0400 Subject: [openib-general] SM Bad Port Handling In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EF0F0@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED6047EF0F0@mtlex01.yok.mtl.com> Message-ID: <1113380170.4479.18.camel@localhost.localdomain> On Wed, 2005-04-13 at 01:28, Eitan Zahavi wrote: > [EZ] This is true. Currently there is only one cause for the > un-healthy bits to be set - which are exactly as you point - these > traps. The point I was trying to make was that this bit is the > mechanism for flagging a port status is bad. > > What I did recommend was to write a "statistical" analysis of Directed > Route packet drop - such that we can find the ports with a high drop > rate and mark them as un-healthy. If you mark every port that does not > respond to a MAD as un-healthy you can suffer from flaky links > somewhere on the route to that port. Only analysis of the number of > good packets vs. dropped packets can lead you to the right bad port. The original proposal on this said the following: "The OpenSM will implement a configurable policy (some number of consecutive lack of responses to SM requests). At the point of exhaustion of the timeout/retry strategy, that port will be marked as "bad" by OpenSM." Any idea on what might make a good default threshold (for consecutive retries) ? Do you think there is no sufficient default ? If a link is flaky and MADs can't get through, should it be used for non MAD traffic ? Also note that the proposal also said: "Also, there could also be a periodic "ping" at a slower rate to check if the "bad" ports revive." In terms of analysis of good v. errored and dropped packets (along the path to that node), there are OpenIB diagnostic tools to help with this. -- Hal From eitan at mellanox.co.il Wed Apr 13 02:20:52 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 13 Apr 2005 12:20:52 +0300 Subject: [openib-general] SM Bad Port Handling Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF0F9@mtlex01.yok.mtl.com> I probably did not make point very clear: It is bad (not to say wrong) to disqualify a port and mark it as bad port if it did not respond to queries. The cause of the issue might be a flaky link on the directed route to the port. If the SM would be able to find that flaky link port it would avoid marking the wrong ports. More over, the port that was almost marked as bad by the simplistic algorithm you propose will be discovered and operational as there many other paths to reach it - walking around the real bad port ! Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, April 13, 2005 12:00 PM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: RE: [openib-general] SM Bad Port Handling > > On Wed, 2005-04-13 at 01:28, Eitan Zahavi wrote: > > [EZ] This is true. Currently there is only one cause for the > > un-healthy bits to be set - which are exactly as you point - these > > traps. The point I was trying to make was that this bit is the > > mechanism for flagging a port status is bad. > > > > What I did recommend was to write a "statistical" analysis of Directed > > Route packet drop - such that we can find the ports with a high drop > > rate and mark them as un-healthy. If you mark every port that does not > > respond to a MAD as un-healthy you can suffer from flaky links > > somewhere on the route to that port. Only analysis of the number of > > good packets vs. dropped packets can lead you to the right bad port. > > The original proposal on this said the following: > > "The OpenSM will implement a configurable policy (some number of > consecutive lack of responses to SM requests). At the point of > exhaustion of the timeout/retry strategy, that port will be marked as > "bad" by OpenSM." > > Any idea on what might make a good default threshold (for consecutive > retries) ? Do you think there is no sufficient default ? > > If a link is flaky and MADs can't get through, should it be used for non > MAD traffic ? > > Also note that the proposal also said: > > "Also, there could also be a periodic "ping" at a slower rate to check > if the "bad" ports revive." > > In terms of analysis of good v. errored and dropped packets (along the > path to that node), there are OpenIB diagnostic tools to help with this. > > -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaharf at voltaire.com Wed Apr 13 07:03:21 2005 From: shaharf at voltaire.com (shaharf) Date: Wed, 13 Apr 2005 17:03:21 +0300 Subject: [openib-general] SM Bad Port Handling Message-ID: Eitan, Your analysis is not completely accurate. The SM configure the subnet using direct mads only, and it builds a spanning tree of direct routes. What I want to say, is that that it doesn't matter why exactly a port is unreachable. Once a port can not be reached, you can either retry the entire heavy sweep process, but if the problem repeats itself (X times) on the same port, you have no alternative other then disable it. If the SM will have an alternative method of building direct paths, then such alternative path could be attempted. Currently it is not relevant. Speaking of "statistical analysis", what are the odds that a port will behave well when it is queried directly, but starts to loose packets when a direct route is routed through it, and behave consistently during all retries? Again, even if this is the case (and in understatement, I am not sure how frequent it is), the port behind it is unreachable and therefore "bad". The current unhealthy port mechanism is not redundant to this "bad" port mechanism because it does not handle the same case. Both mechanisms are required. The issue if they can share the same status bit is really an implementation issue. Relying of traps is very problematic in some cases, particularly in initial bring up sweep when the SM lid is not even configured (remember VTEC?). Shahar ________________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Eitan Zahavi Sent: Wednesday, April 13, 2005 11:21 AM To: Hal Rosenstock; Eitan Zahavi Cc: openib-general at openib.org Subject: RE: [openib-general] SM Bad Port Handling I probably did not make point very clear: It is bad (not to say wrong) to disqualify a port and mark it as bad port if it did not respond to queries. The cause of the issue might be a flaky link on the directed route to the port. If the SM would be able to find that flaky link port it would avoid marking the wrong ports. More over, the port that was almost marked as bad by the simplistic algorithm you propose will be discovered and operational as there many other paths to reach it - walking around the real bad port ! Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, April 13, 2005 12:00 PM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: RE: [openib-general] SM Bad Port Handling > > On Wed, 2005-04-13 at 01:28, Eitan Zahavi wrote: > > [EZ] This is true. Currently there is only one cause for the > > un-healthy bits to be set - which are exactly as you point - these > > traps. The point I was trying to make was that this bit is the > > mechanism for flagging a port status is bad. > > > > What I did recommend was to write a "statistical" analysis of Directed > > Route packet drop - such that we can find the ports with a high drop > > rate and mark them as un-healthy. If you mark every port that does not > > respond to a MAD as un-healthy you can suffer from flaky links > > somewhere on the route to that port. Only analysis of the number of > > good packets vs. dropped packets can lead you to the right bad port. > > The original proposal on this said the following: > > "The OpenSM will implement a configurable policy (some number of > consecutive lack of responses to SM requests). At the point of > exhaustion of the timeout/retry strategy, that port will be marked as > "bad" by OpenSM." > > Any idea on what might make a good default threshold (for consecutive > retries) ? Do you think there is no sufficient default ? > > If a link is flaky and MADs can't get through, should it be used for non > MAD traffic ? > > Also note that the proposal also said: > > "Also, there could also be a periodic "ping" at a slower rate to check > if the "bad" ports revive." > > In terms of analysis of good v. errored and dropped packets (along the > path to that node), there are OpenIB diagnostic tools to help with this. > > -- Hal From roland at topspin.com Wed Apr 13 09:22:37 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 13 Apr 2005 09:22:37 -0700 Subject: [openib-general] Re: uverbs events In-Reply-To: <506C3D7B14CDD411A52C00025558DED6064BF35D@mtlex01.yok.mtl.com> (Tziporet Koren's message of "Wed, 13 Apr 2005 11:19:44 +0300") References: <506C3D7B14CDD411A52C00025558DED6064BF35D@mtlex01.yok.mtl.com> Message-ID: <52d5syr6ia.fsf@topspin.com> Tziporet> Well this is the description of this bug, but we got Tziporet> this problem of 64-bits arithmetic on a simple if and Tziporet> not switch. OK, thanks. Do you know of any distributions that are shipping gcc 3.4.2? - R. From roland at topspin.com Wed Apr 13 11:28:03 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 13 Apr 2005 11:28:03 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050412182357.GA24047@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 12 Apr 2005 21:23:57 +0300") References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <521x9gyhe7.fsf@topspin.com> <20050412182357.GA24047@mellanox.co.il> Message-ID: <52sm1upm4s.fsf@topspin.com> OK, I'm by no means an expert on this, but Libor and I looked at rmap.c a little more, and there is code: if ((vma->vm_flags & (VM_LOCKED|VM_RESERVED)) || ptep_clear_flush_young(vma, address, pte)) { ret = SWAP_FAIL; goto out_unmap; } before the check if (PageSwapCache(page) && page_count(page) != page_mapcount(page) + 2) { ret = SWAP_FAIL; goto out_unmap; } If userspace allocates some memory but doesn't touch it aside from passing the address in to the kernel, which does get_user_pages(), the PTE will be young in that first test, right? Does that mean that the userspace mapping will be cleared and userspace will get a different physical page if it faults that address back in? - R. From akpm at osdl.org Wed Apr 13 12:32:30 2005 From: akpm at osdl.org (Andrew Morton) Date: Wed, 13 Apr 2005 12:32:30 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <52sm1upm4s.fsf@topspin.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <521x9gyhe7.fsf@topspin.com> <20050412182357.GA24047@mellanox.co.il> <52sm1upm4s.fsf@topspin.com> Message-ID: <20050413123230.7a18dff5.akpm@osdl.org> Roland Dreier wrote: > > OK, I'm by no means an expert on this, but Libor and I looked at > rmap.c a little more, and there is code: > > if ((vma->vm_flags & (VM_LOCKED|VM_RESERVED)) || > ptep_clear_flush_young(vma, address, pte)) { > ret = SWAP_FAIL; > goto out_unmap; > } > > before the check > > if (PageSwapCache(page) && > page_count(page) != page_mapcount(page) + 2) { > ret = SWAP_FAIL; > goto out_unmap; > } > > If userspace allocates some memory but doesn't touch it aside from > passing the address in to the kernel, which does get_user_pages(), the > PTE will be young in that first test, right? If get_user_pages() was called with write=1, get_user_pages() will fault in a real page and yes, I guess it'll be pte_young. If get_user_pages() was called with write=0, get_user_pages() will fault in a mapping of the zero page and we'd never get this far. > Does that mean that > the userspace mapping will be cleared and userspace will get a > different physical page if it faults that address back in? > We won't try to unmap a page's ptes until that page has file-or-swapcache backing. If the pte is then cleared, a subsequent minor fault will reestablish the mapping to the same physical page. A major fault cannot happen because the page was pinned by get_user_pages(). From halr at voltaire.com Wed Apr 13 13:33:59 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Apr 2005 16:33:59 -0400 Subject: [openib-general] Re: [PATCH] teach ifconfig about ib [WAS: Latest IPoIB Bringup Questions] In-Reply-To: <1107811374.6917.6.camel@duffman> References: <1098985903.17991.74.camel@hpc-1> <1107811374.6917.6.camel@duffman> Message-ID: <1113424438.4479.140.camel@localhost.localdomain> Hi Tom, On Mon, 2005-02-07 at 16:22, Tom Duffy wrote: > [Responding to an old message] > > On Thu, 2004-10-28 at 13:51 -0400, Hal Rosenstock wrote: > > Should we teach ifconfig to display Link Encap: INFINIBAND ? > > Still has the problem of truncating the address to the first 14 bytes. I finally did this. The HWaddr does not look right to me. Although the formatting is correct now, it doesn't contain the port GUID as I think it should. Does it appear for you ? Thanks. -- Hal ./ifconfig ib0 ib0 Link encap:InfiniBand HWaddr 00:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 /usr/local/ib/bin/ibstatus Infiniband device 'mthca0' port 1 status: default gid: fe80:0000:0000:0000:0008:f104:0396:0559 /usr/local/ib/bin/ibstat CA 'mthca0' ... Node GUID: 0x0008f10403960558 ... Port 1: ... Port GUID: 0x0008f10403960559 Port 2: ... Port GUID: 0x0008f1040396055a > diff -Nur net-tools-1.60/config.in net-tools-1.60-ib/config.in > --- net-tools-1.60/config.in 2000-05-21 07:32:12.000000000 -0700 > +++ net-tools-1.60-ib/config.in 2005-02-07 10:45:14.108286619 -0800 > @@ -82,6 +82,7 @@ > bool '(Cisco)-HDLC/LAPB support' HAVE_HWHDLCLAPB n > bool 'IrDA support' HAVE_HWIRDA y > bool 'Econet hardware support' HAVE_HWEC n > +bool 'InfiniBand hardware support' HAVE_HWIB y > * > * > * Other Features. > diff -Nur net-tools-1.60/config.make net-tools-1.60-ib/config.make > --- net-tools-1.60/config.make 2005-02-07 11:58:18.536146922 -0800 > +++ net-tools-1.60-ib/config.make 2005-02-07 12:04:03.596462891 -0800 > @@ -30,6 +30,7 @@ > HAVE_HWHDLCLAPB=1 > HAVE_HWIRDA=1 > HAVE_HWEC=1 > +HAVE_HWIB=1 > HAVE_FW_MASQUERADE=1 > HAVE_IP_TOOLS=1 > HAVE_MII=1 > Binary files net-tools-1.60/ipmaddr and net-tools-1.60-ib/ipmaddr differ > Binary files net-tools-1.60/iptunnel and net-tools-1.60-ib/iptunnel differ > diff -Nur net-tools-1.60/lib/hw.c net-tools-1.60-ib/lib/hw.c > --- net-tools-1.60/lib/hw.c 2000-05-20 11:27:25.000000000 -0700 > +++ net-tools-1.60-ib/lib/hw.c 2005-02-07 09:56:22.315428035 -0800 > @@ -73,6 +73,8 @@ > > extern struct hwtype ec_hwtype; > > +extern struct hwtype ib_hwtype; > + > static struct hwtype *hwtypes[] = > { > > @@ -144,6 +146,9 @@ > #if HAVE_HWX25 > &x25_hwtype, > #endif > +#if HAVE_HWIB > + &ib_hwtype, > +#endif > &unspec_hwtype, > NULL > }; > @@ -217,6 +222,9 @@ > #if HAVE_HWEC > ec_hwtype.title = _("Econet"); > #endif > +#if HAVE_HWIB > + ib_hwtype.title = _("InfiniBand"); > +#endif > sVhwinit = 1; > } > > diff -Nur net-tools-1.60/lib/ib.c net-tools-1.60-ib/lib/ib.c > --- net-tools-1.60/lib/ib.c 1969-12-31 16:00:00.000000000 -0800 > +++ net-tools-1.60-ib/lib/ib.c 2005-02-07 12:55:04.635559244 -0800 > @@ -0,0 +1,147 @@ > +/* > + * lib/ib.c This file contains an implementation of the "Infiniband" > + * support functions. > + * > + * Version: $Id: ib.c,v 1.1 2005/02/06 11:00:47 tduffy Exp $ > + * > + * Author: Fred N. van Kempen, > + * Copyright 1993 MicroWalt Corporation > + * Tom Duffy > + * > + * This program is free software; you can redistribute it > + * and/or modify it under the terms of the GNU General > + * Public License as published by the Free Software > + * Foundation; either version 2 of the License, or (at > + * your option) any later version. > + */ > +#include "config.h" > + > +#if HAVE_HWIB > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include "net-support.h" > +#include "pathnames.h" > +#include "intl.h" > +#include "util.h" > + > +extern struct hwtype ib_hwtype; > + > + > +/* Display an InfiniBand address in readable format. */ > +static char *pr_ib(unsigned char *ptr) > +{ > + static char buff[128]; > + char *pos; > + unsigned int i; > + > + pos = buff; > + for (i = 0; i < INFINIBAND_ALEN; i++) { > + pos += sprintf(pos, "%02X:", (*ptr++ & 0377)); > + } > + buff[strlen(buff) - 1] = '\0'; > + > + /* snprintf(buff, sizeof(buff), "%02X:%02X:%02X:%02X:%02X:%02X", > + (ptr[0] & 0377), (ptr[1] & 0377), (ptr[2] & 0377), > + (ptr[3] & 0377), (ptr[4] & 0377), (ptr[5] & 0377) > + ); > + */ > + return (buff); > +} > + > + > +/* Input an Infiniband address and convert to binary. */ > +static int in_ib(char *bufp, struct sockaddr *sap) > +{ > + unsigned char *ptr; > + char c, *orig; > + int i; > + unsigned val; > + > + sap->sa_family = ib_hwtype.type; > + ptr = sap->sa_data; > + > + i = 0; > + orig = bufp; > + while ((*bufp != '\0') && (i < INFINIBAND_ALEN)) { > + val = 0; > + c = *bufp++; > + if (isdigit(c)) > + val = c - '0'; > + else if (c >= 'a' && c <= 'f') > + val = c - 'a' + 10; > + else if (c >= 'A' && c <= 'F') > + val = c - 'A' + 10; > + else { > +#ifdef DEBUG > + fprintf(stderr, _("in_ib(%s): invalid infiniband address!\n"), orig); > +#endif > + errno = EINVAL; > + return (-1); > + } > + val <<= 4; > + c = *bufp; > + if (isdigit(c)) > + val |= c - '0'; > + else if (c >= 'a' && c <= 'f') > + val |= c - 'a' + 10; > + else if (c >= 'A' && c <= 'F') > + val |= c - 'A' + 10; > + else if (c == ':' || c == 0) > + val >>= 4; > + else { > +#ifdef DEBUG > + fprintf(stderr, _("in_ib(%s): invalid infiniband address!\n"), orig); > +#endif > + errno = EINVAL; > + return (-1); > + } > + if (c != 0) > + bufp++; > + *ptr++ = (unsigned char) (val & 0377); > + i++; > + > + /* We might get a semicolon here - not required. */ > + if (*bufp == ':') { > + if (i == INFINIBAND_ALEN) { > +#ifdef DEBUG > + fprintf(stderr, _("in_ib(%s): trailing : ignored!\n"), > + orig) > +#endif > + ; /* nothing */ > + } > + bufp++; > + } > + } > + > + /* That's it. Any trailing junk? */ > + if ((i == INFINIBAND_ALEN) && (*bufp != '\0')) { > +#ifdef DEBUG > + fprintf(stderr, _("in_ib(%s): trailing junk!\n"), orig); > + errno = EINVAL; > + return (-1); > +#endif > + } > +#ifdef DEBUG > + fprintf(stderr, "in_ib(%s): %s\n", orig, pr_ib(sap->sa_data)); > +#endif > + > + return (0); > +} > + > + > +struct hwtype ib_hwtype = > +{ > + "infiniband", NULL, ARPHRD_INFINIBAND, INFINIBAND_ALEN, > + pr_ib, in_ib, NULL > +}; > + > + > +#endif /* HAVE_HWETHER */ > diff -Nur net-tools-1.60/lib/Makefile net-tools-1.60-ib/lib/Makefile > --- net-tools-1.60/lib/Makefile 2000-10-28 03:59:42.000000000 -0700 > +++ net-tools-1.60-ib/lib/Makefile 2005-02-07 10:02:14.662640164 -0800 > @@ -16,7 +16,7 @@ > # > > > -HWOBJS = hw.o loopback.o slip.o ether.o ax25.o ppp.o arcnet.o tr.o tunnel.o frame.o sit.o rose.o ash.o fddi.o hippi.o hdlclapb.o strip.o irda.o ec_hw.o x25.o > +HWOBJS = hw.o loopback.o slip.o ether.o ax25.o ppp.o arcnet.o tr.o tunnel.o frame.o sit.o rose.o ash.o fddi.o hippi.o hdlclapb.o strip.o irda.o ec_hw.o x25.o ib.o > AFOBJS = unix.o inet.o inet6.o ax25.o ipx.o ddp.o ipx.o netrom.o af.o rose.o econet.o x25.o > AFGROBJS = inet_gr.o inet6_gr.o ipx_gr.o ddp_gr.o netrom_gr.o ax25_gr.o rose_gr.o getroute.o x25_gr.o > AFSROBJS = inet_sr.o inet6_sr.o netrom_sr.o ipx_sr.o setroute.o x25_sr.o > Binary files net-tools-1.60/mii-tool and net-tools-1.60-ib/mii-tool differ > From jlentini at netapp.com Wed Apr 13 13:49:15 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 13 Apr 2005 16:49:15 -0400 (EDT) Subject: [openib-general] target InfiniBand release Message-ID: Which version of the InfiniBand specification does the gen2 stack target? Release 1.0, 1.1, or 1.2? james From halr at voltaire.com Wed Apr 13 13:51:43 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Apr 2005 16:51:43 -0400 Subject: [openib-general] target InfiniBand release In-Reply-To: References: Message-ID: <1113425503.4479.155.camel@localhost.localdomain> On Wed, 2005-04-13 at 16:49, James Lentini wrote: > Which version of the InfiniBand specification does the gen2 stack > target? Release 1.0, 1.1, or 1.2? Mostly 1.1 but a little 1.2 -- Hal From mshefty at ichips.intel.com Wed Apr 13 13:57:48 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 13 Apr 2005 13:57:48 -0700 Subject: [openib-general] target InfiniBand release In-Reply-To: References: Message-ID: <425D87CC.9080109@ichips.intel.com> James Lentini wrote: > > Which version of the InfiniBand specification does the gen2 stack > target? Release 1.0, 1.1, or 1.2? I reference the 1.2 version of the spec when coding. - Sean From tduffy at sun.com Wed Apr 13 14:27:54 2005 From: tduffy at sun.com (Tom Duffy) Date: Wed, 13 Apr 2005 14:27:54 -0700 Subject: [openib-general] Re: [PATCH] teach ifconfig about ib [WAS: Latest IPoIB Bringup Questions] In-Reply-To: <1113424438.4479.140.camel@localhost.localdomain> References: <1098985903.17991.74.camel@hpc-1> <1107811374.6917.6.camel@duffman> <1113424438.4479.140.camel@localhost.localdomain> Message-ID: <1113427674.26977.0.camel@duffman> On Wed, 2005-04-13 at 16:33 -0400, Hal Rosenstock wrote: > Hi Tom, > > On Mon, 2005-02-07 at 16:22, Tom Duffy wrote: > > [Responding to an old message] > > > > On Thu, 2004-10-28 at 13:51 -0400, Hal Rosenstock wrote: > > > Should we teach ifconfig to display Link Encap: INFINIBAND ? > > > > Still has the problem of truncating the address to the first 14 bytes. > > I finally did this. The HWaddr does not look right to me. Although the > formatting is correct now, it doesn't contain the port GUID as I think > it should. Does it appear for you ? Thanks. No, the address is always truncated. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From tziporet at mellanox.co.il Wed Apr 13 23:08:28 2005 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 14 Apr 2005 09:08:28 +0300 Subject: [openib-general] Re: uverbs events Message-ID: <506C3D7B14CDD411A52C00025558DED6064BF376@mtlex01.yok.mtl.com> Fedora core 3 RH AS 4 already use gcc 3.4.3 Tziporet -----Original Message----- From: Roland Dreier [mailto:roland at topspin.com] Sent: Wednesday, April 13, 2005 7:23 PM To: Tziporet Koren Cc: Grant Grundler; openib-general at openib.org Subject: Re: [openib-general] Re: uverbs events Tziporet> Well this is the description of this bug, but we got Tziporet> this problem of 64-bits arithmetic on a simple if and Tziporet> not switch. OK, thanks. Do you know of any distributions that are shipping gcc 3.4.2? - R. -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Wed Apr 13 23:29:08 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 14 Apr 2005 09:29:08 +0300 Subject: [openib-general] SM Bad Port Handling Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF101@mtlex01.yok.mtl.com> Hi Shahar, > Your analysis is not completely accurate. The SM configure the > subnet using direct mads only, and it builds a spanning tree of direct > routes. What I want to say, is that that it doesn't matter why exactly a > port is unreachable. Once a port can not be reached, you can either > retry the entire heavy sweep process, but if the problem repeats itself > (X times) on the same port, you have no alternative other then disable > it. The point is that the real "bad" ports are not the ones that are killing 100% of packets (since they will simply have a "DOWN" state and vanish). The real bad ports are the ones that pass < 25% (as we use retry of 4) of packets that goes through them. When such a port happen to be on a switch it will normally cause other ports to appear to be "bad" - NOT ITSELF ! The reason for it is that the number of packets sent through a switch port (not a leaf switch port) is much larger then the number of packets that deals with the discovery of the port itself. All the ports "behind" the switch port will go through that port. And there is a much higher chance for ALL the packets that goes to an end-port be dropped then the chance for ALL the packets that goes through the switch ports to be dropped). So if you implement the feature the way it was proposed what you will end up with is disconnecting end-ports and not the real bad port. Why is it bad? It is bad since in tree topology the end-ports always have an alternate path to the SM. If you could find the real flaky bad port - you could still communicate with all the end-ports. So how do we find that bad port/cable that causes other port to appear bad? We have internally had many long discussions on this topic. The algorithm is not fully developed yet. But several things are clear: 1. One needs to track the number of successful and bad packet flowing through each port. Such that a failure rate can be obtained for each port. 2. Topology based analysis should be used to find the common point that is first to have a high drop rate on the directed route tree. 3. Alternate directed routes might be used to invalidate "suspicious" ports. In any case, I was not proposing relying on traps. I was suggesting to use the "healthy" bit on physical ports as the way to carry the information about "bad" ports (once we correctly find them) into the rest of the algorithms used by the SM. Regarding the need to "disconnect" a bad HCA "end-port" - I still have not seen any log showing OpenSM going through infinite "polling" of bad ports. As I know the code - I can not believe this is possible - so unless you have a log that shows this phenomena (and not another one) please do not chance this path. One last word. I would highly recommend using the management simulator for setting arbitrary (random) bad packet drops and test any algorithm you might think of. EZ Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: shaharf [mailto:shaharf at voltaire.com] > Sent: Wednesday, April 13, 2005 5:03 PM > To: Eitan Zahavi; Hal Rosenstock > Cc: openib-general at openib.org > Subject: RE: [openib-general] SM Bad Port Handling > > Eitan, > > Your analysis is not completely accurate. The SM configure the > subnet using direct mads only, and it builds a spanning tree of direct > routes. What I want to say, is that that it doesn't matter why exactly a > port is unreachable. Once a port can not be reached, you can either > retry the entire heavy sweep process, but if the problem repeats itself > (X times) on the same port, you have no alternative other then disable > it. If the SM will have an alternative method of building direct paths, > then such alternative path could be attempted. Currently it is not > relevant. Speaking of "statistical analysis", what are the odds that a > port will behave well when it is queried directly, but starts to loose > packets when a direct route is routed through it, and behave > consistently during all retries? Again, even if this is the case (and in > understatement, I am not sure how frequent it is), the port behind it is > unreachable and therefore "bad". > > The current unhealthy port mechanism is not redundant to this "bad" port > mechanism because it does not handle the same case. Both mechanisms are > required. The issue if they can share the same status bit is really an > implementation issue. > > Relying of traps is very problematic in some cases, particularly in > initial bring up sweep when the SM lid is not even configured (remember > VTEC?). > > Shahar > > > ________________________________________ > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Eitan Zahavi > Sent: Wednesday, April 13, 2005 11:21 AM > To: Hal Rosenstock; Eitan Zahavi > Cc: openib-general at openib.org > Subject: RE: [openib-general] SM Bad Port Handling > > I probably did not make point very clear: > It is bad (not to say wrong) to disqualify a port and mark it as bad > port if it did not respond to queries. > The cause of the issue might be a flaky link on the directed route to > the port. > If the SM would be able to find that flaky link port it would avoid > marking the wrong ports. More over, the port that was almost marked as > bad by the simplistic algorithm you propose will be discovered and > operational as there many other paths to reach it - walking around the > real bad port ! > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Wednesday, April 13, 2005 12:00 PM > > To: Eitan Zahavi > > Cc: openib-general at openib.org > > Subject: RE: [openib-general] SM Bad Port Handling > > > > On Wed, 2005-04-13 at 01:28, Eitan Zahavi wrote: > > > [EZ] This is true. Currently there is only one cause for the > > > un-healthy bits to be set - which are exactly as you point - these > > > traps. The point I was trying to make was that this bit is the > > > mechanism for flagging a port status is bad. > > > > > > What I did recommend was to write a "statistical" analysis of > Directed > > > Route packet drop - such that we can find the ports with a high drop > > > > rate and mark them as un-healthy. If you mark every port that does > not > > > respond to a MAD as un-healthy you can suffer from flaky links > > > somewhere on the route to that port. Only analysis of the number of > > > good packets vs. dropped packets can lead you to the right bad port. > > > > > The original proposal on this said the following: > > > > "The OpenSM will implement a configurable policy (some number of > > consecutive lack of responses to SM requests). At the point of > > exhaustion of the timeout/retry strategy, that port will be marked as > > "bad" by OpenSM." > > > > Any idea on what might make a good default threshold (for consecutive > > retries) ? Do you think there is no sufficient default ? > > > > If a link is flaky and MADs can't get through, should it be used for > non > > MAD traffic ? > > > > Also note that the proposal also said: > > > > "Also, there could also be a periodic "ping" at a slower rate to check > > > if the "bad" ports revive." > > > > In terms of analysis of good v. errored and dropped packets (along the > > > path to that node), there are OpenIB diagnostic tools to help with > this. > > > > -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Thu Apr 14 01:06:48 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 14 Apr 2005 11:06:48 +0300 Subject: [openib-general] Re: patches In-Reply-To: <52hdif3ggn.fsf@topspin.com> References: <20050408093558.GB21709@mellanox.co.il> <52psx545sy.fsf@topspin.com> <20050409172150.GA31200@mellanox.co.il> <52hdif3ggn.fsf@topspin.com> Message-ID: <20050414080648.GE32526@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: patches > > Michael> If I remember > Michael> correctly alloc_consistent and free consistent in init_ib > Michael> currently get different sizes, isnt that wrong? > > Yes, that needs to be fixed. > > - R. > Signed-off-by: Michael S. Tsirkin Index: src/linux-kernel/infiniband/hw/mthca/mthca_cmd.c =================================================================== --- src/linux-kernel/infiniband/hw/mthca/mthca_cmd.c (revision 2169) +++ src/linux-kernel/infiniband/hw/mthca/mthca_cmd.c (working copy) @@ -1224,7 +1224,7 @@ int mthca_INIT_IB(struct mthca_dev *dev, err = mthca_cmd(dev, indma, port, 0, CMD_INIT_IB, CMD_TIME_CLASS_A, status); - pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma); + pci_free_consistent(dev->pdev, INIT_IB_IN_SIZE, inbox, indma); return err; } -- MST - Michael S. Tsirkin From halr at voltaire.com Thu Apr 14 03:32:02 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Apr 2005 06:32:02 -0400 Subject: [openib-general] SM Bad Port Handling In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EF101@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED6047EF101@mtlex01.yok.mtl.com> Message-ID: <1113474722.4479.299.camel@localhost.localdomain> On Thu, 2005-04-14 at 02:29, Eitan Zahavi wrote: > The point is that the real "bad" ports are not the ones that are > killing 100% of packets > (since they will simply have a "DOWN" state and vanish). > > The real bad ports are the ones that pass < 25% (as we use retry of 4) > of packets that goes through them. When the SM sends a direct route MAD it saves the port guid (and port num) in the madw context, so that when there is a reply or timeout you can easily find the port. That means you dont have to walk the entire DR path to find the unhealthy port. That means that the peer port (from which we arrived to the bad port) is unhealthy. Does this address your concern ? -- Hal From eitan at mellanox.co.il Thu Apr 14 04:13:18 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 14 Apr 2005 14:13:18 +0300 Subject: [openib-general] SM Bad Port Handling Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF10B@mtlex01.yok.mtl.com> > > When the SM sends a direct route MAD it saves the port guid (and port > num) in the madw context, so that when there is a reply or timeout you > can easily find the port. That means you dont have to walk the entire DR > path to find the unhealthy port. That means that the peer port (from > which we arrived to the bad port) is unhealthy. Does this address your > concern ? > [EZ] Not at all. Although the target port is known. The flaky link that fails the mad might be anywhere along the path to the port. So, if you mark the target port as bad you might be marking the wrong port! [EZ] Let me clarify with an example: SM=HCA1/P1 -> SW1/P1....SW1/P2->SW2/P1..SW2/P2->SW3/P1....SW3/P3->HCA2/P1 \..SW4/P4->SW3/P4..SW3/P5->SW3/P2../ If the flaky link is between SW2/P2 and SW3/P1 then the packet sent to HCA2 using DR : [0][1][1][2][3] might fail . If you mark HCA2/P1 as bad then you actually will loose that HCA for no good reason since another path from SM to HCA2 exists. EZ -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland at topspin.com Thu Apr 14 11:39:17 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 14 Apr 2005 11:39:17 -0700 Subject: [openib-general] NULL ptr derefence In-Reply-To: <20050412192501.GA18034@esmail.cup.hp.com> (Grant Grundler's message of "Tue, 12 Apr 2005 12:25:01 -0700") References: <20050412192501.GA18034@esmail.cup.hp.com> Message-ID: <52zmw1mcdm.fsf@topspin.com> I think I have this figured out: if you unload ib_ipoib and ib_sa_query in quick succession, ib_ipoib sends MCMember requests to the SA to leave its multicast groups. Normally, because IPoIB sets a timeout of 0, no callback is generated and so it's fine that IPoIB passes a NULL callback. However, if ib_sa_query is unloaded right afterwards, the send of the request doesn't get a chance to complete and so a cancel callback is generated. If this crash is at all reproducible for you, can you try this patch and see if it helps? Thanks, Roland --- infiniband/core/sa_query.c (revision 1781) +++ infiniband/core/sa_query.c (working copy) @@ -587,7 +587,7 @@ init_mad(query->sa_query.mad, agent); - query->sa_query.callback = ib_sa_path_rec_callback; + query->sa_query.callback = callback ? ib_sa_path_rec_callback : NULL; query->sa_query.release = ib_sa_path_rec_release; query->sa_query.port = port; query->sa_query.mad->mad_hdr.method = IB_MGMT_METHOD_GET; @@ -663,7 +663,7 @@ init_mad(query->sa_query.mad, agent); - query->sa_query.callback = ib_sa_mcmember_rec_callback; + query->sa_query.callback = callback ? ib_sa_mcmember_rec_callback : NULL; query->sa_query.release = ib_sa_mcmember_rec_release; query->sa_query.port = port; query->sa_query.mad->mad_hdr.method = method; @@ -698,20 +698,21 @@ if (!query) return; - switch (mad_send_wc->status) { - case IB_WC_SUCCESS: - /* No callback -- already got recv */ - break; - case IB_WC_RESP_TIMEOUT_ERR: - query->callback(query, -ETIMEDOUT, NULL); - break; - case IB_WC_WR_FLUSH_ERR: - query->callback(query, -EINTR, NULL); - break; - default: - query->callback(query, -EIO, NULL); - break; - } + if (query->callback) + switch (mad_send_wc->status) { + case IB_WC_SUCCESS: + /* No callback -- already got recv */ + break; + case IB_WC_RESP_TIMEOUT_ERR: + query->callback(query, -ETIMEDOUT, NULL); + break; + case IB_WC_WR_FLUSH_ERR: + query->callback(query, -EINTR, NULL); + break; + default: + query->callback(query, -EIO, NULL); + break; + } dma_unmap_single(agent->device->dma_device, pci_unmap_addr(query, mapping), @@ -736,7 +737,7 @@ query = idr_find(&query_idr, mad_recv_wc->wc->wr_id); spin_unlock_irqrestore(&idr_lock, flags); - if (query) { + if (query && query->callback) { if (mad_recv_wc->wc->status == IB_WC_SUCCESS) query->callback(query, mad_recv_wc->recv_buf.mad->mad_hdr.status ? From roland at topspin.com Thu Apr 14 12:35:48 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 14 Apr 2005 12:35:48 -0700 Subject: [openib-general] Topspin, Cisco and OpenIB Message-ID: <52r7hdm9rf.fsf@topspin.com> By now I'm sure most of you have heard the news that Cisco is acquiring Topspin. As you can see from the headline of the release (http://newsroom.cisco.com/dlls/2005/corp_041405.html?CMP=ILC-001): Cisco Systems to Acquire Topspin Communications Broadens Data Center Portfolio with Server Fabric Switches, InfiniBand Technology, and Server Virtualization Software Cisco is putting InfiniBand front and center, and everything I've heard from Cisco during the process confirms that they are excited about IB and want to continue to expand the IB market in high performance computing and beyond. Open source InfiniBand software is a key part of this plan, and Libor and I will continue our efforts in OpenIB. If anything, the Cisco acquisition will allow us to focus even more resources on OpenIB, and I think the deal will be a huge win for OpenIB and the InfiniBand world. Thanks, Roland From iod00d at hp.com Thu Apr 14 14:20:34 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 14 Apr 2005 14:20:34 -0700 Subject: [openib-general] NULL ptr derefence In-Reply-To: <52zmw1mcdm.fsf@topspin.com> References: <20050412192501.GA18034@esmail.cup.hp.com> <52zmw1mcdm.fsf@topspin.com> Message-ID: <20050414212034.GH25145@esmail.cup.hp.com> On Thu, Apr 14, 2005 at 11:39:17AM -0700, Roland Dreier wrote: > I think I have this figured out: if you unload ib_ipoib and > ib_sa_query in quick succession, ib_ipoib sends MCMember requests to > the SA to leave its multicast groups. Normally, because IPoIB sets a > timeout of 0, no callback is generated and so it's fine that IPoIB > passes a NULL callback. However, if ib_sa_query is unloaded right > afterwards, the send of the request doesn't get a chance to complete > and so a cancel callback is generated. > > If this crash is at all reproducible for you, can you try this patch > and see if it helps? I haven't reproduced it yet...but I'm going to put a machine in an infinite loop running the unload/load script. Once I know how long it takes to reproduce, I can comfortably tell you if it's fixed or not. thanks, grant From iod00d at hp.com Thu Apr 14 16:58:13 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 14 Apr 2005 16:58:13 -0700 Subject: [openib-general] NULL ptr derefence In-Reply-To: <52zmw1mcdm.fsf@topspin.com> References: <20050412192501.GA18034@esmail.cup.hp.com> <52zmw1mcdm.fsf@topspin.com> Message-ID: <20050414235813.GA26772@esmail.cup.hp.com> On Thu, Apr 14, 2005 at 11:39:17AM -0700, Roland Dreier wrote: ... > If this crash is at all reproducible for you, ... I tried to reproduce the crash with SVN r2168. But I was only able to produce a "hang". The one liner was: while : do date reload_ib done "reload_ib" just unloads all the modules, loads ib_mthca, ib_ipoib and ib_sdp modules, and lastly ifconfig's up the ib0/1 interfaces. I had a "sleep 3" after 'date' and that ran for 10 minutes or so with no problems. Without the sleep, it ran for 5 minutes with no problem. I then ran "ping -f 10.0.0.113" from another host just to get the target a bit busy and that hung the target machine after a few minutes. I've parked the reload_ib script, System.map, and "errdump init" (ib_hang-2.6.11-pa1.txt) output on ftp://gsyprf10.external.hp.com/pub/openib/ I've got a couple other fires and administrivia to deal with and won't be able to mess with this again today. grant From kjreilly at us.ibm.com Thu Apr 14 19:32:37 2005 From: kjreilly at us.ibm.com (Kevin Reilly) Date: Thu, 14 Apr 2005 22:32:37 -0400 Subject: [openib-general] openIB gen2 user space verbs API Message-ID: I was wonder where i could find information on the openIB gen2 user space verbs API? One of my key questions is how much different then VAPI. Is it a superset of VAPI and if so what function was added? Kevin J. Reilly STSM, HPC Architecture -Federation/HPS Chief Engineer -HPC interconnect architect (office) 845-433-7976 (tieline) 8-293-7976 From roland at topspin.com Thu Apr 14 19:55:40 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 14 Apr 2005 19:55:40 -0700 Subject: [openib-general] openIB gen2 user space verbs API In-Reply-To: (Kevin Reilly's message of "Thu, 14 Apr 2005 22:32:37 -0400") References: Message-ID: <527jj4n3yr.fsf@topspin.com> Kevin> I was wonder where i could find information on the openIB Kevin> gen2 user space verbs API? One of my key questions is how Kevin> much different then VAPI. Is it a superset of VAPI and if Kevin> so what function was added? Right now the best way to find out about the userspace verbs API is to look at the libibverbs source. The include file infiniband/verbs.h and the code in the examples directory are probably the best places to get started. The current code implements all the main verbs likely to be used by userspace applications. There are no functions that I can think of added beyond what's in VAPI. - R. From RAISCH at de.ibm.com Fri Apr 15 06:15:10 2005 From: RAISCH at de.ibm.com (Christoph Raisch) Date: Fri, 15 Apr 2005 15:15:10 +0200 Subject: [openib-general] openIB gen2 user space verbs API In-Reply-To: <527jj4n3yr.fsf@topspin.com> Message-ID: Roland, reading the userspace infiniband/verbs.h file, where did query QP go? Gruss / Regards . . . Christoph Raisch openib-general-bounces at openib.org wrote on 15.04.2005 04:55:40: > Kevin> I was wonder where i could find information on the openIB > Kevin> gen2 user space verbs API? One of my key questions is how > Kevin> much different then VAPI. Is it a superset of VAPI and if > Kevin> so what function was added? > > Right now the best way to find out about the userspace verbs API is to > look at the libibverbs source. The include file infiniband/verbs.h > and the code in the examples directory are probably the best places to > get started. > > The current code implements all the main verbs likely to be used by > userspace applications. There are no functions that I can think of > added beyond what's in VAPI. > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Fri Apr 15 06:35:10 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Apr 2005 09:35:10 -0400 Subject: [openib-general] openIB gen2 user space verbs API In-Reply-To: References: Message-ID: <1113572109.4479.68.camel@localhost.localdomain> On Fri, 2005-04-15 at 09:15, Christoph Raisch wrote: > Roland, > reading the userspace infiniband/verbs.h file, > where did query QP go? That's one that is missing in user verbs (and mthca) currently. It is in kernel verbs but returns -ENOSYS currently. -- Hal > > Gruss / Regards . . . Christoph Raisch > > > openib-general-bounces at openib.org wrote on 15.04.2005 04:55:40: > > > Kevin> I was wonder where i could find information on the openIB > > Kevin> gen2 user space verbs API? One of my key questions is > how > > Kevin> much different then VAPI. Is it a superset of VAPI and > if > > Kevin> so what function was added? > > > > Right now the best way to find out about the userspace verbs API is > to > > look at the libibverbs source. The include file infiniband/verbs.h > > and the code in the examples directory are probably the best places > to > > get started. > > > > The current code implements all the main verbs likely to be used by > > userspace applications. There are no functions that I can think of > > added beyond what's in VAPI. > > > > - R. > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From roland at topspin.com Fri Apr 15 08:49:20 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 15 Apr 2005 08:49:20 -0700 Subject: [openib-general] openIB gen2 user space verbs API References: Message-ID: <527jj4kpkv.fsf@topspin.com> Christoph> Roland, reading the userspace infiniband/verbs.h file, Christoph> where did query QP go? It's not implemented yet. Is there an application that needs it? - R. From ardavis at ichips.intel.com Fri Apr 15 13:27:41 2005 From: ardavis at ichips.intel.com (ardavis) Date: Fri, 15 Apr 2005 13:27:41 -0700 Subject: [openib-general] Kernel oops: NULL ptr dereference in ib_umem_get Message-ID: <426023BD.8080504@ichips.intel.com> Hello Roland, I have openib uDAPL up and running with most of our internal MPI test suites (Intel-MPI). Pretty impressive with such an early code drop of user verbs. Nice job! With a little stress, I see the following oops (running latest from the trunk). Let me know if you need any more information. Apr 15 13:03:27 iclust-19 kernel: <1>Unable to handle kernel NULL pointer dereference at 0000000000000010 RIP: Apr 15 13:03:27 iclust-19 kernel: {ib_umem_get+272} Apr 15 13:03:27 iclust-19 kernel: PGD 33933067 PUD 32a58067 PMD 0 Apr 15 13:03:27 iclust-19 kernel: Oops: 0000 [2] SMP Apr 15 13:03:27 iclust-19 kernel: CPU 0 Apr 15 13:03:27 iclust-19 kernel: Modules linked in: Apr 15 13:03:27 iclust-19 kernel: Pid: 13502, comm: transpose2 Not tainted 2.6.11 Apr 15 13:03:27 iclust-19 kernel: RIP: 0010:[] {ib_umem_get+272} Apr 15 13:03:27 iclust-19 kernel: RSP: 0018:ffff81002ed4ddd8 EFLAGS: 00010206 Apr 15 13:03:27 iclust-19 kernel: RAX: 0000800000000000 RBX: 000000000000b000 RCX: 00007fffffff5000 Apr 15 13:03:27 iclust-19 kernel: RDX: 0000000000000000 RSI: 00007fffffff5000 RDI: ffff810027f9e940 Apr 15 13:03:27 iclust-19 kernel: RBP: 00007fffffff5000 R08: 0000000000000000 R09: 0000000000000000 Apr 15 13:03:27 iclust-19 kernel: R10: 0000000000030b24 R11: 0000000000000000 R12: ffff810031815c80 Apr 15 13:03:27 iclust-19 kernel: R13: 0000000000000000 R14: 00007fffffff5000 R15: ffff81002ed15000 Apr 15 13:03:27 iclust-19 kernel: FS: 00002aaaaae55f40(0000) GS:ffffffff805fe400(0000) knlGS:0000000000000000 Apr 15 13:03:27 iclust-19 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Apr 15 13:03:27 iclust-19 kernel: CR2: 0000000000000010 CR3: 0000000034b16000 CR4: 00000000000006e0 Apr 15 13:03:27 iclust-19 kernel: Process transpose2 (pid: 13502, threadinfo ffff81002ed4c000, task ffff81003e3f62f0) Apr 15 13:03:27 iclust-19 kernel: Stack: ffff810033391ab8 ffffffff80168e62 000000000000000d ffff810031815cc8 Apr 15 13:03:27 iclust-19 kernel: 000000000000000b 0000000000000000 ffff810031815ca8 ffff81000235a000 Apr 15 13:03:27 iclust-19 kernel: ffffffff804ca110 0000000000000030 Apr 15 13:03:27 iclust-19 kernel: Call Trace:{handle_mm_fault+418} {ib_uverbs_reg_mr+212} Apr 15 13:03:27 iclust-19 kernel: {ib_uverbs_write+150} {vfs_write+196} Apr 15 13:03:27 iclust-19 kernel: {sys_write+83} {system_call+126} Apr 15 13:03:27 iclust-19 kernel: Apr 15 13:03:27 iclust-19 kernel: Apr 15 13:03:27 iclust-19 kernel: Code: 4c 8b 72 10 eb ba 49 89 ee 49 81 e6 00 f0 ff ff 8b 4c 24 20 Apr 15 13:03:27 iclust-19 kernel: RIP {ib_umem_get+272} RSP Apr 15 13:03:27 iclust-19 kernel: CR2: 0000000000000010 Thanks, -arlin From roland at topspin.com Fri Apr 15 13:19:27 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 15 Apr 2005 13:19:27 -0700 Subject: [openib-general] Re: Gen2 User verbs usage In-Reply-To: <20050415182702.GA5572@cse.ohio-state.edu> (Sayantan Sur's message of "Fri, 15 Apr 2005 14:27:02 -0400") References: <20050415182702.GA5572@cse.ohio-state.edu> Message-ID: <52hdi7kd2o.fsf@topspin.com> >>>>> "Sayantan" == Sayantan Sur writes: Sayantan> Hi Roland, I have some questions regarding the usage of Sayantan> the new Gen2 verbs. Sayantan> 1. Polling CQ : I notice that this verb is little Sayantan> different from VAPI_poll_cq, in the sense that it Sayantan> accepts a parameter to poll for multiple completion Sayantan> entries. So, if I have a statement like: Sayantan> 497 ne = ibv_poll_cq(hca.cq, 1, &wc); Sayantan> I want to poll for one completion. Does `ne' hold the Sayantan> return status or number of elements actually pulled out Sayantan> of CQ? Sorry, this should be documented better in the userspace library. The semantics are identical to the ib_poll_cq() function in the kernel: * Poll a CQ for (possibly multiple) completions. If the return value * is < 0, an error occurred. If the return value is >= 0, it is the * number of completions returned. If the return value is * non-negative and < num_entries, then the CQ was emptied. Sayantan> 2. Posting RDMA write : Do these statements for Sayantan> preparing a RDMA write IB descriptor make sense? Sayantan> 472 sr_desc.send_flags = IBV_SEND_SIGNALED; Sayantan> 473 sr_desc.opcode = IBV_WR_RDMA_WRITE; Sayantan> 474 sr_desc.wr_id = 0; Sayantan> 475 sr_desc.num_sge = 1; Sayantan> 477 sr_desc.sg_list = &(sg_entry); Sayantan> 479 sr_desc.wr.rdma.remote_addr = (uintptr_t) (rbuf.buf); Sayantan> 480 sr_desc.wr.rdma.rkey = rbuf.rkey; Sayantan> 483 sg_entry.addr = (uintptr_t) (lbuf.buf); Sayantan> 484 sg_entry.length = len; Sayantan> 485 sg_entry.lkey = lbuf.mr->lkey; Sayantan> Essentially, I don't understand what the `send_flags' Sayantan> field means. Yes, this all makes sense. The send_flags field can hold any combination (|'ed together) of the flags IBV_SEND_FENCE, IBV_SEND_SIGNALED, IBV_SEND_SOLICITED and IBV_SEND_INLINE. FENCE means that strict ordering will be enforced, as described in section 10.8 of the IB spec. IBV_SEND_SIGNALED means that a CQ entry will be generated when the send is completed (this flag is ignored if the QP is created with sq_sig_all != 0, since all sends will generate CQ entries anyway). SOLICITED means that the solicited bit will be set in the message so that the remote side will receive a solicited completion event. INLINE means the verbs driver should try to copy the data directly into the send work request to reduce latency. Sayantan> On a side note, if you think that this sort of Sayantan> discussion is useful in openib-general, please feel free Sayantan> to Cc to that list. Yes, I definitely think these questions should go through the mailing list so that all the subscribers (plus any future archive searchers!) can learn from the answers. Thanks, Roland From roland at topspin.com Fri Apr 15 13:30:24 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 15 Apr 2005 13:30:24 -0700 Subject: [openib-general] Kernel oops: NULL ptr dereference in ib_umem_get In-Reply-To: <426023BD.8080504@ichips.intel.com> (ardavis@ichips.intel.com's message of "Fri, 15 Apr 2005 13:27:41 -0700") References: <426023BD.8080504@ichips.intel.com> Message-ID: <52r7hbixzz.fsf@topspin.com> ardavis> Hello Roland, I have openib uDAPL up and running with ardavis> most of our internal MPI test suites (Intel-MPI). Pretty ardavis> impressive with such an early code drop of user ardavis> verbs. Nice job! Cool! ardavis> With a little stress, I see the following oops (running ardavis> latest from the trunk). Let me know if you need any more ardavis> information. Thanks, I'll try to take a look at that code and see if I can figure it out. - R. From surs at cse.ohio-state.edu Fri Apr 15 14:58:38 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Fri, 15 Apr 2005 17:58:38 -0400 Subject: [openib-general] openIB gen2 user space verbs API In-Reply-To: <527jj4kpkv.fsf@topspin.com> References: <527jj4kpkv.fsf@topspin.com> Message-ID: <20050415215836.GA6479@cse.ohio-state.edu> Hi, * On Apr,5 Roland Dreier wrote : > Christoph> Roland, reading the userspace infiniband/verbs.h file, > Christoph> where did query QP go? > > It's not implemented yet. Is there an application that needs it? In VAPI, the query QP is used to find the inline size. e.g. ret = VAPI_query_qp(viadev.nic, viadev.qp_hndl[i], &qp_query_attr, &qp_query_attr_mask, &qp_query_init_attr); [...] inline_size = qp_query_attr.cap.max_inline_data_sq; Is there another way to find out the inline size in Gen2 verbs? Thanks, Sayantan. > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- --------------------------------------------------------- Sayantan Sur Graduate Research Assistant 395 Dreese Labs, Computer Science and Engineering Ohio State University, Office : 774, Dreese Labs Columbus, email : surs at cse.ohio-state.edu Ohio - 43210. phone(res) : 614.688.9792 USA. phone(off) : 614.292.8501 --------------------------------------------------------- From surs at cse.ohio-state.edu Fri Apr 15 15:10:09 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Fri, 15 Apr 2005 18:10:09 -0400 Subject: [openib-general] IBV_WC_LOC_LEN_ERR in ibv_pingpong for message < 4KB Message-ID: <20050415221008.GB6479@cse.ohio-state.edu> Hi, If I run `ibv_pingpong' with msg size argument < 4096, I am encountering a IBV_WC_LOC_LEN_ERR. I have pasted the output from my run. Is this usage correct? Is anybody else getting the same error too? If I may ask, what do the messages like: [ 0] 00860404 mean? They are not coming from `ibv_pingpong' but from somewhere else. Are they supposed to be debug messages? If yes, how can I interpret them? Thanks, Sayantan. ============== [surs at x5:latency] ibv_pingpong --size=4095 local address: LID 0x0001, QPN 0x860404, PSN 0x51bebd remote address: LID 0x0002, QPN 0x860404, PSN 0x4be4ee [ 0] 00860404 [ 4] 00000000 [ 8] 0020669c [ c] 00000000 [10] 01d70000 [14] 00207388 [18] 00000004 [1c] fe100000 [ 0] 00860404 [ 4] 0001000a [ 8] 1f20669c [ c] 00000300 [10] 05f90000 [14] 000001f3 [18] 00000044 [1c] fe100000 Failed status 1 for wr_id 1 ============== [surs at x5:latency] ibv_pingpong --size=4096 local address: LID 0x0001, QPN 0x850404, PSN 0x79dca6 remote address: LID 0x0002, QPN 0x850404, PSN 0xff01c2 8192000 bytes in 0.02 seconds = 3411.38 Mbit/sec 1000 iters in 0.02 seconds = 19.21 usec/iter ============== Platform description - Hardware: --------- Two Dual Intel Xeon EM64T 3.4 GHz nodes PCI-Express I/O bus MT25208 Mellanox HCAs (rev a0) Software: --------- RedHat AS 4 2.6.11.6/2.6.11.7 kernel with Gen2 InfiniBand drivers Firmware version 5.0.1 OpenIB Gen2 drivers (user verbs from main branch) OpenSM (OpenIB version/IBGD 1.7.0 both of them result in the same) -- --------------------------------------------------------- Sayantan Sur Graduate Research Assistant 395 Dreese Labs, Computer Science and Engineering Ohio State University, Office : 774, Dreese Labs Columbus, email : surs at cse.ohio-state.edu Ohio - 43210. phone(res) : 614.688.9792 USA. phone(off) : 614.292.8501 --------------------------------------------------------- From roland at topspin.com Fri Apr 15 15:07:35 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 15 Apr 2005 15:07:35 -0700 Subject: [openib-general] openIB gen2 user space verbs API In-Reply-To: <20050415215836.GA6479@cse.ohio-state.edu> (Sayantan Sur's message of "Fri, 15 Apr 2005 17:58:38 -0400") References: <527jj4kpkv.fsf@topspin.com> <20050415215836.GA6479@cse.ohio-state.edu> Message-ID: <52r7hbhexk.fsf@topspin.com> Sayantan> In VAPI, the query QP is used to find the inline size. Sayantan> Is there another way to find out the inline size in Gen2 Sayantan> verbs? It's not implemented yet, but I would have ibv_create_qp(): struct ibv_qp *ibv_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *qp_init_attr); to pass the max inline size back in the qp_init_attr->qp_cap.max_inline_data member. I'll code this up on Monday, it's pretty trivial. - R. From roland at topspin.com Fri Apr 15 15:15:18 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 15 Apr 2005 15:15:18 -0700 Subject: [openib-general] IBV_WC_LOC_LEN_ERR in ibv_pingpong for message < 4KB In-Reply-To: <20050415221008.GB6479@cse.ohio-state.edu> (Sayantan Sur's message of "Fri, 15 Apr 2005 18:10:09 -0400") References: <20050415221008.GB6479@cse.ohio-state.edu> Message-ID: <52mzrzhekp.fsf@topspin.com> Sayantan> Hi, If I run `ibv_pingpong' with msg size argument < Sayantan> 4096, I am encountering a IBV_WC_LOC_LEN_ERR. I have Sayantan> pasted the output from my run. Is this usage correct? Is Sayantan> anybody else getting the same error too? Are you passing the --size argument to both the client and server ibv_pingpong programs? If not, then one side will post a receive buffer too small to receive the message that the other side sends, and you will get a local length error. Sayantan> If I may ask, what do the messages like: Sayantan> [ 0] 00860404 Sayantan> mean? They are not coming from `ibv_pingpong' but from Sayantan> somewhere else. Are they supposed to be debug messages? Sayantan> If yes, how can I interpret them? That's temporary debugging code from libmthca, specifically the CQ polling code. It's dumping out the 32 byte CQ entry written by the HCA hardware. You will need Mellanox documentation to interpret it (or you can just read the code in libmthca/src/cq.c). - R. From roland at topspin.com Fri Apr 15 15:25:10 2005 From: roland at topspin.com (Roland Dreier) Date: Fri, 15 Apr 2005 15:25:10 -0700 Subject: [openib-general] Kernel oops: NULL ptr dereference in ib_umem_get In-Reply-To: <426023BD.8080504@ichips.intel.com> (ardavis@ichips.intel.com's message of "Fri, 15 Apr 2005 13:27:41 -0700") References: <426023BD.8080504@ichips.intel.com> Message-ID: <52ekdbhe49.fsf@topspin.com> ardavis> With a little stress, I see the following oops (running ardavis> latest from the trunk). Let me know if you need any more ardavis> information. Can you try this patch and let me know if it helps at all? Thanks, Roland --- infiniband/core/uverbs_mem.c (revision 2156) +++ infiniband/core/uverbs_mem.c (working copy) @@ -60,16 +60,16 @@ static void __ib_umem_unmark(struct ib_u unsigned long cur_base; vma = find_vma(mm, umem->user_base); + cur_base = umem->user_base; - for (cur_base = umem->user_base; - cur_base < umem->user_base + umem->length; - cur_base = vma->vm_end) { - if (!vma || vma->vm_start > umem->user_base + umem->length) + while (cur_base < umem->user_base + umem->length && vma && + vma->vm_start < umem->user_base + umem->length) { break; if (!(vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) vma->vm_flags &= ~VM_DONTCOPY; + cur_base = vma->vm_end; vma = vma->vm_next; } } @@ -102,10 +102,9 @@ int ib_umem_get(struct ib_device *dev, s down_write(¤t->mm->mmap_sem); vma = find_vma(current->mm, mem->user_base); + cur_base = mem->user_base; - for (cur_base = mem->user_base; - cur_base < mem->user_base + size; - cur_base = vma->vm_end) { + while (cur_base < mem->user_base + size) { if (!vma || vma->vm_start > cur_base) { ret = -ENOMEM; goto out; @@ -114,6 +113,7 @@ int ib_umem_get(struct ib_device *dev, s if (!(vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) vma->vm_flags |= VM_DONTCOPY; + cur_base = vma->vm_end; vma = vma->vm_next; } From robert.j.woodruff at intel.com Fri Apr 15 15:37:40 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 15 Apr 2005 15:37:40 -0700 Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel Message-ID: People have been asking for a backport patch of the openib.org code for the 2.6.9 kernel, so I backported the latest code that was released to kernel.org. Attached is a patch that can be applied to a 2.6.9 kernel that contains the same code that is in 2.6.12-rc2-mm3. Roland tells me that there probably will not be any more changes, so this should match what is released in 2.6.12. I have done limited testing with IPoIB on 2.6.9 from kernel.org and the RedHat version of 2.6.9 that is in EL4.0 and it seems to work fine. It was tested on an Itanium tiger2 platform, but the changes were small and it should work for all other platforms as well. Matt, should we post this to the downloads web page for people that want to download it. woody -------------- next part -------------- A non-text attachment was scrubbed... Name: linux-2.6.9-ib.patch.bz2 Type: application/octet-stream Size: 120133 bytes Desc: not available URL: From iod00d at hp.com Fri Apr 15 15:52:49 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 15 Apr 2005 15:52:49 -0700 Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel In-Reply-To: References: Message-ID: <20050415225249.GC30386@esmail.cup.hp.com> On Fri, Apr 15, 2005 at 03:37:40PM -0700, Bob Woodruff wrote: > > People have been asking for a backport patch > of the openib.org code for the 2.6.9 kernel, so I backported > the latest code that was released to kernel.org. ... > Matt, should we post this to the downloads web page for > people that want to download it. openib.org has a "gen2/src/linux-kernel/patches/" directory for this sort of backport patches. Maybe add a link to an SVN web interface for patches? grant From robert.j.woodruff at intel.com Fri Apr 15 15:57:47 2005 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 15 Apr 2005 15:57:47 -0700 Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel Message-ID: <1AC79F16F5C5284499BB9591B33D6F00041FAA70@orsmsx408> >openib.org has a "gen2/src/linux-kernel/patches/" directory >for this sort of backport patches. >Maybe add a link to an SVN web interface for patches? >grant Makes sense. Perhaps we should make a subdirectory for gen2/src/linux-kernel/patches/2.6.9, since I suspect that we will have additional backport patches for 2.6.9 going forward, i.e., when the user-mode and SDP support is complete and released to kernel.org someone may want a packport to 2.6.9. woody From iod00d at hp.com Fri Apr 15 16:11:08 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 15 Apr 2005 16:11:08 -0700 Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F00041FAA70@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F00041FAA70@orsmsx408> Message-ID: <20050415231108.GD30386@esmail.cup.hp.com> On Fri, Apr 15, 2005 at 03:57:47PM -0700, Woodruff, Robert J wrote: > Perhaps we should make a subdirectory for > gen2/src/linux-kernel/patches/2.6.9, I don't think that's needed since the file name should make that obvious and we won't have that many seperate 2.6.9 patches. > since I suspect that we will have additional backport patches > for 2.6.9 going forward, i.e., when the user-mode and SDP support is > complete and released to kernel.org someone may want a packport to 2.6.9. Just name new chunks appropriately or update the existing ones. e.g. linux-2.6.9-uverbs.diff could be the uverbs patch for 2.6.9 support. Ditto for SDP. Maybe add an enumerator so that one apply patches that would collide in some same files without having to physically glob the patches together. BTW, patches can contain the same text that we post to the mailing list to document the contents. The existing linux-2.6.11-sinai.diff is a good example. grant From robert.j.woodruff at intel.com Fri Apr 15 16:21:52 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 15 Apr 2005 16:21:52 -0700 Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel In-Reply-To: <20050415231108.GD30386@esmail.cup.hp.com> Message-ID: >Just name new chunks appropriately or update the existing ones. >e.g. linux-2.6.9-uverbs.diff could be the uverbs patch for 2.6.9 support. >Ditto for SDP. >Maybe add an enumerator so that one apply patches that would collide >in some same files without having to physically glob the patches together. >grant My current thinking was to provide one patch that contained all of the InfiniBand code that was released in a specific kernel.org release that also contained the changes for the backport. That way, the user would only have to apply one patch to do the backport. Alternatively, we could take the approach of just putting a patch for each component that is a diff of what was released to kernel.org. In this case the user would first apply the kernel.org patches and then multiple backport patches. Not sure which way will be harder to maintain going forward. woody From iod00d at hp.com Fri Apr 15 16:26:58 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 15 Apr 2005 16:26:58 -0700 Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel In-Reply-To: References: Message-ID: <20050415232658.GE30386@esmail.cup.hp.com> On Fri, Apr 15, 2005 at 03:37:40PM -0700, Bob Woodruff wrote: ... > Attached is a patch that can be applied to a 2.6.9 kernel > that contains the same code that is in 2.6.12-rc2-mm3. Normally patches shouldn't be compressed - search engines can't/don't dig through those. And my selfish reason for complaining is I have to save the patch and manually uncompress it to look at it. diff -Naurp linux-2.6.9/Documentation/ioctl-number.txt.rej linux-2.6.9-ib/Docume ntation/ioctl-number.txt.rej --- linux-2.6.9/Documentation/ioctl-number.txt.rej 1969-12-31 16:00:00.0000 00000 -0800 +++ linux-2.6.9-ib/Documentation/ioctl-number.txt.rej 2005-04-13 12:18:20.0000 00000 -0700 Is this really a file in the tree? Or just an artifact that should be removed? ... > I have done limited testing with IPoIB on 2.6.9 from kernel.org and > the RedHat version of 2.6.9 that is in EL4.0 and it seems to > work fine. It was tested on an Itanium tiger2 platform, but the changes > were small and it should work for all other platforms as well. "the changes were small" is not how I would describe an 800k patch. This "patch" isn't a patch in the traditional sense since most of the 800k is new (drivers/infiniband). I'm pretty sure you were referring to changes you made, not counting new code. But I can't see that since it's all globbed together. This would be better split up into three patches: 1) add drivers/infiniband (and document which SVN rev you started with) 2) changes to drivers/infiniband to use 2.6.9 services 3) changes to the kernel to support drivers/infiniband That way we can just update each peice as necessary. Possible names to call each peice (just ideas, you probably have better names): diff-2.6.9-01-openib_drivers diff-2.6.9-02-openib_fixup diff-2.6.9-03-ib_kernel_changes $0.02, grant From iod00d at hp.com Fri Apr 15 16:33:25 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 15 Apr 2005 16:33:25 -0700 Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel In-Reply-To: References: <20050415231108.GD30386@esmail.cup.hp.com> Message-ID: <20050415233325.GF30386@esmail.cup.hp.com> On Fri, Apr 15, 2005 at 04:21:52PM -0700, Bob Woodruff wrote: > My current thinking was to provide one patch that contained all of > the InfiniBand code that was released in a specific kernel.org release > that also contained the changes for the backport. > That way, the user would only have to apply one patch to do the backport. I agree that's more convenient for users, but harder to maintain and review. I don't expect that many people to apply this kind of patch to a RH release which they bought support for. It will basically void the support contract. Wouldn't someone rolling their own kernels is more likely to be running something newer than 2.6.9? (to follow this example) > Alternatively, we could take the approach of just putting a patch for > each component that is a diff of what was released to kernel.org. In this > case the user would first apply the kernel.org patches and then multiple > backport patches. Not sure which way will be harder to maintain going > forward. *nod*. Exactly what I was thinking. As a developer, I'm inclined to keep my life easier (smaller patches) and I think the distro's would prefer smaller patches. Is anyone from the SuSE/RH distro's on this list and care to comment? thanks, grant From robert.j.woodruff at intel.com Fri Apr 15 17:23:25 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 15 Apr 2005 17:23:25 -0700 Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel In-Reply-To: <20050415232658.GE30386@esmail.cup.hp.com> Message-ID: Grant wrote, >Possible names to call each peice (just ideas, you probably >have better names): > diff-2.6.9-01-openib_drivers > diff-2.6.9-02-openib_fixup > diff-2.6.9-03-ib_kernel_changes >$0.02, >grant Sure, if that would make more sense and would be easier to review and maintain, we can certainly do it that way. Might want to have something like, diff-2.6.9-01-openib_drivers-SVNxxx diff-2.6.9-02-openib_fixup-SVNxxx diff-2.6.9-03-ib_kernel_changes where xxx is the SVN version so that we can have fixups that match a specific SVN version. From robert.j.woodruff at intel.com Fri Apr 15 17:26:13 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 15 Apr 2005 17:26:13 -0700 Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel In-Reply-To: <20050415232658.GE30386@esmail.cup.hp.com> Message-ID: Grant wrote, >diff -Naurp linux-2.6.9/Documentation/ioctl-number.txt.rej linux-2.6.9-ib/Docume >ntation/ioctl-number.txt.rej >--- linux-2.6.9/Documentation/ioctl-number.txt.rej 1969-12-31 16:00:00.0000 >00000 -0800 >+++ linux-2.6.9-ib/Documentation/ioctl-number.txt.rej 2005-04-13 12:18:20.0000 >00000 -0700 >Is this really a file in the tree? >Or just an artifact that should be removed? I'll check. Looks like something that can be removed. woody From iod00d at hp.com Fri Apr 15 19:07:15 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 15 Apr 2005 19:07:15 -0700 Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel In-Reply-To: References: <20050415232658.GE30386@esmail.cup.hp.com> Message-ID: <20050416020715.GG30386@esmail.cup.hp.com> On Fri, Apr 15, 2005 at 05:23:25PM -0700, Bob Woodruff wrote: > Might want to have something like, > > diff-2.6.9-01-openib_drivers-SVNxxx > diff-2.6.9-02-openib_fixup-SVNxxx > diff-2.6.9-03-ib_kernel_changes > > where xxx is the SVN version > > so that we can have fixups that match a specific SVN version. Sure, that's a good idea. And it reminds me that we still have no clue which version of SVN someone's IB kernel driver is based on. Supporting this is going to be painful unless this is dealt with. Anyone have a clue how to get SVN or "make" to embed the version number in any resulting .o ? It's bad enough when distro's patch drivers without updating the rev number. But not having a reliable rev number to start with is even worse. thanks, grant From surs at cse.ohio-state.edu Fri Apr 15 22:26:37 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Sat, 16 Apr 2005 01:26:37 -0400 Subject: [openib-general] IBV_WC_LOC_LEN_ERR in ibv_pingpong for message < 4KB In-Reply-To: <52mzrzhekp.fsf@topspin.com> References: <20050415221008.GB6479@cse.ohio-state.edu> <52mzrzhekp.fsf@topspin.com> Message-ID: <4260A20D.1090009@cse.ohio-state.edu> Roland Dreier wrote: > Sayantan> Hi, If I run `ibv_pingpong' with msg size argument < > Sayantan> 4096, I am encountering a IBV_WC_LOC_LEN_ERR. I have > Sayantan> pasted the output from my run. Is this usage correct? Is > Sayantan> anybody else getting the same error too? > > Are you passing the --size argument to both the client and server > ibv_pingpong programs? If not, then one side will post a receive > buffer too small to receive the message that the other side sends, and > you will get a local length error. Thanks for your reply. Yes, I was assuming a `perf_main' style of usage, where it isn't necessary to pass the size information to the receiver command line. > > Sayantan> If I may ask, what do the messages like: > > Sayantan> [ 0] 00860404 > > Sayantan> mean? They are not coming from `ibv_pingpong' but from > Sayantan> somewhere else. Are they supposed to be debug messages? > Sayantan> If yes, how can I interpret them? > > That's temporary debugging code from libmthca, specifically the CQ > polling code. It's dumping out the 32 byte CQ entry written by the > HCA hardware. You will need Mellanox documentation to interpret it > (or you can just read the code in libmthca/src/cq.c). Okay. I'll look at that file. Thanks, Sayantan. > > - R. -- --------------------------------------------------------- Sayantan Sur Graduate Research Assistant 395 Dreese Labs, Computer Science and Engineering Ohio State University, Office : 774, Dreese Labs Columbus, email : surs at cse.ohio-state.edu Ohio - 43210. phone(res) : 614.688.9792 USA. phone(off) : 614.292.8501 --------------------------------------------------------- From kjreilly at us.ibm.com Sat Apr 16 07:49:56 2005 From: kjreilly at us.ibm.com (Kevin Reilly) Date: Sat, 16 Apr 2005 10:49:56 -0400 Subject: [openib-general] openIB gen2 user space verbs API In-Reply-To: <527jj4n3yr.fsf@topspin.com> Message-ID: Thanks Roland, I guess i can make the assumption that any prototype work layered over VAPI can be fairly easily ported over the new gen2 interface? Kevin J. Reilly STSM, HPC Architecture -Federation/HPS Chief Engineer -HPC interconnect architect (office) 845-433-7976 (tieline) 8-293-7976 Roland Dreier To Kevin Reilly/Poughkeepsie/IBM at IBMUS 04/14/2005 10:55 cc PM openib-general at openib.org Subject Re: [openib-general] openIB gen2 user space verbs API Kevin> I was wonder where i could find information on the openIB Kevin> gen2 user space verbs API? One of my key questions is how Kevin> much different then VAPI. Is it a superset of VAPI and if Kevin> so what function was added? Right now the best way to find out about the userspace verbs API is to look at the libibverbs source. The include file infiniband/verbs.h and the code in the examples directory are probably the best places to get started. The current code implements all the main verbs likely to be used by userspace applications. There are no functions that I can think of added beyond what's in VAPI. - R. From roland at topspin.com Sat Apr 16 08:50:01 2005 From: roland at topspin.com (Roland Dreier) Date: Sat, 16 Apr 2005 08:50:01 -0700 Subject: [openib-general] openIB gen2 user space verbs API In-Reply-To: (Kevin Reilly's message of "Sat, 16 Apr 2005 10:49:56 -0400") References: Message-ID: <52y8big1qt.fsf@topspin.com> Kevin> Thanks Roland, I guess i can make the assumption that any Kevin> prototype work layered over VAPI can be fairly easily Kevin> ported over the new gen2 interface? Yes, I think so. - R. From mst at mellanox.co.il Sat Apr 16 09:58:04 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 16 Apr 2005 19:58:04 +0300 Subject: [openib-general] Re: Gen2 User verbs usage In-Reply-To: <52hdi7kd2o.fsf@topspin.com> References: <20050415182702.GA5572@cse.ohio-state.edu> <52hdi7kd2o.fsf@topspin.com> Message-ID: <20050416165804.GA854@mellanox.co.il> Quoting r. Roland Dreier : > INLINE means the verbs driver should try to copy > the data directly into the send work request to reduce latency. There currently doesnt seem to exist a way for userspace to know when is setting the INLINE flag possible. It would seem what we need is another attribute passed to create_qp that would specify the max inline buffer size, and probably an hca attribute to give the maximum legal value for this attribute. Does this make sence, and would you accept a patch like this? -- MST - Michael S. Tsirkin From mst at mellanox.co.il Sat Apr 16 10:00:13 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 16 Apr 2005 20:00:13 +0300 Subject: [openib-general] Re: [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel In-Reply-To: <20050416020715.GG30386@esmail.cup.hp.com> References: <20050415232658.GE30386@esmail.cup.hp.com> <20050416020715.GG30386@esmail.cup.hp.com> Message-ID: <20050416170013.GB854@mellanox.co.il> Quoting r. Grant Grundler : > Subject: Re: [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel > > On Fri, Apr 15, 2005 at 05:23:25PM -0700, Bob Woodruff wrote: > > Might want to have something like, > > > > diff-2.6.9-01-openib_drivers-SVNxxx > > diff-2.6.9-02-openib_fixup-SVNxxx > > diff-2.6.9-03-ib_kernel_changes > > > > where xxx is the SVN version > > > > so that we can have fixups that match a specific SVN version. > > Sure, that's a good idea. > > And it reminds me that we still have no clue which version > of SVN someone's IB kernel driver is based on. Supporting > this is going to be painful unless this is dealt with. > Anyone have a clue how to get SVN or "make" to embed > the version number in any resulting .o ? > > It's bad enough when distro's patch drivers without updating > the rev number. But not having a reliable rev number to start > with is even worse. > > thanks, > grant You have to run the svnversion utility to get the revision. -- MST - Michael S. Tsirkin From mst at mellanox.co.il Sat Apr 16 10:23:03 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 16 Apr 2005 20:23:03 +0300 Subject: [openib-general] Re: openIB gen2 user space verbs API In-Reply-To: <52r7hbhexk.fsf@topspin.com> References: <527jj4kpkv.fsf@topspin.com> <20050415215836.GA6479@cse.ohio-state.edu> <52r7hbhexk.fsf@topspin.com> Message-ID: <20050416172303.GC854@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: openIB gen2 user space verbs API > > Sayantan> In VAPI, the query QP is used to find the inline size. > > Sayantan> Is there another way to find out the inline size in Gen2 > Sayantan> verbs? > > It's not implemented yet, but I would have ibv_create_qp(): > > struct ibv_qp *ibv_create_qp(struct ibv_pd *pd, > struct ibv_qp_init_attr *qp_init_attr); > > to pass the max inline size back in the qp_init_attr->qp_cap.max_inline_data > member. > > I'll code this up on Monday, it's pretty trivial. > > - R. An application would need to know what values is it legal to pass to create_qp. Maybe it makes sence to implement something like query_hca, and let it return the maximum legal value? -- MST - Michael S. Tsirkin From gdror at mellanox.co.il Sat Apr 16 14:09:00 2005 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Sun, 17 Apr 2005 00:09:00 +0300 Subject: [openib-general] Static Rate Questions Message-ID: <506C3D7B14CDD411A52C00025558DED6078F2B30@mtlex01.yok.mtl.com> Hi, I have couple of questions regarding static rate implementation. - In struct ib_ah_attr, static_rate is defined as u8. What are the expected values that static_rate is supposed to take ? Is it absolute Gb/s ? Gb/s in 2.5Gb/s units ? or relative rate to port speed ? Looking at ipoib_main I understand that the static_rate is supposed to be the relative rate to port speed. In other words, a divider for the current port speed. - For some reason I don't static_rate initialization for SDP. This should either be in SDP code or in the CM (cm_init_qp_rtr_attr()). - In mthca, there are two places setting up the static_rate. One for AH which looks fine. The other one for QP which I believe has a bug. mthca_qp.c: qp_context->pri_path.static_rate = (!!attr->ah_attr.static_rate) << 3; Should be qp_context->pri_path.static_rate = (!!attr->ah_attr.static_rate); Because max_stat_rate is bits 10:3 at offset 8h of Address Path. - A question for next generation HW. Would you find it more useful that the HCA supports static rate as an absolute speed (Gb/s) or as a relative ratio to the current port speed ? Thanks Dror -------------- next part -------------- An HTML attachment was scrubbed... URL: From ftillier at infiniconsys.com Sat Apr 16 15:26:48 2005 From: ftillier at infiniconsys.com (Fab Tillier) Date: Sat, 16 Apr 2005 15:26:48 -0700 Subject: [openib-general] Static Rate Questions In-Reply-To: <506C3D7B14CDD411A52C00025558DED6078F2B30@mtlex01.yok.mtl.com> Message-ID: <000001c542d3$62c382c0$6501a8c0@infiniconsys.com> > From: Dror Goldenberg [mailto:gdror at mellanox.co.il] > Sent: Saturday, April 16, 2005 2:09 PM > > - A question for next generation HW. Would you find it more useful that > the HCA supports static rate as an absolute speed (Gb/s) or as a relative > ratio to the current port speed? Personally I'd like to see everything use absolute not relative rates. It seems that the static rate is really used as IPD, which I find a bit counter-intuitive. - Fab From mst at mellanox.co.il Sun Apr 17 02:32:45 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 17 Apr 2005 12:32:45 +0300 Subject: [openib-general] [PATCH] fix management/README Message-ID: <20050417093245.GA16996@mellanox.co.il> Fix build instructions to refer to directories that actually exist. Signed-off-by: Michael S. Tsirkin Index: src/userspace/management/README =================================================================== --- src/userspace/management/README (revision 2171) +++ src/userspace/management/README (working copy) @@ -61,7 +61,7 @@ ./autogen.sh && ./configure && make && make install 2. In osm/complib and osm/libvendor, run: ./autogen.sh && ./configure && make && make install -3. In all util, diag, and osm/opensm subdirectories, run: +3. In all util/mad_test, diags, and osm/opensm subdirectories, run: ./autogen.sh && ./configure 4. At top level of management, run: make && make install -- MST - Michael S. Tsirkin From mst at mellanox.co.il Sun Apr 17 04:36:47 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 17 Apr 2005 14:36:47 +0300 Subject: [openib-general] performance tests uploaded to contrib Message-ID: <20050417113647.GF16996@mellanox.co.il> Hello! I created https://openib.org/svn/trunk/contrib/mellanox/perftest This directory includes gen2 uverbs microbenchmarks - useful as usage examples and for performance tuning. Testing methodology: - CPU clock instruction is used to get CPU clock without context switch. - Median (as opposed to average) result is reported. The median is less sensitive to extreme scores. An option to report the full result distribution for alternative statistical analysis is provided. Architectures supported: - i686, x86_64 Tests in this directory: there is currently one test: rdma_lat.c - latency test with RDMA write transactions Code is originally based on the pingping test. I intentionally did not rename functions from pingpong_ to rdma_ to make it easier to share some code with libibverbs/examples later. Current results: I currently observe latency below 3.5 usec. Drop me a note if you find this useful. Thanks, -- MST - Michael S. Tsirkin From mst at mellanox.co.il Sun Apr 17 06:56:58 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 17 Apr 2005 16:56:58 +0300 Subject: [openib-general] Re: [PATCH] uverbs with static libraries In-Reply-To: <52is2t1g1m.fsf@topspin.com> References: <20050410084724.GZ20567@mellanox.co.il> <52is2t1g1m.fsf@topspin.com> Message-ID: <20050417135658.GK16996@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] uverbs with static libraries > > Michael> I'd like to get userspace verbs working with static > Michael> libraries. My motivation is currently enabling our code > Michael> coverage tools which only work well with static > Michael> libraries, but I expect there to be other uses. > > Looks reasonable. With this, do you then do --enable-static when > configuring libmthca or is there anything else required? > > - R. > You also need the patch below to enable static libraries. Then when you link you must pass -u openib_driver_init -rdynamic to gcc, to pull in the driver library. Roland, please let me know whether you plan to apply this and the previous patch. Enable static version of libmthca. Signed-off-by: Michael S. Tsirkin Index: libmthca/configure.in =================================================================== --- libmthca/configure.in (revision 2171) +++ libmthca/configure.in (working copy) @@ -6,7 +6,6 @@ AC_CONFIG_SRCDIR([src/mthca.h]) AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE(libmthca, 0.9.0) -AC_DISABLE_STATIC AM_PROG_LIBTOOL dnl Checks for programs -- MST - Michael S. Tsirkin From steve at wooding.uklinux.net Sun Apr 17 08:06:02 2005 From: steve at wooding.uklinux.net (Steven Wooding) Date: Sun, 17 Apr 2005 16:06:02 +0100 Subject: [openib-general] Advice about adapting ibv_pingpong to use UC Message-ID: <42627B5A.7010709@wooding.uklinux.net> Hi, I wonder if someone working on the gen2 uverbs would be so kind as to give me some advice on adapting the ibv_pingpong program to use a UC QP type rather than RC. I was previously able to do this with the Mellanox stack by changing the qp_type attribute and then not setting variables that are only needed for RC (timeout and retry periods etc). However, when I perform the same trick with ibv_pingpong it errors on the function call that should put the QP into the INIT state. I can't see what to change in that function call to get it to the next state change. I realise that the general demand for the UC type connection is low, but my application is a real-time interface where retries are not an option I'm afraid. Thank you in advance for any help the busy gen2 developers are able to offer. Regards, Steve. x86_64, RHEL 4, gen2 2169. From tduffy at sun.com Sun Apr 17 12:40:37 2005 From: tduffy at sun.com (Tom Duffy) Date: Sun, 17 Apr 2005 12:40:37 -0700 Subject: [openib-general] Re: [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel In-Reply-To: <20050416170013.GB854@mellanox.co.il> References: <20050415232658.GE30386@esmail.cup.hp.com> <20050416020715.GG30386@esmail.cup.hp.com> <20050416170013.GB854@mellanox.co.il> Message-ID: <1113766837.9390.2.camel@duffman> On Sat, 2005-04-16 at 20:00 +0300, Michael S. Tsirkin wrote: > You have to run the svnversion utility to get the revision. You could embed the $Revision$ in the MODULE_VERSION. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mst at mellanox.co.il Sun Apr 17 12:47:45 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 17 Apr 2005 22:47:45 +0300 Subject: [openib-general] Re: [ANNOUNCE][PATCH] Backport patch for 2.6 .9 kernel In-Reply-To: <1113766837.9390.2.camel@duffman> References: <1113766837.9390.2.camel@duffman> Message-ID: <20050417194745.GA9442@mellanox.co.il> Quoting r. Tom Duffy : > Subject: Re: [openib-general] Re: [ANNOUNCE][PATCH] Backport patch for 2.6 .9?kernel > > On Sat, 2005-04-16 at 20:00 +0300, Michael S. Tsirkin wrote: > > You have to run the svnversion utility to get the revision. > > You could embed the $Revision$ in the MODULE_VERSION. > > -tduffy > AFAIK thats the last revision the specific fiel changed, which is typically not what you want. -- MST - Michael S. Tsirkin From surs at cse.ohio-state.edu Sun Apr 17 12:54:33 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Sun, 17 Apr 2005 15:54:33 -0400 Subject: [openib-general] performance tests uploaded to contrib In-Reply-To: <20050417113647.GF16996@mellanox.co.il> References: <20050417113647.GF16996@mellanox.co.il> Message-ID: <20050417195432.GA22185@cse.ohio-state.edu> Michael, > Current results: > > I currently observe latency below 3.5 usec. > > Drop me a note if you find this useful. Thanks for putting this up on the contrib tree. I have run this rdma_latency, and am getting around 3.35 us (without switch). Our platform is Dual Xeon EM64T with RH AS 4. Kernel version 2.6.11.7 Do you have any idea when a port of the popular `perf_main' will be available? As more people try to use the Gen2 verbs, `perf_main' (or something similar) can help people evaluate different IB operations and also to have example code to use different features of IB. Thanks, Sayantan. > > Thanks, > > -- > MST - Michael S. Tsirkin > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- --------------------------------------------------------- Sayantan Sur Graduate Research Assistant 395 Dreese Labs, Computer Science and Engineering Ohio State University, Office : 774, Dreese Labs Columbus, email : surs at cse.ohio-state.edu Ohio - 43210. phone(res) : 614.688.9792 USA. phone(off) : 614.292.8501 --------------------------------------------------------- From mst at mellanox.co.il Sun Apr 17 13:40:07 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 17 Apr 2005 23:40:07 +0300 Subject: [openib-general] performance tests uploaded to contrib In-Reply-To: <20050417195432.GA22185@cse.ohio-state.edu> References: <20050417113647.GF16996@mellanox.co.il> <20050417195432.GA22185@cse.ohio-state.edu> Message-ID: <20050417204007.GB9442@mellanox.co.il> Quoting r. Sayantan Sur : > Subject: Re: [openib-general] performance tests uploaded to contrib > > Michael, > > > Current results: > > > > I currently observe latency below 3.5 usec. > > > > Drop me a note if you find this useful. > > Thanks for putting this up on the contrib tree. I have run this > rdma_latency, and am getting around 3.35 us (without switch). > > Our platform is Dual Xeon EM64T with RH AS 4. Kernel version 2.6.11.7 > > Do you have any idea when a port of the popular `perf_main' will be > available? As more people try to use the Gen2 verbs, `perf_main' (or > something similar) can help people evaluate different IB operations and > also to have example code to use different features of IB. > > Thanks, > Sayantan. > I dont plan to port the monolithic perf_main to gen2. Instead, I plan to upload a set of microbenchmarks each testing a specific feature: rdma latency/rc send latency/rdma bandwidth/rc send bandwidth etc. I hope that this will help achieve better code clarity than what we have in perf_main. What are the features you are most interested in? By the way, you can already see some example code in libibverbs/examples, although that is not necessarily benchmark-oriented. I used the code in pingpong.c as a starting point for rdma latency test, and it can be used with very little changes as rc send latency test. -- MST - Michael S. Tsirkin From abhijitngpune at indiatimes.com Mon Apr 18 08:20:55 2005 From: abhijitngpune at indiatimes.com (abhijitngpune) Date: Mon, 18 Apr 2005 11:20:55 -0400 Subject: [openib-general] Re: Re: Re: Subment Management Message-ID: <200504180458.KAA17566@WS0005.indiatimes.com> hi, >Are you using Voltaire SM ? It has richer capabilities here than OpenSM. >Someone in support should be able to explain how to use the various >routing/pathing algorithms supported by VSM. Who is your support person >? Could you plz suggest me any person from Support Team who can help me in this problem? Abhijeet "Hal Rosenstock" wrote: Hi Abhi, On Thu, 2005-04-14 at 07:26, abhijitngpune wrote: > Hi, > > Thanks, > > Regd. Fabric manager : It was just a theoritical concern. I > was doubtful regd voltaire SM. > > Regd. Subnet topology: > > Let me explain the scenario. I want to create > (logical) topology like this : > > N > > / | \ > > N ---- | ---- N > > / \ | / \ > > N --- N --- N > > N: node (hope u understand the topology) > > How to create such kind of virtual subnet topology? This is where LMC > 0 comes in. It allows for multiple paths to be utilized between nodes. > Do u think it is possible over star connected nodes(Interconnected > topology is star, becoz we have one voltaire switch to which all 78 > nodes are connected )? Yes. > How subnet management will help me to get this topology? What should > be my basic step in this scenario?. Are you using Voltaire SM ? It has richer capabilities here than OpenSM. Someone in support should be able to explain how to use the various routing/pathing algorithms supported by VSM. Who is your support person ? What host stack are you running on the end nodes ? (They appear to be HP nodes). Also, what applications/ULPs/protocols are you intending on running in this topology ? Thanks. -- Hal > Abhi, > > CML > > > > > > "Hal Rosenstock" wrote: > > > > Hi Abhi, > > On Thu, 2005-04-14 at 04:40, abhijitngpune wrote: > > Hi, My lab has recently purchased Voltaire 9288 - 288 Port > Infiniband > > Cluster Switch and HP cluster with 78 nodes. I have some > doubts > > related to subnet management. 1. Does the fabric manager > work on > > Graph(non fat tree) topologies? > > As Shahar has explained in previous emails, it does. Is this a > theoretical concern or are you having a real world problem ? > > > 2. Since the interconnect topology is star how can i get > logical > > topology which is graph (containing cycles)? > > The SM uses one of its routing/pathing policies to determine > the > topology. Which SM are you using ? Are you using Voltaire's SM > or > OpenSM ? > > > How can i create subnet with (logical) topology on the top > of the > > underlying star topology? > > The SM automatically does this based on the routing/pathing > policy. Note > that the routing/pathing policy is beyo n! d the IB spec (and > left to the > SM vendor/implementor). > > > 3. Can i install Mellanox gold s/w for infiniband? > > The short answer is yes. You can run Mellanox Gold on the end > nodes, > OpenIB, or the Voltaire host stack on the end nodes. It > depends on what > applications/ULPs you are trying to run. For things in common, > not all > interop experiments have been run if you are trying for a > heterogeneous > environment (a mix of those end node stacks). > > > 4. Does voltiare has its own subnet simulator? Abhi CML > > Shahar will need to answer this one. > > -- Hal > > > > > ______________________________________________________________________ > > > > > ______________________________________________________________________ > Indiatimes Email now powered by APIC Advantage. Help! > M! y PresenceHelp > > ______________________________________________________________________ Indiatimes Email now powered by APIC Advantage. Help! Help -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at voltaire.com Sun Apr 17 23:26:32 2005 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 18 Apr 2005 09:26:32 +0300 Subject: [openib-general] openIB gen2 user space verbs API Message-ID: Christoph> Roland, reading the userspace infiniband/verbs.h file, where did query QP go? Roland>It's not implemented yet. Is there an application that needs it? Roland, Other than getting the max inline size, query qp is used to get the current QP state, examples are app error flow (eg when modify qp failed) and app APM flow to sense some of the state transitions done by the HW. These are only examples I quickly thought of, I guess there are more. I understand and like the atitude of striving for simplicity but I guess that for production mthca and the uverbs library would need to support all the query functions existing in VAPI (see below, maybe to the exception of query ah / eec / mw). Indeed, it can be that some of them (eg the first four below) can be implemented in mthca only and their result cached/mmaped to be used by the uverbs libs. Or. [root at zeta ogerlitz]# grep query /usr/mellanox/include/vapi.h | grep Func | awk '{ print $3}' VAPI_query_hca_cap VAPI_query_hca_port_prop VAPI_query_hca_gid_tbl VAPI_query_hca_pkey_tbl VAPI_query_addr_hndl VAPI_query_qp VAPI_query_qp_ext VAPI_query_srq VAPI_query_cq VAPI_query_eec_attr VAPI_query_mr VAPI_query_mw -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier Sent: Saturday, April 16, 2005 1:08 AM To: Sayantan Sur Cc: Christoph Raisch; openib-general at openib.org Subject: Re: [openib-general] openIB gen2 user space verbs API Sayantan> In VAPI, the query QP is used to find the inline size. Sayantan> Is there another way to find out the inline size in Gen2 Sayantan> verbs? It's not implemented yet, but I would have ibv_create_qp(): struct ibv_qp *ibv_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *qp_init_attr); to pass the max inline size back in the qp_init_attr->qp_cap.max_inline_data member. I'll code this up on Monday, it's pretty trivial. - R. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From surs at cse.ohio-state.edu Mon Apr 18 06:00:48 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Mon, 18 Apr 2005 09:00:48 -0400 Subject: [openib-general] performance tests uploaded to contrib In-Reply-To: <20050417204007.GB9442@mellanox.co.il> References: <20050417113647.GF16996@mellanox.co.il> <20050417195432.GA22185@cse.ohio-state.edu> <20050417204007.GB9442@mellanox.co.il> Message-ID: <4263AF80.1000601@cse.ohio-state.edu> Michael S. Tsirkin wrote: > Quoting r. Sayantan Sur : > >>Subject: Re: [openib-general] performance tests uploaded to contrib >> >>Michael, >> >> >>>Current results: >>> >>>I currently observe latency below 3.5 usec. >>> >>>Drop me a note if you find this useful. >> >>Thanks for putting this up on the contrib tree. I have run this >>rdma_latency, and am getting around 3.35 us (without switch). >> >>Our platform is Dual Xeon EM64T with RH AS 4. Kernel version 2.6.11.7 >> >>Do you have any idea when a port of the popular `perf_main' will be >>available? As more people try to use the Gen2 verbs, `perf_main' (or >>something similar) can help people evaluate different IB operations and >>also to have example code to use different features of IB. >> >>Thanks, >>Sayantan. >> > > > I dont plan to port the monolithic perf_main to gen2. > > Instead, I plan to upload a set of microbenchmarks each testing > a specific feature: rdma latency/rc send latency/rdma bandwidth/rc send > bandwidth etc. > > I hope that this will help achieve better code clarity than what we > have in perf_main. That is fine. As long as there is some ibverbs level benchmark suite, it will ease the transition. > > What are the features you are most interested in? By RDMA latency/bandwidth, do you mean both RDMA write & read? Will there be any Atomic latency tests also? > > By the way, you can already see some example code in libibverbs/examples, > although that is not necessarily benchmark-oriented. > I used the code in pingpong.c as a starting point for rdma latency > test, and it can be used with very little changes as rc send latency > test. Yes, this was helpful for me too. Apart from providing code examples, I was suggesting that a much more comprehensive benchmark suite will help people moving from other vendor stacks to Gen2 to have a quick comparison of performance offered by both stacks. Thanks, Sayantan. -- --------------------------------------------------------- Sayantan Sur Graduate Research Assistant 395 Dreese Labs, Computer Science and Engineering Ohio State University, Office : 774, Dreese Labs Columbus, email : surs at cse.ohio-state.edu Ohio - 43210. phone(res) : 614.688.9792 USA. phone(off) : 614.292.8501 --------------------------------------------------------- From mst at mellanox.co.il Mon Apr 18 06:08:56 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 18 Apr 2005 16:08:56 +0300 Subject: [openib-general] performance tests uploaded to contrib In-Reply-To: <4263AF80.1000601@cse.ohio-state.edu> References: <20050417113647.GF16996@mellanox.co.il> <20050417195432.GA22185@cse.ohio-state.edu> <20050417204007.GB9442@mellanox.co.il> <4263AF80.1000601@cse.ohio-state.edu> Message-ID: <20050418130856.GF17566@mellanox.co.il> Quoting r. Sayantan Sur : > Subject: Re: [openib-general] performance tests uploaded to contrib > > Michael S. Tsirkin wrote: > >Quoting r. Sayantan Sur : > > > >>Subject: Re: [openib-general] performance tests uploaded to contrib > >> > >>Michael, > >> > >> > >>>Current results: > >>> > >>>I currently observe latency below 3.5 usec. > >>> > >>>Drop me a note if you find this useful. > >> > >>Thanks for putting this up on the contrib tree. I have run this > >>rdma_latency, and am getting around 3.35 us (without switch). > >> > >>Our platform is Dual Xeon EM64T with RH AS 4. Kernel version 2.6.11.7 > >> > >>Do you have any idea when a port of the popular `perf_main' will be > >>available? As more people try to use the Gen2 verbs, `perf_main' (or > >>something similar) can help people evaluate different IB operations and > >>also to have example code to use different features of IB. > >> > >>Thanks, > >>Sayantan. > >> > > > > > >I dont plan to port the monolithic perf_main to gen2. > > > >Instead, I plan to upload a set of microbenchmarks each testing > >a specific feature: rdma latency/rc send latency/rdma bandwidth/rc send > >bandwidth etc. > > > >I hope that this will help achieve better code clarity than what we > >have in perf_main. > > That is fine. As long as there is some ibverbs level benchmark suite, it > will ease the transition. > > > > >What are the features you are most interested in? > > By RDMA latency/bandwidth, do you mean both RDMA write & read? RDMA write test is out there, I hope to upload the read test RSN. > Will > there be any Atomic latency tests also? Sure, why not. Is it a priority for you? > > > >By the way, you can already see some example code in libibverbs/examples, > >although that is not necessarily benchmark-oriented. > >I used the code in pingpong.c as a starting point for rdma latency > >test, and it can be used with very little changes as rc send latency > >test. > > Yes, this was helpful for me too. Apart from providing code examples, I > was suggesting that a much more comprehensive benchmark suite will help > people moving from other vendor stacks to Gen2 to have a quick > comparison of performance offered by both stacks. > > Thanks, > Sayantan. > I agree. -- MST - Michael S. Tsirkin From mst at mellanox.co.il Mon Apr 18 07:50:46 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 18 Apr 2005 17:50:46 +0300 Subject: [openib-general] ttcp.aio - kernel NULL pointer dereference Message-ID: <20050418145046.GG17566@mellanox.co.il> Hello, Libor! Every once in a while, when I run ttcp I get a kernel NULL pointer dereference from SDP I compiled the ttcp.aio test with gcc -I../../../linux-kernel/infiniband/ulp/sdp ttcp.aio.c -O2 -o ttcp.aio.x -laio I run ttcp on the server as ./ttcp.aio.x -r -l 100 -a 10 and the client as ./ttcp.aio.x -t -l 100 -n 100 -a 10 11.4.8.155 I repeated this test several times, sometimes getting ttcp-t: Event error <-32> <5275648> messages and sometimes not. It was the server that finally crashed. My kernel is 2.6.11 + latest openib svn (rev 2171). The log file leading to the crash is below: Apr 18 17:34:11 swlab155 kernel: ERR: : IOCB <0> cancel <0> flag <0040> size <1:0:1> Apr 18 17:34:22 swlab155 kernel: ERR: : IOCB <0> cancel <0> flag <0040> size <100:0:100> Apr 18 17:34:41 swlab155 kernel: ERR: : VMA lock <528000:100> error <-12> <1:8:8> Apr 18 17:34:41 swlab155 kernel: ERR: : VMA lock <52c000:100> error <-12> <1:8:8> Apr 18 17:34:49 swlab155 kernel: ERR: : VMA lock <528000:100> error <-12> <1:8:8> Apr 18 17:34:49 swlab155 kernel: ERR: : VMA lock <52c000:100> error <-12> <1:8:8> Apr 18 17:34:59 swlab155 kernel: ERR: : VMA lock <528000:100> error <-12> <1:8:8> Apr 18 17:34:59 swlab155 kernel: ERR: : VMA lock <52c000:100> error <-12> <1:8:8> Apr 18 17:34:59 swlab155 kernel: WARN: : Unexpected conn state. conn <9> state Apr 18 17:35:22 swlab155 kernel: ERR: : IOCB <0> cancel <0> flag <0040> size <100:0:100> Apr 18 17:35:34 swlab155 last message repeated 5 times Apr 18 17:35:44 swlab155 kernel: ERR: : VMA lock <528000:100> error <-12> <1:8:8> Apr 18 17:35:44 swlab155 kernel: ERR: : VMA lock <52c000:100> error <-12> <1:8:8> Apr 18 17:35:52 swlab155 kernel: ERR: : VMA lock <528000:100> error <-12> <1:8:8> Apr 18 17:35:52 swlab155 kernel: ERR: : VMA lock <52c000:100> error <-12> <1:8:8> Apr 18 17:35:52 swlab155 kernel: WARN: : Cancel read with no IOCB. <2:0:00000005> Apr 18 17:35:52 swlab155 kernel: Unable to handle kernel NULL pointer dereference at 0000000000000038 RIP: Apr 18 17:35:52 swlab155 kernel: {_spin_lock_irqsave+9} Apr 18 17:35:52 swlab155 kernel: PGD 15cb56067 PUD 15cbb4067 PMD 0 Apr 18 17:35:52 swlab155 kernel: Oops: 0002 [1] SMP Apr 18 17:35:52 swlab155 kernel: CPU 0 Apr 18 17:35:52 swlab155 kernel: Modules linked in: ib_sdp ib_cm ib_ipoib ib_sa ib_umad ib_mthca ib_mad ib_core Apr 18 17:35:52 swlab155 kernel: Pid: 6, comm: events/0 Not tainted 2.6.11-openib Apr 18 17:35:52 swlab155 kernel: RIP: 0010:[_spin_lock_irqsave+9/27] {_spin_lock_irqsave+9} Apr 18 17:35:52 swlab155 kernel: RIP: 0010:[] {_spin_lock_irqsave+9} Apr 18 17:35:52 swlab155 kernel: RSP: 0000:ffff8100dfe9fe08 EFLAGS: 00010092 Apr 18 17:35:52 swlab155 kernel: RAX: 0000000000000064 RBX: 0000000000000000 RCX: ffff81015c596528 Apr 18 17:35:52 swlab155 kernel: RDX: 0000000000000000 RSI: 0000000000000064 RDI: 0000000000000038 Apr 18 17:35:52 swlab155 kernel: RBP: ffff81014dd23080 R08: ffff8100dfe9e000 R09: 0000000000000000 Apr 18 17:35:52 swlab155 kernel: R10: 00000000ffffffff R11: 0000000000000000 R12: 0000000000000068 Apr 18 17:35:52 swlab155 kernel: R13: 0000000000000064 R14: 0000000000000038 R15: 0000000000000000 Apr 18 17:35:52 swlab155 kernel: FS: 0000000000000000(0000) GS:ffffffff80522c80(0000) knlGS:0000000000000000 Apr 18 17:35:52 swlab155 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Apr 18 17:35:52 swlab155 kernel: CR2: 0000000000000038 CR3: 000000015c626000 CR4: 00000000000006e0 Apr 18 17:35:52 swlab155 kernel: Process events/0 (pid: 6, threadinfo ffff8100dfe9e000, task ffff8100dff02750) Apr 18 17:35:52 swlab155 kernel: Stack: 0000000000000292 ffffffff8018663b 0000000000000286 ffff81014dec1680 Apr 18 17:35:52 swlab155 kernel: ffff81014dec1718 ffff8100dffa2000 ffff81014dec1680 0000000000000292 Apr 18 17:35:52 swlab155 kernel: ffffffff8804a8c2 ffffffff8804a956 Apr 18 17:35:52 swlab155 kernel: Call Trace:{aio_complete+129} {:ib_sdp:do_iocb_complete+0} Apr 18 17:35:52 swlab155 kernel: {:ib_sdp:do_iocb_complete+148} {worker_thread+476} Apr 18 17:35:52 swlab155 kernel: {default_wake_function+0} {default_wake_function+0} Apr 18 17:35:52 swlab155 kernel: {worker_thread+0} {kthread+206} Apr 18 17:35:52 swlab155 kernel: {child_rip+8} {kthread+0} Apr 18 17:35:52 swlab155 kernel: {child_rip+0} Apr 18 17:35:52 swlab155 kernel: Apr 18 17:35:52 swlab155 kernel: Code: f0 fe 0f 0f 88 8b 01 00 00 48 8b 04 24 48 83 c4 08 c3 fa f0 Apr 18 17:35:52 swlab155 kernel: RIP {_spin_lock_irqsave+9} RSP Apr 18 17:35:52 swlab155 kernel: CR2: 0000000000000038 -- MST - Michael S. Tsirkin From roland at topspin.com Mon Apr 18 07:42:36 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 18 Apr 2005 07:42:36 -0700 Subject: [openib-general] Re: [PATCH] uverbs with static libraries References: <20050410084724.GZ20567@mellanox.co.il> <52is2t1g1m.fsf@topspin.com> <20050417135658.GK16996@mellanox.co.il> Message-ID: <52u0m4cfj7.fsf@topspin.com> Michael> You also need the patch below to enable static libraries. Michael> Then when you link you must pass Michael> -u openib_driver_init -rdynamic Michael> to gcc, to pull in the driver library. Roland, please Michael> let me know whether you plan to apply this and the Michael> previous patch. I was waiting to see what changes you required to libmthca. However I don't really see the point of deleting AC_DISABLE_STATIC -- if someone wants a static library then all that's required is passing "--enable-static" to the configure script. What am I missing? I guess I will apply the patch to libibverbs to call openib_driver_init if it's linked in directly. - R. From roland at topspin.com Mon Apr 18 07:42:37 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 18 Apr 2005 07:42:37 -0700 Subject: [openib-general] Re: openIB gen2 user space verbs API References: <527jj4kpkv.fsf@topspin.com> <20050415215836.GA6479@cse.ohio-state.edu> <52r7hbhexk.fsf@topspin.com> <20050416172303.GC854@mellanox.co.il> Message-ID: <52oecccfj6.fsf@topspin.com> Michael> An application would need to know what values is it legal Michael> to pass to create_qp. Maybe it makes sence to implement Michael> something like query_hca, and let it return the maximum Michael> legal value? The Mellanox VAPI just used the maximum number of sg entries for the send queue to calculate the inline data value for a QP. Does it make sense to change this interface? - R. From surs at cse.ohio-state.edu Mon Apr 18 07:52:09 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Mon, 18 Apr 2005 10:52:09 -0400 Subject: [openib-general] performance tests uploaded to contrib In-Reply-To: <20050418130856.GF17566@mellanox.co.il> References: <20050417113647.GF16996@mellanox.co.il> <20050417195432.GA22185@cse.ohio-state.edu> <20050417204007.GB9442@mellanox.co.il> <4263AF80.1000601@cse.ohio-state.edu> <20050418130856.GF17566@mellanox.co.il> Message-ID: <20050418145208.GA23304@cse.ohio-state.edu> * On Apr,5 Michael S. Tsirkin wrote : > Quoting r. Sayantan Sur : > > Subject: Re: [openib-general] performance tests uploaded to contrib > > > > Michael S. Tsirkin wrote: > > >Quoting r. Sayantan Sur : > > > > > >>Subject: Re: [openib-general] performance tests uploaded to contrib > > >> > > >>Michael, > > >> > > >> > > >>>Current results: > > >>> > > >>>I currently observe latency below 3.5 usec. > > >>> > > >>>Drop me a note if you find this useful. > > >> > > >>Thanks for putting this up on the contrib tree. I have run this > > >>rdma_latency, and am getting around 3.35 us (without switch). > > >> > > >>Our platform is Dual Xeon EM64T with RH AS 4. Kernel version 2.6.11.7 > > >> > > >>Do you have any idea when a port of the popular `perf_main' will be > > >>available? As more people try to use the Gen2 verbs, `perf_main' (or > > >>something similar) can help people evaluate different IB operations and > > >>also to have example code to use different features of IB. > > >> > > >>Thanks, > > >>Sayantan. > > >> > > > > > > > > >I dont plan to port the monolithic perf_main to gen2. > > > > > >Instead, I plan to upload a set of microbenchmarks each testing > > >a specific feature: rdma latency/rc send latency/rdma bandwidth/rc send > > >bandwidth etc. > > > > > >I hope that this will help achieve better code clarity than what we > > >have in perf_main. > > > > That is fine. As long as there is some ibverbs level benchmark suite, it > > will ease the transition. > > > > > > > >What are the features you are most interested in? > > > > By RDMA latency/bandwidth, do you mean both RDMA write & read? > > RDMA write test is out there, I hope to upload the read test RSN. > > > Will > > there be any Atomic latency tests also? > > Sure, why not. Is it a priority for you? Nope. Not a priority RDMA Write/Read will do for some time to come. Thanks, Sayantan. > > > > > > >By the way, you can already see some example code in libibverbs/examples, > > >although that is not necessarily benchmark-oriented. > > >I used the code in pingpong.c as a starting point for rdma latency > > >test, and it can be used with very little changes as rc send latency > > >test. > > > > Yes, this was helpful for me too. Apart from providing code examples, I > > was suggesting that a much more comprehensive benchmark suite will help > > people moving from other vendor stacks to Gen2 to have a quick > > comparison of performance offered by both stacks. > > > > Thanks, > > Sayantan. > > > > I agree. > > -- > MST - Michael S. Tsirkin -- --------------------------------------------------------- Sayantan Sur Graduate Research Assistant 395 Dreese Labs, Computer Science and Engineering Ohio State University, Office : 774, Dreese Labs Columbus, email : surs at cse.ohio-state.edu Ohio - 43210. phone(res) : 614.688.9792 USA. phone(off) : 614.292.8501 --------------------------------------------------------- From mst at mellanox.co.il Mon Apr 18 08:24:52 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 18 Apr 2005 18:24:52 +0300 Subject: [openib-general] Re: [PATCH] uverbs with static libraries In-Reply-To: <52u0m4cfj7.fsf@topspin.com> References: <20050410084724.GZ20567@mellanox.co.il> <52is2t1g1m.fsf@topspin.com> <20050417135658.GK16996@mellanox.co.il> <52u0m4cfj7.fsf@topspin.com> Message-ID: <20050418152452.GH17566@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] uverbs with static libraries > > Michael> You also need the patch below to enable static libraries. > Michael> Then when you link you must pass > > Michael> -u openib_driver_init -rdynamic > > Michael> to gcc, to pull in the driver library. By the way, any ideas on how to make this step easier for users of the static library? > Michael> Roland, please let me know whether you plan to > Michael> apply this and the previous patch. > > I was waiting to see what changes you required to libmthca. However I > don't really see the point of deleting AC_DISABLE_STATIC -- if someone > wants a static library then all that's required is passing > "--enable-static" to the configure script. What am I missing? Does this work for you? Is mthca.a created? This does not seem to work for me - static library isnt created unless I remove the AC_DISABLE_STATIC. Put another way - whats the harm in always building the static version as well? Other libraries (e.g. libibverbs) build both static and shared versions by default. > I guess I will apply the patch to libibverbs to call openib_driver_init > if it's linked in directly. > > - R. > -- MST - Michael S. Tsirkin From roland at topspin.com Mon Apr 18 07:57:38 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 18 Apr 2005 07:57:38 -0700 Subject: [openib-general] Re: Gen2 User verbs usage References: <20050415182702.GA5572@cse.ohio-state.edu> <52hdi7kd2o.fsf@topspin.com> <20050416165804.GA854@mellanox.co.il> Message-ID: <52is2kceu5.fsf@topspin.com> Michael> There currently doesnt seem to exist a way for userspace Michael> to know when is setting the INLINE flag possible. I seem to remember some discussion where we said the inline flag was just a hint for the low-level driver, so it's always possible to set it. Michael> It would seem what we need is another attribute passed to Michael> create_qp that would specify the max inline buffer size, Michael> and probably an hca attribute to give the maximum legal Michael> value for this attribute. Do we care enough about this special feature to do this? It seems Mellanox VAPI worked out OK without that. - R. From robert.j.woodruff at intel.com Mon Apr 18 08:27:10 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Mon, 18 Apr 2005 08:27:10 -0700 Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel In-Reply-To: <20050416020715.GG30386@esmail.cup.hp.com> Message-ID: >On Fri, Apr 15, 2005 at 05:23:25PM -0700, Bob Woodruff wrote: >> Might want to have something like, >> >> diff-2.6.9-01-openib_drivers-SVNxxx >> diff-2.6.9-02-openib_fixup-SVNxxx >> diff-2.6.9-03-ib_kernel_changes >> >> where xxx is the SVN version >> >> so that we can have fixups that match a specific SVN version. >Sure, that's a good idea. >And it reminds me that we still have no clue which version >of SVN someone's IB kernel driver is based on. Supporting >this is going to be painful unless this is dealt with. >Anyone have a clue how to get SVN or "make" to embed >the version number in any resulting .o ? >It's bad enough when distro's patch drivers without updating >the rev number. But not having a reliable rev number to start >with is even worse. >thanks, >grant Ok, I will look at generating patches for the base drivers, the fixups, and the kernel patches. I also think that it is a good idea to affix a SVN rev or some other rev. number to a particular driver. Not sure of the best way to do this. In the past I have used schemes where the rev number is generated with a build number and date and put into a string in a version.h file that is built into the module. Not sure what is the best thing for this project. woody From mst at mellanox.co.il Mon Apr 18 08:33:29 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 18 Apr 2005 18:33:29 +0300 Subject: [openib-general] Re: Gen2 User verbs usage In-Reply-To: <52is2kceu5.fsf@topspin.com> References: <20050415182702.GA5572@cse.ohio-state.edu> <52hdi7kd2o.fsf@topspin.com> <20050416165804.GA854@mellanox.co.il> <52is2kceu5.fsf@topspin.com> Message-ID: <20050418153329.GI17566@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Gen2 User verbs usage > > Michael> There currently doesnt seem to exist a way for userspace > Michael> to know when is setting the INLINE flag possible. > > I seem to remember some discussion where we said the inline flag was > just a hint for the low-level driver, so it's always possible to set it. I'm not against this approach, on principle, what bothers me latency vs cpu utilization tradeoff is involved so the low-level driver may not be the right place to take that decision. There's also the point that for inline you dont need to pass in a valid rkey, so the app may be better off knowing about it. Finally, if its just a hint a separate pass over the s/g list would be needed to calculate the size and check it fits inline, which kind of implies performance penalty. What is your opinion? > Michael> It would seem what we need is another attribute passed to > Michael> create_qp that would specify the max inline buffer size, > Michael> and probably an hca attribute to give the maximum legal > Michael> value for this attribute. > > Do we care enough about this special feature to do this? It seems > Mellanox VAPI worked out OK without that. > > - R. > It may be users are only recently waking to this feature. Latency benefits for small to medium sized messages are significant. -- MST - Michael S. Tsirkin From mst at mellanox.co.il Mon Apr 18 08:41:46 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 18 Apr 2005 18:41:46 +0300 Subject: [openib-general] Re: openIB gen2 user space verbs API In-Reply-To: <52oecccfj6.fsf@topspin.com> References: <527jj4kpkv.fsf@topspin.com> <20050415215836.GA6479@cse.ohio-state.edu> <52r7hbhexk.fsf@topspin.com> <20050416172303.GC854@mellanox.co.il> <52oecccfj6.fsf@topspin.com> Message-ID: <20050418154146.GJ17566@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: openIB gen2 user space verbs API > > Michael> An application would need to know what values is it legal > Michael> to pass to create_qp. Maybe it makes sence to implement > Michael> something like query_hca, and let it return the maximum > Michael> legal value? > > The Mellanox VAPI just used the maximum number of sg entries for the > send queue to calculate the inline data value for a QP. Does it make > sense to change this interface? > > - R. > It was actually you who first proposed the change :) But I'd like to defend that decision. I think applications have different latency/CPU utilization tradeoffs. A microbenchmark may want to push as much data as possible inline, another application may not. So I think what you end up doing with VAPI API is applications starting with the inline size they actually want and hardcoding the tavor size to s/g entries ratio to get the right values in query qp. If they do it right they will even work with other HCAs, just more slowly (say the app checks qp properties and sees it didnt get the inline size that it wanted, so it doesnt use inline), but isnt what I proposed cleaner? Since its an ABI change anyway - lets do it right? Its easy enough to implement - just say the word ... -- MST - Michael S. Tsirkin From robert.j.woodruff at intel.com Mon Apr 18 08:41:08 2005 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 18 Apr 2005 08:41:08 -0700 Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel Message-ID: <1AC79F16F5C5284499BB9591B33D6F000424683E@orsmsx408> >>On Fri, Apr 15, 2005 at 05:23:25PM -0700, Bob Woodruff wrote: >>> Might want to have something like, >>> >>> diff-2.6.9-01-openib_drivers-SVNxxx >>> diff-2.6.9-02-openib_fixup-SVNxxx >>> diff-2.6.9-03-ib_kernel_changes >>> >>> where xxx is the SVN version >>> >>> so that we can have fixups that match a specific SVN version. >>Sure, that's a good idea. >Ok, I will look at generating patches for the base drivers, the fixups, and >the kernel patches. Roland, do you know what the SVN rev was for that latest code that was submitted to 2.6.12-rc2-mm3. That is the version that we discussed starting with for an initial 2.6.9 backport, but as suggested, I want to embed the SVN rev. into the file name of the patches for clarity. woody From mst at mellanox.co.il Mon Apr 18 08:48:46 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 18 Apr 2005 18:48:46 +0300 Subject: [openib-general] Re: openIB gen2 user space verbs API In-Reply-To: <52oecccfj6.fsf@topspin.com> References: <527jj4kpkv.fsf@topspin.com> <20050415215836.GA6479@cse.ohio-state.edu> <52r7hbhexk.fsf@topspin.com> <20050416172303.GC854@mellanox.co.il> <52oecccfj6.fsf@topspin.com> Message-ID: <20050418154846.GK17566@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: openIB gen2 user space verbs API > > Michael> An application would need to know what values is it legal > Michael> to pass to create_qp. Maybe it makes sence to implement > Michael> something like query_hca, and let it return the maximum > Michael> legal value? > > The Mellanox VAPI just used the maximum number of sg entries for the > send queue to calculate the inline data value for a QP. Does it make > sense to change this interface? > > - R. > Lets look at an example like pingpong.c I think its cleanest for this test to have --inline flag which would make all work requests inline. The test will then create the qp setting the right inline size. If the user passed a size too big to be inline, we want the test to be able to detect this and print a clear error message, not fail in create_qp. Right? -- MST - Michael S. Tsirkin From roland at topspin.com Mon Apr 18 08:42:38 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 18 Apr 2005 08:42:38 -0700 Subject: [openib-general] Advice about adapting ibv_pingpong to use UC References: <42627B5A.7010709@wooding.uklinux.net> Message-ID: <52d5ssccr5.fsf@topspin.com> Steven> Hi, I wonder if someone working on the gen2 uverbs would Steven> be so kind as to give me some advice on adapting the Steven> ibv_pingpong program to use a UC QP type rather than RC. I Steven> was previously able to do this with the Mellanox stack by Steven> changing the qp_type attribute and then not setting Steven> variables that are only needed for RC (timeout and retry Steven> periods etc). Unfortunately UC support has not been implemented in the gen2 stack. It probably wouldn't be that hard but I can't say when I'll get a chance to look at it. - R. From roland at topspin.com Mon Apr 18 08:42:41 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 18 Apr 2005 08:42:41 -0700 Subject: [openib-general] Static Rate Questions References: <506C3D7B14CDD411A52C00025558DED6078F2B30@mtlex01.yok.mtl.com> Message-ID: <527jj0ccr2.fsf@topspin.com> Dror> - In struct ib_ah_attr, static_rate is defined as u8. What Dror> are the expected values that static_rate is supposed to take Dror> ? Is it absolute Gb/s ? Gb/s in 2.5Gb/s units ? or relative Dror> rate to port speed ? My understanding is that it is the inter-packet delay. However I don't have any strong objection to changing this interface. Dror> - In mthca, there are two places setting up the Dror> static_rate. One for AH which looks fine. The other one for Dror> QP which I believe has a bug. Yes, you're right. I'm not sure where the " << 3" came from; I've deleted it. Dror> - A question for next generation HW. Would you find it more Dror> useful that the HCA supports static rate as an absolute Dror> speed (Gb/s) or as a relative ratio to the current port Dror> speed ? I'm not sure it makes much difference one way or the other. We're not talking about a lot of complex code to deal with static rates, whatever the hardware interface is. - R. From roland at topspin.com Mon Apr 18 08:46:43 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 18 Apr 2005 08:46:43 -0700 Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F000424683E@orsmsx408> (Robert J. Woodruff's message of "Mon, 18 Apr 2005 08:41:08 -0700") References: <1AC79F16F5C5284499BB9591B33D6F000424683E@orsmsx408> Message-ID: <523btocckc.fsf@topspin.com> Robert> Roland, do you know what the SVN rev was for that latest Robert> code that was submitted to 2.6.12-rc2-mm3. That is the Robert> version that we discussed starting with for an initial Robert> 2.6.9 backport, but as suggested, I want to embed the SVN Robert> rev. into the file name of the patches for clarity. It doesn't really make sense to talk about the svn rev for that tree, since I went through and picked some patches but not others to merge upstream. - R. From timur.tabi at ammasso.com Mon Apr 18 09:09:35 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Mon, 18 Apr 2005 11:09:35 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <52mzs51g5g.fsf@topspin.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> Message-ID: <4263DBBF.9040801@ammasso.com> Roland Dreier wrote: > Troy> How is memory pinning handled? (I haven't had time to read > Troy> all the code, so please excuse my ignorance of something > Troy> obvious). > > The userspace library calls mlock() and then the kernel does > get_user_pages(). Why do you call mlock() and get_user_pages()? In our code, we only call mlock(), and the memory is pinned. We have a test case that fails if only get_user_pages() is called, but it passes if only mlock() is called. -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com From arjan at infradead.org Mon Apr 18 09:16:12 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Mon, 18 Apr 2005 18:16:12 +0200 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <4263DBBF.9040801@ammasso.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <4263DBBF.9040801@ammasso.com> Message-ID: <1113840973.6274.84.camel@laptopd505.fenrus.org> On Mon, 2005-04-18 at 11:09 -0500, Timur Tabi wrote: > Roland Dreier wrote: > > Troy> How is memory pinning handled? (I haven't had time to read > > Troy> all the code, so please excuse my ignorance of something > > Troy> obvious). > > > > The userspace library calls mlock() and then the kernel does > > get_user_pages(). > > Why do you call mlock() and get_user_pages()? In our code, we only call mlock(), and the > memory is pinned. this is a myth; linux is free to move the page about in physical memory even if it's mlock()ed!! And even then, the user can munlock the memory from another thread etc etc. Not a good idea. get_user_pages() is used from AIO and other parts of the kernel for similar purposes and in fact is designed for it, so it better work. If it has bugs those should be fixed, not worked around! From timur.tabi at ammasso.com Mon Apr 18 09:22:29 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Mon, 18 Apr 2005 11:22:29 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050411171347.7e05859f.akpm@osdl.org> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> Message-ID: <4263DEC5.5080909@ammasso.com> Andrew Morton wrote: > Roland Dreier wrote: > >> Troy> Do we even need the mlock in userspace then? >> >>Yes, because the kernel may go through and unmap pages from userspace >>while trying to swap. Since we have the page locked in the kernel, >>the physical page won't go anywhere, but userspace might end up with a >>different page mapped at the same virtual address. > > > That shouldn't happen. If get_user_pages() has elevated the refcount on a > page then the following can happen: > > - The VM may decide to add the page to swapcache (if it's not mmapped > from a file). > > - Once the page is backed by either swapcache of a (mmapped) file, the VM > may decide the unmap the application's pte's. A later minor fault by the > app will cause the same physical page to be remapped. That's not what we're seeing. We have hardware that does DMA over the network (much like the Infiniband stuff), and we have a testcase that fails if get_user_pages() is used, but not if mlock() is used. Consider two computers on a network, X and Y. Both have our hardware, which can transfer a page of memory from a given physical address on X to a physical address on Y. 1) Application on X allocates a block of memory, and passes the virtual address to the driver. 2) Driver on X calls get_user_pages() and then obtains a physical address for the memory. 3) Application and driver on Y do the same thing. 4) App X fills memory with some data D. 5) App X then allocates as much memory as it possibly can. It touches every page in this memory, and then frees the memory. This will force other pages to be swapped out, including the supposedly pinned memory. 6) App X then tells Driver X to transfer data D to computer Y. 7) App Y compares data D and finds that it doesn't match with it's supposed to. Conclusion: during step 5, the data in pinned memory is swapped out or something. I'm not sure where it goes. We can only demonstrate this problem using our hardware, because you need the ability to transfer memory without using the CPU. We were going to prepare a test case and ship same hardware to a few kernel developers to prove our point, but now that we're able to call mlock() in non-user processes, we decided it wasn't worth our time. Actually, I discovered that I can call cap_raise() and set the ulimit structure, which gives me the ability to call mlock() on any amount of memory from any process in 2.4 and 2.6 kernels, which we need to support. If I had thought of that earlier, I wouldn't have needed all those hacks to call sys_mlock() from the driver. -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com From timur.tabi at ammasso.com Mon Apr 18 09:25:20 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Mon, 18 Apr 2005 11:25:20 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <1113840973.6274.84.camel@laptopd505.fenrus.org> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <4263DBBF.9040801@ammasso.com> <1113840973.6274.84.camel@laptopd505.fenrus.org> Message-ID: <4263DF70.2060702@ammasso.com> Arjan van de Ven wrote: > this is a myth; linux is free to move the page about in physical memory > even if it's mlock()ed!! Then Linux has a very odd definition of the word "locked". > And even then, the user can munlock the memory from another thread etc > etc. Not a good idea. Well, that's okay, because then the app is doing something stupid, so we don't worry about that. > get_user_pages() is used from AIO and other parts of the kernel for > similar purposes and in fact is designed for it, so it better work. If > it has bugs those should be fixed, not worked around! I've been complaining about get_user_pages() not working for a long time now, but I can only demonstrate the problem with our hardware. See my other post in this thread for details. -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com From iod00d at hp.com Mon Apr 18 09:31:39 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 18 Apr 2005 09:31:39 -0700 Subject: [openib-general] Re: [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel In-Reply-To: <20050416170013.GB854@mellanox.co.il> References: <20050415232658.GE30386@esmail.cup.hp.com> <20050416020715.GG30386@esmail.cup.hp.com> <20050416170013.GB854@mellanox.co.il> Message-ID: <20050418163139.GB6931@esmail.cup.hp.com> On Sat, Apr 16, 2005 at 08:00:13PM +0300, Michael S. Tsirkin wrote: > You have to run the svnversion utility to get the revision. that's fine if I have the source tree. (I usually just look in .svn/entries) But I need to track version number in existing binaries so I know which source tree they come from. grant From roland at topspin.com Mon Apr 18 08:57:38 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 18 Apr 2005 08:57:38 -0700 Subject: [openib-general] Re: patches References: <20050408093558.GB21709@mellanox.co.il> <52psx545sy.fsf@topspin.com> <20050409172150.GA31200@mellanox.co.il> <52hdif3ggn.fsf@topspin.com> <20050414080648.GE32526@mellanox.co.il> Message-ID: <52wtr0axhp.fsf@topspin.com> Thanks, it turns out I made the same mistake several times: --- infiniband/hw/mthca/mthca_cmd.c (revision 2156) +++ infiniband/hw/mthca/mthca_cmd.c (working copy) @@ -1053,7 +1053,7 @@ int mthca_QUERY_ADAPTER(struct mthca_dev MTHCA_GET(adapter->inta_pin, outbox, QUERY_ADAPTER_INTA_PIN_OFFSET); out: - pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma); + pci_free_consistent(dev->pdev, QUERY_ADAPTER_OUT_SIZE, outbox, outdma); return err; } @@ -1224,7 +1224,7 @@ int mthca_INIT_IB(struct mthca_dev *dev, err = mthca_cmd(dev, indma, port, 0, CMD_INIT_IB, CMD_TIME_CLASS_A, status); - pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma); + pci_free_consistent(dev->pdev, INIT_IB_IN_SIZE, inbox, indma); return err; } @@ -1269,7 +1269,7 @@ int mthca_SET_IB(struct mthca_dev *dev, err = mthca_cmd(dev, indma, port, 0, CMD_SET_IB, CMD_TIME_CLASS_B, status); - pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma); + pci_free_consistent(dev->pdev, INIT_SET_IB_IN_SIZE, inbox, indma); return err; } From roland at topspin.com Mon Apr 18 08:57:39 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 18 Apr 2005 08:57:39 -0700 Subject: [openib-general] Re: [PATCH] uverbs with static libraries References: <20050410084724.GZ20567@mellanox.co.il> Message-ID: <52r7h8axho.fsf@topspin.com> It made more sense to me to load a static driver once before calling find_drivers() for each entry in our driver path. Is there anything wrong with this? - R. --- libibverbs/src/init.c (revision 2156) +++ libibverbs/src/init.c (working copy) @@ -198,6 +198,11 @@ static void INIT ibverbs_init(void) if (ibv_init_mem_map()) return; + /* + * Check if a driver is statically linked, and if so load it first. + */ + load_driver(NULL); + user_path = getenv(OPENIB_DRIVER_PATH_ENV); if (user_path) { wr_path = strdupa(user_path); From roland at topspin.com Mon Apr 18 09:11:56 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 18 Apr 2005 09:11:56 -0700 Subject: [openib-general] Re: [PATCH] uverbs with static libraries In-Reply-To: <20050418152452.GH17566@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 18 Apr 2005 18:24:52 +0300") References: <20050410084724.GZ20567@mellanox.co.il> <52is2t1g1m.fsf@topspin.com> <20050417135658.GK16996@mellanox.co.il> <52u0m4cfj7.fsf@topspin.com> <20050418152452.GH17566@mellanox.co.il> Message-ID: <52mzrwawtv.fsf@topspin.com> Michael> Does this work for you? Is mthca.a created? I just tried this: $ ../libmthca/configure --enable-static CPPFLAGS=-I$(pwd)/../libibverbs/include --prefix=$HOME/junk $ make $ make install and I get this: $ tree ~/junk /data/home/roland/junk `-- lib `-- infiniband |-- mthca.a |-- mthca.la `-- mthca.so so yes, it looks like it works. Michael> Put another way - whats the harm in always building the Michael> static version as well? Other libraries (e.g. libibverbs) Michael> build both static and shared versions by default. I don't think of libmthca as a library really. It's a plug in loaded by libibverbs. In some specialized circumstances it may be useful to build it statically but in general it's just unneeded confusion. - R. From roland at topspin.com Mon Apr 18 09:12:45 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 18 Apr 2005 09:12:45 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <4263DBBF.9040801@ammasso.com> (Timur Tabi's message of "Mon, 18 Apr 2005 11:09:35 -0500") References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <4263DBBF.9040801@ammasso.com> Message-ID: <52is2kawsi.fsf@topspin.com> Timur> Why do you call mlock() and get_user_pages()? In our code, Timur> we only call mlock(), and the memory is pinned. We have a Timur> test case that fails if only get_user_pages() is called, Timur> but it passes if only mlock() is called. What if a buggy/malicious userspace program doesn't call mlock()? - R. From roland at topspin.com Mon Apr 18 09:27:38 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 18 Apr 2005 09:27:38 -0700 Subject: [openib-general] openIB gen2 user space verbs API References: Message-ID: <52d5ssaw3p.fsf@topspin.com> Or> Other than getting the max inline size, query qp is used to Or> get the current QP state, examples are app error flow (eg when Or> modify qp failed) and app APM flow to sense some of the state Or> transitions done by the HW. These are only examples I quickly Or> thought of, I guess there are more. Which applications are using these operations as you describe? - R. From iod00d at hp.com Mon Apr 18 09:34:50 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 18 Apr 2005 09:34:50 -0700 Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel In-Reply-To: <523btocckc.fsf@topspin.com> References: <1AC79F16F5C5284499BB9591B33D6F000424683E@orsmsx408> <523btocckc.fsf@topspin.com> Message-ID: <20050418163450.GC6931@esmail.cup.hp.com> On Mon, Apr 18, 2005 at 08:46:43AM -0700, Roland Dreier wrote: > Robert> Roland, do you know what the SVN rev was for that latest > Robert> code that was submitted to 2.6.12-rc2-mm3. That is the > Robert> version that we discussed starting with for an initial > Robert> 2.6.9 backport, but as suggested, I want to embed the SVN > Robert> rev. into the file name of the patches for clarity. > > It doesn't really make sense to talk about the svn rev for that tree, > since I went through and picked some patches but not others to merge > upstream. I agree. What gets delivered by kernel.org has it's own version control. I'm mostly concerned about when people pull from SVN on openib.org. grant From mshefty at ichips.intel.com Mon Apr 18 09:38:17 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 18 Apr 2005 09:38:17 -0700 Subject: [openib-general] Re: openIB gen2 user space verbs API In-Reply-To: <52oecccfj6.fsf@topspin.com> References: <527jj4kpkv.fsf@topspin.com> <20050415215836.GA6479@cse.ohio-state.edu> <52r7hbhexk.fsf@topspin.com> <20050416172303.GC854@mellanox.co.il> <52oecccfj6.fsf@topspin.com> Message-ID: <4263E279.1010401@ichips.intel.com> Roland Dreier wrote: > Michael> An application would need to know what values is it legal > Michael> to pass to create_qp. Maybe it makes sence to implement > Michael> something like query_hca, and let it return the maximum > Michael> legal value? > > The Mellanox VAPI just used the maximum number of sg entries for the > send queue to calculate the inline data value for a QP. Does it make > sense to change this interface? IMO, from a pure API perspective, max inline size and SG entries should be separate, even if a current implementation ties them together. - Sean From shaharf at voltaire.com Mon Apr 18 09:39:25 2005 From: shaharf at voltaire.com (shaharf) Date: Mon, 18 Apr 2005 19:39:25 +0300 Subject: [openib-general] mthca_cmd is broken? Message-ID: It seems that there is a naming problem with INIT_SET_IB_IN_SIZE. I guess it should be SET_IB_IN_SIZE. Shahar -------------- next part -------------- An HTML attachment was scrubbed... URL: From hch at infradead.org Mon Apr 18 09:43:16 2005 From: hch at infradead.org (Christoph Hellwig) Date: Mon, 18 Apr 2005 17:43:16 +0100 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <4263DEC5.5080909@ammasso.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> Message-ID: <20050418164316.GA27697@infradead.org> On Mon, Apr 18, 2005 at 11:22:29AM -0500, Timur Tabi wrote: > That's not what we're seeing. We have hardware that does DMA over the > network (much like the Infiniband stuff), and we have a testcase that fails > if get_user_pages() is used, but not if mlock() is used. If you don't share your testcase it's unlikely to be fixed. From timur.tabi at ammasso.com Mon Apr 18 09:45:57 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Mon, 18 Apr 2005 11:45:57 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050418164316.GA27697@infradead.org> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> Message-ID: <4263E445.8000605@ammasso.com> Christoph Hellwig wrote: > On Mon, Apr 18, 2005 at 11:22:29AM -0500, Timur Tabi wrote: > >>That's not what we're seeing. We have hardware that does DMA over the >>network (much like the Infiniband stuff), and we have a testcase that fails >>if get_user_pages() is used, but not if mlock() is used. > > > If you don't share your testcase it's unlikely to be fixed. As I said, the testcase only works with our hardware, and it's also very large. It's one small test that's part of a huge test suite. It takes a couple hours just to install the damn thing. We want to produce a simpler test case that demonstrates the problem in an easy-to-understand manner, but we don't have time to do that now. -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com From timur.tabi at ammasso.com Mon Apr 18 09:50:06 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Mon, 18 Apr 2005 11:50:06 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <52is2kawsi.fsf@topspin.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <4263DBBF.9040801@ammasso.com> <52is2kawsi.fsf@topspin.com> Message-ID: <4263E53E.3090107@ammasso.com> Roland Dreier wrote: > Timur> Why do you call mlock() and get_user_pages()? In our code, > Timur> we only call mlock(), and the memory is pinned. We have a > Timur> test case that fails if only get_user_pages() is called, > Timur> but it passes if only mlock() is called. > > What if a buggy/malicious userspace program doesn't call mlock()? Our library calls mlock() when the apps requests memory to be "registered". We then call munlock() when the app requests the memory to be unregistered. All apps talk to our library for all services. No apps talk to the driver directly. -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com From roland at topspin.com Mon Apr 18 09:45:18 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 18 Apr 2005 09:45:18 -0700 Subject: [openib-general] mthca_cmd is broken? In-Reply-To: (shaharf@voltaire.com's message of "Mon, 18 Apr 2005 19:39:25 +0300") References: Message-ID: <528y3gava9.fsf@topspin.com> Yes, you're right. Fixed. From timur.tabi at ammasso.com Mon Apr 18 10:15:02 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Mon, 18 Apr 2005 12:15:02 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050412180447.E6958@topspin.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <20050412180447.E6958@topspin.com> Message-ID: <4263EB16.1090904@ammasso.com> Libor Michalek wrote: > The problem we were seeing is that the minor fault by the app resulted > in a new physical page getting mapped for the application. The page that > had the elevated refcount was still waiting for the data to be written > to by the driver at the time that the app accessed the page causing the > minor fault. Obviously since the app had a new mapping the data written > by the driver was lost. Thanks Libor, this is much better explanation of the problem than what I posted. > It looks like code was added to try_to_unmap_one() to address this, so > hopefully it's no longer an issue... I doubt it. I tried this with an earlier 2.6 kernel, and get_user_pages() was still not enough to really pin the memory down. Maybe it works in 2.6.12, but that doesn't help me any, because our driver needs to support all 2.4 and 2.6 kernels. Currently, mlock() alone seems to be good enough, but I'm going to add calls to get_user_pages() just to be sure. -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com From mst at mellanox.co.il Mon Apr 18 10:39:55 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 18 Apr 2005 20:39:55 +0300 Subject: [openib-general] Re: [PATCH] uverbs with static libraries In-Reply-To: <52r7h8axho.fsf@topspin.com> References: <20050410084724.GZ20567@mellanox.co.il> <52r7h8axho.fsf@topspin.com> Message-ID: <20050418173955.GB19702@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] uverbs with static libraries > > It made more sense to me to load a static driver once before calling > find_drivers() for each entry in our driver path. Is there anything > wrong with this? > > - R. > > --- libibverbs/src/init.c (revision 2156) > +++ libibverbs/src/init.c (working copy) > @@ -198,6 +198,11 @@ static void INIT ibverbs_init(void) > if (ibv_init_mem_map()) > return; > > + /* > + * Check if a driver is statically linked, and if so load it first. > + */ > + load_driver(NULL); > + > user_path = getenv(OPENIB_DRIVER_PATH_ENV); > if (user_path) { > wr_path = strdupa(user_path); > ok -- MST - Michael S. Tsirkin From mst at mellanox.co.il Mon Apr 18 11:15:19 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 18 Apr 2005 21:15:19 +0300 Subject: [openib-general] Re: [PATCH] uverbs with static libraries In-Reply-To: <52mzrwawtv.fsf@topspin.com> References: <20050410084724.GZ20567@mellanox.co.il> <52is2t1g1m.fsf@topspin.com> <20050417135658.GK16996@mellanox.co.il> <52u0m4cfj7.fsf@topspin.com> <20050418152452.GH17566@mellanox.co.il> <52mzrwawtv.fsf@topspin.com> Message-ID: <20050418181519.GA19943@mellanox.co.il> Hi, Roland! Quoting r. Roland Dreier : > Michael> Put another way - whats the harm in always building the > Michael> static version as well? Other libraries (e.g. libibverbs) > Michael> build both static and shared versions by default. > > I don't think of libmthca as a library really. It's a plug in loaded > by libibverbs. Whats the point of a static libibverbs then? Some people may want to build an executable without external dependencies, they clearly need both libraries static. Others may not care, they may be better of with shared. > In some specialized circumstances it may be useful to > build it statically Hopefully it shall be there for the developer, who shall have no need to build it. If the default is not to include the static version distributions wont package it so it wont be there for developers to use. If the default does build static and shared version, distributions will put the static version in a separate -devel rpm together with header files so people who dont build apps wont need it. > but in general it's just unneeded confusion. > > - R. What kind of confusion? -- MST - Michael S. Tsirkin From libor at topspin.com Mon Apr 18 11:15:26 2005 From: libor at topspin.com (Libor Michalek) Date: Mon, 18 Apr 2005 11:15:26 -0700 Subject: [openib-general] Re: ttcp.aio - kernel NULL pointer dereference In-Reply-To: <20050418145046.GG17566@mellanox.co.il>; from mst@mellanox.co.il on Mon, Apr 18, 2005 at 05:50:46PM +0300 References: <20050418145046.GG17566@mellanox.co.il> Message-ID: <20050418111526.A7553@topspin.com> On Mon, Apr 18, 2005 at 05:50:46PM +0300, Michael S. Tsirkin wrote: > Hello, Libor! > Every once in a while, when I run ttcp > > I get a kernel NULL pointer dereference from SDP > > My kernel is 2.6.11 + latest openib svn (rev 2171). Michael, on what type of system are you seeing this? -Libor From mst at mellanox.co.il Mon Apr 18 11:29:47 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 18 Apr 2005 21:29:47 +0300 Subject: [openib-general] Re: [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel In-Reply-To: <523btocckc.fsf@topspin.com> References: <1AC79F16F5C5284499BB9591B33D6F000424683E@orsmsx408> <523btocckc.fsf@topspin.com> Message-ID: <20050418182947.GC19943@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel > > Robert> Roland, do you know what the SVN rev was for that latest > Robert> code that was submitted to 2.6.12-rc2-mm3. That is the > Robert> version that we discussed starting with for an initial > Robert> 2.6.9 backport, but as suggested, I want to embed the SVN > Robert> rev. into the file name of the patches for clarity. > > It doesn't really make sense to talk about the svn rev for that tree, > since I went through and picked some patches but not others to merge > upstream. > > - R. > Roland, maybe, when you do this, you can put a copy of the source as you submit it under gen2/branches? I think that would solve Grant's problem but may be too hard to do in practice. Would that fit easily with your workflow - you are using quilt, right? -- MST - Michael S. Tsirkin From mst at mellanox.co.il Mon Apr 18 11:31:51 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 18 Apr 2005 21:31:51 +0300 Subject: [openib-general] Re: ttcp.aio - kernel NULL pointer dereference In-Reply-To: <20050418111526.A7553@topspin.com> References: <20050418145046.GG17566@mellanox.co.il> <20050418111526.A7553@topspin.com> Message-ID: <20050418183151.GD19943@mellanox.co.il> Quoting r. Libor Michalek : > Subject: Re: ttcp.aio - kernel NULL pointer dereference > > On Mon, Apr 18, 2005 at 05:50:46PM +0300, Michael S. Tsirkin wrote: > > Hello, Libor! > > Every once in a while, when I run ttcp > > > > I get a kernel NULL pointer dereference from SDP > > > > My kernel is 2.6.11 + latest openib svn (rev 2171). > > Michael, on what type of system are you seeing this? > > -Libor > Intel nocona with Arbel native. -- MST - Michael S. Tsirkin From arjan at infradead.org Mon Apr 18 12:40:40 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Mon, 18 Apr 2005 21:40:40 +0200 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <4263DF70.2060702@ammasso.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <4263DBBF.9040801@ammasso.com> <1113840973.6274.84.camel@laptopd505.fenrus.org> <4263DF70.2060702@ammasso.com> Message-ID: <1113853240.6274.99.camel@laptopd505.fenrus.org> On Mon, 2005-04-18 at 11:25 -0500, Timur Tabi wrote: > Arjan van de Ven wrote: > > > this is a myth; linux is free to move the page about in physical memory > > even if it's mlock()ed!! > > Then Linux has a very odd definition of the word "locked". > > > And even then, the user can munlock the memory from another thread etc > > etc. Not a good idea. > > Well, that's okay, because then the app is doing something stupid, so we don't worry about > that. you should since that physical page can be reused, say by a root process, and you'd be majorly screwed From timur.tabi at ammasso.com Mon Apr 18 13:00:02 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Mon, 18 Apr 2005 15:00:02 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <1113853240.6274.99.camel@laptopd505.fenrus.org> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <4263DBBF.9040801@ammasso.com> <1113840973.6274.84.camel@laptopd505.fenrus.org> <4263DF70.2060702@ammasso.com> <1113853240.6274.99.camel@laptopd505.fenrus.org> Message-ID: <426411C2.5040703@ammasso.com> Arjan van de Ven wrote: > you should since that physical page can be reused, say by a root > process, and you'd be majorly screwed I don't understand what you mean by "reused". The whole point behind pinning the memory is that it stays where it is. It doesn't get moved around and it doesn't get swapped out. -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com From iod00d at hp.com Mon Apr 18 13:02:49 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 18 Apr 2005 13:02:49 -0700 Subject: [openib-general] Re: [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel In-Reply-To: <20050418182947.GC19943@mellanox.co.il> References: <1AC79F16F5C5284499BB9591B33D6F000424683E@orsmsx408> <523btocckc.fsf@topspin.com> <20050418182947.GC19943@mellanox.co.il> Message-ID: <20050418200249.GG6931@esmail.cup.hp.com> On Mon, Apr 18, 2005 at 09:29:47PM +0300, Michael S. Tsirkin wrote: > Quoting r. Roland Dreier : > > It doesn't really make sense to talk about the svn rev for that tree, > > since I went through and picked some patches but not others to merge > > upstream. > > Roland, maybe, when you do this, you can put a copy of the source as you > submit it under gen2/branches? I think that would solve Grant's problem > but may be too hard to do in practice. Not really since I can see the kernel rev and then pull the matching kernel source. Well unless people are using -mm or linus' bk tree directly. But not that many people do that. grant From arjan at infradead.org Mon Apr 18 13:05:42 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Mon, 18 Apr 2005 22:05:42 +0200 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <426411C2.5040703@ammasso.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <4263DBBF.9040801@ammasso.com> <1113840973.6274.84.camel@laptopd505.fenrus.org> <4263DF70.2060702@ammasso.com> <1113853240.6274.99.camel@laptopd505.fenrus.org> <426411C2.5040703@ammasso.com> Message-ID: <1113854742.6274.101.camel@laptopd505.fenrus.org> On Mon, 2005-04-18 at 15:00 -0500, Timur Tabi wrote: > Arjan van de Ven wrote: > > > you should since that physical page can be reused, say by a root > > process, and you'd be majorly screwed > > I don't understand what you mean by "reused". The whole point behind pinning the memory > is that it stays where it is. It doesn't get moved around and it doesn't get swapped out. > you just said that you didn't care that it got munlock'd. So you don't care that it gets freed either. And then reused. From blist at aon.at Mon Apr 18 13:07:12 2005 From: blist at aon.at (Bernhard Fischer) Date: Mon, 18 Apr 2005 22:07:12 +0200 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <1113853240.6274.99.camel@laptopd505.fenrus.org> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <4263DBBF.9040801@ammasso.com> <1113840973.6274.84.camel@laptopd505.fenrus.org> <4263DF70.2060702@ammasso.com> <1113853240.6274.99.camel@laptopd505.fenrus.org> Message-ID: <20050418200711.GI15688@aon.at> On Mon, Apr 18, 2005 at 09:40:40PM +0200, Arjan van de Ven wrote: >On Mon, 2005-04-18 at 11:25 -0500, Timur Tabi wrote: >> Arjan van de Ven wrote: >> >> > this is a myth; linux is free to move the page about in physical memory >> > even if it's mlock()ed!! darn, yes, this is true. I know people who introduced #define VM_RESERVED 0x00080000 /* Don't unmap it from swap_out */ to vm_flags just because of this. I'll just hold my breath and won't delve further. >> >> Then Linux has a very odd definition of the word "locked". >> >> > And even then, the user can munlock the memory from another thread etc >> > etc. Not a good idea. >> >> Well, that's okay, because then the app is doing something stupid, so we don't worry about >> that. > >you should since that physical page can be reused, say by a root >process, and you'd be majorly screwed From timur.tabi at ammasso.com Mon Apr 18 13:19:33 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Mon, 18 Apr 2005 15:19:33 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <1113854742.6274.101.camel@laptopd505.fenrus.org> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <4263DBBF.9040801@ammasso.com> <1113840973.6274.84.camel@laptopd505.fenrus.org> <4263DF70.2060702@ammasso.com> <1113853240.6274.99.camel@laptopd505.fenrus.org> <426411C2.5040703@ammasso.com> <1113854742.6274.101.camel@laptopd505.fenrus.org> Message-ID: <42641655.1080403@ammasso.com> Arjan van de Ven wrote: > you just said that you didn't care that it got munlock'd. So you don't > care that it gets freed either. And then reused. Well, I can live with the app being able to call munlock(), because the apps that our customers use don't call munlock(). What I can't live with is a bug in the kernel that causes pinned pages to be swapped or moved. Obviously, I would rather call get_user_pages() instead of mlock(), but I can't, because get_user_pages doesn't work. The page doesn't stay pinned at the physical address, but it does if I call mlock() and get_user_pages(). Actually, in our tests, calling mlock() appears to be good enough, but I'll update our code to call get_user_pages() as well. -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com From hermannv at supermicro.com Mon Apr 18 18:37:37 2005 From: hermannv at supermicro.com (Hermann von Drateln) Date: Mon, 18 Apr 2005 18:37:37 -0700 Subject: [openib-general] Infiniband embedded dual Xeon 800 MHz MBD Message-ID: To all Open IB Members! This week we are commencing the verification and validation of our Dual Xeon 800 MHz FSB MBD. The board is very similar to our http://www.supermicro.com/products/motherboard/Xeon800/E7320/X6DVA-EG.cfm And it made to be fitted on our CSE-513 http://www.supermicro.com/products/chassis/1U/?chs=513 For any one that would like to obtain more detail information on this new board and our new product road map send me an email to provide you with Power point presentation if information. Best regards, Hermann von Drateln Director Business Development USA TEL 1 408 503 8110 USA CEL 1 408 306 8110 Eu Tel + 49 173 286 6883 Eu Fax + 49 69 255 77303 -------------- next part -------------- An HTML attachment was scrubbed... URL: From yulia.plavunova at t-platforms.ru Tue Apr 19 03:01:34 2005 From: yulia.plavunova at t-platforms.ru (Yulia Plavunova) Date: Tue, 19 Apr 2005 14:01:34 +0400 Subject: [openib-general] Infiniband embedded dual Xeon 800 MHz MBD Message-ID: <3DD22B58943FBB47A82692E3D4671BF5142928@srv04.merle.ru> Dear Hermann, I represent T-Platforms, Russian HPC integrator (www.t-platforms.ru ). We are very interested in getting more info on your Dual Xeon 800 MHz FSB MBD. I'd appreciate you send the presentation and the roadmap. Looking forward to your reply, Best regards, Yulia Plavunova Manager of Vendor&International Relations, T-Platforms Tlf.: (+7-095) 9565414 ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Hermann von Drateln Sent: Tuesday, April 19, 2005 5:38 AM To: openib-general at openib.org Subject: [openib-general] Infiniband embedded dual Xeon 800 MHz MBD Importance: High To all Open IB Members! This week we are commencing the verification and validation of our Dual Xeon 800 MHz FSB MBD. The board is very similar to our http://www.supermicro.com/products/motherboard/Xeon800/E7320/X6DVA-EG.cf m And it made to be fitted on our CSE-513 http://www.supermicro.com/products/chassis/1U/?chs=513 For any one that would like to obtain more detail information on this new board and our new product road map send me an email to provide you with Power point presentation if information. Best regards, Hermann von Drateln Director Business Development USA TEL 1 408 503 8110 USA CEL 1 408 306 8110 Eu Tel + 49 173 286 6883 Eu Fax + 49 69 255 77303 -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at voltaire.com Tue Apr 19 05:26:58 2005 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 19 Apr 2005 15:26:58 +0300 Subject: [openib-general] openIB gen2 user space verbs API Message-ID: Roland>Which applications are using these operations as you describe? Ignoring APM, one can claim that using query qp to get the qp state is means of debugging and not needed for an app regular flow. Correct, we are using it in our dapl code when modify qp fails and i know an HPC app using it in its startup code in the same manner. Taking APM into account, as the QP REARMED to ARMED state transition is done by the HCA HW when data is delivered over the RC connection, there are apps that do qp query to sense this transition and modify the qp mig state to MIGRATED. Other than debugging and APM one can implement resource tracking code that can query a specific qp per request, or qp caching scheme that keeps created/init-ed or even connected QPs and before/after handing them to consumers queries the QP for verifyig the state. Some of the gen1 stacks have resource tracking as i describe here, also there are apps doing this caching i mentioned. For getting the inline size (which you indeed propose to return its max size from the qp create func) every app which wishes to use inline (eg MVAPICH) would query the qp state. To me query qp and hca capabilities seems as a must, the other queries (CQ MR etc) are less important. Or. -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier Sent: Monday, April 18, 2005 7:28 PM To: Or Gerlitz Cc: openib-general at openib.org Subject: Re: [openib-general] openIB gen2 user space verbs API Or> Other than getting the max inline size, query qp is used to Or> get the current QP state, examples are app error flow (eg when Or> modify qp failed) and app APM flow to sense some of the state Or> transitions done by the HW. These are only examples I quickly Or> thought of, I guess there are more. Which applications are using these operations as you describe? - R. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Tue Apr 19 06:56:16 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Apr 2005 09:56:16 -0400 Subject: [openib-general] [PATCH] fix management/README In-Reply-To: <20050417093245.GA16996@mellanox.co.il> References: <20050417093245.GA16996@mellanox.co.il> Message-ID: <1113918975.4880.14.camel@localhost.localdomain> On Sun, 2005-04-17 at 05:32, Michael S. Tsirkin wrote: > Fix build instructions to refer to directories that actually exist. Thanks. Applied. -- Hal From halr at voltaire.com Tue Apr 19 07:00:05 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Apr 2005 10:00:05 -0400 Subject: [openib-general] SM Bad Port Handling In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EF10B@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED6047EF10B@mtlex01.yok.mtl.com> Message-ID: <1113919205.4880.24.camel@localhost.localdomain> On Thu, 2005-04-14 at 07:13, Eitan Zahavi wrote: > [EZ] Not at all. Although the target port is known. The flaky link > that fails the mad might be anywhere along the path to the port. So, > if you mark the target port as bad you might be marking the wrong > port! OpenSM does look for traps 128-131 which includes Local link integrity (129) which is likely from these noisy ports, right ? > [EZ] Let me clarify with an example: > SM=HCA1/P1 -> > SW1/P1....SW1/P2->SW2/P1..SW2/P2->SW3/P1....SW3/P3->HCA2/P1 > > \..SW4/P4->SW3/P4..SW3/P5->SW3/P2../ > > If the flaky link is between SW2/P2 and SW3/P1 then the packet sent to > HCA2 using DR : [0][1][1][2][3] might fail . If you mark HCA2/P1 as > bad then you actually will loose that HCA for no good reason since > another path from SM to HCA2 exists. OK. To be really sure about the failed port, one could then walk the entire DR path from the SM to the perceived non responding port and if the same port along the path doesn't respond some number of times (say 4) in a row, that port's peer port can be marked as unhealthy. Is this algorithm acceptable ? -- Hal From eitan at mellanox.co.il Tue Apr 19 08:05:32 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 19 Apr 2005 18:05:32 +0300 Subject: [openib-general] SM Bad Port Handling Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF134@mtlex01.yok.mtl.com> Yes, the algorithm looks reasonable. I would make the number of packets required to qualify the ports on the way a parameter with default value of 10 or 20 (surely not 4). Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, April 19, 2005 5:00 PM > To: Eitan Zahavi > Cc: 'shaharf'; openib-general at openib.org > Subject: RE: [openib-general] SM Bad Port Handling > > On Thu, 2005-04-14 at 07:13, Eitan Zahavi wrote: > > [EZ] Not at all. Although the target port is known. The flaky link > > that fails the mad might be anywhere along the path to the port. So, > > if you mark the target port as bad you might be marking the wrong > > port! > > OpenSM does look for traps 128-131 which includes Local link integrity > (129) which is likely from these noisy ports, right ? > > > [EZ] Let me clarify with an example: > > SM=HCA1/P1 -> > > SW1/P1....SW1/P2->SW2/P1..SW2/P2->SW3/P1....SW3/P3->HCA2/P1 > > > > \..SW4/P4->SW3/P4..SW3/P5->SW3/P2../ > > > > If the flaky link is between SW2/P2 and SW3/P1 then the packet sent to > > HCA2 using DR : [0][1][1][2][3] might fail . If you mark HCA2/P1 as > > bad then you actually will loose that HCA for no good reason since > > another path from SM to HCA2 exists. > > OK. To be really sure about the failed port, one could then walk the > entire DR path from the SM to the perceived non responding port and if > the same port along the path doesn't respond some number of times (say > 4) in a row, that port's peer port can be marked as unhealthy. Is this > algorithm acceptable ? > > -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From tduffy at sun.com Tue Apr 19 10:28:07 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 19 Apr 2005 10:28:07 -0700 Subject: [openib-general] Infiniband embedded dual Xeon 800 MHz MBD In-Reply-To: References: Message-ID: <1113931687.13847.1.camel@duffman> On Mon, 2005-04-18 at 18:37 -0700, Hermann von Drateln wrote: > This week we are commencing the verification and validation of our > Dual Xeon 800 MHz FSB MBD. So, does this board have built-in Infiniband? If not, how is it relevant to this list? Or can I conclude that this was simply SPAM. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From ardavis at ichips.intel.com Tue Apr 19 11:42:22 2005 From: ardavis at ichips.intel.com (ardavis) Date: Tue, 19 Apr 2005 11:42:22 -0700 Subject: [openib-general] Kernel oops: NULL ptr dereference in ib_umem_get In-Reply-To: <52ekdbhe49.fsf@topspin.com> References: <426023BD.8080504@ichips.intel.com> <52ekdbhe49.fsf@topspin.com> Message-ID: <4265510E.2080502@ichips.intel.com> Roland Dreier wrote: > ardavis> With a little stress, I see the following oops (running > ardavis> latest from the trunk). Let me know if you need any more > ardavis> information. > >Can you try this patch and let me know if it helps at all? > >Thanks, > Roland > > > Yes, works great. Thanks! -arlin From roland at topspin.com Tue Apr 19 11:49:01 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 19 Apr 2005 11:49:01 -0700 Subject: [openib-general] Kernel oops: NULL ptr dereference in ib_umem_get In-Reply-To: <4265510E.2080502@ichips.intel.com> (ardavis@ichips.intel.com's message of "Tue, 19 Apr 2005 11:42:22 -0700") References: <426023BD.8080504@ichips.intel.com> <52ekdbhe49.fsf@topspin.com> <4265510E.2080502@ichips.intel.com> Message-ID: <527jiy7gbm.fsf@topspin.com> ardavis> Yes, works great. Thanks! Cool, I've committed this. - R. From ardavis at ichips.intel.com Tue Apr 19 14:49:56 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 19 Apr 2005 14:49:56 -0700 Subject: [openib-general] [PATCH] udapl provider Message-ID: <20050419144956.359e2d6c.ardavis@ichips.intel.com> Fixes for socket CM to prevent blocking and allow more uDAPL applications to run successfully. Signed-off-by: Arlin Davis Index: udapl/Makefile =================================================================== --- udapl/Makefile (revision 2190) +++ udapl/Makefile (working copy) @@ -122,7 +122,6 @@ endif # ifeq ($(VERBS),openib) PROVIDER = $(TOPDIR)/../openib -DAPL_IBLIB_DIR = /usr/local/lib CFLAGS += -DSOCKET_CM -DOPENIB -DCQ_WAIT_OBJECT CFLAGS += -I/usr/local/include/infiniband endif @@ -139,7 +138,7 @@ endif CFLAGS += -I. CFLAGS += -I.. -CFLAGS += -I../../dat/include +CFLAGS += -I../dat/include CFLAGS += -I../include CFLAGS += -I$(PROVIDER) @@ -234,8 +233,9 @@ PROVIDER_SRCS += dapl_openib_util.c dapl endif ifeq ($(VERBS),openib) -LDFLAGS += -libverbs -LDFLAGS += -L /usr/local/lib/ +LDFLAGS += -libverbs /usr/local/lib/infiniband/mthca.so +LDFLAGS += -rpath /usr/local/lib -L /usr/local/lib +LDFLAGS += -rpath /usr/local/lib/infiniband PROVIDER_SRCS = dapl_ib_util.c dapl_ib_cq.c dapl_ib_qp.c PROVIDER_SRCS += dapl_ib_cm.c dapl_ib_mem.c endif Index: openib/TODO =================================================================== --- openib/TODO (revision 2190) +++ openib/TODO (working copy) @@ -2,16 +2,17 @@ IB Verbs: - CQ resize? - query call to get current qp state -- ibv_get_cq_event() blocks until event arrives. need timed event and wakeup +- ibv_get_cq_event() needs timed event call and wakeup - query call to get device attributes -- poll_cq return codes not exported +- current implementation only supports one event per device +- memory window support DAPL: - Build udapl issues with mthca having reverse dependencies to ibverbs -- When CM arrives: change modify_qp_state RTS RTR calls +- When real CM arrives: change modify_qp_state RTS RTR calls - reinit EP needs a QP timewait completion notification -- disconnect clean -- add cq_object wakeup, time based cq_object wait +- code disconnect clean +- add cq_object wakeup, time based cq_object wait when verbs support arrives - update uDAPL code with real CM and ATS support - etc, etc. Index: openib/dapl_ib_util.c =================================================================== --- openib/dapl_ib_util.c (revision 2190) +++ openib/dapl_ib_util.c (working copy) @@ -53,18 +53,12 @@ static const char rcsid[] = "$Id: $"; #include "dapl_adapter_util.h" #include "dapl_ib_util.h" -#include #include #include #include #include #include -/* set default path */ -#define OPENIB_VERBS_PATH_DEFAULT "/usr/local/lib/libibverbs.so" -static char * ibv_path; -static void * ibv_handle = NULL; - int g_dapl_loopback_connection = 0; #ifdef SOCKET_CM @@ -110,32 +104,11 @@ DAT_RETURN getipaddr( char *addr, int ad */ int32_t dapls_ib_init (void) { - dapl_dbg_log(DAPL_DBG_TYPE_UTIL, "dapls_ib_init() called\n"); - - ibv_path = getenv("OPENIB_VERBS_PATH"); - - if (ibv_path == NULL) - ibv_path = OPENIB_VERBS_PATH_DEFAULT; - - dapl_dbg_log(DAPL_DBG_TYPE_UTIL," loading verbs library %s\n",ibv_path); - - ibv_handle = dlopen(ibv_path, RTLD_NOW | RTLD_GLOBAL); - if (ibv_handle == NULL ) { - dapl_dbg_log(DAPL_DBG_TYPE_ERR, - " library load failure %s\n", dlerror()); - return -1; - } - return 0; } int32_t dapls_ib_release (void) { - dapl_dbg_log (DAPL_DBG_TYPE_UTIL, "dapls_ib_release() called\n"); - - if (ibv_handle) - dlclose(ibv_handle); - return 0; } @@ -166,13 +139,6 @@ DAT_RETURN dapls_ib_open_hca ( dapl_dbg_log (DAPL_DBG_TYPE_UTIL, " open_hca: %s - %p\n", hca_name, hca_ptr ); - if (ibv_handle == NULL) { - dapl_dbg_log(DAPL_DBG_TYPE_ERR, - " Failure loading IB verbs library %s \n", - ibv_path); - return DAT_PROVIDER_NOT_FOUND; - } - /* Get list of all IB devices, find match, open */ dev_list = ibv_get_devices(); dlist_start(dev_list); @@ -201,6 +167,29 @@ DAT_RETURN dapls_ib_open_hca ( } #ifdef SOCKET_CM + /* initialize cr_list lock */ + dat_status = dapl_os_lock_init(&hca_ptr->ib_trans.lock); + if (dat_status != DAT_SUCCESS) + { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + " open_hca: failed to init lock\n"); + return dat_status; + } + + /* initialize CM list for listens on this HCA */ + dapl_llist_init_head(&hca_ptr->ib_trans.list); + + /* create thread to process inbound connect request */ + dat_status = dapl_os_thread_create(cr_thread, + (void*)hca_ptr, + &hca_ptr->ib_trans.thread ); + if (dat_status != DAT_SUCCESS) + { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + " open_hca: failed to create thread\n"); + return dat_status; + } + /* get the IP address of the device */ dat_status = getipaddr((char*)&hca_ptr->hca_address, sizeof(DAT_SOCK_ADDR6) ); @@ -243,6 +232,20 @@ DAT_RETURN dapls_ib_close_hca ( IN DAP return(dapl_convert_errno(errno,"ib_close_device")); hca_ptr->ib_hca_handle = IB_INVALID_HANDLE; } + +#if SOCKET_CM + /* destroy cr_thread and lock */ + hca_ptr->ib_trans.destroy = 1; + while (hca_ptr->ib_trans.destroy) { + struct timespec sleep, remain; + sleep.tv_sec = 0; + sleep.tv_nsec = 10000000; /* 10 ms */ + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " close_hca: waiting for cr_thread\n"); + nanosleep (&sleep, &remain); + } + dapl_os_lock_destroy(&hca_ptr->ib_trans.lock); +#endif return (DAT_SUCCESS); } @@ -297,18 +300,18 @@ DAT_RETURN dapls_ib_query_hca ( ((struct sockaddr_in *)ia_attr->ia_address_ptr)->sin_addr.s_addr >> 24 & 0xff ); /* TODO: need verbs query call */ - ia_attr->max_eps = 1000; - ia_attr->max_dto_per_ep = 1000; - ia_attr->max_rdma_read_per_ep = 4; - ia_attr->max_evds = 1000; - ia_attr->max_evd_qlen = 1000; - ia_attr->max_iov_segments_per_dto = 10; - ia_attr->max_lmrs = 1000; + ia_attr->max_eps = 64000; + ia_attr->max_dto_per_ep = 64000; + ia_attr->max_rdma_read_per_ep = 8; + ia_attr->max_evds = 64000; + ia_attr->max_evd_qlen = 64000; + ia_attr->max_iov_segments_per_dto = 32; + ia_attr->max_lmrs = 64000; ia_attr->max_lmr_block_size = 0x80000000; - ia_attr->max_rmrs = 1000; + ia_attr->max_rmrs = 64000; ia_attr->max_lmr_virtual_address = 0x80000000; ia_attr->max_rmr_target_address = 0x80000000; - ia_attr->max_pzs = 1000; + ia_attr->max_pzs = 64000; ia_attr->max_mtu_size = 0x80000000; ia_attr->max_rdma_size = 0x80000000; ia_attr->num_transport_attr = 0; @@ -333,12 +336,12 @@ DAT_RETURN dapls_ib_query_hca ( if (ep_attr != NULL) { ep_attr->max_mtu_size = 0x80000000; ep_attr->max_rdma_size = 0x80000000; - ep_attr->max_recv_dtos = 1000; - ep_attr->max_request_dtos = 1000; - ep_attr->max_recv_iov = 10; - ep_attr->max_request_iov = 10; - ep_attr->max_rdma_read_in = 4; - ep_attr->max_rdma_read_out= 4; + ep_attr->max_recv_dtos = 64000; + ep_attr->max_request_dtos = 64000; + ep_attr->max_recv_iov = 32; + ep_attr->max_request_iov = 32; + ep_attr->max_rdma_read_in = 8; + ep_attr->max_rdma_read_out= 8; dapl_dbg_log (DAPL_DBG_TYPE_UTIL, " query_hca: MAX msg %llu dto %d iov %d rdma i%d,o%d\n", ep_attr->max_mtu_size, @@ -394,7 +397,7 @@ DAT_RETURN dapls_ib_setup_async_callback hca_ptr->async_cq_error = callback; break; case DAPL_ASYNC_CQ_COMPLETION: - hca_ptr->async_cq_completion = callback; + hca_ptr->async_cq = callback; break; case DAPL_ASYNC_QP_ERROR: hca_ptr->async_qp_error = callback; Index: openib/dapl_ib_mem.c =================================================================== --- openib/dapl_ib_mem.c (revision 2190) +++ openib/dapl_ib_mem.c (working copy) @@ -1,26 +1,25 @@ /* - * Copyright (c) 2002-2003, Network Appliance, Inc. All rights reserved. - * - * This Software is licensed under either one of the following two licenses: + * This Software is licensed under one of the following licenses: * * 1) under the terms of the "Common Public License 1.0" a copy of which is - * in the file LICENSE.txt in the root directory. The license is also * available from the Open Source Initiative, see * http://www.opensource.org/licenses/cpl.php. - * OR * - * 2) under the terms of the "The BSD License" a copy of which is in the file - * LICENSE2.txt in the root directory. The license is also available from - * the Open Source Initiative, see + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see * http://www.opensource.org/licenses/bsd-license.php. * - * Licensee has the right to choose either one of the above two licenses. + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. * - * Redistributions of source code must retain both the above copyright - * notice and either one of the license notices. + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. * * Redistributions in binary form must reproduce both the above copyright - * notice, either one of the license notices in the documentation + * notice, one of the license notices in the documentation * and/or other materials provided with the distribution. */ @@ -31,7 +30,7 @@ * PURPOSE: Intel DET APIs: Memory windows, registration, * and protection domain * - * $Id:$ + * $Id: $ * **********************************************************************/ @@ -182,8 +181,11 @@ dapls_ib_mr_register ( ia_ptr, lmr, virt_addr, length, privileges ); /* TODO: shared memory */ - if (lmr->param.mem_type == DAT_MEM_TYPE_SHARED_VIRTUAL) + if (lmr->param.mem_type == DAT_MEM_TYPE_SHARED_VIRTUAL) { + dapl_dbg_log( DAPL_DBG_TYPE_ERR, + " mr_register_shared: NOT IMPLEMENTED\n"); return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); + } /* local read is default on IB */ lmr->mr_handle = @@ -266,6 +268,7 @@ dapls_ib_mr_register_shared ( IN DAPL_LMR *lmr, IN DAT_MEM_PRIV_FLAGS privileges ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR," mr_register_shared: NOT IMPLEMENTED\n"); return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); } @@ -289,6 +292,8 @@ DAT_RETURN dapls_ib_mw_alloc ( IN DAPL_RMR *rmr ) { + + dapl_dbg_log(DAPL_DBG_TYPE_ERR," mw_alloc: NOT IMPLEMENTED\n"); return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); } @@ -312,6 +317,7 @@ DAT_RETURN dapls_ib_mw_free ( IN DAPL_RMR *rmr ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR," mw_free: NOT IMPLEMENTED\n"); return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); } @@ -343,6 +349,7 @@ dapls_ib_mw_bind ( IN DAT_MEM_PRIV_FLAGS mem_priv, IN DAT_BOOLEAN is_signaled) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR," mw_bind: NOT IMPLEMENTED\n"); return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); } @@ -371,6 +378,7 @@ dapls_ib_mw_unbind ( IN DAPL_COOKIE *cookie, IN DAT_BOOLEAN is_signaled ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR," mw_unbind: NOT IMPLEMENTED\n"); return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); } Index: openib/dapl_ib_cm.c =================================================================== --- openib/dapl_ib_cm.c (revision 2190) +++ openib/dapl_ib_cm.c (working copy) @@ -74,10 +74,12 @@ static DAT_RETURN dapli_socket_listen ( DAT_CONN_QUAL serviceID, DAPL_SP *sp_ptr ); -static DAT_RETURN dapli_socket_accept( DAPL_EP *ep_ptr, - DAPL_CR *cr_ptr, - DAT_COUNT p_size, - DAT_PVOID p_data ); +static DAT_RETURN dapli_socket_accept( ib_cm_srvc_handle_t cm_ptr ); + +static DAT_RETURN dapli_socket_accept_final( DAPL_EP *ep_ptr, + DAPL_CR *cr_ptr, + DAT_COUNT p_size, + DAT_PVOID p_data ); /* XXX temporary hack to get lid */ static uint16_t dapli_get_lid(IN struct ibv_device *dev, IN int port) @@ -114,6 +116,7 @@ dapli_socket_connect ( DAPL_EP *ep_ptr DAPL_IA *ia_ptr = ep_ptr->header.owner_ia; int len, opt = 1; struct iovec iovec[2]; + short rtu_data = htons(0x0E0F); dapl_dbg_log(DAPL_DBG_TYPE_EP, " connect: r_qual %d\n", r_qual); @@ -133,7 +136,7 @@ dapli_socket_connect ( DAPL_EP *ep_ptr return DAT_INSUFFICIENT_RESOURCES; } - ((struct sockaddr_in*)r_addr)->sin_port = r_qual; + ((struct sockaddr_in*)r_addr)->sin_port = htons(r_qual); if ( connect(cm_ptr->socket, r_addr, sizeof(*r_addr)) < 0 ) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, @@ -153,9 +156,11 @@ dapli_socket_connect ( DAPL_EP *ep_ptr cm_ptr->dst.p_size = p_size; iovec[0].iov_base = &cm_ptr->dst; iovec[0].iov_len = sizeof(ib_qp_cm_t); - iovec[1].iov_base = p_data; - iovec[1].iov_len = p_size; - len = writev( cm_ptr->socket, iovec, 2 ); + if ( p_size ) { + iovec[1].iov_base = p_data; + iovec[1].iov_len = p_size; + } + len = writev( cm_ptr->socket, iovec, (p_size ? 2:1) ); if ( len != (p_size + sizeof(ib_qp_cm_t)) ) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, " connect write: ERR %s, wcnt=%d\n", @@ -190,8 +195,8 @@ dapli_socket_connect ( DAPL_EP *ep_ptr /* read private data into cm_handle if any present */ if ( cm_ptr->dst.p_size ) { - iovec[1].iov_base = cm_ptr->p_data; - iovec[1].iov_len = cm_ptr->dst.p_size; + iovec[0].iov_base = cm_ptr->p_data; + iovec[0].iov_len = cm_ptr->dst.p_size; len = readv( cm_ptr->socket, iovec, 1 ); if ( len != cm_ptr->dst.p_size ) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, @@ -213,10 +218,11 @@ dapli_socket_connect ( DAPL_EP *ep_ptr ep_ptr->qp_state = IB_QP_STATE_RTS; /* complete handshake after final QP state change */ - write(cm_ptr->socket, "QP_RTR_RTS", sizeof "QP_RTR_RTS"); + write(cm_ptr->socket, &rtu_data, sizeof(rtu_data) ); /* init cm_handle and post the event with private data */ ep_ptr->cm_handle = cm_ptr; + dapl_dbg_log( DAPL_DBG_TYPE_EP," ACTIVE: connected!\n" ); dapl_evd_connection_callback( ep_ptr->cm_handle, IB_CME_CONNECTED, cm_ptr->p_data, @@ -248,9 +254,7 @@ dapli_socket_listen ( DAPL_IA *ia_ptr, { struct sockaddr_in addr; ib_cm_srvc_handle_t cm_ptr = NULL; - void *p_data = NULL; - int l_sock = -1; - int len, opt = 1; + int opt = 1; DAT_RETURN dat_status = DAT_SUCCESS; dapl_dbg_log ( DAPL_DBG_TYPE_EP, @@ -263,26 +267,30 @@ dapli_socket_listen ( DAPL_IA *ia_ptr, (void) dapl_os_memzero( cm_ptr, sizeof( *cm_ptr ) ); - cm_ptr->socket = -1; + cm_ptr->socket = cm_ptr->l_socket = -1; cm_ptr->sp = sp_ptr; cm_ptr->hca_ptr = ia_ptr->hca_ptr; /* bind, listen, set sockopt, accept, exchange data */ - if ((l_sock = socket(AF_INET, SOCK_STREAM, 0)) < 0) { + if ((cm_ptr->l_socket = socket(AF_INET, SOCK_STREAM, 0)) < 0) { dapl_dbg_log (DAPL_DBG_TYPE_ERR, "socket for listen returned %d\n", errno); dat_status = DAT_INSUFFICIENT_RESOURCES; goto bail; } - setsockopt(l_sock,SOL_SOCKET,SO_REUSEADDR,&opt,sizeof(opt)); - addr.sin_port = serviceID; + setsockopt(cm_ptr->l_socket,SOL_SOCKET,SO_REUSEADDR,&opt,sizeof(opt)); + addr.sin_port = htons(serviceID); addr.sin_family = AF_INET; addr.sin_addr.s_addr = INADDR_ANY; - if (( bind( l_sock,(struct sockaddr*)&addr, sizeof(addr) ) < 0) || - (listen( l_sock, 1 ) < 0) ) { + if (( bind( cm_ptr->l_socket,(struct sockaddr*)&addr, sizeof(addr) ) < 0) || + (listen( cm_ptr->l_socket, 128 ) < 0) ) { + dapl_dbg_log( DAPL_DBG_TYPE_ERR, + " listen: ERROR %s on conn_qual 0x%x\n", + strerror(errno),serviceID); + if ( errno == EADDRINUSE ) dat_status = DAT_CONN_QUAL_IN_USE; else @@ -290,109 +298,144 @@ dapli_socket_listen ( DAPL_IA *ia_ptr, goto bail; } + + /* set cm_handle for this service point, save listen socket */ + sp_ptr->cm_srvc_handle = cm_ptr; - /* block on the accept */ - len = sizeof(cm_ptr->dst.ia_address); - cm_ptr->socket = accept(l_sock, - (struct sockaddr*)&cm_ptr->dst.ia_address, + /* add to SP->CR thread list */ + dapl_llist_init_entry((DAPL_LLIST_ENTRY*)&cm_ptr->entry); + dapl_os_lock( &cm_ptr->hca_ptr->ib_trans.lock ); + dapl_llist_add_tail(&cm_ptr->hca_ptr->ib_trans.list, + (DAPL_LLIST_ENTRY*)&cm_ptr->entry, cm_ptr); + dapl_os_unlock(&cm_ptr->hca_ptr->ib_trans.lock); + + dapl_dbg_log( DAPL_DBG_TYPE_CM, + " listen: qual 0x%x cr %p s_fd %d\n", + ntohs(serviceID), cm_ptr, cm_ptr->l_socket ); + + return dat_status; +bail: + dapl_dbg_log( DAPL_DBG_TYPE_ERR, + " listen: ERROR on conn_qual 0x%x\n",serviceID); + if ( cm_ptr->l_socket >= 0 ) + close( cm_ptr->l_socket ); + dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); + return dat_status; +} + + +/* + * PASSIVE: send local QP information, private data, and wait for + * active side to respond with QP RTS/RTR status + */ +static DAT_RETURN +dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr) +{ + ib_cm_handle_t acm_ptr; + void *p_data = NULL; + int len; + DAT_RETURN dat_status = DAT_SUCCESS; + + /* Allocate accept CM and initialize */ + if ((acm_ptr = dapl_os_alloc(sizeof(*acm_ptr))) == NULL) + return DAT_INSUFFICIENT_RESOURCES; + + (void) dapl_os_memzero( acm_ptr, sizeof( *acm_ptr ) ); + + acm_ptr->socket = -1; + acm_ptr->sp = cm_ptr->sp; + acm_ptr->hca_ptr = cm_ptr->hca_ptr; + + len = sizeof(acm_ptr->dst.ia_address); + acm_ptr->socket = accept(cm_ptr->l_socket, + (struct sockaddr*)&acm_ptr->dst.ia_address, &len ); - if ( cm_ptr->socket < 0 ) { + if ( acm_ptr->socket < 0 ) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, - " listen accept: ERR %s\n",strerror(errno)); + " accept: ERR %s on FD %d l_cr %p\n", + strerror(errno),cm_ptr->l_socket,cm_ptr); dat_status = DAT_INTERNAL_ERROR; goto bail; } /* read in DST QP info, IA address. check for private data */ - len = read( cm_ptr->socket, &cm_ptr->dst, sizeof(ib_qp_cm_t) ); + len = read( acm_ptr->socket, &acm_ptr->dst, sizeof(ib_qp_cm_t) ); if ( len != sizeof(ib_qp_cm_t) ) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, - " listen read: ERR %s, rcnt=%d\n", - strerror(errno), len); + " accept read: ERR %s, rcnt=%d\n", + strerror(errno), len); dat_status = DAT_INTERNAL_ERROR; goto bail; } - dapl_dbg_log(DAPL_DBG_TYPE_EP, - " listen: DST port=0x%x lid=0x%x, qpn=0x%x, psize=%d\n", - cm_ptr->dst.port, cm_ptr->dst.lid, - cm_ptr->dst.qpn, cm_ptr->dst.p_size ); + " accept: DST port=0x%x lid=0x%x, qpn=0x%x, psize=%d\n", + acm_ptr->dst.port, acm_ptr->dst.lid, + acm_ptr->dst.qpn, acm_ptr->dst.p_size ); /* validate private data size before reading */ - if ( cm_ptr->dst.p_size > IB_MAX_REQ_PDATA_SIZE ) { + if ( acm_ptr->dst.p_size > IB_MAX_REQ_PDATA_SIZE ) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, - " listen read: psize (%d) wrong\n", - cm_ptr->dst.p_size ); + " accept read: psize (%d) wrong\n", + acm_ptr->dst.p_size ); dat_status = DAT_INTERNAL_ERROR; goto bail; } /* read private data into cm_handle if any present */ - if ( cm_ptr->dst.p_size ) { - len = read( cm_ptr->socket, - cm_ptr->p_data, - cm_ptr->dst.p_size ); - if ( len != cm_ptr->dst.p_size ) { + if ( acm_ptr->dst.p_size ) { + len = read( acm_ptr->socket, + acm_ptr->p_data, acm_ptr->dst.p_size ); + if ( len != acm_ptr->dst.p_size ) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, - " listen read pdata: ERR %s, rcnt=%d\n", + " accept read pdata: ERR %s, rcnt=%d\n", strerror(errno), len ); dat_status = DAT_INTERNAL_ERROR; goto bail; } - p_data = cm_ptr->p_data; + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " accept: psize=%d read\n", + acm_ptr->dst.p_size); + p_data = acm_ptr->p_data; } - - /* set cm_handle for this service point */ - sp_ptr->cm_srvc_handle = cm_ptr; - /* - * dapls_ib_accept_connection send QP information - * and complete CM handshake - */ - /* trigger CR event and return SUCCESS */ - dapls_cr_callback( cm_ptr, + dapls_cr_callback( acm_ptr, IB_CME_CONNECTION_REQUEST_PENDING, p_data, - sp_ptr ); + acm_ptr->sp ); - return dat_status; + return DAT_SUCCESS; bail: - if ( l_sock >= 0 ) - close( l_sock ); - if ( cm_ptr->socket >= 0 ) - close( cm_ptr->socket ); - if ( cm_ptr ) - dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); - - return dat_status; + if ( acm_ptr->socket >=0 ) + close( acm_ptr->socket ); + dapl_os_free( acm_ptr, sizeof( *acm_ptr ) ); + return DAT_INTERNAL_ERROR; } - -/* - * PASSIVE: send local QP information, private data, and wait for - * active side to respond with QP RTS/RTR status - */ static DAT_RETURN -dapli_socket_accept( DAPL_EP *ep_ptr, - DAPL_CR *cr_ptr, - DAT_COUNT p_size, - DAT_PVOID p_data ) +dapli_socket_accept_final( DAPL_EP *ep_ptr, + DAPL_CR *cr_ptr, + DAT_COUNT p_size, + DAT_PVOID p_data ) { - ib_cm_handle_t cm_ptr = cr_ptr->ib_cm_handle; DAPL_IA *ia_ptr = ep_ptr->header.owner_ia; + ib_cm_handle_t cm_ptr = cr_ptr->ib_cm_handle; ib_qp_cm_t qp_cm; struct iovec iovec[2]; int len; - char r_buf[10] = "XX_XXX_XXX"; + short rtu_data = 0; if (p_size > IB_MAX_REP_PDATA_SIZE) - return (DAT_LENGTH_ERROR); + return DAT_LENGTH_ERROR; + /* must have a accepted socket */ + if ( cm_ptr->socket < 0 ) + return DAT_INTERNAL_ERROR; + /* modify QP to RTR and then to RTS with remote info already read */ if ( dapls_modify_qp_state( ep_ptr->qp_handle, IBV_QPS_RTR, &cm_ptr->dst ) != DAT_SUCCESS ) @@ -413,42 +456,42 @@ dapli_socket_accept( DAPL_EP *ep_ptr, qp_cm.p_size = p_size; iovec[0].iov_base = &qp_cm; iovec[0].iov_len = sizeof(ib_qp_cm_t); - iovec[1].iov_base = p_data; - iovec[1].iov_len = p_size; - len = writev( cm_ptr->socket, iovec, 2 ); + if (p_size) { + iovec[1].iov_base = p_data; + iovec[1].iov_len = p_size; + } + len = writev( cm_ptr->socket, iovec, (p_size ? 2:1) ); if (len != (p_size + sizeof(ib_qp_cm_t))) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, - " connect write: ERR %s, wcnt=%d\n", + " accept_final: ERR %s, wcnt=%d\n", strerror(errno), len); goto bail; } dapl_dbg_log(DAPL_DBG_TYPE_EP, - " accept: SRC port=0x%x lid=0x%x, qpn=0x%x, psize=%d\n", + " accept_final: SRC port=0x%x lid=0x%x, qpn=0x%x, psize=%d\n", qp_cm.port, qp_cm.lid, qp_cm.qpn, qp_cm.p_size ); - + /* complete handshake after final QP state change */ - len = read(cm_ptr->socket, r_buf, sizeof(r_buf) ); - if ( len != sizeof(r_buf) ) { + len = read(cm_ptr->socket, &rtu_data, sizeof(rtu_data) ); + if ( len != sizeof(rtu_data) || ntohs(rtu_data) != 0x0e0f ) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, - " accept: ERR %s, rcnt=%d\n", - strerror(errno), len); + " accept_final: ERR %s, rcnt=%d rdata=%x\n", + strerror(errno), len, ntohs(rtu_data) ); goto bail; } /* final data exchange if remote QP state is good to go */ - dapl_dbg_log( DAPL_DBG_TYPE_EP," accept: %s \n", r_buf); - - dapls_cr_callback ( cm_ptr, - IB_CME_CONNECTED, - NULL, - cm_ptr->sp ); - + dapl_dbg_log( DAPL_DBG_TYPE_EP," PASSIVE: connected!\n" ); + dapls_cr_callback ( cm_ptr, IB_CME_CONNECTED, NULL, cm_ptr->sp ); return DAT_SUCCESS; bail: - dapl_dbg_log( DAPL_DBG_TYPE_ERR," accept: ERR !QP_RTR_RTS \n"); - close( cm_ptr->socket ); + dapl_dbg_log( DAPL_DBG_TYPE_ERR," accept_final: ERR !QP_RTR_RTS \n"); + if ( cm_ptr >= 0 ) + close( cm_ptr->socket ); + dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); dapls_ib_reinit_ep( ep_ptr ); /* reset QP state */ + return DAT_INTERNAL_ERROR; } @@ -482,7 +525,7 @@ dapls_ib_connect ( IN DAT_IA_ADDRESS_PTR remote_ia_address, IN DAT_CONN_QUAL remote_conn_qual, IN DAT_COUNT private_data_size, - IN DAT_PVOID private_data ) + IN void *private_data ) { DAPL_EP *ep_ptr; ib_qp_handle_t qp_ptr; @@ -545,18 +588,19 @@ dapls_ib_disconnect ( dapls_ib_reinit_ep(ep_ptr); #endif - - if ( ep_ptr->cr_ptr ) + if ( ep_ptr->cr_ptr ) { dapls_cr_callback ( ep_ptr->cm_handle, IB_CME_DISCONNECTED, NULL, ((DAPL_CR *)ep_ptr->cr_ptr)->sp_ptr ); - else + } else { dapl_evd_connection_callback ( ep_ptr->cm_handle, IB_CME_DISCONNECTED, NULL, ep_ptr ); - + ep_ptr->cm_handle = NULL; + dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); + } return DAT_SUCCESS; } @@ -584,7 +628,6 @@ dapls_ib_disconnect_clean ( IN DAT_BOOLEAN active, IN const ib_cm_events_t ib_cm_event ) { - return; } @@ -644,25 +687,22 @@ dapls_ib_remove_conn_listener ( IN DAPL_IA *ia_ptr, IN DAPL_SP *sp_ptr ) { - ib_cm_srvc_handle_t cm_ptr = sp_ptr->cm_srvc_handle; dapl_dbg_log (DAPL_DBG_TYPE_EP, "dapls_ib_remove_conn_listener(ia_ptr %p sp_ptr %p cm_ptr %p)\n", ia_ptr, sp_ptr, cm_ptr ); #ifdef SOCKET_CM - /* close accepted socket, free cm_srvc_handle and return */ if ( cm_ptr != NULL ) { - if ( cm_ptr->socket > 0 ) { - close( cm_ptr->socket ); - cm_ptr->socket = 0; + if ( cm_ptr->l_socket >= 0 ) { + close( cm_ptr->l_socket ); + cm_ptr->socket = -1; } - dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); + /* cr_thread will free */ sp_ptr->cm_srvc_handle = NULL; } return DAT_SUCCESS; - #else return DAT_NOT_IMPLEMENTED; @@ -717,7 +757,7 @@ dapls_ib_accept_connection ( } #ifdef SOCKET_CM - return ( dapli_socket_accept(ep_ptr, cr_ptr, p_size, p_data) ); + return ( dapli_socket_accept_final(ep_ptr, cr_ptr, p_size, p_data) ); #else return DAT_NOT_IMPLEMENTED; #endif @@ -756,13 +796,13 @@ dapls_ib_reject_connection ( /* just close the socket and return */ if ( cm_ptr->socket > 0 ) { close( cm_ptr->socket ); - cm_ptr->socket = 0; + cm_ptr->socket = -1; } - return DAT_SUCCESS; - +#else + return DAT_NOT_IMPLEMENTED; #endif - return DAT_SUCCESS; + } @@ -984,6 +1024,76 @@ dapls_ib_get_cm_event ( return ib_cm_event; } +/* async CR processing thread to avoid blocking applications */ +void cr_thread(void *arg) +{ + struct dapl_hca *hca_ptr = arg; + ib_cm_srvc_handle_t cr, next_cr; + int max_fd; + fd_set rfd,rfds; + struct timeval to; + + dapl_os_lock( &hca_ptr->ib_trans.lock ); + while ( !hca_ptr->ib_trans.destroy ) { + + FD_ZERO( &rfds ); + max_fd = -1; + + if (!dapl_llist_is_empty(&hca_ptr->ib_trans.list)) + next_cr = dapl_llist_peek_head (&hca_ptr->ib_trans.list); + else + next_cr = NULL; + + while (next_cr) { + cr = next_cr; + dapl_dbg_log (DAPL_DBG_TYPE_CM," thread: cm_ptr %p\n", cr ); + if (cr->l_socket == -1 || hca_ptr->ib_trans.destroy) { + + dapl_dbg_log(DAPL_DBG_TYPE_CM," thread: Freeing %p\n", cr); + next_cr = dapl_llist_next_entry(&hca_ptr->ib_trans.list, + (DAPL_LLIST_ENTRY*)&cr->entry ); + dapl_llist_remove_entry(&hca_ptr->ib_trans.list, + (DAPL_LLIST_ENTRY*)&cr->entry); + dapl_os_free( cr, sizeof(*cr) ); + continue; + } + + FD_SET( cr->l_socket, &rfds ); /* add to select set */ + if ( cr->l_socket > max_fd ) + max_fd = cr->l_socket; + + /* individual select poll to check for work */ + FD_ZERO(&rfd); + FD_SET(cr->l_socket, &rfd); + dapl_os_unlock(&hca_ptr->ib_trans.lock); + to.tv_sec = 0; + to.tv_usec = 0; + if ( select(cr->l_socket + 1,&rfd, NULL, NULL, &to) < 0) { + dapl_dbg_log (DAPL_DBG_TYPE_CM, + " thread: ERR %s on cr %p sk %d\n", + strerror(errno), cr, cr->l_socket); + close(cr->l_socket); + cr->l_socket = -1; + } else if ( FD_ISSET(cr->l_socket, &rfd) && + dapli_socket_accept(cr)) { + close(cr->l_socket); + cr->l_socket = -1; + } + dapl_os_lock( &hca_ptr->ib_trans.lock ); + next_cr = dapl_llist_next_entry(&hca_ptr->ib_trans.list, + (DAPL_LLIST_ENTRY*)&cr->entry ); + } + dapl_os_unlock( &hca_ptr->ib_trans.lock ); + to.tv_sec = 0; + to.tv_usec = 500000; /* wakeup and check destroy */ + select(max_fd + 1, &rfds, NULL, NULL, &to); + dapl_os_lock( &hca_ptr->ib_trans.lock ); + } + dapl_os_unlock( &hca_ptr->ib_trans.lock ); + hca_ptr->ib_trans.destroy = 0; + dapl_dbg_log(DAPL_DBG_TYPE_CM," thread(hca %p) exit\n",hca_ptr); +} + /* Real IBv CM */ #else Index: openib/dapl_ib_qp.c =================================================================== --- openib/dapl_ib_qp.c (revision 2190) +++ openib/dapl_ib_qp.c (working copy) @@ -1,26 +1,25 @@ /* - * Copyright (c) 2002-2003, Network Appliance, Inc. All rights reserved. - * - * This Software is licensed under either one of the following two licenses: + * This Software is licensed under one of the following licenses: * * 1) under the terms of the "Common Public License 1.0" a copy of which is - * in the file LICENSE.txt in the root directory. The license is also * available from the Open Source Initiative, see * http://www.opensource.org/licenses/cpl.php. - * OR * - * 2) under the terms of the "The BSD License" a copy of which is in the file - * LICENSE2.txt in the root directory. The license is also available from - * the Open Source Initiative, see + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see * http://www.opensource.org/licenses/bsd-license.php. * - * Licensee has the right to choose either one of the above two licenses. + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. * - * Redistributions of source code must retain both the above copyright - * notice and either one of the license notices. + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. * * Redistributions in binary form must reproduce both the above copyright - * notice, either one of the license notices in the documentation + * notice, one of the license notices in the documentation * and/or other materials provided with the distribution. */ @@ -30,7 +29,7 @@ * * PURPOSE: QP routines for access to DET Verbs * - * $Id:$ + * $Id: $ **********************************************************************/ #include "dapl.h" @@ -311,7 +310,7 @@ dapls_modify_qp_state ( IN ib_qp_handle_ qp_attr.path_mtu = IBV_MTU_1024; qp_attr.dest_qp_num = qp_cm->qpn; qp_attr.rq_psn = 1; - qp_attr.max_dest_rd_atomic = 1; + qp_attr.max_dest_rd_atomic = 8; qp_attr.min_rnr_timer = 12; qp_attr.ah_attr.is_global = 0; qp_attr.ah_attr.dlid = qp_cm->lid; @@ -338,7 +337,7 @@ dapls_modify_qp_state ( IN ib_qp_handle_ qp_attr.retry_cnt = 7; qp_attr.rnr_retry = 7; qp_attr.sq_psn = 1; - qp_attr.max_rd_atomic = 1; + qp_attr.max_rd_atomic = 8; dapl_dbg_log (DAPL_DBG_TYPE_EP, " modify_qp_rts: psn %x or %x\n", qp_attr.sq_psn, qp_attr.max_rd_atomic ); Index: openib/README =================================================================== --- openib/README (revision 2190) +++ openib/README (working copy) @@ -41,8 +41,8 @@ A simple dapl test just for initial open known issues: - early drop, good luck! Only tested with a simple dtest. - see TODO for more details - events not working?? + early drop, only tested with simple dtest and dapltest SR. + no memory windows support in ibverbs, dat_create_rmr fails. + Index: openib/dapl_ib_util.h =================================================================== --- openib/dapl_ib_util.h (revision 2190) +++ openib/dapl_ib_util.h (working copy) @@ -79,10 +79,23 @@ typedef struct _ib_qp_cm } ib_qp_cm_t; -/* EP->cm_handle for connect, SP->cm_srvc_handle for listen */ +/* + * dapl_llist_entry in dapl.h but dapl.h depends on provider + * typedef's in this file first. move dapl_llist_entry out of dapl.h + */ +struct ib_llist_entry +{ + struct dapl_llist_entry *flink; + struct dapl_llist_entry *blink; + void *data; + struct dapl_llist_entry *list_head; +}; + struct ib_cm_handle { - int socket; + struct ib_llist_entry entry; + int socket; + int l_socket; struct dapl_hca *hca_ptr; DAT_HANDLE cr; DAT_HANDLE sp; @@ -112,6 +125,9 @@ typedef enum } ib_cm_events_t; +/* prototype for cm thread */ +void cr_thread (void *arg); + #else /* TODO: Waiting for IB CM to define */ @@ -205,11 +221,6 @@ typedef struct dapl_evd *ib_wait_obj_ha * ibv_post_recv - Return 0, -1 & bad_wr */ -/* definitions from libmthca/src/cq.c, should be in verbs.h */ -#define IB_CQ_OK 0 -#define IB_CQ_EMPTY -1 -#define IB_POLL_ERR -2 - /* async handler for CQ, QP, and unafiliated */ typedef void (*ib_async_handler_t)( IN ib_hca_handle_t ib_hca_handle, @@ -221,11 +232,18 @@ typedef struct _ib_hca_transport { struct ibv_device *ib_dev; ib_cq_handle_t ib_cq_empty; + +#if SOCKET_CM + int destroy; + DAPL_OS_THREAD thread; + DAPL_OS_LOCK lock; + struct dapl_llist_entry *list; +#endif ib_async_handler_t async_unafiliated; ib_async_handler_t async_cq_error; - ib_async_handler_t async_cq_completion; + ib_async_handler_t async_cq; ib_async_handler_t async_qp_error; - + } ib_hca_tranport_t; /* provider specfic fields for shared memory support */ Index: openib/dapl_ib_cq.c =================================================================== --- openib/dapl_ib_cq.c (revision 2190) +++ openib/dapl_ib_cq.c (working copy) @@ -382,10 +382,10 @@ DAT_RETURN dapls_ib_completion_notify ( * Output: * none * - * Returns: + * Returns: * DAT_SUCCESS * DAT_QUEUE_EMPTY - * dapl_convert_errno + * */ DAT_RETURN dapls_ib_completion_poll ( IN DAPL_HCA *hca_ptr, @@ -393,15 +393,12 @@ DAT_RETURN dapls_ib_completion_poll ( IN ib_work_completion_t *wc_ptr) { int ret; - - ret = ibv_poll_cq(evd_ptr->ib_cq_handle, 1, wc_ptr); + + ret = ibv_poll_cq(evd_ptr->ib_cq_handle, 1, wc_ptr); if (ret == 1) return DAT_SUCCESS; - else if ((ret == IB_CQ_OK) || (ret == IB_CQ_EMPTY)) - return DAT_QUEUE_EMPTY; - else - return(dapl_convert_errno(EFAULT,"poll_cq"));; - + + return DAT_QUEUE_EMPTY; } #ifdef CQ_WAIT_OBJECT @@ -447,24 +444,45 @@ dapls_ib_wait_object_wait ( IN ib_wait_obj_handle_t p_cq_wait_obj_handle, IN u_int32_t timeout) { - int status; - ib_cq_handle_t cq = p_cq_wait_obj_handle->ib_cq_handle; - struct ibv_cq *ibv_cq; - void *ibv_ctx; + DAPL_EVD *evd_ptr = p_cq_wait_obj_handle; + ib_cq_handle_t cq = evd_ptr->ib_cq_handle; + struct ibv_cq *ibv_cq; + void *ibv_ctx; + int status = EINVAL; /* invalid handle */ dapl_dbg_log ( DAPL_DBG_TYPE_UTIL, " cq_object_wait: dev %p evd %p cq %p, time %d\n", - cq->context, p_cq_wait_obj_handle, cq, timeout ); + cq->context, evd_ptr, cq, timeout ); - /* will block forever, only 1 per device for now?? */ /* TODO: add timeout, map each CQ created?? */ - if (cq) { - status = ibv_get_cq_event(cq->context, 0, &ibv_cq, &ibv_ctx); - if (!status && (ibv_cq == cq)) - return DAT_SUCCESS; + /* Multiple EVD's sharing one event handle for now */ + while (evd_ptr->ib_cq_handle) { + + status = ibv_get_cq_event(cq->context, + 0, &ibv_cq, &ibv_ctx); + if (status) + break; + + /* EVD mismatch, process DTO callback for this EVD */ + if (ibv_cq != cq) { + ib_hca_tranport_t *hca_ptr = + &evd_ptr->header.owner_ia->hca_ptr->ib_trans; + + if ( hca_ptr->async_cq ) + hca_ptr->async_cq(cq->context, + (ib_error_record_t*)ibv_cq, + ibv_ctx); + + continue; + } + break; } - return(dapl_convert_errno(EFAULT,"cq_wait_object_wait")); + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, + " cq_object_wait: RET cq %p ibv_cq %p ibv_ctx %p %x\n", + cq,ibv_cq,ibv_ctx,status); + + return(dapl_convert_errno(status,"cq_wait_object_wait")); } #endif From mshefty at ichips.intel.com Tue Apr 19 16:01:45 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 19 Apr 2005 16:01:45 -0700 Subject: [openib-general] [PATCH] [VERBS] new verbs call to allocate AH using WC Message-ID: <20050419160145.6244bce6.mshefty@ichips.intel.com> This patch will add a new call to ib_verbs.h to allocate an address handle using a received work completion. This call will be used by the MAD RMPP code (to be submitted shortly in a separate patch). Signed-off-by: Sean Hefty Index: include/ib_verbs.h =================================================================== --- include/ib_verbs.h (revision 2168) +++ include/ib_verbs.h (working copy) @@ -971,6 +971,21 @@ struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr); /** + * ib_create_ah_from_wc - Creates an address handle associated with the + * sender of the specified work completion. + * @pd: The protection domain associated with the address handle. + * @wc: Work completion information associated with a received message. + * @grh: References the received global route header. This parameter is + * ignored unless the work completion indicates that the GRH is valid. + * @port_num: The outbound port number to associate with the address. + * + * The address handle is used to reference a local or global destination + * in all UD QP post sends. + */ +struct ib_ah *ib_create_ah_from_wc(struct ib_pd *pd, struct ib_wc *wc, + struct ib_grh *grh, u8 port_num); + +/** * ib_modify_ah - Modifies the address vector associated with an address * handle. * @ah: The address handle to modify. Index: core/verbs.c =================================================================== --- core/verbs.c (revision 2168) +++ core/verbs.c (working copy) @@ -40,6 +40,7 @@ #include #include +#include /* Protection domains */ @@ -87,6 +88,40 @@ } EXPORT_SYMBOL(ib_create_ah); +struct ib_ah *ib_create_ah_from_wc(struct ib_pd *pd, struct ib_wc *wc, + struct ib_grh *grh, u8 port_num) +{ + struct ib_ah_attr ah_attr; + u32 flow_class; + u16 gid_index; + int ret; + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = wc->slid; + ah_attr.sl = wc->sl; + ah_attr.src_path_bits = wc->dlid_path_bits; + ah_attr.port_num = port_num; + + if (wc->wc_flags & IB_WC_GRH) { + ah_attr.ah_flags = IB_AH_GRH; + ah_attr.grh.dgid = grh->dgid; + + ret = ib_find_cached_gid(pd->device, &grh->sgid, &port_num, + &gid_index); + if (ret) + return ERR_PTR(ret); + + ah_attr.grh.sgid_index = (u8) gid_index; + flow_class = be32_to_cpu(&grh->version_tclass_flow); + ah_attr.grh.flow_label = flow_class & 0xFFFFF; + ah_attr.grh.traffic_class = (flow_class >> 20) & 0xFF; + ah_attr.grh.hop_limit = grh->hop_limit; + } + + return ib_create_ah(pd, &ah_attr); +} +EXPORT_SYMBOL(ib_create_ah_from_wc); + int ib_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr) { return ah->device->modify_ah ? From robert.j.woodruff at intel.com Tue Apr 19 16:17:58 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Tue, 19 Apr 2005 16:17:58 -0700 Subject: [openib-general] Re: [ANNOUNCE][PATCH] Backport patch for 2.6.9kernel In-Reply-To: <20050418200249.GG6931@esmail.cup.hp.com> Message-ID: Grant wrote, >> >> Roland, maybe, when you do this, you can put a copy of the source as you >> submit it under gen2/branches? I think that would solve Grant's problem >> but may be too hard to do in practice. >Not really since I can see the kernel rev and then pull the matching >kernel source. Well unless people are using -mm or linus' bk tree >directly. But not that many people do that. >grant Ok, I split the large patch into 3, one for kernel diffs, one for the openib drivers, and one for openib fixups that are needed to backport. It might be good to come up with some way to track which SVN rev went into which kernel.org release, but until that is done, we can just embed the rev. of the kernel.org infiniband source that was back ported, as I have done in the three patches attached. These allow a back port of what is in 2.6.12 back to 2.6.9. The patches should be applied in order, 01, 02, and 03. For that matter, we may only want to back port the stable versions of code that are released with each kernel.org release and thus may not need the SVN number in the back port patches. Note that since 2.6.12 is not quite released, we should also consider these release candidates and not put them into SVN until 2.6.12 is released. I have done limited testing with IPoIB and it appears to work ok. It would be good if someone else could also try them, perhaps on some other platform type. I tested them on Itanium. woody -------------- next part -------------- A non-text attachment was scrubbed... Name: infiniband-2.6.12-to-2.6.9-kernel-fixups-01.diff Type: application/octet-stream Size: 12580 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: infiniband-2.6.12-to-2.6.9-openib-drivers-02.diff Type: application/octet-stream Size: 802662 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: infiniband-2.6.12-to-2.6.9-openib-fixups-03.diff Type: application/octet-stream Size: 864 bytes Desc: not available URL: From mshefty at ichips.intel.com Tue Apr 19 16:20:32 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 19 Apr 2005 16:20:32 -0700 Subject: [openib-general] [PATCH] relocate SA MAD definitions to ib_mad.h Message-ID: <20050419162032.46e339ab.mshefty@ichips.intel.com> This patch moves the definitions of the SA MAD and header from ib_sa.h and sa_query.c to ib_mad.h. The definitions are needed by RMPP. Signed-off-by: Sean Hefty Index: include/ib_mad.h =================================================================== --- include/ib_mad.h (revision 2168) +++ include/ib_mad.h (working copy) @@ -39,6 +39,8 @@ #if !defined( IB_MAD_H ) #define IB_MAD_H +#include + #include /* Management base version */ @@ -115,6 +117,12 @@ union ib_gid dgid; } __attribute__ ((packed)); +/* + * These structures must be packed because they have 64-bit fields + * that are only 32-bit aligned. 64-bit architectures will lay them + * out wrong otherwise. (And unfortunately they are sent on the wire + * so we can't change the layout) + */ struct ib_mad_hdr { u8 base_version; u8 mgmt_class; @@ -137,6 +145,17 @@ u32 paylen_newwin; } __attribute__ ((packed)); +typedef u64 __bitwise ib_sa_comp_mask; + +#define IB_SA_COMP_MASK(n) ((__force ib_sa_comp_mask) cpu_to_be64(1ull << n)) + +struct ib_sa_hdr { + u64 sm_key; + u16 attr_offset; + u16 reserved; + ib_sa_comp_mask comp_mask; +} __attribute__ ((packed)); + struct ib_mad { struct ib_mad_hdr mad_hdr; u8 data[232]; @@ -148,6 +167,13 @@ u8 data[220]; } __attribute__ ((packed)); +struct ib_sa_mad { + struct ib_mad_hdr mad_hdr; + struct ib_rmpp_hdr rmpp_hdr; + struct ib_sa_hdr sa_hdr; + u8 data[200]; +} __attribute__ ((packed)); + struct ib_vendor_mad { struct ib_mad_hdr mad_hdr; struct ib_rmpp_hdr rmpp_hdr; @@ -418,8 +444,7 @@ void *buf); /** - * ib_free_recv_mad - Returns data buffers used to receive a MAD to the - * access layer. + * ib_free_recv_mad - Returns data buffers used to receive a MAD. * @mad_recv_wc: Work completion information for a received MAD. * * Clients receiving MADs through their ib_mad_recv_handler must call this Index: include/ib_sa.h =================================================================== --- include/ib_sa.h (revision 2168) +++ include/ib_sa.h (working copy) @@ -89,10 +89,6 @@ } } -typedef u64 __bitwise ib_sa_comp_mask; - -#define IB_SA_COMP_MASK(n) ((__force ib_sa_comp_mask) cpu_to_be64(1ull << n)) - /* * Structures for SA records are named "struct ib_sa_xxx_rec." No * attempt is made to pack structures to match the physical layout of Index: core/sa_query.c =================================================================== --- core/sa_query.c (revision 2168) +++ core/sa_query.c (working copy) @@ -50,26 +50,6 @@ MODULE_DESCRIPTION("InfiniBand subnet administration query support"); MODULE_LICENSE("Dual BSD/GPL"); -/* - * These two structures must be packed because they have 64-bit fields - * that are only 32-bit aligned. 64-bit architectures will lay them - * out wrong otherwise. (And unfortunately they are sent on the wire - * so we can't change the layout) - */ -struct ib_sa_hdr { - u64 sm_key; - u16 attr_offset; - u16 reserved; - ib_sa_comp_mask comp_mask; -} __attribute__ ((packed)); - -struct ib_sa_mad { - struct ib_mad_hdr mad_hdr; - struct ib_rmpp_hdr rmpp_hdr; - struct ib_sa_hdr sa_hdr; - u8 data[200]; -} __attribute__ ((packed)); - struct ib_sa_sm_ah { struct ib_ah *ah; struct kref ref; From ardavis at ichips.intel.com Tue Apr 19 16:27:14 2005 From: ardavis at ichips.intel.com (ardavis) Date: Tue, 19 Apr 2005 16:27:14 -0700 Subject: [openib-general] uverbs API Message-ID: <426593D2.4030707@ichips.intel.com> Hello Roland, Now that I have the initial drop of uDAPL running I would like to discuss some possible modifications/additions. Here is my TODO list that I need some feedback on.... - resize_cq - query_device - ib_query_gid - ibv_get_cq_event(), need timed event call and wakeup - current implementation supports one event per device, plans for more? - memory window support Thanks, -arlin From roland at topspin.com Tue Apr 19 16:28:50 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 19 Apr 2005 16:28:50 -0700 Subject: [openib-general] [PATCH] relocate SA MAD definitions to ib_mad.h In-Reply-To: <20050419162032.46e339ab.mshefty@ichips.intel.com> (Sean Hefty's message of "Tue, 19 Apr 2005 16:20:32 -0700") References: <20050419162032.46e339ab.mshefty@ichips.intel.com> Message-ID: <52br8a5ost.fsf@topspin.com> Why do you need to add this: > +#include to ib_mad.h? I didn't see anything new that would use it. - R. From roland at topspin.com Tue Apr 19 16:32:38 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 19 Apr 2005 16:32:38 -0700 Subject: [openib-general] [PATCH] relocate SA MAD definitions to ib_mad.h In-Reply-To: <20050419162032.46e339ab.mshefty@ichips.intel.com> (Sean Hefty's message of "Tue, 19 Apr 2005 16:20:32 -0700") References: <20050419162032.46e339ab.mshefty@ichips.intel.com> Message-ID: <527jiy5omh.fsf@topspin.com> By the way: > +/* > + * These structures must be packed because they have 64-bit fields > + * that are only 32-bit aligned. 64-bit architectures will lay them > + * out wrong otherwise. (And unfortunately they are sent on the wire > + * so we can't change the layout) > + */ I just had a quick look at ib_mad.h and it seems that none of the packed structures already in that file actually need the __attribute__((packed)) -- everything is already aligned to its size as far as I can tell. It might be worth checking to make sure I'm not missing anything, and then removing the packed attribute -- if nothing else, this will shrink the IA64 code a fair bit. - R. From mshefty at ichips.intel.com Tue Apr 19 16:34:12 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 19 Apr 2005 16:34:12 -0700 Subject: [openib-general] [PATCH] relocate SA MAD definitions to ib_mad.h In-Reply-To: <52br8a5ost.fsf@topspin.com> References: <20050419162032.46e339ab.mshefty@ichips.intel.com> <52br8a5ost.fsf@topspin.com> Message-ID: <42659574.2010802@ichips.intel.com> Roland Dreier wrote: > Why do you need to add this: > > > +#include > > to ib_mad.h? I didn't see anything new that would use it. I don't think it's needed with this patch. sorry... I'm was trying to break apart my RMPP changes into a few, smaller patches to make the review easier. I'll remove it from this patch. - Sean From mshefty at ichips.intel.com Tue Apr 19 16:35:25 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 19 Apr 2005 16:35:25 -0700 Subject: [openib-general] [PATCH] relocate SA MAD definitions to ib_mad.h In-Reply-To: <527jiy5omh.fsf@topspin.com> References: <20050419162032.46e339ab.mshefty@ichips.intel.com> <527jiy5omh.fsf@topspin.com> Message-ID: <426595BD.4060500@ichips.intel.com> Roland Dreier wrote: > By the way: > > > +/* > > + * These structures must be packed because they have 64-bit fields > > + * that are only 32-bit aligned. 64-bit architectures will lay them > > + * out wrong otherwise. (And unfortunately they are sent on the wire > > + * so we can't change the layout) > > + */ > > I just had a quick look at ib_mad.h and it seems that none of the > packed structures already in that file actually need the > __attribute__((packed)) -- everything is already aligned to its size > as far as I can tell. It might be worth checking to make sure I'm not > missing anything, and then removing the packed attribute -- if nothing > else, this will shrink the IA64 code a fair bit. I'll double check this and remove the attribute packed if so. - Sean From mshefty at ichips.intel.com Tue Apr 19 16:39:28 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 19 Apr 2005 16:39:28 -0700 Subject: [openib-general] [PATCH] [VERBS] new verbs call to allocate AH using WC In-Reply-To: <20050419160145.6244bce6.mshefty@ichips.intel.com> References: <20050419160145.6244bce6.mshefty@ichips.intel.com> Message-ID: <426596B0.8030505@ichips.intel.com> Sean Hefty wrote: > Index: include/ib_verbs.h > =================================================================== ... > +struct ib_ah *ib_create_ah_from_wc(struct ib_pd *pd, struct ib_wc *wc, > + struct ib_grh *grh, u8 port_num); > + ... > +struct ib_ah *ib_create_ah_from_wc(struct ib_pd *pd, struct ib_wc *wc, > + struct ib_grh *grh, u8 port_num) > +{ It looks like I missed including the change that moved struct ib_grh from ib_mad.h to ib_verbs.h. - Sean From mshefty at ichips.intel.com Tue Apr 19 16:54:30 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 19 Apr 2005 16:54:30 -0700 Subject: [openib-general] [RFC] patch for new send MAD allocation routines Message-ID: <20050419165430.786a1dfc.mshefty@ichips.intel.com> This pseudo-patch (meaning I haven't tested it separately from the other RMPP changes) defines new a new structure and calls for allocation of a MAD that can be posted on the send queue. It tries to combine functionality common to several MAD agents into a single location. It is currently only used by an RMPP test program. My plan was to update other agents once these calls were part of the standard build. Signed-off-by: Sean Hefty Index: include/ib_mad.h =================================================================== --- include/ib_mad.h (revision 2168) +++ include/ib_mad.h (working copy) @@ -39,6 +39,8 @@ #if !defined( IB_MAD_H ) #define IB_MAD_H +#include + #include /* Management base version */ @@ -157,6 +159,30 @@ } __attribute__ ((packed)); /** + * ib_mad_send_buf - MAD data buffer and work request for sends. + * @mad: References an allocated MAD data buffer. The size of the data + * buffer is specified in the @send_wr.length field. + * @mapping: DMA mapping information. + * @mad_agent: MAD agent that allocated the buffer. + * @context: User-controlled context fields. + * @send_wr: An initialized work request structure used when sending the MAD. + * The wr_id field of the work request is initialized to reference this + * data structure. + * @sge: A scatter-gather list referenced by the work request. + * + * Users are responsible for initializing the MAD buffer itself, with the + * exception of specifying the payload length field in any RMPP MAD. + */ +struct ib_mad_send_buf { + struct ib_mad *mad; + DECLARE_PCI_UNMAP_ADDR(mapping) + struct ib_mad_agent *mad_agent; + void *context[2]; + struct ib_send_wr send_wr; + struct ib_sge sge; +}; + +/** * ib_get_rmpp_resptime - Returns the RMPP response time. * @rmpp_hdr: An RMPP header. */ @@ -478,4 +504,35 @@ int ib_process_mad_wc(struct ib_mad_agent *mad_agent, struct ib_wc *wc); +/** + * ib_create_send_mad - Allocate and initialize a data buffer and work request + * for sending a MAD. + * @mad_agent: Specifies the registered MAD service to associate with the MAD. + * @remote_qpn: Specifies the QPN of the receiving node. + * @pkey_index: Specifies which PKey the MAD will be send using. This field + * is valid only if the remote_qpn is QP 1. + * @ah: References the address handle used to transfer to the remote node. + * @hdr_len: Indicates the size of the data header of the MAD. This length + * should include the common MAD header, RMPP header, plus any class + * specific header. + * @data_len: Indicates the size of any user-transfered data. The call will + * automatically adjust the allocated buffer size to account for any + * additional padding that may be necessary. + * + * This is a helper routine that may be used to allocate a MAD. Users are + * not required to allocate outbound MADs using this call. The returned + * MAD send buffer will reference a data buffer usable for sending a MAD, along + * with an intialized work request structure. + */ +struct ib_mad_send_buf * ib_create_send_mad(struct ib_mad_agent *mad_agent, + u32 remote_qpn, u16 pkey_index, + struct ib_ah *ah, + int hdr_len, int data_len); + +/** + * ib_free_send_mad - Returns data buffers used to send a MAD. + * @send_buf: Previously allocated send data buffer. + */ +void ib_free_send_mad(struct ib_mad_send_buf *send_buf); + #endif /* IB_MAD_H */ Index: core/mad.c =================================================================== --- core/mad.c (revision 2168) +++ core/mad.c (working copy) @@ -766,6 +766,89 @@ return ret; } +static int get_buf_length(int hdr_len, int data_len) +{ + int seg_size, pad; + + seg_size = sizeof(struct ib_mad) - hdr_len; + if (data_len && seg_size) { + pad = seg_size - data_len % seg_size; + if (pad == seg_size) + pad = 0; + } else + pad = seg_size; + return hdr_len + data_len + pad; +} + +struct ib_mad_send_buf * ib_create_send_mad(struct ib_mad_agent *mad_agent, + u32 remote_qpn, u16 pkey_index, + struct ib_ah *ah, + int hdr_len, int data_len) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_send_buf *send_buf; + int buf_size; + void *buf; + + mad_agent_priv = container_of(mad_agent, + struct ib_mad_agent_private, agent); + buf_size = get_buf_length(hdr_len, data_len); + + buf = kmalloc(sizeof *send_buf + buf_size, GFP_KERNEL); + if (!buf) + return ERR_PTR(-ENOMEM); + + send_buf = buf + buf_size; + memset(send_buf, 0, sizeof *send_buf); + send_buf->mad = buf; + + send_buf->sge.addr = dma_map_single(mad_agent->device->dma_device, + buf, buf_size, DMA_TO_DEVICE); + pci_unmap_addr_set(send_buf, mapping, send_buf->sge.addr); + send_buf->sge.length = buf_size; + send_buf->sge.lkey = mad_agent->mr->lkey; + + send_buf->send_wr.wr_id = (unsigned long) send_buf; + send_buf->send_wr.sg_list = &send_buf->sge; + send_buf->send_wr.num_sge = 1; + send_buf->send_wr.opcode = IB_WR_SEND; + send_buf->send_wr.send_flags = IB_SEND_SIGNALED; + send_buf->send_wr.wr.ud.ah = ah; + send_buf->send_wr.wr.ud.mad_hdr = &send_buf->mad->mad_hdr; + send_buf->send_wr.wr.ud.remote_qpn = remote_qpn; + send_buf->send_wr.wr.ud.remote_qkey = IB_QP_SET_QKEY; + send_buf->send_wr.wr.ud.pkey_index = pkey_index; + + if (mad_agent->rmpp_version) { + struct ib_rmpp_mad *rmpp_mad; + rmpp_mad = (struct ib_rmpp_mad *)send_buf->mad; + rmpp_mad->rmpp_hdr.paylen_newwin = cpu_to_be32(hdr_len - + offsetof(struct ib_rmpp_mad, data) + data_len); + } + + send_buf->mad_agent = mad_agent; + atomic_inc(&mad_agent_priv->refcount); + return send_buf; +} +EXPORT_SYMBOL(ib_create_send_mad); + +void ib_free_send_mad(struct ib_mad_send_buf *send_buf) +{ + struct ib_mad_agent_private *mad_agent_priv; + + mad_agent_priv = container_of(send_buf->mad_agent, + struct ib_mad_agent_private, agent); + + dma_unmap_single(send_buf->mad_agent->device->dma_device, + pci_unmap_addr(send_buf, mapping), + send_buf->sge.length, DMA_TO_DEVICE); + kfree(send_buf->mad); + + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); +} +EXPORT_SYMBOL(ib_free_send_mad); + static int ib_send_mad(struct ib_mad_agent_private *mad_agent_priv, struct ib_mad_send_wr_private *mad_send_wr) { From roland at topspin.com Tue Apr 19 17:07:40 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 19 Apr 2005 17:07:40 -0700 Subject: [openib-general] [RFC] patch for new send MAD allocation routines In-Reply-To: <20050419165430.786a1dfc.mshefty@ichips.intel.com> (Sean Hefty's message of "Tue, 19 Apr 2005 16:54:30 -0700") References: <20050419165430.786a1dfc.mshefty@ichips.intel.com> Message-ID: <52is2i48fn.fsf@topspin.com> Instead of hard-coding GFP_KERNEL: > + buf = kmalloc(sizeof *send_buf + buf_size, GFP_KERNEL); would it make sense to put a gfp_mask into the API and avoid some of the heartburn about interrupt context that we've had lately? - R. From roland at topspin.com Tue Apr 19 17:24:46 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 19 Apr 2005 17:24:46 -0700 Subject: [openib-general] Re: uverbs API References: <426593D2.4030707@ichips.intel.com> Message-ID: <528y3e47n5.fsf@topspin.com> > Here is my TODO list that I need some feedback on.... > - resize_cq A fair bit of work, probably won't get done too soon. > - query_device > - ib_query_gid Both easy, I'll add them shortly. > - ibv_get_cq_event(), need timed event call and wakeup Can you explain what this means a little more? Is there something you need that you can't get by using select()/poll() with a timeout on the CQ event FD? > - current implementation supports one event per device, plans for more? Yes, in the medium term I plan to add support for multiple MSI-X vectors so that multiple CQ events are possible. > - memory window support Right now I don't have any plans to implement this. All the feedback I've seen is that with current hardware, performance is not good enough to make MWs worth using. - R. From krkumar2 at in.ibm.com Wed Apr 20 00:14:28 2005 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Wed, 20 Apr 2005 12:44:28 +0530 Subject: [openib-general] [PATCH][RFC][1/4] IB: core changes for userspace verbs Message-ID: Hi Roland, > In particular, the memory pinning code in in uverbs_mem.c could stand a looking over. 1. In ib_umem_get(), I see you set ret = 0, which is unnecessary because chunk->nents is set based on ret value. Plus you already have a "while (ret)" to break out. "ret = 0" can be safely removed. 2. Also, as an optimization, in __ib_umem_release(), you could add another argument "page_dirty" which if set will do set_page_dirty_lock() (it seems to be a costly routine), and pass that argument as 0 in ib_umem_get() and 1 in ib_umem_release(). 3. In __ib_umem_unmark() (sorry, I don't fully know this code very well and could be wrong), should the for loop have cur_base = vma->vm_start (instead of vm_end) since vma is set to the next one before this statement is executed ? thanks, - KK -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland at topspin.com Wed Apr 20 06:57:06 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 20 Apr 2005 06:57:06 -0700 Subject: [openib-general] kernel/user verbs interface changed Message-ID: <521x954klp.fsf@topspin.com> I just committed a change to ib_user_verbs.h in the kernel and kern-abi.h in libibverbs to add command codes for all verbs. This means that you must update both the kernel and libibverbs at the same time; for example, a new kernel will not work with old libibverbs. - R. From ardavis at ichips.intel.com Wed Apr 20 08:51:27 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 20 Apr 2005 08:51:27 -0700 Subject: [openib-general] Kernel oops: NULL ptr dereference in ib_umem_get In-Reply-To: <52r7hbixzz.fsf@topspin.com> References: <426023BD.8080504@ichips.intel.com> <52r7hbixzz.fsf@topspin.com> Message-ID: <42667A7F.8090604@ichips.intel.com> Here is a new oops from my overnight run.... Apr 19 12:14:57 iclust-19 kernel: idr_remove called for id=0 which is not allocated. Apr 19 12:14:57 iclust-19 kernel: Apr 19 12:14:57 iclust-19 kernel: Call Trace:{idr_remove+244} {ib_uverbs_event_release+126} Apr 19 12:14:57 iclust-19 kernel: {ib_uverbs_close+566} {__fput+98} Apr 19 12:14:57 iclust-19 kernel: {remove_vm_struct+125} {do_munmap+918} Apr 19 12:14:57 iclust-19 kernel: {__down_read+49} {sys_munmap+77} Apr 19 12:14:57 iclust-19 kernel: {system_call+126} Apr 19 12:14:57 iclust-19 kernel: Unable to handle kernel NULL pointer dereference at 0000000000000010 RIP: Apr 19 12:14:57 iclust-19 kernel: {ib_dealloc_pd+0} Apr 19 12:14:57 iclust-19 kernel: PGD 2feee067 PUD 312c7067 PMD 0 Apr 19 12:14:57 iclust-19 kernel: Oops: 0000 [1] SMP Apr 19 12:14:57 iclust-19 kernel: CPU 0 Apr 19 12:14:57 iclust-19 kernel: Modules linked in: Apr 19 12:14:57 iclust-19 kernel: Pid: 19391, comm: putfence1 Not tainted 2.6.11 Apr 19 12:14:57 iclust-19 kernel: RIP: 0010:[] {ib_dealloc_pd+0} Apr 19 12:14:57 iclust-19 kernel: RSP: 0018:ffff81002f66fe40 EFLAGS: 00010296 Apr 19 12:14:57 iclust-19 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000040000 Apr 19 12:14:57 iclust-19 kernel: RDX: 00000000ffffff01 RSI: ffff8100325bb400 RDI: 0000000000000000 Apr 19 12:14:57 iclust-19 kernel: RBP: ffff8100311fe900 R08: 00000000fffffff8 R09: 0000000000000002 Apr 19 12:14:57 iclust-19 kernel: R10: 00000000ffffffff R11: 0000000000000000 R12: ffff8100311fe910 Apr 19 12:14:57 iclust-19 kernel: R13: ffff81003a3e7d78 R14: ffff81003a3e7880 R15: ffff81003a3e7d88 Apr 19 12:14:57 iclust-19 kernel: FS: 00002aaaaae55f40(0000) GS:ffffffff805fe400(0000) knlGS:0000000000000000 Apr 19 12:14:57 iclust-19 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Apr 19 12:14:57 iclust-19 kernel: CR2: 0000000000000010 CR3: 0000000031e8f000 CR4: 00000000000006e0 Apr 19 12:14:57 iclust-19 kernel: Process putfence1 (pid: 19391, threadinfo ffff81002f66e000, task ffff81002fbd54a0) Apr 19 12:14:57 iclust-19 kernel: Stack: ffffffff8037f88e ffff81003a3e7d80 ffff81003227a2c0 ffff810037ff6440 Apr 19 12:14:57 iclust-19 kernel: ffff81003d864108 ffff81003f289870 00002aaaab4d2000 ffff810032eb6e00 Apr 19 12:14:57 iclust-19 kernel: ffffffff8017baa2 00002aaaab4d2000 Apr 19 12:14:57 iclust-19 kernel: Call Trace:{ib_uverbs_close+574} {__fput+98} Apr 19 12:14:57 iclust-19 kernel: {remove_vm_struct+125} {do_munmap+918} Apr 19 12:14:57 iclust-19 kernel: {__down_read+49} {sys_munmap+77} Apr 19 12:14:57 iclust-19 kernel: {system_call+126} Apr 19 12:14:57 iclust-19 kernel: Apr 19 12:14:57 iclust-19 kernel: Code: 8b 47 10 85 c0 75 0d 48 8b 07 4c 8b 98 18 01 00 00 41 ff e3 Apr 19 12:14:57 iclust-19 kernel: RIP {ib_dealloc_pd+0} RSP Apr 19 12:14:57 iclust-19 kernel: CR2: 0000000000000010 From mshefty at ichips.intel.com Wed Apr 20 09:02:11 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 20 Apr 2005 09:02:11 -0700 Subject: [openib-general] [RFC] patch for new send MAD allocation routines In-Reply-To: <52is2i48fn.fsf@topspin.com> References: <20050419165430.786a1dfc.mshefty@ichips.intel.com> <52is2i48fn.fsf@topspin.com> Message-ID: <42667D03.1030502@ichips.intel.com> Roland Dreier wrote: > Instead of hard-coding GFP_KERNEL: > > > + buf = kmalloc(sizeof *send_buf + buf_size, GFP_KERNEL); > > would it make sense to put a gfp_mask into the API and avoid some of > the heartburn about interrupt context that we've had lately? That's easy enough to do. - Sean From IBMEHCAD at de.ibm.com Wed Apr 20 09:17:43 2005 From: IBMEHCAD at de.ibm.com (IBMEHCA DD) Date: Wed, 20 Apr 2005 18:17:43 +0200 Subject: [openib-general] IBM eHCA Device Driver for gen1 IB stack Message-ID: Hi, we've just released the first linux device driver for the IBM eServer HCA for Power5. It's gen1 based and runs on SLES9 SP1. Main testvehicle for this code was IPoIB. gen2 and full userspace support will be next. http://sourceforge.net/projects/ibmehcad/ Hardware device driver development for gen2 is so much easier... Chirstoph Raisch HCAD teamlead, ibm boeblingen lab, -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland at topspin.com Wed Apr 20 09:38:13 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 20 Apr 2005 09:38:13 -0700 Subject: [openib-general] Kernel oops: NULL ptr dereference in ib_umem_get In-Reply-To: <42667A7F.8090604@ichips.intel.com> (Arlin Davis's message of "Wed, 20 Apr 2005 08:51:27 -0700") References: <426023BD.8080504@ichips.intel.com> <52r7hbixzz.fsf@topspin.com> <42667A7F.8090604@ichips.intel.com> Message-ID: <52sm1l2ykq.fsf@topspin.com> Thanks for the report. Did your overnight run involve multiple processes using IB verbs running at the same time? I think I might see a race condition that could possibly cause this crash... - R. From roland at topspin.com Wed Apr 20 09:44:45 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 20 Apr 2005 09:44:45 -0700 Subject: [openib-general] IBM eHCA Device Driver for gen1 IB stack In-Reply-To: (IBMEHCA DD's message of "Wed, 20 Apr 2005 18:17:43 +0200") References: Message-ID: <52ll7d2y9u.fsf@topspin.com> > Hi, we've just released the first linux device driver for > the IBM eServer HCA for Power5. It's gen1 based and runs > on SLES9 SP1. Main testvehicle for this code was IPoIB. > gen2 and full userspace support will be next. Excellent, I'm glad to see this released. I'm looking forward to seeing the gen2 support. If I may make a small suggestion for future releases: please have the tar file contain a top-level directory like ehca-0021, with everything contained in that directory. It's a little annoying to unpack a tar file and have it spread 5 files in your working directory, especially when some have generic names like "INSTALL" or "patches." Thanks, Roland From roland at topspin.com Wed Apr 20 09:48:05 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 20 Apr 2005 09:48:05 -0700 Subject: [openib-general] [PATCH][RFC][1/4] IB: core changes for userspace verbs In-Reply-To: (Krishna Kumar2's message of "Wed, 20 Apr 2005 12:44:28 +0530") References: Message-ID: <52hdi12y4a.fsf@topspin.com> Krishna> 1. In ib_umem_get(), I see you set ret = 0, which is Krishna> unnecessary because chunk-> nents is set based on ret Krishna> value. Plus you already have a "while (ret)" to break Krishna> out. "ret = 0" can be safely removed. Actually, I know why "ret = 0" is there: although it's not strictly needed, gcc isn't smart enough to see that, and without the initialization, it warns that "'ret' might be used uninitialized in this function." - R. From ardavis at ichips.intel.com Wed Apr 20 09:50:05 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 20 Apr 2005 09:50:05 -0700 Subject: [openib-general] Kernel oops: NULL ptr dereference in ib_umem_get In-Reply-To: <52sm1l2ykq.fsf@topspin.com> References: <426023BD.8080504@ichips.intel.com> <52r7hbixzz.fsf@topspin.com> <42667A7F.8090604@ichips.intel.com> <52sm1l2ykq.fsf@topspin.com> Message-ID: <4266883D.1050306@ichips.intel.com> Roland Dreier wrote: >Thanks for the report. Did your overnight run involve multiple >processes using IB verbs running at the same time? > >I think I might see a race condition that could possibly cause this >crash... > > - R. > > > Yes, this MPI test is running across 2 nodes, with 2 processes on each node. From roland at topspin.com Wed Apr 20 09:50:26 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 20 Apr 2005 09:50:26 -0700 Subject: [openib-general] [PATCH][RFC][1/4] IB: core changes for userspace verbs In-Reply-To: (Krishna Kumar2's message of "Wed, 20 Apr 2005 12:44:28 +0530") References: Message-ID: <52fyxl2y0d.fsf@topspin.com> Krishna> 1. In ib_umem_get(), I see you set ret = 0, which is Krishna> unnecessary because chunk-> nents is set based on ret Krishna> value. Plus you already have a "while (ret)" to break Krishna> out. "ret = 0" can be safely removed. True, done. Krishna> 2. Also, as an optimization, in __ib_umem_release(), you Krishna> could add another argument "page_dirty" which if set will Krishna> do set_page_dirty_lock() (it seems to be a costly Krishna> routine), and pass that argument as 0 in ib_umem_get() Krishna> and 1 in ib_umem_release(). Seems reasonable, done as well. Krishna> 3. In __ib_umem_unmark() (sorry, I don't fully know this Krishna> code very well and could be wrong), should the for loop Krishna> have cur_base = vma->vm_start (instead of vm_end) since Krishna> vma is set to the next one before this statement is Krishna> executed ? Yes, there was a bug there. I think it's already fixed in the latest code, though. Thanks, Roland From roland at topspin.com Wed Apr 20 11:09:16 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 20 Apr 2005 11:09:16 -0700 Subject: [openib-general] Kernel oops: NULL ptr dereference in ib_umem_get In-Reply-To: <4266883D.1050306@ichips.intel.com> (Arlin Davis's message of "Wed, 20 Apr 2005 09:50:05 -0700") References: <426023BD.8080504@ichips.intel.com> <52r7hbixzz.fsf@topspin.com> <42667A7F.8090604@ichips.intel.com> <52sm1l2ykq.fsf@topspin.com> <4266883D.1050306@ichips.intel.com> Message-ID: <523btl2ucz.fsf@topspin.com> OK, I'm not absolutely sure this fixes the cause of the oops you saw, but I am pretty sure it is a necessary fix. You can apply the patch below or just pull the latest subversion. Remember that the latest subversion kernel code requires up-to-date libibverbs code as well. My current theory is that you had two MPI processes exiting simultaneously, and ib_dealloc_ucontext() ended up accessing the same struct idr for both processes, which is a no-no. - R. --- infiniband/core/uverbs_main.c (revision 2193) +++ infiniband/core/uverbs_main.c (working copy) @@ -99,6 +99,8 @@ static int ib_dealloc_ucontext(struct ib if (!context) return 0; + down(&ib_uverbs_idr_mutex); + /* XXX Free AHs */ list_for_each_entry_safe(uobj, tmp, &context->qp_list, list) { @@ -141,6 +143,8 @@ static int ib_dealloc_ucontext(struct ib kfree(uobj); } + up(&ib_uverbs_idr_mutex); + return context->device->dealloc_ucontext(context); } From ardavis at ichips.intel.com Wed Apr 20 11:35:31 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 20 Apr 2005 11:35:31 -0700 Subject: [openib-general] Re: uverbs API In-Reply-To: <528y3e47n5.fsf@topspin.com> References: <426593D2.4030707@ichips.intel.com> <528y3e47n5.fsf@topspin.com> Message-ID: <4266A0F3.4020804@ichips.intel.com> Roland Dreier wrote: > > Here is my TODO list that I need some feedback on.... > > > - resize_cq > >A fair bit of work, probably won't get done too soon. > > > - query_device > > - ib_query_gid > >Both easy, I'll add them shortly. > > > - ibv_get_cq_event(), need timed event call and wakeup > >Can you explain what this means a little more? Is there something you >need that you can't get by using select()/poll() with a timeout on the >CQ event FD? > > > As long as you can tell me that a thread blocked on get_cq_event will wakeup on a device close, signal, or CQ error then I don't need a wakeup call from userspace. and yes, select() was exactly my thinking, but I was hoping we could get it added to the ibv_get_cq_event() code and just include a new timeout parameter (in usecs) with the call. -arlin > > - current implementation supports one event per device, plans for more? > >Yes, in the medium term I plan to add support for multiple MSI-X >vectors so that multiple CQ events are possible. > > > - memory window support > >Right now I don't have any plans to implement this. All the feedback >I've seen is that with current hardware, performance is not good >enough to make MWs worth using. > > - R. > > > From tduffy at sun.com Wed Apr 20 13:12:35 2005 From: tduffy at sun.com (Tom Duffy) Date: Wed, 20 Apr 2005 13:12:35 -0700 Subject: [openib-general] Re: [openib-commits] r2195 - in gen2/trunk/src/linux-kernel/infiniband: core include In-Reply-To: <20050420173553.D22902283D6@openib.ca.sandia.gov> References: <20050420173553.D22902283D6@openib.ca.sandia.gov> Message-ID: <1114027955.11751.4.camel@duffman> On Wed, 2005-04-20 at 10:35 -0700, sean.hefty at openib.org wrote: > Modified: gen2/trunk/src/linux-kernel/infiniband/core/verbs.c > =================================================================== > --- gen2/trunk/src/linux-kernel/infiniband/core/verbs.c 2005-04-20 16:54:38 UTC (rev 2194) > +++ gen2/trunk/src/linux-kernel/infiniband/core/verbs.c 2005-04-20 17:35:52 UTC (rev 2195) > @@ -40,6 +40,7 @@ > #include > > #include > +#include > > /* Protection domains */ > > @@ -87,6 +88,40 @@ > } > EXPORT_SYMBOL(ib_create_ah); > > +struct ib_ah *ib_create_ah_from_wc(struct ib_pd *pd, struct ib_wc *wc, > + struct ib_grh *grh, u8 port_num) > +{ > + struct ib_ah_attr ah_attr; > + u32 flow_class; > + u16 gid_index; > + int ret; > + > + memset(&ah_attr, 0, sizeof ah_attr); > + ah_attr.dlid = wc->slid; > + ah_attr.sl = wc->sl; > + ah_attr.src_path_bits = wc->dlid_path_bits; > + ah_attr.port_num = port_num; > + > + if (wc->wc_flags & IB_WC_GRH) { > + ah_attr.ah_flags = IB_AH_GRH; > + ah_attr.grh.dgid = grh->dgid; > + > + ret = ib_find_cached_gid(pd->device, &grh->sgid, &port_num, > + &gid_index); > + if (ret) > + return ERR_PTR(ret); > + > + ah_attr.grh.sgid_index = (u8) gid_index; > + flow_class = be32_to_cpu(&grh->version_tclass_flow); > + ah_attr.grh.flow_label = flow_class & 0xFFFFF; > + ah_attr.grh.traffic_class = (flow_class >> 20) & 0xFF; > + ah_attr.grh.hop_limit = grh->hop_limit; > + } > + > + return ib_create_ah(pd, &ah_attr); > +} > +EXPORT_SYMBOL(ib_create_ah_from_wc); > + > int ib_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr) > { > return ah->device->modify_ah ? Causes build warning on 64bit: CC [M] drivers/infiniband/core/verbs.o /build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/core/verbs.c: In function ‘ib_create_ah_from_wc’: /build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/core/verbs.c:115: warning: cast from pointer to integer of different size /build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/core/verbs.c:115: warning: cast from pointer to integer of different size /build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/core/verbs.c:115: warning: cast from pointer to integer of different size -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Wed Apr 20 13:24:06 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 20 Apr 2005 13:24:06 -0700 Subject: [openib-general] Re: [openib-commits] r2195 - in gen2/trunk/src/linux-kernel/infiniband: core include In-Reply-To: <1114027955.11751.4.camel@duffman> (Tom Duffy's message of "Wed, 20 Apr 2005 13:12:35 -0700") References: <20050420173553.D22902283D6@openib.ca.sandia.gov> <1114027955.11751.4.camel@duffman> Message-ID: <52u0m119jt.fsf@topspin.com> Tom> Causes build warning on 64bit: Tom> CC [M] drivers/infiniband/core/verbs.o Tom> /build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/core/verbs.c: Tom> In function â˜ib_create_ah_from_wcâ™: Tom> /build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/core/verbs.c:115: Tom> warning: cast from pointer to integer of different size Looks like a real bug -- I think flow_class = be32_to_cpu(&grh->version_tclass_flow); should be flow_class = be32_to_cpu(grh->version_tclass_flow); (ie no "&" -- we want to swap the value, not the address!) - R. From tduffy at sun.com Wed Apr 20 13:29:06 2005 From: tduffy at sun.com (Tom Duffy) Date: Wed, 20 Apr 2005 13:29:06 -0700 Subject: [openib-general] Re: [openib-commits] r2195 - in gen2/trunk/src/linux-kernel/infiniband: core include In-Reply-To: <52u0m119jt.fsf@topspin.com> References: <20050420173553.D22902283D6@openib.ca.sandia.gov> <1114027955.11751.4.camel@duffman> <52u0m119jt.fsf@topspin.com> Message-ID: <1114028946.11751.7.camel@duffman> On Wed, 2005-04-20 at 13:24 -0700, Roland Dreier wrote: > Looks like a real bug -- I think > > flow_class = be32_to_cpu(&grh->version_tclass_flow); > > should be > > flow_class = be32_to_cpu(grh->version_tclass_flow); > > (ie no "&" -- we want to swap the value, not the address!) That is what I thought as well, but there are other places in the code that do the same thing. agent.c:159 ping.c:137 -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From libor at topspin.com Wed Apr 20 13:32:44 2005 From: libor at topspin.com (Libor Michalek) Date: Wed, 20 Apr 2005 13:32:44 -0700 Subject: [openib-general] Re: uverbs API In-Reply-To: <4266A0F3.4020804@ichips.intel.com>; from ardavis@ichips.intel.com on Wed, Apr 20, 2005 at 11:35:31AM -0700 References: <426593D2.4030707@ichips.intel.com> <528y3e47n5.fsf@topspin.com> <4266A0F3.4020804@ichips.intel.com> Message-ID: <20050420133244.A9497@topspin.com> On Wed, Apr 20, 2005 at 11:35:31AM -0700, Arlin Davis wrote: > Roland Dreier wrote: > > > >Can you explain what this means a little more? Is there something you > >need that you can't get by using select()/poll() with a timeout on the > >CQ event FD? > > > > > > > As long as you can tell me that a thread blocked on get_cq_event will > wakeup on a device close, signal, or CQ error then I don't need a wakeup > call from userspace. > > and yes, select() was exactly my thinking, but I was hoping we could > get it added to the ibv_get_cq_event() code and just include a new > timeout parameter (in usecs) with the call. I guess it would be trivial for someone to write an ibv_get_cq_event() wrapper which took a timeout, and performed the select before calling the real ibv_get_cq_event... However, that seems limiting from a real application point of view where you will almost certainly want to have more then one file descriptor on which you are waiting. If the consumer is responsible for getting the FD and placing it into a select, inorder to support timeouts and event notification, then it's trivial to add a CM fd and a SA query fd, not to mention any other file descriptor in your application. -Libor From roland at topspin.com Wed Apr 20 13:34:26 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 20 Apr 2005 13:34:26 -0700 Subject: [openib-general] Re: [openib-commits] r2195 - in gen2/trunk/src/linux-kernel/infiniband: core include In-Reply-To: <1114028946.11751.7.camel@duffman> (Tom Duffy's message of "Wed, 20 Apr 2005 13:29:06 -0700") References: <20050420173553.D22902283D6@openib.ca.sandia.gov> <1114027955.11751.4.camel@duffman> <52u0m119jt.fsf@topspin.com> <1114028946.11751.7.camel@duffman> Message-ID: <52pswp192l.fsf@topspin.com> Tom> That is what I thought as well, but there are other places in Tom> the code that do the same thing. Tom> agent.c:159, ping.c:137 Those are doing be32_to_cpup() -- notice the "p" at the end, which means it dereferences a pointer. be32_to_cpup(&val) is rather obfuscated but still technically correct. It's sort of like writing "*&val" instead of just "val." In other words, it's probably worth cleaning up those other places, but they're not actually bugs. - R. From tduffy at sun.com Wed Apr 20 13:48:01 2005 From: tduffy at sun.com (Tom Duffy) Date: Wed, 20 Apr 2005 13:48:01 -0700 Subject: [openib-general] [PATCH][CORE] use be32_to_cpu instead of be32_to_cpup In-Reply-To: <52pswp192l.fsf@topspin.com> References: <20050420173553.D22902283D6@openib.ca.sandia.gov> <1114027955.11751.4.camel@duffman> <52u0m119jt.fsf@topspin.com> <1114028946.11751.7.camel@duffman> <52pswp192l.fsf@topspin.com> Message-ID: <1114030081.11751.22.camel@duffman> On Wed, 2005-04-20 at 13:34 -0700, Roland Dreier wrote: > Those are doing be32_to_cpup() -- notice the "p" at the end, which > means it dereferences a pointer. be32_to_cpup(&val) is rather > obfuscated but still technically correct. It's sort of like writing > "*&val" instead of just "val." > > In other words, it's probably worth cleaning up those other places, > but they're not actually bugs. Ah, didn't see that. Good eye. I propose the following patch then. Pointed-out-by: Roland Dreier Signed-off-by: Tom Duffy Index: linux-2.6.11-openib/drivers/infiniband/core/agent.c =================================================================== --- linux-2.6.11-openib/drivers/infiniband/core/agent.c (revision 2198) +++ linux-2.6.11-openib/drivers/infiniband/core/agent.c (working copy) @@ -155,10 +155,10 @@ static int agent_mad_send(struct ib_mad_ /* Should sgid be looked up ? */ ah_attr.grh.sgid_index = 0; ah_attr.grh.hop_limit = grh->hop_limit; - ah_attr.grh.flow_label = be32_to_cpup( - &grh->version_tclass_flow) & 0xfffff; - ah_attr.grh.traffic_class = (be32_to_cpup( - &grh->version_tclass_flow) >> 20) & 0xff; + ah_attr.grh.flow_label = be32_to_cpu( + grh->version_tclass_flow) & 0xfffff; + ah_attr.grh.traffic_class = (be32_to_cpu( + grh->version_tclass_flow) >> 20) & 0xff; memcpy(ah_attr.grh.dgid.raw, grh->sgid.raw, sizeof(ah_attr.grh.dgid)); Index: linux-2.6.11-openib/drivers/infiniband/core/verbs.c =================================================================== --- linux-2.6.11-openib/drivers/infiniband/core/verbs.c (revision 2198) +++ linux-2.6.11-openib/drivers/infiniband/core/verbs.c (working copy) @@ -112,7 +112,7 @@ struct ib_ah *ib_create_ah_from_wc(struc return ERR_PTR(ret); ah_attr.grh.sgid_index = (u8) gid_index; - flow_class = be32_to_cpu(&grh->version_tclass_flow); + flow_class = be32_to_cpu(grh->version_tclass_flow); ah_attr.grh.flow_label = flow_class & 0xFFFFF; ah_attr.grh.traffic_class = (flow_class >> 20) & 0xFF; ah_attr.grh.hop_limit = grh->hop_limit; Index: linux-2.6.11-openib/drivers/infiniband/core/ping.c =================================================================== --- linux-2.6.11-openib/drivers/infiniband/core/ping.c (revision 2198) +++ linux-2.6.11-openib/drivers/infiniband/core/ping.c (working copy) @@ -133,10 +133,10 @@ static int ping_mad_send(struct ib_mad_a /* Should sgid be looked up ? */ ah_attr.grh.sgid_index = 0; ah_attr.grh.hop_limit = grh->hop_limit; - ah_attr.grh.flow_label = be32_to_cpup( - &grh->version_tclass_flow) & 0xfffff; - ah_attr.grh.traffic_class = (be32_to_cpup( - &grh->version_tclass_flow) >> 20) & 0xff; + ah_attr.grh.flow_label = be32_to_cpu( + grh->version_tclass_flow) & 0xfffff; + ah_attr.grh.traffic_class = (be32_to_cpu( + grh->version_tclass_flow) >> 20) & 0xff; memcpy(ah_attr.grh.dgid.raw, grh->sgid.raw, sizeof(ah_attr.grh.dgid)); From mshefty at ichips.intel.com Wed Apr 20 14:19:20 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 20 Apr 2005 14:19:20 -0700 Subject: [openib-general] [PATCH][CORE] use be32_to_cpu instead of be32_to_cpup In-Reply-To: <1114030081.11751.22.camel@duffman> References: <20050420173553.D22902283D6@openib.ca.sandia.gov> <1114027955.11751.4.camel@duffman> <52u0m119jt.fsf@topspin.com> <1114028946.11751.7.camel@duffman> <52pswp192l.fsf@topspin.com> <1114030081.11751.22.camel@duffman> Message-ID: <4266C758.1080401@ichips.intel.com> Tom Duffy wrote: > Ah, didn't see that. Good eye. I propose the following patch then. > > Pointed-out-by: Roland Dreier > Signed-off-by: Tom Duffy > Thanks for catching this. I'll go ahead and take the patch and apply it. My plan is to eventually go through the code and identify areas where the newer call can be made in place of the existing code. I had already identified agent.c, but it looks like ping.c is a candidate as well. - Sean From robert.j.woodruff at intel.com Wed Apr 20 14:29:22 2005 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 20 Apr 2005 14:29:22 -0700 Subject: [openib-general] User-verbs Broken in SVN 2194 ? Message-ID: <1AC79F16F5C5284499BB9591B33D6F00042A40A1@orsmsx408> I tried to build and run the usermode verbs from the SVN 2194 I checked out this morning. It appears to build ok. When I run the example ibv_devices it shows device Node GUID ------------------------------------ mthca0 0002c900011da040 But when I run ibv_pingpong I get Couldn't get context for mthca0 Any ideas ? woody From robert.j.woodruff at intel.com Wed Apr 20 14:51:51 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 20 Apr 2005 14:51:51 -0700 Subject: [openib-general] User-verbs Broken in SVN 2194 ? In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F00042A40A1@orsmsx408> Message-ID: > Couldn't get context for mthca0 > Any ideas ? > woody Never mind. I needed to make the /dev nodes. From mshefty at ichips.intel.com Wed Apr 20 16:27:26 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 20 Apr 2005 16:27:26 -0700 Subject: [openib-general] [PATCH] fix bug matching responses with non-DATA RMPP MADs Message-ID: <20050420162726.10e35922.mshefty@ichips.intel.com> The following patch fixes an issue where a response MAD could have been incorrectly matched with an internally generated RMPP ACK. Signed-off-by: Sean Hefty Index: core/mad.c =================================================================== --- core/mad.c (revision 2202) +++ core/mad.c (working copy) @@ -1550,6 +1550,18 @@ return mad_recv_wc; } +static int is_data_mad(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_hdr *mad_hdr) +{ + struct ib_rmpp_mad *rmpp_mad; + + rmpp_mad = (struct ib_rmpp_mad *)mad_hdr; + return !mad_agent_priv->agent.rmpp_version || + !(ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr) & + IB_MGMT_RMPP_FLAG_ACTIVE) || + (rmpp_mad->rmpp_hdr.rmpp_type == IB_MGMT_RMPP_TYPE_DATA); +} + static struct ib_mad_send_wr_private* find_send_req(struct ib_mad_agent_private *mad_agent_priv, u64 tid) @@ -1568,7 +1580,9 @@ */ list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, agent_list) { - if (mad_send_wr->tid == tid && mad_send_wr->timeout) { + if (is_data_mad(mad_agent_priv, + mad_send_wr->send_wr.wr.ud.mad_hdr) && + mad_send_wr->tid == tid && mad_send_wr->timeout) { /* Verify request has not been canceled */ return (mad_send_wr->status == IB_WC_SUCCESS) ? mad_send_wr : NULL; @@ -2055,7 +2069,9 @@ list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, agent_list) { - if (mad_send_wr->wr_id == wr_id) + if (is_data_mad(mad_agent_priv, + mad_send_wr->send_wr.wr.ud.mad_hdr) && + mad_send_wr->wr_id == wr_id) return mad_send_wr; } return NULL; From tduffy at sun.com Wed Apr 20 17:37:11 2005 From: tduffy at sun.com (Tom Duffy) Date: Wed, 20 Apr 2005 17:37:11 -0700 Subject: [PATCH][MTHCA] fix sparc build WAS: Re: [openib-general] [PATCH][RFC][3/4] IB: userspace verbs mthca changes In-Reply-To: <200544159.AzH1nqpM3uTQZaKG@topspin.com> References: <200544159.AzH1nqpM3uTQZaKG@topspin.com> Message-ID: <1114043831.18198.17.camel@duffman> On Mon, 2005-04-04 at 15:09 -0700, Roland Dreier wrote: > @@ -574,6 +836,22 @@ > return 0; > } > > +static int mthca_mmap_uar(struct ib_ucontext *context, > + struct vm_area_struct *vma) > +{ > + if (vma->vm_end - vma->vm_start != PAGE_SIZE) > + return -EINVAL; > + > + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); > + > + if (remap_pfn_range(vma, vma->vm_start, > + to_mucontext(context)->uar.pfn, > + PAGE_SIZE, vma->vm_page_prot)) > + return -EAGAIN; > + > + return 0; > +} > + This breaks building on sparc64: CC [M] drivers/infiniband/hw/mthca/mthca_provider.o /build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/hw/mthca/mthca_provider.c: In function `mthca_mmap_uar': /build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/hw/mthca/mthca_provider.c:352: warning: implicit declaration of function `pgprot_noncached' /build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/hw/mthca/mthca_provider.c:352: error: incompatible types in assignment make[3]: *** [drivers/infiniband/hw/mthca/mthca_provider.o] Error 1 make[2]: *** [drivers/infiniband/hw/mthca] Error 2 make[1]: *** [_module_drivers/infiniband] Error 2 make: *** [_all] Error 2 This is ugly, but fixes the build. Perhaps sparc needs pgprot_noncached() to be a noop? Signed-off-by: Tom Duffy Index: linux-2.6.11-openib/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- linux-2.6.11-openib/drivers/infiniband/hw/mthca/mthca_provider.c (revision 2202) +++ linux-2.6.11-openib/drivers/infiniband/hw/mthca/mthca_provider.c (working copy) @@ -349,7 +349,9 @@ static int mthca_mmap_uar(struct ib_ucon if (vma->vm_end - vma->vm_start != PAGE_SIZE) return -EINVAL; +#ifdef pgprot_noncached vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); +#endif if (remap_pfn_range(vma, vma->vm_start, to_mucontext(context)->uar.pfn, From davem at davemloft.net Wed Apr 20 17:38:20 2005 From: davem at davemloft.net (David S. Miller) Date: Wed, 20 Apr 2005 17:38:20 -0700 Subject: [PATCH][MTHCA] fix sparc build WAS: Re: [openib-general] [PATCH][RFC][3/4] IB: userspace verbs mthca changes In-Reply-To: <1114043831.18198.17.camel@duffman> References: <200544159.AzH1nqpM3uTQZaKG@topspin.com> <1114043831.18198.17.camel@duffman> Message-ID: <20050420173820.24c512ae.davem@davemloft.net> On Wed, 20 Apr 2005 17:37:11 -0700 Tom Duffy wrote: > This breaks building on sparc64: ... > This is ugly, but fixes the build. Perhaps sparc needs > pgprot_noncached() to be a noop? No, it should actually do something, like so: include/asm-sparc64/pgtable.h: af9bf175a223cf44310293287d50302e0fd3f9e9 --- a/include/asm-sparc64/pgtable.h +++ b/include/asm-sparc64/pgtable.h @@ -416,6 +416,11 @@ extern int io_remap_pfn_range(struct vm_ unsigned long pfn, unsigned long size, pgprot_t prot); +/* Clear virtual and physical cachability, set side-effect bit. */ +#define pgprot_noncached(prot) \ + (__pgprot((pgprot_val(prot) & ~(_PAGE_CP | _PAGE_CV)) | \ + _PAGE_E)) + /* * For sparc32&64, the pfn in io_remap_pfn_range() carries in * its high 4 bits. These macros/functions put it there or get it from there. From hozer at hozed.org Wed Apr 20 19:17:13 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Wed, 20 Apr 2005 21:17:13 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050418200711.GI15688@aon.at> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <4263DBBF.9040801@ammasso.com> <1113840973.6274.84.camel@laptopd505.fenrus.org> <4263DF70.2060702@ammasso.com> <1113853240.6274.99.camel@laptopd505.fenrus.org> <20050418200711.GI15688@aon.at> Message-ID: <20050421021713.GP999@kalmia.hozed.org> On Mon, Apr 18, 2005 at 10:07:12PM +0200, Bernhard Fischer wrote: > On Mon, Apr 18, 2005 at 09:40:40PM +0200, Arjan van de Ven wrote: > >On Mon, 2005-04-18 at 11:25 -0500, Timur Tabi wrote: > >> Arjan van de Ven wrote: > >> > >> > this is a myth; linux is free to move the page about in physical memory > >> > even if it's mlock()ed!! > darn, yes, this is true. > I know people who introduced > #define VM_RESERVED 0x00080000 /* Don't unmap it from swap_out > */ Someone (aka Tospin, infinicon, and Amasso) should probably post a patch adding '#define VM_REGISTERD 0x01000000', and some extensions to something like 'madvise' to set pages to be registered. My preference is said patch will also allow a way for the kernel to reclaim registered memory from an application under memory pressure. From timur.tabi at ammasso.com Wed Apr 20 20:07:45 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Wed, 20 Apr 2005 22:07:45 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050421021713.GP999@kalmia.hozed.org> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <4263DBBF.9040801@ammasso.com> <1113840973.6274.84.camel@laptopd505.fenrus.org> <4263DF70.2060702@ammasso.com> <1113853240.6274.99.camel@laptopd505.fenrus.org> <20050418200711.GI15688@aon.at> <20050421021713.GP999@kalmia.hozed.org> Message-ID: <42671901.4000805@ammasso.com> Troy Benjegerdes wrote: > Someone (aka Tospin, infinicon, and Amasso) should probably post a patch > adding '#define VM_REGISTERD 0x01000000', and some extensions to > something like 'madvise' to set pages to be registered. > > My preference is said patch will also allow a way for the kernel to > reclaim registered memory from an application under memory pressure. I don't know if VM_REGISTERED is a good idea or not, but it should be absolutely impossible for the kernel to reclaim "registered" (aka pinned) memory, no matter what. For RDMA services (such as Infiniband, iWARP, etc), it's normal for non-root processes to pin hundreds of megabytes of memory, and that memory better be locked to those physical pages until the application deregisters them. If kernel really thinks it needs to unpin those pages, then at the very least it should kill the process, and the syslog better have a very clear message indicating why. That way, the application doesn't continue thinking that everything's still going to work. If those pages become unpinned, the applications are going to experience serious data corruption. From tduffy at sun.com Thu Apr 21 07:57:20 2005 From: tduffy at sun.com (Tom Duffy) Date: Thu, 21 Apr 2005 07:57:20 -0700 Subject: [openib-general] [Fwd: rpms/udev/devel pam_console.dev, 1.9, 1.10 udev.rules, 1.23, 1.24 udev.spec, 1.82, 1.83] Message-ID: <1114095440.17167.1.camel@duffman> Looks like Fedora Core 4 will support IB out of the box. They have the kernel modules, udev support, glibc-headers. Is there anything else that would be nice to get right before 4 ships? -tduffy -------------- next part -------------- An embedded message was scrubbed... From: fedora-cvs-commits at redhat.com Subject: rpms/udev/devel pam_console.dev, 1.9, 1.10 udev.rules, 1.23, 1.24 udev.spec, 1.82, 1.83 Date: Thu, 21 Apr 2005 09:11:36 -0400 Size: 7538 URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From robert.j.woodruff at intel.com Thu Apr 21 08:32:00 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Thu, 21 Apr 2005 08:32:00 -0700 Subject: [openib-general] [Fwd: rpms/udev/devel pam_console.dev, 1.9, 1.10 udev.rules, 1.23, 1.24 udev.spec, 1.82, 1.83] In-Reply-To: <1114095440.17167.1.camel@duffman> Message-ID: >Looks like Fedora Core 4 will support IB out of the box. They have the >kernel modules, udev support, glibc-headers. Is there anything else >that would be nice to get right before 4 ships? >-tduffy Cool. Do you know what the time frame is ? i.e., how long till it ships ? It would be nice to get the user-mode verbs support in, but I think that it may need some more testing and I don't think the user-mode kernel module has been submitted upstream yet. Roland, what were your thoughts on when we would be ready to submit the user mode support upstream ? woody From mshefty at ichips.intel.com Thu Apr 21 09:48:17 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 21 Apr 2005 09:48:17 -0700 Subject: [openib-general] MAD/RMPP test program Message-ID: <4267D951.7030606@ichips.intel.com> For those interested (likely a few developers only), I've checked in a kernel test program that I used to stress the MAD/RMPP code. gen2/utils/src/linux-kernel/infiniband/util/grmpp - Sean From mshefty at ichips.intel.com Thu Apr 21 10:31:06 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 21 Apr 2005 10:31:06 -0700 Subject: [openib-general] [PATCH] [MAD] fix race completing request MAD with timeout/cancel Message-ID: <20050421103106.30eb3df4.mshefty@ichips.intel.com> This patch should fix an issue processing a sent MAD after it has timed out or been canceled. The race occurs when a response MAD matches with the sent request. The request could time out or be canceled after the response MAD matches with the request, but before the request completion can be processed. Signed-off-by: Sean Hefty Index: core/mad.c =================================================================== --- core/mad.c (revision 2203) +++ core/mad.c (working copy) @@ -342,6 +342,7 @@ spin_lock_init(&mad_agent_priv->lock); INIT_LIST_HEAD(&mad_agent_priv->send_list); INIT_LIST_HEAD(&mad_agent_priv->wait_list); + INIT_LIST_HEAD(&mad_agent_priv->done_list); INIT_LIST_HEAD(&mad_agent_priv->rmpp_list); INIT_WORK(&mad_agent_priv->timed_work, timeout_sends, mad_agent_priv); INIT_LIST_HEAD(&mad_agent_priv->local_list); @@ -1591,6 +1592,16 @@ return NULL; } +static void ib_mark_req_done(struct ib_mad_send_wr_private *mad_send_wr) +{ + mad_send_wr->timeout = 0; + if (mad_send_wr->refcount == 1) { + list_del(&mad_send_wr->agent_list); + list_add_tail(&mad_send_wr->agent_list, + &mad_send_wr->mad_agent_priv->done_list); + } +} + static void ib_mad_complete_recv(struct ib_mad_agent_private *mad_agent_priv, struct ib_mad_recv_wc *mad_recv_wc) { @@ -1619,8 +1630,7 @@ wake_up(&mad_agent_priv->wait); return; } - /* Timeout = 0 means that we won't wait for a response */ - mad_send_wr->timeout = 0; + ib_mark_req_done(mad_send_wr); spin_unlock_irqrestore(&mad_agent_priv->lock, flags); /* Defined behavior is to complete response before request */ Index: core/mad_priv.h =================================================================== --- core/mad_priv.h (revision 2202) +++ core/mad_priv.h (working copy) @@ -92,6 +92,7 @@ spinlock_t lock; struct list_head send_list; struct list_head wait_list; + struct list_head done_list; struct work_struct timed_work; unsigned long timeout; struct list_head local_list; From adi at hexapodia.org Thu Apr 21 10:38:21 2005 From: adi at hexapodia.org (Andy Isaacson) Date: Thu, 21 Apr 2005 10:38:21 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <42671901.4000805@ammasso.com> Message-ID: <20050421173821.GA13312@hexapodia.org> On Wed, Apr 20, 2005 at 10:07:45PM -0500, Timur Tabi wrote: > Troy Benjegerdes wrote: > >Someone (aka Tospin, infinicon, and Amasso) should probably post a patch > >adding '#define VM_REGISTERD 0x01000000', and some extensions to > >something like 'madvise' to set pages to be registered. > > > >My preference is said patch will also allow a way for the kernel to > >reclaim registered memory from an application under memory pressure. > > I don't know if VM_REGISTERED is a good idea or not, but it should be > absolutely impossible for the kernel to reclaim "registered" (aka pinned) > memory, no matter what. For RDMA services (such as Infiniband, iWARP, etc), > it's normal for non-root processes to pin hundreds of megabytes of memory, > and that memory better be locked to those physical pages until the > application deregisters them. If you take the hardline position that "the app is the only thing that matters", your code is unlikely to get merged. Linux is a general-purpose OS. I don't think that Troy was suggesting the kernel should deregister memory without notifying the application. Personally, I envision something like the NetBSD Scheduler Activations (SA) work, where the kernel can notify the app of changes to its state in a very efficient manner. (According to the NetBSD design whitepaper, the kernel does an upcall whenever the multithreaded app gains or loses a CPU!) In a Linux context, I doubt that fullblown SA is necessary or appropriate. Rather, I'd suggest two new signals, SIGMEMLOW and SIGMEMCRIT. The userland comms library registers handlers for both. When the kernel decides that it needs to reclaim some memory from the app, it sends SIGMEMLOW. The comms library then has the responsibility to un-reserve some memory in an orderly fashion. If a reasonable [1] time has expired since SIGMEMLOW and the kernel is still hungry, the kernel sends SIGMEMCRIT. At this point, the comms lib *must* unregister some memory [2] even if it has to drop state to do so; if it returns from the signal handler without having unregistered the memory, the kernel will SIGKILL. [1] Part of the interface spec should cover the expectation as to how long the library is allowed to take; I'd guess that 2 timeslices should suffice. [2] Is there a way for the kernel to pass down to userspace how many pages it wants, maybe in the sigcontext? > If kernel really thinks it needs to unpin those pages, then at the very > least it should kill the process, and the syslog better have a very clear > message indicating why. That way, the application doesn't continue > thinking that everything's still going to work. If those pages become > unpinned, the applications are going to experience serious data corruption. You might want to consider what happens with your communication system in a machine running power-saving modes (in the limit, suspend-to-disk). Of course most machines with Infiniband adapters aren't running swsusp, but it's not inconceivable that blade servers might sleep to lower power and cooling costs. -andy From roland at topspin.com Thu Apr 21 08:37:33 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 21 Apr 2005 08:37:33 -0700 Subject: [openib-general] [Fwd: rpms/udev/devel pam_console.dev, 1.9,1.10 udev.rules, 1.23, 1.24 udev.spec, 1.82, 1.83] In-Reply-To: (Bob Woodruff's message of "Thu, 21 Apr 2005 08:32:00 -0700") References: Message-ID: <52k6mw16pu.fsf@topspin.com> Bob> Do you know what the time frame is ? i.e., how long till it ships ? http://fedora.redhat.com/participate/schedule/ Bob> Roland, what were your thoughts on when we would be ready to Bob> submit the user mode support upstream ? I hope to be able to send the patches upstream soon after 2.6.12 is released. 2.6.12-rc3 just came out so I would hope that the final release will be in about a month or so. - R. From timur.tabi at ammasso.com Thu Apr 21 11:39:35 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Thu, 21 Apr 2005 13:39:35 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050421173821.GA13312@hexapodia.org> References: <20050421173821.GA13312@hexapodia.org> Message-ID: <4267F367.3090508@ammasso.com> Andy Isaacson wrote: > If you take the hardline position that "the app is the only thing that > matters", your code is unlikely to get merged. Linux is a > general-purpose OS. The problem is that our driver and library implement an API that we don't fully control. The API states that the application allocates the memory and tells the library to register it. The app then goes on its merry way until it's done, at which point it tells the library to deregister the memory. Neither the app nor the API has any provision for the app to be notified that the memory is no longer pinned and therefore can't be trusted. That would be considered a critical failure from the app's perspective, so the kernel would be doing it a favor by killing the process. > You might want to consider what happens with your communication system > in a machine running power-saving modes (in the limit, suspend-to-disk). > Of course most machines with Infiniband adapters aren't running swsusp, > but it's not inconceivable that blade servers might sleep to lower power > and cooling costs. Any application that registers memory, will in all likelihood be running at 100% CPU non-stop. The computer is not going to be doing anything else but whatever that app is trying to do. The application could conceiveable register gigabytes of RAM, and if even a single page becomes unpinned, the whole thing is worthless. The application cannot do anything meaningful if it gets a message saying that some of the memory has become unpinned and should not be used. So the real question is: how important is it to the kernel developers that Linux support these kinds of enterprise-class applications? -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com One thing a Southern boy will never say is, "I don't think duct tape will fix it." -- Ed Smylie, NASA engineer for Apollo 13 From adi at hexapodia.org Thu Apr 21 12:56:41 2005 From: adi at hexapodia.org (Andy Isaacson) Date: Thu, 21 Apr 2005 12:56:41 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <4267F367.3090508@ammasso.com> References: <20050421173821.GA13312@hexapodia.org> <4267F367.3090508@ammasso.com> Message-ID: <20050421195641.GB13312@hexapodia.org> On Thu, Apr 21, 2005 at 01:39:35PM -0500, Timur Tabi wrote: > Andy Isaacson wrote: > >If you take the hardline position that "the app is the only thing that > >matters", your code is unlikely to get merged. Linux is a > >general-purpose OS. > > The problem is that our driver and library implement an API that we don't > fully control. The API states that the application allocates the memory and > tells the library to register it. The app then goes on its merry way until > it's done, at which point it tells the library to deregister the memory. > Neither the app nor the API has any provision for the app to be notified > that the memory is no longer pinned and therefore can't be trusted. That > would be considered a critical failure from the app's perspective, so the > kernel would be doing it a favor by killing the process. I'm familiar with MPI 1.0 and 2.0, but I haven't been following the development of modern messaging APIs, so I might not make sense here... Assuming that the app calls into the library on a fairly regular basis, you could implement a fast-path/slow-path scheme where the library normally operates in go-fast mode, but if a "unregister" event has occurred, the library falls back to a less performant mode. But now having written that I'm thinking that it's not worth the bother - if you've got a 512P MPP job, it's basically equivalent to job death for one of the nodes to go away in this manner -- even if the process is still running on the node, the fact that you took a giant performance hiccup is unacceptable. Therefore, cluster admins are going to do their darndest to avoid this behavior, so we might as well just kill the job and make it explicit. > >You might want to consider what happens with your communication system > >in a machine running power-saving modes (in the limit, suspend-to-disk). > >Of course most machines with Infiniband adapters aren't running swsusp, > >but it's not inconceivable that blade servers might sleep to lower power > >and cooling costs. > > Any application that registers memory, will in all likelihood be running at > 100% CPU non-stop. The computer is not going to be doing anything else but > whatever that app is trying to do. The application could conceiveable > register gigabytes of RAM, and if even a single page becomes unpinned, the > whole thing is worthless. The application cannot do anything meaningful if > it gets a message saying that some of the memory has become unpinned and > should not be used. > > So the real question is: how important is it to the kernel developers that > Linux support these kinds of enterprise-class applications? While I understand your arguments, this kind of rhetoric is more likely to harden ears than to convince people you're right. I refer you to the "Live Patching Function" thread. *You* need to come up with a solution that looks good to *the community* if you want it merged. In the long run, this process is likely to result in *your* systems working better than if you had just gone off and done your thing. If you have to do something that "tastes bad" to the average l-k hacker, *justify* it by addressing the alternatives and explaining why your solution is the right one. I'm leaning towards agreeing that mlock()-alicious code is the right way to solve this problem, and it's not clear to me what the benefit of adding a new VM_REGISTERED flag would be. Do you guys simply raise RLIMIT_MEMLOCK to allow apps to lock their pages? Or are you doing something more nasty? (Oh, I see that Libor has contributed to the other branch of this thread... off to read...) -andy From timur.tabi at ammasso.com Thu Apr 21 13:07:42 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Thu, 21 Apr 2005 15:07:42 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050421195641.GB13312@hexapodia.org> References: <20050421173821.GA13312@hexapodia.org> <4267F367.3090508@ammasso.com> <20050421195641.GB13312@hexapodia.org> Message-ID: <4268080E.3000303@ammasso.com> Andy Isaacson wrote: > I'm familiar with MPI 1.0 and 2.0, but I haven't been following the > development of modern messaging APIs, so I might not make sense here... > > Assuming that the app calls into the library on a fairly regular basis, Not really. The whole point is to have the adapter DMA the data directly from memory to the network. That's why it's called RDMA - remote DMA. > Therefore, cluster admins are going to do their > darndest to avoid this behavior, so we might as well just kill the job > and make it explicit. Yes, and if it turns out that the same MPI application dies on Linux but not on Solaris because Linux doesn't really care if the memory stays pinned, then we're going to see a lot of MPI customers transitioning away from Linux. > *You* need to come up with a solution that looks good to *the community* > if you want it merged. True, but I'm not going to waste my time adding this support if the consensus I get from the kernel developers that they don't want Linux to behave this way. > Do you guys simply raise RLIMIT_MEMLOCK to allow apps to lock their > pages? Or are you doing something more nasty? A little more nasty. I raise RLIMIT_MEMLOCK in the driver to "unlimited" and also set cap_raise(IPC_LOCK). I do this because I need to support all 2.4 and 2.6 kernel versions with the same driver, but only 2.6.10 and later have any support for non-root mlock(). If and when our driver is submitted to the official kernel, that nastiness will be removed of course. -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com One thing a Southern boy will never say is, "I don't think duct tape will fix it." -- Ed Smylie, NASA engineer for Apollo 13 From chrisw at osdl.org Thu Apr 21 13:12:27 2005 From: chrisw at osdl.org (Chris Wright) Date: Thu, 21 Apr 2005 13:12:27 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <4268080E.3000303@ammasso.com> References: <20050421173821.GA13312@hexapodia.org> <4267F367.3090508@ammasso.com> <20050421195641.GB13312@hexapodia.org> <4268080E.3000303@ammasso.com> Message-ID: <20050421201227.GI23013@shell0.pdx.osdl.net> * Timur Tabi (timur.tabi at ammasso.com) wrote: > Andy Isaacson wrote: > >Do you guys simply raise RLIMIT_MEMLOCK to allow apps to lock their > >pages? Or are you doing something more nasty? > > A little more nasty. I raise RLIMIT_MEMLOCK in the driver to "unlimited" > and also set cap_raise(IPC_LOCK). I do this because I need to support all > 2.4 and 2.6 kernel versions with the same driver, but only 2.6.10 and later > have any support for non-root mlock(). FYI, that will not work on all 2.6 kernels. Specifically anything that's not using capabilities. thanks, -chris -- Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net From timur.tabi at ammasso.com Thu Apr 21 13:14:59 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Thu, 21 Apr 2005 15:14:59 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050421201227.GI23013@shell0.pdx.osdl.net> References: <20050421173821.GA13312@hexapodia.org> <4267F367.3090508@ammasso.com> <20050421195641.GB13312@hexapodia.org> <4268080E.3000303@ammasso.com> <20050421201227.GI23013@shell0.pdx.osdl.net> Message-ID: <426809C3.7010101@ammasso.com> Chris Wright wrote: > FYI, that will not work on all 2.6 kernels. Specifically anything that's > not using capabilities. It works with every kernel I've tried. I'm sure there are plenty of kernel configuration options that will break our driver. But as long as all the distros our customers use work, as well as reasonably-configured custom kernels, we're happy. -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com One thing a Southern boy will never say is, "I don't think duct tape will fix it." -- Ed Smylie, NASA engineer for Apollo 13 From chrisw at osdl.org Thu Apr 21 13:25:03 2005 From: chrisw at osdl.org (Chris Wright) Date: Thu, 21 Apr 2005 13:25:03 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <426809C3.7010101@ammasso.com> References: <20050421173821.GA13312@hexapodia.org> <4267F367.3090508@ammasso.com> <20050421195641.GB13312@hexapodia.org> <4268080E.3000303@ammasso.com> <20050421201227.GI23013@shell0.pdx.osdl.net> <426809C3.7010101@ammasso.com> Message-ID: <20050421202503.GO493@shell0.pdx.osdl.net> * Timur Tabi (timur.tabi at ammasso.com) wrote: > It works with every kernel I've tried. I'm sure there are plenty of kernel > configuration options that will break our driver. But as long as all the > distros our customers use work, as well as reasonably-configured custom > kernels, we're happy. > Hey, if you're happy (and, as you said, you don't intend to merge that bit), I'm happy ;-) thanks, -chris -- Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net From arjan at infradead.org Thu Apr 21 13:30:57 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Thu, 21 Apr 2005 22:30:57 +0200 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050421202503.GO493@shell0.pdx.osdl.net> References: <20050421173821.GA13312@hexapodia.org> <4267F367.3090508@ammasso.com> <20050421195641.GB13312@hexapodia.org> <4268080E.3000303@ammasso.com> <20050421201227.GI23013@shell0.pdx.osdl.net> <426809C3.7010101@ammasso.com> <20050421202503.GO493@shell0.pdx.osdl.net> Message-ID: <1114115458.6277.84.camel@laptopd505.fenrus.org> On Thu, 2005-04-21 at 13:25 -0700, Chris Wright wrote: > * Timur Tabi (timur.tabi at ammasso.com) wrote: > > It works with every kernel I've tried. I'm sure there are plenty of kernel > > configuration options that will break our driver. But as long as all the > > distros our customers use work, as well as reasonably-configured custom > > kernels, we're happy. > > > > Hey, if you're happy (and, as you said, you don't intend to merge that > bit), I'm happy ;-) yeah... drivers giving unprivileged processes more privs belong on bugtraq though, not in the core kernel :) From tduffy at sun.com Thu Apr 21 15:31:06 2005 From: tduffy at sun.com (Tom Duffy) Date: Thu, 21 Apr 2005 15:31:06 -0700 Subject: [openib-general] [PATCH][SDP] Allow SDP to compile on 2.6.12-rc3 Message-ID: <1114122666.6858.5.camel@duffman> The sock structure was changed in 2.6.12-rc? and SDP no longer compiles against it. This patch allows SDP to build with either 2.6.11 or 2.6.12-rc3 as we must preserve building on current stable tree. Signed-off-by: Tom Duffy Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c =================================================================== --- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c (revision 2207) +++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c (working copy) @@ -356,13 +356,23 @@ static int sdp_cm_listen_lookup(struct s */ sk->sk_lingertime = listen_sk->sk_lingertime; sk->sk_rcvlowat = listen_sk->sk_rcvlowat; +/* XXX Remove once 2.6.12 is released */ +#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) ) sk->sk_debug = listen_sk->sk_debug; sk->sk_localroute = listen_sk->sk_localroute; + sk->sk_rcvtstamp = listen_sk->sk_rcvtstamp; +#else + if (sock_flag(sk, SOCK_DBG)) + sock_set_flag(listen_sk, SOCK_DBG); + if (sock_flag(sk, SOCK_LOCALROUTE)) + sock_set_flag(listen_sk, SOCK_LOCALROUTE); + if (sock_flag(sk, SOCK_RCVTSTAMP)) + sock_set_flag(listen_sk, SOCK_RCVTSTAMP); +#endif sk->sk_sndbuf = listen_sk->sk_sndbuf; sk->sk_rcvbuf = listen_sk->sk_rcvbuf; sk->sk_no_check = listen_sk->sk_no_check; sk->sk_priority = listen_sk->sk_priority; - sk->sk_rcvtstamp = listen_sk->sk_rcvtstamp; sk->sk_rcvtimeo = listen_sk->sk_rcvtimeo; sk->sk_sndtimeo = listen_sk->sk_sndtimeo; sk->sk_reuse = listen_sk->sk_reuse; From roland at topspin.com Thu Apr 21 15:36:19 2005 From: roland at topspin.com (Roland Dreier) Date: Thu, 21 Apr 2005 15:36:19 -0700 Subject: [openib-general] [PATCH][SDP] Allow SDP to compile on 2.6.12-rc3 In-Reply-To: <1114122666.6858.5.camel@duffman> (Tom Duffy's message of "Thu, 21 Apr 2005 15:31:06 -0700") References: <1114122666.6858.5.camel@duffman> Message-ID: <52y8bbzrj0.fsf@topspin.com> Tom> The sock structure was changed in 2.6.12-rc? and SDP no Tom> longer compiles against it. This patch allows SDP to build Tom> with either 2.6.11 or 2.6.12-rc3 as we must preserve building Tom> on current stable tree. Is this really the only change required? It seems that the socket allocation function changed too -- 2.6.11 has extern struct sock * sk_alloc(int family, int priority, int zero_it, kmem_cache_t *slab); while my up-to-date Linus tree has extern struct sock *sk_alloc(int family, int priority, struct proto *prot, int zero_it); so I think sdp_conn.c at least needs some fixing up. (I don't have time track down what goes in struct proto right now, so I'm not posting a real patch) - R. From tduffy at sun.com Thu Apr 21 15:52:43 2005 From: tduffy at sun.com (Tom Duffy) Date: Thu, 21 Apr 2005 15:52:43 -0700 Subject: [openib-general] [PATCH][SDP] Allow SDP to compile on 2.6.12-rc3 In-Reply-To: <52y8bbzrj0.fsf@topspin.com> References: <1114122666.6858.5.camel@duffman> <52y8bbzrj0.fsf@topspin.com> Message-ID: <1114123963.6858.11.camel@duffman> On Thu, 2005-04-21 at 15:36 -0700, Roland Dreier wrote: > Is this really the only change required? It seems that the socket > allocation function changed too -- 2.6.11 has > > extern struct sock * sk_alloc(int family, int priority, int zero_it, > kmem_cache_t *slab); > > while my up-to-date Linus tree has > > extern struct sock *sk_alloc(int family, int priority, > struct proto *prot, int zero_it); > > so I think sdp_conn.c at least needs some fixing up. Oh, you are right, I missed the compile warning, but I see it now. Why does the SDP code pass in sizeof(struct inet_sock) for the zero_it bool? -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From libor at topspin.com Thu Apr 21 16:17:30 2005 From: libor at topspin.com (Libor Michalek) Date: Thu, 21 Apr 2005 16:17:30 -0700 Subject: [openib-general] [PATCH][SDP] Allow SDP to compile on 2.6.12-rc3 In-Reply-To: <1114123963.6858.11.camel@duffman>; from tduffy@sun.com on Thu, Apr 21, 2005 at 03:52:43PM -0700 References: <1114122666.6858.5.camel@duffman> <52y8bbzrj0.fsf@topspin.com> <1114123963.6858.11.camel@duffman> Message-ID: <20050421161730.A27238@topspin.com> On Thu, Apr 21, 2005 at 03:52:43PM -0700, Tom Duffy wrote: > On Thu, 2005-04-21 at 15:36 -0700, Roland Dreier wrote: > > Is this really the only change required? It seems that the socket > > allocation function changed too -- 2.6.11 has > > > > extern struct sock * sk_alloc(int family, int priority, int zero_it, > > kmem_cache_t *slab); > > > > while my up-to-date Linus tree has > > > > extern struct sock *sk_alloc(int family, int priority, > > struct proto *prot, int zero_it); > > > > so I think sdp_conn.c at least needs some fixing up. > > Oh, you are right, I missed the compile warning, but I see it now. > > Why does the SDP code pass in sizeof(struct inet_sock) for the zero_it > bool? Is this a trick question? :) Because it's not a bool but an integer, which use to be a bool in the 2.4 kernel days. Here's the relevant code snip from net/core/sock.c: if (zero_it) { memset(sk, 0, zero_it == 1 ? sizeof(struct sock) : zero_it); sk->sk_family = family; sock_lock_init(sk); } -Libor From tduffy at sun.com Thu Apr 21 16:27:04 2005 From: tduffy at sun.com (Tom Duffy) Date: Thu, 21 Apr 2005 16:27:04 -0700 Subject: [openib-general] [PATCHv2][SDP] Allow SDP to compile on 2.6.12-rc3 In-Reply-To: <52y8bbzrj0.fsf@topspin.com> References: <1114122666.6858.5.camel@duffman> <52y8bbzrj0.fsf@topspin.com> Message-ID: <1114126024.6858.21.camel@duffman> On Thu, 2005-04-21 at 15:36 -0700, Roland Dreier wrote: > (I don't have time track down what goes in struct proto right now, so > I'm not posting a real patch) Ok, this patch now builds without warning on 2.6.11 and 2.6.12-rc3. Libor, what do you think? Signed-off-by: Tom Duffy Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c =================================================================== --- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c (revision 2207) +++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c (working copy) @@ -1112,6 +1112,15 @@ error_attr: return result; } +/* XXX remove if/else (leave struct) once 2.6.12 is out */ +#if ( LINUX_VERSION_CODE > KERNEL_VERSION(2,6,11) ) +static struct proto sdp_proto = { + .name = "sdp_sock", + .owner = THIS_MODULE, + .obj_size = sizeof(struct inet_sock), +}; +#endif + /* * sdp_conn_alloc - allocate a new socket, and init. */ @@ -1122,7 +1131,12 @@ struct sdp_opt *sdp_conn_alloc(int prior int result; sk = sk_alloc(dev_root_s.proto, priority, +/* XXX Remove once 2.6.12 is out */ +#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) ) sizeof(struct inet_sock), dev_root_s.sock_cache); +#else + &sdp_proto, sizeof(struct inet_sock)); +#endif if (!sk) { sdp_dbg_warn(NULL, "socket alloc error for protocol. <%d:%d>", dev_root_s.proto, priority); @@ -1966,6 +1980,8 @@ int sdp_conn_table_init(int proto_family goto error_conn; } +/* XXX Remove once 2.6.12 is out */ +#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) ) dev_root_s.sock_cache = kmem_cache_create("sdp_sock", sizeof(struct inet_sock), 0, SLAB_HWCACHE_ALIGN, @@ -1975,6 +1991,13 @@ int sdp_conn_table_init(int proto_family result = -ENOMEM; goto error_sock; } +#else + if (proto_register(&sdp_proto, 1) != 0) { + sdp_warn("Failed to register sdp proto."); + result = -ENOMEM; + goto error_sock; + } +#endif /* * start listening @@ -2002,7 +2025,12 @@ int sdp_conn_table_init(int proto_family error_listen: (void)ib_destroy_cm_id(dev_root_s.listen_id); error_cm_id: +/* XXX Remove once 2.6.12 is out */ +#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) ) kmem_cache_destroy(dev_root_s.sock_cache); +#else + proto_unregister(&sdp_proto); +#endif error_sock: kmem_cache_destroy(dev_root_s.conn_cache); error_conn: @@ -2049,10 +2077,15 @@ int sdp_conn_table_clear(void) * delete conn cache */ kmem_cache_destroy(dev_root_s.conn_cache); +/* Remove once 2.6.12 is out */ +#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) ) /* * delete sock cache */ kmem_cache_destroy(dev_root_s.sock_cache); +#else + proto_unregister(&sdp_proto); +#endif /* * stop listening */ Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c =================================================================== --- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c (revision 2207) +++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c (working copy) @@ -356,13 +356,23 @@ static int sdp_cm_listen_lookup(struct s */ sk->sk_lingertime = listen_sk->sk_lingertime; sk->sk_rcvlowat = listen_sk->sk_rcvlowat; +/* XXX Remove once 2.6.12 is released */ +#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) ) sk->sk_debug = listen_sk->sk_debug; sk->sk_localroute = listen_sk->sk_localroute; + sk->sk_rcvtstamp = listen_sk->sk_rcvtstamp; +#else + if (sock_flag(sk, SOCK_DBG)) + sock_set_flag(listen_sk, SOCK_DBG); + if (sock_flag(sk, SOCK_LOCALROUTE)) + sock_set_flag(listen_sk, SOCK_LOCALROUTE); + if (sock_flag(sk, SOCK_RCVTSTAMP)) + sock_set_flag(listen_sk, SOCK_RCVTSTAMP); +#endif sk->sk_sndbuf = listen_sk->sk_sndbuf; sk->sk_rcvbuf = listen_sk->sk_rcvbuf; sk->sk_no_check = listen_sk->sk_no_check; sk->sk_priority = listen_sk->sk_priority; - sk->sk_rcvtstamp = listen_sk->sk_rcvtstamp; sk->sk_rcvtimeo = listen_sk->sk_rcvtimeo; sk->sk_sndtimeo = listen_sk->sk_sndtimeo; sk->sk_reuse = listen_sk->sk_reuse; Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h =================================================================== --- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h (revision 2207) +++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h (working copy) @@ -201,7 +201,10 @@ struct sdev_root { * cache's */ kmem_cache_t *conn_cache; +/* XXX Remove once 2.6.12 is out */ +#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) ) kmem_cache_t *sock_cache; +#endif }; #endif /* _SDP_DEV_H */ From ftillier at infiniconsys.com Thu Apr 21 16:31:59 2005 From: ftillier at infiniconsys.com (Fab Tillier) Date: Thu, 21 Apr 2005 16:31:59 -0700 Subject: [openib-general] [PATCHv2][SDP] Allow SDP to compile on 2.6.12-rc3 In-Reply-To: <1114126024.6858.21.camel@duffman> Message-ID: <000001c546ca$51807b80$8d5aa8c0@infiniconsys.com> > From: Tom Duffy [mailto:tduffy at sun.com] > Sent: Thursday, April 21, 2005 4:27 PM > > Ok, this patch now builds without warning on 2.6.11 and 2.6.12-rc3. > > Libor, what do you think? > > Signed-off-by: Tom Duffy > > Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c > =================================================================== > --- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c > (revision 2207) > +++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c (working > copy) > @@ -356,13 +356,23 @@ static int sdp_cm_listen_lookup(struct s > */ > sk->sk_lingertime = listen_sk->sk_lingertime; > sk->sk_rcvlowat = listen_sk->sk_rcvlowat; > +/* XXX Remove once 2.6.12 is released */ > +#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) ) > sk->sk_debug = listen_sk->sk_debug; > sk->sk_localroute = listen_sk->sk_localroute; > + sk->sk_rcvtstamp = listen_sk->sk_rcvtstamp; > +#else > + if (sock_flag(sk, SOCK_DBG)) > + sock_set_flag(listen_sk, SOCK_DBG); > + if (sock_flag(sk, SOCK_LOCALROUTE)) > + sock_set_flag(listen_sk, SOCK_LOCALROUTE); > + if (sock_flag(sk, SOCK_RCVTSTAMP)) > + sock_set_flag(listen_sk, SOCK_RCVTSTAMP); > +#endif Isn't the above change backwards? The original code was copying settings from listen_sk to sk, and the new code seems to be checking flags in sk to determine whether to set them in listen_sk. - Fab From tduffy at sun.com Thu Apr 21 16:33:14 2005 From: tduffy at sun.com (Tom Duffy) Date: Thu, 21 Apr 2005 16:33:14 -0700 Subject: [openib-general] [PATCHv3][SDP] Allow SDP to compile on 2.6.12-rc3 In-Reply-To: <20050421161730.A27238@topspin.com> References: <1114122666.6858.5.camel@duffman> <52y8bbzrj0.fsf@topspin.com> <1114123963.6858.11.camel@duffman> <20050421161730.A27238@topspin.com> Message-ID: <1114126394.6858.26.camel@duffman> On Thu, 2005-04-21 at 16:17 -0700, Libor Michalek wrote: > Is this a trick question? :) Because it's not a bool but an integer, > which use to be a bool in the 2.4 kernel days. Here's the relevant > code snip from net/core/sock.c: > > if (zero_it) { > memset(sk, 0, > zero_it == 1 ? sizeof(struct sock) : zero_it); > sk->sk_family = family; > sock_lock_init(sk); > } Sorry, I was looking at the new code where it is (used as) a bool again. In fact, I fucked up and on my v2 patch, the *new* call to sk_alloc should just be 1. Here is v3. Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c =================================================================== --- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c (revision 2207) +++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c (working copy) @@ -1112,6 +1112,15 @@ error_attr: return result; } +/* XXX remove if/else (leave struct) once 2.6.12 is out */ +#if ( LINUX_VERSION_CODE > KERNEL_VERSION(2,6,11) ) +static struct proto sdp_proto = { + .name = "sdp_sock", + .owner = THIS_MODULE, + .obj_size = sizeof(struct inet_sock), +}; +#endif + /* * sdp_conn_alloc - allocate a new socket, and init. */ @@ -1122,7 +1131,12 @@ struct sdp_opt *sdp_conn_alloc(int prior int result; sk = sk_alloc(dev_root_s.proto, priority, +/* XXX Remove once 2.6.12 is out */ +#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) ) sizeof(struct inet_sock), dev_root_s.sock_cache); +#else + &sdp_proto, 1); +#endif if (!sk) { sdp_dbg_warn(NULL, "socket alloc error for protocol. <%d:%d>", dev_root_s.proto, priority); @@ -1966,6 +1980,8 @@ int sdp_conn_table_init(int proto_family goto error_conn; } +/* XXX Remove once 2.6.12 is out */ +#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) ) dev_root_s.sock_cache = kmem_cache_create("sdp_sock", sizeof(struct inet_sock), 0, SLAB_HWCACHE_ALIGN, @@ -1975,6 +1991,13 @@ int sdp_conn_table_init(int proto_family result = -ENOMEM; goto error_sock; } +#else + if (proto_register(&sdp_proto, 1) != 0) { + sdp_warn("Failed to register sdp proto."); + result = -ENOMEM; + goto error_sock; + } +#endif /* * start listening @@ -2002,7 +2025,12 @@ int sdp_conn_table_init(int proto_family error_listen: (void)ib_destroy_cm_id(dev_root_s.listen_id); error_cm_id: +/* XXX Remove once 2.6.12 is out */ +#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) ) kmem_cache_destroy(dev_root_s.sock_cache); +#else + proto_unregister(&sdp_proto); +#endif error_sock: kmem_cache_destroy(dev_root_s.conn_cache); error_conn: @@ -2049,10 +2077,15 @@ int sdp_conn_table_clear(void) * delete conn cache */ kmem_cache_destroy(dev_root_s.conn_cache); +/* Remove once 2.6.12 is out */ +#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) ) /* * delete sock cache */ kmem_cache_destroy(dev_root_s.sock_cache); +#else + proto_unregister(&sdp_proto); +#endif /* * stop listening */ Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c =================================================================== --- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c (revision 2207) +++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c (working copy) @@ -356,13 +356,23 @@ static int sdp_cm_listen_lookup(struct s */ sk->sk_lingertime = listen_sk->sk_lingertime; sk->sk_rcvlowat = listen_sk->sk_rcvlowat; +/* XXX Remove once 2.6.12 is released */ +#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) ) sk->sk_debug = listen_sk->sk_debug; sk->sk_localroute = listen_sk->sk_localroute; + sk->sk_rcvtstamp = listen_sk->sk_rcvtstamp; +#else + if (sock_flag(sk, SOCK_DBG)) + sock_set_flag(listen_sk, SOCK_DBG); + if (sock_flag(sk, SOCK_LOCALROUTE)) + sock_set_flag(listen_sk, SOCK_LOCALROUTE); + if (sock_flag(sk, SOCK_RCVTSTAMP)) + sock_set_flag(listen_sk, SOCK_RCVTSTAMP); +#endif sk->sk_sndbuf = listen_sk->sk_sndbuf; sk->sk_rcvbuf = listen_sk->sk_rcvbuf; sk->sk_no_check = listen_sk->sk_no_check; sk->sk_priority = listen_sk->sk_priority; - sk->sk_rcvtstamp = listen_sk->sk_rcvtstamp; sk->sk_rcvtimeo = listen_sk->sk_rcvtimeo; sk->sk_sndtimeo = listen_sk->sk_sndtimeo; sk->sk_reuse = listen_sk->sk_reuse; Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h =================================================================== --- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h (revision 2207) +++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h (working copy) @@ -201,7 +201,10 @@ struct sdev_root { * cache's */ kmem_cache_t *conn_cache; +/* XXX Remove once 2.6.12 is out */ +#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) ) kmem_cache_t *sock_cache; +#endif }; #endif /* _SDP_DEV_H */ From tduffy at sun.com Thu Apr 21 16:37:54 2005 From: tduffy at sun.com (Tom Duffy) Date: Thu, 21 Apr 2005 16:37:54 -0700 Subject: [openib-general] [PATCHv4][SDP] Allow SDP to compile on 2.6.12-rc3 In-Reply-To: <000001c546ca$51807b80$8d5aa8c0@infiniconsys.com> References: <000001c546ca$51807b80$8d5aa8c0@infiniconsys.com> Message-ID: <1114126674.6858.31.camel@duffman> On Thu, 2005-04-21 at 16:31 -0700, Fab Tillier wrote: > Isn't the above change backwards? The original code was copying settings > from listen_sk to sk, and the new code seems to be checking flags in sk to > determine whether to set them in listen_sk. You are so right. My brain ain't on today or something. Signed-off-by: Tom Duffy Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c =================================================================== --- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c (revision 2207) +++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c (working copy) @@ -1112,6 +1112,15 @@ error_attr: return result; } +/* XXX remove if/else (leave struct) once 2.6.12 is out */ +#if ( LINUX_VERSION_CODE > KERNEL_VERSION(2,6,11) ) +static struct proto sdp_proto = { + .name = "sdp_sock", + .owner = THIS_MODULE, + .obj_size = sizeof(struct inet_sock), +}; +#endif + /* * sdp_conn_alloc - allocate a new socket, and init. */ @@ -1122,7 +1131,12 @@ struct sdp_opt *sdp_conn_alloc(int prior int result; sk = sk_alloc(dev_root_s.proto, priority, +/* XXX Remove once 2.6.12 is out */ +#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) ) sizeof(struct inet_sock), dev_root_s.sock_cache); +#else + &sdp_proto, 1); +#endif if (!sk) { sdp_dbg_warn(NULL, "socket alloc error for protocol. <%d:%d>", dev_root_s.proto, priority); @@ -1966,6 +1980,8 @@ int sdp_conn_table_init(int proto_family goto error_conn; } +/* XXX Remove once 2.6.12 is out */ +#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) ) dev_root_s.sock_cache = kmem_cache_create("sdp_sock", sizeof(struct inet_sock), 0, SLAB_HWCACHE_ALIGN, @@ -1975,6 +1991,13 @@ int sdp_conn_table_init(int proto_family result = -ENOMEM; goto error_sock; } +#else + if (proto_register(&sdp_proto, 1) != 0) { + sdp_warn("Failed to register sdp proto."); + result = -ENOMEM; + goto error_sock; + } +#endif /* * start listening @@ -2002,7 +2025,12 @@ int sdp_conn_table_init(int proto_family error_listen: (void)ib_destroy_cm_id(dev_root_s.listen_id); error_cm_id: +/* XXX Remove once 2.6.12 is out */ +#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) ) kmem_cache_destroy(dev_root_s.sock_cache); +#else + proto_unregister(&sdp_proto); +#endif error_sock: kmem_cache_destroy(dev_root_s.conn_cache); error_conn: @@ -2049,10 +2077,15 @@ int sdp_conn_table_clear(void) * delete conn cache */ kmem_cache_destroy(dev_root_s.conn_cache); +/* Remove once 2.6.12 is out */ +#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) ) /* * delete sock cache */ kmem_cache_destroy(dev_root_s.sock_cache); +#else + proto_unregister(&sdp_proto); +#endif /* * stop listening */ Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c =================================================================== --- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c (revision 2207) +++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c (working copy) @@ -356,13 +356,23 @@ static int sdp_cm_listen_lookup(struct s */ sk->sk_lingertime = listen_sk->sk_lingertime; sk->sk_rcvlowat = listen_sk->sk_rcvlowat; +/* XXX Remove once 2.6.12 is released */ +#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) ) sk->sk_debug = listen_sk->sk_debug; sk->sk_localroute = listen_sk->sk_localroute; + sk->sk_rcvtstamp = listen_sk->sk_rcvtstamp; +#else + if (sock_flag(listen_sk, SOCK_DBG)) + sock_set_flag(sk, SOCK_DBG); + if (sock_flag(listen_sk, SOCK_LOCALROUTE)) + sock_set_flag(sk, SOCK_LOCALROUTE); + if (sock_flag(listen_sk, SOCK_RCVTSTAMP)) + sock_set_flag(sk, SOCK_RCVTSTAMP); +#endif sk->sk_sndbuf = listen_sk->sk_sndbuf; sk->sk_rcvbuf = listen_sk->sk_rcvbuf; sk->sk_no_check = listen_sk->sk_no_check; sk->sk_priority = listen_sk->sk_priority; - sk->sk_rcvtstamp = listen_sk->sk_rcvtstamp; sk->sk_rcvtimeo = listen_sk->sk_rcvtimeo; sk->sk_sndtimeo = listen_sk->sk_sndtimeo; sk->sk_reuse = listen_sk->sk_reuse; Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h =================================================================== --- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h (revision 2207) +++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h (working copy) @@ -201,7 +201,10 @@ struct sdev_root { * cache's */ kmem_cache_t *conn_cache; +/* XXX Remove once 2.6.12 is out */ +#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) ) kmem_cache_t *sock_cache; +#endif }; #endif /* _SDP_DEV_H */ From mshefty at ichips.intel.com Thu Apr 21 18:34:39 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 21 Apr 2005 18:34:39 -0700 Subject: [openib-general] [PATCH] [MAD RMPP] add RMPP send support to MAD layer Message-ID: <20050421183439.262c8233.mshefty@ichips.intel.com> The following patch adds RMPP send support to the kernel MAD layer. - NACKs are not implemented - Spec compliant double-sided transfers are not implemented. Request/ reply matching works, but missing is the ACK to the ACK that occurs during the RMPP direction switch. - Clients are limited to a single sge. - Timeout values are hard-coded until such time that packet lifetimes magically appear. Signed-off-by: Sean Hefty Index: include/ib_verbs.h =================================================================== --- include/ib_verbs.h (revision 2207) +++ include/ib_verbs.h (working copy) @@ -573,6 +573,7 @@ u32 remote_qpn; u32 remote_qkey; int timeout_ms; /* valid for MADs only */ + int retries; /* valid for MADs only */ u16 pkey_index; /* valid for GSI only */ u8 port_num; /* valid for DR SMPs on switch only */ } ud; Index: core/mad_rmpp.c =================================================================== --- core/mad_rmpp.c (revision 2207) +++ core/mad_rmpp.c (working copy) @@ -76,20 +76,6 @@ struct ib_sge sge; }; -static struct ib_ah * create_ah_from_wc(struct ib_pd *pd, struct ib_wc *wc, - u8 port_num) -{ - struct ib_ah_attr ah_attr; - - memset(&ah_attr, 0, sizeof ah_attr); - ah_attr.dlid = wc->slid; - ah_attr.sl = wc->sl; - ah_attr.src_path_bits = wc->dlid_path_bits; - ah_attr.port_num = port_num; - - return ib_create_ah(pd, &ah_attr); -} - static void destroy_rmpp_recv(struct mad_rmpp_recv *rmpp_recv) { atomic_dec(&rmpp_recv->refcount); @@ -164,9 +150,10 @@ if (!rmpp_recv) return NULL; - rmpp_recv->ah = create_ah_from_wc(agent->agent.qp->pd, - mad_recv_wc->wc, - agent->agent.port_num); + rmpp_recv->ah = ib_create_ah_from_wc(agent->agent.qp->pd, + mad_recv_wc->wc, + mad_recv_wc->recv_buf.grh, + agent->agent.port_num); if (IS_ERR(rmpp_recv->ah)) goto error; @@ -291,18 +278,28 @@ kfree(msg); } +static int data_offset(u8 mgmt_class) +{ + if (mgmt_class == IB_MGMT_CLASS_SUBN_ADM) + return offsetof(struct ib_sa_mad, data); + else if ((mgmt_class >= IB_MGMT_CLASS_VENDOR_RANGE2_START) && + (mgmt_class <= IB_MGMT_CLASS_VENDOR_RANGE2_END)) + return offsetof(struct ib_vendor_mad, data); + else + return offsetof(struct ib_rmpp_mad, data); +} + static void format_ack(struct ib_rmpp_mad *ack, struct ib_rmpp_mad *data, struct mad_rmpp_recv *rmpp_recv) { unsigned long flags; - ack->mad_hdr = data->mad_hdr; + memcpy(&ack->mad_hdr, &data->mad_hdr, + data_offset(data->mad_hdr.mgmt_class)); + ack->mad_hdr.method ^= IB_MGMT_METHOD_RESP; - ack->rmpp_hdr.rmpp_version = data->rmpp_hdr.rmpp_version; ack->rmpp_hdr.rmpp_type = IB_MGMT_RMPP_TYPE_ACK; - ib_set_rmpp_resptime(&ack->rmpp_hdr, - ib_get_rmpp_resptime(&data->rmpp_hdr)); ib_set_rmpp_flags(&ack->rmpp_hdr, IB_MGMT_RMPP_FLAG_ACTIVE); spin_lock_irqsave(&rmpp_recv->lock, flags); @@ -392,12 +389,18 @@ static inline int get_mad_len(struct mad_rmpp_recv *rmpp_recv) { - int hdr_size; + struct ib_rmpp_mad *rmpp_mad; + int hdr_size, data_size, pad; - /* TODO: need to check for SA MADs - requires access to SA header */ - hdr_size = sizeof(struct ib_mad_hdr) + sizeof(struct ib_rmpp_hdr); - return rmpp_recv->seg_num * (sizeof(struct ib_mad) - hdr_size) + - hdr_size; + rmpp_mad = (struct ib_rmpp_mad *)rmpp_recv->cur_seg_buf->mad; + + hdr_size = data_offset(rmpp_mad->mad_hdr.mgmt_class); + data_size = sizeof(struct ib_rmpp_mad) - hdr_size; + pad = be32_to_cpu(rmpp_mad->rmpp_hdr.paylen_newwin); + if (pad > data_size) + pad = 0; + + return hdr_size + rmpp_recv->seg_num * data_size - pad; } static struct ib_mad_recv_wc * complete_rmpp(struct mad_rmpp_recv *rmpp_recv) @@ -513,6 +516,121 @@ return mad_recv_wc; } +static inline u64 get_seg_addr(struct ib_mad_send_wr_private *mad_send_wr) +{ + return mad_send_wr->sg_list[0].addr + mad_send_wr->data_offset + + (sizeof(struct ib_rmpp_mad) - mad_send_wr->data_offset) * + (mad_send_wr->seg_num - 1); +} + +static int send_next_seg(struct ib_mad_send_wr_private *mad_send_wr) +{ + struct ib_rmpp_mad *rmpp_mad; + int timeout; + + rmpp_mad = (struct ib_rmpp_mad *)mad_send_wr->send_wr.wr.ud.mad_hdr; + ib_set_rmpp_flags(&rmpp_mad->rmpp_hdr, IB_MGMT_RMPP_FLAG_ACTIVE); + rmpp_mad->rmpp_hdr.seg_num = cpu_to_be32(mad_send_wr->seg_num); + + if (mad_send_wr->seg_num == 1) { + rmpp_mad->rmpp_hdr.rmpp_rtime_flags |= IB_MGMT_RMPP_FLAG_FIRST; + rmpp_mad->rmpp_hdr.paylen_newwin = + cpu_to_be32(mad_send_wr->total_seg * + (sizeof(struct ib_rmpp_mad) - + offsetof(struct ib_rmpp_mad, data))); + mad_send_wr->sg_list[0].length = sizeof(struct ib_rmpp_mad); + } else { + mad_send_wr->send_wr.num_sge = 2; + mad_send_wr->sg_list[0].length = mad_send_wr->data_offset; + mad_send_wr->sg_list[1].addr = get_seg_addr(mad_send_wr); + mad_send_wr->sg_list[1].length = sizeof(struct ib_rmpp_mad) - + mad_send_wr->data_offset; + mad_send_wr->sg_list[1].lkey = mad_send_wr->sg_list[0].lkey; + } + + if (mad_send_wr->seg_num == mad_send_wr->total_seg) { + rmpp_mad->rmpp_hdr.rmpp_rtime_flags |= IB_MGMT_RMPP_FLAG_LAST; + rmpp_mad->rmpp_hdr.paylen_newwin = + cpu_to_be32(sizeof(struct ib_rmpp_mad) - + offsetof(struct ib_rmpp_mad, data) - + mad_send_wr->pad); + } + + /* 5 seconds until we can find the packet lifetime */ + timeout = mad_send_wr->send_wr.wr.ud.timeout_ms; + if (timeout && timeout < 5000) + mad_send_wr->timeout = msecs_to_jiffies(timeout); + else + mad_send_wr->timeout = msecs_to_jiffies(5000); + mad_send_wr->seg_num++; + + return ib_send_mad(mad_send_wr); +} + +static void process_rmpp_ack(struct ib_mad_agent_private *agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_rmpp_mad *rmpp_mad; + unsigned long flags; + int seg_num, newwin, ret; + + rmpp_mad = (struct ib_rmpp_mad *)mad_recv_wc->recv_buf.mad; + if (rmpp_mad->rmpp_hdr.rmpp_status) + return; + + seg_num = be32_to_cpu(rmpp_mad->rmpp_hdr.seg_num); + newwin = be32_to_cpu(rmpp_mad->rmpp_hdr.paylen_newwin); + + spin_lock_irqsave(&agent->lock, flags); + mad_send_wr = ib_find_send_mad(agent, rmpp_mad->mad_hdr.tid); + if (!mad_send_wr) + goto out; /* Unmatched ACK */ + + if ((mad_send_wr->last_ack == mad_send_wr->total_seg) || + (!mad_send_wr->timeout) || (mad_send_wr->status != IB_WC_SUCCESS)) + goto out; /* Send is already done */ + + if (seg_num > mad_send_wr->total_seg) + goto out; /* Bad ACK */ + + if (newwin < mad_send_wr->newwin || seg_num < mad_send_wr->last_ack) + goto out; /* Old ACK */ + + if (seg_num > mad_send_wr->last_ack) { + mad_send_wr->last_ack = seg_num; + mad_send_wr->retries = mad_send_wr->send_wr.wr.ud.retries; + } + mad_send_wr->newwin = newwin; + if (mad_send_wr->refcount > 1) + goto out; /* Send is active */ + + if (mad_send_wr->last_ack == mad_send_wr->total_seg) { + /* If no response is expected, the ACK completes the send */ + if (!mad_send_wr->send_wr.wr.ud.timeout_ms) { + struct ib_mad_send_wc wc; + + ib_mark_mad_done(mad_send_wr); + spin_unlock_irqrestore(&agent->lock, flags); + + wc.status = IB_WC_SUCCESS; + wc.vendor_err = 0; + wc.wr_id = mad_send_wr->wr_id; + ib_mad_complete_send_wr(mad_send_wr, &wc); + return; + } + ib_reset_mad_timeout(mad_send_wr, + mad_send_wr->send_wr.wr.ud.timeout_ms); + } else if (mad_send_wr->seg_num < mad_send_wr->newwin) { + /* Send failure will just result in a timeout/retry */ + ret = send_next_seg(mad_send_wr); + if (!ret) + mad_send_wr->refcount++; + } +out: + spin_unlock_irqrestore(&agent->lock, flags); +} + struct ib_mad_recv_wc * ib_process_rmpp_recv_wc(struct ib_mad_agent_private *agent, struct ib_mad_recv_wc *mad_recv_wc) @@ -523,6 +641,9 @@ if (!(rmpp_mad->rmpp_hdr.rmpp_rtime_flags & IB_MGMT_RMPP_FLAG_ACTIVE)) return mad_recv_wc; + if (rmpp_mad->rmpp_hdr.rmpp_version != IB_MGMT_RMPP_VERSION) + goto out; + switch (rmpp_mad->rmpp_hdr.rmpp_type) { case IB_MGMT_RMPP_TYPE_DATA: if (rmpp_mad->rmpp_hdr.seg_num == __constant_htonl(1)) @@ -530,38 +651,121 @@ else return continue_rmpp(agent, mad_recv_wc); case IB_MGMT_RMPP_TYPE_ACK: - /* process_rmpp_ack(agent, mad_recv_wc); */ + process_rmpp_ack(agent, mad_recv_wc); break; case IB_MGMT_RMPP_TYPE_STOP: case IB_MGMT_RMPP_TYPE_ABORT: - /* process_rmpp_nack(agent, mad_recv_wc); */ + /* TODO: process_rmpp_nack(agent, mad_recv_wc); */ break; default: break; } +out: ib_free_recv_mad(mad_recv_wc); return NULL; } +int ib_send_rmpp_mad(struct ib_mad_send_wr_private *mad_send_wr) +{ + struct ib_rmpp_mad *rmpp_mad; + int i, total_len, ret; + + rmpp_mad = (struct ib_rmpp_mad *)mad_send_wr->send_wr.wr.ud.mad_hdr; + if (!(ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr) & + IB_MGMT_RMPP_FLAG_ACTIVE)) + return IB_RMPP_RESULT_UNHANDLED; + + if (rmpp_mad->rmpp_hdr.rmpp_type != IB_MGMT_RMPP_TYPE_DATA) + return IB_RMPP_RESULT_INTERNAL; + + if (mad_send_wr->send_wr.num_sge > 1) + return -EINVAL; /* TODO: support num_sge > 1 */ + + mad_send_wr->seg_num = 1; + mad_send_wr->newwin = 1; + mad_send_wr->data_offset = data_offset(rmpp_mad->mad_hdr.mgmt_class); + + total_len = 0; + for (i = 0; i < mad_send_wr->send_wr.num_sge; i++) + total_len += mad_send_wr->send_wr.sg_list[i].length; -enum ib_mad_result -ib_process_rmpp_send_wc(struct ib_mad_send_wr_private *mad_send_wr, - struct ib_mad_send_wc *mad_send_wc) + mad_send_wr->total_seg = (total_len - mad_send_wr->data_offset) / + (sizeof(struct ib_rmpp_mad) - mad_send_wr->data_offset); + mad_send_wr->pad = total_len - offsetof(struct ib_rmpp_mad, data) - + be32_to_cpu(rmpp_mad->rmpp_hdr.paylen_newwin); + mad_send_wr->retries = mad_send_wr->send_wr.wr.ud.retries; + + /* We need to wait for the final ACK even if there isn't a response */ + mad_send_wr->refcount += (mad_send_wr->timeout == 0); + + ret = send_next_seg(mad_send_wr); + if (!ret) + return IB_RMPP_RESULT_CONSUMED; + return ret; +} + +int ib_process_rmpp_send_wc(struct ib_mad_send_wr_private *mad_send_wr, + struct ib_mad_send_wc *mad_send_wc) { struct ib_rmpp_mad *rmpp_mad; struct rmpp_msg *msg; + int ret; rmpp_mad = (struct ib_rmpp_mad *)mad_send_wr->send_wr.wr.ud.mad_hdr; if (!(ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr) & IB_MGMT_RMPP_FLAG_ACTIVE)) - return IB_MAD_RESULT_SUCCESS; + return IB_RMPP_RESULT_UNHANDLED; /* RMPP not active */ if (rmpp_mad->rmpp_hdr.rmpp_type != IB_MGMT_RMPP_TYPE_DATA) { msg = (struct rmpp_msg *) (unsigned long) mad_send_wc->wr_id; free_rmpp_msg(msg); - return IB_MAD_RESULT_CONSUMED; + return IB_RMPP_RESULT_INTERNAL; /* ACK, STOP, or ABORT */ } - /* TODO: continue send until done - ACKed or we have a response */ - return IB_MAD_RESULT_SUCCESS; + if (mad_send_wc->status != IB_WC_SUCCESS || + mad_send_wr->status != IB_WC_SUCCESS) + return IB_RMPP_RESULT_PROCESSED; /* Canceled or send error */ + + if (!mad_send_wr->timeout) + return IB_RMPP_RESULT_PROCESSED; /* Response received */ + + if (mad_send_wr->last_ack == mad_send_wr->total_seg) { + mad_send_wr->timeout = + msecs_to_jiffies(mad_send_wr->send_wr.wr.ud.timeout_ms); + return IB_RMPP_RESULT_PROCESSED; /* Send done */ + } + + if (mad_send_wr->seg_num > mad_send_wr->newwin || + mad_send_wr->seg_num > mad_send_wr->total_seg) + return IB_RMPP_RESULT_PROCESSED; /* Wait for ACK */ + + ret = send_next_seg(mad_send_wr); + if (ret) { + mad_send_wc->status = IB_WC_GENERAL_ERR; + return IB_RMPP_RESULT_PROCESSED; + } + return IB_RMPP_RESULT_CONSUMED; +} + +int ib_timeout_rmpp(struct ib_mad_send_wr_private *mad_send_wr) +{ + struct ib_rmpp_mad *rmpp_mad; + int ret; + + rmpp_mad = (struct ib_rmpp_mad *)mad_send_wr->send_wr.wr.ud.mad_hdr; + if (!(ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr) & + IB_MGMT_RMPP_FLAG_ACTIVE)) + return IB_RMPP_RESULT_UNHANDLED; /* RMPP not active */ + + if (mad_send_wr->last_ack == mad_send_wr->total_seg || + !mad_send_wr->retries--) + return IB_RMPP_RESULT_PROCESSED; + + mad_send_wr->seg_num = mad_send_wr->last_ack + 1; + ret = send_next_seg(mad_send_wr); + if (ret) + return IB_RMPP_RESULT_PROCESSED; + + mad_send_wr->refcount++; + return IB_RMPP_RESULT_CONSUMED; } Index: core/mad.c =================================================================== --- core/mad.c (revision 2207) +++ core/mad.c (working copy) @@ -63,8 +63,6 @@ static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info, struct ib_mad_private *mad); static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv); -static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, - struct ib_mad_send_wc *mad_send_wc); static void timeout_sends(void *data); static void cancel_sends(void *data); static void local_completions(void *data); @@ -851,7 +849,7 @@ } EXPORT_SYMBOL(ib_free_send_mad); -static int ib_send_mad(struct ib_mad_send_wr_private *mad_send_wr) +int ib_send_mad(struct ib_mad_send_wr_private *mad_send_wr) { struct ib_mad_qp_info *qp_info; struct ib_send_wr *bad_send_wr; @@ -953,19 +951,18 @@ ret = -ENOMEM; goto error2; } + memset(mad_send_wr, 0, sizeof *mad_send_wr); mad_send_wr->send_wr = *send_wr; mad_send_wr->send_wr.sg_list = mad_send_wr->sg_list; memcpy(mad_send_wr->sg_list, send_wr->sg_list, sizeof *send_wr->sg_list * send_wr->num_sge); - mad_send_wr->wr_id = mad_send_wr->send_wr.wr_id; - mad_send_wr->send_wr.next = NULL; + mad_send_wr->wr_id = send_wr->wr_id; mad_send_wr->tid = send_wr->wr.ud.mad_hdr->tid; mad_send_wr->mad_agent_priv = mad_agent_priv; /* Timeout will be updated after send completes */ mad_send_wr->timeout = msecs_to_jiffies(send_wr->wr. ud.timeout_ms); - mad_send_wr->retry = 0; /* One reference for each work request to QP + response */ mad_send_wr->refcount = 1 + (mad_send_wr->timeout > 0); mad_send_wr->status = IB_WC_SUCCESS; @@ -977,8 +974,13 @@ &mad_agent_priv->send_list); spin_unlock_irqrestore(&mad_agent_priv->lock, flags); - ret = ib_send_mad(mad_send_wr); - if (ret) { + if (mad_agent_priv->agent.rmpp_version) { + ret = ib_send_rmpp_mad(mad_send_wr); + if (ret >= 0 && ret != IB_RMPP_RESULT_CONSUMED) + ret = ib_send_mad(mad_send_wr); + } else + ret = ib_send_mad(mad_send_wr); + if (ret < 0) { /* Fail send request */ spin_lock_irqsave(&mad_agent_priv->lock, flags); list_del(&mad_send_wr->agent_list); @@ -1538,19 +1540,6 @@ return valid; } -static struct ib_mad_recv_wc * -process_recv(struct ib_mad_agent_private *mad_agent_priv, - struct ib_mad_recv_wc *mad_recv_wc) -{ - INIT_LIST_HEAD(&mad_recv_wc->rmpp_list); - list_add(&mad_recv_wc->recv_buf.list, &mad_recv_wc->rmpp_list); - - if (mad_agent_priv->agent.rmpp_version) - return ib_process_rmpp_recv_wc(mad_agent_priv, mad_recv_wc); - else - return mad_recv_wc; -} - static int is_data_mad(struct ib_mad_agent_private *mad_agent_priv, struct ib_mad_hdr *mad_hdr) { @@ -1563,9 +1552,8 @@ (rmpp_mad->rmpp_hdr.rmpp_type == IB_MGMT_RMPP_TYPE_DATA); } -static struct ib_mad_send_wr_private* -find_send_req(struct ib_mad_agent_private *mad_agent_priv, - u64 tid) +struct ib_mad_send_wr_private* +ib_find_send_mad(struct ib_mad_agent_private *mad_agent_priv, u64 tid) { struct ib_mad_send_wr_private *mad_send_wr; @@ -1592,7 +1580,7 @@ return NULL; } -static void ib_mark_req_done(struct ib_mad_send_wr_private *mad_send_wr) +void ib_mark_mad_done(struct ib_mad_send_wr_private *mad_send_wr) { mad_send_wr->timeout = 0; if (mad_send_wr->refcount == 1) { @@ -1610,19 +1598,24 @@ unsigned long flags; u64 tid; - /* Process the receive before giving it to the user. */ - mad_recv_wc = process_recv(mad_agent_priv, mad_recv_wc); - if (!mad_recv_wc) { - if (atomic_dec_and_test(&mad_agent_priv->refcount)) - wake_up(&mad_agent_priv->wait); - return; + INIT_LIST_HEAD(&mad_recv_wc->rmpp_list); + list_add(&mad_recv_wc->recv_buf.list, &mad_recv_wc->rmpp_list); + + if (mad_agent_priv->agent.rmpp_version) { + mad_recv_wc = ib_process_rmpp_recv_wc(mad_agent_priv, + mad_recv_wc); + if (!mad_recv_wc) { + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + return; + } } /* Complete corresponding request */ if (response_mad(mad_recv_wc->recv_buf.mad)) { tid = mad_recv_wc->recv_buf.mad->mad_hdr.tid; spin_lock_irqsave(&mad_agent_priv->lock, flags); - mad_send_wr = find_send_req(mad_agent_priv, tid); + mad_send_wr = ib_find_send_mad(mad_agent_priv, tid); if (!mad_send_wr) { spin_unlock_irqrestore(&mad_agent_priv->lock, flags); ib_free_recv_mad(mad_recv_wc); @@ -1630,7 +1623,7 @@ wake_up(&mad_agent_priv->wait); return; } - ib_mark_req_done(mad_send_wr); + ib_mark_mad_done(mad_send_wr); spin_unlock_irqrestore(&mad_agent_priv->lock, flags); /* Defined behavior is to complete response before request */ @@ -1821,23 +1814,33 @@ } } +void ib_reset_mad_timeout(struct ib_mad_send_wr_private *mad_send_wr, + int timeout_ms) +{ + mad_send_wr->timeout = msecs_to_jiffies(timeout_ms); + wait_for_response(mad_send_wr); + adjust_timeout(mad_send_wr->mad_agent_priv); +} + /* * Process a send work completion */ -static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, - struct ib_mad_send_wc *mad_send_wc) +void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, + struct ib_mad_send_wc *mad_send_wc) { struct ib_mad_agent_private *mad_agent_priv; unsigned long flags; - enum ib_mad_result ret; + int ret; mad_agent_priv = mad_send_wr->mad_agent_priv; - if (mad_agent_priv->agent.rmpp_version) + spin_lock_irqsave(&mad_agent_priv->lock, flags); + if (mad_agent_priv->agent.rmpp_version) { ret = ib_process_rmpp_send_wc(mad_send_wr, mad_send_wc); - else - ret = IB_MAD_RESULT_SUCCESS; + if (ret == IB_RMPP_RESULT_CONSUMED) + goto done; + } else + ret = IB_RMPP_RESULT_UNHANDLED; - spin_lock_irqsave(&mad_agent_priv->lock, flags); if (mad_send_wc->status != IB_WC_SUCCESS && mad_send_wr->status == IB_WC_SUCCESS) { mad_send_wr->status = mad_send_wc->status; @@ -1849,8 +1852,7 @@ mad_send_wr->status == IB_WC_SUCCESS) { wait_for_response(mad_send_wr); } - spin_unlock_irqrestore(&mad_agent_priv->lock, flags); - return; + goto done; } /* Remove send from MAD agent and notify client of completion */ @@ -1860,7 +1862,7 @@ if (mad_send_wr->status != IB_WC_SUCCESS ) mad_send_wc->status = mad_send_wr->status; - if (ret == IB_MAD_RESULT_SUCCESS) + if (ret != IB_RMPP_RESULT_INTERNAL) mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, mad_send_wc); @@ -1869,6 +1871,9 @@ wake_up(&mad_agent_priv->wait); kfree(mad_send_wr); + return; +done: + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); } static void ib_mad_send_done_handler(struct ib_mad_port_private *port_priv, @@ -2066,8 +2071,7 @@ } static struct ib_mad_send_wr_private* -find_send_by_wr_id(struct ib_mad_agent_private *mad_agent_priv, - u64 wr_id) +find_send_by_wr_id(struct ib_mad_agent_private *mad_agent_priv, u64 wr_id) { struct ib_mad_send_wr_private *mad_send_wr; @@ -2234,6 +2238,7 @@ struct ib_mad_send_wr_private *mad_send_wr; struct ib_mad_send_wc mad_send_wc; unsigned long flags, delay; + int ret; mad_agent_priv = (struct ib_mad_agent_private *)data; @@ -2257,6 +2262,14 @@ } list_del(&mad_send_wr->agent_list); + if (mad_agent_priv->agent.rmpp_version) { + ret = ib_timeout_rmpp(mad_send_wr); + if (ret == IB_RMPP_RESULT_CONSUMED) { + list_add_tail(&mad_send_wr->agent_list, + &mad_agent_priv->send_list); + continue; + } + } spin_unlock_irqrestore(&mad_agent_priv->lock, flags); mad_send_wc.wr_id = mad_send_wr->wr_id; Index: core/mad_rmpp.h =================================================================== --- core/mad_rmpp.h (revision 2207) +++ core/mad_rmpp.h (working copy) @@ -37,14 +37,24 @@ #include "mad_priv.h" +enum { + IB_RMPP_RESULT_PROCESSED, + IB_RMPP_RESULT_CONSUMED, + IB_RMPP_RESULT_INTERNAL, + IB_RMPP_RESULT_UNHANDLED +}; + +int ib_send_rmpp_mad(struct ib_mad_send_wr_private *mad_send_wr); + struct ib_mad_recv_wc * -ib_process_rmpp_recv_wc(struct ib_mad_agent_private *mad_agent_priv, +ib_process_rmpp_recv_wc(struct ib_mad_agent_private *agent, struct ib_mad_recv_wc *mad_recv_wc); -enum ib_mad_result -ib_process_rmpp_send_wc(struct ib_mad_send_wr_private *mad_send_wr, - struct ib_mad_send_wc *mad_send_wc); +int ib_process_rmpp_send_wc(struct ib_mad_send_wr_private *mad_send_wr, + struct ib_mad_send_wc *mad_send_wc); + +void ib_cancel_rmpp_recvs(struct ib_mad_agent_private *agent); -void ib_cancel_rmpp_recvs(struct ib_mad_agent_private *mad_agent_priv); +int ib_timeout_rmpp(struct ib_mad_send_wr_private *mad_send_wr); #endif /* __MAD_RMPP_H__ */ Index: core/mad_priv.h =================================================================== --- core/mad_priv.h (revision 2207) +++ core/mad_priv.h (working copy) @@ -126,6 +126,15 @@ int retry; int refcount; enum ib_wc_status status; + + /* RMPP control */ + int last_ack; + int seg_num; + int newwin; + int total_seg; + int data_offset; + int pad; + int retries; }; struct ib_mad_local_private { @@ -198,4 +207,17 @@ extern kmem_cache_t *ib_mad_cache; +int ib_send_mad(struct ib_mad_send_wr_private *mad_send_wr); + +struct ib_mad_send_wr_private * +ib_find_send_mad(struct ib_mad_agent_private *mad_agent_priv, u64 tid); + +void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, + struct ib_mad_send_wc *mad_send_wc); + +void ib_mark_mad_done(struct ib_mad_send_wr_private *mad_send_wr); + +void ib_reset_mad_timeout(struct ib_mad_send_wr_private *mad_send_wr, + int timeout_ms); + #endif /* __IB_MAD_PRIV_H__ */ From greg at kroah.com Thu Apr 21 23:14:43 2005 From: greg at kroah.com (Greg KH) Date: Thu, 21 Apr 2005 23:14:43 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <4268080E.3000303@ammasso.com> References: <20050421173821.GA13312@hexapodia.org> <4267F367.3090508@ammasso.com> <20050421195641.GB13312@hexapodia.org> <4268080E.3000303@ammasso.com> Message-ID: <20050422061443.GA10499@kroah.com> On Thu, Apr 21, 2005 at 03:07:42PM -0500, Timur Tabi wrote: > >*You* need to come up with a solution that looks good to *the community* > >if you want it merged. > > True, but I'm not going to waste my time adding this support if the > consensus I get from the kernel developers that they don't want Linux to > behave this way. I think we have been giving you that consensus from the very beginning :) The very fact that you tried to trot out the "enterprise" card should have raised a huge flag... thanks, greg k-h From pavel at suse.cz Thu Apr 21 12:47:06 2005 From: pavel at suse.cz (Pavel Machek) Date: Thu, 21 Apr 2005 21:47:06 +0200 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <4263E53E.3090107@ammasso.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <4263DBBF.9040801@ammasso.com> <52is2kawsi.fsf@topspin.com> <4263E53E.3090107@ammasso.com> Message-ID: <20050421194706.GE475@openzaurus.ucw.cz> Hi! > > Timur> Why do you call mlock() and get_user_pages()? In our > > code, > > Timur> we only call mlock(), and the memory is pinned. We have a > > Timur> test case that fails if only get_user_pages() is called, > > Timur> but it passes if only mlock() is called. > > > >What if a buggy/malicious userspace program doesn't call mlock()? > > Our library calls mlock() when the apps requests memory to be > "registered". We then call munlock() when the app requests the > memory to be unregistered. All apps talk to our library for all > services. No apps talk to the driver directly. That does not cover "malicious" part. Pavel -- 64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms From 7eggert at gmx.de Fri Apr 22 06:10:09 2005 From: 7eggert at gmx.de (Bodo Eggert ) Date: Fri, 22 Apr 2005 15:10:09 +0200 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation References: <3VAeQ-1To-7@gated-at.bofh.it> <3VNYt-4M4-15@gated-at.bofh.it> Message-ID: Andy Isaacson wrote: > On Wed, Apr 20, 2005 at 10:07:45PM -0500, Timur Tabi wrote: >> I don't know if VM_REGISTERED is a good idea or not, but it should be >> absolutely impossible for the kernel to reclaim "registered" (aka pinned) >> memory, no matter what. For RDMA services (such as Infiniband, iWARP, etc), >> it's normal for non-root processes to pin hundreds of megabytes of memory, >> and that memory better be locked to those physical pages until the >> application deregisters them. > > If you take the hardline position that "the app is the only thing that > matters", your code is unlikely to get merged. Linux is a > general-purpose OS. All userspace hardware drivers with DMA will require pinned pages (and some of them will require continuous memory). Since this memory may be scheduled to be accessed by DMA, reclaiming those pages may (aka. will) result in "random" memory corruption unless done by the driver itself. You can't even set a time limit, the driver may have allocated all DMA memory to queued transfers, and some media needs to get plugged in by the lazy robot. As soon as the robot arrives - boom. (For the same reason, this memory MUST NOT be freed if the application terminates abnormally, e.g. killed by OOM). In other words, you need to make this memory as unaccessible as the framebuffer on a graphic card. If that causes a lockup, you better had prevented that while allocating. > In a Linux context, I doubt that fullblown SA is necessary or > appropriate. Rather, I'd suggest two new signals, SIGMEMLOW and > SIGMEMCRIT. The userland comms library registers handlers for both. > When the kernel decides that it needs to reclaim some memory from the > app, it sends SIGMEMLOW. The comms library then has the responsibility > to un-reserve some memory in an orderly fashion. If a reasonable [1] > time has expired since SIGMEMLOW and the kernel is still hungry, the > kernel sends SIGMEMCRIT. At this point, the comms lib *must* unregister > some memory [2] even if it has to drop state to do so; if it returns > from the signal handler without having unregistered the memory, the > kernel will SIGKILL. Choosing Data loss vs. finitely stalled system may sometimes be a bad decision. If I designes an application that might get a "gimme memory or die", I'd reserve an extra bunch of memory with the only purpose of being released in this situation. If the kernel had done that instead, this part of memory could have been used e.g. as a read-only disk cache in the meantime (off cause provided somebody cared to implement that). > [2] Is there a way for the kernel to pass down to userspace how many > pages it wants, maybe in the sigcontext? Then you'd need only one signal. I think this interface is usefull, it would e.g. allow a picture viewer to cache as many decoded and scaled pictures as the RAM permits, freeing them if the RAM gets full and the swap would have to be used. -- "When the pin is pulled, Mr. Grenade is not our friend. -U.S. Marine Corps From ftillier at infiniconsys.com Fri Apr 22 10:01:55 2005 From: ftillier at infiniconsys.com (Fab Tillier) Date: Fri, 22 Apr 2005 10:01:55 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbsimplementation In-Reply-To: Message-ID: <000101c5475c$fe3c5fa0$8d5aa8c0@infiniconsys.com> > From: Bodo Eggert > Sent: Friday, April 22, 2005 6:10 AM > > All userspace hardware drivers with DMA will require pinned pages (and > some of them will require continuous memory). Since this memory may be > scheduled to be accessed by DMA, reclaiming those pages may (aka. will) > result in "random" memory corruption unless done by the driver itself. Any reclaim must involve the driver. That doesn't mean that it must involve the application. That said this isn't trivial to implement. > > You can't even set a time limit, the driver may have allocated all DMA > memory to queued transfers, and some media needs to get plugged in by > the lazy robot. As soon as the robot arrives - boom. (For the same reason, > this memory MUST NOT be freed if the application terminates abnormally, > e.g. killed by OOM). InfiniBand provides support for deregistering memory that might be referenced at some future time by an RDMA operation. The only side effect this has is that the QP on both sides of the connection transition to an error state. Upon abnormal termination, all registrations must be undone and the memory unpinned. This must be synchronized with the hardware so that there are no races. The IB deregistration semantics provide such synchronization. I'd venture that any HW design that does not do this is broken. Requiring the memory to never be freed upon abnormal termination equates to a serious memory leak, in that physical memory is leaked, not virtual. - Fab From eitan at mellanox.co.il Fri Apr 22 10:41:58 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 22 Apr 2005 20:41:58 +0300 Subject: [openib-general] MAD/RMPP test program Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF16B@mtlex01.yok.mtl.com> Hi Sean, Were you able to qualify the protocol implementation using an IB analyzer? Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Thursday, April 21, 2005 7:48 PM > To: openib-general > Subject: [openib-general] MAD/RMPP test program > > For those interested (likely a few developers only), I've checked in a > kernel test program that I used to stress the MAD/RMPP code. > > gen2/utils/src/linux-kernel/infiniband/util/grmpp > > - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From timur.tabi at ammasso.com Fri Apr 22 10:55:22 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Fri, 22 Apr 2005 12:55:22 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <1113840973.6274.84.camel@laptopd505.fenrus.org> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <4263DBBF.9040801@ammasso.com> <1113840973.6274.84.camel@laptopd505.fenrus.org> Message-ID: <42693A8A.80105@ammasso.com> Arjan van de Ven wrote: > On Mon, 2005-04-18 at 11:09 -0500, Timur Tabi wrote: > >>Roland Dreier wrote: >> >>> Troy> How is memory pinning handled? (I haven't had time to read >>> Troy> all the code, so please excuse my ignorance of something >>> Troy> obvious). >>> >>>The userspace library calls mlock() and then the kernel does >>>get_user_pages(). >> >>Why do you call mlock() and get_user_pages()? In our code, we only call mlock(), and the >>memory is pinned. > > > this is a myth; linux is free to move the page about in physical memory > even if it's mlock()ed!! Can you tell me when Linux actually does this? I know in theory it can happen, but I've never seen it. Does the code to implement moving of data from one physical page to another even exist in any version of Linux? Also, what would be the point? What reason would there be to move some data from one physical page to another, while keeping the same virtual address? -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com One thing a Southern boy will never say is, "I don't think duct tape will fix it." -- Ed Smylie, NASA engineer for Apollo 13 From mshefty at ichips.intel.com Fri Apr 22 11:00:44 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 22 Apr 2005 11:00:44 -0700 Subject: [openib-general] MAD/RMPP test program In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EF16B@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED6047EF16B@mtlex01.yok.mtl.com> Message-ID: <42693BCC.7080106@ichips.intel.com> Eitan Zahavi wrote: > Hi Sean, > > Were you able to qualify the protocol implementation using an IB analyzer? Lacking a usable IB analyzer... no. I did use the madeye utility to examine the headers for the window size, ACK format, timeouts, retries, etc. If someone does run this against an analyzer and notices any issues, please let me know of them. - Sean From arjan at infradead.org Fri Apr 22 11:12:58 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Fri, 22 Apr 2005 20:12:58 +0200 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <42693A8A.80105@ammasso.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <4263DBBF.9040801@ammasso.com> <1113840973.6274.84.camel@laptopd505.fenrus.org> <42693A8A.80105@ammasso.com> Message-ID: <1114193579.10355.38.camel@laptopd505.fenrus.org> On Fri, 2005-04-22 at 12:55 -0500, Timur Tabi wrote: > Arjan van de Ven wrote: > > On Mon, 2005-04-18 at 11:09 -0500, Timur Tabi wrote: > > > >>Roland Dreier wrote: > >> > >>> Troy> How is memory pinning handled? (I haven't had time to read > >>> Troy> all the code, so please excuse my ignorance of something > >>> Troy> obvious). > >>> > >>>The userspace library calls mlock() and then the kernel does > >>>get_user_pages(). > >> > >>Why do you call mlock() and get_user_pages()? In our code, we only call mlock(), and the > >>memory is pinned. > > > > > > this is a myth; linux is free to move the page about in physical memory > > even if it's mlock()ed!! > > Can you tell me when Linux actually does this? I know in theory it can happen, but I've > never seen it. Does the code to implement moving of data from one physical page to > another even exist in any version of Linux? hot(un)plug memory. > > Also, what would be the point? What reason would there be to move some data from one > physical page to another, while keeping the same virtual address? so that you can hot unplug the dimm in question. I guess that's a bit of a high end though though... so maybe you don't care about it. From 7eggert at gmx.de Fri Apr 22 15:01:23 2005 From: 7eggert at gmx.de (Bodo Eggert) Date: Sat, 23 Apr 2005 00:01:23 +0200 (CEST) Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbsimplementation In-Reply-To: <000101c5475c$fe3c5fa0$8d5aa8c0@infiniconsys.com> References: <000101c5475c$fe3c5fa0$8d5aa8c0@infiniconsys.com> Message-ID: On Fri, 22 Apr 2005, Fab Tillier wrote: > > From: Bodo Eggert > > Sent: Friday, April 22, 2005 6:10 AM > > You can't even set a time limit, the driver may have allocated all DMA > > memory to queued transfers, and some media needs to get plugged in by > > the lazy robot. As soon as the robot arrives - boom. (For the same reason, > > this memory MUST NOT be freed if the application terminates abnormally, > > e.g. killed by OOM). > > InfiniBand provides support for deregistering memory that might be > referenced at some future time by an RDMA operation. The only side effect > this has is that the QP on both sides of the connection transition to an > error state. > > Upon abnormal termination, all registrations must be undone and the memory > unpinned. This must be synchronized with the hardware so that there are no > races. If you know the hardware. If you have userspace drivers, this will be impossible, and even if you have kernel drivers, you'll need to know which of them is responsible for each part of the pinned memory. This doesn't imply the affected memory to be lost. The same application that created the pinned memory can reset the hardware (provided nobody changed the configuration), then reconnect to the shared memory segment you'll use for that purpose and use or free it. -- To iterate is human; to recurse, divine. From tduffy at sun.com Fri Apr 22 15:57:32 2005 From: tduffy at sun.com (Tom Duffy) Date: Fri, 22 Apr 2005 15:57:32 -0700 Subject: [openib-general] [PATCHv4][SDP] Allow SDP to compile on 2.6.12-rc3 In-Reply-To: <1114126674.6858.31.camel@duffman> References: <000001c546ca$51807b80$8d5aa8c0@infiniconsys.com> <1114126674.6858.31.camel@duffman> Message-ID: <1114210652.5519.1.camel@duffman> On Thu, 2005-04-21 at 16:37 -0700, Tom Duffy wrote: > On Thu, 2005-04-21 at 16:31 -0700, Fab Tillier wrote: > > Isn't the above change backwards? The original code was copying settings > > from listen_sk to sk, and the new code seems to be checking flags in sk to > > determine whether to set them in listen_sk. > > You are so right. My brain ain't on today or something. You know what, cancel this whole patch. I have it wrong, and I am reworking a new patch to work with the new sk_alloc(). -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From libor at topspin.com Fri Apr 22 17:57:19 2005 From: libor at topspin.com (Libor Michalek) Date: Fri, 22 Apr 2005 17:57:19 -0700 Subject: [openib-general] [ANNOUNCE] Userspace Connection Manager Message-ID: <20050422175719.A1735@topspin.com> I've made the initial check-in of the userspace connection manager library. The kernel module that provides the access from userspace to the kernel CM was checked in previously, and is already being built as part of the core IB support. (ib_ucm.ko) To use the Userspace CM you'll need to create a single character device file: mknod /dev/infiniband/ucm c 231 255 There's a dependency on infiniband/verbs.h so you'll need libibverbs installed on the same system. Check out src/userspace/libibcm and build: ./autogen.sh && ./configure && make && sudo make install The API is very similar to the kernel CM, as you will be able to tell by looking at infiniband/cm.h, (thanks Sean.) with the one notable exception being CM event notification. Unlike the kernel which delivers events through a callback, the userspace CM does not deliver events, they must be solicited. (ib_cm_event_get()) The file descriptor used by the CM can be retreived for use in poll/select, so a user does not need to block on event solicitation. Ideally an app should be able to use the cm and verbs without needing to use threads for event handling. There exists a simple example, which drives the CM through the standard connection states, but does not actually create any QPs. Next step is more testing and to create a real example which actually uses libibverbs, moves data, and uses the real SA to get path records. -Libor From akpm at osdl.org Sat Apr 23 19:44:21 2005 From: akpm at osdl.org (Andrew Morton) Date: Sat, 23 Apr 2005 19:44:21 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <4263E445.8000605@ammasso.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> Message-ID: <20050423194421.4f0d6612.akpm@osdl.org> Timur Tabi wrote: > > Christoph Hellwig wrote: > > On Mon, Apr 18, 2005 at 11:22:29AM -0500, Timur Tabi wrote: > > > >>That's not what we're seeing. We have hardware that does DMA over the > >>network (much like the Infiniband stuff), and we have a testcase that fails > >>if get_user_pages() is used, but not if mlock() is used. > > > > > > If you don't share your testcase it's unlikely to be fixed. > > As I said, the testcase only works with our hardware, and it's also very large. It's one > small test that's part of a huge test suite. It takes a couple hours just to install the > damn thing. > > We want to produce a simpler test case that demonstrates the problem in an > easy-to-understand manner, but we don't have time to do that now. If your theory is correct then it should be able to demonstrate this problem without any special hardware at all: pin some user memory, then generate memory pressure then check the contents of those pinned pages. But if, for the DMA transfer, you're using the array of page*'s which were originally obtained from get_user_pages() then it's rather hard to see how the kernel could alter the page's contents. Then again, if mlock() fixes it then something's up. Very odd. From timur.tabi at ammasso.com Sun Apr 24 07:23:48 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Sun, 24 Apr 2005 09:23:48 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050423194421.4f0d6612.akpm@osdl.org> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> Message-ID: <426BABF4.3050205@ammasso.com> Andrew Morton wrote: > If your theory is correct then it should be able to demonstrate this > problem without any special hardware at all: pin some user memory, then > generate memory pressure then check the contents of those pinned pages. I tried that, but I couldn't get it to fail. But that was a while ago, and I've learned a few things since then, so I'll try again. > But if, for the DMA transfer, you're using the array of page*'s which were > originally obtained from get_user_pages() then it's rather hard to see how > the kernel could alter the page's contents. > > Then again, if mlock() fixes it then something's up. Very odd. With mlock(), we don't need to use get_user_pages() at all. Arjan tells me the only time an mlocked page can move is with hot (un)plug of memory, but that isn't supported on the systems that we support. We actually prefer mlock() over get_user_pages(), because if the process dies, the locks automatically go away too. From greg at kroah.com Sun Apr 24 13:53:10 2005 From: greg at kroah.com (Greg KH) Date: Sun, 24 Apr 2005 13:53:10 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <426BABF4.3050205@ammasso.com> References: <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> Message-ID: <20050424205309.GA5386@kroah.com> On Sun, Apr 24, 2005 at 09:23:48AM -0500, Timur Tabi wrote: > Andrew Morton wrote: > > >If your theory is correct then it should be able to demonstrate this > >problem without any special hardware at all: pin some user memory, then > >generate memory pressure then check the contents of those pinned pages. > > I tried that, but I couldn't get it to fail. But that was a while ago, and > I've learned a few things since then, so I'll try again. > > >But if, for the DMA transfer, you're using the array of page*'s which were > >originally obtained from get_user_pages() then it's rather hard to see how > >the kernel could alter the page's contents. > > > >Then again, if mlock() fixes it then something's up. Very odd. > > With mlock(), we don't need to use get_user_pages() at all. Arjan tells me > the only time an mlocked page can move is with hot (un)plug of memory, but > that isn't supported on the systems that we support. You don't "support" i386 or ia64 or x86-64 or ppc64 systems? What hardware do you support? And what about the fact that you are aiming to get this code into mainline, right? If not, why are you asking here? :) thanks, greg k-h From timur.tabi at ammasso.com Sun Apr 24 14:52:31 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Sun, 24 Apr 2005 16:52:31 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050424205309.GA5386@kroah.com> References: <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <20050424205309.GA5386@kroah.com> Message-ID: <426C151F.3000407@ammasso.com> Greg KH wrote: > You don't "support" i386 or ia64 or x86-64 or ppc64 systems? What > hardware do you support? I've never seen or heard of any x86-32 or x86-64 system that supports hot-swap RAM. Our hardware does not support PPC, and our software doesn't support ia-64. > And what about the fact that you are aiming to > get this code into mainline, right? If not, why are you asking here? > :) Well, our primary concern is getting our stuff to work. Since get_user_pages() doesn't work, but mlock() does, that's what we use. I don't know how to fix get_user_pages(), and I don't have the time right now to figure it out. I know that technically mlock() is not the right way to do it, and so we're not going to be submitting our code for the mainline until get_user_pages() works and our code uses it instead of mlock(). From greg at kroah.com Sun Apr 24 18:03:51 2005 From: greg at kroah.com (Greg KH) Date: Sun, 24 Apr 2005 18:03:51 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <426C151F.3000407@ammasso.com> References: <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <20050424205309.GA5386@kroah.com> <426C151F.3000407@ammasso.com> Message-ID: <20050425010351.GA21246@kroah.com> On Sun, Apr 24, 2005 at 04:52:31PM -0500, Timur Tabi wrote: > Greg KH wrote: > > >You don't "support" i386 or ia64 or x86-64 or ppc64 systems? What > >hardware do you support? > > I've never seen or heard of any x86-32 or x86-64 system that supports > hot-swap RAM. I know of at least 1 x86-32 box from a three-letter-named company with this feature that has been shipping for a few _years_ now. That box is pretty much everywhere now, and I know that other versions of it are also quite popular (despite the high cost...) > Our hardware does not support PPC, and our software doesn't support > ia-64. Your hardware is just a pci card, right? Why wouldn't it work on ppc64 and ia64 then? > > And what about the fact that you are aiming to > >get this code into mainline, right? If not, why are you asking here? > >:) > > Well, our primary concern is getting our stuff to work. Since > get_user_pages() doesn't work, but mlock() does, that's what we use. I > don't know how to fix get_user_pages(), and I don't have the time right now > to figure it out. I know that technically mlock() is not the right way to > do it, and so we're not going to be submitting our code for the mainline > until get_user_pages() works and our code uses it instead of mlock(). Wait, what _is_ "your stuff"? The open-ib code? Or some other, private fork? Any pointers to this stuff? thanks, greg k-h From timur.tabi at ammasso.com Sun Apr 24 21:12:20 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Sun, 24 Apr 2005 23:12:20 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050425010351.GA21246@kroah.com> References: <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <20050424205309.GA5386@kroah.com> <426C151F.3000407@ammasso.com> <20050425010351.GA21246@kroah.com> Message-ID: <426C6E24.5050203@ammasso.com> Greg KH wrote: > I know of at least 1 x86-32 box from a three-letter-named company with > this feature that has been shipping for a few _years_ now. That box is > pretty much everywhere now, and I know that other versions of it are > also quite popular (despite the high cost...) Hmm... Well, I think we were already planning on telling our customers that we don't support hot-swap RAM. Is there a CONFIG option for that feature? > Your hardware is just a pci card, right? Why wouldn't it work on ppc64 > and ia64 then? It's PCI-X, actually, and I don't think we've ever actually plugged it into a PPC box. Isn't Open Firmware support required for all PPC boxes, anyway? Our PCI card is not OF compatible, AFAIK. As for IA64, well, we could support it, but it's not a high enough priority. We do have some CPU-specific code in our driver that we would need to port to IA-64. > Wait, what _is_ "your stuff"? The open-ib code? No, if anything, it's the competition to IB. It's called iWARP (RDMA over TCP/IP), and it's similar to IB except it uses gigabit ethernet instead of whatever hardware IB uses. Because we also support RMDA, we have the same problems as OpenIB, however, we would prefer that the kernel support OpenRDMA instead, since it's more generic. > Or some other, private > fork? Any pointers to this stuff? http://ammasso.com/support.html The current version of the code calls sys_mlock() directly from the driver. We haven't released yet the version that calls mlock(). From roland at topspin.com Mon Apr 25 06:15:10 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 25 Apr 2005 06:15:10 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> Message-ID: <52is2bvvz5.fsf@topspin.com> Timur> With mlock(), we don't need to use get_user_pages() at all. Timur> Arjan tells me the only time an mlocked page can move is Timur> with hot (un)plug of memory, but that isn't supported on Timur> the systems that we support. We actually prefer mlock() Timur> over get_user_pages(), because if the process dies, the Timur> locks automatically go away too. There actually is another way pages can move, with both get_user_pages() and mlock(): copy-on-write after a fork(). If userspace does a fork(), then all PTEs are marked read-only, and if the original process touches the page after the fork(), a new page will be allocated and mapped at the original virtual address. This is actually a pretty big pain, because the only good solution seems to be for the kernel to mark these registered regions as VM_DONTCOPY. Right now this means that driver code ends up monkeying with vm_flags for user vmas. Does it seem reasonable to add a new system call to let userspace mark memory it doesn't want copied into forked processes? Something like long sys_mark_nocopy(unsigned long addr, size_t len, int mark) which would set VM_DONTCOPY if mark != 0, and clear it if mark == 0. A better name would be gratefully accepted... Then to register memory for RDMA, userspace would call sys_mark_nocopy() (with appropriate accounting to handle possibly overlapping regions) and the kernel would call get_user_pages(). The get_user_pages() is of course required because the kernel can't trust userspace to keep the pages locked. mlock() would no longer be necessary. We can trust userspace to call sys_mark_nocopy() as needed, because a process can only hurt itself and its children by misusing the sys_mark_nocopy() call. If this seems reasonable then I can code a patch. - R. From hch at infradead.org Mon Apr 25 06:17:53 2005 From: hch at infradead.org (Christoph Hellwig) Date: Mon, 25 Apr 2005 14:17:53 +0100 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <52is2bvvz5.fsf@topspin.com> References: <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> Message-ID: <20050425131753.GA8860@infradead.org> On Mon, Apr 25, 2005 at 06:15:10AM -0700, Roland Dreier wrote: > Does it seem reasonable to add a new system call to let userspace mark > memory it doesn't want copied into forked processes? Something like > > long sys_mark_nocopy(unsigned long addr, size_t len, int mark) > > which would set VM_DONTCOPY if mark != 0, and clear it if mark == 0. > A better name would be gratefully accepted... add a new MAP_DONTCOPY flag and accept it in mmap and mprotect? From haveblue at us.ibm.com Mon Apr 25 06:30:22 2005 From: haveblue at us.ibm.com (Dave Hansen) Date: Mon, 25 Apr 2005 06:30:22 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <426C6E24.5050203@ammasso.com> References: <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <20050424205309.GA5386@kroah.com> <426C151F.3000407@ammasso.com> <20050425010351.GA21246@kroah.com> <426C6E24.5050203@ammasso.com> Message-ID: <1114435822.14501.51.camel@localhost> On Sun, 2005-04-24 at 23:12 -0500, Timur Tabi wrote: > Greg KH wrote: > > I know of at least 1 x86-32 box from a three-letter-named company with > > this feature that has been shipping for a few _years_ now. That box is > > pretty much everywhere now, and I know that other versions of it are > > also quite popular (despite the high cost...) > > Hmm... Well, I think we were already planning on telling our customers that we don't > support hot-swap RAM. Is there a CONFIG option for that feature? The driver to do the ACPI portion of both add and remove is in the kernel today, so it's certainly a feature that's coming relatively soon. There is a large variety of x86_64, ppc64, ia64 and ia64 hardware that will be doing memory hotplug. I believe that every POWER5 system is capable of supporting it, at least virtually. I don't think your concerns end with memory hotplug. The same approaches to moving memory around will be used for NUMA memory balancing and for memory defragmentation. Can you say that your cards will never be used on a system which has memory which becomes fragmented? -- Dave From hozer at hozed.org Mon Apr 25 06:31:31 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Mon, 25 Apr 2005 08:31:31 -0500 Subject: [openib-general] IBM eHCA Device Driver for gen1 IB stack In-Reply-To: <52ll7d2y9u.fsf@topspin.com> References: <52ll7d2y9u.fsf@topspin.com> Message-ID: <20050425133131.GT999@kalmia.hozed.org> On Wed, Apr 20, 2005 at 09:44:45AM -0700, Roland Dreier wrote: > > Hi, we've just released the first linux device driver for > > the IBM eServer HCA for Power5. It's gen1 based and runs > > on SLES9 SP1. Main testvehicle for this code was IPoIB. > > > gen2 and full userspace support will be next. > > Excellent, I'm glad to see this released. I'm looking forward to > seeing the gen2 support. > > If I may make a small suggestion for future releases: please have the > tar file contain a top-level directory like ehca-0021, with everything > contained in that directory. It's a little annoying to unpack a tar > file and have it spread 5 files in your working directory, especially > when some have generic names like "INSTALL" or "patches." What will it take to get the Gen2 support into the openib.org subversion tree? How much is the low-level driver likely to change once it's written and working? From panda at cse.ohio-state.edu Mon Apr 25 07:00:44 2005 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Mon, 25 Apr 2005 10:00:44 -0400 (EDT) Subject: [openib-general] Annoucing the release of OSU MVAPICH 0.9.5 Message-ID: <200504251400.j3PE0jkL009406@xi.cse.ohio-state.edu> The MVAPICH (MPI over InfiniBand) team at the Ohio State University is pleased to announce the release of MVAPICH 0.9.5 for multiple platforms (EM64T, G5, IA-32, IA-64, and Opteron) and network interfaces (PCI-X and PCI-Express-including the new mem-free cards). MVAPICH 0.9.5 is being distributed as a single integrated package (with the latest MPICH 1.2.6 and MVICH). It can be downloaded with a `single click' and installed. It is available under BSD license. MVAPICH/MVAPICH2 software is being used by more than 200 organizations world-wide (in 26 countries) to extract the potential of InfiniBand networking technology for designing high-end computing systems and servers. It is also being distributed by many IBA vendors in their software distributions. The current version (MVAPICH 0.9.5) provides support for the VAPI layer. As indicated below, an implementation of MVAPICH 0.9.5 on the OpenIB Gen2 interface will be available soon. This new release has the following features: - multi-rail support (multiple adapters per node and multiple ports per adapter) - optimized intra-node shared memory support (both for bus-based and NUMA-based systems) - enhanced MPI broadcast support with IBA hardware-based multicast - flexible mechanisms for minimizing memory resource usage on large scale clusters - support for TotalView debugger - optimized and tuned for the above platforms and different network interfaces (PCI-X and PCI-Express) - single code base for all of the above platforms Other features of this release include: - Excellent performance: MVAPICH 0.9.5 with multi-rail (1-NIC, 2-port) delivers 4.0 microsec latency, up to 1498 MB/sec unidirectional bandwidth, and up to 2704 MB/sec bidirectional bandwidth on EM64T system with PCI-Express. Detailed performance numbers for other platforms are available on the project's web page. - An enhanced and detailed `User and Tuning Guide' to assist users: - to install this package on different platforms with different options - to vary different parameters of the MPI installation to extract maximum performance and achieve scalability, especially on large-scale systems. You are welcome to download the MVAPICH 0.9.5 package and access relevant information from the following URL: http://nowlab.cis.ohio-state.edu/projects/mpi-iba/ Since the 0.9.4 release, we have introduced a set of patches based on user feedbacks. If you plan to continue using 0.9.4 for some more time, we strongly encourage you to download and apply these patches to your current installation. Our upcoming releases include: - an OpenIB Gen2 version of MVAPICH 0.9.5 - MVAPICH2 0.6.5 with uDAPL support to run on different networks with uDAPL interface All feedbacks, including bug reports and hints for performance tuning, are welcome. Please send an e-mail to mvapich-help at cse.ohio-state.edu. Thanks, MVAPICH Team at OSU/NBCL From roland at topspin.com Mon Apr 25 07:16:23 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 25 Apr 2005 07:16:23 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050425131753.GA8860@infradead.org> (Christoph Hellwig's message of "Mon, 25 Apr 2005 14:17:53 +0100") References: <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425131753.GA8860@infradead.org> Message-ID: <523btfvt54.fsf@topspin.com> Roland> Does it seem reasonable to add a new system call to let Roland> userspace mark memory it doesn't want copied into forked Roland> processes? Something like Roland> long sys_mark_nocopy(unsigned long addr, size_t len, int Roland> mark) Roland> which would set VM_DONTCOPY if mark != 0, and clear it if Roland> mark == 0. A better name would be gratefully accepted... Christoph> add a new MAP_DONTCOPY flag and accept it in mmap and Christoph> mprotect? That is much better, thanks. But I think it would need to be PROT_DONTCOPY to work with mprotect(), right? - R. From caitlin.bestler at gmail.com Mon Apr 25 07:43:22 2005 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Mon, 25 Apr 2005 07:43:22 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <52is2bvvz5.fsf@topspin.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> Message-ID: <469958e00504250743340ff9e9@mail.gmail.com> On 4/25/05, Roland Dreier wrote: > Timur> With mlock(), we don't need to use get_user_pages() at all. > Timur> Arjan tells me the only time an mlocked page can move is > Timur> with hot (un)plug of memory, but that isn't supported on > Timur> the systems that we support. We actually prefer mlock() > Timur> over get_user_pages(), because if the process dies, the > Timur> locks automatically go away too. > > There actually is another way pages can move, with both > get_user_pages() and mlock(): copy-on-write after a fork(). If > userspace does a fork(), then all PTEs are marked read-only, and if > the original process touches the page after the fork(), a new page > will be allocated and mapped at the original virtual address. > > This is actually a pretty big pain, because the only good solution > seems to be for the kernel to mark these registered regions as > VM_DONTCOPY. Right now this means that driver code ends up monkeying > with vm_flags for user vmas. > > Does it seem reasonable to add a new system call to let userspace mark > memory it doesn't want copied into forked processes? Something like > > long sys_mark_nocopy(unsigned long addr, size_t len, int mark) > > which would set VM_DONTCOPY if mark != 0, and clear it if mark == 0. > A better name would be gratefully accepted... > > Then to register memory for RDMA, userspace would call > sys_mark_nocopy() (with appropriate accounting to handle possibly > overlapping regions) and the kernel would call get_user_pages(). The > get_user_pages() is of course required because the kernel can't trust > userspace to keep the pages locked. mlock() would no longer be > necessary. We can trust userspace to call sys_mark_nocopy() as > needed, because a process can only hurt itself and its children by > misusing the sys_mark_nocopy() call. > > If this seems reasonable then I can code a patch. > Who is responsible for counting within a process, and then between processes (in case shared memory is being registered)? The application? Middleware? Driver? My concern here is that the application layer may not be fully aware when middleware is registering memory, and middleware may not be fully aware when the memory it receives from the application is shared with another process. From roland at topspin.com Mon Apr 25 08:34:06 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 25 Apr 2005 08:34:06 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <469958e00504250743340ff9e9@mail.gmail.com> (Caitlin Bestler's message of "Mon, 25 Apr 2005 07:43:22 -0700") References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <469958e00504250743340ff9e9@mail.gmail.com> Message-ID: <52y8b6vpjl.fsf@topspin.com> Caitlin> Who is responsible for counting within a process, and Caitlin> then between processes (in case shared memory is being Caitlin> registered)? The application? Middleware? Driver? The verbs code doing the registration should do it as part of the registration. Shared memory does not cause any additional issues because it is mapped into the virtual memory map of each process and must be marked VM_DONTCOPY in each process separately. - R. From caitlin.bestler at gmail.com Mon Apr 25 08:49:40 2005 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Mon, 25 Apr 2005 08:49:40 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <52y8b6vpjl.fsf@topspin.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <469958e00504250743340ff9e9@mail.gmail.com> <52y8b6vpjl.fsf@topspin.com> Message-ID: <469958e005042508494a6cc9c9@mail.gmail.com> That leaves a problem when the same memory region is registered with different vendors. Verbs A marks the area, Verbs B sees that is already marked, Verbs A unmarks the area when it is done not knowing that B is relying on the memory staying pinned. I do not believe there is a solution to this problem when working at arms length from Linux other than documenting the problem and informing applications of workarounds required when using multiple vendors concurrently with the same memory (i.e, destroy the most recently created memory region first, or pin the memory yourself before creating the first memory region). The only other alternative is to make the pinning some sort of shared service that would apply across multiple vendors. That is doable, but might not be worthwhile given that a single process using multiple vendor devices concurrently is decidely the exception. But those users deserve at least a warning. On 4/25/05, Roland Dreier wrote: > Caitlin> Who is responsible for counting within a process, and > Caitlin> then between processes (in case shared memory is being > Caitlin> registered)? The application? Middleware? Driver? > > The verbs code doing the registration should do it as part of the > registration. Shared memory does not cause any additional issues > because it is mapped into the virtual memory map of each process and > must be marked VM_DONTCOPY in each process separately. > > - R. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From tduffy at sun.com Mon Apr 25 10:09:00 2005 From: tduffy at sun.com (Tom Duffy) Date: Mon, 25 Apr 2005 10:09:00 -0700 Subject: [openib-general] [PATCH][SDP] fix panic when cat'ing /proc/net/sdp/conn_main Message-ID: <1114448940.13354.8.camel@duffman> If you start up a something like ./ttcp.aio.x -r -l 65536 -a 20 with no SM running on your subnet, and then cat /proc/net/sdp/conn_main, you will panic: Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP: {:ib_sdp:sdp_proc_dump_conn_main+469} PGD 33943067 PUD 338ad067 PMD 0 Oops: 0000 [1] SMP CPU 0 Modules linked in: ib_sdp ib_cm ib_ipoib ib_sa md5 ipv6 parport_pc lp parport autofs4 nfs lockd rfcomm l2cap bluetooth pcmcia yenta_socket rsrc_nonstatic pcmcia_core sunrpc ext3 jbd dm_mod video container button battery ac ohci_hcd i2c_amd756 i2c_core ib_mthca ib_mad ib_core tg3 floppy xfs exportfs mptscsih mptbase sd_mod scsi_mod Pid: 5548, comm: cat Not tainted 2.6.11.7openib RIP: 0010:[] {:ib_sdp:sdp_proc_dump_conn_main+469} RSP: 0018:ffff8100778cbd78 EFLAGS: 00010056 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff882c24f0 RDI: ffff810033f9418a RBP: 000000000000018a R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: ffff81003a9219c0 R12: 0000000000000000 R13: ffff810033f94000 R14: 0000000000000400 R15: ffff8100778cbe98 FS: 00002aaaaaad4b00(0000) GS:ffffffff8047dc00(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000028 CR3: 000000003385d000 CR4: 00000000000006e0 Process cat (pid: 5548, threadinfo ffff8100778ca000, task ffff81007f3bcf50) Stack: ffff8100778cbddc ffff81003b402010 ffff81003bbfb9b0 ffff81003bb6a940 ffff81003bb6a940 0000000000000292 0000000000000292 ffffffff8016be89 ffff8100000015a5 0000000000000000 Call Trace:{do_no_page+729} {:ib_sdp:sdp_proc_read_parse+37} {proc_file_read+227} {vfs_read+229} {sys_read+83} {system_call+126} After this patch: [root at sins-stinger-10 ~]# cat /proc/net/sdp/conn_main dst address:port src address:port ID comm_id pid dst guid src guid dlid slid dqpn sqpn data sent buff'd data rcvd_buff'd data written data read src_serv snk_serv ---------------- ---------------- ---- -------- ---- ---------------- ---------------- ---- ---- ------ ------ ---------------- ---------------- ---------------- ---------------- -------- -------- 00.00.00.00:0000 00.00.00.00:1389 0000 00000000 155a 0000000000000000 0000000000000000 0000 0000 000000 000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 00000000 00000000 Signed-off-by: Tom Duffy Index: linux-2.6.11-openib/drivers/infiniband/ulp/sdp/sdp_conn.c =================================================================== --- linux-2.6.11-openib/drivers/infiniband/ulp/sdp/sdp_conn.c (revision 2207) +++ linux-2.6.11-openib/drivers/infiniband/ulp/sdp/sdp_conn.c (working copy) @@ -1384,7 +1384,7 @@ int sdp_proc_dump_conn_main(char *buffer ((conn->src_addr >> 24) & 0xff), conn->src_port, conn->hashent, - conn->cm_id->local_id, + conn->cm_id ? conn->cm_id->local_id : 0, conn->pid, (u32)((d_guid >> 32) & 0xffffffff), (u32)(d_guid & 0xffffffff), From adi at hexapodia.org Mon Apr 25 12:11:11 2005 From: adi at hexapodia.org (Andy Isaacson) Date: Mon, 25 Apr 2005 12:11:11 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050423194421.4f0d6612.akpm@osdl.org> References: <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> Message-ID: <20050425191111.GC2511@hexapodia.org> On Sat, Apr 23, 2005 at 07:44:21PM -0700, Andrew Morton wrote: > Timur Tabi wrote: > > As I said, the testcase only works with our hardware, and it's also > > very large. It's one small test that's part of a huge test suite. > > It takes a couple hours just to install the damn thing. > > > > We want to produce a simpler test case that demonstrates the problem in an > > easy-to-understand manner, but we don't have time to do that now. > > If your theory is correct then it should be able to demonstrate this > problem without any special hardware at all: pin some user memory, then > generate memory pressure then check the contents of those pinned pages. > > But if, for the DMA transfer, you're using the array of page*'s which were > originally obtained from get_user_pages() then it's rather hard to see how > the kernel could alter the page's contents. > > Then again, if mlock() fixes it then something's up. Very odd. Andrew, Libor Michalek posted a much more reasonable (to my limited understanding) bug description in <20050412180447.E6958 at topspin.com>. (And I'd love to provide a URL, but damned if I can figure out how to find that message on gmane. Clue-bat applications gladly accepted.) Libor Michalek wrote: # The driver did use get_user_pages() to elevated the refcount on all the # pages it was going to use for IO, as well as call set_page_dirty() since # the pages were going to have data written to them from the device. # # The problem we were seeing is that the minor fault by the app resulted # in a new physical page getting mapped for the application. The page that # had the elevated refcount was still waiting for the data to be written # to by the driver at the time that the app accessed the page causing the # minor fault. Obviously since the app had a new mapping the data written # by the driver was lost. # # It looks like code was added to try_to_unmap_one() to address this, so # hopefully it's no longer an issue... Which makes me think that Timur's bug is just an insufficiently-understood version of Libor's. -andy From akpm at osdl.org Mon Apr 25 13:54:01 2005 From: akpm at osdl.org (Andrew Morton) Date: Mon, 25 Apr 2005 13:54:01 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <52is2bvvz5.fsf@topspin.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> Message-ID: <20050425135401.65376ce0.akpm@osdl.org> Roland Dreier wrote: > > Timur> With mlock(), we don't need to use get_user_pages() at all. > Timur> Arjan tells me the only time an mlocked page can move is > Timur> with hot (un)plug of memory, but that isn't supported on > Timur> the systems that we support. We actually prefer mlock() > Timur> over get_user_pages(), because if the process dies, the > Timur> locks automatically go away too. > > There actually is another way pages can move, with both > get_user_pages() and mlock(): copy-on-write after a fork(). If > userspace does a fork(), then all PTEs are marked read-only, and if > the original process touches the page after the fork(), a new page > will be allocated and mapped at the original virtual address. Do we care about that? A straightforward scenario under which this can happen is: a) app starts some read I/O in an asynchronous manner b) app forks c) child writes to one of the pages which is still under read I/O d) the read I/O completes e) the child is left with the old data plus the child's modification instead of the new data which is a very silly application which is giving itself unpredictable memory contents anyway. I assume there's a more sensible scenario? From roland at topspin.com Mon Apr 25 14:12:40 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 25 Apr 2005 14:12:40 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050425135401.65376ce0.akpm@osdl.org> (Andrew Morton's message of "Mon, 25 Apr 2005 13:54:01 -0700") References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> Message-ID: <521x8yv9vb.fsf@topspin.com> Andrew> Do we care about that? A straightforward scenario under Andrew> which this can happen is: Andrew> a) app starts some read I/O in an asynchronous manner Andrew> b) app forks Andrew> c) child writes to one of the pages which is still under read I/O Andrew> d) the read I/O completes Andrew> e) the child is left with the old data plus the child's modification instead Andrew> of the new data Andrew> which is a very silly application which is giving itself Andrew> unpredictable memory contents anyway. Andrew> I assume there's a more sensible scenario? You're right, that is a silly scenario ;) In fact if we mark vmas with VM_DONTCOPY, then the child just crashes with a seg fault. The type of thing I'm worried about is something like, for example: a) app registers memory region with RDMA hardware -- in other words, loads the device's translation table for future I/O b) app forks c) app writes to the registered memory region, and the kernel breaks the COW for the (now read-only) page by mapping a new page d) app starts an I/O that will do a DMA read from the region e) device reads using the wrong, old mapping This can be pretty insiduous because for example fork() + immediate exec() or just using system() still leaves the parent with PTEs marked read-only. If an application does overlapping memory registrations so get_user_pages() is called a lot, then as far as I can see can_share_swap_page() will always return 0 and the COW will happen even if the child process has thrown out its original vmas. Or if the counts are in the correct range, then there's a small window between fork() and exec() where the parent process can screw itself up, so most of the time the app works, until it doesn't. - R. From caitlin.bestler at gmail.com Mon Apr 25 14:42:55 2005 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Mon, 25 Apr 2005 14:42:55 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <521x8yv9vb.fsf@topspin.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> Message-ID: <469958e005042514421b57c833@mail.gmail.com> On 4/25/05, Roland Dreier wrote: > Andrew> Do we care about that? A straightforward scenario under > Andrew> which this can happen is: > > Andrew> a) app starts some read I/O in an asynchronous manner > Andrew> b) app forks > Andrew> c) child writes to one of the pages which is still under read I/O > Andrew> d) the read I/O completes > Andrew> e) the child is left with the old data plus the child's modification instead > Andrew> of the new data > > Andrew> which is a very silly application which is giving itself > Andrew> unpredictable memory contents anyway. > > Andrew> I assume there's a more sensible scenario? > > You're right, that is a silly scenario ;) In fact if we mark vmas > with VM_DONTCOPY, then the child just crashes with a seg fault. > > The type of thing I'm worried about is something like, for example: > > a) app registers memory region with RDMA hardware -- in other words, > loads the device's translation table for future I/O > b) app forks > c) app writes to the registered memory region, and the kernel breaks > the COW for the (now read-only) page by mapping a new page > d) app starts an I/O that will do a DMA read from the region > e) device reads using the wrong, old mapping > > This can be pretty insiduous because for example fork() + immediate > exec() or just using system() still leaves the parent with PTEs marked > read-only. If an application does overlapping memory registrations so > get_user_pages() is called a lot, then as far as I can see > can_share_swap_page() will always return 0 and the COW will happen > even if the child process has thrown out its original vmas. > > Or if the counts are in the correct range, then there's a small window > between fork() and exec() where the parent process can screw itself > up, so most of the time the app works, until it doesn't. > Every RDMA related interface specification that I know of specifically excludes support of RDMA resources being inherited by child processes, with the warning that excellent implementations will give the child process an error for attempting to use the parent's RDMA resources. More streamlined implementations will simply be unpredictable. As for forking while the parent has a pending read: since the parent has not reaped the completion at the time of the fork the buffers in question are undefined. The child's buffers will be consistent, that is they are undefined. From akpm at osdl.org Mon Apr 25 15:14:59 2005 From: akpm at osdl.org (Andrew Morton) Date: Mon, 25 Apr 2005 15:14:59 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <521x8yv9vb.fsf@topspin.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> Message-ID: <20050425151459.1f5fb378.akpm@osdl.org> Roland Dreier wrote: > > Andrew> Do we care about that? A straightforward scenario under > Andrew> which this can happen is: > > Andrew> a) app starts some read I/O in an asynchronous manner > Andrew> b) app forks > Andrew> c) child writes to one of the pages which is still under read I/O > Andrew> d) the read I/O completes > Andrew> e) the child is left with the old data plus the child's modification instead > Andrew> of the new data > > Andrew> which is a very silly application which is giving itself > Andrew> unpredictable memory contents anyway. > > Andrew> I assume there's a more sensible scenario? > > You're right, that is a silly scenario ;) In fact if we mark vmas > with VM_DONTCOPY, then the child just crashes with a seg fault. > > The type of thing I'm worried about is something like, for example: > > a) app registers memory region with RDMA hardware -- in other words, > loads the device's translation table for future I/O Whoa, hang on. The way we expect get_user_pages() to be used is that the kernel will use get_user_pages() once per application I/O request. Are you saying that RDMA clients will semi-permanently own pages which were pinned by get_user_pages()? That those pages will be used for multiple separate I/O operations? If so, then that's a significant design departure and it would be good to hear why it is necessary. > b) app forks > c) app writes to the registered memory region, and the kernel breaks > the COW for the (now read-only) page by mapping a new page > d) app starts an I/O that will do a DMA read from the region > e) device reads using the wrong, old mapping Sure. But such an app could be declared to be buggy... > This can be pretty insiduous because for example fork() + immediate > exec() or just using system() still leaves the parent with PTEs marked > read-only. If an application does overlapping memory registrations so > get_user_pages() is called a lot, then as far as I can see > can_share_swap_page() will always return 0 and the COW will happen > even if the child process has thrown out its original vmas. > > Or if the counts are in the correct range, then there's a small window > between fork() and exec() where the parent process can screw itself > up, so most of the time the app works, until it doesn't. > > - R. From timur.tabi at ammasso.com Mon Apr 25 15:21:28 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Mon, 25 Apr 2005 17:21:28 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050425151459.1f5fb378.akpm@osdl.org> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> Message-ID: <426D6D68.6040504@ammasso.com> Andrew Morton wrote: > The way we expect get_user_pages() to be used is that the kernel will use > get_user_pages() once per application I/O request. > > Are you saying that RDMA clients will semi-permanently own pages which were > pinned by get_user_pages()? That those pages will be used for multiple > separate I/O operations? Yes, absolutely! The memory buffer is allocated by the process (usually just via malloc) and registed/pinned by the driver. It then stays pinned for the life of the process (typically). > If so, then that's a significant design departure and it would be good to > hear why it is necessary. That's just how RMDA works. Once the memory is pinned, if the app wants to send data to another node, it does two things: 1) Puts the data into its buffer 2) Sends a "work request" to the driver with (among other things) the offset and length of the data. This is a time-critical operation. It must occurs as fast as possible, which means the memory must have already been pinned. -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com One thing a Southern boy will never say is, "I don't think duct tape will fix it." -- Ed Smylie, NASA engineer for Apollo 13 From timur.tabi at ammasso.com Mon Apr 25 15:23:54 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Mon, 25 Apr 2005 17:23:54 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050425151459.1f5fb378.akpm@osdl.org> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> Message-ID: <426D6DFA.4090908@ammasso.com> Andrew Morton wrote: > The way we expect get_user_pages() to be used is that the kernel will use > get_user_pages() once per application I/O request. Are you saying that the mapping obtained by get_user_pages() is valid only within the context of the IOCtl call? That once the driver returns from the IOCtl, the mapping should no longer be used? -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com One thing a Southern boy will never say is, "I don't think duct tape will fix it." -- Ed Smylie, NASA engineer for Apollo 13 From akpm at osdl.org Mon Apr 25 15:32:56 2005 From: akpm at osdl.org (Andrew Morton) Date: Mon, 25 Apr 2005 15:32:56 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <426D6D68.6040504@ammasso.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> Message-ID: <20050425153256.3850ee0a.akpm@osdl.org> Timur Tabi wrote: > > Andrew Morton wrote: > > > The way we expect get_user_pages() to be used is that the kernel will use > > get_user_pages() once per application I/O request. > > > > Are you saying that RDMA clients will semi-permanently own pages which were > > pinned by get_user_pages()? That those pages will be used for multiple > > separate I/O operations? > > Yes, absolutely! > > The memory buffer is allocated by the process (usually just via malloc) and > registed/pinned by the driver. It then stays pinned for the life of the process (typically). ug. What stops the memory from leaking if the process exits? I hope this is a privileged operation? > > If so, then that's a significant design departure and it would be good to > > hear why it is necessary. > > That's just how RMDA works. Once the memory is pinned, if the app wants to send data to > another node, it does two things: > > 1) Puts the data into its buffer > 2) Sends a "work request" to the driver with (among other things) the offset and length of > the data. > > This is a time-critical operation. It must occurs as fast as possible, which means the > memory must have already been pinned. It would be better to obtain this memory via a mmap() of some special device node, so we can perform appropriate permission checking and clean everything up on unclean application exit. From akpm at osdl.org Mon Apr 25 15:35:42 2005 From: akpm at osdl.org (Andrew Morton) Date: Mon, 25 Apr 2005 15:35:42 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <426D6DFA.4090908@ammasso.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6DFA.4090908@ammasso.com> Message-ID: <20050425153542.70197e6a.akpm@osdl.org> Timur Tabi wrote: > > Andrew Morton wrote: > > > The way we expect get_user_pages() to be used is that the kernel will use > > get_user_pages() once per application I/O request. > > Are you saying that the mapping obtained by get_user_pages() is valid only within the > context of the IOCtl call? That once the driver returns from the IOCtl, the mapping > should no longer be used? Yes, we expect that all the pages which get_user_pages() pinned will become unpinned within the context of the syscall which pinned the pages. Or shortly after, in the case of async I/O. This is because there is no file descriptor or anything else associated with the pages which permits the kernel to clean stuff up on unclean application exit. Also there are the obvious issues with permitting pinning of unbounded amounts of memory. From timur.tabi at ammasso.com Mon Apr 25 15:42:36 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Mon, 25 Apr 2005 17:42:36 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050425153542.70197e6a.akpm@osdl.org> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6DFA.4090908@ammasso.com> <20050425153542.70197e6a.akpm@osdl.org> Message-ID: <426D725C.4070103@ammasso.com> Andrew Morton wrote: > This is because there is no file descriptor or anything else associated > with the pages which permits the kernel to clean stuff up on unclean > application exit. Also there are the obvious issues with permitting > pinning of unbounded amounts of memory. Then that might explain the "bug" that we're seeing with get_user_pages(). We've been assuming that get_user_pages() mappings are permanent. Well, I was just about to re-implement get_user_pages() support in our driver to demonstrate the bug. I guess I'll hold off on that. If you look at the Infiniband code that was recently submitted, I think you'll see it does exactly that: after calling mlock(), the driver calls get_user_pages(), and it stores the page mappings for future use. -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com One thing a Southern boy will never say is, "I don't think duct tape will fix it." -- Ed Smylie, NASA engineer for Apollo 13 From robert.j.woodruff at intel.com Mon Apr 25 15:51:03 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Mon, 25 Apr 2005 15:51:03 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbsimplementation In-Reply-To: <20050425153542.70197e6a.akpm@osdl.org> Message-ID: Andrew Morton wrote, >Yes, we expect that all the pages which get_user_pages() pinned will become >unpinned within the context of the syscall which pinned the pages. Or >shortly after, in the case of async I/O. >This is because there is no file descriptor or anything else associated >with the pages which permits the kernel to clean stuff up on unclean >application exit. Also there are the obvious issues with permitting >pinning of unbounded amounts of memory. There definitely needs to be a mechanism to prevent people from pinning too much memory. We saw issues in the sourceforge stack and some of the vendors stacks where we could lock memory till the system hung. In the sourceforge InfiniBand stack, we put in a check to make sure that people did not pin too much memory. It was sort of a crude/bruit force mechanism, but effective. I think that we limited people from locking down more that 1/2 of kernel memory or 70 % of all memory (it was tunable with a module option) and if they exceeded the limit, their requests to register memory would begin to fail. Arlin can provide details on how we did it or people can look at the IBAL code for an example. woody From timur.tabi at ammasso.com Mon Apr 25 16:13:01 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Mon, 25 Apr 2005 18:13:01 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbsimplementation In-Reply-To: References: Message-ID: <426D797D.3000108@ammasso.com> Bob Woodruff wrote: > There definitely needs to be a mechanism to prevent people from pinning > too much memory. Any limit would have to be very high - definitely more than just half. What if the application needs to pin 2GB? The customer is not going to buy 4+ GB of RAM just because Linux doesn't like pinning more than half. In an x86-32 system, that would required PAE support and slow everything down. Off the top of my head, I'd say Linux would need to allow all but 512MB to be pinned. So you have 3GB of RAM, Linux should allow you to pin 2.5GB. -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com One thing a Southern boy will never say is, "I don't think duct tape will fix it." -- Ed Smylie, NASA engineer for Apollo 13 From akpm at osdl.org Mon Apr 25 16:13:30 2005 From: akpm at osdl.org (Andrew Morton) Date: Mon, 25 Apr 2005 16:13:30 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <426D725C.4070103@ammasso.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6DFA.4090908@ammasso.com> <20050425153542.70197e6a.akpm@osdl.org> <426D725C.4070103@ammasso.com> Message-ID: <20050425161330.32c32b4b.akpm@osdl.org> Timur Tabi wrote: > > Andrew Morton wrote: > > > This is because there is no file descriptor or anything else associated > > with the pages which permits the kernel to clean stuff up on unclean > > application exit. Also there are the obvious issues with permitting > > pinning of unbounded amounts of memory. > > Then that might explain the "bug" that we're seeing with get_user_pages(). We've been > assuming that get_user_pages() mappings are permanent. They are permanent until someone runs put_page() against all the pages. What I'm saying is that all current callers of get_user_pages() _do_ run put_page() within the same syscall or upon I/O termination. > Well, I was just about to re-implement get_user_pages() support in our driver to > demonstrate the bug. I guess I'll hold off on that. > > If you look at the Infiniband code that was recently submitted, I think you'll see it does > exactly that: after calling mlock(), the driver calls get_user_pages(), and it stores the > page mappings for future use. Where? bix:/usr/src/linux-2.6.12-rc3> grep -rl get_user_pages . ./arch/i386/lib/usercopy.c ./arch/sparc64/kernel/ptrace.c ./drivers/video/pvr2fb.c ./drivers/media/video/video-buf.c ./drivers/scsi/sg.c ./drivers/scsi/st.c ./include/asm-ia64/pgtable.h ./include/linux/mm.h ./include/asm-um/archparam-i386.h ./include/asm-i386/fixmap.h ./fs/nfs/direct.c ./fs/aio.c ./fs/binfmt_elf.c ./fs/bio.c ./fs/direct-io.c ./kernel/futex.c ./kernel/ptrace.c ./mm/memory.c ./mm/nommu.c ./mm/rmap.c ./mm/mempolicy.c From libor at topspin.com Mon Apr 25 16:17:13 2005 From: libor at topspin.com (Libor Michalek) Date: Mon, 25 Apr 2005 16:17:13 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050425153542.70197e6a.akpm@osdl.org>; from akpm@osdl.org on Mon, Apr 25, 2005 at 03:35:42PM -0700 References: <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6DFA.4090908@ammasso.com> <20050425153542.70197e6a.akpm@osdl.org> Message-ID: <20050425161713.A9002@topspin.com> On Mon, Apr 25, 2005 at 03:35:42PM -0700, Andrew Morton wrote: > Timur Tabi wrote: > > > > Andrew Morton wrote: > > > > > The way we expect get_user_pages() to be used is that the kernel will use > > > get_user_pages() once per application I/O request. > > > > Are you saying that the mapping obtained by get_user_pages() is valid only within the > > context of the IOCtl call? That once the driver returns from the IOCtl, the mapping > > should no longer be used? > > Yes, we expect that all the pages which get_user_pages() pinned will become > unpinned within the context of the syscall which pinned the pages. Or > shortly after, in the case of async I/O. When a network protocol is making use of async I/O the amount of time between posting the read request and getting the completion for that request is unbounded since it depends on the other half of the connection sending some data. In this case the buffer that was pinned during the io_submit() may be pinned, and holding the pages, for a long time. During this time the process might fork, at this point any data received will be placed into the wrong spot. > This is because there is no file descriptor or anything else associated > with the pages which permits the kernel to clean stuff up on unclean > application exit. Also there are the obvious issues with permitting > pinning of unbounded amounts of memory. Correct, the driver must be able to determine that the process has died and clean up after it, so the pinned region in most implementations is associated with an open file descriptor. -Libor From akpm at osdl.org Mon Apr 25 16:17:47 2005 From: akpm at osdl.org (Andrew Morton) Date: Mon, 25 Apr 2005 16:17:47 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbsimplementation In-Reply-To: <426D797D.3000108@ammasso.com> References: <426D797D.3000108@ammasso.com> Message-ID: <20050425161747.28b03800.akpm@osdl.org> Timur Tabi wrote: > > Bob Woodruff wrote: > > > There definitely needs to be a mechanism to prevent people from pinning > > too much memory. > > Any limit would have to be very high - definitely more than just half. What if the > application needs to pin 2GB? The customer is not going to buy 4+ GB of RAM just because > Linux doesn't like pinning more than half. In an x86-32 system, that would required PAE > support and slow everything down. > > Off the top of my head, I'd say Linux would need to allow all but 512MB to be pinned. So > you have 3GB of RAM, Linux should allow you to pin 2.5GB. > You can pin the whole darn lot *if you have the correct privileges*. From timur.tabi at ammasso.com Mon Apr 25 16:21:51 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Mon, 25 Apr 2005 18:21:51 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050425161330.32c32b4b.akpm@osdl.org> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6DFA.4090908@ammasso.com> <20050425153542.70197e6a.akpm@osdl.org> <426D725C.4070103@ammasso.com> <20050425161330.32c32b4b.akpm@osdl.org> Message-ID: <426D7B8F.6000903@ammasso.com> Andrew Morton wrote: > They are permanent until someone runs put_page() against all the pages. > What I'm saying is that all current callers of get_user_pages() _do_ run > put_page() within the same syscall or upon I/O termination. Oh, okay then. I guess I'll get back to work! Actually, with RDMA, "I/O termination" technically doesn't happen until the memory is deregistered. When the memory is registered, all that means is that it's should be pinned and the virtual-to-physical should be stored. No actual I/O occurs at that point. >>If you look at the Infiniband code that was recently submitted, I think you'll see it does >>exactly that: after calling mlock(), the driver calls get_user_pages(), and it stores the >>page mappings for future > > Where? I was talking about the code that Roland mentioned in the first message of this thread - the user-space verbs support. He said the code calls mlock() and get_user_pages(). FYI, our driver detects the process termination and cleans up everything itself. -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com One thing a Southern boy will never say is, "I don't think duct tape will fix it." -- Ed Smylie, NASA engineer for Apollo 13 From akpm at osdl.org Mon Apr 25 16:24:05 2005 From: akpm at osdl.org (Andrew Morton) Date: Mon, 25 Apr 2005 16:24:05 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050425161713.A9002@topspin.com> References: <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6DFA.4090908@ammasso.com> <20050425153542.70197e6a.akpm@osdl.org> <20050425161713.A9002@topspin.com> Message-ID: <20050425162405.0889093e.akpm@osdl.org> Libor Michalek wrote: > > On Mon, Apr 25, 2005 at 03:35:42PM -0700, Andrew Morton wrote: > > Timur Tabi wrote: > > > > > > Andrew Morton wrote: > > > > > > > The way we expect get_user_pages() to be used is that the kernel will use > > > > get_user_pages() once per application I/O request. > > > > > > Are you saying that the mapping obtained by get_user_pages() is valid only within the > > > context of the IOCtl call? That once the driver returns from the IOCtl, the mapping > > > should no longer be used? > > > > Yes, we expect that all the pages which get_user_pages() pinned will become > > unpinned within the context of the syscall which pinned the pages. Or > > shortly after, in the case of async I/O. > > When a network protocol is making use of async I/O the amount of time > between posting the read request and getting the completion for that > request is unbounded since it depends on the other half of the connection > sending some data. In this case the buffer that was pinned during the > io_submit() may be pinned, and holding the pages, for a long time. Sure. > During > this time the process might fork, at this point any data received will be > placed into the wrong spot. Well the data is placed in _a_ spot. That's only the "wrong" spot because you've defined it to be wrong! IOW: what behaviour are you actually looking for here, and why, and does it matter? > > This is because there is no file descriptor or anything else associated > > with the pages which permits the kernel to clean stuff up on unclean > > application exit. Also there are the obvious issues with permitting > > pinning of unbounded amounts of memory. > > Correct, the driver must be able to determine that the process has died > and clean up after it, so the pinned region in most implementations is > associated with an open file descriptor. How is that association created? From akpm at osdl.org Mon Apr 25 16:27:40 2005 From: akpm at osdl.org (Andrew Morton) Date: Mon, 25 Apr 2005 16:27:40 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <426D7B8F.6000903@ammasso.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6DFA.4090908@ammasso.com> <20050425153542.70197e6a.akpm@osdl.org> <426D725C.4070103@ammasso.com> <20050425161330.32c32b4b.akpm@osdl.org> <426D7B8F.6000903@ammasso.com> Message-ID: <20050425162740.702a171b.akpm@osdl.org> Timur Tabi wrote: > > FYI, our driver detects the process termination and cleans up everything itself. How is this implemented? From robert.j.woodruff at intel.com Mon Apr 25 16:29:34 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Mon, 25 Apr 2005 16:29:34 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbsimplementation In-Reply-To: <426D797D.3000108@ammasso.com> Message-ID: Timur Tabi wrote, >Any limit would have to be very high - definitely more than just half. What if the >application needs to pin 2GB? The customer is not going to buy 4+ GB of RAM just >because >Linux doesn't like pinning more than half. In an x86-32 system, that would required >PAE >support and slow everything down. >Off the top of my head, I'd say Linux would need to allow all but 512MB to be pinned. >So >you have 3GB of RAM, Linux should allow you to pin 2.5GB. That is why we made it tunable, so that people could decide how to allow. There is probably a better way to do it than some hard limit, but that would take a little more understanding of the VM system than we had, and that is why some of the core kernel folks maybe able to help us come up with a better solution. woody From caitlin.bestler at gmail.com Mon Apr 25 16:37:56 2005 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Mon, 25 Apr 2005 16:37:56 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050425162405.0889093e.akpm@osdl.org> References: <20050418164316.GA27697@infradead.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6DFA.4090908@ammasso.com> <20050425153542.70197e6a.akpm@osdl.org> <20050425161713.A9002@topspin.com> <20050425162405.0889093e.akpm@osdl.org> Message-ID: <469958e00504251637350cc8c@mail.gmail.com> On 4/25/05, Andrew Morton wrote: > > > > This is because there is no file descriptor or anything else associated > > > with the pages which permits the kernel to clean stuff up on unclean > > > application exit. Also there are the obvious issues with permitting > > > pinning of unbounded amounts of memory. > > > > Correct, the driver must be able to determine that the process has died > > and clean up after it, so the pinned region in most implementations is > > associated with an open file descriptor. > > How is that association created? There is not a file descrptor, but there is an rnic handle. Both DAPL and IT-API require that process death will result in the handle and all of its dependent objects being released. The rnic handle can always be declared to be a "file descriptor" if that makes it follow normal OS conventions more precisiely. There is also a need for some form of resource manager to approve creation of Memory Regions. Obviously you cannot have multiple applications claiming half of physical memory. But if you merely require the user to have root privileges in order to create a Memory Region, and then take a first-come first-served attitude, I don't think you end up with something that is truly a general purpose capability. A general purpose RDMA capability requires the ability to indefinitely pin large portions of user memory. It makes sense to integrate that with OS policy control over resource utilization and to integrate it with memory suspend/resume capabilities so that hotplug memory can be supported. What you can't do is downgrade a Memory Region so that it is no longer a memory region. Doing that means that you are not truly supporting RDMA. From roland at topspin.com Mon Apr 25 16:58:03 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 25 Apr 2005 16:58:03 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050425153256.3850ee0a.akpm@osdl.org> (Andrew Morton's message of "Mon, 25 Apr 2005 15:32:56 -0700") References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org> Message-ID: <52vf6atnn8.fsf@topspin.com> Andrew> ug. What stops the memory from leaking if the process Andrew> exits? Andrew> I hope this is a privileged operation? I don't think it has to be privileged. In my implementation, the driver keeps a per-process list of registered memory regions and unpins/cleans up on process exit. Andrew> It would be better to obtain this memory via a mmap() of Andrew> some special device node, so we can perform appropriate Andrew> permission checking and clean everything up on unclean Andrew> application exit. This seems to interact poorly with how applications want to use RDMA, ie typically through a library interface such as MPI. People doing HPC don't want to recode their apps to use a new allocator, they just want to link to a new MPI library and have the app go fast. - R. From roland at topspin.com Mon Apr 25 17:02:36 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 25 Apr 2005 17:02:36 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050425151459.1f5fb378.akpm@osdl.org> (Andrew Morton's message of "Mon, 25 Apr 2005 15:14:59 -0700") References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> Message-ID: <52r7gytnfn.fsf@topspin.com> Andrew> Whoa, hang on. Andrew> The way we expect get_user_pages() to be used is that the Andrew> kernel will use get_user_pages() once per application I/O Andrew> request. Andrew> Are you saying that RDMA clients will semi-permanently own Andrew> pages which were pinned by get_user_pages()? That those Andrew> pages will be used for multiple separate I/O operations? Andrew> If so, then that's a significant design departure and it Andrew> would be good to hear why it is necessary. The idea is that applications manage the lifetime of pinned memory regions. They can do things like post multiple I/O operations without any page-walking overhead, or pass a buffer descriptor to a remote host who will send data at some indeterminate time in the future. In addition, InfiniBand has the notion of atomic operations, so a cluster application may be using some memory region to implement a global lock. This might not be the most kernel-friendly design but it is pretty deeply ingrained in the design of RDMA transports like InfiniBand and iWARP (RDMA over IP). I'm also not opposed to implementing some other mechanism to make this work, but the combiniation of get_user_pages() in the kernel and extending mprotect() to allow setting VM_DONTCOPY seems to work fine. - R. From roland at topspin.com Mon Apr 25 17:04:02 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 25 Apr 2005 17:04:02 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <469958e005042514421b57c833@mail.gmail.com> (Caitlin Bestler's message of "Mon, 25 Apr 2005 14:42:55 -0700") References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <469958e005042514421b57c833@mail.gmail.com> Message-ID: <52mzrmtnd9.fsf@topspin.com> Caitlin> Every RDMA related interface specification that I know of Caitlin> specifically excludes support of RDMA resources being Caitlin> inherited by child processes, with the warning that Caitlin> excellent implementations will give the child process an Caitlin> error for attempting to use the parent's RDMA resources. Caitlin> More streamlined implementations will simply be Caitlin> unpredictable. Caitlin> As for forking while the parent has a pending read: since Caitlin> the parent has not reaped the completion at the time of Caitlin> the fork the buffers in question are undefined. The Caitlin> child's buffers will be consistent, that is they are Caitlin> undefined. I think you've missed the point: unless a process sets VM_DONTCOPY on its RDMA memory regions, then incorrect memory mappings may be used if the app does something as simple as calling system("ls"). - R. From roland at topspin.com Mon Apr 25 17:08:57 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 25 Apr 2005 17:08:57 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050425161330.32c32b4b.akpm@osdl.org> (Andrew Morton's message of "Mon, 25 Apr 2005 16:13:30 -0700") References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6DFA.4090908@ammasso.com> <20050425153542.70197e6a.akpm@osdl.org> <426D725C.4070103@ammasso.com> <20050425161330.32c32b4b.akpm@osdl.org> Message-ID: <52is2atn52.fsf@topspin.com> Timur> If you look at the Infiniband code that was recently Timur> submitted, I think you'll see it does exactly that: after Timur> calling mlock(), the driver calls get_user_pages(), and it Timur> stores the page mappings for future use. Andrew> Where? The code isn't merged yet. I sent a version to lkml for review -- in fact it was this very thread that we're in now. The code in question is in http://lkml.org/lkml/2005/4/4/266 This implements a "userspace verbs" character device that memory registration goes through. This means the kernel has a device node that will be closed when a process dies, and so the memory can be cleaned up. - R. From akpm at osdl.org Mon Apr 25 17:10:50 2005 From: akpm at osdl.org (Andrew Morton) Date: Mon, 25 Apr 2005 17:10:50 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <469958e00504251637350cc8c@mail.gmail.com> References: <20050418164316.GA27697@infradead.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6DFA.4090908@ammasso.com> <20050425153542.70197e6a.akpm@osdl.org> <20050425161713.A9002@topspin.com> <20050425162405.0889093e.akpm@osdl.org> <469958e00504251637350cc8c@mail.gmail.com> Message-ID: <20050425171050.5ba25918.akpm@osdl.org> Caitlin Bestler wrote: > > > > > > > This is because there is no file descriptor or anything else associated > > > > with the pages which permits the kernel to clean stuff up on unclean > > > > application exit. Also there are the obvious issues with permitting > > > > pinning of unbounded amounts of memory. > > > > > > Correct, the driver must be able to determine that the process has died > > > and clean up after it, so the pinned region in most implementations is > > > associated with an open file descriptor. > > > > How is that association created? > > > There is not a file descrptor, but there is an rnic handle. Both DAPL > and IT-API require that process death will result in the handle and all > of its dependent objects being released. What's an "rnic handle", in Linux terms? > The rnic handle can always be declared to be a "file descriptor" if > that makes it follow normal OS conventions more precisiely. Does that mean that the code has not yet been implemented? Yes, a Linux fd is appropriate. But we don't have any sane way right now of saying "you need to run put_page() against all these pages in the ->release() handler". That'll need to be coded by yourselves. > There is also a need for some form of resource manager to approve > creation of Memory Regions. Obviously you cannot have multiple > applications claiming half of physical memory. The kernel already has considerable resource management capabilities. Please consider using/extending/generalising those before inventing anything new. RLIMIT_MEMLOCK would be a starting point. > But if you merely require the user to have root privileges in order > to create a Memory Region, and then take a first-come first-served > attitude, I don't think you end up with something that is truly a > general purpose capability. We don't want code in the kernel which will permit hostile unprivileged users to trivially cause the box to lock up. RLIMIT_MEMLOCK and, if necessary, CAP_IPC_LOCK sound appropriate here. > A general purpose RDMA capability requires the ability to indefinitely > pin large portions of user memory. It makes sense to integrate that > with OS policy control over resource utilization and to integrate it with > memory suspend/resume capabilities so that hotplug memory can > be supported. What you can't do is downgrade a Memory Region so > that it is no longer a memory region. Doing that means that you are > not truly supporting RDMA. From akpm at osdl.org Mon Apr 25 17:11:45 2005 From: akpm at osdl.org (Andrew Morton) Date: Mon, 25 Apr 2005 17:11:45 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <52vf6atnn8.fsf@topspin.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> Message-ID: <20050425171145.2f0fd7f8.akpm@osdl.org> Roland Dreier wrote: > > Andrew> ug. What stops the memory from leaking if the process > Andrew> exits? > > Andrew> I hope this is a privileged operation? > > I don't think it has to be privileged. In my implementation, the > driver keeps a per-process list of registered memory regions and > unpins/cleans up on process exit. How does the driver detect process exit? > Andrew> It would be better to obtain this memory via a mmap() of > Andrew> some special device node, so we can perform appropriate > Andrew> permission checking and clean everything up on unclean > Andrew> application exit. > > This seems to interact poorly with how applications want to use RDMA, > ie typically through a library interface such as MPI. People doing > HPC don't want to recode their apps to use a new allocator, they just > want to link to a new MPI library and have the app go fast. Fair enough. From roland at topspin.com Mon Apr 25 17:23:17 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 25 Apr 2005 17:23:17 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050425171145.2f0fd7f8.akpm@osdl.org> (Andrew Morton's message of "Mon, 25 Apr 2005 17:11:45 -0700") References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org> Message-ID: <52acnmtmh6.fsf@topspin.com> Andrew> How does the driver detect process exit? I already answered earlier but just to be clear: registration goes through a character device, and all regions are cleaned up in the ->release() of that device. I don't currently have any code accounting against RLIMIT_MEMLOCK or testing CAP_FOO, but I have no problem adding whatever is thought appropriate. Userspace also has control over the permissions and owner/group of the /dev node. - R. From akpm at osdl.org Mon Apr 25 17:37:57 2005 From: akpm at osdl.org (Andrew Morton) Date: Mon, 25 Apr 2005 17:37:57 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <52acnmtmh6.fsf@topspin.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org> <52acnmtmh6.fsf@topspin.com> Message-ID: <20050425173757.1dbab90b.akpm@osdl.org> Roland Dreier wrote: > > Andrew> How does the driver detect process exit? > > I already answered earlier but just to be clear: registration goes > through a character device, and all regions are cleaned up in the > ->release() of that device. yup. > I don't currently have any code accounting against RLIMIT_MEMLOCK or > testing CAP_FOO, but I have no problem adding whatever is thought > appropriate. Userspace also has control over the permissions and > owner/group of the /dev node. I guess device node permissions won't be appropriate here, if only because it sounds like everyone will go and set them to 0666. RLIMIT_MEMLOCK sounds like the appropriate mechanism. We cannot rely upon userspace running mlock(), so perhaps it is appropriate to run sys_mlock() in-kernel because that gives us the appropriate RLIMIT_MEMLOCK checking. However an hostile app can just go and run munlock() and then allocate some more pinned-by-get_user_pages() memory. umm, how about we - force the special pages into a separate vma - run get_user_pages() against it all - use RLIMIT_MEMLOCK accounting to check whether the user is allowed to do this thing - undo the RMLIMIT_MEMLOCK accounting in ->release This will all interact with user-initiated mlock/munlock in messy ways. Maybe a new kernel-internal vma->vm_flag which works like VM_LOCKED but is unaffected by mlock/munlock activity is needed. A bit of generalisation in do_mlock() should suit? From iwamoto at valinux.co.jp Mon Apr 25 19:03:38 2005 From: iwamoto at valinux.co.jp (IWAMOTO Toshihiro) Date: Tue, 26 Apr 2005 11:03:38 +0900 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <52vf6atnn8.fsf@topspin.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> Message-ID: <20050426020338.5909570488@sv1.valinux.co.jp> At Mon, 25 Apr 2005 16:58:03 -0700, Roland Dreier wrote: > Andrew> It would be better to obtain this memory via a mmap() of > Andrew> some special device node, so we can perform appropriate > Andrew> permission checking and clean everything up on unclean > Andrew> application exit. > > This seems to interact poorly with how applications want to use RDMA, > ie typically through a library interface such as MPI. People doing > HPC don't want to recode their apps to use a new allocator, they just > want to link to a new MPI library and have the app go fast. Such HPC users cannot use the memory hotremoval feature, and something needs to be implemented so that the NUMA migration can handle such memory properly, but I see your point. If such memory were allocated by a driver, the memory could be placed in non-hotremovable areas to avoid the above problems. -- IWAMOTO Toshihiro From timur.tabi at ammasso.com Mon Apr 25 19:16:53 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Mon, 25 Apr 2005 21:16:53 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050426020338.5909570488@sv1.valinux.co.jp> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> <20050426020338.5909570488@sv1.valinux.co.jp> Message-ID: <426DA495.4040700@ammasso.com> IWAMOTO Toshihiro wrote: > If such memory were allocated by a driver, the memory could be placed > in non-hotremovable areas to avoid the above problems. How can the driver allocated 3GB of pinned memory on a system with 3.5GB of RAM? Can vmalloc() or get_free_pages() allocate that much memory? From timur.tabi at ammasso.com Mon Apr 25 19:21:03 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Mon, 25 Apr 2005 21:21:03 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050425173757.1dbab90b.akpm@osdl.org> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org> <52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org> Message-ID: <426DA58F.3020508@ammasso.com> Andrew Morton wrote: > RLIMIT_MEMLOCK sounds like the appropriate mechanism. We cannot rely upon > userspace running mlock(), so perhaps it is appropriate to run sys_mlock() > in-kernel because that gives us the appropriate RLIMIT_MEMLOCK checking. I don't see what's wrong with relying on userspace to call mlock(). First all, all RDMA apps call a third-party API, like DAPL or MPI, to register memory. The memory needs to be registered in order for the driver and adapter to know where it is. During this registration, the memory is also pinned. That's when we call mlock(). > > However an hostile app can just go and run munlock() and then allocate > some more pinned-by-get_user_pages() memory. Isn't mlock() on a per-process basis anyway? How can one process call munlock() on another process' memory? > umm, how about we > > - force the special pages into a separate vma > > - run get_user_pages() against it all > > - use RLIMIT_MEMLOCK accounting to check whether the user is allowed to > do this thing > > - undo the RMLIMIT_MEMLOCK accounting in ->release Isn't this kinda what mlock() does already? Create a new VMA and then VM_LOCK it? > This will all interact with user-initiated mlock/munlock in messy ways. > Maybe a new kernel-internal vma->vm_flag which works like VM_LOCKED but is > unaffected by mlock/munlock activity is needed. > > A bit of generalisation in do_mlock() should suit? Yes, but do_mlock() needs to prevent pages from being moved during memory hotswap. From steve.langdon at hp.com Mon Apr 25 19:26:04 2005 From: steve.langdon at hp.com (Stephen Langdon) Date: Mon, 25 Apr 2005 22:26:04 -0400 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050426020338.5909570488@sv1.valinux.co.jp> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> <20050426020338.5909570488@sv1.valinux.co.jp> Message-ID: <426DA6BC.9070703@hp.com> I don't think that we should jump to the conclusion that in the long term HPC users cannot benefit from support of mechanisms such as hotremoval of memory or other forms of page migration in physical memory. In an earlier exchange on the openib-general list Mike Krause sent the message quoted below on very much the same topic. On the other hand I am willing to accept that there is practical value to implementations which are not (yet) sophisticated to enough to support the migration functions. Steve Langdon > Michael Krause wrote: At 05:35 PM 3/14/2005, Caitlin Bestler wrote: > >> >> >> > -----Original Message----- >> > From: Troy Benjegerdes [ mailto:hozer at hozed.org] >> > Sent: Monday, March 14, 2005 5:06 PM >> > To: Caitlin Bestler >> > Cc: openib-general at openib.org >> > Subject: Re: [openib-general] Getting rid of pinned memory requirement >> > >> > > >> > > The key is that the entire operation either has to be fast >> > > enough so that no connection or application session layer >> > > time-outs occur, or an end-to-end agreement to suspend the >> > > connetion is a requirement. The first option seems more >> > > plausible to me, the second essentially >> > > reuqires extending the CM protocol. That's a tall order even for >> > > InfiniBand, and it's even worse for iWARP where the CM >> > > functionality typically ends when the connection is established. >> > >> > I'll buy the good network design argument. > > > I and others designed InfiniBand RNR (Receiver not ready) operations > to allow one to adjust V-to-P mappings (not change the address that > was advertised) in order to allow an OS to safely play some games with > memory and not drop a connection. The time values associated with RNR > allow a solution to tolerate up to infinite amount of time to perform > such operations but the envisioned goal was to do this on the order of > a handful or milliseconds in the worse case. For iWARP, there was no > support for defining RNR functionality as indeed many people claimed > one could just drop in-bound segments and allow the retransmission > protocol to deal with the delay (even if this has performance > implications due to back-off algorithms though some claim SACK would > minimize this to a large extent). Again, the idea was to minimize the > worse case to milliseconds of down time. BTW, all of this assumed > that the OS would not perform these types of changes that often so the > long-term impact on an application would be minimum. > >> > >> > I suppose if the kernel wants to revoke a card's pinned >> > memory, we should be able to guarantee that it gets new >> > pinned memory within a bounded time. What sort of timing do >> > we need? Milliseconds? >> > Microseconds? >> > >> > In the case of iWarp, isn't this just TCP underneath? If so, >> > can't we just drop any packets in the pipe on the floor and >> > let them get retransmitted? (I suppose the same argument goes >> > for infiniband.. >> > what sort of a time window do we have for retransmission?) >> > >> > What are the limits on end-to-end flow control in IB and iWarp? >> > >> >> >From the RDMA Provider's perspective, the short answer is "quick >> enough so that I don't have to do anything heroic to keep the >> connection alive." > > > It should not require anything heroic. What is does require is a > local method to suspend the local QP(s) so that it cannot place or > read memory in the effected area. That can take some time depending > upon the implementation. There is then the time to over write the > mappings which again depending upon the implementation and the number > of mappings could be milliseconds in length. > >> With TCP you also have to add "and healthy". If you've ever had a >> long download that got effectively stalled by a burst of noise and >> you just hit the 'reload' button on your browser then you know what >> I'm talking about. >> >> But in transport neutral terms I would think that one RTT is >> definitely safe -- that much data could have >> been dropped by one switch failure or one nasty spike in inbound noise. >> >> > > >> > > Yes, there are limits on how much memory you can mlock, or even >> > > allocate. Applications are required to reqister memory precisely >> > > because the required guarantess are not there by default. >> > Eliminating >> > > those guarantees *is* effectively rewriting every RDMA application >> > > without even letting them know. >> > >> > Some of this argument is a policy issue, which I would argue >> > shouldn't be hard-coded in the code or in the network hardware. >> > >> > At least in my view, the guarantees are only there to make >> > applications go fast. We are getting low latency and high >> > performance with infiniband by making memory registration go >> > really really slow. If, to make big HPC simulation >> > applications work, we wind up doing memcpy() to put the data >> > into a registered buffer because we can't register half of >> > physical memory, the application isn't going very fast. >> > >> >> What you are looking for is a distinction between registering >> memory to *enable* the RNIC to optimize local access and >> registering memory to enable its being advertised to the >> remote end. >> >> Early implementations of RDMA, both IB and iWARP, have not >> distinquished between the two. But theoretically *applications* >> do not need memory regions that are not enabled for remote >> access to be pinned. That is an RNIC requirement that could >> evolve. But applications themselves *do* need remotely >> accessible memory regions, portions of which they intend >> to advertise with RKeys, to be truly available (i.e., pinned). >> >> You are also making a policy assumption that an application >> that actually needs half of physical memory should be using >> paged memory. Memory is cheap, and if performance is critical >> why should this memory be swapped out to disk? >> >> Is the limitation on not being able to register half of >> physical memory based upon some assumption that swapping >> is a requirement? Or is it a limitation in the memory region >> size? If it's the latter, you need to get the OS to support >> larger page sizes. > > > For some OS, you can pin very large areas. I've seen 15/16 of memory > being able to be pinned with no adverse impacts on the applications. > For these OS, kernel memory is effectively pinned memory. As such, > depending upon the mix of services being provided, the system may > operate quite nicely with such large amounts of memory being pinned. > As more services are "ported" to operate over RDMA technologies, > memory management isn't necessarily any harder; it just becomes > something people have to think more about. Today's VM designs have > allowed people to get sloppy as they assume that swapping will occur > and since many platforms are not that loaded, they don't see any real > adverse impacts. User-space RDMA applications requires people to > think once again about memory management and that swapping isn't a > get-out-of-jail card. One needs to develop resource management tools > to determine who obtains specified amounts of resources and their > priorities. For the most part, this is somewhat a re-invention of > some thinking that went into the micro-kernel work in past years. > These problems are not intractable; they are only constrained by the > legacy inertia inherent in all technologies today. > > Mike > > > IWAMOTO Toshihiro wrote: >At Mon, 25 Apr 2005 16:58:03 -0700, >Roland Dreier wrote: > > >> Andrew> It would be better to obtain this memory via a mmap() of >> Andrew> some special device node, so we can perform appropriate >> Andrew> permission checking and clean everything up on unclean >> Andrew> application exit. >> >>This seems to interact poorly with how applications want to use RDMA, >>ie typically through a library interface such as MPI. People doing >>HPC don't want to recode their apps to use a new allocator, they just >>want to link to a new MPI library and have the app go fast. >> >> > >Such HPC users cannot use the memory hotremoval feature, and something >needs to be implemented so that the NUMA migration can handle such >memory properly, but I see your point. > >If such memory were allocated by a driver, the memory could be placed >in non-hotremovable areas to avoid the above problems. > >-- >IWAMOTO Toshihiro >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > -------------- next part -------------- A non-text attachment was scrubbed... Name: steve.langdon.vcf Type: text/x-vcard Size: 348 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 6189 bytes Desc: S/MIME Cryptographic Signature URL: From akpm at osdl.org Mon Apr 25 20:16:29 2005 From: akpm at osdl.org (Andrew Morton) Date: Mon, 25 Apr 2005 20:16:29 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <426DA58F.3020508@ammasso.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org> <52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org> <426DA58F.3020508@ammasso.com> Message-ID: <20050425201629.11d9118f.akpm@osdl.org> Timur Tabi wrote: > > Andrew Morton wrote: > > > RLIMIT_MEMLOCK sounds like the appropriate mechanism. We cannot rely upon > > userspace running mlock(), so perhaps it is appropriate to run sys_mlock() > > in-kernel because that gives us the appropriate RLIMIT_MEMLOCK checking. > > I don't see what's wrong with relying on userspace to call mlock(). First all, all RDMA > apps call a third-party API, like DAPL or MPI, to register memory. The memory needs to be > registered in order for the driver and adapter to know where it is. During this > registration, the memory is also pinned. That's when we call mlock(). All the above refers to well-behaved applications. Now think about how the syscalls which you provide may be used by applications which are *designed* to cripple or to compromise the machine. > > > > However an hostile app can just go and run munlock() and then allocate > > some more pinned-by-get_user_pages() memory. > > Isn't mlock() on a per-process basis anyway? How can one process call munlock() on > another process' memory? I'm referring to an application which uses your syscalls to obtain pinned memory and uses munlock() so that it may then use your syscalls to obtain evem more pinned memory. With the objective of taking the machine down. > > umm, how about we > > > > - force the special pages into a separate vma > > > > - run get_user_pages() against it all > > > > - use RLIMIT_MEMLOCK accounting to check whether the user is allowed to > > do this thing > > > > - undo the RMLIMIT_MEMLOCK accounting in ->release > > Isn't this kinda what mlock() does already? Create a new VMA and then VM_LOCK it? kinda. But applications can undo the mlock which the kernel did. > > This will all interact with user-initiated mlock/munlock in messy ways. > > Maybe a new kernel-internal vma->vm_flag which works like VM_LOCKED but is > > unaffected by mlock/munlock activity is needed. > > > > A bit of generalisation in do_mlock() should suit? > > Yes, but do_mlock() needs to prevent pages from being moved during memory hotswap. I haven't even thought about memory hotswap. Surely it'll fail if the pages are pinned by get_user_pages()? From libor at topspin.com Mon Apr 25 20:31:10 2005 From: libor at topspin.com (Libor Michalek) Date: Mon, 25 Apr 2005 20:31:10 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050412180447.E6958@topspin.com>; from libor@topspin.com on Tue, Apr 12, 2005 at 06:04:47PM -0700 References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <20050412180447.E6958@topspin.com> Message-ID: <20050425203110.A9729@topspin.com> On Tue, Apr 12, 2005 at 06:04:47PM -0700, Libor Michalek wrote: > On Mon, Apr 11, 2005 at 05:13:47PM -0700, Andrew Morton wrote: > > Roland Dreier wrote: > > > > > > Troy> Do we even need the mlock in userspace then? > > > > > > Yes, because the kernel may go through and unmap pages from userspace > > > while trying to swap. Since we have the page locked in the kernel, > > > the physical page won't go anywhere, but userspace might end up with a > > > different page mapped at the same virtual address. > > With the last few kernels I haven't had a chance to retest the problem > that pushed us in the direction of using mlock. I will go back and do > so with the latest kernel. Below I've given a quick description of the > issue. > > > That shouldn't happen. If get_user_pages() has elevated the refcount on a > > page then the following can happen: > > > > - The VM may decide to add the page to swapcache (if it's not mmapped > > from a file). > > > > - Once the page is backed by either swapcache of a (mmapped) file, the VM > > may decide the unmap the application's pte's. A later minor fault by the > > app will cause the same physical page to be remapped. > > The driver did use get_user_pages() to elevated the refcount on all the > pages it was going to use for IO, as well as call set_page_dirty() since > the pages were going to have data written to them from the device. > > The problem we were seeing is that the minor fault by the app resulted > in a new physical page getting mapped for the application. The page that > had the elevated refcount was still waiting for the data to be written > to by the driver at the time that the app accessed the page causing the > minor fault. Obviously since the app had a new mapping the data written > by the driver was lost. > > It looks like code was added to try_to_unmap_one() to address this, so > hopefully it's no longer an issue... I wrote a quick test module and program to confirm that the problem we saw in older kernels with get_user_pages() no longer exists. The module creates a character device with three different ioctl commands: - Pin the pages of a buffer using get_user_pages() - Check the pages by calling get_user_pages() a second time and comparing the new and original page list. - Relase the pages using put_page() The program opens the charcter device file descriptor, pins the pages and waits for a signal, before checking the pages, which is sent to the process after running some other program which exercises the VM. On older kernels the check fails, on my 2.6.11 kernel the check succeeds. So mlock is not needed on top of get_user_pages() as it was before. Thanks for the heads up. Module and program attached. -Libor -------------- next part -------------- /* * Copyright (c) 2005 Topspin Communications. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available from the file * COPYING in the main directory of this source tree, or the * OpenIB.org BSD license below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above * copyright notice, this list of conditions and the following * disclaimer. * * - Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials * provided with the distribution. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * * $Id: $ */ #include #include #include #include #include #include #include #include #include #include #include #include MODULE_AUTHOR("Libor Michalek"); MODULE_DESCRIPTION("Get pages test"); MODULE_LICENSE("GPL"); enum { TEST_MAJOR = 232, TEST_MINOR = 255 }; #define TEST_DEV MKDEV(TEST_MAJOR, TEST_MINOR) enum { TEST_CMD_REGISTER = 1, TEST_CMD_UNREGISTER = 2, TEST_CMD_CHECK = 3 }; struct ioctl_arg { __u64 addr; __u64 size; }; struct region_root { struct semaphore mutex; struct list_head regions; /* list of pending events. */ struct file *filp; int nr_region; }; struct test_region { unsigned long user; unsigned long addr; unsigned long size; int nr_pages; struct page **pages; struct region_root *root; struct list_head region_list; /* member in root region list */ }; static void test_unlock(struct test_region *region) { long i; list_del(®ion->region_list); for (i = 0; i < region->nr_pages; i++) put_page(region->pages[i]); printk(KERN_ERR "TEST: Unlocked address <%016lx>\n", region->user); kfree(region->pages); kfree(region); } static struct test_region *test_lookup(struct region_root *root, unsigned long addr) { struct test_region *region; list_for_each_entry(region, &root->regions, region_list) if (region->user == addr) return region; return NULL; } static int test_lock(struct region_root *root, unsigned long uaddr, unsigned long size) { struct test_region *region; int nr_pages; int result; region = kmalloc(sizeof(*region), GFP_KERNEL); if (!region) return -ENOMEM; region->user = uaddr; region->addr = uaddr & PAGE_MASK; region->size = PAGE_ALIGN(size + (uaddr & ~PAGE_MASK)); region->root = root; nr_pages = (region->size + PAGE_SIZE-1) >> PAGE_SHIFT; region->pages = kmalloc(sizeof(struct page *) * nr_pages, GFP_KERNEL); if (!region->pages) { result = -ENOMEM; goto page_err; } region->nr_pages = get_user_pages(current, current->mm, region->addr, nr_pages, 1, 0, region->pages, NULL); if (region->nr_pages != nr_pages) { result = -EFAULT; goto get_err; } list_add_tail(®ion->region_list, &root->regions); printk(KERN_ERR "TEST: Locked address <%016lx>\n", region->user); return 0; get_err: kfree(region->pages); page_err: kfree(region); return result; } static int test_check(struct test_region *region) { struct page **pages; int nr_pages; int result = 0; int i; pages = kmalloc(sizeof(struct page *) * region->nr_pages, GFP_KERNEL); if (!pages) return -ENOMEM; nr_pages = get_user_pages(current, current->mm, region->addr, region->nr_pages, 1, 0, pages, NULL); if (region->nr_pages != nr_pages) { result = -EFAULT; goto get_err; } for (i = 0; i < nr_pages; i++) { if (region->pages[i] != pages[i]) printk(KERN_ERR "TEST: Check error <%p:%p> " "page <%u> of <%u>\n", pages[i], region->pages[i], i, nr_pages); put_page(pages[i]); } get_err: kfree(pages); return result; } static long test_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { struct region_root *root = filp->private_data; struct test_region *region; struct ioctl_arg ureq; int result = 0; if (!root) return -EINVAL; if (copy_from_user(&ureq, (void __user *)arg, sizeof(ureq))) return -EFAULT; down(&root->mutex); switch (cmd) { case TEST_CMD_REGISTER: result = test_lock(root, ureq.addr, ureq.size); break; case TEST_CMD_UNREGISTER: region = test_lookup(root, ureq.addr); if (!region) result = -ENOENT; else test_unlock(region); break; case TEST_CMD_CHECK: region = test_lookup(root, ureq.addr); if (!region) result = -ENOENT; else result = test_check(region); break; default: result = -ERANGE; break; } up(&root->mutex); return result; } static int test_open(struct inode *inode, struct file *filp) { struct region_root *root; root = kmalloc(sizeof(*root), GFP_KERNEL); if (!root) return -ENOMEM; memset(root, 0, sizeof(*root)); INIT_LIST_HEAD(&root->regions); init_MUTEX(&root->mutex); filp->private_data = root; root->filp = filp; printk(KERN_ERR "TEST: Created root struct\n"); return 0; } static int test_close(struct inode *inode, struct file *filp) { struct region_root *root = filp->private_data; struct test_region *region; down(&root->mutex); while (!list_empty(&root->regions)) { region = list_entry(root->regions.next, struct test_region, region_list); test_unlock(region); } up(&root->mutex); kfree(root); filp->private_data = NULL; printk(KERN_ERR "TEST: Deleted root struct\n"); return 0; } static struct file_operations test_fops = { .owner = THIS_MODULE, .open = test_open, .release = test_close, .compat_ioctl = test_ioctl, .unlocked_ioctl = test_ioctl, }; static struct cdev test_cdev; static int __init test_init(void) { int result; result = register_chrdev_region(TEST_DEV, 1, "mltest"); if (result) { printk(KERN_ERR "TEST: Error <%d> registering dev\n", result); goto err_chr; } cdev_init(&test_cdev, &test_fops); result = cdev_add(&test_cdev, TEST_DEV, 1); if (result) { printk(KERN_ERR "TEST: Error <%d> adding cdev\n", result); goto err_cdev; } return 0; err_cdev: unregister_chrdev_region(TEST_DEV, 1); err_chr: return result; } static void __exit test_cleanup(void) { cdev_del(&test_cdev); unregister_chrdev_region(TEST_DEV, 1); } module_init(test_init); module_exit(test_cleanup); -------------- next part -------------- /* * Copyright (c) 2005 Topspin Communications. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available from the file * COPYING in the main directory of this source tree, or the * OpenIB.org BSD license below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above * copyright notice, this list of conditions and the following * disclaimer. * * - Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials * provided with the distribution. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * * $Id: $ */ #include #include #include #include #include #include #include #include #include #include #include #define TEST_DEV_PATH "/dev/mltest" #define TEST_SLEEP_UTIME 50000 struct ioctl_arg { __u64 addr; __u64 size; }; enum { TEST_CMD_REGISTER = 1, TEST_CMD_UNREGISTER = 2, TEST_CMD_CHECK = 3 }; static int hold = 1; void sig_usr(int value) { hold = 0; } int main(int argc, char **argv) { struct ioctl_arg req; void *data; int param_c = 0; int size; int fd; int result; if (2 != argc || 0 > (size = atoi(argv[++param_c]))) { fprintf(stderr, "usage: %s \n", argv[0]); fprintf(stderr, " size - allocated region size in bytes.\n"); exit(1); } signal(SIGUSR1, sig_usr); data = malloc(size); if (!data) { fprintf(stderr, "Failed to allocated region of size <%d>\n", size); exit(1); } fd = open(TEST_DEV_PATH, O_RDWR); if (fd < 0) { fprintf(stderr, "Error <%d:%d> opening device <%s>\n", fd, errno, TEST_DEV_PATH); goto open_err; } req.addr = (unsigned long)data; req.size = size; result = ioctl(fd, TEST_CMD_REGISTER, &req); if (result) { fprintf(stderr, "Error <%d:%d> registering region\n", result, errno); goto ioctl_err; } fprintf(stdout, "Address <%016lx> registered, process <%d> waiting...\n", data, getpid()); while (hold) { usleep(TEST_SLEEP_UTIME); } fprintf(stdout, "Process continuing, checking address <%016lx>\n", data); result = ioctl(fd, TEST_CMD_CHECK, &req); if (result) { fprintf(stderr, "Error <%d:%d> checking region\n", result, errno); goto ioctl_err; } result = ioctl(fd, TEST_CMD_UNREGISTER, &req); if (result) { fprintf(stderr, "Error <%d:%d> unregistering region\n", result, errno); goto ioctl_err; } ioctl_err: close(fd); open_err: free(data); return 0; } From timur.tabi at ammasso.com Mon Apr 25 20:38:26 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Mon, 25 Apr 2005 22:38:26 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050425201629.11d9118f.akpm@osdl.org> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org> <52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org> <426DA58F.3020508@ammasso.com> <20050425201629.11d9118f.akpm@osdl.org> Message-ID: <426DB7B2.7000409@ammasso.com> Andrew Morton wrote: > I'm referring to an application which uses your syscalls to obtain pinned > memory and uses munlock() so that it may then use your syscalls to obtain > evem more pinned memory. With the objective of taking the machine down. I'm in favor of having drivers call do_mlock() and do_munlock() on behalf of the application. All we need to do is export those functions, and my driver can call them. However, that still doesn't prevent an app from calling munlock(). But I don't understand the distinction between having the driver call do_mlock() vs. the application calling mlock(). Won't we still have the same problems? A malicious app can just call our driver instead of calling mlock() or munlock(). The driver won't know the difference between an authorized app and an unauthorized one. Besides, isn't the whole point behind RLIMIT_MEMLOCK to limit how much one process can lock? > I haven't even thought about memory hotswap. Surely it'll fail if the > pages are pinned by get_user_pages()? Any memory registered for RDMA devices obviously can't be swapped out. Technically, the driver could detect that, and reject any attempt to transfer data to those regions until everything is remapped to other RAM. But that's opening a whole new can of worms. I don't know how the memory hotswap mechanism works, so I can't guess what recovery mechanisms can be implemented in the driver. From libor at topspin.com Mon Apr 25 20:55:12 2005 From: libor at topspin.com (Libor Michalek) Date: Mon, 25 Apr 2005 20:55:12 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050425162405.0889093e.akpm@osdl.org>; from akpm@osdl.org on Mon, Apr 25, 2005 at 04:24:05PM -0700 References: <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6DFA.4090908@ammasso.com> <20050425153542.70197e6a.akpm@osdl.org> <20050425161713.A9002@topspin.com> <20050425162405.0889093e.akpm@osdl.org> Message-ID: <20050425205512.B9729@topspin.com> On Mon, Apr 25, 2005 at 04:24:05PM -0700, Andrew Morton wrote: > Libor Michalek wrote: > > On Mon, Apr 25, 2005 at 03:35:42PM -0700, Andrew Morton wrote: > > > > > Yes, we expect that all the pages which get_user_pages() pinned > > > will become unpinned within the context of the syscall which pinned > > > the pages. Or shortly after, in the case of async I/O. > > > > When a network protocol is making use of async I/O the amount of time > > between posting the read request and getting the completion for that > > request is unbounded since it depends on the other half of the connection > > sending some data. In this case the buffer that was pinned during the > > io_submit() may be pinned, and holding the pages, for a long time. > > Sure. > > > During > > this time the process might fork, at this point any data received will be > > placed into the wrong spot. > > Well the data is placed in _a_ spot. That's only the "wrong" spot because > you've defined it to be wrong! > > IOW: what behaviour are you actually looking for here, and why, and does it > matter? For example a network server app has an open connection on which it uses async IO to submit two buffers for a read operation. Both buffers are pinned using get_user_pages() and the connection waits for data to arrive. The connection received data, it is written into the first buffer, the app is notified using async IO, and it retreives the async IO completion. The app reads the buffer which happens to contain a command to spawn a child, the app forks a child. Now there is still a buffer posted for read and if more data arrives on the connection that data is copied to the pages which were saved when the buffer was pinned. The app is notified, retrieves the async IO completion, but when it goes to read that buffer it will not have the new data. > > > This is because there is no file descriptor or anything else associated > > > with the pages which permits the kernel to clean stuff up on unclean > > > application exit. Also there are the obvious issues with permitting > > > pinning of unbounded amounts of memory. > > > > Correct, the driver must be able to determine that the process has died > > and clean up after it, so the pinned region in most implementations is > > associated with an open file descriptor. > > How is that association created? The kernel module which pinned the memory is responsible for unpinning it if the file descriptor, which was used to deliver the command that resulted in the pinning, is closed. -Libor From akpm at osdl.org Mon Apr 25 21:33:15 2005 From: akpm at osdl.org (Andrew Morton) Date: Mon, 25 Apr 2005 21:33:15 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <426DB7B2.7000409@ammasso.com> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org> <52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org> <426DA58F.3020508@ammasso.com> <20050425201629.11d9118f.akpm@osdl.org> <426DB7B2.7000409@ammasso.com> Message-ID: <20050425213315.27db35db.akpm@osdl.org> Timur Tabi wrote: > > Andrew Morton wrote: > > > I'm referring to an application which uses your syscalls to obtain pinned > > memory and uses munlock() so that it may then use your syscalls to obtain > > evem more pinned memory. With the objective of taking the machine down. > > I'm in favor of having drivers call do_mlock() and do_munlock() on behalf of the > application. All we need to do is export those functions, and my driver can call them. > However, that still doesn't prevent an app from calling munlock(). Precisely. That's why I suggested that we have an alternative vma->vm_flag bit which behaves in a similar manner to VM_LOCKED (say, VM_LOCKED_KERNEL), only userspace cannot alter it. > But I don't understand the distinction between having the driver call do_mlock() vs. the > application calling mlock(). Won't we still have the same problems? A malicious app can > just call our driver instead of calling mlock() or munlock(). The driver won't know the > difference between an authorized app and an unauthorized one. The driver will set VM_LOCKED_KERNEL, not VM_LOCKED. > Besides, isn't the whole point behind RLIMIT_MEMLOCK to limit how much one process can lock? Sure. The internal setting of VM_LOCKED_KERNEL should still use RLIMIT_MEMLOCK accounting. From hch at infradead.org Mon Apr 25 23:12:36 2005 From: hch at infradead.org (Christoph Hellwig) Date: Tue, 26 Apr 2005 07:12:36 +0100 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <52r7gytnfn.fsf@topspin.com> References: <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <52r7gytnfn.fsf@topspin.com> Message-ID: <20050426061236.GA27220@infradead.org> On Mon, Apr 25, 2005 at 05:02:36PM -0700, Roland Dreier wrote: > The idea is that applications manage the lifetime of pinned memory > regions. They can do things like post multiple I/O operations without > any page-walking overhead, or pass a buffer descriptor to a remote > host who will send data at some indeterminate time in the future. In > addition, InfiniBand has the notion of atomic operations, so a cluster > application may be using some memory region to implement a global lock. > > This might not be the most kernel-friendly design but it is pretty > deeply ingrained in the design of RDMA transports like InfiniBand and > iWARP (RDMA over IP). Actuallky, no it isn't. All these transports would work just fine with the mmap a character device to hand out memory from the kernel approach I told you to use multiple times and Andrew mentioned in this thread aswell. What doesn't work with that design are the braindead designed by comittee APIs in the RDMA world - but I don't think we should care about them too much. From caitlin.bestler at gmail.com Tue Apr 26 06:45:42 2005 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Tue, 26 Apr 2005 06:45:42 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050426061236.GA27220@infradead.org> References: <4263DEC5.5080909@ammasso.com> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <52r7gytnfn.fsf@topspin.com> <20050426061236.GA27220@infradead.org> Message-ID: <469958e00504260645dd2218d@mail.gmail.com> On 4/25/05, Christoph Hellwig wrote: > On Mon, Apr 25, 2005 at 05:02:36PM -0700, Roland Dreier wrote: > > The idea is that applications manage the lifetime of pinned memory > > regions. They can do things like post multiple I/O operations without > > any page-walking overhead, or pass a buffer descriptor to a remote > > host who will send data at some indeterminate time in the future. In > > addition, InfiniBand has the notion of atomic operations, so a cluster > > application may be using some memory region to implement a global lock. > > > > This might not be the most kernel-friendly design but it is pretty > > deeply ingrained in the design of RDMA transports like InfiniBand and > > iWARP (RDMA over IP). > > Actuallky, no it isn't. All these transports would work just fine with > the mmap a character device to hand out memory from the kernel approach > I told you to use multiple times and Andrew mentioned in this thread aswell. > What doesn't work with that design are the braindead designed by comittee > APIs in the RDMA world - but I don't think we should care about them too > much. > RDMA registers and uses the memory the user specifies. That is why byte granularity and multiple redundant registrations are explicitly specified. The mechanism by which this requirement is implemented is of course OS dependent. But the requirements are that the application specifies what portion of their memory they want registered (or what set of physical pages if they have sufficient privilege) and that request is either honored or refused by a resource manager (one preferably as integrated with general OS resource management as possible). The other aspect is that remotely enabled memory regions and memory windows most be enabled for hardware access for the duration of the region or window -- indefinitely until process death or explicit termination by the application layer. Theoretically there is nothing in the wire protocols that requires source buffers to be pinned indefinitely, but that is the only way any RDMA interface has ever worked -- so "brain death" must be pretty widespread. The fact that this problem must be solved for remotely accessible buffers, and that for cluster applications like MPI there is no distinction between buffers used for inbound messages and outbound messages, might have something to do with this. User verbs needs to deal with these actual Memory Registration requirements, including the very real application need for Memory Windows. The solution should map to existing OS controls as much as possible. From timur.tabi at ammasso.com Tue Apr 26 07:07:03 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Tue, 26 Apr 2005 09:07:03 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050425213315.27db35db.akpm@osdl.org> References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org> <52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org> <426DA58F.3020508@ammasso.com> <20050425201629.11d9118f.akpm@osdl.org> <426DB7B2.7000409@ammasso.com> <20050425213315.27db35db.akpm@osdl.org> Message-ID: <426E4B07.1040400@ammasso.com> Andrew Morton wrote: > Precisely. That's why I suggested that we have an alternative vma->vm_flag > bit which behaves in a similar manner to VM_LOCKED (say, VM_LOCKED_KERNEL), > only userspace cannot alter it. How about calling it VM_PINNED? That way, we can define Locked - won't be swapped to disk, but can be moved around in memory Pinned - won't be swapped to disk or moved around in memory From halr at voltaire.com Tue Apr 26 07:48:24 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Apr 2005 10:48:24 -0400 Subject: [openib-general] Re: [openib-commits] r2196 - in gen2/trunk/src/linux-kernel/infiniband: core include] Message-ID: <1114526788.1764.200.camel@localhost.localdomain> Hi Sean, I may have missed this but how is the need for the non natural alignment accomodated now ? I am unable to IPoIB ping from one node to another as the SA query for PathRecord is not answered as something is now wrong in the query. When I add back the packing (patch below), it works again. I do think the packing is for more than just 64 bit architectures as I am running this on an Intel 386. -- Hal Index: ib_mad.h =================================================================== --- ib_mad.h (revision 2209) +++ ib_mad.h (working copy) @@ -134,12 +134,18 @@ #define IB_SA_COMP_MASK(n) ((__force ib_sa_comp_mask) cpu_to_be64(1ull << n)) +/* + * ib_sa_hdr and ib_sa_mad structures must be packed because they have + * 64-bit fields that are only 32-bit aligned. 64-bit architectures will + * lay them out wrong otherwise. (And unfortunately they are sent on + * the wire so we can't change the layout) + */ struct ib_sa_hdr { u64 sm_key; u16 attr_offset; u16 reserved; ib_sa_comp_mask comp_mask; -}; +} __attribute__ ((packed)); struct ib_mad { struct ib_mad_hdr mad_hdr; @@ -157,7 +163,7 @@ struct ib_rmpp_hdr rmpp_hdr; struct ib_sa_hdr sa_hdr; u8 data[200]; -}; +} __attribute__ ((packed)); struct ib_vendor_mad { struct ib_mad_hdr mad_hdr; --Forwarded Message-- From: sean.hefty at openib.org To: openib-commits at openib.org Subject: [openib-commits] r2196 - in gen2/trunk/src/linux-kernel/infiniband: core include Date: 20 Apr 2005 10:58:56 -0700 Author: sean.hefty Date: 2005-04-20 10:58:55 -0700 (Wed, 20 Apr 2005) New Revision: 2196 Modified: gen2/trunk/src/linux-kernel/infiniband/core/sa_query.c gen2/trunk/src/linux-kernel/infiniband/include/ib_mad.h gen2/trunk/src/linux-kernel/infiniband/include/ib_sa.h Log: Move SA MAD definitions to ib_mad.h. Removed unneeded packed attribute from MAD structure definitions. Signed-off-by: Sean Hefty Modified: gen2/trunk/src/linux-kernel/infiniband/core/sa_query.c =================================================================== -- gen2/trunk/src/linux-kernel/infiniband/core/sa_query.c 2005-04-20 17:35:52 UTC (rev 2195) +++ gen2/trunk/src/linux-kernel/infiniband/core/sa_query.c 2005-04-20 17:58:55 UTC (rev 2196) @@ -50,26 +50,6 @@ MODULE_DESCRIPTION("InfiniBand subnet administration query support"); MODULE_LICENSE("Dual BSD/GPL"); -/* - * These two structures must be packed because they have 64-bit fields - * that are only 32-bit aligned. 64-bit architectures will lay them - * out wrong otherwise. (And unfortunately they are sent on the wire - * so we can't change the layout) - */ -struct ib_sa_hdr { - u64 sm_key; - u16 attr_offset; - u16 reserved; - ib_sa_comp_mask comp_mask; -} __attribute__ ((packed)); - -struct ib_sa_mad { - struct ib_mad_hdr mad_hdr; - struct ib_rmpp_hdr rmpp_hdr; - struct ib_sa_hdr sa_hdr; - u8 data[200]; -} __attribute__ ((packed)); - struct ib_sa_sm_ah { struct ib_ah *ah; struct kref ref; Modified: gen2/trunk/src/linux-kernel/infiniband/include/ib_mad.h =================================================================== -- gen2/trunk/src/linux-kernel/infiniband/include/ib_mad.h 2005-04-20 17:35:52 UTC (rev 2195) +++ gen2/trunk/src/linux-kernel/infiniband/include/ib_mad.h 2005-04-20 17:58:55 UTC (rev 2196) @@ -117,7 +117,7 @@ u16 attr_id; u16 resv; u32 attr_mod; -} __attribute__ ((packed)); +}; struct ib_rmpp_hdr { u8 rmpp_version; @@ -126,26 +126,44 @@ u8 rmpp_status; u32 seg_num; u32 paylen_newwin; -} __attribute__ ((packed)); +}; +typedef u64 __bitwise ib_sa_comp_mask; + +#define IB_SA_COMP_MASK(n) ((__force ib_sa_comp_mask) cpu_to_be64(1ull << n)) + +struct ib_sa_hdr { + u64 sm_key; + u16 attr_offset; + u16 reserved; + ib_sa_comp_mask comp_mask; +}; + struct ib_mad { struct ib_mad_hdr mad_hdr; u8 data[232]; -} __attribute__ ((packed)); +}; struct ib_rmpp_mad { struct ib_mad_hdr mad_hdr; struct ib_rmpp_hdr rmpp_hdr; u8 data[220]; -} __attribute__ ((packed)); +}; +struct ib_sa_mad { + struct ib_mad_hdr mad_hdr; + struct ib_rmpp_hdr rmpp_hdr; + struct ib_sa_hdr sa_hdr; + u8 data[200]; +}; + struct ib_vendor_mad { struct ib_mad_hdr mad_hdr; struct ib_rmpp_hdr rmpp_hdr; u8 reserved; u8 oui[3]; u8 data[216]; -} __attribute__ ((packed)); +}; From roland at topspin.com Tue Apr 26 08:14:08 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 26 Apr 2005 08:14:08 -0700 Subject: [openib-general] Re: [openib-commits] r2196 - in gen2/trunk/src/linux-kernel/infiniband: core include] In-Reply-To: <1114526788.1764.200.camel@localhost.localdomain> (Hal Rosenstock's message of "26 Apr 2005 10:48:24 -0400") References: <1114526788.1764.200.camel@localhost.localdomain> Message-ID: <521x8xtvsv.fsf@topspin.com> Hal> Hi Sean, I may have missed this but how is the need for the Hal> non natural alignment accomodated now ? Hal> I am unable to IPoIB ping from one node to another as the SA Hal> query for PathRecord is not answered as something is now Hal> wrong in the query. When I add back the packing (patch Hal> below), it works again. Hal> I do think the packing is for more than just 64 bit Hal> architectures as I am running this on an Intel 386. Yes, the two SA structures definitely need the packed attribute as explained in the comment. I believe that the other MAD structures do not need the attribute but that needs to be checked. - R. From robert.j.woodruff at intel.com Tue Apr 26 08:14:33 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Tue, 26 Apr 2005 08:14:33 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspaceverbs implementation In-Reply-To: <52mzrmtnd9.fsf@topspin.com> Message-ID: Roland wrote, >I think you've missed the point: unless a process sets VM_DONTCOPY on >its RDMA memory regions, then incorrect memory mappings may be used if >the app does something as simple as calling system("ls"). > - R. This is the exact problem that we saw with the Mellanox vapi driver. It set VM_DONTCOPY and the result was that when someone did a system("ls"), the call often caused a segv in the child. The issue seemed to be that someone had done a malloc of a buffer and then registered it, which caused the entire page to be set to VM_DONTCOPY. Then someone else (like the pthreads library) did a malloc that happened to reside in the same page. When the user called system() which did a fork()/exec(), the pthreads library would segv when trying to clean things up before the exec(). We found that if we did not set VM_DONTCOPY that the child would no longer segv, but in some instances, the Mellanox card seemed to be hosed after the system call and would no longer transfer data. We never did understand why. We then found that if we set VM_DONTCOPY on only the register space pages (doorbells and such), but not on the registered memory, that it seemed to work OK. woody From timur.tabi at ammasso.com Tue Apr 26 08:24:03 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Tue, 26 Apr 2005 10:24:03 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050426061236.GA27220@infradead.org> References: <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <52r7gytnfn.fsf@topspin.com> <20050426061236.GA27220@infradead.org> Message-ID: <426E5D13.6000200@ammasso.com> Christoph Hellwig wrote: > What doesn't work with that design are the braindead designed by comittee > APIs in the RDMA world - but I don't think we should care about them too > much. I think you should. The whole point behind RDMA is that these APIs exist and are being used by real-world applications. You can't just ignore them because they're inconvenient. If you're not willing to cater to these API's needs, then you may as well tell all the RDMA developers to forgot about Linux and port everything to Windows instead. The APIs are here to stay, and the whole point behind this thread is to discuss how Linux can support them. -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com One thing a Southern boy will never say is, "I don't think duct tape will fix it." -- Ed Smylie, NASA engineer for Apollo 13 From tduffy at sun.com Tue Apr 26 08:27:21 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 26 Apr 2005 08:27:21 -0700 Subject: [openib-general] Re: [openib-commits] r2196 - in gen2/trunk/src/linux-kernel/infiniband: core include] In-Reply-To: <1114526788.1764.200.camel@localhost.localdomain> References: <1114526788.1764.200.camel@localhost.localdomain> Message-ID: <1114529241.11580.1.camel@duffman> On Tue, 2005-04-26 at 10:48 -0400, Hal Rosenstock wrote: > Hi Sean, > > I may have missed this but how is the need for the non natural > alignment accomodated now ? > > I am unable to IPoIB ping from one node to another as the SA query for > PathRecord is not answered as something is now wrong in the query. When > I add back the packing (patch below), it works again. Ah good. It is not just me. Please apply. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Tue Apr 26 08:31:32 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 26 Apr 2005 08:31:32 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050425173757.1dbab90b.akpm@osdl.org> (Andrew Morton's message of "Mon, 25 Apr 2005 17:37:57 -0700") References: <200544159.Ahk9l0puXy39U6u6@topspin.com> <20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com> <20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org> <52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org> Message-ID: <52wtqpsgff.fsf@topspin.com> Andrew> umm, how about we Andrew> - force the special pages into a separate vma Andrew> - run get_user_pages() against it all Andrew> - use RLIMIT_MEMLOCK accounting to check whether the user Andrew> is allowed to do this thing Andrew> - undo the RMLIMIT_MEMLOCK accounting in ->release Andrew> This will all interact with user-initiated mlock/munlock Andrew> in messy ways. Maybe a new kernel-internal vma->vm_flag Andrew> which works like VM_LOCKED but is unaffected by Andrew> mlock/munlock activity is needed. Andrew> A bit of generalisation in do_mlock() should suit? Yes, it seems that modifying do_mlock() to something like int do_mlock(unsigned long start, size_t len, unsigned int set, unsigned int clear) and then exporting a function along the lines of int do_mem_pin(unsigned long start, size_t len, int on) that sets/clears (VM_LOCKED_KERNEL | VM_DONTCOPY) according to the on flag. Seem reasonable? If so I can code this up. - R. From libor at topspin.com Tue Apr 26 08:42:34 2005 From: libor at topspin.com (Libor Michalek) Date: Tue, 26 Apr 2005 08:42:34 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <52wtqpsgff.fsf@topspin.com>; from roland@topspin.com on Tue, Apr 26, 2005 at 08:31:32AM -0700 References: <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org> <52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org> <52wtqpsgff.fsf@topspin.com> Message-ID: <20050426084234.A10366@topspin.com> On Tue, Apr 26, 2005 at 08:31:32AM -0700, Roland Dreier wrote: > Andrew> umm, how about we > > Andrew> - force the special pages into a separate vma > > Andrew> - run get_user_pages() against it all > > Andrew> - use RLIMIT_MEMLOCK accounting to check whether the user > Andrew> is allowed to do this thing > > Andrew> - undo the RMLIMIT_MEMLOCK accounting in ->release > > Andrew> This will all interact with user-initiated mlock/munlock > Andrew> in messy ways. Maybe a new kernel-internal vma->vm_flag > Andrew> which works like VM_LOCKED but is unaffected by > Andrew> mlock/munlock activity is needed. > > Andrew> A bit of generalisation in do_mlock() should suit? > > Yes, it seems that modifying do_mlock() to something like > > int do_mlock(unsigned long start, size_t len, > unsigned int set, unsigned int clear) > > and then exporting a function along the lines of > > int do_mem_pin(unsigned long start, size_t len, int on) > > that sets/clears (VM_LOCKED_KERNEL | VM_DONTCOPY) according to the on > flag. Do you mean that the set/clear parameters to do_mlock() are the actual flags which are set/cleared by the caller? Also, the issue remains that the flags are not reference counted which is a problem if you are dealing with overlapping memory region, or even if one region ends and another begins on the same page. Since the desire is to be able to pin any memory that a user can malloc this is a likely scenario. -Libor From roland at topspin.com Tue Apr 26 08:49:17 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 26 Apr 2005 08:49:17 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050426084234.A10366@topspin.com> (Libor Michalek's message of "Tue, 26 Apr 2005 08:42:34 -0700") References: <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org> <52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org> <52wtqpsgff.fsf@topspin.com> <20050426084234.A10366@topspin.com> Message-ID: <52mzrlsflu.fsf@topspin.com> Libor> Do you mean that the set/clear parameters to do_mlock() Libor> are the actual flags which are set/cleared by the caller? Libor> Also, the issue remains that the flags are not reference Libor> counted which is a problem if you are dealing with Libor> overlapping memory region, or even if one region ends and Libor> another begins on the same page. Since the desire is to be Libor> able to pin any memory that a user can malloc this is a Libor> likely scenario. Good point... we need to figure out how to handle: a) app registers 0x0000 through 0x17ff b) app registers 0x1800 through 0x2fff c) app unregisters 0x0000 through 0x17ff d) the page at 0x1000 must stay pinned hmm... - R. From mshefty at ichips.intel.com Tue Apr 26 09:13:27 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 26 Apr 2005 09:13:27 -0700 Subject: [openib-general] Re: [openib-commits] r2196 - in gen2/trunk/src/linux-kernel/infiniband: core include] In-Reply-To: <1114526788.1764.200.camel@localhost.localdomain> References: <1114526788.1764.200.camel@localhost.localdomain> Message-ID: <426E68A7.6040305@ichips.intel.com> Hal Rosenstock wrote: > Hi Sean, > > I may have missed this but how is the need for the non natural > alignment accomodated now ? Sorry about that. I was looking at the layout for a single structure, and not nesting of structures, when I made this change, and my testing is on a 32-bit system. - Sean From olivier.cozette at seanodes.com Tue Apr 26 09:56:32 2005 From: olivier.cozette at seanodes.com (Olivier Cozette) Date: Tue, 26 Apr 2005 18:56:32 +0200 Subject: [openib-general] kernel vapi Message-ID: <1114534592.15717.33.camel@olivier.toulouse> Hello, Sorry, but i don't the good list to tell about my problem, so a post it to this list. I'm using a kernel 2.4 with openib 1.0 with an x86-64 smp, and i tried to port the vping to kernel space (ib-support/third_party/mellanox_thca_host_native_2_6/src/HCA/examples/vping). I have removed all the tcp things, and all the thread relative things. If i prohibite the call of schedule() (uninteruptible process, so the pc is no more usable), the server receive data from one userspace client. But if i use schedule() in anywere in the code, for example in the get_next_rq_cqe(), my module oops with a probleme in __switch_to. With more debugging, i find that current->thread.io_bitmap_ptr have very strange value ( 0x1 , 0x3 , 0x565554535251504f ) or sometimes a good value (in normal kernel space ). When there are bad value, the _switch_to crash within "memcpy(tss->io_bitmap, next->io_bitmap_ptr,IO_BITMAP_SIZE*sizeof(u32));" or with a "mov %db6,% rax" instruction. My module seems stable when i set current->thread.io_bitmap_ptr to NULL, but it crash when i type "top" or "ps aux". So if anyone know an issue please help me. Olivier From halr at voltaire.com Tue Apr 26 10:39:56 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Apr 2005 13:39:56 -0400 Subject: [openib-general] Re: [openib-commits] r2196 - in gen2/trunk/src/linux-kernel/infiniband: core include] In-Reply-To: <1114529241.11580.1.camel@duffman> References: <1114526788.1764.200.camel@localhost.localdomain> <1114529241.11580.1.camel@duffman> Message-ID: <1114537000.1764.243.camel@localhost.localdomain> On Tue, 2005-04-26 at 11:27, Tom Duffy wrote: > On Tue, 2005-04-26 at 10:48 -0400, Hal Rosenstock wrote: > > Hi Sean, > > > > I may have missed this but how is the need for the non natural > > alignment accomodated now ? > > > > I am unable to IPoIB ping from one node to another as the SA query for > > PathRecord is not answered as something is now wrong in the query. When > > I add back the packing (patch below), it works again. > > Ah good. It is not just me. Please apply. And I thought it was (just) me too :-) Applied. -- Hal From akpm at osdl.org Tue Apr 26 12:28:50 2005 From: akpm at osdl.org (Andrew Morton) Date: Tue, 26 Apr 2005 12:28:50 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <52mzrlsflu.fsf@topspin.com> References: <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org> <52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org> <52wtqpsgff.fsf@topspin.com> <20050426084234.A10366@topspin.com> <52mzrlsflu.fsf@topspin.com> Message-ID: <20050426122850.44d06fa6.akpm@osdl.org> Roland Dreier wrote: > > Libor> Do you mean that the set/clear parameters to do_mlock() > Libor> are the actual flags which are set/cleared by the caller? > Libor> Also, the issue remains that the flags are not reference > Libor> counted which is a problem if you are dealing with > Libor> overlapping memory region, or even if one region ends and > Libor> another begins on the same page. Since the desire is to be > Libor> able to pin any memory that a user can malloc this is a > Libor> likely scenario. > > Good point... we need to figure out how to handle: > > a) app registers 0x0000 through 0x17ff > b) app registers 0x1800 through 0x2fff > c) app unregisters 0x0000 through 0x17ff > d) the page at 0x1000 must stay pinned The userspace library should be able to track the tree and the overlaps, etc. Things might become interesting when the memory is MAP_SHARED pagecache and multiple independent processes are involved, although I guess that'd work OK. But afaict the problem wherein part of a page needs VM_DONTCOPY and the other part does not cannot be solved. From woodennickel at gmail.com Tue Apr 26 12:57:15 2005 From: woodennickel at gmail.com (Bill Jordan) Date: Tue, 26 Apr 2005 15:57:15 -0400 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050426122850.44d06fa6.akpm@osdl.org> References: <20050425135401.65376ce0.akpm@osdl.org> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org> <52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org> <52wtqpsgff.fsf@topspin.com> <20050426084234.A10366@topspin.com> <52mzrlsflu.fsf@topspin.com> <20050426122850.44d06fa6.akpm@osdl.org> Message-ID: <5ebee0d1050426125764409335@mail.gmail.com> On 4/26/05, Andrew Morton wrote: > But afaict the problem wherein part of a page needs VM_DONTCOPY and the > other part does not cannot be solved. There may be an opportunity to create a solution where we can mark the page as "copy on fork" so the child has a page with a copy of the contents (at the time of the fork) instead of marking the page copy-on-write. -- Bill Jordan InfiniCon Systems From roland at topspin.com Tue Apr 26 13:14:31 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 26 Apr 2005 13:14:31 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050426122850.44d06fa6.akpm@osdl.org> (Andrew Morton's message of "Tue, 26 Apr 2005 12:28:50 -0700") References: <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org> <52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org> <52wtqpsgff.fsf@topspin.com> <20050426084234.A10366@topspin.com> <52mzrlsflu.fsf@topspin.com> <20050426122850.44d06fa6.akpm@osdl.org> Message-ID: <5264y9s3bs.fsf@topspin.com> Roland> a) app registers 0x0000 through 0x17ff Roland> b) app registers 0x1800 through 0x2fff Roland> c) app unregisters 0x0000 through 0x17ff Roland> d) the page at 0x1000 must stay pinned Andrew> The userspace library should be able to track the tree and Andrew> the overlaps, etc. Things might become interesting when Andrew> the memory is MAP_SHARED pagecache and multiple Andrew> independent processes are involved, although I guess Andrew> that'd work OK. I used to think I knew how to handle this, but in your scheme where the kernel is doing accounting for pinned memory by marking vmas with VM_KERNEL_LOCKED, at step c), I don't see why the kernel won't unlock vmas covering 0x0000 through 0x1fff and credit 8K back to the process's pinning count. Sorry to be so dense but can you spell out what you think should happen at steps a), b) and c) above? Andrew> But afaict the problem wherein part of a page needs Andrew> VM_DONTCOPY and the other part does not cannot be solved. Yes, I agree. If an app wants to register half a page and pass the other half to a child process, I think the only answer is "don't do that then." - R. From timur.tabi at ammasso.com Tue Apr 26 13:18:40 2005 From: timur.tabi at ammasso.com (Timur Tabi) Date: Tue, 26 Apr 2005 15:18:40 -0500 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <5264y9s3bs.fsf@topspin.com> References: <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org> <52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org> <52wtqpsgff.fsf@topspin.com> <20050426084234.A10366@topspin.com> <52mzrlsflu.fsf@topspin.com> <20050426122850.44d06fa6.akpm@osdl.org> <5264y9s3bs.fsf@topspin.com> Message-ID: <426EA220.6010007@ammasso.com> Roland Dreier wrote: > Yes, I agree. If an app wants to register half a page and pass the > other half to a child process, I think the only answer is "don't do > that then." How can the app know that, though? It would have to allocate I/O buffers with knowledge of page boundaries. Today, the apps just malloc() a bunch of memory and pay no attention to whether the beginning or the end of the buffer shares a page with some other, unrelated object. We may as well tell the app that it needs to page-align all I/O buffers. My point is that we can't just simply say, "Don't do that". Some entity (the kernel, libraries, whatever) should be able to tell the app that its usage of memory is going to break in some unpredictable way. -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com One thing a Southern boy will never say is, "I don't think duct tape will fix it." -- Ed Smylie, NASA engineer for Apollo 13 From akpm at osdl.org Tue Apr 26 13:32:29 2005 From: akpm at osdl.org (Andrew Morton) Date: Tue, 26 Apr 2005 13:32:29 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <5264y9s3bs.fsf@topspin.com> References: <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org> <52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org> <52wtqpsgff.fsf@topspin.com> <20050426084234.A10366@topspin.com> <52mzrlsflu.fsf@topspin.com> <20050426122850.44d06fa6.akpm@osdl.org> <5264y9s3bs.fsf@topspin.com> Message-ID: <20050426133229.416a5e66.akpm@osdl.org> Roland Dreier wrote: > > Roland> a) app registers 0x0000 through 0x17ff > Roland> b) app registers 0x1800 through 0x2fff > Roland> c) app unregisters 0x0000 through 0x17ff > Roland> d) the page at 0x1000 must stay pinned > > Andrew> The userspace library should be able to track the tree and > Andrew> the overlaps, etc. Things might become interesting when > Andrew> the memory is MAP_SHARED pagecache and multiple > Andrew> independent processes are involved, although I guess > Andrew> that'd work OK. > > I used to think I knew how to handle this, but in your scheme where > the kernel is doing accounting for pinned memory by marking vmas with > VM_KERNEL_LOCKED, at step c), I don't see why the kernel won't unlock > vmas covering 0x0000 through 0x1fff and credit 8K back to the > process's pinning count. > > Sorry to be so dense but can you spell out what you think should > happen at steps a), b) and c) above? Well I was vaguely proposing that the userspace library keep track of the byteranges and the underlying page states. So in the above scenario userspace would leave the page at 0x1000 registered until all registrations against that page have been undone. From akpm at osdl.org Tue Apr 26 13:37:52 2005 From: akpm at osdl.org (Andrew Morton) Date: Tue, 26 Apr 2005 13:37:52 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <426EA220.6010007@ammasso.com> References: <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org> <52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org> <52wtqpsgff.fsf@topspin.com> <20050426084234.A10366@topspin.com> <52mzrlsflu.fsf@topspin.com> <20050426122850.44d06fa6.akpm@osdl.org> <5264y9s3bs.fsf@topspin.com> <426EA220.6010007@ammasso.com> Message-ID: <20050426133752.37d74805.akpm@osdl.org> Timur Tabi wrote: > > Roland Dreier wrote: > > > Yes, I agree. If an app wants to register half a page and pass the > > other half to a child process, I think the only answer is "don't do > > that then." > > How can the app know that, though? It would have to allocate I/O buffers with knowledge > of page boundaries. Today, the apps just malloc() a bunch of memory and pay no attention > to whether the beginning or the end of the buffer shares a page with some other, unrelated > object. We may as well tell the app that it needs to page-align all I/O buffers. > > My point is that we can't just simply say, "Don't do that". Some entity (the kernel, > libraries, whatever) should be able to tell the app that its usage of memory is going to > break in some unpredictable way. Our point is that contemporary microprocessors cannot electrically do what you want them to do! Now, conceeeeeeiveably the kernel could keep track of the state of the pages down to the byte level, and could keep track of all COWed pages and could look at faulting addresses at the byte level and could copy sub-page ranges by hand from one process's address space into another process's after I/O completion. I don't think we want to do that. Methinks your specification is busted. From roland at topspin.com Tue Apr 26 14:23:28 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 26 Apr 2005 14:23:28 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050426133229.416a5e66.akpm@osdl.org> (Andrew Morton's message of "Tue, 26 Apr 2005 13:32:29 -0700") References: <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org> <52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org> <52wtqpsgff.fsf@topspin.com> <20050426084234.A10366@topspin.com> <52mzrlsflu.fsf@topspin.com> <20050426122850.44d06fa6.akpm@osdl.org> <5264y9s3bs.fsf@topspin.com> <20050426133229.416a5e66.akpm@osdl.org> Message-ID: <521x8xs04v.fsf@topspin.com> Andrew> Well I was vaguely proposing that the userspace library Andrew> keep track of the byteranges and the underlying page Andrew> states. So in the above scenario userspace would leave Andrew> the page at 0x1000 registered until all registrations Andrew> against that page have been undone. OK, I already have code in userspace that keeps reference counts for overlapping regions, etc. However I'm not sure how to tie this in with reliable accounting of pinned memory -- we don't want malicious userspace code to be able fool the accounting, right? So I'm still trying to puzzle out what to do. I don't want to keep a complicated data structure in the kernel keeping track of what memory has been registered. Right now, I just keep a list of structs, one for each region, and when a process dies, I just go through region by region and do a put_page() to balance off the get_user_pages(). However I don't see how to make it work if I put the reference counting for overlapping regions in userspace but when I want mlock() accounting in the kernel. If a buggy/malicious app does: a) register from 0x0000 to 0x2fff b) register from 0x1000 to 0x1fff c) unregister from 0x0000 to 0x2fff then it seems the kernel is screwed unless it counts how many times a vma has been pinned. And adding a pin_count member to vm_struct seems like a pretty damn major step. We definitely have to make sure that userspace is never able to either unpin a page that is still registered with RDMA hardware, because that can lead to DMA to into memory that someone else owns. On the other hand, we don't want userspace to be able to defeat resource accounting by tricking the kernel into keeping page_count elevated after it credits the memory back to a process's limit on locked pages. The limit on the number of locked pages seems like a natural thing to check against, but perhaps we need a different limit for the number of pages pinned for use by RDMA hardware. Sort of the same way that there's a separate limit on the number of in-flight aios. - R. From tduffy at sun.com Tue Apr 26 15:36:33 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 26 Apr 2005 15:36:33 -0700 Subject: [openib-general] [SDP] having moving data on ttcp.aio connection Message-ID: <1114554993.22383.7.camel@duffman> Has anybody seen this type of error when doing SDP? ERR: : VMA lock <508000:65536> error <-12> <16:0:8> WARN: <1> <0404:2480> Error <-12> IOCB lock <65536:0> ERR: : VMA lock <51c000:65536> error <-12> <16:0:8> WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0> ERR: : VMA lock <530000:65536> error <-12> <16:0:8> WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0> ERR: : VMA lock <544000:65536> error <-12> <16:0:8> WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0> ERR: : VMA lock <558000:65536> error <-12> <16:0:8> WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0> ERR: : VMA lock <56c000:65536> error <-12> <16:0:8> WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0> ERR: : VMA lock <580000:65536> error <-12> <16:0:8> WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0> ERR: : VMA lock <594000:65536> error <-12> <16:0:8> WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0> ERR: : VMA lock <5a8000:65536> error <-12> <16:0:8> WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0> ERR: : VMA lock <5bc000:65536> error <-12> <16:0:8> WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0> ERR: : VMA lock <5d0000:65536> error <-12> <16:0:8> WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0> ERR: : VMA lock <5e4000:65536> error <-12> <16:0:8> WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0> ERR: : VMA lock <5f8000:65536> error <-12> <16:0:8> WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0> ERR: : VMA lock <60c000:65536> error <-12> <16:0:8> WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0> ERR: : VMA lock <620000:65536> error <-12> <16:0:8> WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0> ERR: : VMA lock <634000:65536> error <-12> <16:0:8> WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0> ERR: : VMA lock <648000:65536> error <-12> <16:0:8> WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0> ERR: : VMA lock <65c000:65536> error <-12> <16:0:8> WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0> ERR: : VMA lock <670000:65536> error <-12> <16:0:8> WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0> ERR: : VMA lock <684000:65536> error <-12> <16:0:8> WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0> WARN: <1> CM state <0> event <9> error <-2> ttcp reports: [root at sins-stinger-10 ~]# ./ttcp -r -l 65536 -a 20 ttcp-r: buflen = 65536 nbuf = 0 align = 16384/0 port = 5001 ttcp-r: socket ttcp-r: accept from 192.168.0.26 ttcp-r: Event error <-12> <5275648> ttcp-r: 0 bytes in 2.50 real seconds = 0.00 Mbit/sec +++ ttcp-r: 2 I/O calls, usec/call = 1248114.00, calls/sec = 0.80 ttcp-r: user: 0 sys: 41994 total: 41994 real: 2496228 (microseconds) [root at flopteron2 ~]# ./ttcp -t -l 65536 -n 100000 -a 20 192.168.0.233 ttcp-t: buflen = 65536 nbuf = 100000 align = 16384/0 port = 5001 192.168.0.233 ttcp-t: socket ttcp-t: connect ttcp-t: Event error <-12> <5275648> ttcp-t: 0 bytes in 2.56 real seconds = 0.00 Mbit/sec +++ ttcp-t: 2 I/O calls, usec/call = 1282312.00, calls/sec = 0.78 ttcp-t: user: 0 sys: 42993 total: 42993 real: 2564624 (microseconds) BTW, this with both ends on 2.6.11 with stock openib revision 2214, not my modified 2.6.12-rc3 version. Also, I am using opensm for my SM (back to back configuration). -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From tduffy at sun.com Tue Apr 26 16:05:22 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 26 Apr 2005 16:05:22 -0700 Subject: [openib-general] .org pavilion spot in LW 2005 in SF In-Reply-To: <1110826633.21708.8.camel@duffman> References: <1110826633.21708.8.camel@duffman> Message-ID: <1114556723.22383.12.camel@duffman> On Mon, 2005-03-14 at 10:57 -0800, Tom Duffy wrote: > Duncan, > > Hello, I am contacting you as a representative from the OpenIB.org > alliance. We are a non-profit organization that is dedicated to > providing an open-source, multi-vendor, best-of-breed Infiniband stack > for the Linux kernel as well as all the related userland libraries and > utilities. > > Our website is http://www.openib.org. All of our projects are available > under the GPL as well as a BSD license. > > We would like a slot in the .org pavilion for LinuxWorld 2005 in San > Francisco. The booth will have demos of InfiniBand in action using the > recently accepted code in the 2.6.11 kernel running on multiple vendors > hardware. > > Please "reply all" as I have CC'ed the developer list for OpenIB. Duncan, are you still the person involved with setting up the .org pavilion at Linux World? If not, can you please forward my message to the appropriate person. Thanks, -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From tduffy at sun.com Tue Apr 26 16:18:37 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 26 Apr 2005 16:18:37 -0700 Subject: [openib-general] [SDP] having moving data on ttcp.aio connection In-Reply-To: <20050426160835.A10906@topspin.com> References: <1114554993.22383.7.camel@duffman> <20050426160835.A10906@topspin.com> Message-ID: <1114557517.22383.19.camel@duffman> On Tue, 2005-04-26 at 16:08 -0700, Libor Michalek wrote: > limit memorylocked unlimited or in a real shell: $ ulimit -l unlimited ;-) -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From tduffy at sun.com Tue Apr 26 16:51:40 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 26 Apr 2005 16:51:40 -0700 Subject: [openib-general] [PATCHv5][SDP] Allow SDP to compile on 2.6.12-rc3 In-Reply-To: <1114210652.5519.1.camel@duffman> References: <000001c546ca$51807b80$8d5aa8c0@infiniconsys.com> <1114126674.6858.31.camel@duffman> <1114210652.5519.1.camel@duffman> Message-ID: <1114559500.22383.27.camel@duffman> So, here is a version that is tested working on 2.6.12-rc3. I have combined the inet_sock struct with the sdp_opt struct. I have not bothered to put #ifdef's in here to keep working with 2.6.11 as the code has changed too much to bother. So, I don't expect this patch to be applied to openib trunk until 2.6.12-final comes out. Take a look that I didn't make any more dumb mistakes, and please test on your configurations. Signed-off-by: Tom Duffy Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_rcvd.c =================================================================== --- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_rcvd.c (revision 2212) +++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_rcvd.c (working copy) @@ -1236,7 +1236,7 @@ int sdp_event_recv(struct sdp_opt *conn, * If data was consumed by the protocol, signal * the user. */ - sdp_inet_wake_recv(conn->sk, conn->byte_strm); + sdp_inet_wake_recv(sk_sdp(conn), conn->byte_strm); /* * It's possible that a new recv buffer advertisment opened up the * recv window and we can flush buffered send data Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_inet.c =================================================================== --- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_inet.c (revision 2212) +++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_inet.c (working copy) @@ -101,9 +101,9 @@ module_param(sdp_debug_level, int, 0); */ void sdp_inet_wake_send(struct sock *sk) { - struct sdp_opt *conn; + struct sdp_opt *conn = sdp_sk(sk); - if (!sk || !(conn = SDP_GET_CONN(sk))) + if (sk == NULL) return; if (sk->sk_socket && test_bit(SOCK_NOSPACE, &sk->sk_socket->flags) && @@ -312,7 +312,7 @@ static int sdp_inet_release(struct socke } sk = sock->sk; - conn = SDP_GET_CONN(sk); + conn = sdp_sk(sk); sdp_dbg_ctrl(conn, "RELEASE: linger <%d:%lu> data <%d:%d>", sock_flag(sk, SOCK_LINGER), sk->sk_lingertime, @@ -429,6 +429,7 @@ done: sock_orphan(sk); sdp_conn_unlock(conn); sdp_conn_put(conn); + sock_put(sk); return 0; } @@ -446,7 +447,7 @@ static int sdp_inet_bind(struct socket * int result; sk = sock->sk; - conn = SDP_GET_CONN(sk); + conn = sdp_sk(sk); sdp_dbg_ctrl(conn, "BIND: family <%d> addr <%08x:%04x>", addr->sin_family, addr->sin_addr.s_addr, addr->sin_port); @@ -537,7 +538,7 @@ static int sdp_inet_connect(struct socke int result; sk = sock->sk; - conn = SDP_GET_CONN(sk); + conn = sdp_sk(sk); sdp_dbg_ctrl(conn, "CONNECT: family <%d> addr <%08x:%04x>", addr->sin_family, addr->sin_addr.s_addr, addr->sin_port); @@ -699,7 +700,7 @@ static int sdp_inet_listen(struct socket int result; sk = sock->sk; - conn = SDP_GET_CONN(sk); + conn = sdp_sk(sk); sdp_dbg_ctrl(conn, "LISTEN: addr <%08x:%04x> backlog <%04x>", conn->src_addr, conn->src_port, backlog); @@ -760,7 +761,7 @@ static int sdp_inet_accept(struct socket long timeout; listen_sk = listen_sock->sk; - listen_conn = SDP_GET_CONN(listen_sk); + listen_conn = sdp_sk(listen_sk); sdp_dbg_ctrl(listen_conn, "ACCEPT: addr <%08x:%04x>", listen_conn->src_addr, listen_conn->src_port); @@ -816,7 +817,7 @@ static int sdp_inet_accept(struct socket goto listen_done; } } else { - accept_sk = accept_conn->sk; + accept_sk = sk_sdp(accept_conn); switch (accept_conn->istate) { case SDP_SOCK_ST_ACCEPTED: @@ -913,7 +914,7 @@ static int sdp_inet_getname(struct socke struct sdp_opt *conn; sk = sock->sk; - conn = SDP_GET_CONN(sk); + conn = sdp_sk(sk); sdp_dbg_ctrl(conn, "GETNAME: src <%08x:%04x> dst <%08x:%04x>", conn->src_addr, conn->src_port, @@ -953,7 +954,7 @@ static unsigned int sdp_inet_poll(struct * recheck the falgs on being woken. */ sk = sock->sk; - conn = SDP_GET_CONN(sk); + conn = sdp_sk(sk); sdp_dbg_data(conn, "POLL: socket flags <%08lx>", sock->flags); @@ -1040,7 +1041,7 @@ static int sdp_inet_ioctl(struct socket int value; sk = sock->sk; - conn = SDP_GET_CONN(sk); + conn = sdp_sk(sk); sdp_dbg_ctrl(conn, "IOCTL: command <%d> argument <%08lx>", cmd, arg); /* @@ -1162,7 +1163,7 @@ static int sdp_inet_setopt(struct socket int result = 0; sk = sock->sk; - conn = SDP_GET_CONN(sk); + conn = sdp_sk(sk); sdp_dbg_ctrl(conn, "SETSOCKOPT: level <%d> option <%d>", level, optname); @@ -1229,7 +1230,7 @@ static int sdp_inet_getopt(struct socket int len; sk = sock->sk; - conn = SDP_GET_CONN(sk); + conn = sdp_sk(sk); sdp_dbg_ctrl(conn, "GETSOCKOPT: level <%d> option <%d>", level, optname); @@ -1287,7 +1288,7 @@ static int sdp_inet_shutdown(struct sock int result = 0; struct sdp_opt *conn; - conn = SDP_GET_CONN(sock->sk); + conn = sdp_sk(sock->sk); sdp_dbg_ctrl(conn, "SHUTDOWN: flag <%d>", flag); /* @@ -1422,7 +1423,7 @@ static int sdp_inet_create(struct socket sock->ops = &lnx_stream_ops; sock->state = SS_UNCONNECTED; - sock_graft(conn->sk, sock); + sock_graft(sk_sdp(conn), sock); conn->pid = current->pid; Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_read.c =================================================================== --- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_read.c (revision 2212) +++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_read.c (working copy) @@ -185,7 +185,7 @@ int sdp_event_read(struct sdp_opt *conn, */ conn->byte_strm += result; - sdp_inet_wake_recv(conn->sk, conn->byte_strm); + sdp_inet_wake_recv(sk_sdp(conn), conn->byte_strm); } else { if (result < 0) sdp_dbg_warn(conn, "Error <%d> receiving buff", @@ -229,7 +229,7 @@ int sdp_event_read(struct sdp_opt *conn, iocb->flags &= ~(SDP_IOCB_F_ACTIVE | SDP_IOCB_F_RDMA_R); - if (conn->sk->sk_rcvlowat > iocb->post) + if (sk_sdp(conn)->sk_rcvlowat > iocb->post) break; iocb = sdp_iocb_q_get_head(&conn->r_pend); Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_send.c =================================================================== --- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_send.c (revision 2212) +++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_send.c (working copy) @@ -1792,7 +1792,7 @@ static int sdp_inet_write_cancel(struct /* * lock the socket while we operate. */ - conn = SDP_GET_CONN(si->sock->sk); + conn = sdp_sk(si->sock->sk); sdp_conn_lock(conn); sdp_dbg_ctrl(conn, "Cancel Write IOCB. <%08x:%04x> <%08x:%04x>", @@ -2002,7 +2002,7 @@ int sdp_send_flush(struct sdp_opt *conn) /* * see if there is enough buffer to wake/notify writers */ - sdp_inet_wake_send(conn->sk); /* conn->sk->write_space(conn->sk); */ + sdp_inet_wake_send(sk_sdp(conn)); /* conn->sk->write_space(conn->sk); */ return 0; done: @@ -2031,7 +2031,7 @@ int sdp_inet_send(struct kiocb *req, str oob = (msg->msg_flags & MSG_OOB); sk = sock->sk; - conn = SDP_GET_CONN(sk); + conn = sdp_sk(sk); sdp_dbg_data(conn, "send state <%04x> size <%Zu> flags <%08x>", conn->state, size, msg->msg_flags); Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_actv.c =================================================================== --- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_actv.c (revision 2212) +++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_actv.c (working copy) @@ -40,6 +40,7 @@ void sdp_cm_actv_error(struct sdp_opt *conn, int error) { int result; + struct sock *sk; /* * error value is positive error. * @@ -95,11 +96,12 @@ void sdp_cm_actv_error(struct sdp_opt *c conn->shutdown = SHUTDOWN_MASK; conn->send_buf = 0; - if (conn->sk->sk_socket) - conn->sk->sk_socket->state = SS_UNCONNECTED; + sk = sk_sdp(conn); + if (sk->sk_socket) + sk->sk_socket->state = SS_UNCONNECTED; sdp_iocb_q_cancel_all(conn, (0 - error)); - sdp_inet_wake_error(conn->sk); + sdp_inet_wake_error(sk); return; } @@ -117,7 +119,7 @@ static int sdp_cm_actv_establish(struct conn->src_addr, conn->src_port, conn->dst_addr, conn->dst_port); - sk = conn->sk; + sk = sk_sdp(conn); qp_attr = kmalloc(sizeof(*qp_attr), GFP_KERNEL); if (!qp_attr) @@ -550,7 +552,7 @@ int sdp_cm_connect(struct sdp_opt *conn) result = sdp_link_path_lookup(htonl(conn->dst_addr), htonl(conn->src_addr), - conn->sk->sk_bound_dev_if, + sk_sdp(conn)->sk_bound_dev_if, sdp_cm_path_complete, conn, &conn->plid); Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c =================================================================== --- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c (revision 2212) +++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c (working copy) @@ -312,7 +312,7 @@ int sdp_inet_port_get(struct sdp_opt *co static s32 rover = -1; unsigned long flags; - sk = conn->sk; + sk = sk_sdp(conn); /* * lock table */ @@ -323,7 +323,7 @@ int sdp_inet_port_get(struct sdp_opt *co if (port > 0) { for (look = dev_root_s.bind_list, port_ok = 1; look; look = look->bind_next) { - srch = look->sk; + srch = sk_sdp(look); /* * 1) same port * 2) linux force reuse is off. @@ -756,17 +756,6 @@ void sdp_conn_destruct(struct sdp_opt *c if (dump) sdp_conn_state_dump(conn); - /* - * free the OS socket structure - */ - if (!conn->sk) - sdp_dbg_warn(conn, "destruct, no socket! continuing."); - else { - sk_free(conn->sk); - conn->sk = NULL; - } - - kmem_cache_free(dev_root_s.conn_cache, conn); } /* @@ -1112,6 +1101,12 @@ error_attr: return result; } +static struct proto sdp_sk_proto = { + .name = "SDP", + .owner = THIS_MODULE, + .obj_size = sizeof(struct sdp_opt), +}; + /* * sdp_conn_alloc - allocate a new socket, and init. */ @@ -1121,8 +1116,7 @@ struct sdp_opt *sdp_conn_alloc(int prior struct sock *sk; int result; - sk = sk_alloc(dev_root_s.proto, priority, - sizeof(struct inet_sock), dev_root_s.sock_cache); + sk = sk_alloc(dev_root_s.proto, priority, &sdp_sk_proto, 1); if (!sk) { sdp_dbg_warn(NULL, "socket alloc error for protocol. <%d:%d>", dev_root_s.proto, priority); @@ -1146,23 +1140,8 @@ struct sdp_opt *sdp_conn_alloc(int prior sk->sk_state_change = sdp_inet_wake_generic; sk->sk_data_ready = sdp_inet_wake_recv; sk->sk_error_report = sdp_inet_wake_error; - /* - * Allocate must be called from process context, since QP - * create/modifies must be in that context. - */ - conn = kmem_cache_alloc(dev_root_s.conn_cache, priority); - if (!conn) { - sdp_dbg_warn(conn, "connection alloc error. <%d>", priority); - result = -ENOMEM; - goto error; - } - memset(conn, 0, sizeof(struct sdp_opt)); - /* - * The STRM interface specific data is map/cast over the TCP specific - * area of the sock. - */ - SDP_SET_CONN(sk, conn); + conn = sdp_sk(sk); SDP_CONN_ST_INIT(conn); conn->cm_id = NULL; @@ -1179,7 +1158,6 @@ struct sdp_opt *sdp_conn_alloc(int prior conn->parent = NULL; conn->pid = 0; - conn->sk = sk; conn->hashent = SDP_DEV_SK_INVALID; conn->istate = SDP_SOCK_ST_CLOSED; conn->flags = 0; @@ -1286,7 +1264,7 @@ struct sdp_opt *sdp_conn_alloc(int prior sdp_dbg_warn(conn, "Error <%d> conn table insert <%d:%d>", result, dev_root_s.sk_entry, dev_root_s.sk_size); - goto error_conn; + goto error; } /* * set reference @@ -1300,8 +1278,6 @@ struct sdp_opt *sdp_conn_alloc(int prior * done */ return conn; -error_conn: - kmem_cache_free(dev_root_s.conn_cache, conn); error: sk_free(sk); return NULL; @@ -1470,7 +1446,7 @@ int sdp_proc_dump_conn_data(char *buffer continue; conn = dev_root_s.sk_array[counter]; - sk = conn->sk; + sk = sk_sdp(conn); offset += sprintf((buffer + offset), SDP_PROC_CONN_DATA_FORM, conn->hashent, @@ -1956,26 +1932,13 @@ int sdp_conn_table_init(int proto_family goto error_iocb; } - dev_root_s.conn_cache = kmem_cache_create("sdp_conn", - sizeof(struct sdp_opt), - 0, SLAB_HWCACHE_ALIGN, - NULL, NULL); - if (!dev_root_s.conn_cache) { - sdp_warn("Failed to initialize connection cache."); + sdp_dbg_init("Registering socket proto."); + if (proto_register(&sdp_sk_proto, 1) != 0) { + sdp_warn("Failed to register sdp proto."); result = -ENOMEM; goto error_conn; } - dev_root_s.sock_cache = kmem_cache_create("sdp_sock", - sizeof(struct inet_sock), - 0, SLAB_HWCACHE_ALIGN, - NULL, NULL); - if (!dev_root_s.sock_cache) { - sdp_warn("Failed to initialize sock cache."); - result = -ENOMEM; - goto error_sock; - } - /* * start listening */ @@ -2002,9 +1965,7 @@ int sdp_conn_table_init(int proto_family error_listen: (void)ib_destroy_cm_id(dev_root_s.listen_id); error_cm_id: - kmem_cache_destroy(dev_root_s.sock_cache); -error_sock: - kmem_cache_destroy(dev_root_s.conn_cache); + proto_unregister(&sdp_sk_proto); error_conn: sdp_main_iocb_cleanup(); error_iocb: @@ -2045,14 +2006,7 @@ int sdp_conn_table_clear(void) * delete IOCB table */ sdp_main_iocb_cleanup(); - /* - * delete conn cache - */ - kmem_cache_destroy(dev_root_s.conn_cache); - /* - * delete sock cache - */ - kmem_cache_destroy(dev_root_s.sock_cache); + proto_unregister(&sdp_sk_proto); /* * stop listening */ Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_recv.c =================================================================== --- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_recv.c (revision 2212) +++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_recv.c (working copy) @@ -780,7 +780,7 @@ static int sdp_recv_buff_iocb_pending(st */ if (!iocb->len || (!conn->src_recv && - !(conn->sk->sk_rcvlowat > iocb->post))) { + !(sk_sdp(conn)->sk_rcvlowat > iocb->post))) { /* * complete IOCB */ @@ -835,7 +835,7 @@ int sdp_recv_buff(struct sdp_opt *conn, */ if (buff->flags & SDP_BUFF_F_OOB_PEND) { conn->rcv_urg_cnt++; - sdp_inet_wake_urg(conn->sk); + sdp_inet_wake_urg(sk_sdp(conn)); } /* * loop while there are available IOCB's, break if there is no @@ -933,7 +933,7 @@ static int sdp_inet_read_cancel(struct k /* * lock the socket while we operate. */ - conn = SDP_GET_CONN(si->sock->sk); + conn = sdp_sk(si->sock->sk); sdp_conn_lock(conn); sdp_dbg_ctrl(conn, "Cancel Read IOCB. <%08x:%04x> <%08x:%04x>", @@ -1086,7 +1086,7 @@ static int sdp_inet_recv_urg(struct sock int result = 0; u8 value; - conn = SDP_GET_CONN(sk); + conn = sdp_sk(sk); if (sock_flag(sk, SOCK_URGINLINE) || !conn->rcv_urg_cnt) return -EINVAL; @@ -1173,7 +1173,7 @@ int sdp_inet_recv(struct kiocb *req, st struct sdpc_buff_q peek_queue; sk = sock->sk; - conn = SDP_GET_CONN(sk); + conn = sdp_sk(sk); sdp_dbg_data(conn, "state <%08x> size <%Zu> pending <%d> falgs <%08x>", conn->state, size, conn->byte_strm, flags); Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_wall.c =================================================================== --- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_wall.c (revision 2212) +++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_wall.c (working copy) @@ -294,8 +294,8 @@ int sdp_wall_recv_close(struct sdp_opt * /* * async notification. POLL_HUP on full duplex close only. */ - sdp_inet_wake_generic(conn->sk); - sk_wake_async(conn->sk, 1, POLL_IN); + sdp_inet_wake_generic(sk_sdp(conn)); + sk_wake_async(sk_sdp(conn), 1, POLL_IN); break; } @@ -327,8 +327,8 @@ int sdp_wall_recv_closing(struct sdp_opt /* * async notification. POLL_HUP on full duplex close only. */ - sdp_inet_wake_generic(conn->sk); - sk_wake_async(conn->sk, 1, POLL_HUP); + sdp_inet_wake_generic(sk_sdp(conn)); + sk_wake_async(sk_sdp(conn), 1, POLL_HUP); return 0; } @@ -368,7 +368,7 @@ int sdp_wall_recv_abort(struct sdp_opt * */ sdp_iocb_q_cancel_all(conn, -ECONNRESET); - sdp_inet_wake_error(conn->sk); + sdp_inet_wake_error(sk_sdp(conn)); return 0; } @@ -402,7 +402,7 @@ void sdp_wall_recv_drop(struct sdp_opt * break; case SDP_SOCK_ST_CLOSING: conn->istate = SDP_SOCK_ST_CLOSED; - sdp_inet_wake_generic(conn->sk); + sdp_inet_wake_generic(sk_sdp(conn)); break; default: @@ -418,7 +418,7 @@ void sdp_wall_recv_drop(struct sdp_opt * */ sdp_iocb_q_cancel_all(conn, -ECONNRESET); - sdp_inet_wake_error(conn->sk); + sdp_inet_wake_error(sk_sdp(conn)); break; } Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.h =================================================================== --- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.h (revision 2212) +++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.h (working copy) @@ -146,16 +146,18 @@ enum sdp_mode { */ #define SDP_MSG_EVENT_TABLE_SIZE 0x20 -/* - * connection handle within a socket. - */ -#define SDP_GET_CONN(sk) \ - (*((struct sdp_opt **)&(sk)->sk_protinfo)) -#define SDP_SET_CONN(sk, conn) \ - (*((struct sdp_opt **)&(sk)->sk_protinfo) = (conn)) +static inline struct sdp_opt *sdp_sk(struct sock *sk) +{ + return (struct sdp_opt *)sk; +} + +static inline struct sock *sk_sdp(struct sdp_opt *conn) +{ + return (struct sock *)conn; +} #define SDP_CONN_SET_ERR(conn, val) \ - ((conn)->error = (conn)->sk->sk_err = (val)) + ((conn)->error = sk_sdp(conn)->sk_err = (val)) #define SDP_CONN_GET_ERR(conn) \ ((conn)->error) @@ -214,10 +216,15 @@ struct sdp_conn_lock { * SDP Connection structure. */ struct sdp_opt { + /* + * inet_sock must be first member of sdp_sock + * NOTE: this depends on inet_sock having struct sock as its + * first member + */ + struct inet_sock in; __s32 hashent; /* connection ID/hash entry */ atomic_t refcnt; /* connection reference count. */ - struct sock *sk; /* * SDP specific data */ @@ -530,7 +537,7 @@ static inline int sdp_conn_error(struct * lock, however the linux socket error, needs to be xchg'd since the * SO_ERROR getsockopt happens outside of the connection lock. */ - int error = xchg(&conn->sk->sk_err, 0); + int error = xchg(&sk_sdp(conn)->sk_err, 0); SDP_CONN_SET_ERR(conn, 0); return -error; Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c =================================================================== --- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c (revision 2212) +++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c (working copy) @@ -40,6 +40,7 @@ void sdp_cm_pass_error(struct sdp_opt *conn, int error) { int result; + struct sock *sk; sdp_dbg_ctrl(conn, "passive error. src <%08x:%04x> dst <%08x:%04x> <%d>", @@ -59,11 +60,12 @@ void sdp_cm_pass_error(struct sdp_opt *c conn->shutdown = SHUTDOWN_MASK; conn->send_buf = 0; - if (conn->sk->sk_socket) - conn->sk->sk_socket->state = SS_UNCONNECTED; + sk = sk_sdp(conn); + if (sk->sk_socket) + sk->sk_socket->state = SS_UNCONNECTED; sdp_iocb_q_cancel_all(conn, (0 - error)); - sdp_inet_wake_error(conn->sk); + sdp_inet_wake_error(sk); } /* @@ -130,7 +132,7 @@ int sdp_cm_pass_establish(struct sdp_opt goto error; } - sdp_inet_wake_send(conn->sk); + sdp_inet_wake_send(sk_sdp(conn)); kfree(qp_attr); return 0; @@ -320,8 +322,8 @@ static int sdp_cm_listen_lookup(struct s /* * check backlog */ - listen_sk = listen_conn->sk; - sk = conn->sk; + listen_sk = sk_sdp(listen_conn); + sk = sk_sdp(conn); if (listen_conn->backlog_cnt > listen_conn->backlog_max) { sdp_dbg_warn(listen_conn, @@ -356,13 +358,16 @@ static int sdp_cm_listen_lookup(struct s */ sk->sk_lingertime = listen_sk->sk_lingertime; sk->sk_rcvlowat = listen_sk->sk_rcvlowat; - sk->sk_debug = listen_sk->sk_debug; - sk->sk_localroute = listen_sk->sk_localroute; + if (sock_flag(listen_sk, SOCK_DBG)) + sock_set_flag(sk, SOCK_DBG); + if (sock_flag(listen_sk, SOCK_LOCALROUTE)) + sock_set_flag(sk, SOCK_LOCALROUTE); sk->sk_sndbuf = listen_sk->sk_sndbuf; sk->sk_rcvbuf = listen_sk->sk_rcvbuf; sk->sk_no_check = listen_sk->sk_no_check; sk->sk_priority = listen_sk->sk_priority; - sk->sk_rcvtstamp = listen_sk->sk_rcvtstamp; + if (sock_flag(listen_sk, SOCK_RCVTSTAMP)) + sock_set_flag(sk, SOCK_RCVTSTAMP); sk->sk_rcvtimeo = listen_sk->sk_rcvtimeo; sk->sk_sndtimeo = listen_sk->sk_sndtimeo; sk->sk_reuse = listen_sk->sk_reuse; @@ -501,7 +506,7 @@ int sdp_cm_req_handler(struct ib_cm_id * goto done; } /* - * Lock the new connection before modifyingg it into any tables. + * Lock the new connection before modifying it into any tables. */ sdp_conn_lock(conn); /* Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h =================================================================== --- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h (revision 2212) +++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h (working copy) @@ -197,11 +197,6 @@ struct sdev_root { * SDP wide listen */ struct ib_cm_id *listen_id; /* listen handle */ - /* - * cache's - */ - kmem_cache_t *conn_cache; - kmem_cache_t *sock_cache; }; #endif /* _SDP_DEV_H */ From akpm at osdl.org Tue Apr 26 17:05:13 2005 From: akpm at osdl.org (Andrew Morton) Date: Tue, 26 Apr 2005 17:05:13 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <521x8xs04v.fsf@topspin.com> References: <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org> <52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org> <52wtqpsgff.fsf@topspin.com> <20050426084234.A10366@topspin.com> <52mzrlsflu.fsf@topspin.com> <20050426122850.44d06fa6.akpm@osdl.org> <5264y9s3bs.fsf@topspin.com> <20050426133229.416a5e66.akpm@osdl.org> <521x8xs04v.fsf@topspin.com> Message-ID: <20050426170513.33b81f76.akpm@osdl.org> Roland Dreier wrote: > > Andrew> Well I was vaguely proposing that the userspace library > Andrew> keep track of the byteranges and the underlying page > Andrew> states. So in the above scenario userspace would leave > Andrew> the page at 0x1000 registered until all registrations > Andrew> against that page have been undone. > > OK, I already have code in userspace that keeps reference counts for > overlapping regions, etc. However I'm not sure how to tie this in > with reliable accounting of pinned memory -- we don't want malicious > userspace code to be able fool the accounting, right? > > So I'm still trying to puzzle out what to do. I don't want to keep a > complicated data structure in the kernel keeping track of what memory > has been registered. Right now, I just keep a list of structs, one > for each region, and when a process dies, I just go through region by > region and do a put_page() to balance off the get_user_pages(). > > However I don't see how to make it work if I put the reference > counting for overlapping regions in userspace but when I want mlock() > accounting in the kernel. If a buggy/malicious app does: > > a) register from 0x0000 to 0x2fff > b) register from 0x1000 to 0x1fff > c) unregister from 0x0000 to 0x2fff As far as the kernel is concerned, step b) should be a no-op. (The kernel might choose to split the vma, but that's not significant). > then it seems the kernel is screwed unless it counts how many times a > vma has been pinned. And adding a pin_count member to vm_struct seems > like a pretty damn major step. > > We definitely have to make sure that userspace is never able to either > unpin a page that is still registered with RDMA hardware, because that > can lead to DMA to into memory that someone else owns. On the other > hand, we don't want userspace to be able to defeat resource accounting > by tricking the kernel into keeping page_count elevated after it > credits the memory back to a process's limit on locked pages. The kernel can simply register and unregister ranges for RDMA. So effectively a particular page is in either the registered or unregistered state. Kernel accounting counts the number of registered pages and compares this with rlimits. On top of all that, your userspace library needs to keep track of when pages should really be registered and unregistered with the kernel. Using overlap logic and per-page refcounting or whatever. No? From tduffy at sun.com Tue Apr 26 20:00:35 2005 From: tduffy at sun.com (Tom Duffy) Date: Tue, 26 Apr 2005 20:00:35 -0700 Subject: [openib-general] kernel vapi In-Reply-To: <1114534592.15717.33.camel@olivier.toulouse> References: <1114534592.15717.33.camel@olivier.toulouse> Message-ID: <1114570835.22627.7.camel@duffman> On Tue, 2005-04-26 at 18:56 +0200, Olivier Cozette wrote: > Hello, > > Sorry, but i don't the good list to tell about my problem, so a post it > to this list. > > I'm using a kernel 2.4 with openib 1.0 with an x86-64 smp, and i tried > to port the vping to kernel space > (ib-support/third_party/mellanox_thca_host_native_2_6/src/HCA/examples/vping). Can you reproduce your problem on a 2.6.11 kernel with the gen2 code? Unfortunately, gen1 is no longer supported. Nor is the 2.4 kernel. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From tomduffy at gmail.com Tue Apr 26 20:05:44 2005 From: tomduffy at gmail.com (Tom Duffy) Date: Tue, 26 Apr 2005 20:05:44 -0700 Subject: [openib-general] rendering openib.org on Firefox/Linux Message-ID: <9d3b7de7050426200569e83b68@mail.gmail.com> Has anybody else noticed that openib.org doesn't seem to render properly on Firefox/Linux? Check out this screenshot. -------------- next part -------------- A non-text attachment was scrubbed... Name: openib_banner.png Type: image/png Size: 11157 bytes Desc: not available URL: From caitlin.bestler at gmail.com Tue Apr 26 20:15:46 2005 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Tue, 26 Apr 2005 20:15:46 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050426122850.44d06fa6.akpm@osdl.org> References: <20050425135401.65376ce0.akpm@osdl.org> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org> <52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org> <52wtqpsgff.fsf@topspin.com> <20050426084234.A10366@topspin.com> <52mzrlsflu.fsf@topspin.com> <20050426122850.44d06fa6.akpm@osdl.org> Message-ID: <469958e00504262015772c9181@mail.gmail.com> On 4/26/05, Andrew Morton wrote: > Roland Dreier wrote: > > > > Libor> Do you mean that the set/clear parameters to do_mlock() > > Libor> are the actual flags which are set/cleared by the caller? > > Libor> Also, the issue remains that the flags are not reference > > Libor> counted which is a problem if you are dealing with > > Libor> overlapping memory region, or even if one region ends and > > Libor> another begins on the same page. Since the desire is to be > > Libor> able to pin any memory that a user can malloc this is a > > Libor> likely scenario. > > > > Good point... we need to figure out how to handle: > > > > a) app registers 0x0000 through 0x17ff > > b) app registers 0x1800 through 0x2fff > > c) app unregisters 0x0000 through 0x17ff > > d) the page at 0x1000 must stay pinned > > The userspace library should be able to track the tree and the overlaps, > etc. Things might become interesting when the memory is MAP_SHARED > pagecache and multiple independent processes are involved, although I guess > that'd work OK. > > But afaict the problem wherein part of a page needs VM_DONTCOPY and the > other part does not cannot be solved. > Which portion of the userspace library? HCA-dependent code, or common code? The HCA-dependent code would fail to count when the same memory was registered to different HCAs (for example to the internal network device and the external network device). The vendor-independent code *could* do it, but only by maintaining a complete list of all registrations that had been issued but not cancelled. That data would be redundant with data kept at the verb layer, and by the kernel. It *would' work, but maintaining highly redundant data at multiple layers is something that I generally try to avoid. From caitlin.bestler at gmail.com Tue Apr 26 20:21:23 2005 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Tue, 26 Apr 2005 20:21:23 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050426170513.33b81f76.akpm@osdl.org> References: <20050425135401.65376ce0.akpm@osdl.org> <20050425173757.1dbab90b.akpm@osdl.org> <52wtqpsgff.fsf@topspin.com> <20050426084234.A10366@topspin.com> <52mzrlsflu.fsf@topspin.com> <20050426122850.44d06fa6.akpm@osdl.org> <5264y9s3bs.fsf@topspin.com> <20050426133229.416a5e66.akpm@osdl.org> <521x8xs04v.fsf@topspin.com> <20050426170513.33b81f76.akpm@osdl.org> Message-ID: <469958e0050426202144a1fdf4@mail.gmail.com> On 4/26/05, Andrew Morton wrote: > > > > However I don't see how to make it work if I put the reference > > counting for overlapping regions in userspace but when I want mlock() > > accounting in the kernel. If a buggy/malicious app does: > > > > a) register from 0x0000 to 0x2fff > > b) register from 0x1000 to 0x1fff > > c) unregister from 0x0000 to 0x2fff > > As far as the kernel is concerned, step b) should be a no-op. (The kernel > might choose to split the vma, but that's not significant). > If "register" and "unregister" is meant in the RDMA sense then the above sequence is totally reasonable. The b) registration could be for a different protection domain that did not require access to all of the larger region. Unless a full counting lock is available from the kernel, the responsibility of the collective RDMA components would be to a) pin 0x0000 to 0x2fff, b) nothing c) unpin 0x000 to 0x0fff and 0x2000 to 0x2fff From halr at voltaire.com Wed Apr 27 03:53:31 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Apr 2005 06:53:31 -0400 Subject: [openib-general] [SDP] having moving data on ttcp.aio connection In-Reply-To: <1114554993.22383.7.camel@duffman> References: <1114554993.22383.7.camel@duffman> Message-ID: <1114556937.1764.331.camel@localhost.localdomain> On Tue, 2005-04-26 at 18:36, Tom Duffy wrote: > BTW, this with both ends on 2.6.11 with stock openib revision 2214, not > my modified 2.6.12-rc3 version. Also, I am using opensm for my SM (back > to back configuration). I haven't run OpenSM on top of the latest changes in a while so I need to ask: Does ibstat/ibstatus show both ports as active with LIDs assigned ? Thanks. -- Hal From halr at voltaire.com Wed Apr 27 06:09:07 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Apr 2005 09:09:07 -0400 Subject: [openib-general] [SDP] having moving data on ttcp.aio connection In-Reply-To: <1114556937.1764.331.camel@localhost.localdomain> References: <1114554993.22383.7.camel@duffman> <1114556937.1764.331.camel@localhost.localdomain> Message-ID: <1114607232.1764.379.camel@localhost.localdomain> On Wed, 2005-04-27 at 06:53, Hal Rosenstock wrote: > On Tue, 2005-04-26 at 18:36, Tom Duffy wrote: > > BTW, this with both ends on 2.6.11 with stock openib revision 2214, not > > my modified 2.6.12-rc3 version. Also, I am using opensm for my SM (back > > to back configuration). > > I haven't run OpenSM on top of the latest changes in a while Just did this and OpenSM appears to work (as does IPoIB). -- Hal > so I need to ask: > > Does ibstat/ibstatus show both ports as active with LIDs assigned ? > > Thanks. > > -- Hal > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From libor at topspin.com Tue Apr 26 16:08:35 2005 From: libor at topspin.com (Libor Michalek) Date: Tue, 26 Apr 2005 16:08:35 -0700 Subject: [openib-general] [SDP] having moving data on ttcp.aio connection In-Reply-To: <1114554993.22383.7.camel@duffman>; from tduffy@sun.com on Tue, Apr 26, 2005 at 03:36:33PM -0700 References: <1114554993.22383.7.camel@duffman> Message-ID: <20050426160835.A10906@topspin.com> On Tue, Apr 26, 2005 at 03:36:33PM -0700, Tom Duffy wrote: > Has anybody seen this type of error when doing SDP? > > ERR: : VMA lock <508000:65536> error <-12> <16:0:8> > WARN: <1> <0404:2480> Error <-12> IOCB lock <65536:0> > > ttcp reports: > > [root at sins-stinger-10 ~]# ./ttcp -r -l 65536 -a 20 > ttcp-r: buflen = 65536 nbuf = 0 align = 16384/0 port = 5001 > ttcp-r: socket > ttcp-r: accept from 192.168.0.26 > ttcp-r: Event error <-12> <5275648> > ttcp-r: 0 bytes in 2.50 real seconds = 0.00 Mbit/sec +++ > ttcp-r: 2 I/O calls, usec/call = 1248114.00, calls/sec = 0.80 > ttcp-r: user: 0 sys: 41994 total: 41994 real: 2496228 (microseconds) > > BTW, this with both ends on 2.6.11 with stock openib revision 2214, not > my modified 2.6.12-rc3 version. Also, I am using opensm for my SM (back > to back configuration). The error -12 is errno ENOMEM, and the most common caused for ENOMEM with AIO, especially on an unloaded system, is that the mlock failed, and the most common reason is that you cannot lock as many pages as you are attempting to lock. You should increase the amount of memory that the user is allowed to lock. The following command in each shell from which you are running ttcp: limit memorylocked unlimited I've already told Hal I'd add this to the README, I'll do it right now. -Libor From roland at topspin.com Tue Apr 26 19:13:24 2005 From: roland at topspin.com (Roland Dreier) Date: Tue, 26 Apr 2005 19:13:24 -0700 Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation In-Reply-To: <20050426170513.33b81f76.akpm@osdl.org> (Andrew Morton's message of "Tue, 26 Apr 2005 17:05:13 -0700") References: <20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org> <426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org> <52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org> <52wtqpsgff.fsf@topspin.com> <20050426084234.A10366@topspin.com> <52mzrlsflu.fsf@topspin.com> <20050426122850.44d06fa6.akpm@osdl.org> <5264y9s3bs.fsf@topspin.com> <20050426133229.416a5e66.akpm@osdl.org> <521x8xs04v.fsf@topspin.com> <20050426170513.33b81f76.akpm@osdl.org> Message-ID: <521x8xq857.fsf@topspin.com> Andrew> The kernel can simply register and unregister ranges for Andrew> RDMA. So effectively a particular page is in either the Andrew> registered or unregistered state. Kernel accounting Andrew> counts the number of registered pages and compares this Andrew> with rlimits. Andrew> On top of all that, your userspace library needs to keep Andrew> track of when pages should really be registered and Andrew> unregistered with the kernel. Using overlap logic and Andrew> per-page refcounting or whatever. This is OK as long as userspace is trusted. However I don't see how this works when we don't trust userspace. The problem is that for an RDMA device (IB HCA or iWARP RNIC), a process can create many memory regions, each of which a separate virtual to physical translation map. For example, an app can do: a) register 0x0000 through 0xffff and get memory handle 1 b) register 0x0000 through 0xffff and get memory handle 2 c) use memory handle 1 for communication with remote app A d) use memory handle 2 for communication with remote app B Even though memory handles 1 and 2 both refer to exactly the same memory, they may have different lifetimes, might be attached to different connections, and so on. Clearly the memory at 0x0000 must stay pinned as long as the RDMA device thinks either memory handle 1 or memory handle 2 is valid. Furthermore, the kernel must be the one keeping track of how many regions refer to a given page because we can't allow userspace to be able to tell a device to go DMA to memory it doesn't own any more. Creation and destruction of these memory handles will always go through the kernel driver, so this isn't so bad. And get_user_pages() is almost exactly what we need: it stacks perfectly, since it operates on the page_count rather than just setting a bit in vm_flags. The main problem is that it doesn't check against RLIMIT_MEMLOCK. The most reasonable thing to do would seem to be having the IB kernel memory region code update current->mm->locked_vm and check it against RLIMIT_MEMLOCK. I guess it would be good to figure out an appropriate abstraction to export rather than monkeying with current->mm directly. We could also put this directly in get_user_pages(), but I'd be worried about messing with current users. I just don't see a way to make VM_KERNEL_LOCKED work. It would also be nice to have a way for apps to set VM_DONTCOPY appropriately. Christoph's suggestion of extending mmap() and mprotect() with PROT_DONTCOPY seems good to me, especially since it means we don't have to export do_mlock() functionality to modules. - R. From halr at voltaire.com Wed Apr 27 07:15:22 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Apr 2005 10:15:22 -0400 Subject: [openib-general] [PATCH] mad.c: Minor cleanup during startup and shutdown Message-ID: <1114611322.1764.385.camel@localhost.localdomain> mad.c: Minor cleanup during startup and shutdown Signed-off-by: Hal Rosenstock Index: mad.c =================================================================== --- mad.c (revision 2212) +++ mad.c (working copy) @@ -2534,14 +2534,6 @@ unsigned long flags; char name[sizeof "ib_mad123"]; - /* First, check if port already open at MAD layer */ - port_priv = ib_get_mad_port(device, port_num); - if (port_priv) { - printk(KERN_DEBUG PFX "%s port %d already open\n", - device->name, port_num); - return 0; - } - /* Create new device info */ port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL); if (!port_priv) { @@ -2666,7 +2658,7 @@ static void ib_mad_init_device(struct ib_device *device) { - int ret, num_ports, cur_port, i, ret2; + int num_ports, cur_port, i; if (device->node_type == IB_NODE_SWITCH) { num_ports = 1; @@ -2676,47 +2668,37 @@ cur_port = 1; } for (i = 0; i < num_ports; i++, cur_port++) { - ret = ib_mad_port_open(device, cur_port); - if (ret) { + if (ib_mad_port_open(device, cur_port)) { printk(KERN_ERR PFX "Couldn't open %s port %d\n", device->name, cur_port); goto error_device_open; } - ret = ib_agent_port_open(device, cur_port); - if (ret) { + if (ib_agent_port_open(device, cur_port)) { printk(KERN_ERR PFX "Couldn't open %s port %d " "for agents\n", device->name, cur_port); goto error_device_open; } } + return; - goto error_device_query; - error_device_open: while (i > 0) { cur_port--; - ret2 = ib_agent_port_close(device, cur_port); - if (ret2) { + if (ib_agent_port_close(device, cur_port)) printk(KERN_ERR PFX "Couldn't close %s port %d " "for agents\n", device->name, cur_port); - } - ret2 = ib_mad_port_close(device, cur_port); - if (ret2) { + if (ib_mad_port_close(device, cur_port)) printk(KERN_ERR PFX "Couldn't close %s port %d\n", device->name, cur_port); - } i--; } - -error_device_query: - return; } static void ib_mad_remove_device(struct ib_device *device) { - int ret = 0, i, num_ports, cur_port, ret2; + int i, num_ports, cur_port; if (device->node_type == IB_NODE_SWITCH) { num_ports = 1; @@ -2726,21 +2708,13 @@ cur_port = 1; } for (i = 0; i < num_ports; i++, cur_port++) { - ret2 = ib_agent_port_close(device, cur_port); - if (ret2) { + if (ib_agent_port_close(device, cur_port)) printk(KERN_ERR PFX "Couldn't close %s port %d " "for agents\n", device->name, cur_port); - if (!ret) - ret = ret2; - } - ret2 = ib_mad_port_close(device, cur_port); - if (ret2) { + if (ib_mad_port_close(device, cur_port)) printk(KERN_ERR PFX "Couldn't close %s port %d\n", device->name, cur_port); - if (!ret) - ret = ret2; - } } } From olivier.cozette at seanodes.com Wed Apr 27 08:14:32 2005 From: olivier.cozette at seanodes.com (Olivier Cozette) Date: Wed, 27 Apr 2005 17:14:32 +0200 Subject: [openib-general] kernel vapi In-Reply-To: <1114570835.22627.7.camel@duffman> References: <1114534592.15717.33.camel@olivier.toulouse> <1114570835.22627.7.camel@duffman> Message-ID: <1114614872.15717.37.camel@olivier.toulouse> Le mardi 26 avril 2005 à 20:00 -0700, Tom Duffy a écrit : > On Tue, 2005-04-26 at 18:56 +0200, Olivier Cozette wrote: > > Hello, > > > > Sorry, but i don't the good list to tell about my problem, so a post it > > to this list. > > > > I'm using a kernel 2.4 with openib 1.0 with an x86-64 smp, and i tried > > to port the vping to kernel space > > (ib-support/third_party/mellanox_thca_host_native_2_6/src/HCA/examples/vping). > > Can you reproduce your problem on a 2.6.11 kernel with the gen2 code? > Unfortunately, gen1 is no longer supported. Nor is the 2.4 kernel. > > -tduffy Hello, Actually i don't have any 2.6.11 kernel installed now, but i will try tomorow. Olivier From jlentini at netapp.com Wed Apr 27 09:06:01 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 27 Apr 2005 12:06:01 -0400 (EDT) Subject: [openib-general] rendering openib.org on Firefox/Linux In-Reply-To: <9d3b7de7050426200569e83b68@mail.gmail.com> References: <9d3b7de7050426200569e83b68@mail.gmail.com> Message-ID: I see it too. On Tue, 26 Apr 2005, Tom Duffy wrote: > Has anybody else noticed that openib.org doesn't seem to render > properly on Firefox/Linux? Check out this screenshot. > From jcarr at linuxmachines.com Wed Apr 27 09:26:56 2005 From: jcarr at linuxmachines.com (Jeff Carr) Date: Wed, 27 Apr 2005 09:26:56 -0700 Subject: [openib-general] rendering openib.org on Firefox/Linux In-Reply-To: References: <9d3b7de7050426200569e83b68@mail.gmail.com> Message-ID: <426FBD50.5070704@linuxmachines.com> James Lentini wrote: > > I see it too. > > On Tue, 26 Apr 2005, Tom Duffy wrote: > >> Has anybody else noticed that openib.org doesn't seem to render >> properly on Firefox/Linux? Check out this screenshot. Not for me. Perhaps reload fixes it? Jeff (using debian sid - firefox 1.0.1) From jlentini at netapp.com Wed Apr 27 09:28:29 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 27 Apr 2005 12:28:29 -0400 (EDT) Subject: [openib-general] rendering openib.org on Firefox/Linux In-Reply-To: <426FBD50.5070704@linuxmachines.com> References: <9d3b7de7050426200569e83b68@mail.gmail.com> <426FBD50.5070704@linuxmachines.com> Message-ID: I'm using Firefox 1.0.3 on Fedora Core 3. On Wed, 27 Apr 2005, Jeff Carr wrote: > James Lentini wrote: >> >> I see it too. >> >> On Tue, 26 Apr 2005, Tom Duffy wrote: >> >>> Has anybody else noticed that openib.org doesn't seem to render >>> properly on Firefox/Linux? Check out this screenshot. > > Not for me. Perhaps reload fixes it? > > Jeff > (using debian sid - firefox 1.0.1) > From tduffy at sun.com Wed Apr 27 10:13:15 2005 From: tduffy at sun.com (Tom Duffy) Date: Wed, 27 Apr 2005 10:13:15 -0700 Subject: [openib-general] rendering openib.org on Firefox/Linux In-Reply-To: <426FBD50.5070704@linuxmachines.com> References: <9d3b7de7050426200569e83b68@mail.gmail.com> <426FBD50.5070704@linuxmachines.com> Message-ID: <1114621995.2221.2.camel@duffman> On Wed, 2005-04-27 at 09:26 -0700, Jeff Carr wrote: > Not for me. Perhaps reload fixes it? The problem seems to stem from the fact that the horizontal blue bar does not move when the font is increased or decreased. Here is a series of screenshots to demonstrate the issue: -------------- next part -------------- A non-text attachment was scrubbed... Name: openib-font-issue.jpg Type: image/jpeg Size: 27598 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Wed Apr 27 10:25:54 2005 From: roland at topspin.com (Roland Dreier) Date: Wed, 27 Apr 2005 10:25:54 -0700 Subject: [openib-general] rendering openib.org on Firefox/Linux In-Reply-To: <1114621995.2221.2.camel@duffman> (Tom Duffy's message of "Wed, 27 Apr 2005 10:13:15 -0700") References: <9d3b7de7050426200569e83b68@mail.gmail.com> <426FBD50.5070704@linuxmachines.com> <1114621995.2221.2.camel@duffman> Message-ID: <523btcp1wd.fsf@topspin.com> Tom> The problem seems to stem from the fact that the horizontal Tom> blue bar does not move when the font is increased or Tom> decreased. Here is a series of screenshots to demonstrate Tom> the issue: Looks like there's some absolute positioning hard-coded in the html: