From wkyuwk at gmail.com Sat Apr 1 06:43:49 2006 From: wkyuwk at gmail.com (Weikuan Yu) Date: Sat, 1 Apr 2006 09:43:49 -0500 Subject: [openib-general] port space In-Reply-To: <01c501c65520$965f24d0$0281a8c0@ebpc> References: <01c501c65520$965f24d0$0281a8c0@ebpc> Message-ID: > Thanks for your clear and comprehensive answers. > >> Sorry, this probably doesn't help you much. What are your >> requirements > for how >> the port space should behave? > > I want to listen on the same service number everywhere in a cluster > without > colliding with anyone else; that's all really. Isn't sufficient to listen on the dedicated IBNAL_PORT? Weikuan From rdreier at cisco.com Sat Apr 1 06:53:11 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 01 Apr 2006 06:53:11 -0800 Subject: [openib-general] IPoIB destructor patches In-Reply-To: (Pradeep Satyanarayana's message of "Fri, 31 Mar 2006 22:42:50 -0700") References: Message-ID: Pradeep> I guess I did not make myself very clear previously. The Pradeep> rc2 I meant was openib-1.0-rc2. Yes, I understood. What I was trying to say is that the kernel is not released by openib/openfabrics. Therefore it doesn't really make sense to ask about kernel patches in the context of an openib release. - R. From mst at mellanox.co.il Sat Apr 1 22:30:13 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 2 Apr 2006 09:30:13 +0300 Subject: [openib-general] Re: RDMA CM and loopback addresses In-Reply-To: References: <20060331001424.GC28869@obsidianresearch.com> Message-ID: <20060402063013.GB1399@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RE: RDMA CM and loopback addresses > > >Hmm. This all sounds very reasonable, but if resolve_addr is > >available, why not get rid of bind()? Do not provide any way > >in the API for a non-listening socket to get the bound hardware device > >without providing both source and dest. > > This makes sense, and was the way the initial implementation was coded. > > Support for bind() on the active side was added to support DAPL. In fact, > calling bind() doesn't even alleviate the need to call resolve_addr(). > > I will see if there's find a reasonable way to support DAPL without binding > early to a device. > > - Sean SDP also needs this I think. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From ogerlitz at voltaire.com Sat Apr 1 22:35:37 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 02 Apr 2006 08:35:37 +0200 Subject: [openib-general] port space In-Reply-To: <200603311202.k2VC2QLP018538@robert.bartonsoftware.com> References: <200603311202.k2VC2QLP018538@robert.bartonsoftware.com> Message-ID: <442F70B9.5080206@voltaire.com> Eric Barton wrote: > If I use RDMA_PS_TCP, will this not conflict potentially with IPoIB > sockets? IPoIB is not using the IB CM, it just tunnels IP packets over IB UD transport, so in this respect there's no contention between the port of a listener socket to the SID of an IB RC based ULP. Or. From mst at mellanox.co.il Sat Apr 1 22:58:10 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 2 Apr 2006 09:58:10 +0300 Subject: [openib-general] Re: updated InfiniBand 2.6.17 merge plans In-Reply-To: References: Message-ID: <20060402065810.GC1399@mellanox.co.il> Quoting r. Roland Dreier : > Subject: updated InfiniBand 2.6.17 merge plans > > OK, here's a quick update on 2.6.17 merge plans: These are just the features, right? There's as the usual bugfixes which aren't listed here. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From jackm at mellanox.co.il Sun Apr 2 00:43:16 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 2 Apr 2006 11:43:16 +0300 Subject: [openib-general] Re: static rate encoding change support In-Reply-To: References: <20060330121820.GH14808@mellanox.co.il> <200603301918.35548.jackm@mellanox.co.il> Message-ID: <200604021143.16383.jackm@mellanox.co.il> On Friday 31 March 2006 22:59, Roland Dreier wrote: > Since I have nothing better to do, I fixed up the patch myself. > Here's what I have so far; there are still some cleanups that I need > to make before I commit this. > Looks fine (I came up with the identical patch for mthca files on Thurs). Please note, though, that the patch does not include the mthca changes for returning the correct static rate to the ib layer in verb "query qp". I have a patch for this, too, but had hoped to add it to the larger patch, since that was not yet checked in. Jack From leonida at voltaire.com Sun Apr 2 04:08:48 2006 From: leonida at voltaire.com (Leonid Arsh) Date: Sun, 2 Apr 2006 14:08:48 +0300 Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support Message-ID: <20060402110848.GA5379@voltaire.com> Hello, this is a patch implementing the kernel mode client reregister event support on MTHCA (see the InfiniBand SPEC 1.2 14.4.11 Client Reregistration.) The patch handles the MTHCA event by scheduling an IB_EVENT_CLIENT_REREGISTER event. We checked it on Mellanox PCI-Express HCAs with FW 4.7.0 on the kernel 2.6.15, on fabric with Voltaire SM and it worked fine. Note, some older FW didn't set ClientReregistration capability bit in the port info, and the event wasn't generated. Regards, Leonid Signed-off-by: Leonid Arsh Index: linux-kernel/infiniband/include/rdma/ib_verbs.h =================================================================== --- linux-kernel/infiniband/include/rdma/ib_verbs.h (revision 8165) +++ linux-kernel/infiniband/include/rdma/ib_verbs.h (working copy) @@ -283,7 +283,8 @@ IB_EVENT_SM_CHANGE, IB_EVENT_SRQ_ERR, IB_EVENT_SRQ_LIMIT_REACHED, - IB_EVENT_QP_LAST_WQE_REACHED + IB_EVENT_QP_LAST_WQE_REACHED, + IB_EVENT_CLIENT_REREGISTER }; struct ib_event { Index: linux-kernel/infiniband/hw/mthca/mthca_eq.c =================================================================== --- linux-kernel/infiniband/hw/mthca/mthca_eq.c (revision 8504) +++ linux-kernel/infiniband/hw/mthca/mthca_eq.c (working copy) @@ -93,6 +93,7 @@ MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR = 0x10, MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR = 0x11, MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR = 0x12, + MTHCA_EVENT_TYPE_CLIENT_REREGIST = 0x16, MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR = 0x08, MTHCA_EVENT_TYPE_PORT_CHANGE = 0x09, MTHCA_EVENT_TYPE_EQ_OVERFLOW = 0x0f, @@ -111,6 +112,7 @@ (1ULL << MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR) | \ (1ULL << MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR) | \ (1ULL << MTHCA_EVENT_TYPE_PORT_CHANGE) | \ + (1ULL << MTHCA_EVENT_TYPE_CLIENT_REREGIST) | \ (1ULL << MTHCA_EVENT_TYPE_ECC_DETECT)) #define MTHCA_SRQ_EVENT_MASK ((1ULL << MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR) | \ (1ULL << MTHCA_EVENT_TYPE_SRQ_QP_LAST_WQE) | \ @@ -274,6 +276,20 @@ ib_dispatch_event(&record); } +static void client_reregister_event(struct mthca_dev *dev, int port) +{ + struct ib_event record; + + mthca_dbg(dev, "Client reregister for port %d\n", + port); + + record.device = &dev->ib_dev; + record.event = IB_EVENT_CLIENT_REREGISTER; + record.element.port_num = port; + + ib_dispatch_event(&record); +} + static int mthca_eq_int(struct mthca_dev *dev, struct mthca_eq *eq) { struct mthca_eqe *eqe; @@ -365,7 +381,10 @@ case MTHCA_EVENT_TYPE_EQ_OVERFLOW: mthca_warn(dev, "EQ overrun on EQN %d\n", eq->eqn); break; - + case MTHCA_EVENT_TYPE_CLIENT_REREGIST: + client_reregister_event(dev, + (be32_to_cpu(eqe->event.port_change.port) >> 28) & 3); + break; case MTHCA_EVENT_TYPE_EEC_CATAS_ERROR: case MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR: case MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR: From mulix at mulix.org Sun Apr 2 05:17:32 2006 From: mulix at mulix.org (Muli Ben-Yehuda) Date: Sun, 02 Apr 2006 15:17:32 +0300 Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support In-Reply-To: <20060402110848.GA5379@voltaire.com> References: <20060402110848.GA5379@voltaire.com> Message-ID: <20060402121732.GD7856@granada.merseine.nu> On Sun, Apr 02, 2006 at 02:08:48PM +0300, Leonid Arsh wrote: > struct ib_event { > Index: linux-kernel/infiniband/hw/mthca/mthca_eq.c > =================================================================== > --- linux-kernel/infiniband/hw/mthca/mthca_eq.c (revision 8504) > +++ linux-kernel/infiniband/hw/mthca/mthca_eq.c (working copy) > @@ -93,6 +93,7 @@ > MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR = 0x10, > MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR = 0x11, > MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR = 0x12, > + MTHCA_EVENT_TYPE_CLIENT_REREGIST = 0x16, Why not REREGISTER? > + case MTHCA_EVENT_TYPE_CLIENT_REREGIST: > + client_reregister_event(dev, > + (be32_to_cpu(eqe->event.port_change.port) >> 28) & 3); > + break; 80 columns per line please. Cheers, Muli -- Muli Ben-Yehuda http://www.mulix.org | http://mulix.livejournal.com/ From dotanb at mellanox.co.il Sun Apr 2 05:25:47 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Sun, 2 Apr 2006 15:25:47 +0300 Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support In-Reply-To: <20060402110848.GA5379@voltaire.com> References: <20060402110848.GA5379@voltaire.com> Message-ID: <200604021525.48072.dotanb@mellanox.co.il> Hi leonid. On Sunday 02 April 2006 14:08, Leonid Arsh wrote: > Hello, > this is a patch implementing the kernel mode client reregister event support on MTHCA What about adding an enumeration to the verb.h in the user level as well? Dotan From leonida at voltaire.com Sun Apr 2 05:31:07 2006 From: leonida at voltaire.com (Leonid Arsh) Date: Sun, 2 Apr 2006 15:31:07 +0300 Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support Message-ID: Sure. To be done. -----Original Message----- From: Dotan Barak [mailto:dotanb at mellanox.co.il] Sent: Sunday, April 02, 2006 2:26 PM To: openib-general at openib.org Cc: Leonid Arsh Subject: Re: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support Hi leonid. On Sunday 02 April 2006 14:08, Leonid Arsh wrote: > Hello, > this is a patch implementing the kernel mode client reregister event > support on MTHCA What about adding an enumeration to the verb.h in the user level as well? Dotan From mst at mellanox.co.il Sun Apr 2 05:48:53 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 2 Apr 2006 15:48:53 +0300 Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support In-Reply-To: <20060402110848.GA5379@voltaire.com> References: <20060402110848.GA5379@voltaire.com> Message-ID: <20060402124853.GQ14808@mellanox.co.il> Quoting r. Leonid Arsh : > Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support > > Hello, > this is a patch implementing the kernel mode client reregister event support on MTHCA > (see the InfiniBand SPEC 1.2 14.4.11 Client Reregistration.) > > The patch handles the MTHCA event by scheduling an IB_EVENT_CLIENT_REREGISTER event. > We checked it on Mellanox PCI-Express HCAs with FW 4.7.0 > on the kernel 2.6.15, on fabric with Voltaire SM and it worked fine. > > Note, some older FW didn't set ClientReregistration capability bit in the port info, > and the event wasn't generated. > > Regards, > Leonid Adding hardware-generated events has drawbacks as it increases the risk of the even queue overrun. Wouldn't pushing this event from mthca_mad.c be a better way to do this? We already do this from smp_snoop for LID and pkey change events. This way it will work on any firmware. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Sun Apr 2 05:49:43 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 2 Apr 2006 15:49:43 +0300 Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support In-Reply-To: <20060402121732.GD7856@granada.merseine.nu> References: <20060402110848.GA5379@voltaire.com> <20060402121732.GD7856@granada.merseine.nu> Message-ID: <20060402124943.GR14808@mellanox.co.il> Quoting r. Muli Ben-Yehuda : > 80 columns per line please. Roland's actually pretty lax on this. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From leonida at voltaire.com Sun Apr 2 04:49:35 2006 From: leonida at voltaire.com (Leonid Arsh) Date: Sun, 02 Apr 2006 14:49:35 +0300 Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support In-Reply-To: <20060402121732.GD7856@granada.merseine.nu> References: <20060402110848.GA5379@voltaire.com> <20060402121732.GD7856@granada.merseine.nu> Message-ID: <442FBA4F.7030107@voltaire.com> Thank you for the corrections. See also remarks below. Muli Ben-Yehuda wrote: > On Sun, Apr 02, 2006 at 02:08:48PM +0300, Leonid Arsh wrote: > > >> struct ib_event { >> Index: linux-kernel/infiniband/hw/mthca/mthca_eq.c >> =================================================================== >> --- linux-kernel/infiniband/hw/mthca/mthca_eq.c (revision 8504) >> +++ linux-kernel/infiniband/hw/mthca/mthca_eq.c (working copy) >> @@ -93,6 +93,7 @@ >> MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR = 0x10, >> MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR = 0x11, >> MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR = 0x12, >> + MTHCA_EVENT_TYPE_CLIENT_REREGIST = 0x16, >> > > Why not REREGISTER? > > Right, CLIENT_REREGIST was just the event name in the Mellanox VAPI based driver. I'll change it here. >> + case MTHCA_EVENT_TYPE_CLIENT_REREGIST: >> + client_reregister_event(dev, >> + (be32_to_cpu(eqe->event.port_change.port) >> 28) & 3); >> + break; >> > > > 80 columns per line please. > > Cheers, > Muli > From mulix at mulix.org Sun Apr 2 06:05:12 2006 From: mulix at mulix.org (Muli Ben-Yehuda) Date: Sun, 02 Apr 2006 16:05:12 +0300 Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support In-Reply-To: <20060402124943.GR14808@mellanox.co.il> References: <20060402110848.GA5379@voltaire.com> <20060402121732.GD7856@granada.merseine.nu> <20060402124943.GR14808@mellanox.co.il> Message-ID: <20060402130512.GG7856@granada.merseine.nu> On Sun, Apr 02, 2006 at 03:49:43PM +0300, Michael S. Tsirkin wrote: > Quoting r. Muli Ben-Yehuda : > > 80 columns per line please. > > Roland's actually pretty lax on this. It's not exactly critical, but it's good kernel manners are to stay below 80 columns. "The limit on the length of lines is 80 columns and this is a hard limit." Cheers, Muli -- Muli Ben-Yehuda http://www.mulix.org | http://mulix.livejournal.com/ From leonida at voltaire.com Sun Apr 2 05:14:35 2006 From: leonida at voltaire.com (Leonid Arsh) Date: Sun, 02 Apr 2006 15:14:35 +0300 Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support In-Reply-To: <20060402124853.GQ14808@mellanox.co.il> References: <20060402110848.GA5379@voltaire.com> <20060402124853.GQ14808@mellanox.co.il> Message-ID: <442FC02B.2010801@voltaire.com> Michael, the event is quite rare, so I see no risk for the event queue. I also think that using the HW event is a bit more elegant, so I implemented it like in the VAPI based driver. The new FW generates the event in any case, and we just add a compact event handler. Only smp_snoop() on the older FW will not help, since the older FW doesn't set the ClientReregistration port capability bit automatically, so the SM will not generate ClientReregistrer requests. There is a way to set the bit by the SW, but it would add some redundant complexity. Thanks, Leonid Michael S. Tsirkin wrote: > Quoting r. Leonid Arsh : > >> Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support >> >> Hello, >> this is a patch implementing the kernel mode client reregister event support on MTHCA >> (see the InfiniBand SPEC 1.2 14.4.11 Client Reregistration.) >> >> The patch handles the MTHCA event by scheduling an IB_EVENT_CLIENT_REREGISTER event. >> We checked it on Mellanox PCI-Express HCAs with FW 4.7.0 >> on the kernel 2.6.15, on fabric with Voltaire SM and it worked fine. >> >> Note, some older FW didn't set ClientReregistration capability bit in the port info, >> and the event wasn't generated. >> >> Regards, >> Leonid >> > > Adding hardware-generated events has drawbacks as it increases the risk of the > even queue overrun. > > Wouldn't pushing this event from mthca_mad.c be a better way to do this? We > already do this from smp_snoop for LID and pkey change events. This way it will > work on any firmware. > > From mst at mellanox.co.il Sun Apr 2 06:28:13 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 2 Apr 2006 16:28:13 +0300 Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support In-Reply-To: <442FC02B.2010801@voltaire.com> References: <20060402110848.GA5379@voltaire.com> <20060402124853.GQ14808@mellanox.co.il> <442FC02B.2010801@voltaire.com> Message-ID: <20060402132813.GT14808@mellanox.co.il> Quoting r. Leonid Arsh : > Subject: Re: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support > > Michael, > the event is quite rare, so I see no risk for the event queue. > I also think that using the HW event is a bit more elegant, so I > implemented it like in the VAPI based driver. No idea why VAPI did it this way: I personally prefer software solutions since they are so much easier to debug. > The new FW generates the event in any case, and we just add a compact > event handler. > Only smp_snoop() on the older FW will not help, since the older FW > doesn't set the ClientReregistration port capability bit automatically, > so the SM will not generate ClientReregistrer requests. Looks like you'll just need to set this bit in mthca_process_mad. No? > There is a way > to set the bit by the SW, but it would add some redundant complexity. There need not be any redundancy: I don't advocate using both the software and hardware-based mechanism: let's just handle this even from smp_snoop and be done with it. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From jackm at mellanox.co.il Sun Apr 2 07:03:04 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 2 Apr 2006 17:03:04 +0300 Subject: [openib-general] Re: static rate encoding change support In-Reply-To: References: <20060330121820.GH14808@mellanox.co.il> <200603301918.35548.jackm@mellanox.co.il> Message-ID: <200604021703.07238.jackm@mellanox.co.il> On Friday 31 March 2006 22:59, Roland Dreier wrote: > Since I have nothing better to do, I fixed up the patch myself. > Here's what I have so far; there are still some cleanups that I need > to make before I commit this. > On Sunday 02 April 2006 11:43, Jack Morgenstein wrote: >Please note, though, that the patch does not include the mthca changes >for returning the correct static rate to the ib layer in verb "query qp". > > I have a patch for this, too, but had hoped to add it to the larger patch, > since that was not yet checked in. > > Jack Here is the static rate query-qp patch, to be applied on top of your static-rate patch. If possible, I think the two patches (yours and mine) should be combined into a single check-in. - Jack --- Return proper static rate (for new encoding) in query_qp. Signed-off-by: Jack Morgenstein Index: src/drivers/infiniband/hw/mthca/mthca_av.c =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_av.c 2006-04-02 11:48:53.348045000 +0300 +++ src/drivers/infiniband/hw/mthca/mthca_av.c 2006-04-02 16:59:25.695664000 +0300 @@ -71,6 +71,37 @@ static inline u8 ib_rate_to_memfree(stru } } +enum ib_rate mthca_rate_to_ib(struct mthca_dev *dev, u8 mthca_rate, u8 port) +{ + if (!mthca_rate) + return mult_to_ib_rate(dev->rate[port - 1]); + + if (mthca_is_memfree(dev)) { + switch (mthca_rate) { + case MTHCA_RATE_MEMFREE_EIGHTH: + return mult_to_ib_rate(dev->rate[port - 1] / 8); + case MTHCA_RATE_MEMFREE_QUARTER: + return mult_to_ib_rate(dev->rate[port - 1] / 4); + case MTHCA_RATE_MEMFREE_HALF: + return mult_to_ib_rate(dev->rate[port - 1] / 2); + case MTHCA_RATE_MEMFREE_FULL: + default: + return mult_to_ib_rate(dev->rate[port - 1]); + } + } + + switch (mthca_rate) { + case MTHCA_RATE_TAVOR_1X: + return IB_RATE_2_5_GBPS; + case MTHCA_RATE_TAVOR_1X_DDR: + return IB_RATE_5_GBPS; + case MTHCA_RATE_TAVOR_4X: + return IB_RATE_10_GBPS; + default: + return mult_to_ib_rate(dev->rate[port - 1]); + } +} + u8 mthca_get_rate(struct mthca_dev *dev, struct ib_ah_attr *attr, u8 port) { if (!attr->static_rate) Index: src/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2006-04-02 11:48:53.289045000 +0300 +++ src/drivers/infiniband/hw/mthca/mthca_dev.h 2006-04-02 16:09:34.421360000 +0300 @@ -575,6 +575,7 @@ int mthca_process_mad(struct ib_device * struct ib_mad *out_mad); int mthca_update_rate(struct mthca_dev *dev, u8 port_num); u8 mthca_get_rate(struct mthca_dev *dev, struct ib_ah_attr *attr, u8 port); +enum ib_rate mthca_rate_to_ib(struct mthca_dev *dev, u8 mthca_rate, u8 port); int mthca_create_agents(struct mthca_dev *dev); void mthca_free_agents(struct mthca_dev *dev); Index: src/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2006-04-02 11:48:53.385044000 +0300 +++ src/drivers/infiniband/hw/mthca/mthca_qp.c 2006-04-02 16:11:36.555940000 +0300 @@ -393,10 +393,17 @@ static void to_ib_ah_attr(struct mthca_d { memset(ib_ah_attr, 0, sizeof *path); ib_ah_attr->port_num = (be32_to_cpu(path->port_pkey) >> 24) & 0x3; + + if (ib_ah_attr->port_num == 0 || + ib_ah_attr->port_num > dev->limits.num_ports ) + return; + ib_ah_attr->dlid = be16_to_cpu(path->rlid); ib_ah_attr->sl = be32_to_cpu(path->sl_tclass_flowlabel) >> 28; ib_ah_attr->src_path_bits = path->g_mylmc & 0x7f; - ib_ah_attr->static_rate = path->static_rate & 0x7; + ib_ah_attr->static_rate = mthca_rate_to_ib(dev, + path->static_rate & 0x7, + ib_ah_attr->port_num); ib_ah_attr->ah_flags = (path->g_mylmc & (1 << 7)) ? IB_AH_GRH : 0; if (ib_ah_attr->ah_flags) { ib_ah_attr->grh.sgid_index = path->mgid_index & (dev->limits.gid_table_len - 1); @@ -421,6 +428,7 @@ int mthca_query_qp(struct ib_qp *ibqp, s struct mthca_qp_context *context; int mthca_state; u8 status; + int i; mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL); if (IS_ERR(mailbox)) @@ -435,10 +443,16 @@ int mthca_query_qp(struct ib_qp *ibqp, s goto out; } + if (qp->transport == RC || qp->transport == UC) + for (i = 1; i < dev->limits.num_ports; ++i) + if ((err = mthca_update_rate(dev,i))) + goto out; + qp_param = mailbox->buf; context = &qp_param->context; mthca_state = be32_to_cpu(context->flags) >> 28; + memset(qp_attr, 0, sizeof *qp_attr); qp_attr->qp_state = to_ib_qp_state(mthca_state); qp_attr->cur_qp_state = qp_attr->qp_state; qp_attr->path_mtu = context->mtu_msgmax >> 5; @@ -456,8 +470,10 @@ int mthca_query_qp(struct ib_qp *ibqp, s qp_attr->cap.max_recv_sge = qp->rq.max_gs; qp_attr->cap.max_inline_data = qp->max_inline_data; - to_ib_ah_attr(dev, &qp_attr->ah_attr, &context->pri_path); - to_ib_ah_attr(dev, &qp_attr->alt_ah_attr, &context->alt_path); + if (qp->transport == RC || qp->transport == UC){ + to_ib_ah_attr(dev, &qp_attr->ah_attr, &context->pri_path); + to_ib_ah_attr(dev, &qp_attr->alt_ah_attr, &context->alt_path); + } qp_attr->pkey_index = be32_to_cpu(context->pri_path.port_pkey) & 0x7f; qp_attr->alt_pkey_index = be32_to_cpu(context->alt_path.port_pkey) & 0x7f; Index: src/drivers/infiniband/include/rdma/ib_verbs.h =================================================================== --- src.orig/drivers/infiniband/include/rdma/ib_verbs.h 2006-04-02 11:47:26.155973000 +0300 +++ src/drivers/infiniband/include/rdma/ib_verbs.h 2006-04-02 12:03:27.156727000 +0300 @@ -366,6 +366,22 @@ static inline int ib_rate_to_mult(enum i } } +static inline enum ib_rate mult_to_ib_rate(u8 mult) +{ + switch (mult) { + case 1: return IB_RATE_2_5_GBPS; + case 2: return IB_RATE_5_GBPS; + case 4: return IB_RATE_10_GBPS; + case 8: return IB_RATE_20_GBPS; + case 12: return IB_RATE_30_GBPS; + case 16: return IB_RATE_40_GBPS; + case 24: return IB_RATE_60_GBPS; + case 32: return IB_RATE_80_GBPS; + case 48: return IB_RATE_120_GBPS; + default: return IB_RATE_PORT_CURRENT; + } +} + struct ib_ah_attr { struct ib_global_route grh; From leonida at voltaire.com Sun Apr 2 06:05:34 2006 From: leonida at voltaire.com (Leonid Arsh) Date: Sun, 2 Apr 2006 16:05:34 +0300 Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support Message-ID: <20060402130534.GB6970@voltaire.com> This is the fixed patch: Signed-off-by: Leonid Arsh Index: linux-kernel/infiniband/include/rdma/ib_verbs.h =================================================================== --- linux-kernel/infiniband/include/rdma/ib_verbs.h (revision 8165) +++ linux-kernel/infiniband/include/rdma/ib_verbs.h (working copy) @@ -283,7 +283,8 @@ IB_EVENT_SM_CHANGE, IB_EVENT_SRQ_ERR, IB_EVENT_SRQ_LIMIT_REACHED, - IB_EVENT_QP_LAST_WQE_REACHED + IB_EVENT_QP_LAST_WQE_REACHED, + IB_EVENT_CLIENT_REREGISTER }; struct ib_event { Index: linux-kernel/infiniband/hw/mthca/mthca_eq.c =================================================================== --- linux-kernel/infiniband/hw/mthca/mthca_eq.c (revision 8504) +++ linux-kernel/infiniband/hw/mthca/mthca_eq.c (working copy) @@ -93,6 +93,7 @@ MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR = 0x10, MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR = 0x11, MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR = 0x12, + MTHCA_EVENT_TYPE_CLIENT_REREGISTER = 0x16, MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR = 0x08, MTHCA_EVENT_TYPE_PORT_CHANGE = 0x09, MTHCA_EVENT_TYPE_EQ_OVERFLOW = 0x0f, @@ -111,6 +112,7 @@ (1ULL << MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR) | \ (1ULL << MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR) | \ (1ULL << MTHCA_EVENT_TYPE_PORT_CHANGE) | \ + (1ULL << MTHCA_EVENT_TYPE_CLIENT_REREGISTER) | \ (1ULL << MTHCA_EVENT_TYPE_ECC_DETECT)) #define MTHCA_SRQ_EVENT_MASK ((1ULL << MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR) | \ (1ULL << MTHCA_EVENT_TYPE_SRQ_QP_LAST_WQE) | \ @@ -274,6 +276,20 @@ ib_dispatch_event(&record); } +static void client_reregister_event(struct mthca_dev *dev, int port) +{ + struct ib_event record; + + mthca_dbg(dev, "Client reregister for port %d\n", + port); + + record.device = &dev->ib_dev; + record.event = IB_EVENT_CLIENT_REREGISTER; + record.element.port_num = port; + + ib_dispatch_event(&record); +} + static int mthca_eq_int(struct mthca_dev *dev, struct mthca_eq *eq) { struct mthca_eqe *eqe; @@ -365,7 +381,12 @@ case MTHCA_EVENT_TYPE_EQ_OVERFLOW: mthca_warn(dev, "EQ overrun on EQN %d\n", eq->eqn); break; - + case MTHCA_EVENT_TYPE_CLIENT_REREGISTER: + client_reregister_event(dev, + (be32_to_cpu + (eqe->event.port_change. + port) >> 28) & 3); + break; case MTHCA_EVENT_TYPE_EEC_CATAS_ERROR: case MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR: case MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR: From sashak at voltaire.com Sun Apr 2 07:34:06 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Apr 2006 17:34:06 +0300 Subject: [openib-general] Re: [PATCH] opensm: observe PartitionEnforcementCap of zero In-Reply-To: <1143742324.10449.14165.camel@hal.voltaire.com> References: <1143718893.10449.10852.camel@hal.voltaire.com> <442C1DC5.8060504@obsidianresearch.com> <1143742324.10449.14165.camel@hal.voltaire.com> Message-ID: <20060402143406.GB12288@sashak.voltaire.com> Hello Hal, On 13:15 Thu 30 Mar , Hal Rosenstock wrote: > > One more thing: > > With this patch, I have noticed the following: > > Anafa based switch > 0: 0x0000 0xffff 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 > 8: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 > 16: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 > 24: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 > 32 pkeys capacity for this port I have the same pkey table with AnafaI just after power-on, even without OpenSM at all (and without Rolf's patch obviously). So I think Rolf's patch works correct, and it is just feature of AnafaI. Sasha. From eitan at mellanox.co.il Sun Apr 2 07:57:17 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 2 Apr 2006 17:57:17 +0300 Subject: [openib-general] Re: [PATCH] opensm: observePartitionEnforcementCap of zero Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B9F7@mtlexch01.mtl.com> You observed right. I just verified this... Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Sasha Khapyorsky > Sent: Sunday, April 02, 2006 5:34 PM > To: Hal Rosenstock > Cc: openib-general at openib.org > Subject: Re: [openib-general] Re: [PATCH] opensm: observePartitionEnforcementCap > of zero > > Hello Hal, > > On 13:15 Thu 30 Mar , Hal Rosenstock wrote: > > > > One more thing: > > > > With this patch, I have noticed the following: > > > > Anafa based switch > > 0: 0x0000 0xffff 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 > > 8: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 > > 16: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 > > 24: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 > > 32 pkeys capacity for this port > > I have the same pkey table with AnafaI just after power-on, even without > OpenSM at all (and without Rolf's patch obviously). So I think Rolf's > patch works correct, and it is just feature of AnafaI. > > Sasha. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Sun Apr 2 07:51:05 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Apr 2006 10:51:05 -0400 Subject: [openib-general] Re: [PATCH] opensm: add Obsidian vendor id In-Reply-To: References: Message-ID: <1143989314.4480.29061.camel@hal.voltaire.com> On Fri, 2006-03-31 at 13:55, Rolf Manderscheid wrote: > Hi Hal, > > Trivial patch to add Obsidian's official vendor id. > > Signed-off-by: Rolf Manderscheid > > --- > > Rolf Thanks. Applied. -- Hal From mst at mellanox.co.il Sun Apr 2 08:01:15 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 2 Apr 2006 18:01:15 +0300 Subject: [openib-general] Re: IPoIB destructor patches In-Reply-To: <20060331212642.GC17344@greglaptop> References: <20060331212642.GC17344@greglaptop> Message-ID: <20060402150115.GW14808@mellanox.co.il> Quoting r. Greg Lindahl : > > If you want to forward the necessary patches to stable at kernel.org for > > inclusion in 2.6.16.x releases, that would be great. > > It would be nice to have a single person responsible for spotting > things that qualify for the stable kernel, and shepherd them into it. > This one is a good example, something that hurts IB 1.0 on some > new hardware. > > I'm not volunteering Roland; anyone else want to volunteer? I'll do that. I'll start by sending the destructor patch to the stable team. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From leonida at voltaire.com Sun Apr 2 07:10:55 2006 From: leonida at voltaire.com (Leonid Arsh) Date: Sun, 02 Apr 2006 17:10:55 +0300 Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support In-Reply-To: <20060402132813.GT14808@mellanox.co.il> References: <20060402110848.GA5379@voltaire.com> <20060402124853.GQ14808@mellanox.co.il> <442FC02B.2010801@voltaire.com> <20060402132813.GT14808@mellanox.co.il> Message-ID: <442FDB6F.7080402@voltaire.com> Yes, I agree with you about your preference to have a software solution, but the actual ClientReregister request handling is done by the FW, isn't it? If so, it can be a kind of mixture between SW and HW anyway. As to the bit setting in the mthca_set_mad for old FW, it is not enough, since the actual request handling would not be done. We'll have to add the request handling to the SW too. But why should we, if the new FW excellently does it? It also generates an event for us. I like the approach to handle as much as possible in SW, but for my opinion both ways are acceptable in our case. Michael S. Tsirkin wrote: > Quoting r. Leonid Arsh : > >> Subject: Re: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support >> >> Michael, >> the event is quite rare, so I see no risk for the event queue. >> I also think that using the HW event is a bit more elegant, so I >> implemented it like in the VAPI based driver. >> > > No idea why VAPI did it this way: I personally prefer software solutions since > they are so much easier to debug. > > >> The new FW generates the event in any case, and we just add a compact >> event handler. >> Only smp_snoop() on the older FW will not help, since the older FW >> doesn't set the ClientReregistration port capability bit automatically, >> so the SM will not generate ClientReregistrer requests. >> > > Looks like you'll just need to set this bit in mthca_process_mad. No? > > >> There is a way >> to set the bit by the SW, but it would add some redundant complexity. >> > > There need not be any redundancy: I don't advocate using both the software and > hardware-based mechanism: let's just handle this even from smp_snoop and be done > with it. > > > From mst at mellanox.co.il Sun Apr 2 08:23:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 2 Apr 2006 18:23:42 +0300 Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support In-Reply-To: <442FDB6F.7080402@voltaire.com> References: <20060402110848.GA5379@voltaire.com> <20060402124853.GQ14808@mellanox.co.il> <442FC02B.2010801@voltaire.com> <20060402132813.GT14808@mellanox.co.il> <442FDB6F.7080402@voltaire.com> Message-ID: <20060402152341.GX14808@mellanox.co.il> Quoting r. Leonid Arsh : > Yes, I agree with you about your preference to have a software solution, > but the actual ClientReregister request handling is done by the FW, > isn't it? No, FW does not do any handling AFAIK. I think it has the capability to push the event to software, but the actual re-registration still has to be done by ULPs. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From leonida at voltaire.com Sun Apr 2 08:01:05 2006 From: leonida at voltaire.com (Leonid Arsh) Date: Sun, 02 Apr 2006 18:01:05 +0300 Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support In-Reply-To: <20060402152341.GX14808@mellanox.co.il> References: <20060402110848.GA5379@voltaire.com> <20060402124853.GQ14808@mellanox.co.il> <442FC02B.2010801@voltaire.com> <20060402132813.GT14808@mellanox.co.il> <442FDB6F.7080402@voltaire.com> <20060402152341.GX14808@mellanox.co.il> Message-ID: <442FE731.3050100@voltaire.com> Of course, the FW doesn't handle the event, it handles the request by sending the port info to the SM. It also generates an appropriate event to SW. The actual re-registration is to be done by the SW in ULPs. An alternative way to generate the event is catching the request MAD in smp_snoop, as you suggested, but the port info still will be sent by the FW. Michael S. Tsirkin wrote: > Quoting r. Leonid Arsh : > >> Yes, I agree with you about your preference to have a software solution, >> but the actual ClientReregister request handling is done by the FW, >> isn't it? >> > > No, FW does not do any handling AFAIK. I think it has the capability to push the > event to software, but the actual re-registration still has to be done by ULPs. > > From rdreier at cisco.com Sun Apr 2 10:43:00 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 02 Apr 2006 10:43:00 -0700 Subject: [openib-general] Re: static rate encoding change support In-Reply-To: <200604021703.07238.jackm@mellanox.co.il> (Jack Morgenstein's message of "Sun, 2 Apr 2006 17:03:04 +0300") References: <20060330121820.GH14808@mellanox.co.il> <200603301918.35548.jackm@mellanox.co.il> <200604021703.07238.jackm@mellanox.co.il> Message-ID: > @@ -435,10 +443,16 @@ int mthca_query_qp(struct ib_qp *ibqp, s > goto out; > } > > + if (qp->transport == RC || qp->transport == UC) > + for (i = 1; i < dev->limits.num_ports; ++i) > + if ((err = mthca_update_rate(dev,i))) > + goto out; I haven't read through all of this very carefully yet, but I'm wondering about this. Why is it necessary to update the rates of all the ports on every query QP call? Any all of this seems like a whole lot of complexity for not much value. If we really care about returning static rate from query QP, why not just store away the original state rate passed in through modify QP? - R. From mst at mellanox.co.il Sun Apr 2 12:17:27 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 2 Apr 2006 22:17:27 +0300 Subject: [openib-general] Re: Re: static rate encoding change support In-Reply-To: References: <20060330121820.GH14808@mellanox.co.il> <200603301918.35548.jackm@mellanox.co.il> <200604021703.07238.jackm@mellanox.co.il> Message-ID: <20060402191727.GA3771@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Re: static rate encoding change support > > > @@ -435,10 +443,16 @@ int mthca_query_qp(struct ib_qp *ibqp, s > > goto out; > > } > > > > + if (qp->transport == RC || qp->transport == UC) > > + for (i = 1; i < dev->limits.num_ports; ++i) > > + if ((err = mthca_update_rate(dev,i))) > > + goto out; > > I haven't read through all of this very carefully yet, but I'm > wondering about this. Why is it necessary to update the rates of all > the ports on every query QP call? I'm not sure. I assume we can just kill it. > Any all of this seems like a whole lot of complexity for not much > value. If we really care about returning static rate from query QP, > why not just store away the original state rate passed in through > modify QP? Hmm, seems like a waste of memory ... recoding back from hardware seems cheaper - its global, not per QP. No? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Sun Apr 2 13:14:28 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 02 Apr 2006 13:14:28 -0700 Subject: [openib-general] Re: static rate encoding change support In-Reply-To: <20060402191727.GA3771@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 2 Apr 2006 22:17:27 +0300") References: <20060330121820.GH14808@mellanox.co.il> <200603301918.35548.jackm@mellanox.co.il> <200604021703.07238.jackm@mellanox.co.il> <20060402191727.GA3771@mellanox.co.il> Message-ID: Michael> Hmm, seems like a waste of memory ... recoding back from Michael> hardware seems cheaper - its global, not per QP. No? One byte per QP doesn't seem like much of a waste. We can probably lay out the QP struct so it's free in fact. - R. From mst at mellanox.co.il Sun Apr 2 13:39:30 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 2 Apr 2006 23:39:30 +0300 Subject: [openib-general] Re: static rate encoding change support In-Reply-To: References: <20060330121820.GH14808@mellanox.co.il> <200603301918.35548.jackm@mellanox.co.il> <200604021703.07238.jackm@mellanox.co.il> <20060402191727.GA3771@mellanox.co.il> Message-ID: <20060402203930.GA3993@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: static rate encoding change support > > Michael> Hmm, seems like a waste of memory ... recoding back from > Michael> hardware seems cheaper - its global, not per QP. No? > > One byte per QP doesn't seem like much of a waste. We can probably > lay out the QP struct so it's free in fact. Okay. That's a trivial fix then - do you want to add it to your patch yourself? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Sun Apr 2 14:37:15 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 02 Apr 2006 14:37:15 -0700 Subject: [openib-general] Re: static rate encoding change support In-Reply-To: <20060402203930.GA3993@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 2 Apr 2006 23:39:30 +0300") References: <20060330121820.GH14808@mellanox.co.il> <200603301918.35548.jackm@mellanox.co.il> <200604021703.07238.jackm@mellanox.co.il> <20060402191727.GA3771@mellanox.co.il> <20060402203930.GA3993@mellanox.co.il> Message-ID: Michael> Okay. That's a trivial fix then - do you want to add it Michael> to your patch yourself? Sure, I'll do that. - R. From rdreier at cisco.com Sun Apr 2 15:36:36 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 02 Apr 2006 15:36:36 -0700 Subject: [openib-general] Re: updated InfiniBand 2.6.17 merge plans In-Reply-To: <20060402065810.GC1399@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 2 Apr 2006 09:58:10 +0300") References: <20060402065810.GC1399@mellanox.co.il> Message-ID: Michael> These are just the features, right? There's as the usual Michael> bugfixes which aren't listed here. Yes, of course. We can always merge bugfixes. Although I don't know of any pending bugfixes that are not at least in my for-2.6.17 branch... - R. From tziporet at mellanox.co.il Sun Apr 2 23:40:05 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 03 Apr 2006 08:40:05 +0200 Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support In-Reply-To: <442FE731.3050100@voltaire.com> References: <20060402110848.GA5379@voltaire.com> <20060402124853.GQ14808@mellanox.co.il> <442FC02B.2010801@voltaire.com> <20060402132813.GT14808@mellanox.co.il> <442FDB6F.7080402@voltaire.com> <20060402152341.GX14808@mellanox.co.il> <442FE731.3050100@voltaire.com> Message-ID: <4430C345.2050002@mellanox.co.il> Leonid Arsh wrote: > Of course, the FW doesn't handle the event, it handles the request by > sending the port info to the SM. > It also generates an appropriate event to SW. The actual > re-registration is to be done by the SW in ULPs. > An alternative way to generate the event is catching the request MAD > in smp_snoop, as you suggested, but the port info still will be sent > by the FW. > I actually prefer that the FW will generate the ClientReregister event since its already generating this event and in this way we can save logic in the driver. Tziporet From eitan at mellanox.co.il Mon Apr 3 00:11:15 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 03 Apr 2006 10:11:15 +0300 Subject: [openib-general] [PATCH] osm: add support for 1.2 errata - SA enhanced capability mask matching Message-ID: <86y7yngmlo.fsf@mtl066.yok.mtl.com> Hi Hal This patch adds support for the following 1.2 errata MGTWG8372. This should be useful for scalability of: * SRP target discovery and * Queries for all SM ports. Reference ID: 4291 Add to table: 186 SA-Specific ClassPortInfo:CapabilityMask Name | Bit | Description =========================================================================================== IsPortInfoCapMaskMatchSupported | 13 | If this value is 1, SA shall support matching the | | PortInfo:CapabilityMask component as described in | | . Reference ID: 4292 If SA's ClassPortInfo:CapabilityMask.IsPortInfoCapMaskMatchSupported is 1, then the AttributeModifier of the SubnAdmGet() and SubnAdmGetTable() methods affects the matching behavior on the PortInfo:CapabilityMask component. If the high-order bit (bit 31) of the AttributeModifier is set to 1, matching on the CapabilityMask component will not be an exact bitwise match as described in . Instead, matching will only be performed on those bits which are set to 1 in the PortInfo:CapabilityMask embedded in the query. In , bits in the PortInfo:CapabilityMask embedded in the query that are set to 0 are bitwise wildcards for purposes of matching. This gives a requester the ability to select desired capabilities and query for ports which support those capabilities. If SA's ClassPortInfo:CapabilityMask.IsPortInfoCapMaskMatchSupported is 0, or if bit 31 of the AttributeModifier is 0, then any matching performed on the PortInfo:CapabilityMask component is as described in . Eitan Signed-off-by: Eitan Zahavi Index: include/opensm/osm_base.h =================================================================== --- include/opensm/osm_base.h (revision 6144) +++ include/opensm/osm_base.h (working copy) @@ -545,6 +545,26 @@ typedef enum _osm_thread_state #define OSM_CAP_IS_REINIT_SUP (1 << 11); /***********/ +/****d* OpenSM: Base/OSM_CAP_IS_PORT_INFO_CAPMASK_MATCH_SUPPORTED +* Name +* OSM_CAP_IS_PORT_INFO_CAPMASK_MATCH_SUPPORTED +* +* DESCRIPTION +* SM/SA supports enhanced SA PortInfoRecord searches per 1.2 Errata: +* ClassPortInfo:CapabilityMask.IsPortInfoCapMaskMatchSupported is 1, +* then the AttributeModifier of the SubnAdmGet() and SubnAdmGetTable() +* methods affects the matching behavior on the PortInfo:CapabilityMask +* component. If the high-order bit (bit 31) of the AttributeModifier +* is set to 1, matching on the CapabilityMask component will not be an +* exact bitwise match as described in . Instead, +* matching will only be performed on those bits which are set to 1 in +* the PortInfo:CapabilityMask embedded in the query. +* +* SYNOPSIS +*/ +#define OSM_CAP_IS_PORT_INFO_CAPMASK_MATCH_SUPPORTED (1 << 13); +/***********/ + /****d* OpenSM: Base/osm_sm_state_t * NAME * osm_sm_state_t Index: opensm/osm_sa_portinfo_record.c =================================================================== --- opensm/osm_sa_portinfo_record.c (revision 6144) +++ opensm/osm_sa_portinfo_record.c (working copy) @@ -85,7 +85,7 @@ typedef struct _osm_pir_search_ctxt cl_qlist_t* p_list; osm_pir_rcv_t* p_rcv; const osm_physp_t* p_req_physp; - + boolean_t is_enhanced_comp_mask; } osm_pir_search_ctxt_t; /********************************************************************** @@ -288,11 +288,23 @@ __osm_sa_pir_check_physp( if( p_comp_pi->master_sm_base_lid != p_pi->master_sm_base_lid ) goto Exit; } + + /* IBTA 1.2 errata provides support for bitwise compare if the bit 31 + of the attribute modifier of the Get/GetTable is set */ if( comp_mask & IB_PIR_COMPMASK_CAPMASK ) { + if (p_ctxt->is_enhanced_comp_mask) + { + if ( (p_comp_pi->capability_mask & p_pi->capability_mask != p_comp_pi->capability_mask) ) + goto Exit; + } + else + { if( p_comp_pi->capability_mask != p_pi->capability_mask ) goto Exit; } + } + if( comp_mask & IB_PIR_COMPMASK_DIAGCODE ) { if( p_comp_pi->diag_code != p_pi->diag_code ) @@ -648,6 +660,7 @@ osm_pir_rcv_process( context.comp_mask = p_rcvd_mad->comp_mask; context.p_rcv = p_rcv; context.p_req_physp = p_req_physp; + context.is_enhanced_comp_mask = (cl_ntoh32(p_rcvd_mad->attr_mod) & (1 << 31)); cl_plock_acquire( p_rcv->p_lock ); Index: opensm/osm_sa_class_port_info.c =================================================================== --- opensm/osm_sa_class_port_info.c (revision 6144) +++ opensm/osm_sa_class_port_info.c (working copy) @@ -212,15 +212,21 @@ __osm_cpi_rcv_respond( MultiPathRecord, TraceRecord - OSM_CAP_IS_SUBN_OPT_REINIT_SUP: + OSM_CAP_IS_REINIT_SUP: For reinitialization functionality. So not sending traps, but supporting Get(Notice) and Set(Notice): */ - p_resp_cpi->cap_mask = 0x2; /* Note host notation replaced later */ + + /* Note host notation replaced later */ + p_resp_cpi->cap_mask = 0x2; /* Generic mask: support Get/Set attributes */ + if (p_rcv->p_subn->opt.no_multicast_option != TRUE) p_resp_cpi->cap_mask |= OSM_CAP_IS_UD_MCAST_SUP; + p_resp_cpi->cap_mask |= OSM_CAP_IS_REINIT_SUP; + p_resp_cpi->cap_mask |= OSM_CAP_IS_PORT_INFO_CAPMASK_MATCH_SUPPORTED; + p_resp_cpi->cap_mask = cl_hton16(p_resp_cpi->cap_mask); if( osm_log_is_active( p_rcv->p_log, OSM_LOG_FRAMES ) ) From tziporet at mellanox.co.il Mon Apr 3 01:01:42 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 03 Apr 2006 10:01:42 +0200 Subject: [openib-general] [PATCH] osm: add support for 1.2 errata - SA enhanced capability mask matching In-Reply-To: <86y7yngmlo.fsf@mtl066.yok.mtl.com> References: <86y7yngmlo.fsf@mtl066.yok.mtl.com> Message-ID: <4430D666.7000700@mellanox.co.il> Eitan Zahavi wrote: > Hi Hal > > This patch adds support for the following 1.2 errata MGTWG8372. > This should be useful for scalability of: > * SRP target discovery and > * Queries for all SM ports. > > Hal, Please check-in this patch to the branch too since we need it for SRP discovery Thanks, Tziporet From mst at mellanox.co.il Mon Apr 3 01:20:56 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 3 Apr 2006 11:20:56 +0300 Subject: [openib-general] CMA deadlock Message-ID: <20060403082056.GZ14808@mellanox.co.il> Sean, in case of errors I am seeing hangs in CMA which I think I have tracked to the following deadlock scenario: A ULP requests address resolution; on success requests route resolution; route resolution succeeds; inside the callback ULP requests rdma_connect. Now, a failure (e.g. out of memory) occurs at ULP level and so it decides to destroy the ID. To this end it returns failure code from the route callback. Note that route resolution callback runs in the per-port MAD workqueue. Now, CMA will call rdma_destroy_id to destroy the ID. Since CM ID exists, it will try to destroy it. This might deadlock: since a CM MAD (REQ) has been created, CM ID destroy will now block, waiting for the MAD to be freed, but MADs might not complete since we are blocking the MAD workqueue. A possible solution could be to bounce the SA query callback out to the rdma WQ. Does this make sense? Further, a comment in ib_cm.h says: * Users may not call ib_destroy_cm_id while in the context of this callback; * however, returning a non-zero value instructs the communication manager to * destroy the @cm_id after the callback completes. And it seems that, if the user callback returns failure, the CMA actually calls rdma_destroy_id which in turn may call ib_destroy_cm_id from inside the CM callback. I think this might deadlock in a similiar way. Again, bouncing the CM event to the rdma WQ will solve this I think. Sean, could you look at this please? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From jackm at mellanox.co.il Mon Apr 3 01:34:25 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Mon, 3 Apr 2006 11:34:25 +0300 Subject: [openib-general] Re: static rate encoding change support In-Reply-To: References: <20060330121820.GH14808@mellanox.co.il> <200604021703.07238.jackm@mellanox.co.il> Message-ID: <200604031134.25798.jackm@mellanox.co.il> On Sunday 02 April 2006 20:43, Roland Dreier wrote: > > @@ -435,10 +443,16 @@ int mthca_query_qp(struct ib_qp *ibqp, s > > goto out; > > } > > > > + if (qp->transport == RC || qp->transport == UC) > > + for (i = 1; i < dev->limits.num_ports; ++i) > > + if ((err = mthca_update_rate(dev,i))) > > + goto out; > > I haven't read through all of this very carefully yet, but I'm > wondering about this. Why is it necessary to update the rates of all > the ports on every query QP call? > > Any all of this seems like a whole lot of complexity for not much > value. If we really care about returning static rate from query QP, > why not just store away the original state rate passed in through > modify QP? > In memfree, static rate changes are tracked (the original patch) while in Tavor, they are not. Easiest thing to do is to track static rate changes for Tavor and memfree both (modification of the original large patch). Then, you can get rid of the port static rate update in query-qp. I've indicated the required changes below in a new patch. (That was one of the reasons I wanted to merge the two patches -- sorry for not indicating this). The merged mthca patch, which tracks static rate changes in Tavor as well, includes the static-rate changes for Query-QP, and gets rid of the update_rate in query-qp, is below. - Jack ----- Push translation of static rate to HCA format into low-level drivers, where it belongs. For static rate encoding, use encoding of rate field from IB standard PathRecord, with addition of value 0, for backwards compatibility with current usage. The changes are: - Add enum ib_rate to midlayer includes. - Get rid of static rate translation in IPoIB; just use static rate directly from Path and MulticastGroup records. - Update mthca driver to translate absolute static rate into the format used by hardware. From jackm at mellanox.co.il Mon Apr 3 02:41:54 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Mon, 3 Apr 2006 12:41:54 +0300 Subject: [openib-general] Re: static rate encoding change support In-Reply-To: References: <20060330121820.GH14808@mellanox.co.il> <20060402191727.GA3771@mellanox.co.il> Message-ID: <200604031241.56298.jackm@mellanox.co.il> On Sunday 02 April 2006 23:14, Roland Dreier wrote: > Michael> Hmm, seems like a waste of memory ... recoding back from > Michael> hardware seems cheaper - its global, not per QP. No? > > One byte per QP doesn't seem like much of a waste. We can probably > lay out the QP struct so it's free in fact. > > - R. In memfree, static rate changes are tracked (the original patch) while in Tavor, they are not -- that was the reason for performing the update inside query-qp. Easiest thing to do is to track static rate changes for Tavor and memfree both (modification of the original large patch). Then, you can get rid of the port static rate update in query-qp. I've indicated the required changes below in a new combined patch (replacement for the patch you proposed). The logic is simpler than before (no memfree check for updating port static rate, and no update at all in query-qp). The merged mthca patch, which tracks static rate changes in Tavor as well, includes the static-rate changes for Query-QP, and gets rid of the update_rate in query-qp, is below. Note that there is no need to store the static rates now. All static rate changes are tracked, so we can used the stored port static rate for the path and alt-path static rate computation. One dividend of this is that you are returned the actual static rate in operation, rather than zero (if you created the path with a default static rate). Sorry I did not see your correspondence/questions last night. I added your header comments to the patch below, replacing mine. I didn't add your name to Signed-off-by list, although I think it belongs there. - Jack -------------------------------- IB: simplify static rate encoding Push translation of static rate to HCA format into low-level drivers, where it belongs. For static rate encoding, use encoding of rate field from IB standard PathRecord, with addition of value 0, for backwards compatibility with current usage. The changes are: - Add enum ib_rate to midlayer includes. - Get rid of static rate translation in IPoIB; just use static rate directly from Path and MulticastGroup records. - Update mthca driver to translate absolute static rate into the format used by hardware. Signed-off-by: Jack Morgenstein Index: src/drivers/infiniband/include/rdma/ib_verbs.h =================================================================== --- src.orig/drivers/infiniband/include/rdma/ib_verbs.h 2006-04-02 09:41:54.478573000 +0300 +++ src/drivers/infiniband/include/rdma/ib_verbs.h 2006-04-03 11:23:44.078150000 +0300 @@ -337,6 +337,52 @@ enum ib_ah_flags { IB_AH_GRH = 1 }; +enum ib_rate { + IB_RATE_PORT_CURRENT = 0, + IB_RATE_2_5_GBPS = 2, + IB_RATE_5_GBPS = 5, + IB_RATE_10_GBPS = 3, + IB_RATE_20_GBPS = 6, + IB_RATE_30_GBPS = 4, + IB_RATE_40_GBPS = 7, + IB_RATE_60_GBPS = 8, + IB_RATE_80_GBPS = 9, + IB_RATE_120_GBPS = 10 +}; + +static inline int ib_rate_to_mult(enum ib_rate rate) +{ + switch (rate) { + case IB_RATE_2_5_GBPS: return 1; + case IB_RATE_5_GBPS: return 2; + case IB_RATE_10_GBPS: return 4; + case IB_RATE_20_GBPS: return 8; + case IB_RATE_30_GBPS: return 12; + case IB_RATE_40_GBPS: return 16; + case IB_RATE_60_GBPS: return 24; + case IB_RATE_80_GBPS: return 32; + case IB_RATE_120_GBPS: return 48; + default: return -1; + } +} + +static inline enum ib_rate mult_to_ib_rate(u8 mult) +{ + switch (mult) { + case 1: return IB_RATE_2_5_GBPS; + case 2: return IB_RATE_5_GBPS; + case 4: return IB_RATE_10_GBPS; + case 8: return IB_RATE_20_GBPS; + case 12: return IB_RATE_30_GBPS; + case 16: return IB_RATE_40_GBPS; + case 24: return IB_RATE_60_GBPS; + case 32: return IB_RATE_80_GBPS; + case 48: return IB_RATE_120_GBPS; + default: return IB_RATE_PORT_CURRENT; + } +} + + struct ib_ah_attr { struct ib_global_route grh; u16 dlid; Index: src/drivers/infiniband/include/rdma/ib_sa.h =================================================================== --- src.orig/drivers/infiniband/include/rdma/ib_sa.h 2006-04-02 09:41:54.595569000 +0300 +++ src/drivers/infiniband/include/rdma/ib_sa.h 2006-04-02 11:47:26.174973000 +0300 @@ -91,34 +91,6 @@ enum ib_sa_selector { IB_SA_BEST = 3 }; -enum ib_sa_rate { - IB_SA_RATE_2_5_GBPS = 2, - IB_SA_RATE_5_GBPS = 5, - IB_SA_RATE_10_GBPS = 3, - IB_SA_RATE_20_GBPS = 6, - IB_SA_RATE_30_GBPS = 4, - IB_SA_RATE_40_GBPS = 7, - IB_SA_RATE_60_GBPS = 8, - IB_SA_RATE_80_GBPS = 9, - IB_SA_RATE_120_GBPS = 10 -}; - -static inline int ib_sa_rate_enum_to_int(enum ib_sa_rate rate) -{ - switch (rate) { - case IB_SA_RATE_2_5_GBPS: return 1; - case IB_SA_RATE_5_GBPS: return 2; - case IB_SA_RATE_10_GBPS: return 4; - case IB_SA_RATE_20_GBPS: return 8; - case IB_SA_RATE_30_GBPS: return 12; - case IB_SA_RATE_40_GBPS: return 16; - case IB_SA_RATE_60_GBPS: return 24; - case IB_SA_RATE_80_GBPS: return 32; - case IB_SA_RATE_120_GBPS: return 48; - default: return -1; - } -} - /* * Structures for SA records are named "struct ib_sa_xxx_rec." No * attempt is made to pack structures to match the physical layout of Index: src/drivers/infiniband/ulp/ipoib/ipoib_fs.c =================================================================== --- src.orig/drivers/infiniband/ulp/ipoib/ipoib_fs.c 2006-04-03 11:16:33.615619000 +0300 +++ src/drivers/infiniband/ulp/ipoib/ipoib_fs.c 2006-04-03 11:22:19.792587000 +0300 @@ -213,7 +213,7 @@ static int ipoib_path_seq_show(struct se gid_buf, path.pathrec.dlid ? "yes" : "no"); if (path.pathrec.dlid) { - rate = ib_sa_rate_enum_to_int(path.pathrec.rate) * 25; + rate = ib_rate_to_mult(path.pathrec.rate) * 25; seq_printf(file, " DLID: 0x%04x\n" Index: src/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- src.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-04-03 11:16:33.427621000 +0300 +++ src/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-04-03 11:22:19.773588000 +0300 @@ -378,16 +378,9 @@ static void path_rec_completion(int stat struct ib_ah_attr av = { .dlid = be16_to_cpu(pathrec->dlid), .sl = pathrec->sl, - .port_num = priv->port + .port_num = priv->port, + .static_rate = pathrec->rate }; - int path_rate = ib_sa_rate_enum_to_int(pathrec->rate); - - if (path_rate > 0 && priv->local_rate > path_rate) - av.static_rate = (priv->local_rate - 1) / path_rate; - - ipoib_dbg(priv, "static_rate %d for local port %dX, path %dX\n", - av.static_rate, priv->local_rate, - ib_sa_rate_enum_to_int(pathrec->rate)); ah = ipoib_create_ah(dev, priv->pd, &av); } Index: src/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- src.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-04-03 11:16:33.522619000 +0300 +++ src/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-04-03 11:22:19.784588000 +0300 @@ -250,6 +250,7 @@ static int ipoib_mcast_join_finish(struc .port_num = priv->port, .sl = mcast->mcmember.sl, .ah_flags = IB_AH_GRH, + .static_rate = mcast->mcmember.rate, .grh = { .flow_label = be32_to_cpu(mcast->mcmember.flow_label), .hop_limit = mcast->mcmember.hop_limit, @@ -257,17 +258,8 @@ static int ipoib_mcast_join_finish(struc .traffic_class = mcast->mcmember.traffic_class } }; - int path_rate = ib_sa_rate_enum_to_int(mcast->mcmember.rate); - av.grh.dgid = mcast->mcmember.mgid; - if (path_rate > 0 && priv->local_rate > path_rate) - av.static_rate = (priv->local_rate - 1) / path_rate; - - ipoib_dbg_mcast(priv, "static_rate %d for local port %dX, mcmember %dX\n", - av.static_rate, priv->local_rate, - ib_sa_rate_enum_to_int(mcast->mcmember.rate)); - ah = ipoib_create_ah(dev, priv->pd, &av); if (!ah) { ipoib_warn(priv, "ib_address_create failed\n"); Index: src/drivers/infiniband/hw/mthca/mthca_av.c =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_av.c 2006-04-03 11:16:21.388995000 +0300 +++ src/drivers/infiniband/hw/mthca/mthca_av.c 2006-04-03 11:23:43.843152000 +0300 @@ -53,6 +53,83 @@ struct mthca_av { __be32 dgid[4]; }; +static inline u8 ib_rate_to_memfree(struct mthca_dev *dev, u8 req_rate, + u8 curr_rate) +{ + u8 ipd; /* Inter Packet Delay. See IB Spec Vol 1, 9.11.1 */ + + if (curr_rate <= req_rate) + return 0; + + ipd = (curr_rate - 1) / req_rate; + switch (ipd) { + case 0: return MTHCA_RATE_MEMFREE_FULL; + case 1: return MTHCA_RATE_MEMFREE_HALF; + case 2: /* fall through */ + case 3: return MTHCA_RATE_MEMFREE_QUARTER; + default: return MTHCA_RATE_MEMFREE_EIGHTH; + } +} + +enum ib_rate mthca_rate_to_ib(struct mthca_dev *dev, u8 mthca_rate, u8 port) +{ + if (!mthca_rate) + return mult_to_ib_rate(dev->rate[port - 1]); + + if (mthca_is_memfree(dev)) { + switch (mthca_rate) { + case MTHCA_RATE_MEMFREE_EIGHTH: + return mult_to_ib_rate(dev->rate[port - 1] / 8); + case MTHCA_RATE_MEMFREE_QUARTER: + return mult_to_ib_rate(dev->rate[port - 1] / 4); + case MTHCA_RATE_MEMFREE_HALF: + return mult_to_ib_rate(dev->rate[port - 1] / 2); + case MTHCA_RATE_MEMFREE_FULL: + default: + return mult_to_ib_rate(dev->rate[port - 1]); + } + } + + switch (mthca_rate) { + case MTHCA_RATE_TAVOR_1X: + return IB_RATE_2_5_GBPS; + case MTHCA_RATE_TAVOR_1X_DDR: + return IB_RATE_5_GBPS; + case MTHCA_RATE_TAVOR_4X: + return IB_RATE_10_GBPS; + default: + return mult_to_ib_rate(dev->rate[port - 1]); + } +} + +u8 mthca_get_rate(struct mthca_dev *dev, struct ib_ah_attr *attr, u8 port) +{ + if (!attr->static_rate) + return 0; + + if (mthca_is_memfree(dev)) + return ib_rate_to_memfree(dev, + ib_rate_to_mult(attr->static_rate), + dev->rate[port - 1]); + + if ((dev->limits.stat_rate_support & MTHCA_RATE_SUPP_TAVOR_ALL) == + MTHCA_RATE_SUPP_TAVOR_ALL) + /* full Tavor absolute rates*/ + switch (attr->static_rate) { + case IB_RATE_2_5_GBPS: return MTHCA_RATE_TAVOR_1X; + case IB_RATE_5_GBPS: return MTHCA_RATE_TAVOR_1X_DDR; + case IB_RATE_10_GBPS: return MTHCA_RATE_TAVOR_4X; + default: return MTHCA_RATE_TAVOR_FULL; + } + else + /* old (partial) Tavor absolute rates */ + switch (attr->static_rate) { + case IB_RATE_2_5_GBPS: /* fall through */ + case IB_RATE_5_GBPS: return MTHCA_RATE_TAVOR_1X; + default: return MTHCA_RATE_TAVOR_FULL; + } +} + int mthca_create_ah(struct mthca_dev *dev, struct mthca_pd *pd, struct ib_ah_attr *ah_attr, @@ -105,7 +182,7 @@ on_hca_fail: av->g_slid = ah_attr->src_path_bits; av->dlid = cpu_to_be16(ah_attr->dlid); av->msg_sr = (3 << 4) | /* 2K message */ - ah_attr->static_rate; + mthca_get_rate(dev, ah_attr, ah_attr->port_num); av->sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28); if (ah_attr->ah_flags & IB_AH_GRH) { av->g_slid |= 0x80; Index: src/drivers/infiniband/hw/mthca/mthca_cmd.c =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 2006-04-03 11:16:20.935996000 +0300 +++ src/drivers/infiniband/hw/mthca/mthca_cmd.c 2006-04-03 11:23:28.618616000 +0300 @@ -995,6 +995,7 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev #define QUERY_DEV_LIM_MTU_WIDTH_OFFSET 0x36 #define QUERY_DEV_LIM_VL_PORT_OFFSET 0x37 #define QUERY_DEV_LIM_MAX_GID_OFFSET 0x3b +#define QUERY_DEV_LIM_RATE_SUPPORT_OFFSET 0x3c #define QUERY_DEV_LIM_MAX_PKEY_OFFSET 0x3f #define QUERY_DEV_LIM_FLAGS_OFFSET 0x44 #define QUERY_DEV_LIM_RSVD_UAR_OFFSET 0x48 @@ -1086,6 +1087,8 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev dev_lim->num_ports = field & 0xf; MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_GID_OFFSET); dev_lim->max_gids = 1 << (field & 0xf); + MTHCA_GET(size, outbox, QUERY_DEV_LIM_RATE_SUPPORT_OFFSET); + dev_lim->stat_rate_support = size; MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PKEY_OFFSET); dev_lim->max_pkeys = 1 << (field & 0xf); MTHCA_GET(dev_lim->flags, outbox, QUERY_DEV_LIM_FLAGS_OFFSET); Index: src/drivers/infiniband/hw/mthca/mthca_cmd.h =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_cmd.h 2006-04-03 11:16:21.025998000 +0300 +++ src/drivers/infiniband/hw/mthca/mthca_cmd.h 2006-04-03 11:23:28.627616000 +0300 @@ -146,6 +146,7 @@ struct mthca_dev_lim { int max_vl; int num_ports; int max_gids; + u16 stat_rate_support; int max_pkeys; u32 flags; int reserved_uars; Index: src/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2006-04-03 11:16:21.116996000 +0300 +++ src/drivers/infiniband/hw/mthca/mthca_dev.h 2006-04-03 11:23:43.852153000 +0300 @@ -120,6 +120,24 @@ enum { MTHCA_CMD_NUM_DBELL_DWORDS = 8 }; +enum { + MTHCA_RATE_SUPP_TAVOR_ALL = 0xF +}; + +enum { + MTHCA_RATE_TAVOR_FULL = 0, /* 4X SDR / DDR depending on HCA and link*/ + MTHCA_RATE_TAVOR_1X = 1, + MTHCA_RATE_TAVOR_4X = 2, + MTHCA_RATE_TAVOR_1X_DDR = 3 +}; + +enum { + MTHCA_RATE_MEMFREE_FULL = 0, /* 4X SDR / DDR depending on HCA and link*/ + MTHCA_RATE_MEMFREE_QUARTER = 1, + MTHCA_RATE_MEMFREE_EIGHTH = 2, + MTHCA_RATE_MEMFREE_HALF = 3 +}; + struct mthca_cmd { struct pci_pool *pool; struct mutex hcr_mutex; @@ -172,6 +190,7 @@ struct mthca_limits { int reserved_pds; u32 page_size_cap; u32 flags; + u16 stat_rate_support; u8 port_width_cap; }; @@ -353,6 +372,7 @@ struct mthca_dev { struct ib_mad_agent *send_agent[MTHCA_MAX_PORTS][2]; struct ib_ah *sm_ah[MTHCA_MAX_PORTS]; spinlock_t sm_lock; + u8 rate[MTHCA_MAX_PORTS]; }; #define mthca_dbg(mdev, format, arg...) \ @@ -553,6 +573,9 @@ int mthca_process_mad(struct ib_device * struct ib_grh *in_grh, struct ib_mad *in_mad, struct ib_mad *out_mad); +int mthca_update_rate(struct mthca_dev *dev, u8 port_num); +u8 mthca_get_rate(struct mthca_dev *dev, struct ib_ah_attr *attr, u8 port); +enum ib_rate mthca_rate_to_ib(struct mthca_dev *dev, u8 mthca_rate, u8 port); int mthca_create_agents(struct mthca_dev *dev); void mthca_free_agents(struct mthca_dev *dev); Index: src/drivers/infiniband/hw/mthca/mthca_mad.c =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_mad.c 2006-04-03 11:16:21.297997000 +0300 +++ src/drivers/infiniband/hw/mthca/mthca_mad.c 2006-04-03 11:23:28.886586000 +0300 @@ -46,6 +46,26 @@ enum { MTHCA_VENDOR_CLASS2 = 0xa }; +int mthca_update_rate(struct mthca_dev *dev, u8 port_num) +{ + struct ib_port_attr *tprops = NULL; + int ret; + + tprops = kmalloc(sizeof *tprops, GFP_KERNEL); + if (!tprops) + return -ENOMEM; + + ret = ib_query_port(&dev->ib_dev, port_num, tprops); + if (ret) { + printk(KERN_WARNING "ib_query_port failed (%d) for %s port %d\n", + ret, dev->ib_dev.name, port_num); + return ret; + } + dev->rate[port_num - 1] = tprops->active_speed * + ib_width_enum_to_int(tprops->active_width); + return 0; +} + static void update_sm_ah(struct mthca_dev *dev, u8 port_num, u16 lid, u8 sl) { @@ -87,6 +107,7 @@ static void smp_snoop(struct ib_device * mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && mad->mad_hdr.method == IB_MGMT_METHOD_SET) { if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) { + mthca_update_rate(to_mdev(ibdev), port_num); update_sm_ah(to_mdev(ibdev), port_num, be16_to_cpup((__be16 *) (mad->data + 58)), (*(u8 *) (mad->data + 76)) & 0xf); Index: src/drivers/infiniband/hw/mthca/mthca_main.c =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_main.c 2006-04-03 11:16:21.207998000 +0300 +++ src/drivers/infiniband/hw/mthca/mthca_main.c 2006-04-03 11:23:28.877589000 +0300 @@ -191,6 +191,7 @@ static int __devinit mthca_dev_lim(struc mdev->limits.port_width_cap = dev_lim->max_port_width; mdev->limits.page_size_cap = ~(u32) (dev_lim->min_page_sz - 1); mdev->limits.flags = dev_lim->flags; + mdev->limits.stat_rate_support = dev_lim->stat_rate_support; /* IB_DEVICE_RESIZE_MAX_WR not supported by driver. May be doable since hardware supports it for SRQ. @@ -957,6 +958,7 @@ static int __devinit mthca_init_one(stru int ddr_hidden = 0; int err; struct mthca_dev *mdev; + int i; if (!mthca_version_printed) { printk(KERN_INFO "%s", mthca_version); @@ -1095,8 +1097,18 @@ static int __devinit mthca_init_one(stru pci_set_drvdata(pdev, mdev); + for (i = 1; i <= mdev->limits.num_ports; ++i) + if (mthca_update_rate(mdev, i)) { + mthca_err(mdev, "Failed to obtain port %d rate." + " aborting.\n", i); + goto err_free_agents; + } + return 0; +err_free_agents: + mthca_free_agents(mdev); + err_unregister: mthca_unregister_device(mdev); Index: src/drivers/infiniband/hw/mthca/mthca_provider.h =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_provider.h 2006-04-03 11:16:21.568967000 +0300 +++ src/drivers/infiniband/hw/mthca/mthca_provider.h 2006-04-03 11:23:29.034584000 +0300 @@ -257,6 +257,8 @@ struct mthca_qp { atomic_t refcount; u32 qpn; int is_direct; + u16 port; + u16 alt_port; u8 transport; u8 state; u8 atomic_rd_en; @@ -278,7 +280,6 @@ struct mthca_qp { struct mthca_sqp { struct mthca_qp qp; - int port; int pkey_index; u32 qkey; u32 send_psn; Index: src/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2006-04-03 11:16:21.478968000 +0300 +++ src/drivers/infiniband/hw/mthca/mthca_qp.c 2006-04-03 11:28:23.361059000 +0300 @@ -246,6 +246,9 @@ void mthca_qp_event(struct mthca_dev *de return; } + if (event_type == IB_EVENT_PATH_MIG) + qp->port = qp->alt_port; + event.device = &dev->ib_dev; event.event = event_type; event.element.qp = &qp->ibqp; @@ -390,10 +393,17 @@ static void to_ib_ah_attr(struct mthca_d { memset(ib_ah_attr, 0, sizeof *path); ib_ah_attr->port_num = (be32_to_cpu(path->port_pkey) >> 24) & 0x3; + + if (ib_ah_attr->port_num == 0 || + ib_ah_attr->port_num > dev->limits.num_ports ) + return; + ib_ah_attr->dlid = be16_to_cpu(path->rlid); ib_ah_attr->sl = be32_to_cpu(path->sl_tclass_flowlabel) >> 28; ib_ah_attr->src_path_bits = path->g_mylmc & 0x7f; - ib_ah_attr->static_rate = path->static_rate & 0x7; + ib_ah_attr->static_rate = mthca_rate_to_ib(dev, + path->static_rate & 0x7, + ib_ah_attr->port_num); ib_ah_attr->ah_flags = (path->g_mylmc & (1 << 7)) ? IB_AH_GRH : 0; if (ib_ah_attr->ah_flags) { ib_ah_attr->grh.sgid_index = path->mgid_index & (dev->limits.gid_table_len - 1); @@ -436,6 +446,7 @@ int mthca_query_qp(struct ib_qp *ibqp, s context = &qp_param->context; mthca_state = be32_to_cpu(context->flags) >> 28; + memset(qp_attr, 0, sizeof *qp_attr); qp_attr->qp_state = to_ib_qp_state(mthca_state); qp_attr->cur_qp_state = qp_attr->qp_state; qp_attr->path_mtu = context->mtu_msgmax >> 5; @@ -453,8 +464,10 @@ int mthca_query_qp(struct ib_qp *ibqp, s qp_attr->cap.max_recv_sge = qp->rq.max_gs; qp_attr->cap.max_inline_data = qp->max_inline_data; - to_ib_ah_attr(dev, &qp_attr->ah_attr, &context->pri_path); - to_ib_ah_attr(dev, &qp_attr->alt_ah_attr, &context->alt_path); + if (qp->transport == RC || qp->transport == UC){ + to_ib_ah_attr(dev, &qp_attr->ah_attr, &context->pri_path); + to_ib_ah_attr(dev, &qp_attr->alt_ah_attr, &context->alt_path); + } qp_attr->pkey_index = be32_to_cpu(context->pri_path.port_pkey) & 0x7f; qp_attr->alt_pkey_index = be32_to_cpu(context->alt_path.port_pkey) & 0x7f; @@ -482,11 +495,11 @@ out: } static int mthca_path_set(struct mthca_dev *dev, struct ib_ah_attr *ah, - struct mthca_qp_path *path) + struct mthca_qp_path *path, u8 port) { path->g_mylmc = ah->src_path_bits & 0x7f; path->rlid = cpu_to_be16(ah->dlid); - path->static_rate = !!ah->static_rate; + path->static_rate = mthca_get_rate(dev, ah, port); if (ah->ah_flags & IB_AH_GRH) { if (ah->grh.sgid_index >= dev->limits.gid_table_len) { @@ -632,7 +645,7 @@ int mthca_modify_qp(struct ib_qp *ibqp, if (qp->transport == MLX) qp_context->pri_path.port_pkey |= - cpu_to_be32(to_msqp(qp)->port << 24); + cpu_to_be32(qp->port << 24); else { if (attr_mask & IB_QP_PORT) { qp_context->pri_path.port_pkey |= @@ -655,7 +668,8 @@ int mthca_modify_qp(struct ib_qp *ibqp, } if (attr_mask & IB_QP_AV) { - if (mthca_path_set(dev, &attr->ah_attr, &qp_context->pri_path)) + if (mthca_path_set(dev, &attr->ah_attr, &qp_context->pri_path, + attr_mask & IB_QP_PORT ? attr->port_num : qp->port)) return -EINVAL; qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH); @@ -679,7 +693,8 @@ int mthca_modify_qp(struct ib_qp *ibqp, return -EINVAL; } - if (mthca_path_set(dev, &attr->alt_ah_attr, &qp_context->alt_path)) + if (mthca_path_set(dev, &attr->alt_ah_attr, &qp_context->alt_path, + attr->alt_ah_attr.port_num)) return -EINVAL; qp_context->alt_path.port_pkey |= cpu_to_be32(attr->alt_pkey_index | @@ -789,6 +804,10 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp->atomic_rd_en = attr->qp_access_flags; if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) qp->resp_depth = attr->max_dest_rd_atomic; + if (attr_mask & IB_QP_PORT) + qp->port = attr->port_num; + if (attr_mask & IB_QP_ALT_PATH) + qp->alt_port = attr->alt_port_num; if (is_sqp(dev, qp)) store_attrs(to_msqp(qp), attr, attr_mask); @@ -800,13 +819,13 @@ int mthca_modify_qp(struct ib_qp *ibqp, if (is_qp0(dev, qp)) { if (cur_state != IB_QPS_RTR && new_state == IB_QPS_RTR) - init_port(dev, to_msqp(qp)->port); + init_port(dev, qp->port); if (cur_state != IB_QPS_RESET && cur_state != IB_QPS_ERR && (new_state == IB_QPS_RESET || new_state == IB_QPS_ERR)) - mthca_CLOSE_IB(dev, to_msqp(qp)->port, &status); + mthca_CLOSE_IB(dev, qp->port, &status); } /* @@ -1210,6 +1229,9 @@ int mthca_alloc_qp(struct mthca_dev *dev if (qp->qpn == -1) return -ENOMEM; + /* initialize port to zero for error-catching. */ + qp->port = 0; + err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, send_policy, qp); if (err) { @@ -1259,7 +1281,7 @@ int mthca_alloc_sqp(struct mthca_dev *de if (err) goto err_out; - sqp->port = port; + sqp->qp.port = port; sqp->qp.qpn = mqpn; sqp->qp.transport = MLX; @@ -1402,10 +1424,10 @@ static int build_mlx_header(struct mthca sqp->ud_header.lrh.source_lid = IB_LID_PERMISSIVE; sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED); if (!sqp->qp.ibqp.qp_num) - ib_get_cached_pkey(&dev->ib_dev, sqp->port, + ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port, sqp->pkey_index, &pkey); else - ib_get_cached_pkey(&dev->ib_dev, sqp->port, + ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port, wr->wr.ud.pkey_index, &pkey); sqp->ud_header.bth.pkey = cpu_to_be16(pkey); sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn); From ftillier at silverstorm.com Mon Apr 3 03:09:31 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Mon, 3 Apr 2006 03:09:31 -0700 Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support In-Reply-To: <4430C345.2050002@mellanox.co.il> References: <20060402110848.GA5379@voltaire.com> <20060402124853.GQ14808@mellanox.co.il> <442FC02B.2010801@voltaire.com> <20060402132813.GT14808@mellanox.co.il> <442FDB6F.7080402@voltaire.com> <20060402152341.GX14808@mellanox.co.il> <442FE731.3050100@voltaire.com> <4430C345.2050002@mellanox.co.il> Message-ID: <79ae2f320604030309m46563712o584d33c94378d043@mail.gmail.com> On 4/2/06, Tziporet Koren wrote: > Leonid Arsh wrote: > > Of course, the FW doesn't handle the event, it handles the request by > > sending the port info to the SM. > > It also generates an appropriate event to SW. The actual > > re-registration is to be done by the SW in ULPs. > > An alternative way to generate the event is catching the request MAD > > in smp_snoop, as you suggested, but the port info still will be sent > > by the FW. > > > I actually prefer that the FW will generate the ClientReregister event > since its already generating this event and in this way we can save > logic in the driver. I have to weigh in on the side of SW here - there's already logic to handle other events in SW, so supporting client reregister in SW is trivial. Further, the Windows stack handles this in SW, and I don't see any reason why that should change since it's already in place and working. It would be nice to have some similarities between the stacks, especially for cases where the FW would affect the behavior of the HCA. We don't want a FW for Linux and a different one for Windows. - Fab From bardov at gmail.com Mon Apr 3 03:31:20 2006 From: bardov at gmail.com (Dan Bar Dov) Date: Mon, 3 Apr 2006 12:31:20 +0200 Subject: [openib-general] InfiniBand 2.6.17 merge plans In-Reply-To: References: Message-ID: On the other hand, there are user-space consumers for CMA, such as udapl, so although ISER will attempt only at 2.6.18, maybe it is a good idea to get CMA into 2.6.17 Dan On 3/29/06, Roland Dreier wrote: > Dan> Therefore the plan for iSER is to push it into 2.6.18 > > OK, thanks. > > Based on all of this I'm thinking that it's better to hold back the > RDMA CM as well, since there will be no code using it ready to merge > for 2.6.17. > > - R. > From leonida at voltaire.com Mon Apr 3 02:31:09 2006 From: leonida at voltaire.com (Leonid Arsh) Date: Mon, 3 Apr 2006 12:31:09 +0300 Subject: [openib-general] verbs.h client reregister event support Message-ID: <20060403093109.GA13584@voltaire.com> Hello, this is an additional patch for the user space verbs. The patch is to be applied in addition to the kernel module patch -- (Please see RE: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support) Signed-off-by: Leonid Arsh Index: userspace/libibverbs/include/infiniband/verbs.h =================================================================== --- userspace/libibverbs/include/infiniband/verbs.h (revision 8165) +++ userspace/libibverbs/include/infiniband/verbs.h (working copy) @@ -190,7 +190,8 @@ IBV_EVENT_SM_CHANGE, IBV_EVENT_SRQ_ERR, IBV_EVENT_SRQ_LIMIT_REACHED, - IBV_EVENT_QP_LAST_WQE_REACHED + IBV_EVENT_QP_LAST_WQE_REACHED, + IBV_EVENT_CLIENT_REREGISTER }; struct ibv_async_event { From bugzilla-daemon at openib.org Mon Apr 3 04:04:09 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Mon, 3 Apr 2006 04:04:09 -0700 (PDT) Subject: [openib-general] [Bug 26] QP Tests fails between 32-bit node and 64-bit node Message-ID: <20060403110409.6F2542283F3@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=26 dotanb at mellanox.co.il changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Additional Comments From dotanb at mellanox.co.il 2006-04-03 04:04 ------- structure that was sent between the two sides contains a pointer (32 bit in one machine, 64 in the other machine), pointer was changed to be 64 bit and casting was done where needed to support this issue. please check this version. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From halr at voltaire.com Mon Apr 3 03:49:57 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Apr 2006 06:49:57 -0400 Subject: [openib-general] Re: [PATCH] opensm: observe PartitionEnforcementCap of zero In-Reply-To: References: Message-ID: <1144061397.4480.40476.camel@hal.voltaire.com> On Wed, 2006-03-29 at 18:54, Rolf Manderscheid wrote: > Hi Hal, > > opensm attempts to set pkey table entries on external switch ports > even if the switch declares a PartitionEnforcementCap of zero. The > consequence is ERR 4108. > > The decision to set the block is based on the size of the > pkeys->blocks vector, which is initialized to one. There is a > comment that claims there must be a pre-allocated block in said vector > "for the sake of empty table test", but I can't see why it's > necessary. Is this comment wrong or am I missing something? > > The vector size grows, if necessary, when processing the response from > a SubnGet(PKeyTable). The query happens during a sweep after > obtaining the portinfo, and the query code is careful to observe the > partition cap. All this means that the vector size can still be used > to decide whether to do the set provided the vector size starts out at > zero. So, the patch below just initializes the vector to size zero > and removes the code that inserts the pre-allocated block. > > Rolf Thanks. Applied. -- Hal From halr at voltaire.com Mon Apr 3 03:59:58 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Apr 2006 06:59:58 -0400 Subject: [openib-general] Re: [PATCH] opensm: MTU/rate setup fixes in MCG creation In-Reply-To: <20060329194719.GC6926@sashak.voltaire.com> References: <20060329194719.GC6926@sashak.voltaire.com> Message-ID: <1144061991.4480.40569.camel@hal.voltaire.com> On Wed, 2006-03-29 at 14:47, Sasha Khapyorsky wrote: > Hello, > > There are fixes in MTU/rate verification/setup when MC Group is created. > Most of them were discussed in this thread: > > http://openib.org/pipermail/openib-general/2006-March/018888.html > > Sasha. > > > Fixes in MTU/rate verification/setup when MC Group is created. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied (to trunk only). -- Hal From ogerlitz at voltaire.com Mon Apr 3 04:59:03 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 03 Apr 2006 14:59:03 +0300 Subject: [openib-general] [PATCH 3/6] [RFC] iser initiator In-Reply-To: <44214A5D.8040100@voltaire.com> References: <20060222162507.GB24303@lst.de> <43FD8353.3020909@voltaire.com> <43FD8DF9.5090200@voltaire.com> <20060224211615.GA30927@lst.de> <4401A6AA.3080508@voltaire.com> <440423B8.2080302@voltaire.com> <440BEC4D.3010503@voltaire.com> <20060308152603.GA13621@lst.de> <441016D2.70400@voltaire.com> <20060322095708.GA23491@lst.de> <44213A00.4010001@voltaire.com> <44214A5D.8040100@voltaire.com> Message-ID: <44310E07.5010606@voltaire.com> Or Gerlitz wrote: >> Christoph Hellwig wrote: >>>>> looks like you have a testcase for SCSI_IOCTL_SEND_COMMAND, nice. >>>>> Could you test the patch below, which should make this remaning user >>>>> of non-SG commands go away? >> Cool, I have tested your patch against the official 2.6.16 and it works! > OK, your patch goes into 2.6.17 then i will patch iser with the below, > please let me know Haven't heard from you re the patch you have supplied me which removes at least this SCSI IOCTL issuing a non SG SCSI command. As i wrote you i have patched 2.6.16 and tested it, works great. Is it queued for 2.6.17? Or. From halr at voltaire.com Mon Apr 3 05:25:12 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Apr 2006 08:25:12 -0400 Subject: [openib-general] [PATCH] osm: add support for 1.2 errata - SA enhanced capability mask matching In-Reply-To: <4430D666.7000700@mellanox.co.il> References: <86y7yngmlo.fsf@mtl066.yok.mtl.com> <4430D666.7000700@mellanox.co.il> Message-ID: <1144067112.4480.41450.camel@hal.voltaire.com> On Mon, 2006-04-03 at 04:01, Tziporet Koren wrote: > Eitan Zahavi wrote: > > Hi Hal > > > > This patch adds support for the following 1.2 errata MGTWG8372. > > This should be useful for scalability of: > > * SRP target discovery and > > * Queries for all SM ports. > > > > > Hal, > > Please check-in this patch to the branch too since we need it for SRP > discovery Do you mean the 1.0 branch ? The policy is to check in all appropriate changes that are going in on the trunk to the 1.0 branch. Since there is some minor divergence of these files between the trunk and 1.0, we'll see if a separate 1.0 patch is needed or not. -- Hal > Thanks, > Tziporet > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tziporet at mellanox.co.il Mon Apr 3 05:45:07 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 03 Apr 2006 14:45:07 +0200 Subject: [openib-general] [PATCH] osm: add support for 1.2 errata - SA enhanced capability mask matching In-Reply-To: <1144067112.4480.41450.camel@hal.voltaire.com> References: <86y7yngmlo.fsf@mtl066.yok.mtl.com> <4430D666.7000700@mellanox.co.il> <1144067112.4480.41450.camel@hal.voltaire.com> Message-ID: <443118D3.7000205@mellanox.co.il> Hal Rosenstock wrote: >> Hal, >> Please check-in this patch to the branch too since we need it for SRP >> discovery >> > > Do you mean the 1.0 branch ? > > The policy is to check in all appropriate changes that are going in on > the trunk to the 1.0 branch. > > Since there is some minor divergence of these files between the trunk > and 1.0, we'll see if a separate 1.0 patch is needed or not. > > -- Hal > I meant the 1.0 branch From dotanb at mellanox.co.il Mon Apr 3 05:56:41 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 3 Apr 2006 15:56:41 +0300 Subject: [openib-general] which cm should we test? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301D163E9@mtlexch01.mtl.com> Hi sean. we would like to start testing the CM. which CM should we use in the user level? (librdmacm or libibcm) do you have any test plan that covers the CM features? thanks Dotan Barak Software Verification Engineer Mellanox Technologies Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-77-8841095 Cell: 052-4222383 -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon Apr 3 05:57:25 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Apr 2006 08:57:25 -0400 Subject: [openib-general] Re: [PATCH] osm: add support for 1.2 errata - SA enhanced capability mask matching In-Reply-To: <86y7yngmlo.fsf@mtl066.yok.mtl.com> References: <86y7yngmlo.fsf@mtl066.yok.mtl.com> Message-ID: <1144069044.4480.41777.camel@hal.voltaire.com> On Mon, 2006-04-03 at 03:11, Eitan Zahavi wrote: > Hi Hal > > This patch adds support for the following 1.2 errata MGTWG8372. > This should be useful for scalability of: > * SRP target discovery and > * Queries for all SM ports. > > Reference ID: 4291 > > Add to table: 186 SA-Specific ClassPortInfo:CapabilityMask > Name | Bit | Description > =========================================================================================== > IsPortInfoCapMaskMatchSupported | 13 | If this value is 1, SA shall support matching the > | | PortInfo:CapabilityMask component as described in > | | . > > Reference ID: 4292 > If SA's ClassPortInfo:CapabilityMask.IsPortInfoCapMaskMatchSupported is 1, > then the AttributeModifier of the SubnAdmGet() and SubnAdmGetTable() > methods affects the matching behavior on the PortInfo:CapabilityMask > component. If the high-order bit (bit 31) of the AttributeModifier > is set to 1, matching on the CapabilityMask component will not be an > exact bitwise match as described in . Instead, > matching will only be performed on those bits which are set to 1 in > the PortInfo:CapabilityMask embedded in the query. > > In , bits in the PortInfo:CapabilityMask embedded > in the query that are set to 0 are bitwise wildcards for purposes of > matching. > > This gives a requester the ability to select desired capabilities > and query for ports which support those capabilities. > > If SA's ClassPortInfo:CapabilityMask.IsPortInfoCapMaskMatchSupported > is 0, or if bit 31 of the AttributeModifier is 0, then any matching > performed on the PortInfo:CapabilityMask component is as described > in . > > Eitan > > Signed-off-by: Eitan Zahavi Thanks. Applied to trunk and 1.0 branch. -- Hal From halr at voltaire.com Mon Apr 3 06:14:45 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Apr 2006 09:14:45 -0400 Subject: [openib-general] [PATCH] OpenSM/osm_sa_portinfo_record.c: Fix precedence in bitwise compare of PortInfo capability mask Message-ID: <1144070084.4480.41958.camel@hal.voltaire.com> OpenSM/osm_sa_portinfo_record.c: Fix precedence in bitwise compare of PortInfo capability mask Signed-off-by: Hal Rosenstock Index: opensm/osm_sa_portinfo_record.c =================================================================== --- opensm/osm_sa_portinfo_record.c (revision 6161) +++ opensm/osm_sa_portinfo_record.c (working copy) @@ -294,7 +294,7 @@ __osm_sa_pir_check_physp( { if (p_ctxt->is_enhanced_comp_mask) { - if ( (p_comp_pi->capability_mask & p_pi->capability_mask != p_comp_pi->capability_mask) ) + if ( ( ( p_comp_pi->capability_mask & p_pi->capability_mask ) != p_comp_pi->capability_mask) ) goto Exit; } else From eitan at mellanox.co.il Mon Apr 3 06:32:45 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 3 Apr 2006 16:32:45 +0300 Subject: [openib-general] RE: [PATCH] OpenSM/osm_sa_portinfo_record.c: Fix precedence in bitwisecompare of PortInfo capability mask Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102BA06@mtlexch01.mtl.com> Oops - thanks for catching it Eitan > > OpenSM/osm_sa_portinfo_record.c: Fix precedence in bitwise compare of > PortInfo capability mask > > Signed-off-by: Hal Rosenstock > > Index: opensm/osm_sa_portinfo_record.c > =================================================================== > --- opensm/osm_sa_portinfo_record.c (revision 6161) > +++ opensm/osm_sa_portinfo_record.c (working copy) > @@ -294,7 +294,7 @@ __osm_sa_pir_check_physp( > { > if (p_ctxt->is_enhanced_comp_mask) > { > - if ( (p_comp_pi->capability_mask & p_pi->capability_mask != p_comp_pi- > >capability_mask) ) > + if ( ( ( p_comp_pi->capability_mask & p_pi->capability_mask ) != p_comp_pi- > >capability_mask) ) > goto Exit; > } > else From dotanb at mellanox.co.il Mon Apr 3 06:54:09 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 3 Apr 2006 16:54:09 +0300 Subject: [openib-general] [PATCH] limrdmacm/cmatose: added a check to the return value of the post send requests Message-ID: <200604031654.09048.dotanb@mellanox.co.il> Added a check to the return value of the post send requests. Signed-off-by: Dotan Barak Index: latest/src/userspace/librdmacm/examples/cmatose.c =================================================================== --- latest.orig/src/userspace/librdmacm/examples/cmatose.c 2006-04-03 08:05:39.000000000 +0300 +++ latest/src/userspace/librdmacm/examples/cmatose.c 2006-04-03 16:47:24.000000000 +0300 @@ -204,9 +204,13 @@ static int post_sends(struct cmatest_nod sge.lkey = node->mr->lkey; sge.addr = (uintptr_t) node->mem; - for (i = 0; i < message_count && !ret; i++) + for (i = 0; i < message_count && !ret; i++) { ret = ibv_post_send(node->cma_id->qp, &send_wr, &bad_send_wr); - + if (ret) { + printf("failed to post sends: %d\n", ret); + break; + } + } return ret; } From jlentini at netapp.com Mon Apr 3 07:00:53 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 3 Apr 2006 10:00:53 -0400 (EDT) Subject: [openib-general] Re: [PATCH] printf fix for dapl/dtest In-Reply-To: <1143831015.23845.32.camel@stevo-desktop> References: <1143831015.23845.32.camel@stevo-desktop> Message-ID: I committed this on the trunk in revision 6168. On Fri, 31 Mar 2006, Steve Wise wrote: > Looks good. > > > Steve, Arlin, do either of you see problems with this? If not, I'll > > check it in. From mst at mellanox.co.il Mon Apr 3 08:47:41 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 3 Apr 2006 18:47:41 +0300 Subject: [openib-general] [PATCH -stable] Move destructor from neigh->ops to neigh_param Message-ID: <20060403154741.GB14808@mellanox.co.il> Hello, -stable team! The following patch is a backport from 2.6.17-rc1: is solves an oops/crash condition in ipoib that people are observing in 2.6.16/2.6.16.1. The patch exceeds the 100 line limit slightly, but only because it removes a static function which now becomes unused. Let me know if this is a problem. --- From: Michael S. Tsirkin struct neigh_ops currently has a destructor field, but not a constructor field. The infiniband/ulp/ipoib in-tree driver stashes some info in the neighbour structure (the results of the second-stage lookup from ARP results to real link-level path), and it uses neigh->ops->destructor to get a callback so it can clean up this extra info when a neighbour is freed. We've run into problems with this: since the destructor is in an ops field that is shared between neighbours that may belong to different net devices, there's no way to set/clear it safely. The following patch moves this field to neigh_parms where it can be safely set, together with its twin neigh_setup, and switches the only two in-kernel users (ipoib and clip) to this interface. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier Index: linux-2.6.16/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.16.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-03-20 07:53:29.000000000 +0200 +++ linux-2.6.16/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-04-03 17:09:23.000000000 +0200 @@ -247,7 +247,6 @@ static void path_free(struct net_device if (neigh->ah) ipoib_put_ah(neigh->ah); *to_ipoib_neigh(neigh->neighbour) = NULL; - neigh->neighbour->ops->destructor = NULL; kfree(neigh); } @@ -530,7 +529,6 @@ static void neigh_add_path(struct sk_buf err: *to_ipoib_neigh(skb->dst->neighbour) = NULL; list_del(&neigh->list); - neigh->neighbour->ops->destructor = NULL; kfree(neigh); ++priv->stats.tx_dropped; @@ -769,21 +767,9 @@ static void ipoib_neigh_destructor(struc ipoib_put_ah(ah); } -static int ipoib_neigh_setup(struct neighbour *neigh) -{ - /* - * Is this kosher? I can't find anybody in the kernel that - * sets neigh->destructor, so we should be able to set it here - * without trouble. - */ - neigh->ops->destructor = ipoib_neigh_destructor; - - return 0; -} - static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) { - parms->neigh_setup = ipoib_neigh_setup; + parms->neigh_destructor = ipoib_neigh_destructor; return 0; } Index: linux-2.6.16/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- linux-2.6.16.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-03-20 07:53:29.000000000 +0200 +++ linux-2.6.16/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-04-03 17:09:23.000000000 +0200 @@ -115,7 +115,6 @@ static void ipoib_mcast_free(struct ipoi if (neigh->ah) ipoib_put_ah(neigh->ah); *to_ipoib_neigh(neigh->neighbour) = NULL; - neigh->neighbour->ops->destructor = NULL; kfree(neigh); } Index: linux-2.6.16/include/net/neighbour.h =================================================================== --- linux-2.6.16.orig/include/net/neighbour.h 2006-03-20 07:53:29.000000000 +0200 +++ linux-2.6.16/include/net/neighbour.h 2006-04-03 17:09:23.000000000 +0200 @@ -68,6 +68,7 @@ struct neigh_parms struct net_device *dev; struct neigh_parms *next; int (*neigh_setup)(struct neighbour *); + void (*neigh_destructor)(struct neighbour *); struct neigh_table *tbl; void *sysctl_table; @@ -145,7 +146,6 @@ struct neighbour struct neigh_ops { int family; - void (*destructor)(struct neighbour *); void (*solicit)(struct neighbour *, struct sk_buff*); void (*error_report)(struct neighbour *, struct sk_buff*); int (*output)(struct sk_buff*); Index: linux-2.6.16/net/atm/clip.c =================================================================== --- linux-2.6.16.orig/net/atm/clip.c 2006-03-20 07:53:29.000000000 +0200 +++ linux-2.6.16/net/atm/clip.c 2006-04-03 17:14:45.000000000 +0200 @@ -289,7 +289,6 @@ static void clip_neigh_error(struct neig static struct neigh_ops clip_neigh_ops = { .family = AF_INET, - .destructor = clip_neigh_destroy, .solicit = clip_neigh_solicit, .error_report = clip_neigh_error, .output = dev_queue_xmit, @@ -346,6 +345,7 @@ static struct neigh_table clip_tbl = { /* parameters are copied from ARP ... */ .parms = { + .neigh_destructor = clip_neigh_destroy, .tbl = &clip_tbl, .base_reachable_time = 30 * HZ, .retrans_time = 1 * HZ, Index: linux-2.6.16/net/core/neighbour.c =================================================================== --- linux-2.6.16.orig/net/core/neighbour.c 2006-03-20 07:53:29.000000000 +0200 +++ linux-2.6.16/net/core/neighbour.c 2006-04-03 17:09:23.000000000 +0200 @@ -586,8 +586,8 @@ void neigh_destroy(struct neighbour *nei kfree(hh); } - if (neigh->ops && neigh->ops->destructor) - (neigh->ops->destructor)(neigh); + if (neigh->parms->neigh_destructor) + (neigh->parms->neigh_destructor)(neigh); skb_queue_purge(&neigh->arp_queue); -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Mon Apr 3 08:46:52 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 03 Apr 2006 08:46:52 -0700 Subject: [openib-general] InfiniBand 2.6.17 merge plans In-Reply-To: (Dan Bar Dov's message of "Mon, 3 Apr 2006 12:31:20 +0200") References: Message-ID: Dan> On the other hand, there are user-space consumers for CMA, Dan> such as udapl, so although ISER will attempt only at 2.6.18, Dan> maybe it is a good idea to get CMA into 2.6.17 Well, too late now anyway. But Sean and I agreed that the userspace interface to CMA was not mature enough to merge for 2.6.17. - R. From bugzilla-daemon at openib.org Mon Apr 3 08:59:18 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Mon, 3 Apr 2006 08:59:18 -0700 (PDT) Subject: [openib-general] [Bug 31] New: ifconfig up/down while ssh connection alive cause oops Message-ID: <20060403155918.302032283F3@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=31 Summary: ifconfig up/down while ssh connection alive cause oops Product: OpenIB Version: gen2 Platform: X86-64 OS/Version: 2.6.9 Status: NEW Severity: major Priority: P2 Component: IPoIB AssignedTo: bugzilla at openib.org ReportedBy: amitk at mellanox.co.il Topology: 2 machines connected back to back port 1 to port 1 HCA:Cougar cub FW:3.4.0 Test flow: Ssh from machine 1 to machine 2 using the IPoIB interface, machine 2 bring IPoIB interface down and up (ifconfig ib0 down/up) Notes: This bug may be related to bug number 28 ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mshefty at ichips.intel.com Mon Apr 3 09:39:35 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 03 Apr 2006 09:39:35 -0700 Subject: [openib-general] [PATCH] limrdmacm/cmatose: added a check to the return value of the post send requests In-Reply-To: <200604031654.09048.dotanb@mellanox.co.il> References: <200604031654.09048.dotanb@mellanox.co.il> Message-ID: <44314FC7.3050600@ichips.intel.com> Dotan Barak wrote: > - for (i = 0; i < message_count && !ret; i++) > + for (i = 0; i < message_count && !ret; i++) { > ret = ibv_post_send(node->cma_id->qp, &send_wr, &bad_send_wr); > - > + if (ret) { > + printf("failed to post sends: %d\n", ret); > + break; > + } > + } > return ret; The code will already drop out of the for loop if ret is non-zero. - Sean From mshefty at ichips.intel.com Mon Apr 3 09:43:46 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 03 Apr 2006 09:43:46 -0700 Subject: [openib-general] which cm should we test? In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301D163E9@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301D163E9@mtlexch01.mtl.com> Message-ID: <443150C2.5080904@ichips.intel.com> Dotan Barak wrote: > Hi sean. > > we would like to start testing the CM. > which CM should we use in the user level? (librdmacm or libibcm) It entirely depends on which one users will be using. My recommendation is to test only librdmacm. Use of the libibcm is difficult because of a lack of userspace SA query support. > do you have any test plan that covers the CM features? Out testing of the librdmacm is limited to the functionality needed by uDAPL, including Intel's MPI, and what cmatose tests. - Sean From mshefty at ichips.intel.com Mon Apr 3 09:46:50 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 03 Apr 2006 09:46:50 -0700 Subject: [openib-general] CMA deadlock In-Reply-To: <20060403082056.GZ14808@mellanox.co.il> References: <20060403082056.GZ14808@mellanox.co.il> Message-ID: <4431517A.5060407@ichips.intel.com> Michael S. Tsirkin wrote: > And it seems that, if the user callback returns failure, the CMA actually calls > rdma_destroy_id which in turn may call ib_destroy_cm_id from inside the CM > callback. I think this might deadlock in a similiar way. Again, bouncing the CM > event to the rdma WQ will solve this I think. > > Sean, could you look at this please? I will take a look at this today. Thanks for the description. - Sean From halr at voltaire.com Mon Apr 3 09:44:43 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Apr 2006 12:44:43 -0400 Subject: [openib-general] IPoIB interface for unauthorized partition Message-ID: <1144082680.4480.44271.camel@hal.voltaire.com> Hi Roland, I have a port which only has the full default partition configured but ifconfig allows an IPoIB interface with a PKey which is not in the Pkey table. Shouldn't the ifconfig fail for this (rather than the subsequent ping) ? -- Hal smpquery pkeys 1 1 0: 0xffff 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 8: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 16: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 24: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 32: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 40: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 48: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 56: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 64 pkeys capacity for this port echo 0x8001 > /sys/class/net/ib0/create_child /sbin/ifconfig ib0.8001 192.168.2.1 ping -b 192.168.2.255 WARNING: pinging broadcast address PING 192.168.2.255 (192.168.2.255) 56(84) bytes of data. --- 192.168.2.255 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 999ms From swise at opengridcomputing.com Mon Apr 3 10:03:26 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 03 Apr 2006 12:03:26 -0500 Subject: [openib-general] which cm should we test? In-Reply-To: <443150C2.5080904@ichips.intel.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301D163E9@mtlexch01.mtl.com> <443150C2.5080904@ichips.intel.com> Message-ID: <1144083806.14424.46.camel@stevo-desktop> On Mon, 2006-04-03 at 09:43 -0700, Sean Hefty wrote: > Dotan Barak wrote: > > Hi sean. > > > > we would like to start testing the CM. > > which CM should we use in the user level? (librdmacm or libibcm) > > It entirely depends on which one users will be using. My recommendation is to > test only librdmacm. Use of the libibcm is difficult because of a lack of > userspace SA query support. > > > do you have any test plan that covers the CM features? > > Out testing of the librdmacm is limited to the functionality needed by uDAPL, > including Intel's MPI, and what cmatose tests. and rping. From sean.hefty at intel.com Mon Apr 3 10:17:09 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 3 Apr 2006 10:17:09 -0700 Subject: [openib-general] RE: CMA deadlock In-Reply-To: <20060403082056.GZ14808@mellanox.co.il> Message-ID: > A ULP requests address resolution; on success requests route resolution; > route resolution succeeds; inside the callback ULP requests rdma_connect. > Now, a failure (e.g. out of memory) occurs at ULP level and so it decides to > destroy the ID. To this end it returns failure code from the route callback. I didn't consider this possibility. The only solution I can see at the moment is to schedule route resolution to a separate thread, as you suggested. >And it seems that, if the user callback returns failure, the CMA actually calls >rdma_destroy_id which in turn may call ib_destroy_cm_id from inside the CM >callback. I think this might deadlock in a similiar way. Again, bouncing the >CM >event to the rdma WQ will solve this I think. This should be handled by the code. See the comment near the bottom of the cma_ib_handler() routine. - Sean From mshefty at ichips.intel.com Mon Apr 3 10:23:34 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 03 Apr 2006 10:23:34 -0700 Subject: [openib-general] [PATCH] limrdmacm/cmatose: added a check to the return value of the post send requests In-Reply-To: <200604031654.09048.dotanb@mellanox.co.il> References: <200604031654.09048.dotanb@mellanox.co.il> Message-ID: <44315A16.3010301@ichips.intel.com> Dotan Barak wrote: > Added a check to the return value of the post send requests. Committed the change to print an error message if post send fails. Thanks. - Sean From bugzilla-daemon at openib.org Mon Apr 3 11:02:59 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Mon, 3 Apr 2006 11:02:59 -0700 (PDT) Subject: [openib-general] [Bug 31] ifconfig up/down while ssh connection alive cause oops Message-ID: <20060403180259.15D322283F3@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=31 ------- Additional Comments From amitk at mellanox.co.il 2006-04-03 11:02 ------- ['Cougar cub\n', 'MHXL-CF128-T\n', 'C-04\n', 'MT0542X00026\n'] ib0: dev_queue_xmit failed to requeue packet general protection fault: 0000 [1] SMP CPU 1 Modules linked in: nfsd exportfs md5 ipv6 parport_pc lp parport autofs4 i2c_dd Pid: 1117, comm: ib_mad1 Not tainted 2.6.9-22.ELsmp RIP: 0010:[] {_spin_lock_irqsave+12} RSP: 0018:000001007c2afc58 EFLAGS: 00010086 RAX: 000001007c2afcb8 RBX: 0000ffff1b40167f RCX: ffffffffa01043f9 RDX: 000001007b66e8c0 RSI: 0000000000000000 RDI: 0000ffff1b40167f RBP: 000001007b66e8c0 R08: 00000000fffffffc R09: dead4ead00000001 R10: ffffffff801f51d3 R11: ffffffff801f51d3 R12: 0000000000000000 R13: 0000ffff1b40167f R14: 0000ffff1b4012ff R15: 0000000000000003 FS: 000000000058a1a0(0000) GS:ffffffff804d3180(0000) knlGS:00000000f7fd76c0 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000002a956fb070 CR3: 00000000032d8000 CR4: 00000000000006e0 Process ib_mad1 (pid: 1117, threadinfo 000001007c2ae000, task 000001007da6603) Stack: 0000000000000000 0000000000000286 0000000000000000 ffffffffa010a74a 000001007ba52000 ffffffff804081c0 000001007c2afd68 0000000000000246 0000000000000246 ffffffff802a9ab7 Call Trace:{:ib_ipoib:path_rec_completion+309} {dev_queue_xmit+525} {:ib_sa:ib_sa {:ib_sa:send_handler+74} {:ib_mad: {:ib_mad:ib_mad_completion_handler+993} {:ib_mad:ib_mad_completion_handler+0} {worker_thread+419} {default_wake_ {default_wake_function+0} {keventd {worker_thread+0} {keventd_create_ {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthrea ffffffff80110c9b>{child_rip+0} Code: 81 7f 04 ad 4e ad de 74 1f 48 8b 74 24 18 48 c7 c7 25 d6 31 RIP {_spin_lock_irqsave+12} RSP <000001007c2afc58> <0>Kernel panic - not syncing: Oops ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jlentini at netapp.com Mon Apr 3 11:05:48 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 3 Apr 2006 14:05:48 -0400 (EDT) Subject: [openib-general] [DAPL] Provider initialialization Message-ID: Arlin, As part of the uDAPL autotools patch, we changed the mechanism by which the uDAPL provider library's init and fini functions were specified. I've seen (and received reports) of systems on which the init and fini functions are not being called. I'd like to move back to the old mechanism (see patch below). Do you see any problems with this? Index: dapl/udapl/dapl_init.c =================================================================== --- dapl/udapl/dapl_init.c (revision 6180) +++ dapl/udapl/dapl_init.c (working copy) @@ -66,7 +66,7 @@ * * Return Values: */ -static void __attribute__((constructor)) dapl_init ( void ) +void dapl_init ( void ) { DAT_RETURN dat_status; @@ -138,7 +138,7 @@ * * Return Values: */ -static void __attribute__((destructor)) dapl_fini ( void ) +void dapl_fini ( void ) { DAT_RETURN dat_status; Index: Makefile.am =================================================================== --- Makefile.am (revision 6180) +++ Makefile.am (working copy) @@ -173,7 +173,8 @@ dapl/openib_cma/dapl_ib_mem.c dapl_udapl_libdaplcma_la_LDFLAGS = -version-info 1:2:0 $(daplcma_version_script) \ - -lpthread -libverbs -lrdmacm + -Wl,-init,dapl_init -Wl,-fini,dapl_fini \ + -lpthread -libverbs -lrdmacm # @@ -282,6 +283,7 @@ dapl/openib_scm/dapl_ib_mem.c dapl_udapl_libdaplscm_la_LDFLAGS = -version-info 1:2:0 $(daplscm_version_script) \ + -Wl,-init,dapl_init -Wl,-fini,dapl_fini \ -lpthread -libverbs libdatincludedir = $(includedir)/dat From mst at mellanox.co.il Mon Apr 3 11:46:22 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 3 Apr 2006 21:46:22 +0300 Subject: [openib-general] Re: CMA deadlock In-Reply-To: References: <20060403082056.GZ14808@mellanox.co.il> Message-ID: <20060403184622.GA13767@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RE: CMA deadlock > > > A ULP requests address resolution; on success requests route resolution; > > route resolution succeeds; inside the callback ULP requests rdma_connect. > > Now, a failure (e.g. out of memory) occurs at ULP level and so it decides to > > destroy the ID. To this end it returns failure code from the route callback. > > I didn't consider this possibility. The only solution I can see at the moment > is to schedule route resolution to a separate thread, as you suggested. OK. I gather you'll fix it then? > >And it seems that, if the user callback returns failure, the CMA actually > >calls rdma_destroy_id which in turn may call ib_destroy_cm_id from inside the > >CM callback. I think this might deadlock in a similiar way. Again, bouncing > >the CM event to the rdma WQ will solve this I think. > > This should be handled by the code. See the comment near the bottom of the > cma_ib_handler() routine. I don't really understand the comment. if (ret) { /* Destroy the CM ID by returning a non-zero value. */ conn_id->cm_id.ib = NULL; cma_exch(conn_id, CMA_DESTROYING); cma_release_remove(conn_id); rdma_destroy_id(&conn_id->id); } We seem to be calling rdma_destroy_id, which seems to be calling ib_destroy_cm_id directly. No? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Mon Apr 3 11:49:42 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 03 Apr 2006 11:49:42 -0700 Subject: [openib-general] [git pull] InfiniBand: small post 2.6.17-rc1 fixes Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus The pull will get the following changes: Michael S. Tsirkin: IB/mad: fix oops in cancel_mads Roland Dreier: IPoIB: Always build debugging code unless CONFIG_EMBEDDED=y IB/mthca: Always build debugging code unless CONFIG_EMBEDDED=y IB/srp: Fix memory leak in options parsing drivers/infiniband/core/mad.c | 2 +- drivers/infiniband/hw/mthca/Kconfig | 11 ++++++----- drivers/infiniband/hw/mthca/Makefile | 4 ---- drivers/infiniband/hw/mthca/mthca_dev.h | 17 +++++++++++++++-- drivers/infiniband/hw/mthca/mthca_main.c | 8 ++++++++ drivers/infiniband/ulp/ipoib/Kconfig | 3 ++- drivers/infiniband/ulp/srp/ib_srp.c | 1 + 7 files changed, 33 insertions(+), 13 deletions(-) diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index ba54c85..3a702da 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -2311,6 +2311,7 @@ static void local_completions(void *data local = list_entry(mad_agent_priv->local_list.next, struct ib_mad_local_private, completion_list); + list_del(&local->completion_list); spin_unlock_irqrestore(&mad_agent_priv->lock, flags); if (local->mad_priv) { recv_mad_agent = local->recv_mad_agent; @@ -2362,7 +2363,6 @@ local_send_completion: &mad_send_wc); spin_lock_irqsave(&mad_agent_priv->lock, flags); - list_del(&local->completion_list); atomic_dec(&mad_agent_priv->refcount); if (!recv) kmem_cache_free(ib_mad_cache, local->mad_priv); diff --git a/drivers/infiniband/hw/mthca/Kconfig b/drivers/infiniband/hw/mthca/Kconfig index e88be85..9aa5a44 100644 --- a/drivers/infiniband/hw/mthca/Kconfig +++ b/drivers/infiniband/hw/mthca/Kconfig @@ -7,10 +7,11 @@ config INFINIBAND_MTHCA ("Tavor") and the MT25208 PCI Express HCA ("Arbel"). config INFINIBAND_MTHCA_DEBUG - bool "Verbose debugging output" + bool "Verbose debugging output" if EMBEDDED depends on INFINIBAND_MTHCA - default n + default y ---help--- - This option causes the mthca driver produce a bunch of debug - messages. Select this is you are developing the driver or - trying to diagnose a problem. + This option causes debugging code to be compiled into the + mthca driver. The output can be turned on via the + debug_level module parameter (which can also be set after + the driver is loaded through sysfs). diff --git a/drivers/infiniband/hw/mthca/Makefile b/drivers/infiniband/hw/mthca/Makefile index 47ec5a7..e388d95 100644 --- a/drivers/infiniband/hw/mthca/Makefile +++ b/drivers/infiniband/hw/mthca/Makefile @@ -1,7 +1,3 @@ -ifdef CONFIG_INFINIBAND_MTHCA_DEBUG -EXTRA_CFLAGS += -DDEBUG -endif - obj-$(CONFIG_INFINIBAND_MTHCA) += ib_mthca.o ib_mthca-y := mthca_main.o mthca_cmd.o mthca_profile.o mthca_reset.o \ diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h index ad52edb..bb2a9d6 100644 --- a/drivers/infiniband/hw/mthca/mthca_dev.h +++ b/drivers/infiniband/hw/mthca/mthca_dev.h @@ -355,8 +355,21 @@ struct mthca_dev { spinlock_t sm_lock; }; -#define mthca_dbg(mdev, format, arg...) \ - dev_dbg(&mdev->pdev->dev, format, ## arg) +#ifdef CONFIG_INFINIBAND_MTHCA_DEBUG +extern int mthca_debug_level; + +#define mthca_dbg(mdev, format, arg...) \ + do { \ + if (mthca_debug_level) \ + dev_printk(KERN_DEBUG, &mdev->pdev->dev, format, ## arg); \ + } while (0) + +#else /* CONFIG_INFINIBAND_MTHCA_DEBUG */ + +#define mthca_dbg(mdev, format, arg...) do { (void) mdev; } while (0) + +#endif /* CONFIG_INFINIBAND_MTHCA_DEBUG */ + #define mthca_err(mdev, format, arg...) \ dev_err(&mdev->pdev->dev, format, ## arg) #define mthca_info(mdev, format, arg...) \ diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 266f347..597d7dc 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -52,6 +52,14 @@ MODULE_DESCRIPTION("Mellanox InfiniBand MODULE_LICENSE("Dual BSD/GPL"); MODULE_VERSION(DRV_VERSION); +#ifdef CONFIG_INFINIBAND_MTHCA_DEBUG + +int mthca_debug_level = 0; +module_param_named(debug_level, mthca_debug_level, int, 0644); +MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0"); + +#endif /* CONFIG_INFINIBAND_MTHCA_DEBUG */ + #ifdef CONFIG_PCI_MSI static int msi_x = 0; diff --git a/drivers/infiniband/ulp/ipoib/Kconfig b/drivers/infiniband/ulp/ipoib/Kconfig index 8d2e04c..13d6d01 100644 --- a/drivers/infiniband/ulp/ipoib/Kconfig +++ b/drivers/infiniband/ulp/ipoib/Kconfig @@ -10,8 +10,9 @@ config INFINIBAND_IPOIB group: . config INFINIBAND_IPOIB_DEBUG - bool "IP-over-InfiniBand debugging" + bool "IP-over-InfiniBand debugging" if EMBEDDED depends on INFINIBAND_IPOIB + default y ---help--- This option causes debugging code to be compiled into the IPoIB driver. The output can be turned on via the diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index fd8a95a..5f2b3f6 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1434,6 +1434,7 @@ static int srp_parse_options(const char p = match_strdup(args); if (strlen(p) != 32) { printk(KERN_WARNING PFX "bad dest GID parameter '%s'\n", p); + kfree(p); goto out; } From rdreier at cisco.com Mon Apr 3 11:54:04 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 03 Apr 2006 11:54:04 -0700 Subject: [openib-general] [PATCH] IB/ipath: Make more names static Message-ID: Make symbols that are only used in a single source file static. Also completely delete the unused ipath_diag_bringup_link() function. This patch doesn't get rid of things like IPATH_SRC_OUI_1 that are not used right now but might be used by the ipath_ether driver. That would be a good cleanup too but I'll leave that for another day. Signed-off-by: Roland Dreier --- If this looks good to you, please check it in to svn and let me know so I can queue it for merging into 2.6.17. diff --git a/drivers/infiniband/hw/ipath/ipath_diag.c b/drivers/infiniband/hw/ipath/ipath_diag.c index cd533cf..7d3fb69 100644 --- a/drivers/infiniband/hw/ipath/ipath_diag.c +++ b/drivers/infiniband/hw/ipath/ipath_diag.c @@ -365,15 +365,3 @@ static ssize_t ipath_diag_write(struct f bail: return ret; } - -void ipath_diag_bringup_link(struct ipath_devdata *dd) -{ - if (diag_set_link || (dd->ipath_flags & IPATH_LINKACTIVE)) - return; - - diag_set_link = 1; - ipath_cdbg(VERBOSE, "Trying to set to set link active for " - "diag pkt\n"); - ipath_layer_set_linkstate(dd, IPATH_IB_LINKARM); - ipath_layer_set_linkstate(dd, IPATH_IB_LINKACTIVE); -} diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index 58a94ef..e7617c3 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -1729,7 +1729,7 @@ void ipath_free_pddata(struct ipath_devd } } -int __init infinipath_init(void) +static int __init infinipath_init(void) { int ret; diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h index 159d0ae..0ce5f19 100644 --- a/drivers/infiniband/hw/ipath/ipath_kernel.h +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h @@ -528,7 +528,6 @@ extern spinlock_t ipath_devs_lock; extern struct ipath_devdata *ipath_lookup(int unit); extern u16 ipath_layer_rcv_opcode; -extern int ipath_verbs_registered; extern int __ipath_layer_intr(struct ipath_devdata *, u32); extern int ipath_layer_intr(struct ipath_devdata *, u32); extern int __ipath_layer_rcv(struct ipath_devdata *, void *, diff --git a/drivers/infiniband/hw/ipath/ipath_layer.c b/drivers/infiniband/hw/ipath/ipath_layer.c index 2cabf63..69ed110 100644 --- a/drivers/infiniband/hw/ipath/ipath_layer.c +++ b/drivers/infiniband/hw/ipath/ipath_layer.c @@ -52,7 +52,7 @@ static int (*layer_rcv)(void *, void *, static int (*layer_rcv_lid)(void *, void *); static int (*verbs_piobufavail)(void *); static void (*verbs_rcv)(void *, void *, void *, u32); -int ipath_verbs_registered; +static int ipath_verbs_registered; static void *(*layer_add_one)(int, struct ipath_devdata *); static void (*layer_remove_one)(void *); diff --git a/drivers/infiniband/hw/ipath/ipath_pe800.c b/drivers/infiniband/hw/ipath/ipath_pe800.c index e693a7a..e1dc4f7 100644 --- a/drivers/infiniband/hw/ipath/ipath_pe800.c +++ b/drivers/infiniband/hw/ipath/ipath_pe800.c @@ -305,8 +305,8 @@ static const struct ipath_cregs ipath_pe * we'll print them and continue. We reuse the same message buffer as * ipath_handle_errors() to avoid excessive stack usage. */ -void ipath_pe_handle_hwerrors(struct ipath_devdata *dd, char *msg, - size_t msgl) +static void ipath_pe_handle_hwerrors(struct ipath_devdata *dd, char *msg, + size_t msgl) { ipath_err_t hwerrs; u32 bits, ctrl; @@ -552,7 +552,7 @@ static int ipath_pe_boardname(struct ipa * freeze mode), and enable hardware errors as errors (along with * everything else) in errormask */ -void ipath_pe_init_hwerrors(struct ipath_devdata *dd) +static void ipath_pe_init_hwerrors(struct ipath_devdata *dd) { ipath_err_t val; u64 extsval; @@ -577,7 +577,7 @@ void ipath_pe_init_hwerrors(struct ipath * ipath_pe_bringup_serdes - bring up the serdes * @dd: the infinipath device */ -int ipath_pe_bringup_serdes(struct ipath_devdata *dd) +static int ipath_pe_bringup_serdes(struct ipath_devdata *dd) { u64 val, tmp, config1; int ret = 0, change = 0; @@ -694,7 +694,7 @@ int ipath_pe_bringup_serdes(struct ipath * @dd: the infinipath device * Called when driver is being unloaded */ -void ipath_pe_quiet_serdes(struct ipath_devdata *dd) +static void ipath_pe_quiet_serdes(struct ipath_devdata *dd) { u64 val = ipath_read_kreg64(dd, dd->ipath_kregs->kr_serdesconfig0); diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c index 6058d70..1889071 100644 --- a/drivers/infiniband/hw/ipath/ipath_qp.c +++ b/drivers/infiniband/hw/ipath/ipath_qp.c @@ -188,8 +188,8 @@ static void free_qpn(struct ipath_qp_tab * Allocate the next available QPN and put the QP into the hash table. * The hash table holds a reference to the QP. */ -int ipath_alloc_qpn(struct ipath_qp_table *qpt, struct ipath_qp *qp, - enum ib_qp_type type) +static int ipath_alloc_qpn(struct ipath_qp_table *qpt, struct ipath_qp *qp, + enum ib_qp_type type) { unsigned long flags; u32 qpn; @@ -232,7 +232,7 @@ bail: * Remove the QP from the table so it can't be found asynchronously by * the receive interrupt routine. */ -void ipath_free_qp(struct ipath_qp_table *qpt, struct ipath_qp *qp) +static void ipath_free_qp(struct ipath_qp_table *qpt, struct ipath_qp *qp) { struct ipath_qp *q, **qpp; unsigned long flags; @@ -358,6 +358,65 @@ static void ipath_reset_qp(struct ipath_ } /** + * ipath_error_qp - put a QP into an error state + * @qp: the QP to put into an error state + * + * Flushes both send and receive work queues. + * QP r_rq.lock and s_lock should be held. + */ + +static void ipath_error_qp(struct ipath_qp *qp) +{ + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + struct ib_wc wc; + + _VERBS_INFO("QP%d/%d in error state\n", + qp->ibqp.qp_num, qp->remote_qpn); + + spin_lock(&dev->pending_lock); + /* XXX What if its already removed by the timeout code? */ + if (qp->timerwait.next != LIST_POISON1) + list_del(&qp->timerwait); + if (qp->piowait.next != LIST_POISON1) + list_del(&qp->piowait); + spin_unlock(&dev->pending_lock); + + wc.status = IB_WC_WR_FLUSH_ERR; + wc.vendor_err = 0; + wc.byte_len = 0; + wc.imm_data = 0; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = 0; + wc.wc_flags = 0; + wc.pkey_index = 0; + wc.slid = 0; + wc.sl = 0; + wc.dlid_path_bits = 0; + wc.port_num = 0; + + while (qp->s_last != qp->s_head) { + struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last); + + wc.wr_id = wqe->wr.wr_id; + wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; + if (++qp->s_last >= qp->s_size) + qp->s_last = 0; + ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 1); + } + qp->s_cur = qp->s_tail = qp->s_head; + qp->s_hdrwords = 0; + qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; + + wc.opcode = IB_WC_RECV; + while (qp->r_rq.tail != qp->r_rq.head) { + wc.wr_id = get_rwqe_ptr(&qp->r_rq, qp->r_rq.tail)->wr_id; + if (++qp->r_rq.tail >= qp->r_rq.size) + qp->r_rq.tail = 0; + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); + } +} + +/** * ipath_modify_qp - modify the attributes of a queue pair * @ibqp: the queue pair who's attributes we're modifying * @attr: the new attributes @@ -821,65 +880,6 @@ void ipath_sqerror_qp(struct ipath_qp *q } /** - * ipath_error_qp - put a QP into an error state - * @qp: the QP to put into an error state - * - * Flushes both send and receive work queues. - * QP r_rq.lock and s_lock should be held. - */ - -void ipath_error_qp(struct ipath_qp *qp) -{ - struct ipath_ibdev *dev = to_idev(qp->ibqp.device); - struct ib_wc wc; - - _VERBS_INFO("QP%d/%d in error state\n", - qp->ibqp.qp_num, qp->remote_qpn); - - spin_lock(&dev->pending_lock); - /* XXX What if its already removed by the timeout code? */ - if (qp->timerwait.next != LIST_POISON1) - list_del(&qp->timerwait); - if (qp->piowait.next != LIST_POISON1) - list_del(&qp->piowait); - spin_unlock(&dev->pending_lock); - - wc.status = IB_WC_WR_FLUSH_ERR; - wc.vendor_err = 0; - wc.byte_len = 0; - wc.imm_data = 0; - wc.qp_num = qp->ibqp.qp_num; - wc.src_qp = 0; - wc.wc_flags = 0; - wc.pkey_index = 0; - wc.slid = 0; - wc.sl = 0; - wc.dlid_path_bits = 0; - wc.port_num = 0; - - while (qp->s_last != qp->s_head) { - struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last); - - wc.wr_id = wqe->wr.wr_id; - wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; - if (++qp->s_last >= qp->s_size) - qp->s_last = 0; - ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 1); - } - qp->s_cur = qp->s_tail = qp->s_head; - qp->s_hdrwords = 0; - qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; - - wc.opcode = IB_WC_RECV; - while (qp->r_rq.tail != qp->r_rq.head) { - wc.wr_id = get_rwqe_ptr(&qp->r_rq, qp->r_rq.tail)->wr_id; - if (++qp->r_rq.tail >= qp->r_rq.size) - qp->r_rq.tail = 0; - ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); - } -} - -/** * ipath_get_credit - flush the send work queue of a QP * @qp: the qp who's send work queue to flush * @aeth: the Acknowledge Extended Transport Header diff --git a/drivers/infiniband/hw/ipath/ipath_ud.c b/drivers/infiniband/hw/ipath/ipath_ud.c index 5ff3de6..01cfb30 100644 --- a/drivers/infiniband/hw/ipath/ipath_ud.c +++ b/drivers/infiniband/hw/ipath/ipath_ud.c @@ -46,8 +46,8 @@ * This is called from ipath_post_ud_send() to forward a WQE addressed * to the same HCA. */ -void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_sge_state *ss, - u32 length, struct ib_send_wr *wr, struct ib_wc *wc) +static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_sge_state *ss, + u32 length, struct ib_send_wr *wr, struct ib_wc *wc) { struct ipath_ibdev *dev = to_idev(sqp->ibqp.device); struct ipath_qp *qp; diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index 9f27fd3..e3be492 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -41,7 +41,7 @@ /* Not static, because we don't want the compiler removing it */ const char ipath_verbs_version[] = "ipath_verbs " IPATH_IDSTR; -unsigned int ib_ipath_qp_table_size = 251; +static unsigned int ib_ipath_qp_table_size = 251; module_param_named(qp_table_size, ib_ipath_qp_table_size, uint, S_IRUGO); MODULE_PARM_DESC(qp_table_size, "QP table size"); @@ -87,7 +87,7 @@ const enum ib_wc_opcode ib_ipath_wc_opco /* * System image GUID. */ -__be64 sys_image_guid; +static __be64 sys_image_guid; /** * ipath_copy_sge - copy data to SGE memory @@ -1110,7 +1110,7 @@ static void ipath_unregister_ib_device(v ib_dealloc_device(ibdev); } -int __init ipath_verbs_init(void) +static int __init ipath_verbs_init(void) { return ipath_verbs_register(ipath_register_ib_device, ipath_unregister_ib_device, @@ -1118,7 +1118,7 @@ int __init ipath_verbs_init(void) ipath_ib_timer); } -void __exit ipath_verbs_cleanup(void) +static void __exit ipath_verbs_cleanup(void) { ipath_verbs_unregister(); } diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index b824632..fcafbc7 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -577,8 +577,6 @@ int ipath_init_qp_table(struct ipath_ibd void ipath_sqerror_qp(struct ipath_qp *qp, struct ib_wc *wc); -void ipath_error_qp(struct ipath_qp *qp); - void ipath_get_credit(struct ipath_qp *qp, u32 aeth); void ipath_do_rc_send(unsigned long data); @@ -607,9 +605,6 @@ void ipath_rc_rcv(struct ipath_ibdev *de void ipath_restart_rc(struct ipath_qp *qp, u32 psn, struct ib_wc *wc); -void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_sge_state *ss, - u32 length, struct ib_send_wr *wr, struct ib_wc *wc); - int ipath_post_ud_send(struct ipath_qp *qp, struct ib_send_wr *wr); void ipath_ud_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, From rdreier at cisco.com Mon Apr 3 11:55:33 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 03 Apr 2006 11:55:33 -0700 Subject: [openib-general] Re: IPoIB interface for unauthorized partition In-Reply-To: <1144082680.4480.44271.camel@hal.voltaire.com> (Hal Rosenstock's message of "03 Apr 2006 12:44:43 -0400") References: <1144082680.4480.44271.camel@hal.voltaire.com> Message-ID: Hal> Hi Roland, I have a port which only has the full default Hal> partition configured but ifconfig allows an IPoIB interface Hal> with a PKey which is not in the Pkey table. Shouldn't the Hal> ifconfig fail for this (rather than the subsequent ping) ? No, I don't think so. The philosophy of the IPoIB driver is to allow configuration to be done before the IB port is fully ready. This allows for things such as boot scripts configuring interfaces before the SM has brought the port up. Otherwise you end up in a situation where a system may come up without network interfaces configured, just because of a transient timing issue during boot. - R. From sean.hefty at intel.com Mon Apr 3 12:01:38 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 3 Apr 2006 12:01:38 -0700 Subject: [openib-general] [PATCH]: CMA deadlock In-Reply-To: <20060403082056.GZ14808@mellanox.co.il> Message-ID: Michael, Can you see if the following patch fixes your problem? - Sean --- Fix deadlock condition caused by destroying an IB CM ID from within a MAD callback thread (SA query callback). See note from Michael Tsirkin about this bug: A ULP requests address resolution; on success requests route resolution; route resolution succeeds; inside the callback ULP requests rdma_connect. Now, a failure (e.g. out of memory) occurs at ULP level and so it decides to destroy the ID. To this end it returns failure code from the route callback. Note that route resolution callback runs in the per-port MAD workqueue. Now, CMA will call rdma_destroy_id to destroy the ID. Since CM ID exists, it will try to destroy it. This might deadlock: since a CM MAD (REQ) has been created, CM ID destroy will now block, waiting for the MAD to be freed, but MADs might not complete since we are blocking the MAD workqueue. Fix condition by queuing SA query callbacks to the CMA's thread. Signed-off-by: Sean Hefty --- Index: cma.c =================================================================== --- cma.c (revision 6134) +++ cma.c (working copy) @@ -1030,44 +1030,25 @@ EXPORT_SYMBOL(rdma_listen); static void cma_query_handler(int status, struct ib_sa_path_rec *path_rec, void *context) { - struct rdma_id_private *id_priv = context; - struct rdma_route *route = &id_priv->id.route; - enum rdma_cm_event_type event = RDMA_CM_EVENT_ROUTE_RESOLVED; + struct cma_work *work = context; + struct rdma_route *route; - atomic_inc(&id_priv->dev_remove); - if (!status) { - route->path_rec = kmalloc(sizeof *route->path_rec, GFP_KERNEL); - if (route->path_rec) { - route->num_paths = 1; - *route->path_rec = *path_rec; - if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, - CMA_ROUTE_RESOLVED)) { - kfree(route->path_rec); - goto out; - } - } else - status = -ENOMEM; - } + route = &work->id->id.route; - if (status) { - if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, CMA_ADDR_RESOLVED)) - goto out; - event = RDMA_CM_EVENT_ROUTE_ERROR; + if (!status) { + route->num_paths = 1; + *route->path_rec = *path_rec; + } else { + work->old_state = CMA_ROUTE_QUERY; + work->new_state = CMA_ADDR_RESOLVED; + work->event.event = RDMA_CM_EVENT_ROUTE_ERROR; } - if (cma_notify_user(id_priv, event, status, NULL, 0)) { - cma_exch(id_priv, CMA_DESTROYING); - cma_release_remove(id_priv); - cma_deref_id(id_priv); - rdma_destroy_id(&id_priv->id); - return; - } -out: - cma_release_remove(id_priv); - cma_deref_id(id_priv); + queue_work(cma_wq, &work->work); } -static int cma_query_ib_route(struct rdma_id_private *id_priv, int timeout_ms) +static int cma_query_ib_route(struct rdma_id_private *id_priv, int timeout_ms, + struct cma_work *work) { struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr; struct ib_sa_path_rec path_rec; @@ -1083,7 +1064,7 @@ static int cma_query_ib_route(struct rdm IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH, timeout_ms, GFP_KERNEL, - cma_query_handler, id_priv, &id_priv->query); + cma_query_handler, work, &id_priv->query); return (id_priv->query_id < 0) ? id_priv->query_id : 0; } @@ -1121,6 +1102,12 @@ static int cma_resolve_ib_route(struct r if (!work) return -ENOMEM; + work->id = id_priv; + INIT_WORK(&work->work, cma_work_handler, work); + work->old_state = CMA_ROUTE_QUERY; + work->new_state = CMA_ROUTE_RESOLVED; + work->event.event = RDMA_CM_EVENT_ROUTE_RESOLVED; + route->path_rec = kmalloc(sizeof *route->path_rec, GFP_KERNEL); if (!route->path_rec) { ret = -ENOMEM; @@ -1130,24 +1117,21 @@ static int cma_resolve_ib_route(struct r ret = ib_get_path_rec(id_priv->id.device, id_priv->id.port_num, ib_addr_get_sgid(addr), ib_addr_get_dgid(addr), ib_addr_get_pkey(addr), route->path_rec); - if (ret) - goto err2; - - route->num_paths = 1; - work->id = id_priv; - INIT_WORK(&work->work, cma_work_handler, work); - work->old_state = CMA_ROUTE_QUERY; - work->new_state = CMA_ROUTE_RESOLVED; - work->event.event = RDMA_CM_EVENT_ROUTE_RESOLVED; - queue_work(cma_wq, &work->work); + if (!ret) { + route->num_paths = 1; + queue_work(cma_wq, &work->work); + } else { + if (ret == -ENODATA) + ret = cma_query_ib_route(id_priv, timeout_ms, work); + if (ret) + goto err2; + } return 0; err2: kfree(route->path_rec); route->path_rec = NULL; err1: kfree(work); - if (ret == -ENODATA) - ret = cma_query_ib_route(id_priv, timeout_ms); return ret; } From bugzilla-daemon at openib.org Mon Apr 3 12:11:32 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Mon, 3 Apr 2006 12:11:32 -0700 (PDT) Subject: [openib-general] [Bug 31] ifconfig up/down while ssh connection alive cause oops Message-ID: <20060403191132.124162283F3@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=31 ------- Additional Comments From roland at topspin.com 2006-04-03 12:11 ------- How reproducible is this? I cannot make it happen on my test systems. Does this happen with a current kernel (2.6.16 and/or 2.6.17-rc1)? I see that your oops is coming from a 2.6.9 backport, so the bug may be introduced as part of the backport. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Mon Apr 3 12:12:16 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Mon, 3 Apr 2006 12:12:16 -0700 (PDT) Subject: [openib-general] [Bug 28] ipoib_mcast_sendonly_join_complete oops Message-ID: <20060403191216.9875B2283F3@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=28 ------- Additional Comments From roland at topspin.com 2006-04-03 12:12 ------- Does this still happen with all the IPoIB fixes in 2.6.17-rc1? ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From sean.hefty at intel.com Mon Apr 3 12:06:00 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 3 Apr 2006 12:06:00 -0700 Subject: [openib-general] RE: CMA deadlock In-Reply-To: <20060403184622.GA13767@mellanox.co.il> Message-ID: > if (ret) { > /* Destroy the CM ID by returning a non-zero value. */ > conn_id->cm_id.ib = NULL; > cma_exch(conn_id, CMA_DESTROYING); > cma_release_remove(conn_id); > rdma_destroy_id(&conn_id->id); > } > > >We seem to be calling rdma_destroy_id, which seems to be calling >ib_destroy_cm_id directly. No? We're setting cm_id.ib to NULL, which prevents rdma_destroy_id() from calling ib_destroy_cm_id(). The handler will return a non-zero value to destroy cm_id.ib. - Sean From rjwalsh at pathscale.com Mon Apr 3 12:13:05 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Mon, 03 Apr 2006 12:13:05 -0700 Subject: [openib-general] [PATCH] IB/ipath: Make more names static In-Reply-To: References: Message-ID: <1144091585.23196.0.camel@hematite.internal.keyresearch.com> On Mon, 2006-04-03 at 11:54 -0700, Roland Dreier wrote: > Make symbols that are only used in a single source file static. Also > completely delete the unused ipath_diag_bringup_link() function. > > This patch doesn't get rid of things like IPATH_SRC_OUI_1 that are not > used right now but might be used by the ipath_ether driver. That > would be a good cleanup too but I'll leave that for another day. > > Signed-off-by: Roland Dreier Thanks, Roland. I'll merge them in as appropriate. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 483 bytes Desc: This is a digitally signed message part URL: From mst at mellanox.co.il Mon Apr 3 12:48:15 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 3 Apr 2006 22:48:15 +0300 Subject: [openib-general] Re: CMA deadlock In-Reply-To: References: <20060403184622.GA13767@mellanox.co.il> Message-ID: <20060403194815.GC13767@mellanox.co.il> Quoting r. Sean Hefty : > >We seem to be calling rdma_destroy_id, which seems to be calling > >ib_destroy_cm_id directly. No? > > We're setting cm_id.ib to NULL, which prevents rdma_destroy_id() from calling > ib_destroy_cm_id(). The handler will return a non-zero value to destroy > cm_id.ib. I see. I would be happier with a variant of rdma_destroy_id with an explicit flag destroy_cm_id. The current approach seems racy: what if e.g. listen call is in progress? It looks at id_priv->cm_id.ib which seemingly might change under its feet. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Mon Apr 3 12:51:28 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 3 Apr 2006 22:51:28 +0300 Subject: [openib-general] Re: [PATCH]: CMA deadlock In-Reply-To: References: <20060403082056.GZ14808@mellanox.co.il> Message-ID: <20060403195128.GD13767@mellanox.co.il> Quoting r. Sean Hefty : > Subject: [PATCH]: CMA deadlock > > Michael, > > Can you see if the following patch fixes your problem? > > - Sean Looks right, but I can't test it today. If it works for you, I suggest you check this patch in - this will make testing easier for me. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Mon Apr 3 12:56:20 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 3 Apr 2006 22:56:20 +0300 Subject: [openib-general] Re: CMA deadlock In-Reply-To: References: <20060403082056.GZ14808@mellanox.co.il> Message-ID: <20060403195620.GE13767@mellanox.co.il> Quoting r. Sean Hefty : > This should be handled by the code. See the comment near the bottom of the > cma_ib_handler() routine. BTW, am I right in sayng that error handling is mostly untested? If so, a variant of cmatose that fails the connection randomly at various stages would be quite useful, wouldn't it? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mshefty at ichips.intel.com Mon Apr 3 13:07:16 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 03 Apr 2006 13:07:16 -0700 Subject: [openib-general] Re: CMA deadlock In-Reply-To: <20060403194815.GC13767@mellanox.co.il> References: <20060403184622.GA13767@mellanox.co.il> <20060403194815.GC13767@mellanox.co.il> Message-ID: <44318074.8010207@ichips.intel.com> Michael S. Tsirkin wrote: > The current approach seems racy: what if e.g. listen call is in progress? It > looks at id_priv->cm_id.ib which seemingly might change under its feet. There shouldn't be any races. Callbacks from the IB CM are serialized, so we're in the only callback that can occur. And if the user destroys the rdma_id by returning a non-zero value in the callback, then they cannot be using that same id in any other call. - Sean From mtravis9999 at yahoo.com Mon Apr 3 13:08:20 2006 From: mtravis9999 at yahoo.com (Uncle Fester) Date: Mon, 3 Apr 2006 13:08:20 -0700 (PDT) Subject: [openib-general] Newbie Questions Regarding How To Start Message-ID: <20060403200821.737.qmail@web36505.mail.mud.yahoo.com> I am interested in getting my hands messy with Infiniband--and perhaps in eventually contributing in some small way to its progress. I have done very rudimentary research, but would like to confirm my assumptions with the experts here so that I don't spend my money unwisely. I would like to add infiniband capability to two amd64 systems in my lab. These have PCI-X Tyan S2882 motherboards. I'm running linux on these systems. Can infiniband function properly in a physical point to point topology? I would prefer to not invest in switches if it's not entirely necessary. The documentation I've glanced at so far seems to indicate that infiniband is logically point-to-point. But the diagrams representing physical topology all seem to have switches. Other than obviously not being able to communicate across a fabric, is there functionality missing from a point-to-point topology? Does it even work? What's the best HCA card for me? I'm drawn to Mellanox MT23108-based systems because OpenIB seems to support them well (https://openib.org/tiki/tiki-index.php?page=MellanoxHcaFirmware). I'm considering purchasing two HP NC570C cards (http://h30094.www3.hp.com/product.asp?sku=2603660&), which are based on that Mellanox chip. Does anybody know if those cards would be functional in non-hp gear? In other words, does HP have their own proprietary goodies in the cards that would cause them to break in generic systems such as what I have? Are there better cards for me? Thank you very much for your time and advice. Sincerely, Mark Travis --------------------------------- New Yahoo! Messenger with Voice. Call regular phones from your PC and save big. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Mon Apr 3 13:55:13 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 3 Apr 2006 23:55:13 +0300 Subject: [openib-general] Re: Re: CMA deadlock In-Reply-To: <44318074.8010207@ichips.intel.com> References: <20060403184622.GA13767@mellanox.co.il> <20060403194815.GC13767@mellanox.co.il> <44318074.8010207@ichips.intel.com> Message-ID: <20060403205513.GF13767@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: Re: CMA deadlock > > Michael S. Tsirkin wrote: > >The current approach seems racy: what if e.g. listen call is in progress? > >It > >looks at id_priv->cm_id.ib which seemingly might change under its feet. > > There shouldn't be any races. Callbacks from the IB CM are serialized, so > we're in the only callback that can occur. And if the user destroys the > rdma_id by returning a non-zero value in the callback, then they cannot be > using that same id in any other call. Sounds sane. OK, please commit the patch you've sent and I'll test tomorrow. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Mon Apr 3 14:01:54 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 03 Apr 2006 14:01:54 -0700 Subject: [openib-general] Newbie Questions Regarding How To Start In-Reply-To: <20060403200821.737.qmail@web36505.mail.mud.yahoo.com> (Uncle Fester's message of "Mon, 3 Apr 2006 13:08:20 -0700 (PDT)") References: <20060403200821.737.qmail@web36505.mail.mud.yahoo.com> Message-ID: Uncle> Can infiniband function properly in a physical point to Uncle> point topology? I would prefer to not invest in switches Uncle> if it's not entirely necessary. The documentation I've Uncle> glanced at so far seems to indicate that infiniband is Uncle> logically point-to-point. But the diagrams representing Uncle> physical topology all seem to have switches. Other than Uncle> obviously not being able to communicate across a fabric, is Uncle> there functionality missing from a point-to-point topology? Uncle> Does it even work? Yes, this works. You will still need to run a subnet manager on one or both of the two hosts in your fabric. OpenSM will work fine. Uncle> What's the best HCA card for me? I'm drawn to Mellanox Uncle> MT23108-based systems because OpenIB seems to support them Uncle> well Uncle> (https://openib.org/tiki/tiki-index.php?page=MellanoxHcaFirmware). Yes, I think MT23108 HCAs are your only choice. I don't know of any other PCI-X HCAs. Uncle> I'm considering purchasing two HP NC570C cards Uncle> (http://h30094.www3.hp.com/product.asp?sku=2603660&), which Uncle> are based on that Mellanox chip. Does anybody know if Uncle> those cards would be functional in non-hp gear? In other Uncle> words, does HP have their own proprietary goodies in the Uncle> cards that would cause them to break in generic systems Uncle> such as what I have? Are there better cards for me? I would guess that all MT23108-based cards are pretty much the same, but you would have to confirm with HP to find out for sure. Using them in non-HP systems might affect your ability to get support if they don't work. You could also consider the Cisco SFS-HX4XCI2-LTM1= HCA for example, and I'm sure lots of other vendors have PCI-X HCAs available as well. Probably the best thing to do would be to go for the best price and availability that you can find. - R. From swise at opengridcomputing.com Mon Apr 3 14:05:03 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 03 Apr 2006 16:05:03 -0500 Subject: [openib-general] possible dapl bug? Message-ID: <1144098303.14424.79.camel@stevo-desktop> Hey James, I'm trying to get dapltest to run over the Chelsio RNIC in user mode. I'm running into an intermittent failure where the server side fails to properly clean up its resources. It has to do with disconnect vs ep freeing (go figure :). Basically if the disconnect event handler thread doesn't get done (and turn off in_callback) before the main dapltest thread attempts to destroy the EP, then dat_ep_free() will return "ok I freed it" even though it doesn't because in_callback == 1 in the dapl_cm_id struct. The main thread then tries to free the EVD and PZ and gets errors because they are still in use. So dapli_destroy_conn() defers destroying the ib_qp if ! conn->in_callback. This, however, leads to the dapltest program trying to destroy CQs and PDs with a QP still attached to them. I'll keep poking into this, but I wanted to bring this to your attention now. I haven't seen this on IB, but I hit it regularly on iwarp. But it seems like a bug since dat_ep_free() returns that the ep was destroyed and it really wasn't... Steve. --------- trace of server side dapltest --------- Test[b0df]: End Successfully dapl_lmr_free (0x8084598) dapl_lmr_free (0x808b758) dapl_lmr_free (0x8085e78) dapl_ep_disconnect (0x8083f00, 0) disconnect(ep 0x8083f00, conn 0x80872a0, id 134749624 flags 0) ib_thread(7575) poll_event: async=0x0 pipe=0x0 cm=0x1 cq=0x0 cm_event() cm_event: EVENT=10 ID=0x8081db8 LID=0x4015aae8 CTX=0x80872a0 passive_cb: conn 0x80872a0 id 134749624 event 10 --> dapl_cr_callback! context: 0x8085f28 event: 1 cm_handle 0x80872a0 dapls_ib_get_dat_event: event(passive) ib=0x1 dat=0x4005 --> dapli_get_sp_ep! disconnect dump sp: 0x8085f28 remove_listen(ia_ptr 0x8077220 sp_ptr 0x8085f28 cm_ptr 0x8085f98) destroy_conn: conn 0x8085f98 id 134766784 --> dapl_evd_connection_callback: ctxt: 0x8083f00 event: 1 cm_handle 0x80872a0 dapls_ib_get_dat_event: event(passive) ib=0x1 dat=0x4005 dapli_evd_post_event: Called with event # 4005 dapl_evd_connection_callback () returns dapl_ep_disconnect () returns 0x0 dapl_evd_wait (0x8082da0, -1, 1, 0x403b48c0, 0x403b4894) dapl_evd_wait: EVD 0x8082da0, CQ (nil) dapl_evd_wait () returns 0x0 dapl_evd_dequeue (0x8081ea8, 0x403b4940) dapl_evd_dequeue () returns 0xd0000 dapl_ep_free (0x8083f00) dapl_ep_disconnect (0x8083f00, 0) dapl_ep_disconnect () returns 0x0 dapl_ep_free: Free EP: b, ep 0x8083f00 qp_state 1 qp_handle 80872a0 qp_free: ep_ptr 0x8083f00 qp 0x80872a0 destroy_conn: conn 0x80872a0 id 134749624 dapli_destroy_conn IN CALLBACK! dapl_evd_free (0x8082da0) dapl_evd_free () returns 0x0 dapl_evd_free (0x8082b98) dapl_evd_free () returns 0x0 dapl_evd_free (0x8082520) destroy_cq Device or resource busy dapl_evd_free () returns 0x110000 Test[b0df]: dat_evd_free (reqt) error: DAT_PROVIDER_IN_USE dapl_evd_free (0x8081ea8) destroy_cq Device or resource busy dapl_evd_free () returns 0x110000 Test[b0df]: dat_evd_free (recv) error: DAT_PROVIDER_IN_USE dapl_pz_free (0x807a390) dealloc_pd Device or resource busy Test[b0df]: dat_pz_free error: DAT_PROVIDER_IN_USE Test[b0df]: cleanup is done passive_cb: destroy 1 in_callback 1 cma_cb: DESTROY conn 0x80872a0 cm_id 0x8081db8 qp 0x8084790 Server: Transaction Test Finished for this client From halr at voltaire.com Mon Apr 3 14:31:24 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Apr 2006 17:31:24 -0400 Subject: [openib-general] Newbie Questions Regarding How To Start In-Reply-To: <20060403200821.737.qmail@web36505.mail.mud.yahoo.com> References: <20060403200821.737.qmail@web36505.mail.mud.yahoo.com> Message-ID: <1144099882.4480.47693.camel@hal.voltaire.com> Hi Mark, On Mon, 2006-04-03 at 16:08, Uncle Fester wrote: > I am interested in getting my hands messy with Infiniband--and perhaps > in eventually contributing in some small way to its progress. I have > done very rudimentary research, but would like to confirm my > assumptions with the experts here so that I don't spend my money > unwisely. > > I would like to add infiniband capability to two amd64 systems in my > lab. These have PCI-X Tyan S2882 motherboards. I'm running linux on > these systems. I have a couple of these same motherboards too that I'm using for IB development and testing. > Can infiniband function properly in a physical point to point > topology? I would prefer to not invest in switches if it's not > entirely necessary. The documentation I've glanced at so far seems to > indicate that infiniband is logically point-to-point. But the > diagrams representing physical topology all seem to have switches. > Other than obviously not being able to communicate across a fabric, is > there functionality missing from a point-to-point topology? Does it > even work? Yes. You can run back to back HCAs. You will need to run an SM on one of the hosts. > What's the best HCA card for me? I'm drawn to Mellanox MT23108-based > systems because OpenIB seems to support them well > (https://openib.org/tiki/tiki-index.php?page=MellanoxHcaFirmware). > I'm considering purchasing two HP NC570C cards > (http://h30094.www3.hp.com/product.asp?sku=2603660&), which are based > on that Mellanox chip. Does anybody know if those cards would be > functional in non-hp gear? In other words, does HP have their own > proprietary goodies in the cards that would cause them to break in > generic systems such as what I have? Are there better cards for me? There are a number of Mellanox based cards which would work (for PCI-X). You can get these from a variety of vendors (Voltaire, Cisco, SilverStorm) as well as HP. -- Hal > Thank you very much for your time and advice. > > Sincerely, > Mark Travis > > > > > > ______________________________________________________________________ > New Yahoo! Messenger with Voice. Call regular phones from your PC and > save big. > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From iod00d at hp.com Mon Apr 3 14:43:27 2006 From: iod00d at hp.com (Grant Grundler) Date: Mon, 3 Apr 2006 14:43:27 -0700 Subject: [openib-general] Newbie Questions Regarding How To Start In-Reply-To: <20060403200821.737.qmail@web36505.mail.mud.yahoo.com> References: <20060403200821.737.qmail@web36505.mail.mud.yahoo.com> Message-ID: <20060403214327.GB24455@esmail.cup.hp.com> On Mon, Apr 03, 2006 at 01:08:20PM -0700, Uncle Fester wrote: > I would like to add infiniband capability to two amd64 systems in my lab. > These have PCI-X Tyan S2882 motherboards. I'm running linux on these systems. > Can infiniband function properly in a physical point to point topology? Yes. I've tested it that way in the past. > Other than obviously not being able to communicate across a fabric, > is there functionality missing from a point-to-point topology? > Does it even work? Yes, but then you loose the integrated Subnet Manager that is running on the switch. You will have to run OpenSM instead. > What's the best HCA card for me? You need to define your criteria (rank of what is important to you). E.g. HW/OS support vs perf vs MPI support etc > I'm considering purchasing two HP NC570C cards > (http://h30094.www3.hp.com/product.asp?sku=2603660&), > which are based on that Mellanox chip. > Does anybody know if those cards would be functional in non-hp gear? I believe they will but can't warrantee that. Some of the AMD chipsets are broken in subtle ways (e.g. MSI won't work on "8131" - or something like that). > In other words, does HP have their own proprietary goodies in the cards > that would cause them to break in generic systems such as what I have? I'm not aware of any. HP has tested the cards/drivers/OSs on Proliants so they are known to work in officially supported configurations. ie they have documented a working set of firmware/drivers/OS versions. HP MPI and other commercial apps will be looking for that. hth, grant From halr at voltaire.com Mon Apr 3 14:48:03 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Apr 2006 17:48:03 -0400 Subject: [openib-general] Re: IPoIB interface for unauthorized partition In-Reply-To: References: <1144082680.4480.44271.camel@hal.voltaire.com> Message-ID: <1144100882.4480.47903.camel@hal.voltaire.com> On Mon, 2006-04-03 at 14:55, Roland Dreier wrote: > Hal> Hi Roland, I have a port which only has the full default > Hal> partition configured but ifconfig allows an IPoIB interface > Hal> with a PKey which is not in the Pkey table. Shouldn't the > Hal> ifconfig fail for this (rather than the subsequent ping) ? > > No, I don't think so. The philosophy of the IPoIB driver is to allow > configuration to be done before the IB port is fully ready. This > allows for things such as boot scripts configuring interfaces before > the SM has brought the port up. Otherwise you end up in a situation > where a system may come up without network interfaces configured, just > because of a transient timing issue during boot. Are the joins for those IPoIB interfaces deferred until the PKey table is initialized ? If they aren't deferred, should they be ? -- Hal From rdreier at cisco.com Mon Apr 3 14:53:58 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 03 Apr 2006 14:53:58 -0700 Subject: [openib-general] Re: IPoIB interface for unauthorized partition In-Reply-To: <1144100882.4480.47903.camel@hal.voltaire.com> (Hal Rosenstock's message of "03 Apr 2006 17:48:03 -0400") References: <1144082680.4480.44271.camel@hal.voltaire.com> <1144100882.4480.47903.camel@hal.voltaire.com> Message-ID: Hal> Are the joins for those IPoIB interfaces deferred until the Hal> PKey table is initialized ? If they aren't deferred, should Hal> they be ? Yes, they are deferred. - R. From mst at mellanox.co.il Mon Apr 3 15:00:26 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Apr 2006 01:00:26 +0300 Subject: [openib-general] Re: [Bug 28] ipoib_mcast_sendonly_join_complete oops In-Reply-To: <20060403191216.9875B2283F3@openib.ca.sandia.gov> References: <20060403191216.9875B2283F3@openib.ca.sandia.gov> Message-ID: <20060403220026.GA14847@mellanox.co.il> Quoting r. bugzilla-daemon at openib.org : > Subject: [Bug 28] ipoib_mcast_sendonly_join_complete oops > > Does this still happen with all the IPoIB fixes in 2.6.17-rc1? This is with svn trunk, so its almost identical to 2.6.17-rc1. Especially for 2.6.14 the backport patches are really minor for ipoib. You can see them here https://openib.org/svn/gen2/branches/backport/2.6.14 One thing that I consider gross in our ipoib code is the way ipoib_mcast_start_thread is autostarted from restart_task if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) ipoib_mcast_start_thread(dev); -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Mon Apr 3 15:02:22 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Apr 2006 01:02:22 +0300 Subject: [openib-general] Re: updated InfiniBand 2.6.17 merge plans In-Reply-To: References: <20060402065810.GC1399@mellanox.co.il> Message-ID: <20060403220222.GB14847@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: updated InfiniBand 2.6.17 merge plans > > Michael> These are just the features, right? There's as the usual > Michael> bugfixes which aren't listed here. > > Yes, of course. We can always merge bugfixes. Although I don't know > of any pending bugfixes that are not at least in my for-2.6.17 branch... One thing I wanted to do was kill the global mutexes in ipoib, replacing all uses with priv->lock: we never do anything blocking there. Is this 2.6.17 material? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Mon Apr 3 15:04:23 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 03 Apr 2006 15:04:23 -0700 Subject: [openib-general] Re: updated InfiniBand 2.6.17 merge plans In-Reply-To: <20060403220222.GB14847@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 4 Apr 2006 01:02:22 +0300") References: <20060402065810.GC1399@mellanox.co.il> <20060403220222.GB14847@mellanox.co.il> Message-ID: Michael> One thing I wanted to do was kill the global mutexes in Michael> ipoib, replacing all uses with priv->lock: we never do Michael> anything blocking there. Is this 2.6.17 material? I don't have a strong feeling either way. It doesn't fix anything, but on the other hand it's a small low-risk cleanup. - R. From mtravis9999 at yahoo.com Mon Apr 3 15:05:55 2006 From: mtravis9999 at yahoo.com (Uncle Fester) Date: Mon, 3 Apr 2006 15:05:55 -0700 (PDT) Subject: [openib-general] Newbie Questions Regarding How To Start In-Reply-To: <20060403200821.737.qmail@web36505.mail.mud.yahoo.com> Message-ID: <20060403220555.59058.qmail@web36502.mail.mud.yahoo.com> Thank you all for your responses and advice. I think I'll be shopping a bit for the best priced Mellanox-based PCI-X cards, and I'll make sure to use the openIB subnet manager when the time comes. --------------------------------- How low will we go? Check out Yahoo! Messenger’s low PC-to-Phone call rates. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Mon Apr 3 15:06:48 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 03 Apr 2006 15:06:48 -0700 Subject: [openib-general] Re: [Bug 28] ipoib_mcast_sendonly_join_complete oops In-Reply-To: <20060403220026.GA14847@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 4 Apr 2006 01:00:26 +0300") References: <20060403191216.9875B2283F3@openib.ca.sandia.gov> <20060403220026.GA14847@mellanox.co.il> Message-ID: Michael> This is with svn trunk, so its almost identical to Michael> 2.6.17-rc1. svn trunk as of when? Michael> One thing that I consider gross in our ipoib code is the Michael> way ipoib_mcast_start_thread is autostarted from Michael> restart_task Michael> if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) Michael> ipoib_mcast_start_thread(dev); But the multicast task can't start send-only joins, can it? So I don't think this is related. - R. From bugzilla-daemon at openib.org Mon Apr 3 15:23:18 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Mon, 3 Apr 2006 15:23:18 -0700 (PDT) Subject: [openib-general] [Bug 28] ipoib_mcast_sendonly_join_complete oops Message-ID: <20060403222318.8FB8A2283D7@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=28 ------- Additional Comments From roland at topspin.com 2006-04-03 15:23 ------- Unfortunately the output looks cut off in places too: - Is there anything before the first "Oops" line? For example, the second oops shows "Unable to handle kernel NULL pointer..." - The code line Code: a5 6e c7 8b 45 b0 83 c4 28 f0 0f ba b0 88 00 00 00 03 8b 45 a4 e9 98 fc ff ff 8d b6 00 00 00 00 55 85 c0 89 e5 57 56 53 53 89 ce <8 got cut off right at the instruction causing the oops, so we can't disassemble and try to understand what's going on. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mst at mellanox.co.il Mon Apr 3 15:15:36 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Apr 2006 01:15:36 +0300 Subject: [openib-general] Re: [Bug 28] ipoib_mcast_sendonly_join_complete oops In-Reply-To: References: <20060403191216.9875B2283F3@openib.ca.sandia.gov> <20060403220026.GA14847@mellanox.co.il> Message-ID: <20060403221536.GC14847@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [Bug 28] ipoib_mcast_sendonly_join_complete oops > > Michael> This is with svn trunk, so its almost identical to > Michael> 2.6.17-rc1. > > svn trunk as of when? Good question, bugzilla has a singularly uninformative "gen2" in version field. I think its the date of filing or the day before that - 04/01 or 04/02. > Michael> One thing that I consider gross in our ipoib code is the > Michael> way ipoib_mcast_start_thread is autostarted from > Michael> restart_task > > Michael> if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) > Michael> ipoib_mcast_start_thread(dev); > > But the multicast task can't start send-only joins, can it? So I > don't think this is related. Hmm, good point. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From lindahl at pathscale.com Mon Apr 3 15:14:56 2006 From: lindahl at pathscale.com (Greg Lindahl) Date: Mon, 3 Apr 2006 15:14:56 -0700 Subject: [openib-general] Help with CONFIG_PCI_MSI in the kernel Message-ID: <20060403221456.GF1442@greglaptop.pwlan-swisscom-mobile.ch> Red Hat has started turning off CONFIG_PCI_MSI in their kernels (FC5 and the latest FC4 update). I remember a while back there was a discussion about how MSI made the Mellanox HCA run faster, can someone please add some concrete details about this to the bug? Thanks. http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=186520 -- greg From bugzilla-daemon at openib.org Mon Apr 3 15:29:34 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Mon, 3 Apr 2006 15:29:34 -0700 (PDT) Subject: [openib-general] [Bug 28] ipoib_mcast_sendonly_join_complete oops Message-ID: <20060403222934.DD88C2283D7@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=28 ------- Additional Comments From roland at topspin.com 2006-04-03 15:29 ------- Disassembly ('objdump -d') of ipoib_mcast_sendonly_join_complete for the driver running here probably wouldn't hurt either. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From xma at us.ibm.com Mon Apr 3 20:19:41 2006 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 3 Apr 2006 21:19:41 -0600 Subject: [openib-general] [PATCH] repost: IPoIB queue size tune patch Message-ID: Hello Rolan, I am reposting this patch for review with some update. txqueuelen is assigned to 128 (default sendq_size * 2) by default. Please let me know your plan with this patch. Signed-off-by: Shirley Ma diff -urN infiniband/ulp/ipoib/ipoib.h infiniband-queue/ulp/ipoib/ipoib.h --- infiniband/ulp/ipoib/ipoib.h 2006-03-26 11:57:15.000000000 -0800 +++ infiniband-queue/ulp/ipoib/ipoib.h 2006-03-31 08:46:34.000000000 -0800 @@ -66,8 +66,8 @@ IPOIB_ENCAP_LEN = 4, - IPOIB_RX_RING_SIZE = 128, - IPOIB_TX_RING_SIZE = 64, + IPOIB_SENDQ_SIZE = 64, + IPOIB_RECVQ_SIZE = 128, IPOIB_NUM_WC = 4, @@ -186,6 +186,8 @@ struct dentry *mcg_dentry; struct dentry *path_dentry; #endif + int sendq_size; + int recvq_size; }; struct ipoib_ah { @@ -338,6 +340,8 @@ #define ipoib_warn(priv, format, arg...) \ ipoib_printk(KERN_WARNING, priv, format , ## arg) +extern int ipoib_sendq_size; +extern int ipoib_recvq_size; #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG extern int ipoib_debug_level; diff -urN infiniband/ulp/ipoib/ipoib_ib.c infiniband-queue/ulp/ipoib/ipoib_ib.c --- infiniband/ulp/ipoib/ipoib_ib.c 2006-03-26 11:57:15.000000000 -0800 +++ infiniband-queue/ulp/ipoib/ipoib_ib.c 2006-03-31 08:46:34.000000000 -0800 @@ -161,7 +161,7 @@ struct ipoib_dev_priv *priv = netdev_priv(dev); int i; - for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) { + for (i = 0; i < priv->recvq_size; ++i) { if (ipoib_alloc_rx_skb(dev, i)) { ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); return -ENOMEM; @@ -187,7 +187,7 @@ if (wr_id & IPOIB_OP_RECV) { wr_id &= ~IPOIB_OP_RECV; - if (wr_id < IPOIB_RX_RING_SIZE) { + if (wr_id < priv->recvq_size) { struct sk_buff *skb = priv->rx_ring[wr_id].skb; dma_addr_t addr = priv->rx_ring[wr_id].mapping; @@ -252,9 +252,9 @@ struct ipoib_tx_buf *tx_req; unsigned long flags; - if (wr_id >= IPOIB_TX_RING_SIZE) { + if (wr_id >= priv->sendq_size) { ipoib_warn(priv, "completion event with wrid %d (> %d)\n", - wr_id, IPOIB_TX_RING_SIZE); + wr_id, priv->sendq_size); return; } @@ -275,7 +275,7 @@ spin_lock_irqsave(&priv->tx_lock, flags); ++priv->tx_tail; if (netif_queue_stopped(dev) && - priv->tx_head - priv->tx_tail <= IPOIB_TX_RING_SIZE / 2) + priv->tx_head - priv->tx_tail <= priv->sendq_size / 2) netif_wake_queue(dev); spin_unlock_irqrestore(&priv->tx_lock, flags); @@ -344,13 +344,13 @@ * means we have to make sure everything is properly recorded and * our state is consistent before we call post_send(). */ - tx_req = &priv->tx_ring[priv->tx_head & (IPOIB_TX_RING_SIZE - 1)]; + tx_req = &priv->tx_ring[priv->tx_head & (priv->sendq_size - 1)]; tx_req->skb = skb; addr = dma_map_single(priv->ca->dma_device, skb->data, skb->len, DMA_TO_DEVICE); pci_unmap_addr_set(tx_req, mapping, addr); - if (unlikely(post_send(priv, priv->tx_head & (IPOIB_TX_RING_SIZE - 1), + if (unlikely(post_send(priv, priv->tx_head & (priv->sendq_size - 1), address->ah, qpn, addr, skb->len))) { ipoib_warn(priv, "post_send failed\n"); ++priv->stats.tx_errors; @@ -363,7 +363,7 @@ address->last_send = priv->tx_head; ++priv->tx_head; - if (priv->tx_head - priv->tx_tail == IPOIB_TX_RING_SIZE) { + if (priv->tx_head - priv->tx_tail == priv->sendq_size) { ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); netif_stop_queue(dev); } @@ -488,7 +488,7 @@ int pending = 0; int i; - for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + for (i = 0; i < priv->recvq_size; ++i) if (priv->rx_ring[i].skb) ++pending; @@ -527,7 +527,7 @@ */ while ((int) priv->tx_tail - (int) priv->tx_head < 0) { tx_req = &priv->tx_ring[priv->tx_tail & - (IPOIB_TX_RING_SIZE - 1)]; + (priv->sendq_size - 1)]; dma_unmap_single(priv->ca->dma_device, pci_unmap_addr(tx_req, mapping), tx_req->skb->len, @@ -536,7 +536,7 @@ ++priv->tx_tail; } - for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + for (i = 0; i < priv->recvq_size; ++i) if (priv->rx_ring[i].skb) { dma_unmap_single(priv->ca->dma_device, pci_unmap_addr(&priv->rx_ring[i], diff -urN infiniband/ulp/ipoib/ipoib_main.c infiniband-queue/ulp/ipoib/ipoib_main.c --- infiniband/ulp/ipoib/ipoib_main.c 2006-03-28 19:20:21.000000000 -0800 +++ infiniband-queue/ulp/ipoib/ipoib_main.c 2006-04-03 21:01:39.105803776 -0700 @@ -41,6 +41,7 @@ #include #include #include +#include #include /* For ARPHRD_xxx */ @@ -53,6 +54,17 @@ MODULE_DESCRIPTION("IP-over-InfiniBand net driver"); MODULE_LICENSE("Dual BSD/GPL"); +#define IPOIB_MAX_QUEUE_SIZE 4096 /* max is 4k */ +#define IPOIB_MIN_QUEUE_SIZE 64 /* min is 64 */ + +int ipoib_sendq_size = IPOIB_SENDQ_SIZE; +int ipoib_recvq_size = IPOIB_RECVQ_SIZE; + +module_param_named(sendq_size, ipoib_sendq_size, int, 0444); +MODULE_PARM_DESC(sendq_size, "Number of wqe in send queue"); +module_param_named(recvq_size, ipoib_recvq_size, int, 0444); +MODULE_PARM_DESC(recvq_size, "Number of wqe in receive queue"); + #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG int ipoib_debug_level; @@ -843,21 +855,45 @@ /* Allocate RX/TX "rings" to hold queued skbs */ - priv->rx_ring = kzalloc(IPOIB_RX_RING_SIZE * sizeof (struct ipoib_rx_buf), + if (ipoib_recvq_size > IPOIB_MAX_QUEUE_SIZE) { + ipoib_recvq_size = IPOIB_MAX_QUEUE_SIZE; + printk(KERN_WARNING "%s: ipoib_recvq_size is too big, use max %d instead\n", ca->name, IPOIB_MAX_QUEUE_SIZE); + } + if (ipoib_recvq_size < IPOIB_MIN_QUEUE_SIZE) { + ipoib_recvq_size = IPOIB_MIN_QUEUE_SIZE; + printk(KERN_WARNING "%s: ipoib_recvq_size is too small, use min %d instead\n", ca->name, IPOIB_MIN_QUEUE_SIZE); + } + priv->recvq_size = roundup_pow_of_two(ipoib_recvq_size); + priv->rx_ring = kzalloc(priv->recvq_size * sizeof (struct ipoib_rx_buf), GFP_KERNEL); if (!priv->rx_ring) { printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n", - ca->name, IPOIB_RX_RING_SIZE); + ca->name, priv->sendq_size); goto out; } + printk(KERN_INFO "%s: RX_RING_SIZE is set to %d entries\n", + ca->name, priv->recvq_size); + + if (ipoib_sendq_size > IPOIB_MAX_QUEUE_SIZE) { + ipoib_sendq_size = IPOIB_MAX_QUEUE_SIZE; + printk(KERN_WARNING "%s: ipoib_sendq_size is too big, use max %d instead\n", ca->name, IPOIB_MAX_QUEUE_SIZE); + } + if (ipoib_sendq_size < IPOIB_MIN_QUEUE_SIZE) { + ipoib_sendq_size = IPOIB_MIN_QUEUE_SIZE; + printk(KERN_WARNING "%s: ipoib_recvq_size is too small, use min %d instead\n", ca->name, IPOIB_MIN_QUEUE_SIZE); + } + + priv->sendq_size = roundup_pow_of_two(ipoib_sendq_size); - priv->tx_ring = kzalloc(IPOIB_TX_RING_SIZE * sizeof (struct ipoib_tx_buf), + priv->tx_ring = kzalloc(priv->sendq_size * sizeof (struct ipoib_tx_buf), GFP_KERNEL); if (!priv->tx_ring) { printk(KERN_WARNING "%s: failed to allocate TX ring (%d entries)\n", - ca->name, IPOIB_TX_RING_SIZE); + ca->name, priv->sendq_size); goto out_rx_ring_cleanup; } + printk(KERN_INFO "%s: TX_RING_SIZE is set to %d entries\n", + ca->name, priv->sendq_size); /* priv->tx_head & tx_tail are already 0 */ @@ -923,7 +959,7 @@ dev->hard_header_len = IPOIB_ENCAP_LEN + INFINIBAND_ALEN; dev->addr_len = INFINIBAND_ALEN; dev->type = ARPHRD_INFINIBAND; - dev->tx_queue_len = IPOIB_TX_RING_SIZE * 2; + dev->tx_queue_len = IPOIB_SENDQ_SIZE * 2; dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX; /* MTU will be reset when mcast join happens */ diff -urN infiniband/ulp/ipoib/ipoib_verbs.c infiniband-queue/ulp/ipoib/ipoib_verbs.c --- infiniband/ulp/ipoib/ipoib_verbs.c 2006-03-26 11:57:15.000000000 -0800 +++ infiniband-queue/ulp/ipoib/ipoib_verbs.c 2006-03-31 08:46:34.000000000 -0800 @@ -159,8 +159,8 @@ struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_init_attr init_attr = { .cap = { - .max_send_wr = IPOIB_TX_RING_SIZE, - .max_recv_wr = IPOIB_RX_RING_SIZE, + .max_send_wr = priv->sendq_size, + .max_recv_wr = priv->recvq_size, .max_send_sge = 1, .max_recv_sge = 1 }, @@ -175,7 +175,7 @@ } priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, - IPOIB_TX_RING_SIZE + IPOIB_RX_RING_SIZE + 1); + priv->sendq_size + priv->recvq_size + 1); if (IS_ERR(priv->cq)) { printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); goto out_free_pd; Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: infiniband-tune-queue.patch Type: application/octet-stream Size: 8306 bytes Desc: not available URL: From iod00d at hp.com Mon Apr 3 21:36:00 2006 From: iod00d at hp.com (Grant Grundler) Date: Mon, 3 Apr 2006 21:36:00 -0700 Subject: [openib-general] Help with CONFIG_PCI_MSI in the kernel In-Reply-To: <20060403221456.GF1442@greglaptop.pwlan-swisscom-mobile.ch> References: <20060403221456.GF1442@greglaptop.pwlan-swisscom-mobile.ch> Message-ID: <20060404043600.GF24455@esmail.cup.hp.com> On Mon, Apr 03, 2006 at 03:14:56PM -0700, Greg Lindahl wrote: > Red Hat has started turning off CONFIG_PCI_MSI in their kernels (FC5 > and the latest FC4 update). I remember a while back there was a > discussion about how MSI made the Mellanox HCA run faster, can someone > please add some concrete details about this to the bug? Thanks. Greg, The only evidence I have is one AMD chipset is buggy WRT MSI. And this was already noted in the bug report. I also don't see why MSI support should be disabled. MSI should work fine on most of the platforms out there. For the cases where MSI does not work correctly and there is no workaround in place yet, Matthew Wilcox proposed a "platform" MSI disable patch. User can then disable MSI globally at boot time. Besides Infiniband, newer Broadcom gige chips, LSI MPT cards, and most 10gige cards support MSI/MSI-X. I'm sure there are alot more devices than that though. So yes, optimal system performance will want MSI working. hth, grant > > http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=186520 > > -- greg > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Mon Apr 3 22:50:37 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 03 Apr 2006 22:50:37 -0700 Subject: [openib-general] Re: [PATCH] repost: IPoIB queue size tune patch In-Reply-To: (Shirley Ma's message of "Mon, 3 Apr 2006 21:19:41 -0600") References: Message-ID: Sorry I hadn't gotten a chance to read this over until now... > - IPOIB_RX_RING_SIZE = 128, > - IPOIB_TX_RING_SIZE = 64, > + IPOIB_SENDQ_SIZE = 64, > + IPOIB_RECVQ_SIZE = 128, Can you explain again why it's a good idea to rename these? Is the name "sendq_size" really clearer than "tx_ring_size," especially in the context of a network driver? > + int sendq_size; > + int recvq_size; Why does every device need a private copy of the ring sizes? It seems better to just use ipoib_sendq_size and ipoib_recvq_size directly -- round them up in the module init function, and I guess mark them __read_mostly. I wonder if it's also worth having masks to avoid subtracting 1 every time the driver does something like > + tx_req = &priv->tx_ring[priv->tx_head & (priv->sendq_size - 1)]; > +#define IPOIB_MAX_QUEUE_SIZE 4096 /* max is 4k */ > +#define IPOIB_MIN_QUEUE_SIZE 64 /* min is 64 */ Where do these limits come from? Why shouldn't someone be able to use a bigger or smaller ring? > printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n", > - ca->name, IPOIB_RX_RING_SIZE); > + ca->name, priv->sendq_size); Looks like this should be recvq_size, not sendq_size. > + printk(KERN_INFO "%s: RX_RING_SIZE is set to %d entries\n", > + ca->name, priv->recvq_size); Seems kind of chatty -- I think anyone who cared could just look at the module parameter in sysfs. - R. From dotanb at mellanox.co.il Mon Apr 3 23:38:09 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 4 Apr 2006 09:38:09 +0300 Subject: [openib-general] [PATCH] limrdmacm/cmatose: added a check to =?iso-8859-1?q?the=09return_value_of_the_post_send?= requests In-Reply-To: <44315A16.3010301@ichips.intel.com> References: <200604031654.09048.dotanb@mellanox.co.il> <44315A16.3010301@ichips.intel.com> Message-ID: <200604040938.09840.dotanb@mellanox.co.il> > > Added a check to the return value of the post send requests. > > Committed the change to print an error message if post send fails. Thanks. My target is to add this example to the regression, so the output incase of an error is a nice to have but not a must. I would like to fix the test and make it exit with value different than zero incase of an error. About the patch that i send you, the function return the result of the last post_send: so if (for example) from unknown reason only the first post_send will fail, the function will still return 0 ... I think that new users that would like to use the librdmacm will take this test as an example, so this test should be a good example. What do you think? Dotan From jgunthorpe at obsidianresearch.com Tue Apr 4 00:02:29 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Tue, 4 Apr 2006 01:02:29 -0600 Subject: [openib-general] Help with CONFIG_PCI_MSI in the kernel In-Reply-To: <20060404043600.GF24455@esmail.cup.hp.com> References: <20060403221456.GF1442@greglaptop.pwlan-swisscom-mobile.ch> <20060404043600.GF24455@esmail.cup.hp.com> Message-ID: <20060404070229.GF10080@obsidianresearch.com> On Mon, Apr 03, 2006 at 09:36:00PM -0700, Grant Grundler wrote: > On Mon, Apr 03, 2006 at 03:14:56PM -0700, Greg Lindahl wrote: > > Red Hat has started turning off CONFIG_PCI_MSI in their kernels (FC5 > > and the latest FC4 update). I remember a while back there was a > > discussion about how MSI made the Mellanox HCA run faster, can someone > > please add some concrete details about this to the bug? Thanks. > I also don't see why MSI support should be disabled. > MSI should work fine on most of the platforms out there. > For the cases where MSI does not work correctly and there > is no workaround in place yet, Matthew Wilcox proposed a > "platform" MSI disable patch. User can then disable > MSI globally at boot time. One thing that struck me as strange with the MSI code was that it just unconditionally turns on without any checks that the system actually supports it, so that it breaks in some cases doesn't seem to be too surprising. MSI requires end device support and something in a bridge to transform the MSI message into an APIC message, but the kernel currently only looks for end device support. Presumably only some subset of Intel PCI/PCI-X bridges support MSI translation to an APIC message on the FSB, and we know on HT systems that the MSI PCI capability block is required (and must be *enabled*) in the PCI/PCI-X/PCIe bridge for there to be support. IMHO what is needed is that if a device X wants MSI on then the path from X to the root bridge must hit a MSI capable bridge[1]. White list the known good intel chips, detect HT capability blocks, turn on MSI for *every* device (ie 100% of PCIe devices), not just high performance ones that specifically ask for it - then black list any broken hardware ;> In the long term as PCIe becomes dominant this is best since linux becomes firmly in control of all interrupt routing and doesn't rely on working APCI for IOAPIC tables.. It is hard to blame RedHat for turning it off by default when I know some BIOS vendors are not enabling MSI translation in the HT chipsets. I have some MBs that are broken like that and MSI only works with a kernel patch that forces the MSI translation to be turned on. Ideally the kernel would detect that and just refuse to do MSI. [1: In this same way the current hack to disable MSI if an 8131 is detected is bogus, there are systems with 8131's and other nvidia bridges that support MSI on PCIe slots (via a nvidia bridge) but not on PCI-X (since the 8131 doesn't support it)] Regards, Jason From sean.hefty at intel.com Tue Apr 4 00:31:46 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 4 Apr 2006 00:31:46 -0700 Subject: [openib-general] [PATCH] limrdmacm/cmatose: added a check to the return value of the post send requests In-Reply-To: <200604040938.09840.dotanb@mellanox.co.il> Message-ID: >My target is to add this example to the regression, so the output incase of an >error is a nice to have but not a must. >I would like to fix the test and make it exit with value different than zero >incase of an error. I added the printf if post_send fails. >About the patch that i send you, the function return the result of the last >post_send: >so if (for example) from unknown reason only the first post_send will fail, the >function will still return 0 ... I think you're missing the "&& !ret" in the for-loop check. If post_send fails, the function returns the failure. - Sean From dotanb at mellanox.co.il Tue Apr 4 00:36:35 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 4 Apr 2006 10:36:35 +0300 Subject: [openib-general] [PATCH] limrdmacm/cmatose: added a check to =?iso-8859-1?q?the=09return_value_of_the_post_send?= requests In-Reply-To: References: Message-ID: <200604041036.36041.dotanb@mellanox.co.il> On Tuesday 04 April 2006 10:31, Sean Hefty wrote: > >My target is to add this example to the regression, so the output incase of an > >error is a nice to have but not a must. > >I would like to fix the test and make it exit with value different than zero > >incase of an error. > > I think you're missing the "&& !ret" in the for-loop check. If post_send fails, > the function returns the failure. You are right, i missed it ... (oops ..) In the function that handles the post_recv there is a break when the rc is not zero .... Dotan From bugzilla-daemon at openib.org Tue Apr 4 01:12:58 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Tue, 4 Apr 2006 01:12:58 -0700 (PDT) Subject: [openib-general] [Bug 28] ipoib_mcast_sendonly_join_complete oops Message-ID: <20060404081258.7C01C2283EB@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=28 ------- Additional Comments From eli at mellanox.co.il 2006-04-04 01:12 ------- 1. Nothing before the oops 2. svn rev 6107 ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Tue Apr 4 01:18:10 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Tue, 4 Apr 2006 01:18:10 -0700 (PDT) Subject: [openib-general] [Bug 31] ifconfig up/down while ssh connection alive cause oops Message-ID: <20060404081810.19712228444@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=31 ------- Additional Comments From eli at mellanox.co.il 2006-04-04 01:18 ------- Created an attachment (id=6) --> (http://openib.org/bugzilla/attachment.cgi?id=6&action=view) objdump -d for ib_ipoib objdump -d for ib_ipoib ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Tue Apr 4 01:20:01 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Tue, 4 Apr 2006 01:20:01 -0700 (PDT) Subject: [openib-general] [Bug 28] ipoib_mcast_sendonly_join_complete oops Message-ID: <20060404082001.90FF0228467@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=28 ------- Additional Comments From eli at mellanox.co.il 2006-04-04 01:20 ------- Created an attachment (id=7) --> (http://openib.org/bugzilla/attachment.cgi?id=7&action=view) objdump -d for ib_ipoib objdump -d for ib_ipoib ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From k_mahesh85 at yahoo.co.in Tue Apr 4 01:34:13 2006 From: k_mahesh85 at yahoo.co.in (keshetti mahesh) Date: Tue, 4 Apr 2006 09:34:13 +0100 (BST) Subject: [openib-general] where can i get the offload protocol switch(OPS) patch for SDP? Message-ID: <20060404083413.98396.qmail@web8316.mail.in.yahoo.com> can anybody tell me which version of redhat is coming with complete infinband support. i want to run my socket/ftp applications through the SDP over the infinband i have installed SDP module() in my nodes and it is working fine if i mention the AF_INET_OFFLOAD in my application but i want a transparent mechanism i heard that a patch for the socket.c can do the thing can anybody tell me wer i can get that patch thanks®ards K.Mahesh --------------------------------- Jiyo cricket on Yahoo! India cricket Yahoo! Messenger Mobile Stay in touch with your buddies all the time. -------------- next part -------------- An HTML attachment was scrubbed... URL: From lindahl at pathscale.com Tue Apr 4 01:36:54 2006 From: lindahl at pathscale.com (Greg Lindahl) Date: Tue, 4 Apr 2006 01:36:54 -0700 Subject: [openib-general] Help with CONFIG_PCI_MSI in the kernel In-Reply-To: <20060404043600.GF24455@esmail.cup.hp.com> References: <20060403221456.GF1442@greglaptop.pwlan-swisscom-mobile.ch> <20060404043600.GF24455@esmail.cup.hp.com> Message-ID: <20060404083654.GA1689@greglaptop.mpp.unique.ch> On Mon, Apr 03, 2006 at 09:36:00PM -0700, Grant Grundler wrote: > The only evidence I have is one AMD chipset is buggy WRT MSI. Grant, I know about that case, the kernel disables stuff if it sees an AMD 8131 due to a bug. What I am referring to was IPoIB performance on Mellanox HCAs being improved with MSI. I figure if it's gotten this far with Red Hat turning it off, concrete examples are in order. -- greg From mst at mellanox.co.il Tue Apr 4 01:54:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Apr 2006 11:54:42 +0300 Subject: [openib-general] Re: Help with CONFIG_PCI_MSI in the kernel In-Reply-To: <20060404070229.GF10080@obsidianresearch.com> References: <20060403221456.GF1442@greglaptop.pwlan-swisscom-mobile.ch> <20060404043600.GF24455@esmail.cup.hp.com> <20060404070229.GF10080@obsidianresearch.com> Message-ID: <20060404085442.GF14808@mellanox.co.il> Quoting r. Jason Gunthorpe : > [1: In this same way the current hack to disable MSI if an 8131 is > detected is bogus, there are systems with 8131's and other nvidia > bridges that support MSI on PCIe slots (via a nvidia bridge) but > not on PCI-X (since the 8131 doesn't support it)] I have fixed this in 2.6.17-rc1: MSI is now only disabled for devices behind 8131. Do you have such hardware? Please test. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Tue Apr 4 02:07:03 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Apr 2006 12:07:03 +0300 Subject: [openib-general] Re: where can i get the offload protocol switch(OPS) patch for SDP? In-Reply-To: <20060404083413.98396.qmail@web8316.mail.in.yahoo.com> References: <20060404083413.98396.qmail@web8316.mail.in.yahoo.com> Message-ID: <20060404090703.GG14808@mellanox.co.il> Quoting r. keshetti mahesh : > Subject: where can i get the offload protocol switch(OPS) patch for SDP? > > can anybody tell me which version of redhat is coming with complete infinband support. > > i want to run my socket/ftp applications through the SDP over the infinband > i have installed SDP module() in my nodes and it is working fine if i mention the AF_INET_OFFLOAD in my application > but i want a transparent mechanism > i heard that a patch for the socket.c can do the thing > > can anybody tell me wer i can get that patch > > thanks®ards > K.Mahesh I don't have such a patch. You can build libsdp and preload libsdp_sys.so - this hack will convert all sockets to AF_INET_SDP. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From k_mahesh85 at yahoo.co.in Tue Apr 4 03:10:20 2006 From: k_mahesh85 at yahoo.co.in (keshetti mahesh) Date: Tue, 4 Apr 2006 11:10:20 +0100 (BST) Subject: [openib-general] how can i guide FTP like applications to SDP module (infiniband)? Message-ID: <20060404101020.79974.qmail@web8327.mail.in.yahoo.com> i have installed openIB stack in my infinband cluster whenever i try to run FTP like apllications they are taking the path through the IPoIB can anybody tell me how can i guide those applications thru SDP module of the openIB stack? also, can anybody tell me which redhat version is coming with the complete infinband support. thanks n regards K.Mahesh --------------------------------- Jiyo cricket on Yahoo! India cricket Yahoo! Messenger Mobile Stay in touch with your buddies all the time. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Tue Apr 4 03:29:37 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Apr 2006 13:29:37 +0300 Subject: [openib-general] Re: Help with CONFIG_PCI_MSI in the kernel In-Reply-To: <20060403221456.GF1442@greglaptop.pwlan-swisscom-mobile.ch> References: <20060403221456.GF1442@greglaptop.pwlan-swisscom-mobile.ch> Message-ID: <20060404102937.GI14808@mellanox.co.il> Quoting r. Greg Lindahl : > Subject: Help with CONFIG_PCI_MSI in the kernel > > Red Hat has started turning off CONFIG_PCI_MSI in their kernels (FC5 > and the latest FC4 update). I remember a while back there was a > discussion about how MSI made the Mellanox HCA run faster, can someone > please add some concrete details about this to the bug? Thanks. > > http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=186520 Unfortunately its not a clear-cut win, or not on all systems. At least for me, IRQ rebalancing seems to work much worse with MSI than with regular interrupt messages: the interrupts seem to bounce between CPUs all the time. I am only getting good performance from MSI by disabling IRQ rebalancing and binding the task to a specific CPU. It works fine with regular interrupts and I never had the time to get to the bottom of it. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From bugzilla-daemon at openib.org Tue Apr 4 08:07:56 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Tue, 4 Apr 2006 08:07:56 -0700 (PDT) Subject: [openib-general] [Bug 31] ifconfig up/down while ssh connection alive cause oops Message-ID: <20060404150756.B65072283EB@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=31 ------- Additional Comments From amitk at mellanox.co.il 2006-04-04 08:07 ------- Comment 3 and first attachment is related to bug 28, not to this one. I didn't succeed to reproduce the oops on 2.6.16, objdump file is added ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Tue Apr 4 08:07:05 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Tue, 4 Apr 2006 08:07:05 -0700 (PDT) Subject: [openib-general] [Bug 31] ifconfig up/down while ssh connection alive cause oops Message-ID: <20060404150705.1084A2283D7@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=31 ------- Additional Comments From amitk at mellanox.co.il 2006-04-04 08:07 ------- Created an attachment (id=8) --> (http://openib.org/bugzilla/attachment.cgi?id=8&action=view) objdump -dS ib_ipoib.ko ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Tue Apr 4 08:26:26 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Tue, 4 Apr 2006 08:26:26 -0700 (PDT) Subject: [openib-general] [Bug 28] ipoib_mcast_sendonly_join_complete oops Message-ID: <20060404152626.C42B72283D7@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=28 ------- Additional Comments From roland at topspin.com 2006-04-04 08:26 ------- The oops is happening at the last line here: 00003d50 : 3d50: 55 push %ebp 3d51: 85 c0 test %eax,%eax 3d53: 89 e5 mov %esp,%ebp 3d55: 57 push %edi 3d56: 56 push %esi 3d57: 53 push %ebx 3d58: 53 push %ebx 3d59: 89 ce mov %ecx,%esi 3d5b: 8b 89 b4 00 00 00 mov 0xb4(%ecx),%ecx ecx holds the third parameter of ipoib_mcast_sendonly_join_complete, namely mcast_ptr, so it seems the oops is happening when doing struct net_device *dev = mcast->dev; Now the question is why this callback is getting passed an invalid mcast_ptr. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Tue Apr 4 08:31:59 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Tue, 4 Apr 2006 08:31:59 -0700 (PDT) Subject: [openib-general] [Bug 28] ipoib_mcast_sendonly_join_complete oops Message-ID: <20060404153159.5CE092283D7@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=28 ------- Additional Comments From eli at mellanox.co.il 2006-04-04 08:31 ------- It looks to me like this is because the restart task frees mcast objects without making sure there are no pending mcmember joins. I am testing a patch which does something similar to stop_thread e.g. check if mcast->query != NULL and wait for completion ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From sweitzen at cisco.com Tue Apr 4 08:43:52 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 4 Apr 2006 08:43:52 -0700 Subject: [openib-general] how can i guide FTP like applications to SDPmodule (infiniband)? Message-ID: You can LD_PRELOAD libsdp.so to transparently change applications to use SDP instead of TCP. libsdp.so come with a libsdp.conf config file to control when to use SDP vs TCP. RHEL4 U3 has a technology preview of OpenIB. Scott ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of keshetti mahesh Sent: Tuesday, April 04, 2006 3:10 AM To: openIB Subject: [openib-general] how can i guide FTP like applications to SDPmodule (infiniband)? i have installed openIB stack in my infinband cluster whenever i try to run FTP like apllications they are taking the path through the IPoIB can anybody tell me how can i guide those applications thru SDP module of the openIB stack? also, can anybody tell me which redhat version is coming with the complete infinband support. thanks n regards K.Mahesh ________________________________ Jiyo cricket on Yahoo! India cricket Yahoo! Messenger Mobile Stay in touch with your buddies all the time. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Tue Apr 4 08:52:33 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Apr 2006 18:52:33 +0300 Subject: [openib-general] [PATCH] ipoib_mcast_restart_task Message-ID: <20060404155233.GR14808@mellanox.co.il> Roland, Eli spotted the following race that might explain part of the crashes in sendonly complete. Please take a look. --- ipoib_mcast_restart_task might free an mcast object while a join request (sa query) is still outstanding, leading to an oops when the query completes. Fix this by waiting for query to complete, similiar to what ipoib_stop_thread is doing. Signed-off-by: Eli Cohen Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (revision 5992) +++ openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -910,6 +910,16 @@ void ipoib_mcast_restart_task(void *dev_ /* We have to cancel outside of the spinlock */ list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { + spin_lock_irq(&priv->lock); + if (mcast->query) { + ib_sa_cancel_query(mcast->query_id, mcast->query); + mcast->query = NULL; + spin_unlock_irq(&priv->lock); + ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + wait_for_completion(&mcast->done); + } else + spin_unlock_irq(&priv->lock); ipoib_mcast_leave(mcast->dev, mcast); ipoib_mcast_free(mcast); } -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From xma at us.ibm.com Tue Apr 4 09:04:04 2006 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 4 Apr 2006 09:04:04 -0700 Subject: [openib-general] Re: [PATCH] repost: IPoIB queue size tune patch In-Reply-To: Message-ID: Hello Roland, Thanks. Roland Dreier wrote on 04/03/2006 10:50:37 PM: > Sorry I hadn't gotten a chance to read this over until now... > > > - IPOIB_RX_RING_SIZE = 128, > > - IPOIB_TX_RING_SIZE = 64, > > + IPOIB_SENDQ_SIZE = 64, > > + IPOIB_RECVQ_SIZE = 128, > > Can you explain again why it's a good idea to rename these? Is the > name "sendq_size" really clearer than "tx_ring_size," especially in > the context of a network driver? tx_ring and rx_ring are not good names for IPoIB. tx_ring in IPoIB is a place to save pointers and rx_ring a place to allocate skb buffs. I am working on a patch to remove tx_ring and replace rx_ring with a list, which would reduce some tx_ring overhead. > > > + int sendq_size; > > + int recvq_size; > > Why does every device need a private copy of the ring sizes? It seems > better to just use ipoib_sendq_size and ipoib_recvq_size directly -- > round them up in the module init function, and I guess mark them > __read_mostly. > Michael suggested to save them in the priv before. I did some test and didn't see the difference. I wil move them out of priv. > I wonder if it's also worth having masks to avoid subtracting 1 every > time the driver does something like > > > + tx_req = &priv->tx_ring[priv->tx_head & (priv->sendq_size - 1)]; > sounds good. > > +#define IPOIB_MAX_QUEUE_SIZE 4096 /* max is 4k */ > > +#define IPOIB_MIN_QUEUE_SIZE 64 /* min is 64 */ > > Where do these limits come from? Why shouldn't someone be able to use > a bigger or smaller ring? > Done some test, too smaller value would cause network problem, too big value is not helpful. 1K is sufficient in my environment. We could use max 8k, min 32 instead. Which values do you suggest? > > printk(KERN_WARNING "%s: failed to allocate RX > ring (%d entries)\n", > > - ca->name, IPOIB_RX_RING_SIZE); > > + ca->name, priv->sendq_size); > > Looks like this should be recvq_size, not sendq_size. Yes, you are right. > > > + printk(KERN_INFO "%s: RX_RING_SIZE is set to %d entries\n", > > + ca->name, priv->recvq_size); > > Seems kind of chatty -- I think anyone who cared could just look at > the module parameter in sysfs. > > - R. Ok. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Tue Apr 4 09:03:23 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Apr 2006 19:03:23 +0300 Subject: [openib-general] 2.6.17 merge Message-ID: <20060404160323.GS14808@mellanox.co.il> Roland, could you please merge the neigh free patch for 2.6.17? Its small, harmless, has been in svn a while. Its better for git not to diverge from svn. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Tue Apr 4 09:15:01 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Apr 2006 09:15:01 -0700 Subject: [openib-general] Re: 2.6.17 merge In-Reply-To: <20060404160323.GS14808@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 4 Apr 2006 19:03:23 +0300") References: <20060404160323.GS14808@mellanox.co.il> Message-ID: Michael> Roland, could you please merge the neigh free patch for Michael> 2.6.17? Its small, harmless, has been in svn a while. Michael> Its better for git not to diverge from svn. Which patch is that? The best thing is if you send me a patch against current git. - R. From mst at mellanox.co.il Tue Apr 4 09:33:32 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Apr 2006 19:33:32 +0300 Subject: [openib-general] CM patch for 2.6.17 merge Message-ID: <20060404163331.GU14808@mellanox.co.il> Roland, Sean, CM API in svn is currently richer than what we have in git. Two parts that seem to be missing: support for private data comparison in ib_cm_listen() and the ability to reject requests for userspace that try to use the SDP or CMA service IDs. The first is small and harmless if not used - I want it for things that are now out of kernel, and to matching API API between svn and kernel. The second is a security fix, its a must. Sean, do you agree these two changes in CM are ready for 2.6.17? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From iod00d at hp.com Tue Apr 4 09:45:09 2006 From: iod00d at hp.com (Grant Grundler) Date: Tue, 4 Apr 2006 09:45:09 -0700 Subject: [openib-general] Help with CONFIG_PCI_MSI in the kernel In-Reply-To: <20060404070229.GF10080@obsidianresearch.com> References: <20060403221456.GF1442@greglaptop.pwlan-swisscom-mobile.ch> <20060404043600.GF24455@esmail.cup.hp.com> <20060404070229.GF10080@obsidianresearch.com> Message-ID: <20060404164509.GA29487@esmail.cup.hp.com> On Tue, Apr 04, 2006 at 01:02:29AM -0600, Jason Gunthorpe wrote: > One thing that struck me as strange with the MSI code was that it just > unconditionally turns on without any checks that the system actually > supports it, so that it breaks in some cases doesn't seem to be too > surprising. I agree it's not surprising that some implementations are broken. > MSI requires end device support and something in a bridge > to transform the MSI message into an APIC message, but the kernel > currently only looks for end device support. APIC Message? MSI is just a DMA-write from the card point of view. So if PCI is working and DMA is working, MSI should work too. The difference is routing of the transaction and the fact that it's not targeting Host memory but some other part of the chipset. > Presumably only some subset of Intel PCI/PCI-X bridges support MSI > translation to an APIC message on the FSB, and we know on HT systems > that the MSI PCI capability block is required (and must be *enabled*) > in the PCI/PCI-X/PCIe bridge for there to be support. Well, can't linux enable that block if it's present? That isn't a reason to disable MSI for _all_ systems. > IMHO what is needed is that if a device X wants MSI on then the path > from X to the root bridge must hit a MSI capable bridge[1]. White list > the known good intel chips, detect HT capability blocks, turn on MSI > for *every* device (ie 100% of PCIe devices), not just high > performance ones that specifically ask for it - then black list any > broken hardware ;> I'd prefer to just black list the broken ones or encourage people to use "nomsi". Secondly, we've already had this discussion on the linux-pci mailing list about turning it on for _ALL_ devices that advertise MSI/MSI-X - it's a non-starter. Too many devices have broken implementations (e.g. don't turn off Line IRQs). The driver needs to ask for MSI since it's the only one to know that the device can really handle MSI properly. > In the long term as PCIe becomes dominant this is best since linux > becomes firmly in control of all interrupt routing and doesn't rely on > working APCI for IOAPIC tables.. Sorry...I can't agree. Line based interrupt routing is dependent on firmware to give us the IRQ->APIC routing tables, enough info to identify CPUs (ID/EID info for Intel implementations) and program IO-xAPIC entries. Essentially, MSI only needs the CPU Info so MSI transactions get routed correctly. Then MSI/-X entries on the devices can be programmed (essentially the same way an IO-xAPIC gets programmed). > It is hard to blame RedHat for turning it off by default when I know > some BIOS vendors are not enabling MSI translation in the HT > chipsets. I have some MBs that are broken like that and MSI only works > with a kernel patch that forces the MSI translation to be turned on. > Ideally the kernel would detect that and just refuse to do MSI. If it's "just a BIOS" issue, then can't AMD help linux turn MSI support on? ie linux can bang some values into the chipset like we do for other types of initialization when BIOS doesn't do it right. > [1: In this same way the current hack to disable MSI if an 8131 is > detected is bogus, there are systems with 8131's and other nvidia > bridges that support MSI on PCIe slots (via a nvidia bridge) but > not on PCI-X (since the 8131 doesn't support it)] Can you submit a patch with the right way to disable MSI for 8131 chips? thanks, grant > > Regards, > Jason From mshefty at ichips.intel.com Tue Apr 4 09:57:33 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 04 Apr 2006 09:57:33 -0700 Subject: [openib-general] CM patch for 2.6.17 merge In-Reply-To: <20060404163331.GU14808@mellanox.co.il> References: <20060404163331.GU14808@mellanox.co.il> Message-ID: <4432A57D.2010604@ichips.intel.com> Michael S. Tsirkin wrote: > Sean, do you agree these two changes in CM are ready for 2.6.17? I believe that both of these changes are ready, and it would be nice to merge them upstream, even if separate from the RDMA CM. - Sean From mst at mellanox.co.il Tue Apr 4 09:59:40 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Apr 2006 19:59:40 +0300 Subject: [openib-general] Re: 2.6.17 merge In-Reply-To: References: <20060404160323.GS14808@mellanox.co.il> Message-ID: <20060404165940.GV14808@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: 2.6.17 merge > > Michael> Roland, could you please merge the neigh free patch for > Michael> 2.6.17? Its small, harmless, has been in svn a while. > Michael> Its better for git not to diverge from svn. > > Which patch is that? The best thing is if you send me a patch against > current git. Here it is: I just did diff between svn and trunk and removed 2.6.16 things. OK? --- Consolidate IPoIB's private neighbour data handling into ipoib_neigh_alloc() and ipoib_neigh_free(). This will make it easier to keep track of the neighbour structures that IPoIB is handling, and is a nice cleanup of the code: add/remove: 2/1 grow/shrink: 1/8 up/down: 100/-178 (-78) function old new delta ipoib_neigh_alloc - 61 +61 ipoib_neigh_free - 36 +36 ipoib_mcast_join_finish 1288 1291 +3 path_rec_completion 575 573 -2 ipoib_mcast_join_task 664 660 -4 ipoib_neigh_destructor 101 92 -9 ipoib_neigh_setup_dev 14 3 -11 ipoib_neigh_setup 17 - -17 path_free 238 215 -23 ipoib_mcast_free 329 306 -23 ipoib_mcast_send 718 684 -34 neigh_add_path 705 650 -55 Signed-off-by: Michael S. Tsirkin Index: latest/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- latest.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ latest/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -252,8 +252,8 @@ static void path_free(struct net_device */ if (neigh->ah) ipoib_put_ah(neigh->ah); - *to_ipoib_neigh(neigh->neighbour) = NULL; - kfree(neigh); + + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -481,7 +481,7 @@ static void neigh_add_path(struct sk_buf struct ipoib_path *path; struct ipoib_neigh *neigh; - neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + neigh = ipoib_neigh_alloc(skb->dst->neighbour); if (!neigh) { ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -489,8 +489,6 @@ static void neigh_add_path(struct sk_buf } skb_queue_head_init(&neigh->queue); - neigh->neighbour = skb->dst->neighbour; - *to_ipoib_neigh(skb->dst->neighbour) = neigh; /* * We can only be called from ipoib_start_xmit, so we're @@ -503,7 +501,7 @@ static void neigh_add_path(struct sk_buf path = path_rec_create(dev, (union ib_gid *) (skb->dst->neighbour->ha + 4)); if (!path) - goto err; + goto err_path; __path_add(dev, path); } @@ -521,17 +519,17 @@ static void neigh_add_path(struct sk_buf __skb_queue_tail(&neigh->queue, skb); if (!path->query && path_rec_start(dev, path)) - goto err; + goto err_list; } spin_unlock(&priv->lock); return; -err: - *to_ipoib_neigh(skb->dst->neighbour) = NULL; +err_list: list_del(&neigh->list); - kfree(neigh); +err_path: + ipoib_neigh_free(neigh); ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -763,8 +761,7 @@ static void ipoib_neigh_destructor(struc if (neigh->ah) ah = neigh->ah; list_del(&neigh->list); - *to_ipoib_neigh(n) = NULL; - kfree(neigh); + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -773,6 +770,26 @@ static void ipoib_neigh_destructor(struc ipoib_put_ah(ah); } +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) +{ + struct ipoib_neigh *neigh; + + neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + if (!neigh) + return NULL; + + neigh->neighbour = neighbour; + *to_ipoib_neigh(neighbour) = neigh; + + return neigh; +} + +void ipoib_neigh_free(struct ipoib_neigh *neigh) +{ + *to_ipoib_neigh(neigh->neighbour) = NULL; + kfree(neigh); +} + static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) { parms->neigh_destructor = ipoib_neigh_destructor; Index: latest/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- latest.orig/drivers/infiniband/ulp/ipoib/ipoib.h +++ latest/drivers/infiniband/ulp/ipoib/ipoib.h @@ -230,6 +230,9 @@ static inline struct ipoib_neigh **to_ip INFINIBAND_ALEN, sizeof(void *)); } +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh); +void ipoib_neigh_free(struct ipoib_neigh *neigh); + extern struct workqueue_struct *ipoib_workqueue; /* functions */ Index: latest/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- latest.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ latest/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -114,8 +114,7 @@ static void ipoib_mcast_free(struct ipoi */ if (neigh->ah) ipoib_put_ah(neigh->ah); - *to_ipoib_neigh(neigh->neighbour) = NULL; - kfree(neigh); + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -772,13 +771,11 @@ out: if (skb->dst && skb->dst->neighbour && !*to_ipoib_neigh(skb->dst->neighbour)) { - struct ipoib_neigh *neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour); if (neigh) { kref_get(&mcast->ah->ref); neigh->ah = mcast->ah; - neigh->neighbour = skb->dst->neighbour; - *to_ipoib_neigh(skb->dst->neighbour) = neigh; list_add_tail(&neigh->list, &mcast->neigh_list); } } -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From iod00d at hp.com Tue Apr 4 10:01:32 2006 From: iod00d at hp.com (Grant Grundler) Date: Tue, 4 Apr 2006 10:01:32 -0700 Subject: [openib-general] Help with CONFIG_PCI_MSI in the kernel In-Reply-To: <20060404083654.GA1689@greglaptop.mpp.unique.ch> References: <20060403221456.GF1442@greglaptop.pwlan-swisscom-mobile.ch> <20060404043600.GF24455@esmail.cup.hp.com> <20060404083654.GA1689@greglaptop.mpp.unique.ch> Message-ID: <20060404170132.GB29487@esmail.cup.hp.com> On Tue, Apr 04, 2006 at 01:36:54AM -0700, Greg Lindahl wrote: > On Mon, Apr 03, 2006 at 09:36:00PM -0700, Grant Grundler wrote: > > > The only evidence I have is one AMD chipset is buggy WRT MSI. > > Grant, > > I know about that case, the kernel disables stuff if it sees an AMD > 8131 due to a bug. What I am referring to was IPoIB performance on > Mellanox HCAs being improved with MSI. I figure if it's gotten this > far with Red Hat turning it off, concrete examples are in order. I understood and I'm trying to provide an argument for turning MSI support back ON. I'm very aware of the impact that MSI has on performance: http://iou.parisc-linux.org/ols_2002/porting_zx1.pdf http://iou.parisc-linux.org/ols_2004/pfmon_for_iodorks.pdf Search for "MSI" in either pdf. I might have some of the minor details wrong in either paper. Please send corrections to me privately only if there are gross errors of fact. thanks, grant From mst at mellanox.co.il Tue Apr 4 10:11:22 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Apr 2006 20:11:22 +0300 Subject: [openib-general] CM patch for 2.6.17 merge In-Reply-To: <4432A57D.2010604@ichips.intel.com> References: <20060404163331.GU14808@mellanox.co.il> <4432A57D.2010604@ichips.intel.com> Message-ID: <20060404171122.GW14808@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [openib-general] CM patch for 2.6.17 merge > > Michael S. Tsirkin wrote: > >Sean, do you agree these two changes in CM are ready for 2.6.17? > > I believe that both of these changes are ready, and it would be nice to > merge them upstream, even if separate from the RDMA CM. OK then. Roland, could you update CM from svn? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Tue Apr 4 10:13:37 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Apr 2006 20:13:37 +0300 Subject: [openib-general] Re: Help with CONFIG_PCI_MSI in the kernel In-Reply-To: <20060404164509.GA29487@esmail.cup.hp.com> References: <20060403221456.GF1442@greglaptop.pwlan-swisscom-mobile.ch> <20060404043600.GF24455@esmail.cup.hp.com> <20060404070229.GF10080@obsidianresearch.com> <20060404164509.GA29487@esmail.cup.hp.com> Message-ID: <20060404171337.GX14808@mellanox.co.il> Quoting r. Grant Grundler : > Can you submit a patch with the right way to disable MSI for 8131 chips? I've already fixed this in 2.6.17-rc1. Please test that. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Tue Apr 4 10:22:08 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Apr 2006 20:22:08 +0300 Subject: [openib-general] Re: [PATCH] repost: IPoIB queue size tune patch In-Reply-To: References: Message-ID: <20060404172208.GY14808@mellanox.co.il> Quoting r. Shirley Ma : > Subject: Re: [PATCH] repost: IPoIB queue size tune patch > > > Hello Roland, > > Thanks. > > Roland Dreier wrote on 04/03/2006 10:50:37 PM: > > > Sorry I hadn't gotten a chance to read this over until now... > > > > > - IPOIB_RX_RING_SIZE = 128, > > > - IPOIB_TX_RING_SIZE = 64, > > > + IPOIB_SENDQ_SIZE = 64, > > > + IPOIB_RECVQ_SIZE = 128, > > > > Can you explain again why it's a good idea to rename these? Is the > > name "sendq_size" really clearer than "tx_ring_size," especially in > > the context of a network driver? > > tx_ring and rx_ring are not good names for IPoIB. tx_ring in IPoIB is a place > to save pointers and rx_ring a place to allocate skb buffs. Actually names make sense to me - they are cyclic buffers, hence ring. > I am working on > a patch to remove tx_ring and replace rx_ring with a list, which would > reduce some tx_ring overhead. What overhead? Please don't, lists are bad for cache, require locks ... Circular buffers are much better, I don't have the time now but I think we can even make event handler completely lockless with them. > > > > > + int sendq_size; > > > + int recvq_size; > > > > Why does every device need a private copy of the ring sizes? It seems > > better to just use ipoib_sendq_size and ipoib_recvq_size directly -- > > round them up in the module init function, and I guess mark them > > __read_mostly. > > > > Michael suggested to save them in the priv before. I did some test and didn't > see the difference. I wil move them out of priv. The point was to make the parameter writeable in sysfs. If you don't do this there's no point. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Tue Apr 4 10:25:25 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Apr 2006 20:25:25 +0300 Subject: [openib-general] git head for 2.6.17 Message-ID: <20060404172525.GZ14808@mellanox.co.il> Roland, I noticed that sometimes you use for-2.6.17 and sometimes for-linus head to push changes upstream. I'd like to have a git tree here auto-synched with whatever will be in 2.6.17, and test that - could you help by selecting one head for upsteam merges and sticking to it? E.g. for-linus seems like a sane head name. Right? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Tue Apr 4 10:33:17 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Apr 2006 20:33:17 +0300 Subject: [openib-general] ANNOUNCE: -stable patches Message-ID: <20060404173317.GA14808@mellanox.co.il> Hi! I started back-porting from current svn and sending to the stable 2.6.16 maintainers stability patches back-ported from the upcoming 2.6.17 I'll maintain the list of outstanding patches here https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/2.6.16-stable There's a single patch in that directory now - I have already sent it to the stable alias, and I plan to add to that shortly. Comments welcome. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From sean.hefty at intel.com Tue Apr 4 10:47:21 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 4 Apr 2006 10:47:21 -0700 Subject: [openib-general] ipoib send-only vs full join Message-ID: When does ipoib join a multicast group for send and receive membership? It looks like it joins the broadcast group for send/receive, but all other groups are send-only. Is this correct? - Sean From jgunthorpe at obsidianresearch.com Tue Apr 4 11:07:23 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Tue, 4 Apr 2006 12:07:23 -0600 Subject: [openib-general] Help with CONFIG_PCI_MSI in the kernel In-Reply-To: <20060404164509.GA29487@esmail.cup.hp.com> References: <20060403221456.GF1442@greglaptop.pwlan-swisscom-mobile.ch> <20060404043600.GF24455@esmail.cup.hp.com> <20060404070229.GF10080@obsidianresearch.com> <20060404164509.GA29487@esmail.cup.hp.com> Message-ID: <20060404180723.GA29589@obsidianresearch.com> On Tue, Apr 04, 2006 at 09:45:09AM -0700, Grant Grundler wrote: > > MSI requires end device support and something in a bridge > > to transform the MSI message into an APIC message, but the kernel > > currently only looks for end device support. > > APIC Message? > MSI is just a DMA-write from the card point of view. > So if PCI is working and DMA is working, MSI should work too. > The difference is routing of the transaction and the fact that > it's not targeting Host memory but some other part of the chipset. You still need chipset support to get from the memory write to an interrupt message transaction on the FSB to the processor APIC. For an example of how intel documents this in their chipsets see http://www.intel.com/design/chipsets/datashts/30146403.pdf Pages 19 and 165 I doubt all intel chipsets ever produced have this transformation. I also know that alot of embedded systems don't have host bridges that support this. The list of host bridges that work is definately smaller than the list that don't. :< > > Presumably only some subset of Intel PCI/PCI-X bridges support MSI > > translation to an APIC message on the FSB, and we know on HT systems > > that the MSI PCI capability block is required (and must be *enabled*) > > in the PCI/PCI-X/PCIe bridge for there to be support. > Well, can't linux enable that block if it's present? > That isn't a reason to disable MSI for _all_ systems. I have a patch that does that, but monkying with the memory map always makes me nervous since you never really know what the BIOS has done. Intel style MSI's overlap the high memory BIOS area so there is a potential problem. HT MSI translation can be configured to use a high address, so it might be very safe to enable the translation and set something >4G as the base. > > In the long term as PCIe becomes dominant this is best since linux > > becomes firmly in control of all interrupt routing and doesn't rely on > > working APCI for IOAPIC tables.. > Sorry...I can't agree. > Line based interrupt routing is dependent on firmware to give > us the IRQ->APIC routing tables, enough info to identify CPUs (ID/EID > info for Intel implementations) and program IO-xAPIC entries. > Essentially, MSI only needs the CPU Info so MSI transactions get > routed correctly. Then MSI/-X entries on the devices can be > programmed (essentially the same way an IO-xAPIC gets programmed). ?? That's what I was trying to say - on a system with only PCIe (granted, with working MSI in the devices..) there should be limited need for IOAPIC based routing. > If it's "just a BIOS" issue, then can't AMD help linux turn MSI support on? > ie linux can bang some values into the chipset like we do for other types > of initialization when BIOS doesn't do it right. Yes, simple patch attached.. Jason -------------- next part -------------- --- linux-2.6.15.4/drivers/pci/quirks.c 2006-02-16 12:08:59.000000000 -0700 +++ lin/drivers/pci/quirks.c 2006-02-16 12:12:30.000000000 -0700 @@ -1257,6 +1257,29 @@ } DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NCR, PCI_DEVICE_ID_NCR_53C810, fixup_rev1_53c810); +#ifdef CONFIG_PCI_MSI +static void __devinit fixup_ht_msi(struct pci_dev* dev) +{ + /* Some BIOS's do not enable the hypertransport MSI mapping capability + on the chipset. This breaks MSI support.. */ + int pos = pci_find_capability(dev,PCI_CAP_ID_HT); + while (pos != 0) + { + u32 cap; + pci_read_config_dword(dev,pos,&cap); + if (((cap >> 16) & PCI_HT_CMD_TYP) == PCI_HT_CMD_TYP_MSIM) { + if ((cap & PCI_HT_MSIM_ENABLE) == 0) { + printk("BIOS BUG: HyperTransport MSI mapping not enabled for %s, enabling.\n",pci_name(dev)); + cap |= PCI_HT_MSIM_ENABLE; + pci_write_config_dword(dev,pos,cap); + } + break; + } + pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_HT); + } +} +DECLARE_PCI_FIXUP_FINAL(PCI_ANY_ID, PCI_ANY_ID, fixup_ht_msi); +#endif static void pci_do_fixups(struct pci_dev *dev, struct pci_fixup *f, struct pci_fixup *end) { --- linux-2.6.15.4/include/linux/pci_regs.h 2006-02-16 12:09:05.000000000 -0700 +++ lin/include/linux/pci_regs.h 2006-02-16 12:12:30.000000000 -0700 @@ -196,12 +196,14 @@ #define PCI_CAP_ID_MSI 0x05 /* Message Signalled Interrupts */ #define PCI_CAP_ID_CHSWP 0x06 /* CompactPCI HotSwap */ #define PCI_CAP_ID_PCIX 0x07 /* PCI-X */ +#define PCI_CAP_ID_HT 0x08 /* HyperTransport */ #define PCI_CAP_ID_SHPC 0x0C /* PCI Standard Hot-Plug Controller */ #define PCI_CAP_ID_EXP 0x10 /* PCI Express */ #define PCI_CAP_ID_MSIX 0x11 /* MSI-X */ #define PCI_CAP_LIST_NEXT 1 /* Next capability in the list */ #define PCI_CAP_FLAGS 2 /* Capability defined flags (16 bits) */ #define PCI_CAP_SIZEOF 4 +#define PCI_HT_CMD_TYP 0xf800 /* Hypertransport capability type mask */ /* Power Management Registers */ @@ -285,6 +287,10 @@ #define PCI_MSI_DATA_64 12 /* 16 bits of data for 64-bit devices */ #define PCI_MSI_MASK_BIT 16 /* Mask bits register */ +/* HyperTransport MSI Mapping registers */ +#define PCI_HT_CMD_TYP_MSIM 0xa800 // MSI Mapping type +#define PCI_HT_MSIM_ENABLE (1<<16) + /* CompactPCI Hotswap Register */ #define PCI_CHSWP_CSR 2 /* Control and Status Register */ From xma at us.ibm.com Tue Apr 4 11:10:23 2006 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 4 Apr 2006 11:10:23 -0700 Subject: [openib-general] Re: [PATCH] repost: IPoIB queue size tune patch In-Reply-To: <20060404172208.GY14808@mellanox.co.il> Message-ID: Hello Michael, "Michael S. Tsirkin" wrote on 04/04/2006 10:22:08 AM: > Quoting r. Shirley Ma : > > Subject: Re: [PATCH] repost: IPoIB queue size tune patch > > > > > > Hello Roland, > > > > Thanks. > > > > Roland Dreier wrote on 04/03/2006 10:50:37 PM: > > > > > Sorry I hadn't gotten a chance to read this over until now... > > > > > > > - IPOIB_RX_RING_SIZE = 128, > > > > - IPOIB_TX_RING_SIZE = 64, > > > > + IPOIB_SENDQ_SIZE = 64, > > > > + IPOIB_RECVQ_SIZE = 128, > > > > > > Can you explain again why it's a good idea to rename these? Is the > > > name "sendq_size" really clearer than "tx_ring_size," especially in > > > the context of a network driver? > > > > tx_ring and rx_ring are not good names for IPoIB. tx_ring in IPoIBis a place > > to save pointers and rx_ring a place to allocate skb buffs. > > Actually names make sense to me - they are cyclic buffers, hence ring. > > > I am working on > > a patch to remove tx_ring and replace rx_ring with a list, which would > > reduce some tx_ring overhead. > > What overhead? Please don't, lists are bad for cache, require locks ... > Circular buffers are much better, I don't have the time now but I think we can > even make event handler completely lockless with them. IPoIB is not a real driver. Packets can be dropped from dev xmit queue to QPs send queue directly. Why we use tx_ring to induce overhead in send path? Also, using a list doesn't always mean to induce a lock. I will test the patch performance before I submit for review. We really want this patch to be in RC2 release. I will change the name back if this would prevent the patch from being into upper stream. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From ardavis at ichips.intel.com Tue Apr 4 11:10:26 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 04 Apr 2006 11:10:26 -0700 Subject: [openib-general] Re: [DAPL] Provider initialialization In-Reply-To: References: Message-ID: <4432B692.8000606@ichips.intel.com> James Lentini wrote: >Arlin, > >As part of the uDAPL autotools patch, we changed the mechanism by >which the uDAPL provider library's init and fini functions were >specified. > >I've seen (and received reports) of systems on which the init and fini >functions are not being called. I'd like to move back to the old >mechanism (see patch below). Do you see any problems with this? > > no problem. From jlentini at netapp.com Tue Apr 4 11:23:55 2006 From: jlentini at netapp.com (James Lentini) Date: Tue, 4 Apr 2006 14:23:55 -0400 (EDT) Subject: [openib-general] Re: [DAPL] Provider initialialization In-Reply-To: <4432B692.8000606@ichips.intel.com> References: <4432B692.8000606@ichips.intel.com> Message-ID: On Tue, 4 Apr 2006, Arlin Davis wrote: > James Lentini wrote: > > > Arlin, > > > > As part of the uDAPL autotools patch, we changed the mechanism by > > which the uDAPL provider library's init and fini functions were > > specified. > > > > I've seen (and received reports) of systems on which the init and > > fini functions are not being called. I'd like to move back to the > > old mechanism (see patch below). Do you see any problems with > > this? > > > no problem. Committed in the gen2 trunk and 1.0 branch in revision 6222. From swise at opengridcomputing.com Tue Apr 4 12:04:03 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 04 Apr 2006 14:04:03 -0500 Subject: [openib-general] [PATCH] [RFC] - dapl - dat_ep_free() can return without freeing the endpoint Message-ID: <1144177443.6427.34.camel@stevo-desktop> Arlin/James, I'm running dapltest over the Chelsio cxgb3 openib/iwarp driver. I'm running into an intermittent failure where the server side fails to properly clean up its resources. It has to do with disconnect vs ep freeing. Basically if the disconnect event handler thread doesn't get done (and turn off conn->in_callback) before the main dapltest thread attempts to destroy the EP, then dat_ep_free() will return "ok I freed it" even though it doesn't because in_callback == 1 in the dapl_cm_id struct. The idea, I assume, was to let the callback finish and destroy the connection on the callback thread. However, once dat_ep_free() returns, the main dapltest thread then tries to free the EVDs and PZ and gets errors because they are still in use. So dapli_destroy_conn() defers destroying the ib_qp if conn->in_callback == 1. This, however, leads to the dapltest program trying to destroy CQs and PDs with a QP still attached to them. Here is a patch that fixes the problem. Please review. Basically, I changed the logic so that dapli_destroy_conn() will wait until the callback finishes. Is this gonna break anything else? The patch probably needs to get rid of the usleep() in favor of a pthread_cond_wait() or something, but you'll get the idea. I'd like to get this patch into the trunk. Lemme know what you think. This fix seems to work fine with my testing so far over the chelsio driver. Thanks, Steve. Signed-off-by: Steve Wise Index: openib_cma/dapl_ib_cm.c =================================================================== --- openib_cma/dapl_ib_cm.c (revision 6107) +++ openib_cma/dapl_ib_cm.c (working copy) @@ -62,9 +62,9 @@ /* local prototypes */ static struct dapl_cm_id * dapli_req_recv(struct dapl_cm_id *conn, struct rdma_cm_event *event); -static int dapli_cm_active_cb(struct dapl_cm_id *conn, +static void dapli_cm_active_cb(struct dapl_cm_id *conn, struct rdma_cm_event *event); -static int dapli_cm_passive_cb(struct dapl_cm_id *conn, +static void dapli_cm_passive_cb(struct dapl_cm_id *conn, struct rdma_cm_event *event); static void dapli_addr_resolve(struct dapl_cm_id *conn); static void dapli_route_resolve(struct dapl_cm_id *conn); @@ -164,28 +164,34 @@ void dapli_destroy_conn(struct dapl_cm_id *conn) { int in_callback; + struct rdma_cm_id *cm_id; dapl_dbg_log(DAPL_DBG_TYPE_CM, " destroy_conn: conn %p id %d\n", conn,conn->cm_id); - dapl_os_lock(&conn->lock); conn->destroy = 1; in_callback = conn->in_callback; - dapl_os_unlock(&conn->lock); - - if (!in_callback) { - if (conn->ep) - conn->ep->cm_handle = IB_INVALID_HANDLE; - if (conn->cm_id) { - if (conn->cm_id->qp) - rdma_destroy_qp(conn->cm_id); - rdma_destroy_id(conn->cm_id); + do { + if (in_callback) { + dapl_os_unlock(&conn->lock); + usleep(10); + dapl_os_lock(&conn->lock); } - - conn->cm_id = NULL; - dapl_os_free(conn, sizeof(*conn)); + in_callback = conn->in_callback; + } while (in_callback); + + if (conn->ep) + conn->ep->cm_handle = IB_INVALID_HANDLE; + cm_id = conn->cm_id; + conn->cm_id = NULL; + dapl_os_unlock(&conn->lock); + if (cm_id) { + if (cm_id->qp) + rdma_destroy_qp(cm_id); + rdma_destroy_id(cm_id); } + dapl_os_free(conn, sizeof(*conn)); } static struct dapl_cm_id * dapli_req_recv(struct dapl_cm_id *conn, @@ -243,11 +249,9 @@ return new_conn; } -static int dapli_cm_active_cb(struct dapl_cm_id *conn, +static void dapli_cm_active_cb(struct dapl_cm_id *conn, struct rdma_cm_event *event) { - int destroy; - dapl_dbg_log(DAPL_DBG_TYPE_CM, " active_cb: conn %p id %d event %d\n", conn, conn->cm_id, event->event ); @@ -255,7 +259,7 @@ dapl_os_lock(&conn->lock); if (conn->destroy) { dapl_os_unlock(&conn->lock); - return 0; + return; } conn->in_callback = 1; dapl_os_unlock(&conn->lock); @@ -303,16 +307,14 @@ } dapl_os_lock(&conn->lock); - destroy = conn->destroy; - conn->in_callback = conn->destroy; + conn->in_callback = 0; dapl_os_unlock(&conn->lock); - return(destroy); + return; } -static int dapli_cm_passive_cb(struct dapl_cm_id *conn, +static void dapli_cm_passive_cb(struct dapl_cm_id *conn, struct rdma_cm_event *event) { - int destroy; struct dapl_cm_id *new_conn; dapl_dbg_log(DAPL_DBG_TYPE_CM, @@ -322,7 +324,7 @@ dapl_os_lock(&conn->lock); if (conn->destroy) { dapl_os_unlock(&conn->lock); - return 0; + return; } conn->in_callback = 1; dapl_os_unlock(&conn->lock); @@ -377,10 +379,9 @@ } dapl_os_lock(&conn->lock); - destroy = conn->destroy; - conn->in_callback = conn->destroy; + conn->in_callback = 0; dapl_os_unlock(&conn->lock); - return(destroy); + return; } @@ -1021,7 +1022,6 @@ /* process one CM event, fairness */ if(!ret) { struct dapl_cm_id *conn; - int ret; /* set proper conn from cm_id context*/ if (event->event == RDMA_CM_EVENT_CONNECT_REQUEST) @@ -1059,24 +1059,9 @@ case RDMA_CM_EVENT_DISCONNECTED: /* passive or active */ if (conn->sp) - ret = dapli_cm_passive_cb(conn,event); + dapli_cm_passive_cb(conn,event); else - ret = dapli_cm_active_cb(conn,event); - - /* destroy both qp and cm_id */ - if (ret) { - dapl_dbg_log(DAPL_DBG_TYPE_CM, - " cma_cb: DESTROY conn %p" - " cm_id %p qp %p\n", - conn, conn->cm_id, - conn->cm_id->qp); - - if (conn->cm_id->qp) - rdma_destroy_qp(conn->cm_id); - - rdma_destroy_id(conn->cm_id); - dapl_os_free(conn, sizeof(*conn)); - } + dapli_cm_active_cb(conn,event); break; case RDMA_CM_EVENT_CONNECT_RESPONSE: default: From mst at mellanox.co.il Tue Apr 4 12:22:16 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Apr 2006 22:22:16 +0300 Subject: [openib-general] Re: [PATCH] repost: IPoIB queue size tune patch In-Reply-To: References: <20060404172208.GY14808@mellanox.co.il> Message-ID: <20060404192216.GA25763@mellanox.co.il> Quoting r. Shirley Ma : > IPoIB is not a real driver. Packets can be dropped from dev xmit queue to > QPs send queue directly. Why we use tx_ring to induce overhead in > send path? As I see it, rx_ring and tx_ring are mostly to keep the dma mapping. The fact that we have to track it implies I think a level of indirection and since the number of these matches the qp size, it seems best to allocate them at interface up and not dynamically. But: by all means please go ahead and experiment: I was just trying to help with these suggestions - I'll guess we'll see what you come up with. In particular, lots of people recommend looking at NAPI as a way to reduce iterrupt rate on receive. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From xma at us.ibm.com Tue Apr 4 13:20:36 2006 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 4 Apr 2006 13:20:36 -0700 Subject: [openib-general] Re: [PATCH] repost: IPoIB queue size tune patch In-Reply-To: <20060404192216.GA25763@mellanox.co.il> Message-ID: "Michael S. Tsirkin" wrote on 04/04/2006 12:22:16 PM: > Quoting r. Shirley Ma : > > IPoIB is not a real driver. Packets can be dropped from dev xmit queue to > > QPs send queue directly. Why we use tx_ring to induce overhead in > > send path? > > As I see it, rx_ring and tx_ring are mostly to keep the dma mapping. > The fact that we have to track it implies I think a level of indirection > and since the number of these matches the qp size, it seems best to > allocate them at interface up and not dynamically. > But: by all means please go ahead and experiment: I was just trying > to help with > these suggestions - I'll guess we'll see what you come up with. Sure. I will submit the patch for review when I've done all the test. > In particular, lots of people recommend looking at NAPI as a way > to reduce iterrupt rate on receive. I have several experimental patches which do reduce the interrupts/ cpu utilization and increase throughput. These patches are not mature enough to submit yet. In the meantime, we are looking at NAPI. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue Apr 4 13:33:39 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Apr 2006 13:33:39 -0700 Subject: [openib-general] Re: [PATCH] repost: IPoIB queue size tune patch In-Reply-To: (Shirley Ma's message of "Tue, 4 Apr 2006 13:20:36 -0700") References: Message-ID: Shirley> I have several experimental patches which do reduce the Shirley> interrupts/ cpu utilization and increase Shirley> throughput. These patches are not mature enough to submit Shirley> yet. Could you post them so people can look at them? Even if they're not ready to merge it's helpful to be able to discuss them, and you may even get suggestions for improvements. - R. From rdreier at cisco.com Tue Apr 4 13:34:47 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Apr 2006 13:34:47 -0700 Subject: [openib-general] CM patch for 2.6.17 merge In-Reply-To: <20060404171122.GW14808@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 4 Apr 2006 20:11:22 +0300") References: <20060404163331.GU14808@mellanox.co.il> <4432A57D.2010604@ichips.intel.com> <20060404171122.GW14808@mellanox.co.il> Message-ID: Michael> OK then. Roland, could you update CM from svn? I have to say that I don't think now is the appropriate time to merge this stuff. Let's merge this when a consumer is ready too. - R. From rdreier at cisco.com Tue Apr 4 13:35:13 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Apr 2006 13:35:13 -0700 Subject: [openib-general] Re: CM patch for 2.6.17 merge In-Reply-To: <20060404163331.GU14808@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 4 Apr 2006 19:33:32 +0300") References: <20060404163331.GU14808@mellanox.co.il> Message-ID: Michael> The second is a security fix, its a must. Not sure I understand this. What's the exploit? - R. From rdreier at cisco.com Tue Apr 4 13:36:55 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Apr 2006 13:36:55 -0700 Subject: [openib-general] Re: git head for 2.6.17 In-Reply-To: <20060404172525.GZ14808@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 4 Apr 2006 20:25:25 +0300") References: <20060404172525.GZ14808@mellanox.co.il> Message-ID: Michael> Roland, I noticed that sometimes you use for-2.6.17 and Michael> sometimes for-linus head to push changes upstream. No, when have I ever asked Linus to pull from for-2.6.17? I queue things in for-2.6.17 (or for-2.6.18, or whatever), and when I want Linus to pull, I merge onto the for-linus branch and ask him to pull that. This let's me keep adding new stuff to the for-2.6.17 branch even before Linus pulls from me. - R. From rdreier at cisco.com Tue Apr 4 13:37:46 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Apr 2006 13:37:46 -0700 Subject: [openib-general] Re: ipoib send-only vs full join In-Reply-To: (Sean Hefty's message of "Tue, 4 Apr 2006 10:47:21 -0700") References: Message-ID: Sean> When does ipoib join a multicast group for send and receive Sean> membership? It looks like it joins the broadcast group for Sean> send/receive, but all other groups are send-only. Is this Sean> correct? Not quite. It joins for receive when the networking core asks it to receive multicasts on a given group. It joins send-only when it has a packet to send for a group it hasn't joined yet. - R. From rdreier at cisco.com Tue Apr 4 13:39:18 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Apr 2006 13:39:18 -0700 Subject: [openib-general] Re: 2.6.17 merge In-Reply-To: <20060404165940.GV14808@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 4 Apr 2006 19:59:40 +0300") References: <20060404160323.GS14808@mellanox.co.il> <20060404165940.GV14808@mellanox.co.il> Message-ID: Michael> Here it is: I just did diff between svn and trunk and Michael> removed 2.6.16 things. OK? Yes, I'll queue this. To be honest I thought I had already merged it but I guess I dropped it somewhere. - R. From jlentini at netapp.com Tue Apr 4 13:57:01 2006 From: jlentini at netapp.com (James Lentini) Date: Tue, 4 Apr 2006 16:57:01 -0400 (EDT) Subject: [openib-general] Re: SPAM: [PATCH] [RFC] - dapl - dat_ep_free() can return without freeing the endpoint In-Reply-To: <1144177443.6427.34.camel@stevo-desktop> References: <1144177443.6427.34.camel@stevo-desktop> Message-ID: Comments below: On Tue, 4 Apr 2006, Steve Wise wrote: > Index: openib_cma/dapl_ib_cm.c > =================================================================== > --- openib_cma/dapl_ib_cm.c (revision 6107) > +++ openib_cma/dapl_ib_cm.c (working copy) > @@ -62,9 +62,9 @@ > /* local prototypes */ > static struct dapl_cm_id * dapli_req_recv(struct dapl_cm_id *conn, > struct rdma_cm_event *event); > -static int dapli_cm_active_cb(struct dapl_cm_id *conn, > +static void dapli_cm_active_cb(struct dapl_cm_id *conn, > struct rdma_cm_event *event); > -static int dapli_cm_passive_cb(struct dapl_cm_id *conn, > +static void dapli_cm_passive_cb(struct dapl_cm_id *conn, > struct rdma_cm_event *event); > static void dapli_addr_resolve(struct dapl_cm_id *conn); > static void dapli_route_resolve(struct dapl_cm_id *conn); > @@ -164,28 +164,34 @@ > void dapli_destroy_conn(struct dapl_cm_id *conn) > { > int in_callback; > + struct rdma_cm_id *cm_id; > > dapl_dbg_log(DAPL_DBG_TYPE_CM, > " destroy_conn: conn %p id %d\n", > conn,conn->cm_id); > - > dapl_os_lock(&conn->lock); > conn->destroy = 1; > in_callback = conn->in_callback; > - dapl_os_unlock(&conn->lock); > - > - if (!in_callback) { > - if (conn->ep) > - conn->ep->cm_handle = IB_INVALID_HANDLE; > - if (conn->cm_id) { > - if (conn->cm_id->qp) > - rdma_destroy_qp(conn->cm_id); > - rdma_destroy_id(conn->cm_id); > + do { > + if (in_callback) { > + dapl_os_unlock(&conn->lock); > + usleep(10); > + dapl_os_lock(&conn->lock); > } > - > - conn->cm_id = NULL; > - dapl_os_free(conn, sizeof(*conn)); > + in_callback = conn->in_callback; > + } while (in_callback); > + > + if (conn->ep) > + conn->ep->cm_handle = IB_INVALID_HANDLE; > + cm_id = conn->cm_id; > + conn->cm_id = NULL; > + dapl_os_unlock(&conn->lock); > + if (cm_id) { > + if (cm_id->qp) > + rdma_destroy_qp(cm_id); > + rdma_destroy_id(cm_id); > } > + dapl_os_free(conn, sizeof(*conn)); > } This is an improvement. If dat_ep_free() returns success, DAPL consumers expect to be able to free the free'd EP's resources (PZ, etc.). We must have been getting lucky on IB. I'm worried that a callback could occur ... void dapli_destroy_conn(struct dapl_cm_id *conn) { int in_callback; struct rdma_cm_id *cm_id; dapl_dbg_log(DAPL_DBG_TYPE_CM, " destroy_conn: conn %p id %d\n", conn,conn->cm_id); dapl_os_lock(&conn->lock); conn->destroy = 1; in_callback = conn->in_callback; do { if (in_callback) { dapl_os_unlock(&conn->lock); usleep(10); dapl_os_lock(&conn->lock); } in_callback = conn->in_callback; } while (in_callback); if (conn->ep) conn->ep->cm_handle = IB_INVALID_HANDLE; cm_id = conn->cm_id; conn->cm_id = NULL; dapl_os_unlock(&conn->lock); /* ... here */ if (cm_id) { if (cm_id->qp) rdma_destroy_qp(cm_id); rdma_destroy_id(cm_id); } dapl_os_free(conn, sizeof(*conn)); } Destroying the cm_id while in a callback would be bad. > static struct dapl_cm_id * dapli_req_recv(struct dapl_cm_id *conn, > @@ -243,11 +249,9 @@ > return new_conn; > } > > -static int dapli_cm_active_cb(struct dapl_cm_id *conn, > +static void dapli_cm_active_cb(struct dapl_cm_id *conn, > struct rdma_cm_event *event) > { > - int destroy; > - > dapl_dbg_log(DAPL_DBG_TYPE_CM, > " active_cb: conn %p id %d event %d\n", > conn, conn->cm_id, event->event ); > @@ -255,7 +259,7 @@ > dapl_os_lock(&conn->lock); > if (conn->destroy) { > dapl_os_unlock(&conn->lock); > - return 0; > + return; > } > conn->in_callback = 1; > dapl_os_unlock(&conn->lock); > @@ -303,16 +307,14 @@ > } > > dapl_os_lock(&conn->lock); > - destroy = conn->destroy; > - conn->in_callback = conn->destroy; > + conn->in_callback = 0; > dapl_os_unlock(&conn->lock); > - return(destroy); > + return; > } > > -static int dapli_cm_passive_cb(struct dapl_cm_id *conn, > +static void dapli_cm_passive_cb(struct dapl_cm_id *conn, > struct rdma_cm_event *event) > { > - int destroy; > struct dapl_cm_id *new_conn; > > dapl_dbg_log(DAPL_DBG_TYPE_CM, > @@ -322,7 +324,7 @@ > dapl_os_lock(&conn->lock); > if (conn->destroy) { > dapl_os_unlock(&conn->lock); > - return 0; > + return; > } > conn->in_callback = 1; > dapl_os_unlock(&conn->lock); > @@ -377,10 +379,9 @@ > } > > dapl_os_lock(&conn->lock); > - destroy = conn->destroy; > - conn->in_callback = conn->destroy; > + conn->in_callback = 0; > dapl_os_unlock(&conn->lock); > - return(destroy); > + return; > } > > > @@ -1021,7 +1022,6 @@ > /* process one CM event, fairness */ > if(!ret) { > struct dapl_cm_id *conn; > - int ret; > > /* set proper conn from cm_id context*/ > if (event->event == RDMA_CM_EVENT_CONNECT_REQUEST) > @@ -1059,24 +1059,9 @@ > case RDMA_CM_EVENT_DISCONNECTED: > /* passive or active */ > if (conn->sp) > - ret = dapli_cm_passive_cb(conn,event); > + dapli_cm_passive_cb(conn,event); > else > - ret = dapli_cm_active_cb(conn,event); > - > - /* destroy both qp and cm_id */ > - if (ret) { > - dapl_dbg_log(DAPL_DBG_TYPE_CM, > - " cma_cb: DESTROY conn %p" > - " cm_id %p qp %p\n", > - conn, conn->cm_id, > - conn->cm_id->qp); > - > - if (conn->cm_id->qp) > - rdma_destroy_qp(conn->cm_id); > - > - rdma_destroy_id(conn->cm_id); > - dapl_os_free(conn, sizeof(*conn)); > - } > + dapli_cm_active_cb(conn,event); If this code is removed, we'll need to update functions that set conn->destroy to 1 to destroy the cm_id. What happens if a consumer attempts to free the EP from a callback? With this change (or any one that blocked a callback thread from attempting to free the EP), I believe we would deadlock. Is it possible to destroy CM IDs from within a callback? From rdreier at cisco.com Tue Apr 4 14:04:10 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Apr 2006 14:04:10 -0700 Subject: [openib-general] Re: [PATCH] ipoib_mcast_restart_task In-Reply-To: <20060404155233.GR14808@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 4 Apr 2006 18:52:33 +0300") References: <20060404155233.GR14808@mellanox.co.il> Message-ID: Michael> ipoib_mcast_restart_task might free an mcast object while Michael> a join request (sa query) is still outstanding, leading Michael> to an oops when the query completes. Fix this by waiting Michael> for query to complete, similiar to what ipoib_stop_thread Michael> is doing. Yes, looks like there might be problem here. However, is there any way to consolidate the "cancel and wait for done" code in one place, rather than just cut-and-pasting it from ipoib_stop_thread()? This could explain the oops in ipoib_mcast_sendonly_join_complete(), but only if a send-only group is being replaced by a full-member join. Is Eli's test doing that? - R. From makia at llnl.gov Tue Apr 4 14:04:43 2006 From: makia at llnl.gov (Makia Minich) Date: Tue, 4 Apr 2006 14:04:43 -0700 Subject: [openib-general] APIC error Message-ID: <20060404210443.GB26741@langley.llnl.gov> I was wondering if anyone's come across this. On a Tyan 4881 motherboard with BIOS version V1.04 and openib-1.0, I'm seeing APIC errors: ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006) ib_mthca: Initializing 0000:05:00.0 ACPI: PCI interrupt 0000:05:00.0[A] -> GSI 18 (level, high) -> IRQ 217 PCI: Setting latency timer of device 0000:05:00.0 to 64 MSI INIT SUCCESS APIC error on CPU1: 00(08) APIC error on CPU0: 00(08) In fact, anytime I attempt to run a command that would query the card, I get an APIC error: (on the command line) # sminfo ibpanic: [4100] madrpc_init: client_register for mgmt 1 failed: (Cannot allocate memory) # (on the console) APIC error on CPU1: 08(08) APIC error on CPU0: 08(08) ib_mthca 0000:05:00.0: SW2HW_MPT failed (-16) Using earlier BIOS versions, I don't see the issue at all, but on the latest BIOS versions I'm dead in the water. So, I was hoping that someone else has seen this issue and perhaphs might know of a workaround. (((((((((((((((((((((((((((((((((()))))))))))))))))))))))))))))))))) Makia Minich Money is the Devil's toothpaste. 925.424.5675 --The Flea (Mucha Lucha) (((((((((((((((((((((((((((((((((()))))))))))))))))))))))))))))))))) From mshefty at ichips.intel.com Tue Apr 4 14:05:19 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 04 Apr 2006 14:05:19 -0700 Subject: [openib-general] Re: ipoib send-only vs full join In-Reply-To: References: Message-ID: <4432DF8F.60702@ichips.intel.com> Roland Dreier wrote: > Not quite. It joins for receive when the networking core asks it to > receive multicasts on a given group. It joins send-only when it has a > packet to send for a group it hasn't joined yet. Okay - I think I see what's happening now. Thanks. - Sean From halr at voltaire.com Tue Apr 4 14:02:28 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Apr 2006 17:02:28 -0400 Subject: [openib-general] Re: ipoib send-only vs full join In-Reply-To: References: Message-ID: <1144184544.4480.63878.camel@hal.voltaire.com> On Tue, 2006-04-04 at 16:37, Roland Dreier wrote: > Sean> When does ipoib join a multicast group for send and receive > Sean> membership? It looks like it joins the broadcast group for > Sean> send/receive, but all other groups are send-only. Is this > Sean> correct? > > Not quite. It joins for receive when the networking core asks it to > receive multicasts on a given group. There is a setsockopt for adding a IP member which causes this: ret = setsockopt(s, IPPROTO_IP, IP_ADD_MEMBERSHIP, ...); You should be seeing a number of full joins (like for some 224.0.0.x addresses). > It joins send-only when it has a > packet to send for a group it hasn't joined yet. That occurs when IPmc is doing the sender role to a IPmc group without previously joining the group via the setsockopt mentioned above. -- Hal > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Tue Apr 4 14:08:46 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Apr 2006 14:08:46 -0700 Subject: [openib-general] APIC error In-Reply-To: <20060404210443.GB26741@langley.llnl.gov> (Makia Minich's message of "Tue, 4 Apr 2006 14:04:43 -0700") References: <20060404210443.GB26741@langley.llnl.gov> Message-ID: Makia> Using earlier BIOS versions, I don't see the issue at all, Makia> but on the latest BIOS versions I'm dead in the water. So, Makia> I was hoping that someone else has seen this issue and Makia> perhaphs might know of a workaround. Are you enabling MSI or MSI-X for the ib_mthca module? You could try without that if you are. Otherwise, given that a BIOS update broke your setup, rolling back your BIOS seems like a reasonable workaround for now. I would definitely try to get your BIOS vendor to help diagnose the issue if there's a fix you need in the newer BIOS. - R. From rdreier at cisco.com Tue Apr 4 14:16:33 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Apr 2006 14:16:33 -0700 Subject: [openib-general] Re: [PATCH] ipoib_mcast_restart_task In-Reply-To: <20060404155233.GR14808@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 4 Apr 2006 18:52:33 +0300") References: <20060404155233.GR14808@mellanox.co.il> Message-ID: Michael> Roland, Eli spotted the following race that might explain Michael> part of the crashes in sendonly complete. Actually I don't see how this could explain it. The oops seems to be happening when the CPU follows the mcast pointer to get mcast->dev, and even if mcast has been freed, it should still point to valid kernel memory. - R. From makia at llnl.gov Tue Apr 4 14:26:09 2006 From: makia at llnl.gov (Makia Minich) Date: Tue, 4 Apr 2006 14:26:09 -0700 Subject: [openib-general] APIC error In-Reply-To: References: <20060404210443.GB26741@langley.llnl.gov> Message-ID: <20060404212609.GC26741@langley.llnl.gov> * Roland Dreier (rdreier at cisco.com) wrote: > Makia> Using earlier BIOS versions, I don't see the issue at all, > Makia> but on the latest BIOS versions I'm dead in the water. So, > Makia> I was hoping that someone else has seen this issue and > Makia> perhaphs might know of a workaround. > > Are you enabling MSI or MSI-X for the ib_mthca module? You could try > without that if you are. > > Otherwise, given that a BIOS update broke your setup, rolling back > your BIOS seems like a reasonable workaround for now. I would > definitely try to get your BIOS vendor to help diagnose the issue if > there's a fix you need in the newer BIOS. > > - R. I'll try the MSI disabling, perhaps that might work. Sadly I don't have an option to down-rev the BIOS (this version is needed for IPMI 2.0 testing) and Tyan appears to not have the same problem (IBGD seems to work for them). I was just hoping that someone else has seen the issues as well. (((((((((((((((((((((((((((((((((()))))))))))))))))))))))))))))))))) Makia Minich Money is the Devil's toothpaste. 925.424.5675 --The Flea (Mucha Lucha) (((((((((((((((((((((((((((((((((()))))))))))))))))))))))))))))))))) From rdreier at cisco.com Tue Apr 4 14:36:36 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Apr 2006 14:36:36 -0700 Subject: [openib-general] APIC error In-Reply-To: <20060404212609.GC26741@langley.llnl.gov> (Makia Minich's message of "Tue, 4 Apr 2006 14:26:09 -0700") References: <20060404210443.GB26741@langley.llnl.gov> <20060404212609.GC26741@langley.llnl.gov> Message-ID: Makia> I'll try the MSI disabling, perhaps that might work. Sadly Makia> I don't have an option to down-rev the BIOS (this version Makia> is needed for IPMI 2.0 testing) and Tyan appears to not Makia> have the same problem (IBGD seems to work for them). I was Makia> just hoping that someone else has seen the issues as well. Do you know if Tyan has tried with MSI and/or MSI-X (whichever you're using)? They may have messed up MSI in the BIOS. - R. From mst at mellanox.co.il Tue Apr 4 14:40:35 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Apr 2006 00:40:35 +0300 Subject: [openib-general] Re: CM patch for 2.6.17 merge In-Reply-To: References: <20060404163331.GU14808@mellanox.co.il> Message-ID: <20060404214035.GA26692@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: CM patch for 2.6.17 merge > > Michael> The second is a security fix, its a must. > > Not sure I understand this. What's the exploit? Connecting from userspace to an SDP socket. People expect sockets to be kernel-level. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Tue Apr 4 14:41:51 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Apr 2006 00:41:51 +0300 Subject: [openib-general] Re: [PATCH] ipoib_mcast_restart_task In-Reply-To: References: <20060404155233.GR14808@mellanox.co.il> Message-ID: <20060404214151.GB26692@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] ipoib_mcast_restart_task > > Michael> ipoib_mcast_restart_task might free an mcast object while > Michael> a join request (sa query) is still outstanding, leading > Michael> to an oops when the query completes. Fix this by waiting > Michael> for query to complete, similiar to what ipoib_stop_thread > Michael> is doing. > > Yes, looks like there might be problem here. However, is there any > way to consolidate the "cancel and wait for done" code in one place, > rather than just cut-and-pasting it from ipoib_stop_thread()? Why not? Please go ahead. > This could explain the oops in ipoib_mcast_sendonly_join_complete(), > but only if a send-only group is being replaced by a full-member > join. Is Eli's test doing that? I don't think so. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Tue Apr 4 14:43:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Apr 2006 00:43:04 +0300 Subject: [openib-general] Re: git head for 2.6.17 In-Reply-To: References: <20060404172525.GZ14808@mellanox.co.il> Message-ID: <20060404214304.GC26692@mellanox.co.il> Quoting r. Roland Dreier : > I queue things in for-2.6.17 (or for-2.6.18, or whatever), and when I > want Linus to pull, I merge onto the for-linus branch and ask him to > pull that. This let's me keep adding new stuff to the for-2.6.17 > branch even before Linus pulls from me. OK. So for-2.6.17 is likely the thing to test. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From caitlinb at broadcom.com Tue Apr 4 14:42:27 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 4 Apr 2006 14:42:27 -0700 Subject: [openib-general] Re: CM patch for 2.6.17 merge Message-ID: <54AD0F12E08D1541B826BE97C98F99F13C99CD@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Quoting r. Roland Dreier : >> Subject: Re: CM patch for 2.6.17 merge >> >> Michael> The second is a security fix, its a must. >> >> Not sure I understand this. What's the exploit? > > Connecting from userspace to an SDP socket. People expect > sockets to be kernel-level. To be fair, I do not think that users have a reasonable expectation that merely because they are using a socket that all traffic will be subject to kernel validation and inspection. But I do believe that most people assume that when they connect a socket that the kernel will block them if the connection is contrary to netfilter policies. From rdreier at cisco.com Tue Apr 4 14:43:43 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Apr 2006 14:43:43 -0700 Subject: [openib-general] Re: CM patch for 2.6.17 merge In-Reply-To: <20060404214035.GA26692@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 5 Apr 2006 00:40:35 +0300") References: <20060404163331.GU14808@mellanox.co.il> <20060404214035.GA26692@mellanox.co.il> Message-ID: Roland> Not sure I understand this. What's the exploit? Michael> Connecting from userspace to an SDP socket. People expect Michael> sockets to be kernel-level. Without SDP upstream I don't see the security issue. Even with SDP upstream it's dubious: everything coming in from the network should be untrusted. I don't see how you can prevent userspace from sending CM messages on an arbitrary UD QP. - R. From mst at mellanox.co.il Tue Apr 4 14:47:29 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Apr 2006 00:47:29 +0300 Subject: [openib-general] CM patch for 2.6.17 merge In-Reply-To: References: <20060404163331.GU14808@mellanox.co.il> <4432A57D.2010604@ichips.intel.com> <20060404171122.GW14808@mellanox.co.il> Message-ID: <20060404214729.GD26692@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] CM patch for 2.6.17 merge > > Michael> OK then. Roland, could you update CM from svn? > > I have to say that I don't think now is the appropriate time to merge > this stuff. Let's merge this when a consumer is ready too. That's unfortunate, I think CMA needs this and it would be nice to develop CMA out of kernel against in-kernel CM. Is that right, Sean? Does CMA require private data matching to work? Oh well, maybe you are right. Still, we really want to prevent userspace from using CMA/SDP service IDs, so I think we need at least this part. Sean, if you agree, please send Roland this patch. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From makia at llnl.gov Tue Apr 4 14:53:04 2006 From: makia at llnl.gov (Makia Minich) Date: Tue, 4 Apr 2006 14:53:04 -0700 Subject: [openib-general] APIC error In-Reply-To: References: <20060404210443.GB26741@langley.llnl.gov> <20060404212609.GC26741@langley.llnl.gov> Message-ID: <20060404215304.GE26741@langley.llnl.gov> * Roland Dreier (rdreier at cisco.com) wrote: > Makia> I'll try the MSI disabling, perhaps that might work. Sadly > Makia> I don't have an option to down-rev the BIOS (this version > Makia> is needed for IPMI 2.0 testing) and Tyan appears to not > Makia> have the same problem (IBGD seems to work for them). I was > Makia> just hoping that someone else has seen the issues as well. > > Do you know if Tyan has tried with MSI and/or MSI-X (whichever you're > using)? They may have messed up MSI in the BIOS. > > - R. It looks like your idea worked. I'm pinging Tyan now to see what tests they tried. In the meantime, at least I have a *mostly* working system again. (((((((((((((((((((((((((((((((((()))))))))))))))))))))))))))))))))) Makia Minich Money is the Devil's toothpaste. 925.424.5675 --The Flea (Mucha Lucha) (((((((((((((((((((((((((((((((((()))))))))))))))))))))))))))))))))) From mst at mellanox.co.il Tue Apr 4 14:53:44 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Apr 2006 00:53:44 +0300 Subject: [openib-general] Re: CM patch for 2.6.17 merge In-Reply-To: References: <20060404163331.GU14808@mellanox.co.il> <20060404214035.GA26692@mellanox.co.il> Message-ID: <20060404215344.GE26692@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: CM patch for 2.6.17 merge > > Roland> Not sure I understand this. What's the exploit? > > Michael> Connecting from userspace to an SDP socket. People expect > Michael> sockets to be kernel-level. > > Without SDP upstream I don't see the security issue. We are protecting the remote system here. Think about time when SDP/CMA are upstream, or about a non-linux system with SDP/CMA listening, connected over IB to a 2.6.17 linux. > Even with SDP > upstream it's dubious: everything coming in from the network should be > untrusted. Yes, but e.g. in linux sending e.g. arp packets i slimited for priviledged users. I agree its weak but ... > I don't see how you can prevent userspace from sending CM > messages on an arbitrary UD QP. Does IB spec require me to accept them? Maybe we should validate the source QP ... -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From bugzilla-daemon at openib.org Tue Apr 4 15:04:22 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Tue, 4 Apr 2006 15:04:22 -0700 (PDT) Subject: [openib-general] [Bug 32] New: IBED RC2 fails to build on FC4 2.6.15 Message-ID: <20060404220422.7AA422283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=32 Summary: IBED RC2 fails to build on FC4 2.6.15 Product: OpenIB Version: 1.0rc2 Platform: X86-64 OS/Version: Other Status: NEW Severity: blocker Priority: P2 Component: iSER AssignedTo: bugzilla at openib.org ReportedBy: kball at pathscale.com IBED RC2 fails to build on FC4 with the 2.6.15 kernel. It fails in the kernel build (I will attach all log files shortly) with errors that look like /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/iser/iscsi_iser.c: In function 'iscsi_iser_conn_set_param': /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/iser/iscsi_iser.c:1681: error: 'ISCSI_PARAM_RDMAEXTENSIONS' undeclared (first use in this function) /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/iser/iscsi_iser.c:1681: error: (Each undeclared identifier is reported only once /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/iser/iscsi_iser.c:1681: error: for each function it appears in.) /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/iser/iscsi_iser.c: In function 'iscsi_iser_conn_get_param': /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/iser/iscsi_iser.c:1742: error: 'ISCSI_PARAM_RDMAEXTENSIONS' undeclared (first use in this function) /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/iser/iscsi_iser.c: At top level: /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/iser/iscsi_iser.c:1776: error: unknown field 'af' specified in initializer /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/iser/iscsi_iser.c:1776: warning: initialization makes pointer from integer without a cast /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/iser/iscsi_iser.c:1777: error: unknown field 'rdma' specified in initializer make[3]: *** [/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/iser/iscsi_iser.o] Error 1 make[2]: *** [/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/iser] Error 2 make[1]: *** [_module_/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband] Error 2 I could find no mention in the documentation of what kernels/distros are supported, in any of the documentation, though 2.6.15 and 2.6.16 are both mentioned in the build script. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mshefty at ichips.intel.com Tue Apr 4 14:54:56 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 04 Apr 2006 14:54:56 -0700 Subject: [openib-general] CM patch for 2.6.17 merge In-Reply-To: <20060404214729.GD26692@mellanox.co.il> References: <20060404163331.GU14808@mellanox.co.il> <4432A57D.2010604@ichips.intel.com> <20060404171122.GW14808@mellanox.co.il> <20060404214729.GD26692@mellanox.co.il> Message-ID: <4432EB30.1050502@ichips.intel.com> Michael S. Tsirkin wrote: > That's unfortunate, I think CMA needs this and it would be nice to develop CMA > out of kernel against in-kernel CM. Is that right, Sean? Does CMA require > private data matching to work? The RDMA CM requires these changes. Yes, it would have been nice to sync the kernel tree, but only so I wouldn't have to generate those patches again. :) > Still, we really want to prevent userspace from using CMA/SDP service IDs, so I > think we need at least this part. Sean, if you agree, please send Roland this > patch. The CMA/SDP service IDs are defined by the IB spec, so a userspace consumer shouldn't be able to use those ranges. Allowing this probably isn't a big issue until either are in the kernel. I.e. if the CMA isn't loaded, then it doesn't really matter if someone uses the CMA's service ID. And if the CMA is loaded, then this patch would have been applied locally. - Sean From rdreier at cisco.com Tue Apr 4 14:56:24 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Apr 2006 14:56:24 -0700 Subject: [openib-general] Re: [PATCH] repost: IPoIB queue size tune patch In-Reply-To: (Shirley Ma's message of "Tue, 4 Apr 2006 11:10:23 -0700") References: Message-ID: Shirley> IPoIB is not a real driver. Packets can be dropped from Shirley> dev xmit queue to QPs send queue directly. Why we use Shirley> tx_ring to induce overhead in send path? I don't follow how you can do this -- we need to keep track of skb so that when the send completes, we can release it. As Michael said we also need to keep track of DMA mapping so that we can unmap after the send is done. This implies some sort of data structure for each packet in the send queue, and allocating a circular buffer up front seems at least as good as keeping a linked list to me. - R. From bugzilla-daemon at openib.org Tue Apr 4 15:06:25 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Tue, 4 Apr 2006 15:06:25 -0700 (PDT) Subject: [openib-general] [Bug 32] IBED RC2 fails to build on FC4 2.6.15 Message-ID: <20060404220625.C17012283DC@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=32 ------- Additional Comments From kball at pathscale.com 2006-04-04 15:06 ------- Created an attachment (id=9) --> (http://openib.org/bugzilla/attachment.cgi?id=9&action=view) debug_info.tgz It looks like this tarball contains all useful logs, but if there is anything else that is desired to fix this let me know. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mst at mellanox.co.il Tue Apr 4 15:06:41 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Apr 2006 01:06:41 +0300 Subject: [openib-general] Re: [PATCH]: CMA deadlock In-Reply-To: References: <20060403082056.GZ14808@mellanox.co.il> Message-ID: <20060404220641.GA27248@mellanox.co.il> Quoting r. Sean Hefty : > Subject: [PATCH]: CMA deadlock > > Michael, > > Can you see if the following patch fixes your problem? Seems to work. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mshefty at ichips.intel.com Tue Apr 4 15:05:47 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 04 Apr 2006 15:05:47 -0700 Subject: [openib-general] Re: SPAM: [PATCH] [RFC] - dapl - dat_ep_free() can return without freeing the endpoint In-Reply-To: References: <1144177443.6427.34.camel@stevo-desktop> Message-ID: <4432EDBB.4080304@ichips.intel.com> James Lentini wrote: >> void dapli_destroy_conn(struct dapl_cm_id *conn) >> { >> int in_callback; >>+ struct rdma_cm_id *cm_id; >> >> dapl_dbg_log(DAPL_DBG_TYPE_CM, >> " destroy_conn: conn %p id %d\n", >> conn,conn->cm_id); >>- >> dapl_os_lock(&conn->lock); >> conn->destroy = 1; >> in_callback = conn->in_callback; >>- dapl_os_unlock(&conn->lock); >>- >>- if (!in_callback) { >>- if (conn->ep) >>- conn->ep->cm_handle = IB_INVALID_HANDLE; >>- if (conn->cm_id) { >>- if (conn->cm_id->qp) >>- rdma_destroy_qp(conn->cm_id); >>- rdma_destroy_id(conn->cm_id); >>+ do { >>+ if (in_callback) { >>+ dapl_os_unlock(&conn->lock); >>+ usleep(10); >>+ dapl_os_lock(&conn->lock); In general this doesn't work. The calling thread may be the callback thread, which would lead to deadlock. This is why we don't just call rdma_destroy_id() directly, and let it wait for the callback to complete. - Sean From jgunthorpe at obsidianresearch.com Tue Apr 4 15:06:13 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Tue, 4 Apr 2006 16:06:13 -0600 Subject: [openib-general] APIC error In-Reply-To: <20060404210443.GB26741@langley.llnl.gov> References: <20060404210443.GB26741@langley.llnl.gov> Message-ID: <20060404220613.GB29808@obsidianresearch.com> On Tue, Apr 04, 2006 at 02:04:43PM -0700, Makia Minich wrote: > Using earlier BIOS versions, I don't see the issue at all, but on the latest > BIOS versions I'm dead in the water. So, I was hoping that someone else has > seen this issue and perhaphs might know of a workaround. You might want to check that the BIOS hasn't changed the APIC configuration in some strange way between the two revs. The error code you are getting is documented as 'that a message received by this APIC was not accepted by this or any other APIC'. Which probably means the encoded destinations bits in the MSI address don't match what the APICs have been programmed for... I'd suggest dumping APIC registers ID (0x20), LOG_DEST (0xD0), and DEST_FORMAT (0xE0) for your two BIOS versions. [ie add some printks to the verify_local_APIC function]. If they are different then perhaps a patch the msi code could be devised. Also it is probably worth capturing the MSI address with lspci -v and comparing that between revs. Do any IO-APIC based interrupts work (cat /proc/interrupts)? The other thing you can try is to capture the IO APIC configuration debug messages (ie the stuff from io_apic.c:print_IO_APIC) and see if the debug output changes between bios revisions. That may give a clue what is going on. Good Luck, Jason From mst at mellanox.co.il Tue Apr 4 15:08:40 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Apr 2006 01:08:40 +0300 Subject: [openib-general] Re: [PATCH] ib_addr: local/loopback address handling In-Reply-To: <442AE3FF.6040703@ichips.intel.com> References: <20060326142350.GT1802@mellanox.co.il> <20060326144906.GU1802@mellanox.co.il> <44282512.1040302@ichips.intel.com> <20060327180443.GA26455@mellanox.co.il> <44282E82.7060301@ichips.intel.com> <20060329081115.GB25712@mellanox.co.il> <442ACAF0.8080508@ichips.intel.com> <20060329192619.GD17318@mellanox.co.il> <442AE3FF.6040703@ichips.intel.com> Message-ID: <20060404220840.GB27248@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH] ib_addr: local/loopback address handling > > Michael S. Tsirkin wrote: > >> if (ZERONET(src_ip)) { > >> src_in->sin_family = dst_in->sin_family; > >> src_in->sin_addr.s_addr = dst_ip; > >> ret = copy_addr(addr, dev, dev->dev_addr); > >> } else if (LOOPBACK(src_ip)) { > >> ret = rdma_translate_ip((struct sockaddr *)dst_in, addr); > >> if (!ret) > >> memcpy(addr->dst_dev_addr, dev->dev_addr, > >> MAX_ADDR_LEN); > >> } else { > >> ret = rdma_translate_ip((struct sockaddr *)src_in, addr); > >> if (!ret) > >> memcpy(addr->dst_dev_addr, dev->dev_addr, > >> MAX_ADDR_LEN); > >> } > > > >This will put the IP of an actual IB device in the SDP hello message, > >right? > >I don't think we should have 127.0.0.1 there ... > > After some testing, this appears to work the way I expected anyway. > > My test system has a local IP address of 192.168.0.102. > > I could connect from 127.0.0.1 to 127.0.0.1, 192.168.0.102 to 127.0.0.1, > and 127.0.0.1 to 192.168.0.102. I could not connect between 127.0.0.1 to > 192.168.0.101, which is my other test system. > > I believe that this does carry 127.0.0.1 in the connection messages, since > that's the address that we're bound to. Sean, this wasn't ever checked in, was it? If so, please do. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From swise at opengridcomputing.com Tue Apr 4 15:09:32 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 04 Apr 2006 17:09:32 -0500 Subject: [openib-general] Re: SPAM: [PATCH] [RFC] - dapl - dat_ep_free() can return without freeing the endpoint In-Reply-To: <4432EDBB.4080304@ichips.intel.com> References: <1144177443.6427.34.camel@stevo-desktop> <4432EDBB.4080304@ichips.intel.com> Message-ID: <1144188572.7326.36.camel@stevo-desktop> > In general this doesn't work. The calling thread may be the callback thread, > which would lead to deadlock. This is why we don't just call rdma_destroy_id() > directly, and let it wait for the callback to complete. Ya'll are right. Lemme chew on this some more. From sean.hefty at intel.com Tue Apr 4 15:09:53 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 4 Apr 2006 15:09:53 -0700 Subject: [openib-general] Re: [PATCH] ib_addr: local/loopback addresshandling In-Reply-To: <20060404220840.GB27248@mellanox.co.il> Message-ID: >Sean, this wasn't ever checked in, was it? If so, please do. These changes were committed. Let me know if you run into any other issues. - Sean From bugzilla-daemon at openib.org Tue Apr 4 15:22:30 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Tue, 4 Apr 2006 15:22:30 -0700 (PDT) Subject: [openib-general] [Bug 32] IBED RC2 fails to build on FC4 2.6.15 Message-ID: <20060404222230.EF6192283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=32 kball at pathscale.com changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|blocker |major ------- Additional Comments From kball at pathscale.com 2006-04-04 15:22 ------- I forgot to note, this was done with the simple 'build all' option in build.sh I have worked around this to some extent by not building some of the packages, and it appears I may be able to test some of what I want to test without this being fixed, so I am lowering the severity. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mshefty at ichips.intel.com Tue Apr 4 15:12:58 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 04 Apr 2006 15:12:58 -0700 Subject: [openib-general] Re: SPAM: [PATCH] [RFC] - dapl - dat_ep_free() can return without freeing the endpoint In-Reply-To: References: <1144177443.6427.34.camel@stevo-desktop> Message-ID: <4432EF6A.5030909@ichips.intel.com> James Lentini wrote: > /* ... here */ > > if (cm_id) { > if (cm_id->qp) > rdma_destroy_qp(cm_id); > rdma_destroy_id(cm_id); > } > dapl_os_free(conn, sizeof(*conn)); > } > > Destroying the cm_id while in a callback would be bad. rdma_destroy_id() will block if a callback is in progress. The issue is making sure that this routine is not called from the callback thread. There shouldn't be any issue calling rdma_destroy_qp() regardless of if we're in a callback or not though. So, the fix may be to always call rdma_destroy_qp() somewhere in this call path. - Sean From greg at kroah.com Tue Apr 4 15:12:13 2006 From: greg at kroah.com (Greg KH) Date: Tue, 4 Apr 2006 15:12:13 -0700 Subject: [openib-general] Re: [stable] [PATCH -stable] Move destructor from neigh->ops to neigh_param In-Reply-To: <20060403154741.GB14808@mellanox.co.il> References: <20060403154741.GB14808@mellanox.co.il> Message-ID: <20060404221213.GC6150@kroah.com> On Mon, Apr 03, 2006 at 06:47:41PM +0300, Michael S. Tsirkin wrote: > Hello, -stable team! > The following patch is a backport from 2.6.17-rc1: is solves an oops/crash > condition in ipoib that people are observing in 2.6.16/2.6.16.1. > > The patch exceeds the 100 line limit slightly, but only because it > removes a static function which now becomes unused. Let me know if this > is a problem. queued to -stable, thanks. greg k-h From swise at opengridcomputing.com Tue Apr 4 15:21:56 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 04 Apr 2006 17:21:56 -0500 Subject: [openib-general] Re: SPAM: [PATCH] [RFC] - dapl - dat_ep_free() can return without freeing the endpoint In-Reply-To: <4432EDBB.4080304@ichips.intel.com> References: <1144177443.6427.34.camel@stevo-desktop> <4432EDBB.4080304@ichips.intel.com> Message-ID: <1144189316.7326.43.camel@stevo-desktop> On Tue, 2006-04-04 at 15:05 -0700, Sean Hefty wrote: > James Lentini wrote: > >> void dapli_destroy_conn(struct dapl_cm_id *conn) > >> { > >> int in_callback; > >>+ struct rdma_cm_id *cm_id; > >> > >> dapl_dbg_log(DAPL_DBG_TYPE_CM, > >> " destroy_conn: conn %p id %d\n", > >> conn,conn->cm_id); > >>- > >> dapl_os_lock(&conn->lock); > >> conn->destroy = 1; > >> in_callback = conn->in_callback; > >>- dapl_os_unlock(&conn->lock); > >>- > >>- if (!in_callback) { > >>- if (conn->ep) > >>- conn->ep->cm_handle = IB_INVALID_HANDLE; > >>- if (conn->cm_id) { > >>- if (conn->cm_id->qp) > >>- rdma_destroy_qp(conn->cm_id); > >>- rdma_destroy_id(conn->cm_id); > >>+ do { > >>+ if (in_callback) { > >>+ dapl_os_unlock(&conn->lock); > >>+ usleep(10); > >>+ dapl_os_lock(&conn->lock); > > In general this doesn't work. The calling thread may be the callback thread, > which would lead to deadlock. This is why we don't just call rdma_destroy_id() > directly, and let it wait for the callback to complete. I'm not sure we can get a deadlock here. This is in user space. And the "callback thread" is a thread created by the dapl library to process ibv and cma events. It just posts events to a software EVD and maybe wakes up a consumer thread. Its not like the kernel-mode direct function calls. Looking through the code I don't see where dapli_destroy_conn() can be called by the function that sets in_callback=1. From swise at opengridcomputing.com Tue Apr 4 15:23:16 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 04 Apr 2006 17:23:16 -0500 Subject: [openib-general] Re: SPAM: [PATCH] [RFC] - dapl - dat_ep_free() can return without freeing the endpoint In-Reply-To: <4432EF6A.5030909@ichips.intel.com> References: <1144177443.6427.34.camel@stevo-desktop> <4432EF6A.5030909@ichips.intel.com> Message-ID: <1144189396.7326.46.camel@stevo-desktop> On Tue, 2006-04-04 at 15:12 -0700, Sean Hefty wrote: > James Lentini wrote: > > /* ... here */ > > > > if (cm_id) { > > if (cm_id->qp) > > rdma_destroy_qp(cm_id); > > rdma_destroy_id(cm_id); > > } > > dapl_os_free(conn, sizeof(*conn)); > > } > > > > Destroying the cm_id while in a callback would be bad. > > rdma_destroy_id() will block if a callback is in progress. The issue is making > sure that this routine is not called from the callback thread. > This is all in user mode. Does this issue still exist? > There shouldn't be any issue calling rdma_destroy_qp() regardless of if we're in > a callback or not though. So, the fix may be to always call rdma_destroy_qp() > somewhere in this call path. > > - Sean From swise at opengridcomputing.com Tue Apr 4 15:30:19 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 04 Apr 2006 17:30:19 -0500 Subject: [openib-general] Re: [PATCH] [RFC] - dapl - dat_ep_free() can return without freeing the endpoint In-Reply-To: References: <1144177443.6427.34.camel@stevo-desktop> Message-ID: <1144189819.7326.53.camel@stevo-desktop> > resources (PZ, etc.). We must have been getting lucky on IB. > > I'm worried that a callback could occur ... > > void dapli_destroy_conn(struct dapl_cm_id *conn) > { > int in_callback; > struct rdma_cm_id *cm_id; > > dapl_dbg_log(DAPL_DBG_TYPE_CM, > " destroy_conn: conn %p id %d\n", > conn,conn->cm_id); > > dapl_os_lock(&conn->lock); > conn->destroy = 1; > in_callback = conn->in_callback; > do { > if (in_callback) { > dapl_os_unlock(&conn->lock); > usleep(10); > dapl_os_lock(&conn->lock); > } > in_callback = conn->in_callback; > } while (in_callback); > > if (conn->ep) > conn->ep->cm_handle = IB_INVALID_HANDLE; > cm_id = conn->cm_id; > conn->cm_id = NULL; > dapl_os_unlock(&conn->lock); > > /* ... here */ > > if (cm_id) { > if (cm_id->qp) > rdma_destroy_qp(cm_id); > rdma_destroy_id(cm_id); > } > dapl_os_free(conn, sizeof(*conn)); > } > > Destroying the cm_id while in a callback would be bad. > > > static struct dapl_cm_id * dapli_req_recv(struct dapl_cm_id *conn, > > @@ -243,11 +249,9 @@ > > return new_conn; > > } > > > > -static int dapli_cm_active_cb(struct dapl_cm_id *conn, > > +static void dapli_cm_active_cb(struct dapl_cm_id *conn, > > struct rdma_cm_event *event) > > { > > - int destroy; > > - > > dapl_dbg_log(DAPL_DBG_TYPE_CM, > > " active_cb: conn %p id %d event %d\n", > > conn, conn->cm_id, event->event ); > > @@ -255,7 +259,7 @@ > > dapl_os_lock(&conn->lock); > > if (conn->destroy) { > > dapl_os_unlock(&conn->lock); > > - return 0; > > + return; > > } > > conn->in_callback = 1; > > dapl_os_unlock(&conn->lock); > > @@ -303,16 +307,14 @@ > > } > > > > dapl_os_lock(&conn->lock); > > - destroy = conn->destroy; > > - conn->in_callback = conn->destroy; > > + conn->in_callback = 0; > > dapl_os_unlock(&conn->lock); > > - return(destroy); > > + return; > > } > > > > -static int dapli_cm_passive_cb(struct dapl_cm_id *conn, > > +static void dapli_cm_passive_cb(struct dapl_cm_id *conn, > > struct rdma_cm_event *event) > > { > > - int destroy; > > struct dapl_cm_id *new_conn; > > > > dapl_dbg_log(DAPL_DBG_TYPE_CM, > > @@ -322,7 +324,7 @@ > > dapl_os_lock(&conn->lock); > > if (conn->destroy) { > > dapl_os_unlock(&conn->lock); > > - return 0; > > + return; > > } > > conn->in_callback = 1; > > dapl_os_unlock(&conn->lock); > > @@ -377,10 +379,9 @@ > > } > > > > dapl_os_lock(&conn->lock); > > - destroy = conn->destroy; > > - conn->in_callback = conn->destroy; > > + conn->in_callback = 0; > > dapl_os_unlock(&conn->lock); > > - return(destroy); > > + return; > > } > > > > > > @@ -1021,7 +1022,6 @@ > > /* process one CM event, fairness */ > > if(!ret) { > > struct dapl_cm_id *conn; > > - int ret; > > > > /* set proper conn from cm_id context*/ > > if (event->event == RDMA_CM_EVENT_CONNECT_REQUEST) > > @@ -1059,24 +1059,9 @@ > > case RDMA_CM_EVENT_DISCONNECTED: > > /* passive or active */ > > if (conn->sp) > > - ret = dapli_cm_passive_cb(conn,event); > > + dapli_cm_passive_cb(conn,event); > > else > > - ret = dapli_cm_active_cb(conn,event); > > - > > - /* destroy both qp and cm_id */ > > - if (ret) { > > - dapl_dbg_log(DAPL_DBG_TYPE_CM, > > - " cma_cb: DESTROY conn %p" > > - " cm_id %p qp %p\n", > > - conn, conn->cm_id, > > - conn->cm_id->qp); > > - > > - if (conn->cm_id->qp) > > - rdma_destroy_qp(conn->cm_id); > > - > > - rdma_destroy_id(conn->cm_id); > > - dapl_os_free(conn, sizeof(*conn)); > > - } > > + dapli_cm_active_cb(conn,event); > > If this code is removed, we'll need to update functions that set > conn->destroy to 1 to destroy the cm_id. > I thought only dapli_destroy_conn() sets this, but now I see that dapli_thread() can also set this so I'll go check this out. > What happens if a consumer attempts to free the EP from a callback? There are no direct consumer callbacks in usermode are there? consumers call dat_evd_wait() or whatever and get scheduled. Not like kernel mode... Or am I confused? > With this change (or any one that blocked a callback thread from > attempting to free the EP), I believe we would deadlock. > > Is it possible to destroy CM IDs from within a callback? I'll go verify this again, but I don't think the callback dapl thread does this. Stevo. From rdreier at cisco.com Tue Apr 4 15:54:02 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Apr 2006 15:54:02 -0700 Subject: [openib-general] [PATCH] IB/ipath: Make more names static In-Reply-To: <1144091585.23196.0.camel@hematite.internal.keyresearch.com> (Robert Walsh's message of "Mon, 03 Apr 2006 12:13:05 -0700") References: <1144091585.23196.0.camel@hematite.internal.keyresearch.com> Message-ID: And here are a couple of minor fixes to make the ipath driver build out of svn... Signed-off-by: Roland Dreier --- infiniband/hw/ipath/Makefile (revision 6230) +++ infiniband/hw/ipath/Makefile (working copy) @@ -1,5 +1,6 @@ EXTRA_CFLAGS += -DIPATH_IDSTR='"PathScale kernel.org driver"' \ - -DIPATH_KERN_TYPE=0 + -DIPATH_KERN_TYPE=0 \ + -Idrivers/infiniband/include obj-$(CONFIG_IPATH_CORE) += ipath_core.o obj-$(CONFIG_INFINIBAND_IPATH) += ib_ipath.o --- infiniband/hw/ipath/ipath_verbs.c (revision 6230) +++ infiniband/hw/ipath/ipath_verbs.c (working copy) @@ -1002,7 +1002,7 @@ static void *ipath_register_ib_device(in (1ull << IB_USER_VERBS_CMD_QUERY_SRQ) | (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) | (1ull << IB_USER_VERBS_CMD_POST_SRQ_RECV); - dev->node_type = IB_NODE_CA; + dev->node_type = RDMA_NODE_IB_CA; dev->phys_port_cnt = 1; dev->dma_device = ipath_layer_get_device(dd); dev->class_dev.dev = dev->dma_device; From rdreier at cisco.com Tue Apr 4 15:59:07 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Apr 2006 15:59:07 -0700 Subject: [openib-general] [PATCH] static rate encoding changes In-Reply-To: <20060324081644.GC31619@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 24 Mar 2006 10:16:44 +0200") References: <20060316165855.GA32324@mellanox.co.il> <1142529460.25297.117.camel@camp4.serpentine.com> <200603191759.36800.jackm@mellanox.co.il> <20060321154959.GC1802@mellanox.co.il> <20060324081644.GC31619@mellanox.co.il> Message-ID: Here's the static rate patch I have right now. Does anyone see issues with this? I think it can be justified for 2.6.17, since it fixes static rate handling for 4X DDR. Jack, I've reworked the mthca part quite significantly according to my particular taste, so please let me know if you think I've broken something... Thanks, Roland Push translation of static rate to HCA format into low-level drivers, where it belongs. For static rate encoding, use encoding of rate field from IB standard PathRecord, with addition of value 0, for backwards compatibility with current usage. The changes are: - Add enum ib_rate to midlayer includes. - Get rid of static rate translation in IPoIB; just use static rate directly from Path and MulticastGroup records. - Update mthca driver to translate absolute static rate into the format used by hardware. This also fixes mthca's static rate handling for HCAs that are capable of 4X DDR. diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index cae0845..b78e7dc 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -45,6 +45,40 @@ #include #include +int ib_rate_to_mult(enum ib_rate rate) +{ + switch (rate) { + case IB_RATE_2_5_GBPS: return 1; + case IB_RATE_5_GBPS: return 2; + case IB_RATE_10_GBPS: return 4; + case IB_RATE_20_GBPS: return 8; + case IB_RATE_30_GBPS: return 12; + case IB_RATE_40_GBPS: return 16; + case IB_RATE_60_GBPS: return 24; + case IB_RATE_80_GBPS: return 32; + case IB_RATE_120_GBPS: return 48; + default: return -1; + } +} +EXPORT_SYMBOL(ib_rate_to_mult); + +enum ib_rate mult_to_ib_rate(int mult) +{ + switch (mult) { + case 1: return IB_RATE_2_5_GBPS; + case 2: return IB_RATE_5_GBPS; + case 4: return IB_RATE_10_GBPS; + case 8: return IB_RATE_20_GBPS; + case 12: return IB_RATE_30_GBPS; + case 16: return IB_RATE_40_GBPS; + case 24: return IB_RATE_60_GBPS; + case 32: return IB_RATE_80_GBPS; + case 48: return IB_RATE_120_GBPS; + default: return IB_RATE_PORT_CURRENT; + } +} +EXPORT_SYMBOL(mult_to_ib_rate); + /* Protection domains */ struct ib_pd *ib_alloc_pd(struct ib_device *device) diff --git a/drivers/infiniband/hw/mthca/mthca_av.c b/drivers/infiniband/hw/mthca/mthca_av.c index bc5bdcb..87e7c63 100644 --- a/drivers/infiniband/hw/mthca/mthca_av.c +++ b/drivers/infiniband/hw/mthca/mthca_av.c @@ -42,6 +42,20 @@ #include "mthca_dev.h" +enum { + MTHCA_RATE_TAVOR_FULL = 0, + MTHCA_RATE_TAVOR_1X = 1, + MTHCA_RATE_TAVOR_4X = 2, + MTHCA_RATE_TAVOR_1X_DDR = 3 +}; + +enum { + MTHCA_RATE_MEMFREE_FULL = 0, + MTHCA_RATE_MEMFREE_QUARTER = 1, + MTHCA_RATE_MEMFREE_EIGHTH = 2, + MTHCA_RATE_MEMFREE_HALF = 3 +}; + struct mthca_av { __be32 port_pd; u8 reserved1; @@ -55,6 +69,86 @@ struct mthca_av { __be32 dgid[4]; }; +static enum ib_rate memfree_rate_to_ib(u8 mthca_rate, u8 port_rate) +{ + switch (mthca_rate) { + case MTHCA_RATE_MEMFREE_EIGHTH: return port_rate / 8; + case MTHCA_RATE_MEMFREE_QUARTER: return port_rate / 4; + case MTHCA_RATE_MEMFREE_HALF: return port_rate / 2; + case MTHCA_RATE_MEMFREE_FULL: return port_rate; + default: return port_rate; + } +} + +static enum ib_rate tavor_rate_to_ib(u8 mthca_rate, u8 port_rate) +{ + switch (mthca_rate) { + case MTHCA_RATE_TAVOR_1X: return IB_RATE_2_5_GBPS; + case MTHCA_RATE_TAVOR_1X_DDR: return IB_RATE_5_GBPS; + case MTHCA_RATE_TAVOR_4X: return IB_RATE_10_GBPS; + default: return port_rate; + } +} + +enum ib_rate mthca_rate_to_ib(struct mthca_dev *dev, u8 mthca_rate, u8 port) +{ + if (mthca_is_memfree(dev)) { + /* Handle old Arbel FW */ + if (dev->limits.stat_rate_support == 0x3 && mthca_rate) + return IB_RATE_2_5_GBPS; + + return memfree_rate_to_ib(mthca_rate, dev->rate[port - 1]); + } else + return tavor_rate_to_ib(mthca_rate, dev->rate[port - 1]); +} + +static u8 ib_rate_to_memfree(u8 req_rate, u8 cur_rate) +{ + if (cur_rate <= req_rate) + return 0; + + /* + * Inter-packet delay (IPD) to get from rate X down to a rate + * no more than Y is (X - 1) / Y. + */ + switch ((cur_rate - 1) / req_rate) { + case 0: return MTHCA_RATE_MEMFREE_FULL; + case 1: return MTHCA_RATE_MEMFREE_HALF; + case 2: /* fall through */ + case 3: return MTHCA_RATE_MEMFREE_QUARTER; + default: return MTHCA_RATE_MEMFREE_EIGHTH; + } +} + +static u8 ib_rate_to_tavor(u8 static_rate) +{ + switch (static_rate) { + case IB_RATE_2_5_GBPS: return MTHCA_RATE_TAVOR_1X; + case IB_RATE_5_GBPS: return MTHCA_RATE_TAVOR_1X_DDR; + case IB_RATE_10_GBPS: return MTHCA_RATE_TAVOR_4X; + default: return MTHCA_RATE_TAVOR_FULL; + } +} + +u8 mthca_get_rate(struct mthca_dev *dev, int static_rate, u8 port) +{ + u8 rate; + + if (!static_rate || ib_rate_to_mult(static_rate) >= dev->rate[port - 1]) + return 0; + + if (mthca_is_memfree(dev)) + rate = ib_rate_to_memfree(ib_rate_to_mult(static_rate), + dev->rate[port - 1]); + else + rate = ib_rate_to_tavor(static_rate); + + if (!(dev->limits.stat_rate_support & (1 << rate))) + rate = 1; + + return rate; +} + int mthca_create_ah(struct mthca_dev *dev, struct mthca_pd *pd, struct ib_ah_attr *ah_attr, @@ -107,7 +201,7 @@ on_hca_fail: av->g_slid = ah_attr->src_path_bits; av->dlid = cpu_to_be16(ah_attr->dlid); av->msg_sr = (3 << 4) | /* 2K message */ - ah_attr->static_rate; + mthca_get_rate(dev, ah_attr->static_rate, ah_attr->port_num); av->sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28); if (ah_attr->ah_flags & IB_AH_GRH) { av->g_slid |= 0x80; diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c index 343eca5..1985b5d 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.c +++ b/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -965,6 +965,7 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev u32 *outbox; u8 field; u16 size; + u16 stat_rate; int err; #define QUERY_DEV_LIM_OUT_SIZE 0x100 @@ -995,6 +996,7 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev #define QUERY_DEV_LIM_MTU_WIDTH_OFFSET 0x36 #define QUERY_DEV_LIM_VL_PORT_OFFSET 0x37 #define QUERY_DEV_LIM_MAX_GID_OFFSET 0x3b +#define QUERY_DEV_LIM_RATE_SUPPORT_OFFSET 0x3c #define QUERY_DEV_LIM_MAX_PKEY_OFFSET 0x3f #define QUERY_DEV_LIM_FLAGS_OFFSET 0x44 #define QUERY_DEV_LIM_RSVD_UAR_OFFSET 0x48 @@ -1086,6 +1088,8 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev dev_lim->num_ports = field & 0xf; MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_GID_OFFSET); dev_lim->max_gids = 1 << (field & 0xf); + MTHCA_GET(stat_rate, outbox, QUERY_DEV_LIM_RATE_SUPPORT_OFFSET); + dev_lim->stat_rate_support = stat_rate; MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PKEY_OFFSET); dev_lim->max_pkeys = 1 << (field & 0xf); MTHCA_GET(dev_lim->flags, outbox, QUERY_DEV_LIM_FLAGS_OFFSET); diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.h b/drivers/infiniband/hw/mthca/mthca_cmd.h index e4ec35c..2f976f2 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.h +++ b/drivers/infiniband/hw/mthca/mthca_cmd.h @@ -146,6 +146,7 @@ struct mthca_dev_lim { int max_vl; int num_ports; int max_gids; + u16 stat_rate_support; int max_pkeys; u32 flags; int reserved_uars; diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h index ad52edb..d3df395 100644 --- a/drivers/infiniband/hw/mthca/mthca_dev.h +++ b/drivers/infiniband/hw/mthca/mthca_dev.h @@ -172,6 +172,7 @@ struct mthca_limits { int reserved_pds; u32 page_size_cap; u32 flags; + u16 stat_rate_support; u8 port_width_cap; }; @@ -353,6 +354,7 @@ struct mthca_dev { struct ib_mad_agent *send_agent[MTHCA_MAX_PORTS][2]; struct ib_ah *sm_ah[MTHCA_MAX_PORTS]; spinlock_t sm_lock; + u8 rate[MTHCA_MAX_PORTS]; }; #define mthca_dbg(mdev, format, arg...) \ @@ -542,6 +544,8 @@ int mthca_read_ah(struct mthca_dev *dev, struct ib_ud_header *header); int mthca_ah_query(struct ib_ah *ibah, struct ib_ah_attr *attr); int mthca_ah_grh_present(struct mthca_ah *ah); +u8 mthca_get_rate(struct mthca_dev *dev, int static_rate, u8 port); +enum ib_rate mthca_rate_to_ib(struct mthca_dev *dev, u8 mthca_rate, u8 port); int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); diff --git a/drivers/infiniband/hw/mthca/mthca_mad.c b/drivers/infiniband/hw/mthca/mthca_mad.c index dfb482e..f235c7e 100644 --- a/drivers/infiniband/hw/mthca/mthca_mad.c +++ b/drivers/infiniband/hw/mthca/mthca_mad.c @@ -49,6 +49,30 @@ enum { MTHCA_VENDOR_CLASS2 = 0xa }; +int mthca_update_rate(struct mthca_dev *dev, u8 port_num) +{ + struct ib_port_attr *tprops = NULL; + int ret; + + tprops = kmalloc(sizeof *tprops, GFP_KERNEL); + if (!tprops) + return -ENOMEM; + + ret = ib_query_port(&dev->ib_dev, port_num, tprops); + if (ret) { + printk(KERN_WARNING "ib_query_port failed (%d) for %s port %d\n", + ret, dev->ib_dev.name, port_num); + goto out; + } + + dev->rate[port_num - 1] = tprops->active_speed * + ib_width_enum_to_int(tprops->active_width); + +out: + kfree(tprops); + return ret; +} + static void update_sm_ah(struct mthca_dev *dev, u8 port_num, u16 lid, u8 sl) { @@ -90,6 +114,7 @@ static void smp_snoop(struct ib_device * mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && mad->mad_hdr.method == IB_MGMT_METHOD_SET) { if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) { + mthca_update_rate(to_mdev(ibdev), port_num); update_sm_ah(to_mdev(ibdev), port_num, be16_to_cpup((__be16 *) (mad->data + 58)), (*(u8 *) (mad->data + 76)) & 0xf); @@ -246,6 +271,7 @@ int mthca_create_agents(struct mthca_dev { struct ib_mad_agent *agent; int p, q; + int ret; spin_lock_init(&dev->sm_lock); @@ -255,11 +281,23 @@ int mthca_create_agents(struct mthca_dev q ? IB_QPT_GSI : IB_QPT_SMI, NULL, 0, send_handler, NULL, NULL); - if (IS_ERR(agent)) + if (IS_ERR(agent)) { + ret = PTR_ERR(agent); goto err; + } dev->send_agent[p][q] = agent; } + + for (p = 1; p <= dev->limits.num_ports; ++p) { + ret = mthca_update_rate(dev, p); + if (ret) { + mthca_err(dev, "Failed to obtain port %d rate." + " aborting.\n", p); + goto err; + } + } + return 0; err: @@ -268,7 +306,7 @@ err: if (dev->send_agent[p][q]) ib_unregister_mad_agent(dev->send_agent[p][q]); - return PTR_ERR(agent); + return ret; } void __devexit mthca_free_agents(struct mthca_dev *dev) diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 266f347..279de23 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -191,6 +191,18 @@ static int __devinit mthca_dev_lim(struc mdev->limits.port_width_cap = dev_lim->max_port_width; mdev->limits.page_size_cap = ~(u32) (dev_lim->min_page_sz - 1); mdev->limits.flags = dev_lim->flags; + /* + * For old FW that doesn't return static rate support, use a + * value of 0x3 (only static rate values of 0 or 1 are handled), + * except on Sinai, where even old FW can handle static rate + * values of 2 and 3. + */ + if (dev_lim->stat_rate_support) + mdev->limits.stat_rate_support = dev_lim->stat_rate_support; + else if (mdev->mthca_flags & MTHCA_FLAG_SINAI_OPT) + mdev->limits.stat_rate_support = 0xf; + else + mdev->limits.stat_rate_support = 0x3; /* IB_DEVICE_RESIZE_MAX_WR not supported by driver. May be doable since hardware supports it for SRQ. diff --git a/drivers/infiniband/hw/mthca/mthca_provider.h b/drivers/infiniband/hw/mthca/mthca_provider.h index 2e7f521..6676a78 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.h +++ b/drivers/infiniband/hw/mthca/mthca_provider.h @@ -257,6 +257,8 @@ struct mthca_qp { atomic_t refcount; u32 qpn; int is_direct; + u8 port; /* for SQP and memfree use only */ + u8 alt_port; /* for memfree use only */ u8 transport; u8 state; u8 atomic_rd_en; @@ -278,7 +280,6 @@ struct mthca_qp { struct mthca_sqp { struct mthca_qp qp; - int port; int pkey_index; u32 qkey; u32 send_psn; diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index 057c8e6..f37b0e3 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -248,6 +248,9 @@ void mthca_qp_event(struct mthca_dev *de return; } + if (event_type == IB_EVENT_PATH_MIG) + qp->port = qp->alt_port; + event.device = &dev->ib_dev; event.event = event_type; event.element.qp = &qp->ibqp; @@ -392,10 +395,16 @@ static void to_ib_ah_attr(struct mthca_d { memset(ib_ah_attr, 0, sizeof *path); ib_ah_attr->port_num = (be32_to_cpu(path->port_pkey) >> 24) & 0x3; + + if (ib_ah_attr->port_num == 0 || ib_ah_attr->port_num > dev->limits.num_ports) + return; + ib_ah_attr->dlid = be16_to_cpu(path->rlid); ib_ah_attr->sl = be32_to_cpu(path->sl_tclass_flowlabel) >> 28; ib_ah_attr->src_path_bits = path->g_mylmc & 0x7f; - ib_ah_attr->static_rate = path->static_rate & 0x7; + ib_ah_attr->static_rate = mthca_rate_to_ib(dev, + path->static_rate & 0x7, + ib_ah_attr->port_num); ib_ah_attr->ah_flags = (path->g_mylmc & (1 << 7)) ? IB_AH_GRH : 0; if (ib_ah_attr->ah_flags) { ib_ah_attr->grh.sgid_index = path->mgid_index & (dev->limits.gid_table_len - 1); @@ -455,8 +464,10 @@ int mthca_query_qp(struct ib_qp *ibqp, s qp_attr->cap.max_recv_sge = qp->rq.max_gs; qp_attr->cap.max_inline_data = qp->max_inline_data; - to_ib_ah_attr(dev, &qp_attr->ah_attr, &context->pri_path); - to_ib_ah_attr(dev, &qp_attr->alt_ah_attr, &context->alt_path); + if (qp->transport == RC || qp->transport == UC) { + to_ib_ah_attr(dev, &qp_attr->ah_attr, &context->pri_path); + to_ib_ah_attr(dev, &qp_attr->alt_ah_attr, &context->alt_path); + } qp_attr->pkey_index = be32_to_cpu(context->pri_path.port_pkey) & 0x7f; qp_attr->alt_pkey_index = be32_to_cpu(context->alt_path.port_pkey) & 0x7f; @@ -484,11 +495,11 @@ out: } static int mthca_path_set(struct mthca_dev *dev, struct ib_ah_attr *ah, - struct mthca_qp_path *path) + struct mthca_qp_path *path, u8 port) { path->g_mylmc = ah->src_path_bits & 0x7f; path->rlid = cpu_to_be16(ah->dlid); - path->static_rate = !!ah->static_rate; + path->static_rate = mthca_get_rate(dev, ah->static_rate, port); if (ah->ah_flags & IB_AH_GRH) { if (ah->grh.sgid_index >= dev->limits.gid_table_len) { @@ -634,7 +645,7 @@ int mthca_modify_qp(struct ib_qp *ibqp, if (qp->transport == MLX) qp_context->pri_path.port_pkey |= - cpu_to_be32(to_msqp(qp)->port << 24); + cpu_to_be32(qp->port << 24); else { if (attr_mask & IB_QP_PORT) { qp_context->pri_path.port_pkey |= @@ -657,7 +668,8 @@ int mthca_modify_qp(struct ib_qp *ibqp, } if (attr_mask & IB_QP_AV) { - if (mthca_path_set(dev, &attr->ah_attr, &qp_context->pri_path)) + if (mthca_path_set(dev, &attr->ah_attr, &qp_context->pri_path, + attr_mask & IB_QP_PORT ? attr->port_num : qp->port)) return -EINVAL; qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH); @@ -681,7 +693,8 @@ int mthca_modify_qp(struct ib_qp *ibqp, return -EINVAL; } - if (mthca_path_set(dev, &attr->alt_ah_attr, &qp_context->alt_path)) + if (mthca_path_set(dev, &attr->alt_ah_attr, &qp_context->alt_path, + attr->alt_ah_attr.port_num)) return -EINVAL; qp_context->alt_path.port_pkey |= cpu_to_be32(attr->alt_pkey_index | @@ -791,6 +804,10 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp->atomic_rd_en = attr->qp_access_flags; if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) qp->resp_depth = attr->max_dest_rd_atomic; + if (attr_mask & IB_QP_PORT) + qp->port = attr->port_num; + if (attr_mask & IB_QP_ALT_PATH) + qp->alt_port = attr->alt_port_num; if (is_sqp(dev, qp)) store_attrs(to_msqp(qp), attr, attr_mask); @@ -802,13 +819,13 @@ int mthca_modify_qp(struct ib_qp *ibqp, if (is_qp0(dev, qp)) { if (cur_state != IB_QPS_RTR && new_state == IB_QPS_RTR) - init_port(dev, to_msqp(qp)->port); + init_port(dev, qp->port); if (cur_state != IB_QPS_RESET && cur_state != IB_QPS_ERR && (new_state == IB_QPS_RESET || new_state == IB_QPS_ERR)) - mthca_CLOSE_IB(dev, to_msqp(qp)->port, &status); + mthca_CLOSE_IB(dev, qp->port, &status); } /* @@ -1212,6 +1229,9 @@ int mthca_alloc_qp(struct mthca_dev *dev if (qp->qpn == -1) return -ENOMEM; + /* initialize port to zero for error-catching. */ + qp->port = 0; + err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, send_policy, qp); if (err) { @@ -1261,7 +1281,7 @@ int mthca_alloc_sqp(struct mthca_dev *de if (err) goto err_out; - sqp->port = port; + sqp->qp.port = port; sqp->qp.qpn = mqpn; sqp->qp.transport = MLX; @@ -1404,10 +1424,10 @@ static int build_mlx_header(struct mthca sqp->ud_header.lrh.source_lid = IB_LID_PERMISSIVE; sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED); if (!sqp->qp.ibqp.qp_num) - ib_get_cached_pkey(&dev->ib_dev, sqp->port, + ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port, sqp->pkey_index, &pkey); else - ib_get_cached_pkey(&dev->ib_dev, sqp->port, + ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port, wr->wr.ud.pkey_index, &pkey); sqp->ud_header.bth.pkey = cpu_to_be16(pkey); sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_fs.c b/drivers/infiniband/ulp/ipoib/ipoib_fs.c index 685258e..5dde380 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_fs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_fs.c @@ -213,7 +213,7 @@ static int ipoib_path_seq_show(struct se gid_buf, path.pathrec.dlid ? "yes" : "no"); if (path.pathrec.dlid) { - rate = ib_sa_rate_enum_to_int(path.pathrec.rate) * 25; + rate = ib_rate_to_mult(path.pathrec.rate) * 25; seq_printf(file, " DLID: 0x%04x\n" diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 9b0bd7c..f13b9bb 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -373,16 +373,9 @@ static void path_rec_completion(int stat struct ib_ah_attr av = { .dlid = be16_to_cpu(pathrec->dlid), .sl = pathrec->sl, - .port_num = priv->port + .port_num = priv->port, + .static_rate = pathrec->rate }; - int path_rate = ib_sa_rate_enum_to_int(pathrec->rate); - - if (path_rate > 0 && priv->local_rate > path_rate) - av.static_rate = (priv->local_rate - 1) / path_rate; - - ipoib_dbg(priv, "static_rate %d for local port %dX, path %dX\n", - av.static_rate, priv->local_rate, - ib_sa_rate_enum_to_int(pathrec->rate)); ah = ipoib_create_ah(dev, priv->pd, &av); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 93c462e..a6a9d09 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -251,6 +251,7 @@ static int ipoib_mcast_join_finish(struc .port_num = priv->port, .sl = mcast->mcmember.sl, .ah_flags = IB_AH_GRH, + .static_rate = mcast->mcmember.rate, .grh = { .flow_label = be32_to_cpu(mcast->mcmember.flow_label), .hop_limit = mcast->mcmember.hop_limit, @@ -258,17 +259,8 @@ static int ipoib_mcast_join_finish(struc .traffic_class = mcast->mcmember.traffic_class } }; - int path_rate = ib_sa_rate_enum_to_int(mcast->mcmember.rate); - av.grh.dgid = mcast->mcmember.mgid; - if (path_rate > 0 && priv->local_rate > path_rate) - av.static_rate = (priv->local_rate - 1) / path_rate; - - ipoib_dbg_mcast(priv, "static_rate %d for local port %dX, mcmember %dX\n", - av.static_rate, priv->local_rate, - ib_sa_rate_enum_to_int(mcast->mcmember.rate)); - ah = ipoib_create_ah(dev, priv->pd, &av); if (!ah) { ipoib_warn(priv, "ib_address_create failed\n"); diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h index f404fe2..ad63c21 100644 --- a/include/rdma/ib_sa.h +++ b/include/rdma/ib_sa.h @@ -91,34 +91,6 @@ enum ib_sa_selector { IB_SA_BEST = 3 }; -enum ib_sa_rate { - IB_SA_RATE_2_5_GBPS = 2, - IB_SA_RATE_5_GBPS = 5, - IB_SA_RATE_10_GBPS = 3, - IB_SA_RATE_20_GBPS = 6, - IB_SA_RATE_30_GBPS = 4, - IB_SA_RATE_40_GBPS = 7, - IB_SA_RATE_60_GBPS = 8, - IB_SA_RATE_80_GBPS = 9, - IB_SA_RATE_120_GBPS = 10 -}; - -static inline int ib_sa_rate_enum_to_int(enum ib_sa_rate rate) -{ - switch (rate) { - case IB_SA_RATE_2_5_GBPS: return 1; - case IB_SA_RATE_5_GBPS: return 2; - case IB_SA_RATE_10_GBPS: return 4; - case IB_SA_RATE_20_GBPS: return 8; - case IB_SA_RATE_30_GBPS: return 12; - case IB_SA_RATE_40_GBPS: return 16; - case IB_SA_RATE_60_GBPS: return 24; - case IB_SA_RATE_80_GBPS: return 32; - case IB_SA_RATE_120_GBPS: return 48; - default: return -1; - } -} - /* * Structures for SA records are named "struct ib_sa_xxx_rec." No * attempt is made to pack structures to match the physical layout of diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index c1ad627..6bbf1b3 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -314,6 +314,34 @@ enum ib_ah_flags { IB_AH_GRH = 1 }; +enum ib_rate { + IB_RATE_PORT_CURRENT = 0, + IB_RATE_2_5_GBPS = 2, + IB_RATE_5_GBPS = 5, + IB_RATE_10_GBPS = 3, + IB_RATE_20_GBPS = 6, + IB_RATE_30_GBPS = 4, + IB_RATE_40_GBPS = 7, + IB_RATE_60_GBPS = 8, + IB_RATE_80_GBPS = 9, + IB_RATE_120_GBPS = 10 +}; + +/** + * ib_rate_to_mult - Convert the IB rate enum to a multiple of the + * base rate of 2.5 Gbit/sec. For example, IB_RATE_5_GBPS will be + * converted to 2, since 5 Gbit/sec is 2 * 2.5 Gbit/sec. + * @rate: rate to convert. + */ +int ib_rate_to_mult(enum ib_rate rate) __attribute_const__; + +/** + * mult_to_ib_rate - Convert a multiple of 2.5 Gbit/sec to an IB rate + * enum. + * @mult: multiple to convert. + */ +enum ib_rate mult_to_ib_rate(int mult) __attribute_const__; + struct ib_ah_attr { struct ib_global_route grh; u16 dlid; From rjwalsh at pathscale.com Tue Apr 4 16:16:58 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Tue, 04 Apr 2006 16:16:58 -0700 Subject: [openib-general] [PATCH] IB/ipath: Make more names static In-Reply-To: References: <1144091585.23196.0.camel@hematite.internal.keyresearch.com> Message-ID: <1144192619.12594.27.camel@hematite.internal.keyresearch.com> > - dev->node_type = IB_NODE_CA; > + dev->node_type = RDMA_NODE_IB_CA; Which of these #defines is valid in 2.6.17? Just curious. What we have in svn matches what you put in your git tree, you see... Perhaps an #ifdef until 2.6.17 goes through? Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 483 bytes Desc: This is a digitally signed message part URL: From rdreier at cisco.com Tue Apr 4 16:19:15 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Apr 2006 16:19:15 -0700 Subject: [openib-general] [PATCH] IB/ipath: Make more names static In-Reply-To: <1144192619.12594.27.camel@hematite.internal.keyresearch.com> (Robert Walsh's message of "Tue, 04 Apr 2006 16:16:58 -0700") References: <1144091585.23196.0.camel@hematite.internal.keyresearch.com> <1144192619.12594.27.camel@hematite.internal.keyresearch.com> Message-ID: Robert> Which of these #defines is valid in 2.6.17? Just curious. Robert> What we have in svn matches what you put in your git tree, Robert> you see... Perhaps an #ifdef until 2.6.17 goes through? 2.6.17 still has IB_NODE_CA. The svn tree has some iWARP-related changes that make it RDMA_NODE_IB_CA, but those haven't been merged upstream and won't be in 2.6.17. - R. From sean.hefty at intel.com Tue Apr 4 16:19:08 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 4 Apr 2006 16:19:08 -0700 Subject: [openib-general] [PATCH] IB/ipath: Make more names static In-Reply-To: <1144192619.12594.27.camel@hematite.internal.keyresearch.com> Message-ID: >> - dev->node_type = IB_NODE_CA; >> + dev->node_type = RDMA_NODE_IB_CA; > >Which of these #defines is valid in 2.6.17? Just curious. What we have >in svn matches what you put in your git tree, you see... Perhaps an >#ifdef until 2.6.17 goes through? 2.6.17 has IB_NODE_CA The tip of svn has RDMA_NODE_IB_CA. - Sean From rjwalsh at pathscale.com Tue Apr 4 16:20:13 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Tue, 04 Apr 2006 16:20:13 -0700 Subject: [openib-general] [PATCH] IB/ipath: Make more names static In-Reply-To: References: <1144091585.23196.0.camel@hematite.internal.keyresearch.com> <1144192619.12594.27.camel@hematite.internal.keyresearch.com> Message-ID: <1144192813.12594.30.camel@hematite.internal.keyresearch.com> On Tue, 2006-04-04 at 16:19 -0700, Roland Dreier wrote: > Robert> Which of these #defines is valid in 2.6.17? Just curious. > Robert> What we have in svn matches what you put in your git tree, > Robert> you see... Perhaps an #ifdef until 2.6.17 goes through? > > 2.6.17 still has IB_NODE_CA. The svn tree has some iWARP-related > changes that make it RDMA_NODE_IB_CA, but those haven't been merged > upstream and won't be in 2.6.17. Would an ifdef be OK? -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 483 bytes Desc: This is a digitally signed message part URL: From rdreier at cisco.com Tue Apr 4 16:24:26 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Apr 2006 16:24:26 -0700 Subject: [openib-general] [PATCH] IB/ipath: Make more names static In-Reply-To: <1144192813.12594.30.camel@hematite.internal.keyresearch.com> (Robert Walsh's message of "Tue, 04 Apr 2006 16:20:13 -0700") References: <1144091585.23196.0.camel@hematite.internal.keyresearch.com> <1144192619.12594.27.camel@hematite.internal.keyresearch.com> <1144192813.12594.30.camel@hematite.internal.keyresearch.com> Message-ID: Robert> Would an ifdef be OK? Sure, but I'm not sure what you can test on, and I'm not sure what it buys you -- you still have svn diverging from the upstream kernel, except svn has an #ifdef too. - R. From weiny2 at llnl.gov Tue Apr 4 16:27:40 2006 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 04 Apr 2006 16:27:40 -0700 Subject: [openib-general] Re: SRP compile warning in Branch 1.0 In-Reply-To: <20060304231742.GB11216@mellanox.co.il> References: <20060302140044.72dcac10.weiny2@llnl.gov> <000001c63e50$8c826760$cba1070a@amr.corp.intel.com> <20060304231742.GB11216@mellanox.co.il> Message-ID: <20060404162740.3afa1c90.weiny2@llnl.gov> On Sun, 05 Mar 2006 01:17:42 +0200 "Michael S. Tsirkin" wrote: > Quoting r. Bob Woodruff : > > I am not sure the backport for SRP is completely correct. > > I have never tested it as I don't have a target or time to test it > > right now. > > An open-source gen1 based SRP target implementation can be found under > https://openib.org/svn/trunk/contrib/mellanox/gen1/ib_srpt/ > We needed an SRP target and using the above and a gen1 stack supplied by Mellanox I have gotten SRP working. However, in response to my original post about the compile error... > > /tftpboot/weiny2/openib/openib-modules/openib/infiniband/ulp/srp/ib_srp.c: In function `srp_add_target': > /tftpboot/weiny2/openib/openib-modules/openib/infiniband/ulp/srp/ib_srp.c:1263: warning: passing arg 1 of `scsi_scan_target' from incompatible pointer type > I had to apply the following patch. Indeed my initial assumption was correct. Thanks, Ira =================================================================== --- openib/infiniband/ulp/srp/ib_srp.c (revision 2070) +++ openib/infiniband/ulp/srp/ib_srp.c (working copy) @@ -1259,8 +1259,7 @@ target->state = SRP_TARGET_LIVE; /* XXX: are we supposed to have a definition of SCAN_WILD_CARD ?? */ - scsi_scan_target(&target->scsi_host->shost_gendev, - 0, target->scsi_id, ~0, 0); + scsi_scan_target(target->scsi_host, 0, target->scsi_id, ~0, 0); return 0; } From xma at us.ibm.com Tue Apr 4 16:31:47 2006 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 4 Apr 2006 17:31:47 -0600 Subject: [openib-general] Re: [PATCH] repost: IPoIB queue size tune patch In-Reply-To: Message-ID: Roland, Here is the updated patch for review. I have updated the max value to 8k and min to 32. Attachment is for you to apply the patch. Signed-off-by: Shirley Ma diff -urpN infiniband/ulp/ipoib/ipoib.h infiniband-queue/ulp/ipoib/ipoib.h --- infiniband/ulp/ipoib/ipoib.h 2006-03-26 11:57:15.000000000 -0800 +++ infiniband-queue/ulp/ipoib/ipoib.h 2006-04-04 16:53:24.702300792 -0700 @@ -338,6 +338,8 @@ static inline void ipoib_unregister_debu #define ipoib_warn(priv, format, arg...) \ ipoib_printk(KERN_WARNING, priv, format , ## arg) +extern int ipoib_sendq_size; +extern int ipoib_recvq_size; #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG extern int ipoib_debug_level; diff -urpN infiniband/ulp/ipoib/ipoib_ib.c infiniband-queue/ulp/ipoib/ipoib_ib.c --- infiniband/ulp/ipoib/ipoib_ib.c 2006-03-26 11:57:15.000000000 -0800 +++ infiniband-queue/ulp/ipoib/ipoib_ib.c 2006-04-04 16:56:49.475170584 -0700 @@ -161,7 +161,7 @@ static int ipoib_ib_post_receives(struct struct ipoib_dev_priv *priv = netdev_priv(dev); int i; - for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) { + for (i = 0; i < ipoib_recvq_size; ++i) { if (ipoib_alloc_rx_skb(dev, i)) { ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); return -ENOMEM; @@ -187,7 +187,7 @@ static void ipoib_ib_handle_wc(struct ne if (wr_id & IPOIB_OP_RECV) { wr_id &= ~IPOIB_OP_RECV; - if (wr_id < IPOIB_RX_RING_SIZE) { + if (wr_id < ipoib_recvq_size) { struct sk_buff *skb = priv->rx_ring[wr_id].skb; dma_addr_t addr = priv->rx_ring[wr_id].mapping; @@ -252,9 +252,9 @@ static void ipoib_ib_handle_wc(struct ne struct ipoib_tx_buf *tx_req; unsigned long flags; - if (wr_id >= IPOIB_TX_RING_SIZE) { + if (wr_id >= ipoib_sendq_size) { ipoib_warn(priv, "completion event with wrid %d (> %d)\n", - wr_id, IPOIB_TX_RING_SIZE); + wr_id, ipoib_sendq_size); return; } @@ -275,7 +275,7 @@ static void ipoib_ib_handle_wc(struct ne spin_lock_irqsave(&priv->tx_lock, flags); ++priv->tx_tail; if (netif_queue_stopped(dev) && - priv->tx_head - priv->tx_tail <= IPOIB_TX_RING_SIZE / 2) + priv->tx_head - priv->tx_tail <= ipoib_sendq_size / 2) netif_wake_queue(dev); spin_unlock_irqrestore(&priv->tx_lock, flags); @@ -344,13 +344,13 @@ void ipoib_send(struct net_device *dev, * means we have to make sure everything is properly recorded and * our state is consistent before we call post_send(). */ - tx_req = &priv->tx_ring[priv->tx_head & (IPOIB_TX_RING_SIZE - 1)]; + tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)]; tx_req->skb = skb; addr = dma_map_single(priv->ca->dma_device, skb->data, skb->len, DMA_TO_DEVICE); pci_unmap_addr_set(tx_req, mapping, addr); - if (unlikely(post_send(priv, priv->tx_head & (IPOIB_TX_RING_SIZE - 1), + if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1), address->ah, qpn, addr, skb->len))) { ipoib_warn(priv, "post_send failed\n"); ++priv->stats.tx_errors; @@ -363,7 +363,7 @@ void ipoib_send(struct net_device *dev, address->last_send = priv->tx_head; ++priv->tx_head; - if (priv->tx_head - priv->tx_tail == IPOIB_TX_RING_SIZE) { + if (priv->tx_head - priv->tx_tail == ipoib_sendq_size) { ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); netif_stop_queue(dev); } @@ -488,7 +488,7 @@ static int recvs_pending(struct net_devi int pending = 0; int i; - for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + for (i = 0; i < ipoib_recvq_size; ++i) if (priv->rx_ring[i].skb) ++pending; @@ -527,7 +527,7 @@ int ipoib_ib_dev_stop(struct net_device */ while ((int) priv->tx_tail - (int) priv->tx_head < 0) { tx_req = &priv->tx_ring[priv->tx_tail & - (IPOIB_TX_RING_SIZE - 1)]; + (ipoib_sendq_size - 1)]; dma_unmap_single(priv->ca->dma_device, pci_unmap_addr(tx_req, mapping), tx_req->skb->len, @@ -536,7 +536,7 @@ int ipoib_ib_dev_stop(struct net_device ++priv->tx_tail; } - for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + for (i = 0; i < ipoib_recvq_size; ++i) if (priv->rx_ring[i].skb) { dma_unmap_single(priv->ca->dma_device, pci_unmap_addr(&priv->rx_ring[i], diff -urpN infiniband/ulp/ipoib/ipoib_main.c infiniband-queue/ulp/ipoib/ipoib_main.c --- infiniband/ulp/ipoib/ipoib_main.c 2006-03-28 19:20:21.000000000 -0800 +++ infiniband-queue/ulp/ipoib/ipoib_main.c 2006-04-04 17:17:14.643916624 -0700 @@ -41,6 +41,7 @@ #include #include #include +#include #include /* For ARPHRD_xxx */ @@ -53,6 +54,17 @@ MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("IP-over-InfiniBand net driver"); MODULE_LICENSE("Dual BSD/GPL"); +#define IPOIB_MAX_QUEUE_SIZE 8192 /* max is 8k */ +#define IPOIB_MIN_QUEUE_SIZE 32 /* min is 32 */ + +int ipoib_sendq_size = IPOIB_TX_RING_SIZE; +int ipoib_recvq_size = IPOIB_RX_RING_SIZE; + +module_param_named(sendq_size, ipoib_sendq_size, int, 0444); +MODULE_PARM_DESC(sendq_size, "Number of wqe in send queue"); +module_param_named(recvq_size, ipoib_recvq_size, int, 0444); +MODULE_PARM_DESC(recvq_size, "Number of wqe in receive queue"); + #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG int ipoib_debug_level; @@ -843,19 +855,39 @@ int ipoib_dev_init(struct net_device *de /* Allocate RX/TX "rings" to hold queued skbs */ - priv->rx_ring = kzalloc(IPOIB_RX_RING_SIZE * sizeof (struct ipoib_rx_buf), + if (ipoib_recvq_size > IPOIB_MAX_QUEUE_SIZE) { + ipoib_recvq_size = IPOIB_MAX_QUEUE_SIZE; + printk(KERN_WARNING "%s: ipoib_recvq_size is too big, use max %d instead\n", ca->name, IPOIB_MAX_QUEUE_SIZE); + } + if (ipoib_recvq_size < IPOIB_MIN_QUEUE_SIZE) { + ipoib_recvq_size = IPOIB_MIN_QUEUE_SIZE; + printk(KERN_WARNING "%s: ipoib_recvq_size is too small, use min %d instead\n", ca->name, IPOIB_MIN_QUEUE_SIZE); + } + ipoib_recvq_size = roundup_pow_of_two(ipoib_recvq_size); + priv->rx_ring = kzalloc(ipoib_recvq_size * sizeof (struct ipoib_rx_buf), GFP_KERNEL); if (!priv->rx_ring) { printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n", - ca->name, IPOIB_RX_RING_SIZE); + ca->name, ipoib_sendq_size); goto out; } - priv->tx_ring = kzalloc(IPOIB_TX_RING_SIZE * sizeof (struct ipoib_tx_buf), + if (ipoib_sendq_size > IPOIB_MAX_QUEUE_SIZE) { + ipoib_sendq_size = IPOIB_MAX_QUEUE_SIZE; + printk(KERN_WARNING "%s: ipoib_sendq_size is too big, use max %d instead\n", ca->name, IPOIB_MAX_QUEUE_SIZE); + } + if (ipoib_sendq_size < IPOIB_MIN_QUEUE_SIZE) { + ipoib_sendq_size = IPOIB_MIN_QUEUE_SIZE; + printk(KERN_WARNING "%s: ipoib_sendq_size is too small, use min %d instead\n", ca->name, IPOIB_MIN_QUEUE_SIZE); + } + + ipoib_sendq_size = roundup_pow_of_two(ipoib_sendq_size); + + priv->tx_ring = kzalloc(ipoib_sendq_size * sizeof (struct ipoib_tx_buf), GFP_KERNEL); if (!priv->tx_ring) { printk(KERN_WARNING "%s: failed to allocate TX ring (%d entries)\n", - ca->name, IPOIB_TX_RING_SIZE); + ca->name, ipoib_sendq_size); goto out_rx_ring_cleanup; } @@ -923,7 +955,7 @@ static void ipoib_setup(struct net_devic dev->hard_header_len = IPOIB_ENCAP_LEN + INFINIBAND_ALEN; dev->addr_len = INFINIBAND_ALEN; dev->type = ARPHRD_INFINIBAND; - dev->tx_queue_len = IPOIB_TX_RING_SIZE * 2; + dev->tx_queue_len = ipoib_sendq_size * 2; dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX; /* MTU will be reset when mcast join happens */ diff -urpN infiniband/ulp/ipoib/ipoib_verbs.c infiniband-queue/ulp/ipoib/ipoib_verbs.c --- infiniband/ulp/ipoib/ipoib_verbs.c 2006-03-26 11:57:15.000000000 -0800 +++ infiniband-queue/ulp/ipoib/ipoib_verbs.c 2006-04-04 16:57:07.098491432 -0700 @@ -159,8 +159,8 @@ int ipoib_transport_dev_init(struct net_ struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_init_attr init_attr = { .cap = { - .max_send_wr = IPOIB_TX_RING_SIZE, - .max_recv_wr = IPOIB_RX_RING_SIZE, + .max_send_wr = ipoib_sendq_size, + .max_recv_wr = ipoib_recvq_size, .max_send_sge = 1, .max_recv_sge = 1 }, @@ -175,7 +175,7 @@ int ipoib_transport_dev_init(struct net_ } priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, - IPOIB_TX_RING_SIZE + IPOIB_RX_RING_SIZE + 1); + ipoib_sendq_size + ipoib_recvq_size + 1); if (IS_ERR(priv->cq)) { printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); goto out_free_pd; Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: infiniband-queue.patch Type: application/octet-stream Size: 8329 bytes Desc: not available URL: From rdreier at cisco.com Tue Apr 4 16:51:32 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Apr 2006 16:51:32 -0700 Subject: [openib-general] Re: SRP compile warning in Branch 1.0 In-Reply-To: <20060404162740.3afa1c90.weiny2@llnl.gov> (Ira Weiny's message of "Tue, 04 Apr 2006 16:27:40 -0700") References: <20060302140044.72dcac10.weiny2@llnl.gov> <000001c63e50$8c826760$cba1070a@amr.corp.intel.com> <20060304231742.GB11216@mellanox.co.il> <20060404162740.3afa1c90.weiny2@llnl.gov> Message-ID: Ira> I had to apply the following patch. Indeed my initial Ira> assumption was correct. What kernel are you using? This might make SRP compile but I have serious doubts about this being enough to make old SCSI midlayers handle scanning remote target ports... From iod00d at hp.com Tue Apr 4 16:52:32 2006 From: iod00d at hp.com (Grant Grundler) Date: Tue, 4 Apr 2006 16:52:32 -0700 Subject: [openib-general] Help with CONFIG_PCI_MSI in the kernel In-Reply-To: <20060404180723.GA29589@obsidianresearch.com> References: <20060403221456.GF1442@greglaptop.pwlan-swisscom-mobile.ch> <20060404043600.GF24455@esmail.cup.hp.com> <20060404070229.GF10080@obsidianresearch.com> <20060404164509.GA29487@esmail.cup.hp.com> <20060404180723.GA29589@obsidianresearch.com> Message-ID: <20060404235232.GH29487@esmail.cup.hp.com> On Tue, Apr 04, 2006 at 12:07:23PM -0600, Jason Gunthorpe wrote: > On Tue, Apr 04, 2006 at 09:45:09AM -0700, Grant Grundler wrote: > > > MSI requires end device support and something in a bridge > > > to transform the MSI message into an APIC message, but the kernel > > > currently only looks for end device support. > > > > APIC Message? > > MSI is just a DMA-write from the card point of view. > > So if PCI is working and DMA is working, MSI should work too. > > The difference is routing of the transaction and the fact that > > it's not targeting Host memory but some other part of the chipset. > > You still need chipset support to get from the memory write to an > interrupt message transaction on the FSB to the processor APIC. > http://www.intel.com/design/chipsets/datashts/30146403.pdf > Pages 19 and 165 Yes, that's what I meant with "routing of the transaction". "interrupt message transaction" is not "transformed" - just routed. Page 19 uses the same language. The "FSB Interrupt Memory Space" described on page 165 is also known as the "PIB" (Processor Interrupt Block). See: drivers/parisc/iosapic.c http://www.intel.com/design/itanium/downloads/251350.htm http://en.wikipedia.org/wiki/IO-APIC The PIB can be anywhere and it's up to firmware to tell the OS if it's not hardcoded in the kernel like it is today. Hardcoding of the PIB is one of the problems that Mark Maul is attempting to fix since it just doesn't work for large SGI systems. > I doubt all intel chipsets ever produced have this transformation. I expect any Intel box that uses a LOCAL xAPIC _must_ support a PIB (and therefore routing of MSI). > I also know that alot of embedded systems don't have host bridges that > support this. Ok. Could that be because embedded systems aren't x86 or PCI-e based? Is Fedora targeting embedded systems? (I think not) > The list of host bridges that work is definately smaller > than the list that don't. :< This thread started with x86-64 arch disabling MSI and I'm pretty sure _all_ of those from Intel support this in HW. It's clear one variant of x86-64 from AMD (8131) is broken and others from Intel might be too. So for now, I'll disagree when talking about x86-64 architecture. > > Well, can't linux enable that block if it's present? > > That isn't a reason to disable MSI for _all_ systems. > > I have a patch that does that, but monkying with the memory map always > makes me nervous since you never really know what the BIOS has done. > Intel style MSI's overlap the high memory BIOS area so there is > a potential problem. HT MSI translation can be configured to use a > high address, so it might be very safe to enable the translation and > set something >4G as the base. This fits nicely with the patches that Mark Maule (SGI) has submitted to make the hardcoding of PIB an x86 thing. Sounds like we have another interested party in making the MSI support less Intel specific. > > Sorry...I can't agree. > > Line based interrupt routing is dependent on firmware to give > > us the IRQ->APIC routing tables, enough info to identify CPUs (ID/EID > > info for Intel implementations) and program IO-xAPIC entries. > > Essentially, MSI only needs the CPU Info so MSI transactions get > > routed correctly. Then MSI/-X entries on the devices can be > > programmed (essentially the same way an IO-xAPIC gets programmed). > > ?? That's what I was trying to say - on a system with only PCIe > (granted, with working MSI in the devices..) there should be limited > need for IOAPIC based routing. ah ok. But I was commenting on the dependency on firmware. It's still there since we don't know which devices need IRQ lines and which can live without them. The concept of "IRQ Lines" doesn't go away with PCI-e. PCI-e just virtualizes the concept into transactions that a parent bridge can convert into MSI transactions. The parent bridge in this case is essentially doing the same thing that an IO xAPIC (or SAPIC) would do. > > If it's "just a BIOS" issue, then can't AMD help linux turn MSI support on? > > ie linux can bang some values into the chipset like we do for other types > > of initialization when BIOS doesn't do it right. > > Yes, simple patch attached.. Nice. Looks good to me. I'll bounce your mail linux-pci. I've cc'd linux-pci on this reply. thanks, grant > > Jason > --- linux-2.6.15.4/drivers/pci/quirks.c 2006-02-16 12:08:59.000000000 -0700 > +++ lin/drivers/pci/quirks.c 2006-02-16 12:12:30.000000000 -0700 > @@ -1257,6 +1257,29 @@ > } > DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NCR, PCI_DEVICE_ID_NCR_53C810, fixup_rev1_53c810); > > +#ifdef CONFIG_PCI_MSI > +static void __devinit fixup_ht_msi(struct pci_dev* dev) > +{ > + /* Some BIOS's do not enable the hypertransport MSI mapping capability > + on the chipset. This breaks MSI support.. */ > + int pos = pci_find_capability(dev,PCI_CAP_ID_HT); > + while (pos != 0) > + { > + u32 cap; > + pci_read_config_dword(dev,pos,&cap); > + if (((cap >> 16) & PCI_HT_CMD_TYP) == PCI_HT_CMD_TYP_MSIM) { > + if ((cap & PCI_HT_MSIM_ENABLE) == 0) { > + printk("BIOS BUG: HyperTransport MSI mapping not enabled for %s, enabling.\n",pci_name(dev)); > + cap |= PCI_HT_MSIM_ENABLE; > + pci_write_config_dword(dev,pos,cap); > + } > + break; > + } > + pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_HT); > + } > +} > +DECLARE_PCI_FIXUP_FINAL(PCI_ANY_ID, PCI_ANY_ID, fixup_ht_msi); > +#endif > > static void pci_do_fixups(struct pci_dev *dev, struct pci_fixup *f, struct pci_fixup *end) > { > --- linux-2.6.15.4/include/linux/pci_regs.h 2006-02-16 12:09:05.000000000 -0700 > +++ lin/include/linux/pci_regs.h 2006-02-16 12:12:30.000000000 -0700 > @@ -196,12 +196,14 @@ > #define PCI_CAP_ID_MSI 0x05 /* Message Signalled Interrupts */ > #define PCI_CAP_ID_CHSWP 0x06 /* CompactPCI HotSwap */ > #define PCI_CAP_ID_PCIX 0x07 /* PCI-X */ > +#define PCI_CAP_ID_HT 0x08 /* HyperTransport */ > #define PCI_CAP_ID_SHPC 0x0C /* PCI Standard Hot-Plug Controller */ > #define PCI_CAP_ID_EXP 0x10 /* PCI Express */ > #define PCI_CAP_ID_MSIX 0x11 /* MSI-X */ > #define PCI_CAP_LIST_NEXT 1 /* Next capability in the list */ > #define PCI_CAP_FLAGS 2 /* Capability defined flags (16 bits) */ > #define PCI_CAP_SIZEOF 4 > +#define PCI_HT_CMD_TYP 0xf800 /* Hypertransport capability type mask */ > > /* Power Management Registers */ > > @@ -285,6 +287,10 @@ > #define PCI_MSI_DATA_64 12 /* 16 bits of data for 64-bit devices */ > #define PCI_MSI_MASK_BIT 16 /* Mask bits register */ > > +/* HyperTransport MSI Mapping registers */ > +#define PCI_HT_CMD_TYP_MSIM 0xa800 // MSI Mapping type > +#define PCI_HT_MSIM_ENABLE (1<<16) > + > /* CompactPCI Hotswap Register */ > > #define PCI_CHSWP_CSR 2 /* Control and Status Register */ From gregkh at suse.de Tue Apr 4 17:00:30 2006 From: gregkh at suse.de (gregkh at suse.de) Date: Tue, 4 Apr 2006 17:00:30 -0700 Subject: [openib-general] [patch 11/26] IPOB: Move destructor from neigh->ops to neigh_param In-Reply-To: <20060404235927.GA27049@kroah.com> References: <20060404235634.696852000@quad.kroah.org> Message-ID: <20060405000030.GL27049@kroah.com> From: Michael Tsirkin struct neigh_ops currently has a destructor field, but not a constructor field. The infiniband/ulp/ipoib in-tree driver stashes some info in the neighbour structure (the results of the second-stage lookup from ARP results to real link-level path), and it uses neigh->ops->destructor to get a callback so it can clean up this extra info when a neighbour is freed. We've run into problems with this: since the destructor is in an ops field that is shared between neighbours that may belong to different net devices, there's no way to set/clear it safely. The following patch moves this field to neigh_parms where it can be safely set, together with its twin neigh_setup, and switches the only two in-kernel users (ipoib and clip) to this interface. Signed-off-by: Michael Tsirkin Signed-off-by: Roland Dreier Signed-off-by: Greg Kroah-Hartman --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 16 +--------------- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 1 - include/net/neighbour.h | 2 +- net/atm/clip.c | 2 +- net/core/neighbour.c | 4 ++-- 5 files changed, 5 insertions(+), 20 deletions(-) --- linux-2.6.16.1.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ linux-2.6.16.1/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -247,7 +247,6 @@ static void path_free(struct net_device if (neigh->ah) ipoib_put_ah(neigh->ah); *to_ipoib_neigh(neigh->neighbour) = NULL; - neigh->neighbour->ops->destructor = NULL; kfree(neigh); } @@ -530,7 +529,6 @@ static void neigh_add_path(struct sk_buf err: *to_ipoib_neigh(skb->dst->neighbour) = NULL; list_del(&neigh->list); - neigh->neighbour->ops->destructor = NULL; kfree(neigh); ++priv->stats.tx_dropped; @@ -769,21 +767,9 @@ static void ipoib_neigh_destructor(struc ipoib_put_ah(ah); } -static int ipoib_neigh_setup(struct neighbour *neigh) -{ - /* - * Is this kosher? I can't find anybody in the kernel that - * sets neigh->destructor, so we should be able to set it here - * without trouble. - */ - neigh->ops->destructor = ipoib_neigh_destructor; - - return 0; -} - static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) { - parms->neigh_setup = ipoib_neigh_setup; + parms->neigh_destructor = ipoib_neigh_destructor; return 0; } --- linux-2.6.16.1.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ linux-2.6.16.1/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -115,7 +115,6 @@ static void ipoib_mcast_free(struct ipoi if (neigh->ah) ipoib_put_ah(neigh->ah); *to_ipoib_neigh(neigh->neighbour) = NULL; - neigh->neighbour->ops->destructor = NULL; kfree(neigh); } --- linux-2.6.16.1.orig/include/net/neighbour.h +++ linux-2.6.16.1/include/net/neighbour.h @@ -68,6 +68,7 @@ struct neigh_parms struct net_device *dev; struct neigh_parms *next; int (*neigh_setup)(struct neighbour *); + void (*neigh_destructor)(struct neighbour *); struct neigh_table *tbl; void *sysctl_table; @@ -145,7 +146,6 @@ struct neighbour struct neigh_ops { int family; - void (*destructor)(struct neighbour *); void (*solicit)(struct neighbour *, struct sk_buff*); void (*error_report)(struct neighbour *, struct sk_buff*); int (*output)(struct sk_buff*); --- linux-2.6.16.1.orig/net/atm/clip.c +++ linux-2.6.16.1/net/atm/clip.c @@ -289,7 +289,6 @@ static void clip_neigh_error(struct neig static struct neigh_ops clip_neigh_ops = { .family = AF_INET, - .destructor = clip_neigh_destroy, .solicit = clip_neigh_solicit, .error_report = clip_neigh_error, .output = dev_queue_xmit, @@ -346,6 +345,7 @@ static struct neigh_table clip_tbl = { /* parameters are copied from ARP ... */ .parms = { + .neigh_destructor = clip_neigh_destroy, .tbl = &clip_tbl, .base_reachable_time = 30 * HZ, .retrans_time = 1 * HZ, --- linux-2.6.16.1.orig/net/core/neighbour.c +++ linux-2.6.16.1/net/core/neighbour.c @@ -586,8 +586,8 @@ void neigh_destroy(struct neighbour *nei kfree(hh); } - if (neigh->ops && neigh->ops->destructor) - (neigh->ops->destructor)(neigh); + if (neigh->parms->neigh_destructor) + (neigh->parms->neigh_destructor)(neigh); skb_queue_purge(&neigh->arp_queue); -- From davem at davemloft.net Tue Apr 4 17:07:20 2006 From: davem at davemloft.net (David S. Miller) Date: Tue, 04 Apr 2006 17:07:20 -0700 (PDT) Subject: [openib-general] Re: [patch 11/26] IPOB: Move destructor from neigh->ops to neigh_param In-Reply-To: <20060405000030.GL27049@kroah.com> References: <20060404235634.696852000@quad.kroah.org> <20060404235927.GA27049@kroah.com> <20060405000030.GL27049@kroah.com> Message-ID: <20060404.170720.61536177.davem@davemloft.net> From: gregkh at suse.de Date: Tue, 4 Apr 2006 17:00:30 -0700 > From: Michael Tsirkin > > struct neigh_ops currently has a destructor field, but not a constructor field. > The infiniband/ulp/ipoib in-tree driver stashes some info in the neighbour > structure (the results of the second-stage lookup from ARP results to real > link-level path), and it uses neigh->ops->destructor to get a callback so it can > clean up this extra info when a neighbour is freed. We've run into problems > with this: since the destructor is in an ops field that is shared between > neighbours that may belong to different net devices, there's no way to set/clear > it safely. > > The following patch moves this field to neigh_parms where it can be safely set, > together with its twin neigh_setup, and switches the only two in-kernel users > (ipoib and clip) to this interface. Major NAK. This does not fix a bug, it is merely and API change that the inifiniband folks want for some of their infrastructure. It was accepted for 2.6.17, but this change is not appropriate for the -stable release branch. Furthermore, this version of the patch here will break the build of ATM. From greg at kroah.com Tue Apr 4 17:12:26 2006 From: greg at kroah.com (Greg KH) Date: Tue, 4 Apr 2006 17:12:26 -0700 Subject: [openib-general] Re: [stable] Re: [patch 11/26] IPOB: Move destructor from neigh->ops to neigh_param In-Reply-To: <20060404.170720.61536177.davem@davemloft.net> References: <20060404235634.696852000@quad.kroah.org> <20060404235927.GA27049@kroah.com> <20060405000030.GL27049@kroah.com> <20060404.170720.61536177.davem@davemloft.net> Message-ID: <20060405001226.GA29002@kroah.com> On Tue, Apr 04, 2006 at 05:07:20PM -0700, David S. Miller wrote: > From: gregkh at suse.de > Date: Tue, 4 Apr 2006 17:00:30 -0700 > > > From: Michael Tsirkin > > > > struct neigh_ops currently has a destructor field, but not a constructor field. > > The infiniband/ulp/ipoib in-tree driver stashes some info in the neighbour > > structure (the results of the second-stage lookup from ARP results to real > > link-level path), and it uses neigh->ops->destructor to get a callback so it can > > clean up this extra info when a neighbour is freed. We've run into problems > > with this: since the destructor is in an ops field that is shared between > > neighbours that may belong to different net devices, there's no way to set/clear > > it safely. > > > > The following patch moves this field to neigh_parms where it can be safely set, > > together with its twin neigh_setup, and switches the only two in-kernel users > > (ipoib and clip) to this interface. > > Major NAK. > > This does not fix a bug, it is merely and API change that the > inifiniband folks want for some of their infrastructure. > > It was accepted for 2.6.17, but this change is not appropriate > for the -stable release branch. > > Furthermore, this version of the patch here will break the build of > ATM. Thanks for the information and the review, I've dropped this patch from the queue now. greg k-h From rdreier at cisco.com Tue Apr 4 17:14:27 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Apr 2006 17:14:27 -0700 Subject: [openib-general] Re: [patch 11/26] IPOB: Move destructor from neigh->ops to neigh_param In-Reply-To: <20060404.170720.61536177.davem@davemloft.net> (David S. Miller's message of "Tue, 04 Apr 2006 17:07:20 -0700 (PDT)") References: <20060404235634.696852000@quad.kroah.org> <20060404235927.GA27049@kroah.com> <20060405000030.GL27049@kroah.com> <20060404.170720.61536177.davem@davemloft.net> Message-ID: David> Major NAK. David> This does not fix a bug, it is merely and API change that David> the inifiniband folks want for some of their infrastructure. It definitely does fix a bug: without the change, because ops->destructor is shared (possibly with other net devices), IPoIB ops->can't set it or clear it safely. I don't have exact details at hand but this was definitely causing panics for people. David> Furthermore, this version of the patch here will break the David> build of ATM. I'll admit I haven't tested but it looks OK to me -- it seems to have the required chunk in clip.c. I'm not going to fight too hard for it (I'll let Michael champion it if he really cares), but I think this is a legitimate -stable patch: it fixes a panic that real users are seeing. - R. From greg at kroah.com Tue Apr 4 17:19:37 2006 From: greg at kroah.com (Greg KH) Date: Tue, 4 Apr 2006 17:19:37 -0700 Subject: [openib-general] Help with CONFIG_PCI_MSI in the kernel In-Reply-To: <20060404180723.GA29589@obsidianresearch.com> References: <20060403221456.GF1442@greglaptop.pwlan-swisscom-mobile.ch> <20060404043600.GF24455@esmail.cup.hp.com> <20060404070229.GF10080@obsidianresearch.com> <20060404164509.GA29487@esmail.cup.hp.com> <20060404180723.GA29589@obsidianresearch.com> Message-ID: <20060405001937.GB30049@kroah.com> On Tue, Apr 04, 2006 at 12:07:23PM -0600, Jason Gunthorpe wrote: > +#ifdef CONFIG_PCI_MSI No #ifdef is needed really, right? > +static void __devinit fixup_ht_msi(struct pci_dev* dev) > +{ > + /* Some BIOS's do not enable the hypertransport MSI mapping capability > + on the chipset. This breaks MSI support.. */ > + int pos = pci_find_capability(dev,PCI_CAP_ID_HT); > + while (pos != 0) > + { Wrong coding style :) > + u32 cap; > + pci_read_config_dword(dev,pos,&cap); > + if (((cap >> 16) & PCI_HT_CMD_TYP) == PCI_HT_CMD_TYP_MSIM) { > + if ((cap & PCI_HT_MSIM_ENABLE) == 0) { > + printk("BIOS BUG: HyperTransport MSI mapping not enabled for %s, enabling.\n",pci_name(dev)); > + cap |= PCI_HT_MSIM_ENABLE; > + pci_write_config_dword(dev,pos,cap); > + } > + break; > + } > + pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_HT); > + } > +} > +DECLARE_PCI_FIXUP_FINAL(PCI_ANY_ID, PCI_ANY_ID, fixup_ht_msi); > +#endif What's the odds that we don't want to do this for every pci device? Do any of them lie about their capabilities? > > static void pci_do_fixups(struct pci_dev *dev, struct pci_fixup *f, struct pci_fixup *end) > { > --- linux-2.6.15.4/include/linux/pci_regs.h 2006-02-16 12:09:05.000000000 -0700 > +++ lin/include/linux/pci_regs.h 2006-02-16 12:12:30.000000000 -0700 > @@ -196,12 +196,14 @@ > #define PCI_CAP_ID_MSI 0x05 /* Message Signalled Interrupts */ > #define PCI_CAP_ID_CHSWP 0x06 /* CompactPCI HotSwap */ > #define PCI_CAP_ID_PCIX 0x07 /* PCI-X */ > +#define PCI_CAP_ID_HT 0x08 /* HyperTransport */ No tabs used :( > #define PCI_CAP_ID_SHPC 0x0C /* PCI Standard Hot-Plug Controller */ > #define PCI_CAP_ID_EXP 0x10 /* PCI Express */ > #define PCI_CAP_ID_MSIX 0x11 /* MSI-X */ > #define PCI_CAP_LIST_NEXT 1 /* Next capability in the list */ > #define PCI_CAP_FLAGS 2 /* Capability defined flags (16 bits) */ > #define PCI_CAP_SIZEOF 4 > +#define PCI_HT_CMD_TYP 0xf800 /* Hypertransport capability type mask */ > > /* Power Management Registers */ > > @@ -285,6 +287,10 @@ > #define PCI_MSI_DATA_64 12 /* 16 bits of data for 64-bit devices */ > #define PCI_MSI_MASK_BIT 16 /* Mask bits register */ > > +/* HyperTransport MSI Mapping registers */ > +#define PCI_HT_CMD_TYP_MSIM 0xa800 // MSI Mapping type > +#define PCI_HT_MSIM_ENABLE (1<<16) No tabs used :( And don't use // thanks, greg k-h From jgunthorpe at obsidianresearch.com Tue Apr 4 17:31:44 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Tue, 4 Apr 2006 18:31:44 -0600 Subject: [openib-general] Help with CONFIG_PCI_MSI in the kernel In-Reply-To: <20060404235232.GH29487@esmail.cup.hp.com> References: <20060403221456.GF1442@greglaptop.pwlan-swisscom-mobile.ch> <20060404043600.GF24455@esmail.cup.hp.com> <20060404070229.GF10080@obsidianresearch.com> <20060404164509.GA29487@esmail.cup.hp.com> <20060404180723.GA29589@obsidianresearch.com> <20060404235232.GH29487@esmail.cup.hp.com> Message-ID: <20060405003144.GA9618@obsidianresearch.com> On Tue, Apr 04, 2006 at 04:52:32PM -0700, Grant Grundler wrote: > Yes, that's what I meant with "routing of the transaction". > "interrupt message transaction" is not "transformed" - just routed. > Page 19 uses the same language. Yes, I think we are saying exactly the same thing - I was saying 'transformed' because I don't know the exact bus protocol used on an Intel FSB. (Not knowing that I have no idea if it is a memory write to the CPU or some kind of special APIC interrupt message or whatever.) > Mark Maul is attempting to fix since it just doesn't > work for large SGI systems. Thats sounds great! > > I doubt all intel chipsets ever produced have this transformation. > > I expect any Intel box that uses a LOCAL xAPIC _must_ support a PIB > (and therefore routing of MSI). Ok. I just remember long ago when APIC first came out there used to be a dedicated out of band APIC bus that carried the vectors from the IO APCIs directly. These days I really only know exact bus protocol specifics about HT and AMD64.. [clip about fedora] You are totally right of course WRT to fedora. I was just thinking aloud about how to have MSI turned on and work for all supported platforms more generally. Certainly it sounds like all Intel EM64T systems should work and a varient of my patch can be used to identify the AMD64 ones that don't. (look for the HT MSI MAPPING cap on each bridge, if it is absent then it is not supported, 8131 does not have a HT MSI MAPPING cap) > This fits nicely with the patches that Mark Maule (SGI) has submitted > to make the hardcoding of PIB an x86 thing. Sounds like we have > another interested party in making the MSI support less Intel > specific. Yes, I think it would go well with HT MSI support. > ah ok. But I was commenting on the dependency on firmware. It's still > there since we don't know which devices need IRQ lines and which can > live without them. The concept of "IRQ Lines" doesn't go away > with PCI-e. PCI-e just virtualizes the concept into transactions > that a parent bridge can convert into MSI transactions. > The parent bridge in this case is essentially doing the same > thing that an IO xAPIC (or SAPIC) would do. Mm, can you clarify this? Your description matches my understanding of PCIe legacy (8252 and IOAPIC style) interrupts. However with MSI at the device the parent bridge isn't involved in vector selection so the APCI IOAPIC linkage information is unused. I only point this out because it seems to be a common problem that some el-chepo motherboards have broken APCI and important things like HD controllers don't work because their interrupt linkages are mangled somehow. In a UP system alot of the other APCI information is not critical (or as error prone..) so using MSI would be a nice alternative way to get such a system to work 100%. > Nice. Looks good to me. > I'll bounce your mail linux-pci. > I've cc'd linux-pci on this reply. Two notes about it: 1) It doesn't force the base address to 0xfee00000 [this is the documented reset value however] 2) If used with Mark Maule's work, then the base address should be taken from the first bridge with a HT MSI Mapping cap, going along the bridge chain toward the host from the device that the MSI address is being generated for. Thanks, Jason From davem at davemloft.net Tue Apr 4 17:17:39 2006 From: davem at davemloft.net (David S. Miller) Date: Tue, 04 Apr 2006 17:17:39 -0700 (PDT) Subject: [openib-general] Re: [patch 11/26] IPOB: Move destructor from neigh->ops to neigh_param In-Reply-To: References: <20060405000030.GL27049@kroah.com> <20060404.170720.61536177.davem@davemloft.net> Message-ID: <20060404.171739.92845421.davem@davemloft.net> From: Roland Dreier Date: Tue, 04 Apr 2006 17:14:27 -0700 > I'm not going to fight too hard for it (I'll let Michael champion it > if he really cares), but I think this is a legitimate -stable patch: > it fixes a panic that real users are seeing. You were using an interface in an unintended way. Do you know %100 for certain that moving that callback to a different location won't break anything? From rdreier at cisco.com Tue Apr 4 17:42:20 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Apr 2006 17:42:20 -0700 Subject: [openib-general] Re: [patch 11/26] IPOB: Move destructor from neigh->ops to neigh_param In-Reply-To: <20060404.171739.92845421.davem@davemloft.net> (David S. Miller's message of "Tue, 04 Apr 2006 17:17:39 -0700 (PDT)") References: <20060405000030.GL27049@kroah.com> <20060404.170720.61536177.davem@davemloft.net> <20060404.171739.92845421.davem@davemloft.net> Message-ID: David> You were using an interface in an unintended way. There were a lot of opportunities to suggest a better way or even just raise the alarm when IPoIB was first being reviewed. And I don't remember anyone giving any guidance or insight into the neighbour destructor design the three or four times Michael raised the issue of the IPoIB crash and posted this patch for review.... David> Do you know %100 for certain that moving that callback to a David> different location won't break anything? Of course it's not %100 certain, but it definitely fixes a panic in IPoIB, and the clip.c change looks "obviously correct." If this patch is too risky for -stable, that's fine. But let's be clear that it _does_ fix a panic people hit in practice, and as far as I know it doesn't break the ATM build - R. From jgunthorpe at obsidianresearch.com Tue Apr 4 17:58:28 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Tue, 4 Apr 2006 18:58:28 -0600 Subject: [openib-general] Help with CONFIG_PCI_MSI in the kernel In-Reply-To: <20060405001937.GB30049@kroah.com> References: <20060403221456.GF1442@greglaptop.pwlan-swisscom-mobile.ch> <20060404043600.GF24455@esmail.cup.hp.com> <20060404070229.GF10080@obsidianresearch.com> <20060404164509.GA29487@esmail.cup.hp.com> <20060404180723.GA29589@obsidianresearch.com> <20060405001937.GB30049@kroah.com> Message-ID: <20060405005828.GB9618@obsidianresearch.com> On Tue, Apr 04, 2006 at 05:19:37PM -0700, Greg KH wrote: Just as a preface, I whipped this patch up because I have a Asus with a broken BIOS and I wanted to test with MSI-X on IB cards. It needs a little more thought before being merged, but the method seemed relevent given the discussion on Fedora disabling MSI. > > +#ifdef CONFIG_PCI_MSI > No #ifdef is needed really, right? This fixup _only_ enables MSI, if there is no MSI in the kernel then it can only harm things. > Wrong coding style :) Yes, sorry, let me send you something more suitable for merging. > What's the odds that we don't want to do this for every pci device? Do > any of them lie about their capabilities? This capability is defined in the latest HT IOLink spec. I know that at least modern AMD64 HT chipsets from ATI and NVidia report the capability. The problem with limiting it to only some devices is that the list grows with every ATI/nVidia chipset rev and I think that keeping tabs on all the PCI IDs would be difficult. The risk with this patch is that by enabling MSI mapping at the usual address of 0xfee00000 may break something the BIOS has setup, but without it MSI doesn't work at all. Optimally the BIOS would turn it on and make sure 0xfee00000 is clear, but some don't :< Alternatively something like this could be used to detect disabled MSI support (busted BIOS) or absent MSI support (AMD 8131) and then disable MSI in the kernel. Jason From davem at davemloft.net Tue Apr 4 17:47:41 2006 From: davem at davemloft.net (David S. Miller) Date: Tue, 04 Apr 2006 17:47:41 -0700 (PDT) Subject: [openib-general] Re: [patch 11/26] IPOB: Move destructor from neigh->ops to neigh_param In-Reply-To: References: <20060404.171739.92845421.davem@davemloft.net> Message-ID: <20060404.174741.63557413.davem@davemloft.net> From: Roland Dreier Date: Tue, 04 Apr 2006 17:42:20 -0700 > David> You were using an interface in an unintended way. > > There were a lot of opportunities to suggest a better way or even just > raise the alarm when IPoIB was first being reviewed. And I don't > remember anyone giving any guidance or insight into the neighbour > destructor design the three or four times Michael raised the issue of > the IPoIB crash and posted this patch for review.... If I thought your change was appropriate for 2.6.16 I would have put it into that tree back then. Instead, I did not consider it appropriate, that's why we decided to put it into 2.6.17 Nothing since then has changed the situation. > If this patch is too risky for -stable, that's fine. But let's be > clear that it _does_ fix a panic people hit in practice, and as far as > I know it doesn't break the ATM build I think it's too risky. It fixes a panic for infiniband. I think you should not have submitted such a core networking change to -stable without passing it by netdev CC:'ing me first. From rdreier at cisco.com Tue Apr 4 18:08:34 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Apr 2006 18:08:34 -0700 Subject: [openib-general] Re: [patch 11/26] IPOB: Move destructor from neigh->ops to neigh_param In-Reply-To: <20060404.174741.63557413.davem@davemloft.net> (David S. Miller's message of "Tue, 04 Apr 2006 17:47:41 -0700 (PDT)") References: <20060404.171739.92845421.davem@davemloft.net> <20060404.174741.63557413.davem@davemloft.net> Message-ID: David> I think it's too risky. It fixes a panic for infiniband. Fair enough. David> I think you should not have submitted such a core David> networking change to -stable without passing it by netdev David> CC:'ing me first. Noted. Glad I wasn't the one who submitted it ;) - R. From rdreier at cisco.com Tue Apr 4 18:29:24 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Apr 2006 18:29:24 -0700 Subject: [openib-general] Help with CONFIG_PCI_MSI in the kernel In-Reply-To: <20060405005828.GB9618@obsidianresearch.com> (Jason Gunthorpe's message of "Tue, 4 Apr 2006 18:58:28 -0600") References: <20060403221456.GF1442@greglaptop.pwlan-swisscom-mobile.ch> <20060404043600.GF24455@esmail.cup.hp.com> <20060404070229.GF10080@obsidianresearch.com> <20060404164509.GA29487@esmail.cup.hp.com> <20060404180723.GA29589@obsidianresearch.com> <20060405001937.GB30049@kroah.com> <20060405005828.GB9618@obsidianresearch.com> Message-ID: Jason> Just as a preface, I whipped this patch up because I have a Jason> Asus with a broken BIOS and I wanted to test with MSI-X on Jason> IB cards. It needs a little more thought before being Jason> merged, but the method seemed relevent given the discussion Jason> on Fedora disabling MSI. This patch probably is a good thing to work around buggy BIOSes. But I don't think it helps Fedora at all. The issue (as I understand it) is not that some devices try to use MSI and don't work, but that some systems completely fail to boot because of the (non-MSI) IRQ changes that CONFIG_PCI_MSI also (confusingly) enables. To put it naively, what breaks things is all interrupt handling stuff that changes /proc/interrupts from 17: 4797 0 IO-APIC-level eth0 to 177: 1942286 0 IO-APIC-level eth0 _That's_ the stuff you have to fix everywhere to get Fedora to enable MSI. - R. From iod00d at hp.com Tue Apr 4 18:57:44 2006 From: iod00d at hp.com (Grant Grundler) Date: Tue, 4 Apr 2006 18:57:44 -0700 Subject: [openib-general] Help with CONFIG_PCI_MSI in the kernel In-Reply-To: <20060405003144.GA9618@obsidianresearch.com> References: <20060403221456.GF1442@greglaptop.pwlan-swisscom-mobile.ch> <20060404043600.GF24455@esmail.cup.hp.com> <20060404070229.GF10080@obsidianresearch.com> <20060404164509.GA29487@esmail.cup.hp.com> <20060404180723.GA29589@obsidianresearch.com> <20060404235232.GH29487@esmail.cup.hp.com> <20060405003144.GA9618@obsidianresearch.com> Message-ID: <20060405015744.GM29487@esmail.cup.hp.com> On Tue, Apr 04, 2006 at 06:31:44PM -0600, Jason Gunthorpe wrote: ... > [clip about fedora] > > You are totally right of course WRT to fedora. I was just thinking > aloud about how to have MSI turned on and work for all supported > platforms more generally. Yeah, each arch has to decide if to enable MSI or not. e.g. PARISC does not. Mathew Wilcox and I will hack this out late some evening once we realize Mark's patches have enabled us to do so. > > The concept of "IRQ Lines" doesn't go away > > with PCI-e. PCI-e just virtualizes the concept into transactions > > that a parent bridge can convert into MSI transactions. > > The parent bridge in this case is essentially doing the same > > thing that an IO xAPIC (or SAPIC) would do. > > Mm, can you clarify this? Your description matches my understanding of > PCIe legacy (8252 and IOAPIC style) interrupts. However with MSI at > the device the parent bridge isn't involved in vector selection so > the APCI IOAPIC linkage information is unused. Right. I forget the right terminlogy for the "port" that handles PCIe legacy interrupts - I just called it "parent bridge". It's some "upstream" component from the "legacy" device but below a "root port" (IIRC). > I only point this out because it seems to be a common problem that > some el-chepo motherboards have broken APCI and important things like HD > controllers don't work because their interrupt linkages are mangled > somehow. In a UP system alot of the other APCI information is not > critical (or as error prone..) so using MSI would be a nice > alternative way to get such a system to work 100%. hrm...interesting thought. That might be feasible in some cases but el-cheapo MBs probably have PCI devices with broken MSI implementations (e.g. bcm5701). Given the track record for IDE HW is generally not so good, I wouldn't count on MSI working for most IDE devices that advertise the capability until I saw it working. I also wonder if some of the same ACPI info that causes IRQ lines not to work will also cause MSI not to work. The ia64/i386 linux IO SAPIC support depends on ACPI but I don't know offhand how much of that is used to program IO SAPIC (vs discover IO SAPICs). [ re patch ] ... > Two notes about it: > 1) It doesn't force the base address to 0xfee00000 [this is the > documented reset value however] > 2) If used with Mark Maule's work, then the base address should be > taken from the first bridge with a HT MSI Mapping cap, going along the > bridge chain toward the host from the device that the MSI address is > being generated for. ok - others will understand this better than me. thanks again, grant From jgunthorpe at obsidianresearch.com Tue Apr 4 19:16:26 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Tue, 4 Apr 2006 20:16:26 -0600 Subject: [openib-general] Help with CONFIG_PCI_MSI in the kernel In-Reply-To: References: <20060403221456.GF1442@greglaptop.pwlan-swisscom-mobile.ch> <20060404043600.GF24455@esmail.cup.hp.com> <20060404070229.GF10080@obsidianresearch.com> <20060404164509.GA29487@esmail.cup.hp.com> <20060404180723.GA29589@obsidianresearch.com> <20060405001937.GB30049@kroah.com> <20060405005828.GB9618@obsidianresearch.com> Message-ID: <20060405021626.GA29395@obsidianresearch.com> On Tue, Apr 04, 2006 at 06:29:24PM -0700, Roland Dreier wrote: > Jason> Just as a preface, I whipped this patch up because I have a > Jason> Asus with a broken BIOS and I wanted to test with MSI-X on > Jason> IB cards. It needs a little more thought before being > Jason> merged, but the method seemed relevent given the discussion > Jason> on Fedora disabling MSI. > This patch probably is a good thing to work around buggy BIOSes. But > I don't think it helps Fedora at all. The issue (as I understand it) > is not that some devices try to use MSI and don't work, but that some > systems completely fail to boot because of the (non-MSI) IRQ changes > that CONFIG_PCI_MSI also (confusingly) enables. I looked around on google for some rational or boot log or something and found nothing but someone on ubuntu and an old kernel that didn't work. I'm hoping to address this situation: Someone with an e1000 (supports MSI on some revs in 2.6.16) reported that their card didn't work in XX AMD64 system because it had broken MSI support. Thanks, Jason From rdreier at cisco.com Tue Apr 4 19:21:16 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Apr 2006 19:21:16 -0700 Subject: [openib-general] Help with CONFIG_PCI_MSI in the kernel In-Reply-To: <20060405021626.GA29395@obsidianresearch.com> (Jason Gunthorpe's message of "Tue, 4 Apr 2006 20:16:26 -0600") References: <20060403221456.GF1442@greglaptop.pwlan-swisscom-mobile.ch> <20060404043600.GF24455@esmail.cup.hp.com> <20060404070229.GF10080@obsidianresearch.com> <20060404164509.GA29487@esmail.cup.hp.com> <20060404180723.GA29589@obsidianresearch.com> <20060405001937.GB30049@kroah.com> <20060405005828.GB9618@obsidianresearch.com> <20060405021626.GA29395@obsidianresearch.com> Message-ID: Jason> I looked around on google for some rational or boot log or Jason> something and found nothing but someone on ubuntu and an Jason> old kernel that didn't work. Yeah, I haven't seen much hard info myself. But Dave Jones wouldn't turn off the config option just for the fun of it. - R. From yael at mellanox.co.il Tue Apr 4 23:24:57 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 05 Apr 2006 09:24:57 +0300 Subject: [openib-general] [PATCH] OpenSM - complib fix for branch Message-ID: <5zy7yksfnq.fsf@mtl066.yok.mtl.com> Hi Hal, I saw that the complib patch (removal of constructor and destructor attribute), wasn't fully added to the branch. Attached is a patch for the branch. Thanks, Yael Signed-off-by: Yael Kalka Index: complib/cl_complib.c =================================================================== --- complib/cl_complib.c (revision 6203) +++ complib/cl_complib.c (working copy) @@ -65,7 +65,6 @@ __cl_timer_prov_destroy( void ); cl_spinlock_t cl_atomic_spinlock; void -__attribute (( constructor )) complib_init(void) { cl_status_t status = CL_SUCCESS; @@ -90,14 +89,6 @@ complib_init(void) } void -__attribute (( destructor )) -complib_fini(void) -{ - __cl_timer_prov_destroy(); - __cl_user_syshelper_exit(); -} - -void complib_exit(void) { __cl_timer_prov_destroy(); Index: opensm/main.c =================================================================== --- opensm/main.c (revision 6203) +++ opensm/main.c (working copy) @@ -44,9 +44,6 @@ * * $Revision: 1.23 $ */ -#ifdef __WIN__ -#pragma warning(disable : 4996) -#endif #if HAVE_CONFIG_H # include @@ -557,9 +554,7 @@ main( { NULL, 0, NULL, 0 } /* Required at the end of the array */ }; -#ifdef __WIN__ complib_init(); -#endif /* Make sure that the opensm and complib were compiled using same modes (debug/free) */ From yael at mellanox.co.il Tue Apr 4 23:31:52 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 05 Apr 2006 09:31:52 +0300 Subject: [openib-general] [PATCH] OpenSM - osm_vendor_mlx_svc.h fix for branch1.0 Message-ID: <5zwte4sfc7.fsf@mtl066.yok.mtl.com> Hi Hal, Attached is a patch for branch 1.0, for the osm_vendor_mlx_svc.h that was applied on the trunk but not on the branch 1.0. Yael OpenSM/osm_vendor_mlx_svc.h: Identify RMPP MADs RMPP mads can only sent in 4 MAD classes: 1. IB_MCLASS_SUBN_ADM 2. IB_MCLASS_DEV_MGMT 3. BIS 4. DevAdm If the packet is not 1 or 2 - the mad is not checked and returned in advance as not a rmpp packet. Since 3,4 are not management classes (see 13.4.4 page 720 in the spec), I don't know whether they are defined as constants or not, so I didn't handle them. Signed-off-by: Ofer Gigi Signed-off-by: Yael Kalka Index: include/vendor/osm_vendor_mlx_svc.h =================================================================== --- include/vendor/osm_vendor_mlx_svc.h (revision 5887) +++ include/vendor/osm_vendor_mlx_svc.h (revision 5888) @@ -116,6 +116,10 @@ osmv_mad_is_rmpp(IN const ib_mad_t *p_ma CL_ASSERT(NULL != p_mad); rmpp_flags = ((ib_rmpp_mad_t*)p_mad)->rmpp_flags; + /* HACK - JUST SA and DevMgt for now - need to add BIS and DevAdm */ + if ( (p_mad->mgmt_class != CL_NTOH16(IB_MCLASS_SUBN_ADM)) && + (p_mad->mgmt_class != CL_NTOH16(IB_MCLASS_DEV_MGMT)) ) + return(0); return (0 != (rmpp_flags & IB_RMPP_FLAG_ACTIVE)); } From devesh28 at gmail.com Tue Apr 4 23:44:31 2006 From: devesh28 at gmail.com (Devesh Sharma) Date: Wed, 5 Apr 2006 12:14:31 +0530 Subject: [openib-general] Question on get_dma_mr() In-Reply-To: <1143727210.25287.0.camel@stevo-desktop> References: <309a667c0603232321v278a7ae4k71db679eb6220a2f@mail.gmail.com> <309a667c0603282203g2aeaf069g7890cae1c6e40b1d@mail.gmail.com> <309a667c0603292027p1d5a016du771048e84415c3e8@mail.gmail.com> <1143727210.25287.0.camel@stevo-desktop> Message-ID: <309a667c0604042344s16495622w637aa541def67459@mail.gmail.com> Hi list and Roland, Is this verb (ib_get_dma_mr) is equivalent to the verb explained in the section 11.2.8.1 Allocate L_key? On 3/30/06, Steve Wise wrote: > > On Wed, 2006-03-29 at 20:35 -0800, Roland Dreier wrote: > > Devesh> Here I am saying that assigning Key is sufficient Or there > > Devesh> are some other specific setps to be taken? > > > > It would depend on the device. You can look at the mthca, ipath and > ehca > > drivers' implementation of get_dma_mr() for examples. > > > > As well as the iwarp devices in the iwarp branch. amso1100 and cxgb3. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From krkumar2 at in.ibm.com Wed Apr 5 00:13:53 2006 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Wed, 5 Apr 2006 12:43:53 +0530 Subject: [openib-general] Re: [PATCH] repost: IPoIB queue size tune patch In-Reply-To: Message-ID: Shirley, Some nits : 1. You can use : priv->tx_ring = kzalloc(ipoib_recvq_size * sizeof *priv->rx_ring, GFP_KERNEL); (for tx & rx) instead (though this is not a change introduced by you). 2. make : + *else* if (ipoib_recvq_size < IPOIB_MIN_QUEUE_SIZE) { (for sendq/recvq) 3. Error messages can be changed from (since "too big" implies cutting at "max", same for "too small" and "min") : printk(KERN_WARNING "%s: ipoib_sendq_size is too big, use max %d instead\n", ca->name, IPOIB_MAX_QUEUE_SIZE); to : printk(KERN_WARNING "%s: ipoib_sendq_size is too big, using %d instead\n", ca->name, IPOIB_MAX_QUEUE_SIZE); Thanks, - KK From k_mahesh85 at yahoo.co.in Wed Apr 5 00:24:20 2006 From: k_mahesh85 at yahoo.co.in (keshetti mahesh) Date: Wed, 5 Apr 2006 08:24:20 +0100 (BST) Subject: [openib-general] how to build a library from a source code? Message-ID: <20060405072420.81173.qmail@web8326.mail.in.yahoo.com> i am using infiniband cluster in my lab i have setup openIB stack over that i came to know that to make sdp transparent i need to PRELOAD the libsdp.so library. now i have the libsdp source code(downloaded from openib.org) can anybody plx tell me how can i build libsdp.so from that code and how to preload it thanx n regards K.Mahehs --------------------------------- Jiyo cricket on Yahoo! India cricket Yahoo! Messenger Mobile Stay in touch with your buddies all the time. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Wed Apr 5 00:49:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Apr 2006 10:49:04 +0300 Subject: [openib-general] Re: [PATCH] ipoib_mcast_restart_task In-Reply-To: References: <20060404155233.GR14808@mellanox.co.il> Message-ID: <20060405074904.GC14808@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] ipoib_mcast_restart_task > > Michael> Roland, Eli spotted the following race that might explain > Michael> part of the crashes in sendonly complete. > > Actually I don't see how this could explain it. The oops seems to be > happening when the CPU follows the mcast pointer to get mcast->dev, > and even if mcast has been freed, it should still point to valid > kernel memory. Not sure I read you. It'd still be use after free, won't it? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Wed Apr 5 00:58:39 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Apr 2006 10:58:39 +0300 Subject: [openib-general] Re: [patch 11/26] IPOB: Move destructor from neigh->ops to neigh_param In-Reply-To: <20060404.174741.63557413.davem@davemloft.net> References: <20060404.171739.92845421.davem@davemloft.net> <20060404.174741.63557413.davem@davemloft.net> Message-ID: <20060405075839.GD14808@mellanox.co.il> Quoting r. David S. Miller : > I think it's too risky. It fixes a panic for infiniband. Fair enough. > I think you should not have submitted such a core networking change to > -stable without passing it by netdev CC:'ing me first. OK, note taken. -- Michael S. Tsirkin From k_mahesh85 at yahoo.co.in Wed Apr 5 01:29:13 2006 From: k_mahesh85 at yahoo.co.in (keshetti mahesh) Date: Wed, 5 Apr 2006 09:29:13 +0100 (BST) Subject: [openib-general] how can i know whether my appl'n is using SDP or not? Message-ID: <20060405082913.35553.qmail@web8314.mail.in.yahoo.com> i am working on the infiniband cluster and the openIB stack is unstalled in hosts(including SDP) can anybody tell me wen i run an appl'n over infiniband how can i know whether it is using SDP or IPoIB regards K.Mahesh --------------------------------- Jiyo cricket on Yahoo! India cricket Yahoo! Messenger Mobile Stay in touch with your buddies all the time. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at voltaire.com Wed Apr 5 01:42:27 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 5 Apr 2006 11:42:27 +0300 Subject: [openib-general] the cma is not in the for-2.6.18 branch of the git tree Message-ID: Roland, >From the 2.6.17 related discussion I understand that the kernel portion of the cma is ready for upstream. As part of the work to push iser for 2.6.18 I am going to send RFC on it to linux-scsi (and ofcourse resend RFC to open-ib). To have the people who review it in linux-scsi being able to compile iser, they would need rdma_cm.h and the associated changes in whatever (eg ib_cm.h) under include/rdma. I see now that there's also rdma_cm branch at the git tree, but i was thinking that if you agree that the kernel cma is ready for upstream, it makes sense to me that the for-2.6.18 branch would have it. Or. From yael at mellanox.co.il Wed Apr 5 01:54:15 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 05 Apr 2006 11:54:15 +0300 Subject: [openib-general] [PATCH] OpenSM - fix osm_vendor_send on vendor mads Message-ID: <5zvetos8qw.fsf@mtl066.yok.mtl.com> Hi Hal, We saw the following problem in the osm_vendor_send mad (in osm_vendor_ibumad.c). Currently, there is a case on the Management Class values, where the cases are IB_MCLASS_SUBN_DIR/IB_MCLASS_SUBN_LID and default, when the assumption is that in the default case the management class is IB_MCLASS_SUBN_ADM. So when the libosmvendor is used for sending for example Vendor type mads, we address it as SA mad, and address the rmpp fields, which are not relevant in this case. The following patch addes the default as case of IB_MCLASS_SUBN_ADM, and changes the default case to not to check the rmpp header. Please apply the patch on both trunk and branch 1.0. Thanks, Yael Signed-off-by: Yael Kalka Index: libvendor/osm_vendor_ibumad.c =================================================================== --- libvendor/osm_vendor_ibumad.c (revision 6192) +++ libvendor/osm_vendor_ibumad.c (working copy) @@ -1053,7 +1053,7 @@ osm_vendor_send( umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, 0, 0, 0); umad_set_grh(p_vw->umad, 0); break; - default: /* GSI FIXME: no GRH */ + case IB_MCLASS_SUBN_ADM: /* GSI FIXME: no GRH */ umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, p_mad_addr->addr_type.gsi.remote_qp, p_mad_addr->addr_type.gsi.service_level, @@ -1087,6 +1087,14 @@ osm_vendor_send( } #endif break; + default: /* GSI FIXME: no GRH */ + umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, + p_mad_addr->addr_type.gsi.remote_qp, + p_mad_addr->addr_type.gsi.service_level, + IB_QP1_WELL_KNOWN_Q_KEY); + umad_set_grh(p_vw->umad, 0); /* FIXME: GRH support */ + umad_set_pkey(p_vw->umad, p_mad_addr->addr_type.gsi.pkey); + break; } if (resp_expected) From halr at voltaire.com Wed Apr 5 03:54:18 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Apr 2006 06:54:18 -0400 Subject: [openib-general] Re: [PATCH] OpenSM - osm_vendor_mlx_svc.h fix for branch1.0 In-Reply-To: <5zwte4sfc7.fsf@mtl066.yok.mtl.com> References: <5zwte4sfc7.fsf@mtl066.yok.mtl.com> Message-ID: <1144234457.4480.71788.camel@hal.voltaire.com> On Wed, 2006-04-05 at 02:31, Yael Kalka wrote: > Hi Hal, > > Attached is a patch for branch 1.0, for the osm_vendor_mlx_svc.h that > was applied on the trunk but not on the branch 1.0. > > Yael > > OpenSM/osm_vendor_mlx_svc.h: Identify RMPP MADs > > RMPP mads can only sent in 4 MAD classes: > 1. IB_MCLASS_SUBN_ADM > 2. IB_MCLASS_DEV_MGMT > 3. BIS > 4. DevAdm > > If the packet is not 1 or 2 - the mad is not checked and returned in advance as not a rmpp packet. > > Since 3,4 are not management classes (see 13.4.4 page 720 in the spec), > I don't know whether they are defined as constants or not, so I didn't > handle them. > > Signed-off-by: Ofer Gigi > Signed-off-by: Yael Kalka Thanks. Applied to 1.0 branch. From xma at us.ibm.com Wed Apr 5 04:02:45 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 5 Apr 2006 05:02:45 -0600 Subject: [openib-general] Re: [PATCH] repost: IPoIB queue size tune patch In-Reply-To: Message-ID: Thanks, updated. Signed-off-by: Shirley Ma diff -urpN infiniband/ulp/ipoib/ipoib.h infiniband-queue/ulp/ipoib/ipoib.h --- infiniband/ulp/ipoib/ipoib.h 2006-03-26 11:57:15.000000000 -0800 +++ infiniband-queue/ulp/ipoib/ipoib.h 2006-04-04 16:53:24.000000000 -0700 @@ -338,6 +338,8 @@ static inline void ipoib_unregister_debu #define ipoib_warn(priv, format, arg...) \ ipoib_printk(KERN_WARNING, priv, format , ## arg) +extern int ipoib_sendq_size; +extern int ipoib_recvq_size; #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG extern int ipoib_debug_level; Binary files infiniband/ulp/ipoib/.ipoib.h.swp and infiniband-queue/ulp/ipoib/.ipoib.h.swp differ diff -urpN infiniband/ulp/ipoib/ipoib_ib.c infiniband-queue/ulp/ipoib/ipoib_ib.c --- infiniband/ulp/ipoib/ipoib_ib.c 2006-03-26 11:57:15.000000000 -0800 +++ infiniband-queue/ulp/ipoib/ipoib_ib.c 2006-04-04 16:56:49.000000000 -0700 @@ -161,7 +161,7 @@ static int ipoib_ib_post_receives(struct struct ipoib_dev_priv *priv = netdev_priv(dev); int i; - for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) { + for (i = 0; i < ipoib_recvq_size; ++i) { if (ipoib_alloc_rx_skb(dev, i)) { ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); return -ENOMEM; @@ -187,7 +187,7 @@ static void ipoib_ib_handle_wc(struct ne if (wr_id & IPOIB_OP_RECV) { wr_id &= ~IPOIB_OP_RECV; - if (wr_id < IPOIB_RX_RING_SIZE) { + if (wr_id < ipoib_recvq_size) { struct sk_buff *skb = priv->rx_ring[wr_id].skb; dma_addr_t addr = priv->rx_ring[wr_id].mapping; @@ -252,9 +252,9 @@ static void ipoib_ib_handle_wc(struct ne struct ipoib_tx_buf *tx_req; unsigned long flags; - if (wr_id >= IPOIB_TX_RING_SIZE) { + if (wr_id >= ipoib_sendq_size) { ipoib_warn(priv, "completion event with wrid %d (> %d)\n", - wr_id, IPOIB_TX_RING_SIZE); + wr_id, ipoib_sendq_size); return; } @@ -275,7 +275,7 @@ static void ipoib_ib_handle_wc(struct ne spin_lock_irqsave(&priv->tx_lock, flags); ++priv->tx_tail; if (netif_queue_stopped(dev) && - priv->tx_head - priv->tx_tail <= IPOIB_TX_RING_SIZE / 2) + priv->tx_head - priv->tx_tail <= ipoib_sendq_size / 2) netif_wake_queue(dev); spin_unlock_irqrestore(&priv->tx_lock, flags); @@ -344,13 +344,13 @@ void ipoib_send(struct net_device *dev, * means we have to make sure everything is properly recorded and * our state is consistent before we call post_send(). */ - tx_req = &priv->tx_ring[priv->tx_head & (IPOIB_TX_RING_SIZE - 1)]; + tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)]; tx_req->skb = skb; addr = dma_map_single(priv->ca->dma_device, skb->data, skb->len, DMA_TO_DEVICE); pci_unmap_addr_set(tx_req, mapping, addr); - if (unlikely(post_send(priv, priv->tx_head & (IPOIB_TX_RING_SIZE - 1), + if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1), address->ah, qpn, addr, skb->len))) { ipoib_warn(priv, "post_send failed\n"); ++priv->stats.tx_errors; @@ -363,7 +363,7 @@ void ipoib_send(struct net_device *dev, address->last_send = priv->tx_head; ++priv->tx_head; - if (priv->tx_head - priv->tx_tail == IPOIB_TX_RING_SIZE) { + if (priv->tx_head - priv->tx_tail == ipoib_sendq_size) { ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); netif_stop_queue(dev); } @@ -488,7 +488,7 @@ static int recvs_pending(struct net_devi int pending = 0; int i; - for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + for (i = 0; i < ipoib_recvq_size; ++i) if (priv->rx_ring[i].skb) ++pending; @@ -527,7 +527,7 @@ int ipoib_ib_dev_stop(struct net_device */ while ((int) priv->tx_tail - (int) priv->tx_head < 0) { tx_req = &priv->tx_ring[priv->tx_tail & - (IPOIB_TX_RING_SIZE - 1)]; + (ipoib_sendq_size - 1)]; dma_unmap_single(priv->ca->dma_device, pci_unmap_addr(tx_req, mapping), tx_req->skb->len, @@ -536,7 +536,7 @@ int ipoib_ib_dev_stop(struct net_device ++priv->tx_tail; } - for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + for (i = 0; i < ipoib_recvq_size; ++i) if (priv->rx_ring[i].skb) { dma_unmap_single(priv->ca->dma_device, pci_unmap_addr(&priv->rx_ring[i], diff -urpN infiniband/ulp/ipoib/ipoib_main.c infiniband-queue/ulp/ipoib/ipoib_main.c --- infiniband/ulp/ipoib/ipoib_main.c 2006-03-28 19:20:21.000000000 -0800 +++ infiniband-queue/ulp/ipoib/ipoib_main.c 2006-04-05 04:54:57.664770104 -0700 @@ -41,6 +41,7 @@ #include #include #include +#include #include /* For ARPHRD_xxx */ @@ -53,6 +54,17 @@ MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("IP-over-InfiniBand net driver"); MODULE_LICENSE("Dual BSD/GPL"); +#define IPOIB_MAX_QUEUE_SIZE 8192 /* max is 8k */ +#define IPOIB_MIN_QUEUE_SIZE 32 /* min is 32 */ + +int ipoib_sendq_size = IPOIB_TX_RING_SIZE; +int ipoib_recvq_size = IPOIB_RX_RING_SIZE; + +module_param_named(sendq_size, ipoib_sendq_size, int, 0444); +MODULE_PARM_DESC(sendq_size, "Number of wqe in send queue"); +module_param_named(recvq_size, ipoib_recvq_size, int, 0444); +MODULE_PARM_DESC(recvq_size, "Number of wqe in receive queue"); + #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG int ipoib_debug_level; @@ -843,19 +855,41 @@ int ipoib_dev_init(struct net_device *de /* Allocate RX/TX "rings" to hold queued skbs */ - priv->rx_ring = kzalloc(IPOIB_RX_RING_SIZE * sizeof (struct ipoib_rx_buf), + if (ipoib_recvq_size > IPOIB_MAX_QUEUE_SIZE) { + ipoib_recvq_size = IPOIB_MAX_QUEUE_SIZE; + printk(KERN_WARNING "%s: ipoib_recvq_size is too big, using %d instead\n", + ca->name, IPOIB_MAX_QUEUE_SIZE); + } else if (ipoib_recvq_size < IPOIB_MIN_QUEUE_SIZE) { + ipoib_recvq_size = IPOIB_MIN_QUEUE_SIZE; + printk(KERN_WARNING "%s: ipoib_recvq_size is too small, using %d instead\n", + ca->name, IPOIB_MIN_QUEUE_SIZE); + } + ipoib_recvq_size = roundup_pow_of_two(ipoib_recvq_size); + priv->rx_ring = kzalloc(ipoib_recvq_size * sizeof (*priv->rx_ring), GFP_KERNEL); if (!priv->rx_ring) { printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n", - ca->name, IPOIB_RX_RING_SIZE); + ca->name, ipoib_sendq_size); goto out; } - priv->tx_ring = kzalloc(IPOIB_TX_RING_SIZE * sizeof (struct ipoib_tx_buf), + if (ipoib_sendq_size > IPOIB_MAX_QUEUE_SIZE) { + ipoib_sendq_size = IPOIB_MAX_QUEUE_SIZE; + printk(KERN_WARNING "%s: ipoib_sendq_size is too big, using %d instead\n", + ca->name, IPOIB_MAX_QUEUE_SIZE); + } else if (ipoib_sendq_size < IPOIB_MIN_QUEUE_SIZE) { + ipoib_sendq_size = IPOIB_MIN_QUEUE_SIZE; + printk(KERN_WARNING "%s: ipoib_sendq_size is too small, using %d instead\n", + ca->name, IPOIB_MIN_QUEUE_SIZE); + } + + ipoib_sendq_size = roundup_pow_of_two(ipoib_sendq_size); + + priv->tx_ring = kzalloc(ipoib_sendq_size * sizeof (*priv->tx_ring), GFP_KERNEL); if (!priv->tx_ring) { printk(KERN_WARNING "%s: failed to allocate TX ring (%d entries)\n", - ca->name, IPOIB_TX_RING_SIZE); + ca->name, ipoib_sendq_size); goto out_rx_ring_cleanup; } @@ -923,7 +957,7 @@ static void ipoib_setup(struct net_devic dev->hard_header_len = IPOIB_ENCAP_LEN + INFINIBAND_ALEN; dev->addr_len = INFINIBAND_ALEN; dev->type = ARPHRD_INFINIBAND; - dev->tx_queue_len = IPOIB_TX_RING_SIZE * 2; + dev->tx_queue_len = ipoib_sendq_size * 2; dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX; /* MTU will be reset when mcast join happens */ diff -urpN infiniband/ulp/ipoib/ipoib_verbs.c infiniband-queue/ulp/ipoib/ipoib_verbs.c --- infiniband/ulp/ipoib/ipoib_verbs.c 2006-03-26 11:57:15.000000000 -0800 +++ infiniband-queue/ulp/ipoib/ipoib_verbs.c 2006-04-04 16:57:07.000000000 -0700 @@ -159,8 +159,8 @@ int ipoib_transport_dev_init(struct net_ struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_init_attr init_attr = { .cap = { - .max_send_wr = IPOIB_TX_RING_SIZE, - .max_recv_wr = IPOIB_RX_RING_SIZE, + .max_send_wr = ipoib_sendq_size, + .max_recv_wr = ipoib_recvq_size, .max_send_sge = 1, .max_recv_sge = 1 }, @@ -175,7 +175,7 @@ int ipoib_transport_dev_init(struct net_ } priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, - IPOIB_TX_RING_SIZE + IPOIB_RX_RING_SIZE + 1); + ipoib_sendq_size + ipoib_recvq_size + 1); if (IS_ERR(priv->cq)) { printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); goto out_free_pd; Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: infiniband-tune-queue.patch Type: application/octet-stream Size: 8456 bytes Desc: not available URL: From halr at voltaire.com Wed Apr 5 03:58:51 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Apr 2006 06:58:51 -0400 Subject: [openib-general] Re: [PATCH] OpenSM - complib fix for branch In-Reply-To: <5zy7yksfnq.fsf@mtl066.yok.mtl.com> References: <5zy7yksfnq.fsf@mtl066.yok.mtl.com> Message-ID: <1144234575.4480.71812.camel@hal.voltaire.com> Hi Yael, On Wed, 2006-04-05 at 02:24, Yael Kalka wrote: > Hi Hal, > > I saw that the complib patch (removal of constructor and destructor > attribute), wasn't fully added to the branch. > Attached is a patch for the branch. Is this needed for 1.0 ? Is this safe to add ? Was there more to it than just this ? -- Hal > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: complib/cl_complib.c > =================================================================== > --- complib/cl_complib.c (revision 6203) > +++ complib/cl_complib.c (working copy) > @@ -65,7 +65,6 @@ __cl_timer_prov_destroy( void ); > cl_spinlock_t cl_atomic_spinlock; > > void > -__attribute (( constructor )) > complib_init(void) > { > cl_status_t status = CL_SUCCESS; > @@ -90,14 +89,6 @@ complib_init(void) > } > > void > -__attribute (( destructor )) > -complib_fini(void) > -{ > - __cl_timer_prov_destroy(); > - __cl_user_syshelper_exit(); > -} > - > -void > complib_exit(void) > { > __cl_timer_prov_destroy(); > Index: opensm/main.c > =================================================================== > --- opensm/main.c (revision 6203) > +++ opensm/main.c (working copy) > @@ -44,9 +44,6 @@ > * > * $Revision: 1.23 $ > */ > -#ifdef __WIN__ > -#pragma warning(disable : 4996) > -#endif > > #if HAVE_CONFIG_H > # include > @@ -557,9 +554,7 @@ main( > { NULL, 0, NULL, 0 } /* Required at the end of the array */ > }; > > -#ifdef __WIN__ > complib_init(); > -#endif > > /* Make sure that the opensm and complib were compiled using > same modes (debug/free) */ > From halr at voltaire.com Wed Apr 5 04:11:51 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Apr 2006 07:11:51 -0400 Subject: [openib-general] Re: [PATCH] OpenSM - fix osm_vendor_send on vendor mads In-Reply-To: <5zvetos8qw.fsf@mtl066.yok.mtl.com> References: <5zvetos8qw.fsf@mtl066.yok.mtl.com> Message-ID: <1144235507.4480.71959.camel@hal.voltaire.com> Hi Yael, On Wed, 2006-04-05 at 04:54, Yael Kalka wrote: > Hi Hal, > > We saw the following problem in the osm_vendor_send mad (in > osm_vendor_ibumad.c). Currently, there is a case on the Management > Class values, where the cases are > IB_MCLASS_SUBN_DIR/IB_MCLASS_SUBN_LID and default, when the assumption > is that in the default case the management class is > IB_MCLASS_SUBN_ADM. > So when the libosmvendor is used for sending for example Vendor type > mads, we address it as SA mad, and address the rmpp fields, which are > not relevant in this case. > The following patch addes the default as case of IB_MCLASS_SUBN_ADM, > and changes the default case to not to check the rmpp header. > Please apply the patch on both trunk and branch 1.0. The convention is that the RMPP active flag is off when not sending RMPP. That needs to be conformed to in the GSI classes (consumers of this). Is that the case ? -- Hal > Thanks, > > Yael > > Signed-off-by: Yael Kalka > > Index: libvendor/osm_vendor_ibumad.c > =================================================================== > --- libvendor/osm_vendor_ibumad.c (revision 6192) > +++ libvendor/osm_vendor_ibumad.c (working copy) > @@ -1053,7 +1053,7 @@ osm_vendor_send( > umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, 0, 0, 0); > umad_set_grh(p_vw->umad, 0); > break; > - default: /* GSI FIXME: no GRH */ > + case IB_MCLASS_SUBN_ADM: /* GSI FIXME: no GRH */ > umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, > p_mad_addr->addr_type.gsi.remote_qp, > p_mad_addr->addr_type.gsi.service_level, > @@ -1087,6 +1087,14 @@ osm_vendor_send( > } > #endif > break; > + default: /* GSI FIXME: no GRH */ > + umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, > + p_mad_addr->addr_type.gsi.remote_qp, > + p_mad_addr->addr_type.gsi.service_level, > + IB_QP1_WELL_KNOWN_Q_KEY); > + umad_set_grh(p_vw->umad, 0); /* FIXME: GRH support */ > + umad_set_pkey(p_vw->umad, p_mad_addr->addr_type.gsi.pkey); > + break; > } > > if (resp_expected) > From halr at voltaire.com Wed Apr 5 04:23:11 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Apr 2006 07:23:11 -0400 Subject: [openib-general] Re: [PATCH] OpenSM - fix osm_vendor_send on vendor mads In-Reply-To: <1144235507.4480.71959.camel@hal.voltaire.com> References: <5zvetos8qw.fsf@mtl066.yok.mtl.com> <1144235507.4480.71959.camel@hal.voltaire.com> Message-ID: <1144236191.4480.72072.camel@hal.voltaire.com> On Wed, 2006-04-05 at 07:11, Hal Rosenstock wrote: > Hi Yael, > > On Wed, 2006-04-05 at 04:54, Yael Kalka wrote: > > Hi Hal, > > > > We saw the following problem in the osm_vendor_send mad (in > > osm_vendor_ibumad.c). Currently, there is a case on the Management > > Class values, where the cases are > > IB_MCLASS_SUBN_DIR/IB_MCLASS_SUBN_LID and default, when the assumption > > is that in the default case the management class is > > IB_MCLASS_SUBN_ADM. > > So when the libosmvendor is used for sending for example Vendor type > > mads, we address it as SA mad, and address the rmpp fields, which are > > not relevant in this case. > > The following patch addes the default as case of IB_MCLASS_SUBN_ADM, > > and changes the default case to not to check the rmpp header. > > Please apply the patch on both trunk and branch 1.0. > > The convention is that the RMPP active flag is off when not sending > RMPP. That needs to be conformed to in the GSI classes (consumers of > this). Is that the case ? Neither the way it is nor the patch are correct. I'm working on a more complete patch for this issue. -- Hal > -- Hal > > > Thanks, > > > > Yael > > > > Signed-off-by: Yael Kalka > > > > Index: libvendor/osm_vendor_ibumad.c > > =================================================================== > > --- libvendor/osm_vendor_ibumad.c (revision 6192) > > +++ libvendor/osm_vendor_ibumad.c (working copy) > > @@ -1053,7 +1053,7 @@ osm_vendor_send( > > umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, 0, 0, 0); > > umad_set_grh(p_vw->umad, 0); > > break; > > - default: /* GSI FIXME: no GRH */ > > + case IB_MCLASS_SUBN_ADM: /* GSI FIXME: no GRH */ > > umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, > > p_mad_addr->addr_type.gsi.remote_qp, > > p_mad_addr->addr_type.gsi.service_level, > > @@ -1087,6 +1087,14 @@ osm_vendor_send( > > } > > #endif > > break; > > + default: /* GSI FIXME: no GRH */ > > + umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, > > + p_mad_addr->addr_type.gsi.remote_qp, > > + p_mad_addr->addr_type.gsi.service_level, > > + IB_QP1_WELL_KNOWN_Q_KEY); > > + umad_set_grh(p_vw->umad, 0); /* FIXME: GRH support */ > > + umad_set_pkey(p_vw->umad, p_mad_addr->addr_type.gsi.pkey); > > + break; > > } > > > > if (resp_expected) > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From eli at mellanox.co.il Wed Apr 5 04:31:24 2006 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 5 Apr 2006 14:31:24 +0300 Subject: [openib-general] Re: [PATCH] ipoib_mcast_restart_task Message-ID: <200604051431.25364.eli@mellanox.co.il> > Yes, looks like there might be problem here. However, is there any > way to consolidate the "cancel and wait for done" code in one place, > rather than just cut-and-pasting it from ipoib_stop_thread()? An appropriate patch will follow. > This could explain the oops in ipoib_mcast_sendonly_join_complete(), > but only if a send-only group is being replaced by a full-member > join. Is Eli's test doing that? No, not deliberately but it did not happen again after a full night runs. I will keep running the tests for a while and track this. From mst at mellanox.co.il Wed Apr 5 04:52:32 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Apr 2006 14:52:32 +0300 Subject: [openib-general] IPoIB descructor for 2.6.16-stable? Message-ID: <20060405115232.GA21115@mellanox.co.il> Roland, given that the cleaner backport from 2.6.17 got voted down from 2.6.16, how about pushing a work-around from subversion in there? Something along the lines of the patch below? Is this small/obvious enough to be considered for stable? What do you think? Signed-off-by: Michael S. Tsirkin Index: linux-2.6.16.1/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.16.1.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-03-28 08:49:02.000000000 +0200 +++ linux-2.6.16.1/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-04-05 11:30:35.000000000 +0300 @@ -73,6 +73,9 @@ static const u8 ipv4_bcast_addr[] = { struct workqueue_struct *ipoib_workqueue; +static DEFINE_SPINLOCK(ipoib_all_neigh_list_lock); +static LIST_HEAD(ipoib_all_neigh_list); + static void ipoib_add_one(struct ib_device *device); static void ipoib_remove_one(struct ib_device *device); @@ -246,8 +249,8 @@ static void path_free(struct net_device */ if (neigh->ah) ipoib_put_ah(neigh->ah); + ipoib_neigh_cleanup(neigh); *to_ipoib_neigh(neigh->neighbour) = NULL; - neigh->neighbour->ops->destructor = NULL; kfree(neigh); } @@ -486,6 +489,7 @@ static void neigh_add_path(struct sk_buf skb_queue_head_init(&neigh->queue); neigh->neighbour = skb->dst->neighbour; *to_ipoib_neigh(skb->dst->neighbour) = neigh; + ipoib_neigh_setup(neigh) /* * We can only be called from ipoib_start_xmit, so we're @@ -528,9 +532,9 @@ static void neigh_add_path(struct sk_buf return; err: + ipoib_neigh_cleanup(neigh); *to_ipoib_neigh(skb->dst->neighbour) = NULL; list_del(&neigh->list); - neigh->neighbour->ops->destructor = NULL; kfree(neigh); ++priv->stats.tx_dropped; @@ -747,6 +751,17 @@ static void ipoib_neigh_destructor(struc unsigned long flags; struct ipoib_ah *ah = NULL; + struct ipoib_neigh *tn, *nn = NULL; + spin_lock(&ipoib_all_neigh_list_lock); + list_for_each_entry(tn, &ipoib_all_neigh_list, all_neigh_list) + if (tn->neighbour == n) { + nn = tn; + break; + } + spin_unlock(&ipoib_all_neigh_list_lock); + if (!nn) + return; + ipoib_dbg(priv, "neigh_destructor for %06x " IPOIB_GID_FMT "\n", be32_to_cpup((__be32 *) n->ha), @@ -759,6 +774,7 @@ static void ipoib_neigh_destructor(struc if (neigh->ah) ah = neigh->ah; list_del(&neigh->list); + ipoib_neigh_cleanup(neigh); *to_ipoib_neigh(n) = NULL; kfree(neigh); } @@ -769,23 +785,31 @@ static void ipoib_neigh_destructor(struc ipoib_put_ah(ah); } -static int ipoib_neigh_setup(struct neighbour *neigh) +void ipoib_neigh_setup(struct ipoib_neigh *neigh) { /* * Is this kosher? I can't find anybody in the kernel that * sets neigh->destructor, so we should be able to set it here * without trouble. */ - neigh->ops->destructor = ipoib_neigh_destructor; - - return 0; + spin_lock(&ipoib_all_neigh_list_lock); + list_add_tail(&neigh->all_neigh_list, &ipoib_all_neigh_list); + neigh->neighbour->ops->destructor = ipoib_neigh_destructor; + spin_unlock(&ipoib_all_neigh_list_lock); } -static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) -{ - parms->neigh_setup = ipoib_neigh_setup; +int ipoib_neigh_cleanup(struct ipoib_neigh *neigh) +{ + struct ipoib_neigh *nn; + spin_lock(&ipoib_all_neigh_list_lock); + list_del(&neigh->all_neigh_list); + list_for_each_entry(nn, &ipoib_all_neigh_list, all_neigh_list) + if (nn->neighbour->ops == neigh->neighbour->ops) + goto found; - return 0; + neigh->neighbour->ops->destructor = NULL; +found: + spin_unlock(&ipoib_all_neigh_list_lock); } int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) @@ -861,7 +885,6 @@ static void ipoib_setup(struct net_devic dev->tx_timeout = ipoib_timeout; dev->hard_header = ipoib_hard_header; dev->set_multicast_list = ipoib_set_mcast_list; - dev->neigh_setup = ipoib_neigh_setup_dev; dev->watchdog_timeo = HZ; Index: linux-2.6.16.1/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- linux-2.6.16.1.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-03-28 08:49:02.000000000 +0200 +++ linux-2.6.16.1/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-04-05 11:30:57.000000000 +0300 @@ -114,8 +114,8 @@ static void ipoib_mcast_free(struct ipoi */ if (neigh->ah) ipoib_put_ah(neigh->ah); + ipoib_neigh_cleanup(neigh) *to_ipoib_neigh(neigh->neighbour) = NULL; - neigh->neighbour->ops->destructor = NULL; kfree(neigh); } @@ -766,6 +766,7 @@ out: neigh->ah = mcast->ah; neigh->neighbour = skb->dst->neighbour; *to_ipoib_neigh(skb->dst->neighbour) = neigh; + ipoib_neigh_setup(neigh) list_add_tail(&neigh->list, &mcast->neigh_list); } } -- MST From eli at mellanox.co.il Wed Apr 5 04:59:40 2006 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 5 Apr 2006 14:59:40 +0300 Subject: [openib-general] [PATCH] ipoib_mcast_restart_task Message-ID: <200604051459.41367.eli@mellanox.co.il> ipoib_mcast_restart_task might free an mcast object while a join request is still outstanding, leading to an oops when the query completes. Fix this by waiting for query to complete, similar to what ipoib_stop_thread is doing. The wait for mcast completion code is consolidated in wait_join_complete(). Signed-off-by: Eli Cohen Signed-off-by: Michael S. Tsirkin Index: latest/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- latest.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ latest/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -617,6 +617,22 @@ int ipoib_mcast_start_thread(struct net_ return 0; } +static void wait_join_complete(struct ipoib_dev_priv *priv, + struct ipoib_mcast *mcast) +{ + spin_lock_irq(&priv->lock); + if (mcast && mcast->query) { + ib_sa_cancel_query(mcast->query_id, mcast->query); + mcast->query = NULL; + spin_unlock_irq(&priv->lock); + ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + wait_for_completion(&mcast->done); + } + else + spin_unlock_irq(&priv->lock); +} + int ipoib_mcast_stop_thread(struct net_device *dev, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -636,28 +652,10 @@ int ipoib_mcast_stop_thread(struct net_d if (flush) flush_workqueue(ipoib_workqueue); - spin_lock_irq(&priv->lock); - if (priv->broadcast && priv->broadcast->query) { - ib_sa_cancel_query(priv->broadcast->query_id, priv->broadcast->query); - priv->broadcast->query = NULL; - spin_unlock_irq(&priv->lock); - ipoib_dbg_mcast(priv, "waiting for bcast\n"); - wait_for_completion(&priv->broadcast->done); - } else - spin_unlock_irq(&priv->lock); + wait_join_complete(priv, priv->broadcast); - list_for_each_entry(mcast, &priv->multicast_list, list) { - spin_lock_irq(&priv->lock); - if (mcast->query) { - ib_sa_cancel_query(mcast->query_id, mcast->query); - mcast->query = NULL; - spin_unlock_irq(&priv->lock); - ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", - IPOIB_GID_ARG(mcast->mcmember.mgid)); - wait_for_completion(&mcast->done); - } else - spin_unlock_irq(&priv->lock); - } + list_for_each_entry(mcast, &priv->multicast_list, list) + wait_join_complete(priv, mcast); return 0; } @@ -910,6 +908,7 @@ void ipoib_mcast_restart_task(void *dev_ /* We have to cancel outside of the spinlock */ list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { + wait_join_complete(priv, mcast); ipoib_mcast_leave(mcast->dev, mcast); ipoib_mcast_free(mcast); } From vlad at mellanox.co.il Wed Apr 5 05:00:15 2006 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 05 Apr 2006 15:00:15 +0300 Subject: [openib-general] Re: [DAPL] Provider initialialization In-Reply-To: References: <4432B692.8000606@ichips.intel.com> Message-ID: <4433B14F.9010701@mellanox.co.il> Hi James, Does uDAPL support PPC64 architecture? Description: OS: Fedora Core release 4 (Stentz) Kernel: 2.6.11-1.1369_FC4 Arch: ppc64 I have the following compilation error: make -C src/userspace/dapl \ CPPFLAGS="-I../libibverbs/include/infiniband -I../librdmacm/include \ -I../libibverbs/include" \ LDFLAGS="-L../libibverbs/src -libverbs -L../librdmacm/src/.libs -lrdmacm" make[1]: Entering directory `/var/tmp/IBED/tmp/openib/openib/src/userspace/dapl' make all-am make[2]: Entering directory `/var/tmp/IBED/tmp/openib/openib/src/userspace/dapl' if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I. -I. -I../libibverbs/include/infiniband -I../librdmacm/include -I../libibverbs/inc lude -Wall -g -D_GNU_SOURCE -DOPENIB -DCQ_WAIT_OBJECT -I./dat/include/ -I./dapl/include/ -I./dapl/common -I./dapl/udapl/linux -I./dapl/openib_cma -g -O2 -M T dapl_udapl_libdaplcma_la-dapl_init.lo -MD -MP -MF ".deps/dapl_udapl_libdaplcma_la-dapl_init.Tpo" -c -o dapl_udapl_libdaplcma_la-dapl_init.lo `test -f 'dapl /udapl/dapl_init.c' || echo './'`dapl/udapl/dapl_init.c; \ then mv -f ".deps/dapl_udapl_libdaplcma_la-dapl_init.Tpo" ".deps/dapl_udapl_libdaplcma_la-dapl_init.Plo"; else rm -f ".deps/dapl_udapl_libdaplcma_la-dapl_ini t.Tpo"; exit 1; fi mkdir .libs gcc -DHAVE_CONFIG_H -I. -I. -I. -I../libibverbs/include/infiniband -I../librdmacm/include -I../libibverbs/include -Wall -g -D_GNU_SOURCE -DOPENIB -DCQ_WAIT_ OBJECT -I./dat/include/ -I./dapl/include/ -I./dapl/common -I./dapl/udapl/linux -I./dapl/openib_cma -g -O2 -MT dapl_udapl_libdaplcma_la-dapl_init.lo -MD -MP - MF .deps/dapl_udapl_libdaplcma_la-dapl_init.Tpo -c dapl/udapl/dapl_init.c -fPIC -DPIC -o .libs/dapl_udapl_libdaplcma_la-dapl_init.o In file included from ./dapl/include/dapl.h:50, from dapl/udapl/dapl_init.c:39: ./dapl/udapl/linux/dapl_osd.h:53:2: error: #error UNDEFINED ARCH make[2]: *** [dapl_udapl_libdaplcma_la-dapl_init.lo] Error 1 make[2]: Leaving directory `/var/tmp/IBED/tmp/openib/openib/src/userspace/dapl' make[1]: *** [all] Error 2 make[1]: Leaving directory `/var/tmp/IBED/tmp/openib/openib/src/userspace/dapl' make: *** [dapl] Error 2 Thanks, /*Best Regards,*/ /*Vladimir Sokolovsky*/ /*Software Integration Engineer*/ /**//*Mellanox Technologies Ltd.*/ From halr at voltaire.com Wed Apr 5 04:55:13 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Apr 2006 07:55:13 -0400 Subject: [openib-general] [PATCH] OpenSM: Fix osm_vendor_send for GSI classes Message-ID: <1144238111.4480.72379.camel@hal.voltaire.com> Hi Yael, Below is a complete fix for the problem you identified. Let me know if this works for you and I will check it into both the trunk and 1.0 branch. Thanks. -- Hal OpenSM: Fix osm_vendor_send for GSI classes Currently, the default for GSI classes assumes RMPP. There are two groups of GSI classes: those which support RMPP and those which don't. This patch handles them properly in osm_vendor_send. Problem pointed out by Yael Kalka Signed-off-by: Hal Rosenstock Index: include/iba/ib_types.h =================================================================== --- include/iba/ib_types.h (revision 6219) +++ include/iba/ib_types.h (working copy) @@ -515,6 +515,30 @@ BEGIN_C_DECLS #define IB_MCLASS_VENDOR_LOW_RANGE_MAX 0x0f /**********/ +/****d* IBA Base: Constants/IB_MCLASS_DEV_ADM +* NAME +* IB_MCLASS_DEV_ADM +* +* DESCRIPTION +* Subnet Management Class, Device Administration +* +* SOURCE +*/ +#define IB_MCLASS_DEV_ADM 0x10 +/**********/ + +/****d* IBA Base: Constants/IB_MCLASS_BIS +* NAME +* IB_MCLASS_BIS +* +* DESCRIPTION +* Subnet Management Class, BIS +* +* SOURCE +*/ +#define IB_MCLASS_BIS 0x12 +/**********/ + /****d* IBA Base: Constants/IB_MCLASS_VENDOR_HIGH_RANGE_MIN * NAME * IB_MCLASS_VENDOR_HIGH_RANGE_MIN Index: libvendor/osm_vendor_ibumad.c =================================================================== --- libvendor/osm_vendor_ibumad.c (revision 6219) +++ libvendor/osm_vendor_ibumad.c (working copy) @@ -1044,16 +1044,21 @@ osm_vendor_send( CL_ASSERT( p_vw->h_bind == h_bind ); CL_ASSERT( p_mad == umad_get_mad(p_vw->umad) ); - switch (p_mad->mgmt_class) { - case IB_MCLASS_SUBN_DIR: + if (p_mad->mgmt_class == IB_MCLASS_SUBN_DIR) { umad_set_addr_net(p_vw->umad, 0xffff, 0, 0, 0); umad_set_grh(p_vw->umad, 0); - break; - case IB_MCLASS_SUBN_LID: + goto Resp; + } + if (p_mad->mgmt_class == IB_MCLASS_SUBN_LID) { umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, 0, 0, 0); umad_set_grh(p_vw->umad, 0); - break; - default: /* GSI FIXME: no GRH */ + goto Resp; + } + if ((p_mad->mgmt_class == IB_MCLASS_SUBN_ADM) || + (p_mad->mgmt_class == IB_MCLASS_DEV_MGMT) || + (p_mad->mgmt_class == IB_MCLASS_DEV_ADM) || + (p_mad->mgmt_class == IB_MCLASS_BIS) || + ib_class_is_vendor_specific_high(p_mad->mgmt_class)) { /* RMPP GSI classes FIXME: no GRH */ umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, p_mad_addr->addr_type.gsi.remote_qp, p_mad_addr->addr_type.gsi.service_level, @@ -1086,9 +1091,17 @@ osm_vendor_send( p_sa->paylen_newwin = cl_ntoh32(paylen); } #endif - break; + goto Resp; + } else { /* non RMPP GSI classes FIXME: no GRH */ + umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, + p_mad_addr->addr_type.gsi.remote_qp, + p_mad_addr->addr_type.gsi.service_level, + IB_QP1_WELL_KNOWN_Q_KEY); + umad_set_grh(p_vw->umad, 0); /* FIXME: GRH support */ + umad_set_pkey(p_vw->umad, p_mad_addr->addr_type.gsi.pkey); } +Resp: if (resp_expected) put_madw(p_vend, p_madw, &p_mad->trans_id); From mst at mellanox.co.il Wed Apr 5 05:08:01 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Apr 2006 15:08:01 +0300 Subject: [openib-general] Re: [PATCH] ipoib_mcast_restart_task In-Reply-To: <200604051431.25364.eli@mellanox.co.il> References: <200604051431.25364.eli@mellanox.co.il> Message-ID: <20060405120801.GC21115@mellanox.co.il> Quoting Eli Cohen : > > This could explain the oops in ipoib_mcast_sendonly_join_complete(), > > but only if a send-only group is being replaced by a full-member > > join. Is Eli's test doing that? > > No, not deliberately but it did not happen again after a full night runs. I > will keep running the tests for a while and track this. This could happen without you knowing it - some distros seem to set up multicast in scripts run from ifconfig, and this could be preceded by a send. -- MST From yael at mellanox.co.il Wed Apr 5 05:13:54 2006 From: yael at mellanox.co.il (Yael Kalka) Date: Wed, 5 Apr 2006 15:13:54 +0300 Subject: [openib-general] RE: [PATCH] OpenSM - fix osm_vendor_send on vendor mads Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FE83@mtlexch01.mtl.com> When we are sending MAD with ManagementClass Vendor, then it isn't defined with rmpp head. Thus when looking at the rmpp header we are actually looking at part of the mad data. Yael > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, April 05, 2006 2:12 PM > To: Yael Kalka > Cc: openib-general at openib.org; Eitan Zahavi; Sasha Khapyorsky; Ofer Gigi > Subject: Re: [PATCH] OpenSM - fix osm_vendor_send on vendor mads > > Hi Yael, > > On Wed, 2006-04-05 at 04:54, Yael Kalka wrote: > > Hi Hal, > > > > We saw the following problem in the osm_vendor_send mad (in > > osm_vendor_ibumad.c). Currently, there is a case on the Management > > Class values, where the cases are > > IB_MCLASS_SUBN_DIR/IB_MCLASS_SUBN_LID and default, when the assumption > > is that in the default case the management class is > > IB_MCLASS_SUBN_ADM. > > So when the libosmvendor is used for sending for example Vendor type > > mads, we address it as SA mad, and address the rmpp fields, which are > > not relevant in this case. > > The following patch addes the default as case of IB_MCLASS_SUBN_ADM, > > and changes the default case to not to check the rmpp header. > > Please apply the patch on both trunk and branch 1.0. > > The convention is that the RMPP active flag is off when not sending > RMPP. That needs to be conformed to in the GSI classes (consumers of > this). Is that the case ? > > -- Hal > > > Thanks, > > > > Yael > > > > Signed-off-by: Yael Kalka > > > > Index: libvendor/osm_vendor_ibumad.c > > =================================================================== > > --- libvendor/osm_vendor_ibumad.c (revision 6192) > > +++ libvendor/osm_vendor_ibumad.c (working copy) > > @@ -1053,7 +1053,7 @@ osm_vendor_send( > > umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, 0, 0, 0); > > umad_set_grh(p_vw->umad, 0); > > break; > > - default: /* GSI FIXME: no GRH */ > > + case IB_MCLASS_SUBN_ADM: /* GSI FIXME: no GRH */ > > umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, > > p_mad_addr->addr_type.gsi.remote_qp, > > p_mad_addr->addr_type.gsi.service_level, > > @@ -1087,6 +1087,14 @@ osm_vendor_send( > > } > > #endif > > break; > > + default: /* GSI FIXME: no GRH */ > > + umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, > > + p_mad_addr->addr_type.gsi.remote_qp, > > + p_mad_addr->addr_type.gsi.service_level, > > + IB_QP1_WELL_KNOWN_Q_KEY); > > + umad_set_grh(p_vw->umad, 0); /* FIXME: GRH support */ > > + umad_set_pkey(p_vw->umad, p_mad_addr->addr_type.gsi.pkey); > > + break; > > } > > > > if (resp_expected) > > From dotanb at mellanox.co.il Wed Apr 5 05:35:58 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 5 Apr 2006 15:35:58 +0300 Subject: [openib-general] what should be the result of create a CQ with size 0? Message-ID: <200604051535.58448.dotanb@mellanox.co.il> Hi. We have a test case in which we try to create a CQ with 0 entries. Over the gen2 driver the creation of this CQ fails. what is the expected behavior of the driver? a) the CQ creation should fail b) the CQ should be created (with 1 or 2 entries) Thanks Dotan From mst at mellanox.co.il Wed Apr 5 05:47:16 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Apr 2006 15:47:16 +0300 Subject: [openib-general] [PATCH] mthca: disable pci_tune Message-ID: <20060405124716.GF21115@mellanox.co.il> Roland, here's a patch for stability problems with mthca that were reported to Mellanox by some of our customers. I know how you think about options, and I agree, but this seems the best we can do in this case: it seems too risky to remove this tuning outright. --- PCI spec recommends against drivers playing with PCI read burst size, and says that systems software should configure it. And we actually have customers that report that changing it from the default set by BIOS hurts performance and/or stability for them. On the other hand, PRM recommends turning it up all the way to the maximum value. Some tests conducted here in the lab do not show performance improvement from this tuning, but this might be just me. As a work-around, lets make this tuning an option, off by default (safe value), with an eye towards removing it completely one day if no one complains. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/hw/mthca/mthca_main.c =================================================================== --- openib.orig/drivers/infiniband/hw/mthca/mthca_main.c 2006-04-05 14:47:59.000000000 +0300 +++ openib/drivers/infiniband/hw/mthca/mthca_main.c 2006-04-05 14:57:21.000000000 +0300 @@ -69,6 +69,10 @@ MODULE_PARM_DESC(msi, "attempt to use MS #endif /* CONFIG_PCI_MSI */ +static int tune_pci = 0; +module_param(tune_pci, int, 0444); +MODULE_PARM_DESC(tune_pci, "increase PCI burst from the default set by BIOS"); + static const char mthca_version[] __devinitdata = DRV_NAME ": Mellanox InfiniBand HCA driver v" DRV_VERSION " (" DRV_RELDATE ")\n"; @@ -90,6 +94,9 @@ static int __devinit mthca_tune_pci(stru int cap; u16 val; + if (!tune_pci) + return 0; + /* First try to max out Read Byte Count */ cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX); if (cap) { -- MST From halr at voltaire.com Wed Apr 5 05:42:07 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Apr 2006 08:42:07 -0400 Subject: [openib-general] RE: [PATCH] OpenSM - fix osm_vendor_send on vendor mads In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FE83@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FE83@mtlexch01.mtl.com> Message-ID: <1144240925.4480.72821.camel@hal.voltaire.com> On Wed, 2006-04-05 at 08:13, Yael Kalka wrote: > When we are sending MAD with ManagementClass Vendor, then it isn't > defined with rmpp head. Thus when looking at the rmpp header we are > actually looking at part of the mad data. Right. See subsequent email and proposed patch. -- Hal > > Yael > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Wednesday, April 05, 2006 2:12 PM > > To: Yael Kalka > > Cc: openib-general at openib.org; Eitan Zahavi; Sasha Khapyorsky; Ofer > Gigi > > Subject: Re: [PATCH] OpenSM - fix osm_vendor_send on vendor mads > > > > Hi Yael, > > > > On Wed, 2006-04-05 at 04:54, Yael Kalka wrote: > > > Hi Hal, > > > > > > We saw the following problem in the osm_vendor_send mad (in > > > osm_vendor_ibumad.c). Currently, there is a case on the Management > > > Class values, where the cases are > > > IB_MCLASS_SUBN_DIR/IB_MCLASS_SUBN_LID and default, when the > assumption > > > is that in the default case the management class is > > > IB_MCLASS_SUBN_ADM. > > > So when the libosmvendor is used for sending for example Vendor type > > > mads, we address it as SA mad, and address the rmpp fields, which > are > > > not relevant in this case. > > > The following patch addes the default as case of IB_MCLASS_SUBN_ADM, > > > and changes the default case to not to check the rmpp header. > > > Please apply the patch on both trunk and branch 1.0. > > > > The convention is that the RMPP active flag is off when not sending > > RMPP. That needs to be conformed to in the GSI classes (consumers of > > this). Is that the case ? > > > > -- Hal > > > > > Thanks, > > > > > > Yael > > > > > > Signed-off-by: Yael Kalka > > > > > > Index: libvendor/osm_vendor_ibumad.c > > > =================================================================== > > > --- libvendor/osm_vendor_ibumad.c (revision 6192) > > > +++ libvendor/osm_vendor_ibumad.c (working copy) > > > @@ -1053,7 +1053,7 @@ osm_vendor_send( > > > umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, 0, > 0, 0); > > > umad_set_grh(p_vw->umad, 0); > > > break; > > > - default: /* GSI FIXME: no GRH */ > > > + case IB_MCLASS_SUBN_ADM: /* GSI FIXME: no GRH */ > > > umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, > > > p_mad_addr->addr_type.gsi.remote_qp, > > > > p_mad_addr->addr_type.gsi.service_level, > > > @@ -1087,6 +1087,14 @@ osm_vendor_send( > > > } > > > #endif > > > break; > > > + default: /* GSI FIXME: no GRH */ > > > + umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, > > > + p_mad_addr->addr_type.gsi.remote_qp, > > > + > p_mad_addr->addr_type.gsi.service_level, > > > + IB_QP1_WELL_KNOWN_Q_KEY); > > > + umad_set_grh(p_vw->umad, 0); /* FIXME: GRH support */ > > > + umad_set_pkey(p_vw->umad, > p_mad_addr->addr_type.gsi.pkey); > > > + break; > > > } > > > > > > if (resp_expected) > > > From eli at mellanox.co.il Wed Apr 5 05:59:34 2006 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 5 Apr 2006 15:59:34 +0300 Subject: [openib-general] [PATCH] ipoib_flush_paths Message-ID: <200604051559.34828.eli@mellanox.co.il> ib_sa_cancel_query must be called with priv->lock held since a completion might arrive and set path->query to NULL Signed-off-by: Eli Cohen Index: latest/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- latest.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ latest/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -342,14 +342,16 @@ void ipoib_flush_paths(struct net_device list_for_each_entry(path, &remove_list, list) rb_erase(&path->rb_node, &priv->path_tree); - spin_unlock_irqrestore(&priv->lock, flags); list_for_each_entry_safe(path, tp, &remove_list, list) { if (path->query) ib_sa_cancel_query(path->query_id, path->query); + spin_unlock_irqrestore(&priv->lock, flags); wait_for_completion(&path->done); path_free(dev, path); + spin_lock_irqsave(&priv->lock, flags); } + spin_unlock_irqrestore(&priv->lock, flags); } static void path_rec_completion(int status, From halr at voltaire.com Wed Apr 5 05:54:59 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Apr 2006 08:54:59 -0400 Subject: [openib-general] [PATCHv2] OpenSM: Fix osm_vendor_send for GSI classes Message-ID: <1144241694.4480.72945.camel@hal.voltaire.com> Hi Yael, Below is a slightly modified version of the previous patch. It is a complete fix for the problem you identified. Let me know if this works for you and I will check it into both the trunk and 1.0 branch. Thanks. -- Hal OpenSM: Fix osm_vendor_send for GSI classes Currently, the default for GSI classes assumes RMPP. There are two groups of GSI classes: those which support RMPP and those which don't. This patch handles them properly in osm_vendor_send. Problem pointed out by Yael Kalka Signed-off-by: Hal Rosenstock Index: include/iba/ib_types.h =================================================================== --- include/iba/ib_types.h (revision 6219) +++ include/iba/ib_types.h (working copy) @@ -515,6 +515,30 @@ BEGIN_C_DECLS #define IB_MCLASS_VENDOR_LOW_RANGE_MAX 0x0f /**********/ +/****d* IBA Base: Constants/IB_MCLASS_DEV_ADM +* NAME +* IB_MCLASS_DEV_ADM +* +* DESCRIPTION +* Subnet Management Class, Device Administration +* +* SOURCE +*/ +#define IB_MCLASS_DEV_ADM 0x10 +/**********/ + +/****d* IBA Base: Constants/IB_MCLASS_BIS +* NAME +* IB_MCLASS_BIS +* +* DESCRIPTION +* Subnet Management Class, BIS +* +* SOURCE +*/ +#define IB_MCLASS_BIS 0x12 +/**********/ + /****d* IBA Base: Constants/IB_MCLASS_VENDOR_HIGH_RANGE_MIN * NAME * IB_MCLASS_VENDOR_HIGH_RANGE_MIN @@ -544,7 +568,7 @@ BEGIN_C_DECLS * ib_class_is_vendor_specific_low * * DESCRIPTION -* Indicitates if the Class Code if a vendor specific class from +* Indicates if the Class Code if a vendor specific class from * the low range * * SYNOPSIS @@ -576,7 +600,7 @@ ib_class_is_vendor_specific_low( * ib_class_is_vendor_specific_high * * DESCRIPTION -* Indicitates if the Class Code if a vendor specific class from +* Indicates if the Class Code if a vendor specific class from * the high range * * SYNOPSIS @@ -609,7 +633,7 @@ ib_class_is_vendor_specific_high( * ib_class_is_vendor_specific * * DESCRIPTION -* Indicitates if the Class Code if a vendor specific class +* Indicates if the Class Code if a vendor specific class * * SYNOPSIS */ @@ -635,6 +659,38 @@ ib_class_is_vendor_specific( * ib_class_is_vendor_specific_low, ib_class_is_vendor_specific_high *********/ +/****f* IBA Base: Types/ib_class_is_rmpp +* NAME +* ib_class_is_rmpp +* +* DESCRIPTION +* Indicates if the Class Code supports RMPP +* +* SYNOPSIS +*/ +static inline boolean_t +ib_class_is_rmpp( + IN const uint8_t class_code ) +{ + return( (class_code == IB_MCLASS_SUBN_ADM) || + (class_code == IB_MCLASS_DEV_MGMT) || + (class_code == IB_MCLASS_DEV_ADM) || + (class_code == IB_MCLASS_BIS) || + ib_class_is_vendor_specific_high( class_code ) ); +} +/* +* PARAMETERS +* class_code +* [in] The Management Datagram Class Code +* +* RETURN VALUE +* TRUE if the class supports RMPP +* FALSE otherwise. +* +* NOTES +* +*********/ + /* * MAD methods */ @@ -1811,7 +1867,7 @@ ib_pkey_get_base( * ib_pkey_is_full_member * * DESCRIPTION -* Indicitates if the port is a full member of the parition. +* Indicates if the port is a full member of the parition. * * SYNOPSIS */ Index: libvendor/osm_vendor_ibumad.c =================================================================== --- libvendor/osm_vendor_ibumad.c (revision 6219) +++ libvendor/osm_vendor_ibumad.c (working copy) @@ -1044,16 +1044,17 @@ osm_vendor_send( CL_ASSERT( p_vw->h_bind == h_bind ); CL_ASSERT( p_mad == umad_get_mad(p_vw->umad) ); - switch (p_mad->mgmt_class) { - case IB_MCLASS_SUBN_DIR: + if (p_mad->mgmt_class == IB_MCLASS_SUBN_DIR) { umad_set_addr_net(p_vw->umad, 0xffff, 0, 0, 0); umad_set_grh(p_vw->umad, 0); - break; - case IB_MCLASS_SUBN_LID: + goto Resp; + } + if (p_mad->mgmt_class == IB_MCLASS_SUBN_LID) { umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, 0, 0, 0); umad_set_grh(p_vw->umad, 0); - break; - default: /* GSI FIXME: no GRH */ + goto Resp; + } + if (ib_class_is_rmpp(p_mad->mgmt_class)) { /* RMPP GSI classes FIXME: no GRH */ umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, p_mad_addr->addr_type.gsi.remote_qp, p_mad_addr->addr_type.gsi.service_level, @@ -1086,9 +1087,16 @@ osm_vendor_send( p_sa->paylen_newwin = cl_ntoh32(paylen); } #endif - break; + } else { /* non RMPP GSI classes FIXME: no GRH */ + umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, + p_mad_addr->addr_type.gsi.remote_qp, + p_mad_addr->addr_type.gsi.service_level, + IB_QP1_WELL_KNOWN_Q_KEY); + umad_set_grh(p_vw->umad, 0); /* FIXME: GRH support */ + umad_set_pkey(p_vw->umad, p_mad_addr->addr_type.gsi.pkey); } +Resp: if (resp_expected) put_madw(p_vend, p_madw, &p_mad->trans_id); From mst at mellanox.co.il Wed Apr 5 06:21:38 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Apr 2006 16:21:38 +0300 Subject: [openib-general] Re: [PATCH] ipoib_flush_paths In-Reply-To: <200604051559.34828.eli@mellanox.co.il> References: <200604051559.34828.eli@mellanox.co.il> Message-ID: <20060405132138.GG21115@mellanox.co.il> Quoting r. Eli Cohen : > Subject: [PATCH] ipoib_flush_paths > > ib_sa_cancel_query must be called with priv->lock held since > a completion might arrive and set path->query to NULL > > Signed-off-by: Eli Cohen Wow. This looks very similiar to what was fixed in r5586 - only for path record rather than multicast SA queries. Good catch. Acked-by: Michael S. Tsirkin -- MST From swise at opengridcomputing.com Wed Apr 5 07:01:18 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 05 Apr 2006 09:01:18 -0500 Subject: [openib-general] [PATCH] [DAPL] - dapl doesn't set max read iov attributes Message-ID: <1144245678.28591.5.camel@stevo-desktop> Set the IA attribute max_iov_segments_per_rdma_read and the EP attribute max_rdma_read_iov based on the openib max_sge_rd device attribute. Signed-off-by: Steve Wise Index: openib_cma/dapl_ib_util.c =================================================================== --- openib_cma/dapl_ib_util.c (revision 6229) +++ openib_cma/dapl_ib_util.c (working copy) @@ -442,6 +442,7 @@ ia_attr->transport_attr = NULL; ia_attr->num_vendor_attr = 0; ia_attr->vendor_attr = NULL; + ia_attr->max_iov_segments_per_rdma_read = dev_attr.max_sge_rd; dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " query_hca: (ver=%x) ep %d ep_q %d evd %d evd_q %d\n", @@ -464,6 +465,7 @@ ep_attr->max_request_iov = dev_attr.max_sge; ep_attr->max_rdma_read_in = dev_attr.max_qp_rd_atom; ep_attr->max_rdma_read_out= dev_attr.max_qp_rd_atom; + ep_attr->max_rdma_read_iov= dev_attr.max_sge_rd; dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " query_hca: MAX msg %llu dto %d iov %d rdma i%d,o%d\n", ep_attr->max_mtu_size, From swise at opengridcomputing.com Wed Apr 5 07:19:34 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 05 Apr 2006 09:19:34 -0500 Subject: [openib-general] [PATCH] [MTHCA] - set max_sge_rd attribute Message-ID: <1144246774.28591.20.camel@stevo-desktop> While testing the cxgb3 iwarp driver, I noticed that mthca is not setting the max_sge_rd attribute. Dunno if the limits.max_sg is the correct value, but it seemed like the only reasonable one... Signed-off-by: Steve Wise Index: mthca_provider.c =================================================================== --- mthca_provider.c (revision 6243) +++ mthca_provider.c (working copy) @@ -97,6 +97,7 @@ props->max_qp = mdev->limits.num_qps - mdev->limits.reserved_qps; props->max_qp_wr = mdev->limits.max_wqes; props->max_sge = mdev->limits.max_sg; + props->max_sge_rd = mdev->limits.max_sg; props->max_cq = mdev->limits.num_cqs - mdev->limits.reserved_cqs; props->max_cqe = mdev->limits.max_cqes; props->max_mr = mdev->limits.num_mpts - mdev->limits.reserved_mrws; From dotanb at mellanox.co.il Wed Apr 5 07:47:23 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 5 Apr 2006 17:47:23 +0300 Subject: [openib-general] [PATCH] [MTHCA] - set max_sge_rd attribute In-Reply-To: <1144246774.28591.20.camel@stevo-desktop> References: <1144246774.28591.20.camel@stevo-desktop> Message-ID: <200604051747.23748.dotanb@mellanox.co.il> On Wednesday 05 April 2006 17:19, Steve Wise wrote: > While testing the cxgb3 iwarp driver, I noticed that mthca is not > setting the max_sge_rd attribute. Dunno if the limits.max_sg is the > correct value, but it seemed like the only reasonable one... This value is not being set because RD (Reliable Datagram) is not supported. Dotan From cripi.surplusmachines at email.it Wed Apr 5 07:50:17 2006 From: cripi.surplusmachines at email.it (Cripi surplusmachines) Date: Wed, 5 Apr 2006 16:50:17 +0200 Subject: [openib-general] ATP Camion-TRANSPORT Message-ID: <000d01c658c0$456dd9b0$a14cfe01@acerfed0a54eeb> CRI.PI. Via C.Amoretti 80|1 20157 Milano Tel:0039-023553148 mobile:0039-3332646489 Web-site : www.cri-pi.it cripi.surplusmachines at email.it If you have received this message in error, please reply this mail with REMOVE in the subject. Se per errore avete ricevuto questo messaggio,per favore,rispondete scrivendo nell'oggetto: REMOVE Si ustedes no desean recibir más correos, infórmen nos por favor. -- Email.it, the professional e-mail, gratis per te: http://www.email.it/f Sponsor: Vuoi diventare un vero esperto sul Controllo di Gestione? Scopri come nella tua azienda puoi migliorare gli utili e ridurre le spese Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?midP58&d=5-4 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 5340 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.gif Type: image/gif Size: 7660 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.gif Type: image/gif Size: 5503 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image004.jpg Type: image/jpeg Size: 16329 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image005.jpg Type: image/jpeg Size: 5938 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image006.jpg Type: image/jpeg Size: 8076 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image007.jpg Type: image/jpeg Size: 9107 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image008.jpg Type: image/jpeg Size: 7955 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image009.jpg Type: image/jpeg Size: 7681 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image010.jpg Type: image/jpeg Size: 8629 bytes Desc: not available URL: From swise at opengridcomputing.com Wed Apr 5 07:54:12 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 05 Apr 2006 09:54:12 -0500 Subject: [openib-general] [PATCH] [MTHCA] - set max_sge_rd attribute In-Reply-To: <200604051747.23748.dotanb@mellanox.co.il> References: <1144246774.28591.20.camel@stevo-desktop> <200604051747.23748.dotanb@mellanox.co.il> Message-ID: <1144248852.28591.23.camel@stevo-desktop> On Wed, 2006-04-05 at 17:47 +0300, Dotan Barak wrote: > On Wednesday 05 April 2006 17:19, Steve Wise wrote: > > While testing the cxgb3 iwarp driver, I noticed that mthca is not > > setting the max_sge_rd attribute. Dunno if the limits.max_sg is the > > correct value, but it seemed like the only reasonable one... > > This value is not being set because RD (Reliable Datagram) is not supported. > > Dotan Well then! My mistake. I assumed max_sge_rd pertained to the max sge of an rdma read WR. :-( I'll need to add such an attribute for iwarp support since iwarp only supports a sge count of 1 in rdma reads. Sorry... From swise at opengridcomputing.com Wed Apr 5 07:54:56 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 05 Apr 2006 09:54:56 -0500 Subject: [openib-general] [PATCH] [DAPL] - dapl doesn't set max read iov attributes In-Reply-To: <1144245678.28591.5.camel@stevo-desktop> References: <1144245678.28591.5.camel@stevo-desktop> Message-ID: <1144248896.28591.25.camel@stevo-desktop> Ignore this patch. max_sge_rd is not the correct attribute... Sorry... STeve. On Wed, 2006-04-05 at 09:01 -0500, Steve Wise wrote: > Set the IA attribute max_iov_segments_per_rdma_read and the EP attribute > max_rdma_read_iov based on the openib max_sge_rd device attribute. > > Signed-off-by: Steve Wise > > > Index: openib_cma/dapl_ib_util.c > =================================================================== > --- openib_cma/dapl_ib_util.c (revision 6229) > +++ openib_cma/dapl_ib_util.c (working copy) > @@ -442,6 +442,7 @@ > ia_attr->transport_attr = NULL; > ia_attr->num_vendor_attr = 0; > ia_attr->vendor_attr = NULL; > + ia_attr->max_iov_segments_per_rdma_read = dev_attr.max_sge_rd; > > dapl_dbg_log(DAPL_DBG_TYPE_UTIL, > " query_hca: (ver=%x) ep %d ep_q %d evd %d evd_q %d\n", > @@ -464,6 +465,7 @@ > ep_attr->max_request_iov = dev_attr.max_sge; > ep_attr->max_rdma_read_in = dev_attr.max_qp_rd_atom; > ep_attr->max_rdma_read_out= dev_attr.max_qp_rd_atom; > + ep_attr->max_rdma_read_iov= dev_attr.max_sge_rd; > dapl_dbg_log(DAPL_DBG_TYPE_UTIL, > " query_hca: MAX msg %llu dto %d iov %d rdma i%d,o%d\n", > ep_attr->max_mtu_size, > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From dotanb at mellanox.co.il Wed Apr 5 08:18:16 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 5 Apr 2006 18:18:16 +0300 Subject: [openib-general] how to execute the dtest? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301D16BE2@mtlexch01.mtl.com> Hi. We would like to start executing the dtest. When i tried to execute it, i got the following output: DAT Registry: Started (dat_init) DAT Registry: static registry file DAT Registry: token type eor value <> DAT Registry: token type string value DAT Registry: token type string value DAT Registry: token type string value DAT Registry: token type string value DAT Registry: token type string value DAT Registry: token type eor value <> DAT Registry: token type eof value <> 9402 Running as server DAT Registry: dat_ia_openv (OpenIB-cma-ip,1:2,0) called DAT Registry: dat_ia_open () provider information for IA name OpenIB-cma-ip not found in dynamic registry 9402: Error Adaptor open: DAT_PROVIDER_NOT_FOUND DAT_NAME_NOT_REGISTERED DAT Registry: Stopped (dat_fini) here is the content of the file /etc/dat.conf: # # DAT 1.1 and 1.2 configuration file # # Each entry should have the following fields: # # \ # # # Example for openib using the first Mellanox adapter, port 1 and port 2 IB1 u1.2 nonthreadsafe default /usr/local/lib can someone tell me what am i doing wrong? thanx Dotan Barak Software Verification Engineer Mellanox Technologies Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-77-8841095 Cell: 052-4222383 -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at mellanox.co.il Wed Apr 5 08:27:28 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 5 Apr 2006 18:27:28 +0300 Subject: [openib-general] how to execute the dtest? In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301D16BE2@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301D16BE2@mtlexch01.mtl.com> Message-ID: <200604051827.28808.dotanb@mellanox.co.il> Some more info: when i changed the dat.conf to be: OpenIB-cma u1.2 nonthreadsafe default /usr/local/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" OpenIB-cma-ip u1.2 nonthreadsafe default /usr/local/lib/libdaplcma.so mv_dapl.1.2 "192.168.0.22 0" "" OpenIB-cma-name u1.2 nonthreadsafe default /usr/local/lib/libdaplcma.so mv_dapl.1.2 "svr1-ib0 0" "" OpenIB-cma-netdev u1.2 nonthreadsafe default /usr/local/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" OpenIB-scm1 u1.2 nonthreadsafe default /usr/local/lib/libdaplscm.so mv_dapl.1.2 "mthca0 1" "" OpenIB-scm2 u1.2 nonthreadsafe default /usr/local/lib/libdaplscm.so mv_dapl.1.2 "mthca0 2" "" i got the following message: DAT Registry: entry ia_name OpenIB-scm2 api_version type 0x0 major.minor 1.2 is_thread_safe 0 is_default 1 lib_path /usr/local/lib/libdaplscm.so provider_version id mv_dapl major.minor 1.2 ia_params mthca0 2 DAT Registry: loading provider for OpenIB-scm2 DAT Registry: token type eof value <> 10127 Running as server DAT Registry: dat_ia_openv (OpenIB-cma-ip,1:2,0) called DAT Registry: IA OpenIB-cma-ip, trying to load library /usr/local/lib/libdaplcma.so DAT Registry: dat_registry_add_provider (OpenIB-cma-ip,1:2,0) 10127: Error Adaptor open: DAT_INVALID_ADDRESS DAT Registry: Stopped (dat_fini) Dotan On Wednesday 05 April 2006 18:18, Dotan Barak wrote: > Hi. > > We would like to start executing the dtest. > When i tried to execute it, i got the following output: > > DAT Registry: Started (dat_init) > DAT Registry: static registry file > > DAT Registry: token > type eor > value <> > > > DAT Registry: token > type string > value > > > DAT Registry: token > type string > value > > > DAT Registry: token > type string > value > > > DAT Registry: token > type string > value > > > DAT Registry: token > type string > value > > > DAT Registry: token > type eor > value <> > > > DAT Registry: token > type eof > value <> > > 9402 Running as server > DAT Registry: dat_ia_openv (OpenIB-cma-ip,1:2,0) called > DAT Registry: dat_ia_open () provider information for IA name > OpenIB-cma-ip not found in dynamic registry > 9402: Error Adaptor open: DAT_PROVIDER_NOT_FOUND DAT_NAME_NOT_REGISTERED > DAT Registry: Stopped (dat_fini) > > > > > > here is the content of the file /etc/dat.conf: > > > > # > # DAT 1.1 and 1.2 configuration file > # > # Each entry should have the following fields: > # > # \ > # > # > # Example for openib using the first Mellanox adapter, port 1 and port > 2 > > IB1 u1.2 nonthreadsafe default /usr/local/lib > > > can someone tell me what am i doing wrong? > > thanx > Dotan Barak > > Software Verification Engineer > Mellanox Technologies > Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 > P.O. Box 86 Yokneam 20692 ISRAEL. > Home: +972-77-8841095 Cell: 052-4222383 > > From swise at opengridcomputing.com Wed Apr 5 08:30:22 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 05 Apr 2006 10:30:22 -0500 Subject: [openib-general] how to execute the dtest? In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301D16BE2@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301D16BE2@mtlexch01.mtl.com> Message-ID: <1144251022.28591.37.camel@stevo-desktop> Here is what works for me (using mthca): vic17:~ # cat /etc/dat.conf OpenIB-cma-ip u1.2 nonthreadsafe default /usr/local/lib/libdapl.so mv_dapl.1.2 "192.168.79.147 0" "" vic17:~ # And ib1 has the above ip address: vic17:~ # ifconfig ib1 ib1 Link encap:UNSPEC HWaddr 00-00-04-05-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:192.168.79.147 Bcast:192.168.79.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:14 errors:0 dropped:0 overruns:0 frame:0 TX packets:14 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:840 (840.0 b) TX bytes:896 (896.0 b) My other system looks similar but the ipaddr is 192.168.79.148. Hope this helps. Steve. On Wed, 2006-04-05 at 18:18 +0300, Dotan Barak wrote: > Hi. > > We would like to start executing the dtest. > When i tried to execute it, i got the following output: > > DAT Registry: Started (dat_init) > DAT Registry: static registry file > > DAT Registry: token > type eor > value <> > > > DAT Registry: token > type string > value > > > DAT Registry: token > type string > value > > > DAT Registry: token > type string > value > > > DAT Registry: token > type string > value > > > DAT Registry: token > type string > value > > > DAT Registry: token > type eor > value <> > > > DAT Registry: token > type eof > value <> > > 9402 Running as server > DAT Registry: dat_ia_openv (OpenIB-cma-ip,1:2,0) called > DAT Registry: dat_ia_open () provider information for IA name > OpenIB-cma-ip not found in dynamic registry > 9402: Error Adaptor open: DAT_PROVIDER_NOT_FOUND > DAT_NAME_NOT_REGISTERED > DAT Registry: Stopped (dat_fini) > > > > > > here is the content of the file /etc/dat.conf: > > > > # > # DAT 1.1 and 1.2 configuration file > # > # Each entry should have the following fields: > # > # \ > # > # > # Example for openib using the first Mellanox adapter, port 1 and > port 2 > > IB1 u1.2 nonthreadsafe default /usr/local/lib > > > can someone tell me what am i doing wrong? > > thanx > Dotan Barak > > Software Verification Engineer > Mellanox Technologies > Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 > P.O. Box 86 Yokneam 20692 ISRAEL. > Home: +972-77-8841095 Cell: 052-4222383 > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From vlad at mellanox.co.il Wed Apr 5 08:30:48 2006 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 05 Apr 2006 18:30:48 +0300 Subject: [openib-general] [DAPL] tests In-Reply-To: <4433B14F.9010701@mellanox.co.il> References: <4432B692.8000606@ichips.intel.com> <4433B14F.9010701@mellanox.co.il> Message-ID: <4433E2A8.1050702@mellanox.co.il> Hi James, Can you add dapl tests to EXTRA_DIST list in the dapl/Makefile.am? Thanks, Regards, Vladimir Vladimir Sokolovsky wrote: > Hi James, > Does uDAPL support PPC64 architecture? > > Description: > > OS: Fedora Core release 4 (Stentz) > Kernel: 2.6.11-1.1369_FC4 > Arch: ppc64 > > > I have the following compilation error: > > make -C src/userspace/dapl \ > CPPFLAGS="-I../libibverbs/include/infiniband -I../librdmacm/include \ > -I../libibverbs/include" \ > LDFLAGS="-L../libibverbs/src -libverbs -L../librdmacm/src/.libs -lrdmacm" > make[1]: Entering directory > `/var/tmp/IBED/tmp/openib/openib/src/userspace/dapl' > make all-am > make[2]: Entering directory > `/var/tmp/IBED/tmp/openib/openib/src/userspace/dapl' > if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. > -I. -I. -I../libibverbs/include/infiniband -I../librdmacm/include > -I../libibverbs/inc > lude -Wall -g -D_GNU_SOURCE -DOPENIB -DCQ_WAIT_OBJECT > -I./dat/include/ -I./dapl/include/ -I./dapl/common > -I./dapl/udapl/linux -I./dapl/openib_cma -g -O2 -M > T dapl_udapl_libdaplcma_la-dapl_init.lo -MD -MP -MF > ".deps/dapl_udapl_libdaplcma_la-dapl_init.Tpo" -c -o > dapl_udapl_libdaplcma_la-dapl_init.lo `test -f 'dapl > /udapl/dapl_init.c' || echo './'`dapl/udapl/dapl_init.c; \ > then mv -f ".deps/dapl_udapl_libdaplcma_la-dapl_init.Tpo" > ".deps/dapl_udapl_libdaplcma_la-dapl_init.Plo"; else rm -f > ".deps/dapl_udapl_libdaplcma_la-dapl_ini > t.Tpo"; exit 1; fi > mkdir .libs > gcc -DHAVE_CONFIG_H -I. -I. -I. -I../libibverbs/include/infiniband > -I../librdmacm/include -I../libibverbs/include -Wall -g -D_GNU_SOURCE > -DOPENIB -DCQ_WAIT_ > OBJECT -I./dat/include/ -I./dapl/include/ -I./dapl/common > -I./dapl/udapl/linux -I./dapl/openib_cma -g -O2 -MT > dapl_udapl_libdaplcma_la-dapl_init.lo -MD -MP - > MF .deps/dapl_udapl_libdaplcma_la-dapl_init.Tpo -c > dapl/udapl/dapl_init.c -fPIC -DPIC -o > .libs/dapl_udapl_libdaplcma_la-dapl_init.o > In file included from ./dapl/include/dapl.h:50, > from dapl/udapl/dapl_init.c:39: > ./dapl/udapl/linux/dapl_osd.h:53:2: error: #error UNDEFINED ARCH > make[2]: *** [dapl_udapl_libdaplcma_la-dapl_init.lo] Error 1 > make[2]: Leaving directory > `/var/tmp/IBED/tmp/openib/openib/src/userspace/dapl' > make[1]: *** [all] Error 2 > make[1]: Leaving directory > `/var/tmp/IBED/tmp/openib/openib/src/userspace/dapl' > make: *** [dapl] Error 2 > > Thanks, > > /*Best Regards,*/ > /*Vladimir Sokolovsky*/ > /*Software Integration Engineer*/ > /**//*Mellanox Technologies Ltd.*/ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Wed Apr 5 08:43:51 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Apr 2006 08:43:51 -0700 Subject: [openib-general] Re: [PATCH] ipoib_mcast_restart_task In-Reply-To: <20060405074904.GC14808@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 5 Apr 2006 10:49:04 +0300") References: <20060404155233.GR14808@mellanox.co.il> <20060405074904.GC14808@mellanox.co.il> Message-ID: Michael> Not sure I read you. It'd still be use after free, won't it? It's definitely a bug. But it doesn't explain the specific oops we saw. In other words, doing: kfree(mcast); dev = mcast->dev; shouldn't cause an oops, because mcast is still a valid kernel pointer, even if the memory it points to might be reused and corrupted. Following the dev pointer after that snippet might cause an oops, because it might be overwritten. - R. From rdreier at cisco.com Wed Apr 5 08:49:39 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Apr 2006 08:49:39 -0700 Subject: [openib-general] Re: [PATCH] mthca: disable pci_tune In-Reply-To: <20060405124716.GF21115@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 5 Apr 2006 15:47:16 +0300") References: <20060405124716.GF21115@mellanox.co.il> Message-ID: Looks reasonable to me. How about the PCI Express max read request value? Does that matter? Do we want to disable setting that by default too? (that's what your patch does) - R. From rdreier at cisco.com Wed Apr 5 08:51:27 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Apr 2006 08:51:27 -0700 Subject: [openib-general] Re: IPoIB descructor for 2.6.16-stable? In-Reply-To: <20060405115232.GA21115@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 5 Apr 2006 14:52:32 +0300") References: <20060405115232.GA21115@mellanox.co.il> Message-ID: Michael> Is this small/obvious enough to be considered for stable? Michael> What do you think? I'd be a little worried. Hal had an oops that looked related to this, but I was never able to reproduce it or figure it out. But in any case I'm uncomfortable that this patch is replacing one problem with another. - R. From mst at mellanox.co.il Wed Apr 5 09:30:34 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Apr 2006 19:30:34 +0300 Subject: [openib-general] Re: [PATCH] mthca: disable pci_tune In-Reply-To: References: <20060405124716.GF21115@mellanox.co.il> Message-ID: <20060405163034.GB23186@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] mthca: disable pci_tune > > Looks reasonable to me. How about the PCI Express max read request > value? Does that matter? Do we want to disable setting that by > default too? (that's what your patch does) Yes, I disable that too and for all the same reasons, with an option to turn it back on to make it easier to debug performance problems should the appear. -- MST From mst at mellanox.co.il Wed Apr 5 09:43:29 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Apr 2006 19:43:29 +0300 Subject: [openib-general] Re: IPoIB destructor for 2.6.16-stable? In-Reply-To: References: <20060405115232.GA21115@mellanox.co.il> Message-ID: <20060405164329.GC23186@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: IPoIB destructor for 2.6.16-stable? > > Michael> Is this small/obvious enough to be considered for stable? > Michael> What do you think? > > I'd be a little worried. Hal had an oops that looked related to this, > but I was never able to reproduce it or figure it out. Neither could I. > But in any case I'm uncomfortable that this patch is replacing one problem > with another. I don't see any way to fix crashes in ipoib in 2.6.16, then. Do you? -- MST From ardavis at ichips.intel.com Wed Apr 5 09:46:10 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 05 Apr 2006 09:46:10 -0700 Subject: [openib-general] how to execute the dtest? In-Reply-To: <200604051827.28808.dotanb@mellanox.co.il> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301D16BE2@mtlexch01.mtl.com> <200604051827.28808.dotanb@mellanox.co.il> Message-ID: <4433F452.8030807@ichips.intel.com> Dotan Barak wrote: >Some more info: > >when i changed the dat.conf to be: >OpenIB-cma u1.2 nonthreadsafe default /usr/local/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" >OpenIB-cma-ip u1.2 nonthreadsafe default /usr/local/lib/libdaplcma.so mv_dapl.1.2 "192.168.0.22 0" "" >OpenIB-cma-name u1.2 nonthreadsafe default /usr/local/lib/libdaplcma.so mv_dapl.1.2 "svr1-ib0 0" "" >OpenIB-cma-netdev u1.2 nonthreadsafe default /usr/local/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" >OpenIB-scm1 u1.2 nonthreadsafe default /usr/local/lib/libdaplscm.so mv_dapl.1.2 "mthca0 1" "" >OpenIB-scm2 u1.2 nonthreadsafe default /usr/local/lib/libdaplscm.so mv_dapl.1.2 "mthca0 2" "" > > > The dtest makefile builds with DAPL_PROVIDER == OpenIB-cma-ip by default so it will use the second line of the configuration file. This requires the IP address of the IB device on your system to be supplied in the dat.conf. Change the default IP address (192.168.0.22) to match your ib device network address that you ifconfig'ed. -arlin From xma at us.ibm.com Wed Apr 5 09:51:06 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 5 Apr 2006 09:51:06 -0700 Subject: [openib-general] IPoIB descructor for 2.6.16-stable? In-Reply-To: <20060405115232.GA21115@mellanox.co.il> Message-ID: Michael, I have tested your workaround patch. It has been working pretty well. It's easy to hit this problem with ehca driver. I need to relook at the patch to see whether it's the same or not. thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Apr 5 09:50:14 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Apr 2006 09:50:14 -0700 Subject: [openib-general] Re: IPoIB destructor for 2.6.16-stable? In-Reply-To: <20060405164329.GC23186@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 5 Apr 2006 19:43:29 +0300") References: <20060405115232.GA21115@mellanox.co.il> <20060405164329.GC23186@mellanox.co.il> Message-ID: Michael> I don't see any way to fix crashes in ipoib in 2.6.16, Michael> then. Do you? Unfortunately no. If we could get to the bottom of Hal's crash then I would be fine with adding something like this to 2.6.16.stable. But I don't have much interest in debugging code that's already obsolete. - R. From mst at mellanox.co.il Wed Apr 5 09:53:03 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Apr 2006 19:53:03 +0300 Subject: [openib-general] Re: [PATCH] ipoib_mcast_restart_task In-Reply-To: References: <20060404155233.GR14808@mellanox.co.il> <20060405074904.GC14808@mellanox.co.il> Message-ID: <20060405165303.GA23402@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] ipoib_mcast_restart_task > > Michael> Not sure I read you. It'd still be use after free, won't it? > > It's definitely a bug. But it doesn't explain the specific oops we > saw. The mcast pointer comes from stack. Surely we could have use after free in ipoib_mcast_join_complete trigger data corruption on stack and then trip on it? -- MST From vlad at mellanox.co.il Wed Apr 5 09:52:44 2006 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 05 Apr 2006 19:52:44 +0300 Subject: [openib-general] ipath module compilation on 2.6.15 and 2.6.16 In-Reply-To: References: Message-ID: <4433F5DC.2080503@mellanox.co.il> Hi Bryan, I tried to compile ipath module taken from trunk (REV=6237) on 2.6.16 and on 2.6.15 kernels and it fails with the following errors: gcc -m32 -Wp,-MD,/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/.ipath_verbs.o.d -nostdinc -isystem /usr/lib/gcc/i586-suse-linux/4.0.2/include -D__KERNEL__ -I/var/tmp/IBED/tmp/openib/openib/include -I/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/include -Iinclude -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -pipe -msoft-float -mpreferred-stack-boundary=2 -march=i686 -mtune=pentium4 -mregparm=3 -Iinclude/asm-i386/mach-generic -Iinclude/asm-i386/mach-default -Wdeclaration-after-statement -Wno-pointer-sign -I/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/include -I/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/ipoib -I/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/kdapl -I/var/tmp/IBED/tmp/openib/openib/drivers/infiniband/debug -DIPATH_IDSTR='"PathScale kernel.org driver"' -DIPATH_KERN_TYPE=0 -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ipath_verbs)" -D"KBUILD_MODNAME=KBUILD_STR(ib_ipath)" -c -o /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/.tmp_ipath_verbs.o /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipath_verbs.c In file included from /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipath_verbs.c:37: /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipath_kernel.h: In function âipath_write_uregâ: /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipath_kernel.h:760: warning: implicit declaration of function âwriteqâ /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipath_kernel.h: In function âipath_read_kreg64â: /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipath_kernel.h:777: warning: implicit declaration of function âreadqâ /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipath_verbs.c: In function âipath_register_ib_deviceâ: /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipath_verbs.c:1005: error: âIB_NODE_CAâ undeclared (first use in this function) /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipath_verbs.c:1005: error: (Each undeclared identifier is reported only once /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipath_verbs.c:1005: error: for each function it appears in.) make[3]: *** [/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipath_verbs.o] Error 1 make[2]: *** [/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath] Error 2 make[1]: *** [_module_/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband] Error 2 make[1]: Leaving directory `/usr/src/linux-2.6.16' make: *** [kernel] Error 2 I am trying to add ipath support to IBED release. Can you please help? Thanks, /*Best Regards,*/ /*Vladimir Sokolovsky*/ /*Software Integration Engineer*/ /**//*Mellanox Technologies Ltd.*/ From bos at pathscale.com Wed Apr 5 09:56:15 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 05 Apr 2006 09:56:15 -0700 Subject: [openib-general] ipath module compilation on 2.6.15 and 2.6.16 In-Reply-To: <4433F5DC.2080503@mellanox.co.il> References: <4433F5DC.2080503@mellanox.co.il> Message-ID: <1144256175.3984.5.camel@chalcedony.internal.keyresearch.com> On Wed, 2006-04-05 at 19:52 +0300, Vladimir Sokolovsky wrote: > I tried to compile ipath module taken from trunk (REV=6237) on 2.6.16 > and on 2.6.15 kernels and it fails with the following errors: Ah. You're using a kernel patched with SVN headers. I haven't figured out what to do about this. People quite reasonably don't like having ifdefs in drivers, but when macro names change out from under us in header files, I don't know what else to do. Roland, what's your preference? (Michael S. Tsirkin's message of "Wed, 5 Apr 2006 19:53:03 +0300") References: <20060404155233.GR14808@mellanox.co.il> <20060405074904.GC14808@mellanox.co.il> <20060405165303.GA23402@mellanox.co.il> Message-ID: Michael> The mcast pointer comes from stack. Surely we could have Michael> use after free in ipoib_mcast_join_complete trigger data Michael> corruption on stack and then trip on it? Now you're confusing me. Isn't the mcast pointer kmalloc()ed? - R. From rdreier at cisco.com Wed Apr 5 09:58:44 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Apr 2006 09:58:44 -0700 Subject: [openib-general] ipath module compilation on 2.6.15 and 2.6.16 In-Reply-To: <4433F5DC.2080503@mellanox.co.il> (Vladimir Sokolovsky's message of "Wed, 05 Apr 2006 19:52:44 +0300") References: <4433F5DC.2080503@mellanox.co.il> Message-ID: Looks like you are building with a 32-bit compiler. But in the Kconfig, ipath depends on 64BIT. So how are you enabling the driver? - R. From rdreier at cisco.com Wed Apr 5 10:01:36 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Apr 2006 10:01:36 -0700 Subject: [openib-general] ipath module compilation on 2.6.15 and 2.6.16 In-Reply-To: <1144256175.3984.5.camel@chalcedony.internal.keyresearch.com> (Bryan O'Sullivan's message of "Wed, 05 Apr 2006 09:56:15 -0700") References: <4433F5DC.2080503@mellanox.co.il> <1144256175.3984.5.camel@chalcedony.internal.keyresearch.com> Message-ID: Bryan> I haven't figured out what to do about this. People quite Bryan> reasonably don't like having ifdefs in drivers, but when Bryan> macro names change out from under us in header files, I Bryan> don't know what else to do. Bryan> Roland, what's your preference? Huh?? This error seemed to have been because writeq() doesn't exist in a 32-bit kernel. If you're talking about IB_NODE_CA vs RDMA_NODE_IB_CA, just have one thing in svn and one thing upstream. The differences are pretty minimal right now. - R. From rdreier at cisco.com Wed Apr 5 10:04:53 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Apr 2006 10:04:53 -0700 Subject: [openib-general] [PATCH] static rate encoding changes In-Reply-To: (Roland Dreier's message of "Tue, 04 Apr 2006 15:59:07 -0700") References: <20060316165855.GA32324@mellanox.co.il> <1142529460.25297.117.camel@camp4.serpentine.com> <200603191759.36800.jackm@mellanox.co.il> <20060321154959.GC1802@mellanox.co.il> <20060324081644.GC31619@mellanox.co.il> Message-ID: OK, I went ahead and committed this to svn and queued it for 2.6.17. If you see any problems, let me know. (It will be a while before this gets merged upstream anyway because Linus is away this week) - R. From rdreier at cisco.com Wed Apr 5 10:06:39 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Apr 2006 10:06:39 -0700 Subject: [openib-general] Re: [PATCH 1 of 3] core: static rate encoding changesupport In-Reply-To: (Christoph Raisch's message of "Fri, 24 Mar 2006 16:59:16 +0100") References: Message-ID: I merged the core static rate patch into svn. This means that the struct ib_ah_attr.static_rate field is now defined to be an absolute rate (as defined in enum ib_rate), rather than a relative inter-packet delay (IPD) value. I believe ehca will need to be updated to handle this in modify QP and create AH operations. Thanks, Roland From vlad at mellanox.co.il Wed Apr 5 10:07:50 2006 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 05 Apr 2006 20:07:50 +0300 Subject: [openib-general] ipath module compilation on 2.6.15 and 2.6.16 In-Reply-To: References: <4433F5DC.2080503@mellanox.co.il> Message-ID: <4433F966.6010504@mellanox.co.il> Roland Dreier wrote: > Looks like you are building with a 32-bit compiler. But in the > Kconfig, ipath depends on 64BIT. So how are you enabling the driver? > > - R. > > I tried compilation on 32 bit machine. Should it be supported by ipath? Vladimir From rdreier at cisco.com Wed Apr 5 10:09:22 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Apr 2006 10:09:22 -0700 Subject: [openib-general] Re: the cma is not in the for-2.6.18 branch of the git tree In-Reply-To: (Or Gerlitz's message of "Wed, 5 Apr 2006 11:42:27 +0300") References: Message-ID: Or> As part of the work to push iser for 2.6.18 I am going to send Or> RFC on it to linux-scsi (and ofcourse resend RFC to Or> open-ib). To have the people who review it in linux-scsi being Or> able to compile iser, they would need rdma_cm.h and the Or> associated changes in whatever (eg ib_cm.h) under Or> include/rdma. Or> I see now that there's also rdma_cm branch at the git tree, Or> but i was thinking that if you agree that the kernel cma is Or> ready for upstream, it makes sense to me that the for-2.6.18 Or> branch would have it. Yes, makes sense. I think there have been some updates and fixes to CMA code (loopback handling, etc). Sean, when you get a chance, can send me updates for the rdma_cm branch? Even a single rolled-up patch against the head of that branch is fine -- it's easy for me to split up and update the individual patches. For iSER review purposes, I think the rdma_cm branch should be OK for the time being. - R. From rdreier at cisco.com Wed Apr 5 10:10:19 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Apr 2006 10:10:19 -0700 Subject: [openib-general] ipath module compilation on 2.6.15 and 2.6.16 In-Reply-To: <4433F966.6010504@mellanox.co.il> (Vladimir Sokolovsky's message of "Wed, 05 Apr 2006 20:07:50 +0300") References: <4433F5DC.2080503@mellanox.co.il> <4433F966.6010504@mellanox.co.il> Message-ID: Vladimir> I tried compilation on 32 bit machine. Should it be Vladimir> supported by ipath? No, PathScale hasn't done the work to make the driver work on 32-bit archs yet. But what I don't understand is how you were able to configure your kernel to build the driver, since the Kconfig has "depends on 64BIT" in it. - R. From mst at mellanox.co.il Wed Apr 5 10:13:14 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Apr 2006 20:13:14 +0300 Subject: [openib-general] Re: [PATCH] ipoib_mcast_restart_task In-Reply-To: References: <20060404155233.GR14808@mellanox.co.il> <20060405074904.GC14808@mellanox.co.il> <20060405165303.GA23402@mellanox.co.il> Message-ID: <20060405171314.GB23402@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] ipoib_mcast_restart_task > > Michael> The mcast pointer comes from stack. Surely we could have > Michael> use after free in ipoib_mcast_join_complete trigger data > Michael> corruption on stack and then trip on it? > > Now you're confusing me. Isn't the mcast pointer kmalloc()ed? Sorry about that. I think the memory *it points to* is kmalloc()ed - the the pointer itself I think comes from stack. static void ipoib_mcast_sendonly_join_complete(int status, struct ib_sa_mcmember_rec *mcmember, void *mcast_ptr) { struct ipoib_mcast *mcast = mcast_ptr; struct net_device *dev = mcast->dev; So all I had in mind was obvious things like: Assume that you have mcast point to random kernel data. doing things like skb_dequeue(&mcast->pkt_queue) will now do random things to random memory locations, it could be stack or anything else. -- MST From rdreier at cisco.com Wed Apr 5 10:15:26 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Apr 2006 10:15:26 -0700 Subject: [openib-general] Re: [PATCH] ipoib_mcast_restart_task In-Reply-To: <20060405171314.GB23402@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 5 Apr 2006 20:13:14 +0300") References: <20060404155233.GR14808@mellanox.co.il> <20060405074904.GC14808@mellanox.co.il> <20060405165303.GA23402@mellanox.co.il> <20060405171314.GB23402@mellanox.co.il> Message-ID: Michael> Assume that you have mcast point to random kernel data. Michael> doing things like skb_dequeue(&mcast->pkt_queue) will now Michael> do random things to random memory locations, it could be Michael> stack or anything else. Yes, true. Maybe it is just random corruption that causes the crash right at the beginning of ipoib_mcast_sendonly_join_complete(). - R. From mst at mellanox.co.il Wed Apr 5 10:19:20 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Apr 2006 20:19:20 +0300 Subject: [openib-general] Re: [PATCH] static rate encoding changes In-Reply-To: References: <20060316165855.GA32324@mellanox.co.il> <1142529460.25297.117.camel@camp4.serpentine.com> <200603191759.36800.jackm@mellanox.co.il> <20060321154959.GC1802@mellanox.co.il> <20060324081644.GC31619@mellanox.co.il> Message-ID: <20060405171920.GC23402@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] static rate encoding changes > > OK, I went ahead and committed this to svn and queued it for 2.6.17. > If you see any problems, let me know. (It will be a while before this > gets merged upstream anyway because Linus is away this week) OK, good, I hope to be testing bits from for-2.6.17 next week. -- MST From vlad at mellanox.co.il Wed Apr 5 10:18:46 2006 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 05 Apr 2006 20:18:46 +0300 Subject: [openib-general] ipath module compilation on 2.6.15 and 2.6.16 In-Reply-To: References: <4433F5DC.2080503@mellanox.co.il> <4433F966.6010504@mellanox.co.il> Message-ID: <4433FBF6.1040301@mellanox.co.il> Roland Dreier wrote: > Vladimir> I tried compilation on 32 bit machine. Should it be > Vladimir> supported by ipath? > > No, PathScale hasn't done the work to make the driver work on 32-bit > archs yet. But what I don't understand is how you were able to > configure your kernel to build the driver, since the Kconfig has > "depends on 64BIT" in it. > > - R. > > I am not compiling kernel, but only ipath using SUBDIRS: $(MAKE) -C /lib/modules/$(uname -r)/source SUBDIRS="./linux-kernel/infiniband" \ EXTRAVERSION=$(EXTRAVERSION) V=1 \ CONFIG_INFINIBAND=m \ CONFIG_IPATH_CORE=m \ CONFIG_INFINIBAND_IPATH=m \ CONFIG_IPATH_ETHER =m .... Vladimir From rdreier at cisco.com Wed Apr 5 10:21:24 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Apr 2006 10:21:24 -0700 Subject: [openib-general] ipath module compilation on 2.6.15 and 2.6.16 In-Reply-To: <4433FBF6.1040301@mellanox.co.il> (Vladimir Sokolovsky's message of "Wed, 05 Apr 2006 20:18:46 +0300") References: <4433F5DC.2080503@mellanox.co.il> <4433F966.6010504@mellanox.co.il> <4433FBF6.1040301@mellanox.co.il> Message-ID: Vladimir> I am not compiling kernel, but only ipath using SUBDIRS: Vladimir> CONFIG_IPATH_CORE=m \ CONFIG_INFINIBAND_IPATH=m \ Vladimir> CONFIG_IPATH_ETHER =m .... I see. I guess in that case if you create a broken config then there's no one else to blame... - R. From ardavis at ichips.intel.com Wed Apr 5 10:24:27 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 05 Apr 2006 10:24:27 -0700 Subject: [openib-general] Re: SPAM: [PATCH] [RFC] - dapl - dat_ep_free() can return without freeing the endpoint In-Reply-To: <4432EDBB.4080304@ichips.intel.com> References: <1144177443.6427.34.camel@stevo-desktop> <4432EDBB.4080304@ichips.intel.com> Message-ID: <4433FD4B.6040505@ichips.intel.com> Sean Hefty wrote: > James Lentini wrote: > >>> void dapli_destroy_conn(struct dapl_cm_id *conn) >>> { >>> int in_callback; >>> + struct rdma_cm_id *cm_id; >>> >>> dapl_dbg_log(DAPL_DBG_TYPE_CM, " destroy_conn: conn >>> %p id %d\n", >>> conn,conn->cm_id); >>> - >>> dapl_os_lock(&conn->lock); >>> conn->destroy = 1; >>> in_callback = conn->in_callback; >>> - dapl_os_unlock(&conn->lock); >>> - >>> - if (!in_callback) { >>> - if (conn->ep) >>> - conn->ep->cm_handle = IB_INVALID_HANDLE; >>> - if (conn->cm_id) { >>> - if (conn->cm_id->qp) >>> - rdma_destroy_qp(conn->cm_id); >>> - rdma_destroy_id(conn->cm_id); >>> + do { >>> + if (in_callback) { >>> + dapl_os_unlock(&conn->lock); >>> + usleep(10); >>> + dapl_os_lock(&conn->lock); >> > > In general this doesn't work. The calling thread may be the callback > thread, which would lead to deadlock. This is why we don't just call > rdma_destroy_id() directly, and let it wait for the callback to complete. > Sorry, the callback names should be changed since it is really a async event processing thread and not a direct callback from CMA. The async thread can destroy the cm_id if we no longer hold any cm_id event references, we destroy the associated QP, and we are syncronized with any other thread that could be destroying at the same time. This is how the code currently works. I did not see the original thread/patch from Steve so I don't have the entire context of this issue but it sounds like we need to fix the code so that the destroy QP (dat_ep_free) blocks until the event processing is complete, always destroy the QP and cm_id from this call, and remove cleanup from any async event processing threads. Is this what Steve was attempting to do with his patch? I seemed to have missed the posting of the patch so could someone point me to the original patch so I can review and test any changes. thanks -arlin From mst at mellanox.co.il Wed Apr 5 10:29:22 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Apr 2006 20:29:22 +0300 Subject: [openib-general] Re: ipath module compilation on 2.6.15 and 2.6.16 In-Reply-To: References: <4433F5DC.2080503@mellanox.co.il> <4433F966.6010504@mellanox.co.il> Message-ID: <20060405172922.GE23402@mellanox.co.il> Quoting r. Roland Dreier : > But what I don't understand is how you were able to > configure your kernel to build the driver, since the Kconfig has > "depends on 64BIT" in it. Vlad is compiling things as an out of kernel module. KConfig does not work in this configuration. -- MST From bos at pathscale.com Wed Apr 5 10:37:59 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 05 Apr 2006 10:37:59 -0700 Subject: [openib-general] ipath module compilation on 2.6.15 and 2.6.16 In-Reply-To: References: <4433F5DC.2080503@mellanox.co.il> <1144256175.3984.5.camel@chalcedony.internal.keyresearch.com> Message-ID: <1144258679.3984.14.camel@chalcedony.internal.keyresearch.com> On Wed, 2006-04-05 at 10:01 -0700, Roland Dreier wrote: > Huh?? This error seemed to have been because writeq() doesn't exist > in a 32-bit kernel. I didn't see that error, just the IB_NODE_CA one. Vladimir, please set LANG=C when you're going to compile a test case for posting the output, so the error messages are more readable. Message-ID: Hello Bryan, openib-general-bounces at openib.org wrote on 03/28/2006 04:10:36 PM: > Hi - > > Due to being swamped with several different things at once, I will not > be able to release 1.0 RC2 this week. I expect that I will be able to > release it by April 4. > > Apologies for the inconvenience, > > From xma at us.ibm.com Wed Apr 5 10:46:39 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 5 Apr 2006 10:46:39 -0700 Subject: [openib-general] Re: [PATCH] repost: IPoIB queue size tune patch In-Reply-To: Message-ID: Hello Roland, I have been working hard on this patch. Do you think it is ready to be merged? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Apr 5 10:52:51 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Apr 2006 10:52:51 -0700 Subject: [openib-general] Re: [PATCH] repost: IPoIB queue size tune patch In-Reply-To: (Shirley Ma's message of "Wed, 5 Apr 2006 10:46:39 -0700") References: Message-ID: Shirley> Hello Roland, I have been working hard on this patch. Do Shirley> you think it is ready to be merged? Yes, I will fix it up and apply it. Hopefully we can come up with something better for 2.6.18. - R. From swise at opengridcomputing.com Wed Apr 5 10:56:36 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 05 Apr 2006 12:56:36 -0500 Subject: [openib-general] Re: SPAM: [PATCH] [RFC] - dapl - dat_ep_free() can return without freeing the endpoint In-Reply-To: <4433FD4B.6040505@ichips.intel.com> References: <1144177443.6427.34.camel@stevo-desktop> <4432EDBB.4080304@ichips.intel.com> <4433FD4B.6040505@ichips.intel.com> Message-ID: <1144259796.28591.61.camel@stevo-desktop> > I did not see the original thread/patch from Steve so I don't have the > entire context of this issue but it sounds like we need to fix the code > so that the destroy QP (dat_ep_free) blocks until the event processing > is complete, always destroy the QP and cm_id from this call, and remove > cleanup from any async event processing threads. Is this what Steve was > attempting to do with his patch? > > I seemed to have missed the posting of the patch so could someone point > me to the original patch so I can review and test any changes. see: http://openib.org/pipermail/openib-general/2006-April/019479.html From mshefty at ichips.intel.com Wed Apr 5 10:56:49 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 05 Apr 2006 10:56:49 -0700 Subject: [openib-general] Re: SPAM: [PATCH] [RFC] - dapl - dat_ep_free() can return without freeing the endpoint In-Reply-To: <4433FD4B.6040505@ichips.intel.com> References: <1144177443.6427.34.camel@stevo-desktop> <4432EDBB.4080304@ichips.intel.com> <4433FD4B.6040505@ichips.intel.com> Message-ID: <443404E1.5050607@ichips.intel.com> Arlin Davis wrote: > I did not see the original thread/patch from Steve so I don't have the > entire context of this issue but it sounds like we need to fix the code > so that the destroy QP (dat_ep_free) blocks until the event processing > is complete, always destroy the QP and cm_id from this call, and remove > cleanup from any async event processing threads. Is this what Steve was > attempting to do with his patch? There's still the issue that the event processing thread is the one calling dat_ep_free, in which case, you need to ensure that it is finished with all event processing before this call is made. - Sean From bos at pathscale.com Wed Apr 5 10:58:33 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 05 Apr 2006 10:58:33 -0700 Subject: [openib-general] RC2 delayed a bit In-Reply-To: References: Message-ID: <1144259913.3984.30.camel@chalcedony.internal.keyresearch.com> On Wed, 2006-04-05 at 10:43 -0700, Shirley Ma wrote: > Today is April 5. What the current plan for RC2? Several vendors have formed an "Enterprise Working Group" (EWG) to produce a set of sources and binary packages that they can provide to customers until such time as the Linux distribution vendors ship enough OpenIB userspace components to make this unnecessary. The intention is to ship the same basic userspace components as the OpenIB 1.0 software release, with some additional parts. In addition, some people want to ship kernel components that are not upstream, such as SDP and iSER. Unfortunately, the coordination of this with the 1.0 process has thus far not been very effective, so I am spending a lot of time manually filtering diffs to see what has changed between the first EWG software release (named "IBED") and the 1.0 tree, so that I can reunify the two. Also, for some reason, there has not yet been an official announcement of the EWG's existence on this mailing list. Until that announcement comes along, you can see the EWG archives and sign up for the mailing list at http://openib.org/mailman/listinfo/openfabrics-ewg Please pardon the delay as we work to get things back on track. References: Message-ID: <443405C4.4090703@ichips.intel.com> Roland Dreier wrote: > Yes, makes sense. I think there have been some updates and fixes to > CMA code (loopback handling, etc). Sean, when you get a chance, can > send me updates for the rdma_cm branch? Even a single rolled-up patch > against the head of that branch is fine -- it's easy for me to split > up and update the individual patches. I will send a patch against your tree at the end of this week. - Sean From mshefty at ichips.intel.com Wed Apr 5 11:02:54 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 05 Apr 2006 11:02:54 -0700 Subject: [openib-general] RC2 delayed a bit In-Reply-To: <1144259913.3984.30.camel@chalcedony.internal.keyresearch.com> References: <1144259913.3984.30.camel@chalcedony.internal.keyresearch.com> Message-ID: <4434064E.3090200@ichips.intel.com> Bryan O'Sullivan wrote: > The intention is to ship the same basic userspace components as the > OpenIB 1.0 software release, with some additional parts. In addition, > some people want to ship kernel components that are not upstream, such > as SDP and iSER. So, we went from having no openib release to now having two? That's confusing. Are these vendors members of openib? - Sean From swise at opengridcomputing.com Wed Apr 5 11:03:27 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 05 Apr 2006 13:03:27 -0500 Subject: [openib-general] Re: SPAM: [PATCH] [RFC] - dapl - dat_ep_free() can return without freeing the endpoint In-Reply-To: <443404E1.5050607@ichips.intel.com> References: <1144177443.6427.34.camel@stevo-desktop> <4432EDBB.4080304@ichips.intel.com> <4433FD4B.6040505@ichips.intel.com> <443404E1.5050607@ichips.intel.com> Message-ID: <1144260207.28591.68.camel@stevo-desktop> On Wed, 2006-04-05 at 10:56 -0700, Sean Hefty wrote: > Arlin Davis wrote: > > I did not see the original thread/patch from Steve so I don't have the > > entire context of this issue but it sounds like we need to fix the code > > so that the destroy QP (dat_ep_free) blocks until the event processing > > is complete, always destroy the QP and cm_id from this call, and remove > > cleanup from any async event processing threads. Is this what Steve was > > attempting to do with his patch? > > There's still the issue that the event processing thread is the one calling > dat_ep_free, in which case, you need to ensure that it is finished with all > event processing before this call is made. > The event processing thread is not the one calling dat_ep_free(). As I said before, with my new patch, the event processing thread never calls dat_ep_free() (or really dapli_conn_destroy()). The dat event processing thread only posts events into an EVD and wakes up the consumer thread. Arlin/James, please confirm this? Cuz maybe I'm confused? Steve. From bos at pathscale.com Wed Apr 5 11:04:37 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 05 Apr 2006 11:04:37 -0700 Subject: [openib-general] RC2 delayed a bit In-Reply-To: <4434064E.3090200@ichips.intel.com> References: <1144259913.3984.30.camel@chalcedony.internal.keyresearch.com> <4434064E.3090200@ichips.intel.com> Message-ID: <1144260277.3984.34.camel@chalcedony.internal.keyresearch.com> On Wed, 2006-04-05 at 11:02 -0700, Sean Hefty wrote: > So, we went from having no openib release to now having two? That's confusing. Indeed. > Are these vendors members of openib? Yes. (Sean Hefty's message of "Wed, 05 Apr 2006 11:02:54 -0700") References: <1144259913.3984.30.camel@chalcedony.internal.keyresearch.com> <4434064E.3090200@ichips.intel.com> Message-ID: Sean> So, we went from having no openib release to now having two? Sean> That's confusing. I think the right way to think about it is that we have one OpenFabrics release, namely the 1.0 release managed by Bryan, and one distribution (so far) of that release plus other components (MPI, kernel modules, etc). I'm not sure whether it makes sense to have that distribution produced by an OpenFabrics working group, but that's where it stands right now. The initial mistake was probably to think that the OF1.0 release would be something targeted at end users. It would be much better to do as nearly all other free software projects do and produce a release targeted at distributors, and let distributors get it to end users. - R. From jlentini at netapp.com Wed Apr 5 11:10:48 2006 From: jlentini at netapp.com (James Lentini) Date: Wed, 5 Apr 2006 14:10:48 -0400 (EDT) Subject: [openib-general] Re: [DAPL] Provider initialialization In-Reply-To: <4433B14F.9010701@mellanox.co.il> Message-ID: On Wed, 5 Apr 2006, Vladimir Sokolovsky wrote: > Hi James, > Does uDAPL support PPC64 architecture? On PPC64, the code expects __PPC64__ to be defined. The PPC64 support isn't as well tested as other architectures, so there may be some problems. From robert.j.woodruff at intel.com Wed Apr 5 11:10:01 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 5 Apr 2006 11:10:01 -0700 Subject: [openib-general] RC2 delayed a bit In-Reply-To: <4434064E.3090200@ichips.intel.com> Message-ID: <000001c658dc$2867f9b0$6aa1070a@amr.corp.intel.com> Bryan wrote, >So, we went from having no openib release to now having two? That's confusing. >Are these vendors members of openib? >- Sean I know that I am confused. Can someone from the ibed (openfabrics-ewg) people please enlighten us ? BTW. I built some kernel RPMs based on the 1.0 branch kernel code and the backport patches for RedHat EL4.0 U3. If someone wants me to post them somewhere, I will. woody From mst at mellanox.co.il Wed Apr 5 11:36:12 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Apr 2006 21:36:12 +0300 Subject: [openib-general] Re: [PATCH] ipoib_mcast_restart_task In-Reply-To: References: <20060404155233.GR14808@mellanox.co.il> <20060405074904.GC14808@mellanox.co.il> <20060405165303.GA23402@mellanox.co.il> <20060405171314.GB23402@mellanox.co.il> Message-ID: <20060405183612.GA25302@mellanox.co.il> Quoting r. Roland Dreier : > Maybe it is just random corruption that causes the crash > right at the beginning of ipoib_mcast_sendonly_join_complete(). Anyway, the updated patch Eli posted looks good, doesn't it? -- MST From mst at mellanox.co.il Wed Apr 5 11:43:33 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Apr 2006 21:43:33 +0300 Subject: [openib-general] Re: ipath module compilation on 2.6.15 and 2.6.16 In-Reply-To: References: <4433F5DC.2080503@mellanox.co.il> <4433F966.6010504@mellanox.co.il> <4433FBF6.1040301@mellanox.co.il> Message-ID: <20060405184333.GC25302@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: ipath module compilation on 2.6.15 and 2.6.16 > > Vladimir> I am not compiling kernel, but only ipath using SUBDIRS: > > Vladimir> CONFIG_IPATH_CORE=m \ CONFIG_INFINIBAND_IPATH=m \ > Vladimir> CONFIG_IPATH_ETHER =m .... > > I see. I guess in that case if you create a broken config then > there's no one else to blame... Actually, it might be useful to try and make KConfig work for out of tree modules (or even non-kernel projects) someday. -- MST From robert.j.woodruff at intel.com Wed Apr 5 11:44:16 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 5 Apr 2006 11:44:16 -0700 Subject: [openib-general] RC2 delayed a bit In-Reply-To: Message-ID: <000201c658e0$f1876a20$6aa1070a@amr.corp.intel.com> Roland wrote, >The initial mistake was probably to think that the OF1.0 release would >be something targeted at end users. It would be much better to do as >nearly all other free software projects do and produce a release > > - R. I agree. OF1.0 should probably target distros, rather than end users. I think that part of the problem is that people want to use the code now and don't want to wait for the distros to pick it up. Problem is that openib/openfabrics does not have the infrastructure and resources to support every end user in the world. This is what the distros do for a living. I know that it is taking a lot of my cycles trying to help people get the code installed an running and it will only get worse as more newbie's want to use it. woody From rdreier at cisco.com Wed Apr 5 13:48:57 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Apr 2006 13:48:57 -0700 Subject: [openib-general] Re: [PATCH] ipoib_mcast_restart_task In-Reply-To: <200604051459.41367.eli@mellanox.co.il> (Eli Cohen's message of "Wed, 5 Apr 2006 14:59:40 +0300") References: <200604051459.41367.eli@mellanox.co.il> Message-ID: Thanks, looks really good (consolidating the code even shrinks the .text of the driver). Applied to svn and 2.6.17 From iod00d at hp.com Wed Apr 5 14:32:43 2006 From: iod00d at hp.com (Grant Grundler) Date: Wed, 5 Apr 2006 14:32:43 -0700 Subject: [openib-general] how can i know whether my appl'n is using SDP or not? In-Reply-To: <20060405082913.35553.qmail@web8314.mail.in.yahoo.com> References: <20060405082913.35553.qmail@web8314.mail.in.yahoo.com> Message-ID: <20060405213243.GE3556@esmail.cup.hp.com> On Wed, Apr 05, 2006 at 09:29:13AM +0100, keshetti mahesh wrote: > i am working on the infiniband cluster and the openIB stack is > unstalled in hosts(including SDP) > > can anybody tell me wen i run an appl'n over infiniband > how can i know whether it is using SDP or IPoIB? Are you using libsdp.so and LD_PRELOAD? Or does the aplication know about the AF_INET_SDP address family? If no to both, then you have the answer. hth, grant From sweitzen at cisco.com Wed Apr 5 14:38:13 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 5 Apr 2006 14:38:13 -0700 Subject: [openib-general] how can i know whether my appl'n is using SDP ornot? Message-ID: 1) You can use tcpdump to see if there is IPoIB traffic (assuming your tcpdump supports OpenIB IPoIB, RHEL4 tcpdump does not yet). 2) You can strace the application see if it is using AF_INET or AF_INET_SDP. Scott > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Grant Grundler > Sent: Wednesday, April 05, 2006 2:33 PM > To: keshetti mahesh > Cc: openIB > Subject: Re: [openib-general] how can i know whether my > appl'n is using SDP ornot? > > On Wed, Apr 05, 2006 at 09:29:13AM +0100, keshetti mahesh wrote: > > i am working on the infiniband cluster and the openIB stack is > > unstalled in hosts(including SDP) > > > > can anybody tell me wen i run an appl'n over infiniband > > how can i know whether it is using SDP or IPoIB? > > Are you using libsdp.so and LD_PRELOAD? > Or does the aplication know about the AF_INET_SDP address family? > > If no to both, then you have the answer. > > hth, > grant > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Wed Apr 5 15:31:37 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Apr 2006 15:31:37 -0700 Subject: [openib-general] Re: [PATCH] repost: IPoIB queue size tune patch In-Reply-To: (Shirley Ma's message of "Wed, 5 Apr 2006 05:02:45 -0600") References: Message-ID: Thanks, here's the version I committed to svn and queued for 2.6.17. I made the module parameters "send_queue_size" and "recv_queue_size" because I think that "sendq_size" might be too obscure for people to understand. I made the queue size variables __read_mostly to avoid false sharing of cache lines. I changed one "/ 2" into ">> 1", because now that the queue size is not a compile-time constant, the compiler is forced to generate a divide instruction for the "/ 2". And I did a few other minor cleanups... diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 374109d..12a1e05 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -65,6 +65,8 @@ enum { IPOIB_RX_RING_SIZE = 128, IPOIB_TX_RING_SIZE = 64, + IPOIB_MAX_QUEUE_SIZE = 8192, + IPOIB_MIN_QUEUE_SIZE = 2, IPOIB_NUM_WC = 4, @@ -332,6 +334,8 @@ static inline void ipoib_unregister_debu #define ipoib_warn(priv, format, arg...) \ ipoib_printk(KERN_WARNING, priv, format , ## arg) +extern int ipoib_sendq_size; +extern int ipoib_recvq_size; #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG extern int ipoib_debug_level; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index ed65202..a54da42 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -161,7 +161,7 @@ static int ipoib_ib_post_receives(struct struct ipoib_dev_priv *priv = netdev_priv(dev); int i; - for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) { + for (i = 0; i < ipoib_recvq_size; ++i) { if (ipoib_alloc_rx_skb(dev, i)) { ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); return -ENOMEM; @@ -187,7 +187,7 @@ static void ipoib_ib_handle_wc(struct ne if (wr_id & IPOIB_OP_RECV) { wr_id &= ~IPOIB_OP_RECV; - if (wr_id < IPOIB_RX_RING_SIZE) { + if (wr_id < ipoib_recvq_size) { struct sk_buff *skb = priv->rx_ring[wr_id].skb; dma_addr_t addr = priv->rx_ring[wr_id].mapping; @@ -252,9 +252,9 @@ static void ipoib_ib_handle_wc(struct ne struct ipoib_tx_buf *tx_req; unsigned long flags; - if (wr_id >= IPOIB_TX_RING_SIZE) { + if (wr_id >= ipoib_sendq_size) { ipoib_warn(priv, "completion event with wrid %d (> %d)\n", - wr_id, IPOIB_TX_RING_SIZE); + wr_id, ipoib_sendq_size); return; } @@ -275,7 +275,7 @@ static void ipoib_ib_handle_wc(struct ne spin_lock_irqsave(&priv->tx_lock, flags); ++priv->tx_tail; if (netif_queue_stopped(dev) && - priv->tx_head - priv->tx_tail <= IPOIB_TX_RING_SIZE / 2) + priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) netif_wake_queue(dev); spin_unlock_irqrestore(&priv->tx_lock, flags); @@ -344,13 +344,13 @@ void ipoib_send(struct net_device *dev, * means we have to make sure everything is properly recorded and * our state is consistent before we call post_send(). */ - tx_req = &priv->tx_ring[priv->tx_head & (IPOIB_TX_RING_SIZE - 1)]; + tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)]; tx_req->skb = skb; addr = dma_map_single(priv->ca->dma_device, skb->data, skb->len, DMA_TO_DEVICE); pci_unmap_addr_set(tx_req, mapping, addr); - if (unlikely(post_send(priv, priv->tx_head & (IPOIB_TX_RING_SIZE - 1), + if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1), address->ah, qpn, addr, skb->len))) { ipoib_warn(priv, "post_send failed\n"); ++priv->stats.tx_errors; @@ -363,7 +363,7 @@ void ipoib_send(struct net_device *dev, address->last_send = priv->tx_head; ++priv->tx_head; - if (priv->tx_head - priv->tx_tail == IPOIB_TX_RING_SIZE) { + if (priv->tx_head - priv->tx_tail == ipoib_sendq_size) { ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); netif_stop_queue(dev); } @@ -488,7 +488,7 @@ static int recvs_pending(struct net_devi int pending = 0; int i; - for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + for (i = 0; i < ipoib_recvq_size; ++i) if (priv->rx_ring[i].skb) ++pending; @@ -527,7 +527,7 @@ int ipoib_ib_dev_stop(struct net_device */ while ((int) priv->tx_tail - (int) priv->tx_head < 0) { tx_req = &priv->tx_ring[priv->tx_tail & - (IPOIB_TX_RING_SIZE - 1)]; + (ipoib_sendq_size - 1)]; dma_unmap_single(priv->ca->dma_device, pci_unmap_addr(tx_req, mapping), tx_req->skb->len, @@ -536,7 +536,7 @@ int ipoib_ib_dev_stop(struct net_device ++priv->tx_tail; } - for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + for (i = 0; i < ipoib_recvq_size; ++i) if (priv->rx_ring[i].skb) { dma_unmap_single(priv->ca->dma_device, pci_unmap_addr(&priv->rx_ring[i], diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 9cb9e43..5bf7e26 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -41,6 +41,7 @@ #include #include #include +#include #include /* For ARPHRD_xxx */ @@ -53,6 +54,14 @@ MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("IP-over-InfiniBand net driver"); MODULE_LICENSE("Dual BSD/GPL"); +int ipoib_sendq_size __read_mostly = IPOIB_TX_RING_SIZE; +int ipoib_recvq_size __read_mostly = IPOIB_RX_RING_SIZE; + +module_param_named(send_queue_size, ipoib_sendq_size, int, 0444); +MODULE_PARM_DESC(send_queue_size, "Number of descriptors in send queue"); +module_param_named(recv_queue_size, ipoib_recvq_size, int, 0444); +MODULE_PARM_DESC(recv_queue_size, "Number of descriptors in receive queue"); + #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG int ipoib_debug_level; @@ -795,20 +804,19 @@ int ipoib_dev_init(struct net_device *de struct ipoib_dev_priv *priv = netdev_priv(dev); /* Allocate RX/TX "rings" to hold queued skbs */ - - priv->rx_ring = kzalloc(IPOIB_RX_RING_SIZE * sizeof (struct ipoib_rx_buf), + priv->rx_ring = kzalloc(ipoib_recvq_size * sizeof *priv->rx_ring, GFP_KERNEL); if (!priv->rx_ring) { printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n", - ca->name, IPOIB_RX_RING_SIZE); + ca->name, ipoib_recvq_size); goto out; } - priv->tx_ring = kzalloc(IPOIB_TX_RING_SIZE * sizeof (struct ipoib_tx_buf), + priv->tx_ring = kzalloc(ipoib_sendq_size * sizeof *priv->tx_ring, GFP_KERNEL); if (!priv->tx_ring) { printk(KERN_WARNING "%s: failed to allocate TX ring (%d entries)\n", - ca->name, IPOIB_TX_RING_SIZE); + ca->name, ipoib_sendq_size); goto out_rx_ring_cleanup; } @@ -876,7 +884,7 @@ static void ipoib_setup(struct net_devic dev->hard_header_len = IPOIB_ENCAP_LEN + INFINIBAND_ALEN; dev->addr_len = INFINIBAND_ALEN; dev->type = ARPHRD_INFINIBAND; - dev->tx_queue_len = IPOIB_TX_RING_SIZE * 2; + dev->tx_queue_len = ipoib_sendq_size * 2; dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX; /* MTU will be reset when mcast join happens */ @@ -1128,6 +1136,14 @@ static int __init ipoib_init_module(void { int ret; + ipoib_recvq_size = roundup_pow_of_two(ipoib_recvq_size); + ipoib_recvq_size = min(ipoib_recvq_size, IPOIB_MAX_QUEUE_SIZE); + ipoib_recvq_size = max(ipoib_recvq_size, IPOIB_MIN_QUEUE_SIZE); + + ipoib_sendq_size = roundup_pow_of_two(ipoib_sendq_size); + ipoib_sendq_size = min(ipoib_sendq_size, IPOIB_MAX_QUEUE_SIZE); + ipoib_sendq_size = max(ipoib_sendq_size, IPOIB_MIN_QUEUE_SIZE); + ret = ipoib_register_debugfs(); if (ret) return ret; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c index 5f03880..1d49d16 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -159,8 +159,8 @@ int ipoib_transport_dev_init(struct net_ struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_init_attr init_attr = { .cap = { - .max_send_wr = IPOIB_TX_RING_SIZE, - .max_recv_wr = IPOIB_RX_RING_SIZE, + .max_send_wr = ipoib_sendq_size, + .max_recv_wr = ipoib_recvq_size, .max_send_sge = 1, .max_recv_sge = 1 }, @@ -175,7 +175,7 @@ int ipoib_transport_dev_init(struct net_ } priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, - IPOIB_TX_RING_SIZE + IPOIB_RX_RING_SIZE + 1); + ipoib_sendq_size + ipoib_recvq_size + 1); if (IS_ERR(priv->cq)) { printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); goto out_free_pd; From swise at opengridcomputing.com Wed Apr 5 15:39:02 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 05 Apr 2006 17:39:02 -0500 Subject: [openib-general] [PATCH] [DAPL] [RFC] - remove duplicate disconnect event. Message-ID: <1144276742.28591.82.camel@stevo-desktop> James, Running a 4 thread, 8 ep/thread dapltest (the last test in regress.sh), I was intermittently seeing a seg fault in dapltest. This is running over the chelsio rnic using the iwarp branch. After debugging I found out that dapltest was freeing an already freed endpoint due to it receiving duplicate disconnect events during test shutdown. The code assumes it will get exactly one disconnect event for every endpoint (rightly so I guess). I tracked this down to the code in dapl_ep_disconnect() that generates its own disconnect event in certain circumstances. I removed this and ran regress.sh over both mthca and cxgb3 with no problems. So my question to the dapl experts is: why is this code here? For our iwarp devices, it ends up sometimes generating duplicate disconnect events. I don't see why its needed. If anyone can explain the logic, that would be great. With this patch and the previous patch the fixes dat_ep_free() to always free the endpoint, I'm able to run dapltest 1-6 over the chelsio rnic. As part of pulling in the iwarp support, I'd like the group to consider pulling in these patches that fix issues with udapl (once we agree on the final patches). For now, I'll maintain these patches in the iwarp branch... Thanks, Steve. Signed-off-by: Steve Wise Index: dapl_ep_disconnect.c =================================================================== --- dapl_ep_disconnect.c (revision 6253) +++ dapl_ep_disconnect.c (working copy) @@ -70,8 +70,6 @@ { DAPL_EP *ep_ptr; DAPL_EVD *evd_ptr; - DAPL_CR *cr_ptr; - ib_cm_events_t ib_cm_event; DAT_RETURN dat_status; dapl_dbg_log (DAPL_DBG_TYPE_API | DAPL_DBG_TYPE_CM, @@ -175,51 +173,6 @@ dapl_os_unlock ( &ep_ptr->header.lock ); dat_status = dapls_ib_disconnect ( ep_ptr, disconnect_flags ); - /* - * Reacquire the lock and make sure we didn't get a callback - * that cleaned up. - */ - dapl_os_lock ( &ep_ptr->header.lock ); - if (disconnect_flags == DAT_CLOSE_ABRUPT_FLAG && - ep_ptr->param.ep_state == DAT_EP_STATE_DISCONNECT_PENDING ) - { - /* - * If this is an ABRUPT close, the provider will not generate - * a disconnect message so we do it manually here. Just invoke - * the CM callback as it will clean up the appropriate - * data structures, reset the state, and generate the event - * on the way out. Obtain the provider dependent cm_event to - * pass into the callback for a disconnect. - */ - ib_cm_event = dapls_ib_get_cm_event (DAT_CONNECTION_EVENT_DISCONNECTED); - - cr_ptr = ep_ptr->cr_ptr; - dapl_os_unlock ( &ep_ptr->header.lock ); - - if (cr_ptr != NULL) - { - dapl_dbg_log (DAPL_DBG_TYPE_API | DAPL_DBG_TYPE_CM, - " dapl_ep_disconnect force callback on EP %p CM handle %x\n", - ep_ptr, cr_ptr->ib_cm_handle); - - dapls_cr_callback (cr_ptr->ib_cm_handle, - ib_cm_event, - NULL, - cr_ptr->sp_ptr); - } - else - { - dapl_evd_connection_callback (ep_ptr->cm_handle, - ib_cm_event, - NULL, - (void *) ep_ptr); - } - } - else - { - dapl_os_unlock ( &ep_ptr->header.lock ); - } - bail: dapl_dbg_log (DAPL_DBG_TYPE_RTN | DAPL_DBG_TYPE_CM, "dapl_ep_disconnect () returns 0x%x\n", From rdreier at cisco.com Wed Apr 5 15:39:15 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Apr 2006 15:39:15 -0700 Subject: [openib-general] Re: [PATCH] mthca: disable pci_tune In-Reply-To: <20060405124716.GF21115@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 5 Apr 2006 15:47:16 +0300") References: <20060405124716.GF21115@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Wed Apr 5 15:55:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Apr 2006 15:55:12 -0700 Subject: [openib-general] Re: [PATCH] ipoib_flush_paths In-Reply-To: <200604051559.34828.eli@mellanox.co.il> (Eli Cohen's message of "Wed, 5 Apr 2006 15:59:34 +0300") References: <200604051559.34828.eli@mellanox.co.il> Message-ID: This makes sense but I'm trying to see exactly what goes wrong without it. Suppose path->query gets set to NULL between testing it and calling ib_sa_cancel_query(). What's the worst that can happen? It looks safe to ib_sa_cancel_query() with a stale or NULL query pointer. - R. From mst at mellanox.co.il Wed Apr 5 16:13:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 02:13:42 +0300 Subject: [openib-general] CMA and SDP hh Message-ID: <20060405231342.GA26398@mellanox.co.il> Sean, the following code in CMA *ip_ver = sdp_get_ip_ver(hdr); *port = ((struct sdp_hh *) hdr)->port; *src = &((struct sdp_hh *) hdr)->src_addr; *dst = &((struct sdp_hh *) hdr)->dst_addr; seems to assume that SDP places the HH message at the beginning of the private data in CM messages. However, I think in SDP HH is preceded by the BSDH, so this looks wrong. -- MST From sean.hefty at intel.com Wed Apr 5 16:20:16 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 5 Apr 2006 16:20:16 -0700 Subject: [openib-general] RE: CMA and SDP hh In-Reply-To: <20060405231342.GA26398@mellanox.co.il> Message-ID: > *ip_ver = sdp_get_ip_ver(hdr); > *port = ((struct sdp_hh *) hdr)->port; > *src = &((struct sdp_hh *) hdr)->src_addr; > *dst = &((struct sdp_hh *) hdr)->dst_addr; > >seems to assume that SDP places the HH message at the beginning >of the private data in CM messages. > >However, I think in SDP HH is preceded by the BSDH, so this looks wrong. Yes - you're correct. I assumed that the HH was at the start of the private data, but the BSDH should go there. To fix this, I think we'll want to add bsdh in struct sdp_hh in cma.c. struct sdp_hh { u8 bsdh[16]; u8 sdp_version; u8 ip_version; /* IP version: 7:4 */ u8 sdp_specific1[10]; __u16 port; __u16 sdp_specific2; union cma_ip_addr src_addr; union cma_ip_addr dst_addr; }; - Sean From swise at opengridcomputing.com Wed Apr 5 10:56:36 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 05 Apr 2006 12:56:36 -0500 Subject: [openib-general] Re: SPAM: [PATCH] [RFC] - dapl - dat_ep_free() can return without freeing the endpoint In-Reply-To: <4433FD4B.6040505@ichips.intel.com> References: <1144177443.6427.34.camel@stevo-desktop> <4432EDBB.4080304@ichips.intel.com> <4433FD4B.6040505@ichips.intel.com> Message-ID: <1144259796.28591.61.camel@stevo-desktop> > I did not see the original thread/patch from Steve so I don't have the > entire context of this issue but it sounds like we need to fix the code > so that the destroy QP (dat_ep_free) blocks until the event processing > is complete, always destroy the QP and cm_id from this call, and remove > cleanup from any async event processing threads. Is this what Steve was > attempting to do with his patch? > > I seemed to have missed the posting of the patch so could someone point > me to the original patch so I can review and test any changes. see: http://openib.org/pipermail/openib-general/2006-April/019479.html From swise at opengridcomputing.com Wed Apr 5 11:03:27 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 05 Apr 2006 13:03:27 -0500 Subject: [openib-general] Re: SPAM: [PATCH] [RFC] - dapl - dat_ep_free() can return without freeing the endpoint In-Reply-To: <443404E1.5050607@ichips.intel.com> References: <1144177443.6427.34.camel@stevo-desktop> <4432EDBB.4080304@ichips.intel.com> <4433FD4B.6040505@ichips.intel.com> <443404E1.5050607@ichips.intel.com> Message-ID: <1144260207.28591.68.camel@stevo-desktop> On Wed, 2006-04-05 at 10:56 -0700, Sean Hefty wrote: > Arlin Davis wrote: > > I did not see the original thread/patch from Steve so I don't have the > > entire context of this issue but it sounds like we need to fix the code > > so that the destroy QP (dat_ep_free) blocks until the event processing > > is complete, always destroy the QP and cm_id from this call, and remove > > cleanup from any async event processing threads. Is this what Steve was > > attempting to do with his patch? > > There's still the issue that the event processing thread is the one calling > dat_ep_free, in which case, you need to ensure that it is finished with all > event processing before this call is made. > The event processing thread is not the one calling dat_ep_free(). As I said before, with my new patch, the event processing thread never calls dat_ep_free() (or really dapli_conn_destroy()). The dat event processing thread only posts events into an EVD and wakes up the consumer thread. Arlin/James, please confirm this? Cuz maybe I'm confused? Steve. From swise at opengridcomputing.com Wed Apr 5 15:39:02 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 05 Apr 2006 17:39:02 -0500 Subject: [openib-general] [PATCH] [DAPL] [RFC] - remove duplicate disconnect event. Message-ID: <1144276742.28591.82.camel@stevo-desktop> James, Running a 4 thread, 8 ep/thread dapltest (the last test in regress.sh), I was intermittently seeing a seg fault in dapltest. This is running over the chelsio rnic using the iwarp branch. After debugging I found out that dapltest was freeing an already freed endpoint due to it receiving duplicate disconnect events during test shutdown. The code assumes it will get exactly one disconnect event for every endpoint (rightly so I guess). I tracked this down to the code in dapl_ep_disconnect() that generates its own disconnect event in certain circumstances. I removed this and ran regress.sh over both mthca and cxgb3 with no problems. So my question to the dapl experts is: why is this code here? For our iwarp devices, it ends up sometimes generating duplicate disconnect events. I don't see why its needed. If anyone can explain the logic, that would be great. With this patch and the previous patch the fixes dat_ep_free() to always free the endpoint, I'm able to run dapltest 1-6 over the chelsio rnic. As part of pulling in the iwarp support, I'd like the group to consider pulling in these patches that fix issues with udapl (once we agree on the final patches). For now, I'll maintain these patches in the iwarp branch... Thanks, Steve. Signed-off-by: Steve Wise Index: dapl_ep_disconnect.c =================================================================== --- dapl_ep_disconnect.c (revision 6253) +++ dapl_ep_disconnect.c (working copy) @@ -70,8 +70,6 @@ { DAPL_EP *ep_ptr; DAPL_EVD *evd_ptr; - DAPL_CR *cr_ptr; - ib_cm_events_t ib_cm_event; DAT_RETURN dat_status; dapl_dbg_log (DAPL_DBG_TYPE_API | DAPL_DBG_TYPE_CM, @@ -175,51 +173,6 @@ dapl_os_unlock ( &ep_ptr->header.lock ); dat_status = dapls_ib_disconnect ( ep_ptr, disconnect_flags ); - /* - * Reacquire the lock and make sure we didn't get a callback - * that cleaned up. - */ - dapl_os_lock ( &ep_ptr->header.lock ); - if (disconnect_flags == DAT_CLOSE_ABRUPT_FLAG && - ep_ptr->param.ep_state == DAT_EP_STATE_DISCONNECT_PENDING ) - { - /* - * If this is an ABRUPT close, the provider will not generate - * a disconnect message so we do it manually here. Just invoke - * the CM callback as it will clean up the appropriate - * data structures, reset the state, and generate the event - * on the way out. Obtain the provider dependent cm_event to - * pass into the callback for a disconnect. - */ - ib_cm_event = dapls_ib_get_cm_event (DAT_CONNECTION_EVENT_DISCONNECTED); - - cr_ptr = ep_ptr->cr_ptr; - dapl_os_unlock ( &ep_ptr->header.lock ); - - if (cr_ptr != NULL) - { - dapl_dbg_log (DAPL_DBG_TYPE_API | DAPL_DBG_TYPE_CM, - " dapl_ep_disconnect force callback on EP %p CM handle %x\n", - ep_ptr, cr_ptr->ib_cm_handle); - - dapls_cr_callback (cr_ptr->ib_cm_handle, - ib_cm_event, - NULL, - cr_ptr->sp_ptr); - } - else - { - dapl_evd_connection_callback (ep_ptr->cm_handle, - ib_cm_event, - NULL, - (void *) ep_ptr); - } - } - else - { - dapl_os_unlock ( &ep_ptr->header.lock ); - } - bail: dapl_dbg_log (DAPL_DBG_TYPE_RTN | DAPL_DBG_TYPE_CM, "dapl_ep_disconnect () returns 0x%x\n", From mst at mellanox.co.il Wed Apr 5 16:40:23 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 02:40:23 +0300 Subject: [openib-general] Re: [PATCH] ipoib_flush_paths In-Reply-To: References: <200604051559.34828.eli@mellanox.co.il> Message-ID: <20060405234023.GA26557@mellanox.co.il> Quoting r. Roland Dreier : > It looks safe to ib_sa_cancel_query() with a stale or NULL query pointer. I don't think so. Look into ib_sa_cancel_query: void ib_sa_cancel_query(int id, struct ib_sa_query *query) { unsigned long flags; struct ib_mad_agent *agent; struct ib_mad_send_buf *mad_buf; spin_lock_irqsave(&idr_lock, flags); if (idr_find(&query_idr, id) != query) { <--------- !!! spin_unlock_irqrestore(&idr_lock, flags); return; } agent = query->port->agent; mad_buf = query->mad_buf; spin_unlock_irqrestore(&idr_lock, flags); ib_cancel_mad(agent, mad_buf); } See what happens if you pass a stale id (query finished) and a NULL query? -- MST From mst at mellanox.co.il Wed Apr 5 16:43:55 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 02:43:55 +0300 Subject: [openib-general] Re: CMA and SDP hh In-Reply-To: References: <20060405231342.GA26398@mellanox.co.il> Message-ID: <20060405234355.GB26557@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RE: CMA and SDP hh > > > *ip_ver = sdp_get_ip_ver(hdr); > > *port = ((struct sdp_hh *) hdr)->port; > > *src = &((struct sdp_hh *) hdr)->src_addr; > > *dst = &((struct sdp_hh *) hdr)->dst_addr; > > > >seems to assume that SDP places the HH message at the beginning > >of the private data in CM messages. > > > >However, I think in SDP HH is preceded by the BSDH, so this looks wrong. > > Yes - you're correct. I assumed that the HH was at the start of the private > data, but the BSDH should go there. To fix this, I think we'll want to add bsdh > in struct sdp_hh in cma.c. > > struct sdp_hh { > u8 bsdh[16]; > u8 sdp_version; > u8 ip_version; /* IP version: 7:4 */ > u8 sdp_specific1[10]; > __u16 port; > __u16 sdp_specific2; > union cma_ip_addr src_addr; > union cma_ip_addr dst_addr; > }; Yes. Please do that. -- MST From rdreier at cisco.com Wed Apr 5 16:46:00 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Apr 2006 16:46:00 -0700 Subject: [openib-general] Re: [PATCH] ipoib_flush_paths In-Reply-To: <20060405234023.GA26557@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 6 Apr 2006 02:40:23 +0300") References: <200604051559.34828.eli@mellanox.co.il> <20060405234023.GA26557@mellanox.co.il> Message-ID: Michael> See what happens if you pass a stale id (query finished) Michael> and a NULL query? Ah, I see. It can deal with a stale id or a NULL query, but both at once leads to a bad coincidence. I guess that's not quite a bug in ib_sa_cancel_query(), although it might be nice if it dealt with NULL query pointers. Well, at least this would lead to an instant crash as it is now. Out of curiousity have you guys seen this actually happen? - R. From mst at mellanox.co.il Wed Apr 5 17:01:17 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 03:01:17 +0300 Subject: [openib-general] Re: [PATCH] ipoib_flush_paths In-Reply-To: References: <200604051559.34828.eli@mellanox.co.il> <20060405234023.GA26557@mellanox.co.il> Message-ID: <20060406000117.GC26557@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] ipoib_flush_paths > > Michael> See what happens if you pass a stale id (query finished) > Michael> and a NULL query? > > Ah, I see. It can deal with a stale id or a NULL query, but both at > once leads to a bad coincidence. I guess that's not quite a bug in > ib_sa_cancel_query(), although it might be nice if it dealt with NULL > query pointers. > > Well, at least this would lead to an instant crash as it is now. Out > of curiousity have you guys seen this actually happen? Yes, I think we have. By the way, I was thinking about SA queries, and I came to a conclusion that we have an unfixable race at module unload: nothing I as the user of SA do in my callback can ensure that my callback is not still running when my module is unloaded. The problem here is lack of explicit registration wth SA: I just push queries so there's no time SA can flush its queues for me. The only easy way out that I see is some kind of sa_flush function that will flush MAD wqs, and ask all users to call that. And I think we have the same bug in addr so this needs a flush function too. Makes sense? -- MST From xma at us.ibm.com Wed Apr 5 17:03:16 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 5 Apr 2006 17:03:16 -0700 Subject: [openib-general] RC2 delayed a bit In-Reply-To: Message-ID: > The initial mistake was probably to think that the OF1.0 release would > be something targeted at end users. It would be much better to do as > nearly all other free software projects do and produce a release > targeted at distributors, and let distributors get it to end users. > > - R. How to handle this OF1.0 release to be synced with distros' releases? Without clear milestones and schedules, it would be tough to target distros. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Wed Apr 5 17:04:23 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 05 Apr 2006 17:04:23 -0700 Subject: [openib-general] Re: [PATCH] ipoib_flush_paths In-Reply-To: <20060406000117.GC26557@mellanox.co.il> References: <200604051559.34828.eli@mellanox.co.il> <20060405234023.GA26557@mellanox.co.il> <20060406000117.GC26557@mellanox.co.il> Message-ID: <44345B07.6050002@ichips.intel.com> Michael S. Tsirkin wrote: > By the way, I was thinking about SA queries, and I came to a conclusion > that we have an unfixable race at module unload: nothing I as the user of > SA do in my callback can ensure that my callback is not still running > when my module is unloaded. > > The problem here is lack of explicit registration wth SA: I just push queries > so there's no time SA can flush its queues for me. > > The only easy way out that I see is some kind of sa_flush function that > will flush MAD wqs, and ask all users to call that. > And I think we have the same bug in addr so this needs a flush function too. > > Makes sense? Both ib_sa and ib_addr will always perform exactly one callback per request. A user only needs to wait for that callback to occur. The rdma_cm does this for ib_addr, and ib_multicast will do this for ib_sa. - Sean From sean.hefty at intel.com Wed Apr 5 17:05:52 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 5 Apr 2006 17:05:52 -0700 Subject: [openib-general] RE: CMA and SDP hh In-Reply-To: <20060405234355.GB26557@mellanox.co.il> Message-ID: >> struct sdp_hh { >> u8 bsdh[16]; >> u8 sdp_version; >> u8 ip_version; /* IP version: 7:4 */ >> u8 sdp_specific1[10]; >> __u16 port; >> __u16 sdp_specific2; >> union cma_ip_addr src_addr; >> union cma_ip_addr dst_addr; >> }; > >Yes. Please do that. I committed a fix like that shown above. - Sean From mst at mellanox.co.il Wed Apr 5 17:13:07 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 03:13:07 +0300 Subject: [openib-general] Re: Re: [PATCH] ipoib_flush_paths In-Reply-To: <44345B07.6050002@ichips.intel.com> References: <200604051559.34828.eli@mellanox.co.il> <20060405234023.GA26557@mellanox.co.il> <20060406000117.GC26557@mellanox.co.il> <44345B07.6050002@ichips.intel.com> Message-ID: <20060406001307.GD26557@mellanox.co.il> Quoting r. Sean Hefty : > >By the way, I was thinking about SA queries, and I came to a conclusion > >that we have an unfixable race at module unload: nothing I as the user of > >SA do in my callback can ensure that my callback is not still running > >when my module is unloaded. > > > >The problem here is lack of explicit registration wth SA: I just push > >queries > >so there's no time SA can flush its queues for me. > > > >The only easy way out that I see is some kind of sa_flush function that > >will flush MAD wqs, and ask all users to call that. > >And I think we have the same bug in addr so this needs a flush function > >too. > > > >Makes sense? > > Both ib_sa and ib_addr will always perform exactly one callback per > request. A user only needs to wait for that callback to occur. Sean, there's no safe way to do this in presence of loadable modules: callback() { set flag return; } flag is set but it is still not OK to unload module - you are executing return from the module text section. See now? -- MST From mshefty at ichips.intel.com Wed Apr 5 17:12:23 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 05 Apr 2006 17:12:23 -0700 Subject: [openib-general] Re: [PATCH] ipoib_flush_paths In-Reply-To: <44345B07.6050002@ichips.intel.com> References: <200604051559.34828.eli@mellanox.co.il> <20060405234023.GA26557@mellanox.co.il> <20060406000117.GC26557@mellanox.co.il> <44345B07.6050002@ichips.intel.com> Message-ID: <44345CE7.9000807@ichips.intel.com> >> By the way, I was thinking about SA queries, and I came to a conclusion >> that we have an unfixable race at module unload: nothing I as the user of >> SA do in my callback can ensure that my callback is not still running >> when my module is unloaded. I missed what you were saying before. I agree, I don't see how you can prevent your callback from completing. - Sean From mst at mellanox.co.il Wed Apr 5 17:20:51 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 03:20:51 +0300 Subject: [openib-general] Re: Re: [PATCH] ipoib_flush_paths In-Reply-To: <44345CE7.9000807@ichips.intel.com> References: <200604051559.34828.eli@mellanox.co.il> <20060405234023.GA26557@mellanox.co.il> <20060406000117.GC26557@mellanox.co.il> <44345B07.6050002@ichips.intel.com> <44345CE7.9000807@ichips.intel.com> Message-ID: <20060406002051.GE26557@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: Re: [PATCH] ipoib_flush_paths > > >>By the way, I was thinking about SA queries, and I came to a conclusion > >>that we have an unfixable race at module unload: nothing I as the user of > >>SA do in my callback can ensure that my callback is not still running > >>when my module is unloaded. > > I missed what you were saying before. I agree, I don't see how you can > prevent your callback from completing. OK, so what I propose is: /** * ib_sa_flush - flush query callbacks in progress * @device:device to flush * @port_num: port number to flush */ void ib_sa_flush(struct ib_device *device, u8 port_num); and in similiar fashion from addr. -- MST From mshefty at ichips.intel.com Wed Apr 5 17:19:36 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 05 Apr 2006 17:19:36 -0700 Subject: [openib-general] Re: [PATCH] ipoib_flush_paths In-Reply-To: <20060406000117.GC26557@mellanox.co.il> References: <200604051559.34828.eli@mellanox.co.il> <20060405234023.GA26557@mellanox.co.il> <20060406000117.GC26557@mellanox.co.il> Message-ID: <44345E98.6000300@ichips.intel.com> Michael S. Tsirkin wrote: > The only easy way out that I see is some kind of sa_flush function that > will flush MAD wqs, and ask all users to call that. > And I think we have the same bug in addr so this needs a flush function too. My preference would be to add a registration function to ib_sa and ib_addr. Otherwise, we weaken the encapsulation between the modules, since ib_sa doesn't own the threads that it uses in its callbacks. Registration could be global, or per port. The latter probably makes the most sense. - Sean From rdreier at cisco.com Wed Apr 5 17:25:54 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Apr 2006 17:25:54 -0700 Subject: [openib-general] RC2 delayed a bit In-Reply-To: (Shirley Ma's message of "Wed, 5 Apr 2006 17:03:16 -0700") References: Message-ID: Shirley> How to handle this OF1.0 release to be synced with Shirley> distros' releases? Without clear milestones and Shirley> schedules, it would be tough to target distros. I think it's up to OF to publish its release schedule, and then distros can decide what to ship. It's no different that what distros face with the kernel, gcc, glibc, X.org, gnome, KDE, etc., etc. - R. From mst at mellanox.co.il Wed Apr 5 17:31:41 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 03:31:41 +0300 Subject: [openib-general] Re: RC2 delayed a bit In-Reply-To: References: Message-ID: <20060406003140.GF26557@mellanox.co.il> Quoting r. Shirley Ma : > How to handle this OF1.0 release to be synced with distros' releases? 1. Publish tarballs of stable versions on openib website 2. Send an annuncement by mail 3. Have distros pick it up Seems to work for most every open-source project out there. > Without clear milestones and schedules, it would be tough to target > distros. Why? Just have them download the last stable version. -- MST From xma at us.ibm.com Wed Apr 5 17:36:24 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 5 Apr 2006 17:36:24 -0700 Subject: [openib-general] RC2 delayed a bit In-Reply-To: Message-ID: Roland Dreier wrote on 04/05/2006 05:25:54 PM: > Shirley> How to handle this OF1.0 release to be synced with > Shirley> distros' releases? Without clear milestones and > Shirley> schedules, it would be tough to target distros. > > I think it's up to OF to publish its release schedule, and then > distros can decide what to ship. It's no different that what distros > face with the kernel, gcc, glibc, X.org, gnome, KDE, etc., etc. > > - R. It would be risky if Distros target future OF releases if for some reason OF slips its release schedule. Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Wed Apr 5 17:36:35 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 03:36:35 +0300 Subject: [openib-general] Re: Re: [PATCH] ipoib_flush_paths In-Reply-To: <44345E98.6000300@ichips.intel.com> References: <200604051559.34828.eli@mellanox.co.il> <20060405234023.GA26557@mellanox.co.il> <20060406000117.GC26557@mellanox.co.il> <44345E98.6000300@ichips.intel.com> Message-ID: <20060406003635.GG26557@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: Re: [PATCH] ipoib_flush_paths > > Michael S. Tsirkin wrote: > >The only easy way out that I see is some kind of sa_flush function that > >will flush MAD wqs, and ask all users to call that. > >And I think we have the same bug in addr so this needs a flush function > >too. > > My preference would be to add a registration function to ib_sa and ib_addr. > Otherwise, we weaken the encapsulation between the modules, since ib_sa > doesn't own the threads that it uses in its callbacks. I don't see the connection (since we only need de-register, let's just call it flush and be done with it), but fine. Please propose an API then. > Registration could be global, or per port. The latter probably makes the > most sense. Probably global for ib_addr, per port for ib_sa: we don't want to force ib_addr users to deal with devices. -- MST From arlin.r.davis at intel.com Wed Apr 5 17:35:16 2006 From: arlin.r.davis at intel.com (Arlin Davis) Date: Wed, 5 Apr 2006 17:35:16 -0700 Subject: [openib-general] [PATCH][RFC] uDAPL openIB provider with IB extensions based on latest DAT 2.0 draft Message-ID: Here is patch for the 1.2 OpenIB cma provider (svn6260) that validates the extensions per latest spec draft. A new test/dtest/dtest_ext.c can be used to test the IB immed data and atomic extensions. I did not change the Makefile.am so use the modified dapl/udapl/makefile and the dat/udat/makefile to build. Default "make" will build the extensions and openib_cma. Patch includes 24 modified files and 3 new files. The only changes required from the 2.0 specification draft are as follow: - page 442 extendedop_func should be handle_extendedop_func - Missing dat_extended_op definition for the dat_extension prototype - the following dat_event data type was used for extended events. typedef struct dat_event { DAT_EVENT_NUMBER event_number; DAT_EVD_HANDLE evd_handle; DAT_EVENT_DATA event_data; DAT_UINT64 extension_data[8]; } DAT_EVENT; Signed-off by: Arlin Davis ardavis at ichips.intel.com Index: test/dtest/dtest_ext.c =================================================================== --- test/dtest/dtest_ext.c (revision 0) +++ test/dtest/dtest_ext.c (revision 0) @@ -0,0 +1,965 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "dat/udat.h" +#include "dat/dat_ib_extensions.h" + +#ifndef DAPL_PROVIDER +#define DAPL_PROVIDER "OpenIB-cma" +#endif + +/* + * Map DAT_RETURN values to readable strings, + * but don't assume the values are zero-based or contiguous. + */ +char errmsg[256] = {0}; +char majmsg[64] = {0}; +char minmsg[64] = {0}; +char extmsg[64] = {0}; +const char * +DT_RetToString (DAT_RETURN ret_value) +{ + const char *major_msg = majmsg; + const char *minor_msg = minmsg; + const char *ext_msg = extmsg; + int sz; + + /* DAT_NOT_IMPLEMENTED definition masked improperly in dat_error.h */ + if (ret_value == DAT_NOT_IMPLEMENTED) { + strcpy(errmsg, "DAT_NOT_IMPLEMENTED"); + return errmsg; + } + + dat_strerror(ret_value, &major_msg, &minor_msg); + dat_strerror_extension(ret_value, &ext_msg); + strcpy(errmsg, major_msg); + strcat(errmsg, " "); + strcpy(errmsg, ext_msg); + strcat(errmsg, " "); + strcat(errmsg, minor_msg); + + return errmsg; +} + +/* + * Map DAT_EVENT_CODE values to readable strings + */ +const char * +DT_EventToSTr (DAT_EVENT_NUMBER event_code) +{ + unsigned int i; + static struct { + const char *name; + DAT_RETURN value; + } + dat_events[] = + { + # define DATxx(x) { # x, x } + DATxx (DAT_DTO_COMPLETION_EVENT), + DATxx (DAT_RMR_BIND_COMPLETION_EVENT), + DATxx (DAT_CONNECTION_REQUEST_EVENT), + DATxx (DAT_CONNECTION_EVENT_ESTABLISHED), + DATxx (DAT_CONNECTION_EVENT_PEER_REJECTED), + DATxx (DAT_CONNECTION_EVENT_NON_PEER_REJECTED), + DATxx (DAT_CONNECTION_EVENT_ACCEPT_COMPLETION_ERROR), + DATxx (DAT_CONNECTION_EVENT_DISCONNECTED), + DATxx (DAT_CONNECTION_EVENT_BROKEN), + DATxx (DAT_CONNECTION_EVENT_TIMED_OUT), + DATxx (DAT_CONNECTION_EVENT_UNREACHABLE), + DATxx (DAT_ASYNC_ERROR_EVD_OVERFLOW), + DATxx (DAT_ASYNC_ERROR_IA_CATASTROPHIC), + DATxx (DAT_ASYNC_ERROR_EP_BROKEN), + DATxx (DAT_ASYNC_ERROR_TIMED_OUT), + DATxx (DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR), + DATxx (DAT_SOFTWARE_EVENT) + # undef DATxx + }; + # define NUM_EVENTS (sizeof(dat_events)/sizeof(dat_events[0])) + + for (i = 0; i < NUM_EVENTS; i++) { + if (dat_events[i].value == event_code) + { + return ( dat_events[i].name ); + } + } + return ( "Invalid_DAT_EVENT_NUMBER" ); +} + +/* + * Map DAT_EVENT_CODE values to readable strings + */ +const char * +DT_DtoStatusToSTr (DAT_DTO_COMPLETION_STATUS dto_status ) +{ + unsigned int i; + static struct { + const char *name; + DAT_RETURN value; + } + dat_dto[] = + { + # define DATxx(x) { # x, x } + DATxx (DAT_DTO_SUCCESS), + DATxx (DAT_DTO_ERR_FLUSHED), + DATxx (DAT_DTO_ERR_LOCAL_LENGTH), + DATxx (DAT_DTO_ERR_LOCAL_EP), + DATxx (DAT_DTO_ERR_LOCAL_PROTECTION), + DATxx (DAT_DTO_ERR_BAD_RESPONSE), + DATxx (DAT_DTO_ERR_REMOTE_ACCESS), + DATxx (DAT_DTO_ERR_REMOTE_RESPONDER), + DATxx (DAT_DTO_ERR_TRANSPORT), + DATxx (DAT_DTO_ERR_RECEIVER_NOT_READY), + DATxx (DAT_DTO_ERR_PARTIAL_PACKET), + DATxx (DAT_RMR_OPERATION_FAILED) + # undef DATxx + }; + # define NUM_DTO_ERRS (sizeof(dat_dto)/sizeof(dat_dto[0])) + + for (i = 0; i < NUM_DTO_ERRS; i++) { + if (dat_dto[i].value == dto_status) + { + return ( dat_dto[i].name ); + } + } + return ( "Invalid DAT_DTO_COMPLETION_STATUS" ); +} + +#define _OK( status, str ) \ +{ \ + if ( status != DAT_SUCCESS ) { \ + fprintf(stderr, str " returned %s\n", \ + DT_RetToString(status) ); \ + exit ( 1 ); \ + } \ +} + +#define _OK_EVENT( event, status, str ) \ +{ \ + if ( status != DAT_SUCCESS ) { \ + fprintf(stderr, str " event %s status %s\n", \ + DT_EventToSTr(event), DT_DtoStatusToSTr(status)); \ + exit ( 1 ); \ + } \ +} + +#define SECONDS( secs ) (1000*1000*secs) + +#define SERVER_CONN_QUAL 31111 +#define BUF_SIZE 256 +#define BUF_SIZE_ATOMIC 8 +#define REG_MEM_COUNT 10 + +#define SND_RDMA_BUF_INDEX 0 +#define RCV_RDMA_BUF_INDEX 1 +#define SEND_BUF_INDEX 2 +#define RECV_BUF_INDEX 3 + +u_int64_t *atomic_buf; +DAT_LMR_HANDLE lmr_atomic; +DAT_LMR_CONTEXT lmr_atomic_context; +DAT_RMR_CONTEXT rmr_atomic_context; +DAT_VLEN reg_atomic_size; +DAT_VADDR reg_atomic_addr; + +DAT_LMR_HANDLE lmr[ REG_MEM_COUNT ]; +DAT_LMR_CONTEXT lmr_context[ REG_MEM_COUNT ]; +DAT_RMR_TRIPLET rmr[ REG_MEM_COUNT ]; +DAT_RMR_CONTEXT rmr_context[ REG_MEM_COUNT ]; +DAT_VLEN reg_size[ REG_MEM_COUNT ]; +DAT_VADDR reg_addr[ REG_MEM_COUNT ]; +DAT_RMR_TRIPLET * buf[ REG_MEM_COUNT ]; +DAT_EP_HANDLE ep; +DAT_EVD_HANDLE async_evd = DAT_HANDLE_NULL; +DAT_IA_HANDLE ia = DAT_HANDLE_NULL; +DAT_PZ_HANDLE pz = DAT_HANDLE_NULL; +DAT_EVD_HANDLE cr_evd = DAT_HANDLE_NULL; +DAT_EVD_HANDLE con_evd = DAT_HANDLE_NULL; +DAT_EVD_HANDLE dto_evd = DAT_HANDLE_NULL; +DAT_PSP_HANDLE psp = DAT_HANDLE_NULL; +DAT_CR_HANDLE cr = DAT_HANDLE_NULL; +int server = 1; + +void +send_msg( + void *data, + DAT_COUNT size, + DAT_LMR_CONTEXT context, + DAT_DTO_COOKIE cookie, + DAT_COMPLETION_FLAGS flags ) +{ + DAT_LMR_TRIPLET iov; + DAT_EVENT event; + DAT_COUNT nmore; + DAT_RETURN status; + + iov.lmr_context = context; + iov.pad = 0; + iov.virtual_address = (DAT_VADDR)(unsigned long)data; + iov.segment_length = (DAT_VLEN)size; + + status = dat_ep_post_send( ep, + 1, + &iov, + cookie, + flags ); + _OK( status, "dat_ep_post_send" ); + + if ( ! (flags & DAT_COMPLETION_SUPPRESS_FLAG) ) { + status = dat_evd_wait( dto_evd, SECONDS( 3 ), 1, &event, &nmore ); + _OK( status, "dat_evd_wait after dat_ep_post_send" ); + + if ( event.event_number != DAT_DTO_COMPLETION_EVENT ) { + printf("unexpected event waiting for post_send completion - 0x%x\n", event.event_number); + exit ( 1 ); + } + + _OK( event.event_data.dto_completion_event_data.status, "event status for post_send" ); + } +} + +int +connect_ep( char *hostname ) +{ + DAT_SOCK_ADDR remote_addr; + DAT_EP_ATTR ep_attr; + DAT_RETURN status; + DAT_REGION_DESCRIPTION region; + DAT_EVENT event; + DAT_COUNT nmore; + DAT_LMR_TRIPLET iov; + DAT_RMR_TRIPLET r_iov; + DAT_DTO_COOKIE cookie; + DAT_PROVIDER_ATTR provider_attr; + DAT_NAMED_ATTR named_attrs; + int i,ext_cnt; + DAT_DTO_COMPLETION_EVENT_DATA *dto_event = &event.event_data.dto_completion_event_data; + + status = dat_ia_open( DAPL_PROVIDER, 8, &async_evd, &ia ); + _OK( status, "dat_ia_open" ); + + /* query for immediate data and atomic operation extensions */ + status = dat_ia_query( ia, NULL, DAT_IA_FIELD_NONE, NULL, + DAT_PROVIDER_FIELD_PROVIDER_SPECIFIC_ATTR, + &provider_attr ); + _OK( status, "dat_ia_query" ); + + /* look for extension support, ALL or nothing */ + ext_cnt=0; + printf(" Extension Attributes:\n"); + for (i=0;iai_addr)->sin_addr.s_addr; + printf ("Server Name: %s \n", hostname); + printf ("Server Net Address: %d.%d.%d.%d\n", + (rval >> 0) & 0xff, + (rval >> 8) & 0xff, + (rval >> 16) & 0xff, + (rval >> 24) & 0xff); + + remote_addr = *((DAT_IA_ADDRESS_PTR)target->ai_addr); + + strcpy( (char*)buf[ SND_RDMA_BUF_INDEX ], "client written data" ); + + status = dat_ep_connect( ep, + &remote_addr, + SERVER_CONN_QUAL, + SECONDS( 20 ), + 0, + (DAT_PVOID)0, + 0, + DAT_CONNECT_DEFAULT_FLAG ); + _OK( status, "dat_psp_create" ); + + + } + + printf("Client waiting for connect response\n"); + status = dat_evd_wait( con_evd, SECONDS( 5 ), 1, &event, &nmore ); + _OK( status, "connect dat_evd_wait" ); + + if ( event.event_number != DAT_CONNECTION_EVENT_ESTABLISHED ) { + printf("unexpected event after dat_ep_connect: 0x%x\n", event.event_number); + exit ( 1 ); + } + + printf("Connected!\n"); + + /* + * Setup our remote memory and tell the other side about it + */ + printf("Sending RMR data to remote\n"); + r_iov.rmr_context = rmr_context[ RCV_RDMA_BUF_INDEX ]; + r_iov.pad = 0; + r_iov.target_address = (DAT_VADDR)((unsigned long)buf[ RCV_RDMA_BUF_INDEX ]); + r_iov.segment_length = BUF_SIZE; + + *buf[ SEND_BUF_INDEX ] = r_iov; + + send_msg(buf[ SEND_BUF_INDEX ], + sizeof( DAT_RMR_TRIPLET ), + lmr_context[ SEND_BUF_INDEX ], + cookie, + DAT_COMPLETION_SUPPRESS_FLAG ); + + /* + * Wait for their RMR + */ + printf("Waiting for remote to send RMR data\n"); + status = dat_evd_wait( dto_evd, SECONDS( 3 ), 1, &event, &nmore ); + _OK( status, "dat_evd_wait after dat_ep_post_send" ); + + if ( event.event_number != DAT_DTO_COMPLETION_EVENT ) { + printf("unexpected event waiting for RMR context - 0x%x\n", + event.event_number); + exit ( 1 ); + } + + _OK_EVENT( event.event_number, dto_event->status, " post_send" ); + if ( (dto_event->transfered_length != sizeof( DAT_RMR_TRIPLET )) || + (dto_event->user_cookie.as_64 != RECV_BUF_INDEX) ) { + printf("unexpected event data for receive: len=%d cookie=%d expected %d/%d\n", + (int)dto_event->transfered_length, + (int)dto_event->user_cookie.as_64, + sizeof(DAT_RMR_TRIPLET), RECV_BUF_INDEX); + exit ( 1 ); + } + + r_iov = *buf[ RECV_BUF_INDEX ]; + + printf("Received RMR from remote: r_iov: ctx=%x,pad=%x,va=%p,len=%d\n", + r_iov.rmr_context, + r_iov.pad, + (void*)(unsigned long)r_iov.target_address, + r_iov.segment_length ); + + return ( 0 ); +} + +int +disconnect_ep( ) +{ + DAT_RETURN status; + int i; + + status = dat_ep_disconnect( ep, DAT_CLOSE_DEFAULT ); + _OK( status, "dat_ep_disconnect" ); + + printf("EP disconnected\n"); + + if ( server ) { + status = dat_psp_free( psp ); + _OK( status, "dat_ep_disconnect" ); + } + + for ( i = 0; i < REG_MEM_COUNT; i++ ) { + status = dat_lmr_free( lmr[ i ] ); + _OK( status, "dat_lmr_free" ); + } + + status = dat_lmr_free( lmr_atomic ); + _OK( status, "dat_lmr_free_atomic" ); + + status = dat_ep_free( ep ); + _OK( status, "dat_ep_free" ); + + status = dat_evd_free( dto_evd ); + _OK( status, "dat_evd_free DTO" ); + status = dat_evd_free( con_evd ); + _OK( status, "dat_evd_free CON" ); + status = dat_evd_free( cr_evd ); + _OK( status, "dat_evd_free CR" ); + + status = dat_pz_free( pz ); + _OK( status, "dat_pz_free" ); + + status = dat_ia_close( ia, DAT_CLOSE_DEFAULT ); + _OK( status, "dat_ia_close" ); + + return ( 0 ); +} + +int +do_immediate( ) +{ + DAT_REGION_DESCRIPTION region; + DAT_EVENT event; + DAT_COUNT nmore; + DAT_LMR_TRIPLET iov; + DAT_RMR_TRIPLET r_iov; + DAT_DTO_COOKIE cookie; + DAT_RMR_CONTEXT their_context; + DAT_RETURN status; + DAT_UINT32 immed_data; + DAT_UINT32 immed_data_recv; + DAT_DTO_COMPLETION_EVENT_DATA *dto_event = + &event.event_data.dto_completion_event_data; + + printf("\nDoing RDMA WRITE IMMEDIATE DATA\n"); + + if ( server ) { + immed_data = 0x1111; + } else { + immed_data = 0x7777; + } + + cookie.as_64 = 0x5555; + + r_iov = *buf[ RECV_BUF_INDEX ]; + + iov.lmr_context = lmr_context[ SND_RDMA_BUF_INDEX ]; + iov.pad = 0; + iov.virtual_address = (DAT_VADDR)(unsigned long)buf[ SND_RDMA_BUF_INDEX ]; + iov.segment_length = BUF_SIZE; + + cookie.as_64 = 0x9999; + + status = dat_ib_post_rdma_write_immed( ep, // ep_handle + 1, // num_segments + &iov, // LMR + cookie, // user_cookie + &r_iov, // RMR + immed_data, + DAT_COMPLETION_DEFAULT_FLAG ); + + _OK( status, "dat_ep_post_rdma_write_immed" ); + printf("dat_ep_post_rdma_write_immed posted\n"); + + /* + * Collect first event, write completion or the inbound rdma with immed + */ + status = dat_evd_wait( dto_evd, SECONDS( 3 ), 1, &event, &nmore ); + _OK( status, "dat_evd_wait after dat_ep_post_rdma_write" ); + + if ( event.event_number != DAT_IB_DTO_EVENT ) + { + printf("unexpected event waiting for immediate data - 0x%x\n", + event.event_number ); + exit ( 1 ); + } + + _OK_EVENT( event.event_number, dto_event->status, " rdma_write_immed" ); + if (dto_event->operation == DAT_IB_DTO_RDMA_WRITE_IMMED) + { + if ((dto_event->transfered_length != BUF_SIZE) || + (dto_event->user_cookie.as_64 != 0x9999) ) + { + printf("unexpected event data for rdma_write_immed: len=%d cookie=0x%x\n", + (int)dto_event->transfered_length, + (int)dto_event->user_cookie.as_64); + exit ( 1 ); + } + } + else if (dto_event->operation == DAT_IB_DTO_RECV_IMMED) + { + if ((dto_event->transfered_length != BUF_SIZE) || + (dto_event->user_cookie.as_64 != RECV_BUF_INDEX+1)) { + printf("unexpected event data of immediate write:" + "len=%d cookie=%d expected %d/%d\n", + (int)dto_event->transfered_length, + (int)dto_event->user_cookie.as_64, + sizeof(int), RECV_BUF_INDEX+1); + exit ( 1 ); + } + + /* get immediate data from DTO event */ + immed_data_recv = ((DAT_EXTENSION_EVENT_DATA*)event.extension_data)->val.immed_data.data; + printf(" Received idata = 0x%x\n",immed_data_recv); + } + else + { + printf("unexpected op type - evt=0x%x, op=0x%x type=0x%x\n", + event.event_number, dto_event->operation, + ((DAT_EXTENSION_EVENT_DATA*)event.extension_data)->type); + exit ( 1 ); + } + + /* + * Collect second event, write completion or the inbound rdma with immed + */ + status = dat_evd_wait( dto_evd, SECONDS( 3 ), 1, &event, &nmore ); + _OK( status, "dat_evd_wait after dat_ep_post_rdma_write" ); + + if ( event.event_number != DAT_IB_DTO_EVENT ) + { + printf("unexpected event waiting for immediate data - 0x%x\n", + event.event_number ); + exit ( 1 ); + } + + _OK_EVENT( event.event_number, dto_event->status, " rdma_write_immed" ); + if (dto_event->operation == DAT_IB_DTO_RDMA_WRITE_IMMED) + { + if ((dto_event->transfered_length != BUF_SIZE) || + (dto_event->user_cookie.as_64 != 0x9999) ) + { + printf("unexpected event data for rdma_write_immed: len=%d cookie=0x%x\n", + (int)dto_event->transfered_length, + (int)dto_event->user_cookie.as_64); + exit ( 1 ); + } + } + else if (dto_event->operation == DAT_IB_DTO_RECV_IMMED) + { + if ((dto_event->transfered_length != BUF_SIZE) || + (dto_event->user_cookie.as_64 != RECV_BUF_INDEX+1)) { + printf("unexpected event data of immediate write:" + "len=%d cookie=%d expected %d/%d\n", + (int)dto_event->transfered_length, + (int)dto_event->user_cookie.as_64, + sizeof(int), RECV_BUF_INDEX+1); + exit ( 1 ); + } + + /* get immediate data from DTO event */ + immed_data_recv = ((DAT_EXTENSION_EVENT_DATA*)event.extension_data)->val.immed_data.data; + printf(" received idata = 0x%x\n",immed_data_recv); + } + else + { + printf("unexpected op type - evt=0x%x, op=0x%x type=0x%x\n", + event.event_number, dto_event->operation, + ((DAT_EXTENSION_EVENT_DATA*)event.extension_data)->type); + exit ( 1 ); + } + + if ((server) && (immed_data_recv != 0x7777)) + { + printf("Server got unexpected immed_data_recv 0x%x/0x%x\n", + 0x7777, immed_data_recv ); + exit ( 1 ); + } + else if ((!server) && (immed_data_recv != 0x1111)) + { + printf("Client got unexpected immed_data_recv 0x%x/0x%x\n", + 0x1111, immed_data_recv ); + exit ( 1 ); + } + + if (server) + printf("Server received immed_data=0x%x, expected 0x7777\n", immed_data_recv ); + else + printf("Client received immed_data=0x%x, expected 0x1111\n", immed_data_recv ); + + printf("RCV buffer %p contains: %s\n", + buf[ RCV_RDMA_BUF_INDEX ], buf[ RCV_RDMA_BUF_INDEX ]); + + return ( 0 ); +} + +int +do_cmp_swap() +{ + DAT_DTO_COOKIE cookie; + DAT_RETURN status; + DAT_EVENT event; + DAT_COUNT nmore; + DAT_LMR_TRIPLET l_iov; + DAT_RMR_TRIPLET r_iov; + volatile DAT_UINT64 *target = (DAT_UINT64*)buf[ RCV_RDMA_BUF_INDEX ]; + DAT_DTO_COMPLETION_EVENT_DATA *dto_event = + &event.event_data.dto_completion_event_data; + + printf("\nDoing CMP and SWAP\n"); + + r_iov = *buf[ RECV_BUF_INDEX ]; + + l_iov.lmr_context = lmr_atomic_context; + l_iov.pad = 0; + l_iov.virtual_address = (DAT_VADDR)(unsigned long)atomic_buf; + l_iov.segment_length = BUF_SIZE_ATOMIC; + + cookie.as_64 = 3333; + if ( server ) { + *target = 0x12345; + sleep(1); + /* server does not compare and should not swap */ + status = dat_ib_post_cmp_and_swap( ep, + (DAT_UINT64)0x654321, + (DAT_UINT64)0x6789A, + &l_iov, + cookie, + &r_iov, + DAT_COMPLETION_DEFAULT_FLAG); + } else { + *target = 0x54321; + sleep(1); + /* client does compare and should swap */ + status = dat_ib_post_cmp_and_swap( ep, + (DAT_UINT64)0x12345, + (DAT_UINT64)0x98765, + &l_iov, + cookie, + &r_iov, + DAT_COMPLETION_DEFAULT_FLAG); + } + _OK( status, "dat_ep_post_cmp_and_swap" ); + printf("dat_ep_post_cmp_and_swap posted\n"); + + status = dat_evd_wait( dto_evd, SECONDS( 3 ), 1, &event, &nmore ); + _OK( status, "dat_evd_wait for compare and swap" ); + if ( event.event_number != DAT_IB_DTO_EVENT ) { + printf("unexpected event after post_cmp_and_swap: 0x%x\n", + event.event_number); + exit ( 1 ); + } + + _OK_EVENT( event.event_number, dto_event->status, " cmp_swap" ); + if ( dto_event->operation != DAT_IB_DTO_CMP_AND_SWAP ) { + printf("unexpected event data of cmp and swap : type=%d cookie=%d original 0x%llx\n", + (int)((DAT_EXTENSION_EVENT_DATA*)event.extension_data)->type, + (int)dto_event->user_cookie.as_64, + *atomic_buf); + exit ( 1 ); + } + sleep(1); + if ( server ) { + printf("Server got original data = 0x%llx, expected 0x54321\n", *atomic_buf); + printf("Client final result (on server) = 0x%llx, expected 0x98765\n", *target); + } else { + printf("Client got original data = 0x%llx, expected 0x12345\n",*atomic_buf); + printf("Server final result (on client) = 0x%llx, expected 0x54321\n", *target); + } + return(0); +} + +int +do_fetch_add() +{ + DAT_DTO_COOKIE cookie; + DAT_RETURN status; + DAT_EVENT event; + DAT_COUNT nmore; + DAT_LMR_TRIPLET l_iov; + DAT_RMR_TRIPLET r_iov; + volatile DAT_UINT64 *target = (DAT_UINT64*)buf[ RCV_RDMA_BUF_INDEX ]; + DAT_DTO_COMPLETION_EVENT_DATA *dto_event = + &event.event_data.dto_completion_event_data; + + printf("\nDoing FETCH and ADD\n"); + + r_iov = *buf[ RECV_BUF_INDEX ]; + + l_iov.lmr_context = lmr_atomic_context; + l_iov.pad = 0; + l_iov.virtual_address = (DAT_VADDR)(unsigned long)atomic_buf; + l_iov.segment_length = BUF_SIZE_ATOMIC; + + cookie.as_64 = 0x7777; + if ( server ) { + *target = 0x10; + sleep( 1 ); + status = dat_ib_post_fetch_and_add( ep, + (DAT_UINT64)0x100, + &l_iov, + cookie, + &r_iov, + DAT_COMPLETION_DEFAULT_FLAG); + } else { + *target = 0x100; + sleep( 1 ); + status = dat_ib_post_fetch_and_add( ep, + (DAT_UINT64)0x10, + &l_iov, + cookie, + &r_iov, + DAT_COMPLETION_DEFAULT_FLAG); + } + _OK( status, "dat_ep_post_fetch_and_add" ); + printf("dat_ep_post_fetch_and_add posted\n"); + status = dat_evd_wait( dto_evd, SECONDS( 3 ), 1, &event, &nmore ); + _OK( status, "dat_evd_wait for fetch and add" ); + if ( event.event_number != DAT_IB_DTO_EVENT ) { + printf("unexpected event after post_fetch_and_add: 0x%x\n", event.event_number); + exit ( 1 ); + } + + _OK_EVENT( event.event_number, dto_event->status, " fetch_add" ); + if ( dto_event->operation != DAT_IB_DTO_FETCH_AND_ADD ) { + printf("unexpected event data of fetch and add : type=%d cookie=%d original%d\n", + (int)((DAT_EXTENSION_EVENT_DATA*)event.extension_data)->type, + (int)dto_event->user_cookie.as_64, + (int)*atomic_buf ); + exit ( 1 ); + } + + if ( server ) { + printf("Client original data (on server) = 0x%llx, expected 0x100\n", *atomic_buf ); + } else { + printf("Server original data (on client) = 0x%llx, expected 0x10\n", *atomic_buf ); + } + + sleep( 1 ); + + if ( server ) { + status = dat_ib_post_fetch_and_add( ep, + (DAT_UINT64)0x100, + &l_iov, + cookie, + &r_iov, + DAT_COMPLETION_DEFAULT_FLAG); + } else { + status = dat_ib_post_fetch_and_add( ep, + (DAT_UINT64)0x10, + &l_iov, + cookie, + &r_iov, + DAT_COMPLETION_DEFAULT_FLAG); + } + + status = dat_evd_wait( dto_evd, SECONDS( 3 ), 1, &event, &nmore ); + _OK( status, "dat_evd_wait for second fetch and add" ); + if ( event.event_number != DAT_IB_DTO_EVENT ) { + printf("unexpected event after second post_fetch_and_add: 0x%x\n", event.event_number); + exit ( 1 ); + } + + _OK_EVENT( event.event_number, dto_event->status, " fetch_add" ); + if ( dto_event->operation != DAT_IB_DTO_FETCH_AND_ADD ) { + printf("unexpected event data of second fetch and add : type=%d cookie=%d original%d\n", + (int)((DAT_EXTENSION_EVENT_DATA*)event.extension_data)->type, + (int)dto_event->user_cookie.as_64, + (long)atomic_buf); + exit ( 1 ); + } + + sleep( 1 ); + if ( server ) { + printf("Server got original data = 0x%llx, expected 0x200\n", *atomic_buf); + printf("Client final result (on server) = 0x%llx, expected 0x30\n", *target); + } else { + printf("Server side original data = 0x%llx, expected 0x20\n", *atomic_buf); + printf("Server final result (on client) = 0x%llx, expected 0x300\n", *target); + } + + return ( 0 ); +} + +void print_usage() +{ + printf("\n dtest_ext usage \n\n"); + printf("s: server\n"); + printf("h: hostname\n"); + printf("\n"); +} + +int +main(int argc, char **argv) +{ + int i,c; + char hostname[100]; + + /* parse arguments */ + while ((c = getopt(argc, argv, "sh:")) != -1) + { + switch(c) + { + case 's': + server = 1; + break; + case 'h': + server = 0; + strcpy(hostname, optarg); + break; + default: + print_usage(); + exit(1); + } + } + + if (server) + printf("Server: using provider %s\n",DAPL_PROVIDER); + else + printf("Client: using provider %s, connect to %s \n", + DAPL_PROVIDER,hostname); + + /* connect and send rdma buffer information */ + if (connect_ep(hostname)) + exit(1); + + if (do_immediate()) + goto bail; + + if (do_fetch_add()) + goto bail; + + if (do_cmp_swap()) + goto bail; + +bail: + return (disconnect_ep()); +} Index: test/dtest/makefile =================================================================== --- test/dtest/makefile (revision 6260) +++ test/dtest/makefile (working copy) @@ -2,15 +2,19 @@ CC = gcc CFLAGS = -O2 -g DAT_INC = ../../dat/include -DAT_LIB = /usr/local/lib +DAT_LIB = ../../dat/udat/Target/$(ARCH) -all: dtest +all: dtest dtest_ext clean: - rm -f *.o;touch *.c;rm -f dtest + rm -f *.o;touch *.c;rm -f dtest dtest_ib_ext dtest: ./dtest.c $(CC) $(CFLAGS) ./dtest.c -o dtest \ - -DDAPL_PROVIDER='"OpenIB-cma-ip"' \ + -DDAPL_PROVIDER='"OpenIB-cma"' \ -I $(DAT_INC) -L $(DAT_LIB) -ldat +dtest_ext: ./dtest_ext.c + $(CC) $(CFLAGS) ./dtest_ext.c -o dtest_ext \ + -DDAPL_PROVIDER='"OpenIB-cma"' -DDAT_EXTENSIONS \ + -I $(DAT_INC) -L $(DAT_LIB) -ldat Index: dapl/include/dapl.h =================================================================== --- dapl/include/dapl.h (revision 6260) +++ dapl/include/dapl.h (working copy) @@ -563,6 +563,10 @@ typedef enum dapl_dto_type DAPL_DTO_TYPE_RECV, DAPL_DTO_TYPE_RDMA_WRITE, DAPL_DTO_TYPE_RDMA_READ, +#ifdef DAT_EXTENSIONS + DAPL_DTO_TYPE_EXTENSION +#endif + } DAPL_DTO_TYPE; typedef enum dapl_cookie_type @@ -570,6 +574,7 @@ typedef enum dapl_cookie_type DAPL_COOKIE_TYPE_NULL, DAPL_COOKIE_TYPE_DTO, DAPL_COOKIE_TYPE_RMR, + } DAPL_COOKIE_TYPE; /* DAPL_DTO_COOKIE used as context for DTO WQEs */ @@ -1116,6 +1121,14 @@ extern DAT_RETURN dapl_srq_set_lw( IN DAT_SRQ_HANDLE, /* srq_handle */ IN DAT_COUNT); /* low_watermark */ +#ifdef DAT_EXTENSIONS +extern DAT_RETURN dapl_extensions( + IN DAT_HANDLE, /* dat_handle */ + IN DAT_EXTENDED_OP, /* extension operation */ + IN va_list ); /* va_list args */ + +#endif + /* * DAPL internal utility function prototpyes */ Index: dapl/include/dapl_debug.h =================================================================== --- dapl/include/dapl_debug.h (revision 6260) +++ dapl/include/dapl_debug.h (working copy) @@ -112,7 +112,13 @@ extern void dapl_internal_dbg_log ( DAPL #define DCNT_EVD_DEQUEUE_NOT_FOUND 18 #define DCNT_TIMER_SET 19 #define DCNT_TIMER_CANCEL 20 +#ifdef DAT_EXTENSIONS +#define DCNT_EXTENSION 21 +#define DCNT_NUM_COUNTERS 22 +#else #define DCNT_NUM_COUNTERS 21 +#endif + #define DCNT_ALL_COUNTERS DCNT_NUM_COUNTERS #if defined(DAPL_COUNTERS) Index: dapl/udapl/Makefile =================================================================== --- dapl/udapl/Makefile (revision 6260) +++ dapl/udapl/Makefile (working copy) @@ -80,6 +80,12 @@ ifdef OS_VENDOR CFLAGS += -D$(OS_VENDOR) endif +# If an implementation supports immdiate data and extensions +CFLAGS += -DDAT_EXTENSIONS + +# If an implementation supports DAPL provider specific attributes +CFLAGS += -DDAPL_PROVIDER_SPECIFIC_ATTR + # # dummy provider # @@ -283,6 +289,8 @@ LDFLAGS += -libverbs -lrdmacm LDFLAGS += -rpath /usr/local/lib -L /usr/local/lib PROVIDER_SRCS = dapl_ib_util.c dapl_ib_cq.c dapl_ib_qp.c \ dapl_ib_cm.c dapl_ib_mem.c +# implementation supports extensions +PROVIDER_SRCS += dapl_ib_extensions.c endif UDAPL_SRCS = dapl_init.c \ Index: dapl/common/dapl_ia_query.c =================================================================== --- dapl/common/dapl_ia_query.c (revision 6260) +++ dapl/common/dapl_ia_query.c (working copy) @@ -151,6 +151,7 @@ dapl_ia_query ( provider_attr->dat_qos_supported = DAT_QOS_BEST_EFFORT; provider_attr->completion_flags_supported = DAT_COMPLETION_DEFAULT_FLAG; provider_attr->is_thread_safe = DAT_FALSE; + /* * N.B. The second part of the following equation will evaluate * to 0 unless IBHOSTS_NAMING is enabled. @@ -167,6 +168,14 @@ dapl_ia_query ( #if !defined(__KDAPL__) provider_attr->pz_support = DAT_PZ_UNIQUE; #endif /* !KDAPL */ + + /* + * Query for provider specific attributes + */ +#ifdef DAPL_PROVIDER_SPECIFIC_ATTR + dapls_query_provider_specific_attr(provider_attr); +#endif + /* * Set up evd_stream_merging_supported options. Note there is * one bit per allowable combination, using the ordinal Index: dapl/common/dapl_adapter_util.h =================================================================== --- dapl/common/dapl_adapter_util.h (revision 6260) +++ dapl/common/dapl_adapter_util.h (working copy) @@ -256,6 +256,21 @@ dapls_ib_wait_object_wait ( IN u_int32_t timeout); #endif +#ifdef DAPL_PROVIDER_SPECIFIC_ATTR +void +dapls_query_provider_specific_attr( + IN DAT_PROVIDER_ATTR *provider_attr ); +#endif + +#ifdef DAT_EXTENSIONS +void +dapls_cqe_to_event_extension( + IN DAPL_EP *ep_ptr, + IN DAPL_COOKIE *cookie, + IN ib_work_completion_t *cqe_ptr, + OUT DAT_EVENT *event_ptr); +#endif + /* * Values for provider DAT_NAMED_ATTR */ Index: dapl/common/dapl_provider.c =================================================================== --- dapl/common/dapl_provider.c (revision 6260) +++ dapl/common/dapl_provider.c (working copy) @@ -221,7 +221,12 @@ DAT_PROVIDER g_dapl_provider_template = &dapl_srq_post_recv, &dapl_srq_query, &dapl_srq_resize, - &dapl_srq_set_lw + &dapl_srq_set_lw, + +#ifdef DAT_EXTENSIONS + /* dat-2.0 */ + &dapl_extensions +#endif }; #endif /* __KDAPL__ */ Index: dapl/common/dapl_init.h =================================================================== --- dapl/common/dapl_init.h (revision 6260) +++ dapl/common/dapl_init.h (working copy) @@ -39,6 +39,9 @@ #ifndef _DAPL_INIT_H_ #define _DAPL_INIT_H_ +extern void dapl_init ( void ); +extern void dapl_fini ( void ); + extern void DAT_PROVIDER_INIT_FUNC_NAME ( IN const DAT_PROVIDER_INFO *, Index: dapl/common/dapl_evd_util.c =================================================================== --- dapl/common/dapl_evd_util.c (revision 6260) +++ dapl/common/dapl_evd_util.c (working copy) @@ -502,6 +502,21 @@ dapli_evd_eh_print_cqe ( #ifdef DAPL_DBG static char *optable[] = { +#ifdef OPENIB + /* different order for openib verbs */ + "OP_RDMA_WRITE", + "OP_RDMA_WRITE_IMMED", + "OP_SEND", + "OP_SEND_IMMED", + "OP_RDMA_READ", + "OP_COMP_AND_SWAP", + "OP_FETCH_AND_ADD", + "OP_RECEIVE", + "OP_RECEIVE_IMMED", + "OP_RECEIVE_RDMA_IMMED", + "OP_BIND_MW", + "OP_INVALID", +#else "OP_SEND", "OP_RDMA_READ", "OP_RDMA_WRITE", @@ -509,6 +524,7 @@ dapli_evd_eh_print_cqe ( "OP_FETCH_AND_ADD", "OP_RECEIVE", "OP_BIND_MW", +#endif 0 }; @@ -1041,30 +1057,24 @@ dapli_evd_cqe_to_event ( buffer = &ep_ptr->req_buffer; } +#ifdef DAT_EXTENSIONS + if ( DAPL_GET_CQE_DTOS_OPTYPE(cqe_ptr) >= DAT_DTO_EXTENSION_BASE ) + { + dapls_cqe_to_event_extension(ep_ptr, cookie, cqe_ptr, event_ptr); + dapls_cookie_dealloc (buffer, cookie); + break; + } +#endif event_ptr->event_number = DAT_DTO_COMPLETION_EVENT; event_ptr->event_data.dto_completion_event_data.ep_handle = cookie->ep; event_ptr->event_data.dto_completion_event_data.user_cookie = cookie->val.dto.cookie; event_ptr->event_data.dto_completion_event_data.status = dto_status; - -#ifdef DAPL_DBG - if (dto_status == DAT_DTO_SUCCESS) - { - uint32_t ibtype; - - ibtype = DAPL_GET_CQE_OPTYPE (cqe_ptr); - - dapl_os_assert ((ibtype == OP_SEND && - cookie->val.dto.type == DAPL_DTO_TYPE_SEND) - || (ibtype == OP_RECEIVE && - cookie->val.dto.type == DAPL_DTO_TYPE_RECV) - || (ibtype == OP_RDMA_WRITE && - cookie->val.dto.type == DAPL_DTO_TYPE_RDMA_WRITE) - || (ibtype == OP_RDMA_READ && - cookie->val.dto.type == DAPL_DTO_TYPE_RDMA_READ)); - } -#endif /* DAPL_DBG */ + + /* new operation field for DAT 2.0 */ + event_ptr->event_data.dto_completion_event_data.operation = + DAPL_GET_CQE_DTOS_OPTYPE(cqe_ptr); if ( cookie->val.dto.type == DAPL_DTO_TYPE_SEND || cookie->val.dto.type == DAPL_DTO_TYPE_RDMA_WRITE ) @@ -1113,6 +1123,7 @@ dapli_evd_cqe_to_event ( dapls_cookie_dealloc (&ep_ptr->req_buffer, cookie); break; } + default: { dapl_os_assert (!"Invalid Operation type"); Index: dapl/common/dapl_debug.c =================================================================== --- dapl/common/dapl_debug.c (revision 6260) +++ dapl/common/dapl_debug.c (working copy) @@ -86,6 +86,9 @@ char *dapl_dbg_counter_names[] = { "dapl_evd_not_found", "dapls_timer_set", "dapls_timer_cancel", +#ifdef DAT_EXTENSIONS + "dapls_extension", +#endif }; void dapl_dump_cntr( int cntr ) Index: dapl/openib_cma/dapl_ib_dto.h =================================================================== --- dapl/openib_cma/dapl_ib_dto.h (revision 6260) +++ dapl/openib_cma/dapl_ib_dto.h (working copy) @@ -233,6 +233,174 @@ dapls_ib_optional_prv_dat( return DAT_SUCCESS; } +#ifdef DAT_EXTENSIONS + +/* Map WCs to DAT DTOS opcodes, extensions start at DAT_DTO_EXTENSION_BASE */ +STATIC _INLINE_ DAT_DTOS dapls_cqe_dtos_opcode(ib_work_completion_t *cqe_p) +{ + switch (cqe_p->opcode) { + + case IBV_WC_SEND: + return (DAT_SEND); + case IBV_WC_RDMA_READ: + return (DAT_RDMA_READ); + case IBV_WC_BIND_MW: + return (DAT_BIND_MW); + case IBV_WC_RDMA_WRITE: + if (cqe_p->wc_flags & IBV_WC_WITH_IMM) + return (DAT_IB_DTO_RDMA_WRITE_IMMED); + else + return (DAT_RDMA_WRITE); + case IBV_WC_COMP_SWAP: + return (DAT_IB_DTO_CMP_AND_SWAP); + case IBV_WC_FETCH_ADD: + return (DAT_IB_DTO_FETCH_AND_ADD); + case IBV_WC_RECV_RDMA_WITH_IMM: + return (DAT_IB_DTO_RECV_IMMED); + case IBV_WC_RECV: + return (DAT_RECEIVE); + default: + return (0xff); + } +} + +#define DAPL_GET_CQE_DTOS_OPTYPE(cqe_p) dapls_cqe_dtos_opcode(cqe_p) + +/* + * dapls_ib_post_ext_send + * + * Provider specific extended Post SEND function for atomics + * OP_COMP_AND_SWAP and OP_FETCH_AND_ADD + */ +STATIC _INLINE_ DAT_RETURN +dapls_ib_post_ext_send ( + IN DAPL_EP *ep_ptr, + IN ib_send_op_type_t op_type, + IN DAPL_COOKIE *cookie, + IN DAT_COUNT segments, + IN DAT_LMR_TRIPLET *local_iov, + IN const DAT_RMR_TRIPLET *remote_iov, + IN DAT_UINT32 immed_data, + IN DAT_UINT64 compare_add, + IN DAT_UINT64 swap, + IN DAT_COMPLETION_FLAGS completion_flags) +{ + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: ep %p op %d ck %p sgs", + "%d l_iov %p r_iov %p f %d\n", + ep_ptr, op_type, cookie, segments, local_iov, + remote_iov, completion_flags); + + ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; + ib_data_segment_t *ds_array_p; + struct ibv_send_wr wr; + struct ibv_send_wr *bad_wr; + DAT_COUNT i, total_len; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: ep %p cookie %p segs %d l_iov %p\n", + ep_ptr, cookie, segments, local_iov); + + if(segments <= DEFAULT_DS_ENTRIES) + ds_array_p = ds_array; + else + ds_array_p = + dapl_os_alloc(segments * sizeof(ib_data_segment_t)); + + if (NULL == ds_array_p) + return (DAT_INSUFFICIENT_RESOURCES); + + /* setup the work request */ + wr.next = 0; + wr.opcode = op_type; + wr.num_sge = 0; + wr.send_flags = 0; + wr.wr_id = (uint64_t)(uintptr_t)cookie; + wr.sg_list = ds_array_p; + total_len = 0; + + for (i = 0; i < segments; i++ ) { + if ( !local_iov[i].segment_length ) + continue; + + ds_array_p->addr = (uint64_t) local_iov[i].virtual_address; + ds_array_p->length = local_iov[i].segment_length; + ds_array_p->lkey = local_iov[i].lmr_context; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: lkey 0x%x va %p len %d\n", + ds_array_p->lkey, ds_array_p->addr, + ds_array_p->length ); + + total_len += ds_array_p->length; + wr.num_sge++; + ds_array_p++; + } + + if (cookie != NULL) + cookie->val.dto.size = total_len; + + switch (op_type) { + case OP_RDMA_WRITE_IMM: + /* OP_RDMA_WRITE)IMMED has direct IB wr_type mapping */ + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_ext: rkey 0x%x va %#016Lx immed=0x%x\n", + remote_iov->rmr_context, + remote_iov->target_address, immed_data); + + wr.imm_data = immed_data; + wr.wr.rdma.remote_addr = remote_iov->target_address; + wr.wr.rdma.rkey = remote_iov->rmr_context; + break; + case OP_COMP_AND_SWAP: + /* OP_COMP_AND_SWAP has direct IB wr_type mapping */ + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_ext: OP_COMP_AND_SWAP=%lx," + "%lx rkey 0x%x va %#016Lx\n", + compare_add, swap, remote_iov->rmr_context, + remote_iov->target_address); + + wr.wr.atomic.compare_add = compare_add; + wr.wr.atomic.swap = swap; + wr.wr.atomic.remote_addr = remote_iov->target_address; + wr.wr.atomic.rkey = remote_iov->rmr_context; + break; + case OP_FETCH_AND_ADD: + /* OP_FETCH_AND_ADD has direct IB wr_type mapping */ + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_ext: OP_FETCH_AND_ADD=%lx," + "%lx rkey 0x%x va %#016Lx\n", + compare_add, remote_iov->rmr_context, + remote_iov->target_address); + + wr.wr.atomic.compare_add = compare_add; + wr.wr.atomic.remote_addr = remote_iov->target_address; + wr.wr.atomic.rkey = remote_iov->rmr_context; + break; + default: + break; + } + + /* set completion flags in work request */ + wr.send_flags |= (DAT_COMPLETION_SUPPRESS_FLAG & + completion_flags) ? 0 : IBV_SEND_SIGNALED; + wr.send_flags |= (DAT_COMPLETION_BARRIER_FENCE_FLAG & + completion_flags) ? IBV_SEND_FENCE : 0; + wr.send_flags |= (DAT_COMPLETION_SOLICITED_WAIT_FLAG & + completion_flags) ? IBV_SEND_SOLICITED : 0; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: op 0x%x flags 0x%x sglist %p, %d\n", + wr.opcode, wr.send_flags, wr.sg_list, wr.num_sge); + + if (ibv_post_send(ep_ptr->qp_handle->cm_id->qp, &wr, &bad_wr)) + return( dapl_convert_errno(EFAULT,"ibv_recv") ); + + dapl_dbg_log(DAPL_DBG_TYPE_EP," post_snd: returned\n"); + return DAT_SUCCESS; +} +#endif /* DAT_EXTENSIONS */ + STATIC _INLINE_ int dapls_cqe_opcode(ib_work_completion_t *cqe_p) { switch (cqe_p->opcode) { Index: dapl/openib_cma/dapl_ib_util.c =================================================================== --- dapl/openib_cma/dapl_ib_util.c (revision 6260) +++ dapl/openib_cma/dapl_ib_util.c (working copy) @@ -843,3 +843,55 @@ void dapli_thread(void *arg) dapl_os_unlock(&g_hca_lock); } + +#ifdef DAPL_PROVIDER_SPECIFIC_ATTR + +/* + * dapls_query_provider_specific_attr + * + * Input: + * attr_ptr Pointer provider attributes + * + * Output: + * none + * + * Returns: + * void + */ +DAT_NAMED_ATTR ib_attrs[] = { + +#ifdef DAT_EXTENSIONS + { + DAT_EXTENSION_ATTR, " " + }, + { + DAT_EXTENSION_ATTR_VERSION, DAT_EXTENSION_ATTR_VERSION_VALUE + }, + { + DAT_IB_ATTR_FETCH_AND_ADD, " " + }, + { + DAT_IB_ATTR_CMP_AND_SWAP, " " + }, + { + DAT_IB_ATTR_IMMED_DATA, " " + }, +#endif /* DAT_EXTENSIONS */ + +}; + +#define SPEC_ATTR_SIZE( x ) (sizeof( x ) / sizeof( DAT_NAMED_ATTR)) + +/* + * Query for all provider specific attributes and + */ +void dapls_query_provider_specific_attr( + IN DAT_PROVIDER_ATTR *attr_ptr ) +{ + attr_ptr->num_provider_specific_attr = SPEC_ATTR_SIZE(ib_attrs); + attr_ptr->provider_specific_attr = ib_attrs; +} + +#endif + + Index: dapl/openib_cma/dapl_ib_mem.c =================================================================== --- dapl/openib_cma/dapl_ib_mem.c (revision 6260) +++ dapl/openib_cma/dapl_ib_mem.c (working copy) @@ -74,10 +74,10 @@ dapls_convert_privileges(IN DAT_MEM_PRIV access |= IBV_ACCESS_REMOTE_WRITE; if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) access |= IBV_ACCESS_REMOTE_READ; - if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) - access |= IBV_ACCESS_REMOTE_READ; - if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) - access |= IBV_ACCESS_REMOTE_READ; +#ifdef DAT_EXTENSIONS + if (DAT_IB_MEM_PRIV_REMOTE_ATOMIC & privileges) + access |= IBV_ACCESS_REMOTE_ATOMIC; +#endif return access; } Index: dapl/openib_cma/dapl_ib_cm.c =================================================================== --- dapl/openib_cma/dapl_ib_cm.c (revision 6260) +++ dapl/openib_cma/dapl_ib_cm.c (working copy) @@ -87,7 +87,9 @@ static inline uint64_t cpu_to_be64(uint6 static void dapli_addr_resolve(struct dapl_cm_id *conn) { int ret; - +#ifdef DAPL_DBG + struct rdma_addr *ipaddr = &conn->cm_id->route.addr; +#endif dapl_dbg_log(DAPL_DBG_TYPE_CM, " addr_resolve: cm_id %p SRC %x DST %x\n", conn->cm_id, @@ -111,6 +113,10 @@ static void dapli_route_resolve(struct d { int ret; struct rdma_cm_id *cm_id = conn->cm_id; +#ifdef DAPL_DBG + struct rdma_addr *ipaddr = &cm_id->route.addr; + struct ib_addr *ibaddr = &cm_id->route.addr.addr.ibaddr; +#endif dapl_dbg_log(DAPL_DBG_TYPE_CM, " route_resolve: cm_id %p SRC %x DST %x PORT %d\n", @@ -189,7 +195,10 @@ static struct dapl_cm_id * dapli_req_rec struct rdma_cm_event *event) { struct dapl_cm_id *new_conn; - +#ifdef DAPL_DBG + struct rdma_addr *ipaddr = &event->id->route.addr; +#endif + if (conn->sp == NULL) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, " dapli_rep_recv: on invalid listen " Index: dapl/openib_cma/dapl_ib_extensions.c =================================================================== --- dapl/openib_cma/dapl_ib_extensions.c (revision 0) +++ dapl/openib_cma/dapl_ib_extensions.c (revision 0) @@ -0,0 +1,316 @@ +/* + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ +/********************************************************************** + * + * MODULE: dapl_ib_extensions.c + * + * PURPOSE: Extensions routines for OpenIB uCMA provider + * + * $Id: $ + **********************************************************************/ +#ifdef DAT_EXTENSIONS + +#include "dapl.h" +#include "dapl_adapter_util.h" +#include "dapl_evd_util.h" +#include "dapl_ib_util.h" +#include "dapl_ep_util.h" +#include "dapl_cookie.h" +#include + + +DAT_RETURN +dapli_post_ext( IN DAT_EP_HANDLE ep_handle, + IN DAT_UINT64 cmp_add, + IN DAT_UINT64 swap, + IN DAT_UINT32 immed_data, + IN DAT_COUNT segments, + IN DAT_LMR_TRIPLET *local_iov, + IN DAT_DTO_COOKIE user_cookie, + IN const DAT_RMR_TRIPLET *remote_iov, + IN int op_type, + IN DAT_COMPLETION_FLAGS flags ); + + +/* + * dapl_extensions + * + * Process extension requests + * + * Input: + * ext_type, + * ... + * + * Output: + * Depends.... + * + * Returns: + * DAT_SUCCESS + * DAT_NOT_IMPLEMENTED + * ..... + * + */ +DAT_RETURN +dapl_extensions(IN DAT_HANDLE dat_handle, + IN DAT_EXTENDED_OP ext_op, + IN va_list args) +{ + DAT_EP_HANDLE ep; + DAT_LMR_TRIPLET *lmr_p; + DAT_DTO_COOKIE cookie; + const DAT_RMR_TRIPLET *rmr_p; + DAT_UINT64 dat_uint64a, dat_uint64b; + DAT_UINT32 dat_uint32; + DAT_COUNT segments = 1; + DAT_COMPLETION_FLAGS comp_flags; + DAT_RETURN status = DAT_NOT_IMPLEMENTED; + + dapl_dbg_log(DAPL_DBG_TYPE_API, + "dapl_extensions(hdl %p operation %d, ...)\n", + dat_handle, ext_op); + + DAPL_CNTR(DCNT_EXTENSION); + + switch ((int)ext_op) + { + + case DAT_IB_RDMA_WRITE_IMMED: + dapl_dbg_log(DAPL_DBG_TYPE_RTN, + " WRITE_IMMED_DATA extension call\n"); + + ep = dat_handle; /* ep_handle */ + segments = va_arg( args, DAT_COUNT); /* num segments */ + lmr_p = va_arg( args, DAT_LMR_TRIPLET*); + cookie = va_arg( args, DAT_DTO_COOKIE); + rmr_p = va_arg( args, const DAT_RMR_TRIPLET*); + dat_uint32 = va_arg( args, DAT_UINT32); /* immed data */ + comp_flags = va_arg( args, DAT_COMPLETION_FLAGS); + + status = dapli_post_ext(ep, 0, 0, dat_uint32, segments, lmr_p, + cookie, rmr_p, OP_RDMA_WRITE_IMM, + comp_flags ); + break; + + case DAT_IB_CMP_AND_SWAP: + dapl_dbg_log(DAPL_DBG_TYPE_RTN, + " CMP_AND_SWAP extension call\n"); + + ep = dat_handle; /* ep_handle */ + dat_uint64a = va_arg( args, DAT_UINT64); /* cmp_value */ + dat_uint64b = va_arg( args, DAT_UINT64); /* swap_value */ + lmr_p = va_arg( args, DAT_LMR_TRIPLET*); + cookie = va_arg( args, DAT_DTO_COOKIE); + rmr_p = va_arg( args, const DAT_RMR_TRIPLET*); + comp_flags = va_arg( args, DAT_COMPLETION_FLAGS); + + status = dapli_post_ext(ep, dat_uint64a, dat_uint64b, + 0, segments, lmr_p, cookie, rmr_p, + OP_COMP_AND_SWAP, comp_flags ); + break; + + case DAT_IB_FETCH_AND_ADD: + dapl_dbg_log(DAPL_DBG_TYPE_RTN, + " FETCH_AND_ADD extension call\n"); + + ep = dat_handle; /* ep_handle */ + dat_uint64a = va_arg( args, DAT_UINT64); /* add value */ + lmr_p = va_arg( args, DAT_LMR_TRIPLET*); + cookie = va_arg( args, DAT_DTO_COOKIE); + rmr_p = va_arg( args, const DAT_RMR_TRIPLET*); + comp_flags = va_arg( args, DAT_COMPLETION_FLAGS); + + status = dapli_post_ext(ep, dat_uint64a, 0, 0, segments, + lmr_p, cookie, rmr_p, + OP_FETCH_AND_ADD, comp_flags ); + + break; + + default: + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + "unsupported extension(%d)\n", (int)ext_op); + } + + return(status); +} + + +DAT_RETURN +dapli_post_ext( IN DAT_EP_HANDLE ep_handle, + IN DAT_UINT64 cmp_add, + IN DAT_UINT64 swap, + IN DAT_UINT32 immed_data, + IN DAT_COUNT segments, + IN DAT_LMR_TRIPLET *local_iov, + IN DAT_DTO_COOKIE user_cookie, + IN const DAT_RMR_TRIPLET *remote_iov, + IN int op_type, + IN DAT_COMPLETION_FLAGS flags ) +{ + DAPL_EP *ep_ptr; + ib_qp_handle_t qp_ptr; + DAPL_COOKIE *cookie; + DAT_RETURN dat_status = DAT_SUCCESS; + + dapl_dbg_log(DAPL_DBG_TYPE_API, + " post_ext_op: ep %p cmp_val %d " + "swap_val %d cookie 0x%x, r_iov %p, flags 0x%x\n", + ep_handle, (unsigned)cmp_add, (unsigned)swap, + (unsigned)user_cookie.as_64, remote_iov, flags); + + if (DAPL_BAD_HANDLE(ep_handle, DAPL_MAGIC_EP)) + return(DAT_ERROR(DAT_INVALID_HANDLE, DAT_INVALID_HANDLE_EP)); + + if ((NULL == remote_iov) || (NULL == local_iov)) + return DAT_INVALID_PARAMETER; + + ep_ptr = (DAPL_EP *) ep_handle; + qp_ptr = ep_ptr->qp_handle; + + /* + * Synchronization ok since this buffer is only used for send + * requests, which aren't allowed to race with each other. + * only if completion is expected + */ + if (!(DAT_COMPLETION_SUPPRESS_FLAG & flags)) { + + dat_status = dapls_dto_cookie_alloc( + &ep_ptr->req_buffer, + DAPL_DTO_TYPE_EXTENSION, + user_cookie, + &cookie ); + + if ( dat_status != DAT_SUCCESS ) + goto bail; + + /* + * Take reference before posting to avoid race conditions with + * completions + */ + dapl_os_atomic_inc(&ep_ptr->req_count); + } + + /* + * Invoke provider specific routine to post DTO + */ + dat_status = dapls_ib_post_ext_send(ep_ptr, + op_type, + cookie, + segments, /* data segments */ + local_iov, + remote_iov, + immed_data, /* immed data */ + cmp_add, /* compare or add */ + swap, /* swap */ + flags); + + if (dat_status != DAT_SUCCESS) { + if ( cookie != NULL ) { + dapl_os_atomic_dec(&ep_ptr->req_count); + dapls_cookie_dealloc(&ep_ptr->req_buffer, cookie); + } + } + +bail: + return dat_status; + +} + +/* + * New provider routine to process extended DTO events - IB transport specific + */ +void +dapls_cqe_to_event_extension(IN DAPL_EP *ep_ptr, + IN DAPL_COOKIE *cookie, + IN ib_work_completion_t *cqe_ptr, + IN DAT_EVENT *event_ptr) +{ + DAT_EXTENSION_EVENT_DATA *ext_ptr = + (DAT_EXTENSION_EVENT_DATA*)event_ptr->extension_data; + DAT_DTO_COMPLETION_EVENT_DATA *dto_ptr = + &event_ptr->event_data.dto_completion_event_data; + + dapl_dbg_log(DAPL_DBG_TYPE_EVD, + " cqe_to_event_ext: event_ptr %p ext_ptr %p\n", + event_ptr, ext_ptr); + + event_ptr->event_number = DAT_IB_DTO_EVENT; + dto_ptr->ep_handle = cookie->ep; + dto_ptr->user_cookie = cookie->val.dto.cookie; + dto_ptr->status = dapls_ib_get_dto_status(cqe_ptr); + dto_ptr->operation = DAPL_GET_CQE_DTOS_OPTYPE(cqe_ptr); + + if (dapls_ib_get_dto_status(cqe_ptr) != DAT_DTO_SUCCESS) + return; + + ext_ptr->status = DAT_OP_SUCCESS; + + switch (dto_ptr->operation) { + + case DAT_IB_DTO_RDMA_WRITE_IMMED: + dapl_dbg_log(DAPL_DBG_TYPE_EVD, + " cqe_to_event_ext: RDMA_WRITE_IMMED\n"); + /* type and outbound rdma write transfer size */ + ext_ptr->type = DAT_IB_RDMA_WRITE_IMMED_STATUS; + dto_ptr->transfered_length = cookie->val.dto.size; + break; + + case DAT_IB_DTO_RECV_IMMED: + { + /* Data delivered */ + dapl_dbg_log(DAPL_DBG_TYPE_EVD, + " cqe_to_event_ext: RECV_IMMED\n"); + /* type and inbound rdma write transfer size */ + ext_ptr->type = DAT_IB_RDMA_WRITE_IMMED_DATA; + ext_ptr->val.immed_data.data = + DAPL_GET_CQE_IMMED_DATA(cqe_ptr); + dto_ptr->transfered_length = DAPL_GET_CQE_BYTESNUM(cqe_ptr); + break; + } + case DAT_IB_DTO_CMP_AND_SWAP: + dapl_dbg_log(DAPL_DBG_TYPE_EVD, + " cqe_to_event_ext: CMP_AND_SWAP\n"); + /* original data is returned in LMR provided with post */ + ext_ptr->type = DAT_IB_CMP_AND_SWAP_STATUS; + dto_ptr->transfered_length = DAPL_GET_CQE_BYTESNUM(cqe_ptr); + break; + + case DAT_IB_DTO_FETCH_AND_ADD: + dapl_dbg_log(DAPL_DBG_TYPE_EVD, + " cqe_to_event_ext: FETCH_AND_ADD\n"); + /* original data is returned in LMR provided with post */ + ext_ptr->type = DAT_IB_FETCH_AND_ADD_STATUS; + dto_ptr->transfered_length = DAPL_GET_CQE_BYTESNUM(cqe_ptr); + break; + + default: + /* not extended operation */ + dapl_dbg_log(DAPL_DBG_TYPE_WARN, + " cqe_to_event_ext: unknown operation(0x%x)\n", + dto_ptr->operation); + break; + } +} + +#endif /* DAT_EXTENSIONS */ Index: dapl/openib_cma/dapl_ib_util.h =================================================================== --- dapl/openib_cma/dapl_ib_util.h (revision 6260) +++ dapl/openib_cma/dapl_ib_util.h (working copy) @@ -53,6 +53,10 @@ #include #include +#ifdef DAT_EXTENSIONS +#include "dat/dat_ib_extensions.h" +#endif + /* Typedefs to map common DAPL provider types to IB verbs */ typedef struct dapl_cm_id *ib_qp_handle_t; typedef struct ibv_cq *ib_cq_handle_t; Index: dapl/openib_cma/dapl_ib_cq.c =================================================================== --- dapl/openib_cma/dapl_ib_cq.c (revision 6260) +++ dapl/openib_cma/dapl_ib_cq.c (working copy) @@ -517,8 +517,8 @@ dapls_ib_wait_object_wait(IN ib_wait_obj status = errno; dapl_dbg_log(DAPL_DBG_TYPE_UTIL, - " cq_object_wait: RET evd %p ibv_cq %p ibv_ctx %p %s\n", - evd_ptr, ibv_cq,ibv_ctx,strerror(errno)); + " cq_object_wait: RET evd %p ibv_cq %p %s\n", + evd_ptr, ibv_cq, strerror(errno)); return(dapl_convert_errno(status,"cq_wait_object_wait")); Index: dat/include/dat/udat.h =================================================================== --- dat/include/dat/udat.h (revision 6260) +++ dat/include/dat/udat.h (working copy) @@ -73,7 +73,10 @@ typedef enum dat_handle_type DAT_HANDLE_TYPE_RMR, DAT_HANDLE_TYPE_RSP, DAT_HANDLE_TYPE_CNO, - DAT_HANDLE_TYPE_SRQ + DAT_HANDLE_TYPE_SRQ, +#ifdef DAT_EXTENSIONS + DAT_HANDLE_TYPE_EXTENSION_BASE +#endif } DAT_HANDLE_TYPE; /* EVD state consists of three orthogonal substates. One for Index: dat/include/dat/dat_redirection.h =================================================================== --- dat/include/dat/dat_redirection.h (revision 6260) +++ dat/include/dat/dat_redirection.h (working copy) @@ -395,6 +395,14 @@ typedef struct dat_provider DAT_PROVIDER (lbuf), \ (cookie)) +#ifdef DAT_EXTENSIONS +#define DAT_HANDLE_EXTENDEDOP(handle, op, args) \ + (*DAT_HANDLE_TO_PROVIDER (handle)->handle_extendedop_func) (\ + (handle), \ + (op), \ + (args)) +#endif + /*************************************************************** * * FUNCTION PROTOTYPES @@ -720,4 +728,12 @@ typedef DAT_RETURN (*DAT_SRQ_POST_RECV_F IN DAT_LMR_TRIPLET *, /* local_iov */ IN DAT_DTO_COOKIE ); /* user_cookie */ +#ifdef DAT_EXTENSIONS +#include +typedef DAT_RETURN (*DAT_HANDLE_EXTENDEDOP_FUNC)( + IN DAT_HANDLE, /* handle */ + IN DAT_EXTENDED_OP, /* extended op */ + IN va_list); /* argument list */ +#endif /* DAT_EXTENSIONS */ + #endif /* _DAT_REDIRECTION_H_ */ Index: dat/include/dat/dat_ib_extensions.h =================================================================== --- dat/include/dat/dat_ib_extensions.h (revision 0) +++ dat/include/dat/dat_ib_extensions.h (revision 0) @@ -0,0 +1,290 @@ +/* + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * in the file LICENSE.txt in the root directory. The license is also + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is in the file + * LICENSE2.txt in the root directory. The license is also available from + * the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is in the file LICENSE3.txt in the root directory. The + * license is also available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ +/********************************************************************** + * + * HEADER: dat_ib_extensions.h + * + * PURPOSE: extensions to the DAT API for IB transport specific services + * NOTE: Prototyped IB extension support in openib-cma 1.2 provider. + * Applications MUST recompile with new dat.h definitions + * and include this file. + * + * Description: Header file for "uDAPL: User Direct Access Programming + * Library, Version: 1.2" + * + * Mapping rules: + * All global symbols are prepended with "DAT_" or "dat_" + * All DAT objects have an 'api' tag which, such as 'ep' or 'lmr' + * The method table is in the provider definition structure. + * + * + **********************************************************************/ +#ifndef _DAT_IB_EXTENSIONS_H_ + +/* + * Provider specific attribute strings for extension support + * returned with dat_ia_query() and + * DAT_PROVIDER_ATTR_MASK == DAT_PROVIDER_FIELD_PROVIDER_SPECIFIC_ATTR + * + * DAT_NAMED_ATTR name == extended operations and version, + * version_value = version number of extension API + */ +#define DAT_EXTENSION_ATTR "DAT_EXTENSION_INTERFACE" +#define DAT_EXTENSION_ATTR_VERSION "DAT_EXTENSION_VERSION" +#define DAT_EXTENSION_ATTR_VERSION_VALUE "2.0.1" +#define DAT_IB_ATTR_FETCH_AND_ADD "DAT_IB_FETCH_AND_ADD" +#define DAT_IB_ATTR_CMP_AND_SWAP "DAT_IB_CMP_AND_SWAP" +#define DAT_IB_ATTR_IMMED_DATA "DAT_IB_IMMED_DATA" + +/* + * Extension operations + */ +typedef enum dat_ib_op +{ + DAT_IB_FETCH_AND_ADD, + DAT_IB_CMP_AND_SWAP, + DAT_IB_RDMA_WRITE_IMMED, + +} DAT_IB_OP; + +/* + * Definition for extended EVENT numbers, DAT_GENERIC_EXTENSION_BASE_RANGE + * is used by these extensions as a starting point for extended event numbers + */ +typedef enum dat_ib_event_number +{ + DAT_IB_DTO_EVENT = DAT_IB_EXTENSION_BASE_RANGE, + +} DAT_IB_EVENT_NUMBER; + +/* + * Definition for new extended event types; + * + * set when EVENT_NUMBER == DAT_EXT_EVENT_NUMBER + */ +/* + * Definition for extended event types; + * set when EVENT_NUMBER == DAT_EXT_EVENT + */ +typedef enum dat_ib_type +{ + DAT_IB_FETCH_AND_ADD_STATUS, // 0 + DAT_IB_CMP_AND_SWAP_STATUS, // 1 + DAT_IB_RDMA_WRITE_IMMED_STATUS, // 2 + DAT_IB_RDMA_WRITE_IMMED_DATA, // 3 + +} DAT_IB_TYPE; + + +/* + * Definitions for additional extension type RETURN codes above + * standard DAT types. Included with standard DAT_TYPE_STATUS + * bits using a DAT_EXTENSION BASE as a starting point. + */ +typedef enum dat_ib_return +{ + DAT_IB_ERR = DAT_EXTENSION_BASE, + +} DAT_IB_RETURN; + +/* + * Definitions for additional extension handle types beyond + * standard DAT handle. New Bit definitions MUST start at + * DAT_HANDLE_TYPE_EXTENSION_BASE + */ +typedef enum dat_ib_handle_type +{ + DAT_IB_HANDLE_TYPE_EXT = DAT_HANDLE_TYPE_EXTENSION_BASE, + +} DAT_IB_HANDLE_TYPE; + +/* + * Definition for memory privilege extension flags. + * New privileges required for new atomic DTO type extensions. + * New Bit definitions MUST start at DAT_MEM_PRIV_EXTENSION + */ +typedef enum dat_ib_mem_priv_flags +{ + DAT_IB_MEM_PRIV_REMOTE_ATOMIC = DAT_MEM_PRIV_EXTENSION_BASE, + +} DAT_IB_MEM_PRIV_FLAGS; + +/* + * Definition for extended IB DTO operations, DAT_DTO_EXTENSION_BASE + * is used by DAT extensions as a starting point of extension DTOs + */ +typedef enum dat_ib_dtos +{ + DAT_IB_DTO_RDMA_WRITE_IMMED = DAT_DTO_EXTENSION_BASE, + DAT_IB_DTO_RECV_IMMED, + DAT_IB_DTO_FETCH_AND_ADD, + DAT_IB_DTO_CMP_AND_SWAP, + +} DAT_IB_DTOS; + +/* + * Extension event status + */ +typedef enum dat_ib_status +{ + DAT_OP_SUCCESS = DAT_SUCCESS, + DAT_IB_OP_ERR, + +} DAT_IB_STATUS; + +/* + * Definitions for extended event data: + * When dat_event->event_number >= DAT_IB_EXTENSION_BASE_RANGE + * then dat_event->extension_data == DAT_IB_EXT_EVENT_DATA type + * and ((DAT_IB_EXT_EVENT_DATA*)dat_event->extension_data)->type + * specifies extension data values. + * NOTE: DAT_IB_EXT_EVENT_DATA cannot exceed 64 bytes as defined by + * "DAT_UINT64 extension_data[8]" in DAT_EVENT (dat.h) + */ +typedef struct dat_ib_immediate_data +{ + DAT_UINT32 data; + +} DAT_IB_IMMED_DATA; + +typedef struct dat_extension_event_data +{ + DAT_IB_TYPE type; + DAT_IB_STATUS status; + DAT_CONTEXT user_context; + union { + DAT_IB_IMMED_DATA immed_data; + } val; +} DAT_EXTENSION_EVENT_DATA; + + +/* Extended RETURN and EVENT STATUS string helper functions */ + +/* DAT_EXT_RETURN error to string */ +static __inline__ DAT_RETURN +dat_strerror_extension ( + IN DAT_IB_RETURN value, + OUT const char **message ) +{ + switch( DAT_GET_TYPE(value) ) { + case DAT_IB_ERR: + *message = "DAT_IB_ERR"; + return DAT_SUCCESS; + default: + /* standard DAT return type */ + return(dat_strerror(value, message, NULL)); + } +} + +/* DAT_EXT_STATUS error to string */ +static __inline__ DAT_RETURN +dat_strerror_ext_status ( + IN DAT_IB_STATUS value, + OUT const char **message ) +{ + switch(value) { + case 0: + *message = " "; + return DAT_SUCCESS; + case DAT_IB_OP_ERR: + *message = "DAT_IB_OP_ERR"; + return DAT_SUCCESS; + default: + *message = "unknown extension status"; + return DAT_INVALID_PARAMETER; + } +} + +/* + * Extended IB transport specific APIs + * redirection via DAT extension function + */ + +/* + * This asynchronous call is modeled after the InfiniBand atomic + * Fetch and Add operation. The add_value is added to the 64 bit + * value stored at the remote memory location specified in remote_iov + * and the result is stored in the local_iov. + */ +#define dat_ib_post_fetch_and_add(ep, add_val, lbuf, cookie, rbuf, flgs) \ + dat_extension( ep, \ + DAT_IB_FETCH_AND_ADD, \ + (add_val), \ + (lbuf), \ + (cookie), \ + (rbuf), \ + (flgs)) + +/* + * This asynchronous call is modeled after the InfiniBand atomic + * Compare and Swap operation. The cmp_value is compared to the 64 bit + * value stored at the remote memory location specified in remote_iov. + * If the two values are equal, the 64 bit swap_value is stored in + * the remote memory location. In all cases, the original 64 bit + * value stored in the remote memory location is copied to the local_iov. + */ +#define dat_ib_post_cmp_and_swap(ep, cmp_val, swap_val, lbuf, cookie, rbuf, flgs) \ + dat_extension( ep, \ + DAT_IB_CMP_AND_SWAP, \ + (cmp_val), \ + (swap_val), \ + (lbuf), \ + (cookie), \ + (rbuf), \ + (flgs)) + +/* + * RDMA Write with IMMEDIATE: + * + * This asynchronous call is modeled after the InfiniBand rdma write with + * immediate data operation. Event completion for the request completes as an + * DAT_EXTENSION with extension type set to DAT_DTO_EXTENSION_IMMED_DATA. + * Event completion on the remote endpoint completes as receive DTO operation + * type of DAT_EXTENSION with operation set to DAT_DTO_EXTENSION_IMMED_DATA. + * The immediate data will be provided in the extented DTO event data structure. + * + * Note to Consumers: the immediate data will consume a receive + * buffer at the Data Sink. + * + * Other extension flags: + * n/a + */ +#define dat_ib_post_rdma_write_immed(ep, size, lbuf, cookie, rbuf, idata, flgs) \ + dat_extension( ep, \ + DAT_IB_RDMA_WRITE_IMMED, \ + (size), \ + (lbuf), \ + (cookie), \ + (rbuf), \ + (idata), \ + (flgs)) + +#endif /* _DAT_IB_VENDOR_EXTENSIONS_H_ */ + Index: dat/include/dat/dat.h =================================================================== --- dat/include/dat/dat.h (revision 6260) +++ dat/include/dat/dat.h (working copy) @@ -122,6 +122,23 @@ typedef DAT_HANDLE DAT_SRQ_HANDLE; /* dat NULL handles */ #define DAT_HANDLE_NULL ((DAT_HANDLE)NULL) +#ifdef DAT_EXTENSIONS +typedef int DAT_EXTENDED_OP; + +typedef enum dat_dtos +{ + DAT_SEND, + DAT_RDMA_WRITE, + DAT_RDMA_READ, + DAT_RECEIVE, + DAT_RECEIVE_WITH_INVALIDATE, + DAT_BIND_MW, + /* To be used by DAT extensions as a starting point of extension DTOs */ + DAT_DTO_EXTENSION_BASE +} DAT_DTOS; + +#endif + typedef DAT_SOCK_ADDR * DAT_IA_ADDRESS_PTR; typedef DAT_UINT64 DAT_CONN_QUAL; @@ -165,8 +182,10 @@ typedef enum dat_evd_flags DAT_EVD_CONNECTION_FLAG = 0x040, DAT_EVD_RMR_BIND_FLAG = 0x080, DAT_EVD_ASYNC_FLAG = 0x100, - - /* DAT events only, no software events */ +#ifdef DAT_EXTENSIONS + DAT_EVD_EXTENSION_FLAG = 0x200, +#endif + /* DAT events only, no software events,no extended events */ DAT_EVD_DEFAULT_FLAG = 0x1F0 } DAT_EVD_FLAGS; @@ -267,7 +286,11 @@ typedef enum dat_mem_priv_flags DAT_MEM_PRIV_REMOTE_READ_FLAG = 0x02, DAT_MEM_PRIV_LOCAL_WRITE_FLAG = 0x10, DAT_MEM_PRIV_REMOTE_WRITE_FLAG = 0x20, - DAT_MEM_PRIV_ALL_FLAG = 0x33 + DAT_MEM_PRIV_ALL_FLAG = 0x33, +#ifdef DAT_EXTENSIONS + /* To be used by DAT extensions as a starting point of extension memory privileges */ + DAT_MEM_PRIV_EXTENSION_BASE = 0x40 +#endif } DAT_MEM_PRIV_FLAGS; /* For backward compatibility with DAT-1.0, memory privileges values are @@ -720,6 +743,9 @@ typedef struct dat_dto_completion_event_ DAT_DTO_COOKIE user_cookie; DAT_DTO_COMPLETION_STATUS status; DAT_VLEN transfered_length; +#ifdef DAT_EXTENSIONS + DAT_DTOS operation; +#endif } DAT_DTO_COMPLETION_EVENT_DATA; /* RMR bind completion event data */ @@ -854,7 +880,13 @@ typedef enum dat_event_number DAT_ASYNC_ERROR_EP_BROKEN = 0x08003, DAT_ASYNC_ERROR_TIMED_OUT = 0x08004, DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR = 0x08005, - DAT_SOFTWARE_EVENT = 0x10001 + DAT_SOFTWARE_EVENT = 0x10001, +#ifdef DAT_EXTENSIONS + /* type and data embedded in extension_data in dat_event */ + DAT_IB_EXTENSION_BASE_RANGE = 0x20001, + DAT_IW_EXTENSION_BASE_RANGE = 0x40001, + DAT_VENDOR_EXTENSION_BASE_RANGE = 0x80001 +#endif } DAT_EVENT_NUMBER; /* Union for event Data */ @@ -876,8 +908,12 @@ typedef struct dat_event DAT_EVENT_NUMBER event_number; DAT_EVD_HANDLE evd_handle; DAT_EVENT_DATA event_data; +#ifdef DAT_EXTENSIONS + DAT_UINT64 extension_data[8]; +#endif } DAT_EVENT; + /* Provider/registration info */ typedef struct dat_provider_info @@ -1222,6 +1258,13 @@ extern DAT_RETURN dat_srq_set_lw ( IN DAT_SRQ_HANDLE, /* srq_handle */ IN DAT_COUNT); /* low_watermark */ +#ifdef DAT_EXTENSIONS +extern DAT_RETURN dat_extension( + IN DAT_HANDLE, + IN DAT_EXTENDED_OP, + IN ... ); +#endif + /* * DAT registry functions. * Index: dat/include/dat/dat_error.h =================================================================== --- dat/include/dat/dat_error.h (revision 6260) +++ dat/include/dat/dat_error.h (working copy) @@ -162,6 +162,11 @@ typedef enum dat_return_type /* No Connection Qualifiers are available */ DAT_CONN_QUAL_UNAVAILABLE = 0x00140000, +#ifdef DAT_EXTENSIONS + /* The DAT extensions support. */ + DAT_EXTENSION_BASE = 0x10000000, + /* range 0x10000000 - 0x3FFF0000 is reserved for extension */ +#endif /* Provider does not support the operation yet. */ DAT_NOT_IMPLEMENTED = 0xFFFF0000 } DAT_RETURN_TYPE; Index: dat/include/dat/udat_redirection.h =================================================================== --- dat/include/dat/udat_redirection.h (revision 6260) +++ dat/include/dat/udat_redirection.h (working copy) @@ -294,6 +294,10 @@ struct dat_provider DAT_SRQ_QUERY_FUNC srq_query_func; DAT_SRQ_RESIZE_FUNC srq_resize_func; DAT_SRQ_SET_LW_FUNC srq_set_lw_func; + +#ifdef DAT_EXTENSIONS + DAT_HANDLE_EXTENDEDOP_FUNC handle_extendedop_func; +#endif /* DAT_EXTENSIONS */ }; #endif /* _UDAT_REDIRECTION_H_ */ Index: dat/common/dat_api.c =================================================================== --- dat/common/dat_api.c (revision 6260) +++ dat/common/dat_api.c (working copy) @@ -1142,6 +1142,35 @@ DAT_RETURN dat_srq_set_lw( low_watermark); } +#ifdef DAT_EXTENSIONS +extern int g_dat_extensions; +DAT_RETURN dat_extension( + IN DAT_HANDLE handle, + IN DAT_EXTENDED_OP ext_op, + IN ... ) + +{ + DAT_RETURN status; + va_list args; + + /* verify provider extension support */ + if (!g_dat_extensions) + { + return DAT_ERROR(DAT_NOT_IMPLEMENTED, 0); + } + + va_start(args, ext_op); + + /* extension will validate the handle based on exp_op */ + status = DAT_HANDLE_EXTENDEDOP(handle, + ext_op, + args); + va_end(args); + + return status; +} +#endif + /* * Local variables: * c-indent-level: 4 Index: dat/udat/Makefile =================================================================== --- dat/udat/Makefile (revision 6260) +++ dat/udat/Makefile (working copy) @@ -61,7 +61,7 @@ STATIC32 = $(TARGET_PATH32)/libdat.a DYNAMIC32 = $(TARGET_PATH32)/libdat.so ifeq "$(ARCH)" "x86_64" -DUAL_ARCH = true +#DUAL_ARCH = true endif OBJS = $(OBJ_PATH)/udat.o \ @@ -112,6 +112,12 @@ CFLAGS32 = -m32 endif # +# Prototype 2.0 DAT extensions +# +CFLAGS += -DDAT_EXTENSIONS + + +# # LD definitions # Index: dat/udat/udat.c =================================================================== --- dat/udat/udat.c (revision 6260) +++ dat/udat/udat.c (working copy) @@ -66,6 +66,10 @@ udat_check_state ( void ); * * *********************************************************************/ +/* + * Use a global to get an unresolved when run with pre-extension library + */ +int g_dat_extensions = 0; /* * @@ -226,19 +230,47 @@ dat_ia_openv ( return dat_status; } - dat_status = (*ia_open_func) (name, - async_event_qlen, - async_event_handle, - ia_handle); + dat_status = (*ia_open_func) (name, + async_event_qlen, + async_event_handle, + ia_handle); + + /* + * See if provider supports extensions + */ if (dat_status == DAT_SUCCESS) { - return_handle = dats_set_ia_handle (*ia_handle); - if (return_handle >= 0) - { - *ia_handle = (DAT_IA_HANDLE)return_handle; - } - } + DAT_PROVIDER_ATTR p_attr; + int i; + return_handle = dats_set_ia_handle (*ia_handle); + if (return_handle >= 0) + { + *ia_handle = (DAT_IA_HANDLE)return_handle; + } + + if ( dat_ia_query( *ia_handle, + NULL, + 0, + NULL, + DAT_PROVIDER_FIELD_PROVIDER_SPECIFIC_ATTR, + &p_attr ) == DAT_SUCCESS ) + { + for ( i = 0; i < p_attr.num_provider_specific_attr; i++ ) + { + if (strcmp( p_attr.provider_specific_attr[i].name, + "DAT_EXTENSION_INTERFACE" ) == 0) + { + dat_os_dbg_print(DAT_OS_DBG_TYPE_CONSUMER_API, + "DAT Registry: dat_ia_open () " + "Extension Interface supported!\n"); + + g_dat_extensions = 1; + break; + } + } + } + } return dat_status; } -------------- next part -------------- A non-text attachment was scrubbed... Name: dat_ext.patch Type: application/octet-stream Size: 74513 bytes Desc: not available URL: From mst at mellanox.co.il Wed Apr 5 17:46:43 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 03:46:43 +0300 Subject: [openib-general] Re: RC2 delayed a bit In-Reply-To: References: Message-ID: <20060406004643.GH26557@mellanox.co.il> Quoting r. Shirley Ma : > It would be risky if Distros target future OF releases if for some reason > OF slips its release schedule. I think distros learned to live with this by now. Or they can just play it safe and target a stable release. -- MST From sean.hefty at intel.com Wed Apr 5 19:02:22 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 5 Apr 2006 19:02:22 -0700 Subject: [openib-general] RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: <20060406003635.GG26557@mellanox.co.il> Message-ID: >I don't see the connection (since we only need de-register, let's just call >it flush and be done with it), but fine. Please propose an API then. The issue that I see is that ib_sa_flush() requires ib_mad_flush(), but the MAD layer API doesn't have the problem that we're talking about. I dislike the trickling effect that this fix has. I need to give more thought about an interface, and who will really use it. For example, the local_sa performs its own path record queries because of the need to obtain multiple paths. The ib_multicast module now handles multicast join/leave requests, and could be made to do its own queries. It may be possible to remove some of the ib_sa APIs. >> Registration could be global, or per port. The latter probably makes the >> most sense. > >Probably global for ib_addr, per port for ib_sa: we don't want to force >ib_addr users to deal with devices. I agree. - Sean From ftillier at silverstorm.com Wed Apr 5 20:31:02 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Wed, 5 Apr 2006 20:31:02 -0700 Subject: [openib-general] RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: References: <20060406003635.GG26557@mellanox.co.il> Message-ID: <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> On 4/5/06, Sean Hefty wrote: > >I don't see the connection (since we only need de-register, let's just call > >it flush and be done with it), but fine. Please propose an API then. > > The issue that I see is that ib_sa_flush() requires ib_mad_flush(), but the MAD > layer API doesn't have the problem that we're talking about. I dislike the > trickling effect that this fix has. > > I need to give more thought about an interface, and who will really use it. For > example, the local_sa performs its own path record queries because of the need > to obtain multiple paths. The ib_multicast module now handles multicast > join/leave requests, and could be made to do its own queries. It may be > possible to remove some of the ib_sa APIs. Can't you pass in a reference to the client module for registration, and then take a reference from the context of each request that is released after the callback unwinds? I thought Linux had module reference functions... - Fab From eli at mellanox.co.il Wed Apr 5 23:37:00 2006 From: eli at mellanox.co.il (Eli Cohen) Date: Thu, 6 Apr 2006 09:37:00 +0300 Subject: [openib-general] Re: [PATCH] ipoib_mcast_restart_task In-Reply-To: References: <20060404155233.GR14808@mellanox.co.il> <20060405074904.GC14808@mellanox.co.il> Message-ID: <200604060937.00787.eli@mellanox.co.il> On Wednesday 05 April 2006 18:43, Roland Dreier wrote: > Michael> Not sure I read you. It'd still be use after free, won't it? > > It's definitely a bug. But it doesn't explain the specific oops we > saw. In other words, doing: > > kfree(mcast); > dev = mcast->dev; > > shouldn't cause an oops, because mcast is still a valid kernel > pointer, even if the memory it points to might be reused and > corrupted. Following the dev pointer after that snippet might cause > an oops, because it might be overwritten. > The reason for that is probably because I am using a custom kernel compiled with 'Debug memory allocations' which poisons freed memory. From moshek at voltaire.com Wed Apr 5 23:49:31 2006 From: moshek at voltaire.com (Moshe Kazir) Date: Thu, 6 Apr 2006 09:49:31 +0300 Subject: [openib-general] RC2 delayed a bit Message-ID: Bob Woodruff wrote -> > BTW. I built some kernel RPMs based on the 1.0 branch kernel code and the backport patches for RedHat EL4.0 U3. If someone wants me to post them somewhere, I will. I'll be glad to test them on a PPC machine . Can you put the rpm on an ftp somewhere ? Moshe ____________________________________________________________ Moshe Katzir | +972-9971-8639 (o) | +972-52-860-6042 (m) Voltaire - The Grid Backbone www.voltaire.com -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Bob Woodruff Sent: Wednesday, April 05, 2006 9:10 PM To: 'Sean Hefty'; Bryan O'Sullivan Cc: openfabrics-ewg at openib.org; Openib-general at openib.org Subject: RE: [openib-general] RC2 delayed a bit Bryan wrote, >So, we went from having no openib release to now having two? That's confusing. >Are these vendors members of openib? >- Sean I know that I am confused. Can someone from the ibed (openfabrics-ewg) people please enlighten us ? BTW. I built some kernel RPMs based on the 1.0 branch kernel code and the backport patches for RedHat EL4.0 U3. If someone wants me to post them somewhere, I will. woody _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sean.hefty at intel.com Wed Apr 5 23:57:00 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 5 Apr 2006 23:57:00 -0700 Subject: [openib-general] RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> Message-ID: >Can't you pass in a reference to the client module for registration, >and then take a reference from the context of each request that is >released after the callback unwinds? I thought Linux had module >reference functions... Yes - this is what ib_mad does. The problem is that ib_sa, ib_addr, ib_cm, and soon to be ib_multicast can invoke callbacks without explicit registration / deregistration. For example, the following interface has the issue: ib_do_async_operation(request, my_callback, my_context); The caller can verify that all callbacks have been invoked, but there's no way for the caller to know if a thread is still in their callback. I was able to come up with several possible solutions to this problem. The easiest to implement is doing what Michael suggested, and calling some sort of wait_until_all_current_callbacks_return routine. What I don't like about this approach is that the interface becomes easier to misuse (i.e. callers must remember to call wait_until_all_current_callbacks_return before unloading), plus it requires changes to interfaces that do work. My preference, and it's not a very strong one at this point, is to push the responsibility into the module invoking the callback. To me, that's the direction that the reference goes, so that's where the responsibility lies. Besides, it's his thread that's executing random memory as code. - Sean From mst at mellanox.co.il Thu Apr 6 00:12:29 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 10:12:29 +0300 Subject: [openib-general] RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> Message-ID: <20060406071228.GA9994@mellanox.co.il> Quoting r. Fabian Tillier : > Can't you pass in a reference to the client module for registration, > and then take a reference from the context of each request that is > released after the callback unwinds? I thought Linux had module > reference functions... I thought about that , but this would mean: 1. changing API instead of extending it by new functions - lots of churn for ULPs 2. adding overhead on data path rather than unload path where it belongs 3. a module that does lots of queries can't be unloaded large part of the time Maybe it's still not too bad ... -- MST From mst at mellanox.co.il Thu Apr 6 00:17:32 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 10:17:32 +0300 Subject: [openib-general] RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: References: <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> Message-ID: <20060406071731.GC9994@mellanox.co.il> Quoting r. Sean Hefty : > My preference, and it's not a very strong one at this point, is to push the > responsibility into the module invoking the callback. To me, that's the > direction that the reference goes, so that's where the responsibility lies. > Besides, it's his thread that's executing random memory as code. Sure: that's the only way it can be done, so either the invoking module (sa) has to take reference on module that has the callback, or the callback module has to notify the invoking one that its unloading. I like the second approach better. -- MST From krkumar2 at in.ibm.com Wed Apr 5 22:28:43 2006 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Thu, 6 Apr 2006 10:58:43 +0530 Subject: [openib-general] [PATCH] Typo in ib_ping_init_device In-Reply-To: Message-ID: --- a/drivers/infiniband/core/ping.c 2006-03-28 18:01:31.000000000 +0530 +++ b/drivers/infiniband/core/ping.c 2006-03-28 18:06:09.000000000 +0530 @@ -259,10 +259,11 @@ static void ib_ping_init_device(struct i } for (i = 0; i < num_ports; i++, cur_port++) { - if (ib_ping_port_open(device, cur_port)) + if (ib_ping_port_open(device, cur_port)) { printk(KERN_ERR SPFX "Couldn't open %s port %d\n", device->name, cur_port); goto error_device_open; + } } return; Thanks, - KK -------------- next part -------------- A non-text attachment was scrubbed... Name: P. Type: application/octet-stream Size: 493 bytes Desc: not available URL: From yael at mellanox.co.il Thu Apr 6 02:01:25 2006 From: yael at mellanox.co.il (Yael Kalka) Date: Thu, 6 Apr 2006 12:01:25 +0300 Subject: [openib-general] RE: [PATCH] OpenSM - complib fix for branch Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FE86@mtlexch01.mtl.com> Hi Hal, > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, April 05, 2006 1:59 PM > To: Yael Kalka > Cc: openib-general at openib.org; Eitan Zahavi; Sasha Khapyorsky; Ofer Gigi > Subject: Re: [PATCH] OpenSM - complib fix for branch > > Hi Yael, > > On Wed, 2006-04-05 at 02:24, Yael Kalka wrote: > > Hi Hal, > > > > I saw that the complib patch (removal of constructor and destructor > > attribute), wasn't fully added to the branch. > > Attached is a patch for the branch. > > Is this needed for 1.0 ? Is this safe to add ? Was there more to it than > just this ? > [Yael Kalka] It isn't needed for 1.0 more than for the trunk, just that the minimum differences we have between the branch and the trunk, the easier it'll be to handle them. I don't see a problem with applying the patch on the branch. The patch also included a change to the osmtest/main.c, but that change is already applied to the branch for some reason. Another issue that is not connected to this specific patch - I think we should apply the cosmetic changes patches to the branch in order to minimize the differences between the trunk and the branch. Yael > -- Hal > > > Thanks, > > Yael > > > > Signed-off-by: Yael Kalka > > > > Index: complib/cl_complib.c > > =================================================================== > > --- complib/cl_complib.c (revision 6203) > > +++ complib/cl_complib.c (working copy) > > @@ -65,7 +65,6 @@ __cl_timer_prov_destroy( void ); > > cl_spinlock_t cl_atomic_spinlock; > > > > void > > -__attribute (( constructor )) > > complib_init(void) > > { > > cl_status_t status = CL_SUCCESS; > > @@ -90,14 +89,6 @@ complib_init(void) > > } > > > > void > > -__attribute (( destructor )) > > -complib_fini(void) > > -{ > > - __cl_timer_prov_destroy(); > > - __cl_user_syshelper_exit(); > > -} > > - > > -void > > complib_exit(void) > > { > > __cl_timer_prov_destroy(); > > Index: opensm/main.c > > =================================================================== > > --- opensm/main.c (revision 6203) > > +++ opensm/main.c (working copy) > > @@ -44,9 +44,6 @@ > > * > > * $Revision: 1.23 $ > > */ > > -#ifdef __WIN__ > > -#pragma warning(disable : 4996) > > -#endif > > > > #if HAVE_CONFIG_H > > # include > > @@ -557,9 +554,7 @@ main( > > { NULL, 0, NULL, 0 } /* Required at the end of the > array */ > > }; > > > > -#ifdef __WIN__ > > complib_init(); > > -#endif > > > > /* Make sure that the opensm and complib were compiled using > > same modes (debug/free) */ > > > From eli at mellanox.co.il Thu Apr 6 02:35:53 2006 From: eli at mellanox.co.il (Eli Cohen) Date: Thu, 6 Apr 2006 12:35:53 +0300 Subject: [openib-general] Re: [PATCH] ipoib_mcast_restart_task In-Reply-To: <200604060937.00787.eli@mellanox.co.il> References: <20060404155233.GR14808@mellanox.co.il> <200604060937.00787.eli@mellanox.co.il> Message-ID: <200604061235.54075.eli@mellanox.co.il> > > The reason for that is probably because I am using a custom kernel compiled > with 'Debug memory allocations' which poisons freed memory. oops. Ignore this. I did not notice the succeding discussion. From k_mahesh85 at yahoo.co.in Thu Apr 6 03:28:00 2006 From: k_mahesh85 at yahoo.co.in (keshetti mahesh) Date: Thu, 6 Apr 2006 11:28:00 +0100 (BST) Subject: [openib-general] what is the status of sdp..........? Message-ID: <20060406102800.5791.qmail@web8325.mail.in.yahoo.com> hello all i have recently started working over SDP in my lab and i came to know that still it is under development i want to participate in the development can anybody tell me what is the current progress in SDP and how i can involve in it.............also is ther any spl. mailing list for the SDP? regards K.Mahehs --------------------------------- Jiyo cricket on Yahoo! India cricket Yahoo! Messenger Mobile Stay in touch with your buddies all the time. -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Thu Apr 6 04:07:08 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Apr 2006 07:07:08 -0400 Subject: [openib-general] [PATCH] Typo in ib_ping_init_device In-Reply-To: References: Message-ID: <1144321414.4480.90416.camel@hal.voltaire.com> On Thu, 2006-04-06 at 01:28, Krishna Kumar2 wrote: > --- a/drivers/infiniband/core/ping.c 2006-03-28 18:01:31.000000000 > +0530 > +++ b/drivers/infiniband/core/ping.c 2006-03-28 18:06:09.000000000 > +0530 > @@ -259,10 +259,11 @@ static void ib_ping_init_device(struct i > } > > for (i = 0; i < num_ports; i++, cur_port++) { > - if (ib_ping_port_open(device, cur_port)) > + if (ib_ping_port_open(device, cur_port)) { > printk(KERN_ERR SPFX "Couldn't open %s port %d\n", > device->name, cur_port); > goto error_device_open; > + } > } > return; > > > Thanks, Thanks. Applied. In the future, please use the Signed-off-by line for patches. -- Hal > - KK > > > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jackm at mellanox.co.il Thu Apr 6 04:51:18 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Thu, 6 Apr 2006 14:51:18 +0300 Subject: [openib-general] [PATCH] mad: use GID/LID on requester side when matching responses to requests Message-ID: <200604061451.18624.jackm@mellanox.co.il> Check GID/LID for requester side when searching for request which matches received response. This, in order to guarantee uniqueness if use same TID when requesting via multiple source LIDs (when LMC is not zero). To perform check, need to add LMC to cache. (previous patch returned OK unconditionally for LID check, when the check is performed at the requesting node). Further, do not perform LID check for direct-routed packets, since permissive LID makes a proper check impossible. Signed-off-by: Jack Morgenstein Index: src/drivers/infiniband/include/rdma/ib_verbs.h =================================================================== --- src/drivers/infiniband/include/rdma/ib_verbs.h (revision 6066) +++ src/drivers/infiniband/include/rdma/ib_verbs.h (working copy) @@ -822,6 +822,7 @@ struct ib_cache { struct ib_event_handler event_handler; struct ib_pkey_cache **pkey_cache; struct ib_gid_cache **gid_cache; + struct ib_lmc_cache *lmc_cache; }; struct ib_device { Index: src/drivers/infiniband/include/rdma/ib_cache.h =================================================================== --- src/drivers/infiniband/include/rdma/ib_cache.h (revision 6066) +++ src/drivers/infiniband/include/rdma/ib_cache.h (working copy) @@ -102,4 +102,17 @@ int ib_find_cached_pkey(struct ib_device u16 pkey, u16 *index); +/** + * ib_get_cached_lmc - Returns a cached lmc table entry + * @device: The device to query. + * @port_num: The port number of the device to query. + * @lmc: The lmc value for the specified port for that device. + * + * ib_get_cached_lmc() fetches the specified lmc table entry stored in + * the local software cache. + */ +int ib_get_cached_lmc(struct ib_device *device, + u8 port_num, + u8 *lmc); + #endif /* _IB_CACHE_H */ Index: src/drivers/infiniband/core/cache.c =================================================================== --- src/drivers/infiniband/core/cache.c (revision 6066) +++ src/drivers/infiniband/core/cache.c (working copy) @@ -59,6 +59,10 @@ struct ib_update_work { u8 port_num; }; +struct ib_lmc_cache { + u8 lmc[0]; +}; + static inline int start_port(struct ib_device *device) { return (device->node_type == RDMA_NODE_IB_SWITCH) ? 0 : 1; @@ -191,6 +195,24 @@ int ib_find_cached_pkey(struct ib_device } EXPORT_SYMBOL(ib_find_cached_pkey); +int ib_get_cached_lmc(struct ib_device *device, + u8 port_num, + u8 *lmc) +{ + unsigned long flags; + int ret = 0; + + if (port_num < start_port(device) || port_num > end_port(device)) + return -EINVAL; + + read_lock_irqsave(&device->cache.lock, flags); + *lmc = device->cache.lmc_cache->lmc[port_num - start_port(device)]; + read_unlock_irqrestore(&device->cache.lock, flags); + + return ret; +} +EXPORT_SYMBOL(ib_get_cached_lmc); + static void ib_cache_update(struct ib_device *device, u8 port) { @@ -251,6 +273,8 @@ static void ib_cache_update(struct ib_de device->cache.pkey_cache[port - start_port(device)] = pkey_cache; device->cache.gid_cache [port - start_port(device)] = gid_cache; + device->cache.lmc_cache->lmc[port - start_port(device)] = tprops->lmc; + write_unlock_irq(&device->cache.lock); kfree(old_pkey_cache); @@ -305,7 +329,13 @@ static void ib_cache_setup_one(struct ib kmalloc(sizeof *device->cache.pkey_cache * (end_port(device) - start_port(device) + 1), GFP_KERNEL); - if (!device->cache.pkey_cache || !device->cache.gid_cache) { + device->cache.lmc_cache = kmalloc(sizeof *device->cache.lmc_cache * + (end_port(device) - + start_port(device) + 1), + GFP_KERNEL); + + if (!device->cache.pkey_cache || !device->cache.gid_cache || + !device->cache.lmc_cache) { printk(KERN_WARNING "Couldn't allocate cache " "for %s\n", device->name); goto err; @@ -333,6 +363,7 @@ err_cache: err: kfree(device->cache.pkey_cache); kfree(device->cache.gid_cache); + kfree(device->cache.lmc_cache); } static void ib_cache_cleanup_one(struct ib_device *device) @@ -349,6 +380,7 @@ static void ib_cache_cleanup_one(struct kfree(device->cache.pkey_cache); kfree(device->cache.gid_cache); + kfree(device->cache.lmc_cache); } static struct ib_client cache_client = { Index: src/drivers/infiniband/core/mad.c =================================================================== --- src/drivers/infiniband/core/mad.c (revision 6066) +++ src/drivers/infiniband/core/mad.c (working copy) @@ -34,6 +34,7 @@ * $Id$ */ #include +#include #include "mad_priv.h" #include "mad_rmpp.h" @@ -1669,20 +1670,21 @@ static inline int rcv_has_same_class(str rwc->recv_buf.mad->mad_hdr.mgmt_class; } -static inline int rcv_has_same_gid(struct ib_mad_send_wr_private *wr, +static inline int rcv_has_same_gid(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_send_wr_private *wr, struct ib_mad_recv_wc *rwc ) { struct ib_ah_attr attr; u8 send_resp, rcv_resp; + union ib_gid sgid; + struct ib_device *device = mad_agent_priv->agent.device; + u8 port_num = mad_agent_priv->agent.port_num; + u8 lmc; send_resp = ((struct ib_mad *)(wr->send_buf.mad))-> mad_hdr.method & IB_MGMT_METHOD_RESP; rcv_resp = rwc->recv_buf.mad->mad_hdr.method & IB_MGMT_METHOD_RESP; - if (!send_resp && rcv_resp) - /* is request/response. GID/LIDs are both local (same). */ - return 1; - if (send_resp == rcv_resp) /* both requests, or both responses. GIDs different */ return 0; @@ -1691,48 +1693,78 @@ static inline int rcv_has_same_gid(struc /* Assume not equal, to avoid false positives. */ return 0; - if (!(attr.ah_flags & IB_AH_GRH) && !(rwc->wc->wc_flags & IB_WC_GRH)) - return attr.dlid == rwc->wc->slid; - else if ((attr.ah_flags & IB_AH_GRH) && - (rwc->wc->wc_flags & IB_WC_GRH)) - return memcmp(attr.grh.dgid.raw, - rwc->recv_buf.grh->sgid.raw, 16) == 0; - else + if (!!(attr.ah_flags & IB_AH_GRH) != + !!(rwc->wc->wc_flags & IB_WC_GRH)) /* one has GID, other does not. Assume different */ return 0; + + if (!send_resp && rcv_resp) { + /* is request/response. */ + if (!(attr.ah_flags & IB_AH_GRH)) { + if (ib_get_cached_lmc(device, port_num, &lmc)) + return 0; + return (!lmc || !((attr.src_path_bits ^ + rwc->wc->dlid_path_bits) & + ((1 << lmc) - 1))); + } else { + if (ib_get_cached_gid(device, port_num, + attr.grh.sgid_index, &sgid)) + return 0; + return !memcmp(sgid.raw, rwc->recv_buf.grh->dgid.raw, + 16); + } + } + + if (!(attr.ah_flags & IB_AH_GRH)) + return attr.dlid == rwc->wc->slid; + else + return !memcmp(attr.grh.dgid.raw, rwc->recv_buf.grh->sgid.raw, + 16); } + +static inline int is_direct(u8 class) +{ + return (class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE); +} + struct ib_mad_send_wr_private* ib_find_send_mad(struct ib_mad_agent_private *mad_agent_priv, - struct ib_mad_recv_wc *mad_recv_wc) + struct ib_mad_recv_wc *wc) { - struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wr_private *wr; struct ib_mad *mad; - mad = (struct ib_mad *)mad_recv_wc->recv_buf.mad; + mad = (struct ib_mad *)wc->recv_buf.mad; - list_for_each_entry(mad_send_wr, &mad_agent_priv->wait_list, - agent_list) { - if ((mad_send_wr->tid == mad->mad_hdr.tid) && - rcv_has_same_class(mad_send_wr, mad_recv_wc) && - rcv_has_same_gid(mad_send_wr, mad_recv_wc)) - return mad_send_wr; + list_for_each_entry(wr, &mad_agent_priv->wait_list, agent_list) { + if ((wr->tid == mad->mad_hdr.tid) && + rcv_has_same_class(wr, wc) && + /* + * Don't check GID for direct routed MADs. + * These might have permissive LIDs. + */ + (is_direct(wc->recv_buf.mad->mad_hdr.mgmt_class) || + rcv_has_same_gid(mad_agent_priv, wr, wc))) + return wr; } /* * It's possible to receive the response before we've * been notified that the send has completed */ - list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, - agent_list) { - if (is_data_mad(mad_agent_priv, mad_send_wr->send_buf.mad) && - mad_send_wr->tid == mad->mad_hdr.tid && - mad_send_wr->timeout && - rcv_has_same_class(mad_send_wr, mad_recv_wc) && - rcv_has_same_gid(mad_send_wr, mad_recv_wc)) { + list_for_each_entry(wr, &mad_agent_priv->send_list, agent_list) { + if (is_data_mad(mad_agent_priv, wr->send_buf.mad) && + wr->tid == mad->mad_hdr.tid && + wr->timeout && + rcv_has_same_class(wr, wc) && + /* + * Don't check GID for direct routed MADs. + * These might have permissive LIDs. + */ + (is_direct(wc->recv_buf.mad->mad_hdr.mgmt_class) || + rcv_has_same_gid(mad_agent_priv, wr, wc))) /* Verify request has not been canceled */ - return (mad_send_wr->status == IB_WC_SUCCESS) ? - mad_send_wr : NULL; - } + return (wr->status == IB_WC_SUCCESS) ? wr : NULL; } return NULL; } From yael at mellanox.co.il Thu Apr 6 04:52:59 2006 From: yael at mellanox.co.il (Yael Kalka) Date: Thu, 6 Apr 2006 14:52:59 +0300 Subject: [openib-general] RE: [PATCHv2] OpenSM: Fix osm_vendor_send for GSI classes Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FE89@mtlexch01.mtl.com> Hi Hal, The patch fixes the problem I saw. Please apply it. Thanks, Yael > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, April 05, 2006 3:55 PM > To: Yael Kalka > Cc: openib-general at openib.org > Subject: [PATCHv2] OpenSM: Fix osm_vendor_send for GSI classes > > Hi Yael, > > Below is a slightly modified version of the previous patch. It is a > complete fix for the problem you identified. Let me know if > this works for you and I will check it into both the trunk and 1.0 > branch. > > Thanks. > > -- Hal > > OpenSM: Fix osm_vendor_send for GSI classes > > Currently, the default for GSI classes assumes RMPP. There are two > groups of GSI classes: those which support RMPP and those which don't. > This patch handles them properly in osm_vendor_send. > > Problem pointed out by Yael Kalka > > Signed-off-by: Hal Rosenstock > > Index: include/iba/ib_types.h > =================================================================== > --- include/iba/ib_types.h (revision 6219) > +++ include/iba/ib_types.h (working copy) > @@ -515,6 +515,30 @@ BEGIN_C_DECLS > #define IB_MCLASS_VENDOR_LOW_RANGE_MAX 0x0f > /**********/ > > +/****d* IBA Base: Constants/IB_MCLASS_DEV_ADM > +* NAME > +* IB_MCLASS_DEV_ADM > +* > +* DESCRIPTION > +* Subnet Management Class, Device Administration > +* > +* SOURCE > +*/ > +#define IB_MCLASS_DEV_ADM 0x10 > +/**********/ > + > +/****d* IBA Base: Constants/IB_MCLASS_BIS > +* NAME > +* IB_MCLASS_BIS > +* > +* DESCRIPTION > +* Subnet Management Class, BIS > +* > +* SOURCE > +*/ > +#define IB_MCLASS_BIS 0x12 > +/**********/ > + > /****d* IBA Base: Constants/IB_MCLASS_VENDOR_HIGH_RANGE_MIN > * NAME > * IB_MCLASS_VENDOR_HIGH_RANGE_MIN > @@ -544,7 +568,7 @@ BEGIN_C_DECLS > * ib_class_is_vendor_specific_low > * > * DESCRIPTION > -* Indicitates if the Class Code if a vendor specific class from > +* Indicates if the Class Code if a vendor specific class from > * the low range > * > * SYNOPSIS > @@ -576,7 +600,7 @@ ib_class_is_vendor_specific_low( > * ib_class_is_vendor_specific_high > * > * DESCRIPTION > -* Indicitates if the Class Code if a vendor specific class from > +* Indicates if the Class Code if a vendor specific class from > * the high range > * > * SYNOPSIS > @@ -609,7 +633,7 @@ ib_class_is_vendor_specific_high( > * ib_class_is_vendor_specific > * > * DESCRIPTION > -* Indicitates if the Class Code if a vendor specific class > +* Indicates if the Class Code if a vendor specific class > * > * SYNOPSIS > */ > @@ -635,6 +659,38 @@ ib_class_is_vendor_specific( > * ib_class_is_vendor_specific_low, ib_class_is_vendor_specific_high > *********/ > > +/****f* IBA Base: Types/ib_class_is_rmpp > +* NAME > +* ib_class_is_rmpp > +* > +* DESCRIPTION > +* Indicates if the Class Code supports RMPP > +* > +* SYNOPSIS > +*/ > +static inline boolean_t > +ib_class_is_rmpp( > + IN const uint8_t class_code ) > +{ > + return( (class_code == IB_MCLASS_SUBN_ADM) || > + (class_code == IB_MCLASS_DEV_MGMT) || > + (class_code == IB_MCLASS_DEV_ADM) || > + (class_code == IB_MCLASS_BIS) || > + ib_class_is_vendor_specific_high( class_code ) ); > +} > +/* > +* PARAMETERS > +* class_code > +* [in] The Management Datagram Class Code > +* > +* RETURN VALUE > +* TRUE if the class supports RMPP > +* FALSE otherwise. > +* > +* NOTES > +* > +*********/ > + > /* > * MAD methods > */ > @@ -1811,7 +1867,7 @@ ib_pkey_get_base( > * ib_pkey_is_full_member > * > * DESCRIPTION > -* Indicitates if the port is a full member of the parition. > +* Indicates if the port is a full member of the parition. > * > * SYNOPSIS > */ > Index: libvendor/osm_vendor_ibumad.c > =================================================================== > --- libvendor/osm_vendor_ibumad.c (revision 6219) > +++ libvendor/osm_vendor_ibumad.c (working copy) > @@ -1044,16 +1044,17 @@ osm_vendor_send( > CL_ASSERT( p_vw->h_bind == h_bind ); > CL_ASSERT( p_mad == umad_get_mad(p_vw->umad) ); > > - switch (p_mad->mgmt_class) { > - case IB_MCLASS_SUBN_DIR: > + if (p_mad->mgmt_class == IB_MCLASS_SUBN_DIR) { > umad_set_addr_net(p_vw->umad, 0xffff, 0, 0, 0); > umad_set_grh(p_vw->umad, 0); > - break; > - case IB_MCLASS_SUBN_LID: > + goto Resp; > + } > + if (p_mad->mgmt_class == IB_MCLASS_SUBN_LID) { > umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, 0, 0, 0); > umad_set_grh(p_vw->umad, 0); > - break; > - default: /* GSI FIXME: no GRH */ > + goto Resp; > + } > + if (ib_class_is_rmpp(p_mad->mgmt_class)) { /* RMPP GSI classes > FIXME: no GRH */ > umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, > p_mad_addr->addr_type.gsi.remote_qp, > p_mad_addr->addr_type.gsi.service_level, > @@ -1086,9 +1087,16 @@ osm_vendor_send( > p_sa->paylen_newwin = cl_ntoh32(paylen); > } > #endif > - break; > + } else { /* non RMPP GSI classes FIXME: no GRH */ > + umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, > + p_mad_addr->addr_type.gsi.remote_qp, > + p_mad_addr->addr_type.gsi.service_level, > + IB_QP1_WELL_KNOWN_Q_KEY); > + umad_set_grh(p_vw->umad, 0); /* FIXME: GRH support */ > + umad_set_pkey(p_vw->umad, p_mad_addr->addr_type.gsi.pkey); > } > > +Resp: > if (resp_expected) > put_madw(p_vend, p_madw, &p_mad->trans_id); > > From mst at mellanox.co.il Thu Apr 6 05:05:30 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 15:05:30 +0300 Subject: [openib-general] Re: what is the status of sdp..........? In-Reply-To: <20060406102800.5791.qmail@web8325.mail.in.yahoo.com> References: <20060406102800.5791.qmail@web8325.mail.in.yahoo.com> Message-ID: <20060406120530.GK21115@mellanox.co.il> Quoting r. keshetti mahesh : > Subject: what is the status of sdp..........? > > hello all > > i have recently started working over SDP in my lab > and i came to know that still it is under development > > i want to participate in the development > > can anybody tell me what is the current progress in SDP and how i can involve in it.............also is ther any spl. mailing list for the SDP? > > regards > K.Mahehs Take a look at libsdp - this needs help most of all: https://openib.org/svn/gen2/trunk/src/userspace/libsdp -- MST From mst at mellanox.co.il Thu Apr 6 05:07:25 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 15:07:25 +0300 Subject: [openib-general] Re: what is the status of sdp..........? In-Reply-To: <20060406120530.GK21115@mellanox.co.il> References: <20060406102800.5791.qmail@web8325.mail.in.yahoo.com> <20060406120530.GK21115@mellanox.co.il> Message-ID: <20060406120725.GL21115@mellanox.co.il> Quoting r. Michael S. Tsirkin : > > hello all > > > > i have recently started working over SDP in my lab > > and i came to know that still it is under development > > > > i want to participate in the development > > > > can anybody tell me what is the current progress in SDP and how i can involve in it.............also is ther any spl. mailing list for the SDP? > > > > regards > > K.Mahehs > > Take a look at libsdp - this needs help most of all: > https://openib.org/svn/gen2/trunk/src/userspace/libsdp Another missing SDP bit at which no one is working on is IPv6 support in ib_addr. https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband/core/addr.c -- MST From dotanb at mellanox.co.il Thu Apr 6 05:12:08 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Thu, 6 Apr 2006 15:12:08 +0300 Subject: [openib-general] how to execute the dtest? In-Reply-To: <4433F452.8030807@ichips.intel.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301D16BE2@mtlexch01.mtl.com> <200604051827.28808.dotanb@mellanox.co.il> <4433F452.8030807@ichips.intel.com> Message-ID: <200604061512.08387.dotanb@mellanox.co.il> On Wednesday 05 April 2006 19:46, Arlin Davis wrote: > Dotan Barak wrote: > > >Some more info: > > > >when i changed the dat.conf to be: > >OpenIB-cma u1.2 nonthreadsafe default /usr/local/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" > >OpenIB-cma-ip u1.2 nonthreadsafe default /usr/local/lib/libdaplcma.so mv_dapl.1.2 "192.168.0.22 0" "" > >OpenIB-cma-name u1.2 nonthreadsafe default /usr/local/lib/libdaplcma.so mv_dapl.1.2 "svr1-ib0 0" "" > >OpenIB-cma-netdev u1.2 nonthreadsafe default /usr/local/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" > >OpenIB-scm1 u1.2 nonthreadsafe default /usr/local/lib/libdaplscm.so mv_dapl.1.2 "mthca0 1" "" > >OpenIB-scm2 u1.2 nonthreadsafe default /usr/local/lib/libdaplscm.so mv_dapl.1.2 "mthca0 2" "" > > > > > > > The dtest makefile builds with DAPL_PROVIDER == OpenIB-cma-ip by default > so it will use the second line of the configuration file. This requires > the IP address of the IB device on your system to be supplied in the > dat.conf. Change the default IP address (192.168.0.22) to match your ib > device network address that you ifconfig'ed. > > -arlin Thanks for the help, after making those fixes the test worked (there are still failures that i need to debug, but this is a different story ..). Dotan From ogerlitz at voltaire.com Thu Apr 6 05:15:25 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 6 Apr 2006 15:15:25 +0300 (IDT) Subject: [openib-general] [PATCH 0/8] [RFC] iSER initiator Message-ID: The patch series that follows is the second RFC sent to the openib community asking for comments on the iSER (iSCSI Extensions for RDMA) code. We target going upstream to 2.6.18, so again any feedback is wellcome. Following the first RFC, the code has gone through two sets of changes, first the RFC comments, and second integrate with two big changes at the kernel part of open iscsi. The RFC comments that were applied are: +1 change iser_adaptor to iser_device or any better name +2 session->state is not be volatile (does not exist anymore as it was) +3 killed ISER_HDR_LEN ISER_PDU_BHS_LENGTH +4 killed USE_OFFSET USE_NO_OFFSET USE_SIZE USE_ENTIRE_SIZE +5 killed _p suffix for pointers all over the code +6 use goto-unwinding, that is each failure jumps to different error label such that there is not one label with many if()s under it eg see send_data_out_error +7 remove usage of GFP_NOFAIL as does not work with kmalloc/kmem_cache_alloc, +8 SG-ify a SCSI command which is "single" (ie sc->use_sg is 0), we also have a patch ready that would BUG when it gets from the SCSI ML such command. +9 a LLD can not assume it is fine to call page_address(sg->page) as the page might be unmapped to the kernel virtual address space. so kmap_atomic is used. The changes in iscsi are: +1 introduce libiscsi - iscsi kernel library (module) at include/scsi/libiscsi.h and drivers/scsi/libiscsi.c. It is common code to be used by kernel iscsi transports. The two current transports, TCP and iSER were ported to it. - with this changes > 1.3K LOC were moved from drivers/scsi/iscsi_tcp.[ch] and drivers/infiniband/ulp/iscsi_iser.[ch] into libiscsi.[ch] +2 allow for the iscsi connection establishment/teardown to be done from the kernel. - this changed removed the need to expose a pseudo AF_ISER socket by iser to be used by open iscsi user space code. Now the oiscsi user code is doing the connection est. for TCP from user space (socket) and for iSER (struct iser_conn) from the kernel. - with this change, iser_socket.[ch] are removed. Both iscsi changes are to be sent in the coming days to review and upstream push for 2.6.18 to the linux-scsi maintainers. They are also present as the last two patches in this RFC The iser changes related to the two iscsi changes are not commited yet, so this code is different then the one present at the openib svn head. The code has been tested with 2.6.16, openib r5919 and the user part of the open-iscsi svn head (r529), where both the kernel part of open iscsi was replaced by oiscsi svn patched with the library and the ep_callbacks patches. Or. From ogerlitz at voltaire.com Thu Apr 6 05:16:31 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 6 Apr 2006 15:16:31 +0300 (IDT) Subject: [openib-general] [PATCH 1/8] [RFC] iscsi_iser header file In-Reply-To: Message-ID: --- iser-libiscsi-canq2-ep-null/iscsi_iser.h 2006-04-06 14:47:19.000000000 +0300 +++ iser-libiscsi-canq2-ep/iscsi_iser.h 2006-04-06 11:30:55.000000000 +0300 @@ -0,0 +1,355 @@ +/* + * iSER transport for the Open iSCSI Initiator & iSER transport internals + * + * Copyright (C) 2004 Dmitry Yusupov + * Copyright (C) 2004 Alex Aizman + * Copyright (C) 2005 Mike Christie + * based on code maintained by open-iscsi at googlegroups.com + * + * Copyright (c) 2004, 2005, 2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: iscsi_iser.h 5924 2006-03-21 12:21:43Z ogerlitz $ + */ +#ifndef __ISCSI_ISER_H__ +#define __ISCSI_ISER_H__ + +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include +#include +#include + +#define PFX "iser:" + +#define iser_dbg(fmt, arg...) \ + do { \ + if (iser_debug_level > 0) \ + printk(KERN_DEBUG PFX "%s:" fmt,\ + __func__ , ## arg); \ + } while (0) + +#define iser_err(fmt, arg...) \ + do { \ + printk(KERN_ERR PFX "%s:" fmt, \ + __func__ , ## arg); \ + } while (0) + +#define iser_bug(fmt,arg...) \ + do { \ + printk(KERN_ERR PFX "%s: PANIC! " fmt, \ + __func__ , ## arg); \ + BUG(); \ + } while(0) + +#define ISCSI_ISER_XMIT_CMDS_MAX 128 /* must be power of 2 */ +#define ISCSI_ISER_MGMT_CMDS_MAX 32 /* must be power of 2 */ + /* support upto 512KB in one RDMA */ +#define ISCSI_ISER_SG_TABLESIZE (0x80000 >> PAGE_SHIFT) +#define ISCSI_ISER_CMD_PER_LUN ISCSI_ISER_XMIT_CMDS_MAX +#define ISCSI_ISER_MAX_LUN 256 +#define ISCSI_ISER_MAX_CMD_LEN 16 + +/* QP settings */ +/* Maximal bounds on received asynchronous PDUs */ +#define ISER_MAX_RX_MISC_PDUS 4 /* NOOP_IN(2) , ASYNC_EVENT(2) */ + +#define ISER_MAX_TX_MISC_PDUS 6 /* NOOP_OUT(2), TEXT(1), * + * SCSI_TMFUNC(2), LOGOUT(1) */ + +#define ISER_QP_MAX_RECV_DTOS (ISCSI_ISER_XMIT_CMDS_MAX + \ + ISER_MAX_RX_MISC_PDUS + \ + ISER_MAX_TX_MISC_PDUS) + +/* the max TX (send) WR supported by the iSER QP is defined by * + * max_send_wr = T * (1 + D) + C ; D is how many inflight dataouts we expect * + * to have at max for SCSI command. The tx posting & completion handling code * + * supports -EAGAIN scheme where tx is suspended till the QP has room for more * + * send WR. D=8 comes from 64K/8K */ + +#define ISER_INFLIGHT_DATAOUTS 8 + +#define ISER_QP_MAX_REQ_DTOS (ISCSI_ISER_XMIT_CMDS_MAX * \ + (1 + ISER_INFLIGHT_DATAOUTS) + \ + ISER_MAX_TX_MISC_PDUS + \ + ISER_MAX_RX_MISC_PDUS) + +#define ISER_VER 0x10 +#define ISER_WSV 0x08 +#define ISER_RSV 0x04 + +struct iser_hdr { + u8 flags; + u8 rsvd[3]; + __be32 write_stag; /* write rkey */ + __be64 write_va; + __be32 read_stag; /* read rkey */ + __be64 read_va; +} __attribute__((packed)); + + +/* Length of an object name string */ +#define ISER_OBJECT_NAME_SIZE 64 + +enum iser_ib_conn_state { + ISER_CONN_INIT, /* descriptor allocd, no conn */ + ISER_CONN_PENDING, /* in the process of being established */ + ISER_CONN_UP, /* up and running */ + ISER_CONN_TERMINATING, /* in the process of being terminated */ + ISER_CONN_DOWN, /* shut down */ + ISER_CONN_STATES_NUM +}; + +enum iser_task_status { + ISER_TASK_STATUS_INIT = 0, + ISER_TASK_STATUS_STARTED, + ISER_TASK_STATUS_COMPLETED +}; + +enum iser_data_dir { + ISER_DIR_IN = 0, /* to initiator */ + ISER_DIR_OUT, /* from initiator */ + ISER_DIRS_NUM +}; + +struct iser_data_buf { + void *buf; /* pointer to the sg list */ + unsigned int size; /* num entries of this sg */ + unsigned long data_len; /* total data len */ + unsigned int dma_nents; /* returned by dma_map_sg */ + char *copy_buf; /* allocated copy buf for SGs unaligned * + * for rdma which are copied */ + struct scatterlist sg_single; /* SG-ified clone of a non SG SC or * + * unaligned SG */ + }; + +/* fwd declarations */ +struct iser_device; +struct iscsi_iser_conn; +struct iscsi_iser_cmd_task; + +struct iser_mem_reg { + u32 lkey; + u32 rkey; + u64 va; + u64 len; + void *mem_h; +}; + +struct iser_regd_buf { + struct iser_mem_reg reg; /* memory registration info */ + void *virt_addr; + struct iser_device *device; /* device->device for dma_unmap */ + dma_addr_t dma_addr; /* if non zero, addr for dma_unmap */ + enum dma_data_direction direction; /* direction for dma_unmap */ + unsigned int data_size; + atomic_t ref_count; /* refcount, freed when dec to 0 */ +}; + +#define MAX_REGD_BUF_VECTOR_LEN 2 + +struct iser_dto { + struct iscsi_iser_cmd_task *ctask; + struct iscsi_iser_conn *conn; + int notify_enable; + + /* vector of registered buffers */ + unsigned int regd_vector_len; + struct iser_regd_buf *regd[MAX_REGD_BUF_VECTOR_LEN]; + + /* offset into the registered buffer may be specified */ + unsigned int offset[MAX_REGD_BUF_VECTOR_LEN]; + + /* a smaller size may be specified, if 0, then full size is used */ + unsigned int used_sz[MAX_REGD_BUF_VECTOR_LEN]; +}; + +enum iser_desc_type { + ISCSI_RX, + ISCSI_TX_CONTROL , + ISCSI_TX_SCSI_COMMAND, + ISCSI_TX_DATAOUT +}; + +struct iser_desc { + struct iser_hdr iser_header; + struct iscsi_hdr iscsi_header; + struct iser_regd_buf hdr_regd_buf; + void *data; /* used by RX & TX_CONTROL */ + struct iser_regd_buf data_regd_buf; /* used by RX & TX_CONTROL */ + enum iser_desc_type type; + struct iser_dto dto; +}; + +struct iser_device { + struct ib_device *ib_device; + struct ib_pd *pd; + struct ib_cq *cq; + struct ib_mr *mr; + struct tasklet_struct cq_tasklet; + struct list_head ig_list; /* entry in ig devices list */ + int refcount; +}; + +struct iser_conn +{ + struct iscsi_iser_conn *iscsi_conn; /* iscsi conn for upcalls */ + atomic_t state; /* rdma connection state */ + struct iser_device *device; /* device context */ + struct rdma_cm_id *cma_id; /* CMA ID */ + struct ib_qp *qp; /* QP */ + struct ib_fmr_pool *fmr_pool; /* pool of IB FMRs */ + int disc_evt_flag; /* disconn event delivered */ + wait_queue_head_t wait; /* waitq for conn/disconn */ + atomic_t post_recv_buf_count; /* posted rx count */ + atomic_t post_send_buf_count; /* posted tx count */ + struct work_struct comperror_work; /* conn term sleepable ctx*/ + char name[ISER_OBJECT_NAME_SIZE]; + struct iser_page_vec *page_vec; /* represents SG to fmr maps* + * maps serialized as tx is*/ +}; + +struct iscsi_iser_conn { + struct iscsi_conn *iscsi_conn;/* ptr to iscsi conn */ + struct iser_conn *ib_conn; /* iSER IB conn */ + + rwlock_t lock; +}; + +struct iscsi_iser_cmd_task { + struct iser_desc desc; + struct iscsi_iser_conn *iser_conn; + int rdma_data_count;/* RDMA bytes */ + enum iser_task_status status; + int command_sent; /* set if command sent */ + int dir[ISER_DIRS_NUM]; /* set if dir use*/ + struct iser_regd_buf rdma_regd[ISER_DIRS_NUM];/* regd rdma buf */ + struct iser_data_buf data[ISER_DIRS_NUM]; /* orig. data des*/ + struct iser_data_buf data_copy[ISER_DIRS_NUM];/* contig. copy */ +}; + +struct iser_page_vec { + u64 *pages; + int length; + int offset; + int data_size; +}; + +struct iser_global { + struct mutex device_list_mutex;/* */ + struct list_head device_list; /* all iSER devices */ + + kmem_cache_t *desc_cache; +}; + +extern struct iser_global ig; +extern int iser_debug_level; + +/* allocate connection resources needed for rdma functionality */ +int iser_conn_set_full_featured_mode(struct iscsi_conn *conn); + +int iser_send_control(struct iscsi_conn *conn, + struct iscsi_mgmt_task *mtask); + +int iser_send_command(struct iscsi_conn *conn, + struct iscsi_cmd_task *ctask); + +int iser_send_data_out(struct iscsi_conn *conn, + struct iscsi_cmd_task *ctask, + struct iscsi_data *hdr); + +int iscsi_iser_recv(struct iscsi_conn *conn, + struct iscsi_hdr *hdr, + char *rx_data, + int rx_data_len); + +int iser_conn_init(struct iser_conn *ib_conn); + +void iser_conn_terminate(struct iser_conn *ib_conn); + +void iser_rcv_completion(struct iser_desc *desc, + unsigned long dto_xfer_len); + +void iser_snd_completion(struct iser_desc *desc); + +void iser_ctask_rdma_init(struct iscsi_iser_cmd_task *ctask); + +void iser_ctask_rdma_finalize(struct iscsi_iser_cmd_task *ctask); + +void iser_dto_buffs_release(struct iser_dto *dto); + +int iser_regd_buff_release(struct iser_regd_buf *regd_buf); + +void iser_reg_single(struct iser_device *device, + struct iser_regd_buf *regd_buf, + enum dma_data_direction direction); + +int iser_sg_size(struct iser_data_buf *mem); + +int iser_start_rdma_unaligned_sg(struct iscsi_iser_cmd_task *ctask, + enum iser_data_dir cmd_dir); + +void iser_finalize_rdma_unaligned_sg(struct iscsi_iser_cmd_task *ctask, + enum iser_data_dir cmd_dir); + +int iser_reg_rdma_mem(struct iscsi_iser_cmd_task *ctask, + enum iser_data_dir cmd_dir); + +int iser_connect(struct iser_conn *ib_conn, + struct sockaddr_in *src_addr, + struct sockaddr_in *dst_addr, + int non_blocking); + +int iser_reg_page_vec(struct iser_conn *ib_conn, + struct iser_page_vec *page_vec, + struct iser_mem_reg *mem_reg); + +void iser_unreg_mem(struct iser_mem_reg *mem_reg); + +int iser_post_recv(struct iser_desc *rx_desc); +int iser_post_send(struct iser_desc *tx_desc); +#endif From ogerlitz at voltaire.com Thu Apr 6 05:17:19 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 6 Apr 2006 15:17:19 +0300 (IDT) Subject: [openib-general] [PATCH 2/8] [RFC] open iscsi iser transport provider code In-Reply-To: Message-ID: --- iser-libiscsi-canq2-ep-null/iscsi_iser.c 2006-04-06 14:47:18.000000000 +0300 +++ iser-libiscsi-canq2-ep/iscsi_iser.c 2006-04-06 14:14:08.000000000 +0300 @@ -0,0 +1,779 @@ +/* + * iSCSI Initiator over iSER Data-Path + * + * Copyright (C) 2004 Dmitry Yusupov + * Copyright (C) 2004 Alex Aizman + * Copyright (C) 2005 Mike Christie + * Copyright (c) 2005, 2006 Voltaire, Inc. All rights reserved. + * maintained by openib-general at openib.org + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Credits: + * Christoph Hellwig + * FUJITA Tomonori + * Arne Redlich + * Zhenyu Wang + * Modified by: + * Erez Zilber + * + * + * $Id: iscsi_iser.c 5924 2006-03-21 12:21:43Z ogerlitz $ + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include + +#include +#include +#include +#include +#include +#include +#include +#include + +#include "iscsi_iser.h" + +#ifdef DEBUG_ISER +#define debug_iser(fmt...) printk(KERN_DEBUG "iser: " fmt) +#else +#define debug_iser(fmt...) +#endif + +#ifdef DEBUG_SCSI +#define debug_scsi(fmt...) printk(KERN_DEBUG "scsi: " fmt) +#else +#define debug_scsi(fmt...) +#endif + +static unsigned int iscsi_max_lun = 512; +module_param_named(max_lun, iscsi_max_lun, uint, S_IRUGO); + +#define DRV_VER "$Rev: 227 $" +#define DRV_DATE "$LastChangedDate: 2006-03-22 16:47:30 +0200 (Wed, 22 Mar 2006) $" + +int iser_debug_level = 0; + +MODULE_DESCRIPTION("iSER (iSCSI Extensions for RDMA) Datamover " + "v" DRV_VER "(" DRV_DATE ")"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_AUTHOR("Alex Nezhinsky, Dan Bar Dov"); + +module_param_named(debug_level, iser_debug_level, int, 0644); +MODULE_PARM_DESC(debug_level,"Enable debug tracing if > 0 (default:disabled)"); + +struct iser_global ig; + +int +iscsi_iser_recv(struct iscsi_conn *conn, + struct iscsi_hdr *hdr, char *rx_data, int rx_data_len) +{ + int rc = 0; + uint32_t ret_itt; + int datalen; + int ahslen; + + /* verify PDU length */ + datalen = ntoh24(hdr->dlength); + if (datalen != rx_data_len) { + printk(KERN_ERR "iscsi_iser: datalen %d (hdr) != %d (IB) \n", + datalen, rx_data_len); + return ISCSI_ERR_DATALEN; + } + + /* read AHS */ + ahslen = hdr->hlength * 4; + + /* verify itt (itt encoding: age+cid+itt) */ + rc = iscsi_verify_itt(conn, hdr, &ret_itt); + if (rc) + return rc; /* FIXME can we get here ISCSI_ERR_NO_SCSI_CMD ? */ + + rc = iscsi_complete_pdu(conn, hdr, rx_data, rx_data_len); + + return rc; +} + + +/** + * iscsi_iser_cmd_init - Initialize iSCSI SCSI_READ or SCSI_WRITE commands + * + **/ +static void +iscsi_iser_cmd_init(struct iscsi_cmd_task *ctask) +{ + struct iscsi_iser_conn *iser_conn = ctask->conn->dd_data; + struct iscsi_iser_cmd_task *iser_ctask = ctask->dd_data; + struct scsi_cmnd *sc = ctask->sc; + + iser_ctask->command_sent = 0; + iser_ctask->iser_conn = iser_conn; + + if (sc->sc_data_direction == DMA_TO_DEVICE) { + BUG_ON(ctask->total_length == 0); + /* bytes to be sent via RDMA operations */ + iser_ctask->rdma_data_count = ctask->total_length - + ctask->imm_count - + ctask->unsol_count; + + debug_scsi("cmd [itt %x total %d imm %d imm_data %d " + "rdma_data %d]\n", + ctask->itt, ctask->total_length, ctask->imm_count, + ctask->unsol_count, ctask->rdma_data_count); + } else + /* bytes to be sent via RDMA operations */ + iser_ctask->rdma_data_count = ctask->total_length; + + iser_ctask_rdma_init(iser_ctask); +} + +/** + * iscsi_mtask_xmit - xmit management(immediate) task + * @conn: iscsi connection + * @mtask: task management task + * + * Notes: + * The function can return -EAGAIN in which case caller must + * call it again later, or recover. '0' return code means successful + * xmit. + * + **/ +static int +iscsi_iser_mtask_xmit(struct iscsi_conn *conn, + struct iscsi_mgmt_task *mtask) +{ + int error = 0; + + debug_scsi("mtask deq [cid %d itt 0x%x]\n", conn->id, mtask->itt); + + error = iser_send_control(conn, mtask); + + /* since iser xmits control with zero copy, mtasks can not be recycled + * right after sending them. + * The recycling scheme is based on whether a response is expected + * - if yes, the mtask is recycled at iscsi_complete_pdu + * - if no, the mtask is recycled at iser_snd_completion + */ + if (error && error != -EAGAIN) + iscsi_conn_failure(conn, ISCSI_ERR_CONN_FAILED); + + return error; +} + +static int +iscsi_iser_ctask_xmit_unsol_data(struct iscsi_conn *conn, + struct iscsi_cmd_task *ctask) +{ + struct iscsi_data hdr; + int error = 0; + struct iscsi_iser_cmd_task *iser_ctask = ctask->dd_data; + + /* Send data-out PDUs while there's still unsolicited data to send */ + while (ctask->unsol_count > 0) { + iscsi_prep_unsolicit_data_pdu(ctask, &hdr, + iser_ctask->rdma_data_count); + + debug_scsi("Sending data-out: itt 0x%x, data count %d\n", + hdr.itt, ctask->data_count); + + /* the buffer description has been passed with the command */ + /* Send the command */ + error = iser_send_data_out(conn, ctask, &hdr); + if (error) { + ctask->unsol_datasn--; + goto iscsi_iser_ctask_xmit_unsol_data_exit; + } + ctask->unsol_count -= ctask->data_count; + debug_scsi("Need to send %d more as data-out PDUs\n", + ctask->unsol_count); + } + +iscsi_iser_ctask_xmit_unsol_data_exit: + return error; +} + +static int +iscsi_iser_ctask_xmit(struct iscsi_conn *conn, + struct iscsi_cmd_task *ctask) +{ + struct iscsi_iser_cmd_task *iser_ctask = ctask->dd_data; + int error = 0; + + debug_scsi("ctask deq [cid %d itt 0x%x]\n", + conn->id, ctask->itt); + + /* + * serialize with TMF AbortTask + */ + if (ctask->mtask) + return error; + + /* Send the cmd PDU */ + if (!iser_ctask->command_sent) { + error = iser_send_command(conn, ctask); + if (error) + goto iscsi_iser_ctask_xmit_exit; + iser_ctask->command_sent = 1; + } + + /* Send unsolicited data-out PDU(s) if necessary */ + if (ctask->unsol_count) + error = iscsi_iser_ctask_xmit_unsol_data(conn, ctask); + + iscsi_iser_ctask_xmit_exit: + if (error && error != -EAGAIN) + iscsi_conn_failure(conn, ISCSI_ERR_CONN_FAILED); + return error; +} + +static void +iscsi_iser_cleanup_ctask(struct iscsi_conn *conn, struct iscsi_cmd_task *ctask) +{ + struct iscsi_iser_cmd_task *iser_ctask = ctask->dd_data; + + if(iser_ctask->status == ISER_TASK_STATUS_STARTED) { + /* FIXME - deallocate here the iser ctask associated + * resources eg do dma unmapping / fmr unmapping / etc + * also make sure iser ctask status changes are atomic + */ + } +} + +static struct iscsi_cls_conn * +iscsi_iser_conn_create(struct iscsi_cls_session *cls_session, uint32_t conn_idx) +{ + struct iscsi_conn *conn; + struct iscsi_cls_conn *cls_conn; + struct iscsi_iser_conn *iser_conn; + + cls_conn = iscsi_conn_setup(cls_session, conn_idx); + if (!cls_conn) + return NULL; + conn = cls_conn->dd_data; + + /* + * due to issues with the login code re iser sematics + * this not set in iscsi_conn_setup - FIXME + */ + conn->max_recv_dlength = 128; + + iser_conn = kzalloc(sizeof(*iser_conn), GFP_KERNEL); + if (!iser_conn) + goto conn_alloc_fail; + + /* currently this is the only field which need to be initiated */ + rwlock_init(&iser_conn->lock); + + conn->recv_lock = &iser_conn->lock; + + conn->dd_data = iser_conn; + iser_conn->iscsi_conn = conn; + + return cls_conn; + +conn_alloc_fail: + iscsi_conn_teardown(cls_conn); + return NULL; +} + +static void +iscsi_iser_conn_destroy(struct iscsi_cls_conn *cls_conn) +{ + struct iscsi_conn *conn = cls_conn->dd_data; + struct iscsi_iser_conn *iser_conn = conn->dd_data; + + iscsi_conn_teardown(cls_conn); + kfree(iser_conn); +} + +static int +iscsi_iser_conn_bind(struct iscsi_cls_session *cls_session, + struct iscsi_cls_conn *cls_conn, uint64_t transport_eph, + int is_leading) +{ + struct iscsi_conn *conn = cls_conn->dd_data; + struct iscsi_iser_conn *iser_conn = conn->dd_data; + struct iser_conn *ib_conn; + int error; + + error = iscsi_conn_bind(cls_session, cls_conn, is_leading); + if (error) + return error; + + if (conn->stop_stage != STOP_CONN_SUSPEND) { + /* binds the iSER connection retrieved from the previously + * connected ep_handle to the iSCSI layer connection. exchanges + * connection pointers */ + ib_conn = (struct iser_conn *)transport_eph; + iser_err("binding iscsi conn %p to iser_conn %p\n",conn,ib_conn); + ib_conn->iscsi_conn = iser_conn; + iser_conn->ib_conn = ib_conn; + } + + return 0; +} + +static int +iscsi_iser_conn_start(struct iscsi_cls_conn *cls_conn) +{ + struct iscsi_conn *conn = cls_conn->dd_data; + int err; + + err = iscsi_conn_start(cls_conn); + if (err) + return err; + + return iser_conn_set_full_featured_mode(conn); +} + +static void +iscsi_iser_conn_terminate(struct iscsi_conn *conn) +{ + struct iscsi_iser_conn *iser_conn = conn->dd_data; + + BUG_ON(!iser_conn->ib_conn); + + /* starts conn teardown process, waits until all previously * + * posted buffers get flushed, deallocates all conn resources */ + iser_conn_terminate(iser_conn->ib_conn); + iser_conn->ib_conn = NULL; + conn->recv_lock = NULL; +} + + +static struct iscsi_transport iscsi_iser_transport; + +static struct iscsi_cls_session * +iscsi_iser_session_create(struct iscsi_transport *iscsit, + struct scsi_transport_template *scsit, + uint32_t initial_cmdsn, uint32_t cmds_max, + uint32_t *hostno) +{ + struct iscsi_cls_session *cls_session; + struct iscsi_session *session; + int i; + uint32_t hn; + struct iscsi_cmd_task *ctask; + struct iscsi_mgmt_task *mtask; + struct iscsi_iser_cmd_task *iser_ctask; + struct iser_desc *desc; + + cls_session = iscsi_session_setup(iscsit, scsit, cmds_max, + sizeof(struct iscsi_iser_cmd_task), + sizeof(struct iser_desc), + initial_cmdsn, &hn); + if (!cls_session) + return NULL; + + *hostno = hn; + session = class_to_transport_session(cls_session); + + /* libiscsi setup itts, data and pool so just set desc fields */ + for (i = 0; i < session->cmds_max; i++) { + ctask = session->cmds[i]; + iser_ctask = ctask->dd_data; + ctask->hdr = (struct iscsi_cmd *)&iser_ctask->desc.iscsi_header; + } + + for (i = 0; i < session->mgmtpool_max; i++) { + mtask = session->mgmt_cmds[i]; + desc = mtask->dd_data; + mtask->hdr = &desc->iscsi_header; + desc->data = mtask->data; + } + + return cls_session; +} + +static int +iscsi_iser_conn_set_param(struct iscsi_cls_conn *cls_conn, + enum iscsi_param param, uint32_t value) +{ + struct iscsi_conn *conn = cls_conn->dd_data; + struct iscsi_session *session = conn->session; + + spin_lock_bh(&session->lock); + if (conn->c_stage != ISCSI_CONN_INITIAL_STAGE && + conn->stop_stage != STOP_CONN_RECOVER) { + printk(KERN_ERR "iscsi_iser: can not change parameter [%d]\n", + param); + spin_unlock_bh(&session->lock); + return 0; + } + spin_unlock_bh(&session->lock); + + switch (param) { + case ISCSI_PARAM_MAX_RECV_DLENGTH: + /* TBD */ + break; + case ISCSI_PARAM_MAX_XMIT_DLENGTH: + conn->max_xmit_dlength = value; + break; + case ISCSI_PARAM_HDRDGST_EN: + if (value) { + printk(KERN_ERR "DataDigest wasn't negotiated to None"); + return -EPROTO; + } + break; + case ISCSI_PARAM_DATADGST_EN: + if (value) { + printk(KERN_ERR "DataDigest wasn't negotiated to None"); + return -EPROTO; + } + break; + case ISCSI_PARAM_INITIAL_R2T_EN: + session->initial_r2t_en = value; + break; + case ISCSI_PARAM_IMM_DATA_EN: + session->imm_data_en = value; + break; + case ISCSI_PARAM_FIRST_BURST: + session->first_burst = value; + break; + case ISCSI_PARAM_MAX_BURST: + session->max_burst = value; + break; + case ISCSI_PARAM_PDU_INORDER_EN: + session->pdu_inorder_en = value; + break; + case ISCSI_PARAM_DATASEQ_INORDER_EN: + session->dataseq_inorder_en = value; + break; + case ISCSI_PARAM_ERL: + session->erl = value; + break; + case ISCSI_PARAM_IFMARKER_EN: + if (value) { + printk(KERN_ERR "IFMarker wasn't negotiated to No"); + return -EPROTO; + } + break; + case ISCSI_PARAM_OFMARKER_EN: + if (value) { + printk(KERN_ERR "OFMarker wasn't negotiated to No"); + return -EPROTO; + } + break; + default: + break; + } + + return 0; +} + +static int +iscsi_iser_session_get_param(struct iscsi_cls_session *cls_session, + enum iscsi_param param, uint32_t *value) +{ + struct Scsi_Host *shost = iscsi_session_to_shost(cls_session); + struct iscsi_session *session = iscsi_hostdata(shost->hostdata); + + switch (param) { + case ISCSI_PARAM_INITIAL_R2T_EN: + *value = session->initial_r2t_en; + break; + case ISCSI_PARAM_MAX_R2T: + *value = session->max_r2t; + break; + case ISCSI_PARAM_IMM_DATA_EN: + *value = session->imm_data_en; + break; + case ISCSI_PARAM_FIRST_BURST: + *value = session->first_burst; + break; + case ISCSI_PARAM_MAX_BURST: + *value = session->max_burst; + break; + case ISCSI_PARAM_PDU_INORDER_EN: + *value = session->pdu_inorder_en; + break; + case ISCSI_PARAM_DATASEQ_INORDER_EN: + *value = session->dataseq_inorder_en; + break; + case ISCSI_PARAM_ERL: + *value = session->erl; + break; + case ISCSI_PARAM_IFMARKER_EN: + *value = 0; + break; + case ISCSI_PARAM_OFMARKER_EN: + *value = 0; + break; + default: + return ISCSI_ERR_PARAM_NOT_FOUND; + } + + return 0; +} + +static int +iscsi_iser_conn_get_param(struct iscsi_cls_conn *cls_conn, + enum iscsi_param param, uint32_t *value) +{ + struct iscsi_conn *conn = cls_conn->dd_data; + + switch(param) { + case ISCSI_PARAM_MAX_RECV_DLENGTH: + *value = conn->max_recv_dlength; + break; + case ISCSI_PARAM_MAX_XMIT_DLENGTH: + *value = conn->max_xmit_dlength; + break; + case ISCSI_PARAM_HDRDGST_EN: + *value = 0; + break; + case ISCSI_PARAM_DATADGST_EN: + *value = 0; + break; + /*case ISCSI_PARAM_TARGET_RECV_DLENGTH: + *value = conn->target_recv_dlength; + break; + case ISCSI_PARAM_INITIATOR_RECV_DLENGTH: + *value = conn->initiator_recv_dlength; + break;*/ + default: + return ISCSI_ERR_PARAM_NOT_FOUND; + } + + return 0; +} + + +static void +iscsi_iser_conn_get_stats(struct iscsi_cls_conn *cls_conn, struct iscsi_stats *stats) +{ + struct iscsi_conn *conn = cls_conn->dd_data; + + stats->txdata_octets = conn->txdata_octets; + stats->rxdata_octets = conn->rxdata_octets; + stats->scsicmd_pdus = conn->scsicmd_pdus_cnt; + stats->dataout_pdus = conn->dataout_pdus_cnt; + stats->scsirsp_pdus = conn->scsirsp_pdus_cnt; + stats->datain_pdus = conn->datain_pdus_cnt; /* always 0 */ + stats->r2t_pdus = conn->r2t_pdus_cnt; /* always 0 */ + stats->tmfcmd_pdus = conn->tmfcmd_pdus_cnt; + stats->tmfrsp_pdus = conn->tmfrsp_pdus_cnt; + stats->custom_length = 3; + strcpy(stats->custom[0].desc, "qp_tx_queue_full"); + stats->custom[0].value = 0; /* FIXME iser_conn->qp_tx_queue_full; */ + strcpy(stats->custom[1].desc, "fmr_map_not_avail"); + stats->custom[1].value = 0; /* FIXME iser_conn->fmr_map_not_avail */; + strcpy(stats->custom[2].desc, "eh_abort_cnt"); + stats->custom[2].value = conn->eh_abort_cnt; +} + +static int +iscsi_iser_ep_connect(struct sockaddr *dst_addr, int non_blocking, + __u64 *ep_handle) +{ + int err; + struct iser_conn *ib_conn; + + iser_err("called\n"); + + ib_conn = kzalloc(sizeof *ib_conn, GFP_KERNEL); + if (!ib_conn) + return -ENOMEM; + + err = iser_conn_init(ib_conn); + BUG_ON(err); + + err = iser_connect(ib_conn, NULL, (struct sockaddr_in *)dst_addr, non_blocking); + if (err == 0) + *ep_handle = (__u64)ib_conn; + + return err; +} + +static int +iscsi_iser_ep_poll(__u64 ep_handle, int timeout_ms) +{ + struct iser_conn *ib_conn = (struct iser_conn *)ep_handle; + int rc; + /* FIXME 2 jiffies assumed, should translate from ms to jiffies */ + int timeout_jiffies = 2; + + rc = wait_event_interruptible_timeout(ib_conn->wait, + atomic_read(&ib_conn->state) == ISER_CONN_UP, + timeout_jiffies); + + iser_err("called eph %llx rc = %d\n", ep_handle, rc); + + if(rc > 0) + return 1; /* success, this is the equivalent of POLLOUT */ + else if(!rc) + return 0; /* timeout */ + else + return rc; /* signal */ +} + +static void +iscsi_iser_ep_disconnect(__u64 ep_handle) +{ + struct iser_conn *ib_conn = (struct iser_conn *)ep_handle; + + iser_err("called eph %llx iser ib conn state %d\n", + ep_handle, atomic_read(&ib_conn->state)); + + if (atomic_read(&ib_conn->state) == ISER_CONN_UP) { + iser_err("iSER connection %p is UP, terminating\n",ib_conn); + iser_conn_terminate(ib_conn); + } + + BUG_ON(atomic_read(&ib_conn->state) != ISER_CONN_DOWN); + kfree(ib_conn); +} + +static struct scsi_host_template iscsi_iser_sht = { + .name = "iSCSI Initiator over iSER, v." + ISCSI_VERSION_STR, + .queuecommand = iscsi_queuecommand, + .sg_tablesize = ISCSI_ISER_SG_TABLESIZE, + .cmd_per_lun = ISCSI_ISER_CMD_PER_LUN, + .eh_abort_handler = iscsi_eh_abort, + .eh_host_reset_handler = iscsi_eh_host_reset, + .use_clustering = DISABLE_CLUSTERING, + .proc_name = "iscsi_iser", + .this_id = -1, +}; + +static struct iscsi_transport iscsi_iser_transport = { + .owner = THIS_MODULE, + .name = "iser", + .caps = CAP_RECOVERY_L0 | CAP_MULTI_R2T, + .param_mask = ISCSI_MAX_RECV_DLENGTH | + ISCSI_MAX_XMIT_DLENGTH | + ISCSI_HDRDGST_EN | + ISCSI_DATADGST_EN | + ISCSI_INITIAL_R2T_EN | + ISCSI_MAX_R2T | + ISCSI_IMM_DATA_EN | + ISCSI_FIRST_BURST | + ISCSI_MAX_BURST | + ISCSI_PDU_INORDER_EN | + ISCSI_DATASEQ_INORDER_EN, + .host_template = &iscsi_iser_sht, + .conndata_size = sizeof(struct iscsi_conn), + .max_lun = ISCSI_ISER_MAX_LUN, + .max_cmd_len = ISCSI_ISER_MAX_CMD_LEN, + /* session management */ + .create_session = iscsi_iser_session_create, + .destroy_session = iscsi_session_teardown, + /* connection management */ + .create_conn = iscsi_iser_conn_create, + .bind_conn = iscsi_iser_conn_bind, + .destroy_conn = iscsi_iser_conn_destroy, + .set_param = iscsi_iser_conn_set_param, + .get_conn_param = iscsi_iser_conn_get_param, + .get_session_param = iscsi_iser_session_get_param, + .start_conn = iscsi_iser_conn_start, + .stop_conn = iscsi_conn_stop, + /* these are called as part of conn recovery */ + .suspend_conn_recv = NULL, /* FIXME is/how this relvant to iser? */ + .terminate_conn = iscsi_iser_conn_terminate, + /* IO */ + .send_pdu = iscsi_conn_send_pdu, + .get_stats = iscsi_iser_conn_get_stats, + .init_cmd_task = iscsi_iser_cmd_init, + .xmit_cmd_task = iscsi_iser_ctask_xmit, + .xmit_mgmt_task = iscsi_iser_mtask_xmit, + .cleanup_cmd_task = iscsi_iser_cleanup_ctask, + /* recovery */ + .session_recovery_timedout = iscsi_session_recovery_timedout, + + .ep_connect = iscsi_iser_ep_connect, + .ep_poll = iscsi_iser_ep_poll, + .ep_disconnect = iscsi_iser_ep_disconnect +}; + +static int __init iser_init(void) +{ + int err; + + iser_dbg("Starting iSER datamover...\n"); + + if (iscsi_max_lun < 1) { + printk(KERN_ERR "Invalid max_lun value of %u\n", iscsi_max_lun); + return -EINVAL; + } + + iscsi_iser_transport.max_lun = iscsi_max_lun; + + memset(&ig, 0, sizeof(struct iser_global)); + + ig.desc_cache = kmem_cache_create("iser_descriptors", + sizeof (struct iser_desc), + 0, SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (ig.desc_cache == NULL) + return -ENOMEM; + + /* device init is called only after the first addr resolution */ + mutex_init(&ig.device_list_mutex); + INIT_LIST_HEAD(&ig.device_list); + + if (!iscsi_register_transport(&iscsi_iser_transport)) { + iser_err("iscsi_register_transport failed\n"); + err = -EINVAL; + goto register_transport_failure; + } + + return 0; + +register_transport_failure: + kmem_cache_destroy(ig.desc_cache); + + return err; +} + +static void __exit iser_exit(void) +{ + iser_dbg("Removing iSER datamover...\n"); + iscsi_unregister_transport(&iscsi_iser_transport); + kmem_cache_destroy(ig.desc_cache); +} + +module_init(iser_init); +module_exit(iser_exit); From ogerlitz at voltaire.com Thu Apr 6 05:17:56 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 6 Apr 2006 15:17:56 +0300 (IDT) Subject: [openib-general] Re: [PATCH 3/8] [RFC] iser initiator In-Reply-To: Message-ID: --- iser-libiscsi-canq2-ep-null/iser_initiator.c 2006-04-06 14:47:29.000000000 +0300 +++ iser-libiscsi-canq2-ep/iser_initiator.c 2006-03-29 15:18:38.000000000 +0200 @@ -0,0 +1,727 @@ +/* + * Copyright (c) 2004, 2005, 2006 Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: iser_initiator.c 5959 2006-03-22 14:36:34Z ogerlitz $ + */ +#include +#include +#include +#include +#include +#include +#include +#include + +#include "iscsi_iser.h" + +/* Constant PDU lengths calculations */ +#define ISER_TOTAL_HEADERS_LEN (sizeof (struct iser_hdr) + \ + sizeof (struct iscsi_hdr)) + +/* iser_dto_add_regd_buff - increments the reference count for * + * the registered buffer & adds it to the DTO object */ +static void iser_dto_add_regd_buff(struct iser_dto *dto, + struct iser_regd_buf *regd_buf, + unsigned long use_offset, + unsigned long use_size) +{ + int add_idx; + + atomic_inc(®d_buf->ref_count); + + add_idx = dto->regd_vector_len; + dto->regd[add_idx] = regd_buf; + dto->used_sz[add_idx] = use_size; + dto->offset[add_idx] = use_offset; + + dto->regd_vector_len++; +} + +static int iser_dma_map_task_data(struct iscsi_iser_cmd_task *ctask, + struct iser_data_buf *data, + enum iser_data_dir iser_dir, + enum dma_data_direction dma_dir) +{ + struct device *dma_device; + + ctask->dir[iser_dir] = 1; + dma_device = ctask->iser_conn->ib_conn->device->ib_device->dma_device; + + data->dma_nents = dma_map_sg(dma_device, data->buf, data->size, dma_dir); + if (data->dma_nents == 0) { + iser_err("dma_map_sg failed!!!\n"); + return -EINVAL; + } + data->data_len = iser_sg_size(data); + return 0; +} + +static void iser_dma_unmap_task_data(struct iscsi_iser_cmd_task *ctask) +{ + struct device *dma_device; + struct iser_data_buf *data; + + dma_device = ctask->iser_conn->ib_conn->device->ib_device->dma_device; + + data = &ctask->data[ISER_DIR_IN]; + if (data->buf != NULL) + dma_unmap_sg(dma_device, data->buf, data->size, DMA_FROM_DEVICE); + + data = &ctask->data[ISER_DIR_OUT]; + if (data->buf != NULL) + dma_unmap_sg(dma_device, data->buf, data->size, DMA_TO_DEVICE); +} + +/* Register user buffer memory and initialize passive rdma + * dto descriptor. Total data size is stored in + * ctask->data[ISER_DIR_IN].data_len + */ +static int iser_prepare_read_cmd(struct iscsi_cmd_task *ctask, + unsigned int edtl) + +{ + struct iscsi_iser_cmd_task *iser_ctask = ctask->dd_data; + struct iser_regd_buf *regd_buf; + int err; + struct iser_hdr *hdr = &iser_ctask->desc.iser_header; + struct iser_data_buf *buf_in = &iser_ctask->data[ISER_DIR_IN]; + + err = iser_dma_map_task_data(iser_ctask, + buf_in, + ISER_DIR_IN, + DMA_FROM_DEVICE); + if (err) + return err; + + if (edtl > iser_ctask->data[ISER_DIR_IN].data_len) { + iser_err("Total data length: %ld, less than EDTL: " + "%d, in READ cmd BHS itt: %d, conn: 0x%p\n", + iser_ctask->data[ISER_DIR_IN].data_len, edtl, + ctask->itt, iser_ctask->iser_conn); + return -EINVAL; + } + + err = iser_reg_rdma_mem(iser_ctask,ISER_DIR_IN); + if (err) { + iser_err("Failed to set up Data-IN RDMA\n"); + return err; + } + regd_buf = &iser_ctask->rdma_regd[ISER_DIR_IN]; + + hdr->flags |= ISER_RSV; + hdr->read_stag = cpu_to_be32(regd_buf->reg.rkey); + hdr->read_va = cpu_to_be64(regd_buf->reg.va); + + iser_dbg("Cmd itt:%d READ tags RKEY:%#.4X VA:%#llX\n", + ctask->itt, regd_buf->reg.rkey, + (unsigned long long)regd_buf->reg.va); + + return 0; +} + +/* Register user buffer memory and initialize passive rdma + * dto descriptor. Total data size is stored in + * ctask->data[ISER_DIR_OUT].data_len + */ +static int +iser_prepare_write_cmd(struct iscsi_cmd_task *ctask, + unsigned int imm_sz, + unsigned int unsol_sz, + unsigned int edtl) +{ + struct iscsi_iser_cmd_task *iser_ctask = ctask->dd_data; + struct iser_regd_buf *regd_buf; + int err; + struct iser_dto *send_dto = &iser_ctask->desc.dto; + struct iser_hdr *hdr = &iser_ctask->desc.iser_header; + struct iser_data_buf *buf_out = &iser_ctask->data[ISER_DIR_OUT]; + + err = iser_dma_map_task_data(iser_ctask, + buf_out, + ISER_DIR_OUT, + DMA_TO_DEVICE); + if (err) + return err; + + if (edtl > iser_ctask->data[ISER_DIR_OUT].data_len) { + iser_err("Total data length: %ld, less than EDTL: %d, " + "in WRITE cmd BHS itt: %d, conn: 0x%p\n", + iser_ctask->data[ISER_DIR_OUT].data_len, + edtl, ctask->itt, ctask->conn); + return -EINVAL; + } + + err = iser_reg_rdma_mem(iser_ctask,ISER_DIR_OUT); + if (err != 0) { + iser_err("Failed to register write cmd RDMA mem\n"); + return err; + } + + regd_buf = &iser_ctask->rdma_regd[ISER_DIR_OUT]; + + if (unsol_sz < edtl) { + hdr->flags |= ISER_WSV; + hdr->write_stag = cpu_to_be32(regd_buf->reg.rkey); + hdr->write_va = cpu_to_be64(regd_buf->reg.va + unsol_sz); + + iser_dbg("Cmd itt:%d, WRITE tags, RKEY:%#.4X " + "VA:%#llX + unsol:%d\n", + ctask->itt, regd_buf->reg.rkey, + (unsigned long long)regd_buf->reg.va, unsol_sz); + } + + if (imm_sz > 0) { + iser_dbg("Cmd itt:%d, WRITE, adding imm.data sz: %d\n", + ctask->itt, imm_sz); + iser_dto_add_regd_buff(send_dto, + regd_buf, + 0, + imm_sz); + } + + return 0; +} + +/** + * iser_post_receive_control - allocates, initializes and posts receive DTO. + */ +static int iser_post_receive_control(struct iscsi_conn *conn) +{ + struct iscsi_iser_conn *iser_conn = conn->dd_data; + struct iser_desc *rx_desc; + struct iser_regd_buf *regd_hdr; + struct iser_regd_buf *regd_data; + struct iser_dto *recv_dto = NULL; + struct iser_device *device = iser_conn->ib_conn->device; + int rx_data_size, err = 0; + + rx_desc = kmem_cache_alloc(ig.desc_cache, GFP_KERNEL); + if (rx_desc == NULL) { + iser_err("Failed to alloc desc for post recv\n"); + return -ENOMEM; + } + rx_desc->type = ISCSI_RX; + + /* for the login sequence we must support rx of upto 8K */ + if (conn->c_stage == ISCSI_CONN_INITIAL_STAGE) + rx_data_size = DEFAULT_MAX_RECV_DATA_SEGMENT_LENGTH; + else /* FIXME till user space sets conn->max_recv_dlength correctly */ + rx_data_size = 128; + + rx_desc->data = kmalloc(rx_data_size, GFP_KERNEL); + if (rx_desc->data == NULL) { + iser_err("Failed to alloc data buf for post recv\n"); + err = -ENOMEM; + goto post_rx_kmalloc_failure; + } + + recv_dto = &rx_desc->dto; + recv_dto->conn = iser_conn; + recv_dto->regd_vector_len = 0; + + regd_hdr = &rx_desc->hdr_regd_buf; + memset(regd_hdr, 0, sizeof(struct iser_regd_buf)); + regd_hdr->device = device; + regd_hdr->virt_addr = rx_desc; /* == &rx_desc->iser_header */ + regd_hdr->data_size = ISER_TOTAL_HEADERS_LEN; + + iser_reg_single(device, regd_hdr, DMA_FROM_DEVICE); + + iser_dto_add_regd_buff(recv_dto, regd_hdr, 0, 0); + + regd_data = &rx_desc->data_regd_buf; + memset(regd_data, 0, sizeof(struct iser_regd_buf)); + regd_data->device = device; + regd_data->virt_addr = rx_desc->data; + regd_data->data_size = rx_data_size; + + iser_reg_single(device, regd_data, DMA_FROM_DEVICE); + + iser_dto_add_regd_buff(recv_dto, regd_data, 0, 0); + + err = iser_post_recv(rx_desc); + if (!err) + return 0; + + /* iser_post_recv failed */ + iser_dto_buffs_release(recv_dto); + kfree(rx_desc->data); +post_rx_kmalloc_failure: + kmem_cache_free(ig.desc_cache, rx_desc); + return err; +} + +/* creates a new tx descriptor and adds header regd buffer */ +static void iser_create_send_desc(struct iscsi_iser_conn *iser_conn, + struct iser_desc *tx_desc) +{ + struct iser_regd_buf *regd_hdr = &tx_desc->hdr_regd_buf; + struct iser_dto *send_dto = &tx_desc->dto; + + memset(regd_hdr, 0, sizeof(struct iser_regd_buf)); + regd_hdr->device = iser_conn->ib_conn->device; + regd_hdr->virt_addr = tx_desc; /* == &tx_desc->iser_header */ + regd_hdr->data_size = ISER_TOTAL_HEADERS_LEN; + + send_dto->conn = iser_conn; + send_dto->notify_enable = 1; + send_dto->regd_vector_len = 0; + + memset(&tx_desc->iser_header, 0, sizeof(struct iser_hdr)); + tx_desc->iser_header.flags = ISER_VER; + + iser_dto_add_regd_buff(send_dto, regd_hdr, 0, 0); +} + +/** + * iser_conn_set_full_featured_mode - (iSER API) + */ +int iser_conn_set_full_featured_mode(struct iscsi_conn *conn) +{ + struct iscsi_iser_conn *iser_conn = conn->dd_data; + + int i; + /* no need to keep it in a var, we are after login so if this should + * be negotiated, by now the result should be available here */ + int initial_post_recv_bufs_num = ISER_MAX_RX_MISC_PDUS; + + iser_dbg("Initially post: %d\n", initial_post_recv_bufs_num); + + /* Check that there is no posted recv or send buffers left - */ + /* they must be consumed during the login phase */ + if (atomic_read(&iser_conn->ib_conn->post_recv_buf_count) != 0) + iser_bug("Number of currently posted recv bufs non-zero\n"); + if (atomic_read(&iser_conn->ib_conn->post_send_buf_count) != 0) + iser_bug("Number of currently posted send bufs non-zero\n"); + + /* Initial post receive buffers */ + for (i = 0; i < initial_post_recv_bufs_num; i++) { + if (iser_post_receive_control(conn) != 0) { + iser_err("Failed to post recv bufs at:%d conn:0x%p\n", + i, conn); + return -ENOMEM; + } + } + iser_dbg("Posted %d post recv bufs, conn:0x%p\n", i, conn); + return 0; +} + +static int +iser_check_xmit(struct iscsi_conn *conn, void *task) +{ + int rc = 0; + struct iscsi_iser_conn *iser_conn = conn->dd_data; + + write_lock_bh(conn->recv_lock); + if (atomic_read(&iser_conn->ib_conn->post_send_buf_count) == + ISER_QP_MAX_REQ_DTOS) { + iser_dbg("%ld can't xmit task %p, suspending tx\n",jiffies,task); + set_bit(ISCSI_SUSPEND_BIT, &conn->suspend_tx); + rc = -EAGAIN; + } + write_unlock_bh(conn->recv_lock); + return rc; +} + + +/** + * iser_send_command - send command PDU + */ +int iser_send_command(struct iscsi_conn *conn, + struct iscsi_cmd_task *ctask) +{ + struct iscsi_iser_conn *iser_conn = conn->dd_data; + struct iscsi_iser_cmd_task *iser_ctask = ctask->dd_data; + struct iser_dto *send_dto = NULL; + unsigned long edtl; + int err = 0; + struct iser_data_buf *data_buf; + + struct iscsi_cmd *hdr = ctask->hdr; + struct scsi_cmnd *sc = ctask->sc; + + if (atomic_read(&iser_conn->ib_conn->state) != ISER_CONN_UP) { + iser_err("Failed to send, conn: 0x%p is not up\n", iser_conn->ib_conn); + return -EPERM; + } + if (iser_check_xmit(conn, ctask)) + return -EAGAIN; + + edtl = ntohl(hdr->data_length); + + /* build the tx desc regd header and add it to the tx desc dto */ + iser_ctask->desc.type = ISCSI_TX_SCSI_COMMAND; + send_dto = &iser_ctask->desc.dto; + send_dto->ctask = iser_ctask; + iser_create_send_desc(iser_conn, &iser_ctask->desc); + + if (hdr->flags & ISCSI_FLAG_CMD_READ) + data_buf = &iser_ctask->data[ISER_DIR_IN]; + else + data_buf = &iser_ctask->data[ISER_DIR_OUT]; + + if (sc->use_sg) { /* using a scatter list */ + data_buf->buf = sc->request_buffer; + data_buf->size = sc->use_sg; + } else { /* using a single buffer - convert it into one entry SG */ + sg_init_one(&data_buf->sg_single, + sc->request_buffer, sc->request_bufflen); + data_buf->buf = &data_buf->sg_single; + data_buf->size = 1; + } + + if (hdr->flags & ISCSI_FLAG_CMD_READ) { + err = iser_prepare_read_cmd(ctask, edtl); + if (err) + goto send_command_error; + } + if (hdr->flags & ISCSI_FLAG_CMD_WRITE) { + err = iser_prepare_write_cmd(ctask, + ctask->imm_count, + ctask->imm_count + + ctask->unsol_count, + edtl); + if (err) + goto send_command_error; + } + + iser_reg_single(iser_conn->ib_conn->device, + send_dto->regd[0], DMA_TO_DEVICE); + + if (iser_post_receive_control(conn) != 0) { + iser_err("post_recv failed!\n"); + err = -ENOMEM; + goto send_command_error; + } + + iser_ctask->status = ISER_TASK_STATUS_STARTED; + + err = iser_post_send(&iser_ctask->desc); + if (!err) + return 0; + +send_command_error: + iser_dto_buffs_release(send_dto); + iser_err("conn %p failed ctask->itt %d err %d\n",conn, ctask->itt, err); + return err; +} + +/** + * iser_send_data_out - send data out PDU + */ +int iser_send_data_out(struct iscsi_conn *conn, + struct iscsi_cmd_task *ctask, + struct iscsi_data *hdr) +{ + struct iscsi_iser_conn *iser_conn = conn->dd_data; + struct iscsi_iser_cmd_task *iser_ctask = ctask->dd_data; + struct iser_desc *tx_desc = NULL; + struct iser_dto *send_dto = NULL; + unsigned long buf_offset; + unsigned long data_seg_len; + unsigned int itt; + int err = 0; + + if (atomic_read(&iser_conn->ib_conn->state) != ISER_CONN_UP) { + iser_err("Failed to send, conn: 0x%p is not up\n", iser_conn->ib_conn); + return -EPERM; + } + + if (iser_check_xmit(conn, ctask)) + return -EAGAIN; + + itt = ntohl(hdr->itt); + data_seg_len = ntoh24(hdr->dlength); + buf_offset = ntohl(hdr->offset); + + iser_dbg("%s itt %d dseg_len %d offset %d\n", + __func__,(int)itt,(int)data_seg_len,(int)buf_offset); + + tx_desc = kmem_cache_alloc(ig.desc_cache, GFP_KERNEL); + if (tx_desc == NULL) { + iser_err("Failed to alloc desc for post dataout\n"); + return -ENOMEM; + } + + tx_desc->type = ISCSI_TX_DATAOUT; + memcpy(&tx_desc->iscsi_header, hdr, sizeof(struct iscsi_hdr)); + + /* build the tx desc regd header and add it to the tx desc dto */ + send_dto = &tx_desc->dto; + send_dto->ctask = iser_ctask; + iser_create_send_desc(iser_conn, tx_desc); + + iser_reg_single(iser_conn->ib_conn->device, + send_dto->regd[0], DMA_TO_DEVICE); + + /* all data was registered for RDMA, we can use the lkey */ + iser_dto_add_regd_buff(send_dto, + &iser_ctask->rdma_regd[ISER_DIR_OUT], + buf_offset, + data_seg_len); + + if (buf_offset + data_seg_len > iser_ctask->data[ISER_DIR_OUT].data_len) { + iser_err("Offset:%ld & DSL:%ld in Data-Out " + "inconsistent with total len:%ld, itt:%d\n", + buf_offset, data_seg_len, + iser_ctask->data[ISER_DIR_OUT].data_len, itt); + err = -EINVAL; + goto send_data_out_error; + } + iser_dbg("data-out itt: %d, offset: %ld, sz: %ld\n", + itt, buf_offset, data_seg_len); + + + err = iser_post_send(tx_desc); + if (!err) + return 0; + +send_data_out_error: + iser_dto_buffs_release(send_dto); + kmem_cache_free(ig.desc_cache, tx_desc); + iser_err("conn %p failed err %d\n",conn, err); + return err; +} + +int iser_send_control(struct iscsi_conn *conn, + struct iscsi_mgmt_task *mtask) +{ + struct iscsi_iser_conn *iser_conn = conn->dd_data; + struct iser_desc *mdesc = mtask->dd_data; + struct iser_dto *send_dto = NULL; + unsigned int itt; + unsigned long data_seg_len; + int err = 0; + unsigned char opcode; + struct iser_regd_buf *regd_buf; + struct iser_device *device; + + if (atomic_read(&iser_conn->ib_conn->state) != ISER_CONN_UP) { + iser_err("Failed to send, conn: 0x%p is not up\n", iser_conn->ib_conn); + return -EPERM; + } + + if (iser_check_xmit(conn,mtask)) + return -EAGAIN; + + /* build the tx desc regd header and add it to the tx desc dto */ + mdesc->type = ISCSI_TX_CONTROL; + send_dto = &mdesc->dto; + send_dto->ctask = NULL; + iser_create_send_desc(iser_conn, mdesc); + + device = iser_conn->ib_conn->device; + + iser_reg_single(device, send_dto->regd[0], DMA_TO_DEVICE); + + itt = ntohl(mtask->hdr->itt); + opcode = mtask->hdr->opcode & ISCSI_OPCODE_MASK; + data_seg_len = ntoh24(mtask->hdr->dlength); + + if (data_seg_len > 0) { + regd_buf = &mdesc->data_regd_buf; + memset(regd_buf, 0, sizeof(struct iser_regd_buf)); + regd_buf->device = device; + regd_buf->virt_addr = mtask->data; + regd_buf->data_size = mtask->data_count; + iser_reg_single(device, regd_buf, + DMA_TO_DEVICE); + iser_dto_add_regd_buff(send_dto, regd_buf, + 0, + data_seg_len); + } + + if (iser_post_receive_control(conn) != 0) { + iser_err("post_rcv_buff failed!\n"); + err = -ENOMEM; + goto send_control_error; + } + + err = iser_post_send(mdesc); + if (!err) + return 0; + +send_control_error: + iser_dto_buffs_release(send_dto); + iser_err("conn %p failed err %d\n",conn, err); + return err; +} + +/** + * iser_rcv_dto_completion - recv DTO completion + */ +void iser_rcv_completion(struct iser_desc *rx_desc, + unsigned long dto_xfer_len) +{ + struct iser_dto *dto = &rx_desc->dto; + struct iscsi_iser_conn *conn = dto->conn; + struct iscsi_session *session = conn->iscsi_conn->session; + struct iscsi_cmd_task *ctask; + struct iscsi_iser_cmd_task *iser_ctask; + struct iscsi_hdr *hdr; + char *rx_data = NULL; + int rc, rx_data_len = 0; + unsigned int itt; + unsigned char opcode; + + hdr = &rx_desc->iscsi_header; + + iser_dbg("op 0x%x itt 0x%x\n", hdr->opcode,hdr->itt); + + if (dto_xfer_len > ISER_TOTAL_HEADERS_LEN) { /* we have data */ + rx_data_len = dto_xfer_len - ISER_TOTAL_HEADERS_LEN; + rx_data = dto->regd[1]->virt_addr; + rx_data += dto->offset[1]; + } + + opcode = hdr->opcode & ISCSI_OPCODE_MASK; + + if (opcode == ISCSI_OP_SCSI_CMD_RSP) { + itt = hdr->itt; + if (!(itt < session->cmds_max)) + iser_bug("itt can't be matched to task!!!" + "conn %p opcode %d cmds_max %d itt %d\n", + conn->iscsi_conn,opcode,session->cmds_max,itt); + /* use the mapping given with the cmds array indexed by itt */ + ctask = (struct iscsi_cmd_task *)session->cmds[itt]; + iser_ctask = ctask->dd_data; + iser_dbg("itt %d ctask %p\n",itt,ctask); + if (ctask != NULL) { + /* if we were reading, copy back to unaligned * + * sglist, anyway dma_unmap and free the copy */ + if (iser_ctask->data_copy[ISER_DIR_IN].copy_buf != NULL) + iser_finalize_rdma_unaligned_sg(iser_ctask, ISER_DIR_IN); + if (iser_ctask->data_copy[ISER_DIR_OUT].copy_buf != NULL) + iser_finalize_rdma_unaligned_sg(iser_ctask, ISER_DIR_OUT); + iser_ctask->status = ISER_TASK_STATUS_COMPLETED; + iser_ctask_rdma_finalize(iser_ctask); + } + } + + iser_dto_buffs_release(dto); + + rc = iscsi_iser_recv(conn->iscsi_conn, hdr, rx_data, rx_data_len); + if (rc) + iscsi_conn_failure(conn->iscsi_conn, rc); + + kfree(rx_desc->data); + kmem_cache_free(ig.desc_cache, rx_desc); + + /* decrementing conn->post_recv_buf_count only --after-- freeing the * + * task eliminates the need to worry on tasks which are completed in * + * parallel to the execution of iser_conn_term. So the code that waits * + * for the posted rx bufs refcount to become zero handles everything */ + atomic_dec(&conn->ib_conn->post_recv_buf_count); +} + +void iser_snd_completion(struct iser_desc *tx_desc) +{ + struct iser_dto *dto = &tx_desc->dto; + struct iscsi_iser_conn *iser_conn = dto->conn; + struct iscsi_conn *conn = iser_conn->iscsi_conn; + + iser_dbg("Initiator, Data sent dto=0x%p\n", dto); + + iser_dto_buffs_release(dto); + + if (tx_desc->type == ISCSI_TX_DATAOUT) + kmem_cache_free(ig.desc_cache, tx_desc); + + atomic_dec(&iser_conn->ib_conn->post_send_buf_count); + + write_lock(conn->recv_lock); + if (conn->suspend_tx) { + iser_dbg("%ld resuming tx\n",jiffies); + clear_bit(ISCSI_SUSPEND_BIT, &conn->suspend_tx); + scsi_queue_work(conn->session->host, &conn->xmitwork); + } + write_unlock(conn->recv_lock); + +#if 0 + /* FIXME - need to get the ***iscsi*** mtask from the iser desc/dto */ + if (mtask->hdr->itt == cpu_to_be32(ISCSI_RESERVED_TAG)) { + struct iscsi_session *session = conn->session; + + spin_lock(&conn->session->lock); + list_del(&mtask->running); + __kfifo_put(session->mgmtpool.queue, (void*)&mtask, + sizeof(void*)); + spin_unlock(&session->lock); + } +#endif +} + +void iser_ctask_rdma_init(struct iscsi_iser_cmd_task *ctask) + +{ + ctask->status = ISER_TASK_STATUS_INIT; + + ctask->dir[ISER_DIR_IN] = 0; + ctask->dir[ISER_DIR_OUT] = 0; + + ctask->data[ISER_DIR_IN].data_len = 0; + ctask->data[ISER_DIR_OUT].data_len = 0; + + memset(&ctask->rdma_regd[ISER_DIR_IN], 0, + sizeof(struct iser_regd_buf)); + memset(&ctask->rdma_regd[ISER_DIR_OUT], 0, + sizeof(struct iser_regd_buf)); +} + +void iser_ctask_rdma_finalize(struct iscsi_iser_cmd_task *ctask) +{ + int deferred; + + if (ctask->dir[ISER_DIR_IN]) { + deferred = iser_regd_buff_release + (&ctask->rdma_regd[ISER_DIR_IN]); + if (deferred) + iser_bug("References remain for BUF-IN rdma reg\n"); + } + + if (ctask->dir[ISER_DIR_OUT]) { + deferred = iser_regd_buff_release + (&ctask->rdma_regd[ISER_DIR_OUT]); + if (deferred) + iser_bug("References remain for BUF-OUT rdma reg\n"); + } + + iser_dma_unmap_task_data(ctask); +} + +void iser_dto_buffs_release(struct iser_dto *dto) +{ + int i; + + for (i = 0; i < dto->regd_vector_len; i++) + iser_regd_buff_release(dto->regd[i]); +} + From ogerlitz at voltaire.com Thu Apr 6 05:18:32 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 6 Apr 2006 15:18:32 +0300 (IDT) Subject: [openib-general] [PATCH 4/8] [RFC] iser cma and verbs interaction In-Reply-To: Message-ID: --- iser-libiscsi-canq2-ep-null/iser_verbs.c 2006-04-06 14:47:41.000000000 +0300 +++ iser-libiscsi-canq2-ep/iser_verbs.c 2006-04-06 11:29:36.000000000 +0300 @@ -0,0 +1,788 @@ +/* + * Copyright (c) 2004, 2005, 2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: iser_verbs.c 5959 2006-03-22 14:36:34Z ogerlitz $ + */ +#include +#include +#include +#include +#include +#include + +#include "iscsi_iser.h" +#include "iser_socket.h" + +#define ISCSI_ISER_MAX_CONN 8 +#define ISER_MAX_CQ_LEN ((ISER_QP_MAX_RECV_DTOS + \ + ISER_QP_MAX_REQ_DTOS) * \ + ISCSI_ISER_MAX_CONN) + +static void iser_cq_tasklet_fn(unsigned long data); +static void iser_cq_callback(struct ib_cq *cq, void *cq_context); +static void iser_comp_error_worker(void *data); +static void iser_conn_release(struct iser_conn *ib_conn); + +static void iser_cq_event_callback(struct ib_event *cause, void *context) +{ + iser_err("got cq event %d \n", cause->event); +} + +static void iser_qp_event_callback(struct ib_event *cause, void *context) +{ + iser_err("got qp event %d\n",cause->event); +} + +/** + * iser_create_device_ib_res - creates Protection Domain (PD), Completion + * Queue (CQ), DMA Memory Region (DMA MR) with the device associated with + * the adapator. + * + * returns 0 on success, -1 on failure + */ +static int iser_create_device_ib_res(struct iser_device *device) +{ + device->pd = ib_alloc_pd(device->ib_device); + if (IS_ERR(device->pd)) + goto pd_err; + + device->cq = ib_create_cq(device->ib_device, + iser_cq_callback, + iser_cq_event_callback, + (void *)device, + ISER_MAX_CQ_LEN); + if (IS_ERR(device->cq)) + goto cq_err; + + if (ib_req_notify_cq(device->cq, IB_CQ_NEXT_COMP)) + goto cq_arm_err; + + tasklet_init(&device->cq_tasklet, + iser_cq_tasklet_fn, + (unsigned long)device); + + device->mr = ib_get_dma_mr(device->pd, + IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(device->mr)) + goto dma_mr_err; + + return 0; + +dma_mr_err: + tasklet_kill(&device->cq_tasklet); +cq_arm_err: + ib_destroy_cq(device->cq); +cq_err: + ib_dealloc_pd(device->pd); +pd_err: + iser_err("failed to allocate an IB resource\n"); + return -1; +} + +/** + * iser_free_device_ib_res - destory/dealloc/dereg the DMA MR, + * CQ and PD created with the device associated with the adapator. + * + * returns 0 on success, -1 on failure + */ +static int iser_free_device_ib_res(struct iser_device *device) +{ + BUG_ON(device->mr == NULL); + + tasklet_kill(&device->cq_tasklet); + + (void)ib_dereg_mr(device->mr); + (void)ib_destroy_cq(device->cq); + (void)ib_dealloc_pd(device->pd); + + device->mr = NULL; + device->cq = NULL; + device->pd = NULL; + return 0; +} + +/** + * iser_create_ib_conn_res - Creates FMR pool and Queue-Pair (QP) + * + * returns 0 on success, -1 on failure + */ +static int iser_create_ib_conn_res(struct iser_conn *ib_conn) +{ + struct iser_device *device; + struct ib_qp_init_attr init_attr; + int ret; + struct ib_fmr_pool_param params; + + BUG_ON(ib_conn->device == NULL); + + device = ib_conn->device; + + params.page_shift = PAGE_SHIFT; + /* when the first/last SG element are not start/end * + * page aligned, the map whould be of N+1 pages */ + params.max_pages_per_fmr = ISCSI_ISER_SG_TABLESIZE + 1; + /* make the pool size twice the max number of SCSI commands * + * the ML is expected to queue, watermark for unmap at 50% */ + params.pool_size = ISCSI_ISER_XMIT_CMDS_MAX * 2; + params.dirty_watermark = ISCSI_ISER_XMIT_CMDS_MAX; + params.cache = 0; + params.flush_function = NULL; + params.access = (IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_WRITE | + IB_ACCESS_REMOTE_READ); + + ib_conn->fmr_pool = ib_create_fmr_pool(device->pd, ¶ms); + if (IS_ERR(ib_conn->fmr_pool)) { + ret = PTR_ERR(ib_conn->fmr_pool); + goto fmr_pool_err; + } + + memset(&init_attr, 0, sizeof init_attr); + + init_attr.event_handler = iser_qp_event_callback; + init_attr.qp_context = (void *)ib_conn; + init_attr.send_cq = device->cq; + init_attr.recv_cq = device->cq; + init_attr.cap.max_send_wr = ISER_QP_MAX_REQ_DTOS; + init_attr.cap.max_recv_wr = ISER_QP_MAX_RECV_DTOS; + init_attr.cap.max_send_sge = MAX_REGD_BUF_VECTOR_LEN; + init_attr.cap.max_recv_sge = 2; + init_attr.sq_sig_type = IB_SIGNAL_REQ_WR; + init_attr.qp_type = IB_QPT_RC; + + ret = rdma_create_qp(ib_conn->cma_id, device->pd, &init_attr); + if (ret) + goto qp_err; + + ib_conn->qp = ib_conn->cma_id->qp; + iser_err("setting conn %p cma_id %p: fmr_pool %p qp %p\n", + ib_conn, ib_conn->cma_id, + ib_conn->fmr_pool, ib_conn->cma_id->qp); + return ret; + +qp_err: + (void)ib_destroy_fmr_pool(ib_conn->fmr_pool); +fmr_pool_err: + iser_err("unable to create fmr pool or qp for ib_conn: %d\n", ret); + return ret; +} + +/** + * iser_free_ib_conn_res - Releases the FMR pool, QP and CMA ID objects + * returns 0 on success, -1 on failure + */ +static int iser_free_ib_conn_res(struct iser_conn *ib_conn) +{ + BUG_ON(ib_conn == NULL); + + iser_err("freeing conn %p cma_id %p fmr pool %p qp %p\n", + ib_conn, ib_conn->cma_id, + ib_conn->fmr_pool, ib_conn->qp); + + /* qp is created only once both addr & route are resolved */ + if (ib_conn->fmr_pool != NULL) + ib_destroy_fmr_pool(ib_conn->fmr_pool); + + if (ib_conn->qp != NULL) + rdma_destroy_qp(ib_conn->cma_id); + + if (ib_conn->cma_id != NULL) + rdma_destroy_id(ib_conn->cma_id); + else + iser_bug("not supposed to be called twice\n"); + + ib_conn->fmr_pool = NULL; + ib_conn->qp = NULL; + ib_conn->cma_id = NULL; + kfree(ib_conn->page_vec); + + return 0; +} + +/** + * based on the resolved device node GUID see if there already allocated + * device for this device. If there's no such, create one. + */ +static +struct iser_device *iser_device_find_by_ib_device(struct rdma_cm_id *cma_id) +{ + struct list_head *p_list; + struct iser_device *device = NULL; + + mutex_lock(&ig.device_list_mutex); + + p_list = ig.device_list.next; + while (p_list != &ig.device_list) { + device = list_entry(p_list, struct iser_device, ig_list); + /* find if there's a match using the node GUID */ + if (device->ib_device->node_guid == cma_id->device->node_guid) + break; + } + + if (device == NULL) { + device = kzalloc(sizeof *device, GFP_KERNEL); + if (device == NULL) + goto end; + /* assign this device to the device */ + device->ib_device = cma_id->device; + /* init the device and link it into ig device list */ + if (iser_create_device_ib_res(device)) { + kfree(device); + device = NULL; + goto end; + } + list_add(&device->ig_list, &ig.device_list); + } +end: + BUG_ON(device == NULL); + device->refcount++; + mutex_unlock(&ig.device_list_mutex); + return device; +} + +/* if there's no demand for this device, release it */ +static void iser_device_try_release(struct iser_device *device) +{ + mutex_lock(&ig.device_list_mutex); + device->refcount--; + iser_err("device %p refcount %d\n",device,device->refcount); + if (!device->refcount) { + iser_free_device_ib_res(device); + list_del(&device->ig_list); + kfree(device); + } + mutex_unlock(&ig.device_list_mutex); +} + +/** + * ib_conn_terminate - Triggers start of the disconnect procedures and wait + * for them to be done + */ +void iser_conn_terminate(struct iser_conn *ib_conn) +{ + int err = 0; + + atomic_set(&ib_conn->state, ISER_CONN_TERMINATING); + err = rdma_disconnect(ib_conn->cma_id); + if (err) + iser_bug("Failed to disconnect, conn: 0x%p err %d\n",ib_conn,err); + wait_event_interruptible(ib_conn->wait, + (atomic_read(&ib_conn->state) == ISER_CONN_DOWN)); + iser_conn_release(ib_conn); +} + +static void iser_connect_error(struct rdma_cm_id *cma_id) +{ + struct iser_conn *ib_conn; + ib_conn = (struct iser_conn *)cma_id->context; + + if (atomic_read(&ib_conn->state) == ISER_CONN_PENDING) { + atomic_set(&ib_conn->state, ISER_CONN_DOWN); + wake_up_interruptible(&ib_conn->wait); + } else + iser_err("Unexpected evt for conn.state: %d\n", + atomic_read(&ib_conn->state)); +} + +static void iser_addr_handler(struct rdma_cm_id *cma_id) +{ + struct iser_device *device; + struct iser_conn *ib_conn; + int ret; + + device = iser_device_find_by_ib_device(cma_id); + ib_conn = (struct iser_conn *)cma_id->context; + ib_conn->device = device; + + ret = rdma_resolve_route(cma_id, 1000); + if (ret) { + iser_err("resolve route failed: %d\n", ret); + iser_connect_error(cma_id); + } + return; +} + +static void iser_route_handler(struct rdma_cm_id *cma_id) +{ + struct rdma_conn_param conn_param; + int ret; + + ret = iser_create_ib_conn_res((struct iser_conn *)cma_id->context); + if (ret) + goto failure; + + iser_dbg("path.mtu is %d setting it to %d\n", + cma_id->route.path_rec->mtu, IB_MTU_1024); + + /* we must set the MTU to 1024 as this is what the target is assuming */ + if (cma_id->route.path_rec->mtu > IB_MTU_1024) + cma_id->route.path_rec->mtu = IB_MTU_1024; + + memset(&conn_param, 0, sizeof conn_param); + conn_param.responder_resources = 4; + conn_param.initiator_depth = 1; + conn_param.retry_count = 7; + conn_param.rnr_retry_count = 6; + + ret = rdma_connect(cma_id, &conn_param); + if (ret) { + iser_err("failure connecting: %d\n", ret); + goto failure; + } + + return; +failure: + iser_connect_error(cma_id); +} + +static void iser_connected_handler(struct rdma_cm_id *cma_id) +{ + struct iser_conn *ib_conn; + + ib_conn = (struct iser_conn *)cma_id->context; + atomic_set(&ib_conn->state, ISER_CONN_UP); + wake_up_interruptible(&ib_conn->wait); +} + +static void iser_disconnected_handler(struct rdma_cm_id *cma_id) +{ + struct iser_conn *ib_conn; + + ib_conn = (struct iser_conn *)cma_id->context; + ib_conn->disc_evt_flag = 1; + + /* If this event is unsolicited this means that the conn is being */ + /* terminated asynchronously from the iSCSI layer's perspective. */ + if (atomic_read(&ib_conn->state) == ISER_CONN_PENDING) { + atomic_set(&ib_conn->state, ISER_CONN_DOWN); + wake_up_interruptible(&ib_conn->wait); + } else { + if (atomic_read(&ib_conn->state) == ISER_CONN_UP) { + atomic_set(&ib_conn->state, ISER_CONN_TERMINATING); + iscsi_conn_failure(ib_conn->iscsi_conn->iscsi_conn, + ISCSI_ERR_CONN_FAILED); + } + /* Complete the termination process if no posts are pending */ + if ((atomic_read(&ib_conn->post_recv_buf_count) == 0) && + (atomic_read(&ib_conn->post_send_buf_count) == 0)) { + atomic_set(&ib_conn->state, ISER_CONN_DOWN); + wake_up_interruptible(&ib_conn->wait); + } + } +} + +static int iser_cma_handler(struct rdma_cm_id *cma_id, struct rdma_cm_event *event) +{ + int ret = 0; + + iser_err("event %d conn %p id %p\n",event->event,cma_id->context,cma_id); + + switch (event->event) { + case RDMA_CM_EVENT_ADDR_RESOLVED: + iser_addr_handler(cma_id); + break; + case RDMA_CM_EVENT_ROUTE_RESOLVED: + iser_route_handler(cma_id); + break; + case RDMA_CM_EVENT_ESTABLISHED: + iser_connected_handler(cma_id); + break; + case RDMA_CM_EVENT_ADDR_ERROR: + case RDMA_CM_EVENT_ROUTE_ERROR: + case RDMA_CM_EVENT_CONNECT_ERROR: + case RDMA_CM_EVENT_UNREACHABLE: + case RDMA_CM_EVENT_REJECTED: + iser_err("event: %d, error: %d\n", event->event, event->status); + iser_connect_error(cma_id); + break; + case RDMA_CM_EVENT_DISCONNECTED: + iser_disconnected_handler(cma_id); + break; + case RDMA_CM_EVENT_DEVICE_REMOVAL: + iser_bug("device removal is not handled yet\n"); + break; + case RDMA_CM_EVENT_CONNECT_RESPONSE: + iser_bug("not expecting cma to deliver the REP!!!\n"); + break; + case RDMA_CM_EVENT_CONNECT_REQUEST: + default: + break; + } + return ret; +} + +int iser_conn_init(struct iser_conn *ib_conn) +{ + memset(ib_conn, 0, sizeof(struct iser_conn)); + atomic_set(&ib_conn->state, ISER_CONN_INIT); + init_waitqueue_head(&ib_conn->wait); + atomic_set(&ib_conn->post_recv_buf_count, 0); + atomic_set(&ib_conn->post_send_buf_count, 0); + INIT_WORK(&ib_conn->comperror_work, iser_comp_error_worker, + ib_conn); + + ib_conn->page_vec = kmalloc(sizeof(struct iser_page_vec) + + (sizeof(u64) * (ISCSI_ISER_SG_TABLESIZE +1)), + GFP_KERNEL); + if(!ib_conn->page_vec) + return -ENOMEM; + + ib_conn->page_vec->pages = (u64 *) (ib_conn->page_vec + 1); + return 0; +} + + /** + * starts the process of connecting to the target + * sleeps untill the connection is established or rejected + */ +int iser_connect(struct iser_conn *ib_conn, + struct sockaddr_in *src_addr, + struct sockaddr_in *dst_addr, + int non_blocking) +{ + struct sockaddr *src, *dst; + int err = 0; + + sprintf(ib_conn->name,"%d.%d.%d.%d:%d", + NIPQUAD(dst_addr->sin_addr.s_addr), dst_addr->sin_port); + + /* the device is known only --after-- address resolution */ + ib_conn->device = NULL; + + iser_err("connecting to: %d.%d.%d.%d, port 0x%x\n", + NIPQUAD(dst_addr->sin_addr), dst_addr->sin_port); + + atomic_set(&ib_conn->state, ISER_CONN_PENDING); + + ib_conn->cma_id = rdma_create_id(iser_cma_handler, + (void *)ib_conn, + RDMA_PS_TCP); + if (IS_ERR(ib_conn->cma_id)) { + err = PTR_ERR(ib_conn->cma_id); + iser_err("rdma_create_id failed: %d\n", err); + goto connect_failure; + } + + src = (struct sockaddr *)src_addr; + dst = (struct sockaddr *)dst_addr; + err = rdma_resolve_addr(ib_conn->cma_id, src, dst, 1000); + if (err) { + iser_err("rdma_resolve_addr failed: %d\n", err); + rdma_destroy_id(ib_conn->cma_id); + goto connect_failure; + } + + if(!non_blocking) { + wait_event_interruptible(ib_conn->wait, + atomic_read(&ib_conn->state) != ISER_CONN_PENDING); + + if (atomic_read(&ib_conn->state) != ISER_CONN_UP) { + iser_conn_release(ib_conn); + err = -EIO; + goto connect_failure; + } + } + return 0; + +connect_failure: + atomic_set(&ib_conn->state, ISER_CONN_DOWN); + return err; +} + +/** + * Frees all conn objects and deallocs conn descriptor + */ +static void iser_conn_release(struct iser_conn *ib_conn) +{ + struct iser_device *device = ib_conn->device; + + if (atomic_read(&ib_conn->state) == ISER_CONN_DOWN) { + iser_free_ib_conn_res(ib_conn); /* qp/id freed only once */ + ib_conn->device = NULL; + /* on EVENT_ADDR_ERROR there's no device yet for this conn */ + if (device != NULL) + iser_device_try_release(device); + } else + iser_err("conn %p state is %d doing nothing\n", + ib_conn,atomic_read(&ib_conn->state)); +} + + +/** + * iser_reg_page_vec - Register physical memory + * + * returns: 0 on success, errno code on failure + */ +int iser_reg_page_vec(struct iser_conn *ib_conn, + struct iser_page_vec *page_vec, + struct iser_mem_reg *mem_reg) +{ + struct ib_pool_fmr *mem; + u64 io_addr; + u64 *page_list; + int status; + + page_list = page_vec->pages; + io_addr = page_list[0]; + + mem = ib_fmr_pool_map_phys(ib_conn->fmr_pool, + page_list, + page_vec->length, + &io_addr); + + if (IS_ERR(mem)) { + status = (int)PTR_ERR(mem); + iser_err("ib_fmr_pool_map_phys failed: %d\n", status); + return status; + } + + mem_reg->lkey = mem->fmr->lkey; + mem_reg->rkey = mem->fmr->rkey; + mem_reg->len = page_vec->length * PAGE_SIZE; + mem_reg->va = io_addr; + mem_reg->mem_h = (void *)mem; + + mem_reg->va += page_vec->offset; + mem_reg->len = page_vec->data_size; + + iser_dbg("PHYSICAL Mem.register, [PHYS p_array: 0x%p, sz: %d, " + "entry[0]: (0x%08lx,%ld)] -> " + "[lkey: 0x%08X mem_h: 0x%p va: 0x%08lX sz: %ld]\n", + page_vec, page_vec->length, + (unsigned long)page_vec->pages[0], + (unsigned long)page_vec->data_size, + (unsigned int)mem_reg->lkey, mem_reg->mem_h, + (unsigned long)mem_reg->va, (unsigned long)mem_reg->len); + return 0; +} + +/** + * Unregister (previosuly registered) memory. + */ +void iser_unreg_mem(struct iser_mem_reg *reg) +{ + int ret; + + iser_dbg("PHYSICAL Mem.Unregister mem_h %p\n",reg->mem_h); + + ret = ib_fmr_pool_unmap((struct ib_pool_fmr *)reg->mem_h); + if (ret) + iser_err("ib_fmr_pool_unmap failed %d\n", ret); + + reg->mem_h = NULL; +} + +/** + * iser_dto_to_iov - builds IOV from a dto descriptor + */ +static void iser_dto_to_iov(struct iser_dto *dto, struct ib_sge *iov, int iov_len) +{ + int i; + struct ib_sge *sge; + struct iser_regd_buf *regd_buf; + + if (dto->regd_vector_len > iov_len) + iser_bug("iov size %d too small for posting dto of len %d\n", + iov_len, dto->regd_vector_len); + + for (i = 0; i < dto->regd_vector_len; i++) { + sge = &iov[i]; + regd_buf = dto->regd[i]; + + sge->addr = regd_buf->reg.va; + sge->length = regd_buf->reg.len; + sge->lkey = regd_buf->reg.lkey; + + if (dto->used_sz[i] > 0) /* Adjust size */ + sge->length = dto->used_sz[i]; + + /* offset and length should not exceed the regd buf length */ + if (sge->length + dto->offset[i] > regd_buf->reg.len) { + iser_bug("Used len:%ld + offset:%d, exceed reg.buf.len:" + "%ld in dto:0x%p [%d], va:0x%08lX\n", + (unsigned long)sge->length, dto->offset[i], + (unsigned long)regd_buf->reg.len, dto, i, + (unsigned long)sge->addr); + } + + sge->addr += dto->offset[i]; /* Adjust offset */ + } +} + +/** + * iser_post_recv - Posts a receive buffer. + * + * returns 0 on success, -1 on failure + */ +int iser_post_recv(struct iser_desc *rx_desc) +{ + int ib_ret, ret_val = 0; + struct ib_recv_wr recv_wr, *recv_wr_failed; + struct ib_sge iov[2]; + struct iser_conn *ib_conn; + struct iser_dto *recv_dto = &rx_desc->dto; + + /* Retrieve conn */ + ib_conn = recv_dto->conn->ib_conn; + + iser_dto_to_iov(recv_dto, iov, 2); + + recv_wr.next = NULL; + recv_wr.sg_list = iov; + recv_wr.num_sge = recv_dto->regd_vector_len; + recv_wr.wr_id = (unsigned long)rx_desc; + + atomic_inc(&ib_conn->post_recv_buf_count); + ib_ret = ib_post_recv(ib_conn->qp, &recv_wr, &recv_wr_failed); + if (ib_ret) { + iser_err("ib_post_recv failed ret=%d\n", ib_ret); + atomic_dec(&ib_conn->post_recv_buf_count); + ret_val = -1; + } + + return ret_val; +} + +/** + * iser_start_send - Initiate a Send DTO operation + * + * returns 0 on success, -1 on failure + */ +int iser_post_send(struct iser_desc *tx_desc) +{ + int ib_ret, ret_val = 0; + struct ib_send_wr send_wr, *send_wr_failed; + struct ib_sge iov[MAX_REGD_BUF_VECTOR_LEN]; + struct iser_conn *ib_conn; + struct iser_dto *dto = &tx_desc->dto; + + ib_conn = dto->conn->ib_conn; + + iser_dto_to_iov(dto, iov, MAX_REGD_BUF_VECTOR_LEN); + + send_wr.next = NULL; + send_wr.wr_id = (unsigned long)tx_desc; + send_wr.sg_list = iov; + send_wr.num_sge = dto->regd_vector_len; + send_wr.opcode = IB_WR_SEND; + send_wr.send_flags = dto->notify_enable ? IB_SEND_SIGNALED : 0; + + atomic_inc(&ib_conn->post_send_buf_count); + + ib_ret = ib_post_send(ib_conn->qp, &send_wr, &send_wr_failed); + if (ib_ret) { + iser_err("Failed to start SEND DTO, dto: 0x%p, IOV len: %d\n", + dto, dto->regd_vector_len); + iser_err("ib_post_send failed, ret:%d\n", ib_ret); + atomic_dec(&ib_conn->post_send_buf_count); + ret_val = -1; + } + + return ret_val; +} + +static void iser_comp_error_worker(void *data) +{ + struct iser_conn *ib_conn = data; + + if (atomic_read(&ib_conn->state) == ISER_CONN_UP) { + atomic_set(&ib_conn->state, ISER_CONN_TERMINATING); + iscsi_conn_failure(ib_conn->iscsi_conn->iscsi_conn, + ISCSI_ERR_CONN_FAILED); + } + + /* complete the termination process if disconnect event was delivered * + * note there are no more non completed posts to the QP */ + if (ib_conn->disc_evt_flag) { + atomic_set(&ib_conn->state, ISER_CONN_DOWN); + wake_up_interruptible(&ib_conn->wait); + } +} + +static void iser_handle_comp_error(struct iser_desc *desc) +{ + struct iser_dto *dto = &desc->dto; + struct iser_conn *ib_conn = dto->conn->ib_conn; + + iser_dto_buffs_release(dto); + + if (desc->type == ISCSI_RX) { + kfree(desc->data); + kmem_cache_free(ig.desc_cache, desc); + atomic_dec(&ib_conn->post_recv_buf_count); + } else { /* type is TX control/command/dataout */ + if (desc->type == ISCSI_TX_DATAOUT) + kmem_cache_free(ig.desc_cache, desc); + atomic_dec(&ib_conn->post_send_buf_count); + } + + if (atomic_read(&ib_conn->post_recv_buf_count) == 0 && + atomic_read(&ib_conn->post_send_buf_count) == 0) + schedule_work(&ib_conn->comperror_work); +} + +static void iser_cq_tasklet_fn(unsigned long data) +{ + struct iser_device *device = (struct iser_device *)data; + struct ib_cq *cq = device->cq; + struct ib_wc wc; + struct iser_desc *desc; + unsigned long xfer_len; + + while (ib_poll_cq(cq, 1, &wc) == 1) { + desc = (struct iser_desc *) (unsigned long) wc.wr_id; + + if (desc == NULL) + iser_bug("NULL desc\n"); + + if (wc.status == IB_WC_SUCCESS) { + if (desc->type == ISCSI_RX) { + xfer_len = (unsigned long)wc.byte_len; + iser_rcv_completion(desc, xfer_len); + } else /* type == ISCSI_TX_CONTROL/SCSI_CMD/DOUT */ + iser_snd_completion(desc); + } else { + iser_err("comp w. error op %d status %d\n",desc->type,wc.status); + iser_handle_comp_error(desc); + } + } + /* #warning "it is assumed here that arming CQ only once its empty" * + * " would not cause interrupts to be missed" */ + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); +} + +static void iser_cq_callback(struct ib_cq *cq, void *cq_context) +{ + struct iser_device *device = (struct iser_device *)cq_context; + + tasklet_schedule(&device->cq_tasklet); +} From ogerlitz at voltaire.com Thu Apr 6 05:19:05 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 6 Apr 2006 15:19:05 +0300 (IDT) Subject: [openib-general] [PATCH 5/8] [RFC] iser handling of memory for RDMA In-Reply-To: Message-ID: --- iser-libiscsi-canq2-ep-null/iser_memory.c 2006-04-06 14:47:36.000000000 +0300 +++ iser-libiscsi-canq2-ep/iser_memory.c 2006-03-27 18:31:25.000000000 +0200 @@ -0,0 +1,417 @@ +/* + * Copyright (c) 2004, 2005, 2006 Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: iser_memory.c 5524 2006-02-28 10:18:43Z ogerlitz $ + */ +#include +#include +#include +#include +#include +#include +#include + +#include "iscsi_iser.h" + +#define ISER_KMALLOC_THRESHOLD 0x20000 /* 128K - kmalloc limit */ +/** + * Decrements the reference count for the + * registered buffer & releases it + * + * returns 0 if released, 1 if deferred + */ +int iser_regd_buff_release(struct iser_regd_buf *regd_buf) +{ + struct device *dma_device; + + if ((atomic_read(®d_buf->ref_count) == 0) || + atomic_dec_and_test(®d_buf->ref_count)) { + /* if we used the dma mr, unreg is just NOP */ + if (regd_buf->reg.rkey != 0) + iser_unreg_mem(®d_buf->reg); + + if (regd_buf->dma_addr) { + dma_device = regd_buf->device->ib_device->dma_device; + dma_unmap_single(dma_device, + regd_buf->dma_addr, + regd_buf->data_size, + regd_buf->direction); + } + /* else this regd buf is associated with task which we */ + /* dma_unmap_single/sg later */ + return 0; + } else { + iser_dbg("Release deferred, regd.buff: 0x%p\n", regd_buf); + return 1; + } +} + +/** + * iser_reg_single - fills registered buffer descriptor with + * registration information + */ +void iser_reg_single(struct iser_device *device, + struct iser_regd_buf *regd_buf, + enum dma_data_direction direction) +{ + dma_addr_t dma_addr; + + dma_addr = dma_map_single(device->ib_device->dma_device, + regd_buf->virt_addr, + regd_buf->data_size, direction); + if (dma_mapping_error(dma_addr)) + iser_bug("dma_map_single failed at %p\n", regd_buf->virt_addr); + + regd_buf->reg.lkey = device->mr->lkey; + regd_buf->reg.rkey = 0; /* indicate there's no need to unreg */ + regd_buf->reg.len = regd_buf->data_size; + regd_buf->reg.va = dma_addr; + + regd_buf->dma_addr = dma_addr; + regd_buf->direction = direction; +} + + +/** + * iser_sg_size - returns the total data length in an sg list + */ +int iser_sg_size(struct iser_data_buf *data) +{ + struct scatterlist *sg = (struct scatterlist *)data->buf; + int i, total_len=0; + + for (i = 0; i < data->dma_nents; i++) + total_len += sg_dma_len(&sg[i]); + return total_len; +} + +/** + * iser_start_rdma_unaligned_sg + */ +int iser_start_rdma_unaligned_sg(struct iscsi_iser_cmd_task *ctask, + enum iser_data_dir cmd_dir) +{ + int dma_nents; + struct device *dma_device; + char *mem = NULL; + struct iser_data_buf *data = &ctask->data[cmd_dir]; + unsigned long cmd_data_len = data->data_len; + + if (cmd_data_len > ISER_KMALLOC_THRESHOLD) + mem = (void *)__get_free_pages(GFP_KERNEL, + long_log2(roundup_pow_of_two(cmd_data_len)) - PAGE_SHIFT); + else + mem = kmalloc(cmd_data_len, GFP_KERNEL); + + if (mem == NULL) { + iser_err("Failed to allocate mem size %d %d for copying sglist\n", + data->size,(int)cmd_data_len); + return -ENOMEM; + } + + if (cmd_dir == ISER_DIR_OUT) { + /* copy the unaligned sg the buffer which is used for RDMA */ + struct scatterlist *sg = (struct scatterlist *)data->buf; + int i; + char *p, *from; + + for (p = mem, i = 0; i < data->size; i++) { + from = kmap_atomic(sg[i].page, KM_USER0); + memcpy(p, + from + sg[i].offset, + sg[i].length); + kunmap_atomic(from, KM_USER0); + p += sg[i].length; + } + } + + sg_init_one(&ctask->data_copy[cmd_dir].sg_single, mem, cmd_data_len); + ctask->data_copy[cmd_dir].buf = &ctask->data_copy[cmd_dir].sg_single; + ctask->data_copy[cmd_dir].size = 1; + + ctask->data_copy[cmd_dir].copy_buf = mem; + + dma_device = ctask->iser_conn->ib_conn->device->ib_device->dma_device; + + if (cmd_dir == ISER_DIR_OUT) + dma_nents = dma_map_sg(dma_device, + &ctask->data_copy[cmd_dir].sg_single, 1, + DMA_TO_DEVICE); + else + dma_nents = dma_map_sg(dma_device, + &ctask->data_copy[cmd_dir].sg_single, 1, + DMA_FROM_DEVICE); + + if (dma_nents == 0) + iser_bug("dma_map_sg failed at %p\n", mem); + + ctask->data_copy[cmd_dir].dma_nents = dma_nents; + return 0; +} + +/** + * iser_finalize_rdma_unaligned_sg + */ +void iser_finalize_rdma_unaligned_sg(struct iscsi_iser_cmd_task *ctask, + enum iser_data_dir cmd_dir) +{ + struct device *dma_device; + struct iser_data_buf *mem_copy; + unsigned long cmd_data_len; + + dma_device = ctask->iser_conn->ib_conn->device->ib_device->dma_device; + mem_copy = &ctask->data_copy[cmd_dir]; + + if(cmd_dir == ISER_DIR_OUT) + dma_unmap_sg(dma_device, &mem_copy->sg_single, 1, + DMA_TO_DEVICE); + else + dma_unmap_sg(dma_device, &mem_copy->sg_single, 1, + DMA_FROM_DEVICE); + + if (cmd_dir == ISER_DIR_IN) { + char *mem; + struct scatterlist *sg; + unsigned char *p, *to; + unsigned int sg_size; + int i; + + /* copy back read RDMA to unaligned sg */ + mem = mem_copy->copy_buf; + + sg = (struct scatterlist *)ctask->data[ISER_DIR_IN].buf; + sg_size = ctask->data[ISER_DIR_IN].size; + + for (p = mem, i = 0; i < sg_size; i++){ + to = kmap_atomic(sg[i].page, KM_SOFTIRQ0); + memcpy(to + sg[i].offset, + p, + sg[i].length); + kunmap_atomic(to, KM_SOFTIRQ0); + p += sg[i].length; + } + } + + cmd_data_len = ctask->data[cmd_dir].data_len; + + if (cmd_data_len > ISER_KMALLOC_THRESHOLD) + free_pages((unsigned long)mem_copy->copy_buf, + long_log2(roundup_pow_of_two(cmd_data_len)) - PAGE_SHIFT); + else + kfree(mem_copy->copy_buf); + + mem_copy->copy_buf = NULL; +} + +/** + * iser_sg_to_page_vec - Translates scatterlist entries to physical addresses + * and returns the length of resulting physical address array (may be less than + * the original due to possible compaction). + * + * we build a "page vec" under the assumption that the SG meets the RDMA + * alignment requirements. Other then the first and last SG elements, all + * the "internal" elements can be compacted into a list whose elements are + * dma addresses of physical pages. The code supports also the weird case + * where --few fragments of the same page-- are present in the SG as + * consecutive elements. Also, it handles one entry SG. + */ +static int iser_sg_to_page_vec(struct iser_data_buf *data, + struct iser_page_vec *page_vec) +{ + struct scatterlist *sg = (struct scatterlist *)data->buf; + dma_addr_t first_addr, last_addr, page; + int start_aligned, end_aligned; + unsigned int cur_page = 0; + unsigned long total_sz = 0; + int i; + + /* compute the offset of first element */ + /* FIXME page_vec->offset type should be dma_addr_t */ + page_vec->offset = (u64) sg[0].offset; + + for (i = 0; i < data->dma_nents; i++) { + total_sz += sg_dma_len(&sg[i]); + + first_addr = sg_dma_address(&sg[i]); + last_addr = first_addr + sg_dma_len(&sg[i]); + + start_aligned = !(first_addr & ~PAGE_MASK); + end_aligned = !(last_addr & ~PAGE_MASK); + + /* continue to collect page fragments till aligned or SG ends */ + while (!end_aligned && (i + 1 < data->dma_nents)) { + i++; + total_sz += sg_dma_len(&sg[i]); + last_addr = sg_dma_address(&sg[i]) + sg_dma_len(&sg[i]); + end_aligned = !(last_addr & ~PAGE_MASK); + } + + first_addr = first_addr & PAGE_MASK; + + for (page = first_addr; page < last_addr; page += PAGE_SIZE) + page_vec->pages[cur_page++] = page; + + } + page_vec->data_size = total_sz; + iser_dbg("page_vec->data_size:%d cur_page %d\n", page_vec->data_size,cur_page); + return cur_page; +} + +#define MASK_4K ((1UL << 12) - 1) /* 0xFFF */ +#define IS_4K_ALIGNED(addr) ((((unsigned long)addr) & MASK_4K) == 0) + +/** + * iser_data_buf_aligned_len - Tries to determine the maximal correctly aligned + * for RDMA sub-list of a scatter-gather list of memory buffers, and returns + * the number of entries which are aligned correctly. Supports the case where + * consecutive SG elements are actually fragments of the same physcial page. + */ +static unsigned int iser_data_buf_aligned_len(struct iser_data_buf *data) +{ + struct scatterlist *sg; + dma_addr_t end_addr, next_addr; + int i, cnt; + unsigned int ret_len = 0; + + sg = (struct scatterlist *)data->buf; + + for (cnt = 0, i = 0; i < data->dma_nents; i++, cnt++) { + /* iser_dbg("Checking sg iobuf [%d]: phys=0x%08lX " + "offset: %ld sz: %ld\n", i, + (unsigned long)page_to_phys(sg[i].page), + (unsigned long)sg[i].offset, + (unsigned long)sg[i].length); */ + end_addr = sg_dma_address(&sg[i]) + + sg_dma_len(&sg[i]); + /* iser_dbg("Checking sg iobuf end address " + "0x%08lX\n", end_addr); */ + if (i + 1 < data->dma_nents) { + next_addr = sg_dma_address(&sg[i+1]); + /* are i, i+1 fragments of the same page? */ + if (end_addr == next_addr) + continue; + else if (!IS_4K_ALIGNED(end_addr)) { + ret_len = cnt + 1; + break; + } + } + } + if (i == data->dma_nents) + ret_len = cnt; /* loop ended */ + iser_dbg("Found %d aligned entries out of %d in sg:0x%p\n", + ret_len, data->dma_nents, data); + return ret_len; +} + +static void iser_data_buf_dump(struct iser_data_buf *data) +{ + struct scatterlist *sg = (struct scatterlist *)data->buf; + int i; + + for (i = 0; i < data->size; i++) + iser_err("sg[%d] dma_addr:0x%lX page:0x%p " + "off:%d sz:%d dma_len:%d\n", + i, (unsigned long)sg_dma_address(&sg[i]), + sg[i].page, sg[i].offset, + sg[i].length,sg_dma_len(&sg[i])); +} + +static void iser_dump_page_vec(struct iser_page_vec *page_vec) +{ + int i; + + iser_err("page vec length %d data size %d\n", + page_vec->length, page_vec->data_size); + for (i = 0; i < page_vec->length; i++) + iser_err("%d %lx\n",i,(unsigned long)page_vec->pages[i]); +} + +static void iser_page_vec_build(struct iser_data_buf *data, + struct iser_page_vec *page_vec) +{ + int page_vec_len = 0; + + page_vec->length = 0; + page_vec->offset = 0; + + iser_dbg("Translating sg sz: %d\n", data->dma_nents); + page_vec_len = iser_sg_to_page_vec(data,page_vec); + iser_dbg("sg len %d page_vec_len %d\n", data->dma_nents,page_vec_len); + + page_vec->length = page_vec_len; + + if (page_vec_len * 4096 < page_vec->data_size) { + iser_err("dumping sg\n"); + iser_data_buf_dump(data); + iser_dump_page_vec(page_vec); + iser_bug("page_vec too short to hold this SG\n"); + } +} + +/** + * iser_reg_rdma_mem - Registers memory intended for RDMA, + * obtaining rkey and va + * + * returns 0 on success, errno code on failure + */ +int iser_reg_rdma_mem(struct iscsi_iser_cmd_task *ctask, + enum iser_data_dir cmd_dir) +{ + struct iser_conn *ib_conn = ctask->iser_conn->ib_conn; + struct iser_data_buf *mem = &ctask->data[cmd_dir]; + struct iser_regd_buf *regd_buf; + int aligned_len; + int err; + + regd_buf = &ctask->rdma_regd[cmd_dir]; + + aligned_len = iser_data_buf_aligned_len(mem); + if (aligned_len != mem->size) { + iser_err("rdma alignment violation %d/%d aligned\n", + aligned_len, mem->size); + iser_data_buf_dump(mem); + /* allocate copy buf, if we are writing, copy the */ + /* unaligned scatterlist, dma map the copy */ + if (iser_start_rdma_unaligned_sg(ctask, cmd_dir) != 0) + return -ENOMEM; + mem = &ctask->data_copy[cmd_dir]; + } + + iser_page_vec_build(mem, ib_conn->page_vec); + err = iser_reg_page_vec(ib_conn, ib_conn->page_vec, ®d_buf->reg); + if(err) + return err; + + /* take a reference on this regd buf such that it will not be released * + * (eg in send dto completion) before we get the scsi response */ + atomic_inc(®d_buf->ref_count); + return 0; +} From ogerlitz at voltaire.com Thu Apr 6 05:19:44 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 6 Apr 2006 15:19:44 +0300 (IDT) Subject: [openib-general] [PATCH 6/8] [RFC] iser socket - removed !!! In-Reply-To: Message-ID: --- iser-libiscsi-canq2-ep-null/iser_socket.h 2006-04-06 15:09:28.000000000 +0300 +++ iser-libiscsi-canq2-ep/iser_socket.h 2006-04-06 15:09:41.000000000 +0300 @@ -1,49 +0,0 @@ -/* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id: iser_socket.h 5502 2006-02-27 09:09:38Z ogerlitz $ - */ - -#ifndef __ISER_SOCKETS_H__ -#define __ISER_SOCKETS_H__ - -#include - -struct iser_conn; - -#define AF_ISER 28 /* to be defined properly */ - -int iser_register_sockets(void); -void iser_unreg_sockets(void); - -struct iser_conn *iser_conn_from_sock(struct socket *sock); -struct socket *iser_conn_to_sock(struct iser_conn *iser_conn); -#endif /* __ISER_SOCKETS_H__ */ --- iser-libiscsi-canq2-ep-null/iser_socket.c 2006-04-06 15:09:24.000000000 +0300 +++ iser-libiscsi-canq2-ep/iser_socket.c 2006-04-06 15:09:38.000000000 +0300 @@ -1,215 +0,0 @@ -/* - * Copyright (c) 2005, 2006 Voltaire, Inc. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id: iser_socket.c 5505 2006-02-27 12:48:53Z ogerlitz $ - */ - -#include -#include -#include - -#include "iscsi_iser.h" -#include "iser_socket.h" - -#define PF_ISER AF_ISER - -static int iser_sock_create(struct socket *, int); -static int iser_sock_release(struct socket *); -static int iser_sock_connect(struct socket *, struct sockaddr *, int, int); -static int iser_sock_shutdown(struct socket *,int); -static int iser_sock_getsockopt(struct socket *,int,int,char __user *,int __user *); -static unsigned int iser_sock_poll(struct file *,struct socket *, - struct poll_table_struct *); - -struct iser_sock { - struct sock sock; - struct iser_conn iser_conn; -}; - -static struct net_proto_family iser_proto_family = { - .family = PF_ISER, - .create = iser_sock_create, - .authentication = 0, - .encryption = 0, - .encrypt_net = 0 -}; - -static struct proto_ops iser_proto_ops = { - .family = AF_ISER, - .owner = THIS_MODULE, - - .connect = iser_sock_connect, - .release = iser_sock_release, - .shutdown = iser_sock_shutdown, - - .bind = sock_no_bind, - .poll = iser_sock_poll, - .socketpair = sock_no_socketpair, - .accept = sock_no_accept, - .getname = sock_no_getname, - .ioctl = sock_no_ioctl, - .listen = sock_no_listen, - .setsockopt = sock_setsockopt, - .getsockopt = iser_sock_getsockopt, - .sendmsg = sock_no_sendmsg, - .recvmsg = sock_no_recvmsg, - .mmap = sock_no_mmap, - .sendpage = sock_no_sendpage -}; - -static struct proto iser_sock_proto = { - .name = "ib_iser", - .owner = THIS_MODULE, - .obj_size = sizeof(struct iser_sock) -}; - -struct iser_conn *iser_conn_from_sock(struct socket *sock) -{ - struct iser_sock *iser_sk = (struct iser_sock *)sock->sk; - - return &iser_sk->iser_conn; -} - -struct socket *iser_conn_to_sock(struct iser_conn *iser_conn) -{ - struct iser_sock *iser_sk; - iser_sk = container_of(iser_conn, struct iser_sock, iser_conn); - - return iser_sk->sock.sk_socket; -} - -int iser_register_sockets(void) -{ - int error; - - error = proto_register(&iser_sock_proto, 1); - if (error < 0) { - iser_err("proto_register failed (%d)\n", error); - return error; - } - - error = sock_register(&iser_proto_family); - if (error < 0) { - iser_err("sock_register failed (%d)\n", error); - proto_unregister(&iser_sock_proto); - } - - return 0; -} - -void iser_unreg_sockets(void) -{ - sock_unregister(PF_ISER); - proto_unregister(&iser_sock_proto); -} - -static int iser_sock_create(struct socket *sock, int protocol) -{ - struct iser_sock *iser_sk = NULL; - - if (sock->type != SOCK_STREAM) - return -ESOCKTNOSUPPORT; - - iser_sk = (struct iser_sock *)sk_alloc(PF_INET, GFP_KERNEL, - &iser_sock_proto, 1); - if (iser_sk == NULL) - return -ENOBUFS; - - sock_init_data(sock, &iser_sk->sock); - iser_sk->sock.sk_destruct = NULL; - iser_sk->sock.sk_family = PF_ISER; - iser_sk->sock.sk_sndbuf = 64*1024; - - if (iser_conn_init(&iser_sk->iser_conn) != 0) - return -ENOMEM; - - sock->ops = &iser_proto_ops; - sock->state = SS_UNCONNECTED; - sock_graft(&iser_sk->sock, sock); - - return 0; -} - -int iser_sock_connect(struct socket *sock, struct sockaddr *uservaddr, - int sockaddr_len, int flags) -{ - struct sockaddr_in *dst_addr = (struct sockaddr_in *)uservaddr; - struct iser_sock *iser_sk = (struct iser_sock *)sock->sk; - struct iser_conn *iser_conn = &iser_sk->iser_conn; - int err = 0; - - iser_err("dst_addr ip %.8x (%d.%d.%d.%d) port %.4x=%d\n", - dst_addr->sin_addr.s_addr, NIPQUAD(dst_addr->sin_addr), - dst_addr->sin_port, dst_addr->sin_port); - - err = iser_connect(iser_conn, NULL, dst_addr); - if (err) - iser_err("conn_establish failed: %d\n", err); - return err; -} - -static inline void iser_sock_free(struct socket *sock) -{ - struct sock *sk = sock->sk; - sock->sk = NULL; - sock_orphan(sk); - sk_free(sk); -} - -int iser_sock_release(struct socket *sock) -{ - struct iser_sock *iser_sock = (struct iser_sock *)sock->sk; - struct iser_conn *iser_conn = &iser_sock->iser_conn; - int iser_err = 0; - - if (atomic_read(&iser_conn->state) == ISER_CONN_DOWN) - iser_sock_free(sock); - else - iser_err = -EPERM; - return iser_err; -} - -int iser_sock_shutdown(struct socket *sock, int how) -{ - return 0; -} - -static int iser_sock_getsockopt(struct socket *sock, int level, int optname, - char __user *optval, int __user *optlen) -{ - return 0; -} - -static unsigned int iser_sock_poll(struct file *file, struct socket *sock, - struct poll_table_struct *wait) -{ - return POLLOUT; -} From ogerlitz at voltaire.com Thu Apr 6 05:20:27 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 6 Apr 2006 15:20:27 +0300 (IDT) Subject: [openib-general] [PATCH 7/8] [RFC] include/scsi/libiscsi.h In-Reply-To: Message-ID: --- /tmp/libiscsi.h.null 2006-04-06 14:55:37.000000000 +0300 +++ kernel/libiscsi.h 2006-04-06 10:43:50.000000000 +0300 @@ -0,0 +1,284 @@ +/* + * iSCSI lib definitions + * + * Copyright (C) 2006 Red Hat, Inc. All rights reserved. + * Copyright (C) 2004 - 2006 Mike Christie + * Copyright (C) 2004 - 2005 Dmitry Yusupov + * Copyright (C) 2004 - 2005 Alex Aizman + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + */ +#ifndef LIBISCSI_H +#define LIBISCSI_H + +#include +#include +#include +#include + +struct scsi_transport_template; +struct scsi_device; +struct Scsi_Host; +struct scsi_cmnd; +struct socket; +struct iscsi_transport; +struct iscsi_cls_session; +struct iscsi_cls_conn; +struct iscsi_session; +struct iscsi_nopin; + +/* #define DEBUG_SCSI */ +#ifdef DEBUG_SCSI +#define debug_scsi(fmt...) printk(KERN_INFO "scsi: " fmt) +#else +#define debug_scsi(fmt...) +#endif + +#define ISCSI_MGMT_CMDS_MAX 32 /* must be power of 2 */ +#define ISCSI_CONN_MAX 1 + +#define ISCSI_MGMT_ITT_OFFSET 0xa00 + +#define ISCSI_DEF_CMD_PER_LUN 32 + +/* Task Mgmt states */ +#define TMABORT_INITIAL 0x0 +#define TMABORT_SUCCESS 0x1 +#define TMABORT_FAILED 0x2 +#define TMABORT_TIMEDOUT 0x3 + +/* Connection suspend "bit" */ +#define ISCSI_SUSPEND_BIT 1 + +#define ISCSI_ITT_MASK (0xfff) +#define ISCSI_CID_SHIFT 12 +#define ISCSI_CID_MASK (0xffff << ISCSI_CID_SHIFT) +#define ISCSI_AGE_SHIFT 28 +#define ISCSI_AGE_MASK (0xf << ISCSI_AGE_SHIFT) + +struct iscsi_mgmt_task { + /* + * Becuae LLDs allocate their hdr differently, this is a pointer to + * that storage. It must be setup at session creation time. + */ + struct iscsi_hdr *hdr; + char *data; /* mgmt payload */ + int data_count; /* counts data to be sent */ + uint32_t itt; /* this ITT */ + void *dd_data; /* driver/transport data */ + struct list_head running; +}; + +struct iscsi_cmd_task { + /* + * Becuae LLDs allocate their hdr differently, this is a pointer to + * that storage. It must be setup at session creation time. + */ + struct iscsi_cmd *hdr; + int itt; /* this ITT */ + int datasn; /* DataSN */ + + uint32_t unsol_datasn; + int imm_count; /* imm-data (bytes) */ + int unsol_count; /* unsolicited (bytes)*/ + int data_count; /* remaining Data-Out */ + struct scsi_cmnd *sc; /* associated SCSI cmd*/ + int total_length; + struct iscsi_conn *conn; /* used connection */ + struct iscsi_mgmt_task *mtask; /* tmf mtask in progr */ + + struct list_head running; /* running cmd list */ + void *dd_data; /* driver/transport data */ +}; + +struct iscsi_conn { + struct iscsi_cls_conn *cls_conn; /* ptr to class connection */ + void *dd_data; /* iscsi_transport data */ + struct iscsi_session *session; /* parent session */ + /* + * LLDs should set this lock. It protects the transport recv + * code + */ + rwlock_t *recv_lock; + /* + * conn_stop() flag: stop to recover, stop to terminate + */ + int stop_stage; + + /* iSCSI connection-wide sequencing */ + uint32_t exp_statsn; + + /* control data */ + int id; /* CID */ + struct list_head item; /* maintains list of conns */ + int c_stage; /* connection state */ + struct iscsi_mgmt_task *login_mtask; /* mtask used for login/text */ + struct iscsi_mgmt_task *mtask; /* xmit mtask in progress */ + struct iscsi_cmd_task *ctask; /* xmit ctask in progress */ + + /* xmit */ + struct kfifo *immqueue; /* immediate xmit queue */ + struct kfifo *mgmtqueue; /* mgmt (control) xmit queue */ + struct list_head mgmt_run_list; /* list of control tasks */ + struct kfifo *xmitqueue; /* data-path cmd queue */ + struct list_head run_list; /* list of cmds in progress */ + struct work_struct xmitwork; /* per-conn. xmit workqueue */ + /* + * serializes connection xmit, access to kfifos: + * xmitqueue, immqueue, mgmtqueue + */ + struct mutex xmitmutex; + + unsigned long suspend_tx; /* suspend Tx */ + unsigned long suspend_rx; /* suspend Rx */ + + /* abort */ + wait_queue_head_t ehwait; /* used in eh_abort() */ + struct iscsi_tm tmhdr; + struct timer_list tmabort_timer; + int tmabort_state; /* see TMABORT_INITIAL, etc.*/ + + /* negotiated params */ + int max_recv_dlength; /* initiator_max_recv_dsl*/ + int max_xmit_dlength; /* target_max_recv_dsl */ + int hdrdgst_en; + int datadgst_en; + + /* MIB-statistics */ + uint64_t txdata_octets; + uint64_t rxdata_octets; + uint32_t scsicmd_pdus_cnt; + uint32_t dataout_pdus_cnt; + uint32_t scsirsp_pdus_cnt; + uint32_t datain_pdus_cnt; + uint32_t r2t_pdus_cnt; + uint32_t tmfcmd_pdus_cnt; + int32_t tmfrsp_pdus_cnt; + + /* custom statistics */ + uint32_t eh_abort_cnt; +}; + +struct iscsi_queue { + struct kfifo *queue; /* FIFO Queue */ + void **pool; /* Pool of elements */ + int max; /* Max number of elements */ +}; + +struct iscsi_session { + /* iSCSI session-wide sequencing */ + uint32_t cmdsn; + uint32_t exp_cmdsn; + uint32_t max_cmdsn; + + /* configuration */ + int initial_r2t_en; + int max_r2t; + int imm_data_en; + int first_burst; + int max_burst; + int time2wait; + int time2retain; + int pdu_inorder_en; + int dataseq_inorder_en; + int erl; + int ifmarker_en; + int ofmarker_en; + + /* control data */ + struct iscsi_transport *tt; + struct Scsi_Host *host; + struct iscsi_conn *leadconn; /* leading connection */ + spinlock_t lock; /* protects session state, * + * sequence numbers, * + * session resources: * + * - cmdpool, * + * - mgmtpool, * + * - r2tpool */ + int state; /* session state */ + int recovery_failed; + struct list_head item; + int conn_cnt; + int age; /* counts session re-opens */ + + struct list_head connections; /* list of connections */ + int cmds_max; /* size of cmds array */ + struct iscsi_cmd_task **cmds; /* Original Cmds arr */ + struct iscsi_queue cmdpool; /* PDU's pool */ + int mgmtpool_max; /* size of mgmt array */ + struct iscsi_mgmt_task **mgmt_cmds; /* Original mgmt arr */ + struct iscsi_queue mgmtpool; /* Mgmt PDU's pool */ +}; + +/* + * scsi host template + */ +extern int iscsi_change_queue_depth(struct scsi_device *sdev, int depth); +extern int iscsi_eh_abort(struct scsi_cmnd *sc); +extern int iscsi_eh_host_reset(struct scsi_cmnd *sc); +extern int iscsi_queuecommand(struct scsi_cmnd *sc, + void (*done)(struct scsi_cmnd *)); + +/* + * session management + */ +extern struct iscsi_cls_session * +iscsi_session_setup(struct iscsi_transport *, struct scsi_transport_template *, + uint32_t, int, int, uint32_t, uint32_t *); +extern void iscsi_session_teardown(struct iscsi_cls_session *); +extern struct iscsi_session *class_to_transport_session(struct iscsi_cls_session *); +extern void iscsi_start_session_recovery(struct iscsi_session *, + struct iscsi_conn *, int); +extern void iscsi_session_recovery_timedout(struct iscsi_cls_session *); + +#define session_to_cls(_sess) \ + hostdata_session(_sess->host->hostdata) + +/* + * connection management + */ +extern struct iscsi_cls_conn *iscsi_conn_setup(struct iscsi_cls_session *, + uint32_t); +extern void iscsi_conn_teardown(struct iscsi_cls_conn *); +extern int iscsi_conn_start(struct iscsi_cls_conn *); +extern void iscsi_conn_stop(struct iscsi_cls_conn *, int); +extern int iscsi_conn_bind(struct iscsi_cls_session *, struct iscsi_cls_conn *, + int); +extern void iscsi_conn_failure(struct iscsi_conn *conn, enum iscsi_err err); + +/* + * pdu and task processing + */ +extern int iscsi_check_assign_cmdsn(struct iscsi_session *, + struct iscsi_nopin *); +extern void iscsi_prep_unsolicit_data_pdu(struct iscsi_cmd_task *, + struct iscsi_data *hdr, + int transport_data_cnt); +extern int iscsi_conn_send_pdu(struct iscsi_cls_conn *, struct iscsi_hdr *, + char *, uint32_t); +extern int iscsi_complete_pdu(struct iscsi_conn *, struct iscsi_hdr *, + char *, int); +extern int __iscsi_complete_pdu(struct iscsi_conn *, struct iscsi_hdr *, + char *, int); +extern int iscsi_verify_itt(struct iscsi_conn *, struct iscsi_hdr *, + uint32_t *); + +/* + * generic helpers + */ +extern void iscsi_pool_free(struct iscsi_queue *, void **); +extern int iscsi_pool_init(struct iscsi_queue *, int, void ***, int); + +#endif From ogerlitz at voltaire.com Thu Apr 6 05:21:51 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 6 Apr 2006 15:21:51 +0300 (IDT) Subject: [openib-general] [PATCH 8/8] [RFC] iscsi ep callbacks support In-Reply-To: Message-ID: The kernel part of the change in open iscsi which allows to connect from the kernel (and later run the login/logout/etc negotiation from user space) diff --exclude=.svn -rup oiscsi-529-canq2/include/iscsi_if.h oiscsi-529-canq2-ep2/include/iscsi_if.h --- oiscsi-529-canq2/include/iscsi_if.h 2006-04-06 10:44:23.000000000 +0300 +++ oiscsi-529-canq2-ep2/include/iscsi_if.h 2006-04-06 13:26:39.000000000 +0300 @@ -43,6 +43,10 @@ enum iscsi_uevent_e { ISCSI_UEVENT_GET_STATS = UEVENT_BASE + 10, ISCSI_UEVENT_GET_PARAM = UEVENT_BASE + 11, + ISCSI_UEVENT_TRANSPORT_EP_CONNECT = UEVENT_BASE + 12, + ISCSI_UEVENT_TRANSPORT_EP_POLL = UEVENT_BASE + 13, + ISCSI_UEVENT_TRANSPORT_EP_DISCONNECT = UEVENT_BASE + 14, + /* up events */ ISCSI_KEVENT_RECV_PDU = KEVENT_BASE + 1, ISCSI_KEVENT_CONN_ERROR = KEVENT_BASE + 2, @@ -70,7 +74,7 @@ struct iscsi_uevent { struct msg_bind_conn { uint32_t sid; uint32_t cid; - uint32_t transport_fd; + uint64_t transport_eph; uint32_t is_leading; } b_conn; struct msg_destroy_conn { @@ -103,6 +107,16 @@ struct iscsi_uevent { uint32_t sid; uint32_t cid; } get_stats; + struct msg_transport_connect { + uint32_t non_blocking; + } ep_connect; + struct msg_transport_poll { + uint64_t ep_handle; + uint32_t timeout_ms; + } ep_poll; + struct msg_transport_disconnect { + uint64_t ep_handle; + } ep_disconnect; } u; union { /* messages k -> u */ @@ -125,6 +139,9 @@ struct iscsi_uevent { uint32_t cid; uint32_t error; /* enum iscsi_err */ } connerror; + struct msg_transport_connect_ret { + uint64_t handle; + } ep_connect_ret; } r; } __attribute__ ((aligned (sizeof(uint64_t)))); diff --exclude=.svn -rup oiscsi-529-canq2/kernel/iscsi_tcp.c oiscsi-529-canq2-ep2/kernel/iscsi_tcp.c --- oiscsi-529-canq2/kernel/iscsi_tcp.c 2006-04-06 10:44:23.000000000 +0300 +++ oiscsi-529-canq2-ep2/kernel/iscsi_tcp.c 2006-04-06 13:26:39.000000000 +0300 @@ -1975,7 +1975,7 @@ iscsi_tcp_conn_destroy(struct iscsi_cls_ static int iscsi_tcp_conn_bind(struct iscsi_cls_session *cls_session, - struct iscsi_cls_conn *cls_conn, uint32_t transport_fd, + struct iscsi_cls_conn *cls_conn, uint64_t transport_eph, int is_leading) { struct iscsi_conn *conn = cls_conn->dd_data; @@ -1985,7 +1985,7 @@ iscsi_tcp_conn_bind(struct iscsi_cls_ses int err; /* lookup for existing socket */ - sock = sockfd_lookup(transport_fd, &err); + sock = sockfd_lookup((int)transport_eph, &err); if (!sock) { printk(KERN_ERR "iscsi_tcp: sockfd_lookup failed %d\n", err); return -EEXIST; diff --exclude=.svn -rup oiscsi-529-canq2/kernel/scsi_transport_iscsi.c oiscsi-529-canq2-ep2/kernel/scsi_transport_iscsi.c --- oiscsi-529-canq2/kernel/scsi_transport_iscsi.c 2006-04-06 10:44:23.000000000 +0300 +++ oiscsi-529-canq2-ep2/kernel/scsi_transport_iscsi.c 2006-04-06 13:26:39.000000000 +0300 @@ -930,6 +930,40 @@ iscsi_set_param(struct iscsi_transport * } static int +iscsi_if_transport_ep(struct iscsi_transport *transport, + struct iscsi_uevent *ev, int msg_type) +{ + struct sockaddr *dst_addr; + int rc = 0; + + switch (msg_type) { + case ISCSI_UEVENT_TRANSPORT_EP_CONNECT: + if (!transport->ep_connect) + return -EINVAL; + + dst_addr = (struct sockaddr *)((char*)ev + sizeof(*ev)); + rc = transport->ep_connect(dst_addr, + ev->u.ep_connect.non_blocking, + &ev->r.ep_connect_ret.handle); + break; + case ISCSI_UEVENT_TRANSPORT_EP_POLL: + if (!transport->ep_poll) + return -EINVAL; + + ev->r.retcode = transport->ep_poll(ev->u.ep_poll.ep_handle, + ev->u.ep_poll.timeout_ms); + break; + case ISCSI_UEVENT_TRANSPORT_EP_DISCONNECT: + if (!transport->ep_disconnect) + return -EINVAL; + + transport->ep_disconnect(ev->u.ep_disconnect.ep_handle); + break; + } + return rc; +} + +static int iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh) { int err = 0; @@ -975,7 +1009,7 @@ iscsi_if_recv_msg(struct sk_buff *skb, s if (session && conn) ev->r.retcode = transport->bind_conn(session, conn, - ev->u.b_conn.transport_fd, + ev->u.b_conn.transport_eph, ev->u.b_conn.is_leading); else err = -EINVAL; @@ -1010,6 +1044,11 @@ iscsi_if_recv_msg(struct sk_buff *skb, s case ISCSI_UEVENT_GET_STATS: err = iscsi_if_get_stats(transport, nlh); break; + case ISCSI_UEVENT_TRANSPORT_EP_CONNECT: + case ISCSI_UEVENT_TRANSPORT_EP_POLL: + case ISCSI_UEVENT_TRANSPORT_EP_DISCONNECT: + err = iscsi_if_transport_ep(transport, ev, nlh->nlmsg_type); + break; default: err = -EINVAL; break; diff --exclude=.svn -rup oiscsi-529-canq2/kernel/scsi_transport_iscsi.h oiscsi-529-canq2-ep2/kernel/scsi_transport_iscsi.h --- oiscsi-529-canq2/kernel/scsi_transport_iscsi.h 2006-04-06 10:44:23.000000000 +0300 +++ oiscsi-529-canq2-ep2/kernel/scsi_transport_iscsi.h 2006-04-06 13:26:39.000000000 +0300 @@ -89,7 +89,7 @@ struct iscsi_transport { uint32_t cid); int (*bind_conn) (struct iscsi_cls_session *session, struct iscsi_cls_conn *cls_conn, - uint32_t transport_fd, int is_leading); + uint64_t transport_eph, int is_leading); int (*start_conn) (struct iscsi_cls_conn *conn); void (*stop_conn) (struct iscsi_cls_conn *conn, int flag); void (*destroy_conn) (struct iscsi_cls_conn *conn); @@ -120,6 +120,10 @@ struct iscsi_transport { int (*xmit_mgmt_task) (struct iscsi_conn *conn, struct iscsi_mgmt_task *mtask); void (*session_recovery_timedout) (struct iscsi_cls_session *session); + int (*ep_connect) (struct sockaddr *dst_addr, int non_blocking, + uint64_t *ep_handle); + int (*ep_poll) (uint64_t ep_handle, int timeout_ms); + void (*ep_disconnect) (uint64_t ep_handle); }; /* From eli at mellanox.co.il Thu Apr 6 05:49:35 2006 From: eli at mellanox.co.il (Eli Cohen) Date: Thu, 6 Apr 2006 15:49:35 +0300 Subject: [openib-general] [PATCH] ipoib_cleanup_module Message-ID: <200604061549.36061.eli@mellanox.co.il> ensure reverse order of creation or else might cause errors if debugfs is used. Signed-off-by: Eli Cohen Index: latest/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- latest.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ latest/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1216,8 +1216,8 @@ err_fs: static void __exit ipoib_cleanup_module(void) { ib_unregister_client(&ipoib_client); - ipoib_unregister_debugfs(); destroy_workqueue(ipoib_workqueue); + ipoib_unregister_debugfs(); } module_init(ipoib_init_module); From halr at voltaire.com Thu Apr 6 05:43:11 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Apr 2006 08:43:11 -0400 Subject: [openib-general] RE: [PATCHv2] OpenSM: Fix osm_vendor_send for GSI classes In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FE89@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FE89@mtlexch01.mtl.com> Message-ID: <1144327380.4480.91836.camel@hal.voltaire.com> On Thu, 2006-04-06 at 07:52, Yael Kalka wrote: > Hi Hal, > The patch fixes the problem I saw. > Please apply it. Thanks. Applied to both trunk and 1.0 branch. -- Hal > Thanks, > Yael > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Wednesday, April 05, 2006 3:55 PM > > To: Yael Kalka > > Cc: openib-general at openib.org > > Subject: [PATCHv2] OpenSM: Fix osm_vendor_send for GSI classes > > > > Hi Yael, > > > > Below is a slightly modified version of the previous patch. It is a > > complete fix for the problem you identified. Let me know if > > this works for you and I will check it into both the trunk and 1.0 > > branch. > > > > Thanks. > > > > -- Hal > > > > OpenSM: Fix osm_vendor_send for GSI classes > > > > Currently, the default for GSI classes assumes RMPP. There are two > > groups of GSI classes: those which support RMPP and those which don't. > > This patch handles them properly in osm_vendor_send. > > > > Problem pointed out by Yael Kalka > > > > Signed-off-by: Hal Rosenstock > > > > Index: include/iba/ib_types.h > > =================================================================== > > --- include/iba/ib_types.h (revision 6219) > > +++ include/iba/ib_types.h (working copy) > > @@ -515,6 +515,30 @@ BEGIN_C_DECLS > > #define IB_MCLASS_VENDOR_LOW_RANGE_MAX 0x0f > > /**********/ > > > > +/****d* IBA Base: Constants/IB_MCLASS_DEV_ADM > > +* NAME > > +* IB_MCLASS_DEV_ADM > > +* > > +* DESCRIPTION > > +* Subnet Management Class, Device Administration > > +* > > +* SOURCE > > +*/ > > +#define IB_MCLASS_DEV_ADM 0x10 > > +/**********/ > > + > > +/****d* IBA Base: Constants/IB_MCLASS_BIS > > +* NAME > > +* IB_MCLASS_BIS > > +* > > +* DESCRIPTION > > +* Subnet Management Class, BIS > > +* > > +* SOURCE > > +*/ > > +#define IB_MCLASS_BIS 0x12 > > +/**********/ > > + > > /****d* IBA Base: Constants/IB_MCLASS_VENDOR_HIGH_RANGE_MIN > > * NAME > > * IB_MCLASS_VENDOR_HIGH_RANGE_MIN > > @@ -544,7 +568,7 @@ BEGIN_C_DECLS > > * ib_class_is_vendor_specific_low > > * > > * DESCRIPTION > > -* Indicitates if the Class Code if a vendor specific class from > > +* Indicates if the Class Code if a vendor specific class from > > * the low range > > * > > * SYNOPSIS > > @@ -576,7 +600,7 @@ ib_class_is_vendor_specific_low( > > * ib_class_is_vendor_specific_high > > * > > * DESCRIPTION > > -* Indicitates if the Class Code if a vendor specific class from > > +* Indicates if the Class Code if a vendor specific class from > > * the high range > > * > > * SYNOPSIS > > @@ -609,7 +633,7 @@ ib_class_is_vendor_specific_high( > > * ib_class_is_vendor_specific > > * > > * DESCRIPTION > > -* Indicitates if the Class Code if a vendor specific class > > +* Indicates if the Class Code if a vendor specific class > > * > > * SYNOPSIS > > */ > > @@ -635,6 +659,38 @@ ib_class_is_vendor_specific( > > * ib_class_is_vendor_specific_low, ib_class_is_vendor_specific_high > > *********/ > > > > +/****f* IBA Base: Types/ib_class_is_rmpp > > +* NAME > > +* ib_class_is_rmpp > > +* > > +* DESCRIPTION > > +* Indicates if the Class Code supports RMPP > > +* > > +* SYNOPSIS > > +*/ > > +static inline boolean_t > > +ib_class_is_rmpp( > > + IN const uint8_t class_code ) > > +{ > > + return( (class_code == IB_MCLASS_SUBN_ADM) || > > + (class_code == IB_MCLASS_DEV_MGMT) || > > + (class_code == IB_MCLASS_DEV_ADM) || > > + (class_code == IB_MCLASS_BIS) || > > + ib_class_is_vendor_specific_high( class_code ) ); > > +} > > +/* > > +* PARAMETERS > > +* class_code > > +* [in] The Management Datagram Class Code > > +* > > +* RETURN VALUE > > +* TRUE if the class supports RMPP > > +* FALSE otherwise. > > +* > > +* NOTES > > +* > > +*********/ > > + > > /* > > * MAD methods > > */ > > @@ -1811,7 +1867,7 @@ ib_pkey_get_base( > > * ib_pkey_is_full_member > > * > > * DESCRIPTION > > -* Indicitates if the port is a full member of the parition. > > +* Indicates if the port is a full member of the parition. > > * > > * SYNOPSIS > > */ > > Index: libvendor/osm_vendor_ibumad.c > > =================================================================== > > --- libvendor/osm_vendor_ibumad.c (revision 6219) > > +++ libvendor/osm_vendor_ibumad.c (working copy) > > @@ -1044,16 +1044,17 @@ osm_vendor_send( > > CL_ASSERT( p_vw->h_bind == h_bind ); > > CL_ASSERT( p_mad == umad_get_mad(p_vw->umad) ); > > > > - switch (p_mad->mgmt_class) { > > - case IB_MCLASS_SUBN_DIR: > > + if (p_mad->mgmt_class == IB_MCLASS_SUBN_DIR) { > > umad_set_addr_net(p_vw->umad, 0xffff, 0, 0, 0); > > umad_set_grh(p_vw->umad, 0); > > - break; > > - case IB_MCLASS_SUBN_LID: > > + goto Resp; > > + } > > + if (p_mad->mgmt_class == IB_MCLASS_SUBN_LID) { > > umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, 0, > 0, 0); > > umad_set_grh(p_vw->umad, 0); > > - break; > > - default: /* GSI FIXME: no GRH */ > > + goto Resp; > > + } > > + if (ib_class_is_rmpp(p_mad->mgmt_class)) { /* RMPP GSI > classes > > FIXME: no GRH */ > > umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, > > p_mad_addr->addr_type.gsi.remote_qp, > > > p_mad_addr->addr_type.gsi.service_level, > > @@ -1086,9 +1087,16 @@ osm_vendor_send( > > p_sa->paylen_newwin = cl_ntoh32(paylen); > > } > > #endif > > - break; > > + } else { /* non RMPP GSI classes FIXME: no GRH */ > > + umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, > > + p_mad_addr->addr_type.gsi.remote_qp, > > + > p_mad_addr->addr_type.gsi.service_level, > > + IB_QP1_WELL_KNOWN_Q_KEY); > > + umad_set_grh(p_vw->umad, 0); /* FIXME: GRH support */ > > + umad_set_pkey(p_vw->umad, > p_mad_addr->addr_type.gsi.pkey); > > } > > > > +Resp: > > if (resp_expected) > > put_madw(p_vend, p_madw, &p_mad->trans_id); > > > > > From rdreier at cisco.com Thu Apr 6 05:58:41 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Apr 2006 05:58:41 -0700 Subject: [openib-general] RE: Re: [PATCH] ipoib_flush_paths References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> Message-ID: Fabian> Can't you pass in a reference to the client module for Fabian> registration, and then take a reference from the context Fabian> of each request that is released after the callback Fabian> unwinds? I thought Linux had module reference functions... I'm pretty sure that any scheme that tries to use module reference counting will either be horribly complex, still have subtle races, or (mostly likely) both. The sanest solution at least for the SA query module would be to make every consumer get some sort of client cookie, and then have the function that destroys the cookie wait for all callbacks to finish. - R. From rdreier at cisco.com Thu Apr 6 05:58:42 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Apr 2006 05:58:42 -0700 Subject: [openib-general] Re: [PATCH] ipoib_flush_paths References: <200604051559.34828.eli@mellanox.co.il> <20060405234023.GA26557@mellanox.co.il> <20060406000117.GC26557@mellanox.co.il> <44345E98.6000300@ichips.intel.com> <20060406003635.GG26557@mellanox.co.il> Message-ID: Michael> Probably global for ib_addr, per port for ib_sa: we don't Michael> want to force ib_addr users to deal with devices. Why per-port for ib_sa? It seems we want to make sure all callbacks are done before removing a consumer module -- why would a consumer want to wait for each individual port separately? - R. From mst at mellanox.co.il Thu Apr 6 06:17:55 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 16:17:55 +0300 Subject: [openib-general] Re: RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> Message-ID: <20060406131755.GN21115@mellanox.co.il> Quoting r. Roland Dreier : > I'm pretty sure that any scheme that tries to use module reference > counting will either be horribly complex, still have subtle races, or > (mostly likely) both. Actually, it turned out to be the simplest solution - and quite elegant since there's no room for mistakes: if query is going to be running this means module is still loaded so we can take a reference to it without races. As a bonus, and assertion inside __module_get increases the chance to catch races where user forgets to cancel the query - much nicer than crashing randomly. Please take a look, if OK I'll build a similiar patch for ib_addr. Warning: compile-tested only. --- Make sure callback module isn't unloaded while callback is running. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.16/drivers/infiniband/include/rdma/ib_sa.h =================================================================== --- linux-2.6.16/drivers/infiniband/include/rdma/ib_sa.h (revision 6281) +++ linux-2.6.16/drivers/infiniband/include/rdma/ib_sa.h (working copy) @@ -262,6 +262,7 @@ struct ib_sa_path_rec *resp, void *context), void *context, + struct module *owner, struct ib_sa_query **query); int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, @@ -273,6 +274,7 @@ struct ib_sa_mcmember_rec *resp, void *context), void *context, + struct module *owner, struct ib_sa_query **query); int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, @@ -284,6 +286,7 @@ struct ib_sa_service_rec *resp, void *context), void *context, + struct module *owner, struct ib_sa_query **sa_query); /** @@ -319,13 +322,14 @@ struct ib_sa_mcmember_rec *resp, void *context), void *context, + struct module *owner, struct ib_sa_query **query) { return ib_sa_mcmember_rec_query(device, port_num, IB_MGMT_METHOD_SET, rec, comp_mask, timeout_ms, gfp_mask, callback, - context, query); + context, owner, query); } /** @@ -361,13 +365,14 @@ struct ib_sa_mcmember_rec *resp, void *context), void *context, + struct module *owner, struct ib_sa_query **query) { return ib_sa_mcmember_rec_query(device, port_num, IB_SA_METHOD_DELETE, rec, comp_mask, timeout_ms, gfp_mask, callback, - context, query); + context, owner, query); } /** Index: linux-2.6.16/drivers/infiniband/core/at.c =================================================================== --- linux-2.6.16/drivers/infiniband/core/at.c (revision 6281) +++ linux-2.6.16/drivers/infiniband/core/at.c (working copy) @@ -220,6 +220,7 @@ GFP_KERNEL, ats_op_complete, ib_dev, + THIS_MODULE, &ib_dev->sa_query); if (ib_dev->sa_id < 0) { @@ -1122,6 +1123,7 @@ GFP_KERNEL, ats_ips_req_complete, req, + THIS_MODULE, &req->pend.sa_query); if (req->pend.sa_id < 0) { @@ -1167,6 +1169,7 @@ GFP_KERNEL, ats_route_req_complete, req, + THIS_MODULE, &req->pend.sa_query); if (req->pend.sa_id < 0) { @@ -1230,6 +1233,7 @@ GFP_KERNEL, path_req_complete, req, + THIS_MODULE, &req->pend.sa_query); if (req->pend.sa_id < 0) { Index: linux-2.6.16/drivers/infiniband/core/sa_query.c =================================================================== --- linux-2.6.16/drivers/infiniband/core/sa_query.c (revision 6281) +++ linux-2.6.16/drivers/infiniband/core/sa_query.c (working copy) @@ -73,6 +73,7 @@ struct ib_sa_query { void (*callback)(struct ib_sa_query *, int, struct ib_sa_mad *); void (*release)(struct ib_sa_query *); + struct module *owner; struct ib_sa_port *port; struct ib_mad_send_buf *mad_buf; struct ib_sa_sm_ah *sm_ah; @@ -559,6 +560,7 @@ * @callback:function called when query completes, times out or is * canceled * @context:opaque user context passed to callback + * @owner:callback owner module * @sa_query:query context, used to cancel query * * Send a Path Record Get query to the SA to look up a path. The @@ -580,6 +582,7 @@ struct ib_sa_path_rec *resp, void *context), void *context, + struct module *owner, struct ib_sa_query **sa_query) { struct ib_sa_path_query *query; @@ -614,6 +617,7 @@ init_mad(mad, agent); query->sa_query.callback = callback ? ib_sa_path_rec_callback : NULL; + query->sa_query.owner = owner; query->sa_query.release = ib_sa_path_rec_release; query->sa_query.port = port; mad->mad_hdr.method = IB_MGMT_METHOD_GET; @@ -696,6 +700,7 @@ struct ib_sa_service_rec *resp, void *context), void *context, + struct module *owner, struct ib_sa_query **sa_query) { struct ib_sa_service_query *query; @@ -735,6 +740,7 @@ init_mad(mad, agent); query->sa_query.callback = callback ? ib_sa_service_rec_callback : NULL; + query->sa_query.owner = owner; query->sa_query.release = ib_sa_service_rec_release; query->sa_query.port = port; mad->mad_hdr.method = method; @@ -793,6 +799,7 @@ struct ib_sa_mcmember_rec *resp, void *context), void *context, + struct module *owner, struct ib_sa_query **sa_query) { struct ib_sa_mcmember_query *query; @@ -827,6 +834,7 @@ init_mad(mad, agent); query->sa_query.callback = callback ? ib_sa_mcmember_rec_callback : NULL; + query->sa_query.owner = owner; query->sa_query.release = ib_sa_mcmember_rec_release; query->sa_query.port = port; mad->mad_hdr.method = method; @@ -866,13 +874,19 @@ /* No callback -- already got recv */ break; case IB_WC_RESP_TIMEOUT_ERR: + __module_get(query->owner); query->callback(query, -ETIMEDOUT, NULL); + module_put(query->owner); break; case IB_WC_WR_FLUSH_ERR: + __module_get(query->owner); query->callback(query, -EINTR, NULL); + module_put(query->owner); break; default: + __module_get(query->owner); query->callback(query, -EIO, NULL); + module_put(query->owner); break; } @@ -895,6 +909,7 @@ query = mad_buf->context[0]; if (query->callback) { + __module_get(query->owner); if (mad_recv_wc->wc->status == IB_WC_SUCCESS) query->callback(query, mad_recv_wc->recv_buf.mad->mad_hdr.status ? @@ -902,6 +917,7 @@ (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad); else query->callback(query, -EIO, NULL); + module_put(query->owner); } ib_free_recv_mad(mad_recv_wc); Index: linux-2.6.16/drivers/infiniband/core/cma.c =================================================================== --- linux-2.6.16/drivers/infiniband/core/cma.c (revision 6281) +++ linux-2.6.16/drivers/infiniband/core/cma.c (working copy) @@ -1065,7 +1065,8 @@ IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH, timeout_ms, GFP_KERNEL, - cma_query_handler, work, &id_priv->query); + cma_query_handler, work, THIS_MODULE, + &id_priv->query); return (id_priv->query_id < 0) ? id_priv->query_id : 0; } Index: linux-2.6.16/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.16/drivers/infiniband/ulp/ipoib/ipoib_main.c (revision 6281) +++ linux-2.6.16/drivers/infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -472,7 +472,7 @@ IB_SA_PATH_REC_PKEY, 1000, GFP_ATOMIC, path_rec_completion, - path, &path->query); + path, THIS_MODULE, &path->query); if (path->query_id < 0) { ipoib_warn(priv, "ib_sa_path_rec_get failed\n"); path->query = NULL; Index: linux-2.6.16/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- linux-2.6.16/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (revision 6281) +++ linux-2.6.16/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -367,7 +367,7 @@ IB_SA_MCMEMBER_REC_JOIN_STATE, 1000, GFP_ATOMIC, ipoib_mcast_sendonly_join_complete, - mcast, &mcast->query); + mcast, THIS_MODULE, &mcast->query); if (ret < 0) { ipoib_warn(priv, "ib_sa_mcmember_rec_set failed (ret = %d)\n", ret); @@ -487,7 +487,7 @@ ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, comp_mask, mcast->backoff * 1000, GFP_ATOMIC, ipoib_mcast_join_complete, - mcast, &mcast->query); + mcast, THIS_MODULE, &mcast->query); if (ret < 0) { ipoib_warn(priv, "ib_sa_mcmember_rec_set failed, status %d\n", ret); @@ -686,7 +686,7 @@ IB_SA_MCMEMBER_REC_PKEY | IB_SA_MCMEMBER_REC_JOIN_STATE, 0, GFP_ATOMIC, NULL, - mcast, &mcast->query); + mcast, THIS_MODULE, &mcast->query); if (ret < 0) ipoib_warn(priv, "ib_sa_mcmember_rec_delete failed " "for leave (result = %d)\n", ret); Index: linux-2.6.16/drivers/infiniband/ulp/srp/ib_srp.c =================================================================== --- linux-2.6.16/drivers/infiniband/ulp/srp/ib_srp.c (revision 6281) +++ linux-2.6.16/drivers/infiniband/ulp/srp/ib_srp.c (working copy) @@ -260,7 +260,8 @@ SRP_PATH_REC_TIMEOUT_MS, GFP_KERNEL, srp_path_rec_completion, - target, &target->path_query); + target, THIS_MODULE, + &target->path_query); if (target->path_query_id < 0) return target->path_query_id; -- MST From mst at mellanox.co.il Thu Apr 6 06:19:52 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 16:19:52 +0300 Subject: [openib-general] Re: [PATCH] ipoib_flush_paths In-Reply-To: References: <200604051559.34828.eli@mellanox.co.il> Message-ID: <20060406131952.GO21115@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] ipoib_flush_paths > > This makes sense but I'm trying to see exactly what goes wrong without > it. Suppose path->query gets set to NULL between testing it and > calling ib_sa_cancel_query(). What's the worst that can happen? It > looks safe to ib_sa_cancel_query() with a stale or NULL query pointer. Just to be clear: this patch is going into 2.6.17, isn't it? -- MST From halr at voltaire.com Thu Apr 6 06:18:26 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Apr 2006 09:18:26 -0400 Subject: [openib-general] RE: [PATCH] OpenSM - complib fix for branch In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FE86@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FE86@mtlexch01.mtl.com> Message-ID: <1144329488.4480.92319.camel@hal.voltaire.com> Hi Yael, On Thu, 2006-04-06 at 05:01, Yael Kalka wrote: > Hi Hal, > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Wednesday, April 05, 2006 1:59 PM > > To: Yael Kalka > > Cc: openib-general at openib.org; Eitan Zahavi; Sasha Khapyorsky; Ofer > Gigi > > Subject: Re: [PATCH] OpenSM - complib fix for branch > > > > Hi Yael, > > > > On Wed, 2006-04-05 at 02:24, Yael Kalka wrote: > > > Hi Hal, > > > > > > I saw that the complib patch (removal of constructor and destructor > > > attribute), wasn't fully added to the branch. > > > Attached is a patch for the branch. > > > > Is this needed for 1.0 ? Is this safe to add ? Was there more to it > than > > just this ? > > > [Yael Kalka] It isn't needed for 1.0 more than for the trunk, just that > the minimum differences we have between the branch and the trunk, the > easier it'll be to handle them. > I don't see a problem with applying the patch on the branch. > The patch also included a change to the osmtest/main.c, but that change > is already applied to the branch for some reason. Not sure how it got in there. Anyhow, I made the requested change to the 1.0 branch (r6282). > Another issue that is not connected to this specific patch - I think we > should apply the cosmetic changes patches to the branch in order to > minimize the differences between the trunk and the branch. That has not been a priority but I agree that this would make things easier (in terms of a simple diff). I'll work on this in the background as I have time. As always, you're also welcome to submit patches. -- Hal > Yael > > > -- Hal > > > > > Thanks, > > > Yael > > > > > > Signed-off-by: Yael Kalka > > > > > > Index: complib/cl_complib.c > > > =================================================================== > > > --- complib/cl_complib.c (revision 6203) > > > +++ complib/cl_complib.c (working copy) > > > @@ -65,7 +65,6 @@ __cl_timer_prov_destroy( void ); > > > cl_spinlock_t cl_atomic_spinlock; > > > > > > void > > > -__attribute (( constructor )) > > > complib_init(void) > > > { > > > cl_status_t status = CL_SUCCESS; > > > @@ -90,14 +89,6 @@ complib_init(void) > > > } > > > > > > void > > > -__attribute (( destructor )) > > > -complib_fini(void) > > > -{ > > > - __cl_timer_prov_destroy(); > > > - __cl_user_syshelper_exit(); > > > -} > > > - > > > -void > > > complib_exit(void) > > > { > > > __cl_timer_prov_destroy(); > > > Index: opensm/main.c > > > =================================================================== > > > --- opensm/main.c (revision 6203) > > > +++ opensm/main.c (working copy) > > > @@ -44,9 +44,6 @@ > > > * > > > * $Revision: 1.23 $ > > > */ > > > -#ifdef __WIN__ > > > -#pragma warning(disable : 4996) > > > -#endif > > > > > > #if HAVE_CONFIG_H > > > # include > > > @@ -557,9 +554,7 @@ main( > > > { NULL, 0, NULL, 0 } /* Required at the end of > the > > array */ > > > }; > > > > > > -#ifdef __WIN__ > > > complib_init(); > > > -#endif > > > > > > /* Make sure that the opensm and complib were compiled using > > > same modes (debug/free) */ > > > > > > From mulix at mulix.org Thu Apr 6 06:25:12 2006 From: mulix at mulix.org (Muli Ben-Yehuda) Date: Thu, 06 Apr 2006 16:25:12 +0300 Subject: [openib-general] Re: RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: <20060406131755.GN21115@mellanox.co.il> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> Message-ID: <20060406132512.GF16153@granada.merseine.nu> On Thu, Apr 06, 2006 at 04:17:55PM +0300, Michael S. Tsirkin wrote: > Quoting r. Roland Dreier : > > I'm pretty sure that any scheme that tries to use module reference > > counting will either be horribly complex, still have subtle races, or > > (mostly likely) both. > > Actually, it turned out to be the simplest solution - and quite > elegant since there's no room for mistakes: if query is going to be running > this means module is still loaded so we can take a reference to it > without races. Don't you have a race between the point you make the call (pass THIS_MODULE) and the point you __get_module on it? what's keeping the module from disappearing between those two points? Cheers, Muli -- Muli Ben-Yehuda http://www.mulix.org | http://mulix.livejournal.com/ From mst at mellanox.co.il Thu Apr 6 06:38:33 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 16:38:33 +0300 Subject: [openib-general] Re: RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: <20060406132512.GF16153@granada.merseine.nu> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060406132512.GF16153@granada.merseine.nu> Message-ID: <20060406133833.GP21115@mellanox.co.il> Quoting r. Muli Ben-Yehuda : > Subject: Re: [openib-general] Re: RE: Re: [PATCH] ipoib_flush_paths > > On Thu, Apr 06, 2006 at 04:17:55PM +0300, Michael S. Tsirkin wrote: > > Quoting r. Roland Dreier : > > > I'm pretty sure that any scheme that tries to use module reference > > > counting will either be horribly complex, still have subtle races, or > > > (mostly likely) both. > > > > Actually, it turned out to be the simplest solution - and quite > > elegant since there's no room for mistakes: if query is going to be running > > this means module is still loaded so we can take a reference to it > > without races. > > Don't you have a race between the point you make the call (pass > THIS_MODULE) and the point you __get_module on it? what's keeping the > module from disappearing between those two points? No, since we are keeping a callback pointer into that module. -- MST From mulix at mulix.org Thu Apr 6 06:42:12 2006 From: mulix at mulix.org (Muli Ben-Yehuda) Date: Thu, 06 Apr 2006 16:42:12 +0300 Subject: [openib-general] Re: RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: <20060406133833.GP21115@mellanox.co.il> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060406132512.GF16153@granada.merseine.nu> <20060406133833.GP21115@mellanox.co.il> Message-ID: <20060406134212.GG16153@granada.merseine.nu> On Thu, Apr 06, 2006 at 04:38:33PM +0300, Michael S. Tsirkin wrote: > No, since we are keeping a callback pointer into that module. Sorry if I'm being dense but I don't see it in this patch. Point me at it? Cheers, Muli -- Muli Ben-Yehuda http://www.mulix.org | http://mulix.livejournal.com/ From tziporet at mellanox.co.il Thu Apr 6 06:57:23 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 06 Apr 2006 16:57:23 +0300 Subject: [openib-general] RC2 delayed a bit In-Reply-To: <1144259913.3984.30.camel@chalcedony.internal.keyresearch.com> References: <1144259913.3984.30.camel@chalcedony.internal.keyresearch.com> Message-ID: <44351E43.1040405@mellanox.co.il> Bryan O'Sullivan wrote: > Unfortunately, the coordination of this with the 1.0 process has thus > far not been very effective, so I am spending a lot of time manually > filtering diffs to see what has changed between the first EWG software > release (named "IBED") and the 1.0 tree, so that I can reunify the two. > > User space code that is needed by IBED was reviewed and check it in to the 1.0 branch. We did it to all modules except management that is synchronized by Hal and ipath library that I assumed you will take care for. ibed directory was opened under 1.0 branch for scripts & backport patches. We plan to publish IBED rc3 by Monday. Tziporet From bugzilla-daemon at openib.org Thu Apr 6 07:05:28 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Thu, 6 Apr 2006 07:05:28 -0700 (PDT) Subject: [openib-general] [Bug 32] IBED RC2 fails to build on FC4 2.6.15 Message-ID: <20060406140528.9658C2283E6@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=32 danb at voltaire.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |INVALID ------- Additional Comments From danb at voltaire.com 2006-04-06 07:05 ------- The open-iscsi patch for kernel 2.6.15 were not applied prior to the ISER build. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mst at mellanox.co.il Thu Apr 6 06:59:50 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 16:59:50 +0300 Subject: [openib-general] librdmacm/ucma Message-ID: <20060406135950.GQ21115@mellanox.co.il> Sean, I'd like to see some things with librdmacm/ucma code improved: - dependency on libsysfs I propose we don't get into this again. I know we have it in ibverbs but let's avoid for new code. - negative error codes I think this is kernel practice, in userspace we either set errno and return -1, or simply return positive error code: otherwise utilities like strerror do not work - abi versioning The RDMA_USER_CM_MAX_ABI_VERSION idea is broken - it makes it so that you are required to upgrade userspace to run on older kernels. Bailing out in userspace if the kernel is too new is not good - please remove this test. We need to be able to run old userspace on old kernels too - that's how the rest of the kernel understands backward compatibility. -- MST From ftillier at silverstorm.com Thu Apr 6 07:00:42 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Thu, 6 Apr 2006 07:00:42 -0700 Subject: [openib-general] RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: References: <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> Message-ID: <79ae2f320604060700w49ee5a4w651a7ebce8d76cec@mail.gmail.com> On 4/5/06, Sean Hefty wrote: > >Can't you pass in a reference to the client module for registration, > >and then take a reference from the context of each request that is > >released after the callback unwinds? I thought Linux had module > >reference functions... > > Yes - this is what ib_mad does. The problem is that ib_sa, ib_addr, ib_cm, and > soon to be ib_multicast can invoke callbacks without explicit registration / > deregistration. For example, the following interface has the issue: > > ib_do_async_operation(request, my_callback, my_context); You don't need explicit registration/deregistration - just add a module parameter to this function: ib_do_async_op_safe( my_module, request, my_callback, my_context ); > I was able to come up with several possible solutions to this problem. The > easiest to implement is doing what Michael suggested, and calling some sort of > wait_until_all_current_callbacks_return routine. What I don't like about this > approach is that the interface becomes easier to misuse (i.e. callers must > remember to call wait_until_all_current_callbacks_return before unloading), plus > it requires changes to interfaces that do work. I don't like this - like you said, it's error prone. > My preference, and it's not a very strong one at this point, is to push the > responsibility into the module invoking the callback. To me, that's the > direction that the reference goes, so that's where the responsibility lies. > Besides, it's his thread that's executing random memory as code. Right - that module just needs to know which module to hold a reference on. - Fab From mst at mellanox.co.il Thu Apr 6 07:04:07 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 17:04:07 +0300 Subject: [openib-general] Re: RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: <20060406134212.GG16153@granada.merseine.nu> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060406132512.GF16153@granada.merseine.nu> <20060406133833.GP21115@mellanox.co.il> <20060406134212.GG16153@granada.merseine.nu> Message-ID: <20060406140407.GR21115@mellanox.co.il> Quoting r. Muli Ben-Yehuda : > Subject: Re: [openib-general] Re: RE: Re: [PATCH] ipoib_flush_paths > > On Thu, Apr 06, 2006 at 04:38:33PM +0300, Michael S. Tsirkin wrote: > > > No, since we are keeping a callback pointer into that module. > > Sorry if I'm being dense but I don't see it in this patch. Point me at > it? You don't see it in the patch because SA already kept a callback pointer - that's the race I'm solving. Look in sa_query.c If I have struct query { void (*callback)(); struct module *owner; } Then it is always safe to do __get_module(query->owner); query->callback(); put_module(query->owner); since it is the called module's responsibility to invalidate all such query objects before its unloaded. -- MST From ftillier at silverstorm.com Thu Apr 6 07:02:38 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Thu, 6 Apr 2006 07:02:38 -0700 Subject: [openib-general] RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: <20060406071228.GA9994@mellanox.co.il> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406071228.GA9994@mellanox.co.il> Message-ID: <79ae2f320604060702q163f5b1alb25249f951feb017@mail.gmail.com> On 4/6/06, Michael S. Tsirkin wrote: > Quoting r. Fabian Tillier : > > Can't you pass in a reference to the client module for registration, > > and then take a reference from the context of each request that is > > released after the callback unwinds? I thought Linux had module > > reference functions... > > I thought about that , but this would mean: > > 1. changing API instead of extending it by new functions - lots of > churn for ULPs Adding new functions to work around a flaw in an existing API is like addressing the symptom of a problem instead of the cause. If the API is broken without the client making some other call to close a race, then the API is broken - fix it. > 2. adding overhead on data path rather than unload path where > it belongs I can't imagine a module reference/dereference adds significant overhead, so I don't buy this argument. - Fab From mst at mellanox.co.il Thu Apr 6 07:06:29 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 17:06:29 +0300 Subject: [openib-general] Re: RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: <79ae2f320604060702q163f5b1alb25249f951feb017@mail.gmail.com> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406071228.GA9994@mellanox.co.il> <79ae2f320604060702q163f5b1alb25249f951feb017@mail.gmail.com> Message-ID: <20060406140629.GS21115@mellanox.co.il> Quoting r. Fabian Tillier : > Subject: Re: RE: Re: [PATCH] ipoib_flush_paths > > On 4/6/06, Michael S. Tsirkin wrote: > > Quoting r. Fabian Tillier : > > > Can't you pass in a reference to the client module for registration, > > > and then take a reference from the context of each request that is > > > released after the callback unwinds? I thought Linux had module > > > reference functions... You are right, you might notice I've reversed my opinion. Please review the patch I've posted. -- MST From jlentini at netapp.com Thu Apr 6 07:05:18 2006 From: jlentini at netapp.com (James Lentini) Date: Thu, 6 Apr 2006 10:05:18 -0400 (EDT) Subject: [openib-general] [DAPL] tests In-Reply-To: <4433E2A8.1050702@mellanox.co.il> References: <4432B692.8000606@ichips.intel.com> <4433B14F.9010701@mellanox.co.il> <4433E2A8.1050702@mellanox.co.il> Message-ID: On Wed, 5 Apr 2006, Vladimir Sokolovsky wrote: > Can you add dapl tests to EXTRA_DIST list in the dapl/Makefile.am? Can you send me a patch with exactly what you want? From mulix at mulix.org Thu Apr 6 07:05:27 2006 From: mulix at mulix.org (Muli Ben-Yehuda) Date: Thu, 06 Apr 2006 17:05:27 +0300 Subject: [openib-general] Re: RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: <20060406140407.GR21115@mellanox.co.il> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060406132512.GF16153@granada.merseine.nu> <20060406133833.GP21115@mellanox.co.il> <20060406134212.GG16153@granada.merseine.nu> <20060406140407.GR21115@mellanox.co.il> Message-ID: <20060406140527.GH16153@granada.merseine.nu> On Thu, Apr 06, 2006 at 05:04:07PM +0300, Michael S. Tsirkin wrote: > struct query { > void (*callback)(); > struct module *owner; > } > > Then it is always safe to do > > __get_module(query->owner); > query->callback(); > put_module(query->owner); Ok, that makes sense. By the way, shouldn't __get_module be try_module_get(), with proper error handling if it fails? Cheers, Muli -- Muli Ben-Yehuda http://www.mulix.org | http://mulix.livejournal.com/ From halr at voltaire.com Thu Apr 6 07:07:46 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Apr 2006 10:07:46 -0400 Subject: [openib-general] RC2 delayed a bit In-Reply-To: <44351E43.1040405@mellanox.co.il> References: <1144259913.3984.30.camel@chalcedony.internal.keyresearch.com> <44351E43.1040405@mellanox.co.il> Message-ID: <1144332463.4480.92952.camel@hal.voltaire.com> On Thu, 2006-04-06 at 09:57, Tziporet Koren wrote: > Bryan O'Sullivan wrote: > > Unfortunately, the coordination of this with the 1.0 process has thus > > far not been very effective, so I am spending a lot of time manually > > filtering diffs to see what has changed between the first EWG software > > release (named "IBED") and the 1.0 tree, so that I can reunify the two. > > > > > User space code that is needed by IBED was reviewed and check it in to > the 1.0 branch. > We did it to all modules except management that is synchronized by Hal > and ipath library that I assumed you will take care for. Does IBED maintain it's own copy of userspace ? -- Hal > ibed directory was opened under 1.0 branch for scripts & backport patches. > We plan to publish IBED rc3 by Monday. > > Tziporet > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Thu Apr 6 07:20:22 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 17:20:22 +0300 Subject: [openib-general] Re: Re: RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: <20060406140527.GH16153@granada.merseine.nu> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060406132512.GF16153@granada.merseine.nu> <20060406133833.GP21115@mellanox.co.il> <20060406134212.GG16153@granada.merseine.nu> <20060406140407.GR21115@mellanox.co.il> <20060406140527.GH16153@granada.merseine.nu> Message-ID: <20060406142022.GT21115@mellanox.co.il> Quoting r. Muli Ben-Yehuda : > Subject: Re: Re: RE: Re: [PATCH] ipoib_flush_paths > > On Thu, Apr 06, 2006 at 05:04:07PM +0300, Michael S. Tsirkin wrote: > > > struct query { > > void (*callback)(); > > struct module *owner; > > } > > > > Then it is always safe to do > > > > __get_module(query->owner); > > query->callback(); > > put_module(query->owner); > > Ok, that makes sense. By the way, shouldn't __get_module be > try_module_get(), with proper error handling if it fails? No, the whole point is it can't fail. If it fails its a bug - there's nothing I can do. -- MST From mst at mellanox.co.il Thu Apr 6 07:22:16 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 17:22:16 +0300 Subject: [openib-general] Re: RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: <79ae2f320604060700w49ee5a4w651a7ebce8d76cec@mail.gmail.com> References: <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <79ae2f320604060700w49ee5a4w651a7ebce8d76cec@mail.gmail.com> Message-ID: <20060406142216.GU21115@mellanox.co.il> Quoting r. Fabian Tillier : > Subject: Re: RE: Re: [PATCH] ipoib_flush_paths > > On 4/5/06, Sean Hefty wrote: > > >Can't you pass in a reference to the client module for registration, > > >and then take a reference from the context of each request that is > > >released after the callback unwinds? I thought Linux had module > > >reference functions... > > > > Yes - this is what ib_mad does. The problem is that ib_sa, ib_addr, ib_cm, and > > soon to be ib_multicast can invoke callbacks without explicit registration / > > deregistration. For example, the following interface has the issue: > > > > ib_do_async_operation(request, my_callback, my_context); Correct, that's the issue. So this means we can't fix it without some kind of API change. > You don't need explicit registration/deregistration - just add a > module parameter to this function: > > ib_do_async_op_safe( my_module, request, my_callback, my_context ); That's what I did - added owner parameter after callback and context. -- MST From ftillier at silverstorm.com Thu Apr 6 07:24:51 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Thu, 6 Apr 2006 07:24:51 -0700 Subject: [openib-general] Re: RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: <20060406140629.GS21115@mellanox.co.il> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406071228.GA9994@mellanox.co.il> <79ae2f320604060702q163f5b1alb25249f951feb017@mail.gmail.com> <20060406140629.GS21115@mellanox.co.il> Message-ID: <79ae2f320604060724y55e2131chb544a14f561b1615@mail.gmail.com> On 4/6/06, Michael S. Tsirkin wrote: > Quoting r. Fabian Tillier : > > Subject: Re: RE: Re: [PATCH] ipoib_flush_paths > > > > On 4/6/06, Michael S. Tsirkin wrote: > > > Quoting r. Fabian Tillier : > > > > Can't you pass in a reference to the client module for registration, > > > > and then take a reference from the context of each request that is > > > > released after the callback unwinds? I thought Linux had module > > > > reference functions... > > You are right, you might notice I've reversed my opinion. > Please review the patch I've posted. Yep, just noticed that. That's what I get for replying before getting the latest mails... - Fab > > -- > MST > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From ftillier at silverstorm.com Thu Apr 6 07:27:26 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Thu, 6 Apr 2006 07:27:26 -0700 Subject: [openib-general] Re: RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: <20060406140407.GR21115@mellanox.co.il> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060406132512.GF16153@granada.merseine.nu> <20060406133833.GP21115@mellanox.co.il> <20060406134212.GG16153@granada.merseine.nu> <20060406140407.GR21115@mellanox.co.il> Message-ID: <79ae2f320604060727w228f0ca3h3321bb48960706b@mail.gmail.com> On 4/6/06, Michael S. Tsirkin wrote: > Quoting r. Muli Ben-Yehuda : > > Subject: Re: [openib-general] Re: RE: Re: [PATCH] ipoib_flush_paths > > > > On Thu, Apr 06, 2006 at 04:38:33PM +0300, Michael S. Tsirkin wrote: > > > > > No, since we are keeping a callback pointer into that module. > > > > Sorry if I'm being dense but I don't see it in this patch. Point me at > > it? > > You don't see it in the patch because SA already kept a callback pointer - > that's the race I'm solving. Look in sa_query.c > > > If I have > > struct query { > void (*callback)(); > struct module *owner; > } > > Then it is always safe to do > > __get_module(query->owner); > query->callback(); > put_module(query->owner); > > since it is the called module's responsibility to invalidate > all such query objects before its unloaded. Wait, why are you doing __get_module just before the callback? This leaves the possibility of crashing - sure, you'll detect that things went wrong, but you haven't solved the issue. The whole point of the reference is to prevent the crash. You need to call __get_module from the context of teh caller making the request. - Fab From mst at mellanox.co.il Thu Apr 6 07:42:22 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 17:42:22 +0300 Subject: [openib-general] Re: RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: <79ae2f320604060727w228f0ca3h3321bb48960706b@mail.gmail.com> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060406132512.GF16153@granada.merseine.nu> <20060406133833.GP21115@mellanox.co.il> <20060406134212.GG16153@granada.merseine.nu> <20060406140407.GR21115@mellanox.co.il> <79ae2f320604060727w228f0ca3h3321bb48960706b@mail.gmail.com> Message-ID: <20060406144221.GA13416@mellanox.co.il> Quoting r. Fabian Tillier : > Subject: Re: [openib-general] Re: RE: Re: [PATCH] ipoib_flush_paths > > On 4/6/06, Michael S. Tsirkin wrote: > > Quoting r. Muli Ben-Yehuda : > > > Subject: Re: [openib-general] Re: RE: Re: [PATCH] ipoib_flush_paths > > > > > > On Thu, Apr 06, 2006 at 04:38:33PM +0300, Michael S. Tsirkin wrote: > > > > > > > No, since we are keeping a callback pointer into that module. > > > > > > Sorry if I'm being dense but I don't see it in this patch. Point me at > > > it? > > > > You don't see it in the patch because SA already kept a callback pointer - > > that's the race I'm solving. Look in sa_query.c > > > > > > If I have > > > > struct query { > > void (*callback)(); > > struct module *owner; > > } > > > > Then it is always safe to do > > > > __get_module(query->owner); > > query->callback(); > > put_module(query->owner); > > > > since it is the called module's responsibility to invalidate > > all such query objects before its unloaded. > > Wait, why are you doing __get_module just before the callback? This > leaves the possibility of crashing - sure, you'll detect that things > went wrong, but you haven't solved the issue. The whole point of the > reference is to prevent the crash. No, this problem is solved today in all ULPs by caller polling on flag (completion) that callback sets: all ULPs do this already, since the need to track resources irrespective of module being unloaded. > You need to call __get_module from the context of teh caller making the request. No, this would prevent module from being unloaded for extended periods of time - we don't want this. All I must prevent is problem we missed previously: module being unloaded while callback is in progress. -- MST From jlentini at netapp.com Thu Apr 6 08:07:53 2006 From: jlentini at netapp.com (James Lentini) Date: Thu, 6 Apr 2006 11:07:53 -0400 (EDT) Subject: [openib-general] [PATCH] [DAPL] - dapl doesn't set max read iov In-Reply-To: <1144245678.28591.5.camel@stevo-desktop> References: <1144245678.28591.5.camel@stevo-desktop> Message-ID: On Wed, 5 Apr 2006, Steve Wise wrote: > > Set the IA attribute max_iov_segments_per_rdma_read and the EP attribute > max_rdma_read_iov based on the openib max_sge_rd device attribute. Committed in the trunk and 1.0 branch in revision 6287. From jlentini at netapp.com Thu Apr 6 08:23:35 2006 From: jlentini at netapp.com (James Lentini) Date: Thu, 6 Apr 2006 11:23:35 -0400 (EDT) Subject: [openib-general] [PATCH] [DAPL] - dapl doesn't set max read In-Reply-To: <1144248896.28591.25.camel@stevo-desktop> References: <1144245678.28591.5.camel@stevo-desktop> <1144248896.28591.25.camel@stevo-desktop> Message-ID: On Wed, 5 Apr 2006, Steve Wise wrote: > Ignore this patch. max_sge_rd is not the correct attribute... You're right. I've fixed it on the trunk and 1.0 branch in revision 6289 with this: Index: openib_cma/dapl_ib_util.c =================================================================== --- openib_cma/dapl_ib_util.c (revision 6287) +++ openib_cma/dapl_ib_util.c (working copy) @@ -442,7 +442,7 @@ DAT_RETURN dapls_ib_query_hca(IN DAPL_HC ia_attr->transport_attr = NULL; ia_attr->num_vendor_attr = 0; ia_attr->vendor_attr = NULL; - ia_attr->max_iov_segments_per_rdma_read = dev_attr.max_sge_rd; + ia_attr->max_iov_segments_per_rdma_read = dev_attr.max_sge; dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " query_hca: (ver=%x) ep %d ep_q %d evd %d evd_q %d\n", @@ -465,7 +465,7 @@ DAT_RETURN dapls_ib_query_hca(IN DAPL_HC ep_attr->max_request_iov = dev_attr.max_sge; ep_attr->max_rdma_read_in = dev_attr.max_qp_rd_atom; ep_attr->max_rdma_read_out= dev_attr.max_qp_rd_atom; - ep_attr->max_rdma_read_iov= dev_attr.max_sge_rd; + ep_attr->max_rdma_read_iov= dev_attr.max_sge; dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " query_hca: MAX msg %llu dto %d iov %d rdma i%d,o%d\n", ep_attr->max_mtu_size, From mst at mellanox.co.il Thu Apr 6 08:26:49 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 18:26:49 +0300 Subject: [openib-general] Re: RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: <20060406131755.GN21115@mellanox.co.il> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> Message-ID: <20060406152649.GC13416@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Actually, it turned out to be the simplest solution - and quite > elegant since there's no room for mistakes: if query is going to be running > this means module is still loaded so we can take a reference to it > without races. And Here's the patch to ib_addr. Sean, Roland, please comment. --- Prevent module from being unloaded while callback from ib_addr has signaled completion but is still running. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.16/include/rdma/ib_addr.h =================================================================== --- linux-2.6.16/include/rdma/ib_addr.h (revision 6281) +++ linux-2.6.16/include/rdma/ib_addr.h (working copy) @@ -63,12 +63,13 @@ * @callback: Call invoked once address resolution has completed, timed out, * or been canceled. A status of 0 indicates success. * @context: User-specified context associated with the call. + * @owner: Module that owns the callback. */ int rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr, struct rdma_dev_addr *addr, int timeout_ms, void (*callback)(int status, struct sockaddr *src_addr, struct rdma_dev_addr *addr, void *context), - void *context); + void *context, struct module *owner); void rdma_addr_cancel(struct rdma_dev_addr *addr); Index: linux-2.6.16/drivers/infiniband/core/addr.c =================================================================== --- linux-2.6.16/drivers/infiniband/core/addr.c (revision 6281) +++ linux-2.6.16/drivers/infiniband/core/addr.c (working copy) @@ -76,6 +76,7 @@ void *context; void (*callback)(int status, struct sockaddr *src_addr, struct rdma_dev_addr *addr, void *context); + struct module *owner; unsigned long timeout; int status; }; @@ -252,8 +253,10 @@ list_for_each_entry_safe(req, temp_req, &done_list, list) { list_del(&req->list); + __module_get(req->owner); req->callback(req->status, &req->src_addr, req->addr, req->context); + module_put(req->owner); kfree(req); } } @@ -293,7 +296,7 @@ struct rdma_dev_addr *addr, int timeout_ms, void (*callback)(int status, struct sockaddr *src_addr, struct rdma_dev_addr *addr, void *context), - void *context) + void *context, struct module *owner) { struct sockaddr_in *src_in, *dst_in; struct addr_req *req; @@ -310,6 +313,7 @@ req->addr = addr; req->callback = callback; req->context = context; + req->owner = owner; src_in = (struct sockaddr_in *) &req->src_addr; dst_in = (struct sockaddr_in *) &req->dst_addr; -- MST From rdreier at cisco.com Thu Apr 6 08:38:42 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Apr 2006 08:38:42 -0700 Subject: [openib-general] Re: [PATCH] ipoib_flush_paths References: Message-ID: The best solution might be just to say, hey, module unloading has races. - R. From mst at mellanox.co.il Thu Apr 6 08:46:17 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 18:46:17 +0300 Subject: [openib-general] Re: [PATCH] ipoib_flush_paths In-Reply-To: References: Message-ID: <20060406154617.GD13416@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] ipoib_flush_paths > > The best solution might be just to say, hey, module unloading has races. Ugh. Please, consider my patch instead. It solves these races in 35 LOC, including SA, ADDR and updating all ULPs. -- MST From halr at voltaire.com Thu Apr 6 08:41:32 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Apr 2006 11:41:32 -0400 Subject: [openib-general] RC2 delayed a bit In-Reply-To: <44351E43.1040405@mellanox.co.il> References: <1144259913.3984.30.camel@chalcedony.internal.keyresearch.com> <44351E43.1040405@mellanox.co.il> Message-ID: <1144337957.17533.48.camel@hal.voltaire.com> On Thu, 2006-04-06 at 09:57, Tziporet Koren wrote: > Bryan O'Sullivan wrote: > > Unfortunately, the coordination of this with the 1.0 process has thus > > far not been very effective, so I am spending a lot of time manually > > filtering diffs to see what has changed between the first EWG software > > release (named "IBED") and the 1.0 tree, so that I can reunify the two. > > > > > User space code that is needed by IBED was reviewed and check it in to > the 1.0 branch. > We did it to all modules except management that is synchronized by Hal > and ipath library that I assumed you will take care for. > ibed directory was opened under 1.0 branch for scripts & backport patches. > We plan to publish IBED rc3 by Monday. If OF rc2 is not done by then, does that mean that IBED rc3 is still using OF rc2 user packages ? -- Hal > Tziporet > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From vlad at mellanox.co.il Thu Apr 6 09:48:44 2006 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Thu, 06 Apr 2006 19:48:44 +0300 Subject: [openib-general] ipath module compilation failure on SLES10 In-Reply-To: <1144258679.3984.14.camel@chalcedony.internal.keyresearch.com> References: <4433F5DC.2080503@mellanox.co.il> <1144256175.3984.5.camel@chalcedony.internal.keyresearch.com> <1144258679.3984.14.camel@chalcedony.internal.keyresearch.com> Message-ID: <4435466C.4020508@mellanox.co.il> Hi Bryan, I have the following compilation failure: OS: Novell Linux Desktop 10 (x86_64) VERSION = 10 RELEASE = 9 Kernel: 2.6.16-rc5-git9-2-smp Architecture: x86_64 gcc -Wp,-MD,/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/.ipath_verbs.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64 -suse-linux/4.1.0/include -D__KERNEL__ -I/var/tmp/IBED/tmp/openib/openib/include -I/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniba nd/include -Iinclude -Iinclude2 -I/usr/src/linux-2.6.16-rc5-git9-2/include -I/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/h w/ipath -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -ffreestandin g -Os -fomit-frame-pointer -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-un wind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wdeclaration-after-statement -Wno-pointer-sign -I/var/tmp/IBED/tmp/open ib/openib/src/linux-kernel/infiniband/include -I/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/ipoib -I/var/tmp/IBED/tmp/o penib/openib/src/linux-kernel/infiniband/ulp/kdapl -I/var/tmp/IBED/tmp/openib/openib/drivers/infiniband/debug -DIPATH_IDSTR='"PathScale kern el.org driver"' -DIPATH_KERN_TYPE=0 -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ipath_verbs)" -D"KBUILD_MODNAME=KBUILD_STR(i b_ipath)" -c -o /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/.tmp_ipath_verbs.o /var/tmp/IBED/tmp/openib/openib/src/l inux-kernel/infiniband/hw/ipath/ipath_verbs.c /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipath_verbs.c: In function âipath_register_ib_deviceâ: /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipath_verbs.c:1005: error: âIB_NODE_CAâ undeclared (first use in this fu nction) /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipath_verbs.c:1005: error: (Each undeclared identifier is reported only once /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipath_verbs.c:1005: error: for each function it appears in.) make[5]: *** [/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipath_verbs.o] Error 1 make[4]: *** [/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath] Error 2 make[3]: *** [_module_/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband] Error 2 make[2]: *** [modules] Error 2 make[1]: *** [modules] Error 2 make[1]: Leaving directory `/usr/src/linux-2.6.16-rc5-git9-2-obj/x86_64/smp' make: *** [kernel] Error 2 Regards, Vladimir From bos at pathscale.com Thu Apr 6 09:53:52 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 06 Apr 2006 09:53:52 -0700 Subject: [openib-general] ipath module compilation failure on SLES10 In-Reply-To: <4435466C.4020508@mellanox.co.il> References: <4433F5DC.2080503@mellanox.co.il> <1144256175.3984.5.camel@chalcedony.internal.keyresearch.com> <1144258679.3984.14.camel@chalcedony.internal.keyresearch.com> <4435466C.4020508@mellanox.co.il> Message-ID: <1144342432.20433.39.camel@chalcedony.internal.keyresearch.com> On Thu, 2006-04-06 at 19:48 +0300, Vladimir Sokolovsky wrote: > I have the following compilation failure: This should be fixed now. Message-ID: >- dependency on libsysfs > I propose we don't get into this again. I know we have it in ibverbs > but let's avoid for new code. > >- negative error codes > I think this is kernel practice, in userspace we > either set errno and return -1, or simply return > positive error code: otherwise utilities like strerror > do not work We can convert all negative error codes to positive. >- abi versioning > The RDMA_USER_CM_MAX_ABI_VERSION idea is broken - it makes it so that you are > required to upgrade userspace to run on older kernels. > Bailing out in userspace if the kernel is too new is not good - please > remove this test. There is both a minor and major number. A given library should support anything between its known versions. Can you be more specific about the problem? - Sean From sean.hefty at intel.com Thu Apr 6 09:55:35 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 6 Apr 2006 09:55:35 -0700 Subject: [openib-general] RE: RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: <20060406152649.GC13416@mellanox.co.il> Message-ID: >> Actually, it turned out to be the simplest solution - and quite >> elegant since there's no room for mistakes: if query is going to be running >> this means module is still loaded so we can take a reference to it >> without races. > >And Here's the patch to ib_addr. Sean, Roland, please comment. I like this approach, especially given the current implementations. Roland, Hal? What are your thoughts? Note that ib_cm and rdma_cm technically have the same issue, since cm_id's can be destroyed by returning non-zero from a callback. I.e. a user of those interfaces isn't forced to call anything when unloading. - Sean From jlentini at netapp.com Thu Apr 6 10:06:52 2006 From: jlentini at netapp.com (James Lentini) Date: Thu, 6 Apr 2006 13:06:52 -0400 (EDT) Subject: [openib-general] Re: [PATCH] [RFC] - dapl - dat_ep_free() can return without In-Reply-To: <1144189819.7326.53.camel@stevo-desktop> References: <1144177443.6427.34.camel@stevo-desktop> <1144189819.7326.53.camel@stevo-desktop> Message-ID: On Tue, 4 Apr 2006, Steve Wise wrote: > > What happens if a consumer attempts to free the EP from a callback? > > There are no direct consumer callbacks in usermode are there? consumers > call dat_evd_wait() or whatever and get scheduled. Not like kernel > mode... Or am I confused? You're right. The DAT consumer thread calling dat_ep_free() will never be a provider (or verbs) thread. It looks like there needs to be some synchronization around destroying the cm_id with the dapli_thread(), though. Could we only delete the QP in dat_ep_free as Sean suggested and leave the cm_id cleanup for later as is being done now? From vlad at mellanox.co.il Thu Apr 6 10:08:26 2006 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Thu, 06 Apr 2006 20:08:26 +0300 Subject: [openib-general] ipath module compilation failure on RHEL 4.0 U3 In-Reply-To: <1144342432.20433.39.camel@chalcedony.internal.keyresearch.com> References: <4433F5DC.2080503@mellanox.co.il> <1144256175.3984.5.camel@chalcedony.internal.keyresearch.com> <1144258679.3984.14.camel@chalcedony.internal.keyresearch.com> <4435466C.4020508@mellanox.co.il> <1144342432.20433.39.camel@chalcedony.internal.keyresearch.com> Message-ID: <44354B0A.8040108@mellanox.co.il> Bryan, I have also failure on: OS: Red Hat Enterprise Linux AS release 4 (Nahant Update 3) Kernel: 2.6.9-34.ELsmp Architecture: x86_64 gcc -Wp,-MD,/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/.ipath_cq.o.d -nostdinc -iwithprefix include -D__KERNEL__ -I/var/tmp/IBED/tmp/openib/openib/include -I/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/include -Iinclude -Wall -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Os -fomit-frame-pointer -g -Wdeclaration-after-statement -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -funit-at-a-time -I/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/include -I/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/ipoib -I/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/ulp/kdapl -I/var/tmp/IBED/tmp/openib/openib/drivers/infiniband/debug -D__nocast= -DIPATH_IDSTR=PathScale kernel.org driver -DIPATH_KERN_TYPE=0 -DMODULE -DKBUILD_BASENAME=ipath_cq -DKBUILD_MODNAME=ib_ipath -c -o /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/.tmp_ipath_cq.o /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipath_cq.c In file included from /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipath_cq.c:36: /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipath_verbs.h:395: error: `BITS_PER_BYTE' undeclared here (not in a function) make[3]: *** [/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipath_cq.o] Error 1 make[2]: *** [/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath] Error 2 make[1]: *** [_module_/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband] Error 2 make[1]: Leaving directory `/usr/src/kernels/2.6.9-34.EL-smp-x86_64' make: *** [kernel] Error 2 Regards, Vladimir From dledford at redhat.com Thu Apr 6 10:12:20 2006 From: dledford at redhat.com (Doug Ledford) Date: Thu, 6 Apr 2006 13:12:20 -0400 Subject: [openib-general] Re: ib_umad related kernel panic In-Reply-To: <67897d690604051904la898f84y5a34709a6334f315@mail.gmail.com> References: <67897d690604051904la898f84y5a34709a6334f315@mail.gmail.com> Message-ID: <20060406171220.GB7353@redhat.com> On Wed, Apr 05, 2006 at 07:04:29PM -0700, Manpreet Singh wrote: > Hi, > > I am observing the following with redhat kernel rpm at: > http://people.redhat.com/dledford/Infiniband , which uses openib version > 3965. This is on an RHEL4 install. > > When the system is rebooted, ib_core, ib_mad and ib_mthca modules get loaded > automatically. When I load ib_umad after that, I get the following panic. > However, if I unload ib_mthca and load again once before loading ib_umad, > then there is no problem subsequently in the system. The rpms posted on my site, especially the 3985 revision, are not going to be "fixed", they will simply be replaced. We are going to a later rev of the code base for the next update (the 1.0 release branch), so time spent fixing bugs in that version is basically wasted. Just to make that point clear, I've removed the old RPMs from my site and put up a new set of kernel rpms based on the 1.0 release branch code (userspace rpms will be a little later). -- Doug Ledford 919-754-3700 x44233 Red Hat, Inc. 1801 Varsity Dr. Raleigh, NC 27606 From ardavis at ichips.intel.com Thu Apr 6 10:53:14 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Thu, 06 Apr 2006 10:53:14 -0700 Subject: [openib-general] Re: [PATCH] [RFC] - dapl - dat_ep_free() can return without In-Reply-To: References: <1144177443.6427.34.camel@stevo-desktop> <1144189819.7326.53.camel@stevo-desktop> Message-ID: <4435558A.4050908@ichips.intel.com> James Lentini wrote: >On Tue, 4 Apr 2006, Steve Wise wrote: > > > > >>>What happens if a consumer attempts to free the EP from a callback? >>> >>> >>There are no direct consumer callbacks in usermode are there? consumers >>call dat_evd_wait() or whatever and get scheduled. Not like kernel >>mode... Or am I confused? >> >> > >You're right. The DAT consumer thread calling dat_ep_free() will never >be a provider (or verbs) thread. > >It looks like there needs to be some synchronization around destroying >the cm_id with the dapli_thread(), though. > >Could we only delete the QP in dat_ep_free as Sean suggested and leave >the cm_id cleanup for later as is being done now? > > > I think we should destroy all resources before returning, including the cm_id. This will insure that no events will fire with context associated with an EP(qp,cm_id,etc) that was just freed. I will take a look at Steve's patch and get back to you. From sean.hefty at intel.com Thu Apr 6 10:59:43 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 6 Apr 2006 10:59:43 -0700 Subject: [openib-general] [RFC] [PATCH] SA query: expose retries through API Message-ID: Currently, the SA query interface does not permit retrying requests automatically. Expose this capability to take advantage of underlying MAD layer API, which provides it basically for free because of RMPP. Without automatic retries pushed down into the SA query module, retries are assigned new TIDs, and appear as separate requests. This means that a delayed response will be dropped, and the remote side will not detect that the request is a duplicate, so will re-calculate the response. Signed-off-by: Sean Hefty --- This will be used by the multicast code. I changed all of the APIs to be consistent, but for the purposes of multicast, only ib_sa_mcmember_rec_set and ib_sa_mcmember_rec_delete need to change. Also, the MAD layer currently uses linear retries. This could be changed to an exponential backoff algorithm instead. This would still allow for a default retry algorithm that a consumer could use, but consumers that wanted to manage their own timeout algorithm could do so by specifying a retry count of 0.. Index: include/rdma/ib_sa.h =================================================================== --- include/rdma/ib_sa.h (revision 6230) +++ include/rdma/ib_sa.h (working copy) @@ -285,7 +285,7 @@ void ib_sa_cancel_query(int id, struct i int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_path_rec *resp, void *context), @@ -296,7 +296,7 @@ int ib_sa_mcmember_rec_query(struct ib_d u8 method, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_mcmember_rec *resp, void *context), @@ -307,7 +307,7 @@ int ib_sa_service_rec_query(struct ib_de u8 method, struct ib_sa_service_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_service_rec *resp, void *context), @@ -321,6 +321,7 @@ int ib_sa_service_rec_query(struct ib_de * @rec:MCMember Record to send in query * @comp_mask:component mask to send in query * @timeout_ms:time to wait for response + * @retries:number of times to retry request * @gfp_mask:GFP mask to use for internal allocations * @callback:function called when query completes, times out or is * canceled @@ -342,7 +343,7 @@ static inline int ib_sa_mcmember_rec_set(struct ib_device *device, u8 port_num, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_mcmember_rec *resp, void *context), @@ -352,7 +353,7 @@ ib_sa_mcmember_rec_set(struct ib_device return ib_sa_mcmember_rec_query(device, port_num, IB_MGMT_METHOD_SET, rec, comp_mask, - timeout_ms, gfp_mask, callback, + timeout_ms, retries, gfp_mask, callback, context, query); } @@ -363,6 +364,7 @@ ib_sa_mcmember_rec_set(struct ib_device * @rec:MCMember Record to send in query * @comp_mask:component mask to send in query * @timeout_ms:time to wait for response + * @retries:number of times to retry request * @gfp_mask:GFP mask to use for internal allocations * @callback:function called when query completes, times out or is * canceled @@ -384,7 +386,7 @@ static inline int ib_sa_mcmember_rec_delete(struct ib_device *device, u8 port_num, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_mcmember_rec *resp, void *context), @@ -394,7 +396,7 @@ ib_sa_mcmember_rec_delete(struct ib_devi return ib_sa_mcmember_rec_query(device, port_num, IB_SA_METHOD_DELETE, rec, comp_mask, - timeout_ms, gfp_mask, callback, + timeout_ms, retries, gfp_mask, callback, context, query); } Index: core/sa_query.c =================================================================== --- core/sa_query.c (revision 6230) +++ core/sa_query.c (working copy) @@ -482,7 +482,7 @@ static void init_mad(struct ib_sa_mad *m spin_unlock_irqrestore(&tid_lock, flags); } -static int send_mad(struct ib_sa_query *query, int timeout_ms) +static int send_mad(struct ib_sa_query *query, int timeout_ms, int retries) { unsigned long flags; int ret, id; @@ -499,6 +499,7 @@ retry: return ret; query->mad_buf->timeout_ms = timeout_ms; + query->mad_buf->retries = retries; query->mad_buf->context[0] = query; query->id = id; @@ -555,6 +556,7 @@ static void ib_sa_path_rec_release(struc * @rec:Path Record to send in query * @comp_mask:component mask to send in query * @timeout_ms:time to wait for response + * @retries:number of times to retry request * @gfp_mask:GFP mask to use for internal allocations * @callback:function called when query completes, times out or is * canceled @@ -575,7 +577,7 @@ static void ib_sa_path_rec_release(struc int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_path_rec *resp, void *context), @@ -624,7 +626,7 @@ int ib_sa_path_rec_get(struct ib_device *sa_query = &query->sa_query; - ret = send_mad(&query->sa_query, timeout_ms); + ret = send_mad(&query->sa_query, timeout_ms, retries); if (ret < 0) goto err2; @@ -670,6 +672,7 @@ static void ib_sa_service_rec_release(st * @rec:Service Record to send in request * @comp_mask:component mask to send in request * @timeout_ms:time to wait for response + * @retries:number of times to retry request * @gfp_mask:GFP mask to use for internal allocations * @callback:function called when request completes, times out or is * canceled @@ -691,7 +694,7 @@ static void ib_sa_service_rec_release(st int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, u8 method, struct ib_sa_service_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_service_rec *resp, void *context), @@ -746,7 +749,7 @@ int ib_sa_service_rec_query(struct ib_de *sa_query = &query->sa_query; - ret = send_mad(&query->sa_query, timeout_ms); + ret = send_mad(&query->sa_query, timeout_ms, retries); if (ret < 0) goto err2; @@ -788,7 +791,7 @@ int ib_sa_mcmember_rec_query(struct ib_d u8 method, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_mcmember_rec *resp, void *context), @@ -838,7 +841,7 @@ int ib_sa_mcmember_rec_query(struct ib_d *sa_query = &query->sa_query; - ret = send_mad(&query->sa_query, timeout_ms); + ret = send_mad(&query->sa_query, timeout_ms, retries); if (ret < 0) goto err2; Index: core/cma.c =================================================================== --- core/cma.c (revision 6262) +++ core/cma.c (working copy) @@ -1064,7 +1064,7 @@ static int cma_query_ib_route(struct rdm id_priv->id.port_num, &path_rec, IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH, - timeout_ms, GFP_KERNEL, + timeout_ms, 0, GFP_KERNEL, cma_query_handler, work, &id_priv->query); return (id_priv->query_id < 0) ? id_priv->query_id : 0; Index: core/at.c =================================================================== --- core/at.c (revision 6230) +++ core/at.c (working copy) @@ -216,7 +216,7 @@ static void ib_dev_ats_op(struct ib_at_d op, rec, mask, - IB_AT_REQ_RETRY_MS, + IB_AT_REQ_RETRY_MS, 0, GFP_KERNEL, ats_op_complete, ib_dev, @@ -1118,7 +1118,7 @@ static int resolve_ats_ips(struct ats_ip IB_MGMT_METHOD_GET, rec, IB_ATS_GET_PRIM_IP_MASK, - req->pend.timeout_ms, + req->pend.timeout_ms, 0, GFP_KERNEL, ats_ips_req_complete, req, @@ -1163,7 +1163,7 @@ static int resolve_ats_route(struct rout IB_MGMT_METHOD_GET, rec, IB_ATS_GET_GID_MASK, - req->pend.timeout_ms, + req->pend.timeout_ms, 0, GFP_KERNEL, ats_route_req_complete, req, @@ -1226,7 +1226,7 @@ static int resolve_path(struct path_req IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH), - req->pend.timeout_ms, + req->pend.timeout_ms, 0, GFP_KERNEL, path_req_complete, req, Index: ulp/srp/ib_srp.c =================================================================== --- ulp/srp/ib_srp.c (revision 6230) +++ ulp/srp/ib_srp.c (working copy) @@ -257,7 +257,7 @@ static int srp_lookup_path(struct srp_ta IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_NUMB_PATH | IB_SA_PATH_REC_PKEY, - SRP_PATH_REC_TIMEOUT_MS, + SRP_PATH_REC_TIMEOUT_MS, 0, GFP_KERNEL, srp_path_rec_completion, target, &target->path_query); Index: ulp/sdp/sdp_link.c =================================================================== --- ulp/sdp/sdp_link.c (revision 6230) +++ ulp/sdp/sdp_link.c (working copy) @@ -323,7 +323,7 @@ static void sdp_link_path_rec_done(int s IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH), - info->sa_time, + info->sa_time, 0, GFP_KERNEL, sdp_link_path_rec_done, info, @@ -359,7 +359,7 @@ static int sdp_link_path_rec_get(struct IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH), - info->sa_time, + info->sa_time, 0, GFP_KERNEL, sdp_link_path_rec_done, info, Index: ulp/ipoib/ipoib_main.c =================================================================== --- ulp/ipoib/ipoib_main.c (revision 6230) +++ ulp/ipoib/ipoib_main.c (working copy) @@ -468,7 +468,7 @@ static int path_rec_start(struct net_dev IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_NUMB_PATH | IB_SA_PATH_REC_PKEY, - 1000, GFP_ATOMIC, + 1000, 0, GFP_ATOMIC, path_rec_completion, path, &path->query); if (path->query_id < 0) { From vaidyana at cse.ohio-state.edu Thu Apr 6 11:01:08 2006 From: vaidyana at cse.ohio-state.edu (Karthikeyan Vaidyanathan) Date: Thu, 6 Apr 2006 14:01:08 -0400 (EDT) Subject: [openib-general] amso1100 testing with OpenIB Message-ID: Hi, I was trying to follow the discussion on getting the iWARP branch to work with Ammasso NICs and tried the following steps: /include/rdma --> /gen2/branches/iwarp/src/linux-kernel/infiniband/include/rdma /drivers/infiniband --> /gen2/branches/iwarp/src/linux-kernel/infiniband and recompiled the kernel. I'm working on linux 2.6.15.4 kernel. I had initially updated the firmware from AMSO1100 Release 1.2 Update 1 kit. Then I used ccflash2 to update to C2L_H23_B58_F61_080507.bit. However, when I reboot the machine, the dmesg reports: c2: AMSO1100 Gigabit Ethernet driver v1.1 loaded c2: Downlevel Firmware boot loader [1/7: got 0x42, exp 0x43]. Use the cc_flash utility to update your boot loader c2: Adapter not claimed c2: probe of 0000:03:02.0 failed with error -5 I also tried updating the firmware boot loader from AMSO1100 Release 1.2 Update 1 kit but I still get this message when the machine boots up. However, the following command shows that I have the latest firmware loaded: $ ccons 0 bitfile FPGA Bitfile ID: C2L_H23_B58_080507 Release 23 Am I missing something here? thanks, Karthik From ishai at mellanox.co.il Thu Apr 6 11:11:58 2006 From: ishai at mellanox.co.il (Ishai Rabinovitz) Date: Thu, 6 Apr 2006 21:11:58 +0300 Subject: [openib-general] FW: [openfabrics-ewg] Changes in SM for the new SRP daemon Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301DD2CA9@mtlexch01.mtl.com> Yes it should. Ishai ________________________________ From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] Sent: Thursday, April 06, 2006 1:31 AM To: Ishai Rabinovitz; openfabrics-ewg at openib.org Subject: RE: [openfabrics-ewg] Changes in SM for the new SRP daemon Shouldn't this discussion be on openib-general ? ________________________________ From: openfabrics-ewg-bounces at openib.org [mailto:openfabrics-ewg-bounces at openib.org] On Behalf Of Ishai Rabinovitz Sent: Wednesday, April 05, 2006 2:14 PM To: openfabrics-ewg at openib.org Subject: [openfabrics-ewg] Changes in SM for the new SRP daemon Summary: In the next release of SRP we want to add a daemon that is executed in each initiator and finds out which targets exist in the fabric. The new daemon can use SA capabilities from IBTA 1.2 errata (details at the bottom) to improve performance. If you are using SM other then openIB's openSM please support this feature in your SM. Details: When one wants to find all SRP targets and their information in the fabric, he/she currently run "ibsrpdm" "ibsrpdm" uses the following procedure to discover all SRP targets available in the fabric 1. "ibsrpdm" sends a query get_table node info with node type is CA. 2. It gets quite a big table and then for every node it sends a port_info query. 3. From the response to this query the initiator can check if this port is an SRP target. (dm bit capability is set) The problem with this procedure is that it may create too much traffic on the fabric. Let's assume there is a cluster of 4096 nodes booting together. Each of this 4096 nodes is sending the first query and gets a list of 4096 nodes. This list is divided into a long number of UD messages and may cause retransmit. After getting this list each node sends 4096 queries for the port of each node. This traffic is quite huge. The SA has a new capability to answer the query: "please return a list of the ports that has the dm bit set" (meaning return ports of SRP targets). (This capability of the SA is part of Errata Release Version: 1.2 1/26/2006 Chapter/Subsection: 15.2.5.3 - quoted at the bottom). Using this capability we can use the following procedure: 1. The daemon will send "get table port_info of ports that has dm bit set" query and gets a table of small number of port_info. 2. For each port It queries for the guid of this ports. This will significantly reduce the traffic on the fabric. Actually, in this solution there is so little traffic that the new daemon will run it periodically (every minute) to look for changes (There will be less traffic than registering to Trap 64 and Trap 65.) Quoting the errata: ------------------------------------------------------------------------ ---- Errata Tracking Number: MGTWG8372 Sub-Case Number: 0 Reference ID: 4291 Title: Enhanced SA PortInfoRecord searches Submitter: Livingston, James (James.Livingston at necsam.com) Volume: 1 Revision: 1.2 Errata Release Version: 1.2 1/26/2006 Chapter/Subsection: 15 Page: 885 Line: 20 AssignedIntensity: Status/Disposition: WG_Approved Problem Description: Add a new row to Table 186 SA-Specific ClassPortInfo:CapabilityMask Bits: Problem Resolution: Add a new row to Table 186 SA-Specific ClassPortInfo:CapabilityMask Bits Original Text: Corrected Text: IsPortInfoCapMaskMatchSupported 13 If this value is 1, SA shall support matching the PortInfo:CapabilityMask component as described in . Comment History: Dec 14, 2005 08:11:26 PM Old: New:Pending By:Benner, Alan (bennera at us.ibm.com) ------------------------------------------------------------------------ ---- Errata Tracking Number: MGTWG8372 Sub-Case Number: 1 Reference ID: 4292 Title: Enhanced SA PortInfoRecord searches Submitter: Livingston, James (James.Livingston at necsam.com) Volume: 1 Revision: 1.2 Errata Release Version: 1.2 1/26/2006 Chapter/Subsection: 15.2.5.3 Page: 891 Line: 36 AssignedIntensity: Status/Disposition: WG_Approved Problem Description: Add optional compliance statement for new capability. Problem Resolution: Make the proposed change to the spec text. Original Text: Corrected Text: o15-0.x.y: If SA's ClassPortInfo:CapabilityMask.IsPortInfoCapMaskMatchSupported is 1, then the AttributeModifier of the SubnAdmGet() and SubnAdmGetTable() methods affects the matching behavior on the PortInfo:CapabilityMask component. If the high-order bit (bit 31) of the AttributeModifier is set to 1, matching on the CapabilityMask component will not be an exact bitwise match as described in . Instead, matching will only be performed on those bits which are set to 1 in the PortInfo:CapabilityMask embedded in the query. In , bits in the PortInfo:CapabilityMask embedded in the query that are set to 0 are bitwise wildcards for purposes of matching. This gives a requester the ability to select desired capabilities and query for ports which support those capabilities. If SA's ClassPortInfo:CapabilityMask.IsPortInfoCapMaskMatchSupported is 0, or if bit 31 of the AttributeModifier is 0, then any matching performed on the PortInfo:CapabilityMask component is as described in . Comment History: Dec 14, 2005 08:21:47 PM Old: New:Pending By:Benner, Alan (bennera at us.ibm.com) ------------------------------------------------------------------------ ---- Ishai -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu Apr 6 11:11:26 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Apr 2006 11:11:26 -0700 Subject: [openib-general] ipath module compilation failure on SLES10 In-Reply-To: <1144342432.20433.39.camel@chalcedony.internal.keyresearch.com> (Bryan O'Sullivan's message of "Thu, 06 Apr 2006 09:53:52 -0700") References: <4433F5DC.2080503@mellanox.co.il> <1144256175.3984.5.camel@chalcedony.internal.keyresearch.com> <1144258679.3984.14.camel@chalcedony.internal.keyresearch.com> <4435466C.4020508@mellanox.co.il> <1144342432.20433.39.camel@chalcedony.internal.keyresearch.com> Message-ID: I don't think doing this: #ifdef IB_NODE_CA dev->node_type = IB_NODE_CA; #else dev->node_type = RDMA_NODE_IB_CA; #endif bought you anything. IB_NODE_CA is an enum, not a macro. - R. From sean.hefty at intel.com Thu Apr 6 11:13:47 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 6 Apr 2006 11:13:47 -0700 Subject: [openib-general] [RFC] [PATCH 1/2] multicast support for multiple users Message-ID: Add kernel support that tracks joining and leaving multicast groups. The SA tracks join/leave operations on a per port basis. In order to support multiple users of the same multicast group, we need to track join / leave requests locally. Signed-off-by: Sean Hefty --- This patch depends on the sa query patch that adds retries to that API. I spent considerable time (and a couple of rewrites of the code) trying to ensure that all race conditions were handled, and in a way that was as simple as possible. Some additional review of the code that looked for race conditions would be appreciated. Also note that this code has the bug that Michael pointed out: http://openib.org/pipermail/openib-general/2006-April/019643.html I believe that this bug should be fixed, once we agree on a solution, but I don't think that the bug is likely enough to occur that it should delay the check-in. Index: include/rdma/ib_multicast.h =================================================================== --- include/rdma/ib_multicast.h (revision 0) +++ include/rdma/ib_multicast.h (revision 0) @@ -0,0 +1,85 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef IB_MULTICAST_H +#define IB_MULTICAST_H + +#include + +struct ib_multicast { + struct ib_sa_mcmember_rec rec; + ib_sa_comp_mask comp_mask; + int (*callback)(int status, + struct ib_multicast *multicast); + void *context; +}; + +/** + * ib_join_multicast - Initiates a join request to the specified multicast + * group. + * @device: Device associated with the multicast group. + * @port_num: Port on the specified device to associate with the multicast + * group. + * @rec: SA multicast member record specifying group attributes. + * @comp_mask: Component mask indicating which group attributes of %rec are + * valid. + * @gfp_mask: GFP mask for memory allocations. + * @callback: User callback invoked once the join operation completes. + * @context: User specified context stored with the ib_multicast structure. + * + * This call initiates a multicast join request with the SA for the specified + * multicast group. If the join operation is started successfully, it returns + * an ib_multicast structure that is used to track the multicast operation. + * Users must free this structure by calling ib_free_multicast, even if the + * join operation later fails. (The callback status is non-zero.) + */ +struct ib_multicast *ib_join_multicast(struct ib_device *device, u8 port_num, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, gfp_t gfp_mask, + int (*callback)(int status, + struct ib_multicast + *multicast), + void *context); + +/** + * ib_free_multicast - Frees the multicast tracking structure, and releases + * any reference on the multicast group. + * @multicast: Multicast tracking structure allocated by ib_join_multicast. + * + * This call blocks until the connection identifier is destroyed. It may + * not be called from within the multicast callback; however, returning a non- + * zero value from the callback will result in destroying the multicast + * tracking structure. + */ +void ib_free_multicast(struct ib_multicast *multicast); + +#endif /* IB_MULTICAST_H */ Index: core/multicast.c =================================================================== --- core/multicast.c (revision 0) +++ core/multicast.c (revision 0) @@ -0,0 +1,659 @@ +/* + * Copyright (c) 2006 Intel Corporation.  All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include +#include + +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("InfiniBand multicast membership handling"); +MODULE_LICENSE("Dual BSD/GPL"); + +static int retry_timer = 5000; /* 5 sec */ +module_param(retry_timer, int, 0444); +MODULE_PARM_DESC(retry_timer, "Time in ms between retried requests."); + +static int retries = 3; +module_param(retries, int, 0444); +MODULE_PARM_DESC(retries, "Number of times to retry a request."); + +static void mcast_add_one(struct ib_device *device); +static void mcast_remove_one(struct ib_device *device); + +static struct ib_client mcast_client = { + .name = "ib_multicast", + .add = mcast_add_one, + .remove = mcast_remove_one +}; + +static struct workqueue_struct *mcast_wq; + +struct mcast_device; + +struct mcast_port { + struct mcast_device *dev; + spinlock_t lock; + struct rb_root table; + atomic_t refcount; + wait_queue_head_t wait; + u8 port_num; +}; + +struct mcast_device { + struct ib_device *device; + struct mcast_port port[0]; +}; + +enum mcast_state { + MCAST_IDLE, + MCAST_JOINING, + MCAST_MEMBER, + MCAST_BUSY, +}; + +struct mcast_member; + +struct mcast_group { + struct ib_sa_mcmember_rec rec; + struct rb_node node; + struct mcast_port *port; + spinlock_t lock; + struct work_struct work; + struct list_head pending_list; + struct mcast_member *last_join; + int members[3]; + atomic_t refcount; + enum mcast_state state; + struct ib_sa_query *query; + int query_id; +}; + +struct mcast_member { + struct ib_multicast multicast; + struct mcast_group *group; + struct list_head list; + enum mcast_state state; + atomic_t refcount; + wait_queue_head_t wait; +}; + +static void join_handler(int status, struct ib_sa_mcmember_rec *rec, + void *context); +static void leave_handler(int status, struct ib_sa_mcmember_rec *rec, + void *context); + +static struct mcast_group *mcast_find(struct mcast_port *port, + union ib_gid *mgid) +{ + struct rb_node *node = port->table.rb_node; + struct mcast_group *group; + int ret; + + while (node) { + group = rb_entry(node, struct mcast_group, node); + ret = memcmp(mgid->raw, group->rec.mgid.raw, sizeof *mgid); + if (!ret) + return group; + + if (ret < 0) + node = node->rb_left; + else + node = node->rb_right; + } + return NULL; +} + +static struct mcast_group *mcast_insert(struct mcast_port *port, + struct mcast_group *group) +{ + struct rb_node **link = &port->table.rb_node; + struct rb_node *parent = NULL; + struct mcast_group *cur_group; + int ret; + + while (*link) { + parent = *link; + cur_group = rb_entry(parent, struct mcast_group, node); + + ret = memcmp(group->rec.mgid.raw, cur_group->rec.mgid.raw, + sizeof group->rec.mgid); + if (ret < 0) + link = &(*link)->rb_left; + else if (ret > 0) + link = &(*link)->rb_right; + else + return cur_group; + } + rb_link_node(&group->node, parent, link); + rb_insert_color(&group->node, &port->table); + return NULL; +} + +static void deref_port(struct mcast_port *port) +{ + if (atomic_dec_and_test(&port->refcount)) + wake_up(&port->wait); +} + +static void release_group(struct mcast_group *group) +{ + struct mcast_port *port = group->port; + unsigned long flags; + + spin_lock_irqsave(&port->lock, flags); + if (atomic_dec_and_test(&group->refcount)) { + rb_erase(&group->node, &port->table); + spin_unlock_irqrestore(&port->lock, flags); + kfree(group); + deref_port(port); + } else + spin_unlock_irqrestore(&port->lock, flags); +} + +static void deref_member(struct mcast_member *member) +{ + if (atomic_dec_and_test(&member->refcount)) + wake_up(&member->wait); +} + +static void queue_join(struct mcast_member *member) +{ + struct mcast_group *group = member->group; + unsigned long flags; + + spin_lock_irqsave(&group->lock, flags); + list_add(&member->list, &group->pending_list); + if (group->state == MCAST_IDLE) { + group->state = MCAST_BUSY; + spin_unlock_irqrestore(&group->lock, flags); + atomic_inc(&group->refcount); + queue_work(mcast_wq, &group->work); + } else + spin_unlock_irqrestore(&group->lock, flags); +} + +static void adjust_membership(struct mcast_group *group, u8 join_state, int inc) +{ + int i; + + for (i = 0; i < 3; i++, join_state >>= 1) + if (join_state & 0x1) + group->members[i] += inc; +} + +static u8 get_leave_state(struct mcast_group *group) +{ + u8 leave_state = 0; + int i; + + for (i = 0; i < 3; i++) + if (!group->members[i]) + leave_state |= (0x1 << i); + + return leave_state & group->rec.join_state; +} + +static int cmp_rec(struct ib_sa_mcmember_rec *src, + struct ib_sa_mcmember_rec *dst, ib_sa_comp_mask comp_mask) +{ + /* MGID must already match */ + + if (comp_mask & IB_SA_MCMEMBER_REC_PORT_GID && + memcmp(&src->port_gid, &dst->port_gid, sizeof src->port_gid)) + return -EINVAL; + if (comp_mask & IB_SA_MCMEMBER_REC_QKEY && src->qkey != dst->qkey) + return -EINVAL; + if (comp_mask & IB_SA_MCMEMBER_REC_MLID && src->mlid != dst->mlid) + return -EINVAL; + if (comp_mask & IB_SA_MCMEMBER_REC_MTU_SELECTOR && + src->mtu_selector != dst->mtu_selector) + return -EINVAL; + if (comp_mask & IB_SA_MCMEMBER_REC_MTU && src->mtu != dst->mtu) + return -EINVAL; + if (comp_mask & IB_SA_MCMEMBER_REC_TRAFFIC_CLASS && + src->traffic_class != dst->traffic_class) + return -EINVAL; + if (comp_mask & IB_SA_MCMEMBER_REC_PKEY && src->pkey != dst->pkey) + return -EINVAL; + if (comp_mask & IB_SA_MCMEMBER_REC_RATE_SELECTOR && + src->rate_selector != dst->rate_selector) + return -EINVAL; + if (comp_mask & IB_SA_MCMEMBER_REC_RATE && src->rate != dst->rate) + return -EINVAL; + if (comp_mask & IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME_SELECTOR && + src->packet_life_time_selector != dst->packet_life_time_selector) + return -EINVAL; + if (comp_mask & IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME && + src->packet_life_time != dst->packet_life_time) + return -EINVAL; + if (comp_mask & IB_SA_MCMEMBER_REC_SL && src->sl != dst->sl) + return -EINVAL; + if (comp_mask & IB_SA_MCMEMBER_REC_FLOW_LABEL && + src->flow_label != dst->flow_label) + return -EINVAL; + if (comp_mask & IB_SA_MCMEMBER_REC_HOP_LIMIT && + src->hop_limit != dst->hop_limit) + return -EINVAL; + if (comp_mask & IB_SA_MCMEMBER_REC_SCOPE && src->scope != dst->scope) + return -EINVAL; + + /* join_state checked separately, proxy_join ignored */ + + return 0; +} + +static int send_join(struct mcast_group *group, struct mcast_member *member) +{ + struct mcast_port *port = group->port; + int ret; + + ret = ib_sa_mcmember_rec_set(port->dev->device, port->port_num, + &member->multicast.rec, + member->multicast.comp_mask, + retry_timer, retries, GFP_KERNEL, + join_handler, group, &group->query); + if (ret > 0) { + group->query_id = ret; + ret = 0; + } + return ret; +} + +static int send_leave(struct mcast_group *group, u8 leave_state) +{ + struct mcast_port *port = group->port; + struct ib_sa_mcmember_rec rec; + int ret; + + rec = group->rec; + rec.join_state = leave_state; + + ret = ib_sa_mcmember_rec_delete(port->dev->device, port->port_num, &rec, + IB_SA_MCMEMBER_REC_MGID | + IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_JOIN_STATE, + retry_timer, retries, GFP_KERNEL, + leave_handler, group, &group->query); + if (ret > 0) { + group->query_id = ret; + ret = 0; + } + return ret; +} + +static void join_group(struct mcast_group *group, struct mcast_member *member, + u8 join_state) +{ + adjust_membership(group, join_state, 1); + group->rec.join_state |= join_state; + member->multicast.rec = group->rec; + member->multicast.rec.join_state = join_state; +} + +static int fail_join(struct mcast_group *group, struct mcast_member *member, + int status) +{ + spin_lock_irq(&group->lock); + list_del_init(&member->list); + spin_unlock_irq(&group->lock); + return member->multicast.callback(status, &member->multicast); +} + +static void mcast_work_handler(void *data) +{ + struct mcast_group *group = data; + struct mcast_member *member; + struct ib_multicast *multicast; + int status, ret; + u8 join_state; + +retest: + spin_lock_irq(&group->lock); + while (!list_empty(&group->pending_list)) { + member = list_entry(group->pending_list.next, + struct mcast_member, list); + multicast = &member->multicast; + join_state = multicast->rec.join_state; + atomic_inc(&member->refcount); + + if (join_state == (group->rec.join_state & join_state)) { + status = cmp_rec(&group->rec, &multicast->rec, + multicast->comp_mask); + if (!status) + join_group(group, member, join_state); + + list_del_init(&member->list); + spin_unlock_irq(&group->lock); + ret = multicast->callback(status, multicast); + } else { + spin_unlock_irq(&group->lock); + status = send_join(group, member); + if (!status) { + deref_member(member); + return; + } + ret = fail_join(group, member, status); + } + + deref_member(member); + if (ret) + ib_free_multicast(&member->multicast); + spin_lock_irq(&group->lock); + } + + join_state = get_leave_state(group); + if (join_state) { + group->rec.join_state &= ~join_state; + spin_unlock_irq(&group->lock); + if (send_leave(group, join_state)) + goto retest; + } else { + group->state = MCAST_IDLE; + spin_unlock_irq(&group->lock); + release_group(group); + } +} + +/* + * Fail a join request if it is still active - at the head of the pending queue. + */ +static void process_join_error(struct mcast_group *group, int status) +{ + struct mcast_member *member; + int ret; + + spin_lock_irq(&group->lock); + member = list_entry(group->pending_list.next, + struct mcast_member, list); + if (group->last_join == member) { + atomic_inc(&member->refcount); + list_del_init(&member->list); + spin_unlock_irq(&group->lock); + ret = member->multicast.callback(status, &member->multicast); + deref_member(member); + if (ret) + ib_free_multicast(&member->multicast); + } else + spin_unlock_irq(&group->lock); +} + +static void join_handler(int status, struct ib_sa_mcmember_rec *rec, + void *context) +{ + struct mcast_group *group = context; + + if (status) + process_join_error(group, status); + else { + spin_lock_irq(&group->lock); + group->rec = *rec; + spin_unlock_irq(&group->lock); + } + mcast_work_handler(group); +} + +static void leave_handler(int status, struct ib_sa_mcmember_rec *rec, + void *context) +{ + mcast_work_handler(context); +} + +static struct mcast_group *acquire_group(struct mcast_port *port, + union ib_gid *mgid, gfp_t gfp_mask) +{ + struct mcast_group *group, *cur_group; + unsigned long flags; + + spin_lock_irqsave(&port->lock, flags); + group = mcast_find(port, mgid); + if (group) + goto found; + spin_unlock_irqrestore(&port->lock, flags); + + group = kzalloc(sizeof *group, gfp_mask); + if (!group) + return NULL; + + group->port = port; + group->rec.mgid = *mgid; + INIT_LIST_HEAD(&group->pending_list); + INIT_WORK(&group->work, mcast_work_handler, group); + spin_lock_init(&group->lock); + + spin_lock_irqsave(&port->lock, flags); + cur_group = mcast_insert(port, group); + if (cur_group) { + kfree(group); + group = cur_group; + } else + atomic_inc(&port->refcount); +found: + atomic_inc(&group->refcount); + spin_unlock_irqrestore(&port->lock, flags); + return group; +} + +/* + * We serialize all join requests to a single group to make our lives much + * easier. Otherwise, two users could try to join the same group + * simultaneously, with different configurations, one could leave while the + * join is in progress, etc., which makes locking around error recovery + * difficult. + */ +struct ib_multicast *ib_join_multicast(struct ib_device *device, u8 port_num, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, gfp_t gfp_mask, + int (*callback)(int status, + struct ib_multicast + *multicast), + void *context) +{ + struct mcast_device *dev; + struct mcast_member *member; + struct ib_multicast *multicast; + int ret; + + dev = ib_get_client_data(device, &mcast_client); + if (!dev) + return ERR_PTR(-ENODEV); + + member = kzalloc(sizeof *member, gfp_mask); + if (!member) + return ERR_PTR(-ENOMEM); + + member->multicast.rec = *rec; + member->multicast.comp_mask = comp_mask; + member->multicast.callback = callback; + member->multicast.context = context; + init_waitqueue_head(&member->wait); + atomic_set(&member->refcount, 1); + member->state = MCAST_JOINING; + + member->group = acquire_group(&dev->port[port_num - 1], + &rec->mgid, gfp_mask); + if (!member->group) { + ret = -ENOMEM; + goto err; + } + + /* + * The user will get the multicast structure in their callback. They + * could then free the multicast structure before we can return from + * this routine. So we save the pointer to return before queuing + * any callback. + */ + multicast = &member->multicast; + queue_join(member); + return multicast; + +err: + kfree(member); + return ERR_PTR(ret); +} +EXPORT_SYMBOL(ib_join_multicast); + +void ib_free_multicast(struct ib_multicast *multicast) +{ + struct mcast_member *member; + struct mcast_group *group; + + member = container_of(multicast, struct mcast_member, multicast); + group = member->group; + + spin_lock_irq(&group->lock); + switch (member->state) { + case MCAST_MEMBER: + adjust_membership(group, multicast->rec.join_state, -1); + break; + case MCAST_JOINING: + list_del_init(&member->list); + break; + default: + break; + } + + if (group->state == MCAST_IDLE) { + group->state = MCAST_BUSY; + spin_unlock_irq(&group->lock); + /* Continue to hold reference on group until callback */ + queue_work(mcast_wq, &group->work); + } else { + spin_unlock_irq(&group->lock); + release_group(group); + } + + atomic_dec(&member->refcount); + wait_event(member->wait, !atomic_read(&member->refcount)); + kfree(member); +} +EXPORT_SYMBOL(ib_free_multicast); + +static void mcast_add_one(struct ib_device *device) +{ + struct mcast_device *dev; + struct mcast_port *port; + int i; + + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port, + GFP_KERNEL); + if (!dev) + return; + + for (i = 1; i <= device->phys_port_cnt; i++) { + port = &dev->port[i - 1]; + port->dev = dev; + port->port_num = i; + spin_lock_init(&port->lock); + port->table = RB_ROOT; + init_waitqueue_head(&port->wait); + atomic_set(&port->refcount, 1); + } + + dev->device = device; + ib_set_client_data(device, &mcast_client, dev); +} + +/* + * Mark any existing groups as no longer having any members. This will force + * cleanup of the groups when all outstanding leave requests complete. + */ +static void leave_groups(struct mcast_port *port) +{ + struct mcast_group *group; + struct rb_node *node; + unsigned long flags; + + spin_lock_irqsave(&port->lock, flags); + for (node = rb_first(&port->table); node; node = rb_next(node)) { + group = rb_entry(node, struct mcast_group, node); + group->rec.join_state = 0; + ib_sa_cancel_query(group->query_id, group->query); + } + spin_unlock_irqrestore(&port->lock, flags); +} + +static void mcast_remove_one(struct ib_device *device) +{ + struct mcast_device *dev; + struct mcast_port *port; + int i; + + dev = ib_get_client_data(device, &mcast_client); + if (!dev) + return; + + flush_workqueue(mcast_wq); + + for (i = 0; i < device->phys_port_cnt; i++) { + port = &dev->port[i]; + leave_groups(port); + atomic_dec(&port->refcount); + wait_event(port->wait, !atomic_read(&port->refcount)); + } + + kfree(dev); +} + +static int __init mcast_init(void) +{ + int ret; + + mcast_wq = create_singlethread_workqueue("ib_mcast_wq"); + if (!mcast_wq) + return -ENOMEM; + + ret = ib_register_client(&mcast_client); + if (ret) + goto err; + return 0; + +err: + destroy_workqueue(mcast_wq); + return ret; +} + +static void __exit mcast_cleanup(void) +{ + ib_unregister_client(&mcast_client); + destroy_workqueue(mcast_wq); +} + +module_init(mcast_init); +module_exit(mcast_cleanup); Index: core/Makefile =================================================================== --- core/Makefile (revision 6230) +++ core/Makefile (working copy) @@ -5,7 +5,7 @@ user_access-$(CONFIG_INFINIBAND_ADDR_TRA obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_ping.o ib_cm.o \ ib_sa.o ib_at.o $(infiniband-y) \ - findex.o + findex.o ib_multicast.o obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o ib_uat.o $(user_access-y) @@ -30,6 +30,8 @@ ib_sa-y := sa_query.o ib_local_sa-y := local_sa.o +ib_multicast-y := multicast.o + ib_umad-y := user_mad.o ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o \ From bos at pathscale.com Thu Apr 6 11:14:34 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 06 Apr 2006 11:14:34 -0700 Subject: [openib-general] ipath module compilation failure on SLES10 In-Reply-To: References: <4433F5DC.2080503@mellanox.co.il> <1144256175.3984.5.camel@chalcedony.internal.keyresearch.com> <1144258679.3984.14.camel@chalcedony.internal.keyresearch.com> <4435466C.4020508@mellanox.co.il> <1144342432.20433.39.camel@chalcedony.internal.keyresearch.com> Message-ID: <1144347274.25128.2.camel@chalcedony.internal.keyresearch.com> On Thu, 2006-04-06 at 11:11 -0700, Roland Dreier wrote: > I don't think doing this: Thanks for spotting that. Fixed. (Michael S. Tsirkin's message of "Thu, 6 Apr 2006 16:17:55 +0300") References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> Message-ID: Michael> Actually, it turned out to be the simplest solution - and Michael> quite elegant since there's no room for mistakes: if Michael> query is going to be running this means module is still Michael> loaded so we can take a reference to it without races. Yes, this is suprisingly clean. Michael> As a bonus, and assertion inside __module_get increases Michael> the chance to catch races where user forgets to cancel Michael> the query - much nicer than crashing randomly. Actually I think __module_get() will do the wrong thing if called during module unloading -- it doesn't test module_is_live(). In other words, calling __module_get() without already holding a ref has a race: __try_stop_module() can see the ref count as 0, then __module_get() can increment it, and then __try_stop_module() sets the module state to GOING and returns. So the right thing to do is BUG_ON(!try_module_get(owner)) Also, I don't think that a consumer of ib_sa() would ever pass an owner other than THIS_MODULE. So how about if we keep the API the same and just do the THIS_MODULE stuff in an inline wrapper? Like the following... it ends up being a pretty big diff, but just because I moved some comments around and so on. Also I put the try_module_get() stuff out of line into call_sa_callback(), because the compiled code ends up smaller that way. Does anyone disagree with this patch? Michael, are you happy with this tweaked version of yours? diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index 501cc05..c43ed75 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -74,6 +74,7 @@ struct ib_sa_device { struct ib_sa_query { void (*callback)(struct ib_sa_query *, int, struct ib_sa_mad *); void (*release)(struct ib_sa_query *); + struct module *owner; struct ib_sa_port *port; struct ib_mad_send_buf *mad_buf; struct ib_sa_sm_ah *sm_ah; @@ -547,15 +548,16 @@ static void ib_sa_path_rec_release(struc * error code. Otherwise it is a query ID that can be used to cancel * the query. */ -int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, - struct ib_sa_path_rec *rec, - ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, - void (*callback)(int status, - struct ib_sa_path_rec *resp, - void *context), - void *context, - struct ib_sa_query **sa_query) +int __ib_sa_path_rec_get(struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_path_rec *resp, + void *context), + void *context, + struct module *owner, + struct ib_sa_query **sa_query) { struct ib_sa_path_query *query; struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); @@ -590,6 +592,7 @@ int ib_sa_path_rec_get(struct ib_device query->sa_query.callback = callback ? ib_sa_path_rec_callback : NULL; query->sa_query.release = ib_sa_path_rec_release; + query->sa_query.owner = owner; query->sa_query.port = port; mad->mad_hdr.method = IB_MGMT_METHOD_GET; mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); @@ -613,7 +616,7 @@ err1: kfree(query); return ret; } -EXPORT_SYMBOL(ib_sa_path_rec_get); +EXPORT_SYMBOL(__ib_sa_path_rec_get); static void ib_sa_service_rec_callback(struct ib_sa_query *sa_query, int status, @@ -663,15 +666,16 @@ static void ib_sa_service_rec_release(st * error code. Otherwise it is a request ID that can be used to cancel * the query. */ -int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, u8 method, - struct ib_sa_service_rec *rec, - ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, - void (*callback)(int status, - struct ib_sa_service_rec *resp, - void *context), - void *context, - struct ib_sa_query **sa_query) +int __ib_sa_service_rec_query(struct ib_device *device, u8 port_num, u8 method, + struct ib_sa_service_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_service_rec *resp, + void *context), + void *context, + struct module *owner, + struct ib_sa_query **sa_query) { struct ib_sa_service_query *query; struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); @@ -711,6 +715,7 @@ int ib_sa_service_rec_query(struct ib_de query->sa_query.callback = callback ? ib_sa_service_rec_callback : NULL; query->sa_query.release = ib_sa_service_rec_release; + query->sa_query.owner = owner; query->sa_query.port = port; mad->mad_hdr.method = method; mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_SERVICE_REC); @@ -735,7 +740,7 @@ err1: kfree(query); return ret; } -EXPORT_SYMBOL(ib_sa_service_rec_query); +EXPORT_SYMBOL(__ib_sa_service_rec_query); static void ib_sa_mcmember_rec_callback(struct ib_sa_query *sa_query, int status, @@ -759,16 +764,17 @@ static void ib_sa_mcmember_rec_release(s kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query)); } -int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, - u8 method, - struct ib_sa_mcmember_rec *rec, - ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, - void (*callback)(int status, - struct ib_sa_mcmember_rec *resp, - void *context), - void *context, - struct ib_sa_query **sa_query) +int __ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, + u8 method, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct module *owner, + struct ib_sa_query **sa_query) { struct ib_sa_mcmember_query *query; struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); @@ -803,6 +809,7 @@ int ib_sa_mcmember_rec_query(struct ib_d query->sa_query.callback = callback ? ib_sa_mcmember_rec_callback : NULL; query->sa_query.release = ib_sa_mcmember_rec_release; + query->sa_query.owner = owner; query->sa_query.port = port; mad->mad_hdr.method = method; mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_MC_MEMBER_REC); @@ -827,7 +834,15 @@ err1: kfree(query); return ret; } -EXPORT_SYMBOL(ib_sa_mcmember_rec_query); +EXPORT_SYMBOL(__ib_sa_mcmember_rec_query); + +static void call_sa_callback(struct ib_sa_query *query, int status, + struct ib_sa_mad *mad) +{ + BUG_ON(!try_module_get(query->owner)); + query->callback(query, status, mad); + module_put(query->owner); +} static void send_handler(struct ib_mad_agent *agent, struct ib_mad_send_wc *mad_send_wc) @@ -841,13 +856,13 @@ static void send_handler(struct ib_mad_a /* No callback -- already got recv */ break; case IB_WC_RESP_TIMEOUT_ERR: - query->callback(query, -ETIMEDOUT, NULL); + call_sa_callback(query, -ETIMEDOUT, NULL); break; case IB_WC_WR_FLUSH_ERR: - query->callback(query, -EINTR, NULL); + call_sa_callback(query, -EINTR, NULL); break; default: - query->callback(query, -EIO, NULL); + call_sa_callback(query, -EIO, NULL); break; } @@ -871,12 +886,12 @@ static void recv_handler(struct ib_mad_a if (query->callback) { if (mad_recv_wc->wc->status == IB_WC_SUCCESS) - query->callback(query, - mad_recv_wc->recv_buf.mad->mad_hdr.status ? - -EINVAL : 0, - (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad); + call_sa_callback(query, + mad_recv_wc->recv_buf.mad->mad_hdr.status ? + -EINVAL : 0, + (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad); else - query->callback(query, -EIO, NULL); + call_sa_callback(query, -EIO, NULL); } ib_free_recv_mad(mad_recv_wc); diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h index ad63c21..6769d1b 100644 --- a/include/rdma/ib_sa.h +++ b/include/rdma/ib_sa.h @@ -254,37 +254,80 @@ struct ib_sa_query; void ib_sa_cancel_query(int id, struct ib_sa_query *query); -int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, - struct ib_sa_path_rec *rec, - ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, - void (*callback)(int status, - struct ib_sa_path_rec *resp, - void *context), - void *context, - struct ib_sa_query **query); - -int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, - u8 method, - struct ib_sa_mcmember_rec *rec, - ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, - void (*callback)(int status, - struct ib_sa_mcmember_rec *resp, - void *context), - void *context, - struct ib_sa_query **query); - -int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, - u8 method, - struct ib_sa_service_rec *rec, +int __ib_sa_path_rec_get(struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, gfp_t gfp_mask, void (*callback)(int status, - struct ib_sa_service_rec *resp, + struct ib_sa_path_rec *resp, void *context), void *context, - struct ib_sa_query **sa_query); + struct module *owner, + struct ib_sa_query **query); + +int __ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, + u8 method, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct module *owner, + struct ib_sa_query **query); + +int __ib_sa_service_rec_query(struct ib_device *device, u8 port_num, + u8 method, + struct ib_sa_service_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_service_rec *resp, + void *context), + void *context, + struct module *owner, + struct ib_sa_query **sa_query); + +/** + * ib_sa_path_rec_get - Start a Path get query + * @device:device to send query on + * @port_num: port number to send query on + * @rec:Path Record to send in query + * @comp_mask:component mask to send in query + * @timeout_ms:time to wait for response + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when query completes, times out or is + * canceled + * @context:opaque user context passed to callback + * @sa_query:query context, used to cancel query + * + * Send a Path Record Get query to the SA to look up a path. The + * callback function will be called when the query completes (or + * fails); status is 0 for a successful response, -EINTR if the query + * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error + * occurred sending the query. The resp parameter of the callback is + * only valid if status is 0. + * + * If the return value of ib_sa_path_rec_get() is negative, it is an + * error code. Otherwise it is a query ID that can be used to cancel + * the query. + */ +static inline int +ib_sa_path_rec_get(struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_path_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) +{ + return __ib_sa_path_rec_get(device, port_num, rec, comp_mask, + timeout_ms, gfp_mask, callback, + context, THIS_MODULE, sa_query); +} /** * ib_sa_mcmember_rec_set - Start an MCMember set query @@ -321,11 +364,11 @@ ib_sa_mcmember_rec_set(struct ib_device void *context, struct ib_sa_query **query) { - return ib_sa_mcmember_rec_query(device, port_num, - IB_MGMT_METHOD_SET, - rec, comp_mask, - timeout_ms, gfp_mask, callback, - context, query); + return __ib_sa_mcmember_rec_query(device, port_num, + IB_MGMT_METHOD_SET, + rec, comp_mask, + timeout_ms, gfp_mask, callback, + context, THIS_MODULE, query); } /** @@ -363,12 +406,54 @@ ib_sa_mcmember_rec_delete(struct ib_devi void *context, struct ib_sa_query **query) { - return ib_sa_mcmember_rec_query(device, port_num, - IB_SA_METHOD_DELETE, - rec, comp_mask, - timeout_ms, gfp_mask, callback, - context, query); + return __ib_sa_mcmember_rec_query(device, port_num, + IB_SA_METHOD_DELETE, + rec, comp_mask, + timeout_ms, gfp_mask, callback, + context, THIS_MODULE, query); } +/** + * ib_sa_service_rec_query - Start Service Record operation + * @device:device to send request on + * @port_num: port number to send request on + * @method:SA method - should be get, set, or delete + * @rec:Service Record to send in request + * @comp_mask:component mask to send in request + * @timeout_ms:time to wait for response + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when request completes, times out or is + * canceled + * @context:opaque user context passed to callback + * @sa_query:request context, used to cancel request + * + * Send a Service Record set/get/delete to the SA to register, + * unregister or query a service record. + * The callback function will be called when the request completes (or + * fails); status is 0 for a successful response, -EINTR if the query + * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error + * occurred sending the query. The resp parameter of the callback is + * only valid if status is 0. + * + * If the return value of ib_sa_service_rec_query() is negative, it is an + * error code. Otherwise it is a request ID that can be used to cancel + * the query. + */ +static inline int +ib_sa_service_rec_query(struct ib_device *device, u8 port_num, u8 method, + struct ib_sa_service_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_service_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) +{ + return __ib_sa_service_rec_query(device, port_num, method, rec, + comp_mask, timeout_ms, gfp_mask, + callback, context, THIS_MODULE, + sa_query); +} #endif /* IB_SA_H */ From rdreier at cisco.com Thu Apr 6 11:22:10 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Apr 2006 11:22:10 -0700 Subject: [openib-general] Re: [PATCH] ipoib_flush_paths In-Reply-To: <20060406152649.GC13416@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 6 Apr 2006 18:26:49 +0300") References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060406152649.GC13416@mellanox.co.il> Message-ID: Same comments as the sa_query patch apply here: - __module_get() has a race, use try_module_get() instead - let's hide the THIS_MODULE inside an inline wrapper so consumers of the API don't change and don't have to think about this at all. Other than that it seems good. - R. From sean.hefty at intel.com Thu Apr 6 11:44:20 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 6 Apr 2006 11:44:20 -0700 Subject: [openib-general] [RFC] [PATCH 2/2] ipoib: convert to use new multicast interface In-Reply-To: Message-ID: Convert IPoIB to use the new ib_multicast interfaces in place of direct SA query calls. Signed-off-by: Sean Hefty --- Testing was limited to bringing up ipoib and verifying that it worked. If I can get some guidance on some additional testing, I can do that. Code review would also be beneficial, as this patch removes a multicast fix that was just added, which I no longer believe is required. Index: ulp/ipoib/ipoib_multicast.c =================================================================== --- ulp/ipoib/ipoib_multicast.c (revision 6307) +++ ulp/ipoib/ipoib_multicast.c (working copy) @@ -45,6 +45,8 @@ #include +#include + #include "ipoib.h" #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG @@ -60,14 +62,11 @@ static DEFINE_MUTEX(mcast_mutex); /* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */ struct ipoib_mcast { struct ib_sa_mcmember_rec mcmember; + struct ib_multicast *mc; struct ipoib_ah *ah; struct rb_node rb_node; struct list_head list; - struct completion done; - - int query_id; - struct ib_sa_query *query; unsigned long created; unsigned long backoff; @@ -299,18 +298,18 @@ static int ipoib_mcast_join_finish(struc return 0; } -static void +static int ipoib_mcast_sendonly_join_complete(int status, - struct ib_sa_mcmember_rec *mcmember, - void *mcast_ptr) + struct ib_multicast *multicast) { - struct ipoib_mcast *mcast = mcast_ptr; + struct ipoib_mcast *mcast = multicast->context; struct net_device *dev = mcast->dev; struct ipoib_dev_priv *priv = netdev_priv(dev); if (!status) - ipoib_mcast_join_finish(mcast, mcmember); - else { + status = ipoib_mcast_join_finish(mcast, &multicast->rec); + + if (status) { if (mcast->logcount++ < 20) ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for " IPOIB_GID_FMT ", status %d\n", @@ -325,10 +324,10 @@ ipoib_mcast_sendonly_join_complete(int s spin_unlock_irq(&priv->tx_lock); /* Clear the busy flag so we try again */ - clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, + &mcast->flags); } - - complete(&mcast->done); + return status; } static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast) @@ -358,35 +357,31 @@ static int ipoib_mcast_sendonly_join(str rec.port_gid = priv->local_gid; rec.pkey = cpu_to_be16(priv->pkey); - init_completion(&mcast->done); - - ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, - IB_SA_MCMEMBER_REC_MGID | - IB_SA_MCMEMBER_REC_PORT_GID | - IB_SA_MCMEMBER_REC_PKEY | - IB_SA_MCMEMBER_REC_JOIN_STATE, - 1000, GFP_ATOMIC, - ipoib_mcast_sendonly_join_complete, - mcast, &mcast->query); - if (ret < 0) { - ipoib_warn(priv, "ib_sa_mcmember_rec_set failed (ret = %d)\n", + mcast->mc = ib_join_multicast(priv->ca, priv->port, &rec, + IB_SA_MCMEMBER_REC_MGID | + IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_PKEY | + IB_SA_MCMEMBER_REC_JOIN_STATE, + GFP_ATOMIC, + ipoib_mcast_sendonly_join_complete, + mcast); + if (IS_ERR(mcast->mc)) { + ret = PTR_ERR(mcast->mc); + clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + ipoib_warn(priv, "ib_join_multicast failed (ret = %d)\n", ret); } else { ipoib_dbg_mcast(priv, "no multicast record for " IPOIB_GID_FMT ", starting join\n", IPOIB_GID_ARG(mcast->mcmember.mgid)); - - mcast->query_id = ret; } - return ret; } -static void ipoib_mcast_join_complete(int status, - struct ib_sa_mcmember_rec *mcmember, - void *mcast_ptr) +static int ipoib_mcast_join_complete(int status, + struct ib_multicast *multicast) { - struct ipoib_mcast *mcast = mcast_ptr; + struct ipoib_mcast *mcast = multicast->context; struct net_device *dev = mcast->dev; struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -394,23 +389,20 @@ static void ipoib_mcast_join_complete(in " (status %d)\n", IPOIB_GID_ARG(mcast->mcmember.mgid), status); - if (!status && !ipoib_mcast_join_finish(mcast, mcmember)) { + if (!status) + status = ipoib_mcast_join_finish(mcast, &multicast->rec); + + if (!status) { mcast->backoff = 1; mutex_lock(&mcast_mutex); if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) queue_work(ipoib_workqueue, &priv->mcast_task); mutex_unlock(&mcast_mutex); - complete(&mcast->done); - return; - } - - if (status == -EINTR) { - complete(&mcast->done); - return; + return 0; } - if (status && mcast->logcount++ < 20) { - if (status == -ETIMEDOUT || status == -EINTR) { + if (mcast->logcount++ < 20) { + if (status == -ETIMEDOUT) { ipoib_dbg_mcast(priv, "multicast join failed for " IPOIB_GID_FMT ", status %d\n", IPOIB_GID_ARG(mcast->mcmember.mgid), @@ -427,23 +419,18 @@ static void ipoib_mcast_join_complete(in if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; - mutex_lock(&mcast_mutex); + /* Clear the busy flag so we try again */ + status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + mutex_lock(&mcast_mutex); spin_lock_irq(&priv->lock); - mcast->query = NULL; - - if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) { - if (status == -ETIMEDOUT) - queue_work(ipoib_workqueue, &priv->mcast_task); - else - queue_delayed_work(ipoib_workqueue, &priv->mcast_task, - mcast->backoff * HZ); - } else - complete(&mcast->done); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_delayed_work(ipoib_workqueue, &priv->mcast_task, + mcast->backoff * HZ); spin_unlock_irq(&priv->lock); mutex_unlock(&mcast_mutex); - return; + return status; } static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast, @@ -463,18 +450,16 @@ static void ipoib_mcast_join(struct net_ rec.port_gid = priv->local_gid; rec.pkey = cpu_to_be16(priv->pkey); - comp_mask = - IB_SA_MCMEMBER_REC_MGID | - IB_SA_MCMEMBER_REC_PORT_GID | - IB_SA_MCMEMBER_REC_PKEY | - IB_SA_MCMEMBER_REC_JOIN_STATE; + comp_mask = IB_SA_MCMEMBER_REC_MGID | + IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_PKEY | + IB_SA_MCMEMBER_REC_JOIN_STATE; if (create) { - comp_mask |= - IB_SA_MCMEMBER_REC_QKEY | - IB_SA_MCMEMBER_REC_SL | - IB_SA_MCMEMBER_REC_FLOW_LABEL | - IB_SA_MCMEMBER_REC_TRAFFIC_CLASS; + comp_mask |= IB_SA_MCMEMBER_REC_QKEY | + IB_SA_MCMEMBER_REC_SL | + IB_SA_MCMEMBER_REC_FLOW_LABEL | + IB_SA_MCMEMBER_REC_TRAFFIC_CLASS; rec.qkey = priv->broadcast->mcmember.qkey; rec.sl = priv->broadcast->mcmember.sl; @@ -482,15 +467,14 @@ static void ipoib_mcast_join(struct net_ rec.traffic_class = priv->broadcast->mcmember.traffic_class; } - init_completion(&mcast->done); - - ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, comp_mask, - mcast->backoff * 1000, GFP_ATOMIC, - ipoib_mcast_join_complete, - mcast, &mcast->query); - - if (ret < 0) { - ipoib_warn(priv, "ib_sa_mcmember_rec_set failed, status %d\n", ret); + set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + mcast->mc = ib_join_multicast(priv->ca, priv->port, &rec, comp_mask, + GFP_KERNEL, ipoib_mcast_join_complete, + mcast); + if (IS_ERR(mcast->mc)) { + clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + ret = PTR_ERR(mcast->mc); + ipoib_warn(priv, "ib_join_multicast failed, status %d\n", ret); mcast->backoff *= 2; if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) @@ -502,14 +486,14 @@ static void ipoib_mcast_join(struct net_ &priv->mcast_task, mcast->backoff * HZ); mutex_unlock(&mcast_mutex); - } else - mcast->query_id = ret; + } } void ipoib_mcast_join_task(void *dev_ptr) { struct net_device *dev = dev_ptr; struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_mcast *mcast; if (!test_bit(IPOIB_MCAST_RUN, &priv->flags)) return; @@ -553,36 +537,27 @@ void ipoib_mcast_join_task(void *dev_ptr spin_unlock_irq(&priv->lock); } - if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { + if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags) && + !test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags)) { ipoib_mcast_join(dev, priv->broadcast, 0); return; } - while (1) { - struct ipoib_mcast *mcast = NULL; - - spin_lock_irq(&priv->lock); - list_for_each_entry(mcast, &priv->multicast_list, list) { - if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) - && !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) - && !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { - /* Found the next unjoined group */ - break; - } - } - spin_unlock_irq(&priv->lock); - - if (&mcast->list == &priv->multicast_list) { - /* All done */ - break; + spin_lock_irq(&priv->lock); + list_for_each_entry(mcast, &priv->multicast_list, list) { + if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) && + !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) && + !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { + /* Found the next unjoined group */ + spin_unlock_irq(&priv->lock); + ipoib_mcast_join(dev, mcast, 1); + return; } - - ipoib_mcast_join(dev, mcast, 1); - return; } + spin_unlock_irq(&priv->lock); priv->mcast_mtu = ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu) - - IPOIB_ENCAP_LEN; + IPOIB_ENCAP_LEN; dev->mtu = min(priv->mcast_mtu, priv->admin_mtu); ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n"); @@ -609,26 +584,9 @@ int ipoib_mcast_start_thread(struct net_ return 0; } -static void wait_for_mcast_join(struct ipoib_dev_priv *priv, - struct ipoib_mcast *mcast) -{ - spin_lock_irq(&priv->lock); - if (mcast && mcast->query) { - ib_sa_cancel_query(mcast->query_id, mcast->query); - mcast->query = NULL; - spin_unlock_irq(&priv->lock); - ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", - IPOIB_GID_ARG(mcast->mcmember.mgid)); - wait_for_completion(&mcast->done); - } - else - spin_unlock_irq(&priv->lock); -} - -int ipoib_mcast_stop_thread(struct net_device *dev, int flush) +void ipoib_mcast_stop_thread(struct net_device *dev, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ipoib_mcast *mcast; ipoib_dbg_mcast(priv, "stopping multicast thread\n"); @@ -643,55 +601,26 @@ int ipoib_mcast_stop_thread(struct net_d if (flush) flush_workqueue(ipoib_workqueue); - - wait_for_mcast_join(priv, priv->broadcast); - - list_for_each_entry(mcast, &priv->multicast_list, list) - wait_for_mcast_join(priv, mcast); - - return 0; } -static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast) +static void ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ib_sa_mcmember_rec rec = { - .join_state = 1 - }; - int ret = 0; - - if (!test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) - return 0; - - ipoib_dbg_mcast(priv, "leaving MGID " IPOIB_GID_FMT "\n", - IPOIB_GID_ARG(mcast->mcmember.mgid)); - - rec.mgid = mcast->mcmember.mgid; - rec.port_gid = priv->local_gid; - rec.pkey = cpu_to_be16(priv->pkey); + int ret; - /* Remove ourselves from the multicast group */ - ret = ipoib_mcast_detach(dev, be16_to_cpu(mcast->mcmember.mlid), - &mcast->mcmember.mgid); - if (ret) - ipoib_warn(priv, "ipoib_mcast_detach failed (result = %d)\n", ret); + if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { + ipoib_dbg_mcast(priv, "leaving MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); - /* - * Just make one shot at leaving and don't wait for a reply; - * if we fail, too bad. - */ - ret = ib_sa_mcmember_rec_delete(priv->ca, priv->port, &rec, - IB_SA_MCMEMBER_REC_MGID | - IB_SA_MCMEMBER_REC_PORT_GID | - IB_SA_MCMEMBER_REC_PKEY | - IB_SA_MCMEMBER_REC_JOIN_STATE, - 0, GFP_ATOMIC, NULL, - mcast, &mcast->query); - if (ret < 0) - ipoib_warn(priv, "ib_sa_mcmember_rec_delete failed " - "for leave (result = %d)\n", ret); + /* Remove ourselves from the multicast group */ + ret = ipoib_mcast_detach(dev, be16_to_cpu(mcast->mcmember.mlid), + &mcast->mcmember.mgid); + if (ret) + ipoib_warn(priv, "ipoib_mcast_detach failed (result = %d)\n", ret); + } - return 0; + if (test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) + ib_free_multicast(mcast->mc); } void ipoib_mcast_send(struct net_device *dev, union ib_gid *mgid, @@ -726,7 +655,7 @@ void ipoib_mcast_send(struct net_device "multicast structure\n"); ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); - goto out; + goto unlock; } set_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags); @@ -743,7 +672,7 @@ void ipoib_mcast_send(struct net_device dev_kfree_skb_any(skb); } - if (mcast->query) + if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) ipoib_dbg_mcast(priv, "no address vector, " "but multicast join already started\n"); else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) @@ -753,25 +682,20 @@ void ipoib_mcast_send(struct net_device * If lookup completes between here and out:, don't * want to send packet twice. */ - mcast = NULL; + goto unlock; } -out: - if (mcast && mcast->ah) { - if (skb->dst && - skb->dst->neighbour && - !*to_ipoib_neigh(skb->dst->neighbour)) { - struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour); - - if (neigh) { - kref_get(&mcast->ah->ref); - neigh->ah = mcast->ah; - list_add_tail(&neigh->list, &mcast->neigh_list); - } - } + if (skb->dst && skb->dst->neighbour && + !*to_ipoib_neigh(skb->dst->neighbour)) { + struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour); - ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); + if (neigh) { + kref_get(&mcast->ah->ref); + neigh->ah = mcast->ah; + list_add_tail(&neigh->list, &mcast->neigh_list); + } } + ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); unlock: spin_unlock(&priv->lock); @@ -900,7 +824,6 @@ void ipoib_mcast_restart_task(void *dev_ /* We have to cancel outside of the spinlock */ list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { - wait_for_mcast_join(priv, mcast); ipoib_mcast_leave(mcast->dev, mcast); ipoib_mcast_free(mcast); } Index: ulp/ipoib/ipoib.h =================================================================== --- ulp/ipoib/ipoib.h (revision 6307) +++ ulp/ipoib/ipoib.h (working copy) @@ -283,7 +283,7 @@ void ipoib_mcast_send(struct net_device void ipoib_mcast_restart_task(void *dev_ptr); int ipoib_mcast_start_thread(struct net_device *dev); -int ipoib_mcast_stop_thread(struct net_device *dev, int flush); +void ipoib_mcast_stop_thread(struct net_device *dev, int flush); void ipoib_mcast_dev_down(struct net_device *dev); void ipoib_mcast_dev_flush(struct net_device *dev); From rdreier at cisco.com Thu Apr 6 12:48:56 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Apr 2006 12:48:56 -0700 Subject: [openib-general] Re: [PATCH] ipoib_cleanup_module In-Reply-To: <200604061549.36061.eli@mellanox.co.il> (Eli Cohen's message of "Thu, 6 Apr 2006 15:49:35 +0300") References: <200604061549.36061.eli@mellanox.co.il> Message-ID: Eli> ensure reverse order of creation or else might cause errors Eli> if debugfs is used. Not sure I follow this. What error could occur? - R. From rdreier at cisco.com Thu Apr 6 12:51:23 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Apr 2006 12:51:23 -0700 Subject: [openib-general] Re: [RFC] [PATCH 1/2] multicast support for multiple users In-Reply-To: (Sean Hefty's message of "Thu, 6 Apr 2006 11:13:47 -0700") References: Message-ID: > + dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port, > + GFP_KERNEL); > + if (!dev) > + return; > + > + for (i = 1; i <= device->phys_port_cnt; i++) { Seems like this is implicitly assuming that the IB device is a CA. Maybe we should give up the ghost and stop trying to support IB switches? - R. From rdreier at cisco.com Thu Apr 6 13:03:06 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Apr 2006 13:03:06 -0700 Subject: [openib-general] Re: [RFC] [PATCH 1/2] multicast support for multiple users In-Reply-To: (Sean Hefty's message of "Thu, 6 Apr 2006 11:13:47 -0700") References: Message-ID: > +static void queue_join(struct mcast_member *member) > +{ > + struct mcast_group *group = member->group; > + unsigned long flags; > + > + spin_lock_irqsave(&group->lock, flags); > + list_add(&member->list, &group->pending_list); > + if (group->state == MCAST_IDLE) { > + group->state = MCAST_BUSY; > + spin_unlock_irqrestore(&group->lock, flags); > + atomic_inc(&group->refcount); > + queue_work(mcast_wq, &group->work); > + } else > + spin_unlock_irqrestore(&group->lock, flags); > +} should the atomic_inc() be outside the lock here? It seems that leaves a window for things to go bad. It might be simpler to just do something like: spin_lock_irqsave(&group->lock, flags); list_add(&member->list, &group->pending_list); if (group->state == MCAST_IDLE) { group->state = MCAST_BUSY; atomic_inc(&group->refcount); queue_work(mcast_wq, &group->work); } spin_unlock_irqrestore(&group->lock, flags); > +static void adjust_membership(struct mcast_group *group, u8 join_state, int inc) > +{ > + int i; > + > + for (i = 0; i < 3; i++, join_state >>= 1) > + if (join_state & 0x1) > + group->members[i] += inc; > +} > + > +static u8 get_leave_state(struct mcast_group *group) > +{ > + u8 leave_state = 0; > + int i; > + > + for (i = 0; i < 3; i++) > + if (!group->members[i]) > + leave_state |= (0x1 << i); > + > + return leave_state & group->rec.join_state; > +} These look rather magical -- perhaps a comment to make them understandable? > + case MCAST_JOINING: > + list_del_init(&member->list); > + break; Why not just list_del()? I don't see anywhere that this list_head is used after this. From rdreier at cisco.com Thu Apr 6 13:05:03 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Apr 2006 13:05:03 -0700 Subject: [openib-general] Re: [RFC] [PATCH 2/2] ipoib: convert to use new multicast interface In-Reply-To: (Sean Hefty's message of "Thu, 6 Apr 2006 11:44:20 -0700") References: Message-ID: There is a fair amount of simple reformatting here like > - comp_mask = > - IB_SA_MCMEMBER_REC_MGID | > - IB_SA_MCMEMBER_REC_PORT_GID | > - IB_SA_MCMEMBER_REC_PKEY | > - IB_SA_MCMEMBER_REC_JOIN_STATE; > + comp_mask = IB_SA_MCMEMBER_REC_MGID | > + IB_SA_MCMEMBER_REC_PORT_GID | > + IB_SA_MCMEMBER_REC_PKEY | > + IB_SA_MCMEMBER_REC_JOIN_STATE; > priv->mcast_mtu = ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu) - > - IPOIB_ENCAP_LEN; > + IPOIB_ENCAP_LEN; and so on. Can you separate that into another patch, so that it's easier to review the real changes here? Thanks, Roland From rdreier at cisco.com Thu Apr 6 13:06:05 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Apr 2006 13:06:05 -0700 Subject: [openib-general] Re: [RFC] [PATCH] SA query: expose retries through API In-Reply-To: (Sean Hefty's message of "Thu, 6 Apr 2006 10:59:43 -0700") References: Message-ID: Looks fine but can you redo this on top of the module unload race fix once we agree on that? I expect the race fix to go into 2.6.17 and this API change to go into 2.6.18, so the API change needs to apply on top of the race fix. - R. From sean.hefty at intel.com Thu Apr 6 13:07:52 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 6 Apr 2006 13:07:52 -0700 Subject: [openib-general] RE: [PATCH] ipoib_flush_paths In-Reply-To: Message-ID: Here's a similar patch for ib_addr. Signed-off-by: Sean Hefty --- Index: core/addr.c =================================================================== --- core/addr.c (revision 6307) +++ core/addr.c (working copy) @@ -73,6 +73,7 @@ struct addr_req { struct sockaddr src_addr; struct sockaddr dst_addr; struct rdma_dev_addr *addr; + struct module *owner; void *context; void (*callback)(int status, struct sockaddr *src_addr, struct rdma_dev_addr *addr, void *context); @@ -252,8 +253,10 @@ static void process_req(void *data) list_for_each_entry_safe(req, temp_req, &done_list, list) { list_del(&req->list); + BUG_ON(!try_module_get(req->owner)); req->callback(req->status, &req->src_addr, req->addr, req->context); + module_put(req->owner); kfree(req); } } @@ -289,11 +292,12 @@ static int addr_resolve_local(struct soc return ret; } -int rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr, - struct rdma_dev_addr *addr, int timeout_ms, - void (*callback)(int status, struct sockaddr *src_addr, - struct rdma_dev_addr *addr, void *context), - void *context) +int __rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr, + struct rdma_dev_addr *addr, int timeout_ms, + void (*callback)(int status, struct sockaddr *src_addr, + struct rdma_dev_addr *addr, + void *context), + void *context, struct module *owner) { struct sockaddr_in *src_in, *dst_in; struct addr_req *req; @@ -310,6 +314,7 @@ int rdma_resolve_ip(struct sockaddr *src req->addr = addr; req->callback = callback; req->context = context; + req->owner = owner; src_in = (struct sockaddr_in *) &req->src_addr; dst_in = (struct sockaddr_in *) &req->dst_addr; @@ -335,7 +340,7 @@ int rdma_resolve_ip(struct sockaddr *src } return ret; } -EXPORT_SYMBOL(rdma_resolve_ip); +EXPORT_SYMBOL(__rdma_resolve_ip); void rdma_addr_cancel(struct rdma_dev_addr *addr) { Index: include/rdma/ib_addr.h =================================================================== --- include/rdma/ib_addr.h (revision 6307) +++ include/rdma/ib_addr.h (working copy) @@ -64,11 +64,16 @@ int rdma_translate_ip(struct sockaddr *a * or been canceled. A status of 0 indicates success. * @context: User-specified context associated with the call. */ -int rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr, - struct rdma_dev_addr *addr, int timeout_ms, - void (*callback)(int status, struct sockaddr *src_addr, - struct rdma_dev_addr *addr, void *context), - void *context); +static inline int +rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr, + struct rdma_dev_addr *addr, int timeout_ms, + void (*callback)(int status, struct sockaddr *src_addr, + struct rdma_dev_addr *addr, void *context), + void *context) +{ + return __rdma_resolve_ip(src_addr, dst_addr, addr, timeout_ms, + callback, context, THIS_MODULE); +} void rdma_addr_cancel(struct rdma_dev_addr *addr); From mshefty at ichips.intel.com Thu Apr 6 13:10:01 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 06 Apr 2006 13:10:01 -0700 Subject: [openib-general] Re: [RFC] [PATCH] SA query: expose retries through API In-Reply-To: References: Message-ID: <44357599.9000701@ichips.intel.com> Roland Dreier wrote: > Looks fine but can you redo this on top of the module unload race fix > once we agree on that? I expect the race fix to go into 2.6.17 and > this API change to go into 2.6.18, so the API change needs to apply on > top of the race fix. Yes - I'll redo this on top of the API fix. - Sean From mshefty at ichips.intel.com Thu Apr 6 13:11:20 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 06 Apr 2006 13:11:20 -0700 Subject: [openib-general] Re: [RFC] [PATCH 2/2] ipoib: convert to use new multicast interface In-Reply-To: References: Message-ID: <443575E8.2020608@ichips.intel.com> Roland Dreier wrote: > and so on. Can you separate that into another patch, so that it's > easier to review the real changes here? Sure. - Sean From halr at voltaire.com Thu Apr 6 13:07:42 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Apr 2006 16:07:42 -0400 Subject: [openib-general] Re: [RFC] [PATCH 1/2] multicast support for multiple users In-Reply-To: References: Message-ID: <1144354060.17533.3210.camel@hal.voltaire.com> On Thu, 2006-04-06 at 15:51, Roland Dreier wrote: > > + dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port, > > + GFP_KERNEL); > > + if (!dev) > > + return; > > + > > + for (i = 1; i <= device->phys_port_cnt; i++) { > > Seems like this is implicitly assuming that the IB device is a CA. > > Maybe we should give up the ghost and stop trying to support IB switches? I would prefer not to. SMI changes for switches have been provided but not integrated as yet. Enhanced switch port 0 would need this (at a minimum). -- Hal From mshefty at ichips.intel.com Thu Apr 6 13:16:32 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 06 Apr 2006 13:16:32 -0700 Subject: [openib-general] Re: [RFC] [PATCH 1/2] multicast support for multiple users In-Reply-To: References: Message-ID: <44357720.2040604@ichips.intel.com> Roland Dreier wrote: > > + dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port, > > + GFP_KERNEL); > > + if (!dev) > > + return; > > + > > + for (i = 1; i <= device->phys_port_cnt; i++) { > > Seems like this is implicitly assuming that the IB device is a CA. > > Maybe we should give up the ghost and stop trying to support IB switches? I can go either way here. Would someone want to use this module on a switch? If so, I can fix up the code so that it can work on one, or at least change the checks that the device is a CA. I thought that someone was running at least ib_mad on a switch. I'm just not sure which other modules are likely to run on them. - Sean From halr at voltaire.com Thu Apr 6 13:20:59 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Apr 2006 16:20:59 -0400 Subject: [openib-general] Re: [RFC] [PATCH 1/2] multicast support for multiple users In-Reply-To: <44357720.2040604@ichips.intel.com> References: <44357720.2040604@ichips.intel.com> Message-ID: <1144354857.17533.3369.camel@hal.voltaire.com> On Thu, 2006-04-06 at 16:16, Sean Hefty wrote: > Roland Dreier wrote: > > > + dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port, > > > + GFP_KERNEL); > > > + if (!dev) > > > + return; > > > + > > > + for (i = 1; i <= device->phys_port_cnt; i++) { > > > > Seems like this is implicitly assuming that the IB device is a CA. > > > > Maybe we should give up the ghost and stop trying to support IB switches? > > I can go either way here. Would someone want to use this module on a switch? Yes. IPoIB can run on enhanced switch port 0. -- Hal > If so, I can fix up the code so that it can work on one, or at least change the > checks that the device is a CA. > > I thought that someone was running at least ib_mad on a switch. I'm just not > sure which other modules are likely to run on them. > > - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Thu Apr 6 13:36:18 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 06 Apr 2006 13:36:18 -0700 Subject: [openib-general] Re: [RFC] [PATCH 1/2] multicast support for multiple users In-Reply-To: References: Message-ID: <44357BC2.4000302@ichips.intel.com> Roland Dreier wrote: > > +static void queue_join(struct mcast_member *member) > > +{ > > + struct mcast_group *group = member->group; > > + unsigned long flags; > > + > > + spin_lock_irqsave(&group->lock, flags); > > + list_add(&member->list, &group->pending_list); > > + if (group->state == MCAST_IDLE) { > > + group->state = MCAST_BUSY; > > + spin_unlock_irqrestore(&group->lock, flags); > > + atomic_inc(&group->refcount); > > + queue_work(mcast_wq, &group->work); > > + } else > > + spin_unlock_irqrestore(&group->lock, flags); > > +} > > should the atomic_inc() be outside the lock here? It seems that > leaves a window for things to go bad. It might be simpler to just do > something like: The mcast_member already holds a reference on the group, or we couldn't have acquired the spin_lock. The second reference is taken because we're queuing a work item that will access the group from a callback. That said, your changes look cleaner, so I'll update the code. > > +static void adjust_membership(struct mcast_group *group, u8 join_state, int inc) > > +{ > > + int i; > > + > > + for (i = 0; i < 3; i++, join_state >>= 1) > > + if (join_state & 0x1) > > + group->members[i] += inc; > > +} > > + > > +static u8 get_leave_state(struct mcast_group *group) > > +{ > > + u8 leave_state = 0; > > + int i; > > + > > + for (i = 0; i < 3; i++) > > + if (!group->members[i]) > > + leave_state |= (0x1 << i); > > + > > + return leave_state & group->rec.join_state; > > +} > > These look rather magical -- perhaps a comment to make them understandable? I'll add some comments. Basically, a multicast group has 3 types of members. These calls track the number of members of each type. > > + case MCAST_JOINING: > > + list_del_init(&member->list); > > + break; > > Why not just list_del()? I don't see anywhere that this list_head is > used after this. I think you're correct. list_del_init() is needed elsewhere, but I think this can be just list_del(). Thanks for the feedback. - Sean From swise at opengridcomputing.com Thu Apr 6 13:40:24 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 06 Apr 2006 15:40:24 -0500 Subject: [openib-general] amso1100 testing with OpenIB In-Reply-To: References: Message-ID: <1144356024.10701.65.camel@stevo-desktop> On Thu, 2006-04-06 at 14:01 -0400, Karthikeyan Vaidyanathan wrote: > Hi, > > I was trying to follow the discussion on getting the iWARP branch to work > with Ammasso NICs and tried the following steps: > > /include/rdma --> > /gen2/branches/iwarp/src/linux-kernel/infiniband/include/rdma > > /drivers/infiniband --> > /gen2/branches/iwarp/src/linux-kernel/infiniband > > and recompiled the kernel. I'm working on linux 2.6.15.4 kernel. > > I had initially updated the firmware from AMSO1100 Release 1.2 Update 1 > kit. Then I used ccflash2 to update to C2L_H23_B58_F61_080507.bit. > However, when I reboot the machine, the dmesg reports: > > c2: AMSO1100 Gigabit Ethernet driver v1.1 loaded > c2: Downlevel Firmware boot loader [1/7: got 0x42, exp 0x43]. Use the > cc_flash utility to update your boot loader > c2: Adapter not claimed > c2: probe of 0000:03:02.0 failed with error -5 > > I also tried updating the firmware boot loader from AMSO1100 Release 1.2 > Update 1 kit but I still get this message when the machine boots up. > > However, the following command shows that I have the latest firmware > loaded: > > $ ccons 0 bitfile > FPGA Bitfile ID: C2L_H23_B58_080507 Release 23 > You need the amso1100 openib kit from Open Grid Computing. It contains new firmware, a new bitfile/boot loader, and scripts to load the firmware correctly before the openib driver loads. The kit is here: http://www.opengridcomputing.com/downloads/ogc_amso_kit_20060308.tgz You also need to uninstall the Ammasso software from the system. Hope this helps. Steve. From suri at baymicrosystems.com Thu Apr 6 13:41:00 2006 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Thu, 6 Apr 2006 16:41:00 -0400 Subject: [openib-general] Re: [RFC] [PATCH 1/2] multicast support formultiple users In-Reply-To: <44357720.2040604@ichips.intel.com> Message-ID: <200604062041.k36Kf5Gs003447@mail.baymicrosystems.com> Sean/Roland: I may be the only one using the openib stack for a switch. I don't need the module in question at this point, but I request the group to consider putting all changes for a switch (as well) while adding new functionality. This way, if and when I get to a point of needing the functionality, I would have something to start with! And I promise to test and send in changes (if required). BTW, as far as the ib_mad/smi stuff goes, I have sent my changes to Hal, since I am running 2.6.12 and it is best if he does the merge (I think he said so already in his email). Thanks a lot, Suri > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Sean Hefty > Sent: Thursday, April 06, 2006 4:17 PM > To: Roland Dreier > Cc: openib-general at openib.org > Subject: Re: [openib-general] Re: [RFC] [PATCH 1/2] multicast support > formultiple users > > Roland Dreier wrote: > > > + dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof > *port, > > > + GFP_KERNEL); > > > + if (!dev) > > > + return; > > > + > > > + for (i = 1; i <= device->phys_port_cnt; i++) { > > > > Seems like this is implicitly assuming that the IB device is a CA. > > > > Maybe we should give up the ghost and stop trying to support IB > switches? > > I can go either way here. Would someone want to use this module on a > switch? > If so, I can fix up the code so that it can work on one, or at least > change the > checks that the device is a CA. > > I thought that someone was running at least ib_mad on a switch. I'm just > not > sure which other modules are likely to run on them. > > - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general From mst at mellanox.co.il Thu Apr 6 13:43:00 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Apr 2006 23:43:00 +0300 Subject: [openib-general] Re: librdmacm/ucma In-Reply-To: References: <20060406135950.GQ21115@mellanox.co.il> Message-ID: <20060406204259.GA15005@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RE: librdmacm/ucma > > >- dependency on libsysfs > > I propose we don't get into this again. I know we have it in ibverbs > > but let's avoid for new code. > > > >- negative error codes > > I think this is kernel practice, in userspace we > > either set errno and return -1, or simply return > > positive error code: otherwise utilities like strerror > > do not work > > We can convert all negative error codes to positive. That's OK. > >- abi versioning > > The RDMA_USER_CM_MAX_ABI_VERSION idea is broken - it makes it so that you are > > required to upgrade userspace to run on older kernels. > > Bailing out in userspace if the kernel is too new is not good - please > > remove this test. > > There is both a minor and major number. A given library should support anything > between its known versions. Can you be more specific about the problem? The library now has min = 1 max = 2 this means that any ABI update in kernel will break userspace. People expect to be able to update kernel and keep old userspace. -- MST From rdreier at cisco.com Thu Apr 6 13:41:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Apr 2006 13:41:59 -0700 Subject: [openib-general] Re: [PATCH] ipoib_flush_paths In-Reply-To: (Sean Hefty's message of "Thu, 6 Apr 2006 13:07:52 -0700") References: Message-ID: Sean> Here's a similar patch for ib_addr. Looks good to me. From rdreier at cisco.com Thu Apr 6 13:42:46 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Apr 2006 13:42:46 -0700 Subject: [openib-general] Re: [RFC] [PATCH 1/2] multicast support for multiple users In-Reply-To: <1144354060.17533.3210.camel@hal.voltaire.com> (Hal Rosenstock's message of "06 Apr 2006 16:07:42 -0400") References: <1144354060.17533.3210.camel@hal.voltaire.com> Message-ID: Hal> I would prefer not to. SMI changes for switches have been Hal> provided but not integrated as yet. Enhanced switch port 0 Hal> would need this (at a minimum). OK, then this code needs to do the usual "port 0 if it's a switch, otherwise ports 1...num ports" - R. From halr at voltaire.com Thu Apr 6 13:38:14 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Apr 2006 16:38:14 -0400 Subject: [openib-general] Re: [RFC] [PATCH 1/2] multicast support for multiple users In-Reply-To: References: Message-ID: <1144355892.17533.3569.camel@hal.voltaire.com> On Thu, 2006-04-06 at 14:13, Sean Hefty wrote: > Add kernel support that tracks joining and leaving multicast groups. > The SA tracks join/leave operations on a per port basis. In order > to support multiple users of the same multicast group, we need to > track join / leave requests locally. On initial read, this looks pretty good. Aside from the switch port 0 comment/issue, I have 2 comments/questions. 1. My main initial comment is that I think that cmp_rec needs to be more complicated that the matching which is there. The selectors include things like greater than, less than, and largest available in addition to equal to which is what is supported there now. I'm not sure whether any of this is used right now so may not be an issue for IPoIB. 2. The other comment is I didn't yet follow how multiple joins of different JoinStates are handled. I can see there are different slots in the groups but I didn't see whether all the joins go out on the wire (one per JoinState) or whether there is some "promotion"/"demotion" of these. I will also look at this some more because I need at least a second read as I didn't follow things sufficiently yet. -- Hal > Signed-off-by: Sean Hefty > > --- > > This patch depends on the sa query patch that adds retries to that API. > I spent considerable time (and a couple of rewrites of the code) trying to > ensure that all race conditions were handled, and in a way that was as > simple as possible. Some additional review of the code that looked for > race conditions would be appreciated. > > Also note that this code has the bug that Michael pointed out: > http://openib.org/pipermail/openib-general/2006-April/019643.html > I believe that this bug should be fixed, once we agree on a solution, but > I don't think that the bug is likely enough to occur that it should delay > the check-in. > > > Index: include/rdma/ib_multicast.h > =================================================================== > --- include/rdma/ib_multicast.h (revision 0) > +++ include/rdma/ib_multicast.h (revision 0) > @@ -0,0 +1,85 @@ > +/* > + * Copyright (c) 2006 Intel Corporation. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > + > +#ifndef IB_MULTICAST_H > +#define IB_MULTICAST_H > + > +#include > + > +struct ib_multicast { > + struct ib_sa_mcmember_rec rec; > + ib_sa_comp_mask comp_mask; > + int (*callback)(int status, > + struct ib_multicast *multicast); > + void *context; > +}; > + > +/** > + * ib_join_multicast - Initiates a join request to the specified multicast > + * group. > + * @device: Device associated with the multicast group. > + * @port_num: Port on the specified device to associate with the multicast > + * group. > + * @rec: SA multicast member record specifying group attributes. > + * @comp_mask: Component mask indicating which group attributes of %rec are > + * valid. > + * @gfp_mask: GFP mask for memory allocations. > + * @callback: User callback invoked once the join operation completes. > + * @context: User specified context stored with the ib_multicast structure. > + * > + * This call initiates a multicast join request with the SA for the specified > + * multicast group. If the join operation is started successfully, it returns > + * an ib_multicast structure that is used to track the multicast operation. > + * Users must free this structure by calling ib_free_multicast, even if the > + * join operation later fails. (The callback status is non-zero.) > + */ > +struct ib_multicast *ib_join_multicast(struct ib_device *device, u8 port_num, > + struct ib_sa_mcmember_rec *rec, > + ib_sa_comp_mask comp_mask, gfp_t gfp_mask, > + int (*callback)(int status, > + struct ib_multicast > + *multicast), > + void *context); > + > +/** > + * ib_free_multicast - Frees the multicast tracking structure, and releases > + * any reference on the multicast group. > + * @multicast: Multicast tracking structure allocated by ib_join_multicast. > + * > + * This call blocks until the connection identifier is destroyed. It may > + * not be called from within the multicast callback; however, returning a non- > + * zero value from the callback will result in destroying the multicast > + * tracking structure. > + */ > +void ib_free_multicast(struct ib_multicast *multicast); > + > +#endif /* IB_MULTICAST_H */ > Index: core/multicast.c > =================================================================== > --- core/multicast.c (revision 0) > +++ core/multicast.c (revision 0) > @@ -0,0 +1,659 @@ > +/* > + * Copyright (c) 2006 Intel Corporation. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include > + > +MODULE_AUTHOR("Sean Hefty"); > +MODULE_DESCRIPTION("InfiniBand multicast membership handling"); > +MODULE_LICENSE("Dual BSD/GPL"); > + > +static int retry_timer = 5000; /* 5 sec */ > +module_param(retry_timer, int, 0444); > +MODULE_PARM_DESC(retry_timer, "Time in ms between retried requests."); > + > +static int retries = 3; > +module_param(retries, int, 0444); > +MODULE_PARM_DESC(retries, "Number of times to retry a request."); > + > +static void mcast_add_one(struct ib_device *device); > +static void mcast_remove_one(struct ib_device *device); > + > +static struct ib_client mcast_client = { > + .name = "ib_multicast", > + .add = mcast_add_one, > + .remove = mcast_remove_one > +}; > + > +static struct workqueue_struct *mcast_wq; > + > +struct mcast_device; > + > +struct mcast_port { > + struct mcast_device *dev; > + spinlock_t lock; > + struct rb_root table; > + atomic_t refcount; > + wait_queue_head_t wait; > + u8 port_num; > +}; > + > +struct mcast_device { > + struct ib_device *device; > + struct mcast_port port[0]; > +}; > + > +enum mcast_state { > + MCAST_IDLE, > + MCAST_JOINING, > + MCAST_MEMBER, > + MCAST_BUSY, > +}; > + > +struct mcast_member; > + > +struct mcast_group { > + struct ib_sa_mcmember_rec rec; > + struct rb_node node; > + struct mcast_port *port; > + spinlock_t lock; > + struct work_struct work; > + struct list_head pending_list; > + struct mcast_member *last_join; > + int members[3]; > + atomic_t refcount; > + enum mcast_state state; > + struct ib_sa_query *query; > + int query_id; > +}; > + > +struct mcast_member { > + struct ib_multicast multicast; > + struct mcast_group *group; > + struct list_head list; > + enum mcast_state state; > + atomic_t refcount; > + wait_queue_head_t wait; > +}; > + > +static void join_handler(int status, struct ib_sa_mcmember_rec *rec, > + void *context); > +static void leave_handler(int status, struct ib_sa_mcmember_rec *rec, > + void *context); > + > +static struct mcast_group *mcast_find(struct mcast_port *port, > + union ib_gid *mgid) > +{ > + struct rb_node *node = port->table.rb_node; > + struct mcast_group *group; > + int ret; > + > + while (node) { > + group = rb_entry(node, struct mcast_group, node); > + ret = memcmp(mgid->raw, group->rec.mgid.raw, sizeof *mgid); > + if (!ret) > + return group; > + > + if (ret < 0) > + node = node->rb_left; > + else > + node = node->rb_right; > + } > + return NULL; > +} > + > +static struct mcast_group *mcast_insert(struct mcast_port *port, > + struct mcast_group *group) > +{ > + struct rb_node **link = &port->table.rb_node; > + struct rb_node *parent = NULL; > + struct mcast_group *cur_group; > + int ret; > + > + while (*link) { > + parent = *link; > + cur_group = rb_entry(parent, struct mcast_group, node); > + > + ret = memcmp(group->rec.mgid.raw, cur_group->rec.mgid.raw, > + sizeof group->rec.mgid); > + if (ret < 0) > + link = &(*link)->rb_left; > + else if (ret > 0) > + link = &(*link)->rb_right; > + else > + return cur_group; > + } > + rb_link_node(&group->node, parent, link); > + rb_insert_color(&group->node, &port->table); > + return NULL; > +} > + > +static void deref_port(struct mcast_port *port) > +{ > + if (atomic_dec_and_test(&port->refcount)) > + wake_up(&port->wait); > +} > + > +static void release_group(struct mcast_group *group) > +{ > + struct mcast_port *port = group->port; > + unsigned long flags; > + > + spin_lock_irqsave(&port->lock, flags); > + if (atomic_dec_and_test(&group->refcount)) { > + rb_erase(&group->node, &port->table); > + spin_unlock_irqrestore(&port->lock, flags); > + kfree(group); > + deref_port(port); > + } else > + spin_unlock_irqrestore(&port->lock, flags); > +} > + > +static void deref_member(struct mcast_member *member) > +{ > + if (atomic_dec_and_test(&member->refcount)) > + wake_up(&member->wait); > +} > + > +static void queue_join(struct mcast_member *member) > +{ > + struct mcast_group *group = member->group; > + unsigned long flags; > + > + spin_lock_irqsave(&group->lock, flags); > + list_add(&member->list, &group->pending_list); > + if (group->state == MCAST_IDLE) { > + group->state = MCAST_BUSY; > + spin_unlock_irqrestore(&group->lock, flags); > + atomic_inc(&group->refcount); > + queue_work(mcast_wq, &group->work); > + } else > + spin_unlock_irqrestore(&group->lock, flags); > +} > + > +static void adjust_membership(struct mcast_group *group, u8 join_state, int inc) > +{ > + int i; > + > + for (i = 0; i < 3; i++, join_state >>= 1) > + if (join_state & 0x1) > + group->members[i] += inc; > +} > + > +static u8 get_leave_state(struct mcast_group *group) > +{ > + u8 leave_state = 0; > + int i; > + > + for (i = 0; i < 3; i++) > + if (!group->members[i]) > + leave_state |= (0x1 << i); > + > + return leave_state & group->rec.join_state; > +} > + > +static int cmp_rec(struct ib_sa_mcmember_rec *src, > + struct ib_sa_mcmember_rec *dst, ib_sa_comp_mask comp_mask) > +{ > + /* MGID must already match */ > + > + if (comp_mask & IB_SA_MCMEMBER_REC_PORT_GID && > + memcmp(&src->port_gid, &dst->port_gid, sizeof src->port_gid)) > + return -EINVAL; > + if (comp_mask & IB_SA_MCMEMBER_REC_QKEY && src->qkey != dst->qkey) > + return -EINVAL; > + if (comp_mask & IB_SA_MCMEMBER_REC_MLID && src->mlid != dst->mlid) > + return -EINVAL; > + if (comp_mask & IB_SA_MCMEMBER_REC_MTU_SELECTOR && > + src->mtu_selector != dst->mtu_selector) > + return -EINVAL; > + if (comp_mask & IB_SA_MCMEMBER_REC_MTU && src->mtu != dst->mtu) > + return -EINVAL; > + if (comp_mask & IB_SA_MCMEMBER_REC_TRAFFIC_CLASS && > + src->traffic_class != dst->traffic_class) > + return -EINVAL; > + if (comp_mask & IB_SA_MCMEMBER_REC_PKEY && src->pkey != dst->pkey) > + return -EINVAL; > + if (comp_mask & IB_SA_MCMEMBER_REC_RATE_SELECTOR && > + src->rate_selector != dst->rate_selector) > + return -EINVAL; > + if (comp_mask & IB_SA_MCMEMBER_REC_RATE && src->rate != dst->rate) > + return -EINVAL; > + if (comp_mask & IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME_SELECTOR && > + src->packet_life_time_selector != dst->packet_life_time_selector) > + return -EINVAL; > + if (comp_mask & IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME && > + src->packet_life_time != dst->packet_life_time) > + return -EINVAL; > + if (comp_mask & IB_SA_MCMEMBER_REC_SL && src->sl != dst->sl) > + return -EINVAL; > + if (comp_mask & IB_SA_MCMEMBER_REC_FLOW_LABEL && > + src->flow_label != dst->flow_label) > + return -EINVAL; > + if (comp_mask & IB_SA_MCMEMBER_REC_HOP_LIMIT && > + src->hop_limit != dst->hop_limit) > + return -EINVAL; > + if (comp_mask & IB_SA_MCMEMBER_REC_SCOPE && src->scope != dst->scope) > + return -EINVAL; > + > + /* join_state checked separately, proxy_join ignored */ > + > + return 0; > +} > + > +static int send_join(struct mcast_group *group, struct mcast_member *member) > +{ > + struct mcast_port *port = group->port; > + int ret; > + > + ret = ib_sa_mcmember_rec_set(port->dev->device, port->port_num, > + &member->multicast.rec, > + member->multicast.comp_mask, > + retry_timer, retries, GFP_KERNEL, > + join_handler, group, &group->query); > + if (ret > 0) { > + group->query_id = ret; > + ret = 0; > + } > + return ret; > +} > + > +static int send_leave(struct mcast_group *group, u8 leave_state) > +{ > + struct mcast_port *port = group->port; > + struct ib_sa_mcmember_rec rec; > + int ret; > + > + rec = group->rec; > + rec.join_state = leave_state; > + > + ret = ib_sa_mcmember_rec_delete(port->dev->device, port->port_num, &rec, > + IB_SA_MCMEMBER_REC_MGID | > + IB_SA_MCMEMBER_REC_PORT_GID | > + IB_SA_MCMEMBER_REC_JOIN_STATE, > + retry_timer, retries, GFP_KERNEL, > + leave_handler, group, &group->query); > + if (ret > 0) { > + group->query_id = ret; > + ret = 0; > + } > + return ret; > +} > + > +static void join_group(struct mcast_group *group, struct mcast_member *member, > + u8 join_state) > +{ > + adjust_membership(group, join_state, 1); > + group->rec.join_state |= join_state; > + member->multicast.rec = group->rec; > + member->multicast.rec.join_state = join_state; > +} > + > +static int fail_join(struct mcast_group *group, struct mcast_member *member, > + int status) > +{ > + spin_lock_irq(&group->lock); > + list_del_init(&member->list); > + spin_unlock_irq(&group->lock); > + return member->multicast.callback(status, &member->multicast); > +} > + > +static void mcast_work_handler(void *data) > +{ > + struct mcast_group *group = data; > + struct mcast_member *member; > + struct ib_multicast *multicast; > + int status, ret; > + u8 join_state; > + > +retest: > + spin_lock_irq(&group->lock); > + while (!list_empty(&group->pending_list)) { > + member = list_entry(group->pending_list.next, > + struct mcast_member, list); > + multicast = &member->multicast; > + join_state = multicast->rec.join_state; > + atomic_inc(&member->refcount); > + > + if (join_state == (group->rec.join_state & join_state)) { > + status = cmp_rec(&group->rec, &multicast->rec, > + multicast->comp_mask); > + if (!status) > + join_group(group, member, join_state); > + > + list_del_init(&member->list); > + spin_unlock_irq(&group->lock); > + ret = multicast->callback(status, multicast); > + } else { > + spin_unlock_irq(&group->lock); > + status = send_join(group, member); > + if (!status) { > + deref_member(member); > + return; > + } > + ret = fail_join(group, member, status); > + } > + > + deref_member(member); > + if (ret) > + ib_free_multicast(&member->multicast); > + spin_lock_irq(&group->lock); > + } > + > + join_state = get_leave_state(group); > + if (join_state) { > + group->rec.join_state &= ~join_state; > + spin_unlock_irq(&group->lock); > + if (send_leave(group, join_state)) > + goto retest; > + } else { > + group->state = MCAST_IDLE; > + spin_unlock_irq(&group->lock); > + release_group(group); > + } > +} > + > +/* > + * Fail a join request if it is still active - at the head of the pending queue. > + */ > +static void process_join_error(struct mcast_group *group, int status) > +{ > + struct mcast_member *member; > + int ret; > + > + spin_lock_irq(&group->lock); > + member = list_entry(group->pending_list.next, > + struct mcast_member, list); > + if (group->last_join == member) { > + atomic_inc(&member->refcount); > + list_del_init(&member->list); > + spin_unlock_irq(&group->lock); > + ret = member->multicast.callback(status, &member->multicast); > + deref_member(member); > + if (ret) > + ib_free_multicast(&member->multicast); > + } else > + spin_unlock_irq(&group->lock); > +} > + > +static void join_handler(int status, struct ib_sa_mcmember_rec *rec, > + void *context) > +{ > + struct mcast_group *group = context; > + > + if (status) > + process_join_error(group, status); > + else { > + spin_lock_irq(&group->lock); > + group->rec = *rec; > + spin_unlock_irq(&group->lock); > + } > + mcast_work_handler(group); > +} > + > +static void leave_handler(int status, struct ib_sa_mcmember_rec *rec, > + void *context) > +{ > + mcast_work_handler(context); > +} > + > +static struct mcast_group *acquire_group(struct mcast_port *port, > + union ib_gid *mgid, gfp_t gfp_mask) > +{ > + struct mcast_group *group, *cur_group; > + unsigned long flags; > + > + spin_lock_irqsave(&port->lock, flags); > + group = mcast_find(port, mgid); > + if (group) > + goto found; > + spin_unlock_irqrestore(&port->lock, flags); > + > + group = kzalloc(sizeof *group, gfp_mask); > + if (!group) > + return NULL; > + > + group->port = port; > + group->rec.mgid = *mgid; > + INIT_LIST_HEAD(&group->pending_list); > + INIT_WORK(&group->work, mcast_work_handler, group); > + spin_lock_init(&group->lock); > + > + spin_lock_irqsave(&port->lock, flags); > + cur_group = mcast_insert(port, group); > + if (cur_group) { > + kfree(group); > + group = cur_group; > + } else > + atomic_inc(&port->refcount); > +found: > + atomic_inc(&group->refcount); > + spin_unlock_irqrestore(&port->lock, flags); > + return group; > +} > + > +/* > + * We serialize all join requests to a single group to make our lives much > + * easier. Otherwise, two users could try to join the same group > + * simultaneously, with different configurations, one could leave while the > + * join is in progress, etc., which makes locking around error recovery > + * difficult. > + */ > +struct ib_multicast *ib_join_multicast(struct ib_device *device, u8 port_num, > + struct ib_sa_mcmember_rec *rec, > + ib_sa_comp_mask comp_mask, gfp_t gfp_mask, > + int (*callback)(int status, > + struct ib_multicast > + *multicast), > + void *context) > +{ > + struct mcast_device *dev; > + struct mcast_member *member; > + struct ib_multicast *multicast; > + int ret; > + > + dev = ib_get_client_data(device, &mcast_client); > + if (!dev) > + return ERR_PTR(-ENODEV); > + > + member = kzalloc(sizeof *member, gfp_mask); > + if (!member) > + return ERR_PTR(-ENOMEM); > + > + member->multicast.rec = *rec; > + member->multicast.comp_mask = comp_mask; > + member->multicast.callback = callback; > + member->multicast.context = context; > + init_waitqueue_head(&member->wait); > + atomic_set(&member->refcount, 1); > + member->state = MCAST_JOINING; > + > + member->group = acquire_group(&dev->port[port_num - 1], > + &rec->mgid, gfp_mask); > + if (!member->group) { > + ret = -ENOMEM; > + goto err; > + } > + > + /* > + * The user will get the multicast structure in their callback. They > + * could then free the multicast structure before we can return from > + * this routine. So we save the pointer to return before queuing > + * any callback. > + */ > + multicast = &member->multicast; > + queue_join(member); > + return multicast; > + > +err: > + kfree(member); > + return ERR_PTR(ret); > +} > +EXPORT_SYMBOL(ib_join_multicast); > + > +void ib_free_multicast(struct ib_multicast *multicast) > +{ > + struct mcast_member *member; > + struct mcast_group *group; > + > + member = container_of(multicast, struct mcast_member, multicast); > + group = member->group; > + > + spin_lock_irq(&group->lock); > + switch (member->state) { > + case MCAST_MEMBER: > + adjust_membership(group, multicast->rec.join_state, -1); > + break; > + case MCAST_JOINING: > + list_del_init(&member->list); > + break; > + default: > + break; > + } > + > + if (group->state == MCAST_IDLE) { > + group->state = MCAST_BUSY; > + spin_unlock_irq(&group->lock); > + /* Continue to hold reference on group until callback */ > + queue_work(mcast_wq, &group->work); > + } else { > + spin_unlock_irq(&group->lock); > + release_group(group); > + } > + > + atomic_dec(&member->refcount); > + wait_event(member->wait, !atomic_read(&member->refcount)); > + kfree(member); > +} > +EXPORT_SYMBOL(ib_free_multicast); > + > +static void mcast_add_one(struct ib_device *device) > +{ > + struct mcast_device *dev; > + struct mcast_port *port; > + int i; > + > + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) > + return; > + > + dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port, > + GFP_KERNEL); > + if (!dev) > + return; > + > + for (i = 1; i <= device->phys_port_cnt; i++) { > + port = &dev->port[i - 1]; > + port->dev = dev; > + port->port_num = i; > + spin_lock_init(&port->lock); > + port->table = RB_ROOT; > + init_waitqueue_head(&port->wait); > + atomic_set(&port->refcount, 1); > + } > + > + dev->device = device; > + ib_set_client_data(device, &mcast_client, dev); > +} > + > +/* > + * Mark any existing groups as no longer having any members. This will force > + * cleanup of the groups when all outstanding leave requests complete. > + */ > +static void leave_groups(struct mcast_port *port) > +{ > + struct mcast_group *group; > + struct rb_node *node; > + unsigned long flags; > + > + spin_lock_irqsave(&port->lock, flags); > + for (node = rb_first(&port->table); node; node = rb_next(node)) { > + group = rb_entry(node, struct mcast_group, node); > + group->rec.join_state = 0; > + ib_sa_cancel_query(group->query_id, group->query); > + } > + spin_unlock_irqrestore(&port->lock, flags); > +} > + > +static void mcast_remove_one(struct ib_device *device) > +{ > + struct mcast_device *dev; > + struct mcast_port *port; > + int i; > + > + dev = ib_get_client_data(device, &mcast_client); > + if (!dev) > + return; > + > + flush_workqueue(mcast_wq); > + > + for (i = 0; i < device->phys_port_cnt; i++) { > + port = &dev->port[i]; > + leave_groups(port); > + atomic_dec(&port->refcount); > + wait_event(port->wait, !atomic_read(&port->refcount)); > + } > + > + kfree(dev); > +} > + > +static int __init mcast_init(void) > +{ > + int ret; > + > + mcast_wq = create_singlethread_workqueue("ib_mcast_wq"); > + if (!mcast_wq) > + return -ENOMEM; > + > + ret = ib_register_client(&mcast_client); > + if (ret) > + goto err; > + return 0; > + > +err: > + destroy_workqueue(mcast_wq); > + return ret; > +} > + > +static void __exit mcast_cleanup(void) > +{ > + ib_unregister_client(&mcast_client); > + destroy_workqueue(mcast_wq); > +} > + > +module_init(mcast_init); > +module_exit(mcast_cleanup); > Index: core/Makefile > =================================================================== > --- core/Makefile (revision 6230) > +++ core/Makefile (working copy) > @@ -5,7 +5,7 @@ user_access-$(CONFIG_INFINIBAND_ADDR_TRA > > obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_ping.o ib_cm.o \ > ib_sa.o ib_at.o $(infiniband-y) \ > - findex.o > + findex.o ib_multicast.o > obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o > obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o ib_uat.o $(user_access-y) > > @@ -30,6 +30,8 @@ ib_sa-y := sa_query.o > > ib_local_sa-y := local_sa.o > > +ib_multicast-y := multicast.o > + > ib_umad-y := user_mad.o > > ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o \ > From mshefty at ichips.intel.com Thu Apr 6 13:53:53 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 06 Apr 2006 13:53:53 -0700 Subject: [openib-general] Re: librdmacm/ucma In-Reply-To: <20060406204259.GA15005@mellanox.co.il> References: <20060406135950.GQ21115@mellanox.co.il> <20060406204259.GA15005@mellanox.co.il> Message-ID: <44357FE1.3090302@ichips.intel.com> Michael S. Tsirkin wrote: > The library now has > min = 1 max = 2 > this means that any ABI update in kernel will break userspace. Isn't that what it means to break the ABI? From halr at voltaire.com Thu Apr 6 13:49:59 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Apr 2006 16:49:59 -0400 Subject: [openib-general] Re: [RFC] [PATCH 1/2] multicast support for multiple users In-Reply-To: References: Message-ID: <1144356598.17533.3706.camel@hal.voltaire.com> On Thu, 2006-04-06 at 14:13, Sean Hefty wrote: Another question: > +static int send_leave(struct mcast_group *group, u8 leave_state) > +{ > + struct mcast_port *port = group->port; > + struct ib_sa_mcmember_rec rec; > + int ret; > + > + rec = group->rec; > + rec.join_state = leave_state; > + > + ret = ib_sa_mcmember_rec_delete(port->dev->device, port->port_num, &rec, > + IB_SA_MCMEMBER_REC_MGID | > + IB_SA_MCMEMBER_REC_PORT_GID | > + IB_SA_MCMEMBER_REC_JOIN_STATE, Why did the PKey component get removed from the leave ? -- Hal > + retry_timer, retries, GFP_KERNEL, > + leave_handler, group, &group->query); > + if (ret > 0) { > + group->query_id = ret; > + ret = 0; > + } > + return ret; > +} From rdreier at cisco.com Thu Apr 6 13:58:57 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Apr 2006 13:58:57 -0700 Subject: [openib-general] Re: [RFC] [PATCH 1/2] multicast support for multiple users In-Reply-To: <1144356598.17533.3706.camel@hal.voltaire.com> (Hal Rosenstock's message of "06 Apr 2006 16:49:59 -0400") References: <1144356598.17533.3706.camel@hal.voltaire.com> Message-ID: Hal> Why did the PKey component get removed from the leave ? I don't think it's needed. MGID and PortGID together form the record identifier for multicast groups. - R. From sean.hefty at intel.com Thu Apr 6 14:01:49 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 6 Apr 2006 14:01:49 -0700 Subject: [openib-general] RE: [RFC] [PATCH 1/2] multicast support for multiple users In-Reply-To: <1144355892.17533.3569.camel@hal.voltaire.com> Message-ID: >1. My main initial comment is that I think that cmp_rec needs to be more >complicated that the matching which is there. The selectors include >things like greater than, less than, and largest available in addition >to equal to which is what is supported there now. I'm not sure whether >any of this is used right now so may not be an issue for IPoIB. I will review the spec to see where the checks need to be enhanced. This probably won't be an issue for a while, since most join requests are limited to select fields of the multicast member record. >2. The other comment is I didn't yet follow how multiple joins of >different JoinStates are handled. I can see there are different slots in >the groups but I didn't see whether all the joins go out on the wire >(one per JoinState) or whether there is some "promotion"/"demotion" of >these. The code uses a promotion/demotion mechanism based on a reference count of membership types. The restriction is that only a single request per group is active at a time. All join requests are queued to a pending list. If a request can be met with the current join state of the group, it is added. If not, then a request is sent to promote the group. Leave requests are handled differently, but result in demotion. - Sean From jlentini at netapp.com Thu Apr 6 14:04:28 2006 From: jlentini at netapp.com (James Lentini) Date: Thu, 6 Apr 2006 17:04:28 -0400 (EDT) Subject: [openib-general] Re: [PATCH] [DAPL] [RFC] - remove duplicate disconnect event. In-Reply-To: <1144276742.28591.82.camel@stevo-desktop> References: <1144276742.28591.82.camel@stevo-desktop> Message-ID: On Wed, 5 Apr 2006, Steve Wise wrote: > James, > > Running a 4 thread, 8 ep/thread dapltest (the last test in regress.sh), > I was intermittently seeing a seg fault in dapltest. This is running > over the chelsio rnic using the iwarp branch. After debugging I found > out that dapltest was freeing an already freed endpoint due to it > receiving duplicate disconnect events during test shutdown. The code > assumes it will get exactly one disconnect event for every endpoint > (rightly so I guess). > > I tracked this down to the code in dapl_ep_disconnect() that generates > its own disconnect event in certain circumstances. I removed this and > ran regress.sh over both mthca and cxgb3 with no problems. > > So my question to the dapl experts is: why is this code here? This is an artifact of some older verbs definitions. This code should have gone in the verbs specific portion of DAPL instead of the common code. I'll play around with this and see if there are any negative effects on IB. > For our iwarp devices, it ends up sometimes generating duplicate > disconnect events. I don't see why its needed. If anyone can > explain the logic, that would be great. > > With this patch and the previous patch the fixes dat_ep_free() to always > free the endpoint, I'm able to run dapltest 1-6 over the chelsio rnic. > As part of pulling in the iwarp support, I'd like the group to consider > pulling in these patches that fix issues with udapl (once we agree on > the final patches). For now, I'll maintain these patches in the iwarp > branch... From halr at voltaire.com Thu Apr 6 14:04:12 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Apr 2006 17:04:12 -0400 Subject: [openib-general] Re: [RFC] [PATCH 1/2] multicast support for multiple users In-Reply-To: References: <1144356598.17533.3706.camel@hal.voltaire.com> Message-ID: <1144357451.17533.3876.camel@hal.voltaire.com> On Thu, 2006-04-06 at 16:58, Roland Dreier wrote: > Hal> Why did the PKey component get removed from the leave ? > > I don't think it's needed. MGID and PortGID together form the record > identifier for multicast groups. I see your point so perhaps it is not needed. For IPoIB, that clearly works as the PKey is embedded in the MGID. I didn't think that was the case with generalized IB multicast. -- Hal From mst at mellanox.co.il Thu Apr 6 14:17:47 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 7 Apr 2006 00:17:47 +0300 Subject: [openib-general] Re: [PATCH] ipoib_flush_paths In-Reply-To: References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> Message-ID: <20060406211747.GB15005@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] ipoib_flush_paths > > Michael> Actually, it turned out to be the simplest solution - and > Michael> quite elegant since there's no room for mistakes: if > Michael> query is going to be running this means module is still > Michael> loaded so we can take a reference to it without races. > > Yes, this is suprisingly clean. > > Michael> As a bonus, and assertion inside __module_get increases > Michael> the chance to catch races where user forgets to cancel > Michael> the query - much nicer than crashing randomly. > > Actually I think __module_get() will do the wrong thing if called > during module unloading -- it doesn't test module_is_live(). In other > words, calling __module_get() without already holding a ref has a > race: __try_stop_module() can see the ref count as 0, then > __module_get() can increment it, and then __try_stop_module() sets the > module state to GOING and returns. > > So the right thing to do is BUG_ON(!try_module_get(owner)) A bit more code but better debugging then. OK. > Also, I don't think that a consumer of ib_sa() would ever pass an > owner other than THIS_MODULE. So how about if we keep the API the > same and just do the THIS_MODULE stuff in an inline wrapper? Good idea. > Like the following... it ends up being a pretty big diff, but just > because I moved some comments around and so on. Also I put the > try_module_get() stuff out of line into call_sa_callback(), because > the compiled code ends up smaller that way. > > Does anyone disagree with this patch? Michael, are you happy with > this tweaked version of yours? I'm happy, go ahead. -- MST From rdreier at cisco.com Thu Apr 6 14:17:30 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Apr 2006 14:17:30 -0700 Subject: [openib-general] Re: [RFC] [PATCH 1/2] multicast support for multiple users In-Reply-To: <1144357451.17533.3876.camel@hal.voltaire.com> (Hal Rosenstock's message of "06 Apr 2006 17:04:12 -0400") References: <1144356598.17533.3706.camel@hal.voltaire.com> <1144357451.17533.3876.camel@hal.voltaire.com> Message-ID: Hal> I see your point so perhaps it is not needed. For IPoIB, that Hal> clearly works as the PKey is embedded in the MGID. I didn't Hal> think that was the case with generalized IB multicast. A multicast group is uniquely identified by MGID. There can't be two different groups with the same MGID but different P_Keys. - R. From mst at mellanox.co.il Thu Apr 6 14:21:08 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 7 Apr 2006 00:21:08 +0300 Subject: [openib-general] Re: librdmacm/ucma In-Reply-To: <44357FE1.3090302@ichips.intel.com> References: <20060406135950.GQ21115@mellanox.co.il> <20060406204259.GA15005@mellanox.co.il> <44357FE1.3090302@ichips.intel.com> Message-ID: <20060406212108.GC15005@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [openib-general] Re: librdmacm/ucma > > Michael S. Tsirkin wrote: > >The library now has > >min = 1 max = 2 > >this means that any ABI update in kernel will break userspace. > > Isn't that what it means to break the ABI? In normal usage I expect we must not break the ABI, but should rather extend it - without breaking userspace. How do you extend the ABI? not by bumping the ABI revision? How does userspace find out whether kernel supports new features? -- MST From sean.hefty at intel.com Thu Apr 6 14:21:00 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 6 Apr 2006 14:21:00 -0700 Subject: [openib-general] RE: [RFC] [PATCH 1/2] multicast support for multiple users In-Reply-To: <1144357451.17533.3876.camel@hal.voltaire.com> Message-ID: >> I don't think it's needed. MGID and PortGID together form the record >> identifier for multicast groups. > >I see your point so perhaps it is not needed. For IPoIB, that clearly >works as the PKey is embedded in the MGID. I didn't think that was the >case with generalized IB multicast. It doesn't appear that PKey is used in deletion of the group unless proxy_join is set to 1. But in that case, the PKey that's used comes from the MAD source, and not the value specified as part of the MCMemberRecord. - Sean From mshefty at ichips.intel.com Thu Apr 6 14:25:06 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 06 Apr 2006 14:25:06 -0700 Subject: [openib-general] Re: librdmacm/ucma In-Reply-To: <20060406212108.GC15005@mellanox.co.il> References: <20060406135950.GQ21115@mellanox.co.il> <20060406204259.GA15005@mellanox.co.il> <44357FE1.3090302@ichips.intel.com> <20060406212108.GC15005@mellanox.co.il> Message-ID: <44358732.20400@ichips.intel.com> Michael S. Tsirkin wrote: > In normal usage I expect we must not break the ABI, but should rather > extend it - without breaking userspace. How do you extend the ABI? > not by bumping the ABI revision? How does userspace find out > whether kernel supports new features? If the ABI didn't break, can't userspace just make the call for the new feature and check the return code? - Sean From halr at voltaire.com Thu Apr 6 14:21:39 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Apr 2006 17:21:39 -0400 Subject: [openib-general] RE: [RFC] [PATCH 1/2] multicast support for multiple users In-Reply-To: References: Message-ID: <1144358301.17533.4064.camel@hal.voltaire.com> On Thu, 2006-04-06 at 17:01, Sean Hefty wrote: > >1. My main initial comment is that I think that cmp_rec needs to be more > >complicated that the matching which is there. The selectors include > >things like greater than, less than, and largest available in addition > >to equal to which is what is supported there now. I'm not sure whether > >any of this is used right now so may not be an issue for IPoIB. > > I will review the spec to see where the checks need to be enhanced. There's also code in opensm/osm_sa_mcmember_record.c that you can peruse. > This > probably won't be an issue for a while, since most join requests are limited to > select fields of the multicast member record. Agreed. > >2. The other comment is I didn't yet follow how multiple joins of > >different JoinStates are handled. I can see there are different slots in > >the groups but I didn't see whether all the joins go out on the wire > >(one per JoinState) or whether there is some "promotion"/"demotion" of > >these. > > The code uses a promotion/demotion mechanism based on a reference count of > membership types. The restriction is that only a single request per group is > active at a time. Meaning only one of the membership types is active outside the node ? If so, that seems right as long as the order of precedence is correct. How does the change in precedence occur ? Is it a leave followed by the new join or the new join followed by the old leave ? > All join requests are queued to a pending list. If a request can be met with > the current join state of the group, it is added. If not, then a request is > sent to promote the group. Leave requests are handled differently, but result > in demotion. Sounds right. Thanks. -- Hal From mst at mellanox.co.il Thu Apr 6 14:30:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 7 Apr 2006 00:30:04 +0300 Subject: [openib-general] Re: ib_umad related kernel panic In-Reply-To: <20060406171220.GB7353@redhat.com> References: <67897d690604051904la898f84y5a34709a6334f315@mail.gmail.com> <20060406171220.GB7353@redhat.com> Message-ID: <20060406213004.GD15005@mellanox.co.il> Quoting r. Doug Ledford : > Just to make that point > clear, I've removed the old RPMs from my site and put up a new set of kernel > rpms based on the 1.0 release branch code (userspace rpms will be a little > later). Doug, is this https://openib.org/svn/gen2/branches/1.0/src/linux-kernel/ happens to be the code that you are using? -- MST From arlin.r.davis at intel.com Thu Apr 6 14:28:52 2006 From: arlin.r.davis at intel.com (Arlin Davis) Date: Thu, 6 Apr 2006 14:28:52 -0700 Subject: [openib-general] [PATCH] uDAPL cma; dat_ep_free can return without freeing cm_id Message-ID: James and Steve, Here is revised patch that was tested (free and debug build versions) with dtest, dapltest, and Intel MPI test suites. The rdma_destroy_id will block until we acknowledge the event so there was no need to add our own wait objects for synchronization. This will never be called from the async event thread so there is no chance of deadlock conditions. I also made some changes to build with configure –enable-debug. Some unused variables that were deleted are actually used in the debug messages. Please review the changes. Steve, can you test this version and see if it works for your iWARP device. Thanks, -arlin Signed-off by: Arlin Davis ardavis at ichips.intel.com   Index: dapl/openib_cma/dapl_ib_cm.c =================================================================== --- dapl/openib_cma/dapl_ib_cm.c (revision 6305) +++ dapl/openib_cma/dapl_ib_cm.c (working copy) @@ -62,9 +62,9 @@ /* local prototypes */ static struct dapl_cm_id * dapli_req_recv(struct dapl_cm_id *conn, struct rdma_cm_event *event); -static int dapli_cm_active_cb(struct dapl_cm_id *conn, +static void dapli_cm_active_cb(struct dapl_cm_id *conn, struct rdma_cm_event *event); -static int dapli_cm_passive_cb(struct dapl_cm_id *conn, +static void dapli_cm_passive_cb(struct dapl_cm_id *conn, struct rdma_cm_event *event); static void dapli_addr_resolve(struct dapl_cm_id *conn); static void dapli_route_resolve(struct dapl_cm_id *conn); @@ -87,7 +87,9 @@ static inline uint64_t cpu_to_be64(uint6 static void dapli_addr_resolve(struct dapl_cm_id *conn) { int ret; - +#ifdef DAPL_DBG + struct rdma_addr *ipaddr = &conn->cm_id->route.addr; +#endif dapl_dbg_log(DAPL_DBG_TYPE_CM, " addr_resolve: cm_id %p SRC %x DST %x\n", conn->cm_id, @@ -110,8 +112,10 @@ static void dapli_addr_resolve(struct da static void dapli_route_resolve(struct dapl_cm_id *conn) { int ret; - struct rdma_cm_id *cm_id = conn->cm_id; - +#ifdef DAPL_DBG + struct rdma_addr *ipaddr = &conn->cm_id->route.addr; + struct ib_addr *ibaddr = &conn->cm_id->route.addr.addr.ibaddr; +#endif dapl_dbg_log(DAPL_DBG_TYPE_CM, " route_resolve: cm_id %p SRC %x DST %x PORT %d\n", conn->cm_id, @@ -158,37 +162,51 @@ bail: NULL, conn->ep); } +/* + * Called from consumer thread via dat_ep_free(). + * CANNOT be called from the async event processing thread + * dapli_cma_event_cb() since a cm_id reference is held and + * a deadlock will occur. + */ void dapli_destroy_conn(struct dapl_cm_id *conn) { - int in_callback; + struct rdma_cm_id *cm_id; dapl_dbg_log(DAPL_DBG_TYPE_CM, " destroy_conn: conn %p id %d\n", conn,conn->cm_id); - + dapl_os_lock(&conn->lock); conn->destroy = 1; - in_callback = conn->in_callback; + + if (conn->ep) + conn->ep->cm_handle = IB_INVALID_HANDLE; + + cm_id = conn->cm_id; + conn->cm_id = NULL; dapl_os_unlock(&conn->lock); - if (!in_callback) { - if (conn->ep) - conn->ep->cm_handle = IB_INVALID_HANDLE; - if (conn->cm_id) { - if (conn->cm_id->qp) - rdma_destroy_qp(conn->cm_id); - rdma_destroy_id(conn->cm_id); - } - - conn->cm_id = NULL; - dapl_os_free(conn, sizeof(*conn)); + /* + * rdma_destroy_id will force synchronization with async CM event + * thread since it blocks until the in-process event reference + * is cleared during our event processing call exit. + */ + if (cm_id) { + if (cm_id->qp) + rdma_destroy_qp(cm_id); + + rdma_destroy_id(cm_id); } + dapl_os_free(conn, sizeof(*conn)); } static struct dapl_cm_id * dapli_req_recv(struct dapl_cm_id *conn, struct rdma_cm_event *event) { struct dapl_cm_id *new_conn; +#ifdef DAPL_DBG + struct rdma_addr *ipaddr = &event->id->route.addr; +#endif if (conn->sp == NULL) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, @@ -239,11 +257,9 @@ static struct dapl_cm_id * dapli_req_rec return new_conn; } -static int dapli_cm_active_cb(struct dapl_cm_id *conn, +static void dapli_cm_active_cb(struct dapl_cm_id *conn, struct rdma_cm_event *event) { - int destroy; - dapl_dbg_log(DAPL_DBG_TYPE_CM, " active_cb: conn %p id %d event %d\n", conn, conn->cm_id, event->event ); @@ -251,9 +267,8 @@ static int dapli_cm_active_cb(struct dap dapl_os_lock(&conn->lock); if (conn->destroy) { dapl_os_unlock(&conn->lock); - return 0; + return; } - conn->in_callback = 1; dapl_os_unlock(&conn->lock); switch (event->event) { @@ -298,17 +313,12 @@ static int dapli_cm_active_cb(struct dap break; } - dapl_os_lock(&conn->lock); - destroy = conn->destroy; - conn->in_callback = conn->destroy; - dapl_os_unlock(&conn->lock); - return(destroy); + return; } -static int dapli_cm_passive_cb(struct dapl_cm_id *conn, +static void dapli_cm_passive_cb(struct dapl_cm_id *conn, struct rdma_cm_event *event) { - int destroy; struct dapl_cm_id *new_conn; dapl_dbg_log(DAPL_DBG_TYPE_CM, @@ -318,9 +328,8 @@ static int dapli_cm_passive_cb(struct da dapl_os_lock(&conn->lock); if (conn->destroy) { dapl_os_unlock(&conn->lock); - return 0; + return; } - conn->in_callback = 1; dapl_os_unlock(&conn->lock); switch (event->event) { @@ -372,11 +381,7 @@ static int dapli_cm_passive_cb(struct da break; } - dapl_os_lock(&conn->lock); - destroy = conn->destroy; - conn->in_callback = conn->destroy; - dapl_os_unlock(&conn->lock); - return(destroy); + return; } @@ -1008,16 +1013,12 @@ dapls_ib_get_cm_event(IN DAT_EVENT_NUMBE void dapli_cma_event_cb(void) { struct rdma_cm_event *event; - int ret; - - dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " cm_event()\n"); - ret = rdma_get_cm_event(&event); + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " cm_event()\n"); /* process one CM event, fairness */ - if(!ret) { + if(!rdma_get_cm_event(&event)) { struct dapl_cm_id *conn; - int ret; /* set proper conn from cm_id context*/ if (event->event == RDMA_CM_EVENT_CONNECT_REQUEST) @@ -1055,26 +1056,9 @@ void dapli_cma_event_cb(void) case RDMA_CM_EVENT_DISCONNECTED: /* passive or active */ if (conn->sp) - ret = dapli_cm_passive_cb(conn,event); + dapli_cm_passive_cb(conn,event); else - ret = dapli_cm_active_cb(conn,event); - - /* destroy both qp and cm_id */ - if (ret) { - dapl_dbg_log(DAPL_DBG_TYPE_CM, - " cma_cb: DESTROY conn %p" - " cm_id %p qp %p\n", - conn, conn->cm_id, - conn->cm_id->qp); - - if (conn->cm_id->qp) - rdma_destroy_qp(conn->cm_id); - - rdma_ack_cm_event(event); - rdma_destroy_id(conn->cm_id); - dapl_os_free(conn, sizeof(*conn)); - return; - } + dapli_cm_active_cb(conn,event); break; case RDMA_CM_EVENT_CONNECT_RESPONSE: default: @@ -1084,12 +1068,9 @@ void dapli_cma_event_cb(void) event->id->context); break; } + /* ack event, unblocks destroy_cm_id in consumer threads */ rdma_ack_cm_event(event); - } else { - dapl_dbg_log(DAPL_DBG_TYPE_CM, - " cm_event: ERROR: rdma_get_cm_event() %d %d %s\n", - ret, errno, strerror(errno)); - } + } } /* From mst at mellanox.co.il Thu Apr 6 14:31:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 7 Apr 2006 00:31:42 +0300 Subject: [openib-general] Re: Re: librdmacm/ucma In-Reply-To: <44358732.20400@ichips.intel.com> References: <20060406135950.GQ21115@mellanox.co.il> <20060406204259.GA15005@mellanox.co.il> <44357FE1.3090302@ichips.intel.com> <20060406212108.GC15005@mellanox.co.il> <44358732.20400@ichips.intel.com> Message-ID: <20060406213142.GE15005@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: Re: librdmacm/ucma > > Michael S. Tsirkin wrote: > >In normal usage I expect we must not break the ABI, but should rather > >extend it - without breaking userspace. How do you extend the ABI? > >not by bumping the ABI revision? How does userspace find out > >whether kernel supports new features? > > If the ABI didn't break, can't userspace just make the call for the new > feature and check the return code? Put another way - why do you already have backward compatibility hacks in lirdmacm? There wasn't any released version of cma, was there? -- MST From sean.hefty at intel.com Thu Apr 6 14:41:58 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 6 Apr 2006 14:41:58 -0700 Subject: [openib-general] RE: Re: librdmacm/ucma In-Reply-To: <20060406213142.GE15005@mellanox.co.il> Message-ID: >Put another way - why do you already have backward compatibility hacks >in lirdmacm? There wasn't any released version of cma, was there? Because the behavior of the ABI changed. The CMA was released to openib, but has not yet been merged upstream. What is the issue here? The CMA ABI now behaves similar to the other userspace components. - Sean From mst at mellanox.co.il Thu Apr 6 14:46:55 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 7 Apr 2006 00:46:55 +0300 Subject: [openib-general] Re: Re: librdmacm/ucma In-Reply-To: References: <20060406213142.GE15005@mellanox.co.il> Message-ID: <20060406214655.GF15005@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RE: Re: librdmacm/ucma > > >Put another way - why do you already have backward compatibility hacks > >in lirdmacm? There wasn't any released version of cma, was there? > > Because the behavior of the ABI changed. The CMA was released to openib, but > has not yet been merged upstream. What is the issue here? The CMA ABI now > behaves similar to the other userspace components. I just propose killing old ABI support in librdmacm - CMA is sufficiently new for that. -- MST From halr at voltaire.com Thu Apr 6 14:43:51 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Apr 2006 17:43:51 -0400 Subject: [openib-general] Re: [RFC] [PATCH 1/2] multicast support for multiple users In-Reply-To: References: <1144356598.17533.3706.camel@hal.voltaire.com> <1144357451.17533.3876.camel@hal.voltaire.com> Message-ID: <1144359635.17533.4321.camel@hal.voltaire.com> On Thu, 2006-04-06 at 17:17, Roland Dreier wrote: > Hal> I see your point so perhaps it is not needed. For IPoIB, that > Hal> clearly works as the PKey is embedded in the MGID. I didn't > Hal> think that was the case with generalized IB multicast. > > A multicast group is uniquely identified by MGID. There can't be two > different groups with the same MGID but different P_Keys. Is it more than the MCMemberRecord RID ? I suppose this keeps things simpler anyhow. -- Hal From halr at voltaire.com Thu Apr 6 14:49:18 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Apr 2006 17:49:18 -0400 Subject: [openib-general] RE: [RFC] [PATCH 1/2] multicast support for multiple users In-Reply-To: References: Message-ID: <1144360157.17533.4427.camel@hal.voltaire.com> On Thu, 2006-04-06 at 17:21, Sean Hefty wrote: > >> I don't think it's needed. MGID and PortGID together form the record > >> identifier for multicast groups. > > > >I see your point so perhaps it is not needed. For IPoIB, that clearly > >works as the PKey is embedded in the MGID. I didn't think that was the > >case with generalized IB multicast. > > It doesn't appear that PKey is used in deletion of the group unless proxy_join > is set to 1. But in that case, the PKey that's used comes from the MAD source, > and not the value specified as part of the MCMemberRecord. In the case of proxy join/leave, the PKey sharing rule would (still) need to be followed (requester port would need to be in the same partition as the requested port (and the group)) but that's (enforced) on the SA side. As Roland pointed out, the Pkey component is not needed on leaves as it's implicit in the group. -- Hal From mst at mellanox.co.il Thu Apr 6 14:57:50 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 7 Apr 2006 00:57:50 +0300 Subject: [openib-general] Re: RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: References: <20060406152649.GC13416@mellanox.co.il> Message-ID: <20060406215750.GA16403@mellanox.co.il> Quoting r. Sean Hefty : > Note that ib_cm and rdma_cm technically have the same issue, since cm_id's can > be destroyed by returning non-zero from a callback. I.e. a user of those > interfaces isn't forced to call anything when unloading. OK, this means we must add struct module * pointer to ib_create_cm_id and rdma_create_cm_id as well. Could one of you guys build and commit such a patch for these modules? I'm not in the lab. -- MST From sean.hefty at intel.com Thu Apr 6 14:56:38 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 6 Apr 2006 14:56:38 -0700 Subject: [openib-general] RE: [RFC] [PATCH 1/2] multicast support for multiple users In-Reply-To: <1144358301.17533.4064.camel@hal.voltaire.com> Message-ID: >There's also code in opensm/osm_sa_mcmember_record.c that you can >peruse. Thanks - that's helpful. >> The code uses a promotion/demotion mechanism based on a reference count of >> membership types. The restriction is that only a single request per group is >> active at a time. > >Meaning only one of the membership types is active outside the node ? If >so, that seems right as long as the order of precedence is correct. Only one MAD request per group is active at any time, regardless if the MAD is for a join or leave. >How does the change in precedence occur ? Is it a leave followed by the >new join or the new join followed by the old leave ? Either through a join or leave, depending on if the group is being promoted or demoted. For example, if a send-only group is joined by a full member, a new join is issued with an updated join state. If the full member later leaves, a leave request is issued for that join state. - Sean From mshefty at ichips.intel.com Thu Apr 6 15:03:46 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 06 Apr 2006 15:03:46 -0700 Subject: [openib-general] Re: librdmacm/ucma In-Reply-To: <20060406214655.GF15005@mellanox.co.il> References: <20060406213142.GE15005@mellanox.co.il> <20060406214655.GF15005@mellanox.co.il> Message-ID: <44359042.3030901@ichips.intel.com> Michael S. Tsirkin wrote: > I just propose killing old ABI support in librdmacm - CMA is sufficiently > new for that. I'm fine doing that. The ABI versions can be reset to 1. - Sean From rdreier at cisco.com Thu Apr 6 15:08:03 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Apr 2006 15:08:03 -0700 Subject: [openib-general] Re: [RFC] [PATCH 1/2] multicast support for multiple users In-Reply-To: <1144359635.17533.4321.camel@hal.voltaire.com> (Hal Rosenstock's message of "06 Apr 2006 17:43:51 -0400") References: <1144356598.17533.3706.camel@hal.voltaire.com> <1144357451.17533.3876.camel@hal.voltaire.com> <1144359635.17533.4321.camel@hal.voltaire.com> Message-ID: Hal> Is it more than the MCMemberRecord RID ? I suppose this keeps Hal> things simpler anyhow. Not really. MCMemberRecord RID identifies a single member of a group, so it needs both MGID and port GID. This means that a group is identified by just the MGID. - R. From mshefty at ichips.intel.com Thu Apr 6 15:08:33 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 06 Apr 2006 15:08:33 -0700 Subject: [openib-general] Re: RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: <20060406215750.GA16403@mellanox.co.il> References: <20060406152649.GC13416@mellanox.co.il> <20060406215750.GA16403@mellanox.co.il> Message-ID: <44359161.8060903@ichips.intel.com> Michael S. Tsirkin wrote: > OK, this means we must add struct module * pointer to ib_create_cm_id > and rdma_create_cm_id as well. > > Could one of you guys build and commit such a patch for these modules? > I'm not in the lab. I will create the patches for these. - Sean From halr at voltaire.com Thu Apr 6 15:05:34 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Apr 2006 18:05:34 -0400 Subject: [openib-general] RE: [RFC] [PATCH 1/2] multicast support for multiple users In-Reply-To: References: Message-ID: <1144361134.17533.4616.camel@hal.voltaire.com> On Thu, 2006-04-06 at 17:56, Sean Hefty wrote: > >There's also code in opensm/osm_sa_mcmember_record.c that you can > >peruse. > > Thanks - that's helpful. > > >> The code uses a promotion/demotion mechanism based on a reference count of > >> membership types. The restriction is that only a single request per group is > >> active at a time. > > > >Meaning only one of the membership types is active outside the node ? If > >so, that seems right as long as the order of precedence is correct. > > Only one MAD request per group is active at any time, regardless if the MAD is > for a join or leave. One request per group or (per group and join state) ? > >How does the change in precedence occur ? Is it a leave followed by the > >new join or the new join followed by the old leave ? > > Either through a join or leave, depending on if the group is being promoted or > demoted. For example, if a send-only group is joined by a full member, a new > join is issued with an updated join state. If the full member later leaves, a > leave request is issued for that join state. Sorry for being dense but I'm wondering about the SA interactions. In your example, what are the requests made of the SA ? There seems to be the original send only join. Sometime later there is the full member join. Is that it ? Is it one request per group as you stated above or one request per group per join state that can be active ? That's what was confusing me. -- Hal From mst at mellanox.co.il Thu Apr 6 15:16:30 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 7 Apr 2006 01:16:30 +0300 Subject: [openib-general] Re: librdmacm/ucma In-Reply-To: <44359042.3030901@ichips.intel.com> References: <20060406213142.GE15005@mellanox.co.il> <20060406214655.GF15005@mellanox.co.il> <44359042.3030901@ichips.intel.com> Message-ID: <20060406221630.GG15005@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: librdmacm/ucma > > Michael S. Tsirkin wrote: > >I just propose killing old ABI support in librdmacm - CMA is sufficiently > >new for that. > > I'm fine doing that. The ABI versions can be reset to 1. Yes, let's do that, -- MST From mst at mellanox.co.il Thu Apr 6 15:19:28 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 7 Apr 2006 01:19:28 +0300 Subject: [openib-general] Re: [PATCH] ipoib_flush_paths In-Reply-To: <200604051559.34828.eli@mellanox.co.il> References: <200604051559.34828.eli@mellanox.co.il> Message-ID: <20060406221928.GH15005@mellanox.co.il> Quoting r. Eli Cohen : > Subject: [PATCH] ipoib_flush_paths > > ib_sa_cancel_query must be called with priv->lock held since > a completion might arrive and set path->query to NULL > > Signed-off-by: Eli Cohen > Roland, with all the noise about module unloading, please don't forget about this one - please note its unreleated to module unloading issue. -- MST From sean.hefty at intel.com Thu Apr 6 15:18:29 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 6 Apr 2006 15:18:29 -0700 Subject: [openib-general] RE: [RFC] [PATCH 1/2] multicast support for multiple users In-Reply-To: <1144361134.17533.4616.camel@hal.voltaire.com> Message-ID: >> Only one MAD request per group is active at any time, regardless if the MAD >is >> for a join or leave. > >One request per group or (per group and join state) ? Per group - with a group identified by an MGID. The intent is to ensure that the state of the group is not driven in different directions in case MADs are processed out of order or are lost. So only a single MAD request can be outstanding per group. >Sorry for being dense but I'm wondering about the SA interactions. In >your example, what are the requests made of the SA ? There seems to be >the original send only join. Sometime later there is the full member >join. Is that it ? In this example, the first join request is for send-only. The second join request is for full member and send-only. The leave request is for full member. A final leave request would later be for send-only. (This is for this example only, the actual order of the requests depends on the users and their join states.) >Is it one request per group as you stated above or one request per group >per join state that can be active ? That's what was confusing me. One request per group outstanding at a time. A change in the join state will result in a second request, but will not be issued until the previous request completes. - Sean From necojp at citiz.net Thu Apr 6 15:28:38 2006 From: necojp at citiz.net (=?gb2312?B?aW5mb3JtYXRpb24=?=) Date: Thu, 6 Apr 2006 15:28:38 -0700 (PDT) Subject: [openib-general] =?iso-2022-jp?b?GyRCTXA4ciVRITwlRiUjITwbKEI=?= Message-ID: <20060406222838.A7A5E2283D9@openib.ca.sandia.gov> ☆━━━……‥★ ★‥……━━━☆ 悦楽のひとときをお楽しみください☆ 18歳未満退出 -------------------------------------------------------------------------------- 高級ホテルで 毎日夜繰りひろらげる 怪しく淫乱な乱交パーティー ↓↓↓↓↓↓↓↓↓↓↓↓ NEW!  http://www.deai-allfree.net/?bid17NEW! -------------------------------------------------------------------------------- 全国高級ホテルのスイートルームで開催している乱交パーティーを紹介しています。 http://www.deai-allfree.net/?bid17 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: logo1.gif Type: image/gif Size: 4186 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: photo1.jpg Type: image/jpeg Size: 3019 bytes Desc: not available URL: From nivedita at us.ibm.com Thu Apr 6 15:26:49 2006 From: nivedita at us.ibm.com (Nivedita Singhvi) Date: Thu, 6 Apr 2006 15:26:49 -0700 Subject: [openib-general] (no subject) Message-ID: Hello! Just wanted to let everyone know Jiuxing has populated a mercurial tree (very kindly hosted by XenSource) with his code at the following site: http://xenbits.xensource.com/ext/xen-smartio.hg This contains the current source code for a xen infiniband frontend and backend driver. The source code is in very preliminary stages of development, just a proof of concept for now (works). We have a long way to go. We'd like to invite interested folks to take a look and get involved in the continuing design and development as an open-source community. We will be putting up a Wiki page for this shortly on the Xen Wiki. Stay tuned... There is also a mailing list we set up for discussion on virtualization of smart I/O in Xen at: http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-smartio However, at Ian's request we're going to contain most discussion to xen-devel itself. thanks, Nivedita -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Thu Apr 6 16:00:54 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 6 Apr 2006 16:00:54 -0700 Subject: [openib-general] [RFC] [PATCH 2/2 v2] ipoib: convert to use new multicast interface In-Reply-To: Message-ID: Here's a version that should remove formatting changes. - Sean Index: ipoib_multicast.c =================================================================== --- ipoib_multicast.c (revision 6307) +++ ipoib_multicast.c (working copy) @@ -45,6 +45,8 @@ #include +#include + #include "ipoib.h" #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG @@ -60,14 +62,11 @@ static DEFINE_MUTEX(mcast_mutex); /* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */ struct ipoib_mcast { struct ib_sa_mcmember_rec mcmember; + struct ib_multicast *mc; struct ipoib_ah *ah; struct rb_node rb_node; struct list_head list; - struct completion done; - - int query_id; - struct ib_sa_query *query; unsigned long created; unsigned long backoff; @@ -299,18 +298,18 @@ static int ipoib_mcast_join_finish(struc return 0; } -static void +static int ipoib_mcast_sendonly_join_complete(int status, - struct ib_sa_mcmember_rec *mcmember, - void *mcast_ptr) + struct ib_multicast *multicast) { - struct ipoib_mcast *mcast = mcast_ptr; + struct ipoib_mcast *mcast = multicast->context; struct net_device *dev = mcast->dev; struct ipoib_dev_priv *priv = netdev_priv(dev); if (!status) - ipoib_mcast_join_finish(mcast, mcmember); - else { + status = ipoib_mcast_join_finish(mcast, &multicast->rec); + + if (status) { if (mcast->logcount++ < 20) ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for " IPOIB_GID_FMT ", status %d\n", @@ -325,10 +324,10 @@ ipoib_mcast_sendonly_join_complete(int s spin_unlock_irq(&priv->tx_lock); /* Clear the busy flag so we try again */ - clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, + &mcast->flags); } - - complete(&mcast->done); + return status; } static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast) @@ -358,35 +357,32 @@ static int ipoib_mcast_sendonly_join(str rec.port_gid = priv->local_gid; rec.pkey = cpu_to_be16(priv->pkey); - init_completion(&mcast->done); - - ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, - IB_SA_MCMEMBER_REC_MGID | - IB_SA_MCMEMBER_REC_PORT_GID | - IB_SA_MCMEMBER_REC_PKEY | - IB_SA_MCMEMBER_REC_JOIN_STATE, - 1000, GFP_ATOMIC, - ipoib_mcast_sendonly_join_complete, - mcast, &mcast->query); - if (ret < 0) { - ipoib_warn(priv, "ib_sa_mcmember_rec_set failed (ret = %d)\n", + mcast->mc = ib_join_multicast(priv->ca, priv->port, &rec, + IB_SA_MCMEMBER_REC_MGID | + IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_PKEY | + IB_SA_MCMEMBER_REC_JOIN_STATE, + GFP_ATOMIC, + ipoib_mcast_sendonly_join_complete, + mcast); + if (IS_ERR(mcast->mc)) { + ret = PTR_ERR(mcast->mc); + clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + ipoib_warn(priv, "ib_join_multicast failed (ret = %d)\n", ret); } else { ipoib_dbg_mcast(priv, "no multicast record for " IPOIB_GID_FMT ", starting join\n", IPOIB_GID_ARG(mcast->mcmember.mgid)); - - mcast->query_id = ret; } return ret; } -static void ipoib_mcast_join_complete(int status, - struct ib_sa_mcmember_rec *mcmember, - void *mcast_ptr) +static int ipoib_mcast_join_complete(int status, + struct ib_multicast *multicast) { - struct ipoib_mcast *mcast = mcast_ptr; + struct ipoib_mcast *mcast = multicast->context; struct net_device *dev = mcast->dev; struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -394,23 +390,20 @@ static void ipoib_mcast_join_complete(in " (status %d)\n", IPOIB_GID_ARG(mcast->mcmember.mgid), status); - if (!status && !ipoib_mcast_join_finish(mcast, mcmember)) { + if (!status) + status = ipoib_mcast_join_finish(mcast, &multicast->rec); + + if (!status) { mcast->backoff = 1; mutex_lock(&mcast_mutex); if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) queue_work(ipoib_workqueue, &priv->mcast_task); mutex_unlock(&mcast_mutex); - complete(&mcast->done); - return; - } - - if (status == -EINTR) { - complete(&mcast->done); - return; + return 0; } - if (status && mcast->logcount++ < 20) { - if (status == -ETIMEDOUT || status == -EINTR) { + if (mcast->logcount++ < 20) { + if (status == -ETIMEDOUT) { ipoib_dbg_mcast(priv, "multicast join failed for " IPOIB_GID_FMT ", status %d\n", IPOIB_GID_ARG(mcast->mcmember.mgid), @@ -427,23 +420,18 @@ static void ipoib_mcast_join_complete(in if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; - mutex_lock(&mcast_mutex); + /* Clear the busy flag so we try again */ + status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + mutex_lock(&mcast_mutex); spin_lock_irq(&priv->lock); - mcast->query = NULL; - - if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) { - if (status == -ETIMEDOUT) - queue_work(ipoib_workqueue, &priv->mcast_task); - else - queue_delayed_work(ipoib_workqueue, &priv->mcast_task, - mcast->backoff * HZ); - } else - complete(&mcast->done); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_delayed_work(ipoib_workqueue, &priv->mcast_task, + mcast->backoff * HZ); spin_unlock_irq(&priv->lock); mutex_unlock(&mcast_mutex); - return; + return status; } static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast, @@ -482,15 +470,14 @@ static void ipoib_mcast_join(struct net_ rec.traffic_class = priv->broadcast->mcmember.traffic_class; } - init_completion(&mcast->done); - - ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, comp_mask, - mcast->backoff * 1000, GFP_ATOMIC, - ipoib_mcast_join_complete, - mcast, &mcast->query); - - if (ret < 0) { - ipoib_warn(priv, "ib_sa_mcmember_rec_set failed, status %d\n", ret); + set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + mcast->mc = ib_join_multicast(priv->ca, priv->port, &rec, comp_mask, + GFP_KERNEL, ipoib_mcast_join_complete, + mcast); + if (IS_ERR(mcast->mc)) { + clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + ret = PTR_ERR(mcast->mc); + ipoib_warn(priv, "ib_join_multicast failed, status %d\n", ret); mcast->backoff *= 2; if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) @@ -502,8 +489,7 @@ static void ipoib_mcast_join(struct net_ &priv->mcast_task, mcast->backoff * HZ); mutex_unlock(&mcast_mutex); - } else - mcast->query_id = ret; + } } void ipoib_mcast_join_task(void *dev_ptr) @@ -553,7 +539,8 @@ void ipoib_mcast_join_task(void *dev_ptr spin_unlock_irq(&priv->lock); } - if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { + if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags) && + !test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags)) { ipoib_mcast_join(dev, priv->broadcast, 0); return; } @@ -609,26 +596,9 @@ int ipoib_mcast_start_thread(struct net_ return 0; } -static void wait_for_mcast_join(struct ipoib_dev_priv *priv, - struct ipoib_mcast *mcast) -{ - spin_lock_irq(&priv->lock); - if (mcast && mcast->query) { - ib_sa_cancel_query(mcast->query_id, mcast->query); - mcast->query = NULL; - spin_unlock_irq(&priv->lock); - ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", - IPOIB_GID_ARG(mcast->mcmember.mgid)); - wait_for_completion(&mcast->done); - } - else - spin_unlock_irq(&priv->lock); -} - int ipoib_mcast_stop_thread(struct net_device *dev, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ipoib_mcast *mcast; ipoib_dbg_mcast(priv, "stopping multicast thread\n"); @@ -644,52 +614,27 @@ int ipoib_mcast_stop_thread(struct net_d if (flush) flush_workqueue(ipoib_workqueue); - wait_for_mcast_join(priv, priv->broadcast); - - list_for_each_entry(mcast, &priv->multicast_list, list) - wait_for_mcast_join(priv, mcast); - return 0; } static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ib_sa_mcmember_rec rec = { - .join_state = 1 - }; int ret = 0; - if (!test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) - return 0; - - ipoib_dbg_mcast(priv, "leaving MGID " IPOIB_GID_FMT "\n", - IPOIB_GID_ARG(mcast->mcmember.mgid)); - - rec.mgid = mcast->mcmember.mgid; - rec.port_gid = priv->local_gid; - rec.pkey = cpu_to_be16(priv->pkey); + if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { + ipoib_dbg_mcast(priv, "leaving MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); - /* Remove ourselves from the multicast group */ - ret = ipoib_mcast_detach(dev, be16_to_cpu(mcast->mcmember.mlid), - &mcast->mcmember.mgid); - if (ret) - ipoib_warn(priv, "ipoib_mcast_detach failed (result = %d)\n", ret); + /* Remove ourselves from the multicast group */ + ret = ipoib_mcast_detach(dev, be16_to_cpu(mcast->mcmember.mlid), + &mcast->mcmember.mgid); + if (ret) + ipoib_warn(priv, "ipoib_mcast_detach failed (result = %d)\n", ret); + } - /* - * Just make one shot at leaving and don't wait for a reply; - * if we fail, too bad. - */ - ret = ib_sa_mcmember_rec_delete(priv->ca, priv->port, &rec, - IB_SA_MCMEMBER_REC_MGID | - IB_SA_MCMEMBER_REC_PORT_GID | - IB_SA_MCMEMBER_REC_PKEY | - IB_SA_MCMEMBER_REC_JOIN_STATE, - 0, GFP_ATOMIC, NULL, - mcast, &mcast->query); - if (ret < 0) - ipoib_warn(priv, "ib_sa_mcmember_rec_delete failed " - "for leave (result = %d)\n", ret); + if (test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) + ib_free_multicast(mcast->mc); return 0; } @@ -743,7 +688,7 @@ void ipoib_mcast_send(struct net_device dev_kfree_skb_any(skb); } - if (mcast->query) + if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) ipoib_dbg_mcast(priv, "no address vector, " "but multicast join already started\n"); else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) @@ -900,7 +845,6 @@ void ipoib_mcast_restart_task(void *dev_ /* We have to cancel outside of the spinlock */ list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { - wait_for_mcast_join(priv, mcast); ipoib_mcast_leave(mcast->dev, mcast); ipoib_mcast_free(mcast); } From nivedita at us.ibm.com Thu Apr 6 16:15:15 2006 From: nivedita at us.ibm.com (Nivedita Singhvi) Date: Thu, 6 Apr 2006 16:15:15 -0700 Subject: [openib-general] Xen Infiniband Source Code Hosted Message-ID: Hello! Just wanted to let everyone know Jiuxing has populated a mercurial tree (very kindly hosted by XenSource) with his code at the following site: http://xenbits.xensource.com/ext/xen-smartio.hg This contains the current source code for a xen infiniband frontend and backend driver. The source code is in very preliminary stages of development, just a proof of concept for now (works). We have a long way to go. We'd like to invite interested folks to take a look and get involved in the continuing design and development as an open-source community. We will be putting up a Wiki page for this shortly on the Xen Wiki. Stay tuned... There is also a mailing list we set up for discussion on virtualization of smart I/O in Xen at: http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-smartio However, at Ian's request we're going to contain most discussion to xen-devel itself. (Please pardon my earlier post without a subject). thanks, Nivedita -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Thu Apr 6 16:26:25 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 06 Apr 2006 18:26:25 -0500 Subject: [openib-general] Re: [PATCH] uDAPL cma; dat_ep_free can return without freeing cm_id In-Reply-To: References: Message-ID: <1144365985.10701.80.camel@stevo-desktop> > Steve, can you test this version and see if it works for your iWARP device. > I think the patch is good. I ran dapltest/regress.sh over the chelsio iwarp device using this new patch instead of my original patch, and things seem as stable as they were before (i'm fighting some intermittent connection setup failures that I think are in cxgb3 provider, not dapl). Thanks, Steve. From tchahande at silverstorm.com Thu Apr 6 16:27:33 2006 From: tchahande at silverstorm.com (Chahande, Takshak) Date: Thu, 6 Apr 2006 19:27:33 -0400 Subject: [openib-general] Data Structures at UserSpace Message-ID: Hi Hal and others, I find that, there are no standard header files exists at userspace which can define structure for PORT_INFO, NODE_INFO and other elements like Path Records, Service record etc. So every individual has to define his own header files to define the data structures for these elements and use in their application program or tool. If it is exists then could you please point me out or if it does not then is there any plan or shall I provide such header files to make standard header files like we have mad.h, umad.h etc. Thanks, - Takshak -------------- next part -------------- An HTML attachment was scrubbed... URL: From athy at nodachi.net Thu Apr 6 16:34:48 2006 From: athy at nodachi.net (Ath Delamora) Date: Thu, 6 Apr 2006 19:34:48 -0400 Subject: [openib-general] Re: news for you Message-ID: <000001c659d2$b1694bf0$762ca8c0@rgr93> V A L t U M 1, 22 $ C t A L t S 3, 74 $ V t A G R A 3, 36 $ get more information - http://toaep990.paitemidde.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From Don.Albert at Bull.com Thu Apr 6 17:02:44 2006 From: Don.Albert at Bull.com (Don.Albert at Bull.com) Date: Thu, 6 Apr 2006 17:02:44 -0700 Subject: [openib-general] Problems running MPI jobs with MVAPICH In-Reply-To: Message-ID: Weikuan I previously reported that I was having problems running any MPI jobs between a pair of EM64T machines with RHEL4, Update 3 with the OpenIB modules, (kernel versions 2.6.9-34.ELsmp) and the "mvapich-gen2" code from the OpenIB svn tree. I was having two problems: When I tried to run from user mode, I would get segmentation faults When I ran from root, the jobs would fail with the following message: "cpi: pmgr_client_mpd.c:254: mpd_exchange_info: Assertion `len_remote == len_local' failed. ". The first problem turned out to be a memory problem; I had to increase the size of the max locked-in-memory address space (memlock) in the user limits. The second problem seemed to be more related to process management than to MPI itself. I remembered that when I modified the "make.mvapich.gen2" build script, there was a parameter for MPD: # Whether to use an optimized queue pair exchange scheme. This is not # checked for a setting in in the script. It must be set here explicitly. # Supported: "-DUSE_MPD_RING", "-DUSE_MPD_BASIC" and "" (to disable) HAVE_MPD_RING="" Because I wanted to use MPD to launch jobs, I set HAVE_MPD_RING="-DUSE_MPD_RING" in the build script. I went back and set the parameter to HAVE_MPD_RING="" to disable it, and rebuilt, which meant that MPD was not installed. Using "mpirun_rsh" I am now able to run the MPI jobs, including "cpi", "mping" and other benchmark tests. There seems to be a problem with "USE_MPD_RING". Have you seen this before? Should I try with "USE_MPD_BASIC" instead? -Don Albert- -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Thu Apr 6 17:19:26 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 06 Apr 2006 17:19:26 -0700 Subject: [openib-general] Re: [PATCH] ipoib_flush_paths In-Reply-To: References: Message-ID: <4435B00E.8050706@ichips.intel.com> Sean Hefty wrote: > +static inline int > +rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr, > + struct rdma_dev_addr *addr, int timeout_ms, > + void (*callback)(int status, struct sockaddr *src_addr, > + struct rdma_dev_addr *addr, void *context), > + void *context) > +{ > + return __rdma_resolve_ip(src_addr, dst_addr, addr, timeout_ms, > + callback, context, THIS_MODULE); > +} I missed the compile warning before. This needs a forward declaration of __rdma_resolve_ip(). Assuming that we've decided to go this route, I'll submit a corrected version. - Sean From jgunthorpe at obsidianresearch.com Thu Apr 6 17:25:07 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 6 Apr 2006 18:25:07 -0600 Subject: [openib-general] Re: [RFC] [PATCH 1/2] multicast support for multiple users In-Reply-To: References: Message-ID: <20060407002507.GC29395@obsidianresearch.com> On Thu, Apr 06, 2006 at 12:51:23PM -0700, Roland Dreier wrote: > Maybe we should give up the ghost and stop trying to support IB switches? IMHO, the main reason to have the IB stack on a switch is to support ipoib for in band management with a stable and well tested code base. My company is looking closely at doing this in our switches and I'd be surprised if other companies that already use linux as the management OS are not doing the same. Regards, Jason From sean.hefty at intel.com Thu Apr 6 17:27:04 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 6 Apr 2006 17:27:04 -0700 Subject: [openib-general] Re: [RFC] [PATCH 1/2] multicast support for multiple users In-Reply-To: <20060407002507.GC29395@obsidianresearch.com> Message-ID: >> Maybe we should give up the ghost and stop trying to support IB switches? > >IMHO, the main reason to have the IB stack on a switch is to support >ipoib for in band management with a stable and well tested code base. >My company is looking closely at doing this in our switches and I'd be >surprised if other companies that already use linux as the management >OS are not doing the same. I've already made the changes to the ib_multicast module to support switches. I will send out an updated version of the patch tomorrow. - Sean From info-sweepstakesdepartment at tesco.net Thu Apr 6 17:41:23 2006 From: info-sweepstakesdepartment at tesco.net (SWEEPSTAKES DEPARTMENT) Date: Fri, 7 Apr 2006 1:41:23 +0100 Subject: [openib-general] OFFICIAL LOTTERY WINNING NOTIFICATION!!! Message-ID: <20060407004123.ZXBM24467.aamtaout02-winn.ispmail.ntl.com@smtp.ntlworld.com> SWEEPSTAKES ONLINE LOTTERY DEPARTMENT. P O Box 4539. #24 Lyndhurst Road, London NW3 5PQ, United Kingdom. Ref : UK/9420X2/68. Batch: 074/05/ZY369. (Customer Services) 07/04/2006. DEAR WINNER: OFFICIAL LOTTERY WINNING NOTIFICATION We happily announce to you the Draw Number:124 of the SWEEPSTAKES ONLINE LOTTERY DEPARTMENT PROGRAMME held on Saturday 25th of February, 2006. Your e-mail address attached to Ticket Number:56475600545188 with Serial number:5368/02 drew the Winning Numbers:4-7-18-19-35-40 with(bonus no.25), which subsequently won you the lottery in the 2nd category i.e match 6 lucky numbers plus bonus no.25. Our winners are arranged into four categories with different winning prizes accordingly in each category.They are arranged in this format below: CATEGORY NO.OF WINNERS WINNING PRIZES 1st. 6 USD.$1,350,000:00 each 2nd. 20 USD.$850,000:00 each 3rd. 45 USD.$470,000:00 each 4th. 75 USD.$170,000:00 each Your e-mail won you the prize in the 2nd category, You have therefore been approved to claim the total sum of USD$850,000:00 (Eight hundred and fifty thousand, United States Dollars) in cash credited to file.This is from a total cash prize of USD$59 Million dollars,shared amongst the first one hundred and forty six (146) lucky winners in this categories. All participants for the online version were selected randomly from World Wide Web sites through computer draw system and extracted from over 100,000 unions, associations, and corporate bodies that are listed online. This promotion takes place annually. your lucky winning number falls within our European booklet repre- sentative office in Europe as indicated in your play coupon. In view of this, your USD$850,000:00 (Eight hundred and fifty thousand United States Dollars)will be released to you by any of our payment offices in Europe. Our European agent will immediately commence the process to facilitate the release of your funds as soon as you contact him. For security reasons, you are advised to keep your winning information confidential till your claim is processed and your money remitted to you in whatever manner you deem fit to claim your prize. This is part of our precautionary measure to avoid double claiming and unwarranted abuse of this program. Please be warned. To file for your claim,Please contact your Valdating Officer for VALIDATION of your winning within Twenty-nine days of this winning notification.Winnings that are not validated within Twenty-nine days of winning notification are termed void and invalid. CONTACT: ******************************************************************* Name :Dr. Sten F.williams E-mail: stenwilliams02 at yahoo.co.uk Foreign Services Validating Officer, Sweepstakes Online Lottery Department, P O Box 4539, #24 Lyndhurst Road, London NW3 5PQ, United Kingdom. Tel.Phone:+44 702 4077 642. :+44 702 4077 804. Fax:+447092867114. ******************************************************************** Note:Endeavour to email the Validating Officer your: Full Names: Winning Numbers: Ticket Number: Email Address: Telephone Number: Fax Number: Country: to enable him validate your winning. Unvalidated winnings after Twenty-nine days of winning notification will be termed invalid and void. Congratulations from me and members of staff of SWEEPSTAKES ONLINE LOTTERY DEPARTMENT. Yours faithfully, Rose Woods (Mrs) Online co-ordinator, For: SWEEPSTAKES ONLINE LOTTERY DEPARTMENT. From yuw at cse.ohio-state.edu Thu Apr 6 17:53:06 2006 From: yuw at cse.ohio-state.edu (Weikuan Yu) Date: Thu, 6 Apr 2006 20:53:06 -0400 Subject: [openib-general] Problems running MPI jobs with MVAPICH In-Reply-To: References: Message-ID: Hi, Don, Good to know that you are able to run mvapich with mpirun_rsh. We can now focus on MPD problem. We never had attempted to run MPD_RING option as root user. Just curious, were you able to mvapich2-gen2 with MPD_RING? They are more or less similar code. So could you try the following two possibilities and let us know all the log files and etc. a) rpm -e lam. The reason for this is that I noticed earlier LAM showing up in your config.log. It might help the configure if you can remove the other MPI packages which are on your path. b) Try mvapich-gen2 with mpd_ring, either as root or as user. Please do build/configure/install on one node and propagate the installation to see if it runs. We can look into the separate build later on. BTW, make sure you do `make install' at the end of configure/build. c) If possible, could you try mvapich2-gen2 with mpd_ring since the mpd_ring related code is similar there. That may help to locate the problem. Thanks, Weikuan On Apr 6, 2006, at 8:02 PM, Don.Albert at Bull.com wrote: > > Weikuan > > I previously reported that I was having problems running any MPI jobs > between a pair of EM64T machines with RHEL4, Update 3 with the OpenIB > modules,  (kernel versions 2.6.9-34.ELsmp) and the "mvapich-gen2" code > from the OpenIB svn tree.     I was having two problems: > > 1. When I tried to run from user mode,  I would get segmentation > faults > > 2. When I ran from root,  the jobs would fail with the following > message:   "cpi: pmgr_client_mpd.c:254: mpd_exchange_info: Assertion > `len_remote == len_local' failed. ". > > The first problem turned out to be a memory problem;  I had to > increase the size of the max locked-in-memory address space (memlock) > in the user limits. > > The second problem seemed to be more related to process management > than to MPI itself.   I remembered that when I modified the > "make.mvapich.gen2" build script,  there was a parameter for MPD: > >   # Whether to use an optimized queue pair exchange scheme.  This is > not >   # checked for a setting in in the script.  It must be set here > explicitly. >   # Supported: "-DUSE_MPD_RING", "-DUSE_MPD_BASIC" and "" (to disable) >   HAVE_MPD_RING="" > > Because I wanted to use MPD to launch jobs,  I set   > HAVE_MPD_RING="-DUSE_MPD_RING"  in the build script. > > I went back and set the parameter to HAVE_MPD_RING="" to disable it, > and rebuilt, which meant that MPD was not installed.   Using > "mpirun_rsh" I am now able to run the MPI jobs,  including "cpi", > "mping" and other benchmark tests. > > There seems to be a problem with "USE_MPD_RING".    Have you seen this > before?   Should I try with "USE_MPD_BASIC" instead? > >         -Don Albert- > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general -- Weikuan Yu, Computer Science, OSU http://www.cse.ohio-state.edu/~yuw From fugateupero at notarius.net Thu Apr 6 18:10:37 2006 From: fugateupero at notarius.net (Peronelle Fugate) Date: Thu, 6 Apr 2006 21:10:37 -0400 Subject: [openib-general] Re: your home Message-ID: <000001c659e0$147e0f70$6799a8c0@dci53> Dear Home O w n e r , Your credit doesn't matter to us ! If you O W N real e s t a t e and want l M M E D l A T E c a s h to s p e n d ANY way you like, or simply wish to L O W E R your monthly p a y m e n t s by a third or more, here are the deals we have T O D A Y : $ 488 , 000 at a 3 , 67 % F i x e d - r a t e $ 372 , 000 at a 3 , 90 % V a r i a b I e - r a t e $ 492 , 000 at a 3 , 21 % l n t e r e s t - o n l y $ 248 , 000 at a 3 , 36 % F i x e d - r a t e $ 198 , 000 at a 3 , 55 % V a r i a b I e - r a t e Hurry, when these deals are gone, they are gone! Don't worry about a p p r o v a l , your credit will not d i s q u a I i f y you ! web site Sincerely, Peronelle Fugate A p p r o v a l Manager -------------- next part -------------- An HTML attachment was scrubbed... URL: From yuw at cse.ohio-state.edu Thu Apr 6 18:34:16 2006 From: yuw at cse.ohio-state.edu (Weikuan Yu) Date: Thu, 6 Apr 2006 21:34:16 -0400 Subject: [openib-general] Problems running MPI jobs with MVAPICH In-Reply-To: References: Message-ID: A quick followup. I have just build/configure and propagated the mvapich-gen2 installation on two EM64T nodes as root. mvapich-gen2 runs fine with MPD_RING option. Here are the commands I had used. Hope they could help. 1) prepare your mpd passwd/conf files: /root/.mpdpasswd and /root/.mpd.conf, they should be the same with mode 600 [root at e14-oib mvapich-gen2]# cat /root/.mpd.conf password=56rtG9 2) make.mvapich.gen2 # select /root/installs as $PREFIX and add USE_MPD_RING into the option. [root at e14-oib mvapich-gen2]# scp /root/installs e15:/root/. 3) [root at e14-oib mvapich-gen2]# /root/installs/bin/mpicc -o /root/cpi examples/basic/cpi.c [root at e14-oib mvapich-gen2]# scp /root/cpi e15:/root/. cpi 100% 294KB 293.8KB/s 00:00 4) Some system info [root at e14-oib mvapich-gen2]# cat /etc/redhat-release Red Hat Enterprise Linux AS release 4 (Nahant Update 2) [root at e14-oib mvapich-gen2]# uname -a Linux e14-oib 2.6.15 #3 SMP Mon Mar 6 20:48:17 PST 2006 x86_64 x86_64 x86_64 GNU/Linux [root at e14-oib mvapich-gen2]# LD_LIBRARY_PATH=/usr/local/lib /root/installs/bin/mpdtrace mpdtrace: e14-oib_43520: lhs=e15-oib_60830 rhs=e15-oib_60830 rhs2=e14-oib_43520 gen=1 mpdtrace: e15-oib_60830: lhs=e14-oib_43520 rhs=e14-oib_43520 rhs2=e15-oib_60830 gen=1 5) running two processes on one or two nodes. [root at e14-oib mvapich-gen2]# LD_LIBRARY_PATH=/usr/local/lib /root/installs/bin/mpirun_mpd -np 2 /root/cpi -MPDENV- LD_LIBRARY_PATH=/usr/local/lib Process 0 of 2 on e14-oib Process 1 of 2 on e15-oib pi is approximately 3.1415926544231318, Error is 0.0000000008333387 wall clock time = 0.000424 [root at e14-oib mvapich-gen2]# LD_LIBRARY_PATH=/usr/local/lib /root/installs/bin/mpirun_mpd -g 2 -np 2 /root/cpi -MPDENV- LD_LIBRARY_PATH=/usr/local/lib Process 0 of 2 on e14-oib Process 1 of 2 on e14-oib pi is approximately 3.1415926544231318, Error is 0.0000000008333387 wall clock time = 0.000406 Let us know how we can help further, Weikuan On Apr 6, 2006, at 8:53 PM, Weikuan Yu wrote: > Hi, Don, > > Good to know that you are able to run mvapich with mpirun_rsh. We can > now focus on MPD problem. We never had attempted to run MPD_RING > option as root user. Just curious, were you able to mvapich2-gen2 with > MPD_RING? They are more or less similar code. So could you try the > following two possibilities and let us know all the log files and etc. > > a) rpm -e lam. > The reason for this is that I noticed earlier LAM showing up in your > config.log. It might help the configure if you can remove the other > MPI packages which are on your path. > b) Try mvapich-gen2 with mpd_ring, either as root or as user. Please > do build/configure/install on one node and propagate the installation > to see if it runs. We can look into the separate build later on. BTW, > make sure you do `make install' at the end of configure/build. > c) If possible, could you try mvapich2-gen2 with mpd_ring since the > mpd_ring related code is similar there. That may help to locate the > problem. > > Thanks, > Weikuan > > > On Apr 6, 2006, at 8:02 PM, Don.Albert at Bull.com wrote: > >> >> Weikuan >> >> I previously reported that I was having problems running any MPI jobs >> between a pair of EM64T machines with RHEL4, Update 3 with the OpenIB >> modules,  (kernel versions 2.6.9-34.ELsmp) and the "mvapich-gen2" >> code from the OpenIB svn tree.     I was having two problems: >> >> 1. When I tried to run from user mode,  I would get segmentation >> faults >> >> 2. When I ran from root,  the jobs would fail with the following >> message:   "cpi: pmgr_client_mpd.c:254: mpd_exchange_info: Assertion >> `len_remote == len_local' failed. ". >> >> The first problem turned out to be a memory problem;  I had to >> increase the size of the max locked-in-memory address space (memlock) >> in the user limits. >> >> The second problem seemed to be more related to process management >> than to MPI itself.   I remembered that when I modified the >> "make.mvapich.gen2" build script,  there was a parameter for MPD: >> >>   # Whether to use an optimized queue pair exchange scheme.  This is >> not >>   # checked for a setting in in the script.  It must be set here >> explicitly. >>   # Supported: "-DUSE_MPD_RING", "-DUSE_MPD_BASIC" and "" (to disable) >>   HAVE_MPD_RING="" >> >> Because I wanted to use MPD to launch jobs,  I set   >> HAVE_MPD_RING="-DUSE_MPD_RING"  in the build script. >> >> I went back and set the parameter to HAVE_MPD_RING="" to disable it, >> and rebuilt, which meant that MPD was not installed.   Using >> "mpirun_rsh" I am now able to run the MPI jobs,  including "cpi", >> "mping" and other benchmark tests. >> >> There seems to be a problem with "USE_MPD_RING".    Have you seen >> this before?   Should I try with "USE_MPD_BASIC" instead? >> >>         -Don Albert- >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general > -- > Weikuan Yu, Computer Science, OSU > http://www.cse.ohio-state.edu/~yuw > > -- Weikuan Yu, Computer Science, OSU http://www.cse.ohio-state.edu/~yuw From manpreet at gmail.com Thu Apr 6 19:16:25 2006 From: manpreet at gmail.com (Manpreet Singh) Date: Thu, 6 Apr 2006 19:16:25 -0700 Subject: [openib-general] Re: ib_umad related kernel panic In-Reply-To: <20060406171220.GB7353@redhat.com> References: <67897d690604051904la898f84y5a34709a6334f315@mail.gmail.com> <20060406171220.GB7353@redhat.com> Message-ID: <67897d690604061916k14900f0nb93be4f8c06f2d9c@mail.gmail.com> Doug, Thanks for the updated rpms. Just curious as to what svn revision of openib you used for the rpms. Do the IB tools (if compiled separately) have to be from the same svn revision? Thanks, Manpreet. On 4/6/06, Doug Ledford wrote: > > On Wed, Apr 05, 2006 at 07:04:29PM -0700, Manpreet Singh wrote: > > Hi, > > > > I am observing the following with redhat kernel rpm at: > > http://people.redhat.com/dledford/Infiniband , which uses openib version > > 3965. This is on an RHEL4 install. > > > > When the system is rebooted, ib_core, ib_mad and ib_mthca modules get > loaded > > automatically. When I load ib_umad after that, I get the following > panic. > > However, if I unload ib_mthca and load again once before loading > ib_umad, > > then there is no problem subsequently in the system. > > The rpms posted on my site, especially the 3985 revision, are not going to > be "fixed", they will simply be replaced. We are going to a later rev of > the code base for the next update (the 1.0 release branch), so time spent > fixing bugs in that version is basically wasted. Just to make that point > clear, I've removed the old RPMs from my site and put up a new set of > kernel > rpms based on the 1.0 release branch code (userspace rpms will be a little > later). > > -- > Doug Ledford 919-754-3700 x44233 > Red Hat, Inc. > 1801 Varsity Dr. > Raleigh, NC 27606 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From necojp at citiz.net Thu Apr 6 19:56:18 2006 From: necojp at citiz.net (=?gb2312?B?aW5mb3JtYXRpb24=?=) Date: Thu, 6 Apr 2006 19:56:18 -0700 (PDT) Subject: [openib-general] =?iso-2022-jp?b?GyRCJVIlYkA4M2ghejJ3M1obKEI=?= Message-ID: <20060407025618.D30E02283D9@openib.ca.sandia.gov> ╋━━━━━━━━━━━╋ 快楽体験★ヒモ生活 誰でもなれる実践編 ╋━━━━━━━━━━━╋ -------------------------------------------------------------------------------- 訪問者:80818 番組開始 03/5/26 番組更新 06/03/15 -------------------------------------------------------------------------------- お金をかけずに彼女とデート?お小遣いも貰えちゃう! 貧乏無職フリーターが、そんなヒモ生活をウェブ上で大公開! ヒモ生活を維持できるノウハウ等も掲載中。参考にしてね! 【ヒモ生活を希望する男性】 http://www.deai-allfree.net/?bid17 【リッチな女性、人妻、熟女、OL、女社長、金持ち婦人】 http://www.deai-allfree.net/?bid17 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: taba3.jpg Type: image/jpeg Size: 2295 bytes Desc: not available URL: From dledford at redhat.com Thu Apr 6 20:38:43 2006 From: dledford at redhat.com (Doug Ledford) Date: Thu, 6 Apr 2006 23:38:43 -0400 Subject: [openib-general] Re: ib_umad related kernel panic In-Reply-To: <20060406213004.GD15005@mellanox.co.il> References: <67897d690604051904la898f84y5a34709a6334f315@mail.gmail.com> <20060406171220.GB7353@redhat.com> <20060406213004.GD15005@mellanox.co.il> Message-ID: <20060407033843.GD7353@redhat.com> On Fri, Apr 07, 2006 at 12:30:04AM +0300, Michael S. Tsirkin wrote: > Quoting r. Doug Ledford : > > Just to make that point > > clear, I've removed the old RPMs from my site and put up a new set of kernel > > rpms based on the 1.0 release branch code (userspace rpms will be a little > > later). > > Doug, is this > https://openib.org/svn/gen2/branches/1.0/src/linux-kernel/ > happens to be the code that you are using? That's the base, yes. -- Doug Ledford 919-754-3700 x44233 Red Hat, Inc. 1801 Varsity Dr. Raleigh, NC 27606 From dledford at redhat.com Thu Apr 6 20:42:23 2006 From: dledford at redhat.com (Doug Ledford) Date: Thu, 6 Apr 2006 23:42:23 -0400 Subject: [openib-general] Re: ib_umad related kernel panic In-Reply-To: <67897d690604061916k14900f0nb93be4f8c06f2d9c@mail.gmail.com> References: <67897d690604051904la898f84y5a34709a6334f315@mail.gmail.com> <20060406171220.GB7353@redhat.com> <67897d690604061916k14900f0nb93be4f8c06f2d9c@mail.gmail.com> Message-ID: <20060407034223.GE7353@redhat.com> On Thu, Apr 06, 2006 at 07:16:25PM -0700, Manpreet Singh wrote: > Doug, > > Thanks for the updated rpms. Just curious as to what svn revision of openib > you used for the rpms. > > Do the IB tools (if compiled separately) have to be from the same svn > revision? Usually, they don't have to be identical revs, but if they get too far apart they might break. -- Doug Ledford 919-754-3700 x44233 Red Hat, Inc. 1801 Varsity Dr. Raleigh, NC 27606 From rdreier at cisco.com Thu Apr 6 22:16:01 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Apr 2006 22:16:01 -0700 Subject: [openib-general] Re: [PATCH] ipoib_flush_paths In-Reply-To: <20060406221928.GH15005@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 7 Apr 2006 01:19:28 +0300") References: <200604051559.34828.eli@mellanox.co.il> <20060406221928.GH15005@mellanox.co.il> Message-ID: OK, I applied the original patch here, along with this on top of it: IPoIB: Use spin_lock_irq() instead of spin_lock_irqsave() We know ipoib_flush_paths() is called from plain process context with interrupts enabled, since it does wait_for_completion(). So there's no need to use spin_lock_irqsave() -- spin_lock_irq() is fine. Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 996c6e1..cb078a7 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -336,9 +336,8 @@ void ipoib_flush_paths(struct net_device struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_path *path, *tp; LIST_HEAD(remove_list); - unsigned long flags; - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irq(&priv->lock); list_splice(&priv->path_list, &remove_list); INIT_LIST_HEAD(&priv->path_list); @@ -349,12 +348,12 @@ void ipoib_flush_paths(struct net_device list_for_each_entry_safe(path, tp, &remove_list, list) { if (path->query) ib_sa_cancel_query(path->query_id, path->query); - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock_irq(&priv->lock); wait_for_completion(&path->done); path_free(dev, path); - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irq(&priv->lock); } - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock_irq(&priv->lock); } static void path_rec_completion(int status, From info at sui-yoh.com Thu Apr 6 18:51:49 2006 From: info at sui-yoh.com (info at sui-yoh.com) Date: 7 Apr 2006 10:51:49 +0900 Subject: [openib-general] $B?M:J$H$A$g$C$HHkL)$N$*IU$-9g$$$r"v%(%C%A$G2a7c$J!D(B Message-ID: <20060407015149.17243.qmail@mail.sui-yoh.com> 皆さん、どんなエッチを楽しんでいますか? 今、話題で人気なのがやっぱり即アポ即エッチにムンムンなフェロモンを放つ『人妻』。 そんな人妻とちょっと大人の秘密の出会いを… でもどうやって!?と思ったり感じたりしてませんか? 彼女たちの本気(マジ)にエロイ実態を大公開♪ そしてそんなサポートをしてくれる『2丁目の奥様』のご紹介、そして出逢いパーティーのお知らせです!!  http://okusama.ock-gatu.com/2cco 出逢い┓┏┓┏┓.*・°☆.。.:*・°☆.。.:*・°☆.。.:*・°☆ 確立♪┻┗┛┗%…┓┏━┓┏…┓┏━┓┏…┓┏━┓        ┃直┃┃ア┃┃ド┃┃O┃┃K┃┃♪┃        ┗━┛┗…┛┗━┛┗…┛┗━┛┗…┛          〜完全無料サイトの決定版〜         ┏━┓┏…┓┏━┓ ┏…┓ ┏━┓┏…┓       ┃2┃┃丁┃┃目┃ ┃の┃ ┃奥┃┃様┃       ┗…┛┗━┛┗…┛ ┗━┛ ┗…┛┗━┛ .。.:*・°☆.。.:*・°☆.。.:*・°☆.。.:*・°☆.。.:*・°☆     ☆★☆★☆★2006!!【完全無料】にてあなたの近所の人妻を検索!!☆★☆★☆★ ※なんと登録料はもちろん!サイト利用料が全部無料♪完全フルサポートで幅広いユーザー層を誇ってます!!     ☆★『2丁目の奥様』なら、お好みの相手をすぐにGET出来ちゃいます!☆★ 無料出会い系サイト『2丁目の奥様』⇒ http://okusama.ock-gatu.com/2cco  〓 この春一番!いい思いをしようよ♪ 〓   ┏━━━━━━━━━━━━━━━★彡 ┌☆ 地域密着型!『ご近所サーチ』 ┿━━━━━━━━━━━━━━━━━━━━━━ 自分の住んでいる地域はもちろん、他地域全ての異性会員を表示可能! お近くのお相手検索にてさらに出会いのチャンスが広がります。  http://okusama.ock-gatu.com/2cco  ┏━━━━━━━━━━━━━━━☆彡 ┌★ メンクイさん必見『写メサーチ』 ━━━━━━━━━━━━━━━━━━━━━━ 好みの異性を選り好みで検索。メンクイさんは必見です♪ 写メールをつけるとメールの返信率もアップ間違いなし!  http://okusama.ock-gatu.com/2cco   ┏━━━━━━━━━━━━━━━☆彡 ┌★ 気に入った人とだけ直電・直メOK! ┿━━━━━━━━━━━━━━━━━━━━━━ 仲良くなった人だけにメールアドレスや電話番号を教える事ができます。 気になるあの人と…もっと仲良くなっちゃいましょう♪  http://okusama.ock-gatu.com/2cco ※無料出会いサイト『2丁目の奥様』は、登録料の他にメールの送受信、掲示板の書込みや 閲覧、顔写真の添付・閲覧など、全てのコンテンツが無料でご利用頂けます。 ▼▼┏┯━┯━┯━┯━┯━┯━┯━┯━┯━┯━┯━┯━┯━┯━┯┓▼▼   ┃│最│新│!│!│掲│示│板│情│報│の│お│届│け│♪│┃   ┗┷━┷━┷━┷━┷━┷━┷━┷━┷━┷━┷━┷━┷━┷━┷┛ ▲▲ .。.:*・°☆.。.:*・°☆.。.:*・°☆.。.:*・°☆.。.:*・°☆▲▲ ■まいさん □25才 ■結婚暦2年 □体の相性が合えば、生でもいいよん  http://okusama.ock-gatu.com/2cco -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ■なぎささん □30才 ■結婚暦5年 □卑猥な言葉で私を犯して下さい。  http://okusama.ock-gatu.com/2cco -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ┃ 会員様必見情報!!出逢いパーティのお知らせ♪ ┻━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   〜 愛と自由とファッション、そして出会いの提供をコンセプト 〜 ※まだまだ間に合う!弊社主催のカップリングパーティー及び秘密出逢いパーティーに関するお知らせです! ▼4/14 30対30カップリングパーティー 30対30の会員様限定応募企画!ハイグレードなカップリングパーティーを開催予定。 詳細はサポートに「4/14カップリング希望」とお問い合わせの上、抽選にてご当選された方にメールにてご連絡致します。 ………………………………………………………………………………………… ▼4/15 人妻厳選♪秘密出逢いパーティー 会員様限定応募企画!普段はなかなか出会うことの出来ないセレブな人妻との○秘交際パーティーです。 詳細はサポートに「4/14カップリング希望」とお問い合わせの上、抽選にてご当選された方にメールにてご連絡致します。 ………………………………………………………………………………………… ▼4/22 巨乳ちゃん集まれ!!男女入り乱れ♪乱交パーティー 弊社女性登録会員の巨乳女性とのスペシャル乱交企画第3弾!人気企画のため応募はお早めに!! 詳細はサポートに「4/14カップリング希望」とお問い合わせの上、抽選にてご当選された方にメールにてご連絡致します。 ………………………………………………………………………………………… 詳細はこちらから⇒ http://okusama.ock-gatu.com/2cco 今後メール広告の配信を希望されない方は、    tuma_nyobo at yahoo.co.jpより配信解除登録を行ってください。 From vaidyana at cse.ohio-state.edu Thu Apr 6 22:36:00 2006 From: vaidyana at cse.ohio-state.edu (Karthikeyan Vaidyanathan) Date: Fri, 7 Apr 2006 01:36:00 -0400 (EDT) Subject: [openib-general] amso1100 testing with OpenIB In-Reply-To: <1144356024.10701.65.camel@stevo-desktop> Message-ID: Hi, Thanks for pointing me to the latest firmware/boot loader. After I uninstalled the Ammasso Software, I was able to succesfully update and load the firmware and also the module. It worked fine on one machine but on another machine, when I try to bring the eth2 interface up, the machine hangs. However, I dont see any problem with iw2 interface. I was also able to do krping test successfully between these two machines using the iw2 ip address. It is very likely that if I set this up on another node, everything might run smoothly but I was wondering if I was doing something wrong here. Do you have any suggestions? Also, Will there be a userspace support for the Ammasso driver soon? thanks, Karthik On Thu, 6 Apr 2006, Steve Wise wrote: > On Thu, 2006-04-06 at 14:01 -0400, Karthikeyan Vaidyanathan wrote: > > Hi, > > > > I was trying to follow the discussion on getting the iWARP branch to work > > with Ammasso NICs and tried the following steps: > > > > /include/rdma --> > > /gen2/branches/iwarp/src/linux-kernel/infiniband/include/rdma > > > > /drivers/infiniband --> > > /gen2/branches/iwarp/src/linux-kernel/infiniband > > > > and recompiled the kernel. I'm working on linux 2.6.15.4 kernel. > > > > I had initially updated the firmware from AMSO1100 Release 1.2 Update 1 > > kit. Then I used ccflash2 to update to C2L_H23_B58_F61_080507.bit. > > However, when I reboot the machine, the dmesg reports: > > > > c2: AMSO1100 Gigabit Ethernet driver v1.1 loaded > > c2: Downlevel Firmware boot loader [1/7: got 0x42, exp 0x43]. Use the > > cc_flash utility to update your boot loader > > c2: Adapter not claimed > > c2: probe of 0000:03:02.0 failed with error -5 > > > > I also tried updating the firmware boot loader from AMSO1100 Release 1.2 > > Update 1 kit but I still get this message when the machine boots up. > > > > However, the following command shows that I have the latest firmware > > loaded: > > > > $ ccons 0 bitfile > > FPGA Bitfile ID: C2L_H23_B58_080507 Release 23 > > > > You need the amso1100 openib kit from Open Grid Computing. It contains > new firmware, a new bitfile/boot loader, and scripts to load the > firmware correctly before the openib driver loads. The kit is here: > > http://www.opengridcomputing.com/downloads/ogc_amso_kit_20060308.tgz > > You also need to uninstall the Ammasso software from the system. > > Hope this helps. > > Steve. > > > From stephen at garageservice.biz Thu Apr 6 14:35:36 2006 From: stephen at garageservice.biz (William) Date: Thu, 06 Apr 2006 22:35:36 +0100 Subject: [openib-general] Best love dr@gs at best store! Message-ID: <000001c65a05$188f0400$0100007f@PREFERRE-0748F8> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: p.jpg Type: image/jpeg Size: 17940 bytes Desc: not available URL: From k_mahesh85 at yahoo.co.in Thu Apr 6 23:10:09 2006 From: k_mahesh85 at yahoo.co.in (keshetti mahesh) Date: Fri, 7 Apr 2006 07:10:09 +0100 (BST) Subject: [openib-general] what are the performance figures of SDP over infiniband? Message-ID: <20060407061009.20024.qmail@web8321.mail.in.yahoo.com> hi i have started working over infiniband network in my work i am using openib stack which is available along with RHEL4 update 3 and i tried my benchmark to test the performance the figures are not convincing (max 260Mbps at 10Mb data rate) can anybody send me the typical performance figures of sdp over infinband Data rate BandWidth 2.000000e+04 1.565233e+08 5.000000e+04 2.080646e+08 8.000000e+04 2.273233e+08 1.000000e+05 2.329996e+08 2.000000e+05 2.479734e+08 5.000000e+05 2.597392e+08 8.000000e+05 2.600675e+08 1.000000e+06 2.623670e+08 2.000000e+06 2.639205e+08 5.000000e+06 2.651772e+08 8.000000e+06 2.663875e+08 1.000000e+07 2.656914e+08 1.000000e+01 5.518821e+06 1.000000e+02 3.419270e+07 1.000000e+03 4.338935e+08 1.500000e+03 7.997614e+08 2.000000e+03 8.501968e+08 5.000000e+03 1.160785e+09 8.000000e+03 9.055712e+07 1.000000e+04 1.127501e+08 2.000000e+04 1.557870e+08 5.000000e+04 2.086995e+08 8.000000e+04 2.260573e+08 1.000000e+05 2.351155e+08 2.000000e+05 2.470192e+08 5.000000e+05 2.585480e+08 8.000000e+05 2.607290e+08 1.000000e+06 2.617710e+08 2.000000e+06 2.637822e+08 5.000000e+06 2.651189e+08 8.000000e+06 2.662102e+08 1.000000e+07 2.656525e+08 1.000000e+01 5.156931e+06 1.000000e+02 3.328813e+07 1.000000e+03 3.475943e+08 1.500000e+03 8.426057e+08 2.000000e+03 6.108210e+08 5.000000e+03 1.014751e+09 8.000000e+03 9.096629e+07 1.000000e+04 1.102024e+08 2.000000e+04 1.585548e+08 5.000000e+04 2.098831e+08 8.000000e+04 2.277244e+08 These are the figures i got with SDP. thanx n regards K.Mahesh --------------------------------- Jiyo cricket on Yahoo! India cricket Yahoo! Messenger Mobile Stay in touch with your buddies all the time. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hch at lst.de Thu Apr 6 23:32:35 2006 From: hch at lst.de (Christoph Hellwig) Date: Fri, 7 Apr 2006 08:32:35 +0200 Subject: [openib-general] [PATCH 3/6] [RFC] iser initiator In-Reply-To: <44310E07.5010606@voltaire.com> References: <20060224211615.GA30927@lst.de> <4401A6AA.3080508@voltaire.com> <440423B8.2080302@voltaire.com> <440BEC4D.3010503@voltaire.com> <20060308152603.GA13621@lst.de> <441016D2.70400@voltaire.com> <20060322095708.GA23491@lst.de> <44213A00.4010001@voltaire.com> <44214A5D.8040100@voltaire.com> <44310E07.5010606@voltaire.com> Message-ID: <20060407063235.GB23760@lst.de> On Mon, Apr 03, 2006 at 02:59:03PM +0300, Or Gerlitz wrote: > Haven't heard from you re the patch you have supplied me which removes > at least this SCSI IOCTL issuing a non SG SCSI command. As i wrote you i > have patched 2.6.16 and tested it, works great. Is it queued for 2.6.17? It's in scsi-misc and will hopefully still go in before 2.6.17 From gongon23 at compass.jp Thu Apr 6 23:49:56 2006 From: gongon23 at compass.jp (gongon23 at compass.jp) Date: Thu, 6 Apr 2006 23:49:56 -0700 (PDT) Subject: [openib-general] =?utf-8?b?wotYwoLCtcKCwq3CgsKoworDqMKCwqLCgsK1?= =?utf-8?b?woLDnMKCwrc=?= Message-ID: 20060407154020.94295mail@mail.love-woman889889_gogo-server114_freesystem01_freefree-lovelove.tv --�����Z�����āA�������Ƃ̌J��Ԃ��ł������Ԃ������߂��Ă����Ȃ��� �v���Ă��܂��񂩁H ����ɏ����E�j�������Ȃ�����Ɨ����ɏ��ɓI �ɂ͂Ȃ��Ă��܂��񂩁H ���܂łɂP���o��n�ʼn���������Ȃ��c ������̂ɉ���Ă���Ȃ��E�E�E ����΂��肩�����đS����Ȃ��I�I����Ȍo���A����܂��񂩁H ���܂܂ł̃T�C�g�͂Ȃ񂾂����̂��H ����`���Ċ����čs�����Ă��������I���܂ŋC�Â��Ȃ����������� �����ƁA���‚���܂��B http://ad.love-meets.com/?kiarh �������c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c �N��I�ɂT�T���ᖳ���ł���ˁH ���̗l�Ȏ��͂������܂���B�����j���̕��ϔN��S�P�� �����̕��ς͂R�S�΂ƂȂ��Ă���ō���͒j���V�P�Ώ����U�R�΂ƂȂ�܂��B ���̐l���u�]�҂���������̂������ł��B http://ad.love-meets.com/?kiarh �������c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c �ǂ�Ȑl������̂ł����H ����N��̌��͉񓚂��܂������A��x���s���ꂽ���A���X���������̖����� ������ȏ����̕��������l�ł��B���ׁ̈A�Q�O��O���͔��ɏ��Ȃ��Ȃ��� ����܂��B http://ad.love-meets.com/?kiarh �������c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c�c ���܂ő��̃T�C�g�Œ��X��肭�����Ȃ��������ǁH ��肭�s���Ȃ����R�Ƃ��ĐF�X����܂����A���T�C�g�͔�r�I����ȏ����� �����̂�����ł��B���ׁ̈A���܂�ł����`�Ȃǂ̑Ή������ɂ܂������ �C�����𕷂������Ă��ꂩ��ł�x���͂���܂���B �ł�͍ł�o��̒��łm�f�ł͂Ȃ��ł��傤���B http://ad.love-meets.com/?kiarh From y.hanyul at virgilio.it Fri Apr 7 00:55:56 2006 From: y.hanyul at virgilio.it (Yeonna Hanyul.) Date: Fri, 07 Apr 2006 15:55:56 +0800 Subject: [openib-general] Hello Message-ID: <20060407080555.682532283DA@openib.ca.sandia.gov> Hello, My name is Yeonna Hanyul.From 1980-1988 I was a special aide to Mr. CHUN DOO-HWAN, the former President of South Korea who seized power in a military coup in 1980 and ruled from 1980 to 1988. My boss was pushed out of office and charged with treason ,corruption and embezzlement of over 21billion won. He was wrongly sentenced to death but fortunately AMNESTY INTERNATIONAL stepped in and commuted the sentence to life imprisonment. We thank God that he has finally been released, though still under house arrest in the sense of conditions of the freedom. I have been under house arrest at the government house all these years while he was in prison. I was also pardoned on the day of his release. During his regime as president of South Korea, we realized some reasonable amount of money from various deals that we successfully executed. I have plans to invest this money for my children's future on real estate and industrial production. Before he was overthrown, he successfully moved some amount funds out of Seoul and deposited the money with a security firm that transports valuable goods and consignments through diplomatic means in Europe. I am contacting you because I want you to deal with the security company and claim the money on my behalf since I have declared that the consignment belong to my foreign business partner. You shall also be required to assist me in investment in your country. I hope to trust you as a God fearing person who will not sit on this money when you claim it, rather assist me properly, I expect you to declare what percentage of the total money you will take for your assistance. When I receive your positive response I will let you know the contacts of the security company. What I need is for you to indicate your interest that you will assist me by receiving the money on my behalf in Europe. For this, you shall be considered to be the beneficiary of the money. The project in brief, is that the funds with which we intend to carry out our proposed investments in your country, is presently in the custody of a security firm in Europe. Note that it will require you to visit Europe to do the necessary clearance of the consignment before it release and final banking for you and I. I cannot do this myself because I do not want the government of my Country to know about the money because they will believe I got the money from the government. Besides I am under government surveilance. My movement is being monitored, I have to be in Korea for sometime. Once you confirm the receipt of the money ,I will come over with my Children to your Country or any Country in Europe to start a new life with my Family. As soon as payment is effected, and the amount mentioned above is successfully transferred into the bank account that will be opened for you, we will use our own share in acquiring some estates abroad. For this too you shall also be our overseas manager if you display the ability to do so. Please send me your telephone and fax number. Your quick response will be highly appreciated. Thank you in anticipation of your cooperation. Yours faithfully, Mr Yeonna Hanyul. y.hanyul at ausi.com From devesh28 at gmail.com Fri Apr 7 01:50:22 2006 From: devesh28 at gmail.com (Devesh Sharma) Date: Fri, 7 Apr 2006 14:20:22 +0530 Subject: [openib-general] Question on : ib_reg_phys_mr() Message-ID: <309a667c0604070150sdf99ef7kfd81e2bbe45f8076@mail.gmail.com> Hello list, In Ib kernel verbs there is a function ib_reg_phys_mr(). I am not able to trace the call of this verb by any ulp or uverb. Who calls this function? Is this function mendatory to be supported by the HCA driver provider? please guide me. Devesh -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at mellanox.co.il Fri Apr 7 03:53:05 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Fri, 7 Apr 2006 13:53:05 +0300 Subject: [openib-general] [uDAPLl] question about dapl_ib_cq_resize Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301DD2D2E@mtlexch01.mtl.com> Hi. I looked at the file: src/userspace/dapl/dapl/openib/dapl_ib_cq.c, function: dapl_ib_cq_resize: In this function, when one wants to resize a CQ, the dapl destroys the old CQ and creates a new one instead of calling to the resize CQ verb (which was added ~3 months ago), is there is a reason for this code? (please notice that the current implementation of the resize CQ function will fail if there are QPs that using this CQ). thanks Dotan Barak Software Verification Engineer Mellanox Technologies Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-77-8841095 Cell: 052-4222383 -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Fri Apr 7 03:48:34 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Apr 2006 06:48:34 -0400 Subject: [openib-general] Data Structures at UserSpace In-Reply-To: References: Message-ID: <1144406913.19061.194.camel@hal.voltaire.com> Hi Takshak, On Thu, 2006-04-06 at 19:27, Chahande, Takshak wrote: > Hi Hal and others, > > I find that, there are no standard header files exists at userspace > which can define structure for PORT_INFO, NODE_INFO and other elements > like Path Records, Service record etc. There is no user space SA client support in gen2/OpenIB currently. > So every individual has to define his own header files to define the > data structures for these elements and use in their application > program or tool. > > If it is exists then could you please point me out or if it does not > then is there any plan or shall I provide such header files to > make standard header files like we have mad.h, umad.h etc. The current plan is to expose path records and multicast support to user space. That's the next increment of support over the next couple months but it sounds like that is insufficient for your needs. If you are planning an SA diagnostics tool, there are 3 approaches in increasing order of difficulty/magnitude of work: 1. Use the OpenSM SA client API for this (osmtest and some Mellanox diagnostic tools use this currently). 2. Use the userspace/management libraries for this. This will take more work as much if not all of the SA support is not there. (These are more geared at SMPs). 3. Develop a user space SA client library for gen2 similar to the other user space libraries (CM, CMA, etc.). -- Hal > Thanks, > - Takshak > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Fri Apr 7 04:41:49 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Apr 2006 07:41:49 -0400 Subject: [openib-general] Dual Sided RMPP Support as well as OpenSM Implications Message-ID: <1144410107.19061.824.camel@hal.voltaire.com> Hi, I am in the process of adding support for SA MultiPathRecord to OpenSM. Dual sided RMPP is not supported in the current RMPP support nor is the use of dual sided RMPP by OpenSM (or the vendor SA client). It can be added but may require an API change and possibly an ABI change. It seems that user space code needs to both say and know whether dual sided RMPP is supported or not so all mixes of user space and kernel code could "work". The simplest way I can think of to add this from an ABI/API perspective is to add an ioctl to user_mad for this. Prior to the kernel supporting this, libibumad can just return not supported for the "is dual sided" check for the current user_mad ABI version (which is 5). Anyone have any better ideas on how to accomplish this ? Also, there is the question of what should the existing RMPP code do if it receives dual sided RMPP request on the network side. It could ABORT this request although RMPP currently has no specific status code for this. This is likely an issue with any other dual sided RMPP implementations and there would be no guarantee of interoperability here. It's unclear what RMPP implementations which do not support dual sided would do: likely just handle it as if it weren't dual sided. On the OpenSM side, should there be a DUAL_SIDED_RMPP flag ? I resist the 2 versions of OpenSM but there may be a need for this but I don't think it's an OpenIB need. I think OpenIB version can deal with whether the kernel supports dual sided RMPP at initialization time so there would need no conditionalization of OpenSM for OpenIB's purpose. Comments ? -- Hal From jlentini at netapp.com Fri Apr 7 06:52:20 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 7 Apr 2006 09:52:20 -0400 (EDT) Subject: [openib-general] Question on : ib_reg_phys_mr() In-Reply-To: <309a667c0604070150sdf99ef7kfd81e2bbe45f8076@mail.gmail.com> References: <309a667c0604070150sdf99ef7kfd81e2bbe45f8076@mail.gmail.com> Message-ID: On Fri, 7 Apr 2006, Devesh Sharma wrote: > Hello list, > In Ib kernel verbs there is a function ib_reg_phys_mr(). > I am not able to trace the call of this verb by any ulp or uverb. > Who calls this function? NFS-RDMA uses this function: http://sourceforge.net/projects/nfs-rdma > Is this function mendatory to be supported by the HCA driver provider? As a ULP implementer, I expect it to be supported. It is a standard IBTA verb. From swise at opengridcomputing.com Fri Apr 7 07:18:41 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 07 Apr 2006 09:18:41 -0500 Subject: [openib-general] amso1100 testing with OpenIB In-Reply-To: References: Message-ID: <1144419521.7000.10.camel@stevo-desktop> On Fri, 2006-04-07 at 01:36 -0400, Karthikeyan Vaidyanathan wrote: > Hi, > > Thanks for pointing me to the latest firmware/boot loader. > > After I uninstalled the Ammasso Software, I was able to succesfully update > and load the firmware and also the module. It worked fine on one machine > but on another machine, when I try to bring the eth2 interface up, the > machine hangs. However, I dont see any problem with iw2 interface. > > I was also able to do krping test successfully between these two machines > using the iw2 ip address. If you have the kdb patch, you might hit the debugger and see what's up with the hang. I haven't seen this, but we haven't done much testing with the legacy interface. Are you bringing up eth2 and iw2 with different addresses/subnets? > It is very likely that if I set this up on another node, everything might > run smoothly but I was wondering if I was doing something wrong here. > Do you have any suggestions? > > Also, Will there be a userspace support for the Ammasso driver soon? > Its on our list. It should be fairly easy to do a user mode library that doesn't do kernel bypass. We're looking for volunteers to help! :-) From swise at opengridcomputing.com Fri Apr 7 08:36:43 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 07 Apr 2006 10:36:43 -0500 Subject: [openib-general] [PATCH] - rping - recv completion can occur before established event Message-ID: <1144424203.7000.33.camel@stevo-desktop> Sean, Here's a patch to librdmacm/examples/rping.c to handle the case on the server side where the first RECV completion happens before the connect ESTABLISHED event. This can happen (and does :). Signed-off-by: Steve Wise Index: rping.c =================================================================== --- rping.c (revision 6229) +++ rping.c (working copy) @@ -326,7 +326,7 @@ } sem_wait(&cb->sem); - if (cb->state != CONNECTED) { + if (cb->state == ERROR) { fprintf(stderr, "wait for CONNECTED state %d\n", cb->state); return -1; } From mshefty at ichips.intel.com Fri Apr 7 09:29:26 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 07 Apr 2006 09:29:26 -0700 Subject: [openib-general] Re: [PATCH] - rping - recv completion can occur before established event In-Reply-To: <1144424203.7000.33.camel@stevo-desktop> References: <1144424203.7000.33.camel@stevo-desktop> Message-ID: <44369366.2080806@ichips.intel.com> Steve Wise wrote: > Here's a patch to librdmacm/examples/rping.c to handle the case on the > server side where the first RECV completion happens before the connect > ESTABLISHED event. This can happen (and does :). Do you want to just commit this? - Sean From swise at opengridcomputing.com Fri Apr 7 09:33:13 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 07 Apr 2006 11:33:13 -0500 Subject: [openib-general] Re: [PATCH] - rping - recv completion can occur before established event In-Reply-To: <44369366.2080806@ichips.intel.com> References: <1144424203.7000.33.camel@stevo-desktop> <44369366.2080806@ichips.intel.com> Message-ID: <1144427593.7000.40.camel@stevo-desktop> Ok. Committed under revision 6314. Thanks, Steve. On Fri, 2006-04-07 at 09:29 -0700, Sean Hefty wrote: > Steve Wise wrote: > > Here's a patch to librdmacm/examples/rping.c to handle the case on the > > server side where the first RECV completion happens before the connect > > ESTABLISHED event. This can happen (and does :). > > Do you want to just commit this? > > - Sean From ftillier at silverstorm.com Fri Apr 7 10:08:33 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Fri, 7 Apr 2006 10:08:33 -0700 Subject: [openib-general] Re: RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: <20060406144221.GA13416@mellanox.co.il> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060406132512.GF16153@granada.merseine.nu> <20060406133833.GP21115@mellanox.co.il> <20060406134212.GG16153@granada.merseine.nu> <20060406140407.GR21115@mellanox.co.il> <79ae2f320604060727w228f0ca3h3321bb48960706b@mail.gmail.com> <20060406144221.GA13416@mellanox.co.il> Message-ID: <79ae2f320604071008k3098d005mce453e537f908f76@mail.gmail.com> On 4/6/06, Michael S. Tsirkin wrote: > Quoting r. Fabian Tillier : > > Subject: Re: [openib-general] Re: RE: Re: [PATCH] ipoib_flush_paths > > > > On 4/6/06, Michael S. Tsirkin wrote: > > > Quoting r. Muli Ben-Yehuda : > > > > Subject: Re: [openib-general] Re: RE: Re: [PATCH] ipoib_flush_paths > > > > > > > > On Thu, Apr 06, 2006 at 04:38:33PM +0300, Michael S. Tsirkin wrote: > > > > > > > > > No, since we are keeping a callback pointer into that module. > > > > > > > > Sorry if I'm being dense but I don't see it in this patch. Point me at > > > > it? > > > > > > You don't see it in the patch because SA already kept a callback pointer - > > > that's the race I'm solving. Look in sa_query.c > > > > > > > > > If I have > > > > > > struct query { > > > void (*callback)(); > > > struct module *owner; > > > } > > > > > > Then it is always safe to do > > > > > > __get_module(query->owner); > > > query->callback(); > > > put_module(query->owner); > > > > > > since it is the called module's responsibility to invalidate > > > all such query objects before its unloaded. > > > > Wait, why are you doing __get_module just before the callback? This > > leaves the possibility of crashing - sure, you'll detect that things > > went wrong, but you haven't solved the issue. The whole point of the > > reference is to prevent the crash. > > No, this problem is solved today in all ULPs by caller polling on flag > (completion) that callback sets: all ULPs do this already, since the need to > track resources irrespective of module being unloaded. > > > You need to call __get_module from the context of teh caller making the request. > > No, this would prevent module from being unloaded for extended periods > of time - we don't want this. I don't get it - as long as a query is outstanding (i.e. there will be a callback invoked at some point in the future), it's unsafe for the module to unload. Taking a reference on the module when the request is issued, rather than just before the callback is invoked, should guarantee that __get_module succeeds (since you're in the context of the caller). Right now you're requiring the user to keep a parallel reference count, when you could just handle it for them. It would also be completely fool proof - there would be no way for a module to go away while a callback is outstanding, whether or not the callback has started executing or not. > All I must prevent is problem we missed previously: module being unloaded > while callback is in progress. No, you must prevent the module from being unloaded while a callback is *outstanding*, which includes in progress but is a broader scope. Using __get_module at the time the request is issued eliminates the possibility of __try_get_module failing right before you invoke the callback (though in practice __try_get_module would be eliminated since you'd already hold a reference). - Fab From kjreilly at us.ibm.com Fri Apr 7 10:43:19 2006 From: kjreilly at us.ibm.com (Kevin Reilly) Date: Fri, 7 Apr 2006 13:43:19 -0400 Subject: [openib-general] Include patch for IPoIB queue size tuning into the release 1.0 branch Message-ID: Hi Byran, Can you please include this patch submitted by Shirley Ma that went into the main openIB trunk in revision 6255 into the release 1.0 branch. Shirley Ma xma at us.ibm.com Wed Apr 5 10:46:39 PDT 2006 [openib-general] Re: [PATCH] repost: IPoIB queue size tune patch --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Hello Roland, I have been working hard on this patch. Do you think it is ready to be merged? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://openib.org/pipermail/openib-general/attachments/20060405/37964155/attachment.html Kevin J. Reilly STSM, HPC Architecture -Federation/HPS Chief Engineer -HPC interconnect architect (office) 845-433-7976 (tieline) 8-293-7976 From xma at us.ibm.com Fri Apr 7 10:52:48 2006 From: xma at us.ibm.com (Shirley Ma) Date: Fri, 7 Apr 2006 10:52:48 -0700 Subject: [openib-general] Re: Include patch for IPoIB queue size tuning into the release 1.0 branch In-Reply-To: Message-ID: Hello Bryan, The final patch is available: svn di -r 6254:6255 https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband >|queue_tune.patch Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Fri Apr 7 11:08:46 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 07 Apr 2006 11:08:46 -0700 Subject: [openib-general] Re: RE: Re: [PATCH] ipoib_flush_paths In-Reply-To: <79ae2f320604071008k3098d005mce453e537f908f76@mail.gmail.com> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060406132512.GF16153@granada.merseine.nu> <20060406133833.GP21115@mellanox.co.il> <20060406134212.GG16153@granada.merseine.nu> <20060406140407.GR21115@mellanox.co.il> <79ae2f320604060727w228f0ca3h3321bb48960706b@mail.gmail.com> <20060406144221.GA13416@mellanox.co.il> <79ae2f320604071008k3098d005mce453e537f908f76@mail.gmail.com> Message-ID: <4436AAAE.6010900@ichips.intel.com> Fabian Tillier wrote: > No, you must prevent the module from being unloaded while a callback > is *outstanding*, which includes in progress but is a broader scope. We need to prevent module unload from completing while a callback is outstanding, but not necessarily from starting. If we took a reference up front, can we still initiate unloading the module? - Sean From mshefty at ichips.intel.com Fri Apr 7 11:46:07 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 07 Apr 2006 11:46:07 -0700 Subject: [openib-general] commit change to IB CM for module reference fix Message-ID: <4436B36F.7090003@ichips.intel.com> Roland, This is the commit for the IB CM. Can you please make sure that this changes gets merged upstream? I'll send a separate update for ib_addr and the rdma_cm. - Sean --- Add an owner field to cm_id's so that the CM module can take a reference on modules that it is about to call back into. This avoids races where a module's callback function is still running as the module is unloaded. This patch puts the passing of THIS_MODULE in inline wrappers, so the externally visible API remains unchanged. Signed-off-by: Sean Hefty Modified: gen2/trunk/src/linux-kernel/infiniband/core/cm.c =================================================================== --- gen2/trunk/src/linux-kernel/infiniband/core/cm.c 2006-04-07 18:45:24 UTC (rev 6323) +++ gen2/trunk/src/linux-kernel/infiniband/core/cm.c 2006-04-07 18:51:35 UTC (rev 6324) @@ -118,6 +118,7 @@ struct cm_id_private { struct ib_cm_id id; + struct module *owner; struct rb_node service_node; struct rb_node sidr_id_node; @@ -590,9 +591,9 @@ ib_send_cm_sidr_rep(&cm_id_priv->id, ¶m); } -struct ib_cm_id *ib_create_cm_id(struct ib_device *device, +struct ib_cm_id *__ib_create_cm_id(struct ib_device *device, ib_cm_handler cm_handler, - void *context) + void *context, struct module *owner) { struct cm_id_private *cm_id_priv; int ret; @@ -601,6 +602,7 @@ if (!cm_id_priv) return ERR_PTR(-ENOMEM); + cm_id_priv->owner = owner; cm_id_priv->id.state = IB_CM_IDLE; cm_id_priv->id.device = device; cm_id_priv->id.cm_handler = cm_handler; @@ -621,7 +623,7 @@ kfree(cm_id_priv); return ERR_PTR(-ENOMEM); } -EXPORT_SYMBOL(ib_create_cm_id); +EXPORT_SYMBOL(__ib_create_cm_id); static struct cm_work * cm_dequeue_work(struct cm_id_private *cm_id_priv) { @@ -1151,6 +1153,18 @@ work->cm_event.private_data = &req_msg->private_data; } +static int invoke_cm_handler(struct cm_id_private *cm_id_priv, + struct ib_cm_event *event) +{ + int ret; + + BUG_ON(!try_module_get(cm_id_priv->owner)); + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, event); + module_put(cm_id_priv->owner); + + return ret; +} + static void cm_process_work(struct cm_id_private *cm_id_priv, struct cm_work *work) { @@ -1158,7 +1172,7 @@ int ret; /* We will typically only have the current event to report. */ - ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->cm_event); + ret = invoke_cm_handler(cm_id_priv, &work->cm_event); cm_free_work(work); while (!ret && !atomic_add_negative(-1, &cm_id_priv->work_count)) { @@ -1166,8 +1180,7 @@ work = cm_dequeue_work(cm_id_priv); spin_unlock_irqrestore(&cm_id_priv->lock, flags); BUG_ON(!work); - ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, - &work->cm_event); + ret = invoke_cm_handler(cm_id_priv, &work->cm_event); cm_free_work(work); } cm_deref_id(cm_id_priv); @@ -1357,6 +1370,7 @@ goto error2; } + cm_id_priv->owner = listen_cm_id_priv->owner; cm_id_priv->id.cm_handler = listen_cm_id_priv->id.cm_handler; cm_id_priv->id.context = listen_cm_id_priv->id.context; cm_id_priv->id.service_id = req_msg->service_id; @@ -2729,6 +2743,7 @@ atomic_inc(&cur_cm_id_priv->refcount); spin_unlock_irqrestore(&cm.lock, flags); + cm_id_priv->owner = cur_cm_id_priv->owner; cm_id_priv->id.cm_handler = cur_cm_id_priv->id.cm_handler; cm_id_priv->id.context = cur_cm_id_priv->id.context; cm_id_priv->id.service_id = sidr_req_msg->service_id; @@ -2897,7 +2912,7 @@ cm_event.param.send_status = wc_status; /* No other events can occur on the cm_id at this point. */ - ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &cm_event); + ret = invoke_cm_handler(cm_id_priv, &cm_event); cm_free_msg(msg); if (ret) ib_destroy_cm_id(&cm_id_priv->id); Modified: gen2/trunk/src/linux-kernel/infiniband/include/rdma/ib_cm.h =================================================================== --- gen2/trunk/src/linux-kernel/infiniband/include/rdma/ib_cm.h 2006-04-07 18:45:24 UTC (rev 6323) +++ gen2/trunk/src/linux-kernel/infiniband/include/rdma/ib_cm.h 2006-04-07 18:51:35 UTC (rev 6324) @@ -292,6 +292,10 @@ u32 remote_cm_qpn; /* 1 unless redirected */ }; +struct ib_cm_id *__ib_create_cm_id(struct ib_device *device, + ib_cm_handler cm_handler, + void *context, struct module *owner); + /** * ib_create_cm_id - Allocate a communication identifier. * @device: Device associated with the cm_id. All related communication will @@ -303,9 +307,12 @@ * Communication identifiers are used to track connection states, service * ID resolution requests, and listen requests. */ -struct ib_cm_id *ib_create_cm_id(struct ib_device *device, - ib_cm_handler cm_handler, - void *context); +static inline struct ib_cm_id *ib_create_cm_id(struct ib_device *device, + ib_cm_handler cm_handler, + void *context) +{ + return __ib_create_cm_id(device, cm_handler, context, THIS_MODULE); +} /** * ib_destroy_cm_id - Destroy a connection identifier. _______________________________________________ openib-commits mailing list openib-commits at openib.org http://openib.org/mailman/listinfo/openib-commits From sean.hefty at intel.com Fri Apr 7 12:08:21 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 7 Apr 2006 12:08:21 -0700 Subject: [openib-general] [RFC] [PATCH] SA query: expose retries through API In-Reply-To: Message-ID: Currently, the SA query interface does not permit retrying requests automatically. Expose this capability to take advantage of underlying MAD layer API, which provides it basically for free because of RMPP. Without automatic retries pushed down into the SA query module, retries are assigned new TIDs, and appear as separate requests. This means that a delayed response will be dropped, and the remote side will not detect that the request is a duplicate, so will re-calculate the response. Signed-off-by: Sean Hefty --- This version applies on top of the module reference fix. Index: include/rdma/ib_sa.h =================================================================== --- include/rdma/ib_sa.h (revision 6322) +++ include/rdma/ib_sa.h (working copy) @@ -257,7 +257,7 @@ void ib_sa_cancel_query(int id, struct i int __ib_sa_path_rec_get(struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_path_rec *resp, void *context), @@ -269,7 +269,7 @@ int __ib_sa_mcmember_rec_query(struct ib u8 method, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_mcmember_rec *resp, void *context), @@ -281,7 +281,7 @@ int __ib_sa_service_rec_query(struct ib_ u8 method, struct ib_sa_service_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_service_rec *resp, void *context), @@ -296,6 +296,7 @@ int __ib_sa_service_rec_query(struct ib_ * @rec:Path Record to send in query * @comp_mask:component mask to send in query * @timeout_ms:time to wait for response + * @retries:number of times to retry a request * @gfp_mask:GFP mask to use for internal allocations * @callback:function called when query completes, times out or is * canceled @@ -317,7 +318,7 @@ static inline int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_path_rec *resp, void *context), @@ -325,7 +326,7 @@ ib_sa_path_rec_get(struct ib_device *dev struct ib_sa_query **sa_query) { return __ib_sa_path_rec_get(device, port_num, rec, comp_mask, - timeout_ms, gfp_mask, callback, + timeout_ms, retries, gfp_mask, callback, context, THIS_MODULE, sa_query); } @@ -336,6 +337,7 @@ ib_sa_path_rec_get(struct ib_device *dev * @rec:MCMember Record to send in query * @comp_mask:component mask to send in query * @timeout_ms:time to wait for response + * @retries:number of times to retry a request * @gfp_mask:GFP mask to use for internal allocations * @callback:function called when query completes, times out or is * canceled @@ -357,7 +359,7 @@ static inline int ib_sa_mcmember_rec_set(struct ib_device *device, u8 port_num, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_mcmember_rec *resp, void *context), @@ -367,7 +369,7 @@ ib_sa_mcmember_rec_set(struct ib_device return __ib_sa_mcmember_rec_query(device, port_num, IB_MGMT_METHOD_SET, rec, comp_mask, - timeout_ms, gfp_mask, callback, + timeout_ms, retries, gfp_mask, callback, context, THIS_MODULE, query); } @@ -378,6 +380,7 @@ ib_sa_mcmember_rec_set(struct ib_device * @rec:MCMember Record to send in query * @comp_mask:component mask to send in query * @timeout_ms:time to wait for response + * @retries:number of times to retry a request * @gfp_mask:GFP mask to use for internal allocations * @callback:function called when query completes, times out or is * canceled @@ -399,7 +402,7 @@ static inline int ib_sa_mcmember_rec_delete(struct ib_device *device, u8 port_num, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_mcmember_rec *resp, void *context), @@ -409,7 +412,7 @@ ib_sa_mcmember_rec_delete(struct ib_devi return __ib_sa_mcmember_rec_query(device, port_num, IB_SA_METHOD_DELETE, rec, comp_mask, - timeout_ms, gfp_mask, callback, + timeout_ms, retries, gfp_mask, callback, context, THIS_MODULE, query); } @@ -421,6 +424,7 @@ ib_sa_mcmember_rec_delete(struct ib_devi * @rec:Service Record to send in request * @comp_mask:component mask to send in request * @timeout_ms:time to wait for response + * @retries:number of times to retry a request * @gfp_mask:GFP mask to use for internal allocations * @callback:function called when request completes, times out or is * canceled @@ -443,7 +447,7 @@ static inline int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, u8 method, struct ib_sa_service_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_service_rec *resp, void *context), @@ -451,7 +455,7 @@ ib_sa_service_rec_query(struct ib_device struct ib_sa_query **sa_query) { return __ib_sa_service_rec_query(device, port_num, method, rec, - comp_mask, timeout_ms, gfp_mask, + comp_mask, timeout_ms, retries, gfp_mask, callback, context, THIS_MODULE, sa_query); } Index: core/sa_query.c =================================================================== --- core/sa_query.c (revision 6322) +++ core/sa_query.c (working copy) @@ -483,7 +483,7 @@ static void init_mad(struct ib_sa_mad *m spin_unlock_irqrestore(&tid_lock, flags); } -static int send_mad(struct ib_sa_query *query, int timeout_ms) +static int send_mad(struct ib_sa_query *query, int timeout_ms, int retries) { unsigned long flags; int ret, id; @@ -500,6 +500,7 @@ retry: return ret; query->mad_buf->timeout_ms = timeout_ms; + query->mad_buf->retries = retries; query->mad_buf->context[0] = query; query->id = id; @@ -556,6 +557,7 @@ static void ib_sa_path_rec_release(struc * @rec:Path Record to send in query * @comp_mask:component mask to send in query * @timeout_ms:time to wait for response + * @retries:number of times to retry a request * @gfp_mask:GFP mask to use for internal allocations * @callback:function called when query completes, times out or is * canceled @@ -576,7 +578,7 @@ static void ib_sa_path_rec_release(struc int __ib_sa_path_rec_get(struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_path_rec *resp, void *context), @@ -627,7 +629,7 @@ int __ib_sa_path_rec_get(struct ib_devic *sa_query = &query->sa_query; - ret = send_mad(&query->sa_query, timeout_ms); + ret = send_mad(&query->sa_query, timeout_ms, retries); if (ret < 0) goto err2; @@ -673,6 +675,7 @@ static void ib_sa_service_rec_release(st * @rec:Service Record to send in request * @comp_mask:component mask to send in request * @timeout_ms:time to wait for response + * @retries:number of times to retry a request * @gfp_mask:GFP mask to use for internal allocations * @callback:function called when request completes, times out or is * canceled @@ -694,7 +697,7 @@ static void ib_sa_service_rec_release(st int __ib_sa_service_rec_query(struct ib_device *device, u8 port_num, u8 method, struct ib_sa_service_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_service_rec *resp, void *context), @@ -751,7 +754,7 @@ int __ib_sa_service_rec_query(struct ib_ *sa_query = &query->sa_query; - ret = send_mad(&query->sa_query, timeout_ms); + ret = send_mad(&query->sa_query, timeout_ms, retries); if (ret < 0) goto err2; @@ -793,7 +796,7 @@ int __ib_sa_mcmember_rec_query(struct ib u8 method, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_mcmember_rec *resp, void *context), @@ -845,7 +848,7 @@ int __ib_sa_mcmember_rec_query(struct ib *sa_query = &query->sa_query; - ret = send_mad(&query->sa_query, timeout_ms); + ret = send_mad(&query->sa_query, timeout_ms, retries); if (ret < 0) goto err2; Index: ulp/srp/ib_srp.c =================================================================== --- ulp/srp/ib_srp.c (revision 6322) +++ ulp/srp/ib_srp.c (working copy) @@ -257,7 +257,7 @@ static int srp_lookup_path(struct srp_ta IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_NUMB_PATH | IB_SA_PATH_REC_PKEY, - SRP_PATH_REC_TIMEOUT_MS, + SRP_PATH_REC_TIMEOUT_MS, 0, GFP_KERNEL, srp_path_rec_completion, target, &target->path_query); Index: ulp/ipoib/ipoib_main.c =================================================================== --- ulp/ipoib/ipoib_main.c (revision 6322) +++ ulp/ipoib/ipoib_main.c (working copy) @@ -471,7 +471,7 @@ static int path_rec_start(struct net_dev IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_NUMB_PATH | IB_SA_PATH_REC_PKEY, - 1000, GFP_ATOMIC, + 1000, 0, GFP_ATOMIC, path_rec_completion, path, &path->query); if (path->query_id < 0) { Index: ulp/sdp/sdp_link.c =================================================================== --- ulp/sdp/sdp_link.c (revision 6322) +++ ulp/sdp/sdp_link.c (working copy) @@ -323,7 +323,7 @@ static void sdp_link_path_rec_done(int s IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH), - info->sa_time, + info->sa_time, 0, GFP_KERNEL, sdp_link_path_rec_done, info, @@ -359,7 +359,7 @@ static int sdp_link_path_rec_get(struct IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH), - info->sa_time, + info->sa_time, 0, GFP_KERNEL, sdp_link_path_rec_done, info, Index: core/cma.c =================================================================== --- core/cma.c (revision 6322) +++ core/cma.c (working copy) @@ -1064,7 +1064,7 @@ static int cma_query_ib_route(struct rdm id_priv->id.port_num, &path_rec, IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH, - timeout_ms, GFP_KERNEL, + timeout_ms, 0, GFP_KERNEL, cma_query_handler, work, &id_priv->query); return (id_priv->query_id < 0) ? id_priv->query_id : 0; Index: core/at.c =================================================================== --- core/at.c (revision 6322) +++ core/at.c (working copy) @@ -216,7 +216,7 @@ static void ib_dev_ats_op(struct ib_at_d op, rec, mask, - IB_AT_REQ_RETRY_MS, + IB_AT_REQ_RETRY_MS, 0, GFP_KERNEL, ats_op_complete, ib_dev, @@ -1118,7 +1118,7 @@ static int resolve_ats_ips(struct ats_ip IB_MGMT_METHOD_GET, rec, IB_ATS_GET_PRIM_IP_MASK, - req->pend.timeout_ms, + req->pend.timeout_ms, 0, GFP_KERNEL, ats_ips_req_complete, req, @@ -1163,7 +1163,7 @@ static int resolve_ats_route(struct rout IB_MGMT_METHOD_GET, rec, IB_ATS_GET_GID_MASK, - req->pend.timeout_ms, + req->pend.timeout_ms, 0, GFP_KERNEL, ats_route_req_complete, req, @@ -1226,7 +1226,7 @@ static int resolve_path(struct path_req IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH), - req->pend.timeout_ms, + req->pend.timeout_ms, 0, GFP_KERNEL, path_req_complete, req, From rdreier at cisco.com Fri Apr 7 12:35:08 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 07 Apr 2006 12:35:08 -0700 Subject: [openib-general] Include patch for IPoIB queue size tuning into the release 1.0 branch In-Reply-To: (Kevin Reilly's message of "Fri, 7 Apr 2006 13:43:19 -0400") References: Message-ID: Kevin> Hi Byran, Can you please include this patch submitted by Kevin> Shirley Ma that went into the main openIB trunk in revision Kevin> 6255 into the release 1.0 branch. As we've discussed, OpenIB is not releasing kernel components -- that is Linus's job. So it doesn't really make sense to talk about this patch in relation to release 1.0, since the IPoIB kernel driver is not part of that release. - R. From mshefty at ichips.intel.com Fri Apr 7 12:39:43 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 07 Apr 2006 12:39:43 -0700 Subject: [openib-general] switch from svn to git Message-ID: <4436BFFF.7080008@ichips.intel.com> I wanted to start a discussion about migrating the openib code repository from svn to git. - Sean From bos at pathscale.com Fri Apr 7 12:51:03 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 07 Apr 2006 12:51:03 -0700 Subject: [openib-general] Include patch for IPoIB queue size tuning into the release 1.0 branch In-Reply-To: References: Message-ID: <1144439463.14694.129.camel@chalcedony.pathscale.com> On Fri, 2006-04-07 at 12:35 -0700, Roland Dreier wrote: > Kevin> Hi Byran, Can you please include this patch submitted by > Kevin> Shirley Ma that went into the main openIB trunk in revision > Kevin> 6255 into the release 1.0 branch. > > As we've discussed, OpenIB is not releasing kernel components -- that > is Linus's job. So it doesn't really make sense to talk about this > patch in relation to release 1.0, since the IPoIB kernel driver is not > part of that release. IBM's been building their testing bits against the kernel code that's in the 1.0 branch, which I believe is the origin for this request. It costs us nothing to put the change in, so I see no problem with it. References: <4436BFFF.7080008@ichips.intel.com> Message-ID: <1144439895.14694.139.camel@chalcedony.pathscale.com> On Fri, 2006-04-07 at 12:39 -0700, Sean Hefty wrote: > I wanted to start a discussion about migrating the openib code repository from > svn to git. I'm not very open to using git; it has a horrible user interface. I'd much prefer to see a switch to something cleaner, specifically Mercurial. (Bryan O'Sullivan's message of "Fri, 07 Apr 2006 12:51:03 -0700") References: <1144439463.14694.129.camel@chalcedony.pathscale.com> Message-ID: Bryan> IBM's been building their testing bits against the kernel Bryan> code that's in the 1.0 branch, which I believe is the Bryan> origin for this request. It costs us nothing to put the Bryan> change in, so I see no problem with it. Well, nothing except for the time to maintain the kernel code on that branch in addition to the svn trunk and the upstream kernel. But if you're willing to do it, then that's fine by me. - R. From rdreier at cisco.com Fri Apr 7 13:00:58 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 07 Apr 2006 13:00:58 -0700 Subject: [openib-general] switch from svn to git In-Reply-To: <4436BFFF.7080008@ichips.intel.com> (Sean Hefty's message of "Fri, 07 Apr 2006 12:39:43 -0700") References: <4436BFFF.7080008@ichips.intel.com> Message-ID: Sean> I wanted to start a discussion about migrating the openib Sean> code repository from svn to git. I'm in favor of it, especially for the kernel code. However for userspace stuff I think we need to think carefully about how to lay things out. Multiple repositories (one for each "project") makes the most sense to me, but there is always strong resistance to breaking up the monolith we have now. - R. From rminnich at lanl.gov Fri Apr 7 12:54:38 2006 From: rminnich at lanl.gov (Ronald G Minnich) Date: Fri, 07 Apr 2006 13:54:38 -0600 Subject: [openib-general] switch from svn to git In-Reply-To: <1144439895.14694.139.camel@chalcedony.pathscale.com> References: <4436BFFF.7080008@ichips.intel.com> <1144439895.14694.139.camel@chalcedony.pathscale.com> Message-ID: <4436C37E.90106@lanl.gov> Bryan O'Sullivan wrote: > On Fri, 2006-04-07 at 12:39 -0700, Sean Hefty wrote: > >>I wanted to start a discussion about migrating the openib code repository from >>svn to git. > > > I'm not very open to using git; it has a horrible user interface. I'd > much prefer to see a switch to something cleaner, specifically > Mercurial. first read andrew morton's writeup on why he does not use git. I don't particularly like git. We moved to svn for linuxbios a while back and it has been fine. xen uses mercurial and they are happy with it. But do we need multiple repo capability or not? What's missing in svn that you don't like? thanks ron From bos at pathscale.com Fri Apr 7 13:04:07 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 07 Apr 2006 13:04:07 -0700 Subject: [openib-general] Include patch for IPoIB queue size tuning into the release 1.0 branch In-Reply-To: References: <1144439463.14694.129.camel@chalcedony.pathscale.com> Message-ID: <1144440247.14694.141.camel@chalcedony.pathscale.com> On Fri, 2006-04-07 at 12:59 -0700, Roland Dreier wrote: > Well, nothing except for the time to maintain the kernel code on that > branch in addition to the svn trunk and the upstream kernel. But if > you're willing to do it, then that's fine by me. As a matter of courtesy and convenience, and provided it doesn't take much of my time, I have no problem with doing it. (Bryan O'Sullivan's message of "Fri, 07 Apr 2006 12:58:15 -0700") References: <4436BFFF.7080008@ichips.intel.com> <1144439895.14694.139.camel@chalcedony.pathscale.com> Message-ID: Bryan> I'm not very open to using git; it has a horrible user Bryan> interface. I'd much prefer to see a switch to something Bryan> cleaner, specifically Mercurial. I think mercurial lost the mindshare battle unfortunately, and git is getting further ahead at an accelerating pace. And for kernel development especially, being able to merge from a git tree directly with Linus is really convenient. If you don't like the core git UI ("porcelain") then there are plenty of others to choose from -- cogito and StGit are notable examples. I use StGit myself and it's really ideal for my workflow. - R. From rdreier at cisco.com Fri Apr 7 13:08:03 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 07 Apr 2006 13:08:03 -0700 Subject: [openib-general] switch from svn to git In-Reply-To: <4436C37E.90106@lanl.gov> (Ronald G. Minnich's message of "Fri, 07 Apr 2006 13:54:38 -0600") References: <4436BFFF.7080008@ichips.intel.com> <1144439895.14694.139.camel@chalcedony.pathscale.com> <4436C37E.90106@lanl.gov> Message-ID: Ronald> What's missing in svn that you don't like? Merging is pretty bad, which makes developing something on a branch and then landing it on the trunk much harder than it should be. - R. From bos at pathscale.com Fri Apr 7 13:09:14 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 07 Apr 2006 13:09:14 -0700 Subject: [openib-general] switch from svn to git In-Reply-To: <4436C37E.90106@lanl.gov> References: <4436BFFF.7080008@ichips.intel.com> <1144439895.14694.139.camel@chalcedony.pathscale.com> <4436C37E.90106@lanl.gov> Message-ID: <1144440555.14694.146.camel@chalcedony.pathscale.com> On Fri, 2006-04-07 at 13:54 -0600, Ronald G Minnich wrote: > first read andrew morton's writeup on why he does not use git. Are you talking about this? http://www.kerneltraffic.org/kernel-traffic/kt20050605_311.html#2 If so, I don't see that it has any relevance. > xen uses mercurial and they are happy with it. But do we need multiple > repo capability or not? For my work, I find it tremendously useful. > What's missing in svn that you don't like? Every single operation except diff and status requires that I talk to the server. This makes trying to deal with the entirety of a large tree (which, as release coordinator, is almost all I do) very very slow. Couple that with Subversion's lack of meaningful support for branches and merges (something else that I do all the time) and you begin to get the picture :-( References: <4436BFFF.7080008@ichips.intel.com> <1144439895.14694.139.camel@chalcedony.pathscale.com> Message-ID: <1144440640.14694.149.camel@chalcedony.pathscale.com> On Fri, 2006-04-07 at 13:06 -0700, Roland Dreier wrote: > I think mercurial lost the mindshare battle unfortunately, and git is > getting further ahead at an accelerating pace. I don't see much evidence for either of those, to be honest. > And for kernel > development especially, being able to merge from a git tree directly > with Linus is really convenient. But that has little to do with OpenIB. > If you don't like the core git UI ("porcelain") then there are plenty > of others to choose from -- cogito and StGit are notable examples. I > use StGit myself and it's really ideal for my workflow. I use Mercurial Queues to do the same stuff as StGit, and it's a lot faster. (Bryan O'Sullivan's message of "Fri, 07 Apr 2006 13:10:40 -0700") References: <4436BFFF.7080008@ichips.intel.com> <1144439895.14694.139.camel@chalcedony.pathscale.com> <1144440640.14694.149.camel@chalcedony.pathscale.com> Message-ID: Roland> And for kernel development especially, being able to merge Roland> from a git tree directly with Linus is really convenient. Bryan> But that has little to do with OpenIB. If we're going to split kernel driver development out of OpenIB and have OpenIB just focus on userspace stuff, then maybe the most sensible thing to do is let each component pick its own source code control strategy. That's what X.org is doing. I really doubt we'll be able to reach consensus on a single system. - R. From mshefty at ichips.intel.com Fri Apr 7 14:15:37 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 07 Apr 2006 14:15:37 -0700 Subject: [openib-general] switch from svn to git In-Reply-To: <1144439895.14694.139.camel@chalcedony.pathscale.com> References: <4436BFFF.7080008@ichips.intel.com> <1144439895.14694.139.camel@chalcedony.pathscale.com> Message-ID: <4436D679.3030503@ichips.intel.com> Bryan O'Sullivan wrote: > I'm not very open to using git; it has a horrible user interface. I'd > much prefer to see a switch to something cleaner, specifically > Mercurial. I don't see the core git interface being any worse than what we have for svn. I mentioned git specifically to simplify merging kernel code upstream. - Sean From rdreier at cisco.com Fri Apr 7 14:26:58 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 07 Apr 2006 14:26:58 -0700 Subject: [openib-general] Re: Dual Sided RMPP Support as well as OpenSM Implications In-Reply-To: <1144410107.19061.824.camel@hal.voltaire.com> (Hal Rosenstock's message of "07 Apr 2006 07:41:49 -0400") References: <1144410107.19061.824.camel@hal.voltaire.com> Message-ID: Hal> The simplest way I can think of to add this from an ABI/API Hal> perspective is to add an ioctl to user_mad for this. Prior to Hal> the kernel supporting this, libibumad can just return not Hal> supported for the "is dual sided" check for the current Hal> user_mad ABI version (which is 5). Anyone have any better Hal> ideas on how to accomplish this ? Maybe it would be simpler to add a flag to the agent registration request structure? If you wanted to, you could even steal the high order bit of the QPN field and do this in a backwards compatible way. Hal> Also, there is the question of what should the existing RMPP Hal> code do if it receives dual sided RMPP request on the network Hal> side. It could ABORT this request although RMPP currently has Hal> no specific status code for this. This is likely an issue Hal> with any other dual sided RMPP implementations and there Hal> would be no guarantee of interoperability here. It's unclear Hal> what RMPP implementations which do not support dual sided Hal> would do: likely just handle it as if it weren't dual sided. I don't see how we can change the behavior of the existing code without a time machine -- it does whatever it does... - R. From mshefty at ichips.intel.com Fri Apr 7 14:28:14 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 07 Apr 2006 14:28:14 -0700 Subject: [openib-general] switch from svn to git In-Reply-To: References: <4436BFFF.7080008@ichips.intel.com> Message-ID: <4436D96E.8000302@ichips.intel.com> Roland Dreier wrote: > I'm in favor of it, especially for the kernel code. > > However for userspace stuff I think we need to think carefully about > how to lay things out. Multiple repositories (one for each "project") > makes the most sense to me, but there is always strong resistance to > breaking up the monolith we have now. From an organizational perspective, my preference is to use a single source control tool. But, if I can get the libraries that I need to use without a source control tool, and provide patches using a standard diff technique, then I wouldn't care if different source controls were used. - Sean From bos at pathscale.com Fri Apr 7 14:31:28 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 07 Apr 2006 14:31:28 -0700 Subject: [openib-general] switch from svn to git In-Reply-To: <4436D679.3030503@ichips.intel.com> References: <4436BFFF.7080008@ichips.intel.com> <1144439895.14694.139.camel@chalcedony.pathscale.com> <4436D679.3030503@ichips.intel.com> Message-ID: <1144445488.14694.157.camel@chalcedony.pathscale.com> On Fri, 2006-04-07 at 14:15 -0700, Sean Hefty wrote: > I don't see the core git interface being any worse than what we have for svn. Faint praise :-) > I mentioned git specifically to simplify merging kernel code upstream. But Roland is already using git for that, no? About half the revs on the trunk are against userspace, so choosing a tool based on kernel suitability doesn't seem completely appropriate. References: <1144410107.19061.824.camel@hal.voltaire.com> Message-ID: <1144445589.19061.7520.camel@hal.voltaire.com> On Fri, 2006-04-07 at 17:26, Roland Dreier wrote: > Hal> The simplest way I can think of to add this from an ABI/API > Hal> perspective is to add an ioctl to user_mad for this. Prior to > Hal> the kernel supporting this, libibumad can just return not > Hal> supported for the "is dual sided" check for the current > Hal> user_mad ABI version (which is 5). Anyone have any better > Hal> ideas on how to accomplish this ? > > Maybe it would be simpler to add a flag to the agent registration > request structure? If you wanted to, you could even steal the high > order bit of the QPN field and do this in a backwards compatible way. I thought about using the high order bit of the rmpp_version as this would never get to needing 8 bits. It seems this is not a per registration thing though although that approach would work. This has other minor downsides. Is this due to adverseness to ioctls ? > Hal> Also, there is the question of what should the existing RMPP > Hal> code do if it receives dual sided RMPP request on the network > Hal> side. It could ABORT this request although RMPP currently has > Hal> no specific status code for this. This is likely an issue > Hal> with any other dual sided RMPP implementations and there > Hal> would be no guarantee of interoperability here. It's unclear > Hal> what RMPP implementations which do not support dual sided > Hal> would do: likely just handle it as if it weren't dual sided. > > I don't see how we can change the behavior of the existing code > without a time machine -- it does whatever it does... I didn't explain this well. I was suggesting to make a minor modification to the existing RMPP code prior to dual sided being supported. -- Hal From rdreier at cisco.com Fri Apr 7 14:44:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 07 Apr 2006 14:44:45 -0700 Subject: [openib-general] switch from svn to git In-Reply-To: <1144445488.14694.157.camel@chalcedony.pathscale.com> (Bryan O'Sullivan's message of "Fri, 07 Apr 2006 14:31:28 -0700") References: <4436BFFF.7080008@ichips.intel.com> <1144439895.14694.139.camel@chalcedony.pathscale.com> <4436D679.3030503@ichips.intel.com> <1144445488.14694.157.camel@chalcedony.pathscale.com> Message-ID: Bryan> But Roland is already using git for that, no? Yes, and it would be great if I didn't have to worry about keeping svn and my git tree in sync. - R. From mshefty at ichips.intel.com Fri Apr 7 14:51:36 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 07 Apr 2006 14:51:36 -0700 Subject: [openib-general] switch from svn to git In-Reply-To: <1144445488.14694.157.camel@chalcedony.pathscale.com> References: <4436BFFF.7080008@ichips.intel.com> <1144439895.14694.139.camel@chalcedony.pathscale.com> <4436D679.3030503@ichips.intel.com> <1144445488.14694.157.camel@chalcedony.pathscale.com> Message-ID: <4436DEE8.6070204@ichips.intel.com> Bryan O'Sullivan wrote: >>I mentioned git specifically to simplify merging kernel code upstream. > > But Roland is already using git for that, no? About half the revs on > the trunk are against userspace, so choosing a tool based on kernel > suitability doesn't seem completely appropriate. This also means that half the revs are against the kernel. And for kernel changes, git ends up working well, plus its an improvement over svn for userspace. I think part of the reason Roland wants each userspace maintainer to select their own tool is so that the maintainer can pick which tool is most suitable for their work flow. I don't have a strong attachment to git, it just hit me as the most natural replacement. - Sean From bos at pathscale.com Fri Apr 7 14:57:28 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 07 Apr 2006 14:57:28 -0700 Subject: [openib-general] switch from svn to git In-Reply-To: <4436DEE8.6070204@ichips.intel.com> References: <4436BFFF.7080008@ichips.intel.com> <1144439895.14694.139.camel@chalcedony.pathscale.com> <4436D679.3030503@ichips.intel.com> <1144445488.14694.157.camel@chalcedony.pathscale.com> <4436DEE8.6070204@ichips.intel.com> Message-ID: <1144447048.14694.164.camel@chalcedony.pathscale.com> On Fri, 2006-04-07 at 14:51 -0700, Sean Hefty wrote: > I think part of the reason Roland wants each userspace maintainer to > select their own tool is so that the maintainer can pick which tool is most > suitable for their work flow. The "let a thousand flowers bloom" approach is going to be a *huge* pain in the ass for people like me who have to deal with every component. I already have to switch between three of them several times a day and use one or two others on a regular basis. Now I'll need a different set of working practices for every picayune component of OpenIB? Message-ID: <000001c65a8e$5d740be0$010fa8c0@amr.corp.intel.com> Roland wrote, >If we're going to split kernel driver development out of OpenIB and >have OpenIB just focus on userspace stuff, then maybe the most >sensible thing to do is let each component pick its own source code >control strategy. That's what X.org is doing. >I really doubt we'll be able to reach consensus on a single system. > - R. I really don't care what tool we use, but please don't start putting different modules into different source control tools. That would be a mess and make my life much harder. I think we already had this discussion a couple of months back and decided to stay with SVN. woody From rdreier at cisco.com Fri Apr 7 15:07:44 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 07 Apr 2006 15:07:44 -0700 Subject: [openib-general] switch from svn to git In-Reply-To: <1144447048.14694.164.camel@chalcedony.pathscale.com> (Bryan O'Sullivan's message of "Fri, 07 Apr 2006 14:57:28 -0700") References: <4436BFFF.7080008@ichips.intel.com> <1144439895.14694.139.camel@chalcedony.pathscale.com> <4436D679.3030503@ichips.intel.com> <1144445488.14694.157.camel@chalcedony.pathscale.com> <4436DEE8.6070204@ichips.intel.com> <1144447048.14694.164.camel@chalcedony.pathscale.com> Message-ID: Bryan> The "let a thousand flowers bloom" approach is going to be Bryan> a *huge* pain in the ass for people like me who have to Bryan> deal with every component. I already have to switch Bryan> between three of them several times a day and use one or Bryan> two others on a regular basis. Now I'll need a different Bryan> set of working practices for every picayune component of Bryan> OpenIB? I think this may be due to a fundamental problem with how you've chosen to manage the process. I think it would be less work both for you and also the component maintainers if we made like Gnome and just said, "OK maintainers, tarballs for RC5 are due on such-and-such date." Then we avoid paying the extra cost of marching in lockstep on a single branch in a single repository. - R. From robert.j.woodruff at intel.com Fri Apr 7 15:16:45 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 7 Apr 2006 15:16:45 -0700 Subject: [openib-general] switch from svn to git In-Reply-To: Message-ID: <000101c65a90$f4f141c0$010fa8c0@amr.corp.intel.com> Roland wrote, >I think this may be due to a fundamental problem with how you've >chosen to manage the process. I think it would be less work both for >you and also the component maintainers if we made like Gnome and just >said, "OK maintainers, tarballs for RC5 are due on such-and-such >date." Then we avoid paying the extra cost of marching in lockstep on >a single branch in a single repository. > - R. This approach assumes that no one wants to follow the latest version (tip) of the tree and that people only care about specific releases or RCs, which is not the case. woody From mshefty at ichips.intel.com Fri Apr 7 15:17:43 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 07 Apr 2006 15:17:43 -0700 Subject: [openib-general] Re: Dual Sided RMPP Support as well as OpenSM Implications In-Reply-To: <1144410107.19061.824.camel@hal.voltaire.com> References: <1144410107.19061.824.camel@hal.voltaire.com> Message-ID: <4436E507.7000309@ichips.intel.com> Hal Rosenstock wrote: > It can be added but may require an API change and possibly an ABI > change. It seems that user space code needs to both say and know whether > dual sided RMPP is supported or not so all mixes of user space and > kernel code could "work". Do we really need to support these combinations? Does anything use the GetMulti method today? Is dual-RMPP used with anything other than MultiPathRecords? This is my understanding of what needs to happen to support dual-sided RMPP. Node A sends an RMPP message. This requires normal RMPP processing. Node A sends an ACK of the final ACK (I'll call ACK2), giving a new window. Node B receives ACKs. Node B sends the response. This requires normal RMPP processing. From the perspective of node A, the RMPP code only needs to know to send ACK2. It can do this based on the method, or per transaction if directed by the client. Node B is more complex. It must now wait for ACK2, using timeout and retries of ACK1 until ACK2 is received. And the response that will be generated by the client must be delayed until that ACK2 is received. For node B, it may be simpler to delay handing the request up to a client until ACK2 is received. The only information from ACK2 that's needed when sending the response is NewWindowLast. A client could be expected to give this back to the RMPP layer when sending the response. (A client that lied about NewWindowLast should only lead to sending some packets that would be dropped, with the transaction aborted.) So, if we always make the sender of an RMPP message specify NewWindowLast, with a default of 1 set when the MAD is allocated, then we can keep RMPP consistent. And we'd only be left handling ACK2. - Sean From mshefty at ichips.intel.com Fri Apr 7 15:21:46 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 07 Apr 2006 15:21:46 -0700 Subject: [openib-general] switch from svn to git In-Reply-To: <000101c65a90$f4f141c0$010fa8c0@amr.corp.intel.com> References: <000101c65a90$f4f141c0$010fa8c0@amr.corp.intel.com> Message-ID: <4436E5FA.8070704@ichips.intel.com> Bob Woodruff wrote: > This approach assumes that no one wants to follow the latest version (tip) > of the tree and that people only care about specific releases or RCs, > which is not the case. To be clear, I'm only suggesting that openfabrics consider using a different source control tool with better capabilities for handling branches and merges. And to me, the most natural replacement would be git. - Sean From robert.j.woodruff at intel.com Fri Apr 7 15:25:05 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 7 Apr 2006 15:25:05 -0700 Subject: [openib-general] switch from svn to git In-Reply-To: <4436E5FA.8070704@ichips.intel.com> Message-ID: <000201c65a92$1ee3f9e0$010fa8c0@amr.corp.intel.com> Sean wrote, >To be clear, I'm only suggesting that openfabrics consider using a different >source control tool with better capabilities for handling branches and merges. >And to me, the most natural replacement would be git. >- Sean As I said before, I don't really care what the tool is, as long as it is all in the same one. woody From rdreier at cisco.com Fri Apr 7 16:25:16 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 07 Apr 2006 16:25:16 -0700 Subject: [openib-general] switch from svn to git In-Reply-To: <000101c65a90$f4f141c0$010fa8c0@amr.corp.intel.com> (Bob Woodruff's message of "Fri, 7 Apr 2006 15:16:45 -0700") References: <000101c65a90$f4f141c0$010fa8c0@amr.corp.intel.com> Message-ID: Bob> This approach assumes that no one wants to follow the latest Bob> version (tip) of the tree and that people only care about Bob> specific releases or RCs, which is not the case. I recognize the desire to be on the bleeding edge. But the fact is that any "tip" is an artificial construct -- there is no true linear ordering of development. If Sean is off developing CMA stuff, and Hal is off developing RMPP stuff, their two trees are not really comparable -- neither one is closer to the "tip". - R. From mshefty at ichips.intel.com Fri Apr 7 16:41:44 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 07 Apr 2006 16:41:44 -0700 Subject: [openib-general] Re: librdmacm/ucma In-Reply-To: <20060406221630.GG15005@mellanox.co.il> References: <20060406213142.GE15005@mellanox.co.il> <20060406214655.GF15005@mellanox.co.il> <44359042.3030901@ichips.intel.com> <20060406221630.GG15005@mellanox.co.il> Message-ID: <4436F8B8.8060107@ichips.intel.com> Michael S. Tsirkin wrote: >>I'm fine doing that. The ABI versions can be reset to 1. > > Yes, let's do that, I've removed the ABI version hack from userspace, and set the ABI to 1. We'll start maintaining backwards compatibility once the RDMA CM has been merged upstream, or at least released through some other process. - Sean From mshefty at ichips.intel.com Fri Apr 7 17:02:02 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 07 Apr 2006 17:02:02 -0700 Subject: [openib-general] Re: the cma is not in the for-2.6.18 branch of the git tree In-Reply-To: <443405C4.4090703@ichips.intel.com> References: <443405C4.4090703@ichips.intel.com> Message-ID: <4436FD7A.9030000@ichips.intel.com> Sean Hefty wrote: >> Yes, makes sense. I think there have been some updates and fixes to >> CMA code (loopback handling, etc). Sean, when you get a chance, can >> send me updates for the rdma_cm branch? Even a single rolled-up patch >> against the head of that branch is fine -- it's easy for me to split >> up and update the individual patches. > > I will send a patch against your tree at the end of this week. I'm having some sort of local network issue trying to pull the latest updates of your tree. I'll need to look at this again on Monday. - Sean From bos at pathscale.com Fri Apr 7 22:14:52 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 07 Apr 2006 22:14:52 -0700 Subject: [openib-general] OpenIB 1.0-rc2 available Message-ID: <1144473292.7801.2.camel@chalcedony.pathscale.com> The svn tag is here: https://openib.org/svn/gen2/tags/openib-1.0-rc2/ Tarballs: http://openib.red-bean.com/rc2/SOURCES/ RPM packages (fc4 and suse10 only so far, will add more next week): fc4 - http://openib.red-bean.com/rc2/fc4/ suse10 - http://openib.red-bean.com/rc2/suse10.0/ From devesh28 at gmail.com Fri Apr 7 22:31:22 2006 From: devesh28 at gmail.com (Devesh Sharma) Date: Sat, 8 Apr 2006 11:01:22 +0530 Subject: [openib-general] Question on : ib_reg_phys_mr() In-Reply-To: References: <309a667c0604070150sdf99ef7kfd81e2bbe45f8076@mail.gmail.com> Message-ID: <309a667c0604072231h170d9dfar828db22ccc278ec3@mail.gmail.com> Thanks James for quick reply, In your nfs-rdma context what this function is supposed to do? I know that this function returns memory region, but what is the difference from other mr returning functions? why get_dma_mr can't be used? Devesh On 4/7/06, James Lentini wrote: > > > > On Fri, 7 Apr 2006, Devesh Sharma wrote: > > > Hello list, > > In Ib kernel verbs there is a function ib_reg_phys_mr(). > > I am not able to trace the call of this verb by any ulp or uverb. > > Who calls this function? > > NFS-RDMA uses this function: > > http://sourceforge.net/projects/nfs-rdma > > > Is this function mendatory to be supported by the HCA driver provider? > > As a ULP implementer, I expect it to be supported. It is a standard > IBTA verb. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eli at mellanox.co.il Sat Apr 8 12:36:27 2006 From: eli at mellanox.co.il (Eli Cohen) Date: Sat, 8 Apr 2006 22:36:27 +0300 Subject: [openib-general] RE: [PATCH] ipoib_cleanup_module Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3F39C02@mtlexch01.mtl.com> > Not sure I follow this. What error could occur? It's a more theoretical problem then a real problem. ipoib workqueue is still running and could be using debugfs which does not exist anymore. I know in the current implementation there is no real problem but it is cleaner to destroy in the opposite order of creation. From mst at mellanox.co.il Sat Apr 8 15:48:41 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 9 Apr 2006 01:48:41 +0300 Subject: [openib-general] Re: [PATCH] mad: use GID/LID on requester side when matching responses to requests In-Reply-To: <200604061451.18624.jackm@mellanox.co.il> References: <200604061451.18624.jackm@mellanox.co.il> Message-ID: <20060408224841.GA26452@mellanox.co.il> Quoting r. Jack Morgenstein : > Subject: [PATCH] mad: use GID/LID on requester side when matching responses to requests > > Check GID/LID for requester side when searching for request which matches > received response. This, in order to guarantee uniqueness if use same TID > when requesting via multiple source LIDs (when LMC is not zero). To perform > check, need to add LMC to cache. > > (previous patch returned OK unconditionally for LID check, when the check is > performed at the requesting node). > > Further, do not perform LID check for direct-routed packets, since permissive > LID makes a proper check impossible. > > Signed-off-by: Jack Morgenstein Sean, could you take a look at this please? -- MST From eitan at mellanox.co.il Sun Apr 9 03:55:30 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 09 Apr 2006 13:55:30 +0300 Subject: [openib-general] IPoIB interface for unauthorized partition In-Reply-To: <1144082680.4480.44271.camel@hal.voltaire.com> References: <1144082680.4480.44271.camel@hal.voltaire.com> Message-ID: <4438E822.3070503@mellanox.co.il> Hi Roland, Hal, Regarding the usage of P_Key values in setting up IPoIB interfaces: I thought the intent of the IB spec when defining P_Key index usage (and not P_Key value) was that the P_Key values would never need to be known above the driver level. To avoid exposing the P_Key values we could use P_Key index for creating the IPoIB interfaces. Does it make sense to work on a patch that would setup IPoIB interfaces by the P_Key index (and not by P_Key value)? Also I think the expected behavior for IPoIB should be that IPoIB "child" interfaces should be "automatically" initialized by the code that brings up the interface (ifconfig scripts). All valid IPoIB partitions (valid = have corresponding broadcast groups) should be initialized. By doing so we provide a centralized control of the partitions and their IPoIB interfaces through the SM. Please advice Eitan Hal Rosenstock wrote: > Hi Roland, > > I have a port which only has the full default partition configured but > ifconfig allows an IPoIB interface with a PKey which is not in the Pkey > table. Shouldn't the ifconfig fail for this (rather than the subsequent > ping) ? > > -- Hal > > smpquery pkeys 1 1 > 0: 0xffff 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 > 8: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 > 16: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 > 24: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 > 32: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 > 40: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 > 48: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 > 56: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 > 64 pkeys capacity for this port > > echo 0x8001 > /sys/class/net/ib0/create_child > > /sbin/ifconfig ib0.8001 192.168.2.1 > > ping -b 192.168.2.255 > WARNING: pinging broadcast address > PING 192.168.2.255 (192.168.2.255) 56(84) bytes of data. > From jackm at mellanox.co.il Sun Apr 9 05:46:06 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 9 Apr 2006 15:46:06 +0300 Subject: [openib-general] RE: [PATCH] static rate encoding changes Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301DD2F96@mtlexch01.mtl.com> Looks OK to me. Your main change seems to be to shift the static-rate manipulations to query-dev-lims (adjusting the stat_rate_supported bitmap). That is fine with me (and maybe a bit cleaner, too). - Jack -----Original Message----- From: Roland Dreier [mailto:rdreier at cisco.com] Sent: Wednesday, April 05, 2006 1:59 AM To: Michael S. Tsirkin; Jack Morgenstein Cc: openib-general at openib.org Subject: [PATCH] static rate encoding changes Here's the static rate patch I have right now. Does anyone see issues with this? I think it can be justified for 2.6.17, since it fixes static rate handling for 4X DDR. Jack, I've reworked the mthca part quite significantly according to my particular taste, so please let me know if you think I've broken something... From mst at mellanox.co.il Sun Apr 9 08:01:30 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 9 Apr 2006 18:01:30 +0300 Subject: [openib-general] RFC: clean branches/1.0/ (was Re: Include patch for IPoIB queue size tuning into the release 1.0 branch) In-Reply-To: <1144440247.14694.141.camel@chalcedony.pathscale.com> References: <1144439463.14694.129.camel@chalcedony.pathscale.com> <1144440247.14694.141.camel@chalcedony.pathscale.com> Message-ID: <20060409150130.GJ13416@mellanox.co.il> Quoting r. Bryan O'Sullivan : > Subject: Re: Include patch for IPoIB queue size tuning into the release 1.0 branch > > On Fri, 2006-04-07 at 12:59 -0700, Roland Dreier wrote: > > > Well, nothing except for the time to maintain the kernel code on that > > branch in addition to the svn trunk and the upstream kernel. But if > > you're willing to do it, then that's fine by me. > > As a matter of courtesy and convenience, and provided it doesn't take > much of my time, I have no problem with doing it. I'm actually quite unhappy about this arrangement: the random dump in https://openib.org/svn/gen2/branches/1.0/src/linux-kernel/ appears, by the virtue of its location in svn, to represent some kind of "official openib release" while its nothing of the kind, and is, unlike https://openib.org/svn/gen2/branches/1.0/src/userspace/, quite unmaintained. This seems to be confusing people: as a result we have ibm and redhat people synchronizing on this, bypassing completely both the openib maintainers and the proper kernel.org review process. This does not make sense, to me: people should either track svn trunk for bleeding edge development, one of the kernel git trees (probably Roland's http://www.kernel.org/git/?p=linux/kernel/git/roland/infiniband.git;a=summary) for a snoop at how 2.6.17/2.6.18 will look, kernel.org trees for stable deployment, or maintain their own infiniband tree in svn or git. Bryan, since you stated that this kernel code is useful to you for internal purposes, I'd like to propose that this branch subdirectory https://openib.org/svn/gen2/branches/1.0/src/linux-kernel be moved to https://openib.org/svn/trunk/contrib/pathscale directory. Only userspace should stay under https://openib.org/svn/gen2/branches/1.0/src/userspace/. Please comment. Thanks, -- MST From bos at pathscale.com Sun Apr 9 09:32:18 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Sun, 09 Apr 2006 09:32:18 -0700 Subject: [openib-general] Re: RFC: clean branches/1.0/ (was Re: Include patch for IPoIB queue size tuning into the release 1.0 branch) In-Reply-To: <20060409150130.GJ13416@mellanox.co.il> References: <1144439463.14694.129.camel@chalcedony.pathscale.com> <1144440247.14694.141.camel@chalcedony.pathscale.com> <20060409150130.GJ13416@mellanox.co.il> Message-ID: <1144600338.2434.5.camel@localhost.localdomain> On Sun, 2006-04-09 at 18:01 +0300, Michael S. Tsirkin wrote: > Bryan, since you stated that this kernel code is useful to you for internal > purposes, We actually don't use it at all. Perhaps it would be worth mirroring Roland's for-2.6.17 git tree into branches/1.0/src/linux-kernel? That would avoid the lack of maintenance issue, while making it possible for IBM and others to just look in one location for everything. From mst at mellanox.co.il Sun Apr 9 10:40:02 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 9 Apr 2006 20:40:02 +0300 Subject: [openib-general] Re: RFC: clean branches/1.0/ (was Re: Include patch for IPoIB queue size tuning into the release 1.0 branch) In-Reply-To: <1144600338.2434.5.camel@localhost.localdomain> References: <1144439463.14694.129.camel@chalcedony.pathscale.com> <1144440247.14694.141.camel@chalcedony.pathscale.com> <20060409150130.GJ13416@mellanox.co.il> <1144600338.2434.5.camel@localhost.localdomain> Message-ID: <20060409174002.GA18546@mellanox.co.il> Quoting r. Bryan O'Sullivan : > Subject: Re: RFC: clean branches/1.0/ (was Re: Include patch for IPoIB queue size tuning into the release 1.0 branch) > > On Sun, 2006-04-09 at 18:01 +0300, Michael S. Tsirkin wrote: > > > Bryan, since you stated that this kernel code is useful to you for internal > > purposes, > > We actually don't use it at all. > > Perhaps it would be worth mirroring Roland's for-2.6.17 git tree into > branches/1.0/src/linux-kernel? That would avoid the lack of maintenance > issue, while making it possible for IBM and others to just look in one > location for everything. Importing for-2.6.17 in svn might help svn users, although a quick google search didn't find any script that can do this preserving history - and I don't thing there'a a way to trigger this from git commit so I guess you'll just have to do this from a cron job. kernel.org has infrastructure to create and publish tarballs periodically or triggered by user - this might be just as well for anyone not interested in history: Roland, could it be possible to make these available? But keeping for-2.6.7 under branches/1.0 will just keep confusing people. branches/for-2.6.17 will be more appropriate. -- MST From rdreier at cisco.com Sun Apr 9 14:23:33 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 09 Apr 2006 14:23:33 -0700 Subject: [openib-general] Please don't commit without maintainer approval Message-ID: Bryan, I reverted the changes below, which you checked into the trunk with the comment "update spec file from 1.0 branch." I think it's inappropriate to commit without even giving a heads up to the maintainer. If you had asked, I would have explained why these changes are wrong: - the version macro shouldn't be used for the source tarball name, because that doesn't work out right for -rc snapshots. - I've not tried it to see for sure, but I can't imagine that depending on instead of a package name works properly either for autobuilders or automatic dependency tracking in yum, etc. - the rest of the changes look like fiddling for no reason (reordering lines, changing buildroot away from the preferred Fedora value, etc) And finally, this development process is completely backwards. Changes shouldn't be made first on the release branch and then merged to the trunk. All fixes should go into the trunk first with maintainer approval, and then a selected subset should be merged onto the release branch. - R. --- libibverbs/libibverbs.spec.in (revision 6293) +++ libibverbs/libibverbs.spec.in (revision 6294) @@ -1,19 +1,20 @@ # $Id$ -%define ver @VERSION@ +%define ver @VERSION@ +%define RELEASE 1 +%define repl %{?CUSTOM_RELEASE} %{!?CUSTOM_RELEASE:%RELEASE} Name: libibverbs Version: 1.0.3 -Release: 1%{?dist} +Release: %rel%{?dist} Summary: A library for direct userspace use of InfiniBand - Group: System Environment/Libraries License: GPL/BSD Url: http://openib.org/ -Source: http://openib.org/downloads/libibverbs-1.0.3.tar.gz +Source: http://openib.org/downloads/libibverbs-%{version}.tar.gz BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) -BuildRequires: sysfsutils-devel +BuildRequires: %{_includedir}/sysfs/libsysfs.h %description libibverbs is a library that allows userspace processes to use @@ -27,7 +28,7 @@ also be installed. %package devel Summary: Development files for the libibverbs library Group: System Environment/Libraries -Requires: %{name} = %{version}-%{release} sysfsutils-devel +Requires: %{name} = %{version}-%{release} %{_includedir}/sysfs/libsysfs.h %description devel Static libraries and header files for the libibverbs verbs library. --- srptools/srptools.spec.in (revision 6300) +++ srptools/srptools.spec.in (revision 6301) @@ -1,24 +1,25 @@ # $Id$ -%define ver @VERSION@ +%define ver @VERSION@ +%define RELEASE 1 +%define rel %{?CUSTOM_RELEASE} %{!?CUSTOM_RELEASE:%RELEASE} +Summary: Tools for SRP/IB Name: srptools Version: %ver -Release: 1%{?dist} -Summary: Tools for SRP/IB - -Group: Applications/System +Release: %rel%{?dist} License: GPL/BSD -Url: http://openib.org/ +Group: Applications/System +BuildRoot: %{_tmppath}/%{name}-%{version}-root Source: http://openib.org/downloads/%{name}-%{version}.tar.gz -BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) +Url: http://openib.org/ %description In conjunction with the kernel ib_srp driver, srptools allows you to discover and use SCSI devices via the SCSI RDMA Protocol over InfiniBand. %prep -%setup -q -n %{name}-%{ver} +%setup -q %build %configure @@ -38,5 +39,8 @@ rm -rf $RPM_BUILD_ROOT %changelog +* Thu Apr 06 2006 Bryan O'Sullivan - @VERSION at -1 +- Merge spec file from 1.0 branch with spec file from mainline + * Tue Mar 21 2006 Roland Dreier - @VERSION at -1 - Initial attempt at a working spec file From rdreier at cisco.com Sun Apr 9 14:24:30 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 09 Apr 2006 14:24:30 -0700 Subject: [openib-general] Re: RFC: clean branches/1.0/ In-Reply-To: <20060409150130.GJ13416@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 9 Apr 2006 18:01:30 +0300") References: <1144439463.14694.129.camel@chalcedony.pathscale.com> <1144440247.14694.141.camel@chalcedony.pathscale.com> <20060409150130.GJ13416@mellanox.co.il> Message-ID: Michael> This seems to be confusing people: as a result we have Michael> ibm and redhat people synchronizing on this, bypassing Michael> completely both the openib maintainers and the proper Michael> kernel.org review process. I have to agree. I had the same thought myself this weekend. - R. From rdreier at cisco.com Sun Apr 9 14:25:46 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 09 Apr 2006 14:25:46 -0700 Subject: [openib-general] Re: RFC: clean branches/1.0/ In-Reply-To: <1144600338.2434.5.camel@localhost.localdomain> (Bryan O'Sullivan's message of "Sun, 09 Apr 2006 09:32:18 -0700") References: <1144439463.14694.129.camel@chalcedony.pathscale.com> <1144440247.14694.141.camel@chalcedony.pathscale.com> <20060409150130.GJ13416@mellanox.co.il> <1144600338.2434.5.camel@localhost.localdomain> Message-ID: Bryan> Perhaps it would be worth mirroring Roland's for-2.6.17 git Bryan> tree into branches/1.0/src/linux-kernel? That would avoid Bryan> the lack of maintenance issue, while making it possible for Bryan> IBM and others to just look in one location for everything. I'm not sure how feasible this is, or whether it's worth the effort. It would probably be better just to delete the directory and educate people to get kernel releases from kernel.org. - R. From rdreier at cisco.com Sun Apr 9 14:29:32 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 09 Apr 2006 14:29:32 -0700 Subject: [openib-general] Re: RFC: clean branches/1.0/ In-Reply-To: <20060409174002.GA18546@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 9 Apr 2006 20:40:02 +0300") References: <1144439463.14694.129.camel@chalcedony.pathscale.com> <1144440247.14694.141.camel@chalcedony.pathscale.com> <20060409150130.GJ13416@mellanox.co.il> <1144600338.2434.5.camel@localhost.localdomain> <20060409174002.GA18546@mellanox.co.il> Message-ID: Michael> Importing for-2.6.17 in svn might help svn users, Michael> although a quick google search didn't find any script Michael> that can do this preserving history - and I don't thing Michael> there'a a way to trigger this from git commit so I guess Michael> you'll just have to do this from a cron job. You could do it with git-svn (part of the core git distribution) I guess. Except that I'll fix up patches that are in my queue if Linus hasn't pulled yet -- I'm using StGit, so rather than put "patch A" and then "fix patch A" into the main kernel histroy, I'll just replace "patch A" with "fixed patch A". Of course this changes the git history so freezing it into svn is not such a good idea. Michael> kernel.org has infrastructure to create and publish Michael> tarballs periodically or triggered by user - this might Michael> be just as well for anyone not interested in history: Michael> Roland, could it be possible to make these available? I don't think kernel.org publishes snapshots of anyone's tree other than Linus. And in any case I don't think it's worth mirroring 40 meg tarballs all over the world for the few people who want a snapshot of my tree once in a while. - R. From rdreier at cisco.com Sun Apr 9 14:34:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 09 Apr 2006 14:34:12 -0700 Subject: [openib-general] IPoIB interface for unauthorized partition In-Reply-To: <4438E822.3070503@mellanox.co.il> (Eitan Zahavi's message of "Sun, 09 Apr 2006 13:55:30 +0300") References: <1144082680.4480.44271.camel@hal.voltaire.com> <4438E822.3070503@mellanox.co.il> Message-ID: Eitan> I thought the intent of the IB spec when defining P_Key Eitan> index usage (and not P_Key value) was that the P_Key values Eitan> would never need to be known above the driver level. To Eitan> avoid exposing the P_Key values we could use P_Key index Eitan> for creating the IPoIB interfaces. Eitan> Does it make sense to work on a patch that would setup Eitan> IPoIB interfaces by the P_Key index (and not by P_Key Eitan> value)? I don't see how this is feasible. The index that a particular P_Key lands at is completely undetermined -- if two nodes wanted to talk on partition 0x8001 say, how does one know which interface to use without knowing the index of that P_Key? Eitan> Also I think the expected behavior for IPoIB should be that Eitan> IPoIB "child" interfaces should be "automatically" Eitan> initialized by the code that brings up the interface Eitan> (ifconfig scripts). All valid IPoIB partitions (valid = Eitan> have corresponding broadcast groups) should be Eitan> initialized. By doing so we provide a centralized control Eitan> of the partitions and their IPoIB interfaces through the Eitan> SM. Not sure if this is so. I may want a partition strictly for storage traffic something like that, so it doesn't make sense to create an IPoIB interface for that partition. - R. From rjwalsh at pathscale.com Sun Apr 9 14:38:12 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Sun, 09 Apr 2006 14:38:12 -0700 Subject: [openib-general] Please don't commit without maintainer approval In-Reply-To: References: Message-ID: <1144618692.6263.16.camel@phosphene.durables.org> > - I've not tried it to see for sure, but I can't imagine that > depending on instead of a package name works > properly either for autobuilders or automatic dependency tracking > in yum, etc. FWIW, sysfsutils-devel doesn't exist on SuSE: the header files are in sysfsutils instead. That's probably why Bryan made that change: he picked it up from a change (never submitted) I'd made many months ago. I don't think this would be a big deal for yum, since it's a BuildRequires, not a Requires. It certainly hasn't been an issue for us at PathScale, and we've been using it for ages. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com QLogic Corporation Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. From mst at mellanox.co.il Sun Apr 9 14:44:11 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Apr 2006 00:44:11 +0300 Subject: [openib-general] Re: RFC: clean branches/1.0/ In-Reply-To: References: <1144439463.14694.129.camel@chalcedony.pathscale.com> <1144440247.14694.141.camel@chalcedony.pathscale.com> <20060409150130.GJ13416@mellanox.co.il> <1144600338.2434.5.camel@localhost.localdomain> <20060409174002.GA18546@mellanox.co.il> Message-ID: <20060409214411.GA18887@mellanox.co.il> Quoting r. Roland Dreier : > And in any case I don't think it's worth mirroring 40 meg > tarballs all over the world for the few people who want a snapshot of > my tree once in a while. Surely not. Just a patch against the last stable tree would be enough. And we don't need it on the kernel.org frontpage :) Sticking it e.g. in http://www.kernel.org/pub/linux/kernel/people/roland/ will do. -- MST From rdreier at cisco.com Sun Apr 9 15:06:39 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 09 Apr 2006 15:06:39 -0700 Subject: [openib-general] [GIT PULL] post-2.6.17-rc1 fixes Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This includes some changes that I asked you to pull last week right before you left. There are a couple of largish changes in here, but I think they are all needed: - the IPoIB ring size tunables fix horrible performance IBM sees - the static rate change fixes big problems on mixed rate networks - the callback refcounting changes are fixing possible crashes on module unload The exact changes and patch are: Eli Cohen: IPoIB: Wait for join to finish before freeing mcast struct IPoIB: Close race in ipoib_flush_paths() Jack Morgenstein: IB: simplify static rate encoding Michael S. Tsirkin: IB/mad: fix oops in cancel_mads IPoIB: Consolidate private neighbour data handling IB/mthca: Disable tuning PCI read burst size IB/sa: Don't let a module be unloaded with a callback running Roland Dreier: IPoIB: Always build debugging code unless CONFIG_EMBEDDED=y IB/mthca: Always build debugging code unless CONFIG_EMBEDDED=y IB/srp: Fix memory leak in options parsing IPoIB: Use spin_lock_irq() instead of spin_lock_irqsave() Sean Hefty: IB/cm: Don't let a module be unloaded with a callback running Shirley Ma: IPoIB: Make send and receive queue sizes tunable drivers/infiniband/core/cm.c | 31 +++- drivers/infiniband/core/mad.c | 2 drivers/infiniband/core/sa_query.c | 93 +++++++----- drivers/infiniband/core/verbs.c | 34 ++++ drivers/infiniband/hw/mthca/Kconfig | 11 + drivers/infiniband/hw/mthca/Makefile | 4 - drivers/infiniband/hw/mthca/mthca_av.c | 96 ++++++++++++ drivers/infiniband/hw/mthca/mthca_cmd.c | 4 + drivers/infiniband/hw/mthca/mthca_cmd.h | 1 drivers/infiniband/hw/mthca/mthca_dev.h | 21 ++- drivers/infiniband/hw/mthca/mthca_mad.c | 42 +++++ drivers/infiniband/hw/mthca/mthca_main.c | 27 ++++ drivers/infiniband/hw/mthca/mthca_provider.h | 3 drivers/infiniband/hw/mthca/mthca_qp.c | 46 ++++-- drivers/infiniband/ulp/ipoib/Kconfig | 3 drivers/infiniband/ulp/ipoib/ipoib.h | 7 + drivers/infiniband/ulp/ipoib/ipoib_fs.c | 2 drivers/infiniband/ulp/ipoib/ipoib_ib.c | 22 +-- drivers/infiniband/ulp/ipoib/ipoib_main.c | 88 +++++++---- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 58 +++----- drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 6 - drivers/infiniband/ulp/srp/ib_srp.c | 1 include/rdma/ib_cm.h | 13 +- include/rdma/ib_sa.h | 185 ++++++++++++++++-------- include/rdma/ib_verbs.h | 28 ++++ 25 files changed, 603 insertions(+), 225 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 7cfedb8..66d1cb3 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -118,6 +118,7 @@ struct cm_timewait_info { struct cm_id_private { struct ib_cm_id id; + struct module *owner; struct rb_node service_node; struct rb_node sidr_id_node; @@ -538,9 +539,9 @@ static void cm_reject_sidr_req(struct cm ib_send_cm_sidr_rep(&cm_id_priv->id, ¶m); } -struct ib_cm_id *ib_create_cm_id(struct ib_device *device, - ib_cm_handler cm_handler, - void *context) +struct ib_cm_id *__ib_create_cm_id(struct ib_device *device, + ib_cm_handler cm_handler, + void *context, struct module *owner) { struct cm_id_private *cm_id_priv; int ret; @@ -549,6 +550,7 @@ struct ib_cm_id *ib_create_cm_id(struct if (!cm_id_priv) return ERR_PTR(-ENOMEM); + cm_id_priv->owner = owner; cm_id_priv->id.state = IB_CM_IDLE; cm_id_priv->id.device = device; cm_id_priv->id.cm_handler = cm_handler; @@ -569,7 +571,7 @@ error: kfree(cm_id_priv); return ERR_PTR(-ENOMEM); } -EXPORT_SYMBOL(ib_create_cm_id); +EXPORT_SYMBOL(__ib_create_cm_id); static struct cm_work * cm_dequeue_work(struct cm_id_private *cm_id_priv) { @@ -1086,6 +1088,18 @@ static void cm_format_req_event(struct c work->cm_event.private_data = &req_msg->private_data; } +static int invoke_cm_handler(struct cm_id_private *cm_id_priv, + struct ib_cm_event *event) +{ + int ret; + + BUG_ON(!try_module_get(cm_id_priv->owner)); + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, event); + module_put(cm_id_priv->owner); + + return ret; +} + static void cm_process_work(struct cm_id_private *cm_id_priv, struct cm_work *work) { @@ -1093,7 +1107,7 @@ static void cm_process_work(struct cm_id int ret; /* We will typically only have the current event to report. */ - ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->cm_event); + ret = invoke_cm_handler(cm_id_priv, &work->cm_event); cm_free_work(work); while (!ret && !atomic_add_negative(-1, &cm_id_priv->work_count)) { @@ -1101,8 +1115,7 @@ static void cm_process_work(struct cm_id work = cm_dequeue_work(cm_id_priv); spin_unlock_irqrestore(&cm_id_priv->lock, flags); BUG_ON(!work); - ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, - &work->cm_event); + ret = invoke_cm_handler(cm_id_priv, &work->cm_event); cm_free_work(work); } cm_deref_id(cm_id_priv); @@ -1291,6 +1304,7 @@ static int cm_req_handler(struct cm_work goto error2; } + cm_id_priv->owner = listen_cm_id_priv->owner; cm_id_priv->id.cm_handler = listen_cm_id_priv->id.cm_handler; cm_id_priv->id.context = listen_cm_id_priv->id.context; cm_id_priv->id.service_id = req_msg->service_id; @@ -2662,6 +2676,7 @@ static int cm_sidr_req_handler(struct cm atomic_inc(&cur_cm_id_priv->refcount); spin_unlock_irqrestore(&cm.lock, flags); + cm_id_priv->owner = cur_cm_id_priv->owner; cm_id_priv->id.cm_handler = cur_cm_id_priv->id.cm_handler; cm_id_priv->id.context = cur_cm_id_priv->id.context; cm_id_priv->id.service_id = sidr_req_msg->service_id; @@ -2830,7 +2845,7 @@ static void cm_process_send_error(struct cm_event.param.send_status = wc_status; /* No other events can occur on the cm_id at this point. */ - ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &cm_event); + ret = invoke_cm_handler(cm_id_priv, &cm_event); cm_free_msg(msg); if (ret) ib_destroy_cm_id(&cm_id_priv->id); diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index ba54c85..3a702da 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -2311,6 +2311,7 @@ static void local_completions(void *data local = list_entry(mad_agent_priv->local_list.next, struct ib_mad_local_private, completion_list); + list_del(&local->completion_list); spin_unlock_irqrestore(&mad_agent_priv->lock, flags); if (local->mad_priv) { recv_mad_agent = local->recv_mad_agent; @@ -2362,7 +2363,6 @@ local_send_completion: &mad_send_wc); spin_lock_irqsave(&mad_agent_priv->lock, flags); - list_del(&local->completion_list); atomic_dec(&mad_agent_priv->refcount); if (!recv) kmem_cache_free(ib_mad_cache, local->mad_priv); diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index 501cc05..c43ed75 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -74,6 +74,7 @@ struct ib_sa_device { struct ib_sa_query { void (*callback)(struct ib_sa_query *, int, struct ib_sa_mad *); void (*release)(struct ib_sa_query *); + struct module *owner; struct ib_sa_port *port; struct ib_mad_send_buf *mad_buf; struct ib_sa_sm_ah *sm_ah; @@ -547,15 +548,16 @@ static void ib_sa_path_rec_release(struc * error code. Otherwise it is a query ID that can be used to cancel * the query. */ -int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, - struct ib_sa_path_rec *rec, - ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, - void (*callback)(int status, - struct ib_sa_path_rec *resp, - void *context), - void *context, - struct ib_sa_query **sa_query) +int __ib_sa_path_rec_get(struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_path_rec *resp, + void *context), + void *context, + struct module *owner, + struct ib_sa_query **sa_query) { struct ib_sa_path_query *query; struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); @@ -590,6 +592,7 @@ int ib_sa_path_rec_get(struct ib_device query->sa_query.callback = callback ? ib_sa_path_rec_callback : NULL; query->sa_query.release = ib_sa_path_rec_release; + query->sa_query.owner = owner; query->sa_query.port = port; mad->mad_hdr.method = IB_MGMT_METHOD_GET; mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); @@ -613,7 +616,7 @@ err1: kfree(query); return ret; } -EXPORT_SYMBOL(ib_sa_path_rec_get); +EXPORT_SYMBOL(__ib_sa_path_rec_get); static void ib_sa_service_rec_callback(struct ib_sa_query *sa_query, int status, @@ -663,15 +666,16 @@ static void ib_sa_service_rec_release(st * error code. Otherwise it is a request ID that can be used to cancel * the query. */ -int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, u8 method, - struct ib_sa_service_rec *rec, - ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, - void (*callback)(int status, - struct ib_sa_service_rec *resp, - void *context), - void *context, - struct ib_sa_query **sa_query) +int __ib_sa_service_rec_query(struct ib_device *device, u8 port_num, u8 method, + struct ib_sa_service_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_service_rec *resp, + void *context), + void *context, + struct module *owner, + struct ib_sa_query **sa_query) { struct ib_sa_service_query *query; struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); @@ -711,6 +715,7 @@ int ib_sa_service_rec_query(struct ib_de query->sa_query.callback = callback ? ib_sa_service_rec_callback : NULL; query->sa_query.release = ib_sa_service_rec_release; + query->sa_query.owner = owner; query->sa_query.port = port; mad->mad_hdr.method = method; mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_SERVICE_REC); @@ -735,7 +740,7 @@ err1: kfree(query); return ret; } -EXPORT_SYMBOL(ib_sa_service_rec_query); +EXPORT_SYMBOL(__ib_sa_service_rec_query); static void ib_sa_mcmember_rec_callback(struct ib_sa_query *sa_query, int status, @@ -759,16 +764,17 @@ static void ib_sa_mcmember_rec_release(s kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query)); } -int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, - u8 method, - struct ib_sa_mcmember_rec *rec, - ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, - void (*callback)(int status, - struct ib_sa_mcmember_rec *resp, - void *context), - void *context, - struct ib_sa_query **sa_query) +int __ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, + u8 method, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct module *owner, + struct ib_sa_query **sa_query) { struct ib_sa_mcmember_query *query; struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); @@ -803,6 +809,7 @@ int ib_sa_mcmember_rec_query(struct ib_d query->sa_query.callback = callback ? ib_sa_mcmember_rec_callback : NULL; query->sa_query.release = ib_sa_mcmember_rec_release; + query->sa_query.owner = owner; query->sa_query.port = port; mad->mad_hdr.method = method; mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_MC_MEMBER_REC); @@ -827,7 +834,15 @@ err1: kfree(query); return ret; } -EXPORT_SYMBOL(ib_sa_mcmember_rec_query); +EXPORT_SYMBOL(__ib_sa_mcmember_rec_query); + +static void call_sa_callback(struct ib_sa_query *query, int status, + struct ib_sa_mad *mad) +{ + BUG_ON(!try_module_get(query->owner)); + query->callback(query, status, mad); + module_put(query->owner); +} static void send_handler(struct ib_mad_agent *agent, struct ib_mad_send_wc *mad_send_wc) @@ -841,13 +856,13 @@ static void send_handler(struct ib_mad_a /* No callback -- already got recv */ break; case IB_WC_RESP_TIMEOUT_ERR: - query->callback(query, -ETIMEDOUT, NULL); + call_sa_callback(query, -ETIMEDOUT, NULL); break; case IB_WC_WR_FLUSH_ERR: - query->callback(query, -EINTR, NULL); + call_sa_callback(query, -EINTR, NULL); break; default: - query->callback(query, -EIO, NULL); + call_sa_callback(query, -EIO, NULL); break; } @@ -871,12 +886,12 @@ static void recv_handler(struct ib_mad_a if (query->callback) { if (mad_recv_wc->wc->status == IB_WC_SUCCESS) - query->callback(query, - mad_recv_wc->recv_buf.mad->mad_hdr.status ? - -EINVAL : 0, - (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad); + call_sa_callback(query, + mad_recv_wc->recv_buf.mad->mad_hdr.status ? + -EINVAL : 0, + (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad); else - query->callback(query, -EIO, NULL); + call_sa_callback(query, -EIO, NULL); } ib_free_recv_mad(mad_recv_wc); diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index cae0845..b78e7dc 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -45,6 +45,40 @@ #include #include +int ib_rate_to_mult(enum ib_rate rate) +{ + switch (rate) { + case IB_RATE_2_5_GBPS: return 1; + case IB_RATE_5_GBPS: return 2; + case IB_RATE_10_GBPS: return 4; + case IB_RATE_20_GBPS: return 8; + case IB_RATE_30_GBPS: return 12; + case IB_RATE_40_GBPS: return 16; + case IB_RATE_60_GBPS: return 24; + case IB_RATE_80_GBPS: return 32; + case IB_RATE_120_GBPS: return 48; + default: return -1; + } +} +EXPORT_SYMBOL(ib_rate_to_mult); + +enum ib_rate mult_to_ib_rate(int mult) +{ + switch (mult) { + case 1: return IB_RATE_2_5_GBPS; + case 2: return IB_RATE_5_GBPS; + case 4: return IB_RATE_10_GBPS; + case 8: return IB_RATE_20_GBPS; + case 12: return IB_RATE_30_GBPS; + case 16: return IB_RATE_40_GBPS; + case 24: return IB_RATE_60_GBPS; + case 32: return IB_RATE_80_GBPS; + case 48: return IB_RATE_120_GBPS; + default: return IB_RATE_PORT_CURRENT; + } +} +EXPORT_SYMBOL(mult_to_ib_rate); + /* Protection domains */ struct ib_pd *ib_alloc_pd(struct ib_device *device) diff --git a/drivers/infiniband/hw/mthca/Kconfig b/drivers/infiniband/hw/mthca/Kconfig index e88be85..9aa5a44 100644 --- a/drivers/infiniband/hw/mthca/Kconfig +++ b/drivers/infiniband/hw/mthca/Kconfig @@ -7,10 +7,11 @@ config INFINIBAND_MTHCA ("Tavor") and the MT25208 PCI Express HCA ("Arbel"). config INFINIBAND_MTHCA_DEBUG - bool "Verbose debugging output" + bool "Verbose debugging output" if EMBEDDED depends on INFINIBAND_MTHCA - default n + default y ---help--- - This option causes the mthca driver produce a bunch of debug - messages. Select this is you are developing the driver or - trying to diagnose a problem. + This option causes debugging code to be compiled into the + mthca driver. The output can be turned on via the + debug_level module parameter (which can also be set after + the driver is loaded through sysfs). diff --git a/drivers/infiniband/hw/mthca/Makefile b/drivers/infiniband/hw/mthca/Makefile index 47ec5a7..e388d95 100644 --- a/drivers/infiniband/hw/mthca/Makefile +++ b/drivers/infiniband/hw/mthca/Makefile @@ -1,7 +1,3 @@ -ifdef CONFIG_INFINIBAND_MTHCA_DEBUG -EXTRA_CFLAGS += -DDEBUG -endif - obj-$(CONFIG_INFINIBAND_MTHCA) += ib_mthca.o ib_mthca-y := mthca_main.o mthca_cmd.o mthca_profile.o mthca_reset.o \ diff --git a/drivers/infiniband/hw/mthca/mthca_av.c b/drivers/infiniband/hw/mthca/mthca_av.c index bc5bdcb..87e7c63 100644 --- a/drivers/infiniband/hw/mthca/mthca_av.c +++ b/drivers/infiniband/hw/mthca/mthca_av.c @@ -42,6 +42,20 @@ #include "mthca_dev.h" +enum { + MTHCA_RATE_TAVOR_FULL = 0, + MTHCA_RATE_TAVOR_1X = 1, + MTHCA_RATE_TAVOR_4X = 2, + MTHCA_RATE_TAVOR_1X_DDR = 3 +}; + +enum { + MTHCA_RATE_MEMFREE_FULL = 0, + MTHCA_RATE_MEMFREE_QUARTER = 1, + MTHCA_RATE_MEMFREE_EIGHTH = 2, + MTHCA_RATE_MEMFREE_HALF = 3 +}; + struct mthca_av { __be32 port_pd; u8 reserved1; @@ -55,6 +69,86 @@ struct mthca_av { __be32 dgid[4]; }; +static enum ib_rate memfree_rate_to_ib(u8 mthca_rate, u8 port_rate) +{ + switch (mthca_rate) { + case MTHCA_RATE_MEMFREE_EIGHTH: return port_rate / 8; + case MTHCA_RATE_MEMFREE_QUARTER: return port_rate / 4; + case MTHCA_RATE_MEMFREE_HALF: return port_rate / 2; + case MTHCA_RATE_MEMFREE_FULL: return port_rate; + default: return port_rate; + } +} + +static enum ib_rate tavor_rate_to_ib(u8 mthca_rate, u8 port_rate) +{ + switch (mthca_rate) { + case MTHCA_RATE_TAVOR_1X: return IB_RATE_2_5_GBPS; + case MTHCA_RATE_TAVOR_1X_DDR: return IB_RATE_5_GBPS; + case MTHCA_RATE_TAVOR_4X: return IB_RATE_10_GBPS; + default: return port_rate; + } +} + +enum ib_rate mthca_rate_to_ib(struct mthca_dev *dev, u8 mthca_rate, u8 port) +{ + if (mthca_is_memfree(dev)) { + /* Handle old Arbel FW */ + if (dev->limits.stat_rate_support == 0x3 && mthca_rate) + return IB_RATE_2_5_GBPS; + + return memfree_rate_to_ib(mthca_rate, dev->rate[port - 1]); + } else + return tavor_rate_to_ib(mthca_rate, dev->rate[port - 1]); +} + +static u8 ib_rate_to_memfree(u8 req_rate, u8 cur_rate) +{ + if (cur_rate <= req_rate) + return 0; + + /* + * Inter-packet delay (IPD) to get from rate X down to a rate + * no more than Y is (X - 1) / Y. + */ + switch ((cur_rate - 1) / req_rate) { + case 0: return MTHCA_RATE_MEMFREE_FULL; + case 1: return MTHCA_RATE_MEMFREE_HALF; + case 2: /* fall through */ + case 3: return MTHCA_RATE_MEMFREE_QUARTER; + default: return MTHCA_RATE_MEMFREE_EIGHTH; + } +} + +static u8 ib_rate_to_tavor(u8 static_rate) +{ + switch (static_rate) { + case IB_RATE_2_5_GBPS: return MTHCA_RATE_TAVOR_1X; + case IB_RATE_5_GBPS: return MTHCA_RATE_TAVOR_1X_DDR; + case IB_RATE_10_GBPS: return MTHCA_RATE_TAVOR_4X; + default: return MTHCA_RATE_TAVOR_FULL; + } +} + +u8 mthca_get_rate(struct mthca_dev *dev, int static_rate, u8 port) +{ + u8 rate; + + if (!static_rate || ib_rate_to_mult(static_rate) >= dev->rate[port - 1]) + return 0; + + if (mthca_is_memfree(dev)) + rate = ib_rate_to_memfree(ib_rate_to_mult(static_rate), + dev->rate[port - 1]); + else + rate = ib_rate_to_tavor(static_rate); + + if (!(dev->limits.stat_rate_support & (1 << rate))) + rate = 1; + + return rate; +} + int mthca_create_ah(struct mthca_dev *dev, struct mthca_pd *pd, struct ib_ah_attr *ah_attr, @@ -107,7 +201,7 @@ on_hca_fail: av->g_slid = ah_attr->src_path_bits; av->dlid = cpu_to_be16(ah_attr->dlid); av->msg_sr = (3 << 4) | /* 2K message */ - ah_attr->static_rate; + mthca_get_rate(dev, ah_attr->static_rate, ah_attr->port_num); av->sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28); if (ah_attr->ah_flags & IB_AH_GRH) { av->g_slid |= 0x80; diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c index 343eca5..1985b5d 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.c +++ b/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -965,6 +965,7 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev u32 *outbox; u8 field; u16 size; + u16 stat_rate; int err; #define QUERY_DEV_LIM_OUT_SIZE 0x100 @@ -995,6 +996,7 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev #define QUERY_DEV_LIM_MTU_WIDTH_OFFSET 0x36 #define QUERY_DEV_LIM_VL_PORT_OFFSET 0x37 #define QUERY_DEV_LIM_MAX_GID_OFFSET 0x3b +#define QUERY_DEV_LIM_RATE_SUPPORT_OFFSET 0x3c #define QUERY_DEV_LIM_MAX_PKEY_OFFSET 0x3f #define QUERY_DEV_LIM_FLAGS_OFFSET 0x44 #define QUERY_DEV_LIM_RSVD_UAR_OFFSET 0x48 @@ -1086,6 +1088,8 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev dev_lim->num_ports = field & 0xf; MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_GID_OFFSET); dev_lim->max_gids = 1 << (field & 0xf); + MTHCA_GET(stat_rate, outbox, QUERY_DEV_LIM_RATE_SUPPORT_OFFSET); + dev_lim->stat_rate_support = stat_rate; MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PKEY_OFFSET); dev_lim->max_pkeys = 1 << (field & 0xf); MTHCA_GET(dev_lim->flags, outbox, QUERY_DEV_LIM_FLAGS_OFFSET); diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.h b/drivers/infiniband/hw/mthca/mthca_cmd.h index e4ec35c..2f976f2 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.h +++ b/drivers/infiniband/hw/mthca/mthca_cmd.h @@ -146,6 +146,7 @@ struct mthca_dev_lim { int max_vl; int num_ports; int max_gids; + u16 stat_rate_support; int max_pkeys; u32 flags; int reserved_uars; diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h index ad52edb..49d0eae 100644 --- a/drivers/infiniband/hw/mthca/mthca_dev.h +++ b/drivers/infiniband/hw/mthca/mthca_dev.h @@ -172,6 +172,7 @@ struct mthca_limits { int reserved_pds; u32 page_size_cap; u32 flags; + u16 stat_rate_support; u8 port_width_cap; }; @@ -353,10 +354,24 @@ struct mthca_dev { struct ib_mad_agent *send_agent[MTHCA_MAX_PORTS][2]; struct ib_ah *sm_ah[MTHCA_MAX_PORTS]; spinlock_t sm_lock; + u8 rate[MTHCA_MAX_PORTS]; }; -#define mthca_dbg(mdev, format, arg...) \ - dev_dbg(&mdev->pdev->dev, format, ## arg) +#ifdef CONFIG_INFINIBAND_MTHCA_DEBUG +extern int mthca_debug_level; + +#define mthca_dbg(mdev, format, arg...) \ + do { \ + if (mthca_debug_level) \ + dev_printk(KERN_DEBUG, &mdev->pdev->dev, format, ## arg); \ + } while (0) + +#else /* CONFIG_INFINIBAND_MTHCA_DEBUG */ + +#define mthca_dbg(mdev, format, arg...) do { (void) mdev; } while (0) + +#endif /* CONFIG_INFINIBAND_MTHCA_DEBUG */ + #define mthca_err(mdev, format, arg...) \ dev_err(&mdev->pdev->dev, format, ## arg) #define mthca_info(mdev, format, arg...) \ @@ -542,6 +557,8 @@ int mthca_read_ah(struct mthca_dev *dev, struct ib_ud_header *header); int mthca_ah_query(struct ib_ah *ibah, struct ib_ah_attr *attr); int mthca_ah_grh_present(struct mthca_ah *ah); +u8 mthca_get_rate(struct mthca_dev *dev, int static_rate, u8 port); +enum ib_rate mthca_rate_to_ib(struct mthca_dev *dev, u8 mthca_rate, u8 port); int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); diff --git a/drivers/infiniband/hw/mthca/mthca_mad.c b/drivers/infiniband/hw/mthca/mthca_mad.c index dfb482e..f235c7e 100644 --- a/drivers/infiniband/hw/mthca/mthca_mad.c +++ b/drivers/infiniband/hw/mthca/mthca_mad.c @@ -49,6 +49,30 @@ enum { MTHCA_VENDOR_CLASS2 = 0xa }; +int mthca_update_rate(struct mthca_dev *dev, u8 port_num) +{ + struct ib_port_attr *tprops = NULL; + int ret; + + tprops = kmalloc(sizeof *tprops, GFP_KERNEL); + if (!tprops) + return -ENOMEM; + + ret = ib_query_port(&dev->ib_dev, port_num, tprops); + if (ret) { + printk(KERN_WARNING "ib_query_port failed (%d) for %s port %d\n", + ret, dev->ib_dev.name, port_num); + goto out; + } + + dev->rate[port_num - 1] = tprops->active_speed * + ib_width_enum_to_int(tprops->active_width); + +out: + kfree(tprops); + return ret; +} + static void update_sm_ah(struct mthca_dev *dev, u8 port_num, u16 lid, u8 sl) { @@ -90,6 +114,7 @@ static void smp_snoop(struct ib_device * mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && mad->mad_hdr.method == IB_MGMT_METHOD_SET) { if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) { + mthca_update_rate(to_mdev(ibdev), port_num); update_sm_ah(to_mdev(ibdev), port_num, be16_to_cpup((__be16 *) (mad->data + 58)), (*(u8 *) (mad->data + 76)) & 0xf); @@ -246,6 +271,7 @@ int mthca_create_agents(struct mthca_dev { struct ib_mad_agent *agent; int p, q; + int ret; spin_lock_init(&dev->sm_lock); @@ -255,11 +281,23 @@ int mthca_create_agents(struct mthca_dev q ? IB_QPT_GSI : IB_QPT_SMI, NULL, 0, send_handler, NULL, NULL); - if (IS_ERR(agent)) + if (IS_ERR(agent)) { + ret = PTR_ERR(agent); goto err; + } dev->send_agent[p][q] = agent; } + + for (p = 1; p <= dev->limits.num_ports; ++p) { + ret = mthca_update_rate(dev, p); + if (ret) { + mthca_err(dev, "Failed to obtain port %d rate." + " aborting.\n", p); + goto err; + } + } + return 0; err: @@ -268,7 +306,7 @@ err: if (dev->send_agent[p][q]) ib_unregister_mad_agent(dev->send_agent[p][q]); - return PTR_ERR(agent); + return ret; } void __devexit mthca_free_agents(struct mthca_dev *dev) diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 266f347..7e9c97b 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -52,6 +52,14 @@ MODULE_DESCRIPTION("Mellanox InfiniBand MODULE_LICENSE("Dual BSD/GPL"); MODULE_VERSION(DRV_VERSION); +#ifdef CONFIG_INFINIBAND_MTHCA_DEBUG + +int mthca_debug_level = 0; +module_param_named(debug_level, mthca_debug_level, int, 0644); +MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0"); + +#endif /* CONFIG_INFINIBAND_MTHCA_DEBUG */ + #ifdef CONFIG_PCI_MSI static int msi_x = 0; @@ -69,6 +77,10 @@ MODULE_PARM_DESC(msi, "attempt to use MS #endif /* CONFIG_PCI_MSI */ +static int tune_pci = 0; +module_param(tune_pci, int, 0444); +MODULE_PARM_DESC(tune_pci, "increase PCI burst from the default set by BIOS if nonzero"); + static const char mthca_version[] __devinitdata = DRV_NAME ": Mellanox InfiniBand HCA driver v" DRV_VERSION " (" DRV_RELDATE ")\n"; @@ -90,6 +102,9 @@ static int __devinit mthca_tune_pci(stru int cap; u16 val; + if (!tune_pci) + return 0; + /* First try to max out Read Byte Count */ cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX); if (cap) { @@ -191,6 +206,18 @@ static int __devinit mthca_dev_lim(struc mdev->limits.port_width_cap = dev_lim->max_port_width; mdev->limits.page_size_cap = ~(u32) (dev_lim->min_page_sz - 1); mdev->limits.flags = dev_lim->flags; + /* + * For old FW that doesn't return static rate support, use a + * value of 0x3 (only static rate values of 0 or 1 are handled), + * except on Sinai, where even old FW can handle static rate + * values of 2 and 3. + */ + if (dev_lim->stat_rate_support) + mdev->limits.stat_rate_support = dev_lim->stat_rate_support; + else if (mdev->mthca_flags & MTHCA_FLAG_SINAI_OPT) + mdev->limits.stat_rate_support = 0xf; + else + mdev->limits.stat_rate_support = 0x3; /* IB_DEVICE_RESIZE_MAX_WR not supported by driver. May be doable since hardware supports it for SRQ. diff --git a/drivers/infiniband/hw/mthca/mthca_provider.h b/drivers/infiniband/hw/mthca/mthca_provider.h index 2e7f521..6676a78 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.h +++ b/drivers/infiniband/hw/mthca/mthca_provider.h @@ -257,6 +257,8 @@ struct mthca_qp { atomic_t refcount; u32 qpn; int is_direct; + u8 port; /* for SQP and memfree use only */ + u8 alt_port; /* for memfree use only */ u8 transport; u8 state; u8 atomic_rd_en; @@ -278,7 +280,6 @@ struct mthca_qp { struct mthca_sqp { struct mthca_qp qp; - int port; int pkey_index; u32 qkey; u32 send_psn; diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index 057c8e6..f37b0e3 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -248,6 +248,9 @@ void mthca_qp_event(struct mthca_dev *de return; } + if (event_type == IB_EVENT_PATH_MIG) + qp->port = qp->alt_port; + event.device = &dev->ib_dev; event.event = event_type; event.element.qp = &qp->ibqp; @@ -392,10 +395,16 @@ static void to_ib_ah_attr(struct mthca_d { memset(ib_ah_attr, 0, sizeof *path); ib_ah_attr->port_num = (be32_to_cpu(path->port_pkey) >> 24) & 0x3; + + if (ib_ah_attr->port_num == 0 || ib_ah_attr->port_num > dev->limits.num_ports) + return; + ib_ah_attr->dlid = be16_to_cpu(path->rlid); ib_ah_attr->sl = be32_to_cpu(path->sl_tclass_flowlabel) >> 28; ib_ah_attr->src_path_bits = path->g_mylmc & 0x7f; - ib_ah_attr->static_rate = path->static_rate & 0x7; + ib_ah_attr->static_rate = mthca_rate_to_ib(dev, + path->static_rate & 0x7, + ib_ah_attr->port_num); ib_ah_attr->ah_flags = (path->g_mylmc & (1 << 7)) ? IB_AH_GRH : 0; if (ib_ah_attr->ah_flags) { ib_ah_attr->grh.sgid_index = path->mgid_index & (dev->limits.gid_table_len - 1); @@ -455,8 +464,10 @@ int mthca_query_qp(struct ib_qp *ibqp, s qp_attr->cap.max_recv_sge = qp->rq.max_gs; qp_attr->cap.max_inline_data = qp->max_inline_data; - to_ib_ah_attr(dev, &qp_attr->ah_attr, &context->pri_path); - to_ib_ah_attr(dev, &qp_attr->alt_ah_attr, &context->alt_path); + if (qp->transport == RC || qp->transport == UC) { + to_ib_ah_attr(dev, &qp_attr->ah_attr, &context->pri_path); + to_ib_ah_attr(dev, &qp_attr->alt_ah_attr, &context->alt_path); + } qp_attr->pkey_index = be32_to_cpu(context->pri_path.port_pkey) & 0x7f; qp_attr->alt_pkey_index = be32_to_cpu(context->alt_path.port_pkey) & 0x7f; @@ -484,11 +495,11 @@ out: } static int mthca_path_set(struct mthca_dev *dev, struct ib_ah_attr *ah, - struct mthca_qp_path *path) + struct mthca_qp_path *path, u8 port) { path->g_mylmc = ah->src_path_bits & 0x7f; path->rlid = cpu_to_be16(ah->dlid); - path->static_rate = !!ah->static_rate; + path->static_rate = mthca_get_rate(dev, ah->static_rate, port); if (ah->ah_flags & IB_AH_GRH) { if (ah->grh.sgid_index >= dev->limits.gid_table_len) { @@ -634,7 +645,7 @@ int mthca_modify_qp(struct ib_qp *ibqp, if (qp->transport == MLX) qp_context->pri_path.port_pkey |= - cpu_to_be32(to_msqp(qp)->port << 24); + cpu_to_be32(qp->port << 24); else { if (attr_mask & IB_QP_PORT) { qp_context->pri_path.port_pkey |= @@ -657,7 +668,8 @@ int mthca_modify_qp(struct ib_qp *ibqp, } if (attr_mask & IB_QP_AV) { - if (mthca_path_set(dev, &attr->ah_attr, &qp_context->pri_path)) + if (mthca_path_set(dev, &attr->ah_attr, &qp_context->pri_path, + attr_mask & IB_QP_PORT ? attr->port_num : qp->port)) return -EINVAL; qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH); @@ -681,7 +693,8 @@ int mthca_modify_qp(struct ib_qp *ibqp, return -EINVAL; } - if (mthca_path_set(dev, &attr->alt_ah_attr, &qp_context->alt_path)) + if (mthca_path_set(dev, &attr->alt_ah_attr, &qp_context->alt_path, + attr->alt_ah_attr.port_num)) return -EINVAL; qp_context->alt_path.port_pkey |= cpu_to_be32(attr->alt_pkey_index | @@ -791,6 +804,10 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp->atomic_rd_en = attr->qp_access_flags; if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) qp->resp_depth = attr->max_dest_rd_atomic; + if (attr_mask & IB_QP_PORT) + qp->port = attr->port_num; + if (attr_mask & IB_QP_ALT_PATH) + qp->alt_port = attr->alt_port_num; if (is_sqp(dev, qp)) store_attrs(to_msqp(qp), attr, attr_mask); @@ -802,13 +819,13 @@ int mthca_modify_qp(struct ib_qp *ibqp, if (is_qp0(dev, qp)) { if (cur_state != IB_QPS_RTR && new_state == IB_QPS_RTR) - init_port(dev, to_msqp(qp)->port); + init_port(dev, qp->port); if (cur_state != IB_QPS_RESET && cur_state != IB_QPS_ERR && (new_state == IB_QPS_RESET || new_state == IB_QPS_ERR)) - mthca_CLOSE_IB(dev, to_msqp(qp)->port, &status); + mthca_CLOSE_IB(dev, qp->port, &status); } /* @@ -1212,6 +1229,9 @@ int mthca_alloc_qp(struct mthca_dev *dev if (qp->qpn == -1) return -ENOMEM; + /* initialize port to zero for error-catching. */ + qp->port = 0; + err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, send_policy, qp); if (err) { @@ -1261,7 +1281,7 @@ int mthca_alloc_sqp(struct mthca_dev *de if (err) goto err_out; - sqp->port = port; + sqp->qp.port = port; sqp->qp.qpn = mqpn; sqp->qp.transport = MLX; @@ -1404,10 +1424,10 @@ static int build_mlx_header(struct mthca sqp->ud_header.lrh.source_lid = IB_LID_PERMISSIVE; sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED); if (!sqp->qp.ibqp.qp_num) - ib_get_cached_pkey(&dev->ib_dev, sqp->port, + ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port, sqp->pkey_index, &pkey); else - ib_get_cached_pkey(&dev->ib_dev, sqp->port, + ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port, wr->wr.ud.pkey_index, &pkey); sqp->ud_header.bth.pkey = cpu_to_be16(pkey); sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn); diff --git a/drivers/infiniband/ulp/ipoib/Kconfig b/drivers/infiniband/ulp/ipoib/Kconfig index 8d2e04c..13d6d01 100644 --- a/drivers/infiniband/ulp/ipoib/Kconfig +++ b/drivers/infiniband/ulp/ipoib/Kconfig @@ -10,8 +10,9 @@ config INFINIBAND_IPOIB group: . config INFINIBAND_IPOIB_DEBUG - bool "IP-over-InfiniBand debugging" + bool "IP-over-InfiniBand debugging" if EMBEDDED depends on INFINIBAND_IPOIB + default y ---help--- This option causes debugging code to be compiled into the IPoIB driver. The output can be turned on via the diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index b640107..12a1e05 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -65,6 +65,8 @@ enum { IPOIB_RX_RING_SIZE = 128, IPOIB_TX_RING_SIZE = 64, + IPOIB_MAX_QUEUE_SIZE = 8192, + IPOIB_MIN_QUEUE_SIZE = 2, IPOIB_NUM_WC = 4, @@ -230,6 +232,9 @@ static inline struct ipoib_neigh **to_ip INFINIBAND_ALEN, sizeof(void *)); } +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh); +void ipoib_neigh_free(struct ipoib_neigh *neigh); + extern struct workqueue_struct *ipoib_workqueue; /* functions */ @@ -329,6 +334,8 @@ static inline void ipoib_unregister_debu #define ipoib_warn(priv, format, arg...) \ ipoib_printk(KERN_WARNING, priv, format , ## arg) +extern int ipoib_sendq_size; +extern int ipoib_recvq_size; #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG extern int ipoib_debug_level; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_fs.c b/drivers/infiniband/ulp/ipoib/ipoib_fs.c index 685258e..5dde380 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_fs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_fs.c @@ -213,7 +213,7 @@ static int ipoib_path_seq_show(struct se gid_buf, path.pathrec.dlid ? "yes" : "no"); if (path.pathrec.dlid) { - rate = ib_sa_rate_enum_to_int(path.pathrec.rate) * 25; + rate = ib_rate_to_mult(path.pathrec.rate) * 25; seq_printf(file, " DLID: 0x%04x\n" diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index ed65202..a54da42 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -161,7 +161,7 @@ static int ipoib_ib_post_receives(struct struct ipoib_dev_priv *priv = netdev_priv(dev); int i; - for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) { + for (i = 0; i < ipoib_recvq_size; ++i) { if (ipoib_alloc_rx_skb(dev, i)) { ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); return -ENOMEM; @@ -187,7 +187,7 @@ static void ipoib_ib_handle_wc(struct ne if (wr_id & IPOIB_OP_RECV) { wr_id &= ~IPOIB_OP_RECV; - if (wr_id < IPOIB_RX_RING_SIZE) { + if (wr_id < ipoib_recvq_size) { struct sk_buff *skb = priv->rx_ring[wr_id].skb; dma_addr_t addr = priv->rx_ring[wr_id].mapping; @@ -252,9 +252,9 @@ static void ipoib_ib_handle_wc(struct ne struct ipoib_tx_buf *tx_req; unsigned long flags; - if (wr_id >= IPOIB_TX_RING_SIZE) { + if (wr_id >= ipoib_sendq_size) { ipoib_warn(priv, "completion event with wrid %d (> %d)\n", - wr_id, IPOIB_TX_RING_SIZE); + wr_id, ipoib_sendq_size); return; } @@ -275,7 +275,7 @@ static void ipoib_ib_handle_wc(struct ne spin_lock_irqsave(&priv->tx_lock, flags); ++priv->tx_tail; if (netif_queue_stopped(dev) && - priv->tx_head - priv->tx_tail <= IPOIB_TX_RING_SIZE / 2) + priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) netif_wake_queue(dev); spin_unlock_irqrestore(&priv->tx_lock, flags); @@ -344,13 +344,13 @@ void ipoib_send(struct net_device *dev, * means we have to make sure everything is properly recorded and * our state is consistent before we call post_send(). */ - tx_req = &priv->tx_ring[priv->tx_head & (IPOIB_TX_RING_SIZE - 1)]; + tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)]; tx_req->skb = skb; addr = dma_map_single(priv->ca->dma_device, skb->data, skb->len, DMA_TO_DEVICE); pci_unmap_addr_set(tx_req, mapping, addr); - if (unlikely(post_send(priv, priv->tx_head & (IPOIB_TX_RING_SIZE - 1), + if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1), address->ah, qpn, addr, skb->len))) { ipoib_warn(priv, "post_send failed\n"); ++priv->stats.tx_errors; @@ -363,7 +363,7 @@ void ipoib_send(struct net_device *dev, address->last_send = priv->tx_head; ++priv->tx_head; - if (priv->tx_head - priv->tx_tail == IPOIB_TX_RING_SIZE) { + if (priv->tx_head - priv->tx_tail == ipoib_sendq_size) { ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); netif_stop_queue(dev); } @@ -488,7 +488,7 @@ static int recvs_pending(struct net_devi int pending = 0; int i; - for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + for (i = 0; i < ipoib_recvq_size; ++i) if (priv->rx_ring[i].skb) ++pending; @@ -527,7 +527,7 @@ int ipoib_ib_dev_stop(struct net_device */ while ((int) priv->tx_tail - (int) priv->tx_head < 0) { tx_req = &priv->tx_ring[priv->tx_tail & - (IPOIB_TX_RING_SIZE - 1)]; + (ipoib_sendq_size - 1)]; dma_unmap_single(priv->ca->dma_device, pci_unmap_addr(tx_req, mapping), tx_req->skb->len, @@ -536,7 +536,7 @@ int ipoib_ib_dev_stop(struct net_device ++priv->tx_tail; } - for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + for (i = 0; i < ipoib_recvq_size; ++i) if (priv->rx_ring[i].skb) { dma_unmap_single(priv->ca->dma_device, pci_unmap_addr(&priv->rx_ring[i], diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 9b0bd7c..cb078a7 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -41,6 +41,7 @@ #include #include #include +#include #include /* For ARPHRD_xxx */ @@ -53,6 +54,14 @@ MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("IP-over-InfiniBand net driver"); MODULE_LICENSE("Dual BSD/GPL"); +int ipoib_sendq_size __read_mostly = IPOIB_TX_RING_SIZE; +int ipoib_recvq_size __read_mostly = IPOIB_RX_RING_SIZE; + +module_param_named(send_queue_size, ipoib_sendq_size, int, 0444); +MODULE_PARM_DESC(send_queue_size, "Number of descriptors in send queue"); +module_param_named(recv_queue_size, ipoib_recvq_size, int, 0444); +MODULE_PARM_DESC(recv_queue_size, "Number of descriptors in receive queue"); + #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG int ipoib_debug_level; @@ -252,8 +261,8 @@ static void path_free(struct net_device */ if (neigh->ah) ipoib_put_ah(neigh->ah); - *to_ipoib_neigh(neigh->neighbour) = NULL; - kfree(neigh); + + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -327,9 +336,8 @@ void ipoib_flush_paths(struct net_device struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_path *path, *tp; LIST_HEAD(remove_list); - unsigned long flags; - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irq(&priv->lock); list_splice(&priv->path_list, &remove_list); INIT_LIST_HEAD(&priv->path_list); @@ -337,14 +345,15 @@ void ipoib_flush_paths(struct net_device list_for_each_entry(path, &remove_list, list) rb_erase(&path->rb_node, &priv->path_tree); - spin_unlock_irqrestore(&priv->lock, flags); - list_for_each_entry_safe(path, tp, &remove_list, list) { if (path->query) ib_sa_cancel_query(path->query_id, path->query); + spin_unlock_irq(&priv->lock); wait_for_completion(&path->done); path_free(dev, path); + spin_lock_irq(&priv->lock); } + spin_unlock_irq(&priv->lock); } static void path_rec_completion(int status, @@ -373,16 +382,9 @@ static void path_rec_completion(int stat struct ib_ah_attr av = { .dlid = be16_to_cpu(pathrec->dlid), .sl = pathrec->sl, - .port_num = priv->port + .port_num = priv->port, + .static_rate = pathrec->rate }; - int path_rate = ib_sa_rate_enum_to_int(pathrec->rate); - - if (path_rate > 0 && priv->local_rate > path_rate) - av.static_rate = (priv->local_rate - 1) / path_rate; - - ipoib_dbg(priv, "static_rate %d for local port %dX, path %dX\n", - av.static_rate, priv->local_rate, - ib_sa_rate_enum_to_int(pathrec->rate)); ah = ipoib_create_ah(dev, priv->pd, &av); } @@ -481,7 +483,7 @@ static void neigh_add_path(struct sk_buf struct ipoib_path *path; struct ipoib_neigh *neigh; - neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + neigh = ipoib_neigh_alloc(skb->dst->neighbour); if (!neigh) { ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -489,8 +491,6 @@ static void neigh_add_path(struct sk_buf } skb_queue_head_init(&neigh->queue); - neigh->neighbour = skb->dst->neighbour; - *to_ipoib_neigh(skb->dst->neighbour) = neigh; /* * We can only be called from ipoib_start_xmit, so we're @@ -503,7 +503,7 @@ static void neigh_add_path(struct sk_buf path = path_rec_create(dev, (union ib_gid *) (skb->dst->neighbour->ha + 4)); if (!path) - goto err; + goto err_path; __path_add(dev, path); } @@ -521,17 +521,17 @@ static void neigh_add_path(struct sk_buf __skb_queue_tail(&neigh->queue, skb); if (!path->query && path_rec_start(dev, path)) - goto err; + goto err_list; } spin_unlock(&priv->lock); return; -err: - *to_ipoib_neigh(skb->dst->neighbour) = NULL; +err_list: list_del(&neigh->list); - kfree(neigh); +err_path: + ipoib_neigh_free(neigh); ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -763,8 +763,7 @@ static void ipoib_neigh_destructor(struc if (neigh->ah) ah = neigh->ah; list_del(&neigh->list); - *to_ipoib_neigh(n) = NULL; - kfree(neigh); + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -773,6 +772,26 @@ static void ipoib_neigh_destructor(struc ipoib_put_ah(ah); } +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) +{ + struct ipoib_neigh *neigh; + + neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + if (!neigh) + return NULL; + + neigh->neighbour = neighbour; + *to_ipoib_neigh(neighbour) = neigh; + + return neigh; +} + +void ipoib_neigh_free(struct ipoib_neigh *neigh) +{ + *to_ipoib_neigh(neigh->neighbour) = NULL; + kfree(neigh); +} + static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) { parms->neigh_destructor = ipoib_neigh_destructor; @@ -785,20 +804,19 @@ int ipoib_dev_init(struct net_device *de struct ipoib_dev_priv *priv = netdev_priv(dev); /* Allocate RX/TX "rings" to hold queued skbs */ - - priv->rx_ring = kzalloc(IPOIB_RX_RING_SIZE * sizeof (struct ipoib_rx_buf), + priv->rx_ring = kzalloc(ipoib_recvq_size * sizeof *priv->rx_ring, GFP_KERNEL); if (!priv->rx_ring) { printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n", - ca->name, IPOIB_RX_RING_SIZE); + ca->name, ipoib_recvq_size); goto out; } - priv->tx_ring = kzalloc(IPOIB_TX_RING_SIZE * sizeof (struct ipoib_tx_buf), + priv->tx_ring = kzalloc(ipoib_sendq_size * sizeof *priv->tx_ring, GFP_KERNEL); if (!priv->tx_ring) { printk(KERN_WARNING "%s: failed to allocate TX ring (%d entries)\n", - ca->name, IPOIB_TX_RING_SIZE); + ca->name, ipoib_sendq_size); goto out_rx_ring_cleanup; } @@ -866,7 +884,7 @@ static void ipoib_setup(struct net_devic dev->hard_header_len = IPOIB_ENCAP_LEN + INFINIBAND_ALEN; dev->addr_len = INFINIBAND_ALEN; dev->type = ARPHRD_INFINIBAND; - dev->tx_queue_len = IPOIB_TX_RING_SIZE * 2; + dev->tx_queue_len = ipoib_sendq_size * 2; dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX; /* MTU will be reset when mcast join happens */ @@ -1118,6 +1136,14 @@ static int __init ipoib_init_module(void { int ret; + ipoib_recvq_size = roundup_pow_of_two(ipoib_recvq_size); + ipoib_recvq_size = min(ipoib_recvq_size, IPOIB_MAX_QUEUE_SIZE); + ipoib_recvq_size = max(ipoib_recvq_size, IPOIB_MIN_QUEUE_SIZE); + + ipoib_sendq_size = roundup_pow_of_two(ipoib_sendq_size); + ipoib_sendq_size = min(ipoib_sendq_size, IPOIB_MAX_QUEUE_SIZE); + ipoib_sendq_size = max(ipoib_sendq_size, IPOIB_MIN_QUEUE_SIZE); + ret = ipoib_register_debugfs(); if (ret) return ret; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 93c462e..1dae4b2 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -114,8 +114,7 @@ static void ipoib_mcast_free(struct ipoi */ if (neigh->ah) ipoib_put_ah(neigh->ah); - *to_ipoib_neigh(neigh->neighbour) = NULL; - kfree(neigh); + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -251,6 +250,7 @@ static int ipoib_mcast_join_finish(struc .port_num = priv->port, .sl = mcast->mcmember.sl, .ah_flags = IB_AH_GRH, + .static_rate = mcast->mcmember.rate, .grh = { .flow_label = be32_to_cpu(mcast->mcmember.flow_label), .hop_limit = mcast->mcmember.hop_limit, @@ -258,17 +258,8 @@ static int ipoib_mcast_join_finish(struc .traffic_class = mcast->mcmember.traffic_class } }; - int path_rate = ib_sa_rate_enum_to_int(mcast->mcmember.rate); - av.grh.dgid = mcast->mcmember.mgid; - if (path_rate > 0 && priv->local_rate > path_rate) - av.static_rate = (priv->local_rate - 1) / path_rate; - - ipoib_dbg_mcast(priv, "static_rate %d for local port %dX, mcmember %dX\n", - av.static_rate, priv->local_rate, - ib_sa_rate_enum_to_int(mcast->mcmember.rate)); - ah = ipoib_create_ah(dev, priv->pd, &av); if (!ah) { ipoib_warn(priv, "ib_address_create failed\n"); @@ -618,6 +609,22 @@ int ipoib_mcast_start_thread(struct net_ return 0; } +static void wait_for_mcast_join(struct ipoib_dev_priv *priv, + struct ipoib_mcast *mcast) +{ + spin_lock_irq(&priv->lock); + if (mcast && mcast->query) { + ib_sa_cancel_query(mcast->query_id, mcast->query); + mcast->query = NULL; + spin_unlock_irq(&priv->lock); + ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + wait_for_completion(&mcast->done); + } + else + spin_unlock_irq(&priv->lock); +} + int ipoib_mcast_stop_thread(struct net_device *dev, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -637,28 +644,10 @@ int ipoib_mcast_stop_thread(struct net_d if (flush) flush_workqueue(ipoib_workqueue); - spin_lock_irq(&priv->lock); - if (priv->broadcast && priv->broadcast->query) { - ib_sa_cancel_query(priv->broadcast->query_id, priv->broadcast->query); - priv->broadcast->query = NULL; - spin_unlock_irq(&priv->lock); - ipoib_dbg_mcast(priv, "waiting for bcast\n"); - wait_for_completion(&priv->broadcast->done); - } else - spin_unlock_irq(&priv->lock); + wait_for_mcast_join(priv, priv->broadcast); - list_for_each_entry(mcast, &priv->multicast_list, list) { - spin_lock_irq(&priv->lock); - if (mcast->query) { - ib_sa_cancel_query(mcast->query_id, mcast->query); - mcast->query = NULL; - spin_unlock_irq(&priv->lock); - ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", - IPOIB_GID_ARG(mcast->mcmember.mgid)); - wait_for_completion(&mcast->done); - } else - spin_unlock_irq(&priv->lock); - } + list_for_each_entry(mcast, &priv->multicast_list, list) + wait_for_mcast_join(priv, mcast); return 0; } @@ -772,13 +761,11 @@ out: if (skb->dst && skb->dst->neighbour && !*to_ipoib_neigh(skb->dst->neighbour)) { - struct ipoib_neigh *neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour); if (neigh) { kref_get(&mcast->ah->ref); neigh->ah = mcast->ah; - neigh->neighbour = skb->dst->neighbour; - *to_ipoib_neigh(skb->dst->neighbour) = neigh; list_add_tail(&neigh->list, &mcast->neigh_list); } } @@ -913,6 +900,7 @@ void ipoib_mcast_restart_task(void *dev_ /* We have to cancel outside of the spinlock */ list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { + wait_for_mcast_join(priv, mcast); ipoib_mcast_leave(mcast->dev, mcast); ipoib_mcast_free(mcast); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c index 5f03880..1d49d16 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -159,8 +159,8 @@ int ipoib_transport_dev_init(struct net_ struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_init_attr init_attr = { .cap = { - .max_send_wr = IPOIB_TX_RING_SIZE, - .max_recv_wr = IPOIB_RX_RING_SIZE, + .max_send_wr = ipoib_sendq_size, + .max_recv_wr = ipoib_recvq_size, .max_send_sge = 1, .max_recv_sge = 1 }, @@ -175,7 +175,7 @@ int ipoib_transport_dev_init(struct net_ } priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, - IPOIB_TX_RING_SIZE + IPOIB_RX_RING_SIZE + 1); + ipoib_sendq_size + ipoib_recvq_size + 1); if (IS_ERR(priv->cq)) { printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); goto out_free_pd; diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index fd8a95a..5f2b3f6 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1434,6 +1434,7 @@ static int srp_parse_options(const char p = match_strdup(args); if (strlen(p) != 32) { printk(KERN_WARNING PFX "bad dest GID parameter '%s'\n", p); + kfree(p); goto out; } diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h index 0a9fcd5..c552ee7 100644 --- a/include/rdma/ib_cm.h +++ b/include/rdma/ib_cm.h @@ -292,6 +292,10 @@ struct ib_cm_id { u32 remote_cm_qpn; /* 1 unless redirected */ }; +struct ib_cm_id *__ib_create_cm_id(struct ib_device *device, + ib_cm_handler cm_handler, + void *context, struct module *owner); + /** * ib_create_cm_id - Allocate a communication identifier. * @device: Device associated with the cm_id. All related communication will @@ -303,9 +307,12 @@ struct ib_cm_id { * Communication identifiers are used to track connection states, service * ID resolution requests, and listen requests. */ -struct ib_cm_id *ib_create_cm_id(struct ib_device *device, - ib_cm_handler cm_handler, - void *context); +static inline struct ib_cm_id *ib_create_cm_id(struct ib_device *device, + ib_cm_handler cm_handler, + void *context) +{ + return __ib_create_cm_id(device, cm_handler, context, THIS_MODULE); +} /** * ib_destroy_cm_id - Destroy a connection identifier. diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h index f404fe2..6769d1b 100644 --- a/include/rdma/ib_sa.h +++ b/include/rdma/ib_sa.h @@ -91,34 +91,6 @@ enum ib_sa_selector { IB_SA_BEST = 3 }; -enum ib_sa_rate { - IB_SA_RATE_2_5_GBPS = 2, - IB_SA_RATE_5_GBPS = 5, - IB_SA_RATE_10_GBPS = 3, - IB_SA_RATE_20_GBPS = 6, - IB_SA_RATE_30_GBPS = 4, - IB_SA_RATE_40_GBPS = 7, - IB_SA_RATE_60_GBPS = 8, - IB_SA_RATE_80_GBPS = 9, - IB_SA_RATE_120_GBPS = 10 -}; - -static inline int ib_sa_rate_enum_to_int(enum ib_sa_rate rate) -{ - switch (rate) { - case IB_SA_RATE_2_5_GBPS: return 1; - case IB_SA_RATE_5_GBPS: return 2; - case IB_SA_RATE_10_GBPS: return 4; - case IB_SA_RATE_20_GBPS: return 8; - case IB_SA_RATE_30_GBPS: return 12; - case IB_SA_RATE_40_GBPS: return 16; - case IB_SA_RATE_60_GBPS: return 24; - case IB_SA_RATE_80_GBPS: return 32; - case IB_SA_RATE_120_GBPS: return 48; - default: return -1; - } -} - /* * Structures for SA records are named "struct ib_sa_xxx_rec." No * attempt is made to pack structures to match the physical layout of @@ -282,37 +254,80 @@ struct ib_sa_query; void ib_sa_cancel_query(int id, struct ib_sa_query *query); -int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, - struct ib_sa_path_rec *rec, - ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, - void (*callback)(int status, - struct ib_sa_path_rec *resp, - void *context), - void *context, - struct ib_sa_query **query); - -int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, - u8 method, - struct ib_sa_mcmember_rec *rec, - ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, - void (*callback)(int status, - struct ib_sa_mcmember_rec *resp, - void *context), - void *context, - struct ib_sa_query **query); - -int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, - u8 method, - struct ib_sa_service_rec *rec, +int __ib_sa_path_rec_get(struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, gfp_t gfp_mask, void (*callback)(int status, - struct ib_sa_service_rec *resp, + struct ib_sa_path_rec *resp, void *context), void *context, - struct ib_sa_query **sa_query); + struct module *owner, + struct ib_sa_query **query); + +int __ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, + u8 method, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct module *owner, + struct ib_sa_query **query); + +int __ib_sa_service_rec_query(struct ib_device *device, u8 port_num, + u8 method, + struct ib_sa_service_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_service_rec *resp, + void *context), + void *context, + struct module *owner, + struct ib_sa_query **sa_query); + +/** + * ib_sa_path_rec_get - Start a Path get query + * @device:device to send query on + * @port_num: port number to send query on + * @rec:Path Record to send in query + * @comp_mask:component mask to send in query + * @timeout_ms:time to wait for response + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when query completes, times out or is + * canceled + * @context:opaque user context passed to callback + * @sa_query:query context, used to cancel query + * + * Send a Path Record Get query to the SA to look up a path. The + * callback function will be called when the query completes (or + * fails); status is 0 for a successful response, -EINTR if the query + * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error + * occurred sending the query. The resp parameter of the callback is + * only valid if status is 0. + * + * If the return value of ib_sa_path_rec_get() is negative, it is an + * error code. Otherwise it is a query ID that can be used to cancel + * the query. + */ +static inline int +ib_sa_path_rec_get(struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_path_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) +{ + return __ib_sa_path_rec_get(device, port_num, rec, comp_mask, + timeout_ms, gfp_mask, callback, + context, THIS_MODULE, sa_query); +} /** * ib_sa_mcmember_rec_set - Start an MCMember set query @@ -349,11 +364,11 @@ ib_sa_mcmember_rec_set(struct ib_device void *context, struct ib_sa_query **query) { - return ib_sa_mcmember_rec_query(device, port_num, - IB_MGMT_METHOD_SET, - rec, comp_mask, - timeout_ms, gfp_mask, callback, - context, query); + return __ib_sa_mcmember_rec_query(device, port_num, + IB_MGMT_METHOD_SET, + rec, comp_mask, + timeout_ms, gfp_mask, callback, + context, THIS_MODULE, query); } /** @@ -391,12 +406,54 @@ ib_sa_mcmember_rec_delete(struct ib_devi void *context, struct ib_sa_query **query) { - return ib_sa_mcmember_rec_query(device, port_num, - IB_SA_METHOD_DELETE, - rec, comp_mask, - timeout_ms, gfp_mask, callback, - context, query); + return __ib_sa_mcmember_rec_query(device, port_num, + IB_SA_METHOD_DELETE, + rec, comp_mask, + timeout_ms, gfp_mask, callback, + context, THIS_MODULE, query); } +/** + * ib_sa_service_rec_query - Start Service Record operation + * @device:device to send request on + * @port_num: port number to send request on + * @method:SA method - should be get, set, or delete + * @rec:Service Record to send in request + * @comp_mask:component mask to send in request + * @timeout_ms:time to wait for response + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when request completes, times out or is + * canceled + * @context:opaque user context passed to callback + * @sa_query:request context, used to cancel request + * + * Send a Service Record set/get/delete to the SA to register, + * unregister or query a service record. + * The callback function will be called when the request completes (or + * fails); status is 0 for a successful response, -EINTR if the query + * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error + * occurred sending the query. The resp parameter of the callback is + * only valid if status is 0. + * + * If the return value of ib_sa_service_rec_query() is negative, it is an + * error code. Otherwise it is a request ID that can be used to cancel + * the query. + */ +static inline int +ib_sa_service_rec_query(struct ib_device *device, u8 port_num, u8 method, + struct ib_sa_service_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_service_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) +{ + return __ib_sa_service_rec_query(device, port_num, method, rec, + comp_mask, timeout_ms, gfp_mask, + callback, context, THIS_MODULE, + sa_query); +} #endif /* IB_SA_H */ diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index c1ad627..6bbf1b3 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -314,6 +314,34 @@ enum ib_ah_flags { IB_AH_GRH = 1 }; +enum ib_rate { + IB_RATE_PORT_CURRENT = 0, + IB_RATE_2_5_GBPS = 2, + IB_RATE_5_GBPS = 5, + IB_RATE_10_GBPS = 3, + IB_RATE_20_GBPS = 6, + IB_RATE_30_GBPS = 4, + IB_RATE_40_GBPS = 7, + IB_RATE_60_GBPS = 8, + IB_RATE_80_GBPS = 9, + IB_RATE_120_GBPS = 10 +}; + +/** + * ib_rate_to_mult - Convert the IB rate enum to a multiple of the + * base rate of 2.5 Gbit/sec. For example, IB_RATE_5_GBPS will be + * converted to 2, since 5 Gbit/sec is 2 * 2.5 Gbit/sec. + * @rate: rate to convert. + */ +int ib_rate_to_mult(enum ib_rate rate) __attribute_const__; + +/** + * mult_to_ib_rate - Convert a multiple of 2.5 Gbit/sec to an IB rate + * enum. + * @mult: multiple to convert. + */ +enum ib_rate mult_to_ib_rate(int mult) __attribute_const__; + struct ib_ah_attr { struct ib_global_route grh; u16 dlid; From rdreier at cisco.com Sun Apr 9 15:08:17 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 09 Apr 2006 15:08:17 -0700 Subject: [openib-general] Re: Dual Sided RMPP Support as well as OpenSM Implications In-Reply-To: <1144445589.19061.7520.camel@hal.voltaire.com> (Hal Rosenstock's message of "07 Apr 2006 17:33:11 -0400") References: <1144410107.19061.824.camel@hal.voltaire.com> <1144445589.19061.7520.camel@hal.voltaire.com> Message-ID: Hal> I thought about using the high order bit of the rmpp_version Hal> as this would never get to needing 8 bits. It seems this is Hal> not a per registration thing though although that approach Hal> would work. This has other minor downsides. Is this due to Hal> adverseness to ioctls ? Yeah, mostly adverseness to ioctls. But it does seem to be a per-registration thing. After all, I assume this is something that is a property of a MAD agent -- otherwise how does the MAD core know how to react to a two-sided RMPP when it gets one? - R. From rdreier at cisco.com Sun Apr 9 15:09:10 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 09 Apr 2006 15:09:10 -0700 Subject: [openib-general] Re: RFC: clean branches/1.0/ In-Reply-To: <20060409214411.GA18887@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 10 Apr 2006 00:44:11 +0300") References: <1144439463.14694.129.camel@chalcedony.pathscale.com> <1144440247.14694.141.camel@chalcedony.pathscale.com> <20060409150130.GJ13416@mellanox.co.il> <1144600338.2434.5.camel@localhost.localdomain> <20060409174002.GA18546@mellanox.co.il> <20060409214411.GA18887@mellanox.co.il> Message-ID: Michael> Surely not. Just a patch against the last stable tree Michael> would be enough. And we don't need it on the kernel.org Michael> frontpage :) Sticking it e.g. in Michael> http://www.kernel.org/pub/linux/kernel/people/roland/ Michael> will do. Hmm, I can request that directory. But I don't think there's any preexisting machinery to create snapshots. I guess it shouldn't be too hard to script though. - R. From rdreier at cisco.com Sun Apr 9 21:48:46 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 09 Apr 2006 21:48:46 -0700 Subject: [openib-general] Please don't commit without maintainer approval In-Reply-To: <1144618692.6263.16.camel@phosphene.durables.org> (Robert Walsh's message of "Sun, 09 Apr 2006 14:38:12 -0700") References: <1144618692.6263.16.camel@phosphene.durables.org> Message-ID: Robert> FWIW, sysfsutils-devel doesn't exist on SuSE: the header Robert> files are in sysfsutils instead. That's probably why Robert> Bryan made that change: he picked it up from a change Robert> (never submitted) I'd made many months ago. I don't think Robert> this would be a big deal for yum, since it's a Robert> BuildRequires, not a Requires. It certainly hasn't been Robert> an issue for us at PathScale, and we've been using it for Robert> ages. It was actually both BuildRequires and Requires. But yum is smarter than I gave it credit for -- it does work to depend on libsysfs.h. So I checked in the change below. However, committing this without telling anyone was still the wrong way to handle this, and the rest of the changes were either wrong or didn't matter either way. - R. --- src/userspace/libibverbs/libibverbs.spec.in (revision 6351) +++ src/userspace/libibverbs/libibverbs.spec.in (working copy) @@ -13,7 +13,7 @@ Url: http://openib.org/ Source: http://openib.org/downloads/libibverbs-1.0.3.tar.gz BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) -BuildRequires: sysfsutils-devel +BuildRequires: %{_includedir}/sysfs/libsysfs.h %description libibverbs is a library that allows userspace processes to use @@ -27,7 +27,7 @@ also be installed. %package devel Summary: Development files for the libibverbs library Group: System Environment/Libraries -Requires: %{name} = %{version}-%{release} sysfsutils-devel +Requires: %{name} = %{version}-%{release} %{_includedir}/sysfs/libsysfs.h %description devel Static libraries and header files for the libibverbs verbs library. From ogerlitz at voltaire.com Sun Apr 9 22:27:38 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 10 Apr 2006 08:27:38 +0300 (IDT) Subject: [openib-general] problems cloning infiniband.git Message-ID: Hi Roland, I have problems cloning your git tree, is it an issue on my side? Or. $ git clone http://www.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git infiniband ..... [downloading stuff and then getting this error] error: Couldn't get http://www.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git/refs/heads/static%5Frate for heads/static_rate The requested URL returned error: 404 error: Could not interpret heads/static_rate as something to pull From eitan at mellanox.co.il Sun Apr 9 23:35:00 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 10 Apr 2006 09:35:00 +0300 Subject: [openib-general] IPoIB interface for unauthorized partition In-Reply-To: References: <1144082680.4480.44271.camel@hal.voltaire.com> <4438E822.3070503@mellanox.co.il> Message-ID: <4439FC94.4090204@mellanox.co.il> Hi Roland, Roland Dreier wrote: > Eitan> I thought the intent of the IB spec when defining P_Key > Eitan> index usage (and not P_Key value) was that the P_Key values > Eitan> would never need to be known above the driver level. To > Eitan> avoid exposing the P_Key values we could use P_Key index > Eitan> for creating the IPoIB interfaces. > > Eitan> Does it make sense to work on a patch that would setup > Eitan> IPoIB interfaces by the P_Key index (and not by P_Key > Eitan> value)? > > I don't see how this is feasible. The index that a particular P_Key > lands at is completely undetermined -- if two nodes wanted to talk on > partition 0x8001 say, how does one know which interface to use without > knowing the index of that P_Key? OK, I get it. Actually the way IPoIB defines the broadcast group MGID exposes P_Key anyway. > > Eitan> Also I think the expected behavior for IPoIB should be that > Eitan> IPoIB "child" interfaces should be "automatically" > Eitan> initialized by the code that brings up the interface > Eitan> (ifconfig scripts). All valid IPoIB partitions (valid = > Eitan> have corresponding broadcast groups) should be > Eitan> initialized. By doing so we provide a centralized control > Eitan> of the partitions and their IPoIB interfaces through the > Eitan> SM. > > Not sure if this is so. I may want a partition strictly for storage > traffic something like that, so it doesn't make sense to create an > IPoIB interface for that partition. OpenSM provides this capability in the "partition policy": Each partition is marked explicitly if to be used for IPoIB or not. So through one file one could actually control the IPoIB interfaces that will exist in the subnet. My intent is to write some extension to ifup for IPoIB such that all sub interfaces will be automatically started (based on pre-availability of IPoIB broadcast MGID). > > - R. > From leonida at voltaire.com Sun Apr 9 23:01:12 2006 From: leonida at voltaire.com (Leonid Arsh) Date: Mon, 10 Apr 2006 09:01:12 +0300 Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support by the SW Message-ID: <20060410060112.GA27077@voltaire.com> Hello, I'm resending the fixed client reregister event support patch. The event is handled by the software now, as Michael recommended, not by the hardware. (The previouse patch with the hardware event handling may be found in the mail message "[openib-general][PATCH] mthca & ib_verbs.h client reregister event support" ) Note, I moved the "port_info" struct definition from ipath_mad.c to ib_smi.h, so the ipath is touched too, but this doesn't change the functionality. I'm also sending the user space verbs patch again, to be applied together with the kernel space patch. (I sent it before in the mail message "[openib-general] verbs.h client reregister event support") See below. Signed-off-by: Leonid Arsh Index: linux-kernel/infiniband/include/rdma/ib_verbs.h =================================================================== --- linux-kernel/infiniband/include/rdma/ib_verbs.h (revision 8509) +++ linux-kernel/infiniband/include/rdma/ib_verbs.h (working copy) @@ -283,7 +283,8 @@ IB_EVENT_SM_CHANGE, IB_EVENT_SRQ_ERR, IB_EVENT_SRQ_LIMIT_REACHED, - IB_EVENT_QP_LAST_WQE_REACHED + IB_EVENT_QP_LAST_WQE_REACHED, + IB_EVENT_CLIENT_REREGISTER }; struct ib_event { Index: linux-kernel/infiniband/include/rdma/ib_smi.h =================================================================== --- linux-kernel/infiniband/include/rdma/ib_smi.h (revision 8509) +++ linux-kernel/infiniband/include/rdma/ib_smi.h (working copy) @@ -91,4 +91,40 @@ return ((smp->status & IB_SMP_DIRECTION) == IB_SMP_DIRECTION); } +struct port_info { + __be64 mkey; + __be64 gid_prefix; + __be16 lid; + __be16 sm_lid; + __be32 cap_mask; + __be16 diag_code; + __be16 mkey_lease_period; + u8 local_port_num; + u8 link_width_enabled; + u8 link_width_supported; + u8 link_width_active; + u8 linkspeed_portstate; /* 4 bits, 4 bits */ + u8 portphysstate_linkdown; /* 4 bits, 4 bits */ + u8 mkeyprot_resv_lmc; /* 2 bits, 3 bits, 3 bits */ + u8 linkspeedactive_enabled; /* 4 bits, 4 bits */ + u8 neighbormtu_mastersmsl; /* 4 bits, 4 bits */ + u8 vlcap_inittype; /* 4 bits, 4 bits */ + u8 vl_high_limit; + u8 vl_arb_high_cap; + u8 vl_arb_low_cap; + u8 inittypereply_mtucap; /* 4 bits, 4 bits */ + u8 vlstallcnt_hoqlife; /* 3 bits, 5 bits */ + u8 operationalvl_pei_peo_fpi_fpo; /* 4 bits, 1, 1, 1, 1 */ + __be16 mkey_violations; + __be16 pkey_violations; + __be16 qkey_violations; + u8 guid_cap; + u8 clientrereg_resv_subnetto; /* 1 bit, 2 bits, 5 bits */ + u8 resv_resptimevalue; /* 3 bits, 5 bits */ + u8 localphyerrors_overrunerrors; /* 4 bits, 4 bits */ + __be16 max_credit_hint; + u8 resv; + u8 link_roundtrip_latency[3]; +} __attribute__ ((packed)); + #endif /* IB_SMI_H */ Index: linux-kernel/infiniband/hw/ipath/ipath_mad.c =================================================================== --- linux-kernel/infiniband/hw/ipath/ipath_mad.c (revision 8509) +++ linux-kernel/infiniband/hw/ipath/ipath_mad.c (working copy) @@ -136,42 +136,6 @@ return reply(smp, __LINE__); } -struct port_info { - __be64 mkey; - __be64 gid_prefix; - __be16 lid; - __be16 sm_lid; - __be32 cap_mask; - __be16 diag_code; - __be16 mkey_lease_period; - u8 local_port_num; - u8 link_width_enabled; - u8 link_width_supported; - u8 link_width_active; - u8 linkspeed_portstate; /* 4 bits, 4 bits */ - u8 portphysstate_linkdown; /* 4 bits, 4 bits */ - u8 mkeyprot_resv_lmc; /* 2 bits, 3 bits, 3 bits */ - u8 linkspeedactive_enabled; /* 4 bits, 4 bits */ - u8 neighbormtu_mastersmsl; /* 4 bits, 4 bits */ - u8 vlcap_inittype; /* 4 bits, 4 bits */ - u8 vl_high_limit; - u8 vl_arb_high_cap; - u8 vl_arb_low_cap; - u8 inittypereply_mtucap; /* 4 bits, 4 bits */ - u8 vlstallcnt_hoqlife; /* 3 bits, 5 bits */ - u8 operationalvl_pei_peo_fpi_fpo; /* 4 bits, 1, 1, 1, 1 */ - __be16 mkey_violations; - __be16 pkey_violations; - __be16 qkey_violations; - u8 guid_cap; - u8 clientrereg_resv_subnetto; /* 1 bit, 2 bits, 5 bits */ - u8 resv_resptimevalue; /* 3 bits, 5 bits */ - u8 localphyerrors_overrunerrors; /* 4 bits, 4 bits */ - __be16 max_credit_hint; - u8 resv; - u8 link_roundtrip_latency[3]; -} __attribute__ ((packed)); - static int recv_subn_get_portinfo(struct ib_smp *smp, struct ib_device *ibdev, u8 port) { Index: linux-kernel/infiniband/hw/mthca/mthca_mad.c =================================================================== --- linux-kernel/infiniband/hw/mthca/mthca_mad.c (revision 8509) +++ linux-kernel/infiniband/hw/mthca/mthca_mad.c (working copy) @@ -81,19 +81,27 @@ u8 port_num, struct ib_mad *mad) { - struct ib_event event; + struct ib_event event; + struct port_info *pinfo; if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && mad->mad_hdr.method == IB_MGMT_METHOD_SET) { if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) { + pinfo = (struct port_info *)((struct ib_smp *)mad)->data; + update_sm_ah(to_mdev(ibdev), port_num, - be16_to_cpup((__be16 *) (mad->data + 58)), - (*(u8 *) (mad->data + 76)) & 0xf); + be16_to_cpup(&pinfo->lid), + pinfo->neighbormtu_mastersmsl & 0xf); event.device = ibdev; - event.event = IB_EVENT_LID_CHANGE; event.element.port_num = port_num; + + if(pinfo->clientrereg_resv_subnetto & 0x80) + event.event = IB_EVENT_CLIENT_REREGISTER; + else + event.event = IB_EVENT_LID_CHANGE; + ib_dispatch_event(&event); } @@ -103,6 +111,7 @@ event.element.port_num = port_num; ib_dispatch_event(&event); } + } } This is the user space verbs patch: Signed-off-by: Leonid Arsh Index: userspace/libibverbs/include/infiniband/verbs.h =================================================================== --- userspace/libibverbs/include/infiniband/verbs.h (revision 8165) +++ userspace/libibverbs/include/infiniband/verbs.h (working copy) @@ -190,7 +190,8 @@ IBV_EVENT_SM_CHANGE, IBV_EVENT_SRQ_ERR, IBV_EVENT_SRQ_LIMIT_REACHED, - IBV_EVENT_QP_LAST_WQE_REACHED + IBV_EVENT_QP_LAST_WQE_REACHED, + IBV_EVENT_CLIENT_REREGISTER }; struct ibv_async_event { From tziporet at mellanox.co.il Mon Apr 10 00:30:55 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 10 Apr 2006 10:30:55 +0300 Subject: [openib-general] Re: RFC: clean branches/1.0/ (was Re: Include patch for IPoIB queue size tuning into the release 1.0 branch) In-Reply-To: <1144600338.2434.5.camel@localhost.localdomain> References: <1144439463.14694.129.camel@chalcedony.pathscale.com> <1144440247.14694.141.camel@chalcedony.pathscale.com> <20060409150130.GJ13416@mellanox.co.il> <1144600338.2434.5.camel@localhost.localdomain> Message-ID: <443A09AF.6000200@mellanox.co.il> Bryan O'Sullivan wrote: > > Perhaps it would be worth mirroring Roland's for-2.6.17 git tree into > branches/1.0/src/linux-kernel? That would avoid the lack of maintenance > issue, while making it possible for IBM and others to just look in one > location for everything. > > I think the best is to remove the kernel part from the 1.0 branch and in this way no one will be confused. Tziporet From mst at mellanox.co.il Mon Apr 10 00:42:55 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Apr 2006 10:42:55 +0300 Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support by the SW In-Reply-To: <20060410060112.GA27077@voltaire.com> References: <20060410060112.GA27077@voltaire.com> Message-ID: <20060410074255.GL13416@mellanox.co.il> Quoting r. Leonid Arsh : > Index: linux-kernel/infiniband/hw/mthca/mthca_mad.c > =================================================================== > --- linux-kernel/infiniband/hw/mthca/mthca_mad.c (revision 8509) > +++ linux-kernel/infiniband/hw/mthca/mthca_mad.c (working copy) > @@ -81,19 +81,27 @@ > > > .... > > > event.device = ibdev; > - event.event = IB_EVENT_LID_CHANGE; > event.element.port_num = port_num; > + > + if(pinfo->clientrereg_resv_subnetto & 0x80) > + event.event = IB_EVENT_CLIENT_REREGISTER; > + else > + event.event = IB_EVENT_LID_CHANGE; > + > ib_dispatch_event(&event); > } > Hmm, might this break ipoib? It currently does: void ipoib_event(struct ib_event_handler *handler, struct ib_event *record) { struct ipoib_dev_priv *priv = container_of(handler, struct ipoib_dev_priv, event_handler); if (record->event == IB_EVENT_PORT_ERR || record->event == IB_EVENT_PKEY_CHANGE || record->event == IB_EVENT_PORT_ACTIVE || record->event == IB_EVENT_LID_CHANGE || record->event == IB_EVENT_SM_CHANGE) { ipoib_dbg(priv, "Port state change event\n"); queue_work(ipoib_workqueue, &priv->flush_task); } } Don't we need to add IB_EVENT_CLIENT_REREGISTER too? -- MST From bardov at gmail.com Mon Apr 10 00:45:54 2006 From: bardov at gmail.com (Dan Bar Dov) Date: Mon, 10 Apr 2006 09:45:54 +0200 Subject: [openib-general] Location for iser-target code Message-ID: We are starting to work on open-source iser-target code. I want to create a directory for it in the SVN. There are several options I can think of: 1. https://openib.org/svn/gen2/ulps An empty directory, I don't know what its for, but could containe the iser-target 2. https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband/ulp Together with the rest of the ulps, in the trunk 3. https://openib.org/svn/gen2/branches create a new branch for it (branch of what?) 4. https://openib.org/svn/gen2/branches/openib-candidate/src/linux-kernel/infiniband I have no idea what is that for, but the path seems appropriate. I'm leaning towards option 1 myself. Dan From fujita.tomonori at lab.ntt.co.jp Mon Apr 10 00:54:58 2006 From: fujita.tomonori at lab.ntt.co.jp (FUJITA Tomonori) Date: Mon, 10 Apr 2006 16:54:58 +0900 Subject: [openib-general] Location for iser-target code In-Reply-To: References: Message-ID: <20060410165458Q.fujita.tomonori@lab.ntt.co.jp> From: "Dan Bar Dov" Subject: [openib-general] Location for iser-target code Date: Mon, 10 Apr 2006 09:45:54 +0200 > We are starting to work on open-source iser-target code. > I want to create a directory for it in the SVN. > There are several options I can think of: Any chance that you could implement it as a target driver for scsi tgt? http://stgt.berlios.de/ From leonida at voltaire.com Mon Apr 10 00:14:43 2006 From: leonida at voltaire.com (Leonid Arsh) Date: Mon, 10 Apr 2006 10:14:43 +0300 Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support by the SW In-Reply-To: <20060410074255.GL13416@mellanox.co.il> References: <20060410060112.GA27077@voltaire.com> <20060410074255.GL13416@mellanox.co.il> Message-ID: <443A05E3.40009@voltaire.com> Michael, IPoIB used to interpret the CLIENT_REREGISTER event as a LID_CHANGE event before. That's why we could see that sometimes that IPoIB handled the LID_CHANGE event additional times. With the patch, IPoIB will receive the correct CLIENT_REREGISTER event. You are right, although the CLIENT_REREGISTER event handling is optional, we should add the event handling to IPoIB too. I think, we should change a little the event handling procedure in IPoIB, in order to prevent superfluous device flushing. That's why I didn't add the CLIENT_REREGISTER event to IPoIB yet. I think, we should check more information in the flush_task (such as the port number, partition etc.) in order to flush only right network interfaces. Anyway, the patch does not harm IPoIB at all. Michael S. Tsirkin wrote: > Quoting r. Leonid Arsh : > >> Index: linux-kernel/infiniband/hw/mthca/mthca_mad.c >> =================================================================== >> --- linux-kernel/infiniband/hw/mthca/mthca_mad.c (revision 8509) >> +++ linux-kernel/infiniband/hw/mthca/mthca_mad.c (working copy) >> @@ -81,19 +81,27 @@ >> >> >> .... >> >> >> event.device = ibdev; >> - event.event = IB_EVENT_LID_CHANGE; >> event.element.port_num = port_num; >> + >> + if(pinfo->clientrereg_resv_subnetto & 0x80) >> + event.event = IB_EVENT_CLIENT_REREGISTER; >> + else >> + event.event = IB_EVENT_LID_CHANGE; >> + >> ib_dispatch_event(&event); >> } >> >> > > Hmm, might this break ipoib? It currently does: > > void ipoib_event(struct ib_event_handler *handler, > struct ib_event *record) > { > struct ipoib_dev_priv *priv = > container_of(handler, struct ipoib_dev_priv, event_handler); > > if (record->event == IB_EVENT_PORT_ERR || > record->event == IB_EVENT_PKEY_CHANGE || > record->event == IB_EVENT_PORT_ACTIVE || > record->event == IB_EVENT_LID_CHANGE || > record->event == IB_EVENT_SM_CHANGE) { > ipoib_dbg(priv, "Port state change event\n"); > queue_work(ipoib_workqueue, &priv->flush_task); > } > } > > Don't we need to add IB_EVENT_CLIENT_REREGISTER too? > > From mst at mellanox.co.il Mon Apr 10 01:22:27 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Apr 2006 11:22:27 +0300 Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support by the SW In-Reply-To: <443A05E3.40009@voltaire.com> References: <20060410060112.GA27077@voltaire.com> <20060410074255.GL13416@mellanox.co.il> <443A05E3.40009@voltaire.com> Message-ID: <20060410082227.GO13416@mellanox.co.il> Quoting r. Leonid Arsh : > Subject: Re: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support by the SW > > Michael, > IPoIB used to interpret the CLIENT_REREGISTER event as a LID_CHANGE event > before. That's why we could see that sometimes that IPoIB handled the > LID_CHANGE event additional times. With the patch, IPoIB will receive the > correct CLIENT_REREGISTER event. But it will ignore this event, won't it? So where previously IPoIB responded to CLIENT_REREGISTER in the same way as to LID_CHANGE and re-registered with SM, it now won't do it, which seems wrong. Am I missing something? -- MST From bardov at gmail.com Mon Apr 10 01:22:41 2006 From: bardov at gmail.com (Dan Bar Dov) Date: Mon, 10 Apr 2006 10:22:41 +0200 Subject: [openib-general] Location for iser-target code In-Reply-To: <20060410165458Q.fujita.tomonori@lab.ntt.co.jp> References: <20060410165458Q.fujita.tomonori@lab.ntt.co.jp> Message-ID: On 4/10/06, FUJITA Tomonori wrote: > From: "Dan Bar Dov" > Subject: [openib-general] Location for iser-target code > Date: Mon, 10 Apr 2006 09:45:54 +0200 > > > We are starting to work on open-source iser-target code. > > I want to create a directory for it in the SVN. > > There are several options I can think of: > > Any chance that you could implement it as a target driver for scsi > tgt? Our initial intention was to develop it for the IET iscsi target. I don't really understand what the scsi-tgt is. From the link I see it does not have an iscsi target yet. An iser-target is actually a transport only - it does not implement iscsi, but provides it with an API for doing the networking portion. If the idea behind scsi-tgt is to define various APIs, then I don't see any reason for not implementing the iser-target to conform, assuming actual iscsi-target e.g. IET, will use this framework. I'd be glad if you can educate me more on the scsi-tgt, the page you linked is pretty skimpy. Dan > > http://stgt.berlios.de/ > From dotanb at mellanox.co.il Mon Apr 10 01:37:17 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 10 Apr 2006 11:37:17 +0300 Subject: [openib-general] [uDAPL] who should update the file /etc/dat.conf? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301DD31C1@mtlexch01.mtl.com> Hi. In the gen2 driver, in the dapl folder the file dat.conf can be found. Here is the context of this file: # # DAT 1.2 configuration file # # Each entry should have the following fields: # # \ # # # Example for openib_cma and openib_scm # # For cma version you specify as: # network address, network hostname, or netdev name and 0 for port # # For scm version you specify as actual device name and port # # Simple (OpenIB-cma) default with netdev name provided first on list # to enable use of same dat.conf version on all nodes # OpenIB-cma u1.2 nonthreadsafe default /usr/local//lib64/libdaplcma.so mv_dapl.1.2 "ib0 0" "" OpenIB-cma-ip u1.2 nonthreadsafe default /usr/local//lib64/libdaplcma.so mv_dapl.1.2 "192.168.0.22 0" "" OpenIB-cma-name u1.2 nonthreadsafe default /usr/local//lib64/libdaplcma.so mv_dapl.1.2 "svr1-ib0 0" "" OpenIB-cma-netdev u1.2 nonthreadsafe default /usr/local//lib64/libdaplcma.so mv_dapl.1.2 "ib0 0" "" OpenIB-scm1 u1.2 nonthreadsafe default /usr/local//lib64/libdaplscm.so mv_dapl.1.2 "mthca0 1" "" OpenIB-scm2 u1.2 nonthreadsafe default /usr/local//lib64/libdaplscm.so mv_dapl.1.2 "mthca0 2" "" who is responsible to change this file with valid data? for example: local IPs local HCAs and valid port numbers what is the meaning of the first word in each line (DAPL_PROVIDER?)? thanks Dotan Barak Software Verification Engineer Mellanox Technologies Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-77-8841095 Cell: 052-4222383 -------------- next part -------------- An HTML attachment was scrubbed... URL: From fujita.tomonori at lab.ntt.co.jp Mon Apr 10 01:45:47 2006 From: fujita.tomonori at lab.ntt.co.jp (FUJITA Tomonori) Date: Mon, 10 Apr 2006 17:45:47 +0900 Subject: [openib-general] Location for iser-target code In-Reply-To: References: <20060410165458Q.fujita.tomonori@lab.ntt.co.jp> Message-ID: <20060410174547Y.fujita.tomonori@lab.ntt.co.jp> From: "Dan Bar Dov" Subject: Re: [openib-general] Location for iser-target code Date: Mon, 10 Apr 2006 10:22:41 +0200 > On 4/10/06, FUJITA Tomonori wrote: > > From: "Dan Bar Dov" > > Subject: [openib-general] Location for iser-target code > > Date: Mon, 10 Apr 2006 09:45:54 +0200 > > > > > We are starting to work on open-source iser-target code. > > > I want to create a directory for it in the SVN. > > > There are several options I can think of: > > > > Any chance that you could implement it as a target driver for scsi > > tgt? > > Our initial intention was to develop it for the IET iscsi target. > > I don't really understand what the scsi-tgt is. From the link I see > it does not have an iscsi target yet. The OLS abstract may be more informative. http://www.linuxsymposium.org/2006/view_abstract.php?content_key=19 In short, tgt is the framework for SCSI target drivers. The combination of tgt and the iSCSI target driver for NIC provides the similar features that IET does. We don't have the iSCSI target driver yet. Now we are working mainly on FCP and SRP. However, we will resume working on it. From leonida at voltaire.com Mon Apr 10 00:48:09 2006 From: leonida at voltaire.com (Leonid Arsh) Date: Mon, 10 Apr 2006 10:48:09 +0300 Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support by the SW In-Reply-To: <20060410082227.GO13416@mellanox.co.il> References: <20060410060112.GA27077@voltaire.com> <20060410074255.GL13416@mellanox.co.il> <443A05E3.40009@voltaire.com> <20060410082227.GO13416@mellanox.co.il> Message-ID: <443A0DB9.2000104@voltaire.com> You are right. Without adding the event handling to IPoIB, IPoIB will not re-register with the SM in some cases. We really should add it. I'll add the event handling to IPoIB a bit later. Anyway, I think we could apply the patch. The CLIENT_REREGISTER request was not supported by older FW, and now we behave the same way. In most cases, the event comes together with the PORT_ACTIVE event, so we don't miss much here. That's why I think that the patch doesn't harm IPoIB. Michael S. Tsirkin wrote: > Quoting r. Leonid Arsh : > >> Subject: Re: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support by the SW >> >> Michael, >> IPoIB used to interpret the CLIENT_REREGISTER event as a LID_CHANGE event >> before. That's why we could see that sometimes that IPoIB handled the >> LID_CHANGE event additional times. With the patch, IPoIB will receive the >> correct CLIENT_REREGISTER event. >> > > But it will ignore this event, won't it? > So where previously IPoIB responded to CLIENT_REREGISTER in the same way > as to LID_CHANGE and re-registered with SM, it now won't do it, which seems > wrong. > > Am I missing something? > > From mst at mellanox.co.il Mon Apr 10 02:24:05 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Apr 2006 12:24:05 +0300 Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support by the SW In-Reply-To: <443A0DB9.2000104@voltaire.com> References: <20060410060112.GA27077@voltaire.com> <20060410074255.GL13416@mellanox.co.il> <443A05E3.40009@voltaire.com> <20060410082227.GO13416@mellanox.co.il> <443A0DB9.2000104@voltaire.com> Message-ID: <20060410092405.GP13416@mellanox.co.il> Quoting r. Leonid Arsh : > Subject: Re: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support by the SW > > You are right. > Without adding the event handling to IPoIB, IPoIB will not re-register > with the SM in some cases. > We really should add it. > I'll add the event handling to IPoIB a bit later. How about an (untested) one-liner like Index: openib/drivers/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- openib/drivers/infiniband/ulp/ipoib/ipoib_verbs.c (revision 6343) +++ openib/drivers/infiniband/ulp/ipoib/ipoib_verbs.c (working copy) @@ -255,7 +255,9 @@ void ipoib_event(struct ib_event_handler record->event == IB_EVENT_PKEY_CHANGE || record->event == IB_EVENT_PORT_ACTIVE || record->event == IB_EVENT_LID_CHANGE || - record->event == IB_EVENT_SM_CHANGE) { + record->event == IB_EVENT_SM_CHANGE || + record->event == IB_EVENT_CLIENT_REREGISTER + ) { ipoib_dbg(priv, "Port state change event\n"); queue_work(ipoib_workqueue, &priv->flush_task); } > Anyway, I think we could apply the patch. > The CLIENT_REREGISTER request was not supported by older FW, and now we > behave the same way. > In most cases, the event comes together with the PORT_ACTIVE event, so > we don't miss much here. > That's why I think that the patch doesn't harm IPoIB. How about when opensm is re-started? I actually do this all the time. -- MST From bardov at gmail.com Mon Apr 10 02:37:26 2006 From: bardov at gmail.com (Dan Bar Dov) Date: Mon, 10 Apr 2006 11:37:26 +0200 Subject: [openib-general] Location for iser-target code In-Reply-To: <20060410174547Y.fujita.tomonori@lab.ntt.co.jp> References: <20060410165458Q.fujita.tomonori@lab.ntt.co.jp> <20060410174547Y.fujita.tomonori@lab.ntt.co.jp> Message-ID: On 4/10/06, FUJITA Tomonori wrote: > > The OLS abstract may be more informative. > > http://www.linuxsymposium.org/2006/view_abstract.php?content_key=19 Is the full document available as well? > > In short, tgt is the framework for SCSI target drivers. The > combination of tgt and the iSCSI target driver for NIC provides the > similar features that IET does. SCSI targets are LLDs that sit below the mid-layer. iSCSI target on the other hand is a "network" protocol driver, that sits above SCSI (sd, st, sg, or directly over mid-layer). Seems like you'd need two different frameworks, no? > > We don't have the iSCSI target driver yet. Now we are working mainly > on FCP and SRP. However, we will resume working on it. > To accomodate iSER, the iSCSI target driver itself would need to be broken to the iSCSI target engine, and the network (or data-mover aka DM) part. The network part needs an API that both the TCP DM and the ISER DM will provide. Does that make sense? Dan From mst at mellanox.co.il Mon Apr 10 02:44:57 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Apr 2006 12:44:57 +0300 Subject: [openib-general] RFC: revert module ref counting patches (was Re: [PATCH] ipoib_flush_paths) In-Reply-To: References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> Message-ID: <20060410094457.GQ13416@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] ipoib_flush_paths > > Michael> Actually, it turned out to be the simplest solution - and > Michael> quite elegant since there's no room for mistakes: if > Michael> query is going to be running this means module is still > Michael> loaded so we can take a reference to it without races. > > Yes, this is suprisingly clean. As it turns out, every problem has a simple, incorrect solution :( > Michael> As a bonus, and assertion inside __module_get increases > Michael> the chance to catch races where user forgets to cancel > Michael> the query - much nicer than crashing randomly. > > Actually I think __module_get() will do the wrong thing if called > during module unloading -- it doesn't test module_is_live(). In other > words, calling __module_get() without already holding a ref has a > race: __try_stop_module() can see the ref count as 0, then > __module_get() can increment it, and then __try_stop_module() sets the > module state to GOING and returns. > > So the right thing to do is BUG_ON(!try_module_get(owner)) Ugh. I was wrong: this approach does not work at all. Unfortunately we have: asmlinkage long sys_delete_module(const char __user *name_user, unsigned int flags) { .... /* Stop the machine so refcounts can't move and disable module. */ ret = try_stop_module(mod, flags, &forced); if (ret != 0) goto out; /* Never wait if forced. */ if (!forced && module_refcount(mod) != 0) wait_for_zero_refcount(mod); /* Final destruction now noone is using it. */ if (mod->exit != NULL) { up(&module_mutex); mod->exit(); down(&module_mutex); } free_module(mod); out: up(&module_mutex); return ret; } ---- This means that incrementing module reference count once the reference count might have once reached 0 is useless *even if the module cleanup did not finish yet*. Worse, when a module e.g. cancels SA queries as part of module unload, we get callbacks when module is not live anymore, so BUG_ON(!try_module_get(owner)) will trigger things like ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at drivers/infiniband/core/sa_query.c:867 And since that's what ipoib does, I am actually seeing these :( The patch I posted and tested does __module_get instead, so there's no BUG_ON and thus I did not notice anything untoward, but the fact remains that this play with reference counting around callback seems to get us zero safety against module unloading. I guess this was what people were saying all the time: we can't safely use try_module_get/__module_get not in module context. Sigh. At this point it seems to me that the only viable short-term approach is to revert the module reference counting patches. Opinions? It also seems that all we need is an exported API to prevent modules from unloading while we are inside the callback. Just adding 2 exported functions module_mutex_lock(); module_mutex_unlock(); and calling them around callbacks will work. Alternatively, we can go back to the original idea of adding API for flushing WQs to ib_mad, ib_sa and ib_addr modules and calling that at module cleanup. Comments on these ideas? All this might be 2.6.18 material though. -- MST From leonida at voltaire.com Mon Apr 10 01:53:09 2006 From: leonida at voltaire.com (Leonid Arsh) Date: Mon, 10 Apr 2006 11:53:09 +0300 Subject: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support by the SW In-Reply-To: <20060410092405.GP13416@mellanox.co.il> References: <20060410060112.GA27077@voltaire.com> <20060410074255.GL13416@mellanox.co.il> <443A05E3.40009@voltaire.com> <20060410082227.GO13416@mellanox.co.il> <443A0DB9.2000104@voltaire.com> <20060410092405.GP13416@mellanox.co.il> Message-ID: <443A1CF5.5030605@voltaire.com> Yes, this will work. This will also fix the SM restart problem. I'm simply not sure that I have time to test it today. If you want I'll test it later. Michael S. Tsirkin wrote: > Quoting r. Leonid Arsh : > >> Subject: Re: [openib-general][PATCH] mthca & ib_verbs.h client reregister event support by the SW >> >> You are right. >> Without adding the event handling to IPoIB, IPoIB will not re-register >> with the SM in some cases. >> We really should add it. >> I'll add the event handling to IPoIB a bit later. >> > > How about an (untested) one-liner like > > Index: openib/drivers/infiniband/ulp/ipoib/ipoib_verbs.c > =================================================================== > --- openib/drivers/infiniband/ulp/ipoib/ipoib_verbs.c (revision 6343) > +++ openib/drivers/infiniband/ulp/ipoib/ipoib_verbs.c (working copy) > @@ -255,7 +255,9 @@ void ipoib_event(struct ib_event_handler > record->event == IB_EVENT_PKEY_CHANGE || > record->event == IB_EVENT_PORT_ACTIVE || > record->event == IB_EVENT_LID_CHANGE || > - record->event == IB_EVENT_SM_CHANGE) { > + record->event == IB_EVENT_SM_CHANGE || > + record->event == IB_EVENT_CLIENT_REREGISTER > + ) { > ipoib_dbg(priv, "Port state change event\n"); > queue_work(ipoib_workqueue, &priv->flush_task); > } > > >> Anyway, I think we could apply the patch. >> The CLIENT_REREGISTER request was not supported by older FW, and now we >> behave the same way. >> In most cases, the event comes together with the PORT_ACTIVE event, so >> we don't miss much here. >> That's why I think that the patch doesn't harm IPoIB. >> > > How about when opensm is re-started? I actually do this all the time. > > From halr at voltaire.com Mon Apr 10 03:56:08 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Apr 2006 06:56:08 -0400 Subject: [Fwd: Re: [openib-general] Data Structures at UserSpace] Message-ID: <1144666530.19061.52059.camel@hal.voltaire.com> Hi again Takshak, Hadn't heard anything back on what you decided so I have the following additional information to offer: If you decide on approach 1, then there is the following which may of help/guidance: 1. There are the OpenSM headers which also provide SA client definition (include/vendor/osm_vendor_sa_api.h). You may want to extend the queries or use the user defined query capability depending on what your requirements are here. 2. There is Ira Weiny's sa_query code (in the trunk) which currently handles path records. I have not enabled this in the makefile as yet as there is one issue I need to fix. Let me know if you have additional questions. -- Hal -----Forwarded Message----- From: Hal Rosenstock To: "Chahande, Takshak" Cc: openib-general at openib.org Subject: Re: [openib-general] Data Structures at UserSpace Date: 07 Apr 2006 06:48:34 -0400 Hi Takshak, On Thu, 2006-04-06 at 19:27, Chahande, Takshak wrote: > Hi Hal and others, > > I find that, there are no standard header files exists at userspace > which can define structure for PORT_INFO, NODE_INFO and other elements > like Path Records, Service record etc. There is no user space SA client support in gen2/OpenIB currently. > So every individual has to define his own header files to define the > data structures for these elements and use in their application > program or tool. > > If it is exists then could you please point me out or if it does not > then is there any plan or shall I provide such header files to > make standard header files like we have mad.h, umad.h etc. The current plan is to expose path records and multicast support to user space. That's the next increment of support over the next couple months but it sounds like that is insufficient for your needs. If you are planning an SA diagnostics tool, there are 3 approaches in increasing order of difficulty/magnitude of work: 1. Use the OpenSM SA client API for this (osmtest and some Mellanox diagnostic tools use this currently). 2. Use the userspace/management libraries for this. This will take more work as much if not all of the SA support is not there. (These are more geared at SMPs). 3. Develop a user space SA client library for gen2 similar to the other user space libraries (CM, CMA, etc.). -- Hal > Thanks, > - Takshak > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Mon Apr 10 03:59:39 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Apr 2006 06:59:39 -0400 Subject: [openib-general] IPoIB interface for unauthorized partition In-Reply-To: <4439FC94.4090204@mellanox.co.il> References: <1144082680.4480.44271.camel@hal.voltaire.com> <4438E822.3070503@mellanox.co.il> <4439FC94.4090204@mellanox.co.il> Message-ID: <1144666779.19061.52125.camel@hal.voltaire.com> Hi Eitan, On Mon, 2006-04-10 at 02:35, Eitan Zahavi wrote: > Hi Roland, > > Roland Dreier wrote: > > Eitan> I thought the intent of the IB spec when defining P_Key > > Eitan> index usage (and not P_Key value) was that the P_Key values > > Eitan> would never need to be known above the driver level. To > > Eitan> avoid exposing the P_Key values we could use P_Key index > > Eitan> for creating the IPoIB interfaces. > > > > Eitan> Does it make sense to work on a patch that would setup > > Eitan> IPoIB interfaces by the P_Key index (and not by P_Key > > Eitan> value)? > > > > I don't see how this is feasible. The index that a particular P_Key > > lands at is completely undetermined -- if two nodes wanted to talk on > > partition 0x8001 say, how does one know which interface to use without > > knowing the index of that P_Key? > OK, I get it. Actually the way IPoIB defines the broadcast group MGID exposes P_Key anyway. > > > > > Eitan> Also I think the expected behavior for IPoIB should be that > > Eitan> IPoIB "child" interfaces should be "automatically" > > Eitan> initialized by the code that brings up the interface > > Eitan> (ifconfig scripts). All valid IPoIB partitions (valid = > > Eitan> have corresponding broadcast groups) should be > > Eitan> initialized. By doing so we provide a centralized control > > Eitan> of the partitions and their IPoIB interfaces through the > > Eitan> SM. > > > > Not sure if this is so. I may want a partition strictly for storage > > traffic something like that, so it doesn't make sense to create an > > IPoIB interface for that partition. > OpenSM provides this capability in the "partition policy": > Each partition is marked explicitly if to be used for IPoIB or not. > So through one file one could actually control the IPoIB interfaces > that will exist in the subnet. The end node does not know the SM policy for that partition though. > My intent is to write some extension to ifup for IPoIB such that all sub > interfaces will be automatically started (based on pre-availability of IPoIB > broadcast MGID). If that were to be done, it would be cleanest if the child IPoIB interface was created only if that IPoIB broadcast group for that partition exists. -- Hal > > > > - R. > > > From halr at voltaire.com Mon Apr 10 04:08:54 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Apr 2006 07:08:54 -0400 Subject: [openib-general] Re: Dual Sided RMPP Support as well as OpenSM Implications In-Reply-To: References: <1144410107.19061.824.camel@hal.voltaire.com> <1144445589.19061.7520.camel@hal.voltaire.com> Message-ID: <1144667137.19061.52221.camel@hal.voltaire.com> On Sun, 2006-04-09 at 18:08, Roland Dreier wrote: > Hal> I thought about using the high order bit of the rmpp_version > Hal> as this would never get to needing 8 bits. It seems this is > Hal> not a per registration thing though although that approach > Hal> would work. This has other minor downsides. Is this due to > Hal> adverseness to ioctls ? > > Yeah, mostly adverseness to ioctls. But it does seem to be a > per-registration thing. After all, I assume this is something that is > a property of a MAD agent -- otherwise how does the MAD core know how > to react to a two-sided RMPP when it gets one? Right now, it needs to be burnt into the MAD core as there is no other way to determine this (as it is class/method dependent). I have made a spec comment on this lack of flexibility but don't have a good solution for this at least yet... I suppose one could argue it only applies to those classes/methods (e.g. MAD agents which use those). It certainly can work that way. I think the ioctl way is a little cleaner though but it's only a slight preference. I started with the rmpp_version bit stealing approach. -- Hal > - R. From mst at mellanox.co.il Mon Apr 10 04:17:30 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Apr 2006 14:17:30 +0300 Subject: [openib-general] [PATCH 1 of 4] RFC: revert module ref counting patches In-Reply-To: <20060410094457.GQ13416@mellanox.co.il> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060410094457.GQ13416@mellanox.co.il> Message-ID: <20060410111730.GR13416@mellanox.co.il> Revert module ref counting patch for CMA. Index: linux-2.6.16/drivers/infiniband/core/cma.c =================================================================== --- linux-2.6.16/drivers/infiniband/core/cma.c (revision 6333) +++ linux-2.6.16/drivers/infiniband/core/cma.c (revision 6332) @@ -96,7 +96,6 @@ */ struct rdma_id_private { struct rdma_cm_id id; - struct module *owner; struct list_head list; struct list_head listen_list; @@ -280,9 +279,8 @@ wake_up(&id_priv->wait_remove); } -struct rdma_cm_id* __rdma_create_id(rdma_cm_event_handler event_handler, - void *context, enum rdma_port_space ps, - struct module *owner) +struct rdma_cm_id* rdma_create_id(rdma_cm_event_handler event_handler, + void *context, enum rdma_port_space ps) { struct rdma_id_private *id_priv; @@ -293,7 +291,6 @@ id_priv->state = CMA_IDLE; id_priv->id.context = context; id_priv->id.event_handler = event_handler; - id_priv->owner = owner; id_priv->id.ps = ps; spin_lock_init(&id_priv->lock); init_waitqueue_head(&id_priv->wait); @@ -305,7 +302,7 @@ return &id_priv->id; } -EXPORT_SYMBOL(__rdma_create_id); +EXPORT_SYMBOL(rdma_create_id); static int cma_init_ib_qp(struct rdma_id_private *id_priv, struct ib_qp *qp) { @@ -545,18 +542,6 @@ } } -static int invoke_event_handler(struct rdma_id_private *id_priv, - struct rdma_cm_event *event) -{ - int ret; - - BUG_ON(!try_module_get(id_priv->owner)); - ret = id_priv->id.event_handler(&id_priv->id, event); - module_put(id_priv->owner); - - return ret; -} - static int cma_notify_user(struct rdma_id_private *id_priv, enum rdma_cm_event_type type, int status, void *data, u8 data_len) @@ -568,7 +553,7 @@ event.private_data = data; event.private_data_len = data_len; - return invoke_event_handler(id_priv, &event); + return id_priv->id.event_handler(&id_priv->id, &event); } static void cma_cancel_route(struct rdma_id_private *id_priv) @@ -785,7 +770,7 @@ return ret; } -static struct rdma_id_private* cma_new_id(struct rdma_id_private *listen_id, +static struct rdma_id_private* cma_new_id(struct rdma_cm_id *listen_id, struct ib_cm_event *ib_event) { struct rdma_id_private *id_priv; @@ -795,8 +780,8 @@ __u16 port; u8 ip_ver; - id = __rdma_create_id(listen_id->id.event_handler, listen_id->id.context, - listen_id->id.ps, listen_id->owner); + id = rdma_create_id(listen_id->event_handler, listen_id->context, + listen_id->ps); if (IS_ERR(id)) return NULL; @@ -806,11 +791,11 @@ if (!rt->path_rec) goto err; - if (cma_get_net_info(ib_event->private_data, listen_id->id.ps, + if (cma_get_net_info(ib_event->private_data, listen_id->ps, &ip_ver, &port, &src, &dst)) goto err; - cma_save_net_info(&id->route.addr, &listen_id->id.route.addr, + cma_save_net_info(&id->route.addr, &listen_id->route.addr, ip_ver, port, src, dst); rt->path_rec[0] = *ib_event->param.req_rcvd.primary_path; if (rt->num_paths == 2) @@ -841,7 +826,7 @@ goto out; } - conn_id = cma_new_id(listen_id, ib_event); + conn_id = cma_new_id(&listen_id->id, ib_event); if (!conn_id) { ret = -ENOMEM; goto out; @@ -958,13 +943,11 @@ static int cma_listen_handler(struct rdma_cm_id *id, struct rdma_cm_event *event) { - struct rdma_id_private *listen_id = id->context; - struct rdma_id_private *id_priv; + struct rdma_id_private *id_priv = id->context; - id_priv = container_of(id, struct rdma_id_private, id); - id->context = listen_id->id.context; - id->event_handler = listen_id->id.event_handler; - return invoke_event_handler(id_priv, event); + id->context = id_priv->id.context; + id->event_handler = id_priv->id.event_handler; + return id_priv->id.event_handler(id, event); } static void cma_listen_on_dev(struct rdma_id_private *id_priv, @@ -974,8 +957,7 @@ struct rdma_cm_id *id; int ret; - id = __rdma_create_id(cma_listen_handler, id_priv, id_priv->id.ps, - id_priv->owner); + id = rdma_create_id(cma_listen_handler, id_priv, id_priv->id.ps); if (IS_ERR(id)) return; @@ -1098,7 +1080,7 @@ if (!cma_comp_exch(id_priv, work->old_state, work->new_state)) goto out; - if (invoke_event_handler(id_priv, &work->event)) { + if (id_priv->id.event_handler(&id_priv->id, &work->event)) { cma_exch(id_priv, CMA_DESTROYING); destroy = 1; } Index: linux-2.6.16/drivers/infiniband/include/rdma/rdma_cm.h =================================================================== --- linux-2.6.16/drivers/infiniband/include/rdma/rdma_cm.h (revision 6333) +++ linux-2.6.16/drivers/infiniband/include/rdma/rdma_cm.h (revision 6332) @@ -106,10 +106,6 @@ u8 port_num; }; -struct rdma_cm_id *__rdma_create_id(rdma_cm_event_handler event_handler, - void *context, enum rdma_port_space ps, - struct module *owner); - /** * rdma_create_id - Create an RDMA identifier. * @@ -118,12 +114,8 @@ * @context: User specified context associated with the id. * @ps: RDMA port space. */ -static inline struct rdma_cm_id * -rdma_create_id(rdma_cm_event_handler event_handler, - void *context, enum rdma_port_space ps) -{ - return __rdma_create_id(event_handler, context, ps, THIS_MODULE); -} +struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler, + void *context, enum rdma_port_space ps); void rdma_destroy_id(struct rdma_cm_id *id); -- MST From mst at mellanox.co.il Mon Apr 10 04:17:55 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Apr 2006 14:17:55 +0300 Subject: [openib-general] [PATCH 2 of 4] RFC: revert module ref counting patches In-Reply-To: <20060410094457.GQ13416@mellanox.co.il> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060410094457.GQ13416@mellanox.co.il> Message-ID: <20060410111755.GS13416@mellanox.co.il> Revert module ref counting patch for CM. I expect the same to apply to for-2.6.17. Index: linux-2.6.16/drivers/infiniband/core/cm.c =================================================================== --- linux-2.6.16/drivers/infiniband/core/cm.c (revision 6324) +++ linux-2.6.16/drivers/infiniband/core/cm.c (revision 6323) @@ -118,7 +118,6 @@ struct cm_id_private { struct ib_cm_id id; - struct module *owner; struct rb_node service_node; struct rb_node sidr_id_node; @@ -591,9 +590,9 @@ ib_send_cm_sidr_rep(&cm_id_priv->id, ¶m); } -struct ib_cm_id *__ib_create_cm_id(struct ib_device *device, +struct ib_cm_id *ib_create_cm_id(struct ib_device *device, ib_cm_handler cm_handler, - void *context, struct module *owner) + void *context) { struct cm_id_private *cm_id_priv; int ret; @@ -602,7 +601,6 @@ if (!cm_id_priv) return ERR_PTR(-ENOMEM); - cm_id_priv->owner = owner; cm_id_priv->id.state = IB_CM_IDLE; cm_id_priv->id.device = device; cm_id_priv->id.cm_handler = cm_handler; @@ -623,7 +621,7 @@ kfree(cm_id_priv); return ERR_PTR(-ENOMEM); } -EXPORT_SYMBOL(__ib_create_cm_id); +EXPORT_SYMBOL(ib_create_cm_id); static struct cm_work * cm_dequeue_work(struct cm_id_private *cm_id_priv) { @@ -1153,18 +1151,6 @@ work->cm_event.private_data = &req_msg->private_data; } -static int invoke_cm_handler(struct cm_id_private *cm_id_priv, - struct ib_cm_event *event) -{ - int ret; - - BUG_ON(!try_module_get(cm_id_priv->owner)); - ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, event); - module_put(cm_id_priv->owner); - - return ret; -} - static void cm_process_work(struct cm_id_private *cm_id_priv, struct cm_work *work) { @@ -1172,7 +1158,7 @@ int ret; /* We will typically only have the current event to report. */ - ret = invoke_cm_handler(cm_id_priv, &work->cm_event); + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->cm_event); cm_free_work(work); while (!ret && !atomic_add_negative(-1, &cm_id_priv->work_count)) { @@ -1180,7 +1166,8 @@ work = cm_dequeue_work(cm_id_priv); spin_unlock_irqrestore(&cm_id_priv->lock, flags); BUG_ON(!work); - ret = invoke_cm_handler(cm_id_priv, &work->cm_event); + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, + &work->cm_event); cm_free_work(work); } cm_deref_id(cm_id_priv); @@ -1370,7 +1357,6 @@ goto error2; } - cm_id_priv->owner = listen_cm_id_priv->owner; cm_id_priv->id.cm_handler = listen_cm_id_priv->id.cm_handler; cm_id_priv->id.context = listen_cm_id_priv->id.context; cm_id_priv->id.service_id = req_msg->service_id; @@ -2743,7 +2729,6 @@ atomic_inc(&cur_cm_id_priv->refcount); spin_unlock_irqrestore(&cm.lock, flags); - cm_id_priv->owner = cur_cm_id_priv->owner; cm_id_priv->id.cm_handler = cur_cm_id_priv->id.cm_handler; cm_id_priv->id.context = cur_cm_id_priv->id.context; cm_id_priv->id.service_id = sidr_req_msg->service_id; @@ -2912,7 +2897,7 @@ cm_event.param.send_status = wc_status; /* No other events can occur on the cm_id at this point. */ - ret = invoke_cm_handler(cm_id_priv, &cm_event); + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &cm_event); cm_free_msg(msg); if (ret) ib_destroy_cm_id(&cm_id_priv->id); Index: linux-2.6.16/drivers/infiniband/include/rdma/ib_cm.h =================================================================== --- linux-2.6.16/drivers/infiniband/include/rdma/ib_cm.h (revision 6324) +++ linux-2.6.16/drivers/infiniband/include/rdma/ib_cm.h (revision 6323) @@ -292,10 +292,6 @@ u32 remote_cm_qpn; /* 1 unless redirected */ }; -struct ib_cm_id *__ib_create_cm_id(struct ib_device *device, - ib_cm_handler cm_handler, - void *context, struct module *owner); - /** * ib_create_cm_id - Allocate a communication identifier. * @device: Device associated with the cm_id. All related communication will @@ -307,12 +303,9 @@ * Communication identifiers are used to track connection states, service * ID resolution requests, and listen requests. */ -static inline struct ib_cm_id *ib_create_cm_id(struct ib_device *device, - ib_cm_handler cm_handler, - void *context) -{ - return __ib_create_cm_id(device, cm_handler, context, THIS_MODULE); -} +struct ib_cm_id *ib_create_cm_id(struct ib_device *device, + ib_cm_handler cm_handler, + void *context); /** * ib_destroy_cm_id - Destroy a connection identifier. -- MST From mst at mellanox.co.il Mon Apr 10 04:18:02 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Apr 2006 14:18:02 +0300 Subject: [openib-general] [PATCH 3 of 4] RFC: revert module ref counting patches In-Reply-To: <20060410094457.GQ13416@mellanox.co.il> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060410094457.GQ13416@mellanox.co.il> Message-ID: <20060410111802.GT13416@mellanox.co.il> Revert module ref counting patch for addr. Index: linux-2.6.16/drivers/infiniband/core/addr.c =================================================================== --- linux-2.6.16/drivers/infiniband/core/addr.c (revision 6323) +++ linux-2.6.16/drivers/infiniband/core/addr.c (revision 6322) @@ -73,7 +73,6 @@ struct sockaddr src_addr; struct sockaddr dst_addr; struct rdma_dev_addr *addr; - struct module *owner; void *context; void (*callback)(int status, struct sockaddr *src_addr, struct rdma_dev_addr *addr, void *context); @@ -253,10 +252,8 @@ list_for_each_entry_safe(req, temp_req, &done_list, list) { list_del(&req->list); - BUG_ON(!try_module_get(req->owner)); req->callback(req->status, &req->src_addr, req->addr, req->context); - module_put(req->owner); kfree(req); } } @@ -292,12 +289,11 @@ return ret; } -int __rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr, - struct rdma_dev_addr *addr, int timeout_ms, - void (*callback)(int status, struct sockaddr *src_addr, - struct rdma_dev_addr *addr, - void *context), - void *context, struct module *owner) +int rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr, + struct rdma_dev_addr *addr, int timeout_ms, + void (*callback)(int status, struct sockaddr *src_addr, + struct rdma_dev_addr *addr, void *context), + void *context) { struct sockaddr_in *src_in, *dst_in; struct addr_req *req; @@ -314,7 +310,6 @@ req->addr = addr; req->callback = callback; req->context = context; - req->owner = owner; src_in = (struct sockaddr_in *) &req->src_addr; dst_in = (struct sockaddr_in *) &req->dst_addr; @@ -340,7 +335,7 @@ } return ret; } -EXPORT_SYMBOL(__rdma_resolve_ip); +EXPORT_SYMBOL(rdma_resolve_ip); void rdma_addr_cancel(struct rdma_dev_addr *addr) { Index: linux-2.6.16/drivers/infiniband/include/rdma/ib_addr.h =================================================================== --- linux-2.6.16/drivers/infiniband/include/rdma/ib_addr.h (revision 6323) +++ linux-2.6.16/drivers/infiniband/include/rdma/ib_addr.h (revision 6322) @@ -43,13 +43,6 @@ enum rdma_node_type dev_type; }; -int __rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr, - struct rdma_dev_addr *addr, int timeout_ms, - void (*callback)(int status, struct sockaddr *src_addr, - struct rdma_dev_addr *addr, - void *context), - void *context, struct module *owner); - /** * rdma_translate_ip - Translate a local IP address to an RDMA hardware * address. @@ -71,16 +64,11 @@ * or been canceled. A status of 0 indicates success. * @context: User-specified context associated with the call. */ -static inline int -rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr, - struct rdma_dev_addr *addr, int timeout_ms, - void (*callback)(int status, struct sockaddr *src_addr, - struct rdma_dev_addr *addr, void *context), - void *context) -{ - return __rdma_resolve_ip(src_addr, dst_addr, addr, timeout_ms, - callback, context, THIS_MODULE); -} +int rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr, + struct rdma_dev_addr *addr, int timeout_ms, + void (*callback)(int status, struct sockaddr *src_addr, + struct rdma_dev_addr *addr, void *context), + void *context); void rdma_addr_cancel(struct rdma_dev_addr *addr); -- MST From mst at mellanox.co.il Mon Apr 10 04:18:21 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Apr 2006 14:18:21 +0300 Subject: [openib-general] [PATCH 4 of 4] RFC: revert module ref counting patches In-Reply-To: <20060410094457.GQ13416@mellanox.co.il> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060410094457.GQ13416@mellanox.co.il> Message-ID: <20060410111821.GU13416@mellanox.co.il> Revert module ref counting patch for SA. I expect the same to apply to for-2.6.17. Index: linux-2.6.16/drivers/infiniband/core/sa_query.c =================================================================== --- linux-2.6.16/drivers/infiniband/core/sa_query.c (revision 6310) +++ linux-2.6.16/drivers/infiniband/core/sa_query.c (revision 6309) @@ -73,7 +73,6 @@ struct ib_sa_query { void (*callback)(struct ib_sa_query *, int, struct ib_sa_mad *); void (*release)(struct ib_sa_query *); - struct module *owner; struct ib_sa_port *port; struct ib_mad_send_buf *mad_buf; struct ib_sa_sm_ah *sm_ah; @@ -573,16 +572,15 @@ * error code. Otherwise it is a query ID that can be used to cancel * the query. */ -int __ib_sa_path_rec_get(struct ib_device *device, u8 port_num, - struct ib_sa_path_rec *rec, - ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, - void (*callback)(int status, - struct ib_sa_path_rec *resp, - void *context), - void *context, - struct module *owner, - struct ib_sa_query **sa_query) +int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_path_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) { struct ib_sa_path_query *query; struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); @@ -617,7 +615,6 @@ query->sa_query.callback = callback ? ib_sa_path_rec_callback : NULL; query->sa_query.release = ib_sa_path_rec_release; - query->sa_query.owner = owner; query->sa_query.port = port; mad->mad_hdr.method = IB_MGMT_METHOD_GET; mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); @@ -641,7 +638,7 @@ kfree(query); return ret; } -EXPORT_SYMBOL(__ib_sa_path_rec_get); +EXPORT_SYMBOL(ib_sa_path_rec_get); static void ib_sa_service_rec_callback(struct ib_sa_query *sa_query, int status, @@ -691,16 +688,15 @@ * error code. Otherwise it is a request ID that can be used to cancel * the query. */ -int __ib_sa_service_rec_query(struct ib_device *device, u8 port_num, u8 method, - struct ib_sa_service_rec *rec, - ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, - void (*callback)(int status, - struct ib_sa_service_rec *resp, - void *context), - void *context, - struct module *owner, - struct ib_sa_query **sa_query) +int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, u8 method, + struct ib_sa_service_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_service_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) { struct ib_sa_service_query *query; struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); @@ -740,7 +736,6 @@ query->sa_query.callback = callback ? ib_sa_service_rec_callback : NULL; query->sa_query.release = ib_sa_service_rec_release; - query->sa_query.owner = owner; query->sa_query.port = port; mad->mad_hdr.method = method; mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_SERVICE_REC); @@ -765,7 +760,7 @@ kfree(query); return ret; } -EXPORT_SYMBOL(__ib_sa_service_rec_query); +EXPORT_SYMBOL(ib_sa_service_rec_query); static void ib_sa_mcmember_rec_callback(struct ib_sa_query *sa_query, int status, @@ -789,17 +784,16 @@ kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query)); } -int __ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, - u8 method, - struct ib_sa_mcmember_rec *rec, - ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, - void (*callback)(int status, - struct ib_sa_mcmember_rec *resp, - void *context), - void *context, - struct module *owner, - struct ib_sa_query **sa_query) +int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, + u8 method, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) { struct ib_sa_mcmember_query *query; struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); @@ -834,7 +828,6 @@ query->sa_query.callback = callback ? ib_sa_mcmember_rec_callback : NULL; query->sa_query.release = ib_sa_mcmember_rec_release; - query->sa_query.owner = owner; query->sa_query.port = port; mad->mad_hdr.method = method; mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_MC_MEMBER_REC); @@ -859,16 +852,8 @@ kfree(query); return ret; } -EXPORT_SYMBOL(__ib_sa_mcmember_rec_query); +EXPORT_SYMBOL(ib_sa_mcmember_rec_query); -static void call_sa_callback(struct ib_sa_query *query, int status, - struct ib_sa_mad *mad) -{ - BUG_ON(!try_module_get(query->owner)); - query->callback(query, status, mad); - module_put(query->owner); -} - static void send_handler(struct ib_mad_agent *agent, struct ib_mad_send_wc *mad_send_wc) { @@ -881,13 +866,13 @@ /* No callback -- already got recv */ break; case IB_WC_RESP_TIMEOUT_ERR: - call_sa_callback(query, -ETIMEDOUT, NULL); + query->callback(query, -ETIMEDOUT, NULL); break; case IB_WC_WR_FLUSH_ERR: - call_sa_callback(query, -EINTR, NULL); + query->callback(query, -EINTR, NULL); break; default: - call_sa_callback(query, -EIO, NULL); + query->callback(query, -EIO, NULL); break; } @@ -911,12 +896,12 @@ if (query->callback) { if (mad_recv_wc->wc->status == IB_WC_SUCCESS) - call_sa_callback(query, - mad_recv_wc->recv_buf.mad->mad_hdr.status ? - -EINVAL : 0, - (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad); + query->callback(query, + mad_recv_wc->recv_buf.mad->mad_hdr.status ? + -EINVAL : 0, + (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad); else - call_sa_callback(query, -EIO, NULL); + query->callback(query, -EIO, NULL); } ib_free_recv_mad(mad_recv_wc); Index: linux-2.6.16/drivers/infiniband/include/rdma/ib_sa.h =================================================================== --- linux-2.6.16/drivers/infiniband/include/rdma/ib_sa.h (revision 6310) +++ linux-2.6.16/drivers/infiniband/include/rdma/ib_sa.h (revision 6309) @@ -254,82 +254,39 @@ void ib_sa_cancel_query(int id, struct ib_sa_query *query); -int __ib_sa_path_rec_get(struct ib_device *device, u8 port_num, - struct ib_sa_path_rec *rec, +int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_path_rec *resp, + void *context), + void *context, + struct ib_sa_query **query); + +int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, + u8 method, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct ib_sa_query **query); + +int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, + u8 method, + struct ib_sa_service_rec *rec, ib_sa_comp_mask comp_mask, int timeout_ms, gfp_t gfp_mask, void (*callback)(int status, - struct ib_sa_path_rec *resp, + struct ib_sa_service_rec *resp, void *context), void *context, - struct module *owner, - struct ib_sa_query **query); + struct ib_sa_query **sa_query); -int __ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, - u8 method, - struct ib_sa_mcmember_rec *rec, - ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, - void (*callback)(int status, - struct ib_sa_mcmember_rec *resp, - void *context), - void *context, - struct module *owner, - struct ib_sa_query **query); - -int __ib_sa_service_rec_query(struct ib_device *device, u8 port_num, - u8 method, - struct ib_sa_service_rec *rec, - ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, - void (*callback)(int status, - struct ib_sa_service_rec *resp, - void *context), - void *context, - struct module *owner, - struct ib_sa_query **sa_query); - /** - * ib_sa_path_rec_get - Start a Path get query - * @device:device to send query on - * @port_num: port number to send query on - * @rec:Path Record to send in query - * @comp_mask:component mask to send in query - * @timeout_ms:time to wait for response - * @gfp_mask:GFP mask to use for internal allocations - * @callback:function called when query completes, times out or is - * canceled - * @context:opaque user context passed to callback - * @sa_query:query context, used to cancel query - * - * Send a Path Record Get query to the SA to look up a path. The - * callback function will be called when the query completes (or - * fails); status is 0 for a successful response, -EINTR if the query - * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error - * occurred sending the query. The resp parameter of the callback is - * only valid if status is 0. - * - * If the return value of ib_sa_path_rec_get() is negative, it is an - * error code. Otherwise it is a query ID that can be used to cancel - * the query. - */ -static inline int -ib_sa_path_rec_get(struct ib_device *device, u8 port_num, - struct ib_sa_path_rec *rec, - ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, - void (*callback)(int status, - struct ib_sa_path_rec *resp, - void *context), - void *context, - struct ib_sa_query **sa_query) -{ - return __ib_sa_path_rec_get(device, port_num, rec, comp_mask, - timeout_ms, gfp_mask, callback, - context, THIS_MODULE, sa_query); -} - -/** * ib_sa_mcmember_rec_set - Start an MCMember set query * @device:device to send query on * @port_num: port number to send query on @@ -364,11 +321,11 @@ void *context, struct ib_sa_query **query) { - return __ib_sa_mcmember_rec_query(device, port_num, - IB_MGMT_METHOD_SET, - rec, comp_mask, - timeout_ms, gfp_mask, callback, - context, THIS_MODULE, query); + return ib_sa_mcmember_rec_query(device, port_num, + IB_MGMT_METHOD_SET, + rec, comp_mask, + timeout_ms, gfp_mask, callback, + context, query); } /** @@ -406,57 +363,14 @@ void *context, struct ib_sa_query **query) { - return __ib_sa_mcmember_rec_query(device, port_num, - IB_SA_METHOD_DELETE, - rec, comp_mask, - timeout_ms, gfp_mask, callback, - context, THIS_MODULE, query); + return ib_sa_mcmember_rec_query(device, port_num, + IB_SA_METHOD_DELETE, + rec, comp_mask, + timeout_ms, gfp_mask, callback, + context, query); } /** - * ib_sa_service_rec_query - Start Service Record operation - * @device:device to send request on - * @port_num: port number to send request on - * @method:SA method - should be get, set, or delete - * @rec:Service Record to send in request - * @comp_mask:component mask to send in request - * @timeout_ms:time to wait for response - * @gfp_mask:GFP mask to use for internal allocations - * @callback:function called when request completes, times out or is - * canceled - * @context:opaque user context passed to callback - * @sa_query:request context, used to cancel request - * - * Send a Service Record set/get/delete to the SA to register, - * unregister or query a service record. - * The callback function will be called when the request completes (or - * fails); status is 0 for a successful response, -EINTR if the query - * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error - * occurred sending the query. The resp parameter of the callback is - * only valid if status is 0. - * - * If the return value of ib_sa_service_rec_query() is negative, it is an - * error code. Otherwise it is a request ID that can be used to cancel - * the query. - */ -static inline int -ib_sa_service_rec_query(struct ib_device *device, u8 port_num, u8 method, - struct ib_sa_service_rec *rec, - ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, - void (*callback)(int status, - struct ib_sa_service_rec *resp, - void *context), - void *context, - struct ib_sa_query **sa_query) -{ - return __ib_sa_service_rec_query(device, port_num, method, rec, - comp_mask, timeout_ms, gfp_mask, - callback, context, THIS_MODULE, - sa_query); -} - -/** * ib_sa_pack_attr - Copy an SA attribute from a host defined structure to * a network packed structure. * dst: Destination buffer. -- MST From halr at voltaire.com Mon Apr 10 04:13:01 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Apr 2006 07:13:01 -0400 Subject: [openib-general] IPoIB interface for unauthorized partition In-Reply-To: References: <1144082680.4480.44271.camel@hal.voltaire.com> <4438E822.3070503@mellanox.co.il> Message-ID: <1144667379.19061.52265.camel@hal.voltaire.com> On Sun, 2006-04-09 at 17:34, Roland Dreier wrote: > Eitan> I thought the intent of the IB spec when defining P_Key > Eitan> index usage (and not P_Key value) was that the P_Key values > Eitan> would never need to be known above the driver level. To > Eitan> avoid exposing the P_Key values we could use P_Key index > Eitan> for creating the IPoIB interfaces. > > Eitan> Does it make sense to work on a patch that would setup > Eitan> IPoIB interfaces by the P_Key index (and not by P_Key > Eitan> value)? > > I don't see how this is feasible. The index that a particular P_Key > lands at is completely undetermined -- if two nodes wanted to talk on > partition 0x8001 say, how does one know which interface to use without > knowing the index of that P_Key? > > Eitan> Also I think the expected behavior for IPoIB should be that > Eitan> IPoIB "child" interfaces should be "automatically" > Eitan> initialized by the code that brings up the interface > Eitan> (ifconfig scripts). All valid IPoIB partitions (valid = > Eitan> have corresponding broadcast groups) should be > Eitan> initialized. By doing so we provide a centralized control > Eitan> of the partitions and their IPoIB interfaces through the > Eitan> SM. > > Not sure if this is so. I may want a partition strictly for storage > traffic something like that, so it doesn't make sense to create an > IPoIB interface for that partition. Couldn't it be done based on the existence of the appropriate IPoIB broadcast group for that partition ? -- Hal From halr at voltaire.com Mon Apr 10 04:18:35 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Apr 2006 07:18:35 -0400 Subject: [openib-general] Re: Dual Sided RMPP Support as well as OpenSM Implications In-Reply-To: <4436E507.7000309@ichips.intel.com> References: <1144410107.19061.824.camel@hal.voltaire.com> <4436E507.7000309@ichips.intel.com> Message-ID: <1144667911.19061.52371.camel@hal.voltaire.com> Hi Sean, On Fri, 2006-04-07 at 18:17, Sean Hefty wrote: > Hal Rosenstock wrote: > > It can be added but may require an API change and possibly an ABI > > change. It seems that user space code needs to both say and know whether > > dual sided RMPP is supported or not so all mixes of user space and > > kernel code could "work". > > Do we really need to support these combinations? OpenSM is supporting not just OpenIB but also gen1 still. So I think there is relevance there if that is to continue. Also, there is the possibility of running a newer OpenSM on an older kernel which does not support this (at least properly). > Does anything use the GetMulti method today? Yes but not OpenIB. Once OpenIB does support it, then the issue becomes what happens if OpenSM/OpenIB is not being used and the other SMs lag behind or decide not to support this. > Is dual-RMPP used with anything other than MultiPathRecords? No. > This is my understanding of what needs to happen to support dual-sided RMPP. > > Node A sends an RMPP message. This requires normal RMPP processing. > Node A sends an ACK of the final ACK (I'll call ACK2), giving a new window. > Node B receives ACKs. > Node B sends the response. This requires normal RMPP processing. > > From the perspective of node A, the RMPP code only needs to know to send ACK2. There's more to the state machine in turning the direction around in terms of the sender becoming the receiver. I thought that this is the "harder" direction change. > It can do this based on the method, or per transaction if directed by the client. Yes; I was thinking of class/method based approach for this. > Node B is more complex. It must now wait for ACK2, using timeout and retries of > ACK1 until ACK2 is received. And the response that will be generated by the > client must be delayed until that ACK2 is received. Yes but isn't much of this already needed for the normal termination case or is that part not implemented yet ? > For node B, it may be simpler to delay handing the request up to a client until > ACK2 is received. As you suggest, it makes sense not to hand it up in the dual sided case until the turnaround ACK is received (ACK2) because if that ACK is not received, one would want to indicate an error on the transaction (and not send the response). > The only information from ACK2 that's needed when sending the > response is NewWindowLast. A client could be expected to give this back to the > RMPP layer when sending the response. (A client that lied about NewWindowLast > should only lead to sending some packets that would be dropped, with the > transaction aborted.) Good idea. That would eliminate the need for some context transfer from the receive side to the send side in the RMPP code itself. Leaving the NWL up to the client could have the effect you mentioned but this is known to the RMPP core and hence we needn't rely on the client for this. > So, if we always make the sender of an RMPP message specify NewWindowLast, with > a default of 1 set when the MAD is allocated, then we can keep RMPP consistent. > And we'd only be left handling ACK2. This is a clever idea. I want to think about it some more. Thanks. -- Hal > - Sean From mst at mellanox.co.il Mon Apr 10 04:30:53 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Apr 2006 14:30:53 +0300 Subject: [openib-general] [PATCH] thinko fix in core/cache.c Message-ID: <20060410113053.GV13416@mellanox.co.il> core: fixed the pointer type in memory allocation Signed-off-by: Dotan Barak Signed-off-by: Michael S. Tsirkin Index: last_stable/src/linux-kernel/infiniband/core/cache.c =================================================================== --- last_stable.orig/src/linux-kernel/infiniband/core/cache.c 2006-04-10 11:12:32.000000000 +0300 +++ last_stable/src/linux-kernel/infiniband/core/cache.c 2006-04-10 14:22:47.000000000 +0300 @@ -326,7 +326,7 @@ static void ib_cache_setup_one(struct ib kmalloc(sizeof *device->cache.pkey_cache * (end_port(device) - start_port(device) + 1), GFP_KERNEL); device->cache.gid_cache = - kmalloc(sizeof *device->cache.pkey_cache * + kmalloc(sizeof *device->cache.gid_cache * (end_port(device) - start_port(device) + 1), GFP_KERNEL); device->cache.lmc_cache = kmalloc(sizeof *device->cache.lmc_cache * -- MST From dotanb at mellanox.co.il Mon Apr 10 04:42:21 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 10 Apr 2006 14:42:21 +0300 Subject: [openib-general] [PATCH] core: fixed the pointer type in memory allocation Message-ID: <200604101442.21944.dotanb@mellanox.co.il> core: fixed the pointer type in memory allocation Signed-off-by: Dotan Barak Index: last_stable/src/linux-kernel/infiniband/core/cache.c =================================================================== --- last_stable.orig/src/linux-kernel/infiniband/core/cache.c 2006-04-10 11:12:32.000000000 +0300 +++ last_stable/src/linux-kernel/infiniband/core/cache.c 2006-04-10 14:22:47.000000000 +0300 @@ -326,7 +326,7 @@ static void ib_cache_setup_one(struct ib kmalloc(sizeof *device->cache.pkey_cache * (end_port(device) - start_port(device) + 1), GFP_KERNEL); device->cache.gid_cache = - kmalloc(sizeof *device->cache.pkey_cache * + kmalloc(sizeof *device->cache.gid_cache * (end_port(device) - start_port(device) + 1), GFP_KERNEL); device->cache.lmc_cache = kmalloc(sizeof *device->cache.lmc_cache * Dotan From eitan at mellanox.co.il Mon Apr 10 05:35:47 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 10 Apr 2006 15:35:47 +0300 Subject: [openib-general] IPoIB interface for unauthorized partition Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102BA2C@mtlexch01.mtl.com> Hi Hal, > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Monday, April 10, 2006 2:00 PM > To: Eitan Zahavi > Cc: Roland Dreier; openib-general at openib.org > Subject: Re: [openib-general] IPoIB interface for unauthorized partition > > Hi Eitan, > > On Mon, 2006-04-10 at 02:35, Eitan Zahavi wrote: > > Hi Roland, > > > > Roland Dreier wrote: > > > Eitan> I thought the intent of the IB spec when defining P_Key > > > Eitan> index usage (and not P_Key value) was that the P_Key values > > > Eitan> would never need to be known above the driver level. To > > > Eitan> avoid exposing the P_Key values we could use P_Key index > > > Eitan> for creating the IPoIB interfaces. > > > > > > Eitan> Does it make sense to work on a patch that would setup > > > Eitan> IPoIB interfaces by the P_Key index (and not by P_Key > > > Eitan> value)? > > > > > > I don't see how this is feasible. The index that a particular P_Key > > > lands at is completely undetermined -- if two nodes wanted to talk on > > > partition 0x8001 say, how does one know which interface to use without > > > knowing the index of that P_Key? > > OK, I get it. Actually the way IPoIB defines the broadcast group MGID exposes > P_Key anyway. > > > > > > > > Eitan> Also I think the expected behavior for IPoIB should be that > > > Eitan> IPoIB "child" interfaces should be "automatically" > > > Eitan> initialized by the code that brings up the interface > > > Eitan> (ifconfig scripts). All valid IPoIB partitions (valid = > > > Eitan> have corresponding broadcast groups) should be > > > Eitan> initialized. By doing so we provide a centralized control > > > Eitan> of the partitions and their IPoIB interfaces through the > > > Eitan> SM. > > > > > > Not sure if this is so. I may want a partition strictly for storage > > > traffic something like that, so it doesn't make sense to create an > > > IPoIB interface for that partition. > > OpenSM provides this capability in the "partition policy": > > Each partition is marked explicitly if to be used for IPoIB or not. > > So through one file one could actually control the IPoIB interfaces > > that will exist in the subnet. > > The end node does not know the SM policy for that partition though. > > > My intent is to write some extension to ifup for IPoIB such that all sub > > interfaces will be automatically started (based on pre-availability of IPoIB > > broadcast MGID). > > If that were to be done, it would be cleanest if the child IPoIB > interface was created only if that IPoIB broadcast group for that > partition exists. [EZ] This is exactly what I had in mind. > > -- Hal > > > > > > > - R. > > > > > From halr at voltaire.com Mon Apr 10 05:34:15 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Apr 2006 08:34:15 -0400 Subject: [openib-general] Re: [PATCH] osm: add support for 1.2 errata - SA enhanced capability mask matching In-Reply-To: <86y7yngmlo.fsf@mtl066.yok.mtl.com> References: <86y7yngmlo.fsf@mtl066.yok.mtl.com> Message-ID: <1144672026.19061.53145.camel@hal.voltaire.com> Hi Eitan, On Mon, 2006-04-03 at 03:11, Eitan Zahavi wrote: > Hi Hal > > This patch adds support for the following 1.2 errata MGTWG8372. > This should be useful for scalability of: > * SRP target discovery and > * Queries for all SM ports. > > Reference ID: 4291 > > Add to table: 186 SA-Specific ClassPortInfo:CapabilityMask > Name | Bit | Description > =========================================================================================== > IsPortInfoCapMaskMatchSupported | 13 | If this value is 1, SA shall support matching the > | | PortInfo:CapabilityMask component as described in > | | . > > Reference ID: 4292 > If SA's ClassPortInfo:CapabilityMask.IsPortInfoCapMaskMatchSupported is 1, > then the AttributeModifier of the SubnAdmGet() and SubnAdmGetTable() > methods affects the matching behavior on the PortInfo:CapabilityMask > component. If the high-order bit (bit 31) of the AttributeModifier > is set to 1, matching on the CapabilityMask component will not be an > exact bitwise match as described in . Instead, > matching will only be performed on those bits which are set to 1 in > the PortInfo:CapabilityMask embedded in the query. > > In , bits in the PortInfo:CapabilityMask embedded > in the query that are set to 0 are bitwise wildcards for purposes of > matching. > > This gives a requester the ability to select desired capabilities > and query for ports which support those capabilities. > > If SA's ClassPortInfo:CapabilityMask.IsPortInfoCapMaskMatchSupported > is 0, or if bit 31 of the AttributeModifier is 0, then any matching > performed on the PortInfo:CapabilityMask component is as described > in . > > Eitan > > Signed-off-by: Eitan Zahavi ... > Index: opensm/osm_sa_class_port_info.c > =================================================================== > --- opensm/osm_sa_class_port_info.c (revision 6144) > +++ opensm/osm_sa_class_port_info.c (working copy) > @@ -212,15 +212,21 @@ __osm_cpi_rcv_respond( > MultiPathRecord, > TraceRecord > > - OSM_CAP_IS_SUBN_OPT_REINIT_SUP: > + OSM_CAP_IS_REINIT_SUP: > For reinitialization functionality. > > So not sending traps, but supporting Get(Notice) and Set(Notice): > */ > - p_resp_cpi->cap_mask = 0x2; /* Note host notation replaced later */ > + > + /* Note host notation replaced later */ > + p_resp_cpi->cap_mask = 0x2; /* Generic mask: support Get/Set attributes */ > + > if (p_rcv->p_subn->opt.no_multicast_option != TRUE) > p_resp_cpi->cap_mask |= OSM_CAP_IS_UD_MCAST_SUP; > > + p_resp_cpi->cap_mask |= OSM_CAP_IS_REINIT_SUP; OpenSM doesn't support node reinit so this bit shouldn't be on, right ? -- Hal From eitan at mellanox.co.il Mon Apr 10 06:16:03 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 10 Apr 2006 16:16:03 +0300 Subject: [openib-general] RE: [PATCH] osm: add support for 1.2 errata - SA enhancedcapability mask matching Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102BA2D@mtlexch01.mtl.com> Hi Hal, This is correct. OpenSM does not support RE_INIT. I think I meant adding ClientReRegistration capability. Good catch Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Monday, April 10, 2006 3:34 PM > To: Eitan Zahavi > Cc: OPENIB > Subject: Re: [PATCH] osm: add support for 1.2 errata - SA enhancedcapability mask > matching > > Hi Eitan, > > On Mon, 2006-04-03 at 03:11, Eitan Zahavi wrote: > > Hi Hal > > > > This patch adds support for the following 1.2 errata MGTWG8372. > > This should be useful for scalability of: > > * SRP target discovery and > > * Queries for all SM ports. > > > > Reference ID: 4291 > > > > Add to table: 186 SA-Specific ClassPortInfo:CapabilityMask > > Name | Bit | Description > > > ======================================================================== ==== > =============== > > IsPortInfoCapMaskMatchSupported | 13 | If this value is 1, SA shall support > matching the > > | | PortInfo:CapabilityMask component as described in > > | | . > > > > Reference ID: 4292 > > If SA's ClassPortInfo:CapabilityMask.IsPortInfoCapMaskMatchSupported is 1, > > then the AttributeModifier of the SubnAdmGet() and SubnAdmGetTable() > > methods affects the matching behavior on the PortInfo:CapabilityMask > > component. If the high-order bit (bit 31) of the AttributeModifier > > is set to 1, matching on the CapabilityMask component will not be an > > exact bitwise match as described in . Instead, > > matching will only be performed on those bits which are set to 1 in > > the PortInfo:CapabilityMask embedded in the query. > > > > In , bits in the PortInfo:CapabilityMask embedded > > in the query that are set to 0 are bitwise wildcards for purposes of > > matching. > > > > This gives a requester the ability to select desired capabilities > > and query for ports which support those capabilities. > > > > If SA's ClassPortInfo:CapabilityMask.IsPortInfoCapMaskMatchSupported > > is 0, or if bit 31 of the AttributeModifier is 0, then any matching > > performed on the PortInfo:CapabilityMask component is as described > > in . > > > > Eitan > > > > Signed-off-by: Eitan Zahavi > > ... > > > Index: opensm/osm_sa_class_port_info.c > > =================================================================== > > --- opensm/osm_sa_class_port_info.c (revision 6144) > > +++ opensm/osm_sa_class_port_info.c (working copy) > > @@ -212,15 +212,21 @@ __osm_cpi_rcv_respond( > > MultiPathRecord, > > TraceRecord > > > > - OSM_CAP_IS_SUBN_OPT_REINIT_SUP: > > + OSM_CAP_IS_REINIT_SUP: > > For reinitialization functionality. > > > > So not sending traps, but supporting Get(Notice) and Set(Notice): > > */ > > - p_resp_cpi->cap_mask = 0x2; /* Note host notation replaced later */ > > + > > + /* Note host notation replaced later */ > > + p_resp_cpi->cap_mask = 0x2; /* Generic mask: support Get/Set attributes */ > > + > > if (p_rcv->p_subn->opt.no_multicast_option != TRUE) > > p_resp_cpi->cap_mask |= OSM_CAP_IS_UD_MCAST_SUP; > > > > + p_resp_cpi->cap_mask |= OSM_CAP_IS_REINIT_SUP; > > OpenSM doesn't support node reinit so this bit shouldn't be on, right ? > > -- Hal From devesh28 at gmail.com Mon Apr 10 06:48:13 2006 From: devesh28 at gmail.com (Devesh Sharma) Date: Mon, 10 Apr 2006 19:18:13 +0530 Subject: [openib-general] SDP Memory management Message-ID: <309a667c0604100648p686e8d23t8d8559242063488a@mail.gmail.com> Hello list, I have some queries regarding SDP memory management. a) What is the concept of FMR? b) In absence of FMR support what is the buffer management scheme for Z-Copy? Will SDP work without FMR? c) How page locking and virtual to physical address conversion is done for Z-Copy buffers? Devesh -------------- next part -------------- An HTML attachment was scrubbed... URL: From monil at voltaire.com Mon Apr 10 07:01:54 2006 From: monil at voltaire.com (Moni Levy) Date: Mon, 10 Apr 2006 17:01:54 +0300 Subject: [openib-general] IPoIB interface for unauthorized partition In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102BA2C@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102BA2C@mtlexch01.mtl.com> Message-ID: <6a122cc00604100701n476272dfrb2f0527fb6f48a39@mail.gmail.com> On 4/10/06, Eitan Zahavi wrote: > Hi Hal, > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Monday, April 10, 2006 2:00 PM > > To: Eitan Zahavi > > Cc: Roland Dreier; openib-general at openib.org > > Subject: Re: [openib-general] IPoIB interface for unauthorized > partition > > > > Hi Eitan, > > > > On Mon, 2006-04-10 at 02:35, Eitan Zahavi wrote: > > > Hi Roland, > > > > > > Roland Dreier wrote: > > > > Eitan> I thought the intent of the IB spec when defining P_Key > > > > Eitan> index usage (and not P_Key value) was that the P_Key > values > > > > Eitan> would never need to be known above the driver level. > To > > > > Eitan> avoid exposing the P_Key values we could use P_Key > index > > > > Eitan> for creating the IPoIB interfaces. > > > > > > > > Eitan> Does it make sense to work on a patch that would setup > > > > Eitan> IPoIB interfaces by the P_Key index (and not by P_Key > > > > Eitan> value)? > > > > > > > > I don't see how this is feasible. The index that a particular > P_Key > > > > lands at is completely undetermined -- if two nodes wanted to talk > on > > > > partition 0x8001 say, how does one know which interface to use > without > > > > knowing the index of that P_Key? > > > OK, I get it. Actually the way IPoIB defines the broadcast group > MGID exposes > > P_Key anyway. > > > > > > > > > > > Eitan> Also I think the expected behavior for IPoIB should be > that > > > > Eitan> IPoIB "child" interfaces should be "automatically" > > > > Eitan> initialized by the code that brings up the interface > > > > Eitan> (ifconfig scripts). All valid IPoIB partitions (valid = > > > > Eitan> have corresponding broadcast groups) should be > > > > Eitan> initialized. By doing so we provide a centralized > control > > > > Eitan> of the partitions and their IPoIB interfaces through > the > > > > Eitan> SM. > > > > > > > > Not sure if this is so. I may want a partition strictly for > storage > > > > traffic something like that, so it doesn't make sense to create an > > > > IPoIB interface for that partition. > > > OpenSM provides this capability in the "partition policy": > > > Each partition is marked explicitly if to be used for IPoIB or not. > > > So through one file one could actually control the IPoIB interfaces > > > that will exist in the subnet. > > > > The end node does not know the SM policy for that partition though. > > > > > My intent is to write some extension to ifup for IPoIB such that all > sub > > > interfaces will be automatically started (based on pre-availability > of IPoIB > > > broadcast MGID). I'm not sure how ifup is related to that. From what I understand you'd like ipoib driver to behave as follows: 1. Get an event ( or figure it out) when a new PKEY is added to the relevant port partition table. 2. Try to join that new MC group with the MGID it created according to the PKEY and the spec. (or maybe query for the MC group existance but that's not atomic) 3. In case it fails nothing is done (no relevant MC group was pre-created in the SM). 4. In case it succeeds a new interface is created. Is that what you meant ? - Moni > > > > If that were to be done, it would be cleanest if the child IPoIB > > interface was created only if that IPoIB broadcast group for that > > partition exists. > [EZ] This is exactly what I had in mind. > > > > -- Hal > > > > > > > > > > - R. > > > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Mon Apr 10 07:25:10 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Apr 2006 10:25:10 -0400 Subject: [openib-general] Re: [PATCH] OpenSM - recognize port change not as duplicated guids In-Reply-To: <5z8xr6u9cj.fsf@mtl066.yok.mtl.com> References: <5z8xr6u9cj.fsf@mtl066.yok.mtl.com> Message-ID: <1144679090.19061.54417.camel@hal.voltaire.com> Hi Yael, On Sun, 2006-03-19 at 03:28, Yael Kalka wrote: > Hi Hal, > > This is a patch to solve the issue of port move during heavy sweep > being recognized as duplicated guids. > If the SM sees what seems to be duplicated guids, but it also received > an indication for immediatly forcing another heavy sweep (for example, > as a result of receiving trap 128) - then it shouldn't issue a > duplicated guids error (and exit), but just ignore and continue. > This means that only if the SM recognizes such a duplication in a > stable subnet it'll issue the error and exit. > > Thanks, > Yael > > Signed-off-by: Yael Kalka Thanks. Applied to both trunk and 1.0 branch. -- Hal From jackm at mellanox.co.il Mon Apr 10 08:04:34 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Mon, 10 Apr 2006 18:04:34 +0300 Subject: [openib-general] [PATCH v2] mad: use GID/LID on requester side when matching responses to requests Message-ID: <200604101804.34043.jackm@mellanox.co.il> Corrected and cleaner version. Check GID/LID for requester side when searching for request which matches received response. This, in order to guarantee uniqueness if use same TID when requesting via multiple source LIDs (when LMC is not zero). To perform check, add LMC to cache. Further, do not perform LID check for direct-routed packets, since permissive LID makes a proper check impossible. Signed-off-by: Jack Morgenstein Index: src/drivers/infiniband/include/rdma/ib_verbs.h =================================================================== --- src/drivers/infiniband/include/rdma/ib_verbs.h (revision 6066) +++ src/drivers/infiniband/include/rdma/ib_verbs.h (working copy) @@ -822,6 +822,7 @@ struct ib_cache { struct ib_event_handler event_handler; struct ib_pkey_cache **pkey_cache; struct ib_gid_cache **gid_cache; + u8 *lmc_cache; }; struct ib_device { Index: src/drivers/infiniband/include/rdma/ib_cache.h =================================================================== --- src/drivers/infiniband/include/rdma/ib_cache.h (revision 6066) +++ src/drivers/infiniband/include/rdma/ib_cache.h (working copy) @@ -102,4 +102,17 @@ int ib_find_cached_pkey(struct ib_device u16 pkey, u16 *index); +/** + * ib_get_cached_lmc - Returns a cached lmc table entry + * @device: The device to query. + * @port_num: The port number of the device to query. + * @lmc: The lmc value for the specified port for that device. + * + * ib_get_cached_lmc() fetches the specified lmc table entry stored in + * the local software cache. + */ +int ib_get_cached_lmc(struct ib_device *device, + u8 port_num, + u8 *lmc); + #endif /* _IB_CACHE_H */ Index: src/drivers/infiniband/core/cache.c =================================================================== --- src/drivers/infiniband/core/cache.c (revision 6066) +++ src/drivers/infiniband/core/cache.c (working copy) @@ -191,6 +195,24 @@ int ib_find_cached_pkey(struct ib_device } EXPORT_SYMBOL(ib_find_cached_pkey); +int ib_get_cached_lmc(struct ib_device *device, + u8 port_num, + u8 *lmc) +{ + unsigned long flags; + int ret = 0; + + if (port_num < start_port(device) || port_num > end_port(device)) + return -EINVAL; + + read_lock_irqsave(&device->cache.lock, flags); + *lmc = device->cache.lmc_cache[port_num - start_port(device)]; + read_unlock_irqrestore(&device->cache.lock, flags); + + return ret; +} +EXPORT_SYMBOL(ib_get_cached_lmc); + static void ib_cache_update(struct ib_device *device, u8 port) { @@ -251,6 +273,8 @@ static void ib_cache_update(struct ib_de device->cache.pkey_cache[port - start_port(device)] = pkey_cache; device->cache.gid_cache [port - start_port(device)] = gid_cache; + device->cache.lmc_cache[port - start_port(device)] = tprops->lmc; + write_unlock_irq(&device->cache.lock); kfree(old_pkey_cache); @@ -305,7 +329,13 @@ static void ib_cache_setup_one(struct ib kmalloc(sizeof *device->cache.pkey_cache * (end_port(device) - start_port(device) + 1), GFP_KERNEL); - if (!device->cache.pkey_cache || !device->cache.gid_cache) { + device->cache.lmc_cache = kmalloc(sizeof *device->cache.lmc_cache * + (end_port(device) - + start_port(device) + 1), + GFP_KERNEL); + + if (!device->cache.pkey_cache || !device->cache.gid_cache || + !device->cache.lmc_cache) { printk(KERN_WARNING "Couldn't allocate cache " "for %s\n", device->name); goto err; @@ -333,6 +363,7 @@ err_cache: err: kfree(device->cache.pkey_cache); kfree(device->cache.gid_cache); + kfree(device->cache.lmc_cache); } static void ib_cache_cleanup_one(struct ib_device *device) @@ -349,6 +380,7 @@ static void ib_cache_cleanup_one(struct kfree(device->cache.pkey_cache); kfree(device->cache.gid_cache); + kfree(device->cache.lmc_cache); } static struct ib_client cache_client = { Index: src/drivers/infiniband/core/mad.c =================================================================== --- src/drivers/infiniband/core/mad.c (revision 6066) +++ src/drivers/infiniband/core/mad.c (working copy) @@ -34,6 +34,7 @@ * $Id$ */ #include +#include #include "mad_priv.h" #include "mad_rmpp.h" @@ -1669,20 +1670,21 @@ static inline int rcv_has_same_class(str rwc->recv_buf.mad->mad_hdr.mgmt_class; } -static inline int rcv_has_same_gid(struct ib_mad_send_wr_private *wr, +static inline int rcv_has_same_gid(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_send_wr_private *wr, struct ib_mad_recv_wc *rwc ) { struct ib_ah_attr attr; u8 send_resp, rcv_resp; + union ib_gid sgid; + struct ib_device *device = mad_agent_priv->agent.device; + u8 port_num = mad_agent_priv->agent.port_num; + u8 lmc; send_resp = ((struct ib_mad *)(wr->send_buf.mad))-> mad_hdr.method & IB_MGMT_METHOD_RESP; rcv_resp = rwc->recv_buf.mad->mad_hdr.method & IB_MGMT_METHOD_RESP; - if (!send_resp && rcv_resp) - /* is request/response. GID/LIDs are both local (same). */ - return 1; - if (send_resp == rcv_resp) /* both requests, or both responses. GIDs different */ return 0; @@ -1691,48 +1693,78 @@ static inline int rcv_has_same_gid(struc /* Assume not equal, to avoid false positives. */ return 0; - if (!(attr.ah_flags & IB_AH_GRH) && !(rwc->wc->wc_flags & IB_WC_GRH)) - return attr.dlid == rwc->wc->slid; - else if ((attr.ah_flags & IB_AH_GRH) && - (rwc->wc->wc_flags & IB_WC_GRH)) - return memcmp(attr.grh.dgid.raw, - rwc->recv_buf.grh->sgid.raw, 16) == 0; - else + if (!!(attr.ah_flags & IB_AH_GRH) != + !!(rwc->wc->wc_flags & IB_WC_GRH)) /* one has GID, other does not. Assume different */ return 0; + + if (!send_resp && rcv_resp) { + /* is request/response. */ + if (!(attr.ah_flags & IB_AH_GRH)) { + if (ib_get_cached_lmc(device, port_num, &lmc)) + return 0; + return (!lmc || !((attr.src_path_bits ^ + rwc->wc->dlid_path_bits) & + ((1 << lmc) - 1))); + } else { + if (ib_get_cached_gid(device, port_num, + attr.grh.sgid_index, &sgid)) + return 0; + return !memcmp(sgid.raw, rwc->recv_buf.grh->dgid.raw, + 16); + } + } + + if (!(attr.ah_flags & IB_AH_GRH)) + return attr.dlid == rwc->wc->slid; + else + return !memcmp(attr.grh.dgid.raw, rwc->recv_buf.grh->sgid.raw, + 16); } + +static inline int is_direct(u8 class) +{ + return (class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE); +} + struct ib_mad_send_wr_private* ib_find_send_mad(struct ib_mad_agent_private *mad_agent_priv, - struct ib_mad_recv_wc *mad_recv_wc) + struct ib_mad_recv_wc *wc) { - struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wr_private *wr; struct ib_mad *mad; - mad = (struct ib_mad *)mad_recv_wc->recv_buf.mad; + mad = (struct ib_mad *)wc->recv_buf.mad; - list_for_each_entry(mad_send_wr, &mad_agent_priv->wait_list, - agent_list) { - if ((mad_send_wr->tid == mad->mad_hdr.tid) && - rcv_has_same_class(mad_send_wr, mad_recv_wc) && - rcv_has_same_gid(mad_send_wr, mad_recv_wc)) - return mad_send_wr; + list_for_each_entry(wr, &mad_agent_priv->wait_list, agent_list) { + if ((wr->tid == mad->mad_hdr.tid) && + rcv_has_same_class(wr, wc) && + /* + * Don't check GID for direct routed MADs. + * These might have permissive LIDs. + */ + (is_direct(wc->recv_buf.mad->mad_hdr.mgmt_class) || + rcv_has_same_gid(mad_agent_priv, wr, wc))) + return wr; } /* * It's possible to receive the response before we've * been notified that the send has completed */ - list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, - agent_list) { - if (is_data_mad(mad_agent_priv, mad_send_wr->send_buf.mad) && - mad_send_wr->tid == mad->mad_hdr.tid && - mad_send_wr->timeout && - rcv_has_same_class(mad_send_wr, mad_recv_wc) && - rcv_has_same_gid(mad_send_wr, mad_recv_wc)) { + list_for_each_entry(wr, &mad_agent_priv->send_list, agent_list) { + if (is_data_mad(mad_agent_priv, wr->send_buf.mad) && + wr->tid == mad->mad_hdr.tid && + wr->timeout && + rcv_has_same_class(wr, wc) && + /* + * Don't check GID for direct routed MADs. + * These might have permissive LIDs. + */ + (is_direct(wc->recv_buf.mad->mad_hdr.mgmt_class) || + rcv_has_same_gid(mad_agent_priv, wr, wc))) /* Verify request has not been canceled */ - return (mad_send_wr->status == IB_WC_SUCCESS) ? - mad_send_wr : NULL; - } + return (wr->status == IB_WC_SUCCESS) ? wr : NULL; } return NULL; } From jackm at mellanox.co.il Mon Apr 10 08:09:00 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Mon, 10 Apr 2006 18:09:00 +0300 Subject: [openib-general] [PATCH] mthca: fix for checked-in static rate patch Message-ID: <200604101809.00910.jackm@mellanox.co.il> rate stored for port is a multiple of 1X. Needs to be converted to the ib_rate enumeration. Signed-off-by: Jack Morgenstein Index: drivers/infiniband/hw/mthca/mthca_av.c =================================================================== --- drivers/infiniband/hw/mthca/mthca_av.c (revision 6362) +++ drivers/infiniband/hw/mthca/mthca_av.c (working copy) @@ -70,11 +70,20 @@ struct mthca_av { static enum ib_rate memfree_rate_to_ib(u8 mthca_rate, u8 port_rate) { switch (mthca_rate) { - case MTHCA_RATE_MEMFREE_EIGHTH: return port_rate / 8; - case MTHCA_RATE_MEMFREE_QUARTER: return port_rate / 4; - case MTHCA_RATE_MEMFREE_HALF: return port_rate / 2; - case MTHCA_RATE_MEMFREE_FULL: return port_rate; - default: return port_rate; + case MTHCA_RATE_MEMFREE_EIGHTH: + return mult_to_ib_rate(port_rate / 8); + + case MTHCA_RATE_MEMFREE_QUARTER: + return mult_to_ib_rate(port_rate / 4); + + case MTHCA_RATE_MEMFREE_HALF: + return mult_to_ib_rate(port_rate / 2); + + case MTHCA_RATE_MEMFREE_FULL: + return mult_to_ib_rate(port_rate); + + default: + return mult_to_ib_rate(port_rate); } } @@ -84,7 +93,7 @@ static enum ib_rate tavor_rate_to_ib(u8 case MTHCA_RATE_TAVOR_1X: return IB_RATE_2_5_GBPS; case MTHCA_RATE_TAVOR_1X_DDR: return IB_RATE_5_GBPS; case MTHCA_RATE_TAVOR_4X: return IB_RATE_10_GBPS; - default: return port_rate; + default: return mult_to_ib_rate(port_rate); } } From michaelc at cs.wisc.edu Mon Apr 10 09:32:31 2006 From: michaelc at cs.wisc.edu (Mike Christie) Date: Mon, 10 Apr 2006 11:32:31 -0500 Subject: [openib-general] Location for iser-target code In-Reply-To: References: <20060410165458Q.fujita.tomonori@lab.ntt.co.jp> <20060410174547Y.fujita.tomonori@lab.ntt.co.jp> Message-ID: <443A889F.5050408@cs.wisc.edu> Dan Bar Dov wrote: > On 4/10/06, FUJITA Tomonori wrote: >> The OLS abstract may be more informative. >> >> http://www.linuxsymposium.org/2006/view_abstract.php?content_key=19 > > Is the full document available as well? The code is shorter than the paper. > >> In short, tgt is the framework for SCSI target drivers. The >> combination of tgt and the iSCSI target driver for NIC provides the >> similar features that IET does. > > SCSI targets are LLDs that sit below the mid-layer. iSCSI target on > the other hand is a "network" protocol driver, that sits above SCSI > (sd, st, sg, or directly over mid-layer). Seems like you'd need two > different frameworks, no? > tgt/stgt is probably two frameworks from your point of view. There is a kernel part for target LLDs to hook into. The kernel part is similar to scsi-ml, actually it builds onto it and uses some of the scsi-ml functions, and provides code to share for tasks like creating scatter lists and mapping commands between the kernel and userspace. The target LLD basically handles lower level issues like DMAing the data, transport issues, etc, pretty much what a scsi-ml initiator driver does. For iscsi, the tgt lld performs similar tasks as the initiator. It parses the iscsi PDUs or puts them on the interconnect, handles session and connection manamgement (this would be done like open-iscsi though), but then then passes the scsi command to tgt's kernel code. The other part of the framework is the userspace component. The tgt kernel component basically passes scsi commands and task management functions to a userspace daemon. The daemon contains the scsi state machine and execute the IO. When it is done it informs the the kernel component which in turn maps the data into the kernel, forms scatter lists, and then passes them to the target LLD to send out. From rdreier at cisco.com Mon Apr 10 09:34:06 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 10 Apr 2006 09:34:06 -0700 Subject: [openib-general] Re: Dual Sided RMPP Support as well as OpenSM Implications In-Reply-To: <1144667137.19061.52221.camel@hal.voltaire.com> (Hal Rosenstock's message of "10 Apr 2006 07:08:54 -0400") References: <1144410107.19061.824.camel@hal.voltaire.com> <1144445589.19061.7520.camel@hal.voltaire.com> <1144667137.19061.52221.camel@hal.voltaire.com> Message-ID: Hal> Right now, it needs to be burnt into the MAD core as there is Hal> no other way to determine this (as it is class/method Hal> dependent). I have made a spec comment on this lack of Hal> flexibility but don't have a good solution for this at least Hal> yet... Now you've lost me. What if one process sets the ioctl and another process doesn't? What happens to dual-sided RMPPs then? If it's really a system-wide setting then I guess it should be a file under /sys/class/infiniband_mad. - R. From ardavis at ichips.intel.com Mon Apr 10 09:35:10 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Mon, 10 Apr 2006 09:35:10 -0700 Subject: [openib-general] Re: [PATCH] uDAPL cma; dat_ep_free can return without freeing cm_id In-Reply-To: <1144365985.10701.80.camel@stevo-desktop> References: <1144365985.10701.80.camel@stevo-desktop> Message-ID: <443A893E.4000109@ichips.intel.com> Steve Wise wrote: >>Steve, can you test this version and see if it works for your iWARP device. >> >> >> > >I think the patch is good. > >I ran dapltest/regress.sh over the chelsio iwarp device using this new >patch instead of my original patch, and things seem as stable as they >were before (i'm fighting some intermittent connection setup failures >that I think are in cxgb3 provider, not dapl). > > James, can you go ahead and commit this patch? -arlin From bardov at gmail.com Mon Apr 10 09:48:38 2006 From: bardov at gmail.com (Dan Bar Dov) Date: Mon, 10 Apr 2006 18:48:38 +0200 Subject: [openib-general] Location for iser-target code In-Reply-To: <443A889F.5050408@cs.wisc.edu> References: <20060410165458Q.fujita.tomonori@lab.ntt.co.jp> <20060410174547Y.fujita.tomonori@lab.ntt.co.jp> <443A889F.5050408@cs.wisc.edu> Message-ID: On 4/10/06, Mike Christie wrote: > Dan Bar Dov wrote: > > On 4/10/06, FUJITA Tomonori wrote: > >> The OLS abstract may be more informative. > >> > >> http://www.linuxsymposium.org/2006/view_abstract.php?content_key=19 > > > > Is the full document available as well? > > The code is shorter than the paper. > > > > >> In short, tgt is the framework for SCSI target drivers. The > >> combination of tgt and the iSCSI target driver for NIC provides the > >> similar features that IET does. > > > > SCSI targets are LLDs that sit below the mid-layer. iSCSI target on > > the other hand is a "network" protocol driver, that sits above SCSI > > (sd, st, sg, or directly over mid-layer). Seems like you'd need two > > different frameworks, no? > > > > tgt/stgt is probably two frameworks from your point of view. There is a > kernel part for target LLDs to hook into. The kernel part is similar to > scsi-ml, actually it builds onto it and uses some of the scsi-ml > functions, and provides code to share for tasks like creating scatter > lists and mapping commands between the kernel and userspace. The target > LLD basically handles lower level issues like DMAing the data, transport > issues, etc, pretty much what a scsi-ml initiator driver does. For What do you mean by scsi-ml initiator? > iscsi, the tgt lld performs similar tasks as the initiator. It parses > the iscsi PDUs or puts them on the interconnect, handles session and > connection manamgement (this would be done like open-iscsi though), but > then then passes the scsi command to tgt's kernel code. > > The other part of the framework is the userspace component. The tgt > kernel component basically passes scsi commands and task management > functions to a userspace daemon. The daemon contains the scsi state > machine and execute the IO. When it is done it informs the the kernel > component which in turn maps the data into the kernel, forms scatter > lists, and then passes them to the target LLD to send out. > I got completely confused. I understand (obviously wrongly) that to implement an iscsi target (or srp target for that matter), I would write the network facing part in kernel, that would pass the tasks and data to the user mode, the user mode will perform the tasks using scsi drivers (sd/st/sg), and once completed report back to the network facing part. I guess I'd need a diagram or two to understand :-) Dan From rdreier at cisco.com Mon Apr 10 09:49:24 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 10 Apr 2006 09:49:24 -0700 Subject: [openib-general] RFC: revert module ref counting patches In-Reply-To: <20060410094457.GQ13416@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 10 Apr 2006 12:44:57 +0300") References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060410094457.GQ13416@mellanox.co.il> Message-ID: Oh well, this module refcounting stuff is always harder than it looks. I dropped these patches from my git tree, and I guess we should revert them from svn too. - R. From rdreier at cisco.com Mon Apr 10 09:50:55 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 10 Apr 2006 09:50:55 -0700 Subject: [openib-general] Re: [PATCH] mthca: fix for checked-in static rate patch In-Reply-To: <200604101809.00910.jackm@mellanox.co.il> (Jack Morgenstein's message of "Mon, 10 Apr 2006 18:09:00 +0300") References: <200604101809.00910.jackm@mellanox.co.il> Message-ID: Thanks, applied From vlad at mellanox.co.il Mon Apr 10 09:54:33 2006 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Mon, 10 Apr 2006 19:54:33 +0300 Subject: [openib-general] IBED-1.0-rc3 is available Message-ID: <443A8DC9.9070305@mellanox.co.il> Hi All, We have prepared IBED 1.0 RC3. Release location: *https://openib.org/svn/gen2/branches/1.0/ibed/releases* File: IBED-1.0-rc3.tgz md5sum: 8e143fd4b63646ebc9f5c9f73d18394b *_BUILD_ID:_* IBED-1.0-rc3: OpenIB: openib_branch1.0-20060410-1551 (REV=6367) Userspace SVN path: https://openib.org/svn/gen2/branches/1.0/src/userspace IB Kernel modules SVN path: https://openib.org/svn/gen2/branches/1.0/ibed/tags/rc3/linux-kernel MPI: openmpi-1.0.2a12-1 mpi_osu-0.9.7-mlx2.1.0 mpitests-1.0-0 *OSes:* * RH EL4 up2: 2.6.9-22.ELsmp * RH EL4 up3: 2.6.9-34.ELsmp * Fedora C4: 2.6.11-1.1369_FC4 * SLES10 beta 7: 2.6.16-rc5-git9-2-smp * SUSE 10 Pro: 2.6.13-15-smp * kernel.org: 2.6.16 *Systems:* * x86_64 * x86 * ia64 * ppc64 *Main changes from RC2:* 1. Added support in Rh EL4 up3 2. Added Open MPI package 3. OSU MPI is now based on 0.97 release (was 0.95 in RC2) 4. Added Pathscale (ipath) driver 5. Added uDapl 6. build based on the new method: Userlevel from openib branch 1.0 and kernel from openib trunk. (will be from the git in RC4) 7. Added ibutils package 8. Bug fixes *Package limitations:* 1. iSER is working on SuSE SLES 10 Beta8 only 2. MPI OSU and Open MPI compilation fails on PPC64 3. uDAPL does not supported on RH EL4 (up2 and up3) since rdma_ucm module does not work on 2.6.9* kernels. If someone has a patch we will use it. 4. ipath driver compilation fails on RH EL4 and FedoraC4. Please send me and Vlad any issue you encounter and testing results. Thanks Tziporet & Vlad From mst at mellanox.co.il Mon Apr 10 12:44:24 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Apr 2006 21:44:24 +0200 Subject: [openib-general] Re: RFC: revert module ref counting patches In-Reply-To: References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060410094457.GQ13416@mellanox.co.il> Message-ID: <20060410194424.GA3174@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: RFC: revert module ref counting patches > > Oh well, this module refcounting stuff is always harder than it looks. BTW, how about the trick of exporting module_mutex_lock/module_mutex_unlock? Its a really simple approach - do you think this will be acceptable? > I dropped these patches from my git tree, and I guess we should revert > them from svn too. Yea, please do. -- MST From ardavis at ichips.intel.com Mon Apr 10 09:59:30 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Mon, 10 Apr 2006 09:59:30 -0700 Subject: [openib-general] [uDAPLl] question about dapl_ib_cq_resize In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301DD2D2E@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301DD2D2E@mtlexch01.mtl.com> Message-ID: <443A8EF2.6060601@ichips.intel.com> Dotan Barak wrote: > Hi. > > > I looked at the file: src/userspace/dapl/dapl/openib/dapl_ib_cq.c, > function: dapl_ib_cq_resize: > In this function, when one wants to resize a CQ, the dapl destroys the > old CQ and creates a new one instead of calling to the resize CQ verb > (which was added ~3 months ago), > > is there is a reason for this code? When this was coded the resize CQ verb was not available. I will take a look and update the code. -arlin From michaelc at cs.wisc.edu Mon Apr 10 10:03:42 2006 From: michaelc at cs.wisc.edu (Mike Christie) Date: Mon, 10 Apr 2006 12:03:42 -0500 Subject: [openib-general] Location for iser-target code In-Reply-To: References: <20060410165458Q.fujita.tomonori@lab.ntt.co.jp> <20060410174547Y.fujita.tomonori@lab.ntt.co.jp> <443A889F.5050408@cs.wisc.edu> Message-ID: <443A8FEE.2030803@cs.wisc.edu> Dan Bar Dov wrote: > On 4/10/06, Mike Christie wrote: >> Dan Bar Dov wrote: >>> On 4/10/06, FUJITA Tomonori wrote: >>>> The OLS abstract may be more informative. >>>> >>>> http://www.linuxsymposium.org/2006/view_abstract.php?content_key=19 >>> Is the full document available as well? >> The code is shorter than the paper. >> >>>> In short, tgt is the framework for SCSI target drivers. The >>>> combination of tgt and the iSCSI target driver for NIC provides the >>>> similar features that IET does. >>> SCSI targets are LLDs that sit below the mid-layer. iSCSI target on >>> the other hand is a "network" protocol driver, that sits above SCSI >>> (sd, st, sg, or directly over mid-layer). Seems like you'd need two >>> different frameworks, no? >>> >> tgt/stgt is probably two frameworks from your point of view. There is a >> kernel part for target LLDs to hook into. The kernel part is similar to >> scsi-ml, actually it builds onto it and uses some of the scsi-ml >> functions, and provides code to share for tasks like creating scatter >> lists and mapping commands between the kernel and userspace. The target >> LLD basically handles lower level issues like DMAing the data, transport >> issues, etc, pretty much what a scsi-ml initiator driver does. For > > What do you mean by scsi-ml initiator? SCSI Mid Layer - (this is what the SCSI Layer in linux is referred to sometimes) initiator - I guess in this context it would be the linux scsi host driver? So I was referring to drivers like iscsi_tcp, qla2xxx, aic79xx etc. > >> iscsi, the tgt lld performs similar tasks as the initiator. It parses >> the iscsi PDUs or puts them on the interconnect, handles session and >> connection manamgement (this would be done like open-iscsi though), but >> then then passes the scsi command to tgt's kernel code. >> >> The other part of the framework is the userspace component. The tgt >> kernel component basically passes scsi commands and task management >> functions to a userspace daemon. The daemon contains the scsi state >> machine and execute the IO. When it is done it informs the the kernel >> component which in turn maps the data into the kernel, forms scatter >> lists, and then passes them to the target LLD to send out. >> > I got completely confused. I understand (obviously wrongly) that to implement an > iscsi target (or srp target for that matter), I would write the > network facing part in kernel, > that would pass the tasks and data to the user mode, the user mode will perform > the tasks using scsi drivers (sd/st/sg), and once completed report > back to the network facing part. For tasks did you mean iscsi or scsi type of tasks or is that a rdma or infinniband term too? For software iscsi over tcp we could just do it all in usersapce. We are not 100% if a software iscsi target, even it is slimmed down and hooks into iscsi_tcp/libiscsi/scsi_transport_iscsi would be mergable, because for for software iscsi we could just open a socket in userspace to the target and never have to implement new kernel code for iscsi or scsi processing. The completely userspace iscsi target would listen on the socket for iscsi PDUs, then write/read to the device using SG_IO or to some virtualized block device. Arjan gave some good review comments about how to do a software iscsi target in userspace when IET was posted to linux-scsi for review. He did not hammer out all the details though. Could you probably do the same for iser and srp? Maybe we just want a userspace framework for these types of software drivers becuase at least the read/write to the device part could be shared. > > I guess I'd need a diagram or two to understand :-) > ok, i am not sure if we can post the paper. we will read the OLS rules and get back to you. From leonida at voltaire.com Mon Apr 10 09:16:44 2006 From: leonida at voltaire.com (Leonid Arsh) Date: Mon, 10 Apr 2006 19:16:44 +0300 Subject: [openib-general][PATCH] IPoIB client reregister event support Message-ID: <20060410161644.GA30726@voltaire.com> Hello, this is Michael's patch, that he proposed in addition to my linux kernel mthca & ib_verbs.h and user space verbs.h patches (see [openib-general][PATCH] mthca & ib_verbs.h client reregister event support by the SW) I checked it and it works fine. Signed-off-by: Leonid Arsh Index: linux-kernel/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- linux-kernel/infiniband/ulp/ipoib/ipoib_verbs.c (revision 6343) +++ linux-kernel/infiniband/ulp/ipoib/ipoib_verbs.c (working copy) @@ -255,7 +255,9 @@ void ipoib_event(struct ib_event_handler record->event == IB_EVENT_PKEY_CHANGE || record->event == IB_EVENT_PORT_ACTIVE || record->event == IB_EVENT_LID_CHANGE || - record->event == IB_EVENT_SM_CHANGE) { + record->event == IB_EVENT_SM_CHANGE || + record->event == IB_EVENT_CLIENT_REREGISTER + ) { ipoib_dbg(priv, "Port state change event\n"); queue_work(ipoib_workqueue, &priv->flush_task); } From tchahande at silverstorm.com Mon Apr 10 10:18:55 2006 From: tchahande at silverstorm.com (Chahande, Takshak) Date: Mon, 10 Apr 2006 13:18:55 -0400 Subject: [Fwd: Re: [openib-general] Data Structures at UserSpace] Message-ID: Hi Hal, Thanks for providing more inputs. Using 1st approach you have suggested, I can take needed things from OpenSM. As I was looking for data structures defined in the header file, I got the basic node_info, port_info structures in osm/include/iba/ib_types.h. I will be using this osm header file for my current requirement. Thanks & Regards, - Takshak -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Monday, April 10, 2006 6:56 AM To: Chahande, Takshak Cc: openib-general at openib.org Subject: [Fwd: Re: [openib-general] Data Structures at UserSpace] Hi again Takshak, Hadn't heard anything back on what you decided so I have the following additional information to offer: If you decide on approach 1, then there is the following which may of help/guidance: 1. There are the OpenSM headers which also provide SA client definition (include/vendor/osm_vendor_sa_api.h). You may want to extend the queries or use the user defined query capability depending on what your requirements are here. 2. There is Ira Weiny's sa_query code (in the trunk) which currently handles path records. I have not enabled this in the makefile as yet as there is one issue I need to fix. Let me know if you have additional questions. -- Hal -----Forwarded Message----- From: Hal Rosenstock To: "Chahande, Takshak" Cc: openib-general at openib.org Subject: Re: [openib-general] Data Structures at UserSpace Date: 07 Apr 2006 06:48:34 -0400 Hi Takshak, On Thu, 2006-04-06 at 19:27, Chahande, Takshak wrote: > Hi Hal and others, > > I find that, there are no standard header files exists at userspace > which can define structure for PORT_INFO, NODE_INFO and other elements > like Path Records, Service record etc. There is no user space SA client support in gen2/OpenIB currently. > So every individual has to define his own header files to define the > data structures for these elements and use in their application > program or tool. > > If it is exists then could you please point me out or if it does not > then is there any plan or shall I provide such header files to > make standard header files like we have mad.h, umad.h etc. The current plan is to expose path records and multicast support to user space. That's the next increment of support over the next couple months but it sounds like that is insufficient for your needs. If you are planning an SA diagnostics tool, there are 3 approaches in increasing order of difficulty/magnitude of work: 1. Use the OpenSM SA client API for this (osmtest and some Mellanox diagnostic tools use this currently). 2. Use the userspace/management libraries for this. This will take more work as much if not all of the SA support is not there. (These are more geared at SMPs). 3. Develop a user space SA client library for gen2 similar to the other user space libraries (CM, CMA, etc.). -- Hal > Thanks, > - Takshak > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From xma at us.ibm.com Mon Apr 10 10:23:00 2006 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 10 Apr 2006 10:23:00 -0700 Subject: [openib-general] Please don't commit without maintainer approval In-Reply-To: Message-ID: Hello Roland, > And finally, this development process is completely backwards. > Changes shouldn't be made first on the release branch and then merged > to the trunk. All fixes should go into the trunk first with > maintainer approval, and then a selected subset should be merged onto > the release branch. > > - R. > Where we can see the selected subset to be merged onto the release branch? The IPoIB neighbour destructor patch should be one of the subset to be merged onto the release brach. It addressed a kernel panic problem. Has this patch been selected to RC2 release? I couldn't find this patch in RC2. This panic is consistently hit on ehca driver. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Mon Apr 10 10:26:28 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 10 Apr 2006 10:26:28 -0700 Subject: [openib-general] Please don't commit without maintainer approval In-Reply-To: (Shirley Ma's message of "Mon, 10 Apr 2006 10:23:00 -0700") References: Message-ID: Shirley> The IPoIB neighbour destructor patch should be one of the Shirley> subset to be merged onto the release brach. It addressed Shirley> a kernel panic problem. Has this patch been selected to Shirley> RC2 release? I couldn't find this patch in RC2. This Shirley> panic is consistently hit on ehca driver. As we've said many times, the kernel is not released by openib. It is released by Linus. The neighbour destructor patch was in kernel 2.6.17-rc1. - R. From bardov at gmail.com Mon Apr 10 10:33:04 2006 From: bardov at gmail.com (Dan Bar Dov) Date: Mon, 10 Apr 2006 19:33:04 +0200 Subject: [openib-general] Location for iser-target code In-Reply-To: <443A8FEE.2030803@cs.wisc.edu> References: <20060410165458Q.fujita.tomonori@lab.ntt.co.jp> <20060410174547Y.fujita.tomonori@lab.ntt.co.jp> <443A889F.5050408@cs.wisc.edu> <443A8FEE.2030803@cs.wisc.edu> Message-ID: On 4/10/06, Mike Christie wrote: > Dan Bar Dov wrote: > > On 4/10/06, Mike Christie wrote: > >> Dan Bar Dov wrote: > >>> On 4/10/06, FUJITA Tomonori wrote: > >>>> The OLS abstract may be more informative. > >>>> > >>>> http://www.linuxsymposium.org/2006/view_abstract.php?content_key=19 > >>> Is the full document available as well? > >> The code is shorter than the paper. > >> > >>>> In short, tgt is the framework for SCSI target drivers. The > >>>> combination of tgt and the iSCSI target driver for NIC provides the > >>>> similar features that IET does. > >>> SCSI targets are LLDs that sit below the mid-layer. iSCSI target on > >>> the other hand is a "network" protocol driver, that sits above SCSI > >>> (sd, st, sg, or directly over mid-layer). Seems like you'd need two > >>> different frameworks, no? > >>> > >> tgt/stgt is probably two frameworks from your point of view. There is a > >> kernel part for target LLDs to hook into. The kernel part is similar to > >> scsi-ml, actually it builds onto it and uses some of the scsi-ml > >> functions, and provides code to share for tasks like creating scatter > >> lists and mapping commands between the kernel and userspace. The target > >> LLD basically handles lower level issues like DMAing the data, transport > >> issues, etc, pretty much what a scsi-ml initiator driver does. For > > > > What do you mean by scsi-ml initiator? Thank you ;-) > > SCSI Mid Layer - (this is what the SCSI Layer in linux is referred to > sometimes) > initiator - I guess in this context it would be the linux scsi host driver? > > So I was referring to drivers like iscsi_tcp, qla2xxx, aic79xx etc. > So scsi-ml initiator driver is a scsi LLD. Got it. > > > >> iscsi, the tgt lld performs similar tasks as the initiator. It parses > >> the iscsi PDUs or puts them on the interconnect, handles session and > >> connection manamgement (this would be done like open-iscsi though), but > >> then then passes the scsi command to tgt's kernel code. > >> > >> The other part of the framework is the userspace component. The tgt > >> kernel component basically passes scsi commands and task management > >> functions to a userspace daemon. The daemon contains the scsi state > >> machine and execute the IO. When it is done it informs the the kernel > >> component which in turn maps the data into the kernel, forms scatter > >> lists, and then passes them to the target LLD to send out. > >> > > I got completely confused. I understand (obviously wrongly) that to implement an > > iscsi target (or srp target for that matter), I would write the > > network facing part in kernel, > > that would pass the tasks and data to the user mode, the user mode will perform > > the tasks using scsi drivers (sd/st/sg), and once completed report > > back to the network facing part. > > For tasks did you mean iscsi or scsi type of tasks or is that a rdma or I meant iscsi/scsi tasks. > infinniband term too? For software iscsi over tcp we could just do it > all in usersapce. We are not 100% if a software iscsi target, even it is > slimmed down and hooks into iscsi_tcp/libiscsi/scsi_transport_iscsi > would be mergable, because for for software iscsi we could just open a > socket in userspace to the target and never have to implement new kernel > code for iscsi or scsi processing. The completely userspace iscsi target > would listen on the socket for iscsi PDUs, then write/read to the device > using SG_IO or to some virtualized block device. Arjan gave some good > review comments about how to do a software iscsi target in userspace > when IET was posted to linux-scsi for review. He did not hammer out all > the details though. I think I start to understand. The tgt framework uses a model similar to linux's scsi 3 layers. The lld here would be iser, a tcp "data mover", srp etc. and they will convert from a wire protocol (iscsi/srp etc.) to scsi, and then pass the scsi commands to tgt. The mid-layer equivalent is the tgt, and the upper layer is the command handler and that is to live in userland, so it can use any block devices including md, dm etc. > > Could you probably do the same for iser and srp? Maybe we just want a > userspace framework for these types of software drivers becuase at least > the read/write to the device part could be shared. > I think if I got the model right, it is possible to fit ISER to it. Again, a lot of the iser and tcp pdu processing could be shared in a library like you did for the initiator. > > > > > I guess I'd need a diagram or two to understand :-) > > > > ok, i am not sure if we can post the paper. we will read the OLS rules > and get back to you. > Thanks. Dan From mshefty at ichips.intel.com Mon Apr 10 10:33:39 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Apr 2006 10:33:39 -0700 Subject: [openib-general] Re: Dual Sided RMPP Support as well as OpenSM Implications In-Reply-To: <1144667911.19061.52371.camel@hal.voltaire.com> References: <1144410107.19061.824.camel@hal.voltaire.com> <4436E507.7000309@ichips.intel.com> <1144667911.19061.52371.camel@hal.voltaire.com> Message-ID: <443A96F3.9000706@ichips.intel.com> Hal Rosenstock wrote: >>This is my understanding of what needs to happen to support dual-sided RMPP. >> >>Node A sends an RMPP message. This requires normal RMPP processing. >>Node A sends an ACK of the final ACK (I'll call ACK2), giving a new window. >>Node B receives ACKs. >>Node B sends the response. This requires normal RMPP processing. >> >> From the perspective of node A, the RMPP code only needs to know to send ACK2. > > There's more to the state machine in turning the direction around in > terms of the sender becoming the receiver. I thought that this is the > "harder" direction change. Can you describe what more is needed that what's listed above? In terms of compliance, if node A is not, but node B is; Node B cannot send back the response until ACK2 is received. Since node A does not understand dual-sided RMPP, it will not send ACK2. Node B will never send the response. If node A is, but node B is not: Node A will send ACK2, which node B should drop. Node B will send an RMPP message assuming an initial window size of 1. If node A had set the window larger, it may delay the ACK of segment 1. Node B will eventually timeout and resend segment 1. Most likely, this will cause node A to ACK segment 1, which will update the window size at node B. >> It can do this based on the method, or per transaction if directed by the client. > > Yes; I was thinking of class/method based approach for this. Currently, only a MultiPathRecord query requires this. Why not limit dual-sided RMPP to _only_ this request? All other queries can just use an RMPP message one direction, followed by an RMPP message in the other direction. Beyond MultiPathRecord queries, aren't we talking about vendor specific queries anyway? Personally, I'd vote for removing dual-sided RMPP completely from the spec. It's of questionable benefit and complicates the implementation, but it's probably a little too late for that. Couldn't we just keep from defining anything else as "dual-sided"? >>Node B is more complex. It must now wait for ACK2, using timeout and retries of >>ACK1 until ACK2 is received. And the response that will be generated by the >>client must be delayed until that ACK2 is received. > > > Yes but isn't much of this already needed for the normal termination > case or is that part not implemented yet ? No - ACK2 is a new message unique to dual-sided RMPP transfers (an ACK of an ACK). >> The only information from ACK2 that's needed when sending the >>response is NewWindowLast. A client could be expected to give this back to the >>RMPP layer when sending the response. (A client that lied about NewWindowLast >>should only lead to sending some packets that would be dropped, with the >>transaction aborted.) > > > Good idea. That would eliminate the need for some context transfer from > the receive side to the send side in the RMPP code itself. > > Leaving the NWL up to the client could have the effect you mentioned but > this is known to the RMPP core and hence we needn't rely on the client > for this. For the RMPP core to know what NWL is, it would need to track that information between receiving a request and the generation of the reply. Since the RMPP code can't trust the client to generate a reply, it would also need some sort of timeout for how long the NWL information is valid. Tracking NWL by the client seems trivial, whereas, it's substantially more complex for the RMPP core. - Sean From rdreier at cisco.com Mon Apr 10 10:34:28 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 10 Apr 2006 10:34:28 -0700 Subject: [openib-general] Re: problems cloning infiniband.git In-Reply-To: (Or Gerlitz's message of "Mon, 10 Apr 2006 08:27:38 +0300 (IDT)") References: Message-ID: Or> Hi Roland, I have problems cloning your git tree, is it an Or> issue on my side? I was able to reproduce it but I can't explain it. I can't find any trace of the "static_rate" branch in my tree on kernel.org. Maybe mirrors haven't been updated completely? - R. From mshefty at ichips.intel.com Mon Apr 10 10:38:28 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Apr 2006 10:38:28 -0700 Subject: [openib-general] [PATCH 3 of 4] RFC: revert module ref counting patches In-Reply-To: <20060410111802.GT13416@mellanox.co.il> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060410094457.GQ13416@mellanox.co.il> <20060410111802.GT13416@mellanox.co.il> Message-ID: <443A9814.60306@ichips.intel.com> Michael S. Tsirkin wrote: > Revert module ref counting patch for addr. thanks - committed From rdreier at cisco.com Mon Apr 10 10:39:07 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 10 Apr 2006 10:39:07 -0700 Subject: [openib-general] Re: problems cloning infiniband.git In-Reply-To: (Roland Dreier's message of "Mon, 10 Apr 2006 10:34:28 -0700") References: Message-ID: BTW cloning via rsync:// and git:// works fine. From mshefty at ichips.intel.com Mon Apr 10 10:40:29 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Apr 2006 10:40:29 -0700 Subject: [openib-general] [PATCH 2 of 4] RFC: revert module ref counting patches In-Reply-To: <20060410111755.GS13416@mellanox.co.il> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060410094457.GQ13416@mellanox.co.il> <20060410111755.GS13416@mellanox.co.il> Message-ID: <443A988D.5000802@ichips.intel.com> Michael S. Tsirkin wrote: > Revert module ref counting patch for CM. I expect the same to > apply to for-2.6.17. committed From mshefty at ichips.intel.com Mon Apr 10 10:42:22 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Apr 2006 10:42:22 -0700 Subject: [openib-general] Re: problems cloning infiniband.git In-Reply-To: References: Message-ID: <443A98FE.5040708@ichips.intel.com> Roland Dreier wrote: > BTW cloning via rsync:// and git:// works fine. I've never been able to successfully clone using git:// myself. I've always used http://. - Sean From rdreier at cisco.com Mon Apr 10 10:42:40 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 10 Apr 2006 10:42:40 -0700 Subject: [openib-general] Re: [PATCH 4 of 4] RFC: revert module ref counting patches In-Reply-To: <20060410111821.GU13416@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 10 Apr 2006 14:18:21 +0300") References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060410094457.GQ13416@mellanox.co.il> <20060410111821.GU13416@mellanox.co.il> Message-ID: OK, I applied this to svn too From xma at us.ibm.com Mon Apr 10 10:48:05 2006 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 10 Apr 2006 10:48:05 -0700 Subject: [openib-general] Please don't commit without maintainer approval In-Reply-To: Message-ID: Roland, >As we've said many times, the kernel is not released by openib. It is >released by Linus. The neighbour destructor patch was in kernel 2.6.17-rc1. I meant the IPoIB destructor work around patch for kernel less than 2.6.17. It is in 2.6.17-rc1 already? It's pretty confusing. When I looked at openib-1.0-rc2 release tree, there is an ipoib directory, which doesn't include IPoIB destructor patch. /gen2/tags/openib-1.0-rc2/src/linux-kernel/infiniband/ulp/ipoib Sorry for bother you same question many times. Why ipoib is a part of openib-1.0-rc2 when it's already a part of kernel release, and they are not the same? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Mon Apr 10 10:46:28 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Apr 2006 10:46:28 -0700 Subject: [openib-general] Re: RFC: revert module ref counting patches In-Reply-To: <20060410194424.GA3174@mellanox.co.il> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060410094457.GQ13416@mellanox.co.il> <20060410194424.GA3174@mellanox.co.il> Message-ID: <443A99F4.3070700@ichips.intel.com> Michael S. Tsirkin wrote: > BTW, how about the trick of exporting module_mutex_lock/module_mutex_unlock? > Its a really simple approach - do you think this will be acceptable? Who exports these calls, and who calls them? - Sean From mshefty at ichips.intel.com Mon Apr 10 10:48:19 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Apr 2006 10:48:19 -0700 Subject: [openib-general] [PATCH 1 of 4] RFC: revert module ref counting patches In-Reply-To: <20060410111730.GR13416@mellanox.co.il> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060410094457.GQ13416@mellanox.co.il> <20060410111730.GR13416@mellanox.co.il> Message-ID: <443A9A63.6070708@ichips.intel.com> Michael S. Tsirkin wrote: > Revert module ref counting patch for CMA. committed From rdreier at cisco.com Mon Apr 10 10:49:46 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 10 Apr 2006 10:49:46 -0700 Subject: [openib-general] Please don't commit without maintainer approval In-Reply-To: (Shirley Ma's message of "Mon, 10 Apr 2006 10:48:05 -0700") References: Message-ID: Shirley> Sorry for bother you same question many times. Why ipoib Shirley> is a part of openib-1.0-rc2 when it's already a part of Shirley> kernel release, and they are not the same? Yes, we've been discussing how confusing it is to have kernel components in the release branch. The whole kernel directory should be deleted, since it's just an artifact of the subversion repository structure. For now just pretend it isn't there. - R. From mst at mellanox.co.il Mon Apr 10 10:51:36 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Apr 2006 20:51:36 +0300 Subject: [openib-general] Re: Please don't commit without maintainer approval In-Reply-To: References: Message-ID: <20060410175136.GA4746@mellanox.co.il> Quoting r. Shirley Ma : > Where we can see the selected subset to be merged onto the release branch? > > The IPoIB neighbour destructor patch should be one of the subset to be > merged onto the release brach. It addressed a kernel panic problem. Has > this patch been selected to RC2 release? I couldn't find this patch in RC2. > This panic is consistently hit on ehca driver. Shirley, its a mistake to use branches/1.0/src/linux-kernel/ in the first place. It's unmaintained and seems to be there for historical reasons. Only branches/1.0/src/userspace/ is maintained. This was discussed e.g. here http://openib.org/pipermail/openib-general/2006-April/019855.html If you feel a need for a stable tag for internal testing/development, I suggest to go ahead and create one under contrib/ibm. This is what Mellanox does - updating this from time to time is very low maintainance, seems to work well for us. -- MST From rdreier at cisco.com Mon Apr 10 10:50:20 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 10 Apr 2006 10:50:20 -0700 Subject: [openib-general] Re: RFC: revert module ref counting patches In-Reply-To: <20060410194424.GA3174@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 10 Apr 2006 21:44:24 +0200") References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060410094457.GQ13416@mellanox.co.il> <20060410194424.GA3174@mellanox.co.il> Message-ID: Michael> BTW, how about the trick of exporting Michael> module_mutex_lock/module_mutex_unlock? Its a really Michael> simple approach - do you think this will be acceptable? It doesn't seem that appealing to me. - R. From delle at pathscale.com Mon Apr 10 10:54:25 2006 From: delle at pathscale.com (Delle Maxwell) Date: Mon, 10 Apr 2006 10:54:25 -0700 Subject: [openib-general] Re: OpenIB 1.0-rc2 available In-Reply-To: <1144473292.7801.2.camel@chalcedony.pathscale.com> References: <1144473292.7801.2.camel@chalcedony.pathscale.com> Message-ID: <1144691665.11884.12.camel@concrete.internal.keyresearch.com> Are they keeping the URLs as openib.org rather than openfabrics.org for this stuff? (I need to make sure we document this correctly...) On Fri, 2006-04-07 at 22:14 -0700, Bryan O'Sullivan wrote: > The svn tag is here: > > https://openib.org/svn/gen2/tags/openib-1.0-rc2/ > > Tarballs: > > http://openib.red-bean.com/rc2/SOURCES/ > > RPM packages (fc4 and suse10 only so far, will add more next week): > > fc4 - http://openib.red-bean.com/rc2/fc4/ > suse10 - http://openib.red-bean.com/rc2/suse10.0/ > -- Delle Maxwell delle at pathscale.com QLogic Corporation dmaxwell at qlogic.com System Interconnect Group 650-934-8076 2071 Stierlin Court, Suite 200 http://www.PathScale.com Mountain View, CA http://www.qlogic.com From mst at mellanox.co.il Mon Apr 10 10:56:32 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Apr 2006 20:56:32 +0300 Subject: [openib-general] Re: Re: RFC: revert module ref counting patches In-Reply-To: <443A99F4.3070700@ichips.intel.com> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060410094457.GQ13416@mellanox.co.il> <20060410194424.GA3174@mellanox.co.il> <443A99F4.3070700@ichips.intel.com> Message-ID: <20060410175632.GB4746@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: Re: RFC: revert module ref counting patches > > Michael S. Tsirkin wrote: > >BTW, how about the trick of exporting > >module_mutex_lock/module_mutex_unlock? > >Its a really simple approach - do you think this will be acceptable? > > Who exports these calls, and who calls them? The idea is to add this to kernel/module.c: void module_mutex_lock(void) { down(&module_mutex); } void module_mutex_unlock(void) { up(&module_mutex); } EXPORT_SYMBOL_GPL(module_mutex_lock); EXPORT_SYMBOL_GPL(module_mutex_unlock); ---------------------- And now ib_sa can just module_mutex_lock() query->callback(..) module_mutex_unlock() to prevent modules from unloading while callback is in progress. Same for other modules that use callbacks without registration, like ib_addr. -- MST From halr at voltaire.com Mon Apr 10 10:50:32 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Apr 2006 13:50:32 -0400 Subject: [openib-general] Re: Dual Sided RMPP Support as well as OpenSM Implications In-Reply-To: References: <1144410107.19061.824.camel@hal.voltaire.com> <1144445589.19061.7520.camel@hal.voltaire.com> <1144667137.19061.52221.camel@hal.voltaire.com> Message-ID: <1144688215.19061.56014.camel@hal.voltaire.com> On Mon, 2006-04-10 at 12:34, Roland Dreier wrote: > Hal> Right now, it needs to be burnt into the MAD core as there is > Hal> no other way to determine this (as it is class/method > Hal> dependent). I have made a spec comment on this lack of > Hal> flexibility but don't have a good solution for this at least > Hal> yet... > > Now you've lost me. What if one process sets the ioctl and another > process doesn't? What happens to dual-sided RMPPs then? It's a readable thing rather than a settable thing. > If it's really a system-wide setting It is. > then I guess it should be a file under /sys/class/infiniband_mad. Ah, that's better than an ioctl and no file would certainly be the backward compatibility mode (no dual sided RMPP support). -- Hal From mshefty at ichips.intel.com Mon Apr 10 10:57:58 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Apr 2006 10:57:58 -0700 Subject: [openib-general] Re: RFC: revert module ref counting patches (was Re: [PATCH] ipoib_flush_paths) In-Reply-To: <20060410094457.GQ13416@mellanox.co.il> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060410094457.GQ13416@mellanox.co.il> Message-ID: <443A9CA6.2000306@ichips.intel.com> Michael S. Tsirkin wrote: > Alternatively, we can go back to the original idea of adding API for flushing > WQs to ib_mad, ib_sa and ib_addr modules and calling that at module cleanup. I've thought about this more, and it leads to very subtle dependencies between modules. For example, suppose modules A and B both call into module C, with module C performing callbacks into A and B. For module A to unload, it is now dependent on what module B does in its callback. Interactions between A and C should be limited to A and C to avoid potential deadlock conditions that could occur between unrelated modules. - Sean From rdreier at cisco.com Mon Apr 10 11:00:06 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 10 Apr 2006 11:00:06 -0700 Subject: [openib-general] Re: Dual Sided RMPP Support as well as OpenSM Implications In-Reply-To: <1144688215.19061.56014.camel@hal.voltaire.com> (Hal Rosenstock's message of "10 Apr 2006 13:50:32 -0400") References: <1144410107.19061.824.camel@hal.voltaire.com> <1144445589.19061.7520.camel@hal.voltaire.com> <1144667137.19061.52221.camel@hal.voltaire.com> <1144688215.19061.56014.camel@hal.voltaire.com> Message-ID: Hal> It's a readable thing rather than a settable thing. Now I'm really confused. Why not just bump /sys/class/infiniband_mad/abi_version then? - R. From alexn at Voltaire.COM Mon Apr 10 10:59:20 2006 From: alexn at Voltaire.COM (Alexander Nezhinsky) Date: Mon, 10 Apr 2006 20:59:20 +0300 Subject: [openib-general] Location for iser-target code In-Reply-To: <443A8FEE.2030803@cs.wisc.edu> References: <20060410165458Q.fujita.tomonori@lab.ntt.co.jp> <20060410174547Y.fujita.tomonori@lab.ntt.co.jp> <443A889F.5050408@cs.wisc.edu> <443A8FEE.2030803@cs.wisc.edu> Message-ID: <443A9CF8.7020600@Voltaire.COM> Mike Christie wrote: >>> tgt/stgt is probably two frameworks from your point of view. There is a >>> kernel part for target LLDs to hook into. The kernel part is similar to >>> scsi-ml, actually it builds onto it and uses some of the scsi-ml >>> functions, and provides code to share for tasks like creating scatter >>> lists and mapping commands between the kernel and userspace. The target >>> LLD basically handles lower level issues like DMAing the data, >>> transport >>> issues, etc, pretty much what a scsi-ml initiator driver does. For >>>> iscsi, the tgt lld performs similar tasks as the initiator. It parses >>>> the iscsi PDUs or puts them on the interconnect, handles session and >>>> connection manamgement (this would be done like open-iscsi though), >>>> but >>>> then then passes the scsi command to tgt's kernel code. >>>> >>>> The other part of the framework is the userspace component. The tgt >>>> kernel component basically passes scsi commands and task management >>>> functions to a userspace daemon. The daemon contains the scsi state >>>> machine and execute the IO. When it is done it informs the the kernel >>>> component which in turn maps the data into the kernel, forms scatter >>>> lists, and then passes them to the target LLD to send out. In the cited paper's abstract you wrote: > In order to provide block I/O services, users have had to use modified kernel code, > binary kernel modules, or specialized hardware. With Linux now having iSCSI, > Fibre Channel, and RDMA initiator support, Linux target framework (tgt) aims to > fill the gap in storage functionality by consolidating several storage target drivers... So i guess one of the added values (if not the main one) of implementing the entire scsi command interface of tgt in userspace is gaining easy access to block I/O drivers. But the block I/O subsystem has a clear intra-kernel interface. If the kernel part of tgt would anyway allocate memory and build the scatter-gather lists, it could pass the commands along with the buffer descriptors down to the storage stack, addressing either the appropriate block I/O driver or scsi-ml itself. This extra code should be quite thin, it uses only existing interfaces, makes no modification to the existing kernel code. The user space code can do all the administration stuff, and specifically choosing the right driver and passing to the kernel part all necessary identification and configuration info about it. Are there other reasons for pushing SCSI commands from kernel to user space and performing them from there? From mshefty at ichips.intel.com Mon Apr 10 11:01:02 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Apr 2006 11:01:02 -0700 Subject: [openib-general] Re: RFC: revert module ref counting patches In-Reply-To: <20060410175632.GB4746@mellanox.co.il> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060410094457.GQ13416@mellanox.co.il> <20060410194424.GA3174@mellanox.co.il> <443A99F4.3070700@ichips.intel.com> <20060410175632.GB4746@mellanox.co.il> Message-ID: <443A9D5E.1060809@ichips.intel.com> Michael S. Tsirkin wrote: > And now ib_sa can just > > module_mutex_lock() > query->callback(..) > module_mutex_unlock() > > to prevent modules from unloading while callback is in progress. > > Same for other modules that use callbacks without registration, like ib_addr. This seems like too broad of a fix. We should just make sure that the module we're calling does not unload, rather than preventing any module from unloading. Wouldn't this also serialize all callbacks? - Sean From mst at mellanox.co.il Mon Apr 10 11:03:03 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Apr 2006 21:03:03 +0300 Subject: [openib-general] Re: Please don't commit without maintainer approval In-Reply-To: References: Message-ID: <20060410180303.GC4746@mellanox.co.il> Quoting r. Shirley Ma : > >As we've said many times, the kernel is not released by openib. It is > >released by Linus. The neighbour destructor patch was in kernel 2.6.17-rc1. > > I meant the IPoIB destructor work around patch for kernel less than 2.6.17. This has been discussed here: http://openib.org/pipermail/openib-general/2006-April/019563.html > It is in 2.6.17-rc1 already? Yes. > It's pretty confusing. When I looked at openib-1.0-rc2 release tree, there is an > ipoib directory, which doesn't include IPoIB destructor patch. > /gen2/tags/openib-1.0-rc2/src/linux-kernel/infiniband/ulp/ipoib This has been discussed here: http://openib.org/pipermail/openib-general/2006-April/019855.html > Sorry for bother you same question many times. > Why ipoib is a part of openib-1.0-rc2 when it's already a part of kernel release, > and they are not the same? Basically, src/linux-kernel is under 1.0 directory by mistake. I agree it should be removed. -- MST From bos at pathscale.com Mon Apr 10 11:06:05 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 10 Apr 2006 11:06:05 -0700 Subject: [openib-general] Re: RFC: clean branches/1.0/ In-Reply-To: References: <1144439463.14694.129.camel@chalcedony.pathscale.com> <1144440247.14694.141.camel@chalcedony.pathscale.com> <20060409150130.GJ13416@mellanox.co.il> <1144600338.2434.5.camel@localhost.localdomain> Message-ID: <1144692365.12804.0.camel@chalcedony.pathscale.com> On Sun, 2006-04-09 at 14:25 -0700, Roland Dreier wrote: > I'm not sure how feasible this is, or whether it's worth the effort. It's easy enough, but I'll be happy to not do the work if it's convenient for people to just use the current kernel.org kernel instead. References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060410094457.GQ13416@mellanox.co.il> <443A9CA6.2000306@ichips.intel.com> Message-ID: <20060410181104.GD4746@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: RFC: revert module ref counting patches (was Re: [PATCH] ipoib_flush_paths) > > Michael S. Tsirkin wrote: > >Alternatively, we can go back to the original idea of adding API for > >flushing > >WQs to ib_mad, ib_sa and ib_addr modules and calling that at module > >cleanup. > > I've thought about this more, and it leads to very subtle dependencies > between modules. No, its simple: A uses C -> A does flush C at unload. B uses C -> B does flush C at unload. > For example, suppose modules A and B both call into > module C, with module C performing callbacks into A and B. For module A to > unload, it is now dependent on what module B does in its callback. This is always the case when both callbacks run form the same workqueue. > Interactions between A and C should be limited to A and C to avoid > potential deadlock conditions that could occur between unrelated modules. I don't see how this adds dependencies that we don't already have. Witness the recent ib_cma/sa deadlock - without flushes. -- MST From mst at mellanox.co.il Mon Apr 10 11:12:30 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Apr 2006 21:12:30 +0300 Subject: [openib-general] Re: Re: problems cloning infiniband.git In-Reply-To: References: Message-ID: <20060410181230.GE4746@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Re: problems cloning infiniband.git > > BTW cloning via rsync:// and git:// works fine. git is what I do. But many firewalls seem to block it. -- MST From mshefty at ichips.intel.com Mon Apr 10 11:10:32 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Apr 2006 11:10:32 -0700 Subject: [openib-general] [PATCH v2] mad: use GID/LID on requester side when matching responses to requests In-Reply-To: <200604101804.34043.jackm@mellanox.co.il> References: <200604101804.34043.jackm@mellanox.co.il> Message-ID: <443A9F98.9060604@ichips.intel.com> Jack Morgenstein wrote: > Corrected and cleaner version. > > Check GID/LID for requester side when searching for request which matches > received response. This, in order to guarantee uniqueness if use same TID > when requesting via multiple source LIDs (when LMC is not zero). To perform > check, add LMC to cache. > > Further, do not perform LID check for direct-routed packets, since permissive > LID makes a proper check impossible. Thanks - I'll look at this within the next couple of days. Roland, can you look over the verbs cache piece? Do you want that as a separate patch? - Sean From michaelc at cs.wisc.edu Mon Apr 10 11:16:12 2006 From: michaelc at cs.wisc.edu (Mike Christie) Date: Mon, 10 Apr 2006 13:16:12 -0500 Subject: [openib-general] Location for iser-target code In-Reply-To: <443A9CF8.7020600@Voltaire.COM> References: <20060410165458Q.fujita.tomonori@lab.ntt.co.jp> <20060410174547Y.fujita.tomonori@lab.ntt.co.jp> <443A889F.5050408@cs.wisc.edu> <443A8FEE.2030803@cs.wisc.edu> <443A9CF8.7020600@Voltaire.COM> Message-ID: <443AA0EC.70200@cs.wisc.edu> Alexander Nezhinsky wrote: > Mike Christie wrote: >>>> tgt/stgt is probably two frameworks from your point of view. There is a >>>> kernel part for target LLDs to hook into. The kernel part is similar to >>>> scsi-ml, actually it builds onto it and uses some of the scsi-ml >>>> functions, and provides code to share for tasks like creating scatter >>>> lists and mapping commands between the kernel and userspace. The target >>>> LLD basically handles lower level issues like DMAing the data, >>>> transport >>>> issues, etc, pretty much what a scsi-ml initiator driver does. For >>>>> iscsi, the tgt lld performs similar tasks as the initiator. It parses >>>>> the iscsi PDUs or puts them on the interconnect, handles session and >>>>> connection manamgement (this would be done like open-iscsi though), >>>>> but >>>>> then then passes the scsi command to tgt's kernel code. >>>>> >>>>> The other part of the framework is the userspace component. The tgt >>>>> kernel component basically passes scsi commands and task management >>>>> functions to a userspace daemon. The daemon contains the scsi state >>>>> machine and execute the IO. When it is done it informs the the kernel >>>>> component which in turn maps the data into the kernel, forms scatter >>>>> lists, and then passes them to the target LLD to send out. > In the cited paper's abstract you wrote: > > In order to provide block I/O services, users have had to use > modified kernel code, > > binary kernel modules, or specialized hardware. With Linux now having > iSCSI, > > Fibre Channel, and RDMA initiator support, Linux target framework > (tgt) aims to > > fill the gap in storage functionality by consolidating several > storage target drivers... > > So i guess one of the added values (if not the main one) of implementing > the entire > scsi command interface of tgt in userspace is gaining easy access to > block I/O drivers. > But the block I/O subsystem has a clear intra-kernel interface. If the > kernel part of Which interface are you referring to? bio, REQ_PC or REQ_BLOCK_PC, or read/write so you can take advantage of the kernel cache? > tgt would anyway allocate memory and build the scatter-gather lists, it > could pass the > commands along with the buffer descriptors down to the storage stack, > addressing > either the appropriate block I/O driver or scsi-ml itself. This extra > code should be > quite thin, it uses only existing interfaces, makes no modification to > the existing > kernel code. Some of those options above require minor changes or you have to work around them in your own code. The user space code can do all the administration stuff, > and specifically > choosing the right driver and passing to the kernel part all necessary > identification > and configuration info about it. Actually we did this, and it was not acceptable to the scsi maintainer: For example we could send IO by: 1. we use the sg io kernel interfaces to do a passthrough type of interface. 2. read/write to device from kernel If you look at the different trees on that berili site you will see different versions of this. And it sends up being the same amount of code. See below. > > Are there other reasons for pushing SCSI commands from kernel to user > space and > performing them from there? > By pushing it to userspace you please the kernel reviewers and there is not a major difference in performance (not that we have found yet). Also when pushing it to userspace we use the same API that is used to execute SG_IO request and push its data between the kernel and userspace so it is not like we are creating something completely new. Just hooking up some pieces. The major new part is the netlink interface which is a couple hundred lines. Some of that interface is for management though. From bos at pathscale.com Mon Apr 10 11:16:33 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 10 Apr 2006 11:16:33 -0700 Subject: [openib-general] Re: Please don't commit without maintainer approval In-Reply-To: References: Message-ID: <1144692993.12804.5.camel@chalcedony.pathscale.com> On Sun, 2006-04-09 at 14:23 -0700, Roland Dreier wrote: > Bryan, I reverted the changes below, which you checked into the trunk > with the comment "update spec file from 1.0 branch." Fair enough. > I think it's > inappropriate to commit without even giving a heads up to the > maintainer. I'll be sure to do that, then. > - I've not tried it to see for sure, but I can't imagine that > depending on instead of a package name works > properly either for autobuilders or automatic dependency tracking > in yum, etc. It works perfectly, which is why I did it. > And finally, this development process is completely backwards. No, it isn't. References: <20060410165458Q.fujita.tomonori@lab.ntt.co.jp> <20060410174547Y.fujita.tomonori@lab.ntt.co.jp> <443A889F.5050408@cs.wisc.edu> <443A8FEE.2030803@cs.wisc.edu> <443A9CF8.7020600@Voltaire.COM> <443AA0EC.70200@cs.wisc.edu> Message-ID: <443AA208.4030207@cs.wisc.edu> Mike Christie wrote: > Alexander Nezhinsky wrote: >> Mike Christie wrote: >>>>> tgt/stgt is probably two frameworks from your point of view. There >>>>> is a >>>>> kernel part for target LLDs to hook into. The kernel part is >>>>> similar to >>>>> scsi-ml, actually it builds onto it and uses some of the scsi-ml >>>>> functions, and provides code to share for tasks like creating scatter >>>>> lists and mapping commands between the kernel and userspace. The >>>>> target >>>>> LLD basically handles lower level issues like DMAing the data, >>>>> transport >>>>> issues, etc, pretty much what a scsi-ml initiator driver does. For >>>>>> iscsi, the tgt lld performs similar tasks as the initiator. It parses >>>>>> the iscsi PDUs or puts them on the interconnect, handles session and >>>>>> connection manamgement (this would be done like open-iscsi >>>>>> though), but >>>>>> then then passes the scsi command to tgt's kernel code. >>>>>> >>>>>> The other part of the framework is the userspace component. The tgt >>>>>> kernel component basically passes scsi commands and task management >>>>>> functions to a userspace daemon. The daemon contains the scsi state >>>>>> machine and execute the IO. When it is done it informs the the kernel >>>>>> component which in turn maps the data into the kernel, forms scatter >>>>>> lists, and then passes them to the target LLD to send out. >> In the cited paper's abstract you wrote: >> > In order to provide block I/O services, users have had to use >> modified kernel code, >> > binary kernel modules, or specialized hardware. With Linux now >> having iSCSI, >> > Fibre Channel, and RDMA initiator support, Linux target framework >> (tgt) aims to >> > fill the gap in storage functionality by consolidating several >> storage target drivers... >> >> So i guess one of the added values (if not the main one) of >> implementing the entire >> scsi command interface of tgt in userspace is gaining easy access to >> block I/O drivers. >> But the block I/O subsystem has a clear intra-kernel interface. If the >> kernel part of > > Which interface are you referring to? bio, REQ_PC or REQ_BLOCK_PC, or > read/write so you can take advantage of the kernel cache? > >> tgt would anyway allocate memory and build the scatter-gather lists, >> it could pass the >> commands along with the buffer descriptors down to the storage >> stack, addressing >> either the appropriate block I/O driver or scsi-ml itself. This extra >> code should be >> quite thin, it uses only existing interfaces, makes no modification to >> the existing >> kernel code. > > Some of those options above require minor changes or you have to work > around them in your own code. > > The user space code can do all the administration stuff, >> and specifically >> choosing the right driver and passing to the kernel part all necessary >> identification >> and configuration info about it. > > > Actually we did this, and it was not acceptable to the scsi maintainer: > > For example we could send IO by: > > 1. we use the sg io kernel interfaces to do a passthrough type of > interface. > > 2. read/write to device from kernel > > If you look at the different trees on that berili site you will see > different versions of this. And it sends up being the same amount of > code. See below. > >> >> Are there other reasons for pushing SCSI commands from kernel to user >> space and >> performing them from there? >> > > By pushing it to userspace you please the kernel reviewers and there is > not a major difference in performance (not that we have found yet). Also > when pushing it to userspace we use the same API that is used to execute > SG_IO request and push its data between the kernel and userspace so it > is not like we are creating something completely new. Just hooking up > some pieces. The major new part is the netlink interface which is a > couple hundred lines. Some of that interface is for management though. > That is not really answering your question... Besides the kernel reviewers, the problem with using REQ_PC or REQ_BLOCK_PC and bios is that you cannot reuse the kernel's caching layer. By doing it in userspace you can do a mmap, advantage of the kernels caching code and it is async. To do the same thing in the kernel you have to either create a thread to do each read/write, hook in the async read/write interface in the kernel (which may be nicer to do now, but was not when we looked at it), or implement your own cache layer and I do not think that would be easy to merge. From iod00d at hp.com Mon Apr 10 11:34:49 2006 From: iod00d at hp.com (Grant Grundler) Date: Mon, 10 Apr 2006 11:34:49 -0700 Subject: [openib-general] Re: Please don't commit without maintainer approval In-Reply-To: <1144692993.12804.5.camel@chalcedony.pathscale.com> References: <1144692993.12804.5.camel@chalcedony.pathscale.com> Message-ID: <20060410183449.GF29757@esmail.cup.hp.com> On Mon, Apr 10, 2006 at 11:16:33AM -0700, Bryan O'Sullivan wrote: > > And finally, this development process is completely backwards. > > No, it isn't. Yes it is. Roland is absolutely right. Put things into trunk/mainline first then merge to the release branch. trunk is what folks are testing and watching for commits. grant From mshefty at ichips.intel.com Mon Apr 10 11:36:07 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Apr 2006 11:36:07 -0700 Subject: [openib-general] Re: RFC: revert module ref counting patches (was Re: [PATCH] ipoib_flush_paths) In-Reply-To: <20060410181104.GD4746@mellanox.co.il> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060410094457.GQ13416@mellanox.co.il> <443A9CA6.2000306@ichips.intel.com> <20060410181104.GD4746@mellanox.co.il> Message-ID: <443AA597.6000307@ichips.intel.com> Michael S. Tsirkin wrote: > No, its simple: A uses C -> A does flush C at unload. > B uses C -> B does flush C at unload. This is C's problem. We're forcing clients to provide the fix, which just seems wrong. The issue is that A does flush C, which must wait for B's callback to complete. >>For example, suppose modules A and B both call into >>module C, with module C performing callbacks into A and B. For module A to >>unload, it is now dependent on what module B does in its callback. > > > This is always the case when both callbacks run form the same > workqueue. No - for example the IB CM uses a workqueue to invoke multiple callbacks. When a client destroys their cm_id, the call blocks only while the callback associated with that cm_id is executing. > I don't see how this adds dependencies that we don't already have. Witness the > recent ib_cma/sa deadlock - without flushes. Exactly - we hit a deadlock condition between modules where the code appeared to be correct. (Although in that case, I blame the RDMA CM for making a blocking call in the MAD thread.) Adding more dependencies can only lead to more issues. - Sean From xma at us.ibm.com Mon Apr 10 11:41:38 2006 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 10 Apr 2006 11:41:38 -0700 Subject: [openib-general] Re: Please don't commit without maintainer approval In-Reply-To: <20060410180303.GC4746@mellanox.co.il> Message-ID: "Michael S. Tsirkin" wrote on 04/10/2006 11:03:03 AM: > Quoting r. Shirley Ma : > > >As we've said many times, the kernel is not released by openib. It is > > >released by Linus. The neighbour destructor patch was in kernel > 2.6.17-rc1. > > > > I meant the IPoIB destructor work around patch for kernel less than 2.6.17. > > This has been discussed here: > http://openib.org/pipermail/openib-general/2006-April/019563.html > > > It is in 2.6.17-rc1 already? > > Yes. I have checked 2.6.17-rc1. I couldn't see the above patch (ipoib_all_neigh_list_lock) there. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 > MST -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Mon Apr 10 11:41:47 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 10 Apr 2006 11:41:47 -0700 Subject: [openib-general] RE: [PATCH] thinko fix in core/cache.c In-Reply-To: <20060410113053.GV13416@mellanox.co.il> Message-ID: Updated patch for the trunk. - Sean --- Index: cache.c =================================================================== --- cache.c (revision 6387) +++ cache.c (working copy) @@ -302,7 +302,7 @@ static void ib_cache_setup_one(struct ib kmalloc(sizeof *device->cache.pkey_cache * (end_port(device) - start_port(device) + 1), GFP_KERNEL); device->cache.gid_cache = - kmalloc(sizeof *device->cache.pkey_cache * + kmalloc(sizeof *device->cache.gid_cache * (end_port(device) - start_port(device) + 1), GFP_KERNEL); if (!device->cache.pkey_cache || !device->cache.gid_cache) { From mst at mellanox.co.il Mon Apr 10 11:43:59 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Apr 2006 21:43:59 +0300 Subject: [openib-general] Re: RFC: revert module ref counting patches (was Re: [PATCH] ipoib_flush_paths) In-Reply-To: <443AA597.6000307@ichips.intel.com> References: <20060406003635.GG26557@mellanox.co.il> <79ae2f320604052031j290c8d2el645bdcb2caf9f769@mail.gmail.com> <20060406131755.GN21115@mellanox.co.il> <20060410094457.GQ13416@mellanox.co.il> <443A9CA6.2000306@ichips.intel.com> <20060410181104.GD4746@mellanox.co.il> <443AA597.6000307@ichips.intel.com> Message-ID: <20060410184359.GF4746@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: RFC: revert module ref counting patches (was Re: [PATCH] ipoib_flush_paths) > > Michael S. Tsirkin wrote: > >No, its simple: A uses C -> A does flush C at unload. > >B uses C -> B does flush C at unload. > > This is C's problem. We're forcing clients to provide the fix, which just > seems wrong. The issue is that A does flush C, which must wait for B's > callback to complete. Sean, as a rule, flushing never adds deadlocks unless you keep some locks while flushing. In our example its clear that A has to wait for its callback to run to complete. If B and A callbacks are running on the same WQ, this means A must wait for B's callack to complete if it gets placed in in WQ before A callback. So we are *not* adding an extra dependency. -- MST From mst at mellanox.co.il Mon Apr 10 11:49:16 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Apr 2006 21:49:16 +0300 Subject: [openib-general] Re: Please don't commit without maintainer approval In-Reply-To: References: <20060410180303.GC4746@mellanox.co.il> Message-ID: <20060410184916.GG4746@mellanox.co.il> Quoting r. Shirley Ma : > > > I meant the IPoIB destructor work around patch for kernel less than 2.6.17. > > > > This has been discussed here: > > http://openib.org/pipermail/openib-general/2006-April/019563.html > > > > > It is in 2.6.17-rc1 already? > > > > Yes. > > I have checked 2.6.17-rc1. I couldn't see the above patch (ipoib_all_neigh_list_lock) there. Oh, I see. Its not needed on 2.6.17. What's there is commit c5ecd62c25400a3c6856e009f84257d5bd03f03b More to the point - does IPoIB still crash for you in 2.6.17-rc1? -- MST From xma at us.ibm.com Mon Apr 10 12:01:37 2006 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 10 Apr 2006 12:01:37 -0700 Subject: [openib-general] Re: Please don't commit without maintainer approval In-Reply-To: <20060410184916.GG4746@mellanox.co.il> Message-ID: "Michael S. Tsirkin" wrote on 04/10/2006 11:49:16 AM: > Quoting r. Shirley Ma : > > > > I meant the IPoIB destructor work around patch for kernel less > than 2.6.17. > > > > > > This has been discussed here: > > > http://openib.org/pipermail/openib-general/2006-April/019563.html > > > > > > > It is in 2.6.17-rc1 already? > > > > > > Yes. > > > > I have checked 2.6.17-rc1. I couldn't see the above patch > (ipoib_all_neigh_list_lock) there. > > Oh, I see. Its not needed on 2.6.17. What's there is > commit c5ecd62c25400a3c6856e009f84257d5bd03f03b > > More to the point - does IPoIB still crash for you in 2.6.17-rc1? > > -- > MST I haven't tested 2.6.17-rc1 yet. The problem is we need to run most recent ipoib on the old kernel. Which is the reason we need this workaround patch to be upper stream or as a part of kernel patch for RC2 release. When do you think this workaround patch is to be upper stream? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Mon Apr 10 12:21:54 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Apr 2006 22:21:54 +0300 Subject: [openib-general] Re: Please don't commit without maintainer approval In-Reply-To: References: <20060410184916.GG4746@mellanox.co.il> Message-ID: <20060410192154.GH4746@mellanox.co.il> Quoting r. Shirley Ma : > Subject: Re: Please don't commit without maintainer approval > > .... > > "Michael S. Tsirkin" wrote on 04/10/2006 11:49:16 AM: > > > > This has been discussed here: > > > > http://openib.org/pipermail/openib-general/2006-April/019563.html > > .... > > The problem is we need to run most recent ipoib on the old kernel. > Which is the reason we need this workaround patch to be upper stream > or as a part of kernel patch for RC2 release. I feel your pain. But see above link. > When do you think this workaround patch is to be upper stream? The patch in the above link is against the stable kernel. I haven't submitted it. -- MST From sean.hefty at intel.com Mon Apr 10 12:20:26 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 10 Apr 2006 12:20:26 -0700 Subject: [openib-general] RE: RFC: revert module ref counting patches (was Re: [PATCH] ipoib_flush_paths) In-Reply-To: <20060410184359.GF4746@mellanox.co.il> Message-ID: >In our example its clear that A has to wait for its callback to run to >complete. >If B and A callbacks are running on the same WQ, this means A >must wait for B's callack to complete if it gets placed in in WQ >before A callback. You're assuming a single-threaded work queue. With a multi-threaded work queue, A must wait for callbacks to all modules to complete. Maybe this isn't a big deal, and if we would deadlock using flush, we would deadlock using an alternate method. - Sean From bugzilla-daemon at openib.org Mon Apr 10 12:47:09 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Mon, 10 Apr 2006 12:47:09 -0700 (PDT) Subject: [openib-general] [Bug 33] New: Ping fails on ib1 interface - IBED - RC3 Message-ID: <20060410194709.555E52283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=33 Summary: Ping fails on ib1 interface - IBED - RC3 Product: OpenIB Version: gen2 Platform: X86 OS/Version: 2.6.9 Status: NEW Severity: normal Priority: P2 Component: IPoIB AssignedTo: bugzilla at openib.org ReportedBy: ksharma at silverstorm.com Kernel version: 2.6.9-22 Release version: IBED 1.0 - RC3 Description: After bringing up both ib0 and ib1 interfaces, ping on ib1 interfaces fails. Even if we down ib0 interface, ping via ib1 interface doesn't work. The routing table shows the entry for ib1 interface. [st45 Mon Apr 10 14:15:33]# route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 172.26.16.0 * 255.255.255.0 U 0 0 0 ib0 172.26.17.0 * 255.255.255.0 U 0 0 0 ib1 172.26.0.0 * 255.255.240.0 U 0 0 0 eth0 169.254.0.0 * 255.255.0.0 U 0 0 0 eth0 default 172.26.0.254 0.0.0.0 UG 0 0 0 eth0 [st45 Mon Apr 10 14:34:25]# ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mst at mellanox.co.il Mon Apr 10 12:41:43 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Apr 2006 22:41:43 +0300 Subject: [openib-general] Re: RFC: revert module ref counting patches (was Re: [PATCH] ipoib_flush_paths) In-Reply-To: References: <20060410184359.GF4746@mellanox.co.il> Message-ID: <20060410194143.GJ4746@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RE: RFC: revert module ref counting patches (was Re: [PATCH] ipoib_flush_paths) > > >In our example its clear that A has to wait for its callback to run to > >complete. > >If B and A callbacks are running on the same WQ, this means A > >must wait for B's callack to complete if it gets placed in in WQ > >before A callback. > > You're assuming a single-threaded work queue. With a multi-threaded work queue, > A must wait for callbacks to all modules to complete. By multi-threaded workqueue you mean a per-cpu one? But all core components use single-threaded workqueues now. Anyway,I think its a matter of luck whether things get queued in separate CPUs in parallel, or on the same one serially: you can't solve deadlocks this way. > Maybe this isn't a big deal, and if we would deadlock using flush, we would > deadlock using an alternate method. Exactly, that's what I'm saying. -- MST From halr at voltaire.com Mon Apr 10 12:49:02 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Apr 2006 15:49:02 -0400 Subject: [openib-general] Re: Dual Sided RMPP Support as well as OpenSM Implications In-Reply-To: References: <1144410107.19061.824.camel@hal.voltaire.com> <1144445589.19061.7520.camel@hal.voltaire.com> <1144667137.19061.52221.camel@hal.voltaire.com> <1144688215.19061.56014.camel@hal.voltaire.com> Message-ID: <1144698536.19061.58054.camel@hal.voltaire.com> On Mon, 2006-04-10 at 14:00, Roland Dreier wrote: > Hal> It's a readable thing rather than a settable thing. > > Now I'm really confused. Why not just bump > /sys/class/infiniband_mad/abi_version then? Yes, this certainly can be done with an ABI version change. I was trying to do this without doing that as I thought it could be done and that would be easier for users. It certainly is more deterministic if there is an ABI version change in that the errors can be disambiguated better but is it required if an ioctl is added or alternatively rmpp_version bit is "stolen" ? -- Hal From rdreier at cisco.com Mon Apr 10 13:04:28 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 10 Apr 2006 13:04:28 -0700 Subject: [openib-general] [PATCH v2] mad: use GID/LID on requester side when matching responses to requests In-Reply-To: <443A9F98.9060604@ichips.intel.com> (Sean Hefty's message of "Mon, 10 Apr 2006 11:10:32 -0700") References: <200604101804.34043.jackm@mellanox.co.il> <443A9F98.9060604@ichips.intel.com> Message-ID: Sean> Thanks - I'll look at this within the next couple of days. Sean> Roland, can you look over the verbs cache piece? Do you Sean> want that as a separate patch? The lmc_cache stuff looks fine to me. It probably would be better to commit it as a separate patch -- "one idea per patch." - R. From swise at opengridcomputing.com Mon Apr 10 13:08:18 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 10 Apr 2006 15:08:18 -0500 Subject: [openib-general] Re: [PATCH] [DAPL] [RFC] - remove duplicate disconnect event. In-Reply-To: References: <1144276742.28591.82.camel@stevo-desktop> Message-ID: <1144699698.23502.17.camel@stevo-desktop> > This is an artifact of some older verbs definitions. This code should > have gone in the verbs specific portion of DAPL instead of the common > code. > > I'll play around with this and see if there are any negative effects > on IB. > Hey James, Any progress on this? From jlentini at netapp.com Mon Apr 10 13:12:20 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 10 Apr 2006 16:12:20 -0400 (EDT) Subject: [openib-general] Re: [PATCH] uDAPL cma; dat_ep_free can return without freeing cm_id In-Reply-To: <443A893E.4000109@ichips.intel.com> References: <1144365985.10701.80.camel@stevo-desktop> <443A893E.4000109@ichips.intel.com> Message-ID: On Mon, 10 Apr 2006, Arlin Davis wrote: > James, can you go ahead and commit this patch? Committed in the trunk and on the 1.0 branch in revision 6393. From gdror at mellanox.co.il Mon Apr 10 13:14:39 2006 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Mon, 10 Apr 2006 23:14:39 +0300 Subject: [openib-general] SDP Memory management Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3FEE8C1@mtlexch01.mtl.com> >From: Devesh Sharma >Sent: Monday, April 10, 2006 4:48 PM > >a) What is the concept of FMR? > Fast memory registration. It is an optimization of the HCA to perform registration faster than regular memory region. >b) In absence of FMR support what is the buffer management scheme for Z-Copy? >Will SDP work without FMR? I think that the easiest way out is to implement FMR in the HCA provider using regular MRs. SDP strongly assumes FMR support in today's implementation. Haven't looked deeply into it, but I think that ehca has done this. >c) How page locking and virtual to physical address conversion is done for Z- >Copy buffers? Take a look at sdp_iocb.c >Devesh From rdreier at cisco.com Mon Apr 10 13:20:40 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 10 Apr 2006 13:20:40 -0700 Subject: [openib-general] Re: [PATCH] thinko fix in core/cache.c In-Reply-To: (Sean Hefty's message of "Mon, 10 Apr 2006 11:41:47 -0700") References: Message-ID: Thanks, applied. From rdreier at cisco.com Mon Apr 10 13:23:28 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 10 Apr 2006 13:23:28 -0700 Subject: [openib-general] Re: Dual Sided RMPP Support as well as OpenSM Implications In-Reply-To: <1144698536.19061.58054.camel@hal.voltaire.com> (Hal Rosenstock's message of "10 Apr 2006 15:49:02 -0400") References: <1144410107.19061.824.camel@hal.voltaire.com> <1144445589.19061.7520.camel@hal.voltaire.com> <1144667137.19061.52221.camel@hal.voltaire.com> <1144688215.19061.56014.camel@hal.voltaire.com> <1144698536.19061.58054.camel@hal.voltaire.com> Message-ID: Hal> It certainly is more deterministic if there is an ABI version Hal> change in that the errors can be disambiguated better but is Hal> it required if an ioctl is added or alternatively Hal> rmpp_version bit is "stolen" ? I'm pretty confused. What would the ioctl do? Return a bit saying whether or not dual-sided RMPP is supported? What would happen if an old app that didn't know about dual-sided RMPP ran and didn't perform the ioctl? I guess this isn't really an ABI change -- old binaries continue to run as long as they don't try dual-sided RMPP (which doesn't work now). So maybe the answer is /sys/class/infiniband_mad/dual_sided_rmpp? An ioctl for something that is not attached to a file descriptor isn't really the right thing. - R. From mst at mellanox.co.il Mon Apr 10 13:28:02 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Apr 2006 23:28:02 +0300 Subject: [openib-general] Re: [PATCH] thinko fix in core/cache.c In-Reply-To: References: Message-ID: <20060410202802.GB4692@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] thinko fix in core/cache.c > > Thanks, applied. Pls remember to queue for 2.6.17 as well. -- MST From rdreier at cisco.com Mon Apr 10 13:27:46 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 10 Apr 2006 13:27:46 -0700 Subject: [openib-general] Re: [PATCH] thinko fix in core/cache.c In-Reply-To: <20060410202802.GB4692@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 10 Apr 2006 23:28:02 +0300") References: <20060410202802.GB4692@mellanox.co.il> Message-ID: Michael> Pls remember to queue for 2.6.17 as well. Yes, it's there. From jlentini at netapp.com Mon Apr 10 13:30:57 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 10 Apr 2006 16:30:57 -0400 (EDT) Subject: [openib-general] Re: [uDAPL] who should update the file /etc/dat.conf? In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301DD31C1@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301DD31C1@mtlexch01.mtl.com> Message-ID: On Mon, 10 Apr 2006, Dotan Barak wrote: > Hi. > > In the gen2 driver, in the dapl folder the file dat.conf can be found. > Here is the context of this file: > > # > # DAT 1.2 configuration file > # > # Each entry should have the following fields: > # > # \ > # > # > # Example for openib_cma and openib_scm > # > # For cma version you specify as: > # network address, network hostname, or netdev name and 0 for port > # > # For scm version you specify as actual device name and port > # > # Simple (OpenIB-cma) default with netdev name provided first on list > # to enable use of same dat.conf version on all nodes > # > OpenIB-cma u1.2 nonthreadsafe default /usr/local//lib64/libdaplcma.so > mv_dapl.1.2 "ib0 0" "" > OpenIB-cma-ip u1.2 nonthreadsafe default /usr/local//lib64/libdaplcma.so > mv_dapl.1.2 "192.168.0.22 0" "" > OpenIB-cma-name u1.2 nonthreadsafe default > /usr/local//lib64/libdaplcma.so mv_dapl.1.2 "svr1-ib0 0" "" > OpenIB-cma-netdev u1.2 nonthreadsafe default > /usr/local//lib64/libdaplcma.so mv_dapl.1.2 "ib0 0" "" > OpenIB-scm1 u1.2 nonthreadsafe default /usr/local//lib64/libdaplscm.so > mv_dapl.1.2 "mthca0 1" "" > OpenIB-scm2 u1.2 nonthreadsafe default /usr/local//lib64/libdaplscm.so > mv_dapl.1.2 "mthca0 2" "" Where are you looking? The sample at https://openib.org/svn/gen2/trunk/src/userspace/dapl/doc/dat.conf doesn't point to 64-bit libraries. > who is responsible to change this file with valid data? The system administrator is supposed to edit this file with the correct values. > for example: > local IPs > local HCAs and valid port numbers > > > what is the meaning of the first word in each line (DAPL_PROVIDER?)? The ia_name field (e.g. OpenIB-cma)? This is the name that the system administrator wishes the provider to use for it's IA. From halr at voltaire.com Mon Apr 10 13:29:55 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Apr 2006 16:29:55 -0400 Subject: [openib-general] Re: Dual Sided RMPP Support as well as OpenSM Implications In-Reply-To: <443A96F3.9000706@ichips.intel.com> References: <1144410107.19061.824.camel@hal.voltaire.com> <4436E507.7000309@ichips.intel.com> <1144667911.19061.52371.camel@hal.voltaire.com> <443A96F3.9000706@ichips.intel.com> Message-ID: <1144700991.19061.58524.camel@hal.voltaire.com> On Mon, 2006-04-10 at 13:33, Sean Hefty wrote: > Hal Rosenstock wrote: > >>This is my understanding of what needs to happen to support dual-sided RMPP. > >> > >>Node A sends an RMPP message. This requires normal RMPP processing. > >>Node A sends an ACK of the final ACK (I'll call ACK2), giving a new window. > >>Node B receives ACKs. > >>Node B sends the response. This requires normal RMPP processing. > >> > >> From the perspective of node A, the RMPP code only needs to know to send ACK2. > > > > There's more to the state machine in turning the direction around in > > terms of the sender becoming the receiver. I thought that this is the > > "harder" direction change. > > Can you describe what more is needed that what's listed above? I was referring to comparing the direction switch flows (Figure 181 p. 791) requires more than switch to DS in Figure 179 p. 787). > In terms of compliance, if node A is not, but node B is; Is = DS and not is not DS, right ? Just out of curiosity, where's the compliance for this ? What are you referring to here ? > Node B cannot send back the response until ACK2 is received. Since node A does > not understand dual-sided RMPP, it will not send ACK2. Node B will never send > the response. Correct. It would time out. But wouldn't it be better if the transaction were aborted with some explicit status for this ? > If node A is, but node B is not: > Node A will send ACK2, which node B should drop. Yes, figure 179 for receiver termination flow (IsDS false direction) shows that packet as discarded with an Abort (BadT) sent. > Node B will send an RMPP message assuming an initial window size of 1. > If node A had set the window > larger, it may delay the ACK of segment 1. Node B will eventually timeout and > resend segment 1. Most likely, this will cause node A to ACK segment 1, which > will update the window size at node B. I'm not following you on this part. And you just made me realize that the dual sidedness may be more than binary (based on your comment below on vendor class 2). > >> It can do this based on the method, or per transaction if directed by the client. > > > > Yes; I was thinking of class/method based approach for this. > > Currently, only a MultiPathRecord query requires this. Why not limit dual-sided > RMPP to _only_ this request? That would work for now. One future issue would be vendor class 2 needs here. > All other queries can just use an RMPP message one > direction, followed by an RMPP message in the other direction. I don't understand what you mean here. That's not the way it works from my understanding. If both the request and response are RMPP messages, isn't this dual sided ? > Beyond MultiPathRecord queries, aren't we talking about vendor specific queries anyway? Yes. > Personally, I'd vote for removing dual-sided RMPP completely from the spec. > It's of questionable benefit and complicates the implementation, but it's > probably a little too late for that. Couldn't we just keep from defining > anything else as "dual-sided"? I think the issue is turning things around but I'm not positive. I was wondering about this in a slightly different way: as to why the direction switch ? My initial foray into this area was to do just what you said: two single sided RMPP transfers in opposite direction. In my simple test case, the request was short (1 MAD) but that could be changed. I haven't figured out the reason for the turnaround ACK but I know the people who architected this although most have left the group and am quite confident that this wouldn't just be there if it weren't needed. (I'll eat my words later if necessary :-) > >>Node B is more complex. It must now wait for ACK2, using timeout and retries of > >>ACK1 until ACK2 is received. And the response that will be generated by the > >>client must be delayed until that ACK2 is received. > > > > > > Yes but isn't much of this already needed for the normal termination > > case or is that part not implemented yet ? > > No - ACK2 is a new message unique to dual-sided RMPP transfers (an ACK of an ACK). We're talking about Figure 179, right ? If so, most of that needs to be there already down to the Type decision (without the ACK direction implemented). Yes, ACK2 is new but this doesn't seem like much to add. The delay of the client response would also be "new" and that seems harder. > >> The only information from ACK2 that's needed when sending the > >>response is NewWindowLast. A client could be expected to give this back to the > >>RMPP layer when sending the response. (A client that lied about NewWindowLast > >>should only lead to sending some packets that would be dropped, with the > >>transaction aborted.) > > > > > > Good idea. That would eliminate the need for some context transfer from > > the receive side to the send side in the RMPP code itself. > > > > Leaving the NWL up to the client could have the effect you mentioned but > > this is known to the RMPP core and hence we needn't rely on the client > > for this. > > For the RMPP core to know what NWL is, it would need to track that information > between receiving a request and the generation of the reply. Since the RMPP > code can't trust the client to generate a reply, it would also need some sort of > timeout for how long the NWL information is valid. Tracking NWL by the client > seems trivial, whereas, it's substantially more complex for the RMPP core. Yes, that's the tradeoff. I agree it's way simpler to rely on the client for this than to implement it in the core. -- Hal > > - Sean From halr at voltaire.com Mon Apr 10 13:33:15 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Apr 2006 16:33:15 -0400 Subject: [openib-general] Re: Dual Sided RMPP Support as well as OpenSM Implications In-Reply-To: <443A96F3.9000706@ichips.intel.com> References: <1144410107.19061.824.camel@hal.voltaire.com> <4436E507.7000309@ichips.intel.com> <1144667911.19061.52371.camel@hal.voltaire.com> <443A96F3.9000706@ichips.intel.com> Message-ID: <1144700992.19061.58525.camel@hal.voltaire.com> On Mon, 2006-04-10 at 13:33, Sean Hefty wrote: > Hal Rosenstock wrote: > >>This is my understanding of what needs to happen to support dual-sided RMPP. > >> > >>Node A sends an RMPP message. This requires normal RMPP processing. > >>Node A sends an ACK of the final ACK (I'll call ACK2), giving a new window. > >>Node B receives ACKs. > >>Node B sends the response. This requires normal RMPP processing. > >> > >> From the perspective of node A, the RMPP code only needs to know to send ACK2. > > > > There's more to the state machine in turning the direction around in > > terms of the sender becoming the receiver. I thought that this is the > > "harder" direction change. > > Can you describe what more is needed that what's listed above? I was referring to comparing the direction switch flows (Figure 181 p. 791) requires more than switch to DS in Figure 179 p. 787). > In terms of compliance, if node A is not, but node B is; Is = DS and not is not DS, right ? Just out of curiosity, where's the compliance for this ? What are you referring to here ? > Node B cannot send back the response until ACK2 is received. Since node A does > not understand dual-sided RMPP, it will not send ACK2. Node B will never send > the response. Correct. It would time out. But wouldn't it be better if the transaction were aborted with some explicit status for this ? > If node A is, but node B is not: > Node A will send ACK2, which node B should drop. Yes, figure 179 for receiver termination flow (IsDS false direction) shows that packet as discarded with an Abort (BadT) sent. > Node B will send an RMPP message assuming an initial window size of 1. > If node A had set the window > larger, it may delay the ACK of segment 1. Node B will eventually timeout and > resend segment 1. Most likely, this will cause node A to ACK segment 1, which > will update the window size at node B. I'm not following you on this part. And you just made me realize that the dual sidedness may be more than binary (based on your comment below on vendor class 2). > >> It can do this based on the method, or per transaction if directed by the client. > > > > Yes; I was thinking of class/method based approach for this. > > Currently, only a MultiPathRecord query requires this. Why not limit dual-sided > RMPP to _only_ this request? That would work for now. One future issue would be vendor class 2 needs here. > All other queries can just use an RMPP message one > direction, followed by an RMPP message in the other direction. I don't understand what you mean here. That's not the way it works from my understanding. If both the request and response are RMPP messages, isn't this dual sided ? > Beyond MultiPathRecord queries, aren't we talking about vendor specific queries anyway? Yes. > Personally, I'd vote for removing dual-sided RMPP completely from the spec. > It's of questionable benefit and complicates the implementation, but it's > probably a little too late for that. Couldn't we just keep from defining > anything else as "dual-sided"? I think the issue is turning things around but I'm not positive. I was wondering about this in a slightly different way: as to why the direction switch ? My initial foray into this area was to do just what you said: two single sided RMPP transfers in opposite direction. In my simple test case, the request was short (1 MAD) but that could be changed. I haven't figured out the reason for the turnaround ACK but I know the people who architected this although most have left the group and am quite confident that this wouldn't just be there if it weren't needed. (I'll eat my words later if necessary :-) > >>Node B is more complex. It must now wait for ACK2, using timeout and retries of > >>ACK1 until ACK2 is received. And the response that will be generated by the > >>client must be delayed until that ACK2 is received. > > > > > > Yes but isn't much of this already needed for the normal termination > > case or is that part not implemented yet ? > > No - ACK2 is a new message unique to dual-sided RMPP transfers (an ACK of an ACK). We're talking about Figure 179, right ? If so, most of that needs to be there already down to the Type decision (without the ACK direction implemented). Yes, ACK2 is new but this doesn't seem like much to add. The delay of the client response would also be "new" and that seems harder. > >> The only information from ACK2 that's needed when sending the > >>response is NewWindowLast. A client could be expected to give this back to the > >>RMPP layer when sending the response. (A client that lied about NewWindowLast > >>should only lead to sending some packets that would be dropped, with the > >>transaction aborted.) > > > > > > Good idea. That would eliminate the need for some context transfer from > > the receive side to the send side in the RMPP code itself. > > > > Leaving the NWL up to the client could have the effect you mentioned but > > this is known to the RMPP core and hence we needn't rely on the client > > for this. > > For the RMPP core to know what NWL is, it would need to track that information > between receiving a request and the generation of the reply. Since the RMPP > code can't trust the client to generate a reply, it would also need some sort of > timeout for how long the NWL information is valid. Tracking NWL by the client > seems trivial, whereas, it's substantially more complex for the RMPP core. Yes, that's the tradeoff. I agree it's way simpler to rely on the client for this than to implement it in the core. -- Hal > > - Sean From jlentini at netapp.com Mon Apr 10 13:45:50 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 10 Apr 2006 16:45:50 -0400 (EDT) Subject: [openib-general] Question on : ib_reg_phys_mr() In-Reply-To: <309a667c0604072231h170d9dfar828db22ccc278ec3@mail.gmail.com> References: <309a667c0604070150sdf99ef7kfd81e2bbe45f8076@mail.gmail.com> <309a667c0604072231h170d9dfar828db22ccc278ec3@mail.gmail.com> Message-ID: On Sat, 8 Apr 2006, Devesh Sharma wrote: > In your nfs-rdma context what this function is supposed to do? It should create a memory region for the specified address range. For the exact semantics, see the IBTA spec's description of the REGISTER PHYSICAL MEMORY REGION verb (section 11.2.8.3 of the 1.2 spec). > I know that this function returns memory region, but what is the > difference from other mr returning functions? why get_dma_mr can't > be used? get_dma_mr() will return a memory region which covers all of physical memory. For security reasons, it is not always desirable to expose all of physical memory. ib_reg_phys_mr() allows for more fine grained access control. From jlentini at netapp.com Mon Apr 10 13:47:59 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 10 Apr 2006 16:47:59 -0400 (EDT) Subject: [openib-general] Re: [uDAPLl] question about dapl_ib_cq_resize In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301DD2D2E@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301DD2D2E@mtlexch01.mtl.com> Message-ID: On Fri, 7 Apr 2006, Dotan Barak wrote: > Hi. > > > I looked at the file: src/userspace/dapl/dapl/openib/dapl_ib_cq.c, > function: dapl_ib_cq_resize: > In this function, when one wants to resize a CQ, the dapl destroys the > old CQ and creates a new one instead of calling to the resize CQ verb > (which was added ~3 months ago), > > is there is a reason for this code? > (please notice that the current implementation of the resize CQ function > will fail if there are QPs that using this CQ). As Arlin noted, that verb was not available when the code was written. Do all the userspace hw libraries support this verb? From bos at pathscale.com Mon Apr 10 13:52:51 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 10 Apr 2006 13:52:51 -0700 Subject: [openib-general] Please don't commit without maintainer approval In-Reply-To: References: Message-ID: <1144702371.12804.7.camel@chalcedony.pathscale.com> On Mon, 2006-04-10 at 10:48 -0700, Shirley Ma wrote: > It's pretty confusing. When I looked at openib-1.0-rc2 release tree, > there is an > ipoib directory, which doesn't include IPoIB destructor patch. > /gen2/tags/openib-1.0-rc2/src/linux-kernel/infiniband/ulp/ipoib Please disregard that directory. I have deleted the linux-kernel tree from the 1.0 branch, and I will retag 1.0-rc2 and delete it there, too. References: <1144410107.19061.824.camel@hal.voltaire.com> <1144445589.19061.7520.camel@hal.voltaire.com> <1144667137.19061.52221.camel@hal.voltaire.com> <1144688215.19061.56014.camel@hal.voltaire.com> <1144698536.19061.58054.camel@hal.voltaire.com> Message-ID: <1144702202.19061.58753.camel@hal.voltaire.com> On Mon, 2006-04-10 at 16:23, Roland Dreier wrote: > Hal> It certainly is more deterministic if there is an ABI version > Hal> change in that the errors can be disambiguated better but is > Hal> it required if an ioctl is added or alternatively > Hal> rmpp_version bit is "stolen" ? > > I'm pretty confused. What would the ioctl do? Return a bit saying > whether or not dual-sided RMPP is supported? Yes. > What would happen if an > old app that didn't know about dual-sided RMPP ran and didn't perform > the ioctl? Nothing. An old app wouldn't do dual sided RMPP so it doesn't need to know. It's only a new app which might want to do DS RMPP which needs to know. > I guess this isn't really an ABI change -- old binaries continue to > run as long as they don't try dual-sided RMPP (which doesn't work now). > So maybe the answer is /sys/class/infiniband_mad/dual_sided_rmpp? > > An ioctl for something that is not attached to a file descriptor isn't > really the right thing. Yes, a file based approach is another way. Sean made me realize that this capability may not be so binary per machine due to vendor class 2. I need to think more on how this would be handled (burned into the RMPP core or a loadable kernel table seems like the only ways now; I would like something more flexible in the IBA spec (self identifying dual sided operations)). -- Hal > - R. From bugzilla-daemon at openib.org Mon Apr 10 14:21:58 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Mon, 10 Apr 2006 14:21:58 -0700 (PDT) Subject: [openib-general] [Bug 33] Ping fails on ib1 interface - IBED - RC3 Message-ID: <20060410212158.74E6F2283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=33 sweitzen at cisco.com changed: What |Removed |Added ---------------------------------------------------------------------------- Version|gen2 |1.0rc2 ------- Additional Comments From sweitzen at cisco.com 2006-04-10 14:21 ------- OF 1.0rc2 matches IBED 1.0rc3. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From pradeep at us.ibm.com Mon Apr 10 14:23:26 2006 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Mon, 10 Apr 2006 14:23:26 -0700 Subject: [openib-general] Re: [openfabrics-ewg] IBED-1.0-rc3 is available In-Reply-To: <443A8DC9.9070305@mellanox.co.il> Message-ID: I had a question related to this. IBED-1.0-rc3- has it gone through some minimal touch testing on the various platforms listed below, or has it been simply compiled on the indicated platforms (with the exceptions noted below). I did not see any such references to it in the FAQ. If it was tested, was it tested on any specific set of HCAs? How does one find out the bug fixes picked up in this release? Pradeep pradeep at us.ibm.com Vladimir Sokolovsky openfabrics-ewg at openib.org Sent by: cc openfabrics-ewg-b openib-general ounces at openib.org Subject [openfabrics-ewg] IBED-1.0-rc3 is 04/10/2006 09:54 available AM Hi All, We have prepared IBED 1.0 RC3. Release location: *https://openib.org/svn/gen2/branches/1.0/ibed/releases* File: IBED-1.0-rc3.tgz md5sum: 8e143fd4b63646ebc9f5c9f73d18394b *_BUILD_ID:_* IBED-1.0-rc3: OpenIB: openib_branch1.0-20060410-1551 (REV=6367) Userspace SVN path: https://openib.org/svn/gen2/branches/1.0/src/userspace IB Kernel modules SVN path: https://openib.org/svn/gen2/branches/1.0/ibed/tags/rc3/linux-kernel MPI: openmpi-1.0.2a12-1 mpi_osu-0.9.7-mlx2.1.0 mpitests-1.0-0 *OSes:* * RH EL4 up2: 2.6.9-22.ELsmp * RH EL4 up3: 2.6.9-34.ELsmp * Fedora C4: 2.6.11-1.1369_FC4 * SLES10 beta 7: 2.6.16-rc5-git9-2-smp * SUSE 10 Pro: 2.6.13-15-smp * kernel.org: 2.6.16 *Systems:* * x86_64 * x86 * ia64 * ppc64 *Main changes from RC2:* 1. Added support in Rh EL4 up3 2. Added Open MPI package 3. OSU MPI is now based on 0.97 release (was 0.95 in RC2) 4. Added Pathscale (ipath) driver 5. Added uDapl 6. build based on the new method: Userlevel from openib branch 1.0 and kernel from openib trunk. (will be from the git in RC4) 7. Added ibutils package 8. Bug fixes *Package limitations:* 1. iSER is working on SuSE SLES 10 Beta8 only 2. MPI OSU and Open MPI compilation fails on PPC64 3. uDAPL does not supported on RH EL4 (up2 and up3) since rdma_ucm module does not work on 2.6.9* kernels. If someone has a patch we will use it. 4. ipath driver compilation fails on RH EL4 and FedoraC4. Please send me and Vlad any issue you encounter and testing results. Thanks Tziporet & Vlad _______________________________________________ openfabrics-ewg mailing list openfabrics-ewg at openib.org http://openib.org/mailman/listinfo/openfabrics-ewg -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pic23455.gif Type: image/gif Size: 1255 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: From Jerome at Mellanox.com Mon Apr 10 14:46:22 2006 From: Jerome at Mellanox.com (Jerome Taylor) Date: Mon, 10 Apr 2006 14:46:22 -0700 Subject: [openib-general] tcpdump command issue on IB network Message-ID: <1E3DCD1C63492545881FACB6063A57C10BB6B8@mtiexch01.mti.com> Dear developers and list subscribers, I ran into the following error message running the Linux tcpdump command on an Infiniband network. # tcpdump -i ib0 Tcpdump: ioctl: Value too large for defined data type. I am running the following configuration: - RedHat AS-4.0 U2, kernel-2.6.9-22.ELsmp - Gen2 driver (ib-verb-2.0.1-2.6.9_22.ELsmp; ib-ipoib-2.6.9-22.ELsmp) - tcpdump-3.9.4-3 Has anyone seen this issue before? Is this an issue with the kernel and support for the larger IB hardware address? Regards, Jerome Taylor ____________________ Mellanox Technologies, Inc. Voice: 978-640-0069 Fax: 978-640-0679 Mobile: 978-764-1269 -------------- next part -------------- An HTML attachment was scrubbed... URL: From sweitzen at cisco.com Mon Apr 10 14:50:38 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Mon, 10 Apr 2006 14:50:38 -0700 Subject: [openib-general] tcpdump command issue on IB network Message-ID: I opened a bug on this a couple of months ago. https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=180980 Scott ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Jerome Taylor Sent: Monday, April 10, 2006 2:46 PM To: openib-general at openib.org Cc: Jerome Taylor Subject: [openib-general] tcpdump command issue on IB network Dear developers and list subscribers, I ran into the following error message running the Linux tcpdump command on an Infiniband network. # tcpdump -i ib0 Tcpdump: ioctl: Value too large for defined data type. I am running the following configuration: - RedHat AS-4.0 U2, kernel-2.6.9-22.ELsmp - Gen2 driver (ib-verb-2.0.1-2.6.9_22.ELsmp; ib-ipoib-2.6.9-22.ELsmp) - tcpdump-3.9.4-3 Has anyone seen this issue before? Is this an issue with the kernel and support for the larger IB hardware address? Regards, Jerome Taylor ____________________ Mellanox Technologies, Inc. Voice: 978-640-0069 Fax: 978-640-0679 Mobile: 978-764-1269 -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.j.woodruff at intel.com Mon Apr 10 14:50:54 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 10 Apr 2006 14:50:54 -0700 Subject: [openib-general] tcpdump command issue on IB network Message-ID: <1AC79F16F5C5284499BB9591B33D6F000763BE93@orsmsx408> Jerome wrote, ># tcpdump -i ib0 >Tcpdump: ioctl: Value too large for defined data type. >From what I remember, I think there may be several utilities that don't work with IPoIB since they changed the MAC address size. woody From robert.j.woodruff at intel.com Mon Apr 10 14:54:05 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 10 Apr 2006 14:54:05 -0700 Subject: [openib-general] tcpdump command issue on IB network Message-ID: <1AC79F16F5C5284499BB9591B33D6F000763BEA8@orsmsx408> I am not sure this is a openib issue, but perhaps rather a bug in the utilities that assumed the size of a MAC address is 8 bytes. I am not sure if that is the case with this one, but I know it was the case in some of the user utilities. If this is the case, then the utility should be fixed. my 2 cents, woody ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Scott Weitzenkamp (sweitzen) Sent: Monday, April 10, 2006 2:51 PM To: Jerome Taylor; openib-general at openib.org Subject: RE: [openib-general] tcpdump command issue on IB network I opened a bug on this a couple of months ago. https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=180980 Scott ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Jerome Taylor Sent: Monday, April 10, 2006 2:46 PM To: openib-general at openib.org Cc: Jerome Taylor Subject: [openib-general] tcpdump command issue on IB network Dear developers and list subscribers, I ran into the following error message running the Linux tcpdump command on an Infiniband network. # tcpdump -i ib0 Tcpdump: ioctl: Value too large for defined data type. I am running the following configuration: - RedHat AS-4.0 U2, kernel-2.6.9-22.ELsmp - Gen2 driver (ib-verb-2.0.1-2.6.9_22.ELsmp; ib-ipoib-2.6.9-22.ELsmp) - tcpdump-3.9.4-3 Has anyone seen this issue before? Is this an issue with the kernel and support for the larger IB hardware address? Regards, Jerome Taylor ____________________ Mellanox Technologies, Inc. Voice: 978-640-0069 Fax: 978-640-0679 Mobile: 978-764-1269 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Mon Apr 10 14:59:33 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Apr 2006 14:59:33 -0700 Subject: [openib-general] Re: Dual Sided RMPP Support as well as OpenSM Implications In-Reply-To: <1144700991.19061.58524.camel@hal.voltaire.com> References: <1144410107.19061.824.camel@hal.voltaire.com> <4436E507.7000309@ichips.intel.com> <1144667911.19061.52371.camel@hal.voltaire.com> <443A96F3.9000706@ichips.intel.com> <1144700991.19061.58524.camel@hal.voltaire.com> Message-ID: <443AD545.3010307@ichips.intel.com> Hal Rosenstock wrote: >>>>Node A sends an RMPP message. This requires normal RMPP processing. >>>>Node A sends an ACK of the final ACK (I'll call ACK2), giving a new window. >>>>Node B receives ACKs. >>>>Node B sends the response. This requires normal RMPP processing. >>>> >>>>From the perspective of node A, the RMPP code only needs to know to send ACK2. >>> >>>There's more to the state machine in turning the direction around in >>>terms of the sender becoming the receiver. I thought that this is the >>>"harder" direction change. >> >>Can you describe what more is needed that what's listed above? > > I was referring to comparing the direction switch flows (Figure 181 p. > 791) requires more than switch to DS in Figure 179 p. 787). It's still not clear to me what's missing from the sequence. Once node A sends ACK2, it should wait until either ACK1 is received again, or the response is received. Upon receiving ACK1, it can resend ACK2. I might not have mentioned this before, but I would have ACK2 carry a window size of 1, which lets us treat all received RMPP MADs the same. >>In terms of compliance, if node A is not, but node B is; > > Is = DS and not is not DS, right ? > > Just out of curiosity, where's the compliance for this ? What are you > referring to here ? I mean that node A does not implement DS RMPP, but node B does. >>Node B cannot send back the response until ACK2 is received. Since node A does >>not understand dual-sided RMPP, it will not send ACK2. Node B will never send >>the response. > > Correct. It would time out. But wouldn't it be better if the transaction > were aborted with some explicit status for this ? Are you asking for an explicit status indicating that ACK2 was not received? I guess this could be added, but node B should not make any assumptions about the reason for the timeout, such as node A doesn't support DS RMPP. If node A doesn't support DS RMPP, I don't know that it should expect a MultiPathRecord query to work. >>If node A is, but node B is not: >>Node A will send ACK2, which node B should drop. > > Yes, figure 179 for receiver termination flow (IsDS false direction) > shows that packet as discarded with an Abort (BadT) sent. If ACK2 matches with the received request, then wouldn't that transaction be aborted? Does this mean that both nodes must either be DS RMPP compliant, or non-compliant for communication to work? >>Node B will send an RMPP message assuming an initial window size of 1. >>If node A had set the window >>larger, it may delay the ACK of segment 1. Node B will eventually timeout and >>resend segment 1. Most likely, this will cause node A to ACK segment 1, which >>will update the window size at node B. > > I'm not following you on this part. I'm just trying to determine what could happen if a non-compliant implementation tried talking to a compliant implementation. And now I'm leaning towards them being unable to communication. >>>> It can do this based on the method, or per transaction if directed by the client. >>> >>>Yes; I was thinking of class/method based approach for this. >> >>Currently, only a MultiPathRecord query requires this. Why not limit dual-sided >>RMPP to _only_ this request? > > > That would work for now. One future issue would be vendor class 2 needs > here. What I'm suggesting is that we limit "sender-initiated double-sided" RMPP transfers to only MultiPathRecord. Vendor class 2 would simply use two "sender-initiated transfers". >>All other queries can just use an RMPP message one >>direction, followed by an RMPP message in the other direction. > > > I don't understand what you mean here. That's not the way it works from > my understanding. If both the request and response are RMPP messages, > isn't this dual sided ? If it's a vendor defined MAD, can't we control the behavior and treat this as two Sender-Initiated Transfers? In 13.6.6.3, we have: It is also possible for a single transaction to involve an RMPP transfer sent in one direction followed by another RMPP transfer in the other direction... This *may* be accomplished as follows: My interpretation is that we're not restricted to using this. > I think the issue is turning things around but I'm not positive. I was > wondering about this in a slightly different way: as to why the > direction switch ? My initial foray into this area was to do just what > you said: two single sided RMPP transfers in opposite direction. In my > simple test case, the request was short (1 MAD) but that could be > changed. I haven't figured out the reason for the turnaround ACK but I > know the people who architected this although most have left the group > and am quite confident that this wouldn't just be there if it weren't > needed. (I'll eat my words later if necessary :-) (rant) IMO, the entire RMPP architecture is ridiculous. Segmentation and reassembly information is embedded in the middle of user data, with timeout constraints that would take a half dozen queries to calculate. So I'm not confident that this is needed at all. The only benefit that I see is that the initial window size could be larger than 1, which has a potential to provide for better latency. DS RMPP requires the same number of MADs as two single sided RMPP transfers, so even the potential gain seems fairly small. >>>>Node B is more complex. It must now wait for ACK2, using timeout and retries of >>>>ACK1 until ACK2 is received. And the response that will be generated by the >>>>client must be delayed until that ACK2 is received. >>> >>> >>>Yes but isn't much of this already needed for the normal termination >>>case or is that part not implemented yet ? >> >>No - ACK2 is a new message unique to dual-sided RMPP transfers (an ACK of an ACK). > > > We're talking about Figure 179, right ? If so, most of that needs to be > there already down to the Type decision (without the ACK direction > implemented). > > Yes, ACK2 is new but this doesn't seem like much to add. The delay of > the client response would also be "new" and that seems harder. I agree. Adding in ACK2 shouldn't be that difficult, but does require knowing if a given transaction (class/method) uses DS RMPP. The delay on the send-side is already there, in waiting for the response. The timeout of the RMPP context on the receive side is where the difficulty lies, but I think we can avoid this difficulty simply by passing NWL up to the client, and having them return it on the response. If we want to support DS RMPP for more than just MultiPathRecord, it seems that we need some sort of class/method mapping, which would require changing the kernel MAD API. - Sean From bugzilla-daemon at openib.org Mon Apr 10 15:34:18 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Mon, 10 Apr 2006 15:34:18 -0700 (PDT) Subject: [openib-general] [Bug 33] Ping fails on ib1 interface - IBED - RC3 Message-ID: <20060410223418.54AFC22847E@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=33 ------- Additional Comments From xma at us.ibm.com 2006-04-10 15:34 ------- I have seen the same problem on different kernel. It seems an ARP packet has been silently dropped somehow. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From sean.hefty at intel.com Mon Apr 10 17:12:47 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 10 Apr 2006 17:12:47 -0700 Subject: [openib-general] [PATCH] git: updates to rdma_cm branch Message-ID: Here's a single combined patch to update the rdma_cm git branch. Hopefully, this what you had in mind, otherwise, let me know if you want help separating these out. And feel free to cherry pick through these, since some of the changes pertain to iWarp support, or are minor formatting changes. - Sean ib_addr updates: --- Allow resolving from the loopback address as the source address to another local address as the destination. This allows connections from 127.0.0.1 to another local address. Without this fix, the code will attempt to find a net device with the loopback address assigned to it, which will fail. Signed-off-by: Sean Hefty --- Change to support non-IB RDMA devices. - Remove filter for ARPHRD_INFINIBAND in addr_arp_recv Signed-off-by: Tom Tucker Signed-off-by: Sean Hefty --- Add support for devices that do ARP internally. Signed-off-by: Tom Tucker Signed-off-by: Sean Hefty --- rcma_cm/rdma_ucm updates: --- SDP hello header should be preceeded by the base sockets direct header. Signed-off-by: Sean Hefty --- Fix deadlock condition caused by destroying an IB CM ID from within a MAD callback thread (SA query callback). See note from Michael Tsirkin about this bug: A ULP requests address resolution; on success requests route resolution; route resolution succeeds; inside the callback ULP requests rdma_connect. Now, a failure (e.g. out of memory) occurs at ULP level and so it decides to destroy the ID. To this end it returns failure code from the route callback. Note that route resolution callback runs in the per-port MAD workqueue. Now, CMA will call rdma_destroy_id to destroy the ID. Since CM ID exists, it will try to destroy it. This might deadlock: since a CM MAD (REQ) has been created, CM ID destroy will now block, waiting for the MAD to be freed, but MADs might not complete since we are blocking the MAD workqueue. Fix condition by queuing SA query callbacks to the CMA's thread. Signed-off-by: Sean Hefty --- If the user calls rdma_bind_addr(), but specifies either a zero IP address or the loopback address, bind will succeed, but the cm_id will not be attached to an RDMA device. This will result in a failure if rdma_resolve_addr is called. Fix rdma_resolve_addr(), so that it handles this condition properly. To correct this, rdma_resolve_addr() calls rdma_bind_addr() if it has not already been called by the user. A minor correction to rdma_bind_addr() was made to better handle binding to a zero or loopback IP addresss. Also fix the userspace interface to allow querying to address information after binding to the zero or loopback address. This breaks the behavior of the current ABI, so bump the ABI version, and add the proper support to allow checking the kernel ABI version from the userspace library. Signed-off-by: Sean Hefty --- Interpret any ZERONET/LOOPBACK address as INADDR_ANY. Make cma_loopback_addr match any address in LOOPBACK subnet. Signed-off-by: Michael S. Tsirkin Signed-off-by: Sean Hefty --- diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index b0353f8..810fdd5 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -27,8 +27,11 @@ * notice, one of the license notices in the documentation * and/or other materials provided with the distribution. */ + +#include #include #include +#include #include #include #include @@ -155,15 +158,21 @@ static int addr_resolve_remote(struct so if (ret) goto out; + /* If the device does ARP internally, return 'done' */ + if (rt->idev->dev->flags & IFF_NOARP) { + copy_addr(addr, rt->idev->dev, NULL); + goto put; + } + neigh = neigh_lookup(&arp_tbl, &rt->rt_gateway, rt->idev->dev); if (!neigh) { ret = -ENODATA; - goto err1; + goto put; } if (!(neigh->nud_state & NUD_VALID)) { ret = -ENODATA; - goto err2; + goto release; } if (!src_ip) { @@ -172,9 +181,9 @@ static int addr_resolve_remote(struct so } ret = copy_addr(addr, neigh->dev, neigh->ha); -err2: +release: neigh_release(neigh); -err1: +put: ip_rt_put(rt); out: return ret; @@ -232,10 +241,14 @@ static int addr_resolve_local(struct soc if (!dev) return -EADDRNOTAVAIL; - if (!src_ip) { + if (ZERONET(src_ip)) { src_in->sin_family = dst_in->sin_family; src_in->sin_addr.s_addr = dst_ip; ret = copy_addr(addr, dev, dev->dev_addr); + } else if (LOOPBACK(src_ip)) { + ret = rdma_translate_ip((struct sockaddr *)dst_in, addr); + if (!ret) + memcpy(addr->dst_dev_addr, dev->dev_addr, MAX_ADDR_LEN); } else { ret = rdma_translate_ip((struct sockaddr *)src_in, addr); if (!ret) @@ -320,9 +333,8 @@ static int addr_arp_recv(struct sk_buff arp_hdr = (struct arphdr *) skb->nh.raw; - if (dev->type == ARPHRD_INFINIBAND && - (arp_hdr->ar_op == __constant_htons(ARPOP_REQUEST) || - arp_hdr->ar_op == __constant_htons(ARPOP_REPLY))) + if (arp_hdr->ar_op == __constant_htons(ARPOP_REQUEST) || + arp_hdr->ar_op == __constant_htons(ARPOP_REPLY)) set_timeout(jiffies); kfree_skb(skb); diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index b3a9623..b6f8c84 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -28,6 +28,7 @@ * and/or other materials provided with the distribution. * */ + #include #include #include @@ -103,7 +104,9 @@ struct rdma_id_private { int timeout_ms; struct ib_sa_query *query; int query_id; - struct ib_cm_id *cm_id; + union { + struct ib_cm_id *ib; + } cm_id; u32 seq_num; u32 qp_num; @@ -114,6 +117,9 @@ struct rdma_id_private { struct cma_work { struct work_struct work; struct rdma_id_private *id; + enum cma_state old_state; + enum cma_state new_state; + struct rdma_cm_event event; }; union cma_ip_addr { @@ -133,6 +139,7 @@ struct cma_hdr { }; struct sdp_hh { + u8 bsdh[16]; u8 sdp_version; u8 ip_version; /* IP version: 7:4 */ u8 sdp_specific1[10]; @@ -413,7 +420,7 @@ int rdma_init_qp_attr(struct rdma_cm_id id_priv = container_of(id, struct rdma_id_private, id); switch (id_priv->id.device->node_type) { case IB_NODE_CA: - ret = ib_cm_init_qp_attr(id_priv->cm_id, qp_attr, + ret = ib_cm_init_qp_attr(id_priv->cm_id.ib, qp_attr, qp_attr_mask); if (qp_attr->qp_state == IB_QPS_RTR) qp_attr->rq_psn = id_priv->seq_num; @@ -427,13 +434,12 @@ int rdma_init_qp_attr(struct rdma_cm_id } EXPORT_SYMBOL(rdma_init_qp_attr); -static inline int cma_any_addr(struct sockaddr *addr) +static inline int cma_zero_addr(struct sockaddr *addr) { struct in6_addr *ip6; if (addr->sa_family == AF_INET) - return ((struct sockaddr_in *) addr)->sin_addr.s_addr == - INADDR_ANY; + return ZERONET(((struct sockaddr_in *) addr)->sin_addr.s_addr); else { ip6 = &((struct sockaddr_in6 *) addr)->sin6_addr; return (ip6->s6_addr32[0] | ip6->s6_addr32[1] | @@ -443,8 +449,12 @@ static inline int cma_any_addr(struct so static inline int cma_loopback_addr(struct sockaddr *addr) { - return ((struct sockaddr_in *) addr)->sin_addr.s_addr == - ntohl(INADDR_LOOPBACK); + return LOOPBACK(((struct sockaddr_in *) addr)->sin_addr.s_addr); +} + +static inline int cma_any_addr(struct sockaddr *addr) +{ + return cma_zero_addr(addr) || cma_loopback_addr(addr); } static int cma_get_net_info(void *hdr, enum rdma_port_space ps, @@ -540,7 +550,8 @@ static void cma_cancel_route(struct rdma { switch (id_priv->id.device->node_type) { case IB_NODE_CA: - ib_sa_cancel_query(id_priv->query_id, id_priv->query); + if (id_priv->query) + ib_sa_cancel_query(id_priv->query_id, id_priv->query); break; default: break; @@ -557,12 +568,18 @@ static void cma_destroy_listen(struct rd { cma_exch(id_priv, CMA_DESTROYING); - if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) - ib_destroy_cm_id(id_priv->cm_id); - - list_del(&id_priv->listen_list); - if (id_priv->cma_dev) + if (id_priv->cma_dev) { + switch (id_priv->id.device->node_type) { + case IB_NODE_CA: + if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) + ib_destroy_cm_id(id_priv->cm_id.ib); + break; + default: + break; + } cma_detach_from_dev(id_priv); + } + list_del(&id_priv->listen_list); atomic_dec(&id_priv->refcount); wait_event(id_priv->wait, !atomic_read(&id_priv->refcount)); @@ -614,11 +631,16 @@ void rdma_destroy_id(struct rdma_cm_id * state = cma_exch(id_priv, CMA_DESTROYING); cma_cancel_operation(id_priv, state); - if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) - ib_destroy_cm_id(id_priv->cm_id); - if (id_priv->cma_dev) { - mutex_lock(&lock); + switch (id->device->node_type) { + case IB_NODE_CA: + if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) + ib_destroy_cm_id(id_priv->cm_id.ib); + break; + default: + break; + } + mutex_lock(&lock); cma_detach_from_dev(id_priv); mutex_unlock(&lock); } @@ -643,14 +665,14 @@ static int cma_rep_recv(struct rdma_id_p if (ret) goto reject; - ret = ib_send_cm_rtu(id_priv->cm_id, NULL, 0); + ret = ib_send_cm_rtu(id_priv->cm_id.ib, NULL, 0); if (ret) goto reject; return 0; reject: cma_modify_qp_err(&id_priv->id); - ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, NULL, 0); return ret; } @@ -666,7 +688,7 @@ static int cma_rtu_recv(struct rdma_id_p return 0; reject: cma_modify_qp_err(&id_priv->id); - ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, NULL, 0); return ret; } @@ -727,7 +749,7 @@ static int cma_ib_handler(struct ib_cm_i private_data_len); if (ret) { /* Destroy the CM ID by returning a non-zero value. */ - id_priv->cm_id = NULL; + id_priv->cm_id.ib = NULL; cma_exch(id_priv, CMA_DESTROYING); cma_release_remove(id_priv); rdma_destroy_id(&id_priv->id); @@ -809,7 +831,7 @@ static int cma_req_handler(struct ib_cm_ goto out; } - conn_id->cm_id = cm_id; + conn_id->cm_id.ib = cm_id; cm_id->context = conn_id; cm_id->cm_handler = cma_ib_handler; @@ -819,7 +841,7 @@ static int cma_req_handler(struct ib_cm_ IB_CM_REQ_PRIVATE_DATA_SIZE - offset); if (ret) { /* Destroy the CM ID by returning a non-zero value. */ - conn_id->cm_id = NULL; + conn_id->cm_id.ib = NULL; cma_exch(conn_id, CMA_DESTROYING); cma_release_remove(conn_id); rdma_destroy_id(&conn_id->id); @@ -871,23 +893,23 @@ static int cma_ib_listen(struct rdma_id_ __be64 svc_id; int ret; - id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_req_handler, - id_priv); - if (IS_ERR(id_priv->cm_id)) - return PTR_ERR(id_priv->cm_id); + id_priv->cm_id.ib = ib_create_cm_id(id_priv->id.device, cma_req_handler, + id_priv); + if (IS_ERR(id_priv->cm_id.ib)) + return PTR_ERR(id_priv->cm_id.ib); addr = &id_priv->id.route.addr.src_addr; svc_id = cma_get_service_id(id_priv->id.ps, addr); if (cma_any_addr(addr)) - ret = ib_cm_listen(id_priv->cm_id, svc_id, 0, NULL); + ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0, NULL); else { cma_set_compare_data(addr, &compare_data); - ret = ib_cm_listen(id_priv->cm_id, svc_id, 0, &compare_data); + ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0, &compare_data); } if (ret) { - ib_destroy_cm_id(id_priv->cm_id); - id_priv->cm_id = NULL; + ib_destroy_cm_id(id_priv->cm_id.ib); + id_priv->cm_id.ib = NULL; } return ret; @@ -999,44 +1021,25 @@ EXPORT_SYMBOL(rdma_listen); static void cma_query_handler(int status, struct ib_sa_path_rec *path_rec, void *context) { - struct rdma_id_private *id_priv = context; - struct rdma_route *route = &id_priv->id.route; - enum rdma_cm_event_type event = RDMA_CM_EVENT_ROUTE_RESOLVED; + struct cma_work *work = context; + struct rdma_route *route; - atomic_inc(&id_priv->dev_remove); - if (!status) { - route->path_rec = kmalloc(sizeof *route->path_rec, GFP_KERNEL); - if (route->path_rec) { - route->num_paths = 1; - *route->path_rec = *path_rec; - if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, - CMA_ROUTE_RESOLVED)) { - kfree(route->path_rec); - goto out; - } - } else - status = -ENOMEM; - } + route = &work->id->id.route; - if (status) { - if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, CMA_ADDR_RESOLVED)) - goto out; - event = RDMA_CM_EVENT_ROUTE_ERROR; + if (!status) { + route->num_paths = 1; + *route->path_rec = *path_rec; + } else { + work->old_state = CMA_ROUTE_QUERY; + work->new_state = CMA_ADDR_RESOLVED; + work->event.event = RDMA_CM_EVENT_ROUTE_ERROR; } - if (cma_notify_user(id_priv, event, status, NULL, 0)) { - cma_exch(id_priv, CMA_DESTROYING); - cma_release_remove(id_priv); - cma_deref_id(id_priv); - rdma_destroy_id(&id_priv->id); - return; - } -out: - cma_release_remove(id_priv); - cma_deref_id(id_priv); + queue_work(cma_wq, &work->work); } -static int cma_resolve_ib_route(struct rdma_id_private *id_priv, int timeout_ms) +static int cma_query_ib_route(struct rdma_id_private *id_priv, int timeout_ms, + struct cma_work *work) { struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr; struct ib_sa_path_rec path_rec; @@ -1052,11 +1055,68 @@ static int cma_resolve_ib_route(struct r IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH, timeout_ms, GFP_KERNEL, - cma_query_handler, id_priv, &id_priv->query); - + cma_query_handler, work, &id_priv->query); + return (id_priv->query_id < 0) ? id_priv->query_id : 0; } +static void cma_work_handler(void *data) +{ + struct cma_work *work = data; + struct rdma_id_private *id_priv = work->id; + int destroy = 0; + + atomic_inc(&id_priv->dev_remove); + if (!cma_comp_exch(id_priv, work->old_state, work->new_state)) + goto out; + + if (id_priv->id.event_handler(&id_priv->id, &work->event)) { + cma_exch(id_priv, CMA_DESTROYING); + destroy = 1; + } +out: + cma_release_remove(id_priv); + cma_deref_id(id_priv); + if (destroy) + rdma_destroy_id(&id_priv->id); + kfree(work); +} + +static int cma_resolve_ib_route(struct rdma_id_private *id_priv, int timeout_ms) +{ + struct rdma_route *route = &id_priv->id.route; + struct cma_work *work; + int ret; + + work = kzalloc(sizeof *work, GFP_KERNEL); + if (!work) + return -ENOMEM; + + work->id = id_priv; + INIT_WORK(&work->work, cma_work_handler, work); + work->old_state = CMA_ROUTE_QUERY; + work->new_state = CMA_ROUTE_RESOLVED; + work->event.event = RDMA_CM_EVENT_ROUTE_RESOLVED; + + route->path_rec = kmalloc(sizeof *route->path_rec, GFP_KERNEL); + if (!route->path_rec) { + ret = -ENOMEM; + goto err1; + } + + ret = cma_query_ib_route(id_priv, timeout_ms, work); + if (ret) + goto err2; + + return 0; +err2: + kfree(route->path_rec); + route->path_rec = NULL; +err1: + kfree(work); + return ret; +} + int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) { struct rdma_id_private *id_priv; @@ -1122,18 +1182,13 @@ static void addr_handler(int status, str { struct rdma_id_private *id_priv = context; enum rdma_cm_event_type event; - enum cma_state old_state; atomic_inc(&id_priv->dev_remove); - if (!id_priv->cma_dev) { - old_state = CMA_IDLE; - if (!status) - status = cma_acquire_dev(id_priv); - } else - old_state = CMA_ADDR_BOUND; + if (!id_priv->cma_dev && !status) + status = cma_acquire_dev(id_priv); if (status) { - if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, old_state)) + if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_ADDR_BOUND)) goto out; event = RDMA_CM_EVENT_ADDR_ERROR; } else { @@ -1156,54 +1211,37 @@ out: cma_deref_id(id_priv); } -static void loopback_addr_handler(void *data) -{ - struct cma_work *work = data; - struct rdma_id_private *id_priv = work->id; - - kfree(work); - atomic_inc(&id_priv->dev_remove); - - if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_ADDR_RESOLVED)) - goto out; - - if (cma_notify_user(id_priv, RDMA_CM_EVENT_ADDR_RESOLVED, 0, NULL, 0)) { - cma_exch(id_priv, CMA_DESTROYING); - cma_release_remove(id_priv); - cma_deref_id(id_priv); - rdma_destroy_id(&id_priv->id); - return; - } -out: - cma_release_remove(id_priv); - cma_deref_id(id_priv); -} - -static int cma_resolve_loopback(struct rdma_id_private *id_priv, - struct sockaddr *src_addr, enum cma_state state) +static int cma_resolve_loopback(struct rdma_id_private *id_priv) { struct cma_work *work; - struct rdma_dev_addr *dev_addr; + struct sockaddr_in *src_in, *dst_in; int ret; - work = kmalloc(sizeof *work, GFP_KERNEL); + work = kzalloc(sizeof *work, GFP_KERNEL); if (!work) return -ENOMEM; - if (state == CMA_IDLE) { + if (!id_priv->cma_dev) { ret = cma_bind_loopback(id_priv); if (ret) goto err; - dev_addr = &id_priv->id.route.addr.dev_addr; - ib_addr_set_dgid(dev_addr, ib_addr_get_sgid(dev_addr)); - if (!src_addr || cma_any_addr(src_addr)) - src_addr = &id_priv->id.route.addr.dst_addr; - memcpy(&id_priv->id.route.addr.src_addr, src_addr, - ip_addr_size(src_addr)); + } + + ib_addr_set_dgid(&id_priv->id.route.addr.dev_addr, + ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr)); + + if (cma_zero_addr(&id_priv->id.route.addr.src_addr)) { + src_in = (struct sockaddr_in *)&id_priv->id.route.addr.src_addr; + dst_in = (struct sockaddr_in *)&id_priv->id.route.addr.dst_addr; + src_in->sin_family = dst_in->sin_family; + src_in->sin_addr.s_addr = dst_in->sin_addr.s_addr; } work->id = id_priv; - INIT_WORK(&work->work, loopback_addr_handler, work); + INIT_WORK(&work->work, cma_work_handler, work); + work->old_state = CMA_ADDR_QUERY; + work->new_state = CMA_ADDR_RESOLVED; + work->event.event = RDMA_CM_EVENT_ADDR_RESOLVED; queue_work(cma_wq, &work->work); return 0; err: @@ -1211,29 +1249,42 @@ err: return ret; } +static int cma_bind_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, + struct sockaddr *dst_addr) +{ + struct sockaddr_in addr_in; + + if (src_addr && src_addr->sa_family) + return rdma_bind_addr(id, src_addr); + else { + memset(&addr_in, 0, sizeof addr_in); + addr_in.sin_family = dst_addr->sa_family; + return rdma_bind_addr(id, (struct sockaddr *) &addr_in); + } +} + int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, struct sockaddr *dst_addr, int timeout_ms) { struct rdma_id_private *id_priv; - enum cma_state expected_state; int ret; id_priv = container_of(id, struct rdma_id_private, id); - if (id_priv->cma_dev) { - expected_state = CMA_ADDR_BOUND; - src_addr = &id->route.addr.src_addr; - } else - expected_state = CMA_IDLE; + if (id_priv->state == CMA_IDLE) { + ret = cma_bind_addr(id, src_addr, dst_addr); + if (ret) + return ret; + } - if (!cma_comp_exch(id_priv, expected_state, CMA_ADDR_QUERY)) + if (!cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_ADDR_QUERY)) return -EINVAL; atomic_inc(&id_priv->refcount); memcpy(&id->route.addr.dst_addr, dst_addr, ip_addr_size(dst_addr)); if (cma_loopback_addr(dst_addr)) - ret = cma_resolve_loopback(id_priv, src_addr, expected_state); + ret = cma_resolve_loopback(id_priv); else - ret = rdma_resolve_ip(src_addr, dst_addr, + ret = rdma_resolve_ip(&id->route.addr.src_addr, dst_addr, &id->route.addr.dev_addr, timeout_ms, addr_handler, id_priv); if (ret) @@ -1241,7 +1292,7 @@ int rdma_resolve_addr(struct rdma_cm_id return 0; err: - cma_comp_exch(id_priv, CMA_ADDR_QUERY, expected_state); + cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_ADDR_BOUND); cma_deref_id(id_priv); return ret; } @@ -1260,11 +1311,9 @@ int rdma_bind_addr(struct rdma_cm_id *id if (!cma_comp_exch(id_priv, CMA_IDLE, CMA_ADDR_BOUND)) return -EINVAL; - if (cma_any_addr(addr)) { + if (cma_any_addr(addr)) ret = 0; - } else if (cma_loopback_addr(addr)) { - ret = cma_bind_loopback(id_priv); - } else { + else { dev_addr = &id->route.addr.dev_addr; ret = rdma_translate_ip(addr, dev_addr); if (!ret) @@ -1331,10 +1380,10 @@ static int cma_connect_ib(struct rdma_id memcpy(private_data + offset, conn_param->private_data, conn_param->private_data_len); - id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_ib_handler, - id_priv); - if (IS_ERR(id_priv->cm_id)) { - ret = PTR_ERR(id_priv->cm_id); + id_priv->cm_id.ib = ib_create_cm_id(id_priv->id.device, cma_ib_handler, + id_priv); + if (IS_ERR(id_priv->cm_id.ib)) { + ret = PTR_ERR(id_priv->cm_id.ib); goto out; } @@ -1361,7 +1410,7 @@ static int cma_connect_ib(struct rdma_id req.max_cm_retries = CMA_MAX_CM_RETRIES; req.srq = id_priv->srq ? 1 : 0; - ret = ib_send_cm_req(id_priv->cm_id, &req); + ret = ib_send_cm_req(id_priv->cm_id.ib, &req); out: kfree(private_data); return ret; @@ -1423,7 +1472,7 @@ static int cma_accept_ib(struct rdma_id_ rep.rnr_retry_count = conn_param->rnr_retry_count; rep.srq = id_priv->srq ? 1 : 0; - return ib_send_cm_rep(id_priv->cm_id, &rep); + return ib_send_cm_rep(id_priv->cm_id.ib, &rep); } int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) @@ -1476,8 +1525,9 @@ int rdma_reject(struct rdma_cm_id *id, c switch (id->device->node_type) { case IB_NODE_CA: - ret = ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, - NULL, 0, private_data, private_data_len); + ret = ib_send_cm_rej(id_priv->cm_id.ib, + IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, + private_data, private_data_len); break; default: ret = -ENOSYS; @@ -1503,8 +1553,8 @@ int rdma_disconnect(struct rdma_cm_id *i switch (id->device->node_type) { case IB_NODE_CA: /* Initiate or respond to a disconnect. */ - if (ib_send_cm_dreq(id_priv->cm_id, NULL, 0)) - ib_send_cm_drep(id_priv->cm_id, NULL, 0); + if (ib_send_cm_dreq(id_priv->cm_id.ib, NULL, 0)) + ib_send_cm_drep(id_priv->cm_id.ib, NULL, 0); break; default: break; diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c index 7b29d96..3e21f10 100644 --- a/drivers/infiniband/core/ucma.c +++ b/drivers/infiniband/core/ucma.c @@ -30,6 +30,7 @@ * SOFTWARE. */ +#include #include #include #include @@ -464,19 +465,18 @@ static ssize_t ucma_query_route(struct u if (IS_ERR(ctx)) return PTR_ERR(ctx); - if (!ctx->cm_id->device) { - ret = -ENODEV; - goto out; - } - + memset(&resp, 0, sizeof resp); addr = &ctx->cm_id->route.addr.src_addr; memcpy(&resp.src_addr, addr, addr->sa_family == AF_INET ? - sizeof(struct sockaddr_in) : + sizeof(struct sockaddr_in) : sizeof(struct sockaddr_in6)); addr = &ctx->cm_id->route.addr.dst_addr; memcpy(&resp.dst_addr, addr, addr->sa_family == AF_INET ? - sizeof(struct sockaddr_in) : + sizeof(struct sockaddr_in) : sizeof(struct sockaddr_in6)); + if (!ctx->cm_id->device) + goto out; + resp.node_guid = ctx->cm_id->device->node_guid; resp.port_num = ctx->cm_id->port_num; switch (ctx->cm_id->device->node_type) { @@ -486,11 +486,11 @@ static ssize_t ucma_query_route(struct u break; } +out: if (copy_to_user((void __user *)(unsigned long)cmd.response, &resp, sizeof(resp))) ret = -EFAULT; -out: ucma_put_ctx(ctx); return ret; } @@ -773,13 +773,36 @@ static struct miscdevice ucma_misc = { .fops = &ucma_fops, }; +static ssize_t show_abi_version(struct class_device *class_dev, char *buf) +{ + return sprintf(buf, "%d\n", RDMA_USER_CM_ABI_VERSION); +} +static CLASS_DEVICE_ATTR(abi_version, S_IRUGO, show_abi_version, NULL); + static int __init ucma_init(void) { - return misc_register(&ucma_misc); + int ret; + + ret = misc_register(&ucma_misc); + if (ret) + return ret; + + ret = class_device_create_file(ucma_misc.class, + &class_device_attr_abi_version); + if (ret) { + printk(KERN_ERR "rdma_ucm: couldn't create abi_version attr\n"); + goto err; + } + return 0; +err: + misc_deregister(&ucma_misc); + return ret; } static void __exit ucma_cleanup(void) { + class_device_remove_file(ucma_misc.class, + &class_device_attr_abi_version); misc_deregister(&ucma_misc); idr_destroy(&ctx_idr); } diff --git a/include/rdma/rdma_cm.h b/include/rdma/rdma_cm.h index c9f0827..2e56f25 100644 --- a/include/rdma/rdma_cm.h +++ b/include/rdma/rdma_cm.h @@ -114,7 +114,7 @@ struct rdma_cm_id { * @context: User specified context associated with the id. * @ps: RDMA port space. */ -struct rdma_cm_id* rdma_create_id(rdma_cm_event_handler event_handler, +struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler, void *context, enum rdma_port_space ps); void rdma_destroy_id(struct rdma_cm_id *id); diff --git a/include/rdma/rdma_user_cm.h b/include/rdma/rdma_user_cm.h index 1c38f03..63381e1 100644 --- a/include/rdma/rdma_user_cm.h +++ b/include/rdma/rdma_user_cm.h @@ -38,7 +38,7 @@ #include #include -#define RDMA_USER_CM_ABI_VERSION 1 +#define RDMA_USER_CM_ABI_VERSION 1 #define RDMA_MAX_PRIVATE_DATA 256 From mshefty at ichips.intel.com Mon Apr 10 18:10:02 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Apr 2006 18:10:02 -0700 Subject: [openib-general] [RFC] [PATCH 2/2 v2] ipoib: convert to use new multicast interface In-Reply-To: References: Message-ID: <443B01EA.7040700@ichips.intel.com> FYI. If I run with this code, I get the following error message. kernel: ib0: multicast join failed for ff12:..., status -22 I think I know the issue, but I'll have to wait until tomorrow to fix it. - Sean From devesh28 at gmail.com Mon Apr 10 22:22:03 2006 From: devesh28 at gmail.com (Devesh Sharma) Date: Tue, 11 Apr 2006 10:52:03 +0530 Subject: [openib-general] SDP Memory management In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3FEE8C1@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3FEE8C1@mtlexch01.mtl.com> Message-ID: <309a667c0604102222n2e21c8dn1a183a0cca93f52d@mail.gmail.com> Hi Dror, Thanks for replying, I still have some confusions : On 4/11/06, Dror Goldenberg wrote: > > > >From: Devesh Sharma > >Sent: Monday, April 10, 2006 4:48 PM > > > >a) What is the concept of FMR? > > > > Fast memory registration. It is an optimization of the HCA to perform > registration faster than regular memory region. > > >b) In absence of FMR support what is the buffer management scheme for > Z-Copy? > >Will SDP work without FMR? > > I think that the easiest way out is to implement FMR in the HCA provider > using regular MRs. SDP strongly assumes FMR support in today's > implementation. Still if some provider dose not emulate FMR using regular MR then, SDP will work? Haven't looked deeply into it, but I think that ehca has > done this. >c) How page locking and virtual to physical address conversion is done > for Z- > >Copy buffers? > > Take a look at sdp_iocb.c In short I wanted to know that at driver level whatever address is reaching in ib_sge.addr is a physical address? page locking and virtual to physical address translation is taken care by SDP module? >Devesh > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sweitzen at cisco.com Mon Apr 10 23:43:05 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Mon, 10 Apr 2006 23:43:05 -0700 Subject: [openib-general] Would like to change Infiniband to InfiniBand Message-ID: Any objections to me fixing all the occurences of Infiniband in gen2/trunk? Scott -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at voltaire.com Tue Apr 11 00:10:54 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 11 Apr 2006 10:10:54 +0300 Subject: [openib-general] Re: problems cloning infiniband.git In-Reply-To: References: Message-ID: <443B567E.7000005@voltaire.com> Roland Dreier wrote: > Or> Hi Roland, I have problems cloning your git tree, is it an > Or> issue on my side? > > I was able to reproduce it but I can't explain it. I can't find any > trace of the "static_rate" branch in my tree on kernel.org. Maybe > mirrors haven't been updated completely? So i am downloading from a mirror of kernel.org and it might work with the a non mirror? what would be the url to download it from kernel.org, all the ones I've tried to derive from http://kernel.org/git/?p=linux/kernel/git/roland/infiniband.git which is the url used by my browser have failed (see below) Or. $ git clone http://kernel.org/git/?p=linux/kernel/git/roland/infiniband.git Cannot get remote repository information. Perhaps git-update-server-info needs to be run there? []$ git clone http://kernel.org/kernel/git/roland/infiniband.git Cannot get remote repository information. Perhaps git-update-server-info needs to be run there? $ git clone http://kernel.org/git/roland/infiniband.git infiniband Cannot get remote repository information. Perhaps git-update-server-info needs to be run there? From sweitzen at cisco.com Tue Apr 11 00:30:23 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 11 Apr 2006 00:30:23 -0700 Subject: [openib-general] IBED-1.0-rc3 installer feedback Message-ID: Here's some feedback on installation, should I file bugs/enhancements in bugzilla for these? 0) build.sh does not compile Open MPI, forcing me to run install.sh to compile Open MPI. This makes it harder to set up a build server used to just compile the code for installation elsewhere. 1) Too many references to Mellanox in the docs. # fgrep Mellanox README.txt docs/IBED_Installation_Guide.txt README.txt:Mellanox IBED Distribution v1.0 for Linux README.txt:This is the Mellanox InfiniBand Distribution (IBED) ver. 1.0 software package. README.txt:1) Server platform with InfiniBand HCA (see Mellanox IBED Distributio n Release Notes for details) README.txt:2) Linux OS (see Mellanox IBED Distribution Release Notes for details ) README.txt: o Firmware for Mellanox's switch and HCA products docs/IBED_Installation_Guide.txt:Mellanox IBED Distribution Installation Guide docs/IBED_Installation_Guide.txt: 2. Contents of the Mellanox IBED Distribution docs/IBED_Installation_Guide.txt:2. Contents of the Mellanox IBED Distribution docs/IBED_Installation_Guide.txt: o Firmware for Mellanox's switch and HCA pr oducts docs/IBED_Installation_Guide.txt: "Mellanox IBED Release Notes". 2) When I run install.sh or build.sh and tell it to compile both MPIs, I get the same questions twice, which I assume one is for MVAPICH and one for Open MPI, but this needs to be clearer: The following compiler(s) on your system can be used to build/install MPI: gcc Do you wish to create/install an MPI RPM with gcc? [Y/n]: The following compiler(s) will be used to build the MPI RPM(s): gcc The following compiler(s) on your system can be used to build/install MPI: gcc Do you wish to create/install an MPI RPM with gcc? [Y/n]: 3) It would be nice if install.sh asked me if I wanted to configure IP address, rather than forcing me to. The default IPoIB interface configuration is based on a LAN interface configura. You may change this default configuration in the following steps. Enter LAN interface to be used for setting ib0 interface [eth0]: 4) I would like to see /etc/infiniband/ifcfg-ib0 be in /etc/sysconfig/network-scripts. 5) I would like to see entries for ipoib and sdp in /etc/modprobe.conf. 6) It would be nice if install.sh offered to setup /etc/security/limits.conf. 7) If I run install.sh on one machine, then install the resulting RPMS on a different machine, I get slightly different sets of files installed on the two machines in /usr/local/ibed: < ./BUILD_ID < ./LICENSE < ./README.txt 70,71d66 < ./docs < ./docs/IBED_Installation_Guide.txt 74d68 < ./ibed.conf 114a109 > ./include/infiniband/common.h 165a161 > ./include/infiniband/mad.h 178a175 > ./include/infiniband/umad.h 260a258 > ./lib64/libibcommon.so 263a262 > ./lib64/libibmad.so 266a266 > ./lib64/libibumad.so 1425d1424 < ./uninstall.sh > -----Original Message----- > From: openfabrics-ewg-bounces at openib.org > [mailto:openfabrics-ewg-bounces at openib.org] On Behalf Of > Vladimir Sokolovsky > Sent: Monday, April 10, 2006 9:55 AM > To: openfabrics-ewg at openib.org > Cc: openib-general > Subject: [openfabrics-ewg] IBED-1.0-rc3 is available > > Hi All, > We have prepared IBED 1.0 RC3. > Release location: > *https://openib.org/svn/gen2/branches/1.0/ibed/releases* > File: IBED-1.0-rc3.tgz > md5sum: 8e143fd4b63646ebc9f5c9f73d18394b > > *_BUILD_ID:_* > IBED-1.0-rc3: > OpenIB: > openib_branch1.0-20060410-1551 (REV=6367) > Userspace SVN path: > https://openib.org/svn/gen2/branches/1.0/src/userspace > IB Kernel modules SVN path: > https://openib.org/svn/gen2/branches/1.0/ibed/tags/rc3/linux-kernel > MPI: > openmpi-1.0.2a12-1 > mpi_osu-0.9.7-mlx2.1.0 > mpitests-1.0-0 > > *OSes:* > > * RH EL4 up2: 2.6.9-22.ELsmp > * RH EL4 up3: 2.6.9-34.ELsmp > * Fedora C4: 2.6.11-1.1369_FC4 > * SLES10 beta 7: 2.6.16-rc5-git9-2-smp > * SUSE 10 Pro: 2.6.13-15-smp > * kernel.org: 2.6.16 > > *Systems:* > > * x86_64 > * x86 > * ia64 > * ppc64 > > > *Main changes from RC2:* > > 1. Added support in Rh EL4 up3 > 2. Added Open MPI package > 3. OSU MPI is now based on 0.97 release (was 0.95 in RC2) > 4. Added Pathscale (ipath) driver > 5. Added uDapl > 6. build based on the new method: Userlevel from openib branch 1.0 > and kernel from openib trunk. (will be from the git in RC4) > 7. Added ibutils package > 8. Bug fixes > > *Package limitations:* > > 1. iSER is working on SuSE SLES 10 Beta8 only > 2. MPI OSU and Open MPI compilation fails on PPC64 > 3. uDAPL does not supported on RH EL4 (up2 and up3) since rdma_ucm > module does not work on 2.6.9* kernels. If someone has > a patch we > will use it. > 4. ipath driver compilation fails on RH EL4 and FedoraC4. > > Please send me and Vlad any issue you encounter and testing results. > > Thanks > Tziporet & Vlad > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg > From bardov at gmail.com Tue Apr 11 00:32:33 2006 From: bardov at gmail.com (Dan Bar Dov) Date: Tue, 11 Apr 2006 09:32:33 +0200 Subject: [openib-general] Re: Location for iser-target code In-Reply-To: References: Message-ID: Since there were no relevant replies, I'm going to use option 1 below. My reasoning: this is a development area. It'll take some time to get the code to compile and be even somewhat acceptable, in the meantime, I don't want to clatter the trunk. When iser target will get to a working state, we'll move it to the trunk, possibly joining it with the iser (initiator) ulp. Dan On 4/10/06, Dan Bar Dov wrote: > We are starting to work on open-source iser-target code. > I want to create a directory for it in the SVN. > There are several options I can think of: > > 1. https://openib.org/svn/gen2/ulps > An empty directory, I don't know what its for, but could containe the > iser-target > > 2. https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband/ulp > Together with the rest of the ulps, in the trunk > > 3. https://openib.org/svn/gen2/branches > create a new branch for it (branch of what?) > > 4. https://openib.org/svn/gen2/branches/openib-candidate/src/linux-kernel/infiniband > I have no idea what is that for, but the path seems appropriate. > > I'm leaning towards option 1 myself. > > Dan > From mst at mellanox.co.il Tue Apr 11 00:42:05 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 11 Apr 2006 10:42:05 +0300 Subject: [openib-general] Re: IPoIB destructor for 2.6.16-stable? In-Reply-To: References: <20060405115232.GA21115@mellanox.co.il> <20060405164329.GC23186@mellanox.co.il> Message-ID: <20060411074205.GD24783@mellanox.co.il> Quoting r. Roland Dreier : > Subject: [openib-general] Re: IPoIB destructor for 2.6.16-stable? > > Michael> I don't see any way to fix crashes in ipoib in 2.6.16, > Michael> then. Do you? > > Unfortunately no. If we could get to the bottom of Hal's crash then I > would be fine with adding something like this to 2.6.16.stable. But I > don't have much interest in debugging code that's already obsolete. BTW, could be just random data corruption similiar to what we saw in ipoib_mcast_sendonly_join_complete and fixed by one of Eli's patches. Certainly before Eli's patches I was seeing weird crashes in places that do not make sense. Hal, you couldn't reproduce this by any chance, could you? -- MST From HNGUYEN at de.ibm.com Tue Apr 11 01:30:50 2006 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Tue, 11 Apr 2006 10:30:50 +0200 Subject: [openib-general] RE: static rate encoding changes In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301DD2F96@mtlexch01.mtl.com> Message-ID: Hi Roland! For libehca (ehca user verbs) I realized that I also need conversion functions ibv_rate_to_mult() and mult_to_ibv_rate() similar to the ones for kernel space. I guess they might be used by others as well and did create a patch for that, see below. Coud you review it? Thanks! Regards Hoang-Nam Nguyen Index: src/userspace/libibverbs/include/infiniband/verbs.h =================================================================== --- src/userspace/libibverbs/include/infiniband/verbs.h (revision 6377) +++ src/userspace/libibverbs/include/infiniband/verbs.h (working copy) @@ -306,6 +306,21 @@ IBV_RATE_120_GBPS = 10 }; +/** + * ibv_rate_to_mult - Convert the IB rate enum to a multiple of the + * base rate of 2.5 Gbit/sec. For example, IBV_RATE_5_GBPS will be + * converted to 2, since 5 Gbit/sec is 2 * 2.5 Gbit/sec. + * @rate: rate to convert. + */ +int ibv_rate_to_mult(enum ibv_rate rate); + +/** + * mult_to_ibv_rate - Convert a multiple of 2.5 Gbit/sec to an IB rate + * enum. + * @mult: multiple to convert. + */ +enum ibv_rate mult_to_ibv_rate(int mult); + struct ibv_ah_attr { struct ibv_global_route grh; uint16_t dlid; Index: src/userspace/libibverbs/src/libibverbs.map =================================================================== --- src/userspace/libibverbs/src/libibverbs.map (revision 6377) +++ src/userspace/libibverbs/src/libibverbs.map (working copy) @@ -67,5 +67,7 @@ ib_copy_qp_attr_from_kern; ib_copy_path_rec_from_kern; ib_copy_path_rec_to_kern; + ibv_rate_to_mult; + mult_to_ibv_rate; local: *; }; Index: src/userspace/libibverbs/src/verbs.c =================================================================== --- src/userspace/libibverbs/src/verbs.c (revision 6377) +++ src/userspace/libibverbs/src/verbs.c (working copy) @@ -366,6 +366,38 @@ return qp->context->ops.destroy_qp(qp); } +int ibv_rate_to_mult(enum ibv_rate rate) +{ + switch (rate) { + case IBV_RATE_2_5_GBPS: return 1; + case IBV_RATE_5_GBPS: return 2; + case IBV_RATE_10_GBPS: return 4; + case IBV_RATE_20_GBPS: return 8; + case IBV_RATE_30_GBPS: return 12; + case IBV_RATE_40_GBPS: return 16; + case IBV_RATE_60_GBPS: return 24; + case IBV_RATE_80_GBPS: return 32; + case IBV_RATE_120_GBPS: return 48; + default: return -1; + } +} + +enum ibv_rate mult_to_ibv_rate(int mult) +{ + switch (mult) { + case 1: return IBV_RATE_2_5_GBPS; + case 2: return IBV_RATE_5_GBPS; + case 4: return IBV_RATE_10_GBPS; + case 8: return IBV_RATE_20_GBPS; + case 12: return IBV_RATE_30_GBPS; + case 16: return IBV_RATE_40_GBPS; + case 24: return IBV_RATE_60_GBPS; + case 32: return IBV_RATE_80_GBPS; + case 48: return IBV_RATE_120_GBPS; + default: return IBV_RATE_MAX; + } +} + struct ibv_ah *ibv_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr) { struct ibv_ah *ah = pd->context->ops.create_ah(pd, attr); (See attached file: myipd.diff) -------------- next part -------------- A non-text attachment was scrubbed... Name: myipd.diff Type: application/octet-stream Size: 2547 bytes Desc: not available URL: From dotanb at mellanox.co.il Tue Apr 11 01:37:08 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 11 Apr 2006 11:37:08 +0300 Subject: [openib-general] tcpdump command issue on IB network In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F000763BEA8@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F000763BEA8@orsmsx408> Message-ID: <200604111137.08728.dotanb@mellanox.co.il> On Tuesday 11 April 2006 00:54, Woodruff, Robert J wrote: > I am not sure this is a openib issue, but perhaps rather a bug in the > utilities that assumed the size of a MAC address is 8 bytes. > > I am not sure if that is the case with this one, but I know it was > the case in some of the user utilities. If this is the case, then > the utility should be fixed. Here is some new info: on machine X, i got the same failure, on machine Y i got the tcp dump. here are the machines props (same IB driver): on machine X: ------------------- # tcpdump --version tcpdump version 3.8 libpcap version 0.8.3 # tcpdump -i ib1 tcpdump: ioctl: Value too large for defined data type # uname -a Linux X.mtl.com 2.6.9-22.ELsmp #1 SMP Mon Sep 19 18:00:54 EDT 2005 x86_64 x86_64 x86_64 GNU/Linux # cat /etc/redhat-release Red Hat Enterprise Linux AS release 4 (Nahant Update 2) on machine Y: -------------------- # tcpdump --version tcpdump version 3.8 libpcap version 0.8.3 # tcpdump -i ib1 tcpdump: WARNING: arptype 32 not supported by libpcap - falling back to cooked socket tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on ib1, link-type LINUX_SLL (Linux cooked), capture size 96 bytes # uname -a Linux Y.mtl.com 2.6.14.3 #1 SMP Mon Jan 9 15:19:18 IST 2006 x86_64 x86_64 x86_64 GNU/Linux # cat /etc/redhat-release Fedora Core release 4 (Stentz) I think that maybe the problem is not in the user level stack or in the utilities , but in the linux kernel (in the module that handles this socket ioctl) Dotan From dotanb at mellanox.co.il Tue Apr 11 01:39:22 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 11 Apr 2006 11:39:22 +0300 Subject: [openib-general] Re: [uDAPLl] question about dapl_ib_cq_resize In-Reply-To: References: <6AB138A2AB8C8E4A98B9C0C3D52670E301DD2D2E@mtlexch01.mtl.com> Message-ID: <200604111139.22386.dotanb@mellanox.co.il> On Monday 10 April 2006 23:47, James Lentini wrote: > > On Fri, 7 Apr 2006, Dotan Barak wrote: > > > Hi. > > > > > > I looked at the file: src/userspace/dapl/dapl/openib/dapl_ib_cq.c, > > function: dapl_ib_cq_resize: > > In this function, when one wants to resize a CQ, the dapl destroys the > > old CQ and creates a new one instead of calling to the resize CQ verb > > (which was added ~3 months ago), > > > > is there is a reason for this code? > > (please notice that the current implementation of the resize CQ function > > will fail if there are QPs that using this CQ). > > As Arlin noted, that verb was not available when the code was written. > > Do all the userspace hw libraries support this verb? > The mthca driver (both user and kernel) supports the resize_cq, i don't know if the other low level drivers support this verb. can someone answer this question for other low level drivers? thanks Dotan From dotanb at mellanox.co.il Tue Apr 11 01:44:44 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 11 Apr 2006 11:44:44 +0300 Subject: [openib-general] Re: [uDAPL] who should update the file /etc/dat.conf? In-Reply-To: References: <6AB138A2AB8C8E4A98B9C0C3D52670E301DD31C1@mtlexch01.mtl.com> Message-ID: <200604111144.44167.dotanb@mellanox.co.il> On Monday 10 April 2006 23:30, James Lentini wrote: > > On Mon, 10 Apr 2006, Dotan Barak wrote: > > > > who is responsible to change this file with valid data? > > The system administrator is supposed to edit this file with the > correct values. > > > for example: > > local IPs > > local HCAs and valid port numbers > > > > > > what is the meaning of the first word in each line (DAPL_PROVIDER?)? > > The ia_name field (e.g. OpenIB-cma)? This is the name that the system > administrator wishes the provider to use for it's IA. > Does anyone have an automatic script / tool that searches for all of the IB devices (and ports for each device) and create a valid configuration file? thanks Dotan From tziporet at mellanox.co.il Tue Apr 11 03:49:47 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 11 Apr 2006 13:49:47 +0300 Subject: [openib-general] Re: [openfabrics-ewg] IBED-1.0-rc3 is available In-Reply-To: References: Message-ID: <443B89CB.8020005@mellanox.co.il> Pradeep Satyanarayana wrote: > > I had a question related to this. IBED-1.0-rc3- has it gone through > some minimal touch testing on the various platforms listed below, or > has it been simply compiled on the indicated platforms (with the > exceptions noted below). I did not see any such references to it in > the FAQ. > We run thorough testing of the following systems: X86, X86_64 (both Intel and AMD). We run partial testing on ia64, and compilation & modules load only on PPC (due to lack of machines). Note that we focus our testing in the following: Core verbs (on mthca), IPoIB, OSU MPI, openSM, SRP and SDP Other companies are expected to verify the rest of the modules: ipath driver, Open MPI, uDAPL, iSER, RDS > If it was tested, was it tested on any specific set of HCAs? > In Mellanox we run our regression test suite on all our HCA boards with the latest FW releases (see http://www.mellanox.com/support/firmware_table.php for Mellanox boards and FW releases). > > How does one find out the bug fixes picked up in this release? > Since it is only RC I have not prepared a release notes with all bugs fixed. I suggest that from this RC3 everyone will file any issue found in the bugzilla and then we will be able to track the fixed bugs for RC4. > > > Pradeep > pradeep at us.ibm.com > From halr at voltaire.com Tue Apr 11 03:40:23 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Apr 2006 06:40:23 -0400 Subject: [openib-general] Re: IPoIB destructor for 2.6.16-stable? In-Reply-To: <20060411074205.GD24783@mellanox.co.il> References: <20060405115232.GA21115@mellanox.co.il> <20060405164329.GC23186@mellanox.co.il> <20060411074205.GD24783@mellanox.co.il> Message-ID: <1144752021.19061.67420.camel@hal.voltaire.com> On Tue, 2006-04-11 at 03:42, Michael S. Tsirkin wrote: > Quoting r. Roland Dreier : > > Subject: [openib-general] Re: IPoIB destructor for 2.6.16-stable? > > > > Michael> I don't see any way to fix crashes in ipoib in 2.6.16, > > Michael> then. Do you? > > > > Unfortunately no. If we could get to the bottom of Hal's crash then I > > would be fine with adding something like this to 2.6.16.stable. But I > > don't have much interest in debugging code that's already obsolete. > > BTW, could be just random data corruption similiar to what we saw in > ipoib_mcast_sendonly_join_complete and fixed by one of Eli's patches. > Certainly before Eli's patches I was seeing weird crashes in places > that do not make sense. > > Hal, you couldn't reproduce this by any chance, could you? Nope; I just saw that crash that one time. -- Hal From bugzilla-daemon at openib.org Tue Apr 11 04:19:35 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Tue, 11 Apr 2006 04:19:35 -0700 (PDT) Subject: [openib-general] [Bug 22] IBED RC2 Installation fails Message-ID: <20060411111935.7A610228492@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=22 vlad at mellanox.co.il changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Additional Comments From vlad at mellanox.co.il 2006-04-11 04:19 ------- Fixed in IBED-1.0-rc3 ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Tue Apr 11 04:57:15 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Tue, 11 Apr 2006 04:57:15 -0700 (PDT) Subject: [openib-general] [Bug 33] Ping fails on ib1 interface - IBED - RC3 Message-ID: <20060411115715.1A4AB2283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=33 ------- Additional Comments From eli at mellanox.co.il 2006-04-11 04:57 ------- Are you connecting the ports through a switch? If this is the case then I believe the problem is that you have to different IP subnets which should be on different broadcast domains, using the same broadcast domain which is dictated by the current ipoib implementation. The solution will be inluded probably in the next release and will require opensm enhancements to support different partitions based on differnet PKEYs. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From dotanb at mellanox.co.il Tue Apr 11 05:06:18 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 11 Apr 2006 15:06:18 +0300 Subject: [openib-general] [uDAPL] dtest server never ends when using the dapl provider "OpenIB-scm1" Message-ID: <200604111506.18067.dotanb@mellanox.co.il> Hi. I'm using the dtest from the dapl example folder with the following command line: ./dtest ./dtest -h IP1 (IP1 is the IP of the IPoIB I/F in the remote side) the output of the test is: server output: ---------------------- 1074 CONNECTED! 1074 Send RMR to remote: snd_msg: r_key_ctx=1320439,pad=0,va=50aa40,len=0x40 1074 Waiting for remote to send RMR data 1074 remote RMR data arrived! 1074 Received RMR from remote: r_iov: r_key_ctx=e00436,pad=0,va=50aa40,len=0x40 1074 RDMA WRITE DATA with SEND MSG 1074 Sending completion message 1074 inbound rdma_write; send message arrived! 1074 Received RMR from remote: r_iov: ctx=e00436,pad=0,va=0x50aa40,len=0x40 1074 SERVER: RDMA write buffer contains: client written data... 1074 RDMA READ DATA with SEND MSG 1074 Sending completion message 1074 Waiting for inbound message.... 1074 inbound rdma_read; send message arrived! 1074 Received RMR from remote: r_iov: ctx=e00436,pad=0,va=0x50aa40,len=0x40 1074 SERVER: RCV RDMA read buffer contains: client read data... 1074 PING DATA with SEND MSG client output: ------------------ 13210 CONNECTED! 13210 Send RMR to remote: snd_msg: r_key_ctx=e00436,pad=0,va=50aa40,len=0x40 13210 Waiting for remote to send RMR data 13210 remote RMR data arrived! 13210 Received RMR from remote: r_iov: r_key_ctx=1320439,pad=0,va=50aa40,len=0x40 13210 RDMA WRITE DATA with SEND MSG 13210 Sending completion message 13210 inbound rdma_write; send message arrived! 13210 Received RMR from remote: r_iov: ctx=1320439,pad=0,va=0x50aa40,len=0x40 13210 CLIENT: RDMA write buffer contains: server written data... 13210 RDMA READ DATA with SEND MSG 13210 Sending completion message 13210 Waiting for inbound message.... 13210 inbound rdma_read; send message arrived! 13210 Received RMR from remote: r_iov: ctx=1320439,pad=0,va=0x50aa40,len=0x40 13210 CLIENT: RCV RDMA read buffer contains: server read data... 13210 PING DATA with SEND MSG DAT Registry: dat_ia_close () called dat_get_ia_handle from 1 to 0x5124e0 dat_get_ia_handle from 1 to 0x5124e0 dats_free_ia_handle 1 DAT Registry: IA OpenIB-scm1, unloading library /usr/local//lib64/libdaplscm.so DAT Registry: dat_registry_remove_provider () called 13210: DAPL Test Complete. 13210: Message RTT: Total= 523.81 usec, 10 bursts, itime= 52.38 usec, pc=0 13210: RDMA write: Total= 24.08 usec, 10 bursts, itime= 2.41 usec, pc=0 13210: RDMA read: Total= 138.52 usec, 4 bursts, itime= 48.88 usec, pc=0 13210: RDMA read: Total= 138.52 usec, 4 bursts, itime= 28.85 usec, pc=0 13210: RDMA read: Total= 138.52 usec, 4 bursts, itime= 32.90 usec, pc=0 13210: RDMA read: Total= 138.52 usec, 4 bursts, itime= 27.89 usec, pc=0 13210: open: 54162.98 usec 13210: close: 277697.09 usec 13210: PZ create: 35.05 usec 13210: PZ free: 9.06 usec 13210: LMR create: 215.05 usec 13210: LMR free: 95.13 usec 13210: EVD create: 36.95 usec 13210: EVD free: 152.11 usec 13210: EP create: 494.96 usec 13210: EP free: 298.02 usec 13210: TOTAL: 1292.23 usec DAT Registry: Stopped (dat_fini) when i'm using the dapl provider: OpenIB-cma-ip, everything is working and both sides of the dtest ends (without any error). here is my dat.conf: OpenIB-cma u1.2 nonthreadsafe default /usr/local//lib64/libdaplcma.so mv_dapl.1.2 "ib0 0" "" OpenIB-cma-ip u1.2 nonthreadsafe default /usr/local//lib64/libdaplcma.so mv_dapl.1.2 "11.4.3.86 0" "" <<<-- the IP of the local IPoIB I/F in each host OpenIB-cma-name u1.2 nonthreadsafe default /usr/local//lib64/libdaplcma.so mv_dapl.1.2 "svr1-ib0 0" "" OpenIB-cma-netdev u1.2 nonthreadsafe default /usr/local//lib64/libdaplcma.so mv_dapl.1.2 "ib0 0" "" OpenIB-scm1 u1.2 nonthreadsafe default /usr/local//lib64/libdaplscm.so mv_dapl.1.2 "mthca0 1" "" OpenIB-scm2 u1.2 nonthreadsafe default /usr/local//lib64/libdaplscm.so mv_dapl.1.2 "mthca0 2" "" Here is some info of the host i'm using (both of the hosts are identical): Host Name : sw086 Host Architecture : x86_64 Linux Distribution: Fedora Core release 4 (Stentz) Kernel Version : 2.6.14.3 Memory size : 4039344 kB Driver Version : IBED-1.0-rc3: HCA ID(s) : mthca0 HCA model(s) : 25208 FW version(s) : 4.7.600 Board(s) : MT_00A0010001 can anyone help me with this issue? thanks Dotan From bugzilla-daemon at openib.org Tue Apr 11 05:33:35 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Tue, 11 Apr 2006 05:33:35 -0700 (PDT) Subject: [openib-general] [Bug 33] Ping fails on ib1 interface - IBED - RC3 Message-ID: <20060411123335.F07BB2283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=33 ------- Additional Comments From eli at mellanox.co.il 2006-04-11 05:33 ------- One more thing. You can configure all the ipoib interfaces to reside on the same IP subnet and this should solve the problem. It worked for me. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From moshek at voltaire.com Tue Apr 11 06:39:44 2006 From: moshek at voltaire.com (Moshe Kazir) Date: Tue, 11 Apr 2006 16:39:44 +0300 Subject: [openib-general] RE: [openfabrics-ewg] FW: IBED-1.0-rc3 is available Message-ID: When tring to build IBED-1.0-rc3 on 2.6.9-34.EL-smp-x86_64 I got the following error : In file included from /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipa th_cq.c:36: /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipa th_verbs.h:395: error: `BITS_PER_BYTE' undeclared here (not in a function) make[3]: *** [/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ip ath_cq.o] Error 1 make[2]: *** [/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath] Error 2 make[1]: *** [_module_/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband] Error 2 make[1]: Leaving directory `/usr/src/kernels/2.6.9-34.EL-smp-x86_64' make: *** [kernel] Error 2 ERROR: Failed to execute: make kernel It's look as if in the file ipath_cq.c the line #defined BITS_PER_BYTE is located two lines after the line #include that uses the definition. Moshe ____________________________________________________________ Moshe Katzir | +972-9971-8639 (o) | +972-52-860-6042 (m) Voltaire - The Grid Backbone www.voltaire.com -----Original Message----- From: openfabrics-ewg-bounces at openib.org [mailto:openfabrics-ewg-bounces at openib.org] On Behalf Of Tziporet Koren Sent: Monday, April 10, 2006 10:17 PM To: Tziporet Koren; Matters, Todd; Kamen Bodourov; Moni Levy; Vladimir Sokolovsky; Amit Krig; Bryan O'Sullivan; Jeff Squyres (jsquyres); Matt Leininger Cc: svbu-ibcg at external.cisco.com; ibcg at silverstorm.com; Openfabrics-ewg at openib.org; IBCG Subject: [openfabrics-ewg] FW: IBED-1.0-rc3 is available It seems that this mail was not delivered by the openfabrics-ewg list thus I forward it to direct emails. Matt - can you add a link in the news page of OpenFabrics for this release. Thanks, Tziporet -----Original Message----- From: Vladimir Sokolovsky Sent: Monday, April 10, 2006 7:55 PM To: openfabrics-ewg at openib.org Cc: openib-general; Tziporet Koren Subject: IBED-1.0-rc3 is available Hi All, We have prepared IBED 1.0 RC3. Release location: https://openib.org/svn/gen2/branches/1.0/ibed/releases File: IBED-1.0-rc3.tgz md5sum: 8e143fd4b63646ebc9f5c9f73d18394b BUILD_ID: IBED-1.0-rc3: OpenIB: openib_branch1.0-20060410-1551 (REV=6367) Userspace SVN path: https://openib.org/svn/gen2/branches/1.0/src/userspace IB Kernel modules SVN path: https://openib.org/svn/gen2/branches/1.0/ibed/tags/rc3/linux-kernel MPI: openmpi-1.0.2a12-1 mpi_osu-0.9.7-mlx2.1.0 mpitests-1.0-0 OSes: * RH EL4 up2: 2.6.9-22.ELsmp * RH EL4 up3: 2.6.9-34.ELsmp * Fedora C4: 2.6.11-1.1369_FC4 * SLES10 beta 7: 2.6.16-rc5-git9-2-smp * SUSE 10 Pro: 2.6.13-15-smp * kernel.org: 2.6.16 Systems: * x86_64 * x86 * ia64 * ppc64 Main changes from RC2: 1. Added support in Rh EL4 up3 2. Added Open MPI package 3. OSU MPI is now based on 0.97 release (was 0.95 in RC2) 4. Added Pathscale (ipath) driver 5. Added uDapl 6. build based on the new method: Userlevel from openib branch 1.0 and kernel from openib trunk. (will be from the git in RC4) 7. Added ibutils package 8. Bug fixes Package limitations: 1. iSER is working on SuSE SLES 10 Beta8 only 2. MPI OSU and Open MPI compilation fails on PPC64 3. uDAPL does not supported on RH EL4 (up2 and up3) since rdma_ucm module does not work on 2.6.9* kernels. If someone has a patch we will use it. 4. ipath driver compilation fails on RH EL4 and FedoraC4. Please send me and Vlad any issue you encounter and testing results. Thanks Tziporet & Vlad -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlentini at netapp.com Tue Apr 11 06:51:25 2006 From: jlentini at netapp.com (James Lentini) Date: Tue, 11 Apr 2006 09:51:25 -0400 (EDT) Subject: [openib-general] Would like to change Infiniband to InfiniBand In-Reply-To: References: Message-ID: On Mon, 10 Apr 2006, Scott Weitzenkamp (sweitzen) wrote: > Any objections to me fixing all the occurences of Infiniband in > gen2/trunk? The only "Infiniband" spellings I see are in the ehca driver (32 in comments, 2 in print statements). If you want to change those, you should send a patch to the ehca maintainers and cc the list. There are also 5 comments in ipath driver that use "infiniband". The more pressing need is to change the core ib_* symbols to rdma_* to reflect the stacks generic RDMA capabilities. From jlentini at netapp.com Tue Apr 11 06:51:52 2006 From: jlentini at netapp.com (James Lentini) Date: Tue, 11 Apr 2006 09:51:52 -0400 (EDT) Subject: [openib-general] Re: [uDAPL] who should update the file /etc/dat.conf? In-Reply-To: <200604111144.44167.dotanb@mellanox.co.il> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301DD31C1@mtlexch01.mtl.com> <200604111144.44167.dotanb@mellanox.co.il> Message-ID: On Tue, 11 Apr 2006, Dotan Barak wrote: > Does anyone have an automatic script / tool that searches for all of > the IB devices (and ports for each device) and create a valid > configuration file? I don't know of one. It would be useful to have. From tziporet at mellanox.co.il Tue Apr 11 07:00:00 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 11 Apr 2006 17:00:00 +0300 Subject: [openib-general] Re: [openfabrics-ewg] FW: IBED-1.0-rc3 is available In-Reply-To: References: Message-ID: <443BB660.2010404@mellanox.co.il> Moshe Kazir wrote: > > When tring to build IBED-1.0-rc3 on 2.6.9-34.EL-smp-x86_64 > > > > I got the following error : > > > > In file included from > /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipath_cq.c:36: > /var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipath_verbs.h:395: > error: `BITS_PER_BYTE' undeclared here (not in a function) > make[3]: *** > [/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath/ipath_cq.o] > Error 1 > make[2]: *** > [/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband/hw/ipath] > Error 2 > make[1]: *** > [_module_/var/tmp/IBED/tmp/openib/openib/src/linux-kernel/infiniband] > Error 2 > make[1]: Leaving directory `/usr/src/kernels/2.6.9-34.EL-smp-x86_64' > make: *** [kernel] Error 2 > ERROR: Failed to execute: make kernel > > > > It's look as if in the file ipath_cq.c the line > > #defined BITS_PER_BYTE > > is located two lines after the line > > #include > > that uses the definition. > > > > > > Moshe > Read this limitation: > > 4. ipath driver compilation fails on RH EL4 and FedoraC4. > Tziporet > > Thanks > > Tziporet & Vlad > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlentini at netapp.com Tue Apr 11 07:10:43 2006 From: jlentini at netapp.com (James Lentini) Date: Tue, 11 Apr 2006 10:10:43 -0400 (EDT) Subject: [openib-general] Re: [uDAPL] dtest server never ends when using the dapl provider "OpenIB-scm1" In-Reply-To: <200604111506.18067.dotanb@mellanox.co.il> References: <200604111506.18067.dotanb@mellanox.co.il> Message-ID: On Tue, 11 Apr 2006, Dotan Barak wrote: > > can anyone help me with this issue? Can you ^C and kill the server? From dotanb at mellanox.co.il Tue Apr 11 07:15:55 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 11 Apr 2006 17:15:55 +0300 Subject: [openib-general] Re: [uDAPL] dtest server never ends when using the dapl provider "OpenIB-scm1" In-Reply-To: References: <200604111506.18067.dotanb@mellanox.co.il> Message-ID: <200604111715.55534.dotanb@mellanox.co.il> On Tuesday 11 April 2006 17:10, James Lentini wrote: > > On Tue, 11 Apr 2006, Dotan Barak wrote: > > > > > can anyone help me with this issue? > > Can you ^C and kill the server? > yes, there isn't any problem in the host. i can ^C and kill the server and execute it one more time without any problem ... Dotan From jackm at mellanox.co.il Tue Apr 11 08:16:27 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Tue, 11 Apr 2006 18:16:27 +0300 Subject: [openib-general] [PATCH] mthca: fix max_srq_sge returned by ib_query_device for Tavor devices Message-ID: <200604111816.27204.jackm@mellanox.co.il> The driver allocates SRQ WQEs size with a power-of-2 size both for Tavor and for memfree. For Tavor, however, the WQE size is required to be only a multiple of 16, not a power of 2, and the max number of scatter-gather allowed is reported accordingly by the firmware (and this is the value currently returned by ib_query_device() and ibv_query_device()). If the max number of scatter/gather entries reported by the f/w is used when creating an SRQ, the creation will fail for ArTavor, since the required WQE size will be increased to the next power-of-2, which turns out to be larger than the device permitted max WQE size (which is not a power-of-2). This patch reduces the reported SRQ max wqe size so that it can be used successfully in creating an SRQ on Tavor HCAs. Signed-off-by: Jack Morgenstein Index: src/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2006-04-11 +++ src/drivers/infiniband/hw/mthca/mthca_dev.h 2006-04-11 17:45:52 @@ -151,6 +151,7 @@ struct mthca_limits { int reserved_qps; int num_srqs; int max_srq_wqes; + int max_srq_sge; int reserved_srqs; int num_eecs; int reserved_eecs; @@ -507,6 +508,7 @@ void mthca_free_srq(struct mthca_dev *de int mthca_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, enum ib_srq_attr_mask attr_mask); int mthca_query_srq(struct ib_srq *srq, struct ib_srq_attr *srq_attr); +int mthca_max_srq_sge(struct mthca_dev *dev); void mthca_srq_event(struct mthca_dev *dev, u32 srqn, enum ib_event_type event_type); void mthca_free_srq_wqe(struct mthca_srq *srq, u32 wqe_addr); Index: src/drivers/infiniband/hw/mthca/mthca_main.c =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_main.c 2006-04-11 +++ src/drivers/infiniband/hw/mthca/mthca_main.c 2006-04-11 17:45:52 @@ -191,6 +191,7 @@ static int __devinit mthca_dev_lim(struc mdev->limits.reserved_srqs = dev_lim->reserved_srqs; mdev->limits.reserved_eecs = dev_lim->reserved_eecs; mdev->limits.max_desc_sz = dev_lim->max_desc_sz; + mdev->limits.max_srq_sge = mthca_max_srq_sge(mdev); /* * Subtract 1 from the limit because we need to allocate a * spare CQE so the HCA HW can tell the difference between an Index: src/drivers/infiniband/hw/mthca/mthca_srq.c =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_srq.c 2006-03-29 +++ src/drivers/infiniband/hw/mthca/mthca_srq.c 2006-04-11 17:45:52 @@ -192,7 +192,7 @@ int mthca_alloc_srq(struct mthca_dev *de /* Sanity check SRQ size before proceeding */ if (attr->max_wr > dev->limits.max_srq_wqes || - attr->max_sge > dev->limits.max_sg) + attr->max_sge > dev->limits.max_srq_sge) return -EINVAL; srq->max = attr->max_wr; @@ -660,6 +660,30 @@ int mthca_arbel_post_srq_recv(struct ib_ return err; } +int mthca_max_srq_sge(struct mthca_dev *dev) +{ + if (mthca_is_memfree(dev)) + return dev->limits.max_sg; + /* + * SRQ allocations are based on powers-of-2 for Tavor, + * (although they only need to be multiples of 16 bytes). + * + * Therefore, we need to base the max number of sg entries + * on that power-of-2 descriptor size which closest to the actual + * WQE descriptor size, and is <= to the actual WQE descriptor size, + * rather than return the max_sg value given by the firmware + * (which is based on WQE sizes as multiples of 16, not powers-of-2). + * + * If SRQ implementation is changed for Tavor to be based on multiples + * of 16, the calculation below can be deleted and the f/w max_sg + * value returned. + */ + return min(dev->limits.max_sg, + ((1 << (fls(dev->limits.max_desc_sz) - 1)) - + sizeof(struct mthca_next_seg)) / + sizeof(struct mthca_data_seg)); +} + int __devinit mthca_init_srq_table(struct mthca_dev *dev) { int err; Index: src/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2006-03-07 +++ src/drivers/infiniband/hw/mthca/mthca_provider.c 2006-04-11 @@ -106,7 +106,7 @@ static int mthca_query_device(struct ib_ props->max_res_rd_atom = props->max_qp_rd_atom * props->max_qp; props->max_srq = mdev->limits.num_srqs - mdev->limits.reserved_srqs; props->max_srq_wr = mdev->limits.max_srq_wqes; - props->max_srq_sge = mdev->limits.max_sg; + props->max_srq_sge = mdev->limits.max_srq_sge; props->local_ca_ack_delay = mdev->limits.local_ca_ack_delay; props->atomic_cap = mdev->limits.flags & DEV_LIM_FLAG_ATOMIC ? IB_ATOMIC_HCA : IB_ATOMIC_NONE; From halr at voltaire.com Tue Apr 11 08:14:23 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Apr 2006 11:14:23 -0400 Subject: [openib-general] Re: Dual Sided RMPP Support as well as OpenSM Implications In-Reply-To: <443AD545.3010307@ichips.intel.com> References: <1144410107.19061.824.camel@hal.voltaire.com> <4436E507.7000309@ichips.intel.com> <1144667911.19061.52371.camel@hal.voltaire.com> <443A96F3.9000706@ichips.intel.com> <1144700991.19061.58524.camel@hal.voltaire.com> <443AD545.3010307@ichips.intel.com> Message-ID: <1144768449.19061.70653.camel@hal.voltaire.com> On Mon, 2006-04-10 at 17:59, Sean Hefty wrote: > Hal Rosenstock wrote: > >>>>Node A sends an RMPP message. This requires normal RMPP processing. > >>>>Node A sends an ACK of the final ACK (I'll call ACK2), giving a new window. > >>>>Node B receives ACKs. > >>>>Node B sends the response. This requires normal RMPP processing. > >>>> > >>>>From the perspective of node A, the RMPP code only needs to know to send ACK2. > >>> > >>>There's more to the state machine in turning the direction around in > >>>terms of the sender becoming the receiver. I thought that this is the > >>>"harder" direction change. > >> > >>Can you describe what more is needed that what's listed above? > > > > I was referring to comparing the direction switch flows (Figure 181 p. > > 791) requires more than switch to DS in Figure 179 p. 787). > > It's still not clear to me what's missing from the sequence. Where did the "missing" come from ? > Once node A sends ACK2, it should wait until either ACK1 is received again, > or the response is received. There's also ABORT, STOP, and other handling too in that loop (figure 181). > Upon receiving ACK1, it can resend ACK2. Yes. > I might not have mentioned > this before, but I would have ACK2 carry a window size of 1, > which lets us treat all received RMPP MADs the same. I don't recall that being mentioned and yes, this sounds like a good idea. > >>In terms of compliance, if node A is not, but node B is; > > > > Is = DS and not is not DS, right ? > > > > Just out of curiosity, where's the compliance for this ? What are you > > referring to here ? Can you clarify your use of "compliance" here ? Compliance has a certain spec connotation and I would have used a slightly different term than this to indicate following the (RMPP) protocol if that is what you mean. > I mean that node A does not implement DS RMPP, but node B does. > > >>Node B cannot send back the response until ACK2 is received. Since node A does > >>not understand dual-sided RMPP, it will not send ACK2. Node B will never send > >>the response. > > > > Correct. It would time out. But wouldn't it be better if the transaction > > were aborted with some explicit status for this ? > > Are you asking for an explicit status indicating that ACK2 was not received? I was referring to an explicit status if ACK2 is received and the node does not support dual sided RMPP so the other end does not need to wait the timeout (and retry, etc.) and can potentially indicate a different error back to the user. > I guess this could be added, but node B should not make any assumptions about the > reason for the timeout, such as node A doesn't support DS RMPP. If node A > doesn't support DS RMPP, I don't know that it should expect a MultiPathRecord > query to work. I was referring to node B not supporting DS RMPP. > >>If node A is, but node B is not: > >>Node A will send ACK2, which node B should drop. > > > > Yes, figure 179 for receiver termination flow (IsDS false direction) > > shows that packet as discarded with an Abort (BadT) sent. > > If ACK2 matches with the received request, then wouldn't that transaction be > aborted? Where do you see that ? > Does this mean that both nodes must either be DS RMPP compliant, or > non-compliant for communication to work? You lost me on this. > >>Node B will send an RMPP message assuming an initial window size of 1. > >>If node A had set the window > >>larger, it may delay the ACK of segment 1. Node B will eventually timeout and > >>resend segment 1. Most likely, this will cause node A to ACK segment 1, which > >>will update the window size at node B. > > > > I'm not following you on this part. > > I'm just trying to determine what could happen if a non-compliant implementation > tried talking to a compliant implementation. And now I'm leaning towards them > being unable to communication. I think the non DS RMPP implementation is supposed to send an ABORT with BadT. > >>>> It can do this based on the method, or per transaction if directed by the client. > >>> > >>>Yes; I was thinking of class/method based approach for this. > >> > >>Currently, only a MultiPathRecord query requires this. Why not limit dual-sided > >>RMPP to _only_ this request? > > > > > > That would work for now. One future issue would be vendor class 2 needs > > here. > > What I'm suggesting is that we limit "sender-initiated double-sided" RMPP > transfers to only MultiPathRecord. Vendor class 2 would simply use two > "sender-initiated transfers". I don't think that can work. If the request and response are RMPP'd, I think a direction switch is needed so this can't be done. > >>All other queries can just use an RMPP message one > >>direction, followed by an RMPP message in the other direction. > > > > > > I don't understand what you mean here. That's not the way it works from > > my understanding. If both the request and response are RMPP messages, > > isn't this dual sided ? > > If it's a vendor defined MAD, can't we control the behavior and treat this as > two Sender-Initiated Transfers? In 13.6.6.3, we have: > > It is also possible for a single transaction to involve an RMPP transfer sent > in one direction followed by another RMPP transfer in the other direction... > This *may* be accomplished as follows: > > My interpretation is that we're not restricted to using this. That section is entitled dual sided transfer and mentions GetMulti which clearly uses DS RMPP so I don't think that is the case. > > I think the issue is turning things around but I'm not positive. I was > > wondering about this in a slightly different way: as to why the > > direction switch ? My initial foray into this area was to do just what > > you said: two single sided RMPP transfers in opposite direction. In my > > simple test case, the request was short (1 MAD) but that could be > > changed. I haven't figured out the reason for the turnaround ACK but I > > know the people who architected this although most have left the group > > and am quite confident that this wouldn't just be there if it weren't > > needed. (I'll eat my words later if necessary :-) > > (rant) IMO, the entire RMPP architecture is ridiculous. Segmentation and > reassembly information is embedded in the middle of user data, with timeout > constraints that would take a half dozen queries to calculate. So I'm not > confident that this is needed at all. > > The only benefit that I see is that the initial window size could be larger than > 1, which has a potential to provide for better latency. DS RMPP requires the > same number of MADs as two single sided RMPP transfers, so even the potential > gain seems fairly small. I don't think the issue is gain but how to reverse the RMPP roles. When you say 2 individual sender initiated transfers, would they have the same transaction ID ? > >>>>Node B is more complex. It must now wait for ACK2, using timeout and retries of > >>>>ACK1 until ACK2 is received. And the response that will be generated by the > >>>>client must be delayed until that ACK2 is received. > >>> > >>> > >>>Yes but isn't much of this already needed for the normal termination > >>>case or is that part not implemented yet ? > >> > >>No - ACK2 is a new message unique to dual-sided RMPP transfers (an ACK of an ACK). > > > > > > We're talking about Figure 179, right ? If so, most of that needs to be > > there already down to the Type decision (without the ACK direction > > implemented). > > > > Yes, ACK2 is new but this doesn't seem like much to add. The delay of > > the client response would also be "new" and that seems harder. > > I agree. Adding in ACK2 shouldn't be that difficult, but does require knowing > if a given transaction (class/method) uses DS RMPP. The delay on the send-side > is already there, in waiting for the response. The timeout of the RMPP context > on the receive side is where the difficulty lies, but I think we can avoid this > difficulty simply by passing NWL up to the client, and having them return it on > the response. > > If we want to support DS RMPP for more than just MultiPathRecord, it seems that > we need some sort of class/method mapping, maybe attribute ID as well. > which would require changing the kernel MAD API. Yes unless this were somehow made self-identifying (part of the RMPP protocol rather than an internal state variable). -- Hal > > - Sean > From jlentini at netapp.com Tue Apr 11 08:32:21 2006 From: jlentini at netapp.com (James Lentini) Date: Tue, 11 Apr 2006 11:32:21 -0400 (EDT) Subject: [openib-general] Re: [uDAPL] dtest server never ends when using the dapl provider "OpenIB-scm1" In-Reply-To: <200604111715.55534.dotanb@mellanox.co.il> References: <200604111506.18067.dotanb@mellanox.co.il> <200604111715.55534.dotanb@mellanox.co.il> Message-ID: On Tue, 11 Apr 2006, Dotan Barak wrote: > On Tuesday 11 April 2006 17:10, James Lentini wrote: > > > > On Tue, 11 Apr 2006, Dotan Barak wrote: > > > > > > > > can anyone help me with this issue? > > > > Can you ^C and kill the server? > > > > yes, there isn't any problem in the host. > i can ^C and kill the server and execute it one more time without > any problem ... It sounds like the disconnect is being lost. Let me see if I can reproduce this. Arlin, have you ever seen this? From bugzilla-daemon at openib.org Tue Apr 11 09:19:16 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Tue, 11 Apr 2006 09:19:16 -0700 (PDT) Subject: [openib-general] [Bug 33] Ping fails on ib1 interface - IBED - RC3 Message-ID: <20060411161916.0B90D2283D6@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=33 ------- Additional Comments From xma at us.ibm.com 2006-04-11 09:19 ------- I have onfigured all the ipoib interfaces to reside on the same IP subnet. I hit the problem after running netperf for a while. Then failed. tcpdump showed that one side ARP request sent out but the other side didn't receive the ARP request at all. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From pradeep at us.ibm.com Tue Apr 11 09:11:54 2006 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Tue, 11 Apr 2006 10:11:54 -0600 Subject: [openib-general] Re: [openfabrics-ewg] IBED-1.0-rc3 is available In-Reply-To: <443B89CB.8020005@mellanox.co.il> Message-ID: Thank you for the explanations. This was just what I was looking for. Pradeep pradeep at us.ibm.com Tziporet Koren wrote on 04/11/2006 03:49:47 AM: > Pradeep Satyanarayana wrote: > > > > I had a question related to this. IBED-1.0-rc3- has it gone through > > some minimal touch testing on the various platforms listed below, or > > has it been simply compiled on the indicated platforms (with the > > exceptions noted below). I did not see any such references to it in > > the FAQ. > > > > We run thorough testing of the following systems: X86, X86_64 (both > Intel and AMD). > We run partial testing on ia64, and compilation & modules load only on > PPC (due to lack of machines). > > Note that we focus our testing in the following: > Core verbs (on mthca), IPoIB, OSU MPI, openSM, SRP and SDP > Other companies are expected to verify the rest of the modules: > ipath driver, Open MPI, uDAPL, iSER, RDS > > > If it was tested, was it tested on any specific set of HCAs? > > > In Mellanox we run our regression test suite on all our HCA boards with > the latest FW releases (see > http://www.mellanox.com/support/firmware_table.php for Mellanox boards > and FW releases). > > > > How does one find out the bug fixes picked up in this release? > > > Since it is only RC I have not prepared a release notes with all bugs > fixed. I suggest that from this RC3 everyone will file any issue found > in the bugzilla and then we will be able to track the fixed bugs for RC4. > > > > > > Pradeep > > pradeep at us.ibm.com > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue Apr 11 09:17:19 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Apr 2006 09:17:19 -0700 Subject: [openib-general] tcpdump command issue on IB network In-Reply-To: <200604111137.08728.dotanb@mellanox.co.il> (Dotan Barak's message of "Tue, 11 Apr 2006 11:37:08 +0300") References: <1AC79F16F5C5284499BB9591B33D6F000763BEA8@orsmsx408> <200604111137.08728.dotanb@mellanox.co.il> Message-ID: Dotan> I think that maybe the problem is not in the user level Dotan> stack or in the utilities , but in the linux kernel (in the Dotan> module that handles this socket ioctl) Yes, the problem is in a patch that Red Hat applies to the 2.6.9 kernel in RHEL4. Full details are in the bugzilla entry that Scott posted. From troy at scl.ameslab.gov Tue Apr 11 09:36:53 2006 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Tue, 11 Apr 2006 11:36:53 -0500 Subject: [openib-general] EHCA crash on module unload? Message-ID: <20060411163653.GA18625@scl.ameslab.gov> p5l2:/usr/src/linux-2.6.16/drivers/infiniband# svnversion . 5988 p5l2:~# [86044.767087] Unable to handle kernel paging request for data at address 0x00000068 [86044.767115] Faulting instruction address: 0xd000000018fd4b38 [86044.767132] Oops: Kernel access of bad area, sig: 11 [#1] [86044.767149] SMP NR_CPUS=8 NUMA PSERIES LPAR [86044.767169] Modules linked in: ib_uverbs ib_umad ib_mad hcad_mod ib_core libafs ipr sd_mod sg [86044.767197] NIP: D000000018FD4B38 LR: D000000018FD4AEC CTR: C000000000160FF4 [86044.767212] REGS: c0000001df42ba20 TRAP: 0300 Tainted: P (2.6.16-power5) [86044.767225] MSR: 8000000000009032 CR: 22000024 XER: 20000020 [86044.767270] DAR: 0000000000000068, DSISR: 0000000040000000 [86044.767287] TASK = c0000003ad9c0dd0[2232] 'ehca/0' THREAD: c0000001df428000 CPU: 0 [86044.767303] GPR00: 0000000000000000 C0000001DF42BCA0 D000000018FFC350 0000000000000000 [86044.767334] GPR04: 0000000000000001 0000000000060000 0000000022000042 C000000002319460 [86044.767366] GPR08: D000000000035208 D000000018FEF138 D000000000035000 0000000000000000 [86044.767400] GPR12: D000000018FDBEA8 C000000000433C00 0000000000000000 0000000000000000 [86044.767439] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [86044.767472] GPR20: 0000000000C00000 4000000001C10000 C0000000003E6AF0 0000000001FF6D50 [86044.767493] GPR24: C000000002319490 0000000000000001 C000000002319458 C000000002319000 [86044.767514] GPR28: C000000002319490 D000000018FEF3B8 D000000018FFA578 0000000000000000 [86044.767537] NIP [D000000018FD4B38] .ehca_interrupt_eq+0x1c4/0x550 [hcad_mod] [86044.767573] LR [D000000018FD4AEC] .ehca_interrupt_eq+0x178/0x550 [hcad_mod] [86044.767601] Call Trace: [86044.767609] [C0000001DF42BCA0] [D000000018FD4A04] .ehca_interrupt_eq+0x90/0x550 [hcad_mod] (unreliable) [86044.767643] --- Exception: 2 at .__start+0x4000000000000000/0x8 [86044.767662] LR = .kernel_thread+0x4c/0x68 [86044.767673] [C0000001DF42BD50] [C0000000000561EC] .run_workqueue+0xdc/0x168 (unreliable) [86044.767695] [C0000001DF42BDF0] [C0000000000564B0] .worker_thread+0x128/0x198 [86044.767714] [C0000001DF42BEE0] [C00000000005B450] .kthread+0x120/0x170 [86044.767731] [C0000001DF42BF90] [C000000000021CF8] .kernel_thread+0x4c/0x68 [86044.767746] Instruction dump: [86044.767754] 4c00012c 7c0007b4 2f80ffff 409c001c 5400043e 2f800000 409e0010 7fa3eb78 [86044.767805] 48007201 e8410028 e93e8000 3ca00006 60a50139 88090006 2b800007 [86044.767863] From mshefty at ichips.intel.com Tue Apr 11 09:38:35 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 11 Apr 2006 09:38:35 -0700 Subject: [openib-general] Re: Dual Sided RMPP Support as well as OpenSM Implications In-Reply-To: <1144768449.19061.70653.camel@hal.voltaire.com> References: <1144410107.19061.824.camel@hal.voltaire.com> <4436E507.7000309@ichips.intel.com> <1144667911.19061.52371.camel@hal.voltaire.com> <443A96F3.9000706@ichips.intel.com> <1144700991.19061.58524.camel@hal.voltaire.com> <443AD545.3010307@ichips.intel.com> <1144768449.19061.70653.camel@hal.voltaire.com> Message-ID: <443BDB8B.1010308@ichips.intel.com> Hal Rosenstock wrote: > I don't think that can work. If the request and response are RMPP'd, I > think a direction switch is needed so this can't be done. A direction switch is only needed if we want to follow the DS RMPP protocol. Why can't both sides just follow the sender-initiated protocol instead? I don't see where this is prohibited, and we know that it works. > I don't think the issue is gain but how to reverse the RMPP roles. When > you say 2 individual sender initiated transfers, would they have the > same transaction ID ? Yes. All we're trying to accomplish is reliable segmentation and reassembly. IMO, the RMPP start-up scenarios are utter nonsense. (Wow, I'm almost beginning to sound like a Linux programmer now.) But given that its defined in the spec, and SA GetMulti is defined to use it, let's limit its use to that method. >>If we want to support DS RMPP for more than just MultiPathRecord, it seems that >>we need some sort of class/method mapping, > > > maybe attribute ID as well. > > >> which would require changing the kernel MAD API. > > > Yes unless this were somehow made self-identifying (part of the RMPP > protocol rather than an internal state variable). If we're going to change the RMPP protocol, my vote is to remove DS RMPP entirely, unless someone can show why the additional complexity is needed. - Sean From robert.j.woodruff at intel.com Tue Apr 11 09:40:58 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Tue, 11 Apr 2006 09:40:58 -0700 Subject: [openib-general] tcpdump command issue on IB network In-Reply-To: Message-ID: <000001c65d86$b629a060$010fa8c0@amr.corp.intel.com> Roland wrote, > Dotan> I think that maybe the problem is not in the user level > Dotan> stack or in the utilities , but in the linux kernel (in the > Dotan> module that handles this socket ioctl) >Yes, the problem is in a patch that Red Hat applies to the 2.6.9 >kernel in RHEL4. Full details are in the bugzilla entry that Scott posted. Yes, I recall it was a problem caused when they changed the the HW address for InfiniBand, but did not know the specifics. Removing the overflow check will fix the kernel part, but I think the userspace utilites might also need to be changed or recompiled to make sure the structure sizes match, Right ? woody From mshefty at ichips.intel.com Tue Apr 11 09:49:16 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 11 Apr 2006 09:49:16 -0700 Subject: [openib-general] [RFC] [PATCH 2/2 v2] ipoib: convert to use new multicast interface In-Reply-To: References: Message-ID: <443BDE0C.7020700@ichips.intel.com> Sean Hefty wrote: > void ipoib_mcast_join_task(void *dev_ptr) > @@ -553,7 +539,8 @@ void ipoib_mcast_join_task(void *dev_ptr > spin_unlock_irq(&priv->lock); > } > > - if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { > + if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags) && > + !test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags)) { > ipoib_mcast_join(dev, priv->broadcast, 0); > return; > } The change above needs to be: if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { if (!test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags)) ipoib_mcast_join(dev, priv->broadcast, 0); return; } Or additional join requests will start before we've finished joining the broadcast group. - Sean From rdreier at cisco.com Tue Apr 11 09:54:58 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Apr 2006 09:54:58 -0700 Subject: [openib-general] tcpdump command issue on IB network In-Reply-To: <000001c65d86$b629a060$010fa8c0@amr.corp.intel.com> (Bob Woodruff's message of "Tue, 11 Apr 2006 09:40:58 -0700") References: <000001c65d86$b629a060$010fa8c0@amr.corp.intel.com> Message-ID: Bob> Yes, I recall it was a problem caused when they changed the Bob> the HW address for InfiniBand, but did not know the Bob> specifics. Removing the overflow check will fix the kernel Bob> part, but I think the userspace utilites might also need to Bob> be changed or recompiled to make sure the structure sizes Bob> match, Right ? tcpdump will work fine. It doesn't look at the hardware address in the structure at all. - R. From halr at voltaire.com Tue Apr 11 09:48:07 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Apr 2006 12:48:07 -0400 Subject: [openib-general] Re: Dual Sided RMPP Support as well as OpenSM Implications In-Reply-To: <443BDB8B.1010308@ichips.intel.com> References: <1144410107.19061.824.camel@hal.voltaire.com> <4436E507.7000309@ichips.intel.com> <1144667911.19061.52371.camel@hal.voltaire.com> <443A96F3.9000706@ichips.intel.com> <1144700991.19061.58524.camel@hal.voltaire.com> <443AD545.3010307@ichips.intel.com> <1144768449.19061.70653.camel@hal.voltaire.com> <443BDB8B.1010308@ichips.intel.com> Message-ID: <1144774081.19061.71872.camel@hal.voltaire.com> On Tue, 2006-04-11 at 12:38, Sean Hefty wrote: > Hal Rosenstock wrote: > > I don't think that can work. If the request and response are RMPP'd, I > > think a direction switch is needed so this can't be done. > > A direction switch is only needed if we want to follow the DS RMPP protocol. > Why can't both sides just follow the sender-initiated protocol instead? In thinking about this a little, I see no reason this couldn't work. In fact, that's one mode I have toyed with (a non conformant SA GetMulti* mode) in developing SA MultiPathRecord. > I don't see where this is prohibited, and we know that it works. It's not prohibited. I'm not sure there are many explicit compliances related to RMPP. > > I don't think the issue is gain but how to reverse the RMPP roles. When > > you say 2 individual sender initiated transfers, would they have the > > same transaction ID ? > > Yes. > > All we're trying to accomplish is reliable segmentation and reassembly. IMO, > the RMPP start-up scenarios are utter nonsense. The two one sided transfers still need to do those things so this is a general RMPP comment rather than a DS one. > (Wow, I'm almost beginning to sound like a Linux programmer now.) I think you qualify :-) > But given that its defined in the spec, and > SA GetMulti is defined to use it, let's limit its use to that method. > > >>If we want to support DS RMPP for more than just MultiPathRecord, it seems that > >>we need some sort of class/method mapping, > > > > > > maybe attribute ID as well. > > > > > >> which would require changing the kernel MAD API. > > > > > > Yes unless this were somehow made self-identifying (part of the RMPP > > protocol rather than an internal state variable). > > If we're going to change the RMPP protocol, my vote is to remove DS RMPP > entirely, unless someone can show why the additional complexity is needed. There is backward compatibility at a minimum. I'm exploring more on this. -- Hal From mshefty at ichips.intel.com Tue Apr 11 09:55:57 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 11 Apr 2006 09:55:57 -0700 Subject: [openib-general] Re: [RFC] [PATCH] SA query: expose retries through API In-Reply-To: References: Message-ID: <443BDF9D.8060002@ichips.intel.com> Roland Dreier wrote: > Looks fine but can you redo this on top of the module unload race fix > once we agree on that? I expect the race fix to go into 2.6.17 and > this API change to go into 2.6.18, so the API change needs to apply on > top of the race fix. Do you want me to continue to hold off applying this patch? I'd like to check in the multicast module, which rely on at least a part of this change. (The ipoib changes can be applied separately later.) If we want to hold off applying this, I can limit the patch to only what's needed for the multicast module, or rework the multicast module, so that it doesn't need the changes. - Sean From rdreier at cisco.com Tue Apr 11 09:58:16 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Apr 2006 09:58:16 -0700 Subject: [openib-general] Re: [RFC] [PATCH] SA query: expose retries through API In-Reply-To: <443BDF9D.8060002@ichips.intel.com> (Sean Hefty's message of "Tue, 11 Apr 2006 09:55:57 -0700") References: <443BDF9D.8060002@ichips.intel.com> Message-ID: Sean> Do you want me to continue to hold off applying this patch? Sean> I'd like to check in the multicast module, which rely on at Sean> least a part of this change. (The ipoib changes can be Sean> applied separately later.) No, go ahead and check it in. Can you resend me a new version of it (now that we've backed out the broken unload fix)? I'll put it in the for-2.6.18 tree. Thanks, Roland From info at schihei.de Tue Apr 11 09:59:27 2006 From: info at schihei.de (Heiko J Schick) Date: Tue, 11 Apr 2006 18:59:27 +0200 Subject: [openib-general] EHCA crash on module unload? In-Reply-To: <20060411163653.GA18625@scl.ameslab.gov> References: <20060411163653.GA18625@scl.ameslab.gov> Message-ID: <9BA6DF44-3AC4-4BBC-8EEF-8DE9534DDFF8@schihei.de> Hello Troy, did you unload first all OpenIB modules and then the eHCA module or the other way around? Can you see any other message (error data) in /var/log/messages? It looks like you unloaded the module during an interrupt came in. Can you sent us the steps / commands you've executed when the panic was caused? Regards, Heiko On 11.04.2006, at 18:36, Troy Benjegerdes wrote: > p5l2:/usr/src/linux-2.6.16/drivers/infiniband# svnversion . > 5988 > > > p5l2:~# [86044.767087] Unable to handle kernel paging request for data > at address 0x00000068 > [86044.767115] Faulting instruction address: 0xd000000018fd4b38 > [86044.767132] Oops: Kernel access of bad area, sig: 11 [#1] > [86044.767149] SMP NR_CPUS=8 NUMA PSERIES LPAR > [86044.767169] Modules linked in: ib_uverbs ib_umad ib_mad hcad_mod > ib_core libafs ipr sd_mod sg > [86044.767197] NIP: D000000018FD4B38 LR: D000000018FD4AEC CTR: > C000000000160FF4 > [86044.767212] REGS: c0000001df42ba20 TRAP: 0300 Tainted: P > (2.6.16-power5) > [86044.767225] MSR: 8000000000009032 CR: 22000024 > XER: 20000020 > [86044.767270] DAR: 0000000000000068, DSISR: 0000000040000000 > [86044.767287] TASK = c0000003ad9c0dd0[2232] 'ehca/0' THREAD: > c0000001df428000 CPU: 0 > [86044.767303] GPR00: 0000000000000000 C0000001DF42BCA0 > D000000018FFC350 0000000000000000 > [86044.767334] GPR04: 0000000000000001 0000000000060000 > 0000000022000042 C000000002319460 > [86044.767366] GPR08: D000000000035208 D000000018FEF138 > D000000000035000 0000000000000000 > [86044.767400] GPR12: D000000018FDBEA8 C000000000433C00 > 0000000000000000 0000000000000000 > [86044.767439] GPR16: 0000000000000000 0000000000000000 > 0000000000000000 0000000000000000 > [86044.767472] GPR20: 0000000000C00000 4000000001C10000 > C0000000003E6AF0 0000000001FF6D50 > [86044.767493] GPR24: C000000002319490 0000000000000001 > C000000002319458 C000000002319000 > [86044.767514] GPR28: C000000002319490 D000000018FEF3B8 > D000000018FFA578 0000000000000000 > [86044.767537] NIP [D000000018FD4B38] .ehca_interrupt_eq > +0x1c4/0x550 [hcad_mod] > [86044.767573] LR [D000000018FD4AEC] .ehca_interrupt_eq+0x178/0x550 > [hcad_mod] > [86044.767601] Call Trace: > [86044.767609] [C0000001DF42BCA0] > [D000000018FD4A04] .ehca_interrupt_eq+0x90/0x550 [hcad_mod] > (unreliable) > [86044.767643] --- Exception: 2 at .__start+0x4000000000000000/0x8 > [86044.767662] LR = .kernel_thread+0x4c/0x68 > [86044.767673] [C0000001DF42BD50] [C0000000000561EC] .run_workqueue > +0xdc/0x168 (unreliable) > > [86044.767695] [C0000001DF42BDF0] [C0000000000564B0] .worker_thread > +0x128/0x198 > > [86044.767714] [C0000001DF42BEE0] [C00000000005B450] .kthread > +0x120/0x170 > > [86044.767731] [C0000001DF42BF90] [C000000000021CF8] .kernel_thread > +0x4c/0x68 > > [86044.767746] Instruction dump: > [86044.767754] 4c00012c 7c0007b4 2f80ffff 409c001c 5400043e > 2f800000 409e0010 7fa3eb78 > [86044.767805] 48007201 e8410028 e93e8000 3ca00006 > 60a50139 88090006 2b800007 > [86044.767863] > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > openib-general > From troy at scl.ameslab.gov Tue Apr 11 10:03:42 2006 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Tue, 11 Apr 2006 12:03:42 -0500 Subject: [openib-general] EHCA crash on module unload? In-Reply-To: <9BA6DF44-3AC4-4BBC-8EEF-8DE9534DDFF8@schihei.de> References: <20060411163653.GA18625@scl.ameslab.gov> <9BA6DF44-3AC4-4BBC-8EEF-8DE9534DDFF8@schihei.de> Message-ID: <443BE16E.2080305@scl.ameslab.gov> I had unplugged, then re-plugged the cable, and then ran the following: rmmod hcad_mod ib_mthca ib_uverbs ib_ipoib ib_sa ib_mad ib_core Heiko J Schick wrote: > Hello Troy, > > did you unload first all OpenIB modules and then the eHCA module > or the other way around? > > Can you see any other message (error data) in /var/log/messages? > > It looks like you unloaded the module during an interrupt came in. > Can you sent us the steps / commands you've executed when the panic > was caused? > > Regards, > Heiko > From rdreier at cisco.com Tue Apr 11 10:10:11 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Apr 2006 10:10:11 -0700 Subject: [openib-general] Re: problems cloning infiniband.git In-Reply-To: <443B567E.7000005@voltaire.com> (Or Gerlitz's message of "Tue, 11 Apr 2006 10:10:54 +0300") References: <443B567E.7000005@voltaire.com> Message-ID: Or> So i am downloading from a mirror of kernel.org and it might Or> work with the a non mirror? what would be the url to download Or> it from kernel.org, all the ones I've tried to derive from master.kernel.org is the main machine. but I'm not sure whether you can clone directly from that. Are you able to clone with rsync:// or git:// URLs? - R. From trimmer at silverstorm.com Tue Apr 11 10:15:22 2006 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Tue, 11 Apr 2006 13:15:22 -0400 Subject: [openib-general] Re: Dual Sided RMPP Support as well as OpenSMImplications Message-ID: I haven't been watching this thread so I might be missing the point. > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org]On Behalf Of Hal Rosenstock > Sent: Tuesday, April 11, 2006 12:48 PM > > On Tue, 2006-04-11 at 12:38, Sean Hefty wrote: > > Hal Rosenstock wrote: > > > I don't think that can work. If the request and response > are RMPP'd, I > > > think a direction switch is needed so this can't be done. > > > > A direction switch is only needed if we want to follow the > DS RMPP protocol. > > Why can't both sides just follow the sender-initiated > protocol instead? > > In thinking about this a little, I see no reason this > couldn't work. In > fact, that's one mode I have toyed with (a non conformant SA GetMulti* > mode) in developing SA MultiPathRecord. It is a bad idea to implement a custom double sided approach. This will suddenly create various compliance and interop issues. For example Windows Open Fabrics and Linux OpenSM might not interoperate. Not to mention other OSs (such as Solaris) which have their own IB stacks. More interestingly, it is very likely that most uses of getmulti would involve a requestor providing a request which would fit into a single MAD packet and the RMPP protocol would not be fully needed by the sender (eg. just the simple case of a single packet RMPP transfer by sender with a multipacket RMPP response). You will note up to 10 GIDs can fit in the request within a single packet. The most common uses will probably involve 2 source GIDs and 2 destination GIDs. Hence perhaps the complexity of a compliant double sided solution could even be avoided for now. Todd Rimmer From Richard.Frank at oracle.com Tue Apr 11 10:25:36 2006 From: Richard.Frank at oracle.com (Richard Frank) Date: Tue, 11 Apr 2006 13:25:36 -0400 Subject: [openib-general] Are atomics planned / available via Gen2 ? Message-ID: <1144776336.4085.17.camel@localhost.localdomain> Is there a common set available over iWARP / IB ? From rdreier at cisco.com Tue Apr 11 10:23:36 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Apr 2006 10:23:36 -0700 Subject: [openib-general] Are atomics planned / available via Gen2 ? In-Reply-To: <1144776336.4085.17.camel@localhost.localdomain> (Richard Frank's message of "Tue, 11 Apr 2006 13:25:36 -0400") References: <1144776336.4085.17.camel@localhost.localdomain> Message-ID: Richard> Is there a common set available over iWARP / IB ? atomics work fine on IB, at least for the mthca driver. As far as I know, atomic operations are not part of the iWARP spec. - R. From halr at voltaire.com Tue Apr 11 10:20:21 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Apr 2006 13:20:21 -0400 Subject: [openib-general] Re: Dual Sided RMPP Support as well as OpenSMImplications In-Reply-To: References: Message-ID: <1144776016.19061.72289.camel@hal.voltaire.com> Hi Todd, On Tue, 2006-04-11 at 13:15, Rimmer, Todd wrote: > I haven't been watching this thread so I might be missing the point. > > > -----Original Message----- > > From: openib-general-bounces at openib.org > > [mailto:openib-general-bounces at openib.org]On Behalf Of Hal Rosenstock > > Sent: Tuesday, April 11, 2006 12:48 PM > > > > On Tue, 2006-04-11 at 12:38, Sean Hefty wrote: > > > Hal Rosenstock wrote: > > > > I don't think that can work. If the request and response > > are RMPP'd, I > > > > think a direction switch is needed so this can't be done. > > > > > > A direction switch is only needed if we want to follow the > > DS RMPP protocol. > > > Why can't both sides just follow the sender-initiated > > protocol instead? > > > > In thinking about this a little, I see no reason this > > couldn't work. In > > fact, that's one mode I have toyed with (a non conformant SA GetMulti* > > mode) in developing SA MultiPathRecord. > > It is a bad idea to implement a custom double sided approach. This will > suddenly create various compliance and interop issues. For example > Windows Open Fabrics and Linux OpenSM might not interoperate. Not to > mention other OSs (such as Solaris) which have their own IB stacks. Understood. All I said was I used this as a development vehicle. It is not intended to be checked into the tree. > More interestingly, it is very likely that most uses of getmulti would > involve a requestor providing a request which would fit into a single > MAD packet and the RMPP protocol would not be fully needed by the > sender (eg. just the simple case of a single packet RMPP transfer > by sender with a multipacket RMPP response). Right. > You will note up to 10 > GIDs can fit in the request within a single packet. The most common > uses will probably involve 2 source GIDs and 2 destination GIDs. Yes. > Hence perhaps the complexity of a compliant double sided solution > could even be avoided for now. You lost me here. What do you have in mind ? -- Hal > Todd Rimmer From robert.j.woodruff at intel.com Tue Apr 11 10:29:50 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Tue, 11 Apr 2006 10:29:50 -0700 Subject: [openib-general] tcpdump command issue on IB network In-Reply-To: Message-ID: <000101c65d8d$898f4e40$010fa8c0@amr.corp.intel.com> Roland wrote, >tcpdump will work fine. It doesn't look at the hardware address in >the structure at all. > - R. I removed the check and indeed tcpdump does now appear to work in a modified 2.6.9-34EL kernel. I can add this change to my backport-to-2.6.9 patches for 2.6.9-34EL for people that use them. [root at iclust-1 woody]# tcpdump -i ib0 tcpdump: WARNING: arptype 32 not supported by libpcap - falling back to cooked socket tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on ib0, link-type LINUX_SLL (Linux cooked), capture size 96 bytes 09:48:39.833316 IP iclust-1-ib0 > iclust-16-ib0: icmp 64: echo request seq 0 09:48:39.833406 IP iclust-16-ib0 > iclust-1-ib0: icmp 64: echo reply seq 0 09:48:40.833356 IP iclust-1-ib0 > iclust-16-ib0: icmp 64: echo request seq 1 09:48:40.833419 IP iclust-16-ib0 > iclust-1-ib0: icmp 64: echo reply seq 1 09:48:44.832002 arp who-has iclust-1-ib0 tell iclust-16-ib0 hardware #32 09:48:44.832023 arp reply iclust-1-ib0 is-at 00:00:04:04:fe:80:00:00:00:00:00:00:00:02:c9:01:0a:d2:58:f1 hardware #32 From mshefty at ichips.intel.com Tue Apr 11 10:33:28 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 11 Apr 2006 10:33:28 -0700 Subject: [openib-general] Re: Dual Sided RMPP Support as well as OpenSMImplications In-Reply-To: References: Message-ID: <443BE868.6090202@ichips.intel.com> Rimmer, Todd wrote: > It is a bad idea to implement a custom double sided approach. This will > suddenly create various compliance and interop issues. For example Windows > Open Fabrics and Linux OpenSM might not interoperate. Not to mention other > OSs (such as Solaris) which have their own IB stacks. My proposal is that the spec be updated to limit DS RMPP to SA GetMulti only, which is the only thing it is currently defined for. A vendor specific class that required RMPP requests followed by RMPP responses would be defined to use two sender-initiated transfers. Currently, such vendor transactions are undefined. As a second proposal, remove DS RMPP from the spec. > More interestingly, it is very likely that most uses of getmulti would > involve a requestor providing a request which would fit into a single MAD > packet and the RMPP protocol would not be fully needed by the sender (eg. > just the simple case of a single packet RMPP transfer by sender with a > multipacket RMPP response). You will note up to 10 GIDs can fit in the > request within a single packet. The most common uses will probably involve 2 > source GIDs and 2 destination GIDs. Whether the request fits into a single MAD or not, the DS RMPP protocol is supposed to be invoked. Meaning that an ACK is sent, and an ACK of that ACK is also generated. The implementation difficulty lies in maintaining the context of the DS RMPP transaction on the receiving side, not the segmentation and reassembly. > Hence perhaps the complexity of a compliant double sided solution could even > be avoided for now. Right now, an RMPP request followed by an RMPP response works, but it isn't compliant with the DS RMPP protocol. It is compliant to using two sender-initiated transfers. - Sean From sweitzen at cisco.com Tue Apr 11 10:33:42 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 11 Apr 2006 10:33:42 -0700 Subject: [openib-general] tcpdump command issue on IB network Message-ID: Could you plese check in binaries to svn, or let me get them some other way? Thanks.... Scott > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Bob Woodruff > Sent: Tuesday, April 11, 2006 10:30 AM > To: Roland Dreier (rdreier) > Cc: Jerome Taylor; openib-general at openib.org > Subject: RE: [openib-general] tcpdump command issue on IB network > > Roland wrote, > > >tcpdump will work fine. It doesn't look at the hardware address in > >the structure at all. > > > - R. > > I removed the check and indeed tcpdump does now appear to work in > a modified 2.6.9-34EL kernel. I can add this change to my > backport-to-2.6.9 patches for 2.6.9-34EL for people that use them. > > [root at iclust-1 woody]# tcpdump -i ib0 > tcpdump: WARNING: arptype 32 not supported by libpcap - > falling back to > cooked socket > tcpdump: verbose output suppressed, use -v or -vv for full > protocol decode > listening on ib0, link-type LINUX_SLL (Linux cooked), capture > size 96 bytes > 09:48:39.833316 IP iclust-1-ib0 > iclust-16-ib0: icmp 64: > echo request seq 0 > 09:48:39.833406 IP iclust-16-ib0 > iclust-1-ib0: icmp 64: > echo reply seq 0 > 09:48:40.833356 IP iclust-1-ib0 > iclust-16-ib0: icmp 64: > echo request seq 1 > 09:48:40.833419 IP iclust-16-ib0 > iclust-1-ib0: icmp 64: > echo reply seq 1 > 09:48:44.832002 arp who-has iclust-1-ib0 tell iclust-16-ib0 > hardware #32 > 09:48:44.832023 arp reply iclust-1-ib0 is-at > 00:00:04:04:fe:80:00:00:00:00:00:00:00:02:c9:01:0a:d2:58:f1 > hardware #32 > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From ardavis at ichips.intel.com Tue Apr 11 10:45:15 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 11 Apr 2006 10:45:15 -0700 Subject: [openib-general] Re: [uDAPL] dtest server never ends when using the dapl provider "OpenIB-scm1" In-Reply-To: References: <200604111506.18067.dotanb@mellanox.co.il> <200604111715.55534.dotanb@mellanox.co.il> Message-ID: <443BEB2B.4080203@ichips.intel.com> James Lentini wrote: > > > It sounds like the disconnect is being lost. Let me see if I can > >reproduce this. > >Arlin, have you ever seen this? > > No. it runs fine on my systems. It looks like the ping pong test on the server side did not finish. Can Dotan add a -v switch to the dtest to help isolate? What svn version are we running? Do you have the latest uDAPL fixes commited in 6393? -arlin From robert.j.woodruff at intel.com Tue Apr 11 10:55:42 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Tue, 11 Apr 2006 10:55:42 -0700 Subject: [openib-general] tcpdump command issue on IB network In-Reply-To: Message-ID: <000201c65d91$2653e030$010fa8c0@amr.corp.intel.com> >Could you plese check in binaries to svn, or let me get them some other >way? Thanks.... >Scott I'll add it to my next set of backport patches and test RPMS and let you know when they are checked in, might be a day or 2, since I will need to regression test after I make new patches. woody From sean.hefty at intel.com Tue Apr 11 11:11:54 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 11 Apr 2006 11:11:54 -0700 Subject: [openib-general] [PATCH v3] SA query: expose retries through API In-Reply-To: Message-ID: Currently, the SA query interface does not permit retrying requests automatically. Expose this capability to take advantage of underlying MAD layer API, which provides it basically for free because of RMPP. Without automatic retries pushed down into the SA query module, retries are assigned new TIDs, and appear as separate requests. This means that a delayed response will be dropped, and the remote side will not detect that the request is a duplicate, so will re-calculate the response. Signed-off-by: Sean Hefty --- Index: include/rdma/ib_sa.h =================================================================== --- include/rdma/ib_sa.h (revision 6418) +++ include/rdma/ib_sa.h (working copy) @@ -257,7 +257,7 @@ void ib_sa_cancel_query(int id, struct i int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_path_rec *resp, void *context), @@ -268,7 +268,7 @@ int ib_sa_mcmember_rec_query(struct ib_d u8 method, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_mcmember_rec *resp, void *context), @@ -279,7 +279,7 @@ int ib_sa_service_rec_query(struct ib_de u8 method, struct ib_sa_service_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_service_rec *resp, void *context), @@ -293,6 +293,7 @@ int ib_sa_service_rec_query(struct ib_de * @rec:MCMember Record to send in query * @comp_mask:component mask to send in query * @timeout_ms:time to wait for response + * @retries:number of times to retry request * @gfp_mask:GFP mask to use for internal allocations * @callback:function called when query completes, times out or is * canceled @@ -314,7 +315,7 @@ static inline int ib_sa_mcmember_rec_set(struct ib_device *device, u8 port_num, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_mcmember_rec *resp, void *context), @@ -324,7 +325,7 @@ ib_sa_mcmember_rec_set(struct ib_device return ib_sa_mcmember_rec_query(device, port_num, IB_MGMT_METHOD_SET, rec, comp_mask, - timeout_ms, gfp_mask, callback, + timeout_ms, retries, gfp_mask, callback, context, query); } @@ -335,6 +336,7 @@ ib_sa_mcmember_rec_set(struct ib_device * @rec:MCMember Record to send in query * @comp_mask:component mask to send in query * @timeout_ms:time to wait for response + * @retries:number of times to retry request * @gfp_mask:GFP mask to use for internal allocations * @callback:function called when query completes, times out or is * canceled @@ -356,7 +358,7 @@ static inline int ib_sa_mcmember_rec_delete(struct ib_device *device, u8 port_num, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_mcmember_rec *resp, void *context), @@ -366,7 +368,7 @@ ib_sa_mcmember_rec_delete(struct ib_devi return ib_sa_mcmember_rec_query(device, port_num, IB_SA_METHOD_DELETE, rec, comp_mask, - timeout_ms, gfp_mask, callback, + timeout_ms, retries, gfp_mask, callback, context, query); } Index: core/sa_query.c =================================================================== --- core/sa_query.c (revision 6418) +++ core/sa_query.c (working copy) @@ -482,7 +482,7 @@ static void init_mad(struct ib_sa_mad *m spin_unlock_irqrestore(&tid_lock, flags); } -static int send_mad(struct ib_sa_query *query, int timeout_ms) +static int send_mad(struct ib_sa_query *query, int timeout_ms, int retries) { unsigned long flags; int ret, id; @@ -499,6 +499,7 @@ retry: return ret; query->mad_buf->timeout_ms = timeout_ms; + query->mad_buf->retries = retries; query->mad_buf->context[0] = query; query->id = id; @@ -555,6 +556,7 @@ static void ib_sa_path_rec_release(struc * @rec:Path Record to send in query * @comp_mask:component mask to send in query * @timeout_ms:time to wait for response + * @retries:number of times to retry request * @gfp_mask:GFP mask to use for internal allocations * @callback:function called when query completes, times out or is * canceled @@ -575,7 +577,7 @@ static void ib_sa_path_rec_release(struc int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_path_rec *resp, void *context), @@ -624,7 +626,7 @@ int ib_sa_path_rec_get(struct ib_device *sa_query = &query->sa_query; - ret = send_mad(&query->sa_query, timeout_ms); + ret = send_mad(&query->sa_query, timeout_ms, retries); if (ret < 0) goto err2; @@ -670,6 +672,7 @@ static void ib_sa_service_rec_release(st * @rec:Service Record to send in request * @comp_mask:component mask to send in request * @timeout_ms:time to wait for response + * @retries:number of times to retry request * @gfp_mask:GFP mask to use for internal allocations * @callback:function called when request completes, times out or is * canceled @@ -691,7 +694,7 @@ static void ib_sa_service_rec_release(st int ib_sa_service_rec_query(struct ib_device *device, u8 port_num, u8 method, struct ib_sa_service_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_service_rec *resp, void *context), @@ -746,7 +749,7 @@ int ib_sa_service_rec_query(struct ib_de *sa_query = &query->sa_query; - ret = send_mad(&query->sa_query, timeout_ms); + ret = send_mad(&query->sa_query, timeout_ms, retries); if (ret < 0) goto err2; @@ -788,7 +791,7 @@ int ib_sa_mcmember_rec_query(struct ib_d u8 method, struct ib_sa_mcmember_rec *rec, ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, + int timeout_ms, int retries, gfp_t gfp_mask, void (*callback)(int status, struct ib_sa_mcmember_rec *resp, void *context), @@ -838,7 +841,7 @@ int ib_sa_mcmember_rec_query(struct ib_d *sa_query = &query->sa_query; - ret = send_mad(&query->sa_query, timeout_ms); + ret = send_mad(&query->sa_query, timeout_ms, retries); if (ret < 0) goto err2; Index: ulp/srp/ib_srp.c =================================================================== --- ulp/srp/ib_srp.c (revision 6418) +++ ulp/srp/ib_srp.c (working copy) @@ -257,7 +257,7 @@ static int srp_lookup_path(struct srp_ta IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_NUMB_PATH | IB_SA_PATH_REC_PKEY, - SRP_PATH_REC_TIMEOUT_MS, + SRP_PATH_REC_TIMEOUT_MS, 0, GFP_KERNEL, srp_path_rec_completion, target, &target->path_query); Index: ulp/sdp/sdp_link.c =================================================================== --- ulp/sdp/sdp_link.c (revision 6418) +++ ulp/sdp/sdp_link.c (working copy) @@ -323,7 +323,7 @@ static void sdp_link_path_rec_done(int s IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH), - info->sa_time, + info->sa_time, 0, GFP_KERNEL, sdp_link_path_rec_done, info, @@ -359,7 +359,7 @@ static int sdp_link_path_rec_get(struct IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH), - info->sa_time, + info->sa_time, 0, GFP_KERNEL, sdp_link_path_rec_done, info, Index: core/at.c =================================================================== --- core/at.c (revision 6418) +++ core/at.c (working copy) @@ -216,7 +216,7 @@ static void ib_dev_ats_op(struct ib_at_d op, rec, mask, - IB_AT_REQ_RETRY_MS, + IB_AT_REQ_RETRY_MS, 0, GFP_KERNEL, ats_op_complete, ib_dev, @@ -1118,7 +1118,7 @@ static int resolve_ats_ips(struct ats_ip IB_MGMT_METHOD_GET, rec, IB_ATS_GET_PRIM_IP_MASK, - req->pend.timeout_ms, + req->pend.timeout_ms, 0, GFP_KERNEL, ats_ips_req_complete, req, @@ -1163,7 +1163,7 @@ static int resolve_ats_route(struct rout IB_MGMT_METHOD_GET, rec, IB_ATS_GET_GID_MASK, - req->pend.timeout_ms, + req->pend.timeout_ms, 0, GFP_KERNEL, ats_route_req_complete, req, @@ -1226,7 +1226,7 @@ static int resolve_path(struct path_req IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH), - req->pend.timeout_ms, + req->pend.timeout_ms, 0, GFP_KERNEL, path_req_complete, req, Index: core/cma.c =================================================================== --- core/cma.c (revision 6418) +++ core/cma.c (working copy) @@ -1064,7 +1064,7 @@ static int cma_query_ib_route(struct rdm id_priv->id.port_num, &path_rec, IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH, - timeout_ms, GFP_KERNEL, + timeout_ms, 0, GFP_KERNEL, cma_query_handler, work, &id_priv->query); return (id_priv->query_id < 0) ? id_priv->query_id : 0; Index: ulp/ipoib/ipoib_main.c =================================================================== --- ulp/ipoib/ipoib_main.c (revision 6418) +++ ulp/ipoib/ipoib_main.c (working copy) @@ -471,7 +471,7 @@ static int path_rec_start(struct net_dev IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_NUMB_PATH | IB_SA_PATH_REC_PKEY, - 1000, GFP_ATOMIC, + 1000, 0, GFP_ATOMIC, path_rec_completion, path, &path->query); if (path->query_id < 0) { From mshefty at ichips.intel.com Tue Apr 11 11:17:32 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 11 Apr 2006 11:17:32 -0700 Subject: [openib-general] Re: [PATCH v3] SA query: expose retries through API In-Reply-To: References: Message-ID: <443BF2BC.7080901@ichips.intel.com> I've committed this to svn. - Sean From sweitzen at cisco.com Tue Apr 11 11:19:17 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 11 Apr 2006 11:19:17 -0700 Subject: [openib-general] tcpdump command issue on IB network Message-ID: > I'll add it to my next set of backport patches and test RPMS > and let you > know > when they are checked in, might be a day or 2, since I will need to > regression test after I make new patches. For these RPMs, what is something like kernel-smp-2.6.9-34.OpenIB.6055.trunk.EL.root.x86_64.rpm? Why would I want to test that instead of the RH 2.6.9-34 kernel? Scott From rdreier at cisco.com Tue Apr 11 11:20:22 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Apr 2006 11:20:22 -0700 Subject: [openib-general] tcpdump command issue on IB network In-Reply-To: (Scott Weitzenkamp's message of "Tue, 11 Apr 2006 11:19:17 -0700") References: Message-ID: Scott> For these RPMs, what is something like Scott> kernel-smp-2.6.9-34.OpenIB.6055.trunk.EL.root.x86_64.rpm? Scott> Why would I want to test that instead of the RH 2.6.9-34 Scott> kernel? Because it has things like the tcpdump problem fixed... From sweitzen at cisco.com Tue Apr 11 11:23:10 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 11 Apr 2006 11:23:10 -0700 Subject: [openib-general] tcpdump command issue on IB network Message-ID: > Scott> For these RPMs, what is something like > Scott> kernel-smp-2.6.9-34.OpenIB.6055.trunk.EL.root.x86_64.rpm? > Scott> Why would I want to test that instead of the RH 2.6.9-34 > Scott> kernel? > > Because it has things like the tcpdump problem fixed... Bummer, there's no way to get just a tcpdump binary that will work with RHEL4 kernel? Scott From rdreier at cisco.com Tue Apr 11 11:24:20 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Apr 2006 11:24:20 -0700 Subject: [openib-general] tcpdump command issue on IB network In-Reply-To: (Scott Weitzenkamp's message of "Tue, 11 Apr 2006 11:23:10 -0700") References: Message-ID: Scott> Bummer, there's no way to get just a tcpdump binary that Scott> will work with RHEL4 kernel? You could do it I guess but no one has come up with the patch... From sean.hefty at intel.com Tue Apr 11 11:28:03 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 11 Apr 2006 11:28:03 -0700 Subject: [openib-general] [PATCH v3 2/2] SA query: expose retries through API In-Reply-To: Message-ID: We also need the following changes to IPoIB for this patch to stand alone from the multicast changes. - Sean --- Index: ulp/ipoib/ipoib_multicast.c =================================================================== --- ulp/ipoib/ipoib_multicast.c (revision 6418) +++ ulp/ipoib/ipoib_multicast.c (working copy) @@ -365,7 +365,7 @@ static int ipoib_mcast_sendonly_join(str IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_PKEY | IB_SA_MCMEMBER_REC_JOIN_STATE, - 1000, GFP_ATOMIC, + 1000, 0, GFP_ATOMIC, ipoib_mcast_sendonly_join_complete, mcast, &mcast->query); if (ret < 0) { @@ -485,7 +485,7 @@ static void ipoib_mcast_join(struct net_ init_completion(&mcast->done); ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, comp_mask, - mcast->backoff * 1000, GFP_ATOMIC, + mcast->backoff * 1000, 0, GFP_ATOMIC, ipoib_mcast_join_complete, mcast, &mcast->query); @@ -685,7 +685,7 @@ static int ipoib_mcast_leave(struct net_ IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_PKEY | IB_SA_MCMEMBER_REC_JOIN_STATE, - 0, GFP_ATOMIC, NULL, + 0, 0, GFP_ATOMIC, NULL, mcast, &mcast->query); if (ret < 0) ipoib_warn(priv, "ib_sa_mcmember_rec_delete failed " From sweitzen at cisco.com Tue Apr 11 11:35:46 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 11 Apr 2006 11:35:46 -0700 Subject: [openib-general] tcpdump command issue on IB network Message-ID: > Scott> Bummer, there's no way to get just a tcpdump binary that > Scott> will work with RHEL4 kernel? > > You could do it I guess but no one has come up with the patch... > Roland, can I talk you into doing it? Scott From robert.j.woodruff at intel.com Tue Apr 11 11:36:14 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Tue, 11 Apr 2006 11:36:14 -0700 Subject: [openib-general] tcpdump command issue on IB network In-Reply-To: Message-ID: <000301c65d96$d017dc20$010fa8c0@amr.corp.intel.com> Scott wrote, >Bummer, there's no way to get just a tcpdump binary that will work with >RHEL4 kernel? >Scott I think that net/core/dev.c is part of the core kernel and not a loadable module. Also, not sure if tcpdump can be modified to work with the kernel that has the overflow check. woody From sean.hefty at intel.com Tue Apr 11 11:46:49 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 11 Apr 2006 11:46:49 -0700 Subject: [openib-general] [PATCH v3] ipoib: convert to use new multicast interface In-Reply-To: Message-ID: Convert ipoib to make use of the new multicast module interface. Signed-off-by: Sean Hefty --- I've committed the multicast module, so this can be applied anytime you think that it's ready. These changes work as far as my testing has gone. Index: ulp/ipoib/ipoib_multicast.c =================================================================== --- ulp/ipoib/ipoib_multicast.c (revision 6420) +++ ulp/ipoib/ipoib_multicast.c (working copy) @@ -45,6 +45,8 @@ #include +#include + #include "ipoib.h" #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG @@ -60,14 +62,11 @@ static DEFINE_MUTEX(mcast_mutex); /* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */ struct ipoib_mcast { struct ib_sa_mcmember_rec mcmember; + struct ib_multicast *mc; struct ipoib_ah *ah; struct rb_node rb_node; struct list_head list; - struct completion done; - - int query_id; - struct ib_sa_query *query; unsigned long created; unsigned long backoff; @@ -299,18 +298,18 @@ static int ipoib_mcast_join_finish(struc return 0; } -static void +static int ipoib_mcast_sendonly_join_complete(int status, - struct ib_sa_mcmember_rec *mcmember, - void *mcast_ptr) + struct ib_multicast *multicast) { - struct ipoib_mcast *mcast = mcast_ptr; + struct ipoib_mcast *mcast = multicast->context; struct net_device *dev = mcast->dev; struct ipoib_dev_priv *priv = netdev_priv(dev); if (!status) - ipoib_mcast_join_finish(mcast, mcmember); - else { + status = ipoib_mcast_join_finish(mcast, &multicast->rec); + + if (status) { if (mcast->logcount++ < 20) ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for " IPOIB_GID_FMT ", status %d\n", @@ -325,10 +324,10 @@ ipoib_mcast_sendonly_join_complete(int s spin_unlock_irq(&priv->tx_lock); /* Clear the busy flag so we try again */ - clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, + &mcast->flags); } - - complete(&mcast->done); + return status; } static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast) @@ -358,35 +357,32 @@ static int ipoib_mcast_sendonly_join(str rec.port_gid = priv->local_gid; rec.pkey = cpu_to_be16(priv->pkey); - init_completion(&mcast->done); - - ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, - IB_SA_MCMEMBER_REC_MGID | - IB_SA_MCMEMBER_REC_PORT_GID | - IB_SA_MCMEMBER_REC_PKEY | - IB_SA_MCMEMBER_REC_JOIN_STATE, - 1000, 0, GFP_ATOMIC, - ipoib_mcast_sendonly_join_complete, - mcast, &mcast->query); - if (ret < 0) { - ipoib_warn(priv, "ib_sa_mcmember_rec_set failed (ret = %d)\n", + mcast->mc = ib_join_multicast(priv->ca, priv->port, &rec, + IB_SA_MCMEMBER_REC_MGID | + IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_PKEY | + IB_SA_MCMEMBER_REC_JOIN_STATE, + GFP_ATOMIC, + ipoib_mcast_sendonly_join_complete, + mcast); + if (IS_ERR(mcast->mc)) { + ret = PTR_ERR(mcast->mc); + clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + ipoib_warn(priv, "ib_join_multicast failed (ret = %d)\n", ret); } else { ipoib_dbg_mcast(priv, "no multicast record for " IPOIB_GID_FMT ", starting join\n", IPOIB_GID_ARG(mcast->mcmember.mgid)); - - mcast->query_id = ret; } return ret; } -static void ipoib_mcast_join_complete(int status, - struct ib_sa_mcmember_rec *mcmember, - void *mcast_ptr) +static int ipoib_mcast_join_complete(int status, + struct ib_multicast *multicast) { - struct ipoib_mcast *mcast = mcast_ptr; + struct ipoib_mcast *mcast = multicast->context; struct net_device *dev = mcast->dev; struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -394,23 +390,20 @@ static void ipoib_mcast_join_complete(in " (status %d)\n", IPOIB_GID_ARG(mcast->mcmember.mgid), status); - if (!status && !ipoib_mcast_join_finish(mcast, mcmember)) { + if (!status) + status = ipoib_mcast_join_finish(mcast, &multicast->rec); + + if (!status) { mcast->backoff = 1; mutex_lock(&mcast_mutex); if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) queue_work(ipoib_workqueue, &priv->mcast_task); mutex_unlock(&mcast_mutex); - complete(&mcast->done); - return; - } - - if (status == -EINTR) { - complete(&mcast->done); - return; + return 0; } - if (status && mcast->logcount++ < 20) { - if (status == -ETIMEDOUT || status == -EINTR) { + if (mcast->logcount++ < 20) { + if (status == -ETIMEDOUT) { ipoib_dbg_mcast(priv, "multicast join failed for " IPOIB_GID_FMT ", status %d\n", IPOIB_GID_ARG(mcast->mcmember.mgid), @@ -427,23 +420,18 @@ static void ipoib_mcast_join_complete(in if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; - mutex_lock(&mcast_mutex); + /* Clear the busy flag so we try again */ + status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + mutex_lock(&mcast_mutex); spin_lock_irq(&priv->lock); - mcast->query = NULL; - - if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) { - if (status == -ETIMEDOUT) - queue_work(ipoib_workqueue, &priv->mcast_task); - else - queue_delayed_work(ipoib_workqueue, &priv->mcast_task, - mcast->backoff * HZ); - } else - complete(&mcast->done); + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_delayed_work(ipoib_workqueue, &priv->mcast_task, + mcast->backoff * HZ); spin_unlock_irq(&priv->lock); mutex_unlock(&mcast_mutex); - return; + return status; } static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast, @@ -482,15 +470,14 @@ static void ipoib_mcast_join(struct net_ rec.traffic_class = priv->broadcast->mcmember.traffic_class; } - init_completion(&mcast->done); - - ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, comp_mask, - mcast->backoff * 1000, 0, GFP_ATOMIC, - ipoib_mcast_join_complete, - mcast, &mcast->query); - - if (ret < 0) { - ipoib_warn(priv, "ib_sa_mcmember_rec_set failed, status %d\n", ret); + set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + mcast->mc = ib_join_multicast(priv->ca, priv->port, &rec, comp_mask, + GFP_KERNEL, ipoib_mcast_join_complete, + mcast); + if (IS_ERR(mcast->mc)) { + clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + ret = PTR_ERR(mcast->mc); + ipoib_warn(priv, "ib_join_multicast failed, status %d\n", ret); mcast->backoff *= 2; if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) @@ -502,8 +489,7 @@ static void ipoib_mcast_join(struct net_ &priv->mcast_task, mcast->backoff * HZ); mutex_unlock(&mcast_mutex); - } else - mcast->query_id = ret; + } } void ipoib_mcast_join_task(void *dev_ptr) @@ -554,7 +540,8 @@ void ipoib_mcast_join_task(void *dev_ptr } if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { - ipoib_mcast_join(dev, priv->broadcast, 0); + if (!test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags)) + ipoib_mcast_join(dev, priv->broadcast, 0); return; } @@ -609,26 +596,9 @@ int ipoib_mcast_start_thread(struct net_ return 0; } -static void wait_for_mcast_join(struct ipoib_dev_priv *priv, - struct ipoib_mcast *mcast) -{ - spin_lock_irq(&priv->lock); - if (mcast && mcast->query) { - ib_sa_cancel_query(mcast->query_id, mcast->query); - mcast->query = NULL; - spin_unlock_irq(&priv->lock); - ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", - IPOIB_GID_ARG(mcast->mcmember.mgid)); - wait_for_completion(&mcast->done); - } - else - spin_unlock_irq(&priv->lock); -} - int ipoib_mcast_stop_thread(struct net_device *dev, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ipoib_mcast *mcast; ipoib_dbg_mcast(priv, "stopping multicast thread\n"); @@ -644,52 +614,27 @@ int ipoib_mcast_stop_thread(struct net_d if (flush) flush_workqueue(ipoib_workqueue); - wait_for_mcast_join(priv, priv->broadcast); - - list_for_each_entry(mcast, &priv->multicast_list, list) - wait_for_mcast_join(priv, mcast); - return 0; } static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ib_sa_mcmember_rec rec = { - .join_state = 1 - }; int ret = 0; - if (!test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) - return 0; - - ipoib_dbg_mcast(priv, "leaving MGID " IPOIB_GID_FMT "\n", - IPOIB_GID_ARG(mcast->mcmember.mgid)); - - rec.mgid = mcast->mcmember.mgid; - rec.port_gid = priv->local_gid; - rec.pkey = cpu_to_be16(priv->pkey); + if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { + ipoib_dbg_mcast(priv, "leaving MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); - /* Remove ourselves from the multicast group */ - ret = ipoib_mcast_detach(dev, be16_to_cpu(mcast->mcmember.mlid), - &mcast->mcmember.mgid); - if (ret) - ipoib_warn(priv, "ipoib_mcast_detach failed (result = %d)\n", ret); + /* Remove ourselves from the multicast group */ + ret = ipoib_mcast_detach(dev, be16_to_cpu(mcast->mcmember.mlid), + &mcast->mcmember.mgid); + if (ret) + ipoib_warn(priv, "ipoib_mcast_detach failed (result = %d)\n", ret); + } - /* - * Just make one shot at leaving and don't wait for a reply; - * if we fail, too bad. - */ - ret = ib_sa_mcmember_rec_delete(priv->ca, priv->port, &rec, - IB_SA_MCMEMBER_REC_MGID | - IB_SA_MCMEMBER_REC_PORT_GID | - IB_SA_MCMEMBER_REC_PKEY | - IB_SA_MCMEMBER_REC_JOIN_STATE, - 0, 0, GFP_ATOMIC, NULL, - mcast, &mcast->query); - if (ret < 0) - ipoib_warn(priv, "ib_sa_mcmember_rec_delete failed " - "for leave (result = %d)\n", ret); + if (test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) + ib_free_multicast(mcast->mc); return 0; } @@ -743,7 +688,7 @@ void ipoib_mcast_send(struct net_device dev_kfree_skb_any(skb); } - if (mcast->query) + if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) ipoib_dbg_mcast(priv, "no address vector, " "but multicast join already started\n"); else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) @@ -900,7 +845,6 @@ void ipoib_mcast_restart_task(void *dev_ /* We have to cancel outside of the spinlock */ list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { - wait_for_mcast_join(priv, mcast); ipoib_mcast_leave(mcast->dev, mcast); ipoib_mcast_free(mcast); } From mst at mellanox.co.il Tue Apr 11 12:36:36 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 11 Apr 2006 22:36:36 +0300 Subject: [openib-general] Re: [RFC] [PATCH 2/2 v2] ipoib: convert to use new multicast interface In-Reply-To: <443BDE0C.7020700@ichips.intel.com> References: <443BDE0C.7020700@ichips.intel.com> Message-ID: <20060411193636.GA27908@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [RFC] [PATCH 2/2 v2] ipoib: convert to use new multicast interface > > Sean Hefty wrote: > > void ipoib_mcast_join_task(void *dev_ptr) > >@@ -553,7 +539,8 @@ void ipoib_mcast_join_task(void *dev_ptr > > spin_unlock_irq(&priv->lock); > > } > > > >- if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { > >+ if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags) && > >+ !test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags)) { > > ipoib_mcast_join(dev, priv->broadcast, 0); > > return; > > } > > The change above needs to be: > > if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { > if (!test_bit(IPOIB_MCAST_FLAG_BUSY, > &priv->broadcast->flags)) > ipoib_mcast_join(dev, priv->broadcast, 0); > return; > } > > Or additional join requests will start before we've finished joining the > broadcast group. > > - Sean Hmm, but this seems like 2.6.17 material. It should have the same effect with or without multicast group patch. Right? -- MST From mshefty at ichips.intel.com Tue Apr 11 13:07:44 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 11 Apr 2006 13:07:44 -0700 Subject: [openib-general] Re: [RFC] [PATCH 2/2 v2] ipoib: convert to use new multicast interface In-Reply-To: <20060411193636.GA27908@mellanox.co.il> References: <443BDE0C.7020700@ichips.intel.com> <20060411193636.GA27908@mellanox.co.il> Message-ID: <443C0C90.2010904@ichips.intel.com> Michael S. Tsirkin wrote: >>>void ipoib_mcast_join_task(void *dev_ptr) >>>@@ -553,7 +539,8 @@ void ipoib_mcast_join_task(void *dev_ptr >>> spin_unlock_irq(&priv->lock); >>> } >>> >>>- if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { >>>+ if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags) && >>>+ !test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags)) { >>> ipoib_mcast_join(dev, priv->broadcast, 0); >>> return; >>> } >> >>The change above needs to be: >> >> if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { >> if (!test_bit(IPOIB_MCAST_FLAG_BUSY, >> &priv->broadcast->flags)) >> ipoib_mcast_join(dev, priv->broadcast, 0); >> return; >> } >> >>Or additional join requests will start before we've finished joining the >>broadcast group. >> >>- Sean > > > Hmm, but this seems like 2.6.17 material. > It should have the same effect with or without multicast group patch. Right? I'm not sure if the code has the same effect with or without the rest of the multicast changes. In theory, it seems like adding this change would work, but I don't know that it's necessary. The original code keyed entirely off of the ATTACHED flag. I added a check for the BUSY flag to prevent issuing multiple join requests for the broadcast group, which is necessary for proper interaction with ib_multicast, since every join request must be followed by a call to free. - Sean From rdreier at cisco.com Tue Apr 11 13:18:54 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Apr 2006 13:18:54 -0700 Subject: [openib-general] tcpdump command issue on IB network In-Reply-To: (Scott Weitzenkamp's message of "Tue, 11 Apr 2006 11:35:46 -0700") References: Message-ID: Scott> Roland, can I talk you into doing it? Not any time soon... I have plenty to do as it is, and I don't have that much interest in providing a workaround for a Red Hat bug that isn't present in the standard kernel. - R. From rdreier at cisco.com Tue Apr 11 13:32:28 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Apr 2006 13:32:28 -0700 Subject: [openib-general] RE: static rate encoding changes In-Reply-To: (Hoang-Nam Nguyen's message of "Tue, 11 Apr 2006 10:30:50 +0200") References: Message-ID: I applied this, thanks. Note that if you depend on this in libehca, then you won't work with old releases of libibverbs (which are in Fedora Extras and Debian for example). So you may want to test for the functions in your configure.in and include a local copy if you don't find them in libibverbs. Since this is just adding new functions, so it doesn't affect binary or source compatibility, it will be in the next libibverbs 1.0.3 release. - R. From rdreier at cisco.com Tue Apr 11 13:39:02 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Apr 2006 13:39:02 -0700 Subject: [openib-general] Re: [RFC] [PATCH 2/2 v2] ipoib: convert to use new multicast interface In-Reply-To: <20060411193636.GA27908@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 11 Apr 2006 22:36:36 +0300") References: <443BDE0C.7020700@ichips.intel.com> <20060411193636.GA27908@mellanox.co.il> Message-ID: Michael> Hmm, but this seems like 2.6.17 material. It should have Michael> the same effect with or without multicast group Michael> patch. Right? I don't think so. With the current code, it shouldn't be possible to get to that line with a join of the broadcast group pending. - R. From mst at mellanox.co.il Tue Apr 11 13:44:11 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 11 Apr 2006 23:44:11 +0300 Subject: [openib-general] Re: [RFC] [PATCH 2/2 v2] ipoib: convert to use new multicast interface In-Reply-To: References: <443BDE0C.7020700@ichips.intel.com> <20060411193636.GA27908@mellanox.co.il> Message-ID: <20060411204411.GA28553@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [RFC] [PATCH 2/2 v2] ipoib: convert to use new multicast interface > > Michael> Hmm, but this seems like 2.6.17 material. It should have > Michael> the same effect with or without multicast group > Michael> patch. Right? > > I don't think so. With the current code, it shouldn't be possible to > get to that line with a join of the broadcast group pending. Not sure what changed - I thought new multicast has same API. I'd have to review the code again then - we had lot of subtle bugs in this area ... -- MST From or.gerlitz at gmail.com Tue Apr 11 13:43:18 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Tue, 11 Apr 2006 22:43:18 +0200 Subject: [openib-general] Re: problems cloning infiniband.git In-Reply-To: References: <443B567E.7000005@voltaire.com> Message-ID: <15ddcffd0604111343j697b003ck8dbf74d75b1df6f2@mail.gmail.com> On 4/11/06, Roland Dreier wrote: > Or> So i am downloading from a mirror of kernel.org and it might > Or> work with the a non mirror? what would be the url to download > Or> it from kernel.org, all the ones I've tried to derive from > > master.kernel.org is the main machine. but I'm not sure whether you > can clone directly from that. > > Are you able to clone with rsync:// or git:// URLs? I will ask our sysadmin to have make open in the firewall, give it a try and let you know. But it would be next week as we have a holiday now. Or. From mshefty at ichips.intel.com Tue Apr 11 13:45:31 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 11 Apr 2006 13:45:31 -0700 Subject: [openib-general] Re: [RFC] [PATCH 2/2 v2] ipoib: convert to use new multicast interface In-Reply-To: <20060411204411.GA28553@mellanox.co.il> References: <443BDE0C.7020700@ichips.intel.com> <20060411193636.GA27908@mellanox.co.il> <20060411204411.GA28553@mellanox.co.il> Message-ID: <443C156B.5030000@ichips.intel.com> Michael S. Tsirkin wrote: > Not sure what changed - I thought new multicast has same API. > I'd have to review the code again then - we had lot of subtle bugs in > this area ... The new multicast module won't go into 2.6.17. - Sean From rdreier at cisco.com Tue Apr 11 13:45:42 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Apr 2006 13:45:42 -0700 Subject: [openib-general] Re: [RFC] [PATCH 2/2 v2] ipoib: convert to use new multicast interface In-Reply-To: <20060411204411.GA28553@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 11 Apr 2006 23:44:11 +0300") References: <443BDE0C.7020700@ichips.intel.com> <20060411193636.GA27908@mellanox.co.il> <20060411204411.GA28553@mellanox.co.il> Message-ID: Michael> Not sure what changed - I thought new multicast has same Michael> API. I'd have to review the code again then - we had lot Michael> of subtle bugs in this area ... Yes, that was my though too: why does the new multicast handling end up scheduling the multicast join task again before the broadcast join completes? - R. From rheflin at atipa.com Tue Apr 11 13:47:42 2006 From: rheflin at atipa.com (Roger Heflin) Date: Tue, 11 Apr 2006 15:47:42 -0500 Subject: [openib-general] RHEL4ASU3 question Message-ID: <443C15EE.10609@atipa.com> Hello, I have noticed the recent tcpdump issue related with the RHEL4U3 kernel, I have a cluster to install with Infiniband support and Redhat EL as the distribution. I am not completely sure exactly what RHEL4U3 comes with in terms of the Infiniband support, it definitely has parts of the OpenIB stuff on it. Does anyone have any suggestions on exactly how to proceed, here are the options as I see it: Use what RHEL4U3 has. Replace RHEL4U3 kernel with newest released kernel.org kernel and compile the OpenIB userspace tools for it. Usage wise the customer is going to be using it as a HPC cluster, so mvapich definitely needs to work, but other things that don't matter to an HPC cluster don't matter so much. Roger Atipa Technologies From robert.j.woodruff at intel.com Tue Apr 11 14:03:50 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Tue, 11 Apr 2006 14:03:50 -0700 Subject: [openib-general] tcpdump command issue on IB network In-Reply-To: Message-ID: <000401c65dab$6f83b090$010fa8c0@amr.corp.intel.com> Scott wrote, >Bummer, there's no way to get just a tcpdump binary that will work with >RHEL4 kernel? >Scott You can also apply this patch to your RHEL4 kernel source and rebuild the kernel. diff -Naurp linux-2.6.9/net/core/dev.c linux-2.6.9-fixups/net/core/dev.c --- linux-2.6.9/net/core/dev.c 2006-04-11 10:38:26.000000000 -0700 +++ linux-2.6.9-fixups/net/core/dev.c 2006-04-11 10:41:03.000000000 -0700 @@ -2304,8 +2304,6 @@ static int dev_ifsioc(struct ifreq *ifr, return dev_set_mtu(dev, ifr->ifr_mtu); case SIOCGIFHWADDR: - if ((size_t) dev->addr_len > sizeof ifr->ifr_hwaddr.sa_data) - return -EOVERFLOW; memset(ifr->ifr_hwaddr.sa_data, 0, sizeof ifr->ifr_hwaddr.sa_data); memcpy(ifr->ifr_hwaddr.sa_data, dev->dev_addr, dev->addr_len); ifr->ifr_hwaddr.sa_family = dev->type; ~ From sean.hefty at intel.com Tue Apr 11 14:07:32 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 11 Apr 2006 14:07:32 -0700 Subject: [openib-general] [PATCH] index: more strongly indicate that key is opaque Message-ID: More strongly indicate that the index key is opaque by using void* / size_t in place of u8* / unsigned int in the API. Signed-off-by: Sean Hefty --- Index: core/index.c =================================================================== --- core/index.c (revision 6418) +++ core/index.c (working copy) @@ -39,7 +39,7 @@ MODULE_DESCRIPTION("Indexing service"); MODULE_LICENSE("Dual BSD/GPL"); -void index_init(struct index_root *root, unsigned int key_length, +void index_init(struct index_root *root, size_t key_length, gfp_t gfp_mask) { memset(root, 0, sizeof *root); @@ -55,14 +55,14 @@ } EXPORT_SYMBOL(index_destroy); -void *index_insert(struct index_root *root, void *data, u8 *key) +void *index_insert(struct index_root *root, void *data, void *key) { struct index_node *node, *new_node; struct index_leaf *leaf; int i, k, j; for (node = &root->node, k = 0; 1; node = node->child[i], k++) { - i = key[k]; + i = *((u8 *) key + k); if (!node->child[i]) { leaf = kzalloc(sizeof *leaf + root->key_length, root->gfp_mask); @@ -98,14 +98,14 @@ } EXPORT_SYMBOL(index_insert); -void *index_find(struct index_root *root, u8 *key) +void *index_find(struct index_root *root, void *key) { struct index_node *node; struct index_leaf *leaf; int i, k; for (node = &root->node, k = 0; node; node = node->child[i], k++) { - i = key[k]; + i = *((u8 *) key + k); if (node->child_type[i] == INDEX_LEAF) { leaf = node->child[i]; if ((root->key_length > k) && @@ -120,7 +120,7 @@ } EXPORT_SYMBOL(index_find); -void *index_find_replace(struct index_root *root, void *data, u8 *key) +void *index_find_replace(struct index_root *root, void *data, void *key) { struct index_node *node; struct index_leaf *leaf; @@ -128,7 +128,7 @@ int i, k; for (node = &root->node, k = 0; node; node = node->child[i], k++) { - i = key[k]; + i = *((u8 *) key + k); if (node->child_type[i] == INDEX_LEAF) { leaf = node->child[i]; if ((root->key_length > k) && @@ -145,7 +145,7 @@ } EXPORT_SYMBOL(index_find_replace); -void *index_remove(struct index_root *root, u8 *key) +void *index_remove(struct index_root *root, void *key) { struct index_node *node, *temp_node; struct index_leaf *leaf; @@ -153,7 +153,7 @@ int i, k; for (node = &root->node, k = 0; node; node = node->child[i], k++) { - i = key[k]; + i = *((u8 *) key + k); if (node->child_type[i] == INDEX_LEAF) { leaf = node->child[i]; if (!memcmp(leaf->key + k, key + k, @@ -169,7 +169,7 @@ temp_node = node; node = node->parent; kfree(temp_node); - i = key[--k]; + i = *((u8 *) key + --k); } } return data; Index: include/linux/index.h =================================================================== --- include/linux/index.h (revision 6418) +++ include/linux/index.h (working copy) @@ -55,7 +55,7 @@ struct index_root { struct index_node node; - unsigned int key_length; + size_t key_length; gfp_t gfp_mask; }; @@ -66,7 +66,7 @@ * @gfp_mask: GFP mask to use when allocating resources inserting items into the * index. */ -void index_init(struct index_root *root, unsigned int key_length, +void index_init(struct index_root *root, size_t key_length, gfp_t gfp_mask); /** @@ -84,7 +84,7 @@ * exists in the index with the same key, returns that item. Otherwise, an * error will be returned. */ -void *index_insert(struct index_root *root, void *data, u8 *key); +void *index_insert(struct index_root *root, void *data, void *key); /** * index_find - Return a data item in the index associated with the given key. @@ -93,7 +93,7 @@ * * If the key is not found in the index, returns NULL. */ -void *index_find(struct index_root *root, u8 *key); +void *index_find(struct index_root *root, void *key); /** * index_find_replace - Replace a data item in the index associated with the @@ -105,7 +105,7 @@ * If an existing item is not found in the index, the replacement fails, and * the function returns NULL. */ -void *index_find_replace(struct index_root *root, void *data, u8 *key); +void *index_find_replace(struct index_root *root, void *data, void *key); /** * index_remove - Remove a data item from the index. @@ -114,7 +114,7 @@ * * Returns the data item removed from the index, or NULL if no item was found. */ -void *index_remove(struct index_root *root, u8 *key); +void *index_remove(struct index_root *root, void *key); /** * index_remove_all - Remove all index values, invoking a user-specified routine From rdreier at cisco.com Tue Apr 11 14:14:17 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Apr 2006 14:14:17 -0700 Subject: [openib-general] Re: problems cloning infiniband.git In-Reply-To: <15ddcffd0604111343j697b003ck8dbf74d75b1df6f2@mail.gmail.com> (Or Gerlitz's message of "Tue, 11 Apr 2006 22:43:18 +0200") References: <443B567E.7000005@voltaire.com> <15ddcffd0604111343j697b003ck8dbf74d75b1df6f2@mail.gmail.com> Message-ID: It looks like some mirrors have synched up. I was just able to clone by http:// from a mirror with no problem. - R. From sweitzen at cisco.com Tue Apr 11 14:43:26 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 11 Apr 2006 14:43:26 -0700 Subject: [openib-general] tcpdump command issue on IB network Message-ID: Thanks, but most of our IB customers will want to use a stock RHEL4 kernel, so that's what our SQA has to test with. Scott > -----Original Message----- > From: Bob Woodruff [mailto:robert.j.woodruff at intel.com] > Sent: Tuesday, April 11, 2006 2:04 PM > To: Scott Weitzenkamp (sweitzen); Roland Dreier (rdreier) > Cc: Jerome Taylor; openib-general at openib.org > Subject: RE: [openib-general] tcpdump command issue on IB network > > Scott wrote, > >Bummer, there's no way to get just a tcpdump binary that > will work with > >RHEL4 kernel? > > >Scott > > You can also apply this patch to your RHEL4 kernel source and > rebuild the > kernel. > > > diff -Naurp linux-2.6.9/net/core/dev.c > linux-2.6.9-fixups/net/core/dev.c > --- linux-2.6.9/net/core/dev.c 2006-04-11 10:38:26.000000000 -0700 > +++ linux-2.6.9-fixups/net/core/dev.c 2006-04-11 > 10:41:03.000000000 -0700 > @@ -2304,8 +2304,6 @@ static int dev_ifsioc(struct ifreq *ifr, > return dev_set_mtu(dev, ifr->ifr_mtu); > > case SIOCGIFHWADDR: > - if ((size_t) dev->addr_len > sizeof > ifr->ifr_hwaddr.sa_data) > - return -EOVERFLOW; > memset(ifr->ifr_hwaddr.sa_data, 0, sizeof > ifr->ifr_hwaddr.sa_data); > memcpy(ifr->ifr_hwaddr.sa_data, dev->dev_addr, > dev->addr_len); > ifr->ifr_hwaddr.sa_family = dev->type; > ~ > From robert.j.woodruff at intel.com Tue Apr 11 15:20:29 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Tue, 11 Apr 2006 15:20:29 -0700 Subject: [openib-general] RHEL4ASU3 question In-Reply-To: <443C15EE.10609@atipa.com> Message-ID: <000501c65db6$24a015e0$010fa8c0@amr.corp.intel.com> Roger wrote, >Does anyone have any suggestions on exactly how to proceed, here >are the options as I see it: >Use what RHEL4U3 has. >Replace RHEL4U3 kernel with newest released kernel.org kernel and > compile the OpenIB userspace tools for it. I have tested what is on the RedHat EL4.0 U3 with Intel MPI and it worked ok, so RedHat EL4.0 U3 has all of the userspace libraries needed to run MVAPICH, although I have not tried it, but I suspect it will work. There is one issue that I ran into with the stock RedHat EL4 U3 release and that is with the new Mellenox DDR card I had some problems with rdma, using uDAPL and suspect you would see the same issues with MVAPICH with those cards. The SDR cards seem to work fine with the code that is on the RedHat CD. Using a kernel.org kernel is also an option or if you want/need to use the RedHat EL4.0 kernel, but want a later version of the openib code, you can try my backport patches and/or test RPMs, https://openib.org/svn/gen2/branches/backport-to-2.6.9/ woody From rjwalsh at pathscale.com Tue Apr 11 15:31:37 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Tue, 11 Apr 2006 15:31:37 -0700 Subject: [openib-general] CentOS machines? Message-ID: <1144794697.30135.8.camel@hematite.internal.keyresearch.com> Do we have any CentOS 4.2 or 4.3 machines? Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 483 bytes Desc: This is a digitally signed message part URL: From johnip at sgi.com Tue Apr 11 15:32:58 2006 From: johnip at sgi.com (John Partridge) Date: Tue, 11 Apr 2006 17:32:58 -0500 Subject: [openib-general] RC1 rpms for sles10 ia64 Message-ID: <443C2E9A.2090907@sgi.com> Would anyone be interested in binary RPM's for sles10 ia64 for RC1 ? I am now working on rc2 RPMS and performance baselines. I have the following RPM's tested on SGI Altix 3000 3000Bx2 A350 and A330 : libibat-0.9.0-1.sles10.ia64.rpm libibat-devel-0.9.0-1.sles10.ia64.rpm libibat-utils-0.9.0-1.sles10.ia64.rpm libibcm-0.9.0-1.sles10.ia64.rpm libibcm-devel-0.9.0-1.sles10.ia64.rpm libibcommon-1.0-1.sles10.ia64.rpm libibcommon-devel-1.0-1.sles10.ia64.rpm libibmad-1.0-1.sles10.ia64.rpm libibmad-devel-1.0-1.sles10.ia64.rpm libibumad-1.0-1.sles10.ia64.rpm libibumad-devel-1.0-1.sles10.ia64.rpm libibverbs-1.0-1.sles10.ia64.rpm libibverbs-devel-1.0-1.sles10.ia64.rpm libibverbs-utils-1.0-1.sles10.ia64.rpm libipathverbs-1.0-1.sles10.ia64.rpm libipathverbs-devel-1.0-1.sles10.ia64.rpm libmthca-1.0-1.sles10.ia64.rpm libmthca-devel-1.0-1.sles10.ia64.rpm librdmacm-0.9.0-1.sles10.ia64.rpm librdmacm-devel-0.9.0-1.sles10.ia64.rpm librdmacm-utils-0.9.0-1.sles10.ia64.rpm libsdp-0.9.0-1.sles10.ia64.rpm openib-diags-1.0-1.sles10.ia64.rpm opensm-1.2.0-1.sles10.ia64.rpm opensm-devel-1.2.0-1.sles10.ia64.rpm srptools-0.0.4-1.sles10.ia64.rpm I also have : mvapich-gen2-1.ia64.rpm -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com From panda at cse.ohio-state.edu Tue Apr 11 15:42:11 2006 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Tue, 11 Apr 2006 18:42:11 -0400 (EDT) Subject: [openib-general] MVAPICH2 0.9.3-rc0 with multi-threading support and SVN access is available Message-ID: <200604112242.k3BMgBZQ021389@xi.cse.ohio-state.edu> The MVAPICH team is pleased to announce the availability of MVAPICH2 0.9.3-rc0 with the following new features: - Multi-threading support: This support is available for Gen2, VAPI and uDAPL transport interfaces. In addition, multi-threading support for TCP/IP interface (provided by MPICH2 stack) is also available. - Integrated with MPICH2 1.0.3 stack - Advanced AVL tree-based Resource-aware registration cache - Tuning and Optimization of various collective algorithms for a wide range of system sizes - Processor affinity for intra-node shared memory communication More details on all features and supported platforms can be obtained by visiting the project's web page -> Overview -> features. Starting with this 0.9.3-rc0 release, the MVAPICH team is also pleased to announce the availability of the MVAPICH2 code base through anonymous SVN access. Nightly tarballs are also available. The mvapich-commit mailing list can also be used by users, developers and vendors to keep track of all commits happening to the SVN. For downloading MVAPICH2 0.9.3-rc0 package and accessing the anonymous SVN, please visit the following URL: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ A stripped down version of this release is also available at the OpenIB SVN. Under the download page of the above URL, the latest testing results of this rc0 version for different platforms and test suites are shown. It also shows the rigorous testing procedures being used by the team for MVAPICH and MVAPICH2 releases. As soon as the remaining tests are done, we will make a formal release for MVAPICH2 0.9.3. All feedbacks, including bug reports, hints for performance tuning, patches and enhancements are welcome. Please post it to mvapich-discuss mailing list. Thanks, MVAPICH Team at OSU/NBCL From rjwalsh at pathscale.com Tue Apr 11 15:50:22 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Tue, 11 Apr 2006 15:50:22 -0700 Subject: [openib-general] Re: CentOS machines? In-Reply-To: <1144794697.30135.8.camel@hematite.internal.keyresearch.com> References: <1144794697.30135.8.camel@hematite.internal.keyresearch.com> Message-ID: <1144795822.30135.10.camel@hematite.internal.keyresearch.com> On Tue, 2006-04-11 at 15:31 -0700, Robert Walsh wrote: > Do we have any CentOS 4.2 or 4.3 machines? People on openib-general can ignore this :-) I meant to send it to an internal alias. Sorry for the bother. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 483 bytes Desc: This is a digitally signed message part URL: From vuhuong at mellanox.com Tue Apr 11 17:26:28 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Tue, 11 Apr 2006 17:26:28 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: References: Message-ID: <443C4934.7080400@mellanox.com> Hi Roland, Sorry to take this long to response. Thanks for all the enhancements. I cced some Engenio's engineer who can help to send latest FW to you. > > This mostly works for me, but I still see one weird problem. If I > make an FMR to cover IO of size more than 58 * 4096 bytes, the IO > never completes. The SCSI midlayer times it out and aborts it, and > the target responds to the task management command. I'm having a hard > time imagining that this is an SRP initiator or even low-level HCA > driver bug -- it seems more likely to be a target bug (I am using an > Engenio target to test, and I may have down-rev firmware). > If you have Santricity, you can check what current controller firmware version is and update it to latest > I would be very happy to hear test reports with other targets, > Here is my status of testing this patch. On x86-64 system I got data corruption problem reported after ~4 hrs of running Engenio's Smash test tool when I tested with Engenio storage On ia64 system I got multiple async event 3 (IB_EVENT_QP_ACCESS_ERR) and even 1 (IB_EVENT_QP_FATAL), finally the error handling path kicked in and the system paniced. Please see log below (I tested with Mellanox's srp target reference implementation - I don't see this error without the patch) Apr 7 18:15:10 lab105 kernel: ib_srp: QP event 3 Apr 7 18:15:10 lab105 kernel: ib_srp: failed receive status 5 Apr 7 18:15:13 lab105 kernel: ib_srp: connection closed Apr 7 18:15:43 lab105 kernel: SRP abort called Apr 7 18:15:43 lab105 kernel: Abort for req_index 0 Apr 7 18:15:43 lab105 kernel: SRP abort called Apr 7 18:15:43 lab105 kernel: Abort for req_index 1 Apr 7 18:15:43 lab105 kernel: SRP abort called Apr 7 18:15:43 lab105 kernel: Abort for req_index 2 Apr 7 18:15:43 lab105 kernel: SRP reset_device called Apr 7 18:15:43 lab105 kernel: Abort for req_index 1 Apr 7 18:15:43 lab105 kernel: SRP reset_device called Apr 7 18:15:43 lab105 kernel: Abort for req_index 2 Apr 7 18:15:48 lab105 kernel: SRP reset_device called Apr 7 18:15:48 lab105 kernel: Abort for req_index 0 Apr 7 18:15:48 lab105 kernel: ib_srp: failed receive status 5 Apr 7 18:15:50 lab105 kernel: ib_srp: connection closed Apr 7 18:15:53 lab105 kernel: ib_srp: SRP reset_host called Apr 7 18:15:55 lab105 kernel: ib_srp: connection closed Apr 7 18:16:05 lab105 kernel: ib_mthca 0000:05:00.0: CQ overrun on CQN 000082 Apr 7 18:16:05 lab105 kernel: ib_srp: QP event 1 Apr 7 18:16:05 lab105 last message repeated 3 times Apr 7 18:16:15 lab105 kernel: SRP abort called Apr 7 18:16:15 lab105 kernel: Abort for req_index 0 Apr 7 18:16:20 lab105 kernel: ib_srp: QP event 1 Apr 7 18:16:20 lab105 kernel: ib_srp: QP event 1 Apr 7 18:16:30 lab105 kernel: SRP abort called Apr 7 18:16:30 lab105 kernel: Abort for req_index 1 Apr 7 18:16:35 lab105 kernel: sd 2:0:0:7: scsi: Device offlined - not ready after error recovery Apr 7 18:16:35 lab105 kernel: sd 2:0:0:6: scsi: Device offlined - not ready after error recovery Apr 7 18:16:35 lab105 kernel: sd 2:0:0:7: rejecting I/O to offline device Apr 7 18:16:35 lab105 kernel: Buffer I/O error on device sdj, logical block 0 Apr 7 18:16:35 lab105 kernel: Buffer I/O error on device sdj, logical block 1 Apr 7 18:16:35 lab105 kernel: sd 2:0:0:6: rejecting I/O to offline device Apr 7 18:16:35 lab105 kernel: Buffer I/O error on device sdi, logical block 0 Apr 7 18:16:35 lab105 kernel: sd 2:0:0:6: rejecting I/O to offline device Apr 7 18:16:35 lab105 kernel: Buffer I/O error on device sdi, logical block 1 Apr 7 18:16:35 lab105 kernel: sd 2:0:0:7: rejecting I/O to offline device Apr 7 18:16:35 lab105 kernel: Buffer I/O error on device sdj, logical block 0 Apr 7 18:16:35 lab105 kernel: ib_srp: QP event 1 Apr 7 18:16:35 lab105 kernel: ib_srp: QP event 1 Apr 7 18:16:35 lab105 kernel: sd 2:0:0:6: rejecting I/O to offline device Apr 7 18:16:35 lab105 kernel: Buffer I/O error on device sdi, logical block 0 Apr 7 18:17:05 lab105 kernel: SRP abort called Apr 7 18:17:05 lab105 kernel: Abort for req_index 2 Apr 7 18:17:10 lab105 kernel: SRP reset_device called Apr 7 18:17:10 lab105 kernel: Abort for req_index 2 Apr 7 18:17:15 lab105 kernel: ib_srp: SRP reset_host called Apr 7 18:17:17 lab105 kernel: ib_srp: connection closed Apr 7 18:17:17 lab105 kernel: Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b6b6b Apr 7 18:17:17 lab105 kernel: scsi_eh_2[14050]: Oops 11012296146944 [1] Apr 7 18:17:17 lab105 kernel: Modules linked in: ib_srp ib_cm ib_sa ib_umad evdev joydev sg st sr_mod ide_cd cdrom usbserial parport_pc lp parport thermal processor fan button ipv6 binfmt_misc ib_mthca ib_mad ib_core usbhid ehci_hcd uhci_hcd usbcore i2c_i801 i2c_core e1000 nls_iso8859_1 nls_cp437 dm_mod reiserfs mptspi mptscsih mptbase sd_mod scsi_mod Apr 7 18:17:17 lab105 kernel: Apr 7 18:17:17 lab105 kernel: Pid: 14050, CPU 0, comm: scsi_eh_2 Apr 7 18:17:17 lab105 kernel: psr : 0000121008026018 ifs : 800000000000050d ip : [] Not tainted Apr 7 18:17:17 lab105 kernel: ip is at srp_reconnect_target+0x2b1/0x5c0 [ib_srp] Apr 7 18:17:17 lab105 kernel: unat: 0000000000000000 pfs : 000000000000050d rsc : 0000000000000003 Apr 7 18:17:17 lab105 kernel: rnat: 0000000000000000 bsps: 0000000000000000 pr : 0000000000009941 Apr 7 18:17:17 lab105 kernel: ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f Apr 7 18:17:17 lab105 kernel: csd : 0000000000000000 ssd : 0000000000000000 Apr 7 18:17:17 lab105 kernel: b0 : a0000002022e54e0 b6 : a000000100003320 b7 : a0000002020a36a0 Apr 7 18:17:17 lab105 kernel: f6 : 1003e6b6b6b6b6b6b6b6b f7 : 0ffe7e694bf1a00000000 Apr 7 18:17:17 lab105 kernel: f8 : 1003e0000000000002418 f9 : 1003e0000000000000021 Apr 7 18:17:17 lab105 kernel: f10 : 1000483fffffff96976e2 f11 : 1003e0000000000000021 Apr 7 18:17:17 lab105 kernel: r1 : a0000002022e8278 r2 : e0000001c6e75a18 r3 : e0000001d0a64a10 Apr 7 18:17:17 lab105 kernel: r8 : e0000001c6e75a68 r9 : e0000001c6e758b8 r10 : a0000001008e9f00 Apr 7 18:17:17 lab105 kernel: r11 : 0000000000000001 r12 : e0000001c7f5fd30 r13 : e0000001c7f58000 Apr 7 18:17:17 lab105 kernel: r14 : a0000001008e9f08 r15 : e0000001c7f58000 r16 : 0000000000000000 Apr 7 18:17:17 lab105 kernel: r17 : 0000000000000000 r18 : e0000001c7f58d84 r19 : a0000001008e9f10 Apr 7 18:17:17 lab105 kernel: r20 : ffffffffffffffff r21 : 0000000000000008 r22 : e000000004790000 Apr 7 18:17:17 lab105 kernel: r23 : e0000001e05f7cd0 r24 : 0000000000000080 r25 : e00000000479001f Apr 7 18:17:17 lab105 kernel: r26 : a0000002020a36a0 r27 : e0000001efcca1e0 r28 : e0000001efcca000 Apr 7 18:17:17 lab105 kernel: r29 : e0000001e05f7c30 r30 : e0000001d0a64a88 r31 : e0000001d0a649f0 Apr 7 18:17:17 lab105 kernel: Apr 7 18:17:17 lab105 kernel: Call Trace: Apr 7 18:17:17 lab105 kernel: [] show_stack+0x80/0xa0 Apr 7 18:17:17 lab105 kernel: sp=e0000001c7f5f8b0 bsp=e0000001c7f59110 Apr 7 18:17:17 lab105 kernel: [] show_regs+0x840/0x880 Apr 7 18:17:17 lab105 kernel: sp=e0000001c7f5fa80 bsp=e0000001c7f590b0 Apr 7 18:17:17 lab105 kernel: [] die+0x1b0/0x240 Apr 7 18:17:17 lab105 kernel: sp=e0000001c7f5fa90 bsp=e0000001c7f59068 Apr 7 18:17:17 lab105 kernel: [] ia64_do_page_fault+0x970/0xae0 Apr 7 18:17:17 lab105 kernel: sp=e0000001c7f5fab0 bsp=e0000001c7f59000 Apr 7 18:17:17 lab105 kernel: [] ia64_leave_kernel+0x0/0x280 Apr 7 18:17:17 lab105 kernel: sp=e0000001c7f5fb60 bsp=e0000001c7f59000 Apr 7 18:17:17 lab105 kernel: [] srp_reconnect_target+0x2b0/0x5c0 [ib_srp] Apr 7 18:17:17 lab105 kernel: sp=e0000001c7f5fd30 bsp=e0000001c7f58f90 Apr 7 18:17:17 lab105 kernel: [] srp_reset_host+0x60/0xa0 [ib_srp] Apr 7 18:17:17 lab105 kernel: sp=e0000001c7f5fdf0 bsp=e0000001c7f58f68 Apr 7 18:17:17 lab105 kernel: [] scsi_try_host_reset+0xd0/0x240 [scsi_mod] Apr 7 18:17:17 lab105 kernel: sp=e0000001c7f5fdf0 bsp=e0000001c7f58f38 Apr 7 18:17:17 lab105 kernel: [] scsi_error_handler+0x1880/0x22c0 [scsi_mod] Apr 7 18:17:17 lab105 kernel: sp=e0000001c7f5fdf0 bsp=e0000001c7f58e50 Apr 7 18:17:17 lab105 kernel: [] kthread+0x220/0x280 Apr 7 18:17:17 lab105 kernel: sp=e0000001c7f5fe10 bsp=e0000001c7f58e10 Apr 7 18:17:17 lab105 kernel: [] kernel_thread_helper+0xe0/0x100 Apr 7 18:17:17 lab105 kernel: sp=e0000001c7f5fe30 bsp=e0000001c7f58de0 Apr 7 18:17:17 lab105 kernel: [] start_kernel_thread+0x20/0x40 Apr 7 18:17:17 lab105 kernel: sp=e0000001c7f5fe30 bsp=e0000001c7f5 From jgk at horafeliz.com Tue Apr 11 18:59:39 2006 From: jgk at horafeliz.com (Irina Nikolaeva) Date: Tue, 11 Apr 2006 18:59:39 -0700 Subject: [openib-general] Leichter Nebenverdienst in Deutschland Message-ID: <1806399229.20060411185939@horafeliz.com> Hallo ich bin Irina Nikolaeva und ich vertrete die Firma Btrus . Unsere Firma "Btrus" bietet folgende Dienste an: -Reservierung von teueren Hotels -Vermietung von Autos -Kaufen von Flugtickets Mit der Steigerung der Kundenanzahl, wuchsen auch Ihre Bedurfnisse.Wir haben noch keine Partner in Deutschland, deswegen bieten wir Ihnen diese Moglichkeit, als Partner fur uns zu arbeiten. Die Einzigen Anforderungen an Sie sind: PC, Internet, E-mail Ein Konto bei einem Zahlungsinstitut in Deutschland ist bevorzugt. Und als gegen Leistung bieten wir Ihnen. Hohes Einkommensniveau. Flexible Arbeitszeiten. Moglichkeit diese Tatigkeit als Hauptbeschaftigung auszuuben. Und von jedem Deal erhalten Sie 300-1000|, und das nur fur 2-3 Stunden Arbeit Falls unser Angebot Sie neugierig gemacht hat und Sie sich angesprochen fuhlen, dann melden Sie sich einfach per E-MAIL: btrus_info at km.ru Wir freuen uns! Mit besten Grussen! From rdreier at cisco.com Tue Apr 11 21:23:16 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Apr 2006 21:23:16 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: <443C4934.7080400@mellanox.com> (Vu Pham's message of "Tue, 11 Apr 2006 17:26:28 -0700") References: <443C4934.7080400@mellanox.com> Message-ID: Vu> Hi Roland, Sorry to take this long to response. Thanks for all Vu> the enhancements. I cced some Engenio's engineer who can help Vu> to send latest FW to you. Thanks... I haven't been good about following up with Engenio about this issue (IOs with a single direct region of > 58 * 4096 bytes failing to complete). Vu> Here is my status of testing this patch. On x86-64 system I Vu> got data corruption problem reported after ~4 hrs of running Vu> Engenio's Smash test tool when I tested with Engenio storage Vu> On ia64 system I got multiple async event 3 Vu> (IB_EVENT_QP_ACCESS_ERR) and even 1 (IB_EVENT_QP_FATAL), Vu> finally the error handling path kicked in and the system Vu> paniced. Please see log below (I tested with Mellanox's srp Vu> target reference implementation - I don't see this error Vu> without the patch) Hmm, that's interesting. Did you see this type of problem with the original FMR patch you wrote (and did you do this level of stress testing)? I'm wondering whether the issue is in the SRP driver, or whether there is a bug in the FMR stuff at a lower level. What kind of HCAs were you using? I assume on ia64 you're using PCI-X, what about on x86-64? PCIe or not? Memfree or not? Another thing that might be useful if it's convenient for you would be to use an IB analyzer and trigger on a NAK to see what happens on the wire around the IB_EVENT_QP_ACCESS_ERR. Thanks, Roland From Don.Albert at Bull.com Tue Apr 11 22:05:43 2006 From: Don.Albert at Bull.com (Don.Albert at Bull.com) Date: Tue, 11 Apr 2006 22:05:43 -0700 Subject: [openib-general] RHEL4ASU3 question In-Reply-To: <000501c65db6$24a015e0$010fa8c0@amr.corp.intel.com> Message-ID: Bob, > I have tested what is on the RedHat EL4.0 U3 with Intel MPI and it > worked ok, so RedHat EL4.0 U3 has all of the userspace libraries needed > to run MVAPICH, although I have not tried it, but I suspect it will work. > There is one issue that I ran into with the stock RedHat EL4 U3 release > and that is with the new Mellenox DDR card I had some problems with rdma, > using uDAPL and suspect you would see the same issues with MVAPICH with > those cards. > The SDR cards seem to work fine with the code that is on the RedHat CD. We are running RHEL4 U3 and the MVAPICH version from the OpenIB gen2 trunk. We were able to run the OSU benchmark tests (osu_bw, osu_bibw, and osu_latency) with the Mellanox SDR cards successfully, but when we swapped out the cards for DDR cards, we ran into some problems. We can run some MPI jobs like the simple "calculate pi" job (cpi.c), and we can run an MPING application, but when we try to run the benchmark tests, we get the following: [koa] (ib) ib> mpirun_rsh -np 2 koa jatoba /home/ib/mpi/tests/osu/osu_bw # OSU MPI Bandwidth Test (Version 2.1) # Size Bandwidth (MB/s) [0] Abort: [koa.az05.bull.com:0] Got completion with error, code=1 at line 2148 in file viacheck.c mpirun_rsh: Abort signaled from [0] done. Looking at the viacheck.c file, it seems that this error is generated when a bad status is found in the status of a completion queue entry. From the "code=1" , it may be some sort of "length error". This could be coming from the driver or the card, I suppose? That's as far as I have gotten so far. Does this sound like any of the "issues" you referred to above relative to RHEL4 U3 and the DDR cards? If so, is there a fix? -Don Albert- Bull HN Info Systems -------------- next part -------------- An HTML attachment was scrubbed... URL: From sweitzen at cisco.com Tue Apr 11 22:08:10 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 11 Apr 2006 22:08:10 -0700 Subject: [openib-general] RHEL4ASU3 question Message-ID: > Use what RHEL4U3 has. > Replace RHEL4U3 kernel with newest released kernel.org kernel and > compile the OpenIB userspace tools for it. A third option is to keep the RHEL4U3 kernel, and use the OpenIB code from IBED 1.0 rc3. Scott From sweitzen at cisco.com Tue Apr 11 22:22:04 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 11 Apr 2006 22:22:04 -0700 Subject: [openib-general] SDP working with Cheetah and IBED 1.0 x86_64 RHEL4 2.6.9-22? Message-ID: LionCub worked fine with rc2, but neither rc2 nor rc3 SDP is working with Cheetah HCA. netperf and netserver just hang, then continue when I try to attach with them using strace. Anyone have this combo working? Scott -------------- next part -------------- An HTML attachment was scrubbed... URL: From sweitzen at cisco.com Tue Apr 11 23:20:31 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 11 Apr 2006 23:20:31 -0700 Subject: [openib-general] uDAPL not supported on ppc64? Message-ID: I get this trying to compile uDAPL using install.sh with IBED 1.0 rc3 on RHEL4 U2 2.6.9-22 ppc64: WARNING: Dapl is not supported on PPC64 arcitecture WARNING: Dapl is not supported on PPC64 arcitecture Scott -------------- next part -------------- An HTML attachment was scrubbed... URL: From schihei at de.ibm.com Wed Apr 12 02:46:42 2006 From: schihei at de.ibm.com (Heiko J Schick) Date: Wed, 12 Apr 2006 11:46:42 +0200 Subject: [openib-general] EHCA crash on module unload? In-Reply-To: <443BE16E.2080305@scl.ameslab.gov> References: <20060411163653.GA18625@scl.ameslab.gov> <9BA6DF44-3AC4-4BBC-8EEF-8DE9534DDFF8@schihei.de> <443BE16E.2080305@scl.ameslab.gov> Message-ID: <443CCC82.9030700@de.ibm.com> Hello Troy, it seems that you run into a race condition in our code where we thought it can never occur. Good catch! It looks like an event queue is destroyed while on an other CPU an interrupt is coming in. We will fix it soon as possible. Thanks for your help! Regards, Heiko Troy Benjegerdes wrote: > I had unplugged, then re-plugged the cable, and then ran the following: > > rmmod hcad_mod ib_mthca ib_uverbs ib_ipoib ib_sa ib_mad ib_core > > > Heiko J Schick wrote: > >> Hello Troy, >> >> did you unload first all OpenIB modules and then the eHCA module >> or the other way around? >> >> Can you see any other message (error data) in /var/log/messages? >> >> It looks like you unloaded the module during an interrupt came in. >> Can you sent us the steps / commands you've executed when the panic >> was caused? >> >> Regards, >> Heiko >> > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From dotanb at mellanox.co.il Wed Apr 12 01:01:43 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 12 Apr 2006 11:01:43 +0300 Subject: [openib-general] Re: [uDAPL] dtest server never ends when using the dapl provider "OpenIB-scm1" In-Reply-To: <443BEB2B.4080203@ichips.intel.com> References: <200604111506.18067.dotanb@mellanox.co.il> <443BEB2B.4080203@ichips.intel.com> Message-ID: <200604121101.43348.dotanb@mellanox.co.il> On Tuesday 11 April 2006 20:45, Arlin Davis wrote: > James Lentini wrote: > > > > > > > It sounds like the disconnect is being lost. Let me see if I can > > > >reproduce this. > > > >Arlin, have you ever seen this? > > > > > No. it runs fine on my systems. It looks like the ping pong test on the > server side did not finish. Can Dotan add a -v switch to the dtest to > help isolate? > > What svn version are we running? Do you have the latest uDAPL fixes > commited in 6393? > > -arlin > > Hi. thanks for the quick response. I executed the dtest with the -v parameter and here is the output of both sides. I added the test the '-l' parameter to be able to change to dapl provider in command line (if you wish i can post you a patch). full server output: ----------------------- sw043:/tmp/tsscr/svn.mlx_tp/gen2/userspace/ulps/udapl/dtest # ./dtest -l OpenIB-scm2 -v 23996 DAPL_PROVIDER is OpenIB-scm2 23996 Verbose 23996 Running as server 23996 Allocated RDMA buffers (r:0x8052390,s:0x8052618) len 64 23996 Opened Interface Adaptor 23996 Create Protection Zone 23996 Created Protection Zone 23996 Register RDMA memory 23996 Registered Receive RDMA Buffer 0x8052390 23996 Registered Send RDMA Buffer 0x8052618 23996 Register RDMA memory done 23996 Create events 23996 cr_evd created 0x805d6f8 23996 con_evd created 0x805d940 23996 dto_req_evd created 0x805dc18 23996 dto_rcv_evd created 0x805deb8 23996 Create events done 23996 EP created 0x805f518 23996 Registering send Message Buffer 0x804f6d0, len 24 23996 Registered send Message Buffer 0x804f6d0 23996 Registering Receive Message Buffer 0x804f700 23996 Registered Receive Message Buffer 0x804f700 23996 Posting Receive Message Buffer 0x804f700 23996 Registered Receive Message Buffer 0x804f700 23996 Posting Receive Message Buffer 0x804f718 23996 Registered Receive Message Buffer 0x804f700 23996 Posting Receive Message Buffer 0x804f730 23996 Registered Receive Message Buffer 0x804f700 23996 Creating service point for listen 23996 dat_psp_created for server listen 23996 Server waiting for connect request.. 23996 dat_evd_wait for cr_evd completed 23996 Accepting connect request from client 23996 dat_cr_accept completed 23996 Waiting for connect response 23996 dat_evd_wait for h_conn_evd completed 23996 CONNECTED! 23996 Send RMR to remote: snd_msg: r_key_ctx=620439,pad=0,va=8052390,len=0x40 23996 calling post_send 23996 send_msg completed 23996 Waiting for remote to send RMR data 23996 dat_evd_wait h_dto_rcv_evd completed 23996 remote RMR data arrived! 23996 Received RMR from remote: r_iov: r_key_ctx=620436,pad=0,va=8051600,len=0x40 23996 connect_ep complete 23996 RDMA WRITE DATA with SEND MSG 23996 rdma_write # 1 completed 23996 rdma_write # 2 completed 23996 rdma_write # 3 completed 23996 rdma_write # 4 completed 23996 rdma_write # 5 completed 23996 rdma_write # 6 completed 23996 rdma_write # 7 completed 23996 rdma_write # 8 completed 23996 rdma_write # 9 completed 23996 rdma_write # 10 completed 23996 Sending completion message 23996 calling post_send 23996 send_msg completed 23996 waiting for message receive event 23996 inbound rdma_write; send message arrived! 23996 Received RMR from remote: r_iov: ctx=620436,pad=0,va=0x8051600,len=0x40 23996 inbound rdma_write; send msg event SUCCESS!!! 23996 SERVER: RDMA write buffer contains: client written data... 23996 do_rdma_write_with_msg complete 23996 RDMA READ DATA with SEND MSG 23996 waiting for rdma_read completion event 23996 rdma_read # 1 completed 23996 waiting for rdma_read completion event 23996 rdma_read # 2 completed 23996 waiting for rdma_read completion event 23996 rdma_read # 3 completed 23996 waiting for rdma_read completion event 23996 rdma_read # 4 completed 23996 Sending completion message 23996 calling post_send 23996 send_msg completed 23996 Waiting for inbound message.... 23996 waiting for message receive event 23996 inbound rdma_read; send message arrived! 23996 Received RMR from remote: r_iov: ctx=620436,pad=0,va=0x8051600,len=0x40 23996 inbound rdma_write; send msg event SUCCESS!!! 23996 SERVER: RCV RDMA read buffer contains: server written data... 23996 do_rdma_read_with_msg complete 23996 PING DATA with SEND MSG 23996 Pre-posting Receive Message Buffers 0x8052390 23996 Posted Receive Message Buffer 0x8052390 23996 Pre-posting Receive Message Buffers 0x80523d0 23996 Posted Receive Message Buffer 0x80523d0 23996 Pre-posting Receive Message Buffers 0x8052410 23996 Posted Receive Message Buffer 0x8052410 23996 Pre-posting Receive Message Buffers 0x8052450 23996 Posted Receive Message Buffer 0x8052450 23996 Pre-posting Receive Message Buffers 0x8052490 23996 Posted Receive Message Buffer 0x8052490 23996 Pre-posting Receive Message Buffers 0x80524d0 23996 Posted Receive Message Buffer 0x80524d0 23996 Pre-posting Receive Message Buffers 0x8052510 23996 Posted Receive Message Buffer 0x8052510 23996 Pre-posting Receive Message Buffers 0x8052550 23996 Posted Receive Message Buffer 0x8052550 23996 Pre-posting Receive Message Buffers 0x8052590 23996 Posted Receive Message Buffer 0x8052590 23996 Pre-posting Receive Message Buffers 0x80525d0 23996 Posted Receive Message Buffer 0x80525d0 23996 waiting for message receive event 23996 inbound message; message arrived! 23996 SERVER: RCV buffer 0x8052390 contains: 0x55 len=64 23996 SERVER: SND buffer 0x8052618 contains: 0xffffffaa len=64 23996 calling post_send 23996 send_msg completed 23996 waiting for message receive event 23996 inbound message; message arrived! 23996 SERVER: RCV buffer 0x80523d0 contains: 0x55 len=64 23996 SERVER: SND buffer 0x8052658 contains: 0xffffffaa len=64 23996 calling post_send 23996 send_msg completed 23996 waiting for message receive event 23996 inbound message; message arrived! 23996 SERVER: RCV buffer 0x8052410 contains: 0x55 len=64 23996 SERVER: SND buffer 0x8052698 contains: 0xffffffaa len=64 23996 calling post_send 23996 send_msg completed 23996 waiting for message receive event 23996 inbound message; message arrived! 23996 SERVER: RCV buffer 0x8052450 contains: 0x55 len=64 23996 SERVER: SND buffer 0x80526d8 contains: 0xffffffaa len=64 23996 calling post_send 23996 send_msg completed 23996 waiting for message receive event 23996 inbound message; message arrived! 23996 SERVER: RCV buffer 0x8052490 contains: 0x55 len=64 23996 SERVER: SND buffer 0x8052718 contains: 0xffffffaa len=64 23996 calling post_send 23996 send_msg completed 23996 waiting for message receive event 23996 inbound message; message arrived! 23996 SERVER: RCV buffer 0x80524d0 contains: 0x55 len=64 23996 SERVER: SND buffer 0x8052758 contains: 0xffffffaa len=64 23996 calling post_send 23996 send_msg completed 23996 waiting for message receive event 23996 inbound message; message arrived! 23996 SERVER: RCV buffer 0x8052510 contains: 0x55 len=64 23996 SERVER: SND buffer 0x8052798 contains: 0xffffffaa len=64 23996 calling post_send 23996 send_msg completed 23996 waiting for message receive event 23996 inbound message; message arrived! 23996 SERVER: RCV buffer 0x8052550 contains: 0x55 len=64 23996 SERVER: SND buffer 0x80527d8 contains: 0xffffffaa len=64 23996 calling post_send 23996 send_msg completed 23996 waiting for message receive event 23996 inbound message; message arrived! 23996 SERVER: RCV buffer 0x8052590 contains: 0x55 len=64 23996 SERVER: SND buffer 0x8052818 contains: 0xffffffaa len=64 23996 calling post_send 23996 send_msg completed 23996 waiting for message receive event 23996 inbound message; message arrived! 23996 SERVER: RCV buffer 0x80525d0 contains: 0x55 len=64 23996 SERVER: SND buffer 0x8052858 contains: 0xffffffaa len=64 23996 calling post_send 23996 send_msg completed 23996 do_ping_pong_msg complete 23996 Disconnect and Free EP 0x805f518 full client output ----------------------- sw046:/tmp/tsscr/svn.mlx_tp/gen2/userspace/ulps/udapl/dtest # ./dtest -h 12.4.3.43 -l OpenIB-scm2 -v 30681 DAPL_PROVIDER is OpenIB-scm2 30681 Verbose 30681 Running as client 30681 Allocated RDMA buffers (r:0x8051600,s:0x8051888) len 64 30681 Opened Interface Adaptor 30681 Create Protection Zone 30681 Created Protection Zone 30681 Register RDMA memory 30681 Registered Receive RDMA Buffer 0x8051600 30681 Registered Send RDMA Buffer 0x8051888 30681 Register RDMA memory done 30681 Create events 30681 cr_evd created 0x8058708 30681 con_evd created 0x80589e0 30681 dto_req_evd created 0x8058cb8 30681 dto_rcv_evd created 0x805a3f0 30681 Create events done 30681 EP created 0x805a570 30681 Registering send Message Buffer 0x804f6d0, len 24 30681 Registered send Message Buffer 0x804f6d0 30681 Registering Receive Message Buffer 0x804f700 30681 Registered Receive Message Buffer 0x804f700 30681 Posting Receive Message Buffer 0x804f700 30681 Registered Receive Message Buffer 0x804f700 30681 Posting Receive Message Buffer 0x804f718 30681 Registered Receive Message Buffer 0x804f700 30681 Posting Receive Message Buffer 0x804f730 30681 Registered Receive Message Buffer 0x804f700 30681 Server Name: 12.4.3.43 30681 Server Net Address: 12.4.3.43 30681 Connecting to server 30681 dat_ep_connect completed 30681 Waiting for connect response 30681 dat_evd_wait for h_conn_evd completed 30681 CONNECTED! 30681 Send RMR to remote: snd_msg: r_key_ctx=620436,pad=0,va=8051600,len=0x40 30681 calling post_send 30681 send_msg completed 30681 Waiting for remote to send RMR data 30681 dat_evd_wait h_dto_rcv_evd completed 30681 remote RMR data arrived! 30681 Received RMR from remote: r_iov: r_key_ctx=620439,pad=0,va=8052390,len=0x40 30681 connect_ep complete 30681 RDMA WRITE DATA with SEND MSG 30681 rdma_write # 1 completed 30681 rdma_write # 2 completed 30681 rdma_write # 3 completed 30681 rdma_write # 4 completed 30681 rdma_write # 5 completed 30681 rdma_write # 6 completed 30681 rdma_write # 7 completed 30681 rdma_write # 8 completed 30681 rdma_write # 9 completed 30681 rdma_write # 10 completed 30681 Sending completion message 30681 calling post_send 30681 send_msg completed 30681 waiting for message receive event 30681 inbound rdma_write; send message arrived! 30681 Received RMR from remote: r_iov: ctx=620439,pad=0,va=0x8052390,len=0x40 30681 inbound rdma_write; send msg event SUCCESS!!! 30681 CLIENT: RDMA write buffer contains: server written data... 30681 do_rdma_write_with_msg complete 30681 RDMA READ DATA with SEND MSG 30681 waiting for rdma_read completion event 30681 rdma_read # 1 completed 30681 waiting for rdma_read completion event 30681 rdma_read # 2 completed 30681 waiting for rdma_read completion event 30681 rdma_read # 3 completed 30681 waiting for rdma_read completion event 30681 rdma_read # 4 completed 30681 Sending completion message 30681 calling post_send 30681 send_msg completed 30681 Waiting for inbound message.... 30681 waiting for message receive event 30681 inbound rdma_read; send message arrived! 30681 Received RMR from remote: r_iov: ctx=620439,pad=0,va=0x8052390,len=0x40 30681 inbound rdma_write; send msg event SUCCESS!!! 30681 CLIENT: RCV RDMA read buffer contains: server read data... 30681 do_rdma_read_with_msg complete 30681 PING DATA with SEND MSG 30681 Pre-posting Receive Message Buffers 0x8051600 30681 Posted Receive Message Buffer 0x8051600 30681 Pre-posting Receive Message Buffers 0x8051640 30681 Posted Receive Message Buffer 0x8051640 30681 Pre-posting Receive Message Buffers 0x8051680 30681 Posted Receive Message Buffer 0x8051680 30681 Pre-posting Receive Message Buffers 0x80516c0 30681 Posted Receive Message Buffer 0x80516c0 30681 Pre-posting Receive Message Buffers 0x8051700 30681 Posted Receive Message Buffer 0x8051700 30681 Pre-posting Receive Message Buffers 0x8051740 30681 Posted Receive Message Buffer 0x8051740 30681 Pre-posting Receive Message Buffers 0x8051780 30681 Posted Receive Message Buffer 0x8051780 30681 Pre-posting Receive Message Buffers 0x80517c0 30681 Posted Receive Message Buffer 0x80517c0 30681 Pre-posting Receive Message Buffers 0x8051800 30681 Posted Receive Message Buffer 0x8051800 30681 Pre-posting Receive Message Buffers 0x8051840 30681 Posted Receive Message Buffer 0x8051840 30681 CLIENT: SND buffer 0x8051888 contains: 0x55 len=64 30681 calling post_send 30681 send_msg completed 30681 waiting for message receive event 30681 inbound message; message arrived! 30681 CLIENT: RCV buffer 0x8051600 contains: 0xffffffaa len=64 30681 CLIENT: SND buffer 0x80518c8 contains: 0x55 len=64 30681 calling post_send 30681 send_msg completed 30681 waiting for message receive event 30681 inbound message; message arrived! 30681 CLIENT: RCV buffer 0x8051640 contains: 0xffffffaa len=64 30681 CLIENT: SND buffer 0x8051908 contains: 0x55 len=64 30681 calling post_send 30681 send_msg completed 30681 waiting for message receive event 30681 inbound message; message arrived! 30681 CLIENT: RCV buffer 0x8051680 contains: 0xffffffaa len=64 30681 CLIENT: SND buffer 0x8051948 contains: 0x55 len=64 30681 calling post_send 30681 send_msg completed 30681 waiting for message receive event 30681 inbound message; message arrived! 30681 CLIENT: RCV buffer 0x80516c0 contains: 0xffffffaa len=64 30681 CLIENT: SND buffer 0x8051988 contains: 0x55 len=64 30681 calling post_send 30681 send_msg completed 30681 waiting for message receive event 30681 inbound message; message arrived! 30681 CLIENT: RCV buffer 0x8051700 contains: 0xffffffaa len=64 30681 CLIENT: SND buffer 0x80519c8 contains: 0x55 len=64 30681 calling post_send 30681 send_msg completed 30681 waiting for message receive event 30681 inbound message; message arrived! 30681 CLIENT: RCV buffer 0x8051740 contains: 0xffffffaa len=64 30681 CLIENT: SND buffer 0x8051a08 contains: 0x55 len=64 30681 calling post_send 30681 send_msg completed 30681 waiting for message receive event 30681 inbound message; message arrived! 30681 CLIENT: RCV buffer 0x8051780 contains: 0xffffffaa len=64 30681 CLIENT: SND buffer 0x8051a48 contains: 0x55 len=64 30681 calling post_send 30681 send_msg completed 30681 waiting for message receive event 30681 inbound message; message arrived! 30681 CLIENT: RCV buffer 0x80517c0 contains: 0xffffffaa len=64 30681 CLIENT: SND buffer 0x8051a88 contains: 0x55 len=64 30681 calling post_send 30681 send_msg completed 30681 waiting for message receive event 30681 inbound message; message arrived! 30681 CLIENT: RCV buffer 0x8051800 contains: 0xffffffaa len=64 30681 CLIENT: SND buffer 0x8051ac8 contains: 0x55 len=64 30681 calling post_send 30681 send_msg completed 30681 waiting for message receive event 30681 inbound message; message arrived! 30681 CLIENT: RCV buffer 0x8051840 contains: 0xffffffaa len=64 30681 do_ping_pong_msg complete 30681 Disconnect and Free EP 0x805a570 30681 dat_ep_disconnect 30681 dat_ep_disconnect completed 30681 dat_evd_wait for h_conn_evd completed 30681 Unregister send message h_lmr 0x805c738 30681 Unregistered send message Buffer 30681 Unregister recv message h_lmr 0x805c800 30681 Unregistered recv message Buffer 30681 Free EP 0x805a570 30681 Freed EP 30681 destroy events 30681 Free cr EVD 0x8058708 30681 Freed cr EVD 30681 Free conn EVD 0x80589e0 30681 Freed conn EVD 30681 Free RCV dto EVD 0x805a3f0 30681 Freed dto EVD 30681 Free REQ dto EVD 0x8058cb8 30681 Freed dto EVD 30681 destroy events done 30681 Unregister h_lmr 0x80582b8 30681 Unregistered Recv Buffer 30681 Unregister h_lmr 0x8058658 30681 Unregistered send Buffer 30681 unregister_rdma_memory 30681 unregister_rdma_memory done 30681 Freeing pz 30681 Freed pz 30681 Closing Interface Adaptor 30681 Closed Interface Adaptor 30681: DAPL Test Complete. 30681: Message RTT: Total= 1160.92 usec, 10 bursts, itime= 116.09 usec, pc=0 30681: RDMA write: Total= 892.01 usec, 10 bursts, itime= 89.20 usec, pc=0 30681: RDMA read: Total= 244.90 usec, 4 bursts, itime= 79.05 usec, pc=0 30681: RDMA read: Total= 244.90 usec, 4 bursts, itime= 55.01 usec, pc=0 30681: RDMA read: Total= 244.90 usec, 4 bursts, itime= 55.95 usec, pc=0 30681: RDMA read: Total= 244.90 usec, 4 bursts, itime= 54.89 usec, pc=0 30681: open: 49977.00 usec 30681: close: 71868.94 usec 30681: PZ create: 96.11 usec 30681: PZ free: 14.01 usec 30681: LMR create: 232.01 usec 30681: LMR free: 95.09 usec 30681: EVD create: 21.01 usec 30681: EVD free: 142.07 usec 30681: EP create: 496.94 usec 30681: EP free: 312.04 usec 30681: TOTAL: 1299.16 usec I'm using driver IBED-1.0-rc3, openib_branch1.0-20060410-1551 (REV=6367) when i tried to execute this test on openib_gen2-20060410-1700 (REV=6369), i got the same result. when i tried to execute this test on openib_gen2-20060411-0800 (REV=6400), i got the same result. thanks Dotan From dotanb at mellanox.co.il Wed Apr 12 01:22:48 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 12 Apr 2006 11:22:48 +0300 Subject: [openib-general] [uDAPL] dat.conf generator Message-ID: <200604121122.48646.dotanb@mellanox.co.il> Hi. I'm working on a dat.conf generator that will search for all of the IB devices and will create a valid (and updated) dat.conf. Here is the generated file on a machine with 2 HCAs (2 ports in each device): # DAT 1.2 configuration file # # Each entry should have the following fields: # # \ # # # Example for openib_cma and openib_scm # # For cma version you specify as: # network address, network hostname, or netdev name and 0 for port # # For scm version you specify as actual device name and port # # Simple (OpenIB-cma) default with netdev name provided first on list # to enable use of same dat.conf version on all nodes # OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 1" "" OpenIB-cma0-1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 1" "" OpenIB-cma0-2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 2" "" OpenIB-cma1-1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca1 1" "" OpenIB-cma1-2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca1 2" "" OpenIB-cma-netdev0 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" OpenIB-cma-netdev1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib1 0" "" OpenIB-cma-netdev2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib2 0" "" OpenIB-cma-netdev3 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib3 0" "" OpenIB-scm u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca0 1" "" OpenIB-scm0-1 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca0 1" "" OpenIB-scm0-2 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca0 2" "" OpenIB-scm1-1 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca1 1" "" OpenIB-scm1-2 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca1 2" "" OpenIB-scm-netdev0 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib0 0" "" OpenIB-scm-netdev1 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib1 0" "" OpenIB-scm-netdev2 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib2 0" "" OpenIB-scm-netdev3 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib3 0" "" the names of the dapl providers are: OpenIB-cma: default that uses cma OpenIB-scm: default that uses scm OpenIB-ZX-Y: uses device X (X is the index) , and port Y that connect using Z (cma or scm) OpenIB-Z-netdevX : uses netdevice X (X in the index) that connect using Z (cma or scm) is this file is good enough or more dapl provider names are needed? Dotan From yathiraj at cisco.com Wed Apr 12 02:09:17 2006 From: yathiraj at cisco.com (Yathi Shetty (yathiraj)) Date: Wed, 12 Apr 2006 14:39:17 +0530 Subject: [openib-general] Query on Open IB drivers on DUAL HCA Message-ID: <02FBED75ABE4D4469DE999A7CDFA8CE6F7ADE1@xmb-blr-417.apac.cisco.com> Hi, Does the open Ib support DUAL HCA servers. I tried it on a Sun V20 Z with dual HCA, but got the HCA failed to come up. I get an yellow exclamation mark on the HCA. Uninstalling also doesn't help. Any suggestions ? Thanks in advance. Yathi From dotanb at mellanox.co.il Wed Apr 12 02:18:50 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 12 Apr 2006 12:18:50 +0300 Subject: [openib-general] Query on Open IB drivers on DUAL HCA In-Reply-To: <02FBED75ABE4D4469DE999A7CDFA8CE6F7ADE1@xmb-blr-417.apac.cisco.com> References: <02FBED75ABE4D4469DE999A7CDFA8CE6F7ADE1@xmb-blr-417.apac.cisco.com> Message-ID: <200604121218.50539.dotanb@mellanox.co.il> Hi. On Wednesday 12 April 2006 12:09, Yathi Shetty (yathiraj) wrote: > Hi, > > Does the open Ib support DUAL HCA servers. I tried it on a Sun V20 Z > with dual HCA, but got the HCA failed to come up. I get an yellow > exclamation mark on the HCA. Uninstalling also doesn't help. Any > suggestions ? The openib stack (should) support more than one HCA in an host. In our lab we have several machines with 2 (Mellanox) HCAs, and they work without any problem. can you give some info about this issue? (which HCAs, which OS, which driver version, dump of the /var/log/messages about this problem ..) thanks Dotan From yathiraj at cisco.com Wed Apr 12 02:26:48 2006 From: yathiraj at cisco.com (Yathi Shetty (yathiraj)) Date: Wed, 12 Apr 2006 14:56:48 +0530 Subject: [openib-general] Query on Open IB drivers on DUAL HCA Message-ID: <02FBED75ABE4D4469DE999A7CDFA8CE6F7ADE2@xmb-blr-417.apac.cisco.com> Hi, Thank you for the response. Mellanox 23108, Windows Cluster Compute Edn, Vr.295. Attached a screen hot. I successfully installed on a single HCA. Please let me know if you need more info. Yathi -----Original Message----- From: Dotan Barak [mailto:dotanb at mellanox.co.il] Sent: Wednesday, April 12, 2006 2:49 PM To: openib-general at openib.org Cc: Yathi Shetty (yathiraj) Subject: Re: [openib-general] Query on Open IB drivers on DUAL HCA Hi. On Wednesday 12 April 2006 12:09, Yathi Shetty (yathiraj) wrote: > Hi, > > Does the open Ib support DUAL HCA servers. I tried it on a Sun V20 Z > with dual HCA, but got the HCA failed to come up. I get an yellow > exclamation mark on the HCA. Uninstalling also doesn't help. Any > suggestions ? The openib stack (should) support more than one HCA in an host. In our lab we have several machines with 2 (Mellanox) HCAs, and they work without any problem. can you give some info about this issue? (which HCAs, which OS, which driver version, dump of the /var/log/messages about this problem ..) thanks Dotan -------------- next part -------------- A non-text attachment was scrubbed... Name: openib.JPG Type: image/jpeg Size: 178037 bytes Desc: openib.JPG URL: From dotanb at mellanox.co.il Wed Apr 12 02:43:46 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 12 Apr 2006 12:43:46 +0300 Subject: [openib-general] Query on Open IB drivers on DUAL HCA In-Reply-To: <02FBED75ABE4D4469DE999A7CDFA8CE6F7ADE2@xmb-blr-417.apac.cisco.com> References: <02FBED75ABE4D4469DE999A7CDFA8CE6F7ADE2@xmb-blr-417.apac.cisco.com> Message-ID: <200604121243.46223.dotanb@mellanox.co.il> On Wednesday 12 April 2006 12:26, Yathi Shetty (yathiraj) wrote: > Hi, > Thank you for the response. > > Mellanox 23108, Windows Cluster Compute Edn, Vr.295. Attached a screen > hot. do you see any error messages on screen or in the /var/log/messages? > > I successfully installed on a single HCA. did you check this for both of the HCA? i think that driver version can be usefull. Dotan From yathiraj at cisco.com Wed Apr 12 03:00:10 2006 From: yathiraj at cisco.com (Yathi Shetty (yathiraj)) Date: Wed, 12 Apr 2006 15:30:10 +0530 Subject: [openib-general] Query on Open IB drivers on DUAL HCA Message-ID: <02FBED75ABE4D4469DE999A7CDFA8CE6F7ADE4@xmb-blr-417.apac.cisco.com> Well. I did not see any error message. The screen shot shows the event viewer message. I tried vr.300 also. Still the same result. I am using windows system and have only system logging in event viewer. Yathi -----Original Message----- From: Dotan Barak [mailto:dotanb at mellanox.co.il] Sent: Wednesday, April 12, 2006 3:14 PM To: Yathi Shetty (yathiraj) Cc: openib-general at openib.org Subject: Re: [openib-general] Query on Open IB drivers on DUAL HCA On Wednesday 12 April 2006 12:26, Yathi Shetty (yathiraj) wrote: > Hi, > Thank you for the response. > > Mellanox 23108, Windows Cluster Compute Edn, Vr.295. Attached a screen > hot. do you see any error messages on screen or in the /var/log/messages? > > I successfully installed on a single HCA. did you check this for both of the HCA? i think that driver version can be usefull. Dotan From vlad at mellanox.co.il Wed Apr 12 03:06:13 2006 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 12 Apr 2006 13:06:13 +0300 Subject: [openib-general] Re: [openfabrics-ewg] uDAPL not supported on ppc64? In-Reply-To: References: Message-ID: <443CD115.8020204@mellanox.co.il> Scott Weitzenkamp (sweitzen) wrote: > I get this trying to compile uDAPL using install.sh with IBED 1.0 rc3 > on RHEL4 U2 2.6.9-22 ppc64: > > WARNING: Dapl is not supported on PPC64 arcitecture > WARNING: Dapl is not supported on PPC64 arcitecture > > Scott > > ------------------------------------------------------------------------ > > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg > Hi Scott, I tested uDAPL compilation on PPC64 and it failed with the error that you can see below. The corresponding email was sent to udapl developers. In order to skip this issue in IBED I added warning that you got for PPC64 in the build_env.sh script and it skips uDAPL compilation on this platform. gcc -DHAVE_CONFIG_H -I. -I. -I. -I../libibverbs/include/infiniband -I../librdmacm/include -I../libibverbs/include -Wall -g -D_GNU_SOURCE -DOPENIB -DCQ_WAIT_OBJECT -I./dat/include/ -I./dapl/include/ -I./dapl/common -I./dapl/udapl/linux -I./dapl/ope nib_cma -g -O2 -MT dapl_udapl_libdaplcma_la-dapl_init.lo -MD -MP -MF .deps/dapl_udapl_libdaplcma_la-dapl_init.Tpo -c dapl/ud apl/dapl_init.c -fPIC -DPIC -o .libs/dapl_udapl_libdaplcma_la-dapl_init.o In file included from ./dapl/include/dapl.h:50, from dapl/udapl/dapl_init.c:39: ./dapl/udapl/linux/dapl_osd.h:53:2: error: #error UNDEFINED ARCH make[2]: *** [dapl_udapl_libdaplcma_la-dapl_init.lo] Error 1 make[2]: Leaving directory `/var/tmp/IBED/tmp/openib/openib/src/userspace/dapl' make[1]: *** [all] Error 2 make[1]: Leaving directory `/var/tmp/IBED/tmp/openib/openib/src/userspace/dapl' make: *** [dapl] Error 2 Regards, Vladimir From mst at mellanox.co.il Wed Apr 12 03:08:52 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 12 Apr 2006 13:08:52 +0300 Subject: [openib-general] rma_destroy_id called twice Message-ID: <20060412100852.GA12312@mellanox.co.il> -- MST From dotanb at mellanox.co.il Wed Apr 12 03:07:22 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 12 Apr 2006 13:07:22 +0300 Subject: [openib-general] Query on Open IB drivers on DUAL HCA In-Reply-To: <02FBED75ABE4D4469DE999A7CDFA8CE6F7ADE4@xmb-blr-417.apac.cisco.com> References: <02FBED75ABE4D4469DE999A7CDFA8CE6F7ADE4@xmb-blr-417.apac.cisco.com> Message-ID: <200604121307.22331.dotanb@mellanox.co.il> On Wednesday 12 April 2006 13:00, Yathi Shetty (yathiraj) wrote: > Well. I did not see any error message. The screen shot shows the event > viewer message. I tried vr.300 also. Still the same result. > I am using windows system and have only system logging in event viewer. > > Yathi Which OS do you have: linux or windows? Dotan From mst at mellanox.co.il Wed Apr 12 03:10:08 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 12 Apr 2006 13:10:08 +0300 Subject: [openib-general] Re: rma_destroy_id called twice In-Reply-To: <20060412100852.GA12312@mellanox.co.il> References: <20060412100852.GA12312@mellanox.co.il> Message-ID: <20060412101008.GB12312@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: rma_destroy_id called twice Sent by mistake, please ignore. Sorry, -- MST From mst at mellanox.co.il Wed Apr 12 03:19:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 12 Apr 2006 13:19:04 +0300 Subject: [openib-general] Re: Query on Open IB drivers on DUAL HCA In-Reply-To: <02FBED75ABE4D4469DE999A7CDFA8CE6F7ADE1@xmb-blr-417.apac.cisco.com> References: <02FBED75ABE4D4469DE999A7CDFA8CE6F7ADE1@xmb-blr-417.apac.cisco.com> Message-ID: <20060412101904.GC12312@mellanox.co.il> Quoting r. Yathi Shetty (yathiraj) : > Subject: Query on Open IB drivers on DUAL HCA > > Hi, > > Does the open Ib support DUAL HCA servers. I tried it on a Sun V20 Z > with dual HCA, but got the HCA failed to come up. I get an yellow > exclamation mark on the HCA. Uninstalling also doesn't help. Any > suggestions ? > > > > Thanks in advance. > Yathi You want the openib-windows list. Subscription information here: http://openib.org/mailman/listinfo -- MST From yathiraj at cisco.com Wed Apr 12 03:45:07 2006 From: yathiraj at cisco.com (Yathi Shetty (yathiraj)) Date: Wed, 12 Apr 2006 16:15:07 +0530 Subject: [openib-general] Query on Open IB drivers on DUAL HCA Message-ID: <02FBED75ABE4D4469DE999A7CDFA8CE6F7ADE5@xmb-blr-417.apac.cisco.com> I am using Windows compute cluster edtn (64bit) -----Original Message----- From: Dotan Barak [mailto:dotanb at mellanox.co.il] Sent: Wednesday, April 12, 2006 3:37 PM To: Yathi Shetty (yathiraj) Cc: openib-general at openib.org Subject: Re: [openib-general] Query on Open IB drivers on DUAL HCA On Wednesday 12 April 2006 13:00, Yathi Shetty (yathiraj) wrote: > Well. I did not see any error message. The screen shot shows the event > viewer message. I tried vr.300 also. Still the same result. > I am using windows system and have only system logging in event viewer. > > Yathi Which OS do you have: linux or windows? Dotan From jlentini at netapp.com Wed Apr 12 07:50:10 2006 From: jlentini at netapp.com (James Lentini) Date: Wed, 12 Apr 2006 10:50:10 -0400 (EDT) Subject: [openib-general] Re: [uDAPL] dat.conf generator In-Reply-To: <200604121122.48646.dotanb@mellanox.co.il> References: <200604121122.48646.dotanb@mellanox.co.il> Message-ID: On Wed, 12 Apr 2006, Dotan Barak wrote: > Hi. > > I'm working on a dat.conf generator that will search for all of the > IB devices and will create a valid (and updated) dat.conf. > > Here is the generated file on a machine with 2 HCAs (2 ports in each > device): > > # DAT 1.2 configuration file > # > # Each entry should have the following fields: > # > # \ > # > # > # Example for openib_cma and openib_scm > # > # For cma version you specify as: > # network address, network hostname, or netdev name and 0 for port > # > # For scm version you specify as actual device name and port > # > # Simple (OpenIB-cma) default with netdev name provided first on list > # to enable use of same dat.conf version on all nodes > # > OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 1" "" > OpenIB-cma0-1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 1" "" > OpenIB-cma0-2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 2" "" > OpenIB-cma1-1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca1 1" "" > OpenIB-cma1-2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca1 2" "" > OpenIB-cma-netdev0 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" > OpenIB-cma-netdev1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib1 0" "" > OpenIB-cma-netdev2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib2 0" "" > OpenIB-cma-netdev3 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib3 0" "" > OpenIB-scm u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca0 1" "" > OpenIB-scm0-1 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca0 1" "" > OpenIB-scm0-2 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca0 2" "" > OpenIB-scm1-1 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca1 1" "" > OpenIB-scm1-2 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca1 2" "" > OpenIB-scm-netdev0 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib0 0" "" > OpenIB-scm-netdev1 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib1 0" "" > OpenIB-scm-netdev2 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib2 0" "" > OpenIB-scm-netdev3 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib3 0" "" > > > the names of the dapl providers are: > OpenIB-cma: default that uses cma > OpenIB-scm: default that uses scm > OpenIB-ZX-Y: uses device X (X is the index) , and port Y that connect using Z (cma or scm) > OpenIB-Z-netdevX : uses netdevice X (X in the index) that connect using Z (cma or scm) > > is this file is good enough or more dapl provider names are needed? You've covered all the standard combinations. Why did you include the OpenIB-Z-netdevX entries? Why would a user prefer netdevX over ethY? Just curious. If you are willing to contribute this back to the uDAPL project, I'm sure the uDAPL community would find it very useful. From ftillier at silverstorm.com Wed Apr 12 07:50:01 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Wed, 12 Apr 2006 07:50:01 -0700 Subject: [openib-general] Query on Open IB drivers on DUAL HCA In-Reply-To: <02FBED75ABE4D4469DE999A7CDFA8CE6F7ADE1@xmb-blr-417.apac.cisco.com> References: <02FBED75ABE4D4469DE999A7CDFA8CE6F7ADE1@xmb-blr-417.apac.cisco.com> Message-ID: <79ae2f320604120750r4173750ft9f846ba03c5c3f83@mail.gmail.com> Hi Yathi, On 4/12/06, Yathi Shetty (yathiraj) wrote: > Hi, > > Does the open Ib support DUAL HCA servers. I tried it on a Sun V20 Z > with dual HCA, but got the HCA failed to come up. I get an yellow > exclamation mark on the HCA. Uninstalling also doesn't help. Any > suggestions ? The current VAPI-based driver does not support two HCAs at the same time. I don't know if Mellanox will fix it since the MTHCA driver is almost mature and will replace the VAPI-based driver since it supports all memfree HCAs (which the VAPI driver does not). - Fab From robert.j.woodruff at intel.com Wed Apr 12 08:28:49 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 12 Apr 2006 08:28:49 -0700 Subject: [openib-general] RHEL4ASU3 question In-Reply-To: Message-ID: <000601c65e45$cc8fac40$010fa8c0@amr.corp.intel.com> Don wrote, >Looking at the viacheck.c file, it seems that this error is generated when a bad status is found >in the status of a completion queue entry. From the "code=1" , it may be some sort of "length >error". This could be coming from the driver or the card, I suppose? That's as far as I have >gotten so far. >Does this sound like any of the "issues" you referred to above relative to RHEL4 U3 and the DDR >cards? If so, is there a fix? >-Don Albert- >Bull HN Info Systems Most likely it is the same problem I had with the DDR cards and the stock RedHat EL4.0 - U3 release. I reported the problem to RedHat, but they added a bugzila comment that they did not have any DDR hardware. Perhaps Mellanox could provide them with some. Michael ? My guess is that the version of code that they have in the release is too old and does not completely support the DDR card. I tried them with newer SVN code and it worked OK. You might try my backport patches and/or test RPMs that are based on RedHat EL4 - U3 kernel, but with a newer version of the SVN code or replace your RedHat kernel with a newer kernel.org kernel. Not sure if that is feasible or not in your environment, but I am pretty sure that newer SVN code will fix the problem. https://openib.org/svn/gen2/branches/backport-to-2.6.9/RPMS/ You may also need newer usermode libraries to go with the newer kernel, either kernel.org or my backport to 2.6.9-34EL. You will need to uninstall the redhat usermode RPMS and install either my usermode RPM or the ones from the release 1.0 RC2 should probably work as well. woody woody From bugzilla-daemon at openib.org Wed Apr 12 08:40:23 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Wed, 12 Apr 2006 08:40:23 -0700 (PDT) Subject: [openib-general] [Bug 35] New: 2c_OffloadCheckSum (NDIS) test gives Blue screen Message-ID: <20060412154023.E2AC42283D9@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=35 Summary: 2c_OffloadCheckSum (NDIS) test gives Blue screen Product: OpenIB Version: 1.0rc2 Platform: X86-64 OS/Version: Other Status: NEW Severity: normal Priority: P2 Component: IPoIB AssignedTo: bugzilla at openib.org ReportedBy: ksharma at silverstorm.com Build version: 3.0.0039.1 OS Version: Windows CCS 2003 - 64 bit Description: When i execute 2c_OffloadCheckSum test from NDIS test suite, after around 20-25 minutes of run, it generates a blue screen. PFA the minidump file. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Wed Apr 12 08:42:20 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Wed, 12 Apr 2006 08:42:20 -0700 (PDT) Subject: [openib-general] [Bug 35] 2c_OffloadCheckSum (NDIS) test gives Blue screen Message-ID: <20060412154220.4E6612283D9@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=35 ------- Additional Comments From ksharma at silverstorm.com 2006-04-12 08:42 ------- Created an attachment (id=10) --> (http://openib.org/bugzilla/attachment.cgi?id=10&action=view) Minidump file for the blue screen ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From robert.j.woodruff at intel.com Wed Apr 12 08:35:35 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 12 Apr 2006 08:35:35 -0700 Subject: [openib-general] RHEL4ASU3 question In-Reply-To: Message-ID: <000701c65e46$be5c27b0$010fa8c0@amr.corp.intel.com> >A third option is to keep the RHEL4U3 kernel, and use the OpenIB code >from IBED 1.0 rc3. >Scott Can you provide more details about this IBED release ? What does the acronym IBED stand for ? Is it part of the OpenFabric's 1.0 release, a superset ? what distros it supports ? etc ? Seems like a lot of people on openib-general have been kept in the dark about what you are trying to do and I am sure that others on this list would be interested. woody From ftillier at silverstorm.com Wed Apr 12 09:40:18 2006 From: ftillier at silverstorm.com (Fab Tillier) Date: Wed, 12 Apr 2006 09:40:18 -0700 Subject: [openib-general] Bugzilla Message-ID: <000f01c65e4f$ccd3fcb0$6401a8c0@infiniconsys.com> Hi Bryan, Could you rename the current "OpenIB" product to "OpenIB Linux", and create an "OpenIB Windows" project with the following components: IPoIB WSD IB Core MT23108 MTHCA Diagnostics OpenSM SRP Utils Thanks, - Fab From caitlinb at broadcom.com Wed Apr 12 09:33:54 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 12 Apr 2006 09:33:54 -0700 Subject: [openib-general] uDAPL not supported on ppc64? Message-ID: <54AD0F12E08D1541B826BE97C98F99F13CA1BA@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > I get this trying to compile uDAPL using install.sh with IBED > 1.0 rc3 on RHEL4 U2 2.6.9-22 ppc64: > > WARNING: Dapl is not supported on PPC64 arcitecture > WARNING: Dapl is not supported on PPC64 arcitecture > > Scott There are include files that map DAT-defined types to architecture appropriate choices. Just fill in the correct choices for PPC64 and submit a patch. Don't be afraid to ask for clarification on the semantics of any types, but with the examples already given it should be fairly clear. From weiny2 at llnl.gov Wed Apr 12 09:46:40 2006 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 12 Apr 2006 09:46:40 -0700 Subject: [openib-general] RDMA RC QP returning "RNR Retry Counter Exceeded Error" Message-ID: <20060412094640.7457e097.weiny2@llnl.gov> I have started writing a simple RDMA app which uses the rdmacm. I have gotten the connection established, QP's and MR's set up, and have sent the RDMA ETH. However, more and more I am getting the RNR Retry Counter Exceeded error back from the "client's" post send of the RDMA ETH. About 1/10 times it will work but most of the time it does not. I have figured out that you can't set the IBV_QP_RNR_RETRY attribute unless you go from RTR to RTS. The state of the QP is RTS and the IBV_QP_RNR_RETRY value is 0 as set by the rdmacm. Do I have to, or can I, transition the QP from RTS to RTR and then back again to set the IBV_QP_RNR_RETRY? Thanks, Ira Weiny weiny2 at llnl.gov From bos at pathscale.com Wed Apr 12 09:48:24 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 12 Apr 2006 09:48:24 -0700 Subject: [openib-general] Re: Bugzilla In-Reply-To: <000f01c65e4f$ccd3fcb0$6401a8c0@infiniconsys.com> References: <000f01c65e4f$ccd3fcb0$6401a8c0@infiniconsys.com> Message-ID: <1144860504.2483.29.camel@localhost.localdomain> On Wed, 2006-04-12 at 09:40 -0700, Fab Tillier wrote: > Could you rename the current "OpenIB" product to "OpenIB Linux", and create an > "OpenIB Windows" project with the following components: > > IPoIB Added. > WSD I need a description for this. > IB Core Added. > MT23108 Need description. > MTHCA > Diagnostics > OpenSM > SRP > Utils Added. From bos at pathscale.com Wed Apr 12 09:50:51 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 12 Apr 2006 09:50:51 -0700 Subject: [openib-general] Re: Bugzilla In-Reply-To: <000f01c65e4f$ccd3fcb0$6401a8c0@infiniconsys.com> References: <000f01c65e4f$ccd3fcb0$6401a8c0@infiniconsys.com> Message-ID: <1144860651.2483.32.camel@localhost.localdomain> On Wed, 2006-04-12 at 09:40 -0700, Fab Tillier wrote: > Could you rename the current "OpenIB" product to "OpenIB Linux", and create an > "OpenIB Windows" project with the following components: I forgot to mention that I renamed "OpenIB" to "OpenFabrics Linux", and created "OpenFabrics Windows" for Windows stuff. If you have existing precanned queries that look for the "OpenIB" product, they will probably need to be updated, though I'm not sure of this. References: <000f01c65e4f$ccd3fcb0$6401a8c0@infiniconsys.com> <1144860504.2483.29.camel@localhost.localdomain> Message-ID: <79ae2f320604120951w5fdae72fl5a6a6998dc493ae4@mail.gmail.com> On 4/12/06, Bryan O'Sullivan wrote: > On Wed, 2006-04-12 at 09:40 -0700, Fab Tillier wrote: > > > WSD > > I need a description for this. Windows Sockets Direct provider, Microsoft's precursor to SDP. > > MT23108 > > Need description. VAPI-based Mellanox HCA driver for Tavor (and Arbel in Tavor compatibility mode) Thanks! - Fab From ftillier at silverstorm.com Wed Apr 12 09:53:13 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Wed, 12 Apr 2006 09:53:13 -0700 Subject: [openib-general] Re: Bugzilla In-Reply-To: <79ae2f320604120951w5fdae72fl5a6a6998dc493ae4@mail.gmail.com> References: <000f01c65e4f$ccd3fcb0$6401a8c0@infiniconsys.com> <1144860504.2483.29.camel@localhost.localdomain> <79ae2f320604120951w5fdae72fl5a6a6998dc493ae4@mail.gmail.com> Message-ID: <79ae2f320604120953v20f00989ib432611790604d88@mail.gmail.com> As a follow up, can you add the following operating systems: Windows Server 2003, x86 Windows Server 2003, x64 Windows Server 2003, IA64 Windows XP, x86 Windows XP, x64 Is there a way to make OS choices project specific, just like the components? Thanks! - Fab From bos at pathscale.com Wed Apr 12 09:55:30 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 12 Apr 2006 09:55:30 -0700 Subject: [openib-general] Re: Bugzilla In-Reply-To: <79ae2f320604120951w5fdae72fl5a6a6998dc493ae4@mail.gmail.com> References: <000f01c65e4f$ccd3fcb0$6401a8c0@infiniconsys.com> <1144860504.2483.29.camel@localhost.localdomain> <79ae2f320604120951w5fdae72fl5a6a6998dc493ae4@mail.gmail.com> Message-ID: <1144860930.2483.37.camel@localhost.localdomain> On Wed, 2006-04-12 at 09:51 -0700, Fabian Tillier wrote: > Windows Sockets Direct provider, Microsoft's precursor to SDP. Thanks. Added. > > > MT23108 > > > > Need description. > > VAPI-based Mellanox HCA driver for Tavor (and Arbel in Tavor compatibility mode) Mmm, rolls right off the tongue :-) Added. From bos at pathscale.com Wed Apr 12 09:58:09 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 12 Apr 2006 09:58:09 -0700 Subject: [openib-general] Re: Bugzilla In-Reply-To: <79ae2f320604120953v20f00989ib432611790604d88@mail.gmail.com> References: <000f01c65e4f$ccd3fcb0$6401a8c0@infiniconsys.com> <1144860504.2483.29.camel@localhost.localdomain> <79ae2f320604120951w5fdae72fl5a6a6998dc493ae4@mail.gmail.com> <79ae2f320604120953v20f00989ib432611790604d88@mail.gmail.com> Message-ID: <1144861089.2483.41.camel@localhost.localdomain> On Wed, 2006-04-12 at 09:53 -0700, Fabian Tillier wrote: > As a follow up, can you add the following operating systems: No, sorry. The version of Bugzilla at openib.org has a brain-dead schema where some categories of stuff live in the database, while others live in a hunk of Perl that you have to have shell access to edit. OSes are in the latter category, so I can't do anything with them. Perhaps Matt Leininger can help. References: <000f01c65e4f$ccd3fcb0$6401a8c0@infiniconsys.com> Message-ID: <79ae2f320604121000m659567fco718a3b6190f07bd0@mail.gmail.com> I forgot to give you maintainers, added below. On 4/12/06, Fab Tillier wrote: > IPoIB me > WSD me > IB Core me > MT23108 Leonid Keller (leonid at mellanox.co.il) > MTHCA Leonid Keller (leonid at mellanox.co.il) > Diagnostics I guess assign these to me for now. > OpenSM Probably same as for Linux? > SRP me > Utils I guess assign these to me for now. - Fab From halr at voltaire.com Wed Apr 12 10:45:51 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Apr 2006 13:45:51 -0400 Subject: [openib-general] [PATCH v2] mad: use GID/LID on requester side when matching responses to requests In-Reply-To: <200604101804.34043.jackm@mellanox.co.il> References: <200604101804.34043.jackm@mellanox.co.il> Message-ID: <1144863926.19061.90695.camel@hal.voltaire.com> On Mon, 2006-04-10 at 11:04, Jack Morgenstein wrote: A couple of commentary comments below... -- Hal > Index: src/drivers/infiniband/core/mad.c > =================================================================== > --- src/drivers/infiniband/core/mad.c (revision 6066) > +++ src/drivers/infiniband/core/mad.c (working copy) > struct ib_mad_send_wr_private* > ib_find_send_mad(struct ib_mad_agent_private *mad_agent_priv, > - struct ib_mad_recv_wc *mad_recv_wc) > + struct ib_mad_recv_wc *wc) > { > - struct ib_mad_send_wr_private *mad_send_wr; > + struct ib_mad_send_wr_private *wr; > struct ib_mad *mad; > > - mad = (struct ib_mad *)mad_recv_wc->recv_buf.mad; > + mad = (struct ib_mad *)wc->recv_buf.mad; > > - list_for_each_entry(mad_send_wr, &mad_agent_priv->wait_list, > - agent_list) { > - if ((mad_send_wr->tid == mad->mad_hdr.tid) && > - rcv_has_same_class(mad_send_wr, mad_recv_wc) && > - rcv_has_same_gid(mad_send_wr, mad_recv_wc)) > - return mad_send_wr; > + list_for_each_entry(wr, &mad_agent_priv->wait_list, agent_list) { > + if ((wr->tid == mad->mad_hdr.tid) && > + rcv_has_same_class(wr, wc) && > + /* > + * Don't check GID for direct routed MADs. > + * These might have permissive LIDs. What's the relevance of the latter comment ? VL15 packets never have GRHs so there are no GIDs so I think the first comment is sufficient. > + */ > + (is_direct(wc->recv_buf.mad->mad_hdr.mgmt_class) || > + rcv_has_same_gid(mad_agent_priv, wr, wc))) > + return wr; > } > > /* > * It's possible to receive the response before we've > * been notified that the send has completed > */ > - list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, > - agent_list) { > - if (is_data_mad(mad_agent_priv, mad_send_wr->send_buf.mad) && > - mad_send_wr->tid == mad->mad_hdr.tid && > - mad_send_wr->timeout && > - rcv_has_same_class(mad_send_wr, mad_recv_wc) && > - rcv_has_same_gid(mad_send_wr, mad_recv_wc)) { > + list_for_each_entry(wr, &mad_agent_priv->send_list, agent_list) { > + if (is_data_mad(mad_agent_priv, wr->send_buf.mad) && > + wr->tid == mad->mad_hdr.tid && > + wr->timeout && > + rcv_has_same_class(wr, wc) && > + /* > + * Don't check GID for direct routed MADs. > + * These might have permissive LIDs. > + */ Same comment as above. > + (is_direct(wc->recv_buf.mad->mad_hdr.mgmt_class) || > + rcv_has_same_gid(mad_agent_priv, wr, wc))) > /* Verify request has not been canceled */ > - return (mad_send_wr->status == IB_WC_SUCCESS) ? > - mad_send_wr : NULL; > - } > + return (wr->status == IB_WC_SUCCESS) ? wr : NULL; > } > return NULL; > } From rdreier at cisco.com Wed Apr 12 11:37:06 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 12 Apr 2006 11:37:06 -0700 Subject: [openib-general] Re: [PATCH] mthca: fix max_srq_sge returned by ib_query_device for Tavor devices In-Reply-To: <200604111816.27204.jackm@mellanox.co.il> (Jack Morgenstein's message of "Tue, 11 Apr 2006 18:16:27 +0300") References: <200604111816.27204.jackm@mellanox.co.il> Message-ID: Thanks applied & queued for 2.6.17 From shahanse at cisco.com Wed Apr 12 12:02:34 2006 From: shahanse at cisco.com (Shawn Hansen (shahanse)) Date: Wed, 12 Apr 2006 12:02:34 -0700 Subject: [openib-general] Announcing the OpenFabrics Enterprise Distribution Message-ID: All, The OpenFabrics Enterprise Working Group is pleased to announce the creation of the OpenFabrics Enterprise Distribution (OFED). This was approved today by the Board of Directors. OFED is a distribution of InfiniBand software that includes, or is a superset of, the OpenFabrics 1.0 release, and adds other additional software outside of the scope of the OpenFabrics release, such as MPI. This development work will happen completely in the open in the Enterprise Working Group. To join this mailing list, please subscribe at: http://openib.org/mailman/listinfo/openfabrics-ewg The major reasons for this distribution are: 1) Need to unify vendors' snapshots of OpenFabrics release for interoperability; 2) Need to package software outside the scope of the main OpenFabrics release; 3) Need to coordinate vendors' bug fixes and prevent divergence. For frequently asked questions, a FAQ is included below. Please let us know if you have any additional questions. Regards, Shawn Hansen Chair, Enterprise Working Group Cisco Systems shawn.hansen at cisco.com -------------------------- Frequently Asked Questions -------------------------- Q: What is the Enterprise Working Group? The EWG is a group of hardware vendors that will sell products based on OpenFabrics. The purpose of this group is to coordinate how to provide a single commercially supportable distribution of OpenFabrics software to their customers that guarantees cross-vendor interoperability. Q: Why is OFED required? - Enterprise customers will have solution-level requirements that are outside the scope of the 1.0 release, such as the distribution of MPI stacks, support for pre-2.6.16 kernels, etc. The goal of OFED is to address this need. Without OFED, each InfiniBand vendor would create their own distribution of OpenFabrics to accomplish this goal, and may not be interoperable. Q: Does OFED compete with the OpenFabrics release? - No, there is only one OpenFabrics release. OFED is a distribution that includes the OpenFabrics 1.0 release. The OpenFabrics 1.0 release and OFED share the same user-level code (libraries, management utilities, etc.) The code for both is taken from the 1.0 branch. Q: Is OFED development happening in the open? - Yes, OFED uses the OpenFabrics bugzilla for bug reporting, and all discussions can be viewed on the Enterprise Working Group mailing list. All OFED development is done on the 1.0 branch under the ibed directory. Anyone can access release candidates, test them, observe bugs and discussions, report bugs, and comment. Q: How does OFED differ from the OpenFabrics release? - The OpenFabrics release contains only user-level code, while the OFED distribution also adds InfiniBand kernel modules that are under OpenFabrics development, including modules that are not part of the kernel (like iSER, RDS, and SDP). - OFED will include two MPI packages that are not part of Open Fabrics: OSU MPI and Open MPI. - OFED is packaged for end-user installation. - OFED supports distribution with older kernels (e.g. Redhat EL4 up2) Q: What is the software release process for OFED and how does it relate to the OpenFabrics release? The release build is done using the following method: 1. Any module that is already in the kernel will be taken from the git tree that is targeted for next kernel release 2. Kernel modules that are not in Linux kernel will be taken from openFabrics SVN trunk or in extraordinary cases, from SVN contrib. 3. All user space code is taken from the 1.0 branch. OFED group will make sure the right patches from the trunk are updated to the branch. 4. MPI: Open MPI - Provided by OpenMPI developers. MVAPICH - Based on OSU release. Both tarballs are placed in OpenFabrics web site. 5. OFED build & install scripts: all relevant scripts are placed under a specific directory for OFED release under the 1.0 branch. 6. Back port patches: patches directory will be also under the OFED directory in the 1.0 branch. The release process: The release coordinator will build the release candidate (OFED-rcX) and publish it on OpenFabrics (approximately every 2 weeks). Each OFED vendor is responsible to test the components under his ownership. Bugs are reported through bugzilla and fixes are provided to the general list. Q: What is the anticipated release schedule? Mid-May Q: What components will be included in OFED and how is this decided? - Components will include: - HCA driver - mthca - HCA driver - ipath - Core - IPoIB - SDP - RDS - SRP initiator - iSER initiator - OSU MPI - Open MPI - uDAPL - OpenSM - Diagnostic tools - Performance tests The decision to include components is based on customer demand and level of robustness and stability. Components will be categorized in one of three ways: 1) Basic: GA components that installed in a typical installation. 2) Add-on: Components that can be installed optionally. 3) Technology preview: Components where quality level is not GA, but can be used by customers for technology development. Q: When bugs are found, how will they be fixed? - Fixes to release candidates are coordinated by the OFED release coordinator and maintainers in a controlled fashion. Each bug found is first fixed on the trunk, and then merged into the release branch. - Patches will be made available as RC updates, and fed back to OpenFabrics SVN continuously. - Availability of patches will not be gated by acceptance of patches into OpenFabrics SVN. - Urgent bug fixes can be directly delivered to customers by the distros or vendors, but are rolled into a standard release as quickly as possible. The goal is to ensure that fixes are standardized and make it to the next general release. From halr at voltaire.com Wed Apr 12 12:12:28 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Apr 2006 15:12:28 -0400 Subject: [openib-general] Announcing the OpenFabrics Enterprise Distribution In-Reply-To: References: Message-ID: <1144869147.19061.91517.camel@hal.voltaire.com> On Wed, 2006-04-12 at 15:02, Shawn Hansen (shahanse) wrote: I have some questions about OFED: > -------------------------- > Frequently Asked Questions > -------------------------- > Q: Is OFED development happening in the open? > > - Yes, OFED uses the OpenFabrics bugzilla for bug reporting, and all > discussions can be viewed on the Enterprise Working Group mailing list. > All OFED development is done on the 1.0 branch under the ibed directory. How can that be ? The 1.0 branch contains no kernel code. Where is the OFED kernel code kept ? > Q: When bugs are found, how will they be fixed? > > - Fixes to release candidates are coordinated by the OFED release > coordinator and maintainers in a controlled fashion. Each bug found is > first fixed on the trunk, and then merged into the release branch. So there will never be a bug fixed on the 1.0 branch being merged back to the trunk ? For various components, the trunk has diverged and is ahead of the 1.0 branch. > - Patches will be made available as RC updates, and fed back to > OpenFabrics SVN continuously. Can someone explain the OFED location layout in the OF svn tree ? > - Availability of patches will not be gated by acceptance of patches > into OpenFabrics SVN. How do discrepancies in the accepted patches between OF and OFED get resolved ? -- Hal From rdreier at cisco.com Wed Apr 12 12:26:27 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 12 Apr 2006 12:26:27 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: <443C4934.7080400@mellanox.com> (Vu Pham's message of "Tue, 11 Apr 2006 17:26:28 -0700") References: <443C4934.7080400@mellanox.com> Message-ID: > Apr 7 18:17:17 lab105 kernel: Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b6b6b I think I fixed the bug causing this oops (I was able to reproduce it, and I don't see it any more). I checked the following patch in and queued it for kernel 2.6.17: diff-tree 44f29db23e1994bed2f3254dc6fef4185fdd0541 (from 59fef3b1e96217c6e736372ff8cc95cbcca1b6aa) Author: Roland Dreier Date: Wed Apr 12 12:20:51 2006 -0700 IB/srp: Remove request from list when SCSI abort succeeds If a SCSI abort succeeds, then the aborted request should to be removed from the list of pending requests. This fixes list corruption after an abort occurs. Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 5f2b3f6..5bb5574 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -617,6 +617,14 @@ static void srp_unmap_data(struct scsi_c scmnd->sc_data_direction); } +static void srp_remove_req(struct srp_target_port *target, struct srp_request *req, + int index) +{ + list_del(&req->list); + req->next = target->req_head; + target->req_head = index; +} + static void srp_process_rsp(struct srp_target_port *target, struct srp_rsp *rsp) { struct srp_request *req; @@ -664,9 +672,7 @@ static void srp_process_rsp(struct srp_t scmnd->host_scribble = (void *) -1L; scmnd->scsi_done(scmnd); - list_del(&req->list); - req->next = target->req_head; - target->req_head = rsp->tag & ~SRP_TAG_TSK_MGMT; + srp_remove_req(target, req, rsp->tag & ~SRP_TAG_TSK_MGMT); } else req->cmd_done = 1; } @@ -1188,12 +1194,10 @@ static int srp_send_tsk_mgmt(struct scsi spin_lock_irq(target->scsi_host->host_lock); if (req->cmd_done) { - list_del(&req->list); - req->next = target->req_head; - target->req_head = req_index; - + srp_remove_req(target, req, req_index); scmnd->scsi_done(scmnd); } else if (!req->tsk_status) { + srp_remove_req(target, req, req_index); scmnd->result = DID_ABORT << 16; ret = SUCCESS; } From bugzilla-daemon at openib.org Wed Apr 12 12:40:02 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Wed, 12 Apr 2006 12:40:02 -0700 (PDT) Subject: [openib-general] [Bug 35] 2c_OffloadCheckSum (NDIS) test gives Blue screen Message-ID: <20060412194002.DA0B42283DA@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=35 ftillier at silverstorm.com changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|bugzilla at openib.org |ftillier at silverstorm.com Product|OpenFabrics Linux |OpenFabrics Windows Version|1.0rc2 |unspecified ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From viswa.krish at gmail.com Wed Apr 12 14:03:23 2006 From: viswa.krish at gmail.com (Viswanath Krishnamurthy) Date: Wed, 12 Apr 2006 14:03:23 -0700 Subject: [openib-general] ibping broken in SVN 6446 ? Message-ID: <4df28be40604121403y2c1fdd57n1cc440401a54759@mail.gmail.com> When I do a ibping I get an error (on a 32 bit machine) Linux Kernel: 2.6.16 infiniband directory replaced with SVN6446 I enable debug in umad.c, I get the following error. The ioctl call to the umad driver (umad device) is failing. return value for ioctl is -1, errno is -22 (EINVAL) portid 0 registering qp 1 class 50 version 1 failed: ibping: iberror: can't register to ping class 50 -Viswa -------------- next part -------------- An HTML attachment was scrubbed... URL: From weiny2 at llnl.gov Wed Apr 12 14:55:29 2006 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 12 Apr 2006 14:55:29 -0700 Subject: [openib-general] Announcing the OpenFabrics Enterprise Distribution In-Reply-To: <1144869147.19061.91517.camel@hal.voltaire.com> References: <1144869147.19061.91517.camel@hal.voltaire.com> Message-ID: <20060412145529.65162e81.weiny2@llnl.gov> On Wed, 12 Apr 2006 15:12:28 -0400 Hal Rosenstock wrote: > On Wed, 2006-04-12 at 15:02, Shawn Hansen (shahanse) wrote: > > I have some questions about OFED: > > > -------------------------- > > Frequently Asked Questions > > -------------------------- > > > Q: Is OFED development happening in the open? > > > > - Yes, OFED uses the OpenFabrics bugzilla for bug reporting, and all > > discussions can be viewed on the Enterprise Working Group mailing > > list. All OFED development is done on the 1.0 branch under the ibed > > directory. > > How can that be ? The 1.0 branch contains no kernel code. Where is the > OFED kernel code kept ? > This is my one big question. Also what happens to people who took the 1.0 branch and are actually using it? That is what LLNL has done and now the kernel code is gone. Confused, Ira From vuhuong at mellanox.com Wed Apr 12 15:03:24 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Wed, 12 Apr 2006 15:03:24 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: References: <443C4934.7080400@mellanox.com> Message-ID: <443D792C.6030405@mellanox.com> > > Vu> Here is my status of testing this patch. On x86-64 system I > Vu> got data corruption problem reported after ~4 hrs of running > Vu> Engenio's Smash test tool when I tested with Engenio storage > Vu> On ia64 system I got multiple async event 3 > Vu> (IB_EVENT_QP_ACCESS_ERR) and even 1 (IB_EVENT_QP_FATAL), > Vu> finally the error handling path kicked in and the system > Vu> paniced. Please see log below (I tested with Mellanox's srp > Vu> target reference implementation - I don't see this error > Vu> without the patch) > > Hmm, that's interesting. Did you see this type of problem with the > original FMR patch you wrote (and did you do this level of stress > testing)? I'm wondering whether the issue is in the SRP driver, or > whether there is a bug in the FMR stuff at a lower level. > I stressed on x86_64 and did not see data corruption problem. I restarted the test with your patch without any problem till now ~15 hrs When I tested with my original patch on ia64 I hit different problem per[0]: Oops 8813272891392 [1] Modules linked in: ib_srp ib_sa ib_cm ib_umad evdev joydev sg st sr_mod ide_cd cdrom usbserial parport_pc lp parport thermal processor ipv6 fan button ib_mthca ib_mad ib_core bd Pid: 0, CPU 0, comm: swapper psr : 0000101008022038 ifs : 8000000000000003 ip : [] Not tainted ip is at __copy_user+0x890/0x960 unat: 0000000000000000 pfs : 000000000000050d rsc : 0000000000000003 rnat: e0000001fd1cbb64 bsps: a0000001008e9ef8 pr : 80000000a96627a7 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c0270033f csd : 0000000000000000 ssd : 0000000000000000 b0 : a0000001003019f0 b6 : a000000100003320 b7 : a000000100302120 f6 : 000000000000000000000 f7 : 1003eff23971ce39d6000 f8 : 1003ef840500400886000 f9 : 100068000000000000000 f10 : 10005fffffffff0000000 f11 : 1003e0000000000000080 r1 : a000000100ae8b50 r2 : 0d30315052534249 r3 : 0d3031505253424a r8 : a000000100902570 r9 : 2d3031504db9c249 r10 : 0000000000544f53 r11 : e000000004998000 r12 : a0000001007bfb20 r13 : a0000001007b8000 r14 : a0007ffffdc00000 r15 : a000000100902540 r16 : a000000100902570 r17 : 0000000000000000 r18 : ffffffffffffffff r19 : e5c738e7c46c654d r20 : e5c738e758000000 r21 : ff23971ce39d6000 r22 : c202802004430000 r23 : e0000001e2fafd78 r24 : 6203002002030000 r25 : e0000001e6fec18b r26 : ffffffffffffff80 r27 : 0000000000000000 r28 : 0d30315052534000 r29 : 0000000000000001 r30 : ffffffffffffffff r31 : a0000001007480c8 Call Trace: [] show_stack+0x80/0xa0 sp=a0000001007bf6a0 bsp=a0000001007b94c0 [] show_regs+0x840/0x880 sp=a0000001007bf870 bsp=a0000001007b9460 [] die+0x1b0/0x240 sp=a0000001007bf880 bsp=a0000001007b9418 [] ia64_do_page_fault+0x970/0xae0 sp=a0000001007bf8a0 bsp=a0000001007b93a8 [] ia64_leave_kernel+0x0/0x280 sp=a0000001007bf950 bsp=a0000001007b93a8 [] __copy_user+0x890/0x960 sp=a0000001007bfb20 bsp=a0000001007b9390 [] unmap_single+0x90/0x2a0 sp=a0000001007bfb20 bsp=a0000001007b9388 [] init_task+0x7960/0x8000 sp=a0000001007bfb20 bsp=a0000001007b90e0 [] unmap_single+0x90/0x2a0 sp=a0000001007bfb20 bsp=a0000001007b8e38 > What kind of HCAs were you using? I assume on ia64 you're using > PCI-X, what about on x86-64? PCIe or not? Memfree or not? > PCI-X on ia64 and PCIe without mem on x86_64 > Another thing that might be useful if it's convenient for you would be > to use an IB analyzer and trigger on a NAK to see what happens on the > wire around the IB_EVENT_QP_ACCESS_ERR. I'll capture some log with analyzer when it's available Vu From vuhuong at mellanox.com Wed Apr 12 15:05:01 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Wed, 12 Apr 2006 15:05:01 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: References: <443C4934.7080400@mellanox.com> Message-ID: <443D798D.4040001@mellanox.com> Roland Dreier wrote: >> Apr 7 18:17:17 lab105 kernel: Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b6b6b > > I think I fixed the bug causing this oops (I was able to reproduce it, > and I don't see it any more). I checked the following patch in and > queued it for kernel 2.6.17: > Thanks... I'll test it and let you know From rheflin at atipa.com Wed Apr 12 15:07:16 2006 From: rheflin at atipa.com (Roger Heflin) Date: Wed, 12 Apr 2006 17:07:16 -0500 Subject: [openib-general] Trying to compile mvapich RHEL4U3 for ib. Message-ID: <443D7A14.8050807@atipa.com> I am not having much luck with the default RHEL4U3 setup. I appear to have IP over IB running and appearing to work, but am unable to get any mpi variants to work directly with IB, I do have it working over tcp with ch_p4. With mvapich-0.9.7 it errors out in the building stage with an error ibv_free_device_list/ibv_get_device_list missing, I cannot find any of the ib libraries on RHEL4U3 that appear to contain that library. Using the older mvapich-0.9.6 there is no option to make an ib_gen2 version, and there does not appear to be any ch_gen2 device code. Using the mvapich-gen2-1.src.rpm from openib.org results in these errors (on the first thing it tries to compile). viainit.c: In function `create_cq': viainit.c:118: error: too few arguments to function `ibv_create_cq' I have verified that the include file prototype has more arguments, than are contained in viainit.c. Trying to use openmpi produced different but still failing results, everything compiled and linked and HPL would start but never produced any output from HPL itself, it did produce some things that looked like internal openmpi errors. Any suggestion on what I am missing, or if there is another version that will work? It looks like there must be alot of API differences between the different variants that I have. Roger From robert.j.woodruff at intel.com Wed Apr 12 15:16:57 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 12 Apr 2006 15:16:57 -0700 Subject: [openib-general] Announcing the OpenFabrics EnterpriseDistribution In-Reply-To: <20060412145529.65162e81.weiny2@llnl.gov> Message-ID: <000801c65e7e$d0c66130$010fa8c0@amr.corp.intel.com> Ira wrote, > How can that be ? The 1.0 branch contains no kernel code. Where is the > OFED kernel code kept ? > >This is my one big question. >Also what happens to people who took the 1.0 branch and are actually using it? >That is what LLNL has done and now the kernel code is gone. >Confused, >Ira Uh me too. Where is the kernel code for OFED. Is it the code that was in the Release 1.0 tree or some other SVN rev ? woody From robert.j.woodruff at intel.com Wed Apr 12 15:22:32 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 12 Apr 2006 15:22:32 -0700 Subject: [openib-general] Announcing the OpenFabrics Enterprise Distribution In-Reply-To: Message-ID: <000901c65e7f$97c167d0$010fa8c0@amr.corp.intel.com> > Q: Why is OFED required? >- Enterprise customers will have solution-level requirements that are >outside the scope of the 1.0 release, such as the distribution of MPI >stacks, support for pre-2.6.16 kernels, etc. The goal of OFED is to >address this need. Without OFED, each InfiniBand vendor would create >their own distribution of OpenFabrics to accomplish this goal, and may >not be interoperable. What if a customer wants to use one of the comercially available MPIs, such as Intel MPI or HP MPI, will OFED be validating that those MPIs will work with the OFED distribution ? woody From viswa.krish at gmail.com Wed Apr 12 15:25:03 2006 From: viswa.krish at gmail.com (Viswanath Krishnamurthy) Date: Wed, 12 Apr 2006 15:25:03 -0700 Subject: [openib-general] Fix for ibping Message-ID: <4df28be40604121525j22e085cdg8ab9c1a820b24e3a@mail.gmail.com> The RMPP version needs to be 1. [root at subnetmgr5 src]# svn diff ibping.c Index: ibping.c =================================================================== --- ibping.c (revision 6446) +++ ibping.c (working copy) @@ -336,7 +336,7 @@ exit(0); } - if (mad_register_client(ping_class, 0) < 0) + if (mad_register_client(ping_class, 1) < 0) IBERROR("can't register to ping class %d", ping_class); if (ib_resolve_portid_str(&portid, argv[0], dest_type, sm_id) < 0) -------------- next part -------------- An HTML attachment was scrubbed... URL: From bos at pathscale.com Wed Apr 12 15:35:34 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 12 Apr 2006 15:35:34 -0700 Subject: [openib-general] [RFC] Would like to cut a new release candidate Message-ID: <1144881334.22229.22.camel@chalcedony.pathscale.com> Since the 1.0 RC2 tag in SVN contains kernel code (which led to confusion), and Hal wanted to get a few more diag-related changes into RC2 for people to use, we are tossing around the idea of tagging another release candidate. I've suggested calling it RC2.1, since it's only slightly different from RC2: no kernel source at all, and some diag changes. Does anyone have any objections? By the way, Tziporet and I have discussed syncing up the release candidate numbers for the next IBED and OF release candidates, to reduce a possible source of confusion. This will mean that OF will hop from RC2.1 to RC4, while IBED will step to RC4. The posited date for the release of both is May 1. OFED is focused on including open-source MPI in the distribution. Commercial MPI ISVs can definitely test against OFED, make statements about interoperability, and ask the community to make changes specific to their stack. Likewise, IB vendors may directly do their own testing with these MPI packages and/or support the ISV's efforts. --Shawn -----Original Message----- From: Bob Woodruff [mailto:robert.j.woodruff at intel.com] Sent: Wednesday, April 12, 2006 3:23 PM To: Shawn Hansen (shahanse); openib-general at openib.org Subject: RE: [openib-general] Announcing the OpenFabrics Enterprise Distribution > Q: Why is OFED required? >- Enterprise customers will have solution-level requirements that are >outside the scope of the 1.0 release, such as the distribution of MPI >stacks, support for pre-2.6.16 kernels, etc. The goal of OFED is to >address this need. Without OFED, each InfiniBand vendor would create >their own distribution of OpenFabrics to accomplish this goal, and may >not be interoperable. What if a customer wants to use one of the comercially available MPIs, such as Intel MPI or HP MPI, will OFED be validating that those MPIs will work with the OFED distribution ? woody From rdreier at cisco.com Wed Apr 12 15:39:30 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 12 Apr 2006 15:39:30 -0700 Subject: [openib-general] [GIT PULL] InfiniBand updates for 2.6.17-rc1 Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This includes changes that I've asked you to pull a couple of times, but which probably got lost in your mail queue because of your trip. There are a couple of largish changes in here, but they are all needed: - the IPoIB ring size tunables fix horrible performance IBM sees - the static rate change fixes big problems on mixed rate networks The exact changes and patch are: Eli Cohen: IPoIB: Wait for join to finish before freeing mcast struct IPoIB: Close race in ipoib_flush_paths() Jack Morgenstein: IB: simplify static rate encoding IB/mthca: Fix max_srq_sge returned by ib_query_device for Tavor devices Michael S. Tsirkin: IB/mad: fix oops in cancel_mads IPoIB: Consolidate private neighbour data handling IB/mthca: Disable tuning PCI read burst size IB/cache: Use correct pointer to calculate size Roland Dreier: IPoIB: Always build debugging code unless CONFIG_EMBEDDED=y IB/mthca: Always build debugging code unless CONFIG_EMBEDDED=y IB/srp: Fix memory leak in options parsing IPoIB: Use spin_lock_irq() instead of spin_lock_irqsave() Shirley Ma: IPoIB: Make send and receive queue sizes tunable drivers/infiniband/core/cache.c | 2 drivers/infiniband/core/mad.c | 2 drivers/infiniband/core/verbs.c | 34 ++++++++ drivers/infiniband/hw/mthca/Kconfig | 11 +-- drivers/infiniband/hw/mthca/Makefile | 4 - drivers/infiniband/hw/mthca/mthca_av.c | 100 ++++++++++++++++++++++++ drivers/infiniband/hw/mthca/mthca_cmd.c | 4 + drivers/infiniband/hw/mthca/mthca_cmd.h | 1 drivers/infiniband/hw/mthca/mthca_dev.h | 23 +++++- drivers/infiniband/hw/mthca/mthca_mad.c | 42 ++++++++++ drivers/infiniband/hw/mthca/mthca_main.c | 28 +++++++ drivers/infiniband/hw/mthca/mthca_provider.c | 2 drivers/infiniband/hw/mthca/mthca_provider.h | 3 - drivers/infiniband/hw/mthca/mthca_qp.c | 46 ++++++++--- drivers/infiniband/hw/mthca/mthca_srq.c | 27 ++++++ drivers/infiniband/ulp/ipoib/Kconfig | 3 - drivers/infiniband/ulp/ipoib/ipoib.h | 7 ++ drivers/infiniband/ulp/ipoib/ipoib_fs.c | 2 drivers/infiniband/ulp/ipoib/ipoib_ib.c | 22 +++-- drivers/infiniband/ulp/ipoib/ipoib_main.c | 88 ++++++++++++++------- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 58 ++++++-------- drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 6 + drivers/infiniband/ulp/srp/ib_srp.c | 1 include/rdma/ib_sa.h | 28 ------- include/rdma/ib_verbs.h | 28 +++++++ 25 files changed, 430 insertions(+), 142 deletions(-) diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c index c57a387..50364c0 100644 --- a/drivers/infiniband/core/cache.c +++ b/drivers/infiniband/core/cache.c @@ -302,7 +302,7 @@ static void ib_cache_setup_one(struct ib kmalloc(sizeof *device->cache.pkey_cache * (end_port(device) - start_port(device) + 1), GFP_KERNEL); device->cache.gid_cache = - kmalloc(sizeof *device->cache.pkey_cache * + kmalloc(sizeof *device->cache.gid_cache * (end_port(device) - start_port(device) + 1), GFP_KERNEL); if (!device->cache.pkey_cache || !device->cache.gid_cache) { diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index ba54c85..3a702da 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -2311,6 +2311,7 @@ static void local_completions(void *data local = list_entry(mad_agent_priv->local_list.next, struct ib_mad_local_private, completion_list); + list_del(&local->completion_list); spin_unlock_irqrestore(&mad_agent_priv->lock, flags); if (local->mad_priv) { recv_mad_agent = local->recv_mad_agent; @@ -2362,7 +2363,6 @@ local_send_completion: &mad_send_wc); spin_lock_irqsave(&mad_agent_priv->lock, flags); - list_del(&local->completion_list); atomic_dec(&mad_agent_priv->refcount); if (!recv) kmem_cache_free(ib_mad_cache, local->mad_priv); diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index cae0845..b78e7dc 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -45,6 +45,40 @@ #include #include +int ib_rate_to_mult(enum ib_rate rate) +{ + switch (rate) { + case IB_RATE_2_5_GBPS: return 1; + case IB_RATE_5_GBPS: return 2; + case IB_RATE_10_GBPS: return 4; + case IB_RATE_20_GBPS: return 8; + case IB_RATE_30_GBPS: return 12; + case IB_RATE_40_GBPS: return 16; + case IB_RATE_60_GBPS: return 24; + case IB_RATE_80_GBPS: return 32; + case IB_RATE_120_GBPS: return 48; + default: return -1; + } +} +EXPORT_SYMBOL(ib_rate_to_mult); + +enum ib_rate mult_to_ib_rate(int mult) +{ + switch (mult) { + case 1: return IB_RATE_2_5_GBPS; + case 2: return IB_RATE_5_GBPS; + case 4: return IB_RATE_10_GBPS; + case 8: return IB_RATE_20_GBPS; + case 12: return IB_RATE_30_GBPS; + case 16: return IB_RATE_40_GBPS; + case 24: return IB_RATE_60_GBPS; + case 32: return IB_RATE_80_GBPS; + case 48: return IB_RATE_120_GBPS; + default: return IB_RATE_PORT_CURRENT; + } +} +EXPORT_SYMBOL(mult_to_ib_rate); + /* Protection domains */ struct ib_pd *ib_alloc_pd(struct ib_device *device) diff --git a/drivers/infiniband/hw/mthca/Kconfig b/drivers/infiniband/hw/mthca/Kconfig index e88be85..9aa5a44 100644 --- a/drivers/infiniband/hw/mthca/Kconfig +++ b/drivers/infiniband/hw/mthca/Kconfig @@ -7,10 +7,11 @@ config INFINIBAND_MTHCA ("Tavor") and the MT25208 PCI Express HCA ("Arbel"). config INFINIBAND_MTHCA_DEBUG - bool "Verbose debugging output" + bool "Verbose debugging output" if EMBEDDED depends on INFINIBAND_MTHCA - default n + default y ---help--- - This option causes the mthca driver produce a bunch of debug - messages. Select this is you are developing the driver or - trying to diagnose a problem. + This option causes debugging code to be compiled into the + mthca driver. The output can be turned on via the + debug_level module parameter (which can also be set after + the driver is loaded through sysfs). diff --git a/drivers/infiniband/hw/mthca/Makefile b/drivers/infiniband/hw/mthca/Makefile index 47ec5a7..e388d95 100644 --- a/drivers/infiniband/hw/mthca/Makefile +++ b/drivers/infiniband/hw/mthca/Makefile @@ -1,7 +1,3 @@ -ifdef CONFIG_INFINIBAND_MTHCA_DEBUG -EXTRA_CFLAGS += -DDEBUG -endif - obj-$(CONFIG_INFINIBAND_MTHCA) += ib_mthca.o ib_mthca-y := mthca_main.o mthca_cmd.o mthca_profile.o mthca_reset.o \ diff --git a/drivers/infiniband/hw/mthca/mthca_av.c b/drivers/infiniband/hw/mthca/mthca_av.c index bc5bdcb..b12aa03 100644 --- a/drivers/infiniband/hw/mthca/mthca_av.c +++ b/drivers/infiniband/hw/mthca/mthca_av.c @@ -42,6 +42,20 @@ #include "mthca_dev.h" +enum { + MTHCA_RATE_TAVOR_FULL = 0, + MTHCA_RATE_TAVOR_1X = 1, + MTHCA_RATE_TAVOR_4X = 2, + MTHCA_RATE_TAVOR_1X_DDR = 3 +}; + +enum { + MTHCA_RATE_MEMFREE_FULL = 0, + MTHCA_RATE_MEMFREE_QUARTER = 1, + MTHCA_RATE_MEMFREE_EIGHTH = 2, + MTHCA_RATE_MEMFREE_HALF = 3 +}; + struct mthca_av { __be32 port_pd; u8 reserved1; @@ -55,6 +69,90 @@ struct mthca_av { __be32 dgid[4]; }; +static enum ib_rate memfree_rate_to_ib(u8 mthca_rate, u8 port_rate) +{ + switch (mthca_rate) { + case MTHCA_RATE_MEMFREE_EIGHTH: + return mult_to_ib_rate(port_rate >> 3); + case MTHCA_RATE_MEMFREE_QUARTER: + return mult_to_ib_rate(port_rate >> 2); + case MTHCA_RATE_MEMFREE_HALF: + return mult_to_ib_rate(port_rate >> 1); + case MTHCA_RATE_MEMFREE_FULL: + default: + return mult_to_ib_rate(port_rate); + } +} + +static enum ib_rate tavor_rate_to_ib(u8 mthca_rate, u8 port_rate) +{ + switch (mthca_rate) { + case MTHCA_RATE_TAVOR_1X: return IB_RATE_2_5_GBPS; + case MTHCA_RATE_TAVOR_1X_DDR: return IB_RATE_5_GBPS; + case MTHCA_RATE_TAVOR_4X: return IB_RATE_10_GBPS; + default: return port_rate; + } +} + +enum ib_rate mthca_rate_to_ib(struct mthca_dev *dev, u8 mthca_rate, u8 port) +{ + if (mthca_is_memfree(dev)) { + /* Handle old Arbel FW */ + if (dev->limits.stat_rate_support == 0x3 && mthca_rate) + return IB_RATE_2_5_GBPS; + + return memfree_rate_to_ib(mthca_rate, dev->rate[port - 1]); + } else + return tavor_rate_to_ib(mthca_rate, dev->rate[port - 1]); +} + +static u8 ib_rate_to_memfree(u8 req_rate, u8 cur_rate) +{ + if (cur_rate <= req_rate) + return 0; + + /* + * Inter-packet delay (IPD) to get from rate X down to a rate + * no more than Y is (X - 1) / Y. + */ + switch ((cur_rate - 1) / req_rate) { + case 0: return MTHCA_RATE_MEMFREE_FULL; + case 1: return MTHCA_RATE_MEMFREE_HALF; + case 2: /* fall through */ + case 3: return MTHCA_RATE_MEMFREE_QUARTER; + default: return MTHCA_RATE_MEMFREE_EIGHTH; + } +} + +static u8 ib_rate_to_tavor(u8 static_rate) +{ + switch (static_rate) { + case IB_RATE_2_5_GBPS: return MTHCA_RATE_TAVOR_1X; + case IB_RATE_5_GBPS: return MTHCA_RATE_TAVOR_1X_DDR; + case IB_RATE_10_GBPS: return MTHCA_RATE_TAVOR_4X; + default: return MTHCA_RATE_TAVOR_FULL; + } +} + +u8 mthca_get_rate(struct mthca_dev *dev, int static_rate, u8 port) +{ + u8 rate; + + if (!static_rate || ib_rate_to_mult(static_rate) >= dev->rate[port - 1]) + return 0; + + if (mthca_is_memfree(dev)) + rate = ib_rate_to_memfree(ib_rate_to_mult(static_rate), + dev->rate[port - 1]); + else + rate = ib_rate_to_tavor(static_rate); + + if (!(dev->limits.stat_rate_support & (1 << rate))) + rate = 1; + + return rate; +} + int mthca_create_ah(struct mthca_dev *dev, struct mthca_pd *pd, struct ib_ah_attr *ah_attr, @@ -107,7 +205,7 @@ on_hca_fail: av->g_slid = ah_attr->src_path_bits; av->dlid = cpu_to_be16(ah_attr->dlid); av->msg_sr = (3 << 4) | /* 2K message */ - ah_attr->static_rate; + mthca_get_rate(dev, ah_attr->static_rate, ah_attr->port_num); av->sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28); if (ah_attr->ah_flags & IB_AH_GRH) { av->g_slid |= 0x80; diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c index 343eca5..1985b5d 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.c +++ b/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -965,6 +965,7 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev u32 *outbox; u8 field; u16 size; + u16 stat_rate; int err; #define QUERY_DEV_LIM_OUT_SIZE 0x100 @@ -995,6 +996,7 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev #define QUERY_DEV_LIM_MTU_WIDTH_OFFSET 0x36 #define QUERY_DEV_LIM_VL_PORT_OFFSET 0x37 #define QUERY_DEV_LIM_MAX_GID_OFFSET 0x3b +#define QUERY_DEV_LIM_RATE_SUPPORT_OFFSET 0x3c #define QUERY_DEV_LIM_MAX_PKEY_OFFSET 0x3f #define QUERY_DEV_LIM_FLAGS_OFFSET 0x44 #define QUERY_DEV_LIM_RSVD_UAR_OFFSET 0x48 @@ -1086,6 +1088,8 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev dev_lim->num_ports = field & 0xf; MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_GID_OFFSET); dev_lim->max_gids = 1 << (field & 0xf); + MTHCA_GET(stat_rate, outbox, QUERY_DEV_LIM_RATE_SUPPORT_OFFSET); + dev_lim->stat_rate_support = stat_rate; MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PKEY_OFFSET); dev_lim->max_pkeys = 1 << (field & 0xf); MTHCA_GET(dev_lim->flags, outbox, QUERY_DEV_LIM_FLAGS_OFFSET); diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.h b/drivers/infiniband/hw/mthca/mthca_cmd.h index e4ec35c..2f976f2 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.h +++ b/drivers/infiniband/hw/mthca/mthca_cmd.h @@ -146,6 +146,7 @@ struct mthca_dev_lim { int max_vl; int num_ports; int max_gids; + u16 stat_rate_support; int max_pkeys; u32 flags; int reserved_uars; diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h index ad52edb..4c1dcb4 100644 --- a/drivers/infiniband/hw/mthca/mthca_dev.h +++ b/drivers/infiniband/hw/mthca/mthca_dev.h @@ -151,6 +151,7 @@ struct mthca_limits { int reserved_qps; int num_srqs; int max_srq_wqes; + int max_srq_sge; int reserved_srqs; int num_eecs; int reserved_eecs; @@ -172,6 +173,7 @@ struct mthca_limits { int reserved_pds; u32 page_size_cap; u32 flags; + u16 stat_rate_support; u8 port_width_cap; }; @@ -353,10 +355,24 @@ struct mthca_dev { struct ib_mad_agent *send_agent[MTHCA_MAX_PORTS][2]; struct ib_ah *sm_ah[MTHCA_MAX_PORTS]; spinlock_t sm_lock; + u8 rate[MTHCA_MAX_PORTS]; }; -#define mthca_dbg(mdev, format, arg...) \ - dev_dbg(&mdev->pdev->dev, format, ## arg) +#ifdef CONFIG_INFINIBAND_MTHCA_DEBUG +extern int mthca_debug_level; + +#define mthca_dbg(mdev, format, arg...) \ + do { \ + if (mthca_debug_level) \ + dev_printk(KERN_DEBUG, &mdev->pdev->dev, format, ## arg); \ + } while (0) + +#else /* CONFIG_INFINIBAND_MTHCA_DEBUG */ + +#define mthca_dbg(mdev, format, arg...) do { (void) mdev; } while (0) + +#endif /* CONFIG_INFINIBAND_MTHCA_DEBUG */ + #define mthca_err(mdev, format, arg...) \ dev_err(&mdev->pdev->dev, format, ## arg) #define mthca_info(mdev, format, arg...) \ @@ -492,6 +508,7 @@ void mthca_free_srq(struct mthca_dev *de int mthca_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, enum ib_srq_attr_mask attr_mask); int mthca_query_srq(struct ib_srq *srq, struct ib_srq_attr *srq_attr); +int mthca_max_srq_sge(struct mthca_dev *dev); void mthca_srq_event(struct mthca_dev *dev, u32 srqn, enum ib_event_type event_type); void mthca_free_srq_wqe(struct mthca_srq *srq, u32 wqe_addr); @@ -542,6 +559,8 @@ int mthca_read_ah(struct mthca_dev *dev, struct ib_ud_header *header); int mthca_ah_query(struct ib_ah *ibah, struct ib_ah_attr *attr); int mthca_ah_grh_present(struct mthca_ah *ah); +u8 mthca_get_rate(struct mthca_dev *dev, int static_rate, u8 port); +enum ib_rate mthca_rate_to_ib(struct mthca_dev *dev, u8 mthca_rate, u8 port); int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); diff --git a/drivers/infiniband/hw/mthca/mthca_mad.c b/drivers/infiniband/hw/mthca/mthca_mad.c index dfb482e..f235c7e 100644 --- a/drivers/infiniband/hw/mthca/mthca_mad.c +++ b/drivers/infiniband/hw/mthca/mthca_mad.c @@ -49,6 +49,30 @@ enum { MTHCA_VENDOR_CLASS2 = 0xa }; +int mthca_update_rate(struct mthca_dev *dev, u8 port_num) +{ + struct ib_port_attr *tprops = NULL; + int ret; + + tprops = kmalloc(sizeof *tprops, GFP_KERNEL); + if (!tprops) + return -ENOMEM; + + ret = ib_query_port(&dev->ib_dev, port_num, tprops); + if (ret) { + printk(KERN_WARNING "ib_query_port failed (%d) for %s port %d\n", + ret, dev->ib_dev.name, port_num); + goto out; + } + + dev->rate[port_num - 1] = tprops->active_speed * + ib_width_enum_to_int(tprops->active_width); + +out: + kfree(tprops); + return ret; +} + static void update_sm_ah(struct mthca_dev *dev, u8 port_num, u16 lid, u8 sl) { @@ -90,6 +114,7 @@ static void smp_snoop(struct ib_device * mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && mad->mad_hdr.method == IB_MGMT_METHOD_SET) { if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) { + mthca_update_rate(to_mdev(ibdev), port_num); update_sm_ah(to_mdev(ibdev), port_num, be16_to_cpup((__be16 *) (mad->data + 58)), (*(u8 *) (mad->data + 76)) & 0xf); @@ -246,6 +271,7 @@ int mthca_create_agents(struct mthca_dev { struct ib_mad_agent *agent; int p, q; + int ret; spin_lock_init(&dev->sm_lock); @@ -255,11 +281,23 @@ int mthca_create_agents(struct mthca_dev q ? IB_QPT_GSI : IB_QPT_SMI, NULL, 0, send_handler, NULL, NULL); - if (IS_ERR(agent)) + if (IS_ERR(agent)) { + ret = PTR_ERR(agent); goto err; + } dev->send_agent[p][q] = agent; } + + for (p = 1; p <= dev->limits.num_ports; ++p) { + ret = mthca_update_rate(dev, p); + if (ret) { + mthca_err(dev, "Failed to obtain port %d rate." + " aborting.\n", p); + goto err; + } + } + return 0; err: @@ -268,7 +306,7 @@ err: if (dev->send_agent[p][q]) ib_unregister_mad_agent(dev->send_agent[p][q]); - return PTR_ERR(agent); + return ret; } void __devexit mthca_free_agents(struct mthca_dev *dev) diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 266f347..9b9ff7b 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -52,6 +52,14 @@ MODULE_DESCRIPTION("Mellanox InfiniBand MODULE_LICENSE("Dual BSD/GPL"); MODULE_VERSION(DRV_VERSION); +#ifdef CONFIG_INFINIBAND_MTHCA_DEBUG + +int mthca_debug_level = 0; +module_param_named(debug_level, mthca_debug_level, int, 0644); +MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0"); + +#endif /* CONFIG_INFINIBAND_MTHCA_DEBUG */ + #ifdef CONFIG_PCI_MSI static int msi_x = 0; @@ -69,6 +77,10 @@ MODULE_PARM_DESC(msi, "attempt to use MS #endif /* CONFIG_PCI_MSI */ +static int tune_pci = 0; +module_param(tune_pci, int, 0444); +MODULE_PARM_DESC(tune_pci, "increase PCI burst from the default set by BIOS if nonzero"); + static const char mthca_version[] __devinitdata = DRV_NAME ": Mellanox InfiniBand HCA driver v" DRV_VERSION " (" DRV_RELDATE ")\n"; @@ -90,6 +102,9 @@ static int __devinit mthca_tune_pci(stru int cap; u16 val; + if (!tune_pci) + return 0; + /* First try to max out Read Byte Count */ cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX); if (cap) { @@ -176,6 +191,7 @@ static int __devinit mthca_dev_lim(struc mdev->limits.reserved_srqs = dev_lim->reserved_srqs; mdev->limits.reserved_eecs = dev_lim->reserved_eecs; mdev->limits.max_desc_sz = dev_lim->max_desc_sz; + mdev->limits.max_srq_sge = mthca_max_srq_sge(mdev); /* * Subtract 1 from the limit because we need to allocate a * spare CQE so the HCA HW can tell the difference between an @@ -191,6 +207,18 @@ static int __devinit mthca_dev_lim(struc mdev->limits.port_width_cap = dev_lim->max_port_width; mdev->limits.page_size_cap = ~(u32) (dev_lim->min_page_sz - 1); mdev->limits.flags = dev_lim->flags; + /* + * For old FW that doesn't return static rate support, use a + * value of 0x3 (only static rate values of 0 or 1 are handled), + * except on Sinai, where even old FW can handle static rate + * values of 2 and 3. + */ + if (dev_lim->stat_rate_support) + mdev->limits.stat_rate_support = dev_lim->stat_rate_support; + else if (mdev->mthca_flags & MTHCA_FLAG_SINAI_OPT) + mdev->limits.stat_rate_support = 0xf; + else + mdev->limits.stat_rate_support = 0x3; /* IB_DEVICE_RESIZE_MAX_WR not supported by driver. May be doable since hardware supports it for SRQ. diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 2c250bc..565a24b 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -106,7 +106,7 @@ static int mthca_query_device(struct ib_ props->max_res_rd_atom = props->max_qp_rd_atom * props->max_qp; props->max_srq = mdev->limits.num_srqs - mdev->limits.reserved_srqs; props->max_srq_wr = mdev->limits.max_srq_wqes; - props->max_srq_sge = mdev->limits.max_sg; + props->max_srq_sge = mdev->limits.max_srq_sge; props->local_ca_ack_delay = mdev->limits.local_ca_ack_delay; props->atomic_cap = mdev->limits.flags & DEV_LIM_FLAG_ATOMIC ? IB_ATOMIC_HCA : IB_ATOMIC_NONE; diff --git a/drivers/infiniband/hw/mthca/mthca_provider.h b/drivers/infiniband/hw/mthca/mthca_provider.h index 2e7f521..6676a78 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.h +++ b/drivers/infiniband/hw/mthca/mthca_provider.h @@ -257,6 +257,8 @@ struct mthca_qp { atomic_t refcount; u32 qpn; int is_direct; + u8 port; /* for SQP and memfree use only */ + u8 alt_port; /* for memfree use only */ u8 transport; u8 state; u8 atomic_rd_en; @@ -278,7 +280,6 @@ struct mthca_qp { struct mthca_sqp { struct mthca_qp qp; - int port; int pkey_index; u32 qkey; u32 send_psn; diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index 057c8e6..f37b0e3 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -248,6 +248,9 @@ void mthca_qp_event(struct mthca_dev *de return; } + if (event_type == IB_EVENT_PATH_MIG) + qp->port = qp->alt_port; + event.device = &dev->ib_dev; event.event = event_type; event.element.qp = &qp->ibqp; @@ -392,10 +395,16 @@ static void to_ib_ah_attr(struct mthca_d { memset(ib_ah_attr, 0, sizeof *path); ib_ah_attr->port_num = (be32_to_cpu(path->port_pkey) >> 24) & 0x3; + + if (ib_ah_attr->port_num == 0 || ib_ah_attr->port_num > dev->limits.num_ports) + return; + ib_ah_attr->dlid = be16_to_cpu(path->rlid); ib_ah_attr->sl = be32_to_cpu(path->sl_tclass_flowlabel) >> 28; ib_ah_attr->src_path_bits = path->g_mylmc & 0x7f; - ib_ah_attr->static_rate = path->static_rate & 0x7; + ib_ah_attr->static_rate = mthca_rate_to_ib(dev, + path->static_rate & 0x7, + ib_ah_attr->port_num); ib_ah_attr->ah_flags = (path->g_mylmc & (1 << 7)) ? IB_AH_GRH : 0; if (ib_ah_attr->ah_flags) { ib_ah_attr->grh.sgid_index = path->mgid_index & (dev->limits.gid_table_len - 1); @@ -455,8 +464,10 @@ int mthca_query_qp(struct ib_qp *ibqp, s qp_attr->cap.max_recv_sge = qp->rq.max_gs; qp_attr->cap.max_inline_data = qp->max_inline_data; - to_ib_ah_attr(dev, &qp_attr->ah_attr, &context->pri_path); - to_ib_ah_attr(dev, &qp_attr->alt_ah_attr, &context->alt_path); + if (qp->transport == RC || qp->transport == UC) { + to_ib_ah_attr(dev, &qp_attr->ah_attr, &context->pri_path); + to_ib_ah_attr(dev, &qp_attr->alt_ah_attr, &context->alt_path); + } qp_attr->pkey_index = be32_to_cpu(context->pri_path.port_pkey) & 0x7f; qp_attr->alt_pkey_index = be32_to_cpu(context->alt_path.port_pkey) & 0x7f; @@ -484,11 +495,11 @@ out: } static int mthca_path_set(struct mthca_dev *dev, struct ib_ah_attr *ah, - struct mthca_qp_path *path) + struct mthca_qp_path *path, u8 port) { path->g_mylmc = ah->src_path_bits & 0x7f; path->rlid = cpu_to_be16(ah->dlid); - path->static_rate = !!ah->static_rate; + path->static_rate = mthca_get_rate(dev, ah->static_rate, port); if (ah->ah_flags & IB_AH_GRH) { if (ah->grh.sgid_index >= dev->limits.gid_table_len) { @@ -634,7 +645,7 @@ int mthca_modify_qp(struct ib_qp *ibqp, if (qp->transport == MLX) qp_context->pri_path.port_pkey |= - cpu_to_be32(to_msqp(qp)->port << 24); + cpu_to_be32(qp->port << 24); else { if (attr_mask & IB_QP_PORT) { qp_context->pri_path.port_pkey |= @@ -657,7 +668,8 @@ int mthca_modify_qp(struct ib_qp *ibqp, } if (attr_mask & IB_QP_AV) { - if (mthca_path_set(dev, &attr->ah_attr, &qp_context->pri_path)) + if (mthca_path_set(dev, &attr->ah_attr, &qp_context->pri_path, + attr_mask & IB_QP_PORT ? attr->port_num : qp->port)) return -EINVAL; qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH); @@ -681,7 +693,8 @@ int mthca_modify_qp(struct ib_qp *ibqp, return -EINVAL; } - if (mthca_path_set(dev, &attr->alt_ah_attr, &qp_context->alt_path)) + if (mthca_path_set(dev, &attr->alt_ah_attr, &qp_context->alt_path, + attr->alt_ah_attr.port_num)) return -EINVAL; qp_context->alt_path.port_pkey |= cpu_to_be32(attr->alt_pkey_index | @@ -791,6 +804,10 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp->atomic_rd_en = attr->qp_access_flags; if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) qp->resp_depth = attr->max_dest_rd_atomic; + if (attr_mask & IB_QP_PORT) + qp->port = attr->port_num; + if (attr_mask & IB_QP_ALT_PATH) + qp->alt_port = attr->alt_port_num; if (is_sqp(dev, qp)) store_attrs(to_msqp(qp), attr, attr_mask); @@ -802,13 +819,13 @@ int mthca_modify_qp(struct ib_qp *ibqp, if (is_qp0(dev, qp)) { if (cur_state != IB_QPS_RTR && new_state == IB_QPS_RTR) - init_port(dev, to_msqp(qp)->port); + init_port(dev, qp->port); if (cur_state != IB_QPS_RESET && cur_state != IB_QPS_ERR && (new_state == IB_QPS_RESET || new_state == IB_QPS_ERR)) - mthca_CLOSE_IB(dev, to_msqp(qp)->port, &status); + mthca_CLOSE_IB(dev, qp->port, &status); } /* @@ -1212,6 +1229,9 @@ int mthca_alloc_qp(struct mthca_dev *dev if (qp->qpn == -1) return -ENOMEM; + /* initialize port to zero for error-catching. */ + qp->port = 0; + err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, send_policy, qp); if (err) { @@ -1261,7 +1281,7 @@ int mthca_alloc_sqp(struct mthca_dev *de if (err) goto err_out; - sqp->port = port; + sqp->qp.port = port; sqp->qp.qpn = mqpn; sqp->qp.transport = MLX; @@ -1404,10 +1424,10 @@ static int build_mlx_header(struct mthca sqp->ud_header.lrh.source_lid = IB_LID_PERMISSIVE; sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED); if (!sqp->qp.ibqp.qp_num) - ib_get_cached_pkey(&dev->ib_dev, sqp->port, + ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port, sqp->pkey_index, &pkey); else - ib_get_cached_pkey(&dev->ib_dev, sqp->port, + ib_get_cached_pkey(&dev->ib_dev, sqp->qp.port, wr->wr.ud.pkey_index, &pkey); sqp->ud_header.bth.pkey = cpu_to_be16(pkey); sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn); diff --git a/drivers/infiniband/hw/mthca/mthca_srq.c b/drivers/infiniband/hw/mthca/mthca_srq.c index 2dd3aea..1cfb0fb 100644 --- a/drivers/infiniband/hw/mthca/mthca_srq.c +++ b/drivers/infiniband/hw/mthca/mthca_srq.c @@ -192,7 +192,7 @@ int mthca_alloc_srq(struct mthca_dev *de /* Sanity check SRQ size before proceeding */ if (attr->max_wr > dev->limits.max_srq_wqes || - attr->max_sge > dev->limits.max_sg) + attr->max_sge > dev->limits.max_srq_sge) return -EINVAL; srq->max = attr->max_wr; @@ -660,6 +660,31 @@ int mthca_arbel_post_srq_recv(struct ib_ return err; } +int mthca_max_srq_sge(struct mthca_dev *dev) +{ + if (mthca_is_memfree(dev)) + return dev->limits.max_sg; + + /* + * SRQ allocations are based on powers of 2 for Tavor, + * (although they only need to be multiples of 16 bytes). + * + * Therefore, we need to base the max number of sg entries on + * the largest power of 2 descriptor size that is <= to the + * actual max WQE descriptor size, rather than return the + * max_sg value given by the firmware (which is based on WQE + * sizes as multiples of 16, not powers of 2). + * + * If SRQ implementation is changed for Tavor to be based on + * multiples of 16, the calculation below can be deleted and + * the FW max_sg value returned. + */ + return min(dev->limits.max_sg, + ((1 << (fls(dev->limits.max_desc_sz) - 1)) - + sizeof (struct mthca_next_seg)) / + sizeof (struct mthca_data_seg)); +} + int __devinit mthca_init_srq_table(struct mthca_dev *dev) { int err; diff --git a/drivers/infiniband/ulp/ipoib/Kconfig b/drivers/infiniband/ulp/ipoib/Kconfig index 8d2e04c..13d6d01 100644 --- a/drivers/infiniband/ulp/ipoib/Kconfig +++ b/drivers/infiniband/ulp/ipoib/Kconfig @@ -10,8 +10,9 @@ config INFINIBAND_IPOIB group: . config INFINIBAND_IPOIB_DEBUG - bool "IP-over-InfiniBand debugging" + bool "IP-over-InfiniBand debugging" if EMBEDDED depends on INFINIBAND_IPOIB + default y ---help--- This option causes debugging code to be compiled into the IPoIB driver. The output can be turned on via the diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index b640107..12a1e05 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -65,6 +65,8 @@ enum { IPOIB_RX_RING_SIZE = 128, IPOIB_TX_RING_SIZE = 64, + IPOIB_MAX_QUEUE_SIZE = 8192, + IPOIB_MIN_QUEUE_SIZE = 2, IPOIB_NUM_WC = 4, @@ -230,6 +232,9 @@ static inline struct ipoib_neigh **to_ip INFINIBAND_ALEN, sizeof(void *)); } +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh); +void ipoib_neigh_free(struct ipoib_neigh *neigh); + extern struct workqueue_struct *ipoib_workqueue; /* functions */ @@ -329,6 +334,8 @@ static inline void ipoib_unregister_debu #define ipoib_warn(priv, format, arg...) \ ipoib_printk(KERN_WARNING, priv, format , ## arg) +extern int ipoib_sendq_size; +extern int ipoib_recvq_size; #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG extern int ipoib_debug_level; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_fs.c b/drivers/infiniband/ulp/ipoib/ipoib_fs.c index 685258e..5dde380 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_fs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_fs.c @@ -213,7 +213,7 @@ static int ipoib_path_seq_show(struct se gid_buf, path.pathrec.dlid ? "yes" : "no"); if (path.pathrec.dlid) { - rate = ib_sa_rate_enum_to_int(path.pathrec.rate) * 25; + rate = ib_rate_to_mult(path.pathrec.rate) * 25; seq_printf(file, " DLID: 0x%04x\n" diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index ed65202..a54da42 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -161,7 +161,7 @@ static int ipoib_ib_post_receives(struct struct ipoib_dev_priv *priv = netdev_priv(dev); int i; - for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) { + for (i = 0; i < ipoib_recvq_size; ++i) { if (ipoib_alloc_rx_skb(dev, i)) { ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); return -ENOMEM; @@ -187,7 +187,7 @@ static void ipoib_ib_handle_wc(struct ne if (wr_id & IPOIB_OP_RECV) { wr_id &= ~IPOIB_OP_RECV; - if (wr_id < IPOIB_RX_RING_SIZE) { + if (wr_id < ipoib_recvq_size) { struct sk_buff *skb = priv->rx_ring[wr_id].skb; dma_addr_t addr = priv->rx_ring[wr_id].mapping; @@ -252,9 +252,9 @@ static void ipoib_ib_handle_wc(struct ne struct ipoib_tx_buf *tx_req; unsigned long flags; - if (wr_id >= IPOIB_TX_RING_SIZE) { + if (wr_id >= ipoib_sendq_size) { ipoib_warn(priv, "completion event with wrid %d (> %d)\n", - wr_id, IPOIB_TX_RING_SIZE); + wr_id, ipoib_sendq_size); return; } @@ -275,7 +275,7 @@ static void ipoib_ib_handle_wc(struct ne spin_lock_irqsave(&priv->tx_lock, flags); ++priv->tx_tail; if (netif_queue_stopped(dev) && - priv->tx_head - priv->tx_tail <= IPOIB_TX_RING_SIZE / 2) + priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) netif_wake_queue(dev); spin_unlock_irqrestore(&priv->tx_lock, flags); @@ -344,13 +344,13 @@ void ipoib_send(struct net_device *dev, * means we have to make sure everything is properly recorded and * our state is consistent before we call post_send(). */ - tx_req = &priv->tx_ring[priv->tx_head & (IPOIB_TX_RING_SIZE - 1)]; + tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)]; tx_req->skb = skb; addr = dma_map_single(priv->ca->dma_device, skb->data, skb->len, DMA_TO_DEVICE); pci_unmap_addr_set(tx_req, mapping, addr); - if (unlikely(post_send(priv, priv->tx_head & (IPOIB_TX_RING_SIZE - 1), + if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1), address->ah, qpn, addr, skb->len))) { ipoib_warn(priv, "post_send failed\n"); ++priv->stats.tx_errors; @@ -363,7 +363,7 @@ void ipoib_send(struct net_device *dev, address->last_send = priv->tx_head; ++priv->tx_head; - if (priv->tx_head - priv->tx_tail == IPOIB_TX_RING_SIZE) { + if (priv->tx_head - priv->tx_tail == ipoib_sendq_size) { ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); netif_stop_queue(dev); } @@ -488,7 +488,7 @@ static int recvs_pending(struct net_devi int pending = 0; int i; - for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + for (i = 0; i < ipoib_recvq_size; ++i) if (priv->rx_ring[i].skb) ++pending; @@ -527,7 +527,7 @@ int ipoib_ib_dev_stop(struct net_device */ while ((int) priv->tx_tail - (int) priv->tx_head < 0) { tx_req = &priv->tx_ring[priv->tx_tail & - (IPOIB_TX_RING_SIZE - 1)]; + (ipoib_sendq_size - 1)]; dma_unmap_single(priv->ca->dma_device, pci_unmap_addr(tx_req, mapping), tx_req->skb->len, @@ -536,7 +536,7 @@ int ipoib_ib_dev_stop(struct net_device ++priv->tx_tail; } - for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + for (i = 0; i < ipoib_recvq_size; ++i) if (priv->rx_ring[i].skb) { dma_unmap_single(priv->ca->dma_device, pci_unmap_addr(&priv->rx_ring[i], diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 9b0bd7c..cb078a7 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -41,6 +41,7 @@ #include #include #include +#include #include /* For ARPHRD_xxx */ @@ -53,6 +54,14 @@ MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("IP-over-InfiniBand net driver"); MODULE_LICENSE("Dual BSD/GPL"); +int ipoib_sendq_size __read_mostly = IPOIB_TX_RING_SIZE; +int ipoib_recvq_size __read_mostly = IPOIB_RX_RING_SIZE; + +module_param_named(send_queue_size, ipoib_sendq_size, int, 0444); +MODULE_PARM_DESC(send_queue_size, "Number of descriptors in send queue"); +module_param_named(recv_queue_size, ipoib_recvq_size, int, 0444); +MODULE_PARM_DESC(recv_queue_size, "Number of descriptors in receive queue"); + #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG int ipoib_debug_level; @@ -252,8 +261,8 @@ static void path_free(struct net_device */ if (neigh->ah) ipoib_put_ah(neigh->ah); - *to_ipoib_neigh(neigh->neighbour) = NULL; - kfree(neigh); + + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -327,9 +336,8 @@ void ipoib_flush_paths(struct net_device struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_path *path, *tp; LIST_HEAD(remove_list); - unsigned long flags; - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irq(&priv->lock); list_splice(&priv->path_list, &remove_list); INIT_LIST_HEAD(&priv->path_list); @@ -337,14 +345,15 @@ void ipoib_flush_paths(struct net_device list_for_each_entry(path, &remove_list, list) rb_erase(&path->rb_node, &priv->path_tree); - spin_unlock_irqrestore(&priv->lock, flags); - list_for_each_entry_safe(path, tp, &remove_list, list) { if (path->query) ib_sa_cancel_query(path->query_id, path->query); + spin_unlock_irq(&priv->lock); wait_for_completion(&path->done); path_free(dev, path); + spin_lock_irq(&priv->lock); } + spin_unlock_irq(&priv->lock); } static void path_rec_completion(int status, @@ -373,16 +382,9 @@ static void path_rec_completion(int stat struct ib_ah_attr av = { .dlid = be16_to_cpu(pathrec->dlid), .sl = pathrec->sl, - .port_num = priv->port + .port_num = priv->port, + .static_rate = pathrec->rate }; - int path_rate = ib_sa_rate_enum_to_int(pathrec->rate); - - if (path_rate > 0 && priv->local_rate > path_rate) - av.static_rate = (priv->local_rate - 1) / path_rate; - - ipoib_dbg(priv, "static_rate %d for local port %dX, path %dX\n", - av.static_rate, priv->local_rate, - ib_sa_rate_enum_to_int(pathrec->rate)); ah = ipoib_create_ah(dev, priv->pd, &av); } @@ -481,7 +483,7 @@ static void neigh_add_path(struct sk_buf struct ipoib_path *path; struct ipoib_neigh *neigh; - neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + neigh = ipoib_neigh_alloc(skb->dst->neighbour); if (!neigh) { ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -489,8 +491,6 @@ static void neigh_add_path(struct sk_buf } skb_queue_head_init(&neigh->queue); - neigh->neighbour = skb->dst->neighbour; - *to_ipoib_neigh(skb->dst->neighbour) = neigh; /* * We can only be called from ipoib_start_xmit, so we're @@ -503,7 +503,7 @@ static void neigh_add_path(struct sk_buf path = path_rec_create(dev, (union ib_gid *) (skb->dst->neighbour->ha + 4)); if (!path) - goto err; + goto err_path; __path_add(dev, path); } @@ -521,17 +521,17 @@ static void neigh_add_path(struct sk_buf __skb_queue_tail(&neigh->queue, skb); if (!path->query && path_rec_start(dev, path)) - goto err; + goto err_list; } spin_unlock(&priv->lock); return; -err: - *to_ipoib_neigh(skb->dst->neighbour) = NULL; +err_list: list_del(&neigh->list); - kfree(neigh); +err_path: + ipoib_neigh_free(neigh); ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -763,8 +763,7 @@ static void ipoib_neigh_destructor(struc if (neigh->ah) ah = neigh->ah; list_del(&neigh->list); - *to_ipoib_neigh(n) = NULL; - kfree(neigh); + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -773,6 +772,26 @@ static void ipoib_neigh_destructor(struc ipoib_put_ah(ah); } +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) +{ + struct ipoib_neigh *neigh; + + neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + if (!neigh) + return NULL; + + neigh->neighbour = neighbour; + *to_ipoib_neigh(neighbour) = neigh; + + return neigh; +} + +void ipoib_neigh_free(struct ipoib_neigh *neigh) +{ + *to_ipoib_neigh(neigh->neighbour) = NULL; + kfree(neigh); +} + static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) { parms->neigh_destructor = ipoib_neigh_destructor; @@ -785,20 +804,19 @@ int ipoib_dev_init(struct net_device *de struct ipoib_dev_priv *priv = netdev_priv(dev); /* Allocate RX/TX "rings" to hold queued skbs */ - - priv->rx_ring = kzalloc(IPOIB_RX_RING_SIZE * sizeof (struct ipoib_rx_buf), + priv->rx_ring = kzalloc(ipoib_recvq_size * sizeof *priv->rx_ring, GFP_KERNEL); if (!priv->rx_ring) { printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n", - ca->name, IPOIB_RX_RING_SIZE); + ca->name, ipoib_recvq_size); goto out; } - priv->tx_ring = kzalloc(IPOIB_TX_RING_SIZE * sizeof (struct ipoib_tx_buf), + priv->tx_ring = kzalloc(ipoib_sendq_size * sizeof *priv->tx_ring, GFP_KERNEL); if (!priv->tx_ring) { printk(KERN_WARNING "%s: failed to allocate TX ring (%d entries)\n", - ca->name, IPOIB_TX_RING_SIZE); + ca->name, ipoib_sendq_size); goto out_rx_ring_cleanup; } @@ -866,7 +884,7 @@ static void ipoib_setup(struct net_devic dev->hard_header_len = IPOIB_ENCAP_LEN + INFINIBAND_ALEN; dev->addr_len = INFINIBAND_ALEN; dev->type = ARPHRD_INFINIBAND; - dev->tx_queue_len = IPOIB_TX_RING_SIZE * 2; + dev->tx_queue_len = ipoib_sendq_size * 2; dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX; /* MTU will be reset when mcast join happens */ @@ -1118,6 +1136,14 @@ static int __init ipoib_init_module(void { int ret; + ipoib_recvq_size = roundup_pow_of_two(ipoib_recvq_size); + ipoib_recvq_size = min(ipoib_recvq_size, IPOIB_MAX_QUEUE_SIZE); + ipoib_recvq_size = max(ipoib_recvq_size, IPOIB_MIN_QUEUE_SIZE); + + ipoib_sendq_size = roundup_pow_of_two(ipoib_sendq_size); + ipoib_sendq_size = min(ipoib_sendq_size, IPOIB_MAX_QUEUE_SIZE); + ipoib_sendq_size = max(ipoib_sendq_size, IPOIB_MIN_QUEUE_SIZE); + ret = ipoib_register_debugfs(); if (ret) return ret; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 93c462e..1dae4b2 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -114,8 +114,7 @@ static void ipoib_mcast_free(struct ipoi */ if (neigh->ah) ipoib_put_ah(neigh->ah); - *to_ipoib_neigh(neigh->neighbour) = NULL; - kfree(neigh); + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -251,6 +250,7 @@ static int ipoib_mcast_join_finish(struc .port_num = priv->port, .sl = mcast->mcmember.sl, .ah_flags = IB_AH_GRH, + .static_rate = mcast->mcmember.rate, .grh = { .flow_label = be32_to_cpu(mcast->mcmember.flow_label), .hop_limit = mcast->mcmember.hop_limit, @@ -258,17 +258,8 @@ static int ipoib_mcast_join_finish(struc .traffic_class = mcast->mcmember.traffic_class } }; - int path_rate = ib_sa_rate_enum_to_int(mcast->mcmember.rate); - av.grh.dgid = mcast->mcmember.mgid; - if (path_rate > 0 && priv->local_rate > path_rate) - av.static_rate = (priv->local_rate - 1) / path_rate; - - ipoib_dbg_mcast(priv, "static_rate %d for local port %dX, mcmember %dX\n", - av.static_rate, priv->local_rate, - ib_sa_rate_enum_to_int(mcast->mcmember.rate)); - ah = ipoib_create_ah(dev, priv->pd, &av); if (!ah) { ipoib_warn(priv, "ib_address_create failed\n"); @@ -618,6 +609,22 @@ int ipoib_mcast_start_thread(struct net_ return 0; } +static void wait_for_mcast_join(struct ipoib_dev_priv *priv, + struct ipoib_mcast *mcast) +{ + spin_lock_irq(&priv->lock); + if (mcast && mcast->query) { + ib_sa_cancel_query(mcast->query_id, mcast->query); + mcast->query = NULL; + spin_unlock_irq(&priv->lock); + ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + wait_for_completion(&mcast->done); + } + else + spin_unlock_irq(&priv->lock); +} + int ipoib_mcast_stop_thread(struct net_device *dev, int flush) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -637,28 +644,10 @@ int ipoib_mcast_stop_thread(struct net_d if (flush) flush_workqueue(ipoib_workqueue); - spin_lock_irq(&priv->lock); - if (priv->broadcast && priv->broadcast->query) { - ib_sa_cancel_query(priv->broadcast->query_id, priv->broadcast->query); - priv->broadcast->query = NULL; - spin_unlock_irq(&priv->lock); - ipoib_dbg_mcast(priv, "waiting for bcast\n"); - wait_for_completion(&priv->broadcast->done); - } else - spin_unlock_irq(&priv->lock); + wait_for_mcast_join(priv, priv->broadcast); - list_for_each_entry(mcast, &priv->multicast_list, list) { - spin_lock_irq(&priv->lock); - if (mcast->query) { - ib_sa_cancel_query(mcast->query_id, mcast->query); - mcast->query = NULL; - spin_unlock_irq(&priv->lock); - ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", - IPOIB_GID_ARG(mcast->mcmember.mgid)); - wait_for_completion(&mcast->done); - } else - spin_unlock_irq(&priv->lock); - } + list_for_each_entry(mcast, &priv->multicast_list, list) + wait_for_mcast_join(priv, mcast); return 0; } @@ -772,13 +761,11 @@ out: if (skb->dst && skb->dst->neighbour && !*to_ipoib_neigh(skb->dst->neighbour)) { - struct ipoib_neigh *neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour); if (neigh) { kref_get(&mcast->ah->ref); neigh->ah = mcast->ah; - neigh->neighbour = skb->dst->neighbour; - *to_ipoib_neigh(skb->dst->neighbour) = neigh; list_add_tail(&neigh->list, &mcast->neigh_list); } } @@ -913,6 +900,7 @@ void ipoib_mcast_restart_task(void *dev_ /* We have to cancel outside of the spinlock */ list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { + wait_for_mcast_join(priv, mcast); ipoib_mcast_leave(mcast->dev, mcast); ipoib_mcast_free(mcast); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c index 5f03880..1d49d16 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -159,8 +159,8 @@ int ipoib_transport_dev_init(struct net_ struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_init_attr init_attr = { .cap = { - .max_send_wr = IPOIB_TX_RING_SIZE, - .max_recv_wr = IPOIB_RX_RING_SIZE, + .max_send_wr = ipoib_sendq_size, + .max_recv_wr = ipoib_recvq_size, .max_send_sge = 1, .max_recv_sge = 1 }, @@ -175,7 +175,7 @@ int ipoib_transport_dev_init(struct net_ } priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, - IPOIB_TX_RING_SIZE + IPOIB_RX_RING_SIZE + 1); + ipoib_sendq_size + ipoib_recvq_size + 1); if (IS_ERR(priv->cq)) { printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); goto out_free_pd; diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index fd8a95a..5f2b3f6 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1434,6 +1434,7 @@ static int srp_parse_options(const char p = match_strdup(args); if (strlen(p) != 32) { printk(KERN_WARNING PFX "bad dest GID parameter '%s'\n", p); + kfree(p); goto out; } diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h index f404fe2..ad63c21 100644 --- a/include/rdma/ib_sa.h +++ b/include/rdma/ib_sa.h @@ -91,34 +91,6 @@ enum ib_sa_selector { IB_SA_BEST = 3 }; -enum ib_sa_rate { - IB_SA_RATE_2_5_GBPS = 2, - IB_SA_RATE_5_GBPS = 5, - IB_SA_RATE_10_GBPS = 3, - IB_SA_RATE_20_GBPS = 6, - IB_SA_RATE_30_GBPS = 4, - IB_SA_RATE_40_GBPS = 7, - IB_SA_RATE_60_GBPS = 8, - IB_SA_RATE_80_GBPS = 9, - IB_SA_RATE_120_GBPS = 10 -}; - -static inline int ib_sa_rate_enum_to_int(enum ib_sa_rate rate) -{ - switch (rate) { - case IB_SA_RATE_2_5_GBPS: return 1; - case IB_SA_RATE_5_GBPS: return 2; - case IB_SA_RATE_10_GBPS: return 4; - case IB_SA_RATE_20_GBPS: return 8; - case IB_SA_RATE_30_GBPS: return 12; - case IB_SA_RATE_40_GBPS: return 16; - case IB_SA_RATE_60_GBPS: return 24; - case IB_SA_RATE_80_GBPS: return 32; - case IB_SA_RATE_120_GBPS: return 48; - default: return -1; - } -} - /* * Structures for SA records are named "struct ib_sa_xxx_rec." No * attempt is made to pack structures to match the physical layout of diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index c1ad627..6bbf1b3 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -314,6 +314,34 @@ enum ib_ah_flags { IB_AH_GRH = 1 }; +enum ib_rate { + IB_RATE_PORT_CURRENT = 0, + IB_RATE_2_5_GBPS = 2, + IB_RATE_5_GBPS = 5, + IB_RATE_10_GBPS = 3, + IB_RATE_20_GBPS = 6, + IB_RATE_30_GBPS = 4, + IB_RATE_40_GBPS = 7, + IB_RATE_60_GBPS = 8, + IB_RATE_80_GBPS = 9, + IB_RATE_120_GBPS = 10 +}; + +/** + * ib_rate_to_mult - Convert the IB rate enum to a multiple of the + * base rate of 2.5 Gbit/sec. For example, IB_RATE_5_GBPS will be + * converted to 2, since 5 Gbit/sec is 2 * 2.5 Gbit/sec. + * @rate: rate to convert. + */ +int ib_rate_to_mult(enum ib_rate rate) __attribute_const__; + +/** + * mult_to_ib_rate - Convert a multiple of 2.5 Gbit/sec to an IB rate + * enum. + * @mult: multiple to convert. + */ +enum ib_rate mult_to_ib_rate(int mult) __attribute_const__; + struct ib_ah_attr { struct ib_global_route grh; u16 dlid; From robert.j.woodruff at intel.com Wed Apr 12 16:21:25 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 12 Apr 2006 16:21:25 -0700 Subject: [openib-general] Announcing the OpenFabrics Enterprise Distribution In-Reply-To: Message-ID: <000a01c65e87$d2371920$010fa8c0@amr.corp.intel.com> >OFED is focused on including open-source MPI in the distribution. >Commercial MPI ISVs can definitely test against OFED, make statements >about interoperability, and ask the community to make changes specific >to their stack. Likewise, IB vendors may directly do their own testing >with these MPI packages and/or support the ISV's efforts. >--Shawn Is there a release schedule for OFED ? Did not see it in the announcement ? woody From rminnich at lanl.gov Wed Apr 12 16:29:30 2006 From: rminnich at lanl.gov (Ronald G Minnich) Date: Wed, 12 Apr 2006 17:29:30 -0600 Subject: [openib-general] thanks and a question Message-ID: <443D8D5A.2070809@lanl.gov> I was working with someone and watching a 256-node bproc cluster boot friday. The openib folks have done a lot of very nice work. It booted quite well once we set hoq and slv to 17 in the voltaire switch. It was really snappy coming up. It was actually as fast to boot as a myrinet cluster, which was nice to see. But a question. When hoq and slv were 16 in the voltaire switch, we saw tcp sessions hanging. Thinking back on the tcpdump we watched (would that i had saved it) it almost seems that the sender thought it had gotten an ack for a segment of 96 bytes, and discarded it; whereas the receiver thought it had only gotten 32 of the 96 bytes, and was sending back its idea of where the tcp stream was. So we sat and watched (via tcpdump on the receiver) the two hosts send each other differing ideas about the sequence numbers on the tcp connection. is this at all possible? Could something happen below the tcp stack, given a switch with too-low hoq and slv settings, such that the sender would discard a segment that the receiver would not have ever seen? Is there any switch involvment that could cause this? The whole situation was really odd. Finally, this was one sender, one receiver, and the problem was very, very repeatable -- until we bumped 16->17. Sorry I don't have more info. thanks ron From robert.j.woodruff at intel.com Wed Apr 12 17:19:47 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 12 Apr 2006 17:19:47 -0700 Subject: [openib-general] [RFC] Would like to cut a new release candidate In-Reply-To: <1144881334.22229.22.camel@chalcedony.pathscale.com> Message-ID: <000b01c65e8f$f9421670$010fa8c0@amr.corp.intel.com> Bryan wrote, >Does anyone have any objections? >By the way, Tziporet and I have discussed syncing up the release >candidate numbers for the next IBED and OF release candidates, to reduce >a possible source of confusion. This will mean that OF will hop from >RC2.1 to RC4, while IBED will step to RC4. The posited date for the >release of both is May 1. > (Ira Weiny's message of "Wed, 12 Apr 2006 14:55:29 -0700") References: <1144869147.19061.91517.camel@hal.voltaire.com> <20060412145529.65162e81.weiny2@llnl.gov> Message-ID: Ira> This is my one big question. Ira> Also what happens to people who took the 1.0 branch and are Ira> actually using it? That is what LLNL has done and now the Ira> kernel code is gone. Kernel code was included on the branch by mistake. OpenFabrics is not in the business of releasing kernels so it was confusing to have that code there. You should get kernels via whatever your normal mechanism is -- using your distribution's kernel or upstream kernels from kernel.org are two possibilities. Also, OFED will provide a distibution of IB driver modules if you want to use that. Shawn's original email described the situation for kernel modules pretty well: > 1. Any module that is already in the kernel will be taken from the git > tree that is targeted for next kernel release > 2. Kernel modules that are not in Linux kernel will be taken from > openFabrics SVN trunk or in extraordinary cases, from SVN contrib. > 3. All user space code is taken from the 1.0 branch. OFED group will > make sure the right patches from the trunk are updated to the branch. - R. From halr at voltaire.com Wed Apr 12 17:25:08 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Apr 2006 20:25:08 -0400 Subject: [openib-general] thanks and a question In-Reply-To: <443D8D5A.2070809@lanl.gov> References: <443D8D5A.2070809@lanl.gov> Message-ID: <1144887715.19061.94174.camel@hal.voltaire.com> Hi Ron, On Wed, 2006-04-12 at 19:29, Ronald G Minnich wrote: > I was working with someone and watching a 256-node bproc cluster boot > friday. The openib folks have done a lot of very nice work. It booted > quite well once we set hoq and slv to 17 in the voltaire switch. hoq is HOQLife. Is slv the switch LifeTimeValue ? > It was > really snappy coming up. It was actually as fast to boot as a myrinet > cluster, which was nice to see. Does that have anything to do with those settings ? > But a question. When hoq and slv were 16 in the voltaire switch, we saw > tcp sessions hanging. Truly hanging ? > Thinking back on the tcpdump we watched (would > that i had saved it) it almost seems that the sender thought it had > gotten an ack for a segment of 96 bytes, and discarded it; whereas the > receiver thought it had only gotten 32 of the 96 bytes, and was sending > back its idea of where the tcp stream was. Switches might drop 64 bytes at a time based on those parameters. > So we sat and watched (via > tcpdump on the receiver) the two hosts send each other differing ideas > about the sequence numbers on the tcp connection. > > is this at all possible? Could something happen below the tcp stack, > given a switch with too-low hoq and slv settings, such that the sender > would discard a segment that the receiver would not have ever seen? Yes, as the two directions are independent so I think that the dropping in one direction could cause this. > Is > there any switch involvment that could cause this? The whole situation > was really odd. > > Finally, this was one sender, one receiver, and the problem was very, > very repeatable -- until we bumped 16->17. That effectively doubles the time before the drops would occur which probably eliminated the drops so you didn't see this. 16 = 268.435 msec 17 = 526.871 msec What doesn't make sense to me is the one flow. Are you sure there's no other data traffic ? If so, that doesn't make sense to me and hang together with the rest of this scenario. -- Hal > Sorry I don't have more info. > > thanks > > ron > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Wed Apr 12 17:46:33 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Apr 2006 20:46:33 -0400 Subject: [openib-general] Fix for ibping In-Reply-To: <4df28be40604121525j22e085cdg8ab9c1a820b24e3a@mail.gmail.com> References: <4df28be40604121525j22e085cdg8ab9c1a820b24e3a@mail.gmail.com> Message-ID: <1144888960.19061.94385.camel@hal.voltaire.com> On Wed, 2006-04-12 at 18:25, Viswanath Krishnamurthy wrote: > The RMPP version needs to be 1. Thanks. I'm not sure what changed here to require this. I need to do some more digging. -- Hal > [root at subnetmgr5 src]# svn diff ibping.c > Index: ibping.c > =================================================================== > --- ibping.c (revision 6446) > +++ ibping.c (working copy) > @@ -336,7 +336,7 @@ > exit(0); > } > > - if (mad_register_client(ping_class, 0) < 0) > + if (mad_register_client(ping_class, 1) < 0) > IBERROR("can't register to ping class %d", > ping_class); > > if (ib_resolve_portid_str(&portid, argv[0], dest_type, sm_id) > < 0) > > > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From viswa.krish at gmail.com Wed Apr 12 18:07:04 2006 From: viswa.krish at gmail.com (Viswanath Krishnamurthy) Date: Wed, 12 Apr 2006 18:07:04 -0700 Subject: [openib-general] Fix for ibping In-Reply-To: <1144888960.19061.94385.camel@hal.voltaire.com> References: <4df28be40604121525j22e085cdg8ab9c1a820b24e3a@mail.gmail.com> <1144888960.19061.94385.camel@hal.voltaire.com> Message-ID: <4df28be40604121807y231d3424i2f7ffc7f9252f45e@mail.gmail.com> The mad_register_agent function in mad.c kernel file was checking for rmpp_version. This was failing and this failure was propagated to umad (thru ioctl) On 12 Apr 2006 20:46:33 -0400, Hal Rosenstock wrote: > > On Wed, 2006-04-12 at 18:25, Viswanath Krishnamurthy wrote: > > The RMPP version needs to be 1. > > Thanks. I'm not sure what changed here to require this. I need to do > some more digging. > > -- Hal > > > [root at subnetmgr5 src]# svn diff ibping.c > > Index: ibping.c > > =================================================================== > > --- ibping.c (revision 6446) > > +++ ibping.c (working copy) > > @@ -336,7 +336,7 @@ > > exit(0); > > } > > > > - if (mad_register_client(ping_class, 0) < 0) > > + if (mad_register_client(ping_class, 1) < 0) > > IBERROR("can't register to ping class %d", > > ping_class); > > > > if (ib_resolve_portid_str(&portid, argv[0], dest_type, sm_id) > > < 0) > > > > > > > > ______________________________________________________________________ > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From surs at cse.ohio-state.edu Wed Apr 12 18:24:30 2006 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Wed, 12 Apr 2006 21:24:30 -0400 Subject: [openib-general] Trying to compile mvapich RHEL4U3 for ib. In-Reply-To: <443D7A14.8050807@atipa.com> References: <443D7A14.8050807@atipa.com> Message-ID: <20060413012426.GA12977@cse.ohio-state.edu> Hello Roger, > With mvapich-0.9.7 it errors out in the building > stage with an error ibv_free_device_list/ibv_get_device_list missing, > I cannot find any of the ib libraries on RHEL4U3 that appear to contain > that library. Thanks for trying out MVAPICH-0.9.7. Currently, we don't have any machine with RHEL4U3. We are installing two machines with RHEL4U3 and we will try out MVAPICH on that as soon as possible. The verbs `ibv_get_device_list' was introduced before the 1.0 branch. So, if you have either OpenIB installed from the trunk or from the 1.0 branch, you _should_ be able to see this verb in the library. I am wondering if you are trying out the default versions of the OpenIB rpms on RHEL4U3? > Using the mvapich-gen2-1.src.rpm from openib.org results in > these errors (on the first thing it tries to compile). > viainit.c: In function `create_cq': > viainit.c:118: error: too few arguments to function `ibv_create_cq' This is also due to a verb change made a while back to the ibv_create_cq. I believe this version of mvapich-gen2 source rpm was created against the version of userspace support which is present in the very same .src.rpm (you may install those if you want, though they are a little old now). The userspace verbs changed after this src rpm was created. > I have verified that the include file prototype has more arguments, than > are contained in viainit.c. Yes, it seems that the RPM you have installed is from somewhere in between the ibv_create_cq verb change and the later introduction of the ibv_get_device list verb. I'm wondering if you could try it out with the latest 1.0 branch of OpenIB? In addition, we will get back to you asap with our testing on RHEL4U3. Thanks, Sayantan. -- http://www.cse.ohio-state.edu/~surs From surs at cse.ohio-state.edu Wed Apr 12 18:29:28 2006 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Wed, 12 Apr 2006 21:29:28 -0400 Subject: [openib-general] RHEL4ASU3 question In-Reply-To: References: <000501c65db6$24a015e0$010fa8c0@amr.corp.intel.com> Message-ID: <20060413012926.GB12977@cse.ohio-state.edu> Hello Don, > We are running RHEL4 U3 and the MVAPICH version from the OpenIB gen2 trunk. We > were able to run the OSU benchmark tests (osu_bw, osu_bibw, and osu_latency) > with the Mellanox SDR cards successfully, but when we swapped out the cards for > DDR cards, we ran into some problems. We can run some MPI jobs like the simple > "calculate pi" job (cpi.c), and we can run an MPING application, but when we > try to run the benchmark tests, we get the following: > > [koa] (ib) ib> mpirun_rsh -np 2 koa jatoba /home/ib/mpi/tests/osu/osu_bw > # OSU MPI Bandwidth Test (Version 2.1) > # Size Bandwidth (MB/s) > [0] Abort: [koa.az05.bull.com:0] Got completion with error, code=1 > at line 2148 in file viacheck.c > mpirun_rsh: Abort signaled from [0] > done. Currently we do not have RHEL4U3 installed on any machine. We are installing those and will get back to you. We have plenty of machines with RHEL4U2 and OpenIB (1.0 branch) seems to work just fine with them. Just a question, when you compiled MVAPICH, did you choose the DDR option when prompted by make.mvapich.gen2? Oh yes, another question. Is this reproducible with OpenIB built from the 1.0 branch or trunk instead of the stock RedHat RPM? Thanks, Sayantan. -- http://www.cse.ohio-state.edu/~surs From halr at voltaire.com Wed Apr 12 18:29:10 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Apr 2006 21:29:10 -0400 Subject: [openib-general] [PATCH] mad.c::ib_register_mad_agent: Fix RMPP version check during agent registration Message-ID: <1144891401.4539.28.camel@hal.voltaire.com> mad.c::ib_register_mad_agent: Fix RMPP version check during agent registration Only check that RMPP version is not specified when MAD class does not support RMPP Signed-off-by: Hal Rosenstock -- Roland, Can you push this fix upstream ? Thanks. Index: mad.c =================================================================== -- mad.c (revision 6445) +++ mad.c (working copy) @@ -228,10 +228,7 @@ struct ib_mad_agent *ib_register_mad_age goto error1; } /* Make sure class supplied is consistent with RMPP */ - if (ib_is_mad_class_rmpp(mad_reg_req->mgmt_class)) { - if (!rmpp_version) - goto error1; - } else { + if (!ib_is_mad_class_rmpp(mad_reg_req->mgmt_class)) { if (rmpp_version) goto error1; } From halr at voltaire.com Wed Apr 12 18:32:33 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Apr 2006 21:32:33 -0400 Subject: [openib-general] Fix for ibping In-Reply-To: <1144888960.19061.94385.camel@hal.voltaire.com> References: <4df28be40604121525j22e085cdg8ab9c1a820b24e3a@mail.gmail.com> <1144888960.19061.94385.camel@hal.voltaire.com> Message-ID: <1144891417.4539.41.camel@hal.voltaire.com> On Wed, 2006-04-12 at 20:46, Hal Rosenstock wrote: > On Wed, 2006-04-12 at 18:25, Viswanath Krishnamurthy wrote: > > The RMPP version needs to be 1. > > Thanks. I'm not sure what changed here to require this. I need to do > some more digging. I figured it out. The fix is in r6448. Can you update and try it ? Thanks. -- Hal > -- Hal > > > [root at subnetmgr5 src]# svn diff ibping.c > > Index: ibping.c > > =================================================================== > > -- ibping.c (revision 6446) > > +++ ibping.c (working copy) > > @@ -336,7 +336,7 @@ > > exit(0); > > } > > > > - if (mad_register_client(ping_class, 0) < 0) > > + if (mad_register_client(ping_class, 1) < 0) > > IBERROR("can't register to ping class %d", > > ping_class); > > > > if (ib_resolve_portid_str(&portid, argv[0], dest_type, sm_id) > > < 0) > > > > > > > > ______________________________________________________________________ > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Wed Apr 12 18:36:01 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Apr 2006 21:36:01 -0400 Subject: [openib-general] Fix for ibping In-Reply-To: <4df28be40604121807y231d3424i2f7ffc7f9252f45e@mail.gmail.com> References: <4df28be40604121525j22e085cdg8ab9c1a820b24e3a@mail.gmail.com> <1144888960.19061.94385.camel@hal.voltaire.com> <4df28be40604121807y231d3424i2f7ffc7f9252f45e@mail.gmail.com> Message-ID: <1144891447.4539.43.camel@hal.voltaire.com> On Wed, 2006-04-12 at 21:07, Viswanath Krishnamurthy wrote: > The mad_register_agent function in mad.c kernel file was checking for > rmpp_version. > This was failing and this failure was propagated to umad (thru ioctl) Right. Just because a class is allowed to use RMPP doesn't mean that rmpp_version needs to be set for the MAD agent to register. This was a recent change which was too pedantic. I fixed this in r6448 so you can revert the ibping change you sent and it should work now as before. Thanks. -- Hal > > On 12 Apr 2006 20:46:33 -0400, Hal Rosenstock > wrote: > On Wed, 2006-04-12 at 18:25, Viswanath Krishnamurthy wrote: > > The RMPP version needs to be 1. > > Thanks. I'm not sure what changed here to require this. I need > to do > some more digging. > > -- Hal > > > [ root at subnetmgr5 src]# svn diff ibping.c > > Index: ibping.c > > > =================================================================== > > -- ibping.c (revision 6446) > > +++ ibping.c (working copy) > > @@ -336,7 +336,7 @@ > > exit(0); > > } > > > > - if (mad_register_client(ping_class, 0) < 0) > > + if (mad_register_client(ping_class, 1) < 0) > > IBERROR("can't register to ping class %d", > > ping_class); > > > > if (ib_resolve_portid_str(&portid, argv[0], > dest_type, sm_id) > > < 0) > > > > > > > > > ______________________________________________________________________ > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From rminnich at lanl.gov Wed Apr 12 20:46:24 2006 From: rminnich at lanl.gov (Ronald G Minnich) Date: Wed, 12 Apr 2006 21:46:24 -0600 Subject: [openib-general] thanks and a question In-Reply-To: <1144887715.19061.94174.camel@hal.voltaire.com> References: <443D8D5A.2070809@lanl.gov> <1144887715.19061.94174.camel@hal.voltaire.com> Message-ID: <443DC990.1080108@lanl.gov> Hal Rosenstock wrote: > hoq is HOQLife. Is slv the switch LifeTimeValue ? I believe so. > Does that have anything to do with those settings ? it would not work until hoq and slv were 17. > Truly hanging ? yes, and it was the only real connection at that point, from the bproc daemon on the slave node to the bproc daemon on the master. There was only 1 host powered up at that point. It was very repeatable -- we tried to get it to boot many times. And, weirdly, it always hung at that same point. > Switches might drop 64 bytes at a time based on those parameters. But why does the sender think the segment has been acked, when the receiver has never seen that last 64 bytes? Where did the sender get that TCP-level ack? > That effectively doubles the time before the drops would occur which > probably eliminated the drops so you didn't see this. > > 16 = 268.435 msec > 17 = 526.871 msec which leads to another question. This is 1/2 second. Does it really mean that you could end up buffering 1/2 worth of flow on each port for all 256 ports? > > What doesn't make sense to me is the one flow. Are you sure there's no > other data traffic ? If so, that doesn't make sense to me and hang > together with the rest of this scenario. no other traffic that we could see, but there had been traffic prior to this. Thanks hal! ron From rdreier at cisco.com Wed Apr 12 22:39:50 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 12 Apr 2006 22:39:50 -0700 Subject: [openib-general] Re: [PATCH] git: updates to rdma_cm branch In-Reply-To: (Sean Hefty's message of "Mon, 10 Apr 2006 17:12:47 -0700") References: Message-ID: OK, I updated my rdma_cm branch with all of this. In addition I put the following in -- it's idiomatic in the kernel to let the compiler handle htons(A_CONSTANT) in code. Should I commit this to svn too? diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index 810fdd5..d294bbc 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -333,8 +333,8 @@ static int addr_arp_recv(struct sk_buff arp_hdr = (struct arphdr *) skb->nh.raw; - if (arp_hdr->ar_op == __constant_htons(ARPOP_REQUEST) || - arp_hdr->ar_op == __constant_htons(ARPOP_REPLY)) + if (arp_hdr->ar_op == htons(ARPOP_REQUEST) || + arp_hdr->ar_op == htons(ARPOP_REPLY)) set_timeout(jiffies); kfree_skb(skb); From rdreier at cisco.com Wed Apr 12 22:44:18 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 12 Apr 2006 22:44:18 -0700 Subject: [openib-general] Re: [PATCH] mad.c::ib_register_mad_agent: Fix RMPP version check during agent registration In-Reply-To: <1144891401.4539.28.camel@hal.voltaire.com> (Hal Rosenstock's message of "12 Apr 2006 21:29:10 -0400") References: <1144891401.4539.28.camel@hal.voltaire.com> Message-ID: OK, I applied this by hand ... your mailer turned all your tabs into spaces somewhere along the way, so the patch wouldn't apply. - R. From halr at voltaire.com Thu Apr 13 02:52:25 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Apr 2006 05:52:25 -0400 Subject: [openib-general] Re: [PATCH] mad.c::ib_register_mad_agent: Fix RMPP version check during agent registration In-Reply-To: References: <1144891401.4539.28.camel@hal.voltaire.com> Message-ID: <1144921945.4539.4848.camel@hal.voltaire.com> On Thu, 2006-04-13 at 01:44, Roland Dreier wrote: > OK, I applied this by hand ... your mailer turned all your tabs into > spaces somewhere along the way, so the patch wouldn't apply. Wow. That hasn't happened in a while. I used preformat on evolution the same as the other patches so I'm not sure what's up. Thanks for applying it. -- Hal > - R. From halr at voltaire.com Thu Apr 13 02:57:54 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Apr 2006 05:57:54 -0400 Subject: [openib-general] thanks and a question In-Reply-To: <443DC990.1080108@lanl.gov> References: <443D8D5A.2070809@lanl.gov> <1144887715.19061.94174.camel@hal.voltaire.com> <443DC990.1080108@lanl.gov> Message-ID: <1144922274.4539.4905.camel@hal.voltaire.com> Hi again Ron, On Wed, 2006-04-12 at 23:46, Ronald G Minnich wrote: > Hal Rosenstock wrote: > > > hoq is HOQLife. Is slv the switch LifeTimeValue ? > > I believe so. > > > Does that have anything to do with those settings ? > > it would not work until hoq and slv were 17. > > > Truly hanging ? > > yes, and it was the only real connection at that point, from the bproc > daemon on the slave node to the bproc daemon on the master. There was > only 1 host powered up at that point. It was very repeatable -- we tried > to get it to boot many times. And, weirdly, it always hung at that same > point. > > > > Switches might drop 64 bytes at a time based on those parameters. > > But why does the sender think the segment has been acked, when the > receiver has never seen that last 64 bytes? Where did the sender get > that TCP-level ack? I don't know. It doesn't make sense. Dropping a buffer (64 bytes) in a packet should cause a CRC error which should mean the TCP packet is not valid. In any case, you should be able to see the drops in the various Port (error) counters. > > That effectively doubles the time before the drops would occur which > > probably eliminated the drops so you didn't see this. > > > > 16 = 268.435 msec > > 17 = 526.871 msec > > which leads to another question. This is 1/2 second. Does it really mean > that you could end up buffering 1/2 worth of flow on each port for all > 256 ports? It is limited by the number of buffers (per VL per port) which is no where near this so that could not occur. The credits advertised on the link are reduced by the buffers in use so the throughput would slow down on a congested port (meaning either congestion or a slow receiver). > > > > What doesn't make sense to me is the one flow. Are you sure there's no > > other data traffic ? If so, that doesn't make sense to me and hang > > together with the rest of this scenario. > > no other traffic that we could see, but there had been traffic prior to > this. I would recommend putting an IB analyzer on the last link towards that slave node and capturing the data traffic. -- Hal > Thanks hal! > > ron From rheflin at atipa.com Thu Apr 13 06:23:16 2006 From: rheflin at atipa.com (Roger Heflin) Date: Thu, 13 Apr 2006 08:23:16 -0500 Subject: [openib-general] Trying to compile mvapich RHEL4U3 for ib. In-Reply-To: <20060413012426.GA12977@cse.ohio-state.edu> References: <443D7A14.8050807@atipa.com> <20060413012426.GA12977@cse.ohio-state.edu> Message-ID: <443E50C4.7070804@atipa.com> Sayantan Sur wrote: > Hello Roger, > >> With mvapich-0.9.7 it errors out in the building >> stage with an error ibv_free_device_list/ibv_get_device_list missing, >> I cannot find any of the ib libraries on RHEL4U3 that appear to contain >> that library. > > Thanks for trying out MVAPICH-0.9.7. Currently, we don't have any > machine with RHEL4U3. We are installing two machines with RHEL4U3 and we > will try out MVAPICH on that as soon as possible. > > The verbs `ibv_get_device_list' was introduced before the 1.0 branch. > So, if you have either OpenIB installed from the trunk or from the 1.0 > branch, you _should_ be able to see this verb in the library. > > I am wondering if you are trying out the default versions of the OpenIB > rpms on RHEL4U3? Yes, I am trying the default version of RHEL4U3, alot of our customers would much rather use unmodified RHEL, though I can probably talk them out of it with a bit of work. They have some strange ideas that RHEL is somehow "guaranteed" to work right, and from what I can tell it won't completely work just because RH did not include a IB mpi variant, at least not one that I can find. > >> Using the mvapich-gen2-1.src.rpm from openib.org results in >> these errors (on the first thing it tries to compile). >> viainit.c: In function `create_cq': >> viainit.c:118: error: too few arguments to function `ibv_create_cq' > > This is also due to a verb change made a while back to the > ibv_create_cq. I believe this version of mvapich-gen2 source rpm was > created against the version of userspace support which is present in the > very same .src.rpm (you may install those if you want, though they are a > little old now). The userspace verbs changed after this src rpm was > created. > >> I have verified that the include file prototype has more arguments, than >> are contained in viainit.c. > > Yes, it seems that the RPM you have installed is from somewhere in > between the ibv_create_cq verb change and the later introduction of the > ibv_get_device list verb. > > I'm wondering if you could try it out with the latest 1.0 branch of > OpenIB? In addition, we will get back to you asap with our testing on > RHEL4U3. > > Thanks, > Sayantan. > Do you know if it would be possible to just replace the userspace section and not mess with the kernel part of OpenIB? I am guessing from what I have read that this is very possible, and only requires me to remove the already existing RHEL rpms for OpenIB userspace support. Thank you very much. If you guys need access I have 2 test machines that I can give access to to do whatever testing is needed. Roger From surs at cse.ohio-state.edu Thu Apr 13 06:55:59 2006 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Thu, 13 Apr 2006 09:55:59 -0400 Subject: [openib-general] Trying to compile mvapich RHEL4U3 for ib. In-Reply-To: <443E50C4.7070804@atipa.com> References: <443D7A14.8050807@atipa.com> <20060413012426.GA12977@cse.ohio-state.edu> <443E50C4.7070804@atipa.com> Message-ID: <443E586F.3070208@cse.ohio-state.edu> Hello Roger, > Do you know if it would be possible to just replace the userspace > section and not mess with the kernel part of OpenIB? I am guessing > from what I have read that this is very possible, and only requires > me to remove the already existing RHEL rpms for OpenIB userspace > support. IMHO, it should be possible. However, OpenIB userspace and kernel module authors should be able to exactly answer this question. Roland, any thoughts on which SVN version of userspace support may work with the RHEL default RPMs? > > Thank you very much. > > If you guys need access I have 2 test machines that I can give > access to to do whatever testing is needed. That's great! You can send the login information to me. Thanks, Sayantan. -- http://www.cse.ohio-state.edu/~surs From rdreier at cisco.com Thu Apr 13 07:43:36 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 13 Apr 2006 07:43:36 -0700 Subject: [openib-general] Trying to compile mvapich RHEL4U3 for ib. In-Reply-To: <443E586F.3070208@cse.ohio-state.edu> (Sayantan Sur's message of "Thu, 13 Apr 2006 09:55:59 -0400") References: <443D7A14.8050807@atipa.com> <20060413012426.GA12977@cse.ohio-state.edu> <443E50C4.7070804@atipa.com> <443E586F.3070208@cse.ohio-state.edu> Message-ID: Sayantan> Roland, any thoughts on which SVN version of userspace Sayantan> support may work with the RHEL default RPMs? Any version should work. It might be simpler to use stable releases such as libibverbs-1.0.2 and libmthca-1.0.1. - R. From sweitzen at cisco.com Thu Apr 13 08:12:15 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Thu, 13 Apr 2006 08:12:15 -0700 Subject: [openib-general] Trying to compile mvapich RHEL4U3 for ib. Message-ID: > Yes, I am trying the default version of RHEL4U3, alot of our > customers would much rather use unmodified RHEL, though I can probably > talk them out of it with a bit of work. They have some strange > ideas that RHEL is somehow "guaranteed" to work right, and from > what I can tell it won't completely work just because RH did not > include a IB mpi variant, at least not one that I can find. I didn't try MVAPICH, but I had no luck getting Open MPI 1.0.1 to work with the RHEL4 U3 OpenIB code. The RHEL4 U3 relnotes are pretty clear that its included OpenIB is a technology preview not for production environments, and the APIs are subject to change (which they already did comparing RHEL4 U3 to OF 1.0). I think you are much better off trying the OF 1.0 code. Scott Weitzenkamp SQA Manager Cisco Systems From robert.j.woodruff at intel.com Thu Apr 13 08:19:52 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Thu, 13 Apr 2006 08:19:52 -0700 Subject: [openib-general] Trying to compile mvapich RHEL4U3 for ib. In-Reply-To: Message-ID: <000c01c65f0d$b69ace10$010fa8c0@amr.corp.intel.com> Scott wrote, >I didn't try MVAPICH, but I had no luck getting Open MPI 1.0.1 to work >with the RHEL4 U3 OpenIB code. Not sure if you are interested in a comercial MPI or not, but we did test Intel MPI with the RHEL4-U3 code and it worked fine, except on Mellanox DDR cards. woody From surs at cse.ohio-state.edu Thu Apr 13 09:27:33 2006 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Thu, 13 Apr 2006 12:27:33 -0400 Subject: [openib-general] Trying to compile mvapich RHEL4U3 for ib. In-Reply-To: <443E586F.3070208@cse.ohio-state.edu> References: <443D7A14.8050807@atipa.com> <20060413012426.GA12977@cse.ohio-state.edu> <443E50C4.7070804@atipa.com> <443E586F.3070208@cse.ohio-state.edu> Message-ID: <20060413162729.GA14020@cse.ohio-state.edu> Hello Roger, I'm just CC-ing this to openib-general for the community. Thanks for giving us access. I have verified that the `ibv_get_device_list' verb is indeed *missing* from the OpenIB install. I'm afraid that given this Redhat rpm, it is difficult to get mvapich to work (without patching it). As Roland and others have indicated, perhaps the best way is for you to upgrade to atleast the 1.0 branch. That should be the most stable OpenIB release yet. https://openib.org/svn/gen2/branches/1.0/src/userspace/ You should be able to keep the kernel stuff intact and just upgrade the user level support (management, libibverbs, libmthca). You may skip upgrading management, however it'll be best to upgrade it too, lest you face any OpenSM issues. Thanks, Sayantan. * On Apr,4 Sayantan Sur wrote : > Hello Roger, > > >Do you know if it would be possible to just replace the userspace > >section and not mess with the kernel part of OpenIB? I am guessing > >from what I have read that this is very possible, and only requires > >me to remove the already existing RHEL rpms for OpenIB userspace > >support. > > IMHO, it should be possible. However, OpenIB userspace and kernel module > authors should be able to exactly answer this question. > > Roland, any thoughts on which SVN version of userspace support may work > with the RHEL default RPMs? > > > > >Thank you very much. > > > >If you guys need access I have 2 test machines that I can give > >access to to do whatever testing is needed. > > That's great! You can send the login information to me. > > Thanks, > Sayantan. > > -- > http://www.cse.ohio-state.edu/~surs > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general -- http://www.cse.ohio-state.edu/~surs From rdreier at cisco.com Thu Apr 13 09:57:05 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 13 Apr 2006 09:57:05 -0700 Subject: [openib-general] [PATCH] IB/ipath: Fix whitespace Message-ID: Signed-off-by: Roland Dreier --- Nothing but replacing spaces with tabs. Please apply to svn and let me know if it's OK to queue for upstream. BTW, any progress on reviewing the static function cleanups I sent earlier? drivers/infiniband/hw/ipath/ipath_intr.c | 4 + drivers/infiniband/hw/ipath/ipath_verbs.c | 114 +++++++++++++++-------------- 2 files changed, 59 insertions(+), 59 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index 60f5f41..0bcb428 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -172,8 +172,8 @@ static void handle_e_ibstatuschanged(str "was %s\n", dd->ipath_unit, ib_linkstate(lstate), ib_linkstate((unsigned) - dd->ipath_lastibcstat - & IPATH_IBSTATE_MASK)); + dd->ipath_lastibcstat + & IPATH_IBSTATE_MASK)); } else { lstate = dd->ipath_lastibcstat & IPATH_IBSTATE_MASK; diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index e3be492..8d2558a 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -1125,26 +1125,26 @@ static void __exit ipath_verbs_cleanup(v static ssize_t show_rev(struct class_device *cdev, char *buf) { - struct ipath_ibdev *dev = - container_of(cdev, struct ipath_ibdev, ibdev.class_dev); - int vendor, boardrev, majrev, minrev; - - ipath_layer_query_device(dev->dd, &vendor, &boardrev, - &majrev, &minrev); - return sprintf(buf, "%d.%d\n", majrev, minrev); + struct ipath_ibdev *dev = + container_of(cdev, struct ipath_ibdev, ibdev.class_dev); + int vendor, boardrev, majrev, minrev; + + ipath_layer_query_device(dev->dd, &vendor, &boardrev, + &majrev, &minrev); + return sprintf(buf, "%d.%d\n", majrev, minrev); } static ssize_t show_hca(struct class_device *cdev, char *buf) { - struct ipath_ibdev *dev = - container_of(cdev, struct ipath_ibdev, ibdev.class_dev); - int ret; - - ret = ipath_layer_get_boardname(dev->dd, buf, 128); - if (ret < 0) - goto bail; - strcat(buf, "\n"); - ret = strlen(buf); + struct ipath_ibdev *dev = + container_of(cdev, struct ipath_ibdev, ibdev.class_dev); + int ret; + + ret = ipath_layer_get_boardname(dev->dd, buf, 128); + if (ret < 0) + goto bail; + strcat(buf, "\n"); + ret = strlen(buf); bail: return ret; @@ -1152,40 +1152,40 @@ bail: static ssize_t show_stats(struct class_device *cdev, char *buf) { - struct ipath_ibdev *dev = - container_of(cdev, struct ipath_ibdev, ibdev.class_dev); - int i; - int len; - - len = sprintf(buf, - "RC resends %d\n" - "RC QACKs %d\n" - "RC ACKs %d\n" - "RC SEQ NAKs %d\n" - "RC RDMA seq %d\n" - "RC RNR NAKs %d\n" - "RC OTH NAKs %d\n" - "RC timeouts %d\n" - "RC RDMA dup %d\n" - "piobuf wait %d\n" - "no piobuf %d\n" - "PKT drops %d\n" - "WQE errs %d\n", - dev->n_rc_resends, dev->n_rc_qacks, dev->n_rc_acks, - dev->n_seq_naks, dev->n_rdma_seq, dev->n_rnr_naks, - dev->n_other_naks, dev->n_timeouts, - dev->n_rdma_dup_busy, dev->n_piowait, - dev->n_no_piobuf, dev->n_pkt_drops, dev->n_wqe_errs); - for (i = 0; i < ARRAY_SIZE(dev->opstats); i++) { + struct ipath_ibdev *dev = + container_of(cdev, struct ipath_ibdev, ibdev.class_dev); + int i; + int len; + + len = sprintf(buf, + "RC resends %d\n" + "RC QACKs %d\n" + "RC ACKs %d\n" + "RC SEQ NAKs %d\n" + "RC RDMA seq %d\n" + "RC RNR NAKs %d\n" + "RC OTH NAKs %d\n" + "RC timeouts %d\n" + "RC RDMA dup %d\n" + "piobuf wait %d\n" + "no piobuf %d\n" + "PKT drops %d\n" + "WQE errs %d\n", + dev->n_rc_resends, dev->n_rc_qacks, dev->n_rc_acks, + dev->n_seq_naks, dev->n_rdma_seq, dev->n_rnr_naks, + dev->n_other_naks, dev->n_timeouts, + dev->n_rdma_dup_busy, dev->n_piowait, + dev->n_no_piobuf, dev->n_pkt_drops, dev->n_wqe_errs); + for (i = 0; i < ARRAY_SIZE(dev->opstats); i++) { const struct ipath_opcode_stats *si = &dev->opstats[i]; - if (!si->n_packets && !si->n_bytes) - continue; - len += sprintf(buf + len, "%02x %llu/%llu\n", i, + if (!si->n_packets && !si->n_bytes) + continue; + len += sprintf(buf + len, "%02x %llu/%llu\n", i, (unsigned long long) si->n_packets, - (unsigned long long) si->n_bytes); - } - return len; + (unsigned long long) si->n_bytes); + } + return len; } static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); @@ -1194,25 +1194,25 @@ static CLASS_DEVICE_ATTR(board_id, S_IRU static CLASS_DEVICE_ATTR(stats, S_IRUGO, show_stats, NULL); static struct class_device_attribute *ipath_class_attributes[] = { - &class_device_attr_hw_rev, - &class_device_attr_hca_type, - &class_device_attr_board_id, - &class_device_attr_stats + &class_device_attr_hw_rev, + &class_device_attr_hca_type, + &class_device_attr_board_id, + &class_device_attr_stats }; static int ipath_verbs_register_sysfs(struct ib_device *dev) { - int i; + int i; int ret; - for (i = 0; i < ARRAY_SIZE(ipath_class_attributes); ++i) - if (class_device_create_file(&dev->class_dev, - ipath_class_attributes[i])) { - ret = 1; + for (i = 0; i < ARRAY_SIZE(ipath_class_attributes); ++i) + if (class_device_create_file(&dev->class_dev, + ipath_class_attributes[i])) { + ret = 1; goto bail; } - ret = 0; + ret = 0; bail: return ret; From vuhuong at mellanox.com Thu Apr 13 09:58:13 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Thu, 13 Apr 2006 09:58:13 -0700 Subject: [openib-general][PATCH] srp: tuned parameters, Message-ID: <443E8325.2000502@mellanox.com> Hi Roland, Please review this patch + introducing srp_sg_tablesize as module parameter + adjusting SRP_MAX_IU_LEN, SRP_MAX_INDIRECT from srp_sg_tablesize + throttling command per lun ie. max_cmd_per_lun can be passed in when adding target (same as max_sect) Signed-off-by: Vu Pham -------------- next part -------------- A non-text attachment was scrubbed... Name: srp-params.patch Type: text/x-patch Size: 3599 bytes Desc: not available URL: From viswa.krish at gmail.com Thu Apr 13 10:55:58 2006 From: viswa.krish at gmail.com (Viswanath Krishnamurthy) Date: Thu, 13 Apr 2006 10:55:58 -0700 Subject: [openib-general] Fix for ibping In-Reply-To: <1144891417.4539.41.camel@hal.voltaire.com> References: <4df28be40604121525j22e085cdg8ab9c1a820b24e3a@mail.gmail.com> <1144888960.19061.94385.camel@hal.voltaire.com> <1144891417.4539.41.camel@hal.voltaire.com> Message-ID: <4df28be40604131055g63ded54bv94821b3081d5cc14@mail.gmail.com> Works like a charm... -Viswa On 12 Apr 2006 21:32:33 -0400, Hal Rosenstock wrote: > > On Wed, 2006-04-12 at 20:46, Hal Rosenstock wrote: > > On Wed, 2006-04-12 at 18:25, Viswanath Krishnamurthy wrote: > > > The RMPP version needs to be 1. > > > > Thanks. I'm not sure what changed here to require this. I need to do > > some more digging. > > I figured it out. The fix is in r6448. Can you update and try it ? > Thanks. > > -- Hal > > > -- Hal > > > > > [root at subnetmgr5 src]# svn diff ibping.c > > > Index: ibping.c > > > =================================================================== > > > -- ibping.c (revision 6446) > > > +++ ibping.c (working copy) > > > @@ -336,7 +336,7 @@ > > > exit(0); > > > } > > > > > > - if (mad_register_client(ping_class, 0) < 0) > > > + if (mad_register_client(ping_class, 1) < 0) > > > IBERROR("can't register to ping class %d", > > > ping_class); > > > > > > if (ib_resolve_portid_str(&portid, argv[0], dest_type, sm_id) > > > < 0) > > > > > > > > > > > > ______________________________________________________________________ > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Thu Apr 13 11:27:40 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 13 Apr 2006 11:27:40 -0700 Subject: [openib-general] Re: [PATCH] git: updates to rdma_cm branch In-Reply-To: References: Message-ID: <443E981C.9020800@ichips.intel.com> Roland Dreier wrote: > OK, I updated my rdma_cm branch with all of this. > > In addition I put the following in -- it's idiomatic in the kernel to > let the compiler handle htons(A_CONSTANT) in code. Should I commit > this to svn too? This change is fine. Please commit to svn too. Thanks. - Sean From vuhuong at mellanox.com Thu Apr 13 11:38:00 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Thu, 13 Apr 2006 11:38:00 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: References: <443C4934.7080400@mellanox.com> Message-ID: <443E9A88.7020302@mellanox.com> Hi Roland, >> Apr 7 18:17:17 lab105 kernel: Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b6b6b > > I think I fixed the bug causing this oops (I was able to reproduce it, > and I don't see it any more). I checked the following patch in and > queued it for kernel 2.6.17: > My ia64 system still crashes with the patch applied. Please see log below Apr 13 13:10:21 lab105 kernel: Abort for req_index 1 Apr 13 13:10:26 lab105 kernel: ib_srp: SRP reset_host called Apr 13 13:10:28 lab105 kernel: ib_srp: connection closed Apr 13 13:10:28 lab105 kernel: Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b6b6b Apr 13 13:10:28 lab105 kernel: scsi_eh_2[13324]: Oops 11012296146944 [1] Apr 13 13:10:28 lab105 kernel: Modules linked in: ib_srp ib_cm ib_sa evdev joydev sg st sr_mod ide_cd cdrom usbserial parport_pc lp parport ipv6 thermal processor fan button binfmt_misc usbhid ib_mthca ib_mad ib_core ehci_hcd uhci_hcd usbcore i2c_i801 i2c_core e1000 nls_iso8859_1 nls_cp437 dm_mod reiserfs mptspi scsi_transport_spi mptscsih mptbase sd_mod scsi_mod Apr 13 13:10:28 lab105 kernel: Apr 13 13:10:28 lab105 kernel: Pid: 13324, CPU 1, comm: scsi_eh_2 Apr 13 13:10:28 lab105 kernel: psr : 0000121008026018 ifs : 800000000000050d ip : [] Not tainted Apr 13 13:10:28 lab105 kernel: ip is at srp_reconnect_target+0x2b1/0x5c0 [ib_srp] Apr 13 13:10:28 lab105 kernel: unat: 0000000000000000 pfs : 000000000000050d rsc : 0000000000000003 Apr 13 13:10:28 lab105 kernel: rnat: 0000000000000000 bsps: 0000000000000000 pr : 0000000000009541 Apr 13 13:10:28 lab105 kernel: ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f Apr 13 13:10:28 lab105 kernel: csd : 0000000000000000 ssd : 0000000000000000 Apr 13 13:10:28 lab105 kernel: b0 : a00000020235a060 b6 : a000000100003320 b7 : a0000002023ddd80 Apr 13 13:10:28 lab105 kernel: f6 : 1003e6b6b6b6b6b6b6b6b f7 : 0ffdd8000000000000000 Apr 13 13:10:28 lab105 kernel: f8 : 1003e0000000000003598 f9 : 1003e0000000000000118 Apr 13 13:10:28 lab105 kernel: f10 : 1003e0000000000000000 f11 : 1003e0000000000000000 Apr 13 13:10:28 lab105 kernel: r1 : a00000020235c200 r2 : e0000001e58f8b58 r3 : e00000018d748a40 Apr 13 13:10:28 lab105 kernel: r8 : e0000001e58f8ba8 r9 : e0000001e58f89f8 r10 : a000000100931338 Apr 13 13:10:28 lab105 kernel: r11 : 0000000000000001 r12 : e0000001ea8f7d00 r13 : e0000001ea8f0000 Apr 13 13:10:28 lab105 kernel: r14 : a000000100931340 r15 : e0000001ea8f0000 r16 : 0000000000000001 Apr 13 13:10:28 lab105 kernel: r17 : 0000000000000001 r18 : e0000001ea8f0f84 r19 : a000000100931348 Apr 13 13:10:28 lab105 kernel: r20 : ffffffffffffffff r21 : 0000000000000008 r22 : e00000000479c980 Apr 13 13:10:28 lab105 kernel: r23 : e0000001f5e7a920 r24 : 0000000000000080 r25 : e00000000479c99f Apr 13 13:10:28 lab105 kernel: r26 : a0000002023ddd80 r27 : e000000187d4c1e0 r28 : e000000187d4c000 Apr 13 13:10:28 lab105 kernel: r29 : e0000001f5e7a880 r30 : e00000018d748ab8 r31 : e00000018d748a20 Apr 13 13:10:28 lab105 kernel: Apr 13 13:10:28 lab105 kernel: Call Trace: Apr 13 13:10:28 lab105 kernel: [] show_stack+0x80/0xa0 Apr 13 13:10:28 lab105 kernel: sp=e0000001ea8f7880 bsp=e0000001ea8f1308 Apr 13 13:10:28 lab105 kernel: [] show_regs+0x840/0x880 Apr 13 13:10:28 lab105 kernel: sp=e0000001ea8f7a50 bsp=e0000001ea8f12a8 Apr 13 13:10:28 lab105 kernel: [] die+0x1b0/0x2e0 Apr 13 13:10:28 lab105 kernel: sp=e0000001ea8f7a60 bsp=e0000001ea8f1260 Apr 13 13:10:28 lab105 kernel: [] ia64_do_page_fault+0x9a0/0xb20 Apr 13 13:10:28 lab105 kernel: sp=e0000001ea8f7a80 bsp=e0000001ea8f11f0 Apr 13 13:10:28 lab105 kernel: [] ia64_leave_kernel+0x0/0x280 Apr 13 13:10:28 lab105 kernel: sp=e0000001ea8f7b30 bsp=e0000001ea8f11f0 Apr 13 13:10:28 lab105 kernel: [] srp_reconnect_target+0x2b0/0x5c0 [ib_srp] Apr 13 13:10:28 lab105 kernel: sp=e0000001ea8f7d00 bsp=e0000001ea8f1188 Apr 13 13:10:28 lab105 kernel: [] srp_reset_host+0x60/0xa0 [ib_srp] Apr 13 13:10:28 lab105 kernel: sp=e0000001ea8f7dc0 bsp=e0000001ea8f1160 Apr 13 13:10:28 lab105 kernel: [] scsi_try_host_reset+0xd0/0x240 [scsi_mod] Apr 13 13:10:28 lab105 kernel: sp=e0000001ea8f7dc0 bsp=e0000001ea8f1130 Apr 13 13:10:28 lab105 kernel: [] scsi_error_handler+0x1860/0x2000 [scsi_mod] Apr 13 13:10:28 lab105 kernel: sp=e0000001ea8f7dc0 bsp=e0000001ea8f1040 Apr 13 13:10:28 lab105 kernel: [] kthread+0x220/0x280 Apr 13 13:10:28 lab105 kernel: sp=e0000001ea8f7e10 bsp=e0000001ea8f1000 Apr 13 13:10:28 lab105 kernel: [] kernel_thread_helper+0xe0/0x100 Apr 13 13:10:28 lab105 kernel: sp=e0000001ea8f7e30 bsp=e0000001ea8f0fd0 Apr 13 13:10:28 lab105 kernel: [] start_kernel_thread+0x20/0x40 Apr 13 13:10:28 lab105 kernel: sp=e0000001ea8f7e30 bsp=e0000001ea8f0fd0 Apr 13 13:10:35 lab105 kernel: <3>Slab corruption: start=e0000001e58f89f8, len=448 Apr 13 13:10:35 lab105 kernel: Redzone: 0x5a2cf071/0x5a2cf071. Apr 13 13:10:35 lab105 kernel: Last user: [](scsi_put_command+0x150/0x1c0 [scsi_mod]) Apr 13 13:10:35 lab105 kernel: 1b0: 00 00 08 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 Apr 13 13:10:35 lab105 kernel: Prev obj: start=e0000001e58f8820, len=448 Apr 13 13:10:35 lab105 kernel: Redzone: 0x5a2cf071/0x5a2cf071. Apr 13 13:10:35 lab105 kernel: Last user: [](scsi_put_command+0x150/0x1c0 [scsi_mod]) Apr 13 13:10:35 lab105 kernel: 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Apr 13 13:10:35 lab105 kernel: 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b From mshefty at ichips.intel.com Thu Apr 13 11:46:10 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 13 Apr 2006 11:46:10 -0700 Subject: [openib-general] RDMA RC QP returning "RNR Retry Counter Exceeded Error" In-Reply-To: <20060412094640.7457e097.weiny2@llnl.gov> References: <20060412094640.7457e097.weiny2@llnl.gov> Message-ID: <443E9C72.2030408@ichips.intel.com> Ira Weiny wrote: > I have started writing a simple RDMA app which uses the rdmacm. I have gotten > the connection established, QP's and MR's set up, and have sent the RDMA ETH. > However, more and more I am getting the RNR Retry Counter Exceeded error back > from the "client's" post send of the RDMA ETH. About 1/10 times it will work > but most of the time it does not. I have figured out that you can't set the > IBV_QP_RNR_RETRY attribute unless you go from RTR to RTS. The state of the QP > is RTS and the IBV_QP_RNR_RETRY value is 0 as set by the rdmacm. Do I have to, > or can I, transition the QP from RTS to RTR and then back again to set the > IBV_QP_RNR_RETRY? You cannot transition a QP from RTS to RTR. Did you post receive buffers before you complete the connection? Also, what's RDMA ETH? - Sean From halr at voltaire.com Thu Apr 13 12:06:57 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Apr 2006 15:06:57 -0400 Subject: [openib-general] New diags tool available Message-ID: <1144955216.4539.9924.camel@hal.voltaire.com> Hi, With svn r6460, a new diags tool is now available on the trunk. It is Ira Weiny's saquery. (Thanks for bearing with me on this). saquery tool obtains information based on node name: saquery -h Usage: saquery [-h -d -P -N -L -G][] Queries node records by default -d enable debugging -P get PathRecord info -N get NodeRecord info -L Return just the Lid of the name specified -G Return just the Guid of the name specified -- Hal From troy at scl.ameslab.gov Thu Apr 13 12:35:02 2006 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Thu, 13 Apr 2006 14:35:02 -0500 Subject: [openib-general] opensm issues on 64 node RHEL4 cluster? Message-ID: <20060413193502.GB18625@scl.ameslab.gov> We just moved a cluster over to the latest redhat release, and opensm seems to be having issues. This is running the redhat provided kernel and opensm packages [root at hal2004 troy]# uname -r 2.6.9-34.ELsmp [root at hal2004 troy]# cat /etc/redhat-release Red Hat Enterprise Linux WS release 4 (Nahant Update 3) [root at hal2004 troy]# rpm -qi opensm Name : opensm Relocations: (not relocatable) Version : 1.0 Vendor: Red Hat, Inc. Release : 0.4265.2.EL4 Build Date: Thu 02 Feb 2006 02:24:15 PM CST Install Date: Tue 14 Mar 2006 12:35:09 PM CST Build Host: hs20-bc1-7.build.redhat.com Group : System Environment/Base Source RPM: opensm-1.0-0.4265.2.EL4.src.rpm Size : 1122289 License: GPL/BSD Signature : DSA/SHA1, Thu 16 Feb 2006 01:45:15 PM CST, Key ID 219180cddb42a60e Packager : Red Hat, Inc. URL : https://openib.org/svn/gen2/trunk The opensm log file is at: http://scl.ameslab.gov/~troy/64-node-RHEL4-osm.log.gz Should I go ahead and grab the opensm from the latest subversion and see if it's any better? From halr at voltaire.com Thu Apr 13 12:58:20 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Apr 2006 15:58:20 -0400 Subject: [openib-general] opensm issues on 64 node RHEL4 cluster? In-Reply-To: <20060413193502.GB18625@scl.ameslab.gov> References: <20060413193502.GB18625@scl.ameslab.gov> Message-ID: <1144958299.4539.10477.camel@hal.voltaire.com> Hi Troy, On Thu, 2006-04-13 at 15:35, Troy Benjegerdes wrote: > We just moved a cluster over to the latest redhat release, and opensm > seems to be having issues. > > This is running the redhat provided kernel and opensm packages > > [root at hal2004 troy]# uname -r > 2.6.9-34.ELsmp > [root at hal2004 troy]# cat /etc/redhat-release > Red Hat Enterprise Linux WS release 4 (Nahant Update 3) > > [root at hal2004 troy]# rpm -qi opensm > Name : opensm Relocations: (not > relocatable) > Version : 1.0 Vendor: Red Hat, Inc. > Release : 0.4265.2.EL4 Build Date: Thu 02 Feb 2006 > 02:24:15 PM CST > Install Date: Tue 14 Mar 2006 12:35:09 PM CST Build Host: > hs20-bc1-7.build.redhat.com > Group : System Environment/Base Source RPM: > opensm-1.0-0.4265.2.EL4.src.rpm > Size : 1122289 License: GPL/BSD > Signature : DSA/SHA1, Thu 16 Feb 2006 01:45:15 PM CST, Key ID > 219180cddb42a60e > Packager : Red Hat, Inc. > URL : https://openib.org/svn/gen2/trunk > > The opensm log file is at: > > http://scl.ameslab.gov/~troy/64-node-RHEL4-osm.log.gz > > > Should I go ahead and grab the opensm from the latest subversion and see > if it's any better? If that is the technology preview, then using OpenSM from either OF 1.0 rc2 or from the trunk _should_ be much better especially in your environment. Note you that if you do this, you would also need the management libraries as well as OpenSM. -- Hal From rdreier at cisco.com Thu Apr 13 13:41:49 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 13 Apr 2006 13:41:49 -0700 Subject: [openib-general] [PATCH] RFC: start weaning userspace drivers from sysfs Message-ID: As part of the libibverbs 1.1 release, I would like to remove the dependency on libsysfs, since libsysfs is not very well maintained, not consistent across distros, and the simple sysfs stuff we need is easy to do directly. In this direction, I've already made some changes to libibverbs to reduce its internal use of sysfs. However, sysfs is embedded in the ABI between libibverbs and low-level drivers: libibverbs looks for a function in each driver with the name "openib_driver_init" and calls it with a struct sysfs_class_device *. To fix this in libibverbs 1.1 (which will break ABI from libibverbs 1.0), I propose to replace the driver entry point with a new entry point that looks like struct ibv_device *ibv_driver_init(const char *uverbs_sys_path, int abi_version); where uverbs_sys_path will be a string like "/sys/class/infiniband_verbs/uverbs0" and abi_version will be the contents of the file "abi_version" under that path, or 0 if the file is not present. (This last parameter is just to save every low-level driver from implementing the same code to read the standard abi_version sysfs attribute). However, we can move low-level drivers in this direction in a piecemeal, forwards and backwards compatible way: just add a new ibv_driver_init entry point, but leave the old openib_driver_init entry point there and make it a simple wrapper around the new function. As an example, here's a patch to libmthca that does that. Thoughts? Thanks, Roland --- src/userspace/libmthca/configure.in (revision 6431) +++ src/userspace/libmthca/configure.in (working copy) @@ -12,16 +12,21 @@ dnl Checks for programs AC_PROG_CC dnl Checks for libraries +AC_CHECK_LIB(ibverbs, ibv_get_device_list, [], + AC_MSG_ERROR([ibv_get_device_list() not found. libmthca requires libibverbs.])) dnl Checks for header files. AC_CHECK_HEADER(infiniband/driver.h, [], - AC_MSG_ERROR([ not found. Is libibverbs installed?])) + AC_MSG_ERROR([ not found. libmthca requires libibverbs.])) AC_HEADER_STDC dnl Checks for typedefs, structures, and compiler characteristics. AC_C_CONST AC_CHECK_SIZEOF(long) +dnl Checks for library functions +AC_CHECK_FUNCS(ibv_read_sysfs_file) + AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then ac_cv_version_script=yes --- src/userspace/libmthca/src/mthca.map (revision 6431) +++ src/userspace/libmthca/src/mthca.map (working copy) @@ -1,4 +1,6 @@ { - global: openib_driver_init; + global: + ibv_driver_init; + openib_driver_init; local: *; }; --- src/userspace/libmthca/src/mthca.c (revision 6431) +++ src/userspace/libmthca/src/mthca.c (working copy) @@ -217,29 +217,53 @@ static struct ibv_device_ops mthca_dev_o .free_context = mthca_free_context }; -struct ibv_device *openib_driver_init(struct sysfs_class_device *sysdev) +/* + * Keep a private implementation of HAVE_IBV_READ_SYSFS_FILE to handle + * old versions of libibverbs that didn't implement it. This can be + * removed when libibverbs 1.0.3 or newer is available "everywhere." + */ +#ifndef HAVE_IBV_READ_SYSFS_FILE +static int ibv_read_sysfs_file(const char *dir, const char *file, + char *buf, size_t size) +{ + char path[256]; + int fd; + int len; + + snprintf(path, sizeof path, "%s/%s", dir, file); + + fd = open(path, O_RDONLY); + if (fd < 0) + return -1; + + len = read(fd, buf, size); + + close(fd); + + if (len > 0 && buf[len - 1] == '\n') + buf[--len] = '\0'; + + return len; +} +#endif /* HAVE_IBV_READ_SYSFS_FILE */ + +struct ibv_device *ibv_driver_init(const char *uverbs_sys_path, + int abi_version) { - struct sysfs_device *pcidev; - struct sysfs_attribute *attr; + char value[8]; struct mthca_device *dev; unsigned vendor, device; int i; - pcidev = sysfs_get_classdev_device(sysdev); - if (!pcidev) + if (ibv_read_sysfs_file(uverbs_sys_path, "device/vendor", + value, sizeof value) < 0) return NULL; + sscanf(value, "%i", &vendor); - attr = sysfs_get_device_attr(pcidev, "vendor"); - if (!attr) + if (ibv_read_sysfs_file(uverbs_sys_path, "device/device", + value, sizeof value) < 0) return NULL; - sscanf(attr->value, "%i", &vendor); - sysfs_close_attribute(attr); - - attr = sysfs_get_device_attr(pcidev, "device"); - if (!attr) - return NULL; - sscanf(attr->value, "%i", &device); - sysfs_close_attribute(attr); + sscanf(value, "%i", &device); for (i = 0; i < sizeof hca_table / sizeof hca_table[0]; ++i) if (vendor == hca_table[i].vendor && @@ -252,8 +276,8 @@ found: dev = malloc(sizeof *dev); if (!dev) { fprintf(stderr, PFX "Fatal: couldn't allocate device for %s\n", - sysdev->name); - abort(); + uverbs_sys_path); + return NULL; } dev->ibv_dev.ops = mthca_dev_ops; @@ -262,3 +286,15 @@ found: return &dev->ibv_dev; } + +struct ibv_device *openib_driver_init(struct sysfs_class_device *sysdev) +{ + int abi_ver = 0; + char value[8]; + + if (ibv_read_sysfs_file(sysdev->path, "abi_version", + value, sizeof value) > 0) + abi_ver = strtol(value, NULL, 10); + + return ibv_driver_init(sysdev->path, abi_ver); +} --- src/userspace/libmthca/ChangeLog (revision 6431) +++ src/userspace/libmthca/ChangeLog (working copy) @@ -1,3 +1,9 @@ +2006-04-11 Roland Dreier + + * src/mthca.c (ibv_driver_init, openib_driver_init): Add new + forward-compatible driver entry point. Make old entry point a + simple wrapper for the new one. + 2006-03-14 Roland Dreier * Release version 1.0.1. From rdreier at cisco.com Thu Apr 13 13:51:46 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 13 Apr 2006 13:51:46 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: <443E9A88.7020302@mellanox.com> (Vu Pham's message of "Thu, 13 Apr 2006 11:38:00 -0700") References: <443C4934.7080400@mellanox.com> <443E9A88.7020302@mellanox.com> Message-ID: Hmm, it's clearly a use-after-free bug. Based on ip is at srp_reconnect_target+0x2b1/0x5c0 [ib_srp] can you guess where it is in the SRP driver or what it's accessing? Also this is happening because the connection is being reconnected, because SCSI commands are timing out. Do you have any idea why this is happening? What does the target see when this happens? - R. From rdreier at cisco.com Thu Apr 13 13:54:07 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 13 Apr 2006 13:54:07 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: (Roland Dreier's message of "Thu, 13 Apr 2006 13:51:46 -0700") References: <443C4934.7080400@mellanox.com> <443E9A88.7020302@mellanox.com> Message-ID: Roland> Hmm, it's clearly a use-after-free bug. (...because 6b is the slab poisoning free value, and the oops is at 6b6b6b6b6b6b6b6b...) From rdreier at cisco.com Thu Apr 13 13:59:27 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 13 Apr 2006 13:59:27 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: (Roland Dreier's message of "Thu, 13 Apr 2006 13:54:07 -0700") References: <443C4934.7080400@mellanox.com> <443E9A88.7020302@mellanox.com> Message-ID: One stupid but useful way to narrow this down would be to reproduce the crash with the following patch applied on top... Index: linux-kernel/infiniband/ulp/srp/ib_srp.c =================================================================== --- linux-kernel.orig/infiniband/ulp/srp/ib_srp.c 2006-04-12 12:24:37.398566000 -0700 +++ linux-kernel/infiniband/ulp/srp/ib_srp.c 2006-04-13 13:57:45.793412000 -0700 @@ -428,7 +428,12 @@ target->state = SRP_TARGET_CONNECTING; spin_unlock_irq(target->scsi_host->host_lock); + printk(KERN_ERR "%s/%d: about to disconnect...\n", __func__, __LINE__); + srp_disconnect_target(target); + + printk(KERN_ERR "%s/%d: disconnected...\n", __func__, __LINE__); + /* * Now get a new local CM ID so that we avoid confusing the * target in case things are really fouled up. @@ -442,23 +447,33 @@ ib_destroy_cm_id(target->cm_id); target->cm_id = new_cm_id; + printk(KERN_ERR "%s/%d: got a new CM ID...\n", __func__, __LINE__); + qp_attr.qp_state = IB_QPS_RESET; ret = ib_modify_qp(target->qp, &qp_attr, IB_QP_STATE); if (ret) goto err; + printk(KERN_ERR "%s/%d: Reset QP...\n", __func__, __LINE__); + ret = srp_init_qp(target, target->qp); if (ret) goto err; + printk(KERN_ERR "%s/%d: Init QP...\n", __func__, __LINE__); + while (ib_poll_cq(target->cq, 1, &wc) > 0) ; /* nothing */ + printk(KERN_ERR "%s/%d: cleared CQ...\n", __func__, __LINE__); + list_for_each_entry(req, &target->req_queue, list) { req->scmnd->result = DID_RESET << 16; req->scmnd->scsi_done(req->scmnd); } + printk(KERN_ERR "%s/%d: cleared request queue...\n", __func__, __LINE__); + target->rx_head = 0; target->tx_head = 0; target->tx_tail = 0; @@ -468,10 +483,14 @@ target->req_ring[SRP_SQ_SIZE - 1].next = -1; INIT_LIST_HEAD(&target->req_queue); + printk(KERN_ERR "%s/%d: reinited req ring...\n", __func__, __LINE__); + ret = srp_connect_target(target); if (ret) goto err; + printk(KERN_ERR "%s/%d: connected target...\n", __func__, __LINE__); + spin_lock_irq(target->scsi_host->host_lock); if (target->state == SRP_TARGET_CONNECTING) { ret = 0; From ardavis at ichips.intel.com Thu Apr 13 15:00:24 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Thu, 13 Apr 2006 15:00:24 -0700 Subject: [openib-general] Re: [uDAPL] dtest server never ends when using the dapl provider "OpenIB-scm1" In-Reply-To: <200604121101.43348.dotanb@mellanox.co.il> References: <200604111506.18067.dotanb@mellanox.co.il> <443BEB2B.4080203@ichips.intel.com> <200604121101.43348.dotanb@mellanox.co.il> Message-ID: <443EC9F8.8090603@ichips.intel.com> Dotan Barak wrote: > >Hi. > >thanks for the quick response. > >I executed the dtest with the -v parameter and here is the output of both sides. >I added the test the '-l' parameter to be able to change to dapl provider in command line (if you wish i can post you a patch). > >full server output: >----------------------- >sw043:/tmp/tsscr/svn.mlx_tp/gen2/userspace/ulps/udapl/dtest # ./dtest -l OpenIB-scm2 -v >23996 DAPL_PROVIDER is OpenIB-scm2 >23996 Verbose >23996 Running as server >23996 Allocated RDMA buffers (r:0x8052390,s:0x8052618) len 64 >23996 Opened Interface Adaptor >... >23996 waiting for message receive event >23996 inbound message; message arrived! >23996 SERVER: RCV buffer 0x80525d0 contains: 0x55 len=64 >23996 SERVER: SND buffer 0x8052858 contains: 0xffffffaa len=64 >23996 calling post_send >23996 send_msg completed >23996 do_ping_pong_msg complete >23996 Disconnect and Free EP 0x805f518 > > > Hmm, not sure what this thread is waiting on. I would expect to see the dat_ep_disconnect messages before the wait complete or at least the dat_ep_disconnect message indicating a blocking disconnect call. The next 3 messages expected are as follow: dat_ep_disconnect dat_ep_disconnect completed dat_evd_wait for h_conn_evd completed Can you attach to the server process with gdb and get me a back trace from each of the threads? What does driver IBED-1.0-rc3 consist of? Thanks, -arlin From mshefty at ichips.intel.com Thu Apr 13 15:04:42 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 13 Apr 2006 15:04:42 -0700 Subject: [openib-general] Re: [uDAPL] dtest server never ends when using the dapl provider "OpenIB-scm1" In-Reply-To: <443EC9F8.8090603@ichips.intel.com> References: <200604111506.18067.dotanb@mellanox.co.il> <443BEB2B.4080203@ichips.intel.com> <200604121101.43348.dotanb@mellanox.co.il> <443EC9F8.8090603@ichips.intel.com> Message-ID: <443ECAFA.4080803@ichips.intel.com> Arlin Davis wrote: > What does driver IBED-1.0-rc3 consist of? I think that we want all IBED release issues to go directly to the IBED release team. - Sean From mlleinin at hpcn.ca.sandia.gov Thu Apr 13 16:32:38 2006 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Thu, 13 Apr 2006 16:32:38 -0700 Subject: [openib-general] Compile problems with core code and pathscale for svn6462 and linux-2.6.17-rc1 Message-ID: <1144971158.24662.213.camel@localhost> I'm trying to compile the svn 6462 snapshot with linux-2.6.17-rc1 on a RHEL4 based system. I get the following error for addr.c: CC [M] drivers/infiniband/core/index.o CC [M] drivers/infiniband/core/addr.o In file included from drivers/infiniband/core/addr.c:38: drivers/infiniband/include/rdma/ib_addr.h:43: error: field `dev_type' has incomplete type drivers/infiniband/core/addr.c: In function `copy_addr': drivers/infiniband/core/addr.c:95: error: `RDMA_NODE_IB_CA' undeclared (first use in this function) drivers/infiniband/core/addr.c:95: error: (Each undeclared identifier is reported only once drivers/infiniband/core/addr.c:95: error: for each function it appears in.) drivers/infiniband/core/addr.c:98: error: `RDMA_NODE_RNIC' undeclared (first use in this function) make[3]: *** [drivers/infiniband/core/addr.o] Error 1 make[2]: *** [drivers/infiniband/core] Error 2 make[1]: *** [drivers/infiniband] Error 2 If I remove include/rdma (which I had to do in the past) then some of the pathscale code fails to compile. Here is the error: LD [M] drivers/infiniband/core/rdma_ucm.o CC [M] drivers/infiniband/hw/ipath/ipath_cq.o In file included from drivers/infiniband/hw/ipath/ipath_cq.c:36: drivers/infiniband/hw/ipath/ipath_verbs.h:40:26: rdma/ib_pack.h: No such file or directory In file included from drivers/infiniband/hw/ipath/ipath_cq.c:36: drivers/infiniband/hw/ipath/ipath_verbs.h:128: error: field `grh' has incomplete type drivers/infiniband/hw/ipath/ipath_verbs.h:147: error: field `mgid' has incomplete type drivers/infiniband/hw/ipath/ipath_verbs.h:155: error: field `ibmr' has incomplete type drivers/infiniband/hw/ipath/ipath_verbs.h:161: error: field `ibfmr' has incomplete type drivers/infiniband/hw/ipath/ipath_verbs.h:168: error: field `ibpd' has incomplete type drivers/infiniband/hw/ipath/ipath_verbs.h:174: error: field `ibah' has incomplete type drivers/infiniband/hw/ipath/ipath_verbs.h:175: error: field `attr' has incomplete type drivers/infiniband/hw/ipath/ipath_verbs.h:223: error: field `ibcq' has incomplete type drivers/infiniband/hw/ipath/ipath_verbs.h:239: error: field `wr' has incomplete type drivers/infiniband/hw/ipath/ipath_verbs.h:269: error: field `ibsrq' has incomplete type drivers/infiniband/hw/ipath/ipath_verbs.h:284: error: field `ibqp' has incomplete type drivers/infiniband/hw/ipath/ipath_verbs.h:288: error: field `remote_ah_attr' has incomplete type drivers/infiniband/hw/ipath/ipath_verbs.h:331: error: field `path_mtu' has incomplete type drivers/infiniband/hw/ipath/ipath_verbs.h:412: error: field `ibdev' has incomplete type drivers/infiniband/hw/ipath/ipath_verbs.h:485: error: field `ibucontext' has incomplete type drivers/infiniband/hw/ipath/ipath_verbs.h: In function `to_imr': drivers/infiniband/hw/ipath/ipath_verbs.h:490: warning: type defaults to `int' in declaration of `__mptr' drivers/infiniband/hw/ipath/ipath_verbs.h:490: warning: initialization from incompatible pointer type drivers/infiniband/hw/ipath/ipath_verbs.h: In function `to_ifmr': drivers/infiniband/hw/ipath/ipath_verbs.h:495: warning: type defaults to `int' in declaration of `__mptr' drivers/infiniband/hw/ipath/ipath_verbs.h:495: warning: initialization from incompatible pointer type drivers/infiniband/hw/ipath/ipath_verbs.h: In function `to_ipd': drivers/infiniband/hw/ipath/ipath_verbs.h:500: warning: type defaults to `int' in declaration of `__mptr' drivers/infiniband/hw/ipath/ipath_verbs.h:500: warning: initialization from incompatible pointer type drivers/infiniband/hw/ipath/ipath_verbs.h: In function `to_iah': drivers/infiniband/hw/ipath/ipath_verbs.h:505: warning: type defaults to `int' in declaration of `__mptr' drivers/infiniband/hw/ipath/ipath_verbs.h:505: warning: initialization from incompatible pointer type drivers/infiniband/hw/ipath/ipath_verbs.h: In function `to_icq': drivers/infiniband/hw/ipath/ipath_verbs.h:510: warning: type defaults to `int' in declaration of `__mptr' drivers/infiniband/hw/ipath/ipath_verbs.h:510: warning: initialization from incompatible pointer type drivers/infiniband/hw/ipath/ipath_verbs.h: In function `to_isrq': drivers/infiniband/hw/ipath/ipath_verbs.h:515: warning: type defaults to `int' in declaration of `__mptr' drivers/infiniband/hw/ipath/ipath_verbs.h:515: warning: initialization from incompatible pointer type drivers/infiniband/hw/ipath/ipath_verbs.h: In function `to_iqp': drivers/infiniband/hw/ipath/ipath_verbs.h:520: warning: type defaults to `int' in declaration of `__mptr' drivers/infiniband/hw/ipath/ipath_verbs.h:520: warning: initialization from incompatible pointer type drivers/infiniband/hw/ipath/ipath_verbs.h: In function `to_idev': drivers/infiniband/hw/ipath/ipath_verbs.h:525: warning: type defaults to `int' in declaration of `__mptr' drivers/infiniband/hw/ipath/ipath_verbs.h:525: warning: initialization from incompatible pointer type drivers/infiniband/hw/ipath/ipath_verbs.h: At top level: drivers/infiniband/hw/ipath/ipath_verbs.h:533: warning: "struct ib_mad" declared inside parameter list drivers/infiniband/hw/ipath/ipath_verbs.h:533: warning: its scope is only this definition or declaration, which is probably not what you want drivers/infiniband/hw/ipath/ipath_verbs.h: In function `to_iucontext': drivers/infiniband/hw/ipath/ipath_verbs.h:538: warning: type defaults to `int' in declaration of `__mptr' drivers/infiniband/hw/ipath/ipath_verbs.h:538: warning: initialization from incompatible pointer type drivers/infiniband/hw/ipath/ipath_verbs.h: At top level: drivers/infiniband/hw/ipath/ipath_verbs.h:564: warning: "struct ib_udata" declared inside parameter list drivers/infiniband/hw/ipath/ipath_verbs.h:564: warning: "struct ib_qp_init_attr" declared inside parameter list drivers/infiniband/hw/ipath/ipath_verbs.h:569: warning: "struct ib_qp_attr" declared inside parameter list drivers/infiniband/hw/ipath/ipath_verbs.h:572: warning: "struct ib_qp_init_attr" declared inside parameter list drivers/infiniband/hw/ipath/ipath_verbs.h:572: warning: "struct ib_qp_attr" declared inside parameter list drivers/infiniband/hw/ipath/ipath_verbs.h:594: warning: "struct ib_sge" declared inside parameter list drivers/infiniband/hw/ipath/ipath_verbs.h:624: warning: "struct ib_sge" declared inside parameter list drivers/infiniband/hw/ipath/ipath_verbs.h:624: error: conflicting types for 'ipath_lkey_ok' drivers/infiniband/hw/ipath/ipath_verbs.h:594: error: previous declaration of 'ipath_lkey_ok' was here drivers/infiniband/hw/ipath/ipath_verbs.h:624: error: conflicting types for 'ipath_lkey_ok' drivers/infiniband/hw/ipath/ipath_verbs.h:594: error: previous declaration of 'ipath_lkey_ok' was here drivers/infiniband/hw/ipath/ipath_verbs.h:630: warning: "struct ib_recv_wr" declared inside parameter list drivers/infiniband/hw/ipath/ipath_verbs.h:634: warning: "struct ib_udata" declared inside parameter list drivers/infiniband/hw/ipath/ipath_verbs.h:634: warning: "struct ib_srq_init_attr" declared inside parameter list drivers/infiniband/hw/ipath/ipath_verbs.h:637: warning: "enum ib_srq_attr_mask" declared inside parameter list drivers/infiniband/hw/ipath/ipath_verbs.h:637: warning: "struct ib_srq_attr" declared inside parameter list drivers/infiniband/hw/ipath/ipath_verbs.h:637: warning: parameter has incomplete type drivers/infiniband/hw/ipath/ipath_verbs.h:639: warning: "struct ib_srq_attr" declared inside parameter list drivers/infiniband/hw/ipath/ipath_verbs.h:649: warning: "struct ib_udata" declared inside parameter list drivers/infiniband/hw/ipath/ipath_verbs.h:653: warning: "enum ib_cq_notify" declared inside parameter list drivers/infiniband/hw/ipath/ipath_verbs.h:653: warning: parameter has incomplete type drivers/infiniband/hw/ipath/ipath_verbs.h:655: warning: "struct ib_udata" declared inside parameter list drivers/infiniband/hw/ipath/ipath_verbs.h:661: warning: "struct ib_phys_buf" declared inside parameter list drivers/infiniband/hw/ipath/ipath_verbs.h:665: warning: "struct ib_udata" declared inside parameter list drivers/infiniband/hw/ipath/ipath_verbs.h:665: warning: "struct ib_umem" declared inside parameter listdrivers/infiniband/hw/ipath/ipath_verbs.h:670: warning: "struct ib_fmr_attr" declared inside parameter list drivers/infiniband/hw/ipath/ipath_cq.c: In function `ipath_cq_enter': drivers/infiniband/hw/ipath/ipath_cq.c:60: error: storage size of 'ev' isn't known drivers/infiniband/hw/ipath/ipath_cq.c:64: error: `IB_EVENT_CQ_ERR' undeclared (first use in this function) drivers/infiniband/hw/ipath/ipath_cq.c:64: error: (Each undeclared identifier is reported only once drivers/infiniband/hw/ipath/ipath_cq.c:64: error: for each function it appears in.) drivers/infiniband/hw/ipath/ipath_cq.c:60: warning: unused variable `ev' drivers/infiniband/hw/ipath/ipath_cq.c:69: error: invalid use of undefined type `struct ib_wc' drivers/infiniband/hw/ipath/ipath_cq.c:69: error: dereferencing pointer to incomplete type drivers/infiniband/hw/ipath/ipath_cq.c:69: error: dereferencing pointer to incomplete type drivers/infiniband/hw/ipath/ipath_cq.c:72: error: `IB_CQ_NEXT_COMP' undeclared (first use in this function) drivers/infiniband/hw/ipath/ipath_cq.c:73: error: `IB_CQ_SOLICITED' undeclared (first use in this function) drivers/infiniband/hw/ipath/ipath_cq.c:85: error: dereferencing pointer to incomplete type drivers/infiniband/hw/ipath/ipath_cq.c:85: error: `IB_WC_SUCCESS' undeclared (first use in this function) drivers/infiniband/hw/ipath/ipath_cq.c: In function `ipath_poll_cq': drivers/infiniband/hw/ipath/ipath_cq.c:108: error: increment of pointer to unknown structure drivers/infiniband/hw/ipath/ipath_cq.c:108: error: arithmetic on pointer to an incomplete type drivers/infiniband/hw/ipath/ipath_cq.c:111: error: dereferencing pointer to incomplete type drivers/infiniband/hw/ipath/ipath_cq.c:111: error: invalid use of undefined type `struct ib_wc' drivers/infiniband/hw/ipath/ipath_cq.c:111: error: dereferencing pointer to incomplete type drivers/infiniband/hw/ipath/ipath_cq.c: At top level: drivers/infiniband/hw/ipath/ipath_cq.c:158: warning: "struct ib_udata" declared inside parameter list drivers/infiniband/hw/ipath/ipath_cq.c:159: error: conflicting types for 'ipath_create_cq' drivers/infiniband/hw/ipath/ipath_verbs.h:649: error: previous declaration of 'ipath_create_cq' was here drivers/infiniband/hw/ipath/ipath_cq.c:159: error: conflicting types for 'ipath_create_cq' drivers/infiniband/hw/ipath/ipath_verbs.h:649: error: previous declaration of 'ipath_create_cq' was here drivers/infiniband/hw/ipath/ipath_cq.c: In function `ipath_create_cq': drivers/infiniband/hw/ipath/ipath_cq.c:177: error: dereferencing pointer to incomplete type drivers/infiniband/hw/ipath/ipath_cq.c:189: error: `IB_CQ_NEXT_COMP' undeclared (first use in this function) drivers/infiniband/hw/ipath/ipath_cq.c: At top level: drivers/infiniband/hw/ipath/ipath_cq.c:232: warning: "enum ib_cq_notify" declared inside parameter listdrivers/infiniband/hw/ipath/ipath_cq.c:233: error: parameter `notify' has incomplete type drivers/infiniband/hw/ipath/ipath_cq.c: In function `ipath_req_notify_cq': drivers/infiniband/hw/ipath/ipath_cq.c:242: error: `IB_CQ_NEXT_COMP' undeclared (first use in this function) drivers/infiniband/hw/ipath/ipath_cq.c: At top level: drivers/infiniband/hw/ipath/ipath_cq.c:248: warning: "struct ib_udata" declared inside parameter list drivers/infiniband/hw/ipath/ipath_cq.c:249: error: conflicting types for 'ipath_resize_cq' drivers/infiniband/hw/ipath/ipath_verbs.h:655: error: previous declaration of 'ipath_resize_cq' was here drivers/infiniband/hw/ipath/ipath_cq.c:249: error: conflicting types for 'ipath_resize_cq' drivers/infiniband/hw/ipath/ipath_verbs.h:655: error: previous declaration of 'ipath_resize_cq' was here drivers/infiniband/hw/ipath/ipath_cq.c: In function `ipath_resize_cq': drivers/infiniband/hw/ipath/ipath_cq.c:258: error: dereferencing pointer to incomplete type drivers/infiniband/hw/ipath/ipath_cq.c:276: error: invalid use of undefined type `struct ib_wc' drivers/infiniband/hw/ipath/ipath_cq.c:276: error: dereferencing pointer to incomplete type drivers/infiniband/hw/ipath/ipath_cq.c:276: error: invalid use of undefined type `struct ib_wc' drivers/infiniband/hw/ipath/ipath_cq.c:276: error: dereferencing pointer to incomplete type make[3]: *** [drivers/infiniband/hw/ipath/ipath_cq.o] Error 1 make[2]: *** [drivers/infiniband/hw/ipath] Error 2 make[1]: *** [drivers/infiniband] Error 2 make: *** [drivers] Error 2 From bos at pathscale.com Thu Apr 13 16:40:12 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 13 Apr 2006 16:40:12 -0700 Subject: [openib-general] Re: Compile problems with core code and pathscale for svn6462 and linux-2.6.17-rc1 In-Reply-To: <1144971158.24662.213.camel@localhost> References: <1144971158.24662.213.camel@localhost> Message-ID: <200604131640.12899.bos@pathscale.com> On Thursday 13 April 2006 16:32, Matt Leininger wrote: > I'm trying to compile the svn 6462 snapshot with linux-2.6.17-rc1 on a > RHEL4 based system. Are you building the ipath driver out of the kernel.org tree, or out of svn? If the latter, you have to patch the kernel and rebuild it first. References: <1144971158.24662.213.camel@localhost> <200604131640.12899.bos@pathscale.com> Message-ID: <1144972288.24662.216.camel@localhost> On Thu, 2006-04-13 at 16:40 -0700, Bryan O'Sullivan wrote: > On Thursday 13 April 2006 16:32, Matt Leininger wrote: > > I'm trying to compile the svn 6462 snapshot with linux-2.6.17-rc1 on a > > RHEL4 based system. > > Are you building the ipath driver out of the kernel.org tree, or out of svn? > If the latter, you have to patch the kernel and rebuild it first. Out of svn. I have the drivers/infiniband pointing to the svn tree. I'll try using the drivers in the kernel.org tree. - Matt From bos at pathscale.com Thu Apr 13 16:54:57 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 13 Apr 2006 16:54:57 -0700 Subject: [openib-general] Re: Compile problems with core code and pathscale for svn6462 and linux-2.6.17-rc1 In-Reply-To: <1144972288.24662.216.camel@localhost> References: <1144971158.24662.213.camel@localhost> <200604131640.12899.bos@pathscale.com> <1144972288.24662.216.camel@localhost> Message-ID: <200604131654.57734.bos@pathscale.com> On Thursday 13 April 2006 16:51, Matt Leininger wrote: > > Are you building the ipath driver out of the kernel.org tree, or out of > > svn? If the latter, you have to patch the kernel and rebuild it first. > > Out of svn. I have the drivers/infiniband pointing to the svn tree. Yes, that won't work, because the svn include directory has a bunch of stuff that's no upstream. References: <1144971158.24662.213.camel@localhost> <200604131640.12899.bos@pathscale.com> <1144972288.24662.216.camel@localhost> <200604131654.57734.bos@pathscale.com> Message-ID: <1144972598.24662.221.camel@localhost> On Thu, 2006-04-13 at 16:54 -0700, Bryan O'Sullivan wrote: > On Thursday 13 April 2006 16:51, Matt Leininger wrote: > > > > Are you building the ipath driver out of the kernel.org tree, or out of > > > svn? If the latter, you have to patch the kernel and rebuild it first. > > > > Out of svn. I have the drivers/infiniband pointing to the svn tree. > > Yes, that won't work, because the svn include directory has a bunch of stuff > that's no upstream. > Ok. So the current state is that the mainline devel branch will be broken for a while? BTW, the linux-2.6.17-rc1 in-kernel IB compiled fine. - Matt From bos at pathscale.com Thu Apr 13 16:59:43 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 13 Apr 2006 16:59:43 -0700 Subject: [openib-general] [PATCH] RFC: start weaning userspace drivers from sysfs In-Reply-To: References: Message-ID: <200604131659.43547.bos@pathscale.com> On Thursday 13 April 2006 13:41, Roland Dreier wrote: > As part of the libibverbs 1.1 release, I would like to remove the > dependency on libsysfs, since libsysfs is not very well maintained, > not consistent across distros, and the simple sysfs stuff we need is > easy to do directly. Sounds good. I've been hoping for this for a while. > However, we can move low-level drivers in this direction in a > piecemeal, forwards and backwards compatible way: just add a new > ibv_driver_init entry point, but leave the old openib_driver_init > entry point there and make it a simple wrapper around the new > function. Is the goal of this to make sure that new hardware-specific libraries will work with old libibverbs? How likely do you think that is to happen? I don't see much of a problem with simply breaking backwards compatibility here, since it seems unlikely that someone would update one, but not the other. References: <1144971158.24662.213.camel@localhost> <200604131654.57734.bos@pathscale.com> <1144972598.24662.221.camel@localhost> Message-ID: <200604131700.59046.bos@pathscale.com> On Thursday 13 April 2006 16:56, Matt Leininger wrote: > Ok. So the current state is that the mainline devel branch will be > broken for a while? I have no idea. The current situation is fairly annoying, though. References: Message-ID: <20060414001616.GA9445@cuprite.internal.keyresearch.com> > As part of the libibverbs 1.1 release, I would like to remove the > dependency on libsysfs I highly approve of this move. > the simple sysfs stuff we need is easy to do directly. I was looking at it earlier this week and came to the same conclusion. Johann From rdreier at cisco.com Thu Apr 13 17:28:38 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 13 Apr 2006 17:28:38 -0700 Subject: [openib-general] Compile problems with core code and pathscale for svn6462 and linux-2.6.17-rc1 In-Reply-To: <1144971158.24662.213.camel@localhost> (Matt Leininger's message of "Thu, 13 Apr 2006 16:32:38 -0700") References: <1144971158.24662.213.camel@localhost> Message-ID: Matt> If I remove include/rdma (which I had to do in the past) Matt> then some of the pathscale code fails to compile. Here is Matt> the error: Yes, you need the patch below for the ipath directory. I sent this to pathscale a while ago but it seems to take a while for patches to make it from their internal repository to svn... --- infiniband/hw/ipath/Makefile (revision 6462) +++ infiniband/hw/ipath/Makefile (working copy) @@ -1,5 +1,6 @@ EXTRA_CFLAGS += -DIPATH_IDSTR='"PathScale kernel.org driver"' \ - -DIPATH_KERN_TYPE=0 + -DIPATH_KERN_TYPE=0 \ + -Idrivers/infiniband/include obj-$(CONFIG_IPATH_CORE) += ipath_core.o obj-$(CONFIG_INFINIBAND_IPATH) += ib_ipath.o From rdreier at cisco.com Thu Apr 13 17:29:58 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 13 Apr 2006 17:29:58 -0700 Subject: [openib-general] [PATCH] RFC: start weaning userspace drivers from sysfs In-Reply-To: <200604131659.43547.bos@pathscale.com> (Bryan O'Sullivan's message of "Thu, 13 Apr 2006 16:59:43 -0700") References: <200604131659.43547.bos@pathscale.com> Message-ID: Bryan> Is the goal of this to make sure that new hardware-specific Bryan> libraries will work with old libibverbs? How likely do you Bryan> think that is to happen? I don't see much of a problem Bryan> with simply breaking backwards compatibility here, since it Bryan> seems unlikely that someone would update one, but not the Bryan> other. I just want to decouple things as much as possible, so there doesn't have to be a flag day cut over from the new world to the old. This way we can get low-level drivers out everywhere and then change libibverbs. - R. From mlleinin at hpcn.ca.sandia.gov Thu Apr 13 17:32:44 2006 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Thu, 13 Apr 2006 17:32:44 -0700 Subject: [openib-general] 2.6.17-rc1 IPoIB netperf results Message-ID: <1144974764.24662.231.camel@localhost> Here are the latest IPoIB results: For mthca I saw a range of 380-424 MB/s. The local CPU utilization on the send side dropped for the 380 MB/s, from 98% to 70% For ipath it was 310 MB/s. The local CPU utilization on the send side was always around 30%. - Matt Mellanox benchmarks are with RHEL4 x86_64 with HCA FW v4.7.0 dual EM64T 3.2 GHz PCIe IB HCA (memfull) patch 1 - remove changeset 314324121f9b94b2ca657a494cf2b9cb0e4a28cc patch 2 - remove changeset b8259d9ad1d0f8d0c5ea0e37bb15080b0bd395b5 msi_x=1 for all tests PathScale benchmarks are with RHEL4 x86_64 with HTX HCA dual-socket dual-core Opteron 2.4 GHz -------------------- netperf -f -M -c -C -H IP_ADDRESS Kernel OpenIB netperf (MB/s) 2.6.17-rc1 in-kernel 424 (mthca ipoib) 2.6.17-rc1 in-kernel 310 (ipath ipoib) 2.6.16 svn 6307 367 (mthca ipoib) 2.6.16 svn 6307 319 (ipath ipoib) 2.6.16 svn 6083 371 (mthca ipoib) 2.6.16 svn 6083 304 (ipath ipoib) 2.6.16 svn 5938 380 (mthca ipoib) 2.6.16 svn 5938 300 (ipath ipoib) 2.6.16 in-kernel 364 2.6.16-rc5 in-kernel 367 2.6.15 in-kernel 382 2.6.14-rc4 patch 12 in-kernel 436 2.6.14-rc4 patch 1 in-kernel 434 2.6.14-rc4 in-kernel 385 2.6.14-rc3 in-kernel 374 2.6.13.2 svn3627 386 2.6.13.2 patch 1 svn3627 446 2.6.13.2 in-kernel 394 2.6.13-rc3 patch 12 in-kernel 442 2.6.13-rc3 patch 1 in-kernel 450 2.6.13-rc3 in-kernel 395 2.6.13-rc2 patch 1 in-kernel 409 2.6.13-rc1 patch 1 in-kernel 408 2.6.12.5-lustre in-kernel 399 2.6.12.5 patch 1 in-kernel 464 2.6.12.5 in-kernel 402 2.6.12 in-kernel 406 2.6.12-rc6 patch 1 in-kernel 470 2.6.12-rc6 in-kernel 407 2.6.12-rc5 in-kernel 405 2.6.12-rc5 patch 1 in-kernel 474 2.6.12-rc4 in-kernel 470 2.6.12-rc3 in-kernel 466 2.6.12-rc2 in-kernel 469 2.6.12-rc1 in-kernel 466 2.6.11 in-kernel 464 2.6.11 svn3687 464 2.6.9-11.ELsmp svn3513 425 (Woody's results, 3.6Ghz EM64T) From worleys at gmail.com Fri Apr 14 07:19:38 2006 From: worleys at gmail.com (Chris Worley) Date: Fri, 14 Apr 2006 08:19:38 -0600 Subject: [openib-general] IB initialization Message-ID: I installed the SuSE 10 OpenIB RC2 RPMS. The installation went well, but I'm stuck at the startup. As an IBGD user, I'm used to an init file in /etc/init.d... but there was none. >From the wiki, I was able to glean: Make the udev file: # cat > /etc/udev/rules.d/40-infiniband.rules KERNEL="umad*", NAME="infiniband/%k" KERNEL="issm*", NAME="infiniband/%k" Install some modules: modprobe ib_ucm modprobe ib_cm modprobe ib_uverbs modprobe ib_umad And make sure udev is running, and start the opensm. I've done this on all nodes, and ibstat shows I have a link up and running on every node. Opensm doesn't show any scanning. It's been hung all night at: # opensm --console ------------------------------------------------- OpenSM Rev:openib-1.2.0 Based on OpenIB svn Exported revision Command Line Arguments: Enabling OpenSM interactive console Log File: /var/log/osm.log ------------------------------------------------- OpenSM Rev:openib-1.2.0 OpenIB svn Exported revision Using default guid 0x2c9020020c3ce OpenSM Console $ Entering MASTER state SUBNET UP IPoIB isn't up. ibv_rc_pingpong doesn't work. Neither does ibv_devinfo. Is there a definitive guide on the initialization of the drivers and fabric? Also, is there an MVAPICH2 for SuSE 10 RPM? Thanks, Chris From halr at voltaire.com Fri Apr 14 07:22:22 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Apr 2006 10:22:22 -0400 Subject: [openib-general] IB initialization In-Reply-To: References: Message-ID: <1145024542.4539.21761.camel@hal.voltaire.com> Hi Chris, On Fri, 2006-04-14 at 10:19, Chris Worley wrote: > I installed the SuSE 10 OpenIB RC2 RPMS. > > The installation went well, but I'm stuck at the startup. > > As an IBGD user, I'm used to an init file in /etc/init.d... but there was none. > > >From the wiki, I was able to glean: > > Make the udev file: > > # cat > /etc/udev/rules.d/40-infiniband.rules > KERNEL="umad*", NAME="infiniband/%k" > KERNEL="issm*", NAME="infiniband/%k" > > Install some modules: > > modprobe ib_ucm > modprobe ib_cm > modprobe ib_uverbs > modprobe ib_umad > > And make sure udev is running, and start the opensm. > > I've done this on all nodes, and ibstat shows I have a link up and > running on every node. Opensm doesn't show any scanning. It's been > hung all night at: > > # opensm --console > ------------------------------------------------- > OpenSM Rev:openib-1.2.0 > Based on OpenIB svn Exported revision > Command Line Arguments: > Enabling OpenSM interactive console > Log File: /var/log/osm.log > ------------------------------------------------- > OpenSM Rev:openib-1.2.0 OpenIB svn Exported revision > > Using default guid 0x2c9020020c3ce > > OpenSM Console > > $ Entering MASTER state > > SUBNET UP Looks like everything is fine from the OpenSM standpoint. I see no indication that OpenSM is hung. You are in the console. Also, why do you say OpenSM isn't "scanning" ? What is in /var/log/osm.log ? Any errors ? If you want more verbose messages start OpenSM with -V. -- Hal > IPoIB isn't up. ibv_rc_pingpong doesn't work. Neither does ibv_devinfo. > > Is there a definitive guide on the initialization of the drivers and fabric? > > Also, is there an MVAPICH2 for SuSE 10 RPM? > > Thanks, > > Chris > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From worleys at gmail.com Fri Apr 14 07:39:24 2006 From: worleys at gmail.com (Chris Worley) Date: Fri, 14 Apr 2006 08:39:24 -0600 Subject: [openib-general] IB initialization In-Reply-To: <1145024542.4539.21761.camel@hal.voltaire.com> References: <1145024542.4539.21761.camel@hal.voltaire.com> Message-ID: Hal, You're correct... the results of the scans are in /var/log/osm.log. I was expecting the "-console" mode to show more. In looking at the /var/log/osm.log I'm seeing a lot of: Reporting Generic Notice type:4 num:144 For different GUIDs. Is there a place to look these up? I still don't have IPoIB running, and ibv_devinfo says I'm not setup right either (couldn't open a device). Thanks, Chris On 14 Apr 2006 10:22:22 -0400, Hal Rosenstock wrote: > Hi Chris, > > On Fri, 2006-04-14 at 10:19, Chris Worley wrote: > > I installed the SuSE 10 OpenIB RC2 RPMS. > > > > The installation went well, but I'm stuck at the startup. > > > > As an IBGD user, I'm used to an init file in /etc/init.d... but there was none. > > > > >From the wiki, I was able to glean: > > > > Make the udev file: > > > > # cat > /etc/udev/rules.d/40-infiniband.rules > > KERNEL="umad*", NAME="infiniband/%k" > > KERNEL="issm*", NAME="infiniband/%k" > > > > Install some modules: > > > > modprobe ib_ucm > > modprobe ib_cm > > modprobe ib_uverbs > > modprobe ib_umad > > > > And make sure udev is running, and start the opensm. > > > > I've done this on all nodes, and ibstat shows I have a link up and > > running on every node. Opensm doesn't show any scanning. It's been > > hung all night at: > > > > # opensm --console > > ------------------------------------------------- > > OpenSM Rev:openib-1.2.0 > > Based on OpenIB svn Exported revision > > Command Line Arguments: > > Enabling OpenSM interactive console > > Log File: /var/log/osm.log > > ------------------------------------------------- > > OpenSM Rev:openib-1.2.0 OpenIB svn Exported revision > > > > Using default guid 0x2c9020020c3ce > > > > OpenSM Console > > > > $ Entering MASTER state > > > > SUBNET UP > > Looks like everything is fine from the OpenSM standpoint. > > I see no indication that OpenSM is hung. You are in the console. > > Also, why do you say OpenSM isn't "scanning" ? > > What is in /var/log/osm.log ? Any errors ? > > If you want more verbose messages start OpenSM with -V. > > -- Hal > > > IPoIB isn't up. ibv_rc_pingpong doesn't work. Neither does ibv_devinfo. > > > > Is there a definitive guide on the initialization of the drivers and fabric? > > > > Also, is there an MVAPICH2 for SuSE 10 RPM? > > > > Thanks, > > > > Chris > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From halr at voltaire.com Fri Apr 14 07:49:18 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Apr 2006 10:49:18 -0400 Subject: [openib-general] [PATCH] OpenSM/osm_sa_mcmember_record.c::__osm_mcmr_rcv_respond: Fix MTU, rate, and PLL selectors Message-ID: <1145026157.4539.22174.camel@hal.voltaire.com> OpenSM/osm_sa_mcmember_record.c::__osm_mcmr_rcv_respond: Fix MTU, rate, and PLL selectors to be exactly Signed-off-by: Hal Rosenstock --- Note this patch has been applied to both trunk and 1.0 branch. Index: opensm/osm_sa_mcmember_record.c =================================================================== --- opensm/osm_sa_mcmember_record.c (revision 6466) +++ opensm/osm_sa_mcmember_record.c (working copy) @@ -548,8 +548,11 @@ __osm_mcmr_rcv_respond( *p_resp_mcmember_rec = *p_mcmember_rec; /* Fill in the mtu, rate, and packet lifetime selectors */ + p_resp_mcmember_rec->mtu &= 0x3f; p_resp_mcmember_rec->mtu |= 2<<6; /* exactly */ + p_resp_mcmember_rec->rate &= 0x3f; p_resp_mcmember_rec->rate |= 2<<6; /* exactly */ + p_resp_mcmember_rec->pkt_life &= 0x3f; p_resp_mcmember_rec->pkt_life |= 2<<6; /* exactly */ status = osm_vendor_send( From halr at voltaire.com Fri Apr 14 07:55:00 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Apr 2006 10:55:00 -0400 Subject: [openib-general] IB initialization In-Reply-To: References: <1145024542.4539.21761.camel@hal.voltaire.com> Message-ID: <1145026500.4539.22255.camel@hal.voltaire.com> Hi again Chris, On Fri, 2006-04-14 at 10:39, Chris Worley wrote: > Hal, > > You're correct... the results of the scans are in /var/log/osm.log. I > was expecting the "-console" mode to show more. > > In looking at the /var/log/osm.log I'm seeing a lot of: > > Reporting Generic Notice type:4 num:144 > > For different GUIDs. What's a lot ? One for each GUID ? What's the capability mask indicated ? > Is there a place to look these up? Yes, the IBA spec (volume 1). Trap 144 indicates that the capability mask at the indicated LID has changed. > I still don't have IPoIB running, and ibv_devinfo says I'm not setup > right either (couldn't open a device). I'm not sure why not. -- Hal > Thanks, > > Chris > On 14 Apr 2006 10:22:22 -0400, Hal Rosenstock wrote: > > Hi Chris, > > > > On Fri, 2006-04-14 at 10:19, Chris Worley wrote: > > > I installed the SuSE 10 OpenIB RC2 RPMS. > > > > > > The installation went well, but I'm stuck at the startup. > > > > > > As an IBGD user, I'm used to an init file in /etc/init.d... but there was none. > > > > > > >From the wiki, I was able to glean: > > > > > > Make the udev file: > > > > > > # cat > /etc/udev/rules.d/40-infiniband.rules > > > KERNEL="umad*", NAME="infiniband/%k" > > > KERNEL="issm*", NAME="infiniband/%k" > > > > > > Install some modules: > > > > > > modprobe ib_ucm > > > modprobe ib_cm > > > modprobe ib_uverbs > > > modprobe ib_umad > > > > > > And make sure udev is running, and start the opensm. > > > > > > I've done this on all nodes, and ibstat shows I have a link up and > > > running on every node. Opensm doesn't show any scanning. It's been > > > hung all night at: > > > > > > # opensm --console > > > ------------------------------------------------- > > > OpenSM Rev:openib-1.2.0 > > > Based on OpenIB svn Exported revision > > > Command Line Arguments: > > > Enabling OpenSM interactive console > > > Log File: /var/log/osm.log > > > ------------------------------------------------- > > > OpenSM Rev:openib-1.2.0 OpenIB svn Exported revision > > > > > > Using default guid 0x2c9020020c3ce > > > > > > OpenSM Console > > > > > > $ Entering MASTER state > > > > > > SUBNET UP > > > > Looks like everything is fine from the OpenSM standpoint. > > > > I see no indication that OpenSM is hung. You are in the console. > > > > Also, why do you say OpenSM isn't "scanning" ? > > > > What is in /var/log/osm.log ? Any errors ? > > > > If you want more verbose messages start OpenSM with -V. > > > > -- Hal > > > > > IPoIB isn't up. ibv_rc_pingpong doesn't work. Neither does ibv_devinfo. > > > > > > Is there a definitive guide on the initialization of the drivers and fabric? > > > > > > Also, is there an MVAPICH2 for SuSE 10 RPM? > > > > > > Thanks, > > > > > > Chris > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From worleys at gmail.com Fri Apr 14 08:29:52 2006 From: worleys at gmail.com (Chris Worley) Date: Fri, 14 Apr 2006 09:29:52 -0600 Subject: [openib-general] IB initialization In-Reply-To: <1145026500.4539.22255.camel@hal.voltaire.com> References: <1145024542.4539.21761.camel@hal.voltaire.com> <1145026500.4539.22255.camel@hal.voltaire.com> Message-ID: Hal, It looks like 1 per GUID. I don't see a capability mask. An example is: Apr 14 07:28:18 879428 [40602960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 f rom LID:0x0007 TID:0x0000000000000001 Apr 14 07:28:18 879513 [40602960] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0007 GID:0xfe800 00000000000,0x0002c9020020c3b6 Thanks, Chris On 14 Apr 2006 10:55:00 -0400, Hal Rosenstock wrote: > Hi again Chris, > > On Fri, 2006-04-14 at 10:39, Chris Worley wrote: > > Hal, > > > > You're correct... the results of the scans are in /var/log/osm.log. I > > was expecting the "-console" mode to show more. > > > > In looking at the /var/log/osm.log I'm seeing a lot of: > > > > Reporting Generic Notice type:4 num:144 > > > > For different GUIDs. > > What's a lot ? One for each GUID ? What's the capability mask indicated > ? > > > Is there a place to look these up? > > Yes, the IBA spec (volume 1). Trap 144 indicates that the capability > mask at the indicated LID has changed. > > > I still don't have IPoIB running, and ibv_devinfo says I'm not setup > > right either (couldn't open a device). > > I'm not sure why not. > > -- Hal > > > Thanks, > > > > Chris > > On 14 Apr 2006 10:22:22 -0400, Hal Rosenstock wrote: > > > Hi Chris, > > > > > > On Fri, 2006-04-14 at 10:19, Chris Worley wrote: > > > > I installed the SuSE 10 OpenIB RC2 RPMS. > > > > > > > > The installation went well, but I'm stuck at the startup. > > > > > > > > As an IBGD user, I'm used to an init file in /etc/init.d... but there was none. > > > > > > > > >From the wiki, I was able to glean: > > > > > > > > Make the udev file: > > > > > > > > # cat > /etc/udev/rules.d/40-infiniband.rules > > > > KERNEL="umad*", NAME="infiniband/%k" > > > > KERNEL="issm*", NAME="infiniband/%k" > > > > > > > > Install some modules: > > > > > > > > modprobe ib_ucm > > > > modprobe ib_cm > > > > modprobe ib_uverbs > > > > modprobe ib_umad > > > > > > > > And make sure udev is running, and start the opensm. > > > > > > > > I've done this on all nodes, and ibstat shows I have a link up and > > > > running on every node. Opensm doesn't show any scanning. It's been > > > > hung all night at: > > > > > > > > # opensm --console > > > > ------------------------------------------------- > > > > OpenSM Rev:openib-1.2.0 > > > > Based on OpenIB svn Exported revision > > > > Command Line Arguments: > > > > Enabling OpenSM interactive console > > > > Log File: /var/log/osm.log > > > > ------------------------------------------------- > > > > OpenSM Rev:openib-1.2.0 OpenIB svn Exported revision > > > > > > > > Using default guid 0x2c9020020c3ce > > > > > > > > OpenSM Console > > > > > > > > $ Entering MASTER state > > > > > > > > SUBNET UP > > > > > > Looks like everything is fine from the OpenSM standpoint. > > > > > > I see no indication that OpenSM is hung. You are in the console. > > > > > > Also, why do you say OpenSM isn't "scanning" ? > > > > > > What is in /var/log/osm.log ? Any errors ? > > > > > > If you want more verbose messages start OpenSM with -V. > > > > > > -- Hal > > > > > > > IPoIB isn't up. ibv_rc_pingpong doesn't work. Neither does ibv_devinfo. > > > > > > > > Is there a definitive guide on the initialization of the drivers and fabric? > > > > > > > > Also, is there an MVAPICH2 for SuSE 10 RPM? > > > > > > > > Thanks, > > > > > > > > Chris > > > > _______________________________________________ > > > > openib-general mailing list > > > > openib-general at openib.org > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From vuhuong at mellanox.com Fri Apr 14 08:44:31 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Fri, 14 Apr 2006 08:44:31 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: References: <443C4934.7080400@mellanox.com> <443E9A88.7020302@mellanox.com> Message-ID: <443FC35F.6080301@mellanox.com> Roland Dreier wrote: > Hmm, it's clearly a use-after-free bug. Based on > > ip is at srp_reconnect_target+0x2b1/0x5c0 [ib_srp] > > can you guess where it is in the SRP driver or what it's accessing? > > Also this is happening because the connection is being reconnected, > because SCSI commands are timing out. Do you have any idea why this > is happening? What does the target see when this happens? It crashed in "cleared request queue" ie. list_for_each_entry(req, &target->req_queue, list) { req->scmnd->result = DID_RESET << 16; req->scmnd->scsi_done(req->scmnd); } Probably scsi command already freed thru abort; however, it's still in request queue Vu From halr at voltaire.com Fri Apr 14 08:38:21 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Apr 2006 11:38:21 -0400 Subject: [openib-general] IB initialization In-Reply-To: References: <1145024542.4539.21761.camel@hal.voltaire.com> <1145026500.4539.22255.camel@hal.voltaire.com> Message-ID: <1145028888.4539.22748.camel@hal.voltaire.com> Hi again Chris, On Fri, 2006-04-14 at 11:29, Chris Worley wrote: > Hal, > > It looks like 1 per GUID. I don't see a capability mask. An example is: > > Apr 14 07:28:18 879428 [40602960] -> __osm_trap_rcv_process_request: > Received Generic Notice type:0x04 num:144 Producer:1 f > rom LID:0x0007 TID:0x0000000000000001 > Apr 14 07:28:18 879513 [40602960] -> osm_report_notice: Reporting > Generic Notice type:4 num:144 from LID:0x0007 GID:0xfe800 > 00000000000,0x0002c9020020c3b6 Are you running with verbose (-V) ? You only see that extra info then. Just out of curiousity, how big is your subnet and what is the topology ? -- Hal > Thanks, > > Chris > On 14 Apr 2006 10:55:00 -0400, Hal Rosenstock wrote: > > Hi again Chris, > > > > On Fri, 2006-04-14 at 10:39, Chris Worley wrote: > > > Hal, > > > > > > You're correct... the results of the scans are in /var/log/osm.log. I > > > was expecting the "-console" mode to show more. > > > > > > In looking at the /var/log/osm.log I'm seeing a lot of: > > > > > > Reporting Generic Notice type:4 num:144 > > > > > > For different GUIDs. > > > > What's a lot ? One for each GUID ? What's the capability mask indicated > > ? > > > > > Is there a place to look these up? > > > > Yes, the IBA spec (volume 1). Trap 144 indicates that the capability > > mask at the indicated LID has changed. > > > > > I still don't have IPoIB running, and ibv_devinfo says I'm not setup > > > right either (couldn't open a device). > > > > I'm not sure why not. > > > > -- Hal > > > > > Thanks, > > > > > > Chris > > > On 14 Apr 2006 10:22:22 -0400, Hal Rosenstock wrote: > > > > Hi Chris, > > > > > > > > On Fri, 2006-04-14 at 10:19, Chris Worley wrote: > > > > > I installed the SuSE 10 OpenIB RC2 RPMS. > > > > > > > > > > The installation went well, but I'm stuck at the startup. > > > > > > > > > > As an IBGD user, I'm used to an init file in /etc/init.d... but there was none. > > > > > > > > > > >From the wiki, I was able to glean: > > > > > > > > > > Make the udev file: > > > > > > > > > > # cat > /etc/udev/rules.d/40-infiniband.rules > > > > > KERNEL="umad*", NAME="infiniband/%k" > > > > > KERNEL="issm*", NAME="infiniband/%k" > > > > > > > > > > Install some modules: > > > > > > > > > > modprobe ib_ucm > > > > > modprobe ib_cm > > > > > modprobe ib_uverbs > > > > > modprobe ib_umad > > > > > > > > > > And make sure udev is running, and start the opensm. > > > > > > > > > > I've done this on all nodes, and ibstat shows I have a link up and > > > > > running on every node. Opensm doesn't show any scanning. It's been > > > > > hung all night at: > > > > > > > > > > # opensm --console > > > > > ------------------------------------------------- > > > > > OpenSM Rev:openib-1.2.0 > > > > > Based on OpenIB svn Exported revision > > > > > Command Line Arguments: > > > > > Enabling OpenSM interactive console > > > > > Log File: /var/log/osm.log > > > > > ------------------------------------------------- > > > > > OpenSM Rev:openib-1.2.0 OpenIB svn Exported revision > > > > > > > > > > Using default guid 0x2c9020020c3ce > > > > > > > > > > OpenSM Console > > > > > > > > > > $ Entering MASTER state > > > > > > > > > > SUBNET UP > > > > > > > > Looks like everything is fine from the OpenSM standpoint. > > > > > > > > I see no indication that OpenSM is hung. You are in the console. > > > > > > > > Also, why do you say OpenSM isn't "scanning" ? > > > > > > > > What is in /var/log/osm.log ? Any errors ? > > > > > > > > If you want more verbose messages start OpenSM with -V. > > > > > > > > -- Hal > > > > > > > > > IPoIB isn't up. ibv_rc_pingpong doesn't work. Neither does ibv_devinfo. > > > > > > > > > > Is there a definitive guide on the initialization of the drivers and fabric? > > > > > > > > > > Also, is there an MVAPICH2 for SuSE 10 RPM? > > > > > > > > > > Thanks, > > > > > > > > > > Chris > > > > > _______________________________________________ > > > > > openib-general mailing list > > > > > openib-general at openib.org > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Fri Apr 14 09:04:25 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 14 Apr 2006 09:04:25 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: <443FC35F.6080301@mellanox.com> (Vu Pham's message of "Fri, 14 Apr 2006 08:44:31 -0700") References: <443C4934.7080400@mellanox.com> <443E9A88.7020302@mellanox.com> <443FC35F.6080301@mellanox.com> Message-ID: Hmm, I don't understand what could be going on. srp_send_tsk_mgmt() currently has: if (req->cmd_done) { srp_remove_req(target, req, req_index); scmnd->scsi_done(scmnd); } else if (!req->tsk_status) { srp_remove_req(target, req, req_index); scmnd->result = DID_ABORT << 16; ret = SUCCESS; } and otherwise it returns FAILED. So in both cases where it finishes the command, it removes it from the list of pending requests. Are you absolutely sure you saw the crash with a patched driver that has that code in srp_send_tsk_mgmt()? - R. From worleys at gmail.com Fri Apr 14 09:05:06 2006 From: worleys at gmail.com (Chris Worley) Date: Fri, 14 Apr 2006 10:05:06 -0600 Subject: [openib-general] IB initialization In-Reply-To: <1145028888.4539.22748.camel@hal.voltaire.com> References: <1145024542.4539.21761.camel@hal.voltaire.com> <1145026500.4539.22255.camel@hal.voltaire.com> <1145028888.4539.22748.camel@hal.voltaire.com> Message-ID: Hal, Note that I got an /etc/init.d/openibd script that's getting everything running (I still don't have IPoIB or MVAPICH2... but I can live without both). Now, I'm running Opensm with -V, and it looks as I expected. This cluster is simple: 9 nodes in one switch. Thanks, Chris On 14 Apr 2006 11:38:21 -0400, Hal Rosenstock wrote: > Hi again Chris, > > On Fri, 2006-04-14 at 11:29, Chris Worley wrote: > > Hal, > > > > It looks like 1 per GUID. I don't see a capability mask. An example is: > > > > Apr 14 07:28:18 879428 [40602960] -> __osm_trap_rcv_process_request: > > Received Generic Notice type:0x04 num:144 Producer:1 f > > rom LID:0x0007 TID:0x0000000000000001 > > Apr 14 07:28:18 879513 [40602960] -> osm_report_notice: Reporting > > Generic Notice type:4 num:144 from LID:0x0007 GID:0xfe800 > > 00000000000,0x0002c9020020c3b6 > > Are you running with verbose (-V) ? You only see that extra info then. > > Just out of curiousity, how big is your subnet and what is the topology > ? > > -- Hal > > > Thanks, > > > > Chris > > On 14 Apr 2006 10:55:00 -0400, Hal Rosenstock wrote: > > > Hi again Chris, > > > > > > On Fri, 2006-04-14 at 10:39, Chris Worley wrote: > > > > Hal, > > > > > > > > You're correct... the results of the scans are in /var/log/osm.log. I > > > > was expecting the "-console" mode to show more. > > > > > > > > In looking at the /var/log/osm.log I'm seeing a lot of: > > > > > > > > Reporting Generic Notice type:4 num:144 > > > > > > > > For different GUIDs. > > > > > > What's a lot ? One for each GUID ? What's the capability mask indicated > > > ? > > > > > > > Is there a place to look these up? > > > > > > Yes, the IBA spec (volume 1). Trap 144 indicates that the capability > > > mask at the indicated LID has changed. > > > > > > > I still don't have IPoIB running, and ibv_devinfo says I'm not setup > > > > right either (couldn't open a device). > > > > > > I'm not sure why not. > > > > > > -- Hal > > > > > > > Thanks, > > > > > > > > Chris > > > > On 14 Apr 2006 10:22:22 -0400, Hal Rosenstock wrote: > > > > > Hi Chris, > > > > > > > > > > On Fri, 2006-04-14 at 10:19, Chris Worley wrote: > > > > > > I installed the SuSE 10 OpenIB RC2 RPMS. > > > > > > > > > > > > The installation went well, but I'm stuck at the startup. > > > > > > > > > > > > As an IBGD user, I'm used to an init file in /etc/init.d... but there was none. > > > > > > > > > > > > >From the wiki, I was able to glean: > > > > > > > > > > > > Make the udev file: > > > > > > > > > > > > # cat > /etc/udev/rules.d/40-infiniband.rules > > > > > > KERNEL="umad*", NAME="infiniband/%k" > > > > > > KERNEL="issm*", NAME="infiniband/%k" > > > > > > > > > > > > Install some modules: > > > > > > > > > > > > modprobe ib_ucm > > > > > > modprobe ib_cm > > > > > > modprobe ib_uverbs > > > > > > modprobe ib_umad > > > > > > > > > > > > And make sure udev is running, and start the opensm. > > > > > > > > > > > > I've done this on all nodes, and ibstat shows I have a link up and > > > > > > running on every node. Opensm doesn't show any scanning. It's been > > > > > > hung all night at: > > > > > > > > > > > > # opensm --console > > > > > > ------------------------------------------------- > > > > > > OpenSM Rev:openib-1.2.0 > > > > > > Based on OpenIB svn Exported revision > > > > > > Command Line Arguments: > > > > > > Enabling OpenSM interactive console > > > > > > Log File: /var/log/osm.log > > > > > > ------------------------------------------------- > > > > > > OpenSM Rev:openib-1.2.0 OpenIB svn Exported revision > > > > > > > > > > > > Using default guid 0x2c9020020c3ce > > > > > > > > > > > > OpenSM Console > > > > > > > > > > > > $ Entering MASTER state > > > > > > > > > > > > SUBNET UP > > > > > > > > > > Looks like everything is fine from the OpenSM standpoint. > > > > > > > > > > I see no indication that OpenSM is hung. You are in the console. > > > > > > > > > > Also, why do you say OpenSM isn't "scanning" ? > > > > > > > > > > What is in /var/log/osm.log ? Any errors ? > > > > > > > > > > If you want more verbose messages start OpenSM with -V. > > > > > > > > > > -- Hal > > > > > > > > > > > IPoIB isn't up. ibv_rc_pingpong doesn't work. Neither does ibv_devinfo. > > > > > > > > > > > > Is there a definitive guide on the initialization of the drivers and fabric? > > > > > > > > > > > > Also, is there an MVAPICH2 for SuSE 10 RPM? > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Chris > > > > > > _______________________________________________ > > > > > > openib-general mailing list > > > > > > openib-general at openib.org > > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > > _______________________________________________ > > > > openib-general mailing list > > > > openib-general at openib.org > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From mshefty at ichips.intel.com Fri Apr 14 09:19:43 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 14 Apr 2006 09:19:43 -0700 Subject: [openib-general] Re: Compile problems with core code and pathscale for svn6462 and linux-2.6.17-rc1 In-Reply-To: <1144972598.24662.221.camel@localhost> References: <1144971158.24662.213.camel@localhost> <200604131640.12899.bos@pathscale.com> <1144972288.24662.216.camel@localhost> <200604131654.57734.bos@pathscale.com> <1144972598.24662.221.camel@localhost> Message-ID: <443FCB9F.8030604@ichips.intel.com> Matt Leininger wrote: > Ok. So the current state is that the mainline devel branch will be > broken for a while? The trunk is always suppose to work, let alone compile. This needs to be fixed quickly, or the offending code moved to a branch. - Sean From bos at pathscale.com Fri Apr 14 09:22:00 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 14 Apr 2006 09:22:00 -0700 Subject: [openib-general] Re: Compile problems with core code and pathscale for svn6462 and linux-2.6.17-rc1 In-Reply-To: <443FCB9F.8030604@ichips.intel.com> References: <1144971158.24662.213.camel@localhost> <200604131640.12899.bos@pathscale.com> <1144972288.24662.216.camel@localhost> <200604131654.57734.bos@pathscale.com> <1144972598.24662.221.camel@localhost> <443FCB9F.8030604@ichips.intel.com> Message-ID: <1145031720.30490.13.camel@chalcedony.pathscale.com> On Fri, 2006-04-14 at 09:19 -0700, Sean Hefty wrote: > Matt Leininger wrote: > > Ok. So the current state is that the mainline devel branch will be > > broken for a while? > > The trunk is always suppose to work, let alone compile. This needs to be fixed > quickly, or the offending code moved to a branch. There is nothing that needs to be fixed. Matt was just not using the right combination of bits when we was trying to compile the world. References: <1145024542.4539.21761.camel@hal.voltaire.com> <1145026500.4539.22255.camel@hal.voltaire.com> <1145028888.4539.22748.camel@hal.voltaire.com> Message-ID: <1145031298.4539.23152.camel@hal.voltaire.com> Chris, On Fri, 2006-04-14 at 12:05, Chris Worley wrote: > Hal, > > Note that I got an /etc/init.d/openibd script that's getting > everything running (I still don't have IPoIB or MVAPICH2... but I can > live without both). > > Now, I'm running Opensm with -V, and it looks as I expected. So what's the cap mask change being indicated ? Are you sure there's no embedded SM running on the switch ? -- Hal > > This cluster is simple: 9 nodes in one switch. > > Thanks, > > Chris > On 14 Apr 2006 11:38:21 -0400, Hal Rosenstock wrote: > > Hi again Chris, > > > > On Fri, 2006-04-14 at 11:29, Chris Worley wrote: > > > Hal, > > > > > > It looks like 1 per GUID. I don't see a capability mask. An example is: > > > > > > Apr 14 07:28:18 879428 [40602960] -> __osm_trap_rcv_process_request: > > > Received Generic Notice type:0x04 num:144 Producer:1 f > > > rom LID:0x0007 TID:0x0000000000000001 > > > Apr 14 07:28:18 879513 [40602960] -> osm_report_notice: Reporting > > > Generic Notice type:4 num:144 from LID:0x0007 GID:0xfe800 > > > 00000000000,0x0002c9020020c3b6 > > > > Are you running with verbose (-V) ? You only see that extra info then. > > > > Just out of curiousity, how big is your subnet and what is the topology > > ? > > > > -- Hal > > > > > Thanks, > > > > > > Chris > > > On 14 Apr 2006 10:55:00 -0400, Hal Rosenstock wrote: > > > > Hi again Chris, > > > > > > > > On Fri, 2006-04-14 at 10:39, Chris Worley wrote: > > > > > Hal, > > > > > > > > > > You're correct... the results of the scans are in /var/log/osm.log. I > > > > > was expecting the "-console" mode to show more. > > > > > > > > > > In looking at the /var/log/osm.log I'm seeing a lot of: > > > > > > > > > > Reporting Generic Notice type:4 num:144 > > > > > > > > > > For different GUIDs. > > > > > > > > What's a lot ? One for each GUID ? What's the capability mask indicated > > > > ? > > > > > > > > > Is there a place to look these up? > > > > > > > > Yes, the IBA spec (volume 1). Trap 144 indicates that the capability > > > > mask at the indicated LID has changed. > > > > > > > > > I still don't have IPoIB running, and ibv_devinfo says I'm not setup > > > > > right either (couldn't open a device). > > > > > > > > I'm not sure why not. > > > > > > > > -- Hal > > > > > > > > > Thanks, > > > > > > > > > > Chris > > > > > On 14 Apr 2006 10:22:22 -0400, Hal Rosenstock wrote: > > > > > > Hi Chris, > > > > > > > > > > > > On Fri, 2006-04-14 at 10:19, Chris Worley wrote: > > > > > > > I installed the SuSE 10 OpenIB RC2 RPMS. > > > > > > > > > > > > > > The installation went well, but I'm stuck at the startup. > > > > > > > > > > > > > > As an IBGD user, I'm used to an init file in /etc/init.d... but there was none. > > > > > > > > > > > > > > >From the wiki, I was able to glean: > > > > > > > > > > > > > > Make the udev file: > > > > > > > > > > > > > > # cat > /etc/udev/rules.d/40-infiniband.rules > > > > > > > KERNEL="umad*", NAME="infiniband/%k" > > > > > > > KERNEL="issm*", NAME="infiniband/%k" > > > > > > > > > > > > > > Install some modules: > > > > > > > > > > > > > > modprobe ib_ucm > > > > > > > modprobe ib_cm > > > > > > > modprobe ib_uverbs > > > > > > > modprobe ib_umad > > > > > > > > > > > > > > And make sure udev is running, and start the opensm. > > > > > > > > > > > > > > I've done this on all nodes, and ibstat shows I have a link up and > > > > > > > running on every node. Opensm doesn't show any scanning. It's been > > > > > > > hung all night at: > > > > > > > > > > > > > > # opensm --console > > > > > > > ------------------------------------------------- > > > > > > > OpenSM Rev:openib-1.2.0 > > > > > > > Based on OpenIB svn Exported revision > > > > > > > Command Line Arguments: > > > > > > > Enabling OpenSM interactive console > > > > > > > Log File: /var/log/osm.log > > > > > > > ------------------------------------------------- > > > > > > > OpenSM Rev:openib-1.2.0 OpenIB svn Exported revision > > > > > > > > > > > > > > Using default guid 0x2c9020020c3ce > > > > > > > > > > > > > > OpenSM Console > > > > > > > > > > > > > > $ Entering MASTER state > > > > > > > > > > > > > > SUBNET UP > > > > > > > > > > > > Looks like everything is fine from the OpenSM standpoint. > > > > > > > > > > > > I see no indication that OpenSM is hung. You are in the console. > > > > > > > > > > > > Also, why do you say OpenSM isn't "scanning" ? > > > > > > > > > > > > What is in /var/log/osm.log ? Any errors ? > > > > > > > > > > > > If you want more verbose messages start OpenSM with -V. > > > > > > > > > > > > -- Hal > > > > > > > > > > > > > IPoIB isn't up. ibv_rc_pingpong doesn't work. Neither does ibv_devinfo. > > > > > > > > > > > > > > Is there a definitive guide on the initialization of the drivers and fabric? > > > > > > > > > > > > > > Also, is there an MVAPICH2 for SuSE 10 RPM? > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > Chris > > > > > > > _______________________________________________ > > > > > > > openib-general mailing list > > > > > > > openib-general at openib.org > > > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > openib-general mailing list > > > > > openib-general at openib.org > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > From vuhuong at mellanox.com Fri Apr 14 09:55:51 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Fri, 14 Apr 2006 09:55:51 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: References: <443C4934.7080400@mellanox.com> <443E9A88.7020302@mellanox.com> <443FC35F.6080301@mellanox.com> Message-ID: <443FD417.3010909@mellanox.com> Roland Dreier wrote: > Hmm, I don't understand what could be going on. srp_send_tsk_mgmt() > currently has: > > if (req->cmd_done) { > srp_remove_req(target, req, req_index); > scmnd->scsi_done(scmnd); > } else if (!req->tsk_status) { > srp_remove_req(target, req, req_index); > scmnd->result = DID_ABORT << 16; > ret = SUCCESS; > } > > and otherwise it returns FAILED. So in both cases where it finishes > the command, it removes it from the list of pending requests. > > Are you absolutely sure you saw the crash with a patched driver that > has that code in srp_send_tsk_mgmt()? I'm sure that I patched srp driver revision 6036. It has the above code in srp_send_tsk_mgmt() I don't have time to work on this today. I'll get back with more debug details on Monday Thanks, Vu From rheflin at atipa.com Fri Apr 14 11:36:45 2006 From: rheflin at atipa.com (Roger Heflin) Date: Fri, 14 Apr 2006 13:36:45 -0500 Subject: [openib-general] Trying to compile mvapich RHEL4U3 for ib. In-Reply-To: <20060413162729.GA14020@cse.ohio-state.edu> References: <443D7A14.8050807@atipa.com> <20060413012426.GA12977@cse.ohio-state.edu> <443E50C4.7070804@atipa.com> <443E586F.3070208@cse.ohio-state.edu> <20060413162729.GA14020@cse.ohio-state.edu> Message-ID: <443FEBBD.1040305@atipa.com> Sayantan Sur wrote: > Hello Roger, > > I'm just CC-ing this to openib-general for the community. > > Thanks for giving us access. I have verified that the > `ibv_get_device_list' verb is indeed *missing* from the OpenIB install. > I'm afraid that given this Redhat rpm, it is difficult to get mvapich to > work (without patching it). > > As Roland and others have indicated, perhaps the best way is for you to > upgrade to atleast the 1.0 branch. That should be the most stable OpenIB > release yet. > > https://openib.org/svn/gen2/branches/1.0/src/userspace/ > > You should be able to keep the kernel stuff intact and just upgrade the > user level support (management, libibverbs, libmthca). You may skip > upgrading management, however it'll be best to upgrade it too, lest you > face any OpenSM issues. > > Thanks, > Sayantan. I now have the machines running RHEL4U3 + kernel.org 2.6.16.5 + the Openib 1.0 userspace, given that the RPM spec files did work for the openib tools that made things pretty simple, and have a resonable set of rpms and tar files to execute the kernel+userspace update. I have succeeded in getting OpenMPI to compile and execute HPL under raw IB, and so far I am getting reasonable results and no corruption Mvapich compiles but appears to not have made the mpirun version for Infiniband, and yells about that when attempting to start HPL, I have not yet looked at that in detail to see what the nature of the failure is. Roger From xma at us.ibm.com Fri Apr 14 12:20:57 2006 From: xma at us.ibm.com (Shirley Ma) Date: Fri, 14 Apr 2006 13:20:57 -0600 Subject: [openib-general] [PATCH] splitting IPoIB CQ In-Reply-To: Message-ID: Hello Roland, Here is the patch to split IPoIB CQ into send CQ and recv CQ. Some tests have been done over mthca and ehca. Unidirectional stream test, gains up to 15% throughout with this patch on systems over 4 cpus. Bidirectional could gain more. People might get different performance improvement number under different drivers and cpus. I have attached the patch for who are willing to run the performance test with different drivers. And please give your inputs. The reason I have two seperated wc handler is because I am working on another patch to optimize send CQ and recv CQ seperately. Signed-off-by: Shirley Ma diff -urpN infiniband/ulp/ipoib/ipoib.h infiniband-cq/ulp/ipoib/ipoib.h --- infiniband/ulp/ipoib/ipoib.h 2006-04-05 17:43:18.000000000 -0700 +++ infiniband-cq/ulp/ipoib/ipoib.h 2006-04-12 16:55:57.000000000 -0700 @@ -151,7 +151,8 @@ struct ipoib_dev_priv { u16 pkey; struct ib_pd *pd; struct ib_mr *mr; - struct ib_cq *cq; + struct ib_cq *send_cq; + struct ib_cq *recv_cq; struct ib_qp *qp; u32 qkey; @@ -171,7 +172,8 @@ struct ipoib_dev_priv { struct ib_sge tx_sge; struct ib_send_wr tx_wr; - struct ib_wc ibwc[IPOIB_NUM_WC]; + struct ib_wc *send_ibwc; + struct ib_wc *recv_ibwc; struct list_head dead_ahs; @@ -245,7 +247,8 @@ extern struct workqueue_struct *ipoib_wo /* functions */ -void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr); +void ipoib_ib_send_completion(struct ib_cq *cq, void *dev_ptr); +void ipoib_ib_recv_completion(struct ib_cq *cq, void *dev_ptr); struct ipoib_ah *ipoib_create_ah(struct net_device *dev, struct ib_pd *pd, struct ib_ah_attr *attr); diff -urpN infiniband/ulp/ipoib/ipoib_ib.c infiniband-cq/ulp/ipoib/ipoib_ib.c --- infiniband/ulp/ipoib/ipoib_ib.c 2006-04-05 17:43:18.000000000 -0700 +++ infiniband-cq/ulp/ipoib/ipoib_ib.c 2006-04-14 12:49:51.113116736 -0700 @@ -50,7 +50,6 @@ MODULE_PARM_DESC(data_debug_level, "Enable data path debug tracing if > 0"); #endif -#define IPOIB_OP_RECV (1ul << 31) static DEFINE_MUTEX(pkey_mutex); @@ -108,7 +107,7 @@ static int ipoib_ib_post_receive(struct list.lkey = priv->mr->lkey; param.next = NULL; - param.wr_id = id | IPOIB_OP_RECV; + param.wr_id = id; param.sg_list = &list; param.num_sge = 1; @@ -175,8 +174,8 @@ static int ipoib_ib_post_receives(struct return 0; } -static void ipoib_ib_handle_wc(struct net_device *dev, - struct ib_wc *wc) +static void ipoib_ib_handle_recv_wc(struct net_device *dev, + struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); unsigned int wr_id = wc->wr_id; @@ -184,110 +183,129 @@ static void ipoib_ib_handle_wc(struct ne ipoib_dbg_data(priv, "called: id %d, op %d, status: %d\n", wr_id, wc->opcode, wc->status); - if (wr_id & IPOIB_OP_RECV) { - wr_id &= ~IPOIB_OP_RECV; - - if (wr_id < ipoib_recvq_size) { - struct sk_buff *skb = priv->rx_ring[wr_id].skb; - dma_addr_t addr = priv->rx_ring[wr_id].mapping; - - if (unlikely(wc->status != IB_WC_SUCCESS)) { - if (wc->status != IB_WC_WR_FLUSH_ERR) - ipoib_warn(priv, "failed recv event " - "(status=%d, wrid=%d vend_err %x)\n", - wc->status, wr_id, wc->vendor_err); - dma_unmap_single(priv->ca->dma_device, addr, - IPOIB_BUF_SIZE, DMA_FROM_DEVICE); - dev_kfree_skb_any(skb); - priv->rx_ring[wr_id].skb = NULL; - return; - } - - /* - * If we can't allocate a new RX buffer, dump - * this packet and reuse the old buffer. - */ - if (unlikely(ipoib_alloc_rx_skb(dev, wr_id))) { - ++priv->stats.rx_dropped; - goto repost; - } - - ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", - wc->byte_len, wc->slid); + if (wr_id < ipoib_recvq_size) { + struct sk_buff *skb = priv->rx_ring[wr_id].skb; + dma_addr_t addr = priv->rx_ring[wr_id].mapping; + + if (unlikely(wc->status != IB_WC_SUCCESS)) { + if (wc->status != IB_WC_WR_FLUSH_ERR) + ipoib_warn(priv, "failed recv event " + "(status=%d, wrid=%d vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); dma_unmap_single(priv->ca->dma_device, addr, IPOIB_BUF_SIZE, DMA_FROM_DEVICE); + dev_kfree_skb_any(skb); + priv->rx_ring[wr_id].skb = NULL; + return; + } - skb_put(skb, wc->byte_len); - skb_pull(skb, IB_GRH_BYTES); + /* + * If we can't allocate a new RX buffer, dump + * this packet and reuse the old buffer. + */ + if (unlikely(ipoib_alloc_rx_skb(dev, wr_id))) { + ++priv->stats.rx_dropped; + goto repost; + } - if (wc->slid != priv->local_lid || - wc->src_qp != priv->qp->qp_num) { - skb->protocol = ((struct ipoib_header *) skb->data)->proto; - skb->mac.raw = skb->data; - skb_pull(skb, IPOIB_ENCAP_LEN); - - dev->last_rx = jiffies; - ++priv->stats.rx_packets; - priv->stats.rx_bytes += skb->len; - - skb->dev = dev; - /* XXX get correct PACKET_ type here */ - skb->pkt_type = PACKET_HOST; - netif_rx_ni(skb); - } else { - ipoib_dbg_data(priv, "dropping loopback packet\n"); - dev_kfree_skb_any(skb); - } + ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", + wc->byte_len, wc->slid); - repost: - if (unlikely(ipoib_ib_post_receive(dev, wr_id))) - ipoib_warn(priv, "ipoib_ib_post_receive failed " - "for buf %d\n", wr_id); - } else - ipoib_warn(priv, "completion event with wrid %d\n", - wr_id); + dma_unmap_single(priv->ca->dma_device, addr, + IPOIB_BUF_SIZE, DMA_FROM_DEVICE); - } else { - struct ipoib_tx_buf *tx_req; - unsigned long flags; + skb_put(skb, wc->byte_len); + skb_pull(skb, IB_GRH_BYTES); - if (wr_id >= ipoib_sendq_size) { - ipoib_warn(priv, "completion event with wrid %d (> %d)\n", - wr_id, ipoib_sendq_size); - return; + if (wc->slid != priv->local_lid || + wc->src_qp != priv->qp->qp_num) { + skb->protocol = ((struct ipoib_header *) skb->data)->proto; + skb->mac.raw = skb->data; + skb_pull(skb, IPOIB_ENCAP_LEN); + + dev->last_rx = jiffies; + ++priv->stats.rx_packets; + priv->stats.rx_bytes += skb->len; + + skb->dev = dev; + /* XXX get correct PACKET_ type here */ + skb->pkt_type = PACKET_HOST; + netif_rx_ni(skb); + } else { + ipoib_dbg_data(priv, "dropping loopback packet\n"); + dev_kfree_skb_any(skb); } - ipoib_dbg_data(priv, "send complete, wrid %d\n", wr_id); + repost: + if (unlikely(ipoib_ib_post_receive(dev, wr_id))) + ipoib_warn(priv, "ipoib_ib_post_receive failed " + "for buf %d\n", wr_id); + } else + ipoib_warn(priv, "completion event with wrid %d\n", + wr_id); +} - tx_req = &priv->tx_ring[wr_id]; +static void ipoib_ib_handle_send_wc(struct net_device *dev, + struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned int wr_id = wc->wr_id; + struct ipoib_tx_buf *tx_req; + unsigned long flags; - dma_unmap_single(priv->ca->dma_device, - pci_unmap_addr(tx_req, mapping), - tx_req->skb->len, - DMA_TO_DEVICE); + ipoib_dbg_data(priv, "called: id %d, op %d, status: %d\n", + wr_id, wc->opcode, wc->status); - ++priv->stats.tx_packets; - priv->stats.tx_bytes += tx_req->skb->len; + if (wr_id >= ipoib_sendq_size) { + ipoib_warn(priv, "completion event with wrid %d (> %d)\n", + wr_id, ipoib_sendq_size); + return; + } - dev_kfree_skb_any(tx_req->skb); + ipoib_dbg_data(priv, "send complete, wrid %d\n", wr_id); - spin_lock_irqsave(&priv->tx_lock, flags); - ++priv->tx_tail; - if (netif_queue_stopped(dev) && - priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) - netif_wake_queue(dev); - spin_unlock_irqrestore(&priv->tx_lock, flags); + tx_req = &priv->tx_ring[wr_id]; - if (wc->status != IB_WC_SUCCESS && - wc->status != IB_WC_WR_FLUSH_ERR) - ipoib_warn(priv, "failed send event " - "(status=%d, wrid=%d vend_err %x)\n", - wc->status, wr_id, wc->vendor_err); - } + dma_unmap_single(priv->ca->dma_device, + pci_unmap_addr(tx_req, mapping), + tx_req->skb->len, + DMA_TO_DEVICE); + + ++priv->stats.tx_packets; + priv->stats.tx_bytes += tx_req->skb->len; + + dev_kfree_skb_any(tx_req->skb); + + spin_lock_irqsave(&priv->tx_lock, flags); + ++priv->tx_tail; + if (netif_queue_stopped(dev) && + priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) + netif_wake_queue(dev); + spin_unlock_irqrestore(&priv->tx_lock, flags); + + if (wc->status != IB_WC_SUCCESS && + wc->status != IB_WC_WR_FLUSH_ERR) + ipoib_warn(priv, "failed send event " + "(status=%d, wrid=%d vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); +} + +void ipoib_ib_send_completion(struct ib_cq *cq, void *dev_ptr) +{ + struct net_device *dev = (struct net_device *) dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int n, i; + + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + do { + n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->send_ibwc); + for (i = 0; i < n; ++i) + ipoib_ib_handle_send_wc(dev, priv->send_ibwc + i); + } while (n == IPOIB_NUM_WC); } -void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr) +void ipoib_ib_recv_completion(struct ib_cq *cq, void *dev_ptr) { struct net_device *dev = (struct net_device *) dev_ptr; struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -295,9 +313,9 @@ void ipoib_ib_completion(struct ib_cq *c ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); do { - n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->ibwc); + n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->recv_ibwc); for (i = 0; i < n; ++i) - ipoib_ib_handle_wc(dev, priv->ibwc + i); + ipoib_ib_handle_recv_wc(dev, priv->recv_ibwc + i); } while (n == IPOIB_NUM_WC); } diff -urpN infiniband/ulp/ipoib/ipoib_main.c infiniband-cq/ulp/ipoib/ipoib_main.c --- infiniband/ulp/ipoib/ipoib_main.c 2006-04-12 16:43:38.000000000 -0700 +++ infiniband-cq/ulp/ipoib/ipoib_main.c 2006-04-14 12:40:27.833748216 -0700 @@ -863,11 +863,25 @@ int ipoib_dev_init(struct net_device *de /* priv->tx_head & tx_tail are already 0 */ - if (ipoib_ib_dev_init(dev, ca, port)) + priv->send_ibwc = kzalloc(IPOIB_NUM_WC * sizeof(struct ib_wc), GFP_KERNEL); + if (!priv->send_ibwc) goto out_tx_ring_cleanup; + priv->recv_ibwc = kzalloc(IPOIB_NUM_WC * sizeof(struct ib_wc), GFP_KERNEL); + if (!priv->recv_ibwc) + goto out_send_ibwc_cleanup; + + if (ipoib_ib_dev_init(dev, ca, port)) + goto out_recv_ibwc_cleanup; + return 0; +out_recv_ibwc_cleanup: + kfree(priv->recv_ibwc); + +out_send_ibwc_cleanup: + kfree(priv->send_ibwc); + out_tx_ring_cleanup: kfree(priv->tx_ring); @@ -895,9 +909,15 @@ void ipoib_dev_cleanup(struct net_device kfree(priv->rx_ring); kfree(priv->tx_ring); - + priv->rx_ring = NULL; priv->tx_ring = NULL; + + kfree(priv->send_ibwc); + kfree(priv->recv_ibwc); + + priv->send_ibwc = NULL; + priv->recv_ibwc = NULL; } static void ipoib_setup(struct net_device *dev) diff -urpN infiniband/ulp/ipoib/ipoib_verbs.c infiniband-cq/ulp/ipoib/ipoib_verbs.c --- infiniband/ulp/ipoib/ipoib_verbs.c 2006-04-05 17:43:18.000000000 -0700 +++ infiniband-cq/ulp/ipoib/ipoib_verbs.c 2006-04-12 19:14:41.000000000 -0700 @@ -174,24 +174,35 @@ int ipoib_transport_dev_init(struct net_ return -ENODEV; } - priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, - ipoib_sendq_size + ipoib_recvq_size + 1); - if (IS_ERR(priv->cq)) { - printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); + priv->send_cq = ib_create_cq(priv->ca, ipoib_ib_send_completion, NULL, dev, + ipoib_sendq_size + 1); + if (IS_ERR(priv->send_cq)) { + printk(KERN_WARNING "%s: failed to create send CQ\n", ca->name); goto out_free_pd; } - if (ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP)) - goto out_free_cq; + if (ib_req_notify_cq(priv->send_cq, IB_CQ_NEXT_COMP)) + goto out_free_send_cq; + + + priv->recv_cq = ib_create_cq(priv->ca, ipoib_ib_recv_completion, NULL, dev, + ipoib_recvq_size + 1); + if (IS_ERR(priv->recv_cq)) { + printk(KERN_WARNING "%s: failed to create recv CQ\n", ca->name); + goto out_free_send_cq; + } + + if (ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP)) + goto out_free_recv_cq; priv->mr = ib_get_dma_mr(priv->pd, IB_ACCESS_LOCAL_WRITE); if (IS_ERR(priv->mr)) { printk(KERN_WARNING "%s: ib_get_dma_mr failed\n", ca->name); - goto out_free_cq; + goto out_free_recv_cq; } - init_attr.send_cq = priv->cq; - init_attr.recv_cq = priv->cq, + init_attr.send_cq = priv->send_cq; + init_attr.recv_cq = priv->recv_cq, priv->qp = ib_create_qp(priv->pd, &init_attr); if (IS_ERR(priv->qp)) { @@ -215,8 +226,11 @@ int ipoib_transport_dev_init(struct net_ out_free_mr: ib_dereg_mr(priv->mr); -out_free_cq: - ib_destroy_cq(priv->cq); +out_free_recv_cq: + ib_destroy_cq(priv->recv_cq); + +out_free_send_cq: + ib_destroy_cq(priv->send_cq); out_free_pd: ib_dealloc_pd(priv->pd); @@ -238,7 +252,10 @@ void ipoib_transport_dev_cleanup(struct if (ib_dereg_mr(priv->mr)) ipoib_warn(priv, "ib_dereg_mr failed\n"); - if (ib_destroy_cq(priv->cq)) + if (ib_destroy_cq(priv->send_cq)) + ipoib_warn(priv, "ib_cq_destroy failed\n"); + + if (ib_destroy_cq(priv->recv_cq)) ipoib_warn(priv, "ib_cq_destroy failed\n"); if (ib_dealloc_pd(priv->pd)) Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: cq.test.patch Type: application/octet-stream Size: 12581 bytes Desc: not available URL: From surs at cse.ohio-state.edu Fri Apr 14 14:01:49 2006 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Fri, 14 Apr 2006 17:01:49 -0400 Subject: [openib-general] Trying to compile mvapich RHEL4U3 for ib. In-Reply-To: <443FEBBD.1040305@atipa.com> References: <443D7A14.8050807@atipa.com> <20060413012426.GA12977@cse.ohio-state.edu> <443E50C4.7070804@atipa.com> <443E586F.3070208@cse.ohio-state.edu> <20060413162729.GA14020@cse.ohio-state.edu> <443FEBBD.1040305@atipa.com> Message-ID: <20060414210147.GA15216@cse.ohio-state.edu> Hi Roger, > Mvapich compiles but appears to not have made the mpirun version for > Infiniband, and yells about that when attempting to start HPL, I have > not yet looked at that in detail to see what the nature of the failure > is. Thanks for reporting this. Infact, just today we have fixed this in the MVAPICH trunk. This problem was reported by another user on mvapich-discuss. http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2006-April/000098.html If this was the error you got, we'll be glad if you could just `svn up' your tree and give it a shot. Please let us know if this worked for you. Thanks, Sayantan. > > Roger -- http://www.cse.ohio-state.edu/~surs From robert.j.woodruff at intel.com Fri Apr 14 14:37:55 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 14 Apr 2006 14:37:55 -0700 Subject: [openib-general] IB initialization In-Reply-To: Message-ID: <000d01c6600b$b28c6550$010fa8c0@amr.corp.intel.com> Chris wrote, >As an IBGD user, I'm used to an init file in /etc/init.d... but there was none. >From the wiki, I was able to glean: Make the udev file: ># cat > /etc/udev/rules.d/40-infiniband.rules >KERNEL="umad*", NAME="infiniband/%k" >KERNEL="issm*", NAME="infiniband/%k" Install some modules: >modprobe ib_ucm >modprobe ib_cm >modprobe ib_uverbs >modprobe ib_umad >Is there a definitive guide on the initialization of the drivers and fabric? FYI to anyone else trying to get things loaded and running. Here is an init.d startup script that I use to load and start the IB drivers. You can use it and or edit it to load the drivers that you want. My script makes the dev nodes manually, but if you have udev, you can use that instead. #!/bin/sh # # ib : A script to control openib.org kernel module start # # Set variables module1=ib_mthca module2=ib_mad module3=ib_sa module4=ib_ipoib module5=ib_uverbs module6=ib_umad module7=ib_cm module8=ib_ucm module9=ib_sdp module10=ib_srp module11=rdma_cm module12=rdma_ucm # module13=kdapl depreciated module14=iscsi_tcp module15=ib_iser device=infiniband mode=666 # Set default module parameters det_max_pages_percent=0 det_retry_time=0 det_window_size=0 usage() { echo "Usage: $0 {start|stop|restart|reload} [module_parameters]" } verify_root_privilege() { if [ $UID != 0 ]; then echo "You must be root to modify $module state" exit 1 fi } start() { verify_root_privilege kernel_ver=$(uname -r) module1_path=/lib/modules/$kernel_ver/kernel/drivers/infiniband/hw/mthca/$mo dule1.ko module2_path=/lib/modules/$kernel_ver/kernel/drivers/infiniband/core/$module 2.ko module3_path=/lib/modules/$kernel_ver/kernel/drivers/infiniband/core/$module 3.ko module4_path=/lib/modules/$kernel_ver/kernel/drivers/infiniband/ulp/ipoib/$m odule4.ko module5_path=/lib/modules/$kernel_ver/kernel/drivers/infiniband/core/$module 5.ko module6_path=/lib/modules/$kernel_ver/kernel/drivers/infiniband/core/$module 6.ko module7_path=/lib/modules/$kernel_ver/kernel/drivers/infiniband/core/$module 7.ko module8_path=/lib/modules/$kernel_ver/kernel/drivers/infiniband/core/$module 8.ko module9_path=/lib/modules/$kernel_ver/kernel/drivers/infiniband/ulp/sdp/$mod ule9.ko module10_path=/lib/modules/$kernel_ver/kernel/drivers/infiniband/ulp/srp/$mo dule10.ko module11_path=/lib/modules/$kernel_ver/kernel/drivers/infiniband/core/$modul e11.ko module12_path=/lib/modules/$kernel_ver/kernel/drivers/infiniband/core/$modul e12.ko # module13_path=/lib/modules/$kernel_ver/kernel/drivers/infiniband/ulp/kdapl/$ module13.ko module14_path=/lib/modules/$kernel_ver/kernel/drivers/scsi/$module14.ko module15_path=/lib/modules/$kernel_ver/kernel/drivers/infiniband/ulp/iser/$m odule15.ko if test -e $module1_path; then echo "Loading $module1" sudo /sbin/modprobe $module1 $@ else echo "Module $module not found ($module1_path does not exist)!" fi if test -e $module2_path; then echo "Loading $module2" sudo /sbin/modprobe $module2 $@ else echo "Module $module not found ($module2_path does not exist)!" fi if test -e $module3_path; then echo "Loading $module3" sudo /sbin/modprobe $module3 $@ else echo "Module $module not found ($module3_path does not exist)!" fi if test -e $module4_path; then echo "Loading $module4" sudo /sbin/modprobe $module4 $@ else echo "Module $module not found ($module4_path does not exist)!" fi if test -e $module5_path; then echo "Loading $module5" sudo /sbin/modprobe $module5 $@ else echo "Module $module not found ($module5_path does not exist)!" fi if test -e $module6_path; then echo "Loading $module6" sudo /sbin/modprobe $module6 $@ else echo "Module $module not found ($module6_path does not exist)!" fi if test -e $module7_path; then echo "Loading $module7" sudo /sbin/modprobe $module7 $@ else echo "Module $module not found ($module7_path does not exist)!" fi if test -e $module8_path; then echo "Loading $module8" sudo /sbin/modprobe $module8 $@ else echo "Module $module not found ($module8_path does not exist)!" fi if test -e $module9_path; then echo "Loading $module9" sudo /sbin/modprobe $module9 $@ else echo "Module $module not found ($module9_path does not exist)!" fi if test -e $module10_path; then echo "Loading $module10" sudo /sbin/modprobe $module10 $@ else echo "Module $module not found ($module10_path does not exist)!" fi if test -e $module11_path; then echo "Loading $module11" sudo /sbin/modprobe $module11 $@ else echo "Module $module not found ($module11_path does not exist)!" fi if test -e $module12_path; then echo "Loading $module12" sudo /sbin/modprobe $module12 $@ else echo "Module $module not found ($module12_path does not exist)!" fi # if test -e $module13_path; then # echo "Loading $module13" # sudo /sbin/modprobe $module13 $@ # else # echo "Module $module not found ($module13_path does not exist)!" # fi if test -e $module14_path; then echo "Loading $module14" sudo /sbin/modprobe $module14 $@ else echo "Module $module not found ($module14_path does not exist)!" fi if test -e $module15_path; then echo "Loading $module15" sudo /sbin/modprobe $module15 $@ else echo "Module $module not found ($module15_path does not exist)!" fi #endif # remove stale device rm -rf /dev/$device mkdir /dev/infiniband /bin/mknod /dev/infiniband/umad0 c 231 0 /bin/mknod /dev/infiniband/uverbs0 c 231 192 /bin/mknod /dev/infiniband/uverbs1 c 231 193 /bin/mknod /dev/infiniband/ucm0 c 231 224 /bin/mknod /dev/infiniband/uat c 231 191 /bin/mknod /dev/infiniband/rdma_cm c 10 62 chmod 777 /dev/infiniband/* /usr/local/bin/ifup-ib ib0 /usr/local/bin/ifup-ib ib1 return 0 } stop() { verify_root_privilege echo "Removing $module15" /sbin/rmmod $module15.ko echo "Removing $module14" /sbin/rmmod $module14.ko # echo "Removing $module13" # /sbin/rmmod $module13.ko echo "Removing $module12" /sbin/rmmod $module12.ko echo "Removing $module11" /sbin/rmmod $module11.ko echo "Removing $module10" /sbin/rmmod $module10.ko echo "Removing $module9" /sbin/rmmod $module9.ko echo "Removing $module8" /sbin/rmmod $module8.ko echo "Removing $module7" /sbin/rmmod $module7.ko echo "Removing $module6" /sbin/rmmod $module6.ko echo "Removing $module5" /sbin/rmmod $module5.ko echo "Removing $module4" /sbin/rmmod $module4.ko echo "Removing $module3" /sbin/rmmod ib_local_sa /sbin/rmmod $module3.ko echo "Removing $module1" /sbin/rmmod $module1.ko echo "Removing $module2" /sbin/rmmod $module2.ko rm -rf /dev/$device return 0 } restart() { stop start $@ } # "main" next_state=$1 shift # See how we were called case "$next_state" in start) start $@ ;; stop) stop ;; restart|reload) restart $@ ;; *) usage exit 1 esac From rheflin at atipa.com Fri Apr 14 14:47:58 2006 From: rheflin at atipa.com (Roger Heflin) Date: Fri, 14 Apr 2006 16:47:58 -0500 Subject: [openib-general] Trying to compile mvapich RHEL4U3 for ib. In-Reply-To: <20060414210147.GA15216@cse.ohio-state.edu> References: <443D7A14.8050807@atipa.com> <20060413012426.GA12977@cse.ohio-state.edu> <443E50C4.7070804@atipa.com> <443E586F.3070208@cse.ohio-state.edu> <20060413162729.GA14020@cse.ohio-state.edu> <443FEBBD.1040305@atipa.com> <20060414210147.GA15216@cse.ohio-state.edu> Message-ID: <4440188E.1060106@atipa.com> Sayantan Sur wrote: > Hi Roger, > >> Mvapich compiles but appears to not have made the mpirun version for >> Infiniband, and yells about that when attempting to start HPL, I have >> not yet looked at that in detail to see what the nature of the failure >> is. > > Thanks for reporting this. Infact, just today we have fixed this in the > MVAPICH trunk. This problem was reported by another user on > mvapich-discuss. > > http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2006-April/000098.html > > If this was the error you got, we'll be glad if you could just `svn up' > your tree and give it a shot. > > Please let us know if this worked for you. > > Thanks, > Sayantan. Yeap, that is what I saw. I will try the newer version Monday. Roger From dotanb at mellanox.co.il Sun Apr 16 00:42:24 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Sun, 16 Apr 2006 10:42:24 +0300 Subject: [openib-general] Re: [uDAPL] dat.conf generator In-Reply-To: References: <200604121122.48646.dotanb@mellanox.co.il> Message-ID: <200604161042.25045.dotanb@mellanox.co.il> On Wednesday 12 April 2006 17:50, James Lentini wrote: > > OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 1" "" > > OpenIB-cma0-1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 1" "" > > OpenIB-cma0-2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 2" "" > > OpenIB-cma1-1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca1 1" "" > > OpenIB-cma1-2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca1 2" "" > > OpenIB-cma-netdev0 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" > > OpenIB-cma-netdev1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib1 0" "" > > OpenIB-cma-netdev2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib2 0" "" > > OpenIB-cma-netdev3 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib3 0" "" > > OpenIB-scm u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca0 1" "" > > OpenIB-scm0-1 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca0 1" "" > > OpenIB-scm0-2 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca0 2" "" > > OpenIB-scm1-1 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca1 1" "" > > OpenIB-scm1-2 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca1 2" "" > > OpenIB-scm-netdev0 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib0 0" "" > > OpenIB-scm-netdev1 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib1 0" "" > > OpenIB-scm-netdev2 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib2 0" "" > > OpenIB-scm-netdev3 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib3 0" "" > > > > is this file is good enough or more dapl provider names are needed? > > You've covered all the standard combinations. Why did you include the > OpenIB-Z-netdevX entries? Why would a user prefer netdevX over ethY? > Just curious. > > If you are willing to contribute this back to the uDAPL project, I'm > sure the uDAPL community would find it very useful. I tried to create automatically a configuration file that is being filled manually today. Which lines do you think need to be in this configuration file? (I'm open for suggestions). This dat.conf generator is a new piece of code that (when it will be ready) will be checked to the openib stack (i think to the dapl tree), so it can be availabe to everyone ... Dotan From jackm at mellanox.co.il Sun Apr 16 00:46:28 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 16 Apr 2006 10:46:28 +0300 Subject: [openib-general] [PATCH v2] mad: use GID/LID on requester side when matching responses to requests In-Reply-To: <1144863926.19061.90695.camel@hal.voltaire.com> References: <200604101804.34043.jackm@mellanox.co.il> <1144863926.19061.90695.camel@hal.voltaire.com> Message-ID: <200604161046.28663.jackm@mellanox.co.il> On Wednesday 12 April 2006 20:45, Hal Rosenstock wrote: > > + if ((wr->tid == mad->mad_hdr.tid) && > > + rcv_has_same_class(wr, wc) && > > + /* > > + * Don't check GID for direct routed MADs. > > + * These might have permissive LIDs. > > What's the relevance of the latter comment ? VL15 packets never have > GRHs so there are no GIDs so I think the first comment is sufficient. > My intent was "Don't check GID/LID". In the case of VL15 packets, only the LID can be checked, since there is no GID (as you point out). Per IB Spec, we are checking GIDs -- and if there is no GID, the LIDs are compared. - Jack From ogerlitz at voltaire.com Sun Apr 16 02:52:12 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 16 Apr 2006 12:52:12 +0300 Subject: [openib-general] Re: [PATCH] git: updates to rdma_cm branch In-Reply-To: References: Message-ID: <444213CC.2070008@voltaire.com> Roland Dreier wrote: > OK, I updated my rdma_cm branch with all of this. I see that the for-2.6.18 branch contains the addr and cma includes & code, so from the cma perspective the 2.6.18 push is ready? - also if someone want to compile/run code (eg iSER) which is dependent on the cma they just need to do git pull of the include/rdma and drivers/infiniband directories from the for-2.6.18 branch - am i correct? thanks, Or. From ogerlitz at voltaire.com Sun Apr 16 05:44:35 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 16 Apr 2006 15:44:35 +0300 Subject: [openib-general] iser upstream: further reviews and process towards 2.6.18 Message-ID: Hi Roland, Over the last two weeks we have posted another RFC on the iser code, to both openib and linux scsi. Upto now no new comments were made other then those provided (and fixed) by Christoph Hellwig on the first RFC we posted to openib. Do you find it need/appropriate to post another RFC, this time to LKML? Also (and related to both reviewes and upstream) iser is dependent on two change sets for 2.6.18: first, the cma and second, the bunch of 2.6.18 iscsi updates (it is six patches from which iser directly is dependent on two, but we prefer to treat all the six as needed for iser to compile/work). The six iscsi patches were pushed to linux-scsi but are not yet in the scsi-misc git tree, they can also be found under https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/2.6.17 it is the 01 to 06 sequence. As for the upstream process, i guess each of you and James would need to do git pull from the other one into the scsi-misc (2.6.18 rdma updates) and the infiniband (2.6.18 scsi updates) so iser can be compiled with both trees, or at least with your tree. Please let me know what do you think. thanks, Or. From vlad at mellanox.co.il Sun Apr 16 08:05:00 2006 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Sun, 16 Apr 2006 18:05:00 +0300 Subject: [openib-general] Re: [openfabrics-ewg] IBED-1.0-rc3 installer feedback In-Reply-To: References: Message-ID: <44425D1C.70104@mellanox.co.il> Hi Scott, Thanks for your feedback, see my comments inside marked with [VS]. Regards, Vladimir Scott Weitzenkamp (sweitzen) wrote: > Here's some feedback on installation, should I file bugs/enhancements in > bugzilla for these? > > 0) build.sh does not compile Open MPI, forcing me to run install.sh to > compile Open MPI. This makes it harder to set up a build server used to > just compile the code for installation elsewhere. > > [VS] Open MPI RPM package (current version) can be built if all required software already installed under final installation prefix. Then it can be done only during installation process (by running install.sh). > 1) Too many references to Mellanox in the docs. > [VS] README.txt and docs/IBED_Installation_Guide.txt were taken from previous releases of Mellanox software stacks and still not updated. They will be updated for the next OFED RC. > # fgrep Mellanox README.txt docs/IBED_Installation_Guide.txt > README.txt:Mellanox IBED Distribution v1.0 for Linux > README.txt:This is the Mellanox InfiniBand Distribution (IBED) ver. 1.0 > software > package. > README.txt:1) Server platform with InfiniBand HCA (see Mellanox IBED > Distributio > n Release Notes for details) > README.txt:2) Linux OS (see Mellanox IBED Distribution Release Notes for > details > ) > README.txt: o Firmware for Mellanox's switch and HCA products > docs/IBED_Installation_Guide.txt:Mellanox IBED Distribution Installation > Guide > docs/IBED_Installation_Guide.txt: 2. Contents of the Mellanox IBED > Distribution > docs/IBED_Installation_Guide.txt:2. Contents of the Mellanox IBED > Distribution > docs/IBED_Installation_Guide.txt: o Firmware for Mellanox's switch > and HCA pr > oducts > docs/IBED_Installation_Guide.txt: "Mellanox IBED Release Notes". > > 2) When I run install.sh or build.sh and tell it to compile both MPIs, I > get the same questions twice, which I assume one is for MVAPICH and one > for Open MPI, but this needs to be clearer: > > [VS] You are right. I will add MPI name to the question. > The following compiler(s) on your system can be used to build/install > MPI: gcc > > Do you wish to create/install an MPI RPM with gcc? [Y/n]: > > The following compiler(s) will be used to build the MPI RPM(s): gcc > > The following compiler(s) on your system can be used to build/install > MPI: gcc > > Do you wish to create/install an MPI RPM with gcc? [Y/n]: > > 3) It would be nice if install.sh asked me if I wanted to configure IP > address, rather than forcing me to. > > The default IPoIB interface configuration is based on a LAN interface > configura. > You may change this default configuration in the following steps. > > Enter LAN interface to be used for setting ib0 interface [eth0]: > > [VS] Corresponding question will be added to install script. > 4) I would like to see /etc/infiniband/ifcfg-ib0 be in > /etc/sysconfig/network-scripts. > > [VS] Will be done for rc4, possibly not for all distributions. > 5) I would like to see entries for ipoib and sdp in /etc/modprobe.conf. > > [VS] Will be done for rc4, possibly not for all distributions (SuSE 10.0 has an issue with unloading modules). > 6) It would be nice if install.sh offered to setup > /etc/security/limits.conf. > > [VS] I will check what can be done. > 7) If I run install.sh on one machine, then install the resulting RPMS > on a different machine, I get slightly different sets of files installed > on the two machines in /usr/local/ibed: > > [VS] You probably did not used IBED install.sh script to install another machine. The following files are copied by install.sh: BUILD_ID, LICENSE, README.txt, docs, ibed.conf, uninstall.sh Regarding the header files each one of them is a part of corresponding /devel/ package, e.g. common.h included in libibcommon-devel ...rpm. Please check if this RPM installed. Please update me if I am wrong. > < ./BUILD_ID > < ./LICENSE > < ./README.txt > 70,71d66 > < ./docs > < ./docs/IBED_Installation_Guide.txt > 74d68 > < ./ibed.conf > 114a109 > >> ./include/infiniband/common.h >> > 165a161 > >> ./include/infiniband/mad.h >> > 178a175 > >> ./include/infiniband/umad.h >> > 260a258 > >> ./lib64/libibcommon.so >> > 263a262 > >> ./lib64/libibmad.so >> > 266a266 > >> ./lib64/libibumad.so >> > 1425d1424 > < ./uninstall.sh > > >> -----Original Message----- >> From: openfabrics-ewg-bounces at openib.org >> [mailto:openfabrics-ewg-bounces at openib.org] On Behalf Of >> Vladimir Sokolovsky >> Sent: Monday, April 10, 2006 9:55 AM >> To: openfabrics-ewg at openib.org >> Cc: openib-general >> Subject: [openfabrics-ewg] IBED-1.0-rc3 is available >> >> Hi All, >> We have prepared IBED 1.0 RC3. >> Release location: >> *https://openib.org/svn/gen2/branches/1.0/ibed/releases* >> File: IBED-1.0-rc3.tgz >> md5sum: 8e143fd4b63646ebc9f5c9f73d18394b >> >> *_BUILD_ID:_* >> IBED-1.0-rc3: >> OpenIB: >> openib_branch1.0-20060410-1551 (REV=6367) >> Userspace SVN path: >> https://openib.org/svn/gen2/branches/1.0/src/userspace >> IB Kernel modules SVN path: >> https://openib.org/svn/gen2/branches/1.0/ibed/tags/rc3/linux-kernel >> MPI: >> openmpi-1.0.2a12-1 >> mpi_osu-0.9.7-mlx2.1.0 >> mpitests-1.0-0 >> >> *OSes:* >> >> * RH EL4 up2: 2.6.9-22.ELsmp >> * RH EL4 up3: 2.6.9-34.ELsmp >> * Fedora C4: 2.6.11-1.1369_FC4 >> * SLES10 beta 7: 2.6.16-rc5-git9-2-smp >> * SUSE 10 Pro: 2.6.13-15-smp >> * kernel.org: 2.6.16 >> >> *Systems:* >> >> * x86_64 >> * x86 >> * ia64 >> * ppc64 >> >> >> *Main changes from RC2:* >> >> 1. Added support in Rh EL4 up3 >> 2. Added Open MPI package >> 3. OSU MPI is now based on 0.97 release (was 0.95 in RC2) >> 4. Added Pathscale (ipath) driver >> 5. Added uDapl >> 6. build based on the new method: Userlevel from openib branch 1.0 >> and kernel from openib trunk. (will be from the git in RC4) >> 7. Added ibutils package >> 8. Bug fixes >> >> *Package limitations:* >> >> 1. iSER is working on SuSE SLES 10 Beta8 only >> 2. MPI OSU and Open MPI compilation fails on PPC64 >> 3. uDAPL does not supported on RH EL4 (up2 and up3) since rdma_ucm >> module does not work on 2.6.9* kernels. If someone has >> a patch we >> will use it. >> 4. ipath driver compilation fails on RH EL4 and FedoraC4. >> >> Please send me and Vlad any issue you encounter and testing results. >> >> Thanks >> Tziporet & Vlad >> _______________________________________________ >> openfabrics-ewg mailing list >> openfabrics-ewg at openib.org >> http://openib.org/mailman/listinfo/openfabrics-ewg >> >> > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg > > From bardov at gmail.com Mon Apr 17 03:19:46 2006 From: bardov at gmail.com (Dan Bar Dov) Date: Mon, 17 Apr 2006 13:19:46 +0300 Subject: [openib-general] [Announce] open iSER target Message-ID: We would like to announce the Open iSER Target project as an integral part of the OpenFabrics project. Voltaire has recently contributed its iSER Target sources (located at https://openib.org/svn/gen2/ulps/open-iser-target) These sources will be used as the basis and reference code for the implementation of the open iser target project. A Wiki page for the project has been prepared at https://openib.org/tiki/tiki-index.php?page=ISER-target The open iser target is intended to be integrated with open iscsi target projects such as IET (iSCSI Enterprise Target) http://iscsitarget.sourceforge.net/ and TGT http://www.linuxsymposium.org/2006/view_abstract.php?content_key=19. We've been working with Network Appliance and Open Grid Computing to start off this project and encourage more companies and individuals to join it. From tziporet at mellanox.co.il Mon Apr 17 04:06:42 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 17 Apr 2006 14:06:42 +0300 Subject: [openib-general] New diags tool available Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30100B65B@mtlexch01.mtl.com> Do you want that it will be part of OFED release too? If yes then please commit it to the 1.0 branch too. Tziporet -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Hal Rosenstock Sent: Thursday, April 13, 2006 10:07 PM To: openib-general at openib.org Subject: [openib-general] New diags tool available Hi, With svn r6460, a new diags tool is now available on the trunk. It is Ira Weiny's saquery. (Thanks for bearing with me on this). saquery tool obtains information based on node name: saquery -h Usage: saquery [-h -d -P -N -L -G][] Queries node records by default -d enable debugging -P get PathRecord info -N get NodeRecord info -L Return just the Lid of the name specified -G Return just the Guid of the name specified -- Hal _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tziporet at mellanox.co.il Mon Apr 17 08:01:18 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 17 Apr 2006 18:01:18 +0300 Subject: [openib-general] Announcing the OpenFabrics Enterprise Distribution In-Reply-To: <1144869147.19061.91517.camel@hal.voltaire.com> References: <1144869147.19061.91517.camel@hal.voltaire.com> Message-ID: <4443ADBE.7000307@mellanox.co.il> Hi Hal, I was on Passover vacation so the reply was a little delayed. Hope it still useful. Tziporet >> Q: Is OFED development happening in the open? > How can that be ? The 1.0 branch contains no kernel code. Where is the > OFED kernel code kept ? > We take it from git that is targeted for next kernel release. > So there will never be a bug fixed on the 1.0 branch being merged back > to the trunk ? For various components, the trunk has diverged and is > ahead of the 1.0 branch. > > We will create the needed fix for the branch in case the truck is changed a lot. > Can someone explain the OFED location layout in the OF svn tree ? > Look under 1.0 branch for the ibed directory (ibed was the first name of this release and we will need to change the directory name). If things are not clear there please ask and we will provide more data. > > How do discrepancies in the accepted patches between OF and OFED get > resolved ? > > We have fixes directory for patches that are critical for OFED and are not yet accepted in the trunk. The install script apply all patches when creating the RPMs. (this is also correct for back-port patches) Once a fix makes it to the trunk we take it from there. From rdreier at cisco.com Mon Apr 17 08:56:41 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 17 Apr 2006 08:56:41 -0700 Subject: [openib-general] Re: [PATCH] git: updates to rdma_cm branch In-Reply-To: <444213CC.2070008@voltaire.com> (Or Gerlitz's message of "Sun, 16 Apr 2006 12:52:12 +0300") References: <444213CC.2070008@voltaire.com> Message-ID: Or> I see that the for-2.6.18 branch contains the addr and cma Or> includes & code, so from the cma perspective the 2.6.18 push Or> is ready? - also if someone want to compile/run code (eg iSER) Or> which is dependent on the cma they just need to do git pull of Or> the include/rdma and drivers/infiniband directories from the Or> for-2.6.18 branch - am i correct? Yes, I have CMA queued in for-2.6.18 so unless something changes it will get merged right after 2.6.17 is released. There is no way to just pull a subset of directories, but the CMA patches are the only changes in my for-2.6.18 branch right now. - R. From rdreier at cisco.com Mon Apr 17 09:00:30 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 17 Apr 2006 09:00:30 -0700 Subject: [openib-general] Re: iser upstream: further reviews and process towards 2.6.18 In-Reply-To: (Or Gerlitz's message of "Sun, 16 Apr 2006 15:44:35 +0300") References: Message-ID: Or> Do you find it need/appropriate to post another RFC, this time Or> to LKML? Or> Also (and related to both reviewes and upstream) iser is Or> dependent on two change sets for 2.6.18: first, the cma and Or> second, the bunch of 2.6.18 iscsi updates (it is six patches Or> from which iser directly is dependent on two, but we prefer to Or> treat all the six as needed for iser to compile/work). Or> The six iscsi patches were pushed to linux-scsi but are not Or> yet in the scsi-misc git tree, they can also be found under Or> https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/2.6.17 Or> it is the 01 to 06 sequence. Or> As for the upstream process, i guess each of you and James Or> would need to do git pull from the other one into the Or> scsi-misc (2.6.18 rdma updates) and the infiniband (2.6.18 Or> scsi updates) so iser can be compiled with both trees, or at Or> least with your tree. This is typically not the way things are done. The simplest way would be to wait for the CMA and the iSCSI patches are all merged by Linus, which should be right after 2.6.18 opens (assuming the iSCSI patches make it to scsi-misc in time). Then either James or I could merge the iSER changes on top of what's already in Linus's tree. It's up to you whether you think it's more appropriate to merge iSER through James's tree or my tree. In any case, the best way to proceed would be to post a final set of patches sent To: the maintainer you want to go through, Cc:ed to linux-kernel, linux-scsi and openib-general. If you send it to me, I will put it into an iser branch in my git repo and merge it onto for-2.6.18 after the iSCSI patches are merged by Linus. I assume James would handle it the same way (although he would wait for the CMA patches). - R. From bardov at gmail.com Mon Apr 17 09:14:32 2006 From: bardov at gmail.com (Dan Bar Dov) Date: Mon, 17 Apr 2006 19:14:32 +0300 Subject: [openib-general] Re: iser upstream: further reviews and process towards 2.6.18 In-Reply-To: References: Message-ID: On 4/17/06, Roland Dreier wrote: > > This is typically not the way things are done. The simplest way would > be to wait for the CMA and the iSCSI patches are all merged by Linus, > which should be right after 2.6.18 opens (assuming the iSCSI patches > make it to scsi-misc in time). Then either James or I could merge the > iSER changes on top of what's already in Linus's tree. James has already merged the iSCSI patches into his git tree. We are now waiting for him to merge the last ISER related patch that we sent linux-scsi. > > It's up to you whether you think it's more appropriate to merge iSER > through James's tree or my tree. In any case, the best way to proceed > would be to post a final set of patches sent To: the maintainer you > want to go through, Cc:ed to linux-kernel, linux-scsi and > openib-general. > > If you send it to me, I will put it into an iser branch in my git repo > and merge it onto for-2.6.18 after the iSCSI patches are merged by Linus. > I assume James would handle it the same way (although he would wait > for the CMA patches). We think its more appropriate to push ISER through you and openIB. To do so we'll need a single git tree that has open-iscsi, cma and iser. This is why we ask you to update your git tree with James's iscsi-related updates - your tree will have both the CMA, and iscsi, and then iser will build against it. Dan > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Mon Apr 17 12:22:30 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 17 Apr 2006 12:22:30 -0700 Subject: [openib-general] Re: iser upstream: further reviews and process towards 2.6.18 In-Reply-To: (Dan Bar Dov's message of "Mon, 17 Apr 2006 19:14:32 +0300") References: Message-ID: Dan> We think its more appropriate to push ISER through you and Dan> openIB. To do so we'll need a single git tree that has Dan> open-iscsi, cma and iser. This is why we ask you to update Dan> your git tree with James's iscsi-related updates - your tree Dan> will have both the CMA, and iscsi, and then iser will build Dan> against it. OK, let me know when James has a git tree that has everything needed for iSER. I will pull that into my iser branch, and then add the patches you send me on top. When James merges everything up to Linus, then I will ask Linus to pull my iser branch too. Does that seem like a good plan? - R. From rdreier at cisco.com Mon Apr 17 13:12:38 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 17 Apr 2006 13:12:38 -0700 Subject: [openib-general] [PATCH] splitting IPoIB CQ In-Reply-To: (Shirley Ma's message of "Fri, 14 Apr 2006 13:20:57 -0600") References: Message-ID: Shirley> Some tests have been done over mthca and Shirley> ehca. Unidirectional stream test, gains up to 15% Shirley> throughout with this patch on systems over 4 cpus. Shirley> Bidirectional could gain more. People might get different Shirley> performance improvement number under different drivers Shirley> and cpus. I have attached the patch for who are willing Shirley> to run the performance test with different drivers. And Shirley> please give your inputs. Have you ever seen this hurt performance? It seems that splitting receives and send CQs will increase the number of events generated and possibly use more CPU. Actually, do you have some explanation for why this helps performance? My intuition would be that it just generates more interrupts for the same workload. One specific question: > - struct ib_wc ibwc[IPOIB_NUM_WC]; > + struct ib_wc *send_ibwc; > + struct ib_wc *recv_ibwc; Why are you changing these to be dynamically allocated outside of the main structure? Is it to avoid false sharing of cachelines? It might be better to sort the whole structure so that we have all the common, read-mostly stuff first, then TX stuff (marked with ____cacheline_aligned_in_smp) and then RX stuff, also marked to be cacheline aligned. - R. From jlentini at netapp.com Mon Apr 17 13:46:06 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 17 Apr 2006 16:46:06 -0400 (EDT) Subject: [openib-general] Re: [uDAPL] dat.conf generator In-Reply-To: <200604161042.25045.dotanb@mellanox.co.il> References: <200604121122.48646.dotanb@mellanox.co.il> <200604161042.25045.dotanb@mellanox.co.il> Message-ID: On Sun, 16 Apr 2006, Dotan Barak wrote: > On Wednesday 12 April 2006 17:50, James Lentini wrote: > > > OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 1" "" > > > OpenIB-cma0-1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 1" "" > > > OpenIB-cma0-2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 2" "" > > > OpenIB-cma1-1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca1 1" "" > > > OpenIB-cma1-2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca1 2" "" > > > OpenIB-cma-netdev0 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" > > > OpenIB-cma-netdev1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib1 0" "" > > > OpenIB-cma-netdev2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib2 0" "" > > > OpenIB-cma-netdev3 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib3 0" "" > > > OpenIB-scm u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca0 1" "" > > > OpenIB-scm0-1 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca0 1" "" > > > OpenIB-scm0-2 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca0 2" "" > > > OpenIB-scm1-1 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca1 1" "" > > > OpenIB-scm1-2 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca1 2" "" > > > OpenIB-scm-netdev0 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib0 0" "" > > > OpenIB-scm-netdev1 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib1 0" "" > > > OpenIB-scm-netdev2 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib2 0" "" > > > OpenIB-scm-netdev3 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib3 0" "" > > > > > > is this file is good enough or more dapl provider names are needed? > > > > You've covered all the standard combinations. Why did you include the > > OpenIB-Z-netdevX entries? Why would a user prefer netdevX over ethY? > > Just curious. > > > > If you are willing to contribute this back to the uDAPL project, I'm > > sure the uDAPL community would find it very useful. > > I tried to create automatically a configuration file that is being > filled manually today. Which lines do you think need to be in this > configuration file? (I'm open for suggestions). Why did you include the OpenIB-Z-netdevX entries? Why would a user prefer netdevX over ethY? > This dat.conf generator is a new piece of code that (when it will be > ready) will be checked to the openib stack (i think to the dapl > tree), so it can be availabe to everyone ... Thank you. We'll make a spot for it in the uDAPL tree. From halr at voltaire.com Mon Apr 17 15:13:23 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Apr 2006 18:13:23 -0400 Subject: [openib-general] Announcing the OpenFabrics Enterprise Distribution In-Reply-To: <4443ADBE.7000307@mellanox.co.il> References: <1144869147.19061.91517.camel@hal.voltaire.com> <4443ADBE.7000307@mellanox.co.il> Message-ID: <1145312002.4539.72029.camel@hal.voltaire.com> Hi Tziporet, On Mon, 2006-04-17 at 11:01, Tziporet Koren wrote: > Hi Hal, > I was on Passover vacation so the reply was a little delayed. > Hope it still useful. Thanks. > Tziporet > > >> Q: Is OFED development happening in the open? > > How can that be ? The 1.0 branch contains no kernel code. Where is the > > OFED kernel code kept ? > > > We take it from git that is targeted for next kernel release. Where is this tree ? Also, what about backports ? Are they part of OFED as well ? > > So there will never be a bug fixed on the 1.0 branch being merged back > > to the trunk ? For various components, the trunk has diverged and is > > ahead of the 1.0 branch. > > > > > We will create the needed fix for the branch in case the truck is > changed a lot. So there may be cases of merging fixes from 1.0 back to the trunk (which differs from what was written). > > Can someone explain the OFED location layout in the OF svn tree ? > > > Look under 1.0 branch for the ibed directory (ibed was the first name of > this release and we will need to change the directory name). > If things are not clear there please ask and we will provide more data. I see no such directory under 1.0. Can you provide more details ? > > How do discrepancies in the accepted patches between OF and OFED get > > resolved ? > > > > > We have fixes directory for patches that are critical for OFED and are > not yet accepted in the trunk. > The install script apply all patches when creating the RPMs. (this is > also correct for back-port patches) > Once a fix makes it to the trunk we take it from there. Is the process the same for userspace ? -- Hal From mlleinin at hpcn.ca.sandia.gov Mon Apr 17 20:33:14 2006 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Mon, 17 Apr 2006 20:33:14 -0700 Subject: [openib-general] Re: Compile problems with core code and pathscale for svn6462 and linux-2.6.17-rc1 In-Reply-To: <1145031720.30490.13.camel@chalcedony.pathscale.com> References: <1144971158.24662.213.camel@localhost> <200604131640.12899.bos@pathscale.com> <1144972288.24662.216.camel@localhost> <200604131654.57734.bos@pathscale.com> <1144972598.24662.221.camel@localhost> <443FCB9F.8030604@ichips.intel.com> <1145031720.30490.13.camel@chalcedony.pathscale.com> Message-ID: <1145331194.24662.244.camel@localhost> On Fri, 2006-04-14 at 09:22 -0700, Bryan O'Sullivan wrote: > On Fri, 2006-04-14 at 09:19 -0700, Sean Hefty wrote: > > Matt Leininger wrote: > > > Ok. So the current state is that the mainline devel branch will be > > > broken for a while? > > > > The trunk is always suppose to work, let alone compile. This needs to be fixed > > quickly, or the offending code moved to a branch. > > There is nothing that needs to be fixed. Matt was just not using the > right combination of bits when we was trying to compile the world. > I was using the right bits, I just had to just Rolands patch to the Makefile to get things to compile. - Matt From dotanb at mellanox.co.il Mon Apr 17 23:12:46 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 18 Apr 2006 09:12:46 +0300 Subject: [openib-general] Re: [uDAPL] dat.conf generator In-Reply-To: References: <200604121122.48646.dotanb@mellanox.co.il> <200604161042.25045.dotanb@mellanox.co.il> Message-ID: <200604180912.47016.dotanb@mellanox.co.il> On Monday 17 April 2006 23:46, James Lentini wrote: > > On Sun, 16 Apr 2006, Dotan Barak wrote: > > > On Wednesday 12 April 2006 17:50, James Lentini wrote: > > > > OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 1" "" > > > > OpenIB-cma0-1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 1" "" > > > > OpenIB-cma0-2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 2" "" > > > > OpenIB-cma1-1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca1 1" "" > > > > OpenIB-cma1-2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca1 2" "" > > > > OpenIB-cma-netdev0 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" > > > > OpenIB-cma-netdev1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib1 0" "" > > > > OpenIB-cma-netdev2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib2 0" "" > > > > OpenIB-cma-netdev3 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib3 0" "" > > > > OpenIB-scm u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca0 1" "" > > > > OpenIB-scm0-1 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca0 1" "" > > > > OpenIB-scm0-2 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca0 2" "" > > > > OpenIB-scm1-1 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca1 1" "" > > > > OpenIB-scm1-2 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca1 2" "" > > > > OpenIB-scm-netdev0 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib0 0" "" > > > > OpenIB-scm-netdev1 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib1 0" "" > > > > OpenIB-scm-netdev2 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib2 0" "" > > > > OpenIB-scm-netdev3 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib3 0" "" > > > > > Why did you include the OpenIB-Z-netdevX entries? Why would a user > prefer netdevX over ethY? as i told you before, i tried to create the configuration file automatically. i created the OpenIB-Z-netdevX entries for flexibility: for example you have 2 HCAs in the host, and you want to use a specific one. from other entries you can not be sure which specific HCA is being used. I belive that you have much more experience than i have in using dapl, so i need your help with this issue: which entries do you think that the dat.conf should have? thanks Dotan From dotanb at mellanox.co.il Mon Apr 17 23:37:50 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 18 Apr 2006 09:37:50 +0300 Subject: [openib-general] Re: [uDAPL] dtest server never ends when using the dapl provider "OpenIB-scm1" Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301EF04EC@mtlexch01.mtl.com> > > > Hmm, not sure what this thread is waiting on. I would expect > to see the > dat_ep_disconnect messages before the wait complete or at least the > dat_ep_disconnect message indicating a blocking disconnect call. The > next 3 messages expected are as follow: > > dat_ep_disconnect > dat_ep_disconnect completed > dat_evd_wait for h_conn_evd completed > > Can you attach to the server process with gdb and get me a > back trace from each of the threads? > > What does driver IBED-1.0-rc3 consist of? > > Thanks, > > -arlin > > Here is a back trace of the hanged process: (gdb) bt #0 0x00002aaaab31c86a in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x00002aaaab42ef5b in dapl_os_wait_object_wait (wait_obj=0x516650, timeout_val=) at dapl_osd.c:276 #2 0x00002aaaab42e9ab in dapl_evd_wait (evd_handle=0x516560, time_out=4294967295, threshold=1, event=0x7fffffdd7bf0, nmore=0x7fffffdd7c2c) at dapl_evd_wait.c:233 #3 0x00000000004021ab in disconnect_ep () at dtest.c:894 #4 0x0000000000404cad in main (argc=4, argv=) at dtest.c:375 please note that my version of dtest has some small changed, but line 894 is the following code: ret = dat_evd_wait( h_conn_evd, DAT_TIMEOUT_INFINITE, 1, &event, &nmore ); if(ret != DAT_SUCCESS) { fprintf(stderr, "%d Error dat_evd_wait: %s\n", getpid(),DT_RetToString(ret)); } else { LOGPRINTF("%d dat_evd_wait for h_conn_evd completed\n", getpid()); } thanks Dotan From bpradip at in.ibm.com Tue Apr 18 00:14:54 2006 From: bpradip at in.ibm.com (Pradipta Kumar Banerjee) Date: Tue, 18 Apr 2006 12:44:54 +0530 Subject: [openib-general] [PATCH] amso1100/c2_vq.c : Corrected vq_wait_for_reply function Message-ID: <20060418071453.GB3143@harry-potter.in.ibm.com> Hi This patch corrects the vq_wait_for_reply function so that it actually waits for the reply till the timeout occurs. Signed-off-by: Pradipta Kumar B Signed-off-by: Krishna Kumar --- Index: c2_vq.c ===================================================================== --- c2_vq.c.org 2006-04-18 11:52:01.000000000 +0530 +++ c2_vq.c 2006-04-18 12:20:42.000000000 +0530 @@ -240,27 +240,12 @@ int vq_send_wr(struct c2_dev *c2dev, uni */ int vq_wait_for_reply(struct c2_dev *c2dev, struct c2_vq_req *req) { - wait_queue_t __wait; - int rc = 0; - - /* - * Add this request to the wait queue. - */ - init_waitqueue_entry(&__wait, current); - add_wait_queue(&req->wait_object, &__wait); - for (;;) { - set_current_state(TASK_UNINTERRUPTIBLE); - if (atomic_read(&req->reply_ready)) { - break; - } - if (schedule_timeout(60 * HZ) == 0) { - rc = -ETIMEDOUT; - break; - } + if (wait_event_timeout(req->wait_object, + (atomic_read(&req->reply_ready) == 1), 60*HZ) == 0) { + printk(KERN_ERR "Device timeout\n"); + return -ETIMEDOUT; } - set_current_state(TASK_RUNNING); - remove_wait_queue(&req->wait_object, &__wait); - return rc; + return 0; } /* From hch at lst.de Tue Apr 18 06:07:31 2006 From: hch at lst.de (Christoph Hellwig) Date: Tue, 18 Apr 2006 15:07:31 +0200 Subject: [openib-general] creative file descriptor abuse in uverbs Message-ID: <20060418130731.GB31050@lst.de> I'd like to get rid of get_empty_filp midterm. Because of that's I've looked at all the users, and the only modular and most creative one is the uverbs code. Everything else only every returns new fds from syscalls, which is a good thing. Trying to show-horn fds into ->write otohg creates various problems. Any chance you could change this to creating new files in an uversfs or something when doing the next ABI revision for uverbs? (I suspect there'll be a major change for iWarp integration?) I plan to send a patch to mark get_empty_filp deprecated for modules to akpm soon, but it will be a few month at least before it can go, and not before we've found a solution for uverbs. From tom at opengridcomputing.com Tue Apr 18 08:27:38 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 18 Apr 2006 10:27:38 -0500 Subject: [openib-general] [PATCH] amso1100/c2_vq.c : Corrected vq_wait_for_reply function In-Reply-To: <20060418071453.GB3143@harry-potter.in.ibm.com> References: <20060418071453.GB3143@harry-potter.in.ibm.com> Message-ID: <1145374058.20514.30.camel@trinity.ogc.int> Thanks for the patch! This is a nice cleanup... The code does actually wait. If you go look at the implementation of wait_event_timeout you will see that it essentially the same thing. At the time this code was written, the wait_event_timeout function didn't exist ... On Tue, 2006-04-18 at 12:44 +0530, Pradipta Kumar Banerjee wrote: > Hi > > This patch corrects the vq_wait_for_reply function so that it actually waits > for the reply till the timeout occurs. > > Signed-off-by: Pradipta Kumar B > Signed-off-by: Krishna Kumar > > --- > > Index: c2_vq.c > ===================================================================== > --- c2_vq.c.org 2006-04-18 11:52:01.000000000 +0530 > +++ c2_vq.c 2006-04-18 12:20:42.000000000 +0530 > @@ -240,27 +240,12 @@ int vq_send_wr(struct c2_dev *c2dev, uni > */ > int vq_wait_for_reply(struct c2_dev *c2dev, struct c2_vq_req *req) > { > - wait_queue_t __wait; > - int rc = 0; > - > - /* > - * Add this request to the wait queue. > - */ > - init_waitqueue_entry(&__wait, current); > - add_wait_queue(&req->wait_object, &__wait); > - for (;;) { > - set_current_state(TASK_UNINTERRUPTIBLE); > - if (atomic_read(&req->reply_ready)) { > - break; > - } > - if (schedule_timeout(60 * HZ) == 0) { > - rc = -ETIMEDOUT; > - break; > - } > + if (wait_event_timeout(req->wait_object, > + (atomic_read(&req->reply_ready) == 1), 60*HZ) == 0) { > + printk(KERN_ERR "Device timeout\n"); > + return -ETIMEDOUT; > } > - set_current_state(TASK_RUNNING); > - remove_wait_queue(&req->wait_object, &__wait); > - return rc; > + return 0; > } > > /* > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From or.gerlitz at gmail.com Tue Apr 18 08:28:40 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Tue, 18 Apr 2006 17:28:40 +0200 Subject: [openib-general] Re: iser upstream: further reviews and process towards 2.6.18 In-Reply-To: References: Message-ID: <15ddcffd0604180828i1b551dc6q4e24eef145167011@mail.gmail.com> On 4/17/06, Roland Dreier wrote: > It's up to you whether you think it's more appropriate to merge iSER > through James's tree or my tree. In any case, the best way to proceed > would be to post a final set of patches sent To: the maintainer you > want to go through, Cc:ed to linux-kernel, linux-scsi and > openib-general. Indeed, as Dan commented we prefer to go via your tree. The code is almost ready and once we are, i will send the final set of patches to you and CC linux-kernel , openib-general and linux-scsi. Or. From xma at us.ibm.com Tue Apr 18 08:57:25 2006 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 18 Apr 2006 08:57:25 -0700 Subject: [openib-general] [PATCH] splitting IPoIB CQ In-Reply-To: Message-ID: Roland, Roland Dreier wrote on 04/17/2006 01:12:38 PM: > Have you ever seen this hurt performance? It seems that splitting > receives and send CQs will increase the number of events generated and > possibly use more CPU. The performance gain was not free, it did cost cpu utilization 3-5% more. I don't have the comparison of the number of interrupts with the same throughout. > Actually, do you have some explanation for why this helps performance? > My intuition would be that it just generates more interrupts for the > same workload. The only lock contension in IPoIB I saw is tx_lock. When seperating the completion queue to have seperate completion handler. It could improve the performance. I didn't look at driver code, it might have some impact there? I did see high interrupts and I had pached IPoIB which I mentioned before to have different NUM_WC under different workloads. It could reduce the interrupts N times for the same throughput, and gain better throughput under same cpu utilization. I am still investigating interrupts/cpu utilization/throughput issues. > One specific question: > > > - struct ib_wc ibwc[IPOIB_NUM_WC]; > > + struct ib_wc *send_ibwc; > > + struct ib_wc *recv_ibwc; > > Why are you changing these to be dynamically allocated outside of the > main structure? Is it to avoid false sharing of cachelines? Yep, this was one of the reasons. > It might be better to sort the whole structure so that we have all the > common, read-mostly stuff first, then TX stuff (marked with > ____cacheline_aligned_in_smp) and then RX stuff, also marked to be > cacheline aligned. > > - R. Sure. I will replace it and rerun the test to see the difference. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From vuhuong at mellanox.com Tue Apr 18 09:10:49 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Tue, 18 Apr 2006 09:10:49 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: References: <443C4934.7080400@mellanox.com> <443E9A88.7020302@mellanox.com> <443FC35F.6080301@mellanox.com> Message-ID: <44450F89.7020500@mellanox.com> Roland Dreier wrote: > Hmm, I don't understand what could be going on. srp_send_tsk_mgmt() > currently has: > > if (req->cmd_done) { > srp_remove_req(target, req, req_index); > scmnd->scsi_done(scmnd); > } else if (!req->tsk_status) { > srp_remove_req(target, req, req_index); > scmnd->result = DID_ABORT << 16; > ret = SUCCESS; > } > > and otherwise it returns FAILED. So in both cases where it finishes > the command, it removes it from the list of pending requests. The return happened before reaching the code above. It happened at: if (!wait_for_completion_timeout(&req->done, msecs_to_jiffies(SRP_ABORT_TIMEOUT_MS))) return FAILED; because the qp was in fatal state. Therefore, the command was not finished or removed from the pending queue Vu From rdreier at cisco.com Tue Apr 18 09:15:39 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 18 Apr 2006 09:15:39 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: <44450F89.7020500@mellanox.com> (Vu Pham's message of "Tue, 18 Apr 2006 09:10:49 -0700") References: <443C4934.7080400@mellanox.com> <443E9A88.7020302@mellanox.com> <443FC35F.6080301@mellanox.com> <44450F89.7020500@mellanox.com> Message-ID: > The return happened before reaching the code above. It happened at: > if (!wait_for_completion_timeout(&req->done, > msecs_to_jiffies(SRP_ABORT_TIMEOUT_MS))) > return FAILED; > > because the qp was in fatal state. Therefore, the command was not > finished or removed from the pending queue Hmm... we're returning FAILED from the abort handler. That means the SCSI midlayer should not free the associated SCSI command yet, right? - R. From rdreier at cisco.com Tue Apr 18 09:30:19 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 18 Apr 2006 09:30:19 -0700 Subject: [openib-general] Re: creative file descriptor abuse in uverbs In-Reply-To: <20060418130731.GB31050@lst.de> (Christoph Hellwig's message of "Tue, 18 Apr 2006 15:07:31 +0200") References: <20060418130731.GB31050@lst.de> Message-ID: Christoph> I'd like to get rid of get_empty_filp midterm. Because Christoph> of that's I've looked at all the users, and the only Christoph> modular and most creative one is the uverbs code. Christoph> Everything else only every returns new fds from Christoph> syscalls, which is a good thing. Trying to show-horn Christoph> fds into ->write otohg creates various problems. Any Christoph> chance you could change this to creating new files in Christoph> an uversfs or something when doing the next ABI Christoph> revision for uverbs? (I suspect there'll be a major Christoph> change for iWarp integration?) Yes, I'm very open to cleaning up the uverbs interface, and I have thought that iWARP integration would be a good time to break the world (although I would want to leave the current interface for a kernel release or two afterwards). However I'm not sure that uverbsfs is the right answer. Currently the reason uverbs creates file descriptors as part of the write() call is to do the following: - userspace creates a "context" by opening a device special file - userspace creates 1 or more "event channels," which are new file descriptors, attached to that context by issuing a command via write() (other commands like "create queue pair" or "register memory" are also issued via write()) The closest analogy I can come up with is doing socket()/bind() and then accept() on the socket, which returns a new file desc. This socket analogy leads to adding a new syscall (or ~20 syscalls if we want to make the whole uverbs interface into syscalls), and I'm not sure that we understand the needs of iWARP integration well enough to freeze things into syscalls... - R. From vuhuong at mellanox.com Tue Apr 18 10:09:06 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Tue, 18 Apr 2006 10:09:06 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: References: <443C4934.7080400@mellanox.com> <443E9A88.7020302@mellanox.com> <443FC35F.6080301@mellanox.com> <44450F89.7020500@mellanox.com> Message-ID: <44451D32.1010106@mellanox.com> Roland Dreier wrote: > > The return happened before reaching the code above. It happened at: > > if (!wait_for_completion_timeout(&req->done, > > msecs_to_jiffies(SRP_ABORT_TIMEOUT_MS))) > > return FAILED; > > > > because the qp was in fatal state. Therefore, the command was not > > finished or removed from the pending queue > > Hmm... we're returning FAILED from the abort handler. That means the > SCSI midlayer should not free the associated SCSI command yet, right? > Right for abort command; however, I'm not sure after scsi midlayer called device_reset With some extra printk statement inserted in I don't see the crash anymore. Debug printk probably change the timing Vu From rdreier at cisco.com Tue Apr 18 10:17:40 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 18 Apr 2006 10:17:40 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: <44451D32.1010106@mellanox.com> (Vu Pham's message of "Tue, 18 Apr 2006 10:09:06 -0700") References: <443C4934.7080400@mellanox.com> <443E9A88.7020302@mellanox.com> <443FC35F.6080301@mellanox.com> <44450F89.7020500@mellanox.com> <44451D32.1010106@mellanox.com> Message-ID: Vu> Right for abort command; however, I'm not sure after scsi Vu> midlayer called device_reset Hmm, you may be right -- scsi_eh_bus_device_reset() in scsi_error.c does seem to flush all commands. But do you see srp_reset_device() being called? I didn't think I saw it in your trace. From rdreier at cisco.com Tue Apr 18 10:19:08 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 18 Apr 2006 10:19:08 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: (Roland Dreier's message of "Tue, 18 Apr 2006 10:17:40 -0700") References: <443C4934.7080400@mellanox.com> <443E9A88.7020302@mellanox.com> <443FC35F.6080301@mellanox.com> <44450F89.7020500@mellanox.com> <44451D32.1010106@mellanox.com> Message-ID: Roland> Hmm, you may be right -- scsi_eh_bus_device_reset() in Roland> scsi_error.c does seem to flush all commands. But do you Roland> see srp_reset_device() being called? I didn't think I saw Roland> it in your trace. And what if you comment out the line .eh_device_reset_handler = srp_reset_device, does that fix it? - R. From jlentini at netapp.com Tue Apr 18 11:11:49 2006 From: jlentini at netapp.com (James Lentini) Date: Tue, 18 Apr 2006 14:11:49 -0400 (EDT) Subject: [openib-general] [libmthca][PATCH] minor README fix Message-ID: I noticed this small mistake in the libmthca README Signed-off-by: James Lentini Index: libmthca/README =================================================================== --- libmthca/README (revision 6507) +++ libmthca/README (working copy) @@ -20,7 +20,7 @@ libmthca currently supports HCAs based o MT23108 InfiniHost (PCI-X) MT25208 InfiniHost III Ex (PCI Express) - MT25208 InfiniHost III Lx (PCI Express) + MT25204 InfiniHost III Lx (PCI Express) Both non-DDR and DDR HCAs are supported, and the MT25208 is supported with both MT23108-compatible and native MemFree firmware. From rdreier at cisco.com Tue Apr 18 13:37:40 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 18 Apr 2006 13:37:40 -0700 Subject: [openib-general] Re: creative file descriptor abuse in uverbs In-Reply-To: (Roland Dreier's message of "Tue, 18 Apr 2006 09:30:19 -0700") References: <20060418130731.GB31050@lst.de> Message-ID: Roland> The closest analogy I can come up with is doing Roland> socket()/bind() and then accept() on the socket, which Roland> returns a new file desc. Oh, I forgot one of the main points I wanted to talk about: doing something like an open() on a file in uverbsfs to create a new event channel attached to a given context means that we would have to do some crazy permissions or namespace stuff to restrict access to the right set of processes. And this would break if someone passed a context FD through a unix domain socket or something bizarre like that. Maybe I'm wrong but that doesn't seem like the right way to do it to me. Even if we were redesigning things today I don't think we would create a socketfs with accept() replaced by open() on some magic file... - R. From rdreier at cisco.com Tue Apr 18 13:39:28 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 18 Apr 2006 13:39:28 -0700 Subject: [openib-general] Re: [libmthca][PATCH] minor README fix In-Reply-To: (James Lentini's message of "Tue, 18 Apr 2006 14:11:49 -0400 (EDT)") References: Message-ID: Thanks, applied. From rdreier at cisco.com Tue Apr 18 13:45:17 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 18 Apr 2006 13:45:17 -0700 Subject: [openib-general] [PATCH] splitting IPoIB CQ In-Reply-To: (Shirley Ma's message of "Tue, 18 Apr 2006 08:57:25 -0700") References: Message-ID: > > Actually, do you have some explanation for why this helps performance? > > My intuition would be that it just generates more interrupts for the > > same workload. > The only lock contension in IPoIB I saw is tx_lock. When seperating > the completion queue to have seperate completion handler. It could improve > the performance. I didn't look at driver code, it might have some impact > there? A clever way to avoid taking the TX lock on send completions would be very nice, but I never saw a way to do it. Does splitting the CQ reduce contention? I don't see why that would be, since the contention is between sending and getting send completions. The receive path of course never touches the tx_lock. I'm a little uncomfortable splitting the IPoIB CQ until we have a plausible theory as to why it improves performance. It may be that understanding this would lead us to fix the underlying issue in a better way and get even more performance. By the way, are your numbers with mthca or ehca? I don't know ehca very well, but at least with current mthca, all CQ events will be delivered on the same interrupt and hence all CQ handling will run on the same CPU. So I'm puzzled why changing things from: -> interrupt -> CQ event handler -> handle all IPoIB completions to: -> interrupt -> TX CQ event handler -> handle TX completions [possibly another interrupt] -> RX CQ event handler -> handle RX completions helps throughput. It just seems like it's more CQ locking/unlocking and in general more work. - R. From wombat2 at us.ibm.com Tue Apr 18 13:48:28 2006 From: wombat2 at us.ibm.com (Bernard King-Smith) Date: Tue, 18 Apr 2006 16:48:28 -0400 Subject: [openib-general] Re: openib-general Digest, Vol 22, Issue 114 In-Reply-To: <20060418131811.C8E4B228402@openib.ca.sandia.gov> Message-ID: Shirley> Some tests have been done over mthca and Shirley> ehca. Unidirectional stream test, gains up to 15% Shirley> throughout with this patch on systems over 4 cpus. Shirley> Bidirectional could gain more. People might get different Shirley> performance improvement number under different drivers Shirley> and cpus. I have attached the patch for who are willing Shirley> to run the performance test with different drivers. And Shirley> please give your inputs. Roland> Have you ever seen this hurt performance? It seems that splitting Roland> receives and send CQs will increase the number of events generated and Roland> possibly use more CPU. The problem occurs when you exceed the performance of a single CPU. WE have been running on multiple CPU systems, and this change actually helps performance on 2 CPU running 4 hyperthreads using 2 sockets. One socket for sending and one socket for receiving. If you look at recent IP performance using IPoIB, you see that exchange bandwidth is not much faster than unidirectional ( using Netperf ). Roland> Actually, do you have some explanation for why this helps performance? Roland> My intuition would be that it just generates more interrupts for the Roland> same workload. On a multiple CPU system looking at TOP you see one process consuming a full CPU. This happens to be the thread handling completion queue entries. I suggested that we look at separate threads handing send completions vs. receive completions. When we ran with the split completion queue patch, we no longer see one process pegging the CPU at 100% and we get a speedup of 65% going from STREAM to Duplex. Without the split completion queue, we only saw a 15% speedup going from STREAM to Duplex. The overall CPU utilization does increase with the split completion queue handling, but proportional to the increased bandwidth it is no higher per MB/s than a single handler. You probably won;t see any improvement in performance on 1 or 2 CPU systems because you are already out of CPU at these bandwidths. However, for machines with 4 or more CPUs, either hyperthreads or physical cores, you will see the benefit in duplex bandwidth. Bernie King-Smith IBM Corporation Server Group Cluster System Performance wombat2 at us.ibm.com (845)433-8483 Tie. 293-8483 or wombat2 on NOTES "We are not responsible for the world we are born into, only for the world we leave when we die. So we have to accept what has gone before us and work to change the only thing we can, -- The Future." William Shatner From rdreier at cisco.com Tue Apr 18 13:54:47 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 18 Apr 2006 13:54:47 -0700 Subject: [openib-general] Re: openib-general Digest, Vol 22, Issue 114 In-Reply-To: (Bernard King-Smith's message of "Tue, 18 Apr 2006 16:48:28 -0400") References: Message-ID: Bernard> On a multiple CPU system looking at TOP you see one Bernard> process consuming a full CPU. This happens to be the Bernard> thread handling completion queue entries. I suggested Bernard> that we look at separate threads handing send completions Bernard> vs. receive completions. When we ran with the split Bernard> completion queue patch, we no longer see one process Bernard> pegging the CPU at 100% and we get a speedup of 65% going Bernard> from STREAM to Duplex. Right now completions are not handled in a thread -- they are handled directly in the context of the CQ event interrupt handler. And splitting the CQ doesn't change the fact that there's only one interrupt handler, which will only be running on one CPU at a time, so I don't think this explanation holds up. Changing things to run from thread context might be an improvement, although the overhead of scheduling a thread might be a loss. But I could at least believe the explanation. But I think we're still in the dark about the patch here, since that's not what it does. What would a good netperf command line be if I wanted to reproduce your benchmark? I have 4-way and 8-way machines available, so that's no problem. - R. From jlentini at netapp.com Tue Apr 18 14:04:48 2006 From: jlentini at netapp.com (James Lentini) Date: Tue, 18 Apr 2006 17:04:48 -0400 (EDT) Subject: [openib-general] Re: [uDAPL] dat.conf generator In-Reply-To: <200604180912.47016.dotanb@mellanox.co.il> References: <200604121122.48646.dotanb@mellanox.co.il> <200604161042.25045.dotanb@mellanox.co.il> <200604180912.47016.dotanb@mellanox.co.il> Message-ID: On Tue, 18 Apr 2006, Dotan Barak wrote: > On Monday 17 April 2006 23:46, James Lentini wrote: > > > > On Sun, 16 Apr 2006, Dotan Barak wrote: > > > > > On Wednesday 12 April 2006 17:50, James Lentini wrote: > > > > > OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 1" "" > > > > > OpenIB-cma0-1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 1" "" > > > > > OpenIB-cma0-2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 2" "" > > > > > OpenIB-cma1-1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca1 1" "" > > > > > OpenIB-cma1-2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca1 2" "" > > > > > OpenIB-cma-netdev0 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" > > > > > OpenIB-cma-netdev1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib1 0" "" > > > > > OpenIB-cma-netdev2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib2 0" "" > > > > > OpenIB-cma-netdev3 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib3 0" "" > > > > > OpenIB-scm u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca0 1" "" > > > > > OpenIB-scm0-1 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca0 1" "" > > > > > OpenIB-scm0-2 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca0 2" "" > > > > > OpenIB-scm1-1 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca1 1" "" > > > > > OpenIB-scm1-2 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca1 2" "" > > > > > OpenIB-scm-netdev0 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib0 0" "" > > > > > OpenIB-scm-netdev1 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib1 0" "" > > > > > OpenIB-scm-netdev2 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib2 0" "" > > > > > OpenIB-scm-netdev3 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "ib3 0" "" > > > > > > > Why did you include the OpenIB-Z-netdevX entries? Why would a user > > prefer netdevX over ethY? > > as i told you before, i tried to create the configuration file automatically. > > i created the OpenIB-Z-netdevX entries for flexibility: for example > you have 2 HCAs in the host, and you want to use a specific one. > from other entries you can not be sure which specific HCA is being > used. How does netdevX mapped to an interface (e.g. ib0)? Is this information in sysfs somewhere? > I belive that you have much more experience than i have in using > dapl, so i need your help with this issue: which entries do you > think that the dat.conf should have? The OpenIB-cma, OpenIB-cmaX-Y, OpenIB-scm, and OpenIB-scmA-B entries are all useful. I'm still confused about what the *netdev* entries mean. From xma at us.ibm.com Tue Apr 18 14:25:39 2006 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 18 Apr 2006 14:25:39 -0700 Subject: [openib-general] Re: openib-general Digest, Vol 22, Issue 114 In-Reply-To: Message-ID: Bernie, Bernard King-Smith/Poughkeepsie/IBM wrote on 04/18/2006 01:48:28 PM: > When we ran with the split completion queue patch, we no longer see one > process pegging the CPU at 100% and we get a speedup of 65% going > from STREAM to Duplex. Without the split completion queue, we only > saw a 15% speedup going from STREAM to Duplex. This is another patch to gain huge performance on ehca driver. I haven't submitted yet. :-) Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue Apr 18 14:33:55 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 18 Apr 2006 14:33:55 -0700 Subject: [openib-general] Re: openib-general Digest, Vol 22, Issue 114 In-Reply-To: (Shirley Ma's message of "Tue, 18 Apr 2006 14:25:39 -0700") References: Message-ID: Shirley> This is another patch to gain huge performance on ehca Shirley> driver. I haven't submitted yet. :-) What does the patch do? - R. From xma at us.ibm.com Tue Apr 18 14:40:20 2006 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 18 Apr 2006 14:40:20 -0700 Subject: [openib-general] [PATCH] splitting IPoIB CQ In-Reply-To: Message-ID: Roland Dreier wrote on 04/18/2006 01:45:17 PM: > I'm a little uncomfortable splitting the IPoIB CQ until we have a > plausible theory as to why it improves performance. It may be that > understanding this would lead us to fix the underlying issue in a > better way and get even more performance. Fair enough. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From xma at us.ibm.com Tue Apr 18 14:45:29 2006 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 18 Apr 2006 14:45:29 -0700 Subject: [openib-general] Re: openib-general Digest, Vol 22, Issue 114 In-Reply-To: Message-ID: Roland Dreier wrote on 04/18/2006 02:33:55 PM: > Shirley> This is another patch to gain huge performance on ehca > Shirley> driver. I haven't submitted yet. :-) > > What does the patch do? > > - R. The patch allows you tuning send/recv NUM_WC per poll and add some cycles before polling to sync with the hardware. This is not a mature patch. But I have found it dramatically improve the performance and reduce N times of the interrupts on ehca driver. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue Apr 18 14:49:34 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 18 Apr 2006 14:49:34 -0700 Subject: [openib-general] Re: openib-general Digest, Vol 22, Issue 114 In-Reply-To: (Shirley Ma's message of "Tue, 18 Apr 2006 14:45:29 -0700") References: Message-ID: Shirley> The patch allows you tuning send/recv NUM_WC per poll and Shirley> add some cycles before polling to sync with the hardware. I have no problem increasing NUM_WC to something much bigger. What do you mean by "add some cycles before polling"? - R. From xma at us.ibm.com Tue Apr 18 14:58:20 2006 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 18 Apr 2006 14:58:20 -0700 Subject: [openib-general] Re: openib-general Digest, Vol 22, Issue 114 In-Reply-To: Message-ID: Roland Dreier wrote on 04/18/2006 02:49:34 PM: > Shirley> The patch allows you tuning send/recv NUM_WC per poll and > Shirley> add some cycles before polling to sync with the hardware. > > I have no problem increasing NUM_WC to something much bigger. What do > you mean by "add some cycles before polling"? > > - R. After completion handler receives the notification, don't poll the CQ right away, and wait for more WIKIs in CQ. That way can reduce the CQ lock overhead. I found that there is some problem to increase the NUM_WC on recv. It hurts the performance lots. Do you have any clue? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Tue Apr 18 15:00:16 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 18 Apr 2006 15:00:16 -0700 Subject: [openib-general] [PATCH] RDMA CM: assign port numbers when binding a cm_id to an address Message-ID: Assign/reserve a port number when binding a cm_id. If no port number is given, assign one from the local port space. If a port number is given, reserve it. The RDMA port space is separate from that used for TCP. iWarp devices will need to coordinate between the port values assigned by the rdma_cm and those in use by TCP. SDP also has its own port space. Signed-off-by: Sean Hefty --- Index: cma.c =================================================================== --- cma.c (revision 6479) +++ cma.c (working copy) @@ -33,6 +33,9 @@ #include #include #include +#include +#include +#include #include #include #include @@ -58,6 +61,8 @@ static LIST_HEAD(dev_list); static LIST_HEAD(listen_any_list); static DEFINE_MUTEX(lock); static struct workqueue_struct *cma_wq; +static DEFINE_IDR(sdp_ps); +static DEFINE_IDR(tcp_ps); struct cma_device { struct list_head list; @@ -81,6 +86,12 @@ enum cma_state { CMA_DESTROYING }; +struct rdma_bind_list { + struct idr *ps; + struct hlist_head owners; + unsigned short port; +}; + /* * Device removal can occur at anytime, so we need extra handling to * serialize notifying the user of device removal with other callbacks. @@ -90,6 +101,8 @@ enum cma_state { struct rdma_id_private { struct rdma_cm_id id; + struct rdma_bind_list *bind_list; + struct hlist_node node; struct list_head list; struct list_head listen_list; struct cma_device *cma_dev; @@ -460,6 +473,11 @@ static inline int cma_any_addr(struct so return cma_zero_addr(addr) || cma_loopback_addr(addr); } +static inline int cma_any_port(struct sockaddr *addr) +{ + return !((struct sockaddr_in *) addr)->sin_port; +} + static int cma_get_net_info(void *hdr, enum rdma_port_space ps, u8 *ip_ver, __u16 *port, union cma_ip_addr **src, union cma_ip_addr **dst) @@ -625,6 +643,22 @@ static void cma_cancel_operation(struct } } +static void cma_release_port(struct rdma_id_private *id_priv) +{ + struct rdma_bind_list *bind_list = id_priv->bind_list; + + if (!bind_list) + return; + + mutex_lock(&lock); + hlist_del(&id_priv->node); + if (hlist_empty(&bind_list->owners)) { + idr_remove(bind_list->ps, bind_list->port); + kfree(bind_list); + } + mutex_unlock(&lock); +} + void rdma_destroy_id(struct rdma_cm_id *id) { struct rdma_id_private *id_priv; @@ -648,6 +682,7 @@ void rdma_destroy_id(struct rdma_cm_id * mutex_unlock(&lock); } + cma_release_port(id_priv); atomic_dec(&id_priv->refcount); wait_event(id_priv->wait, !atomic_read(&id_priv->refcount)); @@ -918,21 +953,6 @@ static int cma_ib_listen(struct rdma_id_ return ret; } -static int cma_duplicate_listen(struct rdma_id_private *id_priv) -{ - struct rdma_id_private *cur_id_priv; - struct sockaddr_in *cur_addr, *new_addr; - - new_addr = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; - list_for_each_entry(cur_id_priv, &listen_any_list, listen_list) { - cur_addr = (struct sockaddr_in *) - &cur_id_priv->id.route.addr.src_addr; - if (cur_addr->sin_port == new_addr->sin_port) - return -EADDRINUSE; - } - return 0; -} - static int cma_listen_handler(struct rdma_cm_id *id, struct rdma_cm_event *event) { @@ -955,9 +975,10 @@ static void cma_listen_on_dev(struct rdm return; dev_id_priv = container_of(id, struct rdma_id_private, id); - ret = rdma_bind_addr(id, &id_priv->id.route.addr.src_addr); - if (ret) - goto err; + + dev_id_priv->state = CMA_ADDR_BOUND; + memcpy(&id->route.addr.src_addr, &id_priv->id.route.addr.src_addr, + ip_addr_size(&id_priv->id.route.addr.src_addr)); cma_attach_to_dev(dev_id_priv, cma_dev); list_add_tail(&dev_id_priv->listen_list, &id_priv->listen_list); @@ -971,22 +992,15 @@ err: cma_destroy_listen(dev_id_priv); } -static int cma_listen_on_all(struct rdma_id_private *id_priv) +static void cma_listen_on_all(struct rdma_id_private *id_priv) { struct cma_device *cma_dev; - int ret; mutex_lock(&lock); - ret = cma_duplicate_listen(id_priv); - if (ret) - goto out; - list_add_tail(&id_priv->list, &listen_any_list); list_for_each_entry(cma_dev, &dev_list, list) cma_listen_on_dev(id_priv, cma_dev); -out: mutex_unlock(&lock); - return ret; } int rdma_listen(struct rdma_cm_id *id, int backlog) @@ -1002,16 +1016,15 @@ int rdma_listen(struct rdma_cm_id *id, i switch (rdma_node_get_transport(id->device->node_type)) { case RDMA_TRANSPORT_IB: ret = cma_ib_listen(id_priv); + if (ret) + goto err; break; default: ret = -ENOSYS; - break; + goto err; } } else - ret = cma_listen_on_all(id_priv); - - if (ret) - goto err; + cma_listen_on_all(id_priv); id_priv->backlog = backlog; return 0; @@ -1310,32 +1323,135 @@ err: } EXPORT_SYMBOL(rdma_resolve_addr); +static void cma_bind_port(struct rdma_bind_list *bind_list, + struct rdma_id_private *id_priv) +{ + struct sockaddr_in *sin; + + sin = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; + sin->sin_port = htons(bind_list->port); + id_priv->bind_list = bind_list; + hlist_add_head(&id_priv->node, &bind_list->owners); +} + +static int cma_alloc_port(struct idr *ps, struct rdma_id_private *id_priv, + unsigned short snum) +{ + struct rdma_bind_list *bind_list; + int port, start, ret; + + bind_list = kzalloc(sizeof *bind_list, GFP_KERNEL); + if (!bind_list) + return -ENOMEM; + + start = snum ? snum : sysctl_local_port_range[0]; + + do { + ret = idr_get_new_above(ps, bind_list, start, &port); + } while ((ret == -EAGAIN) && idr_pre_get(ps, GFP_KERNEL)); + + if (ret) + goto err; + + if ((snum && port != snum) || + (!snum && port > sysctl_local_port_range[1])) { + idr_remove(ps, port); + ret = -EADDRNOTAVAIL; + goto err; + } + + bind_list->ps = ps; + bind_list->port = (unsigned short) port; + cma_bind_port(bind_list, id_priv); + return 0; +err: + kfree(bind_list); + return ret; +} + +static int cma_use_port(struct idr *ps, struct rdma_id_private *id_priv) +{ + struct rdma_id_private *cur_id; + struct sockaddr_in *sin, *cur_sin; + struct rdma_bind_list *bind_list; + struct hlist_node *node; + + sin = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; + bind_list = idr_find(ps, ntohs(sin->sin_port)); + if (!bind_list) + return cma_alloc_port(ps, id_priv, ntohs(sin->sin_port)); + + /* + * We don't support binding to any address if anyone is bound to + * a specific address on the same port. + */ + if (cma_any_addr(&id_priv->id.route.addr.src_addr)) + return -EADDRNOTAVAIL; + + hlist_for_each_entry(cur_id, node, &bind_list->owners, node) { + if (cma_any_addr(&cur_id->id.route.addr.src_addr)) + return -EADDRNOTAVAIL; + + cur_sin = (struct sockaddr_in *) &cur_id->id.route.addr.src_addr; + if (sin->sin_addr.s_addr == cur_sin->sin_addr.s_addr) + return -EADDRINUSE; + } + + cma_bind_port(bind_list, id_priv); + return 0; +} + +static int cma_get_port(struct rdma_id_private *id_priv) +{ + struct idr *ps; + int ret; + + switch (id_priv->id.ps) { + case RDMA_PS_SDP: + ps = &sdp_ps; + break; + case RDMA_PS_TCP: + ps = &tcp_ps; + break; + default: + return -EPROTONOSUPPORT; + } + + mutex_lock(&lock); + if (cma_any_port(&id_priv->id.route.addr.src_addr)) + ret = cma_alloc_port(ps, id_priv, 0); + else + ret = cma_use_port(ps, id_priv); + mutex_unlock(&lock); + + return ret; +} + int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr) { struct rdma_id_private *id_priv; - struct rdma_dev_addr *dev_addr; int ret; if (addr->sa_family != AF_INET) - return -EINVAL; + return -EAFNOSUPPORT; id_priv = container_of(id, struct rdma_id_private, id); if (!cma_comp_exch(id_priv, CMA_IDLE, CMA_ADDR_BOUND)) return -EINVAL; - if (cma_any_addr(addr)) - ret = 0; - else { - dev_addr = &id->route.addr.dev_addr; - ret = rdma_translate_ip(addr, dev_addr); + if (!cma_any_addr(addr)) { + ret = rdma_translate_ip(addr, &id->route.addr.dev_addr); if (!ret) ret = cma_acquire_dev(id_priv); + if (ret) + goto err; } + memcpy(&id->route.addr.src_addr, addr, ip_addr_size(addr)); + ret = cma_get_port(id_priv); if (ret) goto err; - memcpy(&id->route.addr.src_addr, addr, ip_addr_size(addr)); return 0; err: cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_IDLE); @@ -1699,6 +1815,8 @@ static void cma_cleanup(void) { ib_unregister_client(&cma_client); destroy_workqueue(cma_wq); + idr_destroy(&sdp_ps); + idr_destroy(&tcp_ps); } module_init(cma_init); From rdreier at cisco.com Tue Apr 18 15:01:57 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 18 Apr 2006 15:01:57 -0700 Subject: [openib-general] Re: openib-general Digest, Vol 22, Issue 114 In-Reply-To: (Shirley Ma's message of "Tue, 18 Apr 2006 14:58:20 -0700") References: Message-ID: Shirley> After completion handler receives the notification, don't Shirley> poll the CQ right away, and wait for more WIKIs in Shirley> CQ. That way can reduce the CQ lock overhead. That's interesting... it makes sense, and it argues in favor of deferring CQ polling to a kernel thread. Of course this will hurt ping-pong latency. Maybe it's better to just implement NAPI though... Shirley> I found that there is some problem to increase the NUM_WC Shirley> on recv. It hurts the performance lots. Do you have any Shirley> clue? Is this on ehca or mthca? I couldn't explain it on mthca, but on ehca maybe there's a bug in generating CQ events ?? - R. From xma at us.ibm.com Tue Apr 18 15:08:15 2006 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 18 Apr 2006 15:08:15 -0700 Subject: [openib-general] Re: openib-general Digest, Vol 22, Issue 114 In-Reply-To: Message-ID: Roland Dreier wrote on 04/18/2006 03:01:57 PM: > Shirley> After completion handler receives the notification, don't > Shirley> poll the CQ right away, and wait for more WIKIs in > Shirley> CQ. That way can reduce the CQ lock overhead. > > That's interesting... it makes sense, and it argues in favor of > deferring CQ polling to a kernel thread. Of course this will hurt > ping-pong latency. Maybe it's better to just implement NAPI though... Let's try difference implementations to see the difference. > Shirley> I found that there is some problem to increase the NUM_WC > Shirley> on recv. It hurts the performance lots. Do you have any > Shirley> clue? > > Is this on ehca or mthca? I couldn't explain it on mthca, but on ehca > maybe there's a bug in generating CQ events ?? > > - R. It's on mthca. If you are interested. I can submit a test patch for your experimental. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue Apr 18 15:06:33 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 18 Apr 2006 15:06:33 -0700 Subject: [openib-general] Re: openib-general Digest, Vol 22, Issue 114 In-Reply-To: (Roland Dreier's message of "Tue, 18 Apr 2006 15:01:57 -0700") References: Message-ID: Shirley> After completion handler receives the notification, don't Shirley> poll the CQ right away, and wait for more WIKIs in Shirley> CQ. That way can reduce the CQ lock overhead. Roland> That's interesting... it makes sense, and it argues in Roland> favor of deferring CQ polling to a kernel thread. Of Roland> course this will hurt ping-pong latency. Maybe it's Roland> better to just implement NAPI though... And actually it argues against splitting the CQ, because having one CQ increases the number of CQ entries that we have a chance to poll at any one time, by lumping send and receive completions together... - R. From rdreier at cisco.com Tue Apr 18 15:07:06 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 18 Apr 2006 15:07:06 -0700 Subject: [openib-general] Re: openib-general Digest, Vol 22, Issue 114 In-Reply-To: (Shirley Ma's message of "Tue, 18 Apr 2006 15:08:15 -0700") References: Message-ID: Shirley> It's on mthca. If you are interested. I can submit a test Shirley> patch for your experimental. Sure, that would be useful. - R. From xma at us.ibm.com Tue Apr 18 15:14:08 2006 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 18 Apr 2006 15:14:08 -0700 Subject: [openib-general] Re: openib-general Digest, Vol 22, Issue 114 In-Reply-To: Message-ID: Roland Dreier wrote on 04/18/2006 03:06:33 PM: > And actually it argues against splitting the CQ, because having one CQ > increases the number of CQ entries that we have a chance to poll at > any one time, by lumping send and receive completions together... > > - R. The send needs to obtain tx_lock to handle send wc. Receive doesn't need to. That might explain some of the performance. I have tried to remove tx_lock in send wc handler by removeing tx_ring. It did increase the send speed. I hit an ehca problem, the test couldn't move on. And splitting CQ is really benefit the duplex throughput. We are doing more testing now to dig out more info. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From xma at us.ibm.com Tue Apr 18 15:16:55 2006 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 18 Apr 2006 15:16:55 -0700 Subject: [openib-general] Re: openib-general Digest, Vol 22, Issue 114 In-Reply-To: Message-ID: Roland Dreier wrote on 04/18/2006 03:07:06 PM: > Shirley> It's on mthca. If you are interested. I can submit a test > Shirley> patch for your experimental. > > Sure, that would be useful. > > - R. It is built on top of splitting CQ patch. I will send you the patch tomorrow. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From bunk at stusta.de Tue Apr 18 15:15:35 2006 From: bunk at stusta.de (Adrian Bunk) Date: Wed, 19 Apr 2006 00:15:35 +0200 Subject: [openib-general] [2.6 patch] drivers/infiniband/hw/mthca/mthca_mad.c: make a function static Message-ID: <20060418221535.GX11582@stusta.de> This patch makes the needlessly global mthca_update_rate() static. Signed-off-by: Adrian Bunk --- linux-2.6.17-rc1-mm3-full/drivers/infiniband/hw/mthca/mthca_mad.c.old 2006-04-18 21:38:06.000000000 +0200 +++ linux-2.6.17-rc1-mm3-full/drivers/infiniband/hw/mthca/mthca_mad.c 2006-04-18 21:38:14.000000000 +0200 @@ -49,7 +49,7 @@ MTHCA_VENDOR_CLASS2 = 0xa }; -int mthca_update_rate(struct mthca_dev *dev, u8 port_num) +static int mthca_update_rate(struct mthca_dev *dev, u8 port_num) { struct ib_port_attr *tprops = NULL; int ret; From rdreier at cisco.com Tue Apr 18 15:18:21 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 18 Apr 2006 15:18:21 -0700 Subject: [openib-general] Re: [2.6 patch] drivers/infiniband/hw/mthca/mthca_mad.c: make a function static In-Reply-To: <20060418221535.GX11582@stusta.de> (Adrian Bunk's message of "Wed, 19 Apr 2006 00:15:35 +0200") References: <20060418221535.GX11582@stusta.de> Message-ID: Thanks, applied. Sorry for letting that sneak through... - R. From vuhuong at mellanox.com Tue Apr 18 15:23:35 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Tue, 18 Apr 2006 15:23:35 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: References: <443C4934.7080400@mellanox.com> <443E9A88.7020302@mellanox.com> <443FC35F.6080301@mellanox.com> <44450F89.7020500@mellanox.com> <44451D32.1010106@mellanox.com> Message-ID: <444566E7.8070907@mellanox.com> Roland Dreier wrote: > Roland> Hmm, you may be right -- scsi_eh_bus_device_reset() in > Roland> scsi_error.c does seem to flush all commands. But do you > Roland> see srp_reset_device() being called? I didn't think I saw > Roland> it in your trace. > It was in my trace. I probably left it out when I posted the email > And what if you comment out the line > > .eh_device_reset_handler = srp_reset_device, > > does that fix it? No I'll do some more debug Vu From dont.reply.thank.you at moneybookers.com Tue Apr 18 17:03:35 2006 From: dont.reply.thank.you at moneybookers.com (Moneybookers Ltd.) Date: Wed, 19 Apr 2006 02:03:35 +0200 (CEST) Subject: [openib-general] Important Notification - Security Issues Message-ID: <20060419000335.0C4CF34476@comsrv.isis-de.de> An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue Apr 18 15:58:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 18 Apr 2006 15:58:45 -0700 Subject: [openib-general] [PATCH] RDMA CM: assign port numbers when binding a cm_id to an address In-Reply-To: (Sean Hefty's message of "Tue, 18 Apr 2006 15:00:16 -0700") References: Message-ID: Should I add this on top of the CMA patches I have queued for 2.6.18? From mshefty at ichips.intel.com Tue Apr 18 16:02:02 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 18 Apr 2006 16:02:02 -0700 Subject: [openib-general] [PATCH] RDMA CM: assign port numbers when binding a cm_id to an address In-Reply-To: References: Message-ID: <44456FEA.7020302@ichips.intel.com> Roland Dreier wrote: > Should I add this on top of the CMA patches I have queued for 2.6.18? I'd like to get some feedback on the implementation first, but we probably want to get this functionality in, even if the implementation gets tweaked a little. - Sean From tom at opengridcomputing.com Tue Apr 18 20:42:34 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 18 Apr 2006 22:42:34 -0500 Subject: [openib-general] [PATCH] RDMA CM: assign port numbers when binding a cm_id to an address In-Reply-To: References: Message-ID: <1145418154.12345.8.camel@bigtime.es335.com> This looks like a great start. One part I didn't understand, however, was where the local port is assigned for children of the listening endpoint? The local port for these endpoints will be the same as for the listening parent. So if cma_use_port is used to bind these child endpoints (i.e. add them to the owners list), then the logic will need to distinguish between rdma_cm_id attempting to bind as a listener vs. rdma_cm_id binding as connected children. On Tue, 2006-04-18 at 15:00 -0700, Sean Hefty wrote: > Assign/reserve a port number when binding a cm_id. If no port number is > given, assign one from the local port space. If a port number is given, > reserve it. > > The RDMA port space is separate from that used for TCP. iWarp devices > will need to coordinate between the port values assigned by the rdma_cm > and those in use by TCP. SDP also has its own port space. > > Signed-off-by: Sean Hefty > --- > Index: cma.c > =================================================================== > --- cma.c (revision 6479) > +++ cma.c (working copy) > @@ -33,6 +33,9 @@ > #include > #include > #include > +#include > +#include > +#include > #include > #include > #include > @@ -58,6 +61,8 @@ static LIST_HEAD(dev_list); > static LIST_HEAD(listen_any_list); > static DEFINE_MUTEX(lock); > static struct workqueue_struct *cma_wq; > +static DEFINE_IDR(sdp_ps); > +static DEFINE_IDR(tcp_ps); > > struct cma_device { > struct list_head list; > @@ -81,6 +86,12 @@ enum cma_state { > CMA_DESTROYING > }; > > +struct rdma_bind_list { > + struct idr *ps; > + struct hlist_head owners; > + unsigned short port; > +}; > + > /* > * Device removal can occur at anytime, so we need extra handling to > * serialize notifying the user of device removal with other callbacks. > @@ -90,6 +101,8 @@ enum cma_state { > struct rdma_id_private { > struct rdma_cm_id id; > > + struct rdma_bind_list *bind_list; > + struct hlist_node node; > struct list_head list; > struct list_head listen_list; > struct cma_device *cma_dev; > @@ -460,6 +473,11 @@ static inline int cma_any_addr(struct so > return cma_zero_addr(addr) || cma_loopback_addr(addr); > } > > +static inline int cma_any_port(struct sockaddr *addr) > +{ > + return !((struct sockaddr_in *) addr)->sin_port; > +} > + > static int cma_get_net_info(void *hdr, enum rdma_port_space ps, > u8 *ip_ver, __u16 *port, > union cma_ip_addr **src, union cma_ip_addr **dst) > @@ -625,6 +643,22 @@ static void cma_cancel_operation(struct > } > } > > +static void cma_release_port(struct rdma_id_private *id_priv) > +{ > + struct rdma_bind_list *bind_list = id_priv->bind_list; > + > + if (!bind_list) > + return; > + > + mutex_lock(&lock); > + hlist_del(&id_priv->node); > + if (hlist_empty(&bind_list->owners)) { > + idr_remove(bind_list->ps, bind_list->port); > + kfree(bind_list); > + } > + mutex_unlock(&lock); > +} > + > void rdma_destroy_id(struct rdma_cm_id *id) > { > struct rdma_id_private *id_priv; > @@ -648,6 +682,7 @@ void rdma_destroy_id(struct rdma_cm_id * > mutex_unlock(&lock); > } > > + cma_release_port(id_priv); > atomic_dec(&id_priv->refcount); > wait_event(id_priv->wait, !atomic_read(&id_priv->refcount)); > > @@ -918,21 +953,6 @@ static int cma_ib_listen(struct rdma_id_ > return ret; > } > > -static int cma_duplicate_listen(struct rdma_id_private *id_priv) > -{ > - struct rdma_id_private *cur_id_priv; > - struct sockaddr_in *cur_addr, *new_addr; > - > - new_addr = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; > - list_for_each_entry(cur_id_priv, &listen_any_list, listen_list) { > - cur_addr = (struct sockaddr_in *) > - &cur_id_priv->id.route.addr.src_addr; > - if (cur_addr->sin_port == new_addr->sin_port) > - return -EADDRINUSE; > - } > - return 0; > -} > - > static int cma_listen_handler(struct rdma_cm_id *id, > struct rdma_cm_event *event) > { > @@ -955,9 +975,10 @@ static void cma_listen_on_dev(struct rdm > return; > > dev_id_priv = container_of(id, struct rdma_id_private, id); > - ret = rdma_bind_addr(id, &id_priv->id.route.addr.src_addr); > - if (ret) > - goto err; > + > + dev_id_priv->state = CMA_ADDR_BOUND; > + memcpy(&id->route.addr.src_addr, &id_priv->id.route.addr.src_addr, > + ip_addr_size(&id_priv->id.route.addr.src_addr)); > > cma_attach_to_dev(dev_id_priv, cma_dev); > list_add_tail(&dev_id_priv->listen_list, &id_priv->listen_list); > @@ -971,22 +992,15 @@ err: > cma_destroy_listen(dev_id_priv); > } > > -static int cma_listen_on_all(struct rdma_id_private *id_priv) > +static void cma_listen_on_all(struct rdma_id_private *id_priv) > { > struct cma_device *cma_dev; > - int ret; > > mutex_lock(&lock); > - ret = cma_duplicate_listen(id_priv); > - if (ret) > - goto out; > - > list_add_tail(&id_priv->list, &listen_any_list); > list_for_each_entry(cma_dev, &dev_list, list) > cma_listen_on_dev(id_priv, cma_dev); > -out: > mutex_unlock(&lock); > - return ret; > } > > int rdma_listen(struct rdma_cm_id *id, int backlog) > @@ -1002,16 +1016,15 @@ int rdma_listen(struct rdma_cm_id *id, i > switch (rdma_node_get_transport(id->device->node_type)) { > case RDMA_TRANSPORT_IB: > ret = cma_ib_listen(id_priv); > + if (ret) > + goto err; > break; > default: > ret = -ENOSYS; > - break; > + goto err; > } > } else > - ret = cma_listen_on_all(id_priv); > - > - if (ret) > - goto err; > + cma_listen_on_all(id_priv); > > id_priv->backlog = backlog; > return 0; > @@ -1310,32 +1323,135 @@ err: > } > EXPORT_SYMBOL(rdma_resolve_addr); > > +static void cma_bind_port(struct rdma_bind_list *bind_list, > + struct rdma_id_private *id_priv) > +{ > + struct sockaddr_in *sin; > + > + sin = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; > + sin->sin_port = htons(bind_list->port); > + id_priv->bind_list = bind_list; > + hlist_add_head(&id_priv->node, &bind_list->owners); > +} > + > +static int cma_alloc_port(struct idr *ps, struct rdma_id_private *id_priv, > + unsigned short snum) > +{ > + struct rdma_bind_list *bind_list; > + int port, start, ret; > + > + bind_list = kzalloc(sizeof *bind_list, GFP_KERNEL); > + if (!bind_list) > + return -ENOMEM; > + > + start = snum ? snum : sysctl_local_port_range[0]; > + > + do { > + ret = idr_get_new_above(ps, bind_list, start, &port); > + } while ((ret == -EAGAIN) && idr_pre_get(ps, GFP_KERNEL)); > + > + if (ret) > + goto err; > + > + if ((snum && port != snum) || > + (!snum && port > sysctl_local_port_range[1])) { > + idr_remove(ps, port); > + ret = -EADDRNOTAVAIL; > + goto err; > + } > + > + bind_list->ps = ps; > + bind_list->port = (unsigned short) port; > + cma_bind_port(bind_list, id_priv); > + return 0; > +err: > + kfree(bind_list); > + return ret; > +} > + > +static int cma_use_port(struct idr *ps, struct rdma_id_private *id_priv) > +{ > + struct rdma_id_private *cur_id; > + struct sockaddr_in *sin, *cur_sin; > + struct rdma_bind_list *bind_list; > + struct hlist_node *node; > + > + sin = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; > + bind_list = idr_find(ps, ntohs(sin->sin_port)); > + if (!bind_list) > + return cma_alloc_port(ps, id_priv, ntohs(sin->sin_port)); > + > + /* > + * We don't support binding to any address if anyone is bound to > + * a specific address on the same port. > + */ > + if (cma_any_addr(&id_priv->id.route.addr.src_addr)) > + return -EADDRNOTAVAIL; > + > + hlist_for_each_entry(cur_id, node, &bind_list->owners, node) { > + if (cma_any_addr(&cur_id->id.route.addr.src_addr)) > + return -EADDRNOTAVAIL; > + > + cur_sin = (struct sockaddr_in *) &cur_id->id.route.addr.src_addr; > + if (sin->sin_addr.s_addr == cur_sin->sin_addr.s_addr) > + return -EADDRINUSE; > + } > + > + cma_bind_port(bind_list, id_priv); > + return 0; > +} > + > +static int cma_get_port(struct rdma_id_private *id_priv) > +{ > + struct idr *ps; > + int ret; > + > + switch (id_priv->id.ps) { > + case RDMA_PS_SDP: > + ps = &sdp_ps; > + break; > + case RDMA_PS_TCP: > + ps = &tcp_ps; > + break; > + default: > + return -EPROTONOSUPPORT; > + } > + > + mutex_lock(&lock); > + if (cma_any_port(&id_priv->id.route.addr.src_addr)) > + ret = cma_alloc_port(ps, id_priv, 0); > + else > + ret = cma_use_port(ps, id_priv); > + mutex_unlock(&lock); > + > + return ret; > +} > + > int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr) > { > struct rdma_id_private *id_priv; > - struct rdma_dev_addr *dev_addr; > int ret; > > if (addr->sa_family != AF_INET) > - return -EINVAL; > + return -EAFNOSUPPORT; > > id_priv = container_of(id, struct rdma_id_private, id); > if (!cma_comp_exch(id_priv, CMA_IDLE, CMA_ADDR_BOUND)) > return -EINVAL; > > - if (cma_any_addr(addr)) > - ret = 0; > - else { > - dev_addr = &id->route.addr.dev_addr; > - ret = rdma_translate_ip(addr, dev_addr); > + if (!cma_any_addr(addr)) { > + ret = rdma_translate_ip(addr, &id->route.addr.dev_addr); > if (!ret) > ret = cma_acquire_dev(id_priv); > + if (ret) > + goto err; > } > > + memcpy(&id->route.addr.src_addr, addr, ip_addr_size(addr)); > + ret = cma_get_port(id_priv); > if (ret) > goto err; > > - memcpy(&id->route.addr.src_addr, addr, ip_addr_size(addr)); > return 0; > err: > cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_IDLE); > @@ -1699,6 +1815,8 @@ static void cma_cleanup(void) > { > ib_unregister_client(&cma_client); > destroy_workqueue(cma_wq); > + idr_destroy(&sdp_ps); > + idr_destroy(&tcp_ps); > } > > module_init(cma_init); > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From wombat2 at us.ibm.com Wed Apr 19 07:10:36 2006 From: wombat2 at us.ibm.com (Bernard King-Smith) Date: Wed, 19 Apr 2006 10:10:36 -0400 Subject: [openib-general] Re: openib-general Digest, Vol 22, Issue 114 In-Reply-To: Message-ID: Shirley> After completion handler receives the notification, don't Shirley> poll the CQ right away, and wait for more WIKIs in Shirley> CQ. That way can reduce the CQ lock overhead. Roland> That's interesting... it makes sense, and it argues in Roland> favor of deferring CQ polling to a kernel thread. Of Roland> course this will hurt ping-pong latency. Maybe it's Roland> better to just implement NAPI though... Roland> And actually it argues against splitting the CQ, because having one CQ Roland> increases the number of CQ entries that we have a chance to poll at Roland> any one time, by lumping send and receive completions together... The assumption you have here is that one CPU is capable of handling the completions without impacting bandwidth. We have seen the opposite in that we end up with one CPU pegged at high throughput. The benefit you are working on is latency will be faster if we handle both send and receive processing off the same thread/interrupt, but you have to balance that with bandwidth limitations. You think 4X has a bandwdith problem using IPoIB, wait till 12X comes out. What per CPU utilization do you see on mthca on a multiple CPU machine running peak bandwidth? Roland> - R. Bernie King-Smith IBM Corporation Server Group Cluster System Performance wombat2 at us.ibm.com (845)433-8483 Tie. 293-8483 or wombat2 on NOTES "We are not responsible for the world we are born into, only for the world we leave when we die. So we have to accept what has gone before us and work to change the only thing we can, -- The Future." William Shatner From rdreier at cisco.com Wed Apr 19 07:29:37 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Apr 2006 07:29:37 -0700 Subject: [openib-general][patch review] srp: fmr implementation, References: <443C4934.7080400@mellanox.com> <443E9A88.7020302@mellanox.com> <443FC35F.6080301@mellanox.com> <44450F89.7020500@mellanox.com> <44451D32.1010106@mellanox.com> <444566E7.8070907@mellanox.com> Message-ID: > > And what if you comment out the line > > .eh_device_reset_handler = srp_reset_device, > > does that fix it? > No Now I'm really confused. It seems we lose the connection to the target (BTW -- do you know why the connection is getting killed)? So the SCSI midlayer times out commands and tries to abort them. But we have no connection so the abort fails. The SCSI command shouldn't get freed now (at least if I'm understanding scsi_error.c correctly). Then we have no .eh_device_reset_handler so everything should fall through to calling our .eh_host_reset_handler without freeing any SCSI commands. And then we crash on a use-after-free of a SCSI command. So where is that command getting freed on us?? - R. From xma at us.ibm.com Wed Apr 19 08:18:30 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 19 Apr 2006 09:18:30 -0600 Subject: [openib-general] [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: Message-ID: Roland, Here is the patch. This patch includes: 1. sepeate CQ to send CQ and recv CQ 2. increase both send/recv poll NUM_WC from 4 to 32 3. add cacheline smp in tx_ring, rx_ring and send_ibwc, recv_ibwc 4. add tunalbe poll interval in both send and recv. example commandline: modprobe ib_ipoib send_poll_interval=80 recv_poll_interval=10 Attachment is the file for you download. Any problems let me know. Signed-off-by: Shirley Ma diff -urN infiniband/ulp/ipoib/ipoib.h infiniband-cq/ulp/ipoib/ipoib.h --- infiniband/ulp/ipoib/ipoib.h 2006-04-05 17:43:18.000000000 -0700 +++ infiniband-cq/ulp/ipoib/ipoib.h 2006-04-19 08:40:42.030284464 -0700 @@ -71,7 +71,8 @@ IPOIB_MAX_QUEUE_SIZE = 8192, IPOIB_MIN_QUEUE_SIZE = 2, - IPOIB_NUM_WC = 4, + IPOIB_NUM_SEND_WC = 32, + IPOIB_NUM_RECV_WC = 32, IPOIB_MAX_PATH_REC_QUEUE = 3, IPOIB_MAX_MCAST_QUEUE = 3, @@ -151,7 +152,8 @@ u16 pkey; struct ib_pd *pd; struct ib_mr *mr; - struct ib_cq *cq; + struct ib_cq *send_cq; + struct ib_cq *recv_cq; struct ib_qp *qp; u32 qkey; @@ -162,16 +164,17 @@ unsigned int admin_mtu; unsigned int mcast_mtu; - struct ipoib_rx_buf *rx_ring; + struct ipoib_rx_buf *rx_ring ____cacheline_aligned_in_smp; spinlock_t tx_lock; - struct ipoib_tx_buf *tx_ring; + struct ipoib_tx_buf *tx_ring ____cacheline_aligned_in_smp; unsigned tx_head; unsigned tx_tail; struct ib_sge tx_sge; struct ib_send_wr tx_wr; - struct ib_wc ibwc[IPOIB_NUM_WC]; + struct ib_wc send_ibwc[IPOIB_NUM_SEND_WC] ____cacheline_aligned_in_smp; + struct ib_wc recv_ibwc[IPOIB_NUM_RECV_WC] ____cacheline_aligned_in_smp; struct list_head dead_ahs; @@ -243,9 +246,13 @@ extern struct workqueue_struct *ipoib_workqueue; +extern int ipoib_send_poll_interval; +extern int ipoib_recv_poll_interval; + /* functions */ -void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr); +void ipoib_ib_send_completion(struct ib_cq *cq, void *dev_ptr); +void ipoib_ib_recv_completion(struct ib_cq *cq, void *dev_ptr); struct ipoib_ah *ipoib_create_ah(struct net_device *dev, struct ib_pd *pd, struct ib_ah_attr *attr); diff -urN infiniband/ulp/ipoib/ipoib_ib.c infiniband-cq/ulp/ipoib/ipoib_ib.c --- infiniband/ulp/ipoib/ipoib_ib.c 2006-04-05 17:43:18.000000000 -0700 +++ infiniband-cq/ulp/ipoib/ipoib_ib.c 2006-04-19 08:56:40.395590792 -0700 @@ -50,7 +50,6 @@ "Enable data path debug tracing if > 0"); #endif -#define IPOIB_OP_RECV (1ul << 31) static DEFINE_MUTEX(pkey_mutex); @@ -108,7 +107,7 @@ list.lkey = priv->mr->lkey; param.next = NULL; - param.wr_id = id | IPOIB_OP_RECV; + param.wr_id = id; param.sg_list = &list; param.num_sge = 1; @@ -175,8 +174,8 @@ return 0; } -static void ipoib_ib_handle_wc(struct net_device *dev, - struct ib_wc *wc) +static void ipoib_ib_handle_recv_wc(struct net_device *dev, + struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); unsigned int wr_id = wc->wr_id; @@ -184,121 +183,142 @@ ipoib_dbg_data(priv, "called: id %d, op %d, status: %d\n", wr_id, wc->opcode, wc->status); - if (wr_id & IPOIB_OP_RECV) { - wr_id &= ~IPOIB_OP_RECV; - - if (wr_id < ipoib_recvq_size) { - struct sk_buff *skb = priv->rx_ring[wr_id].skb; - dma_addr_t addr = priv->rx_ring[wr_id].mapping; - - if (unlikely(wc->status != IB_WC_SUCCESS)) { - if (wc->status != IB_WC_WR_FLUSH_ERR) - ipoib_warn(priv, "failed recv event " - "(status=%d, wrid=%d vend_err %x)\n", - wc->status, wr_id, wc->vendor_err); - dma_unmap_single(priv->ca->dma_device, addr, - IPOIB_BUF_SIZE, DMA_FROM_DEVICE); - dev_kfree_skb_any(skb); - priv->rx_ring[wr_id].skb = NULL; - return; - } - - /* - * If we can't allocate a new RX buffer, dump - * this packet and reuse the old buffer. - */ - if (unlikely(ipoib_alloc_rx_skb(dev, wr_id))) { - ++priv->stats.rx_dropped; - goto repost; - } - - ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", - wc->byte_len, wc->slid); + if (wr_id < ipoib_recvq_size) { + struct sk_buff *skb = priv->rx_ring[wr_id].skb; + dma_addr_t addr = priv->rx_ring[wr_id].mapping; + + if (unlikely(wc->status != IB_WC_SUCCESS)) { + if (wc->status != IB_WC_WR_FLUSH_ERR) + ipoib_warn(priv, "failed recv event " + "(status=%d, wrid=%d vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); dma_unmap_single(priv->ca->dma_device, addr, IPOIB_BUF_SIZE, DMA_FROM_DEVICE); + dev_kfree_skb_any(skb); + priv->rx_ring[wr_id].skb = NULL; + return; + } - skb_put(skb, wc->byte_len); - skb_pull(skb, IB_GRH_BYTES); + /* + * If we can't allocate a new RX buffer, dump + * this packet and reuse the old buffer. + */ + if (unlikely(ipoib_alloc_rx_skb(dev, wr_id))) { + ++priv->stats.rx_dropped; + goto repost; + } - if (wc->slid != priv->local_lid || - wc->src_qp != priv->qp->qp_num) { - skb->protocol = ((struct ipoib_header *) skb->data)->proto; - skb->mac.raw = skb->data; - skb_pull(skb, IPOIB_ENCAP_LEN); - - dev->last_rx = jiffies; - ++priv->stats.rx_packets; - priv->stats.rx_bytes += skb->len; - - skb->dev = dev; - /* XXX get correct PACKET_ type here */ - skb->pkt_type = PACKET_HOST; - netif_rx_ni(skb); - } else { - ipoib_dbg_data(priv, "dropping loopback packet\n"); - dev_kfree_skb_any(skb); - } + ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", + wc->byte_len, wc->slid); - repost: - if (unlikely(ipoib_ib_post_receive(dev, wr_id))) - ipoib_warn(priv, "ipoib_ib_post_receive failed " - "for buf %d\n", wr_id); - } else - ipoib_warn(priv, "completion event with wrid %d\n", - wr_id); + dma_unmap_single(priv->ca->dma_device, addr, + IPOIB_BUF_SIZE, DMA_FROM_DEVICE); - } else { - struct ipoib_tx_buf *tx_req; - unsigned long flags; + skb_put(skb, wc->byte_len); + skb_pull(skb, IB_GRH_BYTES); - if (wr_id >= ipoib_sendq_size) { - ipoib_warn(priv, "completion event with wrid %d (> %d)\n", - wr_id, ipoib_sendq_size); - return; + if (wc->slid != priv->local_lid || + wc->src_qp != priv->qp->qp_num) { + skb->protocol = ((struct ipoib_header *) skb->data)->proto; + skb->mac.raw = skb->data; + skb_pull(skb, IPOIB_ENCAP_LEN); + + dev->last_rx = jiffies; + ++priv->stats.rx_packets; + priv->stats.rx_bytes += skb->len; + + skb->dev = dev; + /* XXX get correct PACKET_ type here */ + skb->pkt_type = PACKET_HOST; + netif_rx_ni(skb); + } else { + ipoib_dbg_data(priv, "dropping loopback packet\n"); + dev_kfree_skb_any(skb); } - ipoib_dbg_data(priv, "send complete, wrid %d\n", wr_id); + repost: + if (unlikely(ipoib_ib_post_receive(dev, wr_id))) + ipoib_warn(priv, "ipoib_ib_post_receive failed " + "for buf %d\n", wr_id); + } else + ipoib_warn(priv, "completion event with wrid %d\n", + wr_id); +} + +static void ipoib_ib_handle_send_wc(struct net_device *dev, + struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned int wr_id = wc->wr_id; + struct ipoib_tx_buf *tx_req; + unsigned long flags; + + ipoib_dbg_data(priv, "called: id %d, op %d, status: %d\n", + wr_id, wc->opcode, wc->status); - tx_req = &priv->tx_ring[wr_id]; + if (wr_id >= ipoib_sendq_size) { + ipoib_warn(priv, "completion event with wrid %d (> %d)\n", + wr_id, ipoib_sendq_size); + return; + } - dma_unmap_single(priv->ca->dma_device, - pci_unmap_addr(tx_req, mapping), - tx_req->skb->len, - DMA_TO_DEVICE); + ipoib_dbg_data(priv, "send complete, wrid %d\n", wr_id); - ++priv->stats.tx_packets; - priv->stats.tx_bytes += tx_req->skb->len; + tx_req = &priv->tx_ring[wr_id]; - dev_kfree_skb_any(tx_req->skb); + dma_unmap_single(priv->ca->dma_device, + pci_unmap_addr(tx_req, mapping), + tx_req->skb->len, + DMA_TO_DEVICE); - spin_lock_irqsave(&priv->tx_lock, flags); - ++priv->tx_tail; - if (netif_queue_stopped(dev) && - priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) - netif_wake_queue(dev); - spin_unlock_irqrestore(&priv->tx_lock, flags); + ++priv->stats.tx_packets; + priv->stats.tx_bytes += tx_req->skb->len; - if (wc->status != IB_WC_SUCCESS && - wc->status != IB_WC_WR_FLUSH_ERR) - ipoib_warn(priv, "failed send event " - "(status=%d, wrid=%d vend_err %x)\n", - wc->status, wr_id, wc->vendor_err); - } + dev_kfree_skb_any(tx_req->skb); + + spin_lock_irqsave(&priv->tx_lock, flags); + ++priv->tx_tail; + if (netif_queue_stopped(dev) && + priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) + netif_wake_queue(dev); + spin_unlock_irqrestore(&priv->tx_lock, flags); + + if (wc->status != IB_WC_SUCCESS && + wc->status != IB_WC_WR_FLUSH_ERR) + ipoib_warn(priv, "failed send event " + "(status=%d, wrid=%d vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); +} + +void ipoib_ib_send_completion(struct ib_cq *cq, void *dev_ptr) +{ + struct net_device *dev = (struct net_device *) dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int n, i; + + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + udelay(ipoib_send_poll_interval); + do { + n = ib_poll_cq(cq, IPOIB_NUM_SEND_WC, priv->send_ibwc); + for (i = 0; i < n; ++i) + ipoib_ib_handle_send_wc(dev, priv->send_ibwc + i); + } while (n == IPOIB_NUM_SEND_WC); } -void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr) +void ipoib_ib_recv_completion(struct ib_cq *cq, void *dev_ptr) { struct net_device *dev = (struct net_device *) dev_ptr; struct ipoib_dev_priv *priv = netdev_priv(dev); int n, i; ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + udelay(ipoib_recv_poll_interval); do { - n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->ibwc); + n = ib_poll_cq(cq, IPOIB_NUM_RECV_WC, priv->recv_ibwc); for (i = 0; i < n; ++i) - ipoib_ib_handle_wc(dev, priv->ibwc + i); - } while (n == IPOIB_NUM_WC); + ipoib_ib_handle_recv_wc(dev, priv->recv_ibwc + i); + } while (n == IPOIB_NUM_RECV_WC); } static inline int post_send(struct ipoib_dev_priv *priv, diff -urN infiniband/ulp/ipoib/ipoib_main.c infiniband-cq/ulp/ipoib/ipoib_main.c --- infiniband/ulp/ipoib/ipoib_main.c 2006-04-12 16:43:38.000000000 -0700 +++ infiniband-cq/ulp/ipoib/ipoib_main.c 2006-04-19 08:44:27.192054672 -0700 @@ -56,12 +56,17 @@ int ipoib_sendq_size __read_mostly = IPOIB_TX_RING_SIZE; int ipoib_recvq_size __read_mostly = IPOIB_RX_RING_SIZE; +int ipoib_send_poll_interval __read_mostly = 0; +int ipoib_recv_poll_interval __read_mostly = 0; module_param_named(send_queue_size, ipoib_sendq_size, int, 0444); MODULE_PARM_DESC(send_queue_size, "Number of descriptors in send queue"); module_param_named(recv_queue_size, ipoib_recvq_size, int, 0444); MODULE_PARM_DESC(recv_queue_size, "Number of descriptors in receive queue"); +module_param_named(send_poll_interval, ipoib_send_poll_interval, int, 0444); +module_param_named(recv_poll_interval, ipoib_recv_poll_interval, int, 0444); + #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG int ipoib_debug_level; @@ -895,7 +900,7 @@ kfree(priv->rx_ring); kfree(priv->tx_ring); - + priv->rx_ring = NULL; priv->tx_ring = NULL; } diff -urN infiniband/ulp/ipoib/ipoib_verbs.c infiniband-cq/ulp/ipoib/ipoib_verbs.c --- infiniband/ulp/ipoib/ipoib_verbs.c 2006-04-05 17:43:18.000000000 -0700 +++ infiniband-cq/ulp/ipoib/ipoib_verbs.c 2006-04-12 19:14:41.000000000 -0700 @@ -174,24 +174,35 @@ return -ENODEV; } - priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, - ipoib_sendq_size + ipoib_recvq_size + 1); - if (IS_ERR(priv->cq)) { - printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); + priv->send_cq = ib_create_cq(priv->ca, ipoib_ib_send_completion, NULL, dev, + ipoib_sendq_size + 1); + if (IS_ERR(priv->send_cq)) { + printk(KERN_WARNING "%s: failed to create send CQ\n", ca->name); goto out_free_pd; } - if (ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP)) - goto out_free_cq; + if (ib_req_notify_cq(priv->send_cq, IB_CQ_NEXT_COMP)) + goto out_free_send_cq; + + + priv->recv_cq = ib_create_cq(priv->ca, ipoib_ib_recv_completion, NULL, dev, + ipoib_recvq_size + 1); + if (IS_ERR(priv->recv_cq)) { + printk(KERN_WARNING "%s: failed to create recv CQ\n", ca->name); + goto out_free_send_cq; + } + + if (ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP)) + goto out_free_recv_cq; priv->mr = ib_get_dma_mr(priv->pd, IB_ACCESS_LOCAL_WRITE); if (IS_ERR(priv->mr)) { printk(KERN_WARNING "%s: ib_get_dma_mr failed\n", ca->name); - goto out_free_cq; + goto out_free_recv_cq; } - init_attr.send_cq = priv->cq; - init_attr.recv_cq = priv->cq, + init_attr.send_cq = priv->send_cq; + init_attr.recv_cq = priv->recv_cq, priv->qp = ib_create_qp(priv->pd, &init_attr); if (IS_ERR(priv->qp)) { @@ -215,8 +226,11 @@ out_free_mr: ib_dereg_mr(priv->mr); -out_free_cq: - ib_destroy_cq(priv->cq); +out_free_recv_cq: + ib_destroy_cq(priv->recv_cq); + +out_free_send_cq: + ib_destroy_cq(priv->send_cq); out_free_pd: ib_dealloc_pd(priv->pd); @@ -238,7 +252,10 @@ if (ib_dereg_mr(priv->mr)) ipoib_warn(priv, "ib_dereg_mr failed\n"); - if (ib_destroy_cq(priv->cq)) + if (ib_destroy_cq(priv->send_cq)) + ipoib_warn(priv, "ib_cq_destroy failed\n"); + + if (ib_destroy_cq(priv->recv_cq)) ipoib_warn(priv, "ib_cq_destroy failed\n"); if (ib_dealloc_pd(priv->pd)) Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: cq.tune.patch Type: application/octet-stream Size: 13025 bytes Desc: not available URL: From rdreier at cisco.com Wed Apr 19 09:03:09 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Apr 2006 09:03:09 -0700 Subject: [openib-general] Re: openib-general Digest, Vol 22, Issue 114 In-Reply-To: (Bernard King-Smith's message of "Wed, 19 Apr 2006 10:10:36 -0400") References: Message-ID: Bernard> The assumption you have here is that one CPU is capable Bernard> of handling the completions without impacting Bernard> bandwidth. We have seen the opposite in that we end up Bernard> with one CPU pegged at high throughput. The benefit you Bernard> are working on is latency will be faster if we handle Bernard> both send and receive processing off the same Bernard> thread/interrupt, but you have to balance that with Bernard> bandwidth limitations. You think 4X has a bandwdith Bernard> problem using IPoIB, wait till 12X comes out. I still don't understand why splitting the CQ allows you to use more than one CPU to handle completions. Both CQ events get handled on the same CPU -- you just have more overhead in getting to the CQ event handlers if there are two of them. Also, why is 12X any worse? With current hardware at least the 4X link is not the bottleneck anyway. Bernard> What per CPU utilization do you see on mthca on a Bernard> multiple CPU machine running peak bandwidth? I've never really measured it. It's especially tough to account for interrupt handler time. - R. From rdreier at cisco.com Wed Apr 19 09:05:34 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Apr 2006 09:05:34 -0700 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: (Shirley Ma's message of "Wed, 19 Apr 2006 09:18:30 -0600") References: Message-ID: > - struct ipoib_rx_buf *rx_ring; > + struct ipoib_rx_buf *rx_ring ____cacheline_aligned_in_smp; > > spinlock_t tx_lock; > - struct ipoib_tx_buf *tx_ring; > + struct ipoib_tx_buf *tx_ring ____cacheline_aligned_in_smp; > unsigned tx_head; > unsigned tx_tail; > struct ib_sge tx_sge; > struct ib_send_wr tx_wr; > > - struct ib_wc ibwc[IPOIB_NUM_WC]; > + struct ib_wc send_ibwc[IPOIB_NUM_SEND_WC] ____cacheline_aligned_in_smp; > + struct ib_wc recv_ibwc[IPOIB_NUM_RECV_WC] ____cacheline_aligned_in_smp; This doesn't look right. It puts tx_lock in the same cacheline as rx_ring, and then puts send_ibwc and recv_ibwc in completely different cachelines. Wouldn't it make more sense to sort the rx and tx stuff so that they're each in as few non-shared cachelines as possible? - R. From xma at us.ibm.com Wed Apr 19 09:26:49 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 19 Apr 2006 09:26:49 -0700 Subject: [openib-general] Re: openib-general Digest, Vol 22, Issue 114 In-Reply-To: Message-ID: Roland, > I still don't understand why splitting the CQ allows you to use more > than one CPU to handle completions. Both CQ events get handled on the > same CPU -- you just have more overhead in getting to the CQ event > handlers if there are two of them. The send WC handler is different with recv WC handler. Even with some overhead we do see big improvement in bidirectional throughput. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From xma at us.ibm.com Wed Apr 19 09:35:33 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 19 Apr 2006 09:35:33 -0700 Subject: [openib-general] [PATCH] splitting IPoIB CQ In-Reply-To: Message-ID: Roland Dreier wrote on 04/18/2006 01:45:17 PM: > > > Actually, do you have some explanation for why this helps performance? > > > My intuition would be that it just generates more interrupts for the > > > same workload. > > > The only lock contension in IPoIB I saw is tx_lock. When seperating > > the completion queue to have seperate completion handler. It could improve > > the performance. I didn't look at driver code, it might have some impact > > there? > > A clever way to avoid taking the TX lock on send completions would be > very nice, but I never saw a way to do it. Does splitting the CQ > reduce contention? I don't see why that would be, since the > contention is between sending and getting send completions. The > receive path of course never touches the tx_lock. tx_lock contension will block the CQ handler to process next wiki in CQ, which could be recv wiki or send wiki. > By the way, are your numbers with mthca or ehca? I don't know ehca > very well, but at least with current mthca, all CQ events will be > delivered on the same interrupt and hence all CQ handling will run on > the same CPU. So I'm puzzled why changing things from: > > -> interrupt > -> CQ event handler > -> handle all IPoIB completions > > to: > > -> interrupt > -> TX CQ event handler > -> handle TX completions > [possibly another interrupt] > -> RX CQ event handler > -> handle RX completions > > helps throughput. It just seems like it's more CQ locking/unlocking > and in general more work. > > - R. If recv CQ and send CQ are at different rate, splitting CQ would reduce CQ locking unlocking. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Apr 19 09:33:52 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Apr 2006 09:33:52 -0700 Subject: [openib-general] Re: openib-general Digest, Vol 22, Issue 114 In-Reply-To: (Shirley Ma's message of "Wed, 19 Apr 2006 09:26:49 -0700") References: Message-ID: Shirley> The send WC handler is different with recv WC Shirley> handler. Even with some overhead we do see big Shirley> improvement in bidirectional throughput. But how? There's only one CQ interrupt handler, which can only run on one CPU at a time. So the send WC handler and recv WC handler are serialized on a single CPU anyway. - R. From rdreier at cisco.com Wed Apr 19 09:33:52 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Apr 2006 09:33:52 -0700 Subject: [openib-general] Re: openib-general Digest, Vol 22, Issue 114 In-Reply-To: (Shirley Ma's message of "Wed, 19 Apr 2006 09:26:49 -0700") References: Message-ID: Shirley> The send WC handler is different with recv WC Shirley> handler. Even with some overhead we do see big Shirley> improvement in bidirectional throughput. But how? There's only one CQ interrupt handler, which can only run on one CPU at a time. So the send WC handler and recv WC handler are serialized on a single CPU anyway. - R. From Richard.Frank at oracle.com Wed Apr 19 09:38:43 2006 From: Richard.Frank at oracle.com (Richard Frank) Date: Wed, 19 Apr 2006 12:38:43 -0400 Subject: [openib-general] How do we prevent starvation say between TCP over IPOIB / and SRP traffic ? Message-ID: <1145464723.23142.104.camel@localhost.localdomain> Some application level protocols - require higher QoS levels than others - for various communication and I/O operations. For example, cluster inter-node health msgs have fixed latency requirements that if exceeded may result in unexpected node removals from the cluster. Are there any mechanisms available to the client process to manage the QoS level for the various supported ULPs (SDP,TCP,UDP,RDS,SRP,iSER,etc) either at the ULP level or some combination of process and ULP - or perhaps even at the connection level ? Using the same example, the cluster node monitors might set the priority / QoS level of the heart beats to be more important than normal SRP/iSER traffic to ensure no starvation ? From jlentini at netapp.com Wed Apr 19 09:38:33 2006 From: jlentini at netapp.com (James Lentini) Date: Wed, 19 Apr 2006 12:38:33 -0400 (EDT) Subject: [openib-general] Re: [PATCH] [DAPL] [RFC] - remove duplicate disconnect event. In-Reply-To: <1144276742.28591.82.camel@stevo-desktop> References: <1144276742.28591.82.camel@stevo-desktop> Message-ID: On Wed, 5 Apr 2006, Steve Wise wrote: > James, > > Running a 4 thread, 8 ep/thread dapltest (the last test in regress.sh), > I was intermittently seeing a seg fault in dapltest. This is running > over the chelsio rnic using the iwarp branch. After debugging I found > out that dapltest was freeing an already freed endpoint due to it > receiving duplicate disconnect events during test shutdown. The code > assumes it will get exactly one disconnect event for every endpoint > (rightly so I guess). There should only be 1 disconnect event generated. dapltest should print out an error instead of crashing on this though. > I tracked this down to the code in dapl_ep_disconnect() that generates > its own disconnect event in certain circumstances. I removed this and > ran regress.sh over both mthca and cxgb3 with no problems. > > So my question to the dapl experts is: why is this code here? For our > iwarp devices, it ends up sometimes generating duplicate disconnect > events. I don't see why its needed. If anyone can explain the logic, > that would be great. I've looked into this. Some older verbs APIs didn't generate a disconnect on an abrupt close. I moved the support for these older APIs into a new location in revision 6517 and committed your changes in revision 6518. -james From iod00d at hp.com Wed Apr 19 09:42:26 2006 From: iod00d at hp.com (Grant Grundler) Date: Wed, 19 Apr 2006 09:42:26 -0700 Subject: [openib-general] Re: openib-general Digest, Vol 22, Issue 114 In-Reply-To: References: Message-ID: <20060419164226.GB6430@esmail.cup.hp.com> On Wed, Apr 19, 2006 at 10:10:36AM -0400, Bernard King-Smith wrote: > The benefit you are > working on is latency will be faster if we handle both send and receive > processing off the same thread/interrupt, but you have to balance that with > bandwidth limitations. You think 4X has a bandwdith problem using IPoIB, > wait till 12X comes out. [ I've probably posted some of these results before...here's another take on this problem. ] I've looked at this tradeoff pretty closely with ia64 (1.5Ghz) by pinning netperf to a different CPU than the one handling interrupts. By moving netperf RX traffic off the CPU handling interrupts, the 1.5Ghz ia64 box goes from 2.8 Gb/s to around 3.5 Gb/s. But the "service demand" (CPU time per KB payload) goes up from ~2.3 usec/KB to ~3.1 usec/KB - cacheline misses go up dramatically. I'm expect splitting the RX/TX completeions would achieve something similar since we are just "slicing" the same problem from a different angle. Apps typically do both RX and TX and will be running on one CPU. So on one path they will be missing cachelines. Anyway, my take is IPoIB perf isn't as critical as SDP and RDMA perf. If folks really care about perf, they have to migrate away from IPoIB to either SDP or directly use RDMA (uDAPL or something). Splitting RX/TX completions might help initial adoption, but aren't were the big wins in perf are. Pinning netperf/netserver to a different CPU caused SDP perf to drop from 5.5 Gb/s to 5.4 Gb/s. Service Demand went from around 0.55 usec/KB to 0.56 usec/KB. ie a much smaller impact on cacheline misses. Keeping traffic local to the CPU that's taking the interrupt keeps the cachelines local. I don't want to discourage anyone from their pet projects. But the conclusion I drew from the above data is IPoIB is a good compatibility story but cacheline misses are going to make it hard to improve perf regardless of how we divide the workload. IPoIB + TCP/IP code path just has a big foot print. > What per CPU utilization do you see on mthca on a multiple CPU machine > running peak bandwidth? I'm interested in those results as well. thanks, grant From Richard.Frank at oracle.com Wed Apr 19 09:43:41 2006 From: Richard.Frank at oracle.com (Richard Frank) Date: Wed, 19 Apr 2006 12:43:41 -0400 Subject: [openib-general] How do we prevent starvation say between TCP over IPOIB / and SRP traffic ? In-Reply-To: <1145464723.23142.104.camel@localhost.localdomain> References: <1145464723.23142.104.camel@localhost.localdomain> Message-ID: <1145465021.23142.106.camel@localhost.localdomain> This discussion assumes a single fabric (e.g IB, or iWARP, etc) for network and file I/O between a set of nodes sharing storage. On Wed, 2006-04-19 at 12:38 -0400, Richard Frank wrote: > Some application level protocols - require higher QoS levels than others > - for various communication and I/O operations. > > For example, cluster inter-node health msgs have fixed latency > requirements that if exceeded may result in unexpected node removals > from the cluster. > > Are there any mechanisms available to the client process to manage the > QoS level for the various supported ULPs (SDP,TCP,UDP,RDS,SRP,iSER,etc) > either at the ULP level or some combination of process and ULP - or > perhaps even at the connection level ? > > Using the same example, the cluster node monitors might set the > priority / QoS level of the heart beats to be more important than normal > SRP/iSER traffic to ensure no starvation ? > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From caitlinb at broadcom.com Wed Apr 19 09:44:56 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 19 Apr 2006 09:44:56 -0700 Subject: [openib-general] How do we prevent starvation say between TCP over IPOIB / and SRP traffic ? Message-ID: <54AD0F12E08D1541B826BE97C98F99F143A655@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Some application level protocols - require higher QoS levels than > others - for various communication and I/O operations. > > For example, cluster inter-node health msgs have fixed > latency requirements that if exceeded may result in > unexpected node removals from the cluster. > > Are there any mechanisms available to the client process to > manage the QoS level for the various supported ULPs > (SDP,TCP,UDP,RDS,SRP,iSER,etc) either at the ULP level or > some combination of process and ULP - or perhaps even at the > connection level ? > > Using the same example, the cluster node monitors might set > the priority / QoS level of the heart beats to be more > important than normal SRP/iSER traffic to ensure no starvation ? > > Working up from hardware capabilities and trying to generalize them probably won't lead anywhere. Do you have a model of the requirements for transport/device neutral QP prioritization that would meet your needs? From sean.hefty at intel.com Wed Apr 19 09:52:37 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 19 Apr 2006 09:52:37 -0700 Subject: [openib-general] [PATCH] RDMA CM: assign port numbers whenbinding a cm_id to an address In-Reply-To: <1145418154.12345.8.camel@bigtime.es335.com> Message-ID: >One part I didn't understand, however, was where the local port is >assigned for children of the listening endpoint? The local port for >these endpoints will be the same as for the listening parent. So if >cma_use_port is used to bind these child endpoints (i.e. add them to the >owners list), then the logic will need to distinguish between rdma_cm_id >attempting to bind as a listener vs. rdma_cm_id binding as connected >children. The child copies the local address information from the connection request. See cma_new_id() and calls to cma_get_net_info() and cma_save_net_info(). The child binding is not stored in the bound port space table (tcp_ps). I didn't see a need to do this, which avoids adding logic to distinguish between binding as a listener versus as a connected child. - Sean From wombat2 at us.ibm.com Wed Apr 19 10:03:58 2006 From: wombat2 at us.ibm.com (Bernard King-Smith) Date: Wed, 19 Apr 2006 13:03:58 -0400 Subject: [openib-general] How do we prevent starvation say between TCP over IPOIB / and SRP traffic ? Message-ID: Richard Frank wrote: Richard> Are there any mechanisms available to the client process to manage the Richard> QoS level for the various supported ULPs (SDP,TCP,UDP,RDS,SRP,iSER,etc) Richard> either at the ULP level or some combination of process and ULP - or Richard> perhaps even at the connection level ? To manage QoS the question is who knows about all the traffic traversing a specific adapter. For most kernel traversing protocols (IP, iSER, iSCSI, etc) you can sometimes do this in the device driver, where you can examine the headers as a packet is expedited and manage it there. Unfortunately you are adding processing in the driver which can end up impacting bandwidth on high speed adapters. You also introduce additional overhead hence higher CPU utilization. You can also introduce that in the adapter but I am not sure if the IB HCA spec has this capability defined. Richard> Using the same example, the cluster node monitors might set the Richard> priority / QoS level of the heart beats to be more important than normal Richard> SRP/iSER traffic to ensure no starvation ? When you add user space protocols such as SDP and MPI, then all bets are off. The kernel has no knowledge of the traffic from the users so can't properly manage QoS across all protocols. The user space traffic can mess up the kernel traversing protocols QoS calculations. The only place to handle QoS if user space protocols are used is in the adapter. Some switches can handle QoS once the traffic gets there but can't help the adapters in the nodes. Bernie King-Smith IBM Corporation Server Group Cluster System Performance wombat2 at us.ibm.com (845)433-8483 Tie. 293-8483 or wombat2 on NOTES "We are not responsible for the world we are born into, only for the world we leave when we die. So we have to accept what has gone before us and work to change the only thing we can, -- The Future." William Shatner From ftillier at silverstorm.com Wed Apr 19 10:24:29 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Wed, 19 Apr 2006 10:24:29 -0700 Subject: [openib-general] How do we prevent starvation say between TCP over IPOIB / and SRP traffic ? In-Reply-To: <1145464723.23142.104.camel@localhost.localdomain> References: <1145464723.23142.104.camel@localhost.localdomain> Message-ID: <79ae2f320604191024l49b11597s90830d6664971cd8@mail.gmail.com> Hi Rick, On 4/19/06, Richard Frank wrote: > Some application level protocols - require higher QoS levels than others > - for various communication and I/O operations. > > For example, cluster inter-node health msgs have fixed latency > requirements that if exceeded may result in unexpected node removals > from the cluster. > > Are there any mechanisms available to the client process to manage the > QoS level for the various supported ULPs (SDP,TCP,UDP,RDS,SRP,iSER,etc) > either at the ULP level or some combination of process and ULP - or > perhaps even at the connection level ? IB has the concept of Virtual Lanes (VL) at the hardware level, and Service Levels (SL) at the software level. There are always 16 SLs that map to however many VLs are supported by the hardware. IB hardware has at a minimum 2 VLs - VL0 and VL15, the latter being reserved for QP0 management traffic (for configuring the fabric). A module parameter to each ULP could assign it an SL to achieve the prioritization you are looking for. There could even be a limit to the SLs available to user-mode, enforced by the kernel for connected QPs, though I don't know if the same can be said for UD QPs. The SM configures the SL to VL mappings for each node, which causes somewhat of a problem - you don't know exactly what VL any particular SL is mapped to. Hardware that doesn't support all VLs could have multiple SLs mapped to any given VL. This means that if you pick SL0 for SRP and SL1 for IPoIB, both of those *may* map to VL0. - Fab From caitlinb at broadcom.com Wed Apr 19 10:30:55 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 19 Apr 2006 10:30:55 -0700 Subject: [openib-general] How do we prevent starvation say between TCP over IPOIB / and SRP traffic ? Message-ID: <54AD0F12E08D1541B826BE97C98F99F143A667@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Hi Rick, > > On 4/19/06, Richard Frank wrote: >> Some application level protocols - require higher QoS levels than >> others - for various communication and I/O operations. >> >> For example, cluster inter-node health msgs have fixed latency >> requirements that if exceeded may result in unexpected node removals >> from the cluster. >> >> Are there any mechanisms available to the client process to manage >> the QoS level for the various supported ULPs >> (SDP,TCP,UDP,RDS,SRP,iSER,etc) either at the ULP level or some >> combination of process and ULP - or perhaps even at the connection >> level ? > > IB has the concept of Virtual Lanes (VL) at the hardware > level, and Service Levels (SL) at the software level. There > are always 16 SLs that map to however many VLs are supported > by the hardware. IB hardware has at a minimum 2 VLs - VL0 > and VL15, the latter being reserved for QP0 management > traffic (for configuring the fabric). > > A module parameter to each ULP could assign it an SL to > achieve the prioritization you are looking for. There could > even be a limit to the SLs available to user-mode, enforced > by the kernel for connected QPs, though I don't know if the same can > be said for UD QPs. > > The SM configures the SL to VL mappings for each node, which > causes somewhat of a problem - you don't know exactly what VL > any particular SL is mapped to. Hardware that doesn't > support all VLs could have multiple SLs mapped to any given > VL. This means that if you pick SL0 for SRP and SL1 for > IPoIB, both of those *may* map to VL0. > Any given fabric will have solutions to this. The question is how the user of OpenFabrics ties their QPs and connections to the fabric-specific traffic management. From xma at us.ibm.com Wed Apr 19 10:38:28 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 19 Apr 2006 10:38:28 -0700 Subject: [openib-general] [PATCH] splitting IPoIB CQ In-Reply-To: Message-ID: Roland Dreier wrote on 04/19/2006 09:36:29 AM: > But if the send CQ handler is running, the recv CQ handler can't run > anyway, since there's only one interrupt which is serialized to run on > one CPU at a time. I thought CQ handler is called in the none interrupt context. Why in recv CQ use netif_rx_ni() anyway? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From xma at us.ibm.com Wed Apr 19 11:10:25 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 19 Apr 2006 11:10:25 -0700 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: Message-ID: Oops. You mean change priv to: struct ipoib_rx_buf *rx_ring ____cacheline_aligned_in_smp; struct ib_wc recv_ibwc[IPOIB_NUM_RECV_WC]; spinlock_t tx_lock; ____cacheline_aligned_in_smp; struct ipoib_tx_buf *tx_ring; unsigned tx_head; unsigned tx_tail; struct ib_sge tx_sge; struct ib_send_wr tx_wr; struct ib_wc ibwc[IPOIB_NUM_WC]; Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Thomas.Talpey at netapp.com Wed Apr 19 11:18:24 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 19 Apr 2006 14:18:24 -0400 Subject: [openib-general] NFS/RDMA for Linux: client and server update release 4 Message-ID: <7.0.1.0.2.20060418160548.04263cf8@netapp.com> Network Appliance is pleased to announce release 4 of the NFS/RDMA client and server for Linux 2.6.16.5. Following up on the client release of Feb 8 and client/server release of March 6, this update brings the server to full protocol functionality (inline, read, write and reply chunks are all supported), and completes the client memory registration functionality to support multisegment scatter/gather. These are both licensed under dual BSD/GPL2 terms, and available at the project's Sourceforge site: As before, both client and server employ the native OpenFabrics RDMA verbs API, and work equally for Infiniband and iWARP. They have been tested on several Mellanox-based Infiniband cards, as well as the Ammasso AMSO1100 and the Chelsio cxgb3 iWARP adapters. The client and server implement the IETF draft protocol and fully support direct (zero-copy, zero-touch) RDMA transfers at the RPC layer. The performance is greatly improved, both client and server are capable of performing operations in parallel with full RDMA offload. Both the client and server have been tested with NFSv3 and pass the Connectathon test suite. Additionally, they are able to run iozone and network stress tests with good stability. As in the previous versions, the patch procedure for applying the changes requires the addition of certain framework components to the Linux kernel, both for the OpenFabrics infrastructure and the RPC transport switch. The package README has details. Of course, we look forward to comments and feedback! Thanks again for all of it so far. Tom Talpey, for the various NFS/RDMA projects. From xma at us.ibm.com Wed Apr 19 11:31:32 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 19 Apr 2006 11:31:32 -0700 Subject: [openib-general] Re: openib-general Digest, Vol 22, Issue 114 In-Reply-To: <20060419164226.GB6430@esmail.cup.hp.com> Message-ID: Hello Grant, openib-general-bounces at openib.org wrote on 04/19/2006 09:42:26 AM: > I've looked at this tradeoff pretty closely with ia64 (1.5Ghz) > by pinning netperf to a different CPU than the one handling interrupts. > By moving netperf RX traffic off the CPU handling interrupts, > the 1.5Ghz ia64 box goes from 2.8 Gb/s to around 3.5 Gb/s. > But the "service demand" (CPU time per KB payload) goes up > from ~2.3 usec/KB to ~3.1 usec/KB - cacheline misses go up dramatically. Yes, netperf/netserver binding to same cpu definitely has benefit cacheline. But the cpu will be the bottleneck. One cpu is not sufficient to drain out faster network device HCA. > I'm expect splitting the RX/TX completeions would achieve something > similar since we are just "slicing" the same problem from a different > angle. Apps typically do both RX and TX and will be running on one > CPU. So on one path they will be missing cachelines. It's different. Binding cpu garantees packets goes to the same cpu. WC handler is not in interrupt conext. It could deliver packets to different cpus. > Anyway, my take is IPoIB perf isn't as critical as SDP and RDMA perf. > If folks really care about perf, they have to migrate away from > IPoIB to either SDP or directly use RDMA (uDAPL or something). > Splitting RX/TX completions might help initial adoption, but > aren't were the big wins in perf are. IPoIB perf if important for people still use old application. We do see under some workload IPoIB gain double bidirectional performance with splitting CQ/tune poll interval/poll more entries from WC patch. It's a huge improvement. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From ssbyrn at yahoo.com Wed Apr 19 11:27:21 2006 From: ssbyrn at yahoo.com (susan) Date: Wed, 19 Apr 2006 18:27:21 +0000 (UTC) Subject: [openib-general] ib_send_cm_req failes with error -22 Message-ID: hello, i'm writing sample kernel ulp driver to get me acquainted with openib stack on linux kernel 2.6.16.2 (fedora 5) with openib gen 2 stack checkout from openib.org website. the setup is two nodes with point-to-point connection, viz. primary & secondary node. the secondary node starts in a listen mode, until primary node makes a connection and start exchanging messages. the problem that i am running into is that the ib_send_cm_req api fails with errorno 22. i'm using local id to make the connection on port 1. ib_send_cm_req() api calls function cm_init_av_by_path(), which calls ib_find_cached_gid(). function ib_find_cached_gid() fails because it can't locate cached gid in device's cache table. below is full control flow from both primary & secondary node. priamry node secondary node -------------- ----------------- ib_register_client() using active port = 1 ib_create_cm_id() ib_cm_listen() listening .... (waiting) ib_register_client() using active port = 1 ib_create_cm_id() ib_alloc_pd() ib_create_cq() ib_req_notify_cq() ib_get_dma_mr() ib_create_qp() ib_query_gid() source.lid = 0x1 dest.lid = 0x2 ib_sa_path_rec_get() sa_path_rec handler returned success ib_send_cm_req() ib_send_cm_req() failed with error -22 how would update device's cache to get cached gid? am i missing any steps? here is output from ib* commands: from primary node: root at copa:~:23> sminfo sminfo: sm lid 0x2 sm guid 0x5ad0000030655, activity count 1707 priority 1 state SMINFO_MASTER 3 root at copa:~:26> ibhosts Ca : 0x0005ad0000030654 ports 2 "Topspin HCA" Ca : 0x0005ad0000030860 ports 2 "Topspin HCA" root at copa:~:28> ibstatus Infiniband device 'mthca0' port 1 status: default gid: fe80:0000:0000:0000:0005:ad00:0003:0861 base lid: 0x1 sm lid: 0x2 state: 4: ACTIVE phys state: 5: LinkUp rate: 10 Gb/sec (4X) Infiniband device 'mthca0' port 2 status: default gid: fe80:0000:0000:0000:0005:ad00:0003:0862 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 2: Polling rate: 2.5 Gb/sec (1X) root at copa:~:29> ibnetdiscover # # Topology file: generated on Tue Apr 18 12:36:39 2006 # # Max of 1 hops discovered # Initiated from node 0005ad0000030860 port 0005ad0000030861 vendid=0x5ad devid=0x5a44 sysimgguid=0x5ad000100d050 caguid=0x5ad0000030654 Ca 2 "H-0005ad0000030654" # Topspin HCA [1] "H-0005ad0000030860"[1] # lid 2 lmc 0 vendid=0x5ad devid=0x5a44 sysimgguid=0x5ad000100d050 caguid=0x5ad0000030860 Ca 2 "H-0005ad0000030860" # Topspin HCA [1] "H-0005ad0000030654"[1] # lid 1 lmc from secondary node: root at bana:~:8> sminfo sminfo: sm lid 0x2 sm guid 0x5ad0000030655, activity count 1738 priority 1 state SMINFO_MASTER 3 root at bana:~:15> ibhosts Ca : 0x0005ad0000030860 ports 2 "Topspin HCA" Ca : 0x0005ad0000030654 ports 2 "Topspin HCA" root at bana:~:16> ibstatus Infiniband device 'mthca0' port 1 status: default gid: fe80:0000:0000:0000:0005:ad00:0003:0655 base lid: 0x2 sm lid: 0x2 state: 4: ACTIVE phys state: 5: LinkUp rate: 10 Gb/sec (4X) Infiniband device 'mthca0' port 2 status: default gid: fe80:0000:0000:0000:0005:ad00:0003:0656 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 2: Polling rate: 2.5 Gb/sec (1X) root at bana:~:17> ibnetdiscover # # Topology file: generated on Tue Apr 18 12:36:15 2006 # # Max of 1 hops discovered # Initiated from node 0005ad0000030654 port 0005ad0000030655 vendid=0x5ad devid=0x5a44 sysimgguid=0x5ad000100d050 caguid=0x5ad0000030860 Ca 2 "H-0005ad0000030860" # Topspin HCA [1] "H-0005ad0000030654"[1] # lid 1 lmc 0 vendid=0x5ad devid=0x5a44 sysimgguid=0x5ad000100d050 caguid=0x5ad0000030654 Ca 2 "H-0005ad0000030654" # Topspin HCA [1] "H-0005ad0000030860"[1] # lid 2 lmc 0 do you know what's wrong? thanks, susan From rdreier at cisco.com Wed Apr 19 11:35:16 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Apr 2006 11:35:16 -0700 Subject: [openib-general] [PATCH] splitting IPoIB CQ In-Reply-To: (Shirley Ma's message of "Wed, 19 Apr 2006 10:38:28 -0700") References: Message-ID: Shirley> I thought CQ handler is called in the none interrupt Shirley> context. Why in recv CQ use netif_rx_ni() anyway? With the mthca driver, CQ is definitely called directly from the CQ event interrupt. ehca may be different -- I need to look at the code. The _ni variant is used to handle low-level drivers that may not be calling back from interrupt context. But it is certainly allowed to call back in interrupt context and that is definitely what mthca does. - R. From rdreier at cisco.com Wed Apr 19 11:35:58 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Apr 2006 11:35:58 -0700 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: (Shirley Ma's message of "Wed, 19 Apr 2006 11:10:25 -0700") References: Message-ID: > struct ipoib_rx_buf *rx_ring ____cacheline_aligned_in_smp; > struct ib_wc recv_ibwc[IPOIB_NUM_RECV_WC]; > > spinlock_t tx_lock; ____cacheline_aligned_in_smp; > struct ipoib_tx_buf *tx_ring; > unsigned tx_head; > unsigned tx_tail; > struct ib_sge tx_sge; > struct ib_send_wr tx_wr; > struct ib_wc ibwc[IPOIB_NUM_WC]; Yes, that looks like the best thing to me. You have no false sharing between RX and TX, and things are laid out as compactly as possible. - R. From bugzilla-daemon at openib.org Wed Apr 19 11:54:03 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Wed, 19 Apr 2006 11:54:03 -0700 (PDT) Subject: [openib-general] [Bug 42] New: OFED 1.0 rc3: infinibandeventfs warning on RHEL4 U2 Message-ID: <20060419185403.7656E22867B@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=42 Summary: OFED 1.0 rc3: infinibandeventfs warning on RHEL4 U2 Product: OpenFabrics Linux Version: 1.0rc2 Platform: All OS/Version: Other Status: NEW Severity: normal Priority: P3 Component: IB Core AssignedTo: bugzilla at openib.org ReportedBy: sweitzen at cisco.com I see this booting on 2.6.9-22 kernel (Dell PE 1850 x86_64): Starting portmap: [ OK ] Starting NFS statd: [ OK ] audit(1145468897.269:2): avc: denied { mount } for pid=2268 comm="modprobe" name="infinibandevent:" dev=infinibandeventfs ino=5761 scontext=user_u:system_r:initrc_t tcontext=system_u:object_r:unlabeled_t tclass=filesystem user_verbs: couldn't mount infinibandeventfs Loading HCA driver and Access Layer:[FAILED] Please open an issue in the http://openib.org/bugzilla and attach /tmp/ib_debug_info.log Starting RPC idmapd: [ OK ] Mounting NFS filesystems: [ OK ] ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From rdreier at cisco.com Wed Apr 19 11:44:03 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Apr 2006 11:44:03 -0700 Subject: [openib-general] [GIT PULL] InfiniBand fixes for 2.6.17-rc2 Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This is mostly (by total lines of patch) cleanups for the ipath driver, with a few other fixes thrown in. The exact changes and patch are: Adrian Bunk: IB/mthca: make a function static Hal Rosenstock: IB/mad: Fix RMPP version check during agent registration Roland Dreier: IB/srp: Remove request from list when SCSI abort succeeds IB/ipath: Make more names static IB/ipath: Fix whitespace drivers/infiniband/core/mad.c | 5 - drivers/infiniband/hw/ipath/ipath_diag.c | 12 --- drivers/infiniband/hw/ipath/ipath_driver.c | 2 drivers/infiniband/hw/ipath/ipath_intr.c | 4 - drivers/infiniband/hw/ipath/ipath_kernel.h | 1 drivers/infiniband/hw/ipath/ipath_layer.c | 2 drivers/infiniband/hw/ipath/ipath_pe800.c | 10 +- drivers/infiniband/hw/ipath/ipath_qp.c | 124 ++++++++++++++-------------- drivers/infiniband/hw/ipath/ipath_ud.c | 4 - drivers/infiniband/hw/ipath/ipath_verbs.c | 122 ++++++++++++++-------------- drivers/infiniband/hw/ipath/ipath_verbs.h | 5 - drivers/infiniband/hw/mthca/mthca_mad.c | 2 drivers/infiniband/ulp/srp/ib_srp.c | 18 ++-- 13 files changed, 147 insertions(+), 164 deletions(-) diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 3a702da..469b692 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -228,10 +228,7 @@ struct ib_mad_agent *ib_register_mad_age goto error1; } /* Make sure class supplied is consistent with RMPP */ - if (ib_is_mad_class_rmpp(mad_reg_req->mgmt_class)) { - if (!rmpp_version) - goto error1; - } else { + if (!ib_is_mad_class_rmpp(mad_reg_req->mgmt_class)) { if (rmpp_version) goto error1; } diff --git a/drivers/infiniband/hw/ipath/ipath_diag.c b/drivers/infiniband/hw/ipath/ipath_diag.c index cd533cf..7d3fb69 100644 --- a/drivers/infiniband/hw/ipath/ipath_diag.c +++ b/drivers/infiniband/hw/ipath/ipath_diag.c @@ -365,15 +365,3 @@ static ssize_t ipath_diag_write(struct f bail: return ret; } - -void ipath_diag_bringup_link(struct ipath_devdata *dd) -{ - if (diag_set_link || (dd->ipath_flags & IPATH_LINKACTIVE)) - return; - - diag_set_link = 1; - ipath_cdbg(VERBOSE, "Trying to set to set link active for " - "diag pkt\n"); - ipath_layer_set_linkstate(dd, IPATH_IB_LINKARM); - ipath_layer_set_linkstate(dd, IPATH_IB_LINKACTIVE); -} diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index 58a94ef..e7617c3 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -1729,7 +1729,7 @@ void ipath_free_pddata(struct ipath_devd } } -int __init infinipath_init(void) +static int __init infinipath_init(void) { int ret; diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index 60f5f41..0bcb428 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -172,8 +172,8 @@ static void handle_e_ibstatuschanged(str "was %s\n", dd->ipath_unit, ib_linkstate(lstate), ib_linkstate((unsigned) - dd->ipath_lastibcstat - & IPATH_IBSTATE_MASK)); + dd->ipath_lastibcstat + & IPATH_IBSTATE_MASK)); } else { lstate = dd->ipath_lastibcstat & IPATH_IBSTATE_MASK; diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h index 159d0ae..0ce5f19 100644 --- a/drivers/infiniband/hw/ipath/ipath_kernel.h +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h @@ -528,7 +528,6 @@ extern spinlock_t ipath_devs_lock; extern struct ipath_devdata *ipath_lookup(int unit); extern u16 ipath_layer_rcv_opcode; -extern int ipath_verbs_registered; extern int __ipath_layer_intr(struct ipath_devdata *, u32); extern int ipath_layer_intr(struct ipath_devdata *, u32); extern int __ipath_layer_rcv(struct ipath_devdata *, void *, diff --git a/drivers/infiniband/hw/ipath/ipath_layer.c b/drivers/infiniband/hw/ipath/ipath_layer.c index 2cabf63..69ed110 100644 --- a/drivers/infiniband/hw/ipath/ipath_layer.c +++ b/drivers/infiniband/hw/ipath/ipath_layer.c @@ -52,7 +52,7 @@ static int (*layer_rcv)(void *, void *, static int (*layer_rcv_lid)(void *, void *); static int (*verbs_piobufavail)(void *); static void (*verbs_rcv)(void *, void *, void *, u32); -int ipath_verbs_registered; +static int ipath_verbs_registered; static void *(*layer_add_one)(int, struct ipath_devdata *); static void (*layer_remove_one)(void *); diff --git a/drivers/infiniband/hw/ipath/ipath_pe800.c b/drivers/infiniband/hw/ipath/ipath_pe800.c index e693a7a..e1dc4f7 100644 --- a/drivers/infiniband/hw/ipath/ipath_pe800.c +++ b/drivers/infiniband/hw/ipath/ipath_pe800.c @@ -305,8 +305,8 @@ static const struct ipath_cregs ipath_pe * we'll print them and continue. We reuse the same message buffer as * ipath_handle_errors() to avoid excessive stack usage. */ -void ipath_pe_handle_hwerrors(struct ipath_devdata *dd, char *msg, - size_t msgl) +static void ipath_pe_handle_hwerrors(struct ipath_devdata *dd, char *msg, + size_t msgl) { ipath_err_t hwerrs; u32 bits, ctrl; @@ -552,7 +552,7 @@ static int ipath_pe_boardname(struct ipa * freeze mode), and enable hardware errors as errors (along with * everything else) in errormask */ -void ipath_pe_init_hwerrors(struct ipath_devdata *dd) +static void ipath_pe_init_hwerrors(struct ipath_devdata *dd) { ipath_err_t val; u64 extsval; @@ -577,7 +577,7 @@ void ipath_pe_init_hwerrors(struct ipath * ipath_pe_bringup_serdes - bring up the serdes * @dd: the infinipath device */ -int ipath_pe_bringup_serdes(struct ipath_devdata *dd) +static int ipath_pe_bringup_serdes(struct ipath_devdata *dd) { u64 val, tmp, config1; int ret = 0, change = 0; @@ -694,7 +694,7 @@ int ipath_pe_bringup_serdes(struct ipath * @dd: the infinipath device * Called when driver is being unloaded */ -void ipath_pe_quiet_serdes(struct ipath_devdata *dd) +static void ipath_pe_quiet_serdes(struct ipath_devdata *dd) { u64 val = ipath_read_kreg64(dd, dd->ipath_kregs->kr_serdesconfig0); diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c index 6058d70..1889071 100644 --- a/drivers/infiniband/hw/ipath/ipath_qp.c +++ b/drivers/infiniband/hw/ipath/ipath_qp.c @@ -188,8 +188,8 @@ static void free_qpn(struct ipath_qp_tab * Allocate the next available QPN and put the QP into the hash table. * The hash table holds a reference to the QP. */ -int ipath_alloc_qpn(struct ipath_qp_table *qpt, struct ipath_qp *qp, - enum ib_qp_type type) +static int ipath_alloc_qpn(struct ipath_qp_table *qpt, struct ipath_qp *qp, + enum ib_qp_type type) { unsigned long flags; u32 qpn; @@ -232,7 +232,7 @@ bail: * Remove the QP from the table so it can't be found asynchronously by * the receive interrupt routine. */ -void ipath_free_qp(struct ipath_qp_table *qpt, struct ipath_qp *qp) +static void ipath_free_qp(struct ipath_qp_table *qpt, struct ipath_qp *qp) { struct ipath_qp *q, **qpp; unsigned long flags; @@ -358,6 +358,65 @@ static void ipath_reset_qp(struct ipath_ } /** + * ipath_error_qp - put a QP into an error state + * @qp: the QP to put into an error state + * + * Flushes both send and receive work queues. + * QP r_rq.lock and s_lock should be held. + */ + +static void ipath_error_qp(struct ipath_qp *qp) +{ + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + struct ib_wc wc; + + _VERBS_INFO("QP%d/%d in error state\n", + qp->ibqp.qp_num, qp->remote_qpn); + + spin_lock(&dev->pending_lock); + /* XXX What if its already removed by the timeout code? */ + if (qp->timerwait.next != LIST_POISON1) + list_del(&qp->timerwait); + if (qp->piowait.next != LIST_POISON1) + list_del(&qp->piowait); + spin_unlock(&dev->pending_lock); + + wc.status = IB_WC_WR_FLUSH_ERR; + wc.vendor_err = 0; + wc.byte_len = 0; + wc.imm_data = 0; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = 0; + wc.wc_flags = 0; + wc.pkey_index = 0; + wc.slid = 0; + wc.sl = 0; + wc.dlid_path_bits = 0; + wc.port_num = 0; + + while (qp->s_last != qp->s_head) { + struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last); + + wc.wr_id = wqe->wr.wr_id; + wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; + if (++qp->s_last >= qp->s_size) + qp->s_last = 0; + ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 1); + } + qp->s_cur = qp->s_tail = qp->s_head; + qp->s_hdrwords = 0; + qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; + + wc.opcode = IB_WC_RECV; + while (qp->r_rq.tail != qp->r_rq.head) { + wc.wr_id = get_rwqe_ptr(&qp->r_rq, qp->r_rq.tail)->wr_id; + if (++qp->r_rq.tail >= qp->r_rq.size) + qp->r_rq.tail = 0; + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); + } +} + +/** * ipath_modify_qp - modify the attributes of a queue pair * @ibqp: the queue pair who's attributes we're modifying * @attr: the new attributes @@ -821,65 +880,6 @@ void ipath_sqerror_qp(struct ipath_qp *q } /** - * ipath_error_qp - put a QP into an error state - * @qp: the QP to put into an error state - * - * Flushes both send and receive work queues. - * QP r_rq.lock and s_lock should be held. - */ - -void ipath_error_qp(struct ipath_qp *qp) -{ - struct ipath_ibdev *dev = to_idev(qp->ibqp.device); - struct ib_wc wc; - - _VERBS_INFO("QP%d/%d in error state\n", - qp->ibqp.qp_num, qp->remote_qpn); - - spin_lock(&dev->pending_lock); - /* XXX What if its already removed by the timeout code? */ - if (qp->timerwait.next != LIST_POISON1) - list_del(&qp->timerwait); - if (qp->piowait.next != LIST_POISON1) - list_del(&qp->piowait); - spin_unlock(&dev->pending_lock); - - wc.status = IB_WC_WR_FLUSH_ERR; - wc.vendor_err = 0; - wc.byte_len = 0; - wc.imm_data = 0; - wc.qp_num = qp->ibqp.qp_num; - wc.src_qp = 0; - wc.wc_flags = 0; - wc.pkey_index = 0; - wc.slid = 0; - wc.sl = 0; - wc.dlid_path_bits = 0; - wc.port_num = 0; - - while (qp->s_last != qp->s_head) { - struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last); - - wc.wr_id = wqe->wr.wr_id; - wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; - if (++qp->s_last >= qp->s_size) - qp->s_last = 0; - ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 1); - } - qp->s_cur = qp->s_tail = qp->s_head; - qp->s_hdrwords = 0; - qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; - - wc.opcode = IB_WC_RECV; - while (qp->r_rq.tail != qp->r_rq.head) { - wc.wr_id = get_rwqe_ptr(&qp->r_rq, qp->r_rq.tail)->wr_id; - if (++qp->r_rq.tail >= qp->r_rq.size) - qp->r_rq.tail = 0; - ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); - } -} - -/** * ipath_get_credit - flush the send work queue of a QP * @qp: the qp who's send work queue to flush * @aeth: the Acknowledge Extended Transport Header diff --git a/drivers/infiniband/hw/ipath/ipath_ud.c b/drivers/infiniband/hw/ipath/ipath_ud.c index 5ff3de6..01cfb30 100644 --- a/drivers/infiniband/hw/ipath/ipath_ud.c +++ b/drivers/infiniband/hw/ipath/ipath_ud.c @@ -46,8 +46,8 @@ * This is called from ipath_post_ud_send() to forward a WQE addressed * to the same HCA. */ -void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_sge_state *ss, - u32 length, struct ib_send_wr *wr, struct ib_wc *wc) +static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_sge_state *ss, + u32 length, struct ib_send_wr *wr, struct ib_wc *wc) { struct ipath_ibdev *dev = to_idev(sqp->ibqp.device); struct ipath_qp *qp; diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index 9f27fd3..8d2558a 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -41,7 +41,7 @@ /* Not static, because we don't want the compiler removing it */ const char ipath_verbs_version[] = "ipath_verbs " IPATH_IDSTR; -unsigned int ib_ipath_qp_table_size = 251; +static unsigned int ib_ipath_qp_table_size = 251; module_param_named(qp_table_size, ib_ipath_qp_table_size, uint, S_IRUGO); MODULE_PARM_DESC(qp_table_size, "QP table size"); @@ -87,7 +87,7 @@ const enum ib_wc_opcode ib_ipath_wc_opco /* * System image GUID. */ -__be64 sys_image_guid; +static __be64 sys_image_guid; /** * ipath_copy_sge - copy data to SGE memory @@ -1110,7 +1110,7 @@ static void ipath_unregister_ib_device(v ib_dealloc_device(ibdev); } -int __init ipath_verbs_init(void) +static int __init ipath_verbs_init(void) { return ipath_verbs_register(ipath_register_ib_device, ipath_unregister_ib_device, @@ -1118,33 +1118,33 @@ int __init ipath_verbs_init(void) ipath_ib_timer); } -void __exit ipath_verbs_cleanup(void) +static void __exit ipath_verbs_cleanup(void) { ipath_verbs_unregister(); } static ssize_t show_rev(struct class_device *cdev, char *buf) { - struct ipath_ibdev *dev = - container_of(cdev, struct ipath_ibdev, ibdev.class_dev); - int vendor, boardrev, majrev, minrev; - - ipath_layer_query_device(dev->dd, &vendor, &boardrev, - &majrev, &minrev); - return sprintf(buf, "%d.%d\n", majrev, minrev); + struct ipath_ibdev *dev = + container_of(cdev, struct ipath_ibdev, ibdev.class_dev); + int vendor, boardrev, majrev, minrev; + + ipath_layer_query_device(dev->dd, &vendor, &boardrev, + &majrev, &minrev); + return sprintf(buf, "%d.%d\n", majrev, minrev); } static ssize_t show_hca(struct class_device *cdev, char *buf) { - struct ipath_ibdev *dev = - container_of(cdev, struct ipath_ibdev, ibdev.class_dev); - int ret; - - ret = ipath_layer_get_boardname(dev->dd, buf, 128); - if (ret < 0) - goto bail; - strcat(buf, "\n"); - ret = strlen(buf); + struct ipath_ibdev *dev = + container_of(cdev, struct ipath_ibdev, ibdev.class_dev); + int ret; + + ret = ipath_layer_get_boardname(dev->dd, buf, 128); + if (ret < 0) + goto bail; + strcat(buf, "\n"); + ret = strlen(buf); bail: return ret; @@ -1152,40 +1152,40 @@ bail: static ssize_t show_stats(struct class_device *cdev, char *buf) { - struct ipath_ibdev *dev = - container_of(cdev, struct ipath_ibdev, ibdev.class_dev); - int i; - int len; - - len = sprintf(buf, - "RC resends %d\n" - "RC QACKs %d\n" - "RC ACKs %d\n" - "RC SEQ NAKs %d\n" - "RC RDMA seq %d\n" - "RC RNR NAKs %d\n" - "RC OTH NAKs %d\n" - "RC timeouts %d\n" - "RC RDMA dup %d\n" - "piobuf wait %d\n" - "no piobuf %d\n" - "PKT drops %d\n" - "WQE errs %d\n", - dev->n_rc_resends, dev->n_rc_qacks, dev->n_rc_acks, - dev->n_seq_naks, dev->n_rdma_seq, dev->n_rnr_naks, - dev->n_other_naks, dev->n_timeouts, - dev->n_rdma_dup_busy, dev->n_piowait, - dev->n_no_piobuf, dev->n_pkt_drops, dev->n_wqe_errs); - for (i = 0; i < ARRAY_SIZE(dev->opstats); i++) { + struct ipath_ibdev *dev = + container_of(cdev, struct ipath_ibdev, ibdev.class_dev); + int i; + int len; + + len = sprintf(buf, + "RC resends %d\n" + "RC QACKs %d\n" + "RC ACKs %d\n" + "RC SEQ NAKs %d\n" + "RC RDMA seq %d\n" + "RC RNR NAKs %d\n" + "RC OTH NAKs %d\n" + "RC timeouts %d\n" + "RC RDMA dup %d\n" + "piobuf wait %d\n" + "no piobuf %d\n" + "PKT drops %d\n" + "WQE errs %d\n", + dev->n_rc_resends, dev->n_rc_qacks, dev->n_rc_acks, + dev->n_seq_naks, dev->n_rdma_seq, dev->n_rnr_naks, + dev->n_other_naks, dev->n_timeouts, + dev->n_rdma_dup_busy, dev->n_piowait, + dev->n_no_piobuf, dev->n_pkt_drops, dev->n_wqe_errs); + for (i = 0; i < ARRAY_SIZE(dev->opstats); i++) { const struct ipath_opcode_stats *si = &dev->opstats[i]; - if (!si->n_packets && !si->n_bytes) - continue; - len += sprintf(buf + len, "%02x %llu/%llu\n", i, + if (!si->n_packets && !si->n_bytes) + continue; + len += sprintf(buf + len, "%02x %llu/%llu\n", i, (unsigned long long) si->n_packets, - (unsigned long long) si->n_bytes); - } - return len; + (unsigned long long) si->n_bytes); + } + return len; } static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); @@ -1194,25 +1194,25 @@ static CLASS_DEVICE_ATTR(board_id, S_IRU static CLASS_DEVICE_ATTR(stats, S_IRUGO, show_stats, NULL); static struct class_device_attribute *ipath_class_attributes[] = { - &class_device_attr_hw_rev, - &class_device_attr_hca_type, - &class_device_attr_board_id, - &class_device_attr_stats + &class_device_attr_hw_rev, + &class_device_attr_hca_type, + &class_device_attr_board_id, + &class_device_attr_stats }; static int ipath_verbs_register_sysfs(struct ib_device *dev) { - int i; + int i; int ret; - for (i = 0; i < ARRAY_SIZE(ipath_class_attributes); ++i) - if (class_device_create_file(&dev->class_dev, - ipath_class_attributes[i])) { - ret = 1; + for (i = 0; i < ARRAY_SIZE(ipath_class_attributes); ++i) + if (class_device_create_file(&dev->class_dev, + ipath_class_attributes[i])) { + ret = 1; goto bail; } - ret = 0; + ret = 0; bail: return ret; diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index b824632..fcafbc7 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -577,8 +577,6 @@ int ipath_init_qp_table(struct ipath_ibd void ipath_sqerror_qp(struct ipath_qp *qp, struct ib_wc *wc); -void ipath_error_qp(struct ipath_qp *qp); - void ipath_get_credit(struct ipath_qp *qp, u32 aeth); void ipath_do_rc_send(unsigned long data); @@ -607,9 +605,6 @@ void ipath_rc_rcv(struct ipath_ibdev *de void ipath_restart_rc(struct ipath_qp *qp, u32 psn, struct ib_wc *wc); -void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_sge_state *ss, - u32 length, struct ib_send_wr *wr, struct ib_wc *wc); - int ipath_post_ud_send(struct ipath_qp *qp, struct ib_send_wr *wr); void ipath_ud_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, diff --git a/drivers/infiniband/hw/mthca/mthca_mad.c b/drivers/infiniband/hw/mthca/mthca_mad.c index f235c7e..4730863 100644 --- a/drivers/infiniband/hw/mthca/mthca_mad.c +++ b/drivers/infiniband/hw/mthca/mthca_mad.c @@ -49,7 +49,7 @@ enum { MTHCA_VENDOR_CLASS2 = 0xa }; -int mthca_update_rate(struct mthca_dev *dev, u8 port_num) +static int mthca_update_rate(struct mthca_dev *dev, u8 port_num) { struct ib_port_attr *tprops = NULL; int ret; diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 5f2b3f6..5bb5574 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -617,6 +617,14 @@ static void srp_unmap_data(struct scsi_c scmnd->sc_data_direction); } +static void srp_remove_req(struct srp_target_port *target, struct srp_request *req, + int index) +{ + list_del(&req->list); + req->next = target->req_head; + target->req_head = index; +} + static void srp_process_rsp(struct srp_target_port *target, struct srp_rsp *rsp) { struct srp_request *req; @@ -664,9 +672,7 @@ static void srp_process_rsp(struct srp_t scmnd->host_scribble = (void *) -1L; scmnd->scsi_done(scmnd); - list_del(&req->list); - req->next = target->req_head; - target->req_head = rsp->tag & ~SRP_TAG_TSK_MGMT; + srp_remove_req(target, req, rsp->tag & ~SRP_TAG_TSK_MGMT); } else req->cmd_done = 1; } @@ -1188,12 +1194,10 @@ static int srp_send_tsk_mgmt(struct scsi spin_lock_irq(target->scsi_host->host_lock); if (req->cmd_done) { - list_del(&req->list); - req->next = target->req_head; - target->req_head = req_index; - + srp_remove_req(target, req, req_index); scmnd->scsi_done(scmnd); } else if (!req->tsk_status) { + srp_remove_req(target, req, req_index); scmnd->result = DID_ABORT << 16; ret = SUCCESS; } From bugzilla-daemon at openib.org Wed Apr 19 11:56:43 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Wed, 19 Apr 2006 11:56:43 -0700 (PDT) Subject: [openib-general] [Bug 42] OFED 1.0 rc3: infinibandeventfs warning on RHEL4 U2 Message-ID: <20060419185643.C2BFD22867D@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=42 ------- Additional Comments From sweitzen at cisco.com 2006-04-19 11:56 ------- Created an attachment (id=11) --> (http://openib.org/bugzilla/attachment.cgi?id=11&action=view) Debug log console asks for ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From vuhuong at mellanox.com Wed Apr 19 11:56:34 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Wed, 19 Apr 2006 11:56:34 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: References: <443C4934.7080400@mellanox.com> <443E9A88.7020302@mellanox.com> <443FC35F.6080301@mellanox.com> <44450F89.7020500@mellanox.com> <44451D32.1010106@mellanox.com> <444566E7.8070907@mellanox.com> Message-ID: <444687E2.8020103@mellanox.com> Roland Dreier wrote: > > > And what if you comment out the line > > > .eh_device_reset_handler = srp_reset_device, > > > does that fix it? > > > No > > Now I'm really confused. > Me too. > It seems we lose the connection to the target (BTW -- do you know why > the connection is getting killed)? I reported the error from my original email responding to your fmr patch. For ia64 system with pcix hca I got asyn event IB_EVENT_QP_ACCESS_ERR at the initiator (and I got cqe with IB_COMPLETION_STATUS_REMOTE_ACCESS_ERROR status at my target) I still have not had an IB analyzer trace (as you suggested) > > So the SCSI midlayer times out commands and tries to abort them. But > we have no connection so the abort fails. The SCSI command shouldn't > get freed now (at least if I'm understanding scsi_error.c correctly). > > Then we have no .eh_device_reset_handler so everything should fall > through to calling our .eh_host_reset_handler without freeing any SCSI > commands. And then we crash on a use-after-free of a SCSI command. > > So where is that command getting freed on us?? > The scsi command that is used by error handlers (.eh_abort_handler, .eh_host_reset_handler...) is not the same as use-after-free scsi command from req->scmnd There is some glitch that the scsi command from req->scmnd already freed by scsi midlayer; however, the request is still in our pending request queue Vu From worleys at gmail.com Wed Apr 19 12:01:27 2006 From: worleys at gmail.com (Chris Worley) Date: Wed, 19 Apr 2006 13:01:27 -0600 Subject: [openib-general] Are their any MVAPICH source or binary RPMs corresponding to the SuSE 10 x86_64 RPMs at red-bean? Message-ID: I'm using the RC2 binary RPMs for SuSE 10 from red-bean. I've tried the 4 MVAPICH source and binary RPMs from the OpenIB wiki, and the source RPM from the openib.org downloads; all have symbol conflicts with the verbs header files. Is there an MVAPICH RPM that matches the RC2 SuSE 10 RPMs? Thanks, Chris From sean.hefty at intel.com Wed Apr 19 12:05:50 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 19 Apr 2006 12:05:50 -0700 Subject: [openib-general] RFC userspace / MPI multicast support Message-ID: I'd like to get some feedback regarding the following approach to supporting multicast groups in userspace, and in particular for MPI. Based on side conversations, I need to know if this approach would meet the needs of MPI developers. To join / leave a multicast group, my proposal is to add the following APIs to the rdma_cm. (Note I haven't implemented this yet, so I'm just assuming that it's possible at this point.) /* Asynchronously join a multicast group. */ int rdma_set_option(struct rdma_cm_id *id, int level, int optname, void *optval, size_t optlen); /* Retrieve multicast group information - not usually called. */ int rdma_get_option(struct rdma_cm_id *id, int level, int optname, void *optval, size_t optlen); /* * Post a message on the QP associated with the cm_id for the * specified multicast address. */ int rdma_sendto(struct rdma_cm_id *id, struct ibv_send_wr *send_wr, struct sockaddr *to); --- As an example of how these APIs would be used: /* The cm_id provides event handling and context. */ rdma_create_id(&id, context); /* Bind to a local interface to attach to a local device. */ rdma_bind_addr(id, local_addr); /* Allocate a PD, CQs, etc. */ pd = ibv_alloc_pd(id->verbs); ... /* * Create a UD QP associated with the cm_id. * TBD: automatically transition the QP to RTS for UD QP types? */ rdma_create_qp(id, pd, init_attr); /* Bind to multicast group. */ mcast_ip = 224.0.0.74.71; /* some fine mcast addr */ ip_mreq.imr_multiaddr = mcast_ip.in_addr; rdma_set_option(id, RDMA_PROTO_IP, IP_ADD_MEMBERSHIP, &ip_mreq, sizeof(ip_mreq)); /* Wait for join to complete. */ rdma_get_cm_event(&event); if (event->event == RDMA_CM_EVENT_JOIN_COMPLETE) /* join worked - we could call rdma_get_option() here */ /* The rdma_cm attached the QP to the multicast group for us. */ ... rdma_ack_cm_event(event); /* * Format a send wr. The ah, remote_qpn, and remote_qkey are * filled out by the rdma_cm based on the provided destination * address. */ rdma_sendto(id, send_wr, &mcast_ip); --- The multicast group information is created / managed by the rdma_cm. The rdma_cm defines the mgid, q_key, p_key, sl, flowlabel, tclass, and joinstate. Except for mgid, these would most likely match the values used by the ipoib broadcast group. The mgid mapping would be similar to that used by ipoib. The actual MCMember record would be available to the user by calling rdma_get_option. I don't believe that there would be any restriction on the use of the QP that is attached to the multicast group, but it would take more work to support more than one multicast group per QP. The purpose of the rdma_sendto() routine is to map a given IP address to an allocated address handle and Qkey. At this point, rdma_sendto would only work for multicast addresses that have been joined by the user. If a user wanted more control over the multicast group, we could support a call such as: struct ib_mreq { struct ib_sa_mcmember_rec rec; ib_sa_comp_mask comp_mask; } rdma_set_option(id, RDMA_PROTO_IB, IB_ADD_MEMBERSHIP, &ib_mreq, sizeof(ib_mreq)); Thoughts? - Sean From wombat2 at us.ibm.com Wed Apr 19 12:10:29 2006 From: wombat2 at us.ibm.com (Bernard King-Smith) Date: Wed, 19 Apr 2006 15:10:29 -0400 Subject: [openib-general] Speeding up IPoIB. In-Reply-To: <20060419164226.GB6430@esmail.cup.hp.com> Message-ID: [sorry if this forum is the wrong place to take this up] Grant Grundler wrote : Grant> [ I've probably posted some of these results before...here's another Grant> take on this problem. ] Hopefully not rehashing too much old information. Grant> I'm expect splitting the RX/TX completions would achieve something Grant> similar since we are just "slicing" the same problem from a different Grant> angle. Apps typically do both RX and TX and will be running on one Grant> CPU. So on one path they will be missing cachelines. However, the event handler(s) handling the RX/TX completion are not guaranteed to run on the same CPU as the application unless you have the scheduler do some kind of affinity between the application and the event handler for the completion queue. In addition, if an application has multiple sockets then the event handlers are all of the place because each socket has its own completion queue. Does one event handler handle all completion queues? Grant> Anyway, my take is IPoIB perf isn't as critical as SDP and RDMA perf. Grant> If folks really care about perf, they have to migrate away from Grant> IPoIB to either SDP or directly use RDMA (uDAPL or something). Grant> Splitting RX/TX completions might help initial adoption, but Grant> aren't were the big wins in perf are. My take is, good enough is not good enough. If the cost to move from IP to SDP or RDMA is too great, then applications ( particularly in the commercial sector ) will not convert. Hence if IPoIB is too slow they will go Ethernet. Currently we only get 40% of the link bandwidth compared to 85% for 10 GigE. (Yes I know the cost differences which favor IB ). However, two things hurt user level protocols. First is scaling and memory requirements. Looking at parallel file systems on large clusters, SDP ended up consuming so much memory it couldn't be used. The N by N socket connections per node, using SDP the required buffer space and QP memory got out of control. There is something to be said for sharing buffer and QP space across lots of sockets. The other issue is flow control across hundreds of autonomous sockets. In TCP/IP, traffic can be managed so that there is some fairness (multiplexing, QoS etc.) across all active sockets. For user level protocols like SDP and uDAPL, you can't manage traffic across multiple autonomous user application connections because ther is no where to see all of them at teh same tiem for mangement. This can lead to overrunning adapters or timeouts to the applications. This tends to be a large system problem when you have lots of CPUs. SDP and uDAPL has some good ideas but have a way to go for anything except HPC and workloads that are not expected to scale to large configurations. For HPC you can use MPI for application message passing, but for the rest of the cluster traffic you need a good performing IP implementation for now. With time things can improve. There is also IPoIB-CM for much lower IPoIB overhead. Grant> Pinning netperf/netserver to a different CPU caused SDP perf Grant> to drop from 5.5 Gb/s to 5.4 Gb/s. Service Demand went from Grant> around 0.55 usec/KB to 0.56 usec/KB. ie a much smaller impact Grant> on cacheline misses. I agree cacheline misses are something that has to be watched carefully. for some platforms we need better binding or affinity tools in Linux to solve some of the current problems. This is a bigger long term issue. Grant> Keeping traffic local to the CPU that's taking the interrupt Grant> keeps the cachelines local. I don't want to discourage anyone Grant> from their pet projects. But the conclusion I drew from the Grant> above data is IPoIB is a good compatibility story but cacheline Grant> misses are going to make it hard to improve perf regardless Grant> of how we divide the workload. IPoIB + TCP/IP code path just has Grant> a big foot print. The footprint of IPoIB + TCP/IP is large as on any system, However, as you get to higher CPU counts, the issue becomes less of a problem since more unused CPU cycles are available. However, affinity ( CPU and Memory) and cacheline miss issues get greater. Bernie King-Smith IBM Corporation Server Group Cluster System Performance wombat2 at us.ibm.com (845)433-8483 Tie. 293-8483 or wombat2 on NOTES "We are not responsible for the world we are born into, only for the world we leave when we die. So we have to accept what has gone before us and work to change the only thing we can, -- The Future." William Shatner From ardavis at ichips.intel.com Wed Apr 19 12:17:15 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 19 Apr 2006 12:17:15 -0700 Subject: [openib-general] Re: [uDAPL] dat.conf generator In-Reply-To: References: <200604121122.48646.dotanb@mellanox.co.il> <200604161042.25045.dotanb@mellanox.co.il> <200604180912.47016.dotanb@mellanox.co.il> Message-ID: <44468CBB.7000501@ichips.intel.com> James Lentini wrote: >On Tue, 18 Apr 2006, Dotan Barak wrote: > > > >>On Monday 17 April 2006 23:46, James Lentini wrote: >> >> >>>On Sun, 16 Apr 2006, Dotan Barak wrote: >>> >>> >>> >>>>On Wednesday 12 April 2006 17:50, James Lentini wrote: >>>> >>>> >>>>>>OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 1" "" >>>>>>OpenIB-cma0-1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 1" "" >>>>>>OpenIB-cma0-2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 2" "" >>>>>>OpenIB-cma1-1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca1 1" "" >>>>>>OpenIB-cma1-2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca1 2" "" >>>>>> >>>>>> These entries are wrong. The cma versopm will only work with an ip address, network hostname, or netdev name. The port value is meaningless since the name gives you the device and port reference all in one. For cma the best flavor is netdev name as follow: OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" because it allows you to have identical dat.conf setups on across your cluster if you intend on using the first IB interface on each node. look at the dapl/doc/dat.conf for the correct examples. -arlin # # DAT 1.2 configuration file # # Each entry should have the following fields: # # \ # # # Example for openib_cma and openib_scm # # For cma version you specify as: # network address, network hostname, or netdev name and 0 for port # # For scm version you specify as actual device name and port # # Simple (OpenIB-cma) default with netdev name provided first on list # to enable use of same dat.conf version on all nodes # OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" OpenIB-cma-ip u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "192.168.0.22 0" "" OpenIB-cma-name u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "svr1-ib0 0" "" OpenIB-cma-netdev u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" OpenIB-scm1 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca0 1" "" OpenIB-scm2 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so mv_dapl.1.2 "mthca0 2" "" From mshefty at ichips.intel.com Wed Apr 19 12:17:34 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 19 Apr 2006 12:17:34 -0700 Subject: [openib-general] ib_send_cm_req failes with error -22 In-Reply-To: References: Message-ID: <44468CCE.6070202@ichips.intel.com> susan wrote: > the problem that i am running into is that the ib_send_cm_req > api fails with errorno 22. i'm using local id to make the > connection on port 1. ib_send_cm_req() api calls function > cm_init_av_by_path(), which calls ib_find_cached_gid(). > function ib_find_cached_gid() fails because it can't locate > cached gid in device's cache table. below is full control > flow from both primary & secondary node. You may want to look at using the rdma_cm interface in place of the lower-level ib_cm interface. > priamry node secondary node > -------------- ----------------- > ib_register_client() > using active port = 1 > ib_create_cm_id() > ib_cm_listen() > listening .... (waiting) > > ib_register_client() > using active port = 1 > ib_create_cm_id() > ib_alloc_pd() > ib_create_cq() > ib_req_notify_cq() > ib_get_dma_mr() > ib_create_qp() > ib_query_gid() > source.lid = 0x1 > dest.lid = 0x2 > ib_sa_path_rec_get() > sa_path_rec handler returned success > ib_send_cm_req() > ib_send_cm_req() failed with error -22 -22 (EINVAL) indicates that one of the parameters is invalid. The initial checks done by the ib_cm are: /* peer-to-peer not supported */ if (param->peer_to_peer) return -EINVAL; if (!param->primary_path) return -EINVAL; if (param->qp_type != IB_QPT_RC && param->qp_type != IB_QPT_UC) return -EINVAL; if (param->private_data && param->private_data_len > IB_CM_REQ_PRIVATE_DATA_SIZE) return -EINVAL; if (param->alternate_path && (param->alternate_path->pkey != param->primary_path->pkey || param->alternate_path->mtu != param->primary_path->mtu)) return -EINVAL; Can you verify that the input parameter would pass these tests? There are some more tests further down in the code that could also return this same error if these all pass. Posting the actual code that calls ib_send_cm_req() may also help debug the problem. - Sean From xma at us.ibm.com Wed Apr 19 12:40:01 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 19 Apr 2006 12:40:01 -0700 Subject: [openib-general] [PATCH] splitting IPoIB CQ In-Reply-To: Message-ID: Roland Dreier wrote on 04/19/2006 11:35:16 AM: > With the mthca driver, CQ is definitely called directly from the CQ > event interrupt. ehca may be different -- I need to look at the code. > > - R. Is that possible to move the CQ handler out of interrupt context in mthca? thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From xma at us.ibm.com Wed Apr 19 12:46:51 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 19 Apr 2006 12:46:51 -0700 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: Message-ID: OK. I am going to split the patch without splitting CQ first. WC handler is called in the interrupt context, it is a myth to have bidirectional performance improvement with splitting CQ. More investigation is needed. If WC handler can be moved from interrupt context, splitting CQ is still an approach. Having a seperate thread or NAPI support can be implemented later. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From xma at us.ibm.com Wed Apr 19 12:55:45 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 19 Apr 2006 13:55:45 -0600 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: Message-ID: Roland, Here is the update patch (cacheline aligned in smp for both TX and RX) for your experimental. Signed-off-by: Shirley Ma diff -urpN infiniband/ulp/ipoib/ipoib.h infiniband-cq/ulp/ipoib/ipoib.h --- infiniband/ulp/ipoib/ipoib.h 2006-04-05 17:43:18.000000000 -0700 +++ infiniband-cq/ulp/ipoib/ipoib.h 2006-04-19 13:45:15.534289600 -0700 @@ -71,7 +71,8 @@ enum { IPOIB_MAX_QUEUE_SIZE = 8192, IPOIB_MIN_QUEUE_SIZE = 2, - IPOIB_NUM_WC = 4, + IPOIB_NUM_SEND_WC = 32, + IPOIB_NUM_RECV_WC = 32, IPOIB_MAX_PATH_REC_QUEUE = 3, IPOIB_MAX_MCAST_QUEUE = 3, @@ -151,7 +152,8 @@ struct ipoib_dev_priv { u16 pkey; struct ib_pd *pd; struct ib_mr *mr; - struct ib_cq *cq; + struct ib_cq *send_cq; + struct ib_cq *recv_cq; struct ib_qp *qp; u32 qkey; @@ -162,16 +164,16 @@ struct ipoib_dev_priv { unsigned int admin_mtu; unsigned int mcast_mtu; - struct ipoib_rx_buf *rx_ring; + struct ipoib_rx_buf *rx_ring ____cacheline_aligned_in_smp; + struct ib_wc recv_ibwc[IPOIB_NUM_RECV_WC]; - spinlock_t tx_lock; - struct ipoib_tx_buf *tx_ring; + spinlock_t tx_lock ____cacheline_aligned_in_smp; + struct ipoib_tx_buf *tx_ring; unsigned tx_head; unsigned tx_tail; struct ib_sge tx_sge; struct ib_send_wr tx_wr; - - struct ib_wc ibwc[IPOIB_NUM_WC]; + struct ib_wc send_ibwc[IPOIB_NUM_SEND_WC]; struct list_head dead_ahs; @@ -243,9 +245,13 @@ void ipoib_neigh_free(struct ipoib_neigh extern struct workqueue_struct *ipoib_workqueue; +extern int ipoib_send_poll_interval; +extern int ipoib_recv_poll_interval; + /* functions */ -void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr); +void ipoib_ib_send_completion(struct ib_cq *cq, void *dev_ptr); +void ipoib_ib_recv_completion(struct ib_cq *cq, void *dev_ptr); struct ipoib_ah *ipoib_create_ah(struct net_device *dev, struct ib_pd *pd, struct ib_ah_attr *attr); diff -urpN infiniband/ulp/ipoib/ipoib_ib.c infiniband-cq/ulp/ipoib/ipoib_ib.c --- infiniband/ulp/ipoib/ipoib_ib.c 2006-04-05 17:43:18.000000000 -0700 +++ infiniband-cq/ulp/ipoib/ipoib_ib.c 2006-04-19 08:56:40.000000000 -0700 @@ -50,7 +50,6 @@ MODULE_PARM_DESC(data_debug_level, "Enable data path debug tracing if > 0"); #endif -#define IPOIB_OP_RECV (1ul << 31) static DEFINE_MUTEX(pkey_mutex); @@ -108,7 +107,7 @@ static int ipoib_ib_post_receive(struct list.lkey = priv->mr->lkey; param.next = NULL; - param.wr_id = id | IPOIB_OP_RECV; + param.wr_id = id; param.sg_list = &list; param.num_sge = 1; @@ -175,8 +174,8 @@ static int ipoib_ib_post_receives(struct return 0; } -static void ipoib_ib_handle_wc(struct net_device *dev, - struct ib_wc *wc) +static void ipoib_ib_handle_recv_wc(struct net_device *dev, + struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); unsigned int wr_id = wc->wr_id; @@ -184,121 +183,142 @@ static void ipoib_ib_handle_wc(struct ne ipoib_dbg_data(priv, "called: id %d, op %d, status: %d\n", wr_id, wc->opcode, wc->status); - if (wr_id & IPOIB_OP_RECV) { - wr_id &= ~IPOIB_OP_RECV; - - if (wr_id < ipoib_recvq_size) { - struct sk_buff *skb = priv->rx_ring[wr_id].skb; - dma_addr_t addr = priv->rx_ring[wr_id].mapping; - - if (unlikely(wc->status != IB_WC_SUCCESS)) { - if (wc->status != IB_WC_WR_FLUSH_ERR) - ipoib_warn(priv, "failed recv event " - "(status=%d, wrid=%d vend_err %x)\n", - wc->status, wr_id, wc->vendor_err); - dma_unmap_single(priv->ca->dma_device, addr, - IPOIB_BUF_SIZE, DMA_FROM_DEVICE); - dev_kfree_skb_any(skb); - priv->rx_ring[wr_id].skb = NULL; - return; - } - - /* - * If we can't allocate a new RX buffer, dump - * this packet and reuse the old buffer. - */ - if (unlikely(ipoib_alloc_rx_skb(dev, wr_id))) { - ++priv->stats.rx_dropped; - goto repost; - } - - ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", - wc->byte_len, wc->slid); + if (wr_id < ipoib_recvq_size) { + struct sk_buff *skb = priv->rx_ring[wr_id].skb; + dma_addr_t addr = priv->rx_ring[wr_id].mapping; + + if (unlikely(wc->status != IB_WC_SUCCESS)) { + if (wc->status != IB_WC_WR_FLUSH_ERR) + ipoib_warn(priv, "failed recv event " + "(status=%d, wrid=%d vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); dma_unmap_single(priv->ca->dma_device, addr, IPOIB_BUF_SIZE, DMA_FROM_DEVICE); + dev_kfree_skb_any(skb); + priv->rx_ring[wr_id].skb = NULL; + return; + } - skb_put(skb, wc->byte_len); - skb_pull(skb, IB_GRH_BYTES); + /* + * If we can't allocate a new RX buffer, dump + * this packet and reuse the old buffer. + */ + if (unlikely(ipoib_alloc_rx_skb(dev, wr_id))) { + ++priv->stats.rx_dropped; + goto repost; + } - if (wc->slid != priv->local_lid || - wc->src_qp != priv->qp->qp_num) { - skb->protocol = ((struct ipoib_header *) skb->data)->proto; - skb->mac.raw = skb->data; - skb_pull(skb, IPOIB_ENCAP_LEN); - - dev->last_rx = jiffies; - ++priv->stats.rx_packets; - priv->stats.rx_bytes += skb->len; - - skb->dev = dev; - /* XXX get correct PACKET_ type here */ - skb->pkt_type = PACKET_HOST; - netif_rx_ni(skb); - } else { - ipoib_dbg_data(priv, "dropping loopback packet\n"); - dev_kfree_skb_any(skb); - } + ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", + wc->byte_len, wc->slid); - repost: - if (unlikely(ipoib_ib_post_receive(dev, wr_id))) - ipoib_warn(priv, "ipoib_ib_post_receive failed " - "for buf %d\n", wr_id); - } else - ipoib_warn(priv, "completion event with wrid %d\n", - wr_id); + dma_unmap_single(priv->ca->dma_device, addr, + IPOIB_BUF_SIZE, DMA_FROM_DEVICE); - } else { - struct ipoib_tx_buf *tx_req; - unsigned long flags; + skb_put(skb, wc->byte_len); + skb_pull(skb, IB_GRH_BYTES); - if (wr_id >= ipoib_sendq_size) { - ipoib_warn(priv, "completion event with wrid %d (> %d)\n", - wr_id, ipoib_sendq_size); - return; + if (wc->slid != priv->local_lid || + wc->src_qp != priv->qp->qp_num) { + skb->protocol = ((struct ipoib_header *) skb->data)->proto; + skb->mac.raw = skb->data; + skb_pull(skb, IPOIB_ENCAP_LEN); + + dev->last_rx = jiffies; + ++priv->stats.rx_packets; + priv->stats.rx_bytes += skb->len; + + skb->dev = dev; + /* XXX get correct PACKET_ type here */ + skb->pkt_type = PACKET_HOST; + netif_rx_ni(skb); + } else { + ipoib_dbg_data(priv, "dropping loopback packet\n"); + dev_kfree_skb_any(skb); } - ipoib_dbg_data(priv, "send complete, wrid %d\n", wr_id); + repost: + if (unlikely(ipoib_ib_post_receive(dev, wr_id))) + ipoib_warn(priv, "ipoib_ib_post_receive failed " + "for buf %d\n", wr_id); + } else + ipoib_warn(priv, "completion event with wrid %d\n", + wr_id); +} + +static void ipoib_ib_handle_send_wc(struct net_device *dev, + struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned int wr_id = wc->wr_id; + struct ipoib_tx_buf *tx_req; + unsigned long flags; + + ipoib_dbg_data(priv, "called: id %d, op %d, status: %d\n", + wr_id, wc->opcode, wc->status); - tx_req = &priv->tx_ring[wr_id]; + if (wr_id >= ipoib_sendq_size) { + ipoib_warn(priv, "completion event with wrid %d (> %d)\n", + wr_id, ipoib_sendq_size); + return; + } - dma_unmap_single(priv->ca->dma_device, - pci_unmap_addr(tx_req, mapping), - tx_req->skb->len, - DMA_TO_DEVICE); + ipoib_dbg_data(priv, "send complete, wrid %d\n", wr_id); - ++priv->stats.tx_packets; - priv->stats.tx_bytes += tx_req->skb->len; + tx_req = &priv->tx_ring[wr_id]; - dev_kfree_skb_any(tx_req->skb); + dma_unmap_single(priv->ca->dma_device, + pci_unmap_addr(tx_req, mapping), + tx_req->skb->len, + DMA_TO_DEVICE); - spin_lock_irqsave(&priv->tx_lock, flags); - ++priv->tx_tail; - if (netif_queue_stopped(dev) && - priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) - netif_wake_queue(dev); - spin_unlock_irqrestore(&priv->tx_lock, flags); + ++priv->stats.tx_packets; + priv->stats.tx_bytes += tx_req->skb->len; - if (wc->status != IB_WC_SUCCESS && - wc->status != IB_WC_WR_FLUSH_ERR) - ipoib_warn(priv, "failed send event " - "(status=%d, wrid=%d vend_err %x)\n", - wc->status, wr_id, wc->vendor_err); - } + dev_kfree_skb_any(tx_req->skb); + + spin_lock_irqsave(&priv->tx_lock, flags); + ++priv->tx_tail; + if (netif_queue_stopped(dev) && + priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) + netif_wake_queue(dev); + spin_unlock_irqrestore(&priv->tx_lock, flags); + + if (wc->status != IB_WC_SUCCESS && + wc->status != IB_WC_WR_FLUSH_ERR) + ipoib_warn(priv, "failed send event " + "(status=%d, wrid=%d vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); +} + +void ipoib_ib_send_completion(struct ib_cq *cq, void *dev_ptr) +{ + struct net_device *dev = (struct net_device *) dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int n, i; + + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + udelay(ipoib_send_poll_interval); + do { + n = ib_poll_cq(cq, IPOIB_NUM_SEND_WC, priv->send_ibwc); + for (i = 0; i < n; ++i) + ipoib_ib_handle_send_wc(dev, priv->send_ibwc + i); + } while (n == IPOIB_NUM_SEND_WC); } -void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr) +void ipoib_ib_recv_completion(struct ib_cq *cq, void *dev_ptr) { struct net_device *dev = (struct net_device *) dev_ptr; struct ipoib_dev_priv *priv = netdev_priv(dev); int n, i; ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + udelay(ipoib_recv_poll_interval); do { - n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->ibwc); + n = ib_poll_cq(cq, IPOIB_NUM_RECV_WC, priv->recv_ibwc); for (i = 0; i < n; ++i) - ipoib_ib_handle_wc(dev, priv->ibwc + i); - } while (n == IPOIB_NUM_WC); + ipoib_ib_handle_recv_wc(dev, priv->recv_ibwc + i); + } while (n == IPOIB_NUM_RECV_WC); } static inline int post_send(struct ipoib_dev_priv *priv, diff -urpN infiniband/ulp/ipoib/ipoib_main.c infiniband-cq/ulp/ipoib/ipoib_main.c --- infiniband/ulp/ipoib/ipoib_main.c 2006-04-12 16:43:38.000000000 -0700 +++ infiniband-cq/ulp/ipoib/ipoib_main.c 2006-04-19 08:44:27.000000000 -0700 @@ -56,12 +56,17 @@ MODULE_LICENSE("Dual BSD/GPL"); int ipoib_sendq_size __read_mostly = IPOIB_TX_RING_SIZE; int ipoib_recvq_size __read_mostly = IPOIB_RX_RING_SIZE; +int ipoib_send_poll_interval __read_mostly = 0; +int ipoib_recv_poll_interval __read_mostly = 0; module_param_named(send_queue_size, ipoib_sendq_size, int, 0444); MODULE_PARM_DESC(send_queue_size, "Number of descriptors in send queue"); module_param_named(recv_queue_size, ipoib_recvq_size, int, 0444); MODULE_PARM_DESC(recv_queue_size, "Number of descriptors in receive queue"); +module_param_named(send_poll_interval, ipoib_send_poll_interval, int, 0444); +module_param_named(recv_poll_interval, ipoib_recv_poll_interval, int, 0444); + #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG int ipoib_debug_level; @@ -895,7 +900,7 @@ void ipoib_dev_cleanup(struct net_device kfree(priv->rx_ring); kfree(priv->tx_ring); - + priv->rx_ring = NULL; priv->tx_ring = NULL; } diff -urpN infiniband/ulp/ipoib/ipoib_verbs.c infiniband-cq/ulp/ipoib/ipoib_verbs.c --- infiniband/ulp/ipoib/ipoib_verbs.c 2006-04-05 17:43:18.000000000 -0700 +++ infiniband-cq/ulp/ipoib/ipoib_verbs.c 2006-04-12 19:14:41.000000000 -0700 @@ -174,24 +174,35 @@ int ipoib_transport_dev_init(struct net_ return -ENODEV; } - priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, - ipoib_sendq_size + ipoib_recvq_size + 1); - if (IS_ERR(priv->cq)) { - printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); + priv->send_cq = ib_create_cq(priv->ca, ipoib_ib_send_completion, NULL, dev, + ipoib_sendq_size + 1); + if (IS_ERR(priv->send_cq)) { + printk(KERN_WARNING "%s: failed to create send CQ\n", ca->name); goto out_free_pd; } - if (ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP)) - goto out_free_cq; + if (ib_req_notify_cq(priv->send_cq, IB_CQ_NEXT_COMP)) + goto out_free_send_cq; + + + priv->recv_cq = ib_create_cq(priv->ca, ipoib_ib_recv_completion, NULL, dev, + ipoib_recvq_size + 1); + if (IS_ERR(priv->recv_cq)) { + printk(KERN_WARNING "%s: failed to create recv CQ\n", ca->name); + goto out_free_send_cq; + } + + if (ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP)) + goto out_free_recv_cq; priv->mr = ib_get_dma_mr(priv->pd, IB_ACCESS_LOCAL_WRITE); if (IS_ERR(priv->mr)) { printk(KERN_WARNING "%s: ib_get_dma_mr failed\n", ca->name); - goto out_free_cq; + goto out_free_recv_cq; } - init_attr.send_cq = priv->cq; - init_attr.recv_cq = priv->cq, + init_attr.send_cq = priv->send_cq; + init_attr.recv_cq = priv->recv_cq, priv->qp = ib_create_qp(priv->pd, &init_attr); if (IS_ERR(priv->qp)) { @@ -215,8 +226,11 @@ int ipoib_transport_dev_init(struct net_ out_free_mr: ib_dereg_mr(priv->mr); -out_free_cq: - ib_destroy_cq(priv->cq); +out_free_recv_cq: + ib_destroy_cq(priv->recv_cq); + +out_free_send_cq: + ib_destroy_cq(priv->send_cq); out_free_pd: ib_dealloc_pd(priv->pd); @@ -238,7 +252,10 @@ void ipoib_transport_dev_cleanup(struct if (ib_dereg_mr(priv->mr)) ipoib_warn(priv, "ib_dereg_mr failed\n"); - if (ib_destroy_cq(priv->cq)) + if (ib_destroy_cq(priv->send_cq)) + ipoib_warn(priv, "ib_cq_destroy failed\n"); + + if (ib_destroy_cq(priv->recv_cq)) ipoib_warn(priv, "ib_cq_destroy failed\n"); if (ib_dealloc_pd(priv->pd)) Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: cq.tune.patch Type: application/octet-stream Size: 13454 bytes Desc: not available URL: From bugzilla-daemon at openib.org Wed Apr 19 14:35:52 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Wed, 19 Apr 2006 14:35:52 -0700 (PDT) Subject: [openib-general] [Bug 42] OFED 1.0 rc3: infinibandeventfs warning on RHEL4 U2 Message-ID: <20060419213552.B446C228681@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=42 sweitzen at cisco.com changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|bugzilla at openib.org |vlad at mellanox.co.il Priority|P3 |P2 ------- Additional Comments From sweitzen at cisco.com 2006-04-19 14:35 ------- This bug prevents uverbs from loading, so I'm raising priority from P3 to P2. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From rdreier at cisco.com Wed Apr 19 14:32:47 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Apr 2006 14:32:47 -0700 Subject: [openib-general] [PATCH] splitting IPoIB CQ In-Reply-To: (Shirley Ma's message of "Wed, 19 Apr 2006 12:40:01 -0700") References: Message-ID: Shirley> Is that possible to move the CQ handler out of interrupt Shirley> context in mthca? Yes, but that seems like the wrong thing to do. I think it would be better to let consumers that want the increased latency defer things. - R. From rdreier at cisco.com Wed Apr 19 14:34:24 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Apr 2006 14:34:24 -0700 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: (Shirley Ma's message of "Wed, 19 Apr 2006 12:46:51 -0700") References: Message-ID: Shirley> OK. I am going to split the patch without splitting CQ Shirley> first. WC handler is called in the interrupt context, it Shirley> is a myth to have bidirectional performance improvement Shirley> with splitting CQ. More investigation is needed. But you did see performance improvement on mthca, right? The reason for that is what I would like to understand. - R. From Diego at Mellanox.com Wed Apr 19 15:12:58 2006 From: Diego at Mellanox.com (Diego Crupnicoff) Date: Wed, 19 Apr 2006 15:12:58 -0700 Subject: [openib-general] How do we prevent starvation say between TCP overIPOIB / and SRP traffic ? Message-ID: > > To manage QoS the question is who knows about all the traffic traversing a > specific adapter. For most kernel traversing protocols (IP, iSER, iSCSI, > etc) you can sometimes do this in the device driver, where you can examine > the headers as a packet is expedited and manage it there. Unfortunately > you > are adding processing in the driver which can end up impacting bandwidth > on > high speed adapters. You also introduce additional overhead hence higher > CPU utilization. Right. You do not want to do this in SW. Most IB HCA adapters can do this for you at wire speed with absolutely no toll on host CPU utilization. From robert.j.woodruff at intel.com Wed Apr 19 15:41:49 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 19 Apr 2006 15:41:49 -0700 Subject: [openib-general] Are their any MVAPICH source or binary RPMscorresponding to the SuSE 10 x86_64 RPMs at red-bean? Message-ID: <1AC79F16F5C5284499BB9591B33D6F00077A9FD6@orsmsx408> Chris wrote, >Is there an MVAPICH RPM that matches the RC2 SuSE 10 RPMs? I think that the IBED rc3 release has RPMs for Mvapich and OpenMPI that you might try. https://openib.org/svn/gen2/branches/1.0/ibed/releases/ woody From ardavis at ichips.intel.com Wed Apr 19 15:48:14 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 19 Apr 2006 15:48:14 -0700 Subject: [openib-general] Re: [uDAPL] dtest server never ends when using the dapl provider "OpenIB-scm1" In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301EF04EC@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301EF04EC@mtlexch01.mtl.com> Message-ID: <4446BE2E.4080305@ichips.intel.com> Dotan Barak wrote: >> >>Can you attach to the server process with gdb and get me a >>back trace from each of the threads? >> >>What does driver IBED-1.0-rc3 consist of? >> >>Thanks, >> >>-arlin >> >> >> >> > >Here is a back trace of the hanged process: >(gdb) bt >#0 0x00002aaaab31c86a in pthread_cond_wait@@GLIBC_2.3.2 () from >/lib64/tls/libpthread.so.0 >#1 0x00002aaaab42ef5b in dapl_os_wait_object_wait (wait_obj=0x516650, >timeout_val=) at dapl_osd.c:276 >#2 0x00002aaaab42e9ab in dapl_evd_wait (evd_handle=0x516560, >time_out=4294967295, threshold=1, event=0x7fffffdd7bf0, >nmore=0x7fffffdd7c2c) > at dapl_evd_wait.c:233 >#3 0x00000000004021ab in disconnect_ep () at dtest.c:894 >#4 0x0000000000404cad in main (argc=4, argv=) at > > > Yes, looks like the disconnect event was dropped. Couple of questions: Does this only happen with the scm provider? Can you reproduce on the OpenIB trunk or 1.0 branch? Thanks, -arlin From xma at us.ibm.com Wed Apr 19 15:56:00 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 19 Apr 2006 15:56:00 -0700 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: Message-ID: Roland Dreier wrote on 04/19/2006 02:34:24 PM: > Shirley> OK. I am going to split the patch without splitting CQ > Shirley> first. WC handler is called in the interrupt context, it > Shirley> is a myth to have bidirectional performance improvement > Shirley> with splitting CQ. More investigation is needed. > > But you did see performance improvement on mthca, right? The reason > for that is what I would like to understand. > > - R. Yes. We did see the performance improvement on mthca. The better way to understand the performance improvement maybe profiling the kernel. Also I am working on removal tx_ring, which requires CQ to be splited to remove recv WC wiki flag IPOIB_OP_RECV. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From iod00d at hp.com Wed Apr 19 15:56:30 2006 From: iod00d at hp.com (Grant Grundler) Date: Wed, 19 Apr 2006 15:56:30 -0700 Subject: [openib-general] Re: Speeding up IPoIB. In-Reply-To: References: <20060419164226.GB6430@esmail.cup.hp.com> Message-ID: <20060419225630.GG6430@esmail.cup.hp.com> On Wed, Apr 19, 2006 at 03:10:29PM -0400, Bernard King-Smith wrote: > Grant> I'm expect splitting the RX/TX completions would achieve something > Grant> similar since we are just "slicing" the same problem from a > different > Grant> angle. Apps typically do both RX and TX and will be running on one > Grant> CPU. So on one path they will be missing cachelines. > > However, the event handler(s) handling the RX/TX completion are not > guaranteed to run on the same CPU as the application unless you have the > scheduler do some kind of affinity between the application and the event > handler for the completion queue. In addition, if an application has > multiple sockets then the event handlers are all of the place because each > socket has its own completion queue. Does one event handler handle all > completion queues? This depends on the HCA. mthca only uses one AFAIK. I believe Roland just confirmed that in a previous email to Shirley Ma. > Grant> Anyway, my take is IPoIB perf isn't as critical as SDP and RDMA > perf. > Grant> If folks really care about perf, they have to migrate away from > Grant> IPoIB to either SDP or directly use RDMA (uDAPL or something). > Grant> Splitting RX/TX completions might help initial adoption, but > Grant> aren't were the big wins in perf are. > > My take is, good enough is not good enough. If the cost to move from IP to > SDP or RDMA is too great, then applications ( particularly in the > commercial sector ) will not convert. Hence if IPoIB is too slow they will > go Ethernet. I agree with that assessment. I'm just pointing out that IPoIB has a major tuning problem with TCP/IP stack. If so, then this is a deficiency in mthca that newer hca's can address in their MSI/MSI-X support. > Currently we only get 40% of the link bandwidth compared to > 85% for 10 GigE. (Yes I know the cost differences which favor IB ). 10gige is getting 85% without TOE? Or are they distributing event handling across several CPUs? > However, two things hurt user level protocols. First is scaling and memory > requirements. Looking at parallel file systems on large clusters, SDP ended > up consuming so much memory it couldn't be used. The N by N socket > connections per node, using SDP the required buffer space and QP memory got > out of control. There is something to be said for sharing buffer and QP > space across lots of sockets. My guess is it's an easier problem to fix SDP than reducing TCP/IP cache/CPU foot print. I realize only a subset of apps can (or will try to) use SDP because of setup/config issues. I still believe SDP is useful to a majority of apps without having to recompile them. > The other issue is flow control across hundreds of autonomous sockets. In > TCP/IP, traffic can be managed so that there is some fairness > (multiplexing, QoS etc.) across all active sockets. For user level > protocols like SDP and uDAPL, you can't manage traffic across multiple > autonomous user application connections because ther is no where to see all > of them at teh same tiem for mangement. This can lead to overrunning > adapters or timeouts to the applications. This tends to be a large system > problem when you have lots of CPUs. I'm not competent to disagree in detail. Fabian Tillier and Caitlin Bestler can (and have) addressed this. > SDP and uDAPL has some good ideas but have a way to go for anything except > HPC and workloads that are not expected to scale to large configurations. > For HPC you can use MPI for application message passing, but for the rest > of the cluster traffic you need a good performing IP implementation for > now. With time things can improve. There is also IPoIB-CM for much lower > IPoIB overhead. I had the impression that IB provided the Reliable Datagram semantics which are equivalent to what TCP provides. I'm sure it's not exactly the same but in general, that disagrees with your assertion above. > > Grant> Pinning netperf/netserver to a different CPU caused SDP perf > Grant> to drop from 5.5 Gb/s to 5.4 Gb/s. Service Demand went from > Grant> around 0.55 usec/KB to 0.56 usec/KB. ie a much smaller impact > Grant> on cacheline misses. > > I agree cacheline misses are something that has to be watched carefully. > for some platforms we need better binding or affinity tools in Linux to > solve some of the current problems. This is a bigger long term issue. taskset works fine. A GUI to "visualize" the application to IO path would be helpful when doing runtime tuning of a given workload. > The footprint of IPoIB + TCP/IP is large as on any system, However, as you > get to higher CPU counts, the issue becomes less of a problem since more > unused CPU cycles are available. However, affinity ( CPU and Memory) and > cacheline miss issues get greater. Hrm...the concept of "unused CPU cycles" is bugging me as someone who occasionally gets to run benchmarks. If a system today has unused CPU cycles, then will adding a faster link change the CPU load if the application doesn't change? Anyway, I don't find this a good justification for using TCP if TCP can be avoided. thanks, grant From rdreier at cisco.com Wed Apr 19 15:57:52 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Apr 2006 15:57:52 -0700 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: (Shirley Ma's message of "Wed, 19 Apr 2006 15:56:00 -0700") References: Message-ID: Shirley> Also I am working on removal tx_ring, which requires CQ Shirley> to be splited to remove recv WC wiki flag IPOIB_OP_RECV. How are you removing the TX ring? Where do you store the skbs and DMA mappings to be freed when a send completes? - R. From xma at us.ibm.com Wed Apr 19 16:05:32 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 19 Apr 2006 16:05:32 -0700 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: Message-ID: Roland Dreier wrote on 04/19/2006 03:57:52 PM: > Shirley> Also I am working on removal tx_ring, which requires CQ > Shirley> to be splited to remove recv WC wiki flag IPOIB_OP_RECV. > > How are you removing the TX ring? Where do you store the skbs and DMA > mappings to be freed when a send completes? > > - R. Since I haven't found any kernel use 128 bit address, I use wr_id to save skb address, DMA mapping and other stuffs are saved in skb->cb, which is the private data for each protocal layer in skb. Same for rx_ring, so rx_buff and tx_buff is not necessary to be used. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From xma at us.ibm.com Wed Apr 19 16:07:55 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 19 Apr 2006 16:07:55 -0700 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: Message-ID: Roland, I can send half of the patch for pre-review which I have removed the tx_ring. But I was little bit confused about ah->last_send. Why not use reference count instead? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Wed Apr 19 16:18:47 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 19 Apr 2006 16:18:47 -0700 Subject: [openib-general] [PATCH] local SA: convert mutex to RW lock Message-ID: Convert from using a single mutex to a reader-writer lock to permit parallel processing of queries. Signed-off-by: Sean Hefty --- This will become more important when multiple path records are stored for a given destination. Index: local_sa.c =================================================================== --- local_sa.c (revision 6418) +++ local_sa.c (working copy) @@ -34,7 +34,7 @@ #include #include #include -#include +#include #include #include @@ -75,7 +75,7 @@ static struct ib_client sa_db_client = { }; static LIST_HEAD(dev_list); -static DEFINE_MUTEX(lock); +static DECLARE_RWSEM(lock); static unsigned long hold_time, update_delay; static struct workqueue_struct *sa_wq; @@ -163,18 +163,18 @@ static void update_path_rec(struct sa_db ib_sa_unpack_attr(sa_path, path, IB_SA_ATTR_PATH_REC); - mutex_lock(&lock); + down_write(&lock); old_path = index_find_replace(&port->index, sa_path, sa_path->dgid.raw); if (old_path) kfree(old_path); else if (index_insert(&port->index, sa_path, sa_path->dgid.raw)) { - mutex_unlock(&lock); + up_write(&lock); kfree(sa_path); return; } - mutex_unlock(&lock); + up_write(&lock); } } } @@ -325,7 +325,7 @@ int ib_get_path_rec(struct ib_device *de struct ib_sa_path_rec *path_rec; int ret = 0; - mutex_lock(&lock); + down_read(&lock); dev = ib_get_client_data(device, &sa_db_client); if (!dev) { ret = -ENODEV; @@ -346,7 +346,7 @@ int ib_get_path_rec(struct ib_device *de memcpy(rec, path_rec, sizeof *path_rec); unlock: - mutex_unlock(&lock); + up_read(&lock); return ret; } EXPORT_SYMBOL(ib_get_path_rec); @@ -393,9 +393,9 @@ static void sa_db_add_one(struct ib_devi dev->device = device; ib_set_client_data(device, &sa_db_client, dev); - mutex_lock(&lock); + down_write(&lock); list_add_tail(&dev->list, &dev_list); - mutex_unlock(&lock); + up_write(&lock); /* Initialization must be complete before cache updates can occur. */ INIT_IB_EVENT_HANDLER(&dev->event_handler, device, handle_event); @@ -433,9 +433,9 @@ static void sa_db_remove_one(struct ib_d index_destroy(&dev->port[i].index); } - mutex_lock(&lock); + down_write(&lock); list_del(&dev->list); - mutex_unlock(&lock); + up_write(&lock); kfree(dev); } From rdreier at cisco.com Wed Apr 19 16:32:49 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Apr 2006 16:32:49 -0700 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: (Shirley Ma's message of "Wed, 19 Apr 2006 16:07:55 -0700") References: Message-ID: Shirley> But I was little bit confused about ah->last_send. Why Shirley> not use reference count instead? The reasons may be lost in the mists of time, but I think using last_send saves us from having to decrement a reference count when sends complete. Since last_send is only set in the serialized send path, we can do a plain old non-atomic integer assignment. - R. From rdreier at cisco.com Wed Apr 19 16:36:50 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Apr 2006 16:36:50 -0700 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: (Shirley Ma's message of "Wed, 19 Apr 2006 16:05:32 -0700") References: Message-ID: Shirley> Since I haven't found any kernel use 128 bit address, I Shirley> use wr_id to save skb address, DMA mapping and other Shirley> stuffs are saved in skb->cb, which is the private data Shirley> for each protocal layer in skb. Same for rx_ring, so Shirley> rx_buff and tx_buff is not necessary to be used. That sounds like a good plan. Although I guess we want to keep around some sort of rx ring if only to free receive buffers when bringing an interface down. But that could become a linked list with the list_head in skb->cb I guess. Does all this matter for performance? - R. From mamidala at cse.ohio-state.edu Wed Apr 19 16:39:27 2006 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Wed, 19 Apr 2006 19:39:27 -0400 (EDT) Subject: [openib-general] Re: RFC userspace / MPI multicast support In-Reply-To: Message-ID: Hi Sean, I have a few basic questions: 1. Does the API which waits for join to complete ensure that the multicast forwarding tables in the switches have been updated. This is one of the main problems that we had studied: (Please refer to the following EURO PVM/MPI paper for details) http://www.cse.ohio-state.edu/~mamidala/europvm.pdf > /* Wait for join to complete. */ > rdma_get_cm_event(&event); > if (event->event == RDMA_CM_EVENT_JOIN_COMPLETE) > /* join worked - we could call rdma_get_option() here */ > /* The rdma_cm attached the QP to the multicast group for us. */ > > rdma_ack_cm_event(event); 2. I am not clear on how to access the QP associated with the cm_id for multicast. This includes posting the receive descriptors etc. 3. If an multicast address is already used by an application running on the cluster and if another request is made by a different application with the same multicast address, does this generate an error? From the API, it looks like the application has to manage this aspect, Thanks, Amith On Wed, 19 Apr 2006, Sean Hefty wrote: > I'd like to get some feedback regarding the following approach to supporting > multicast groups in userspace, and in particular for MPI. Based on side > conversations, I need to know if this approach would meet the needs of MPI > developers. > > To join / leave a multicast group, my proposal is to add the following APIs to > the rdma_cm. (Note I haven't implemented this yet, so I'm just assuming that > it's possible at this point.) > > /* Asynchronously join a multicast group. */ > int rdma_set_option(struct rdma_cm_id *id, int level, int optname, > void *optval, size_t optlen); > > /* Retrieve multicast group information - not usually called. */ > int rdma_get_option(struct rdma_cm_id *id, int level, int optname, > void *optval, size_t optlen); > > /* > * Post a message on the QP associated with the cm_id for the > * specified multicast address. > */ > int rdma_sendto(struct rdma_cm_id *id, struct ibv_send_wr *send_wr, > struct sockaddr *to); > > --- > > As an example of how these APIs would be used: > > /* The cm_id provides event handling and context. */ > rdma_create_id(&id, context); > > /* Bind to a local interface to attach to a local device. */ > rdma_bind_addr(id, local_addr); > > /* Allocate a PD, CQs, etc. */ > pd = ibv_alloc_pd(id->verbs); > .. > > /* > * Create a UD QP associated with the cm_id. > * TBD: automatically transition the QP to RTS for UD QP types? > */ > rdma_create_qp(id, pd, init_attr); > > /* Bind to multicast group. */ > mcast_ip = 224.0.0.74.71; /* some fine mcast addr */ > ip_mreq.imr_multiaddr = mcast_ip.in_addr; > rdma_set_option(id, RDMA_PROTO_IP, IP_ADD_MEMBERSHIP, &ip_mreq, > sizeof(ip_mreq)); > > /* > * Format a send wr. The ah, remote_qpn, and remote_qkey are > * filled out by the rdma_cm based on the provided destination > * address. > */ > rdma_sendto(id, send_wr, &mcast_ip); > > --- > > The multicast group information is created / managed by the rdma_cm. The > rdma_cm defines the mgid, q_key, p_key, sl, flowlabel, tclass, and joinstate. > Except for mgid, these would most likely match the values used by the ipoib > broadcast group. The mgid mapping would be similar to that used by ipoib. The > actual MCMember record would be available to the user by calling > rdma_get_option. > > I don't believe that there would be any restriction on the use of the QP that is > attached to the multicast group, but it would take more work to support more > than one multicast group per QP. The purpose of the rdma_sendto() routine is to > map a given IP address to an allocated address handle and Qkey. At this point, > rdma_sendto would only work for multicast addresses that have been joined by the > user. > > If a user wanted more control over the multicast group, we could support a call > such as: > > struct ib_mreq { > struct ib_sa_mcmember_rec rec; > ib_sa_comp_mask comp_mask; > } > > rdma_set_option(id, RDMA_PROTO_IB, IB_ADD_MEMBERSHIP, &ib_mreq, > sizeof(ib_mreq)); > > Thoughts? > > - Sean > From xma at us.ibm.com Wed Apr 19 16:56:12 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 19 Apr 2006 16:56:12 -0700 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: Message-ID: Roland Dreier wrote on 04/19/2006 04:36:50 PM: > Shirley> Since I haven't found any kernel use 128 bit address, I > Shirley> use wr_id to save skb address, DMA mapping and other > Shirley> stuffs are saved in skb->cb, which is the private data > Shirley> for each protocal layer in skb. Same for rx_ring, so > Shirley> rx_buff and tx_buff is not necessary to be used. > > That sounds like a good plan. Although I guess we want to keep around > some sort of rx ring if only to free receive buffers when bringing an > interface down. But that could become a linked list with the > list_head in skb->cb I guess. Yes, a list is needed for gracefully shutdown. > Does all this matter for performance? > > - R. It helps the performance about 10% for the touch netperf/netserver test, then hit driver errors. I notice that the send is faster than before. Let me send you the patch tomorrow, maybe you have the hints to identify the problem. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Wed Apr 19 16:57:06 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 19 Apr 2006 16:57:06 -0700 Subject: [openib-general] Re: RFC userspace / MPI multicast support In-Reply-To: References: Message-ID: <4446CE52.3010102@ichips.intel.com> amith rajith mamidala wrote: > 1. Does the API which waits for join to complete > ensure that the multicast forwarding tables in the switches have been > updated. This is one of the main problems that we had studied: The join is asynchronous. Completion of the join would not be reported until the switch tables had been updated. (Note that this is really an SA issue outside the scope of the local code). > 2. I am not clear on how to access the QP associated with the cm_id for > multicast. This includes posting the receive descriptors etc. rdma_create_qp() returns a struct ibv_qp *. You would access the QP just as you normally would for posting receives or sends to non-multicast endpoints. > 3. If an multicast address is already used by an application running on > the cluster and if another request is made by a different application with > the same multicast address, does this generate an error? From the API, it > looks like the application has to manage this aspect, If two applications make a request to join the same group, they simply both join the same group. One of the key points to the proposal is that a multicast group is identified by an IP address from a user's perspective. MCMemberRecords are abstracted from the user. - Sean From seablu at ubrew.net Wed Apr 19 20:29:26 2006 From: seablu at ubrew.net (Domingo Holman) Date: Thu, 20 Apr 2006 03:29:26 -0000 Subject: [openib-general] Bad Credit? Our Lenders Want To Hear From You Message-ID: <.$$..Etrack@hotmail.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: polyphony..gif Type: image/gif Size: 8467 bytes Desc: not available URL: From sasha at 4dvision.net Wed Apr 19 22:13:49 2006 From: sasha at 4dvision.net (Sally Thornton) Date: Wed, 19 Apr 2006 21:13:49 -0800 Subject: [openib-general] Excellent mortagee ratees Message-ID: <159241201922257.3046086@msn.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: creamery.3.gif Type: image/gif Size: 7610 bytes Desc: not available URL: From ejegulakwqhdw at monphoto.com Thu Apr 20 15:39:19 2006 From: ejegulakwqhdw at monphoto.com (Karin Carrier) Date: Thu, 20 Apr 2006 18:39:19 -0400 Subject: [openib-general] Re: hi Message-ID: <23441130090241.CFB6326DE@floury.starnetusa.net> "Cia-lis Sof`tabs" is better than Pfizer V`ia`g`ra and normal Ci-ialis because: - Guarantes 40 hours lasting - Safe to take, no side efects at all - Boost and increase se-xual performance - Harder e`rectiiions and quick recharge - Proven and certified by experts and doctors - only $1.56 per tabs - Special offeer! These prices - are valid until 30th of April! Clisk here: http://cica-kosova.info calorimeter petrology laotian enumerable aegis augmentation solvent agricultural boatyard state diathermy committed rapid reprisal bengali counterexample cadenza sidecar tour sequent gloucester hate cervantes lady macro transmit broomcorn bull germicidal ellis copy numerische mask befog laughter stencil detain inhabit respecter concurring covalent rhesus dispersal blaine bluebush bear quarterback end competitive code telescopic plenum commensurate labradorite From necojp at citiz.net Wed Apr 19 22:34:27 2006 From: necojp at citiz.net (=?gb2312?B?aW5mb3JtYXRpb24=?=) Date: Wed, 19 Apr 2006 22:34:27 -0700 (PDT) Subject: [openib-general] =?iso-2022-jp?b?GyRCIXshe0t8MV8wSj5lJE4bKEI=?= =?iso-2022-jp?b?GyRCOWIzWzVVMWc9dRsoQiEh?= Message-ID: <20060420053427.A9F5F22834D@openib.ca.sandia.gov> ◆◆◆◆◆必見!!最新情報◆◆◆◆◆ 逆援助・割り切り交際希望の女性会員様からのパートナー募集 〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜 会員番号:188137 ◇プロフィール◇ ▼名前 智子 現金50万渡します ▼年齢 40代前半  メール内容 ▼初めまして。 ・車での送迎可(いつでも呼び出しOK) ・デート費用は、私が全額負担 Hありの場合はお礼として謝礼金をお渡しします 絶対秘密厳守で宜しくお願いいたします。 この条件で良ければ、私と割り切った交際してもらえますか? 宜しくお願いします。 ◆もっと詳しいプロフィールはこちら⇒http://www.deai-allfree.net/?bid10   ☆24時間管理体制のため、何時でもメール交換が可能☆    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~             ※ 【お知らせ】※ ※随時、新規男性会員様★大募集中!無料お試しOK※ ◆◆女性会員に直接メールをするなら⇒http://www.deai-allfree.net/?◆◆ 〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜 配信不要はこちらへ→ deli_won_kaijyo at yahoo.co.jp Uselessness to serve→ deli_won_kaijyo at yahoo.co.jp -------------- next part -------------- An HTML attachment was scrubbed... URL: From kiyomi777 at kobej.zzn.com Wed Apr 19 23:51:19 2006 From: kiyomi777 at kobej.zzn.com (kiyomi777 at kobej.zzn.com) Date: Wed, 19 Apr 2006 23:51:19 -0700 (PDT) Subject: [openib-general] =?iso-2022-jp?b?GyRCQig/P0AtTV81YUlUS34bKEI=?= =?iso-2022-jp?b?GyRCPU89dyRHJWolQyVBJEsbKEIgICAgICAgICAg?= Message-ID: 20060420143913.61277mail@mail.smsmsmsm55114_cyoukyou88915_lovemake-server07_system22_lovelovemake.tv   ┏━━             ━━┓   ◆■ SEX=現金getの保証制を確立 ■◆   ┗━━             ━━┛  http://venusnetwork.cx/h/   ────────────────────    当サークルの会員女性様はこんな方々!   ──────────────────── ∇. 男性の体を現金交換購入する事を当たり前と考える、   男遊び経験の豊富な淫乱熟女・奥様! ∇. 日々のストレスを極太バイブとピンクローター等の   淫乱自慰遊具を使ったオナニーが日課の   真性欲求不満熟女・奥様! ∇. 即シャブリ即インサートは当たり前、中出し希望・口内発射   希望・顔面発射希望・公開オナニー希望等、真性淫乱熟女・   奥様! ※. 事実真性的淫乱な女性様のみのご登録現状となっております。   当サークルは<入会金・登録料等無料>と   なっております!   つまり、ご紹介をさせて頂きます女性様には、   <無料で逢えて現金を貰う事が可能>   確実性・安全性をお約束させて頂きます!  現在貴方様の地域より最も近い場所にてお待ち合わせ可能な女性様  はコチラにて即日SEX出来ます! http://venusnetwork.cx/h/  ∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞∞ From mst at mellanox.co.il Thu Apr 20 02:20:16 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 20 Apr 2006 12:20:16 +0300 Subject: [openib-general] Re: RFC userspace / MPI multicast support In-Reply-To: References: Message-ID: <20060420092016.GB1792@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RFC userspace / MPI multicast support > > I'd like to get some feedback regarding the following approach to supporting > multicast groups in userspace, and in particular for MPI. Based on side > conversations, I need to know if this approach would meet the needs of MPI > developers. > > To join / leave a multicast group, my proposal is to add the following APIs to > the rdma_cm. (Note I haven't implemented this yet, so I'm just assuming that > it's possible at this point.) Since there's no CM involved in creating UD QPs, and since there's no multicast rdma, rdma_cm seems like a weird place for multicast. Would it make more sense to create a separate module for multicast stuff? -- MST From halr at voltaire.com Thu Apr 20 04:55:12 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Apr 2006 07:55:12 -0400 Subject: [openib-general] RFC userspace / MPI multicast support In-Reply-To: References: Message-ID: <1145534109.4539.115149.camel@hal.voltaire.com> Hi Sean, On Wed, 2006-04-19 at 15:05, Sean Hefty wrote: > I'd like to get some feedback regarding the following approach to supporting > multicast groups in userspace, and in particular for MPI. Based on side > conversations, I need to know if this approach would meet the needs of MPI > developers. > > To join / leave a multicast group, MC groups also need to be created and deleted as well. Creating and deleting the group are assumed under the covers (first joiner, last leaver) so the additional MC parameters for creation need to be available on all adds. > my proposal is to add the following APIs to > the rdma_cm. (Note I haven't implemented this yet, so I'm just assuming that > it's possible at this point.) > > /* Asynchronously join a multicast group. */ > int rdma_set_option(struct rdma_cm_id *id, int level, int optname, > void *optval, size_t optlen); > > /* Retrieve multicast group information - not usually called. */ > int rdma_get_option(struct rdma_cm_id *id, int level, int optname, > void *optval, size_t optlen); > > /* > * Post a message on the QP associated with the cm_id for the > * specified multicast address. > */ > int rdma_sendto(struct rdma_cm_id *id, struct ibv_send_wr *send_wr, > struct sockaddr *to); > > --- > > As an example of how these APIs would be used: > > /* The cm_id provides event handling and context. */ > rdma_create_id(&id, context); > > /* Bind to a local interface to attach to a local device. */ > rdma_bind_addr(id, local_addr); > > /* Allocate a PD, CQs, etc. */ > pd = ibv_alloc_pd(id->verbs); > ... > > /* > * Create a UD QP associated with the cm_id. > * TBD: automatically transition the QP to RTS for UD QP types? > */ > rdma_create_qp(id, pd, init_attr); > > /* Bind to multicast group. */ > mcast_ip = 224.0.0.74.71; /* some fine mcast addr */ How are the MGIDs formed from this IP address ? Is the same algorithm as IPoIB used ? Are the MGIDs constrained to use 0x401B in the signature part (and 0x601B if this is extended to IPv6) ? BTW, this example has too many bytes... > ip_mreq.imr_multiaddr = mcast_ip.in_addr; > rdma_set_option(id, RDMA_PROTO_IP, IP_ADD_MEMBERSHIP, &ip_mreq, > sizeof(ip_mreq)); The API only supports ADD/DROP. It lacks support for JoinStates. (I don't think the IP semantics are rich enough for IB; this was previously pointed out in the context of IP routers quite a while ago). > /* Wait for join to complete. */ > rdma_get_cm_event(&event); > if (event->event == RDMA_CM_EVENT_JOIN_COMPLETE) > /* join worked - we could call rdma_get_option() here */ > /* The rdma_cm attached the QP to the multicast group for us. */ > ... > rdma_ack_cm_event(event); > > /* > * Format a send wr. The ah, remote_qpn, and remote_qkey are > * filled out by the rdma_cm based on the provided destination > * address. > */ > rdma_sendto(id, send_wr, &mcast_ip); > > --- > > The multicast group information is created / managed by the rdma_cm. The > rdma_cm defines the mgid, q_key, p_key, sl, flowlabel, tclass, and joinstate. > Except for mgid, these would most likely match the values used by the ipoib > broadcast group. The mgid mapping would be similar to that used by ipoib. Does that limit the MGIDs to use IP signatures ? -- Hal > The > actual MCMember record would be available to the user by calling > rdma_get_option. > I don't believe that there would be any restriction on the use of the QP that is > attached to the multicast group, but it would take more work to support more > than one multicast group per QP. The purpose of the rdma_sendto() routine is to > map a given IP address to an allocated address handle and Qkey. At this point, > rdma_sendto would only work for multicast addresses that have been joined by the > user. > > If a user wanted more control over the multicast group, we could support a call > such as: > > struct ib_mreq { > struct ib_sa_mcmember_rec rec; > ib_sa_comp_mask comp_mask; > } > > rdma_set_option(id, RDMA_PROTO_IB, IB_ADD_MEMBERSHIP, &ib_mreq, > sizeof(ib_mreq)); > > Thoughts? > > - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Thu Apr 20 05:30:25 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 20 Apr 2006 15:30:25 +0300 Subject: [openib-general] Fwd: a bug Message-ID: <20060420123025.GK1792@mellanox.co.il> Roland, the following makes sense, does it not? The spec says: 14.2.5.5 GUIDInfo The AttributeModifier is a pointer to a block of 8 GUIDs to which this attribute applies. and each guid is 64 bit. ----- Forwarded message from Leonid Keller ----- Date: Tue, 18 Apr 2006 10:49:06 +0300 From: "Leonid Keller" look at mthca_query_gid() at the end. The statement memcpy(gid->raw + 8, out_mad->data + (index % 8) * 16, 8); is buggy, because the mad contains the block of GUIDs, not GIDs. Must be memcpy(gid->raw + 8, out_mad->data + (index % 8) * 8, 8); ----- End forwarded message ----- -- MST From tziporet at mellanox.co.il Thu Apr 20 06:46:58 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 20 Apr 2006 16:46:58 +0300 Subject: [openib-general] Announcing the OpenFabrics Enterprise Distribution In-Reply-To: <1145312002.4539.72029.camel@hal.voltaire.com> References: <1144869147.19061.91517.camel@hal.voltaire.com> <4443ADBE.7000307@mellanox.co.il> <1145312002.4539.72029.camel@hal.voltaire.com> Message-ID: <444790D2.6090006@mellanox.co.il> Hal Rosenstock wrote: > Where is this tree ? Also, what about backports ? Are they part of OFED > as well ? > > This the the git tree that Roland manage. Backports are included in the release. The backports are placed under the ibed dir in the 1.0 branch. > I see no such directory under 1.0. Can you provide more details ? > https://openib.org/svn/gen2/branches/1.0/ibed/ > Is the process the same for userspace ? > For user space we will prefer to apply the patches on the branch, but we can also put them under the fixes if needed. This is good for a case that there is a fix you wish to provide after the release was done. Tziporet From taylor at hpc.ufl.edu Thu Apr 20 07:09:08 2006 From: taylor at hpc.ufl.edu (Charles Taylor) Date: Thu, 20 Apr 2006 10:09:08 -0400 Subject: [openib-general] IB + Dual-processor, dual-core Opteron + PCI-E Message-ID: <4F1EA4C3-6353-4956-B99D-28375425A2CC@hpc.ufl.edu> We have 202 node cluster where each node is configured as follows... dual-processor, dual-core Opteron 275 Asus K8N-DRE Motherboard TopSpin/Cisco LionCub HCA in a 16x PCI-E slot 4 GB RAM (DDR 400) IB fabric is two-tiered fat tree with 14 Cisco 7000 switches on the edge and Cisco 7008s (2) in the first tier. We can scale HPL runs up to about 136 nodes/544 cpus reliably on any set of nodes. Above that number of nodes/processors, our HPL runs begin to fail residuals. We can run across all 202 nodes successfully if we use only two procs/node but 4 procs/node will *always* fail residuals. It feels like a data corruption issue in the IB stack. We have tried various combinations of the following software. Kernel: 2.6.9-22, 2.6.9-34 IB stack: topspin 3.2.0b82, OpenIB (IBGD 1.8.2) MPI: mvapich 092/095 (topspin), mvapich 096 (osu), OpenMPI 1.0.2 Blas Libs: Goto 1.00, 1.02, ACM 3.0.0 The result is the same in every case. We seem to be able run HPL reliably up to about 544 - 548 processors. It doesn't matter whether we run one mpi task per processor or 1 mpi task per node with OMP_NUM_THREADS=4. The result is always failed HPL residuals when we run across any subset of the cluster above about 136 nodes using all four procs. I'm wondering if anyone knows of any other large IB clusters using dual-processor, dual-core Opterons + PCI-E with more than 136 nodes and if so, have they been able to successfully scale MPI apps across their entire cluster? Charlie Taylor UF HPC Center From halr at voltaire.com Thu Apr 20 07:20:55 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Apr 2006 10:20:55 -0400 Subject: [openib-general] Re: RFC userspace / MPI multicast support In-Reply-To: References: Message-ID: <1145542453.4539.117342.camel@hal.voltaire.com> Hi Amith, On Wed, 2006-04-19 at 19:39, amith rajith mamidala wrote: > Hi Sean, > > I have a few basic questions: > > 1. Does the API which waits for join to complete > ensure that the multicast forwarding tables in the switches have been > updated. This is not an API issue. The IB spec (architecture) allows for lazy joining. I can cite the compliance if needed. This is based on the fact that any multicast sending (not just IB) is unreliable and the application needs to deal with lost transmissions if it cares ? Isn't this just another case of that ? > This is one of the main problems that we had studied: > (Please refer to the following EURO PVM/MPI paper for details) > http://www.cse.ohio-state.edu/~mamidala/europvm.pdf Can you summarize the issue that this causes ? I will look at the paper but this may take a little while. -- Hal > > /* Wait for join to complete. */ > > rdma_get_cm_event(&event); > > if (event->event == RDMA_CM_EVENT_JOIN_COMPLETE) > > /* join worked - we could call rdma_get_option() here */ > > /* The rdma_cm attached the QP to the multicast group for us. */ > > > > rdma_ack_cm_event(event); > > 2. I am not clear on how to access the QP associated with the cm_id for > multicast. This includes posting the receive descriptors etc. > > 3. If an multicast address is already used by an application running on > the cluster and if another request is made by a different application with > the same multicast address, does this generate an error? From the API, it > looks like the application has to manage this aspect, > > > Thanks, > Amith > > > On Wed, 19 Apr 2006, Sean Hefty wrote: > > > I'd like to get some feedback regarding the following approach to supporting > > multicast groups in userspace, and in particular for MPI. Based on side > > conversations, I need to know if this approach would meet the needs of MPI > > developers. > > > > To join / leave a multicast group, my proposal is to add the following APIs to > > the rdma_cm. (Note I haven't implemented this yet, so I'm just assuming that > > it's possible at this point.) > > > > /* Asynchronously join a multicast group. */ > > int rdma_set_option(struct rdma_cm_id *id, int level, int optname, > > void *optval, size_t optlen); > > > > /* Retrieve multicast group information - not usually called. */ > > int rdma_get_option(struct rdma_cm_id *id, int level, int optname, > > void *optval, size_t optlen); > > > > /* > > * Post a message on the QP associated with the cm_id for the > > * specified multicast address. > > */ > > int rdma_sendto(struct rdma_cm_id *id, struct ibv_send_wr *send_wr, > > struct sockaddr *to); > > > > --- > > > > As an example of how these APIs would be used: > > > > /* The cm_id provides event handling and context. */ > > rdma_create_id(&id, context); > > > > /* Bind to a local interface to attach to a local device. */ > > rdma_bind_addr(id, local_addr); > > > > /* Allocate a PD, CQs, etc. */ > > pd = ibv_alloc_pd(id->verbs); > > .. > > > > /* > > * Create a UD QP associated with the cm_id. > > * TBD: automatically transition the QP to RTS for UD QP types? > > */ > > rdma_create_qp(id, pd, init_attr); > > > > /* Bind to multicast group. */ > > mcast_ip = 224.0.0.74.71; /* some fine mcast addr */ > > ip_mreq.imr_multiaddr = mcast_ip.in_addr; > > rdma_set_option(id, RDMA_PROTO_IP, IP_ADD_MEMBERSHIP, &ip_mreq, > > sizeof(ip_mreq)); > > > > /* > > * Format a send wr. The ah, remote_qpn, and remote_qkey are > > * filled out by the rdma_cm based on the provided destination > > * address. > > */ > > rdma_sendto(id, send_wr, &mcast_ip); > > > > --- > > > > The multicast group information is created / managed by the rdma_cm. The > > rdma_cm defines the mgid, q_key, p_key, sl, flowlabel, tclass, and joinstate. > > Except for mgid, these would most likely match the values used by the ipoib > > broadcast group. The mgid mapping would be similar to that used by ipoib. The > > actual MCMember record would be available to the user by calling > > rdma_get_option. > > > > I don't believe that there would be any restriction on the use of the QP that is > > attached to the multicast group, but it would take more work to support more > > than one multicast group per QP. The purpose of the rdma_sendto() routine is to > > map a given IP address to an allocated address handle and Qkey. At this point, > > rdma_sendto would only work for multicast addresses that have been joined by the > > user. > > > > If a user wanted more control over the multicast group, we could support a call > > such as: > > > > struct ib_mreq { > > struct ib_sa_mcmember_rec rec; > > ib_sa_comp_mask comp_mask; > > } > > > > rdma_set_option(id, RDMA_PROTO_IB, IB_ADD_MEMBERSHIP, &ib_mreq, > > sizeof(ib_mreq)); > > > > Thoughts? > > > > - Sean > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From iod00d at hp.com Thu Apr 20 08:16:27 2006 From: iod00d at hp.com (Grant Grundler) Date: Thu, 20 Apr 2006 08:16:27 -0700 Subject: [openib-general] Re: openib-general Digest, Vol 22, Issue 114 In-Reply-To: References: <20060419164226.GB6430@esmail.cup.hp.com> Message-ID: <20060420151627.GA10891@esmail.cup.hp.com> Hi Shirley, On Wed, Apr 19, 2006 at 11:31:32AM -0700, Shirley Ma wrote: ... > > By moving netperf RX traffic off the CPU handling interrupts, > > the 1.5Ghz ia64 box goes from 2.8 Gb/s to around 3.5 Gb/s. > > But the "service demand" (CPU time per KB payload) goes up > > from ~2.3 usec/KB to ~3.1 usec/KB - cacheline misses go up dramatically. > > Yes, netperf/netserver binding to same cpu definitely has benefit > cacheline. But the cpu will be the bottleneck. One cpu is not > sufficient to drain out faster network device HCA. I agree. I don't think anyone is suggesting one CPU can handle TCP for a 10Gb/s (or faster) device. ... > IPoIB perf if important for people still use old application. We do see > under some workload IPoIB gain double bidirectional performance with > splitting CQ/tune poll interval/poll more entries from WC patch. Was this measured using ehca? If so, the result implies at least two interrupt vectors are used. And it seems reasonable for IPoIB to tune for that even if it costs mthca a slight amount of overhead. Roland might be more receptive if someone provided him with data showing perf and "service demand" for mthca doesn't substantial degrade because of this change. hth, grant From rdreier at cisco.com Thu Apr 20 09:04:32 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Apr 2006 09:04:32 -0700 Subject: [openib-general] [RFC] remove kernel drivers from svn Message-ID: What would people think of removing the kernel drivers from our svn tree? I see several reasons to do this: - Several groups are already using their own separate repositories anyway, so svn doesn't have all of the latest and greatest anyway. (For example both ehca and ipath are really maintained in private repositories) - Pulling patches out of svn to merge with a git tree generates extra time-wasting busy-work, both for the submitter and for me. - svn kernel drivers dilute testing attention from the upstream kernel, which means that upstream is not as high-quality as possible. Getting rid of svn would encourage new features to be developed as self-contained patch sets, which makes an eventual upstream merge much easier. And of course having git as our core repository would make developing on a branch far far easier. - R. From rdreier at cisco.com Thu Apr 20 09:05:41 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Apr 2006 09:05:41 -0700 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: (Shirley Ma's message of "Wed, 19 Apr 2006 16:56:12 -0700") References: Message-ID: Shirley> It helps the performance about 10% for the touch Shirley> netperf/netserver test, then hit driver errors. I notice Shirley> that the send is faster than before. Let me send you the Shirley> patch tomorrow, maybe you have the hints to identify the Shirley> problem. Without seeing the patch I'm going to guess that your speedup is coming by breaking the accounting of free space in the send queue, and so you run into problems when the queue fills up. - R. From mst at mellanox.co.il Thu Apr 20 09:17:15 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 20 Apr 2006 19:17:15 +0300 Subject: [openib-general] Re: [PATCH] RDMA CM: assign port numbers when binding a cm_id to an address In-Reply-To: References: Message-ID: <20060420161715.GT1792@mellanox.co.il> Quoting r. Sean Hefty : > Subject: [PATCH] RDMA CM: assign port numbers when binding a cm_id to an address > > Assign/reserve a port number when binding a cm_id. If no port number is > given, assign one from the local port space. If a port number is given, > reserve it. > > The RDMA port space is separate from that used for TCP. iWarp devices > will need to coordinate between the port values assigned by the rdma_cm > and those in use by TCP. SDP also has its own port space. So far, seems to work fine, here. I suggest you check it in to svn. One small note: ipv4 on linux does this: err = -EACCES; if (snum && snum < PROT_SOCK && !capable(CAP_NET_BIND_SERVICE)) goto out; disabling bind to ports 1-1023 for non-priveledged users. Do you want to add such a check in CMA, or does it belong in SDP in your opinion? -- MST From tom at opengridcomputing.com Thu Apr 20 09:32:36 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 20 Apr 2006 11:32:36 -0500 Subject: [openib-general] Re: [PATCH] RDMA CM: assign port numbers when binding a cm_id to an address In-Reply-To: <20060420161715.GT1792@mellanox.co.il> References: <20060420161715.GT1792@mellanox.co.il> Message-ID: <1145550756.27405.6.camel@trinity.ogc.int> On Thu, 2006-04-20 at 19:17 +0300, Michael S. Tsirkin wrote: > Quoting r. Sean Hefty : > > Subject: [PATCH] RDMA CM: assign port numbers when binding a cm_id to an address > > > > Assign/reserve a port number when binding a cm_id. If no port number is > > given, assign one from the local port space. If a port number is given, > > reserve it. > > > > The RDMA port space is separate from that used for TCP. iWarp devices > > will need to coordinate between the port values assigned by the rdma_cm > > and those in use by TCP. SDP also has its own port space. > > So far, seems to work fine, here. I suggest you check it in to svn. > > One small note: ipv4 on linux does this: > err = -EACCES; > if (snum && snum < PROT_SOCK && !capable(CAP_NET_BIND_SERVICE)) > goto out; > > disabling bind to ports 1-1023 for non-priveledged users. In my opinion it belongs in the CMA for consistency across all transports. Note that the global port range variable is used to determine where to start ephemeral port allocation. > > Do you want to add such a check in CMA, or does it belong in SDP in your > opinion? > From robert.j.woodruff at intel.com Thu Apr 20 09:27:16 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Thu, 20 Apr 2006 09:27:16 -0700 Subject: [openib-general] [RFC] remove kernel drivers from svn In-Reply-To: Message-ID: <000001c66497$49fe4e00$33a9070a@amr.corp.intel.com> Roland wrote, > - Several groups are already using their own separate repositories > anyway, so svn doesn't have all of the latest and greatest anyway. > (For example both ehca and ipath are really maintained in private > repositories) >- Pulling patches out of svn to merge with a git tree generates extra > time-wasting busy-work, both for the submitter and for me. > - svn kernel drivers dilute testing attention from the upstream > kernel, which means that upstream is not as high-quality as possible. >Getting rid of svn would encourage new features to be developed as >self-contained patch sets, which makes an eventual upstream merge much >easier. And of course having git as our core repository would make >developing on a branch far far easier. > - R. I don't have a problem moving from SVN to something like git if it makes the kernel development and pushing upstream easier as long as the database is available in the open and not kept in some private repository, which is not the open source way. I would also like to see all components user and kernel use the same source control tool so that I don't have to install learn and maintain lots of different tools. my 2 cents, woody From rdreier at cisco.com Thu Apr 20 09:36:11 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Apr 2006 09:36:11 -0700 Subject: [openib-general] Re: Fwd: a bug In-Reply-To: <20060420123025.GK1792@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 20 Apr 2006 15:30:25 +0300") References: <20060420123025.GK1792@mellanox.co.il> Message-ID: Yes, good catch. I fixed it like this: diff-tree 9efa84579338184e0139f1881f0a1b6ad0063fcc (from 52824b6b5fa0533e2b2adc9df396d0e9ff6fb02a) Author: Roland Dreier Date: Thu Apr 20 09:34:46 2006 -0700 IB/mthca: Fix offset in query_gid method GuidInfo records have 8 byte GUIDs in then, so an index should be multiplied by 8 to get an offset. mthca_query_gid() was incorrectly multiplying by 16. Noticed by Leonid Keller . Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 565a24b..a2eae8a 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -306,7 +306,7 @@ static int mthca_query_gid(struct ib_dev goto out; } - memcpy(gid->raw + 8, out_mad->data + (index % 8) * 16, 8); + memcpy(gid->raw + 8, out_mad->data + (index % 8) * 8, 8); out: kfree(in_mad); From xma at us.ibm.com Thu Apr 20 09:41:15 2006 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 20 Apr 2006 09:41:15 -0700 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: Message-ID: Roland Dreier wrote on 04/20/2006 09:05:41 AM: > Shirley> It helps the performance about 10% for the touch > Shirley> netperf/netserver test, then hit driver errors. I notice > Shirley> that the send is faster than before. Let me send you the > Shirley> patch tomorrow, maybe you have the hints to identify the > Shirley> problem. > > Without seeing the patch I'm going to guess that your speedup is > coming by breaking the accounting of free space in the send queue, and > so you run into problems when the queue fills up. > > - R. In this case, post_send() should return ENOMEM. I didn't see any error returns. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu Apr 20 09:39:56 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Apr 2006 09:39:56 -0700 Subject: [openib-general] [RFC] remove kernel drivers from svn In-Reply-To: <000001c66497$49fe4e00$33a9070a@amr.corp.intel.com> (Bob Woodruff's message of "Thu, 20 Apr 2006 09:27:16 -0700") References: <000001c66497$49fe4e00$33a9070a@amr.corp.intel.com> Message-ID: Bob> I don't have a problem moving from SVN to something like git Bob> if it makes the kernel development and pushing upstream Bob> easier as long as the database is available in the open and Bob> not kept in some private repository, which is not the open Bob> source way. Well the de facto situation today is that ehca and ipath are maintained in private repositories. Bob> I would also like to see all components user and kernel use Bob> the same source control tool so that I don't have to install Bob> learn and maintain lots of different tools. I guess I was trying to say that Linus's tree is really the kernel repository. Trying to pretend that our kernel drivers are independent of the kernel leads to a very confusing situation. I would really like to see people who want to run bleeding-edge stuff grab something along the lines of http://kernel.org/pub/linux/kernel/v2.6/snapshots/patch-2.6.17-rc2-git2.bz2 or if they're very brave pull my for-2.6.18 or for-mm tree. Otherwise we end up wasting their testing attention on something different from the upstream kernel. - R. From rdreier at cisco.com Thu Apr 20 09:40:33 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Apr 2006 09:40:33 -0700 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: (Shirley Ma's message of "Thu, 20 Apr 2006 09:41:15 -0700") References: Message-ID: Shirley> In this case, post_send() should return ENOMEM. I didn't Shirley> see any error returns. OK, it's just a guess without seeing the patch ;) From xma at us.ibm.com Thu Apr 20 09:46:19 2006 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 20 Apr 2006 09:46:19 -0700 Subject: [openib-general] Re: openib-general Digest, Vol 22, Issue 114 In-Reply-To: <20060420151627.GA10891@esmail.cup.hp.com> Message-ID: Hello Grant, Grant Grundler wrote on 04/20/2006 08:16:27 AM: > Was this measured using ehca? > If so, the result implies at least two interrupt vectors are used. > > And it seems reasonable for IPoIB to tune for that even if it > costs mthca a slight amount of overhead. Roland might be more > receptive if someone provided him with data showing perf > and "service demand" for mthca doesn't substantial degrade > because of this change. Yes. It was measured in ehca. I am woking on the more resouces to run more tests (both undirection and bidirection) on mthca. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Thu Apr 20 09:47:52 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 20 Apr 2006 09:47:52 -0700 Subject: [openib-general] RE: RFC userspace / MPI multicast support In-Reply-To: <20060420092016.GB1792@mellanox.co.il> Message-ID: >Since there's no CM involved in creating UD QPs, and since there's no multicast >rdma, rdma_cm seems like a weird place for multicast. The CM is involved in using UD QPs through the use of SIDR, and it also requires path record lookups, which is provided by the rdma_cm. >Would it make more sense to create a separate module for multicast stuff? I think this will result in duplicating functionality. The rdma_cm provides handling for device removal, event reporting, path record lookup, and mapping of IP addresses. UD QP users will want part, if not all, of these features. - Sean From sean.hefty at intel.com Thu Apr 20 09:58:26 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 20 Apr 2006 09:58:26 -0700 Subject: [openib-general] RFC userspace / MPI multicast support In-Reply-To: <1145534109.4539.115149.camel@hal.voltaire.com> Message-ID: >On Wed, 2006-04-19 at 15:05, Sean Hefty wrote: >> I'd like to get some feedback regarding the following approach to supporting >> multicast groups in userspace, and in particular for MPI. Based on side >> conversations, I need to know if this approach would meet the needs of MPI >> developers. >> >> To join / leave a multicast group, > >MC groups also need to be created and deleted as well. Creating and >deleting the group are assumed under the covers (first joiner, last >leaver) so the additional MC parameters for creation need to be >available on all adds. Creation / deletion would be automatic. The creation parameters for RDMA_PROTO_IP would use the same settings as the ipoib broadcast group. >> /* Bind to multicast group. */ >> mcast_ip = 224.0.0.74.71; /* some fine mcast addr */ > >How are the MGIDs formed from this IP address ? Is the same algorithm as >IPoIB used ? > >Are the MGIDs constrained to use 0x401B in the signature part (and >0x601B if this is extended to IPv6) ? The MGIDs would be formed using the same algorithm as ipoib. I hadn't decided on whether to use the same signature, or a different one. My initial thought was to use a different signature, but I'm not sure that it's necessary. >BTW, this example has too many bytes... Just a typo... >> ip_mreq.imr_multiaddr = mcast_ip.in_addr; >> rdma_set_option(id, RDMA_PROTO_IP, IP_ADD_MEMBERSHIP, &ip_mreq, >> sizeof(ip_mreq)); > >The API only supports ADD/DROP. It lacks support for JoinStates. >(I don't think the IP semantics are rich enough for IB; this was >previously pointed out in the context of IP routers quite a while ago). Additional join states are IB specific, so would be handled by using the RDMA_PROTO_IB option. As an alternative, we could replace IP_ADD_MEMBERSHIP with RDMA_ADD_FULL_MEMBER, RDMA_ADD_SEND_MEMBER, etc. >> The multicast group information is created / managed by the rdma_cm. The >> rdma_cm defines the mgid, q_key, p_key, sl, flowlabel, tclass, and joinstate. >> Except for mgid, these would most likely match the values used by the ipoib >> broadcast group. The mgid mapping would be similar to that used by ipoib. > >Does that limit the MGIDs to use IP signatures ? Yes - unless the RDMA_PROTO_IB option were used. - Sean From robert.j.woodruff at intel.com Thu Apr 20 10:00:09 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Thu, 20 Apr 2006 10:00:09 -0700 Subject: [openib-general] [RFC] remove kernel drivers from svn In-Reply-To: Message-ID: <000101c6649b$e1b14f00$33a9070a@amr.corp.intel.com> Roland wrote, >I guess I was trying to say that Linus's tree is really the kernel >repository. Trying to pretend that our kernel drivers are independent >of the kernel leads to a very confusing situation. >I would really like to see people who want to run bleeding-edge stuff >grab something along the lines of http://kernel.org/pub/linux/kernel/v2.6/snapshots/patch-2.6.17-rc2-git2.bz2 >or if they're very brave pull my for-2.6.18 or for-mm tree. Otherwise >we end up wasting their testing attention on something different from >the upstream kernel. > - R. Having a local "open" repository (git or SVN) for code under development before it is pushed upstream seems like is still needed, iSer, SDP, etc. Without this, people will not have access to anything that is under development till it is in Linus's tree. Development behind closed doors is not good and will prevent early adopters from providing testing and feedback of new components under development. I think that the working openib database (SVN or git) must be in the open. openib bylaws might even require it. Again, I do not have a problem with moving everything from SVN to git as long as the git tree is available in the open and not kept in some private repository. woody From sean.hefty at intel.com Thu Apr 20 10:02:39 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 20 Apr 2006 10:02:39 -0700 Subject: [openib-general] RE: [PATCH] RDMA CM: assign port numbers when binding a cm_id to an address In-Reply-To: <20060420161715.GT1792@mellanox.co.il> Message-ID: >One small note: ipv4 on linux does this: > err = -EACCES; > if (snum && snum < PROT_SOCK && !capable(CAP_NET_BIND_SERVICE)) > goto out; > >disabling bind to ports 1-1023 for non-priveledged users. > >Do you want to add such a check in CMA, or does it belong in SDP in your >opinion? I would think this check belongs in the kernel ucma, which would require adding it to SDP as well. Which module is the check listed above done in? I want to understand where this check is made before adding it. - Sean From mlleinin at hpcn.ca.sandia.gov Thu Apr 20 10:05:32 2006 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Thu, 20 Apr 2006 10:05:32 -0700 Subject: [openib-general] IB + Dual-processor, dual-core Opteron + PCI-E In-Reply-To: <4F1EA4C3-6353-4956-B99D-28375425A2CC@hpc.ufl.edu> References: <4F1EA4C3-6353-4956-B99D-28375425A2CC@hpc.ufl.edu> Message-ID: <1145552732.24662.393.camel@localhost> On Thu, 2006-04-20 at 10:09 -0400, Charles Taylor wrote: > > We have 202 node cluster where each node is configured as follows... > > dual-processor, dual-core Opteron 275 > Asus K8N-DRE Motherboard > TopSpin/Cisco LionCub HCA in a 16x PCI-E slot > 4 GB RAM (DDR 400) > > IB fabric is two-tiered fat tree with 14 Cisco 7000 switches on the > edge and Cisco 7008s (2) in the > first tier. > > We can scale HPL runs up to about 136 nodes/544 cpus reliably on any > set of nodes. Above that > number of nodes/processors, our HPL runs begin to fail residuals. > We can run across all 202 nodes > successfully if we use only two procs/node but 4 procs/node will > *always* fail residuals. It feels like > a data corruption issue in the IB stack. > > We have tried various combinations of the following software. > > Kernel: 2.6.9-22, 2.6.9-34 > IB stack: topspin 3.2.0b82, OpenIB (IBGD 1.8.2) > MPI: mvapich 092/095 (topspin), mvapich 096 (osu), OpenMPI 1.0.2 > Blas Libs: Goto 1.00, 1.02, ACM 3.0.0 > > The result is the same in every case. We seem to be able run HPL > reliably up to about 544 - 548 processors. It doesn't > matter whether we run one mpi task per processor or 1 mpi task per > node with OMP_NUM_THREADS=4. The result > is always failed HPL residuals when we run across any subset of the > cluster above about 136 nodes using all four procs. > > I'm wondering if anyone knows of any other large IB clusters using > dual-processor, dual-core Opterons + PCI-E with more > than 136 nodes and if so, have they been able to successfully scale > MPI apps across their entire cluster? > Charles, If you see the problem after trying various combinations of the software you listed above, then it's likely a hardware issue. I know of several ~256 node dual-proc dual-core Opteron IB clusters that are running linpack. I've heard there can be issues with "silent" data corruption on the Opteron CPUs if they get too hot. Are you monitoring the node/cpu temps? If CPU temp is an issue you should see a problem whether you are running a single linpack across all 202 nodes, or running simultaneous smaller linpacks (say 4 50 node runs). I'll see if I can find the bug report for this problem. - Matt From halr at voltaire.com Thu Apr 20 10:00:43 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Apr 2006 13:00:43 -0400 Subject: [openib-general] RFC userspace / MPI multicast support In-Reply-To: References: Message-ID: <1145552443.23359.631.camel@hal.voltaire.com> On Thu, 2006-04-20 at 12:58, Sean Hefty wrote: > >On Wed, 2006-04-19 at 15:05, Sean Hefty wrote: > >> I'd like to get some feedback regarding the following approach to supporting > >> multicast groups in userspace, and in particular for MPI. Based on side > >> conversations, I need to know if this approach would meet the needs of MPI > >> developers. > >> > >> To join / leave a multicast group, > > > >MC groups also need to be created and deleted as well. Creating and > >deleting the group are assumed under the covers (first joiner, last > >leaver) so the additional MC parameters for creation need to be > >available on all adds. > > Creation / deletion would be automatic. The creation parameters for > RDMA_PROTO_IP would use the same settings as the ipoib broadcast group. > > >> /* Bind to multicast group. */ > >> mcast_ip = 224.0.0.74.71; /* some fine mcast addr */ > > > >How are the MGIDs formed from this IP address ? Is the same algorithm as > >IPoIB used ? > > > >Are the MGIDs constrained to use 0x401B in the signature part (and > >0x601B if this is extended to IPv6) ? > > The MGIDs would be formed using the same algorithm as ipoib. I hadn't decided > on whether to use the same signature, or a different one. My initial thought > was to use a different signature, but I'm not sure that it's necessary. Guess it comes down to how much control is needed over the entire MGID by MPI as well as whether they can share the IPoIB broadcast group characteristics for all their multicast groups. Also, is IPoIB always setup when running MPI ? -- Hal > >BTW, this example has too many bytes... > > Just a typo... > > >> ip_mreq.imr_multiaddr = mcast_ip.in_addr; > >> rdma_set_option(id, RDMA_PROTO_IP, IP_ADD_MEMBERSHIP, &ip_mreq, > >> sizeof(ip_mreq)); > > > >The API only supports ADD/DROP. It lacks support for JoinStates. > >(I don't think the IP semantics are rich enough for IB; this was > >previously pointed out in the context of IP routers quite a while ago). > > Additional join states are IB specific, so would be handled by using the > RDMA_PROTO_IB option. As an alternative, we could replace IP_ADD_MEMBERSHIP > with RDMA_ADD_FULL_MEMBER, RDMA_ADD_SEND_MEMBER, etc. > > >> The multicast group information is created / managed by the rdma_cm. The > >> rdma_cm defines the mgid, q_key, p_key, sl, flowlabel, tclass, and joinstate. > >> Except for mgid, these would most likely match the values used by the ipoib > >> broadcast group. The mgid mapping would be similar to that used by ipoib. > > > >Does that limit the MGIDs to use IP signatures ? > > Yes - unless the RDMA_PROTO_IB option were used. > > - Sean From sean.hefty at intel.com Thu Apr 20 10:15:32 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 20 Apr 2006 10:15:32 -0700 Subject: [openib-general] [RFC] remove kernel drivers from svn In-Reply-To: Message-ID: >What would people think of removing the kernel drivers from our svn >tree? I see several reasons to do this: I brought up moving from svn to git in a PathForward meeting yesterday. I was asked to send an email to the openib mail list describing the benefits of the move, most of which you just gave. I think as long as we can get a majority of the maintainers to agree to such a move, we can do it. The one concern that was brought up that I haven't looked at yet dealt with how different accounts are handled with git. - Sean From mst at mellanox.co.il Thu Apr 20 10:22:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 20 Apr 2006 20:22:04 +0300 Subject: [openib-general] Re: [PATCH] RDMA CM: assign port numbers when binding a cm_id to an address In-Reply-To: References: <20060420161715.GT1792@mellanox.co.il> Message-ID: <20060420172204.GA6876@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RE: [PATCH] RDMA CM: assign port numbers when binding a cm_id to an address > > >One small note: ipv4 on linux does this: > > err = -EACCES; > > if (snum && snum < PROT_SOCK && !capable(CAP_NET_BIND_SERVICE)) > > goto out; > > > >disabling bind to ports 1-1023 for non-priveledged users. > > > >Do you want to add such a check in CMA, or does it belong in SDP in your > >opinion? > > I would think this check belongs in the kernel ucma, which would require adding > it to SDP as well. > > Which module is the check listed above done in? I want to understand where this > check is made before adding it. For ipv4 sockets it's done in net/ipv4/af_inet.c. grep for CAP_NET_BIND_SERVICE and you'll see it for other protocols. -- MST From mst at mellanox.co.il Thu Apr 20 10:23:52 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 20 Apr 2006 20:23:52 +0300 Subject: [openib-general] Re: Fwd: a bug In-Reply-To: References: <20060420123025.GK1792@mellanox.co.il> Message-ID: <20060420172352.GB6876@mellanox.co.il> Quoting r. Roland Dreier : > GuidInfo records have 8 byte GUIDs in then, in them? -- MST From sean.hefty at intel.com Thu Apr 20 10:26:47 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 20 Apr 2006 10:26:47 -0700 Subject: [openib-general] RFC userspace / MPI multicast support In-Reply-To: <1145552443.23359.631.camel@hal.voltaire.com> Message-ID: >Guess it comes down to how much control is needed over the entire MGID >by MPI as well as whether they can share the IPoIB broadcast group >characteristics for all their multicast groups. Yes - it's a matter of control versus ease of use. An app that wants control would still have it, but then has to come up with their own sl, flowlabel, and tclass values when joining the group. At some point, I'd just not have to require a specialized IB network administrator in order to run every application... >Also, is IPoIB always setup when running MPI ? The RDMA_PROTO_IP option requires ipoib. RDMA_PROTO_IB works directly at the IB level, so wouldn't require it. Of course, ipoib is needed if RC QPs are connected over the rdma_cm. - Sean From rdreier at cisco.com Thu Apr 20 10:30:11 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Apr 2006 10:30:11 -0700 Subject: [openib-general] Re: Fwd: a bug In-Reply-To: <20060420172352.GB6876@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 20 Apr 2006 20:23:52 +0300") References: <20060420123025.GK1792@mellanox.co.il> <20060420172352.GB6876@mellanox.co.il> Message-ID: Michael> in them? yup... fixed in my kernel changelog (no way to fix svn) - R. From mshefty at ichips.intel.com Thu Apr 20 10:31:32 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 20 Apr 2006 10:31:32 -0700 Subject: [openib-general] [PATCH] RDMA CM: assign port numbers when binding a cm_id to an address In-Reply-To: References: Message-ID: <4447C574.1090609@ichips.intel.com> I've committed this patch. Roland, can you please queue this on top of the rdma_cm patch for 2.6.18? - Sean From rdreier at cisco.com Thu Apr 20 10:31:56 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Apr 2006 10:31:56 -0700 Subject: [openib-general] [RFC] remove kernel drivers from svn In-Reply-To: <000101c6649b$e1b14f00$33a9070a@amr.corp.intel.com> (Bob Woodruff's message of "Thu, 20 Apr 2006 10:00:09 -0700") References: <000101c6649b$e1b14f00$33a9070a@amr.corp.intel.com> Message-ID: Bob> Having a local "open" repository (git or SVN) for code under Bob> development before it is pushed upstream seems like is still Bob> needed, iSer, SDP, etc. I'm happy to carry branches with development code in my git tree. And git is distributed, so it's easy for anyone to publish their WIP if they want. Bob> I think that the working openib database (SVN or git) must be Bob> in the open. openib bylaws might even require it. I don't think we should let the bylaws force us to do something that's technically wrong. "...the law is an ass..." - R. From jlentini at netapp.com Thu Apr 20 10:35:32 2006 From: jlentini at netapp.com (James Lentini) Date: Thu, 20 Apr 2006 13:35:32 -0400 (EDT) Subject: [openib-general] [RFC] remove kernel drivers from svn In-Reply-To: <000001c66497$49fe4e00$33a9070a@amr.corp.intel.com> References: <000001c66497$49fe4e00$33a9070a@amr.corp.intel.com> Message-ID: On Thu, 20 Apr 2006, Bob Woodruff wrote: > Roland wrote, > > - Several groups are already using their own separate repositories > > anyway, so svn doesn't have all of the latest and greatest anyway. > > (For example both ehca and ipath are really maintained in private > > repositories) > >- Pulling patches out of svn to merge with a git tree generates extra > > time-wasting busy-work, both for the submitter and for me. > > > - svn kernel drivers dilute testing attention from the upstream > > kernel, which means that upstream is not as high-quality as possible. > > >Getting rid of svn would encourage new features to be developed as > >self-contained patch sets, which makes an eventual upstream merge much > >easier. And of course having git as our core repository would make > >developing on a branch far far easier. > > > - R. > > I don't have a problem moving from SVN to something like git if it makes > the kernel development and pushing upstream easier as long as the database > is available in the open and not kept in some private repository, which is > not the open source way. > > I would also like to see all components user and kernel use the same source > control tool so that I don't have to install learn and maintain lots of > different tools. I also find having the user and kernel code in the same repository useful. > my 2 cents, > > woody From jlentini at netapp.com Thu Apr 20 10:39:13 2006 From: jlentini at netapp.com (James Lentini) Date: Thu, 20 Apr 2006 13:39:13 -0400 (EDT) Subject: [openib-general] [RFC] remove kernel drivers from svn In-Reply-To: References: Message-ID: On Thu, 20 Apr 2006, Roland Dreier wrote: > What would people think of removing the kernel drivers from our svn > tree? I see several reasons to do this: > > - Several groups are already using their own separate repositories > anyway, so svn doesn't have all of the latest and greatest anyway. > (For example both ehca and ipath are really maintained in private > repositories) I'm surprised. I didn't know that this was the case. Why are they using private repositories? Will moving to your git tree change this? > - Pulling patches out of svn to merge with a git tree generates extra > time-wasting busy-work, both for the submitter and for me. > > - svn kernel drivers dilute testing attention from the upstream > kernel, which means that upstream is not as high-quality as possible. > > Getting rid of svn would encourage new features to be developed as > self-contained patch sets, which makes an eventual upstream merge much > easier. And of course having git as our core repository would make > developing on a branch far far easier. > > - R. From mst at mellanox.co.il Thu Apr 20 10:52:08 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 20 Apr 2006 20:52:08 +0300 Subject: [openib-general] Re: Re: Fwd: a bug In-Reply-To: References: <20060420123025.GK1792@mellanox.co.il> <20060420172352.GB6876@mellanox.co.il> Message-ID: <20060420175208.GE6876@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Re: Fwd: a bug > > Michael> in them? > > yup... fixed in my kernel changelog (no way to fix svn) Actually, you can do this with svn. From the svn book: For example, you might want to replace the commit log message of an existing revision. 27 $ svn propset svn:log '* button.c: Fix a compiler warning.' -r11 --revprop Look for more examples under svn propset. -- MST From rdreier at cisco.com Thu Apr 20 10:54:16 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Apr 2006 10:54:16 -0700 Subject: [openib-general] Re: Fwd: a bug In-Reply-To: <20060420175208.GE6876@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 20 Apr 2006 20:52:08 +0300") References: <20060420123025.GK1792@mellanox.co.il> <20060420172352.GB6876@mellanox.co.il> <20060420175208.GE6876@mellanox.co.il> Message-ID: Michael> Look for more examples under svn propset. Yeah, but our repo doesn't have that enabled. - R. From jlentini at netapp.com Thu Apr 20 10:55:35 2006 From: jlentini at netapp.com (James Lentini) Date: Thu, 20 Apr 2006 13:55:35 -0400 (EDT) Subject: [openib-general] [RFC] remove kernel drivers from svn In-Reply-To: References: Message-ID: On Thu, 20 Apr 2006, Sean Hefty wrote: > >What would people think of removing the kernel drivers from our svn > >tree? I see several reasons to do this: > > I brought up moving from svn to git in a PathForward meeting > yesterday. I was asked to send an email to the openib mail list > describing the benefits of the move, most of which you just gave. > I think as long as we can get a majority of the maintainers to agree > to such a move, we can do it. Roland is proposing moving the repository location (from OpenFabrics.org to kernel.org) and changing the version control tool. Are you also in favor of doing both or just changing the version control tool to git. I'm indifferent with regards to the version control tool, but I would like to see the repository remain unified. As a developer of both a userspace library and kernel ULP, I find it very convenient to be able to pull all the code from one place. Roland brings up some good points with regards to upstream testing and merge overhead. With regards to upstream testing, would targeting the OpenFabric's repository at the lastest prepatch release instead of stable release help? How much of the merging overhead would go away if OpenFabrics used git? > The one concern that was brought up that I haven't looked at yet > dealt with how different accounts are handled with git. What about the revision history? Can we convert our svn logs to git? From bos at pathscale.com Thu Apr 20 10:58:54 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 20 Apr 2006 10:58:54 -0700 Subject: [openib-general] [RFC] remove kernel drivers from svn In-Reply-To: References: Message-ID: <1145555934.2057.18.camel@chalcedony.pathscale.com> On Thu, 2006-04-20 at 09:04 -0700, Roland Dreier wrote: > What would people think of removing the kernel drivers from our svn > tree? >From a driver maintainer's perspective, this is okay by me. We just drop big patches into Subversion periodically to sync it up with our internal tree, while we feed "real" patches to you for your git tree, so this eliminates a step. I don't think it will help with the confusion over what's going to go to Linus soon versus stuff that's just sitting in the tree for a long time, though. For example, if asked, I can't articulate why there's a one-line difference between the ipath driver in SVN and in git, other than "some headers haven't been pushed yet, for some reason". >From the perspective of people consuming this stuff (who greatly outnumber people actually hacking on the kernel tree), it might make sense to just drop a kernel snapshot into the Subversion tree on a regular basis, as a convenience, so that those who don't wish to need not put up with the idiosyncracies of git. (James Lentini's message of "Thu, 20 Apr 2006 13:39:13 -0400 (EDT)") References: Message-ID: James> Why are they using private repositories? Will moving to James> your git tree change this? Don't know why. Moving to git doesn't fix this, but at least it emphasizes the "upstream or bust" message and gets rid of the confusing svn repository, which doesn't have the latest and greatest anyway and distracts from testing upstream. - R. From mshefty at ichips.intel.com Thu Apr 20 11:01:22 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 20 Apr 2006 11:01:22 -0700 Subject: [openib-general] [RFC] remove kernel drivers from svn In-Reply-To: References: Message-ID: <4447CC72.60302@ichips.intel.com> Sean Hefty wrote: > The one concern that was brought up that I haven't looked at yet dealt with how > different accounts are handled with git. From what I can tell, it looks like git uses system accounts: http://www.kernel.org/pub/software/scm/git/docs/everyday.html#Repository%20Administration Matt mentioned that this may be a problem for the labs hosting the repository. - Sean From rdreier at cisco.com Thu Apr 20 11:03:23 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Apr 2006 11:03:23 -0700 Subject: [openib-general] [RFC] remove kernel drivers from svn In-Reply-To: <1145555934.2057.18.camel@chalcedony.pathscale.com> (Bryan O'Sullivan's message of "Thu, 20 Apr 2006 10:58:54 -0700") References: <1145555934.2057.18.camel@chalcedony.pathscale.com> Message-ID: Bryan> From a driver maintainer's perspective, this is okay by me. Bryan> We just drop big patches into Subversion periodically to Bryan> sync it up with our internal tree, while we feed "real" Bryan> patches to you for your git tree, so this eliminates a Bryan> step. Exactly. If you're not using svn for development anyway, then I don't see the point of having your driver there. Bryan> From the perspective of people consuming this stuff (who Bryan> greatly outnumber people actually hacking on the kernel Bryan> tree), it might make sense to just drop a kernel snapshot Bryan> into the Subversion tree on a regular basis, as a Bryan> convenience, so that those who don't wish to need not put Bryan> up with the idiosyncracies of git. We can just let people grab kernel snapshots from kernel.org I think. - R. From bos at pathscale.com Thu Apr 20 11:06:50 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 20 Apr 2006 11:06:50 -0700 Subject: [openib-general] [RFC] remove kernel drivers from svn In-Reply-To: References: Message-ID: <1145556410.2057.27.camel@chalcedony.pathscale.com> On Thu, 2006-04-20 at 13:39 -0400, James Lentini wrote: > > - Several groups are already using their own separate repositories > I'm surprised. I didn't know that this was the case. > > Why are they using private repositories? I can't speak for other driver authors, but we started working on our drivers long before OpenFabrics was a gleam in any consortium member's eye. Between that historical fact and the unsuitability of Subversion for serious development, the die was cast. > Will moving to your git tree change this? Realistically, no. We still need to vet patches, cherrypick them, and tidy them up before we send them along to Roland, and I suspect that other driver authors are in a similar position. References: <1145555934.2057.18.camel@chalcedony.pathscale.com> Message-ID: <1145556654.2057.32.camel@chalcedony.pathscale.com> On Thu, 2006-04-20 at 11:03 -0700, Roland Dreier wrote: > We can just let people grab kernel snapshots from kernel.org I think. That sounds OK to me, though I wonder how it will sit with non-kernel-developers. I think some downstream people are actually pulling kernel code from Subversion at the moment, which strikes me as probably a mistaken thing to do. References: Message-ID: <1145556844.14924.5.camel@phosphene.durables.org> On Thu, 2006-04-20 at 11:00 -0700, Roland Dreier wrote: > James> Why are they using private repositories? Will moving to > James> your git tree change this? > > Don't know why. Moving to git doesn't fix this, but at least it > emphasizes the "upstream or bust" message and gets rid of the > confusing svn repository, which doesn't have the latest and greatest > anyway and distracts from testing upstream. I can't speak for other driver authors, but from the PathScale perspective, we did this for a number of reasons: * We were developing the driver a long time before we got involved in OpenFabrics. So notch part of it up to inertia. * We want to support multiple kernel versions without resorting to patches: i.e. #ifdef'ing the code. We strip this out before submitting to subversion or git, but it's all there in our private tree and in the RPMs we ship to our customers. * It's useful for us to keep stuff we're not planning on announcing private until we want to announce it, but still accessible to everyone in PathScale for testing, etc. * We very briefly considered using Subversion when we got involved in OpenFabrics, but just about every developer here shot the idea down as a step backwards, development-wise. Chalk that one up to religion. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com QLogic Corporation Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. From mshefty at ichips.intel.com Thu Apr 20 11:18:48 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 20 Apr 2006 11:18:48 -0700 Subject: [openib-general] [RFC] remove kernel drivers from svn In-Reply-To: References: Message-ID: <4447D088.20204@ichips.intel.com> James Lentini wrote: > Roland is proposing moving the repository location (from > OpenFabrics.org to kernel.org) and changing the version control tool. > > Are you also in favor of doing both or just changing the version > control tool to git. I would like to see the version control tool changed to git. Moving the repository to kernel.org may be easier for the labs. > I'm indifferent with regards to the version control tool, but I would > like to see the repository remain unified. As a developer of both a > userspace library and kernel ULP, I find it very convenient to be able > to pull all the code from one place. For myself, I usually pull the kernel code and userspace code from svn separately, but only because I update my kernel code more frequently than my userspace snapshot. I would like to have an easy way to pull all userspace code and use the same version control tool, but I'm indifferent about having a unified kernel and userspace repository. - Sean From iod00d at hp.com Thu Apr 20 11:29:47 2006 From: iod00d at hp.com (Grant Grundler) Date: Thu, 20 Apr 2006 11:29:47 -0700 Subject: [openib-general] [RFC] remove kernel drivers from svn In-Reply-To: References: <000001c66497$49fe4e00$33a9070a@amr.corp.intel.com> Message-ID: <20060420182947.GC10891@esmail.cup.hp.com> On Thu, Apr 20, 2006 at 09:39:56AM -0700, Roland Dreier wrote: > I guess I was trying to say that Linus's tree is really the kernel > repository. Trying to pretend that our kernel drivers are independent > of the kernel leads to a very confusing situation. openib SVN tree is the equivalent to -mm trees and many other public trees used for developement. parisc-linux isn't ready to obsolete it's developement tree because of significant outstanding differences with linus/andrew. ia64-linux tree has been obsoleted. Tony Luck (ia64 maintainer) on maintains a patch tree like you are suggesting. I don't know where openib devel falls between those two and how the next round of "transport neutral" changes will affect the tree. > I would really like to see people who want to run bleeding-edge stuff > grab something along the lines of > http://kernel.org/pub/linux/kernel/v2.6/snapshots/patch-2.6.17-rc2-git2.bz2 > or if they're very brave pull my for-2.6.18 or for-mm tree. Otherwise > we end up wasting their testing attention on something different from > the upstream kernel. If using quilt or git makes that easy, that's fine with me. Update the wiki or "How to test this" web pages with the replacement to SVN. I'll also note that moving away from SVN will cause some turmoil for the Windows OpenIB efforts. You have to trade off how hard it is to test the developement trees (ie how easy is it to update/pull patches and integrate with a tree from linus) vs merging suitable patches back to linus. This size and number of patches outstanding at any given time should be a clue how well this might work. I also think testing upstream kernels isn't realistic for "the general population" since folks (a) need the HW and (b) some easy to run (test) applications. Asking openib developers to test both upstream and developement trees isn't realistic IMHO. Don't forget that they eventually end up testing multiple distro trees (SuSE/RH) as well. But I know the risks of not testing kernel.org trees - those bits eventually land in a distro. hth, grant From caitlinb at broadcom.com Thu Apr 20 11:33:49 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 20 Apr 2006 11:33:49 -0700 Subject: [openib-general] RFC userspace / MPI multicast support Message-ID: <54AD0F12E08D1541B826BE97C98F99F143A816@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > I'd like to get some feedback regarding the following > approach to supporting multicast groups in userspace, and in > particular for MPI. Based on side conversations, I need to > know if this approach would meet the needs of MPI developers. > > To join / leave a multicast group, my proposal is to add the > following APIs to the rdma_cm. (Note I haven't implemented > this yet, so I'm just assuming that it's possible at this point.) > > /* Asynchronously join a multicast group. */ int > rdma_set_option(struct rdma_cm_id *id, int level, int optname, > void *optval, size_t optlen); > > /* Retrieve multicast group information - not usually called. > */ int rdma_get_option(struct rdma_cm_id *id, int level, int optname, > void *optval, size_t optlen); > > /* > * Post a message on the QP associated with the cm_id for the > * specified multicast address. > */ > int rdma_sendto(struct rdma_cm_id *id, struct ibv_send_wr *send_wr, > struct sockaddr *to); > If we are going to add mullticasat group logic to rdma_cm then we probably should have a multicast definition that encompasses UDP as well as IB UD. For the most part, implementing a "UD QP" over UDP in pure software would be fairly simple. The key issue is that most RNICs do not have stateful offload for UDP, and therefore the work completions for the "UD QP" would be generated on the host rather than on the RNIC. I have a question for applications that want to use mullticast and point-to-point. Is it acceptable to require that UD QPs feed *different* CQs than the CQs fed by RC QPs? Such a restriction would allow software "UD QPs" to feed software "UD CQs" without having to complicate the existing hardware CQs. With that issue solved, the remaining issues look very mappable. From robert.j.woodruff at intel.com Thu Apr 20 11:46:38 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Thu, 20 Apr 2006 11:46:38 -0700 Subject: [openib-general] [RFC] remove kernel drivers from svn In-Reply-To: <1145556654.2057.32.camel@chalcedony.pathscale.com> Message-ID: <000201c664aa$c2cdc140$33a9070a@amr.corp.intel.com> Bryan wrote, >That sounds OK to me, though I wonder how it will sit with >non-kernel-developers. I think some downstream people are actually >pulling kernel code from Subversion at the moment, which strikes me as >probably a mistaken thing to do. > References: Message-ID: On Tue, 18 Apr 2006, Sean Hefty wrote: > Assign/reserve a port number when binding a cm_id. If no port number is > given, assign one from the local port space. If a port number is given, > reserve it. > > The RDMA port space is separate from that used for TCP. iWarp devices > will need to coordinate between the port values assigned by the rdma_cm > and those in use by TCP. SDP also has its own port space. What rdma cm calls do you expect the active consumer to make? In NFS-RDMA we do this: rdma_create_id() rdma_resolve_addr() rdma_resolve_route() rdma_create_qp() rdma_connect() We don't need to call rdma_bind_addr(), and hence won't have a port number assigned. Did you consider automatically assigning a port in connect? Something along the lines of if (cma_any_port()) cma_alloc_port() One comment below: > +static int cma_get_port(struct rdma_id_private *id_priv) > +{ > + struct idr *ps; > + int ret; > + > + switch (id_priv->id.ps) { > + case RDMA_PS_SDP: > + ps = &sdp_ps; > + break; > + case RDMA_PS_TCP: > + ps = &tcp_ps; > + break; > + default: > + return -EPROTONOSUPPORT; > + } Do you plan to add support for UDP and SCTP since they have rdma_port_space values? Is it as simple as adding a UDP and SCTP idr? > + > + mutex_lock(&lock); > + if (cma_any_port(&id_priv->id.route.addr.src_addr)) > + ret = cma_alloc_port(ps, id_priv, 0); > + else > + ret = cma_use_port(ps, id_priv); > + mutex_unlock(&lock); > + > + return ret; > +} From sean.hefty at intel.com Thu Apr 20 12:06:11 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 20 Apr 2006 12:06:11 -0700 Subject: [openib-general] [PATCH] RDMA CM: assign port numbers when binding a cm_id to an address In-Reply-To: Message-ID: >rdma_create_id() >rdma_resolve_addr() >rdma_resolve_route() >rdma_create_qp() >rdma_connect() > >We don't need to call rdma_bind_addr(), and hence won't have a port >number assigned. > >Did you consider automatically assigning a port in connect? Something >along the lines of > > if (cma_any_port()) > cma_alloc_port() > rdma_bind_addr() is called for the user in rdma_resolve_addr(), unless the user has already bound the cm_id. >> +static int cma_get_port(struct rdma_id_private *id_priv) >> +{ >> + struct idr *ps; >> + int ret; >> + >> + switch (id_priv->id.ps) { >> + case RDMA_PS_SDP: >> + ps = &sdp_ps; >> + break; >> + case RDMA_PS_TCP: >> + ps = &tcp_ps; >> + break; >> + default: >> + return -EPROTONOSUPPORT; >> + } > >Do you plan to add support for UDP and SCTP since they have >rdma_port_space values? Is it as simple as adding a UDP and SCTP idr? Adding the port space values should be as simple as adding the UDP / SCTP idr's. I'm just not as sure that SCTP support is handled in other locations in the code, and I don't think it makes sense to have UDP connections. I deferred both of these for now. - Sean From rdreier at cisco.com Thu Apr 20 12:09:49 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Apr 2006 12:09:49 -0700 Subject: [openib-general] [PATCH 1/2] Trivial warning fix for libehca Message-ID: At least on my system running the Debian testing toolchain, I get warnings "implicit declaration of function 'free'" when building libehca. This trivial patch adds includes of to pick up the declaration of free(). Signed-off-by: Roland Dreier --- src/userspace/libehca/src/ehca_u_mrmw.c (revision 6541) +++ src/userspace/libehca/src/ehca_u_mrmw.c (working copy) @@ -40,6 +40,7 @@ #define DEB_PREFIX "umrw" +#include #include #include "ehca_utools.h" --- src/userspace/libehca/src/ehca_umain.c (revision 6541) +++ src/userspace/libehca/src/ehca_umain.c (working copy) @@ -50,6 +50,7 @@ #include "ipzu_pt_fn.h" #include +#include #include #include #include From rdreier at cisco.com Thu Apr 20 12:13:28 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Apr 2006 12:13:28 -0700 Subject: [openib-general] [PATCH 2/2] Wean libehca off of libsysfs In-Reply-To: (Roland Dreier's message of "Thu, 20 Apr 2006 12:09:49 -0700") References: Message-ID: As discussed in I would like to start moving the libibverbs interface to lower-level drivers away from using libsysfs data structures. This patch implements that scheme for libehca, and adds an ibv_driver_init() entry point in a backwards compatible way: it will work with existing releases of libibverbs 1.0, and should be source compatible with libibverbs 1.1. Compile tested only, as I don't have ehca hardware (yet...). Please test and apply if it looks good to you. Signed-off-by: Roland Dreier --- src/userspace/libehca/configure.in (revision 6541) +++ src/userspace/libehca/configure.in (working copy) @@ -16,6 +16,9 @@ AC_CHECK_LIB(ibverbs, [], AC_MSG_ERROR([libibverbs not installed])) +dnl Checks for library functions +AC_CHECK_FUNCS(ibv_read_sysfs_file) + dnl Checks for programs. AC_PROG_CC AC_OUTPUT([Makefile]) --- src/userspace/libehca/src/ehca_uinit.c (revision 6541) +++ src/userspace/libehca/src/ehca_uinit.c (working copy) @@ -38,12 +38,19 @@ * $Id: ehca_uinit.c,v 1.6 2006/04/11 13:45:31 nguyen Exp $ */ +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + #include #include #include #include #include #include +#include +#include +#include #include "ehca_uclasses.h" @@ -144,42 +151,58 @@ static struct ibv_device_ops ehcau_dev_o .free_context = ehcau_free_context }; -struct ibv_device *openib_driver_init(struct sysfs_class_device *sysdev) +/* + * Keep a private implementation of HAVE_IBV_READ_SYSFS_FILE to handle + * old versions of libibverbs that didn't implement it. This can be + * removed when libibverbs 1.0.3 or newer is available "everywhere." + */ +#ifndef HAVE_IBV_READ_SYSFS_FILE +static int ibv_read_sysfs_file(const char *dir, const char *file, + char *buf, size_t size) +{ + char path[256]; + int fd; + int len; + + snprintf(path, sizeof path, "%s/%s", dir, file); + + fd = open(path, O_RDONLY); + if (fd < 0) + return -1; + + len = read(fd, buf, size); + + close(fd); + + if (len > 0 && buf[len - 1] == '\n') + buf[--len] = '\0'; + + return len; +} +#endif /* HAVE_IBV_READ_SYSFS_FILE */ + +struct ibv_device *ibv_driver_init(const char *uverbs_sys_path, + int abi_version) { struct ehcau_device *my_dev = NULL; - struct sysfs_device *sysfs_dev = NULL; - struct sysfs_attribute *sysfs_attr = NULL; - char *dev_name = NULL; + char value[64]; int num_ports = 0; EDEB_EN(7, ""); - /* check devices existence */ - sysfs_dev = sysfs_get_classdev_device(sysdev); - if (sysfs_dev == NULL) { - return NULL; - } - sysfs_attr = sysfs_get_device_attr(sysfs_dev, "name"); - if (sysfs_attr == NULL) { + if (ibv_read_sysfs_file(uverbs_sys_path, "device/vendor", + value, sizeof value) < 0) return NULL; - } - if (asprintf(&dev_name, "%s", sysfs_attr->value)<0) { - return NULL; - } - sysfs_close_attribute(sysfs_attr); - if (strcmp("lhca", str_strip(dev_name)) != 0) { - free(dev_name); + + if (strcmp("lhca", str_strip(value)) != 0) return NULL; - } - free(dev_name); - sysfs_attr = sysfs_get_device_attr(sysfs_dev, "num_ports"); - if (sysfs_attr == NULL) { + if (ibv_read_sysfs_file(uverbs_sys_path, "device/num_ports", + value, sizeof value) < 0) return NULL; - } - sscanf(sysfs_attr->value, "%i", &num_ports); - sysfs_close_attribute(sysfs_attr); + + sscanf(value, "%i", &num_ports); if (num_ports<1) { return NULL; } @@ -188,7 +211,7 @@ struct ibv_device *openib_driver_init(st my_dev = malloc(sizeof *my_dev); if (!my_dev) { fprintf(stderr, "Fatal: couldn't allocate device for %s\n", - sysdev->name); + uverbs_sys_path); abort(); } @@ -198,6 +221,18 @@ struct ibv_device *openib_driver_init(st return &my_dev->ibv_dev; } +struct ibv_device *openib_driver_init(struct sysfs_class_device *sysdev) +{ + int abi_ver = 0; + char value[8]; + + if (ibv_read_sysfs_file(sysdev->path, "abi_version", + value, sizeof value) > 0) + abi_ver = strtol(value, NULL, 10); + + return ibv_driver_init(sysdev->path, abi_ver); +} + /** @brief module initialization */ int libehca_trlevel = 5; --- src/userspace/libehca/src/libehca.map (revision 6541) +++ src/userspace/libehca/src/libehca.map (working copy) @@ -1,5 +1,6 @@ LIBEHCA_1.0 { global: + ibv_driver_init; openib_driver_init; ehcau_query_qp; ehcau_send_wr_trigger; From mlleinin at hpcn.ca.sandia.gov Thu Apr 20 12:47:21 2006 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Thu, 20 Apr 2006 12:47:21 -0700 Subject: [openib-general] RFC userspace / MPI multicast support In-Reply-To: <1145552443.23359.631.camel@hal.voltaire.com> References: <1145552443.23359.631.camel@hal.voltaire.com> Message-ID: <1145562441.24662.404.camel@localhost> On Thu, 2006-04-20 at 13:00 -0400, Hal Rosenstock wrote: > On Thu, 2006-04-20 at 12:58, Sean Hefty wrote: > > >On Wed, 2006-04-19 at 15:05, Sean Hefty wrote: > > >> I'd like to get some feedback regarding the following approach to supporting > > >> multicast groups in userspace, and in particular for MPI. Based on side > > >> conversations, I need to know if this approach would meet the needs of MPI > > >> developers. > > >> > > >> To join / leave a multicast group, > > > > > >MC groups also need to be created and deleted as well. Creating and > > >deleting the group are assumed under the covers (first joiner, last > > >leaver) so the additional MC parameters for creation need to be > > >available on all adds. > > > > Creation / deletion would be automatic. The creation parameters for > > RDMA_PROTO_IP would use the same settings as the ipoib broadcast group. > > > > >> /* Bind to multicast group. */ > > >> mcast_ip = 224.0.0.74.71; /* some fine mcast addr */ > > > > > >How are the MGIDs formed from this IP address ? Is the same algorithm as > > >IPoIB used ? > > > > > >Are the MGIDs constrained to use 0x401B in the signature part (and > > >0x601B if this is extended to IPv6) ? > > > > The MGIDs would be formed using the same algorithm as ipoib. I hadn't decided > > on whether to use the same signature, or a different one. My initial thought > > was to use a different signature, but I'm not sure that it's necessary. > > Guess it comes down to how much control is needed over the entire MGID > by MPI as well as whether they can share the IPoIB broadcast group > characteristics for all their multicast groups. > > Also, is IPoIB always setup when running MPI ? Not always. For most of the older VAPI based stack we never turned on IPoIB (or did and it didn't work). I don't think we want to assume IPoIB is always set up when MPI is running. - Matt > > -- Hal > > > >BTW, this example has too many bytes... > > > > Just a typo... > > > > >> ip_mreq.imr_multiaddr = mcast_ip.in_addr; > > >> rdma_set_option(id, RDMA_PROTO_IP, IP_ADD_MEMBERSHIP, &ip_mreq, > > >> sizeof(ip_mreq)); > > > > > >The API only supports ADD/DROP. It lacks support for JoinStates. > > >(I don't think the IP semantics are rich enough for IB; this was > > >previously pointed out in the context of IP routers quite a while ago). > > > > Additional join states are IB specific, so would be handled by using the > > RDMA_PROTO_IB option. As an alternative, we could replace IP_ADD_MEMBERSHIP > > with RDMA_ADD_FULL_MEMBER, RDMA_ADD_SEND_MEMBER, etc. > > > > >> The multicast group information is created / managed by the rdma_cm. The > > >> rdma_cm defines the mgid, q_key, p_key, sl, flowlabel, tclass, and joinstate. > > >> Except for mgid, these would most likely match the values used by the ipoib > > >> broadcast group. The mgid mapping would be similar to that used by ipoib. > > > > > >Does that limit the MGIDs to use IP signatures ? > > > > Yes - unless the RDMA_PROTO_IB option were used. > > > > - Sean > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sean.hefty at intel.com Thu Apr 20 13:09:26 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 20 Apr 2006 13:09:26 -0700 Subject: [openib-general] RFC userspace / MPI multicast support In-Reply-To: <1145562441.24662.404.camel@localhost> Message-ID: > Not always. For most of the older VAPI based stack we never turned on >IPoIB (or did and it didn't work). I don't think we want to assume >IPoIB is always set up when MPI is running. This means that you can't use the rdma_cm to establish connections. - Sean From Richard.Frank at oracle.com Thu Apr 20 13:28:15 2006 From: Richard.Frank at oracle.com (Richard Frank) Date: Thu, 20 Apr 2006 16:28:15 -0400 Subject: [openib-general] How do we prevent starvation say between TCP over IPOIB / and SRP traffic ? In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F143A655@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F143A655@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <1145564895.25346.58.camel@localhost.localdomain> Currently, all I have is problem to resolve. We will think about a general model.. For Oracle to support running RAC on a single fabric - assuming the fabric is utilized for both network (inter node cluster com) plus storage I/O - we need to limit /control latencies of cluster network msgs such that increasing the storage I/O load does not impact the QoS requirements for cluster network comm. A globally set service level / QoS for cluster network traffic such that it has better / more stringent QoS requirements than storage I/O traffic would meet our needs. However, having the capability to define service levels on a per connection or ULP or process basis is interesting too. On Wed, 2006-04-19 at 09:44 -0700, Caitlin Bestler wrote: > openib-general-bounces at openib.org wrote: > > Some application level protocols - require higher QoS levels than > > others - for various communication and I/O operations. > > > > For example, cluster inter-node health msgs have fixed > > latency requirements that if exceeded may result in > > unexpected node removals from the cluster. > > > > Are there any mechanisms available to the client process to > > manage the QoS level for the various supported ULPs > > (SDP,TCP,UDP,RDS,SRP,iSER,etc) either at the ULP level or > > some combination of process and ULP - or perhaps even at the > > connection level ? > > > > Using the same example, the cluster node monitors might set > > the priority / QoS level of the heart beats to be more > > important than normal SRP/iSER traffic to ensure no starvation ? > > > > > > Working up from hardware capabilities and trying to generalize > them probably won't lead anywhere. > > Do you have a model of the requirements for transport/device > neutral QP prioritization that would meet your needs? > From tom at opengridcomputing.com Thu Apr 20 14:05:17 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 20 Apr 2006 16:05:17 -0500 Subject: [openib-general] [PATCH][UVERBS][RFC] Exporting device node_type to user mode Message-ID: <1145567117.27405.38.camel@trinity.ogc.int> In order to support transport independent behavior for user-mode RDMA CMA clients we need to export the node_type to the user mode device attributes structure. The reason for this is that the user-mode CMA needs to behave differently for iWARP vs. IB transports when migrating QP state at connection setup and tear down. This patch adds the node_type to the device attributes structure for user-mode clients. Please have a look-see and let me know if this seems like a reasonable approach. Signed-off-by: Tom Tucker Index: userspace/libibverbs/include/infiniband/verbs.h =================================================================== --- userspace/libibverbs/include/infiniband/verbs.h (revision 6536) +++ userspace/libibverbs/include/infiniband/verbs.h (working copy) @@ -70,9 +70,30 @@ enum ibv_node_type { IBV_NODE_CA = 1, IBV_NODE_SWITCH, - IBV_NODE_ROUTER + IBV_NODE_ROUTER, + IBV_NODE_RNIC }; +enum ibv_transport_type { + IBV_TRANSPORT_IB=1, + IBV_TRANSPORT_IWARP=2 +}; + +static inline enum ibv_transport_type +ibv_node_get_transport(enum ibv_node_type node_type) +{ + switch (node_type) { + case IBV_NODE_CA: + case IBV_NODE_SWITCH: + case IBV_NODE_ROUTER: + return IBV_TRANSPORT_IB; + case IBV_NODE_RNIC: + return IBV_TRANSPORT_IWARP; + default: + return 0; + } +} + enum ibv_device_cap_flags { IBV_DEVICE_RESIZE_MAX_WR = 1, IBV_DEVICE_BAD_PKEY_CNTR = 1 << 1, @@ -138,6 +159,7 @@ uint16_t max_pkeys; uint8_t local_ca_ack_delay; uint8_t phys_port_cnt; + uint8_t node_type; }; enum ibv_mtu { Index: userspace/libibverbs/include/infiniband/kern-abi.h =================================================================== --- userspace/libibverbs/include/infiniband/kern-abi.h (revision 6536) +++ userspace/libibverbs/include/infiniband/kern-abi.h (working copy) @@ -192,7 +192,8 @@ __u16 max_pkeys; __u8 local_ca_ack_delay; __u8 phys_port_cnt; - __u8 reserved[4]; + __u8 node_type; + __u8 reserved[3]; }; struct ibv_query_port { Index: userspace/libibverbs/src/cmd.c =================================================================== --- userspace/libibverbs/src/cmd.c (revision 6536) +++ userspace/libibverbs/src/cmd.c (working copy) @@ -151,6 +151,7 @@ device_attr->max_pkeys = resp.max_pkeys; device_attr->local_ca_ack_delay = resp.local_ca_ack_delay; device_attr->phys_port_cnt = resp.phys_port_cnt; + device_attr->node_type = resp.node_type; return 0; } Index: linux-kernel/infiniband/include/rdma/ib_user_verbs.h =================================================================== --- linux-kernel/infiniband/include/rdma/ib_user_verbs.h (revision 6536) +++ linux-kernel/infiniband/include/rdma/ib_user_verbs.h (working copy) @@ -177,7 +177,9 @@ __u16 max_pkeys; __u8 local_ca_ack_delay; __u8 phys_port_cnt; - __u8 reserved[4]; + __u8 node_type; + __u8 reserved[3]; }; struct ib_uverbs_query_port { Index: linux-kernel/infiniband/core/uverbs_cmd.c =================================================================== --- linux-kernel/infiniband/core/uverbs_cmd.c (revision 6536) +++ linux-kernel/infiniband/core/uverbs_cmd.c (working copy) @@ -197,6 +197,7 @@ resp.max_pkeys = attr.max_pkeys; resp.local_ca_ack_delay = attr.local_ca_ack_delay; resp.phys_port_cnt = file->device->ib_dev->phys_port_cnt; + resp.node_type = file->device->ib_dev->node_type; if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) From rdreier at cisco.com Thu Apr 20 14:13:00 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Apr 2006 14:13:00 -0700 Subject: [openib-general] [PATCH][UVERBS][RFC] Exporting device node_type to user mode In-Reply-To: <1145567117.27405.38.camel@trinity.ogc.int> (Tom Tucker's message of "Thu, 20 Apr 2006 16:05:17 -0500") References: <1145567117.27405.38.camel@trinity.ogc.int> Message-ID: Tom> In order to support transport independent behavior for Tom> user-mode RDMA CMA clients we need to export the node_type to Tom> the user mode device attributes structure. The reason for Tom> this is that the user-mode CMA needs to behave differently Tom> for iWARP vs. IB transports when migrating QP state at Tom> connection setup and tear down. Adding node_type to the libibverbs API is OK (for the 1.1 release series...), but I think it would be better to read the existing /sys/class/infiniband//node_type field in sysfs rather than adding it in to the query stuff. - R. From ssbyrn at yahoo.com Thu Apr 20 14:24:11 2006 From: ssbyrn at yahoo.com (susan) Date: Thu, 20 Apr 2006 21:24:11 +0000 (UTC) Subject: [openib-general] Re: ib_send_cm_req failes with error -22 References: <44468CCE.6070202@ichips.intel.com> Message-ID: Sean Hefty ichips.intel.com> writes: > > -22 (EINVAL) indicates that one of the parameters is invalid. > The initial checks done by the ib_cm are: > > /* peer-to-peer not supported */ > if (param->peer_to_peer) > return -EINVAL; > > if (!param->primary_path) > return -EINVAL; > > if (param->qp_type != IB_QPT_RC && param->qp_type != IB_QPT_UC) > return -EINVAL; > > if (param->private_data && > param->private_data_len > IB_CM_REQ_PRIVATE_DATA_SIZE) > return -EINVAL; > > if (param->alternate_path && > (param->alternate_path->pkey != param->primary_path->pkey || > param->alternate_path->mtu != param->primary_path->mtu)) > return -EINVAL; > > Can you verify that the input parameter would pass these tests? > There are some more tests further down in the code that could > also return this same error if these all pass. Posting the > actual code that calls ib_send_cm_req() may also help debug the > problem. > > - Sean > Sean, the error is returned by cm_init_av_by_path routine in ib_send_cm_req(). the function fails because it is unable to lookup gid in its cache -- function ib_find_cached_gid() fails. i don't why gid isn't cached. did i missed something. where can i download stable version of ib gen2 stack? thanks, susan. From lindahl at pathscale.com Thu Apr 20 14:26:40 2006 From: lindahl at pathscale.com (Greg Lindahl) Date: Thu, 20 Apr 2006 14:26:40 -0700 Subject: [openib-general] RFC userspace / MPI multicast support In-Reply-To: <1145552443.23359.631.camel@hal.voltaire.com> References: <1145552443.23359.631.camel@hal.voltaire.com> Message-ID: <20060420212640.GD2315@greglaptop.internal.keyresearch.com> On Thu, Apr 20, 2006 at 01:00:43PM -0400, Hal Rosenstock wrote: > Also, is IPoIB always setup when running MPI ? That's an easy one: No. -- greg From rdreier at cisco.com Thu Apr 20 14:27:47 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Apr 2006 14:27:47 -0700 Subject: [openib-general] Re: ib_send_cm_req failes with error -22 In-Reply-To: (susan's message of "Thu, 20 Apr 2006 21:24:11 +0000 (UTC)") References: <44468CCE.6070202@ichips.intel.com> Message-ID: susan> i don't why gid isn't cached. did i missed something. Are you sure the GID matches one of the port's GIDs? You could add a couple of printks to ib_find_cached_gid() to print the GID it's searching for and each of the GIDs it checks against, and see if it looks sane. susan> where can i download stable version of ib gen2 stack? It's in kernel 2.6.16, so that's a good stable version. - R. From ssbyrn at yahoo.com Thu Apr 20 14:40:27 2006 From: ssbyrn at yahoo.com (Susan) Date: Thu, 20 Apr 2006 21:40:27 +0000 (UTC) Subject: [openib-general] Re: ib_send_cm_req failes with error -22 References: <44468CCE.6070202@ichips.intel.com> Message-ID: Roland Dreier cisco.com> writes: > > susan> i don't why gid isn't cached. did i missed something. > > Are you sure the GID matches one of the port's GIDs? You could add a > couple of printks to ib_find_cached_gid() to print the GID it's > searching for and each of the GIDs it checks against, and see if it > looks sane. > > susan> where can i download stable version of ib gen2 stack? > > It's in kernel 2.6.16, so that's a good stable version. > > - R. > yes it matches t one of the ports gid, but there is nothing in the cache. from where should i download userland binaries -- opensm, ib* (ibaddr, ibstat) binaries? should just do: svn co https://openib.org/svn/gen2/trunk Susan From sean.hefty at intel.com Thu Apr 20 15:28:23 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 20 Apr 2006 15:28:23 -0700 Subject: [openib-general] [PATCH] expand local SA cache to include all path records from source Message-ID: Modify the local SA cache to store all path records from the local SGID to all destinations. The current implementation retrieved only a single path to each destination. Add API calls that can be used to walk the list of known paths to a given DGID. This allows a user to select an alternate path for a connection, and provide their own filtering on available paths. Signed-off-by: Sean Hefty --- Index: include/rdma/ib_local_sa.h =================================================================== --- include/rdma/ib_local_sa.h (revision 6418) +++ include/rdma/ib_local_sa.h (working copy) @@ -52,4 +52,32 @@ int ib_get_path_rec(struct ib_device *device, u8 port_num, union ib_gid *sgid, union ib_gid *dgid, u16 pkey, struct ib_sa_path_rec *rec); +/** + * ib_create_path_cursor - Create a cursor that may be used to walk through + * a list of path records. + * @device: The local device to retrieve path records for. + * @port_num: The port of the local device. + * @dgid: The destination GID of the path record. + * + * This call allocates a cursor that is used to walk through a list of locally + * cached path records. All path records accessed by the cursor will have the + * specified DGID. User should not hold the cursor for an extended period of + * time, and must free it by calling ib_free_sa_cursor. + */ +struct ib_sa_cursor *ib_create_path_cursor(struct ib_device *device, + u8 port_num, union ib_gid *dgid); + +/** + * ib_free_sa_cursor - Release a cursor. + * @cursor: The cursor to free. + */ +void ib_free_sa_cursor(struct ib_sa_cursor *cursor); + +/** + * ib_get_next_sa_attr - Retrieve the next SA attribute referenced by a cursor. + * @cursor: A reference to a cursor that points to the next attribute to + * retrieve. + */ +void *ib_get_next_sa_attr(struct ib_sa_cursor **cursor); + #endif /* IB_LOCAL_SA_H */ Index: core/local_sa.c =================================================================== --- core/local_sa.c (revision 6542) +++ core/local_sa.c (working copy) @@ -84,10 +84,10 @@ struct sa_db_port { struct ib_mad_agent *agent; struct index_root index; unsigned long update_time; + int update; struct work_struct work; union ib_gid gid; int port_num; - u16 pkey; }; struct sa_db_device { @@ -107,6 +107,25 @@ struct ib_path_rec { u8 reserved2[20]; }; +struct ib_sa_cursor { + struct ib_sa_cursor *next; +}; + +struct ib_sa_attr_list { + struct ib_sa_cursor cursor; + struct ib_sa_cursor *tail; + int update; +}; + +struct ib_path_rec_info { + struct ib_sa_cursor cursor; + struct ib_sa_path_rec rec; +}; + +enum { + IB_MAX_PATHS_PER_QUERY = 0x7F +}; + static void send_handler(struct ib_mad_agent *agent, struct ib_mad_send_wc *mad_send_wc) { @@ -114,6 +133,53 @@ static void send_handler(struct ib_mad_a ib_free_send_mad(mad_send_wc->send_buf); } +static void free_attr_list(struct ib_sa_attr_list *attr_list) +{ + struct ib_sa_cursor *cur; + + for (cur = attr_list->cursor.next; cur; cur = attr_list->cursor.next) { + attr_list->cursor.next = cur->next; + kfree(cur); + } + attr_list->tail = &attr_list->cursor; +} + +static int insert_attr(struct index_root *index, int update, void *key, + struct ib_sa_cursor *cursor) +{ + struct ib_sa_attr_list *attr_list; + void *err; + + attr_list = index_find(index, key); + if (!attr_list) { + attr_list = kmalloc(sizeof *attr_list, GFP_KERNEL); + if (!attr_list) + return -ENOMEM; + + attr_list->cursor.next = NULL; + attr_list->tail = &attr_list->cursor; + attr_list->update = update; + + err = index_insert(index, attr_list, key); + if (err) { + kfree(attr_list); + return PTR_ERR(err); + } + } else if (attr_list->update != update) { + free_attr_list(attr_list); + attr_list->update = update; + } + + /* + * Assume that the SA returned the best attribute first, and insert + * attributes on the tail. + */ + attr_list->tail->next = cursor; + cursor->next = NULL; + attr_list->tail = cursor; + return 0; +} + /* * Copy a path record from a received MAD and insert it into our index. * The path record in the MAD is in network order, so must be swapped. It @@ -124,7 +190,7 @@ static void update_path_rec(struct sa_db { struct ib_mad_recv_buf *recv_buf; struct ib_sa_mad *mad = (void *) mad_recv_wc->recv_buf.mad; - struct ib_sa_path_rec *sa_path, *old_path; + struct ib_path_rec_info *path_info; struct ib_path_rec ib_path, *path = NULL; int i, attr_size, left, offset = 0; @@ -132,6 +198,8 @@ static void update_path_rec(struct sa_db if (attr_size < sizeof ib_path) return; + down_write(&lock); + port->update++; list_for_each_entry(recv_buf, &mad_recv_wc->rmpp_list, list) { for (i = 0; i < IB_MGMT_SA_DATA;) { mad = (struct ib_sa_mad *) recv_buf->mad; @@ -155,28 +223,25 @@ static void update_path_rec(struct sa_db } if (!path->slid) - return; + goto unlock; - sa_path = kmalloc(sizeof *sa_path, GFP_KERNEL); - if (!sa_path) - return; - - ib_sa_unpack_attr(sa_path, path, IB_SA_ATTR_PATH_REC); - - down_write(&lock); - old_path = index_find_replace(&port->index, sa_path, - sa_path->dgid.raw); - if (old_path) - kfree(old_path); - else if (index_insert(&port->index, sa_path, - sa_path->dgid.raw)) { - up_write(&lock); - kfree(sa_path); - return; + path_info = kmalloc(sizeof *path_info, GFP_KERNEL); + if (!path_info) + goto unlock; + + ib_sa_unpack_attr(&path_info->rec, path, + IB_SA_ATTR_PATH_REC); + + if (insert_attr(&port->index, port->update, + path_info->rec.dgid.raw, + &path_info->cursor)) { + kfree(path_info); + goto unlock; } - up_write(&lock); } } +unlock: + up_write(&lock); } static void recv_handler(struct ib_mad_agent *mad_agent, @@ -251,12 +316,10 @@ static void format_path_req(struct sa_db mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); mad->mad_hdr.tid = form_tid(msg->mad_agent->hi_tid); - mad->sa_hdr.comp_mask = IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | - IB_SA_PATH_REC_NUMB_PATH; + mad->sa_hdr.comp_mask = IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_NUMB_PATH; path_rec.sgid = port->gid; - path_rec.pkey = port->pkey; - path_rec.numb_path = 1; + path_rec.numb_path = IB_MAX_PATHS_PER_QUERY; ib_sa_pack_attr(mad->data, &path_rec, IB_SA_ATTR_PATH_REC); } @@ -320,36 +383,73 @@ static void handle_event(struct ib_event int ib_get_path_rec(struct ib_device *device, u8 port_num, union ib_gid *sgid, union ib_gid *dgid, u16 pkey, struct ib_sa_path_rec *rec) { + struct ib_sa_cursor *cursor; + struct ib_sa_path_rec *path; + int ret = -ENODATA; + + cursor = ib_create_path_cursor(device, port_num, dgid); + if (IS_ERR(cursor)) + return PTR_ERR(cursor); + + for (path = ib_get_next_sa_attr(&cursor); path; + path = ib_get_next_sa_attr(&cursor)) { + if (pkey == path->pkey && + !memcmp(sgid, path->sgid.raw, sizeof *sgid)) { + memcpy(rec, path, sizeof *rec); + ret = 0; + break; + } + } + + ib_free_sa_cursor(cursor); + return ret; +} +EXPORT_SYMBOL(ib_get_path_rec); + +struct ib_sa_cursor *ib_create_path_cursor(struct ib_device *device, + u8 port_num, union ib_gid *dgid) +{ struct sa_db_device *dev; struct sa_db_port *port; - struct ib_sa_path_rec *path_rec; - int ret = 0; + struct ib_sa_attr_list *list; + int ret; down_read(&lock); dev = ib_get_client_data(device, &sa_db_client); if (!dev) { ret = -ENODEV; - goto unlock; + goto err; } port = &dev->port[port_num - 1]; - if (memcmp(&port->gid, sgid, sizeof *sgid) || port->pkey != pkey) { + list = index_find(&port->index, dgid->raw); + if (!list) { ret = -ENODATA; - goto unlock; + goto err; } - path_rec = index_find(&port->index, dgid->raw); - if (!path_rec) { - ret = -ENODATA; - goto unlock; - } + return &list->cursor; +err: + up_read(&lock); + return ERR_PTR(ret); +} +EXPORT_SYMBOL(ib_create_path_cursor); - memcpy(rec, path_rec, sizeof *path_rec); -unlock: +void ib_free_sa_cursor(struct ib_sa_cursor *cursor) +{ up_read(&lock); - return ret; } -EXPORT_SYMBOL(ib_get_path_rec); +EXPORT_SYMBOL(ib_free_sa_cursor); + +void *ib_get_next_sa_attr(struct ib_sa_cursor **cursor) +{ + *cursor = (*cursor)->next; + if (*cursor) + return ((void *)(*cursor)) + sizeof(**cursor); + else + return NULL; +} +EXPORT_SYMBOL(ib_get_next_sa_attr); static void sa_db_free_data(void *context, void *data) { @@ -375,11 +475,11 @@ static void sa_db_add_one(struct ib_devi port->dev = dev; port->port_num = i; port->update_time = jiffies - hold_time; + port->update = 0; INIT_WORK(&port->work, update_cache, port); index_init(&port->index, sizeof (union ib_gid), GFP_KERNEL); - if (ib_get_cached_gid(device, i, 0, &port->gid) || - ib_get_cached_pkey(device, i, 0, &port->pkey)) + if (ib_get_cached_gid(device, i, 0, &port->gid)) goto err; port->agent = ib_register_mad_agent(device, i, IB_QPT_GSI, From mshefty at ichips.intel.com Thu Apr 20 15:34:45 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 20 Apr 2006 15:34:45 -0700 Subject: [openib-general] Re: ib_send_cm_req failes with error -22 In-Reply-To: References: <44468CCE.6070202@ichips.intel.com> Message-ID: <44480C85.1040704@ichips.intel.com> Susan wrote: > yes it matches t one of the ports gid, but there is nothing > in the cache. Is gid_cache->table_len 0? Do you see any kernel warnings that are reported? > from where should i download userland binaries -- opensm, ib* > (ibaddr, ibstat) binaries? should just do: > > svn co https://openib.org/svn/gen2/trunk This should work. - Sean From mlleinin at hpcn.ca.sandia.gov Thu Apr 20 15:47:24 2006 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Thu, 20 Apr 2006 15:47:24 -0700 Subject: [openib-general] RFC userspace / MPI multicast support In-Reply-To: References: Message-ID: <1145573244.13733.83.camel@localhost> On Thu, 2006-04-20 at 13:09 -0700, Sean Hefty wrote: > > Not always. For most of the older VAPI based stack we never turned on > >IPoIB (or did and it didn't work). I don't think we want to assume > >IPoIB is always set up when MPI is running. > > This means that you can't use the rdma_cm to establish connections. Requiring IPoIB to be running for MPI to work is a new dependency that users are not accustomed to. Maybe we need to do this through the cm rather than rdma_cm. We shouldn't require IPoIB to be running for MPI to use IB. - Matt From mlleinin at hpcn.ca.sandia.gov Thu Apr 20 16:29:19 2006 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Thu, 20 Apr 2006 16:29:19 -0700 Subject: [openib-general] IBED-1.0-rc3 README.txt Message-ID: <1145575760.13733.108.camel@localhost> Why does the README.txt refer to all things as "Mellanox" Here is the title in the README --------------------------- Mellanox IBED Distribution v1.0 for Linux ******************************************************* ------------------------ more here ------------------------------- IBED Home Page: https://docs.mellanox.com/dm/ibgold/ReadMe.html Please email bugs and error reports to your local Field Application Engineer. even more a few lines down --------------------------------- 1. HW and SW Requirements: ========================== 1) Server platform with InfiniBand HCA (see Mellanox IBED Distribution Release Notes for details) 2) Linux OS (see Mellanox IBED Distribution Release Notes for details) ---------------------------------- This needs to be cleaned up and updated. - Matt From mshefty at ichips.intel.com Thu Apr 20 16:44:40 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 20 Apr 2006 16:44:40 -0700 Subject: [openib-general] RFC userspace / MPI multicast support In-Reply-To: <1145573244.13733.83.camel@localhost> References: <1145573244.13733.83.camel@localhost> Message-ID: <44481CE8.1040407@ichips.intel.com> Matt Leininger wrote: > Requiring IPoIB to be running for MPI to work is a new dependency that > users are not accustomed to. Maybe we need to do this through the cm > rather than rdma_cm. We shouldn't require IPoIB to be running for MPI > to use IB. Use of these new features already carries with it new dependencies, in the form of new libraries and kernel modules. Are MPI users even accustomed to having the IB CM loaded? How does MPI discover the remote nodes in the fabric today? How does it obtain the path records? How does it use multicast groups today? I'm trying to find a solution that can be both easy to use, yet allow for low level control. However, I don't necessarily want to restrict an implementation based on existing applications that are either hard-coded or coded for static configurations. - Sean From tom at opengridcomputing.com Thu Apr 20 17:57:08 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 20 Apr 2006 19:57:08 -0500 Subject: [openib-general] [PATCH][UVERBS][RFC] Exporting device node_type to user mode In-Reply-To: References: <1145567117.27405.38.camel@trinity.ogc.int> Message-ID: <1145581028.8968.9.camel@bigtime.es335.com> On Thu, 2006-04-20 at 14:13 -0700, Roland Dreier wrote: > Tom> In order to support transport independent behavior for > Tom> user-mode RDMA CMA clients we need to export the node_type to > Tom> the user mode device attributes structure. The reason for > Tom> this is that the user-mode CMA needs to behave differently > Tom> for iWARP vs. IB transports when migrating QP state at > Tom> connection setup and tear down. > > Adding node_type to the libibverbs API is OK (for the 1.1 release > series...), but I think it would be better to read the existing > /sys/class/infiniband//node_type field in sysfs rather than > adding it in to the query stuff. > Ok -- no problem. Are there rules/guidelines that govern the device attributes that belong in sys/class/infiniband vs. attributes that belong in device_attr? > - R. From wombat2 at us.ibm.com Thu Apr 20 18:03:29 2006 From: wombat2 at us.ibm.com (Bernard King-Smith) Date: Thu, 20 Apr 2006 21:03:29 -0400 Subject: [openib-general] Re: Speeding up IPoIB. In-Reply-To: <20060419225630.GG6430@esmail.cup.hp.com> Message-ID: Grant Grundler wrote: > Currently we only get 40% of the link bandwidth compared to > 85% for 10 GigE. (Yes I know the cost differences which favor IB ). Grant> 10gige is getting 85% without TOE? Grant> Or are they distributing event handling across several CPUs? On 10 GigE they are using large send to the adapter where a 60K buffer is read by the adapter and fragmented into 1500 or 9000 byte Ethernet packets. Essentially they offload fragmentation to Ethernet packets from TCP to the adapter. This is similar to RC mode in IB fragmenting larger buffers into link 2000 byte frames/packets. > However, two things hurt user level protocols. First is scaling and memory > requirements. Looking at parallel file systems on large clusters, SDP ended > up consuming so much memory it couldn't be used. The N by N socket > connections per node, using SDP the required buffer space and QP memory got > out of control. There is something to be said for sharing buffer and QP > space across lots of sockets. Grant> My guess is it's an easier problem to fix SDP than reducing TCP/IP Grant> cache/CPU foot print. I realize only a subset of apps can (or will Grant> try to) use SDP because of setup/config issues. I still believe SDP Grant> is useful to a majority of apps without having to recompile them. I agree that reducing any protocol footprint is a very challenging job, however, going to a larger MTU drops the overhead much faster. If IB supported a 60K MTU then the TCP/IP overhead would be 1/30 that of what we measure today. Traversng the TCP/IP stack once for a 60K packet is much lower than 30 times using 2000 byte packets for the same amount of data transmitted. > The other issue is flow control across hundreds of autonomous sockets. In > TCP/IP, traffic can be managed so that there is some fairness > (multiplexing, QoS etc.) across all active sockets. For user level > protocols like SDP and uDAPL, you can't manage traffic across multiple > autonomous user application connections because ther is no where to see all > of them at the same tiem for mangement. This can lead to overrunning > adapters or timeouts to the applications. This tends to be a large system > problem when you have lots of CPUs. Grant> I'm not competent to disagree in detail. Grant> Fabian Tillier and Caitlin Bestler can (and have) addressed this. I would be very interested in any pointers to their work. > The footprint of IPoIB + TCP/IP is large as on any system, However, as you > get to higher CPU counts, the issue becomes less of a problem since more > unused CPU cycles are available. However, affinity ( CPU and Memory) and > cacheline miss issues get greater. Grant> Hrm...the concept of "unused CPU cycles" is bugging me as someone Grant> who occasionally gets to run benchmarks. If a system today has Grant> unused CPU cycles, then will adding a faster link change the CPU Grant> load if the application doesn't change? This goes back to systems where the system is busy doing nothing, generally when waiting for memory or a cache line miss, or I/O to disks. This is where hyperthreading has shown some speedups for benchmarks where previously they were totally CPU limited, and with hyperthreading there is a gain. The unused cycles are "wait" cycles when something can run if it can get in quickly. You can't get a TCP stack in the wait, but small parts of the stackor driver could fit in the other thread. Yes I do benchmarking and was skeptical at first. Grant> thanks, Grant> grant Bernie King-Smith IBM Corporation Server Group Cluster System Performance wombat2 at us.ibm.com (845)433-8483 Tie. 293-8483 or wombat2 on NOTES "We are not responsible for the world we are born into, only for the world we leave when we die. So we have to accept what has gone before us and work to change the only thing we can, -- The Future." William Shatner From tom at opengridcomputing.com Thu Apr 20 18:34:18 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 20 Apr 2006 20:34:18 -0500 Subject: [dat-discussions] [openib-general] [RFC]DAT2.0immediatedataproposal In-Reply-To: References: Message-ID: <1145583258.8968.28.camel@bigtime.es335.com> On Wed, 2006-02-08 at 23:20 -0800, Sean Hefty wrote: > >Hmm. Can you put a number on how much better RDMA write with > >immediate is on current HCA hardware? How does using the underlying > >OpenIB verbs ability to post a list of work requests compare (ie > >posting an RDMA write followed by a send in one verbs call)? > >Maybe "post multiple" is a better direction for DAT. > > A "post multiple" call as a general API makes sense, but I think that's a > separate issue. > > Given that IB provides true immediate data with RDMA writes, a way should be > available to make use of it. I don't know what the performance numbers between > using a write with immediate versus a write followed by a send, but I don't > think that anyone could argue that the write with immediate wouldn't perform > better. > > To me, the question is whether write with immediate is supported as a transport > specific extension, which was Arlin's original patch, or through some standard > API. The attempt to make the API standard, so that iWarp could emulate it > (poorly in my view), is what appears to be driving the disagreements. > > It also appears to me that the decisions are coming down to one of the > following. If iWarp can emulate write with immediate, then a generic API should > be used. This opens Pandora's box. Should iWARP also emulate ATOMICs? Which should be emulated and which should not ... What are the criteria for deciding? > If iWarp cannot properly emulate write with immediate, then the API > should be transport specific. It should be transport specific because it is a transport specific feature. Although -- in this case -- it could but implemented in iWARP in my view it _should_ not. > It's curious to me that in both cases, iWarp is > driving the API decision and design for something that is an IB specific > feature. Huh? > > - Sean > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Thu Apr 20 19:14:31 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Apr 2006 22:14:31 -0400 Subject: [openib-general] Re: Speeding up IPoIB. In-Reply-To: References: Message-ID: <1145585669.23359.7365.camel@hal.voltaire.com> On Thu, 2006-04-20 at 21:03, Bernard King-Smith wrote: > Grant Grundler wrote: > > > Currently we only get 40% of the link bandwidth compared to > > 85% for 10 GigE. (Yes I know the cost differences which favor IB ). > > Grant> 10gige is getting 85% without TOE? > Grant> Or are they distributing event handling across several CPUs? > > On 10 GigE they are using large send to the adapter where a 60K buffer is > read by the adapter and fragmented into 1500 or 9000 byte Ethernet packets. > Essentially they offload fragmentation to Ethernet packets from TCP to the > adapter. This is similar to RC mode in IB fragmenting larger buffers into > link 2000 byte frames/packets. > > > However, two things hurt user level protocols. First is scaling and > memory > > requirements. Looking at parallel file systems on large clusters, SDP > ended > > up consuming so much memory it couldn't be used. The N by N socket > > connections per node, using SDP the required buffer space and QP memory > got > > out of control. There is something to be said for sharing buffer and QP > > space across lots of sockets. > > Grant> My guess is it's an easier problem to fix SDP than reducing TCP/IP > Grant> cache/CPU foot print. I realize only a subset of apps can (or will > Grant> try to) use SDP because of setup/config issues. I still believe SDP > Grant> is useful to a majority of apps without having to recompile them. > > I agree that reducing any protocol footprint is a very challenging job, > however, going to a larger MTU drops the overhead much faster. If IB > supported a 60K MTU then the TCP/IP overhead would be 1/30 that of what we > measure today. Then what you want is IPoIB-CM where the MTU can be much larger. -- Hal > Traversng the TCP/IP stack once for a 60K packet is much > lower than 30 times using 2000 byte packets for the same amount of data > transmitted. > > > The other issue is flow control across hundreds of autonomous sockets. In > > TCP/IP, traffic can be managed so that there is some fairness > > (multiplexing, QoS etc.) across all active sockets. For user level > > protocols like SDP and uDAPL, you can't manage traffic across multiple > > autonomous user application connections because ther is no where to see > all > > of them at the same tiem for mangement. This can lead to overrunning > > adapters or timeouts to the applications. This tends to be a large system > > problem when you have lots of CPUs. > > Grant> I'm not competent to disagree in detail. > Grant> Fabian Tillier and Caitlin Bestler can (and have) addressed this. > > I would be very interested in any pointers to their work. > > > The footprint of IPoIB + TCP/IP is large as on any system, However, as > you > > get to higher CPU counts, the issue becomes less of a problem since more > > unused CPU cycles are available. However, affinity ( CPU and Memory) and > > cacheline miss issues get greater. > > Grant> Hrm...the concept of "unused CPU cycles" is bugging me as someone > Grant> who occasionally gets to run benchmarks. If a system today has > Grant> unused CPU cycles, then will adding a faster link change the CPU > Grant> load if the application doesn't change? > > This goes back to systems where the system is busy doing nothing, generally > when waiting for memory or a cache line miss, or I/O to disks. This is > where hyperthreading has shown some speedups for benchmarks where > previously they were totally CPU limited, and with hyperthreading there is > a gain. The unused cycles are "wait" cycles when something can run if it > can get in quickly. You can't get a TCP stack in the wait, but small parts > of the stackor driver could fit in the other thread. Yes I do benchmarking > and was skeptical at first. > > Grant> thanks, > Grant> grant > > Bernie King-Smith > IBM Corporation > Server Group > Cluster System Performance > wombat2 at us.ibm.com (845)433-8483 > Tie. 293-8483 or wombat2 on NOTES > > "We are not responsible for the world we are born into, only for the world > we leave when we die. > So we have to accept what has gone before us and work to change the only > thing we can, > -- The Future." William Shatner > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Thu Apr 20 20:06:46 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Apr 2006 23:06:46 -0400 Subject: [openib-general] Announcing the OpenFabrics Enterprise Distribution In-Reply-To: <444790D2.6090006@mellanox.co.il> References: <1144869147.19061.91517.camel@hal.voltaire.com> <4443ADBE.7000307@mellanox.co.il> <1145312002.4539.72029.camel@hal.voltaire.com> <444790D2.6090006@mellanox.co.il> Message-ID: <1145588214.23359.8047.camel@hal.voltaire.com> On Thu, 2006-04-20 at 09:46, Tziporet Koren wrote: > Hal Rosenstock wrote: > > Where is this tree ? Also, what about backports ? Are they part of OFED > > as well ? > > > > > This the the git tree that Roland manage. Backports are included in the > release. The backports are placed under the ibed dir in the 1.0 branch. > > I see no such directory under 1.0. Can you provide more details ? > > > https://openib.org/svn/gen2/branches/1.0/ibed/ Thanks. > > Is the process the same for userspace ? > > > For user space we will prefer to apply the patches on the branch, Maybe I misunderstand but: If this is without being accepted to the trunk, this seems inconsistent with the stated OFED policy of acceptance to trunk first prior to propagation to the release 1.0 branch second. > but we can also put them under the fixes if needed. > This is good for a case that there is a fix you wish to provide after > the release was done. Sure; that could handle fixes between releases. -- Hal > Tziporet From mst at mellanox.co.il Fri Apr 21 00:28:13 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 21 Apr 2006 10:28:13 +0300 Subject: [openib-general] [PATCH 1 of 2] core extensions to fix race at module unloading Message-ID: <20060421072813.GA16418@mellanox.co.il> Hi! The only viable approach left that I see to solve module unloading races is adding APIs to flush running callbacks. Please review the following. --- Add APIs to flush outstanding callbacks - required for loadable modules where module text could go away while callback is still running. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/include/rdma/ib_mad.h =================================================================== --- openib/drivers/infiniband/include/rdma/ib_mad.h (revision 6545) +++ openib/drivers/infiniband/include/rdma/ib_mad.h (working copy) @@ -482,6 +482,12 @@ struct ib_mad_agent *ib_register_mad_sno int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent); /** + * ib_flush_mad_agent - flush any callbacks in flight for this client. + * @mad_agent: Corresponding MAD registration request to flush. + */ +int ib_flush_mad_agent(struct ib_mad_agent *mad_agent); + +/** * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated * with the registered client. * @send_buf: Specifies the information needed to send the MAD(s). Index: openib/drivers/infiniband/include/rdma/ib_sa.h =================================================================== --- openib/drivers/infiniband/include/rdma/ib_sa.h (revision 6545) +++ openib/drivers/infiniband/include/rdma/ib_sa.h (working copy) @@ -253,6 +253,7 @@ struct ib_sa_service_rec { struct ib_sa_query; void ib_sa_cancel_query(int id, struct ib_sa_query *query); +void ib_sa_flush(struct ib_device *device); int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, Index: openib/drivers/infiniband/include/rdma/ib_addr.h =================================================================== --- openib/drivers/infiniband/include/rdma/ib_addr.h (revision 6545) +++ openib/drivers/infiniband/include/rdma/ib_addr.h (working copy) @@ -71,6 +71,7 @@ int rdma_resolve_ip(struct sockaddr *src void *context); void rdma_addr_cancel(struct rdma_dev_addr *addr); +void rdma_addr_flush(); static inline int ip_addr_size(struct sockaddr *addr) { Index: openib/drivers/infiniband/core/addr.c =================================================================== --- openib/drivers/infiniband/core/addr.c (revision 6545) +++ openib/drivers/infiniband/core/addr.c (working copy) @@ -356,6 +356,12 @@ void rdma_addr_cancel(struct rdma_dev_ad } EXPORT_SYMBOL(rdma_addr_cancel); +void rdma_addr_flush(void) +{ + flush_workqueue(addr_wq); +} +EXPORT_SYMBOL(rdma_addr_flush); + static int addr_arp_recv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pkt, struct net_device *orig_dev) { Index: openib/drivers/infiniband/core/sa_query.c =================================================================== --- openib/drivers/infiniband/core/sa_query.c (revision 6545) +++ openib/drivers/infiniband/core/sa_query.c (working copy) @@ -440,6 +440,20 @@ void ib_sa_cancel_query(int id, struct i } EXPORT_SYMBOL(ib_sa_cancel_query); +void ib_sa_flush(struct ib_device *device) +{ + struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + int i; + + if (!sa_dev) + return; + + for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) { + ib_flush_mad_agent(sa_dev->port[i].agent); + } +} +EXPORT_SYMBOL(ib_sa_flush); + int ib_sa_pack_attr(void *dst, void *src, int attr_id) { switch (attr_id) { Index: openib/drivers/infiniband/core/mad.c =================================================================== --- openib/drivers/infiniband/core/mad.c (revision 6545) +++ openib/drivers/infiniband/core/mad.c (working copy) @@ -486,6 +486,13 @@ error1: } EXPORT_SYMBOL(ib_register_mad_snoop); +static void ib_flush_mad_agent(struct ib_mad_agent_private *mad_agent_priv) +{ + port_priv = mad_agent_priv->qp_info->port_priv; + flush_workqueue(port_priv->wq); +} +EXPORT_SYMBOL(ib_flush_mad_agent); + static void unregister_mad_agent(struct ib_mad_agent_private *mad_agent_priv) { struct ib_mad_port_private *port_priv; -- MST From mst at mellanox.co.il Fri Apr 21 00:38:01 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 21 Apr 2006 10:38:01 +0300 Subject: [openib-general] [PATCH 2 of 2] fix module unloading races Message-ID: <20060421073800.GA16479@mellanox.co.il> Please review. --- Make sure that callbacks have completed before allowing the module text to go away. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/core/cma.c =================================================================== --- openib.orig/drivers/infiniband/core/cma.c 2006-04-21 09:23:22.000000000 +0300 +++ openib/drivers/infiniband/core/cma.c 2006-04-21 10:04:53.000000000 +0300 @@ -1791,6 +1791,7 @@ static void cma_remove_one(struct ib_dev cma_process_remove(cma_dev); kfree(cma_dev); + ib_sa_flush(device); } static int cma_init(void) @@ -1817,6 +1818,7 @@ static void cma_cleanup(void) destroy_workqueue(cma_wq); idr_destroy(&sdp_ps); idr_destroy(&tcp_ps); + rdma_addr_flush(); } module_init(cma_init); Index: openib/drivers/infiniband/core/multicast.c =================================================================== --- openib.orig/drivers/infiniband/core/multicast.c 2006-04-21 09:23:22.000000000 +0300 +++ openib/drivers/infiniband/core/multicast.c 2006-04-21 10:03:41.000000000 +0300 @@ -649,6 +649,7 @@ static void mcast_remove_one(struct ib_d } kfree(dev); + ib_sa_flush(device); } static int __init mcast_init(void) Index: openib/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- openib.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-04-21 09:23:19.000000000 +0300 +++ openib/drivers/infiniband/ulp/ipoib/ipoib_main.c 2006-04-21 10:07:08.000000000 +0300 @@ -1174,6 +1174,7 @@ static void ipoib_remove_one(struct ib_d } kfree(dev_list); + ib_sa_flush(device); } static int __init ipoib_init_module(void) Index: openib/drivers/infiniband/ulp/srp/ib_srp.c =================================================================== --- openib.orig/drivers/infiniband/ulp/srp/ib_srp.c 2006-04-21 09:23:19.000000000 +0300 +++ openib/drivers/infiniband/ulp/srp/ib_srp.c 2006-04-21 10:07:03.000000000 +0300 @@ -1748,6 +1748,7 @@ static void srp_remove_one(struct ib_dev } kfree(dev_list); + ib_sa_flush(device); } static int __init srp_init_module(void) -- MST ----- End forwarded message ----- -- MST From halr at voltaire.com Fri Apr 21 04:21:39 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Apr 2006 07:21:39 -0400 Subject: [openib-general] [PATCH] expand local SA cache to include all path records from source In-Reply-To: References: Message-ID: <1145618498.23359.13944.camel@hal.voltaire.com> On Thu, 2006-04-20 at 18:28, Sean Hefty wrote: > Modify the local SA cache to store all path records from the local > SGID to all destinations. The current implementation retrieved only > a single path to each destination. > > Add API calls that can be used to walk the list of known paths to a > given DGID. > > This allows a user to select an alternate path for a connection, and > provide their own filtering on available paths. As the end node cannot know the policies the SM used for routing, etc., path records are insufficient in terms of some of the interesting characteristics for filtering like path independence (as fault independent as possible). There's also a 1.2 erratum which enhances SA MultiPathRecord to support scoping of S/DGIDs. It supports explicit, same node, same system with and without high availability (separate HCAs). That could also be used when supported by the SA. So rather than all available paths to the DGID, it could be fault independent ones by making a different request underneath. The API looks fine to me. One thing to note is to keep an eye on the lookups done on the local database and possibly optimizing them if needed. -- Hal > Signed-off-by: Sean Hefty From halr at voltaire.com Fri Apr 21 04:45:40 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Apr 2006 07:45:40 -0400 Subject: [openib-general] [PATCH] expand local SA cache to include all path records from source In-Reply-To: References: Message-ID: <1145619939.23359.14223.camel@hal.voltaire.com> On Thu, 2006-04-20 at 18:28, Sean Hefty wrote: > Index: core/local_sa.c > =================================================================== > --- core/local_sa.c (revision 6542) > +++ core/local_sa.c (working copy) > + > +enum { > + IB_MAX_PATHS_PER_QUERY = 0x7F > +}; > static void recv_handler(struct ib_mad_agent *mad_agent, > @@ -251,12 +316,10 @@ static void format_path_req(struct sa_db > mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); > mad->mad_hdr.tid = form_tid(msg->mad_agent->hi_tid); > > - mad->sa_hdr.comp_mask = IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | > - IB_SA_PATH_REC_NUMB_PATH; > + mad->sa_hdr.comp_mask = IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_NUMB_PATH; > > path_rec.sgid = port->gid; > - path_rec.pkey = port->pkey; > - path_rec.numb_path = 1; > + path_rec.numb_path = IB_MAX_PATHS_PER_QUERY; > ib_sa_pack_attr(mad->data, &path_rec, IB_SA_ATTR_PATH_REC); > } How does this work in a large subnet ? -- Hal From xma at us.ibm.com Fri Apr 21 06:22:11 2006 From: xma at us.ibm.com (Shirley Ma) Date: Fri, 21 Apr 2006 06:22:11 -0700 Subject: [openib-general] [PATCH] splitting IPoIB CQ In-Reply-To: Message-ID: Hello Roland, Roland Dreier wrote on 04/19/2006 02:32:47 PM: > Shirley> Is that possible to move the CQ handler out of interrupt > Shirley> context in mthca? > > Yes, but that seems like the wrong thing to do. I think it would be > better to let consumers that want the increased latency defer things. > > - R. What I meant I want to move completion handler to bottle half. Leave the hardware interrupt in top half. It will benefit not just IPoIB performance on mthca. I will look at the code if it is simple, I can implement it and test the performance. Any objections? Thank Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Arkady.Kanevsky at netapp.com Fri Apr 21 06:23:06 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Fri, 21 Apr 2006 09:23:06 -0400 Subject: [dat-discussions] [openib-general][RFC]DAT2.0immediatedataproposal Message-ID: We need a better job coordinating between 2 reflectors. The current position is that DAT provide transport independent API. We agreed on 2 Transport extensions, one for IB and one for iWARP, that will be a separate documents. DAT proper will have hooks to support extensions. Currently, IB extensions include RDMA Write with immediate and Atomic ops, and IW extension Socket connection model. Also we have an advice to Consumer how to do things in Transport independent way to make code portable, albeit at the cost of some performance. Feel free, to review the current drafts of uDAPL 2.0, IB extension and IW extension on DAT reflector or in the enclosed files. Thanks, Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Tom Tucker [mailto:tom at opengridcomputing.com] > Sent: Thursday, April 20, 2006 9:34 PM > To: Sean Hefty > Cc: 'Roland Dreier'; dat-discussions at yahoogroups.com; > openib-general at openib.org > Subject: RE: [dat-discussions] > [openib-general][RFC]DAT2.0immediatedataproposal > > On Wed, 2006-02-08 at 23:20 -0800, Sean Hefty wrote: > > >Hmm. Can you put a number on how much better RDMA write with > > >immediate is on current HCA hardware? How does using the > underlying > > >OpenIB verbs ability to post a list of work requests compare (ie > > >posting an RDMA write followed by a send in one verbs call)? > > >Maybe "post multiple" is a better direction for DAT. > > > > A "post multiple" call as a general API makes sense, but I think > > that's a separate issue. > > > > Given that IB provides true immediate data with RDMA writes, a way > > should be available to make use of it. I don't know what the > > performance numbers between using a write with immediate versus a > > write followed by a send, but I don't think that anyone could argue > > that the write with immediate wouldn't perform better. > > > > To me, the question is whether write with immediate is > supported as a > > transport specific extension, which was Arlin's original patch, or > > through some standard API. The attempt to make the API > standard, so > > that iWarp could emulate it (poorly in my view), is what > appears to be driving the disagreements. > > > > It also appears to me that the decisions are coming down to > one of the > > following. If iWarp can emulate write with immediate, then > a generic > > API should be used. > > This opens Pandora's box. Should iWARP also emulate ATOMICs? > Which should be emulated and which should not ... What are > the criteria for deciding? > > > If iWarp cannot properly emulate write with immediate, then the API > > should be transport specific. > > It should be transport specific because it is a transport > specific feature. Although -- in this case -- it could but > implemented in iWARP in my view it _should_ not. > > > It's curious to me that in both cases, iWarp is driving the API > > decision and design for something that is an IB specific feature. > > Huh? > > > > > - Sean > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- A non-text attachment was scrubbed... Name: uDAPLv20_042006.zip Type: application/x-zip-compressed Size: 1984003 bytes Desc: uDAPLv20_042006.zip URL: -------------- next part -------------- An embedded message was scrubbed... From: "Arlin Davis" Subject: [PATCH][RFC] uDAPL openIB provider with IB extensions based on latest DAT 2.0 draft Date: Wed, 5 Apr 2006 20:35:16 -0400 Size: 184385 URL: -------------- next part -------------- An embedded message was scrubbed... From: "Caitlin Bestler" Subject: [dat-discussions] IW Extensions Date: Fri, 31 Mar 2006 12:56:58 -0400 Size: 61748 URL: From worleys at gmail.com Fri Apr 21 08:55:48 2006 From: worleys at gmail.com (Chris Worley) Date: Fri, 21 Apr 2006 09:55:48 -0600 Subject: [openib-general] Are their any MVAPICH source or binary RPMscorresponding to the SuSE 10 x86_64 RPMs at red-bean? In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F00077A9FD6@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F00077A9FD6@orsmsx408> Message-ID: Woody, I sucessfully built the IBED MVAPICH w.r.t. the the SuSE 10 rc2 RPMS libraries from red-bean. I re-made the Pallas benchmarks with this mpicc. In submitting a job a few times successively, I get the following output: :::::::::::::: IMB-MV.o270 :::::::::::::: mpirun: executable version 4 does not match our version 3. done. :::::::::::::: IMB-MV.o271 :::::::::::::: [2] Abort: Error creating SRQ at line 258 in file viainit.c [4] Abort: Error creating SRQ at line 258 in file viainit.c mpirun: executable version 2 does not match our version 3. done. :::::::::::::: IMB-MV.o272 :::::::::::::: [2] Abort: Error creating SRQ at line 258 in file viainit.c [12] Abort: mpirun: executable version 0 does not match our version 3. done. :::::::::::::: IMB-MV.o273 :::::::::::::: [10] Abort: [4] Abort: [9] Abort: Error creating SRQ at line 258 in file viainit.c mpirun: executable version 10 does not match our version 3. done. :::::::::::::: IMB-MV.o274 :::::::::::::: [6] Abort: Error creating SRQ at line 258 in file viainit.c mpirun: executable version 6 does not match our version 3. done. :::::::::::::: IMB-MV.o275 :::::::::::::: [4] Abort: Error creating SRQ at line 258 in file viainit.c mpirun: executable version 4 does not match our version 3. done. :::::::::::::: IMB-MV.o276 :::::::::::::: [13] Abort: mpirun: executable version 1 does not match our version 3. [7] Abort: Error creating SRQ at line 258 in file viainit.c [4] Abort: Error creating SRQ at line 258 in file viainit.c done. :::::::::::::: IMB-MV.o277 :::::::::::::: [0] Abort: [12] Abort: Error creating SRQ at line 258 in file viainit.c [10] Abort: [8] Abort: Error creating SRQ at line 258 in file viainit.c mpirun: executable version 12 does not match our version 3. done. :::::::::::::: IMB-MV.o278 :::::::::::::: [0] Abort: [4] Abort: Error creating SRQ at line 258 in file viainit.c mpirun: executable version 4 does not match our version 3. done. That is one confused mpirun! In seraching the source for PMGR_VERSION, there are 5 different #defines, 4 of them set it to 3, the other sets it to 2. Chris On 4/19/06, Woodruff, Robert J wrote: > Chris wrote, > >Is there an MVAPICH RPM that matches the RC2 SuSE 10 RPMs? > > I think that the IBED rc3 release has RPMs for Mvapich and OpenMPI that > you might try. > > > https://openib.org/svn/gen2/branches/1.0/ibed/releases/ > > > woody > From robert.j.woodruff at intel.com Fri Apr 21 09:02:36 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 21 Apr 2006 09:02:36 -0700 Subject: [openib-general] Are their any MVAPICH source or binary RPMscorresponding to the SuSE 10 x86_64 RPMs at red-bean? In-Reply-To: Message-ID: <000001c6655d$02819b10$33a9070a@amr.corp.intel.com> Chris wrote, >:::::::::::::: >IMB-MV.o271 >:::::::::::::: >[2] Abort: Error creating SRQ > at line 258 in file viainit.c >[4] Abort: Error creating SRQ > at line 258 in file viainit.c >mpirun: executable version 2 does not match our version 3. >done. >:::::::::::::: >IMB-MV.o272 >:::::::::::::: >[2] Abort: Error creating SRQ > at line 258 in file viainit.c >[12] Abort: mpirun: executable version 0 does not match our version 3. >done. Looks to me like you might have a mismatch of software. Not sure. Adding the openfabrics-ewg to the thread as they are the folks doing the IBED work. woody From worleys at gmail.com Fri Apr 21 09:10:47 2006 From: worleys at gmail.com (Chris Worley) Date: Fri, 21 Apr 2006 10:10:47 -0600 Subject: [openib-general] Are their any MVAPICH source or binary RPMscorresponding to the SuSE 10 x86_64 RPMs at red-bean? In-Reply-To: <000001c6655d$02819b10$33a9070a@amr.corp.intel.com> References: <000001c6655d$02819b10$33a9070a@amr.corp.intel.com> Message-ID: Woody, I've seen this mismatch before; it occurs when the IB drivers/libraries don't match the libraries that MVAPICH was built with. But, I've never seen it claim that the same executible was version 0, 10, 2,... all for the same executible. And, I did build that MVAPICH w.r.t. the read-bean RC2 libraries for SuSE 10 (three times). Chris On 4/21/06, Bob Woodruff wrote: > Chris wrote, > > >:::::::::::::: > >IMB-MV.o271 > >:::::::::::::: > >[2] Abort: Error creating SRQ > > at line 258 in file viainit.c > >[4] Abort: Error creating SRQ > > at line 258 in file viainit.c > >mpirun: executable version 2 does not match our version 3. > >done. > > >:::::::::::::: > >IMB-MV.o272 > >:::::::::::::: > >[2] Abort: Error creating SRQ > > at line 258 in file viainit.c > >[12] Abort: mpirun: executable version 0 does not match our version 3. > >done. > > Looks to me like you might have a mismatch of software. Not sure. > Adding the openfabrics-ewg to the thread as they are the folks doing the > IBED work. > > woody > From sweitzen at cisco.com Fri Apr 21 09:10:45 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Fri, 21 Apr 2006 09:10:45 -0700 Subject: [openfabrics-ewg] RE: [openib-general] Are their any MVAPICH sourceor binary RPMscorresponding to the SuSE 10 x86_64 RPMs at red-bean? Message-ID: > >:::::::::::::: > >IMB-MV.o271 > >:::::::::::::: > >[2] Abort: Error creating SRQ > > at line 258 in file viainit.c > >[4] Abort: Error creating SRQ > > at line 258 in file viainit.c > >mpirun: executable version 2 does not match our version 3. > >done. On RHEL4 w/IBED 1.0 rc3, I get a similar message if I have not configured /etc/security/limits.conf correctly, you have to add memory locking values such as: * soft memlock unlimited * hard memlock unlimited W/o this, I get: [0] Abort: Error creating CQ at line 235 in file viainit.c mpirun: executable version 0 does not match our version 3. done. From worleys at gmail.com Fri Apr 21 09:15:14 2006 From: worleys at gmail.com (Chris Worley) Date: Fri, 21 Apr 2006 10:15:14 -0600 Subject: [openfabrics-ewg] RE: [openib-general] Are their any MVAPICH sourceor binary RPMscorresponding to the SuSE 10 x86_64 RPMs at red-bean? In-Reply-To: References: Message-ID: I believe I have all nodes set to 7.5GB... # pdsh -a "cat /etc/security/limits.conf | tail" | dshbak -c ---------------- n[1-8] ---------------- #* hard rss 10000 #@student hard nproc 20 #@faculty soft nproc 20 #@faculty hard nproc 50 #ftp hard nproc 0 #@student - maxlogins 4 * hard memlock 7510000 * soft memlock 7510000 # End of file # pdsh -a ulimit -l | dshbak -c ---------------- n[1-8] ---------------- 7510000 On 4/21/06, Scott Weitzenkamp (sweitzen) wrote: > > > >:::::::::::::: > > >IMB-MV.o271 > > >:::::::::::::: > > >[2] Abort: Error creating SRQ > > > at line 258 in file viainit.c > > >[4] Abort: Error creating SRQ > > > at line 258 in file viainit.c > > >mpirun: executable version 2 does not match our version 3. > > >done. > > On RHEL4 w/IBED 1.0 rc3, I get a similar message if I have not > configured /etc/security/limits.conf correctly, you have to add memory > locking values such as: > > * soft memlock unlimited > * hard memlock unlimited > > W/o this, I get: > > [0] Abort: Error creating CQ > at line 235 in file viainit.c > mpirun: executable version 0 does not match our version 3. > done. > From sweitzen at cisco.com Fri Apr 21 09:18:14 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Fri, 21 Apr 2006 09:18:14 -0700 Subject: [openfabrics-ewg] RE: [openib-general] Are their any MVAPICH sourceor binary RPMscorresponding to the SuSE 10 x86_64 RPMs at red-bean? Message-ID: Does ibv_rc_pingpoing work between two of the nodes? If so, does MVAPICH work between the same two nodes? Scott > -----Original Message----- > From: Chris Worley [mailto:worleys at gmail.com] > Sent: Friday, April 21, 2006 9:15 AM > To: Scott Weitzenkamp (sweitzen) > Cc: openib-general at openib.org; openfabrics-ewg at openib.org > Subject: Re: [openfabrics-ewg] RE: [openib-general] Are their > any MVAPICH sourceor binary RPMscorresponding to the SuSE 10 > x86_64 RPMs at red-bean? > > I believe I have all nodes set to 7.5GB... > > # pdsh -a "cat /etc/security/limits.conf | tail" | dshbak -c > ---------------- > n[1-8] > ---------------- > #* hard rss 10000 > #@student hard nproc 20 > #@faculty soft nproc 20 > #@faculty hard nproc 50 > #ftp hard nproc 0 > #@student - maxlogins 4 > * hard memlock 7510000 > * soft memlock 7510000 > > # End of file > > # pdsh -a ulimit -l | dshbak -c > ---------------- > n[1-8] > ---------------- > 7510000 > > > On 4/21/06, Scott Weitzenkamp (sweitzen) wrote: > > > > > >:::::::::::::: > > > >IMB-MV.o271 > > > >:::::::::::::: > > > >[2] Abort: Error creating SRQ > > > > at line 258 in file viainit.c > > > >[4] Abort: Error creating SRQ > > > > at line 258 in file viainit.c > > > >mpirun: executable version 2 does not match our version 3. > > > >done. > > > > On RHEL4 w/IBED 1.0 rc3, I get a similar message if I have not > > configured /etc/security/limits.conf correctly, you have to > add memory > > locking values such as: > > > > * soft memlock unlimited > > * hard memlock unlimited > > > > W/o this, I get: > > > > [0] Abort: Error creating CQ > > at line 235 in file viainit.c > > mpirun: executable version 0 does not match our version 3. > > done. > > > From vuhuong at mellanox.com Fri Apr 21 09:20:24 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Fri, 21 Apr 2006 09:20:24 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: <444687E2.8020103@mellanox.com> References: <443C4934.7080400@mellanox.com> <443E9A88.7020302@mellanox.com> <443FC35F.6080301@mellanox.com> <44450F89.7020500@mellanox.com> <44451D32.1010106@mellanox.com> <444566E7.8070907@mellanox.com> <444687E2.8020103@mellanox.com> Message-ID: <44490648.9070106@mellanox.com> Hi Roland, > > I reported the error from my original email responding to your fmr > patch. For ia64 system with pcix hca I got asyn event > IB_EVENT_QP_ACCESS_ERR at the initiator (and I got cqe with > IB_COMPLETION_STATUS_REMOTE_ACCESS_ERROR status at my target) > I still have not had an IB analyzer trace (as you suggested) I still have not had the IB trace yet. > >> >> So the SCSI midlayer times out commands and tries to abort them. But >> we have no connection so the abort fails. The SCSI command shouldn't >> get freed now (at least if I'm understanding scsi_error.c correctly). >> >> Then we have no .eh_device_reset_handler so everything should fall >> through to calling our .eh_host_reset_handler without freeing any SCSI >> commands. And then we crash on a use-after-free of a SCSI command. >> >> So where is that command getting freed on us?? >> > > The scsi command that is used by error handlers (.eh_abort_handler, > .eh_host_reset_handler...) is not the same as use-after-free scsi > command from req->scmnd > > There is some glitch that the scsi command from req->scmnd already freed > by scsi midlayer; however, the request is still in our pending request > queue With the following patch applied my ia64 system does not crash anymore I prepare this patch diffing from srp revision 6455 applied with srp-params.patch that I sent you last week Please let me know if you want this patch generated from current srp (revision 6550) What is the status for srp-params.patch (introducing tuned parameters) Thanks, Vu -------------- next part -------------- A non-text attachment was scrubbed... Name: srp_eh.patch Type: text/x-patch Size: 1673 bytes Desc: not available URL: From robert.j.woodruff at intel.com Fri Apr 21 09:22:48 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 21 Apr 2006 09:22:48 -0700 Subject: [openib-general] Are their any MVAPICH source or binary RPMscorresponding to the SuSE 10 x86_64 RPMs at red-bean? In-Reply-To: Message-ID: <000101c6655f$d4330610$33a9070a@amr.corp.intel.com> Chris wrote, >Woody, >I've seen this mismatch before; it occurs when the IB >drivers/libraries don't match the libraries that MVAPICH was built >with. Don't know. I use Intel MPI, not MVAPICH, but it does look like some mismatch of code. woody From aviram at mellanox.co.il Fri Apr 21 09:22:09 2006 From: aviram at mellanox.co.il (Aviram Gutman) Date: Fri, 21 Apr 2006 19:22:09 +0300 Subject: [openib-general] RE: [openfabrics-ewg] IBED-1.0-rc3 README.txt Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301EF09E0@mtlexch01.mtl.com> It is an old one and it will be fixed for the RC4. Regards, Aviram -----Original Message----- From: openfabrics-ewg-bounces at openib.org [mailto:openfabrics-ewg-bounces at openib.org] On Behalf Of Matt Leininger Sent: Friday, April 21, 2006 2:29 AM To: Openfabrics-ewg at openib.org Cc: openib-general Subject: [openfabrics-ewg] IBED-1.0-rc3 README.txt Why does the README.txt refer to all things as "Mellanox" Here is the title in the README --------------------------- Mellanox IBED Distribution v1.0 for Linux ******************************************************* ------------------------ more here ------------------------------- IBED Home Page: https://docs.mellanox.com/dm/ibgold/ReadMe.html Please email bugs and error reports to your local Field Application Engineer. even more a few lines down --------------------------------- 1. HW and SW Requirements: ========================== 1) Server platform with InfiniBand HCA (see Mellanox IBED Distribution Release Notes for details) 2) Linux OS (see Mellanox IBED Distribution Release Notes for details) ---------------------------------- This needs to be cleaned up and updated. - Matt _______________________________________________ openfabrics-ewg mailing list openfabrics-ewg at openib.org http://openib.org/mailman/listinfo/openfabrics-ewg From mshefty at ichips.intel.com Fri Apr 21 09:30:57 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 21 Apr 2006 09:30:57 -0700 Subject: [openib-general] [PATCH] expand local SA cache to include all path records from source In-Reply-To: <1145618498.23359.13944.camel@hal.voltaire.com> References: <1145618498.23359.13944.camel@hal.voltaire.com> Message-ID: <444908C1.4010302@ichips.intel.com> Hal Rosenstock wrote: > As the end node cannot know the policies the SM used for routing, etc., > path records are insufficient in terms of some of the interesting > characteristics for filtering like path independence (as fault > independent as possible). This patch doesn't handle policy. It only stores multiple path records. Also, it seems like the argument can go both ways here. The SM doesn't know the policies of the applications or their traffic patterns. A distributed application may want paths from nodes A to B to have path independence from paths from nodes C to D. Can the SM handle such requests? > There's also a 1.2 erratum which enhances SA MultiPathRecord to support > scoping of S/DGIDs. It supports explicit, same node, same system with > and without high availability (separate HCAs). That could also be used > when supported by the SA. It shouldn't be hard to change the path query from GetTable to GetMulti, once the MAD layer supports dual-RMPP. > The API looks fine to me. One thing to note is to keep an eye on the > lookups done on the local database and possibly optimizing them if > needed. Tracking path usage becomes difficult, and would likely require involving the CM, and possibly address handle creation (for UD QPs that didn't use SIDR). When the cache is updated, checks would need to be added to see if a path is currently in the cache, and invalid paths would need to be removed. There would be a lot of additional work needed to do this. - Sean From mshefty at ichips.intel.com Fri Apr 21 09:33:51 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 21 Apr 2006 09:33:51 -0700 Subject: [openib-general] [PATCH] expand local SA cache to include all path records from source In-Reply-To: <1145619939.23359.14223.camel@hal.voltaire.com> References: <1145619939.23359.14223.camel@hal.voltaire.com> Message-ID: <4449096F.2050201@ichips.intel.com> Hal Rosenstock wrote: >> path_rec.sgid = port->gid; >>- path_rec.pkey = port->pkey; >>- path_rec.numb_path = 1; >>+ path_rec.numb_path = IB_MAX_PATHS_PER_QUERY; >> ib_sa_pack_attr(mad->data, &path_rec, IB_SA_ATTR_PATH_REC); >> } > > How does this work in a large subnet ? It would retrieve a large number of paths, and the resulting cache would be huge. Thinking about this more, I will change IB_MAX_PATHS_PER_QUERY to be a module parameter. - Sean From ssbyrn at yahoo.com Fri Apr 21 09:35:18 2006 From: ssbyrn at yahoo.com (susan) Date: Fri, 21 Apr 2006 16:35:18 +0000 (UTC) Subject: [openib-general] Re: ib_send_cm_req failes with error -22 References: <44468CCE.6070202@ichips.intel.com> <44480C85.1040704@ichips.intel.com> Message-ID: Sean Hefty ichips.intel.com> writes: > > Susan wrote: > > yes it matches t one of the ports gid, but there is nothing > > in the cache. > > Is gid_cache->table_len 0? Do you see any kernel warnings that > are reported? > > > from where should i download userland binaries -- opensm, ib* > > (ibaddr, ibstat) binaries? should just do: > > > > svn co https://openib.org/svn/gen2/trunk > > This should work. > > - Sean > after downloading sources from https://openib.org/svn/gen2/trunk and recompiling ib_send_cm_req() works. thanks for help & suggestions. susan From ssbyrn at yahoo.com Fri Apr 21 09:44:06 2006 From: ssbyrn at yahoo.com (susan) Date: Fri, 21 Apr 2006 16:44:06 +0000 (UTC) Subject: [openib-general] Re: ib_send_cm_req failes with error -22 References: <44468CCE.6070202@ichips.intel.com> <44480C85.1040704@ichips.intel.com> Message-ID: susan yahoo.com> writes: > > > after downloading sources from https://openib.org/svn/gen2/trunk > and recompiling ib_send_cm_req() works. > > thanks for help & suggestions. > susan > > actually ib_send_cmd_req() return status 0, but remote node does not get any notification. another issue to solve now. thanks, susan From mshefty at ichips.intel.com Fri Apr 21 09:47:34 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 21 Apr 2006 09:47:34 -0700 Subject: [openib-general] [PATCH 1 of 2] core extensions to fix race at module unloading In-Reply-To: <20060421072813.GA16418@mellanox.co.il> References: <20060421072813.GA16418@mellanox.co.il> Message-ID: <44490CA6.6040201@ichips.intel.com> Michael S. Tsirkin wrote: > /** > + * ib_flush_mad_agent - flush any callbacks in flight for this client. > + * @mad_agent: Corresponding MAD registration request to flush. > + */ > +int ib_flush_mad_agent(struct ib_mad_agent *mad_agent); To restate what I've mentioned before, I don't see any issue with the ib_mad interface. It doesn't make sense to me to change this API, when the problem is in the client. But I'd like to hear what others think. - Sean From mshefty at ichips.intel.com Fri Apr 21 09:52:26 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 21 Apr 2006 09:52:26 -0700 Subject: [openib-general] Re: ib_send_cm_req failes with error -22 In-Reply-To: References: <44468CCE.6070202@ichips.intel.com> <44480C85.1040704@ichips.intel.com> Message-ID: <44490DCA.9080609@ichips.intel.com> susan wrote: > actually ib_send_cmd_req() return status 0, but remote node > does not get any notification. another issue to solve now. If you check out the linux-kernel/infiniband/util directory, there is an IB CM kernel test program (cmpost) that you can run. You will need to copy the util directory under the infiniband directory, and update the infiniband/Kconfig and infiniband/Makefile to include it in the build. - Sean From sean.hefty at intel.com Fri Apr 21 09:56:18 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 21 Apr 2006 09:56:18 -0700 Subject: [dat-discussions] [openib-general][RFC]DAT2.0immediatedataproposal In-Reply-To: Message-ID: >We need a better job coordinating between 2 reflectors. One issue is that someone must subscribe to the dat-discussion list to post to it. - Sean From vuhuong at mellanox.com Fri Apr 21 10:11:15 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Fri, 21 Apr 2006 10:11:15 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: <44490648.9070106@mellanox.com> References: <443C4934.7080400@mellanox.com> <443E9A88.7020302@mellanox.com> <443FC35F.6080301@mellanox.com> <44450F89.7020500@mellanox.com> <44451D32.1010106@mellanox.com> <444566E7.8070907@mellanox.com> <444687E2.8020103@mellanox.com> <44490648.9070106@mellanox.com> Message-ID: <44491233.2010207@mellanox.com> > > With the following patch applied my ia64 system does not crash anymore > > I prepare this patch diffing from srp revision 6455 applied with > srp-params.patch that I sent you last week > > Please let me know if you want this patch generated from current srp > (revision 6550) > > What is the status for srp-params.patch (introducing tuned parameters) Sorry. The previous patch is generated in reversed order -------------- next part -------------- A non-text attachment was scrubbed... Name: srp_eh.patch Type: text/x-patch Size: 1673 bytes Desc: not available URL: From mshefty at ichips.intel.com Fri Apr 21 10:44:06 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 21 Apr 2006 10:44:06 -0700 Subject: [openib-general] [PATCH] expand local SA cache to include all path records from source In-Reply-To: <4449096F.2050201@ichips.intel.com> References: <1145619939.23359.14223.camel@hal.voltaire.com> <4449096F.2050201@ichips.intel.com> Message-ID: <444919E6.3000502@ichips.intel.com> Sean Hefty wrote: > It would retrieve a large number of paths, and the resulting cache would > be huge. Thinking about this more, I will change IB_MAX_PATHS_PER_QUERY > to be a module parameter. I've committed this patch after adding a module parameter (paths_per_dest) to limit the number of paths returned for a given DGID. This will still allow use of a local cache, but allow an administrator to limit the resulting size of the cache. - Sean From sean.hefty at intel.com Fri Apr 21 10:53:19 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 21 Apr 2006 10:53:19 -0700 Subject: [openib-general] [PATCH] RDMA CM: only allow privileged access to ports below 1024 In-Reply-To: <20060420172204.GA6876@mellanox.co.il> Message-ID: Restrict access to ports below 1024 to privileged users. Signed-off-by: Sean Hefty --- I added the check to the kernel rdma_cm. Does this look right to you? Index: cma.c =================================================================== --- cma.c (revision 6541) +++ cma.c (working copy) @@ -1375,11 +1375,16 @@ static int cma_use_port(struct idr *ps, struct sockaddr_in *sin, *cur_sin; struct rdma_bind_list *bind_list; struct hlist_node *node; + unsigned short snum; sin = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; - bind_list = idr_find(ps, ntohs(sin->sin_port)); + snum = ntohs(sin->sin_port); + if (snum < PROT_SOCK && !capable(CAP_NET_BIND_SERVICE)) + return -EACCES; + + bind_list = idr_find(ps, snum); if (!bind_list) - return cma_alloc_port(ps, id_priv, ntohs(sin->sin_port)); + return cma_alloc_port(ps, id_priv, snum); /* * We don't support binding to any address if anyone is bound to From ssbyrn at yahoo.com Fri Apr 21 10:59:30 2006 From: ssbyrn at yahoo.com (susan) Date: Fri, 21 Apr 2006 17:59:30 +0000 (UTC) Subject: [openib-general] Re: ib_send_cm_req failes with error -22 References: <44468CCE.6070202@ichips.intel.com> <44480C85.1040704@ichips.intel.com> <44490DCA.9080609@ichips.intel.com> Message-ID: Sean Hefty ichips.intel.com> writes: > > > If you check out the linux-kernel/infiniband/util directory, there > is an IB CM kernel test program (cmpost) that you can run. You will > need to copy the util directory under the infiniband directory, and > update the infiniband/Kconfig and infiniband/Makefile to include > it in the build. > > - Sean > thanks Sean! i will give it try. susan From mshefty at ichips.intel.com Fri Apr 21 11:46:56 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 21 Apr 2006 11:46:56 -0700 Subject: [openib-general] RFC userspace / MPI multicast support In-Reply-To: <20060420212640.GD2315@greglaptop.internal.keyresearch.com> References: <1145552443.23359.631.camel@hal.voltaire.com> <20060420212640.GD2315@greglaptop.internal.keyresearch.com> Message-ID: <444928A0.6080305@ichips.intel.com> Greg Lindahl wrote: >>Also, is IPoIB always setup when running MPI ? > > That's an easy one: No. Do the ones that don't have ipoib use a separate IP (Ethernet) network? - Sean From sweitzen at cisco.com Fri Apr 21 11:49:12 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Fri, 21 Apr 2006 11:49:12 -0700 Subject: [openib-general] RFC userspace / MPI multicast support Message-ID: > Greg Lindahl wrote: > >>Also, is IPoIB always setup when running MPI ? > > > > That's an easy one: No. > > Do the ones that don't have ipoib use a separate IP > (Ethernet) network? Yes. From halr at voltaire.com Fri Apr 21 12:10:31 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Apr 2006 15:10:31 -0400 Subject: [openib-general] [PATCH] expand local SA cache to include all path records from source In-Reply-To: <444908C1.4010302@ichips.intel.com> References: <1145618498.23359.13944.camel@hal.voltaire.com> <444908C1.4010302@ichips.intel.com> Message-ID: <1145646233.23359.19352.camel@hal.voltaire.com> On Fri, 2006-04-21 at 12:30, Sean Hefty wrote: > Hal Rosenstock wrote: > > As the end node cannot know the policies the SM used for routing, etc., > > path records are insufficient in terms of some of the interesting > > characteristics for filtering like path independence (as fault > > independent as possible). > > This patch doesn't handle policy. I'm not following you on what policy you are referring to here. > It only stores multiple path records. Yes, but the path records returned from MPR queries can have "better" properties than PR queries for things like APM, etc. > Also, it seems like the argument can go both ways here. The SM doesn't know the > policies of the applications or their traffic patterns. A distributed > application may want paths from nodes A to B to have path independence from > paths from nodes C to D. Can the SM handle such requests? I think that is usually addressed in the subnet configuration (e.g. richness of the topology) when the subnet is architected (and use of LMC). Another possibility would be an SM feature to be configured in some manner which may not be realizable based on the topology. > > There's also a 1.2 erratum which enhances SA MultiPathRecord to support > > scoping of S/DGIDs. It supports explicit, same node, same system with > > and without high availability (separate HCAs). That could also be used > > when supported by the SA. > > It shouldn't be hard to change the path query from GetTable to GetMulti, once > the MAD layer supports dual-RMPP. > > > The API looks fine to me. One thing to note is to keep an eye on the > > lookups done on the local database and possibly optimizing them if > > needed. > > Tracking path usage becomes difficult, and would likely require involving the > CM, and possibly address handle creation (for UD QPs that didn't use SIDR). > When the cache is updated, checks would need to be added to see if a path is > currently in the cache, and invalid paths would need to be removed. There would > be a lot of additional work needed to do this. I was referring to the types of lookups used (e.g. key variety). -- Hal > - Sean From rdreier at cisco.com Fri Apr 21 12:30:23 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 21 Apr 2006 12:30:23 -0700 Subject: [openib-general] [PATCH 1 of 2] core extensions to fix race at module unloading In-Reply-To: <44490CA6.6040201@ichips.intel.com> (Sean Hefty's message of "Fri, 21 Apr 2006 09:47:34 -0700") References: <20060421072813.GA16418@mellanox.co.il> <44490CA6.6040201@ichips.intel.com> Message-ID: Sean> To restate what I've mentioned before, I don't see any issue Sean> with the ib_mad interface. It doesn't make sense to me to Sean> change this API, when the problem is in the client. But I'd Sean> like to hear what others think. I agree. If we care about fixing this, it seems better to put it in the SA module, and make every consumer of the SA have to allocate a cookie before using it. Then we can wait for callbacks when freeing the cookie. Similarly I guess we just get rid of the "free my cm_id" return value from CM callbacks and make everyone free cm_ids from non-callback context. - R. From rdreier at cisco.com Fri Apr 21 12:30:54 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 21 Apr 2006 12:30:54 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: <44490648.9070106@mellanox.com> (Vu Pham's message of "Fri, 21 Apr 2006 09:20:24 -0700") References: <443C4934.7080400@mellanox.com> <443E9A88.7020302@mellanox.com> <443FC35F.6080301@mellanox.com> <44450F89.7020500@mellanox.com> <44451D32.1010106@mellanox.com> <444566E7.8070907@mellanox.com> <444687E2.8020103@mellanox.com> <44490648.9070106@mellanox.com> Message-ID: Vu> What is the status for srp-params.patch (introducing tuned Vu> parameters) Sorry, haven't really looked at it. I've been spending my SRP time trying to get FMRs working. - R. From rdreier at cisco.com Fri Apr 21 12:33:02 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 21 Apr 2006 12:33:02 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: <44491233.2010207@mellanox.com> (Vu Pham's message of "Fri, 21 Apr 2006 10:11:15 -0700") References: <443C4934.7080400@mellanox.com> <443E9A88.7020302@mellanox.com> <443FC35F.6080301@mellanox.com> <44450F89.7020500@mellanox.com> <44451D32.1010106@mellanox.com> <444566E7.8070907@mellanox.com> <444687E2.8020103@mellanox.com> <44490648.9070106@mellanox.com> <44491233.2010207@mellanox.com> Message-ID: Thanks. Can you explain what the bug causing the crash is? I'd like to understand the "why" of this patch. - R. From iod00d at hp.com Fri Apr 21 13:10:18 2006 From: iod00d at hp.com (Grant Grundler) Date: Fri, 21 Apr 2006 13:10:18 -0700 Subject: [openib-general] Re: Speeding up IPoIB. In-Reply-To: References: <20060419225630.GG6430@esmail.cup.hp.com> Message-ID: <20060421201018.GA11491@esmail.cup.hp.com> On Thu, Apr 20, 2006 at 09:03:29PM -0400, Bernard King-Smith wrote: > Grant> My guess is it's an easier problem to fix SDP than reducing TCP/IP > Grant> cache/CPU foot print. I realize only a subset of apps can (or will > Grant> try to) use SDP because of setup/config issues. I still believe SDP > Grant> is useful to a majority of apps without having to recompile them. > > I agree that reducing any protocol footprint is a very challenging job, > however, going to a larger MTU drops the overhead much faster. If IB > supported a 60K MTU then the TCP/IP overhead would be 1/30 that of what we > measure today. Traversng the TCP/IP stack once for a 60K packet is much > lower than 30 times using 2000 byte packets for the same amount of data > transmitted. I agree that's effective for workloads which send large messages. And that's typical for storage workloads. But the world is not just an NFS server. ;) > Grant> I'm not competent to disagree in detail. > Grant> Fabian Tillier and Caitlin Bestler can (and have) addressed this. > > I would be very interested in any pointers to their work. They have posted to this forum recently on this topic. The archives are here in case you want to look them up: http://www.openib.org/contact.html > This goes back to systems where the system is busy doing nothing, generally > when waiting for memory or a cache line miss, or I/O to disks. This is > where hyperthreading has shown some speedups for benchmarks where > previously they were totally CPU limited, and with hyperthreading there is > a gain. While there are workloads that benefit, I don't buy the hyperthreading argument in general. Co-workers have demonstrate several "normal" workloads that don't benefit and are faster with hyperthreading disabled. > The unused cycles are "wait" cycles when something can run if it > can get in quickly. You can't get a TCP stack in the wait, but small parts > of the stackor driver could fit in the other thread. Yes I do benchmarking > and was skeptical at first. ok. thanks, grant From wombat2 at us.ibm.com Fri Apr 21 13:46:46 2006 From: wombat2 at us.ibm.com (Bernard King-Smith) Date: Fri, 21 Apr 2006 16:46:46 -0400 Subject: [openib-general] Re: Speeding up IPoIB. In-Reply-To: <20060421201018.GA11491@esmail.cup.hp.com> Message-ID: Grant Grundler wrote: > Grant> My guess is it's an easier problem to fix SDP than reducing TCP/IP > Grant> cache/CPU foot print. I realize only a subset of apps can (or will > Grant> try to) use SDP because of setup/config issues. I still believe SDP > Grant> is useful to a majority of apps without having to recompile them. > > I agree that reducing any protocol footprint is a very challenging job, > however, going to a larger MTU drops the overhead much faster. If IB > supported a 60K MTU then the TCP/IP overhead would be 1/30 that of what we > measure today. Traversng the TCP/IP stack once for a 60K packet is much > lower than 30 times using 2000 byte packets for the same amount of data > transmitted. Grant> I agree that's effective for workloads which send large messages. Grant> And that's typical for storage workloads. Grant> But the world is not just an NFS server. ;) However, NFS is not the only large data transfer workload I come across. If IB wants to achieve high volumes there needs to be some kind of commercial workload that works well on it besides the large applications that can afford to port to SDP or uDAPL. HPC is a nitch that is hard to sustain a viable business in ( though everyone can point to a couple of companies, most are not long term, 15 years or more ). However, in clustering the areas that benefit from large block transfer efficiency include: File Serving ( NFS, GPFS, XFS etc.) Application backup Parallel databases Database upload/update flows Web server graphics Web server MP3s Web server streaming video Local workstation backup Collaboration software Local mail replication My concern is that if IB does not support these operations as well as Ethernet, then it is a hard sell into commercial accounts/workloads for IB. Bernie King-Smith IBM Corporation Server Group Cluster System Performance wombat2 at us.ibm.com (845)433-8483 Tie. 293-8483 or wombat2 on NOTES "We are not responsible for the world we are born into, only for the world we leave when we die. So we have to accept what has gone before us and work to change the only thing we can, -- The Future." William Shatner Grant Grundler To 04/21/2006 04:10 Bernard PM King-Smith/Poughkeepsie/IBM at IBMUS cc Grant Grundler , openib-general at openib.org, Roland Dreier Subject Re: Speeding up IPoIB. On Thu, Apr 20, 2006 at 09:03:29PM -0400, Bernard King-Smith wrote: > Grant> My guess is it's an easier problem to fix SDP than reducing TCP/IP > Grant> cache/CPU foot print. I realize only a subset of apps can (or will > Grant> try to) use SDP because of setup/config issues. I still believe SDP > Grant> is useful to a majority of apps without having to recompile them. > > I agree that reducing any protocol footprint is a very challenging job, > however, going to a larger MTU drops the overhead much faster. If IB > supported a 60K MTU then the TCP/IP overhead would be 1/30 that of what we > measure today. Traversng the TCP/IP stack once for a 60K packet is much > lower than 30 times using 2000 byte packets for the same amount of data > transmitted. I agree that's effective for workloads which send large messages. And that's typical for storage workloads. But the world is not just an NFS server. ;) > Grant> I'm not competent to disagree in detail. > Grant> Fabian Tillier and Caitlin Bestler can (and have) addressed this. > > I would be very interested in any pointers to their work. They have posted to this forum recently on this topic. The archives are here in case you want to look them up: http://www.openib.org/contact.html > This goes back to systems where the system is busy doing nothing, generally > when waiting for memory or a cache line miss, or I/O to disks. This is > where hyperthreading has shown some speedups for benchmarks where > previously they were totally CPU limited, and with hyperthreading there is > a gain. While there are workloads that benefit, I don't buy the hyperthreading argument in general. Co-workers have demonstrate several "normal" workloads that don't benefit and are faster with hyperthreading disabled. > The unused cycles are "wait" cycles when something can run if it > can get in quickly. You can't get a TCP stack in the wait, but small parts > of the stackor driver could fit in the other thread. Yes I do benchmarking > and was skeptical at first. ok. thanks, grant From vuhuong at mellanox.com Fri Apr 21 14:02:50 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Fri, 21 Apr 2006 14:02:50 -0700 Subject: [openib-general][patch review] srp: fmr implementation, In-Reply-To: References: <443C4934.7080400@mellanox.com> <443E9A88.7020302@mellanox.com> <443FC35F.6080301@mellanox.com> <44450F89.7020500@mellanox.com> <44451D32.1010106@mellanox.com> <444566E7.8070907@mellanox.com> <444687E2.8020103@mellanox.com> <44490648.9070106@mellanox.com> <44491233.2010207@mellanox.com> Message-ID: <4449487A.3080004@mellanox.com> > Thanks. Can you explain what the bug causing the crash is? I'd like > to understand the "why" of this patch. > 1. srp_unmap_data() and srp_remove_req() for .eh_abort_handler(scmnd) a. abort get timeout or b. req->cmd_done or c. !req->tsk_status 2. we should do step (1) for .eh_abort_handler(scmnd) only and don't do step 1 for .eh_device_reset_handler(scmnd) since same scsi command is used for all .eh_handler() 3. scsi command is used in all .eh_handler() will be freed by scsi midlayer at the end of error handling sequences 4. If we don't do step 1, scsi command which is used in all .eh_handler() and freed is still in our pending queue and is referenced in srp_reconnect_target() / reinit request ring Vu From rdreier at cisco.com Fri Apr 21 16:29:40 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 21 Apr 2006 16:29:40 -0700 Subject: [openib-general] on vacation... Message-ID: I'll be going on vacation all next week, very out of touch with email. So don't expect a response from me until about May 1. Thanks, Roland From mst at mellanox.co.il Sat Apr 22 11:17:19 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 22 Apr 2006 21:17:19 +0300 Subject: [openib-general] [PATCH 1 of 2] core extensions to fix race at module unloading In-Reply-To: References: <20060421072813.GA16418@mellanox.co.il> <44490CA6.6040201@ichips.intel.com> Message-ID: <20060422181719.GA26462@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] [PATCH 1 of 2] core extensions to fix race at module unloading > > Sean> To restate what I've mentioned before, I don't see any issue > Sean> with the ib_mad interface. It doesn't make sense to me to > Sean> change this API, when the problem is in the client. > Sean> But I'd like to hear what others think. > > I agree. If we care about fixing this, it seems better to put it in > the SA module, and make every consumer of the SA have to allocate a > cookie before using it. Then we can wait for callbacks when freeing > the cookie. Similarly I guess we just get rid of the "free my cm_id" > return value from CM callbacks and make everyone free cm_ids from > non-callback context. Well, I don't see how to implement all this. You guys go ahead then. -- MST From mst at mellanox.co.il Sat Apr 22 11:21:54 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 22 Apr 2006 21:21:54 +0300 Subject: [openib-general] [PATCH 1 of 2] core extensions to fix race at module unloading In-Reply-To: <44490CA6.6040201@ichips.intel.com> References: <20060421072813.GA16418@mellanox.co.il> <44490CA6.6040201@ichips.intel.com> Message-ID: <20060422182154.GB26462@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [openib-general] [PATCH 1 of 2] core extensions to fix race at module unloading > > Michael S. Tsirkin wrote: > > /** > >+ * ib_flush_mad_agent - flush any callbacks in flight for this client. > >+ * @mad_agent: Corresponding MAD registration request to flush. > >+ */ > >+int ib_flush_mad_agent(struct ib_mad_agent *mad_agent); > > To restate what I've mentioned before, I don't see any issue with the > ib_mad interface. It doesn't make sense to me to change this API, when the > problem is in the client. But I'd like to hear what others think. The API is there to be used, no? If its useful for clients to flush callbacks, why shouldn't ib_mad povide the service? the implementation is trivial. -- MST From mst at mellanox.co.il Sun Apr 23 00:54:21 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 23 Apr 2006 10:54:21 +0300 Subject: [openib-general] [PATCH] cma nit Message-ID: <20060423075420.GW1792@mellanox.co.il> ; isn't necessary after {} code block. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.16/drivers/infiniband/core/cma.c =================================================================== --- linux-2.6.16.orig/drivers/infiniband/core/cma.c 2006-04-20 17:47:05.000000000 +0300 +++ linux-2.6.16/drivers/infiniband/core/cma.c 2006-04-23 10:13:27.000000000 +0300 @@ -1031,7 +1031,7 @@ int rdma_listen(struct rdma_cm_id *id, i err: cma_comp_exch(id_priv, CMA_LISTEN, CMA_ADDR_BOUND); return ret; -}; +} EXPORT_SYMBOL(rdma_listen); static void cma_query_handler(int status, struct ib_sa_path_rec *path_rec, @@ -1662,7 +1662,7 @@ int rdma_reject(struct rdma_cm_id *id, c break; } return ret; -}; +} EXPORT_SYMBOL(rdma_reject); int rdma_disconnect(struct rdma_cm_id *id) -- MST From mst at mellanox.co.il Sun Apr 23 00:57:52 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 23 Apr 2006 10:57:52 +0300 Subject: [openib-general] Re: [PATCH] RDMA CM: assign port numbers when binding a cm_id to an address In-Reply-To: References: Message-ID: <20060423075752.GX1792@mellanox.co.il> Quoting r. Sean Hefty : > rdma_bind_addr() is called for the user in rdma_resolve_addr(), unless the user > has already bound the cm_id. BTW, Sean, if you intend to emulate socket interface, same is necessary for listen: listen without bind is equivalent to binding to anyport/anyinterface first. Do you think this logic belongs in CMA or in ULPs? -- MST From eitan at mellanox.co.il Sun Apr 23 04:10:07 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 23 Apr 2006 14:10:07 +0300 Subject: [openib-general] IPoIB interface for unauthorized partition In-Reply-To: <6a122cc00604100701n476272dfrb2f0527fb6f48a39@mail.gmail.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102BA2C@mtlexch01.mtl.com> <6a122cc00604100701n476272dfrb2f0527fb6f48a39@mail.gmail.com> Message-ID: <444B608F.5010200@mellanox.co.il> Hi Moni, Sorry it took me a while to get back to you (was out on vacation ...) Moni Levy wrote: > On 4/10/06, Eitan Zahavi wrote: > >>Hi Hal, >> >> >>>-----Original Message----- >>>From: Hal Rosenstock [mailto:halr at voltaire.com] >>>Sent: Monday, April 10, 2006 2:00 PM >>>To: Eitan Zahavi >>>Cc: Roland Dreier; openib-general at openib.org >>>Subject: Re: [openib-general] IPoIB interface for unauthorized >> >>partition >> >>>Hi Eitan, >>> >>>On Mon, 2006-04-10 at 02:35, Eitan Zahavi wrote: >>> >>>>Hi Roland, >>>> >>>>Roland Dreier wrote: >>>> >>>>> Eitan> I thought the intent of the IB spec when defining P_Key >>>>> Eitan> index usage (and not P_Key value) was that the P_Key >> >>values >> >>>>> Eitan> would never need to be known above the driver level. >> >>To >> >>>>> Eitan> avoid exposing the P_Key values we could use P_Key >> >>index >> >>>>> Eitan> for creating the IPoIB interfaces. >>>>> >>>>> Eitan> Does it make sense to work on a patch that would setup >>>>> Eitan> IPoIB interfaces by the P_Key index (and not by P_Key >>>>> Eitan> value)? >>>>> >>>>>I don't see how this is feasible. The index that a particular >> >>P_Key >> >>>>>lands at is completely undetermined -- if two nodes wanted to talk >> >>on >> >>>>>partition 0x8001 say, how does one know which interface to use >> >>without >> >>>>>knowing the index of that P_Key? >>>> >>>>OK, I get it. Actually the way IPoIB defines the broadcast group >> >>MGID exposes >> >>>P_Key anyway. >>> >>>>> Eitan> Also I think the expected behavior for IPoIB should be >> >>that >> >>>>> Eitan> IPoIB "child" interfaces should be "automatically" >>>>> Eitan> initialized by the code that brings up the interface >>>>> Eitan> (ifconfig scripts). All valid IPoIB partitions (valid = >>>>> Eitan> have corresponding broadcast groups) should be >>>>> Eitan> initialized. By doing so we provide a centralized >> >>control >> >>>>> Eitan> of the partitions and their IPoIB interfaces through >> >>the >> >>>>> Eitan> SM. >>>>> >>>>>Not sure if this is so. I may want a partition strictly for >> >>storage >> >>>>>traffic something like that, so it doesn't make sense to create an >>>>>IPoIB interface for that partition. >>>> >>>>OpenSM provides this capability in the "partition policy": >>>>Each partition is marked explicitly if to be used for IPoIB or not. >>>>So through one file one could actually control the IPoIB interfaces >>>>that will exist in the subnet. >>> >>>The end node does not know the SM policy for that partition though. >>> >>> >>>>My intent is to write some extension to ifup for IPoIB such that all >> >>sub >> >>>>interfaces will be automatically started (based on pre-availability >> >>of IPoIB >> >>>>broadcast MGID). > > > I'm not sure how ifup is related to that. From what I understand you'd > like ipoib driver to behave as follows: > > 1. Get an event ( or figure it out) when a new PKEY is added to the > relevant port partition table. I prefer not to rely on new events. Instead I would like to rely on existing IB Notices: If we register to multicast group create/delete events (traps 66/67) IPoIB can know about each new partition created. > 2. Try to join that new MC group with the MGID it created according to > the PKEY and the spec. (or maybe query for the MC group existance but > that's not atomic) Simply join the group. We rely on these groups to be pre-created by the SM enforcing policy dictating with partitions should be used for IPoIB and which not. > 3. In case it fails nothing is done (no relevant MC group was > pre-created in the SM). Exactly > 4. In case it succeeds a new interface is created. > > Is that what you meant ? > > - Moni > > >>>If that were to be done, it would be cleanest if the child IPoIB >>>interface was created only if that IPoIB broadcast group for that >>>partition exists. >> >>[EZ] This is exactly what I had in mind. >> >>>-- Hal >>> >>> >>>>> - R. >>>>> >>>> >>_______________________________________________ >>openib-general mailing list >>openib-general at openib.org >>http://openib.org/mailman/listinfo/openib-general >> >>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> From ogerlitz at voltaire.com Sun Apr 23 04:26:28 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 23 Apr 2006 14:26:28 +0300 (IDT) Subject: [openib-general] slab error while removing ib_mad Message-ID: I am getting the below trace on 2.6.17-rc2 / AMD x86_64 / PCIX HCA with both the IB sources that come with the kernel and svn trunk 6520. This happens if i just modprobe -r ib_mthca after fresh reboot, can anyone reproduce it on her/his system as well? The module does get modprobed out. Or. $ modprobe -r ib_mthca $ dmesg slab error in kmem_cache_destroy(): cache `ib_mad': Can't free all objects Call Trace: {kmem_cache_destroy+150} {:ib_mad:ib_mad_cleanup_module+25} {sys_delete_module+415} {__up_write+20} {sys_munmap+91} {system_call+126} ib_mad: Failed to destroy ib_mad cache From dotanb at mellanox.co.il Sun Apr 23 04:29:27 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Sun, 23 Apr 2006 14:29:27 +0300 Subject: [openib-general] Re: [uDAPL] dat.conf generator In-Reply-To: <44468CBB.7000501@ichips.intel.com> References: <200604121122.48646.dotanb@mellanox.co.il> <44468CBB.7000501@ichips.intel.com> Message-ID: <200604231429.27532.dotanb@mellanox.co.il> On Wednesday 19 April 2006 22:17, Arlin Davis wrote: > >>>>>>OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 1" "" > >>>>>>OpenIB-cma0-1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 1" "" > >>>>>>OpenIB-cma0-2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 2" "" > >>>>>>OpenIB-cma1-1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca1 1" "" > >>>>>>OpenIB-cma1-2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca1 2" "" > >>>>>> > These entries are wrong. The cma versopm will only work with an ip > address, network hostname, or netdev name. The port value is > meaningless since the name gives you the device and port reference all > in one. > > For cma the best flavor is netdev name as follow: > > OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" > > because it allows you to have identical dat.conf setups on across your > cluster if you intend on using the first IB interface on each node. but, what if one wants to work with the second ib I/F (or with the third)? If in automatic way i will create a dat.conf that will have the following lines (on a machine with 2 HCAs, 2 port in each HCA): OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" OpenIB-cma0 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" OpenIB-cma1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib1 0" "" OpenIB-cma2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib2 0" "" OpenIB-cma3 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib3 0" "" In this file, there is a dapl provider name for every IPoIB I/F, and still there is a default entry (the first I/F). Is this is usefull? Dotan From monil at voltaire.com Sun Apr 23 04:40:44 2006 From: monil at voltaire.com (Moni Levy) Date: Sun, 23 Apr 2006 14:40:44 +0300 Subject: [openib-general] IPoIB interface for unauthorized partition In-Reply-To: <444B608F.5010200@mellanox.co.il> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102BA2C@mtlexch01.mtl.com> <6a122cc00604100701n476272dfrb2f0527fb6f48a39@mail.gmail.com> <444B608F.5010200@mellanox.co.il> Message-ID: <6a122cc00604230440x6383067do6a3a04e69e263d89@mail.gmail.com> On 4/23/06, Eitan Zahavi wrote: > Hi Moni, > > Sorry it took me a while to get back to you (was out on vacation ...) > > Moni Levy wrote: > > On 4/10/06, Eitan Zahavi wrote: > > > >>Hi Hal, > >> > >> > >>>-----Original Message----- > >>>From: Hal Rosenstock [mailto:halr at voltaire.com] > >>>Sent: Monday, April 10, 2006 2:00 PM > >>>To: Eitan Zahavi > >>>Cc: Roland Dreier; openib-general at openib.org > >>>Subject: Re: [openib-general] IPoIB interface for unauthorized > >> > >>partition > >> > >>>Hi Eitan, > >>> > >>>On Mon, 2006-04-10 at 02:35, Eitan Zahavi wrote: > >>> > >>>>Hi Roland, > >>>> > >>>>Roland Dreier wrote: > >>>> > >>>>> Eitan> I thought the intent of the IB spec when defining P_Key > >>>>> Eitan> index usage (and not P_Key value) was that the P_Key > >> > >>values > >> > >>>>> Eitan> would never need to be known above the driver level. > >> > >>To > >> > >>>>> Eitan> avoid exposing the P_Key values we could use P_Key > >> > >>index > >> > >>>>> Eitan> for creating the IPoIB interfaces. > >>>>> > >>>>> Eitan> Does it make sense to work on a patch that would setup > >>>>> Eitan> IPoIB interfaces by the P_Key index (and not by P_Key > >>>>> Eitan> value)? > >>>>> > >>>>>I don't see how this is feasible. The index that a particular > >> > >>P_Key > >> > >>>>>lands at is completely undetermined -- if two nodes wanted to talk > >> > >>on > >> > >>>>>partition 0x8001 say, how does one know which interface to use > >> > >>without > >> > >>>>>knowing the index of that P_Key? > >>>> > >>>>OK, I get it. Actually the way IPoIB defines the broadcast group > >> > >>MGID exposes > >> > >>>P_Key anyway. > >>> > >>>>> Eitan> Also I think the expected behavior for IPoIB should be > >> > >>that > >> > >>>>> Eitan> IPoIB "child" interfaces should be "automatically" > >>>>> Eitan> initialized by the code that brings up the interface > >>>>> Eitan> (ifconfig scripts). All valid IPoIB partitions (valid = > >>>>> Eitan> have corresponding broadcast groups) should be > >>>>> Eitan> initialized. By doing so we provide a centralized > >> > >>control > >> > >>>>> Eitan> of the partitions and their IPoIB interfaces through > >> > >>the > >> > >>>>> Eitan> SM. > >>>>> > >>>>>Not sure if this is so. I may want a partition strictly for > >> > >>storage > >> > >>>>>traffic something like that, so it doesn't make sense to create an > >>>>>IPoIB interface for that partition. > >>>> > >>>>OpenSM provides this capability in the "partition policy": > >>>>Each partition is marked explicitly if to be used for IPoIB or not. > >>>>So through one file one could actually control the IPoIB interfaces > >>>>that will exist in the subnet. > >>> > >>>The end node does not know the SM policy for that partition though. > >>> > >>> > >>>>My intent is to write some extension to ifup for IPoIB such that all > >> > >>sub > >> > >>>>interfaces will be automatically started (based on pre-availability > >> > >>of IPoIB > >> > >>>>broadcast MGID). > > > > > > I'm not sure how ifup is related to that. From what I understand you'd > > like ipoib driver to behave as follows: > > > > 1. Get an event ( or figure it out) when a new PKEY is added to the > > relevant port partition table. > I prefer not to rely on new events. Instead I would like to rely on existing IB Notices: > If we register to multicast group create/delete events (traps 66/67) IPoIB can know about each new partition created. I'm not sure that this is a good idea, because that way all of the IPoIB nodes will get that event and try to join every new MC group and partitioning by definition is good for separating a fabric. I think that the right thing should be that only the relevant nodes try to join the specific MCG. > > > 2. Try to join that new MC group with the MGID it created according to > > the PKEY and the spec. (or maybe query for the MC group existance but > > that's not atomic) > Simply join the group. We rely on these groups to be pre-created by the SM enforcing policy dictating with partitions should > be used for IPoIB and which not. If you let all the IPoIB nodes join every new group without checking their PKEY tables first, they may even get joined if the SM is not eforcing MCG to port policy. Is that your plan ? > > > 3. In case it fails nothing is done (no relevant MC group was > > pre-created in the SM). > Exactly > > > 4. In case it succeeds a new interface is created. > > > > Is that what you meant ? > > > > - Moni > > > > > >>>If that were to be done, it would be cleanest if the child IPoIB > >>>interface was created only if that IPoIB broadcast group for that > >>>partition exists. > >> > >>[EZ] This is exactly what I had in mind. > >> > >>>-- Hal > >>> > >>> > >>>>> - R. > >>>>> > >>>> > >>_______________________________________________ > >>openib-general mailing list > >>openib-general at openib.org > >>http://openib.org/mailman/listinfo/openib-general > >> > >>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >> > > From eitan at mellanox.co.il Sun Apr 23 04:49:17 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 23 Apr 2006 14:49:17 +0300 Subject: [openib-general] IPoIB interface for unauthorized partition In-Reply-To: <6a122cc00604230440x6383067do6a3a04e69e263d89@mail.gmail.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102BA2C@mtlexch01.mtl.com> <6a122cc00604100701n476272dfrb2f0527fb6f48a39@mail.gmail.com> <444B608F.5010200@mellanox.co.il> <6a122cc00604230440x6383067do6a3a04e69e263d89@mail.gmail.com> Message-ID: <444B69BD.4000705@mellanox.co.il> Moni Levy wrote: > On 4/23/06, Eitan Zahavi wrote: > >>Hi Moni, >> >>Sorry it took me a while to get back to you (was out on vacation ...) >> >>Moni Levy wrote: >> >>>On 4/10/06, Eitan Zahavi wrote: >>> >>> >>>>Hi Hal, >>>> >>>> >>>> >>>>>-----Original Message----- >>>>>From: Hal Rosenstock [mailto:halr at voltaire.com] >>>>>Sent: Monday, April 10, 2006 2:00 PM >>>>>To: Eitan Zahavi >>>>>Cc: Roland Dreier; openib-general at openib.org >>>>>Subject: Re: [openib-general] IPoIB interface for unauthorized >>>> >>>>partition >>>> >>>> >>>>>Hi Eitan, >>>>> >>>>>On Mon, 2006-04-10 at 02:35, Eitan Zahavi wrote: >>>>> >>>>> >>>>>>Hi Roland, >>>>>> >>>>>>Roland Dreier wrote: >>>>>> >>>>>> >>>>>>> Eitan> I thought the intent of the IB spec when defining P_Key >>>>>>> Eitan> index usage (and not P_Key value) was that the P_Key >>>> >>>>values >>>> >>>> >>>>>>> Eitan> would never need to be known above the driver level. >>>> >>>>To >>>> >>>> >>>>>>> Eitan> avoid exposing the P_Key values we could use P_Key >>>> >>>>index >>>> >>>> >>>>>>> Eitan> for creating the IPoIB interfaces. >>>>>>> >>>>>>> Eitan> Does it make sense to work on a patch that would setup >>>>>>> Eitan> IPoIB interfaces by the P_Key index (and not by P_Key >>>>>>> Eitan> value)? >>>>>>> >>>>>>>I don't see how this is feasible. The index that a particular >>>> >>>>P_Key >>>> >>>> >>>>>>>lands at is completely undetermined -- if two nodes wanted to talk >>>> >>>>on >>>> >>>> >>>>>>>partition 0x8001 say, how does one know which interface to use >>>> >>>>without >>>> >>>> >>>>>>>knowing the index of that P_Key? >>>>>> >>>>>>OK, I get it. Actually the way IPoIB defines the broadcast group >>>> >>>>MGID exposes >>>> >>>> >>>>>P_Key anyway. >>>>> >>>>> >>>>>>> Eitan> Also I think the expected behavior for IPoIB should be >>>> >>>>that >>>> >>>> >>>>>>> Eitan> IPoIB "child" interfaces should be "automatically" >>>>>>> Eitan> initialized by the code that brings up the interface >>>>>>> Eitan> (ifconfig scripts). All valid IPoIB partitions (valid = >>>>>>> Eitan> have corresponding broadcast groups) should be >>>>>>> Eitan> initialized. By doing so we provide a centralized >>>> >>>>control >>>> >>>> >>>>>>> Eitan> of the partitions and their IPoIB interfaces through >>>> >>>>the >>>> >>>> >>>>>>> Eitan> SM. >>>>>>> >>>>>>>Not sure if this is so. I may want a partition strictly for >>>> >>>>storage >>>> >>>> >>>>>>>traffic something like that, so it doesn't make sense to create an >>>>>>>IPoIB interface for that partition. >>>>>> >>>>>>OpenSM provides this capability in the "partition policy": >>>>>>Each partition is marked explicitly if to be used for IPoIB or not. >>>>>>So through one file one could actually control the IPoIB interfaces >>>>>>that will exist in the subnet. >>>>> >>>>>The end node does not know the SM policy for that partition though. >>>>> >>>>> >>>>> >>>>>>My intent is to write some extension to ifup for IPoIB such that all >>>> >>>>sub >>>> >>>> >>>>>>interfaces will be automatically started (based on pre-availability >>>> >>>>of IPoIB >>>> >>>> >>>>>>broadcast MGID). >>> >>> >>>I'm not sure how ifup is related to that. From what I understand you'd >>>like ipoib driver to behave as follows: >>> >>>1. Get an event ( or figure it out) when a new PKEY is added to the >>>relevant port partition table. >> >>I prefer not to rely on new events. Instead I would like to rely on existing IB Notices: >>If we register to multicast group create/delete events (traps 66/67) IPoIB can know about each new partition created. > > > I'm not sure that this is a good idea, because that way all of the > IPoIB nodes will get that event and try to join every new MC group and > partitioning by definition is good for separating a fabric. I think > that the right thing should be that only the relevant nodes try to > join the specific MCG. A node that does not have a P_Key matching the multicast group would not receive the Report anyway. So there is no problem. If a node is not part of the partition it will never know about the new group. > > >>>2. Try to join that new MC group with the MGID it created according to >>>the PKEY and the spec. (or maybe query for the MC group existance but >>>that's not atomic) >> >>Simply join the group. We rely on these groups to be pre-created by the SM enforcing policy dictating with partitions should >>be used for IPoIB and which not. > > > If you let all the IPoIB nodes join every new group without checking > their PKEY tables first, they may even get joined if the SM is not > eforcing MCG to port policy. > Is that your plan ? A compliant SM will never let them join and never report any new group to ports not in the partition. > > >>>3. In case it fails nothing is done (no relevant MC group was >>>pre-created in the SM). >> >>Exactly >> >> >>>4. In case it succeeds a new interface is created. >>> >>>Is that what you meant ? >>> >>>- Moni >>> >>> >>> >>>>>If that were to be done, it would be cleanest if the child IPoIB >>>>>interface was created only if that IPoIB broadcast group for that >>>>>partition exists. >>>> >>>>[EZ] This is exactly what I had in mind. >>>> >>>> >>>>>-- Hal >>>>> >>>>> >>>>> >>>>>>>- R. >>>>>>> >>>>>> >>>>_______________________________________________ >>>>openib-general mailing list >>>>openib-general at openib.org >>>>http://openib.org/mailman/listinfo/openib-general >>>> >>>>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>>> >> >> From mst at mellanox.co.il Sun Apr 23 05:28:18 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 23 Apr 2006 15:28:18 +0300 Subject: [openib-general] Re: [PATCH] RDMA CM: only allow privileged access to ports below 1024 In-Reply-To: References: <20060420172204.GA6876@mellanox.co.il> Message-ID: <20060423122818.GG1792@mellanox.co.il> Quoting r. Sean Hefty : > Subject: [PATCH] RDMA CM: only allow privileged access to ports below 1024 > > Restrict access to ports below 1024 to privileged users. > > Signed-off-by: Sean Hefty Seems to work fine here, please checkin. -- MST From mst at mellanox.co.il Sun Apr 23 05:42:59 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 23 Apr 2006 15:42:59 +0300 Subject: [openib-general] [PATCH] cma: remove unused header include Message-ID: <20060423124258.GH1792@mellanox.co.il> don't include a header we don't seem to use Signed-off-by: Michael S. Tsirkin Index: linux-2.6.16/drivers/infiniband/core/cma.c =================================================================== --- linux-2.6.16.orig/drivers/infiniband/core/cma.c 2006-04-23 15:04:05.000000000 +0300 +++ linux-2.6.16/drivers/infiniband/core/cma.c 2006-04-23 15:04:10.000000000 +0300 @@ -34,7 +34,6 @@ #include #include #include -#include #include #include #include -- MST From dotanb at mellanox.co.il Sun Apr 23 06:13:13 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Sun, 23 Apr 2006 16:13:13 +0300 Subject: [openib-general] Re: [uDAPL] dtest server never ends when using the dapl provider "OpenIB-scm1" In-Reply-To: <4446BE2E.4080305@ichips.intel.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301EF04EC@mtlexch01.mtl.com> <4446BE2E.4080305@ichips.intel.com> Message-ID: <200604231613.13970.dotanb@mellanox.co.il> On Thursday 20 April 2006 01:48, Arlin Davis wrote: > >Here is a back trace of the hanged process: > >(gdb) bt > >#0 0x00002aaaab31c86a in pthread_cond_wait@@GLIBC_2.3.2 () from > >/lib64/tls/libpthread.so.0 > >#1 0x00002aaaab42ef5b in dapl_os_wait_object_wait (wait_obj=0x516650, > >timeout_val=) at dapl_osd.c:276 > >#2 0x00002aaaab42e9ab in dapl_evd_wait (evd_handle=0x516560, > >time_out=4294967295, threshold=1, event=0x7fffffdd7bf0, > >nmore=0x7fffffdd7c2c) > > at dapl_evd_wait.c:233 > >#3 0x00000000004021ab in disconnect_ep () at dtest.c:894 > >#4 0x0000000000404cad in main (argc=4, argv=) at > > > > > > > Yes, looks like the disconnect event was dropped. Couple of questions: > > Does this only happen with the scm provider? > Can you reproduce on the OpenIB trunk or 1.0 branch? > > Thanks, > > -arlin > > I tried to execute this test on the following driver: openib_gen2-20060420-0800, svn REV=6520 (trunk) and i get the same results. Here are the results of this test execution with all of dapl provider names: OpenIB-cma : passed OpenIB-cma-ip : passed OpenIB-cma-netdev : passed OpenIB-scm1 : never ends i guess that as you suspected, the problem is with the scm provider. Dotan From leonida at voltaire.com Sun Apr 23 06:38:00 2006 From: leonida at voltaire.com (Leonid Arsh) Date: Sun, 23 Apr 2006 16:38:00 +0300 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: References: Message-ID: <444B8338.8000608@voltaire.com> Shirley, some additional information you may be interested: According to our experience with the Voltaire IPoIB driver, splitting CQ harmed the throughput (we checked with the iperf application, UDP mode.) Splitting the the CQ caused more interrupts, context switches and CQ polls. Note, the case is rather different from OpenIB mthca, since Voltare IPoIB is based on the VAPI driver, where CQ completions are handled in a tasklet context, unlike mthca where CQ completions are handled in the HW interrupt context. NAPI gave us some improvement. I think NAPI should improve much more in mthca, with the HW interrupt CQ completions. Shirley Ma wrote: > > OK. I am going to split the patch without splitting CQ first. > > WC handler is called in the interrupt context, it is a myth > to have bidirectional performance improvement with splitting CQ. > More investigation is needed. > > If WC handler can be moved from interrupt context, splitting CQ > is still an approach. Having a seperate thread or NAPI support > can be implemented later. > > Thanks > Shirley Ma > IBM Linux Technology Center > 15300 SW Koll Parkway > Beaverton, OR 97006-6063 > Phone(Fax): (503) 578-7638 > > ------------------------------------------------------------------------ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tziporet at mellanox.co.il Sun Apr 23 07:13:30 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Sun, 23 Apr 2006 17:13:30 +0300 Subject: [openib-general] slab error while removing ib_mad In-Reply-To: References: Message-ID: <444B8B8A.3010902@mellanox.co.il> Or Gerlitz wrote: > I am getting the below trace on 2.6.17-rc2 / AMD x86_64 / PCIX HCA > with both the IB sources that come with the kernel and svn trunk 6520. > > This happens if i just modprobe -r ib_mthca after fresh reboot, can > anyone reproduce it on her/his system as well? The module does get > modprobed out. > > Or. > > $ modprobe -r ib_mthca > > $ dmesg > > slab error in kmem_cache_destroy(): cache `ib_mad': Can't free all objects > > Call Trace: {kmem_cache_destroy+150} > {:ib_mad:ib_mad_cleanup_module+25} > {sys_delete_module+415} > {__up_write+20} > {sys_munmap+91} > {system_call+126} > > ib_mad: Failed to destroy ib_mad cache > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > Hi Or, I tested here on Intel x86_64 systems with kernel from Fedora C4 last update (2.6.16-1.2069_FC4) and there was no issue in removing and loading ib_mthca. I tested the code in IBED-1.0-rc3. Tziporet From sashak at voltaire.com Sun Apr 23 07:19:35 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 23 Apr 2006 17:19:35 +0300 Subject: [openib-general] [PATCH 0/4] opensm: improvements and fixes for pkey manager Message-ID: <20060423141935.15562.38762.stgit@sashak.voltaire.com> Hello Hal, There are various improvements and fixes for opensm pkey manager as well as some cleanups. All patches are functionally independent, but order will be important for applying cleanly. Sasha. From sashak at voltaire.com Sun Apr 23 07:26:18 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 23 Apr 2006 17:26:18 +0300 Subject: [openib-general] [PATCH 1/4] opensm: don't try to enforce partitions on router port In-Reply-To: <20060423141935.15562.38762.stgit@sashak.voltaire.com> References: <20060423141935.15562.38762.stgit@sashak.voltaire.com> Message-ID: <20060423142618.15562.31253.stgit@sashak.voltaire.com> When router port is connected directly to CA don't try handle it as switch external ports (update pkey table and enforce partitions). Router ports are handled by partition manager as end ports. Signed-off-by: Sasha Khapyorsky --- osm/opensm/osm_pkey_mgr.c | 43 ++++++++++++++++++++----------------------- 1 files changed, 20 insertions(+), 23 deletions(-) diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c index 938632e..bdb3ae4 100644 --- a/osm/opensm/osm_pkey_mgr.c +++ b/osm/opensm/osm_pkey_mgr.c @@ -307,7 +307,8 @@ __osm_pkey_mgr_process_physical_port( static void osm_pkey_mgr_update_peer_port( const osm_pkey_mgr_t * const p_mgr, - const osm_port_t * const p_port ) + const osm_port_t * const p_port, + boolean_t enforce) { osm_physp_t *p, *peer; osm_node_t *p_node; @@ -326,18 +327,25 @@ osm_pkey_mgr_update_peer_port( if ( !peer || !osm_physp_is_valid( peer ) ) return; p_node = osm_physp_get_node_ptr( peer ); - if ( osm_node_get_type( p_node ) == IB_NODE_TYPE_CA ) + if ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) return; - else if ( osm_node_get_type( p_node ) == IB_NODE_TYPE_SWITCH ) { - if (!(p_sw = osm_get_switch_by_guid( p_mgr->p_subn, - osm_node_get_node_guid( p_node ))) || - !(p_si = osm_switch_get_si_ptr( p_sw )) || - !p_si->enforce_cap) - return; + + p_sw = osm_get_switch_by_guid( p_mgr->p_subn, osm_node_get_node_guid( p_node )); + if (!p_sw || !(p_si = osm_switch_get_si_ptr( p_sw )) || + !p_si->enforce_cap) + return; + + if (osm_pkey_mgr_enforce_partition( p_mgr, peer, enforce ) != IB_SUCCESS) { + osm_log( p_mgr->p_log, OSM_LOG_ERROR, + "osm_pkey_mgr_update_peer_port: " + "osm_pkey_mgr_enforce_partition() failed to update " + "node 0x%016" PRIx64 " port %u\n", + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( peer ) ); } - if (p_mgr->p_subn->opt.no_partition_enforcement == TRUE) - goto _enforce_port; + if (enforce == FALSE) + return; p_pkey_tbl = osm_physp_get_pkey_tbl( p ); p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer ); @@ -377,18 +385,6 @@ osm_pkey_mgr_update_peer_port( cl_ntoh64( osm_node_get_node_guid( p_node ) ), osm_physp_get_port_num( peer ) ); } - - _enforce_port: - if (osm_pkey_mgr_enforce_partition( p_mgr, peer, - p_mgr->p_subn->opt.no_partition_enforcement == FALSE ) != - IB_SUCCESS) { - osm_log( p_mgr->p_log, OSM_LOG_ERROR, - "osm_pkey_mgr_update_peer_port: " - "osm_pkey_mgr_enforce_partition() failed to update " - "node 0x%016" PRIx64 " port %u\n", - cl_ntoh64( osm_node_get_node_guid( p_node ) ), - osm_physp_get_port_num( peer ) ); - } } /********************************************************************** @@ -484,7 +480,8 @@ osm_pkey_mgr_process( if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != IB_NODE_TYPE_SWITCH ) { - osm_pkey_mgr_update_peer_port( p_mgr, p_port ); + osm_pkey_mgr_update_peer_port( p_mgr, p_port, + !p_mgr->p_subn->opt.no_partition_enforcement); } } From sashak at voltaire.com Sun Apr 23 07:26:21 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 23 Apr 2006 17:26:21 +0300 Subject: [openib-general] [PATCH 2/4] opensm: remove unused osm_pkey_mgr_t object In-Reply-To: <20060423141935.15562.38762.stgit@sashak.voltaire.com> References: <20060423141935.15562.38762.stgit@sashak.voltaire.com> Message-ID: <20060423142620.15562.96611.stgit@sashak.voltaire.com> The structure osm_pkey_mgr_t is not used for pkey management - clean it up. Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_pkey_mgr.h | 183 ++---------------------------------- osm/include/opensm/osm_sm.h | 2 osm/include/opensm/osm_state_mgr.h | 11 -- osm/opensm/osm_pkey_mgr.c | 182 ++++++++++++++---------------------- osm/opensm/osm_sm.c | 12 -- osm/opensm/osm_state_mgr.c | 8 +- 6 files changed, 82 insertions(+), 316 deletions(-) diff --git a/osm/include/opensm/osm_pkey_mgr.h b/osm/include/opensm/osm_pkey_mgr.h index fef3667..cb0075d 100644 --- a/osm/include/opensm/osm_pkey_mgr.h +++ b/osm/include/opensm/osm_pkey_mgr.h @@ -1,4 +1,5 @@ /* + * Copyright (c) 2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * * This software is available to you under a choice of one of two @@ -35,9 +36,8 @@ /* * Abstract: - * Declaration of osm_pkey_mgr_t. - * This object represents the P_Key Manager object. - * This object is part of the OpenSM family of objects. + * Prototype for osm_pkey_mgr_process() function + * This is part of the OpenSM family of objects. * * Environment: * Linux User Mode @@ -49,10 +49,8 @@ #ifndef _OSM_PKEY_MGR_H_ #define _OSM_PKEY_MGR_H_ -#include -#include -#include -#include +#include +#include #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { @@ -64,166 +62,6 @@ #endif /* __cplusplus */ BEGIN_C_DECLS -/****h* OpenSM/P_Key Manager -* NAME -* P_Key Manager -* -* DESCRIPTION -* The P_Key Manager object manage the p_key tables of all -* objects in the subnet -* -* AUTHOR -* Ofer Gigi, Mellanox -* -*********/ -/****s* OpenSM: P_Key Manager/osm_pkey_mgr_t -* NAME -* osm_pkey_mgr_t -* -* DESCRIPTION -* p_Key Manager structure. -* -* -* SYNOPSIS -*/ - -typedef struct _osm_pkey_mgr -{ - osm_subn_t *p_subn; - osm_log_t *p_log; - osm_req_t *p_req; - cl_plock_t *p_lock; - -} osm_pkey_mgr_t; - -/* -* FIELDS -* p_subn -* Pointer to the Subnet object for this subnet. -* -* p_log -* Pointer to the log object. -* -* p_req -* Pointer to the Request object. -* -* p_lock -* Pointer to the serializing lock. -* -* SEE ALSO -* P_Key Manager object -*********/ - -/****** OpenSM: P_Key Manager/osm_pkey_mgr_construct -* NAME -* osm_pkey_mgr_construct -* -* DESCRIPTION -* This function constructs a P_Key Manager object. -* -* SYNOPSIS -*/ -void -osm_pkey_mgr_construct( - IN osm_pkey_mgr_t* const p_mgr ); -/* -* PARAMETERS -* p_mgr -* [in] Pointer to a P_Key Manager object to construct. -* -* RETURN VALUE -* This function does not return a value. -* -* NOTES -* Allows calling osm_pkey_mgr_init, osm_pkey_mgr_destroy -* -* Calling osm_pkey_mgr_construct is a prerequisite to calling any other -* method except osm_pkey_mgr_init. -* -* SEE ALSO -* P_Key Manager object, osm_pkey_mgr_init, -* osm_pkey_mgr_destroy -*********/ - -/****f* OpenSM: P_Key Manager/osm_pkey_mgr_destroy -* NAME -* osm_pkey_mgr_destroy -* -* DESCRIPTION -* The osm_pkey_mgr_destroy function destroys the object, releasing -* all resources. -* -* SYNOPSIS -*/ -void -osm_pkey_mgr_destroy( - IN osm_pkey_mgr_t* const p_mgr ); -/* -* PARAMETERS -* p_mgr -* [in] Pointer to the object to destroy. -* -* RETURN VALUE -* This function does not return a value. -* -* NOTES -* Performs any necessary cleanup of the specified -* P_Key Manager object. -* Further operations should not be attempted on the destroyed object. -* This function should only be called after a call to -* osm_pkey_mgr_construct or osm_pkey_mgr_init. -* -* SEE ALSO -* P_Key Manager object, osm_pkey_mgr_construct, -* osm_pkey_mgr_init -*********/ - -/****f* OpenSM: P_Key Manager/osm_pkey_mgr_init -* NAME -* osm_pkey_mgr_init -* -* DESCRIPTION -* The osm_pkey_mgr_init function initializes a -* P_Key Manager object for use. -* -* SYNOPSIS -*/ -ib_api_status_t -osm_pkey_mgr_init( - IN osm_pkey_mgr_t* const p_mgr, - IN osm_subn_t* const p_subn, - IN osm_log_t* const p_log, - IN osm_req_t* const p_req, - IN cl_plock_t* const p_lock ); -/* -* PARAMETERS -* p_mgr -* [in] Pointer to an osm_pkey_mgr_t object to initialize. -* -* p_subn -* [in] Pointer to the Subnet object for this subnet. -* -* p_log -* [in] Pointer to the log object. -* -* p_req -* [in] Pointer to an osm_req_t object. -* -* p_lock -* [in] Pointer to the OpenSM serializing lock. -* -* RETURN VALUES -* IB_SUCCESS if the P_Key Manager object was initialized -* successfully. -* -* NOTES -* Allows calling other P_Key Manager methods. -* -* SEE ALSO -* P_Key Manager object, osm_pkey_mgr_construct, -* osm_pkey_mgr_destroy -*********/ - /****f* OpenSM: P_Key Manager/osm_pkey_mgr_process * NAME * osm_pkey_mgr_process @@ -235,23 +73,18 @@ osm_pkey_mgr_init( */ osm_signal_t osm_pkey_mgr_process( - IN const osm_pkey_mgr_t* const p_mgr ); + IN osm_opensm_t *p_osm ); /* * PARAMETERS -* p_mgr -* [in] Pointer to an osm_pkey_mgr_t object. +* p_osm +* [in] Pointer to an osm_opensm_t object. * * RETURN VALUES * None * * NOTES -* Current Operations: -* - Inserts IB_DEFAULT_PKEY to all node objects that don't have -* IB_DEFAULT_PARTIAL_PKEY or IB_DEFAULT_PKEY as part -* of their p_key table * * SEE ALSO -* P_Key Manager *********/ END_C_DECLS diff --git a/osm/include/opensm/osm_sm.h b/osm/include/opensm/osm_sm.h index d9fbd8a..d6086d4 100644 --- a/osm/include/opensm/osm_sm.h +++ b/osm/include/opensm/osm_sm.h @@ -74,7 +74,6 @@ #include #include #include #include -#include #include #include #include @@ -162,7 +161,6 @@ typedef struct _osm_sm osm_link_mgr_t link_mgr; osm_state_mgr_t state_mgr; osm_drop_mgr_t drop_mgr; - osm_pkey_mgr_t pkey_mgr; osm_lft_rcv_t lft_rcv; osm_lft_rcv_ctrl_t lft_rcv_ctrl; osm_mft_rcv_t mft_rcv; diff --git a/osm/include/opensm/osm_state_mgr.h b/osm/include/opensm/osm_state_mgr.h index 92aa910..a9385d1 100644 --- a/osm/include/opensm/osm_state_mgr.h +++ b/osm/include/opensm/osm_state_mgr.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -60,7 +60,6 @@ #include #include #include #include -#include #include #include @@ -113,7 +112,6 @@ typedef struct _osm_state_mgr osm_mcast_mgr_t *p_mcast_mgr; osm_link_mgr_t *p_link_mgr; osm_drop_mgr_t *p_drop_mgr; - osm_pkey_mgr_t *p_pkey_mgr; osm_req_t *p_req; osm_stats_t *p_stats; struct _osm_sm_state_mgr *p_sm_state_mgr; @@ -151,9 +149,6 @@ typedef struct _osm_state_mgr * p_drop_mgr * Pointer to the Drop Manager object. * -* p_pkey_mgr -* Pointer to the P_Key Manager object. -* * p_req * Pointer to the Requester object sending SMPs. * @@ -379,7 +374,6 @@ osm_state_mgr_init( IN osm_mcast_mgr_t* const p_mcast_mgr, IN osm_link_mgr_t* const p_link_mgr, IN osm_drop_mgr_t* const p_drop_mgr, - IN osm_pkey_mgr_t* const p_pkey_mgr, IN osm_req_t* const p_req, IN osm_stats_t* const p_stats, IN struct _osm_sm_state_mgr* const p_sm_state_mgr, @@ -411,9 +405,6 @@ osm_state_mgr_init( * p_drop_mgr * [in] Pointer to the Drop Manager object. * -* p_pkey_mgr -* [in] Pointer to the P_Key Manager object. -* * p_req * [in] Pointer to the Request Controller object. * diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c index bdb3ae4..7b3da26 100644 --- a/osm/opensm/osm_pkey_mgr.c +++ b/osm/opensm/osm_pkey_mgr.c @@ -37,9 +37,8 @@ /* * Abstract: - * Implementation of osm_pkey_mgr_t. - * This object represents the P_Key Manager object. - * This object is part of the opensm family of objects. + * Implementation of the P_Key Manager (Partititon Manager). + * This is part of the OpenSM. * * Environment: * Linux User Mode @@ -58,62 +57,15 @@ #include #include #include #include - -/********************************************************************** - **********************************************************************/ -void -osm_pkey_mgr_construct( - IN osm_pkey_mgr_t * const p_mgr ) -{ - CL_ASSERT( p_mgr ); - cl_memclr( p_mgr, sizeof( *p_mgr ) ); -} - -/********************************************************************** - **********************************************************************/ -void -osm_pkey_mgr_destroy( - IN osm_pkey_mgr_t * const p_mgr ) -{ - CL_ASSERT( p_mgr ); - - OSM_LOG_ENTER( p_mgr->p_log, osm_pkey_mgr_destroy ); - - OSM_LOG_EXIT( p_mgr->p_log ); -} - -/********************************************************************** - **********************************************************************/ -ib_api_status_t -osm_pkey_mgr_init( - IN osm_pkey_mgr_t * const p_mgr, - IN osm_subn_t * const p_subn, - IN osm_log_t * const p_log, - IN osm_req_t * const p_req, - IN cl_plock_t * const p_lock ) -{ - ib_api_status_t status = IB_SUCCESS; - - OSM_LOG_ENTER( p_log, osm_pkey_mgr_init ); - - osm_pkey_mgr_construct( p_mgr ); - - p_mgr->p_log = p_log; - p_mgr->p_subn = p_subn; - p_mgr->p_lock = p_lock; - p_mgr->p_req = p_req; - - OSM_LOG_EXIT( p_mgr->p_log ); - return ( status ); -} +#include /********************************************************************** **********************************************************************/ static ib_api_status_t -osm_pkey_mgr_update_pkey_entry( - IN const osm_pkey_mgr_t * const p_mgr, - IN const osm_physp_t * p_physp, - IN const ib_pkey_table_t * block, +pkey_mgr_update_pkey_entry( + IN const osm_req_t *p_req, + IN const osm_physp_t *p_physp, + IN const ib_pkey_table_t *block, IN const uint16_t block_index ) { osm_madw_context_t context; @@ -126,7 +78,7 @@ osm_pkey_mgr_update_pkey_entry( attr_mod = block_index; if ( osm_node_get_type( p_node ) == IB_NODE_TYPE_SWITCH ) attr_mod |= osm_physp_get_port_num( p_physp ) << 16; - return osm_req_set( p_mgr->p_req, osm_physp_get_dr_path_ptr( p_physp ), + return osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ), ( uint8_t * ) block, sizeof( *block ), IB_MAD_ATTR_P_KEY_TABLE, cl_hton32( attr_mod ), CL_DISP_MSGID_NONE, &context ); @@ -135,9 +87,9 @@ osm_pkey_mgr_update_pkey_entry( /********************************************************************** **********************************************************************/ static ib_api_status_t -osm_pkey_mgr_enforce_partition( - IN const osm_pkey_mgr_t * const p_mgr, - IN const osm_physp_t * p_physp, +pkey_mgr_enforce_partition( + IN const osm_req_t *p_req, + IN const osm_physp_t *p_physp, IN const boolean_t enforce) { osm_madw_context_t context; @@ -168,7 +120,7 @@ osm_pkey_mgr_enforce_partition( context.pi_context.ignore_errors = FALSE; context.pi_context.light_sweep = FALSE; - return osm_req_set( p_mgr->p_req, osm_physp_get_dr_path_ptr( p_physp ), + return osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ), payload, sizeof(payload), IB_MAD_ATTR_PORT_INFO, cl_hton32(osm_physp_get_port_num( p_physp )), @@ -184,10 +136,11 @@ osm_pkey_mgr_enforce_partition( */ static boolean_t -__osm_pkey_mgr_process_physical_port( - IN const osm_pkey_mgr_t * const p_mgr, +pkey_mgr_process_physical_port( + IN osm_log_t *p_log, + IN const osm_req_t *p_req, IN const ib_net16_t pkey, - IN osm_physp_t * p_physp ) + IN osm_physp_t *p_physp ) { boolean_t return_val = FALSE; /* TRUE if pkey was inserted or updated */ ib_api_status_t status; @@ -200,7 +153,7 @@ __osm_pkey_mgr_process_physical_port( uint32_t i; boolean_t block_found = FALSE; - OSM_LOG_ENTER( p_mgr->p_log, __osm_pkey_mgr_process_physical_port ); + OSM_LOG_ENTER( p_log, pkey_mgr_process_physical_port ); p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); @@ -209,10 +162,10 @@ __osm_pkey_mgr_process_physical_port( if ( p_orig_pkey && *p_orig_pkey == pkey ) { - if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) + if ( osm_log_is_active( p_log, OSM_LOG_VERBOSE ) ) { - osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, - "__osm_pkey_mgr_process_physical_port: " + osm_log( p_log, OSM_LOG_VERBOSE, + "pkey_mgr_process_physical_port: " "No need to insert pkey 0x%04x for node 0x%016" PRIx64 " port %u\n", cl_ntoh16( pkey ), @@ -258,8 +211,8 @@ __osm_pkey_mgr_process_physical_port( if ( block_found == FALSE ) { - osm_log( p_mgr->p_log, OSM_LOG_ERROR, - "__osm_pkey_mgr_process_physical_port: ERR 0501: " + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_process_physical_port: ERR 0501: " "No empty pkey entry was found to insert 0x%04x for node " "0x%016" PRIx64 " port %u\n", cl_ntoh16( pkey ), @@ -269,13 +222,13 @@ __osm_pkey_mgr_process_physical_port( } status = - osm_pkey_mgr_update_pkey_entry( p_mgr, p_physp, block, block_index ); + pkey_mgr_update_pkey_entry( p_req, p_physp, block, block_index ); if ( status != IB_SUCCESS ) { - osm_log( p_mgr->p_log, OSM_LOG_ERROR, - "__osm_pkey_mgr_process_physical_port: " - "osm_pkey_mgr_update_pkey_entry() failed to update " + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_process_physical_port: " + "pkey_mgr_update_pkey_entry() failed to update " "pkey table block %d for node 0x%016" PRIx64 " port %u\n", block_index, cl_ntoh64( osm_node_get_node_guid( p_node ) ), @@ -285,10 +238,10 @@ __osm_pkey_mgr_process_physical_port( return_val = TRUE; /* pkey was inserted/updated */ - if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) + if ( osm_log_is_active( p_log, OSM_LOG_VERBOSE ) ) { - osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, - "__osm_pkey_mgr_process_physical_port: " + osm_log( p_log, OSM_LOG_VERBOSE, + "pkey_mgr_process_physical_port: " "pkey 0x%04x was inserted for node 0x%016" PRIx64 " port %u\n", cl_ntoh16( pkey ), @@ -297,7 +250,7 @@ __osm_pkey_mgr_process_physical_port( } _done: - OSM_LOG_EXIT( p_mgr->p_log ); + OSM_LOG_EXIT( p_log ); return ( return_val ); } @@ -305,9 +258,11 @@ __osm_pkey_mgr_process_physical_port( /********************************************************************** **********************************************************************/ static void -osm_pkey_mgr_update_peer_port( - const osm_pkey_mgr_t * const p_mgr, - const osm_port_t * const p_port, +pkey_mgr_update_peer_port( + osm_log_t *p_log, + const osm_req_t *p_req, + const osm_subn_t *p_subn, + const osm_port_t *p_port, boolean_t enforce) { osm_physp_t *p, *peer; @@ -330,15 +285,15 @@ osm_pkey_mgr_update_peer_port( if ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) return; - p_sw = osm_get_switch_by_guid( p_mgr->p_subn, osm_node_get_node_guid( p_node )); + p_sw = osm_get_switch_by_guid( p_subn, osm_node_get_node_guid( p_node )); if (!p_sw || !(p_si = osm_switch_get_si_ptr( p_sw )) || !p_si->enforce_cap) return; - if (osm_pkey_mgr_enforce_partition( p_mgr, peer, enforce ) != IB_SUCCESS) { - osm_log( p_mgr->p_log, OSM_LOG_ERROR, - "osm_pkey_mgr_update_peer_port: " - "osm_pkey_mgr_enforce_partition() failed to update " + if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS) { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_update_peer_port: " + "pkey_mgr_enforce_partition() failed to update " "node 0x%016" PRIx64 " port %u\n", cl_ntoh64( osm_node_get_node_guid( p_node ) ), osm_physp_get_port_num( peer ) ); @@ -361,12 +316,12 @@ osm_pkey_mgr_update_peer_port( { cl_memcpy( peer_block, block, sizeof( *block ) ); status = - osm_pkey_mgr_update_pkey_entry( p_mgr, peer, peer_block, + pkey_mgr_update_pkey_entry( p_req, peer, peer_block, block_index ); if ( status != IB_SUCCESS ) - osm_log( p_mgr->p_log, OSM_LOG_ERROR, - "osm_pkey_mgr_update_peer_port: " - "osm_pkey_mgr_update_pkey_entry() failed to update " + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_update_peer_port: " + "pkey_mgr_update_pkey_entry() failed to update " "pkey table block %d for node 0x%016" PRIx64 " port %u\n", block_index, @@ -376,10 +331,10 @@ osm_pkey_mgr_update_peer_port( } if ( num_of_blocks && status == IB_SUCCESS && - osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) + osm_log_is_active( p_log, OSM_LOG_VERBOSE ) ) { - osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, - "osm_pkey_mgr_update_peer_port: " + osm_log( p_log, OSM_LOG_VERBOSE, + "pkey_mgr_update_peer_port: " "pkey table was updated for node 0x%016" PRIx64 " port %u\n", cl_ntoh64( osm_node_get_node_guid( p_node ) ), @@ -390,9 +345,10 @@ osm_pkey_mgr_update_peer_port( /********************************************************************** **********************************************************************/ static boolean_t -osm_pkey_mgr_process_partition_table( - const osm_pkey_mgr_t * const p_mgr, - const osm_prtn_t * const p_prtn, +pkey_mgr_process_partition_table( + osm_log_t *p_log, + const osm_req_t *p_req, + const osm_prtn_t *p_prtn, const boolean_t full ) { const cl_map_t *p_tbl = full ? @@ -412,12 +368,12 @@ osm_pkey_mgr_process_partition_table( i_next = cl_map_next( i ); p_physp = cl_map_obj( i ); if ( p_physp && osm_physp_is_valid( p_physp ) && - __osm_pkey_mgr_process_physical_port( p_mgr, pkey, p_physp ) ) + pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp ) ) { result = TRUE; - if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) - osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, - "osm_pkey_mgr_process_partition_table: " + if ( osm_log_is_active( p_log, OSM_LOG_VERBOSE ) ) + osm_log( p_log, OSM_LOG_VERBOSE, + "pkey_mgr_process_partition_table: " "Adding 0x%04x to pkey table of node " "0x%016" PRIx64 " port %u\n", cl_ntoh16( pkey ), @@ -434,7 +390,7 @@ osm_pkey_mgr_process_partition_table( **********************************************************************/ osm_signal_t osm_pkey_mgr_process( - IN const osm_pkey_mgr_t * const p_mgr ) + IN osm_opensm_t *p_osm ) { cl_qmap_t *p_tbl; cl_map_item_t *p_next; @@ -442,20 +398,20 @@ osm_pkey_mgr_process( osm_port_t *p_port; osm_signal_t signal = OSM_SIGNAL_DONE; - CL_ASSERT( p_mgr ); + CL_ASSERT( p_osm ); - OSM_LOG_ENTER( p_mgr->p_log, osm_pkey_mgr_process ); + OSM_LOG_ENTER( &p_osm->log, osm_pkey_mgr_process ); - CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock ); + CL_PLOCK_EXCL_ACQUIRE( &p_osm->lock ); - if ( osm_prtn_make_partitions( p_mgr->p_log, p_mgr->p_subn ) != IB_SUCCESS ) + if ( osm_prtn_make_partitions( &p_osm->log, &p_osm->subn ) != IB_SUCCESS ) { - osm_log( p_mgr->p_log, OSM_LOG_ERROR, "osm_pkey_mgr_process: " + osm_log( &p_osm->log, OSM_LOG_ERROR, "osm_pkey_mgr_process: " "osm_prtn_make_partitions() failed\n" ); goto _err; } - p_tbl = &p_mgr->p_subn->prtn_pkey_tbl; + p_tbl = &p_osm->subn.prtn_pkey_tbl; p_next = cl_qmap_head( p_tbl ); while ( p_next != cl_qmap_end( p_tbl ) ) @@ -463,13 +419,13 @@ osm_pkey_mgr_process( p_prtn = ( osm_prtn_t * ) p_next; p_next = cl_qmap_next( p_next ); - if ( osm_pkey_mgr_process_partition_table( p_mgr, p_prtn, FALSE ) ) + if ( pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, FALSE ) ) signal = OSM_SIGNAL_DONE_PENDING; - if ( osm_pkey_mgr_process_partition_table( p_mgr, p_prtn, TRUE ) ) + if ( pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, TRUE ) ) signal = OSM_SIGNAL_DONE_PENDING; } - p_tbl = &p_mgr->p_subn->port_guid_tbl; + p_tbl = &p_osm->subn.port_guid_tbl; p_next = cl_qmap_head( p_tbl ); while ( p_next != cl_qmap_end( p_tbl ) ) @@ -480,13 +436,13 @@ osm_pkey_mgr_process( if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != IB_NODE_TYPE_SWITCH ) { - osm_pkey_mgr_update_peer_port( p_mgr, p_port, - !p_mgr->p_subn->opt.no_partition_enforcement); + pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, &p_osm->subn, + p_port, !p_osm->subn.opt.no_partition_enforcement ); } } _err: - CL_PLOCK_RELEASE( p_mgr->p_lock ); - OSM_LOG_EXIT( p_mgr->p_log ); + CL_PLOCK_RELEASE( &p_osm->lock ); + OSM_LOG_EXIT( &p_osm->log ); return ( signal ); } diff --git a/osm/opensm/osm_sm.c b/osm/opensm/osm_sm.c index 9c10651..99e5627 100644 --- a/osm/opensm/osm_sm.c +++ b/osm/opensm/osm_sm.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -66,7 +66,6 @@ #include #include #include #include -#include #include #include #include @@ -160,7 +159,6 @@ osm_sm_construct( osm_state_mgr_construct( &p_sm->state_mgr ); osm_state_mgr_ctrl_construct( &p_sm->state_mgr_ctrl ); osm_drop_mgr_construct( &p_sm->drop_mgr ); - osm_pkey_mgr_construct( &p_sm->pkey_mgr ); osm_lft_rcv_construct( &p_sm->lft_rcv ); osm_lft_rcv_ctrl_construct( &p_sm->lft_rcv_ctrl ); osm_mft_rcv_construct( &p_sm->mft_rcv ); @@ -250,7 +248,6 @@ osm_sm_destroy( osm_ucast_mgr_destroy( &p_sm->ucast_mgr ); osm_link_mgr_destroy( &p_sm->link_mgr ); osm_drop_mgr_destroy( &p_sm->drop_mgr ); - osm_pkey_mgr_destroy( &p_sm->pkey_mgr ); osm_lft_rcv_destroy( &p_sm->lft_rcv ); osm_mft_rcv_destroy( &p_sm->mft_rcv ); osm_slvl_rcv_destroy( &p_sm->slvl_rcv ); @@ -408,7 +405,6 @@ osm_sm_init( &p_sm->mcast_mgr, &p_sm->link_mgr, &p_sm->drop_mgr, - &p_sm->pkey_mgr, &p_sm->req, p_stats, &p_sm->sm_state_mgr, @@ -431,12 +427,6 @@ osm_sm_init( if( status != IB_SUCCESS ) goto Exit; - status = osm_pkey_mgr_init( &p_sm->pkey_mgr, - p_sm->p_subn, - p_sm->p_log, &p_sm->req, p_sm->p_lock ); - if( status != IB_SUCCESS ) - goto Exit; - status = osm_lft_rcv_init( &p_sm->lft_rcv, p_subn, p_log, p_lock ); if( status != IB_SUCCESS ) goto Exit; diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c index 083185c..1aefc0b 100644 --- a/osm/opensm/osm_state_mgr.c +++ b/osm/opensm/osm_state_mgr.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -64,6 +64,7 @@ #include #include #include #include +#include #include #include #include @@ -108,7 +109,6 @@ osm_state_mgr_init( IN osm_mcast_mgr_t * const p_mcast_mgr, IN osm_link_mgr_t * const p_link_mgr, IN osm_drop_mgr_t * const p_drop_mgr, - IN osm_pkey_mgr_t * const p_pkey_mgr, IN osm_req_t * const p_req, IN osm_stats_t * const p_stats, IN osm_sm_state_mgr_t * const p_sm_state_mgr, @@ -128,7 +128,6 @@ osm_state_mgr_init( CL_ASSERT( p_mcast_mgr ); CL_ASSERT( p_link_mgr ); CL_ASSERT( p_drop_mgr ); - CL_ASSERT( p_pkey_mgr ); CL_ASSERT( p_req ); CL_ASSERT( p_stats ); CL_ASSERT( p_sm_state_mgr ); @@ -145,7 +144,6 @@ osm_state_mgr_init( p_mgr->p_mcast_mgr = p_mcast_mgr; p_mgr->p_link_mgr = p_link_mgr; p_mgr->p_drop_mgr = p_drop_mgr; - p_mgr->p_pkey_mgr = p_pkey_mgr; p_mgr->p_mad_ctrl = p_mad_ctrl; p_mgr->p_req = p_req; p_mgr->p_stats = p_stats; @@ -2235,7 +2233,7 @@ osm_state_mgr_process( OSM_SM_SIGNAL_DISCOVERY_COMPLETED ); /* the returned signal might be DONE or DONE_PENDING */ - signal = osm_pkey_mgr_process( p_mgr->p_pkey_mgr ); + signal = osm_pkey_mgr_process( p_mgr->p_subn->p_osm ); break; default: From sashak at voltaire.com Sun Apr 23 07:26:23 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 23 Apr 2006 17:26:23 +0300 Subject: [openib-general] [PATCH 3/4] opensm: pkey manager performance improvement In-Reply-To: <20060423141935.15562.38762.stgit@sashak.voltaire.com> References: <20060423141935.15562.38762.stgit@sashak.voltaire.com> Message-ID: <20060423142623.15562.89538.stgit@sashak.voltaire.com> Send changed pkey table blocks to ports only after full update and not after each pkey value change/update. Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_pkey.h | 51 +++++++++ osm/opensm/osm_pkey.c | 32 ++++++ osm/opensm/osm_pkey_mgr.c | 233 ++++++++++++++++++++--------------------- 3 files changed, 197 insertions(+), 119 deletions(-) diff --git a/osm/include/opensm/osm_pkey.h b/osm/include/opensm/osm_pkey.h index d4ee9a1..f5e8c11 100644 --- a/osm/include/opensm/osm_pkey.h +++ b/osm/include/opensm/osm_pkey.h @@ -90,16 +90,28 @@ struct _osm_physp; typedef struct _osm_pkey_tbl { cl_ptr_vector_t blocks; + cl_ptr_vector_t new_blocks; cl_map_t keys; } osm_pkey_tbl_t; /* * FIELDS * blocks -* The IBA defined blocks of pkey values +* The IBA defined blocks of pkey values, updated from the net +* +* new_blocks +* The blocks of pkey values, will be used for updates by SM * * keys * A set holding all keys * +* NOTES +* 'blocks' vector should be used to store pkey values obtained from +* the port and SM pkey manager should not change it directly, for this +* purpose 'new_blocks' should be used. +* +* The only pkey values stored in 'blocks' vector will be mapped with +* 'keys' map +* *********/ /****f* OpenSM: osm_pkey_tbl_construct @@ -214,6 +226,43 @@ static inline ib_pkey_table_t *osm_pkey_ * *********/ +/****f* OpenSM: osm_pkey_tbl_new_block_get +* NAME +* osm_pkey_tbl_new_block_get +* +* DESCRIPTION +* The same as above but for new block +* +* SYNOPSIS +*/ +static inline ib_pkey_table_t *osm_pkey_tbl_new_block_get( + const osm_pkey_tbl_t *p_pkey_tbl, uint16_t block) +{ + return (block < cl_ptr_vector_get_size(&p_pkey_tbl->new_blocks)) ? + cl_ptr_vector_get(&p_pkey_tbl->new_blocks, block) : NULL; +}; +/* + *********/ + +/****f* OpenSM: osm_pkey_tbl_sync_new_blocks +* NAME +* osm_pkey_tbl_sync_new_blocks +* +* DESCRIPTION +* Syncs new_blocks vector content with current pkey table blocks +* +* SYNOPSIS +*/ +void osm_pkey_tbl_sync_new_blocks( + const osm_pkey_tbl_t *p_pkey_tbl); +/* +* p_pkey_tbl +* [in] Pointer to osm_pkey_tbl_t object. +* +* NOTES +* +*********/ + /****f* OpenSM: osm_pkey_tbl_set * NAME * osm_pkey_tbl_set diff --git a/osm/opensm/osm_pkey.c b/osm/opensm/osm_pkey.c index 5a4ca0d..d661bd6 100644 --- a/osm/opensm/osm_pkey.c +++ b/osm/opensm/osm_pkey.c @@ -67,6 +67,7 @@ void osm_pkey_tbl_construct( IN osm_pkey_tbl_t *p_pkey_tbl) { cl_ptr_vector_construct( &p_pkey_tbl->blocks ); + cl_ptr_vector_construct( &p_pkey_tbl->new_blocks ); cl_map_construct( &p_pkey_tbl->keys ); } @@ -82,6 +83,11 @@ void osm_pkey_tbl_destroy( cl_free(cl_ptr_vector_get( &p_pkey_tbl->blocks, i )); cl_ptr_vector_destroy( &p_pkey_tbl->blocks ); + num_blocks = (uint16_t)(cl_ptr_vector_get_size( &p_pkey_tbl->new_blocks )); + for (i = 0; i < num_blocks; i++) + cl_free(cl_ptr_vector_get( &p_pkey_tbl->new_blocks, i )); + cl_ptr_vector_destroy( &p_pkey_tbl->new_blocks ); + cl_map_remove_all( &p_pkey_tbl->keys ); cl_map_destroy( &p_pkey_tbl->keys ); } @@ -92,12 +98,38 @@ int osm_pkey_tbl_init( IN osm_pkey_tbl_t *p_pkey_tbl) { cl_ptr_vector_init( &p_pkey_tbl->blocks, 0, 1); + cl_ptr_vector_init( &p_pkey_tbl->new_blocks, 0, 1); cl_map_init( &p_pkey_tbl->keys, 1 ); return(IB_SUCCESS); } /********************************************************************** **********************************************************************/ +void osm_pkey_tbl_sync_new_blocks( + IN const osm_pkey_tbl_t *p_pkey_tbl) +{ + ib_pkey_table_t *p_block, *p_new_block; + int16_t b, num_blocks, new_blocks; + + num_blocks = cl_ptr_vector_get_size(&p_pkey_tbl->blocks); + new_blocks = cl_ptr_vector_get_size(&p_pkey_tbl->new_blocks); + + for (b = 0; b < num_blocks; b++) { + p_block = cl_ptr_vector_get(&p_pkey_tbl->blocks, b); + if ( b < new_blocks ) + p_new_block = cl_ptr_vector_get(&p_pkey_tbl->new_blocks, b); + else { + p_new_block = (ib_pkey_table_t *)cl_zalloc(sizeof(*p_new_block)); + if (!p_new_block) + break; + cl_ptr_vector_set(&((osm_pkey_tbl_t *)p_pkey_tbl)->new_blocks, b, p_new_block); + } + cl_memcpy(p_new_block, p_block, sizeof(*p_new_block)); + } +} + +/********************************************************************** + **********************************************************************/ int osm_pkey_tbl_set( IN osm_pkey_tbl_t *p_pkey_tbl, IN uint16_t block, diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c index 7b3da26..da8dfa8 100644 --- a/osm/opensm/osm_pkey_mgr.c +++ b/osm/opensm/osm_pkey_mgr.c @@ -131,86 +131,45 @@ pkey_mgr_enforce_partition( **********************************************************************/ /* - * Send a new entry for the pkey table for this port when this pkey + * Prepare a new entry for the pkey table for this port when this pkey * does not exist. Update existed entry when membership was changed. */ -static boolean_t -pkey_mgr_process_physical_port( +static void pkey_mgr_process_physical_port( IN osm_log_t *p_log, IN const osm_req_t *p_req, IN const ib_net16_t pkey, IN osm_physp_t *p_physp ) { - boolean_t return_val = FALSE; /* TRUE if pkey was inserted or updated */ - ib_api_status_t status; osm_node_t *p_node = osm_physp_get_node_ptr( p_physp ); - ib_pkey_table_t *block = NULL; + ib_pkey_table_t *block; uint16_t block_index; uint16_t num_of_blocks; const osm_pkey_tbl_t *p_pkey_tbl; ib_net16_t *p_orig_pkey; + char *stat = NULL; uint32_t i; - boolean_t block_found = FALSE; - - OSM_LOG_ENTER( p_log, pkey_mgr_process_physical_port ); p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); p_orig_pkey = cl_map_get( &p_pkey_tbl->keys, ib_pkey_get_base( pkey ) ); - if ( p_orig_pkey && *p_orig_pkey == pkey ) - { - if ( osm_log_is_active( p_log, OSM_LOG_VERBOSE ) ) - { - osm_log( p_log, OSM_LOG_VERBOSE, - "pkey_mgr_process_physical_port: " - "No need to insert pkey 0x%04x for node 0x%016" PRIx64 - " port %u\n", - cl_ntoh16( pkey ), - cl_ntoh64( osm_node_get_node_guid( p_node ) ), - osm_physp_get_port_num( p_physp ) ); - } - goto _done; - } - else if ( !p_orig_pkey ) + if ( !p_orig_pkey ) { for ( block_index = 0; block_index < num_of_blocks; block_index++ ) { - block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); + block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ ) { if ( ib_pkey_is_invalid( block->pkey_entry[i] ) ) { block->pkey_entry[i] = pkey; - block_found = TRUE; - break; + stat = "inserted"; + goto _done; } } - if ( block_found ) - { - break; - } } - } - else - { - *p_orig_pkey = pkey; - for ( block_index = 0; block_index < num_of_blocks; block_index++ ) - { - block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); - i = p_orig_pkey - block->pkey_entry; - if ( i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK ) - { - block_found = TRUE; - break; - } - } - } - - if ( block_found == FALSE ) - { osm_log( p_log, OSM_LOG_ERROR, "pkey_mgr_process_physical_port: ERR 0501: " "No empty pkey entry was found to insert 0x%04x for node " @@ -218,46 +177,40 @@ pkey_mgr_process_physical_port( cl_ntoh16( pkey ), cl_ntoh64( osm_node_get_node_guid( p_node ) ), osm_physp_get_port_num( p_physp ) ); - goto _done; } - - status = - pkey_mgr_update_pkey_entry( p_req, p_physp, block, block_index ); - - if ( status != IB_SUCCESS ) + else if ( *p_orig_pkey != pkey ) { - osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_process_physical_port: " - "pkey_mgr_update_pkey_entry() failed to update " - "pkey table block %d for node 0x%016" PRIx64 " port %u\n", - block_index, - cl_ntoh64( osm_node_get_node_guid( p_node ) ), - osm_physp_get_port_num( p_physp ) ); - goto _done; + for ( block_index = 0; block_index < num_of_blocks; block_index++ ) + { + /* we need real block (not just new_block) in order + * to resolve block/pkey indices */ + block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); + i = p_orig_pkey - block->pkey_entry; + if (i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK) { + block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); + block->pkey_entry[i] = pkey; + stat = "updated"; + goto _done; + } + } } - return_val = TRUE; /* pkey was inserted/updated */ - - if ( osm_log_is_active( p_log, OSM_LOG_VERBOSE ) ) - { + _done: + if (stat) { osm_log( p_log, OSM_LOG_VERBOSE, "pkey_mgr_process_physical_port: " - "pkey 0x%04x was inserted for node 0x%016" PRIx64 + "pkey 0x%04x was %s for node 0x%016" PRIx64 " port %u\n", - cl_ntoh16( pkey ), + cl_ntoh16( pkey ), stat, cl_ntoh64( osm_node_get_node_guid( p_node ) ), osm_physp_get_port_num( p_physp ) ); } - - _done: - OSM_LOG_EXIT( p_log ); - return ( return_val ); } /********************************************************************** **********************************************************************/ -static void +static boolean_t pkey_mgr_update_peer_port( osm_log_t *p_log, const osm_req_t *p_req, @@ -274,21 +227,22 @@ pkey_mgr_update_peer_port( uint16_t block_index; uint16_t num_of_blocks; ib_api_status_t status = IB_SUCCESS; + boolean_t ret_val = FALSE; p = osm_port_get_default_phys_ptr( p_port ); if ( !osm_physp_is_valid( p ) ) - return; + return FALSE; peer = osm_physp_get_remote( p ); if ( !peer || !osm_physp_is_valid( peer ) ) - return; + return FALSE; p_node = osm_physp_get_node_ptr( peer ); if ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) - return; + return FALSE; p_sw = osm_get_switch_by_guid( p_subn, osm_node_get_node_guid( p_node )); if (!p_sw || !(p_si = osm_switch_get_si_ptr( p_sw )) || !p_si->enforce_cap) - return; + return FALSE; if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS) { osm_log( p_log, OSM_LOG_ERROR, @@ -300,7 +254,7 @@ pkey_mgr_update_peer_port( } if (enforce == FALSE) - return; + return FALSE; p_pkey_tbl = osm_physp_get_pkey_tbl( p ); p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer ); @@ -310,15 +264,15 @@ pkey_mgr_update_peer_port( for ( block_index = 0; block_index < num_of_blocks; block_index++ ) { - block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); + block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index ); - if ( cl_memcmp( peer_block, block, sizeof( *block ) ) ) + if ( cl_memcmp( peer_block, block, sizeof( *peer_block ) ) ) { - cl_memcpy( peer_block, block, sizeof( *block ) ); status = - pkey_mgr_update_pkey_entry( p_req, peer, peer_block, - block_index ); - if ( status != IB_SUCCESS ) + pkey_mgr_update_pkey_entry( p_req, peer, block, block_index ); + if ( status == IB_SUCCESS ) + ret_val = TRUE; + else osm_log( p_log, OSM_LOG_ERROR, "pkey_mgr_update_peer_port: " "pkey_mgr_update_pkey_entry() failed to update " @@ -330,7 +284,7 @@ pkey_mgr_update_peer_port( } } - if ( num_of_blocks && status == IB_SUCCESS && + if ( ret_val == TRUE && osm_log_is_active( p_log, OSM_LOG_VERBOSE ) ) { osm_log( p_log, OSM_LOG_VERBOSE, @@ -340,11 +294,61 @@ pkey_mgr_update_peer_port( cl_ntoh64( osm_node_get_node_guid( p_node ) ), osm_physp_get_port_num( peer ) ); } + + return ret_val; } /********************************************************************** **********************************************************************/ -static boolean_t +static boolean_t pkey_mgr_update_port( + osm_log_t *p_log, + osm_req_t *p_req, + const osm_port_t * const p_port ) +{ + osm_physp_t *p; + osm_node_t *p_node; + ib_pkey_table_t *block, *new_block; + const osm_pkey_tbl_t *p_pkey_tbl; + uint16_t block_index; + uint16_t num_of_blocks; + ib_api_status_t status; + boolean_t ret_val = FALSE; + + p = osm_port_get_default_phys_ptr( p_port ); + if ( !osm_physp_is_valid( p ) ) + return FALSE; + + p_pkey_tbl = osm_physp_get_pkey_tbl(p); + num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); + + for ( block_index = 0; block_index < num_of_blocks; block_index++ ) + { + block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); + new_block = osm_pkey_tbl_new_block_get( p_pkey_tbl, block_index ); + + if (!new_block || !cl_memcmp( new_block, block, sizeof( *block ) ) ) + continue; + + status = + pkey_mgr_update_pkey_entry( p_req, p, new_block, block_index ); + if (status == IB_SUCCESS) + ret_val = TRUE; + else + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_update_port: " + "pkey_mgr_update_pkey_entry() failed to update " + "pkey table block %d for node 0x%016" PRIx64 " port %u\n", + block_index, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p ) ); + } + + return ret_val; +} + +/********************************************************************** + **********************************************************************/ +static void pkey_mgr_process_partition_table( osm_log_t *p_log, const osm_req_t *p_req, @@ -356,7 +360,6 @@ pkey_mgr_process_partition_table( cl_map_iterator_t i, i_next; ib_net16_t pkey = p_prtn->pkey; osm_physp_t *p_physp; - boolean_t result = FALSE; if ( full ) pkey = cl_hton16( cl_ntoh16( pkey ) | 0x8000 ); @@ -367,23 +370,9 @@ pkey_mgr_process_partition_table( i = i_next; i_next = cl_map_next( i ); p_physp = cl_map_obj( i ); - if ( p_physp && osm_physp_is_valid( p_physp ) && - pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp ) ) - { - result = TRUE; - if ( osm_log_is_active( p_log, OSM_LOG_VERBOSE ) ) - osm_log( p_log, OSM_LOG_VERBOSE, - "pkey_mgr_process_partition_table: " - "Adding 0x%04x to pkey table of node " - "0x%016" PRIx64 " port %u\n", - cl_ntoh16( pkey ), - cl_ntoh64( osm_node_get_node_guid - ( osm_physp_get_node_ptr( p_physp ) ) ), - osm_physp_get_port_num( p_physp ) ); - } + if ( p_physp && osm_physp_is_valid( p_physp ) ) + pkey_mgr_process_physical_port( p_log, p_req, pkey, p_physp ); } - - return result; } /********************************************************************** @@ -397,6 +386,7 @@ osm_pkey_mgr_process( osm_prtn_t *p_prtn; osm_port_t *p_port; osm_signal_t signal = OSM_SIGNAL_DONE; + osm_physp_t *p_physp; CL_ASSERT( p_osm ); @@ -411,34 +401,41 @@ osm_pkey_mgr_process( goto _err; } - p_tbl = &p_osm->subn.prtn_pkey_tbl; + p_tbl = &p_osm->subn.port_guid_tbl; + p_next = cl_qmap_head( p_tbl ); + while ( p_next != cl_qmap_end( p_tbl ) ) + { + p_port = ( osm_port_t * ) p_next; + p_next = cl_qmap_next( p_next ); + p_physp = osm_port_get_default_phys_ptr( p_port ); + if (osm_physp_is_valid( p_physp ) ) + osm_pkey_tbl_sync_new_blocks(osm_physp_get_pkey_tbl(p_physp)); + } + p_tbl = &p_osm->subn.prtn_pkey_tbl; p_next = cl_qmap_head( p_tbl ); while ( p_next != cl_qmap_end( p_tbl ) ) { p_prtn = ( osm_prtn_t * ) p_next; p_next = cl_qmap_next( p_next ); - - if ( pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, FALSE ) ) - signal = OSM_SIGNAL_DONE_PENDING; - if ( pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, TRUE ) ) - signal = OSM_SIGNAL_DONE_PENDING; + pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, FALSE ); + pkey_mgr_process_partition_table( &p_osm->log, &p_osm->sm.req, p_prtn, TRUE ); } p_tbl = &p_osm->subn.port_guid_tbl; - p_next = cl_qmap_head( p_tbl ); while ( p_next != cl_qmap_end( p_tbl ) ) { p_port = ( osm_port_t * ) p_next; p_next = cl_qmap_next( p_next ); - - if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != - IB_NODE_TYPE_SWITCH ) - { - pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, &p_osm->subn, - p_port, !p_osm->subn.opt.no_partition_enforcement ); - } + if (pkey_mgr_update_port(&p_osm->log, &p_osm->sm.req, p_port)) + signal = OSM_SIGNAL_DONE_PENDING; + if (osm_node_get_type( osm_port_get_parent_node( p_port ) ) != + IB_NODE_TYPE_SWITCH && + pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, + &p_osm->subn, p_port, + !p_osm->subn.opt.no_partition_enforcement )) + signal = OSM_SIGNAL_DONE_PENDING; } _err: From sashak at voltaire.com Sun Apr 23 07:26:25 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 23 Apr 2006 17:26:25 +0300 Subject: [openib-general] [PATCH 4/4] opensm: no need to wait for pkey_mgr In-Reply-To: <20060423141935.15562.38762.stgit@sashak.voltaire.com> References: <20060423141935.15562.38762.stgit@sashak.voltaire.com> Message-ID: <20060423142625.15562.25970.stgit@sashak.voltaire.com> Don't wait for pkey tables update responses in partition manager - we may just continue resweep process. Signed-off-by: Sasha Khapyorsky --- osm/opensm/osm_pkey_mgr.c | 66 +++++++++++++++++--------------------------- osm/opensm/osm_state_mgr.c | 41 ++------------------------- 2 files changed, 29 insertions(+), 78 deletions(-) diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c index da8dfa8..167b4c1 100644 --- a/osm/opensm/osm_pkey_mgr.c +++ b/osm/opensm/osm_pkey_mgr.c @@ -135,7 +135,8 @@ pkey_mgr_enforce_partition( * does not exist. Update existed entry when membership was changed. */ -static void pkey_mgr_process_physical_port( +static void +pkey_mgr_process_physical_port( IN osm_log_t *p_log, IN const osm_req_t *p_req, IN const ib_net16_t pkey, @@ -210,7 +211,7 @@ static void pkey_mgr_process_physical_po /********************************************************************** **********************************************************************/ -static boolean_t +static void pkey_mgr_update_peer_port( osm_log_t *p_log, const osm_req_t *p_req, @@ -227,22 +228,21 @@ pkey_mgr_update_peer_port( uint16_t block_index; uint16_t num_of_blocks; ib_api_status_t status = IB_SUCCESS; - boolean_t ret_val = FALSE; p = osm_port_get_default_phys_ptr( p_port ); if ( !osm_physp_is_valid( p ) ) - return FALSE; + return; peer = osm_physp_get_remote( p ); if ( !peer || !osm_physp_is_valid( peer ) ) - return FALSE; + return; p_node = osm_physp_get_node_ptr( peer ); if ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) - return FALSE; + return; p_sw = osm_get_switch_by_guid( p_subn, osm_node_get_node_guid( p_node )); if (!p_sw || !(p_si = osm_switch_get_si_ptr( p_sw )) || !p_si->enforce_cap) - return FALSE; + return; if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS) { osm_log( p_log, OSM_LOG_ERROR, @@ -254,7 +254,7 @@ pkey_mgr_update_peer_port( } if (enforce == FALSE) - return FALSE; + return; p_pkey_tbl = osm_physp_get_pkey_tbl( p ); p_peer_pkey_tbl = osm_physp_get_pkey_tbl( peer ); @@ -271,36 +271,30 @@ pkey_mgr_update_peer_port( status = pkey_mgr_update_pkey_entry( p_req, peer, block, block_index ); if ( status == IB_SUCCESS ) - ret_val = TRUE; + osm_log( p_log, OSM_LOG_VERBOSE, + "pkey_mgr_update_peer_port: " + "pkey table block %u was updated for node 0x%016" PRIx64 + " port %u\n", + block_index, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( peer ) ); else osm_log( p_log, OSM_LOG_ERROR, "pkey_mgr_update_peer_port: " "pkey_mgr_update_pkey_entry() failed to update " - "pkey table block %d for node 0x%016" PRIx64 + "pkey table block %u for node 0x%016" PRIx64 " port %u\n", block_index, cl_ntoh64( osm_node_get_node_guid( p_node ) ), osm_physp_get_port_num( peer ) ); } } - - if ( ret_val == TRUE && - osm_log_is_active( p_log, OSM_LOG_VERBOSE ) ) - { - osm_log( p_log, OSM_LOG_VERBOSE, - "pkey_mgr_update_peer_port: " - "pkey table was updated for node 0x%016" PRIx64 - " port %u\n", - cl_ntoh64( osm_node_get_node_guid( p_node ) ), - osm_physp_get_port_num( peer ) ); - } - - return ret_val; } /********************************************************************** **********************************************************************/ -static boolean_t pkey_mgr_update_port( +static void +pkey_mgr_update_port( osm_log_t *p_log, osm_req_t *p_req, const osm_port_t * const p_port ) @@ -312,11 +306,10 @@ static boolean_t pkey_mgr_update_port( uint16_t block_index; uint16_t num_of_blocks; ib_api_status_t status; - boolean_t ret_val = FALSE; p = osm_port_get_default_phys_ptr( p_port ); if ( !osm_physp_is_valid( p ) ) - return FALSE; + return; p_pkey_tbl = osm_physp_get_pkey_tbl(p); num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); @@ -331,9 +324,7 @@ static boolean_t pkey_mgr_update_port( status = pkey_mgr_update_pkey_entry( p_req, p, new_block, block_index ); - if (status == IB_SUCCESS) - ret_val = TRUE; - else + if (status != IB_SUCCESS) osm_log( p_log, OSM_LOG_ERROR, "pkey_mgr_update_port: " "pkey_mgr_update_pkey_entry() failed to update " @@ -342,8 +333,6 @@ static boolean_t pkey_mgr_update_port( cl_ntoh64( osm_node_get_node_guid( p_node ) ), osm_physp_get_port_num( p ) ); } - - return ret_val; } /********************************************************************** @@ -385,7 +374,6 @@ osm_pkey_mgr_process( cl_map_item_t *p_next; osm_prtn_t *p_prtn; osm_port_t *p_port; - osm_signal_t signal = OSM_SIGNAL_DONE; osm_physp_t *p_physp; CL_ASSERT( p_osm ); @@ -428,18 +416,16 @@ osm_pkey_mgr_process( { p_port = ( osm_port_t * ) p_next; p_next = cl_qmap_next( p_next ); - if (pkey_mgr_update_port(&p_osm->log, &p_osm->sm.req, p_port)) - signal = OSM_SIGNAL_DONE_PENDING; + pkey_mgr_update_port(&p_osm->log, &p_osm->sm.req, p_port); if (osm_node_get_type( osm_port_get_parent_node( p_port ) ) != - IB_NODE_TYPE_SWITCH && - pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, - &p_osm->subn, p_port, - !p_osm->subn.opt.no_partition_enforcement )) - signal = OSM_SIGNAL_DONE_PENDING; + IB_NODE_TYPE_SWITCH ) + pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, + &p_osm->subn, p_port, + !p_osm->subn.opt.no_partition_enforcement ); } _err: CL_PLOCK_RELEASE( &p_osm->lock ); OSM_LOG_EXIT( &p_osm->log ); - return ( signal ); + return OSM_SIGNAL_DONE; } diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c index 1aefc0b..194e51e 100644 --- a/osm/opensm/osm_state_mgr.c +++ b/osm/opensm/osm_state_mgr.c @@ -2232,8 +2232,10 @@ osm_state_mgr_process( osm_sm_state_mgr_process( p_mgr->p_sm_state_mgr, OSM_SM_SIGNAL_DISCOVERY_COMPLETED ); - /* the returned signal might be DONE or DONE_PENDING */ + /* the returned signal will be always DONE */ signal = osm_pkey_mgr_process( p_mgr->p_subn->p_osm ); + p_mgr->state = OSM_SM_STATE_SET_PKEY_DONE; + break; default: @@ -2243,43 +2245,6 @@ osm_state_mgr_process( } break; - case OSM_SM_STATE_SET_PKEY: - switch ( signal ) - { - case OSM_SIGNAL_DONE: - p_mgr->state = OSM_SM_STATE_SET_PKEY_DONE; - break; - - case OSM_SIGNAL_DONE_PENDING: - /* - * There are outstanding transactions, so we - * must wait for the wire to clear. - */ - p_mgr->state = OSM_SM_STATE_SET_PKEY_WAIT; - signal = OSM_SIGNAL_NONE; - break; - - default: - __osm_state_mgr_signal_error( p_mgr, signal ); - signal = OSM_SIGNAL_NONE; - break; - } - break; - - case OSM_SM_STATE_SET_PKEY_WAIT: - switch ( signal ) - { - case OSM_SIGNAL_NO_PENDING_TRANSACTIONS: - p_mgr->state = OSM_SM_STATE_SET_PKEY_DONE; - break; - - default: - __osm_state_mgr_signal_error( p_mgr, signal ); - signal = OSM_SIGNAL_NONE; - break; - } - break; - case OSM_SM_STATE_SET_PKEY_DONE: switch ( signal ) { From ali at mellanox.co.il Sun Apr 23 08:51:27 2006 From: ali at mellanox.co.il (Ali Ayoub) Date: Sun, 23 Apr 2006 18:51:27 +0300 Subject: [openib-general] cmpost test failures Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301BE7D47@mtlexch01.mtl.com> Hi, The cmpost test (under https://openib.org/svn/gen2/utils/src/linux-kernel/infiniband/util/cmpos t ) fails on some cases: 1. If I change the local and the remote timeout for ib_cm_req_param to 40 (instead of 20, the default value) it causes kernel oops. 2. With the following parameters: connections = 3000 message_size = 200 message_count = 10 qp_type = RC The test fails inconsistently; in some cases it causes a kernel oops, 3. In other cases the server fails because it receives some IB_CM_DREQ_ERROR when the client receives all the IB_CM_DREQ_RECEIVED. Attached the /var/log/messages of the above failures. OpenIB rev : 6367 Host architecture : x86_64 Linux Distribution: Fedora Core release 4 (Stentz) Kernel Version : 2.6.11-1.1369_FC4smp Memory size : 4102172 kB Driver Version : IBED-1.0 HCA ID(s) : mthca0 HCA model(s) : 25218 FW version(s) : 5.1.400 Board(s) : MT_0150000001 Ali; -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 1.var.log.messages.client.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 2.var.log.messages.client.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 3.var.log.messages.client.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 3.var.log.messages.server.txt URL: From sashak at voltaire.com Sun Apr 23 09:03:18 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 23 Apr 2006 19:03:18 +0300 Subject: [openib-general] [TRIVIAL PATCH] osm/Makefile: fix for GNU make-3.81 Message-ID: <20060423160318.GB28367@sashak.voltaire.com> Hello Hal, In Gnu make-3.81 (unline 3.80) in action lists lines will be concatanated without leading tab symbol. So lines like fi\ done become 'fidone'. There is fix for osm/Makefile. Sasha. Fixes for GNU make-3.81 Signed-off-by: Sasha Khapyorsky diff --git a/Makefile b/Makefile index 3cc70f8..9c86916 100644 --- a/Makefile +++ b/Makefile @@ -22,25 +22,25 @@ origmake: .PHONY : orig automake: - @for i in $(LIBS); do\ - if [ -x $$i/autogen.sh ]; then\ - if !(cd $$i; ./autogen.sh && ./configure && make && make install); then exit 1; fi\ - fi\ + @for i in $(LIBS); do \ + if [ -x $$i/autogen.sh ]; then \ + if !(cd $$i; ./autogen.sh && ./configure && make && make install); then exit 1; fi \ + fi \ done - @for i in $(OSMLIBS); do\ - if [ -x $(OSM)/$$i/autogen.sh ]; then\ - if !(cd $(OSM)/$$i; ./autogen.sh && ./configure && make && make install); then exit 1; fi\ - fi\ + @for i in $(OSMLIBS); do \ + if [ -x $(OSM)/$$i/autogen.sh ]; then \ + if !(cd $(OSM)/$$i; ./autogen.sh && ./configure && make && make install); then exit 1; fi \ + fi \ done - @for i in $(DIAG) $(OSM)/opensm; do\ - if [ -x $$i/autogen.sh ]; then\ - if !(cd $$i; ./autogen.sh && ./configure); then exit 1; fi\ - fi\ + @for i in $(DIAG) $(OSM)/opensm; do \ + if [ -x $$i/autogen.sh ]; then \ + if !(cd $$i; ./autogen.sh && ./configure); then exit 1; fi \ + fi \ done - @for i in $(DIAG) $(OSM)/opensm; do\ - if [ -x $$i/autogen.sh ]; then\ - if !(cd $$i; make && make install); then exit 1; fi\ - fi\ + @for i in $(DIAG) $(OSM)/opensm; do \ + if [ -x $$i/autogen.sh ]; then \ + if !(cd $$i; make && make install); then exit 1; fi \ + fi \ done install: BUILD_TARG=install @@ -63,18 +63,18 @@ depend: rmdep subdirs .PHONY : subdirs subdirs: - @for i in $(SUBDIRS); do\ - if [ -e $$i/Makefile ]; then\ - if !(cd $$i; make $(BUILD_TARG)); then exit 1; fi\ - fi\ - done\ + @for i in $(SUBDIRS); do \ + if [ -e $$i/Makefile ]; then \ + if !(cd $$i; make $(BUILD_TARG)); then exit 1; fi \ + fi \ + done .PHONY : libs_install libs_install: - @for i in $(LIBS); do\ - if [ -e $$i/Makefile ]; then\ - if !(cd $$i; make install); then exit 1; fi\ - fi\ - done\ + @for i in $(LIBS); do \ + if [ -e $$i/Makefile ]; then \ + if !(cd $$i; make install); then exit 1; fi \ + fi \ + done export BUILD_TARG From halr at voltaire.com Mon Apr 24 01:34:32 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Apr 2006 04:34:32 -0400 Subject: [openib-general] Re: [TRIVIAL PATCH] osm/Makefile: fix for GNU make-3.81 In-Reply-To: <20060423160318.GB28367@sashak.voltaire.com> References: <20060423160318.GB28367@sashak.voltaire.com> Message-ID: <1145867671.23359.62231.camel@hal.voltaire.com> Hi Sasha, On Sun, 2006-04-23 at 12:03, Sasha Khapyorsky wrote: > Hello Hal, > > In Gnu make-3.81 (unline 3.80) in action lists lines will be concatanated > without leading tab symbol. So lines like > > fi\ > done > > become 'fidone'. There is fix for osm/Makefile. Do you mean management/Makefile ? -- Hal From sashak at voltaire.com Mon Apr 24 03:22:46 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 24 Apr 2006 13:22:46 +0300 Subject: [openib-general] Re: [TRIVIAL PATCH] osm/Makefile: fix for GNU make-3.81 In-Reply-To: <1145867671.23359.62231.camel@hal.voltaire.com> References: <20060423160318.GB28367@sashak.voltaire.com> <1145867671.23359.62231.camel@hal.voltaire.com> Message-ID: <20060424102246.GA30918@sashak.voltaire.com> On 04:34 Mon 24 Apr , Hal Rosenstock wrote: > Hi Sasha, > > On Sun, 2006-04-23 at 12:03, Sasha Khapyorsky wrote: > > Hello Hal, > > > > In Gnu make-3.81 (unline 3.80) in action lists lines will be concatanated > > without leading tab symbol. So lines like > > > > fi\ > > done > > > > become 'fidone'. There is fix for osm/Makefile. > > Do you mean management/Makefile ? Yes, sure. Sasha. From schihei at de.ibm.com Mon Apr 24 03:38:41 2006 From: schihei at de.ibm.com (Heiko J Schick) Date: Mon, 24 Apr 2006 12:38:41 +0200 Subject: [openib-general] Unknown symbols on loading ib_sdp Message-ID: <444CAAB1.3050406@de.ibm.com> Hello, when I run OpenIB rev. 6454 with Linux kernel 2.6.17-rc2 I can't load SDP, because it seems that the following symbols are not longer exported: [1627961.726113] ib_sdp: Unknown symbol devinet_ioctl [1627961.726149] ib_sdp: Unknown symbol ip_rt_ioctl Regards, Heiko From ogerlitz at voltaire.com Mon Apr 24 04:43:39 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 24 Apr 2006 14:43:39 +0300 Subject: [openib-general] Unknown symbols on loading ib_sdp In-Reply-To: <444CAAB1.3050406@de.ibm.com> References: <444CAAB1.3050406@de.ibm.com> Message-ID: <444CB9EB.2060108@voltaire.com> Heiko J Schick wrote: > Hello, > > when I run OpenIB rev. 6454 with Linux kernel 2.6.17-rc2 > I can't load SDP, because it seems that the following > symbols are not longer exported: > > [1627961.726113] ib_sdp: Unknown symbol devinet_ioctl > [1627961.726149] ib_sdp: Unknown symbol ip_rt_ioctl Indeed, this is caught by the kernel makefile upon build time WARNING: "ip_rt_ioctl" [drivers/infiniband/ulp/sdp/ib_sdp.ko] undefined! WARNING: "devinet_ioctl" [drivers/infiniband/ulp/sdp/ib_sdp.ko] undefined! Or. From mst at mellanox.co.il Mon Apr 24 04:51:39 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 24 Apr 2006 14:51:39 +0300 Subject: [openib-general] Re: Unknown symbols on loading ib_sdp In-Reply-To: <444CAAB1.3050406@de.ibm.com> References: <444CAAB1.3050406@de.ibm.com> Message-ID: <20060424115139.GO1792@mellanox.co.il> Quoting r. Heiko J Schick : > Subject: Unknown symbols on loading ib_sdp > > Hello, > > when I run OpenIB rev. 6454 with Linux kernel 2.6.17-rc2 > I can't load SDP, because it seems that the following > symbols are not longer exported: > > [1627961.726113] ib_sdp: Unknown symbol devinet_ioctl > [1627961.726149] ib_sdp: Unknown symbol ip_rt_ioctl > > Regards, > Heiko I haven't looked at port to 2.6.17 yet. most likely you can just comment the relevant lines out. -- MST From mst at mellanox.co.il Mon Apr 24 05:04:24 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 24 Apr 2006 15:04:24 +0300 Subject: [openib-general] [PATCH] cma: treat ANY address as loopback on connect Message-ID: <20060424120424.GQ1792@mellanox.co.il> Treat wildcard (ANY) addresses as loopback on connect - the way TCP sockets seem to do. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.16/drivers/infiniband/core/cma.c =================================================================== --- linux-2.6.16.orig/drivers/infiniband/core/cma.c 2006-04-21 12:10:32.000000000 +0300 +++ linux-2.6.16/drivers/infiniband/core/cma.c 2006-04-24 16:49:22.000000000 +0300 @@ -1306,7 +1306,7 @@ int rdma_resolve_addr(struct rdma_cm_id atomic_inc(&id_priv->refcount); memcpy(&id->route.addr.dst_addr, dst_addr, ip_addr_size(dst_addr)); - if (cma_loopback_addr(dst_addr)) + if (cma_any_addr(dst_addr)) ret = cma_resolve_loopback(id_priv); else ret = rdma_resolve_ip(&id->route.addr.src_addr, dst_addr, -- MST From jlentini at netapp.com Mon Apr 24 07:33:23 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 24 Apr 2006 10:33:23 -0400 (EDT) Subject: [openib-general] Re: [uDAPL] dat.conf generator In-Reply-To: <200604231429.27532.dotanb@mellanox.co.il> References: <200604121122.48646.dotanb@mellanox.co.il> <44468CBB.7000501@ichips.intel.com> <200604231429.27532.dotanb@mellanox.co.il> Message-ID: On Sun, 23 Apr 2006, Dotan Barak wrote: > On Wednesday 19 April 2006 22:17, Arlin Davis wrote: > > >>>>>>OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 1" "" > > >>>>>>OpenIB-cma0-1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 1" "" > > >>>>>>OpenIB-cma0-2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 2" "" > > >>>>>>OpenIB-cma1-1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca1 1" "" > > >>>>>>OpenIB-cma1-2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca1 2" "" > > >>>>>> > > These entries are wrong. The cma versopm will only work with an ip > > address, network hostname, or netdev name. The port value is > > meaningless since the name gives you the device and port reference all > > in one. > > > > For cma the best flavor is netdev name as follow: > > > > OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" > > > > because it allows you to have identical dat.conf setups on across your > > cluster if you intend on using the first IB interface on each node. > > but, what if one wants to work with the second ib I/F (or with the third)? > > If in automatic way i will create a dat.conf that will have the following lines (on a machine with 2 HCAs, 2 port in each HCA): > > OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" > OpenIB-cma0 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" > OpenIB-cma1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib1 0" "" > OpenIB-cma2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib2 0" "" > OpenIB-cma3 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib3 0" "" > > In this file, there is a dapl provider name for every IPoIB I/F, and > still there is a default entry (the first I/F). > > Is this is usefull? Arlin was pointing out that the uDAPL CMA provider requires an IP address value, not a device name string: OpenIB-cma u1.2 nonthreadsafe default libdapl.so of_udapl.1.2 "192.168.0.47 1" "" The uDAPL CMA provider only uses the port value to check the device attributes, so the IP addr is the most important way to specify the device. Naming the uDAPL providers OpenIB-cma0, OpenIB-cma1, OpenIB-cma2, ... makes sense, but remember that it is the IP address that will determine which device they map to. Having a default provider, OpenIB-cma, is useful. From halr at voltaire.com Mon Apr 24 07:46:06 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Apr 2006 10:46:06 -0400 Subject: [openib-general] Re: [TRIVIAL PATCH] osm/Makefile: fix for GNU make-3.81 In-Reply-To: <20060423160318.GB28367@sashak.voltaire.com> References: <20060423160318.GB28367@sashak.voltaire.com> Message-ID: <1145889964.23359.67567.camel@hal.voltaire.com> On Sun, 2006-04-23 at 12:03, Sasha Khapyorsky wrote: > Hello Hal, > > In Gnu make-3.81 (unline 3.80) in action lists lines will be concatanated > without leading tab symbol. So lines like > > fi\ > done > > become 'fidone'. There is fix for osm/Makefile. > > Sasha. > > > Fixes for GNU make-3.81 > > Signed-off-by: Sasha Khapyorsky Thanks. Applied (to both trunk and 1.0 branch) for management/Makefile -- Hal From jlentini at netapp.com Mon Apr 24 08:07:00 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 24 Apr 2006 11:07:00 -0400 (EDT) Subject: [openib-general] Re: Speeding up IPoIB. In-Reply-To: References: Message-ID: On Fri, 21 Apr 2006, Bernard King-Smith wrote: > Grant Grundler wrote: > > Grant> My guess is it's an easier problem to fix SDP than reducing TCP/IP > > Grant> cache/CPU foot print. I realize only a subset of apps can (or will > > Grant> try to) use SDP because of setup/config issues. I still believe > SDP > > Grant> is useful to a majority of apps without having to recompile them. > > > > I agree that reducing any protocol footprint is a very challenging job, > > however, going to a larger MTU drops the overhead much faster. If IB > > supported a 60K MTU then the TCP/IP overhead would be 1/30 that of what > we > > measure today. Traversng the TCP/IP stack once for a 60K packet is much > > lower than 30 times using 2000 byte packets for the same amount of data > > transmitted. > > Grant> I agree that's effective for workloads which send large messages. > Grant> And that's typical for storage workloads. > Grant> But the world is not just an NFS server. ;) > > However, NFS is not the only large data transfer workload I come > across. If IB wants to achieve high volumes there needs to be some > kind of commercial workload that works well on it besides the large > applications that can afford to port to SDP or uDAPL. HPC is a nitch > that is hard to sustain a viable business in ( though everyone can > point to a couple of companies, most are not long term, 15 years or > more ). Running NFS on IPoIB isn't the only option. As an alternative, NFS can be run directly on an RDMA network using the RPC-RDMA protocol. > However, in clustering the areas that benefit from large block transfer > efficiency include: > > File Serving ( NFS, GPFS, XFS etc.) > Application backup > Parallel databases > Database upload/update flows > Web server graphics > Web server MP3s > Web server streaming video > Local workstation backup > Collaboration software > Local mail replication > > My concern is that if IB does not support these operations as well as > Ethernet, then it is a hard sell into commercial accounts/workloads for IB. > > Bernie King-Smith > IBM Corporation > Server Group > Cluster System Performance > wombat2 at us.ibm.com (845)433-8483 > Tie. 293-8483 or wombat2 on NOTES > > "We are not responsible for the world we are born into, only for the world > we leave when we die. > So we have to accept what has gone before us and work to change the only > thing we can, > -- The Future." William Shatner > > > > Grant Grundler > > To > 04/21/2006 04:10 Bernard > PM King-Smith/Poughkeepsie/IBM at IBMUS > cc > Grant Grundler , > openib-general at openib.org, Roland > Dreier > Subject > Re: Speeding up IPoIB. > > > > > > > > > > > On Thu, Apr 20, 2006 at 09:03:29PM -0400, Bernard King-Smith wrote: > > Grant> My guess is it's an easier problem to fix SDP than reducing TCP/IP > > Grant> cache/CPU foot print. I realize only a subset of apps can (or will > > Grant> try to) use SDP because of setup/config issues. I still believe > SDP > > Grant> is useful to a majority of apps without having to recompile them. > > > > I agree that reducing any protocol footprint is a very challenging job, > > however, going to a larger MTU drops the overhead much faster. If IB > > supported a 60K MTU then the TCP/IP overhead would be 1/30 that of what > we > > measure today. Traversng the TCP/IP stack once for a 60K packet is much > > lower than 30 times using 2000 byte packets for the same amount of data > > transmitted. > > I agree that's effective for workloads which send large messages. > And that's typical for storage workloads. > But the world is not just an NFS server. ;) > > > Grant> I'm not competent to disagree in detail. > > Grant> Fabian Tillier and Caitlin Bestler can (and have) addressed this. > > > > I would be very interested in any pointers to their work. > > They have posted to this forum recently on this topic. > The archives are here in case you want to look them up: > http://www.openib.org/contact.html > > > This goes back to systems where the system is busy doing nothing, > generally > > when waiting for memory or a cache line miss, or I/O to disks. This is > > where hyperthreading has shown some speedups for benchmarks where > > previously they were totally CPU limited, and with hyperthreading there > is > > a gain. > > While there are workloads that benefit, I don't buy the hyperthreading > argument in general. Co-workers have demonstrate several "normal" > workloads that don't benefit and are faster with hyperthreading > disabled. > > > The unused cycles are "wait" cycles when something can run if it > > can get in quickly. You can't get a TCP stack in the wait, but small > parts > > of the stackor driver could fit in the other thread. Yes I do > benchmarking > > and was skeptical at first. > > ok. > > thanks, > grant > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From dotanb at mellanox.co.il Mon Apr 24 08:16:01 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 24 Apr 2006 18:16:01 +0300 Subject: [openib-general] Re: [uDAPL] dat.conf generator In-Reply-To: References: <200604121122.48646.dotanb@mellanox.co.il> <200604231429.27532.dotanb@mellanox.co.il> Message-ID: <200604241816.01856.dotanb@mellanox.co.il> On Monday 24 April 2006 17:33, James Lentini wrote: > > On Sun, 23 Apr 2006, Dotan Barak wrote: > > > > > If in automatic way i will create a dat.conf that will have the following lines (on a machine with 2 HCAs, 2 port in each HCA): > > > > OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" > > OpenIB-cma0 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" > > OpenIB-cma1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib1 0" "" > > OpenIB-cma2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib2 0" "" > > OpenIB-cma3 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib3 0" "" > > > > In this file, there is a dapl provider name for every IPoIB I/F, and > > still there is a default entry (the first I/F). > > > > Is this is usefull? > > Arlin was pointing out that the uDAPL CMA provider requires an IP > address value, not a device name string: > > OpenIB-cma u1.2 nonthreadsafe default libdapl.so of_udapl.1.2 "192.168.0.47 1" "" > > The uDAPL CMA provider only uses the port value to check the device > attributes, so the IP addr is the most important way to specify the > device. > > Naming the uDAPL providers OpenIB-cma0, OpenIB-cma1, OpenIB-cma2, ... > makes sense, but remember that it is the IP address that will > determine which device they map to. > > Having a default provider, OpenIB-cma, is useful. o.k,, so the generated file will have the following line for all of the IPoIB I/Fs: OpenIB-cmaX u1.2 nonthreadsafe default libdapl.so of_udapl.1.2 "192.168.0.47 1" "" (X run from 0 to n) and a default OpenIB-cma which will be the first IPoIB I/F that was found. i have 2 more questions: 1) what is the number after the IP? (ib port? starts from which value?) 2) do you need any lines with the scm ? (if the answer is yes, can you please send me an example line?) thanks Dotan From caitlinb at broadcom.com Mon Apr 24 08:40:19 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Mon, 24 Apr 2006 08:40:19 -0700 Subject: [openib-general] Re: [uDAPL] dat.conf generator Message-ID: <54AD0F12E08D1541B826BE97C98F99F143AABD@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > On Sun, 23 Apr 2006, Dotan Barak wrote: > >> On Wednesday 19 April 2006 22:17, Arlin Davis wrote: >>>>>>>>> OpenIB-cma u1.2 nonthreadsafe default > /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 1" "" >>>>>>>>> OpenIB-cma0-1 u1.2 nonthreadsafe default > /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 1" "" >>>>>>>>> OpenIB-cma0-2 u1.2 nonthreadsafe default > /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca0 2" "" >>>>>>>>> OpenIB-cma1-1 u1.2 nonthreadsafe default > /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca1 1" "" >>>>>>>>> OpenIB-cma1-2 u1.2 nonthreadsafe default > /usr/lib/libdaplcma.so mv_dapl.1.2 "mthca1 2" "" >>>>>>>>> >>> These entries are wrong. The cma versopm will only work with an ip >>> address, network hostname, or netdev name. The port value is >>> meaningless since the name gives you the device and port reference >>> all in one. >>> >>> For cma the best flavor is netdev name as follow: >>> >>> OpenIB-cma u1.2 nonthreadsafe default > /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" >>> >>> because it allows you to have identical dat.conf setups on across >>> your cluster if you intend on using the first IB interface on each >>> node. >> >> but, what if one wants to work with the second ib I/F (or with the >> third)? >> >> If in automatic way i will create a dat.conf that will have > the following lines (on a machine with 2 HCAs, 2 port in each HCA): >> >> OpenIB-cma u1.2 nonthreadsafe default > /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" >> OpenIB-cma0 u1.2 nonthreadsafe default > /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" >> OpenIB-cma1 u1.2 nonthreadsafe default > /usr/lib/libdaplcma.so mv_dapl.1.2 "ib1 0" "" >> OpenIB-cma2 u1.2 nonthreadsafe default > /usr/lib/libdaplcma.so mv_dapl.1.2 "ib2 0" "" >> OpenIB-cma3 u1.2 nonthreadsafe default > /usr/lib/libdaplcma.so mv_dapl.1.2 "ib3 0" "" >> >> In this file, there is a dapl provider name for every IPoIB I/F, and >> still there is a default entry (the first I/F). >> >> Is this is usefull? > > Arlin was pointing out that the uDAPL CMA provider requires > an IP address value, not a device name string: > > OpenIB-cma u1.2 nonthreadsafe default libdapl.so of_udapl.1.2 > "192.168.0.47 1" "" > > The uDAPL CMA provider only uses the port value to check the > device attributes, so the IP addr is the most important way > to specify the device. > > Naming the uDAPL providers OpenIB-cma0, OpenIB-cma1, OpenIB-cma2, ... > makes sense, but remember that it is the IP address that will > determine which device they map to. > If the system administrator uses ifconfig to change an IP address for an interface, shouldn't DAPL adjust to the new IP address automatically? That would imply that the netdevice name should be used rather than the IP address, wouldn't it? From bos at pathscale.com Mon Apr 24 08:51:16 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 24 Apr 2006 08:51:16 -0700 Subject: [openib-general] 1.0 RC3 schedule update Message-ID: <1145893876.23836.10.camel@localhost.localdomain> I'm still planning to release 1.0 RC3 on Monday, May 1. If you have userspace changes that you want to see included, please commit them to the 1.0 branch or send them to me as patches by 6pm GMT (10am California time) on Thursday, April 28. Thanks, From tom at opengridcomputing.com Mon Apr 24 09:11:40 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 24 Apr 2006 11:11:40 -0500 Subject: [openib-general] [PATCH][UVERBS][RFC] Exporting device node_type to user mode In-Reply-To: <1145581028.8968.9.camel@bigtime.es335.com> References: <1145567117.27405.38.camel@trinity.ogc.int> <1145581028.8968.9.camel@bigtime.es335.com> Message-ID: <1145895100.18808.13.camel@trinity.ogc.int> Roland: Thinking about this a little more and having read some less than flattering commentary on various mailing list about sysfs, ABI, differences between distros, etc... Is using sysfs for device attributes the right approach here, or should be bite the bullet and update the kernel-abi? On Thu, 2006-04-20 at 19:57 -0500, Tom Tucker wrote: > On Thu, 2006-04-20 at 14:13 -0700, Roland Dreier wrote: > > Tom> In order to support transport independent behavior for > > Tom> user-mode RDMA CMA clients we need to export the node_type to > > Tom> the user mode device attributes structure. The reason for > > Tom> this is that the user-mode CMA needs to behave differently > > Tom> for iWARP vs. IB transports when migrating QP state at > > Tom> connection setup and tear down. > > > > Adding node_type to the libibverbs API is OK (for the 1.1 release > > series...), but I think it would be better to read the existing > > /sys/class/infiniband//node_type field in sysfs rather than > > adding it in to the query stuff. > > > > Ok -- no problem. Are there rules/guidelines that govern the device > attributes that belong in sys/class/infiniband vs. attributes that > belong in device_attr? > > > - R. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Mon Apr 24 08:54:33 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Apr 2006 11:54:33 -0400 Subject: [openib-general] 1.0 RC3 schedule update In-Reply-To: <1145893876.23836.10.camel@localhost.localdomain> References: <1145893876.23836.10.camel@localhost.localdomain> Message-ID: <1145893926.23359.68547.camel@hal.voltaire.com> On Mon, 2006-04-24 at 11:51, Bryan O'Sullivan wrote: > I'm still planning to release 1.0 RC3 on Monday, May 1. If you have > userspace changes that you want to see included, please commit them to > the 1.0 branch or send them to me as patches by 6pm GMT (10am California > time) on Thursday, April 28. Isn't it RC4 now (RC3 being skipped to sync with OFED) ? -- Hal > Thanks, > > References: <20060423075752.GX1792@mellanox.co.il> Message-ID: <444CFB6D.6030000@ichips.intel.com> Michael S. Tsirkin wrote: > BTW, Sean, if you intend to emulate socket interface, same is necessary for > listen: listen without bind is equivalent to binding to anyport/anyinterface > first. I'm aware that there are several areas where the CMA doesn't match sockets yet. I just need to know what's important enough to implement immediately. > Do you think this logic belongs in CMA or in ULPs? I think the CMA. - Sean From mshefty at ichips.intel.com Mon Apr 24 09:26:02 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 24 Apr 2006 09:26:02 -0700 Subject: [openib-general] [PATCH] cma nit In-Reply-To: <20060423075420.GW1792@mellanox.co.il> References: <20060423075420.GW1792@mellanox.co.il> Message-ID: <444CFC1A.9020406@ichips.intel.com> Thanks - committed. From mst at mellanox.co.il Mon Apr 24 09:34:15 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 24 Apr 2006 19:34:15 +0300 Subject: [openib-general] Re: Re: [PATCH] RDMA CM: assign port numbers when binding a cm_id to an address In-Reply-To: <444CFB6D.6030000@ichips.intel.com> References: <20060423075752.GX1792@mellanox.co.il> <444CFB6D.6030000@ichips.intel.com> Message-ID: <20060424163415.GZ1792@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: Re: [PATCH] RDMA CM: assign port numbers when binding a cm_id to an address > > Michael S. Tsirkin wrote: > >BTW, Sean, if you intend to emulate socket interface, same is necessary for > >listen: listen without bind is equivalent to binding to > >anyport/anyinterface > >first. > > I'm aware that there are several areas where the CMA doesn't match sockets > yet. I just need to know what's important enough to implement immediately. I have put a simple version in SDP for now, so this bit can wait a bit. > >Do you think this logic belongs in CMA or in ULPs? > > I think the CMA. -- MST From bos at pathscale.com Mon Apr 24 09:33:14 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 24 Apr 2006 09:33:14 -0700 Subject: [openib-general] 1.0 RC3 schedule update In-Reply-To: <1145893926.23359.68547.camel@hal.voltaire.com> References: <1145893876.23836.10.camel@localhost.localdomain> <1145893926.23359.68547.camel@hal.voltaire.com> Message-ID: <1145896394.23836.12.camel@localhost.localdomain> On Mon, 2006-04-24 at 11:54 -0400, Hal Rosenstock wrote: > On Mon, 2006-04-24 at 11:51, Bryan O'Sullivan wrote: > > I'm still planning to release 1.0 RC3 on Monday, May 1. If you have > > userspace changes that you want to see included, please commit them to > > the 1.0 branch or send them to me as patches by 6pm GMT (10am California > > time) on Thursday, April 28. > > Isn't it RC4 now (RC3 being skipped to sync with OFED) ? Sorry, yes. That was a before-enough-coffee thinko. References: <20060423124258.GH1792@mellanox.co.il> Message-ID: <444D006A.20209@ichips.intel.com> Thanks - this was left over from a previous iteration of code to handle port spaces. Committed - Sean From jlentini at netapp.com Mon Apr 24 10:04:58 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 24 Apr 2006 13:04:58 -0400 (EDT) Subject: [openib-general] Re: [uDAPL] dat.conf generator In-Reply-To: <200604241816.01856.dotanb@mellanox.co.il> References: <200604121122.48646.dotanb@mellanox.co.il> <200604231429.27532.dotanb@mellanox.co.il> <200604241816.01856.dotanb@mellanox.co.il> Message-ID: On Mon, 24 Apr 2006, Dotan Barak wrote: > On Monday 24 April 2006 17:33, James Lentini wrote: > > > > On Sun, 23 Apr 2006, Dotan Barak wrote: > > > > > > > > If in automatic way i will create a dat.conf that will have the following lines (on a machine with 2 HCAs, 2 port in each HCA): > > > > > > OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" > > > OpenIB-cma0 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib0 0" "" > > > OpenIB-cma1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib1 0" "" > > > OpenIB-cma2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib2 0" "" > > > OpenIB-cma3 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 "ib3 0" "" > > > > > > In this file, there is a dapl provider name for every IPoIB I/F, and > > > still there is a default entry (the first I/F). > > > > > > Is this is usefull? > > > > Arlin was pointing out that the uDAPL CMA provider requires an IP > > address value, not a device name string: > > > > OpenIB-cma u1.2 nonthreadsafe default libdapl.so of_udapl.1.2 "192.168.0.47 1" "" > > > > The uDAPL CMA provider only uses the port value to check the device > > attributes, so the IP addr is the most important way to specify the > > device. > > > > Naming the uDAPL providers OpenIB-cma0, OpenIB-cma1, OpenIB-cma2, ... > > makes sense, but remember that it is the IP address that will > > determine which device they map to. > > > > Having a default provider, OpenIB-cma, is useful. > > o.k,, so the generated file will have the following line for all of the IPoIB I/Fs: > OpenIB-cmaX u1.2 nonthreadsafe default libdapl.so of_udapl.1.2 "192.168.0.47 1" "" (X run from 0 to n) > and a default OpenIB-cma which will be the first IPoIB I/F that was found. > > i have 2 more questions: > 1) what is the number after the IP? (ib port? starts from which value?) Device port. This is the value passed to ibv_query_port(). These should start at 1. > 2) do you need any lines with the scm ? (if the answer is yes, can > you please send me an example line?) I'd make the script check for the cma and scm library, for each installed library create the appropriate entries. Here's an example entry: OpenIB-scm1 u1.2 nonthreadsafe default /usr/lib/libdaplscm.so of_udapl.1.2 "mthca0 1" "" The first IA param is the device name and the second is the device port. From jlentini at netapp.com Mon Apr 24 10:06:01 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 24 Apr 2006 13:06:01 -0400 (EDT) Subject: [openib-general] Re: [uDAPL] dat.conf generator In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F143AABD@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F143AABD@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: On Mon, 24 Apr 2006, Caitlin Bestler wrote: > If the system administrator uses ifconfig to change an IP address > for an interface, shouldn't DAPL adjust to the new IP address > automatically? That would imply that the netdevice name should > be used rather than the IP address, wouldn't it? That is a good point. The netdevice name would be better. From tom at opengridcomputing.com Mon Apr 24 10:46:00 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 24 Apr 2006 12:46:00 -0500 Subject: [openib-general] [PATCH][UVERBS][RFC] Node Type in Userland Message-ID: <1145900760.18808.19.camel@trinity.ogc.int> This patch uses the sys/class entries to divine the node type and stores the result in the cma_id when the cm_id is bound to a device. The node_type can then be used by both the CMA and Applications to select transport specific code paths. Signed-off-by: Tom Tucker Index: libibverbs/include/infiniband/verbs.h =================================================================== --- libibverbs/include/infiniband/verbs.h (revision 6570) +++ libibverbs/include/infiniband/verbs.h (working copy) @@ -698,6 +698,12 @@ struct ibv_port_attr *port_attr); /** + * ibv_query_node_type - Get the device node_type + */ +int ibv_query_node_type(struct ibv_context *context, + enum ibv_node_type *node_type); + +/** * ibv_query_gid - Get a GID table entry */ int ibv_query_gid(struct ibv_context *context, uint8_t port_num, Index: libibverbs/src/libibverbs.map =================================================================== --- libibverbs/src/libibverbs.map (revision 6570) +++ libibverbs/src/libibverbs.map (working copy) @@ -9,6 +9,7 @@ ibv_get_async_event; ibv_ack_async_event; ibv_query_device; + ibv_query_node_type; ibv_query_port; ibv_query_gid; ibv_query_pkey; Index: libibverbs/src/verbs.c =================================================================== --- libibverbs/src/verbs.c (revision 6570) +++ libibverbs/src/verbs.c (working copy) @@ -89,6 +89,23 @@ return context->ops.query_port(context, port_num, port_attr); } +int ibv_query_node_type(struct ibv_context *context, + enum ibv_node_type *node_type) +{ + char node_desc[24]; + char node_str[24]; + + if (!context) + return -1; + + if (ibv_read_sysfs_file(context->device->ibdev->path, "node_type", + node_desc, sizeof(node_desc)) < 0) + return -1; + + sscanf(node_desc, "%d: %s\n", (int*)node_type, node_str); + return 0; +} + int ibv_query_gid(struct ibv_context *context, uint8_t port_num, int index, union ibv_gid *gid) { Index: librdmacm/include/rdma/rdma_cma.h =================================================================== --- librdmacm/include/rdma/rdma_cma.h (revision 6570) +++ librdmacm/include/rdma/rdma_cma.h (working copy) @@ -79,6 +79,7 @@ void *context; struct ibv_qp *qp; struct rdma_route route; + enum ibv_node_type node_type; uint8_t port_num; }; Index: librdmacm/src/cma.c =================================================================== --- librdmacm/src/cma.c (revision 6570) +++ librdmacm/src/cma.c (working copy) @@ -248,6 +248,8 @@ if (cma_dev->guid == guid) { id_priv->cma_dev = cma_dev; id_priv->id.verbs = cma_dev->verbs; + ibv_query_node_type(cma_dev->verbs, + &id_priv->id.node_type); return 0; } From mst at mellanox.co.il Mon Apr 24 10:45:21 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 24 Apr 2006 20:45:21 +0300 Subject: [openib-general] RFC: cma: need rdma_unbind Message-ID: <20060424174521.GA20743@mellanox.co.il> Sean, it seems that a rdma_unbind API is necessary for SDP SO_REUSEADDR support. This should only remove association between the port and the id, without affecting the CM state. Reason: when SO_REUSEADDR is set, the port becomes usable immediately after close call, but the QP must not be closed immediately - rather its closed after graceful close or after timeout. Thus I can not destroy the id, but the association with port needs to be cancelled. -- MST From mshefty at ichips.intel.com Mon Apr 24 11:05:37 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 24 Apr 2006 11:05:37 -0700 Subject: [openib-general] cmpost test failures In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301BE7D47@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301BE7D47@mtlexch01.mtl.com> Message-ID: <444D1371.5020405@ichips.intel.com> Ali Ayoub wrote: > 1. If I change the local and the remote timeout for ib_cm_req_param to > 40 (instead of 20, the default value) it causes kernel oops. The timeout is calculated as: 4.096 x 2 ^ timeout. In highly technical terms, going from 20 to 40 increases the timeout by a factor of a lot (from seconds to weeks). Since the oops occurred in cmpost, I'm not overly concerned with trying to debug this at the moment. (I will happily take a patch that fixes the issue, or will look at it more if it definitely looks like an ib_cm bug. Cmpost just isn't meant to be a robust test program.) > 2. With the following parameters: > > connections = 3000 > > message_size = 200 > > message_count = 10 > > qp_type = RC > > The test fails inconsistently; in some cases it causes a kernel oops, This setup will result in allocating a fair amount of memory, which could explain the failures. The oops may be related, but I can't tell just from the backtrace. I've never run into this myself though. Can you reproduce this issue using a smaller number of connections? Note that when simultaneously establishing a large number of connections, you will end up overrunning QP 1 on the remote side. This will result in a lot of dropped MADs, timeouts, and retries, which can make the results of the test unpredictable. > 3. In other cases the server fails because it receives some > IB_CM_DREQ_ERROR when the client receives all the IB_CM_DREQ_RECEIVED. This can occur, and is easier to reproduce for a large number of connections. A DREQ is retried until a DREP is received. However, since a DREP is not acked, once it has been sent, the disconnect is done from the client's perspective. If the DREP is lost, the server will see a DREQ timeout. There is code in the ib_cm to resend a DREP in response to a repeated DREQ, but the state needed to generate the DREP is only maintained while the old connection is in timewait. - Sean From mshefty at ichips.intel.com Mon Apr 24 11:09:18 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 24 Apr 2006 11:09:18 -0700 Subject: [openib-general] Re: [PATCH][UVERBS][RFC] Node Type in Userland In-Reply-To: <1145900760.18808.19.camel@trinity.ogc.int> References: <1145900760.18808.19.camel@trinity.ogc.int> Message-ID: <444D144E.3020506@ichips.intel.com> Tom Tucker wrote: > Index: librdmacm/include/rdma/rdma_cma.h > =================================================================== > --- librdmacm/include/rdma/rdma_cma.h (revision 6570) > +++ librdmacm/include/rdma/rdma_cma.h (working copy) > @@ -79,6 +79,7 @@ > void *context; > struct ibv_qp *qp; > struct rdma_route route; > + enum ibv_node_type node_type; > uint8_t port_num; > }; Rather than storing node_type with the userspace rdma_cm, would it make more sense to store it in ibv_context? - Sean From mshefty at ichips.intel.com Mon Apr 24 11:15:24 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 24 Apr 2006 11:15:24 -0700 Subject: [openib-general] RFC: cma: need rdma_unbind In-Reply-To: <20060424174521.GA20743@mellanox.co.il> References: <20060424174521.GA20743@mellanox.co.il> Message-ID: <444D15BC.4040904@ichips.intel.com> Michael S. Tsirkin wrote: > Sean, it seems that a rdma_unbind API is necessary for SDP SO_REUSEADDR support. > This should only remove association between the port and the id, without > affecting the CM state. > > Reason: when SO_REUSEADDR is set, the port becomes usable immediately > after close call, but the QP must not be closed immediately - > rather its closed after graceful close or after timeout. > > Thus I can not destroy the id, but the association with > port needs to be cancelled. I'm in the process of adding rdma_set_option and rdma_get_option calls. If SO_REUSEADDR support were added through these calls, can we come up with a solution that doesn't require adding an rdma_unbind call? - Sean From mshefty at ichips.intel.com Mon Apr 24 11:17:32 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 24 Apr 2006 11:17:32 -0700 Subject: [openib-general] [PATCH] cma: treat ANY address as loopback on connect In-Reply-To: <20060424120424.GQ1792@mellanox.co.il> References: <20060424120424.GQ1792@mellanox.co.il> Message-ID: <444D163C.7040809@ichips.intel.com> Thanks - committed. - Sean From mst at mellanox.co.il Mon Apr 24 11:23:27 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 24 Apr 2006 21:23:27 +0300 Subject: [openib-general] Re: RFC: cma: need rdma_unbind In-Reply-To: <444D15BC.4040904@ichips.intel.com> References: <20060424174521.GA20743@mellanox.co.il> <444D15BC.4040904@ichips.intel.com> Message-ID: <20060424182327.GA20831@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: RFC: cma: need rdma_unbind > > Michael S. Tsirkin wrote: > >Sean, it seems that a rdma_unbind API is necessary for SDP SO_REUSEADDR > >support. > >This should only remove association between the port and the id, without > >affecting the CM state. > > > >Reason: when SO_REUSEADDR is set, the port becomes usable immediately > >after close call, but the QP must not be closed immediately - > >rather its closed after graceful close or after timeout. > > > >Thus I can not destroy the id, but the association with > >port needs to be cancelled. > > I'm in the process of adding rdma_set_option and rdma_get_option calls. Hmm. Not sure this is a good idea - won't proper APIs be better for kernel use? > If SO_REUSEADDR support were added through these calls, can we come up with a > solution that doesn't require adding an rdma_unbind call? I don't see how - e.g. the graceful close must be done at SDP level. -- MST From mshefty at ichips.intel.com Mon Apr 24 12:04:58 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 24 Apr 2006 12:04:58 -0700 Subject: [openib-general] Re: [PATCH] RDMA CM: assign port numbers when binding a cm_id to an address In-Reply-To: <20060424163415.GZ1792@mellanox.co.il> References: <20060423075752.GX1792@mellanox.co.il> <444CFB6D.6030000@ichips.intel.com> <20060424163415.GZ1792@mellanox.co.il> Message-ID: <444D215A.4060201@ichips.intel.com> Michael S. Tsirkin wrote: >>>BTW, Sean, if you intend to emulate socket interface, same is necessary for >>>listen: listen without bind is equivalent to binding to >>>anyport/anyinterface >>>first. >> >>I'm aware that there are several areas where the CMA doesn't match sockets >>yet. I just need to know what's important enough to implement immediately. > > > I have put a simple version in SDP for now, so this bit can wait a bit. I went to add this, but discovered that I don't have a good way to determine which address family to use. I may need to modify the create call to set this. - Sean From mst at mellanox.co.il Mon Apr 24 12:10:19 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 24 Apr 2006 22:10:19 +0300 Subject: [openib-general] Re: [PATCH] RDMA CM: assign port numbers when binding a cm_id to an address In-Reply-To: <444D215A.4060201@ichips.intel.com> References: <20060423075752.GX1792@mellanox.co.il> <444CFB6D.6030000@ichips.intel.com> <20060424163415.GZ1792@mellanox.co.il> <444D215A.4060201@ichips.intel.com> Message-ID: <20060424191019.GB20831@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH] RDMA CM: assign port numbers when binding a cm_id to an address > > Michael S. Tsirkin wrote: > >>>BTW, Sean, if you intend to emulate socket interface, same is necessary > >>>for > >>>listen: listen without bind is equivalent to binding to > >>>anyport/anyinterface > >>>first. > >> > >>I'm aware that there are several areas where the CMA doesn't match > >>sockets yet. I just need to know what's important enough to implement > >>immediately. > > > > > >I have put a simple version in SDP for now, so this bit can wait a bit. > > I went to add this, but discovered that I don't have a good way to > determine which address family to use. I may need to modify the create > call to set this. Not sure how do you mean. Listen without bind is equivalent to bind on any port, any address. So why do you need address family? -- MST From tom at opengridcomputing.com Mon Apr 24 12:23:02 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 24 Apr 2006 14:23:02 -0500 Subject: [openib-general] Re: [PATCH][UVERBS][RFC] Node Type in Userland In-Reply-To: <444D144E.3020506@ichips.intel.com> References: <1145900760.18808.19.camel@trinity.ogc.int> <444D144E.3020506@ichips.intel.com> Message-ID: <1145906582.18808.23.camel@trinity.ogc.int> On Mon, 2006-04-24 at 11:09 -0700, Sean Hefty wrote: > Tom Tucker wrote: > > Index: librdmacm/include/rdma/rdma_cma.h > > =================================================================== > > --- librdmacm/include/rdma/rdma_cma.h (revision 6570) > > +++ librdmacm/include/rdma/rdma_cma.h (working copy) > > @@ -79,6 +79,7 @@ > > void *context; > > struct ibv_qp *qp; > > struct rdma_route route; > > + enum ibv_node_type node_type; > > uint8_t port_num; > > }; > > Rather than storing node_type with the userspace rdma_cm, would it make more > sense to store it in ibv_context? Yes, I think it does. Roland? > > - Sean From tom at opengridcomputing.com Mon Apr 24 12:32:45 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 24 Apr 2006 14:32:45 -0500 Subject: [openib-general] RFC: cma: need rdma_unbind In-Reply-To: <444D15BC.4040904@ichips.intel.com> References: <20060424174521.GA20743@mellanox.co.il> <444D15BC.4040904@ichips.intel.com> Message-ID: <1145907165.18808.28.camel@trinity.ogc.int> On Mon, 2006-04-24 at 11:15 -0700, Sean Hefty wrote: > Michael S. Tsirkin wrote: > > Sean, it seems that a rdma_unbind API is necessary for SDP SO_REUSEADDR support. > > This should only remove association between the port and the id, without > > affecting the CM state. > > > > Reason: when SO_REUSEADDR is set, the port becomes usable immediately > > after close call, but the QP must not be closed immediately - > > rather its closed after graceful close or after timeout. > > > > Thus I can not destroy the id, but the association with > > port needs to be cancelled. > > I'm in the process of adding rdma_set_option and rdma_get_option calls. If > SO_REUSEADDR support were added through these calls, can we come up with a > solution that doesn't require adding an rdma_unbind call? There is no need for an unbind call -- the sockets code, for example, doesn't work this way. Instead, it looks at a 'reuse' flag and decides whether to allow a subsequent bind or use-port request based on the value. ...from inet_csk_get_port... tb_found: if (!hlist_empty(&tb->owners)) { if (sk->sk_reuse > 1) goto success; > > - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jsquyres at cisco.com Mon Apr 24 12:45:44 2006 From: jsquyres at cisco.com (Jeff Squyres (jsquyres)) Date: Mon, 24 Apr 2006 15:45:44 -0400 Subject: [openib-general] Bugzilla component Message-ID: Can we add an "Open MPI" bugzilla component? It will be for OFED packaging issues. Thanks. -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From wombat2 at us.ibm.com Mon Apr 24 12:48:07 2006 From: wombat2 at us.ibm.com (Bernard King-Smith) Date: Mon, 24 Apr 2006 15:48:07 -0400 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: <20060423143449.5840D228716@openib.ca.sandia.gov> Message-ID: Lenoid Arsh wrote: Lenoid> Shirley, Lenoid> some additional information you may be interested: Lenoid> According to our experience with the Voltaire IPoIB driver, Lenoid> splitting CQ harmed the throughput (we checked with the iperf Lenoid> application, UDP mode.) Splitting the the CQ caused more interrupts, Lenoid> context switches and CQ polls. Interesting results. I think some of Shirley's work reduced the number of interrupts on ehca so this is starting to sound like a one size does not fit all driver approach. I wonder what Pathscale see if they split the completion queues? Lenoid> Note, the case is rather different from OpenIB mthca, since Voltare Lenoid> IPoIB is based on the VAPI driver, Lenoid> where CQ completions are handled in a tasklet context, Lenoid> unlike mthca where CQ completions are handled in the HW interrupt Lenoid> context. Another question is what do we do about adapter specific code where each adapter type ( ehca, mthca, Voltare and Pathscale ) can all provide better performance if adapter specific code and tuning is required? Lenoid> NAPI gave us some improvement. I think NAPI should improve much more Lenoid> in mthca, with the HW interrupt CQ completions. However, I don't believe that NAPI can provide the same benefit for all the driver models listed above. It may help in overall interrupt handling, but there is probably a need for additional adapter/driver specific tuning. Some of these may end up requiring support in the OpenIB stack. There are many cases not covered by using Netperf, and netpipe that show improved performance. These cases are running multiple sockets per link/adapter, and the case where you have a larger machine where you have multiple adapters. I haven't seen any data recently on duplex traffic, only STREAM ( or unidirectional) either. Bernie King-Smith IBM Corporation Server Group Cluster System Performance wombat2 at us.ibm.com (845)433-8483 Tie. 293-8483 or wombat2 on NOTES "We are not responsible for the world we are born into, only for the world we leave when we die. So we have to accept what has gone before us and work to change the only thing we can, -- The Future." William Shatner From halr at voltaire.com Mon Apr 24 12:59:06 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Apr 2006 15:59:06 -0400 Subject: [openib-general] Re: [PATCH 1/4] opensm: don't try to enforce partitions on router port In-Reply-To: <20060423142618.15562.31253.stgit@sashak.voltaire.com> References: <20060423141935.15562.38762.stgit@sashak.voltaire.com> <20060423142618.15562.31253.stgit@sashak.voltaire.com> Message-ID: <1145908743.23359.71419.camel@hal.voltaire.com> On Sun, 2006-04-23 at 10:26, Sasha Khapyorsky wrote: > When router port is connected directly to CA don't try handle it as > switch external ports (update pkey table and enforce partitions). > Router ports are handled by partition manager as end ports. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied to trunk only. -- Hal From xma at us.ibm.com Mon Apr 24 13:23:13 2006 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 24 Apr 2006 13:23:13 -0700 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: <444B8338.8000608@voltaire.com> Message-ID: Hello Leonid, Leonid Arsh wrote on 04/23/2006 06:38:00 AM: > Shirley, > > some additional information you may be interested: > > According to our experience with the Voltaire IPoIB driver, > splitting CQ harmed the throughput (we checked with the iperf > application, UDP mode.) Splitting the the CQ caused more interrupts, > context switches and CQ polls. > Note, the case is rather different from OpenIB mthca, since Voltare > IPoIB is based on the VAPI driver, > where CQ completions are handled in a tasklet context, > unlike mthca where CQ completions are handled in the HW interrupt > context. That expected because only one tasklet is allowed running across all cpus in the same time. Have you tried to use other SOFTIRQ instead of TASKLET_SOFTIRQ? My expectation is the performance will be better since there would be multiple softirqs running simultaneously. If it's a simple change of your code, could you please try it? I am thinking to split mthca CQ completion into HW interrupt and softirq context. > NAPI gave us some improvement. I think NAPI should improve much more > in mthca, with the HW interrupt CQ completions. Yes, with the hardware interrupts are disabled. It would be interesting to compare the completion CQ with NAPI and in softirq context. It all depends on how you implement NAPI. If you only implement NAPI without changing the sender, NAPI might not get better performance than softirq. The benefit of NAPI, it has one dev->poll running across all cpus to prevent packets out of order totally. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Mon Apr 24 13:29:01 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 24 Apr 2006 13:29:01 -0700 Subject: [openib-general] Re: [PATCH] RDMA CM: assign port numbers when binding a cm_id to an address In-Reply-To: <20060424191019.GB20831@mellanox.co.il> References: <20060423075752.GX1792@mellanox.co.il> <444CFB6D.6030000@ichips.intel.com> <20060424163415.GZ1792@mellanox.co.il> <444D215A.4060201@ichips.intel.com> <20060424191019.GB20831@mellanox.co.il> Message-ID: <444D350D.8090501@ichips.intel.com> Michael S. Tsirkin wrote: > Not sure how do you mean. > Listen without bind is equivalent to bind on any port, any address. > So why do you need address family? The socket() call takes an address family as input. Don't you still need to know if you're listening on TCPv4 or TCPv6? - Sean From tom at opengridcomputing.com Mon Apr 24 13:41:07 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 24 Apr 2006 15:41:07 -0500 Subject: [openib-general] [PATCH][UVERBS][RFC] node type in ibv_context In-Reply-To: <444D144E.3020506@ichips.intel.com> References: <1145900760.18808.19.camel@trinity.ogc.int> <444D144E.3020506@ichips.intel.com> Message-ID: <1145911267.18808.36.camel@trinity.ogc.int> Here's a patch that puts a node_type in the ibv_context. Signed-off-by: Tom Tucker Index: libibverbs/include/infiniband/verbs.h =================================================================== --- libibverbs/include/infiniband/verbs.h (revision 6570) +++ libibverbs/include/infiniband/verbs.h (working copy) @@ -68,11 +68,19 @@ }; enum ibv_node_type { + IBV_NODE_UNKNOWN=-1, IBV_NODE_CA = 1, IBV_NODE_SWITCH, - IBV_NODE_ROUTER + IBV_NODE_ROUTER, + IBV_NODE_RNIC }; +enum ibv_transport_type { + IBV_TRANSPORT_UNKNOWN=0, + IBV_TRANSPORT_IB=1, + IBV_TRANSPORT_IWARP=2 +}; + enum ibv_device_cap_flags { IBV_DEVICE_RESIZE_MAX_WR = 1, IBV_DEVICE_BAD_PKEY_CNTR = 1 << 1, @@ -615,6 +623,7 @@ }; struct ibv_context { + enum ibv_node_type node_type; struct ibv_device *device; struct ibv_context_ops ops; int cmd_fd; @@ -654,6 +663,33 @@ uint64_t ibv_get_device_guid(struct ibv_device *device); /** + * ibv_get_transport_type - Return device's network transport type + */ +static inline enum ibv_transport_type +ibv_get_transport_type(struct ibv_context *context) +{ + switch (context->node_type) { + case IBV_NODE_CA: + case IBV_NODE_SWITCH: + case IBV_NODE_ROUTER: + return IBV_TRANSPORT_IB; + case IBV_NODE_RNIC: + return IBV_TRANSPORT_IWARP; + default: + return IBV_TRANSPORT_UNKNOWN; + } +} + +/** + * ibv_get_node_type - Return device's node type + */ +static inline enum ibv_node_type +ibv_get_node_type(struct ibv_context *context) +{ + return context->node_type; +} + +/** * ibv_open_device - Initialize device for use */ struct ibv_context *ibv_open_device(struct ibv_device *device); Index: libibverbs/src/device.c =================================================================== --- libibverbs/src/device.c (revision 6570) +++ libibverbs/src/device.c (working copy) @@ -106,6 +106,23 @@ return htonll(guid); } +static enum ibv_node_type query_node_type(struct ibv_context *context) +{ + char node_desc[24]; + char node_str[24]; + int node_type; + + if (!context) + return IBV_NODE_UNKNOWN; + + if (ibv_read_sysfs_file(context->device->ibdev->path, "node_type", + node_desc, sizeof(node_desc)) < 0) + return IBV_NODE_UNKNOWN; + + sscanf(node_desc, "%d: %s\n", (int*)&node_type, node_str); + return (enum ibv_node_type) node_type; +} + struct ibv_context *ibv_open_device(struct ibv_device *device) { char *devpath; @@ -128,7 +145,7 @@ context->device = device; context->cmd_fd = cmd_fd; - + context->node_type = query_node_type(context); return context; err: From halr at voltaire.com Mon Apr 24 13:26:12 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Apr 2006 16:26:12 -0400 Subject: [openib-general] Re: [PATCH 2/4] opensm: remove unused osm_pkey_mgr_t object In-Reply-To: <20060423142620.15562.96611.stgit@sashak.voltaire.com> References: <20060423141935.15562.38762.stgit@sashak.voltaire.com> <20060423142620.15562.96611.stgit@sashak.voltaire.com> Message-ID: <1145910338.23359.71757.camel@hal.voltaire.com> On Sun, 2006-04-23 at 10:26, Sasha Khapyorsky wrote: > The structure osm_pkey_mgr_t is not used for pkey management - > clean it up. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied to trunk only. -- Hal From tom at opengridcomputing.com Mon Apr 24 13:54:58 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 24 Apr 2006 15:54:58 -0500 Subject: [openib-general] Re: [PATCH] RDMA CM: assign port numbers when binding a cm_id to an address In-Reply-To: <444D350D.8090501@ichips.intel.com> References: <20060423075752.GX1792@mellanox.co.il> <444CFB6D.6030000@ichips.intel.com> <20060424163415.GZ1792@mellanox.co.il> <444D215A.4060201@ichips.intel.com> <20060424191019.GB20831@mellanox.co.il> <444D350D.8090501@ichips.intel.com> Message-ID: <1145912098.18808.45.camel@trinity.ogc.int> On Mon, 2006-04-24 at 13:29 -0700, Sean Hefty wrote: > Michael S. Tsirkin wrote: > > Not sure how do you mean. > > Listen without bind is equivalent to bind on any port, any address. > > So why do you need address family? > > The socket() call takes an address family as input. Don't you still need to > know if you're listening on TCPv4 or TCPv6? I think you do, and I think the types are AF_INET and AF_INET6. Right now, although the interfaces are sockaddr's they become sockaddr_in by the time you get to rdma_translate_ip. If we want to support IPv6 going forward then I think you'll want to separate support at the top and potentially have rdma_cm_id ops filled in the cm_id based on the address family specified when the rdma_cm_id was created. > > - Sean > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sean.hefty at intel.com Mon Apr 24 13:49:45 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 24 Apr 2006 13:49:45 -0700 Subject: [openib-general] [PATCH] RDMA CM: allow listen without prior binding to listen on any address In-Reply-To: <20060424191019.GB20831@mellanox.co.il> Message-ID: Will something like the following work for you for now? --- Allow calling rdma_listen() without calling rdma_bind_addr() beforehand. This will result in binding to any address / any port before listening. Signed-off-by: Sean Hefty --- Index: cma.c =================================================================== --- cma.c (revision 6588) +++ cma.c (working copy) @@ -1002,12 +1002,27 @@ static void cma_listen_on_all(struct rdm mutex_unlock(&lock); } +static int cma_bind_any(struct rdma_cm_id *id, sa_family_t af) +{ + struct sockaddr_in addr_in; + + memset(&addr_in, 0, sizeof addr_in); + addr_in.sin_family = af; + return rdma_bind_addr(id, (struct sockaddr *) &addr_in); +} + int rdma_listen(struct rdma_cm_id *id, int backlog) { struct rdma_id_private *id_priv; int ret; id_priv = container_of(id, struct rdma_id_private, id); + if (id_priv->state == CMA_IDLE) { + ret = cma_bind_any(id, AF_INET); + if (ret) + return ret; + } + if (!cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_LISTEN)) return -EINVAL; @@ -1276,15 +1291,10 @@ err: static int cma_bind_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, struct sockaddr *dst_addr) { - struct sockaddr_in addr_in; - if (src_addr && src_addr->sa_family) return rdma_bind_addr(id, src_addr); - else { - memset(&addr_in, 0, sizeof addr_in); - addr_in.sin_family = dst_addr->sa_family; - return rdma_bind_addr(id, (struct sockaddr *) &addr_in); - } + else + return cma_bind_any(id, dst_addr->sa_family); } int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, From mshefty at ichips.intel.com Mon Apr 24 13:58:33 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 24 Apr 2006 13:58:33 -0700 Subject: [openib-general] Re: [PATCH] RDMA CM: assign port numbers when binding a cm_id to an address In-Reply-To: <1145912098.18808.45.camel@trinity.ogc.int> References: <20060423075752.GX1792@mellanox.co.il> <444CFB6D.6030000@ichips.intel.com> <20060424163415.GZ1792@mellanox.co.il> <444D215A.4060201@ichips.intel.com> <20060424191019.GB20831@mellanox.co.il> <444D350D.8090501@ichips.intel.com> <1145912098.18808.45.camel@trinity.ogc.int> Message-ID: <444D3BF9.10702@ichips.intel.com> Tom Tucker wrote: > I think you do, and I think the types are AF_INET and AF_INET6. Right > now, although the interfaces are sockaddr's they become sockaddr_in by > the time you get to rdma_translate_ip. If we want to support IPv6 going > forward then I think you'll want to separate support at the top and > potentially have rdma_cm_id ops filled in the cm_id based on the address > family specified when the rdma_cm_id was created. I realize that the code assume IPv4 in several places, but that should be a temporary limitation. I just haven't had time to add in IPv6 support yet. My original plan was that the sockaddr's should eventually be cast to the correct structure based on the sa_family type. But I don't think that this will be sufficient to handle all cases. Adding a pointer to set of functions, or adding a set of function pointers to the rdma_cm_id may be the correct solution when IPv6 is fully supported. - Sean From mst at mellanox.co.il Mon Apr 24 14:02:15 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 25 Apr 2006 00:02:15 +0300 Subject: [openib-general] Re: Re: [PATCH] RDMA CM: assign port numbers when binding a cm_id to an address In-Reply-To: <1145912098.18808.45.camel@trinity.ogc.int> References: <20060423075752.GX1792@mellanox.co.il> <444CFB6D.6030000@ichips.intel.com> <20060424163415.GZ1792@mellanox.co.il> <444D215A.4060201@ichips.intel.com> <20060424191019.GB20831@mellanox.co.il> <444D350D.8090501@ichips.intel.com> <1145912098.18808.45.camel@trinity.ogc.int> Message-ID: <20060424210215.GC20831@mellanox.co.il> Quoting r. Tom Tucker : > If we want to support IPv6 going > forward One imporant special case is IPv6 tunneling over IPv4. I think this is actually how Java works by default, so I think we want to support this first of all. Full IPv6 is lower priority as far as I'm concerned. -- MST From mst at mellanox.co.il Mon Apr 24 14:19:11 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 25 Apr 2006 00:19:11 +0300 Subject: [openib-general] Re: RFC: cma: need rdma_unbind In-Reply-To: <1145907165.18808.28.camel@trinity.ogc.int> References: <20060424174521.GA20743@mellanox.co.il> <444D15BC.4040904@ichips.intel.com> <1145907165.18808.28.camel@trinity.ogc.int> Message-ID: <20060424211911.GD20831@mellanox.co.il> Quoting r. Tom Tucker : > Subject: Re: RFC: cma: need rdma_unbind > > On Mon, 2006-04-24 at 11:15 -0700, Sean Hefty wrote: > > Michael S. Tsirkin wrote: > > > Sean, it seems that a rdma_unbind API is necessary for SDP SO_REUSEADDR support. > > > This should only remove association between the port and the id, without > > > affecting the CM state. > > > > > > Reason: when SO_REUSEADDR is set, the port becomes usable immediately > > > after close call, but the QP must not be closed immediately - > > > rather its closed after graceful close or after timeout. > > > > > > Thus I can not destroy the id, but the association with > > > port needs to be cancelled. > > > > I'm in the process of adding rdma_set_option and rdma_get_option calls. If > > SO_REUSEADDR support were added through these calls, can we come up with a > > solution that doesn't require adding an rdma_unbind call? > > There is no need for an unbind call -- the sockets code, for example, > doesn't work this way. Instead, it looks at a 'reuse' flag and decides > whether to allow a subsequent bind or use-port request based on the > value. I just tested this and this does not seem to be how it works: if the socket is in TIME_WAIT state, SO_REUSEADDR let's you bind another one to this port. If the socket is not in TIME_WAIT state, you still get EADDRINUSE. socketfaq seems to agree http://www.unixguide.net/network/socketfaq/4.5.shtml And CMA does not know about socket state - that's managed by ULP. -- MST From mshefty at ichips.intel.com Mon Apr 24 14:23:26 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 24 Apr 2006 14:23:26 -0700 Subject: [openib-general] Re: RFC: cma: need rdma_unbind In-Reply-To: <20060424211911.GD20831@mellanox.co.il> References: <20060424174521.GA20743@mellanox.co.il> <444D15BC.4040904@ichips.intel.com> <1145907165.18808.28.camel@trinity.ogc.int> <20060424211911.GD20831@mellanox.co.il> Message-ID: <444D41CE.1040207@ichips.intel.com> Michael S. Tsirkin wrote: > I just tested this and this does not seem to be how it works: > if the socket is in TIME_WAIT state, SO_REUSEADDR let's you bind another > one to this port. > If the socket is not in TIME_WAIT state, you still get EADDRINUSE. > > socketfaq seems to agree > http://www.unixguide.net/network/socketfaq/4.5.shtml > > And CMA does not know about socket state - that's managed by ULP. Would releasing the port from within rdma_disconnect() do what you need? - Sean From bos at pathscale.com Mon Apr 24 14:22:59 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 24 Apr 2006 14:22:59 -0700 Subject: [openib-general] [PATCH 3 of 13] ipath - iterate over correct number of ports during reset In-Reply-To: Message-ID: <49f2286e0bdc0ee1cd23.1145913779@eng-12.pathscale.com> Signed-off-by: Bryan O'Sullivan diff -r 1906950392f7 -r 49f2286e0bdc drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Wed Apr 19 15:24:36 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Wed Apr 19 15:24:36 2006 -0700 @@ -1959,7 +1959,7 @@ int ipath_reset_device(int unit) } if (dd->ipath_pd) - for (i = 1; i < dd->ipath_portcnt; i++) { + for (i = 1; i < dd->ipath_cfgports; i++) { if (dd->ipath_pd[i] && dd->ipath_pd[i]->port_cnt) { ipath_dbg("unit %u port %d is in use " "(PID %u cmd %s), can't reset\n", From bos at pathscale.com Mon Apr 24 14:22:57 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 24 Apr 2006 14:22:57 -0700 Subject: [openib-general] [PATCH 1 of 13] ipath - fix race with exposing reset file In-Reply-To: Message-ID: <61819d2519e0603795bf.1145913777@eng-12.pathscale.com> We were accidentally exposing the "reset" sysfs file more than once per device. Signed-off-by: Bryan O'Sullivan diff -r 8cc21848a9bb -r 61819d2519e0 drivers/infiniband/hw/ipath/ipath_diag.c --- a/drivers/infiniband/hw/ipath/ipath_diag.c Wed Apr 19 15:24:35 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_diag.c Wed Apr 19 15:24:36 2006 -0700 @@ -277,12 +277,13 @@ static int ipath_diag_open(struct inode bail: spin_unlock_irqrestore(&ipath_devs_lock, flags); - mutex_unlock(&ipath_mutex); /* Only expose a way to reset the device if we make it into diag mode. */ if (ret == 0) ipath_expose_reset(&dd->pcidev->dev); + + mutex_unlock(&ipath_mutex); return ret; } diff -r 8cc21848a9bb -r 61819d2519e0 drivers/infiniband/hw/ipath/ipath_sysfs.c --- a/drivers/infiniband/hw/ipath/ipath_sysfs.c Wed Apr 19 15:24:35 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_sysfs.c Wed Apr 19 15:24:36 2006 -0700 @@ -711,10 +711,22 @@ static struct attribute_group dev_attr_g * enters diag mode. A device reset is quite likely to crash the * machine entirely, so we don't want to normally make it * available. + * + * Called with ipath_mutex held. */ int ipath_expose_reset(struct device *dev) { - return device_create_file(dev, &dev_attr_reset); + static int exposed; + int ret; + + if (!exposed) { + ret = device_create_file(dev, &dev_attr_reset); + exposed = 1; + } + else + ret = 0; + + return ret; } int ipath_driver_create_group(struct device_driver *drv) From bos at pathscale.com Mon Apr 24 14:23:00 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 24 Apr 2006 14:23:00 -0700 Subject: [openib-general] [PATCH 4 of 13] ipath - change handling of PIO buffers In-Reply-To: Message-ID: <8e724d49e74bc1155f4e.1145913780@eng-12.pathscale.com> Different ipath hardware types have different numbers of buffers available, so we decide on the counts ourselves unless we are specifically overridden with a module parameter. Signed-off-by: Bryan O'Sullivan diff -r 49f2286e0bdc -r 8e724d49e74b drivers/infiniband/hw/ipath/ipath_init_chip.c --- a/drivers/infiniband/hw/ipath/ipath_init_chip.c Wed Apr 19 15:24:36 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c Wed Apr 19 15:24:36 2006 -0700 @@ -53,13 +53,19 @@ MODULE_PARM_DESC(cfgports, "Set max numb /* * Number of buffers reserved for driver (layered drivers and SMA - * send). Reserved at end of buffer list. + * send). Reserved at end of buffer list. Initialized based on + * number of PIO buffers if not set via module interface. + * The problem with this is that it's global, but we'll use different + * numbers for different chip types. So the default value is not + * very useful. I've redefined it for the 1.3 release so that it's + * zero unless set by the user to something else, in which case we + * try to respect it. */ -static ushort ipath_kpiobufs = 32; +static ushort ipath_kpiobufs; static int ipath_set_kpiobufs(const char *val, struct kernel_param *kp); -module_param_call(kpiobufs, ipath_set_kpiobufs, param_get_uint, +module_param_call(kpiobufs, ipath_set_kpiobufs, param_get_ushort, &ipath_kpiobufs, S_IWUSR | S_IRUGO); MODULE_PARM_DESC(kpiobufs, "Set number of PIO buffers for driver"); @@ -531,8 +537,11 @@ static int init_housekeeping(struct ipat * Don't clear ipath_flags as 8bit mode was set before * entering this func. However, we do set the linkstate to * unknown, so we can watch for a transition. - */ - dd->ipath_flags |= IPATH_LINKUNK; + * PRESENT is set because we want register reads to work, + * and the kernel infrastructure saw it in config space; + * We clear it if we have failures. + */ + dd->ipath_flags |= IPATH_LINKUNK | IPATH_PRESENT; dd->ipath_flags &= ~(IPATH_LINKACTIVE | IPATH_LINKARMED | IPATH_LINKDOWN | IPATH_LINKINIT); @@ -560,6 +569,7 @@ static int init_housekeeping(struct ipat || (dd->ipath_uregbase & 0xffffffff) == 0xffffffff) { ipath_dev_err(dd, "Register read failures from chip, " "giving up initialization\n"); + dd->ipath_flags &= ~IPATH_PRESENT; ret = -ENODEV; goto done; } @@ -682,16 +692,14 @@ int ipath_init_chip(struct ipath_devdata */ dd->ipath_pioavregs = ALIGN(val, sizeof(u64) * BITS_PER_BYTE / 2) / (sizeof(u64) * BITS_PER_BYTE / 2); - if (!ipath_kpiobufs) /* have to have at least 1, for SMA */ - kpiobufs = ipath_kpiobufs = 1; - else if ((dd->ipath_piobcnt2k + dd->ipath_piobcnt4k) < - (dd->ipath_cfgports * IPATH_MIN_USER_PORT_BUFCNT)) { - dev_info(&dd->pcidev->dev, "Too few PIO buffers (%u) " - "for %u ports to have %u each!\n", - dd->ipath_piobcnt2k + dd->ipath_piobcnt4k, - dd->ipath_cfgports, IPATH_MIN_USER_PORT_BUFCNT); - kpiobufs = 1; /* reserve just the minimum for SMA/ether */ - } else + if (ipath_kpiobufs == 0) { + /* not set by user, or set explictly to default */ + if ((dd->ipath_piobcnt2k + dd->ipath_piobcnt4k) > 128) + kpiobufs = 32; + else + kpiobufs = 16; + } + else kpiobufs = ipath_kpiobufs; if (kpiobufs > From bos at pathscale.com Mon Apr 24 14:23:01 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 24 Apr 2006 14:23:01 -0700 Subject: [openib-general] [PATCH 5 of 13] ipath - use proper address translation routine In-Reply-To: Message-ID: <1ab168913f0fea5d18b4.1145913781@eng-12.pathscale.com> Move away from an obsolete, unportable routine for translating physical addresses. Signed-off-by: Bryan O'Sullivan diff -r 8e724d49e74b -r 1ab168913f0f drivers/infiniband/hw/ipath/ipath_keys.c --- a/drivers/infiniband/hw/ipath/ipath_keys.c Wed Apr 19 15:24:36 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_keys.c Wed Apr 19 15:24:36 2006 -0700 @@ -125,12 +125,12 @@ int ipath_lkey_ok(struct ipath_lkey_tabl /* * We use LKEY == zero to mean a physical kmalloc() address. - * This is a bit of a hack since we rely on dma_map_single() - * being reversible by calling bus_to_virt(). + * This is a bit of a hack since we rely on being able to + * reverse the mapping by calling phys_to_virt(). */ if (sge->lkey == 0) { isge->mr = NULL; - isge->vaddr = bus_to_virt(sge->addr); + isge->vaddr = phys_to_virt(sge->addr); isge->length = sge->length; isge->sge_length = sge->length; ret = 1; From bos at pathscale.com Mon Apr 24 14:23:02 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 24 Apr 2006 14:23:02 -0700 Subject: [openib-general] [PATCH 6 of 13] ipath - fix verbs registration In-Reply-To: Message-ID: <3ff1e5ae1c6078b906fd.1145913782@eng-12.pathscale.com> Remember when the verbs layer unregisters from the lower-level code. Signed-off-by: Bryan O'Sullivan diff -r 1ab168913f0f -r 3ff1e5ae1c60 drivers/infiniband/hw/ipath/ipath_layer.c --- a/drivers/infiniband/hw/ipath/ipath_layer.c Wed Apr 19 15:24:36 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_layer.c Wed Apr 19 15:24:36 2006 -0700 @@ -46,13 +46,15 @@ /* Acquire before ipath_devs_lock. */ static DEFINE_MUTEX(ipath_layer_mutex); +static int ipath_verbs_registered; + u16 ipath_layer_rcv_opcode; + static int (*layer_intr)(void *, u32); static int (*layer_rcv)(void *, void *, struct sk_buff *); static int (*layer_rcv_lid)(void *, void *); static int (*verbs_piobufavail)(void *); static void (*verbs_rcv)(void *, void *, void *, u32); -static int ipath_verbs_registered; static void *(*layer_add_one)(int, struct ipath_devdata *); static void (*layer_remove_one)(void *); @@ -585,6 +587,8 @@ void ipath_verbs_unregister(void) verbs_piobufavail = NULL; verbs_rcv = NULL; verbs_timer_cb = NULL; + + ipath_verbs_registered = 0; mutex_unlock(&ipath_layer_mutex); } From bos at pathscale.com Mon Apr 24 14:23:03 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 24 Apr 2006 14:23:03 -0700 Subject: [openib-general] [PATCH 7 of 13] ipath - prevent hardware from being accessed during reset In-Reply-To: Message-ID: The reset code now turns off the PRESENT flag during a reset, so that other code won't attempt to access a device that's in mid-reset. Signed-off-by: Bryan O'Sullivan diff -r 3ff1e5ae1c60 -r ee2f95e99c27 drivers/infiniband/hw/ipath/ipath_intr.c --- a/drivers/infiniband/hw/ipath/ipath_intr.c Wed Apr 19 15:24:36 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_intr.c Wed Apr 19 15:24:36 2006 -0700 @@ -719,11 +719,24 @@ irqreturn_t ipath_intr(int irq, void *da irqreturn_t ipath_intr(int irq, void *data, struct pt_regs *regs) { struct ipath_devdata *dd = data; - u32 istat = ipath_read_kreg32(dd, dd->ipath_kregs->kr_intstatus); + u32 istat; ipath_err_t estat = 0; static unsigned unexpected = 0; irqreturn_t ret; + if(!(dd->ipath_flags & IPATH_PRESENT)) { + /* this is mostly so we don't try to touch the chip while + * it is being reset */ + /* + * This return value is perhaps odd, but we do not want the + * interrupt core code to remove our interrupt handler + * because we don't appear to be handling an interrupt + * during a chip reset. + */ + return IRQ_HANDLED; + } + + istat = ipath_read_kreg32(dd, dd->ipath_kregs->kr_intstatus); if (unlikely(!istat)) { ipath_stats.sps_nullintr++; ret = IRQ_NONE; /* not our interrupt, or already handled */ diff -r 3ff1e5ae1c60 -r ee2f95e99c27 drivers/infiniband/hw/ipath/ipath_kernel.h --- a/drivers/infiniband/hw/ipath/ipath_kernel.h Wed Apr 19 15:24:36 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Wed Apr 19 15:24:36 2006 -0700 @@ -731,7 +731,7 @@ static inline u32 ipath_read_ureg32(cons static inline u32 ipath_read_ureg32(const struct ipath_devdata *dd, ipath_ureg regno, int port) { - if (!dd->ipath_kregbase) + if (!dd->ipath_kregbase || !(dd->ipath_flags & IPATH_PRESENT)) return 0; return readl(regno + (u64 __iomem *) @@ -762,7 +762,7 @@ static inline u32 ipath_read_kreg32(cons static inline u32 ipath_read_kreg32(const struct ipath_devdata *dd, ipath_kreg regno) { - if (!dd->ipath_kregbase) + if (!dd->ipath_kregbase || !(dd->ipath_flags & IPATH_PRESENT)) return -1; return readl((u32 __iomem *) & dd->ipath_kregbase[regno]); } @@ -770,7 +770,7 @@ static inline u64 ipath_read_kreg64(cons static inline u64 ipath_read_kreg64(const struct ipath_devdata *dd, ipath_kreg regno) { - if (!dd->ipath_kregbase) + if (!dd->ipath_kregbase || !(dd->ipath_flags & IPATH_PRESENT)) return -1; return readq(&dd->ipath_kregbase[regno]); @@ -786,7 +786,7 @@ static inline u64 ipath_read_creg(const static inline u64 ipath_read_creg(const struct ipath_devdata *dd, ipath_sreg regno) { - if (!dd->ipath_kregbase) + if (!dd->ipath_kregbase || !(dd->ipath_flags & IPATH_PRESENT)) return 0; return readq(regno + (u64 __iomem *) @@ -797,7 +797,7 @@ static inline u32 ipath_read_creg32(cons static inline u32 ipath_read_creg32(const struct ipath_devdata *dd, ipath_sreg regno) { - if (!dd->ipath_kregbase) + if (!dd->ipath_kregbase || !(dd->ipath_flags & IPATH_PRESENT)) return 0; return readl(regno + (u64 __iomem *) (dd->ipath_cregbase + diff -r 3ff1e5ae1c60 -r ee2f95e99c27 drivers/infiniband/hw/ipath/ipath_pe800.c --- a/drivers/infiniband/hw/ipath/ipath_pe800.c Wed Apr 19 15:24:36 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_pe800.c Wed Apr 19 15:24:36 2006 -0700 @@ -972,6 +972,8 @@ static int ipath_setup_pe_reset(struct i /* Use ERROR so it shows up in logs, etc. */ ipath_dev_err(dd, "Resetting PE-800 unit %u\n", dd->ipath_unit); + /* keep chip from being accessed in a few places */ + dd->ipath_flags &= ~(IPATH_INITTED|IPATH_PRESENT); val = dd->ipath_control | INFINIPATH_C_RESET; ipath_write_kreg(dd, dd->ipath_kregs->kr_control, val); mb(); @@ -997,6 +999,8 @@ static int ipath_setup_pe_reset(struct i if ((r = pci_enable_device(dd->pcidev))) ipath_dev_err(dd, "pci_enable_device failed after " "reset: %d\n", r); + /* whether it worked or not, mark as present, again */ + dd->ipath_flags |= IPATH_PRESENT; val = ipath_read_kreg64(dd, dd->ipath_kregs->kr_revision); if (val == dd->ipath_revision) { ipath_cdbg(VERBOSE, "Got matching revision " From bos at pathscale.com Mon Apr 24 14:23:08 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 24 Apr 2006 14:23:08 -0700 Subject: [openib-general] [PATCH 12 of 13] ipath - fix label name in interrupt handler In-Reply-To: Message-ID: Names that are the opposite of their intended meanings are not so helpful. Signed-off-by: Bryan O'Sullivan diff -r f23abcaaea84 -r e3f1bfd7ce46 drivers/infiniband/hw/ipath/ipath_intr.c --- a/drivers/infiniband/hw/ipath/ipath_intr.c Mon Apr 24 14:21:04 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_intr.c Mon Apr 24 14:21:04 2006 -0700 @@ -665,14 +665,14 @@ static void handle_layer_pioavail(struct ret = __ipath_layer_intr(dd, IPATH_LAYER_INT_SEND_CONTINUE); if (ret > 0) - goto clear; + goto set; ret = __ipath_verbs_piobufavail(dd); if (ret > 0) - goto clear; + goto set; return; -clear: +set: set_bit(IPATH_S_PIOINTBUFAVAIL, &dd->ipath_sendctrl); ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, dd->ipath_sendctrl); From bos at pathscale.com Mon Apr 24 14:23:09 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 24 Apr 2006 14:23:09 -0700 Subject: [openib-general] [PATCH 13 of 13] ipath - tidy up white space in a few files In-Reply-To: Message-ID: <895650567032e5b48153.1145913789@eng-12.pathscale.com> Signed-off-by: Bryan O'Sullivan diff -r e3f1bfd7ce46 -r 895650567032 drivers/infiniband/hw/ipath/ipath_debug.h --- a/drivers/infiniband/hw/ipath/ipath_debug.h Mon Apr 24 14:21:04 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_debug.h Mon Apr 24 14:21:04 2006 -0700 @@ -60,11 +60,11 @@ #define __IPATH_KERNEL_SEND 0x2000 /* use kernel mode send */ #define __IPATH_EPKTDBG 0x4000 /* print ethernet packet data */ #define __IPATH_SMADBG 0x8000 /* sma packet debug */ -#define __IPATH_IPATHDBG 0x10000 /* Ethernet (IPATH) general debug on */ -#define __IPATH_IPATHWARN 0x20000 /* Ethernet (IPATH) warnings on */ -#define __IPATH_IPATHERR 0x40000 /* Ethernet (IPATH) errors on */ -#define __IPATH_IPATHPD 0x80000 /* Ethernet (IPATH) packet dump on */ -#define __IPATH_IPATHTABLE 0x100000 /* Ethernet (IPATH) table dump on */ +#define __IPATH_IPATHDBG 0x10000 /* Ethernet (IPATH) gen debug */ +#define __IPATH_IPATHWARN 0x20000 /* Ethernet (IPATH) warnings */ +#define __IPATH_IPATHERR 0x40000 /* Ethernet (IPATH) errors */ +#define __IPATH_IPATHPD 0x80000 /* Ethernet (IPATH) packet dump */ +#define __IPATH_IPATHTABLE 0x100000 /* Ethernet (IPATH) table dump */ #else /* _IPATH_DEBUGGING */ @@ -79,11 +79,12 @@ #define __IPATH_TRSAMPLE 0x0 /* generate trace buffer sample entries */ #define __IPATH_VERBDBG 0x0 /* very verbose debug */ #define __IPATH_PKTDBG 0x0 /* print packet data */ -#define __IPATH_PROCDBG 0x0 /* print process startup (init)/exit messages */ +#define __IPATH_PROCDBG 0x0 /* process startup (init)/exit messages */ /* print mmap/nopage stuff, not using VDBG any more */ #define __IPATH_MMDBG 0x0 #define __IPATH_EPKTDBG 0x0 /* print ethernet packet data */ -#define __IPATH_SMADBG 0x0 /* print process startup (init)/exit messages */#define __IPATH_IPATHDBG 0x0 /* Ethernet (IPATH) table dump on */ +#define __IPATH_SMADBG 0x0 /* process startup (init)/exit messages */ +#define __IPATH_IPATHDBG 0x0 /* Ethernet (IPATH) table dump on */ #define __IPATH_IPATHWARN 0x0 /* Ethernet (IPATH) warnings on */ #define __IPATH_IPATHERR 0x0 /* Ethernet (IPATH) errors on */ #define __IPATH_IPATHPD 0x0 /* Ethernet (IPATH) packet dump on */ diff -r e3f1bfd7ce46 -r 895650567032 drivers/infiniband/hw/ipath/ipath_registers.h --- a/drivers/infiniband/hw/ipath/ipath_registers.h Mon Apr 24 14:21:04 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_registers.h Mon Apr 24 14:21:04 2006 -0700 @@ -34,8 +34,9 @@ #define _IPATH_REGISTERS_H /* - * This file should only be included by kernel source, and by the diags. - * It defines the registers, and their contents, for the InfiniPath HT-400 chip + * This file should only be included by kernel source, and by the diags. It + * defines the registers, and their contents, for the InfiniPath HT-400 + * chip. */ /* @@ -156,8 +157,10 @@ #define INFINIPATH_IBCC_FLOWCTRLWATERMARK_SHIFT 8 #define INFINIPATH_IBCC_LINKINITCMD_MASK 0x3ULL #define INFINIPATH_IBCC_LINKINITCMD_DISABLE 1 -#define INFINIPATH_IBCC_LINKINITCMD_POLL 2 /* cycle through TS1/TS2 till OK */ -#define INFINIPATH_IBCC_LINKINITCMD_SLEEP 3 /* wait for TS1, then go on */ +/* cycle through TS1/TS2 till OK */ +#define INFINIPATH_IBCC_LINKINITCMD_POLL 2 +/* wait for TS1, then go on */ +#define INFINIPATH_IBCC_LINKINITCMD_SLEEP 3 #define INFINIPATH_IBCC_LINKINITCMD_SHIFT 16 #define INFINIPATH_IBCC_LINKCMD_MASK 0x3ULL #define INFINIPATH_IBCC_LINKCMD_INIT 1 /* move to 0x11 */ @@ -182,7 +185,8 @@ #define INFINIPATH_IBCS_LINKSTATE_SHIFT 4 #define INFINIPATH_IBCS_TXREADY 0x40000000 #define INFINIPATH_IBCS_TXCREDITOK 0x80000000 -/* link training states (shift by INFINIPATH_IBCS_LINKTRAININGSTATE_SHIFT) */ +/* link training states (shift by + INFINIPATH_IBCS_LINKTRAININGSTATE_SHIFT) */ #define INFINIPATH_IBCS_LT_STATE_DISABLED 0x00 #define INFINIPATH_IBCS_LT_STATE_LINKUP 0x01 #define INFINIPATH_IBCS_LT_STATE_POLLACTIVE 0x02 @@ -267,10 +271,12 @@ /* kr_serdesconfig0 bits */ #define INFINIPATH_SERDC0_RESET_MASK 0xfULL /* overal reset bits */ #define INFINIPATH_SERDC0_RESET_PLL 0x10000000ULL /* pll reset */ -#define INFINIPATH_SERDC0_TXIDLE 0xF000ULL /* tx idle enables (per lane) */ -#define INFINIPATH_SERDC0_RXDETECT_EN 0xF0000ULL /* rx detect enables (per lane) */ -#define INFINIPATH_SERDC0_L1PWR_DN 0xF0ULL /* L1 Power down; use with RXDETECT, - Otherwise not used on IB side */ +/* tx idle enables (per lane) */ +#define INFINIPATH_SERDC0_TXIDLE 0xF000ULL +/* rx detect enables (per lane) */ +#define INFINIPATH_SERDC0_RXDETECT_EN 0xF0000ULL +/* L1 Power down; use with RXDETECT, Otherwise not used on IB side */ +#define INFINIPATH_SERDC0_L1PWR_DN 0xF0ULL /* kr_xgxsconfig bits */ #define INFINIPATH_XGXS_RESET 0x7ULL @@ -390,12 +396,13 @@ struct ipath_kregs { ipath_kreg kr_txintmemsize; ipath_kreg kr_xgxsconfig; ipath_kreg kr_ibpllcfg; - /* use these two (and the following N ports) only with ipath_k*_kreg64_port(); - * not *kreg64() */ + /* use these two (and the following N ports) only with + * ipath_k*_kreg64_port(); not *kreg64() */ ipath_kreg kr_rcvhdraddr; ipath_kreg kr_rcvhdrtailaddr; - /* remaining registers are not present on all types of infinipath chips */ + /* remaining registers are not present on all types of infinipath + chips */ ipath_kreg kr_rcvpktledcnt; ipath_kreg kr_pcierbuftestreg0; ipath_kreg kr_pcierbuftestreg1; diff -r e3f1bfd7ce46 -r 895650567032 drivers/infiniband/hw/ipath/ipath_ud.c --- a/drivers/infiniband/hw/ipath/ipath_ud.c Mon Apr 24 14:21:04 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_ud.c Mon Apr 24 14:21:04 2006 -0700 @@ -46,8 +46,10 @@ * This is called from ipath_post_ud_send() to forward a WQE addressed * to the same HCA. */ -static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_sge_state *ss, - u32 length, struct ib_send_wr *wr, struct ib_wc *wc) +static void ipath_ud_loopback(struct ipath_qp *sqp, + struct ipath_sge_state *ss, + u32 length, struct ib_send_wr *wr, + struct ib_wc *wc) { struct ipath_ibdev *dev = to_idev(sqp->ibqp.device); struct ipath_qp *qp; From bos at pathscale.com Mon Apr 24 14:23:05 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 24 Apr 2006 14:23:05 -0700 Subject: [openib-general] [PATCH 9 of 13] ipath - simplify RC send posting In-Reply-To: Message-ID: <4eabd5fc05bbd962e64e.1145913785@eng-12.pathscale.com> Remove some unnecessarily complicated tests. Signed-off-by: Bryan O'Sullivan diff -r fafcc38877ad -r 4eabd5fc05bb drivers/infiniband/hw/ipath/ipath_ruc.c --- a/drivers/infiniband/hw/ipath/ipath_ruc.c Mon Apr 24 14:21:04 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_ruc.c Mon Apr 24 14:21:04 2006 -0700 @@ -531,19 +531,12 @@ int ipath_post_rc_send(struct ipath_qp * } wqe->wr.num_sge = j; qp->s_head = next; - /* - * Wake up the send tasklet if the QP is not waiting - * for an RNR timeout. - */ - next = qp->s_rnr_timeout; spin_unlock_irqrestore(&qp->s_lock, flags); - if (next == 0) { - if (qp->ibqp.qp_type == IB_QPT_UC) - ipath_do_uc_send((unsigned long) qp); - else - ipath_do_rc_send((unsigned long) qp); - } + if (qp->ibqp.qp_type == IB_QPT_UC) + ipath_do_uc_send((unsigned long) qp); + else + ipath_do_rc_send((unsigned long) qp); ret = 0; From bos at pathscale.com Mon Apr 24 14:23:04 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 24 Apr 2006 14:23:04 -0700 Subject: [openib-general] [PATCH 8 of 13] ipath - fix a number of RC protocol bugs In-Reply-To: Message-ID: This change fixes a number of RC protocol bugs: 1. ipath_init_restart() could be called when the QP is already on the timeout list, thus triggering a bad BUG_ON. 2. If a RDMA read was received on a QP without remote read access, the s_lock spin lock was reentered. 3. If a sequence NAK was received for a PSN for the middle of a pending operation, the code to compute which operation to restart had a bug so that the wrong opcode/PSN was resent. This caused the RC connection to go into the error state. 4. If a RC connection was configured for shared receive queues (SRQ), the limit sequence number was not being handled correctly when RDMA reads, writes, or atomic operations were performed, thus causing the RC connection to hang. Signed-off-by: Ralph Campbell Signed-off-by: Bryan O'Sullivan diff -r ee2f95e99c27 -r fafcc38877ad drivers/infiniband/hw/ipath/ipath_rc.c --- a/drivers/infiniband/hw/ipath/ipath_rc.c Wed Apr 19 15:24:36 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_rc.c Mon Apr 24 14:21:04 2006 -0700 @@ -57,9 +57,8 @@ static void ipath_init_restart(struct ip qp->s_len = wqe->length - len; dev = to_idev(qp->ibqp.device); spin_lock(&dev->pending_lock); - if (qp->timerwait.next == LIST_POISON1) - list_add_tail(&qp->timerwait, - &dev->pending[dev->pending_index]); + BUG_ON(qp->timerwait.next != LIST_POISON1); + list_add_tail(&qp->timerwait, &dev->pending[dev->pending_index]); spin_unlock(&dev->pending_lock); } @@ -135,7 +134,8 @@ static inline u32 ipath_make_rc_ack(stru */ qp->r_state = OP(RDMA_READ_RESPONSE_LAST); qp->s_ack_state = OP(ACKNOWLEDGE); - return 0; + bth0 = 0; + goto bail; case OP(COMPARE_SWAP): case OP(FETCH_ADD): @@ -143,7 +143,7 @@ static inline u32 ipath_make_rc_ack(stru len = 0; qp->r_state = OP(SEND_LAST); qp->s_ack_state = OP(ACKNOWLEDGE); - bth0 = IB_OPCODE_ATOMIC_ACKNOWLEDGE << 24; + bth0 = OP(ATOMIC_ACKNOWLEDGE) << 24; ohdr->u.at.aeth = ipath_compute_aeth(qp); ohdr->u.at.atomic_ack_eth = cpu_to_be64(qp->s_ack_atomic); hwords += sizeof(ohdr->u.at) / 4; @@ -162,6 +162,7 @@ static inline u32 ipath_make_rc_ack(stru qp->s_cur_sge = ss; qp->s_cur_size = len; +bail: return bth0; } @@ -257,7 +258,7 @@ static inline int ipath_make_rc_req(stru break; case IB_WR_RDMA_WRITE: - if (newreq) + if (newreq && qp->s_lsn != (u32) -1) qp->s_lsn++; /* FALLTHROUGH */ case IB_WR_RDMA_WRITE_WITH_IMM: @@ -283,8 +284,7 @@ static inline int ipath_make_rc_req(stru else { qp->s_state = OP(RDMA_WRITE_ONLY_WITH_IMMEDIATE); - /* Immediate data comes - * after RETH */ + /* Immediate data comes after RETH */ ohdr->u.rc.imm_data = wqe->wr.imm_data; hwords += 1; if (wqe->wr.send_flags & IB_SEND_SOLICITED) @@ -304,7 +304,8 @@ static inline int ipath_make_rc_req(stru qp->s_state = OP(RDMA_READ_REQUEST); hwords += sizeof(ohdr->u.rc.reth) / 4; if (newreq) { - qp->s_lsn++; + if (qp->s_lsn != (u32) -1) + qp->s_lsn++; /* * Adjust s_next_psn to count the * expected number of responses. @@ -335,7 +336,8 @@ static inline int ipath_make_rc_req(stru wqe->wr.wr.atomic.compare_add); hwords += sizeof(struct ib_atomic_eth) / 4; if (newreq) { - qp->s_lsn++; + if (qp->s_lsn != (u32) -1) + qp->s_lsn++; wqe->lpsn = wqe->psn; } if (++qp->s_cur == qp->s_size) @@ -355,6 +357,11 @@ static inline int ipath_make_rc_req(stru bth2 |= qp->s_psn++ & IPS_PSN_MASK; if ((int)(qp->s_psn - qp->s_next_psn) > 0) qp->s_next_psn = qp->s_psn; + /* + * Put the QP on the pending list so lost ACKs will cause + * a retry. More than one request can be pending so the + * QP may already be on the dev->pending list. + */ spin_lock(&dev->pending_lock); if (qp->timerwait.next == LIST_POISON1) list_add_tail(&qp->timerwait, @@ -364,8 +371,8 @@ static inline int ipath_make_rc_req(stru case OP(RDMA_READ_RESPONSE_FIRST): /* - * This case can only happen if a send is restarted. See - * ipath_restart_rc(). + * This case can only happen if a send is restarted. + * See ipath_restart_rc(). */ ipath_init_restart(qp, wqe); /* FALLTHROUGH */ @@ -496,29 +503,37 @@ done: return 0; } -static inline void ipath_make_rc_grh(struct ipath_qp *qp, - struct ib_global_route *grh, - u32 nwords) +/** + * ipath_make_rc_grh - construct a GRH header + * @dev: a pointer to the ipath device + * @hdr: a pointer to the GRH header being constructed + * @grh: the global route address to send to + * @hwords: the number of 32 bit words of header being sent + * @nwords: the number of 32 bit words of data being sent + * + * Return the size of the header in 32 bit words. + */ +static u32 ipath_make_rc_grh(struct ipath_ibdev *dev, + struct ib_grh *hdr, + struct ib_global_route *grh, + u32 hwords, + u32 nwords) { - struct ipath_ibdev *dev = to_idev(qp->ibqp.device); - - /* GRH header size in 32-bit words. */ - qp->s_hdrwords += 10; - qp->s_hdr.u.l.grh.version_tclass_flow = + hdr->version_tclass_flow = cpu_to_be32((6 << 28) | (grh->traffic_class << 20) | grh->flow_label); - qp->s_hdr.u.l.grh.paylen = - cpu_to_be16(((qp->s_hdrwords - 12) + nwords + - SIZE_OF_CRC) << 2); + hdr->paylen = cpu_to_be16((hwords - 2 + nwords + SIZE_OF_CRC) << 2); /* next_hdr is defined by C8-7 in ch. 8.4.1 */ - qp->s_hdr.u.l.grh.next_hdr = 0x1B; - qp->s_hdr.u.l.grh.hop_limit = grh->hop_limit; + hdr->next_hdr = 0x1B; + hdr->hop_limit = grh->hop_limit; /* The SGID is 32-bit aligned. */ - qp->s_hdr.u.l.grh.sgid.global.subnet_prefix = dev->gid_prefix; - qp->s_hdr.u.l.grh.sgid.global.interface_id = - ipath_layer_get_guid(dev->dd); - qp->s_hdr.u.l.grh.dgid = grh->dgid; + hdr->sgid.global.subnet_prefix = dev->gid_prefix; + hdr->sgid.global.interface_id = ipath_layer_get_guid(dev->dd); + hdr->dgid = grh->dgid; + + /* GRH header size in 32-bit words. */ + return sizeof(struct ib_grh) / sizeof(u32); } /** @@ -569,15 +584,6 @@ again: * If no PIO bufs are available, return. An interrupt will * call ipath_ib_piobufavail() when one is available. */ - _VERBS_INFO("h %u %p\n", qp->s_hdrwords, &qp->s_hdr); - _VERBS_INFO("d %u %p %u %p %u %u %u %u\n", qp->s_cur_size, - qp->s_cur_sge->sg_list, - qp->s_cur_sge->num_sge, - qp->s_cur_sge->sge.vaddr, - qp->s_cur_sge->sge.sge_length, - qp->s_cur_sge->sge.length, - qp->s_cur_sge->sge.m, - qp->s_cur_sge->sge.n); if (ipath_verbs_send(dev->dd, qp->s_hdrwords, (u32 *) &qp->s_hdr, qp->s_cur_size, qp->s_cur_sge)) { @@ -599,8 +605,16 @@ again: if (qp->s_ack_state != OP(ACKNOWLEDGE) && (bth0 = ipath_make_rc_ack(qp, ohdr, pmtu)) != 0) bth2 = qp->s_ack_psn++ & IPS_PSN_MASK; - else if (!ipath_make_rc_req(qp, ohdr, pmtu, &bth0, &bth2)) - goto done; + else if (!ipath_make_rc_req(qp, ohdr, pmtu, &bth0, &bth2)) { + /* + * Clear the busy bit before unlocking to avoid races with + * adding new work queue items and then failing to process + * them. + */ + clear_bit(IPATH_S_BUSY, &qp->s_flags); + spin_unlock_irqrestore(&qp->s_lock, flags); + goto bail; + } spin_unlock_irqrestore(&qp->s_lock, flags); @@ -609,7 +623,9 @@ again: nwords = (qp->s_cur_size + extra_bytes) >> 2; lrh0 = IPS_LRH_BTH; if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) { - ipath_make_rc_grh(qp, &qp->remote_ah_attr.grh, nwords); + qp->s_hdrwords += ipath_make_rc_grh(dev, &qp->s_hdr.u.l.grh, + &qp->remote_ah_attr.grh, + qp->s_hdrwords, nwords); lrh0 = IPS_LRH_GRH; } lrh0 |= qp->remote_ah_attr.sl << 4; @@ -627,8 +643,6 @@ again: /* Check for more work to do. */ goto again; -done: - spin_unlock_irqrestore(&qp->s_lock, flags); clear: clear_bit(IPATH_S_BUSY, &qp->s_flags); bail: @@ -640,32 +654,35 @@ static void send_rc_ack(struct ipath_qp struct ipath_ibdev *dev = to_idev(qp->ibqp.device); u16 lrh0; u32 bth0; + u32 hwords; + struct ipath_ib_header hdr; struct ipath_other_headers *ohdr; /* Construct the header. */ - ohdr = &qp->s_hdr.u.oth; + ohdr = &hdr.u.oth; lrh0 = IPS_LRH_BTH; /* header size in 32-bit words LRH+BTH+AETH = (8+12+4)/4. */ - qp->s_hdrwords = 6; + hwords = 6; if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) { - ipath_make_rc_grh(qp, &qp->remote_ah_attr.grh, 0); + hwords += ipath_make_rc_grh(dev, &hdr.u.l.grh, + &qp->remote_ah_attr.grh, + hwords, 0); ohdr = &qp->s_hdr.u.l.oth; lrh0 = IPS_LRH_GRH; } bth0 = ipath_layer_get_pkey(dev->dd, qp->s_pkey_index); ohdr->u.aeth = ipath_compute_aeth(qp); if (qp->s_ack_state >= OP(COMPARE_SWAP)) { - bth0 |= IB_OPCODE_ATOMIC_ACKNOWLEDGE << 24; + bth0 |= OP(ATOMIC_ACKNOWLEDGE) << 24; ohdr->u.at.atomic_ack_eth = cpu_to_be64(qp->s_ack_atomic); - qp->s_hdrwords += sizeof(ohdr->u.at.atomic_ack_eth) / 4; - } - else + hwords += sizeof(ohdr->u.at.atomic_ack_eth) / 4; + } else bth0 |= OP(ACKNOWLEDGE) << 24; lrh0 |= qp->remote_ah_attr.sl << 4; - qp->s_hdr.lrh[0] = cpu_to_be16(lrh0); - qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid); - qp->s_hdr.lrh[2] = cpu_to_be16(qp->s_hdrwords + SIZE_OF_CRC); - qp->s_hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->dd)); + hdr.lrh[0] = cpu_to_be16(lrh0); + hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid); + hdr.lrh[2] = cpu_to_be16(hwords + SIZE_OF_CRC); + hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->dd)); ohdr->bth[0] = cpu_to_be32(bth0); ohdr->bth[1] = cpu_to_be32(qp->remote_qpn); ohdr->bth[2] = cpu_to_be32(qp->s_ack_psn & IPS_PSN_MASK); @@ -673,12 +690,93 @@ static void send_rc_ack(struct ipath_qp /* * If we can send the ACK, clear the ACK state. */ - if (ipath_verbs_send(dev->dd, qp->s_hdrwords, (u32 *) &qp->s_hdr, - 0, NULL) == 0) { + if (ipath_verbs_send(dev->dd, hwords, (u32 *) &hdr, 0, NULL) == 0) { qp->s_ack_state = OP(ACKNOWLEDGE); + dev->n_unicast_xmit++; + } else dev->n_rc_qacks++; - dev->n_unicast_xmit++; - } +} + +/** + * reset_psn - reset the QP state to send starting from PSN + * @qp: the QP + * @psn: the packet sequence number to restart at + * + * This is called from ipath_rc_rcv() to process an incoming RC ACK + * for the given QP. + * Called at interrupt level with the QP s_lock held. + */ +static void reset_psn(struct ipath_qp *qp, u32 psn) +{ + u32 n = qp->s_last; + struct ipath_swqe *wqe = get_swqe_ptr(qp, n); + u32 opcode; + + qp->s_cur = n; + + /* + * If we are starting the request from the beginning, + * let the normal send code handle initialization. + */ + if (ipath_cmp24(psn, wqe->psn) <= 0) { + qp->s_state = OP(SEND_LAST); + goto done; + } + + /* Find the work request opcode corresponding to the given PSN. */ + opcode = wqe->wr.opcode; + for (;;) { + int diff; + + if (++n == qp->s_size) + n = 0; + if (n == qp->s_tail) + break; + wqe = get_swqe_ptr(qp, n); + diff = ipath_cmp24(psn, wqe->psn); + if (diff < 0) + break; + qp->s_cur = n; + /* + * If we are starting the request from the beginning, + * let the normal send code handle initialization. + */ + if (diff == 0) { + qp->s_state = OP(SEND_LAST); + goto done; + } + opcode = wqe->wr.opcode; + } + + /* + * Set the state to restart in the middle of a request. + * Don't change the s_sge, s_cur_sge, or s_cur_size. + * See ipath_do_rc_send(). + */ + switch (opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + qp->s_state = OP(RDMA_READ_RESPONSE_FIRST); + break; + + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_WRITE_WITH_IMM: + qp->s_state = OP(RDMA_READ_RESPONSE_LAST); + break; + + case IB_WR_RDMA_READ: + qp->s_state = OP(RDMA_READ_RESPONSE_MIDDLE); + break; + + default: + /* + * This case shouldn't happen since its only + * one PSN per req. + */ + qp->s_state = OP(SEND_LAST); + } +done: + qp->s_psn = psn; } /** @@ -693,7 +791,6 @@ void ipath_restart_rc(struct ipath_qp *q { struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last); struct ipath_ibdev *dev; - u32 n; /* * If there are no requests pending, we are done. @@ -735,130 +832,13 @@ void ipath_restart_rc(struct ipath_qp *q else dev->n_rc_resends += (int)qp->s_psn - (int)psn; - /* - * If we are starting the request from the beginning, let the normal - * send code handle initialization. - */ - qp->s_cur = qp->s_last; - if (ipath_cmp24(psn, wqe->psn) <= 0) { - qp->s_state = OP(SEND_LAST); - qp->s_psn = wqe->psn; - } else { - n = qp->s_cur; - for (;;) { - if (++n == qp->s_size) - n = 0; - if (n == qp->s_tail) { - if (ipath_cmp24(psn, qp->s_next_psn) >= 0) { - qp->s_cur = n; - wqe = get_swqe_ptr(qp, n); - } - break; - } - wqe = get_swqe_ptr(qp, n); - if (ipath_cmp24(psn, wqe->psn) < 0) - break; - qp->s_cur = n; - } - qp->s_psn = psn; - - /* - * Reset the state to restart in the middle of a request. - * Don't change the s_sge, s_cur_sge, or s_cur_size. - * See ipath_do_rc_send(). - */ - switch (wqe->wr.opcode) { - case IB_WR_SEND: - case IB_WR_SEND_WITH_IMM: - qp->s_state = OP(RDMA_READ_RESPONSE_FIRST); - break; - - case IB_WR_RDMA_WRITE: - case IB_WR_RDMA_WRITE_WITH_IMM: - qp->s_state = OP(RDMA_READ_RESPONSE_LAST); - break; - - case IB_WR_RDMA_READ: - qp->s_state = - OP(RDMA_READ_RESPONSE_MIDDLE); - break; - - default: - /* - * This case shouldn't happen since its only - * one PSN per req. - */ - qp->s_state = OP(SEND_LAST); - } - } + reset_psn(qp, psn); done: tasklet_hi_schedule(&qp->s_task); bail: return; -} - -/** - * reset_psn - reset the QP state to send starting from PSN - * @qp: the QP - * @psn: the packet sequence number to restart at - * - * This is called from ipath_rc_rcv() to process an incoming RC ACK - * for the given QP. - * Called at interrupt level with the QP s_lock held. - */ -static void reset_psn(struct ipath_qp *qp, u32 psn) -{ - struct ipath_swqe *wqe; - u32 n; - - n = qp->s_cur; - wqe = get_swqe_ptr(qp, n); - for (;;) { - if (++n == qp->s_size) - n = 0; - if (n == qp->s_tail) { - if (ipath_cmp24(psn, qp->s_next_psn) >= 0) { - qp->s_cur = n; - wqe = get_swqe_ptr(qp, n); - } - break; - } - wqe = get_swqe_ptr(qp, n); - if (ipath_cmp24(psn, wqe->psn) < 0) - break; - qp->s_cur = n; - } - qp->s_psn = psn; - - /* - * Set the state to restart in the middle of a - * request. Don't change the s_sge, s_cur_sge, or - * s_cur_size. See ipath_do_rc_send(). - */ - switch (wqe->wr.opcode) { - case IB_WR_SEND: - case IB_WR_SEND_WITH_IMM: - qp->s_state = OP(RDMA_READ_RESPONSE_FIRST); - break; - - case IB_WR_RDMA_WRITE: - case IB_WR_RDMA_WRITE_WITH_IMM: - qp->s_state = OP(RDMA_READ_RESPONSE_LAST); - break; - - case IB_WR_RDMA_READ: - qp->s_state = OP(RDMA_READ_RESPONSE_MIDDLE); - break; - - default: - /* - * This case shouldn't happen since its only - * one PSN per req. - */ - qp->s_state = OP(SEND_LAST); - } } /** @@ -1011,17 +991,7 @@ static int do_rc_ack(struct ipath_qp *qp dev->n_rc_resends += (int)qp->s_psn - (int)psn; - /* - * If we are starting the request from the beginning, let - * the normal send code handle initialization. - */ - qp->s_cur = qp->s_last; - wqe = get_swqe_ptr(qp, qp->s_cur); - if (ipath_cmp24(psn, wqe->psn) <= 0) { - qp->s_state = OP(SEND_LAST); - qp->s_psn = wqe->psn; - } else - reset_psn(qp, psn); + reset_psn(qp, psn); qp->s_rnr_timeout = ib_ipath_rnr_table[(aeth >> IPS_AETH_CREDIT_SHIFT) & @@ -1182,33 +1152,34 @@ static inline void ipath_rc_rcv_resp(str goto ack_done; } rdma_read: - if (unlikely(qp->s_state != OP(RDMA_READ_REQUEST))) - goto ack_done; - if (unlikely(tlen != (hdrsize + pmtu + 4))) - goto ack_done; - if (unlikely(pmtu >= qp->s_len)) - goto ack_done; - /* We got a response so update the timeout. */ - if (unlikely(qp->s_last == qp->s_tail || - get_swqe_ptr(qp, qp->s_last)->wr.opcode != - IB_WR_RDMA_READ)) - goto ack_done; - spin_lock(&dev->pending_lock); - if (qp->s_rnr_timeout == 0 && - qp->timerwait.next != LIST_POISON1) - list_move_tail(&qp->timerwait, - &dev->pending[dev->pending_index]); - spin_unlock(&dev->pending_lock); - /* - * Update the RDMA receive state but do the copy w/o holding the - * locks and blocking interrupts. XXX Yet another place that - * affects relaxed RDMA order since we don't want s_sge modified. - */ - qp->s_len -= pmtu; - qp->s_last_psn = psn; - spin_unlock_irqrestore(&qp->s_lock, flags); - ipath_copy_sge(&qp->s_sge, data, pmtu); - goto bail; + if (unlikely(qp->s_state != OP(RDMA_READ_REQUEST))) + goto ack_done; + if (unlikely(tlen != (hdrsize + pmtu + 4))) + goto ack_done; + if (unlikely(pmtu >= qp->s_len)) + goto ack_done; + /* We got a response so update the timeout. */ + if (unlikely(qp->s_last == qp->s_tail || + get_swqe_ptr(qp, qp->s_last)->wr.opcode != + IB_WR_RDMA_READ)) + goto ack_done; + spin_lock(&dev->pending_lock); + if (qp->s_rnr_timeout == 0 && + qp->timerwait.next != LIST_POISON1) + list_move_tail(&qp->timerwait, + &dev->pending[dev->pending_index]); + spin_unlock(&dev->pending_lock); + /* + * Update the RDMA receive state but do the copy w/o + * holding the locks and blocking interrupts. + * XXX Yet another place that affects relaxed RDMA order + * since we don't want s_sge modified. + */ + qp->s_len -= pmtu; + qp->s_last_psn = psn; + spin_unlock_irqrestore(&qp->s_lock, flags); + ipath_copy_sge(&qp->s_sge, data, pmtu); + goto bail; case OP(RDMA_READ_RESPONSE_LAST): /* ACKs READ req. */ @@ -1255,9 +1226,12 @@ static inline void ipath_rc_rcv_resp(str if (do_rc_ack(qp, aeth, psn, OP(RDMA_READ_RESPONSE_LAST))) { /* * Change the state so we contimue - * processing new requests. + * processing new requests and wake up the + * tasklet if there are posted sends. */ qp->s_state = OP(SEND_LAST); + if (qp->s_tail != qp->s_head) + tasklet_hi_schedule(&qp->s_task); } goto ack_done; } @@ -1296,6 +1270,8 @@ static inline int ipath_rc_rcv_error(str { struct ib_reth *reth; + spin_lock(&qp->s_lock); + if (diff > 0) { /* * Packet sequence error. @@ -1303,13 +1279,10 @@ static inline int ipath_rc_rcv_error(str * Don't queue the NAK if a RDMA read, atomic, or * NAK is pending though. */ - spin_lock(&qp->s_lock); if ((qp->s_ack_state >= OP(RDMA_READ_REQUEST) && - qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE) || - qp->s_nak_state != 0) { - spin_unlock(&qp->s_lock); + qp->s_ack_state != OP(ACKNOWLEDGE)) || + qp->s_nak_state != 0) goto done; - } qp->s_ack_state = OP(SEND_ONLY); qp->s_nak_state = IB_NAK_PSN_ERROR; /* Use the expected PSN. */ @@ -1328,12 +1301,10 @@ static inline int ipath_rc_rcv_error(str * send the earliest so that RDMA reads can be restarted at * the requester's expected PSN. */ - spin_lock(&qp->s_lock); - if (qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE && + if (qp->s_ack_state != OP(ACKNOWLEDGE) && ipath_cmp24(psn, qp->s_ack_psn) >= 0) { - if (qp->s_ack_state < IB_OPCODE_RDMA_READ_REQUEST) + if (qp->s_ack_state < OP(RDMA_READ_REQUEST)) qp->s_ack_psn = psn; - spin_unlock(&qp->s_lock); goto done; } switch (opcode) { @@ -1344,8 +1315,7 @@ static inline int ipath_rc_rcv_error(str * holding the s_lock. */ if (qp->s_ack_state != OP(ACKNOWLEDGE) && - qp->s_ack_state >= IB_OPCODE_RDMA_READ_REQUEST) { - spin_unlock(&qp->s_lock); + qp->s_ack_state >= OP(RDMA_READ_REQUEST)) { dev->n_rdma_dup_busy++; goto done; } @@ -1387,10 +1357,8 @@ static inline int ipath_rc_rcv_error(str * Check for the PSN of the last atomic operations * performed and resend the result if found. */ - if ((psn & IPS_PSN_MASK) != qp->r_atomic_psn) { - spin_unlock(&qp->s_lock); + if ((psn & IPS_PSN_MASK) != qp->r_atomic_psn) goto done; - } qp->s_ack_atomic = qp->r_atomic_data; break; } @@ -1401,6 +1369,7 @@ resched: return 0; done: + spin_unlock(&qp->s_lock); return 1; } @@ -1493,22 +1462,23 @@ void ipath_rc_rcv(struct ipath_ibdev *de opcode == OP(SEND_LAST_WITH_IMMEDIATE)) break; nack_inv: - /* - * A NAK will ACK earlier sends and RDMA writes. Don't queue the - * NAK if a RDMA read, atomic, or NAK is pending though. - */ - spin_lock(&qp->s_lock); - if (qp->s_ack_state >= OP(RDMA_READ_REQUEST) && - qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE) { - spin_unlock(&qp->s_lock); - goto done; - } - /* XXX Flush WQEs */ - qp->state = IB_QPS_ERR; - qp->s_ack_state = OP(SEND_ONLY); - qp->s_nak_state = IB_NAK_INVALID_REQUEST; - qp->s_ack_psn = qp->r_psn; - goto resched; + /* + * A NAK will ACK earlier sends and RDMA writes. + * Don't queue the NAK if a RDMA read, atomic, or NAK + * is pending though. + */ + spin_lock(&qp->s_lock); + if (qp->s_ack_state >= OP(RDMA_READ_REQUEST) && + qp->s_ack_state != OP(ACKNOWLEDGE)) { + spin_unlock(&qp->s_lock); + goto done; + } + /* XXX Flush WQEs */ + qp->state = IB_QPS_ERR; + qp->s_ack_state = OP(SEND_ONLY); + qp->s_nak_state = IB_NAK_INVALID_REQUEST; + qp->s_ack_psn = qp->r_psn; + goto resched; case OP(RDMA_WRITE_FIRST): case OP(RDMA_WRITE_MIDDLE): @@ -1557,9 +1527,8 @@ void ipath_rc_rcv(struct ipath_ibdev *de * is pending though. */ spin_lock(&qp->s_lock); - if (qp->s_ack_state >= - OP(RDMA_READ_REQUEST) && - qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE) { + if (qp->s_ack_state >= OP(RDMA_READ_REQUEST) && + qp->s_ack_state != OP(ACKNOWLEDGE)) { spin_unlock(&qp->s_lock); goto done; } @@ -1675,10 +1644,10 @@ void ipath_rc_rcv(struct ipath_ibdev *de * read, atomic, or NAK is pending though. */ spin_lock(&qp->s_lock); + nack_acc1: if (qp->s_ack_state >= OP(RDMA_READ_REQUEST) && - qp->s_ack_state != - IB_OPCODE_ACKNOWLEDGE) { + qp->s_ack_state != OP(ACKNOWLEDGE)) { spin_unlock(&qp->s_lock); goto done; } @@ -1716,9 +1685,16 @@ void ipath_rc_rcv(struct ipath_ibdev *de reth = (struct ib_reth *)data; data += sizeof(*reth); } + if (unlikely(!(qp->qp_access_flags & + IB_ACCESS_REMOTE_READ))) + goto nack_acc; + /* + * Ignore request if we already have an + * RDMA read or ATOMIC pending. + */ spin_lock(&qp->s_lock); if (qp->s_ack_state != OP(ACKNOWLEDGE) && - qp->s_ack_state >= IB_OPCODE_RDMA_READ_REQUEST) { + qp->s_ack_state >= OP(RDMA_READ_REQUEST)) { spin_unlock(&qp->s_lock); goto done; } @@ -1732,10 +1708,8 @@ void ipath_rc_rcv(struct ipath_ibdev *de ok = ipath_rkey_ok(dev, &qp->s_rdma_sge, qp->s_rdma_len, vaddr, rkey, IB_ACCESS_REMOTE_READ); - if (unlikely(!ok)) { - spin_unlock(&qp->s_lock); - goto nack_acc; - } + if (unlikely(!ok)) + goto nack_acc1; /* * Update the next expected PSN. We add 1 later * below, so only add the remainder here. @@ -1750,9 +1724,6 @@ void ipath_rc_rcv(struct ipath_ibdev *de qp->s_rdma_sge.sge.length = 0; qp->s_rdma_sge.sge.sge_length = 0; } - if (unlikely(!(qp->qp_access_flags & - IB_ACCESS_REMOTE_READ))) - goto nack_acc; /* * We need to increment the MSN here instead of when we * finish sending the result since a duplicate request would @@ -1822,7 +1793,7 @@ void ipath_rc_rcv(struct ipath_ibdev *de */ spin_lock(&qp->s_lock); if (qp->s_ack_state == OP(ACKNOWLEDGE) || - qp->s_ack_state < IB_OPCODE_RDMA_READ_REQUEST) { + qp->s_ack_state < OP(RDMA_READ_REQUEST)) { qp->s_ack_state = opcode; qp->s_nak_state = 0; qp->s_ack_psn = psn; @@ -1844,6 +1815,8 @@ resched: (qp->s_ack_state < IB_OPCODE_RDMA_READ_REQUEST || qp->s_ack_state >= IB_OPCODE_COMPARE_SWAP)) send_rc_ack(qp); + else + dev->n_rc_qacks++; rdmadone: spin_unlock(&qp->s_lock); From bos at pathscale.com Mon Apr 24 14:23:06 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 24 Apr 2006 14:23:06 -0700 Subject: [openib-general] [PATCH 10 of 13] ipath - simplify IB timer usage In-Reply-To: Message-ID: <36447eb1f256c4c1d7bd.1145913786@eng-12.pathscale.com> Signed-off-by: Bryan O'Sullivan diff -r 4eabd5fc05bb -r 36447eb1f256 drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Mon Apr 24 14:21:04 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Mon Apr 24 14:21:04 2006 -0700 @@ -449,7 +449,6 @@ static void ipath_ib_timer(void *arg) { struct ipath_ibdev *dev = (struct ipath_ibdev *) arg; struct ipath_qp *resend = NULL; - struct ipath_qp *rnr = NULL; struct list_head *last; struct ipath_qp *qp; unsigned long flags; @@ -465,32 +464,18 @@ static void ipath_ib_timer(void *arg) last = &dev->pending[dev->pending_index]; while (!list_empty(last)) { qp = list_entry(last->next, struct ipath_qp, timerwait); - if (last->next == LIST_POISON1 || - last->next != &qp->timerwait || - qp->timerwait.prev != last) { - INIT_LIST_HEAD(last); - } else { - list_del(&qp->timerwait); - qp->timerwait.prev = (struct list_head *) resend; - resend = qp; - atomic_inc(&qp->refcount); - } + list_del(&qp->timerwait); + qp->timer_next = resend; + resend = qp; + atomic_inc(&qp->refcount); } last = &dev->rnrwait; if (!list_empty(last)) { qp = list_entry(last->next, struct ipath_qp, timerwait); if (--qp->s_rnr_timeout == 0) { do { - if (last->next == LIST_POISON1 || - last->next != &qp->timerwait || - qp->timerwait.prev != last) { - INIT_LIST_HEAD(last); - break; - } list_del(&qp->timerwait); - qp->timerwait.prev = - (struct list_head *) rnr; - rnr = qp; + tasklet_hi_schedule(&qp->s_task); if (list_empty(last)) break; qp = list_entry(last->next, struct ipath_qp, @@ -530,8 +515,7 @@ static void ipath_ib_timer(void *arg) spin_unlock_irqrestore(&dev->pending_lock, flags); /* XXX What if timer fires again while this is running? */ - for (qp = resend; qp != NULL; - qp = (struct ipath_qp *) qp->timerwait.prev) { + for (qp = resend; qp != NULL; qp = qp->timer_next) { struct ib_wc wc; spin_lock_irqsave(&qp->s_lock, flags); @@ -545,9 +529,6 @@ static void ipath_ib_timer(void *arg) if (atomic_dec_and_test(&qp->refcount)) wake_up(&qp->wait); } - for (qp = rnr; qp != NULL; - qp = (struct ipath_qp *) qp->timerwait.prev) - tasklet_hi_schedule(&qp->s_task); } /** @@ -556,9 +537,9 @@ static void ipath_ib_timer(void *arg) * * This is called from ipath_intr() at interrupt level when a PIO buffer is * available after ipath_verbs_send() returned an error that no buffers were - * available. Return 0 if we consumed all the PIO buffers and we still have + * available. Return 1 if we consumed all the PIO buffers and we still have * QPs waiting for buffers (for now, just do a tasklet_hi_schedule and - * return one). + * return zero). */ static int ipath_ib_piobufavail(void *arg) { @@ -579,7 +560,7 @@ static int ipath_ib_piobufavail(void *ar spin_unlock_irqrestore(&dev->pending_lock, flags); bail: - return 1; + return 0; } static int ipath_query_device(struct ib_device *ibdev, @@ -1159,7 +1140,7 @@ static ssize_t show_stats(struct class_d len = sprintf(buf, "RC resends %d\n" - "RC QACKs %d\n" + "RC no QACK %d\n" "RC ACKs %d\n" "RC SEQ NAKs %d\n" "RC RDMA seq %d\n" diff -r 4eabd5fc05bb -r 36447eb1f256 drivers/infiniband/hw/ipath/ipath_verbs.h --- a/drivers/infiniband/hw/ipath/ipath_verbs.h Mon Apr 24 14:21:04 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h Mon Apr 24 14:21:04 2006 -0700 @@ -282,7 +282,8 @@ struct ipath_srq { */ struct ipath_qp { struct ib_qp ibqp; - struct ipath_qp *next; /* link list for QPN hash table */ + struct ipath_qp *next; /* link list for QPN hash table */ + struct ipath_qp *timer_next; /* link list for ipath_ib_timer() */ struct list_head piowait; /* link for wait PIO buf */ struct list_head timerwait; /* link for waiting for timeouts */ struct ib_ah_attr remote_ah_attr; From bos at pathscale.com Mon Apr 24 14:23:07 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 24 Apr 2006 14:23:07 -0700 Subject: [openib-general] [PATCH 11 of 13] ipath - improve sparse annotation In-Reply-To: Message-ID: Signed-off-by: Bryan O'Sullivan diff -r 36447eb1f256 -r f23abcaaea84 drivers/infiniband/hw/ipath/ips_common.h --- a/drivers/infiniband/hw/ipath/ips_common.h Mon Apr 24 14:21:04 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ips_common.h Mon Apr 24 14:21:04 2006 -0700 @@ -95,7 +95,7 @@ struct ether_header { __u8 seq_num; __le32 len; /* MUST be of word size due to PIO write requirements */ - __u32 csum; + __le32 csum; __le16 csum_offset; __le16 flags; __u16 first_2_bytes; From bos at pathscale.com Mon Apr 24 14:22:58 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 24 Apr 2006 14:22:58 -0700 Subject: [openib-general] [PATCH 2 of 13] ipath - set up 32-bit DMA mask if 64-bit setup fails In-Reply-To: Message-ID: <1906950392f7ef8c7d07.1145913778@eng-12.pathscale.com> Some systems do not set up 64-bit maps on systems with 2GB or less of memory installed, so we have to fall back to trying a 32-bit setup. Signed-off-by: Bryan O'Sullivan diff -r 61819d2519e0 -r 1906950392f7 drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Wed Apr 19 15:24:36 2006 -0700 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Wed Apr 19 15:24:36 2006 -0700 @@ -418,9 +418,19 @@ static int __devinit ipath_init_one(stru ret = pci_set_dma_mask(pdev, DMA_64BIT_MASK); if (ret) { - dev_info(&pdev->dev, "pci_set_dma_mask unit %u " - "fails: %d\n", dd->ipath_unit, ret); - goto bail_regions; + /* + * if the 64 bit setup fails, try 32 bit. Some systems + * do not setup 64 bit maps on systems with 2GB or less + * memory installed. + */ + ret = pci_set_dma_mask(pdev, DMA_32BIT_MASK); + if (ret) { + dev_info(&pdev->dev, "pci_set_dma_mask unit %u " + "fails: %d\n", dd->ipath_unit, ret); + goto bail_regions; + } + else + ipath_dbg("No 64bit DMA mask, used 32 bit mask\n"); } pci_set_master(pdev); From bos at pathscale.com Mon Apr 24 14:22:56 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 24 Apr 2006 14:22:56 -0700 Subject: [openib-general] [PATCH 0 of 13] ipath - various fixes and cleanups Message-ID: Hi, Roland - Here is a set of bug fixes and cleanups for the ipath driver. Please apply. References: <20060424174521.GA20743@mellanox.co.il> <444D15BC.4040904@ichips.intel.com> <1145907165.18808.28.camel@trinity.ogc.int> <20060424211911.GD20831@mellanox.co.il> <444D41CE.1040207@ichips.intel.com> Message-ID: <20060424212748.GE20831@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: RFC: cma: need rdma_unbind > > Michael S. Tsirkin wrote: > >I just tested this and this does not seem to be how it works: > >if the socket is in TIME_WAIT state, SO_REUSEADDR let's you bind another > >one to this port. > >If the socket is not in TIME_WAIT state, you still get EADDRINUSE. > > > >socketfaq seems to agree > >http://www.unixguide.net/network/socketfaq/4.5.shtml > > > >And CMA does not know about socket state - that's managed by ULP. > > Would releasing the port from within rdma_disconnect() do what you need? No, I don't want to disconnect yet - I am in the process of graceful shutdown, I want to continue sending data. -- MST From mshefty at ichips.intel.com Mon Apr 24 14:36:35 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 24 Apr 2006 14:36:35 -0700 Subject: [openib-general] Re: RFC: cma: need rdma_unbind In-Reply-To: <20060424212748.GE20831@mellanox.co.il> References: <20060424174521.GA20743@mellanox.co.il> <444D15BC.4040904@ichips.intel.com> <1145907165.18808.28.camel@trinity.ogc.int> <20060424211911.GD20831@mellanox.co.il> <444D41CE.1040207@ichips.intel.com> <20060424212748.GE20831@mellanox.co.il> Message-ID: <444D44E3.6000607@ichips.intel.com> Michael S. Tsirkin wrote: > No, I don't want to disconnect yet - I am in the process of graceful > shutdown, I want to continue sending data. It sounds like the problem is that both SDP and the CMA are trying to provide sockets-like behavior, which leads to this issue. SDP needs to release the port, but still wants to keep the connection active. This requires the CMA to do something different from sockets in order for SDP to maintain a sockets-like interface. Does this sound like a correct interpretation of the issue? This sounds like an SDP over CMA specific issue, correct? Meaning that an rdma_unbind call is only needed by SDP? - Sean From mst at mellanox.co.il Mon Apr 24 14:46:45 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 25 Apr 2006 00:46:45 +0300 Subject: [openib-general] Re: RFC: cma: need rdma_unbind In-Reply-To: <444D44E3.6000607@ichips.intel.com> References: <20060424174521.GA20743@mellanox.co.il> <444D15BC.4040904@ichips.intel.com> <1145907165.18808.28.camel@trinity.ogc.int> <20060424211911.GD20831@mellanox.co.il> <444D41CE.1040207@ichips.intel.com> <20060424212748.GE20831@mellanox.co.il> <444D44E3.6000607@ichips.intel.com> Message-ID: <20060424214645.GF20831@mellanox.co.il> Quoting r. Sean Hefty : > This sounds like an SDP over CMA specific issue, correct? Meaning that an > rdma_unbind call is only needed by SDP? Hmm. As I see it, keeping the QP alive for a given timeout after user requested close is a simple way to implement graceful close, not limited to sockets. On the other hand, we need to free the port immediately. So its easy for me to imagine another kernel ULP (besides SDP) wanting this. No idea whether it might be useful for userspace as well. -- MST From mst at mellanox.co.il Mon Apr 24 14:55:03 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 25 Apr 2006 00:55:03 +0300 Subject: [openib-general] Re: RFC: cma: need rdma_unbind In-Reply-To: <444D44E3.6000607@ichips.intel.com> References: <20060424174521.GA20743@mellanox.co.il> <444D15BC.4040904@ichips.intel.com> <1145907165.18808.28.camel@trinity.ogc.int> <20060424211911.GD20831@mellanox.co.il> <444D41CE.1040207@ichips.intel.com> <20060424212748.GE20831@mellanox.co.il> <444D44E3.6000607@ichips.intel.com> Message-ID: <20060424215503.GG20831@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: RFC: cma: need rdma_unbind > > Michael S. Tsirkin wrote: > >No, I don't want to disconnect yet - I am in the process of graceful > >shutdown, I want to continue sending data. > > It sounds like the problem is that both SDP and the CMA are trying to > provide sockets-like behavior, which leads to this issue. As I see it, other ULPs besides sockets might need the same. > SDP needs to > release the port, but still wants to keep the connection active. Right. > This > requires the CMA to do something different from sockets in order for SDP to > maintain a sockets-like interface. Does this sound like a correct > interpretation of the issue? More low-level, I would say. By the way, SDP we now have in svn trunk actually provides an "unbind" socket option, as an extension. It turns out to be useful for libsdp. -- MST From mshefty at ichips.intel.com Mon Apr 24 14:56:58 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 24 Apr 2006 14:56:58 -0700 Subject: [openib-general] Re: RFC: cma: need rdma_unbind In-Reply-To: <20060424214645.GF20831@mellanox.co.il> References: <20060424174521.GA20743@mellanox.co.il> <444D15BC.4040904@ichips.intel.com> <1145907165.18808.28.camel@trinity.ogc.int> <20060424211911.GD20831@mellanox.co.il> <444D41CE.1040207@ichips.intel.com> <20060424212748.GE20831@mellanox.co.il> <444D44E3.6000607@ichips.intel.com> <20060424214645.GF20831@mellanox.co.il> Message-ID: <444D49AA.2050708@ichips.intel.com> Michael S. Tsirkin wrote: > Hmm. As I see it, keeping the QP alive for a given timeout after user requested > close is a simple way to implement graceful close, not limited to sockets. > On the other hand, we need to free the port immediately. This seems to imply that there's another connection abstraction on top of the CMA. From the perspective of the CMA, the connection, and hence port, is still in use. For example, I don't think that rdma_unbind() could work if the underlying device were iWarp, or something that used the port number when mapping data. I also can't think of a better solution to this problem. - Sean From mst at mellanox.co.il Mon Apr 24 15:08:11 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 25 Apr 2006 01:08:11 +0300 Subject: [openib-general] Re: RFC: cma: need rdma_unbind In-Reply-To: <444D49AA.2050708@ichips.intel.com> References: <20060424174521.GA20743@mellanox.co.il> <444D15BC.4040904@ichips.intel.com> <1145907165.18808.28.camel@trinity.ogc.int> <20060424211911.GD20831@mellanox.co.il> <444D41CE.1040207@ichips.intel.com> <20060424212748.GE20831@mellanox.co.il> <444D44E3.6000607@ichips.intel.com> <20060424214645.GF20831@mellanox.co.il> <444D49AA.2050708@ichips.intel.com> Message-ID: <20060424220811.GH20831@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: RFC: cma: need rdma_unbind > > Michael S. Tsirkin wrote: > >Hmm. As I see it, keeping the QP alive for a given timeout after user > >requested > >close is a simple way to implement graceful close, not limited to sockets. > >On the other hand, we need to free the port immediately. > > This seems to imply that there's another connection abstraction on top of > the CMA. Well, that's what ULPs are. > From the perspective of the CMA, the connection, and hence port, is still > in use. No. A socket is a 5 tuple (proto, local addr, local port, remote addr, remote port). unbind just says that you can reuse local addresses, so e.g. a new connection request will connect to a new socket. > For example, I don't think that rdma_unbind() could work if the > underlying device were iWarp, or something that used the port number when > mapping data. The 5 tuple still must be unique, so no problem I think. > I also can't think of a better solution to this problem. Maybe rename it unbind_local? -- MST From tom at opengridcomputing.com Mon Apr 24 15:21:50 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 24 Apr 2006 17:21:50 -0500 Subject: [openib-general] Re: RFC: cma: need rdma_unbind In-Reply-To: <20060424211911.GD20831@mellanox.co.il> References: <20060424174521.GA20743@mellanox.co.il> <444D15BC.4040904@ichips.intel.com> <1145907165.18808.28.camel@trinity.ogc.int> <20060424211911.GD20831@mellanox.co.il> Message-ID: <1145917310.18808.78.camel@trinity.ogc.int> On Tue, 2006-04-25 at 00:19 +0300, Michael S. Tsirkin wrote: > Quoting r. Tom Tucker : > > Subject: Re: RFC: cma: need rdma_unbind > > > > On Mon, 2006-04-24 at 11:15 -0700, Sean Hefty wrote: > > > Michael S. Tsirkin wrote: > > > > Sean, it seems that a rdma_unbind API is necessary for SDP SO_REUSEADDR support. > > > > This should only remove association between the port and the id, without > > > > affecting the CM state. > > > > > > > > Reason: when SO_REUSEADDR is set, the port becomes usable immediately > > > > after close call, but the QP must not be closed immediately - > > > > rather its closed after graceful close or after timeout. > > > > > > > > Thus I can not destroy the id, but the association with > > > > port needs to be cancelled. > > > > > > I'm in the process of adding rdma_set_option and rdma_get_option calls. If > > > SO_REUSEADDR support were added through these calls, can we come up with a > > > solution that doesn't require adding an rdma_unbind call? > > > > There is no need for an unbind call -- the sockets code, for example, > > doesn't work this way. Instead, it looks at a 'reuse' flag and decides > > whether to allow a subsequent bind or use-port request based on the > > value. > > I just tested this and this does not seem to be how it works: > if the socket is in TIME_WAIT state, SO_REUSEADDR let's you bind another > one to this port. > If the socket is not in TIME_WAIT state, you still get EADDRINUSE. The code snippet is taken out of a stack that makes all the checks that you refer to. My point is that the CMA doesn't need another API applications call to explicitly release ports because they are implicitly released when the app calls rdma_disconnect and/or rdma_destroy_id. What I thought we were discussing was controlling how quickly the port can be reused through configuring the REUSEADDR option. For IB, I would think the option is effectively a no-op since the TIMEWAIT state is a very short lived affair, not the 2MSL lifetime of a TCP/IP connection. In any event, it is my opinion that the policy for how ports are released belongs in the stack (the CMA in this case), not in the application where it is controlled with an API like rdma_unbind. Are you suggesting something else and I'm just missing it? > > socketfaq seems to agree > http://www.unixguide.net/network/socketfaq/4.5.shtml > > And CMA does not know about socket state - that's managed by ULP. > > > From swise at opengridcomputing.com Mon Apr 24 15:14:53 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 24 Apr 2006 17:14:53 -0500 Subject: [openib-general] dapltest question Message-ID: <1145916893.13656.9.camel@stevo-desktop> Hey James, When running something like this: #==================================================================== #client6 #==================================================================== ./dapltest -T T -s ${host} -D ${device} -i 10000 -t 4 -w 8 \ client SR 256 \ server RW 4096 \ server SR 256 \ client SR 256 \ server RW 4096 \ server SR 256 \ client SR 4096 \ server SR 256 Do the transactions execute in order? IE: Will each client first run the SR 256 test, then the server will run the RW 4096 test, etc? Or do the client and server run through their respective tests in parallel? I'm seeing a problem with the chelsio rnic where it appears that sometimes the server sends an FPDU first, and that is not allowed in the mpa spec. In other words, does the dapltest client always send the first FPDU? MPA draft: 4. MPA "Responder" mode implementations MUST receive and validate at least one FPDU before sending any FPDUs or markers. Thanks, Steve. From tom at opengridcomputing.com Mon Apr 24 15:26:43 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 24 Apr 2006 17:26:43 -0500 Subject: [openib-general] Re: RFC: cma: need rdma_unbind In-Reply-To: <20060424212748.GE20831@mellanox.co.il> References: <20060424174521.GA20743@mellanox.co.il> <444D15BC.4040904@ichips.intel.com> <1145907165.18808.28.camel@trinity.ogc.int> <20060424211911.GD20831@mellanox.co.il> <444D41CE.1040207@ichips.intel.com> <20060424212748.GE20831@mellanox.co.il> Message-ID: <1145917603.18808.82.camel@trinity.ogc.int> On Tue, 2006-04-25 at 00:27 +0300, Michael S. Tsirkin wrote: > Quoting r. Sean Hefty : > > Subject: Re: RFC: cma: need rdma_unbind > > > > Michael S. Tsirkin wrote: > > >I just tested this and this does not seem to be how it works: > > >if the socket is in TIME_WAIT state, SO_REUSEADDR let's you bind another > > >one to this port. > > >If the socket is not in TIME_WAIT state, you still get EADDRINUSE. > > > > > >socketfaq seems to agree > > >http://www.unixguide.net/network/socketfaq/4.5.shtml > > > > > >And CMA does not know about socket state - that's managed by ULP. > > > > Would releasing the port from within rdma_disconnect() do what you need? > > No, I don't want to disconnect yet - I am in the process of graceful > shutdown, I want to continue sending data. > You can't (shouldn't be able to...) send data in TIME_WAIT. Are you trying to implement the TCP/IP half close semantics to pass some compatibility test? From mshefty at ichips.intel.com Mon Apr 24 15:18:44 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 24 Apr 2006 15:18:44 -0700 Subject: [openib-general] Re: RFC: cma: need rdma_unbind In-Reply-To: <20060424220811.GH20831@mellanox.co.il> References: <20060424174521.GA20743@mellanox.co.il> <444D15BC.4040904@ichips.intel.com> <1145907165.18808.28.camel@trinity.ogc.int> <20060424211911.GD20831@mellanox.co.il> <444D41CE.1040207@ichips.intel.com> <20060424212748.GE20831@mellanox.co.il> <444D44E3.6000607@ichips.intel.com> <20060424214645.GF20831@mellanox.co.il> <444D49AA.2050708@ichips.intel.com> <20060424220811.GH20831@mellanox.co.il> Message-ID: <444D4EC4.2020101@ichips.intel.com> Michael S. Tsirkin wrote: > No. A socket is a 5 tuple (proto, local addr, local port, remote addr, remote > port). unbind just says that you can reuse local addresses, so e.g. a new > connection request will connect to a new socket. I understand. But if both sides do this, then the local and remote ports become available for re-use, and a new connection between the systems could end up with the same tuple. > Maybe rename it unbind_local? I'm fine with the name rdma_unbind(). I will likely change rdma_bind_addr() to just rdma_bind() at some point. I'm just trying to determine what the real limitations are to rdma_unbind(), or if there's some other solution that we're missing that may make more sense, such as having SDP control its own port space. - Sean From tom at opengridcomputing.com Mon Apr 24 15:32:54 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 24 Apr 2006 17:32:54 -0500 Subject: [openib-general] Re: RFC: cma: need rdma_unbind In-Reply-To: <20060424214645.GF20831@mellanox.co.il> References: <20060424174521.GA20743@mellanox.co.il> <444D15BC.4040904@ichips.intel.com> <1145907165.18808.28.camel@trinity.ogc.int> <20060424211911.GD20831@mellanox.co.il> <444D41CE.1040207@ichips.intel.com> <20060424212748.GE20831@mellanox.co.il> <444D44E3.6000607@ichips.intel.com> <20060424214645.GF20831@mellanox.co.il> Message-ID: <1145917974.18808.90.camel@trinity.ogc.int> On Tue, 2006-04-25 at 00:46 +0300, Michael S. Tsirkin wrote: > Quoting r. Sean Hefty : > > This sounds like an SDP over CMA specific issue, correct? Meaning that an > > rdma_unbind call is only needed by SDP? > > Hmm. As I see it, keeping the QP alive for a given timeout after user requested > close is a simple way to implement graceful close, not limited to sockets. > On the other hand, we need to free the port immediately. > > So its easy for me to imagine another kernel ULP (besides SDP) wanting this. > No idea whether it might be useful for userspace as well. > What specifically are you trying to implement ... shutdown()? Is there some WHCL test that's failing? Help ;-)? From mst at mellanox.co.il Mon Apr 24 15:26:34 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 25 Apr 2006 01:26:34 +0300 Subject: [openib-general] Re: RFC: cma: need rdma_unbind In-Reply-To: <1145917603.18808.82.camel@trinity.ogc.int> References: <20060424174521.GA20743@mellanox.co.il> <444D15BC.4040904@ichips.intel.com> <1145907165.18808.28.camel@trinity.ogc.int> <20060424211911.GD20831@mellanox.co.il> <444D41CE.1040207@ichips.intel.com> <20060424212748.GE20831@mellanox.co.il> <1145917603.18808.82.camel@trinity.ogc.int> Message-ID: <20060424222634.GI20831@mellanox.co.il> Quoting r. Tom Tucker : > Subject: Re: RFC: cma: need rdma_unbind > > On Tue, 2006-04-25 at 00:27 +0300, Michael S. Tsirkin wrote: > > Quoting r. Sean Hefty : > > > Subject: Re: RFC: cma: need rdma_unbind > > > > > > Michael S. Tsirkin wrote: > > > >I just tested this and this does not seem to be how it works: > > > >if the socket is in TIME_WAIT state, SO_REUSEADDR let's you bind another > > > >one to this port. > > > >If the socket is not in TIME_WAIT state, you still get EADDRINUSE. > > > > > > > >socketfaq seems to agree > > > >http://www.unixguide.net/network/socketfaq/4.5.shtml > > > > > > > >And CMA does not know about socket state - that's managed by ULP. > > > > > > Would releasing the port from within rdma_disconnect() do what you need? > > > > No, I don't want to disconnect yet - I am in the process of graceful > > shutdown, I want to continue sending data. > > > You can't (shouldn't be able to...) send data in TIME_WAIT. I should have said messages, not data. > Are you trying to implement the TCP/IP half close semantics to pass some > compatibility test? Yes. -- MST From mst at mellanox.co.il Mon Apr 24 15:30:39 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 25 Apr 2006 01:30:39 +0300 Subject: [openib-general] Re: RFC: cma: need rdma_unbind In-Reply-To: <444D4EC4.2020101@ichips.intel.com> References: <444D15BC.4040904@ichips.intel.com> <1145907165.18808.28.camel@trinity.ogc.int> <20060424211911.GD20831@mellanox.co.il> <444D41CE.1040207@ichips.intel.com> <20060424212748.GE20831@mellanox.co.il> <444D44E3.6000607@ichips.intel.com> <20060424214645.GF20831@mellanox.co.il> <444D49AA.2050708@ichips.intel.com> <20060424220811.GH20831@mellanox.co.il> <444D4EC4.2020101@ichips.intel.com> Message-ID: <20060424223039.GJ20831@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: RFC: cma: need rdma_unbind > > Michael S. Tsirkin wrote: > >No. A socket is a 5 tuple (proto, local addr, local port, remote addr, > >remote > >port). unbind just says that you can reuse local addresses, so e.g. a new > >connection request will connect to a new socket. > > I understand. But if both sides do this, then the local and remote ports > become available for re-use, and a new connection between the systems could > end up with the same tuple. I think you are right. That's a known property of REUSEADDR I think. > >Maybe rename it unbind_local? > > I'm fine with the name rdma_unbind(). I will likely change > rdma_bind_addr() to just rdma_bind() at some point. I'm just trying to > determine what the real limitations are to rdma_unbind(), or if there's > some other solution that we're missing that may make more sense, such as > having SDP control its own port space. That's always possible I guess. But do let me know. -- MST From caitlinb at broadcom.com Mon Apr 24 15:28:56 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Mon, 24 Apr 2006 15:28:56 -0700 Subject: [openib-general] Re: RFC: cma: need rdma_unbind Message-ID: <54AD0F12E08D1541B826BE97C98F99F143AB7D@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Michael S. Tsirkin wrote: >> No. A socket is a 5 tuple (proto, local addr, local port, remote >> addr, remote port). unbind just says that you can reuse local >> addresses, so e.g. a new connection request will connect to a new >> socket. > > I understand. But if both sides do this, then the local and > remote ports become available for re-use, and a new > connection between the systems could end up with the same tuple. > Keep in mind that an iWARP QP CAN be re-used before the 5 tuple is eligible for re-use (i.e., the TCP layer connection establishment is responsible for not initiating/accepting a conflicting connection during the timewait period, but the QP can be re-used instantly for another connection). So we need to be careful to avoid requiring that the QP be time-waited, merely allow it. If SDP wants to do this generically it would probably be ok, since I don't think it would delay the re-use of that many iWARP QPs. From mst at mellanox.co.il Mon Apr 24 15:34:25 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 25 Apr 2006 01:34:25 +0300 Subject: [openib-general] Re: RFC: cma: need rdma_unbind In-Reply-To: <1145917974.18808.90.camel@trinity.ogc.int> References: <20060424174521.GA20743@mellanox.co.il> <444D15BC.4040904@ichips.intel.com> <1145907165.18808.28.camel@trinity.ogc.int> <20060424211911.GD20831@mellanox.co.il> <444D41CE.1040207@ichips.intel.com> <20060424212748.GE20831@mellanox.co.il> <444D44E3.6000607@ichips.intel.com> <20060424214645.GF20831@mellanox.co.il> <1145917974.18808.90.camel@trinity.ogc.int> Message-ID: <20060424223425.GK20831@mellanox.co.il> Quoting r. Tom Tucker : > What specifically are you trying to implement ... shutdown()? No, graceful close in SDP combined with REUSEADDR. -- MST From tom at opengridcomputing.com Mon Apr 24 15:43:23 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 24 Apr 2006 17:43:23 -0500 Subject: [openib-general] Re: RFC: cma: need rdma_unbind In-Reply-To: <20060424223425.GK20831@mellanox.co.il> References: <20060424174521.GA20743@mellanox.co.il> <444D15BC.4040904@ichips.intel.com> <1145907165.18808.28.camel@trinity.ogc.int> <20060424211911.GD20831@mellanox.co.il> <444D41CE.1040207@ichips.intel.com> <20060424212748.GE20831@mellanox.co.il> <444D44E3.6000607@ichips.intel.com> <20060424214645.GF20831@mellanox.co.il> <1145917974.18808.90.camel@trinity.ogc.int> <20060424223425.GK20831@mellanox.co.il> Message-ID: <1145918603.18808.94.camel@trinity.ogc.int> On Tue, 2006-04-25 at 01:34 +0300, Michael S. Tsirkin wrote: > Quoting r. Tom Tucker : > > What specifically are you trying to implement ... shutdown()? > > No, graceful close in SDP combined with REUSEADDR. So, the app calls disconnect (or shutdown). You want to disallow application send and/or receive, but you still want to send the SDP DISCONNECT control message. Is that it? > From sean.hefty at intel.com Mon Apr 24 16:04:12 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 24 Apr 2006 16:04:12 -0700 Subject: [openib-general] [PATCH] rdma cm: expose rdma_unbind to allow ULPs to release assigned port In-Reply-To: <20060424223039.GJ20831@mellanox.co.il> Message-ID: Can you try this out? I only did some limited testing. Expose port unbinding routine to allow ULPs to control when an assigned port is released by the CMA. This is needed by SDP in order to provide for graceful shutdown when SO_REUSEADDR has been set on the socket. Signed-off-by: Sean Hefty --- Index: core/cma.c =================================================================== --- core/cma.c (revision 6588) +++ core/cma.c (working copy) @@ -642,10 +642,13 @@ static void cma_cancel_operation(struct } } -static void cma_release_port(struct rdma_id_private *id_priv) +void rdma_unbind(struct rdma_cm_id *id) { - struct rdma_bind_list *bind_list = id_priv->bind_list; + struct rdma_id_private *id_priv; + struct rdma_bind_list *bind_list; + id_priv = container_of(id, struct rdma_id_private, id); + bind_list = id_priv->bind_list; if (!bind_list) return; @@ -655,8 +658,10 @@ static void cma_release_port(struct rdma idr_remove(bind_list->ps, bind_list->port); kfree(bind_list); } + id_priv->bind_list = NULL; mutex_unlock(&lock); } +EXPORT_SYMBOL(rdma_unbind); void rdma_destroy_id(struct rdma_cm_id *id) { @@ -681,7 +686,7 @@ void rdma_destroy_id(struct rdma_cm_id * mutex_unlock(&lock); } - cma_release_port(id_priv); + rdma_unbind(id); atomic_dec(&id_priv->refcount); wait_event(id_priv->wait, !atomic_read(&id_priv->refcount)); Index: include/rdma/rdma_cm.h =================================================================== --- include/rdma/rdma_cm.h (revision 6418) +++ include/rdma/rdma_cm.h (working copy) @@ -251,5 +251,13 @@ int rdma_reject(struct rdma_cm_id *id, c */ int rdma_disconnect(struct rdma_cm_id *id); +/** + * rdma_unbind - Release any allocated port binding with the rdma_cm_id. + * + * This releases any port that is bound to the specified rdma_cm_id. The + * associated port becomes available for immediate re-use. + */ +void rdma_unbind(struct rdma_cm_id *id); + #endif /* RDMA_CM_H */ From tom at opengridcomputing.com Mon Apr 24 16:15:26 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 24 Apr 2006 18:15:26 -0500 Subject: [openib-general] Re: RFC: cma: need rdma_unbind In-Reply-To: <1145918603.18808.94.camel@trinity.ogc.int> References: <20060424174521.GA20743@mellanox.co.il> <444D15BC.4040904@ichips.intel.com> <1145907165.18808.28.camel@trinity.ogc.int> <20060424211911.GD20831@mellanox.co.il> <444D41CE.1040207@ichips.intel.com> <20060424212748.GE20831@mellanox.co.il> <444D44E3.6000607@ichips.intel.com> <20060424214645.GF20831@mellanox.co.il> <1145917974.18808.90.camel@trinity.ogc.int> <20060424223425.GK20831@mellanox.co.il> <1145918603.18808.94.camel@trinity.ogc.int> Message-ID: <1145920526.18808.112.camel@trinity.ogc.int> On Mon, 2006-04-24 at 17:43 -0500, Tom Tucker wrote: > On Tue, 2006-04-25 at 01:34 +0300, Michael S. Tsirkin wrote: > > Quoting r. Tom Tucker : > > > What specifically are you trying to implement ... shutdown()? > > > > No, graceful close in SDP combined with REUSEADDR. > > So, the app calls disconnect (or shutdown). You want to disallow > application send and/or receive, but you still want to send the SDP > DISCONNECT control message. Is that it? > > > > FYI, an application calling shutdown() does not release the port synchronously. It is only released after the peers have negotiated the close. For SDP, I think this means that both peers have sent and received the SDP DisConn message and/or AbortConn message _AND_ you have shutdown the LLP (called rdma_disconnect). It is not until after this point that you can expect the port to be reuse-able. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From xma at us.ibm.com Mon Apr 24 17:33:20 2006 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 24 Apr 2006 17:33:20 -0700 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: Message-ID: Leonid, > Have you tried to use other SOFTIRQ instead of TASKLET_SOFTIRQ? I have looked at the interrupts.h below: HI_SOFTIRQ=0, TIMER_SOFTIRQ, NET_TX_SOFTIRQ, NET_RX_SOFTIRQ, BLOCK_SOFTIRQ, TASKLET_SOFTIRQ I could't see any softirq we could use for IB completion polling to run simultanously on multiple cpus. We might need to use work_queues instead. Can the driver be able to identify sendQ/recvQ completion interrupt? Can two tasklets (one for send, one for recv) to be implemented on your driver? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From ssbyrn at yahoo.com Mon Apr 24 18:47:22 2006 From: ssbyrn at yahoo.com (susan) Date: Tue, 25 Apr 2006 01:47:22 +0000 (UTC) Subject: [openib-general] Re: ib_send_cm_req fails with error -22 References: <44468CCE.6070202@ichips.intel.com> <44480C85.1040704@ichips.intel.com> <44490DCA.9080609@ichips.intel.com> Message-ID: susan yahoo.com> writes: > > Sean Hefty ichips.intel.com> writes: > > > thanks Sean! i will give it try. > > susan > > hi Sean, cmpost utility works on my box. i found following anomalies after looking in cmpost.c code: 1. in my test code, i have to explicitly set path_rec.sgid\ for ib_sa_path_rec_get routine to work. if not set then ib_send_cm_req() call fails with error 22 (EINVAL). whereas cmpost doesn't set it. 2. for test code below, i get IB_CM_REQ_ERROR after calling ib_send_cm_req. below is the trace of primary node (that connects to secondary node, which starts in a listen mode). ib call trace on primary node: xtd: ib_register_client xtd: set primary 1 xtd: using active port 1 xtd: ib_alloc_pd xtd: ib_create_cm_id xtd: ib_create_cq xtd: ib_req_notify_cq xtd: ib_create_qp xtd: connect 0x2 xtd: ib_query_gid xtd: sgid.prefix fe80000000000000, sgid.id 5ad0000030861 xtd: slid 0x1, dlid 0x2, port 1 xtd: ib_sa_path_rec_get xtd: waiting for ib_sa_path_rec_get completion xtd: ib_sa_path_rec_get completed with status 0, wakeup sleeping task xtd: ib_sa_path_rec_get operation completed xtd: ib_send_cm_req xtd: cm_id->state = 2 xtd: waiting for connection to establish xtd: xtd_cm_handler(0) xtd: xtd_cm_handler(IB_CM_REQ_ERROR) event not processed xtd: cm_id->state = 0 xtd: wait timed out ib call trace on secondary node: xtd: ib_register_client xtd: set secondary 2 xtd: using active port 1 xtd: ib_alloc_pd xtd: ib_create_cm_id xtd: ib_create_cq xtd: ib_req_notify_cq xtd: ib_create_qp xtd: ib_create_cm_id xtd: ib_cm_listen xtd: listening ... ------------------------TEST CODE-------------------------------- #include #include #include #include #include #include #define XTD_DESCRIPTION "sample infiniband driver for xsft" #define XTD_AUTHOR "symantec software" MODULE_DESCRIPTION(XTD_DESCRIPTION); MODULE_AUTHOR(XTD_AUTHOR); MODULE_LICENSE("GPL"); #define PROCDIR "xtd" #define PROCFILE "info" #define MAX_CQ_WR 1 #define MAX_SEND_WR MAX_CQ_WR #define MAX_RECV_WR MAX_CQ_WR #define MAX_CQ_SIZE MAX_SEND_WR + MAX_RECV_WR #define MAX_TIMEOUT_MS 1000 #define MAX_BUFFER 256 #define MAX_RETRIES 1 #define MAX_WAIT_TIMEOUT (80 * HZ) #define SERVICE_ID 0x1000 enum xtd_loglevel_type { ERROR = 0, WARN, INFO1, INFO2, INFO3, INFO4, INFO5, INFO6, INFO7 }; enum xtd_role_type { NONE = 0, PRIMARY, SECONDARY }; enum xtd_state_type { UNKNOWN = 0, INIT, CONNECTING, ESTABLISHED, DISCONNECTING, DISCONNECTED }; typedef struct _xtd_msgbuf_t { int opcode; char data[MAX_BUFFER]; } xtd_msgbuf_t; typedef struct _xtd_device_t { struct ib_device *hcap; struct ib_cm_id *cm_idp; struct ib_cm_id *listen_idp; struct ib_pd *pdp; struct ib_cq *cqp; struct ib_mr *mrp; struct ib_qp *qpp; struct ib_sa_path_rec path_rec; wait_queue_head_t waitq; int cq_size; int conn_state; int path_rec_qstatus; int port; int role; xtd_msgbuf_t *msgbufp; u64 addr; u32 slid; u32 dlid; DECLARE_PCI_UNMAP_ADDR(mapping); } xtd_device_t; static int xtd_debuglevel = INFO6; static struct proc_dir_entry *xtd_procdir = NULL; static struct proc_dir_entry *xtd_procfile = NULL; static xtd_device_t xtd_device; static void xtd_add_one(struct ib_device *devicep); static void xtd_remove_one(struct ib_device *devicep); static struct ib_client xtd_client = { .name = "xtd", .add = xtd_add_one, .remove = xtd_remove_one }; static void xlog(int level, const char *format, ...) { int ii = 0; char buffer[MAX_BUFFER]; va_list ap; if (level > xtd_debuglevel) { return; } va_start(ap, format); (void) vsprintf(&buffer[0], format, ap); va_end(ap); ii = strlen(buffer); buffer[ii++] = '\n'; buffer[ii] = '\0'; printk("xtd: %s", buffer); return; } static int free_recv_buffer(void) { dma_unmap_single(xtd_device.hcap->dma_device, pci_unmap_addr(&xtd_device, mapping), sizeof (xtd_msgbuf_t), DMA_FROM_DEVICE); if (xtd_device.mrp) { xlog(INFO3, "ib_dereg_mr"); ib_dereg_mr(xtd_device.mrp); xtd_device.mrp = NULL; } kfree(xtd_device.msgbufp); xtd_device.msgbufp = NULL; return 0; } static int alloc_recv_buffer(void) { struct ib_recv_wr recv_wr; struct ib_recv_wr *bad_recv_wrp; struct ib_sge sge; int retc; if (xtd_device.msgbufp) { kfree(xtd_device.msgbufp); } xtd_device.msgbufp = (xtd_msgbuf_t *) kzalloc(sizeof (xtd_msgbuf_t), GFP_ATOMIC); if (! xtd_device.msgbufp) { xlog(ERROR, "alloc for xtd_msgbuf_t failed"); return -ENOMEM; } xlog(INFO3, "ib_get_dma_mr"); xtd_device.mrp = ib_get_dma_mr(xtd_device.pdp, IB_ACCESS_LOCAL_WRITE); if (IS_ERR(xtd_device.mrp)) { retc = PTR_ERR(xtd_device.mrp); xlog(ERROR, "ib_get_dma_mr() failed with error %d", retc); kfree(xtd_device.msgbufp); xtd_device.msgbufp = NULL; xtd_device.mrp = NULL; return retc; } xtd_device.addr = dma_map_single(xtd_device.hcap->dma_device, xtd_device.msgbufp, sizeof (xtd_msgbuf_t), DMA_FROM_DEVICE); if (dma_mapping_error(xtd_device.addr)) { xlog(ERROR, "dma_map_single() failed"); ib_dereg_mr(xtd_device.mrp); xtd_device.mrp = NULL; kfree(xtd_device.msgbufp); xtd_device.msgbufp = NULL; return -ENOMEM; } pci_unmap_addr_set(&xtd_device, mapping, xtd_device.addr); sge.addr = xtd_device.addr; sge.length = sizeof (xtd_msgbuf_t); sge.lkey = xtd_device.mrp->lkey; recv_wr.next = NULL; recv_wr.sg_list = &sge; recv_wr.num_sge = 1; recv_wr.wr_id = (u64) &xtd_device; xlog(INFO3, "ib_post_recv"); retc = ib_post_recv(xtd_device.qpp, &recv_wr, &bad_recv_wrp); if (retc) { xlog(ERROR, "ib_post_recv() failed with error %d", retc); free_recv_buffer(); return retc; } return retc; } static int process_send(struct ib_wc *wcp) { return 0; } static int process_recv(struct ib_wc *wcp) { return 0; } static int process_completion(struct ib_wc *wcp) { int retc; switch(wcp->opcode) { case IB_WC_SEND: retc = process_send(wcp); return retc; case IB_WC_RECV: retc = process_recv(wcp); if (retc) { return retc; } retc = free_recv_buffer(); if (retc) { xlog(ERROR, "free_recv_buffer() failed"); return retc; } retc = alloc_recv_buffer(); return retc; default: xlog(ERROR, "process_completion(%d) operation not " "processed", wcp->opcode); return -EINVAL; } } static void xtd_path_rec_completion(int status, struct ib_sa_path_rec *resp, void *contextp) { xlog(INFO4, "ib_sa_path_rec_get completed with status %d, " "wakeup sleeping task", status); xtd_device.path_rec_qstatus = 0; wake_up(&xtd_device.waitq); return; } static void xtd_qp_event_handler(struct ib_event *eventp, void *contextp) { return; } static void xtd_completion_event_handler(struct ib_event *eventp, void *contextp) { return; } static void xtd_completion_handler(struct ib_cq *cqp, void *contextp) { struct ib_wc wc[MAX_CQ_WR]; int i; int retc; do { xlog(INFO3, "ib_poll_cq"); retc = ib_poll_cq(cqp, MAX_CQ_WR, wc); if (retc < 0) { xlog(ERROR, "ib_poll_cq() failed with error %d", retc); return; } for (i = 0; i < retc; i++) { retc = process_completion(&wc[i]); } } while (retc > 0); xlog(INFO3, "ib_req_notify_cq"); retc = ib_req_notify_cq(cqp, IB_CQ_NEXT_COMP); if (retc != 0) { xlog(ERROR, "ib_req_notify_cq() failed with error %d", retc); } return; } static int modify_qp(struct ib_cm_id *cm_idp, int state) { struct ib_qp_attr qp_attr; int mask; int retc; memset(&qp_attr, 0, sizeof qp_attr); qp_attr.qp_state = state; retc = ib_cm_init_qp_attr(cm_idp, &qp_attr, &mask); if (retc) { xlog(ERROR, "ib_cm_init_qp_attr() failed with error %d " "for state %d", retc, state); return retc; } if (state == IB_QPS_RTR) { qp_attr.rq_psn = xtd_device.qpp->qp_num; } retc = ib_modify_qp(xtd_device.qpp, &qp_attr, mask); if (retc) { xlog(ERROR, "ib_modify_qp() failed with error %d " "for state %d", retc, state); return retc; } return retc; } static int query_path_rec(void) { union ib_gid sgid; struct ib_sa_query *queryp; int wtimeout; int retc; xlog(INFO3, "ib_query_gid"); retc = ib_query_gid(xtd_device.hcap, xtd_device.port, 0, &sgid); if (retc) { xlog(ERROR, "ib_query_gid() failed with error %d", retc); return retc; } xlog(INFO4, "sgid.prefix %llx, sgid.id %llx", __constant_be64_to_cpu(sgid.global.subnet_prefix), __constant_be64_to_cpu(sgid.global.interface_id)); xlog(INFO4, "slid 0x%x, dlid 0x%x, port %d", xtd_device.slid, xtd_device.dlid, xtd_device.port); xtd_device.path_rec.dlid = cpu_to_be16(xtd_device.dlid); xtd_device.path_rec.slid = cpu_to_be16(xtd_device.slid); xtd_device.path_rec.sgid = sgid; xtd_device.path_rec.numb_path = 1; xlog(INFO3, "ib_sa_path_rec_get"); xtd_device.path_rec_qstatus = 1; retc = ib_sa_path_rec_get(xtd_device.hcap, xtd_device.port, &xtd_device.path_rec, IB_SA_PATH_REC_DLID | IB_SA_PATH_REC_SLID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_NUMB_PATH, MAX_TIMEOUT_MS, MAX_RETRIES, GFP_KERNEL, xtd_path_rec_completion, NULL, &queryp); if (retc < 0) { xlog(ERROR, "ib_sa_path_rec_get failed with error %d", retc); return retc; } xlog(INFO4, "waiting for ib_sa_path_rec_get completion"); wtimeout = wait_event_timeout(xtd_device.waitq, xtd_device.path_rec_qstatus <= 0, MAX_WAIT_TIMEOUT); if (wtimeout == 0) { xlog(ERROR, "wait timed out"); return -ETIME; } xlog(INFO4, "ib_sa_path_rec_get operation completed"); return retc; } static int find_active_port(struct ib_device *devicep) { int i; struct ib_port_attr pattr; int retc; for(i = 1; i <= devicep->phys_port_cnt; i++) { retc = ib_query_port(devicep, i, &pattr); if (retc) { xlog(ERROR, "ib_query_port() on port %d failed ", "with retc %d", i, retc); continue; } /* * use the first active found */ if (pattr.state == IB_PORT_ACTIVE) { xlog(INFO4, "using active port %d", i); xtd_device.port = i; return 0; } } return 1; } static int free_ib(void) { if (xtd_device.listen_idp) { xlog(INFO3, "ib_destroy_listen_id"); ib_destroy_cm_id(xtd_device.listen_idp); xtd_device.listen_idp = NULL; } if (xtd_device.qpp) { xlog(INFO3, "ib_destroy_qp"); ib_destroy_qp(xtd_device.qpp); xtd_device.qpp = NULL; } if (xtd_device.cqp) { xlog(INFO3, "ib_destroy_cq"); ib_destroy_cq(xtd_device.cqp); xtd_device.cqp = NULL; xtd_device.cq_size = 0; } if (xtd_device.cm_idp) { xlog(INFO3, "ib_destroy_cm_id"); ib_destroy_cm_id(xtd_device.cm_idp); xtd_device.cm_idp = NULL; } if (xtd_device.pdp) { xlog(INFO3, "ib_dealloc_pd"); ib_dealloc_pd(xtd_device.pdp); xtd_device.pdp = NULL; } return 0; } static int send_req(void) { struct ib_cm_req_param req; int wtimeout; int retc; memset(&req, 0, sizeof req); req.primary_path = &xtd_device.path_rec; req.service_id = __constant_cpu_to_be64(SERVICE_ID); req.qp_num = xtd_device.qpp->qp_num; req.qp_type = xtd_device.qpp->qp_type; req.srq = (xtd_device.qpp->srq != NULL); req.responder_resources = 1; req.initiator_depth = 1; req.remote_cm_response_timeout = 20; req.local_cm_response_timeout = 20; req.retry_count = 5; req.max_cm_retries = 5; req.starting_psn = xtd_device.qpp->qp_num; xlog(INFO3, "ib_send_cm_req"); retc = ib_send_cm_req(xtd_device.cm_idp, &req); if (retc) { xlog(ERROR, "ib_send_cm_req() failed with error %d", retc); return retc; } xlog(INFO5, "cm_id->state = %x", xtd_device.cm_idp->state); xtd_device.conn_state = CONNECTING; xlog(INFO4, "waiting for connection to establish"); wtimeout = wait_event_timeout(xtd_device.waitq, xtd_device.conn_state == ESTABLISHED, MAX_WAIT_TIMEOUT); xlog(INFO5, "cm_id->state = %x", xtd_device.cm_idp->state); if (wtimeout == 0) { xlog(ERROR, "wait timed out"); return -ETIME; } xlog(INFO4, "connection established"); return retc; } static int disconnect(void) { int wtimeout; int retc; xlog(INFO3, "ib_send_cm_dreq"); retc = ib_send_cm_dreq(xtd_device.cm_idp, NULL, 0); if (retc) { xlog(ERROR, "ib_send_cm_dreq() failed with error %d", retc); return retc; } xtd_device.conn_state = DISCONNECTING; xlog(INFO4, "waiting for connection to disconnect"); wtimeout = wait_event_timeout(xtd_device.waitq, xtd_device.conn_state == DISCONNECTED, MAX_WAIT_TIMEOUT); if (wtimeout == 0) { xlog(ERROR, "wait timed out"); return -ETIME; } xlog(INFO4, "connection disconnected"); retc = free_ib(); if (retc) { xlog(ERROR, "free_ib() failed with error %d", retc); return retc; } return retc; } static int connect(void) { int retc; retc = query_path_rec(); if (retc) { return retc; } retc = send_req(); return retc; } static int rtu_handler(struct ib_cm_id *cm_idp, struct ib_cm_event *eventp) { int retc; xlog(INFO3, "modify_qp(IB_QPS_RTS)"); retc = modify_qp(cm_idp, IB_QPS_RTS); if (retc) { return retc; } xtd_device.conn_state = ESTABLISHED; xlog(INFO4, "connection established"); return retc; } static int rep_handler(struct ib_cm_id *cm_idp, struct ib_cm_event *eventp) { int retc; xlog(INFO3, "modify_qp(IB_QPS_INIT)"); retc = modify_qp(cm_idp, IB_QPS_INIT); if (retc) { return retc; } xlog(INFO3, "modify_qp(IB_QPS_RTR)"); retc = modify_qp(cm_idp, IB_QPS_RTR); if (retc) { return retc; } xlog(INFO3, "modify_qp(IB_QPS_RTS)"); retc = modify_qp(cm_idp, IB_QPS_RTS); if (retc) { return retc; } retc = alloc_recv_buffer(); if (retc) { return retc; } xlog(INFO3, "ib_send_cm_rtu"); retc = ib_send_cm_rtu(xtd_device.cm_idp, NULL, 0); if (retc) { xlog(ERROR, "ib_send_cm_rtu() failed with error %d", retc); return retc; } xtd_device.conn_state = ESTABLISHED; xlog(INFO4, "connection established, wakeup sleeping task"); wake_up(&xtd_device.waitq); return retc; } static int req_handler(struct ib_cm_id *cm_idp, struct ib_cm_event *eventp) { struct ib_cm_req_event_param req; struct ib_cm_rep_param rep; int retc; xtd_device.cm_idp = cm_idp; req = eventp->param.req_rcvd; xlog(INFO3, "modify_qp(IB_QPS_INIT)"); retc = modify_qp(cm_idp, IB_QPS_INIT); if (retc) { return retc; } xlog(INFO3, "modify_qp(IB_QPS_RTR)"); retc = modify_qp(cm_idp, IB_QPS_RTR); if (retc) { return retc; } retc = alloc_recv_buffer(); if (retc) { return retc; } memset(&rep, 0, sizeof rep); rep.qp_num = xtd_device.qpp->qp_num; rep.srq = (xtd_device.qpp->srq != NULL); rep.starting_psn = xtd_device.qpp->qp_num; rep.responder_resources = req.responder_resources; rep.initiator_depth = req.initiator_depth; rep.target_ack_delay = 20; rep.flow_control = req.flow_control; rep.rnr_retry_count = req.rnr_retry_count; xlog(INFO3, "ib_send_cm_rep"); retc = ib_send_cm_rep(xtd_device.cm_idp, &rep); if (retc) { xlog(ERROR, "ib_send_cm_rep() failed with error %d", retc); return retc; } return retc; } static int xtd_cm_handler(struct ib_cm_id *cm_idp, struct ib_cm_event *eventp) { int retc; xlog(INFO4, "xtd_cm_handler(%d)", eventp->event); switch(eventp->event) { case IB_CM_REQ_RECEIVED: retc = req_handler(cm_idp, eventp); break; case IB_CM_REP_RECEIVED: retc = rep_handler(cm_idp, eventp); break; case IB_CM_RTU_RECEIVED: retc = rtu_handler(cm_idp, eventp); break; case IB_CM_DREQ_RECEIVED: xlog(INFO3, "ib_send_cm_drep"); retc = ib_send_cm_drep(cm_idp, NULL, 0); if (retc) { xlog(ERROR, "ib_send_cm_rtu() failed with " "error %d", retc); } free_ib(); break; case IB_CM_DREP_RECEIVED: xtd_device.conn_state = DISCONNECTED; xlog(INFO4, "drep received, wakeup sleeping task"); wake_up(&xtd_device.waitq); break; case IB_CM_REQ_ERROR: xlog(ERROR, "xtd_cm_handler(IB_CM_REQ_ERROR) event " "not processed", eventp->event); break; default: xlog(ERROR, "xtd_cm_handler(%d) event not processed", eventp->event); break; } return 0; } static int setrole_secondary(void) { int retc; xlog(INFO3, "ib_create_cm_id"); xtd_device.listen_idp = ib_create_cm_id(xtd_device.hcap, xtd_cm_handler, NULL); if (IS_ERR(xtd_device.listen_idp)) { retc = PTR_ERR(xtd_device.listen_idp); xlog(ERROR, "ib_create_cm_id(listen) failed with error %lld", retc); xtd_device.listen_idp = NULL; free_ib(); return retc; } xlog(INFO3, "ib_cm_listen"); retc = ib_cm_listen(xtd_device.listen_idp, __constant_cpu_to_be64(SERVICE_ID), 0, NULL); if (retc) { xlog(ERROR, "ib_cm_listen() failed with retc %d", retc); } xlog(INFO3, "listening ..."); return retc; } static int init_ib(int role) { struct ib_qp_init_attr qp_attr; int retc; retc = find_active_port(xtd_device.hcap); if (retc) { xlog(WARN, "xtd_add_one: no active port found, " "defaulting to port 1"); xtd_device.port = 1; } xlog(INFO3, "ib_alloc_pd"); xtd_device.pdp = ib_alloc_pd(xtd_device.hcap); if (IS_ERR(xtd_device.pdp)) { retc = PTR_ERR(xtd_device.pdp); xlog(ERROR, "ib_alloc_pd() failed with error %d", retc); xtd_device.pdp = NULL; return retc; } xlog(INFO3, "ib_create_cm_id"); xtd_device.cm_idp = ib_create_cm_id(xtd_device.hcap, xtd_cm_handler, NULL); if (IS_ERR(xtd_device.cm_idp)) { retc = PTR_ERR(xtd_device.cm_idp); xlog(ERROR, "ib_create_cm_id() failed with error %lld", retc); xtd_device.cm_idp = NULL; free_ib(); return retc; } xlog(INFO3, "ib_create_cq"); xtd_device.cqp = ib_create_cq(xtd_device.hcap, xtd_completion_handler, xtd_completion_event_handler, NULL, MAX_CQ_SIZE); if (IS_ERR(xtd_device.cqp)) { retc = PTR_ERR(xtd_device.cqp); xlog(ERROR, "ib_create_cq() failed with error %d", retc); xtd_device.cqp = NULL; free_ib(); return retc; } xtd_device.cq_size = xtd_device.cqp->cqe; xlog(INFO3, "ib_req_notify_cq"); retc = ib_req_notify_cq(xtd_device.cqp, IB_CQ_NEXT_COMP); if (retc != 0) { xlog(ERROR, "ib_req_notify_cq() failed with error %d", retc); free_ib(); return retc; } memset(&qp_attr, 0, sizeof qp_attr); qp_attr.event_handler = xtd_qp_event_handler; qp_attr.cap.max_send_wr = MAX_SEND_WR; qp_attr.cap.max_recv_wr = MAX_RECV_WR; qp_attr.cap.max_send_sge = 1; qp_attr.cap.max_recv_sge = 1; qp_attr.sq_sig_type = IB_SIGNAL_ALL_WR; qp_attr.qp_type = IB_QPT_RC; qp_attr.send_cq = xtd_device.cqp; qp_attr.recv_cq = xtd_device.cqp; xlog(INFO3, "ib_create_qp"); xtd_device.qpp = ib_create_qp(xtd_device.pdp, &qp_attr); if (IS_ERR(xtd_device.qpp)) { retc = PTR_ERR(xtd_device.qpp); xlog(ERROR, "ib_create_qp() failed with error %d", retc); xtd_device.qpp = NULL; free_ib(); return retc; } xtd_device.conn_state = INIT; init_waitqueue_head(&xtd_device.waitq); if (role == SECONDARY) { retc = setrole_secondary(); } return retc; } int xtd_proc_write(struct file *file, const char *buffer, unsigned long length, void *data) { char command[MAX_BUFFER]; int retc = 0; if (length > MAX_BUFFER) { xlog(ERROR, "user buffer overflow when writing to proc"); return -EINVAL; } if (! buffer) { xlog(ERROR, "user buffer was null when writing to proc"); return -EINVAL; } memset(&command, 0, sizeof command); retc = copy_from_user(&command, buffer, length); if (retc) { xlog(ERROR, "copy_from_user() failed"); return retc; } xlog(INFO1, "%s", command); if (strncmp("set primary", command, 11) == 0) { xtd_device.slid = simple_strtoul(&command[12], NULL, 16); retc = init_ib(PRIMARY); return length; } if (strncmp("set secondary", command, 13) == 0) { xtd_device.slid = simple_strtoul(&command[14], NULL, 16); retc = init_ib(SECONDARY); return length; } if (strncmp("connect", command, 7) == 0) { xtd_device.dlid = simple_strtoul(&command[8], NULL, 16); retc = connect(); return length; } if (strncmp("disconnect", command, 10) == 0) { retc = disconnect(); return length; } return length; } static int clean_procfs(void) { if (xtd_procfile) { remove_proc_entry(PROCDIR"/"PROCFILE, NULL); xtd_procfile = NULL; } if (xtd_procdir) { remove_proc_entry(PROCDIR, NULL); xtd_procdir = NULL; } return 0; } static int make_procfs(void) { int error = 0; xtd_procdir = proc_mkdir(PROCDIR, NULL); if (! xtd_procdir) { xlog(ERROR, "proc_mkdir(%s) failed", PROCDIR); error = -ENOMEM; goto err_out; } xtd_procfile = create_proc_entry(PROCFILE, 0644, xtd_procdir); if (! xtd_procfile) { xlog(ERROR, "create_proc_entry(%s) failed", PROCDIR"/"PROCFILE); error = -ENOMEM; clean_procfs(); goto err_out; } xtd_procfile->owner = THIS_MODULE; xtd_procfile->write_proc = xtd_proc_write; err_out: return error; } static void xtd_remove_one(struct ib_device *devicep) { memset(&xtd_device, 0, sizeof (xtd_device)); return; } static void xtd_add_one(struct ib_device *devicep) { int type; type = rdma_node_get_transport(devicep->node_type); if (type != RDMA_TRANSPORT_IB) { xlog(ERROR, "transport is not RDMA_TRANSPORT_IB, returned %d", type); return; } memset(&xtd_device, 0, sizeof (xtd_device)); xtd_device.hcap = devicep; return; } static int __init xtd_init(void) { int retc; retc = make_procfs(); if (retc) { return retc; } xlog(INFO3, "ib_register_client"); retc = ib_register_client(&xtd_client); if (retc) { xlog(ERROR, "ib_register_client failed with retc=%d", retc); clean_procfs(); return retc; } return retc; } static void __exit xtd_exit(void) { free_ib(); xlog(INFO3, "ib_unregister_client"); ib_unregister_client(&xtd_client); clean_procfs(); return; } module_init(xtd_init); module_exit(xtd_exit); From sean.hefty at intel.com Mon Apr 24 21:02:42 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 24 Apr 2006 21:02:42 -0700 Subject: [openib-general] Re: ib_send_cm_req fails with error -22 In-Reply-To: Message-ID: >static int >query_path_rec(void) >{ > union ib_gid sgid; > struct ib_sa_query *queryp; > int wtimeout; > int retc; > > xlog(INFO3, "ib_query_gid"); > retc = ib_query_gid(xtd_device.hcap, xtd_device.port, 0, &sgid); > if (retc) { > xlog(ERROR, "ib_query_gid() failed with error %d", retc); > return retc; > } > > xlog(INFO4, "sgid.prefix %llx, sgid.id %llx", > __constant_be64_to_cpu(sgid.global.subnet_prefix), > __constant_be64_to_cpu(sgid.global.interface_id)); > xlog(INFO4, "slid 0x%x, dlid 0x%x, port %d", > xtd_device.slid, xtd_device.dlid, xtd_device.port); > > xtd_device.path_rec.dlid = cpu_to_be16(xtd_device.dlid); > xtd_device.path_rec.slid = cpu_to_be16(xtd_device.slid); > xtd_device.path_rec.sgid = sgid; > xtd_device.path_rec.numb_path = 1; > > xlog(INFO3, "ib_sa_path_rec_get"); > xtd_device.path_rec_qstatus = 1; > retc = ib_sa_path_rec_get(xtd_device.hcap, > xtd_device.port, > &xtd_device.path_rec, > IB_SA_PATH_REC_DLID | IB_SA_PATH_REC_SLID | > IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_NUMB_PATH, Using IB_SA_PATH_REC_SGID in the mask is what's requiring setting the sgid when querying for the path. If you remove this from the mask, you can also remove setting the sgid above. > MAX_TIMEOUT_MS, > MAX_RETRIES, > GFP_KERNEL, > xtd_path_rec_completion, > NULL, > &queryp); > if (retc < 0) { > xlog(ERROR, "ib_sa_path_rec_get failed with error %d", retc); > return retc; > } > > xlog(INFO4, "waiting for ib_sa_path_rec_get completion"); > wtimeout = wait_event_timeout(xtd_device.waitq, > xtd_device.path_rec_qstatus <= 0, MAX_WAIT_TIMEOUT); > if (wtimeout == 0) { > xlog(ERROR, "wait timed out"); > return -ETIME; > } > > xlog(INFO4, "ib_sa_path_rec_get operation completed"); > return retc; >} >static int >send_req(void) >{ > struct ib_cm_req_param req; > int wtimeout; > int retc; > > memset(&req, 0, sizeof req); > req.primary_path = &xtd_device.path_rec; > req.service_id = __constant_cpu_to_be64(SERVICE_ID); > req.qp_num = xtd_device.qpp->qp_num; > req.qp_type = xtd_device.qpp->qp_type; > req.srq = (xtd_device.qpp->srq != NULL); > > req.responder_resources = 1; > req.initiator_depth = 1; Note: you only need to set responder_resources and initiator_depth if you'll be doing RDMA reads or atomic operations. I set them in cmpost only for testing. > req.remote_cm_response_timeout = 20; > req.local_cm_response_timeout = 20; > req.retry_count = 5; > req.max_cm_retries = 5; > req.starting_psn = xtd_device.qpp->qp_num; > > xlog(INFO3, "ib_send_cm_req"); > retc = ib_send_cm_req(xtd_device.cm_idp, &req); > if (retc) { > xlog(ERROR, "ib_send_cm_req() failed with error %d", retc); > return retc; > } > > xlog(INFO5, "cm_id->state = %x", xtd_device.cm_idp->state); > xtd_device.conn_state = CONNECTING; > xlog(INFO4, "waiting for connection to establish"); > wtimeout = wait_event_timeout(xtd_device.waitq, > xtd_device.conn_state == ESTABLISHED, MAX_WAIT_TIMEOUT); > xlog(INFO5, "cm_id->state = %x", xtd_device.cm_idp->state); > if (wtimeout == 0) { > xlog(ERROR, "wait timed out"); > return -ETIME; > } > > xlog(INFO4, "connection established"); > return retc; >} Nothing jumps out at me as being off here. When the REQ error is reported, what value is reported in send_status? Also, in the util directory, there's a program called madeye that will snoop MAD traffic sent and received on the nodes. Can you try loading the module with the following: insmod ib_madeye.ko "smp=0" "gmp=0" "mgmt_class=7"? You should load this on both the client and server nodes. This should cause madeye to display all IB CM messages that are sent or received on a system. I want to determine if the CM REQ is being received by the server. - Sean From sean.hefty at intel.com Mon Apr 24 21:06:09 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 24 Apr 2006 21:06:09 -0700 Subject: [openib-general] Re: ib_send_cm_req fails with error -22 In-Reply-To: Message-ID: >static void >xtd_path_rec_completion(int status, struct ib_sa_path_rec *resp, void >*contextp) >{ > xlog(INFO4, "ib_sa_path_rec_get completed with status %d, " > "wakeup sleeping task", status); Insert: xtd_device.path_rec = *resp; > xtd_device.path_rec_qstatus = 0; > wake_up(&xtd_device.waitq); > return; >} Actually, you may be able to ignore my other reply. You need to copy the response from the SA. - Sean From tziporet at mellanox.co.il Mon Apr 24 23:55:08 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 25 Apr 2006 09:55:08 +0300 Subject: [openib-general] [PATCH 0 of 13] ipath - various fixes and cleanups In-Reply-To: References: Message-ID: <444DC7CC.4010506@mellanox.co.il> Bryan O'Sullivan wrote: > Hi, Roland - > > Here is a set of bug fixes and cleanups for the ipath driver. Please apply. > > Hi Bryan, Is it important to include those patches on the OFED RC4? If yes since Rolland is not on this week we can put them under fixes directory and apply them during the install. Tziporet From akpm at osdl.org Tue Apr 25 00:56:54 2006 From: akpm at osdl.org (Andrew Morton) Date: Tue, 25 Apr 2006 00:56:54 -0700 Subject: [openib-general] Re: [PATCH 8 of 13] ipath - fix a number of RC protocol bugs In-Reply-To: References: Message-ID: <20060425005654.4c08481f.akpm@osdl.org> "Bryan O'Sullivan" wrote: > > + BUG_ON(qp->timerwait.next != LIST_POISON1); > + list_add_tail(&qp->timerwait, &dev->pending[dev->pending_index]); Please don't play around with list_head internals like this - some speedfreak might legitimately choose to remove the list_head poisoning debug code, or make it Kconfigurable. One option would be to always do list_del_init() on this thing, then do BUG_ON(!list_empty()). From leonida at voltaire.com Tue Apr 25 02:14:09 2006 From: leonida at voltaire.com (Leonid Arsh) Date: Tue, 25 Apr 2006 12:14:09 +0300 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: References: Message-ID: <444DE861.4010803@voltaire.com> You are right - different HCAs and adapters may need specific tuning. The Mellanox VAPI adapter, handling completions in a tasklet, will definitely suffer from CQ splitting, since there may be only one tasklet running across all the CPUs. The mtcha adapter is a completely different case - the completions are handled in HW interrupt context. I'm not familiar with other adapters -- ehca and ipath. Do they notify completions via HW interrupt / tasklet / soft IRQ ? Bernard King-Smith wrote: > Lenoid Arsh wrote: > Lenoid> Shirley, > > Lenoid> some additional information you may be interested: > > Lenoid> According to our experience with the Voltaire IPoIB driver, > Lenoid> splitting CQ harmed the throughput (we checked with the iperf > Lenoid> application, UDP mode.) Splitting the the CQ caused more > interrupts, > Lenoid> context switches and CQ polls. > > Interesting results. I think some of Shirley's work reduced the number of > interrupts on ehca so this is starting to sound like a one size does not > fit all driver approach. I wonder what Pathscale see if they split the > completion queues? > > Lenoid> Note, the case is rather different from OpenIB mthca, since > Voltare > Lenoid> IPoIB is based on the VAPI driver, > Lenoid> where CQ completions are handled in a tasklet context, > Lenoid> unlike mthca where CQ completions are handled in the HW > interrupt > Lenoid> context. > > Another question is what do we do about adapter specific code where each > adapter type ( ehca, mthca, Voltare and Pathscale ) can all provide better > performance if adapter specific code and tuning is required? > > Lenoid> NAPI gave us some improvement. I think NAPI should improve much > more > Lenoid> in mthca, with the HW interrupt CQ completions. > > However, I don't believe that NAPI can provide the same benefit for all the > driver models listed above. It may help in overall interrupt handling, but > there is probably a need for additional adapter/driver specific tuning. > Some of these may end up requiring support in the OpenIB stack. > > There are many cases not covered by using Netperf, and netpipe that show > improved performance. These cases are running multiple sockets per > link/adapter, and the case where you have a larger machine where you have > multiple adapters. I haven't seen any data recently on duplex traffic, only > STREAM ( or unidirectional) either. > > Bernie King-Smith > IBM Corporation > Server Group > Cluster System Performance > wombat2 at us.ibm.com (845)433-8483 > Tie. 293-8483 or wombat2 on NOTES > > "We are not responsible for the world we are born into, only for the world > we leave when we die. > So we have to accept what has gone before us and work to change the only > thing we can, > -- The Future." William Shatner > > From leonida at voltaire.com Tue Apr 25 02:15:10 2006 From: leonida at voltaire.com (Leonid Arsh) Date: Tue, 25 Apr 2006 12:15:10 +0300 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: References: Message-ID: <444DE89E.3060104@voltaire.com> Please correct me if I'm mistaken. I think that soft IRQ is used by the kernel for handling the received packets - NET_RX_SOFTIRQ. For the non-NAPI case, we poll the CQ for completions and call netif_rx() in the completion notification context - the HW interrupt context in case of mthca. netif_rx() queues the received skb and schedules the NET_RX_SOFTIRQ by calling netif_rx_schedule(). In the same context the driver requests the HCA to notify new completions. The skb will be passed to the TCP/IP stack later in the soft IRQ context by calling netif_receive_skb() (see the process_backlog() kernel function) For the NAPI case, the driver shall only call netif_rx_schedule() in the completion notification context - the HW interrupt context in case of mthca. The CQ polling and passing the skb to the TCP/IP stack will be done in the NET_RX_SOFTIRQ context in the driver's poll() function. Only after that the driver will request the HCA to notify new completions. The late completion notification request should cause the interrupt mitigation -- and this is the main NAPI advantage, for my opinion. Of course, the NAPI implementation may need the CQ splitting. On the other hand, polling CQ for the send and receive completions in the same context my be an advantage for bidirectional traffic. BTW, the HW interrupt coalescing used to help us in some cases. Regards, Leonid Shirley Ma wrote: > > Hello Leonid, > > Leonid Arsh wrote on 04/23/2006 06:38:00 AM: > > Shirley, > > > > some additional information you may be interested: > > > > According to our experience with the Voltaire IPoIB driver, > > splitting CQ harmed the throughput (we checked with the iperf > > application, UDP mode.) Splitting the the CQ caused more interrupts, > > context switches and CQ polls. > > Note, the case is rather different from OpenIB mthca, since Voltare > > IPoIB is based on the VAPI driver, > > where CQ completions are handled in a tasklet context, > > unlike mthca where CQ completions are handled in the HW interrupt > > context. > > That expected because only one tasklet is allowed running across all > cpus in the same time. > Have you tried to use other SOFTIRQ instead of TASKLET_SOFTIRQ? > My expectation is the performance will be better since there would be > multiple > softirqs running simultaneously. If it's a simple change of your code, > could you please try it? > > I am thinking to split mthca CQ completion into HW interrupt and > softirq context. > > > NAPI gave us some improvement. I think NAPI should improve much more > > in mthca, with the HW interrupt CQ completions. > Yes, with the hardware interrupts are disabled.Re: [PATCH] splitting > IPoIB CQ > > > It would be interesting to compare the completion CQ with NAPI and in > softirq context. > > It all depends on how you implement NAPI. If you only implement NAPI > without > changing the sender, NAPI might not get better performance than softirq. > The benefit of NAPI, it has one dev->poll running across all cpus to > prevent > packets out of order totally. > > ThanksRe: [PATCH] splitting IPoIB CQ > > Shirley Ma > IBM Linux Technology Center > 15300 SW Koll Parkway > Beaverton, OR 97006-6063 > Phone(Fax): (503) 578-7638 From SCHICKHJ at de.ibm.com Tue Apr 25 02:43:33 2006 From: SCHICKHJ at de.ibm.com (Heiko J Schick) Date: Tue, 25 Apr 2006 11:43:33 +0200 Subject: [openib-general] [PATCH 2/2] Wean libehca off of libsysfs In-Reply-To: Message-ID: An HTML attachment was scrubbed... URL: From segher at kernel.crashing.org Tue Apr 25 02:32:24 2006 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Tue, 25 Apr 2006 11:32:24 +0200 Subject: [openib-general] Re: [PATCH 4 of 13] ipath - change handling of PIO buffers In-Reply-To: <8e724d49e74bc1155f4e.1145913780@eng-12.pathscale.com> References: <8e724d49e74bc1155f4e.1145913780@eng-12.pathscale.com> Message-ID: <1687B2F9-E947-4400-872E-163133854E13@kernel.crashing.org> > + * The problem with this is that it's global, but we'll use different > + * numbers for different chip types. So the default value is not > + * very useful. I've redefined it for the 1.3 release so that it's ----------------------------------------------^^^ Change this to 2.6.17? > + * zero unless set by the user to something else, in which case we > + * try to respect it. Segher From tziporet at mellanox.co.il Tue Apr 25 05:13:20 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 25 Apr 2006 15:13:20 +0300 Subject: [openib-general] do we want saquery utility from diags in the OFED release Message-ID: <444E1260.1030006@mellanox.co.il> I mean the new utility in: https://openib.org/svn/gen2/trunk/src/userspace/management/diags/src/saquery.c Tziporet From halr at voltaire.com Tue Apr 25 05:30:38 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 25 Apr 2006 08:30:38 -0400 Subject: [openib-general] Re: do we want saquery utility from diags in the OFED release In-Reply-To: <444E1260.1030006@mellanox.co.il> References: <444E1260.1030006@mellanox.co.il> Message-ID: <1145968236.2124.9385.camel@hal.voltaire.com> On Tue, 2006-04-25 at 08:13, Tziporet Koren wrote: > I mean the new utility in: > https://openib.org/svn/gen2/trunk/src/userspace/management/diags/src/saquery.c It's experimental IMO. I'm still working through an open issue or two. I'm not ready to support this with 1.0/OFED yet. -- Hal > Tziporet From tziporet at mellanox.co.il Tue Apr 25 06:11:53 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 25 Apr 2006 16:11:53 +0300 Subject: [openib-general] RE: do we want saquery utility from diags in the OFED release Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA6E10@mtlexch01.mtl.com> Please write me when you want it in (maybe in RC5 time frame) -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Tuesday, April 25, 2006 3:31 PM To: Tziporet Koren Cc: OPENIB Subject: Re: do we want saquery utility from diags in the OFED release On Tue, 2006-04-25 at 08:13, Tziporet Koren wrote: > I mean the new utility in: > https://openib.org/svn/gen2/trunk/src/userspace/management/diags/src/saq uery.c It's experimental IMO. I'm still working through an open issue or two. I'm not ready to support this with 1.0/OFED yet. -- Hal > Tziporet From swise at opengridcomputing.com Tue Apr 25 06:14:00 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 25 Apr 2006 08:14:00 -0500 Subject: [openib-general] dapltest question References: <1145916893.13656.9.camel@stevo-desktop> Message-ID: <003c01c6686a$1e5ad850$020010ac@haggard> Never mind. I figured it out. dapltest does post sends from both the client and server just after the connection is setup. Thanks, Stevo. ----- Original Message ----- From: "Steve Wise" To: "James Lentini" Cc: "openib-general" Sent: Monday, April 24, 2006 5:14 PM Subject: [openib-general] dapltest question | Hey James, | | When running something like this: | | #==================================================================== | #client6 | #==================================================================== | ./dapltest -T T -s ${host} -D ${device} -i 10000 -t 4 -w 8 \ | client SR 256 \ | server RW 4096 \ | server SR 256 \ | client SR 256 \ | server RW 4096 \ | server SR 256 \ | client SR 4096 \ | server SR 256 | | | Do the transactions execute in order? IE: Will each client first run | the SR 256 test, then the server will run the RW 4096 test, etc? Or do | the client and server run through their respective tests in parallel? | I'm seeing a problem with the chelsio rnic where it appears that | sometimes the server sends an FPDU first, and that is not allowed in the | mpa spec. In other words, does the dapltest client always send the | first FPDU? | | | MPA draft: | | 4. MPA "Responder" mode implementations MUST receive and validate at | least one FPDU before sending any FPDUs or markers. | | | Thanks, | | Steve. | | | | | _______________________________________________ | openib-general mailing list | openib-general at openib.org | http://openib.org/mailman/listinfo/openib-general | | To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general | From jlentini at netapp.com Tue Apr 25 06:44:22 2006 From: jlentini at netapp.com (James Lentini) Date: Tue, 25 Apr 2006 09:44:22 -0400 (EDT) Subject: [openib-general] dapltest question In-Reply-To: <003c01c6686a$1e5ad850$020010ac@haggard> References: <1145916893.13656.9.camel@stevo-desktop> <003c01c6686a$1e5ad850$020010ac@haggard> Message-ID: A single dapltest invocation will result in multiple connections. The number depends on the command line. There is an initial connection to setup the test in which configuration information is exchanged (test type, parameters, etc.). Additional connections are created for -w # EPs. These connections execute the specified operations in order. For the initial connection, both the client and server should have posted receives ready before the connection is established. Once the connection occurs, both post their configuration information as sends. Let me know if you have more questions. james On Tue, 25 Apr 2006, Steve Wise wrote: > Never mind. I figured it out. dapltest does post sends from both the > client and server just after the connection is setup. > > Thanks, > > > Stevo. > > ----- Original Message ----- > From: "Steve Wise" > To: "James Lentini" > Cc: "openib-general" > Sent: Monday, April 24, 2006 5:14 PM > Subject: [openib-general] dapltest question > > > | Hey James, > | > | When running something like this: > | > | #==================================================================== > | #client6 > | #==================================================================== > | ./dapltest -T T -s ${host} -D ${device} -i 10000 -t 4 -w 8 \ > | client SR 256 \ > | server RW 4096 \ > | server SR 256 \ > | client SR 256 \ > | server RW 4096 \ > | server SR 256 \ > | client SR 4096 \ > | server SR 256 > | > | > | Do the transactions execute in order? IE: Will each client first run > | the SR 256 test, then the server will run the RW 4096 test, etc? Or > do > | the client and server run through their respective tests in parallel? > | I'm seeing a problem with the chelsio rnic where it appears that > | sometimes the server sends an FPDU first, and that is not allowed in > the > | mpa spec. In other words, does the dapltest client always send the > | first FPDU? > | > | > | MPA draft: > | > | 4. MPA "Responder" mode implementations MUST receive and validate at > | least one FPDU before sending any FPDUs or markers. > | > | > | Thanks, > | > | Steve. > | > | > | > | > | _______________________________________________ > | openib-general mailing list > | openib-general at openib.org > | http://openib.org/mailman/listinfo/openib-general > | > | To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > | > > From xma at us.ibm.com Tue Apr 25 07:22:03 2006 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 25 Apr 2006 07:22:03 -0700 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: <444DE89E.3060104@voltaire.com> Message-ID: Leonid, There is no doubt NAPI helping interrupts mitigation, throughput, packets out of order and balance between latency and throughput. But NAPI might not help all of the devices since different drivers have different implementations for CQ completion handler. I am working on a patch to use multiple threads work queue for ipoib completion polling. Have you tried to this on your driver? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Tue Apr 25 07:58:39 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 25 Apr 2006 17:58:39 +0300 Subject: [openib-general] Re: Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: References: <444DE89E.3060104@voltaire.com> Message-ID: <20060425145839.GC31324@mellanox.co.il> Quoting r. Shirley Ma : > different drivers have different implementations for CQ completion handler. Maybe these drivers should be changed then? Its a bit hard for me to imagine a driver that doesn't get hardware interrupts in IRQ context. So why can't completion handler be called directly from there as well? -- MST From tziporet at mellanox.co.il Tue Apr 25 08:12:55 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 25 Apr 2006 18:12:55 +0300 Subject: [openib-general] 1.0 RC3 schedule update Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA6E19@mtlexch01.mtl.com> Hi Bryan, We will try to update all fixes from the trunk to the branch by this date but I am not sure all will be done. I will update you in case we will not complete it on time; in this case maybe you wish to delay the build to Monday. Tziporet -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Bryan O'Sullivan Sent: Monday, April 24, 2006 7:33 PM To: Hal Rosenstock Cc: openib-general Subject: Re: [openib-general] 1.0 RC3 schedule update On Mon, 2006-04-24 at 11:54 -0400, Hal Rosenstock wrote: > On Mon, 2006-04-24 at 11:51, Bryan O'Sullivan wrote: > > I'm still planning to release 1.0 RC3 on Monday, May 1. If you have > > userspace changes that you want to see included, please commit them to > > the 1.0 branch or send them to me as patches by 6pm GMT (10am California > > time) on Thursday, April 28. > > Isn't it RC4 now (RC3 being skipped to sync with OFED) ? Sorry, yes. That was a before-enough-coffee thinko. Message-ID: "Michael S. Tsirkin" wrote on 04/25/2006 07:58:39 AM: > Quoting r. Shirley Ma : > > different drivers have different implementations for CQ completion handler. > > Maybe these drivers should be changed then? Its a bit hard for me > to imagine a > driver that doesn't get hardware interrupts in IRQ context. So why can't > completion handler be called directly from there as well? > > -- > MST These completion handler are called directly from these, but under different contexts. The drivers does get hardware interrupts in IRQ context, but you can always split the handler into two parts, hardware interrupt context and software interrupt context. The more light weigh in hardware interrupts, the better. IPoIB completion polling would be very heavy if the HCA is faster enough. And the driver implementation shouldn't prevent IPoIB to use both send/recv CQ handlers from working simultanously. We do see the dramatic performance improvement on ehca with splitting CQs. With current mthca implementation, polling CQs in hardware context, it does prevent to use two CQ handlers simultanously since there is only one hardware interrupt for both send and recv. I am working on a patch to see whether using work queue in IPoIB completion polling with splitting CQs would improve performance for all drivers. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Tue Apr 25 08:23:12 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 25 Apr 2006 18:23:12 +0300 Subject: [openib-general] Re: Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: References: <20060425145839.GC31324@mellanox.co.il> Message-ID: <20060425152312.GE31324@mellanox.co.il> Quoting r. Shirley Ma : > The more light weigh in hardware interrupts, the better. But we don't know whether the ULP will be doing lightweigh things in completion handler or not. So it seems better to report completion events in IRQ context from low level driver and leave the decision on whether to use tasklet, work queue or polling like NAPI to the ULPs. No? -- MST From bos at pathscale.com Tue Apr 25 08:28:50 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Tue, 25 Apr 2006 08:28:50 -0700 Subject: [openib-general] 1.0 RC3 schedule update In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA6E19@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E301FA6E19@mtlexch01.mtl.com> Message-ID: <1145978930.3683.10.camel@localhost.localdomain> On Tue, 2006-04-25 at 18:12 +0300, Tziporet Koren wrote: > We will try to update all fixes from the trunk to the branch by this > date but I am not sure all will be done. OK, thanks. Message-ID: Michael S. Tsirkin wrote: Michael> Quoting r. Shirley Ma : Michael> > different drivers have different implementations for CQ completion handler. Michael> Maybe these drivers should be changed then? Its a bit hard for me to imagine a Michael> driver that doesn't get hardware interrupts in IRQ context. So why can't Michael> completion handler be called directly from there as well? The problem we are seeing is not how we handle the hardware interrupt, but what is handled where, after getting the interrupt. All packet processing is handled by a single processor. On a multiprocessor machine with high traffic rate, the bandwidth is limited by a single pro, and to different CPU's than the one handling the interrupt. Running the 10 GigE adapter in classic packet mode ( 1500 byte MTU ) gets better performance with splitting the packet handling across multiple CPU's. IB is suffering from this same problem in the completion handler. If you have a 4 CPU machine and run two sockets over teh same adapter/link using NETPERF, if you use mpstat -P ALL, you will see one CPU pegged and the rest much lower in CPU utilization, The CPU at 100% is the one running the interrupt handler, TCP/IP stack and completion queue. Shirley's patch is to split the processing so that completion queue handling can be done on a different CPU that the interrupt handler. Yes there are locking issues with the completion queue being accessed by different CPU's but there is an overall gain in bandwidth especially in duplex mode. Having all completion queue handling done off the hardware interrupt means you have serialized the completions for a duplex adapter. Michael> -- Michael> MST Bernie King-Smith IBM Corporation Server Group Cluster System Performance wombat2 at us.ibm.com (845)433-8483 Tie. 293-8483 or wombat2 on NOTES "We are not responsible for the world we are born into, only for the world we leave when we die. So we have to accept what has gone before us and work to change the only thing we can, -- The Future." William Shatner From xma at us.ibm.com Tue Apr 25 08:38:06 2006 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 25 Apr 2006 08:38:06 -0700 Subject: [openib-general] Re: Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: <20060425152312.GE31324@mellanox.co.il> Message-ID: "Michael S. Tsirkin" wrote on 04/25/2006 08:23:12 AM: > Quoting r. Shirley Ma : > > The more light weigh in hardware interrupts, the better. > > But we don't know whether the ULP will be doing lightweigh things incompletion > handler or not. > > So it seems better to report completion events in IRQ context from low level > driver and leave the decision on whether to use tasklet, work queue or polling > like NAPI to the ULPs. No? > > -- > MST Yes. That's I am working on right now. The first patch will use work queue in IPoIB layer not driver layer to see the perfomance number with both send/recv polling handler simultanously on mthca. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Tue Apr 25 09:40:34 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 25 Apr 2006 09:40:34 -0700 Subject: [openib-general] Re: [PATCH] RDMA CM: allow listen without prior binding to listen on any address In-Reply-To: References: Message-ID: <444E5102.3080100@ichips.intel.com> Sean Hefty wrote: > Allow calling rdma_listen() without calling rdma_bind_addr() beforehand. > > This will result in binding to any address / any port before listening. I've committed this change. - Sean From mshefty at ichips.intel.com Tue Apr 25 09:41:54 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 25 Apr 2006 09:41:54 -0700 Subject: [openib-general] Re: RFC: cma: need rdma_unbind In-Reply-To: <1145920526.18808.112.camel@trinity.ogc.int> References: <20060424174521.GA20743@mellanox.co.il> <444D15BC.4040904@ichips.intel.com> <1145907165.18808.28.camel@trinity.ogc.int> <20060424211911.GD20831@mellanox.co.il> <444D41CE.1040207@ichips.intel.com> <20060424212748.GE20831@mellanox.co.il> <444D44E3.6000607@ichips.intel.com> <20060424214645.GF20831@mellanox.co.il> <1145917974.18808.90.camel@trinity.ogc.int> <20060424223425.GK20831@mellanox.co.il> <1145918603.18808.94.camel@trinity.ogc.int> <1145920526.18808.112.camel@trinity.ogc.int> Message-ID: <444E5152.4090708@ichips.intel.com> Tom Tucker wrote: > FYI, an application calling shutdown() does not release the port > synchronously. It is only released after the peers have negotiated the > close. For SDP, I think this means that both peers have sent and > received the SDP DisConn message and/or AbortConn message _AND_ you have > shutdown the LLP (called rdma_disconnect). It is not until after this > point that you can expect the port to be reuse-able. Based on Tom's comments, I'm planning on dropping my previous patch that exposes rdma_unbind. - Sean From ssbyrn at yahoo.com Tue Apr 25 09:55:16 2006 From: ssbyrn at yahoo.com (susan) Date: Tue, 25 Apr 2006 16:55:16 +0000 (UTC) Subject: [openib-general] Re: ib_send_cm_req fails with error -22 References: <44468CCE.6070202@ichips.intel.com> <44480C85.1040704@ichips.intel.com> <44490DCA.9080609@ichips.intel.com> Message-ID: susan yahoo.com> writes: > > susan yahoo.com> writes: > > > > > Sean Hefty ichips.intel.com> writes: > > > > > > thanks Sean! i will give it try. > > > > susan > > > > > hi Sean, problem solved. i missed setting path_rec correctly on sa path completion. thanks, susan. From mst at mellanox.co.il Tue Apr 25 10:16:07 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 25 Apr 2006 20:16:07 +0300 Subject: [openib-general] Re: RFC: cma: need rdma_unbind In-Reply-To: <444E5152.4090708@ichips.intel.com> References: <20060424211911.GD20831@mellanox.co.il> <444D41CE.1040207@ichips.intel.com> <20060424212748.GE20831@mellanox.co.il> <444D44E3.6000607@ichips.intel.com> <20060424214645.GF20831@mellanox.co.il> <1145917974.18808.90.camel@trinity.ogc.int> <20060424223425.GK20831@mellanox.co.il> <1145918603.18808.94.camel@trinity.ogc.int> <1145920526.18808.112.camel@trinity.ogc.int> <444E5152.4090708@ichips.intel.com> Message-ID: <20060425171607.GB532@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [openib-general] Re: RFC: cma: need rdma_unbind > > Tom Tucker wrote: > >FYI, an application calling shutdown() does not release the port > >synchronously. It is only released after the peers have negotiated the > >close. For SDP, I think this means that both peers have sent and > >received the SDP DisConn message and/or AbortConn message _AND_ you have > >shutdown the LLP (called rdma_disconnect). It is not until after this > >point that you can expect the port to be reuse-able. > > Based on Tom's comments, I'm planning on dropping my previous patch that > exposes rdma_unbind. I'll need to think about the issue a bit more and conduct some tests till I'm ready to continue the discussion. Let's get back to it later. -- MST From tom at opengridcomputing.com Tue Apr 25 12:45:01 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 25 Apr 2006 14:45:01 -0500 Subject: [openib-general] Re: RFC: cma: need rdma_unbind In-Reply-To: <444E5152.4090708@ichips.intel.com> References: <20060424174521.GA20743@mellanox.co.il> <444D15BC.4040904@ichips.intel.com> <1145907165.18808.28.camel@trinity.ogc.int> <20060424211911.GD20831@mellanox.co.il> <444D41CE.1040207@ichips.intel.com> <20060424212748.GE20831@mellanox.co.il> <444D44E3.6000607@ichips.intel.com> <20060424214645.GF20831@mellanox.co.il> <1145917974.18808.90.camel@trinity.ogc.int> <20060424223425.GK20831@mellanox.co.il> <1145918603.18808.94.camel@trinity.ogc.int> <1145920526.18808.112.camel@trinity.ogc.int> <444E5152.4090708@ichips.intel.com> Message-ID: <1145994301.21968.25.camel@trinity.ogc.int> On Tue, 2006-04-25 at 09:41 -0700, Sean Hefty wrote: > Tom Tucker wrote: > > FYI, an application calling shutdown() does not release the port > > synchronously. It is only released after the peers have negotiated the > > close. For SDP, I think this means that both peers have sent and > > received the SDP DisConn message and/or AbortConn message _AND_ you have > > shutdown the LLP (called rdma_disconnect). It is not until after this > > point that you can expect the port to be reuse-able. > > Based on Tom's comments, I'm planning on dropping my previous patch that exposes > rdma_unbind. I actually prefer your approach Sean, which is to release the port in rdma_disconnect. > > - Sean From tom at opengridcomputing.com Tue Apr 25 12:46:28 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 25 Apr 2006 14:46:28 -0500 Subject: [openib-general] Re: RFC: cma: need rdma_unbind In-Reply-To: <444E5152.4090708@ichips.intel.com> References: <20060424174521.GA20743@mellanox.co.il> <444D15BC.4040904@ichips.intel.com> <1145907165.18808.28.camel@trinity.ogc.int> <20060424211911.GD20831@mellanox.co.il> <444D41CE.1040207@ichips.intel.com> <20060424212748.GE20831@mellanox.co.il> <444D44E3.6000607@ichips.intel.com> <20060424214645.GF20831@mellanox.co.il> <1145917974.18808.90.camel@trinity.ogc.int> <20060424223425.GK20831@mellanox.co.il> <1145918603.18808.94.camel@trinity.ogc.int> <1145920526.18808.112.camel@trinity.ogc.int> <444E5152.4090708@ichips.intel.com> Message-ID: <1145994388.21968.27.camel@trinity.ogc.int> On Tue, 2006-04-25 at 09:41 -0700, Sean Hefty wrote: > Tom Tucker wrote: > > FYI, an application calling shutdown() does not release the port > > synchronously. It is only released after the peers have negotiated the > > close. For SDP, I think this means that both peers have sent and > > received the SDP DisConn message and/or AbortConn message _AND_ you have > > shutdown the LLP (called rdma_disconnect). It is not until after this > > point that you can expect the port to be reuse-able. > > Based on Tom's comments, I'm planning on dropping my previous patch that exposes > rdma_unbind. Oops, maybe I misunderstood the term dropping. Sorry. > > - Sean From jdaley at systemfabricworks.com Tue Apr 25 12:59:35 2006 From: jdaley at systemfabricworks.com (Jan Daley) Date: Tue, 25 Apr 2006 14:59:35 -0500 Subject: [openib-general] [PATCH] libibmad - Fix bit field access of RMPP Header Message-ID: <001d01c668a2$ca24fc50$6b01a8c0@maverick> The wrong macro is being used to access the RMPP header in libibmad. Discovered this issue by using mad_set/get_field on an RMPP header. With this change, reading and writing the RMPP header fields is done correctly. Signed-off-by: Jan Daley Index: fields.c =================================================================== --- fields.c (revision 6631) +++ fields.c (working copy) @@ -241,11 +241,11 @@ /* * SA RMPP */ - [IB_SA_RMPP_VERS_F] {BITSOFFS(24*8+24, 8), "RmppVers", mad_dump_uint}, - [IB_SA_RMPP_TYPE_F] {BITSOFFS(24*8+16, 8), "RmppType", mad_dump_uint}, - [IB_SA_RMPP_RESP_F] {BITSOFFS(24*8+11, 5), "RmppResp", mad_dump_uint}, - [IB_SA_RMPP_FLAGS_F] {BITSOFFS(24*8+8, 3), "RmppFlags", mad_dump_hex}, - [IB_SA_RMPP_STATUS_F] {BITSOFFS(24*8+0, 8), "RmppStatus", mad_dump_hex}, + [IB_SA_RMPP_VERS_F] {BE_OFFS(24*8+24, 8), "RmppVers", mad_dump_uint}, + [IB_SA_RMPP_TYPE_F] {BE_OFFS(24*8+16, 8), "RmppType", mad_dump_uint}, + [IB_SA_RMPP_RESP_F] {BE_OFFS(24*8+11, 5), "RmppResp", mad_dump_uint}, + [IB_SA_RMPP_FLAGS_F] {BE_OFFS(24*8+8, 3), "RmppFlags", mad_dump_hex}, + [IB_SA_RMPP_STATUS_F] {BE_OFFS(24*8+0, 8), "RmppStatus", mad_dump_hex}, /* data1 */ [IB_SA_RMPP_D1_F] {28*8, 32, "RmppData1", mad_dump_hex}, Jan Daley System Fabric Works (512) 343-6101 x 14 From sean.hefty at intel.com Tue Apr 25 16:39:20 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 25 Apr 2006 16:39:20 -0700 Subject: [openib-general] [RFC] [PATCH 1/3] RDMA CM: add rdma_get/set_option calls to get/set path records Message-ID: Add rdma_get_option / rdma_set_option framework to the kernel RDMA CM. These calls are intended to allow user control over cm_id communication options, and are based on the socket getsockopt / setsockopt calls. Using the get/set option framework, add the ability for a user to retrieve IB routes (i.e. path records) that are usable on a given cm_id. Also allow a user to set the IB route that a cm_id should use when establishing a connection. This feature allows a user to control the routes used by multiple connections based on their own routing criteria. Signed-off-by: Sean Hefty --- Index: include/rdma/rdma_cm.h =================================================================== --- include/rdma/rdma_cm.h (revision 6418) +++ include/rdma/rdma_cm.h (working copy) @@ -1,6 +1,6 @@ /* * Copyright (c) 2005 Voltaire Inc. All rights reserved. - * Copyright (c) 2005 Intel Corporation. All rights reserved. + * Copyright (c) 2005-2006 Intel Corporation. All rights reserved. * * This Software is licensed under one of the following licenses: * @@ -61,6 +61,17 @@ enum rdma_port_space { RDMA_PS_SCTP = 0x0183 }; +/* Protocol levels for get/set options. */ +enum { + RDMA_PROTO_IP = 0, + RDMA_PROTO_IB = 1, +}; + +/* IB specific option names for get/set. */ +enum { + IB_PATH_OPTIONS = 1, +}; + struct rdma_addr { struct sockaddr src_addr; u8 src_pad[sizeof(struct sockaddr_in6) - @@ -251,5 +262,28 @@ int rdma_reject(struct rdma_cm_id *id, c */ int rdma_disconnect(struct rdma_cm_id *id); +/** + * rdma_get_option - Retrieve options for an rdma_cm_id. + * @id: Communication identifier to retrieve option for. + * @level: Protocol level of the option to retrieve. + * @optname: Name of the option to retrieve. + * @optval: Buffer to receive the returned options. + * @optlen: On input, the size of the %optval buffer. On output, the + * size of the returned data. + */ +int rdma_get_option(struct rdma_cm_id *id, int level, int optname, + void *optval, size_t *optlen); + +/** + * rdma_set_option - Set options for an rdma_cm_id. + * @id: Communication identifier to set option for. + * @level: Protocol level of the option to set. + * @optname: Name of the option to set. + * @optval: Reference to the option data. + * @optlen: The size of the %optval buffer. + */ +int rdma_set_option(struct rdma_cm_id *id, int level, int optname, + void *optval, size_t optlen); + #endif /* RDMA_CM_H */ Index: core/cma.c =================================================================== --- core/cma.c (revision 6627) +++ core/cma.c (working copy) @@ -39,6 +39,7 @@ #include #include #include +#include MODULE_AUTHOR("Sean Hefty"); MODULE_DESCRIPTION("Generic RDMA CM Agent"); @@ -1706,6 +1707,187 @@ out: } EXPORT_SYMBOL(rdma_disconnect); +static inline int cma_port_valid(struct rdma_cm_id *id) +{ + return id->device && id->port_num && + (id->port_num <= id->device->phys_port_cnt); +} + +static int cma_get_ib_paths(struct rdma_id_private *id_priv, + void *optval, size_t *optlen) +{ + struct ib_sa_cursor *cursor; + struct ib_sa_path_rec *path; + struct ib_user_path_rec user_path; + union ib_gid *gid; + int left, ret = 0; + u16 pkey; + + if (!cma_port_valid(&id_priv->id)) + return -ENODEV; + + gid = ib_addr_get_dgid(&id_priv->id.route.addr.dev_addr); + pkey = ib_addr_get_pkey(&id_priv->id.route.addr.dev_addr); + cursor = ib_create_path_cursor(id_priv->id.device, + id_priv->id.port_num, gid); + if (IS_ERR(cursor)) + return PTR_ERR(cursor); + + gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr); + left = *optlen; + *optlen = 0; + + for (path = ib_get_next_sa_attr(&cursor); path; + path = ib_get_next_sa_attr(&cursor)) { + if (pkey == path->pkey && + !memcmp(gid, path->sgid.raw, sizeof *gid)) { + if (optval) { + ib_copy_path_rec_to_user(&user_path, path); + if (copy_to_user((void __user *) optval, + &user_path, + sizeof(user_path))) { + ret = -EFAULT; + break; + } + left -= sizeof(user_path); + if (left < sizeof(user_path)) + break; + optval += sizeof(user_path); + } + *optlen += sizeof(user_path); + } + } + + ib_free_sa_cursor(cursor); + return ret; +} + +static int cma_get_ib_option(struct rdma_id_private *id_priv, int optname, + void *optval, size_t *optlen) +{ + int ret; + + switch (optname) { + case IB_PATH_OPTIONS: + ret = cma_get_ib_paths(id_priv, optval, optlen); + break; + default: + ret = -EINVAL; + break; + } + + return ret; +} + +int rdma_get_option(struct rdma_cm_id *id, int level, int optname, + void *optval, size_t *optlen) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + + switch (level) { + case RDMA_PROTO_IB: + ret = cma_get_ib_option(id_priv, optname, optval, optlen); + break; + case RDMA_PROTO_IP: + ret = -ENOSYS; + break; + default: + ret = -EINVAL; + break; + } + + return ret; +} +EXPORT_SYMBOL(rdma_get_option); + +static int cma_set_ib_paths(struct rdma_id_private *id_priv, + void *optval, size_t optlen) +{ + struct rdma_route *route = &id_priv->id.route; + struct ib_user_path_rec user_path; + int ret, i; + + if (!cma_comp_exch(id_priv, CMA_ADDR_RESOLVED, CMA_ROUTE_RESOLVED)) + return -EINVAL; + + if (optlen == sizeof(user_path)) + route->num_paths = 1; + else if (optlen == (sizeof(user_path) << 1)) + route->num_paths = 2; + else { + ret = -EINVAL; + goto err1; + } + + route->path_rec = kmalloc(sizeof *route->path_rec * route->num_paths, + GFP_KERNEL); + if (!route->path_rec) { + ret = -ENOMEM; + goto err2; + } + + for (i = 0; i < route->num_paths; i++, optval += sizeof(user_path)) { + if (copy_from_user(&user_path, (void __user *) optval, + sizeof(user_path))) { + ret = -EFAULT; + goto err3; + } + ib_copy_path_rec_from_user(&route->path_rec[i], &user_path); + } + return 0; +err3: + kfree(route->path_rec); +err2: + route->num_paths = 0; +err1: + cma_comp_exch(id_priv, CMA_ROUTE_RESOLVED, CMA_ADDR_RESOLVED); + return ret; +} + +static int cma_set_ib_option(struct rdma_id_private *id_priv, int optname, + void *optval, size_t optlen) +{ + int ret; + + switch (optname) { + case IB_PATH_OPTIONS: + ret = cma_set_ib_paths(id_priv, optval, optlen); + break; + default: + ret = -EINVAL; + break; + } + + return ret; +} + +int rdma_set_option(struct rdma_cm_id *id, int level, int optname, + void *optval, size_t optlen) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + + switch (level) { + case RDMA_PROTO_IB: + ret = cma_set_ib_option(id_priv, optname, optval, optlen); + break; + case RDMA_PROTO_IP: + ret = -ENOSYS; + break; + default: + ret = -EINVAL; + break; + } + + return ret; +} +EXPORT_SYMBOL(rdma_set_option); + static void cma_add_one(struct ib_device *device) { struct cma_device *cma_dev; From sean.hefty at intel.com Tue Apr 25 16:41:45 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 25 Apr 2006 16:41:45 -0700 Subject: [openib-general] [RFC] [PATCH 1/3] RDMA CM: add rdma_get/set_optioncalls to get/set path records In-Reply-To: Message-ID: Expose rdma_get_option / rdma_set_option routines to userspace. Signed-off-by: Sean Hefty --- Index: include/rdma/rdma_user_cm.h =================================================================== --- include/rdma/rdma_user_cm.h (revision 6418) +++ include/rdma/rdma_user_cm.h (working copy) @@ -55,7 +55,9 @@ enum { RDMA_USER_CM_CMD_REJECT, RDMA_USER_CM_CMD_DISCONNECT, RDMA_USER_CM_CMD_INIT_QP_ATTR, - RDMA_USER_CM_CMD_GET_EVENT + RDMA_USER_CM_CMD_GET_EVENT, + RDMA_USER_CM_CMD_GET_OPTION, + RDMA_USER_CM_CMD_SET_OPTION, }; /* @@ -183,4 +185,25 @@ struct rdma_ucm_event_resp { __u8 private_data[RDMA_MAX_PRIVATE_DATA]; }; +struct rdma_ucm_get_option { + __u64 response; + __u64 optval; + __u32 id; + __u32 level; + __u32 optname; + __u32 optlen; +}; + +struct rdma_ucm_get_option_resp { + __u32 optlen; +}; + +struct rdma_ucm_set_option { + __u64 optval; + __u32 id; + __u32 level; + __u32 optname; + __u32 optlen; +}; + #endif /* RDMA_USER_CM_H */ Index: core/ucma.c =================================================================== --- core/ucma.c (revision 6418) +++ core/ucma.c (working copy) @@ -656,6 +656,61 @@ out: return ret; } +static ssize_t ucma_get_option(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_get_option cmd; + struct rdma_ucm_get_option_resp resp; + struct ucma_context *ctx; + int ret; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + resp.optlen = cmd.optlen; + ret = rdma_get_option(ctx->cm_id, cmd.level, cmd.optname, + (void *) (unsigned long) cmd.optval, + &resp.optlen); + if (ret) + goto out; + + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) + ret = -EFAULT; +out: + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_set_option(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_set_option cmd; + struct ucma_context *ctx; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ret = rdma_set_option(ctx->cm_id, cmd.level, cmd.optname, + (void *) (unsigned long) cmd.optval, + cmd.optlen); + + ucma_put_ctx(ctx); + return ret; +} + static ssize_t (*ucma_cmd_table[])(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) = { @@ -671,7 +726,9 @@ static ssize_t (*ucma_cmd_table[])(struc [RDMA_USER_CM_CMD_REJECT] = ucma_reject, [RDMA_USER_CM_CMD_DISCONNECT] = ucma_disconnect, [RDMA_USER_CM_CMD_INIT_QP_ATTR] = ucma_init_qp_attr, - [RDMA_USER_CM_CMD_GET_EVENT] = ucma_get_event + [RDMA_USER_CM_CMD_GET_EVENT] = ucma_get_event, + [RDMA_USER_CM_CMD_GET_OPTION] = ucma_get_option, + [RDMA_USER_CM_CMD_SET_OPTION] = ucma_set_option }; static ssize_t ucma_write(struct file *filp, const char __user *buf, From sean.hefty at intel.com Tue Apr 25 16:43:41 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 25 Apr 2006 16:43:41 -0700 Subject: [openib-general] [RFC] [PATCH 3/3] RDMA CM: add rdma_get/set_option calls to userspace library In-Reply-To: Message-ID: Support rdma_get_option / rdma_set_option through the userspace RDMA CM library. Signed-off-by: Sean Hefty --- Index: include/rdma/rdma_cma_abi.h =================================================================== --- include/rdma/rdma_cma_abi.h (revision 6335) +++ include/rdma/rdma_cma_abi.h (working copy) @@ -1,5 +1,5 @@ /* - * Copyright (c) 2005 Intel Corporation. All rights reserved. + * Copyright (c) 2005-2006 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -57,7 +57,9 @@ enum { UCMA_CMD_REJECT, UCMA_CMD_DISCONNECT, UCMA_CMD_INIT_QP_ATTR, - UCMA_CMD_GET_EVENT + UCMA_CMD_GET_EVENT, + UCMA_CMD_GET_OPTION, + UCMA_CMD_SET_OPTION, }; struct ucma_abi_cmd_hdr { @@ -182,4 +184,25 @@ struct ucma_abi_event_resp { __u8 private_data[RDMA_MAX_PRIVATE_DATA]; }; +struct ucma_abi_get_option { + __u64 response; + __u64 optval; + __u32 id; + __u32 level; + __u32 optname; + __u32 optlen; +}; + +struct ucma_abi_get_option_resp { + __u32 optlen; +}; + +struct ucma_abi_set_option { + __u64 optval; + __u32 id; + __u32 level; + __u32 optname; + __u32 optlen; +}; + #endif /* RDMA_CMA_ABI_H */ Index: include/rdma/rdma_cma.h =================================================================== --- include/rdma/rdma_cma.h (revision 5693) +++ include/rdma/rdma_cma.h (working copy) @@ -1,6 +1,6 @@ /* * Copyright (c) 2005 Voltaire Inc. All rights reserved. - * Copyright (c) 2005 Intel Corporation. All rights reserved. + * Copyright (c) 2005-2006 Intel Corporation. All rights reserved. * * This Software is licensed under one of the following licenses: * @@ -54,6 +54,17 @@ enum rdma_cm_event_type { RDMA_CM_EVENT_DEVICE_REMOVAL, }; +/* Protocol levels for get/set options. */ +enum { + RDMA_PROTO_IP = 0, + RDMA_PROTO_IB = 1, +}; + +/* IB specific option names for get/set. */ +enum { + IB_PATH_OPTIONS = 1, +}; + struct ib_addr { union ibv_gid sgid; union ibv_gid dgid; @@ -219,4 +230,27 @@ int rdma_ack_cm_event(struct rdma_cm_eve int rdma_get_fd(void); +/** + * rdma_get_option - Retrieve options for an rdma_cm_id. + * @id: Communication identifier to retrieve option for. + * @level: Protocol level of the option to retrieve. + * @optname: Name of the option to retrieve. + * @optval: Buffer to receive the returned options. + * @optlen: On input, the size of the %optval buffer. On output, the + * size of the returned data. + */ +int rdma_get_option(struct rdma_cm_id *id, int level, int optname, + void *optval, size_t *optlen); + +/** + * rdma_set_option - Set options for an rdma_cm_id. + * @id: Communication identifier to set option for. + * @level: Protocol level of the option to set. + * @optname: Name of the option to set. + * @optval: Reference to the option data. + * @optlen: The size of the %optval buffer. + */ +int rdma_set_option(struct rdma_cm_id *id, int level, int optname, + void *optval, size_t optlen); + #endif /* RDMA_CMA_H */ Index: src/cma.c =================================================================== --- src/cma.c (revision 6628) +++ src/cma.c (working copy) @@ -952,3 +952,51 @@ int rdma_get_fd(void) return cma_fd; } + +int rdma_get_option(struct rdma_cm_id *id, int level, int optname, + void *optval, size_t *optlen) +{ + struct ucma_abi_get_option_resp *resp; + struct ucma_abi_get_option *cmd; + struct cma_id_private *id_priv; + void *msg; + int ret, size; + + CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_GET_OPTION, size); + id_priv = container_of(id, struct cma_id_private, id); + cmd->id = id_priv->handle; + cmd->optval = (uintptr_t) optval; + cmd->level = level; + cmd->optname = optname; + cmd->optlen = *optlen; + + ret = write(cma_fd, msg, size); + if (ret != size) + return (ret > 0) ? -ENODATA : ret; + + *optlen = resp->optlen; + return 0; +} + +int rdma_set_option(struct rdma_cm_id *id, int level, int optname, + void *optval, size_t optlen) +{ + struct ucma_abi_set_option *cmd; + struct cma_id_private *id_priv; + void *msg; + int ret, size; + + CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_SET_OPTION, size); + id_priv = container_of(id, struct cma_id_private, id); + cmd->id = id_priv->handle; + cmd->optval = (uintptr_t) optval; + cmd->level = level; + cmd->optname = optname; + cmd->optlen = optlen; + + ret = write(cma_fd, msg, size); + if (ret != size) + return (ret > 0) ? -ENODATA : ret; + + return 0; +} Index: src/librdmacm.map =================================================================== --- src/librdmacm.map (revision 5693) +++ src/librdmacm.map (working copy) @@ -15,5 +15,7 @@ RDMACM_1.0 { rdma_get_cm_event; rdma_ack_cm_event; rdma_get_fd; + rdma_get_option; + rdma_set_option; local: *; }; From sean.hefty at intel.com Tue Apr 25 16:45:56 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 25 Apr 2006 16:45:56 -0700 Subject: [openib-general] [RFC] [PATCH 2/3] RDMA CM: expose rdma_get/set_option calls to userspace In-Reply-To: Message-ID: Resending with correct subject. Expose rdma_get_option / rdma_set_option routines to userspace. Signed-off-by: Sean Hefty --- Index: include/rdma/rdma_user_cm.h =================================================================== --- include/rdma/rdma_user_cm.h (revision 6418) +++ include/rdma/rdma_user_cm.h (working copy) @@ -55,7 +55,9 @@ enum { RDMA_USER_CM_CMD_REJECT, RDMA_USER_CM_CMD_DISCONNECT, RDMA_USER_CM_CMD_INIT_QP_ATTR, - RDMA_USER_CM_CMD_GET_EVENT + RDMA_USER_CM_CMD_GET_EVENT, + RDMA_USER_CM_CMD_GET_OPTION, + RDMA_USER_CM_CMD_SET_OPTION, }; /* @@ -183,4 +185,25 @@ struct rdma_ucm_event_resp { __u8 private_data[RDMA_MAX_PRIVATE_DATA]; }; +struct rdma_ucm_get_option { + __u64 response; + __u64 optval; + __u32 id; + __u32 level; + __u32 optname; + __u32 optlen; +}; + +struct rdma_ucm_get_option_resp { + __u32 optlen; +}; + +struct rdma_ucm_set_option { + __u64 optval; + __u32 id; + __u32 level; + __u32 optname; + __u32 optlen; +}; + #endif /* RDMA_USER_CM_H */ Index: core/ucma.c =================================================================== --- core/ucma.c (revision 6418) +++ core/ucma.c (working copy) @@ -656,6 +656,61 @@ out: return ret; } +static ssize_t ucma_get_option(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_get_option cmd; + struct rdma_ucm_get_option_resp resp; + struct ucma_context *ctx; + int ret; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + resp.optlen = cmd.optlen; + ret = rdma_get_option(ctx->cm_id, cmd.level, cmd.optname, + (void *) (unsigned long) cmd.optval, + &resp.optlen); + if (ret) + goto out; + + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) + ret = -EFAULT; +out: + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_set_option(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_set_option cmd; + struct ucma_context *ctx; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ret = rdma_set_option(ctx->cm_id, cmd.level, cmd.optname, + (void *) (unsigned long) cmd.optval, + cmd.optlen); + + ucma_put_ctx(ctx); + return ret; +} + static ssize_t (*ucma_cmd_table[])(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) = { @@ -671,7 +726,9 @@ static ssize_t (*ucma_cmd_table[])(struc [RDMA_USER_CM_CMD_REJECT] = ucma_reject, [RDMA_USER_CM_CMD_DISCONNECT] = ucma_disconnect, [RDMA_USER_CM_CMD_INIT_QP_ATTR] = ucma_init_qp_attr, - [RDMA_USER_CM_CMD_GET_EVENT] = ucma_get_event + [RDMA_USER_CM_CMD_GET_EVENT] = ucma_get_event, + [RDMA_USER_CM_CMD_GET_OPTION] = ucma_get_option, + [RDMA_USER_CM_CMD_SET_OPTION] = ucma_set_option }; static ssize_t ucma_write(struct file *filp, const char __user *buf, From mst at mellanox.co.il Wed Apr 26 00:59:16 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 26 Apr 2006 10:59:16 +0300 Subject: [openib-general] [RFC] [PATCH 1/3] RDMA CM: add rdma_get/set_optioncalls to get/set path records In-Reply-To: References: Message-ID: <20060426075916.GB8155@mellanox.co.il> Quoting r. Sean Hefty : > @@ -251,5 +262,28 @@ int rdma_reject(struct rdma_cm_id *id, c > */ > int rdma_disconnect(struct rdma_cm_id *id); > > +/** > + * rdma_get_option - Retrieve options for an rdma_cm_id. > + * @id: Communication identifier to retrieve option for. > + * @level: Protocol level of the option to retrieve. > + * @optname: Name of the option to retrieve. > + * @optval: Buffer to receive the returned options. > + * @optlen: On input, the size of the %optval buffer. On output, the > + * size of the returned data. > + */ > +int rdma_get_option(struct rdma_cm_id *id, int level, int optname, > + void *optval, size_t *optlen); > + > +/** > + * rdma_set_option - Set options for an rdma_cm_id. > + * @id: Communication identifier to set option for. > + * @level: Protocol level of the option to set. > + * @optname: Name of the option to set. > + * @optval: Reference to the option data. > + * @optlen: The size of the %optval buffer. > + */ > +int rdma_set_option(struct rdma_cm_id *id, int level, int optname, > + void *optval, size_t optlen); > + It seems optval is a user pointer. Should it be parked as such void __user *. > #endif /* RDMA_CM_H */ > > Index: core/cma.c > =================================================================== > --- core/cma.c (revision 6627) > +++ core/cma.c (working copy) .... > +static int cma_set_ib_paths(struct rdma_id_private *id_priv, > + void *optval, size_t optlen) > +{ > + struct rdma_route *route = &id_priv->id.route; > + struct ib_user_path_rec user_path; > + int ret, i; > + > + if (!cma_comp_exch(id_priv, CMA_ADDR_RESOLVED, CMA_ROUTE_RESOLVED)) > + return -EINVAL; > + > + if (optlen == sizeof(user_path)) > + route->num_paths = 1; > + else if (optlen == (sizeof(user_path) << 1)) > + route->num_paths = 2; > + else { > + ret = -EINVAL; > + goto err1; > + } > + > + route->path_rec = kmalloc(sizeof *route->path_rec * route->num_paths, > + GFP_KERNEL); > + if (!route->path_rec) { > + ret = -ENOMEM; > + goto err2; > + } > + > + for (i = 0; i < route->num_paths; i++, optval += sizeof(user_path)) { > + if (copy_from_user(&user_path, (void __user *) optval, > + sizeof(user_path))) { Apparently you assume userspace pointer here: so the interface is not intended for kernel users? So why is it not in ucma? > > + ret = -EFAULT; > + goto err3; > + } > + ib_copy_path_rec_from_user(&route->path_rec[i], &user_path); > + } > + return 0; > +err3: > + kfree(route->path_rec); > +err2: > + route->num_paths = 0; > +err1: > + cma_comp_exch(id_priv, CMA_ROUTE_RESOLVED, CMA_ADDR_RESOLVED); > + return ret; > +} -- MST From mst at mellanox.co.il Wed Apr 26 01:04:54 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 26 Apr 2006 11:04:54 +0300 Subject: [openib-general] [RFC] [PATCH 1/3] RDMA CM: addrdma_get/set_optioncalls to get/set path records In-Reply-To: References: Message-ID: <20060426080454.GC8155@mellanox.co.il> Sean, what's up with patch numbering? I see two 1/3 patches. http://openib.org/pipermail/openib-general/2006-April/date.html # [openib-general] [RFC] [PATCH 1/3] RDMA CM: add rdma_get/set_option calls to get/set path records Sean Hefty # [openib-general] [RFC] [PATCH 1/3] RDMA CM: add rdma_get/set_option calls to get/set path records Sean Hefty # # [openib-general] [RFC] [PATCH 3/3] RDMA CM: add rdma_get/set_option calls to userspace library Sean Hefty # # [openib-general] [RFC] [PATCH 2/3] RDMA CM: expose rdma_get/set_option calls to userspace Sean Hefty And two first patches are not identical either. I'm a bit confused as to what constitutes the series. Is it just me? More on the specific patch: Quoting r. Sean Hefty : > +static ssize_t ucma_set_option(struct ucma_file *file, const char __user *inbuf, > + int in_len, int out_len) > +{ > + struct rdma_ucm_set_option cmd; > + struct ucma_context *ctx; > + int ret; > + > + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) > + return -EFAULT; > + > + ctx = ucma_get_ctx(file, cmd.id); > + if (IS_ERR(ctx)) > + return PTR_ERR(ctx); > + > + ret = rdma_set_option(ctx->cm_id, cmd.level, cmd.optname, > + (void *) (unsigned long) cmd.optval, > + cmd.optlen); Casting a value from userspace to void * looks iffy. -- MST From oferg at mellanox.co.il Wed Apr 26 02:29:54 2006 From: oferg at mellanox.co.il (Ofer Gigi) Date: Wed, 26 Apr 2006 12:29:54 +0300 Subject: [openib-general] [PATCH] osm_sa_mcmember_record : MCMember Get/GetTable Trusted mode Message-ID: <07slo0d6r1.fsf@sw053.yok.mtl.com> Hi Hal, Small minor chnages: 1. Adding () for the if statement 2. Clearer messages when duplicate guids are found Thanks Ofer G. Signed-off-by: Ofer Gigi Index: osm/opensm/osm_node_info_rcv.c =================================================================== --- osm/opensm/osm_node_info_rcv.c (revision 6640) +++ osm/opensm/osm_node_info_rcv.c (working copy) @@ -129,8 +129,8 @@ __osm_ni_rcv_set_links( } else { - if( osm_node_has_any_link( p_node, port_num ) && - p_rcv->p_subn->force_immediate_heavy_sweep == FALSE ) + if( (osm_node_has_any_link( p_node, port_num )) && + (p_rcv->p_subn->force_immediate_heavy_sweep == FALSE) ) { /* Uh oh... @@ -205,13 +205,14 @@ __osm_ni_rcv_set_links( ); osm_log( p_rcv->p_log, OSM_LOG_SYS, - "Errors on subnet. SM found duplicated guids or 12x " - "link with lane reversal badly configured. " - "See osm log for more details\n"); + "Fatal: duplicated guids or 12x lane reversal.\n"); if ( p_rcv->p_subn->opt.exit_on_fatal == TRUE ) + { + osm_log( p_rcv->p_log, OSM_LOG_SYS,"Exiting ...\n"); exit( 1 ); } + } /* When there are only two nodes with exact same guids (connected back to From bodyhealth05 at hotmail.co.jp Wed Apr 26 02:56:44 2006 From: bodyhealth05 at hotmail.co.jp (=?ISO-2022-JP?B?GyRCOzBGfDBKRmIkTkIoOHpALSEqNSQkSyRKJGslQCUkJSglQyVISn1LISRyOCE+WiEqGyhC?=) Date: Wed, 26 Apr 2006 18:56:44 +0900 Subject: [openib-general] =?iso-2022-jp?b?GyRCTCQ+NUJ6OS05cCIoQk4bKEI=?= =?iso-2022-jp?b?GyRCPUU4Oj4vISJKWERMOHoyTCEiJCpIKSROPWEkJDQ2GyhC?= =?iso-2022-jp?b?GyRCISIkYCQvJF8ycj5DGyhC?= Message-ID: <20060426102629.4175B22834D@openib.ca.sandia.gov> $B!cAw?.e$2$^$9!#(B customer at body-health.jp $B"#%@%$%(%C%H!*$_$s$J at .8y$7$F$$$k#3$D$NK!B'$H$O!)(B $B-!L5M}$J1?F0$J$7!*(B $B-"L5M}$J?);v@)8B$J$7!*(B $B-#(B85$B!s$N?M$,#3F|$G8z2L$rBN46!*(B $B"#%a%G%#%/%j%K%C%/Fn at D;3!&1!D9!!Cf4V7r at h@8$b?dA&$7$F$$$^$9!#(B http://www.body-health.jp $B$3$s$K$A$o!#H~%7%g%C%W$N9b66$H?=$7$^$9!#(B $B;d$O0JA0$+$i%@%$%(%C%?!<$G$7$?!#(B $B?);vH4$-!"?);vCV$-49$(!"%@%$%(%C%H%^%7!<%s!"(B $B$5$^$6$^$J%@%$%(%C%H$r;n$7$F$_$^$7$?$,(B $B$I$l$bD9B3$-$;$:!"%9%H%l%9$+$i%j%P%&%s%I!&!&!&!&(B $B;D$C$?$N$O!"J*43$7>uBV$K$J$C$F$$$k%^%7!<%s$H(B $B:G8e$^$G0{$^$:$8$^$$$N%5%W%j%a%s%H!&!&!&!&!&!JCQ!K(B $B7k6IAm3[==K|1_6a$/$,?e$NK"$H>C$($^$7$?!JHa!K(B $B$G$b$=$s$J;d$,!"0l%v7n$G!]#8%-%m$N8:NL$K at .8y!*(B $B$7$+$b!"H)$bBND4$b$9$4$/$$$$$N$G$9!#(B $B%9%+!<%H$b$f$k$f$k$K$J$j$^$7$?!*!J4n!K(B $B$=$NHkL)$OL5M}$r$7$J$/$F$b$h$+$C$?$H$$$&$3$H!"(B $B1?F0$b?);v@)8B$b0l at ZI,MW$J$7!*(B $B:#$^$G$I$*$j$N at 83h$K$3$l$r%W%i%9$7$?$@$1$J$s$G$9%h!#(B $B$7$+$bC;4|4V$G8z2L$,MM8BDj"!$G$9$N$G!"(B $B6=L#$N$"$kJ}$O$*5^$.$/$@$5$$!#(B $B$-$C$H!"6C$-$NBN46$G$9$h!*!*(B $B:#$9$0%/%j%C%/"*(B http://www.body-health.jp --------------------------------------------------------- $B%@%$%(%C%H!u%S%e!<%F%#(B $BH~%7%g%C%W(B $B!VBNFb3WL?!WAmHNGdBeM}E9(B $B3t<02q&;v7r9/?)IJ;v6HIt(B $B72GO8)9b:j;T2 References: <001d01c668a2$ca24fc50$6b01a8c0@maverick> Message-ID: <1146047946.2124.25167.camel@hal.voltaire.com> On Tue, 2006-04-25 at 15:59, Jan Daley wrote: > The wrong macro is being used to access the RMPP header in libibmad. > Discovered > this issue by using mad_set/get_field on an RMPP header. With this > change, > reading and writing the RMPP header fields is done correctly. Thanks. > Signed-off-by: Jan Daley Patch was line wrapped but I modified the patch manually and applied it to both trunk and 1.0 branch. -- Hal > Index: fields.c > =================================================================== > --- fields.c (revision 6631) > +++ fields.c (working copy) > @@ -241,11 +241,11 @@ > /* > * SA RMPP > */ > - [IB_SA_RMPP_VERS_F] {BITSOFFS(24*8+24, 8), > "RmppVers", mad_dump_uint}, > - [IB_SA_RMPP_TYPE_F] {BITSOFFS(24*8+16, 8), > "RmppType", mad_dump_uint}, > - [IB_SA_RMPP_RESP_F] {BITSOFFS(24*8+11, 5), > "RmppResp", mad_dump_uint}, > - [IB_SA_RMPP_FLAGS_F] {BITSOFFS(24*8+8, 3), > "RmppFlags", mad_dump_hex}, > - [IB_SA_RMPP_STATUS_F] {BITSOFFS(24*8+0, 8), > "RmppStatus", mad_dump_hex}, > + [IB_SA_RMPP_VERS_F] {BE_OFFS(24*8+24, 8), > "RmppVers", mad_dump_uint}, > + [IB_SA_RMPP_TYPE_F] {BE_OFFS(24*8+16, 8), > "RmppType", mad_dump_uint}, > + [IB_SA_RMPP_RESP_F] {BE_OFFS(24*8+11, 5), > "RmppResp", mad_dump_uint}, > + [IB_SA_RMPP_FLAGS_F] {BE_OFFS(24*8+8, 3), > "RmppFlags", mad_dump_hex}, > + [IB_SA_RMPP_STATUS_F] {BE_OFFS(24*8+0, 8), > "RmppStatus", mad_dump_hex}, > > /* data1 */ > [IB_SA_RMPP_D1_F] {28*8, 32, "RmppData1", > mad_dump_hex}, > > > Jan Daley > System Fabric Works > (512) 343-6101 x 14 > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Wed Apr 26 04:15:21 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Apr 2006 07:15:21 -0400 Subject: [openib-general] Re: [PATCH] osm_sa_mcmember_record : MCMember Get/GetTable Trusted mode In-Reply-To: <07slo0d6r1.fsf@sw053.yok.mtl.com> References: <07slo0d6r1.fsf@sw053.yok.mtl.com> Message-ID: <1146049592.2124.25547.camel@hal.voltaire.com> Hi Ofer, On Wed, 2006-04-26 at 05:29, Ofer Gigi wrote: > Hi Hal, > Small minor chnages: Subject is inconsistent with the patch below. > 1. Adding () for the if statement Those extra parentheses shouldn't be needed as && is lower precedence than ==. > 2. Clearer messages when duplicate guids are found Should Fatal be FATAL so it really stands out ? I applied the second portion of this patch with some cosmetic changes to both the trunk and 1.0 branch. -- Hal > Thanks > > Ofer G. > > Signed-off-by: Ofer Gigi From leonida at voltaire.com Wed Apr 26 04:33:57 2006 From: leonida at voltaire.com (Leonid Arsh) Date: Wed, 26 Apr 2006 14:33:57 +0300 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: References: Message-ID: <444F5AA5.40802@voltaire.com> Shirley Ma wrote: > > I am working on a patch to use multiple threads work queue for ipoib > completion polling. Have you tried to this on your driver? No, we made some experiments with NAPI, tried also to split CQ (as I already wrote, this didn't help with tasklet completion handling.) We also tried to handle completions in HW interrupts (pretty long ago), but this didn't give us any improvement then. Regards, Leonid From oferg at mellanox.co.il Wed Apr 26 06:00:04 2006 From: oferg at mellanox.co.il (Ofer Gigi) Date: Wed, 26 Apr 2006 16:00:04 +0300 Subject: [openib-general] RE: [PATCH] osm_node_info_rcv - cosmetic Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FDF631@mtlexch01.mtl.com> 1. It is up to you - I think it is clearer that way, and you don't have to think who has precedence. 2. Yes, FATAL is better - thanks. Thanks! Ofer -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Wednesday, April 26, 2006 2:15 PM To: Ofer Gigi Cc: OPENIB Subject: Re: [PATCH] osm_sa_mcmember_record : MCMember Get/GetTable Trustedmode Hi Ofer, On Wed, 2006-04-26 at 05:29, Ofer Gigi wrote: > Hi Hal, > Small minor chnages: Subject is inconsistent with the patch below. > 1. Adding () for the if statement Those extra parentheses shouldn't be needed as && is lower precedence than ==. > 2. Clearer messages when duplicate guids are found Should Fatal be FATAL so it really stands out ? I applied the second portion of this patch with some cosmetic changes to both the trunk and 1.0 branch. -- Hal > Thanks > > Ofer G. > > Signed-off-by: Ofer Gigi From leonida at voltaire.com Wed Apr 26 06:43:19 2006 From: leonida at voltaire.com (Leonid Arsh) Date: Wed, 26 Apr 2006 16:43:19 +0300 Subject: [openib-general] netperf for RDS needed Message-ID: Ranjit, could you please send us the netperf working over RDS. We need the the netperf specific version and the changes are done to support RDS. We need it rather urgently. Thank you, Leonid Arsh SW Engineer www.voltaire.com The Grid Interconnect Company -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Wed Apr 26 07:00:09 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Apr 2006 10:00:09 -0400 Subject: [openib-general] [RFC] AT and user AT Message-ID: <1146060008.2124.27486.camel@hal.voltaire.com> As AT and user AT have been obsoleted (and superceeded by CMA which is now in the process of going upstream), any objections to removing AT and user AT from the trunk ? If I don't hear back by COB Friday, I will presume this is OK. -- Hal From ogerlitz at voltaire.com Wed Apr 26 07:41:49 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 26 Apr 2006 17:41:49 +0300 (IDT) Subject: [openib-general] cma unitests / cma local connections Message-ID: Sean, [Sorry if this has been discussed in many emails in the past] I have two questions: +1 what is the recommended cma kernel unitest, i recall there was cmatose and krping (i might be wrong re the testname) also can you please point me to the program SVN location and if there are such to minimal running instructions... +2 Is it possible to open a connection with the CMA to 127.0.0.1 that is the ip address of the lo network device. That is the passive side would create id, bind to 127.0.0.1 (or not bind to anything) then call rdma_listen, and the active side would create id and call addr resolve to 127.0.0.1 etc etc. Will that work? thanks, Or. From xma at us.ibm.com Wed Apr 26 07:51:55 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 26 Apr 2006 07:51:55 -0700 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: <444F5AA5.40802@voltaire.com> Message-ID: Leonid, Leonid Arsh wrote on 04/26/2006 04:33:57 AM: > Shirley Ma wrote: > > > > I am working on a patch to use multiple threads work queue for ipoib > > completion polling. Have you tried to this on your driver? > No, we made some experiments with NAPI, tried also to split CQ > (as I already wrote, this didn't help with tasklet completion handling.) > We also tried to handle completions in HW interrupts (pretty long ago), > but this didn't give us any improvement then. > > Regards, > Leonid > Without seeing your patch, I coudn't say anything. I guess your implemention didn't handler multithreads simultanously. If you only have one interrupt handler, couldn't see any reason you can get better performance number with splitting CQs. Could you please post your NAPI patch here? As I mentioned I will test my patch to see how's the performance. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From ishai at mellanox.co.il Wed Apr 26 07:50:01 2006 From: ishai at mellanox.co.il (Ishai Rabinovitz) Date: Wed, 26 Apr 2006 17:50:01 +0300 Subject: [openib-general] [PATCH] SRP: fix crash in srp_process_rsp Message-ID: <20060426144817.GA21822@mellanox.co.il> Hi srp_process_rsp crashes on NULL pointer dereference. The following fixes the crash. Is this a correct fix? --- Avoiding dereference of a null pointer. Signed-off-by: Ishai Rabinovitz Index: last_stable/drivers/infiniband/ulp/srp/ib_srp.c =================================================================== --- last_stable.orig/drivers/infiniband/ulp/srp/ib_srp.c 2006-04-26 15:38:23.000000000 +0300 +++ last_stable/drivers/infiniband/ulp/srp/ib_srp.c 2006-04-26 17:45:22.000000000 +0300 @@ -655,9 +655,11 @@ static void srp_process_rsp(struct srp_t complete(&req->done); } else { scmnd = req->scmnd; - if (!scmnd) + if (!scmnd) { printk(KERN_ERR "Null scmnd for RSP w/tag %016llx\n", (unsigned long long) rsp->tag); + goto unlock; + } scmnd->result = rsp->status; if (rsp->flags & SRP_RSP_FLAG_SNSVALID) { @@ -683,7 +685,7 @@ static void srp_process_rsp(struct srp_t } else req->cmd_done = 1; } - +unlock: spin_unlock_irqrestore(target->scsi_host->host_lock, flags); } -- Ishai Rabinovitz From jlentini at netapp.com Wed Apr 26 08:00:30 2006 From: jlentini at netapp.com (James Lentini) Date: Wed, 26 Apr 2006 11:00:30 -0400 (EDT) Subject: [openib-general] [RFC] AT and user AT In-Reply-To: <1146060008.2124.27486.camel@hal.voltaire.com> References: <1146060008.2124.27486.camel@hal.voltaire.com> Message-ID: On Wed, 26 Apr 2006, Hal Rosenstock wrote: > As AT and user AT have been obsoleted (and superceeded by CMA which > is now in the process of going upstream), any objections to removing > AT and user AT from the trunk ? If I don't hear back by COB Friday, > I will presume this is OK. I'm in agreement with moving this off of the trunk. Will the code still be available for reference? From rpandit at silverstorm.com Wed Apr 26 08:39:03 2006 From: rpandit at silverstorm.com (Pandit, Ranjit) Date: Wed, 26 Apr 2006 11:39:03 -0400 Subject: [openib-general] RE: netperf for RDS needed Message-ID: Attached is the patch. BTW, we sent all this information to Moni Levy couple of weeks back. patch -d netperf-2.4.1rc1 -p 1 < netperf_changes_for_rds Running netperf on RDS: Server: ./netserver -L -r Client: ./netperf -H ,4 -t UDP_STREAM -l 10 -r -- -L -m -M For example: UDP_STREAM test, 8k msg size run for 10sec Server: st29_ib == 192.168.99.59 Client: 192.168.99.60 ./netserver -L 192.168.99.59 -r ./netperf -H st29_ib,4 -t UDP_STREAM -l 10 -r -- -L 192.168.99.60 -m 8192 -M 8192 -----Original Message----- From: Leonid Arsh [mailto:leonida at voltaire.com] Sent: Wednesday, April 26, 2006 6:43 AM To: Pandit, Ranjit Cc: zach.brown at oracle.com; openib-general at openib.org Subject: netperf for RDS needed Ranjit, could you please send us the netperf working over RDS. We need the the netperf specific version and the changes are done to support RDS. We need it rather urgently. Thank you, Leonid Arsh SW Engineer www.voltaire.com The Grid Interconnect Company -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: netperf_changes_for_rds Type: application/octet-stream Size: 19112 bytes Desc: netperf_changes_for_rds URL: From mshefty at ichips.intel.com Wed Apr 26 08:51:46 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 26 Apr 2006 08:51:46 -0700 Subject: [openib-general] Re: cma unitests / cma local connections In-Reply-To: References: Message-ID: <444F9712.3030604@ichips.intel.com> Or Gerlitz wrote: > +1 what is the recommended cma kernel unitest, i recall there > was cmatose and krping (i might be wrong re the testname) also > can you please point me to the program SVN location and if there > are such to minimal running instructions... Both of those test programs are in the util directory. gen2/utils/src/linux-kernel/infiniband/util The only instructions for cmatose are in comments at the top of the source file. I'm not sure about krping. > +2 Is it possible to open a connection with the CMA to 127.0.0.1 > that is the ip address of the lo network device. > > That is the passive side would create id, bind to 127.0.0.1 (or not > bind to anything) then call rdma_listen, and the active side would create > id and call addr resolve to 127.0.0.1 etc etc. > > Will that work? This should work. If not, please let me know. - Sean From mshefty at ichips.intel.com Wed Apr 26 08:57:09 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 26 Apr 2006 08:57:09 -0700 Subject: [openib-general] [RFC] [PATCH 1/3] RDMA CM: addrdma_get/set_optioncalls to get/set path records In-Reply-To: <20060426080454.GC8155@mellanox.co.il> References: <20060426080454.GC8155@mellanox.co.il> Message-ID: <444F9855.3030900@ichips.intel.com> Michael S. Tsirkin wrote: > Sean, what's up with patch numbering? The second patch labeled 1/3 is really 2/3. I resent 2/3 with the correct subject heading. >>+static ssize_t ucma_set_option(struct ucma_file *file, const char __user *inbuf, >>+ int in_len, int out_len) >>+{ >>+ struct rdma_ucm_set_option cmd; >>+ struct ucma_context *ctx; >>+ int ret; >>+ >>+ if (copy_from_user(&cmd, inbuf, sizeof(cmd))) >>+ return -EFAULT; >>+ >>+ ctx = ucma_get_ctx(file, cmd.id); >>+ if (IS_ERR(ctx)) >>+ return PTR_ERR(ctx); >>+ >>+ ret = rdma_set_option(ctx->cm_id, cmd.level, cmd.optname, >>+ (void *) (unsigned long) cmd.optval, >>+ cmd.optlen); > > > Casting a value from userspace to void * looks iffy. This should be a userspace pointer. The kernel setsockopt interface takes a char * for the option value. Maybe this would be better? - Sean From mst at mellanox.co.il Wed Apr 26 09:13:52 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 26 Apr 2006 19:13:52 +0300 Subject: [openib-general] [RFC] [PATCH 1/3] RDMA CM: addrdma_get/set_optioncalls to get/set path records In-Reply-To: <444F9855.3030900@ichips.intel.com> References: <20060426080454.GC8155@mellanox.co.il> <444F9855.3030900@ichips.intel.com> Message-ID: <20060426161352.GO31324@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [openib-general] [RFC] [PATCH 1/3] RDMA CM:?addrdma_get/set_optioncalls to get/set path records > > Michael S. Tsirkin wrote: > >Sean, what's up with patch numbering? > > The second patch labeled 1/3 is really 2/3. I resent 2/3 with the correct > subject heading. > > >>+static ssize_t ucma_set_option(struct ucma_file *file, const char __user > >>*inbuf, > >>+ int in_len, int out_len) > >>+{ > >>+ struct rdma_ucm_set_option cmd; > >>+ struct ucma_context *ctx; > >>+ int ret; > >>+ > >>+ if (copy_from_user(&cmd, inbuf, sizeof(cmd))) > >>+ return -EFAULT; > >>+ > >>+ ctx = ucma_get_ctx(file, cmd.id); > >>+ if (IS_ERR(ctx)) > >>+ return PTR_ERR(ctx); > >>+ > >>+ ret = rdma_set_option(ctx->cm_id, cmd.level, cmd.optname, > >>+ (void *) (unsigned long) cmd.optval, > >>+ cmd.optlen); > > > > > >Casting a value from userspace to void * looks iffy. > > This should be a userspace pointer. The kernel setsockopt interface takes > a char * for the option value. Where is it? > Maybe this would be better? > > - Sean > I think it would be better to have rdma_set_path/rdma_get_path, for kernel consumers, and have ucma handle copying from/to userspace. No? -- MST From halr at voltaire.com Wed Apr 26 09:15:42 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Apr 2006 12:15:42 -0400 Subject: [openib-general] [RFC] AT and user AT In-Reply-To: References: <1146060008.2124.27486.camel@hal.voltaire.com> Message-ID: <1146067835.2124.28965.camel@hal.voltaire.com> On Wed, 2006-04-26 at 11:00, James Lentini wrote: > On Wed, 26 Apr 2006, Hal Rosenstock wrote: > > > As AT and user AT have been obsoleted (and superceeded by CMA which > > is now in the process of going upstream), any objections to removing > > AT and user AT from the trunk ? If I don't hear back by COB Friday, > > I will presume this is OK. > > I'm in agreement with moving this off of the trunk. > > Will the code still be available for reference? Sure. I can move it somewhere before deleting it from the trunk. -- Hal From mshefty at ichips.intel.com Wed Apr 26 09:26:49 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 26 Apr 2006 09:26:49 -0700 Subject: [openib-general] [RFC] [PATCH 1/3] RDMA CM: add rdma_get/set_optioncalls to get/set path records In-Reply-To: <20060426075916.GB8155@mellanox.co.il> References: <20060426075916.GB8155@mellanox.co.il> Message-ID: <444F9F49.7060608@ichips.intel.com> Michael S. Tsirkin wrote: >>+int rdma_set_option(struct rdma_cm_id *id, int level, int optname, >>+ void *optval, size_t optlen); >>+ > > It seems optval is a user pointer. Should it be parked as such > void __user *. The getsockopt / setsockopt calls both use char *optval in their interfaces. Internally, they do get_user(), put_user(), copy_to_user(), etc. It's my understanding, which could be way off, that both getsockopt and setsockopt are also callable from kernel modules. I have not had a chance to try these calls with kernel memory to see what copy_to_user() would do. >>+static int cma_set_ib_paths(struct rdma_id_private *id_priv, >>+ void *optval, size_t optlen) >>+{ >>+ struct rdma_route *route = &id_priv->id.route; >>+ struct ib_user_path_rec user_path; >>+ int ret, i; >>+ >>+ if (!cma_comp_exch(id_priv, CMA_ADDR_RESOLVED, CMA_ROUTE_RESOLVED)) >>+ return -EINVAL; >>+ >>+ if (optlen == sizeof(user_path)) >>+ route->num_paths = 1; >>+ else if (optlen == (sizeof(user_path) << 1)) >>+ route->num_paths = 2; >>+ else { >>+ ret = -EINVAL; >>+ goto err1; >>+ } >>+ >>+ route->path_rec = kmalloc(sizeof *route->path_rec * route->num_paths, >>+ GFP_KERNEL); >>+ if (!route->path_rec) { >>+ ret = -ENOMEM; >>+ goto err2; >>+ } >>+ >>+ for (i = 0; i < route->num_paths; i++, optval += sizeof(user_path)) { >>+ if (copy_from_user(&user_path, (void __user *) optval, >>+ sizeof(user_path))) { > > > Apparently you assume userspace pointer here: so the interface is not intended > for kernel users? So why is it not in ucma? The call needs to be handled in the cma, as opposed to the ucma for a couple of reasons. For the get_option, I need to protect against device removal while accessing the list of available path records. For the set_option routine, the call changes the state of the rdma_cm_id. - Sean From mst at mellanox.co.il Wed Apr 26 09:44:55 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 26 Apr 2006 19:44:55 +0300 Subject: [openib-general] [RFC] [PATCH 1/3] RDMA CM: add rdma_get/set_optioncalls to get/set path records In-Reply-To: <444F9F49.7060608@ichips.intel.com> References: <20060426075916.GB8155@mellanox.co.il> <444F9F49.7060608@ichips.intel.com> Message-ID: <20060426164455.GP31324@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [openib-general] [RFC] [PATCH 1/3] RDMA CM: add?rdma_get/set_optioncalls to get/set path records > > Michael S. Tsirkin wrote: > >>+int rdma_set_option(struct rdma_cm_id *id, int level, int optname, > >>+ void *optval, size_t optlen); > >>+ > > > >It seems optval is a user pointer. Should it be parked as such > >void __user *. > > The getsockopt / setsockopt calls both use char *optval in their > interfaces. Internally, they do get_user(), put_user(), copy_to_user(), etc. > > It's my understanding, which could be way off, that both getsockopt and > setsockopt are also callable from kernel modules. I have not had a chance > to try these calls with kernel memory to see what copy_to_user() would do. I think this works on typical intel, but not so sure about other architectures. You might need to be in process context so that current pointer is valid. > >>+static int cma_set_ib_paths(struct rdma_id_private *id_priv, > >>+ void *optval, size_t optlen) > >>+{ > >>+ struct rdma_route *route = &id_priv->id.route; > >>+ struct ib_user_path_rec user_path; > >>+ int ret, i; > >>+ > >>+ if (!cma_comp_exch(id_priv, CMA_ADDR_RESOLVED, CMA_ROUTE_RESOLVED)) > >>+ return -EINVAL; > >>+ > >>+ if (optlen == sizeof(user_path)) > >>+ route->num_paths = 1; > >>+ else if (optlen == (sizeof(user_path) << 1)) > >>+ route->num_paths = 2; > >>+ else { > >>+ ret = -EINVAL; > >>+ goto err1; > >>+ } > >>+ > >>+ route->path_rec = kmalloc(sizeof *route->path_rec * route->num_paths, > >>+ GFP_KERNEL); > >>+ if (!route->path_rec) { > >>+ ret = -ENOMEM; > >>+ goto err2; > >>+ } > >>+ > >>+ for (i = 0; i < route->num_paths; i++, optval += sizeof(user_path)) { > >>+ if (copy_from_user(&user_path, (void __user *) optval, > >>+ sizeof(user_path))) { > > > > > >Apparently you assume userspace pointer here: so the interface is not > >intended > >for kernel users? So why is it not in ucma? > > The call needs to be handled in the cma, as opposed to the ucma for a > couple of reasons. For the get_option, I need to protect against device > removal while accessing the list of available path records. For the > set_option routine, the call changes the state of the rdma_cm_id. How about doing copy_from_user in ucma, and implementing rdma_set_path/rdma_get_path in cma? -- MST From mshefty at ichips.intel.com Wed Apr 26 10:11:17 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 26 Apr 2006 10:11:17 -0700 Subject: [openib-general] [RFC] [PATCH 1/3] RDMA CM: add rdma_get/set_optioncalls to get/set path records In-Reply-To: <20060426164455.GP31324@mellanox.co.il> References: <20060426075916.GB8155@mellanox.co.il> <444F9F49.7060608@ichips.intel.com> <20060426164455.GP31324@mellanox.co.il> Message-ID: <444FA9B5.8090507@ichips.intel.com> Michael S. Tsirkin wrote: >>The getsockopt / setsockopt calls both use char *optval in their >>interfaces. Internally, they do get_user(), put_user(), copy_to_user(), etc. >> >>It's my understanding, which could be way off, that both getsockopt and >>setsockopt are also callable from kernel modules. I have not had a chance >>to try these calls with kernel memory to see what copy_to_user() would do. > > > I think this works on typical intel, but not so sure about other architectures. > You might need to be in process context so that current pointer is valid. I did just test this, and it worked at least on my systems. I couldn't find anywhere in the kernel where s/getsockopt would be called without doing a get_user / put_user. Maybe one of the iWarp developers can help here? Is there a separate implementation of s/getsockopt for kernel users, versus userspace users? > How about doing copy_from_user in ucma, and implementing > rdma_set_path/rdma_get_path in cma? I don't think that we want to start adding a new set of APIs for every option that may eventually need to be supported. I _might_ be able to move the implementation into the ucma, but this would duplicate the s/get_option logic in both in ucma and cma. Regarding getting path records. For a large subnet, the number of paths can be fairly large, so I'd like to avoid multiple data copies. I also want to avoid needing to allocate a large kernel buffer (which may require a list of allocations) to get the paths to return to userspace. - Sean From xma at us.ibm.com Wed Apr 26 10:23:18 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 26 Apr 2006 10:23:18 -0700 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: Message-ID: Leonid, > As I mentioned I will test my patch to see how's the performance. I have tested the prototype patch with splitting CQs + work queue in IPoIB layer, under 2-4 cpus, netperf throughput got more than 10% improvement on mthca without msi_x enabled. I hit a slab cache bug on ehca. I need to pass the test on ehca before I submit the patch. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Wed Apr 26 10:37:16 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 26 Apr 2006 20:37:16 +0300 Subject: [openib-general] [RFC] [PATCH 1/3] RDMA CM: add rdma_get/set_optioncalls to get/set path records In-Reply-To: <444FA9B5.8090507@ichips.intel.com> References: <20060426075916.GB8155@mellanox.co.il> <444F9F49.7060608@ichips.intel.com> <20060426164455.GP31324@mellanox.co.il> <444FA9B5.8090507@ichips.intel.com> Message-ID: <20060426173716.GA10098@mellanox.co.il> Quoting r. Sean Hefty : > >How about doing copy_from_user in ucma, and implementing > >rdma_set_path/rdma_get_path in cma? > > I don't think that we want to start adding a new set of APIs for every > option that may eventually need to be supported. Why not? -- MST From caitlinb at broadcom.com Wed Apr 26 11:02:39 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 26 Apr 2006 11:02:39 -0700 Subject: [openib-general] [RFC] [PATCH 1/3] RDMA CM: add rdma_get/set_optioncalls to get/set path records Message-ID: <54AD0F12E08D1541B826BE97C98F99F143AE40@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Quoting r. Sean Hefty : >>> How about doing copy_from_user in ucma, and implementing >>> rdma_set_path/rdma_get_path in cma? >> >> I don't think that we want to start adding a new set of APIs for >> every option that may eventually need to be supported. > > Why not? At least on the iWARP side you must definitely do not want to make it easy to add new options casually. That will lead to users wanting to micro-manage the TCP stack using existing knobs and dials that were designed for host-based TCP stacks. Example: you do NOT want to allow the user to enable Nagle for a TCP connection carrying iWARP/MPA. The only classic L4 layer sockopt that is even vaguely relevant to an RDMA capable device is enabling/disabling keepalive. What about for IB HCAs? Are there a large number of options that have not yet been exposed but which are device independent and *might* be desirable to control? If not, then why introduce a "catchall" interface as opposed to specific interfaces that have to justified on a per method basis? From sean.hefty at intel.com Wed Apr 26 11:26:23 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 26 Apr 2006 11:26:23 -0700 Subject: [openib-general] [RFC] [PATCH 1/3] RDMA CM: addrdma_get/set_optioncalls to get/set path records In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F143AE40@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: >What about for IB HCAs? Are there a large number >of options that have not yet been exposed but >which are device independent and *might* be >desirable to control? If not, then why introduce >a "catchall" interface as opposed to specific >interfaces that have to justified on a per >method basis? Sockets provides setsockopt/getsockopt calls, and there is an attempt here to emulate sockets. These calls provide a very simple way to extend functionality without needing to modify the user to kernel transition code with every feature. Functionality that I would like to add through these calls in the short term include: * Bind to a device based on IB specific addresses (e.g. GIDs). * Getting usable path records between two nodes. * Setting primary and alternate paths for a connection. * Modify an alternate path for a connection. * Joining a multicast group identified by an IP address. * Leaving a multicast group identified by an IP address. * Joining a multicast group identified by an MCMemberRecord. * Leaving a multicast group identified by an MCMemberRecord. - Sean From caitlinb at broadcom.com Wed Apr 26 11:43:09 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 26 Apr 2006 11:43:09 -0700 Subject: [openib-general] [RFC] [PATCH 1/3] RDMA CM: addrdma_get/set_optioncalls to get/set path records Message-ID: <54AD0F12E08D1541B826BE97C98F99F143AE57@NT-SJCA-0751.brcm.ad.broadcom.com> Sean Hefty wrote: >> What about for IB HCAs? Are there a large number of options that have >> not yet been exposed but which are device independent and *might* be >> desirable to control? If not, then why introduce a "catchall" >> interface as opposed to specific interfaces that have to justified >> on a per method basis? > > Sockets provides setsockopt/getsockopt calls, and there is an > attempt here to emulate sockets. These calls provide a very > simple way to extend functionality without needing to modify > the user to kernel transition code with every feature. > > Functionality that I would like to add through these calls in the > short term include: > > * Bind to a device based on IB specific addresses (e.g. GIDs). > * Getting usable path records between two nodes. > * Setting primary and alternate paths for a connection. > * Modify an alternate path for a connection. > * Joining a multicast group identified by an IP address. > * Leaving a multicast group identified by an IP address. > * Joining a multicast group identified by an MCMemberRecord. > * Leaving a multicast group identified by an MCMemberRecord. > > - Sean Those all seem reasonable and on-topic. What you want to avoid is implying that you are creating a general purpose interface for peaking under the hood. All of the cases cited are either fairly naturally transport independent, or clearly relevant only to a specific protocol. More importantly they are all describable in the problem-domain without reference to presumed implementation algorithms. If you scan the list of sockopts for tcp sockets you'll see that those distinctions don't necessarily hold up in the long run (such as specifying the TCP receive buffer). But on the latter set, aren't there already UDP interfaces that supply that supply the required controls? From mst at mellanox.co.il Wed Apr 26 11:54:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 26 Apr 2006 21:54:42 +0300 Subject: [openib-general] Re: [RFC] [PATCH 1/3] RDMA CM: addrdma_get/set_optioncalls to get/set path records In-Reply-To: References: <54AD0F12E08D1541B826BE97C98F99F143AE40@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <20060426185442.GB10098@mellanox.co.il> Quoting r. Sean Hefty : > Sockets provides setsockopt/getsockopt calls, and there is an attempt here to > emulate sockets. Maybe we should create a socket instead of a char device in ucma? Then we get bind/listen/connect/accept for free. > These calls provide a very simple way to extend functionality > without needing to modify the user to kernel transition code with every feature. How? Any new option extends the user to kernel transition code anyway. No? -- MST From mshefty at ichips.intel.com Wed Apr 26 12:05:39 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 26 Apr 2006 12:05:39 -0700 Subject: [openib-general] Re: [RFC] [PATCH 1/3] RDMA CM: addrdma_get/set_optioncalls to get/set path records In-Reply-To: <20060426185442.GB10098@mellanox.co.il> References: <54AD0F12E08D1541B826BE97C98F99F143AE40@NT-SJCA-0751.brcm.ad.broadcom.com> <20060426185442.GB10098@mellanox.co.il> Message-ID: <444FC483.5070006@ichips.intel.com> Michael S. Tsirkin wrote: >>Sockets provides setsockopt/getsockopt calls, and there is an attempt here to >>emulate sockets. > > Maybe we should create a socket instead of a char device in ucma? > Then we get bind/listen/connect/accept for free. This sounds worth considering. I'm not sure what it would take to do it without looking into it more though. >>These calls provide a very simple way to extend functionality >>without needing to modify the user to kernel transition code with every feature. > > > How? Any new option extends the user to kernel transition code anyway. No? The new options would need to be added to the library header file. The library source file and ucma module wouldn't change. The kernel would need to add the actual implementation. - Sean From mshefty at ichips.intel.com Wed Apr 26 12:14:24 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 26 Apr 2006 12:14:24 -0700 Subject: [openib-general] [RFC] [PATCH 1/3] RDMA CM: addrdma_get/set_optioncalls to get/set path records In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F143AE57@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F143AE57@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <444FC690.3090603@ichips.intel.com> Caitlin Bestler wrote: >>* Bind to a device based on IB specific addresses (e.g. GIDs). >>* Getting usable path records between two nodes. >>* Setting primary and alternate paths for a connection. >>* Modify an alternate path for a connection. >>* Joining a multicast group identified by an IP address. >>* Leaving a multicast group identified by an IP address. >>* Joining a multicast group identified by an MCMemberRecord. >>* Leaving a multicast group identified by an MCMemberRecord. > > But on the latter set, aren't there already UDP interfaces that > supply that supply the required controls? There's no easy way to do this from userspace. I'm suggesting that this functionality be exposed through the rdma_cm, rather than exposing the kernel APIs through a separate kernel modules and libraries. - Sean From mst at mellanox.co.il Wed Apr 26 12:25:25 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 26 Apr 2006 22:25:25 +0300 Subject: [openib-general] Re: [RFC] [PATCH 1/3] RDMA CM: addrdma_get/set_optioncalls to get/set path records In-Reply-To: <444FC483.5070006@ichips.intel.com> References: <54AD0F12E08D1541B826BE97C98F99F143AE40@NT-SJCA-0751.brcm.ad.broadcom.com> <20060426185442.GB10098@mellanox.co.il> <444FC483.5070006@ichips.intel.com> Message-ID: <20060426192525.GD10098@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [RFC] [PATCH 1/3] RDMA CM: addrdma_get/set_optioncalls to get/set path records > > Michael S. Tsirkin wrote: > >>Sockets provides setsockopt/getsockopt calls, and there is an attempt > >>here to > >>emulate sockets. > > > >Maybe we should create a socket instead of a char device in ucma? > >Then we get bind/listen/connect/accept for free. > > This sounds worth considering. I'm not sure what it would take to do it > without looking into it more though. The simplest way to do this is by calling inet_register_protosw. And you fill struct proto_ops in include/linux/net.h Things like qp state or port space can be passed around by getsockopt/setsockopt. -- MST From iod00d at hp.com Wed Apr 26 13:53:34 2006 From: iod00d at hp.com (Grant Grundler) Date: Wed, 26 Apr 2006 13:53:34 -0700 Subject: [openib-general] netperf for RDS needed In-Reply-To: References: Message-ID: <20060426205334.GF30826@esmail.cup.hp.com> On Wed, Apr 26, 2006 at 04:43:19PM +0300, Leonid Arsh wrote: > Ranjit, > could you please send us the netperf working over RDS. We need the > the netperf specific version and the changes are done to support RDS. Can you instead submit patches directly to netperf.org? Rick Jones (at my prodding) started an SVN tree at http://www.netperf.org/svn/netperf2/ and he welcomes patches for most things. However, I think most folks here would be more interested in the netperf4 SVN repository. netperf4 has lots of features that including multi-stream testing. I think of netperf4 as a network centric test harness with some ready made tests. (netperf4 is now working on linux, windows, hpux, and I think solaris even though it's still in beta stages now). hth, grant From sean.hefty at intel.com Wed Apr 26 14:28:35 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 26 Apr 2006 14:28:35 -0700 Subject: [openib-general] [PATCH] librdmacm: fix issue reporting connect events after connect error Message-ID: If librdmacm has an error processing an ESTABLISHED event, it will report a CONNECT_ERROR event to the user. However, the kernel will list the cm_id state as connected. This can result in librdmacm receiving a DISCONNECTED event on a cm_id that it failed to fully connect from userspace. This event should not be reported to the user. This condition can occur when the remote side establishes a connection, then disconnects before the local side processes the ESTABLISH event. When the ESTABLISH event is finally processed, the local side will get a failure transitioning the QP, causing the librdmacm to report a CONNECT_ERROR. Since the connection was never reported as ESTABLISHED to the user of the librdmacm, we should not report a DISCONNECTED event. A similar situation exists between the CONNECT_RESPONSE and REJECTED events. Signed-off-by: Sean Hefty --- Index: cma.c =================================================================== --- cma.c (revision 6628) +++ cma.c (working copy) @@ -109,6 +109,7 @@ struct cma_id_private { struct rdma_cm_id id; struct cma_device *cma_dev; int events_completed; + int connect_error; pthread_cond_t cond; pthread_mutex_t mut; uint32_t handle; @@ -920,17 +921,27 @@ retry: evt->status = ucma_process_conn_resp(id_priv); if (!evt->status) evt->event = RDMA_CM_EVENT_ESTABLISHED; - else + else { evt->event = RDMA_CM_EVENT_CONNECT_ERROR; + id_priv->connect_error = 1; + } break; case RDMA_CM_EVENT_ESTABLISHED: evt->status = ucma_process_establish(&id_priv->id); - if (evt->status) + if (evt->status) { evt->event = RDMA_CM_EVENT_CONNECT_ERROR; + id_priv->connect_error = 1; + } break; case RDMA_CM_EVENT_REJECTED: + if (id_priv->connect_error) + goto retry; ucma_modify_qp_err(evt->id); break; + case RDMA_CM_EVENT_DISCONNECTED: + if (id_priv->connect_error) + goto retry; + break; default: break; } From rick.jones2 at hp.com Wed Apr 26 14:36:58 2006 From: rick.jones2 at hp.com (Rick Jones) Date: Wed, 26 Apr 2006 14:36:58 -0700 Subject: [openib-general] RE: netperf for RDS needed In-Reply-To: References: Message-ID: <444FE7FA.2000609@hp.com> I think the patch could be improved if it created a separate nettest_rds.c test suite with RDS_STREAM and RDS_RR tests in. That way the test banners will say "RDS_STREAM" or "RDS_RR" rather than "UDP_STREAM" or "UDP_RR" and it would be _much_ more clear what was being run because the output and command lines would look different. Yes, there would be a bit more code duplication, but I think that is worth having the more clearly distinguished commands/output. I would be more than happy to incorporate such a patch into the netperf2 mainline. happy benchmarking, rick jones mr netperf From mshefty at ichips.intel.com Wed Apr 26 14:56:38 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 26 Apr 2006 14:56:38 -0700 Subject: [openib-general] slab error while removing ib_mad In-Reply-To: References: Message-ID: <444FEC95.3020409@ichips.intel.com> Or Gerlitz wrote: > I am getting the below trace on 2.6.17-rc2 / AMD x86_64 / PCIX HCA > with both the IB sources that come with the kernel and svn trunk 6520. > > This happens if i just modprobe -r ib_mthca after fresh reboot, can > anyone reproduce it on her/his system as well? The module does get > modprobed out. Do you still see this issue? Can you reproduce it on 2.6.16? I haven't seen this issue myself on my systems. - Sean From sashak at voltaire.com Wed Apr 26 15:15:57 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 27 Apr 2006 01:15:57 +0300 Subject: [openib-general] [PATCH] opensm: handle stdout and stderr in osm_log Message-ID: <20060426221557.GF2453@sashak.voltaire.com> Hello Hal, There is small patch for osm_log, this provide possibility to drop log output to stdout or stderr. Sasha. Handle in osm_log "stdout" or "stderr" words as log file names to indicate logging to stdout or stderr respectively. Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_log.h | 6 +++++- osm/opensm/main.c | 6 +----- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/osm/include/opensm/osm_log.h b/osm/include/opensm/osm_log.h index 8cf8cf1..4a7c08c 100644 --- a/osm/include/opensm/osm_log.h +++ b/osm/include/opensm/osm_log.h @@ -224,10 +224,14 @@ osm_log_init( p_log->level = log_flags; p_log->flush = flush; - if (log_file == NULL) + if (log_file == NULL || !strcmp(log_file, "stdout")) { p_log->out_port = stdout; } + else if (!strcmp(log_file, "stderr")) + { + p_log->out_port = stderr; + } else { if (accum_log_file) diff --git a/osm/opensm/main.c b/osm/opensm/main.c index 6f2a857..d95c314 100644 --- a/osm/opensm/main.c +++ b/osm/opensm/main.c @@ -722,11 +722,7 @@ #endif break; case 'f': - if (!strcmp(optarg, "stdout")) - /* output should be to standard output */ - opt.log_file = NULL; - else - opt.log_file = optarg; + opt.log_file = optarg; break; case 'e': From or.gerlitz at gmail.com Wed Apr 26 15:14:36 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Thu, 27 Apr 2006 00:14:36 +0200 Subject: [openib-general] RE: netperf for RDS needed In-Reply-To: References: Message-ID: <15ddcffd0604261514q489792b8y540df5f433adb04f@mail.gmail.com> On 4/26/06, Pandit, Ranjit wrote: > Attached is the patch. > BTW, we sent all this information to Moni Levy couple of weeks back. > patch -d netperf-2.4.1rc1 -p 1 < netperf_changes_for_rds > diff -c -r netperf-2.4.1rc1/src/netlib.c netperf-2.4.1rc1.rds/src/netlib.c can you generate the patch in the standard unifined manner that is by $ diff -rup netperf-2.4.1rc1 netperf-2.4.1rc1.rds Or From arlin.r.davis at intel.com Wed Apr 26 16:10:32 2006 From: arlin.r.davis at intel.com (Arlin Davis) Date: Wed, 26 Apr 2006 16:10:32 -0700 Subject: [openib-general] [PATCH] uDAPL openib_cma: fixed address bindings, getaddrinfo, and added debug messages for rejects Message-ID: James, Sean's port checking in the uCMA exposed a address binding issue in the openib_cma provider. Here is a patch to fix the port issue and a fix for getaddrinfo when running with a debug build. I also added some additional debug messages during connect errors and rejects. -arlin Signed-off by: Arlin Davis Index: dapl/openib_cma/dapl_ib_util.c =================================================================== --- dapl/openib_cma/dapl_ib_util.c (revision 6672) +++ dapl/openib_cma/dapl_ib_util.c (working copy) @@ -121,11 +121,12 @@ static int getipaddr(char *name, char *a if (getaddrinfo(name, NULL, NULL, &res)) { /* retry using network device name */ ret = getipaddr_netdev(name,addr,len); - if (ret) + if (ret) { dapl_dbg_log(DAPL_DBG_TYPE_WARN, " getipaddr: invalid name, addr, or netdev(%s)\n", name); - return ret; + return ret; + } } else { if (len >= res->ai_addrlen) memcpy(addr, res->ai_addr, res->ai_addrlen); Index: dapl/openib_cma/dapl_ib_cm.c =================================================================== --- dapl/openib_cma/dapl_ib_cm.c (revision 6672) +++ dapl/openib_cma/dapl_ib_cm.c (working copy) @@ -274,11 +274,21 @@ static void dapli_cm_active_cb(struct da switch (event->event) { case RDMA_CM_EVENT_UNREACHABLE: case RDMA_CM_EVENT_CONNECT_ERROR: + dapl_dbg_log( + DAPL_DBG_TYPE_WARN, + " dapli_cm_active_handler: CONN_ERR " + " event=0x%x status=%d\n", + event->event, event->status); + dapl_evd_connection_callback(conn, IB_CME_DESTINATION_UNREACHABLE, NULL, conn->ep); break; case RDMA_CM_EVENT_REJECTED: + dapl_dbg_log( + DAPL_DBG_TYPE_WARN, + " dapli_cm_active_handler: REJECTED reason=%d\n", + event->status); dapl_evd_connection_callback(conn, IB_CME_DESTINATION_REJECT, NULL, conn->ep); break; @@ -320,6 +330,9 @@ static void dapli_cm_passive_cb(struct d struct rdma_cm_event *event) { struct dapl_cm_id *new_conn; +#ifdef DAPL_DBG + struct rdma_addr *ipaddr = &conn->cm_id->route.addr; +#endif dapl_dbg_log(DAPL_DBG_TYPE_CM, " passive_cb: conn %p id %d event %d\n", @@ -343,13 +356,58 @@ static void dapli_cm_passive_cb(struct d event->private_data, new_conn->sp); break; case RDMA_CM_EVENT_UNREACHABLE: + dapls_cr_callback(conn, IB_CME_DESTINATION_UNREACHABLE, + NULL, conn->sp); + case RDMA_CM_EVENT_CONNECT_ERROR: + + dapl_dbg_log( + DAPL_DBG_TYPE_WARN, + " dapli_cm_passive_handler: CONN_ERR " + " event=0x%x status=%d\n", + event->event, event->status ); + + dapl_dbg_log( + DAPL_DBG_TYPE_WARN, + " dapli_cm_passive_handler: CONN_ERR " + " on SRC 0x%x,0x%x DST 0x%x,0x%x \n", + ntohl(((struct sockaddr_in *) + &ipaddr->src_addr)->sin_addr.s_addr), + ntohs(((struct sockaddr_in *) + &ipaddr->src_addr)->sin_port), + ntohl(((struct sockaddr_in *) + &ipaddr->dst_addr)->sin_addr.s_addr), + ntohs(((struct sockaddr_in *) + &ipaddr->dst_addr)->sin_port) + ); + dapls_cr_callback(conn, IB_CME_DESTINATION_UNREACHABLE, NULL, conn->sp); break; + case RDMA_CM_EVENT_REJECTED: - dapls_cr_callback(conn, IB_CME_DESTINATION_REJECT, NULL, - conn->sp); + + dapl_dbg_log( + DAPL_DBG_TYPE_WARN, + " dapli_cm_passive_handler: REJECTED reason=%d\n", + event->status); + + dapl_dbg_log( + DAPL_DBG_TYPE_WARN, + " dapli_cm_passive_handler: REJECTED " + " on SRC 0x%x,0x%x DST 0x%x,0x%x \n", + ntohl(((struct sockaddr_in *) + &ipaddr->src_addr)->sin_addr.s_addr), + ntohs(((struct sockaddr_in *) + &ipaddr->src_addr)->sin_port), + ntohl(((struct sockaddr_in *) + &ipaddr->dst_addr)->sin_addr.s_addr), + ntohs(((struct sockaddr_in *) + &ipaddr->dst_addr)->sin_port) + ); + + dapls_cr_callback(conn, IB_CME_DESTINATION_REJECT, + NULL, conn->sp); break; case RDMA_CM_EVENT_ESTABLISHED: @@ -556,6 +614,7 @@ dapls_ib_setup_conn_listener(IN DAPL_IA { DAT_RETURN dat_status = DAT_SUCCESS; ib_cm_srvc_handle_t conn; + DAT_SOCK_ADDR6 addr; /* local binding address */ /* Allocate CM and initialize lock */ if ((conn = dapl_os_alloc(sizeof(*conn))) == NULL) @@ -571,11 +630,12 @@ dapls_ib_setup_conn_listener(IN DAPL_IA } /* open identifies the local device; per DAT specification */ - ((struct sockaddr_in *)&ia_ptr->hca_ptr->hca_address)->sin_port = - htons(MAKE_PORT(ServiceID)); + /* Get family and address then set port to consumer's ServiceID */ + dapl_os_memcpy(&addr, &ia_ptr->hca_ptr->hca_address, sizeof(addr)); + ((struct sockaddr_in *)&addr)->sin_port = htons(MAKE_PORT(ServiceID)); + - if (rdma_bind_addr(conn->cm_id, - (struct sockaddr *)&ia_ptr->hca_ptr->hca_address)) { + if (rdma_bind_addr(conn->cm_id,(struct sockaddr *)&addr)) { if (errno == EBUSY) dat_status = DAT_CONN_QUAL_IN_USE; else From rpandit at silverstorm.com Wed Apr 26 17:04:02 2006 From: rpandit at silverstorm.com (Ranjit Pandit) Date: Wed, 26 Apr 2006 17:04:02 -0700 Subject: [openib-general] RE: netperf for RDS needed In-Reply-To: <15ddcffd0604261514q489792b8y540df5f433adb04f@mail.gmail.com> References: <15ddcffd0604261514q489792b8y540df5f433adb04f@mail.gmail.com> Message-ID: <96f8e60e0604261704l5d38ae11ib495a84c50f53aa4@mail.gmail.com> > can you generate the patch in the standard unifined manner that is by > > $ diff -rup netperf-2.4.1rc1 netperf-2.4.1rc1.rds > Attached. -------------- next part -------------- diff -rup netperf-2.4.1rc1/src/netlib.c netperf-2.4.1rc1.rds/src/netlib.c --- netperf-2.4.1rc1/src/netlib.c 2005-09-07 16:39:34.000000000 -0700 +++ netperf-2.4.1rc1.rds/src/netlib.c 2006-04-26 19:00:18.000000000 -0700 @@ -1,6 +1,8 @@ char netlib_id[]="\ @(#)netlib.c (c) Copyright 1993-2004 Hewlett-Packard Company. Version 2.3pl2"; +#define AF_INET_OFFLOAD 30 +#define AF_INET_SDP 27 /****************************************************************/ /* */ /* netlib.c */ @@ -466,6 +468,12 @@ inet_ftos(int family) case AF_INET: return("AF_INET"); break; + case AF_INET_OFFLOAD: + return("AF_INET_OFFLOAD"); + break; + case AF_INET_SDP: + return("AF_INET_SDP"); + break; #if defined(AF_INET6) case AF_INET6: return("AF_INET6"); @@ -483,6 +491,8 @@ inet_nton(int af, const void *src, char switch (af) { case AF_INET: + case AF_INET_OFFLOAD: + case AF_INET_SDP: /* magic constants again... :) */ if (cnt >= 4) { memcpy(dst,src,4); diff -rup netperf-2.4.1rc1/src/netserver.c netperf-2.4.1rc1.rds/src/netserver.c --- netperf-2.4.1rc1/src/netserver.c 2005-09-07 16:40:49.000000000 -0700 +++ netperf-2.4.1rc1.rds/src/netserver.c 2006-04-26 19:00:18.000000000 -0700 @@ -149,11 +149,12 @@ FILE *afp; char listen_port[10]; extern char *optarg; extern int optind, opterr; +extern int rds_enable; #ifndef WIN32 -#define SERVER_ARGS "dL:n:p:v:46" +#define SERVER_ARGS "dL:n:p:rv:46" #else -#define SERVER_ARGS "dL:n:p:v:46I:i:" +#define SERVER_ARGS "dL:n:p:rv:46I:i:" #endif /* This routine implements the "main event loop" of the netperf */ @@ -249,11 +250,11 @@ process_requests() #endif /* DO_NBRR */ case DO_UDP_STREAM: - recv_udp_stream(); + recv_udp_stream(local_host_name); break; case DO_UDP_RR: - recv_udp_rr(); + recv_udp_rr(local_host_name); break; #ifdef WANT_DLPI @@ -526,7 +527,7 @@ set_up_server(char hostname[], char port printf("server_control: accept failed errno %d\n",errno); exit(1); } -#if defined(MPE) || defined(__VMS) +#if defined(MPE) || defined(__VMS) || 1 /* * Since we cannot fork this process , we cant fire any threads * as they all share the same global data . So we better allow @@ -748,6 +749,9 @@ main(int argc, char *argv[]) strncpy(listen_port,optarg,sizeof(listen_port)); not_inetd = 1; break; + case 'r': + rds_enable=1; + break; case '4': local_address_family = AF_INET; break; Only in netperf-2.4.1rc1.rds/src: netserver.o diff -rup netperf-2.4.1rc1/src/netsh.c netperf-2.4.1rc1.rds/src/netsh.c --- netperf-2.4.1rc1/src/netsh.c 2005-09-07 11:46:16.000000000 -0700 +++ netperf-2.4.1rc1.rds/src/netsh.c 2006-04-26 19:00:18.000000000 -0700 @@ -1,6 +1,9 @@ char netsh_id[]="\ @(#)netsh.c (c) Copyright 1993-2004 Hewlett-Packard Company. Version 2.4.0"; +#define WANT_RDS +#define AF_INET_OFFLOAD 30 +#define AF_INET_SDP 27 /****************************************************************/ /* */ @@ -94,7 +97,7 @@ double atof(const char *); /* Some of the args take optional parameters. Since we are using */ /* getopt to parse the command line, we will tell getopt that they do */ /* not take parms, and then look for them ourselves */ -#define GLOBAL_CMD_LINE_ARGS "A:a:b:CcdDf:F:H:hi:I:l:L:n:O:o:P:p:t:T:v:W:w:46" +#define GLOBAL_CMD_LINE_ARGS "A:a:b:CcdDf:F:H:hi:I:l:L:n:O:o:P:p:rt:T:v:W:w:46" /************************************************************************/ /* */ @@ -127,7 +130,9 @@ char test_port[PORTBUFSIZE]; /* where i char local_test_port[PORTBUFSIZE]; /* from whence we should start */ int address_family; /* which address family remote */ int local_address_family; /* which address family local */ - +#ifdef WANT_RDS +int rds_enable; /* enable RDS testing */ +#endif /* the source of data for filling the buffers */ char fill_file[BUFSIZ]; @@ -255,6 +260,7 @@ Global options:\n\ -n numcpu Set the number of processors for CPU util\n\ -p port,lport* Specify netserver port number and/or local port\n\ -P 0|1 Don't/Do display test headers\n\ + -r Enable RDS usage for UDP tests\n\ -t testname Specify test to perform\n\ -T lcpu,rcpu Request netperf/netserver be bound to local/remote cpu\n\ -v verbosity Specify the verbosity level\n\ @@ -619,6 +625,9 @@ scan_cmd_line(int argc, char *argv[]) /* the header question */ print_headers = convert(optarg); break; + case 'r': + rds_enable = 1; + break; case 't': /* set the test name */ strcpy(test_name,optarg); @@ -733,6 +742,8 @@ scan_cmd_line(int argc, char *argv[]) /* host_name was not set */ switch (address_family) { case AF_INET: + case AF_INET_OFFLOAD: + case AF_INET_SDP: strcpy(host_name,"localhost"); break; case AF_UNSPEC: @@ -740,6 +751,8 @@ scan_cmd_line(int argc, char *argv[]) suppose */ switch (local_address_family) { case AF_INET: + case AF_INET_OFFLOAD: + case AF_INET_SDP: case AF_UNSPEC: strcpy(host_name,"localhost"); break; @@ -771,11 +784,15 @@ scan_cmd_line(int argc, char *argv[]) if ('\0' == local_host_name[0]) { switch (local_address_family) { case AF_INET: + case AF_INET_OFFLOAD: + case AF_INET_SDP: strcpy(local_host_name,"0.0.0.0"); break; case AF_UNSPEC: switch (address_family) { case AF_INET: + case AF_INET_OFFLOAD: + case AF_INET_SDP: case AF_UNSPEC: strcpy(local_host_name,"0.0.0.0"); break; diff -rup netperf-2.4.1rc1/src/netsh.h netperf-2.4.1rc1.rds/src/netsh.h --- netperf-2.4.1rc1/src/netsh.h 2005-04-06 13:26:40.000000000 -0700 +++ netperf-2.4.1rc1.rds/src/netsh.h 2006-04-26 19:00:18.000000000 -0700 @@ -76,6 +76,12 @@ extern int remote_send_offset, remote_recv_offset; +#ifdef WANT_RDS +extern int rds_enable; +#define AF_INET_OFFLOAD 30 +#define AF_INET_SDP 27 +#endif + #ifdef WANT_INTERVALS extern int interval_usecs; extern int interval_wate; diff -rup netperf-2.4.1rc1/src/nettest_bsd.c netperf-2.4.1rc1.rds/src/nettest_bsd.c --- netperf-2.4.1rc1/src/nettest_bsd.c 2005-09-07 15:41:38.000000000 -0700 +++ netperf-2.4.1rc1.rds/src/nettest_bsd.c 2006-04-26 19:00:18.000000000 -0700 @@ -7,6 +7,12 @@ char nettest_id[]="\ #define WANT_INTERVALS #endif /* lint */ +#define WANT_RDS +#ifdef WANT_RDS +#define AF_INET_OFFLOAD 30 +#define AF_INET_SDP 27 +#endif + /****************************************************************/ /* */ /* nettest_bsd.c */ @@ -334,6 +340,12 @@ comma.\n"; int nf_to_af(int nf) { switch(nf) { + case NF_INET_OFFLOAD: + return AF_INET_OFFLOAD; + break; + case NF_INET_SDP: + return AF_INET_SDP; + break; case NF_INET: return AF_INET; break; @@ -357,6 +369,12 @@ int af_to_nf(int af) { switch(af) { + case AF_INET_OFFLOAD: + return NF_INET_OFFLOAD; + break; + case AF_INET_SDP: + return NF_INET_SDP; + break; case AF_INET: return NF_INET; break; @@ -430,7 +448,7 @@ complete_addrinfo(char *controlhost, cha hostname = data_address; else hostname = controlhost; - + if (debug) { fprintf(where, "complete_addrinfo using hostname %s port %s family %s type %s prot %s flags 0x%x\n", @@ -442,13 +460,20 @@ complete_addrinfo(char *controlhost, cha flags); fflush(where); } - + memset(&hints, 0, sizeof(hints)); - hints.ai_family = family; + +#if 0 + if (rds_enable && type == SOCK_DGRAM) + hints.ai_family = AF_INET_OFFLOAD; + else +#endif + hints.ai_family = family; + switch (type) { case 0: - case SOCK_STREAM: case SOCK_DGRAM: + case SOCK_STREAM: /* Right now most implementations only support these socket * types. */ @@ -472,7 +497,7 @@ complete_addrinfo(char *controlhost, cha change_info |= CHANGE_PROTOCOL; break; } - + hints.ai_flags = flags|AI_CANONNAME; count = 0; do { @@ -527,7 +552,24 @@ complete_addrinfo(char *controlhost, cha dump_addrinfo(where, res, hostname, port, family); } - +#ifdef WANT_RDS + /* this should be done before getaddrinfo - but there is currently a bug */ + /* and getaddrinfo does not recognize AF_INET_OFFLOAD */ + if (rds_enable && type == SOCK_DGRAM) + { + res->ai_family = AF_INET_OFFLOAD; +#if 0 + res->ai_protocol = 0; +#endif + } + if (rds_enable && type == SOCK_STREAM) + { + res->ai_family = AF_INET_SDP; +#if 0 + res->ai_protocol = 0; +#endif + } +#endif return(res); } @@ -588,6 +630,8 @@ static unsigned short get_port_number(struct addrinfo *res) { switch(res->ai_family) { + case AF_INET_OFFLOAD: + case AF_INET_SDP: case AF_INET: { struct sockaddr_in *foo = (struct sockaddr_in *)res->ai_addr; return(ntohs(foo->sin_port)); @@ -614,6 +658,8 @@ void set_port_number(struct addrinfo *res, unsigned short port) { switch(res->ai_family) { + case AF_INET_OFFLOAD: + case AF_INET_SDP: case AF_INET: { struct sockaddr_in *foo = (struct sockaddr_in *)res->ai_addr; foo->sin_port = htons(port); @@ -849,6 +895,7 @@ create_data_socket(struct addrinfo *res) errno); fprintf(where," port: %d\n",get_port_number(res)); fflush(where); + exit(1); } } @@ -924,6 +971,8 @@ get_address_address(struct addrinfo *inf switch(info->ai_family) { case AF_INET: + case AF_INET_OFFLOAD: + case AF_INET_SDP: sin = (struct sockaddr_in *)info->ai_addr; return(&(sin->sin_addr)); break; @@ -4206,6 +4255,10 @@ recv_tcp_stream() loc_rcvavoid = tcp_stream_request->so_rcvavoid; loc_sndavoid = tcp_stream_request->so_sndavoid; + /* hack to deal with getaddrinfo not recognizing af_inet_offload */ + if (rds_enable) + tcp_stream_request->ipfamily = AF_INET; + set_hostname_and_port(local_name, port_buffer, nf_to_af(tcp_stream_request->ipfamily), @@ -5785,10 +5838,16 @@ bytes bytes secs # send_ring->buffer_ptr, send_size, 0)) != send_size) { - if ((len >= 0) || - SOCKET_EINTR(len)) - break; - if (errno == ENOBUFS) { + if ((len >= 0) || + SOCKET_EINTR(len)) { + + fprintf(where, "send_udp_stream: aborting send len %d, %d\n",len, + SOCKET_EINTR(len)); + fflush(where); + break; + } + if (errno == ENOBUFS || errno == EWOULDBLOCK) { + usleep(100); failed_sends++; continue; } @@ -5832,7 +5891,7 @@ bytes bytes secs # #endif /* WANT_INTERVALS */ } - + /* This is a timed test, so the remote will be returning to us after */ /* a time. We should not need to send any "strange" messages to tell */ /* the remote that the test is completed, unless we decide to add a */ @@ -5841,6 +5900,11 @@ bytes bytes secs # /* the test is over, so get stats and stuff */ cpu_stop(local_cpu_usage, &elapsed_time); +#if 1 + fprintf(where, "send_udp_stream: sent %d, failed %d\n",messages_sent, + failed_sends); + fflush(where); +#endif /* Get the statistics from the remote end */ recv_response(); @@ -6063,7 +6127,7 @@ bytes bytes secs # /* UDP_STREAM performance test. */ void -recv_udp_stream() +recv_udp_stream(char local_host[]) { struct ring_elt *recv_ring; @@ -6158,13 +6222,17 @@ recv_udp_stream() loc_rcvavoid = udp_stream_request->so_rcvavoid; loc_sndavoid = udp_stream_request->so_sndavoid; + /* hack to deal with getaddrinfo not recognizing af_inet_offload */ + if (rds_enable) + udp_stream_request->ipfamily = AF_INET; + set_hostname_and_port(local_name, port_buffer, nf_to_af(udp_stream_request->ipfamily), udp_stream_request->port); - local_res = complete_addrinfo(local_name, - local_name, + local_res = complete_addrinfo(rds_enable ? local_host : local_name, + rds_enable ? local_host : local_name, port_buffer, nf_to_af(udp_stream_request->ipfamily), SOCK_DGRAM, @@ -6275,6 +6343,9 @@ recv_udp_stream() { if ((len == SOCKET_ERROR) && !SOCKET_EINTR(len)) { netperf_response.content.serv_errno = errno; + fprintf(where,"recv_udp_stream: got error, len %d, %d\n", + len, SOCKET_EINTR(len)); +fflush(where); send_response(); exit(1); } @@ -6978,7 +7049,7 @@ bytes bytes bytes bytes secs. per /* this routine implements the receive side (netserver) of a UDP_RR */ /* test. */ void -recv_udp_rr() +recv_udp_rr(char local_host[]) { struct ring_elt *recv_ring; @@ -7087,13 +7158,17 @@ recv_udp_rr() loc_rcvavoid = udp_rr_request->so_rcvavoid; loc_sndavoid = udp_rr_request->so_sndavoid; + /* hack to deal with getaddrinfo not recognizing af_inet_offload */ + if (rds_enable) + udp_rr_request->ipfamily = AF_INET; + set_hostname_and_port(local_name, port_buffer, nf_to_af(udp_rr_request->ipfamily), udp_rr_request->port); - local_res = complete_addrinfo(local_name, - local_name, + local_res = complete_addrinfo(rds_enable ? local_host : local_name, + rds_enable ? local_host : local_name, port_buffer, nf_to_af(udp_rr_request->ipfamily), SOCK_DGRAM, @@ -7397,6 +7472,9 @@ recv_tcp_rr() loc_rcvavoid = tcp_rr_request->so_rcvavoid; loc_sndavoid = tcp_rr_request->so_sndavoid; + if (rds_enable) + tcp_rr_request->ipfamily = AF_INET; + set_hostname_and_port(local_name, port_buffer, nf_to_af(tcp_rr_request->ipfamily), diff -rup netperf-2.4.1rc1/src/nettest_bsd.h netperf-2.4.1rc1.rds/src/nettest_bsd.h --- netperf-2.4.1rc1/src/nettest_bsd.h 2005-04-18 16:39:16.000000000 -0700 +++ netperf-2.4.1rc1.rds/src/nettest_bsd.h 2006-04-26 19:00:18.000000000 -0700 @@ -14,6 +14,8 @@ #define NF_UNSPEC 0 #define NF_INET 4 #define NF_INET6 6 +#define NF_INET_OFFLOAD 30 +#define NF_INET_SDP 27 struct tcp_stream_request_struct { int send_buf_size; @@ -443,8 +445,8 @@ extern void recv_tcp_maerts(); extern void recv_tcp_rr(); extern void recv_tcp_conn_rr(); extern void recv_tcp_cc(); -extern void recv_udp_stream(); -extern void recv_udp_rr(); +extern void recv_udp_stream(char local_host[]); +extern void recv_udp_rr(char local_host[]); extern void loc_cpu_rate(); extern void rem_cpu_rate(); From hozer at hozed.org Wed Apr 26 21:13:24 2006 From: hozer at hozed.org (Troy Benjegerdes) Date: Wed, 26 Apr 2006 23:13:24 -0500 Subject: [openib-general] Re: TSO and IPoIB performance degradation In-Reply-To: <20060320.023704.70907203.davem@davemloft.net> References: <20060320090629.GA11352@mellanox.co.il> <20060320.015500.72136710.davem@davemloft.net> <20060320102234.GV29929@mellanox.co.il> <20060320.023704.70907203.davem@davemloft.net> Message-ID: <20060427041323.GX15855@narn.hozed.org> On Mon, Mar 20, 2006 at 02:37:04AM -0800, David S. Miller wrote: > From: "Michael S. Tsirkin" > Date: Mon, 20 Mar 2006 12:22:34 +0200 > > > Quoting r. David S. Miller : > > > The path an SKB can take is opaque and unknown until the very last > > > moment it is actually given to the device transmit function. > > > > Why, I was proposing looking at dst cache. If that's NULL, well, > > we won't stretch ACKs. Worst case we apply the wrong optimization. > > Right? > > Where you receive a packet from isn't very useful for determining > even the full patch on which that packet itself flowed. > > More importantly, packets also do not necessarily go back out over the > same path on which packets are received for a connection. This is > actually quite common. > > Maybe packets for this connection come in via IPoIB but go out via > gigabit ethernet and another route altogether. > > > What I'd like to clarify, however: rfc2581 explicitly states that in > > some cases it might be OK to generate ACKs less frequently than > > every second full-sized segment. Given Matt's measurements, TCP on > > top of IP over InfiniBand on Linux seems to hit one of these cases. > > Do you agree to that? > > I disagree with Linux changing it's behavior. It would be great to > turn off congestion control completely over local gigabit networks, > but that isn't determinable in any way, so we don't do that. > > The IPoIB situation is no different, you can set all the bits you want > in incoming packets, the barrier to doing this remains the same. > > It hurts performance if any packet drop occurs because it will require > an extra round trip for recovery to begin to be triggered at the > sender. > > The network is a black box, routes to and from a destination are > arbitrary, and so is packet rewriting and reflection, so being able to > say "this all occurs on IPoIB" is simply infeasible. > > I don't know how else to say this, we simply cannot special case IPoIB > or any other topology type. David is right. If you care about performance, you are already using SDP or verbs layer for the transport anyway. If I am going to be doing IPoIB, it's because eventually I expect the packet might get off the IB network and onto some other network and go halfway across the country. From monil at voltaire.com Wed Apr 26 21:45:28 2006 From: monil at voltaire.com (Moni Levy) Date: Thu, 27 Apr 2006 07:45:28 +0300 Subject: [openib-general] RE: netperf for RDS needed In-Reply-To: References: Message-ID: <6a122cc00604262145x5607bbd8l216c5b05f350c285@mail.gmail.com> Ranjit, > > BTW, we sent all this information to Moni Levy couple of weeks back. > I guess it's something with my mailbox, because I never received it. -- Moni From eitan at mellanox.co.il Wed Apr 26 23:35:01 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 27 Apr 2006 09:35:01 +0300 Subject: [openib-general] RE: [PATCH] opensm: handle stdout and stderr in osm_log Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102BAA1@mtlexch01.mtl.com> Looks like a good idea to me. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Thursday, April 27, 2006 1:16 AM > To: Hal Rosenstock > Cc: openib-general at openib.org; Eitan Zahavi; Yael Kalka; Ofer Gigi > Subject: [PATCH] opensm: handle stdout and stderr in osm_log > > Hello Hal, > > There is small patch for osm_log, this provide possibility to drop log > output to stdout or stderr. > > Sasha. > > > Handle in osm_log "stdout" or "stderr" words as log file names to > indicate logging to stdout or stderr respectively. > > Signed-off-by: Sasha Khapyorsky > --- > > osm/include/opensm/osm_log.h | 6 +++++- > osm/opensm/main.c | 6 +----- > 2 files changed, 6 insertions(+), 6 deletions(-) > > diff --git a/osm/include/opensm/osm_log.h b/osm/include/opensm/osm_log.h > index 8cf8cf1..4a7c08c 100644 > --- a/osm/include/opensm/osm_log.h > +++ b/osm/include/opensm/osm_log.h > @@ -224,10 +224,14 @@ osm_log_init( > p_log->level = log_flags; > p_log->flush = flush; > > - if (log_file == NULL) > + if (log_file == NULL || !strcmp(log_file, "stdout")) > { > p_log->out_port = stdout; > } > + else if (!strcmp(log_file, "stderr")) > + { > + p_log->out_port = stderr; > + } > else > { > if (accum_log_file) > diff --git a/osm/opensm/main.c b/osm/opensm/main.c > index 6f2a857..d95c314 100644 > --- a/osm/opensm/main.c > +++ b/osm/opensm/main.c > @@ -722,11 +722,7 @@ #endif > break; > > case 'f': > - if (!strcmp(optarg, "stdout")) > - /* output should be to standard output */ > - opt.log_file = NULL; > - else > - opt.log_file = optarg; > + opt.log_file = optarg; > break; > > case 'e': From or.gerlitz at gmail.com Wed Apr 26 23:51:04 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Thu, 27 Apr 2006 09:51:04 +0300 Subject: [openib-general] slab error while removing ib_mad In-Reply-To: <444FEC95.3020409@ichips.intel.com> References: <444FEC95.3020409@ichips.intel.com> Message-ID: <15ddcffd0604262351o78d74022tbeeb85fa9743202f@mail.gmail.com> On 4/27/06, Sean Hefty wrote: > Or Gerlitz wrote: > > I am getting the below trace on 2.6.17-rc2 / AMD x86_64 / PCIX HCA > > with both the IB sources that come with the kernel and svn trunk 6520. > > > > This happens if i just modprobe -r ib_mthca after fresh reboot, can > > anyone reproduce it on her/his system as well? The module does get > > modprobed out. > > Do you still see this issue? Can you reproduce it on 2.6.16? > > I haven't seen this issue myself on my systems. OK, i see it now also with 2.6.17-rc3 which more or less the latest linux kernel, I will work to see if i can reproduce it with a plain (IB unrelated) module doing kmem_cache_create/alloc/free/destroy Or. From lindahl at pathscale.com Thu Apr 27 00:20:41 2006 From: lindahl at pathscale.com (Greg Lindahl) Date: Thu, 27 Apr 2006 00:20:41 -0700 Subject: [openib-general] [PATCH] opensm: handle stdout and stderr in osm_log In-Reply-To: <20060426221557.GF2453@sashak.voltaire.com> References: <20060426221557.GF2453@sashak.voltaire.com> Message-ID: <20060427072041.GA1805@greglaptop.hsd1.ca.comcast.net> On Thu, Apr 27, 2006 at 01:15:57AM +0300, Sasha Khapyorsky wrote: > There is small patch for osm_log, this provide possibility to drop log > output to stdout or stderr. Isn't the Unix convention to use "--" to mean stdout? Or you can use /dev/fd/{0,1}... -- greg From lindahl at pathscale.com Thu Apr 27 00:23:52 2006 From: lindahl at pathscale.com (Greg Lindahl) Date: Thu, 27 Apr 2006 00:23:52 -0700 Subject: [openib-general] Re: TSO and IPoIB performance degradation In-Reply-To: <20060427041323.GX15855@narn.hozed.org> References: <20060320090629.GA11352@mellanox.co.il> <20060320.015500.72136710.davem@davemloft.net> <20060320102234.GV29929@mellanox.co.il> <20060320.023704.70907203.davem@davemloft.net> <20060427041323.GX15855@narn.hozed.org> Message-ID: <20060427072352.GB1805@greglaptop.hsd1.ca.comcast.net> On Wed, Apr 26, 2006 at 11:13:24PM -0500, Troy Benjegerdes wrote: > David is right. If you care about performance, you are already using SDP > or verbs layer for the transport anyway. If I am going to be doing IPoIB, > it's because eventually I expect the packet might get off the IB network > and onto some other network and go halfway across the country. This is going to be a surprise to lots of people who want high-speed gateways from IB to ethernet -- many clusters connect to fileservers and other performance-sensitive gizmos that way. -- greg From leonida at voltaire.com Thu Apr 27 01:24:49 2006 From: leonida at voltaire.com (Leonid Arsh) Date: Thu, 27 Apr 2006 11:24:49 +0300 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: References: Message-ID: <44507FD1.9040801@voltaire.com> Shirley Ma wrote: > Without seeing your patch, I coudn't say anything. I guess your > implemention > didn't handler multithreads simultanously. If you only have one > interrupt handler, > couldn't see any reason you can get better performance number with > splitting CQs. Shirley, you are right. I just wanted share our experience with you. All the tests we made on our IPoIB driver, so our NAPI implementation isn't relevant here. Unfortunately, we didn't plan to work on the IPoIB performance in the nearest future, so I can't implement NAPI on the OpenIB driver right now. I think it would be very interesting to compare the NAPI performance against the work queue. Please let me know if you are planning to do it yourself. > > Could you please post your NAPI patch here? > > As I mentioned I will test my patch to see how's the performance. Thank you, Leonid From ogerlitz at voltaire.com Thu Apr 27 01:29:31 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 27 Apr 2006 11:29:31 +0300 (IDT) Subject: [openib-general] possible bug in kmem_cache related code Message-ID: With 2.6.17-rc3 I'm running into something which seems as a bug related to kmem_cache. Doing some allocations/deallocations from a kmem_cache and later attempting to destroy it yields the following message and trace ============================================================================ slab error in kmem_cache_destroy(): cache `my_cache': Can't free all objects Call Trace: {kmem_cache_destroy+150} {:my_kcache:kcache_cleanup_module+51} {sys_delete_module+415} {__up_write+20} {sys_munmap+91} {system_call+126} Failed to destroy cache ============================================================================ I was hitting it as an Infiniband/iSCSI user as IB/iSCSI/SCSI code use kmem_caches, but since the failure happens on a code which works fine on 2.6.16 i have decided to try it with a synthetic module and had this hit... Below is a sample code that reproduces it, if i only do kmem_cache_create and later destroy it does not happen, attached is my .config please note that some of the CONFIG_DEBUG_ options are open. Please CC openib-general at openib.org at least with the resolution of the matter since it kind of hard to do testing over 2.6.17-rcX with this issue, the tests run fine but some modules are crashing on rmmod so a reboot it needed... thanks, Or. This is the related slab info line once the module is loaded my_cache 256 264 328 12 1 : tunables 32 16 8 : slabdata 22 22 0 : globalstat 264 264 22 0 --- /deb/null 1970-01-01 02:00:00.000000000 +0200 +++ kcache/kcache.c 2006-04-27 10:43:18.000000000 +0300 @@ -0,0 +1,61 @@ +#include +#include + +kmem_cache_t *cache; + +struct foo { + char bar[300]; +}; + + +#define TRIES 256 + +struct foo *foo_arr[TRIES]; + +static int __init kcache_init_module(void) +{ + int i, j; + + cache = kmem_cache_create("my_cache", + sizeof (struct foo), + 0, + SLAB_HWCACHE_ALIGN, + NULL, + NULL); + if (!cache) { + printk(KERN_ERR "couldn't create cache\n"); + goto error1; + } + + for (i = 0; i < TRIES; i++) { + foo_arr[i] = kmem_cache_alloc(cache, GFP_KERNEL); + if (foo_arr[i] == NULL) { + printk(KERN_ERR "couldn't allocate from cache\n"); + goto error2; + } + } + + return 0; +error2: + for (j = 0; j < i; j++) + kmem_cache_free(cache, foo_arr[j]); +error1: + return -ENOMEM; +} + +static void __exit kcache_cleanup_module(void) +{ + int i; + + for (i = 0; i < TRIES; i++) + kmem_cache_free(cache, foo_arr[i]); + + if (kmem_cache_destroy(cache)) { + printk(KERN_DEBUG "Failed to destroy cache\n"); + } +} + +MODULE_LICENSE("GPL"); + +module_init(kcache_init_module); +module_exit(kcache_cleanup_module); -------------- next part -------------- # # Automatically generated make config: don't edit # Linux kernel version: 2.6.17-rc3 # Thu Apr 27 10:16:01 2006 # CONFIG_X86_64=y CONFIG_64BIT=y CONFIG_X86=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_MMU=y CONFIG_RWSEM_GENERIC_SPINLOCK=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_X86_CMPXCHG=y CONFIG_EARLY_PRINTK=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_DMI=y # # Code maturity level options # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 # # General setup # CONFIG_LOCALVERSION="" CONFIG_LOCALVERSION_AUTO=y CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y # CONFIG_BSD_PROCESS_ACCT_V3 is not set CONFIG_SYSCTL=y CONFIG_AUDIT=y CONFIG_AUDITSYSCALL=y # CONFIG_IKCONFIG is not set # CONFIG_CPUSETS is not set # CONFIG_RELAY is not set CONFIG_INITRAMFS_SOURCE="" CONFIG_UID16=y CONFIG_VM86=y CONFIG_CC_OPTIMIZE_FOR_SIZE=y # CONFIG_EMBEDDED is not set CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_ALL is not set CONFIG_KALLSYMS_EXTRA_PASS=y CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_EPOLL=y CONFIG_SHMEM=y CONFIG_SLAB=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 # CONFIG_SLOB is not set CONFIG_OBSOLETE_INTERMODULE=m # # Loadable module support # CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y # CONFIG_MODULE_FORCE_UNLOAD is not set CONFIG_MODVERSIONS=y # CONFIG_MODULE_SRCVERSION_ALL is not set CONFIG_KMOD=y CONFIG_STOP_MACHINE=y # # Block layer # CONFIG_LBD=y # CONFIG_BLK_DEV_IO_TRACE is not set # CONFIG_LSF is not set # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y CONFIG_DEFAULT_AS=y # CONFIG_DEFAULT_DEADLINE is not set # CONFIG_DEFAULT_CFQ is not set # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED="anticipatory" # # Processor type and features # CONFIG_X86_PC=y # CONFIG_X86_VSMP is not set # CONFIG_MK8 is not set # CONFIG_MPSC is not set CONFIG_GENERIC_CPU=y CONFIG_X86_L1_CACHE_BYTES=128 CONFIG_X86_L1_CACHE_SHIFT=7 CONFIG_X86_INTERNODE_CACHE_BYTES=128 CONFIG_X86_TSC=y CONFIG_X86_GOOD_APIC=y CONFIG_MICROCODE=m CONFIG_X86_MSR=y CONFIG_X86_CPUID=y CONFIG_X86_HT=y CONFIG_X86_IO_APIC=y CONFIG_X86_LOCAL_APIC=y CONFIG_MTRR=y CONFIG_SMP=y CONFIG_SCHED_SMT=y CONFIG_SCHED_MC=y CONFIG_PREEMPT_NONE=y # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set CONFIG_PREEMPT_BKL=y CONFIG_NUMA=y CONFIG_K8_NUMA=y CONFIG_NODES_SHIFT=6 CONFIG_X86_64_ACPI_NUMA=y # CONFIG_NUMA_EMU is not set CONFIG_ARCH_DISCONTIGMEM_ENABLE=y CONFIG_ARCH_DISCONTIGMEM_DEFAULT=y CONFIG_ARCH_SPARSEMEM_ENABLE=y CONFIG_SELECT_MEMORY_MODEL=y # CONFIG_FLATMEM_MANUAL is not set CONFIG_DISCONTIGMEM_MANUAL=y # CONFIG_SPARSEMEM_MANUAL is not set CONFIG_DISCONTIGMEM=y CONFIG_FLAT_NODE_MEM_MAP=y CONFIG_NEED_MULTIPLE_NODES=y # CONFIG_SPARSEMEM_STATIC is not set CONFIG_SPLIT_PTLOCK_CPUS=4 CONFIG_MIGRATION=y CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y CONFIG_OUT_OF_LINE_PFN_TO_PAGE=y CONFIG_NR_CPUS=8 # CONFIG_HOTPLUG_CPU is not set CONFIG_HPET_TIMER=y CONFIG_HPET_EMULATE_RTC=y CONFIG_GART_IOMMU=y CONFIG_SWIOTLB=y CONFIG_X86_MCE=y CONFIG_X86_MCE_INTEL=y CONFIG_X86_MCE_AMD=y # CONFIG_KEXEC is not set CONFIG_CRASH_DUMP=y CONFIG_PHYSICAL_START=0x1000000 CONFIG_SECCOMP=y # CONFIG_HZ_100 is not set CONFIG_HZ_250=y # CONFIG_HZ_1000 is not set CONFIG_HZ=250 # CONFIG_REORDER is not set CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_ISA_DMA_API=y CONFIG_GENERIC_PENDING_IRQ=y # # Power management options # CONFIG_PM=y CONFIG_PM_LEGACY=y # CONFIG_PM_DEBUG is not set # # ACPI (Advanced Configuration and Power Interface) Support # CONFIG_ACPI=y CONFIG_ACPI_AC=m CONFIG_ACPI_BATTERY=m CONFIG_ACPI_BUTTON=m CONFIG_ACPI_VIDEO=y # CONFIG_ACPI_HOTKEY is not set CONFIG_ACPI_FAN=y CONFIG_ACPI_PROCESSOR=y CONFIG_ACPI_THERMAL=y CONFIG_ACPI_NUMA=y CONFIG_ACPI_ASUS=m CONFIG_ACPI_IBM=y # CONFIG_ACPI_IBM_DOCK is not set CONFIG_ACPI_TOSHIBA=m CONFIG_ACPI_BLACKLIST_YEAR=0 # CONFIG_ACPI_DEBUG is not set CONFIG_ACPI_EC=y CONFIG_ACPI_POWER=y CONFIG_ACPI_SYSTEM=y CONFIG_X86_PM_TIMER=y # CONFIG_ACPI_CONTAINER is not set # CONFIG_ACPI_HOTPLUG_MEMORY is not set # # CPU Frequency scaling # CONFIG_CPU_FREQ=y CONFIG_CPU_FREQ_TABLE=y # CONFIG_CPU_FREQ_DEBUG is not set CONFIG_CPU_FREQ_STAT=y # CONFIG_CPU_FREQ_STAT_DETAILS is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y CONFIG_CPU_FREQ_GOV_PERFORMANCE=y CONFIG_CPU_FREQ_GOV_POWERSAVE=m CONFIG_CPU_FREQ_GOV_USERSPACE=y CONFIG_CPU_FREQ_GOV_ONDEMAND=m # CONFIG_CPU_FREQ_GOV_CONSERVATIVE is not set # # CPUFreq processor drivers # CONFIG_X86_POWERNOW_K8=y CONFIG_X86_POWERNOW_K8_ACPI=y CONFIG_X86_SPEEDSTEP_CENTRINO=y CONFIG_X86_SPEEDSTEP_CENTRINO_ACPI=y CONFIG_X86_ACPI_CPUFREQ=y # # shared options # # CONFIG_X86_ACPI_CPUFREQ_PROC_INTF is not set # CONFIG_X86_SPEEDSTEP_LIB is not set # # Bus options (PCI etc.) # CONFIG_PCI=y CONFIG_PCI_DIRECT=y CONFIG_PCI_MMCONFIG=y # CONFIG_PCIEPORTBUS is not set CONFIG_PCI_MSI=y # CONFIG_PCI_DEBUG is not set # # PCCARD (PCMCIA/CardBus) support # # CONFIG_PCCARD is not set # # PCI Hotplug Support # CONFIG_HOTPLUG_PCI=y # CONFIG_HOTPLUG_PCI_FAKE is not set CONFIG_HOTPLUG_PCI_ACPI=m CONFIG_HOTPLUG_PCI_ACPI_IBM=m # CONFIG_HOTPLUG_PCI_CPCI is not set CONFIG_HOTPLUG_PCI_SHPC=m # CONFIG_HOTPLUG_PCI_SHPC_POLL_EVENT_MODE is not set # # Executable file formats / Emulations # CONFIG_BINFMT_ELF=y CONFIG_BINFMT_MISC=y CONFIG_IA32_EMULATION=y # CONFIG_IA32_AOUT is not set CONFIG_COMPAT=y CONFIG_SYSVIPC_COMPAT=y # # Networking # CONFIG_NET=y # # Networking options # # CONFIG_NETDEBUG is not set CONFIG_PACKET=y CONFIG_PACKET_MMAP=y CONFIG_UNIX=y CONFIG_XFRM=y CONFIG_XFRM_USER=y CONFIG_NET_KEY=m CONFIG_INET=y CONFIG_IP_MULTICAST=y CONFIG_IP_ADVANCED_ROUTER=y CONFIG_ASK_IP_FIB_HASH=y # CONFIG_IP_FIB_TRIE is not set CONFIG_IP_FIB_HASH=y CONFIG_IP_MULTIPLE_TABLES=y CONFIG_IP_ROUTE_MULTIPATH=y # CONFIG_IP_ROUTE_MULTIPATH_CACHED is not set CONFIG_IP_ROUTE_VERBOSE=y # CONFIG_IP_PNP is not set CONFIG_NET_IPIP=m CONFIG_NET_IPGRE=m CONFIG_NET_IPGRE_BROADCAST=y CONFIG_IP_MROUTE=y CONFIG_IP_PIMSM_V1=y CONFIG_IP_PIMSM_V2=y # CONFIG_ARPD is not set CONFIG_SYN_COOKIES=y CONFIG_INET_AH=m CONFIG_INET_ESP=m CONFIG_INET_IPCOMP=m CONFIG_INET_XFRM_TUNNEL=m CONFIG_INET_TUNNEL=m CONFIG_INET_DIAG=y CONFIG_INET_TCP_DIAG=y # CONFIG_TCP_CONG_ADVANCED is not set CONFIG_TCP_CONG_BIC=y CONFIG_IPV6=m CONFIG_IPV6_PRIVACY=y # CONFIG_IPV6_ROUTER_PREF is not set CONFIG_INET6_AH=m CONFIG_INET6_ESP=m CONFIG_INET6_IPCOMP=m CONFIG_INET6_XFRM_TUNNEL=m CONFIG_INET6_TUNNEL=m CONFIG_IPV6_TUNNEL=m # CONFIG_NETFILTER is not set # # DCCP Configuration (EXPERIMENTAL) # # CONFIG_IP_DCCP is not set # # SCTP Configuration (EXPERIMENTAL) # CONFIG_IP_SCTP=m # CONFIG_SCTP_DBG_MSG is not set # CONFIG_SCTP_DBG_OBJCNT is not set # CONFIG_SCTP_HMAC_NONE is not set # CONFIG_SCTP_HMAC_SHA1 is not set CONFIG_SCTP_HMAC_MD5=y # # TIPC Configuration (EXPERIMENTAL) # # CONFIG_TIPC is not set CONFIG_ATM=m CONFIG_ATM_CLIP=m # CONFIG_ATM_CLIP_NO_ICMP is not set CONFIG_ATM_LANE=m # CONFIG_ATM_MPOA is not set CONFIG_ATM_BR2684=m # CONFIG_ATM_BR2684_IPFILTER is not set CONFIG_BRIDGE=m CONFIG_VLAN_8021Q=m # CONFIG_DECNET is not set CONFIG_LLC=y # CONFIG_LLC2 is not set # CONFIG_IPX is not set # CONFIG_ATALK is not set # CONFIG_X25 is not set # CONFIG_LAPB is not set CONFIG_NET_DIVERT=y # CONFIG_ECONET is not set # CONFIG_WAN_ROUTER is not set # # QoS and/or fair queueing # CONFIG_NET_SCHED=y CONFIG_NET_SCH_CLK_JIFFIES=y # CONFIG_NET_SCH_CLK_GETTIMEOFDAY is not set # CONFIG_NET_SCH_CLK_CPU is not set # # Queueing/Scheduling # CONFIG_NET_SCH_CBQ=m CONFIG_NET_SCH_HTB=m CONFIG_NET_SCH_HFSC=m CONFIG_NET_SCH_ATM=m CONFIG_NET_SCH_PRIO=m CONFIG_NET_SCH_RED=m CONFIG_NET_SCH_SFQ=m CONFIG_NET_SCH_TEQL=m CONFIG_NET_SCH_TBF=m CONFIG_NET_SCH_GRED=m CONFIG_NET_SCH_DSMARK=m CONFIG_NET_SCH_NETEM=m CONFIG_NET_SCH_INGRESS=m # # Classification # CONFIG_NET_CLS=y # CONFIG_NET_CLS_BASIC is not set CONFIG_NET_CLS_TCINDEX=m CONFIG_NET_CLS_ROUTE4=m CONFIG_NET_CLS_ROUTE=y CONFIG_NET_CLS_FW=m CONFIG_NET_CLS_U32=m CONFIG_CLS_U32_PERF=y CONFIG_NET_CLS_RSVP=m CONFIG_NET_CLS_RSVP6=m # CONFIG_NET_EMATCH is not set # CONFIG_NET_CLS_ACT is not set CONFIG_NET_CLS_POLICE=y CONFIG_NET_CLS_IND=y CONFIG_NET_ESTIMATOR=y # # Network testing # # CONFIG_NET_PKTGEN is not set # CONFIG_HAMRADIO is not set # CONFIG_IRDA is not set CONFIG_BT=m CONFIG_BT_L2CAP=m CONFIG_BT_SCO=m CONFIG_BT_RFCOMM=m CONFIG_BT_RFCOMM_TTY=y CONFIG_BT_BNEP=m CONFIG_BT_BNEP_MC_FILTER=y CONFIG_BT_BNEP_PROTO_FILTER=y CONFIG_BT_HIDP=m # # Bluetooth device drivers # CONFIG_BT_HCIUSB=m CONFIG_BT_HCIUSB_SCO=y CONFIG_BT_HCIUART=m CONFIG_BT_HCIUART_H4=y CONFIG_BT_HCIUART_BCSP=y CONFIG_BT_HCIBCM203X=m # CONFIG_BT_HCIBPA10X is not set CONFIG_BT_HCIBFUSB=m CONFIG_BT_HCIVHCI=m CONFIG_IEEE80211=m # CONFIG_IEEE80211_DEBUG is not set # CONFIG_IEEE80211_CRYPT_WEP is not set # CONFIG_IEEE80211_CRYPT_CCMP is not set CONFIG_IEEE80211_CRYPT_TKIP=m # CONFIG_IEEE80211_SOFTMAC is not set CONFIG_WIRELESS_EXT=y # # Device Drivers # # # Generic Driver Options # CONFIG_STANDALONE=y CONFIG_PREVENT_FIRMWARE_BUILD=y CONFIG_FW_LOADER=y # CONFIG_DEBUG_DRIVER is not set # # Connector - unified userspace <-> kernelspace linker # # CONFIG_CONNECTOR is not set # # Memory Technology Devices (MTD) # CONFIG_MTD=m # CONFIG_MTD_DEBUG is not set CONFIG_MTD_CONCAT=m CONFIG_MTD_PARTITIONS=y CONFIG_MTD_REDBOOT_PARTS=m CONFIG_MTD_REDBOOT_DIRECTORY_BLOCK=-1 # CONFIG_MTD_REDBOOT_PARTS_UNALLOCATED is not set # CONFIG_MTD_REDBOOT_PARTS_READONLY is not set CONFIG_MTD_CMDLINE_PARTS=y # # User Modules And Translation Layers # CONFIG_MTD_CHAR=m CONFIG_MTD_BLOCK=m CONFIG_MTD_BLOCK_RO=m CONFIG_FTL=m CONFIG_NFTL=m CONFIG_NFTL_RW=y # CONFIG_INFTL is not set # CONFIG_RFD_FTL is not set # # RAM/ROM/Flash chip drivers # CONFIG_MTD_CFI=m CONFIG_MTD_JEDECPROBE=m CONFIG_MTD_GEN_PROBE=m # CONFIG_MTD_CFI_ADV_OPTIONS is not set CONFIG_MTD_MAP_BANK_WIDTH_1=y CONFIG_MTD_MAP_BANK_WIDTH_2=y CONFIG_MTD_MAP_BANK_WIDTH_4=y # CONFIG_MTD_MAP_BANK_WIDTH_8 is not set # CONFIG_MTD_MAP_BANK_WIDTH_16 is not set # CONFIG_MTD_MAP_BANK_WIDTH_32 is not set CONFIG_MTD_CFI_I1=y CONFIG_MTD_CFI_I2=y # CONFIG_MTD_CFI_I4 is not set # CONFIG_MTD_CFI_I8 is not set CONFIG_MTD_CFI_INTELEXT=m CONFIG_MTD_CFI_AMDSTD=m CONFIG_MTD_CFI_STAA=m CONFIG_MTD_CFI_UTIL=m CONFIG_MTD_RAM=m CONFIG_MTD_ROM=m CONFIG_MTD_ABSENT=m # CONFIG_MTD_OBSOLETE_CHIPS is not set # # Mapping drivers for chip access # CONFIG_MTD_COMPLEX_MAPPINGS=y # CONFIG_MTD_PHYSMAP is not set # CONFIG_MTD_PNC2000 is not set CONFIG_MTD_SC520CDP=m CONFIG_MTD_NETSC520=m # CONFIG_MTD_TS5500 is not set # CONFIG_MTD_SBC_GXX is not set # CONFIG_MTD_AMD76XROM is not set CONFIG_MTD_ICHXROM=m # CONFIG_MTD_SCB2_FLASH is not set # CONFIG_MTD_NETtel is not set # CONFIG_MTD_DILNETPC is not set # CONFIG_MTD_L440GX is not set # CONFIG_MTD_PCI is not set # CONFIG_MTD_PLATRAM is not set # # Self-contained MTD device drivers # # CONFIG_MTD_PMC551 is not set # CONFIG_MTD_SLRAM is not set # CONFIG_MTD_PHRAM is not set CONFIG_MTD_MTDRAM=m CONFIG_MTDRAM_TOTAL_SIZE=4096 CONFIG_MTDRAM_ERASE_SIZE=128 # CONFIG_MTD_BLOCK2MTD is not set # # Disk-On-Chip Device Drivers # # CONFIG_MTD_DOC2000 is not set # CONFIG_MTD_DOC2001 is not set # CONFIG_MTD_DOC2001PLUS is not set # # NAND Flash Device Drivers # CONFIG_MTD_NAND=m # CONFIG_MTD_NAND_VERIFY_WRITE is not set CONFIG_MTD_NAND_IDS=m # CONFIG_MTD_NAND_DISKONCHIP is not set # CONFIG_MTD_NAND_NANDSIM is not set # # OneNAND Flash Device Drivers # # CONFIG_MTD_ONENAND is not set # # Parallel port support # CONFIG_PARPORT=m CONFIG_PARPORT_PC=m CONFIG_PARPORT_SERIAL=m # CONFIG_PARPORT_PC_FIFO is not set # CONFIG_PARPORT_PC_SUPERIO is not set CONFIG_PARPORT_NOT_PC=y # CONFIG_PARPORT_GSC is not set CONFIG_PARPORT_1284=y # # Plug and Play support # # CONFIG_PNP is not set # # Block devices # CONFIG_BLK_DEV_FD=m # CONFIG_PARIDE is not set CONFIG_BLK_CPQ_DA=m CONFIG_BLK_CPQ_CISS_DA=m CONFIG_CISS_SCSI_TAPE=y CONFIG_BLK_DEV_DAC960=m # CONFIG_BLK_DEV_UMEM is not set # CONFIG_BLK_DEV_COW_COMMON is not set CONFIG_BLK_DEV_LOOP=m CONFIG_BLK_DEV_CRYPTOLOOP=m CONFIG_BLK_DEV_NBD=m CONFIG_BLK_DEV_SX8=m # CONFIG_BLK_DEV_UB is not set CONFIG_BLK_DEV_RAM=y CONFIG_BLK_DEV_RAM_COUNT=16 CONFIG_BLK_DEV_RAM_SIZE=16384 CONFIG_BLK_DEV_INITRD=y # CONFIG_CDROM_PKTCDVD is not set # CONFIG_ATA_OVER_ETH is not set # # ATA/ATAPI/MFM/RLL support # CONFIG_IDE=y CONFIG_BLK_DEV_IDE=y # # Please see Documentation/ide.txt for help/info on IDE drives # # CONFIG_BLK_DEV_IDE_SATA is not set # CONFIG_BLK_DEV_HD_IDE is not set CONFIG_BLK_DEV_IDEDISK=y CONFIG_IDEDISK_MULTI_MODE=y CONFIG_BLK_DEV_IDECD=y # CONFIG_BLK_DEV_IDETAPE is not set CONFIG_BLK_DEV_IDEFLOPPY=y CONFIG_BLK_DEV_IDESCSI=m # CONFIG_IDE_TASK_IOCTL is not set # # IDE chipset support/bugfixes # CONFIG_IDE_GENERIC=y # CONFIG_BLK_DEV_CMD640 is not set CONFIG_BLK_DEV_IDEPCI=y CONFIG_IDEPCI_SHARE_IRQ=y # CONFIG_BLK_DEV_OFFBOARD is not set CONFIG_BLK_DEV_GENERIC=y # CONFIG_BLK_DEV_OPTI621 is not set CONFIG_BLK_DEV_RZ1000=y CONFIG_BLK_DEV_IDEDMA_PCI=y # CONFIG_BLK_DEV_IDEDMA_FORCED is not set CONFIG_IDEDMA_PCI_AUTO=y # CONFIG_IDEDMA_ONLYDISK is not set CONFIG_BLK_DEV_AEC62XX=y CONFIG_BLK_DEV_ALI15X3=y # CONFIG_WDC_ALI15X3 is not set CONFIG_BLK_DEV_AMD74XX=y CONFIG_BLK_DEV_ATIIXP=y CONFIG_BLK_DEV_CMD64X=y CONFIG_BLK_DEV_TRIFLEX=y CONFIG_BLK_DEV_CY82C693=y CONFIG_BLK_DEV_CS5520=y CONFIG_BLK_DEV_CS5530=y CONFIG_BLK_DEV_HPT34X=y # CONFIG_HPT34X_AUTODMA is not set CONFIG_BLK_DEV_HPT366=y # CONFIG_BLK_DEV_SC1200 is not set CONFIG_BLK_DEV_PIIX=y # CONFIG_BLK_DEV_IT821X is not set # CONFIG_BLK_DEV_NS87415 is not set CONFIG_BLK_DEV_PDC202XX_OLD=y # CONFIG_PDC202XX_BURST is not set CONFIG_BLK_DEV_PDC202XX_NEW=y CONFIG_BLK_DEV_SVWKS=y CONFIG_BLK_DEV_SIIMAGE=y CONFIG_BLK_DEV_SIS5513=y CONFIG_BLK_DEV_SLC90E66=y # CONFIG_BLK_DEV_TRM290 is not set CONFIG_BLK_DEV_VIA82CXXX=y # CONFIG_IDE_ARM is not set CONFIG_BLK_DEV_IDEDMA=y # CONFIG_IDEDMA_IVB is not set CONFIG_IDEDMA_AUTO=y # CONFIG_BLK_DEV_HD is not set # # SCSI device support # # CONFIG_RAID_ATTRS is not set CONFIG_SCSI=m CONFIG_SCSI_PROC_FS=y # # SCSI support type (disk, tape, CD-ROM) # CONFIG_BLK_DEV_SD=m CONFIG_CHR_DEV_ST=m CONFIG_CHR_DEV_OSST=m CONFIG_BLK_DEV_SR=m CONFIG_BLK_DEV_SR_VENDOR=y CONFIG_CHR_DEV_SG=m # CONFIG_CHR_DEV_SCH is not set # # Some SCSI devices (e.g. CD jukebox) support multiple LUNs # # CONFIG_SCSI_MULTI_LUN is not set CONFIG_SCSI_CONSTANTS=y CONFIG_SCSI_LOGGING=y # # SCSI Transport Attributes # CONFIG_SCSI_SPI_ATTRS=m CONFIG_SCSI_FC_ATTRS=m CONFIG_SCSI_ISCSI_ATTRS=m # CONFIG_SCSI_SAS_ATTRS is not set # # SCSI low-level drivers # CONFIG_ISCSI_TCP=m CONFIG_BLK_DEV_3W_XXXX_RAID=m CONFIG_SCSI_3W_9XXX=m CONFIG_SCSI_ACARD=m CONFIG_SCSI_AACRAID=m CONFIG_SCSI_AIC7XXX=m CONFIG_AIC7XXX_CMDS_PER_DEVICE=4 CONFIG_AIC7XXX_RESET_DELAY_MS=15000 # CONFIG_AIC7XXX_DEBUG_ENABLE is not set CONFIG_AIC7XXX_DEBUG_MASK=0 # CONFIG_AIC7XXX_REG_PRETTY_PRINT is not set CONFIG_SCSI_AIC7XXX_OLD=m CONFIG_SCSI_AIC79XX=m CONFIG_AIC79XX_CMDS_PER_DEVICE=4 CONFIG_AIC79XX_RESET_DELAY_MS=15000 # CONFIG_AIC79XX_ENABLE_RD_STRM is not set # CONFIG_AIC79XX_DEBUG_ENABLE is not set CONFIG_AIC79XX_DEBUG_MASK=0 # CONFIG_AIC79XX_REG_PRETTY_PRINT is not set CONFIG_MEGARAID_NEWGEN=y CONFIG_MEGARAID_MM=m CONFIG_MEGARAID_MAILBOX=m # CONFIG_MEGARAID_LEGACY is not set # CONFIG_MEGARAID_SAS is not set CONFIG_SCSI_SATA=m CONFIG_SCSI_SATA_AHCI=m CONFIG_SCSI_SATA_SVW=m CONFIG_SCSI_ATA_PIIX=m # CONFIG_SCSI_SATA_MV is not set CONFIG_SCSI_SATA_NV=m # CONFIG_SCSI_PDC_ADMA is not set # CONFIG_SCSI_SATA_QSTOR is not set CONFIG_SCSI_SATA_PROMISE=m CONFIG_SCSI_SATA_SX4=m CONFIG_SCSI_SATA_SIL=m # CONFIG_SCSI_SATA_SIL24 is not set CONFIG_SCSI_SATA_SIS=m # CONFIG_SCSI_SATA_ULI is not set CONFIG_SCSI_SATA_VIA=m CONFIG_SCSI_SATA_VITESSE=m CONFIG_SCSI_SATA_INTEL_COMBINED=y # CONFIG_SCSI_BUSLOGIC is not set # CONFIG_SCSI_DMX3191D is not set # CONFIG_SCSI_EATA is not set # CONFIG_SCSI_FUTURE_DOMAIN is not set CONFIG_SCSI_GDTH=m CONFIG_SCSI_IPS=m CONFIG_SCSI_INITIO=m # CONFIG_SCSI_INIA100 is not set CONFIG_SCSI_PPA=m CONFIG_SCSI_IMM=m # CONFIG_SCSI_IZIP_EPP16 is not set # CONFIG_SCSI_IZIP_SLOW_CTR is not set CONFIG_SCSI_SYM53C8XX_2=m CONFIG_SCSI_SYM53C8XX_DMA_ADDRESSING_MODE=1 CONFIG_SCSI_SYM53C8XX_DEFAULT_TAGS=16 CONFIG_SCSI_SYM53C8XX_MAX_TAGS=64 CONFIG_SCSI_SYM53C8XX_MMIO=y # CONFIG_SCSI_IPR is not set CONFIG_SCSI_QLOGIC_1280=m CONFIG_SCSI_QLA_FC=m # CONFIG_SCSI_QLA2XXX_EMBEDDED_FIRMWARE is not set CONFIG_SCSI_LPFC=m # CONFIG_SCSI_DC395x is not set # CONFIG_SCSI_DC390T is not set # CONFIG_SCSI_DEBUG is not set # # Multi-device support (RAID and LVM) # CONFIG_MD=y CONFIG_BLK_DEV_MD=y CONFIG_MD_LINEAR=m CONFIG_MD_RAID0=m CONFIG_MD_RAID1=m CONFIG_MD_RAID10=m CONFIG_MD_RAID5=m # CONFIG_MD_RAID5_RESHAPE is not set CONFIG_MD_RAID6=m CONFIG_MD_MULTIPATH=m # CONFIG_MD_FAULTY is not set CONFIG_BLK_DEV_DM=m CONFIG_DM_CRYPT=m CONFIG_DM_SNAPSHOT=m CONFIG_DM_MIRROR=m CONFIG_DM_ZERO=m CONFIG_DM_MULTIPATH=m CONFIG_DM_MULTIPATH_EMC=m # # Fusion MPT device support # # CONFIG_FUSION is not set # CONFIG_FUSION_SPI is not set # CONFIG_FUSION_FC is not set # CONFIG_FUSION_SAS is not set # # IEEE 1394 (FireWire) support # # CONFIG_IEEE1394 is not set # # I2O device support # CONFIG_I2O=m # CONFIG_I2O_LCT_NOTIFY_ON_CHANGES is not set CONFIG_I2O_EXT_ADAPTEC=y CONFIG_I2O_EXT_ADAPTEC_DMA64=y CONFIG_I2O_CONFIG=m CONFIG_I2O_CONFIG_OLD_IOCTL=y # CONFIG_I2O_BUS is not set CONFIG_I2O_BLOCK=m CONFIG_I2O_SCSI=m CONFIG_I2O_PROC=m # # Network device support # CONFIG_NETDEVICES=y CONFIG_DUMMY=m CONFIG_BONDING=m # CONFIG_EQUALIZER is not set CONFIG_TUN=m # # ARCnet devices # # CONFIG_ARCNET is not set # # PHY device support # # CONFIG_PHYLIB is not set # # Ethernet (10 or 100Mbit) # CONFIG_NET_ETHERNET=y CONFIG_MII=m CONFIG_HAPPYMEAL=m CONFIG_SUNGEM=m # CONFIG_CASSINI is not set CONFIG_NET_VENDOR_3COM=y CONFIG_VORTEX=m CONFIG_TYPHOON=m # # Tulip family network device support # CONFIG_NET_TULIP=y CONFIG_DE2104X=m CONFIG_TULIP=m # CONFIG_TULIP_MWI is not set CONFIG_TULIP_MMIO=y # CONFIG_TULIP_NAPI is not set CONFIG_DE4X5=m CONFIG_WINBOND_840=m CONFIG_DM9102=m # CONFIG_ULI526X is not set # CONFIG_HP100 is not set CONFIG_NET_PCI=y CONFIG_PCNET32=m CONFIG_AMD8111_ETH=m CONFIG_AMD8111E_NAPI=y CONFIG_ADAPTEC_STARFIRE=m CONFIG_ADAPTEC_STARFIRE_NAPI=y CONFIG_B44=m CONFIG_FORCEDETH=m # CONFIG_DGRS is not set CONFIG_EEPRO100=m CONFIG_E100=m CONFIG_FEALNX=m CONFIG_NATSEMI=m CONFIG_NE2K_PCI=m CONFIG_8139CP=m CONFIG_8139TOO=m CONFIG_8139TOO_PIO=y # CONFIG_8139TOO_TUNE_TWISTER is not set CONFIG_8139TOO_8129=y # CONFIG_8139_OLD_RX_RESET is not set CONFIG_SIS900=m CONFIG_EPIC100=m # CONFIG_SUNDANCE is not set CONFIG_VIA_RHINE=m CONFIG_VIA_RHINE_MMIO=y # CONFIG_NET_POCKET is not set # # Ethernet (1000 Mbit) # CONFIG_ACENIC=m # CONFIG_ACENIC_OMIT_TIGON_I is not set CONFIG_DL2K=m CONFIG_E1000=m CONFIG_E1000_NAPI=y # CONFIG_E1000_DISABLE_PACKET_SPLIT is not set CONFIG_NS83820=m # CONFIG_HAMACHI is not set # CONFIG_YELLOWFIN is not set CONFIG_R8169=m CONFIG_R8169_NAPI=y # CONFIG_R8169_VLAN is not set # CONFIG_SIS190 is not set # CONFIG_SKGE is not set # CONFIG_SKY2 is not set CONFIG_SK98LIN=m CONFIG_VIA_VELOCITY=m CONFIG_TIGON3=m # CONFIG_BNX2 is not set # # Ethernet (10000 Mbit) # # CONFIG_CHELSIO_T1 is not set CONFIG_IXGB=m CONFIG_IXGB_NAPI=y CONFIG_S2IO=m CONFIG_S2IO_NAPI=y # # Token Ring devices # CONFIG_TR=y CONFIG_IBMOL=m CONFIG_3C359=m CONFIG_TMS380TR=m CONFIG_TMSPCI=m CONFIG_ABYSS=m # # Wireless LAN (non-hamradio) # CONFIG_NET_RADIO=y # CONFIG_NET_WIRELESS_RTNETLINK is not set # # Obsolete Wireless cards support (pre-802.11) # # CONFIG_STRIP is not set # # Wireless 802.11b ISA/PCI cards support # CONFIG_IPW2100=m # CONFIG_IPW2100_MONITOR is not set # CONFIG_IPW2100_DEBUG is not set CONFIG_IPW2200=m # CONFIG_IPW2200_MONITOR is not set # CONFIG_IPW_QOS is not set # CONFIG_IPW2200_DEBUG is not set # CONFIG_AIRO is not set CONFIG_HERMES=m CONFIG_PLX_HERMES=m CONFIG_TMD_HERMES=m # CONFIG_NORTEL_HERMES is not set CONFIG_PCI_HERMES=m CONFIG_ATMEL=m CONFIG_PCI_ATMEL=m # # Prism GT/Duette 802.11(a/b/g) PCI/Cardbus support # CONFIG_PRISM54=m # CONFIG_HOSTAP is not set CONFIG_NET_WIRELESS=y # # Wan interfaces # # CONFIG_WAN is not set # # ATM drivers # # CONFIG_ATM_DUMMY is not set CONFIG_ATM_TCP=m CONFIG_ATM_LANAI=m CONFIG_ATM_ENI=m # CONFIG_ATM_ENI_DEBUG is not set # CONFIG_ATM_ENI_TUNE_BURST is not set CONFIG_ATM_FIRESTREAM=m # CONFIG_ATM_ZATM is not set CONFIG_ATM_IDT77252=m # CONFIG_ATM_IDT77252_DEBUG is not set # CONFIG_ATM_IDT77252_RCV_ALL is not set CONFIG_ATM_IDT77252_USE_SUNI=y CONFIG_ATM_AMBASSADOR=m # CONFIG_ATM_AMBASSADOR_DEBUG is not set CONFIG_ATM_HORIZON=m # CONFIG_ATM_HORIZON_DEBUG is not set CONFIG_ATM_FORE200E_MAYBE=m # CONFIG_ATM_FORE200E_PCA is not set CONFIG_ATM_HE=m # CONFIG_ATM_HE_USE_SUNI is not set CONFIG_FDDI=y # CONFIG_DEFXX is not set # CONFIG_SKFP is not set # CONFIG_HIPPI is not set # CONFIG_PLIP is not set CONFIG_PPP=m CONFIG_PPP_MULTILINK=y CONFIG_PPP_FILTER=y CONFIG_PPP_ASYNC=m CONFIG_PPP_SYNC_TTY=m CONFIG_PPP_DEFLATE=m # CONFIG_PPP_BSDCOMP is not set # CONFIG_PPP_MPPE is not set CONFIG_PPPOE=m CONFIG_PPPOATM=m # CONFIG_SLIP is not set CONFIG_NET_FC=y # CONFIG_SHAPER is not set CONFIG_NETCONSOLE=m CONFIG_NETPOLL=y # CONFIG_NETPOLL_RX is not set CONFIG_NETPOLL_TRAP=y CONFIG_NET_POLL_CONTROLLER=y # # ISDN subsystem # # CONFIG_ISDN is not set # # Telephony Support # # CONFIG_PHONE is not set # # Input device support # CONFIG_INPUT=y # # Userland interfaces # CONFIG_INPUT_MOUSEDEV=y # CONFIG_INPUT_MOUSEDEV_PSAUX is not set CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024 CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768 CONFIG_INPUT_JOYDEV=m # CONFIG_INPUT_TSDEV is not set CONFIG_INPUT_EVDEV=y # CONFIG_INPUT_EVBUG is not set # # Input Device Drivers # CONFIG_INPUT_KEYBOARD=y CONFIG_KEYBOARD_ATKBD=y # CONFIG_KEYBOARD_SUNKBD is not set # CONFIG_KEYBOARD_LKKBD is not set # CONFIG_KEYBOARD_XTKBD is not set # CONFIG_KEYBOARD_NEWTON is not set CONFIG_INPUT_MOUSE=y CONFIG_MOUSE_PS2=y CONFIG_MOUSE_SERIAL=m CONFIG_MOUSE_VSXXXAA=m CONFIG_INPUT_JOYSTICK=y # CONFIG_JOYSTICK_ANALOG is not set # CONFIG_JOYSTICK_A3D is not set # CONFIG_JOYSTICK_ADI is not set # CONFIG_JOYSTICK_COBRA is not set # CONFIG_JOYSTICK_GF2K is not set # CONFIG_JOYSTICK_GRIP is not set # CONFIG_JOYSTICK_GRIP_MP is not set # CONFIG_JOYSTICK_GUILLEMOT is not set # CONFIG_JOYSTICK_INTERACT is not set # CONFIG_JOYSTICK_SIDEWINDER is not set # CONFIG_JOYSTICK_TMDC is not set # CONFIG_JOYSTICK_IFORCE is not set # CONFIG_JOYSTICK_WARRIOR is not set # CONFIG_JOYSTICK_MAGELLAN is not set # CONFIG_JOYSTICK_SPACEORB is not set # CONFIG_JOYSTICK_SPACEBALL is not set # CONFIG_JOYSTICK_STINGER is not set # CONFIG_JOYSTICK_TWIDJOY is not set # CONFIG_JOYSTICK_DB9 is not set # CONFIG_JOYSTICK_GAMECON is not set # CONFIG_JOYSTICK_TURBOGRAFX is not set # CONFIG_JOYSTICK_JOYDUMP is not set CONFIG_INPUT_TOUCHSCREEN=y CONFIG_TOUCHSCREEN_GUNZE=m # CONFIG_TOUCHSCREEN_ELO is not set # CONFIG_TOUCHSCREEN_MTOUCH is not set # CONFIG_TOUCHSCREEN_MK712 is not set CONFIG_INPUT_MISC=y CONFIG_INPUT_PCSPKR=m CONFIG_INPUT_UINPUT=m # # Hardware I/O ports # CONFIG_SERIO=y CONFIG_SERIO_I8042=y CONFIG_SERIO_SERPORT=y # CONFIG_SERIO_CT82C710 is not set # CONFIG_SERIO_PARKBD is not set # CONFIG_SERIO_PCIPS2 is not set CONFIG_SERIO_LIBPS2=y # CONFIG_SERIO_RAW is not set # CONFIG_GAMEPORT is not set # # Character devices # CONFIG_VT=y CONFIG_VT_CONSOLE=y CONFIG_HW_CONSOLE=y CONFIG_SERIAL_NONSTANDARD=y # CONFIG_COMPUTONE is not set # CONFIG_ROCKETPORT is not set # CONFIG_CYCLADES is not set # CONFIG_DIGIEPCA is not set # CONFIG_MOXA_INTELLIO is not set # CONFIG_MOXA_SMARTIO is not set # CONFIG_ISI is not set # CONFIG_SYNCLINK is not set # CONFIG_SYNCLINKMP is not set # CONFIG_SYNCLINK_GT is not set CONFIG_N_HDLC=m # CONFIG_SPECIALIX is not set # CONFIG_SX is not set CONFIG_STALDRV=y # # Serial drivers # CONFIG_SERIAL_8250=y CONFIG_SERIAL_8250_CONSOLE=y CONFIG_SERIAL_8250_PCI=y CONFIG_SERIAL_8250_NR_UARTS=4 CONFIG_SERIAL_8250_RUNTIME_UARTS=1 CONFIG_SERIAL_8250_EXTENDED=y # CONFIG_SERIAL_8250_MANY_PORTS is not set CONFIG_SERIAL_8250_SHARE_IRQ=y CONFIG_SERIAL_8250_DETECT_IRQ=y CONFIG_SERIAL_8250_RSA=y # # Non-8250 serial port support # CONFIG_SERIAL_CORE=y CONFIG_SERIAL_CORE_CONSOLE=y # CONFIG_SERIAL_JSM is not set CONFIG_UNIX98_PTYS=y # CONFIG_LEGACY_PTYS is not set CONFIG_PRINTER=m CONFIG_LP_CONSOLE=y CONFIG_PPDEV=m # CONFIG_TIPAR is not set # # IPMI # CONFIG_IPMI_HANDLER=m # CONFIG_IPMI_PANIC_EVENT is not set CONFIG_IPMI_DEVICE_INTERFACE=m CONFIG_IPMI_SI=m CONFIG_IPMI_WATCHDOG=m CONFIG_IPMI_POWEROFF=m # # Watchdog Cards # CONFIG_WATCHDOG=y # CONFIG_WATCHDOG_NOWAYOUT is not set # # Watchdog Device Drivers # CONFIG_SOFT_WATCHDOG=m CONFIG_ACQUIRE_WDT=m CONFIG_ADVANTECH_WDT=m CONFIG_ALIM1535_WDT=m CONFIG_ALIM7101_WDT=m CONFIG_SC520_WDT=m CONFIG_EUROTECH_WDT=m CONFIG_IB700_WDT=m # CONFIG_IBMASR is not set CONFIG_WAFER_WDT=m # CONFIG_I6300ESB_WDT is not set CONFIG_I8XX_TCO=m CONFIG_SC1200_WDT=m # CONFIG_60XX_WDT is not set # CONFIG_SBC8360_WDT is not set CONFIG_CPU5_WDT=m CONFIG_W83627HF_WDT=m CONFIG_W83877F_WDT=m # CONFIG_W83977F_WDT is not set CONFIG_MACHZ_WDT=m # CONFIG_SBC_EPX_C3_WATCHDOG is not set # # PCI-based Watchdog Cards # CONFIG_PCIPCWATCHDOG=m CONFIG_WDTPCI=m CONFIG_WDT_501_PCI=y # # USB-based Watchdog Cards # CONFIG_USBPCWATCHDOG=m CONFIG_HW_RANDOM=m # CONFIG_NVRAM is not set CONFIG_RTC=y CONFIG_DTLK=m # CONFIG_R3964 is not set # CONFIG_APPLICOM is not set # # Ftape, the floppy tape device driver # CONFIG_AGP=y CONFIG_AGP_AMD64=y # CONFIG_AGP_INTEL is not set # CONFIG_AGP_VIA is not set CONFIG_DRM=y # CONFIG_DRM_TDFX is not set CONFIG_DRM_R128=m CONFIG_DRM_RADEON=m CONFIG_DRM_MGA=m # CONFIG_DRM_SIS is not set # CONFIG_DRM_VIA is not set # CONFIG_DRM_SAVAGE is not set # CONFIG_MWAVE is not set CONFIG_RAW_DRIVER=y CONFIG_MAX_RAW_DEVS=8192 # CONFIG_HPET is not set CONFIG_HANGCHECK_TIMER=m # # TPM devices # # CONFIG_TCG_TPM is not set # CONFIG_TELCLOCK is not set # # I2C support # CONFIG_I2C=m CONFIG_I2C_CHARDEV=m # # I2C Algorithms # CONFIG_I2C_ALGOBIT=m CONFIG_I2C_ALGOPCF=m CONFIG_I2C_ALGOPCA=m # # I2C Hardware Bus support # CONFIG_I2C_ALI1535=m CONFIG_I2C_ALI1563=m CONFIG_I2C_ALI15X3=m CONFIG_I2C_AMD756=m # CONFIG_I2C_AMD756_S4882 is not set CONFIG_I2C_AMD8111=m CONFIG_I2C_I801=m CONFIG_I2C_I810=m # CONFIG_I2C_PIIX4 is not set CONFIG_I2C_ISA=m CONFIG_I2C_NFORCE2=m # CONFIG_I2C_PARPORT is not set # CONFIG_I2C_PARPORT_LIGHT is not set CONFIG_I2C_PROSAVAGE=m CONFIG_I2C_SAVAGE4=m CONFIG_I2C_SIS5595=m CONFIG_I2C_SIS630=m CONFIG_I2C_SIS96X=m # CONFIG_I2C_STUB is not set CONFIG_I2C_VIA=m CONFIG_I2C_VIAPRO=m CONFIG_I2C_VOODOO3=m # CONFIG_I2C_PCA_ISA is not set # # Miscellaneous I2C Chip support # # CONFIG_SENSORS_DS1337 is not set # CONFIG_SENSORS_DS1374 is not set CONFIG_SENSORS_EEPROM=m CONFIG_SENSORS_PCF8574=m # CONFIG_SENSORS_PCA9539 is not set CONFIG_SENSORS_PCF8591=m # CONFIG_SENSORS_MAX6875 is not set # CONFIG_I2C_DEBUG_CORE is not set # CONFIG_I2C_DEBUG_ALGO is not set # CONFIG_I2C_DEBUG_BUS is not set # CONFIG_I2C_DEBUG_CHIP is not set # # SPI support # # CONFIG_SPI is not set # CONFIG_SPI_MASTER is not set # # Dallas's 1-wire bus # # CONFIG_W1 is not set # # Hardware Monitoring support # CONFIG_HWMON=y CONFIG_HWMON_VID=m CONFIG_SENSORS_ADM1021=m CONFIG_SENSORS_ADM1025=m # CONFIG_SENSORS_ADM1026 is not set CONFIG_SENSORS_ADM1031=m # CONFIG_SENSORS_ADM9240 is not set CONFIG_SENSORS_ASB100=m # CONFIG_SENSORS_ATXP1 is not set CONFIG_SENSORS_DS1621=m # CONFIG_SENSORS_F71805F is not set CONFIG_SENSORS_FSCHER=m # CONFIG_SENSORS_FSCPOS is not set CONFIG_SENSORS_GL518SM=m # CONFIG_SENSORS_GL520SM is not set CONFIG_SENSORS_IT87=m # CONFIG_SENSORS_LM63 is not set CONFIG_SENSORS_LM75=m CONFIG_SENSORS_LM77=m CONFIG_SENSORS_LM78=m CONFIG_SENSORS_LM80=m CONFIG_SENSORS_LM83=m CONFIG_SENSORS_LM85=m # CONFIG_SENSORS_LM87 is not set CONFIG_SENSORS_LM90=m # CONFIG_SENSORS_LM92 is not set CONFIG_SENSORS_MAX1619=m # CONFIG_SENSORS_PC87360 is not set # CONFIG_SENSORS_SIS5595 is not set CONFIG_SENSORS_SMSC47M1=m # CONFIG_SENSORS_SMSC47B397 is not set CONFIG_SENSORS_VIA686A=m # CONFIG_SENSORS_VT8231 is not set CONFIG_SENSORS_W83781D=m # CONFIG_SENSORS_W83792D is not set CONFIG_SENSORS_W83L785TS=m CONFIG_SENSORS_W83627HF=m # CONFIG_SENSORS_W83627EHF is not set # CONFIG_SENSORS_HDAPS is not set # CONFIG_HWMON_DEBUG_CHIP is not set # # Misc devices # # CONFIG_IBM_ASM is not set # # Multimedia devices # CONFIG_VIDEO_DEV=m # # Video For Linux # # # Video Adapters # # CONFIG_VIDEO_ADV_DEBUG is not set # CONFIG_VIDEO_BT848 is not set # CONFIG_VIDEO_BWQCAM is not set # CONFIG_VIDEO_CQCAM is not set # CONFIG_VIDEO_W9966 is not set # CONFIG_VIDEO_CPIA is not set # CONFIG_VIDEO_CPIA2 is not set # CONFIG_VIDEO_SAA5246A is not set # CONFIG_VIDEO_SAA5249 is not set # CONFIG_TUNER_3036 is not set # CONFIG_VIDEO_STRADIS is not set # CONFIG_VIDEO_ZORAN is not set # CONFIG_VIDEO_SAA7134 is not set # CONFIG_VIDEO_MXB is not set # CONFIG_VIDEO_DPC is not set # CONFIG_VIDEO_HEXIUM_ORION is not set # CONFIG_VIDEO_HEXIUM_GEMINI is not set # CONFIG_VIDEO_CX88 is not set CONFIG_VIDEO_OVCAMCHIP=m # # Encoders and Decoders # # CONFIG_VIDEO_MSP3400 is not set # CONFIG_VIDEO_CS53L32A is not set # CONFIG_VIDEO_WM8775 is not set # CONFIG_VIDEO_WM8739 is not set # CONFIG_VIDEO_CX25840 is not set # CONFIG_VIDEO_SAA711X is not set # CONFIG_VIDEO_SAA7127 is not set # CONFIG_VIDEO_UPD64031A is not set # CONFIG_VIDEO_UPD64083 is not set # # V4L USB devices # # CONFIG_VIDEO_EM28XX is not set CONFIG_USB_DSBR=m CONFIG_VIDEO_USBVIDEO=m CONFIG_USB_VICAM=m CONFIG_USB_IBMCAM=m CONFIG_USB_KONICAWC=m # CONFIG_USB_ET61X251 is not set CONFIG_USB_OV511=m CONFIG_USB_SE401=m CONFIG_USB_SN9C102=m CONFIG_USB_STV680=m CONFIG_USB_W9968CF=m # CONFIG_USB_ZC0301 is not set CONFIG_USB_PWC=m # # Radio Adapters # # CONFIG_RADIO_GEMTEK_PCI is not set # CONFIG_RADIO_MAXIRADIO is not set # CONFIG_RADIO_MAESTRO is not set # # Digital Video Broadcasting Devices # # CONFIG_DVB is not set CONFIG_USB_DABUSB=m # # Graphics support # CONFIG_FB=y CONFIG_FB_CFB_FILLRECT=y CONFIG_FB_CFB_COPYAREA=y CONFIG_FB_CFB_IMAGEBLIT=y # CONFIG_FB_MACMODES is not set CONFIG_FB_FIRMWARE_EDID=y CONFIG_FB_MODE_HELPERS=y # CONFIG_FB_TILEBLITTING is not set CONFIG_FB_CIRRUS=m # CONFIG_FB_PM2 is not set # CONFIG_FB_CYBER2000 is not set # CONFIG_FB_ARC is not set # CONFIG_FB_ASILIANT is not set # CONFIG_FB_IMSTT is not set CONFIG_FB_VGA16=m CONFIG_FB_VESA=y CONFIG_VIDEO_SELECT=y # CONFIG_FB_HGA is not set # CONFIG_FB_S1D13XXX is not set # CONFIG_FB_NVIDIA is not set CONFIG_FB_RIVA=m # CONFIG_FB_RIVA_I2C is not set # CONFIG_FB_RIVA_DEBUG is not set # CONFIG_FB_MATROX is not set # CONFIG_FB_RADEON is not set # CONFIG_FB_ATY128 is not set # CONFIG_FB_ATY is not set # CONFIG_FB_SAVAGE is not set # CONFIG_FB_SIS is not set # CONFIG_FB_NEOMAGIC is not set CONFIG_FB_KYRO=m # CONFIG_FB_3DFX is not set # CONFIG_FB_VOODOO1 is not set # CONFIG_FB_TRIDENT is not set # CONFIG_FB_GEODE is not set # CONFIG_FB_VIRTUAL is not set # # Console display driver support # CONFIG_VGA_CONSOLE=y # CONFIG_VGACON_SOFT_SCROLLBACK is not set CONFIG_DUMMY_CONSOLE=y CONFIG_FRAMEBUFFER_CONSOLE=y # CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set # CONFIG_FONTS is not set CONFIG_FONT_8x8=y CONFIG_FONT_8x16=y # # Logo configuration # CONFIG_LOGO=y # CONFIG_LOGO_LINUX_MONO is not set # CONFIG_LOGO_LINUX_VGA16 is not set CONFIG_LOGO_LINUX_CLUT224=y # CONFIG_BACKLIGHT_LCD_SUPPORT is not set # # Sound # CONFIG_SOUND=m # # Advanced Linux Sound Architecture # CONFIG_SND=m CONFIG_SND_TIMER=m CONFIG_SND_PCM=m CONFIG_SND_HWDEP=m CONFIG_SND_RAWMIDI=m CONFIG_SND_SEQUENCER=m CONFIG_SND_SEQ_DUMMY=m CONFIG_SND_OSSEMUL=y CONFIG_SND_MIXER_OSS=m CONFIG_SND_PCM_OSS=m CONFIG_SND_PCM_OSS_PLUGINS=y CONFIG_SND_SEQUENCER_OSS=y CONFIG_SND_RTCTIMER=m CONFIG_SND_SEQ_RTCTIMER_DEFAULT=y # CONFIG_SND_DYNAMIC_MINORS is not set # CONFIG_SND_SUPPORT_OLD_API is not set CONFIG_SND_VERBOSE_PROCFS=y # CONFIG_SND_VERBOSE_PRINTK is not set # CONFIG_SND_DEBUG is not set # # Generic devices # CONFIG_SND_MPU401_UART=m CONFIG_SND_OPL3_LIB=m CONFIG_SND_VX_LIB=m CONFIG_SND_AC97_CODEC=m CONFIG_SND_AC97_BUS=m CONFIG_SND_DUMMY=m CONFIG_SND_VIRMIDI=m CONFIG_SND_MTPAV=m # CONFIG_SND_SERIAL_U16550 is not set CONFIG_SND_MPU401=m # # PCI devices # # CONFIG_SND_AD1889 is not set # CONFIG_SND_ALS300 is not set CONFIG_SND_ALS4000=m CONFIG_SND_ALI5451=m CONFIG_SND_ATIIXP=m CONFIG_SND_ATIIXP_MODEM=m CONFIG_SND_AU8810=m CONFIG_SND_AU8820=m CONFIG_SND_AU8830=m CONFIG_SND_AZT3328=m CONFIG_SND_BT87X=m # CONFIG_SND_BT87X_OVERCLOCK is not set # CONFIG_SND_CA0106 is not set CONFIG_SND_CMIPCI=m CONFIG_SND_CS4281=m CONFIG_SND_CS46XX=m CONFIG_SND_CS46XX_NEW_DSP=y CONFIG_SND_EMU10K1=m # CONFIG_SND_EMU10K1X is not set CONFIG_SND_ENS1370=m CONFIG_SND_ENS1371=m CONFIG_SND_ES1938=m CONFIG_SND_ES1968=m CONFIG_SND_FM801=m CONFIG_SND_FM801_TEA575X=m # CONFIG_SND_HDA_INTEL is not set CONFIG_SND_HDSP=m # CONFIG_SND_HDSPM is not set CONFIG_SND_ICE1712=m CONFIG_SND_ICE1724=m CONFIG_SND_INTEL8X0=m CONFIG_SND_INTEL8X0M=m CONFIG_SND_KORG1212=m CONFIG_SND_MAESTRO3=m CONFIG_SND_MIXART=m CONFIG_SND_NM256=m # CONFIG_SND_PCXHR is not set # CONFIG_SND_RIPTIDE is not set CONFIG_SND_RME32=m CONFIG_SND_RME96=m CONFIG_SND_RME9652=m CONFIG_SND_SONICVIBES=m CONFIG_SND_TRIDENT=m CONFIG_SND_VIA82XX=m # CONFIG_SND_VIA82XX_MODEM is not set CONFIG_SND_VX222=m CONFIG_SND_YMFPCI=m # # USB devices # CONFIG_SND_USB_AUDIO=m CONFIG_SND_USB_USX2Y=m # # Open Sound System # # CONFIG_SOUND_PRIME is not set # # USB support # CONFIG_USB_ARCH_HAS_HCD=y CONFIG_USB_ARCH_HAS_OHCI=y CONFIG_USB_ARCH_HAS_EHCI=y CONFIG_USB=y # CONFIG_USB_DEBUG is not set # # Miscellaneous USB options # CONFIG_USB_DEVICEFS=y # CONFIG_USB_BANDWIDTH is not set # CONFIG_USB_DYNAMIC_MINORS is not set CONFIG_USB_SUSPEND=y # CONFIG_USB_OTG is not set # # USB Host Controller Drivers # CONFIG_USB_EHCI_HCD=m CONFIG_USB_EHCI_SPLIT_ISO=y CONFIG_USB_EHCI_ROOT_HUB_TT=y # CONFIG_USB_ISP116X_HCD is not set CONFIG_USB_OHCI_HCD=m # CONFIG_USB_OHCI_BIG_ENDIAN is not set CONFIG_USB_OHCI_LITTLE_ENDIAN=y CONFIG_USB_UHCI_HCD=m # CONFIG_USB_SL811_HCD is not set # # USB Device Class drivers # CONFIG_USB_ACM=m CONFIG_USB_PRINTER=m # # NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support' # # # may also be needed; see USB_STORAGE Help for more information # CONFIG_USB_STORAGE=m # CONFIG_USB_STORAGE_DEBUG is not set CONFIG_USB_STORAGE_DATAFAB=y CONFIG_USB_STORAGE_FREECOM=y CONFIG_USB_STORAGE_ISD200=y CONFIG_USB_STORAGE_DPCM=y # CONFIG_USB_STORAGE_USBAT is not set CONFIG_USB_STORAGE_SDDR09=y CONFIG_USB_STORAGE_SDDR55=y CONFIG_USB_STORAGE_JUMPSHOT=y # CONFIG_USB_STORAGE_ALAUDA is not set # CONFIG_USB_LIBUSUAL is not set # # USB Input Devices # CONFIG_USB_HID=y CONFIG_USB_HIDINPUT=y # CONFIG_USB_HIDINPUT_POWERBOOK is not set CONFIG_HID_FF=y CONFIG_HID_PID=y CONFIG_LOGITECH_FF=y CONFIG_THRUSTMASTER_FF=y CONFIG_USB_HIDDEV=y CONFIG_USB_AIPTEK=m CONFIG_USB_WACOM=m # CONFIG_USB_ACECAD is not set CONFIG_USB_KBTAB=m CONFIG_USB_POWERMATE=m # CONFIG_USB_TOUCHSCREEN is not set # CONFIG_USB_YEALINK is not set CONFIG_USB_XPAD=m CONFIG_USB_ATI_REMOTE=m # CONFIG_USB_ATI_REMOTE2 is not set # CONFIG_USB_KEYSPAN_REMOTE is not set # CONFIG_USB_APPLETOUCH is not set # # USB Imaging devices # CONFIG_USB_MDC800=m CONFIG_USB_MICROTEK=m # # USB Network Adapters # CONFIG_USB_CATC=m CONFIG_USB_KAWETH=m CONFIG_USB_PEGASUS=m CONFIG_USB_RTL8150=m CONFIG_USB_USBNET=m CONFIG_USB_NET_AX8817X=m CONFIG_USB_NET_CDCETHER=m # CONFIG_USB_NET_GL620A is not set CONFIG_USB_NET_NET1080=m # CONFIG_USB_NET_PLUSB is not set # CONFIG_USB_NET_RNDIS_HOST is not set # CONFIG_USB_NET_CDC_SUBSET is not set CONFIG_USB_NET_ZAURUS=m # CONFIG_USB_ZD1201 is not set CONFIG_USB_MON=y # # USB port drivers # CONFIG_USB_USS720=m # # USB Serial Converter support # CONFIG_USB_SERIAL=m CONFIG_USB_SERIAL_GENERIC=y # CONFIG_USB_SERIAL_AIRPRIME is not set # CONFIG_USB_SERIAL_ANYDATA is not set CONFIG_USB_SERIAL_BELKIN=m # CONFIG_USB_SERIAL_WHITEHEAT is not set CONFIG_USB_SERIAL_DIGI_ACCELEPORT=m # CONFIG_USB_SERIAL_CP2101 is not set # CONFIG_USB_SERIAL_CYPRESS_M8 is not set CONFIG_USB_SERIAL_EMPEG=m CONFIG_USB_SERIAL_FTDI_SIO=m # CONFIG_USB_SERIAL_FUNSOFT is not set CONFIG_USB_SERIAL_VISOR=m CONFIG_USB_SERIAL_IPAQ=m CONFIG_USB_SERIAL_IR=m CONFIG_USB_SERIAL_EDGEPORT=m CONFIG_USB_SERIAL_EDGEPORT_TI=m # CONFIG_USB_SERIAL_GARMIN is not set # CONFIG_USB_SERIAL_IPW is not set CONFIG_USB_SERIAL_KEYSPAN_PDA=m CONFIG_USB_SERIAL_KEYSPAN=m CONFIG_USB_SERIAL_KEYSPAN_MPR=y CONFIG_USB_SERIAL_KEYSPAN_USA28=y CONFIG_USB_SERIAL_KEYSPAN_USA28X=y CONFIG_USB_SERIAL_KEYSPAN_USA28XA=y CONFIG_USB_SERIAL_KEYSPAN_USA28XB=y CONFIG_USB_SERIAL_KEYSPAN_USA19=y CONFIG_USB_SERIAL_KEYSPAN_USA18X=y CONFIG_USB_SERIAL_KEYSPAN_USA19W=y CONFIG_USB_SERIAL_KEYSPAN_USA19QW=y CONFIG_USB_SERIAL_KEYSPAN_USA19QI=y CONFIG_USB_SERIAL_KEYSPAN_USA49W=y CONFIG_USB_SERIAL_KEYSPAN_USA49WLC=y CONFIG_USB_SERIAL_KLSI=m CONFIG_USB_SERIAL_KOBIL_SCT=m CONFIG_USB_SERIAL_MCT_U232=m # CONFIG_USB_SERIAL_NAVMAN is not set CONFIG_USB_SERIAL_PL2303=m # CONFIG_USB_SERIAL_HP4X is not set CONFIG_USB_SERIAL_SAFE=m CONFIG_USB_SERIAL_SAFE_PADDED=y # CONFIG_USB_SERIAL_TI is not set CONFIG_USB_SERIAL_CYBERJACK=m CONFIG_USB_SERIAL_XIRCOM=m CONFIG_USB_SERIAL_OMNINET=m CONFIG_USB_EZUSB=y # # USB Miscellaneous drivers # CONFIG_USB_EMI62=m # CONFIG_USB_EMI26 is not set CONFIG_USB_AUERSWALD=m CONFIG_USB_RIO500=m CONFIG_USB_LEGOTOWER=m CONFIG_USB_LCD=m CONFIG_USB_LED=m # CONFIG_USB_CYTHERM is not set # CONFIG_USB_PHIDGETKIT is not set CONFIG_USB_PHIDGETSERVO=m # CONFIG_USB_IDMOUSE is not set # CONFIG_USB_SISUSBVGA is not set # CONFIG_USB_LD is not set CONFIG_USB_TEST=m # # USB DSL modem support # CONFIG_USB_ATM=m CONFIG_USB_SPEEDTOUCH=m # CONFIG_USB_CXACRU is not set # CONFIG_USB_UEAGLEATM is not set # CONFIG_USB_XUSBATM is not set # # USB Gadget Support # # CONFIG_USB_GADGET is not set # # MMC/SD Card support # # CONFIG_MMC is not set # # LED devices # # CONFIG_NEW_LEDS is not set # # LED drivers # # # LED Triggers # # # InfiniBand support # CONFIG_INFINIBAND=m CONFIG_INFINIBAND_USER_MAD=m CONFIG_INFINIBAND_USER_ACCESS=m CONFIG_INFINIBAND_MTHCA=m CONFIG_INFINIBAND_MTHCA_DEBUG=y CONFIG_IPATH_CORE=m # CONFIG_INFINIBAND_IPATH is not set CONFIG_INFINIBAND_IPOIB=m CONFIG_INFINIBAND_IPOIB_DEBUG=y # CONFIG_INFINIBAND_IPOIB_DEBUG_DATA is not set CONFIG_INFINIBAND_SRP=m # # EDAC - error detection and reporting (RAS) (EXPERIMENTAL) # # CONFIG_EDAC is not set # # Real Time Clock # # CONFIG_RTC_CLASS is not set CONFIG_KCACHE=m # # Firmware Drivers # CONFIG_EDD=m # CONFIG_DELL_RBU is not set CONFIG_DCDBAS=m # # File systems # CONFIG_EXT2_FS=y CONFIG_EXT2_FS_XATTR=y CONFIG_EXT2_FS_POSIX_ACL=y CONFIG_EXT2_FS_SECURITY=y # CONFIG_EXT2_FS_XIP is not set CONFIG_EXT3_FS=y CONFIG_EXT3_FS_XATTR=y CONFIG_EXT3_FS_POSIX_ACL=y CONFIG_EXT3_FS_SECURITY=y CONFIG_JBD=y # CONFIG_JBD_DEBUG is not set CONFIG_FS_MBCACHE=y CONFIG_REISERFS_FS=y # CONFIG_REISERFS_CHECK is not set # CONFIG_REISERFS_PROC_INFO is not set CONFIG_REISERFS_FS_XATTR=y CONFIG_REISERFS_FS_POSIX_ACL=y # CONFIG_REISERFS_FS_SECURITY is not set # CONFIG_JFS_FS is not set CONFIG_FS_POSIX_ACL=y # CONFIG_XFS_FS is not set # CONFIG_OCFS2_FS is not set # CONFIG_MINIX_FS is not set # CONFIG_ROMFS_FS is not set CONFIG_INOTIFY=y CONFIG_QUOTA=y # CONFIG_QFMT_V1 is not set CONFIG_QFMT_V2=y CONFIG_QUOTACTL=y CONFIG_DNOTIFY=y # CONFIG_AUTOFS_FS is not set CONFIG_AUTOFS4_FS=m # CONFIG_FUSE_FS is not set # # CD-ROM/DVD Filesystems # CONFIG_ISO9660_FS=y CONFIG_JOLIET=y CONFIG_ZISOFS=y CONFIG_ZISOFS_FS=y CONFIG_UDF_FS=m CONFIG_UDF_NLS=y # # DOS/FAT/NT Filesystems # CONFIG_FAT_FS=m CONFIG_MSDOS_FS=m CONFIG_VFAT_FS=m CONFIG_FAT_DEFAULT_CODEPAGE=437 CONFIG_FAT_DEFAULT_IOCHARSET="ascii" # CONFIG_NTFS_FS is not set # # Pseudo filesystems # CONFIG_PROC_FS=y CONFIG_PROC_KCORE=y # CONFIG_PROC_VMCORE is not set CONFIG_SYSFS=y CONFIG_TMPFS=y CONFIG_HUGETLBFS=y CONFIG_HUGETLB_PAGE=y CONFIG_RAMFS=y # CONFIG_CONFIGFS_FS is not set # # Miscellaneous filesystems # # CONFIG_ADFS_FS is not set # CONFIG_AFFS_FS is not set CONFIG_HFS_FS=m CONFIG_HFSPLUS_FS=m # CONFIG_BEFS_FS is not set # CONFIG_BFS_FS is not set # CONFIG_EFS_FS is not set # CONFIG_JFFS_FS is not set # CONFIG_JFFS2_FS is not set CONFIG_CRAMFS=m CONFIG_VXFS_FS=m # CONFIG_HPFS_FS is not set # CONFIG_QNX4FS_FS is not set # CONFIG_SYSV_FS is not set # CONFIG_UFS_FS is not set # # Network File Systems # CONFIG_NFS_FS=m CONFIG_NFS_V3=y # CONFIG_NFS_V3_ACL is not set CONFIG_NFS_V4=y CONFIG_NFS_DIRECTIO=y CONFIG_NFSD=m CONFIG_NFSD_V3=y # CONFIG_NFSD_V3_ACL is not set CONFIG_NFSD_V4=y CONFIG_NFSD_TCP=y CONFIG_LOCKD=m CONFIG_LOCKD_V4=y CONFIG_EXPORTFS=m CONFIG_NFS_COMMON=y CONFIG_SUNRPC=m CONFIG_SUNRPC_GSS=m CONFIG_RPCSEC_GSS_KRB5=m CONFIG_RPCSEC_GSS_SPKM3=m CONFIG_SMB_FS=m # CONFIG_SMB_NLS_DEFAULT is not set CONFIG_CIFS=m # CONFIG_CIFS_STATS is not set CONFIG_CIFS_XATTR=y CONFIG_CIFS_POSIX=y # CONFIG_CIFS_EXPERIMENTAL is not set # CONFIG_NCP_FS is not set # CONFIG_CODA_FS is not set # CONFIG_AFS_FS is not set # CONFIG_9P_FS is not set # # Partition Types # CONFIG_PARTITION_ADVANCED=y # CONFIG_ACORN_PARTITION is not set CONFIG_OSF_PARTITION=y # CONFIG_AMIGA_PARTITION is not set # CONFIG_ATARI_PARTITION is not set CONFIG_MAC_PARTITION=y CONFIG_MSDOS_PARTITION=y CONFIG_BSD_DISKLABEL=y CONFIG_MINIX_SUBPARTITION=y CONFIG_SOLARIS_X86_PARTITION=y CONFIG_UNIXWARE_DISKLABEL=y # CONFIG_LDM_PARTITION is not set CONFIG_SGI_PARTITION=y # CONFIG_ULTRIX_PARTITION is not set CONFIG_SUN_PARTITION=y # CONFIG_KARMA_PARTITION is not set CONFIG_EFI_PARTITION=y # # Native Language Support # CONFIG_NLS=y CONFIG_NLS_DEFAULT="utf8" CONFIG_NLS_CODEPAGE_437=y CONFIG_NLS_CODEPAGE_737=m CONFIG_NLS_CODEPAGE_775=m CONFIG_NLS_CODEPAGE_850=m CONFIG_NLS_CODEPAGE_852=m CONFIG_NLS_CODEPAGE_855=m CONFIG_NLS_CODEPAGE_857=m CONFIG_NLS_CODEPAGE_860=m CONFIG_NLS_CODEPAGE_861=m CONFIG_NLS_CODEPAGE_862=m CONFIG_NLS_CODEPAGE_863=m CONFIG_NLS_CODEPAGE_864=m CONFIG_NLS_CODEPAGE_865=m CONFIG_NLS_CODEPAGE_866=m CONFIG_NLS_CODEPAGE_869=m CONFIG_NLS_CODEPAGE_936=m CONFIG_NLS_CODEPAGE_950=m CONFIG_NLS_CODEPAGE_932=m CONFIG_NLS_CODEPAGE_949=m CONFIG_NLS_CODEPAGE_874=m CONFIG_NLS_ISO8859_8=m CONFIG_NLS_CODEPAGE_1250=m CONFIG_NLS_CODEPAGE_1251=m CONFIG_NLS_ASCII=y CONFIG_NLS_ISO8859_1=m CONFIG_NLS_ISO8859_2=m CONFIG_NLS_ISO8859_3=m CONFIG_NLS_ISO8859_4=m CONFIG_NLS_ISO8859_5=m CONFIG_NLS_ISO8859_6=m CONFIG_NLS_ISO8859_7=m CONFIG_NLS_ISO8859_9=m CONFIG_NLS_ISO8859_13=m CONFIG_NLS_ISO8859_14=m CONFIG_NLS_ISO8859_15=m CONFIG_NLS_KOI8_R=m CONFIG_NLS_KOI8_U=m CONFIG_NLS_UTF8=m # # Instrumentation Support # CONFIG_PROFILING=y CONFIG_OPROFILE=m CONFIG_KPROBES=y # # Kernel hacking # # CONFIG_PRINTK_TIME is not set CONFIG_MAGIC_SYSRQ=y CONFIG_DEBUG_KERNEL=y CONFIG_LOG_BUF_SHIFT=17 CONFIG_DETECT_SOFTLOCKUP=y # CONFIG_SCHEDSTATS is not set CONFIG_DEBUG_SLAB=y # CONFIG_DEBUG_SLAB_LEAK is not set CONFIG_DEBUG_MUTEXES=y CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_SPINLOCK_SLEEP=y # CONFIG_DEBUG_KOBJECT is not set CONFIG_DEBUG_INFO=y # CONFIG_DEBUG_FS is not set # CONFIG_DEBUG_VM is not set # CONFIG_FRAME_POINTER is not set # CONFIG_UNWIND_INFO is not set CONFIG_FORCED_INLINING=y # CONFIG_RCU_TORTURE_TEST is not set # CONFIG_DEBUG_RODATA is not set # CONFIG_IOMMU_DEBUG is not set # # Security options # CONFIG_KEYS=y CONFIG_KEYS_DEBUG_PROC_KEYS=y CONFIG_SECURITY=y CONFIG_SECURITY_NETWORK=y # CONFIG_SECURITY_NETWORK_XFRM is not set CONFIG_SECURITY_CAPABILITIES=y # CONFIG_SECURITY_ROOTPLUG is not set # CONFIG_SECURITY_SECLVL is not set CONFIG_SECURITY_SELINUX=y CONFIG_SECURITY_SELINUX_BOOTPARAM=y CONFIG_SECURITY_SELINUX_BOOTPARAM_VALUE=1 CONFIG_SECURITY_SELINUX_DISABLE=y CONFIG_SECURITY_SELINUX_DEVELOP=y CONFIG_SECURITY_SELINUX_AVC_STATS=y CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=1 # # Cryptographic options # CONFIG_CRYPTO=y CONFIG_CRYPTO_HMAC=y CONFIG_CRYPTO_NULL=m CONFIG_CRYPTO_MD4=m CONFIG_CRYPTO_MD5=y CONFIG_CRYPTO_SHA1=y CONFIG_CRYPTO_SHA256=m CONFIG_CRYPTO_SHA512=m CONFIG_CRYPTO_WP512=m # CONFIG_CRYPTO_TGR192 is not set CONFIG_CRYPTO_DES=m CONFIG_CRYPTO_BLOWFISH=m CONFIG_CRYPTO_TWOFISH=m CONFIG_CRYPTO_SERPENT=m # CONFIG_CRYPTO_AES is not set # CONFIG_CRYPTO_AES_X86_64 is not set CONFIG_CRYPTO_CAST5=m CONFIG_CRYPTO_CAST6=m CONFIG_CRYPTO_TEA=m CONFIG_CRYPTO_ARC4=m CONFIG_CRYPTO_KHAZAD=m # CONFIG_CRYPTO_ANUBIS is not set CONFIG_CRYPTO_DEFLATE=m CONFIG_CRYPTO_MICHAEL_MIC=m CONFIG_CRYPTO_CRC32C=m # CONFIG_CRYPTO_TEST is not set # # Hardware crypto devices # # # Library routines # CONFIG_CRC_CCITT=m # CONFIG_CRC16 is not set CONFIG_CRC32=y CONFIG_LIBCRC32C=m CONFIG_ZLIB_INFLATE=y CONFIG_ZLIB_DEFLATE=m From ogerlitz at voltaire.com Thu Apr 27 01:40:20 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 27 Apr 2006 11:40:20 +0300 (IDT) Subject: [openib-general] possible bug in kmem_cache related code Message-ID: With 2.6.17-rc3 I'm running into something which seems as a bug related to kmem_cache. Doing some allocations/deallocations from a kmem_cache and later attempting to destroy it yields the following message and trace ============================================================================ slab error in kmem_cache_destroy(): cache `my_cache': Can't free all objects Call Trace: {kmem_cache_destroy+150} {:my_kcache:kcache_cleanup_module+51} {sys_delete_module+415} {__up_write+20} {sys_munmap+91} {system_call+126} Failed to destroy cache ============================================================================ I was hitting it as an Infiniband/iSCSI user as IB/iSCSI/SCSI code use kmem_caches, but since the failure happens on a code which works fine on 2.6.16 i have decided to try it with a synthetic module and had this hit... Below is a sample code that reproduces it, if i only do kmem_cache_create and later destroy it does not happen, attached is my .config please note that some of the CONFIG_DEBUG_ options are open. Please CC openib-general at openib.org at least with the resolution of the matter since it kind of hard to do testing over 2.6.17-rcX with this issue, the tests run fine but some modules are crashing on rmmod so a reboot it needed... thanks, Or. This is the related slab info line once the module is loaded my_cache 256 264 328 12 1 : tunables 32 16 8 : slabdata 22 22 0 : globalstat 264 264 22 0 --- /deb/null 1970-01-01 02:00:00.000000000 +0200 +++ kcache/kcache.c 2006-04-27 10:43:18.000000000 +0300 @@ -0,0 +1,61 @@ +#include +#include + +kmem_cache_t *cache; + +struct foo { + char bar[300]; +}; + + +#define TRIES 256 + +struct foo *foo_arr[TRIES]; + +static int __init kcache_init_module(void) +{ + int i, j; + + cache = kmem_cache_create("my_cache", + sizeof (struct foo), + 0, + SLAB_HWCACHE_ALIGN, + NULL, + NULL); + if (!cache) { + printk(KERN_ERR "couldn't create cache\n"); + goto error1; + } + + for (i = 0; i < TRIES; i++) { + foo_arr[i] = kmem_cache_alloc(cache, GFP_KERNEL); + if (foo_arr[i] == NULL) { + printk(KERN_ERR "couldn't allocate from cache\n"); + goto error2; + } + } + + return 0; +error2: + for (j = 0; j < i; j++) + kmem_cache_free(cache, foo_arr[j]); +error1: + return -ENOMEM; +} + +static void __exit kcache_cleanup_module(void) +{ + int i; + + for (i = 0; i < TRIES; i++) + kmem_cache_free(cache, foo_arr[i]); + + if (kmem_cache_destroy(cache)) { + printk(KERN_DEBUG "Failed to destroy cache\n"); + } +} + +MODULE_LICENSE("GPL"); + +module_init(kcache_init_module); +module_exit(kcache_cleanup_module); -------------- next part -------------- A non-text attachment was scrubbed... Name: kmem-cache-config.bz2 Type: application/x-bzip2 Size: 10879 bytes Desc: URL: From schihei at de.ibm.com Thu Apr 27 05:05:12 2006 From: schihei at de.ibm.com (Heiko J Schick) Date: Thu, 27 Apr 2006 14:05:12 +0200 Subject: [openib-general] [PATCH 00/16] ehca: IBM eHCA InfiniBand Device Driver Message-ID: <4450B378.9000705@de.ibm.com> Hello, many thanks for your comments. They are very helpful for us. All 17 patches have to be applied, otherwise the driver won't compile. We added an initial version to support large pages. At the moment we verified it only for 4K pages, because we're struggling to get a Linux kernel with 64K pages running properly on our system. We would appreciate for any comments and feedbacks. Signed-off-by: Heiko J Schick Changelog-by: Heiko J Schick Changelog: Differences to PatchSet http://openib.org/pipermail/openib-general/2006-March/018144.html Differences to PatchSet http://openib.org/pipermail/openib-general/2006-March/017412.html - Renamed module param tracelevel to debug_level - Reformated MODULE_PARAM_DESC in ehca_main.c - Removed EHCA_CACHE_CREATE / EHCA_CACHE_DESTROY macros - Renamed debug_level sysfs entry to debug_mask - debug_mask sysfs entry has now only one value - Added LARGEPAGE support (EXPERIMENTAL) - Changed locking for internal IDRs (for CQs and QPs) - ehca_poll_eqs uses now mod_timer instead of add_timer - Removed compile warnings in libehca because of missing header files - Added function ibv_read_sysfs_file to be compatible with libibverbs 1.0 - Removed libsysfs usage in libehca - Rename HCALLs defines from StudlyCaps to SHUTTING_CAPS - Improve scaling code for completion queue - Remove use of struct ib_gid in firmware bridge - Improve coding style in firmware bridge - Rework static rate encoding - Removed ehca_kv_to_g() - Splitted remaining shared kernel/userspace files - Removed defines in user space to reuse kernel files - Removed struct ehca_qp_core, ehca_cq_core - Removed all trailing blanks found - Fixed sparse warnings - Improved eq SMP scaling - Added fork access protection to queue entries Kconfig | 6 Makefile | 29 ehca_av.c | 309 ++++++ ehca_classes.h | 314 ++++++ ehca_classes_pSeries.h | 253 ++++ ehca_cq.c | 445 ++++++++ ehca_eq.c | 225 ++++ ehca_hca.c | 286 +++++ ehca_irq.c | 712 ++++++++++++++ ehca_irq.h | 79 + ehca_iverbs.h | 183 +++ ehca_kernel.h | 162 +++ ehca_main.c | 973 +++++++++++++++++++ ehca_mcast.c | 198 +++ ehca_mrmw.c | 2492 +++++++++++++++++++++++++++++++++++++++++++++++++ ehca_mrmw.h | 145 ++ ehca_pd.c | 122 ++ ehca_qes.h | 278 +++++ ehca_qp.c | 1592 +++++++++++++++++++++++++++++++ ehca_reqs.c | 685 +++++++++++++ ehca_sqp.c | 126 ++ ehca_tools.h | 387 +++++++ ehca_uverbs.c | 409 ++++++++ hcp_if.c | 2028 +++++++++++++++++++++++++++++++++++++++ hcp_if.h | 398 +++++++ hcp_phyp.c | 97 + hcp_phyp.h | 97 + hipz_fns.h | 73 + hipz_fns_core.h | 126 ++ hipz_hw.h | 398 +++++++ ipz_pt_fn.c | 184 +++ ipz_pt_fn.h | 258 +++++ 32 files changed, 14069 insertions(+) From schihei at de.ibm.com Thu Apr 27 05:05:24 2006 From: schihei at de.ibm.com (Heiko J Schick) Date: Thu, 27 Apr 2006 14:05:24 +0200 Subject: [openib-general] [PATCH 01/16] ehca: integration in Linux kernel build system Message-ID: <4450B384.4020601@de.ibm.com> Signed-off-by: Heiko J Schick Kconfig | 6 ++++++ Makefile | 29 +++++++++++++++++++++++++++++ 2 files changed, 35 insertions(+) --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/Kconfig 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/Kconfig 2006-01-04 16:29:05.000000000 +0100 @@ -0,0 +1,6 @@ +config INFINIBAND_EHCA + tristate "eHCA support" + depends on IBMEBUS && INFINIBAND + ---help--- + This is a low level device driver for the IBM + GX based Host channel adapters (HCAs) \ No newline at end of file --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/Makefile 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/Makefile 2006-03-06 12:26:36.000000000 +0100 @@ -0,0 +1,29 @@ +# Authors: Heiko J Schick +# Christoph Raisch +# +# Copyright (c) 2005 IBM Corporation +# +# All rights reserved. +# +# This source code is distributed under a dual license of GPL v2.0 and OpenIB BSD. + +obj-$(CONFIG_INFINIBAND_EHCA) += hcad_mod.o + +hcad_mod-objs = ehca_main.o \ + ehca_hca.o \ + ehca_mcast.o \ + ehca_pd.o \ + ehca_av.o \ + ehca_eq.o \ + ehca_cq.o \ + ehca_qp.o \ + ehca_sqp.o \ + ehca_mrmw.o \ + ehca_reqs.o \ + ehca_irq.o \ + ehca_uverbs.o \ + hcp_if.o \ + hcp_phyp.o \ + ipz_pt_fn.o + +CFLAGS += -DEHCA_USE_HCALL -DEHCA_USE_HCALL_KERNEL From oferg at mellanox.co.il Thu Apr 27 03:45:36 2006 From: oferg at mellanox.co.il (Ofer Gigi) Date: Thu, 27 Apr 2006 13:45:36 +0300 Subject: [openib-general] [PATCH] osm_port_info_rcv.c : clear client reregister bit Message-ID: <07fyjzguun.fsf@sw053.yok.mtl.com> Hi Hal, Bug Fix: On receive of client reregister - clear the reregister bit - so reregistering won't be sent again and again Please apply to trunk and branch. Thanks Ofer G. Signed-off-by: Ofer Gigi Index: osm_port_info_rcv.c =================================================================== --- osm_port_info_rcv.c (revision 6640) +++ osm_port_info_rcv.c (working copy) @@ -666,6 +666,17 @@ osm_pi_rcv_process( p_smp = osm_madw_get_smp_ptr( p_madw ); p_context = osm_madw_get_pi_context_ptr( p_madw ); p_pi = (ib_port_info_t*)ib_smp_get_payload_ptr( p_smp ); + + /* On receive of client reregister - clear the reregister bit - so + reregistering won't be sent again and again*/ + if (ib_port_info_get_client_rereg(p_pi)) + { + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "osm_pi_rcv_process: " + "client reregister received on response\n"); + ib_port_info_set_client_rereg(p_pi,0); + } + port_num = (uint8_t)cl_ntoh32( p_smp->attr_mod ); port_guid = p_context->port_guid; From schihei at de.ibm.com Thu Apr 27 03:48:05 2006 From: schihei at de.ibm.com (Heiko J Schick) Date: Thu, 27 Apr 2006 12:48:05 +0200 Subject: [openib-general] [PATCH 02/16] ehca: module infrastructure Message-ID: <4450A165.4000701@de.ibm.com> Signed-off-by: Heiko J Schick ehca_main.c | 973 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 973 insertions(+) --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/ehca_main.c 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/ehca_main.c 2006-04-27 10:55:45.000000000 +0200 @@ -0,0 +1,973 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * module start stop, hca detection + * + * Authors: Heiko J Schick + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_main.c,v 1.35 2006/04/25 08:59:43 schickhj Exp $ + */ + +#define DEB_PREFIX "shca" + +#include "ehca_classes.h" +#include "ehca_iverbs.h" +#include "ehca_kernel.h" +#include "ehca_mrmw.h" +#include "ehca_tools.h" +#include "hcp_if.h" + +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_AUTHOR("Christoph Raisch "); +MODULE_DESCRIPTION("IBM eServer HCA InfiniBand Device Driver"); +MODULE_VERSION("SVNEHCA_0005"); + +struct ehca_comp_pool* ehca_pool; + +int ehca_open_aqp1 = 0; +int ehca_debug_level = -1; +int ehca_hw_level = 0; +int ehca_nr_ports = 2; +int ehca_use_hp_mr = 0; +int ehca_port_act_time = 30; +int ehca_poll_all_eqs = 1; +int ehca_static_rate = -1; + +module_param_named(open_aqp1, ehca_open_aqp1, int, 0); +module_param_named(debug_level, ehca_debug_level, int, 0); +module_param_named(hw_level, ehca_hw_level, int, 0); +module_param_named(nr_ports, ehca_nr_ports, int, 0); +module_param_named(use_hp_mr, ehca_use_hp_mr, int, 0); +module_param_named(port_act_time, ehca_port_act_time, int, 0); +module_param_named(poll_all_eqs, ehca_poll_all_eqs, int, 0); +module_param_named(static_rate, ehca_static_rate, int, 0); + +MODULE_PARM_DESC(open_aqp1, + "AQP1 on startup (0: no (default), 1: yes)"); +MODULE_PARM_DESC(debug_level, + "debug level" + " (0: node, 6: only errors (default), 9: all)"); +MODULE_PARM_DESC(hw_level, + "hardware level" + " (0: autosensing (default), 1: v. 0.20, 2: v. 0.21)"); +MODULE_PARM_DESC(nr_ports, + "number of connected ports (default: 2)"); +MODULE_PARM_DESC(use_hp_mr, + "high performance MRs (0: no (default), 1: yes)"); +MODULE_PARM_DESC(port_act_time, + "time to wait for port activation (default: 30 sec)"); +MODULE_PARM_DESC(poll_all_eqs, + "polls all event queues periodically" + " (0: no, 1: yes (default))"); +MODULE_PARM_DESC(static_rate, + "set permanent static rate (default: disabled)"); + +/* This external trace mask controls what will end up in the + * kernel ring buffer. Number 6 means, that everything between + * 0 and 5 will be stored. + */ +u8 ehca_edeb_mask[EHCA_EDEB_TRACE_MASK_SIZE]={6, 6, 6, 6, + 6, 6, 6, 6, + 6, 6, 6, 6, + 6, 6, 6, 6, + 6, 6, 6, 6, + 6, 6, 6, 6, + 6, 6, 6, 6, + 6, 6, 0, 0}; + +spinlock_t ehca_qp_idr_lock; +spinlock_t ehca_cq_idr_lock; +DEFINE_IDR(ehca_qp_idr); +DEFINE_IDR(ehca_cq_idr); + +struct ehca_module ehca_module; + +void ehca_init_trace(void) +{ + EDEB_EN(7, ""); + + if (ehca_debug_level != -1) { + int i; + for (i = 0; i < EHCA_EDEB_TRACE_MASK_SIZE; i++) + ehca_edeb_mask[i] = ehca_debug_level; + } + + EDEB_EX(7, ""); +} + +int ehca_create_slab_caches(struct ehca_module *ehca_module) +{ + int ret = 0; + + EDEB_EN(7, ""); + + ehca_module->cache_pd = + kmem_cache_create("ehca_cache_pd", + sizeof(struct ehca_pd), + 0, SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (ehca_module->cache_pd == NULL) { + EDEB_ERR(4, "Cannot create PD SLAB cache."); + ret = -ENOMEM; + goto create_slab_caches1; + } + + ehca_module->cache_cq = + kmem_cache_create("ehca_cache_cq", + sizeof(struct ehca_cq), + 0, SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (ehca_module->cache_cq == NULL) { + EDEB_ERR(4, "Cannot create CQ SLAB cache."); + ret = -ENOMEM; + goto create_slab_caches2; + } + + ehca_module->cache_qp = + kmem_cache_create("ehca_cache_qp", + sizeof(struct ehca_qp), + 0, SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (ehca_module->cache_qp == NULL) { + EDEB_ERR(4, "Cannot create QP SLAB cache."); + ret = -ENOMEM; + goto create_slab_caches3; + } + + ehca_module->cache_av = + kmem_cache_create("ehca_cache_av", + sizeof(struct ehca_av), + 0, SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (ehca_module->cache_av == NULL) { + EDEB_ERR(4, "Cannot create AV SLAB cache."); + ret = -ENOMEM; + goto create_slab_caches4; + } + + ehca_module->cache_mw = + kmem_cache_create("ehca_cache_mw", + sizeof(struct ehca_mw), + 0, SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (ehca_module->cache_mw == NULL) { + EDEB_ERR(4, "Cannot create MW SLAB cache."); + ret = -ENOMEM; + goto create_slab_caches5; + } + + ehca_module->cache_mr = + kmem_cache_create("ehca_cache_mr", + sizeof(struct ehca_mr), + 0, SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (ehca_module->cache_mr == NULL) { + EDEB_ERR(4, "Cannot create MR SLAB cache."); + ret = -ENOMEM; + goto create_slab_caches6; + } + + EDEB_EX(7, "ret=%x", ret); + + return ret; + +create_slab_caches6: + kmem_cache_destroy(ehca_module->cache_mw); + +create_slab_caches5: + kmem_cache_destroy(ehca_module->cache_av); + +create_slab_caches4: + kmem_cache_destroy(ehca_module->cache_qp); + +create_slab_caches3: + kmem_cache_destroy(ehca_module->cache_cq); + +create_slab_caches2: + kmem_cache_destroy(ehca_module->cache_pd); + +create_slab_caches1: + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +int ehca_destroy_slab_caches(struct ehca_module *ehca_module) +{ + int ret; + + EDEB_EN(7, ""); + + ret = kmem_cache_destroy(ehca_module->cache_pd); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy PD SLAB cache. ret=%x", ret); + + ret = kmem_cache_destroy(ehca_module->cache_cq); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy CQ SLAB cache. ret=%x", ret); + + ret = kmem_cache_destroy(ehca_module->cache_qp); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy QP SLAB cache. ret=%x", ret); + + ret = kmem_cache_destroy(ehca_module->cache_av); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy AV SLAB cache. ret=%x", ret); + + ret = kmem_cache_destroy(ehca_module->cache_mw); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy MW SLAB cache. ret=%x", ret); + + ret = kmem_cache_destroy(ehca_module->cache_mr); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy MR SLAB cache. ret=%x", ret); + + EDEB_EX(7, ""); + + return 0; +} + +#define EHCA_HCAAVER EHCA_BMASK_IBM(32,39) +#define EHCA_REVID EHCA_BMASK_IBM(40,63) + +int ehca_sense_attributes(struct ehca_shca *shca) +{ + int ret = -EINVAL; + u64 rc = H_SUCCESS; + struct hipz_query_hca *rblock; + + EDEB_EN(7, "shca=%p", shca); + + rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + if (!rblock) { + EDEB_ERR(4, "Cannot allocate rblock memory."); + ret = -ENOMEM; + goto num_ports0; + } + + rc = hipz_h_query_hca(shca->ipz_hca_handle, rblock); + if (rc != H_SUCCESS) { + EDEB_ERR(4, "Cannot query device properties.rc=%lx", rc); + ret = -EPERM; + goto num_ports1; + } + + if (ehca_nr_ports == 1) + shca->num_ports = 1; + else + shca->num_ports = (u8) rblock->num_ports; + + EDEB(6, " ... found %x ports", rblock->num_ports); + + if (ehca_hw_level == 0) { + u32 hcaaver; + u32 revid; + + hcaaver = EHCA_BMASK_GET(EHCA_HCAAVER, rblock->hw_ver); + revid = EHCA_BMASK_GET(EHCA_REVID, rblock->hw_ver); + + EDEB(6, " ... hardware version=%x:%x", + hcaaver, revid); + + if ((hcaaver == 1) && (revid == 0)) + shca->hw_level = 0; + else if ((hcaaver == 1) && (revid == 1)) + shca->hw_level = 1; + else if ((hcaaver == 1) && (revid == 2)) + shca->hw_level = 2; + } + EDEB(6, " ... hardware level=%x", shca->hw_level); + + shca->sport[0].rate = IB_RATE_30_GBPS; + shca->sport[1].rate = IB_RATE_30_GBPS; + + ret = 0; + +num_ports1: + kfree(rblock); + +num_ports0: + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +static int init_node_guid(struct ehca_shca* shca) +{ + int ret = 0; + struct hipz_query_hca *rblock; + + EDEB_EN(7, ""); + + rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + if (!rblock) { + EDEB_ERR(4, "Can't allocate rblock memory."); + ret = -ENOMEM; + goto init_node_guid0; + } + + if (hipz_h_query_hca(shca->ipz_hca_handle, rblock) != H_SUCCESS) { + EDEB_ERR(4, "Can't query device properties"); + ret = -EINVAL; + goto init_node_guid1; + } + + memcpy(&shca->ib_device.node_guid, &rblock->node_guid, (sizeof(u64))); + +init_node_guid1: + kfree(rblock); + +init_node_guid0: + EDEB_EX(7, "node_guid=%lx ret=%x", shca->ib_device.node_guid, ret); + + return ret; +} + +int ehca_register_device(struct ehca_shca *shca) +{ + int ret = 0; + + EDEB_EN(7, "shca=%p", shca); + + ret = init_node_guid(shca); + if (ret != 0) + return ret; + + strlcpy(shca->ib_device.name, "ehca%d", IB_DEVICE_NAME_MAX); + shca->ib_device.owner = THIS_MODULE; + + shca->ib_device.uverbs_abi_ver = 5; + shca->ib_device.uverbs_cmd_mask = + (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | + (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | + (1ull << IB_USER_VERBS_CMD_QUERY_PORT) | + (1ull << IB_USER_VERBS_CMD_ALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_REG_MR) | + (1ull << IB_USER_VERBS_CMD_DEREG_MR) | + (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | + (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | + (1ull << IB_USER_VERBS_CMD_CREATE_QP) | + (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | + (1ull << IB_USER_VERBS_CMD_QUERY_QP) | + (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | + (1ull << IB_USER_VERBS_CMD_ATTACH_MCAST) | + (1ull << IB_USER_VERBS_CMD_DETACH_MCAST); + + shca->ib_device.node_type = RDMA_NODE_IB_CA; + shca->ib_device.phys_port_cnt = shca->num_ports; + shca->ib_device.dma_device = &shca->ibmebus_dev->ofdev.dev; + shca->ib_device.query_device = ehca_query_device; + shca->ib_device.query_port = ehca_query_port; + shca->ib_device.query_gid = ehca_query_gid; + shca->ib_device.query_pkey = ehca_query_pkey; + /* shca->in_device.modify_device = ehca_modify_device */ + shca->ib_device.modify_port = ehca_modify_port; + shca->ib_device.alloc_ucontext = ehca_alloc_ucontext; + shca->ib_device.dealloc_ucontext = ehca_dealloc_ucontext; + shca->ib_device.alloc_pd = ehca_alloc_pd; + shca->ib_device.dealloc_pd = ehca_dealloc_pd; + shca->ib_device.create_ah = ehca_create_ah; + /* shca->ib_device.modify_ah = ehca_modify_ah; */ + shca->ib_device.query_ah = ehca_query_ah; + shca->ib_device.destroy_ah = ehca_destroy_ah; + shca->ib_device.create_qp = ehca_create_qp; + shca->ib_device.modify_qp = ehca_modify_qp; + shca->ib_device.query_qp = ehca_query_qp; + shca->ib_device.destroy_qp = ehca_destroy_qp; + shca->ib_device.post_send = ehca_post_send; + shca->ib_device.post_recv = ehca_post_recv; + shca->ib_device.create_cq = ehca_create_cq; + shca->ib_device.destroy_cq = ehca_destroy_cq; + shca->ib_device.resize_cq = ehca_resize_cq; + shca->ib_device.poll_cq = ehca_poll_cq; + /* shca->ib_device.peek_cq = ehca_peek_cq; */ + shca->ib_device.req_notify_cq = ehca_req_notify_cq; + /* shca->ib_device.req_ncomp_notif = ehca_req_ncomp_notif; */ + shca->ib_device.get_dma_mr = ehca_get_dma_mr; + shca->ib_device.reg_phys_mr = ehca_reg_phys_mr; + shca->ib_device.reg_user_mr = ehca_reg_user_mr; + shca->ib_device.query_mr = ehca_query_mr; + shca->ib_device.dereg_mr = ehca_dereg_mr; + shca->ib_device.rereg_phys_mr = ehca_rereg_phys_mr; + shca->ib_device.alloc_mw = ehca_alloc_mw; + shca->ib_device.bind_mw = ehca_bind_mw; + shca->ib_device.dealloc_mw = ehca_dealloc_mw; + shca->ib_device.alloc_fmr = ehca_alloc_fmr; + shca->ib_device.map_phys_fmr = ehca_map_phys_fmr; + shca->ib_device.unmap_fmr = ehca_unmap_fmr; + shca->ib_device.dealloc_fmr = ehca_dealloc_fmr; + shca->ib_device.attach_mcast = ehca_attach_mcast; + shca->ib_device.detach_mcast = ehca_detach_mcast; + /* shca->ib_device.process_mad = ehca_process_mad; */ + shca->ib_device.mmap = ehca_mmap; + + ret = ib_register_device(&shca->ib_device); + + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +static int ehca_create_aqp1(struct ehca_shca *shca, u32 port) +{ + struct ehca_sport *sport; + struct ib_cq *ibcq; + struct ib_qp *ibqp; + struct ib_qp_init_attr qp_init_attr; + int ret = 0; + + EDEB_EN(7, "shca=%p port=%x", shca, port); + + sport = &shca->sport[port - 1]; + + if (sport->ibcq_aqp1 != NULL) { + EDEB_ERR(4, "AQP1 CQ is already created."); + return -EPERM; + } + + ibcq = ib_create_cq(&shca->ib_device, NULL, NULL, (void*)(-1), 10); + if (IS_ERR(ibcq)) { + EDEB_ERR(4, "Cannot create AQP1 CQ."); + return PTR_ERR(ibcq); + } + sport->ibcq_aqp1 = ibcq; + + if (sport->ibqp_aqp1 != NULL) { + EDEB_ERR(4, "AQP1 QP is already created."); + ret = -EPERM; + goto create_aqp1; + } + + memset(&qp_init_attr, 0, sizeof(struct ib_qp_init_attr)); + qp_init_attr.send_cq = ibcq; + qp_init_attr.recv_cq = ibcq; + qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; + qp_init_attr.cap.max_send_wr = 100; + qp_init_attr.cap.max_recv_wr = 100; + qp_init_attr.cap.max_send_sge = 2; + qp_init_attr.cap.max_recv_sge = 1; + qp_init_attr.qp_type = IB_QPT_GSI; + qp_init_attr.port_num = port; + qp_init_attr.qp_context = NULL; + qp_init_attr.event_handler = NULL; + qp_init_attr.srq = NULL; + + ibqp = ib_create_qp(&shca->pd->ib_pd, &qp_init_attr); + if (IS_ERR(ibqp)) { + EDEB_ERR(4, "Cannot create AQP1 QP."); + ret = PTR_ERR(ibqp); + goto create_aqp1; + } + sport->ibqp_aqp1 = ibqp; + + EDEB_EX(7, "ret=%x", ret); + + return ret; + +create_aqp1: + ib_destroy_cq(sport->ibcq_aqp1); + + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +static int ehca_destroy_aqp1(struct ehca_sport *sport) +{ + int ret = 0; + + EDEB_EN(7, "sport=%p", sport); + + ret = ib_destroy_qp(sport->ibqp_aqp1); + if (ret != 0) { + EDEB_ERR(4, "Cannot destroy AQP1 QP. ret=%x", ret); + goto destroy_aqp1; + } + + ret = ib_destroy_cq(sport->ibcq_aqp1); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy AQP1 CQ. ret=%x", ret); + +destroy_aqp1: + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +static ssize_t ehca_show_debug_mask(struct device_driver *ddp, char *buf) +{ + int i; + int total = 0; + total += snprintf(buf + total, PAGE_SIZE - total, "%d", + ehca_edeb_mask[0]); + for (i = 1; i < EHCA_EDEB_TRACE_MASK_SIZE; i++) { + total += snprintf(buf + total, PAGE_SIZE - total, "%d", + ehca_edeb_mask[i]); + } + + total += snprintf(buf + total, PAGE_SIZE - total, "\n"); + + return total; +} + +static ssize_t ehca_store_debug_mask(struct device_driver *ddp, + const char *buf, size_t count) +{ + int i; + for (i = 0; i < EHCA_EDEB_TRACE_MASK_SIZE; i++) { + char value = buf[i] - '0'; + if ((value <= 9) && (count >= i)) { + ehca_edeb_mask[i] = value; + } + } + return count; +} +DRIVER_ATTR(debug_mask, S_IRUSR | S_IWUSR, + ehca_show_debug_mask, ehca_store_debug_mask); + +void ehca_create_driver_sysfs(struct ibmebus_driver *drv) +{ + driver_create_file(&drv->driver, &driver_attr_debug_mask); +} + +void ehca_remove_driver_sysfs(struct ibmebus_driver *drv) +{ + driver_remove_file(&drv->driver, &driver_attr_debug_mask); +} + +#define EHCA_RESOURCE_ATTR(name) \ +static ssize_t ehca_show_##name(struct device *dev, \ + struct device_attribute *attr, \ + char *buf) \ +{ \ + struct ehca_shca *shca; \ + struct hipz_query_hca *rblock; \ + int data; \ + \ + shca = dev->driver_data; \ + \ + rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); \ + if (!rblock) { \ + EDEB_ERR(4, "Can't allocate rblock memory."); \ + return 0; \ + } \ + \ + if (hipz_h_query_hca(shca->ipz_hca_handle, rblock) != H_SUCCESS) { \ + EDEB_ERR(4, "Can't query device properties"); \ + kfree(rblock); \ + return 0; \ + } \ + \ + data = rblock->name; \ + kfree(rblock); \ + \ + if ((strcmp(#name, "num_ports") == 0) && (ehca_nr_ports == 1)) \ + return snprintf(buf, 256, "1\n"); \ + else \ + return snprintf(buf, 256, "%d\n", data); \ + \ +} \ +static DEVICE_ATTR(name, S_IRUGO, ehca_show_##name, NULL); + +EHCA_RESOURCE_ATTR(num_ports); +EHCA_RESOURCE_ATTR(hw_ver); +EHCA_RESOURCE_ATTR(max_eq); +EHCA_RESOURCE_ATTR(cur_eq); +EHCA_RESOURCE_ATTR(max_cq); +EHCA_RESOURCE_ATTR(cur_cq); +EHCA_RESOURCE_ATTR(max_qp); +EHCA_RESOURCE_ATTR(cur_qp); +EHCA_RESOURCE_ATTR(max_mr); +EHCA_RESOURCE_ATTR(cur_mr); +EHCA_RESOURCE_ATTR(max_mw); +EHCA_RESOURCE_ATTR(cur_mw); +EHCA_RESOURCE_ATTR(max_pd); +EHCA_RESOURCE_ATTR(max_ah); + +static ssize_t ehca_show_adapter_handle(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct ehca_shca *shca = dev->driver_data; + + return sprintf(buf, "%lx\n", shca->ipz_hca_handle.handle); + +} +static DEVICE_ATTR(adapter_handle, S_IRUGO, ehca_show_adapter_handle, NULL); + + + +void ehca_create_device_sysfs(struct ibmebus_dev *dev) +{ + device_create_file(&dev->ofdev.dev, &dev_attr_adapter_handle); + device_create_file(&dev->ofdev.dev, &dev_attr_num_ports); + device_create_file(&dev->ofdev.dev, &dev_attr_hw_ver); + device_create_file(&dev->ofdev.dev, &dev_attr_max_eq); + device_create_file(&dev->ofdev.dev, &dev_attr_cur_eq); + device_create_file(&dev->ofdev.dev, &dev_attr_max_cq); + device_create_file(&dev->ofdev.dev, &dev_attr_cur_cq); + device_create_file(&dev->ofdev.dev, &dev_attr_max_qp); + device_create_file(&dev->ofdev.dev, &dev_attr_cur_qp); + device_create_file(&dev->ofdev.dev, &dev_attr_max_mr); + device_create_file(&dev->ofdev.dev, &dev_attr_cur_mr); + device_create_file(&dev->ofdev.dev, &dev_attr_max_mw); + device_create_file(&dev->ofdev.dev, &dev_attr_cur_mw); + device_create_file(&dev->ofdev.dev, &dev_attr_max_pd); + device_create_file(&dev->ofdev.dev, &dev_attr_max_ah); +} + +void ehca_remove_device_sysfs(struct ibmebus_dev *dev) +{ + device_remove_file(&dev->ofdev.dev, &dev_attr_adapter_handle); + device_remove_file(&dev->ofdev.dev, &dev_attr_num_ports); + device_remove_file(&dev->ofdev.dev, &dev_attr_hw_ver); + device_remove_file(&dev->ofdev.dev, &dev_attr_max_eq); + device_remove_file(&dev->ofdev.dev, &dev_attr_cur_eq); + device_remove_file(&dev->ofdev.dev, &dev_attr_max_cq); + device_remove_file(&dev->ofdev.dev, &dev_attr_cur_cq); + device_remove_file(&dev->ofdev.dev, &dev_attr_max_qp); + device_remove_file(&dev->ofdev.dev, &dev_attr_cur_qp); + device_remove_file(&dev->ofdev.dev, &dev_attr_max_mr); + device_remove_file(&dev->ofdev.dev, &dev_attr_cur_mr); + device_remove_file(&dev->ofdev.dev, &dev_attr_max_mw); + device_remove_file(&dev->ofdev.dev, &dev_attr_cur_mw); + device_remove_file(&dev->ofdev.dev, &dev_attr_max_pd); + device_remove_file(&dev->ofdev.dev, &dev_attr_max_ah); +} + +static int __devinit ehca_probe(struct ibmebus_dev *dev, + const struct of_device_id *id) +{ + struct ehca_shca *shca; + u64 *handle; + struct ib_pd *ibpd; + int ret = 0; + + EDEB_EN(7, "name=%s", dev->name); + + handle = (u64 *)get_property(dev->ofdev.node, "ibm,hca-handle", NULL); + if (!handle) { + EDEB_ERR(4, "Cannot get eHCA handle for adapter: %s.", + dev->ofdev.node->full_name); + return -ENODEV; + } + + if (!(*handle)) { + EDEB_ERR(4, "Wrong eHCA handle for adapter: %s.", + dev->ofdev.node->full_name); + return -ENODEV; + } + + shca = (struct ehca_shca *)ib_alloc_device(sizeof(*shca)); + if (shca == NULL) { + EDEB_ERR(4, "Cannot allocate shca memory."); + return -ENOMEM; + } + + shca->ibmebus_dev = dev; + shca->ipz_hca_handle.handle = *handle; + dev->ofdev.dev.driver_data = shca; + + ret = ehca_sense_attributes(shca); + if (ret < 0) { + EDEB_ERR(4, "Cannot sense eHCA attributes."); + goto probe1; + } + + /* create event queues */ + ret = ehca_create_eq(shca, &shca->eq, EHCA_EQ, 2048); + if (ret != 0) { + EDEB_ERR(4, "Cannot create EQ."); + goto probe1; + } + + ret = ehca_create_eq(shca, &shca->neq, EHCA_NEQ, 513); + if (ret != 0) { + EDEB_ERR(4, "Cannot create NEQ."); + goto probe2; + } + + /* create internal protection domain */ + ibpd = ehca_alloc_pd(&shca->ib_device, (void*)(-1), NULL); + if (IS_ERR(ibpd)) { + EDEB_ERR(4, "Cannot create internal PD."); + ret = PTR_ERR(ibpd); + goto probe3; + } + + shca->pd = container_of(ibpd, struct ehca_pd, ib_pd); + shca->pd->ib_pd.device = &shca->ib_device; + + /* create internal max MR */ + if (shca->maxmr == 0) { + struct ehca_mr *e_maxmr = NULL; + ret = ehca_reg_internal_maxmr(shca, shca->pd, &e_maxmr); + if (ret != 0) { + EDEB_ERR(4, "Cannot create internal MR. ret=%x", ret); + goto probe4; + } + shca->maxmr = e_maxmr; + } + + ret = ehca_register_device(shca); + if (ret != 0) { + EDEB_ERR(4, "Cannot register Infiniband device."); + goto probe5; + } + + /* create AQP1 for port 1 */ + if (ehca_open_aqp1 == 1) { + shca->sport[0].port_state = IB_PORT_DOWN; + ret = ehca_create_aqp1(shca, 1); + if (ret != 0) { + EDEB_ERR(4, "Cannot create AQP1 for port 1."); + goto probe6; + } + } + + /* create AQP1 for port 2 */ + if ((ehca_open_aqp1 == 1) && (shca->num_ports == 2)) { + shca->sport[1].port_state = IB_PORT_DOWN; + ret = ehca_create_aqp1(shca, 2); + if (ret != 0) { + EDEB_ERR(4, "Cannot create AQP1 for port 2."); + goto probe7; + } + } + + ehca_create_device_sysfs(dev); + + spin_lock(&ehca_module.shca_lock); + list_add(&shca->shca_list, &ehca_module.shca_list); + spin_unlock(&ehca_module.shca_lock); + + EDEB_EX(7, "ret=%x", ret); + + return 0; + +probe7: + ret = ehca_destroy_aqp1(&shca->sport[0]); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy AQP1 for port 1. ret=%x", ret); + +probe6: + ib_unregister_device(&shca->ib_device); + +probe5: + ret = ehca_dereg_internal_maxmr(shca); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy internal MR. ret=%x", ret); + +probe4: + ret = ehca_dealloc_pd(&shca->pd->ib_pd); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy internal PD. ret=%x", ret); + +probe3: + ret = ehca_destroy_eq(shca, &shca->neq); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy NEQ. ret=%x", ret); + +probe2: + ret = ehca_destroy_eq(shca, &shca->eq); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy EQ. ret=%x", ret); + +probe1: + ib_dealloc_device(&shca->ib_device); + + EDEB_EX(4, "ret=%x", ret); + + return -EINVAL; +} + +static int __devexit ehca_remove(struct ibmebus_dev *dev) +{ + struct ehca_shca *shca = dev->ofdev.dev.driver_data; + int ret; + + EDEB_EN(7, "shca=%p", shca); + + ehca_remove_device_sysfs(dev); + + if (ehca_open_aqp1 == 1) { + int i; + + for (i = 0; i < shca->num_ports; i++) { + ret = ehca_destroy_aqp1(&shca->sport[i]); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy AQP1 for port %x." + " ret=%x", ret, i); + } + } + + ib_unregister_device(&shca->ib_device); + + ret = ehca_dereg_internal_maxmr(shca); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy internal MR. ret=%x", ret); + + ret = ehca_dealloc_pd(&shca->pd->ib_pd); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy internal PD. ret=%x", ret); + + ret = ehca_destroy_eq(shca, &shca->eq); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy EQ. ret=%x", ret); + + ret = ehca_destroy_eq(shca, &shca->neq); + if (ret != 0) + EDEB_ERR(4, "Canot destroy NEQ. ret=%x", ret); + + ib_dealloc_device(&shca->ib_device); + + spin_lock(&ehca_module.shca_lock); + list_del(&shca->shca_list); + spin_unlock(&ehca_module.shca_lock); + + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +static struct of_device_id ehca_device_table[] = +{ + { + .name = "lhca", + .compatible = "IBM,lhca", + }, + {}, +}; + +static struct ibmebus_driver ehca_driver = { + .name = "ehca", + .id_table = ehca_device_table, + .probe = ehca_probe, + .remove = ehca_remove, +}; + +int __init ehca_module_init(void) +{ + int ret = 0; + + printk(KERN_INFO "eHCA Infiniband Device Driver " + "(Rel.: SVNEHCA_0005)\n"); + EDEB_EN(7, ""); + + idr_init(&ehca_qp_idr); + idr_init(&ehca_cq_idr); + spin_lock_init(&ehca_qp_idr_lock); + spin_lock_init(&ehca_cq_idr_lock); + + INIT_LIST_HEAD(&ehca_module.shca_list); + spin_lock_init(&ehca_module.shca_lock); + + ehca_init_trace(); + + ehca_pool = ehca_create_comp_pool(); + if (ehca_pool == NULL) { + EDEB_ERR(4, "Cannot create comp pool."); + ret = -EINVAL; + goto module_init0; + } + + if ((ret = ehca_create_slab_caches(&ehca_module)) != 0) { + EDEB_ERR(4, "Cannot create SLAB caches"); + ret = -ENOMEM; + goto module_init1; + } + + if ((ret = ibmebus_register_driver(&ehca_driver)) != 0) { + EDEB_ERR(4, "Cannot register eHCA device driver"); + ret = -EINVAL; + goto module_init2; + } + + ehca_create_driver_sysfs(&ehca_driver); + + if (ehca_poll_all_eqs != 1) { + EDEB_ERR(4, "WARNING!!!"); + EDEB_ERR(4, "It is possible to lose interrupts."); + + return 0; + } + + init_timer(&ehca_module.timer); + ehca_module.timer.function = ehca_poll_eqs; + ehca_module.timer.data = (unsigned long)(void*)&ehca_module; + ehca_module.timer.expires = jiffies + HZ; + add_timer(&ehca_module.timer); + + EDEB_EX(7, "ret=%x", ret); + + return 0; + +module_init2: + ehca_destroy_slab_caches(&ehca_module); + +module_init1: + ehca_destroy_comp_pool(ehca_pool); + +module_init0: + EDEB_EX(7, "ret=%x", ret); + + return ret; +}; + +void __exit ehca_module_exit(void) +{ + EDEB_EN(7, ""); + + if (ehca_poll_all_eqs == 1) + del_timer_sync(&ehca_module.timer); + + ehca_remove_driver_sysfs(&ehca_driver); + ibmebus_unregister_driver(&ehca_driver); + + if (ehca_destroy_slab_caches(&ehca_module) != 0) + EDEB_ERR(4, "Cannot destroy SLAB caches"); + + ehca_destroy_comp_pool(ehca_pool); + + idr_destroy(&ehca_cq_idr); + idr_destroy(&ehca_qp_idr); + + EDEB_EX(7, ""); +}; + +module_init(ehca_module_init); +module_exit(ehca_module_exit); From schihei at de.ibm.com Thu Apr 27 03:48:13 2006 From: schihei at de.ibm.com (Heiko J Schick) Date: Thu, 27 Apr 2006 12:48:13 +0200 Subject: [openib-general] [PATCH 03/16] ehca: structure definitions Message-ID: <4450A16D.7000008@de.ibm.com> Signed-off-by: Heiko J Schick ehca_classes.h | 314 +++++++++++++++++++++++++++++++++++++++++++++++++ ehca_classes_pSeries.h | 253 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 567 insertions(+) --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/ehca_classes.h 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/ehca_classes.h 2006-04-24 15:12:03.000000000 +0200 @@ -0,0 +1,314 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Struct definition for eHCA internal structures + * + * Authors:Heiko J Schick + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_classes.h,v 1.23 2006/04/24 13:12:03 schickhj Exp $ + */ + +#ifndef __EHCA_CLASSES_H__ +#define __EHCA_CLASSES_H__ + +#include "ehca_kernel.h" +#include "ipz_pt_fn.h" + +struct ehca_module; +struct ehca_qp; +struct ehca_cq; +struct ehca_eq; +struct ehca_mr; +struct ehca_mw; +struct ehca_pd; +struct ehca_av; + +#ifdef CONFIG_PPC64 +#include "ehca_classes_pSeries.h" +#endif + +#include +#include + +#include "ehca_irq.h" + +struct ehca_module { + struct list_head shca_list; + spinlock_t shca_lock; + struct timer_list timer; + kmem_cache_t *cache_pd; + kmem_cache_t *cache_cq; + kmem_cache_t *cache_qp; + kmem_cache_t *cache_av; + kmem_cache_t *cache_mr; + kmem_cache_t *cache_mw; + struct ehca_pfmodule pf; +}; + +struct ehca_eq { + u32 length; + struct ipz_queue ipz_queue; + struct ipz_eq_handle ipz_eq_handle; + struct work_struct work; + struct h_galpas galpas; + int is_initialized; + struct ehca_pfeq pf; + spinlock_t spinlock; + struct tasklet_struct interrupt_task; + u32 ist; +}; + +struct ehca_sport { + struct ib_cq *ibcq_aqp1; + struct ib_qp *ibqp_aqp1; + enum ib_rate rate; + enum ib_port_state port_state; +}; + +struct ehca_shca { + struct ib_device ib_device; + struct ibmebus_dev *ibmebus_dev; + u8 num_ports; + int hw_level; + struct list_head shca_list; + struct ipz_adapter_handle ipz_hca_handle; + struct ehca_sport sport[2]; + struct ehca_eq eq; + struct ehca_eq neq; + struct ehca_mr *maxmr; + struct ehca_pd *pd; + struct ehca_pfshca pf; + struct h_galpas galpas; +}; + +struct ehca_pd { + struct ib_pd ib_pd; + struct ipz_pd fw_pd; + struct ehca_pfpd pf; + u32 ownpid; +}; + +struct ehca_qp { + struct ib_qp ib_qp; + u32 qp_type; + struct ipz_queue ipz_squeue; + struct ipz_queue ipz_rqueue; + struct h_galpas galpas; + u32 qkey; + u32 real_qp_num; + u32 token; + spinlock_t spinlock_s; + spinlock_t spinlock_r; + u32 sq_max_inline_data_size; + struct ipz_qp_handle ipz_qp_handle; + struct ehca_pfqp pf; + struct ib_qp_init_attr init_attr; + u64 uspace_squeue; + u64 uspace_rqueue; + u64 uspace_fwh; + struct ehca_cq* send_cq; + unsigned int sqerr_purgeflag; + struct hlist_node list_entries; +}; + +/* must be power of 2 */ +#define QP_HASHTAB_LEN 8 + +struct ehca_cq { + struct ib_cq ib_cq; + struct ipz_queue ipz_queue; + struct h_galpas galpas; + spinlock_t spinlock; + u32 cq_number; + u32 token; + u32 nr_of_entries; + struct ipz_cq_handle ipz_cq_handle; + struct ehca_pfcq pf; + spinlock_t cb_lock; + u64 uspace_queue; + u64 uspace_fwh; + struct hlist_head qp_hashtab[QP_HASHTAB_LEN]; + struct list_head entry; + u32 nr_callbacks; + spinlock_t task_lock; + u32 ownpid; +}; + +enum ehca_mr_flag { + EHCA_MR_FLAG_FMR = 0x80000000, /* FMR, created with ehca_alloc_fmr */ + EHCA_MR_FLAG_MAXMR = 0x40000000, /* max-MR */ +}; + +struct ehca_mr { + union { + struct ib_mr ib_mr; /* must always be first in ehca_mr */ + struct ib_fmr ib_fmr; /* must always be first in ehca_mr */ + } ib; + + spinlock_t mrlock; + + /* !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! + * !!! ehca_mr_deletenew() memsets from flags to end of structure + * !!! DON'T move flags or insert another field before. + * !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! */ + + enum ehca_mr_flag flags; + u32 num_pages; /* number of MR pages */ + u32 num_4k; /* number of 4k "page" portions to form MR */ + int acl; /* ACL (stored here for usage in reregister) */ + u64 *start; /* virtual start address (stored here for */ + /* usage in reregister) */ + u64 size; /* size (stored here for usage in reregister) */ + u32 fmr_page_size; /* page size for FMR */ + u32 fmr_max_pages; /* max pages for FMR */ + u32 fmr_max_maps; /* max outstanding maps for FMR */ + u32 fmr_map_cnt; /* map counter for FMR */ + /* fw specific data */ + struct ipz_mrmw_handle ipz_mr_handle; /* MR handle for h-calls */ + struct h_galpas galpas; + /* data for userspace bridge */ + u32 nr_of_pages; + void *pagearray; + + struct ehca_pfmr pf; /* platform specific part of MR */ +}; + +struct ehca_mw { + struct ib_mw ib_mw; /* gen2 mw, must always be first in ehca_mw */ + spinlock_t mwlock; + + u8 never_bound; /* indication MW was never bound */ + struct ipz_mrmw_handle ipz_mw_handle; /* MW handle for h-calls */ + struct h_galpas galpas; + + struct ehca_pfmw pf; /* platform specific part of MW */ +}; + +enum ehca_mr_pgi_type { + EHCA_MR_PGI_PHYS = 1, /* type of ehca_reg_phys_mr, + * ehca_rereg_phys_mr, + * ehca_reg_internal_maxmr */ + EHCA_MR_PGI_USER = 2, /* type of ehca_reg_user_mr */ + EHCA_MR_PGI_FMR = 3 /* type of ehca_map_phys_fmr */ +}; + +struct ehca_mr_pginfo { + enum ehca_mr_pgi_type type; + u64 num_pages; + u64 page_cnt; + u64 num_4k; /* number of 4k "page" portions */ + u64 page_4k_cnt; /* counter for 4k "page" portions */ + u64 next_4k; /* next 4k "page" portion in buffer/chunk/listelem */ + + /* type EHCA_MR_PGI_PHYS section */ + int num_phys_buf; + struct ib_phys_buf *phys_buf_array; + u64 next_buf; + + /* type EHCA_MR_PGI_USER section */ + struct ib_umem *region; + struct ib_umem_chunk *next_chunk; + u64 next_nmap; + + /* type EHCA_MR_PGI_FMR section */ + u64 *page_list; + u64 next_listelem; + /* next_4k also used within EHCA_MR_PGI_FMR */ +}; + +struct ehca_av { + struct ib_ah ib_ah; + struct ehca_ud_av av; +}; + +struct ehca_ucontext { + struct ib_ucontext ib_ucontext; +}; + +struct ehca_module *ehca_module_new(void); + +int ehca_module_delete(struct ehca_module *me); + +int ehca_eq_ctor(struct ehca_eq *eq); + +int ehca_eq_dtor(struct ehca_eq *eq); + +struct ehca_shca *ehca_shca_new(void); + +int ehca_shca_delete(struct ehca_shca *me); + +struct ehca_sport *ehca_sport_new(struct ehca_shca *anchor); + +extern spinlock_t ehca_qp_idr_lock; +extern spinlock_t ehca_cq_idr_lock; +extern struct idr ehca_qp_idr; +extern struct idr ehca_cq_idr; + +struct ipzu_queue_resp { + u64 queue; /* points to first queue entry */ + u32 qe_size; /* queue entry size */ + u32 act_nr_of_sg; + u32 queue_length; /* queue length allocated in bytes */ + u32 pagesize; + u32 toggle_state; + u32 dummy; /* padding for 8 byte alignment */ +}; + +struct ehca_create_cq_resp { + u32 cq_number; + u32 token; + struct ipzu_queue_resp ipz_queue; + struct h_galpas galpas; +}; + +struct ehca_create_qp_resp { + u32 qp_num; + u32 token; + u32 qp_type; + u32 qkey; + /* qp_num assigned by ehca: sqp0/1 may have got different numbers */ + u32 real_qp_num; + u32 dummy; /* padding for 8 byte alignment */ + struct ipzu_queue_resp ipz_squeue; + struct ipzu_queue_resp ipz_rqueue; + struct h_galpas galpas; +}; + +int ehca_cq_assign_qp(struct ehca_cq *cq, struct ehca_qp *qp); +int ehca_cq_unassign_qp(struct ehca_cq *cq, unsigned int qp_num); +struct ehca_qp* ehca_cq_get_qp(struct ehca_cq *cq, int qp_num); + +#endif --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h 2006-02-27 18:00:32.000000000 +0100 @@ -0,0 +1,253 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * pSeries interface definitions + * + * Authors: Waleri Fomin + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_classes_pSeries.h,v 1.3 2006/02/27 17:00:32 nguyen Exp $ + */ + +#ifndef __EHCA_CLASSES_PSERIES_H__ +#define __EHCA_CLASSES_PSERIES_H__ + +#include "hcp_phyp.h" +#include "ipz_pt_fn.h" + + +struct ehca_pfmodule { +}; + +struct ehca_pfshca { +}; + +struct ehca_pfqp { + struct ipz_qpt sqpt; + struct ipz_qpt rqpt; +}; + +struct ehca_pfcq { + struct ipz_qpt qpt; + u32 cqnr; +}; + +struct ehca_pfeq { + struct ipz_qpt qpt; + struct h_galpa galpa; + u32 eqnr; +}; + +struct ehca_pfpd { +}; + +struct ehca_pfmr { +}; + +struct ehca_pfmw { +}; + +struct ipz_adapter_handle { + u64 handle; +}; + +struct ipz_cq_handle { + u64 handle; +}; + +struct ipz_eq_handle { + u64 handle; +}; + +struct ipz_qp_handle { + u64 handle; +}; +struct ipz_mrmw_handle { + u64 handle; +}; + +struct ipz_pd { + u32 value; +}; + +struct hcp_modify_qp_control_block { + u32 qkey; /* 00 */ + u32 rdd; /* reliable datagram domain */ + u32 send_psn; /* 02 */ + u32 receive_psn; /* 03 */ + u32 prim_phys_port; /* 04 */ + u32 alt_phys_port; /* 05 */ + u32 prim_p_key_idx; /* 06 */ + u32 alt_p_key_idx; /* 07 */ + u32 rdma_atomic_ctrl; /* 08 */ + u32 qp_state; /* 09 */ + u32 reserved_10; /* 10 */ + u32 rdma_nr_atomic_resp_res; /* 11 */ + u32 path_migration_state; /* 12 */ + u32 rdma_atomic_outst_dest_qp; /* 13 */ + u32 dest_qp_nr; /* 14 */ + u32 min_rnr_nak_timer_field; /* 15 */ + u32 service_level; /* 16 */ + u32 send_grh_flag; /* 17 */ + u32 retry_count; /* 18 */ + u32 timeout; /* 19 */ + u32 path_mtu; /* 20 */ + u32 max_static_rate; /* 21 */ + u32 dlid; /* 22 */ + u32 rnr_retry_count; /* 23 */ + u32 source_path_bits; /* 24 */ + u32 traffic_class; /* 25 */ + u32 hop_limit; /* 26 */ + u32 source_gid_idx; /* 27 */ + u32 flow_label; /* 28 */ + u32 reserved_29; /* 29 */ + union { /* 30 */ + u64 dw[2]; + u8 byte[16]; + } dest_gid; + u32 service_level_al; /* 34 */ + u32 send_grh_flag_al; /* 35 */ + u32 retry_count_al; /* 36 */ + u32 timeout_al; /* 37 */ + u32 max_static_rate_al; /* 38 */ + u32 dlid_al; /* 39 */ + u32 rnr_retry_count_al; /* 40 */ + u32 source_path_bits_al; /* 41 */ + u32 traffic_class_al; /* 42 */ + u32 hop_limit_al; /* 43 */ + u32 source_gid_idx_al; /* 44 */ + u32 flow_label_al; /* 45 */ + u32 reserved_46; /* 46 */ + u32 reserved_47; /* 47 */ + union { /* 48 */ + u64 dw[2]; + u8 byte[16]; + } dest_gid_al; + u32 max_nr_outst_send_wr; /* 52 */ + u32 max_nr_outst_recv_wr; /* 53 */ + u32 disable_ete_credit_check; /* 54 */ + u32 qp_number; /* 55 */ + u64 send_queue_handle; /* 56 */ + u64 recv_queue_handle; /* 58 */ + u32 actual_nr_sges_in_sq_wqe; /* 60 */ + u32 actual_nr_sges_in_rq_wqe; /* 61 */ + u32 qp_enable; /* 62 */ + u32 curr_srq_limit; /* 63 */ + u64 qp_aff_asyn_ev_log_reg; /* 64 */ + u64 shared_rq_hndl; /* 66 */ + u64 trigg_doorbell_qp_hndl; /* 68 */ + u32 reserved_70_127[58]; /* 70 */ +}; + +#define MQPCB_MASK_QKEY EHCA_BMASK_IBM(0,0) +#define MQPCB_MASK_SEND_PSN EHCA_BMASK_IBM(2,2) +#define MQPCB_MASK_RECEIVE_PSN EHCA_BMASK_IBM(3,3) +#define MQPCB_MASK_PRIM_PHYS_PORT EHCA_BMASK_IBM(4,4) +#define MQPCB_PRIM_PHYS_PORT EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_ALT_PHYS_PORT EHCA_BMASK_IBM(5,5) +#define MQPCB_MASK_PRIM_P_KEY_IDX EHCA_BMASK_IBM(6,6) +#define MQPCB_PRIM_P_KEY_IDX EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_ALT_P_KEY_IDX EHCA_BMASK_IBM(7,7) +#define MQPCB_MASK_RDMA_ATOMIC_CTRL EHCA_BMASK_IBM(8,8) +#define MQPCB_MASK_QP_STATE EHCA_BMASK_IBM(9,9) +#define MQPCB_QP_STATE EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_RDMA_NR_ATOMIC_RESP_RES EHCA_BMASK_IBM(11,11) +#define MQPCB_MASK_PATH_MIGRATION_STATE EHCA_BMASK_IBM(12,12) +#define MQPCB_MASK_RDMA_ATOMIC_OUTST_DEST_QP EHCA_BMASK_IBM(13,13) +#define MQPCB_MASK_DEST_QP_NR EHCA_BMASK_IBM(14,14) +#define MQPCB_MASK_MIN_RNR_NAK_TIMER_FIELD EHCA_BMASK_IBM(15,15) +#define MQPCB_MASK_SERVICE_LEVEL EHCA_BMASK_IBM(16,16) +#define MQPCB_MASK_SEND_GRH_FLAG EHCA_BMASK_IBM(17,17) +#define MQPCB_MASK_RETRY_COUNT EHCA_BMASK_IBM(18,18) +#define MQPCB_MASK_TIMEOUT EHCA_BMASK_IBM(19,19) +#define MQPCB_MASK_PATH_MTU EHCA_BMASK_IBM(20,20) +#define MQPCB_PATH_MTU EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_MAX_STATIC_RATE EHCA_BMASK_IBM(21,21) +#define MQPCB_MAX_STATIC_RATE EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_DLID EHCA_BMASK_IBM(22,22) +#define MQPCB_DLID EHCA_BMASK_IBM(16,31) +#define MQPCB_MASK_RNR_RETRY_COUNT EHCA_BMASK_IBM(23,23) +#define MQPCB_RNR_RETRY_COUNT EHCA_BMASK_IBM(29,31) +#define MQPCB_MASK_SOURCE_PATH_BITS EHCA_BMASK_IBM(24,24) +#define MQPCB_SOURCE_PATH_BITS EHCA_BMASK_IBM(25,31) +#define MQPCB_MASK_TRAFFIC_CLASS EHCA_BMASK_IBM(25,25) +#define MQPCB_TRAFFIC_CLASS EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_HOP_LIMIT EHCA_BMASK_IBM(26,26) +#define MQPCB_HOP_LIMIT EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_SOURCE_GID_IDX EHCA_BMASK_IBM(27,27) +#define MQPCB_SOURCE_GID_IDX EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_FLOW_LABEL EHCA_BMASK_IBM(28,28) +#define MQPCB_FLOW_LABEL EHCA_BMASK_IBM(12,31) +#define MQPCB_MASK_DEST_GID EHCA_BMASK_IBM(30,30) +#define MQPCB_MASK_SERVICE_LEVEL_AL EHCA_BMASK_IBM(31,31) +#define MQPCB_SERVICE_LEVEL_AL EHCA_BMASK_IBM(28,31) +#define MQPCB_MASK_SEND_GRH_FLAG_AL EHCA_BMASK_IBM(32,32) +#define MQPCB_SEND_GRH_FLAG_AL EHCA_BMASK_IBM(31,31) +#define MQPCB_MASK_RETRY_COUNT_AL EHCA_BMASK_IBM(33,33) +#define MQPCB_RETRY_COUNT_AL EHCA_BMASK_IBM(29,31) +#define MQPCB_MASK_TIMEOUT_AL EHCA_BMASK_IBM(34,34) +#define MQPCB_TIMEOUT_AL EHCA_BMASK_IBM(27,31) +#define MQPCB_MASK_MAX_STATIC_RATE_AL EHCA_BMASK_IBM(35,35) +#define MQPCB_MAX_STATIC_RATE_AL EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_DLID_AL EHCA_BMASK_IBM(36,36) +#define MQPCB_DLID_AL EHCA_BMASK_IBM(16,31) +#define MQPCB_MASK_RNR_RETRY_COUNT_AL EHCA_BMASK_IBM(37,37) +#define MQPCB_RNR_RETRY_COUNT_AL EHCA_BMASK_IBM(29,31) +#define MQPCB_MASK_SOURCE_PATH_BITS_AL EHCA_BMASK_IBM(38,38) +#define MQPCB_SOURCE_PATH_BITS_AL EHCA_BMASK_IBM(25,31) +#define MQPCB_MASK_TRAFFIC_CLASS_AL EHCA_BMASK_IBM(39,39) +#define MQPCB_TRAFFIC_CLASS_AL EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_HOP_LIMIT_AL EHCA_BMASK_IBM(40,40) +#define MQPCB_HOP_LIMIT_AL EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_SOURCE_GID_IDX_AL EHCA_BMASK_IBM(41,41) +#define MQPCB_SOURCE_GID_IDX_AL EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_FLOW_LABEL_AL EHCA_BMASK_IBM(42,42) +#define MQPCB_FLOW_LABEL_AL EHCA_BMASK_IBM(12,31) +#define MQPCB_MASK_DEST_GID_AL EHCA_BMASK_IBM(44,44) +#define MQPCB_MASK_MAX_NR_OUTST_SEND_WR EHCA_BMASK_IBM(45,45) +#define MQPCB_MAX_NR_OUTST_SEND_WR EHCA_BMASK_IBM(16,31) +#define MQPCB_MASK_MAX_NR_OUTST_RECV_WR EHCA_BMASK_IBM(46,46) +#define MQPCB_MAX_NR_OUTST_RECV_WR EHCA_BMASK_IBM(16,31) +#define MQPCB_MASK_DISABLE_ETE_CREDIT_CHECK EHCA_BMASK_IBM(47,47) +#define MQPCB_DISABLE_ETE_CREDIT_CHECK EHCA_BMASK_IBM(31,31) +#define MQPCB_QP_NUMBER EHCA_BMASK_IBM(8,31) +#define MQPCB_MASK_QP_ENABLE EHCA_BMASK_IBM(48,48) +#define MQPCB_QP_ENABLE EHCA_BMASK_IBM(31,31) +#define MQPCB_MASK_CURR_SQR_LIMIT EHCA_BMASK_IBM(49,49) +#define MQPCB_CURR_SQR_LIMIT EHCA_BMASK_IBM(15,31) +#define MQPCB_MASK_QP_AFF_ASYN_EV_LOG_REG EHCA_BMASK_IBM(50,50) +#define MQPCB_MASK_SHARED_RQ_HNDL EHCA_BMASK_IBM(51,51) + +#endif /* __EHCA_CLASSES_PSERIES_H__ */ From schihei at de.ibm.com Thu Apr 27 03:48:22 2006 From: schihei at de.ibm.com (Heiko J Schick) Date: Thu, 27 Apr 2006 12:48:22 +0200 Subject: [openib-general] [PATCH 04/16] ehca: userspace support Message-ID: <4450A176.9000008@de.ibm.com> Signed-off-by: Heiko J Schick ehca_uverbs.c | 409 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 409 insertions(+) --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/ehca_uverbs.c 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/ehca_uverbs.c 2006-04-24 15:12:03.000000000 +0200 @@ -0,0 +1,409 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * userspace support verbs + * + * Authors: Christoph Raisch + * Hoang-Nam Nguyen + * Heiko J Schick + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_uverbs.c,v 1.18 2006/04/24 13:12:03 schickhj Exp $ + */ + +#undef DEB_PREFIX +#define DEB_PREFIX "uver" + +#include + +#include "ehca_classes.h" +#include "ehca_iverbs.h" +#include "ehca_kernel.h" +#include "ehca_mrmw.h" +#include "ehca_tools.h" +#include "hcp_if.h" + +struct ib_ucontext *ehca_alloc_ucontext(struct ib_device *device, + struct ib_udata *udata) +{ + struct ehca_ucontext *my_context = NULL; + + EHCA_CHECK_ADR_P(device); + EDEB_EN(7, "device=%p name=%s", device, device->name); + + my_context = kzalloc(sizeof *my_context, GFP_KERNEL); + if (NULL == my_context) { + EDEB_ERR(4, "Out of memory device=%p", device); + return ERR_PTR(-ENOMEM); + } + + EDEB_EX(7, "device=%p ucontext=%p", device, my_context); + + return &my_context->ib_ucontext; +} + +int ehca_dealloc_ucontext(struct ib_ucontext *context) +{ + struct ehca_ucontext *my_context = NULL; + EHCA_CHECK_ADR(context); + EDEB_EN(7, "ucontext=%p", context); + my_context = container_of(context, struct ehca_ucontext, ib_ucontext); + kfree(my_context); + EDEB_EN(7, "ucontext=%p", context); + return 0; +} + +struct page *ehca_nopage(struct vm_area_struct *vma, + unsigned long address, int *type) +{ + struct page *mypage = NULL; + u64 fileoffset = vma->vm_pgoff << PAGE_SHIFT; + u32 idr_handle = fileoffset >> 32; + u32 q_type = (fileoffset >> 28) & 0xF; /* CQ, QP,... */ + u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */ + u32 cur_pid = current->tgid; + unsigned long flags; + + EDEB_EN(7, + "vm_start=%lx vm_end=%lx vm_page_prot=%lx vm_fileoff=%lx " + "address=%lx", + vma->vm_start, vma->vm_end, vma->vm_page_prot, fileoffset, + address); + + if (q_type == 1) { /* CQ */ + struct ehca_cq *cq = NULL; + u64 offset; + void *vaddr = NULL; + + spin_lock_irqsave(&ehca_cq_idr_lock, flags); + cq = idr_find(&ehca_cq_idr, idr_handle); + spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + + if (cq->ownpid!=cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, cq->ownpid); + return NOPAGE_SIGBUS; + } + + /* make sure this mmap really belongs to the authorized user */ + if (cq == 0) { + EDEB_ERR(4, "cq is NULL ret=NOPAGE_SIGBUS"); + return NOPAGE_SIGBUS; + } + if (rsrc_type == 2) { + EDEB(6, "cq=%p cq queuearea", cq); + offset = address - vma->vm_start; + vaddr = ipz_qeit_calc(&cq->ipz_queue, offset); + EDEB(6, "offset=%lx vaddr=%p", offset, vaddr); + mypage = virt_to_page(vaddr); + } + } else if (q_type == 2) { /* QP */ + struct ehca_qp *qp = NULL; + struct ehca_pd *pd = NULL; + u64 offset; + void *vaddr = NULL; + + spin_lock_irqsave(&ehca_qp_idr_lock, flags); + qp = idr_find(&ehca_qp_idr, idr_handle); + spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + + + pd = container_of(qp->ib_qp.pd, struct ehca_pd, ib_pd); + if (pd->ownpid!=cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, pd->ownpid); + return NOPAGE_SIGBUS; + } + + /* make sure this mmap really belongs to the authorized user */ + if (qp == NULL) { + EDEB_ERR(4, "qp is NULL ret=NOPAGE_SIGBUS"); + return NOPAGE_SIGBUS; + } + if (rsrc_type == 2) { /* rqueue */ + EDEB(6, "qp=%p qp rqueuearea", qp); + offset = address - vma->vm_start; + vaddr = ipz_qeit_calc(&qp->ipz_rqueue, offset); + EDEB(6, "offset=%lx vaddr=%p", offset, vaddr); + mypage = virt_to_page(vaddr); + } else if (rsrc_type == 3) { /* squeue */ + EDEB(6, "qp=%p qp squeuearea", qp); + offset = address - vma->vm_start; + vaddr = ipz_qeit_calc(&qp->ipz_squeue, offset); + EDEB(6, "offset=%lx vaddr=%p", offset, vaddr); + mypage = virt_to_page(vaddr); + } + } + + if (mypage == NULL) { + EDEB_ERR(4, "Invalid page adr==NULL ret=NOPAGE_SIGBUS"); + return NOPAGE_SIGBUS; + } + get_page(mypage); + EDEB_EX(7, "page adr=%p", mypage); + return mypage; +} + +static struct vm_operations_struct ehcau_vm_ops = { + .nopage = ehca_nopage, +}; + +int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) +{ + u64 fileoffset = vma->vm_pgoff << PAGE_SHIFT; + + + u32 idr_handle = fileoffset >> 32; + u32 q_type = (fileoffset >> 28) & 0xF; /* CQ, QP,... */ + u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */ + u32 ret = -EFAULT; /* assume the worst */ + u64 vsize = 0; /* must be calculated/set below */ + u64 physical = 0; /* must be calculated/set below */ + u32 cur_pid = current->tgid; + unsigned long flags; + + EDEB_EN(7, "vm_start=%lx vm_end=%lx vm_page_prot=%lx vm_fileoff=%lx", + vma->vm_start, vma->vm_end, vma->vm_page_prot, fileoffset); + + if (q_type == 1) { /* CQ */ + struct ehca_cq *cq; + + spin_lock_irqsave(&ehca_cq_idr_lock, flags); + cq = idr_find(&ehca_cq_idr, idr_handle); + spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + + if (cq->ownpid!=cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, cq->ownpid); + return -ENOMEM; + } + + /* make sure this mmap really belongs to the authorized user */ + if (cq == 0) + return -EINVAL; + if (cq->ib_cq.uobject == 0) + return -EINVAL; + if (cq->ib_cq.uobject->context != context) + return -EINVAL; + if (rsrc_type == 1) { /* galpa fw handle */ + EDEB(6, "cq=%p cq triggerarea", cq); + vma->vm_flags |= VM_RESERVED; + vsize = vma->vm_end - vma->vm_start; + if (vsize != 4096) { + EDEB_ERR(4, "invalid vsize=%lx", + vma->vm_end - vma->vm_start); + ret = -EINVAL; + goto mmap_exit0; + } + + physical = cq->galpas.user.fw_handle; + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + vma->vm_flags |= VM_IO | VM_RESERVED; + + EDEB(6, "vsize=%lx physical=%lx", vsize, + physical); + ret = + remap_pfn_range(vma, vma->vm_start, + physical >> PAGE_SHIFT, vsize, + vma->vm_page_prot); + if (ret != 0) { + EDEB_ERR(4, + "Error: remap_pfn_range() returned %x!", + ret); + ret = -ENOMEM; + } + goto mmap_exit0; + } else if (rsrc_type == 2) { /* cq queue_addr */ + EDEB(6, "cq=%p cq q_addr", cq); + /* vma->vm_page_prot = + * pgprot_noncached(vma->vm_page_prot); */ + vma->vm_flags |= VM_RESERVED; + vma->vm_ops = &ehcau_vm_ops; + ret = 0; + goto mmap_exit0; + } else { + EDEB_ERR(6, "bad resource type %x", rsrc_type); + ret = -EINVAL; + goto mmap_exit0; + } + } else if (q_type == 2) { /* QP */ + struct ehca_qp *qp = NULL; + struct ehca_pd *pd = NULL; + + spin_lock_irqsave(&ehca_qp_idr_lock, flags); + qp = idr_find(&ehca_qp_idr, idr_handle); + spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + + + pd = container_of(qp->ib_qp.pd, struct ehca_pd, ib_pd); + if (pd->ownpid!=cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, pd->ownpid); + return -ENOMEM; + } + + /* make sure this mmap really belongs to the authorized user */ + if (qp == NULL || qp->ib_qp.uobject == NULL || + qp->ib_qp.uobject->context != context) { + EDEB(6, "qp=%p, uobject=%p, context=%p", + qp, qp->ib_qp.uobject, qp->ib_qp.uobject->context); + ret = -EINVAL; + goto mmap_exit0; + } + if (rsrc_type == 1) { /* galpa fw handle */ + EDEB(6, "qp=%p qp triggerarea", qp); + vma->vm_flags |= VM_RESERVED; + vsize = vma->vm_end - vma->vm_start; + if (vsize != 4096) { + EDEB_ERR(4, "invalid vsize=%lx", + vma->vm_end - vma->vm_start); + ret = -EINVAL; + goto mmap_exit0; + } + + physical = qp->galpas.user.fw_handle; + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + vma->vm_flags |= VM_IO | VM_RESERVED; + + EDEB(6, "vsize=%lx physical=%lx", vsize, + physical); + ret = + remap_pfn_range(vma, vma->vm_start, + physical >> PAGE_SHIFT, vsize, + vma->vm_page_prot); + if (ret != 0) { + EDEB_ERR(4, + "Error: remap_pfn_range() returned %x!", + ret); + ret = -ENOMEM; + } + goto mmap_exit0; + } else if (rsrc_type == 2) { /* qp rqueue_addr */ + EDEB(6, "qp=%p qp rqueue_addr", qp); + vma->vm_flags |= VM_RESERVED; + vma->vm_ops = &ehcau_vm_ops; + ret = 0; + goto mmap_exit0; + } else if (rsrc_type == 3) { /* qp squeue_addr */ + EDEB(6, "qp=%p qp squeue_addr", qp); + vma->vm_flags |= VM_RESERVED; + vma->vm_ops = &ehcau_vm_ops; + ret = 0; + goto mmap_exit0; + } else { + EDEB_ERR(4, "bad resource type %x", + rsrc_type); + ret = -EINVAL; + goto mmap_exit0; + } + } else { + EDEB_ERR(4, "bad queue type %x", q_type); + ret = -EINVAL; + goto mmap_exit0; + } + +mmap_exit0: + EDEB_EX(7, "ret=%x", ret); + return ret; +} + +int ehca_mmap_nopage(u64 foffset,u64 length,void ** mapped,struct vm_area_struct ** vma) +{ + EDEB_EN(7, "foffset=%lx length=%lx", foffset, length); + down_write(¤t->mm->mmap_sem); + *mapped=(void*) + do_mmap(NULL,0, + length, + PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, + foffset); + up_write(¤t->mm->mmap_sem); + if (*mapped) { + *vma = find_vma(current->mm,(u64)*mapped); + if (*vma) { + (*vma)->vm_flags |= VM_RESERVED; + (*vma)->vm_ops = &ehcau_vm_ops; + } else { + EDEB_ERR(4,"couldn't find queue vma queue=%p", + *mapped); + } + } else { + EDEB_ERR(4,"couldn't create mmap length=%lx",length); + } + EDEB_EX(7, "mapped=%p", *mapped); + return 0; +} + +int ehca_mmap_register(u64 physical,void ** mapped,struct vm_area_struct ** vma) +{ + int ret; + unsigned long vsize; + ehca_mmap_nopage(0,4096,mapped,vma); + (*vma)->vm_flags |= VM_RESERVED; + vsize = (*vma)->vm_end - (*vma)->vm_start; + if (vsize != 4096) { + EDEB_ERR(4, "invalid vsize=%lx", + (*vma)->vm_end - (*vma)->vm_start); + ret = -EINVAL; + return ret; + } + + (*vma)->vm_page_prot = pgprot_noncached((*vma)->vm_page_prot); + (*vma)->vm_flags |= VM_IO | VM_RESERVED; + + EDEB(6, "vsize=%lx physical=%lx", vsize, + physical); + ret = + remap_pfn_range((*vma), (*vma)->vm_start, + physical >> PAGE_SHIFT, vsize, + (*vma)->vm_page_prot); + if (ret != 0) { + EDEB_ERR(4, + "Error: remap_pfn_range() returned %x!", + ret); + ret = -ENOMEM; + } + return ret; + +} + +int ehca_munmap(unsigned long addr, size_t len) { + int ret=0; + struct mm_struct *mm = current->mm; + if (mm!=0) { + down_write(&mm->mmap_sem); + ret = do_munmap(mm, addr, len); + up_write(&mm->mmap_sem); + } + return ret; +} From schihei at de.ibm.com Thu Apr 27 03:48:29 2006 From: schihei at de.ibm.com (Heiko J Schick) Date: Thu, 27 Apr 2006 12:48:29 +0200 Subject: [openib-general] [PATCH 05/16] ehca: InfiniBand query and multicast functionality Message-ID: <4450A17D.4030708@de.ibm.com> Signed-off-by: Heiko J Schick ehca_hca.c | 286 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ehca_mcast.c | 198 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 484 insertions(+) --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/ehca_hca.c 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/ehca_hca.c 2006-04-25 09:32:54.000000000 +0200 @@ -0,0 +1,286 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * HCA query functions + * + * Authors: Heiko J Schick + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_hca.c,v 1.13 2006/04/25 07:32:54 schickhj Exp $ + */ + +#undef DEB_PREFIX +#define DEB_PREFIX "shca" + +#include "ehca_kernel.h" +#include "ehca_tools.h" + +#include "hcp_if.h" + +#define TO_MAX_INT(dest, src) \ + if (src >= INT_MAX) \ + dest = INT_MAX; \ + else \ + dest = src + +int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) +{ + int ret = 0; + struct ehca_shca *shca; + struct hipz_query_hca *rblock; + + EDEB_EN(7, ""); + + memset(props, 0, sizeof(struct ib_device_attr)); + shca = container_of(ibdev, struct ehca_shca, ib_device); + + rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + if (!rblock) { + EDEB_ERR(4, "Can't allocate rblock memory."); + ret = -ENOMEM; + goto query_device0; + } + + if (hipz_h_query_hca(shca->ipz_hca_handle, rblock) != H_SUCCESS) { + EDEB_ERR(4, "Can't query device properties"); + ret = -EINVAL; + goto query_device1; + } + props->fw_ver = rblock->hw_ver; + props->max_mr_size = rblock->max_mr_size; + props->vendor_id = rblock->vendor_id >> 8; + props->vendor_part_id = rblock->vendor_part_id >> 16; + props->hw_ver = rblock->hw_ver; + TO_MAX_INT(props->max_qp, (rblock->max_qp - rblock->cur_qp)); + TO_MAX_INT(props->max_qp_wr, rblock->max_wqes_wq); + props->max_sge = rblock->max_sge; + props->max_sge_rd = rblock->max_sge_rd; + TO_MAX_INT(props->max_cq, (rblock->max_cq - rblock->cur_cq)); + props->max_cqe = rblock->max_cqe; + TO_MAX_INT(props->max_mr, (rblock->max_cq - rblock->cur_mr)); + TO_MAX_INT(props->max_pd, rblock->max_pd); + props->max_mw = rblock->max_mw; + TO_MAX_INT(props->max_mr, (rblock->max_mw - rblock->cur_mw)); + props->max_raw_ipv6_qp = rblock->max_raw_ipv6_qp; + props->max_raw_ethy_qp = rblock->max_raw_ethy_qp; + props->max_mcast_grp = rblock->max_mcast_grp; + props->max_mcast_qp_attach = rblock->max_qps_attached_mcast_grp; + props->max_total_mcast_qp_attach = rblock->max_qps_attached_all_mcast_grp; + TO_MAX_INT(props->max_ah, rblock->max_ah); + props->max_fmr = rblock->max_mr; + props->max_srq = 0; + props->max_srq_wr = 0; + props->max_srq_sge = 0; + props->max_pkeys = 16; + props->local_ca_ack_delay = rblock->local_ca_ack_delay; + +query_device1: + kfree(rblock); + +query_device0: + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +int ehca_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + int ret = 0; + struct ehca_shca *shca; + struct hipz_query_port *rblock; + + EDEB_EN(7, "port=%x", port); + + memset(props, 0, sizeof(struct ib_port_attr)); + shca = container_of(ibdev, struct ehca_shca, ib_device); + + rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + if (!rblock) { + EDEB_ERR(4, "Can't allocate rblock memory."); + ret = -ENOMEM; + goto query_port0; + } + + if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_SUCCESS) { + EDEB_ERR(4, "Can't query port properties"); + ret = -EINVAL; + goto query_port1; + } + + props->state = rblock->state; + + switch (rblock->max_mtu) { + case 0x1: + props->active_mtu = props->max_mtu = IB_MTU_256; + break; + case 0x2: + props->active_mtu = props->max_mtu = IB_MTU_512; + break; + case 0x3: + props->active_mtu = props->max_mtu = IB_MTU_1024; + break; + case 0x4: + props->active_mtu = props->max_mtu = IB_MTU_2048; + break; + case 0x5: + props->active_mtu = props->max_mtu = IB_MTU_4096; + break; + default: + EDEB_ERR(4, "Unknown MTU size: %x.", rblock->max_mtu); + } + + props->gid_tbl_len = rblock->gid_tbl_len; + props->max_msg_sz = rblock->max_msg_sz; + props->bad_pkey_cntr = rblock->bad_pkey_cntr; + props->qkey_viol_cntr = rblock->qkey_viol_cntr; + props->pkey_tbl_len = rblock->pkey_tbl_len; + props->lid = rblock->lid; + props->sm_lid = rblock->sm_lid; + props->lmc = rblock->lmc; + props->sm_sl = rblock->sm_sl; + props->subnet_timeout = rblock->subnet_timeout; + props->init_type_reply = rblock->init_type_reply; + + props->active_width = IB_WIDTH_12X; + props->active_speed = 0x1; + +query_port1: + kfree(rblock); + +query_port0: + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +int ehca_query_pkey(struct ib_device *ibdev, u8 port, u16 index, u16 *pkey) +{ + int ret = 0; + struct ehca_shca *shca; + struct hipz_query_port *rblock; + + EDEB_EN(7, "port=%x index=%x", port, index); + + if (index > 16) { + EDEB_ERR(4, "Invalid index: %x.", index); + ret = -EINVAL; + goto query_pkey0; + } + + shca = container_of(ibdev, struct ehca_shca, ib_device); + + rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + if (!rblock) { + EDEB_ERR(4, "Can't allocate rblock memory."); + ret = -ENOMEM; + goto query_pkey0; + } + + if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_SUCCESS) { + EDEB_ERR(4, "Can't query port properties"); + ret = -EINVAL; + goto query_pkey1; + } + + memcpy(pkey, &rblock->pkey_entries + index, sizeof(u16)); + +query_pkey1: + kfree(rblock); + +query_pkey0: + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +int ehca_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + int ret = 0; + struct ehca_shca *shca; + struct hipz_query_port *rblock; + + EDEB_EN(7, "port=%x index=%x", port, index); + + if (index > 255) { + EDEB_ERR(4, "Invalid index: %x.", index); + ret = -EINVAL; + goto query_gid0; + } + + shca = container_of(ibdev, struct ehca_shca, ib_device); + + rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + if (!rblock) { + EDEB_ERR(4, "Can't allocate rblock memory."); + ret = -ENOMEM; + goto query_gid0; + } + + if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_SUCCESS) { + EDEB_ERR(4, "Can't query port properties"); + ret = -EINVAL; + goto query_gid1; + } + + memcpy(&gid->raw[0], &rblock->gid_prefix, sizeof(u64)); + memcpy(&gid->raw[8], &rblock->guid_entries[index], sizeof(u64)); + +query_gid1: + kfree(rblock); + +query_gid0: + EDEB_EX(7, "ret=%x GID=%lx%lx", ret, + *(u64 *) & gid->raw[0], + *(u64 *) & gid->raw[8]); + + return ret; +} + +int ehca_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + int ret = 0; + + EDEB_EN(7, "port=%x", port); + + /* Not implemented yet. */ + + EDEB_EX(7, "ret=%x", ret); + + return ret; +} --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/ehca_mcast.c 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/ehca_mcast.c 2006-04-04 23:52:30.000000000 +0200 @@ -0,0 +1,198 @@ + +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * mcast functions + * + * Authors: Khadija Souissi + * Waleri Fomin + * Reinhard Ernst + * Hoang-Nam Nguyen + * Heiko J Schick + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_mcast.c,v 1.9 2006/04/04 21:52:30 nguyen Exp $ + */ + +#define DEB_PREFIX "mcas" + +#include +#include +#include "ehca_kernel.h" +#include "ehca_classes.h" +#include "ehca_tools.h" +#include "ehca_qes.h" +#include "ehca_iverbs.h" + +#include "hcp_if.h" + +#define MAX_MC_LID 0xFFFE +#define MIN_MC_LID 0xC000 /* Multicast limits */ +#define EHCA_VALID_MULTICAST_GID(gid) ((gid)[0] == 0xFF) +#define EHCA_VALID_MULTICAST_LID(lid) (((lid) >= MIN_MC_LID) && ((lid) <= MAX_MC_LID)) + +int ehca_attach_mcast(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + struct ehca_qp *my_qp = NULL; + struct ehca_shca *shca = NULL; + union ib_gid my_gid; + u64 hipz_rc = H_SUCCESS; + int retcode = 0; + + EHCA_CHECK_ADR(ibqp); + EHCA_CHECK_ADR(gid); + + my_qp = container_of(ibqp, struct ehca_qp, ib_qp); + + EHCA_CHECK_QP(my_qp); + if (ibqp->qp_type != IB_QPT_UD) { + EDEB_ERR(4, "invalid qp_type %x gid, retcode=%x", + ibqp->qp_type, EINVAL); + return (-EINVAL); + } + + shca = container_of(ibqp->pd->device, struct ehca_shca, ib_device); + EHCA_CHECK_ADR(shca); + + if (!(EHCA_VALID_MULTICAST_GID(gid->raw))) { + EDEB_ERR(4, "gid is not valid mulitcast gid retcode=%x", + EINVAL); + return (-EINVAL); + } else if ((lid < MIN_MC_LID) || (lid > MAX_MC_LID)) { + EDEB_ERR(4, "lid=%x is not valid mulitcast lid retcode=%x", + lid, EINVAL); + return (-EINVAL); + } + + memcpy(&my_gid.raw, gid->raw, sizeof(union ib_gid)); + + hipz_rc = hipz_h_attach_mcqp(shca->ipz_hca_handle, + my_qp->ipz_qp_handle, + my_qp->galpas.kernel, + lid, my_gid.global.subnet_prefix, + my_gid.global.interface_id); + if (H_SUCCESS != hipz_rc) { + EDEB_ERR(4, + "ehca_qp=%p qp_num=%x hipz_h_attach_mcqp() failed " + "hipz_rc=%lx", my_qp, ibqp->qp_num, hipz_rc); + } + retcode = ehca2ib_return_code(hipz_rc); + + EDEB_EX(7, "mcast attach retcode=%x\n" + "ehca_qp=%p qp_num=%x lid=%x\n" + "my_gid= %x %x %x %x\n" + " %x %x %x %x\n" + " %x %x %x %x\n" + " %x %x %x %x\n", + retcode, my_qp, ibqp->qp_num, lid, + my_gid.raw[0], my_gid.raw[1], + my_gid.raw[2], my_gid.raw[3], + my_gid.raw[4], my_gid.raw[5], + my_gid.raw[6], my_gid.raw[7], + my_gid.raw[8], my_gid.raw[9], + my_gid.raw[10], my_gid.raw[11], + my_gid.raw[12], my_gid.raw[13], + my_gid.raw[14], my_gid.raw[15]); + + return retcode; +} + +int ehca_detach_mcast(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + struct ehca_qp *my_qp = NULL; + struct ehca_shca *shca = NULL; + union ib_gid my_gid; + u64 hipz_rc = H_SUCCESS; + int retcode = 0; + + EHCA_CHECK_ADR(ibqp); + EHCA_CHECK_ADR(gid); + + my_qp = container_of(ibqp, struct ehca_qp, ib_qp); + + EHCA_CHECK_QP(my_qp); + if (ibqp->qp_type != IB_QPT_UD) { + EDEB_ERR(4, "invalid qp_type %x gid, retcode=%x", + ibqp->qp_type, EINVAL); + return (-EINVAL); + } + + shca = container_of(ibqp->pd->device, struct ehca_shca, ib_device); + EHCA_CHECK_ADR(shca); + + if (!(EHCA_VALID_MULTICAST_GID(gid->raw))) { + EDEB_ERR(4, "gid is not valid mulitcast gid retcode=%x", + EINVAL); + return (-EINVAL); + } else if ((lid < MIN_MC_LID) || (lid > MAX_MC_LID)) { + EDEB_ERR(4, "lid=%x is not valid mulitcast lid retcode=%x", + lid, EINVAL); + return (-EINVAL); + } + + EDEB_EN(7, "dgid=%p qp_numl=%x lid=%x", + gid, ibqp->qp_num, lid); + + memcpy(&my_gid.raw, gid->raw, sizeof(union ib_gid)); + + hipz_rc = hipz_h_detach_mcqp(shca->ipz_hca_handle, + my_qp->ipz_qp_handle, + my_qp->galpas.kernel, + lid, my_gid.global.subnet_prefix, + my_gid.global.interface_id); + if (H_SUCCESS != hipz_rc) { + EDEB_ERR(4, + "ehca_qp=%p qp_num=%x hipz_h_detach_mcqp() failed " + "hipz_rc=%lx", my_qp, ibqp->qp_num, hipz_rc); + } + retcode = ehca2ib_return_code(hipz_rc); + + EDEB_EX(7, "mcast detach retcode=%x\n" + "ehca_qp=%p qp_num=%x lid=%x\n" + "my_gid= %x %x %x %x\n" + " %x %x %x %x\n" + " %x %x %x %x\n" + " %x %x %x %x\n", + retcode, my_qp, ibqp->qp_num, lid, + my_gid.raw[0], my_gid.raw[1], + my_gid.raw[2], my_gid.raw[3], + my_gid.raw[4], my_gid.raw[5], + my_gid.raw[6], my_gid.raw[7], + my_gid.raw[8], my_gid.raw[9], + my_gid.raw[10], my_gid.raw[11], + my_gid.raw[12], my_gid.raw[13], + my_gid.raw[14], my_gid.raw[15]); + + return retcode; +} From schihei at de.ibm.com Thu Apr 27 03:48:35 2006 From: schihei at de.ibm.com (Heiko J Schick) Date: Thu, 27 Apr 2006 12:48:35 +0200 Subject: [openib-general] [PATCH 06/16] ehca: common include files Message-ID: <4450A183.6030405@de.ibm.com> Signed-off-by: Heiko J Schick ehca_iverbs.h | 183 +++++++++++++++++++++++++++ ehca_kernel.h | 162 ++++++++++++++++++++++++ ehca_tools.h | 387 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 732 insertions(+) --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/ehca_iverbs.h 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/ehca_iverbs.h 2006-04-04 23:51:52.000000000 +0200 @@ -0,0 +1,183 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Function definitions for internal functions + * + * Authors: Heiko J Schick + * Dietmar Decker + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_iverbs.h,v 1.8 2006/04/04 21:51:52 nguyen Exp $ + */ + +#ifndef __EHCA_IVERBS_H__ +#define __EHCA_IVERBS_H__ + +#include "ehca_classes.h" + +int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props); + +int ehca_query_port(struct ib_device *ibdev, u8 port, + struct ib_port_attr *props); + +int ehca_query_pkey(struct ib_device *ibdev, u8 port, u16 index, u16 * pkey); + +int ehca_query_gid(struct ib_device *ibdev, u8 port, int index, + union ib_gid *gid); + +int ehca_modify_port(struct ib_device *ibdev, u8 port, int port_modify_mask, + struct ib_port_modify *props); + +struct ib_pd *ehca_alloc_pd(struct ib_device *device, + struct ib_ucontext *context, + struct ib_udata *udata); + +int ehca_dealloc_pd(struct ib_pd *pd); + +struct ib_ah *ehca_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr); + +int ehca_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr); + +int ehca_query_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr); + +int ehca_destroy_ah(struct ib_ah *ah); + +struct ib_mr *ehca_get_dma_mr(struct ib_pd *pd, int mr_access_flags); + +struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, u64 *iova_start); + +struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, + struct ib_umem *region, + int mr_access_flags, struct ib_udata *udata); + +int ehca_rereg_phys_mr(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, int mr_access_flags, u64 *iova_start); + +int ehca_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr); + +int ehca_dereg_mr(struct ib_mr *mr); + +struct ib_mw *ehca_alloc_mw(struct ib_pd *pd); + +int ehca_bind_mw(struct ib_qp *qp, struct ib_mw *mw, + struct ib_mw_bind *mw_bind); + +int ehca_dealloc_mw(struct ib_mw *mw); + +struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd, + int mr_access_flags, + struct ib_fmr_attr *fmr_attr); + +int ehca_map_phys_fmr(struct ib_fmr *fmr, + u64 *page_list, int list_len, u64 iova); + +int ehca_unmap_fmr(struct list_head *fmr_list); + +int ehca_dealloc_fmr(struct ib_fmr *fmr); + +enum ehca_eq_type { + EHCA_EQ = 0, /* Event Queue */ + EHCA_NEQ /* Notification Event Queue */ +}; + +int ehca_create_eq(struct ehca_shca *shca, struct ehca_eq *eq, + enum ehca_eq_type type, const u32 length); + +int ehca_destroy_eq(struct ehca_shca *shca, struct ehca_eq *eq); + +void *ehca_poll_eq(struct ehca_shca *shca, struct ehca_eq *eq); + + +struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, + struct ib_ucontext *context, + struct ib_udata *udata); + +int ehca_destroy_cq(struct ib_cq *cq); + +int ehca_resize_cq(struct ib_cq *cq, int cqe, struct ib_udata *udata); + +int ehca_poll_cq(struct ib_cq *cq, int num_entries, struct ib_wc *wc); + +int ehca_peek_cq(struct ib_cq *cq, int wc_cnt); + +int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify); + +struct ib_qp *ehca_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *init_attr, + struct ib_udata *udata); + +int ehca_destroy_qp(struct ib_qp *qp); + +int ehca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask); + +int ehca_query_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr, + int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr); + +int ehca_post_send(struct ib_qp *qp, struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr); + +int ehca_post_recv(struct ib_qp *qp, struct ib_recv_wr *recv_wr, + struct ib_recv_wr **bad_recv_wr); + +u64 ehca_define_sqp(struct ehca_shca *shca, struct ehca_qp *ibqp, + struct ib_qp_init_attr *qp_init_attr); + +int ehca_attach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid); + +int ehca_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid); + +struct ib_ucontext *ehca_alloc_ucontext(struct ib_device *device, + struct ib_udata *udata); + +int ehca_dealloc_ucontext(struct ib_ucontext *context); + +int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma); + +void ehca_poll_eqs(unsigned long data); + +int ehca_mmap_nopage(u64 foffset,u64 length,void **mapped, + struct vm_area_struct **vma); + +int ehca_mmap_register(u64 physical,void **mapped, + struct vm_area_struct **vma); + +int ehca_munmap(unsigned long addr, size_t len); + +#endif --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/ehca_kernel.h 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/ehca_kernel.h 2006-04-03 08:40:54.000000000 +0200 @@ -0,0 +1,162 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Generalized functions for code shared between kernel and userspace + * + * Authors: Christoph Raisch + * Hoang-Nam Nguyen + * Khadija Souissi + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_kernel.h,v 1.13 2006/04/03 06:40:54 schickhj Exp $ + */ + +#ifndef _EHCA_KERNEL_H_ +#define _EHCA_KERNEL_H_ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include + +/** + * ehca_adr_bad - Handle to be used for adress translation mechanisms, + * currently a placeholder. + */ +inline static int ehca_adr_bad(void *adr) +{ + return (adr == 0); +}; + +/* We will remove this lines in SVN when it is included in the Linux kernel. + * We don't want to introducte unnecessary dependencies to a patched kernel. + */ +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,17) +#include +#define H_SUCCESS 0 +#define H_BUSY 1 +#define H_CONSTRAINED 4 +#define H_LONG_BUSY_ORDER_1_MSEC 9900 +#define H_LONG_BUSY_ORDER_10_MSEC 9901 +#define H_LONG_BUSY_ORDER_100_MSEC 9902 +#define H_LONG_BUSY_ORDER_1_SEC 9903 +#define H_LONG_BUSY_ORDER_10_SEC 9904 +#define H_LONG_BUSY_ORDER_100_SEC 9905 + +#define H_IS_LONG_BUSY(x) ((x >= H_LongBusyStartRange) && (x <= H_LongBusyEndRange)) + +#define H_PARTIAL_STORE 16 +#define H_PAGE_REGISTERED 15 +#define H_IN_PROGRESS 14 +#define H_PARTIAL 5 +#define H_NOT_AVAILABLE 3 +#define H_Closed 2 + +#define H_HARDWARE -1 +#define H_PARAMETER -4 +#define H_NO_MEM -9 +#define H_RESOURCE -16 + +#define H_ADAPTER_PARM -17 +#define H_RH_PARM -18 +#define H_RCQ_PARM -19 +#define H_SCQ_PARM -20 +#define H_EQ_PARM -21 +#define H_RT_PARM -22 +#define H_ST_PARM -23 +#define H_SIGT_PARM -24 +#define H_TOKEN_PARM -25 +#define H_MLENGTH_PARM -27 +#define H_MEM_PARM -28 +#define H_MEM_ACCESS_PARM -29 +#define H_ATTR_PARM -30 +#define H_PORT_PARM -31 +#define H_MCG_PARM -32 +#define H_VL_PARM -33 +#define H_TSIZE_PARM -34 +#define H_TRACE_PARM -35 +#define H_MASK_PARM -37 +#define H_MCG_FULL -38 +#define H_ALIAS_EXIST -39 +#define H_P_COUNTER -40 +#define H_TABLE_FULL -41 +#define H_ALT_TABLE -42 +#define H_MR_CONDITION -43 +#define H_NOT_ENOUGH_RESOURCES -44 +#define H_R_STATE -45 +#define H_RESCINDEND -46 + +/* H call defines to be moved to kernel */ +#define H_RESET_EVENTS 0x15C +#define H_ALLOC_RESOURCE 0x160 +#define H_FREE_RESOURCE 0x164 +#define H_MODIFY_QP 0x168 +#define H_QUERY_QP 0x16C +#define H_REREGISTER_PMR 0x170 +#define H_REGISTER_SMR 0x174 +#define H_QUERY_MR 0x178 +#define H_QUERY_MW 0x17C +#define H_QUERY_HCA 0x180 +#define H_QUERY_PORT 0x184 +#define H_MODIFY_PORT 0x188 +#define H_DEFINE_AQP1 0x18C +#define H_GET_TRACE_BUFFER 0x190 +#define H_DEFINE_AQP0 0x194 +#define H_RESIZE_MR 0x198 +#define H_ATTACH_MCQP 0x19C +#define H_DETACH_MCQP 0x1A0 +#define H_CREATE_RPT 0x1A4 +#define H_REMOVE_RPT 0x1A8 +#define H_REGISTER_RPAGES 0x1AC +#define H_DISABLE_AND_GETC 0x1B0 +#define H_ERROR_DATA 0x1B4 +#define H_GET_HCA_INFO 0x1B8 +#define H_GET_PERF_COUNT 0x1BC +#define H_MANAGE_TRACE 0x1C0 +#define H_QUERY_INT_STATE 0x1E4 +#define H_CB_ALIGNMENT 4096 +#endif /* LINUX_VERSION_CODE */ + +#endif /* _EHCA_KERNEL_H_ */ --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/ehca_tools.h 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/ehca_tools.h 2006-03-30 14:36:54.000000000 +0200 @@ -0,0 +1,387 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * auxiliary functions + * + * Authors: Christoph Raisch + * Hoang-Nam Nguyen + * Khadija Souissi + * Waleri Fomin + * Heiko J Schick + * + * Copyright (c) 2005 IBM Corporation + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_tools.h,v 1.12 2006/03/30 12:36:54 schickhj Exp $ + */ + + +#ifndef EHCA_TOOLS_H +#define EHCA_TOOLS_H + +#define EHCA_EDEB_TRACE_MASK_SIZE 32 +extern u8 ehca_edeb_mask[EHCA_EDEB_TRACE_MASK_SIZE]; +#define EDEB_ID_TO_U32(str4) (str4[3] | (str4[2] << 8) | (str4[1] << 16) | \ + (str4[0] << 24)) + +inline static u64 ehca_edeb_filter(const u32 level, + const u32 id, const u32 line) +{ + u64 ret = 0; + u32 filenr = 0; + u32 filter_level = 9; + u32 dynamic_level = 0; + + /* This is code written for the gcc -O2 optimizer which should colapse + * to two single ints filter_level is the first level kicked out by + * compiler means trace everythin below 6. */ + if (id == EDEB_ID_TO_U32("ehav")) { + filenr = 0x01; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("clas")) { + filenr = 0x02; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("cqeq")) { + filenr = 0x03; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("shca")) { + filenr = 0x05; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("eirq")) { + filenr = 0x06; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("lMad")) { + filenr = 0x07; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("mcas")) { + filenr = 0x08; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("mrmw")) { + filenr = 0x09; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("vpd ")) { + filenr = 0x0a; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("e_qp")) { + filenr = 0x0b; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("uqes")) { + filenr = 0x0c; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("PHYP")) { + filenr = 0x0d; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("hcpi")) { + filenr = 0x0e; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("iptz")) { + filenr = 0x0f; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("spta")) { + filenr = 0x10; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("simp")) { + filenr = 0x11; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("reqs")) { + filenr = 0x12; + filter_level = 8; + } + + if ((filenr - 1) > sizeof(ehca_edeb_mask)) { + filenr = 0; + } + + if (filenr == 0) { + filter_level = 9; + } /* default */ + ret = filenr * 0x10000 + line; + if (filter_level <= level) { + return (ret | 0x100000000L); /* this is the flag to not trace */ + } + dynamic_level = ehca_edeb_mask[filenr]; + if (likely(dynamic_level <= level)) { + ret = ret | 0x100000000L; + }; + return ret; +} + +#ifdef EHCA_USE_HCALL_KERNEL +#ifdef CONFIG_PPC_PSERIES + +#include + +/** + * IS_EDEB_ON - Checks if debug is on for the given level. + */ +#define IS_EDEB_ON(level) \ + ((ehca_edeb_filter(level, EDEB_ID_TO_U32(DEB_PREFIX), __LINE__) & 0x100000000L)==0) + +#define EDEB_P_GENERIC(level,idstring,format,args...) \ +do { \ + u64 ehca_edeb_filterresult = \ + ehca_edeb_filter(level, EDEB_ID_TO_U32(DEB_PREFIX), __LINE__);\ + if ((ehca_edeb_filterresult & 0x100000000L) == 0) \ + printk("PU%04x %08x:%s " idstring " "format "\n", \ + get_paca()->paca_index, (u32)(ehca_edeb_filterresult), \ + __func__, ##args); \ +} while (1==0) + +#elif REAL_HCALL + +#define EDEB_P_GENERIC(level,idstring,format,args...) \ +do { \ + u64 ehca_edeb_filterresult = \ + ehca_edeb_filter(level, EDEB_ID_TO_U32(DEB_PREFIX), __LINE__); \ + if ((ehca_edeb_filterresult & 0x100000000L) == 0) \ + printk("%08x:%s " idstring " "format "\n", \ + (u32)(ehca_edeb_filterresult), \ + __func__, ##args); \ +} while (1==0) + +#endif +#else + +#define IS_EDEB_ON(level) (1) + +#define EDEB_P_GENERIC(level,idstring,format,args...) \ +do { \ + printk("%s " idstring " "format "\n", \ + __func__, ##args); \ +} while (1==0) + +#endif + +/** + * EDEB - Trace output macro. + * @level tracelevel + * @format optional format string, use "" if not desired + * @args printf like arguments for trace, use %Lx for u64, %x for u32 + * %p for pointer + */ +#define EDEB(level,format,args...) \ + EDEB_P_GENERIC(level,"",format,##args) +#define EDEB_ERR(level,format,args...) \ + EDEB_P_GENERIC(level,"HCAD_ERROR ",format,##args) +#define EDEB_EN(level,format,args...) \ + EDEB_P_GENERIC(level,">>>",format,##args) +#define EDEB_EX(level,format,args...) \ + EDEB_P_GENERIC(level,"<<<",format,##args) + +/** + * EDEB macro to dump a memory block, whose length is n*8 bytes. + * Each line has the following layout: + * adr=X ofs=Y <8 bytes hex> <8 bytes hex> + */ +#define EDEB_DMP(level,adr,len,format,args...) \ + do { \ + unsigned int x; \ + unsigned int l = (unsigned int)(len); \ + unsigned char *deb = (unsigned char*)(adr); \ + for (x = 0; x < l; x += 16) { \ + EDEB(level, format " adr=%p ofs=%04x %016lx %016lx", \ + ##args, deb, x, *((u64 *)&deb[0]), *((u64 *)&deb[8])); \ + deb += 16; \ + } \ + } while (0) + +/* define a bitmask, little endian version */ +#define EHCA_BMASK(pos,length) (((pos)<<16)+(length)) +/* define a bitmask, the ibm way... */ +#define EHCA_BMASK_IBM(from,to) (((63-to)<<16)+((to)-(from)+1)) +/* internal function, don't use */ +#define EHCA_BMASK_SHIFTPOS(mask) (((mask)>>16)&0xffff) +/* internal function, don't use */ +#define EHCA_BMASK_MASK(mask) (0xffffffffffffffffULL >> ((64-(mask))&0xffff)) +/* return value shifted and masked by mask\n + * variable|=HCA_BMASK_SET(MY_MASK,0x4711) ORs the bits in variable\n + * variable&=~HCA_BMASK_SET(MY_MASK,-1) clears the bits from the mask + * in variable + */ +#define EHCA_BMASK_SET(mask,value) \ + ((EHCA_BMASK_MASK(mask) & ((u64)(value)))<>EHCA_BMASK_SHIFTPOS(mask))) + +#define PARANOIA_MODE +#ifdef PARANOIA_MODE + +#define EHCA_CHECK_ADR_P(adr) \ + if (unlikely(adr==0)) { \ + EDEB_ERR(4, "adr=%p check failed line %i", adr, \ + __LINE__); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_ADR(adr) \ + if (unlikely(adr==0)) { \ + EDEB_ERR(4, "adr=%p check failed line %i", adr, \ + __LINE__); \ + return -EFAULT; } + +#define EHCA_CHECK_DEVICE_P(device) \ + if (unlikely(device==0)) { \ + EDEB_ERR(4, "device=%p check failed", device); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_DEVICE(device) \ + if (unlikely(device==0)) { \ + EDEB_ERR(4, "device=%p check failed", device); \ + return -EFAULT; } + +#define EHCA_CHECK_PD(pd) \ + if (unlikely(pd==0)) { \ + EDEB_ERR(4, "pd=%p check failed", pd); \ + return -EFAULT; } + +#define EHCA_CHECK_PD_P(pd) \ + if (unlikely(pd==0)) { \ + EDEB_ERR(4, "pd=%p check failed", pd); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_AV(av) \ + if (unlikely(av==0)) { \ + EDEB_ERR(4, "av=%p check failed", av); \ + return -EFAULT; } + +#define EHCA_CHECK_AV_P(av) \ + if (unlikely(av==0)) { \ + EDEB_ERR(4, "av=%p check failed", av); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_CQ(cq) \ + if (unlikely(cq==0)) { \ + EDEB_ERR(4, "cq=%p check failed", cq); \ + return -EFAULT; } + +#define EHCA_CHECK_CQ_P(cq) \ + if (unlikely(cq==0)) { \ + EDEB_ERR(4, "cq=%p check failed", cq); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_EQ(eq) \ + if (unlikely(eq==0)) { \ + EDEB_ERR(4, "eq=%p check failed", eq); \ + return -EFAULT; } + +#define EHCA_CHECK_EQ_P(eq) \ + if (unlikely(eq==0)) { \ + EDEB_ERR(4, "eq=%p check failed", eq); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_QP(qp) \ + if (unlikely(qp==0)) { \ + EDEB_ERR(4, "qp=%p check failed", qp); \ + return -EFAULT; } + +#define EHCA_CHECK_QP_P(qp) \ + if (unlikely(qp==0)) { \ + EDEB_ERR(4, "qp=%p check failed", qp); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_MR(mr) \ + if (unlikely(mr==0)) { \ + EDEB_ERR(4, "mr=%p check failed", mr); \ + return -EFAULT; } + +#define EHCA_CHECK_MR_P(mr) \ + if (unlikely(mr==0)) { \ + EDEB_ERR(4, "mr=%p check failed", mr); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_MW(mw) \ + if (unlikely(mw==0)) { \ + EDEB_ERR(4, "mw=%p check failed", mw); \ + return -EFAULT; } + +#define EHCA_CHECK_MW_P(mw) \ + if (unlikely(mw==0)) { \ + EDEB_ERR(4, "mw=%p check failed", mw); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_FMR(fmr) \ + if (unlikely(fmr==0)) { \ + EDEB_ERR(4, "fmr=%p check failed", fmr); \ + return -EFAULT; } + +#define EHCA_CHECK_FMR_P(fmr) \ + if (unlikely(fmr==0)) { \ + EDEB_ERR(4, "fmr=%p check failed", fmr); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_REGISTER_PD(device,pd) +#define EHCA_REGISTER_AV(pd,av) +#define EHCA_DEREGISTER_PD(PD) +#define EHCA_DEREGISTER_AV(av) +#else +#define EHCA_CHECK_DEVICE_P(device) + +#define EHCA_CHECK_PD(pd) +#define EHCA_REGISTER_PD(device,pd) +#define EHCA_DEREGISTER_PD(PD) +#endif + +/** + * ehca2ib_return_code - Returns ib return code corresponding to the given + * ehca return code. + */ +static inline int ehca2ib_return_code(u64 ehca_rc) +{ + switch (ehca_rc) { + case H_SUCCESS: + return 0; + case H_BUSY: + return -EBUSY; + case H_NO_MEM: + return -ENOMEM; + default: + return -EINVAL; + } +} + +#endif /* EHCA_TOOLS_H */ From schihei at de.ibm.com Thu Apr 27 03:48:54 2006 From: schihei at de.ibm.com (Heiko J Schick) Date: Thu, 27 Apr 2006 12:48:54 +0200 Subject: [openib-general] [PATCH 07/16] ehca: interrupt handling routines Message-ID: <4450A196.2050901@de.ibm.com> Signed-off-by: Heiko J Schick ehca_irq.c | 712 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ehca_irq.h | 79 ++++++ 2 files changed, 791 insertions(+) --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/ehca_irq.h 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/ehca_irq.h 2006-04-11 09:29:54.000000000 +0200 @@ -0,0 +1,79 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Function definitions and structs for EQs, NEQs and interrupts + * + * Authors: Heiko J Schick + * Khadija Souissi + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_irq.h,v 1.10 2006/04/11 07:29:54 schickhj Exp $ + */ + +#ifndef __EHCA_IRQ_H +#define __EHCA_IRQ_H + + +struct ehca_shca; + +#include +#include +#include + +int ehca_error_data(struct ehca_shca *shca, void *data, u64 resource); + +irqreturn_t ehca_interrupt_neq(int irq, void *dev_id, struct pt_regs *regs); +void ehca_tasklet_neq(unsigned long data); + +irqreturn_t ehca_interrupt_eq(int irq, void *dev_id, struct pt_regs *regs); +void ehca_tasklet_eq(unsigned long data); + +struct ehca_cpu_comp_task { + wait_queue_head_t wait_queue; + struct list_head cq_list; + struct task_struct *task; + spinlock_t task_lock; +}; + +struct ehca_comp_pool { + struct ehca_cpu_comp_task *cpu_comp_tasks; + int last_cpu; + spinlock_t last_cpu_lock; +}; + +struct ehca_comp_pool *ehca_create_comp_pool(void); +void ehca_destroy_comp_pool(struct ehca_comp_pool *pool); +void ehca_queue_comp_task(struct ehca_comp_pool *pool, struct ehca_cq *__cq); + +#endif --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/ehca_irq.c 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/ehca_irq.c 2006-04-24 15:12:03.000000000 +0200 @@ -0,0 +1,712 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Functions for EQs, NEQs and interrupts + * + * Authors: Heiko J Schick + * Khadija Souissi + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_irq.c,v 1.23 2006/04/24 13:12:03 schickhj Exp $ + */ +#define DEB_PREFIX "eirq" + +#include "ehca_classes.h" +#include "ehca_irq.h" +#include "ehca_iverbs.h" +#include "ehca_kernel.h" +#include "ehca_tools.h" +#include "hcp_if.h" +#include "hipz_fns.h" + +#define EQE_COMPLETION_EVENT EHCA_BMASK_IBM(1,1) +#define EQE_CQ_QP_NUMBER EHCA_BMASK_IBM(8,31) +#define EQE_EE_IDENTIFIER EHCA_BMASK_IBM(2,7) +#define EQE_CQ_NUMBER EHCA_BMASK_IBM(8,31) +#define EQE_QP_NUMBER EHCA_BMASK_IBM(8,31) +#define EQE_QP_TOKEN EHCA_BMASK_IBM(32,63) +#define EQE_CQ_TOKEN EHCA_BMASK_IBM(32,63) + +#define NEQE_COMPLETION_EVENT EHCA_BMASK_IBM(1,1) +#define NEQE_EVENT_CODE EHCA_BMASK_IBM(2,7) +#define NEQE_PORT_NUMBER EHCA_BMASK_IBM(8,15) +#define NEQE_PORT_AVAILABILITY EHCA_BMASK_IBM(16,16) + +#define ERROR_DATA_LENGTH EHCA_BMASK_IBM(52,63) +#define ERROR_DATA_TYPE EHCA_BMASK_IBM(0,7) + +static inline void comp_event_callback(struct ehca_cq *cq) +{ + EDEB_EN(7, "cq=%p", cq); + + if (cq->ib_cq.comp_handler == NULL) + return; + + spin_lock(&cq->cb_lock); + cq->ib_cq.comp_handler(&cq->ib_cq, cq->ib_cq.cq_context); + spin_unlock(&cq->cb_lock); + + EDEB_EX(7, "cq=%p", cq); + + return; +} + +static void print_error_data(struct ehca_shca * shca, void* data, + u64* rblock, int length) +{ + u64 type = EHCA_BMASK_GET(ERROR_DATA_TYPE, rblock[2]); + u64 resource = rblock[1]; + + EDEB_EN(7, "shca=%p data=%p rblock=%p length=%x", + shca, data, rblock, length); + + switch (type) { + case 0x1: /* Queue Pair */ + { + struct ehca_qp *qp = (struct ehca_qp*)data; + + /* only print error data if AER is set */ + if (rblock[6] == 0) + return; + + EDEB_ERR(4, "QP 0x%x (resource=%lx) has errors.", + qp->ib_qp.qp_num, resource); + break; + } + case 0x4: /* Completion Queue */ + { + struct ehca_cq *cq = (struct ehca_cq*)data; + + EDEB_ERR(4, "CQ 0x%x (resource=%lx) has errors.", + cq->cq_number, resource); + break; + } + default: + EDEB_ERR(4, "Unknown errror type: %lx on %s.", + type, shca->ib_device.name); + break; + } + + EDEB_ERR(4, "Error data is available: %lx.", resource); + EDEB_ERR(4, "EHCA ----- error data begin " + "---------------------------------------------------"); + EDEB_DMP(4, rblock, length, "resource=%lx", resource); + EDEB_ERR(4, "EHCA ----- error data end " + "----------------------------------------------------"); + + EDEB_EX(7, ""); + + return; +} + +int ehca_error_data(struct ehca_shca *shca, void *data, + u64 resource) +{ + + unsigned long ret = 0; + u64 *rblock; + unsigned long block_count; + + EDEB_EN(7, "shca=%p data=%p resource=%lx", shca, data, resource); + + rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + if (!rblock) { + EDEB_ERR(4, "Cannot allocate rblock memory."); + ret = -ENOMEM; + goto error_data1; + } + + ret = hipz_h_error_data(shca->ipz_hca_handle, + resource, + rblock, + &block_count); + + if (ret == H_R_STATE) { + EDEB_ERR(4, "No error data is available: %lx.", resource); + } + else if (ret == H_SUCCESS) { + int length; + + length = EHCA_BMASK_GET(ERROR_DATA_LENGTH, rblock[0]); + + if (length > PAGE_SIZE) + length = PAGE_SIZE; + + print_error_data(shca, data, rblock, length); + } + else { + EDEB_ERR(4, "Error data could not be fetched: %lx", resource); + } + + kfree(rblock); + +error_data1: + return ret; + +} + +static void qp_event_callback(struct ehca_shca *shca, + u64 eqe, + enum ib_event_type event_type) +{ + struct ib_event event; + struct ehca_qp *qp; + unsigned long flags; + u32 token = EHCA_BMASK_GET(EQE_QP_TOKEN, eqe); + + EDEB_EN(7, "eqe=%lx", eqe); + + spin_lock_irqsave(&ehca_qp_idr_lock, flags); + qp = idr_find(&ehca_qp_idr, token); + spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + + + if (qp == NULL) + return; + + ehca_error_data(shca, qp, qp->ipz_qp_handle.handle); + + if (qp->ib_qp.event_handler == NULL) + return; + + event.device = &shca->ib_device; + event.event = event_type; + event.element.qp = &qp->ib_qp; + + qp->ib_qp.event_handler(&event, qp->ib_qp.qp_context); + + EDEB_EX(7, "qp=%p", qp); + + return; +} + +static void cq_event_callback(struct ehca_shca *shca, + u64 eqe) +{ + struct ehca_cq *cq; + unsigned long flags; + u32 token = EHCA_BMASK_GET(EQE_CQ_TOKEN, eqe); + + EDEB_EN(7, "eqe=%lx", eqe); + + spin_lock_irqsave(&ehca_cq_idr_lock, flags); + cq = idr_find(&ehca_cq_idr, token); + spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + + if (cq == NULL) + return; + + ehca_error_data(shca, cq, cq->ipz_cq_handle.handle); + + EDEB_EX(7, "cq=%p", cq); + + return; +} + +static void parse_identifier(struct ehca_shca *shca, u64 eqe) +{ + u8 identifier = EHCA_BMASK_GET(EQE_EE_IDENTIFIER, eqe); + + EDEB_EN(7, "shca=%p eqe=%lx", shca, eqe); + + switch (identifier) { + case 0x02: /* path migrated */ + qp_event_callback(shca, eqe, IB_EVENT_PATH_MIG); + break; + case 0x03: /* communication established */ + qp_event_callback(shca, eqe, IB_EVENT_COMM_EST); + break; + case 0x04: /* send queue drained */ + qp_event_callback(shca, eqe, IB_EVENT_SQ_DRAINED); + break; + case 0x05: /* QP error */ + case 0x06: /* QP error */ + qp_event_callback(shca, eqe, IB_EVENT_QP_FATAL); + break; + case 0x07: /* CQ error */ + case 0x08: /* CQ error */ + cq_event_callback(shca, eqe); + break; + case 0x09: /* MRMWPTE error */ + EDEB_ERR(4, "MRMWPTE error."); + break; + case 0x0A: /* port event */ + EDEB_ERR(4, "Port event."); + break; + case 0x0B: /* MR access error */ + EDEB_ERR(4, "MR access error."); + break; + case 0x0C: /* EQ error */ + EDEB_ERR(4, "EQ error."); + break; + case 0x0D: /* P/Q_Key mismatch */ + EDEB_ERR(4, "P/Q_Key mismatch."); + break; + case 0x10: /* sampling complete */ + EDEB_ERR(4, "Sampling complete."); + break; + case 0x11: /* unaffiliated access error */ + EDEB_ERR(4, "Unaffiliated access error."); + break; + case 0x12: /* path migrating error */ + EDEB_ERR(4, "Path migration error."); + break; + case 0x13: /* interface trace stopped */ + EDEB_ERR(4, "Interface trace stopped."); + break; + case 0x14: /* first error capture info available */ + default: + EDEB_ERR(4, "Unknown identifier: %x on %s.", + identifier, shca->ib_device.name); + break; + } + + EDEB_EX(7, "eqe=%lx identifier=%x", eqe, identifier); + + return; +} + +static void parse_ec(struct ehca_shca *shca, u64 eqe) +{ + struct ib_event event; + u8 ec = EHCA_BMASK_GET(NEQE_EVENT_CODE, eqe); + u8 port = EHCA_BMASK_GET(NEQE_PORT_NUMBER, eqe); + + EDEB_EN(7, "shca=%p eqe=%lx", shca, eqe); + + switch (ec) { + case 0x30: /* port availability change */ + if (EHCA_BMASK_GET(NEQE_PORT_AVAILABILITY, eqe)) { + EDEB(4, "%s: port %x is active.", + shca->ib_device.name, port); + event.device = &shca->ib_device; + event.event = IB_EVENT_PORT_ACTIVE; + event.element.port_num = port; + shca->sport[port - 1].port_state = IB_PORT_ACTIVE; + ib_dispatch_event(&event); + } else { + EDEB(4, "%s: port %x is inactive.", + shca->ib_device.name, port); + event.device = &shca->ib_device; + event.event = IB_EVENT_PORT_ERR; + event.element.port_num = port; + shca->sport[port - 1].port_state = IB_PORT_DOWN; + ib_dispatch_event(&event); + } + break; + case 0x31: + /* port configuration change */ + /* disruptive change is caused by */ + /* LID, PKEY or SM change */ + EDEB(4, "EHCA disruptive port %x " + "configuration change.", port); + + EDEB(4, "%s: port %x is inactive.", + shca->ib_device.name, port); + event.device = &shca->ib_device; + event.event = IB_EVENT_PORT_ERR; + event.element.port_num = port; + shca->sport[port - 1].port_state = IB_PORT_DOWN; + ib_dispatch_event(&event); + + EDEB(4, "%s: port %x is active.", + shca->ib_device.name, port); + event.device = &shca->ib_device; + event.event = IB_EVENT_PORT_ACTIVE; + event.element.port_num = port; + shca->sport[port - 1].port_state = IB_PORT_ACTIVE; + ib_dispatch_event(&event); + break; + case 0x32: /* adapter malfunction */ + EDEB_ERR(4, "Adapter malfunction."); + break; + case 0x33: /* trace stopped */ + EDEB_ERR(4, "Traced stopped."); + break; + default: + EDEB_ERR(4, "Unknown event code: %x on %s.", + ec, shca->ib_device.name); + break; + } + + EDEB_EN(7, "eqe=%lx ec=%x", eqe, ec); + + return; +} + +static inline void reset_eq_pending(struct ehca_cq *cq) +{ + u64 CQx_EP = 0; + struct h_galpa gal = cq->galpas.kernel; + + EDEB_EN(7, "cq=%p", cq); + + hipz_galpa_store_cq(gal, cqx_ep, 0x0); + CQx_EP = hipz_galpa_load(gal, CQTEMM_OFFSET(cqx_ep)); + EDEB(7, "CQx_EP=%lx", CQx_EP); + + EDEB_EX(7, "cq=%p", cq); + + return; +} + +irqreturn_t ehca_interrupt_neq(int irq, void *dev_id, struct pt_regs *regs) +{ + struct ehca_shca *shca = (struct ehca_shca*)dev_id; + + EDEB_EN(7, "dev_id=%p", dev_id); + + tasklet_hi_schedule(&shca->neq.interrupt_task); + + EDEB_EX(7, ""); + + return IRQ_HANDLED; +} + +void ehca_tasklet_neq(unsigned long data) +{ + struct ehca_shca *shca = (struct ehca_shca*)data; + struct ehca_eqe *eqe; + u64 ret = H_SUCCESS; + + EDEB_EN(7, "shca=%p", shca); + + eqe = (struct ehca_eqe *)ehca_poll_eq(shca, &shca->neq); + + while (eqe) { + if (!EHCA_BMASK_GET(NEQE_COMPLETION_EVENT, eqe->entry)) + parse_ec(shca, eqe->entry); + + eqe = (struct ehca_eqe *)ehca_poll_eq(shca, &shca->neq); + } + + ret = hipz_h_reset_event(shca->ipz_hca_handle, + shca->neq.ipz_eq_handle, 0xFFFFFFFFFFFFFFFFL); + + if (ret != H_SUCCESS) + EDEB_ERR(4, "Can't clear notification events."); + + EDEB_EX(7, "shca=%p", shca); + + return; +} + +irqreturn_t ehca_interrupt_eq(int irq, void *dev_id, struct pt_regs *regs) +{ + struct ehca_shca *shca = (struct ehca_shca*)dev_id; + + EDEB_EN(7, "dev_id=%p", dev_id); + + tasklet_hi_schedule(&shca->eq.interrupt_task); + + EDEB_EX(7, ""); + + return IRQ_HANDLED; +} + +void ehca_tasklet_eq(unsigned long data) +{ + struct ehca_shca *shca = (struct ehca_shca*)data; + struct ehca_eqe *eqe; + int int_state; + + EDEB_EN(7, "shca=%p", shca); + + do { + eqe = (struct ehca_eqe *)ehca_poll_eq(shca, &shca->eq); + + if ((shca->hw_level >= 2) && (eqe != NULL)) + int_state = 1; + else + int_state = 0; + + while ((int_state == 1) || (eqe != 0)) { + while (eqe) { + u64 eqe_value = eqe->entry; + + EDEB(7, "eqe_value=%lx", eqe_value); + + /* TODO: better structure */ + if (EHCA_BMASK_GET(EQE_COMPLETION_EVENT, + eqe_value)) { + extern struct ehca_comp_pool* ehca_pool; + extern struct idr ehca_cq_idr; + unsigned long flags; + u32 token; + struct ehca_cq *cq; + + EDEB(6, "... completion event"); + token = + EHCA_BMASK_GET(EQE_CQ_TOKEN, + eqe_value); + spin_lock_irqsave(&ehca_cq_idr_lock, + flags); + cq = idr_find(&ehca_cq_idr, token); + + if (cq == NULL) { + spin_unlock(&ehca_cq_idr_lock); + break; + } + + reset_eq_pending(cq); + ehca_queue_comp_task(ehca_pool, cq); + spin_unlock_irqrestore(&ehca_cq_idr_lock, + flags); + } else { + EDEB(6, "... non completion event"); + parse_identifier(shca, eqe_value); + } + eqe = + (struct ehca_eqe *)ehca_poll_eq(shca, + &shca->eq); + } + + if (shca->hw_level >= 2) + int_state = + hipz_h_query_int_state(shca->ipz_hca_handle, + shca->eq.ist); + eqe = (struct ehca_eqe *)ehca_poll_eq(shca, &shca->eq); + + } + } while (int_state != 0); + + EDEB_EX(7, "shca=%p", shca); + + return; +} + +static inline int find_next_online_cpu(struct ehca_comp_pool* pool) +{ + unsigned long flags_last_cpu; + + spin_lock_irqsave(&pool->last_cpu_lock, flags_last_cpu); + pool->last_cpu = next_cpu(pool->last_cpu, cpu_online_map); + + if (pool->last_cpu == NR_CPUS) + pool->last_cpu = 0; + + spin_unlock_irqrestore(&pool->last_cpu_lock, flags_last_cpu); + + return pool->last_cpu; +} + +void ehca_queue_comp_task(struct ehca_comp_pool *pool, struct ehca_cq *__cq) +{ + int cpu; + int cpu_id; + struct ehca_cpu_comp_task *cct; + unsigned long flags_cct; + unsigned long flags_cq; + + cpu = get_cpu(); + cpu_id = find_next_online_cpu(pool); + + EDEB_EN(7, "pool=%p cq=%p cq_nr=%x CPU=%x:%x:%x:%x", + pool, __cq, __cq->cq_number, + cpu, cpu_id, num_online_cpus(), num_possible_cpus()); + + BUG_ON(!cpu_online(cpu_id)); + + cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu_id); + + spin_lock_irqsave(&cct->task_lock, flags_cct); + spin_lock_irqsave(&__cq->task_lock, flags_cq); + + if (__cq->nr_callbacks == 0) { + __cq->nr_callbacks++; + list_add_tail(&__cq->entry, &cct->cq_list); + wake_up(&cct->wait_queue); + } + else + __cq->nr_callbacks++; + + spin_unlock_irqrestore(&__cq->task_lock, flags_cq); + spin_unlock_irqrestore(&cct->task_lock, flags_cct); + + put_cpu(); + + EDEB_EX(7, "cct=%p", cct); + + return; +} + +static void run_comp_task(struct ehca_cpu_comp_task* cct) +{ + struct ehca_cq *cq = NULL; + unsigned long flags_cct; + unsigned long flags_cq; + + + EDEB_EN(7, "cct=%p", cct); + + spin_lock_irqsave(&cct->task_lock, flags_cct); + + while (!list_empty(&cct->cq_list)) { + cq = list_entry(cct->cq_list.next, struct ehca_cq, entry); + spin_unlock_irqrestore(&cct->task_lock, flags_cct); + comp_event_callback(cq); + spin_lock_irqsave(&cct->task_lock, flags_cct); + + spin_lock_irqsave(&cq->task_lock, flags_cq); + cq->nr_callbacks--; + if (cq->nr_callbacks == 0) + list_del_init(cct->cq_list.next); + spin_unlock_irqrestore(&cq->task_lock, flags_cq); + + } + + spin_unlock_irqrestore(&cct->task_lock, flags_cct); + + EDEB_EX(7, "cct=%p cq=%p", cct, cq); + + return; +} + +static int comp_task(void *__cct) +{ + struct ehca_cpu_comp_task* cct = __cct; + DECLARE_WAITQUEUE(wait, current); + + EDEB_EN(7, "cct=%p", cct); + + set_current_state(TASK_INTERRUPTIBLE); + while(!kthread_should_stop()) { + add_wait_queue(&cct->wait_queue, &wait); + + if (list_empty(&cct->cq_list)) + schedule(); + else + __set_current_state(TASK_RUNNING); + + remove_wait_queue(&cct->wait_queue, &wait); + + if (!list_empty(&cct->cq_list)) + run_comp_task(__cct); + + set_current_state(TASK_INTERRUPTIBLE); + } + __set_current_state(TASK_RUNNING); + + EDEB_EX(7, ""); + + return 0; +} + +static struct task_struct *create_comp_task(struct ehca_comp_pool *pool, + int cpu) +{ + struct ehca_cpu_comp_task *cct; + + EDEB_EN(7, "cpu=%d:%d", cpu, NR_CPUS); + + cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu); + spin_lock_init(&cct->task_lock); + INIT_LIST_HEAD(&cct->cq_list); + init_waitqueue_head(&cct->wait_queue); + cct->task = kthread_create(comp_task, cct, "ehca_comp/%d", cpu); + + EDEB_EX(7, "cct/%d=%p", cpu, cct); + + return cct->task; +} + +static void destroy_comp_task(struct ehca_comp_pool *pool, + int cpu) +{ + struct ehca_cpu_comp_task *cct; + struct task_struct *task; + + EDEB_EN(7, "pool=%p cpu=%d:%d", pool, cpu, NR_CPUS); + + cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu); + cct->task = NULL; + task = cct->task; + + if (task) + kthread_stop(task); + + EDEB_EX(7, ""); + + return; +} + +struct ehca_comp_pool *ehca_create_comp_pool(void) +{ + struct ehca_comp_pool *pool; + int cpu; + struct task_struct *task; + + EDEB_EN(7, ""); + + pool = kzalloc(sizeof(struct ehca_comp_pool), GFP_KERNEL); + if (pool == NULL) + return NULL; + + spin_lock_init(&pool->last_cpu_lock); + pool->last_cpu = any_online_cpu(cpu_online_map); + + pool->cpu_comp_tasks = alloc_percpu(struct ehca_cpu_comp_task); + if (pool->cpu_comp_tasks == NULL) { + kfree(pool); + return NULL; + } + + for_each_online_cpu(cpu) { + task = create_comp_task(pool, cpu); + if (task) { + kthread_bind(task, cpu); + wake_up_process(task); + } + } + + EDEB_EX(7, "pool=%p", pool); + + return pool; +} + +void ehca_destroy_comp_pool(struct ehca_comp_pool *pool) +{ + int i; + + EDEB_EN(7, "pool=%p", pool); + + for (i = 0; i < NR_CPUS; i++) { + if (cpu_online(i)) + destroy_comp_task(pool, i); + } + + EDEB_EN(7, ""); + + return; +} From schihei at de.ibm.com Thu Apr 27 03:49:10 2006 From: schihei at de.ibm.com (Heiko J Schick) Date: Thu, 27 Apr 2006 12:49:10 +0200 Subject: [openib-general] [PATCH 09/16] ehca: protection domain and address vector Message-ID: <4450A1A6.6030201@de.ibm.com> Signed-off-by: Heiko J Schick ehca_av.c | 309 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ehca_pd.c | 122 ++++++++++++++++++++++++ 2 files changed, 431 insertions(+) --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/ehca_av.c 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/ehca_av.c 2006-04-11 15:25:50.000000000 +0200 @@ -0,0 +1,309 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * adress vector functions + * + * Authors: Hoang-Nam Nguyen + * Khadija Souissi + * Reinhard Ernst + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_av.c,v 1.12 2006/04/11 13:25:50 nguyen Exp $ + */ + + +#define DEB_PREFIX "ehav" + +#include + +#include "ehca_kernel.h" +#include "ehca_tools.h" +#include "ehca_iverbs.h" +#include "hcp_if.h" + +struct ib_ah *ehca_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) +{ + extern struct ehca_module ehca_module; + extern int ehca_static_rate; + int retcode = 0; + struct ehca_av *av = NULL; + struct ehca_shca *shca = NULL; + + EHCA_CHECK_PD_P(pd); + EHCA_CHECK_ADR_P(ah_attr); + + shca = container_of(pd->device, struct ehca_shca, ib_device); + + EDEB_EN(7,"pd=%p ah_attr=%p", pd, ah_attr); + + av = kmem_cache_alloc(ehca_module.cache_av, SLAB_KERNEL); + if (av == NULL) { + EDEB_ERR(4,"Out of memory pd=%p ah_attr=%p", pd, ah_attr); + retcode = -ENOMEM; + goto create_ah_exit0; + } + + av->av.sl = ah_attr->sl; + av->av.dlid = ntohs(ah_attr->dlid); + av->av.slid_path_bits = ah_attr->src_path_bits; + + if (ehca_static_rate < 0) { + int ah_mult = ib_rate_to_mult(ah_attr->static_rate); + int ehca_mult = + ib_rate_to_mult(shca->sport[ah_attr->port_num].rate ); + + if (ah_mult >= ehca_mult) + av->av.ipd = 0; + else + av->av.ipd = (ah_mult > 0) ? + ((ehca_mult - 1) / ah_mult) : 0; + } else + av->av.ipd = ehca_static_rate; + + EDEB(7,"IPD av->av.ipd set =%x ah_attr->static_rate=%x " + "shca_ib_rate=%x ",av->av.ipd, ah_attr->static_rate, + shca->sport[ah_attr->port_num].rate); + + av->av.lnh = ah_attr->ah_flags; + av->av.grh.word_0 |= EHCA_BMASK_SET(GRH_IPVERSION_MASK, 6); + av->av.grh.word_0 |= EHCA_BMASK_SET(GRH_TCLASS_MASK, + ah_attr->grh.traffic_class); + av->av.grh.word_0 |= EHCA_BMASK_SET(GRH_FLOWLABEL_MASK, + ah_attr->grh.flow_label); + av->av.grh.word_0 |= EHCA_BMASK_SET(GRH_HOPLIMIT_MASK, + ah_attr->grh.hop_limit); + av->av.grh.word_0 |= EHCA_BMASK_SET(GRH_NEXTHEADER_MASK, 0x1B); + /* IB transport */ + av->av.grh.word_0 = be64_to_cpu(av->av.grh.word_0); + /* set sgid in grh.word_1 */ + if (ah_attr->ah_flags & IB_AH_GRH) { + int rc = 0; + struct ib_port_attr port_attr; + union ib_gid gid; + memset(&port_attr, 0, sizeof(port_attr)); + rc = ehca_query_port(pd->device, ah_attr->port_num, + &port_attr); + if (rc != 0) { /* invalid port number */ + retcode = -EINVAL; + EDEB_ERR(4, "Invalid port number " + "ehca_query_port() returned %x " + "pd=%p ah_attr=%p", rc, pd, ah_attr); + goto create_ah_exit1; + } + memset(&gid, 0, sizeof(gid)); + rc = ehca_query_gid(pd->device, + ah_attr->port_num, + ah_attr->grh.sgid_index, &gid); + if (rc != 0) { + retcode = -EINVAL; + EDEB_ERR(4, "Failed to retrieve sgid " + "ehca_query_gid() returned %x " + "pd=%p ah_attr=%p", rc, pd, ah_attr); + goto create_ah_exit1; + } + memcpy(&av->av.grh.word_1, &gid, sizeof(gid)); + } + /* for the time being we use a hard coded PMTU of 2048 Bytes */ + av->av.pmtu = 4; + + /* dgid comes in grh.word_3 */ + memcpy(&av->av.grh.word_3, &ah_attr->grh.dgid, + sizeof(ah_attr->grh.dgid)); + + EHCA_REGISTER_AV(device, pd); + + EDEB_EX(7,"pd=%p ah_attr=%p av=%p", pd, ah_attr, av); + return (&av->ib_ah); + +create_ah_exit1: + kmem_cache_free(ehca_module.cache_av, av); + +create_ah_exit0: + EDEB_EX(7,"retcode=%x pd=%p ah_attr=%p", retcode, pd, ah_attr); + return ERR_PTR(retcode); +} + +int ehca_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr) +{ + struct ehca_av *av = NULL; + struct ehca_ud_av new_ehca_av; + struct ehca_pd *my_pd = NULL; + u32 cur_pid = current->tgid; + int ret = 0; + + EHCA_CHECK_AV(ah); + EHCA_CHECK_ADR(ah_attr); + + EDEB_EN(7,"ah=%p ah_attr=%p", ah, ah_attr); + + my_pd = container_of(ah->pd, struct ehca_pd, ib_pd); + if (my_pd->ib_pd.uobject!=NULL && my_pd->ib_pd.uobject->context!=NULL && + my_pd->ownpid!=cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + return -EINVAL; + } + + memset(&new_ehca_av, 0, sizeof(new_ehca_av)); + new_ehca_av.sl = ah_attr->sl; + new_ehca_av.dlid = ntohs(ah_attr->dlid); + new_ehca_av.slid_path_bits = ah_attr->src_path_bits; + new_ehca_av.ipd = ah_attr->static_rate; + new_ehca_av.lnh = EHCA_BMASK_SET(GRH_FLAG_MASK, + ((ah_attr->ah_flags & IB_AH_GRH) > 0)); + new_ehca_av.grh.word_0 = EHCA_BMASK_SET(GRH_TCLASS_MASK, + ah_attr->grh.traffic_class); + new_ehca_av.grh.word_0 |= EHCA_BMASK_SET(GRH_FLOWLABEL_MASK, + ah_attr->grh.flow_label); + new_ehca_av.grh.word_0 |= EHCA_BMASK_SET(GRH_HOPLIMIT_MASK, + ah_attr->grh.hop_limit); + new_ehca_av.grh.word_0 |= EHCA_BMASK_SET(GRH_NEXTHEADER_MASK, 0x1b); + new_ehca_av.grh.word_0 = be64_to_cpu(new_ehca_av.grh.word_0); + + /* set sgid in grh.word_1 */ + if (ah_attr->ah_flags & IB_AH_GRH) { + int rc = 0; + struct ib_port_attr port_attr; + union ib_gid gid; + memset(&port_attr, 0, sizeof(port_attr)); + rc = ehca_query_port(ah->device, ah_attr->port_num, + &port_attr); + if (rc != 0) { /* invalid port number */ + ret = -EINVAL; + EDEB_ERR(4, "Invalid port number " + "ehca_query_port() returned %x " + "ah=%p ah_attr=%p port_num=%x", + rc, ah, ah_attr, ah_attr->port_num); + goto modify_ah_exit1; + } + memset(&gid, 0, sizeof(gid)); + rc = ehca_query_gid(ah->device, + ah_attr->port_num, + ah_attr->grh.sgid_index, &gid); + if (rc != 0) { + ret = -EINVAL; + EDEB_ERR(4, + "Failed to retrieve sgid " + "ehca_query_gid() returned %x " + "ah=%p ah_attr=%p port_num=%x " + "sgid_index=%x", + rc, ah, ah_attr, ah_attr->port_num, + ah_attr->grh.sgid_index); + goto modify_ah_exit1; + } + memcpy(&new_ehca_av.grh.word_1, &gid, sizeof(gid)); + } + + new_ehca_av.pmtu = 4; /* see also comment in create_ah() */ + + memcpy(&new_ehca_av.grh.word_3, &ah_attr->grh.dgid, + sizeof(ah_attr->grh.dgid)); + + av = container_of(ah, struct ehca_av, ib_ah); + av->av = new_ehca_av; + +modify_ah_exit1: + EDEB_EX(7,"ret=%x ah=%p ah_attr=%p", ret, ah, ah_attr); + + return ret; +} + +int ehca_query_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr) +{ + int ret = 0; + struct ehca_av *av = NULL; + struct ehca_pd *my_pd = NULL; + u32 cur_pid = current->tgid; + + EHCA_CHECK_AV(ah); + EHCA_CHECK_ADR(ah_attr); + + EDEB_EN(7,"ah=%p ah_attr=%p", ah, ah_attr); + + my_pd = container_of(ah->pd, struct ehca_pd, ib_pd); + if (my_pd->ib_pd.uobject!=NULL && my_pd->ib_pd.uobject->context!=NULL && + my_pd->ownpid!=cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + return -EINVAL; + } + + av = container_of(ah, struct ehca_av, ib_ah); + memcpy(&ah_attr->grh.dgid, &av->av.grh.word_3, + sizeof(ah_attr->grh.dgid)); + ah_attr->sl = av->av.sl; + + ah_attr->dlid = av->av.dlid; + + ah_attr->src_path_bits = av->av.slid_path_bits; + ah_attr->static_rate = av->av.ipd; + ah_attr->ah_flags = EHCA_BMASK_GET(GRH_FLAG_MASK, av->av.lnh); + ah_attr->grh.traffic_class = EHCA_BMASK_GET(GRH_TCLASS_MASK, + av->av.grh.word_0); + ah_attr->grh.hop_limit = EHCA_BMASK_GET(GRH_HOPLIMIT_MASK, + av->av.grh.word_0); + ah_attr->grh.flow_label = EHCA_BMASK_GET(GRH_FLOWLABEL_MASK, + av->av.grh.word_0); + + EDEB_EX(7,"ah=%p ah_attr=%p ret=%x", ah, ah_attr, ret); + return ret; +} + +int ehca_destroy_ah(struct ib_ah *ah) +{ + extern struct ehca_module ehca_module; + struct ehca_pd *my_pd = NULL; + u32 cur_pid = current->tgid; + int ret = 0; + + EHCA_CHECK_AV(ah); + EHCA_DEREGISTER_AV(ah); + + EDEB_EN(7,"ah=%p", ah); + + my_pd = container_of(ah->pd, struct ehca_pd, ib_pd); + if (my_pd->ib_pd.uobject!=NULL && my_pd->ib_pd.uobject->context!=NULL && + my_pd->ownpid!=cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + return -EINVAL; + } + + kmem_cache_free(ehca_module.cache_av, + container_of(ah, struct ehca_av, ib_ah)); + + EDEB_EX(7,"ret=%x ah=%p", ret, ah); + return ret; +} --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/ehca_pd.c 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/ehca_pd.c 2006-03-27 11:23:18.000000000 +0200 @@ -0,0 +1,122 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * PD functions + * + * Authors: Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_pd.c,v 1.5 2006/03/27 09:23:18 schickhj Exp $ + */ + + +#define DEB_PREFIX "vpd " + +#include + +#include "ehca_kernel.h" +#include "ehca_tools.h" +#include "ehca_iverbs.h" + +struct ib_pd *ehca_alloc_pd(struct ib_device *device, + struct ib_ucontext *context, struct ib_udata *udata) +{ + extern struct ehca_module ehca_module; + struct ib_pd *mypd = NULL; + struct ehca_pd *pd = NULL; + + EDEB_EN(7, "device=%p context=%p udata=%p", device, context, udata); + + EHCA_CHECK_DEVICE_P(device); + + pd = kmem_cache_alloc(ehca_module.cache_pd, SLAB_KERNEL); + if (pd == NULL) { + EDEB_ERR(4, "ERROR device=%p context=%p pd=%p" + " out of memory", device, context, mypd); + return ERR_PTR(-ENOMEM); + } + + memset(pd, 0, sizeof(struct ehca_pd)); + pd->ownpid = current->tgid; + + /* Kernel PD: when device = -1, 0 + * User PD: when context != -1 + */ + if (context == NULL) { + /* Kernel PDs after init reuses always + * the one created in ehca_shca_reopen() + */ + struct ehca_shca *shca = container_of(device, struct ehca_shca, + ib_device); + pd->fw_pd.value = shca->pd->fw_pd.value; + } else { + pd->fw_pd.value = (u64)pd; + } + + mypd = &pd->ib_pd; + + EHCA_REGISTER_PD(device, pd); + + EDEB_EX(7, "device=%p context=%p pd=%p", device, context, mypd); + + return (mypd); +} + +int ehca_dealloc_pd(struct ib_pd *pd) +{ + extern struct ehca_module ehca_module; + int ret = 0; + u32 cur_pid = current->tgid; + struct ehca_pd *my_pd = NULL; + + EDEB_EN(7, "pd=%p", pd); + + EHCA_CHECK_PD(pd); + my_pd = container_of(pd, struct ehca_pd, ib_pd); + if (my_pd->ib_pd.uobject!=NULL && my_pd->ib_pd.uobject->context!=NULL && + my_pd->ownpid!=cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + return -EINVAL; + } + + EHCA_DEREGISTER_PD(pd); + + kmem_cache_free(ehca_module.cache_pd, + container_of(pd, struct ehca_pd, ib_pd)); + + EDEB_EX(7, "pd=%p", pd); + + return ret; +} From schihei at de.ibm.com Thu Apr 27 03:49:02 2006 From: schihei at de.ibm.com (Heiko J Schick) Date: Thu, 27 Apr 2006 12:49:02 +0200 Subject: [openib-general] [PATCH 08/16] ehca: memory region Message-ID: <4450A19E.1000604@de.ibm.com> Signed-off-by: Heiko J Schick ehca_mrmw.c | 2492 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ehca_mrmw.h | 145 +++ 2 files changed, 2637 insertions(+) --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/ehca_mrmw.h 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/ehca_mrmw.h 2006-02-28 09:49:25.000000000 +0100 @@ -0,0 +1,145 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * MR/MW declarations and inline functions + * + * Authors: Dietmar Decker + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_mrmw.h,v 1.9 2006/02/28 08:49:25 schickhj Exp $ + */ + +#ifndef _EHCA_MRMW_H_ +#define _EHCA_MRMW_H_ + +#undef DEB_PREFIX +#define DEB_PREFIX "mrmw" + +int ehca_reg_mr(struct ehca_shca *shca, + struct ehca_mr *e_mr, + u64 *iova_start, + u64 size, + int acl, + struct ehca_pd *e_pd, + struct ehca_mr_pginfo *pginfo, + u32 *lkey, + u32 *rkey); + +int ehca_reg_mr_rpages(struct ehca_shca *shca, + struct ehca_mr *e_mr, + struct ehca_mr_pginfo *pginfo); + +int ehca_rereg_mr(struct ehca_shca *shca, + struct ehca_mr *e_mr, + u64 *iova_start, + u64 size, + int mr_access_flags, + struct ehca_pd *e_pd, + struct ehca_mr_pginfo *pginfo, + u32 *lkey, + u32 *rkey); + +int ehca_unmap_one_fmr(struct ehca_shca *shca, + struct ehca_mr *e_fmr); + +int ehca_reg_smr(struct ehca_shca *shca, + struct ehca_mr *e_origmr, + struct ehca_mr *e_newmr, + u64 *iova_start, + int acl, + struct ehca_pd *e_pd, + u32 *lkey, + u32 *rkey); + +int ehca_reg_internal_maxmr(struct ehca_shca *shca, + struct ehca_pd *e_pd, + struct ehca_mr **maxmr); + +int ehca_reg_maxmr(struct ehca_shca *shca, + struct ehca_mr *e_newmr, + u64 *iova_start, + int acl, + struct ehca_pd *e_pd, + u32 *lkey, + u32 *rkey); + +int ehca_dereg_internal_maxmr(struct ehca_shca *shca); + +int ehca_mr_chk_buf_and_calc_size(struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + u64 *iova_start, + u64 *size); + +int ehca_fmr_check_page_list(struct ehca_mr *e_fmr, + u64 *page_list, + int list_len); + +int ehca_set_pagebuf(struct ehca_mr *e_mr, + struct ehca_mr_pginfo *pginfo, + u32 number, + u64 *kpage); + +int ehca_set_pagebuf_1(struct ehca_mr *e_mr, + struct ehca_mr_pginfo *pginfo, + u64 *rpage); + +int ehca_mr_is_maxmr(u64 size, + u64 *iova_start); + +void ehca_mrmw_map_acl(int ib_acl, + u32 *hipz_acl); + +void ehca_mrmw_set_pgsize_hipz_acl(u32 *hipz_acl); + +void ehca_mrmw_reverse_map_acl(const u32 *hipz_acl, + int *ib_acl); + +int ehca_mrmw_map_rc_alloc(const u64 rc); + +int ehca_mrmw_map_rc_rrpg_last(const u64 rc); + +int ehca_mrmw_map_rc_rrpg_notlast(const u64 rc); + +int ehca_mrmw_map_rc_query_mr(const u64 rc); + +int ehca_mrmw_map_rc_free_mr(const u64 rc); + +int ehca_mrmw_map_rc_free_mw(const u64 rc); + +int ehca_mrmw_map_rc_reg_smr(const u64 rc); + +void ehca_mr_deletenew(struct ehca_mr *mr); + +#endif /*_EHCA_MRMW_H_*/ --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/ehca_mrmw.c 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/ehca_mrmw.c 2006-04-25 16:51:33.000000000 +0200 @@ -0,0 +1,2492 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * MR/MW functions + * + * Authors: Dietmar Decker + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_mrmw.c,v 1.24 2006/04/25 14:51:33 decker Exp $ + */ + +#undef DEB_PREFIX +#define DEB_PREFIX "mrmw" + +#include + +#include "ehca_kernel.h" +#include "ehca_iverbs.h" +#include "ehca_mrmw.h" +#include "hcp_if.h" +#include "hipz_hw.h" + +extern int ehca_use_hp_mr; + +static struct ehca_mr *ehca_mr_new(void) +{ + extern struct ehca_module ehca_module; + struct ehca_mr *me; + + me = kmem_cache_alloc(ehca_module.cache_mr, SLAB_KERNEL); + if (me) { + memset(me, 0, sizeof(struct ehca_mr)); + spin_lock_init(&me->mrlock); + EDEB_EX(7, "ehca_mr=%p sizeof(ehca_mr_t)=%x", me, + (u32) sizeof(struct ehca_mr)); + } else { + EDEB_ERR(3, "alloc failed"); + } + + return me; +} + +static void ehca_mr_delete(struct ehca_mr *me) +{ + extern struct ehca_module ehca_module; + + kmem_cache_free(ehca_module.cache_mr, me); +} + +static struct ehca_mw *ehca_mw_new(void) +{ + extern struct ehca_module ehca_module; + struct ehca_mw *me; + + me = kmem_cache_alloc(ehca_module.cache_mw, SLAB_KERNEL); + if (me) { + memset(me, 0, sizeof(struct ehca_mw)); + spin_lock_init(&me->mwlock); + EDEB_EX(7, "ehca_mw=%p sizeof(ehca_mw_t)=%x", me, + (u32) sizeof(struct ehca_mw)); + } else { + EDEB_ERR(3, "alloc failed"); + } + + return me; +} + +static void ehca_mw_delete(struct ehca_mw *me) +{ + extern struct ehca_module ehca_module; + + kmem_cache_free(ehca_module.cache_mw, me); +} + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +struct ib_mr *ehca_get_dma_mr(struct ib_pd *pd, int mr_access_flags) +{ + struct ib_mr *ib_mr = NULL; + int retcode = 0; + struct ehca_mr *e_maxmr = NULL; + struct ehca_pd *e_pd = NULL; + struct ehca_shca *shca = NULL; + + EDEB_EN(7, "pd=%p mr_access_flags=%x", pd, mr_access_flags); + + EHCA_CHECK_PD_P(pd); + e_pd = container_of(pd, struct ehca_pd, ib_pd); + shca = container_of(pd->device, struct ehca_shca, ib_device); + + if (shca->maxmr) { + e_maxmr = ehca_mr_new(); + if (!e_maxmr) { + EDEB_ERR(4, "out of memory"); + ib_mr = ERR_PTR(-ENOMEM); + goto get_dma_mr_exit0; + } + + retcode = ehca_reg_maxmr(shca, e_maxmr, + (u64 *)KERNELBASE, + mr_access_flags, e_pd, + &e_maxmr->ib.ib_mr.lkey, + &e_maxmr->ib.ib_mr.rkey); + if (retcode != 0) { + ib_mr = ERR_PTR(retcode); + goto get_dma_mr_exit0; + } + ib_mr = &e_maxmr->ib.ib_mr; + } else { + EDEB_ERR(4, "no internal max-MR exist!"); + ib_mr = ERR_PTR(-EINVAL); + goto get_dma_mr_exit0; + } + +get_dma_mr_exit0: + if (IS_ERR(ib_mr)) + EDEB_EX(4, "rc=%lx pd=%p mr_access_flags=%x ", + PTR_ERR(ib_mr), pd, mr_access_flags); + else + EDEB_EX(7, "ib_mr=%p lkey=%x rkey=%x", + ib_mr, ib_mr->lkey, ib_mr->rkey); + return (ib_mr); +} /* end ehca_get_dma_mr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start) +{ + struct ib_mr *ib_mr = NULL; + int retcode = 0; + struct ehca_mr *e_mr = NULL; + struct ehca_shca *shca = NULL; + struct ehca_pd *e_pd = NULL; + u64 size = 0; + struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0}; + u32 num_pages_mr = 0; + u32 num_pages_4k = 0; /* 4k portion "pages" */ + + EDEB_EN(7, "pd=%p phys_buf_array=%p num_phys_buf=%x " + "mr_access_flags=%x iova_start=%p", pd, phys_buf_array, + num_phys_buf, mr_access_flags, iova_start); + + EHCA_CHECK_PD_P(pd); + if ((num_phys_buf <= 0) || ehca_adr_bad(phys_buf_array)) { + EDEB_ERR(4, "bad input values: num_phys_buf=%x " + "phys_buf_array=%p", num_phys_buf, phys_buf_array); + ib_mr = ERR_PTR(-EINVAL); + goto reg_phys_mr_exit0; + } + if (((mr_access_flags & IB_ACCESS_REMOTE_WRITE) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE)) || + ((mr_access_flags & IB_ACCESS_REMOTE_ATOMIC) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE))) { + /* Remote Write Access requires Local Write Access */ + /* Remote Atomic Access requires Local Write Access */ + EDEB_ERR(4, "bad input values: mr_access_flags=%x", + mr_access_flags); + ib_mr = ERR_PTR(-EINVAL); + goto reg_phys_mr_exit0; + } + + /* check physical buffer list and calculate size */ + retcode = ehca_mr_chk_buf_and_calc_size(phys_buf_array, num_phys_buf, + iova_start, &size); + if (retcode != 0) { + ib_mr = ERR_PTR(retcode); + goto reg_phys_mr_exit0; + } + if ((size == 0) || + (((u64)iova_start + size) < (u64)iova_start)) { + EDEB_ERR(4, "bad input values: size=%lx iova_start=%p", + size, iova_start); + ib_mr = ERR_PTR(-EINVAL); + goto reg_phys_mr_exit0; + } + + e_pd = container_of(pd, struct ehca_pd, ib_pd); + shca = container_of(pd->device, struct ehca_shca, ib_device); + + e_mr = ehca_mr_new(); + if (!e_mr) { + EDEB_ERR(4, "out of memory"); + ib_mr = ERR_PTR(-ENOMEM); + goto reg_phys_mr_exit0; + } + + /* determine number of MR pages */ + num_pages_mr = ( (((u64)iova_start % PAGE_SIZE) + size + + PAGE_SIZE - 1) / PAGE_SIZE ); + num_pages_4k = ( (((u64)iova_start % EHCA_PAGESIZE) + size + + EHCA_PAGESIZE - 1) / EHCA_PAGESIZE ); + + /* register MR on HCA */ + if (ehca_mr_is_maxmr(size, iova_start)) { + e_mr->flags |= EHCA_MR_FLAG_MAXMR; + retcode = ehca_reg_maxmr(shca, e_mr, iova_start, + mr_access_flags, e_pd, + &e_mr->ib.ib_mr.lkey, + &e_mr->ib.ib_mr.rkey); + if (retcode != 0) { + ib_mr = ERR_PTR(retcode); + goto reg_phys_mr_exit1; + } + } else { + pginfo.type = EHCA_MR_PGI_PHYS; + pginfo.num_pages = num_pages_mr; + pginfo.num_4k = num_pages_4k; + pginfo.num_phys_buf = num_phys_buf; + pginfo.phys_buf_array = phys_buf_array; + pginfo.next_4k = ( ((u64)iova_start & ~PAGE_MASK) / + EHCA_PAGESIZE); + + retcode = ehca_reg_mr(shca, e_mr, iova_start, size, + mr_access_flags, e_pd, &pginfo, + &e_mr->ib.ib_mr.lkey, + &e_mr->ib.ib_mr.rkey); + if (retcode != 0) { + ib_mr = ERR_PTR(retcode); + goto reg_phys_mr_exit1; + } + } + + /* successful registration of all pages */ + ib_mr = &e_mr->ib.ib_mr; + goto reg_phys_mr_exit0; + +reg_phys_mr_exit1: + ehca_mr_delete(e_mr); +reg_phys_mr_exit0: + if (IS_ERR(ib_mr)) + EDEB_EX(4, "rc=%lx pd=%p phys_buf_array=%p " + "num_phys_buf=%x mr_access_flags=%x iova_start=%p", + PTR_ERR(ib_mr), pd, phys_buf_array, + num_phys_buf, mr_access_flags, iova_start); + else + EDEB_EX(7, "ib_mr=%p lkey=%x rkey=%x", + ib_mr, ib_mr->lkey, ib_mr->rkey); + return (ib_mr); +} /* end ehca_reg_phys_mr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, + struct ib_umem *region, + int mr_access_flags, + struct ib_udata *udata) +{ + struct ib_mr *ib_mr = NULL; + struct ehca_mr *e_mr = NULL; + struct ehca_shca *shca = NULL; + struct ehca_pd *e_pd = NULL; + struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0}; + int retcode = 0; + u32 num_pages_mr = 0; + u32 num_pages_4k = 0; /* 4k portion "pages" */ + + EDEB_EN(7, "pd=%p region=%p mr_access_flags=%x udata=%p", + pd, region, mr_access_flags, udata); + + EHCA_CHECK_PD_P(pd); + if (ehca_adr_bad(region)) { + EDEB_ERR(4, "bad input values: region=%p", region); + ib_mr = ERR_PTR(-EINVAL); + goto reg_user_mr_exit0; + } + if (((mr_access_flags & IB_ACCESS_REMOTE_WRITE) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE)) || + ((mr_access_flags & IB_ACCESS_REMOTE_ATOMIC) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE))) { + /* Remote Write Access requires Local Write Access */ + /* Remote Atomic Access requires Local Write Access */ + EDEB_ERR(4, "bad input values: mr_access_flags=%x", + mr_access_flags); + ib_mr = ERR_PTR(-EINVAL); + goto reg_user_mr_exit0; + } + EDEB(7, "user_base=%lx virt_base=%lx length=%lx offset=%x page_size=%x " + "chunk_list.next=%p", + region->user_base, region->virt_base, region->length, + region->offset, region->page_size, region->chunk_list.next); + if (region->page_size != PAGE_SIZE) { + EDEB_ERR(4, "page size not supported, region->page_size=%x", + region->page_size); + ib_mr = ERR_PTR(-EINVAL); + goto reg_user_mr_exit0; + } + + if ((region->length == 0) || + ((region->virt_base + region->length) < region->virt_base)) { + EDEB_ERR(4, "bad input values: length=%lx virt_base=%lx", + region->length, region->virt_base); + ib_mr = ERR_PTR(-EINVAL); + goto reg_user_mr_exit0; + } + + e_pd = container_of(pd, struct ehca_pd, ib_pd); + shca = container_of(pd->device, struct ehca_shca, ib_device); + + e_mr = ehca_mr_new(); + if (!e_mr) { + EDEB_ERR(4, "out of memory"); + ib_mr = ERR_PTR(-ENOMEM); + goto reg_user_mr_exit0; + } + + /* determine number of MR pages */ + num_pages_mr = ( ((region->virt_base % PAGE_SIZE) + region->length + + PAGE_SIZE - 1) / PAGE_SIZE ); + num_pages_4k = ( ((region->virt_base % EHCA_PAGESIZE) + region->length + + EHCA_PAGESIZE - 1) / EHCA_PAGESIZE ); + + /* register MR on HCA */ + pginfo.type = EHCA_MR_PGI_USER; + pginfo.num_pages = num_pages_mr; + pginfo.num_4k = num_pages_4k; + pginfo.region = region; + pginfo.next_4k = region->offset / EHCA_PAGESIZE; + pginfo.next_chunk = list_prepare_entry(pginfo.next_chunk, + (®ion->chunk_list), + list); + + retcode = ehca_reg_mr(shca, e_mr, (u64 *)region->virt_base, + region->length, mr_access_flags, e_pd, &pginfo, + &e_mr->ib.ib_mr.lkey, &e_mr->ib.ib_mr.rkey); + if (retcode != 0) { + ib_mr = ERR_PTR(retcode); + goto reg_user_mr_exit1; + } + + /* successful registration of all pages */ + ib_mr = &e_mr->ib.ib_mr; + goto reg_user_mr_exit0; + +reg_user_mr_exit1: + ehca_mr_delete(e_mr); +reg_user_mr_exit0: + if (IS_ERR(ib_mr)) + EDEB_EX(4, "rc=%lx pd=%p region=%p mr_access_flags=%x " + "udata=%p", + PTR_ERR(ib_mr), pd, region, mr_access_flags, udata); + else + EDEB_EX(7, "ib_mr=%p lkey=%x rkey=%x", + ib_mr, ib_mr->lkey, ib_mr->rkey); + return (ib_mr); +} /* end ehca_reg_user_mr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_rereg_phys_mr(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start) +{ + int retcode = 0; + struct ehca_shca *shca = NULL; + struct ehca_mr *e_mr = NULL; + u64 new_size = 0; + u64 *new_start = NULL; + u32 new_acl = 0; + struct ehca_pd *new_pd = NULL; + u32 tmp_lkey = 0; + u32 tmp_rkey = 0; + unsigned long sl_flags; + u32 num_pages_mr = 0; + u32 num_pages_4k = 0; /* 4k portion "pages" */ + struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0}; + struct ehca_pd *my_pd = NULL; + u32 cur_pid = current->tgid; + + EDEB_EN(7, "mr=%p mr_rereg_mask=%x pd=%p phys_buf_array=%p " + "num_phys_buf=%x mr_access_flags=%x iova_start=%p", + mr, mr_rereg_mask, pd, phys_buf_array, num_phys_buf, + mr_access_flags, iova_start); + + EHCA_CHECK_MR(mr); + my_pd = container_of(mr->pd, struct ehca_pd, ib_pd); + if (my_pd->ib_pd.uobject!=NULL && my_pd->ib_pd.uobject->context!=NULL && + my_pd->ownpid!=cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + retcode = -EINVAL; + goto rereg_phys_mr_exit0; + } + + if (!(mr_rereg_mask & IB_MR_REREG_TRANS)) { + /* TODO not supported, because PHYP rereg hCall needs pages*/ + /* TODO: We will follow this with Tom ....*/ + EDEB_ERR(4, "rereg without IB_MR_REREG_TRANS not supported yet," + " mr_rereg_mask=%x", mr_rereg_mask); + retcode = -EINVAL; + goto rereg_phys_mr_exit0; + } + + e_mr = container_of(mr, struct ehca_mr, ib.ib_mr); + if (mr_rereg_mask & IB_MR_REREG_PD) { + EHCA_CHECK_PD(pd); + } + + if ((mr_rereg_mask & + ~(IB_MR_REREG_TRANS | IB_MR_REREG_PD | IB_MR_REREG_ACCESS)) || + (mr_rereg_mask == 0)) { + retcode = -EINVAL; + goto rereg_phys_mr_exit0; + } + + shca = container_of(mr->device, struct ehca_shca, ib_device); + + /* check other parameters */ + if (e_mr == shca->maxmr) { + /* should be impossible, however reject to be sure */ + EDEB_ERR(3, "rereg internal max-MR impossible, mr=%p " + "shca->maxmr=%p mr->lkey=%x", + mr, shca->maxmr, mr->lkey); + retcode = -EINVAL; + goto rereg_phys_mr_exit0; + } + if (mr_rereg_mask & IB_MR_REREG_TRANS) { /* transl., i.e. addr/size */ + if (e_mr->flags & EHCA_MR_FLAG_FMR) { + EDEB_ERR(4, "not supported for FMR, mr=%p flags=%x", + mr, e_mr->flags); + retcode = -EINVAL; + goto rereg_phys_mr_exit0; + } + if (ehca_adr_bad(phys_buf_array) || num_phys_buf <= 0) { + EDEB_ERR(4, "bad input values: mr_rereg_mask=%x " + "phys_buf_array=%p num_phys_buf=%x", + mr_rereg_mask, phys_buf_array, num_phys_buf); + retcode = -EINVAL; + goto rereg_phys_mr_exit0; + } + } + if ((mr_rereg_mask & IB_MR_REREG_ACCESS) && /* change ACL */ + (((mr_access_flags & IB_ACCESS_REMOTE_WRITE) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE)) || + ((mr_access_flags & IB_ACCESS_REMOTE_ATOMIC) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE)))) { + /* Remote Write Access requires Local Write Access */ + /* Remote Atomic Access requires Local Write Access */ + EDEB_ERR(4, "bad input values: mr_rereg_mask=%x " + "mr_access_flags=%x", mr_rereg_mask, mr_access_flags); + retcode = -EINVAL; + goto rereg_phys_mr_exit0; + } + + /* set requested values dependent on rereg request */ + spin_lock_irqsave(&e_mr->mrlock, sl_flags); /* get lock TODO for MR */ + new_start = e_mr->start; /* new == old address */ + new_size = e_mr->size; /* new == old length */ + new_acl = e_mr->acl; /* new == old access control */ + new_pd = container_of(mr->pd,struct ehca_pd,ib_pd); /*new == old PD*/ + + if (mr_rereg_mask & IB_MR_REREG_TRANS) { + new_start = iova_start; /* change address */ + /* check physical buffer list and calculate size */ + retcode = ehca_mr_chk_buf_and_calc_size(phys_buf_array, + num_phys_buf, + iova_start, &new_size); + if (retcode != 0) + goto rereg_phys_mr_exit1; + if ((new_size == 0) || + (((u64)iova_start + new_size) < (u64)iova_start)) { + EDEB_ERR(4, "bad input values: new_size=%lx " + "iova_start=%p", new_size, iova_start); + retcode = -EINVAL; + goto rereg_phys_mr_exit1; + } + num_pages_mr = ( (((u64)new_start % PAGE_SIZE) + new_size + + PAGE_SIZE - 1) / PAGE_SIZE ); + num_pages_4k = ( (((u64)new_start % EHCA_PAGESIZE) + new_size + + EHCA_PAGESIZE - 1) / EHCA_PAGESIZE ); + pginfo.type = EHCA_MR_PGI_PHYS; + pginfo.num_pages = num_pages_mr; + pginfo.num_4k = num_pages_4k; + pginfo.num_phys_buf = num_phys_buf; + pginfo.phys_buf_array = phys_buf_array; + pginfo.next_4k = ( ((u64)iova_start & ~PAGE_MASK) / + EHCA_PAGESIZE); + } + if (mr_rereg_mask & IB_MR_REREG_ACCESS) + new_acl = mr_access_flags; + if (mr_rereg_mask & IB_MR_REREG_PD) + new_pd = container_of(pd, struct ehca_pd, ib_pd); + + EDEB(7, "mr=%p new_start=%p new_size=%lx new_acl=%x new_pd=%p " + "num_pages_mr=%x num_pages_4k=%x", e_mr, new_start, new_size, + new_acl, new_pd, num_pages_mr, num_pages_4k); + + retcode = ehca_rereg_mr(shca, e_mr, new_start, new_size, new_acl, + new_pd, &pginfo, &tmp_lkey, &tmp_rkey); + if (retcode != 0) + goto rereg_phys_mr_exit1; + + /* successful reregistration */ + if (mr_rereg_mask & IB_MR_REREG_PD) + mr->pd = pd; + mr->lkey = tmp_lkey; + mr->rkey = tmp_rkey; + +rereg_phys_mr_exit1: + spin_unlock_irqrestore(&e_mr->mrlock, sl_flags); /* free spin lock */ +rereg_phys_mr_exit0: + if (retcode) + EDEB_EX(4, "retcode=%x mr=%p mr_rereg_mask=%x pd=%p " + "phys_buf_array=%p num_phys_buf=%x mr_access_flags=%x " + "iova_start=%p", + retcode, mr, mr_rereg_mask, pd, phys_buf_array, + num_phys_buf, mr_access_flags, iova_start); + else + EDEB_EX(7, "mr=%p mr_rereg_mask=%x pd=%p phys_buf_array=%p " + "num_phys_buf=%x mr_access_flags=%x iova_start=%p", + mr, mr_rereg_mask, pd, phys_buf_array, num_phys_buf, + mr_access_flags, iova_start); + + return (retcode); +} /* end ehca_rereg_phys_mr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr) +{ + int retcode = 0; + u64 rc = H_SUCCESS; + struct ehca_shca *shca = NULL; + struct ehca_mr *e_mr = NULL; + struct ipz_pd fwpd; /* Firmware PD */ + u32 access_ctrl = 0; + u64 tmp_remote_size = 0; + u64 tmp_remote_len = 0; + struct ehca_pd *my_pd = NULL; + u32 cur_pid = current->tgid; + + unsigned long sl_flags; + + EDEB_EN(7, "mr=%p mr_attr=%p", mr, mr_attr); + + EHCA_CHECK_MR(mr); + + my_pd = container_of(mr->pd, struct ehca_pd, ib_pd); + if (my_pd->ib_pd.uobject!=NULL && my_pd->ib_pd.uobject->context!=NULL && + my_pd->ownpid!=cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + retcode = -EINVAL; + goto query_mr_exit0; + } + + e_mr = container_of(mr, struct ehca_mr, ib.ib_mr); + if (ehca_adr_bad(mr_attr)) { + EDEB_ERR(4, "bad input values: mr_attr=%p", mr_attr); + retcode = -EINVAL; + goto query_mr_exit0; + } + if ((e_mr->flags & EHCA_MR_FLAG_FMR)) { + EDEB_ERR(4, "not supported for FMR, mr=%p e_mr=%p " + "e_mr->flags=%x", mr, e_mr, e_mr->flags); + retcode = -EINVAL; + goto query_mr_exit0; + } + + shca = container_of(mr->device, struct ehca_shca, ib_device); + memset(mr_attr, 0, sizeof(struct ib_mr_attr)); + spin_lock_irqsave(&e_mr->mrlock, sl_flags); /* get spin lock TODO?? */ + + rc = hipz_h_query_mr(shca->ipz_hca_handle, &e_mr->pf, + &e_mr->ipz_mr_handle, &mr_attr->size, + &mr_attr->device_virt_addr, &tmp_remote_size, + &tmp_remote_len, &access_ctrl, &fwpd, + &mr_attr->lkey, &mr_attr->rkey); + if (rc != H_SUCCESS) { + EDEB_ERR(4, "hipz_mr_query failed, rc=%lx mr=%p " + "hca_hndl=%lx mr_hndl=%lx lkey=%x", + rc, mr, shca->ipz_hca_handle.handle, + e_mr->ipz_mr_handle.handle, mr->lkey); + retcode = ehca_mrmw_map_rc_query_mr(rc); + goto query_mr_exit1; + } + ehca_mrmw_reverse_map_acl(&access_ctrl, &mr_attr->mr_access_flags); + mr_attr->pd = mr->pd; + +query_mr_exit1: + spin_unlock_irqrestore(&e_mr->mrlock, sl_flags); /* free spin lock */ +query_mr_exit0: + if (retcode) + EDEB_EX(4, "retcode=%x mr=%p mr_attr=%p", retcode, mr, mr_attr); + else + EDEB_EX(7, "pd=%p device_virt_addr=%lx size=%lx " + "mr_access_flags=%x lkey=%x rkey=%x", + mr_attr->pd, mr_attr->device_virt_addr, + mr_attr->size, mr_attr->mr_access_flags, + mr_attr->lkey, mr_attr->rkey); + return (retcode); +} /* end ehca_query_mr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_dereg_mr(struct ib_mr *mr) +{ + int retcode = 0; + u64 rc = H_SUCCESS; + struct ehca_shca *shca = NULL; + struct ehca_mr *e_mr = NULL; + struct ehca_pd *my_pd = NULL; + u32 cur_pid = current->tgid; + + EDEB_EN(7, "mr=%p", mr); + + EHCA_CHECK_MR(mr); + my_pd = container_of(mr->pd, struct ehca_pd, ib_pd); + if (my_pd->ib_pd.uobject!=NULL && my_pd->ib_pd.uobject->context!=NULL && + my_pd->ownpid!=cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + retcode = -EINVAL; + goto dereg_mr_exit0; + } + + e_mr = container_of(mr, struct ehca_mr, ib.ib_mr); + shca = container_of(mr->device, struct ehca_shca, ib_device); + + if ((e_mr->flags & EHCA_MR_FLAG_FMR)) { + EDEB_ERR(4, "not supported for FMR, mr=%p e_mr=%p " + "e_mr->flags=%x", mr, e_mr, e_mr->flags); + retcode = -EINVAL; + goto dereg_mr_exit0; + } else if (e_mr == shca->maxmr) { + /* should be impossible, however reject to be sure */ + EDEB_ERR(3, "dereg internal max-MR impossible, mr=%p " + "shca->maxmr=%p mr->lkey=%x", + mr, shca->maxmr, mr->lkey); + retcode = -EINVAL; + goto dereg_mr_exit0; + } + + /* TODO: BUSY: MR still has bound window(s) */ + rc = hipz_h_free_resource_mr(shca->ipz_hca_handle, &e_mr->pf, + &e_mr->ipz_mr_handle); + if (rc != H_SUCCESS) { + EDEB_ERR(4, "hipz_free_mr failed, rc=%lx shca=%p e_mr=%p" + " hca_hndl=%lx mr_hndl=%lx mr->lkey=%x", + rc, shca, e_mr, shca->ipz_hca_handle.handle, + e_mr->ipz_mr_handle.handle, mr->lkey); + retcode = ehca_mrmw_map_rc_free_mr(rc); + goto dereg_mr_exit0; + } + + /* successful deregistration */ + ehca_mr_delete(e_mr); + +dereg_mr_exit0: + if (retcode) + EDEB_EX(4, "retcode=%x mr=%p", retcode, mr); + else + EDEB_EX(7, ""); + return (retcode); +} /* end ehca_dereg_mr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +struct ib_mw *ehca_alloc_mw(struct ib_pd *pd) +{ + struct ib_mw *ib_mw = NULL; + u64 rc = H_SUCCESS; + struct ehca_shca *shca = NULL; + struct ehca_mw *e_mw = NULL; + struct ehca_pd *e_pd = NULL; + + EDEB_EN(7, "pd=%p", pd); + + EHCA_CHECK_PD_P(pd); + e_pd = container_of(pd, struct ehca_pd, ib_pd); + shca = container_of(pd->device, struct ehca_shca, ib_device); + + e_mw = ehca_mw_new(); + if (!e_mw) { + ib_mw = ERR_PTR(-ENOMEM); + goto alloc_mw_exit0; + } + + rc = hipz_h_alloc_resource_mw(shca->ipz_hca_handle, &e_mw->pf, + &shca->pf, e_pd->fw_pd, + &e_mw->ipz_mw_handle, &e_mw->ib_mw.rkey); + if (rc != H_SUCCESS) { + EDEB_ERR(4, "hipz_mw_allocate failed, rc=%lx shca=%p " + "hca_hndl=%lx mw=%p", rc, shca, + shca->ipz_hca_handle.handle, e_mw); + ib_mw = ERR_PTR(ehca_mrmw_map_rc_alloc(rc)); + goto alloc_mw_exit1; + } + /* save R_Key in local copy */ + /* TODO????? mw->rkey = *rkey_p; */ + + /* successful MW allocation */ + ib_mw = &e_mw->ib_mw; + goto alloc_mw_exit0; + +alloc_mw_exit1: + ehca_mw_delete(e_mw); +alloc_mw_exit0: + if (IS_ERR(ib_mw)) + EDEB_EX(4, "rc=%lx pd=%p", PTR_ERR(ib_mw), pd); + else + EDEB_EX(7, "ib_mw=%p rkey=%x", ib_mw, ib_mw->rkey); + return (ib_mw); +} /* end ehca_alloc_mw() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_bind_mw(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind) +{ + int retcode = 0; + + /* TODO: not supported up to now */ + EDEB_ERR(4, "bind MW currently not supported by HCAD"); + retcode = -EPERM; + goto bind_mw_exit0; + +bind_mw_exit0: + if (retcode) + EDEB_EX(4, "rc=%x qp=%p mw=%p mw_bind=%p", + retcode, qp, mw, mw_bind); + else + EDEB_EX(7, "qp=%p mw=%p mw_bind=%p", qp, mw, mw_bind); + return (retcode); +} /* end ehca_bind_mw() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_dealloc_mw(struct ib_mw *mw) +{ + int retcode = 0; + u64 rc = H_SUCCESS; + struct ehca_shca *shca = NULL; + struct ehca_mw *e_mw = NULL; + + EDEB_EN(7, "mw=%p", mw); + + EHCA_CHECK_MW(mw); + e_mw = container_of(mw, struct ehca_mw, ib_mw); + shca = container_of(mw->device, struct ehca_shca, ib_device); + + rc = hipz_h_free_resource_mw(shca->ipz_hca_handle, &e_mw->pf, + &e_mw->ipz_mw_handle); + if (rc != H_SUCCESS) { + EDEB_ERR(4, "hipz_free_mw failed, rc=%lx shca=%p mw=%p " + "rkey=%x hca_hndl=%lx mw_hndl=%lx", + rc, shca, mw, mw->rkey, shca->ipz_hca_handle.handle, + e_mw->ipz_mw_handle.handle); + retcode = ehca_mrmw_map_rc_free_mw(rc); + goto dealloc_mw_exit0; + } + /* successful deallocation */ + ehca_mw_delete(e_mw); + +dealloc_mw_exit0: + if (retcode) + EDEB_EX(4, "retcode=%x mw=%p", retcode, mw); + else + EDEB_EX(7, ""); + return (retcode); +} /* end ehca_dealloc_mw() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd, + int mr_access_flags, + struct ib_fmr_attr *fmr_attr) +{ + struct ib_fmr *ib_fmr = NULL; + struct ehca_shca *shca = NULL; + struct ehca_mr *e_fmr = NULL; + int retcode = 0; + struct ehca_pd *e_pd = NULL; + u32 tmp_lkey = 0; + u32 tmp_rkey = 0; + struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0}; + + EDEB_EN(7, "pd=%p mr_access_flags=%x fmr_attr=%p", + pd, mr_access_flags, fmr_attr); + + EHCA_CHECK_PD_P(pd); + if (ehca_adr_bad(fmr_attr)) { + EDEB_ERR(4, "bad input values: fmr_attr=%p", fmr_attr); + ib_fmr = ERR_PTR(-EINVAL); + goto alloc_fmr_exit0; + } + + EDEB(7, "max_pages=%x max_maps=%x page_shift=%x", + fmr_attr->max_pages, fmr_attr->max_maps, fmr_attr->page_shift); + + /* check other parameters */ + if (((mr_access_flags & IB_ACCESS_REMOTE_WRITE) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE)) || + ((mr_access_flags & IB_ACCESS_REMOTE_ATOMIC) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE))) { + /* Remote Write Access requires Local Write Access */ + /* Remote Atomic Access requires Local Write Access */ + EDEB_ERR(4, "bad input values: mr_access_flags=%x", + mr_access_flags); + ib_fmr = ERR_PTR(-EINVAL); + goto alloc_fmr_exit0; + } + if (mr_access_flags & IB_ACCESS_MW_BIND) { + EDEB_ERR(4, "bad input values: mr_access_flags=%x", + mr_access_flags); + ib_fmr = ERR_PTR(-EINVAL); + goto alloc_fmr_exit0; + } + if ((fmr_attr->max_pages == 0) || (fmr_attr->max_maps == 0)) { + EDEB_ERR(4, "bad input values: fmr_attr->max_pages=%x " + "fmr_attr->max_maps=%x fmr_attr->page_shift=%x", + fmr_attr->max_pages, fmr_attr->max_maps, + fmr_attr->page_shift); + ib_fmr = ERR_PTR(-EINVAL); + goto alloc_fmr_exit0; + } + if ( ((1 << fmr_attr->page_shift) != EHCA_PAGESIZE) && + ((1 << fmr_attr->page_shift) != PAGE_SIZE) ) { + EDEB_ERR(4, "unsupported fmr_attr->page_shift=%x", + fmr_attr->page_shift); + ib_fmr = ERR_PTR(-EINVAL); + goto alloc_fmr_exit0; + } + + e_pd = container_of(pd, struct ehca_pd, ib_pd); + shca = container_of(pd->device, struct ehca_shca, ib_device); + + e_fmr = ehca_mr_new(); + if (!e_fmr) { + ib_fmr = ERR_PTR(-ENOMEM); + goto alloc_fmr_exit0; + } + e_fmr->flags |= EHCA_MR_FLAG_FMR; + + /* register MR on HCA */ + retcode = ehca_reg_mr(shca, e_fmr, NULL, + fmr_attr->max_pages * (1 << fmr_attr->page_shift), + mr_access_flags, e_pd, &pginfo, + &tmp_lkey, &tmp_rkey); + if (retcode != 0) { + ib_fmr = ERR_PTR(retcode); + goto alloc_fmr_exit1; + } + + /* successful */ + e_fmr->fmr_page_size = 1 << fmr_attr->page_shift; + e_fmr->fmr_max_pages = fmr_attr->max_pages; + e_fmr->fmr_max_maps = fmr_attr->max_maps; + e_fmr->fmr_map_cnt = 0; + ib_fmr = &e_fmr->ib.ib_fmr; + goto alloc_fmr_exit0; + +alloc_fmr_exit1: + ehca_mr_delete(e_fmr); +alloc_fmr_exit0: + if (IS_ERR(ib_fmr)) + EDEB_EX(4, "rc=%lx pd=%p mr_access_flags=%x " + "fmr_attr=%p", PTR_ERR(ib_fmr), pd, + mr_access_flags, fmr_attr); + else + EDEB_EX(7, "ib_fmr=%p tmp_lkey=%x tmp_rkey=%x", + ib_fmr, tmp_lkey, tmp_rkey); + return (ib_fmr); +} /* end ehca_alloc_fmr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_map_phys_fmr(struct ib_fmr *fmr, + u64 *page_list, + int list_len, + u64 iova) +{ + int retcode = 0; + struct ehca_shca *shca = NULL; + struct ehca_mr *e_fmr = NULL; + struct ehca_pd *e_pd = NULL; + struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0}; + u32 tmp_lkey = 0; + u32 tmp_rkey = 0; + /* TODO unsigned long sl_flags; */ + + EDEB_EN(7, "fmr=%p page_list=%p list_len=%x iova=%lx", + fmr, page_list, list_len, iova); + + EHCA_CHECK_FMR(fmr); + e_fmr = container_of(fmr, struct ehca_mr, ib.ib_fmr); + shca = container_of(fmr->device, struct ehca_shca, ib_device); + e_pd = container_of(fmr->pd, struct ehca_pd, ib_pd); + + if (!(e_fmr->flags & EHCA_MR_FLAG_FMR)) { + EDEB_ERR(4, "not a FMR, e_fmr=%p e_fmr->flags=%x", + e_fmr, e_fmr->flags); + retcode = -EINVAL; + goto map_phys_fmr_exit0; + } + retcode = ehca_fmr_check_page_list(e_fmr, page_list, list_len); + if (retcode != 0) + goto map_phys_fmr_exit0; + if (iova % e_fmr->fmr_page_size) { + /* only whole-numbered pages */ + EDEB_ERR(4, "bad iova, iova=%lx fmr_page_size=%x", + iova, e_fmr->fmr_page_size); + retcode = -EINVAL; + goto map_phys_fmr_exit0; + } + if (e_fmr->fmr_map_cnt >= e_fmr->fmr_max_maps) { + /* HCAD does not limit the maps, however trace this anyway */ + EDEB(6, "map limit exceeded, fmr=%p e_fmr->fmr_map_cnt=%x " + "e_fmr->fmr_max_maps=%x", + fmr, e_fmr->fmr_map_cnt, e_fmr->fmr_max_maps); + } + + pginfo.type = EHCA_MR_PGI_FMR; + pginfo.num_pages = list_len; + pginfo.page_list = page_list; + pginfo.next_4k = ( (iova & (e_fmr->fmr_page_size-1)) / + EHCA_PAGESIZE); + + /* TODO spin_lock_irqsave(&e_fmr->mrlock, sl_flags); */ + + retcode = ehca_rereg_mr(shca, e_fmr, (u64 *)iova, + list_len * e_fmr->fmr_page_size, + e_fmr->acl, e_pd, &pginfo, + &tmp_lkey, &tmp_rkey); + if (retcode != 0) { + /* TODO spin_unlock_irqrestore(&fmr->mrlock, sl_flags); */ + goto map_phys_fmr_exit0; + } + /* successful reregistration */ + e_fmr->fmr_map_cnt++; + /* TODO spin_unlock_irqrestore(&fmr->mrlock, sl_flags); */ + + e_fmr->ib.ib_fmr.lkey = tmp_lkey; + e_fmr->ib.ib_fmr.rkey = tmp_rkey; + +map_phys_fmr_exit0: + if (retcode) + EDEB_EX(4, "retcode=%x fmr=%p page_list=%p list_len=%x " + "iova=%lx", + retcode, fmr, page_list, list_len, iova); + else + EDEB_EX(7, "lkey=%x rkey=%x", + e_fmr->ib.ib_fmr.lkey, e_fmr->ib.ib_fmr.rkey); + return (retcode); +} /* end ehca_map_phys_fmr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_unmap_fmr(struct list_head *fmr_list) +{ + int retcode = 0; + struct ib_fmr *ib_fmr = NULL; + struct ehca_shca *shca = NULL; + struct ehca_shca *prev_shca = NULL; + struct ehca_mr *e_fmr = NULL; + u32 num_fmr = 0; + u32 unmap_fmr_cnt = 0; + /* TODO unsigned long sl_flags; */ + + EDEB_EN(7, "fmr_list=%p", fmr_list); + + /* check all FMR belong to same SHCA, and check internal flag */ + list_for_each_entry(ib_fmr, fmr_list, list) { + prev_shca = shca; + shca = container_of(ib_fmr->device, struct ehca_shca, + ib_device); + EHCA_CHECK_FMR(ib_fmr); + e_fmr = container_of(ib_fmr, struct ehca_mr, ib.ib_fmr); + if ((shca != prev_shca) && (prev_shca != 0)) { + EDEB_ERR(4, "SHCA mismatch, shca=%p prev_shca=%p " + "e_fmr=%p", shca, prev_shca, e_fmr); + retcode = -EINVAL; + goto unmap_fmr_exit0; + } + if (!(e_fmr->flags & EHCA_MR_FLAG_FMR)) { + EDEB_ERR(4, "not a FMR, e_fmr=%p e_fmr->flags=%x", + e_fmr, e_fmr->flags); + retcode = -EINVAL; + goto unmap_fmr_exit0; + } + num_fmr++; + } + + /* loop over all FMRs to unmap */ + list_for_each_entry(ib_fmr, fmr_list, list) { + unmap_fmr_cnt++; + e_fmr = container_of(ib_fmr, struct ehca_mr, ib.ib_fmr); + shca = container_of(ib_fmr->device, struct ehca_shca, + ib_device); + /* TODO??? spin_lock_irqsave(&fmr->mrlock, sl_flags); */ + retcode = ehca_unmap_one_fmr(shca, e_fmr); + /* TODO???? spin_unlock_irqrestore(&fmr->mrlock, sl_flags); */ + if (retcode != 0) { + /* unmap failed, stop unmapping of rest of FMRs */ + EDEB_ERR(4, "unmap of one FMR failed, stop rest, " + "e_fmr=%p num_fmr=%x unmap_fmr_cnt=%x lkey=%x", + e_fmr, num_fmr, unmap_fmr_cnt, + e_fmr->ib.ib_fmr.lkey); + goto unmap_fmr_exit0; + } + } + +unmap_fmr_exit0: + if (retcode) + EDEB_EX(4, "retcode=%x fmr_list=%p num_fmr=%x unmap_fmr_cnt=%x", + retcode, fmr_list, num_fmr, unmap_fmr_cnt); + else + EDEB_EX(7, "num_fmr=%x", num_fmr); + return (retcode); +} /* end ehca_unmap_fmr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_dealloc_fmr(struct ib_fmr *fmr) +{ + int retcode = 0; + u64 rc = H_SUCCESS; + struct ehca_shca *shca = NULL; + struct ehca_mr *e_fmr = NULL; + + EDEB_EN(7, "fmr=%p", fmr); + + EHCA_CHECK_FMR(fmr); + e_fmr = container_of(fmr, struct ehca_mr, ib.ib_fmr); + shca = container_of(fmr->device, struct ehca_shca, ib_device); + + if (!(e_fmr->flags & EHCA_MR_FLAG_FMR)) { + EDEB_ERR(4, "not a FMR, e_fmr=%p e_fmr->flags=%x", + e_fmr, e_fmr->flags); + retcode = -EINVAL; + goto free_fmr_exit0; + } + + rc = hipz_h_free_resource_mr(shca->ipz_hca_handle, &e_fmr->pf, + &e_fmr->ipz_mr_handle); + if (rc != H_SUCCESS) { + EDEB_ERR(4, "hipz_free_mr failed, rc=%lx e_fmr=%p " + "hca_hndl=%lx fmr_hndl=%lx fmr->lkey=%x", + rc, e_fmr, shca->ipz_hca_handle.handle, + e_fmr->ipz_mr_handle.handle, fmr->lkey); + ehca_mrmw_map_rc_free_mr(rc); + goto free_fmr_exit0; + } + /* successful deregistration */ + ehca_mr_delete(e_fmr); + +free_fmr_exit0: + if (retcode) + EDEB_EX(4, "retcode=%x fmr=%p", retcode, fmr); + else + EDEB_EX(7, ""); + return (retcode); +} /* end ehca_dealloc_fmr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_reg_mr(struct ehca_shca *shca, + struct ehca_mr *e_mr, + u64 *iova_start, + u64 size, + int acl, + struct ehca_pd *e_pd, + struct ehca_mr_pginfo *pginfo, + u32 *lkey, /*OUT*/ + u32 *rkey) /*OUT*/ +{ + int retcode = 0; + u64 rc = H_SUCCESS; + struct ehca_pfmr *pfmr = &e_mr->pf; + u32 hipz_acl = 0; + + EDEB_EN(7, "shca=%p e_mr=%p iova_start=%p size=%lx acl=%x e_pd=%p " + "pginfo=%p num_pages=%lx num_4k=%lx", shca, e_mr, iova_start, + size, acl, e_pd, pginfo, pginfo->num_pages, pginfo->num_4k); + + ehca_mrmw_map_acl(acl, &hipz_acl); + ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); + if (ehca_use_hp_mr == 1) + hipz_acl |= 0x00000001; + + rc = hipz_h_alloc_resource_mr(shca->ipz_hca_handle, pfmr, &shca->pf, + (u64)iova_start, size, hipz_acl, + e_pd->fw_pd, &e_mr->ipz_mr_handle, + lkey, rkey); + if (rc != H_SUCCESS) { + EDEB_ERR(4, "hipz_alloc_mr failed, rc=%lx hca_hndl=%lx " + "mr_hndl=%lx", rc, shca->ipz_hca_handle.handle, + e_mr->ipz_mr_handle.handle); + retcode = ehca_mrmw_map_rc_alloc(rc); + goto ehca_reg_mr_exit0; + } + + retcode = ehca_reg_mr_rpages(shca, e_mr, pginfo); + if (retcode != 0) + goto ehca_reg_mr_exit1; + + /* successful registration */ + e_mr->num_pages = pginfo->num_pages; + e_mr->num_4k = pginfo->num_4k; + e_mr->start = iova_start; + e_mr->size = size; + e_mr->acl = acl; + goto ehca_reg_mr_exit0; + +ehca_reg_mr_exit1: + rc = hipz_h_free_resource_mr(shca->ipz_hca_handle, pfmr, + &e_mr->ipz_mr_handle); + if (rc != H_SUCCESS) { + EDEB_ERR(1, "rc=%lx shca=%p e_mr=%p iova_start=%p " + "size=%lx acl=%x e_pd=%p lkey=%x pginfo=%p " + "num_pages=%lx num_4k=%lx retcode=%x", rc, shca, e_mr, + iova_start, size, acl, e_pd, *lkey, pginfo, + pginfo->num_pages, pginfo->num_4k, retcode); + EDEB_ERR(1, "internal error in ehca_reg_mr, not recoverable"); + } +ehca_reg_mr_exit0: + if (retcode) + EDEB_EX(4, "retcode=%x shca=%p e_mr=%p iova_start=%p " + "size=%lx acl=%x e_pd=%p pginfo=%p num_pages=%lx " + "num_4k=%lx", retcode, shca, e_mr, iova_start, size, + acl, e_pd, pginfo, pginfo->num_pages, pginfo->num_4k); + else + EDEB_EX(7, "retcode=%x lkey=%x rkey=%x", retcode, *lkey, *rkey); + return (retcode); +} /* end ehca_reg_mr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_reg_mr_rpages(struct ehca_shca *shca, + struct ehca_mr *e_mr, + struct ehca_mr_pginfo *pginfo) +{ + int retcode = 0; + u64 rc = H_SUCCESS; + struct ehca_pfmr *pfmr = &e_mr->pf; + u32 rnum = 0; + u64 rpage = 0; + u32 i; + u64 *kpage = NULL; + + EDEB_EN(7, "shca=%p e_mr=%p pginfo=%p num_pages=%lx num_4k=%lx", + shca, e_mr, pginfo, pginfo->num_pages, pginfo->num_4k); + + kpage = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + if (!kpage) { + EDEB_ERR(4, "kpage alloc failed"); + retcode = -ENOMEM; + goto ehca_reg_mr_rpages_exit0; + } + + /* max 512 pages per shot */ + for (i = 0; i < ((pginfo->num_4k + 512 - 1) / 512); i++) { + + if (i == ((pginfo->num_4k + 512 - 1) / 512) - 1) { + rnum = pginfo->num_4k % 512; /* last shot */ + if (rnum == 0) + rnum = 512; /* last shot is full */ + } else + rnum = 512; + + if (rnum > 1) { + retcode = ehca_set_pagebuf(e_mr, pginfo, rnum, kpage); + if (retcode) { + EDEB_ERR(4, "ehca_set_pagebuf bad rc, " + "retcode=%x rnum=%x kpage=%p", + retcode, rnum, kpage); + retcode = -EFAULT; + goto ehca_reg_mr_rpages_exit1; + } + rpage = virt_to_abs(kpage); + if (rpage == 0) { + EDEB_ERR(4, "kpage=%p i=%x", kpage, i); + retcode = -EFAULT; + goto ehca_reg_mr_rpages_exit1; + } + } else { /* rnum==1 */ + retcode = ehca_set_pagebuf_1(e_mr, pginfo, &rpage); + if (retcode) { + EDEB_ERR(4, "ehca_set_pagebuf_1 bad rc, " + "retcode=%x i=%x", retcode, i); + retcode = -EFAULT; + goto ehca_reg_mr_rpages_exit1; + } + } + + EDEB(9, "i=%x rnum=%x rpage=%lx", i, rnum, rpage); + + rc = hipz_h_register_rpage_mr(shca->ipz_hca_handle, + &e_mr->ipz_mr_handle, pfmr, + &shca->pf, + 0, /* pagesize hardcoded to 4k */ + 0, rpage, rnum); + + if (i == ((pginfo->num_4k + 512 - 1) / 512) - 1) { + /* check for 'registration complete'==H_SUCCESS */ + /* and for 'page registered'==H_PAGE_REGISTERED */ + if (rc != H_SUCCESS) { + EDEB_ERR(4, "last hipz_reg_rpage_mr failed, " + "rc=%lx e_mr=%p i=%x hca_hndl=%lx " + "mr_hndl=%lx lkey=%x", rc, e_mr, i, + shca->ipz_hca_handle.handle, + e_mr->ipz_mr_handle.handle, + e_mr->ib.ib_mr.lkey); + retcode = ehca_mrmw_map_rc_rrpg_last(rc); + break; + } else + retcode = 0; + } else if (rc != H_PAGE_REGISTERED) { + EDEB_ERR(4, "hipz_reg_rpage_mr failed, rc=%lx e_mr=%p " + "i=%x lkey=%x hca_hndl=%lx mr_hndl=%lx", + rc, e_mr, i, e_mr->ib.ib_mr.lkey, + shca->ipz_hca_handle.handle, + e_mr->ipz_mr_handle.handle); + retcode = ehca_mrmw_map_rc_rrpg_notlast(rc); + break; + } else + retcode = 0; + } /* end for(i) */ + + +ehca_reg_mr_rpages_exit1: + kfree(kpage); +ehca_reg_mr_rpages_exit0: + if (retcode) + EDEB_EX(4, "retcode=%x shca=%p e_mr=%p pginfo=%p " + "num_pages=%lx num_4k=%lx", retcode, shca, e_mr, + pginfo, pginfo->num_pages, pginfo->num_4k); + else + EDEB_EX(7, "retcode=%x", retcode); + return (retcode); +} /* end ehca_reg_mr_rpages() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca, + struct ehca_mr *e_mr, + u64 *iova_start, + u64 size, + u32 acl, + struct ehca_pd *e_pd, + struct ehca_mr_pginfo *pginfo, + u32 *lkey, /*OUT*/ + u32 *rkey) /*OUT*/ +{ + int retcode = 0; + u64 rc = H_SUCCESS; + struct ehca_pfmr *pfmr = &e_mr->pf; + u64 iova_start_out = 0; + u32 hipz_acl = 0; + u64 *kpage = NULL; + u64 rpage = 0; + struct ehca_mr_pginfo pginfo_save; + + EDEB_EN(7, "shca=%p e_mr=%p iova_start=%p size=%lx acl=%x " + "e_pd=%p pginfo=%p num_pages=%lx num_4k=%lx", shca, e_mr, + iova_start, size, acl, e_pd, pginfo, pginfo->num_pages, + pginfo->num_4k); + + ehca_mrmw_map_acl(acl, &hipz_acl); + ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); + + kpage = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + if (!kpage) { + EDEB_ERR(4, "kpage alloc failed"); + retcode = -ENOMEM; + goto ehca_rereg_mr_rereg1_exit0; + } + + pginfo_save = *pginfo; + retcode = ehca_set_pagebuf(e_mr, pginfo, pginfo->num_4k, kpage); + if (retcode != 0) { + EDEB_ERR(4, "set pagebuf failed, e_mr=%p pginfo=%p type=%x " + "num_pages=%lx num_4k=%lx kpage=%p", e_mr, pginfo, + pginfo->type, pginfo->num_pages, pginfo->num_4k,kpage); + goto ehca_rereg_mr_rereg1_exit1; + } + rpage = virt_to_abs(kpage); + if (rpage == 0) { + EDEB_ERR(4, "kpage=%p", kpage); + retcode = -EFAULT; + goto ehca_rereg_mr_rereg1_exit1; + } + rc = hipz_h_reregister_pmr(shca->ipz_hca_handle, pfmr, &shca->pf, + &e_mr->ipz_mr_handle, (u64)iova_start, + size, hipz_acl, e_pd->fw_pd, rpage, + &iova_start_out, lkey, rkey); + if (rc != H_SUCCESS) { + /* reregistration unsuccessful, */ + /* try it again with the 3 hCalls, */ + /* e.g. this is required in case H_MR_CONDITION */ + /* (MW bound or MR is shared) */ + EDEB(6, "hipz_h_reregister_pmr failed (Rereg1), rc=%lx " + "e_mr=%p", rc, e_mr); + *pginfo = pginfo_save; + retcode = -EAGAIN; + } else if ((u64 *)iova_start_out != iova_start) { + EDEB_ERR(4, "PHYP changed iova_start in rereg_pmr, " + "iova_start=%p iova_start_out=%lx e_mr=%p " + "mr_handle=%lx lkey=%x", iova_start, iova_start_out, + e_mr, e_mr->ipz_mr_handle.handle, e_mr->ib.ib_mr.lkey); + retcode = -EFAULT; + } else { + /* successful reregistration */ + /* note: start and start_out are identical for eServer HCAs */ + e_mr->num_pages = pginfo->num_pages; + e_mr->num_4k = pginfo->num_4k; + e_mr->start = iova_start; + e_mr->size = size; + e_mr->acl = acl; + } + +ehca_rereg_mr_rereg1_exit1: + kfree(kpage); +ehca_rereg_mr_rereg1_exit0: + if ((retcode) && (retcode != -EAGAIN)) + EDEB_EX(4, "retcode=%x rc=%lx lkey=%x rkey=%x pginfo=%p " + "num_pages=%lx num_4k=%lx", retcode, rc, *lkey, + *rkey, pginfo, pginfo->num_pages, pginfo->num_4k); + else + EDEB_EX(7, "retcode=%x rc=%lx lkey=%x rkey=%x pginfo=%p " + "num_pages=%lx num_4k=%lx", retcode, rc, *lkey, + *rkey, pginfo, pginfo->num_pages, pginfo->num_4k); + return (retcode); +} /* end ehca_rereg_mr_rereg1() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_rereg_mr(struct ehca_shca *shca, + struct ehca_mr *e_mr, + u64 *iova_start, + u64 size, + int acl, + struct ehca_pd *e_pd, + struct ehca_mr_pginfo *pginfo, + u32 *lkey, + u32 *rkey) +{ + int retcode = 0; + u64 rc = H_SUCCESS; + struct ehca_pfmr *pfmr = &e_mr->pf; + int Rereg1Hcall = 1; /* 1: use hipz_h_reregister_pmr directly */ + int Rereg3Hcall = 0; /* 1: use 3 hipz calls for reregistration */ + + EDEB_EN(7, "shca=%p e_mr=%p iova_start=%p size=%lx acl=%x " + "e_pd=%p pginfo=%p num_pages=%lx num_4k=%lx", shca, e_mr, + iova_start, size, acl, e_pd, pginfo, pginfo->num_pages, + pginfo->num_4k); + + /* first determine reregistration hCall(s) */ + if ((pginfo->num_4k > 512) || (e_mr->num_4k > 512) || + (pginfo->num_4k > e_mr->num_4k)) { + EDEB(7, "Rereg3 case, pginfo->num_4k=%lx " + "e_mr->num_4k=%x", pginfo->num_4k, e_mr->num_4k); + Rereg1Hcall = 0; + Rereg3Hcall = 1; + } + + if (e_mr->flags & EHCA_MR_FLAG_MAXMR) { /* check for max-MR */ + Rereg1Hcall = 0; + Rereg3Hcall = 1; + e_mr->flags &= ~EHCA_MR_FLAG_MAXMR; + EDEB(4, "Rereg MR for max-MR! e_mr=%p", e_mr); + } + + if (Rereg1Hcall) { + retcode = ehca_rereg_mr_rereg1(shca, e_mr, iova_start, size, + acl, e_pd, pginfo, lkey, rkey); + if (retcode != 0) { + if (retcode == -EAGAIN) + Rereg3Hcall = 1; + else + goto ehca_rereg_mr_exit0; + } + } + + if (Rereg3Hcall) { + struct ehca_mr save_mr; + + /* first deregister old MR */ + rc = hipz_h_free_resource_mr(shca->ipz_hca_handle, pfmr, + &e_mr->ipz_mr_handle); + if (rc != H_SUCCESS) { + EDEB_ERR(4, "hipz_free_mr failed, rc=%lx e_mr=%p " + "hca_hndl=%lx mr_hndl=%lx mr->lkey=%x", + rc, e_mr, shca->ipz_hca_handle.handle, + e_mr->ipz_mr_handle.handle, + e_mr->ib.ib_mr.lkey); + retcode = ehca_mrmw_map_rc_free_mr(rc); + goto ehca_rereg_mr_exit0; + } + /* clean ehca_mr_t, without changing struct ib_mr and lock */ + save_mr = *e_mr; + ehca_mr_deletenew(e_mr); + + /* set some MR values */ + e_mr->flags = save_mr.flags; + e_mr->fmr_page_size = save_mr.fmr_page_size; + e_mr->fmr_max_pages = save_mr.fmr_max_pages; + e_mr->fmr_max_maps = save_mr.fmr_max_maps; + e_mr->fmr_map_cnt = save_mr.fmr_map_cnt; + + retcode = ehca_reg_mr(shca, e_mr, iova_start, size, acl, + e_pd, pginfo, lkey, rkey); + if (retcode != 0) { + u32 offset = (u64)(&e_mr->flags) - (u64)e_mr; + memcpy(&e_mr->flags, &(save_mr.flags), + sizeof(struct ehca_mr) - offset); + goto ehca_rereg_mr_exit0; + } + } + +ehca_rereg_mr_exit0: + if (retcode) + EDEB_EX(4, "retcode=%x shca=%p e_mr=%p iova_start=%p size=%lx " + "acl=%x e_pd=%p pginfo=%p num_pages=%lx lkey=%x " + "rkey=%x Rereg1Hcall=%x Rereg3Hcall=%x", + retcode, shca, e_mr, iova_start, size, acl, e_pd, + pginfo, pginfo->num_pages, *lkey, *rkey, Rereg1Hcall, + Rereg3Hcall); + else + EDEB_EX(7, "retcode=%x shca=%p e_mr=%p iova_start=%p size=%lx " + "acl=%x e_pd=%p pginfo=%p num_pages=%lx lkey=%x " + "rkey=%x Rereg1Hcall=%x Rereg3Hcall=%x", + retcode, shca, e_mr, iova_start, size, acl, e_pd, + pginfo, pginfo->num_pages, *lkey, *rkey, Rereg1Hcall, + Rereg3Hcall); + + return (retcode); +} /* end ehca_rereg_mr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_unmap_one_fmr(struct ehca_shca *shca, + struct ehca_mr *e_fmr) +{ + int retcode = 0; + u64 rc = H_SUCCESS; + struct ehca_pfmr *pfmr = &e_fmr->pf; + int Rereg1Hcall = 1; /* 1: use hipz_mr_reregister directly */ + int Rereg3Hcall = 0; /* 1: use 3 hipz calls for unmapping */ + struct ehca_pd *e_pd = NULL; + struct ehca_mr save_fmr; + u32 tmp_lkey = 0; + u32 tmp_rkey = 0; + struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0}; + + EDEB_EN(7, "shca=%p e_fmr=%p", shca, e_fmr); + + /* first check if reregistration hCall can be used for unmap */ + if (e_fmr->fmr_max_pages > 512) { + Rereg1Hcall = 0; + Rereg3Hcall = 1; + } + + e_pd = container_of(e_fmr->ib.ib_fmr.pd, struct ehca_pd, ib_pd); + + if (Rereg1Hcall) { + /* note: after using rereg hcall with len=0, */ + /* rereg hcall must be used again for registering pages */ + u64 start_out = 0; + rc = hipz_h_reregister_pmr(shca->ipz_hca_handle, pfmr, + &shca->pf, &e_fmr->ipz_mr_handle, 0, + 0, 0, e_pd->fw_pd, 0, &start_out, + &tmp_lkey, &tmp_rkey); + if (rc != H_SUCCESS) { + /* should not happen, because length checked above, */ + /* FMRs are not shared and no MW bound to FMRs */ + EDEB_ERR(4, "hipz_reregister_pmr failed (Rereg1), " + "rc=%lx e_fmr=%p hca_hndl=%lx mr_hndl=%lx " + "lkey=%x", rc, e_fmr, + shca->ipz_hca_handle.handle, + e_fmr->ipz_mr_handle.handle, + e_fmr->ib.ib_fmr.lkey); + Rereg3Hcall = 1; + } else { + /* successful reregistration */ + e_fmr->start = NULL; + e_fmr->size = 0; + } + } + + if (Rereg3Hcall) { + struct ehca_mr save_mr; + + /* first free old FMR */ + rc = hipz_h_free_resource_mr(shca->ipz_hca_handle, pfmr, + &e_fmr->ipz_mr_handle); + if (rc != H_SUCCESS) { + EDEB_ERR(4, "hipz_free_mr failed, rc=%lx e_fmr=%p " + "hca_hndl=%lx mr_hndl=%lx lkey=%x", rc, e_fmr, + shca->ipz_hca_handle.handle, + e_fmr->ipz_mr_handle.handle, + e_fmr->ib.ib_fmr.lkey); + retcode = ehca_mrmw_map_rc_free_mr(rc); + goto ehca_unmap_one_fmr_exit0; + } + /* clean ehca_mr_t, without changing lock */ + save_fmr = *e_fmr; + ehca_mr_deletenew(e_fmr); + + /* set some MR values */ + e_fmr->flags = save_fmr.flags; + e_fmr->fmr_page_size = save_fmr.fmr_page_size; + e_fmr->fmr_max_pages = save_fmr.fmr_max_pages; + e_fmr->fmr_max_maps = save_fmr.fmr_max_maps; + e_fmr->fmr_map_cnt = save_fmr.fmr_map_cnt; + e_fmr->acl = save_fmr.acl; + + pginfo.type = EHCA_MR_PGI_FMR; + pginfo.num_pages = 0; + pginfo.num_4k = 0; + retcode = ehca_reg_mr(shca, e_fmr, NULL, + (e_fmr->fmr_max_pages * + e_fmr->fmr_page_size), + e_fmr->acl, e_pd, &pginfo, &tmp_lkey, + &tmp_rkey); + if (retcode != 0) { + u32 offset = (u64)(&e_fmr->flags) - (u64)e_fmr; + memcpy(&e_fmr->flags, &(save_mr.flags), + sizeof(struct ehca_mr) - offset); + goto ehca_unmap_one_fmr_exit0; + } + } + +ehca_unmap_one_fmr_exit0: + EDEB_EX(7, "retcode=%x tmp_lkey=%x tmp_rkey=%x fmr_max_pages=%x " + "Rereg1Hcall=%x Rereg3Hcall=%x", retcode, tmp_lkey, tmp_rkey, + e_fmr->fmr_max_pages, Rereg1Hcall, Rereg3Hcall); + return (retcode); +} /* end ehca_unmap_one_fmr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_reg_smr(struct ehca_shca *shca, + struct ehca_mr *e_origmr, + struct ehca_mr *e_newmr, + u64 *iova_start, + int acl, + struct ehca_pd *e_pd, + u32 *lkey, /*OUT*/ + u32 *rkey) /*OUT*/ +{ + int retcode = 0; + u64 rc = H_SUCCESS; + struct ehca_pfmr *pfmr = &e_newmr->pf; + u32 hipz_acl = 0; + + EDEB_EN(7,"shca=%p e_origmr=%p e_newmr=%p iova_start=%p acl=%x e_pd=%p", + shca, e_origmr, e_newmr, iova_start, acl, e_pd); + + ehca_mrmw_map_acl(acl, &hipz_acl); + ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); + + rc = hipz_h_register_smr(shca->ipz_hca_handle, pfmr, &e_origmr->pf, + &shca->pf, &e_origmr->ipz_mr_handle, + (u64)iova_start, hipz_acl, e_pd->fw_pd, + &e_newmr->ipz_mr_handle, lkey, rkey); + if (rc != H_SUCCESS) { + EDEB_ERR(4, "hipz_reg_smr failed, rc=%lx shca=%p e_origmr=%p " + "e_newmr=%p iova_start=%p acl=%x e_pd=%p hca_hndl=%lx " + "mr_hndl=%lx lkey=%x", rc, shca, e_origmr, e_newmr, + iova_start, acl, e_pd, shca->ipz_hca_handle.handle, + e_origmr->ipz_mr_handle.handle, + e_origmr->ib.ib_mr.lkey); + retcode = ehca_mrmw_map_rc_reg_smr(rc); + goto ehca_reg_smr_exit0; + } + /* successful registration */ + e_newmr->num_pages = e_origmr->num_pages; + e_newmr->num_4k = e_origmr->num_4k; + e_newmr->start = iova_start; + e_newmr->size = e_origmr->size; + e_newmr->acl = acl; + goto ehca_reg_smr_exit0; + +ehca_reg_smr_exit0: + if (retcode) + EDEB_EX(4, "retcode=%x shca=%p e_origmr=%p e_newmr=%p " + "iova_start=%p acl=%x e_pd=%p", retcode, + shca, e_origmr, e_newmr, iova_start, acl, e_pd); + else + EDEB_EX(7, "retcode=%x lkey=%x rkey=%x", + retcode, *lkey, *rkey); + return (retcode); +} /* end ehca_reg_smr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +/* register internal max-MR to internal SHCA */ +int ehca_reg_internal_maxmr( + struct ehca_shca *shca, + struct ehca_pd *e_pd, + struct ehca_mr **e_maxmr) /*OUT*/ +{ + int retcode = 0; + struct ehca_mr *e_mr = NULL; + u64 *iova_start = NULL; + u64 size_maxmr = 0; + struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0}; + struct ib_phys_buf ib_pbuf; + u32 num_pages_mr = 0; + u32 num_pages_4k = 0; /* 4k portion "pages" */ + + EDEB_EN(7, "shca=%p e_pd=%p e_maxmr=%p", shca, e_pd, e_maxmr); + + if (ehca_adr_bad(shca) || ehca_adr_bad(e_pd) || ehca_adr_bad(e_maxmr)) { + EDEB_ERR(4, "bad input values: shca=%p e_pd=%p e_maxmr=%p", + shca, e_pd, e_maxmr); + retcode = -EINVAL; + goto ehca_reg_internal_maxmr_exit0; + } + + e_mr = ehca_mr_new(); + if (!e_mr) { + EDEB_ERR(4, "out of memory"); + retcode = -ENOMEM; + goto ehca_reg_internal_maxmr_exit0; + } + e_mr->flags |= EHCA_MR_FLAG_MAXMR; + + /* register internal max-MR on HCA */ + size_maxmr = (u64)high_memory - PAGE_OFFSET; + EDEB(9, "high_memory=%p PAGE_OFFSET=%lx", high_memory, PAGE_OFFSET); + iova_start = (u64 *)KERNELBASE; + ib_pbuf.addr = 0; + ib_pbuf.size = size_maxmr; + num_pages_mr = ( (((u64)iova_start % PAGE_SIZE) + size_maxmr + + PAGE_SIZE - 1) / PAGE_SIZE); + num_pages_4k = ( (((u64)iova_start % EHCA_PAGESIZE) + size_maxmr + + EHCA_PAGESIZE - 1) / EHCA_PAGESIZE ); + + pginfo.type = EHCA_MR_PGI_PHYS; + pginfo.num_pages = num_pages_mr; + pginfo.num_4k = num_pages_4k; + pginfo.num_phys_buf = 1; + pginfo.phys_buf_array = &ib_pbuf; + + retcode = ehca_reg_mr(shca, e_mr, iova_start, size_maxmr, 0, e_pd, + &pginfo, &e_mr->ib.ib_mr.lkey, + &e_mr->ib.ib_mr.rkey); + if (retcode != 0) { + EDEB_ERR(4, "reg of internal max MR failed, e_mr=%p " + "iova_start=%p size_maxmr=%lx num_pages_mr=%x " + "num_pages_4k=%x", e_mr, iova_start, size_maxmr, + num_pages_mr, num_pages_4k); + goto ehca_reg_internal_maxmr_exit1; + } + + /* successful registration of all pages */ + e_mr->ib.ib_mr.device = e_pd->ib_pd.device; + e_mr->ib.ib_mr.pd = &e_pd->ib_pd; + e_mr->ib.ib_mr.uobject = NULL; + atomic_inc(&(e_pd->ib_pd.usecnt)); + atomic_set(&(e_mr->ib.ib_mr.usecnt), 0); + *e_maxmr = e_mr; + goto ehca_reg_internal_maxmr_exit0; + +ehca_reg_internal_maxmr_exit1: + ehca_mr_delete(e_mr); +ehca_reg_internal_maxmr_exit0: + if (retcode) + EDEB_EX(4, "retcode=%x shca=%p e_pd=%p e_maxmr=%p", + retcode, shca, e_pd, e_maxmr); + else + EDEB_EX(7, "*e_maxmr=%p lkey=%x rkey=%x", + *e_maxmr, (*e_maxmr)->ib.ib_mr.lkey, + (*e_maxmr)->ib.ib_mr.rkey); + return (retcode); +} /* end ehca_reg_internal_maxmr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_reg_maxmr(struct ehca_shca *shca, + struct ehca_mr *e_newmr, + u64 *iova_start, + int acl, + struct ehca_pd *e_pd, + u32 *lkey, + u32 *rkey) +{ + int retcode = 0; + u64 rc = H_SUCCESS; + struct ehca_pfmr *pfmr = &e_newmr->pf; + struct ehca_mr *e_origmr = shca->maxmr; + u32 hipz_acl = 0; + + EDEB_EN(7,"shca=%p e_origmr=%p e_newmr=%p iova_start=%p acl=%x e_pd=%p", + shca, e_origmr, e_newmr, iova_start, acl, e_pd); + + ehca_mrmw_map_acl(acl, &hipz_acl); + ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); + + rc = hipz_h_register_smr(shca->ipz_hca_handle, pfmr, &e_origmr->pf, + &shca->pf, &e_origmr->ipz_mr_handle, + (u64)iova_start, hipz_acl, e_pd->fw_pd, + &e_newmr->ipz_mr_handle, lkey, rkey); + if (rc != H_SUCCESS) { + EDEB_ERR(4, "hipz_reg_smr failed, rc=%lx e_origmr=%p " + "hca_hndl=%lx mr_hndl=%lx lkey=%x", + rc, e_origmr, shca->ipz_hca_handle.handle, + e_origmr->ipz_mr_handle.handle, + e_origmr->ib.ib_mr.lkey); + retcode = ehca_mrmw_map_rc_reg_smr(rc); + goto ehca_reg_maxmr_exit0; + } + /* successful registration */ + e_newmr->num_pages = e_origmr->num_pages; + e_newmr->num_4k = e_origmr->num_4k; + e_newmr->start = iova_start; + e_newmr->size = e_origmr->size; + e_newmr->acl = acl; + +ehca_reg_maxmr_exit0: + EDEB_EX(7, "retcode=%x lkey=%x rkey=%x", retcode, *lkey, *rkey); + return (retcode); +} /* end ehca_reg_maxmr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_dereg_internal_maxmr(struct ehca_shca *shca) +{ + int retcode = 0; + struct ehca_mr *e_maxmr = NULL; + struct ib_pd *ib_pd = NULL; + + EDEB_EN(7, "shca=%p shca->maxmr=%p", shca, shca->maxmr); + + if (shca->maxmr == 0) { + EDEB_ERR(4, "bad call, shca=%p", shca); + retcode = -EINVAL; + goto ehca_dereg_internal_maxmr_exit0; + } + + e_maxmr = shca->maxmr; + ib_pd = e_maxmr->ib.ib_mr.pd; + shca->maxmr = NULL; /* remove internal max-MR indication from SHCA */ + + retcode = ehca_dereg_mr(&e_maxmr->ib.ib_mr); + if (retcode != 0) { + EDEB_ERR(3, "dereg internal max-MR failed, " + "retcode=%x e_maxmr=%p shca=%p lkey=%x", + retcode, e_maxmr, shca, e_maxmr->ib.ib_mr.lkey); + shca->maxmr = e_maxmr; + goto ehca_dereg_internal_maxmr_exit0; + } + + atomic_dec(&ib_pd->usecnt); + +ehca_dereg_internal_maxmr_exit0: + if (retcode) + EDEB_EX(4, "retcode=%x shca=%p shca->maxmr=%p", + retcode, shca, shca->maxmr); + else + EDEB_EX(7, ""); + return (retcode); +} /* end ehca_dereg_internal_maxmr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +/* check physical buffer array of MR verbs for validness and + * calculates MR size + */ +int ehca_mr_chk_buf_and_calc_size(struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + u64 *iova_start, + u64 *size) +{ + struct ib_phys_buf *pbuf = phys_buf_array; + u64 size_count = 0; + u32 i; + + if (num_phys_buf == 0) { + EDEB_ERR(4, "bad phys buf array len, num_phys_buf=0"); + return (-EINVAL); + } + /* check first buffer */ + if (((u64)iova_start & ~PAGE_MASK) != (pbuf->addr & ~PAGE_MASK)) { + EDEB_ERR(4, "iova_start/addr mismatch, iova_start=%p " + "pbuf->addr=%lx pbuf->size=%lx", + iova_start, pbuf->addr, pbuf->size); + return (-EINVAL); + } + if (((pbuf->addr + pbuf->size) % PAGE_SIZE) && + (num_phys_buf > 1)) { + EDEB_ERR(4, "addr/size mismatch in 1st buf, pbuf->addr=%lx " + "pbuf->size=%lx", pbuf->addr, pbuf->size); + return (-EINVAL); + } + + for (i = 0; i < num_phys_buf; i++) { + if ((i > 0) && (pbuf->addr % PAGE_SIZE)) { + EDEB_ERR(4, "bad address, i=%x pbuf->addr=%lx " + "pbuf->size=%lx", i, pbuf->addr, pbuf->size); + return (-EINVAL); + } + if (((i > 0) && /* not 1st */ + (i < (num_phys_buf - 1)) && /* not last */ + (pbuf->size % PAGE_SIZE)) || (pbuf->size == 0)) { + EDEB_ERR(4, "bad size, i=%x pbuf->size=%lx", + i, pbuf->size); + return (-EINVAL); + } + size_count += pbuf->size; + pbuf++; + } + + *size = size_count; + return (0); +} /* end ehca_mr_chk_buf_and_calc_size() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +/* check page list of map FMR verb for validness */ +int ehca_fmr_check_page_list(struct ehca_mr *e_fmr, + u64 *page_list, + int list_len) +{ + u32 i; + u64 *page = NULL; + + if (ehca_adr_bad(page_list)) { + EDEB_ERR(4, "bad page_list, page_list=%p fmr=%p", + page_list, e_fmr); + return (-EINVAL); + } + + if ((list_len == 0) || (list_len > e_fmr->fmr_max_pages)) { + EDEB_ERR(4, "bad list_len, list_len=%x e_fmr->fmr_max_pages=%x " + "fmr=%p", list_len, e_fmr->fmr_max_pages, e_fmr); + return (-EINVAL); + } + + /* each page must be aligned */ + page = page_list; + for (i = 0; i < list_len; i++) { + if (*page % e_fmr->fmr_page_size) { + EDEB_ERR(4, "bad page, i=%x *page=%lx page=%p " + "fmr=%p fmr_page_size=%x", + i, *page, page, e_fmr, e_fmr->fmr_page_size); + return (-EINVAL); + } + page++; + } + + return (0); +} /* end ehca_fmr_check_page_list() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +/* setup page buffer from page info */ +int ehca_set_pagebuf(struct ehca_mr *e_mr, + struct ehca_mr_pginfo *pginfo, + u32 number, + u64 *kpage) +{ + int retcode = 0; + struct ib_umem_chunk *prev_chunk = NULL; + struct ib_umem_chunk *chunk = NULL; + struct ib_phys_buf *pbuf = NULL; + u64 *fmrlist = NULL; + u64 num4k = 0; + u64 pgaddr = 0; + u64 offs4k = 0; + u32 i = 0; + u32 j = 0; + + EDEB_EN(7, "pginfo=%p type=%x num_pages=%lx num_4k=%lx next_buf=%lx " + "next_4k=%lx number=%x kpage=%p page_cnt=%lx page_4k_cnt=%lx " + "next_listelem=%lx region=%p next_chunk=%p next_nmap=%lx", + pginfo, pginfo->type, pginfo->num_pages, pginfo->num_4k, + pginfo->next_buf, pginfo->next_4k, number, kpage, + pginfo->page_cnt, pginfo->page_4k_cnt, pginfo->next_listelem, + pginfo->region, pginfo->next_chunk, pginfo->next_nmap); + + if (pginfo->type == EHCA_MR_PGI_PHYS) { + /* loop over desired phys_buf_array entries */ + while (i < number) { + pbuf = pginfo->phys_buf_array + pginfo->next_buf; + num4k = ((pbuf->addr % EHCA_PAGESIZE) + pbuf->size + + EHCA_PAGESIZE - 1) / EHCA_PAGESIZE; + offs4k = (pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE; + while (pginfo->next_4k < offs4k + num4k) { + /* sanity check */ + if ((pginfo->page_cnt >= pginfo->num_pages) || + (pginfo->page_4k_cnt >= pginfo->num_4k)) { + EDEB_ERR(4, "page_cnt >= num_pages, " + "page_cnt=%lx num_pages=%lx " + "page_4k_cnt=%lx num_4k=%lx " + "i=%x", pginfo->page_cnt, + pginfo->num_pages, + pginfo->page_4k_cnt, + pginfo->num_4k, i); + retcode = -EFAULT; + } + *kpage = phys_to_abs( + (pbuf->addr & (EHCA_PAGESIZE-1)) + + (pginfo->next_4k * EHCA_PAGESIZE)); + if ((*kpage == 0) && (pbuf->addr != 0)) { + EDEB_ERR(4, "pbuf->addr=%lx " + "pbuf->size=%lx next_4k=%lx", + pbuf->addr, pbuf->size, + pginfo->next_4k); + retcode = -EFAULT; + goto ehca_set_pagebuf_exit0; + } + (pginfo->page_4k_cnt)++; + (pginfo->next_4k)++; + if(pginfo->next_4k >= PAGE_SIZE/EHCA_PAGESIZE) + (pginfo->page_cnt)++; + kpage++; + i++; + if (i >= number) break; + } + if (pginfo->next_4k >= offs4k + num4k) { + (pginfo->next_buf)++; + pginfo->next_4k = 0; + } + } + } else if (pginfo->type == EHCA_MR_PGI_USER) { + /* loop over desired chunk entries */ + chunk = pginfo->next_chunk; + prev_chunk = pginfo->next_chunk; + list_for_each_entry_continue(chunk, + (&(pginfo->region->chunk_list)), + list) { + EDEB(9, "chunk->page_list[0]=%lx", + (u64)sg_dma_address(&chunk->page_list[0])); + for (i = pginfo->next_nmap; i < chunk->nmap; ) { + pgaddr = ( page_to_pfn(chunk->page_list[i].page) + << PAGE_SHIFT ); + *kpage = phys_to_abs(pgaddr + + (pginfo->next_4k * + EHCA_PAGESIZE)); + EDEB(9,"pgaddr=%lx *kpage=%lx next_4k=%lx", + pgaddr, *kpage, pginfo->next_4k); + if (*kpage == 0) { + EDEB_ERR(4, "pgaddr=%lx " + "chunk->page_list[i]=%lx i=%x " + "next_4k=%lx mr=%p", pgaddr, + (u64)sg_dma_address( + &chunk->page_list[i]), + i, pginfo->next_4k, e_mr); + retcode = -EFAULT; + goto ehca_set_pagebuf_exit0; + } + (pginfo->page_4k_cnt)++; + (pginfo->next_4k)++; + kpage++; + if (pginfo->next_4k >= PAGE_SIZE/EHCA_PAGESIZE) { + (pginfo->page_cnt)++; + (pginfo->next_nmap)++; + pginfo->next_4k = 0; + i++; + } + j++; + if (j >= number) break; + } + if ( (pginfo->next_nmap >= chunk->nmap) && + (j >= number) ) { + pginfo->next_nmap = 0; + prev_chunk = chunk; + break; + } else if (pginfo->next_nmap >= chunk->nmap) { + pginfo->next_nmap = 0; + prev_chunk = chunk; + } else if (j >= number) + break; + else + prev_chunk = chunk; + } + pginfo->next_chunk = + list_prepare_entry(prev_chunk, + (&(pginfo->region->chunk_list)), + list); + } else if (pginfo->type == EHCA_MR_PGI_FMR) { + /* loop over desired page_list entries */ + fmrlist = pginfo->page_list + pginfo->next_listelem; + for (i = 0; i < number; i++) { + *kpage = phys_to_abs((*fmrlist & (EHCA_PAGESIZE-1)) + + pginfo->next_4k * EHCA_PAGESIZE); + if (*kpage == 0) { + EDEB_ERR(4, "*fmrlist=%lx fmrlist=%p " + "next_listelem=%lx next_4k=%lx", + *fmrlist, fmrlist, + pginfo->next_listelem,pginfo->next_4k); + retcode = -EFAULT; + goto ehca_set_pagebuf_exit0; + } + (pginfo->page_4k_cnt)++; + (pginfo->next_4k)++; + kpage++; + if ( pginfo->next_4k >= + ((e_mr->fmr_page_size) / EHCA_PAGESIZE) ) { + (pginfo->page_cnt)++; + (pginfo->next_listelem)++; + fmrlist++; + pginfo->next_4k = 0; + } + } + } else { + EDEB_ERR(4, "bad pginfo->type=%x", pginfo->type); + retcode = -EFAULT; + goto ehca_set_pagebuf_exit0; + } + +ehca_set_pagebuf_exit0: + if (retcode) + EDEB_EX(4, "retcode=%x e_mr=%p pginfo=%p type=%x num_pages=%lx " + "num_4k=%lx next_buf=%lx next_4k=%lx number=%x " + "kpage=%p page_cnt=%lx page_4k_cnt=%lx i=%x " + "next_listelem=%lx region=%p next_chunk=%p " + "next_nmap=%lx", retcode, e_mr, pginfo, pginfo->type, + pginfo->num_pages, pginfo->num_4k, pginfo->next_buf, + pginfo->next_4k, number, kpage, pginfo->page_cnt, + pginfo->page_4k_cnt, i, pginfo->next_listelem, + pginfo->region, pginfo->next_chunk, pginfo->next_nmap); + else + EDEB_EX(7, "retcode=%x e_mr=%p pginfo=%p type=%x num_pages=%lx " + "num_4k=%lx next_buf=%lx next_4k=%lx number=%x " + "kpage=%p page_cnt=%lx page_4k_cnt=%lx i=%x " + "next_listelem=%lx region=%p next_chunk=%p " + "next_nmap=%lx", retcode, e_mr, pginfo, pginfo->type, + pginfo->num_pages, pginfo->num_4k, pginfo->next_buf, + pginfo->next_4k, number, kpage, pginfo->page_cnt, + pginfo->page_4k_cnt, i, pginfo->next_listelem, + pginfo->region, pginfo->next_chunk, pginfo->next_nmap); + return (retcode); +} /* end ehca_set_pagebuf() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +/* setup 1 page from page info page buffer */ +int ehca_set_pagebuf_1(struct ehca_mr *e_mr, + struct ehca_mr_pginfo *pginfo, + u64 *rpage) +{ + int retcode = 0; + struct ib_phys_buf *tmp_pbuf = NULL; + u64 *fmrlist = NULL; + struct ib_umem_chunk *chunk = NULL; + struct ib_umem_chunk *prev_chunk = NULL; + u64 pgaddr = 0; + u64 num4k = 0; + u64 offs4k = 0; + + EDEB_EN(7, "pginfo=%p type=%x num_pages=%lx num_4k=%lx next_buf=%lx " + "next_4k=%lx rpage=%p page_cnt=%lx page_4k_cnt=%lx " + "next_listelem=%lx region=%p next_chunk=%p next_nmap=%lx", + pginfo, pginfo->type, pginfo->num_pages, pginfo->num_4k, + pginfo->next_buf, pginfo->next_4k, rpage, pginfo->page_cnt, + pginfo->page_4k_cnt, pginfo->next_listelem, pginfo->region, + pginfo->next_chunk, pginfo->next_nmap); + + if (pginfo->type == EHCA_MR_PGI_PHYS) { + /* sanity check */ + if ((pginfo->page_cnt >= pginfo->num_pages) || + (pginfo->page_4k_cnt >= pginfo->num_4k)) { + EDEB_ERR(4, "page_cnt >= num_pages, page_cnt=%lx " + "num_pages=%lx page_4k_cnt=%lx num_4k=%lx", + pginfo->page_cnt, pginfo->num_pages, + pginfo->page_4k_cnt, pginfo->num_4k); + retcode = -EFAULT; + goto ehca_set_pagebuf_1_exit0; + } + tmp_pbuf = pginfo->phys_buf_array + pginfo->next_buf; + num4k = ((tmp_pbuf->addr % EHCA_PAGESIZE) + tmp_pbuf->size + + EHCA_PAGESIZE - 1) / EHCA_PAGESIZE; + offs4k = (tmp_pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE; + *rpage = phys_to_abs((tmp_pbuf->addr & (EHCA_PAGESIZE-1)) + + (pginfo->next_4k * EHCA_PAGESIZE)); + if ((*rpage == 0) && (tmp_pbuf->addr != 0)) { + EDEB_ERR(4, "tmp_pbuf->addr=%lx" + " tmp_pbuf->size=%lx next_4k=%lx", + tmp_pbuf->addr, tmp_pbuf->size, + pginfo->next_4k); + retcode = -EFAULT; + goto ehca_set_pagebuf_1_exit0; + } + (pginfo->page_4k_cnt)++; + (pginfo->next_4k)++; + if(pginfo->next_4k >= PAGE_SIZE/EHCA_PAGESIZE) + (pginfo->page_cnt)++; + if (pginfo->next_4k >= offs4k + num4k) { + (pginfo->next_buf)++; + pginfo->next_4k = 0; + } + } else if (pginfo->type == EHCA_MR_PGI_USER) { + chunk = pginfo->next_chunk; + prev_chunk = pginfo->next_chunk; + list_for_each_entry_continue(chunk, + (&(pginfo->region->chunk_list)), + list) { + pgaddr = ( page_to_pfn(chunk->page_list[ + pginfo->next_nmap].page) + << PAGE_SHIFT); + *rpage = phys_to_abs(pgaddr + + (pginfo->next_4k * EHCA_PAGESIZE)); + EDEB(9,"pgaddr=%lx *rpage=%lx next_4k=%lx", pgaddr, + *rpage, pginfo->next_4k); + if (*rpage == 0) { + EDEB_ERR(4, "pgaddr=%lx chunk->page_list[]=%lx " + "next_nmap=%lx next_4k=%lx mr=%p", + pgaddr, (u64)sg_dma_address( + &chunk->page_list[ + pginfo->next_nmap]), + pginfo->next_nmap, pginfo->next_4k, + e_mr); + retcode = -EFAULT; + goto ehca_set_pagebuf_1_exit0; + } + (pginfo->page_4k_cnt)++; + (pginfo->next_4k)++; + if (pginfo->next_4k >= PAGE_SIZE/EHCA_PAGESIZE) { + (pginfo->page_cnt)++; + (pginfo->next_nmap)++; + pginfo->next_4k = 0; + } + if (pginfo->next_nmap >= chunk->nmap) { + pginfo->next_nmap = 0; + prev_chunk = chunk; + } + break; + } + pginfo->next_chunk = + list_prepare_entry(prev_chunk, + (&(pginfo->region->chunk_list)), + list); + } else if (pginfo->type == EHCA_MR_PGI_FMR) { + fmrlist = pginfo->page_list + pginfo->next_listelem; + *rpage = phys_to_abs((*fmrlist & (EHCA_PAGESIZE-1)) + + pginfo->next_4k * EHCA_PAGESIZE); + if (*rpage == 0) { + EDEB_ERR(4, "*fmrlist=%lx fmrlist=%p next_listelem=%lx " + "next_4k=%lx", *fmrlist, fmrlist, + pginfo->next_listelem, pginfo->next_4k); + retcode = -EFAULT; + goto ehca_set_pagebuf_1_exit0; + } + (pginfo->page_4k_cnt)++; + (pginfo->next_4k)++; + if (pginfo->next_4k >= (e_mr->fmr_page_size)/EHCA_PAGESIZE) { + (pginfo->page_cnt)++; + (pginfo->next_listelem)++; + pginfo->next_4k = 0; + } + } else { + EDEB_ERR(4, "bad pginfo->type=%x", pginfo->type); + retcode = -EFAULT; + goto ehca_set_pagebuf_1_exit0; + } + +ehca_set_pagebuf_1_exit0: + if (retcode) + EDEB_EX(4, "retcode=%x e_mr=%p pginfo=%p type=%x num_pages=%lx " + "num_4k=%lx next_buf=%lx next_4k=%lx rpage=%p " + "page_cnt=%lx page_4k_cnt=%lx next_listelem=%lx " + "region=%p next_chunk=%p next_nmap=%lx", retcode, e_mr, + pginfo, pginfo->type, pginfo->num_pages, pginfo->num_4k, + pginfo->next_buf, pginfo->next_4k, rpage, + pginfo->page_cnt, pginfo->page_4k_cnt, + pginfo->next_listelem, pginfo->region, + pginfo->next_chunk, pginfo->next_nmap); + else + EDEB_EX(7, "retcode=%x e_mr=%p pginfo=%p type=%x num_pages=%lx " + "num_4k=%lx next_buf=%lx next_4k=%lx rpage=%p " + "page_cnt=%lx page_4k_cnt=%lx next_listelem=%lx " + "region=%p next_chunk=%p next_nmap=%lx", retcode, e_mr, + pginfo, pginfo->type, pginfo->num_pages, pginfo->num_4k, + pginfo->next_buf, pginfo->next_4k, rpage, + pginfo->page_cnt, pginfo->page_4k_cnt, + pginfo->next_listelem, pginfo->region, + pginfo->next_chunk, pginfo->next_nmap); + return (retcode); +} /* end ehca_set_pagebuf_1() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +/* check MR if it is a max-MR, i.e. uses whole memory + * in case it's a max-MR 1 is returned, else 0 + */ +int ehca_mr_is_maxmr(u64 size, + u64 *iova_start) +{ + /* a MR is treated as max-MR only if it fits following: */ + if ((size == ((u64)high_memory - PAGE_OFFSET)) && + (iova_start == (void*)KERNELBASE)) { + EDEB(6, "this is a max-MR"); + return (1); + } else + return (0); +} /* end ehca_mr_is_maxmr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ +/* map access control for MR/MW. This routine is used for MR and MW. */ +void ehca_mrmw_map_acl(int ib_acl, + u32 *hipz_acl) +{ + *hipz_acl = 0; + if (ib_acl & IB_ACCESS_REMOTE_READ) + *hipz_acl |= HIPZ_ACCESSCTRL_R_READ; + if (ib_acl & IB_ACCESS_REMOTE_WRITE) + *hipz_acl |= HIPZ_ACCESSCTRL_R_WRITE; + if (ib_acl & IB_ACCESS_REMOTE_ATOMIC) + *hipz_acl |= HIPZ_ACCESSCTRL_R_ATOMIC; + if (ib_acl & IB_ACCESS_LOCAL_WRITE) + *hipz_acl |= HIPZ_ACCESSCTRL_L_WRITE; + if (ib_acl & IB_ACCESS_MW_BIND) + *hipz_acl |= HIPZ_ACCESSCTRL_MW_BIND; +} /* end ehca_mrmw_map_acl() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +/* sets page size in hipz access control for MR/MW. */ +void ehca_mrmw_set_pgsize_hipz_acl(u32 *hipz_acl) /*INOUT*/ +{ + return; /* HCA supports only 4k */ +} /* end ehca_mrmw_set_pgsize_hipz_acl() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +/* reverse map access control for MR/MW. + * This routine is used for MR and MW. + */ +void ehca_mrmw_reverse_map_acl(const u32 *hipz_acl, + int *ib_acl) /*OUT*/ +{ + *ib_acl = 0; + if (*hipz_acl & HIPZ_ACCESSCTRL_R_READ) + *ib_acl |= IB_ACCESS_REMOTE_READ; + if (*hipz_acl & HIPZ_ACCESSCTRL_R_WRITE) + *ib_acl |= IB_ACCESS_REMOTE_WRITE; + if (*hipz_acl & HIPZ_ACCESSCTRL_R_ATOMIC) + *ib_acl |= IB_ACCESS_REMOTE_ATOMIC; + if (*hipz_acl & HIPZ_ACCESSCTRL_L_WRITE) + *ib_acl |= IB_ACCESS_LOCAL_WRITE; + if (*hipz_acl & HIPZ_ACCESSCTRL_MW_BIND) + *ib_acl |= IB_ACCESS_MW_BIND; +} /* end ehca_mrmw_reverse_map_acl() */ + + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +/* map HIPZ rc to IB retcodes for MR/MW allocations + * Used for hipz_mr_reg_alloc and hipz_mw_alloc. + */ +int ehca_mrmw_map_rc_alloc(const u64 rc) +{ + switch (rc) { + case H_SUCCESS: /* successful completion */ + return (0); + case H_ADAPTER_PARM: /* invalid adapter handle */ + case H_RT_PARM: /* invalid resource type */ + case H_NOT_ENOUGH_RESOURCES: /* insufficient resources */ + case H_MLENGTH_PARM: /* invalid memory length */ + case H_MEM_ACCESS_PARM: /* invalid access controls */ + case H_CONSTRAINED: /* resource constraint */ + return (-EINVAL); + case H_BUSY: /* long busy */ + return (-EBUSY); + default: + return (-EINVAL); + } +} /* end ehca_mrmw_map_rc_alloc() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +/* map HIPZ rc to IB retcodes for MR register rpage + * Used for hipz_h_register_rpage_mr at registering last page + */ +int ehca_mrmw_map_rc_rrpg_last(const u64 rc) +{ + switch (rc) { + case H_SUCCESS: /* registration complete */ + return (0); + case H_PAGE_REGISTERED: /* page registered */ + case H_ADAPTER_PARM: /* invalid adapter handle */ + case H_RH_PARM: /* invalid resource handle */ +/* case H_QT_PARM: invalid queue type */ + case H_PARAMETER: /* invalid logical address, */ + /* or count zero or greater 512 */ + case H_TABLE_FULL: /* page table full */ + case H_HARDWARE: /* HCA not operational */ + return (-EINVAL); + case H_BUSY: /* long busy */ + return (-EBUSY); + default: + return (-EINVAL); + } +} /* end ehca_mrmw_map_rc_rrpg_last() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +/* map HIPZ rc to IB retcodes for MR register rpage + * Used for hipz_h_register_rpage_mr at registering one page, but not last page + */ +int ehca_mrmw_map_rc_rrpg_notlast(const u64 rc) +{ + switch (rc) { + case H_PAGE_REGISTERED: /* page registered */ + return (0); + case H_SUCCESS: /* registration complete */ + case H_ADAPTER_PARM: /* invalid adapter handle */ + case H_RH_PARM: /* invalid resource handle */ +/* case H_QT_PARM: invalid queue type */ + case H_PARAMETER: /* invalid logical address, */ + /* or count zero or greater 512 */ + case H_TABLE_FULL: /* page table full */ + case H_HARDWARE: /* HCA not operational */ + return (-EINVAL); + case H_BUSY: /* long busy */ + return (-EBUSY); + default: + return (-EINVAL); + } +} /* end ehca_mrmw_map_rc_rrpg_notlast() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +/* map HIPZ rc to IB retcodes for MR query. Used for hipz_mr_query. */ +int ehca_mrmw_map_rc_query_mr(const u64 rc) +{ + switch (rc) { + case H_SUCCESS: /* successful completion */ + return (0); + case H_ADAPTER_PARM: /* invalid adapter handle */ + case H_RH_PARM: /* invalid resource handle */ + return (-EINVAL); + case H_BUSY: /* long busy */ + return (-EBUSY); + default: + return (-EINVAL); + } +} /* end ehca_mrmw_map_rc_query_mr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +/* map HIPZ rc to IB retcodes for freeing MR resource + * Used for hipz_h_free_resource_mr + */ +int ehca_mrmw_map_rc_free_mr(const u64 rc) +{ + switch (rc) { + case H_SUCCESS: /* resource freed */ + return (0); + case H_ADAPTER_PARM: /* invalid adapter handle */ + case H_RH_PARM: /* invalid resource handle */ + case H_R_STATE: /* invalid resource state */ + case H_HARDWARE: /* HCA not operational */ + return (-EINVAL); + case H_RESOURCE: /* Resource in use */ + case H_BUSY: /* long busy */ + return (-EBUSY); + default: + return (-EINVAL); + } +} /* end ehca_mrmw_map_rc_free_mr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +/* map HIPZ rc to IB retcodes for freeing MW resource + * Used for hipz_h_free_resource_mw + */ +int ehca_mrmw_map_rc_free_mw(const u64 rc) +{ + switch (rc) { + case H_SUCCESS: /* resource freed */ + return (0); + case H_ADAPTER_PARM: /* invalid adapter handle */ + case H_RH_PARM: /* invalid resource handle */ + case H_R_STATE: /* invalid resource state */ + case H_HARDWARE: /* HCA not operational */ + return (-EINVAL); + case H_RESOURCE: /* Resource in use */ + case H_BUSY: /* long busy */ + return (-EBUSY); + default: + return (-EINVAL); + } +} /* end ehca_mrmw_map_rc_free_mw() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +/* map HIPZ rc to IB retcodes for SMR registrations + * Used for hipz_h_register_smr. + */ +int ehca_mrmw_map_rc_reg_smr(const u64 rc) +{ + switch (rc) { + case H_SUCCESS: /* successful completion */ + return (0); + case H_ADAPTER_PARM: /* invalid adapter handle */ + case H_RH_PARM: /* invalid resource handle */ + case H_MEM_PARM: /* invalid MR virtual address */ + case H_MEM_ACCESS_PARM: /* invalid access controls */ + case H_NOT_ENOUGH_RESOURCES: /* insufficient resources */ + return (-EINVAL); + case H_BUSY: /* long busy */ + return (-EBUSY); + default: + return (-EINVAL); + } +} /* end ehca_mrmw_map_rc_reg_smr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +/* MR destructor and constructor + * used in Reregister MR verb, memsets ehca_mr_t to 0, + * except struct ib_mr and spinlock + */ +void ehca_mr_deletenew(struct ehca_mr *mr) +{ + u32 offset = (u64)(&mr->flags) - (u64)mr; + memset(&mr->flags, 0, sizeof(*mr) - offset); +} /* end ehca_mr_deletenew() */ From schihei at de.ibm.com Thu Apr 27 03:49:17 2006 From: schihei at de.ibm.com (Heiko J Schick) Date: Thu, 27 Apr 2006 12:49:17 +0200 Subject: [openib-general] [PATCH 10/16] ehca: event queue Message-ID: <4450A1AD.7040506@de.ibm.com> Signed-off-by: Heiko J Schick ehca_eq.c | 225 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 225 insertions(+) --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/ehca_eq.c 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/ehca_eq.c 2006-04-27 08:34:02.000000000 +0200 @@ -0,0 +1,225 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Event queue handling + * + * Authors: Waleri Fomin + * Khadija Souissi + * Reinhard Ernst + * Heiko J Schick + * Hoang-Nam Nguyen + * + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_eq.c,v 1.17 2006/04/27 06:34:02 schickhj Exp $ + */ + +#define DEB_PREFIX "e_eq" + +#include "ehca_classes.h" +#include "ehca_irq.h" +#include "ehca_iverbs.h" +#include "ehca_kernel.h" +#include "ehca_qes.h" +#include "hcp_if.h" +#include "ipz_pt_fn.h" + +int ehca_create_eq(struct ehca_shca *shca, + struct ehca_eq *eq, + const enum ehca_eq_type type, const u32 length) +{ + u64 ret = H_SUCCESS; + u32 nr_pages = 0; + u32 i; + void *vpage = NULL; + + EDEB_EN(7, "shca=%p eq=%p length=%x", shca, eq, length); + EHCA_CHECK_ADR(shca); + EHCA_CHECK_ADR(eq); + + spin_lock_init(&eq->spinlock); + eq->is_initialized = 0; + + if (type!=EHCA_EQ && type!=EHCA_NEQ) { + EDEB_ERR(4, "Invalid EQ type %x. eq=%p", type, eq); + return -EINVAL; + } + if (length==0) { + EDEB_ERR(4, "EQ length must not be zero. eq=%p", eq); + return -EINVAL; + } + + ret = hipz_h_alloc_resource_eq(shca->ipz_hca_handle, + &eq->pf, + type, + length, + &eq->ipz_eq_handle, + &eq->length, + &nr_pages, &eq->ist); + + if (ret != H_SUCCESS) { + EDEB_ERR(4, "Can't allocate EQ / NEQ. eq=%p", eq); + return -EINVAL; + } + + ret = ipz_queue_ctor(&eq->ipz_queue, nr_pages, + EHCA_PAGESIZE, sizeof(struct ehca_eqe), 0); + if (!ret) { + EDEB_ERR(4, "Can't allocate EQ pages. eq=%p", eq); + goto create_eq_exit1; + } + + for (i = 0; i < nr_pages; i++) { + u64 rpage; + + if (!(vpage = ipz_qpageit_get_inc(&eq->ipz_queue))) { + ret = H_RESOURCE; + goto create_eq_exit2; + } + + rpage = virt_to_abs(vpage); + ret = hipz_h_register_rpage_eq(shca->ipz_hca_handle, + eq->ipz_eq_handle, + &eq->pf, + 0, 0, rpage, 1); + + if (i == (nr_pages - 1)) { + /* last page */ + vpage = ipz_qpageit_get_inc(&eq->ipz_queue); + if ((ret != H_SUCCESS) || (vpage != 0)) + goto create_eq_exit2; + } else { + if ((ret != H_PAGE_REGISTERED) || (vpage == 0)) + goto create_eq_exit2; + } + } + + ipz_qeit_reset(&eq->ipz_queue); + + /* register interrupt handlers and initialize work queues */ + if (type == EHCA_EQ) { + ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_eq, + SA_INTERRUPT, "ehca_eq", + (void *)shca); + if (ret < 0) + EDEB_ERR(4, "Can't map interrupt handler."); + + tasklet_init(&eq->interrupt_task, ehca_tasklet_eq, (long)shca); + } else if (type == EHCA_NEQ) { + ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_neq, + SA_INTERRUPT, "ehca_neq", + (void *)shca); + if (ret < 0) + EDEB_ERR(4, "Can't map interrupt handler."); + + tasklet_init(&eq->interrupt_task, ehca_tasklet_neq, (long)shca); + } + + eq->is_initialized = 1; + + EDEB_EX(7, "ret=%lx", ret); + + return 0; + +create_eq_exit2: + ipz_queue_dtor(&eq->ipz_queue); + +create_eq_exit1: + hipz_h_destroy_eq(shca->ipz_hca_handle, eq); + + EDEB_EX(7, "ret=%lx", ret); + + return -EINVAL; +} + +void *ehca_poll_eq(struct ehca_shca *shca, struct ehca_eq *eq) +{ + unsigned long flags = 0; + void *eqe = NULL; + + EDEB_EN(7, "shca=%p eq=%p", shca, eq); + EHCA_CHECK_ADR_P(shca); + EHCA_CHECK_EQ_P(eq); + + spin_lock_irqsave(&eq->spinlock, flags); + eqe = ipz_eqit_eq_get_inc_valid(&eq->ipz_queue); + spin_unlock_irqrestore(&eq->spinlock, flags); + + EDEB_EX(7, "eq=%p eqe=%p", eq, eqe); + + return eqe; +} + +void ehca_poll_eqs(unsigned long data) +{ + struct ehca_shca *shca; + struct ehca_module *module = (struct ehca_module*)data; + + spin_lock(&module->shca_lock); + list_for_each_entry(shca, &module->shca_list, shca_list) { + if (shca->eq.is_initialized) + ehca_tasklet_eq((unsigned long)(void*)shca); + } + mod_timer(&module->timer, jiffies + HZ); + spin_unlock(&module->shca_lock); + + return; +} + +int ehca_destroy_eq(struct ehca_shca *shca, struct ehca_eq *eq) +{ + unsigned long flags = 0; + u64 retcode = H_SUCCESS; + + EDEB_EN(7, "shca=%p eq=%p", shca, eq); + EHCA_CHECK_ADR(shca); + EHCA_CHECK_EQ(eq); + + spin_lock_irqsave(&eq->spinlock, flags); + ibmebus_free_irq(NULL, eq->ist, (void *)shca); + + retcode = hipz_h_destroy_eq(shca->ipz_hca_handle, eq); + + spin_unlock_irqrestore(&eq->spinlock, flags); + + if (retcode != H_SUCCESS) { + EDEB_ERR(4, "Can't free EQ resources."); + return -EINVAL; + } + ipz_queue_dtor(&eq->ipz_queue); + + EDEB_EX(7, "retcode=%lx", retcode); + + return 0; +} From schihei at de.ibm.com Thu Apr 27 03:49:23 2006 From: schihei at de.ibm.com (Heiko J Schick) Date: Thu, 27 Apr 2006 12:49:23 +0200 Subject: [openib-general] [PATCH 11/16] ehca: completion queue Message-ID: <4450A1B3.5000100@de.ibm.com> Signed-off-by: Heiko J Schick ehca_cq.c | 445 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 445 insertions(+) --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/ehca_cq.c 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/ehca_cq.c 2006-04-24 15:12:03.000000000 +0200 @@ -0,0 +1,445 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Completion queue handling + * + * Authors: Waleri Fomin + * Khadija Souissi + * Reinhard Ernst + * Heiko J Schick + * Hoang-Nam Nguyen + * + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_cq.c,v 1.24 2006/04/24 13:12:03 schickhj Exp $ + */ + +#define DEB_PREFIX "e_cq" + +#include + +#include "ehca_kernel.h" +#include "ehca_iverbs.h" +#include "ehca_classes.h" +#include "ehca_irq.h" +#include "hcp_if.h" + +int ehca_cq_assign_qp(struct ehca_cq *cq, struct ehca_qp *qp) +{ + unsigned int qp_num = qp->real_qp_num; + unsigned int key = qp_num & (QP_HASHTAB_LEN-1); + unsigned long spl_flags = 0; + + spin_lock_irqsave(&cq->spinlock, spl_flags); + hlist_add_head(&qp->list_entries, &cq->qp_hashtab[key]); + spin_unlock_irqrestore(&cq->spinlock, spl_flags); + + EDEB(7, "cq_num=%x real_qp_num=%x", cq->cq_number, qp_num); + + return 0; +} + +int ehca_cq_unassign_qp(struct ehca_cq *cq, unsigned int real_qp_num) +{ + int ret = -EINVAL; + unsigned int key = real_qp_num & (QP_HASHTAB_LEN-1); + struct hlist_node *iter = NULL; + struct ehca_qp *qp = NULL; + unsigned long spl_flags = 0; + + spin_lock_irqsave(&cq->spinlock, spl_flags); + hlist_for_each(iter, &cq->qp_hashtab[key]) { + qp = hlist_entry(iter, struct ehca_qp, list_entries); + if (qp->real_qp_num == real_qp_num) { + hlist_del(iter); + EDEB(7, "removed qp from cq .cq_num=%x real_qp_num=%x", + cq->cq_number, real_qp_num); + ret = 0; + break; + } + } + spin_unlock_irqrestore(&cq->spinlock, spl_flags); + if (ret!=0) { + EDEB_ERR(4, "qp not found cq_num=%x real_qp_num=%x", + cq->cq_number, real_qp_num); + } + + return ret; +} + +struct ehca_qp* ehca_cq_get_qp(struct ehca_cq *cq, int real_qp_num) +{ + struct ehca_qp *ret = NULL; + unsigned int key = real_qp_num & (QP_HASHTAB_LEN-1); + struct hlist_node *iter = NULL; + struct ehca_qp *qp = NULL; + hlist_for_each(iter, &cq->qp_hashtab[key]) { + qp = hlist_entry(iter, struct ehca_qp, list_entries); + if (qp->real_qp_num == real_qp_num) { + ret = qp; + break; + } + } + return ret; +} + +struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + extern struct ehca_module ehca_module; + struct ib_cq *cq = NULL; + struct ehca_cq *my_cq = NULL; + u32 number_of_entries = cqe; + struct ehca_shca *shca = NULL; + struct ipz_adapter_handle adapter_handle; + struct ipz_eq_handle eq_handle; + struct ipz_cq_handle *cq_handle_ref = NULL; + u32 act_nr_of_entries = 0; + u32 act_pages = 0; + u32 counter = 0; + void *vpage = NULL; + u64 rpage = 0; + struct h_galpa gal; + u64 cqx_fec = 0; + u64 hipz_rc = H_SUCCESS; + int ipz_rc = 0; + int ret = 0; + const u32 additional_cqe=20; + int i= 0; + unsigned long flags; + + EHCA_CHECK_DEVICE_P(device); + EDEB_EN(7, "device=%p cqe=%x context=%p", device, cqe, context); + + /* CQs maximum depth is 4GB-64, but we need additional 20 as buffer + * for receiving errors CQEs. + */ + if (cqe >= 0xFFFFFFFF - 64 - additional_cqe) + return ERR_PTR(-EINVAL); + number_of_entries += additional_cqe; + + my_cq = kmem_cache_alloc(ehca_module.cache_cq, SLAB_KERNEL); + if (my_cq == NULL) { + cq = ERR_PTR(-ENOMEM); + EDEB_ERR(4, "Out of memory for ehca_cq struct device=%p", + device); + goto create_cq_exit0; + } + + memset(my_cq, 0, sizeof(struct ehca_cq)); + spin_lock_init(&my_cq->spinlock); + spin_lock_init(&my_cq->cb_lock); + spin_lock_init(&my_cq->task_lock); + my_cq->ownpid = current->tgid; + + cq = &my_cq->ib_cq; + + shca = container_of(device, struct ehca_shca, ib_device); + adapter_handle = shca->ipz_hca_handle; + eq_handle = shca->eq.ipz_eq_handle; + cq_handle_ref = &my_cq->ipz_cq_handle; + + do { + if (!idr_pre_get(&ehca_cq_idr, GFP_KERNEL)) { + cq = ERR_PTR(-ENOMEM); + EDEB_ERR(4, + "Can't reserve idr resources. " + "device=%p", device); + goto create_cq_exit1; + } + + spin_lock_irqsave(&ehca_cq_idr_lock, flags); + ret = idr_get_new(&ehca_cq_idr, my_cq, &my_cq->token); + spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + + } while (ret == -EAGAIN); + + if (ret) { + cq = ERR_PTR(-ENOMEM); + EDEB_ERR(4, + "Can't allocate new idr entry. " + "device=%p", device); + goto create_cq_exit1; + } + + hipz_rc = hipz_h_alloc_resource_cq(adapter_handle, + &my_cq->pf, + eq_handle, + my_cq->token, + number_of_entries, + cq_handle_ref, + &act_nr_of_entries, + &act_pages, + &my_cq->galpas); + if (hipz_rc != H_SUCCESS) { + EDEB_ERR(4, + "hipz_h_alloc_resource_cq() failed " + "hipz_rc=%lx device=%p", hipz_rc, device); + cq = ERR_PTR(ehca2ib_return_code(hipz_rc)); + goto create_cq_exit2; + } + + ipz_rc = ipz_queue_ctor(&my_cq->ipz_queue, act_pages, + EHCA_PAGESIZE, sizeof(struct ehca_cqe), 0); + if (!ipz_rc) { + EDEB_ERR(4, + "ipz_queue_ctor() failed " + "ipz_rc=%x device=%p", ipz_rc, device); + cq = ERR_PTR(-EINVAL); + goto create_cq_exit3; + } + + for (counter = 0; counter < act_pages; counter++) { + vpage = ipz_qpageit_get_inc(&my_cq->ipz_queue); + if (!vpage) { + EDEB_ERR(4, "ipz_qpageit_get_inc() " + "returns NULL device=%p", device); + cq = ERR_PTR(-EAGAIN); + goto create_cq_exit4; + } + rpage = virt_to_abs(vpage); + + hipz_rc = hipz_h_register_rpage_cq(adapter_handle, + my_cq->ipz_cq_handle, + &my_cq->pf, + 0, + 0, + rpage, + 1, + my_cq->galpas. + kernel); + + if (hipz_rc < H_SUCCESS) { + EDEB_ERR(4, "hipz_h_register_rpage_cq() failed " + "ehca_cq=%p cq_num=%x hipz_rc=%lx " + "counter=%i act_pages=%i", + my_cq, my_cq->cq_number, + hipz_rc, counter, act_pages); + cq = ERR_PTR(-EINVAL); + goto create_cq_exit4; + } + + if (counter == (act_pages - 1)) { + vpage = ipz_qpageit_get_inc( + &my_cq->ipz_queue); + if ((hipz_rc != H_SUCCESS) || (vpage != 0)) { + EDEB_ERR(4, "Registration of pages not " + "complete ehca_cq=%p cq_num=%x " + "hipz_rc=%lx", + my_cq, my_cq->cq_number, hipz_rc); + cq = ERR_PTR(-EAGAIN); + goto create_cq_exit4; + } + } else { + if (hipz_rc != H_PAGE_REGISTERED) { + EDEB_ERR(4, "Registration of page failed " + "ehca_cq=%p cq_num=%x hipz_rc=%lx" + "counter=%i act_pages=%i", + my_cq, my_cq->cq_number, + hipz_rc, counter, act_pages); + cq = ERR_PTR(-ENOMEM); + goto create_cq_exit4; + } + } + } + + ipz_qeit_reset(&my_cq->ipz_queue); + + gal = my_cq->galpas.kernel; + cqx_fec = hipz_galpa_load(gal, CQTEMM_OFFSET(cqx_fec)); + EDEB(8, "ehca_cq=%p cq_num=%x CQX_FEC=%lx", + my_cq, my_cq->cq_number, cqx_fec); + + my_cq->ib_cq.cqe = my_cq->nr_of_entries = + act_nr_of_entries-additional_cqe; + my_cq->cq_number = (my_cq->ipz_cq_handle.handle) & 0xffff; + + for (i = 0; i < QP_HASHTAB_LEN; i++) + INIT_HLIST_HEAD(&my_cq->qp_hashtab[i]); + + if (context) { + struct ipz_queue *ipz_queue = &my_cq->ipz_queue; + struct ehca_create_cq_resp resp; + struct vm_area_struct *vma = NULL; + memset(&resp, 0, sizeof(resp)); + resp.cq_number = my_cq->cq_number; + resp.token = my_cq->token; + resp.ipz_queue.qe_size = ipz_queue->qe_size; + resp.ipz_queue.act_nr_of_sg = ipz_queue->act_nr_of_sg; + resp.ipz_queue.queue_length = ipz_queue->queue_length; + resp.ipz_queue.pagesize = ipz_queue->pagesize; + resp.ipz_queue.toggle_state = ipz_queue->toggle_state; + ehca_mmap_nopage(((u64) (my_cq->token) << 32) | 0x12000000, + ipz_queue->queue_length, + ((void**)&resp.ipz_queue.queue), + &vma); + my_cq->uspace_queue = resp.ipz_queue.queue; + resp.galpas = my_cq->galpas; + ehca_mmap_register(my_cq->galpas.user.fw_handle, + ((void**)&resp.galpas.kernel.fw_handle), + &vma); + my_cq->uspace_fwh = (u64)resp.galpas.kernel.fw_handle; + if (ib_copy_to_udata(udata, &resp, sizeof(resp))) { + EDEB_ERR(4, "Copy to udata failed."); + goto create_cq_exit4; + } + } + + EDEB_EX(7,"retcode=%p ehca_cq=%p cq_num=%x cq_size=%x", + cq, my_cq, my_cq->cq_number, act_nr_of_entries); + return cq; + +create_cq_exit4: + ipz_queue_dtor(&my_cq->ipz_queue); + +create_cq_exit3: + hipz_rc = hipz_h_destroy_cq(adapter_handle, my_cq, 1); + EDEB(3, "hipz_h_destroy_cq() failed ehca_cq=%p cq_num=%x hipz_rc=%lx", + my_cq, my_cq->cq_number, hipz_rc); + +create_cq_exit2: + spin_lock_irqsave(&ehca_cq_idr_lock, flags); + idr_remove(&ehca_cq_idr, my_cq->token); + spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + +create_cq_exit1: + kmem_cache_free(ehca_module.cache_cq, my_cq); + +create_cq_exit0: + EDEB_EX(7, "An error has occured retcode=%p ", cq); + return cq; +} + +int ehca_destroy_cq(struct ib_cq *cq) +{ + extern struct ehca_module ehca_module; + u64 hipz_rc = H_SUCCESS; + int retcode = 0; + struct ehca_cq *my_cq = NULL; + int cq_num = 0; + struct ib_device *device = NULL; + struct ehca_shca *shca = NULL; + struct ipz_adapter_handle adapter_handle; + u32 cur_pid = current->tgid; + unsigned long flags; + + EHCA_CHECK_CQ(cq); + my_cq = container_of(cq, struct ehca_cq, ib_cq); + cq_num = my_cq->cq_number; + device = cq->device; + EHCA_CHECK_DEVICE(device); + shca = container_of(device, struct ehca_shca, ib_device); + adapter_handle = shca->ipz_hca_handle; + EDEB_EN(7, "ehca_cq=%p cq_num=%x", + my_cq, my_cq->cq_number); + + spin_lock_irqsave(&ehca_cq_idr_lock, flags); + while (my_cq->nr_callbacks != 0) + yield(); + + idr_remove(&ehca_cq_idr, my_cq->token); + spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + + if (my_cq->uspace_queue!=0 && my_cq->ownpid!=cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_cq->ownpid); + return -EINVAL; + } + + /* un-mmap if vma alloc */ + if (my_cq->uspace_queue!=0) { + retcode = ehca_munmap(my_cq->uspace_queue, + my_cq->ipz_queue.queue_length); + retcode = ehca_munmap(my_cq->uspace_fwh, 4096); + } + + hipz_rc = hipz_h_destroy_cq(adapter_handle, my_cq, 0); + if (hipz_rc == H_R_STATE) { + /* cq in err: read err data and destroy it forcibly */ + EDEB(4, "ehca_cq=%p cq_num=%x ressource=%lx in err state. " + "Try to delete it forcibly.", + my_cq, my_cq->cq_number, my_cq->ipz_cq_handle.handle); + ehca_error_data(shca, my_cq, my_cq->ipz_cq_handle.handle); + hipz_rc = hipz_h_destroy_cq(adapter_handle, my_cq, 1); + if (hipz_rc == H_SUCCESS) { + EDEB(4, "ehca_cq=%p cq_num=%x deleted successfully.", + my_cq, my_cq->cq_number); + } + } + if (hipz_rc != H_SUCCESS) { + EDEB_ERR(4,"hipz_h_destroy_cq() failed " + "hipz_rc=%lx ehca_cq=%p cq_num=%x", + hipz_rc, my_cq, my_cq->cq_number); + retcode = ehca2ib_return_code(hipz_rc); + goto destroy_cq_exit0; + } + ipz_queue_dtor(&my_cq->ipz_queue); + kmem_cache_free(ehca_module.cache_cq, my_cq); + +destroy_cq_exit0: + EDEB_EX(7, "ehca_cq=%p cq_num=%x retcode=%x ", + my_cq, cq_num, retcode); + return retcode; +} + +int ehca_resize_cq(struct ib_cq *cq, int cqe, struct ib_udata *udata) +{ + int retcode = 0; + struct ehca_cq *my_cq = NULL; + u32 cur_pid = current->tgid; + + if (unlikely(NULL == cq)) { + EDEB_ERR(4, "cq is NULL"); + return -EFAULT; + } + + my_cq = container_of(cq, struct ehca_cq, ib_cq); + EDEB_EN(7, "ehca_cq=%p cq_num=%x", + my_cq, my_cq->cq_number); + + if (my_cq->uspace_queue!=0 && my_cq->ownpid!=cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_cq->ownpid); + return -EINVAL; + } + + /* TODO: proper resize needs to be done */ + retcode = -EFAULT; + EDEB_ERR(4, "not implemented yet"); + + EDEB_EX(7, "ehca_cq=%p cq_num=%x", + my_cq, my_cq->cq_number); + return retcode; +} From schihei at de.ibm.com Thu Apr 27 03:49:29 2006 From: schihei at de.ibm.com (Heiko J Schick) Date: Thu, 27 Apr 2006 12:49:29 +0200 Subject: [openib-general] [PATCH 12/16] ehca: queue pair Message-ID: <4450A1B9.9070301@de.ibm.com> Signed-off-by: Heiko J Schick ehca_qes.h | 278 ++++++++++ ehca_qp.c | 1592 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ehca_reqs.c | 685 +++++++++++++++++++++++++ ehca_sqp.c | 126 ++++ 4 files changed, 2681 insertions(+) --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/ehca_qes.h 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/ehca_qes.h 2006-02-27 19:54:41.000000000 +0100 @@ -0,0 +1,278 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Hardware request structures + * + * Authors: Waleri Fomin + * Reinhard Ernst + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_qes.h,v 1.3 2006/02/27 18:54:41 nguyen Exp $ + */ + + +#ifndef _EHCA_QES_H_ +#define _EHCA_QES_H_ + +/** DON'T include any kernel related files here!!! + * This file is used commonly in user and kernel space!!! + */ + +/** + * virtual scatter gather entry to specify remote adresses with length + */ +struct ehca_vsgentry { + u64 vaddr; + u32 lkey; + u32 length; +}; + +#define GRH_FLAG_MASK EHCA_BMASK_IBM(7,7) +#define GRH_IPVERSION_MASK EHCA_BMASK_IBM(0,3) +#define GRH_TCLASS_MASK EHCA_BMASK_IBM(4,12) +#define GRH_FLOWLABEL_MASK EHCA_BMASK_IBM(13,31) +#define GRH_PAYLEN_MASK EHCA_BMASK_IBM(32,47) +#define GRH_NEXTHEADER_MASK EHCA_BMASK_IBM(48,55) +#define GRH_HOPLIMIT_MASK EHCA_BMASK_IBM(56,63) + +/** + * Unreliable Datagram Address Vector Format + * see IBTA Vol1 chapter 8.3 Global Routing Header + */ +struct ehca_ud_av { + u8 sl; + u8 lnh; + u16 dlid; + u8 reserved1; + u8 reserved2; + u8 reserved3; + u8 slid_path_bits; + u8 reserved4; + u8 ipd; + u8 reserved5; + u8 pmtu; + u32 reserved6; + u64 reserved7; + union { + struct { + u64 word_0; /* always set to 6 */ + /*should be 0x1B for IB transport */ + u64 word_1; + u64 word_2; + u64 word_3; + u64 word_4; + } grh; + struct { + u32 wd_0; + u32 wd_1; + /* DWord_1 --> SGID */ + + u32 sgid_wd3; + /* bits 127 - 96 */ + + u32 sgid_wd2; + /* bits 95 - 64 */ + /* DWord_2 */ + + u32 sgid_wd1; + /* bits 63 - 32 */ + + u32 sgid_wd0; + /* bits 31 - 0 */ + /* DWord_3 --> DGID */ + + u32 dgid_wd3; + /* bits 127 - 96 + **/ + u32 dgid_wd2; + /* bits 95 - 64 + DWord_4 */ + u32 dgid_wd1; + /* bits 63 - 32 */ + + u32 dgid_wd0; + /* bits 31 - 0 */ + } grh_l; + }; +}; + +/* maximum number of sg entries allowed in a WQE */ +#define MAX_WQE_SG_ENTRIES 252 + +#define WQE_OPTYPE_SEND 0x80 +#define WQE_OPTYPE_RDMAREAD 0x40 +#define WQE_OPTYPE_RDMAWRITE 0x20 +#define WQE_OPTYPE_CMPSWAP 0x10 +#define WQE_OPTYPE_FETCHADD 0x08 +#define WQE_OPTYPE_BIND 0x04 + +#define WQE_WRFLAG_REQ_SIGNAL_COM 0x80 +#define WQE_WRFLAG_FENCE 0x40 +#define WQE_WRFLAG_IMM_DATA_PRESENT 0x20 +#define WQE_WRFLAG_SOLIC_EVENT 0x10 + +#define WQEF_CACHE_HINT 0x80 +#define WQEF_CACHE_HINT_RD_WR 0x40 +#define WQEF_TIMED_WQE 0x20 +#define WQEF_PURGE 0x08 +#define WQEF_HIGH_NIBBLE 0xF0 + +#define MW_BIND_ACCESSCTRL_R_WRITE 0x40 +#define MW_BIND_ACCESSCTRL_R_READ 0x20 +#define MW_BIND_ACCESSCTRL_R_ATOMIC 0x10 + +struct ehca_wqe { + u64 work_request_id; + u8 optype; + u8 wr_flag; + u16 pkeyi; + u8 wqef; + u8 nr_of_data_seg; + u16 wqe_provided_slid; + u32 destination_qp_number; + u32 resync_psn_sqp; + u32 local_ee_context_qkey; + u32 immediate_data; + union { + struct { + u64 remote_virtual_adress; + u32 rkey; + u32 reserved; + u64 atomic_1st_op_dma_len; + u64 atomic_2nd_op; + struct ehca_vsgentry sg_list[MAX_WQE_SG_ENTRIES]; + + } nud; + struct { + u64 ehca_ud_av_ptr; + u64 reserved1; + u64 reserved2; + u64 reserved3; + struct ehca_vsgentry sg_list[MAX_WQE_SG_ENTRIES]; + } ud_avp; + struct { + struct ehca_ud_av ud_av; + struct ehca_vsgentry sg_list[MAX_WQE_SG_ENTRIES - + 2]; + } ud_av; + struct { + u64 reserved0; + u64 reserved1; + u64 reserved2; + u64 reserved3; + struct ehca_vsgentry sg_list[MAX_WQE_SG_ENTRIES]; + } all_rcv; + + struct { + u64 reserved; + u32 rkey; + u32 old_rkey; + u64 reserved1; + u64 reserved2; + u64 virtual_address; + u32 reserved3; + u32 length; + u32 reserved4; + u16 reserved5; + u8 reserved6; + u8 lr_ctl; + u32 lkey; + u32 reserved7; + u64 reserved8; + u64 reserved9; + u64 reserved10; + u64 reserved11; + } bind; + struct { + u64 reserved12; + u64 reserved13; + u32 size; + u32 start; + } inline_data; + } u; + +}; + +#define WC_SEND_RECEIVE EHCA_BMASK_IBM(0,0) +#define WC_IMM_DATA EHCA_BMASK_IBM(1,1) +#define WC_GRH_PRESENT EHCA_BMASK_IBM(2,2) +#define WC_SE_BIT EHCA_BMASK_IBM(3,3) +#define WC_STATUS_ERROR_BIT 0x80000000 +#define WC_STATUS_REMOTE_ERROR_FLAGS 0x0000F800 +#define WC_STATUS_PURGE_BIT 0x10 + +struct ehca_cqe { + u64 work_request_id; + u8 optype; + u8 w_completion_flags; + u16 reserved1; + u32 nr_bytes_transferred; + u32 immediate_data; + u32 local_qp_number; + u8 freed_resource_count; + u8 service_level; + u16 wqe_count; + u32 qp_token; + u32 qkey_ee_token; + u32 remote_qp_number; + u16 dlid; + u16 rlid; + u16 reserved2; + u16 pkey_index; + u32 cqe_timestamp; + u32 wqe_timestamp; + u8 wqe_timestamp_valid; + u8 reserved3; + u8 reserved4; + u8 cqe_flags; + u32 status; +}; + +struct ehca_eqe { + u64 entry; +}; + +struct ehca_mrte { + u64 starting_va; + u64 length; /* length of memory region in bytes*/ + u32 pd; + u8 key_instance; + u8 pagesize; + u8 mr_control; + u8 local_remote_access_ctrl; + u8 reserved[0x20 - 0x18]; + u64 at_pointer[4]; +}; +#endif /*_EHCA_QES_H_*/ --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/ehca_qp.c 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/ehca_qp.c 2006-04-25 10:30:36.000000000 +0200 @@ -0,0 +1,1592 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * QP functions + * + * Authors: Waleri Fomin + * Hoang-Nam Nguyen + * Reinhard Ernst + * Heiko J Schick + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_qp.c,v 1.30 2006/04/25 08:30:36 schickhj Exp $ + */ + + +#define DEB_PREFIX "e_qp" + +#include + +#include "ehca_kernel.h" +#include "ehca_classes.h" +#include "ehca_tools.h" +#include "ehca_qes.h" +#include "ehca_iverbs.h" +#include "hcp_if.h" +#include "hipz_fns.h" + +/** + * attributes not supported by query qp + */ +#define QP_ATTR_QUERY_NOT_SUPPORTED (IB_QP_MAX_DEST_RD_ATOMIC | \ + IB_QP_MAX_QP_RD_ATOMIC | \ + IB_QP_ACCESS_FLAGS | \ + IB_QP_EN_SQD_ASYNC_NOTIFY) + +/** + * ehca (internal) qp state values + */ +enum ehca_qp_state { + EHCA_QPS_RESET = 1, + EHCA_QPS_INIT = 2, + EHCA_QPS_RTR = 3, + EHCA_QPS_RTS = 5, + EHCA_QPS_SQD = 6, + EHCA_QPS_SQE = 8, + EHCA_QPS_ERR = 128 +}; + +/** + * qp state transitions as defined by IB Arch Rel 1.1 page 431 + */ +enum ib_qp_statetrans { + IB_QPST_ANY2RESET, + IB_QPST_ANY2ERR, + IB_QPST_RESET2INIT, + IB_QPST_INIT2RTR, + IB_QPST_INIT2INIT, + IB_QPST_RTR2RTS, + IB_QPST_RTS2SQD, + IB_QPST_RTS2RTS, + IB_QPST_SQD2RTS, + IB_QPST_SQE2RTS, + IB_QPST_SQD2SQD, + IB_QPST_MAX /* nr of transitions, this must be last!!! */ +}; + +/** + * ib2ehca_qp_state - maps IB to ehca qp_state + * returns ehca qp state corresponding to given ib qp state + */ +static inline enum ehca_qp_state ib2ehca_qp_state(enum ib_qp_state ib_qp_state) +{ + switch (ib_qp_state) { + case IB_QPS_RESET: + return EHCA_QPS_RESET; + case IB_QPS_INIT: + return EHCA_QPS_INIT; + case IB_QPS_RTR: + return EHCA_QPS_RTR; + case IB_QPS_RTS: + return EHCA_QPS_RTS; + case IB_QPS_SQD: + return EHCA_QPS_SQD; + case IB_QPS_SQE: + return EHCA_QPS_SQE; + case IB_QPS_ERR: + return EHCA_QPS_ERR; + default: + EDEB_ERR(4, "invalid ib_qp_state=%x", ib_qp_state); + return -EINVAL; + } +} + +/** + * ehca2ib_qp_state - maps ehca to IB qp_state + * returns ib qp state corresponding to given ehca qp state + */ +static inline enum ib_qp_state ehca2ib_qp_state(enum ehca_qp_state + ehca_qp_state) +{ + switch (ehca_qp_state) { + case EHCA_QPS_RESET: + return IB_QPS_RESET; + case EHCA_QPS_INIT: + return IB_QPS_INIT; + case EHCA_QPS_RTR: + return IB_QPS_RTR; + case EHCA_QPS_RTS: + return IB_QPS_RTS; + case EHCA_QPS_SQD: + return IB_QPS_SQD; + case EHCA_QPS_SQE: + return IB_QPS_SQE; + case EHCA_QPS_ERR: + return IB_QPS_ERR; + default: + EDEB_ERR(4,"invalid ehca_qp_state=%x",ehca_qp_state); + return -EINVAL; + } +} + +/** + * ehca_qp_type - used as index for req_attr and opt_attr of + * struct ehca_modqp_statetrans + */ +enum ehca_qp_type { + QPT_RC = 0, + QPT_UC = 1, + QPT_UD = 2, + QPT_SQP = 3, + QPT_MAX +}; + +/** + * ib2ehcaqptype - maps Ib to ehca qp_type + * returns ehca qp type corresponding to ib qp type + */ +static inline enum ehca_qp_type ib2ehcaqptype(enum ib_qp_type ibqptype) +{ + switch (ibqptype) { + case IB_QPT_SMI: + case IB_QPT_GSI: + return QPT_SQP; + case IB_QPT_RC: + return QPT_RC; + case IB_QPT_UC: + return QPT_UC; + case IB_QPT_UD: + return QPT_UD; + default: + EDEB_ERR(4,"Invalid ibqptype=%x", ibqptype); + return -EINVAL; + } +} + +static inline enum ib_qp_statetrans get_modqp_statetrans(int ib_fromstate, + int ib_tostate) +{ + int index = -EINVAL; + switch (ib_tostate) { + case IB_QPS_RESET: + index = IB_QPST_ANY2RESET; + break; + case IB_QPS_INIT: + if (ib_fromstate == IB_QPS_RESET) + index = IB_QPST_RESET2INIT; + else if (ib_fromstate == IB_QPS_INIT) + index = IB_QPST_INIT2INIT; + break; + case IB_QPS_RTR: + if (ib_fromstate == IB_QPS_INIT) + index = IB_QPST_INIT2RTR; + break; + case IB_QPS_RTS: + if (ib_fromstate == IB_QPS_RTR) + index = IB_QPST_RTR2RTS; + else if (ib_fromstate == IB_QPS_RTS) + index = IB_QPST_RTS2RTS; + else if (ib_fromstate == IB_QPS_SQD) + index = IB_QPST_SQD2RTS; + else if (ib_fromstate == IB_QPS_SQE) + index = IB_QPST_SQE2RTS; + break; + case IB_QPS_SQD: + if (ib_fromstate == IB_QPS_RTS) + index = IB_QPST_RTS2SQD; + break; + case IB_QPS_SQE: + break; + case IB_QPS_ERR: + index = IB_QPST_ANY2ERR; + break; + default: + return -EINVAL; + } + return index; +} + +enum ehca_service_type { + ST_RC = 0, + ST_UC = 1, + ST_RD = 2, + ST_UD = 3 +}; + +/** + * ibqptype2servicetype - returns hcp service type corresponding to given + * ib qp type used by create_qp() + */ +static inline int ibqptype2servicetype(enum ib_qp_type ibqptype) +{ + switch (ibqptype) { + case IB_QPT_SMI: + case IB_QPT_GSI: + return ST_UD; + case IB_QPT_RC: + return ST_RC; + case IB_QPT_UC: + return ST_UC; + case IB_QPT_UD: + return ST_UD; + case IB_QPT_RAW_IPV6: + return -EINVAL; + case IB_QPT_RAW_ETY: + return -EINVAL; + default: + EDEB_ERR(4, "Invalid ibqptype=%x", ibqptype); + return -EINVAL; + } +} + +/** + * init_qp_queues - Initializes/constructs r/squeue and registers queue pages. + */ +static inline int init_qp_queues(struct ipz_adapter_handle ipz_hca_handle, + struct ehca_qp *my_qp, + int nr_sq_pages, + int nr_rq_pages, + int swqe_size, + int rwqe_size, + int nr_send_sges, int nr_receive_sges) +{ + int ret = -EINVAL; + int cnt = 0; + void *vpage = NULL; + u64 rpage = 0; + int ipz_rc = -1; + u64 hipz_rc = H_PARAMETER; + + ipz_rc = ipz_queue_ctor(&my_qp->ipz_squeue, + nr_sq_pages, + EHCA_PAGESIZE, swqe_size, nr_send_sges); + if (!ipz_rc) { + EDEB_ERR(4, "Cannot allocate page for squeue. ipz_rc=%x", + ipz_rc); + ret = -EBUSY; + return ret; + } + + ipz_rc = ipz_queue_ctor(&my_qp->ipz_rqueue, + nr_rq_pages, + EHCA_PAGESIZE, rwqe_size, nr_receive_sges); + if (!ipz_rc) { + EDEB_ERR(4, "Cannot allocate page for rqueue. ipz_rc=%x", + ipz_rc); + ret = -EBUSY; + goto init_qp_queues0; + } + /* register SQ pages */ + for (cnt = 0; cnt < nr_sq_pages; cnt++) { + vpage = ipz_qpageit_get_inc(&my_qp->ipz_squeue); + if (!vpage) { + EDEB_ERR(4, "SQ ipz_qpageit_get_inc() " + "failed p_vpage= %p", vpage); + ret = -EINVAL; + goto init_qp_queues1; + } + rpage = virt_to_abs(vpage); + + hipz_rc = hipz_h_register_rpage_qp(ipz_hca_handle, + my_qp->ipz_qp_handle, + &my_qp->pf, 0, 0, + rpage, 1, + my_qp->galpas.kernel); + if (hipz_rc < H_SUCCESS) { + EDEB_ERR(4,"SQ hipz_qp_register_rpage() faield " + " rc=%lx", hipz_rc); + ret = ehca2ib_return_code(hipz_rc); + goto init_qp_queues1; + } + } + + ipz_qeit_reset(&my_qp->ipz_squeue); + + /* register RQ pages */ + for (cnt = 0; cnt < nr_rq_pages; cnt++) { + vpage = ipz_qpageit_get_inc(&my_qp->ipz_rqueue); + if (!vpage) { + EDEB_ERR(4,"RQ ipz_qpageit_get_inc() " + "failed p_vpage = %p", vpage); + hipz_rc = H_RESOURCE; + ret = -EINVAL; + goto init_qp_queues1; + } + + rpage = virt_to_abs(vpage); + + hipz_rc = hipz_h_register_rpage_qp(ipz_hca_handle, + my_qp->ipz_qp_handle, + &my_qp->pf, 0, 1, + rpage, 1, + my_qp->galpas.kernel); + if (hipz_rc < H_SUCCESS) { + EDEB_ERR(4, "RQ hipz_qp_register_rpage() failed " + "rc=%lx", hipz_rc); + ret = ehca2ib_return_code(hipz_rc); + goto init_qp_queues1; + } + if (cnt == (nr_rq_pages - 1)) { /* last page! */ + if (hipz_rc != H_SUCCESS) { + EDEB_ERR(4,"RQ hipz_qp_register_rpage() " + "hipz_rc= %lx ", hipz_rc); + ret = ehca2ib_return_code(hipz_rc); + goto init_qp_queues1; + } + vpage = ipz_qpageit_get_inc(&my_qp->ipz_rqueue); + if (vpage != NULL) { + EDEB_ERR(4,"ipz_qpageit_get_inc() " + "should not succeed vpage=%p", + vpage); + ret = -EINVAL; + goto init_qp_queues1; + } + } else { + if (hipz_rc != H_PAGE_REGISTERED) { + EDEB_ERR(4,"RQ hipz_qp_register_rpage() " + "hipz_rc= %lx ", hipz_rc); + ret = ehca2ib_return_code(hipz_rc); + goto init_qp_queues1; + } + } + } + + ipz_qeit_reset(&my_qp->ipz_rqueue); + + return 0; + +init_qp_queues1: + ipz_queue_dtor(&my_qp->ipz_rqueue); +init_qp_queues0: + ipz_queue_dtor(&my_qp->ipz_squeue); + return ret; +} + + +struct ib_qp *ehca_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *init_attr, + struct ib_udata *udata) +{ + extern struct ehca_module ehca_module; + static int da_msg_size[]={ 128, 256, 512, 1024, 2048, 4096 }; + int ret = -EINVAL; + int servicetype = 0; + int sigtype = 0; + + struct ehca_qp *my_qp = NULL; + struct ehca_pd *my_pd = NULL; + struct ehca_shca *shca = NULL; + struct ehca_cq *recv_ehca_cq = NULL; + struct ehca_cq *send_ehca_cq = NULL; + struct ib_ucontext *context = NULL; + u64 hipz_rc = H_PARAMETER; + int max_send_sge; + int max_recv_sge; + /* h_call's out parameters */ + u16 act_nr_send_wqes = 0, act_nr_recv_wqes = 0; + u8 act_nr_send_sges = 0, act_nr_recv_sges = 0; + u32 qp_nr = 0, nr_sq_pages = 0, swqe_size = 0; + u32 nr_rq_pages = 0, rwqe_size = 0; + u8 daqp_completion; + u8 isdaqp; + unsigned long flags; + + EDEB_EN(7,"pd=%p init_attr=%p", pd, init_attr); + EHCA_CHECK_PD_P(pd); + EHCA_CHECK_ADR_P(init_attr); + + if (init_attr->sq_sig_type != IB_SIGNAL_REQ_WR && + init_attr->sq_sig_type != IB_SIGNAL_ALL_WR) { + EDEB_ERR(4, "init_attr->sg_sig_type=%x not allowed", + init_attr->sq_sig_type); + return ERR_PTR(-EINVAL); + } + + /* save daqp completion bits */ + daqp_completion = init_attr->qp_type & 0x60; + /* save daqp bit */ + isdaqp = (init_attr->qp_type & 0x80) ? 1 : 0; + init_attr->qp_type = init_attr->qp_type & 0x1F; + + if (init_attr->qp_type != IB_QPT_UD && + init_attr->qp_type != IB_QPT_SMI && + init_attr->qp_type != IB_QPT_GSI && + init_attr->qp_type != IB_QPT_UC && + init_attr->qp_type != IB_QPT_RC) { + EDEB_ERR(4,"wrong QP Type=%x",init_attr->qp_type); + return ERR_PTR(-EINVAL); + } + if (init_attr->qp_type != IB_QPT_RC && isdaqp != 0) { + EDEB_ERR(4,"unsupported LL QP Type=%x",init_attr->qp_type); + return ERR_PTR(-EINVAL); + } + + if (pd->uobject && udata != NULL) + context = pd->uobject->context; + + my_qp = kmem_cache_alloc(ehca_module.cache_qp, SLAB_KERNEL); + if (my_qp == NULL) { + EDEB_ERR(4, "pd=%p not enough memory to alloc qp", pd); + return ERR_PTR(-ENOMEM); + } + + memset(my_qp, 0, sizeof(struct ehca_qp)); + spin_lock_init(&my_qp->spinlock_s); + spin_lock_init(&my_qp->spinlock_r); + + my_pd = container_of(pd, struct ehca_pd, ib_pd); + + shca = container_of(pd->device, struct ehca_shca, ib_device); + recv_ehca_cq = container_of(init_attr->recv_cq, struct ehca_cq, ib_cq); + send_ehca_cq = container_of(init_attr->send_cq, struct ehca_cq, ib_cq); + + my_qp->init_attr = *init_attr; + + do { + if (!idr_pre_get(&ehca_qp_idr, GFP_KERNEL)) { + ret = -ENOMEM; + EDEB_ERR(4, "Can't reserve idr resources."); + goto create_qp_exit0; + } + + spin_lock_irqsave(&ehca_qp_idr_lock, flags); + ret = idr_get_new(&ehca_qp_idr, my_qp, &my_qp->token); + spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + + } while (ret == -EAGAIN); + + if (ret) { + ret = -ENOMEM; + EDEB_ERR(4, "Can't allocate new idr entry."); + goto create_qp_exit0; + } + + servicetype = ibqptype2servicetype(init_attr->qp_type); + if (servicetype < 0) { + ret = -EINVAL; + EDEB_ERR(4, "Invalid qp_type=%x", init_attr->qp_type); + goto create_qp_exit0; + } + + if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) + sigtype = HCALL_SIGT_EVERY; + else + sigtype = HCALL_SIGT_BY_WQE; + + /* UD_AV CIRCUMVENTION */ + max_send_sge = init_attr->cap.max_send_sge; + max_recv_sge = init_attr->cap.max_recv_sge; + if (IB_QPT_UD == init_attr->qp_type || + IB_QPT_GSI == init_attr->qp_type || + IB_QPT_SMI == init_attr->qp_type) { + max_send_sge += 2; + max_recv_sge += 2; + } + + EDEB(7, "isdaqp=%x daqp_completion=%x", isdaqp, daqp_completion); + + hipz_rc = hipz_h_alloc_resource_qp(shca->ipz_hca_handle, + &my_qp->pf, + servicetype, + isdaqp | daqp_completion, + sigtype, 0, /* no ud ad lkey ctrl */ + send_ehca_cq->ipz_cq_handle, + recv_ehca_cq->ipz_cq_handle, + shca->eq.ipz_eq_handle, + my_qp->token, + my_pd->fw_pd, + (u16) init_attr->cap.max_send_wr + 1, + (u16) init_attr->cap.max_recv_wr + 1, + (u8) max_send_sge, + (u8) max_recv_sge, + 0, /* ignored, ud ad lkey ctrl==0 */ + &my_qp->ipz_qp_handle, + &qp_nr, + &act_nr_send_wqes, + &act_nr_recv_wqes, + &act_nr_send_sges, + &act_nr_recv_sges, + &nr_sq_pages, + &nr_rq_pages, + &my_qp->galpas); + if (hipz_rc != H_SUCCESS) { + EDEB_ERR(4, "h_alloc_resource_qp() failed rc=%lx", hipz_rc); + ret = ehca2ib_return_code(hipz_rc); + goto create_qp_exit1; + } + + /* store real qp_num as we got from ehca */ + my_qp->real_qp_num = qp_nr; + + switch (init_attr->qp_type) { + case IB_QPT_RC: + if (isdaqp == 0) { + swqe_size = offsetof(struct ehca_wqe, + u.nud.sg_list[(act_nr_send_sges)]); + rwqe_size = offsetof(struct ehca_wqe, + u.nud.sg_list[(act_nr_recv_sges)]); + } else { /* for daqp we need to use msg size, not wqe size */ + swqe_size = da_msg_size[max_send_sge]; + rwqe_size = da_msg_size[max_recv_sge]; + act_nr_send_sges = 1; + act_nr_recv_sges = 1; + } + break; + case IB_QPT_UC: + swqe_size = offsetof(struct ehca_wqe, + u.nud.sg_list[(act_nr_send_sges)]); + rwqe_size = offsetof(struct ehca_wqe, + u.nud.sg_list[(act_nr_recv_sges)]); + break; + + case IB_QPT_UD: + case IB_QPT_GSI: + case IB_QPT_SMI: + /* UD circumvention */ + act_nr_recv_sges -= 2; + act_nr_send_sges -= 2; + swqe_size = offsetof(struct ehca_wqe, + u.ud_av.sg_list[(act_nr_send_sges)]); + rwqe_size = offsetof(struct ehca_wqe, + u.ud_av.sg_list[(act_nr_recv_sges)]); + + if (IB_QPT_GSI == init_attr->qp_type || + IB_QPT_SMI == init_attr->qp_type) { + act_nr_send_wqes = init_attr->cap.max_send_wr; + act_nr_recv_wqes = init_attr->cap.max_recv_wr; + act_nr_send_sges = init_attr->cap.max_send_sge; + act_nr_recv_sges = init_attr->cap.max_recv_sge; + qp_nr = (init_attr->qp_type == IB_QPT_SMI) ? 0 : 1; + } + + break; + + default: + break; + } + + /* initializes r/squeue and registers queue pages */ + ret = init_qp_queues(shca->ipz_hca_handle, my_qp, + nr_sq_pages, nr_rq_pages, + swqe_size, rwqe_size, + act_nr_send_sges, act_nr_recv_sges); + if (ret != 0) { + EDEB_ERR(4,"Couldn't initialize r/squeue and pages ret=%x", + ret); + goto create_qp_exit2; + } + + my_qp->ib_qp.pd = &my_pd->ib_pd; + my_qp->ib_qp.device = my_pd->ib_pd.device; + + my_qp->ib_qp.recv_cq = init_attr->recv_cq; + my_qp->ib_qp.send_cq = init_attr->send_cq; + + my_qp->ib_qp.qp_num = qp_nr; + my_qp->ib_qp.qp_type = init_attr->qp_type; + + my_qp->qp_type = init_attr->qp_type; + my_qp->ib_qp.srq = init_attr->srq; + + my_qp->ib_qp.qp_context = init_attr->qp_context; + my_qp->ib_qp.event_handler = init_attr->event_handler; + + init_attr->cap.max_inline_data = 0; /* not supported yet */ + init_attr->cap.max_recv_sge = act_nr_recv_sges; + init_attr->cap.max_recv_wr = act_nr_recv_wqes; + init_attr->cap.max_send_sge = act_nr_send_sges; + init_attr->cap.max_send_wr = act_nr_send_wqes; + + /* NOTE: define_apq0() not supported yet */ + if (init_attr->qp_type == IB_QPT_GSI) { + if ((hipz_rc = ehca_define_sqp(shca, my_qp, init_attr))) { + EDEB_ERR(4, "ehca_define_sqp() failed rc=%lx", + hipz_rc); + ret = ehca2ib_return_code(hipz_rc); + goto create_qp_exit3; + } + } + + if (init_attr->send_cq != NULL) { + struct ehca_cq *cq = container_of(init_attr->send_cq, + struct ehca_cq, ib_cq); + ret = ehca_cq_assign_qp(cq, my_qp); + if (ret != 0) { + EDEB_ERR(4, "Couldn't assign qp to send_cq ret=%x", + ret); + goto create_qp_exit3; + } + my_qp->send_cq = cq; + } + + /* copy queues, galpa data to user space */ + if (context != NULL && udata != NULL) { + struct ipz_queue *ipz_rqueue = &my_qp->ipz_rqueue; + struct ipz_queue *ipz_squeue = &my_qp->ipz_squeue; + struct ehca_create_qp_resp resp; + struct vm_area_struct * vma; + memset(&resp, 0, sizeof(resp)); + resp.qp_num = qp_nr; + resp.token = my_qp->token; + resp.qp_type = my_qp->qp_type; + resp.qkey = my_qp->qkey; + resp.real_qp_num = my_qp->real_qp_num; + /* rqueue properties */ + resp.ipz_rqueue.qe_size = ipz_rqueue->qe_size; + resp.ipz_rqueue.act_nr_of_sg = ipz_rqueue->act_nr_of_sg; + resp.ipz_rqueue.queue_length = ipz_rqueue->queue_length; + resp.ipz_rqueue.pagesize = ipz_rqueue->pagesize; + resp.ipz_rqueue.toggle_state = ipz_rqueue->toggle_state; + ehca_mmap_nopage(((u64) (my_qp->token) << 32) | 0x22000000, + ipz_rqueue->queue_length, + ((void**)&resp.ipz_rqueue.queue), + &vma); + my_qp->uspace_rqueue = resp.ipz_rqueue.queue; + /* squeue properties */ + resp.ipz_squeue.qe_size = ipz_squeue->qe_size; + resp.ipz_squeue.act_nr_of_sg = ipz_squeue->act_nr_of_sg; + resp.ipz_squeue.queue_length = ipz_squeue->queue_length; + resp.ipz_squeue.pagesize = ipz_squeue->pagesize; + resp.ipz_squeue.toggle_state = ipz_squeue->toggle_state; + ehca_mmap_nopage(((u64) (my_qp->token) << 32) | 0x23000000, + ipz_squeue->queue_length, + ((void**)&resp.ipz_squeue.queue), + &vma); + my_qp->uspace_squeue = resp.ipz_squeue.queue; + /* fw_handle */ + resp.galpas = my_qp->galpas; + ehca_mmap_register(my_qp->galpas.user.fw_handle, + ((void**)&resp.galpas.kernel.fw_handle), + &vma); + my_qp->uspace_fwh = (u64)resp.galpas.kernel.fw_handle; + + if (ib_copy_to_udata(udata, &resp, sizeof resp)) { + EDEB_ERR(4, "Copy to udata failed"); + ret = -EINVAL; + goto create_qp_exit3; + } + } + + EDEB_EX(7, "ehca_qp=%p qp_num=%x, token=%x", + my_qp, qp_nr, my_qp->token); + return (&my_qp->ib_qp); + +create_qp_exit3: + ipz_queue_dtor(&my_qp->ipz_rqueue); + ipz_queue_dtor(&my_qp->ipz_squeue); + +create_qp_exit2: + hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp); + +create_qp_exit1: + spin_lock_irqsave(&ehca_qp_idr_lock, flags); + idr_remove(&ehca_qp_idr, my_qp->token); + spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + +create_qp_exit0: + kmem_cache_free(ehca_module.cache_qp, my_qp); + EDEB_EX(4, "failed ret=%x", ret); + return ERR_PTR(ret); + +} + +/** + * prepare_sqe_rts - called by internal_modify_qp() at trans sqe -> rts + * set purge bit of bad wqe and subsequent wqes to avoid reentering sqe + * returns total number of bad wqes in bad_wqe_cnt + */ +static int prepare_sqe_rts(struct ehca_qp *my_qp, struct ehca_shca *shca, + int *bad_wqe_cnt) +{ + int ret = 0; + u64 hipz_rc = H_SUCCESS; + struct ipz_queue *squeue = NULL; + void *bad_send_wqe_p = NULL; + void *bad_send_wqe_v = NULL; + void *squeue_start_p = NULL; + void *squeue_end_p = NULL; + void *squeue_start_v = NULL; + void *squeue_end_v = NULL; + struct ehca_wqe *wqe = NULL; + int qp_num = my_qp->ib_qp.qp_num; + + EDEB_EN(7, "ehca_qp=%p qp_num=%x ", my_qp, qp_num); + + /* get send wqe pointer */ + hipz_rc = hipz_h_disable_and_get_wqe(shca->ipz_hca_handle, + my_qp->ipz_qp_handle, &my_qp->pf, + &bad_send_wqe_p, NULL, 2); + if (hipz_rc != H_SUCCESS) { + EDEB_ERR(4, "hipz_h_disable_and_get_wqe() failed " + "ehca_qp=%p qp_num=%x hipz_rc=%lx", + my_qp, qp_num, hipz_rc); + ret = ehca2ib_return_code(hipz_rc); + goto prepare_sqe_rts_exit1; + } + bad_send_wqe_p = (void*)((u64)bad_send_wqe_p & (~(1L<<63))); + EDEB(7, "qp_num=%x bad_send_wqe_p=%p", qp_num, bad_send_wqe_p); + /* convert wqe pointer to vadr */ + bad_send_wqe_v = abs_to_virt((u64)bad_send_wqe_p); + EDEB_DMP(6, bad_send_wqe_v, 32, "qp_num=%x bad_wqe", qp_num); + squeue = &my_qp->ipz_squeue; + squeue_start_p = (void*)virt_to_abs(ipz_qeit_calc(squeue, 0L)); + squeue_end_p = squeue_start_p+squeue->queue_length; + squeue_start_v = abs_to_virt((u64)squeue_start_p); + squeue_end_v = abs_to_virt((u64)squeue_end_p); + EDEB(6, "qp_num=%x squeue_start_v=%p squeue_end_v=%p", + qp_num, squeue_start_v, squeue_end_v); + + /* loop sets wqe's purge bit */ + wqe = (struct ehca_wqe*)bad_send_wqe_v; + *bad_wqe_cnt = 0; + while (wqe->optype != 0xff && wqe->wqef != 0xff) { + EDEB_DMP(6, wqe, 32, "qp_num=%x wqe", qp_num); + wqe->nr_of_data_seg = 0; /* suppress data access */ + wqe->wqef = WQEF_PURGE; /* WQE to be purged */ + wqe = (struct ehca_wqe*)((u8*)wqe+squeue->qe_size); + *bad_wqe_cnt = (*bad_wqe_cnt)+1; + if ((void*)wqe >= squeue_end_v) { + wqe = squeue_start_v; + } + } + /* bad wqe will be reprocessed and ignored when pol_cq() is called, + * i.e. nr of wqes with flush error status is one less + */ + EDEB(6, "qp_num=%x flusherr_wqe_cnt=%x", qp_num, (*bad_wqe_cnt)-1); + wqe->wqef = 0; + +prepare_sqe_rts_exit1: + + EDEB_EX(7, "ehca_qp=%p qp_num=%x ret=%x", my_qp, qp_num, ret); + return ret; +} + +/** + * internal_modify_qp - with circumvention to handle aqp0 properly + * smi_reset2init indicates if this is an internal reset-to-init-call for + * smi. This flag must always be zero if called from ehca_modify_qp()! + * This internal func was intorduced to avoid recursion of ehca_modify_qp()! + */ +static int internal_modify_qp(struct ib_qp *ibqp, + struct ib_qp_attr *attr, + int attr_mask, int smi_reset2init) +{ + enum ib_qp_state qp_cur_state = 0, qp_new_state = 0; + int cnt = 0, qp_attr_idx = 0, retcode = 0; + + enum ib_qp_statetrans statetrans; + struct hcp_modify_qp_control_block *mqpcb = NULL; + struct ehca_qp *my_qp = NULL; + struct ehca_shca *shca = NULL; + u64 update_mask = 0; + u64 hipz_rc = H_SUCCESS; + int bad_wqe_cnt = 0; + int squeue_locked = 0; + unsigned long spl_flags = 0; + + my_qp = container_of(ibqp, struct ehca_qp, ib_qp); + shca = container_of(ibqp->pd->device, struct ehca_shca, ib_device); + + EDEB_EN(7, "ehca_qp=%p qp_num=%x ibqp_type=%x " + "new qp_state=%x attribute_mask=%x", + my_qp, ibqp->qp_num, ibqp->qp_type, + attr->qp_state, attr_mask); + + /* do query_qp to obtain current attr values */ + mqpcb = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + if (mqpcb == NULL) { + retcode = -ENOMEM; + EDEB_ERR(4, "Could not get zeroed page for mqpcb " + "ehca_qp=%p qp_num=%x ", my_qp, ibqp->qp_num); + goto modify_qp_exit0; + } + + hipz_rc = hipz_h_query_qp(shca->ipz_hca_handle, + my_qp->ipz_qp_handle, + &my_qp->pf, + mqpcb, my_qp->galpas.kernel); + if (hipz_rc != H_SUCCESS) { + EDEB_ERR(4, "hipz_h_query_qp() failed " + "ehca_qp=%p qp_num=%x hipz_rc=%lx", + my_qp, ibqp->qp_num, hipz_rc); + retcode = ehca2ib_return_code(hipz_rc); + goto modify_qp_exit1; + } + EDEB(7, "ehca_qp=%p qp_num=%x ehca_qp_state=%x", + my_qp, ibqp->qp_num, mqpcb->qp_state); + + qp_cur_state = ehca2ib_qp_state(mqpcb->qp_state); + + if (qp_cur_state == -EINVAL) { /* invalid qp state */ + retcode = -EINVAL; + EDEB_ERR(4, "Invalid current ehca_qp_state=%x " + "ehca_qp=%p qp_num=%x", + mqpcb->qp_state, my_qp, ibqp->qp_num); + goto modify_qp_exit1; + } + /* circumvention to set aqp0 initial state to init + as expected by IB spec */ + if (smi_reset2init == 0 && + ibqp->qp_type == IB_QPT_SMI && + qp_cur_state == IB_QPS_RESET && + (attr_mask & IB_QP_STATE) && + attr->qp_state == IB_QPS_INIT) { /* RESET -> INIT */ + struct ib_qp_attr smiqp_attr = { + .qp_state = IB_QPS_INIT, + .port_num = my_qp->init_attr.port_num, + .pkey_index = 0, + .qkey = 0 + }; + int smiqp_attr_mask = IB_QP_STATE | IB_QP_PORT | + IB_QP_PKEY_INDEX | IB_QP_QKEY; + int smirc = internal_modify_qp( + ibqp, &smiqp_attr, smiqp_attr_mask, 1); + if (smirc != 0) { + EDEB_ERR(4, "SMI RESET -> INIT failed. " + "ehca_modify_qp() rc=%x", smirc); + retcode = H_PARAMETER; + goto modify_qp_exit1; + } + qp_cur_state = IB_QPS_INIT; + EDEB(7, "SMI RESET -> INIT succeeded"); + } + /* is transmitted current state equal to "real" current state */ + if ((attr_mask & IB_QP_CUR_STATE) && + qp_cur_state != attr->cur_qp_state) { + retcode = -EINVAL; + EDEB_ERR(4, "Invalid IB_QP_CUR_STATE attr->curr_qp_state=%x <>" + " actual cur_qp_state=%x. ehca_qp=%p qp_num=%x", + attr->cur_qp_state, qp_cur_state, my_qp, ibqp->qp_num); + goto modify_qp_exit1; + } + + EDEB(7, "ehca_qp=%p qp_num=%x current qp_state=%x " + "new qp_state=%x attribute_mask=%x", + my_qp, ibqp->qp_num, qp_cur_state, attr->qp_state, attr_mask); + + qp_new_state = attr_mask & IB_QP_STATE ? attr->qp_state : qp_cur_state; + if (!smi_reset2init && + !ib_modify_qp_is_ok(qp_cur_state, qp_new_state, ibqp->qp_type, + attr_mask)) { + retcode = -EINVAL; + EDEB_ERR(4, "Invalid qp transition new_state=%x cur_state=%x " + "ehca_qp=%p qp_num=%x attr_mask=%x", + qp_new_state, qp_cur_state, my_qp, ibqp->qp_num, + attr_mask); + goto modify_qp_exit1; + } + + if ((mqpcb->qp_state = ib2ehca_qp_state(qp_new_state))) + update_mask = EHCA_BMASK_SET(MQPCB_MASK_QP_STATE, 1); + else { + retcode = -EINVAL; + EDEB_ERR(4, "Invalid new qp state=%x " + "ehca_qp=%p qp_num=%x", + qp_new_state, my_qp, ibqp->qp_num); + goto modify_qp_exit1; + } + + /* retrieve state transition struct to get req and opt attrs */ + statetrans = get_modqp_statetrans(qp_cur_state, qp_new_state); + if (statetrans < 0) { + retcode = -EINVAL; + EDEB_ERR(4, " qp_cur_state=%x " + "new_qp_state=%x State_xsition=%x " + "ehca_qp=%p qp_num=%x", + qp_cur_state, qp_new_state, + statetrans, my_qp, ibqp->qp_num); + goto modify_qp_exit1; + } + + qp_attr_idx = ib2ehcaqptype(ibqp->qp_type); + + if (qp_attr_idx < 0) { + retcode = qp_attr_idx; + EDEB_ERR(4, "Invalid QP type=%x ehca_qp=%p qp_num=%x", + ibqp->qp_type, my_qp, ibqp->qp_num); + goto modify_qp_exit1; + } + + EDEB(7, "ehca_qp=%p qp_num=%x qp_state_xsit=%x", + my_qp, ibqp->qp_num, statetrans); + + /* sqe -> rts: set purge bit of bad wqe before actual trans */ + if ((my_qp->qp_type == IB_QPT_UD || + my_qp->qp_type == IB_QPT_GSI || + my_qp->qp_type == IB_QPT_SMI) && + statetrans == IB_QPST_SQE2RTS) { + /* mark next free wqe if kernel */ + if (my_qp->uspace_squeue == 0) { + struct ehca_wqe *wqe = NULL; + /* lock send queue */ + spin_lock_irqsave(&my_qp->spinlock_s, spl_flags); + squeue_locked = 1; + /* mark next free wqe */ + wqe = (struct ehca_wqe*) + ipz_qeit_get(&my_qp->ipz_squeue); + wqe->optype = wqe->wqef = 0xff; + EDEB(7, "qp_num=%x next_free_wqe=%p", + ibqp->qp_num, wqe); + } + retcode = prepare_sqe_rts(my_qp, shca, &bad_wqe_cnt); + if (retcode != 0) { + EDEB_ERR(4, "prepare_sqe_rts() failed " + "ehca_qp=%p qp_num=%x ret=%x", + my_qp, ibqp->qp_num, retcode); + goto modify_qp_exit2; + } + } + + /* enable RDMA_Atomic_Control if reset->init und reliable con + this is necessary since gen2 does not provide that flag, + but pHyp requires it */ + if (statetrans == IB_QPST_RESET2INIT && + (ibqp->qp_type == IB_QPT_RC || ibqp->qp_type == IB_QPT_UC)) { + mqpcb->rdma_atomic_ctrl = 3; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_RDMA_ATOMIC_CTRL, 1); + } + /* circ. pHyp requires #RDMA/Atomic Resp Res for UC INIT -> RTR */ + if (statetrans == IB_QPST_INIT2RTR && + (ibqp->qp_type == IB_QPT_UC) && + !(attr_mask & IB_QP_MAX_DEST_RD_ATOMIC)) { + mqpcb->rdma_nr_atomic_resp_res = 1; /* default to 1 */ + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_RDMA_NR_ATOMIC_RESP_RES, 1); + } + + if (attr_mask & IB_QP_PKEY_INDEX) { + mqpcb->prim_p_key_idx = attr->pkey_index; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_PRIM_P_KEY_IDX, 1); + EDEB(7, "ehca_qp=%p qp_num=%x " + "IB_QP_PKEY_INDEX update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_PORT) { + if (attr->port_num < 1 || attr->port_num > shca->num_ports) { + retcode = -EINVAL; + EDEB_ERR(4, "Invalid port=%x. " + "ehca_qp=%p qp_num=%x num_ports=%x", + attr->port_num, my_qp, ibqp->qp_num, + shca->num_ports); + goto modify_qp_exit2; + } + mqpcb->prim_phys_port = attr->port_num; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_PRIM_PHYS_PORT, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_PORT update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_QKEY) { + mqpcb->qkey = attr->qkey; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_QKEY, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_QKEY update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_AV) { + int ah_mult = ib_rate_to_mult(attr->ah_attr.static_rate); + int ehca_mult = ib_rate_to_mult(shca->sport[my_qp-> + init_attr.port_num].rate); + + mqpcb->dlid = attr->ah_attr.dlid; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_DLID, 1); + mqpcb->source_path_bits = attr->ah_attr.src_path_bits; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SOURCE_PATH_BITS, 1); + mqpcb->service_level = attr->ah_attr.sl; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SERVICE_LEVEL, 1); + + if (ah_mult < ehca_mult) + mqpcb->max_static_rate = (ah_mult > 0) ? + ((ehca_mult - 1) / ah_mult) : 0; + else + mqpcb->max_static_rate = 0; + + EDEB(7, " ipd=mqpcb->max_static_rate set %x " + " ah_mult=%x ehca_mult=%x " + " attr->ah_attr.static_rate=%x", + mqpcb->max_static_rate,ah_mult,ehca_mult, + attr->ah_attr.static_rate); + + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_MAX_STATIC_RATE, 1); + + /* only if GRH is TRUE we might consider SOURCE_GID_IDX + * and DEST_GID otherwise phype will return H_ATTR_PARM!!! + */ + if (attr->ah_attr.ah_flags == IB_AH_GRH) { + mqpcb->send_grh_flag = 1 << 31; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_SEND_GRH_FLAG, 1); + mqpcb->source_gid_idx = attr->ah_attr.grh.sgid_index; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_SOURCE_GID_IDX, 1); + + for (cnt = 0; cnt < 16; cnt++) + mqpcb->dest_gid.byte[cnt] = + attr->ah_attr.grh.dgid.raw[cnt]; + + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_DEST_GID, 1); + mqpcb->flow_label = attr->ah_attr.grh.flow_label; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_FLOW_LABEL, 1); + mqpcb->hop_limit = attr->ah_attr.grh.hop_limit; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_HOP_LIMIT, 1); + mqpcb->traffic_class = attr->ah_attr.grh.traffic_class; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_TRAFFIC_CLASS, 1); + } + + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_AV update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + + if (attr_mask & IB_QP_PATH_MTU) { + mqpcb->path_mtu = attr->path_mtu; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_PATH_MTU, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_PATH_MTU update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_TIMEOUT) { + mqpcb->timeout = attr->timeout; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_TIMEOUT, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_TIMEOUT update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_RETRY_CNT) { + mqpcb->retry_count = attr->retry_cnt; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_RETRY_COUNT, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_RETRY_CNT update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_RNR_RETRY) { + mqpcb->rnr_retry_count = attr->rnr_retry; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_RNR_RETRY_COUNT, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_RNR_RETRY update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_RQ_PSN) { + mqpcb->receive_psn = attr->rq_psn; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_RECEIVE_PSN, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_RQ_PSN update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) { + mqpcb->rdma_nr_atomic_resp_res = attr->max_dest_rd_atomic < 3 ? + attr->max_dest_rd_atomic : 2; /* max is 2 */ + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_RDMA_NR_ATOMIC_RESP_RES, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_MAX_DEST_RD_ATOMIC " + "update_mask=%lx", my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC) { + mqpcb->rdma_atomic_outst_dest_qp = attr->max_rd_atomic; + update_mask |= + EHCA_BMASK_SET + (MQPCB_MASK_RDMA_ATOMIC_OUTST_DEST_QP, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_MAX_QP_RD_ATOMIC " + "update_mask=%lx", my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_ALT_PATH) { + int ah_mult = ib_rate_to_mult(attr->alt_ah_attr.static_rate); + int ehca_mult = ib_rate_to_mult( + shca->sport[my_qp->init_attr.port_num].rate); + + mqpcb->dlid_al = attr->alt_ah_attr.dlid; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_DLID_AL, 1); + mqpcb->source_path_bits_al = attr->alt_ah_attr.src_path_bits; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_SOURCE_PATH_BITS_AL, 1); + mqpcb->service_level_al = attr->alt_ah_attr.sl; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SERVICE_LEVEL_AL, 1); + + if (ah_mult < ehca_mult) + mqpcb->max_static_rate = (ah_mult > 0) ? + ((ehca_mult - 1) / ah_mult) : 0; + else + mqpcb->max_static_rate_al = 0; + + EDEB(7, " ipd=mqpcb->max_static_rate set %x," + " ah_mult=%x ehca_mult=%x", + mqpcb->max_static_rate,ah_mult,ehca_mult); + + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_MAX_STATIC_RATE_AL, 1); + + /* only if GRH is TRUE we might consider SOURCE_GID_IDX + * and DEST_GID otherwise phype will return H_ATTR_PARM!!! + */ + if (attr->alt_ah_attr.ah_flags == IB_AH_GRH) { + mqpcb->send_grh_flag_al = 1 << 31; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_SEND_GRH_FLAG_AL, 1); + mqpcb->source_gid_idx_al = + attr->alt_ah_attr.grh.sgid_index; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_SOURCE_GID_IDX_AL, 1); + + for (cnt = 0; cnt < 16; cnt++) + mqpcb->dest_gid_al.byte[cnt] = + attr->alt_ah_attr.grh.dgid.raw[cnt]; + + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_DEST_GID_AL, 1); + mqpcb->flow_label_al = attr->alt_ah_attr.grh.flow_label; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_FLOW_LABEL_AL, 1); + mqpcb->hop_limit_al = attr->alt_ah_attr.grh.hop_limit; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_HOP_LIMIT_AL, 1); + mqpcb->traffic_class_al = + attr->alt_ah_attr.grh.traffic_class; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_TRAFFIC_CLASS_AL, 1); + } + + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_ALT_PATH update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + + if (attr_mask & IB_QP_MIN_RNR_TIMER) { + mqpcb->min_rnr_nak_timer_field = attr->min_rnr_timer; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_MIN_RNR_NAK_TIMER_FIELD, 1); + EDEB(7, "ehca_qp=%p qp_num=%x " + "IB_QP_MIN_RNR_TIMER update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + + if (attr_mask & IB_QP_SQ_PSN) { + mqpcb->send_psn = attr->sq_psn; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SEND_PSN, 1); + EDEB(7, "ehca_qp=%p qp_num=%x " + "IB_QP_SQ_PSN update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + + if (attr_mask & IB_QP_DEST_QPN) { + mqpcb->dest_qp_nr = attr->dest_qp_num; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_DEST_QP_NR, 1); + EDEB(7, "ehca_qp=%p qp_num=%x " + "IB_QP_DEST_QPN update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + + if (attr_mask & IB_QP_PATH_MIG_STATE) { + mqpcb->path_migration_state = attr->path_mig_state; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_PATH_MIGRATION_STATE, 1); + EDEB(7, "ehca_qp=%p qp_num=%x " + "IB_QP_PATH_MIG_STATE update_mask=%lx", my_qp, + ibqp->qp_num, update_mask); + } + + if (attr_mask & IB_QP_CAP) { + mqpcb->max_nr_outst_send_wr = attr->cap.max_send_wr+1; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_MAX_NR_OUTST_SEND_WR, 1); + mqpcb->max_nr_outst_recv_wr = attr->cap.max_recv_wr+1; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_MAX_NR_OUTST_RECV_WR, 1); + EDEB(7, "ehca_qp=%p qp_num=%x " + "IB_QP_CAP update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + /* no support for max_send/recv_sge yet */ + } + + EDEB_DMP(7, mqpcb, 4*70, "ehca_qp=%p qp_num=%x", my_qp, ibqp->qp_num); + + hipz_rc = hipz_h_modify_qp(shca->ipz_hca_handle, + my_qp->ipz_qp_handle, + &my_qp->pf, + update_mask, + mqpcb, my_qp->galpas.kernel); + + if (hipz_rc != H_SUCCESS) { + retcode = ehca2ib_return_code(hipz_rc); + EDEB_ERR(4, "hipz_h_modify_qp() failed rc=%lx " + "ehca_qp=%p qp_num=%x", + hipz_rc, my_qp, ibqp->qp_num); + goto modify_qp_exit2; + } + + if ((my_qp->qp_type == IB_QPT_UD || + my_qp->qp_type == IB_QPT_GSI || + my_qp->qp_type == IB_QPT_SMI) && + statetrans == IB_QPST_SQE2RTS) { + /* doorbell to reprocessing wqes */ + iosync(); /* serialize GAL register access */ + hipz_update_sqa(my_qp, bad_wqe_cnt-1); + EDEB(6, "doorbell for %x wqes", bad_wqe_cnt); + } + + if (statetrans == IB_QPST_RESET2INIT || + statetrans == IB_QPST_INIT2INIT) { + mqpcb->qp_enable = 1; + mqpcb->qp_state = EHCA_QPS_INIT; + update_mask = 0; + update_mask = EHCA_BMASK_SET(MQPCB_MASK_QP_ENABLE, 1); + + EDEB(7, "ehca_qp=%p qp_num=%x " + "RESET_2_INIT needs an additional enable " + "-> update_mask=%lx", my_qp, ibqp->qp_num, update_mask); + + hipz_rc = hipz_h_modify_qp(shca->ipz_hca_handle, + my_qp->ipz_qp_handle, + &my_qp->pf, + update_mask, + mqpcb, + my_qp->galpas.kernel); + + if (hipz_rc != H_SUCCESS) { + retcode = ehca2ib_return_code(hipz_rc); + EDEB_ERR(4, "ENABLE in context of " + "RESET_2_INIT failed! " + "Maybe you didn't get a LID" + "hipz_rc=%lx ehca_qp=%p qp_num=%x", + hipz_rc, my_qp, ibqp->qp_num); + goto modify_qp_exit2; + } + } + + if (statetrans == IB_QPST_ANY2RESET) { + ipz_qeit_reset(&my_qp->ipz_rqueue); + ipz_qeit_reset(&my_qp->ipz_squeue); + } + + if (attr_mask & IB_QP_QKEY) + my_qp->qkey = attr->qkey; + +modify_qp_exit2: + if (squeue_locked) { /* this means: sqe -> rts */ + spin_unlock_irqrestore(&my_qp->spinlock_s, spl_flags); + my_qp->sqerr_purgeflag = 1; + } + +modify_qp_exit1: + kfree(mqpcb); + +modify_qp_exit0: + EDEB_EX(7, "ehca_qp=%p qp_num=%x ibqp_type=%x retcode=%x", + my_qp, ibqp->qp_num, ibqp->qp_type, retcode); + return retcode; +} + +int ehca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) +{ + int ret = 0; + struct ehca_qp *my_qp = NULL; + struct ehca_pd *my_pd = NULL; + u32 cur_pid = current->tgid; + + EHCA_CHECK_ADR(ibqp); + EHCA_CHECK_ADR(attr); + EHCA_CHECK_ADR(ibqp->device); + + my_qp = container_of(ibqp, struct ehca_qp, ib_qp); + + EDEB_EN(7, "ehca_qp=%p qp_num=%x ibqp_type=%x attr_mask=%x", + my_qp, ibqp->qp_num, ibqp->qp_type, attr_mask); + + my_pd = container_of(my_qp->ib_qp.pd, struct ehca_pd, ib_pd); + if (my_pd->ib_pd.uobject!=NULL && my_pd->ib_pd.uobject->context!=NULL && + my_pd->ownpid!=cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + ret = -EINVAL; + } else + ret = internal_modify_qp(ibqp, attr, attr_mask, 0); + + EDEB_EX(7, "ehca_qp=%p qp_num=%x ibqp_type=%x ret=%x", + my_qp, ibqp->qp_num, ibqp->qp_type, ret); + return ret; +} + +int ehca_query_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr) +{ + struct ehca_qp *my_qp = NULL; + struct ehca_shca *shca = NULL; + struct hcp_modify_qp_control_block *qpcb = NULL; + struct ipz_adapter_handle adapter_handle; + struct ehca_pd *my_pd = NULL; + u32 cur_pid = current->tgid; + int cnt = 0, retcode = 0; + u64 hipz_rc = H_SUCCESS; + + EHCA_CHECK_ADR(qp); + EHCA_CHECK_ADR(qp_attr); + EHCA_CHECK_DEVICE(qp->device); + + my_qp = container_of(qp, struct ehca_qp, ib_qp); + + EDEB_EN(7, "ehca_qp=%p qp_num=%x " + "qp_attr=%p qp_attr_mask=%x qp_init_attr=%p", + my_qp, qp->qp_num, qp_attr, qp_attr_mask, qp_init_attr); + + my_pd = container_of(my_qp->ib_qp.pd, struct ehca_pd, ib_pd); + if (my_pd->ib_pd.uobject!=NULL && my_pd->ib_pd.uobject->context!=NULL && + my_pd->ownpid!=cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + retcode = -EINVAL; + goto query_qp_exit0; + } + + shca = container_of(qp->device, struct ehca_shca, ib_device); + adapter_handle = shca->ipz_hca_handle; + + if (qp_attr_mask & QP_ATTR_QUERY_NOT_SUPPORTED) { + retcode = -EINVAL; + EDEB_ERR(4,"Invalid attribute mask " + "ehca_qp=%p qp_num=%x qp_attr_mask=%x ", + my_qp, qp->qp_num, qp_attr_mask); + goto query_qp_exit0; + } + + qpcb = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL ); + if (qpcb == NULL) { + retcode = -ENOMEM; + EDEB_ERR(4,"Out of memory for qpcb " + "ehca_qp=%p qp_num=%x", my_qp, qp->qp_num); + goto query_qp_exit0; + } + + hipz_rc = hipz_h_query_qp(adapter_handle, + my_qp->ipz_qp_handle, + &my_qp->pf, + qpcb, my_qp->galpas.kernel); + + if (hipz_rc != H_SUCCESS) { + retcode = ehca2ib_return_code(hipz_rc); + EDEB_ERR(4,"hipz_h_query_qp() failed " + "ehca_qp=%p qp_num=%x hipz_rc=%lx", + my_qp, qp->qp_num, hipz_rc); + goto query_qp_exit1; + } + + qp_attr->cur_qp_state = ehca2ib_qp_state(qpcb->qp_state); + qp_attr->qp_state = qp_attr->cur_qp_state; + if (qp_attr->cur_qp_state == -EINVAL) { + retcode = -EINVAL; + EDEB_ERR(4,"Got invalid ehca_qp_state=%x " + "ehca_qp=%p qp_num=%x", + qpcb->qp_state, my_qp, qp->qp_num); + goto query_qp_exit1; + } + + if (qp_attr->qp_state == IB_QPS_SQD) + qp_attr->sq_draining = 1; + + qp_attr->qkey = qpcb->qkey; + qp_attr->path_mtu = qpcb->path_mtu; + qp_attr->path_mig_state = qpcb->path_migration_state; + qp_attr->rq_psn = qpcb->receive_psn; + qp_attr->sq_psn = qpcb->send_psn; + qp_attr->min_rnr_timer = qpcb->min_rnr_nak_timer_field; + qp_attr->cap.max_send_wr = qpcb->max_nr_outst_send_wr-1; + qp_attr->cap.max_recv_wr = qpcb->max_nr_outst_recv_wr-1; + /* UD_AV CIRCUMVENTION */ + if (my_qp->qp_type == IB_QPT_UD) { + qp_attr->cap.max_send_sge = + qpcb->actual_nr_sges_in_sq_wqe - 2; + qp_attr->cap.max_recv_sge = + qpcb->actual_nr_sges_in_rq_wqe - 2; + } else { + qp_attr->cap.max_send_sge = + qpcb->actual_nr_sges_in_sq_wqe; + qp_attr->cap.max_recv_sge = + qpcb->actual_nr_sges_in_rq_wqe; + } + + qp_attr->cap.max_inline_data = my_qp->sq_max_inline_data_size; + qp_attr->dest_qp_num = qpcb->dest_qp_nr; + + qp_attr->pkey_index = + EHCA_BMASK_GET(MQPCB_PRIM_P_KEY_IDX, qpcb->prim_p_key_idx); + + qp_attr->port_num = + EHCA_BMASK_GET(MQPCB_PRIM_PHYS_PORT, qpcb->prim_phys_port); + + qp_attr->timeout = qpcb->timeout; + qp_attr->retry_cnt = qpcb->retry_count; + qp_attr->rnr_retry = qpcb->rnr_retry_count; + + qp_attr->alt_pkey_index = + EHCA_BMASK_GET(MQPCB_PRIM_P_KEY_IDX, qpcb->alt_p_key_idx); + + qp_attr->alt_port_num = qpcb->alt_phys_port; + qp_attr->alt_timeout = qpcb->timeout_al; + + /* primary av */ + qp_attr->ah_attr.sl = qpcb->service_level; + + if (qpcb->send_grh_flag) { + qp_attr->ah_attr.ah_flags = IB_AH_GRH; + } + + qp_attr->ah_attr.static_rate = qpcb->max_static_rate; + qp_attr->ah_attr.dlid = qpcb->dlid; + qp_attr->ah_attr.src_path_bits = qpcb->source_path_bits; + qp_attr->ah_attr.port_num = qp_attr->port_num; + + /* primary GRH */ + qp_attr->ah_attr.grh.traffic_class = qpcb->traffic_class; + qp_attr->ah_attr.grh.hop_limit = qpcb->hop_limit; + qp_attr->ah_attr.grh.sgid_index = qpcb->source_gid_idx; + qp_attr->ah_attr.grh.flow_label = qpcb->flow_label; + + for (cnt = 0; cnt < 16; cnt++) + qp_attr->ah_attr.grh.dgid.raw[cnt] = + qpcb->dest_gid.byte[cnt]; + + /* alternate AV */ + qp_attr->alt_ah_attr.sl = qpcb->service_level_al; + if (qpcb->send_grh_flag_al) { + qp_attr->alt_ah_attr.ah_flags = IB_AH_GRH; + } + + qp_attr->alt_ah_attr.static_rate = qpcb->max_static_rate_al; + qp_attr->alt_ah_attr.dlid = qpcb->dlid_al; + qp_attr->alt_ah_attr.src_path_bits = qpcb->source_path_bits_al; + + /* alternate GRH */ + qp_attr->alt_ah_attr.grh.traffic_class = qpcb->traffic_class_al; + qp_attr->alt_ah_attr.grh.hop_limit = qpcb->hop_limit_al; + qp_attr->alt_ah_attr.grh.sgid_index = qpcb->source_gid_idx_al; + qp_attr->alt_ah_attr.grh.flow_label = qpcb->flow_label_al; + + for (cnt = 0; cnt < 16; cnt++) + qp_attr->alt_ah_attr.grh.dgid.raw[cnt] = + qpcb->dest_gid_al.byte[cnt]; + + /* return init attributes given in ehca_create_qp */ + if (qp_init_attr != NULL) + *qp_init_attr = my_qp->init_attr; + + EDEB(7, "ehca_qp=%p qp_number=%x dest_qp_number=%x " + "dlid=%x path_mtu=%x dest_gid=%lx_%lx " + "service_level=%x qp_state=%x", + my_qp, qpcb->qp_number, qpcb->dest_qp_nr, + qpcb->dlid, qpcb->path_mtu, + qpcb->dest_gid.dw[0], qpcb->dest_gid.dw[1], + qpcb->service_level, qpcb->qp_state); + + EDEB_DMP(7, qpcb, 4*70, "ehca_qp=%p qp_num=%x", my_qp, qp->qp_num); + +query_qp_exit1: + kfree(qpcb); + +query_qp_exit0: + EDEB_EX(7, "ehca_qp=%p qp_num=%x retcode=%x", + my_qp, qp->qp_num, retcode); + return retcode; +} + +int ehca_destroy_qp(struct ib_qp *ibqp) +{ + extern struct ehca_module ehca_module; + struct ehca_qp *my_qp = NULL; + struct ehca_shca *shca = NULL; + struct ehca_pfqp *qp_pf = NULL; + struct ehca_pd *my_pd = NULL; + u32 cur_pid = current->tgid; + u32 qp_num = 0; + int retcode = 0; + u64 hipz_ret = H_SUCCESS; + u8 port_num = 0; + enum ib_qp_type qp_type; + unsigned long flags; + + EHCA_CHECK_ADR(ibqp); + + my_qp = container_of(ibqp, struct ehca_qp, ib_qp); + qp_num = ibqp->qp_num; + qp_pf = &my_qp->pf; + + shca = container_of(ibqp->device, struct ehca_shca, ib_device); + + EDEB_EN(7, "ehca_qp=%p qp_num=%x", my_qp, ibqp->qp_num); + + my_pd = container_of(my_qp->ib_qp.pd, struct ehca_pd, ib_pd); + if (my_pd->ib_pd.uobject!=NULL && my_pd->ib_pd.uobject->context!=NULL && + my_pd->ownpid!=cur_pid) { + EDEB_ERR(4, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + return -EINVAL; + } + + if (my_qp->send_cq != NULL) { + retcode = ehca_cq_unassign_qp(my_qp->send_cq, + my_qp->real_qp_num); + if (retcode != 0) { + EDEB_ERR(4, "Couldn't unassign qp from send_cq " + "ret=%x qp_num=%x cq_num=%x", + retcode, my_qp->ib_qp.qp_num, + my_qp->send_cq->cq_number); + goto destroy_qp_exit0; + } + } + + spin_lock_irqsave(&ehca_qp_idr_lock, flags); + idr_remove(&ehca_qp_idr, my_qp->token); + spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + + /* un-mmap if vma alloc */ + if (my_qp->uspace_rqueue != 0) { + retcode = ehca_munmap(my_qp->uspace_rqueue, + my_qp->ipz_rqueue.queue_length); + retcode = ehca_munmap(my_qp->uspace_squeue, + my_qp->ipz_squeue.queue_length); + retcode = ehca_munmap(my_qp->uspace_fwh, 4096); + } + + hipz_ret = hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp); + if (hipz_ret != H_SUCCESS) { + EDEB_ERR(4, "hipz_h_destroy_qp() failed " + "rc=%lx ehca_qp=%p qp_num=%x", + hipz_ret, qp_pf, qp_num); + goto destroy_qp_exit0; + } + + port_num = my_qp->init_attr.port_num; + qp_type = my_qp->init_attr.qp_type; + + /* no support for IB_QPT_SMI yet */ + if (qp_type == IB_QPT_GSI) { + struct ib_event event; + + EDEB(4, "device %s: port %x is inactive.", + shca->ib_device.name, port_num); + event.device = &shca->ib_device; + event.event = IB_EVENT_PORT_ERR; + event.element.port_num = port_num; + shca->sport[port_num - 1].port_state = IB_PORT_DOWN; + ib_dispatch_event(&event); + } + + ipz_queue_dtor(&my_qp->ipz_rqueue); + ipz_queue_dtor(&my_qp->ipz_squeue); + kmem_cache_free(ehca_module.cache_qp, my_qp); + +destroy_qp_exit0: + retcode = ehca2ib_return_code(hipz_ret); + EDEB_EX(7,"ret=%x", retcode); + return retcode; +} --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/ehca_reqs.c 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/ehca_reqs.c 2006-04-25 10:30:36.000000000 +0200 @@ -0,0 +1,685 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * post_send/recv, poll_cq, req_notify + * + * Authors: Waleri Fomin + * Hoang-Nam Nguyen + * Reinhard Ernst + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_reqs.c,v 1.17 2006/04/25 08:30:36 schickhj Exp $ + */ + + +#define DEB_PREFIX "reqs" + +#include "ehca_kernel.h" +#include "ehca_classes.h" +#include "ehca_tools.h" +#include "ehca_qes.h" +#include "ehca_iverbs.h" +#include "hcp_if.h" +#include "hipz_fns.h" + +static inline int ehca_write_rwqe(struct ipz_queue *ipz_rqueue, + struct ehca_wqe *wqe_p, + struct ib_recv_wr *recv_wr) +{ + u8 cnt_ds; + if (unlikely((recv_wr->num_sge < 0) || + (recv_wr->num_sge > ipz_rqueue->act_nr_of_sg))) { + EDEB_ERR(4, "Invalid number of WQE SGE. " + "num_sqe=%x max_nr_of_sg=%x", + recv_wr->num_sge, ipz_rqueue->act_nr_of_sg); + return (-EINVAL); /* invalid SG list length */ + } + + /* clear wqe header until sglist */ + memset(wqe_p, 0, offsetof(struct ehca_wqe, u.ud_av.sg_list)); + + wqe_p->work_request_id = be64_to_cpu(recv_wr->wr_id); + wqe_p->nr_of_data_seg = recv_wr->num_sge; + + for (cnt_ds = 0; cnt_ds < recv_wr->num_sge; cnt_ds++) { + wqe_p->u.all_rcv.sg_list[cnt_ds].vaddr = + be64_to_cpu(recv_wr->sg_list[cnt_ds].addr); + wqe_p->u.all_rcv.sg_list[cnt_ds].lkey = + ntohl(recv_wr->sg_list[cnt_ds].lkey); + wqe_p->u.all_rcv.sg_list[cnt_ds].length = + ntohl(recv_wr->sg_list[cnt_ds].length); + } + + if (IS_EDEB_ON(7)) { + EDEB(7, "RECEIVE WQE written into ipz_rqueue=%p", ipz_rqueue); + EDEB_DMP(7, wqe_p, 16*(6 + wqe_p->nr_of_data_seg), "recv wqe"); + } + + return 0; +} + +#if defined(DEBUG_GSI_SEND_WR) + +/* need ib_mad struct */ +#include + +static void trace_send_wr_ud(const struct ib_send_wr *send_wr) +{ + int idx = 0; + int j = 0; + while (send_wr != NULL) { + struct ib_mad_hdr *mad_hdr = send_wr->wr.ud.mad_hdr; + struct ib_sge *sge = send_wr->sg_list; + EDEB(4, "send_wr#%x wr_id=%lx num_sge=%x " + "send_flags=%x opcode=%x",idx, send_wr->wr_id, + send_wr->num_sge, send_wr->send_flags, send_wr->opcode); + if (mad_hdr != NULL) { + EDEB(4, "send_wr#%x mad_hdr base_version=%x " + "mgmt_class=%x class_version=%x method=%x " + "status=%x class_specific=%x tid=%lx attr_id=%x " + "resv=%x attr_mod=%x", + idx, mad_hdr->base_version, mad_hdr->mgmt_class, + mad_hdr->class_version, mad_hdr->method, + mad_hdr->status, mad_hdr->class_specific, + mad_hdr->tid, mad_hdr->attr_id, mad_hdr->resv, + mad_hdr->attr_mod); + } + for (j = 0; j < send_wr->num_sge; j++) { + u8 *data = (u8 *) abs_to_virt(sge->addr); + EDEB(4, "send_wr#%x sge#%x addr=%p length=%x lkey=%x", + idx, j, data, sge->length, sge->lkey); + /* assume length is n*16 */ + EDEB_DMP(4, data, sge->length, "send_wr#%x sge#%x", + idx, j); + sge++; + } /* eof for j */ + idx++; + send_wr = send_wr->next; + } /* eof while send_wr */ +} + +#endif /* DEBUG_GSI_SEND_WR */ + +static inline int ehca_write_swqe(struct ehca_qp *qp, + struct ehca_wqe *wqe_p, + const struct ib_send_wr *send_wr) +{ + u32 idx; + u64 dma_length; + struct ehca_av *my_av; + u32 remote_qkey = send_wr->wr.ud.remote_qkey; + + if (unlikely((send_wr->num_sge < 0) || + (send_wr->num_sge > qp->ipz_squeue.act_nr_of_sg))) { + EDEB_ERR(4, "Invalid number of WQE SGE. " + "num_sqe=%x max_nr_of_sg=%x", + send_wr->num_sge, qp->ipz_squeue.act_nr_of_sg); + return (-EINVAL); /* invalid SG list length */ + } + + /* clear wqe header until sglist */ + memset(wqe_p, 0, offsetof(struct ehca_wqe, u.ud_av.sg_list)); + + wqe_p->work_request_id = be64_to_cpu(send_wr->wr_id); + + switch (send_wr->opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + wqe_p->optype = WQE_OPTYPE_SEND; + break; + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_WRITE_WITH_IMM: + wqe_p->optype = WQE_OPTYPE_RDMAWRITE; + break; + case IB_WR_RDMA_READ: + wqe_p->optype = WQE_OPTYPE_RDMAREAD; + break; + default: + EDEB_ERR(4, "Invalid opcode=%x", send_wr->opcode); + return (-EINVAL); /* invalid opcode */ + } + + wqe_p->wqef = (send_wr->opcode) & WQEF_HIGH_NIBBLE; + + wqe_p->wr_flag = 0; + + if (send_wr->send_flags & IB_SEND_SIGNALED) + wqe_p->wr_flag |= WQE_WRFLAG_REQ_SIGNAL_COM; + + if (send_wr->opcode == IB_WR_SEND_WITH_IMM || + send_wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) { + /* this might not work as long as HW does not support it */ + wqe_p->immediate_data = send_wr->imm_data; + wqe_p->wr_flag |= WQE_WRFLAG_IMM_DATA_PRESENT; + } + + wqe_p->nr_of_data_seg = send_wr->num_sge; + + switch (qp->qp_type) { + case IB_QPT_SMI: + case IB_QPT_GSI: + /* no break is intential here */ + case IB_QPT_UD: + /* IB 1.2 spec C10-15 compliance */ + if (send_wr->wr.ud.remote_qkey & 0x80000000) + remote_qkey = qp->qkey; + + wqe_p->destination_qp_number = + ntohl(send_wr->wr.ud.remote_qpn << 8); + wqe_p->local_ee_context_qkey = ntohl(remote_qkey); + if (send_wr->wr.ud.ah==NULL) { + EDEB_ERR(4, "wr.ud.ah is NULL. qp=%p", qp); + return (-EINVAL); + } + my_av = container_of(send_wr->wr.ud.ah, struct ehca_av, ib_ah); + wqe_p->u.ud_av.ud_av = my_av->av; + + /* omitted check of IB_SEND_INLINE + since HW does not support it */ + for (idx = 0; idx < send_wr->num_sge; idx++) { + wqe_p->u.ud_av.sg_list[idx].vaddr = + be64_to_cpu(send_wr->sg_list[idx].addr); + wqe_p->u.ud_av.sg_list[idx].lkey = + ntohl(send_wr->sg_list[idx].lkey); + wqe_p->u.ud_av.sg_list[idx].length = + ntohl(send_wr->sg_list[idx].length); + } /* eof for idx */ + if (qp->qp_type == IB_QPT_SMI || + qp->qp_type == IB_QPT_GSI) + wqe_p->u.ud_av.ud_av.pmtu = 1; + if (qp->qp_type == IB_QPT_GSI) { + wqe_p->pkeyi = + ntohs(send_wr->wr.ud.pkey_index); +#ifdef DEBUG_GSI_SEND_WR + trace_send_wr_ud(send_wr); +#endif /* DEBUG_GSI_SEND_WR */ + } + break; + + case IB_QPT_UC: + if (send_wr->send_flags & IB_SEND_FENCE) + wqe_p->wr_flag |= WQE_WRFLAG_FENCE; + /* no break is intentional here */ + case IB_QPT_RC: + /* TODO: atomic not implemented */ + wqe_p->u.nud.remote_virtual_adress = + be64_to_cpu(send_wr->wr.rdma.remote_addr); + wqe_p->u.nud.rkey = ntohl(send_wr->wr.rdma.rkey); + + /* omitted checking of IB_SEND_INLINE + since HW does not support it */ + dma_length = 0; + for (idx = 0; idx < send_wr->num_sge; idx++) { + wqe_p->u.nud.sg_list[idx].vaddr = + be64_to_cpu(send_wr->sg_list[idx].addr); + wqe_p->u.nud.sg_list[idx].lkey = + ntohl(send_wr->sg_list[idx].lkey); + wqe_p->u.nud.sg_list[idx].length = + ntohl(send_wr->sg_list[idx].length); + dma_length += send_wr->sg_list[idx].length; + } /* eof idx */ + wqe_p->u.nud.atomic_1st_op_dma_len = be64_to_cpu(dma_length); + + break; + + default: + EDEB_ERR(4, "Invalid qptype=%x", qp->qp_type); + return -EINVAL; + } + + if (IS_EDEB_ON(7)) { + EDEB(7, "SEND WQE written into queue qp=%p ", qp); + EDEB_DMP(7, wqe_p, 16*(6 + wqe_p->nr_of_data_seg), "send wqe"); + } + return 0; +} + +/** map_ib_wc_status - convert raw cqe_status to ib_wc_status + */ +static inline void map_ib_wc_status(u32 cqe_status, + enum ib_wc_status *wc_status) +{ + if (unlikely(cqe_status & WC_STATUS_ERROR_BIT)) { + switch (cqe_status & 0x3F) { + case 0x01: + case 0x21: + *wc_status = IB_WC_LOC_LEN_ERR; + break; + case 0x02: + case 0x22: + *wc_status = IB_WC_LOC_QP_OP_ERR; + break; + case 0x03: + case 0x23: + *wc_status = IB_WC_LOC_EEC_OP_ERR; + break; + case 0x04: + case 0x24: + *wc_status = IB_WC_LOC_PROT_ERR; + break; + case 0x05: + case 0x25: + *wc_status = IB_WC_WR_FLUSH_ERR; + break; + case 0x06: + *wc_status = IB_WC_MW_BIND_ERR; + break; + case 0x07: /* remote error - look into bits 20:24 */ + switch ((cqe_status + & WC_STATUS_REMOTE_ERROR_FLAGS) >> 11) { + case 0x0: + /* PSN Sequence Error! + couldn't find a matching status! */ + *wc_status = IB_WC_GENERAL_ERR; + break; + case 0x1: + *wc_status = IB_WC_REM_INV_REQ_ERR; + break; + case 0x2: + *wc_status = IB_WC_REM_ACCESS_ERR; + break; + case 0x3: + *wc_status = IB_WC_REM_OP_ERR; + break; + case 0x4: + *wc_status = IB_WC_REM_INV_RD_REQ_ERR; + break; + } + break; + case 0x08: + *wc_status = IB_WC_RETRY_EXC_ERR; + break; + case 0x09: + *wc_status = IB_WC_RNR_RETRY_EXC_ERR; + break; + case 0x0A: + case 0x2D: + *wc_status = IB_WC_REM_ABORT_ERR; + break; + case 0x0B: + case 0x2E: + *wc_status = IB_WC_INV_EECN_ERR; + break; + case 0x0C: + case 0x2F: + *wc_status = IB_WC_INV_EEC_STATE_ERR; + break; + case 0x0D: + *wc_status = IB_WC_BAD_RESP_ERR; + break; + case 0x10: + /* WQE purged */ + *wc_status = IB_WC_WR_FLUSH_ERR; + break; + default: + *wc_status = IB_WC_FATAL_ERR; + + } + } else + *wc_status = IB_WC_SUCCESS; +} + +int ehca_post_send(struct ib_qp *qp, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr) +{ + struct ehca_qp *my_qp = NULL; + struct ib_send_wr *cur_send_wr = NULL; + struct ehca_wqe *wqe_p = NULL; + int wqe_cnt = 0; + int retcode = 0; + unsigned long spl_flags = 0; + + EHCA_CHECK_ADR(qp); + my_qp = container_of(qp, struct ehca_qp, ib_qp); + EHCA_CHECK_QP(my_qp); + EHCA_CHECK_ADR(send_wr); + EDEB_EN(7, "ehca_qp=%p qp_num=%x send_wr=%p bad_send_wr=%p", + my_qp, qp->qp_num, send_wr, bad_send_wr); + + /* LOCK the QUEUE */ + spin_lock_irqsave(&my_qp->spinlock_s, spl_flags); + + /* loop processes list of send reqs */ + for (cur_send_wr = send_wr; cur_send_wr != NULL; + cur_send_wr = cur_send_wr->next) { + u64 start_offset = my_qp->ipz_squeue.current_q_offset; + /* get pointer next to free WQE */ + wqe_p = ipz_qeit_get_inc(&my_qp->ipz_squeue); + if (unlikely(wqe_p == NULL)) { + /* too many posted work requests: queue overflow */ + if (bad_send_wr != NULL) + *bad_send_wr = cur_send_wr; + if (wqe_cnt==0) { + retcode = -ENOMEM; + EDEB_ERR(4, "Too many posted WQEs qp_num=%x", + qp->qp_num); + } + goto post_send_exit0; + } + /* write a SEND WQE into the QUEUE */ + retcode = ehca_write_swqe(my_qp, wqe_p, cur_send_wr); + /* if something failed, + reset the free entry pointer to the start value + */ + if (unlikely(retcode != 0)) { + my_qp->ipz_squeue.current_q_offset = start_offset; + *bad_send_wr = cur_send_wr; + if (wqe_cnt==0) { + retcode = -EINVAL; + EDEB_ERR(4, "Could not write WQE qp_num=%x", + qp->qp_num); + } + goto post_send_exit0; + } + wqe_cnt++; + EDEB(7, "ehca_qp=%p qp_num=%x wqe_cnt=%d", + my_qp, qp->qp_num, wqe_cnt); + } /* eof for cur_send_wr */ + +post_send_exit0: + /* UNLOCK the QUEUE */ + spin_unlock_irqrestore(&my_qp->spinlock_s, spl_flags); + iosync(); /* serialize GAL register access */ + hipz_update_sqa(my_qp, wqe_cnt); + EDEB_EX(7, "ehca_qp=%p qp_num=%x ret=%x wqe_cnt=%d", + my_qp, qp->qp_num, retcode, wqe_cnt); + return retcode; +} + +int ehca_post_recv(struct ib_qp *qp, + struct ib_recv_wr *recv_wr, + struct ib_recv_wr **bad_recv_wr) +{ + struct ehca_qp *my_qp = NULL; + struct ib_recv_wr *cur_recv_wr = NULL; + struct ehca_wqe *wqe_p = NULL; + int wqe_cnt = 0; + int retcode = 0; + unsigned long spl_flags = 0; + + EHCA_CHECK_ADR(qp); + my_qp = container_of(qp, struct ehca_qp, ib_qp); + EHCA_CHECK_QP(my_qp); + EHCA_CHECK_ADR(recv_wr); + EDEB_EN(7, "ehca_qp=%p qp_num=%x recv_wr=%p bad_recv_wr=%p", + my_qp, qp->qp_num, recv_wr, bad_recv_wr); + + /* LOCK the QUEUE */ + spin_lock_irqsave(&my_qp->spinlock_r, spl_flags); + + /* loop processes list of send reqs */ + for (cur_recv_wr = recv_wr; cur_recv_wr != NULL; + cur_recv_wr = cur_recv_wr->next) { + u64 start_offset = my_qp->ipz_rqueue.current_q_offset; + /* get pointer next to free WQE */ + wqe_p = ipz_qeit_get_inc(&my_qp->ipz_rqueue); + if (unlikely(wqe_p == NULL)) { + /* too many posted work requests: queue overflow */ + if (bad_recv_wr != NULL) + *bad_recv_wr = cur_recv_wr; + if (wqe_cnt==0) { + retcode = -ENOMEM; + EDEB_ERR(4, "Too many posted WQEs qp_num=%x", + qp->qp_num); + } + goto post_recv_exit0; + } + /* write a RECV WQE into the QUEUE */ + retcode = ehca_write_rwqe(&my_qp->ipz_rqueue, wqe_p, + cur_recv_wr); + /* if something failed, + reset the free entry pointer to the start value + */ + if (unlikely(retcode != 0)) { + my_qp->ipz_rqueue.current_q_offset = start_offset; + *bad_recv_wr = cur_recv_wr; + if (wqe_cnt==0) { + retcode = -EINVAL; + EDEB_ERR(4, "Could not write WQE qp_num=%x", + qp->qp_num); + } + goto post_recv_exit0; + } + wqe_cnt++; + EDEB(7, "ehca_qp=%p qp_num=%x wqe_cnt=%d", + my_qp, qp->qp_num, wqe_cnt); + } /* eof for cur_recv_wr */ + +post_recv_exit0: + spin_unlock_irqrestore(&my_qp->spinlock_r, spl_flags); + iosync(); /* serialize GAL register access */ + hipz_update_rqa(my_qp, wqe_cnt); + EDEB_EX(7, "ehca_qp=%p qp_num=%x ret=%x wqe_cnt=%d", + my_qp, qp->qp_num, retcode, wqe_cnt); + return retcode; +} + +/** + * ib_wc_opcode - Table converts ehca wc opcode to ib + * Since we use zero to indicate invalid opcode, the actual ib opcode must + * be decremented!!! + */ +static const u8 ib_wc_opcode[255] = { + [0x01] = IB_WC_RECV+1, + [0x02] = IB_WC_RECV_RDMA_WITH_IMM+1, + [0x04] = IB_WC_BIND_MW+1, + [0x08] = IB_WC_FETCH_ADD+1, + [0x10] = IB_WC_COMP_SWAP+1, + [0x20] = IB_WC_RDMA_WRITE+1, + [0x40] = IB_WC_RDMA_READ+1, + [0x80] = IB_WC_SEND+1 +}; + +/** + * internal function to poll one entry of cq + */ +static inline int ehca_poll_cq_one(struct ib_cq *cq, struct ib_wc *wc) +{ + int retcode = 0; + struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); + struct ehca_cqe *cqe = NULL; + int cqe_count = 0; + + EDEB_EN(7, "ehca_cq=%p cq_num=%x wc=%p", my_cq, my_cq->cq_number, wc); + +poll_cq_one_read_cqe: + cqe = (struct ehca_cqe *) + ipz_qeit_get_inc_valid(&my_cq->ipz_queue); + if (cqe == NULL) { + retcode = -EAGAIN; + EDEB(7, "Completion queue is empty ehca_cq=%p cq_num=%x " + "retcode=%x", my_cq, my_cq->cq_number, retcode); + goto poll_cq_one_exit0; + } + cqe_count++; + if (unlikely(cqe->status & WC_STATUS_PURGE_BIT)) { + struct ehca_qp *qp=ehca_cq_get_qp(my_cq, cqe->local_qp_number); + int purgeflag = 0; + unsigned long spl_flags = 0; + if (qp==NULL) { /* should not happen */ + EDEB_ERR(4, "cq_num=%x qp_num=%x " + "could not find qp -> ignore cqe", + my_cq->cq_number, cqe->local_qp_number); + EDEB_DMP(4, cqe, 64, "cq_num=%x qp_num=%x", + my_cq->cq_number, cqe->local_qp_number); + /* ignore this purged cqe */ + goto poll_cq_one_read_cqe; + } + spin_lock_irqsave(&qp->spinlock_s, spl_flags); + purgeflag = qp->sqerr_purgeflag; + spin_unlock_irqrestore(&qp->spinlock_s, spl_flags); + if (purgeflag!=0) { + EDEB(6, "Got CQE with purged bit qp_num=%x src_qp=%x", + cqe->local_qp_number, cqe->remote_qp_number); + EDEB_DMP(6, cqe, 64, "qp_num=%x src_qp=%x", + cqe->local_qp_number, cqe->remote_qp_number); + /* ignore this to avoid double cqes of bad wqe + that caused sqe and turn off purge flag */ + qp->sqerr_purgeflag = 0; + goto poll_cq_one_read_cqe; + } + } + + /* tracing cqe */ + if (IS_EDEB_ON(7)) { + EDEB(7, "Received COMPLETION ehca_cq=%p cq_num=%x -----", + my_cq, my_cq->cq_number); + EDEB_DMP(7, cqe, 64, "ehca_cq=%p cq_num=%x", + my_cq, my_cq->cq_number); + EDEB(7, "ehca_cq=%p cq_num=%x -------------------------", + my_cq, my_cq->cq_number); + } + + /* we got a completion! */ + wc->wr_id = cqe->work_request_id; + + /* eval ib_wc_opcode */ + wc->opcode = ib_wc_opcode[cqe->optype]-1; + if (unlikely(wc->opcode == -1)) { + EDEB_ERR(4, "Invalid cqe->OPType=%x cqe->status=%x " + "ehca_cq=%p cq_num=%x", + cqe->optype, cqe->status, my_cq, my_cq->cq_number); + /* dump cqe for other infos */ + EDEB_DMP(4, cqe, 64, "ehca_cq=%p cq_num=%x", + my_cq, my_cq->cq_number); + /* update also queue adder to throw away this entry!!! */ + goto poll_cq_one_exit0; + } + /* eval ib_wc_status */ + if (unlikely(cqe->status & WC_STATUS_ERROR_BIT)) { /* complete with errors */ + map_ib_wc_status(cqe->status, &wc->status); + wc->vendor_err = wc->status; + } else + wc->status = IB_WC_SUCCESS; + + wc->qp_num = cqe->local_qp_number; + wc->byte_len = ntohl(cqe->nr_bytes_transferred); + wc->pkey_index = cqe->pkey_index; + wc->slid = cqe->rlid; + wc->dlid_path_bits = cqe->dlid; + wc->src_qp = cqe->remote_qp_number; + wc->wc_flags = cqe->w_completion_flags; + wc->imm_data = cqe->immediate_data; + wc->sl = cqe->service_level; + + if (wc->status != IB_WC_SUCCESS) + EDEB(6, "ehca_cq=%p cq_num=%x WARNING unsuccessful cqe " + "OPType=%x status=%x qp_num=%x src_qp=%x wr_id=%lx cqe=%p", + my_cq, my_cq->cq_number, cqe->optype, cqe->status, + cqe->local_qp_number, cqe->remote_qp_number, + cqe->work_request_id, cqe); + +poll_cq_one_exit0: + if (cqe_count > 0) + hipz_update_feca(my_cq, cqe_count); + + EDEB_EX(7, "retcode=%x ehca_cq=%p cq_number=%x wc=%p " + "status=%x opcode=%x qp_num=%x byte_len=%x", + retcode, my_cq, my_cq->cq_number, wc, wc->status, + wc->opcode, wc->qp_num, wc->byte_len); + + return retcode; +} + +int ehca_poll_cq(struct ib_cq *cq, int num_entries, struct ib_wc *wc) +{ + struct ehca_cq *my_cq = NULL; + int nr = 0; + struct ib_wc *current_wc = NULL; + int retcode = 0; + unsigned long spl_flags = 0; + + EHCA_CHECK_CQ(cq); + EHCA_CHECK_ADR(wc); + + my_cq = container_of(cq, struct ehca_cq, ib_cq); + EHCA_CHECK_CQ(my_cq); + + EDEB_EN(7, "ehca_cq=%p cq_num=%x num_entries=%d wc=%p", + my_cq, my_cq->cq_number, num_entries, wc); + + if (num_entries < 1) { + EDEB_ERR(4, "Invalid num_entries=%d ehca_cq=%p cq_num=%x", + num_entries, my_cq, my_cq->cq_number); + retcode = -EINVAL; + goto poll_cq_exit0; + } + + current_wc = wc; + spin_lock_irqsave(&my_cq->spinlock, spl_flags); + for (nr = 0; nr < num_entries; nr++) { + retcode = ehca_poll_cq_one(cq, current_wc); + if (0 != retcode) + break; + current_wc++; + } /* eof for nr */ + spin_unlock_irqrestore(&my_cq->spinlock, spl_flags); + if (-EAGAIN == retcode || 0 == retcode) + retcode = nr; + +poll_cq_exit0: + EDEB_EX(7, "ehca_cq=%p cq_num=%x retcode=%x wc=%p nr_entries=%d", + my_cq, my_cq->cq_number, retcode, wc, nr); + + return retcode; +} + +int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify) +{ + struct ehca_cq *my_cq = NULL; + int retcode = 0; + + EHCA_CHECK_CQ(cq); + my_cq = container_of(cq, struct ehca_cq, ib_cq); + EHCA_CHECK_CQ(my_cq); + EDEB_EN(7, "ehca_cq=%p cq_num=%x cq_notif=%x", + my_cq, my_cq->cq_number, cq_notify); + + switch (cq_notify) { + case IB_CQ_SOLICITED: + hipz_set_cqx_n0(my_cq, 1); + break; + case IB_CQ_NEXT_COMP: + hipz_set_cqx_n1(my_cq, 1); + break; + default: + retcode = -EINVAL; + } + + EDEB_EX(7, "ehca_cq=%p cq_num=%x retcode=%x", + my_cq, my_cq->cq_number, retcode); + + return retcode; +} --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/ehca_sqp.c 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/ehca_sqp.c 2006-03-30 14:36:54.000000000 +0200 @@ -0,0 +1,126 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * SQP functions + * + * Authors: Khadija Souissi + * Heiko J Schick + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_sqp.c,v 1.10 2006/03/30 12:36:54 schickhj Exp $ + */ + + +#define DEB_PREFIX "e_qp" + +#include +#include +#include "ehca_kernel.h" +#include "ehca_classes.h" +#include "ehca_tools.h" +#include "ehca_qes.h" +#include "ehca_iverbs.h" +#include "hcp_if.h" + + +extern int ehca_create_aqp1(struct ehca_shca *shca, struct ehca_sport *sport); +extern int ehca_destroy_aqp1(struct ehca_sport *sport); + +extern int ehca_port_act_time; + +/** + * ehca_define_sqp - Defines special queue pair 1 (GSI QP). When special queue + * pair is created successfully, the corresponding port gets active. + * + * Define Special Queue pair 0 (SMI QP) is still not supported. + * + * @qp_init_attr: Queue pair init attributes with port and queue pair type + */ + +u64 ehca_define_sqp(struct ehca_shca *shca, + struct ehca_qp *ehca_qp, + struct ib_qp_init_attr *qp_init_attr) +{ + + u32 pma_qp_nr = 0; + u32 bma_qp_nr = 0; + u64 ret = H_SUCCESS; + u8 port = qp_init_attr->port_num; + int counter = 0; + + EDEB_EN(7, "port=%x qp_type=%x", + port, qp_init_attr->qp_type); + + shca->sport[port - 1].port_state = IB_PORT_DOWN; + + switch (qp_init_attr->qp_type) { + case IB_QPT_SMI: + /* function not supported yet */ + break; + case IB_QPT_GSI: + ret = hipz_h_define_aqp1(shca->ipz_hca_handle, + ehca_qp->ipz_qp_handle, + ehca_qp->galpas.kernel, + (u32) qp_init_attr->port_num, + &pma_qp_nr, &bma_qp_nr); + + if (ret != H_SUCCESS) { + EDEB_ERR(4, "Can't define AQP1 for port %x. rc=%lx", + port, ret); + goto ehca_define_aqp1; + } + break; + default: + ret = H_PARAMETER; + goto ehca_define_aqp1; + } + + while ((shca->sport[port - 1].port_state != IB_PORT_ACTIVE) && + (counter < ehca_port_act_time)) { + EDEB(6, "... wait until port %x is active", + port); + msleep_interruptible(1000); + counter++; + } + + if (counter == ehca_port_act_time) { + EDEB_ERR(4, "Port %x is not active.", port); + ret = H_HARDWARE; + } + + ehca_define_aqp1: + EDEB_EX(7, "ret=%lx", ret); + + return ret; +} From schihei at de.ibm.com Thu Apr 27 03:49:36 2006 From: schihei at de.ibm.com (Heiko J Schick) Date: Thu, 27 Apr 2006 12:49:36 +0200 Subject: [openib-general] [PATCH 13/16] ehca: firmware InfiniBand interface Message-ID: <4450A1C0.3080209@de.ibm.com> Signed-off-by: Heiko J Schick hcp_if.c | 2028 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ hcp_if.h | 398 ++++++++++++ 2 files changed, 2426 insertions(+) --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/hcp_if.h 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/hcp_if.h 2006-04-04 23:52:30.000000000 +0200 @@ -0,0 +1,398 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Firmware Infiniband Interface code for POWER + * + * Authors: Christoph Raisch + * Hoang-Nam Nguyen + * Gerd Bayer + * Waleri Fomin + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: hcp_if.h,v 1.12 2006/04/04 21:52:30 nguyen Exp $ + */ + +#ifndef __HCP_IF_H__ +#define __HCP_IF_H__ + +#include "ehca_classes.h" +#include "hipz_hw.h" + +/** + * hipz_h_alloc_resource_eq - Allocate EQ resources in HW and FW, initalize + * resources, create the empty EQPT (ring). + * + * @eq_handle: eq handle for this queue + * @act_nr_of_entries: actual number of queue entries + * @act_pages: actual number of queue pages + * @eq_ist: used by hcp_H_XIRR() call + */ +u64 hipz_h_alloc_resource_eq(const struct ipz_adapter_handle adapter_handle, + struct ehca_pfeq *pfeq, + const u32 neq_control, + const u32 number_of_entries, + struct ipz_eq_handle *eq_handle, + u32 * act_nr_of_entries, + u32 * act_pages, + u32 * eq_ist); + +u64 hipz_h_reset_event(const struct ipz_adapter_handle adapter_handle, + struct ipz_eq_handle eq_handle, + const u64 event_mask); +/** + * hipz_h_allocate_resource_cq - Allocate CQ resources in HW and FW, initialize + * resources, create the empty CQPT (ring). + * + * @eq_handle: eq handle to use for this cq + * @cq_handle: cq handle for this queue + * @act_nr_of_entries: actual number of queue entries + * @act_pages: actual number of queue pages + * @galpas: contain logical adress of priv. storage and + * log_user_storage + */ +u64 hipz_h_alloc_resource_cq(const struct ipz_adapter_handle adapter_handle, + struct ehca_pfcq *pfcq, + const struct ipz_eq_handle eq_handle, + const u32 cq_token, + const u32 number_of_entries, + struct ipz_cq_handle *cq_handle, + u32 * act_nr_of_entries, + u32 * act_pages, + struct h_galpas *galpas); + +/** + * hipz_h_alloc_resource_qp - Allocate QP resources in HW and FW, + * initialize resources, create empty QPPTs (2 rings). + * + * @h_galpas to access HCA resident QP attributes + */ +u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle, + struct ehca_pfqp *pfqp, + const u8 servicetype, + const u8 daqp_ctrl, + const u8 signalingtype, + const u8 ud_av_l_key_ctl, + const struct ipz_cq_handle send_cq_handle, + const struct ipz_cq_handle receive_cq_handle, + const struct ipz_eq_handle async_eq_handle, + const u32 qp_token, + const struct ipz_pd pd, + const u16 max_nr_send_wqes, + const u16 max_nr_receive_wqes, + const u8 max_nr_send_sges, + const u8 max_nr_receive_sges, + const u32 ud_av_l_key, + struct ipz_qp_handle *qp_handle, + u32 * qp_nr, + u16 * act_nr_send_wqes, + u16 * act_nr_receive_wqes, + u8 * act_nr_send_sges, + u8 * act_nr_receive_sges, + u32 * nr_sq_pages, + u32 * nr_rq_pages, + struct h_galpas *h_galpas); + +u64 hipz_h_query_port(const struct ipz_adapter_handle adapter_handle, + const u8 port_id, + struct hipz_query_port *query_port_response_block); + +u64 hipz_h_query_hca(const struct ipz_adapter_handle adapter_handle, + struct hipz_query_hca *query_hca_rblock); + +/** + * hipz_h_register_rpage - hcp_if.h internal function for all + * hcp_H_REGISTER_RPAGE calls. + * + * @logical_address_of_page: kv transformation to GX address in this routine + */ +u64 hipz_h_register_rpage(const struct ipz_adapter_handle adapter_handle, + const u8 pagesize, + const u8 queue_type, + const u64 resource_handle, + const u64 logical_address_of_page, + u64 count); + +u64 hipz_h_register_rpage_eq(const struct ipz_adapter_handle adapter_handle, + const struct ipz_eq_handle eq_handle, + struct ehca_pfeq *pfeq, + const u8 pagesize, + const u8 queue_type, + const u64 logical_address_of_page, + const u64 count); + +u32 hipz_h_query_int_state(const struct ipz_adapter_handle + hcp_adapter_handle, + u32 ist); + +u64 hipz_h_register_rpage_cq(const struct ipz_adapter_handle adapter_handle, + const struct ipz_cq_handle cq_handle, + struct ehca_pfcq *pfcq, + const u8 pagesize, + const u8 queue_type, + const u64 logical_address_of_page, + const u64 count, + const struct h_galpa gal); + +u64 hipz_h_register_rpage_qp(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct ehca_pfqp *pfqp, + const u8 pagesize, + const u8 queue_type, + const u64 logical_address_of_page, + const u64 count, + const struct h_galpa galpa); + +u64 hipz_h_remove_rpt_cq(const struct ipz_adapter_handle adapter_handle, + const struct ipz_cq_handle cq_handle, + struct ehca_pfcq *pfcq); + +u64 hipz_h_remove_rpt_eq(const struct ipz_adapter_handle adapter_handle, + const struct ipz_eq_handle eq_handle, + struct ehca_pfeq *pfeq); + +u64 hipz_h_remove_rpt_qp(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct ehca_pfqp *pfqp); + +u64 hipz_h_disable_and_get_wqe(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct ehca_pfqp *pfqp, + void **log_addr_next_sq_wqe_tb_processed, + void **log_addr_next_rq_wqe_tb_processed, + int dis_and_get_function_code); +enum hcall_sigt { + HCALL_SIGT_NO_CQE = 0, + HCALL_SIGT_BY_WQE = 1, + HCALL_SIGT_EVERY = 2 +}; + +u64 hipz_h_modify_qp(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct ehca_pfqp *pfqp, + const u64 update_mask, + struct hcp_modify_qp_control_block *mqpcb, + struct h_galpa gal); + +u64 hipz_h_query_qp(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct ehca_pfqp *pfqp, + struct hcp_modify_qp_control_block *qqpcb, + struct h_galpa gal); + +u64 hipz_h_destroy_qp(const struct ipz_adapter_handle adapter_handle, + struct ehca_qp *qp); + +u64 hipz_h_define_aqp0(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct h_galpa gal, + u32 port); + +u64 hipz_h_define_aqp1(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct h_galpa gal, + u32 port, u32 * pma_qp_nr, + u32 * bma_qp_nr); + +u64 hipz_h_attach_mcqp(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct h_galpa gal, + u16 mcg_dlid, + u64 subnet_prefix, u64 interface_id); + +u64 hipz_h_detach_mcqp(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct h_galpa gal, + u16 mcg_dlid, + u64 subnet_prefix, u64 interface_id); + +u64 hipz_h_destroy_cq(const struct ipz_adapter_handle adapter_handle, + struct ehca_cq *cq, + u8 force_flag); + +u64 hipz_h_destroy_eq(const struct ipz_adapter_handle adapter_handle, + struct ehca_eq *eq); + +/** + * hipz_h_alloc_resource_mr - Allocate MR resources in HW and FW, initialize + * resources. + * + * @pfmr: platform specific for MR + * pfshca: platform specific for SHCA + * vaddr: Memory Region I/O Virtual Address + * @length: Memory Region Length + * @access_ctrl: Memory Region Access Controls + * @pd: Protection Domain + * @mr_handle: Memory Region Handle + */ +u64 hipz_h_alloc_resource_mr(const struct ipz_adapter_handle adapter_handle, + struct ehca_pfmr *pfmr, + struct ehca_pfshca *pfshca, + const u64 vaddr, + const u64 length, + const u32 access_ctrl, + const struct ipz_pd pd, + struct ipz_mrmw_handle *mr_handle, + u32 * lkey, + u32 * rkey); + +/** + * hipz_h_register_rpage_mr - Register MR resource page in HW and FW . + * + * @pfmr: platform specific for MR + * @pfshca: platform specific for SHCA + * @queue_type: must be zero for MR + */ +u64 hipz_h_register_rpage_mr(const struct ipz_adapter_handle adapter_handle, + const struct ipz_mrmw_handle *mr_handle, + struct ehca_pfmr *pfmr, + struct ehca_pfshca *pfshca, + const u8 pagesize, + const u8 queue_type, + const u64 logical_address_of_page, + const u64 count); + +/** + * hipz_h_query_mr - Query MR in HW and FW. + * + * @pfmr: platform specific for MR + * @mr_handle: Memory Region Handle + * @mr_local_length: Local MR Length + * @mr_local_vaddr: Local MR I/O Virtual Address + * @mr_remote_length: Remote MR Length + * @mr_remote_vaddr Remote MR I/O Virtual Address + * @access_ctrl: Memory Region Access Controls + * @pd: Protection Domain + * lkey: L_Key + * rkey: R_Key + */ +u64 hipz_h_query_mr(const struct ipz_adapter_handle adapter_handle, + struct ehca_pfmr *pfmr, + const struct ipz_mrmw_handle *mr_handle, + u64 * mr_local_length, + u64 * mr_local_vaddr, + u64 * mr_remote_length, + u64 * mr_remote_vaddr, + u32 * access_ctrl, + struct ipz_pd *pd, + u32 * lkey, + u32 * rkey); + +/** + * hipz_h_free_resource_mr - Free MR resources in HW and FW. + * + * @pfmr: platform specific for MR + * @mr_handle: Memory Region Handle + */ +u64 hipz_h_free_resource_mr(const struct ipz_adapter_handle adapter_handle, + struct ehca_pfmr *pfmr, + const struct ipz_mrmw_handle *mr_handle); + +/** + * hipz_h_reregister_pmr - Reregister MR in HW and FW. + * + * @pfmr: platform specific for MR + * @pfshca: platform specific for SHCA + * @mr_handle: Memory Region Handle + * @vaddr_in: Memory Region I/O Virtual Address + * @length: Memory Region Length + * @access_ctrl: Memory Region Access Controls + * @pd: Protection Domain + * @mr_addr_cb: Logical Address of MR Control Block + * @vaddr_out: Memory Region I/O Virtual Address + * lkey: L_Key + * rkey: R_Key + * + */ +u64 hipz_h_reregister_pmr(const struct ipz_adapter_handle adapter_handle, + struct ehca_pfmr *pfmr, + struct ehca_pfshca *pfshca, + const struct ipz_mrmw_handle *mr_handle, + const u64 vaddr_in, + const u64 length, + const u32 access_ctrl, + const struct ipz_pd pd, + const u64 mr_addr_cb, + u64 * vaddr_out, + u32 * lkey, + u32 * rkey); + +/** + * hipz_h_register_smr - Register shared MR in HW and FW. + * + * @pfmr: platform specific for new shared MR + * @orig_pfmr: platform specific for original MR + * @pfshca: platform specific for SHCA + * @orig_mr_handle: Memory Region Handle of original MR + * @vaddr_in: Memory Region I/O Virtual Address of new shared MR + * @access_ctrl: Memory Region Access Controls of new shared MR + * @pd: Protection Domain of new shared MR + * @mr_handle: Memory Region Handle of new shared MR + * @lkey: L_Key of new shared MR + * @rkey: R_Key of new shared MR + */ +u64 hipz_h_register_smr(const struct ipz_adapter_handle adapter_handle, + struct ehca_pfmr *pfmr, + struct ehca_pfmr *orig_pfmr, + struct ehca_pfshca *pfshca, + const struct ipz_mrmw_handle *orig_mr_handle, + const u64 vaddr_in, + const u32 access_ctrl, + const struct ipz_pd pd, + struct ipz_mrmw_handle *mr_handle, + u32 * lkey, + u32 * rkey); + +u64 hipz_h_alloc_resource_mw(const struct ipz_adapter_handle adapter_handle, + struct ehca_pfmw *pfmw, + struct ehca_pfshca *pfshca, + const struct ipz_pd pd, + struct ipz_mrmw_handle *mw_handle, + u32 * rkey); + +u64 hipz_h_query_mw(const struct ipz_adapter_handle adapter_handle, + struct ehca_pfmw *pfmw, + const struct ipz_mrmw_handle *mw_handle, + u32 * rkey, + struct ipz_pd *pd); + +u64 hipz_h_free_resource_mw(const struct ipz_adapter_handle adapter_handle, + struct ehca_pfmw *pfmw, + const struct ipz_mrmw_handle *mw_handle); + +u64 hipz_h_error_data(const struct ipz_adapter_handle adapter_handle, + const u64 ressource_handle, + void *rblock, + unsigned long *byte_count); + +#endif /* __HCP_IF_H__ */ --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/hcp_if.c 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/hcp_if.c 2006-04-04 23:52:30.000000000 +0200 @@ -0,0 +1,2028 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Firmware Infiniband Interface code for POWER + * + * Authors: Christoph Raisch + * Hoang-Nam Nguyen + * Gerd Bayer + * Waleri Fomin + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: hcp_if.c,v 1.18 2006/04/04 21:52:30 nguyen Exp $ + */ + +#define DEB_PREFIX "hcpi" + +#include +#include "ehca_kernel.h" +#include "ehca_tools.h" +#include "hcp_if.h" +#include "hcp_phyp.h" +#include "hipz_fns.h" + +#define H_ALL_RES_QP_ENHANCED_OPS EHCA_BMASK_IBM(9,11) +#define H_ALL_RES_QP_PTE_PIN EHCA_BMASK_IBM(12,12) +#define H_ALL_RES_QP_SERVICE_TYPE EHCA_BMASK_IBM(13,15) +#define H_ALL_RES_QP_LL_RQ_CQE_POSTING EHCA_BMASK_IBM(18,18) +#define H_ALL_RES_QP_LL_SQ_CQE_POSTING EHCA_BMASK_IBM(19,21) +#define H_ALL_RES_QP_SIGNALING_TYPE EHCA_BMASK_IBM(22,23) +#define H_ALL_RES_QP_UD_AV_LKEY_CTRL EHCA_BMASK_IBM(31,31) +#define H_ALL_RES_QP_RESOURCE_TYPE EHCA_BMASK_IBM(56,63) + +#define H_ALL_RES_QP_MAX_OUTST_SEND_WR EHCA_BMASK_IBM(0,15) +#define H_ALL_RES_QP_MAX_OUTST_RECV_WR EHCA_BMASK_IBM(16,31) +#define H_ALL_RES_QP_MAX_SEND_SGE EHCA_BMASK_IBM(32,39) +#define H_ALL_RES_QP_MAX_RECV_SGE EHCA_BMASK_IBM(40,47) + +#define H_ALL_RES_QP_ACT_OUTST_SEND_WR EHCA_BMASK_IBM(16,31) +#define H_ALL_RES_QP_ACT_OUTST_RECV_WR EHCA_BMASK_IBM(48,63) +#define H_ALL_RES_QP_ACT_SEND_SGE EHCA_BMASK_IBM(8,15) +#define H_ALL_RES_QP_ACT_RECV_SGE EHCA_BMASK_IBM(24,31) + +#define H_ALL_RES_QP_SQUEUE_SIZE_PAGES EHCA_BMASK_IBM(0,31) +#define H_ALL_RES_QP_RQUEUE_SIZE_PAGES EHCA_BMASK_IBM(32,63) + +/* direct access qp controls */ +#define DAQP_CTRL_ENABLE 0x01 +#define DAQP_CTRL_SEND_COMPLETION 0x20 +#define DAQP_CTRL_RECV_COMPLETION 0x40 + +/* We will remove this lines in SVN when it is included in the Linux kernel. + * We don't want to introducte unnecessary dependencies to a patched kernel. + */ +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,17) +struct hcall { + u64 regs[11]; +}; + +static long plpar_hcall_7arg_7ret(unsigned long opcode, + unsigned long arg1, /* handle, /* r4 */ + &dummy, /* r5 */ + &dummy, /* r6 */ + &act_nr_of_entries_out, /* r7 */ + &act_pages_out, /* r8 */ + &eq_ist_out, /* r8 */ + &dummy); + + *act_nr_of_entries = (u32) act_nr_of_entries_out; + *act_pages = (u32) act_pages_out; + *eq_ist = (u32) eq_ist_out; + +#endif /* EHCA_USE_HCALL */ + + if (retcode == H_NOT_ENOUGH_RESOURCES) { + EDEB_ERR(4, "Not enough resource - retcode=%lx ", retcode); + } + + EDEB_EX(7, "act_nr_of_entries=%x act_pages=%x eq_ist=%x", + *act_nr_of_entries, *act_pages, *eq_ist); + + return retcode; +} + +u64 hipz_h_reset_event(const struct ipz_adapter_handle adapter_handle, + struct ipz_eq_handle eq_handle, + const u64 event_mask) +{ + u64 retcode = 0; + u64 dummy; + + EDEB_EN(7, "eq_handle=%lx, adapter_handle=%lx event_mask=%lx", + eq_handle.handle, adapter_handle.handle, event_mask); + + retcode = ehca_hcall_7arg_7ret(H_RESET_EVENTS, + adapter_handle.handle, /* r4 */ + eq_handle.handle, /* r5 */ + event_mask, /* r6 */ + 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + EDEB(7, "retcode=%lx", retcode); + + return retcode; +} + +u64 hipz_h_alloc_resource_cq(const struct ipz_adapter_handle adapter_handle, + struct ehca_pfcq *pfcq, + const struct ipz_eq_handle eq_handle, + const u32 cq_token, + const u32 number_of_entries, + struct ipz_cq_handle *cq_handle, + u32 * act_nr_of_entries, + u32 * act_pages, + struct h_galpas *galpas) +{ + u64 retcode = 0; + u64 dummy; + u64 act_nr_of_entries_out; + u64 act_pages_out; + u64 g_la_privileged_out; + u64 g_la_user_out; + /* stack location is a unique identifier for a process from beginning + * to end of this frame */ + u32 x = (u64)(&x); + + EDEB_EN(7, "pfcq=%p adapter_handle=%lx eq_handle=%lx cq_token=%x" + " number_of_entries=%x", + pfcq, adapter_handle.handle, eq_handle.handle, + cq_token, number_of_entries); + +#ifndef EHCA_USE_HCALL + retcode = simp_h_alloc_resource_cq(adapter_handle, + pfcq, + eq_handle, + cq_token, + number_of_entries, + cq_handle, + act_nr_of_entries, + act_pages, galpas); +#else + retcode = ehca_hcall_7arg_7ret(H_ALLOC_RESOURCE, + adapter_handle.handle, /* r4 */ + 2, /* r5 */ + eq_handle.handle, /* r6 */ + cq_token, /* r7 */ + number_of_entries, /* r8 */ + 0, 0, + &cq_handle->handle, /* r4 */ + &dummy, /* r5 */ + &dummy, /* r6 */ + &act_nr_of_entries_out,/* r7 */ + &act_pages_out, /* r8 */ + &g_la_privileged_out, /* r9 */ + &g_la_user_out); /* r10 */ + + *act_nr_of_entries = (u32) act_nr_of_entries_out; + *act_pages = (u32) act_pages_out; + + if (retcode == 0) { + hcp_galpas_ctor(galpas, g_la_privileged_out, g_la_user_out); + } +#endif /* EHCA_USE_HCALL */ + + if (retcode == H_NOT_ENOUGH_RESOURCES) { + EDEB_ERR(4, "Not enough resources. retcode=%lx", retcode); + } + + EDEB_EX(7, "cq_handle=%lx act_nr_of_entries=%x act_pages=%x", + cq_handle->handle, *act_nr_of_entries, *act_pages); + + return retcode; +} + +u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle, + struct ehca_pfqp *pfqp, + const u8 servicetype, + const u8 daqp_ctrl, + const u8 signalingtype, + const u8 ud_av_l_key_ctl, + const struct ipz_cq_handle send_cq_handle, + const struct ipz_cq_handle receive_cq_handle, + const struct ipz_eq_handle async_eq_handle, + const u32 qp_token, + const struct ipz_pd pd, + const u16 max_nr_send_wqes, + const u16 max_nr_receive_wqes, + const u8 max_nr_send_sges, + const u8 max_nr_receive_sges, + const u32 ud_av_l_key, + struct ipz_qp_handle *qp_handle, + u32 * qp_nr, + u16 * act_nr_send_wqes, + u16 * act_nr_receive_wqes, + u8 * act_nr_send_sges, + u8 * act_nr_receive_sges, + u32 * nr_sq_pages, + u32 * nr_rq_pages, + struct h_galpas *h_galpas) +{ + u64 retcode = H_SUCCESS; + u64 allocate_controls; + u64 max_r10_reg; + u64 dummy = 0; + u64 qp_nr_out = 0; + u64 r6_out = 0; + u64 r7_out = 0; + u64 r8_out = 0; + u64 g_la_user_out = 0; + u64 r11_out = 0; + + EDEB_EN(7, "pfqp=%p adapter_handle=%lx servicetype=%x signalingtype=%x" + " ud_av_l_key=%x send_cq_handle=%lx receive_cq_handle=%lx" + " async_eq_handle=%lx qp_token=%x pd=%x max_nr_send_wqes=%x" + " max_nr_receive_wqes=%x max_nr_send_sges=%x" + " max_nr_receive_sges=%x ud_av_l_key=%x galpa.pid=%x", + pfqp, adapter_handle.handle, servicetype, signalingtype, + ud_av_l_key, send_cq_handle.handle, + receive_cq_handle.handle, async_eq_handle.handle, qp_token, + pd.value, max_nr_send_wqes, max_nr_receive_wqes, + max_nr_send_sges, max_nr_receive_sges, ud_av_l_key, + h_galpas->pid); + +#ifndef EHCA_USE_HCALL + retcode = simp_h_alloc_resource_qp(adapter_handle, + pfqp, + servicetype, + signalingtype, + ud_av_l_key_ctl, + send_cq_handle, + receive_cq_handle, + async_eq_handle, + qp_token, + pd, + max_nr_send_wqes, + max_nr_receive_wqes, + max_nr_send_sges, + max_nr_receive_sges, + ud_av_l_key, + qp_handle, + qp_nr, + act_nr_send_wqes, + act_nr_receive_wqes, + act_nr_send_sges, + act_nr_receive_sges, + nr_sq_pages, nr_rq_pages, h_galpas); + +#else + allocate_controls = + EHCA_BMASK_SET(H_ALL_RES_QP_ENHANCED_OPS, + (daqp_ctrl & DAQP_CTRL_ENABLE) ? 1 : 0) + | EHCA_BMASK_SET(H_ALL_RES_QP_PTE_PIN, 0) + | EHCA_BMASK_SET(H_ALL_RES_QP_SERVICE_TYPE, servicetype) + | EHCA_BMASK_SET(H_ALL_RES_QP_SIGNALING_TYPE, signalingtype) + | EHCA_BMASK_SET(H_ALL_RES_QP_LL_RQ_CQE_POSTING, + (daqp_ctrl & DAQP_CTRL_RECV_COMPLETION) ? 1 : 0) + | EHCA_BMASK_SET(H_ALL_RES_QP_LL_SQ_CQE_POSTING, + (daqp_ctrl & DAQP_CTRL_SEND_COMPLETION) ? 1 : 0) + | EHCA_BMASK_SET(H_ALL_RES_QP_UD_AV_LKEY_CTRL, + ud_av_l_key_ctl) + | EHCA_BMASK_SET(H_ALL_RES_QP_RESOURCE_TYPE, 1); + + max_r10_reg = + EHCA_BMASK_SET(H_ALL_RES_QP_MAX_OUTST_SEND_WR, + max_nr_send_wqes) + | EHCA_BMASK_SET(H_ALL_RES_QP_MAX_OUTST_RECV_WR, + max_nr_receive_wqes) + | EHCA_BMASK_SET(H_ALL_RES_QP_MAX_SEND_SGE, + max_nr_send_sges) + | EHCA_BMASK_SET(H_ALL_RES_QP_MAX_RECV_SGE, + max_nr_receive_sges); + + + retcode = ehca_hcall_9arg_9ret(H_ALLOC_RESOURCE, + adapter_handle.handle, /* r4 */ + allocate_controls, /* r5 */ + send_cq_handle.handle, /* r6 */ + receive_cq_handle.handle, /* r7 */ + async_eq_handle.handle, /* r8 */ + ((u64) qp_token << 32) + | pd.value, /* r9 */ + max_r10_reg, /* r10 */ + ud_av_l_key, /* r11 */ + 0, + &qp_handle->handle, /* r4 */ + &qp_nr_out, /* r5 */ + &r6_out, /* r6 */ + &r7_out, /* r7 */ + &r8_out, /* r8 */ + &dummy, /* r9 */ + &g_la_user_out, /* r10 */ + &r11_out, + &dummy); + + /* extract outputs */ + *qp_nr = (u32) qp_nr_out; + *act_nr_send_wqes = (u16) + EHCA_BMASK_GET(H_ALL_RES_QP_ACT_OUTST_SEND_WR, + r6_out); + *act_nr_receive_wqes = (u16) + EHCA_BMASK_GET(H_ALL_RES_QP_ACT_OUTST_RECV_WR, + r6_out); + *act_nr_send_sges = + (u8) EHCA_BMASK_GET(H_ALL_RES_QP_ACT_SEND_SGE, + r7_out); + *act_nr_receive_sges = + (u8) EHCA_BMASK_GET(H_ALL_RES_QP_ACT_RECV_SGE, + r7_out); + *nr_sq_pages = + (u32) EHCA_BMASK_GET(H_ALL_RES_QP_SQUEUE_SIZE_PAGES, + r8_out); + *nr_rq_pages = + (u32) EHCA_BMASK_GET(H_ALL_RES_QP_RQUEUE_SIZE_PAGES, + r8_out); + if (retcode == 0) { + hcp_galpas_ctor(h_galpas, g_la_user_out, g_la_user_out); + } +#endif /* EHCA_USE_HCALL */ + + if (retcode == H_NOT_ENOUGH_RESOURCES) { + EDEB_ERR(4, "Not enough resources. retcode=%lx", + retcode); + } + + EDEB_EX(7, "qp_nr=%x act_nr_send_wqes=%x" + " act_nr_receive_wqes=%x act_nr_send_sges=%x" + " act_nr_receive_sges=%x nr_sq_pages=%x" + " nr_rq_pages=%x galpa.user=%lx galpa.kernel=%lx", + *qp_nr, *act_nr_send_wqes, *act_nr_receive_wqes, + *act_nr_send_sges, *act_nr_receive_sges, *nr_sq_pages, + *nr_rq_pages, h_galpas->user.fw_handle, + h_galpas->kernel.fw_handle); + + return (retcode); +} + +u64 hipz_h_query_port(const struct ipz_adapter_handle adapter_handle, + const u8 port_id, + struct hipz_query_port *query_port_response_block) +{ + u64 retcode = H_SUCCESS; + u64 dummy; + u64 r_cb; + + EDEB_EN(7, "adapter_handle=%lx port_id %x", + adapter_handle.handle, port_id); + + if ((((u64)query_port_response_block) & 0xfff) != 0) { + EDEB_ERR(4, "response block not page aligned"); + retcode = H_PARAMETER; + return (retcode); + } + + r_cb = virt_to_abs(query_port_response_block); + + retcode = ehca_hcall_7arg_7ret(H_QUERY_PORT, + adapter_handle.handle, /* r4 */ + port_id, /* r5 */ + r_cb, /* r6 */ + 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + EDEB_DMP(7, query_port_response_block, 64, "query_port_response_block"); + EDEB(7, "offset31=%x offset35=%x offset36=%x", + ((u32 *) query_port_response_block)[32], + ((u32 *) query_port_response_block)[36], + ((u32 *) query_port_response_block)[37]); + EDEB(7, "offset200=%x offset201=%x offset202=%x " + "offset203=%x", + ((u32 *) query_port_response_block)[0x200], + ((u32 *) query_port_response_block)[0x201], + ((u32 *) query_port_response_block)[0x202], + ((u32 *) query_port_response_block)[0x203]); + + EDEB_EX(7, "retcode=%lx", retcode); + + return retcode; +} + +u64 hipz_h_query_hca(const struct ipz_adapter_handle adapter_handle, + struct hipz_query_hca *query_hca_rblock) +{ + u64 retcode = 0; + u64 dummy; + u64 r_cb; + EDEB_EN(7, "adapter_handle=%lx", adapter_handle.handle); + + if ((((u64)query_hca_rblock) & 0xfff) != 0) { + EDEB_ERR(4, "response block not page aligned"); + retcode = H_PARAMETER; + return (retcode); + } + + r_cb = virt_to_abs(query_hca_rblock); + + retcode = ehca_hcall_7arg_7ret(H_QUERY_HCA, + adapter_handle.handle, /* r4 */ + r_cb, /* r5 */ + 0, 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + EDEB(7, "offset0=%x offset1=%x offset2=%x offset3=%x", + ((u32 *) query_hca_rblock)[0], + ((u32 *) query_hca_rblock)[1], + ((u32 *) query_hca_rblock)[2], ((u32 *) query_hca_rblock)[3]); + EDEB(7, "offset4=%x offset5=%x offset6=%x offset7=%x", + ((u32 *) query_hca_rblock)[4], + ((u32 *) query_hca_rblock)[5], + ((u32 *) query_hca_rblock)[6], ((u32 *) query_hca_rblock)[7]); + EDEB(7, "offset8=%x offset9=%x offseta=%x offsetb=%x", + ((u32 *) query_hca_rblock)[8], + ((u32 *) query_hca_rblock)[9], + ((u32 *) query_hca_rblock)[10], ((u32 *) query_hca_rblock)[11]); + EDEB(7, "offsetc=%x offsetd=%x offsete=%x offsetf=%x", + ((u32 *) query_hca_rblock)[12], + ((u32 *) query_hca_rblock)[13], + ((u32 *) query_hca_rblock)[14], ((u32 *) query_hca_rblock)[15]); + EDEB(7, "offset136=%x offset192=%x offset204=%x", + ((u32 *) query_hca_rblock)[32], + ((u32 *) query_hca_rblock)[48], ((u32 *) query_hca_rblock)[51]); + EDEB(7, "offset231=%x offset235=%x", + ((u32 *) query_hca_rblock)[57], ((u32 *) query_hca_rblock)[58]); + EDEB(7, "offset200=%x offset201=%x offset202=%x offset203=%x", + ((u32 *) query_hca_rblock)[0x201], + ((u32 *) query_hca_rblock)[0x202], + ((u32 *) query_hca_rblock)[0x203], + ((u32 *) query_hca_rblock)[0x204]); + + EDEB_EX(7, "retcode=%lx adapter_handle=%lx", + retcode, adapter_handle.handle); + + return retcode; +} + +u64 hipz_h_register_rpage(const struct ipz_adapter_handle adapter_handle, + const u8 pagesize, + const u8 queue_type, + const u64 resource_handle, + const u64 logical_address_of_page, + u64 count) +{ + u64 retcode = 0; + u64 dummy; + + EDEB_EN(7, "adapter_handle=%lx pagesize=%x queue_type=%x" + " resource_handle=%lx logical_address_of_page=%lx count=%lx", + adapter_handle.handle, pagesize, queue_type, + resource_handle, logical_address_of_page, count); + + retcode = ehca_hcall_7arg_7ret(H_REGISTER_RPAGES, + adapter_handle.handle, /* r4 */ + queue_type | pagesize << 8, /* r5 */ + resource_handle, /* r6 */ + logical_address_of_page, /* r7 */ + count, /* r8 */ + 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + EDEB_EX(7, "retcode=%lx", retcode); + + return retcode; +} + +u64 hipz_h_register_rpage_eq(const struct ipz_adapter_handle adapter_handle, + const struct ipz_eq_handle eq_handle, + struct ehca_pfeq *pfeq, + const u8 pagesize, + const u8 queue_type, + const u64 logical_address_of_page, + const u64 count) +{ + u64 retcode = 0; + + EDEB_EN(7, "pfeq=%p adapter_handle=%lx eq_handle=%lx pagesize=%x" + " queue_type=%x logical_address_of_page=%lx count=%lx", + pfeq, adapter_handle.handle, eq_handle.handle, pagesize, + queue_type,logical_address_of_page, count); + +#ifndef EHCA_USE_HCALL + retcode = + simp_h_register_rpage_eq(adapter_handle, eq_handle, pfeq, + pagesize, queue_type, + logical_address_of_page, count); +#else + if (count != 1) { + EDEB_ERR(4, "Ppage counter=%lx", count); + return (H_PARAMETER); + } + retcode = hipz_h_register_rpage(adapter_handle, + pagesize, + queue_type, + eq_handle.handle, + logical_address_of_page, count); +#endif /* EHCA_USE_HCALL */ + + EDEB_EX(7, "retcode=%lx", retcode); + + return retcode; +} + +u32 hipz_h_query_int_state(const struct ipz_adapter_handle adapter_handle, + u32 ist) +{ + u32 rc = 0; + u64 dummy = 0; + + EDEB_EN(7, "ist=%x", ist); + + rc = ehca_hcall_7arg_7ret(H_QUERY_INT_STATE, + adapter_handle.handle, /* r4 */ + ist, /* r5 */ + 0, 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + if ((rc != H_SUCCESS) && (rc != H_BUSY)) + EDEB_ERR(4, "Could not query interrupt state."); + + EDEB_EX(7, "interrupt state: %x", rc); + + return rc; +} + +u64 hipz_h_register_rpage_cq(const struct ipz_adapter_handle adapter_handle, + const struct ipz_cq_handle cq_handle, + struct ehca_pfcq *pfcq, + const u8 pagesize, + const u8 queue_type, + const u64 logical_address_of_page, + const u64 count, + const struct h_galpa gal) +{ + u64 retcode = 0; + + EDEB_EN(7, "pfcq=%p adapter_handle=%lx cq_handle=%lx pagesize=%x" + " queue_type=%x logical_address_of_page=%lx count=%lx", + pfcq, adapter_handle.handle, cq_handle.handle, pagesize, + queue_type, logical_address_of_page, count); + +#ifndef EHCA_USE_HCALL + retcode = + simp_h_register_rpage_cq(adapter_handle, cq_handle, pfcq, + pagesize, queue_type, + logical_address_of_page, count, gal); +#else + if (count != 1) { + EDEB_ERR(4, "Page counter=%lx", count); + return (H_PARAMETER); + } + + retcode = + hipz_h_register_rpage(adapter_handle, pagesize, queue_type, + cq_handle.handle, logical_address_of_page, + count); +#endif /* EHCA_USE_HCALL */ + EDEB_EX(7, "retcode=%lx", retcode); + + return retcode; +} + +u64 hipz_h_register_rpage_qp(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct ehca_pfqp *pfqp, + const u8 pagesize, + const u8 queue_type, + const u64 logical_address_of_page, + const u64 count, + const struct h_galpa galpa) +{ + u64 retcode = 0; + + EDEB_EN(7, "pfqp=%p adapter_handle=%lx qp_handle=%lx pagesize=%x" + " queue_type=%x logical_address_of_page=%lx count=%lx", + pfqp, adapter_handle.handle, qp_handle.handle, pagesize, + queue_type, logical_address_of_page, count); + +#ifndef EHCA_USE_HCALL + retcode = simp_h_register_rpage_qp(adapter_handle, + qp_handle, + pfqp, + pagesize, + queue_type, + logical_address_of_page, + count, galpa); +#else + if (count != 1) { + EDEB_ERR(4, "Page counter=%lx", count); + return (H_PARAMETER); + } + + retcode = hipz_h_register_rpage(adapter_handle, + pagesize, + queue_type, + qp_handle.handle, + logical_address_of_page, count); +#endif /* EHCA_USE_HCALL */ + EDEB_EX(7, "retcode=%lx", retcode); + + return retcode; +} + +u64 hipz_h_remove_rpt_cq(const struct ipz_adapter_handle adapter_handle, + const struct ipz_cq_handle cq_handle, + struct ehca_pfcq *pfcq) +{ + u64 retcode = 0; + + EDEB_EN(7, "pfcq=%p adapter_handle=%lx cq_handle=%lx", + pfcq, adapter_handle.handle, cq_handle.handle); + +#ifndef EHCA_USE_HCALL + retcode = simp_h_remove_rpt_cq(adapter_handle, cq_handle, pfcq); +#endif + EDEB_EX(7, "retcode=%lx", retcode); + + return 0; +} + +u64 hipz_h_remove_rpt_eq(const struct ipz_adapter_handle adapter_handle, + const struct ipz_eq_handle eq_handle, + struct ehca_pfeq *pfeq) +{ + u64 retcode = 0; + + EDEB_EX(7, "adapter_handle=%lx eq_handle=%lx", + adapter_handle.handle, eq_handle.handle); + +#ifndef EHCA_USE_HCALL + retcode = simp_h_remove_rpt_eq(adapter_handle, eq_handle, pfeq); +#endif + EDEB_EX(7, "retcode=%lx", retcode); + + return 0; +} + +u64 hipz_h_remove_rpt_qp(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct ehca_pfqp *pfqp) +{ + u64 retcode = 0; + + EDEB_EN(7, "pfqp=%p adapter_handle=%lx qp_handle=%lx", + pfqp, adapter_handle.handle, qp_handle.handle); + +#ifndef EHCA_USE_HCALL + retcode = simp_h_remove_rpt_qp(adapter_handle, qp_handle, pfqp); +#endif + EDEB_EX(7, "retcode=%lx", retcode); + + return 0; +} + +u64 hipz_h_disable_and_get_wqe(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct ehca_pfqp *pfqp, + void **log_addr_next_sq_wqe2processed, + void **log_addr_next_rq_wqe2processed, + int dis_and_get_function_code) +{ + u64 retcode = 0; + u8 function_code = 1; + u64 dummy, dummy1, dummy2; + + EDEB_EN(7, "pfqp=%p adapter_handle=%lx function=%x qp_handle=%lx", + pfqp, adapter_handle.handle, function_code, qp_handle.handle); + + if (log_addr_next_sq_wqe2processed==NULL) { + log_addr_next_sq_wqe2processed = (void**)&dummy1; + } + if (log_addr_next_rq_wqe2processed==NULL) { + log_addr_next_rq_wqe2processed = (void**)&dummy2; + } +#ifndef EHCA_USE_HCALL + retcode = + simp_h_disable_and_get_wqe(adapter_handle, qp_handle, pfqp, + log_addr_next_sq_wqe2processed, + log_addr_next_rq_wqe2processed); +#else + + retcode = ehca_hcall_7arg_7ret(H_DISABLE_AND_GETC, + adapter_handle.handle, /* r4 */ + dis_and_get_function_code, /* r5 */ + /* function code 1-disQP ret + * SQ RQ wqe ptr + * 2- ret SQ wqe ptr + * 3- ret. RQ count */ + qp_handle.handle, /* r6 */ + 0, 0, 0, 0, + (void*)log_addr_next_sq_wqe2processed, + (void*)log_addr_next_rq_wqe2processed, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); +#endif /* EHCA_USE_HCALL */ + EDEB_EX(7, "retcode=%lx ladr_next_rq_wqe_out=%p" + " ladr_next_sq_wqe_out=%p", retcode, + *log_addr_next_sq_wqe2processed, + *log_addr_next_rq_wqe2processed); + + return retcode; +} + +u64 hipz_h_modify_qp(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct ehca_pfqp *pfqp, + const u64 update_mask, + struct hcp_modify_qp_control_block *mqpcb, + struct h_galpa gal) +{ + u64 retcode = 0; + u64 invalid_attribute_identifier = 0; + u64 rc_attrib_mask = 0; + u64 dummy; + u64 r_cb; + EDEB_EN(7, "pfqp=%p adapter_handle=%lx qp_handle=%lx" + " update_mask=%lx qp_state=%x mqpcb=%p", + pfqp, adapter_handle.handle, qp_handle.handle, + update_mask, mqpcb->qp_state, mqpcb); + +#ifndef EHCA_USE_HCALL + simp_h_modify_qp(adapter_handle, qp_handle, pfqp, update_mask, + mqpcb, gal); +#else + r_cb = virt_to_abs(mqpcb); + retcode = ehca_hcall_7arg_7ret(H_MODIFY_QP, + adapter_handle.handle, /* r4 */ + qp_handle.handle, /* r5 */ + update_mask, /* r6 */ + r_cb, /* r7 */ + 0, 0, 0, + &invalid_attribute_identifier, /* r4 */ + &dummy, /* r5 */ + &dummy, /* r6 */ + &dummy, /* r7 */ + &dummy, /* r8 */ + &rc_attrib_mask, /* r9 */ + &dummy); +#endif + if (retcode == H_NOT_ENOUGH_RESOURCES) { + EDEB_ERR(4, "Insufficient resources retcode=%lx", retcode); + } + + EDEB_EX(7, "retcode=%lx invalid_attribute_identifier=%lx" + " invalid_attribute_MASK=%lx", retcode, + invalid_attribute_identifier, rc_attrib_mask); + + return retcode; +} + +u64 hipz_h_query_qp(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct ehca_pfqp *pfqp, + struct hcp_modify_qp_control_block *qqpcb, + struct h_galpa gal) +{ + u64 retcode = 0; + u64 dummy; + u64 r_cb; + EDEB_EN(7, "adapter_handle=%lx qp_handle=%lx", + adapter_handle.handle, qp_handle.handle); + +#ifndef EHCA_USE_HCALL + simp_h_query_qp(adapter_handle, qp_handle, qqpcb, gal); +#else + r_cb = virt_to_abs(qqpcb); + EDEB(7, "r_cb=%lx", r_cb); + + retcode = ehca_hcall_7arg_7ret(H_QUERY_QP, + adapter_handle.handle, /* r4 */ + qp_handle.handle, /* r5 */ + r_cb, /* r6 */ + 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + +#endif + EDEB_EX(7, "retcode=%lx", retcode); + + return retcode; +} + +u64 hipz_h_destroy_qp(const struct ipz_adapter_handle adapter_handle, + struct ehca_qp *qp) +{ + u64 retcode = 0; + u64 dummy; + u64 ladr_next_sq_wqe_out; + u64 ladr_next_rq_wqe_out; + + EDEB_EN(7, "qp = %p ,ipz_qp_handle=%lx adapter_handle=%lx", + qp, qp->ipz_qp_handle.handle, adapter_handle.handle); + +#ifndef EHCA_USE_HCALL + retcode = + simp_h_destroy_qp(adapter_handle, qp, + qp->galpas.user); +#else + + retcode = hcp_galpas_dtor(&qp->galpas); + + retcode = ehca_hcall_7arg_7ret(H_DISABLE_AND_GETC, + adapter_handle.handle, /* r4 */ + /* function code */ + 1, /* r5 */ + qp->ipz_qp_handle.handle, /* r6 */ + 0, 0, 0, 0, + &ladr_next_sq_wqe_out, /* r4 */ + &ladr_next_rq_wqe_out, /* r5 */ + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + if (retcode == H_HARDWARE) { + EDEB_ERR(4, "HCA not operational. retcode=%lx", retcode); + } + + retcode = ehca_hcall_7arg_7ret(H_FREE_RESOURCE, + adapter_handle.handle, /* r4 */ + qp->ipz_qp_handle.handle, /* r5 */ + 0, 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); +#endif /* EHCA_USE_HCALL */ + + if (retcode == H_RESOURCE) { + EDEB_ERR(4, "Resource still in use. retcode=%lx", retcode); + } + EDEB_EX(7, "retcode=%lx", retcode); + + return retcode; +} + +u64 hipz_h_define_aqp0(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct h_galpa gal, + u32 port) +{ + u64 retcode = 0; + u64 dummy; + + EDEB_EN(7, "port=%x ipz_qp_handle=%lx adapter_handle=%lx", + port, qp_handle.handle, adapter_handle.handle); + + retcode = ehca_hcall_7arg_7ret(H_DEFINE_AQP0, + adapter_handle.handle, /* r4 */ + qp_handle.handle, /* r5 */ + port, /* r6 */ + 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + EDEB_EX(7, "retcode=%lx", retcode); + + return retcode; +} + +u64 hipz_h_define_aqp1(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct h_galpa gal, + u32 port, u32 * pma_qp_nr, + u32 * bma_qp_nr) +{ + u64 retcode = 0; + u64 dummy; + u64 pma_qp_nr_out; + u64 bma_qp_nr_out; + + EDEB_EN(7, "port=%x qp_handle=%lx adapter_handle=%lx", + port, qp_handle.handle, adapter_handle.handle); + + retcode = ehca_hcall_7arg_7ret(H_DEFINE_AQP1, + adapter_handle.handle, /* r4 */ + qp_handle.handle, /* r5 */ + port, /* r6 */ + 0, 0, 0, 0, + &pma_qp_nr_out, /* r4 */ + &bma_qp_nr_out, /* r5 */ + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + *pma_qp_nr = (u32) pma_qp_nr_out; + *bma_qp_nr = (u32) bma_qp_nr_out; + + if (retcode == H_ALIAS_EXIST) { + EDEB_ERR(4, "AQP1 already exists. retcode=%lx", retcode); + } + + EDEB_EX(7, "retcode=%lx pma_qp_nr=%i bma_qp_nr=%i", + retcode, (int)*pma_qp_nr, (int)*bma_qp_nr); + + return retcode; +} + +u64 hipz_h_attach_mcqp(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct h_galpa gal, + u16 mcg_dlid, + u64 subnet_prefix, u64 interface_id) +{ + u64 retcode = 0; + u64 dummy; + u8 *dgid_sp = (u8*)&subnet_prefix; + u8 *dgid_ii = (u8*)&interface_id; + + EDEB_EN(7, "qp_handle=%lx adapter_handle=%lx\nMCG_DGID =" + " %d.%d.%d.%d.%d.%d.%d.%d." + " %d.%d.%d.%d.%d.%d.%d.%d\n", + qp_handle.handle, adapter_handle.handle, + dgid_sp[0], dgid_sp[1], + dgid_sp[2], dgid_sp[3], + dgid_sp[4], dgid_sp[5], + dgid_sp[6], dgid_sp[7], + dgid_ii[0], dgid_ii[1], + dgid_ii[2], dgid_ii[3], + dgid_ii[4], dgid_ii[5], + dgid_ii[6], dgid_ii[7]); + + retcode = ehca_hcall_7arg_7ret(H_ATTACH_MCQP, + adapter_handle.handle, /* r4 */ + qp_handle.handle, /* r5 */ + mcg_dlid, /* r6 */ + interface_id, /* r7 */ + subnet_prefix, /* r8 */ + 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + if (retcode == H_NOT_ENOUGH_RESOURCES) { + EDEB_ERR(4, "Not enough resources. retcode=%lx", retcode); + } + + EDEB_EX(7, "retcode=%lx", retcode); + + return retcode; +} + +u64 hipz_h_detach_mcqp(const struct ipz_adapter_handle adapter_handle, + const struct ipz_qp_handle qp_handle, + struct h_galpa gal, + u16 mcg_dlid, + u64 subnet_prefix, u64 interface_id) +{ + u64 retcode = 0; + u64 dummy; + u8 *dgid_sp = (u8*)&subnet_prefix; + u8 *dgid_ii = (u8*)&interface_id; + + EDEB_EN(7, "qp_handle=%lx adapter_handle=%lx\nMCG_DGID =" + " %d.%d.%d.%d.%d.%d.%d.%d." + " %d.%d.%d.%d.%d.%d.%d.%d\n", + qp_handle.handle, adapter_handle.handle, + dgid_sp[0], dgid_sp[1], + dgid_sp[2], dgid_sp[3], + dgid_sp[4], dgid_sp[5], + dgid_sp[6], dgid_sp[7], + dgid_ii[0], dgid_ii[1], + dgid_ii[2], dgid_ii[3], + dgid_ii[4], dgid_ii[5], + dgid_ii[6], dgid_ii[7]); + retcode = ehca_hcall_7arg_7ret(H_DETACH_MCQP, + adapter_handle.handle, /* r4 */ + qp_handle.handle, /* r5 */ + mcg_dlid, /* r6 */ + interface_id, /* r7 */ + subnet_prefix, /* r8 */ + 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + EDEB(7, "retcode=%lx", retcode); + + return retcode; +} + +u64 hipz_h_destroy_cq(const struct ipz_adapter_handle adapter_handle, + struct ehca_cq *cq, + u8 force_flag) +{ + u64 retcode = 0; + u64 dummy; + + EDEB_EN(7, "cq->pf=%p cq=.%p ipz_cq_handle=%lx adapter_handle=%lx", + &cq->pf, cq, cq->ipz_cq_handle.handle, adapter_handle.handle); + +#ifndef EHCA_USE_HCALL + simp_h_destroy_cq(adapter_handle, cq, + cq->galpas.kernel); +#else + retcode = hcp_galpas_dtor(&cq->galpas); + if (retcode != 0) { + EDEB_ERR(4, "Could not destruct cp->galpas"); + return (H_RESOURCE); + } + + retcode = ehca_hcall_7arg_7ret(H_FREE_RESOURCE, + adapter_handle.handle, /* r4 */ + cq->ipz_cq_handle.handle, /* r5 */ + force_flag!=0 ? 1L : 0L, /* r6 */ + 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); +#endif + + if (retcode == H_RESOURCE) { + EDEB(4, "retcode=%lx ", retcode); + } + + EDEB_EX(7, "retcode=%lx", retcode); + + return retcode; +} + +u64 hipz_h_destroy_eq(const struct ipz_adapter_handle adapter_handle, + struct ehca_eq *eq) +{ + u64 retcode = 0; + u64 dummy; + + EDEB_EN(7, "eq->pf=%p eq=%p ipz_eq_handle=%lx adapter_handle=%lx", + &eq->pf, eq, eq->ipz_eq_handle.handle, + adapter_handle.handle); + + retcode = hcp_galpas_dtor(&eq->galpas); + if (retcode != 0) { + EDEB_ERR(4, "Could not destruct ep->galpas"); + return (H_RESOURCE); + } + + retcode = ehca_hcall_7arg_7ret(H_FREE_RESOURCE, + adapter_handle.handle, /* r4 */ + eq->ipz_eq_handle.handle, /* r5 */ + 0, 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + + if (retcode == H_RESOURCE) { + EDEB_ERR(4, "Resource in use. retcode=%lx ", retcode); + } + EDEB_EX(7, "retcode=%lx", retcode); + + return retcode; +} + +u64 hipz_h_alloc_resource_mr(const struct ipz_adapter_handle adapter_handle, + struct ehca_pfmr *pfmr, + struct ehca_pfshca *pfshca, + const u64 vaddr, + const u64 length, + const u32 access_ctrl, + const struct ipz_pd pd, + struct ipz_mrmw_handle *mr_handle, + u32 * lkey, + u32 * rkey) +{ + u64 rc = H_SUCCESS; + u64 dummy; + u64 lkey_out; + u64 rkey_out; + + EDEB_EN(7, "adapter_handle=%lx pfmr=%p vaddr=%lx length=%lx" + " access_ctrl=%x pd=%x pfshca=%p", + adapter_handle.handle, pfmr, vaddr, length, access_ctrl, + pd.value, pfshca); + +#ifndef EHCA_USE_HCALL + rc = simp_hcz_h_alloc_resource_mr(adapter_handle, + pfmr, + pfshca, + vaddr, + length, + access_ctrl, + pd, + (struct hcz_mrmw_handle *)mr_handle, + lkey, rkey); + EDEB_EX(7, "rc=%lx mr_handle.mrwpte=%p mr_handle.page_index=%x" + " lkey=%x rkey=%x", + rc, mr_handle->mrwpte, mr_handle->page_index, *lkey, *rkey); +#else + + rc = ehca_hcall_7arg_7ret(H_ALLOC_RESOURCE, + adapter_handle.handle, /* r4 */ + 5, /* r5 */ + vaddr, /* r6 */ + length, /* r7 */ + ((((u64) access_ctrl) << 32ULL)), /* r8 */ + pd.value, /* r9 */ + 0, + &mr_handle->handle, /* r4 */ + &dummy, /* r5 */ + &lkey_out, /* r6 */ + &rkey_out, /* r7 */ + &dummy, + &dummy, + &dummy); + *lkey = (u32) lkey_out; + *rkey = (u32) rkey_out; + + EDEB_EX(7, "rc=%lx mr_handle=%lx lkey=%x rkey=%x", + rc, mr_handle->handle, *lkey, *rkey); +#endif /* EHCA_USE_HCALL */ + + return rc; +} + +u64 hipz_h_register_rpage_mr(const struct ipz_adapter_handle adapter_handle, + const struct ipz_mrmw_handle *mr_handle, + struct ehca_pfmr *pfmr, + struct ehca_pfshca *pfshca, + const u8 pagesize, + const u8 queue_type, + const u64 logical_address_of_page, + const u64 count) +{ + u64 rc = H_SUCCESS; + +#ifndef EHCA_USE_HCALL + EDEB_EN(7, "adapter_handle=%lx pfmr=%p mr_handle.mrwpte=%p" + " mr_handle.page_index=%x pagesize=%x queue_type=%x " + " logical_address_of_page=%lx count=%lx pfshca=%p", + adapter_handle.handle, pfmr, mr_handle->mrwpte, + mr_handle->page_index, pagesize, queue_type, + logical_address_of_page, count, pfshca); + + rc = simp_hcz_h_register_rpage_mr(adapter_handle, + (struct hcz_mrmw_handle *)mr_handle, + pfmr, + pfshca, + pagesize, + queue_type, + logical_address_of_page, count); +#else + EDEB_EN(7, "adapter_handle=%lx pfmr=%p mr_handle=%lx pagesize=%x" + " queue_type=%x logical_address_of_page=%lx count=%lx", + adapter_handle.handle, pfmr, mr_handle->handle, pagesize, + queue_type, logical_address_of_page, count); + + if ((count > 1) && (logical_address_of_page & 0xfff)) { + EDEB_ERR(4, "logical_address_of_page not on a 4k boundary " + "adapter_handle=%lx pfmr=%p mr_handle=%lx " + "pagesize=%x queue_type=%x logical_address_of_page=%lx" + " count=%lx", + adapter_handle.handle, pfmr, mr_handle->handle, + pagesize, queue_type, logical_address_of_page, count); + rc = H_PARAMETER; + } else { + rc = hipz_h_register_rpage(adapter_handle, pagesize, + queue_type, mr_handle->handle, + logical_address_of_page, count); + } +#endif /* EHCA_USE_HCALL */ + EDEB_EX(7, "rc=%lx", rc); + + return rc; +} + +u64 hipz_h_query_mr(const struct ipz_adapter_handle adapter_handle, + struct ehca_pfmr *pfmr, + const struct ipz_mrmw_handle *mr_handle, + u64 * mr_local_length, + u64 * mr_local_vaddr, + u64 * mr_remote_length, + u64 * mr_remote_vaddr, + u32 * access_ctrl, + struct ipz_pd *pd, + u32 * lkey, + u32 * rkey) +{ + u64 rc = H_SUCCESS; + u64 dummy; + u64 acc_ctrl_pd_out; + u64 r9_out; + +#ifndef EHCA_USE_HCALL + EDEB_EN(7, "adapter_handle=%lx pfmr=%p mr_handle.mrwpte=%p" + " mr_handle.page_index=%x", + adapter_handle.handle, pfmr, mr_handle->mrwpte, + mr_handle->page_index); + + rc = simp_hcz_h_query_mr(adapter_handle, + pfmr, + mr_handle, + mr_local_length, + mr_local_vaddr, + mr_remote_length, + mr_remote_vaddr, access_ctrl, pd, lkey, rkey); + + EDEB_EX(7, "rc=%lx mr_local_length=%lx mr_local_vaddr=%lx" + " mr_remote_length=%lx mr_remote_vaddr=%lx access_ctrl=%x" + " pd=%x lkey=%x rkey=%x", + rc, *mr_local_length, *mr_local_vaddr, *mr_remote_length, + *mr_remote_vaddr, *access_ctrl, pd->value, *lkey, *rkey); +#else + EDEB_EN(7, "adapter_handle=%lx pfmr=%p mr_handle=%lx", + adapter_handle.handle, pfmr, mr_handle->handle); + + + rc = ehca_hcall_7arg_7ret(H_QUERY_MR, + adapter_handle.handle, /* r4 */ + mr_handle->handle, /* r5 */ + 0, 0, 0, 0, 0, + mr_local_length, /* r4 */ + mr_local_vaddr, /* r5 */ + mr_remote_length, /* r6 */ + mr_remote_vaddr, /* r7 */ + &acc_ctrl_pd_out, /* r8 */ + &r9_out, + &dummy); + + *access_ctrl = acc_ctrl_pd_out >> 32; + pd->value = (u32) acc_ctrl_pd_out; + *lkey = (u32) (r9_out >> 32); + *rkey = (u32) (r9_out & (0xffffffff)); + + EDEB_EX(7, "rc=%lx mr_local_length=%lx mr_local_vaddr=%lx" + " mr_remote_length=%lx mr_remote_vaddr=%lx access_ctrl=%x" + " pd=%x lkey=%x rkey=%x", + rc, *mr_local_length, *mr_local_vaddr, *mr_remote_length, + *mr_remote_vaddr, *access_ctrl, pd->value, *lkey, *rkey); +#endif /* EHCA_USE_HCALL */ + + return rc; +} + +u64 hipz_h_free_resource_mr(const struct ipz_adapter_handle adapter_handle, + struct ehca_pfmr *pfmr, + const struct ipz_mrmw_handle *mr_handle) +{ + u64 rc = H_SUCCESS; + u64 dummy; + +#ifndef EHCA_USE_HCALL + EDEB_EN(7, "adapter_handle=%lx pfmr=%p mr_handle.mrwpte=%p" + " mr_handle.page_index=%x", + adapter_handle.handle, pfmr, mr_handle->mrwpte, + mr_handle->page_index); + + rc = simp_hcz_h_free_resource_mr(adapter_handle, pfmr, mr_handle); +#else + EDEB_EN(7, "adapter_handle=%lx pfmr=%p mr_handle=%lx", + adapter_handle.handle, pfmr, mr_handle->handle); + + rc = ehca_hcall_7arg_7ret(H_FREE_RESOURCE, + adapter_handle.handle, /* r4 */ + mr_handle->handle, /* r5 */ + 0, 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); +#endif /* EHCA_USE_HCALL */ + EDEB_EX(7, "rc=%lx", rc); + + return rc; +} + +u64 hipz_h_reregister_pmr(const struct ipz_adapter_handle adapter_handle, + struct ehca_pfmr *pfmr, + struct ehca_pfshca *pfshca, + const struct ipz_mrmw_handle *mr_handle, + const u64 vaddr_in, + const u64 length, + const u32 access_ctrl, + const struct ipz_pd pd, + const u64 mr_addr_cb, + u64 * vaddr_out, + u32 * lkey, + u32 * rkey) +{ + u64 rc = H_SUCCESS; + u64 dummy; + u64 lkey_out; + u64 rkey_out; + +#ifndef EHCA_USE_HCALL + EDEB_EN(7, "adapter_handle=%lx pfmr=%p pfshca=%p" + " mr_handle.mrwpte=%p mr_handle.page_index=%x vaddr_in=%lx" + " length=%lx access_ctrl=%x pd=%x mr_addr_cb=", + adapter_handle.handle, pfmr, pfshca, mr_handle->mrwpte, + mr_handle->page_index, vaddr_in, length, access_ctrl, + pd.value, mr_addr_cb); + + rc = simp_hcz_h_reregister_pmr(adapter_handle, pfmr, pfshca, + mr_handle, vaddr_in, length, access_ctrl, + pd, mr_addr_cb, vaddr_out, lkey, rkey); +#else + EDEB_EN(7, "adapter_handle=%lx pfmr=%p pfshca=%p mr_handle=%lx " + "vaddr_in=%lx length=%lx access_ctrl=%x pd=%x mr_addr_cb=%lx", + adapter_handle.handle, pfmr, pfshca, mr_handle->handle, + vaddr_in, length, access_ctrl, pd.value, mr_addr_cb); + + rc = ehca_hcall_7arg_7ret(H_REREGISTER_PMR, + adapter_handle.handle, /* r4 */ + mr_handle->handle, /* r5 */ + vaddr_in, /* r6 */ + length, /* r7 */ + /* r8 */ + ((((u64) access_ctrl) << 32ULL) | pd.value), + mr_addr_cb, /* r9 */ + 0, + &dummy, /* r4 */ + vaddr_out, /* r5 */ + &lkey_out, /* r6 */ + &rkey_out, /* r7 */ + &dummy, + &dummy, + &dummy); + + *lkey = (u32) lkey_out; + *rkey = (u32) rkey_out; +#endif /* EHCA_USE_HCALL */ + + EDEB_EX(7, "rc=%lx vaddr_out=%lx lkey=%x rkey=%x", + rc, *vaddr_out, *lkey, *rkey); + return rc; +} + +u64 hipz_h_register_smr(const struct ipz_adapter_handle adapter_handle, + struct ehca_pfmr *pfmr, + struct ehca_pfmr *orig_pfmr, + struct ehca_pfshca *pfshca, + const struct ipz_mrmw_handle *orig_mr_handle, + const u64 vaddr_in, + const u32 access_ctrl, + const struct ipz_pd pd, + struct ipz_mrmw_handle *mr_handle, + u32 * lkey, + u32 * rkey) +{ + u64 rc = H_SUCCESS; + u64 dummy; + u64 lkey_out; + u64 rkey_out; + +#ifndef EHCA_USE_HCALL + EDEB_EN(7, "adapter_handle=%lx pfmr=%p orig_pfmr=%p pfshca=%p" + " orig_mr_handle.mrwpte=%p orig_mr_handle.page_index=%x" + " vaddr_in=%lx access_ctrl=%x pd=%x", + adapter_handle.handle, pfmr, orig_pfmr, pfshca, + orig_mr_handle->mrwpte, orig_mr_handle->page_index, + vaddr_in, access_ctrl, pd.value); + + rc = simp_hcz_h_register_smr(adapter_handle, pfmr, orig_pfmr, + pfshca, orig_mr_handle, vaddr_in, + access_ctrl, pd, + (struct hcz_mrmw_handle *)mr_handle, lkey, + rkey); + EDEB_EX(7, "rc=%lx mr_handle.mrwpte=%p mr_handle.page_index=%x" + " lkey=%x rkey=%x", + rc, mr_handle->mrwpte, mr_handle->page_index, *lkey, *rkey); +#else + EDEB_EN(7, "adapter_handle=%lx orig_pfmr=%p pfshca=%p" + " orig_mr_handle=%lx vaddr_in=%lx access_ctrl=%x pd=%x", + adapter_handle.handle, orig_pfmr, pfshca, + orig_mr_handle->handle, vaddr_in, access_ctrl, pd.value); + + + rc = ehca_hcall_7arg_7ret(H_REGISTER_SMR, + adapter_handle.handle, /* r4 */ + orig_mr_handle->handle, /* r5 */ + vaddr_in, /* r6 */ + ((((u64) access_ctrl) << 32ULL)), /* r7 */ + pd.value, /* r8 */ + 0, 0, + &mr_handle->handle, /* r4 */ + &dummy, /* r5 */ + &lkey_out, /* r6 */ + &rkey_out, /* r7 */ + &dummy, + &dummy, + &dummy); + *lkey = (u32) lkey_out; + *rkey = (u32) rkey_out; + + EDEB_EX(7, "rc=%lx mr_handle=%lx lkey=%x rkey=%x", + rc, mr_handle->handle, *lkey, *rkey); +#endif /* EHCA_USE_HCALL */ + + return rc; +} + +u64 hipz_h_alloc_resource_mw(const struct ipz_adapter_handle adapter_handle, + struct ehca_pfmw *pfmw, + struct ehca_pfshca *pfshca, + const struct ipz_pd pd, + struct ipz_mrmw_handle *mw_handle, + u32 * rkey) +{ + u64 rc = H_SUCCESS; + u64 dummy; + u64 rkey_out; + + EDEB_EN(7, "adapter_handle=%lx pfmw=%p pd=%x pfshca=%p", + adapter_handle.handle, pfmw, pd.value, pfshca); + +#ifndef EHCA_USE_HCALL + + rc = simp_hcz_h_alloc_resource_mw(adapter_handle, pfmw, pfshca, pd, + (struct hcz_mrmw_handle *)mw_handle, + rkey); + EDEB_EX(7, "rc=%lx mw_handle.mrwpte=%p mw_handle.page_index=%x rkey=%x", + rc, mw_handle->mrwpte, mw_handle->page_index, *rkey); +#else + rc = ehca_hcall_7arg_7ret(H_ALLOC_RESOURCE, + adapter_handle.handle, /* r4 */ + 6, /* r5 */ + pd.value, /* r6 */ + 0, 0, 0, 0, + &mw_handle->handle, /* r4 */ + &dummy, /* r5 */ + &dummy, /* r6 */ + &rkey_out, /* r7 */ + &dummy, + &dummy, + &dummy); + + *rkey = (u32) rkey_out; + + EDEB_EX(7, "rc=%lx mw_handle=%lx rkey=%x", + rc, mw_handle->handle, *rkey); +#endif /* EHCA_USE_HCALL */ + return rc; +} + +u64 hipz_h_query_mw(const struct ipz_adapter_handle adapter_handle, + struct ehca_pfmw *pfmw, + const struct ipz_mrmw_handle *mw_handle, + u32 * rkey, + struct ipz_pd *pd) +{ + u64 rc = H_SUCCESS; + u64 dummy; + u64 pd_out; + u64 rkey_out; + +#ifndef EHCA_USE_HCALL + EDEB_EN(7, "adapter_handle=%lx pfmw=%p mw_handle.mrwpte=%p" + " mw_handle.page_index=%x", + adapter_handle.handle, pfmw, mw_handle->mrwpte, + mw_handle->page_index); + + rc = simp_hcz_h_query_mw(adapter_handle, pfmw, mw_handle, rkey, pd); + + EDEB_EX(7, "rc=%lx rkey=%x pd=%x", rc, *rkey, pd->value); +#else + EDEB_EN(7, "adapter_handle=%lx pfmw=%p mw_handle=%lx", + adapter_handle.handle, pfmw, mw_handle->handle); + + rc = ehca_hcall_7arg_7ret(H_QUERY_MW, + adapter_handle.handle, /* r4 */ + mw_handle->handle, /* r5 */ + 0, 0, 0, 0, 0, + &dummy, /* r4 */ + &dummy, /* r5 */ + &dummy, /* r6 */ + &rkey_out, /* r7 */ + &pd_out, /* r8 */ + &dummy, + &dummy); + *rkey = (u32) rkey_out; + pd->value = (u32) pd_out; + + EDEB_EX(7, "rc=%lx rkey=%x pd=%x", rc, *rkey, pd->value); +#endif /* EHCA_USE_HCALL */ + + return rc; +} + +u64 hipz_h_free_resource_mw(const struct ipz_adapter_handle adapter_handle, + struct ehca_pfmw *pfmw, + const struct ipz_mrmw_handle *mw_handle) +{ + u64 rc = H_SUCCESS; + u64 dummy; + +#ifndef EHCA_USE_HCALL + EDEB_EN(7, "adapter_handle=%lx pfmw=%p mw_handle.mrwpte=%p" + " mw_handle.page_index=%x", + adapter_handle.handle, pfmw, mw_handle->mrwpte, + mw_handle->page_index); + + rc = simp_hcz_h_free_resource_mw(adapter_handle, pfmw, mw_handle); +#else + EDEB_EN(7, "adapter_handle=%lx pfmw=%p mw_handle=%lx", + adapter_handle.handle, pfmw, mw_handle->handle); + + rc = ehca_hcall_7arg_7ret(H_FREE_RESOURCE, + adapter_handle.handle, /* r4 */ + mw_handle->handle, /* r5 */ + 0, 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); +#endif /* EHCA_USE_HCALL */ + EDEB_EX(7, "rc=%lx", rc); + + return rc; +} + +u64 hipz_h_error_data(const struct ipz_adapter_handle adapter_handle, + const u64 ressource_handle, + void *rblock, + unsigned long *byte_count) +{ + u64 rc = H_SUCCESS; + u64 dummy; + u64 r_cb; + + EDEB_EN(7, "adapter_handle=%lx ressource_handle=%lx rblock=%p", + adapter_handle.handle, ressource_handle, rblock); + + if ((((u64)rblock) & 0xfff) != 0) { + EDEB_ERR(4, "rblock not page aligned."); + rc = H_PARAMETER; + return rc; + } + + r_cb = virt_to_abs(rblock); + + rc = ehca_hcall_7arg_7ret(H_ERROR_DATA, + adapter_handle.handle, + ressource_handle, + r_cb, + 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + EDEB_EX(7, "rc=%lx", rc); + + return rc; +} From schihei at de.ibm.com Thu Apr 27 03:49:44 2006 From: schihei at de.ibm.com (Heiko J Schick) Date: Thu, 27 Apr 2006 12:49:44 +0200 Subject: [openib-general] [PATCH 14/16] ehca: hardware interface Message-ID: <4450A1C8.7090407@de.ibm.com> Signed-off-by: Heiko J Schick hipz_fns.h | 73 ++++++++++ hipz_fns_core.h | 126 +++++++++++++++++ hipz_hw.h | 398 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 597 insertions(+) --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/hipz_fns.h 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/hipz_fns.h 2006-04-25 10:31:46.000000000 +0200 @@ -0,0 +1,73 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * HW abstraction register functions + * + * Authors: Christoph Raisch + * Reinhard Ernst + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: hipz_fns.h,v 1.8 2006/04/25 08:31:46 schickhj Exp $ + */ + +#ifndef __HIPZ_FNS_H__ +#define __HIPZ_FNS_H__ + +#include "ehca_classes.h" +#include "hipz_hw.h" +#ifndef EHCA_USE_HCALL +#include "sim_gal.h" +#endif + +#include "hipz_fns_core.h" + +#define hipz_galpa_store_eq(gal, offset, value) \ + hipz_galpa_store(gal, EQTEMM_OFFSET(offset), value) + +#define hipz_galpa_load_eq(gal, offset) \ + hipz_galpa_load(gal, EQTEMM_OFFSET(offset)) + +#define hipz_galpa_store_qped(gal, offset, value) \ + hipz_galpa_store(gal, QPEDMM_OFFSET(offset), value) + +#define hipz_galpa_load_qped(gal, offset) \ + hipz_galpa_load(gal, QPEDMM_OFFSET(offset)) + +#define hipz_galpa_store_mrmw(gal, offset, value) \ + hipz_galpa_store(gal, MRMWMM_OFFSET(offset), value) + +#define hipz_galpa_load_mrmw(gal, offset) \ + hipz_galpa_load(gal, MRMWMM_OFFSET(offset)) + +#endif --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/hipz_fns_core.h 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/hipz_fns_core.h 2006-03-31 13:43:52.000000000 +0200 @@ -0,0 +1,126 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * HW abstraction register functions + * + * Authors: Christoph Raisch + * Heiko J Schick + * Hoang-Nam Nguyen + * Reinhard Ernst + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: hipz_fns_core.h,v 1.6 2006/03/31 11:43:52 nguyen Exp $ + */ + +#ifndef __HIPZ_FNS_CORE_H__ +#define __HIPZ_FNS_CORE_H__ + +#include "hcp_phyp.h" +#include "hipz_hw.h" + +#define hipz_galpa_store_cq(gal,offset,value)\ + hipz_galpa_store(gal,CQTEMM_OFFSET(offset),value) +#define hipz_galpa_load_cq(gal,offset)\ + hipz_galpa_load(gal,CQTEMM_OFFSET(offset)) + +#define hipz_galpa_store_qp(gal,offset,value)\ + hipz_galpa_store(gal,QPTEMM_OFFSET(offset),value) +#define hipz_galpa_load_qp(gal,offset)\ + hipz_galpa_load(gal,QPTEMM_OFFSET(offset)) + +inline static void hipz_update_sqa(struct ehca_qp *qp, u16 nr_wqes) +{ + struct h_galpa gal; + + EDEB_EN(7, "qp=%p", qp); + gal = qp->galpas.kernel; + /* ringing doorbell :-) */ + hipz_galpa_store_qp(gal, qpx_sqa, EHCA_BMASK_SET(QPX_SQADDER, nr_wqes)); + EDEB_EX(7, "qp=%p QPx_SQA = %i", qp, nr_wqes); +} + +inline static void hipz_update_rqa(struct ehca_qp *qp, u16 nr_wqes) +{ + struct h_galpa gal; + + EDEB_EN(7, "qp=%p", qp); + gal = qp->galpas.kernel; + /* ringing doorbell :-) */ + hipz_galpa_store_qp(gal, qpx_rqa, EHCA_BMASK_SET(QPX_RQADDER, nr_wqes)); + EDEB_EX(7, "qp=%p QPx_RQA = %i", qp, nr_wqes); +} + +inline static void hipz_update_feca(struct ehca_cq *cq, u32 nr_cqes) +{ + struct h_galpa gal; + + EDEB_EN(7, "cq=%p", cq); + gal = cq->galpas.kernel; + hipz_galpa_store_cq(gal, cqx_feca, + EHCA_BMASK_SET(CQX_FECADDER, nr_cqes)); + EDEB_EX(7, "cq=%p CQx_FECA = %i", cq, nr_cqes); +} + +inline static void hipz_set_cqx_n0(struct ehca_cq *cq, u32 value) +{ + struct h_galpa gal; + u64 CQx_N0_reg = 0; + + EDEB_EN(7, "cq=%p event on solicited completion -- write CQx_N0", + cq); + gal = cq->galpas.kernel; + hipz_galpa_store_cq(gal, cqx_n0, + EHCA_BMASK_SET(CQX_N0_GENERATE_SOLICITED_COMP_EVENT, + value)); + CQx_N0_reg = hipz_galpa_load_cq(gal, cqx_n0); + EDEB_EX(7, "cq=%p loaded CQx_N0=%lx", cq, + (unsigned long)CQx_N0_reg); +} + +inline static void hipz_set_cqx_n1(struct ehca_cq *cq, u32 value) +{ + struct h_galpa gal; + u64 CQx_N1_reg = 0; + + EDEB_EN(7, "cq=%p event on completion -- write CQx_N1", + cq); + gal = cq->galpas.kernel; + hipz_galpa_store_cq(gal, cqx_n1, + EHCA_BMASK_SET(CQX_N1_GENERATE_COMP_EVENT, value)); + CQx_N1_reg = hipz_galpa_load_cq(gal, cqx_n1); + EDEB_EX(7, "cq=%p loaded CQx_N1=%lx", cq, + (unsigned long)CQx_N1_reg); +} + +#endif /* __HIPZ_FNC_CORE_H__ */ --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/hipz_hw.h 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/hipz_hw.h 2006-03-13 14:07:20.000000000 +0100 @@ -0,0 +1,398 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * eHCA register definitions + * + * Authors: Waleri Fomin + * Christoph Raisch + * Reinhard Ernst + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: hipz_hw.h,v 1.8 2006/03/13 13:07:20 fomin Exp $ + */ + +#ifndef __HIPZ_HW_H__ +#define __HIPZ_HW_H__ + +#include "ehca_tools.h" +#include "ehca_kernel.h" + +/** QP Table Entry Memory Map + */ +struct hipz_qptemm { + u64 qpx_hcr; + u64 qpx_c; + u64 qpx_herr; + u64 qpx_aer; +/* 0x20*/ + u64 qpx_sqa; + u64 qpx_sqc; + u64 qpx_rqa; + u64 qpx_rqc; +/* 0x40*/ + u64 qpx_st; + u64 qpx_pmstate; + u64 qpx_pmfa; + u64 qpx_pkey; +/* 0x60*/ + u64 qpx_pkeya; + u64 qpx_pkeyb; + u64 qpx_pkeyc; + u64 qpx_pkeyd; +/* 0x80*/ + u64 qpx_qkey; + u64 qpx_dqp; + u64 qpx_dlidp; + u64 qpx_portp; +/* 0xa0*/ + u64 qpx_slidp; + u64 qpx_slidpp; + u64 qpx_dlida; + u64 qpx_porta; +/* 0xc0*/ + u64 qpx_slida; + u64 qpx_slidpa; + u64 qpx_slvl; + u64 qpx_ipd; +/* 0xe0*/ + u64 qpx_mtu; + u64 qpx_lato; + u64 qpx_rlimit; + u64 qpx_rnrlimit; +/* 0x100*/ + u64 qpx_t; + u64 qpx_sqhp; + u64 qpx_sqptp; + u64 qpx_nspsn; +/* 0x120*/ + u64 qpx_nspsnhwm; + u64 reserved1; + u64 qpx_sdsi; + u64 qpx_sdsbc; +/* 0x140*/ + u64 qpx_sqwsize; + u64 qpx_sqwts; + u64 qpx_lsn; + u64 qpx_nssn; +/* 0x160 */ + u64 qpx_mor; + u64 qpx_cor; + u64 qpx_sqsize; + u64 qpx_erc; +/* 0x180*/ + u64 qpx_rnrrc; + u64 qpx_ernrwt; + u64 qpx_rnrresp; + u64 qpx_lmsna; +/* 0x1a0 */ + u64 qpx_sqhpc; + u64 qpx_sqcptp; + u64 qpx_sigt; + u64 qpx_wqecnt; +/* 0x1c0*/ + + u64 qpx_rqhp; + u64 qpx_rqptp; + u64 qpx_rqsize; + u64 qpx_nrr; +/* 0x1e0*/ + u64 qpx_rdmac; + u64 qpx_nrpsn; + u64 qpx_lapsn; + u64 qpx_lcr; +/* 0x200*/ + u64 qpx_rwc; + u64 qpx_rwva; + u64 qpx_rdsi; + u64 qpx_rdsbc; +/* 0x220*/ + u64 qpx_rqwsize; + u64 qpx_crmsn; + u64 qpx_rdd; + u64 qpx_larpsn; +/* 0x240*/ + u64 qpx_pd; + u64 qpx_scqn; + u64 qpx_rcqn; + u64 qpx_aeqn; +/* 0x260*/ + u64 qpx_aaelog; + u64 qpx_ram; + u64 qpx_rdmaqe0; + u64 qpx_rdmaqe1; +/* 0x280*/ + u64 qpx_rdmaqe2; + u64 qpx_rdmaqe3; + u64 qpx_nrpsnhwm; +/* 0x298*/ + u64 reserved[(0x400 - 0x298) / 8]; +/* 0x400 extended data */ + u64 reserved_ext[(0x500 - 0x400) / 8]; +/* 0x500 */ + u64 reserved2[(0x1000 - 0x500) / 8]; +/* 0x1000 */ +}; + +#define QPX_SQADDER EHCA_BMASK_IBM(48,63) +#define QPX_RQADDER EHCA_BMASK_IBM(48,63) + +#define QPTEMM_OFFSET(x) offsetof(struct hipz_qptemm,x) + +/** MRMWPT Entry Memory Map + */ +struct hipz_mrmwmm { + /* 0x00 */ + u64 mrx_hcr; + + u64 mrx_c; + u64 mrx_herr; + u64 mrx_aer; + /* 0x20 */ + u64 mrx_pp; + u64 reserved1; + u64 reserved2; + u64 reserved3; + /* 0x40 */ + u64 reserved4[(0x200 - 0x40) / 8]; + /* 0x200 */ + u64 mrx_ctl[64]; + +}; + +#define MRX_HCR_LPARID_VALID EHCA_BMASK_IBM(0,0) + +#define MRMWMM_OFFSET(x) offsetof(struct hipz_mrmwmm,x) + +struct hipz_qpedmm { + /* 0x00 */ + u64 reserved0[(0x400) / 8]; + /* 0x400 */ + u64 qpedx_phh; + u64 qpedx_ppsgp; + /* 0x410 */ + u64 qpedx_ppsgu; + u64 qpedx_ppdgp; + /* 0x420 */ + u64 qpedx_ppdgu; + u64 qpedx_aph; + /* 0x430 */ + u64 qpedx_apsgp; + u64 qpedx_apsgu; + /* 0x440 */ + u64 qpedx_apdgp; + u64 qpedx_apdgu; + /* 0x450 */ + u64 qpedx_apav; + u64 qpedx_apsav; + /* 0x460 */ + u64 qpedx_hcr; + u64 reserved1[4]; + /* 0x488 */ + u64 qpedx_rrl0; + /* 0x490 */ + u64 qpedx_rrrkey0; + u64 qpedx_rrva0; + /* 0x4a0 */ + u64 reserved2; + u64 qpedx_rrl1; + /* 0x4b0 */ + u64 qpedx_rrrkey1; + u64 qpedx_rrva1; + /* 0x4c0 */ + u64 reserved3; + u64 qpedx_rrl2; + /* 0x4d0 */ + u64 qpedx_rrrkey2; + u64 qpedx_rrva2; + /* 0x4e0 */ + u64 reserved4; + u64 qpedx_rrl3; + /* 0x4f0 */ + u64 qpedx_rrrkey3; + u64 qpedx_rrva3; +}; + +#define QPEDMM_OFFSET(x) offsetof(struct hipz_QPEDMM,x) + +/** CQ Table Entry Memory Map + */ +struct hipz_cqtemm { + u64 cqx_hcr; + u64 cqx_c; + u64 cqx_herr; + u64 cqx_aer; +/* 0x20 */ + u64 cqx_ptp; + u64 cqx_tp; + u64 cqx_fec; + u64 cqx_feca; +/* 0x40 */ + u64 cqx_ep; + u64 cqx_eq; +/* 0x50 */ + u64 reserved1; + u64 cqx_n0; +/* 0x60 */ + u64 cqx_n1; + u64 reserved2[(0x1000 - 0x60) / 8]; +/* 0x1000 */ +}; + +#define CQX_FEC_CQE_CNT EHCA_BMASK_IBM(32,63) +#define CQX_FECADDER EHCA_BMASK_IBM(32,63) +#define CQX_N0_GENERATE_SOLICITED_COMP_EVENT EHCA_BMASK_IBM(0,0) +#define CQX_N1_GENERATE_COMP_EVENT EHCA_BMASK_IBM(0,0) + +#define CQTEMM_OFFSET(x) offsetof(struct hipz_cqtemm,x) + +/** EQ Table Entry Memory Map + */ +struct hipz_eqtemm { + u64 eqx_hcr; + u64 eqx_c; + + u64 eqx_herr; + u64 eqx_aer; +/* 0x20 */ + u64 eqx_ptp; + u64 eqx_tp; + u64 eqx_ssba; + u64 eqx_psba; + +/* 0x40 */ + u64 eqx_cec; + u64 eqx_meql; + u64 eqx_xisbi; + u64 eqx_xisc; +/* 0x60 */ + u64 eqx_it; + +}; + +#define EQTEMM_OFFSET(x) offsetof(struct hipz_eqtemm,x) + +/* access control defines for MR/MW */ +#define HIPZ_ACCESSCTRL_L_WRITE 0x00800000 +#define HIPZ_ACCESSCTRL_R_WRITE 0x00400000 +#define HIPZ_ACCESSCTRL_R_READ 0x00200000 +#define HIPZ_ACCESSCTRL_R_ATOMIC 0x00100000 +#define HIPZ_ACCESSCTRL_MW_BIND 0x00080000 + +/* query hca response block */ +struct hipz_query_hca { + u32 cur_reliable_dg; + u32 cur_qp; + u32 cur_cq; + u32 cur_eq; + u32 cur_mr; + u32 cur_mw; + u32 cur_ee_context; + u32 cur_mcast_grp; + u32 cur_qp_attached_mcast_grp; + u32 reserved1; + u32 cur_ipv6_qp; + u32 cur_eth_qp; + u32 cur_hp_mr; + u32 reserved2[3]; + u32 max_rd_domain; + u32 max_qp; + u32 max_cq; + u32 max_eq; + u32 max_mr; + u32 max_hp_mr; + u32 max_mw; + u32 max_mrwpte; + u32 max_special_mrwpte; + u32 max_rd_ee_context; + u32 max_mcast_grp; + u32 max_qps_attached_all_mcast_grp; + u32 max_qps_attached_mcast_grp; + u32 max_raw_ipv6_qp; + u32 max_raw_ethy_qp; + u32 internal_clock_frequency; + u32 max_pd; + u32 max_ah; + u32 max_cqe; + u32 max_wqes_wq; + u32 max_partitions; + u32 max_rr_ee_context; + u32 max_rr_qp; + u32 max_rr_hca; + u32 max_act_wqs_ee_context; + u32 max_act_wqs_qp; + u32 max_sge; + u32 max_sge_rd; + u32 memory_page_size_supported; + u64 max_mr_size; + u32 local_ca_ack_delay; + u32 num_ports; + u32 vendor_id; + u32 vendor_part_id; + u32 hw_ver; + u64 node_guid; + u64 hca_cap_indicators; + u32 data_counter_register_size; + u32 max_shared_rq; + u32 max_isns_eq; + u32 max_neq; +} __attribute__ ((packed)); + +/* query port response block */ +struct hipz_query_port { + u32 state; + u32 bad_pkey_cntr; + u32 lmc; + u32 lid; + u32 subnet_timeout; + u32 qkey_viol_cntr; + u32 sm_sl; + u32 sm_lid; + u32 capability_mask; + u32 init_type_reply; + u32 pkey_tbl_len; + u32 gid_tbl_len; + u64 gid_prefix; + u32 port_nr; + u16 pkey_entries[16]; + u8 reserved1[32]; + u32 trent_size; + u32 trbuf_size; + u64 max_msg_sz; + u32 max_mtu; + u32 vl_cap; + u8 reserved2[1900]; + u64 guid_entries[255]; +} __attribute__ ((packed)); + +#endif From schihei at de.ibm.com Thu Apr 27 03:49:50 2006 From: schihei at de.ibm.com (Heiko J Schick) Date: Thu, 27 Apr 2006 12:49:50 +0200 Subject: [openib-general] [PATCH 15/16] ehca: queue page table handling Message-ID: <4450A1CE.80503@de.ibm.com> Signed-off-by: Heiko J Schick ipz_pt_fn.c | 184 ++++++++++++++++++++++++++++++++++++++++++ ipz_pt_fn.h | 258 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 442 insertions(+) --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/ipz_pt_fn.h 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/ipz_pt_fn.h 2006-03-27 00:48:21.000000000 +0200 @@ -0,0 +1,258 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * internal queue handling + * + * Authors: Waleri Fomin + * Reinhard Ernst + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ipz_pt_fn.h,v 1.6 2006/03/26 22:48:21 nguyen Exp $ + */ + +#ifndef __IPZ_PT_FN_H__ +#define __IPZ_PT_FN_H__ + +#include "ehca_qes.h" +#define EHCA_PAGESHIFT 12 +#define EHCA_PAGESIZE 4096UL +#define EHCA_PT_ENTRIES 512UL + +#include "ehca_tools.h" +#include "ehca_qes.h" + +/* struct generic ehca page + */ +struct ipz_page { + u8 entries[EHCA_PAGESIZE]; +}; + +/* struct generic queue in linux kernel virtual memory (kv) + */ +struct ipz_queue { + u64 current_q_offset; /* current queue entry */ + + struct ipz_page **queue_pages; /* array of pages belonging to queue */ + u32 qe_size; /* queue entry size */ + u32 act_nr_of_sg; + u32 queue_length; /* queue length allocated in bytes */ + u32 pagesize; + u32 toggle_state; /* toggle flag - per page */ + u32 dummy3; /* 64 bit alignment */ +}; + +/* return current Queue Entry for a certain q_offset + * returns address (kv) of Queue Entry + */ +static inline void *ipz_qeit_calc(struct ipz_queue *queue, u64 q_offset) +{ + struct ipz_page *current_page = NULL; + if (q_offset >= queue->queue_length) { + return NULL; + } + current_page = (queue->queue_pages)[q_offset >> EHCA_PAGESHIFT]; + return ¤t_page->entries[q_offset & (EHCA_PAGESIZE - 1)]; +} + +/* return current Queue Entry + * returns address (kv) of Queue Entry + */ +static inline void *ipz_qeit_get(struct ipz_queue *queue) +{ + return ipz_qeit_calc(queue, queue->current_q_offset); +} + +/* return current Queue Page , increment Queue Page iterator from + * page to page in struct ipz_queue, last increment will return 0! and + * NOT wrap + * returns address (kv) of Queue Page + * warning don't use in parallel with ipz_QE_get_inc() + */ +void *ipz_qpageit_get_inc(struct ipz_queue *queue); + +/* return current Queue Entry, increment Queue Entry iterator by one + * step in struct ipz_queue, will wrap in ringbuffer + * @returns address (kv) of Queue Entry BEFORE increment + * warning don't use in parallel with ipz_qpageit_get_inc() + * warning unpredictable results may occur if steps>act_nr_of_queue_entries + */ +static inline void *ipz_qeit_get_inc(struct ipz_queue *queue) +{ + void *retvalue = NULL; + + retvalue = ipz_qeit_get(queue); + queue->current_q_offset += queue->qe_size; + if (queue->current_q_offset >= queue->queue_length) { + queue->current_q_offset = 0; + /* toggle the valid flag */ + queue->toggle_state = (~queue->toggle_state) & 1; + } + + EDEB(7, "queue=%p retvalue=%p new current_q_addr=%lx qe_size=%x", + queue, retvalue, queue->current_q_offset, queue->qe_size); + + return (retvalue); +} + +/* return current Queue Entry, increment Queue Entry iterator by one + * step in struct ipz_queue, will wrap in ringbuffer + * returns address (kv) of Queue Entry BEFORE increment + * returns 0 and does not increment, if wrong valid state + * warning don't use in parallel with ipz_qpageit_get_inc() + * warning unpredictable results may occur if steps>act_nr_of_queue_entries + */ +inline static void *ipz_qeit_get_inc_valid(struct ipz_queue *queue) +{ + void *retvalue = ipz_qeit_get(queue); + u32 qe = ((struct ehca_cqe *)retvalue)->cqe_flags; + if ((qe >> 7) == (queue->toggle_state & 1)) { + /* this is a good one */ + ipz_qeit_get_inc(queue); + } else + retvalue = NULL; + return (retvalue); +} + +/* returns and resets Queue Entry iterator + * returns address (kv) of first Queue Entry + */ +static inline void *ipz_qeit_reset(struct ipz_queue *queue) +{ + queue->current_q_offset = 0; + return (ipz_qeit_get(queue)); +} + +/** struct generic page table + */ +struct ipz_pt { + u64 entries[EHCA_PT_ENTRIES]; +}; + +/* struct page table for a queue, only to be used in pf + */ +struct ipz_qpt { + /* queue page tables (kv), use u64 because we know the element length */ + u64 *qpts; + u32 allocated_qpts_entries; + u32 nr_of_PTEs; /* number of page table entries PTE iterators */ + u64 *current_pte_addr; +}; + +/* constructor for a ipz_queue_t, placement new for ipz_queue_t, + * new for all dependent datastructors + * + * all QP Tables are the same + * flow: + * allocate+pin queue + * see ipz_qpt_ctor() + * returns true if ok, false if out of memory + */ +int ipz_queue_ctor(struct ipz_queue *queue, const u32 nr_of_pages, const u32 pagesize, const u32 qe_size, /* queue entry size */ + const u32 nr_of_sg); + +/* destructor for a ipz_queue_t + * -# free queue + * see ipz_queue_ctor() + * returns true if ok, false if queue was NULL-ptr of free failed + */ +int ipz_queue_dtor(struct ipz_queue *queue); + +/* constructor for a ipz_qpt_t, + * placement new for struct ipz_queue, new for all dependent datastructors + * + * all QP Tables are the same, + * flow: + * -# allocate+pin queue + * -# initialise ptcb + * -# allocate+pin PTs + * -# link PTs to a ring, according to HCA Arch, set bit62 id needed + * -# the ring must have room for exactly nr_of_PTEs + * see ipz_qpt_ctor() + */ +void ipz_qpt_ctor(struct ipz_qpt *qpt, + const u32 nr_of_QEs, + const u32 pagesize, + const u32 qe_size, + const u8 lowbyte, const u8 toggle, + u32 * act_nr_of_QEs, u32 * act_nr_of_pages); + +/* return current Queue Entry, increment Queue Entry iterator by one + * step in struct ipz_queue, will wrap in ringbuffer + * returns address (kv) of Queue Entry BEFORE increment + * warning don't use in parallel with ipz_qpageit_get_inc() + * warning unpredictable results may occur if steps>act_nr_of_queue_entries + * + * fix EQ page problems + */ +void *ipz_qeit_eq_get_inc(struct ipz_queue *queue); + +/* return current Event Queue Entry, increment Queue Entry iterator + * by one step in struct ipz_queue if valid, will wrap in ringbuffer + * returns address (kv) of Queue Entry BEFORE increment + * returns 0 and does not increment, if wrong valid state + * warning don't use in parallel with ipz_queue_QPageit_get_inc() + * warning unpredictable results may occur if steps>act_nr_of_queue_entries + */ +inline static void *ipz_eqit_eq_get_inc_valid(struct ipz_queue *queue) +{ + void *retvalue = ipz_qeit_get(queue); + u32 qe = *(u8 *) retvalue; + EDEB(7, "ipz_QEit_EQ_get_inc_valid qe=%x", qe); + if ((qe >> 7) == (queue->toggle_state & 1)) { + /* this is a good one */ + ipz_qeit_eq_get_inc(queue); + } else { + retvalue = NULL; + } + return (retvalue); +} + +/* + * returns address (GX) of first queue entry + */ +inline static u64 ipz_qpt_get_firstpage(struct ipz_qpt *qpt) +{ + return (be64_to_cpu(qpt->qpts[0])); +} + +/* + * returns address (kv) of first page of queue page table + */ +inline static void *ipz_qpt_get_qpt(struct ipz_qpt *qpt) +{ + return (qpt->qpts); +} + +#endif /* __IPZ_PT_FN_H__ */ --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/ipz_pt_fn.c 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/ipz_pt_fn.c 2006-04-12 16:20:47.000000000 +0200 @@ -0,0 +1,184 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * internal queue handling + * + * Authors: Waleri Fomin + * Reinhard Ernst + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ipz_pt_fn.c,v 1.7 2006/04/12 14:20:47 nguyen Exp $ + */ + +#define DEB_PREFIX "iptz" + +#include "ehca_kernel.h" +#include "ehca_tools.h" +#include "ipz_pt_fn.h" + +extern int ehca_hwlevel; + +void *ipz_qpageit_get_inc(struct ipz_queue *queue) +{ + void *retvalue = NULL; + + retvalue = ipz_qeit_get(queue); + queue->current_q_offset += queue->pagesize; + if (queue->current_q_offset > queue->queue_length) { + queue->current_q_offset -= queue->pagesize; + retvalue = NULL; + } + if ((((u64) retvalue) % EHCA_PAGESIZE) != 0) { + EDEB(4, "ERROR!! not at PAGE-Boundary"); + return (NULL); + } + EDEB(7, "queue=%p retvalue=%p", queue, retvalue); + return (retvalue); +} + +void *ipz_qeit_eq_get_inc(struct ipz_queue *queue) +{ + void *retvalue = NULL; + u64 last_entry_in_q = queue->queue_length - queue->qe_size; + + retvalue = ipz_qeit_get(queue); + queue->current_q_offset += queue->qe_size; + if (queue->current_q_offset > last_entry_in_q) { + queue->current_q_offset = 0; + queue->toggle_state = (~queue->toggle_state) & 1; + } + + EDEB(7, "queue=%p retvalue=%p new current_q_offset=%lx qe_size=%x", + queue, retvalue, queue->current_q_offset, queue->qe_size); + + return (retvalue); +} + +int ipz_queue_ctor(struct ipz_queue *queue, + const u32 nr_of_pages, + const u32 pagesize, const u32 qe_size, const u32 nr_of_sg) +{ + int pages_per_kpage = PAGE_SIZE >> EHCA_PAGESHIFT; + int f; + + EDEB_EN(7, "nr_of_pages=%x pagesize=%x qe_size=%x", + nr_of_pages, pagesize, qe_size); + if (pagesize > PAGE_SIZE) { + EDEB_ERR(4, "FATAL ERROR: pagesize=%x is greater than " + "kernel page size", pagesize); + return 0; + } + if (pages_per_kpage == 0) { + EDEB_ERR(4, "FATAL ERROR: invalid kernel page size. " + "pages_per_kpage=%x", pages_per_kpage); + return 0; + } + queue->queue_length = nr_of_pages * pagesize; + queue->queue_pages = vmalloc(nr_of_pages * sizeof(void *)); + if (queue->queue_pages == NULL) { + EDEB(4, "ERROR!! didn't get the memory"); + return 0; + } + memset(queue->queue_pages, 0, nr_of_pages * sizeof(void *)); + /* allocate pages for queue: + while loop allocates whole kernel pages + if cond allocates so much mem needed for the rest of queue pages, + which is nr_of_pages % pages_per_kpage + */ + f = 0; + while (f + pages_per_kpage <= nr_of_pages) { + u8 *kpage = kzalloc(PAGE_SIZE, GFP_KERNEL); /*@@TODO get_zeroed_page(GFP_KERNEL);*/ + int k; + if (kpage == NULL) + goto ipz_queue_ctor_exit0; /*NOMEM*/ + for (k = 0; k < pages_per_kpage; k++) { + (queue->queue_pages)[f] = (struct ipz_page *)kpage; + kpage += EHCA_PAGESIZE; + f++; + } + } + if (f < nr_of_pages) { + u8 *kpage = kzalloc((nr_of_pages - f) * EHCA_PAGESIZE, + GFP_KERNEL); + if (kpage == NULL) + goto ipz_queue_ctor_exit0; /*NOMEM*/ + while (f < nr_of_pages) { + (queue->queue_pages)[f] = (struct ipz_page *)kpage; + kpage += EHCA_PAGESIZE; + f++; + } + } + + queue->current_q_offset = 0; + queue->qe_size = qe_size; + queue->act_nr_of_sg = nr_of_sg; + queue->pagesize = pagesize; + queue->toggle_state = 1; + EDEB_EX(7, "queue_length=%x queue_pages=%p qe_size=%x" + " act_nr_of_sg=%x", queue->queue_length, queue->queue_pages, + queue->qe_size, queue->act_nr_of_sg); + return 1; + + ipz_queue_ctor_exit0: + EDEB_ERR(4, "Couldn't get alloc pages queue=%p f=%x nr_of_pages=%x", + queue, f, nr_of_pages); + for (f = 0; f < nr_of_pages; f += pages_per_kpage) { + if ((queue->queue_pages)[f] == NULL) + break; + kfree((queue->queue_pages)[f]); + } + return 0; +} + +int ipz_queue_dtor(struct ipz_queue *queue) +{ + int pages_per_kpage = PAGE_SIZE >> EHCA_PAGESHIFT; + int g; + int nr_pages; + + EDEB_EN(7, "ipz_queue pointer=%p", queue); + if (queue == NULL || queue->queue_pages == NULL) { + EDEB_ERR(4, "queue or queue_pages is NULL"); + return 0; + } + EDEB(7, "destructing a queue with the following " + "properties:\n nr_of_pages=%x pagesize=%x qe_size=%x", + queue->act_nr_of_sg, queue->pagesize, queue->qe_size); + nr_pages = queue->queue_length / queue->pagesize; + for (g = 0; g < nr_pages; g += pages_per_kpage) + kfree((queue->queue_pages)[g]); + vfree(queue->queue_pages); + + EDEB_EX(7, "queue freed!"); + return 1; +} From schihei at de.ibm.com Thu Apr 27 03:49:56 2006 From: schihei at de.ibm.com (Heiko J Schick) Date: Thu, 27 Apr 2006 12:49:56 +0200 Subject: [openib-general] [PATCH 16/16] ehca: PHYP abstraction layer Message-ID: <4450A1D4.3030200@de.ibm.com> Signed-off-by: Heiko J Schick hcp_phyp.c | 97 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ hcp_phyp.h | 97 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 194 insertions(+) --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/hcp_phyp.h 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/hcp_phyp.h 2006-03-09 15:05:14.000000000 +0100 @@ -0,0 +1,97 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Firmware calls + * + * Authors: Christoph Raisch + * Hoang-Nam Nguyen + * Waleri Fomin + * Gerd Bayer + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: hcp_phyp.h,v 1.5 2006/03/09 14:05:14 schickhj Exp $ + */ + +#ifndef __HCP_PHYP_H__ +#define __HCP_PHYP_H__ + + +/* eHCA page (mapped into memory) + resource to access eHCA register pages in CPU address space +*/ +struct h_galpa { + u64 fw_handle; + /* for pSeries this is a 64bit memory address where + I/O memory is mapped into CPU address space (kv) */ +}; + +/* + resource to access eHCA address space registers, all types +*/ +struct h_galpas { + u32 pid; /*PID of userspace galpa checking */ + struct h_galpa user; /* user space accessible resource, + set to 0 if unused */ + struct h_galpa kernel; /* kernel space accessible resource, + set to 0 if unused */ +}; + +inline static u64 hipz_galpa_load(struct h_galpa galpa, u32 offset) +{ + u64 addr = galpa.fw_handle + offset; + u64 out; + EDEB_EN(7, "addr=%lx offset=%x ", addr, offset); + out = *(u64 *) addr; + EDEB_EX(7, "addr=%lx value=%lx", addr, out); + return out; +}; + +inline static void hipz_galpa_store(struct h_galpa galpa, u32 offset, u64 value) +{ + u64 addr = galpa.fw_handle + offset; + EDEB(7, "addr=%lx offset=%x value=%lx", addr, + offset, value); + *(u64 *) addr = value; +}; + +int hcp_galpas_ctor(struct h_galpas *galpas, + u64 paddr_kernel, u64 paddr_user); + +int hcp_galpas_dtor(struct h_galpas *galpas); + +int hcall_map_page(u64 physaddr, u64 * mapaddr); + +int hcall_unmap_page(u64 mapaddr); + +#endif --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/hcp_phyp.c 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/hcp_phyp.c 2006-03-17 13:31:35.000000000 +0100 @@ -0,0 +1,97 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * load store abstraction for ehca register access with tracing + * + * Authors: Christoph Raisch + * Hoang-Nam Nguyen + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: hcp_phyp.c,v 1.7 2006/03/17 12:31:35 nguyen Exp $ + */ + +#define DEB_PREFIX "PHYP" + +#include "ehca_kernel.h" +#include "ehca_classes.h" +#include "hipz_hw.h" + +int hcall_map_page(u64 physaddr, u64 * mapaddr) +{ + *mapaddr = (u64)(ioremap(physaddr, 4096)); + + EDEB(7, "ioremap physaddr=%lx mapaddr=%lx", physaddr, *mapaddr); + return 0; +} + +int hcall_unmap_page(u64 mapaddr) +{ + EDEB(7, "mapaddr=%lx", mapaddr); + iounmap((volatile void __iomem*)mapaddr); + return 0; +} + +int hcp_galpas_ctor(struct h_galpas *galpas, + u64 paddr_kernel, u64 paddr_user) +{ + int rc = 0; + + rc = hcall_map_page(paddr_kernel, &galpas->kernel.fw_handle); + if (rc != 0) + return (rc); + + galpas->user.fw_handle = paddr_user; + + EDEB(7, "paddr_kernel=%lx paddr_user=%lx galpas->kernel=%lx" + " galpas->user=%lx", + paddr_kernel, paddr_user, galpas->kernel.fw_handle, + galpas->user.fw_handle); + + return (rc); +} + +int hcp_galpas_dtor(struct h_galpas *galpas) +{ + int rc = 0; + + if (galpas->kernel.fw_handle != 0) + rc = hcall_unmap_page(galpas->kernel.fw_handle); + + if (rc != 0) + return (rc); + + galpas->user.fw_handle = galpas->kernel.fw_handle = 0; + + return rc; +} From arnd.bergmann at de.ibm.com Thu Apr 27 04:07:36 2006 From: arnd.bergmann at de.ibm.com (Arnd Bergmann) Date: Thu, 27 Apr 2006 13:07:36 +0200 Subject: [openib-general] Re: [PATCH 01/16] ehca: integration in Linux kernel build system In-Reply-To: <4450B384.4020601@de.ibm.com> References: <4450B384.4020601@de.ibm.com> Message-ID: <200604271307.36987.arnd.bergmann@de.ibm.com> On Thursday 27 April 2006 14:05, Heiko J Schick wrote: > Signed-off-by: Heiko J Schick > Missing any description whatsoever. > >   Kconfig  |    6 ++++++ >   Makefile |   29 +++++++++++++++++++++++++++++ >   2 files changed, 35 insertions(+) > It would be more practical to put this patch last instead of first so you don't break the build system with partial applies. > --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/Kconfig    1970-01-01 01:00:00.000000000 +0100 > +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/Kconfig 2006-01-04 16:29:05.000000000 +0100 > @@ -0,0 +1,6 @@ > +config INFINIBAND_EHCA > +       tristate "eHCA support" > +       depends on IBMEBUS && INFINIBAND > +       ---help--- > +       This is a low level device driver for the IBM > +       GX based Host channel adapters (HCAs) > \ No newline at end of file > --- linux-2.6.17-rc2-orig/drivers/infiniband/hw/ehca/Makefile   1970-01-01 01:00:00.000000000 +0100 > +++ linux-2.6.17-rc2/drivers/infiniband/hw/ehca/Makefile        2006-03-06 12:26:36.000000000 +0100 > @@ -0,0 +1,29 @@ > +#  Authors: Heiko J Schick > +#           Christoph Raisch > +# > +#  Copyright (c) 2005 IBM Corporation > +# > +#  All rights reserved. > +# > +#  This source code is distributed under a dual license of GPL v2.0 and OpenIB BSD. > + > +obj-$(CONFIG_INFINIBAND_EHCA) += hcad_mod.o > + > +hcad_mod-objs = ehca_main.o   \ > +               ehca_hca.o    \ > +               ehca_mcast.o  \ > +               ehca_pd.o     \ > +               ehca_av.o     \ > +               ehca_eq.o     \ > +               ehca_cq.o     \ > +               ehca_qp.o     \ > +               ehca_sqp.o    \ > +               ehca_mrmw.o   \ > +               ehca_reqs.o   \ > +               ehca_irq.o    \ > +               ehca_uverbs.o \ > +               hcp_if.o      \ > +               hcp_phyp.o    \ > +               ipz_pt_fn.o > + > +CFLAGS += -DEHCA_USE_HCALL -DEHCA_USE_HCALL_KERNEL > Do these need to be on the command line? If you always set them anyways, you can probably get rid of the #ifdef checking for them. If you want to keep the code for some reason, it might be better to have a CONFIG_EHCA_USE_HCALL symbol that is set unconditionally from Kconfig. Arnd <>< From arnd.bergmann at de.ibm.com Thu Apr 27 04:19:06 2006 From: arnd.bergmann at de.ibm.com (Arnd Bergmann) Date: Thu, 27 Apr 2006 13:19:06 +0200 Subject: [openib-general] Re: [PATCH 06/16] ehca: common include files In-Reply-To: <4450A183.6030405@de.ibm.com> References: <4450A183.6030405@de.ibm.com> Message-ID: <200604271319.06844.arnd.bergmann@de.ibm.com> On Thursday 27 April 2006 12:48, Heiko J Schick wrote: > +/** > + * ehca_adr_bad - Handle to be used for adress translation mechanisms, > + * currently a placeholder. > + */ > +inline static int ehca_adr_bad(void *adr) 'static inline', not 'inline static', by convention. > +/* We will remove this lines in SVN when it is included in the Linux kernel. > + * We don't want to introducte unnecessary dependencies to a patched kernel. > + */ > +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,17) Well, you should also remove this for submission, I guess ;-) > +#define EHCA_EDEB_TRACE_MASK_SIZE 32 > +extern u8 ehca_edeb_mask[EHCA_EDEB_TRACE_MASK_SIZE]; > +#define EDEB_ID_TO_U32(str4) (str4[3] | (str4[2] << 8) | (str4[1] << 16) | \ > + (str4[0] << 24)) > + > +inline static u64 ehca_edeb_filter(const u32 level, > + const u32 id, const u32 line) 'static inline' again. best grep all your source for this, there are probably more places. > +{ > + u64 ret = 0; > + u32 filenr = 0; > + u32 filter_level = 9; > + u32 dynamic_level = 0; > + > + /* This is code written for the gcc -O2 optimizer which should colapse > + * to two single ints filter_level is the first level kicked out by > + * compiler means trace everythin below 6. */ > + if (id == EDEB_ID_TO_U32("ehav")) { > + filenr = 0x01; > + filter_level = 8; > + } > + if (id == EDEB_ID_TO_U32("clas")) { > + filenr = 0x02; > + filter_level = 8; > + } > + if (id == EDEB_ID_TO_U32("cqeq")) { > + filenr = 0x03; > + filter_level = 8; > + } > + if (id == EDEB_ID_TO_U32("shca")) { > + filenr = 0x05; > + filter_level = 8; > + } > + if (id == EDEB_ID_TO_U32("eirq")) { > + filenr = 0x06; > + filter_level = 8; > + } > + if (id == EDEB_ID_TO_U32("lMad")) { > + filenr = 0x07; > + filter_level = 8; > + } > + if (id == EDEB_ID_TO_U32("mcas")) { > + filenr = 0x08; > + filter_level = 8; > + } > + if (id == EDEB_ID_TO_U32("mrmw")) { > + filenr = 0x09; > + filter_level = 8; > + } > + if (id == EDEB_ID_TO_U32("vpd ")) { > + filenr = 0x0a; > + filter_level = 8; > + } > + if (id == EDEB_ID_TO_U32("e_qp")) { > + filenr = 0x0b; > + filter_level = 8; > + } > + if (id == EDEB_ID_TO_U32("uqes")) { > + filenr = 0x0c; > + filter_level = 8; > + } > + if (id == EDEB_ID_TO_U32("PHYP")) { > + filenr = 0x0d; > + filter_level = 8; > + } > + if (id == EDEB_ID_TO_U32("hcpi")) { > + filenr = 0x0e; > + filter_level = 8; > + } > + if (id == EDEB_ID_TO_U32("iptz")) { > + filenr = 0x0f; > + filter_level = 8; > + } > + if (id == EDEB_ID_TO_U32("spta")) { > + filenr = 0x10; > + filter_level = 8; > + } > + if (id == EDEB_ID_TO_U32("simp")) { > + filenr = 0x11; > + filter_level = 8; > + } > + if (id == EDEB_ID_TO_U32("reqs")) { > + filenr = 0x12; > + filter_level = 8; > + } I guess you can convert that to switch (id) { case EBEB_ID_CLAS: ... case EDEB_ID_CQEQ: ... } > + > +#ifdef EHCA_USE_HCALL_KERNEL > +#ifdef CONFIG_PPC_PSERIES > + > +#include > + You could make everything down from here a separate header for hcall. From penberg at cs.helsinki.fi Thu Apr 27 04:19:49 2006 From: penberg at cs.helsinki.fi (Pekka Enberg) Date: Thu, 27 Apr 2006 14:19:49 +0300 Subject: [openib-general] Re: possible bug in kmem_cache related code In-Reply-To: References: Message-ID: <84144f020604270419s10696877he2ec27ae6d52e486@mail.gmail.com> On 4/27/06, Or Gerlitz wrote: > With 2.6.17-rc3 I'm running into something which seems as a bug related > to kmem_cache. Doing some allocations/deallocations from a kmem_cache and > later attempting to destroy it yields the following message and trace Tested on 2.6.16.7 and works ok. Christoph, could this be related to the cache draining patches that went in 2.6.17-rc1? Pekka > > ============================================================================ > slab error in kmem_cache_destroy(): cache `my_cache': Can't free all objects > > Call Trace: {kmem_cache_destroy+150} > {:my_kcache:kcache_cleanup_module+51} > {sys_delete_module+415} {__up_write+20} > {sys_munmap+91} {system_call+126} > > Failed to destroy cache > ============================================================================ > > I was hitting it as an Infiniband/iSCSI user as IB/iSCSI/SCSI code use > kmem_caches, but since the failure happens on a code which works fine on > 2.6.16 i have decided to try it with a synthetic module and had this hit... > > Below is a sample code that reproduces it, if i only do kmem_cache_create > and later destroy it does not happen, attached is my .config please note > that some of the CONFIG_DEBUG_ options are open. > > Please CC openib-general at openib.org at least with the resolution of the > matter since it kind of hard to do testing over 2.6.17-rcX with this > issue, the tests run fine but some modules are crashing on rmmod so a > reboot it needed... > > thanks, > > Or. > > This is the related slab info line once the module is loaded > > my_cache 256 264 328 12 1 : tunables 32 16 8 > : slabdata 22 22 0 : globalstat 264 264 22 0 > > --- /deb/null 1970-01-01 02:00:00.000000000 +0200 > +++ kcache/kcache.c 2006-04-27 10:43:18.000000000 +0300 > @@ -0,0 +1,61 @@ > +#include > +#include > + > +kmem_cache_t *cache; > + > +struct foo { > + char bar[300]; > +}; > + > + > +#define TRIES 256 > + > +struct foo *foo_arr[TRIES]; > + > +static int __init kcache_init_module(void) > +{ > + int i, j; > + > + cache = kmem_cache_create("my_cache", > + sizeof (struct foo), > + 0, > + SLAB_HWCACHE_ALIGN, > + NULL, > + NULL); > + if (!cache) { > + printk(KERN_ERR "couldn't create cache\n"); > + goto error1; > + } > + > + for (i = 0; i < TRIES; i++) { > + foo_arr[i] = kmem_cache_alloc(cache, GFP_KERNEL); > + if (foo_arr[i] == NULL) { > + printk(KERN_ERR "couldn't allocate from cache\n"); > + goto error2; > + } > + } > + > + return 0; > +error2: > + for (j = 0; j < i; j++) > + kmem_cache_free(cache, foo_arr[j]); > +error1: > + return -ENOMEM; > +} > + > +static void __exit kcache_cleanup_module(void) > +{ > + int i; > + > + for (i = 0; i < TRIES; i++) > + kmem_cache_free(cache, foo_arr[i]); > + > + if (kmem_cache_destroy(cache)) { > + printk(KERN_DEBUG "Failed to destroy cache\n"); > + } > +} > + > +MODULE_LICENSE("GPL"); > + > +module_init(kcache_init_module); > +module_exit(kcache_cleanup_module); > > > > > From eitan at mellanox.co.il Thu Apr 27 04:30:13 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 27 Apr 2006 14:30:13 +0300 Subject: [openib-general] [PATCH] osm: missing space before Makefile.am back-slash Message-ID: <8664kv2r3u.fsf@mtl066.yok.mtl.com> Hi Hal I followed Sasha's patch for matching the new gnu make requirement for space before a back slash. I double checked for Makefile.am back slashes that miss a space. I have found the one below. Eitan Signed-off-by: Eitan Zahavi Index: osm/opensm/Makefile.am =================================================================== --- osm/opensm/Makefile.am (revision 6695) +++ osm/opensm/Makefile.am (working copy) @@ -84,7 +84,7 @@ opensm_SOURCES = main.c osm_console.c os osm_prtn.c osm_prtn_config.c \ osm_trap_rcv.c osm_trap_rcv_ctrl.c \ osm_ucast_mgr.c osm_ucast_updn.c \ - osm_vl15intf.c osm_vl_arb_rcv.c\ + osm_vl15intf.c osm_vl_arb_rcv.ci \ osm_vl_arb_rcv_ctrl.c st.c opensm_CFLAGS = -Wall $(OSMV_CFLAGS) -fno-strict-aliasing -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 opensm_CXXFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 From joern at wohnheim.fh-wedel.de Thu Apr 27 04:41:04 2006 From: joern at wohnheim.fh-wedel.de (=?iso-8859-1?Q?J=F6rn?= Engel) Date: Thu, 27 Apr 2006 13:41:04 +0200 Subject: [openib-general] Re: [PATCH 05/16] ehca: InfiniBand query and multicast functionality In-Reply-To: <4450A17D.4030708@de.ibm.com> References: <4450A17D.4030708@de.ibm.com> Message-ID: <20060427114104.GA32127@wohnheim.fh-wedel.de> Some small stuff. On Thu, 27 April 2006 12:48:29 +0200, Heiko J Schick wrote: > > + * This source code is distributed under a dual license of GPL v2.0 and > OpenIB Line wrap. You might want to check your mailer or switch to a different one. > + return (-EINVAL); Remove brackets. > + if (H_SUCCESS != hipz_rc) { To frown upon reversed grammar, I tend. Hard to understand, such code is. With a decent compiler, there is zero advantage to put the constant first - assuming you don't ignore warnings. But it makes the code just as hard to read as the Yoda-speak above. > + return retcode; People tend to use the shorter "ret" or "err". Jörn -- You can take my soul, but not my lack of enthusiasm. -- Wally From joern at wohnheim.fh-wedel.de Thu Apr 27 04:43:55 2006 From: joern at wohnheim.fh-wedel.de (=?iso-8859-1?Q?J=F6rn?= Engel) Date: Thu, 27 Apr 2006 13:43:55 +0200 Subject: [openib-general] Re: [PATCH 04/16] ehca: userspace support In-Reply-To: <4450A176.9000008@de.ibm.com> References: <4450A176.9000008@de.ibm.com> Message-ID: <20060427114355.GB32127@wohnheim.fh-wedel.de> More minors. On Thu, 27 April 2006 12:48:22 +0200, Heiko J Schick wrote: > + > + EDEB_EN(7, > + "vm_start=%lx vm_end=%lx vm_page_prot=%lx vm_fileoff=%lx " > + "address=%lx", > + vma->vm_start, vma->vm_end, vma->vm_page_prot, fileoffset, > + address); Gesundheit! Seriously, I suspect "EDEB_EN" is not the best possible name to pick. > + if (cq->ownpid!=cur_pid) { Coding style would require spaces around binary operators. Jörn -- He that composes himself is wiser than he that composes a book. -- B. Franklin From joern at wohnheim.fh-wedel.de Thu Apr 27 04:48:28 2006 From: joern at wohnheim.fh-wedel.de (=?iso-8859-1?Q?J=F6rn?= Engel) Date: Thu, 27 Apr 2006 13:48:28 +0200 Subject: [openib-general] Re: [PATCH 10/16] ehca: event queue In-Reply-To: <4450A1AD.7040506@de.ibm.com> References: <4450A1AD.7040506@de.ibm.com> Message-ID: <20060427114828.GC32127@wohnheim.fh-wedel.de> On Thu, 27 April 2006 12:49:17 +0200, Heiko J Schick wrote: > > + ret = hipz_h_alloc_resource_eq(shca->ipz_hca_handle, Indentation? > + &eq->pf, > + type, > + length, > + &eq->ipz_eq_handle, > + &eq->length, > + &nr_pages, &eq->ist); > + > + if (ret != H_SUCCESS) { Common convention is to return 0 on success and -ESOMETHING on eror. You might want to follow that and remove H_SUCCESS from the complete code. > + if (!(vpage = ipz_qpageit_get_inc(&eq->ipz_queue))) { I personally despise assignments in conditionals. Not sure if this is documented in CodingStyle, but IME most kernel hackers tend to dislike it as well. Jörn -- Don't patch bad code, rewrite it. -- Kernigham and Pike, according to Rusty From joern at wohnheim.fh-wedel.de Thu Apr 27 04:57:49 2006 From: joern at wohnheim.fh-wedel.de (=?iso-8859-1?Q?J=F6rn?= Engel) Date: Thu, 27 Apr 2006 13:57:49 +0200 Subject: [openib-general] Re: [PATCH 03/16] ehca: structure definitions In-Reply-To: <4450A16D.7000008@de.ibm.com> References: <4450A16D.7000008@de.ibm.com> Message-ID: <20060427115749.GD32127@wohnheim.fh-wedel.de> On Thu, 27 April 2006 12:48:13 +0200, Heiko J Schick wrote: > + > +#ifdef CONFIG_PPC64 > +#include "ehca_classes_pSeries.h" > +#endif Is the #ifdef necessary? Such conditions around header includes often indicate problems with the included header. If that is the case here, you should fix the header in question in a seperate patch. > +struct ehca_shca { > + struct ib_device ib_device; > + struct ibmebus_dev *ibmebus_dev; > + u8 num_ports; ^^ > + int hw_level; This will cause some wasted space due to alignment. There don't seem to be other u8 or chars to consolidate with here, though. Still, you could take another look that your structures have fields on natural alignment borders and don't grow without you noticing. > +struct ehca_mr { > + union { > + struct ib_mr ib_mr; /* must always be first in ehca_mr */ > + struct ib_fmr ib_fmr; /* must always be first in ehca_mr */ > + } ib; > + > + spinlock_t mrlock; > + > + /* !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! > + * !!! ehca_mr_deletenew() memsets from flags to end of structure > + * !!! DON'T move flags or insert another field before. > + * !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! > */ Yuck! Do you have really good reasons to play such games? > +struct ehca_pfpd { > +}; > + > +struct ehca_pfmr { > +}; > + > +struct ehca_pfmw { > +}; Why these? Jörn -- Those who come seeking peace without a treaty are plotting. -- Sun Tzu From hch at infradead.org Thu Apr 27 04:59:15 2006 From: hch at infradead.org (Christoph Hellwig) Date: Thu, 27 Apr 2006 12:59:15 +0100 Subject: [openib-general] Re: [PATCH 03/16] ehca: structure definitions In-Reply-To: <20060427115749.GD32127@wohnheim.fh-wedel.de> References: <4450A16D.7000008@de.ibm.com> <20060427115749.GD32127@wohnheim.fh-wedel.de> Message-ID: <20060427115915.GA15520@infradead.org> On Thu, Apr 27, 2006 at 01:57:49PM +0200, J?rn Engel wrote: > On Thu, 27 April 2006 12:48:13 +0200, Heiko J Schick wrote: > > + > > +#ifdef CONFIG_PPC64 > > +#include "ehca_classes_pSeries.h" > > +#endif > > Is the #ifdef necessary? Such conditions around header includes often > indicate problems with the included header. If that is the case here, > you should fix the header in question in a seperate patch. The real question is what is that ifdef for at all? The code subitted isn't built on anything but ppc64. From arnd at arndb.de Thu Apr 27 05:05:36 2006 From: arnd at arndb.de (Arnd Bergmann) Date: Thu, 27 Apr 2006 14:05:36 +0200 Subject: [openib-general] Re: [PATCH 05/16] ehca: InfiniBand query and multicast functionality In-Reply-To: <20060427114104.GA32127@wohnheim.fh-wedel.de> References: <4450A17D.4030708@de.ibm.com> <20060427114104.GA32127@wohnheim.fh-wedel.de> Message-ID: <200604271405.36588.arnd@arndb.de> On Thursday 27 April 2006 13:41, Jörn Engel wrote: > On Thu, 27 April 2006 12:48:29 +0200, Heiko J Schick wrote: > > > > + *  This source code is distributed under a dual license of GPL v2.0 and > > OpenIB > > Line wrap.  You might want to check your mailer or switch to a > different one. > Looks correct here. Maybe you need to check yours ;-) Arnd <>< From joern at wohnheim.fh-wedel.de Thu Apr 27 05:09:51 2006 From: joern at wohnheim.fh-wedel.de (=?iso-8859-1?Q?J=F6rn?= Engel) Date: Thu, 27 Apr 2006 14:09:51 +0200 Subject: [openib-general] Re: [PATCH 02/16] ehca: module infrastructure In-Reply-To: <4450A165.4000701@de.ibm.com> References: <4450A165.4000701@de.ibm.com> Message-ID: <20060427120951.GE32127@wohnheim.fh-wedel.de> On Thu, 27 April 2006 12:48:05 +0200, Heiko J Schick wrote: > > + if (ehca_module->cache_pd == NULL) { Hmm. > + ret = kmem_cache_destroy(ehca_module->cache_pd); > + if (ret != 0) The " != 0" is completely superfluous. Above NULL check is a matter of taste, this one demonstates lack of boolean algebra understanding. > + rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); > + if (!rblock) { Hmm. And your taste seems to change. :) > + if (ehca_hw_level == 0) { And since we're on the subject. Ignoring the recent discussion involving akpm, viro and others, the kernel historically used int both for integer and boolean, plus return values as a special kind of "boolean with error indication attached". For boolean, it is nicer to do things like "if (!error)", for integers, a comparison as above is nicer. Return codes fall into the boolean category. Pointers after kmalloc() and similar are viewed as rich boolean by some people, but not by all. Jörn -- Geld macht nicht glücklich. Glück macht nicht satt. From joern at wohnheim.fh-wedel.de Thu Apr 27 05:16:57 2006 From: joern at wohnheim.fh-wedel.de (=?iso-8859-1?Q?J=F6rn?= Engel) Date: Thu, 27 Apr 2006 14:16:57 +0200 Subject: [openib-general] Re: [PATCH 05/16] ehca: InfiniBand query and multicast functionality In-Reply-To: <200604271405.36588.arnd@arndb.de> References: <4450A17D.4030708@de.ibm.com> <20060427114104.GA32127@wohnheim.fh-wedel.de> <200604271405.36588.arnd@arndb.de> Message-ID: <20060427121657.GF32127@wohnheim.fh-wedel.de> On Thu, 27 April 2006 14:05:36 +0200, Arnd Bergmann wrote: > On Thursday 27 April 2006 13:41, Jörn Engel wrote: > > On Thu, 27 April 2006 12:48:29 +0200, Heiko J Schick wrote: > > > > > > + *  This source code is distributed under a dual license of GPL v2.0 and > > > OpenIB > > > > Line wrap.  You might want to check your mailer or switch to a > > different one. > > > > Looks correct here. Maybe you need to check yours ;-) Weird. I didn't change anything in the last couple of years and never had problems before. Jörn -- The cheapest, fastest and most reliable components of a computer system are those that aren't there. -- Gordon Bell, DEC labratories From halr at voltaire.com Thu Apr 27 05:09:48 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Apr 2006 08:09:48 -0400 Subject: [openib-general] Re: [PATCH] osm: missing space before Makefile.am back-slash In-Reply-To: <8664kv2r3u.fsf@mtl066.yok.mtl.com> References: <8664kv2r3u.fsf@mtl066.yok.mtl.com> Message-ID: <1146139784.2124.46880.camel@hal.voltaire.com> On Thu, 2006-04-27 at 07:30, Eitan Zahavi wrote: > Hi Hal > > I followed Sasha's patch for matching the new gnu make requirement > for space before a back slash. > I double checked for Makefile.am back slashes that miss a space. > I have found the one below. Thanks. > Eitan > > Signed-off-by: Eitan Zahavi > > Index: osm/opensm/Makefile.am > =================================================================== > --- osm/opensm/Makefile.am (revision 6695) > +++ osm/opensm/Makefile.am (working copy) > @@ -84,7 +84,7 @@ opensm_SOURCES = main.c osm_console.c os > osm_prtn.c osm_prtn_config.c \ > osm_trap_rcv.c osm_trap_rcv_ctrl.c \ > osm_ucast_mgr.c osm_ucast_updn.c \ > - osm_vl15intf.c osm_vl_arb_rcv.c\ > + osm_vl15intf.c osm_vl_arb_rcv.ci \ ^ Typo. Fixed by hand. Applied to both trunk and 1.0 branch. -- Hal > osm_vl_arb_rcv_ctrl.c st.c > opensm_CFLAGS = -Wall $(OSMV_CFLAGS) -fno-strict-aliasing -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 > opensm_CXXFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 > From mst at mellanox.co.il Thu Apr 27 05:24:24 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 27 Apr 2006 15:24:24 +0300 Subject: [openib-general] SDP hello ack header Message-ID: <20060427122424.GS31324@mellanox.co.il> Sean, CMA does not seem to set MajV/MinV in SDP hello ack header (REP). It does do this for hello header (REQ). Should SDP do this then? -- MST From ogerlitz at voltaire.com Thu Apr 27 05:30:03 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 27 Apr 2006 15:30:03 +0300 (IDT) Subject: [openib-general] [PATCH 0/6] iSER (iSCSI Extensions for RDMA) initiator Message-ID: Roland, The patch series that follows contains the iSER code which we want to submit upstream for 2.6.18. Below i have placed some general information on iser for LKML reviewers (please CC openib-general at openib.org on your responses). Details are provided here on new some code which iser is dependent upon, and is expected in 2.6.18, (i have communicated them already to you but prefer to repeat it again for clarity). iSER is dependent on three new changesets/functionalities which are expected in 2.6.18, two in iscsi and one in infiniband. +1 libiscsi - a kernel library (module) implementing lots of common code to iscsi_tcp and iscsi_iser +2 iscsi transport ep callbacks - the first patch in this RFC, which enables an iscsi transport to establish/disconnect its connection from the kernel +3 the rdma cm (CMA) - a module that implements RDMA transport neutral Address Translation and Communication Management (CM). iSER as most of the inwork IB RC ULPs (eg SDP, NFSoRDMA, Lustre, etc) are coded to the CMA api. The patch adding libiscsi is one of 5 iSCSI patches present already in the scsi-misc git tree, where the ep callbacks patch is expected to be pushed by the end of this week. The CMA is present in the infiniband git tree. To compile the code you would need to patch 2.6.17-rcX with the 6 iscsi patches I have described above (iser is directly dependent only on two but the patches might apply only in the full order), the patches are also present under https://openib.org/svn/gen2/branches/backport/2.6.17 The code has been tested with 2.6.16 and 2.6.17-rc3 (drivers/infiniband and include/rdma being latest openib) and the user space part of latest open-iscsi. The only patches over this setting were the iscsi updates for 2.6.18. Over the 2.6.17 testing an issue with kmem_cache_destroy crash which seems unrelated to iSER has popped up, i have sent a bug report on the matter today. The iSER targets in this testing were from two types: Voltaire's IB/FC router and Voltaire's Native IB storage box, also recently an open source iSER target was kickedoff. OK, here is some general information on iSER: iSER (iSCSI Extensions for RDMA) is defined by the IETF IP Storage (IPS) working group, also an iSER annex was recently approved to appear in the IB spec. This driver is an iSER transport implementation for the Open iSCSI initiator (www.open-iscsi.org) whose kernel portion and TCP transport provider are merged in as of 2.6.15 (iscsi_trasport_iscsi & iscsi_tcp and with 2.6.18 also libiscsi) Hence iSER is both a provider of the Linux iSCSI transport api (scsi/ scsi_transport_iscsi.h) and a SCSI LLD (Low Level Driver) of the Linux SCSI midlayer api (scsi/scsi_host.h) The Open iSCSI initiator discovery of targets and login into a target is carried out from user space, where once the login negotiation is done, the transport connection is bounded to an iSCSI connection. The diagram under http://www. open-iscsi.org/docs/open-iscsi-1.jpg shows the connecting sequence for TCP. Upto 2.6.18, the transport is expected to use a socket for the connection where Linux has the means to move a socket from user to kernel space. This restriction, the inability to move an IB QP (Queue-Pair) from user to kernel space, and looking forward to integrate with more transports such as iSCSI offloads lead to a change in iscsi under which the transport is allowed to create/connect its native "end point" either from user space (eg TCP/socket) or from the kernel (iSER/QP), later the transport connection is bounded to an iSCSI connection. Basically, it goes like: +1 target discovery over TCP/IP with the discovery server +2.TCP socket create/bind/setopt/connect to the target +2.iSER CMA_ID/QP create/connect to the target +3 iscsi session create +4 iscsi connection create +5 bind iscsi connection to the transport connection +6 login request/response negotiation +7 iscsi connection start +8 the SCSI midlayer starts its inquiry and so on Or Gerlitz From ogerlitz at voltaire.com Thu Apr 27 05:30:32 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 27 Apr 2006 15:30:32 +0300 (IDT) Subject: [openib-general] [PATCH 1/6] iSER's Makefile and Kconfig In-Reply-To: Message-ID: Signed-off-by: Or Gerlitz --- /usr/src/linux-2.6.17-rc3/drivers/infiniband/ulp/iser-x/Makefile 1970-01-01 02:00:00.000000000 +0200 +++ /usr/src/linux-2.6.17-rc3/drivers/infiniband/ulp/iser/Makefile 2006-04-27 15:12:33.000000000 +0300 @@ -0,0 +1,6 @@ +obj-$(CONFIG_INFINIBAND_ISER) += ib_iser.o + +ib_iser-y := iser_verbs.o \ + iser_initiator.o \ + iser_memory.o \ + iscsi_iser.o --- /usr/src/linux-2.6.17-rc3/drivers/infiniband/ulp/iser-x/Kconfig 1970-01-01 02:00:00.000000000 +0200 +++ /usr/src/linux-2.6.17-rc3/drivers/infiniband/ulp/iser/Kconfig 2006-04-16 11:04:42.000000000 +0300 @@ -0,0 +1,12 @@ +config INFINIBAND_ISER + tristate "ISCSI RDMA Protocol" + depends on INFINIBAND && SCSI + select SCSI_ISCSI_ATTRS + ---help--- + + Support for the ISCSI RDMA Protocol over InfiniBand. This + allows you to access storage devices that speak ISER/ISCSI + over InfiniBand. + + The ISER protocol is defined by IETF. + See . From ogerlitz at voltaire.com Thu Apr 27 05:31:17 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 27 Apr 2006 15:31:17 +0300 (IDT) Subject: [openib-general] [PATCH 2/6] iscsi_iser header file In-Reply-To: Message-ID: iscsi_iser is the buddy of drivers/scsi/iscsi_tcp, were with the introduction of libiscsi much of the code (which was common) was moved into it. Signed-off-by: Or Gerlitz --- /usr/src/linux-2.6.17-rc3/drivers/infiniband/ulp/iser-x/iscsi_iser.h 1970-01-01 02:00:00.000000000 +0200 +++ /usr/src/linux-2.6.17-rc3/drivers/infiniband/ulp/iser/iscsi_iser.h 2006-04-27 10:07:04.000000000 +0300 @@ -0,0 +1,355 @@ +/* + * iSER transport for the Open iSCSI Initiator & iSER transport internals + * + * Copyright (C) 2004 Dmitry Yusupov + * Copyright (C) 2004 Alex Aizman + * Copyright (C) 2005 Mike Christie + * based on code maintained by open-iscsi at googlegroups.com + * + * Copyright (c) 2004, 2005, 2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: iscsi_iser.h 6643 2006-04-26 10:01:01Z ogerlitz $ + */ +#ifndef __ISCSI_ISER_H__ +#define __ISCSI_ISER_H__ + +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include +#include +#include + +#define PFX "iser:" + +#define iser_dbg(fmt, arg...) \ + do { \ + if (iser_debug_level > 0) \ + printk(KERN_DEBUG PFX "%s:" fmt,\ + __func__ , ## arg); \ + } while (0) + +#define iser_err(fmt, arg...) \ + do { \ + printk(KERN_ERR PFX "%s:" fmt, \ + __func__ , ## arg); \ + } while (0) + +#define iser_bug(fmt,arg...) \ + do { \ + printk(KERN_ERR PFX "%s: PANIC! " fmt, \ + __func__ , ## arg); \ + BUG(); \ + } while(0) + + /* support upto 512KB in one RDMA */ +#define ISCSI_ISER_SG_TABLESIZE (0x80000 >> PAGE_SHIFT) +#define ISCSI_ISER_MAX_LUN 256 +#define ISCSI_ISER_MAX_CMD_LEN 16 + +/* QP settings */ +/* Maximal bounds on received asynchronous PDUs */ +#define ISER_MAX_RX_MISC_PDUS 4 /* NOOP_IN(2) , ASYNC_EVENT(2) */ + +#define ISER_MAX_TX_MISC_PDUS 6 /* NOOP_OUT(2), TEXT(1), * + * SCSI_TMFUNC(2), LOGOUT(1) */ + +#define ISER_QP_MAX_RECV_DTOS (ISCSI_XMIT_CMDS_MAX + \ + ISER_MAX_RX_MISC_PDUS + \ + ISER_MAX_TX_MISC_PDUS) + +/* the max TX (send) WR supported by the iSER QP is defined by * + * max_send_wr = T * (1 + D) + C ; D is how many inflight dataouts we expect * + * to have at max for SCSI command. The tx posting & completion handling code * + * supports -EAGAIN scheme where tx is suspended till the QP has room for more * + * send WR. D=8 comes from 64K/8K */ + +#define ISER_INFLIGHT_DATAOUTS 8 + +#define ISER_QP_MAX_REQ_DTOS (ISCSI_XMIT_CMDS_MAX * \ + (1 + ISER_INFLIGHT_DATAOUTS) + \ + ISER_MAX_TX_MISC_PDUS + \ + ISER_MAX_RX_MISC_PDUS) + +#define ISER_VER 0x10 +#define ISER_WSV 0x08 +#define ISER_RSV 0x04 + +struct iser_hdr { + u8 flags; + u8 rsvd[3]; + __be32 write_stag; /* write rkey */ + __be64 write_va; + __be32 read_stag; /* read rkey */ + __be64 read_va; +} __attribute__((packed)); + + +/* Length of an object name string */ +#define ISER_OBJECT_NAME_SIZE 64 + +enum iser_ib_conn_state { + ISER_CONN_INIT, /* descriptor allocd, no conn */ + ISER_CONN_PENDING, /* in the process of being established */ + ISER_CONN_UP, /* up and running */ + ISER_CONN_TERMINATING, /* in the process of being terminated */ + ISER_CONN_DOWN, /* shut down */ + ISER_CONN_STATES_NUM +}; + +enum iser_task_status { + ISER_TASK_STATUS_INIT = 0, + ISER_TASK_STATUS_STARTED, + ISER_TASK_STATUS_COMPLETED +}; + +enum iser_data_dir { + ISER_DIR_IN = 0, /* to initiator */ + ISER_DIR_OUT, /* from initiator */ + ISER_DIRS_NUM +}; + +struct iser_data_buf { + void *buf; /* pointer to the sg list */ + unsigned int size; /* num entries of this sg */ + unsigned long data_len; /* total data len */ + unsigned int dma_nents; /* returned by dma_map_sg */ + char *copy_buf; /* allocated copy buf for SGs unaligned * + * for rdma which are copied */ + struct scatterlist sg_single; /* SG-ified clone of a non SG SC or * + * unaligned SG */ + }; + +/* fwd declarations */ +struct iser_device; +struct iscsi_iser_conn; +struct iscsi_iser_cmd_task; + +struct iser_mem_reg { + u32 lkey; + u32 rkey; + u64 va; + u64 len; + void *mem_h; +}; + +struct iser_regd_buf { + struct iser_mem_reg reg; /* memory registration info */ + void *virt_addr; + struct iser_device *device; /* device->device for dma_unmap */ + dma_addr_t dma_addr; /* if non zero, addr for dma_unmap */ + enum dma_data_direction direction; /* direction for dma_unmap */ + unsigned int data_size; + atomic_t ref_count; /* refcount, freed when dec to 0 */ +}; + +#define MAX_REGD_BUF_VECTOR_LEN 2 + +struct iser_dto { + struct iscsi_iser_cmd_task *ctask; + struct iscsi_iser_conn *conn; + int notify_enable; + + /* vector of registered buffers */ + unsigned int regd_vector_len; + struct iser_regd_buf *regd[MAX_REGD_BUF_VECTOR_LEN]; + + /* offset into the registered buffer may be specified */ + unsigned int offset[MAX_REGD_BUF_VECTOR_LEN]; + + /* a smaller size may be specified, if 0, then full size is used */ + unsigned int used_sz[MAX_REGD_BUF_VECTOR_LEN]; +}; + +enum iser_desc_type { + ISCSI_RX, + ISCSI_TX_CONTROL , + ISCSI_TX_SCSI_COMMAND, + ISCSI_TX_DATAOUT +}; + +struct iser_desc { + struct iser_hdr iser_header; + struct iscsi_hdr iscsi_header; + struct iser_regd_buf hdr_regd_buf; + void *data; /* used by RX & TX_CONTROL */ + struct iser_regd_buf data_regd_buf; /* used by RX & TX_CONTROL */ + enum iser_desc_type type; + struct iser_dto dto; +}; + +struct iser_device { + struct ib_device *ib_device; + struct ib_pd *pd; + struct ib_cq *cq; + struct ib_mr *mr; + struct tasklet_struct cq_tasklet; + struct list_head ig_list; /* entry in ig devices list */ + int refcount; +}; + +struct iser_conn +{ + struct iscsi_iser_conn *iser_conn; /* iser conn for upcalls */ + atomic_t state; /* rdma connection state */ + struct iser_device *device; /* device context */ + struct rdma_cm_id *cma_id; /* CMA ID */ + struct ib_qp *qp; /* QP */ + struct ib_fmr_pool *fmr_pool; /* pool of IB FMRs */ + int disc_evt_flag; /* disconn event delivered */ + wait_queue_head_t wait; /* waitq for conn/disconn */ + atomic_t post_recv_buf_count; /* posted rx count */ + atomic_t post_send_buf_count; /* posted tx count */ + struct work_struct comperror_work; /* conn term sleepable ctx*/ + char name[ISER_OBJECT_NAME_SIZE]; + struct iser_page_vec *page_vec; /* represents SG to fmr maps* + * maps serialized as tx is*/ + struct list_head conn_list; /* entry in ig conn list */ +}; + +struct iscsi_iser_conn { + struct iscsi_conn *iscsi_conn;/* ptr to iscsi conn */ + struct iser_conn *ib_conn; /* iSER IB conn */ + + rwlock_t lock; +}; + +struct iscsi_iser_cmd_task { + struct iser_desc desc; + struct iscsi_iser_conn *iser_conn; + int rdma_data_count;/* RDMA bytes */ + enum iser_task_status status; + int command_sent; /* set if command sent */ + int dir[ISER_DIRS_NUM]; /* set if dir use*/ + struct iser_regd_buf rdma_regd[ISER_DIRS_NUM];/* regd rdma buf */ + struct iser_data_buf data[ISER_DIRS_NUM]; /* orig. data des*/ + struct iser_data_buf data_copy[ISER_DIRS_NUM];/* contig. copy */ +}; + +struct iser_page_vec { + u64 *pages; + int length; + int offset; + int data_size; +}; + +struct iser_global { + struct mutex device_list_mutex;/* */ + struct list_head device_list; /* all iSER devices */ + struct mutex connlist_mutex; + struct list_head connlist; /* all iSER IB connections */ + + kmem_cache_t *desc_cache; +}; + +extern struct iser_global ig; +extern int iser_debug_level; + +/* allocate connection resources needed for rdma functionality */ +int iser_conn_set_full_featured_mode(struct iscsi_conn *conn); + +int iser_send_control(struct iscsi_conn *conn, + struct iscsi_mgmt_task *mtask); + +int iser_send_command(struct iscsi_conn *conn, + struct iscsi_cmd_task *ctask); + +int iser_send_data_out(struct iscsi_conn *conn, + struct iscsi_cmd_task *ctask, + struct iscsi_data *hdr); + +void iscsi_iser_recv(struct iscsi_conn *conn, + struct iscsi_hdr *hdr, + char *rx_data, + int rx_data_len); + +int iser_conn_init(struct iser_conn **ib_conn); + +void iser_conn_terminate(struct iser_conn *ib_conn); + +void iser_conn_release(struct iser_conn *ib_conn); + +void iser_rcv_completion(struct iser_desc *desc, + unsigned long dto_xfer_len); + +void iser_snd_completion(struct iser_desc *desc); + +void iser_ctask_rdma_init(struct iscsi_iser_cmd_task *ctask); + +void iser_ctask_rdma_finalize(struct iscsi_iser_cmd_task *ctask); + +void iser_dto_buffs_release(struct iser_dto *dto); + +int iser_regd_buff_release(struct iser_regd_buf *regd_buf); + +void iser_reg_single(struct iser_device *device, + struct iser_regd_buf *regd_buf, + enum dma_data_direction direction); + +int iser_start_rdma_unaligned_sg(struct iscsi_iser_cmd_task *ctask, + enum iser_data_dir cmd_dir); + +void iser_finalize_rdma_unaligned_sg(struct iscsi_iser_cmd_task *ctask, + enum iser_data_dir cmd_dir); + +int iser_reg_rdma_mem(struct iscsi_iser_cmd_task *ctask, + enum iser_data_dir cmd_dir); + +int iser_connect(struct iser_conn *ib_conn, + struct sockaddr_in *src_addr, + struct sockaddr_in *dst_addr, + int non_blocking); + +int iser_reg_page_vec(struct iser_conn *ib_conn, + struct iser_page_vec *page_vec, + struct iser_mem_reg *mem_reg); + +void iser_unreg_mem(struct iser_mem_reg *mem_reg); + +int iser_post_recv(struct iser_desc *rx_desc); +int iser_post_send(struct iser_desc *tx_desc); +#endif From ogerlitz at voltaire.com Thu Apr 27 05:31:52 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 27 Apr 2006 15:31:52 +0300 (IDT) Subject: [openib-general] [PATCH 3/6] open iscsi iser transport provider code In-Reply-To: Message-ID: Signed-off-by: Or Gerlitz --- /usr/src/linux-2.6.17-rc3/drivers/infiniband/ulp/iser-x/iscsi_iser.c 1970-01-01 02:00:00.000000000 +0200 +++ /usr/src/linux-2.6.17-rc3/drivers/infiniband/ulp/iser/iscsi_iser.c 2006-04-26 12:50:11.000000000 +0300 @@ -0,0 +1,800 @@ +/* + * iSCSI Initiator over iSER Data-Path + * + * Copyright (C) 2004 Dmitry Yusupov + * Copyright (C) 2004 Alex Aizman + * Copyright (C) 2005 Mike Christie + * Copyright (c) 2005, 2006 Voltaire, Inc. All rights reserved. + * maintained by openib-general at openib.org + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Credits: + * Christoph Hellwig + * FUJITA Tomonori + * Arne Redlich + * Zhenyu Wang + * Modified by: + * Erez Zilber + * + * + * $Id: iscsi_iser.c 6643 2006-04-26 10:01:01Z ogerlitz $ + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include + +#include +#include +#include +#include +#include +#include +#include +#include + +#include "iscsi_iser.h" + +static unsigned int iscsi_max_lun = 512; +module_param_named(max_lun, iscsi_max_lun, uint, S_IRUGO); + +#define DRV_VER "$Rev: 227 $" +#define DRV_DATE "$LastChangedDate: 2006-03-22 16:47:30 +0200 (Wed, 22 Mar 2006) $" + +int iser_debug_level = 0; + +MODULE_DESCRIPTION("iSER (iSCSI Extensions for RDMA) Datamover " + "v" DRV_VER "(" DRV_DATE ")"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_AUTHOR("Alex Nezhinsky, Dan Bar Dov, Or Gerlitz"); + +module_param_named(debug_level, iser_debug_level, int, 0644); +MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0 (default:disabled)"); + +struct iser_global ig; + +void +iscsi_iser_recv(struct iscsi_conn *conn, + struct iscsi_hdr *hdr, char *rx_data, int rx_data_len) +{ + int rc = 0; + uint32_t ret_itt; + int datalen; + int ahslen; + + /* verify PDU length */ + datalen = ntoh24(hdr->dlength); + if (datalen != rx_data_len) { + printk(KERN_ERR "iscsi_iser: datalen %d (hdr) != %d (IB) \n", + datalen, rx_data_len); + rc = ISCSI_ERR_DATALEN; + goto error; + } + + /* read AHS */ + ahslen = hdr->hlength * 4; + + /* verify itt (itt encoding: age+cid+itt) */ + rc = iscsi_verify_itt(conn, hdr, &ret_itt); + + if (!rc) + rc = iscsi_complete_pdu(conn, hdr, rx_data, rx_data_len); + + if (rc && rc != ISCSI_ERR_NO_SCSI_CMD) + goto error; + + return; +error: + iscsi_conn_failure(conn, rc); +} + + +/** + * iscsi_iser_cmd_init - Initialize iSCSI SCSI_READ or SCSI_WRITE commands + * + **/ +static void +iscsi_iser_cmd_init(struct iscsi_cmd_task *ctask) +{ + struct iscsi_iser_conn *iser_conn = ctask->conn->dd_data; + struct iscsi_iser_cmd_task *iser_ctask = ctask->dd_data; + struct scsi_cmnd *sc = ctask->sc; + + iser_ctask->command_sent = 0; + iser_ctask->iser_conn = iser_conn; + + if (sc->sc_data_direction == DMA_TO_DEVICE) { + BUG_ON(ctask->total_length == 0); + /* bytes to be sent via RDMA operations */ + iser_ctask->rdma_data_count = ctask->total_length - + ctask->imm_count - + ctask->unsol_count; + + debug_scsi("cmd [itt %x total %d imm %d imm_data %d " + "rdma_data %d]\n", + ctask->itt, ctask->total_length, ctask->imm_count, + ctask->unsol_count, ctask->rdma_data_count); + } else + /* bytes to be sent via RDMA operations */ + iser_ctask->rdma_data_count = ctask->total_length; + + iser_ctask_rdma_init(iser_ctask); +} + +/** + * iscsi_mtask_xmit - xmit management(immediate) task + * @conn: iscsi connection + * @mtask: task management task + * + * Notes: + * The function can return -EAGAIN in which case caller must + * call it again later, or recover. '0' return code means successful + * xmit. + * + **/ +static int +iscsi_iser_mtask_xmit(struct iscsi_conn *conn, + struct iscsi_mgmt_task *mtask) +{ + int error = 0; + + debug_scsi("mtask deq [cid %d itt 0x%x]\n", conn->id, mtask->itt); + + error = iser_send_control(conn, mtask); + + /* since iser xmits control with zero copy, mtasks can not be recycled + * right after sending them. + * The recycling scheme is based on whether a response is expected + * - if yes, the mtask is recycled at iscsi_complete_pdu + * - if no, the mtask is recycled at iser_snd_completion + */ + if (error && error != -EAGAIN) + iscsi_conn_failure(conn, ISCSI_ERR_CONN_FAILED); + + return error; +} + +static int +iscsi_iser_ctask_xmit_unsol_data(struct iscsi_conn *conn, + struct iscsi_cmd_task *ctask) +{ + struct iscsi_data hdr; + int error = 0; + struct iscsi_iser_cmd_task *iser_ctask = ctask->dd_data; + + /* Send data-out PDUs while there's still unsolicited data to send */ + while (ctask->unsol_count > 0) { + iscsi_prep_unsolicit_data_pdu(ctask, &hdr, + iser_ctask->rdma_data_count); + + debug_scsi("Sending data-out: itt 0x%x, data count %d\n", + hdr.itt, ctask->data_count); + + /* the buffer description has been passed with the command */ + /* Send the command */ + error = iser_send_data_out(conn, ctask, &hdr); + if (error) { + ctask->unsol_datasn--; + goto iscsi_iser_ctask_xmit_unsol_data_exit; + } + ctask->unsol_count -= ctask->data_count; + debug_scsi("Need to send %d more as data-out PDUs\n", + ctask->unsol_count); + } + +iscsi_iser_ctask_xmit_unsol_data_exit: + return error; +} + +static int +iscsi_iser_ctask_xmit(struct iscsi_conn *conn, + struct iscsi_cmd_task *ctask) +{ + struct iscsi_iser_cmd_task *iser_ctask = ctask->dd_data; + int error = 0; + + debug_scsi("ctask deq [cid %d itt 0x%x]\n", + conn->id, ctask->itt); + + /* + * serialize with TMF AbortTask + */ + if (ctask->mtask) + return error; + + /* Send the cmd PDU */ + if (!iser_ctask->command_sent) { + error = iser_send_command(conn, ctask); + if (error) + goto iscsi_iser_ctask_xmit_exit; + iser_ctask->command_sent = 1; + } + + /* Send unsolicited data-out PDU(s) if necessary */ + if (ctask->unsol_count) + error = iscsi_iser_ctask_xmit_unsol_data(conn, ctask); + + iscsi_iser_ctask_xmit_exit: + if (error && error != -EAGAIN) + iscsi_conn_failure(conn, ISCSI_ERR_CONN_FAILED); + return error; +} + +static void +iscsi_iser_cleanup_ctask(struct iscsi_conn *conn, struct iscsi_cmd_task *ctask) +{ + struct iscsi_iser_cmd_task *iser_ctask = ctask->dd_data; + + if (iser_ctask->status == ISER_TASK_STATUS_STARTED) { + iser_ctask->status = ISER_TASK_STATUS_COMPLETED; + iser_ctask_rdma_finalize(iser_ctask); + } +} + +static struct iser_conn * +iscsi_iser_ib_conn_lookup(__u64 ep_handle) +{ + struct iser_conn *ib_conn; + struct iser_conn *uib_conn = (struct iser_conn *)(unsigned long)ep_handle; + + mutex_lock(&ig.connlist_mutex); + list_for_each_entry(ib_conn, &ig.connlist, conn_list) { + if (ib_conn == uib_conn) { + mutex_unlock(&ig.connlist_mutex); + return ib_conn; + } + } + mutex_unlock(&ig.connlist_mutex); + iser_err("no conn exists for eph %llx\n",(unsigned long long)ep_handle); + return NULL; +} + +static struct iscsi_cls_conn * +iscsi_iser_conn_create(struct iscsi_cls_session *cls_session, uint32_t conn_idx) +{ + struct iscsi_conn *conn; + struct iscsi_cls_conn *cls_conn; + struct iscsi_iser_conn *iser_conn; + + cls_conn = iscsi_conn_setup(cls_session, conn_idx); + if (!cls_conn) + return NULL; + conn = cls_conn->dd_data; + + /* + * due to issues with the login code re iser sematics + * this not set in iscsi_conn_setup - FIXME + */ + conn->max_recv_dlength = 128; + + iser_conn = kzalloc(sizeof(*iser_conn), GFP_KERNEL); + if (!iser_conn) + goto conn_alloc_fail; + + /* currently this is the only field which need to be initiated */ + rwlock_init(&iser_conn->lock); + + conn->recv_lock = &iser_conn->lock; + + conn->dd_data = iser_conn; + iser_conn->iscsi_conn = conn; + + return cls_conn; + +conn_alloc_fail: + iscsi_conn_teardown(cls_conn); + return NULL; +} + +static void +iscsi_iser_conn_destroy(struct iscsi_cls_conn *cls_conn) +{ + struct iscsi_conn *conn = cls_conn->dd_data; + struct iscsi_iser_conn *iser_conn = conn->dd_data; + + iscsi_conn_teardown(cls_conn); + kfree(iser_conn); +} + +static int +iscsi_iser_conn_bind(struct iscsi_cls_session *cls_session, + struct iscsi_cls_conn *cls_conn, uint64_t transport_eph, + int is_leading) +{ + struct iscsi_conn *conn = cls_conn->dd_data; + struct iscsi_iser_conn *iser_conn; + struct iser_conn *ib_conn; + int error; + + error = iscsi_conn_bind(cls_session, cls_conn, is_leading); + if (error) + return error; + + if (conn->stop_stage != STOP_CONN_SUSPEND) { + /* the transport ep handle comes from user space so it must be + * verified against the global ib connections list */ + ib_conn = iscsi_iser_ib_conn_lookup(transport_eph); + if (!ib_conn) { + iser_err("can't bind eph %llx\n", + (unsigned long long)transport_eph); + return -EINVAL; + } + /* binds the iSER connection retrieved from the previously + * connected ep_handle to the iSCSI layer connection. exchanges + * connection pointers */ + iser_err("binding iscsi conn %p to iser_conn %p\n",conn,ib_conn); + iser_conn = conn->dd_data; + ib_conn->iser_conn = iser_conn; + iser_conn->ib_conn = ib_conn; + } + + return 0; +} + +static int +iscsi_iser_conn_start(struct iscsi_cls_conn *cls_conn) +{ + struct iscsi_conn *conn = cls_conn->dd_data; + int err; + + err = iscsi_conn_start(cls_conn); + if (err) + return err; + + return iser_conn_set_full_featured_mode(conn); +} + +static void +iscsi_iser_conn_terminate(struct iscsi_conn *conn) +{ + struct iscsi_iser_conn *iser_conn = conn->dd_data; + struct iser_conn *ib_conn = iser_conn->ib_conn; + + BUG_ON(!ib_conn); + /* starts conn teardown process, waits until all previously * + * posted buffers get flushed, deallocates all conn resources */ + iser_conn_terminate(ib_conn); + iser_conn->ib_conn = NULL; + conn->recv_lock = NULL; +} + + +static struct iscsi_transport iscsi_iser_transport; + +static struct iscsi_cls_session * +iscsi_iser_session_create(struct iscsi_transport *iscsit, + struct scsi_transport_template *scsit, + uint32_t initial_cmdsn, uint32_t *hostno) +{ + struct iscsi_cls_session *cls_session; + struct iscsi_session *session; + int i; + uint32_t hn; + struct iscsi_cmd_task *ctask; + struct iscsi_mgmt_task *mtask; + struct iscsi_iser_cmd_task *iser_ctask; + struct iser_desc *desc; + + cls_session = iscsi_session_setup(iscsit, scsit, + sizeof(struct iscsi_iser_cmd_task), + sizeof(struct iser_desc), + initial_cmdsn, &hn); + if (!cls_session) + return NULL; + + *hostno = hn; + session = class_to_transport_session(cls_session); + + /* libiscsi setup itts, data and pool so just set desc fields */ + for (i = 0; i < session->cmds_max; i++) { + ctask = session->cmds[i]; + iser_ctask = ctask->dd_data; + ctask->hdr = (struct iscsi_cmd *)&iser_ctask->desc.iscsi_header; + } + + for (i = 0; i < session->mgmtpool_max; i++) { + mtask = session->mgmt_cmds[i]; + desc = mtask->dd_data; + mtask->hdr = &desc->iscsi_header; + desc->data = mtask->data; + } + + return cls_session; +} + +static int +iscsi_iser_conn_set_param(struct iscsi_cls_conn *cls_conn, + enum iscsi_param param, uint32_t value) +{ + struct iscsi_conn *conn = cls_conn->dd_data; + struct iscsi_session *session = conn->session; + + spin_lock_bh(&session->lock); + if (conn->c_stage != ISCSI_CONN_INITIAL_STAGE && + conn->stop_stage != STOP_CONN_RECOVER) { + printk(KERN_ERR "iscsi_iser: can not change parameter [%d]\n", + param); + spin_unlock_bh(&session->lock); + return 0; + } + spin_unlock_bh(&session->lock); + + switch (param) { + case ISCSI_PARAM_MAX_RECV_DLENGTH: + /* TBD */ + break; + case ISCSI_PARAM_MAX_XMIT_DLENGTH: + conn->max_xmit_dlength = value; + break; + case ISCSI_PARAM_HDRDGST_EN: + if (value) { + printk(KERN_ERR "DataDigest wasn't negotiated to None"); + return -EPROTO; + } + break; + case ISCSI_PARAM_DATADGST_EN: + if (value) { + printk(KERN_ERR "DataDigest wasn't negotiated to None"); + return -EPROTO; + } + break; + case ISCSI_PARAM_INITIAL_R2T_EN: + session->initial_r2t_en = value; + break; + case ISCSI_PARAM_IMM_DATA_EN: + session->imm_data_en = value; + break; + case ISCSI_PARAM_FIRST_BURST: + session->first_burst = value; + break; + case ISCSI_PARAM_MAX_BURST: + session->max_burst = value; + break; + case ISCSI_PARAM_PDU_INORDER_EN: + session->pdu_inorder_en = value; + break; + case ISCSI_PARAM_DATASEQ_INORDER_EN: + session->dataseq_inorder_en = value; + break; + case ISCSI_PARAM_ERL: + session->erl = value; + break; + case ISCSI_PARAM_IFMARKER_EN: + if (value) { + printk(KERN_ERR "IFMarker wasn't negotiated to No"); + return -EPROTO; + } + break; + case ISCSI_PARAM_OFMARKER_EN: + if (value) { + printk(KERN_ERR "OFMarker wasn't negotiated to No"); + return -EPROTO; + } + break; + default: + break; + } + + return 0; +} + +static int +iscsi_iser_session_get_param(struct iscsi_cls_session *cls_session, + enum iscsi_param param, uint32_t *value) +{ + struct Scsi_Host *shost = iscsi_session_to_shost(cls_session); + struct iscsi_session *session = iscsi_hostdata(shost->hostdata); + + switch (param) { + case ISCSI_PARAM_INITIAL_R2T_EN: + *value = session->initial_r2t_en; + break; + case ISCSI_PARAM_MAX_R2T: + *value = session->max_r2t; + break; + case ISCSI_PARAM_IMM_DATA_EN: + *value = session->imm_data_en; + break; + case ISCSI_PARAM_FIRST_BURST: + *value = session->first_burst; + break; + case ISCSI_PARAM_MAX_BURST: + *value = session->max_burst; + break; + case ISCSI_PARAM_PDU_INORDER_EN: + *value = session->pdu_inorder_en; + break; + case ISCSI_PARAM_DATASEQ_INORDER_EN: + *value = session->dataseq_inorder_en; + break; + case ISCSI_PARAM_ERL: + *value = session->erl; + break; + case ISCSI_PARAM_IFMARKER_EN: + *value = 0; + break; + case ISCSI_PARAM_OFMARKER_EN: + *value = 0; + break; + default: + return ISCSI_ERR_PARAM_NOT_FOUND; + } + + return 0; +} + +static int +iscsi_iser_conn_get_param(struct iscsi_cls_conn *cls_conn, + enum iscsi_param param, uint32_t *value) +{ + struct iscsi_conn *conn = cls_conn->dd_data; + + switch(param) { + case ISCSI_PARAM_MAX_RECV_DLENGTH: + *value = conn->max_recv_dlength; + break; + case ISCSI_PARAM_MAX_XMIT_DLENGTH: + *value = conn->max_xmit_dlength; + break; + case ISCSI_PARAM_HDRDGST_EN: + *value = 0; + break; + case ISCSI_PARAM_DATADGST_EN: + *value = 0; + break; + /*case ISCSI_PARAM_TARGET_RECV_DLENGTH: + *value = conn->target_recv_dlength; + break; + case ISCSI_PARAM_INITIATOR_RECV_DLENGTH: + *value = conn->initiator_recv_dlength; + break;*/ + default: + return ISCSI_ERR_PARAM_NOT_FOUND; + } + + return 0; +} + + +static void +iscsi_iser_conn_get_stats(struct iscsi_cls_conn *cls_conn, struct iscsi_stats *stats) +{ + struct iscsi_conn *conn = cls_conn->dd_data; + + stats->txdata_octets = conn->txdata_octets; + stats->rxdata_octets = conn->rxdata_octets; + stats->scsicmd_pdus = conn->scsicmd_pdus_cnt; + stats->dataout_pdus = conn->dataout_pdus_cnt; + stats->scsirsp_pdus = conn->scsirsp_pdus_cnt; + stats->datain_pdus = conn->datain_pdus_cnt; /* always 0 */ + stats->r2t_pdus = conn->r2t_pdus_cnt; /* always 0 */ + stats->tmfcmd_pdus = conn->tmfcmd_pdus_cnt; + stats->tmfrsp_pdus = conn->tmfrsp_pdus_cnt; + stats->custom_length = 3; + strcpy(stats->custom[0].desc, "qp_tx_queue_full"); + stats->custom[0].value = 0; /* TB iser_conn->qp_tx_queue_full; */ + strcpy(stats->custom[1].desc, "fmr_map_not_avail"); + stats->custom[1].value = 0; /* TB iser_conn->fmr_map_not_avail */; + strcpy(stats->custom[2].desc, "eh_abort_cnt"); + stats->custom[2].value = conn->eh_abort_cnt; +} + +static int +iscsi_iser_ep_connect(struct sockaddr *dst_addr, int non_blocking, + __u64 *ep_handle) +{ + int err; + struct iser_conn *ib_conn; + + err = iser_conn_init(&ib_conn); + if (err) + goto out; + + err = iser_connect(ib_conn, NULL, (struct sockaddr_in *)dst_addr, non_blocking); + if (!err) + *ep_handle = (__u64)(unsigned long)ib_conn; + +out: + return err; +} + +static int +iscsi_iser_ep_poll(__u64 ep_handle, int timeout_ms) +{ + struct iser_conn *ib_conn = iscsi_iser_ib_conn_lookup(ep_handle); + int rc; + + if (!ib_conn) + return -EINVAL; + + rc = wait_event_interruptible_timeout(ib_conn->wait, + atomic_read(&ib_conn->state) == ISER_CONN_UP, + msecs_to_jiffies(timeout_ms)); + + /* if conn establishment failed, return error code to iscsi */ + if (!rc && + (atomic_read(&ib_conn->state) == ISER_CONN_TERMINATING || + atomic_read(&ib_conn->state) == ISER_CONN_DOWN)) + rc = -1; + + iser_err("ib conn %p rc = %d\n", ib_conn, rc); + + if (rc > 0) + return 1; /* success, this is the equivalent of POLLOUT */ + else if (!rc) + return 0; /* timeout */ + else + return rc; /* signal */ +} + +static void +iscsi_iser_ep_disconnect(__u64 ep_handle) +{ + struct iser_conn *ib_conn = iscsi_iser_ib_conn_lookup(ep_handle); + + if (!ib_conn) + return; + + iser_err("ib conn %p state %d\n",ib_conn, atomic_read(&ib_conn->state)); + + if (atomic_read(&ib_conn->state) == ISER_CONN_UP) + iser_conn_terminate(ib_conn); + + iser_conn_release(ib_conn); +} + +static struct scsi_host_template iscsi_iser_sht = { + .name = "iSCSI Initiator over iSER, v." + ISCSI_VERSION_STR, + .queuecommand = iscsi_queuecommand, + .can_queue = ISCSI_XMIT_CMDS_MAX - 1, + .sg_tablesize = ISCSI_ISER_SG_TABLESIZE, + .cmd_per_lun = ISCSI_MAX_CMD_PER_LUN, + .eh_abort_handler = iscsi_eh_abort, + .eh_host_reset_handler = iscsi_eh_host_reset, + .use_clustering = DISABLE_CLUSTERING, + .proc_name = "iscsi_iser", + .this_id = -1, +}; + +static struct iscsi_transport iscsi_iser_transport = { + .owner = THIS_MODULE, + .name = "iser", + .caps = CAP_RECOVERY_L0 | CAP_MULTI_R2T, + .param_mask = ISCSI_MAX_RECV_DLENGTH | + ISCSI_MAX_XMIT_DLENGTH | + ISCSI_HDRDGST_EN | + ISCSI_DATADGST_EN | + ISCSI_INITIAL_R2T_EN | + ISCSI_MAX_R2T | + ISCSI_IMM_DATA_EN | + ISCSI_FIRST_BURST | + ISCSI_MAX_BURST | + ISCSI_PDU_INORDER_EN | + ISCSI_DATASEQ_INORDER_EN, + .host_template = &iscsi_iser_sht, + .conndata_size = sizeof(struct iscsi_conn), + .max_lun = ISCSI_ISER_MAX_LUN, + .max_cmd_len = ISCSI_ISER_MAX_CMD_LEN, + /* session management */ + .create_session = iscsi_iser_session_create, + .destroy_session = iscsi_session_teardown, + /* connection management */ + .create_conn = iscsi_iser_conn_create, + .bind_conn = iscsi_iser_conn_bind, + .destroy_conn = iscsi_iser_conn_destroy, + .set_param = iscsi_iser_conn_set_param, + .get_conn_param = iscsi_iser_conn_get_param, + .get_session_param = iscsi_iser_session_get_param, + .start_conn = iscsi_iser_conn_start, + .stop_conn = iscsi_conn_stop, + /* these are called as part of conn recovery */ + .suspend_conn_recv = NULL, /* FIXME is/how this relvant to iser? */ + .terminate_conn = iscsi_iser_conn_terminate, + /* IO */ + .send_pdu = iscsi_conn_send_pdu, + .get_stats = iscsi_iser_conn_get_stats, + .init_cmd_task = iscsi_iser_cmd_init, + .xmit_cmd_task = iscsi_iser_ctask_xmit, + .xmit_mgmt_task = iscsi_iser_mtask_xmit, + .cleanup_cmd_task = iscsi_iser_cleanup_ctask, + /* recovery */ + .session_recovery_timedout = iscsi_session_recovery_timedout, + + .ep_connect = iscsi_iser_ep_connect, + .ep_poll = iscsi_iser_ep_poll, + .ep_disconnect = iscsi_iser_ep_disconnect +}; + +static int __init iser_init(void) +{ + int err; + + iser_dbg("Starting iSER datamover...\n"); + + if (iscsi_max_lun < 1) { + printk(KERN_ERR "Invalid max_lun value of %u\n", iscsi_max_lun); + return -EINVAL; + } + + iscsi_iser_transport.max_lun = iscsi_max_lun; + + memset(&ig, 0, sizeof(struct iser_global)); + + ig.desc_cache = kmem_cache_create("iser_descriptors", + sizeof (struct iser_desc), + 0, SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (ig.desc_cache == NULL) + return -ENOMEM; + + /* device init is called only after the first addr resolution */ + mutex_init(&ig.device_list_mutex); + INIT_LIST_HEAD(&ig.device_list); + mutex_init(&ig.connlist_mutex); + INIT_LIST_HEAD(&ig.connlist); + + if (!iscsi_register_transport(&iscsi_iser_transport)) { + iser_err("iscsi_register_transport failed\n"); + err = -EINVAL; + goto register_transport_failure; + } + + return 0; + +register_transport_failure: + kmem_cache_destroy(ig.desc_cache); + + return err; +} + +static void __exit iser_exit(void) +{ + iser_dbg("Removing iSER datamover...\n"); + iscsi_unregister_transport(&iscsi_iser_transport); + kmem_cache_destroy(ig.desc_cache); +} + +module_init(iser_init); +module_exit(iser_exit); From ogerlitz at voltaire.com Thu Apr 27 05:32:25 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 27 Apr 2006 15:32:25 +0300 (IDT) Subject: [openib-general] [PATCH 4/6] iser initiator In-Reply-To: Message-ID: the main entry points to this code are iser_send_control/command/dataout for flow coming from iscsi_iser.c and iser_snd_compltion/iser_rcv_completion for handling of completions towards iscsi_iser.c Signed-off-by: Or Gerlitz --- /usr/src/linux-2.6.17-rc3/drivers/infiniband/ulp/iser-x/iser_initiator.c 1970-01-01 02:00:00.000000000 +0200 +++ /usr/src/linux-2.6.17-rc3/drivers/infiniband/ulp/iser/iser_initiator.c 2006-04-26 12:50:11.000000000 +0300 @@ -0,0 +1,732 @@ +/* + * Copyright (c) 2004, 2005, 2006 Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: iser_initiator.c 6643 2006-04-26 10:01:01Z ogerlitz $ + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "iscsi_iser.h" + +/* Constant PDU lengths calculations */ +#define ISER_TOTAL_HEADERS_LEN (sizeof (struct iser_hdr) + \ + sizeof (struct iscsi_hdr)) + +/* iser_dto_add_regd_buff - increments the reference count for * + * the registered buffer & adds it to the DTO object */ +static void iser_dto_add_regd_buff(struct iser_dto *dto, + struct iser_regd_buf *regd_buf, + unsigned long use_offset, + unsigned long use_size) +{ + int add_idx; + + atomic_inc(®d_buf->ref_count); + + add_idx = dto->regd_vector_len; + dto->regd[add_idx] = regd_buf; + dto->used_sz[add_idx] = use_size; + dto->offset[add_idx] = use_offset; + + dto->regd_vector_len++; +} + +static int iser_dma_map_task_data(struct iscsi_iser_cmd_task *iser_ctask, + struct iser_data_buf *data, + enum iser_data_dir iser_dir, + enum dma_data_direction dma_dir) +{ + struct device *dma_device; + + iser_ctask->dir[iser_dir] = 1; + dma_device = iser_ctask->iser_conn->ib_conn->device->ib_device->dma_device; + + data->dma_nents = dma_map_sg(dma_device, data->buf, data->size, dma_dir); + if (data->dma_nents == 0) { + iser_err("dma_map_sg failed!!!\n"); + return -EINVAL; + } + return 0; +} + +static void iser_dma_unmap_task_data(struct iscsi_iser_cmd_task *iser_ctask) +{ + struct device *dma_device; + struct iser_data_buf *data; + + dma_device = iser_ctask->iser_conn->ib_conn->device->ib_device->dma_device; + + if (iser_ctask->dir[ISER_DIR_IN]) { + data = &iser_ctask->data[ISER_DIR_IN]; + dma_unmap_sg(dma_device, data->buf, data->size, DMA_FROM_DEVICE); + } + + if (iser_ctask->dir[ISER_DIR_OUT]) { + data = &iser_ctask->data[ISER_DIR_OUT]; + dma_unmap_sg(dma_device, data->buf, data->size, DMA_TO_DEVICE); + } +} + +/* Register user buffer memory and initialize passive rdma + * dto descriptor. Total data size is stored in + * iser_ctask->data[ISER_DIR_IN].data_len + */ +static int iser_prepare_read_cmd(struct iscsi_cmd_task *ctask, + unsigned int edtl) + +{ + struct iscsi_iser_cmd_task *iser_ctask = ctask->dd_data; + struct iser_regd_buf *regd_buf; + int err; + struct iser_hdr *hdr = &iser_ctask->desc.iser_header; + struct iser_data_buf *buf_in = &iser_ctask->data[ISER_DIR_IN]; + + err = iser_dma_map_task_data(iser_ctask, + buf_in, + ISER_DIR_IN, + DMA_FROM_DEVICE); + if (err) + return err; + + if (edtl > iser_ctask->data[ISER_DIR_IN].data_len) { + iser_err("Total data length: %ld, less than EDTL: " + "%d, in READ cmd BHS itt: %d, conn: 0x%p\n", + iser_ctask->data[ISER_DIR_IN].data_len, edtl, + ctask->itt, iser_ctask->iser_conn); + return -EINVAL; + } + + err = iser_reg_rdma_mem(iser_ctask,ISER_DIR_IN); + if (err) { + iser_err("Failed to set up Data-IN RDMA\n"); + return err; + } + regd_buf = &iser_ctask->rdma_regd[ISER_DIR_IN]; + + hdr->flags |= ISER_RSV; + hdr->read_stag = cpu_to_be32(regd_buf->reg.rkey); + hdr->read_va = cpu_to_be64(regd_buf->reg.va); + + iser_dbg("Cmd itt:%d READ tags RKEY:%#.4X VA:%#llX\n", + ctask->itt, regd_buf->reg.rkey, + (unsigned long long)regd_buf->reg.va); + + return 0; +} + +/* Register user buffer memory and initialize passive rdma + * dto descriptor. Total data size is stored in + * ctask->data[ISER_DIR_OUT].data_len + */ +static int +iser_prepare_write_cmd(struct iscsi_cmd_task *ctask, + unsigned int imm_sz, + unsigned int unsol_sz, + unsigned int edtl) +{ + struct iscsi_iser_cmd_task *iser_ctask = ctask->dd_data; + struct iser_regd_buf *regd_buf; + int err; + struct iser_dto *send_dto = &iser_ctask->desc.dto; + struct iser_hdr *hdr = &iser_ctask->desc.iser_header; + struct iser_data_buf *buf_out = &iser_ctask->data[ISER_DIR_OUT]; + + err = iser_dma_map_task_data(iser_ctask, + buf_out, + ISER_DIR_OUT, + DMA_TO_DEVICE); + if (err) + return err; + + if (edtl > iser_ctask->data[ISER_DIR_OUT].data_len) { + iser_err("Total data length: %ld, less than EDTL: %d, " + "in WRITE cmd BHS itt: %d, conn: 0x%p\n", + iser_ctask->data[ISER_DIR_OUT].data_len, + edtl, ctask->itt, ctask->conn); + return -EINVAL; + } + + err = iser_reg_rdma_mem(iser_ctask,ISER_DIR_OUT); + if (err != 0) { + iser_err("Failed to register write cmd RDMA mem\n"); + return err; + } + + regd_buf = &iser_ctask->rdma_regd[ISER_DIR_OUT]; + + if (unsol_sz < edtl) { + hdr->flags |= ISER_WSV; + hdr->write_stag = cpu_to_be32(regd_buf->reg.rkey); + hdr->write_va = cpu_to_be64(regd_buf->reg.va + unsol_sz); + + iser_dbg("Cmd itt:%d, WRITE tags, RKEY:%#.4X " + "VA:%#llX + unsol:%d\n", + ctask->itt, regd_buf->reg.rkey, + (unsigned long long)regd_buf->reg.va, unsol_sz); + } + + if (imm_sz > 0) { + iser_dbg("Cmd itt:%d, WRITE, adding imm.data sz: %d\n", + ctask->itt, imm_sz); + iser_dto_add_regd_buff(send_dto, + regd_buf, + 0, + imm_sz); + } + + return 0; +} + +/** + * iser_post_receive_control - allocates, initializes and posts receive DTO. + */ +static int iser_post_receive_control(struct iscsi_conn *conn) +{ + struct iscsi_iser_conn *iser_conn = conn->dd_data; + struct iser_desc *rx_desc; + struct iser_regd_buf *regd_hdr; + struct iser_regd_buf *regd_data; + struct iser_dto *recv_dto = NULL; + struct iser_device *device = iser_conn->ib_conn->device; + int rx_data_size, err = 0; + + rx_desc = kmem_cache_alloc(ig.desc_cache, GFP_KERNEL); + if (rx_desc == NULL) { + iser_err("Failed to alloc desc for post recv\n"); + return -ENOMEM; + } + rx_desc->type = ISCSI_RX; + + /* for the login sequence we must support rx of upto 8K */ + if (conn->c_stage == ISCSI_CONN_INITIAL_STAGE) + rx_data_size = DEFAULT_MAX_RECV_DATA_SEGMENT_LENGTH; + else /* FIXME till user space sets conn->max_recv_dlength correctly */ + rx_data_size = 128; + + rx_desc->data = kmalloc(rx_data_size, GFP_KERNEL); + if (rx_desc->data == NULL) { + iser_err("Failed to alloc data buf for post recv\n"); + err = -ENOMEM; + goto post_rx_kmalloc_failure; + } + + recv_dto = &rx_desc->dto; + recv_dto->conn = iser_conn; + recv_dto->regd_vector_len = 0; + + regd_hdr = &rx_desc->hdr_regd_buf; + memset(regd_hdr, 0, sizeof(struct iser_regd_buf)); + regd_hdr->device = device; + regd_hdr->virt_addr = rx_desc; /* == &rx_desc->iser_header */ + regd_hdr->data_size = ISER_TOTAL_HEADERS_LEN; + + iser_reg_single(device, regd_hdr, DMA_FROM_DEVICE); + + iser_dto_add_regd_buff(recv_dto, regd_hdr, 0, 0); + + regd_data = &rx_desc->data_regd_buf; + memset(regd_data, 0, sizeof(struct iser_regd_buf)); + regd_data->device = device; + regd_data->virt_addr = rx_desc->data; + regd_data->data_size = rx_data_size; + + iser_reg_single(device, regd_data, DMA_FROM_DEVICE); + + iser_dto_add_regd_buff(recv_dto, regd_data, 0, 0); + + err = iser_post_recv(rx_desc); + if (!err) + return 0; + + /* iser_post_recv failed */ + iser_dto_buffs_release(recv_dto); + kfree(rx_desc->data); +post_rx_kmalloc_failure: + kmem_cache_free(ig.desc_cache, rx_desc); + return err; +} + +/* creates a new tx descriptor and adds header regd buffer */ +static void iser_create_send_desc(struct iscsi_iser_conn *iser_conn, + struct iser_desc *tx_desc) +{ + struct iser_regd_buf *regd_hdr = &tx_desc->hdr_regd_buf; + struct iser_dto *send_dto = &tx_desc->dto; + + memset(regd_hdr, 0, sizeof(struct iser_regd_buf)); + regd_hdr->device = iser_conn->ib_conn->device; + regd_hdr->virt_addr = tx_desc; /* == &tx_desc->iser_header */ + regd_hdr->data_size = ISER_TOTAL_HEADERS_LEN; + + send_dto->conn = iser_conn; + send_dto->notify_enable = 1; + send_dto->regd_vector_len = 0; + + memset(&tx_desc->iser_header, 0, sizeof(struct iser_hdr)); + tx_desc->iser_header.flags = ISER_VER; + + iser_dto_add_regd_buff(send_dto, regd_hdr, 0, 0); +} + +/** + * iser_conn_set_full_featured_mode - (iSER API) + */ +int iser_conn_set_full_featured_mode(struct iscsi_conn *conn) +{ + struct iscsi_iser_conn *iser_conn = conn->dd_data; + + int i; + /* no need to keep it in a var, we are after login so if this should + * be negotiated, by now the result should be available here */ + int initial_post_recv_bufs_num = ISER_MAX_RX_MISC_PDUS; + + iser_dbg("Initially post: %d\n", initial_post_recv_bufs_num); + + /* Check that there is no posted recv or send buffers left - */ + /* they must be consumed during the login phase */ + if (atomic_read(&iser_conn->ib_conn->post_recv_buf_count) != 0) + iser_bug("Number of currently posted recv bufs non-zero\n"); + if (atomic_read(&iser_conn->ib_conn->post_send_buf_count) != 0) + iser_bug("Number of currently posted send bufs non-zero\n"); + + /* Initial post receive buffers */ + for (i = 0; i < initial_post_recv_bufs_num; i++) { + if (iser_post_receive_control(conn) != 0) { + iser_err("Failed to post recv bufs at:%d conn:0x%p\n", + i, conn); + return -ENOMEM; + } + } + iser_dbg("Posted %d post recv bufs, conn:0x%p\n", i, conn); + return 0; +} + +static int +iser_check_xmit(struct iscsi_conn *conn, void *task) +{ + int rc = 0; + struct iscsi_iser_conn *iser_conn = conn->dd_data; + + write_lock_bh(conn->recv_lock); + if (atomic_read(&iser_conn->ib_conn->post_send_buf_count) == + ISER_QP_MAX_REQ_DTOS) { + iser_dbg("%ld can't xmit task %p, suspending tx\n",jiffies,task); + set_bit(ISCSI_SUSPEND_BIT, &conn->suspend_tx); + rc = -EAGAIN; + } + write_unlock_bh(conn->recv_lock); + return rc; +} + + +/** + * iser_send_command - send command PDU + */ +int iser_send_command(struct iscsi_conn *conn, + struct iscsi_cmd_task *ctask) +{ + struct iscsi_iser_conn *iser_conn = conn->dd_data; + struct iscsi_iser_cmd_task *iser_ctask = ctask->dd_data; + struct iser_dto *send_dto = NULL; + unsigned long edtl; + int err = 0; + struct iser_data_buf *data_buf; + + struct iscsi_cmd *hdr = ctask->hdr; + struct scsi_cmnd *sc = ctask->sc; + + if (atomic_read(&iser_conn->ib_conn->state) != ISER_CONN_UP) { + iser_err("Failed to send, conn: 0x%p is not up\n", iser_conn->ib_conn); + return -EPERM; + } + if (iser_check_xmit(conn, ctask)) + return -EAGAIN; + + edtl = ntohl(hdr->data_length); + + /* build the tx desc regd header and add it to the tx desc dto */ + iser_ctask->desc.type = ISCSI_TX_SCSI_COMMAND; + send_dto = &iser_ctask->desc.dto; + send_dto->ctask = iser_ctask; + iser_create_send_desc(iser_conn, &iser_ctask->desc); + + if (hdr->flags & ISCSI_FLAG_CMD_READ) + data_buf = &iser_ctask->data[ISER_DIR_IN]; + else + data_buf = &iser_ctask->data[ISER_DIR_OUT]; + + if (sc->use_sg) { /* using a scatter list */ + data_buf->buf = sc->request_buffer; + data_buf->size = sc->use_sg; + } else { /* using a single buffer - convert it into one entry SG */ + sg_init_one(&data_buf->sg_single, + sc->request_buffer, sc->request_bufflen); + data_buf->buf = &data_buf->sg_single; + data_buf->size = 1; + } + + data_buf->data_len = sc->request_bufflen; + + if (hdr->flags & ISCSI_FLAG_CMD_READ) { + err = iser_prepare_read_cmd(ctask, edtl); + if (err) + goto send_command_error; + } + if (hdr->flags & ISCSI_FLAG_CMD_WRITE) { + err = iser_prepare_write_cmd(ctask, + ctask->imm_count, + ctask->imm_count + + ctask->unsol_count, + edtl); + if (err) + goto send_command_error; + } + + iser_reg_single(iser_conn->ib_conn->device, + send_dto->regd[0], DMA_TO_DEVICE); + + if (iser_post_receive_control(conn) != 0) { + iser_err("post_recv failed!\n"); + err = -ENOMEM; + goto send_command_error; + } + + iser_ctask->status = ISER_TASK_STATUS_STARTED; + + err = iser_post_send(&iser_ctask->desc); + if (!err) + return 0; + +send_command_error: + iser_dto_buffs_release(send_dto); + iser_err("conn %p failed ctask->itt %d err %d\n",conn, ctask->itt, err); + return err; +} + +/** + * iser_send_data_out - send data out PDU + */ +int iser_send_data_out(struct iscsi_conn *conn, + struct iscsi_cmd_task *ctask, + struct iscsi_data *hdr) +{ + struct iscsi_iser_conn *iser_conn = conn->dd_data; + struct iscsi_iser_cmd_task *iser_ctask = ctask->dd_data; + struct iser_desc *tx_desc = NULL; + struct iser_dto *send_dto = NULL; + unsigned long buf_offset; + unsigned long data_seg_len; + unsigned int itt; + int err = 0; + + if (atomic_read(&iser_conn->ib_conn->state) != ISER_CONN_UP) { + iser_err("Failed to send, conn: 0x%p is not up\n", iser_conn->ib_conn); + return -EPERM; + } + + if (iser_check_xmit(conn, ctask)) + return -EAGAIN; + + itt = ntohl(hdr->itt); + data_seg_len = ntoh24(hdr->dlength); + buf_offset = ntohl(hdr->offset); + + iser_dbg("%s itt %d dseg_len %d offset %d\n", + __func__,(int)itt,(int)data_seg_len,(int)buf_offset); + + tx_desc = kmem_cache_alloc(ig.desc_cache, GFP_KERNEL); + if (tx_desc == NULL) { + iser_err("Failed to alloc desc for post dataout\n"); + return -ENOMEM; + } + + tx_desc->type = ISCSI_TX_DATAOUT; + memcpy(&tx_desc->iscsi_header, hdr, sizeof(struct iscsi_hdr)); + + /* build the tx desc regd header and add it to the tx desc dto */ + send_dto = &tx_desc->dto; + send_dto->ctask = iser_ctask; + iser_create_send_desc(iser_conn, tx_desc); + + iser_reg_single(iser_conn->ib_conn->device, + send_dto->regd[0], DMA_TO_DEVICE); + + /* all data was registered for RDMA, we can use the lkey */ + iser_dto_add_regd_buff(send_dto, + &iser_ctask->rdma_regd[ISER_DIR_OUT], + buf_offset, + data_seg_len); + + if (buf_offset + data_seg_len > iser_ctask->data[ISER_DIR_OUT].data_len) { + iser_err("Offset:%ld & DSL:%ld in Data-Out " + "inconsistent with total len:%ld, itt:%d\n", + buf_offset, data_seg_len, + iser_ctask->data[ISER_DIR_OUT].data_len, itt); + err = -EINVAL; + goto send_data_out_error; + } + iser_dbg("data-out itt: %d, offset: %ld, sz: %ld\n", + itt, buf_offset, data_seg_len); + + + err = iser_post_send(tx_desc); + if (!err) + return 0; + +send_data_out_error: + iser_dto_buffs_release(send_dto); + kmem_cache_free(ig.desc_cache, tx_desc); + iser_err("conn %p failed err %d\n",conn, err); + return err; +} + +int iser_send_control(struct iscsi_conn *conn, + struct iscsi_mgmt_task *mtask) +{ + struct iscsi_iser_conn *iser_conn = conn->dd_data; + struct iser_desc *mdesc = mtask->dd_data; + struct iser_dto *send_dto = NULL; + unsigned int itt; + unsigned long data_seg_len; + int err = 0; + unsigned char opcode; + struct iser_regd_buf *regd_buf; + struct iser_device *device; + + if (atomic_read(&iser_conn->ib_conn->state) != ISER_CONN_UP) { + iser_err("Failed to send, conn: 0x%p is not up\n", iser_conn->ib_conn); + return -EPERM; + } + + if (iser_check_xmit(conn,mtask)) + return -EAGAIN; + + /* build the tx desc regd header and add it to the tx desc dto */ + mdesc->type = ISCSI_TX_CONTROL; + send_dto = &mdesc->dto; + send_dto->ctask = NULL; + iser_create_send_desc(iser_conn, mdesc); + + device = iser_conn->ib_conn->device; + + iser_reg_single(device, send_dto->regd[0], DMA_TO_DEVICE); + + itt = ntohl(mtask->hdr->itt); + opcode = mtask->hdr->opcode & ISCSI_OPCODE_MASK; + data_seg_len = ntoh24(mtask->hdr->dlength); + + if (data_seg_len > 0) { + regd_buf = &mdesc->data_regd_buf; + memset(regd_buf, 0, sizeof(struct iser_regd_buf)); + regd_buf->device = device; + regd_buf->virt_addr = mtask->data; + regd_buf->data_size = mtask->data_count; + iser_reg_single(device, regd_buf, + DMA_TO_DEVICE); + iser_dto_add_regd_buff(send_dto, regd_buf, + 0, + data_seg_len); + } + + if (iser_post_receive_control(conn) != 0) { + iser_err("post_rcv_buff failed!\n"); + err = -ENOMEM; + goto send_control_error; + } + + err = iser_post_send(mdesc); + if (!err) + return 0; + +send_control_error: + iser_dto_buffs_release(send_dto); + iser_err("conn %p failed err %d\n",conn, err); + return err; +} + +/** + * iser_rcv_dto_completion - recv DTO completion + */ +void iser_rcv_completion(struct iser_desc *rx_desc, + unsigned long dto_xfer_len) +{ + struct iser_dto *dto = &rx_desc->dto; + struct iscsi_iser_conn *conn = dto->conn; + struct iscsi_session *session = conn->iscsi_conn->session; + struct iscsi_cmd_task *ctask; + struct iscsi_iser_cmd_task *iser_ctask; + struct iscsi_hdr *hdr; + char *rx_data = NULL; + int rx_data_len = 0; + unsigned int itt; + unsigned char opcode; + + hdr = &rx_desc->iscsi_header; + + iser_dbg("op 0x%x itt 0x%x\n", hdr->opcode,hdr->itt); + + if (dto_xfer_len > ISER_TOTAL_HEADERS_LEN) { /* we have data */ + rx_data_len = dto_xfer_len - ISER_TOTAL_HEADERS_LEN; + rx_data = dto->regd[1]->virt_addr; + rx_data += dto->offset[1]; + } + + opcode = hdr->opcode & ISCSI_OPCODE_MASK; + + if (opcode == ISCSI_OP_SCSI_CMD_RSP) { + itt = hdr->itt & ISCSI_ITT_MASK; /* mask out cid and age bits */ + if (!(itt < session->cmds_max)) + iser_err("itt can't be matched to task!!!" + "conn %p opcode %d cmds_max %d itt %d\n", + conn->iscsi_conn,opcode,session->cmds_max,itt); + /* use the mapping given with the cmds array indexed by itt */ + ctask = (struct iscsi_cmd_task *)session->cmds[itt]; + iser_ctask = ctask->dd_data; + iser_dbg("itt %d ctask %p\n",itt,ctask); + iser_ctask->status = ISER_TASK_STATUS_COMPLETED; + iser_ctask_rdma_finalize(iser_ctask); + } + + iser_dto_buffs_release(dto); + + iscsi_iser_recv(conn->iscsi_conn, hdr, rx_data, rx_data_len); + + kfree(rx_desc->data); + kmem_cache_free(ig.desc_cache, rx_desc); + + /* decrementing conn->post_recv_buf_count only --after-- freeing the * + * task eliminates the need to worry on tasks which are completed in * + * parallel to the execution of iser_conn_term. So the code that waits * + * for the posted rx bufs refcount to become zero handles everything */ + atomic_dec(&conn->ib_conn->post_recv_buf_count); +} + +void iser_snd_completion(struct iser_desc *tx_desc) +{ + struct iser_dto *dto = &tx_desc->dto; + struct iscsi_iser_conn *iser_conn = dto->conn; + struct iscsi_conn *conn = iser_conn->iscsi_conn; + struct iscsi_mgmt_task *mtask; + + iser_dbg("Initiator, Data sent dto=0x%p\n", dto); + + iser_dto_buffs_release(dto); + + if (tx_desc->type == ISCSI_TX_DATAOUT) + kmem_cache_free(ig.desc_cache, tx_desc); + + atomic_dec(&iser_conn->ib_conn->post_send_buf_count); + + write_lock(conn->recv_lock); + if (conn->suspend_tx) { + iser_dbg("%ld resuming tx\n",jiffies); + clear_bit(ISCSI_SUSPEND_BIT, &conn->suspend_tx); + scsi_queue_work(conn->session->host, &conn->xmitwork); + } + write_unlock(conn->recv_lock); + + if (tx_desc->type == ISCSI_TX_CONTROL) { + /* this arithmetic is legal by libiscsi dd_data allocation */ + mtask = (void *) ((long)(void *)tx_desc - + sizeof(struct iscsi_mgmt_task)); + if (mtask->hdr->itt == cpu_to_be32(ISCSI_RESERVED_TAG)) { + struct iscsi_session *session = conn->session; + + spin_lock(&conn->session->lock); + list_del(&mtask->running); + __kfifo_put(session->mgmtpool.queue, (void*)&mtask, + sizeof(void*)); + spin_unlock(&session->lock); + } + } +} + +void iser_ctask_rdma_init(struct iscsi_iser_cmd_task *iser_ctask) + +{ + iser_ctask->status = ISER_TASK_STATUS_INIT; + + iser_ctask->dir[ISER_DIR_IN] = 0; + iser_ctask->dir[ISER_DIR_OUT] = 0; + + iser_ctask->data[ISER_DIR_IN].data_len = 0; + iser_ctask->data[ISER_DIR_OUT].data_len = 0; + + memset(&iser_ctask->rdma_regd[ISER_DIR_IN], 0, + sizeof(struct iser_regd_buf)); + memset(&iser_ctask->rdma_regd[ISER_DIR_OUT], 0, + sizeof(struct iser_regd_buf)); +} + +void iser_ctask_rdma_finalize(struct iscsi_iser_cmd_task *iser_ctask) +{ + int deferred; + + /* if we were reading, copy back to unaligned sglist, + * anyway dma_unmap and free the copy + */ + if (iser_ctask->data_copy[ISER_DIR_IN].copy_buf != NULL) + iser_finalize_rdma_unaligned_sg(iser_ctask, ISER_DIR_IN); + if (iser_ctask->data_copy[ISER_DIR_OUT].copy_buf != NULL) + iser_finalize_rdma_unaligned_sg(iser_ctask, ISER_DIR_OUT); + + if (iser_ctask->dir[ISER_DIR_IN]) { + deferred = iser_regd_buff_release + (&iser_ctask->rdma_regd[ISER_DIR_IN]); + if (deferred) + iser_bug("References remain for BUF-IN rdma reg\n"); + } + + if (iser_ctask->dir[ISER_DIR_OUT]) { + deferred = iser_regd_buff_release + (&iser_ctask->rdma_regd[ISER_DIR_OUT]); + if (deferred) + iser_bug("References remain for BUF-OUT rdma reg\n"); + } + + iser_dma_unmap_task_data(iser_ctask); +} + +void iser_dto_buffs_release(struct iser_dto *dto) +{ + int i; + + for (i = 0; i < dto->regd_vector_len; i++) + iser_regd_buff_release(dto->regd[i]); +} + From ogerlitz at voltaire.com Thu Apr 27 05:32:55 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 27 Apr 2006 15:32:55 +0300 (IDT) Subject: [openib-general] [PATCH 5/6] iser RDMA CM (CMA) and IB verbs interaction In-Reply-To: Message-ID: This code does the low level work with the ib verbs and cma, eg + establish/disconnect the iser connection + create/destory IB resources: PD, DMA MR, CQ, QP, FMR pool + do fast registration (FMR) of SG list associated with the SC + post rx and tx requests to the QP and reap completions from the CQ Signed-off-by: Or Gerlitz --- /usr/src/linux-2.6.17-rc3/drivers/infiniband/ulp/iser-x/iser_verbs.c 1970-01-01 02:00:00.000000000 +0200 +++ /usr/src/linux-2.6.17-rc3/drivers/infiniband/ulp/iser/iser_verbs.c 2006-04-26 12:50:11.000000000 +0300 @@ -0,0 +1,804 @@ +/* + * Copyright (c) 2004, 2005, 2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: iser_verbs.c 6643 2006-04-26 10:01:01Z ogerlitz $ + */ +#include +#include +#include +#include +#include +#include + +#include "iscsi_iser.h" + +#define ISCSI_ISER_MAX_CONN 8 +#define ISER_MAX_CQ_LEN ((ISER_QP_MAX_RECV_DTOS + \ + ISER_QP_MAX_REQ_DTOS) * \ + ISCSI_ISER_MAX_CONN) + +static void iser_cq_tasklet_fn(unsigned long data); +static void iser_cq_callback(struct ib_cq *cq, void *cq_context); +static void iser_comp_error_worker(void *data); + +static void iser_cq_event_callback(struct ib_event *cause, void *context) +{ + iser_err("got cq event %d \n", cause->event); +} + +static void iser_qp_event_callback(struct ib_event *cause, void *context) +{ + iser_err("got qp event %d\n",cause->event); +} + +/** + * iser_create_device_ib_res - creates Protection Domain (PD), Completion + * Queue (CQ), DMA Memory Region (DMA MR) with the device associated with + * the adapator. + * + * returns 0 on success, -1 on failure + */ +static int iser_create_device_ib_res(struct iser_device *device) +{ + device->pd = ib_alloc_pd(device->ib_device); + if (IS_ERR(device->pd)) + goto pd_err; + + device->cq = ib_create_cq(device->ib_device, + iser_cq_callback, + iser_cq_event_callback, + (void *)device, + ISER_MAX_CQ_LEN); + if (IS_ERR(device->cq)) + goto cq_err; + + if (ib_req_notify_cq(device->cq, IB_CQ_NEXT_COMP)) + goto cq_arm_err; + + tasklet_init(&device->cq_tasklet, + iser_cq_tasklet_fn, + (unsigned long)device); + + device->mr = ib_get_dma_mr(device->pd, + IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(device->mr)) + goto dma_mr_err; + + return 0; + +dma_mr_err: + tasklet_kill(&device->cq_tasklet); +cq_arm_err: + ib_destroy_cq(device->cq); +cq_err: + ib_dealloc_pd(device->pd); +pd_err: + iser_err("failed to allocate an IB resource\n"); + return -1; +} + +/** + * iser_free_device_ib_res - destory/dealloc/dereg the DMA MR, + * CQ and PD created with the device associated with the adapator. + * + * returns 0 on success, -1 on failure + */ +static int iser_free_device_ib_res(struct iser_device *device) +{ + BUG_ON(device->mr == NULL); + + tasklet_kill(&device->cq_tasklet); + + (void)ib_dereg_mr(device->mr); + (void)ib_destroy_cq(device->cq); + (void)ib_dealloc_pd(device->pd); + + device->mr = NULL; + device->cq = NULL; + device->pd = NULL; + return 0; +} + +/** + * iser_create_ib_conn_res - Creates FMR pool and Queue-Pair (QP) + * + * returns 0 on success, -1 on failure + */ +static int iser_create_ib_conn_res(struct iser_conn *ib_conn) +{ + struct iser_device *device; + struct ib_qp_init_attr init_attr; + int ret; + struct ib_fmr_pool_param params; + + BUG_ON(ib_conn->device == NULL); + + device = ib_conn->device; + + ib_conn->page_vec = kmalloc(sizeof(struct iser_page_vec) + + (sizeof(u64) * (ISCSI_ISER_SG_TABLESIZE +1)), + GFP_KERNEL); + if (!ib_conn->page_vec) { + ret = -ENOMEM; + goto alloc_err; + } + ib_conn->page_vec->pages = (u64 *) (ib_conn->page_vec + 1); + + params.page_shift = PAGE_SHIFT; + /* when the first/last SG element are not start/end * + * page aligned, the map whould be of N+1 pages */ + params.max_pages_per_fmr = ISCSI_ISER_SG_TABLESIZE + 1; + /* make the pool size twice the max number of SCSI commands * + * the ML is expected to queue, watermark for unmap at 50% */ + params.pool_size = ISCSI_XMIT_CMDS_MAX * 2; + params.dirty_watermark = ISCSI_XMIT_CMDS_MAX; + params.cache = 0; + params.flush_function = NULL; + params.access = (IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_WRITE | + IB_ACCESS_REMOTE_READ); + + ib_conn->fmr_pool = ib_create_fmr_pool(device->pd, ¶ms); + if (IS_ERR(ib_conn->fmr_pool)) { + ret = PTR_ERR(ib_conn->fmr_pool); + goto fmr_pool_err; + } + + memset(&init_attr, 0, sizeof init_attr); + + init_attr.event_handler = iser_qp_event_callback; + init_attr.qp_context = (void *)ib_conn; + init_attr.send_cq = device->cq; + init_attr.recv_cq = device->cq; + init_attr.cap.max_send_wr = ISER_QP_MAX_REQ_DTOS; + init_attr.cap.max_recv_wr = ISER_QP_MAX_RECV_DTOS; + init_attr.cap.max_send_sge = MAX_REGD_BUF_VECTOR_LEN; + init_attr.cap.max_recv_sge = 2; + init_attr.sq_sig_type = IB_SIGNAL_REQ_WR; + init_attr.qp_type = IB_QPT_RC; + + ret = rdma_create_qp(ib_conn->cma_id, device->pd, &init_attr); + if (ret) + goto qp_err; + + ib_conn->qp = ib_conn->cma_id->qp; + iser_err("setting conn %p cma_id %p: fmr_pool %p qp %p\n", + ib_conn, ib_conn->cma_id, + ib_conn->fmr_pool, ib_conn->cma_id->qp); + return ret; + +qp_err: + (void)ib_destroy_fmr_pool(ib_conn->fmr_pool); +fmr_pool_err: + kfree(ib_conn->page_vec); +alloc_err: + iser_err("unable to alloc mem or create resource, err %d\n", ret); + return ret; +} + +/** + * releases the FMR pool, QP and CMA ID objects, returns 0 on success, + * -1 on failure + */ +static int iser_free_ib_conn_res(struct iser_conn *ib_conn) +{ + BUG_ON(ib_conn == NULL); + + iser_err("freeing conn %p cma_id %p fmr pool %p qp %p\n", + ib_conn, ib_conn->cma_id, + ib_conn->fmr_pool, ib_conn->qp); + + /* qp is created only once both addr & route are resolved */ + if (ib_conn->fmr_pool != NULL) + ib_destroy_fmr_pool(ib_conn->fmr_pool); + + if (ib_conn->qp != NULL) + rdma_destroy_qp(ib_conn->cma_id); + + if (ib_conn->cma_id != NULL) + rdma_destroy_id(ib_conn->cma_id); + + ib_conn->fmr_pool = NULL; + ib_conn->qp = NULL; + ib_conn->cma_id = NULL; + kfree(ib_conn->page_vec); + + return 0; +} + +/** + * based on the resolved device node GUID see if there already allocated + * device for this device. If there's no such, create one. + */ +static +struct iser_device *iser_device_find_by_ib_device(struct rdma_cm_id *cma_id) +{ + struct list_head *p_list; + struct iser_device *device = NULL; + + mutex_lock(&ig.device_list_mutex); + + p_list = ig.device_list.next; + while (p_list != &ig.device_list) { + device = list_entry(p_list, struct iser_device, ig_list); + /* find if there's a match using the node GUID */ + if (device->ib_device->node_guid == cma_id->device->node_guid) + break; + } + + if (device == NULL) { + device = kzalloc(sizeof *device, GFP_KERNEL); + if (device == NULL) + goto end; + /* assign this device to the device */ + device->ib_device = cma_id->device; + /* init the device and link it into ig device list */ + if (iser_create_device_ib_res(device)) { + kfree(device); + device = NULL; + goto end; + } + list_add(&device->ig_list, &ig.device_list); + } +end: + BUG_ON(device == NULL); + device->refcount++; + mutex_unlock(&ig.device_list_mutex); + return device; +} + +/* if there's no demand for this device, release it */ +static void iser_device_try_release(struct iser_device *device) +{ + mutex_lock(&ig.device_list_mutex); + device->refcount--; + iser_err("device %p refcount %d\n",device,device->refcount); + if (!device->refcount) { + iser_free_device_ib_res(device); + list_del(&device->ig_list); + kfree(device); + } + mutex_unlock(&ig.device_list_mutex); +} + +/** + * triggers start of the disconnect procedures and wait for them to be done + */ +void iser_conn_terminate(struct iser_conn *ib_conn) +{ + int err = 0; + + atomic_set(&ib_conn->state, ISER_CONN_TERMINATING); + err = rdma_disconnect(ib_conn->cma_id); + if (err) + iser_bug("Failed to disconnect, conn: 0x%p err %d\n",ib_conn,err); + wait_event_interruptible(ib_conn->wait, + (atomic_read(&ib_conn->state) == ISER_CONN_DOWN)); + + mutex_lock(&ig.connlist_mutex); + list_del(&ib_conn->conn_list); + mutex_unlock(&ig.connlist_mutex); + + iser_conn_release(ib_conn); +} + +static void iser_connect_error(struct rdma_cm_id *cma_id) +{ + struct iser_conn *ib_conn; + ib_conn = (struct iser_conn *)cma_id->context; + + if (atomic_read(&ib_conn->state) == ISER_CONN_PENDING) { + atomic_set(&ib_conn->state, ISER_CONN_DOWN); + wake_up_interruptible(&ib_conn->wait); + } else + iser_err("Unexpected evt for conn.state: %d\n", + atomic_read(&ib_conn->state)); +} + +static void iser_addr_handler(struct rdma_cm_id *cma_id) +{ + struct iser_device *device; + struct iser_conn *ib_conn; + int ret; + + device = iser_device_find_by_ib_device(cma_id); + ib_conn = (struct iser_conn *)cma_id->context; + ib_conn->device = device; + + ret = rdma_resolve_route(cma_id, 1000); + if (ret) { + iser_err("resolve route failed: %d\n", ret); + iser_connect_error(cma_id); + } + return; +} + +static void iser_route_handler(struct rdma_cm_id *cma_id) +{ + struct rdma_conn_param conn_param; + int ret; + + ret = iser_create_ib_conn_res((struct iser_conn *)cma_id->context); + if (ret) + goto failure; + + iser_dbg("path.mtu is %d setting it to %d\n", + cma_id->route.path_rec->mtu, IB_MTU_1024); + + /* we must set the MTU to 1024 as this is what the target is assuming */ + if (cma_id->route.path_rec->mtu > IB_MTU_1024) + cma_id->route.path_rec->mtu = IB_MTU_1024; + + memset(&conn_param, 0, sizeof conn_param); + conn_param.responder_resources = 4; + conn_param.initiator_depth = 1; + conn_param.retry_count = 7; + conn_param.rnr_retry_count = 6; + + ret = rdma_connect(cma_id, &conn_param); + if (ret) { + iser_err("failure connecting: %d\n", ret); + goto failure; + } + + return; +failure: + iser_connect_error(cma_id); +} + +static void iser_connected_handler(struct rdma_cm_id *cma_id) +{ + struct iser_conn *ib_conn; + + ib_conn = (struct iser_conn *)cma_id->context; + atomic_set(&ib_conn->state, ISER_CONN_UP); + wake_up_interruptible(&ib_conn->wait); +} + +static void iser_disconnected_handler(struct rdma_cm_id *cma_id) +{ + struct iser_conn *ib_conn; + + ib_conn = (struct iser_conn *)cma_id->context; + ib_conn->disc_evt_flag = 1; + + /* If this event is unsolicited this means that the conn is being */ + /* terminated asynchronously from the iSCSI layer's perspective. */ + if (atomic_read(&ib_conn->state) == ISER_CONN_PENDING) { + atomic_set(&ib_conn->state, ISER_CONN_DOWN); + wake_up_interruptible(&ib_conn->wait); + } else { + if (atomic_read(&ib_conn->state) == ISER_CONN_UP) { + atomic_set(&ib_conn->state, ISER_CONN_TERMINATING); + iscsi_conn_failure(ib_conn->iser_conn->iscsi_conn, + ISCSI_ERR_CONN_FAILED); + } + /* Complete the termination process if no posts are pending */ + if ((atomic_read(&ib_conn->post_recv_buf_count) == 0) && + (atomic_read(&ib_conn->post_send_buf_count) == 0)) { + atomic_set(&ib_conn->state, ISER_CONN_DOWN); + wake_up_interruptible(&ib_conn->wait); + } + } +} + +static int iser_cma_handler(struct rdma_cm_id *cma_id, struct rdma_cm_event *event) +{ + int ret = 0; + + iser_err("event %d conn %p id %p\n",event->event,cma_id->context,cma_id); + + switch (event->event) { + case RDMA_CM_EVENT_ADDR_RESOLVED: + iser_addr_handler(cma_id); + break; + case RDMA_CM_EVENT_ROUTE_RESOLVED: + iser_route_handler(cma_id); + break; + case RDMA_CM_EVENT_ESTABLISHED: + iser_connected_handler(cma_id); + break; + case RDMA_CM_EVENT_ADDR_ERROR: + case RDMA_CM_EVENT_ROUTE_ERROR: + case RDMA_CM_EVENT_CONNECT_ERROR: + case RDMA_CM_EVENT_UNREACHABLE: + case RDMA_CM_EVENT_REJECTED: + iser_err("event: %d, error: %d\n", event->event, event->status); + iser_connect_error(cma_id); + break; + case RDMA_CM_EVENT_DISCONNECTED: + iser_disconnected_handler(cma_id); + break; + case RDMA_CM_EVENT_DEVICE_REMOVAL: + iser_bug("device removal is not handled yet\n"); + break; + case RDMA_CM_EVENT_CONNECT_RESPONSE: + iser_bug("not expecting cma to deliver the REP!!!\n"); + break; + case RDMA_CM_EVENT_CONNECT_REQUEST: + default: + break; + } + return ret; +} + +int iser_conn_init(struct iser_conn **ibconn) +{ + struct iser_conn *ib_conn; + + ib_conn = kzalloc(sizeof *ib_conn, GFP_KERNEL); + if (!ib_conn) { + iser_err("can't alloc memory for struct iser_conn\n"); + return -ENOMEM; + } + atomic_set(&ib_conn->state, ISER_CONN_INIT); + init_waitqueue_head(&ib_conn->wait); + atomic_set(&ib_conn->post_recv_buf_count, 0); + atomic_set(&ib_conn->post_send_buf_count, 0); + INIT_WORK(&ib_conn->comperror_work, iser_comp_error_worker, + ib_conn); + + *ibconn = ib_conn; + return 0; +} + + /** + * starts the process of connecting to the target + * sleeps untill the connection is established or rejected + */ +int iser_connect(struct iser_conn *ib_conn, + struct sockaddr_in *src_addr, + struct sockaddr_in *dst_addr, + int non_blocking) +{ + struct sockaddr *src, *dst; + int err = 0; + + sprintf(ib_conn->name,"%d.%d.%d.%d:%d", + NIPQUAD(dst_addr->sin_addr.s_addr), dst_addr->sin_port); + + /* the device is known only --after-- address resolution */ + ib_conn->device = NULL; + + iser_err("connecting to: %d.%d.%d.%d, port 0x%x\n", + NIPQUAD(dst_addr->sin_addr), dst_addr->sin_port); + + atomic_set(&ib_conn->state, ISER_CONN_PENDING); + + ib_conn->cma_id = rdma_create_id(iser_cma_handler, + (void *)ib_conn, + RDMA_PS_TCP); + if (IS_ERR(ib_conn->cma_id)) { + err = PTR_ERR(ib_conn->cma_id); + iser_err("rdma_create_id failed: %d\n", err); + goto id_failure; + } + + src = (struct sockaddr *)src_addr; + dst = (struct sockaddr *)dst_addr; + err = rdma_resolve_addr(ib_conn->cma_id, src, dst, 1000); + if (err) { + iser_err("rdma_resolve_addr failed: %d\n", err); + goto addr_failure; + } + + if (!non_blocking) { + wait_event_interruptible(ib_conn->wait, + atomic_read(&ib_conn->state) != ISER_CONN_PENDING); + + if (atomic_read(&ib_conn->state) != ISER_CONN_UP) { + err = -EIO; + goto connect_failure; + } + } + + mutex_lock(&ig.connlist_mutex); + list_add(&ib_conn->conn_list, &ig.connlist); + mutex_unlock(&ig.connlist_mutex); + return 0; + +id_failure: + ib_conn->cma_id = NULL; +addr_failure: + atomic_set(&ib_conn->state, ISER_CONN_DOWN); +connect_failure: + iser_conn_release(ib_conn); + return err; +} + +/** + * Frees all conn objects and deallocs conn descriptor + */ +void iser_conn_release(struct iser_conn *ib_conn) +{ + struct iser_device *device = ib_conn->device; + + BUG_ON(atomic_read(&ib_conn->state) != ISER_CONN_DOWN); + + iser_free_ib_conn_res(ib_conn); + ib_conn->device = NULL; + /* on EVENT_ADDR_ERROR there's no device yet for this conn */ + if (device != NULL) + iser_device_try_release(device); + kfree(ib_conn); +} + + +/** + * iser_reg_page_vec - Register physical memory + * + * returns: 0 on success, errno code on failure + */ +int iser_reg_page_vec(struct iser_conn *ib_conn, + struct iser_page_vec *page_vec, + struct iser_mem_reg *mem_reg) +{ + struct ib_pool_fmr *mem; + u64 io_addr; + u64 *page_list; + int status; + + page_list = page_vec->pages; + io_addr = page_list[0]; + + mem = ib_fmr_pool_map_phys(ib_conn->fmr_pool, + page_list, + page_vec->length, + &io_addr); + + if (IS_ERR(mem)) { + status = (int)PTR_ERR(mem); + iser_err("ib_fmr_pool_map_phys failed: %d\n", status); + return status; + } + + mem_reg->lkey = mem->fmr->lkey; + mem_reg->rkey = mem->fmr->rkey; + mem_reg->len = page_vec->length * PAGE_SIZE; + mem_reg->va = io_addr; + mem_reg->mem_h = (void *)mem; + + mem_reg->va += page_vec->offset; + mem_reg->len = page_vec->data_size; + + iser_dbg("PHYSICAL Mem.register, [PHYS p_array: 0x%p, sz: %d, " + "entry[0]: (0x%08lx,%ld)] -> " + "[lkey: 0x%08X mem_h: 0x%p va: 0x%08lX sz: %ld]\n", + page_vec, page_vec->length, + (unsigned long)page_vec->pages[0], + (unsigned long)page_vec->data_size, + (unsigned int)mem_reg->lkey, mem_reg->mem_h, + (unsigned long)mem_reg->va, (unsigned long)mem_reg->len); + return 0; +} + +/** + * Unregister (previosuly registered) memory. + */ +void iser_unreg_mem(struct iser_mem_reg *reg) +{ + int ret; + + iser_dbg("PHYSICAL Mem.Unregister mem_h %p\n",reg->mem_h); + + ret = ib_fmr_pool_unmap((struct ib_pool_fmr *)reg->mem_h); + if (ret) + iser_err("ib_fmr_pool_unmap failed %d\n", ret); + + reg->mem_h = NULL; +} + +/** + * iser_dto_to_iov - builds IOV from a dto descriptor + */ +static void iser_dto_to_iov(struct iser_dto *dto, struct ib_sge *iov, int iov_len) +{ + int i; + struct ib_sge *sge; + struct iser_regd_buf *regd_buf; + + if (dto->regd_vector_len > iov_len) + iser_bug("iov size %d too small for posting dto of len %d\n", + iov_len, dto->regd_vector_len); + + for (i = 0; i < dto->regd_vector_len; i++) { + sge = &iov[i]; + regd_buf = dto->regd[i]; + + sge->addr = regd_buf->reg.va; + sge->length = regd_buf->reg.len; + sge->lkey = regd_buf->reg.lkey; + + if (dto->used_sz[i] > 0) /* Adjust size */ + sge->length = dto->used_sz[i]; + + /* offset and length should not exceed the regd buf length */ + if (sge->length + dto->offset[i] > regd_buf->reg.len) { + iser_bug("Used len:%ld + offset:%d, exceed reg.buf.len:" + "%ld in dto:0x%p [%d], va:0x%08lX\n", + (unsigned long)sge->length, dto->offset[i], + (unsigned long)regd_buf->reg.len, dto, i, + (unsigned long)sge->addr); + } + + sge->addr += dto->offset[i]; /* Adjust offset */ + } +} + +/** + * iser_post_recv - Posts a receive buffer. + * + * returns 0 on success, -1 on failure + */ +int iser_post_recv(struct iser_desc *rx_desc) +{ + int ib_ret, ret_val = 0; + struct ib_recv_wr recv_wr, *recv_wr_failed; + struct ib_sge iov[2]; + struct iser_conn *ib_conn; + struct iser_dto *recv_dto = &rx_desc->dto; + + /* Retrieve conn */ + ib_conn = recv_dto->conn->ib_conn; + + iser_dto_to_iov(recv_dto, iov, 2); + + recv_wr.next = NULL; + recv_wr.sg_list = iov; + recv_wr.num_sge = recv_dto->regd_vector_len; + recv_wr.wr_id = (unsigned long)rx_desc; + + atomic_inc(&ib_conn->post_recv_buf_count); + ib_ret = ib_post_recv(ib_conn->qp, &recv_wr, &recv_wr_failed); + if (ib_ret) { + iser_err("ib_post_recv failed ret=%d\n", ib_ret); + atomic_dec(&ib_conn->post_recv_buf_count); + ret_val = -1; + } + + return ret_val; +} + +/** + * iser_start_send - Initiate a Send DTO operation + * + * returns 0 on success, -1 on failure + */ +int iser_post_send(struct iser_desc *tx_desc) +{ + int ib_ret, ret_val = 0; + struct ib_send_wr send_wr, *send_wr_failed; + struct ib_sge iov[MAX_REGD_BUF_VECTOR_LEN]; + struct iser_conn *ib_conn; + struct iser_dto *dto = &tx_desc->dto; + + ib_conn = dto->conn->ib_conn; + + iser_dto_to_iov(dto, iov, MAX_REGD_BUF_VECTOR_LEN); + + send_wr.next = NULL; + send_wr.wr_id = (unsigned long)tx_desc; + send_wr.sg_list = iov; + send_wr.num_sge = dto->regd_vector_len; + send_wr.opcode = IB_WR_SEND; + send_wr.send_flags = dto->notify_enable ? IB_SEND_SIGNALED : 0; + + atomic_inc(&ib_conn->post_send_buf_count); + + ib_ret = ib_post_send(ib_conn->qp, &send_wr, &send_wr_failed); + if (ib_ret) { + iser_err("Failed to start SEND DTO, dto: 0x%p, IOV len: %d\n", + dto, dto->regd_vector_len); + iser_err("ib_post_send failed, ret:%d\n", ib_ret); + atomic_dec(&ib_conn->post_send_buf_count); + ret_val = -1; + } + + return ret_val; +} + +static void iser_comp_error_worker(void *data) +{ + struct iser_conn *ib_conn = data; + + if (atomic_read(&ib_conn->state) == ISER_CONN_UP) { + atomic_set(&ib_conn->state, ISER_CONN_TERMINATING); + iscsi_conn_failure(ib_conn->iser_conn->iscsi_conn, + ISCSI_ERR_CONN_FAILED); + } + + /* complete the termination process if disconnect event was delivered * + * note there are no more non completed posts to the QP */ + if (ib_conn->disc_evt_flag) { + atomic_set(&ib_conn->state, ISER_CONN_DOWN); + wake_up_interruptible(&ib_conn->wait); + } +} + +static void iser_handle_comp_error(struct iser_desc *desc) +{ + struct iser_dto *dto = &desc->dto; + struct iser_conn *ib_conn = dto->conn->ib_conn; + + iser_dto_buffs_release(dto); + + if (desc->type == ISCSI_RX) { + kfree(desc->data); + kmem_cache_free(ig.desc_cache, desc); + atomic_dec(&ib_conn->post_recv_buf_count); + } else { /* type is TX control/command/dataout */ + if (desc->type == ISCSI_TX_DATAOUT) + kmem_cache_free(ig.desc_cache, desc); + atomic_dec(&ib_conn->post_send_buf_count); + } + + if (atomic_read(&ib_conn->post_recv_buf_count) == 0 && + atomic_read(&ib_conn->post_send_buf_count) == 0) + schedule_work(&ib_conn->comperror_work); +} + +static void iser_cq_tasklet_fn(unsigned long data) +{ + struct iser_device *device = (struct iser_device *)data; + struct ib_cq *cq = device->cq; + struct ib_wc wc; + struct iser_desc *desc; + unsigned long xfer_len; + + while (ib_poll_cq(cq, 1, &wc) == 1) { + desc = (struct iser_desc *) (unsigned long) wc.wr_id; + + if (desc == NULL) + iser_bug("NULL desc\n"); + + if (wc.status == IB_WC_SUCCESS) { + if (desc->type == ISCSI_RX) { + xfer_len = (unsigned long)wc.byte_len; + iser_rcv_completion(desc, xfer_len); + } else /* type == ISCSI_TX_CONTROL/SCSI_CMD/DOUT */ + iser_snd_completion(desc); + } else { + iser_err("comp w. error op %d status %d\n",desc->type,wc.status); + iser_handle_comp_error(desc); + } + } + /* #warning "it is assumed here that arming CQ only once its empty" * + * " would not cause interrupts to be missed" */ + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); +} + +static void iser_cq_callback(struct ib_cq *cq, void *cq_context) +{ + struct iser_device *device = (struct iser_device *)cq_context; + + tasklet_schedule(&device->cq_tasklet); +} From ogerlitz at voltaire.com Thu Apr 27 05:33:23 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 27 Apr 2006 15:33:23 +0300 (IDT) Subject: [openib-general] [PATCH 6/6] iser handling of memory for RDMA In-Reply-To: Message-ID: the code has the ability to handle the case of SG lists which are not aligned for RDMA in the sense that one VA and RKEY pair can NOT be produced for them by ANY of the ib verbs memory registration apis. from our experience such lists are very rare and over time less then 0.1% of the data sent down by the SCSI ML is represented by such SGs Signed-off-by: Or Gerlitz --- /usr/src/linux-2.6.17-rc3/drivers/infiniband/ulp/iser-x/iser_memory.c 1970-01-01 02:00:00.000000000 +0200 +++ /usr/src/linux-2.6.17-rc3/drivers/infiniband/ulp/iser/iser_memory.c 2006-04-26 12:50:11.000000000 +0300 @@ -0,0 +1,403 @@ +/* + * Copyright (c) 2004, 2005, 2006 Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: iser_memory.c 6643 2006-04-26 10:01:01Z ogerlitz $ + */ +#include +#include +#include +#include +#include +#include +#include + +#include "iscsi_iser.h" + +#define ISER_KMALLOC_THRESHOLD 0x20000 /* 128K - kmalloc limit */ +/** + * Decrements the reference count for the + * registered buffer & releases it + * + * returns 0 if released, 1 if deferred + */ +int iser_regd_buff_release(struct iser_regd_buf *regd_buf) +{ + struct device *dma_device; + + if ((atomic_read(®d_buf->ref_count) == 0) || + atomic_dec_and_test(®d_buf->ref_count)) { + /* if we used the dma mr, unreg is just NOP */ + if (regd_buf->reg.rkey != 0) + iser_unreg_mem(®d_buf->reg); + + if (regd_buf->dma_addr) { + dma_device = regd_buf->device->ib_device->dma_device; + dma_unmap_single(dma_device, + regd_buf->dma_addr, + regd_buf->data_size, + regd_buf->direction); + } + /* else this regd buf is associated with task which we */ + /* dma_unmap_single/sg later */ + return 0; + } else { + iser_dbg("Release deferred, regd.buff: 0x%p\n", regd_buf); + return 1; + } +} + +/** + * iser_reg_single - fills registered buffer descriptor with + * registration information + */ +void iser_reg_single(struct iser_device *device, + struct iser_regd_buf *regd_buf, + enum dma_data_direction direction) +{ + dma_addr_t dma_addr; + + dma_addr = dma_map_single(device->ib_device->dma_device, + regd_buf->virt_addr, + regd_buf->data_size, direction); + if (dma_mapping_error(dma_addr)) + iser_bug("dma_map_single failed at %p\n", regd_buf->virt_addr); + + regd_buf->reg.lkey = device->mr->lkey; + regd_buf->reg.rkey = 0; /* indicate there's no need to unreg */ + regd_buf->reg.len = regd_buf->data_size; + regd_buf->reg.va = dma_addr; + + regd_buf->dma_addr = dma_addr; + regd_buf->direction = direction; +} + +/** + * iser_start_rdma_unaligned_sg + */ +int iser_start_rdma_unaligned_sg(struct iscsi_iser_cmd_task *iser_ctask, + enum iser_data_dir cmd_dir) +{ + int dma_nents; + struct device *dma_device; + char *mem = NULL; + struct iser_data_buf *data = &iser_ctask->data[cmd_dir]; + unsigned long cmd_data_len = data->data_len; + + if (cmd_data_len > ISER_KMALLOC_THRESHOLD) + mem = (void *)__get_free_pages(GFP_KERNEL, + long_log2(roundup_pow_of_two(cmd_data_len)) - PAGE_SHIFT); + else + mem = kmalloc(cmd_data_len, GFP_KERNEL); + + if (mem == NULL) { + iser_err("Failed to allocate mem size %d %d for copying sglist\n", + data->size,(int)cmd_data_len); + return -ENOMEM; + } + + if (cmd_dir == ISER_DIR_OUT) { + /* copy the unaligned sg the buffer which is used for RDMA */ + struct scatterlist *sg = (struct scatterlist *)data->buf; + int i; + char *p, *from; + + for (p = mem, i = 0; i < data->size; i++) { + from = kmap_atomic(sg[i].page, KM_USER0); + memcpy(p, + from + sg[i].offset, + sg[i].length); + kunmap_atomic(from, KM_USER0); + p += sg[i].length; + } + } + + sg_init_one(&iser_ctask->data_copy[cmd_dir].sg_single, mem, cmd_data_len); + iser_ctask->data_copy[cmd_dir].buf = + &iser_ctask->data_copy[cmd_dir].sg_single; + iser_ctask->data_copy[cmd_dir].size = 1; + + iser_ctask->data_copy[cmd_dir].copy_buf = mem; + + dma_device = iser_ctask->iser_conn->ib_conn->device->ib_device->dma_device; + + if (cmd_dir == ISER_DIR_OUT) + dma_nents = dma_map_sg(dma_device, + &iser_ctask->data_copy[cmd_dir].sg_single, + 1, DMA_TO_DEVICE); + else + dma_nents = dma_map_sg(dma_device, + &iser_ctask->data_copy[cmd_dir].sg_single, + 1, DMA_FROM_DEVICE); + + if (dma_nents == 0) + iser_bug("dma_map_sg failed at %p\n", mem); + + iser_ctask->data_copy[cmd_dir].dma_nents = dma_nents; + return 0; +} + +/** + * iser_finalize_rdma_unaligned_sg + */ +void iser_finalize_rdma_unaligned_sg(struct iscsi_iser_cmd_task *iser_ctask, + enum iser_data_dir cmd_dir) +{ + struct device *dma_device; + struct iser_data_buf *mem_copy; + unsigned long cmd_data_len; + + dma_device = iser_ctask->iser_conn->ib_conn->device->ib_device->dma_device; + mem_copy = &iser_ctask->data_copy[cmd_dir]; + + if (cmd_dir == ISER_DIR_OUT) + dma_unmap_sg(dma_device, &mem_copy->sg_single, 1, + DMA_TO_DEVICE); + else + dma_unmap_sg(dma_device, &mem_copy->sg_single, 1, + DMA_FROM_DEVICE); + + if (cmd_dir == ISER_DIR_IN) { + char *mem; + struct scatterlist *sg; + unsigned char *p, *to; + unsigned int sg_size; + int i; + + /* copy back read RDMA to unaligned sg */ + mem = mem_copy->copy_buf; + + sg = (struct scatterlist *)iser_ctask->data[ISER_DIR_IN].buf; + sg_size = iser_ctask->data[ISER_DIR_IN].size; + + for (p = mem, i = 0; i < sg_size; i++){ + to = kmap_atomic(sg[i].page, KM_SOFTIRQ0); + memcpy(to + sg[i].offset, + p, + sg[i].length); + kunmap_atomic(to, KM_SOFTIRQ0); + p += sg[i].length; + } + } + + cmd_data_len = iser_ctask->data[cmd_dir].data_len; + + if (cmd_data_len > ISER_KMALLOC_THRESHOLD) + free_pages((unsigned long)mem_copy->copy_buf, + long_log2(roundup_pow_of_two(cmd_data_len)) - PAGE_SHIFT); + else + kfree(mem_copy->copy_buf); + + mem_copy->copy_buf = NULL; +} + +/** + * iser_sg_to_page_vec - Translates scatterlist entries to physical addresses + * and returns the length of resulting physical address array (may be less than + * the original due to possible compaction). + * + * we build a "page vec" under the assumption that the SG meets the RDMA + * alignment requirements. Other then the first and last SG elements, all + * the "internal" elements can be compacted into a list whose elements are + * dma addresses of physical pages. The code supports also the weird case + * where --few fragments of the same page-- are present in the SG as + * consecutive elements. Also, it handles one entry SG. + */ +static int iser_sg_to_page_vec(struct iser_data_buf *data, + struct iser_page_vec *page_vec) +{ + struct scatterlist *sg = (struct scatterlist *)data->buf; + dma_addr_t first_addr, last_addr, page; + int start_aligned, end_aligned; + unsigned int cur_page = 0; + unsigned long total_sz = 0; + int i; + + /* compute the offset of first element */ + page_vec->offset = (u64) sg[0].offset; + + for (i = 0; i < data->dma_nents; i++) { + total_sz += sg_dma_len(&sg[i]); + + first_addr = sg_dma_address(&sg[i]); + last_addr = first_addr + sg_dma_len(&sg[i]); + + start_aligned = !(first_addr & ~PAGE_MASK); + end_aligned = !(last_addr & ~PAGE_MASK); + + /* continue to collect page fragments till aligned or SG ends */ + while (!end_aligned && (i + 1 < data->dma_nents)) { + i++; + total_sz += sg_dma_len(&sg[i]); + last_addr = sg_dma_address(&sg[i]) + sg_dma_len(&sg[i]); + end_aligned = !(last_addr & ~PAGE_MASK); + } + + first_addr = first_addr & PAGE_MASK; + + for (page = first_addr; page < last_addr; page += PAGE_SIZE) + page_vec->pages[cur_page++] = page; + + } + page_vec->data_size = total_sz; + iser_dbg("page_vec->data_size:%d cur_page %d\n", page_vec->data_size,cur_page); + return cur_page; +} + +#define MASK_4K ((1UL << 12) - 1) /* 0xFFF */ +#define IS_4K_ALIGNED(addr) ((((unsigned long)addr) & MASK_4K) == 0) + +/** + * iser_data_buf_aligned_len - Tries to determine the maximal correctly aligned + * for RDMA sub-list of a scatter-gather list of memory buffers, and returns + * the number of entries which are aligned correctly. Supports the case where + * consecutive SG elements are actually fragments of the same physcial page. + */ +static unsigned int iser_data_buf_aligned_len(struct iser_data_buf *data) +{ + struct scatterlist *sg; + dma_addr_t end_addr, next_addr; + int i, cnt; + unsigned int ret_len = 0; + + sg = (struct scatterlist *)data->buf; + + for (cnt = 0, i = 0; i < data->dma_nents; i++, cnt++) { + /* iser_dbg("Checking sg iobuf [%d]: phys=0x%08lX " + "offset: %ld sz: %ld\n", i, + (unsigned long)page_to_phys(sg[i].page), + (unsigned long)sg[i].offset, + (unsigned long)sg[i].length); */ + end_addr = sg_dma_address(&sg[i]) + + sg_dma_len(&sg[i]); + /* iser_dbg("Checking sg iobuf end address " + "0x%08lX\n", end_addr); */ + if (i + 1 < data->dma_nents) { + next_addr = sg_dma_address(&sg[i+1]); + /* are i, i+1 fragments of the same page? */ + if (end_addr == next_addr) + continue; + else if (!IS_4K_ALIGNED(end_addr)) { + ret_len = cnt + 1; + break; + } + } + } + if (i == data->dma_nents) + ret_len = cnt; /* loop ended */ + iser_dbg("Found %d aligned entries out of %d in sg:0x%p\n", + ret_len, data->dma_nents, data); + return ret_len; +} + +static void iser_data_buf_dump(struct iser_data_buf *data) +{ + struct scatterlist *sg = (struct scatterlist *)data->buf; + int i; + + for (i = 0; i < data->size; i++) + iser_err("sg[%d] dma_addr:0x%lX page:0x%p " + "off:%d sz:%d dma_len:%d\n", + i, (unsigned long)sg_dma_address(&sg[i]), + sg[i].page, sg[i].offset, + sg[i].length,sg_dma_len(&sg[i])); +} + +static void iser_dump_page_vec(struct iser_page_vec *page_vec) +{ + int i; + + iser_err("page vec length %d data size %d\n", + page_vec->length, page_vec->data_size); + for (i = 0; i < page_vec->length; i++) + iser_err("%d %lx\n",i,(unsigned long)page_vec->pages[i]); +} + +static void iser_page_vec_build(struct iser_data_buf *data, + struct iser_page_vec *page_vec) +{ + int page_vec_len = 0; + + page_vec->length = 0; + page_vec->offset = 0; + + iser_dbg("Translating sg sz: %d\n", data->dma_nents); + page_vec_len = iser_sg_to_page_vec(data,page_vec); + iser_dbg("sg len %d page_vec_len %d\n", data->dma_nents,page_vec_len); + + page_vec->length = page_vec_len; + + if (page_vec_len * 4096 < page_vec->data_size) { + iser_err("dumping sg\n"); + iser_data_buf_dump(data); + iser_dump_page_vec(page_vec); + iser_bug("page_vec too short to hold this SG\n"); + } +} + +/** + * iser_reg_rdma_mem - Registers memory intended for RDMA, + * obtaining rkey and va + * + * returns 0 on success, errno code on failure + */ +int iser_reg_rdma_mem(struct iscsi_iser_cmd_task *iser_ctask, + enum iser_data_dir cmd_dir) +{ + struct iser_conn *ib_conn = iser_ctask->iser_conn->ib_conn; + struct iser_data_buf *mem = &iser_ctask->data[cmd_dir]; + struct iser_regd_buf *regd_buf; + int aligned_len; + int err; + + regd_buf = &iser_ctask->rdma_regd[cmd_dir]; + + aligned_len = iser_data_buf_aligned_len(mem); + if (aligned_len != mem->size) { + iser_err("rdma alignment violation %d/%d aligned\n", + aligned_len, mem->size); + iser_data_buf_dump(mem); + /* allocate copy buf, if we are writing, copy the */ + /* unaligned scatterlist, dma map the copy */ + if (iser_start_rdma_unaligned_sg(iser_ctask, cmd_dir) != 0) + return -ENOMEM; + mem = &iser_ctask->data_copy[cmd_dir]; + } + + iser_page_vec_build(mem, ib_conn->page_vec); + err = iser_reg_page_vec(ib_conn, ib_conn->page_vec, ®d_buf->reg); + if (err) + return err; + + /* take a reference on this regd buf such that it will not be released * + * (eg in send dto completion) before we get the scsi response */ + atomic_inc(®d_buf->ref_count); + return 0; +} From joern at wohnheim.fh-wedel.de Thu Apr 27 05:37:01 2006 From: joern at wohnheim.fh-wedel.de (=?iso-8859-1?Q?J=F6rn?= Engel) Date: Thu, 27 Apr 2006 14:37:01 +0200 Subject: [openib-general] Re: [PATCH 13/16] ehca: firmware InfiniBand interface In-Reply-To: <4450A1C0.3080209@de.ibm.com> References: <4450A1C0.3080209@de.ibm.com> Message-ID: <20060427123701.GG32127@wohnheim.fh-wedel.de> On Thu, 27 April 2006 12:49:36 +0200, Heiko J Schick wrote: > +u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle > adapter_handle, > + struct ehca_pfqp *pfqp, > + const u8 servicetype, > + const u8 daqp_ctrl, > + const u8 signalingtype, > + const u8 ud_av_l_key_ctl, > + const struct ipz_cq_handle send_cq_handle, > + const struct ipz_cq_handle receive_cq_handle, > + const struct ipz_eq_handle async_eq_handle, > + const u32 qp_token, > + const struct ipz_pd pd, > + const u16 max_nr_send_wqes, > + const u16 max_nr_receive_wqes, > + const u8 max_nr_send_sges, > + const u8 max_nr_receive_sges, > + const u32 ud_av_l_key, > + struct ipz_qp_handle *qp_handle, > + u32 * qp_nr, > + u16 * act_nr_send_wqes, > + u16 * act_nr_receive_wqes, > + u8 * act_nr_send_sges, > + u8 * act_nr_receive_sges, > + u32 * nr_sq_pages, > + u32 * nr_rq_pages, > + struct h_galpas *h_galpas); 25 parameters? If you tell me which drugs were involved in this code, I know what to stay away from. Might be the current record for any code ever proposed for inclusion. The whole patch is full of parameter-happy functions with this one being the ugly top of the iceberg. I sincerely hope this is not a defined ABI and can still be changed. Jörn -- Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan From joern at wohnheim.fh-wedel.de Thu Apr 27 05:39:49 2006 From: joern at wohnheim.fh-wedel.de (=?iso-8859-1?Q?J=F6rn?= Engel) Date: Thu, 27 Apr 2006 14:39:49 +0200 Subject: [openib-general] Re: [PATCH 01/16] ehca: integration in Linux kernel build system In-Reply-To: <4450B384.4020601@de.ibm.com> References: <4450B384.4020601@de.ibm.com> Message-ID: <20060427123949.GH32127@wohnheim.fh-wedel.de> On Thu, 27 April 2006 14:05:24 +0200, Heiko J Schick wrote: > + > +hcad_mod-objs = ehca_main.o \ > + ehca_hca.o \ > + ehca_mcast.o \ > + ehca_pd.o \ > + ehca_av.o \ > + ehca_eq.o \ > + ehca_cq.o \ > + ehca_qp.o \ > + ehca_sqp.o \ > + ehca_mrmw.o \ > + ehca_reqs.o \ > + ehca_irq.o \ > + ehca_uverbs.o \ > + hcp_if.o \ > + hcp_phyp.o \ > + ipz_pt_fn.o If you don't consolidate this into 2-3 lines, Sam might turn you into a toad. Jörn -- Audacity augments courage; hesitation, fear. -- Publilius Syrus From jbglaw at lug-owl.de Thu Apr 27 05:40:01 2006 From: jbglaw at lug-owl.de (Jan-Benedict Glaw) Date: Thu, 27 Apr 2006 14:40:01 +0200 Subject: [openib-general] Re: [PATCH 1/6] iSER's Makefile and Kconfig In-Reply-To: References: Message-ID: <20060427124001.GZ25520@lug-owl.de> On Thu, 2006-04-27 15:30:32 +0300, Or Gerlitz wrote: > Signed-off-by: Or Gerlitz > > --- /usr/src/linux-2.6.17-rc3/drivers/infiniband/ulp/iser-x/Makefile 1970-01-01 02:00:00.000000000 +0200 > +++ /usr/src/linux-2.6.17-rc3/drivers/infiniband/ulp/iser/Makefile 2006-04-27 15:12:33.000000000 +0300 > @@ -0,0 +1,6 @@ > +obj-$(CONFIG_INFINIBAND_ISER) += ib_iser.o > + > +ib_iser-y := iser_verbs.o \ > + iser_initiator.o \ > + iser_memory.o \ > + iscsi_iser.o > --- /usr/src/linux-2.6.17-rc3/drivers/infiniband/ulp/iser-x/Kconfig 1970-01-01 02:00:00.000000000 +0200 > +++ /usr/src/linux-2.6.17-rc3/drivers/infiniband/ulp/iser/Kconfig 2006-04-16 11:04:42.000000000 +0300 > @@ -0,0 +1,12 @@ > +config INFINIBAND_ISER > + tristate "ISCSI RDMA Protocol" > + depends on INFINIBAND && SCSI > + select SCSI_ISCSI_ATTRS > + ---help--- > + > + Support for the ISCSI RDMA Protocol over InfiniBand. This > + allows you to access storage devices that speak ISER/ISCSI > + over InfiniBand. > + > + The ISER protocol is defined by IETF. > + See . Please always send patches in an order so that the kernel still is compilable. Eg. with your first patch introducing the Makefile stuff (while the C files are still not there), this will break and thus make it harder to automatically trace down unrelated breakages. MfG, JBG -- Jan-Benedict Glaw jbglaw at lug-owl.de . +49-172-7608481 _ O _ "Eine Freie Meinung in einem Freien Kopf | Gegen Zensur | Gegen Krieg _ _ O für einen Freien Staat voll Freier Bürger" | im Internet! | im Irak! O O O ret = do_actions((curr | FREE_SPEECH) & ~(NEW_COPYRIGHT_LAW | DRM | TCPA)); -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: Digital signature URL: From joern at wohnheim.fh-wedel.de Thu Apr 27 05:41:44 2006 From: joern at wohnheim.fh-wedel.de (=?iso-8859-1?Q?J=F6rn?= Engel) Date: Thu, 27 Apr 2006 14:41:44 +0200 Subject: [openib-general] Re: [PATCH 14/16] ehca: hardware interface In-Reply-To: <4450A1C8.7090407@de.ibm.com> References: <4450A1C8.7090407@de.ibm.com> Message-ID: <20060427124144.GI32127@wohnheim.fh-wedel.de> On Thu, 27 April 2006 12:49:44 +0200, Heiko J Schick wrote: > +#ifndef EHCA_USE_HCALL > +#include "sim_gal.h" > +#endif Again, somethin's fishy. And in this case, your own code seems to be. ;) Jörn -- Simplicity is prerequisite for reliability. -- Edsger W. Dijkstra From ogerlitz at voltaire.com Thu Apr 27 05:44:12 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 27 Apr 2006 15:44:12 +0300 Subject: [openib-general] Re: [PATCH 1/6] iSER's Makefile and Kconfig In-Reply-To: <20060427124001.GZ25520@lug-owl.de> References: <20060427124001.GZ25520@lug-owl.de> Message-ID: <4450BC9C.6080409@voltaire.com> Jan-Benedict Glaw wrote: > Please always send patches in an order so that the kernel still is > compilable. > > Eg. with your first patch introducing the Makefile stuff (while the C > files are still not there), this will break and thus make it harder to > automatically trace down unrelated breakages. OK, i understand it, thanks, Or. From joern at wohnheim.fh-wedel.de Thu Apr 27 05:52:07 2006 From: joern at wohnheim.fh-wedel.de (=?iso-8859-1?Q?J=F6rn?= Engel) Date: Thu, 27 Apr 2006 14:52:07 +0200 Subject: [openib-general] Re: [PATCH 15/16] ehca: queue page table handling In-Reply-To: <4450A1CE.80503@de.ibm.com> References: <4450A1CE.80503@de.ibm.com> Message-ID: <20060427125207.GJ32127@wohnheim.fh-wedel.de> On Thu, 27 April 2006 12:49:50 +0200, Heiko J Schick wrote: > +inline static void *ipz_qeit_get_inc_valid(struct ipz_queue *queue) > +{ > + void *retvalue = ipz_qeit_get(queue); > + u32 qe = ((struct ehca_cqe *)retvalue)->cqe_flags; > + if ((qe >> 7) == (queue->toggle_state & 1)) { > + /* this is a good one */ > + ipz_qeit_get_inc(queue); > + } else > + retvalue = NULL; > + return (retvalue); > +} How about: static inline void *ipz_qeit_get_inc_valid(struct ipz_queue *queue) { struct ehca_cqe *cqe = ipz_qeit_get(queue); u32 flags = cqe->cqe_flags; if ((flags >> 7) != (queue->toggle_state & 1)) return NULL; ipz_qeit_get_inc(queue); return cqe; } o "static inline", as Arnd requested, o no cast for cqe, o possibly useful identifier for "retvalue", o trivial to identify error path (hint: only error path is indented), o directly returns NULL instead of assigning to a variable, o removed brackets around return value. I'm still not happy with "ehca_cqe" (just try to pronounce it) and the weird condition. But you should get the general idea. Same goes for other functions. Jörn -- The cheapest, fastest and most reliable components of a computer system are those that aren't there. -- Gordon Bell, DEC labratories From joern at wohnheim.fh-wedel.de Thu Apr 27 05:57:26 2006 From: joern at wohnheim.fh-wedel.de (=?iso-8859-1?Q?J=F6rn?= Engel) Date: Thu, 27 Apr 2006 14:57:26 +0200 Subject: [openib-general] Re: [PATCH 00/16] ehca: IBM eHCA InfiniBand Device Driver In-Reply-To: <4450B378.9000705@de.ibm.com> References: <4450B378.9000705@de.ibm.com> Message-ID: <20060427125726.GK32127@wohnheim.fh-wedel.de> On Thu, 27 April 2006 14:05:12 +0200, Heiko J Schick wrote: > > many thanks for your comments. They are very helpful for us. All > 17 patches have to be applied, otherwise the driver won't compile. Don't expect much cheer and rejoicing over this. I suspect that akpm or Linus will either want the 17 patches merged into one or have a patchset where every single patch leaves the kernel in a working state, including working eHCA driver. Generally, there seemed to be a bit more SHOUTING when compared to other kernel code. Might be something to look at as well. Jörn -- Rules of Optimization: Rule 1: Don't do it. Rule 2 (for experts only): Don't do it yet. -- M.A. Jackson From halr at voltaire.com Thu Apr 27 05:47:49 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Apr 2006 08:47:49 -0400 Subject: [openib-general] Re: [PATCH] osm_port_info_rcv.c : clear client reregister bit In-Reply-To: <07fyjzguun.fsf@sw053.yok.mtl.com> References: <07fyjzguun.fsf@sw053.yok.mtl.com> Message-ID: <1146142062.2124.47456.camel@hal.voltaire.com> Hi Ofer, On Thu, 2006-04-27 at 06:45, Ofer Gigi wrote: > Hi Hal, > Bug Fix: > On receive of client reregister - clear the reregister bit - so > reregistering won't be sent again and again > > Please apply to trunk and branch. Thanks. Applied to both trunk and 1.0 branch with some cosmetic changes. -- Hal > > Thanks > > Ofer G. > > Signed-off-by: Ofer Gigi From penberg at cs.helsinki.fi Thu Apr 27 06:42:52 2006 From: penberg at cs.helsinki.fi (Pekka Enberg) Date: Thu, 27 Apr 2006 16:42:52 +0300 Subject: [openib-general] Re: [PATCH 13/16] ehca: firmware InfiniBand interface In-Reply-To: <20060427123701.GG32127@wohnheim.fh-wedel.de> References: <4450A1C0.3080209@de.ibm.com> <20060427123701.GG32127@wohnheim.fh-wedel.de> Message-ID: <84144f020604270642j788be2ecp82841ac3b3ebcaad@mail.gmail.com> On 4/27/06, Jörn Engel wrote: > The whole patch is full of parameter-happy functions with this one > being the ugly top of the iceberg. I sincerely hope this is not a > defined ABI and can still be changed. It's not in mainline, so it can be changed. Pekka From hch at infradead.org Thu Apr 27 06:45:25 2006 From: hch at infradead.org (Christoph Hellwig) Date: Thu, 27 Apr 2006 14:45:25 +0100 Subject: [openib-general] Re: [PATCH 05/16] ehca: InfiniBand query and multicast functionality In-Reply-To: <200604271405.36588.arnd@arndb.de> References: <4450A17D.4030708@de.ibm.com> <20060427114104.GA32127@wohnheim.fh-wedel.de> <200604271405.36588.arnd@arndb.de> Message-ID: <20060427134525.GA20966@infradead.org> On Thu, Apr 27, 2006 at 02:05:36PM +0200, Arnd Bergmann wrote: > On Thursday 27 April 2006 13:41, J?rn Engel wrote: > > On Thu, 27 April 2006 12:48:29 +0200, Heiko J Schick wrote: > > > > > > + * ?This source code is distributed under a dual license of GPL v2.0 and > > > OpenIB > > > > Line wrap. ?You might want to check your mailer or switch to a > > different one. > > > > Looks correct here. Maybe you need to check yours ;-) It's linewrapped here, too. And the mailer on this box hasn't changed for more than three years. OTOH I got strangely looking mails from you recently :) From joern at wohnheim.fh-wedel.de Thu Apr 27 06:47:03 2006 From: joern at wohnheim.fh-wedel.de (=?iso-8859-1?Q?J=F6rn?= Engel) Date: Thu, 27 Apr 2006 15:47:03 +0200 Subject: [openib-general] Re: [PATCH 13/16] ehca: firmware InfiniBand interface In-Reply-To: <84144f020604270642j788be2ecp82841ac3b3ebcaad@mail.gmail.com> References: <4450A1C0.3080209@de.ibm.com> <20060427123701.GG32127@wohnheim.fh-wedel.de> <84144f020604270642j788be2ecp82841ac3b3ebcaad@mail.gmail.com> Message-ID: <20060427134703.GL32127@wohnheim.fh-wedel.de> On Thu, 27 April 2006 16:42:52 +0300, Pekka Enberg wrote: > On 4/27/06, Jörn Engel wrote: > > The whole patch is full of parameter-happy functions with this one > > being the ugly top of the iceberg. I sincerely hope this is not a > > defined ABI and can still be changed. > > It's not in mainline, so it can be changed. I was thinking more about firmware ABI. Jörn -- But this is not to say that the main benefit of Linux and other GPL software is lower-cost. Control is the main benefit--cost is secondary. -- Bruce Perens From Arkady.Kanevsky at netapp.com Thu Apr 27 06:50:06 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 27 Apr 2006 09:50:06 -0400 Subject: [openib-general] mthca_reset question Message-ID: Here is an extract from the mthca_reset.c /* * Reset the chip. This is somewhat ugly because we have to * save off the PCI header before reset and then restore it * after the chip reboots. We skip config space offsets 22 * and 23 since those have a special meaning. * * To make matters worse, for Tavor (PCI-X HCA) we have to * find the associated bridge device and save off its PCI * header as well. */ if (!(mdev->mthca_flags & MTHCA_FLAG_PCIE)) { /* Look for the bridge -- its device ID will be 2 more than HCA's device ID. */ while ((bridge = pci_get_device(mdev->pdev->vendor, mdev->pdev->device + 2, bridge)) != NULL) { if (bridge->hdr_type == PCI_HEADER_TYPE_BRIDGE && bridge->subordinate == mdev->pdev->bus) { mthca_dbg(mdev, "Found bridge: %s\n", pci_name(bridge)); break; } } First, Why do we check for not PCIE instead of PCIX? Second, why while instead of if? Most interesting, third, Why is bridge device ID 2 more than HCA device ID? What is this hack rely/depends on? Can we find a device parent which should be a bridge instead? Thanks, Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 From sashak at voltaire.com Thu Apr 27 07:05:16 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 27 Apr 2006 17:05:16 +0300 Subject: [openib-general] [PATCH] opensm: handle stdout and stderr in osm_log In-Reply-To: <20060427072041.GA1805@greglaptop.hsd1.ca.comcast.net> References: <20060426221557.GF2453@sashak.voltaire.com> <20060427072041.GA1805@greglaptop.hsd1.ca.comcast.net> Message-ID: <20060427140516.GA19060@sashak.voltaire.com> On 00:20 Thu 27 Apr , Greg Lindahl wrote: > On Thu, Apr 27, 2006 at 01:15:57AM +0300, Sasha Khapyorsky wrote: > > > There is small patch for osm_log, this provide possibility to drop log > > output to stdout or stderr. > > Isn't the Unix convention to use "--" to mean stdout? I guess taht "-" means stdout? Yes, I think this is nice to have, will addi too. > Or you can use > /dev/fd/{0,1}... Those are not different from regular files, nothing special is needed. Sasha. From leonida at voltaire.com Thu Apr 27 07:21:52 2006 From: leonida at voltaire.com (Leonid Arsh) Date: Thu, 27 Apr 2006 17:21:52 +0300 Subject: [openib-general] Re: netperf for RDS needed In-Reply-To: References: Message-ID: <4450D380.4000303@voltaire.com> Ranjit thank you for the patch again. I applied it and succeeded to run. Looks very nice. This are the results with for RDS : Socket Message Elapsed Messages Size Size Time Okay Errors Throughput bytes bytes secs # # 10^6bits/sec 262144 8192 10.01 653574 1 4280.59 118784 10.01 653574 4280.59 This are the results without RDS: Socket Message Elapsed Messages Size Size Time Okay Errors Throughput bytes bytes secs # # 10^6bits/sec 262144 8192 10.00 356180 0 2333.90 118784 10.00 211005 1382.63 During the run we get error messages in dmesg on the server side. Have you seen anything like this? Please see the dmesg output below: swapper: page allocation failure. order:1, mode:0x20 Call Trace: {__alloc_pages+662} {smp_apic_timer_interrupt+54} {apic_timer_interrupt+132} {cache_grow+288} {cache_alloc_refill+419} {kmem_cache_alloc+87} {:ib_rds:rds_alloc_buf+16} {:ib_rds:rds_alloc_recv_buffer+12} {:ib_rds:rds_post_new_recv+23} {:ib_rds:rds_recv_completion+85} {:ib_rds:rds_cq_callback+87} {:ib_mthca:mthca_eq_int+119} {do_IRQ+50} {ret_from_intr+0} {:ib_mthca:mthca_tavor_interrupt+91} {handle_IRQ_event+41} {__do_IRQ+156} {do_IRQ+45} {ret_from_intr+0} {mwait_idle+54} {cpu_idle+93} {start_secondary+1131} Mem-info: Node 0 DMA per-cpu: cpu 0 hot: low 0, high 0, batch 1 used:0 cpu 0 cold: low 0, high 0, batch 1 used:0 cpu 1 hot: low 0, high 0, batch 1 used:0 cpu 1 cold: low 0, high 0, batch 1 used:0 cpu 2 hot: low 0, high 0, batch 1 used:0 cpu 2 cold: low 0, high 0, batch 1 used:0 cpu 3 hot: low 0, high 0, batch 1 used:0 cpu 3 cold: low 0, high 0, batch 1 used:0 Node 0 DMA32 per-cpu: cpu 0 hot: low 0, high 186, batch 31 used:162 cpu 0 cold: low 0, high 62, batch 15 used:39 cpu 1 hot: low 0, high 186, batch 31 used:114 cpu 1 cold: low 0, high 62, batch 15 used:51 cpu 2 hot: low 0, high 186, batch 31 used:36 cpu 2 cold: low 0, high 62, batch 15 used:37 cpu 3 hot: low 0, high 186, batch 31 used:21 cpu 3 cold: low 0, high 62, batch 15 used:31 Node 0 Normal per-cpu: empty Node 0 HighMem per-cpu: empty Free pages: 69832kB (0kB HighMem) Active:40222 inactive:13980 dirty:12 writeback:0 unstable:0 free:17462 slab:176826 mapped:37439 pagetables:1756 Node 0 DMA free:3980kB min:44kB low:52kB high:64kB active:0kB inactive:0kB present:11224kB pages_scanned:0 all_unreclai mable? yes lowmem_reserve[]: 0 990 990 990 Node 0 DMA32 free:65868kB min:4000kB low:5000kB high:6000kB active:160888kB inactive:55920kB present:1013924kB pages_sc anned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 0 HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclai mable? no lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 73*4kB 1*8kB 0*16kB 1*32kB 1*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 3980kB Node 0 DMA32: 16225*4kB 17*8kB 8*16kB 2*32kB 1*64kB 1*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 65932kB Node 0 Normal: empty Node 0 HighMem: empty Swap cache: add 72, delete 72, find 33/38, race 0+0 Free swap = 2096460kB Total swap = 2096472kB Free swap: 2096460kB 262112 pages of RAM 6465 reserved pages 130652 pages shared 0 pages swap cached rds: kmem_cache <0xffff810024f52900> returned NULL netserver: page allocation failure. order:1, mode:0x20 Call Trace: {__alloc_pages+662} {cache_grow+288} {cache_alloc_refill+419} {kmem_cache_alloc+87} {:ib_rds:rds_alloc_buf+16} {:ib_rds:rds_alloc_recv_buffer+12} {:ib_rds:rds_post_new_recv+23} {:ib_rds:rds_recv_completion+85} {:ib_rds:rds_cq_callback+87} {:ib_mthca:mthca_eq_int+119} {do_IRQ+50} {ret_from_intr+0} {:ib_mthca:mthca_tavor_interrupt+91} {handle_IRQ_event+41} {__do_IRQ+156} {do_IRQ+45} {ret_from_intr+0} {copy_user_generic+59} {:ib_rds:rds_recvmsg+566} {sock_common_recvmsg+45} {sock_recvmsg+271} {__alloc_pages+101} {_read_unlock_irq+6} {find_get_page+65} {autoremove_wake_function+0} {sys_recvfrom+182} {_spin_unlock_irq+10} {_spin_unlock_irq+7} {thread_return+167} {do_setitimer+333} {system_call+126} Mem-info: Node 0 DMA per-cpu: cpu 0 hot: low 0, high 0, batch 1 used:0 cpu 0 cold: low 0, high 0, batch 1 used:0 cpu 1 hot: low 0, high 0, batch 1 used:0 cpu 1 cold: low 0, high 0, batch 1 used:0 cpu 2 hot: low 0, high 0, batch 1 used:0 cpu 2 cold: low 0, high 0, batch 1 used:0 cpu 3 hot: low 0, high 0, batch 1 used:0 cpu 3 cold: low 0, high 0, batch 1 used:0 Node 0 DMA32 per-cpu: cpu 0 hot: low 0, high 186, batch 31 used:160 cpu 0 cold: low 0, high 62, batch 15 used:39 cpu 1 hot: low 0, high 186, batch 31 used:179 cpu 1 cold: low 0, high 62, batch 15 used:55 cpu 1 cold: low 0, high 62, batch 15 used:55 cpu 2 hot: low 0, high 186, batch 31 used:46 cpu 2 cold: low 0, high 62, batch 15 used:37 cpu 3 hot: low 0, high 186, batch 31 used:44 cpu 3 cold: low 0, high 62, batch 15 used:31 Node 0 Normal per-cpu: empty Node 0 HighMem per-cpu: empty Free pages: 69484kB (0kB HighMem) Active:40244 inactive:13258 dirty:1 writeback:1 unstable:0 free:17371 slab:177517 mapped:37438 pagetables:1756 Node 0 DMA free:3980kB min:44kB low:52kB high:64kB active:0kB inactive:0kB present:11224kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 990 990 990 Node 0 DMA32 free:65504kB min:4000kB low:5000kB high:6000kB active:160976kB inactive:53032kB present:1013924kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable?no lowmem_reserve[]: 0 0 0 0 Node 0 HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 73*4kB 1*8kB 0*16kB 1*32kB 1*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 3980kB Node 0 DMA32: 16188*4kB 0*8kB 1*16kB 1*32kB 1*64kB 1*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 65504kB Node 0 Normal: empty Node 0 HighMem: empty Swap cache: add 72, delete 72, find 33/38, race 0+0 Free swap = 2096460kB Total swap = 2096472kB Free swap: 2096460kB 262112 pages of RAM 6465 reserved pages 130733 pages shared 0 pages swap cached rds: kmem_cache <0xffff810024f52900> returned NULL From arnd at arndb.de Thu Apr 27 07:39:25 2006 From: arnd at arndb.de (Arnd Bergmann) Date: Thu, 27 Apr 2006 16:39:25 +0200 Subject: [openib-general] Re: [PATCH 05/16] ehca: InfiniBand query and multicast functionality In-Reply-To: <20060427134525.GA20966@infradead.org> References: <4450A17D.4030708@de.ibm.com> <200604271405.36588.arnd@arndb.de> <20060427134525.GA20966@infradead.org> Message-ID: <200604271639.26235.arnd@arndb.de> On Thursday 27 April 2006 15:45, Christoph Hellwig wrote: > It's linewrapped here, too. And the mailer on this box hasn't changed > for more than three years. OTOH I got strangely looking mails from > you recently :) > > Hmm. I don't see line wrap problems on http://patchwork.ozlabs.org/linuxppc/patch?id=5174 . Maybe I'm just blind. However, /something/ went wrong with the way the patch showed up there. Half of it ended up in the comment section instead of the patch itself. Arnd <>< From sashak at voltaire.com Thu Apr 27 08:03:50 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 27 Apr 2006 18:03:50 +0300 Subject: [openib-general] [PATCH] opensm: handle stdout and stderr in osm_log In-Reply-To: <20060427140516.GA19060@sashak.voltaire.com> References: <20060426221557.GF2453@sashak.voltaire.com> <20060427072041.GA1805@greglaptop.hsd1.ca.comcast.net> <20060427140516.GA19060@sashak.voltaire.com> Message-ID: <20060427150350.GC19060@sashak.voltaire.com> Hello Hal, "-" is added as suggested by Greg (handled as stdout). Sasha. Handle in osm_log "-", "stdout" or "stderr" words as log file names to indicate logging to stdout or stderr respectively. Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_log.h | 7 ++++++- osm/opensm/main.c | 6 +----- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/osm/include/opensm/osm_log.h b/osm/include/opensm/osm_log.h index 8cf8cf1..c4744b5 100644 --- a/osm/include/opensm/osm_log.h +++ b/osm/include/opensm/osm_log.h @@ -224,10 +224,15 @@ osm_log_init( p_log->level = log_flags; p_log->flush = flush; - if (log_file == NULL) + if (log_file == NULL || !strcmp(log_file, "-") || + !strcmp(log_file, "stdout")) { p_log->out_port = stdout; } + else if (!strcmp(log_file, "stderr")) + { + p_log->out_port = stderr; + } else { if (accum_log_file) diff --git a/osm/opensm/main.c b/osm/opensm/main.c index 6f2a857..d95c314 100644 --- a/osm/opensm/main.c +++ b/osm/opensm/main.c @@ -722,11 +722,7 @@ #endif break; case 'f': - if (!strcmp(optarg, "stdout")) - /* output should be to standard output */ - opt.log_file = NULL; - else - opt.log_file = optarg; + opt.log_file = optarg; break; case 'e': From sean.hefty at intel.com Thu Apr 27 08:12:23 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 27 Apr 2006 08:12:23 -0700 Subject: [openib-general] RE: SDP hello ack header In-Reply-To: <20060427122424.GS31324@mellanox.co.il> Message-ID: >Sean, CMA does not seem to set MajV/MinV in SDP hello ack header (REP). >It does do this for hello header (REQ). Should SDP do this then? I don't think that the CMA cares about the hello ack, but I think it makes more sense for it to set it, since it does for the hh. Which is your preference? - Sean From mst at mellanox.co.il Thu Apr 27 08:20:44 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 27 Apr 2006 18:20:44 +0300 Subject: [openib-general] Re: SDP hello ack header In-Reply-To: References: <20060427122424.GS31324@mellanox.co.il> Message-ID: <20060427152044.GX31324@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RE: SDP hello ack header > > >Sean, CMA does not seem to set MajV/MinV in SDP hello ack header (REP). > >It does do this for hello header (REQ). Should SDP do this then? > > I don't think that the CMA cares about the hello ack, but I think it makes more > sense for it to set it, since it does for the hh. Which is your preference? I don't care much. It seems to make sense for CMA to set it. -- MST From shd at zakalwe.fi Thu Apr 27 08:29:05 2006 From: shd at zakalwe.fi (Heikki Orsila) Date: Thu, 27 Apr 2006 15:29:05 +0000 Subject: [openib-general] Re: [PATCH 02/16] ehca: module infrastructure In-Reply-To: <4450A165.4000701@de.ibm.com> References: <4450A165.4000701@de.ibm.com> Message-ID: <20060427152905.GB14413@zakalwe.fi> On Thu, Apr 27, 2006 at 12:48:05PM +0200, Heiko J Schick wrote: > + * This source code is distributed under a dual license of GPL v2.0 and > OpenIB > + * BSD. > + * > + * OpenIB BSD License > + * > + * Redistribution and use in source and binary forms, with or without > + * modification, are permitted provided that the following conditions are > met: > + * > + * Redistributions of source code must retain the above copyright notice, > this > + * list of conditions and the following disclaimer. > + * > + * Redistributions in binary form must reproduce the above copyright > notice, > + * this list of conditions and the following disclaimer in the > documentation > + * and/or other materials > + * provided with the distribution. > + * > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS > IS" > + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, > THE > + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR > PURPOSE > + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE > + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR > + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF > + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR > + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, > WHETHER > + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR > OTHERWISE) > + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF > THE > + * POSSIBILITY OF SUCH DAMAGE. Would you please keep the full license in only one file? It seems to be duplicated in all source modules. It's wasty. > + * $Id: ehca_main.c,v 1.35 2006/04/25 08:59:43 schickhj Exp $ This shouldn't be here. The kernel project has its own versioning. -- Heikki Orsila Barbie's law: heikki.orsila at iki.fi "Math is hard, let's go shopping!" http://www.iki.fi/shd From halr at voltaire.com Thu Apr 27 08:25:32 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Apr 2006 11:25:32 -0400 Subject: [openib-general] [PATCH] opensm: handle stdout and stderr in osm_log In-Reply-To: <20060427150350.GC19060@sashak.voltaire.com> References: <20060426221557.GF2453@sashak.voltaire.com> <20060427072041.GA1805@greglaptop.hsd1.ca.comcast.net> <20060427140516.GA19060@sashak.voltaire.com> <20060427150350.GC19060@sashak.voltaire.com> Message-ID: <1146151525.2124.49935.camel@hal.voltaire.com> On Thu, 2006-04-27 at 11:03, Sasha Khapyorsky wrote: > Hello Hal, > > "-" is added as suggested by Greg (handled as stdout). > > Sasha. > > > Handle in osm_log "-", "stdout" or "stderr" words as log file names to > indicate logging to stdout or stderr respectively. Thanks. Applied to both trunk and 1.0 branch. -- Hal > Signed-off-by: Sasha Khapyorsky From xma at us.ibm.com Thu Apr 27 09:10:32 2006 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 27 Apr 2006 09:10:32 -0700 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: <44507FD1.9040801@voltaire.com> Message-ID: Leonid Arsh wrote on 04/27/2006 01:24:49 AM: > Shirley Ma wrote: > > Without seeing your patch, I coudn't say anything. I guess your > > implemention > > didn't handler multithreads simultanously. If you only have one > > interrupt handler, > > couldn't see any reason you can get better performance number with > > splitting CQs. > Shirley, you are right. > I just wanted share our experience with you. > > All the tests we made on our IPoIB driver, so our NAPI implementation > isn't relevant here. > Unfortunately, we didn't plan to work on the IPoIB performance in the > nearest future, so I can't > implement NAPI on the OpenIB driver right now. > > I think it would be very interesting to compare the NAPI performance > against the work queue. > Please let me know if you are planning to do it yourself. > > > > Could you please post your NAPI patch here? > > > > As I mentioned I will test my patch to see how's the performance. > > > Thank you, > Leonid How many percentage throughput you got from your NAPI implementation? So far work queue gives very consistent 15% througput increase in my local test with one dual core cpu over mthca. I am planning to add one more cpu to see the difference. Yes, NAPI is in our plan. We can see NAPI vs. work queue results soon. Actualy if we can combine work queue with NAPI, that would be more interesting. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From sweitzen at cisco.com Thu Apr 27 09:17:09 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Thu, 27 Apr 2006 09:17:09 -0700 Subject: [openib-general] Open MPI component in bugzilla Message-ID: Bryan, would you please add an "Open MPI" component to the OpenIB bugzilla, with jsquyres at cisco.com as the owner? Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Thu Apr 27 09:22:20 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 27 Apr 2006 09:22:20 -0700 Subject: [openib-general] RE: SDP hello ack header In-Reply-To: <20060427152044.GX31324@mellanox.co.il> Message-ID: >I don't care much. It seems to make sense for CMA to set it. I created a patch for this. The easiest fix was to set the version in the private_data passed to the CMA by SDP; however, the private_data is declared as const void *. This made me stop and think about the problem more. Both the CMA and SDP must agree on which version of the headers are used. If the CMA sets the version, then there's no way for SDP to use anything different, or to know which version will be used by the CMA. This makes will make it difficult or impossible for SDP to support multiple versions. I think that a better solution is to have SDP set the version information for all headers. The CMA can then check the version to see if it can support it, and set the other fields appropriately. Thoughts? - Sean From sean.hefty at intel.com Thu Apr 27 09:41:18 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 27 Apr 2006 09:41:18 -0700 Subject: [openib-general] [PATCH] rdma_cm: let SDP control the SDP version in the hello header In-Reply-To: Message-ID: These are the changes to the CMA that I was considering. This patch lets SDP determine which version of the SDP headers to use. The CMA will check that it can support that version, and set the address fields appropriately. Signed-off-by: Sean Hefty --- Index: cma.c =================================================================== --- cma.c (revision 6627) +++ cma.c (working copy) @@ -1473,8 +1473,8 @@ err: } EXPORT_SYMBOL(rdma_bind_addr); -static void cma_format_hdr(void *hdr, enum rdma_port_space ps, - struct rdma_route *route) +static int cma_format_hdr(void *hdr, enum rdma_port_space ps, + struct rdma_route *route) { struct sockaddr_in *src4, *dst4; struct cma_hdr *cma_hdr; @@ -1486,7 +1486,8 @@ static void cma_format_hdr(void *hdr, en switch (ps) { case RDMA_PS_SDP: sdp_hdr = hdr; - sdp_hdr->sdp_version = SDP_VERSION; + if (sdp_hdr->sdp_version != SDP_VERSION) + return -EINVAL; sdp_set_ip_ver(sdp_hdr, 4); sdp_hdr->src_addr.ip4.addr = src4->sin_addr.s_addr; sdp_hdr->dst_addr.ip4.addr = dst4->sin_addr.s_addr; @@ -1501,6 +1502,7 @@ static void cma_format_hdr(void *hdr, en cma_hdr->port = src4->sin_port; break; } + return 0; } static int cma_connect_ib(struct rdma_id_private *id_priv, @@ -1530,7 +1532,9 @@ static int cma_connect_ib(struct rdma_id } route = &id_priv->id.route; - cma_format_hdr(private_data, id_priv->id.ps, route); + ret = cma_format_hdr(private_data, id_priv->id.ps, route); + if (ret) + goto out; req.private_data = private_data; req.primary_path = &route->path_rec[0]; From mdidomenico at silverstorm.com Thu Apr 27 10:10:25 2006 From: mdidomenico at silverstorm.com (Di Domenico, Michael) Date: Thu, 27 Apr 2006 13:10:25 -0400 Subject: [openib-general] ib_srp Message-ID: Hi, Is there a way to remove a target from the SRP configuration without unloading the driver module (which seems to have partially removed the disk, but appears to be hanging)? I didn't see anything in the documentation (not to mention the links from the SRP wiki seem to be pointing to the wrong emails on the mailing list archives) Thanks - Michael -------------- next part -------------- An HTML attachment was scrubbed... URL: From sweitzen at cisco.com Thu Apr 27 10:13:56 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Thu, 27 Apr 2006 10:13:56 -0700 Subject: [openib-general] ib_srp Message-ID: I see mention of an SRP daemon coming in 1.0 RC4, does this daemon handle target addition/removal? Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Di Domenico, Michael Sent: Thursday, April 27, 2006 10:10 AM To: openib-general at openib.org Subject: [openib-general] ib_srp Hi, Is there a way to remove a target from the SRP configuration without unloading the driver module (which seems to have partially removed the disk, but appears to be hanging)? I didn't see anything in the documentation (not to mention the links from the SRP wiki seem to be pointing to the wrong emails on the mailing list archives) Thanks - Michael -------------- next part -------------- An HTML attachment was scrubbed... URL: From mrmacman_g4 at mac.com Thu Apr 27 10:13:54 2006 From: mrmacman_g4 at mac.com (Kyle Moffett) Date: Thu, 27 Apr 2006 13:13:54 -0400 Subject: [openib-general] Re: [PATCH 13/16] ehca: firmware InfiniBand interface In-Reply-To: <20060427123701.GG32127@wohnheim.fh-wedel.de> References: <4450A1C0.3080209@de.ibm.com> <20060427123701.GG32127@wohnheim.fh-wedel.de> Message-ID: <28E154B8-B491-4C5D-9976-5082718FEF26@mac.com> On Apr 27, 2006, at 08:37:01, Jörn Engel wrote: > On Thu, 27 April 2006 12:49:36 +0200, Heiko J Schick wrote: >> +u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle >> adapter_handle, >> + struct ehca_pfqp *pfqp, >> + const u8 servicetype, >> + const u8 daqp_ctrl, >> + const u8 signalingtype, >> + const u8 ud_av_l_key_ctl, >> + const struct ipz_cq_handle send_cq_handle, >> + const struct ipz_cq_handle receive_cq_handle, >> + const struct ipz_eq_handle async_eq_handle, >> + const u32 qp_token, >> + const struct ipz_pd pd, >> + const u16 max_nr_send_wqes, >> + const u16 max_nr_receive_wqes, >> + const u8 max_nr_send_sges, >> + const u8 max_nr_receive_sges, >> + const u32 ud_av_l_key, >> + struct ipz_qp_handle *qp_handle, >> + u32 * qp_nr, >> + u16 * act_nr_send_wqes, >> + u16 * act_nr_receive_wqes, >> + u8 * act_nr_send_sges, >> + u8 * act_nr_receive_sges, >> + u32 * nr_sq_pages, >> + u32 * nr_rq_pages, >> + struct h_galpas *h_galpas); > > 25 parameters? If you tell me which drugs were involved in this > code, I know what to stay away from. Might be the current record > for any code ever proposed for inclusion. > > The whole patch is full of parameter-happy functions with this one > being the ugly top of the iceberg. I sincerely hope this is not a > defined ABI and can still be changed. What's worse; look at the stack usage on that sucker alone: 10 pointers, 6 u8, 2 u16, 2 u32, and topped off with 3 unknown-sized "struct ipz_cq_handle", an unknown-sized "struct ipz_pd". The alignment alone probably chews up at least another couple bytes in there somewhere too. That's at _least_ 98 + 3*sizeof(struct ipz_cq_handle) + sizeof(struct ipz_pd) on a 64-bit platform (58 + 3*sizeof(struct ipz_cq_handle) + sizeof(struct ipz_pd) on 32-bit). Not to mention the fact that you totally screwed the compiler's chances of ever passing the important stuff in registers. And you haven't even gotten into local variables yet. Cheers, Kyle Moffett From rpandit at silverstorm.com Thu Apr 27 10:19:37 2006 From: rpandit at silverstorm.com (Ranjit Pandit) Date: Thu, 27 Apr 2006 10:19:37 -0700 Subject: [openib-general] Re: netperf for RDS needed In-Reply-To: <4450D380.4000303@voltaire.com> References: <4450D380.4000303@voltaire.com> Message-ID: <96f8e60e0604271019v2a9a7b6ei9c10a7dd507fbcb4@mail.gmail.com> On 4/27/06, Leonid Arsh wrote: > Ranjit > thank you for the patch again. I applied it and succeeded to run. > Looks very nice. > > This are the results with for RDS : > Socket Message Elapsed Messages > Size Size Time Okay Errors Throughput > bytes bytes secs # # 10^6bits/sec > 262144 8192 10.01 653574 1 4280.59 > 118784 10.01 653574 4280.59 > > This are the results without RDS: > Socket Message Elapsed Messages > Size Size Time Okay Errors Throughput > bytes bytes secs # # 10^6bits/sec > 262144 8192 10.00 356180 0 2333.90 > 118784 10.00 211005 1382.63 > What kind of systems are you running on, cpu and memory? Are the results without-RDS on IPoIB? The second line of the output is more interesting as it shows the "useful" b/w (as seen by the receiver) and therefore accounts for any lost/dropped pkts. Rds shows 3x improvement on recvr side b/w (4289.59 Vs 1382.63). > > > During the run we get error messages in dmesg on the server side. > Have you seen anything like this? > Please see the dmesg output below: What kernel are you on? 32bit or 64bit system? I will see if I can reproduce it. > > > > swapper: page allocation failure. order:1, mode:0x20 > > Call Trace: {__alloc_pages+662} > {smp_apic_timer_interrupt+54} > {apic_timer_interrupt+132} > {cache_grow+288} > {cache_alloc_refill+419} > {kmem_cache_alloc+87} > {:ib_rds:rds_alloc_buf+16} > {:ib_rds:rds_alloc_recv_buffer+12} > {:ib_rds:rds_post_new_recv+23} > {:ib_rds:rds_recv_completion+85} > {:ib_rds:rds_cq_callback+87} > {:ib_mthca:mthca_eq_int+119} > {do_IRQ+50} {ret_from_intr+0} > {:ib_mthca:mthca_tavor_interrupt+91} > {handle_IRQ_event+41} > {__do_IRQ+156} > {do_IRQ+45} {ret_from_intr+0} > {mwait_idle+54} > {cpu_idle+93} > {start_secondary+1131} > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Thu Apr 27 12:43:52 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 27 Apr 2006 22:43:52 +0300 Subject: [openib-general] [PATCH] rdma_cm: let SDP control the SDP version in the hello header In-Reply-To: References: Message-ID: <20060427194351.GA6813@mellanox.co.il> Quoting r. Sean Hefty : > Subject: [openib-general] [PATCH] rdma_cm: let SDP control the SDP version in the hello header > > These are the changes to the CMA that I was considering. > > This patch lets SDP determine which version of the SDP headers to use. > The CMA will check that it can support that version, and set the > address fields appropriately. > > Signed-off-by: Sean Hefty > --- > Index: cma.c > =================================================================== > --- cma.c (revision 6627) > +++ cma.c (working copy) > @@ -1486,7 +1486,8 @@ static void cma_format_hdr(void *hdr, en > switch (ps) { > case RDMA_PS_SDP: > sdp_hdr = hdr; > - sdp_hdr->sdp_version = SDP_VERSION; > + if (sdp_hdr->sdp_version != SDP_VERSION) > + return -EINVAL; I think if you want to do it this way you should check only the MajV. SDP spec says: A4.3.2.1.1 MAJOR PROTOCOL VERSION NUMBER (MAJV) - 4 BITS The current specification requires MajV to be set to 2. See section A4.5.1 Connection Setup on page 1218 for additional information. CA4-15: The accepting peer shall reject the connection if MajV in the HH does not match its local value. CA4-16: The accepting peer shall not reject the connection, solely on the basis that MinV of the HH does not match its local value. This enables future protocol extensions which are upwardly compatible. and I think the same applies to CMA. -- MST From sean.hefty at intel.com Thu Apr 27 11:01:15 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 27 Apr 2006 11:01:15 -0700 Subject: [openib-general] [PATCH] rdma_cm: let SDP control the SDP version in the hello header In-Reply-To: <20060427194351.GA6813@mellanox.co.il> Message-ID: >I think if you want to do it this way you should check only the MajV. >SDP spec says: > >A4.3.2.1.1 MAJOR PROTOCOL VERSION NUMBER (MAJV) - 4 BITS >The current specification requires MajV to be set to 2. See section A4.5.1 >Connection Setup on page 1218 for additional information. >CA4-15: The accepting peer shall reject the connection if MajV in the HH >does not match its local value. >CA4-16: The accepting peer shall not reject the connection, solely on the >basis that MinV of the HH does not match its local value. >This enables future protocol extensions which are upwardly compatible. I agree, the CMA should only check the major version number then. If we're in agreement on this approach, I will update the CMA to check the major version, and expect that SDP will set the version information. - Sean From mst at mellanox.co.il Thu Apr 27 12:56:40 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 27 Apr 2006 22:56:40 +0300 Subject: [openib-general] Re: SDP hello ack header In-Reply-To: References: <20060427152044.GX31324@mellanox.co.il> Message-ID: <20060427195640.GB6813@mellanox.co.il> Quoting r. Sean Hefty : > I think that a better solution is to have SDP set the version information for > all headers. The CMA can then check the version to see if it can support it, > and set the other fields appropriately. Thoughts? Fine. I'll add that to SDP. This will prevent the hard to catch bug like what I had with MajV unset. BTW, does CMA MajV in incoming messages? It does not seem to. If not this needs to be corrected: CA4-15: The accepting peer shall reject the connection if MajV in the HH does not match its local value. and CA4-22: The connecting peer shall terminate the connection attempt if MajV does not match its local value, i.e., it sends a REJ back to the remote peer, instead of RTU. And I think this test clearly belongs in CMA. But please also note that CA4-16/CA4-23 say that MinV should not be checked. -- MST From mst at mellanox.co.il Thu Apr 27 12:58:03 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 27 Apr 2006 22:58:03 +0300 Subject: [openib-general] [PATCH] rdma_cm: let SDP control the SDP version in the hello header In-Reply-To: References: Message-ID: <20060427195803.GC6813@mellanox.co.il> Quoting r. Sean Hefty : > @@ -1486,7 +1486,8 @@ static void cma_format_hdr(void *hdr, en > switch (ps) { > case RDMA_PS_SDP: > sdp_hdr = hdr; > - sdp_hdr->sdp_version = SDP_VERSION; > + if (sdp_hdr->sdp_version != SDP_VERSION) > + return -EINVAL; > sdp_set_ip_ver(sdp_hdr, 4); > sdp_hdr->src_addr.ip4.addr = src4->sin_addr.s_addr; > sdp_hdr->dst_addr.ip4.addr = dst4->sin_addr.s_addr; BTW, a diagnostic message would be helpful here: this could only happen as a result of a bug after all. -- MST From mst at mellanox.co.il Thu Apr 27 13:07:55 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 27 Apr 2006 23:07:55 +0300 Subject: [openib-general] [PATCH] rdma_cm: let SDP control the SDP version in the hello header In-Reply-To: References: <20060427194351.GA6813@mellanox.co.il> Message-ID: <20060427200755.GD6813@mellanox.co.il> Quoting r. Sean Hefty : > I agree, the CMA should only check the major version number then. If we're in > agreement on this approach, I will update the CMA to check the major version, > and expect that SDP will set the version information. Go ahead. BTW, I'm reasonably sure CMA does not check MajV at least in incoming HelloAck. I think since you must check it in incoming Hello in CMA, its best to check it in incoming HelloAck in CMA as well. Another validation check needed in CMA: CA4-17: The accepting peer shall reject the connection if IPV of the HH has a value other than 0x4 or 0x6. -- MST From sean.hefty at intel.com Thu Apr 27 11:21:06 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 27 Apr 2006 11:21:06 -0700 Subject: [openib-general] RE: SDP hello ack header In-Reply-To: <20060427195640.GB6813@mellanox.co.il> Message-ID: >BTW, does CMA MajV in incoming messages? It does not seem to. >If not this needs to be corrected: It checks both the major and minor version on an incoming REQ. See cma_get_net_info() in the cma. A failure will result in sending a REJ, but probably not with the right reject reason / data. >But please also note that CA4-16/CA4-23 say that MinV should not be checked. I will need to update all checks to remove minV. - Sean From mst at mellanox.co.il Thu Apr 27 13:14:06 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 27 Apr 2006 23:14:06 +0300 Subject: [openib-general] Re: Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: References: <44507FD1.9040801@voltaire.com> Message-ID: <20060427201406.GE6813@mellanox.co.il> Quoting r. Shirley Ma : > So far work queue gives very consistent 15% througput increase in my > local test with one dual core cpu over mthca. What happens to the CPU utilization? And latency? > I am planning to add one more cpu to see the difference. And what happens on UP? > Yes, NAPI is in our plan. We can see NAPI vs. work queue results soon. > Actualy if we can combine work queue with NAPI, that would be more > interesting. Theoretically NAPI might make workqueues unnecessary. We shall see. -- MST From sean.hefty at intel.com Thu Apr 27 11:25:09 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 27 Apr 2006 11:25:09 -0700 Subject: [openib-general] [PATCH] rdma_cm: let SDP control the SDP version in the hello header In-Reply-To: <20060427200755.GD6813@mellanox.co.il> Message-ID: >Go ahead. BTW, I'm reasonably sure CMA does not check MajV at least in >incoming HelloAck. I think since you must check it in incoming Hello >in CMA, its best to check it in incoming HelloAck in CMA as well. > >Another validation check needed in CMA: > >CA4-17: The accepting peer shall reject the connection if IPV of the HH >has a value other than 0x4 or 0x6. Note that I would like to keep as much of the SDP protocol out of the CMA as possible. I agree that this check should be in the CMA. - Sean From mst at mellanox.co.il Thu Apr 27 13:16:35 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 27 Apr 2006 23:16:35 +0300 Subject: [openib-general] Re: SDP hello ack header In-Reply-To: References: <20060427195640.GB6813@mellanox.co.il> Message-ID: <20060427201635.GF6813@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RE: SDP hello ack header > > >BTW, does CMA MajV in incoming messages? It does not seem to. > >If not this needs to be corrected: > > It checks both the major and minor version on an incoming REQ. See > cma_get_net_info() in the cma. A failure will result in sending a REJ, but > probably not with the right reject reason / data. Right, but I think CMA should also check the REP. > >But please also note that CA4-16/CA4-23 say that MinV should not be checked. > > I will need to update all checks to remove minV. Please also add check that ip version is 4 or 9. -- MST From mst at mellanox.co.il Thu Apr 27 13:19:10 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 27 Apr 2006 23:19:10 +0300 Subject: [openib-general] [PATCH] rdma_cm: let SDP control the SDP version in the hello header In-Reply-To: References: <20060427200755.GD6813@mellanox.co.il> Message-ID: <20060427201910.GA7144@mellanox.co.il> Quoting r. Sean Hefty : > Note that I would like to keep as much of the SDP protocol out of the CMA as > possible. Sure. Things like MaxExtAdvert validation belong in SDP. I simply think that since CMA validates MajV in REQ its ugly not to do it in REP - reading ULP code, it will be hard to understand the asymmetry. -- MST From swise at opengridcomputing.com Thu Apr 27 11:59:50 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 27 Apr 2006 13:59:50 -0500 Subject: [openib-general] [PATCH] - kernel cmatose fix Message-ID: <1146164390.6444.48.camel@stevo-desktop> cmatose cannot bind to the same local port for all connections. This will fail on iwarp devices. This patch simply doesn't select the port for the client side. Signed-off-by: Steve Wise Index: cmatose.c =================================================================== --- cmatose.c (revision 6421) +++ cmatose.c (working copy) @@ -86,7 +86,7 @@ }; static struct cmatest test; -static char *src_ip = "000.000.000.000"; +static char *src_ip = "x00.000.000.000"; static char *dst_ip = "x00.000.000.000"; static int connections = 1; static int message_size = 100; @@ -447,9 +447,10 @@ message_count = 0; test.src_in.sin_family = AF_INET; - test.src_in.sin_port = 7471; - if (src_ip) + if (src_ip[0] != 'x') { + test.src_in.sin_port = 7471; test.src_in.sin_addr.s_addr = in_aton(src_ip); + } test.src_addr = (struct sockaddr *) &test.src_in; if (dst_ip[0] != 'x') { From sean.hefty at intel.com Thu Apr 27 12:21:35 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 27 Apr 2006 12:21:35 -0700 Subject: [openib-general] [PATCH] - kernel cmatose fix In-Reply-To: <1146164390.6444.48.camel@stevo-desktop> Message-ID: Thanks - applied. From caitlinb at broadcom.com Thu Apr 27 12:53:19 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 27 Apr 2006 12:53:19 -0700 Subject: [openib-general] [PATCH] rdma_cm: let SDP control the SDP version in the hello header Message-ID: <54AD0F12E08D1541B826BE97C98F99F143B005@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: >> Go ahead. BTW, I'm reasonably sure CMA does not check MajV at least >> in incoming HelloAck. I think since you must check it in incoming >> Hello in CMA, its best to check it in incoming HelloAck in CMA as >> well. >> >> Another validation check needed in CMA: >> >> CA4-17: The accepting peer shall reject the connection if IPV of the >> HH has a value other than 0x4 or 0x6. > > Note that I would like to keep as much of the SDP protocol > out of the CMA as possible. I agree that this check should be in the > CMA. > Could the CMA handoff parsing/evaluation of the Hello/HelloAck exchange much the way it hands off private data to the ULP? From jlentini at netapp.com Thu Apr 27 12:57:56 2006 From: jlentini at netapp.com (James Lentini) Date: Thu, 27 Apr 2006 15:57:56 -0400 (EDT) Subject: [openib-general] Re: [PATCH] uDAPL openib_cma: fixed address bindings, getaddrinfo, and added debug messages for rejects In-Reply-To: References: Message-ID: On Wed, 26 Apr 2006, Arlin Davis wrote: > Index: dapl/openib_cma/dapl_ib_cm.c > =================================================================== > --- dapl/openib_cma/dapl_ib_cm.c (revision 6672) > +++ dapl/openib_cma/dapl_ib_cm.c (working copy) > @@ -343,13 +356,58 @@ static void dapli_cm_passive_cb(struct d > event->private_data, new_conn->sp); > break; > case RDMA_CM_EVENT_UNREACHABLE: > + dapls_cr_callback(conn, IB_CME_DESTINATION_UNREACHABLE, > + NULL, conn->sp); > + > case RDMA_CM_EVENT_CONNECT_ERROR: > + > + dapl_dbg_log( > + DAPL_DBG_TYPE_WARN, > + " dapli_cm_passive_handler: CONN_ERR " > + " event=0x%x status=%d\n", > + event->event, event->status ); > + > + dapl_dbg_log( > + DAPL_DBG_TYPE_WARN, > + " dapli_cm_passive_handler: CONN_ERR " > + " on SRC 0x%x,0x%x DST 0x%x,0x%x \n", > + ntohl(((struct sockaddr_in *) > + &ipaddr->src_addr)->sin_addr.s_addr), > + ntohs(((struct sockaddr_in *) > + &ipaddr->src_addr)->sin_port), > + ntohl(((struct sockaddr_in *) > + &ipaddr->dst_addr)->sin_addr.s_addr), > + ntohs(((struct sockaddr_in *) > + &ipaddr->dst_addr)->sin_port) > + ); > + > dapls_cr_callback(conn, IB_CME_DESTINATION_UNREACHABLE, > NULL, conn->sp); > break; Why not combine these two into a signel dapl_dbg_log call? > + > case RDMA_CM_EVENT_REJECTED: > - dapls_cr_callback(conn, IB_CME_DESTINATION_REJECT, NULL, > - conn->sp); > + > + dapl_dbg_log( > + DAPL_DBG_TYPE_WARN, > + " dapli_cm_passive_handler: REJECTED reason=%d\n", > + event->status); > + > + dapl_dbg_log( > + DAPL_DBG_TYPE_WARN, > + " dapli_cm_passive_handler: REJECTED " > + " on SRC 0x%x,0x%x DST 0x%x,0x%x \n", > + ntohl(((struct sockaddr_in *) > + &ipaddr->src_addr)->sin_addr.s_addr), > + ntohs(((struct sockaddr_in *) > + &ipaddr->src_addr)->sin_port), > + ntohl(((struct sockaddr_in *) > + &ipaddr->dst_addr)->sin_addr.s_addr), > + ntohs(((struct sockaddr_in *) > + &ipaddr->dst_addr)->sin_port) > + ); > + > + dapls_cr_callback(conn, IB_CME_DESTINATION_REJECT, > + NULL, conn->sp); > break; > case RDMA_CM_EVENT_ESTABLISHED: > ditto From info at schihei.de Thu Apr 27 12:50:28 2006 From: info at schihei.de (Heiko Joerg Schick) Date: Thu, 27 Apr 2006 21:50:28 +0200 Subject: [openib-general] Re: [PATCH 00/16] ehca: IBM eHCA InfiniBand Device Driver References: <4450B378.9000705@de.ibm.com> <20060427125726.GK32127@wohnheim.fh-wedel.de> Message-ID: On 2006-04-27 14:57:26 +0200, Jörn Engel said: > Don't expect much cheer and rejoicing over this. I suspect that akpm > or Linus will either want the 17 patches merged into one or have a > patchset where every single patch leaves the kernel in a working > state, including working eHCA driver. I don't like the idea to put the whole driver in one patch file. I would propose to put the patch "ehca: integration in Linux kernel" last instead of first, as Arnd mentioned. With that change we leave the kernel in a working state when applying the patches. Regards, Heiko From vuhuong at mellanox.com Thu Apr 27 14:18:14 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Thu, 27 Apr 2006 14:18:14 -0700 Subject: [openib-general] ib_srp In-Reply-To: References: Message-ID: <44513516.8010202@mellanox.com> Scott Weitzenkamp (sweitzen) wrote: > I see mention of an SRP daemon coming in 1.0 RC4, does this daemon > handle target addition/removal? > Yes From info at schihei.de Thu Apr 27 14:26:59 2006 From: info at schihei.de (Heiko J Schick) Date: Thu, 27 Apr 2006 23:26:59 +0200 Subject: [openib-general] Re: [PATCH 01/16] ehca: integration in Linux kernel build system References: <4450B384.4020601@de.ibm.com> <200604271307.36987.arnd.bergmann@de.ibm.com> Message-ID: On 2006-04-27 13:07:36 +0200, Arnd Bergmann said: > It would be more practical to put this patch last instead of > first so you don't break the build system with partial applies. I agree. I Will change set for the next patchset. From sean.hefty at intel.com Thu Apr 27 14:31:04 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 27 Apr 2006 14:31:04 -0700 Subject: [openib-general] [PATCH] rdma_cm: let SDP control the SDP version in the hello header In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F143B005@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: >Could the CMA handoff parsing/evaluation of the Hello/HelloAck >exchange much the way it hands off private data to the ULP? It's slightly different with SDP. With other ULPs, the CMA inserts its own header at the start of the private data, and then strips it off on the remote side. However, with SDP, the header was already defined. The CMA updates the SDP header with the address information, but this is in the middle of the SDP header. So, the CMA must be aware of the format and version of the SDP header. The result is that the CMA is slightly aware of the SDP protocol, but the benefit is that code duplication is minimized. - Sean From info at schihei.de Thu Apr 27 14:31:41 2006 From: info at schihei.de (Heiko J Schick) Date: Thu, 27 Apr 2006 23:31:41 +0200 Subject: [openib-general] Re: [PATCH 06/16] ehca: common include files References: <4450A183.6030405@de.ibm.com> <200604271319.06844.arnd.bergmann@de.ibm.com> Message-ID: On 2006-04-27 13:19:06 +0200, Arnd Bergmann said: > Well, you should also remove this for submission, I guess ;-) Yeah, I've planed to remove this lines when 2.6.17 official came out. It is still included because we don't want to introduce unnecessary dependencies. I will remove it for the next patchset. Regards, Heiko From arlin.r.davis at intel.com Thu Apr 27 14:48:41 2006 From: arlin.r.davis at intel.com (Arlin Davis) Date: Thu, 27 Apr 2006 14:48:41 -0700 Subject: [openib-general] RE: [PATCH2] uDAPL openib_cma: fixed address bindings, getaddrinfo, and added debug messages for rejects In-Reply-To: Message-ID: James, Here is a new patch with your recommended changes. -arlin Signed-off by: Arlin Davis Index: dapl/openib_cma/dapl_ib_util.c =================================================================== --- dapl/openib_cma/dapl_ib_util.c (revision 6672) +++ dapl/openib_cma/dapl_ib_util.c (working copy) @@ -121,11 +121,12 @@ static int getipaddr(char *name, char *a if (getaddrinfo(name, NULL, NULL, &res)) { /* retry using network device name */ ret = getipaddr_netdev(name,addr,len); - if (ret) + if (ret) { dapl_dbg_log(DAPL_DBG_TYPE_WARN, " getipaddr: invalid name, addr, or netdev(%s)\n", name); - return ret; + return ret; + } } else { if (len >= res->ai_addrlen) memcpy(addr, res->ai_addr, res->ai_addrlen); Index: dapl/openib_cma/dapl_ib_cm.c =================================================================== --- dapl/openib_cma/dapl_ib_cm.c (revision 6672) +++ dapl/openib_cma/dapl_ib_cm.c (working copy) @@ -274,11 +274,21 @@ static void dapli_cm_active_cb(struct da switch (event->event) { case RDMA_CM_EVENT_UNREACHABLE: case RDMA_CM_EVENT_CONNECT_ERROR: + dapl_dbg_log( + DAPL_DBG_TYPE_WARN, + " dapli_cm_active_handler: CONN_ERR " + " event=0x%x status=%d\n", + event->event, event->status); + dapl_evd_connection_callback(conn, IB_CME_DESTINATION_UNREACHABLE, NULL, conn->ep); break; case RDMA_CM_EVENT_REJECTED: + dapl_dbg_log( + DAPL_DBG_TYPE_WARN, + " dapli_cm_active_handler: REJECTED reason=%d\n", + event->status); dapl_evd_connection_callback(conn, IB_CME_DESTINATION_REJECT, NULL, conn->ep); break; @@ -320,6 +330,9 @@ static void dapli_cm_passive_cb(struct d struct rdma_cm_event *event) { struct dapl_cm_id *new_conn; +#ifdef DAPL_DBG + struct rdma_addr *ipaddr = &conn->cm_id->route.addr; +#endif dapl_dbg_log(DAPL_DBG_TYPE_CM, " passive_cb: conn %p id %d event %d\n", @@ -343,13 +356,48 @@ static void dapli_cm_passive_cb(struct d event->private_data, new_conn->sp); break; case RDMA_CM_EVENT_UNREACHABLE: + dapls_cr_callback(conn, IB_CME_DESTINATION_UNREACHABLE, + NULL, conn->sp); + case RDMA_CM_EVENT_CONNECT_ERROR: + + dapl_dbg_log( + DAPL_DBG_TYPE_WARN, + " dapli_cm_passive: CONN_ERR " + " event=0x%x status=%d", + " on SRC 0x%x,0x%x DST 0x%x,0x%x\n", + event->event, event->status, + ntohl(((struct sockaddr_in *) + &ipaddr->src_addr)->sin_addr.s_addr), + ntohs(((struct sockaddr_in *) + &ipaddr->src_addr)->sin_port), + ntohl(((struct sockaddr_in *) + &ipaddr->dst_addr)->sin_addr.s_addr), + ntohs(((struct sockaddr_in *) + &ipaddr->dst_addr)->sin_port)); + dapls_cr_callback(conn, IB_CME_DESTINATION_UNREACHABLE, NULL, conn->sp); break; + case RDMA_CM_EVENT_REJECTED: - dapls_cr_callback(conn, IB_CME_DESTINATION_REJECT, NULL, - conn->sp); + + dapl_dbg_log( + DAPL_DBG_TYPE_WARN, + " dapli_cm_passive: REJECTED reason=%d" + " on SRC 0x%x,0x%x DST 0x%x,0x%x\n", + event->status, + ntohl(((struct sockaddr_in *) + &ipaddr->src_addr)->sin_addr.s_addr), + ntohs(((struct sockaddr_in *) + &ipaddr->src_addr)->sin_port), + ntohl(((struct sockaddr_in *) + &ipaddr->dst_addr)->sin_addr.s_addr), + ntohs(((struct sockaddr_in *) + &ipaddr->dst_addr)->sin_port)); + + dapls_cr_callback(conn, IB_CME_DESTINATION_REJECT, + NULL, conn->sp); break; case RDMA_CM_EVENT_ESTABLISHED: @@ -556,6 +604,7 @@ dapls_ib_setup_conn_listener(IN DAPL_IA { DAT_RETURN dat_status = DAT_SUCCESS; ib_cm_srvc_handle_t conn; + DAT_SOCK_ADDR6 addr; /* local binding address */ /* Allocate CM and initialize lock */ if ((conn = dapl_os_alloc(sizeof(*conn))) == NULL) @@ -571,11 +620,12 @@ dapls_ib_setup_conn_listener(IN DAPL_IA } /* open identifies the local device; per DAT specification */ - ((struct sockaddr_in *)&ia_ptr->hca_ptr->hca_address)->sin_port = - htons(MAKE_PORT(ServiceID)); + /* Get family and address then set port to consumer's ServiceID */ + dapl_os_memcpy(&addr, &ia_ptr->hca_ptr->hca_address, sizeof(addr)); + ((struct sockaddr_in *)&addr)->sin_port = htons(MAKE_PORT(ServiceID)); + - if (rdma_bind_addr(conn->cm_id, - (struct sockaddr *)&ia_ptr->hca_ptr->hca_address)) { + if (rdma_bind_addr(conn->cm_id,(struct sockaddr *)&addr)) { if (errno == EBUSY) dat_status = DAT_CONN_QUAL_IN_USE; else From clameter at sgi.com Thu Apr 27 15:22:53 2006 From: clameter at sgi.com (Christoph Lameter) Date: Thu, 27 Apr 2006 15:22:53 -0700 (PDT) Subject: [openib-general] Re: possible bug in kmem_cache related code In-Reply-To: <84144f020604270419s10696877he2ec27ae6d52e486@mail.gmail.com> References: <84144f020604270419s10696877he2ec27ae6d52e486@mail.gmail.com> Message-ID: On Thu, 27 Apr 2006, Pekka Enberg wrote: > On 4/27/06, Or Gerlitz wrote: > > With 2.6.17-rc3 I'm running into something which seems as a bug related > > to kmem_cache. Doing some allocations/deallocations from a kmem_cache and > > later attempting to destroy it yields the following message and trace > > Tested on 2.6.16.7 and works ok. Christoph, could this be related to > the cache draining patches that went in 2.6.17-rc1? What happened to that part of the slab allocator? Looks completely changed to when I saw it the last time? This directly fails in kmem_cache_destroy? So it tries to free all the slab entries from the free list and then returns 1 or 2 if there are entries left on the partial and full list? So the bug happens if cache entries are left. Guess the reason for this failure is then that not all cache entries have been freed before calling kmem_cache_destroy()? From paulus at samba.org Thu Apr 27 15:42:14 2006 From: paulus at samba.org (Paul Mackerras) Date: Fri, 28 Apr 2006 08:42:14 +1000 Subject: [openib-general] Re: [PATCH 13/16] ehca: firmware InfiniBand interface In-Reply-To: <20060427123701.GG32127@wohnheim.fh-wedel.de> References: <4450A1C0.3080209@de.ibm.com> <20060427123701.GG32127@wohnheim.fh-wedel.de> Message-ID: <17489.18630.75412.66803@cargo.ozlabs.ibm.com> Jörn Engel writes: > 25 parameters? If you tell me which drugs were involved in this code, > I know what to stay away from. You really need to ask the firmware architects that, since this is basically a single firmware call. Mind you, since a lot of the parameters are used to return individual bytes or half-words, which are then put into structures, it might be better to pass the pointers to the structures and let the wrapper put the values straight into the structures. Paul. From segher at kernel.crashing.org Thu Apr 27 15:55:10 2006 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Fri, 28 Apr 2006 00:55:10 +0200 Subject: [openib-general] Re: [PATCH 10/16] ehca: event queue In-Reply-To: <20060427114828.GC32127@wohnheim.fh-wedel.de> References: <4450A1AD.7040506@de.ibm.com> <20060427114828.GC32127@wohnheim.fh-wedel.de> Message-ID: <6C0D7610-D6A3-43CC-9327-926948167D43@kernel.crashing.org> >> + if (ret != H_SUCCESS) { > > Common convention is to return 0 on success and -ESOMETHING on eror. > You might want to follow that and remove H_SUCCESS from the complete > code. This return code doesn't come from anywhere within the kernel, though. If we change this, we should get rid of _every_ #define bladibla 0 Do we want that? (I do ;-) ) >> + if (!(vpage = ipz_qpageit_get_inc(&eq->ipz_queue))) { > > I personally despise assignments in conditionals. Not sure if this is > documented in CodingStyle, but IME most kernel hackers tend to dislike > it as well. In this case it's obvious; put it on a separate line. Segher From michael at ellerman.id.au Thu Apr 27 15:36:28 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Fri, 28 Apr 2006 08:36:28 +1000 Subject: [openib-general] Re: [PATCH 04/16] ehca: userspace support In-Reply-To: <20060427114355.GB32127@wohnheim.fh-wedel.de> References: <4450A176.9000008@de.ibm.com> <20060427114355.GB32127@wohnheim.fh-wedel.de> Message-ID: <1146177388.19236.1.camel@localhost.localdomain> On Thu, 2006-04-27 at 13:43 +0200, Jörn Engel wrote: > More minors. > > On Thu, 27 April 2006 12:48:22 +0200, Heiko J Schick wrote: > > + > > + EDEB_EN(7, > > + "vm_start=%lx vm_end=%lx vm_page_prot=%lx vm_fileoff=%lx " > > + "address=%lx", > > + vma->vm_start, vma->vm_end, vma->vm_page_prot, fileoffset, > > + address); > > Gesundheit! Seriously, I suspect "EDEB_EN" is not the best possible > name to pick. Try pr_debug() in include/linux/kernel.h cheers -- Michael Ellerman IBM OzLabs wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 191 bytes Desc: This is a digitally signed message part URL: From xma at us.ibm.com Thu Apr 27 16:19:32 2006 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 27 Apr 2006 16:19:32 -0700 Subject: [openib-general] Re: Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: <20060427201406.GE6813@mellanox.co.il> Message-ID: MIchael, "Michael S. Tsirkin" wrote on 04/27/2006 01:14:06 PM: > Quoting r. Shirley Ma : > > So far work queue gives very consistent 15% througput increase in my > > local test with one dual core cpu over mthca. > What happens to the CPU utilization? And latency? The CPU utilization were doubled under one dual core cpu. Finally I found that's the problem tx_ring blocked the sender. After I tune send/recv queues, I got more than double throughput for unidirectional netperf with 10-15% more cpu utilization. I believe after I apply my other removing tx_ring patch, the performance would be better. > > I am planning to add one more cpu to see the difference. > > And what happens on UP? Don't have a UP. The patch needs to be verified on large cluster to see how's packets out of order. I have tried on 8 cpus, it did good. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From iod00d at hp.com Thu Apr 27 16:22:40 2006 From: iod00d at hp.com (Grant Grundler) Date: Thu, 27 Apr 2006 16:22:40 -0700 Subject: [openib-general] Re: TSO and IPoIB performance degradation In-Reply-To: <20060427072352.GB1805@greglaptop.hsd1.ca.comcast.net> References: <20060320090629.GA11352@mellanox.co.il> <20060320.015500.72136710.davem@davemloft.net> <20060320102234.GV29929@mellanox.co.il> <20060320.023704.70907203.davem@davemloft.net> <20060427041323.GX15855@narn.hozed.org> <20060427072352.GB1805@greglaptop.hsd1.ca.comcast.net> Message-ID: <20060427232240.GB3265@esmail.cup.hp.com> On Thu, Apr 27, 2006 at 12:23:52AM -0700, Greg Lindahl wrote: > On Wed, Apr 26, 2006 at 11:13:24PM -0500, Troy Benjegerdes wrote: > > > David is right. If you care about performance, you are already using SDP > > or verbs layer for the transport anyway. If I am going to be doing IPoIB, > > it's because eventually I expect the packet might get off the IB network > > and onto some other network and go halfway across the country. > > This is going to be a surprise to lots of people who want high-speed > gateways from IB to ethernet -- many clusters connect to fileservers > and other performance-sensitive gizmos that way. Anything preventnig such a gateway from routing SDP to ethernet? Those gateways obviously will grok IB protocols. I'm asking becuase I don't understand/know if there is a real barrier to an IB -> ethernet gateway _without_ IPoIB. thanks, grant From xma at us.ibm.com Thu Apr 27 17:17:44 2006 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 27 Apr 2006 17:17:44 -0700 Subject: [openib-general] Re: Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: Message-ID: I was lying. The 8 cpus throughput was bad. I am looking at the problem now. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From lindahl at pathscale.com Thu Apr 27 17:16:29 2006 From: lindahl at pathscale.com (Greg Lindahl) Date: Thu, 27 Apr 2006 17:16:29 -0700 Subject: [openib-general] Re: TSO and IPoIB performance degradation In-Reply-To: <20060427232240.GB3265@esmail.cup.hp.com> References: <20060320090629.GA11352@mellanox.co.il> <20060320.015500.72136710.davem@davemloft.net> <20060320102234.GV29929@mellanox.co.il> <20060320.023704.70907203.davem@davemloft.net> <20060427041323.GX15855@narn.hozed.org> <20060427072352.GB1805@greglaptop.hsd1.ca.comcast.net> <20060427232240.GB3265@esmail.cup.hp.com> Message-ID: <20060428001629.GA3364@greglaptop.internal.keyresearch.com> On Thu, Apr 27, 2006 at 04:22:40PM -0700, Grant Grundler wrote: > Anything preventnig such a gateway from routing SDP to ethernet? > Those gateways obviously will grok IB protocols. > I'm asking becuase I don't understand/know if there is a real > barrier to an IB -> ethernet gateway _without_ IPoIB. I don't know if a SDP to ethernet gateway even exists, but I do know that it's a lot more work than just an IPoIB to ethernet gateway -- the gateway is going to have to pass all its data through a TCP stack. So I would expect SDP to ethernet to not run very fast, especially on a gateway with lots of streams going. -- greg From mst at mellanox.co.il Thu Apr 27 23:48:54 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 28 Apr 2006 09:48:54 +0300 Subject: [openib-general] Re: Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: References: <20060427201406.GE6813@mellanox.co.il> Message-ID: <20060428064854.GB7144@mellanox.co.il> Quoting r. Shirley Ma : > > And what happens on UP? > > Don't have a UP. It's easy enough to disable SMP support in kernel. -- MST From mst at mellanox.co.il Thu Apr 27 23:50:51 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 28 Apr 2006 09:50:51 +0300 Subject: [openib-general] Re: Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: References: <20060427201406.GE6813@mellanox.co.il> Message-ID: <20060428065051.GC7144@mellanox.co.il> Quoting r. Shirley Ma : > > > So far work queue gives very consistent 15% througput increase in my > > > local test with one dual core cpu over mthca. > > > > What happens to the CPU utilization? And latency? > > The CPU utilization were doubled under one dual core cpu. > Finally I found that's the problem tx_ring blocked the sender. > > After I tune send/recv queues, I got more than double throughput for > unidirectional netperf with 10-15% more cpu utilization. > > I believe after I apply my other removing tx_ring patch, the performance > would be better. Need to also see what happens to latency. There should be some cost, hopefully not too much. -- MST From mrmacman_g4 at mac.com Thu Apr 27 22:11:11 2006 From: mrmacman_g4 at mac.com (Kyle Moffett) Date: Fri, 28 Apr 2006 01:11:11 -0400 Subject: [openib-general] Re: [PATCH 06/16] ehca: common include files In-Reply-To: <4450A183.6030405@de.ibm.com> References: <4450A183.6030405@de.ibm.com> Message-ID: <12A9A218-DC60-4944-892D-150DF2D88F0C@mac.com> On Apr 27, 2006, at 06:48:35, Heiko J Schick wrote: > +#define EHCA_EDEB_TRACE_MASK_SIZE 32 > +extern u8 ehca_edeb_mask[EHCA_EDEB_TRACE_MASK_SIZE]; > +#define EDEB_ID_TO_U32(str4) (str4[3] | (str4[2] << 8) | (str4[1] > << 16) | \ > + (str4[0] << 24)) > + > +inline static u64 ehca_edeb_filter(const u32 level, > + const u32 id, const u32 line) > +{ > + u64 ret = 0; > + u32 filenr = 0; > + u32 filter_level = 9; > + u32 dynamic_level = 0; > + > + /* This is code written for the gcc -O2 optimizer which should > colapse > + * to two single ints filter_level is the first level kicked out by > + * compiler means trace everythin below 6. */ > + if (id == EDEB_ID_TO_U32("ehav")) { > + filenr = 0x01; > + filter_level = 8; > + } > [...] This whole mess should be a simpler with a table and a loop struct edeb_filter_entry { u32 filenr; u32 filter_level; }; # define EDEB_FILTER_ENTRY(name,nr,level) { .id = name, .filenr = nr, .filter_level = level } static const struct edeb_filter_entry edeb_filter_table[] = { EDEB_FILTER_ENTRY("clas", 0x02, 8), [...] }; Then just iterate over that table in a loop. The end result is much smaller code and data, and much clearer as to intent as well. Cheers, Kyle Moffett From penberg at cs.Helsinki.FI Thu Apr 27 23:03:44 2006 From: penberg at cs.Helsinki.FI (Pekka J Enberg) Date: Fri, 28 Apr 2006 09:03:44 +0300 (EEST) Subject: [openib-general] Re: possible bug in kmem_cache related code In-Reply-To: References: <84144f020604270419s10696877he2ec27ae6d52e486@mail.gmail.com> Message-ID: On 4/27/06, Or Gerlitz wrote: > > > With 2.6.17-rc3 I'm running into something which seems as a bug related > > > to kmem_cache. Doing some allocations/deallocations from a kmem_cache and > > > later attempting to destroy it yields the following message and trace On Thu, 27 Apr 2006, Pekka Enberg wrote: > > Tested on 2.6.16.7 and works ok. Christoph, could this be related to > > the cache draining patches that went in 2.6.17-rc1? On Thu, 27 Apr 2006, Christoph Lameter wrote: > What happened to that part of the slab allocator? Looks completely > changed to when I saw it the last time? > > This directly fails in kmem_cache_destroy? > > So it tries to free all the slab entries from the free list and then > returns 1 or 2 if there are entries left on the partial and full > list? So the bug happens if cache entries are left. > > Guess the reason for this failure is then that not all cache entries have > been freed before calling kmem_cache_destroy()? Yes, but if you look at Or's test case, there's no obvious reason why that's happening. I'll see if I can reproduce the problem with 2.6.17-rc3. Pekka From info at schihei.de Thu Apr 27 23:11:08 2006 From: info at schihei.de (Heiko J Schick) Date: Fri, 28 Apr 2006 08:11:08 +0200 Subject: [openib-general] Re: [PATCH 04/16] ehca: userspace support In-Reply-To: <1146177388.19236.1.camel@localhost.localdomain> References: <4450A176.9000008@de.ibm.com> <20060427114355.GB32127@wohnheim.fh-wedel.de> <1146177388.19236.1.camel@localhost.localdomain> Message-ID: <6C4A3B96-4752-4FF9-8FBE-C383B00AE014@schihei.de> Hello Michael, On 28.04.2006, at 00:36, Michael Ellerman wrote: > Try pr_debug() in include/linux/kernel.h The problem I see with pr_debug() is that it could only activated via a compile flag. To use the debug outputs you have to re-compile / compile your own kernel. Regards, Heiko From xma at us.ibm.com Thu Apr 27 23:28:13 2006 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 27 Apr 2006 23:28:13 -0700 Subject: [openib-general] Re: possible bug in kmem_cache related code In-Reply-To: Message-ID: I hit a similar problem while calling kzalloc(). it happened on linux-2.6.17-rc1 + ppc64. kernel BUG in __cache_alloc_node at mm/slab.c:2934! which is BUG_ON(slabp->inuse == cachep->num); 3:mon> expr cpu 0x3: Vector: 700 (Program Check) at [c0000000dac87870] pc: c0000000000b75b0: .__cache_alloc_node+0x94/0x180 lr: c0000000000b756c: .__cache_alloc_node+0x50/0x180 sp: c0000000dac87af0 msr: 8000000000021032 current = 0xc0000000f1d3a040 paca = 0xc000000000439480 pid = 9508, comm = ipoib_comp/3 Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From penberg at cs.helsinki.fi Thu Apr 27 23:32:40 2006 From: penberg at cs.helsinki.fi (Pekka Enberg) Date: Fri, 28 Apr 2006 09:32:40 +0300 Subject: [openib-general] Re: [PATCH 04/16] ehca: userspace support In-Reply-To: <6C4A3B96-4752-4FF9-8FBE-C383B00AE014@schihei.de> References: <4450A176.9000008@de.ibm.com> <20060427114355.GB32127@wohnheim.fh-wedel.de> <1146177388.19236.1.camel@localhost.localdomain> <6C4A3B96-4752-4FF9-8FBE-C383B00AE014@schihei.de> Message-ID: <84144f020604272332s6101032cy6936096230f3637c@mail.gmail.com> Hi Heiko, On 4/28/06, Heiko J Schick wrote: > The problem I see with pr_debug() is that it could only activated via > a compile flag. To use the debug outputs you have to re-compile / > compile your own kernel. Do you really need this heavy debug logging in the first place? You can use kprobes for arbitrary run-time inspection anyway, so logging everything seems wasteful. Pekka From clameter at sgi.com Thu Apr 27 23:46:55 2006 From: clameter at sgi.com (Christoph Lameter) Date: Thu, 27 Apr 2006 23:46:55 -0700 (PDT) Subject: [openib-general] Re: possible bug in kmem_cache related code In-Reply-To: References: Message-ID: On Thu, 27 Apr 2006, Shirley Ma wrote: > I hit a similar problem while calling kzalloc(). it happened on > linux-2.6.17-rc1 + ppc64. > > kernel BUG in __cache_alloc_node at mm/slab.c:2934! > which is > BUG_ON(slabp->inuse == cachep->num); More entries were added to a slab than allowed? This suggests a race on slabp->inuse. From michael at ellerman.id.au Fri Apr 28 00:11:32 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Fri, 28 Apr 2006 17:11:32 +1000 Subject: [openib-general] Re: [PATCH 04/16] ehca: userspace support In-Reply-To: <84144f020604272332s6101032cy6936096230f3637c@mail.gmail.com> References: <4450A176.9000008@de.ibm.com> <20060427114355.GB32127@wohnheim.fh-wedel.de> <1146177388.19236.1.camel@localhost.localdomain> <6C4A3B96-4752-4FF9-8FBE-C383B00AE014@schihei.de> <84144f020604272332s6101032cy6936096230f3637c@mail.gmail.com> Message-ID: <1146208292.6307.4.camel@localhost.localdomain> On Fri, 2006-04-28 at 09:32 +0300, Pekka Enberg wrote: > Hi Heiko, > > On 4/28/06, Heiko J Schick wrote: > > The problem I see with pr_debug() is that it could only activated via > > a compile flag. To use the debug outputs you have to re-compile / > > compile your own kernel. > > Do you really need this heavy debug logging in the first place? You > can use kprobes for arbitrary run-time inspection anyway, so logging > everything seems wasteful. Yeah, I really don't think you want to be running with that kind of debugging gunk in a production kernel. Is someone who can't build their own kernel really going to be able to make sense of the output? cheers -- Michael Ellerman IBM OzLabs wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 191 bytes Desc: This is a digitally signed message part URL: From penberg at cs.Helsinki.FI Fri Apr 28 01:10:00 2006 From: penberg at cs.Helsinki.FI (Pekka J Enberg) Date: Fri, 28 Apr 2006 11:10:00 +0300 (EEST) Subject: [openib-general] Re: possible bug in kmem_cache related code In-Reply-To: References: <84144f020604270419s10696877he2ec27ae6d52e486@mail.gmail.com> Message-ID: On 4/27/06, Or Gerlitz wrote: > > > With 2.6.17-rc3 I'm running into something which seems as a bug related > > > to kmem_cache. Doing some allocations/deallocations from a kmem_cache and > > > later attempting to destroy it yields the following message and trace On Thu, 27 Apr 2006, Pekka Enberg wrote: > > Tested on 2.6.16.7 and works ok. Christoph, could this be related to > > the cache draining patches that went in 2.6.17-rc1? On Thu, 27 Apr 2006, Christoph Lameter wrote: > What happened to that part of the slab allocator? Looks completely > changed to when I saw it the last time? > > This directly fails in kmem_cache_destroy? > > So it tries to free all the slab entries from the free list and then > returns 1 or 2 if there are entries left on the partial and full > list? So the bug happens if cache entries are left. > > Guess the reason for this failure is then that not all cache entries have > been freed before calling kmem_cache_destroy()? I can't reproduce this with Linus' git head on User-mode Linux running on UP i386. Or, can you reproduce this at will? Any local modifications? Can we see your .config, please. Pekka From jlentini at netapp.com Fri Apr 28 06:37:59 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 28 Apr 2006 09:37:59 -0400 (EDT) Subject: [openib-general] RE: [PATCH2] uDAPL openib_cma: fixed address bindings, getaddrinfo, and added debug messages for rejects In-Reply-To: References: Message-ID: On Thu, 27 Apr 2006, Arlin Davis wrote: > Here is a new patch with your recommended changes. Thanks Arlin. Committed in revision 6736 of the trunk and revision 6737 of the 1.0 release tree. From halr at voltaire.com Fri Apr 28 09:02:19 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Apr 2006 12:02:19 -0400 Subject: [openib-general] Re: [PATCH 3/4] opensm: pkey manager performance improvement In-Reply-To: <20060423142623.15562.89538.stgit@sashak.voltaire.com> References: <20060423141935.15562.38762.stgit@sashak.voltaire.com> <20060423142623.15562.89538.stgit@sashak.voltaire.com> Message-ID: <1146240126.2124.74884.camel@hal.voltaire.com> On Sun, 2006-04-23 at 10:26, Sasha Khapyorsky wrote: > Send changed pkey table blocks to ports only after full update and not > after each pkey value change/update. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied (to trunk only) with some cosmetic changes. -- Hal From swise at opengridcomputing.com Fri Apr 28 11:22:38 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 28 Apr 2006 13:22:38 -0500 Subject: [openib-general] [PATCH] [RFC] dapltest change for iwarp Message-ID: <1146248558.28503.20.camel@stevo-desktop> James, This patch changes the dapltest transaction test to force the client side (the side that dat_ep_connect()) to send the first RDMA message. This ensures that the IWARP MPA protocol requirements are met. I'm presenting this for discussion and possible inclusion in the trunk. A transport independent application should be designed to work over all transports and should therefore utilize the only the common features. This implies that the application should always initiate RDMA exchanges starting with the client, to avoid MPA problems. Comments? Signed-off-by: Steve Wise Index: test/dapltest/test/dapl_transaction_test.c =================================================================== --- test/dapltest/test/dapl_transaction_test.c (revision 6733) +++ test/dapltest/test/dapl_transaction_test.c (working copy) @@ -978,55 +978,110 @@ test_ptr->base_port, test_ptr->is_server ? "Server" : "Client")); - /* post the send buffer */ - if (!DT_post_send_buffer (phead, + if (!test_ptr->is_server ) { + + /* post the send buffer */ + if (!DT_post_send_buffer (phead, test_ptr->ep_context[i].ep_handle, test_ptr->ep_context[i].bp, RMI_SEND_BUFFER_ID, buff_size)) - { - /* error message printed by DT_post_send_buffer */ - goto test_failure; - } - /* reap the send and verify it */ - dto_cookie.as_64 = LZERO; - dto_cookie.as_ptr = - (DAT_PVOID) DT_Bpool_GetBuffer ( - test_ptr->ep_context[i].bp, - RMI_SEND_BUFFER_ID); - if (!DT_dto_event_wait (phead, test_ptr->reqt_evd_hdl, &dto_stat) || - !DT_dto_check ( phead, + { + /* error message printed by DT_post_send_buffer */ + goto test_failure; + } + /* reap the send and verify it */ + dto_cookie.as_64 = LZERO; + dto_cookie.as_ptr = + (DAT_PVOID) DT_Bpool_GetBuffer ( + test_ptr->ep_context[i].bp, + RMI_SEND_BUFFER_ID); + if (!DT_dto_event_wait (phead, test_ptr->reqt_evd_hdl, &dto_stat) || + !DT_dto_check ( phead, &dto_stat, test_ptr->ep_context[i].ep_handle, buff_size, dto_cookie, test_ptr->is_server ? "Client_Mem_Info_Send" : "Server_Mem_Info_Send")) - { - goto test_failure; - } + { + goto test_failure; + } - /* - * Recv the other side's info - */ - DT_Tdep_PT_Debug (1,(phead,"Test[" F64x "]: Waiting for %s Memory Info\n", + /* + * Recv the other side's info + */ + DT_Tdep_PT_Debug (1,(phead,"Test[" F64x "]: Waiting for %s Memory Info\n", test_ptr->base_port, test_ptr->is_server ? "Client" : "Server")); - dto_cookie.as_64 = LZERO; - dto_cookie.as_ptr = - (DAT_PVOID) DT_Bpool_GetBuffer ( - test_ptr->ep_context[i].bp, - RMI_RECV_BUFFER_ID); - if (!DT_dto_event_wait (phead, test_ptr->recv_evd_hdl, &dto_stat) || - !DT_dto_check ( phead, + dto_cookie.as_64 = LZERO; + dto_cookie.as_ptr = + (DAT_PVOID) DT_Bpool_GetBuffer ( + test_ptr->ep_context[i].bp, + RMI_RECV_BUFFER_ID); + if (!DT_dto_event_wait (phead, test_ptr->recv_evd_hdl, &dto_stat) || + !DT_dto_check ( phead, &dto_stat, test_ptr->ep_context[i].ep_handle, buff_size, dto_cookie, test_ptr->is_server ? "Client_Mem_Info_Recv" : "Server_Mem_Info_Recv")) - { - goto test_failure; + { + goto test_failure; + } + } else { + + /* + * Recv the other side's info + */ + DT_Tdep_PT_Debug (1,(phead,"Test[" F64x "]: Waiting for %s Memory Info\n", + test_ptr->base_port, + test_ptr->is_server ? "Client" : "Server")); + dto_cookie.as_64 = LZERO; + dto_cookie.as_ptr = + (DAT_PVOID) DT_Bpool_GetBuffer ( + test_ptr->ep_context[i].bp, + RMI_RECV_BUFFER_ID); + if (!DT_dto_event_wait (phead, test_ptr->recv_evd_hdl, &dto_stat) || + !DT_dto_check ( phead, + &dto_stat, + test_ptr->ep_context[i].ep_handle, + buff_size, + dto_cookie, + test_ptr->is_server ? "Client_Mem_Info_Recv" + : "Server_Mem_Info_Recv")) + { + goto test_failure; + } + + /* post the send buffer */ + if (!DT_post_send_buffer (phead, + test_ptr->ep_context[i].ep_handle, + test_ptr->ep_context[i].bp, + RMI_SEND_BUFFER_ID, + buff_size)) + { + /* error message printed by DT_post_send_buffer */ + goto test_failure; + } + /* reap the send and verify it */ + dto_cookie.as_64 = LZERO; + dto_cookie.as_ptr = + (DAT_PVOID) DT_Bpool_GetBuffer ( + test_ptr->ep_context[i].bp, + RMI_SEND_BUFFER_ID); + if (!DT_dto_event_wait (phead, test_ptr->reqt_evd_hdl, &dto_stat) || + !DT_dto_check ( phead, + &dto_stat, + test_ptr->ep_context[i].ep_handle, + buff_size, + dto_cookie, + test_ptr->is_server ? "Client_Mem_Info_Send" + : "Server_Mem_Info_Send")) + { + goto test_failure; + } } /* From mshefty at ichips.intel.com Fri Apr 28 11:32:55 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 28 Apr 2006 11:32:55 -0700 Subject: [openib-general] [PATCH] [RFC] dapltest change for iwarp In-Reply-To: <1146248558.28503.20.camel@stevo-desktop> References: <1146248558.28503.20.camel@stevo-desktop> Message-ID: <44525FD7.4020406@ichips.intel.com> Steve Wise wrote: > This patch changes the dapltest transaction test to force the client > side (the side that dat_ep_connect()) to send the first RDMA message. > This ensures that the IWARP MPA protocol requirements are met. > > I'm presenting this for discussion and possible inclusion in the > trunk. > > A transport independent application should be designed to work over all > transports and should therefore utilize the only the common features. > This implies that the application should always initiate RDMA exchanges > starting with the client, to avoid MPA problems. What if someone comes up with an RDMA transport that requires the server side to send the first message? - Sean From swise at opengridcomputing.com Fri Apr 28 11:43:11 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 28 Apr 2006 13:43:11 -0500 Subject: [openib-general] [PATCH] [RFC] dapltest change for iwarp In-Reply-To: <44525FD7.4020406@ichips.intel.com> References: <1146248558.28503.20.camel@stevo-desktop> <44525FD7.4020406@ichips.intel.com> Message-ID: <1146249791.28503.29.camel@stevo-desktop> On Fri, 2006-04-28 at 11:32 -0700, Sean Hefty wrote: > Steve Wise wrote: > > This patch changes the dapltest transaction test to force the client > > side (the side that dat_ep_connect()) to send the first RDMA message. > > This ensures that the IWARP MPA protocol requirements are met. > > > > I'm presenting this for discussion and possible inclusion in the > > trunk. > > > > A transport independent application should be designed to work over all > > transports and should therefore utilize the only the common features. > > This implies that the application should always initiate RDMA exchanges > > starting with the client, to avoid MPA problems. > > What if someone comes up with an RDMA transport that requires the server side to > send the first message? > We shoot them. ;-) Seriously... Good point. The Chelsio RNIC has this issue. If the server sends the first FPDU _before_ the client driver has moved the connection/qp into RDMA mode, then the data is placed as streaming data and the connection must be terminated (dapltest 6 exposes this intermittently). Ammasso doesn't have this issue, but other RNIC's probably will. One thing I'm experimenting with is to delay the ESTABLISHED event on the server side until the first FPDU is received. However, we still probably need a way for an application to know whether the client has to send first (or the server as you pointed out). I believe rnic-pi has an attribute that indicated this behavior... Steve. From mshefty at ichips.intel.com Fri Apr 28 12:00:09 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 28 Apr 2006 12:00:09 -0700 Subject: [openib-general] [PATCH] [RFC] dapltest change for iwarp In-Reply-To: <1146249791.28503.29.camel@stevo-desktop> References: <1146248558.28503.20.camel@stevo-desktop> <44525FD7.4020406@ichips.intel.com> <1146249791.28503.29.camel@stevo-desktop> Message-ID: <44526639.6040905@ichips.intel.com> Steve Wise wrote: > The Chelsio RNIC has this issue. If the server sends the first FPDU > _before_ the client driver has moved the connection/qp into RDMA mode, > then the data is placed as streaming data and the connection must be > terminated (dapltest 6 exposes this intermittently). Ammasso doesn't > have this issue, but other RNIC's probably will. > > One thing I'm experimenting with is to delay the ESTABLISHED event on > the server side until the first FPDU is received. However, we still > probably need a way for an application to know whether the client has to > send first (or the server as you pointed out). > > I believe rnic-pi has an attribute that indicated this behavior... Can this be hidden from the user? If the client side needs to send the first message, couldn't the connection protocol send a 0-byte message that the server strips out? The ESTABLISHED event would be delayed until this message is received. Also, wouldn't this issue need to be resolved based on both the local and remote device capabilities? I.e. if the server is using an Ammasso RNIC, it needs to know the behavior of the remote RNIC, in case it's Chelsio. Is this information carried as part of the connection, or does iWarp force the client to initiate the first send? - Sean From or.gerlitz at gmail.com Fri Apr 28 12:24:39 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Fri, 28 Apr 2006 21:24:39 +0200 Subject: [openib-general] Re: possible bug in kmem_cache related code In-Reply-To: References: <84144f020604270419s10696877he2ec27ae6d52e486@mail.gmail.com> Message-ID: <15ddcffd0604281224i4308b08fs93f9ebaf7e9a16b3@mail.gmail.com> On 4/28/06, Pekka J Enberg wrote: > On 4/27/06, Or Gerlitz wrote: > > > > With 2.6.17-rc3 I'm running into something which seems as a bug related > > > > to kmem_cache. Doing some allocations/deallocations from a kmem_cache and > > > > later attempting to destroy it yields the following message and trace > On Thu, 27 Apr 2006, Pekka Enberg wrote: > > > Tested on 2.6.16.7 and works ok. Christoph, could this be related to > > > the cache draining patches that went in 2.6.17-rc1? > I can't reproduce this with Linus' git head on User-mode Linux running on > UP i386. Or, can you reproduce this at will? Any local modifications? Can > we see your .config, please. Yes, i can reproduce this at will, no local modifications, my system is amd dual x86_64, i have attached my .config to the first email of this thread, and also mentioned that some CONFIG_DEBUG_ options are set, including one related to slab debugging. Also, by "User mode Linux" you mean linux kernel that runs as a user process on your system? Or. Or. From jlentini at netapp.com Fri Apr 28 13:33:07 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 28 Apr 2006 16:33:07 -0400 (EDT) Subject: [openib-general] [PATCH] [RFC] dapltest change for iwarp In-Reply-To: <1146249791.28503.29.camel@stevo-desktop> References: <1146248558.28503.20.camel@stevo-desktop> <44525FD7.4020406@ichips.intel.com> <1146249791.28503.29.camel@stevo-desktop> Message-ID: On Fri, 28 Apr 2006, Steve Wise wrote: > On Fri, 2006-04-28 at 11:32 -0700, Sean Hefty wrote: > > Steve Wise wrote: > > > This patch changes the dapltest transaction test to force the client > > > side (the side that dat_ep_connect()) to send the first RDMA message. > > > This ensures that the IWARP MPA protocol requirements are met. > > > > > > I'm presenting this for discussion and possible inclusion in the > > > trunk. > > > > > > A transport independent application should be designed to work over all > > > transports and should therefore utilize the only the common features. > > > This implies that the application should always initiate RDMA exchanges > > > starting with the client, to avoid MPA problems. > > > > What if someone comes up with an RDMA transport that requires the server side to > > send the first message? > > > > We shoot them. ;-) > > Seriously... Good point. > > The Chelsio RNIC has this issue. If the server sends the first FPDU > _before_ the client driver has moved the connection/qp into RDMA mode, > then the data is placed as streaming data and the connection must be > terminated (dapltest 6 exposes this intermittently). Ammasso doesn't > have this issue, but other RNIC's probably will. I'm confused. Is this an iWARP requirement or a Chelsio requirement? It sounds like iWARP supports data transfers being initiated by the server but the Chelsio implementation does not. > One thing I'm experimenting with is to delay the ESTABLISHED event on > the server side until the first FPDU is received. However, we still > probably need a way for an application to know whether the client has to > send first (or the server as you pointed out). > > I believe rnic-pi has an attribute that indicated this behavior... > > Steve. From swise at opengridcomputing.com Fri Apr 28 13:39:06 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 28 Apr 2006 15:39:06 -0500 Subject: [openib-general] [PATCH] [RFC] dapltest change for iwarp In-Reply-To: <44526639.6040905@ichips.intel.com> References: <1146248558.28503.20.camel@stevo-desktop> <44525FD7.4020406@ichips.intel.com> <1146249791.28503.29.camel@stevo-desktop> <44526639.6040905@ichips.intel.com> Message-ID: <1146256746.9586.8.camel@stevo-desktop> On Fri, 2006-04-28 at 12:00 -0700, Sean Hefty wrote: > Steve Wise wrote: > > The Chelsio RNIC has this issue. If the server sends the first FPDU > > _before_ the client driver has moved the connection/qp into RDMA mode, > > then the data is placed as streaming data and the connection must be > > terminated (dapltest 6 exposes this intermittently). Ammasso doesn't > > have this issue, but other RNIC's probably will. > > > > One thing I'm experimenting with is to delay the ESTABLISHED event on > > the server side until the first FPDU is received. However, we still > > probably need a way for an application to know whether the client has to > > send first (or the server as you pointed out). > > > > I believe rnic-pi has an attribute that indicated this behavior... > > Can this be hidden from the user? If the client side needs to send the first > message, couldn't the connection protocol send a 0-byte message that the server > strips out? The ESTABLISHED event would be delayed until this message is received. > Yea I guess. A zero-length RDMA Write could be sent by the MPA initiator after it moves the QP/connection into RDMA mode. This will only work if the driver has a way to know if these sorts of FDPUs arrive at the rnic. For Chelsio, the host driver gets no notification of successful zero-length reads or writes. And the host driver needs to generate the ESTABLISHED event. The only thing the Chelsio driver can trigger on is a RECV completion notification. > Also, wouldn't this issue need to be resolved based on both the local and remote > device capabilities? I.e. if the server is using an Ammasso RNIC, it needs to > know the behavior of the remote RNIC, in case it's Chelsio. Is this information > carried as part of the connection, or does iWarp force the client to initiate > the first send? > The MPA layer of the iWARP stack forces this by requiring that the MPA responder (the server) not send an FPDU until it first receives one from the client. Stevo. From swise at opengridcomputing.com Fri Apr 28 13:42:00 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 28 Apr 2006 15:42:00 -0500 Subject: [openib-general] [PATCH] [RFC] dapltest change for iwarp In-Reply-To: References: <1146248558.28503.20.camel@stevo-desktop> <44525FD7.4020406@ichips.intel.com> <1146249791.28503.29.camel@stevo-desktop> Message-ID: <1146256920.9586.12.camel@stevo-desktop> > I'm confused. Is this an iWARP requirement or a Chelsio requirement? > It sounds like iWARP supports data transfers being initiated by the > server but the Chelsio implementation does not. > This is an iWARP requirement. (I will _not_ argue that its a reasonable requirement, however :) See: http://www.ietf.org/internet-drafts/draft-ietf-rddp-mpa-03.txt Page 35: 4. MPA "Responder" mode implementations MUST receive and validate at least one FPDU before sending any FPDUs or markers. Note: this requirement is present to allow the Initiator time to get its receiver into full operation before an FPDU arrives, avoiding potential race conditions at the initiator. This was also subject to some debate in the work group before rough consensus was reached. Eliminating this requirement would allow faster startup in some types of applications. However, that would also make certain implementations (particularly "Dual Stack") much harder. From swise at opengridcomputing.com Fri Apr 28 13:53:43 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 28 Apr 2006 15:53:43 -0500 Subject: [openib-general] [PATCH] [RFC] dapltest change for iwarp In-Reply-To: <1146256920.9586.12.camel@stevo-desktop> References: <1146248558.28503.20.camel@stevo-desktop> <44525FD7.4020406@ichips.intel.com> <1146249791.28503.29.camel@stevo-desktop> <1146256920.9586.12.camel@stevo-desktop> Message-ID: <1146257623.9586.15.camel@stevo-desktop> How this requirement is enforced is really the issue, I guess... On Fri, 2006-04-28 at 15:42 -0500, Steve Wise wrote: > > I'm confused. Is this an iWARP requirement or a Chelsio requirement? > > It sounds like iWARP supports data transfers being initiated by the > > server but the Chelsio implementation does not. > > > > This is an iWARP requirement. (I will _not_ argue that its a reasonable > requirement, however :) > > See: > > http://www.ietf.org/internet-drafts/draft-ietf-rddp-mpa-03.txt > > Page 35: > > 4. MPA "Responder" mode implementations MUST receive and validate at > least one FPDU before sending any FPDUs or markers. > > Note: this requirement is present to allow the Initiator time to > get its receiver into full operation before an FPDU arrives, > avoiding potential race conditions at the initiator. This > was also subject to some debate in the work group before > rough consensus was reached. Eliminating this requirement > would allow faster startup in some types of applications. > However, that would also make certain implementations > (particularly "Dual Stack") much harder. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From caitlinb at broadcom.com Fri Apr 28 13:55:46 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 28 Apr 2006 13:55:46 -0700 Subject: [openib-general] [PATCH] [RFC] dapltest change for iwarp Message-ID: <54AD0F12E08D1541B826BE97C98F99F143B133@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: >> I'm confused. Is this an iWARP requirement or a Chelsio requirement? >> It sounds like iWARP supports data transfers being initiated by the >> server but the Chelsio implementation does not. >> > > This is an iWARP requirement. (I will _not_ argue that its a > reasonable requirement, however :) > There is an extensive googlable discussion on this topic, because the rationale first cited was incorrect -- although the conclusion ended up being correct. The curious can read rddp archives rather than repeating that discussion here. From sean.hefty at intel.com Fri Apr 28 13:58:49 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 28 Apr 2006 13:58:49 -0700 Subject: [openib-general] [PATCH] rdma cm: add sdp version checks Message-ID: Let SDP determine which version of the SDP headers to use. The CMA will check that it can support that version, and set the address fields appropriately. The CMA will also verify that the major SDP and IP versions are supported on received REQ and REP messages. If an unsupported version is received, the connection will be aborted. This will result in sending a consumer REJ message when the cm_id is destroyed. Signed-off-by: Sean Hefty ip_version = (ip_ver << 4) | (hdr->ip_version & 0xF); } +static inline u8 sdp_get_majv(u8 sdp_version) +{ + return sdp_version >> 4; +} + static inline u8 sdp_get_ip_ver(struct sdp_hh *hh) { return hh->ip_version >> 4; @@ -483,7 +493,8 @@ static int cma_get_net_info(void *hdr, e { switch (ps) { case RDMA_PS_SDP: - if (((struct sdp_hh *) hdr)->sdp_version != SDP_VERSION) + if (sdp_get_majv(((struct sdp_hh *) hdr)->sdp_version) != + SDP_MAJ_VERSION) return -EINVAL; *ip_ver = sdp_get_ip_ver(hdr); @@ -501,6 +512,9 @@ static int cma_get_net_info(void *hdr, e *dst = &((struct cma_hdr *) hdr)->dst_addr; break; } + + if (*ip_ver != 4 && *ip_ver != 6) + return -EINVAL; return 0; } @@ -714,6 +728,16 @@ reject: return ret; } +static int cma_verify_rep(struct rdma_id_private *id_priv, void *data) +{ + if (id_priv->id.ps == RDMA_PS_SDP && + sdp_get_majv(((struct sdp_hah *) data)->sdp_version) != + SDP_MAJ_VERSION) + return -EINVAL; + + return 0; +} + static int cma_rtu_recv(struct rdma_id_private *id_priv) { int ret; @@ -748,7 +772,10 @@ static int cma_ib_handler(struct ib_cm_i status = -ETIMEDOUT; break; case IB_CM_REP_RECEIVED: - if (id_priv->id.qp) { + status = cma_verify_rep(id_priv, ib_event->private_data); + if (status) + event = RDMA_CM_EVENT_CONNECT_ERROR; + else if (id_priv->id.qp) { status = cma_rep_recv(id_priv); event = status ? RDMA_CM_EVENT_CONNECT_ERROR : RDMA_CM_EVENT_ESTABLISHED; @@ -1473,8 +1500,8 @@ err: } EXPORT_SYMBOL(rdma_bind_addr); -static void cma_format_hdr(void *hdr, enum rdma_port_space ps, - struct rdma_route *route) +static int cma_format_hdr(void *hdr, enum rdma_port_space ps, + struct rdma_route *route) { struct sockaddr_in *src4, *dst4; struct cma_hdr *cma_hdr; @@ -1486,7 +1513,8 @@ static void cma_format_hdr(void *hdr, en switch (ps) { case RDMA_PS_SDP: sdp_hdr = hdr; - sdp_hdr->sdp_version = SDP_VERSION; + if (sdp_get_majv(sdp_hdr->sdp_version) != SDP_MAJ_VERSION) + return -EINVAL; sdp_set_ip_ver(sdp_hdr, 4); sdp_hdr->src_addr.ip4.addr = src4->sin_addr.s_addr; sdp_hdr->dst_addr.ip4.addr = dst4->sin_addr.s_addr; @@ -1501,6 +1529,7 @@ static void cma_format_hdr(void *hdr, en cma_hdr->port = src4->sin_port; break; } + return 0; } static int cma_connect_ib(struct rdma_id_private *id_priv, @@ -1530,7 +1559,9 @@ static int cma_connect_ib(struct rdma_id } route = &id_priv->id.route; - cma_format_hdr(private_data, id_priv->id.ps, route); + ret = cma_format_hdr(private_data, id_priv->id.ps, route); + if (ret) + goto out; req.private_data = private_data; req.primary_path = &route->path_rec[0]; From caitlinb at broadcom.com Fri Apr 28 14:03:01 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 28 Apr 2006 14:03:01 -0700 Subject: [openib-general] [PATCH] [RFC] dapltest change for iwarp Message-ID: <54AD0F12E08D1541B826BE97C98F99F143B136@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Steve Wise wrote: >> The Chelsio RNIC has this issue. If the server sends the first FPDU >> _before_ the client driver has moved the connection/qp into RDMA >> mode, then the data is placed as streaming data and the connection >> must be terminated (dapltest 6 exposes this intermittently). Ammasso >> doesn't have this issue, but other RNIC's probably will. >> >> One thing I'm experimenting with is to delay the ESTABLISHED event on >> the server side until the first FPDU is received. However, we still >> probably need a way for an application to know whether the client has >> to send first (or the server as you pointed out). >> >> I believe rnic-pi has an attribute that indicated this behavior... > > Can this be hidden from the user? If the client side needs > to send the first message, couldn't the connection protocol > send a 0-byte message that the server strips out? The > ESTABLISHED event would be delayed until this message is received. > A zero byte Send/Receive message is not transparent, it consumes a work request. A zero byte write avoids this, and if you read only the MPA specification you would believe that it would work. Unfortunately, the RDMAC verbs assumed there was no harm in requiring the active side to do the first Send, and clearly encourages that less accomodating approach in the firmware. So relying on a zero byte RDMA Write to clear the problem away would not be advisable. Now any given application can avoid this problem entirely by simply having a NOP request that can be sent by the active side when it doesn't have anything to do at startup. But that's a few layers too high to influence when the Established event is generated. > Also, wouldn't this issue need to be resolved based on both > the local and remote device capabilities? I.e. if the server > is using an Ammasso RNIC, it needs to know the behavior of > the remote RNIC, in case it's Chelsio. Is this information > carried as part of the connection, or does iWarp force the > client to initiate the first send? > On an IP network you don't know what model NIC your peer is using. The real quesetion is how the server knows when it can post to the Send Queue. Having a Provider Attribute allows this to be handled at the verbs or middleware layer so that the application can ignore these variations. From sean.hefty at intel.com Fri Apr 28 15:20:13 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 28 Apr 2006 15:20:13 -0700 Subject: [openib-general] RFC: detecting duplicate MAD requests Message-ID: Today, a request MAD received by the MAD layer is handed to a client. The client processes the MAD, and generates a response. If the client is slow to process the MAD, the request may have been resent. The duplicate request is also handed to the client. The result is that clients perform duplicate processing of the MAD or must detect the duplicates themselves. I'd like to propose that the MAD layer detect duplicate requests. After a request MAD has been handed to a client, its context would be maintained until the user calls ib_free_recv_mad(), allowing duplicate requests to be discarded. One drawback to this approach are that the MAD layer may discard a MAD as a duplicate that wasn't, but I'm not sure if this would happen in practice. A second drawback is that the receive MAD would need to be kept around until the send completed (as opposed to the send started). Finally, a way would need to be found for when to call ib_free_recv_mad() for userspace clients. - Sean From mshefty at ichips.intel.com Fri Apr 28 15:44:03 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 28 Apr 2006 15:44:03 -0700 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: References: Message-ID: <44529AB3.7080905@ichips.intel.com> Sean Hefty wrote: > I'd like to propose that the MAD layer detect duplicate requests. After a > request MAD has been handed to a client, its context would be maintained until > the user calls ib_free_recv_mad(), allowing duplicate requests to be discarded. I should add that this also provides context that the MAD layer can use when performing DS RMPP. On the initiator side, DS RMPP would be detected by an RMPP request that expected a response. (This assumes that the response is also RMPP.) On the responder side, DS RMPP is detected when an RMPP response is sent in response to an RMPP request. - Sean From sean.hefty at intel.com Fri Apr 28 16:05:41 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 28 Apr 2006 16:05:41 -0700 Subject: [openib-general] [PATCH 5/6] iser RDMA CM (CMA) and IB verbsinteraction In-Reply-To: Message-ID: >+static int iser_free_device_ib_res(struct iser_device *device) >+{ >+ BUG_ON(device->mr == NULL); >+ >+ tasklet_kill(&device->cq_tasklet); >+ >+ (void)ib_dereg_mr(device->mr); >+ (void)ib_destroy_cq(device->cq); >+ (void)ib_dealloc_pd(device->pd); >+ >+ device->mr = NULL; >+ device->cq = NULL; >+ device->pd = NULL; >+ return 0; >+} Can you eliminate the return code? >+static int iser_free_ib_conn_res(struct iser_conn *ib_conn) >+{ >+ BUG_ON(ib_conn == NULL); >+ >+ iser_err("freeing conn %p cma_id %p fmr pool %p qp %p\n", >+ ib_conn, ib_conn->cma_id, >+ ib_conn->fmr_pool, ib_conn->qp); >+ >+ /* qp is created only once both addr & route are resolved */ >+ if (ib_conn->fmr_pool != NULL) >+ ib_destroy_fmr_pool(ib_conn->fmr_pool); >+ >+ if (ib_conn->qp != NULL) >+ rdma_destroy_qp(ib_conn->cma_id); >+ >+ if (ib_conn->cma_id != NULL) >+ rdma_destroy_id(ib_conn->cma_id); Are the NULL checks needed above? Neither iser_create_device_ib_res() or iser_create_ib_conn_res() set the values to NULL if an error occurred. >+ >+ ib_conn->fmr_pool = NULL; >+ ib_conn->qp = NULL; >+ ib_conn->cma_id = NULL; >+ kfree(ib_conn->page_vec); >+ >+ return 0; >+} >+ >+/** >+ * based on the resolved device node GUID see if there already allocated >+ * device for this device. If there's no such, create one. >+ */ >+static >+struct iser_device *iser_device_find_by_ib_device(struct rdma_cm_id *cma_id) >+{ >+ struct list_head *p_list; >+ struct iser_device *device = NULL; >+ >+ mutex_lock(&ig.device_list_mutex); >+ >+ p_list = ig.device_list.next; >+ while (p_list != &ig.device_list) { >+ device = list_entry(p_list, struct iser_device, ig_list); >+ /* find if there's a match using the node GUID */ >+ if (device->ib_device->node_guid == cma_id->device->node_guid) >+ break; >+ } >+ >+ if (device == NULL) { >+ device = kzalloc(sizeof *device, GFP_KERNEL); >+ if (device == NULL) >+ goto end; goto out; // see below >+ /* assign this device to the device */ >+ device->ib_device = cma_id->device; >+ /* init the device and link it into ig device list */ >+ if (iser_create_device_ib_res(device)) { >+ kfree(device); >+ device = NULL; >+ goto end; >+ } >+ list_add(&device->ig_list, &ig.device_list); >+ } >+end: >+ BUG_ON(device == NULL); >+ device->refcount++; out: >+ mutex_unlock(&ig.device_list_mutex); >+ return device; >+} >+ >+static void iser_disconnected_handler(struct rdma_cm_id *cma_id) >+{ >+ struct iser_conn *ib_conn; >+ >+ ib_conn = (struct iser_conn *)cma_id->context; >+ ib_conn->disc_evt_flag = 1; >+ >+ /* If this event is unsolicited this means that the conn is being */ >+ /* terminated asynchronously from the iSCSI layer's perspective. */ >+ if (atomic_read(&ib_conn->state) == ISER_CONN_PENDING) { >+ atomic_set(&ib_conn->state, ISER_CONN_DOWN); >+ wake_up_interruptible(&ib_conn->wait); >+ } else { >+ if (atomic_read(&ib_conn->state) == ISER_CONN_UP) { >+ atomic_set(&ib_conn->state, ISER_CONN_TERMINATING); >+ iscsi_conn_failure(ib_conn->iser_conn->iscsi_conn, >+ ISCSI_ERR_CONN_FAILED); >+ } >+ /* Complete the termination process if no posts are pending */ >+ if ((atomic_read(&ib_conn->post_recv_buf_count) == 0) && >+ (atomic_read(&ib_conn->post_send_buf_count) == 0)) { >+ atomic_set(&ib_conn->state, ISER_CONN_DOWN); >+ wake_up_interruptible(&ib_conn->wait); >+ } >+ } Are there races here between reading ib_conn->state and setting it? Could it have changed in between the atomic_read() and atomic_set()? >+ src = (struct sockaddr *)src_addr; >+ dst = (struct sockaddr *)dst_addr; >+ err = rdma_resolve_addr(ib_conn->cma_id, src, dst, 1000); >+ if (err) { >+ iser_err("rdma_resolve_addr failed: %d\n", err); >+ goto addr_failure; >+ } >+ >+ if (!non_blocking) { >+ wait_event_interruptible(ib_conn->wait, >+ atomic_read(&ib_conn->state) != ISER_CONN_PENDING); >+ >+ if (atomic_read(&ib_conn->state) != ISER_CONN_UP) { >+ err = -EIO; >+ goto connect_failure; >+ } >+ } >+ >+ mutex_lock(&ig.connlist_mutex); >+ list_add(&ib_conn->conn_list, &ig.connlist); >+ mutex_unlock(&ig.connlist_mutex); Not sure if there's a race here or not, but rdma_resolve_addr() will result in a callback from a separate thread. That callback could occur before the ib_conn is added to the ig.connlist. Do you assume that ib_conn is in the connlist in any of the callbacks? >+int iser_post_recv(struct iser_desc *rx_desc) >+{ >+ int ib_ret, ret_val = 0; >+ struct ib_recv_wr recv_wr, *recv_wr_failed; >+ struct ib_sge iov[2]; >+ struct iser_conn *ib_conn; >+ struct iser_dto *recv_dto = &rx_desc->dto; >+ >+ /* Retrieve conn */ >+ ib_conn = recv_dto->conn->ib_conn; >+ >+ iser_dto_to_iov(recv_dto, iov, 2); >+ >+ recv_wr.next = NULL; >+ recv_wr.sg_list = iov; >+ recv_wr.num_sge = recv_dto->regd_vector_len; >+ recv_wr.wr_id = (unsigned long)rx_desc; Nit - position of "=" signs above is weird. >+static void iser_comp_error_worker(void *data) >+{ >+ struct iser_conn *ib_conn = data; >+ >+ if (atomic_read(&ib_conn->state) == ISER_CONN_UP) { >+ atomic_set(&ib_conn->state, ISER_CONN_TERMINATING); >+ iscsi_conn_failure(ib_conn->iser_conn->iscsi_conn, >+ ISCSI_ERR_CONN_FAILED); >+ } Potential race reading/setting state? - Sean From troy at scl.ameslab.gov Fri Apr 28 16:15:53 2006 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Fri, 28 Apr 2006 18:15:53 -0500 Subject: [openib-general] ibv_rc_pingpong debugging.. Message-ID: <4452A229.5020901@scl.ameslab.gov> So, how do I start debugging this? ibv_devinfo reports the port as active.. what else would cause this? (I am running the userspace modules from http://openib.red-bean.com/rc2/SOURCES/ , and kernel 2.6.16.11) [root at node3 netpipe3-dev]# ibv_rc_pingpong -n 1 node4 local address: LID 0x0050, QPN 0x040404, PSN 0xd70996 remote address: LID 0x0061, QPN 0x050404, PSN 0xe5357a Failed status 12 for wr_id 2 From albertt at broadcom.com Fri Apr 28 16:41:41 2006 From: albertt at broadcom.com (Albert To) Date: Fri, 28 Apr 2006 16:41:41 -0700 Subject: [openib-general] Problem running mpdboot command in MVAPICH2 v0.9.3-RC0 Message-ID: Hi, I downloaded and compiled the MVAPICH2 v0.9.3-RC0 using make.mvapich2.gen2 script. The script finished without any errors. However, I received "mpdboot: error while loading shared libraries: libibverbs.so.1: cannot open shared object file: No such file or directory" error while executing mpdboot -n 2 -f mpd.hosts. I checked library file libibverbs.so.1 and found it in /usr/local/lib folder. LD_LIBRARY_PATH is already set to /usr/local/bin, but that didn't help. Is there another environment variable that I need to set to make mpdboot works? Thanks in advance for your help. -Albert -------------- next part -------------- An HTML attachment was scrubbed... URL: From lindahl at pathscale.com Fri Apr 28 17:13:48 2006 From: lindahl at pathscale.com (Greg Lindahl) Date: Fri, 28 Apr 2006 17:13:48 -0700 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: References: Message-ID: <20060429001348.GA3843@greglaptop.internal.keyresearch.com> On Fri, Apr 28, 2006 at 03:20:13PM -0700, Sean Hefty wrote: > I'd like to propose that the MAD layer detect duplicate requests. Sean, You can't add this kind of thing piecemeal to a protocol and have it work. If the sender doesn't see a response (perhaps the response was lost, or was slow coming), and sends another MAD, this 2nd MAD will have a different sequence number. How does the recipient know it's the same request? If the response was lost the first time, eating the 2nd MAD without sending a response will result in another timeout and a 3rd MAD... so maybe the recipient remembers the response and sends it again. Will that work? Well, no, it's not guaranteed, because the sender may reject a stale response received after sending the 2nd MAD... Really, it's up to the MAD client to deal with duplicates in its own way. And yes, this class of issues shows up in practice. Ask anyone who's ever worked on a large distributed system. "Execute exactly once" semantics require end-to-end design. -- greg From sean.hefty at intel.com Fri Apr 28 23:23:44 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 28 Apr 2006 23:23:44 -0700 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: <20060429001348.GA3843@greglaptop.internal.keyresearch.com> Message-ID: >You can't add this kind of thing piecemeal to a protocol and have it >work. If the sender doesn't see a response (perhaps the response was >lost, or was slow coming), and sends another MAD, this 2nd MAD will >have a different sequence number. How does the recipient know it's the If a MAD is sent with a different sequence number (transaction ID), then it's a different transaction or request. There is a real issue that is seen when a duplicate request (same TID, SGID, mgmt class) is received at the client, resulting in a duplicate response. The MAD layer cannot allow the duplicate response to be sent because of RMPP issues. The most efficient solution is to detect the duplicate request, and avoid all of the processing overhead of generating a response that must be discarded. No change to the MAD protocol is being proposed. Ib_free_recv_mad() already exists, and must be called by each client. The only change being proposed is that until ib_free_recv_mad() is called, another message with the same TID, SGID, and mgmt class is treated as a duplicate. I believe that this is consistent with C13-18.1.1. >same request? If the response was lost the first time, eating the 2nd >MAD without sending a response will result in another timeout and a >3rd MAD... so maybe the recipient remembers the response and sends it The proposal is to only discard duplicate requests while a response to the first request is being generated. Just because a client sends a request 3 times before we can send a response doesn't mean that we need to send 3 responses. Such an implementation is suboptimal, and the responses that are of most concern use RMPP anyway. >Really, it's up to the MAD client to deal with duplicates in its own >way. A client is still restricted from sending a duplicate response while a previous response is in progress. RMPP cannot handle this case. - Sean From info at schihei.de Fri Apr 28 23:38:46 2006 From: info at schihei.de (Heiko J Schick) Date: Sat, 29 Apr 2006 08:38:46 +0200 Subject: [openib-general] Re: [PATCH 04/16] ehca: userspace support In-Reply-To: <84144f020604272332s6101032cy6936096230f3637c@mail.gmail.com> References: <4450A176.9000008@de.ibm.com> <20060427114355.GB32127@wohnheim.fh-wedel.de> <1146177388.19236.1.camel@localhost.localdomain> <6C4A3B96-4752-4FF9-8FBE-C383B00AE014@schihei.de> <84144f020604272332s6101032cy6936096230f3637c@mail.gmail.com> Message-ID: <4044CACC-FB5A-415E-8974-27136269B5C1@schihei.de> Hello, On 28.04.2006, at 08:32, Pekka Enberg wrote: >> The problem I see with pr_debug() is that it could only activated via >> a compile flag. To use the debug outputs you have to re-compile / >> compile your own kernel. > > Do you really need this heavy debug logging in the first place? You > can use kprobes for arbitrary run-time inspection anyway, so logging > everything seems wasteful. The problem I see with kprobes is that you have to set several kernel configuration options (e.g. CONFIG_KPROBES, CONFIG_DEBUG_INFO, etc.) on compile time to use it. Same problem with pr_debug(). Regards, Heiko From penberg at cs.helsinki.fi Fri Apr 28 23:44:15 2006 From: penberg at cs.helsinki.fi (Pekka Enberg) Date: Sat, 29 Apr 2006 09:44:15 +0300 Subject: [openib-general] Re: possible bug in kmem_cache related code In-Reply-To: <15ddcffd0604281224i4308b08fs93f9ebaf7e9a16b3@mail.gmail.com> References: <84144f020604270419s10696877he2ec27ae6d52e486@mail.gmail.com> <15ddcffd0604281224i4308b08fs93f9ebaf7e9a16b3@mail.gmail.com> Message-ID: <1146293055.11279.2.camel@localhost> On Fri, 2006-04-28 at 21:24 +0200, Or Gerlitz wrote: > Yes, i can reproduce this at will, no local modifications, my system > is amd dual x86_64, i have attached my .config to the first email of > this thread, and also mentioned that some CONFIG_DEBUG_ options are > set, including one related to slab debugging. > > Also, by "User mode Linux" you mean linux kernel that runs as a user > process on your system? Yeah, arch/um/. Unfortunately I don't have a SMP box, so I probably can't reproduce this. You could try git bisect to isolate the offending changeset. Pekka From info at schihei.de Fri Apr 28 23:49:40 2006 From: info at schihei.de (Heiko J Schick) Date: Sat, 29 Apr 2006 08:49:40 +0200 Subject: [openib-general] ibv_rc_pingpong debugging.. In-Reply-To: <4452A229.5020901@scl.ameslab.gov> References: <4452A229.5020901@scl.ameslab.gov> Message-ID: <0AA7C7EE-4A7E-49E6-85F7-E9AD2FF9BB47@schihei.de> Hello Troy, On 29.04.2006, at 01:15, Troy Benjegerdes wrote: > So, how do I start debugging this? > > ibv_devinfo reports the port as active.. what else would cause this? > (I am running the userspace modules from http://openib.red-bean.com/ > rc2/SOURCES/ , and kernel 2.6.16.11) > > [root at node3 netpipe3-dev]# ibv_rc_pingpong -n 1 node4 > local address: LID 0x0050, QPN 0x040404, PSN 0xd70996 > remote address: LID 0x0061, QPN 0x050404, PSN 0xe5357a > Failed status 12 for wr_id 2 If I counted correctly, WQE status 12 means IBV_WC_RETRY_EXC_ERR. As you can see in src/userspace/libibverbs/include/infiniband/verbs.h: enum ibv_wc_status { IBV_WC_SUCCESS, IBV_WC_LOC_LEN_ERR, IBV_WC_LOC_QP_OP_ERR, IBV_WC_LOC_EEC_OP_ERR, IBV_WC_LOC_PROT_ERR, IBV_WC_WR_FLUSH_ERR, IBV_WC_MW_BIND_ERR, IBV_WC_BAD_RESP_ERR, IBV_WC_LOC_ACCESS_ERR, IBV_WC_REM_INV_REQ_ERR, IBV_WC_REM_ACCESS_ERR, IBV_WC_REM_OP_ERR, IBV_WC_RETRY_EXC_ERR, IBV_WC_RNR_RETRY_EXC_ERR, IBV_WC_LOC_RDD_VIOL_ERR, IBV_WC_REM_INV_RD_REQ_ERR, IBV_WC_REM_ABORT_ERR, IBV_WC_INV_EECN_ERR, IBV_WC_INV_EEC_STATE_ERR, IBV_WC_FATAL_ERR, IBV_WC_RESP_TIMEOUT_ERR, IBV_WC_GENERAL_ERR }; Perhaps you have problems with node4. I'm not sure for 100%, but I think this error can be caused when node4 has not set up his IB resources (QPs, etc.) properly. Do you see any errors on node4, too? Regards, Heiko From heiko.carstens at de.ibm.com Sat Apr 29 00:08:26 2006 From: heiko.carstens at de.ibm.com (Heiko Carstens) Date: Sat, 29 Apr 2006 09:08:26 +0200 Subject: [openib-general] Re: [PATCH 04/16] ehca: userspace support In-Reply-To: <4044CACC-FB5A-415E-8974-27136269B5C1@schihei.de> References: <4450A176.9000008@de.ibm.com> <20060427114355.GB32127@wohnheim.fh-wedel.de> <1146177388.19236.1.camel@localhost.localdomain> <6C4A3B96-4752-4FF9-8FBE-C383B00AE014@schihei.de> <84144f020604272332s6101032cy6936096230f3637c@mail.gmail.com> <4044CACC-FB5A-415E-8974-27136269B5C1@schihei.de> Message-ID: <20060429070826.GA9463@osiris.boeblingen.de.ibm.com> > >>The problem I see with pr_debug() is that it could only activated via > >>a compile flag. To use the debug outputs you have to re-compile / > >>compile your own kernel. > > > >Do you really need this heavy debug logging in the first place? You > >can use kprobes for arbitrary run-time inspection anyway, so logging > >everything seems wasteful. > > The problem I see with kprobes is that you have to set several kernel > configuration options (e.g. CONFIG_KPROBES, CONFIG_DEBUG_INFO, etc.) > on compile time to use it. Same problem with pr_debug(). It might be worth to move the s390 debug feature to common code. At least it has proven many times to be very useful in device driver debugging... See Documentation/s390/s390dbf.txt and arch/s390/kernel/debug.c. From mst at mellanox.co.il Sat Apr 29 12:39:13 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 29 Apr 2006 22:39:13 +0300 Subject: [openib-general] Re: [PATCH v2] mad: use GID/LID on requester side when matching responses to requests In-Reply-To: <443A9F98.9060604@ichips.intel.com> References: <200604101804.34043.jackm@mellanox.co.il> <443A9F98.9060604@ichips.intel.com> Message-ID: <20060429193913.GA9584@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH v2] mad: use GID/LID on requester side when matching responses to requests > > Jack Morgenstein wrote: > >Corrected and cleaner version. > > > >Check GID/LID for requester side when searching for request which matches > >received response. This, in order to guarantee uniqueness if use same TID > >when requesting via multiple source LIDs (when LMC is not zero). To perform > >check, add LMC to cache. > > > >Further, do not perform LID check for direct-routed packets, since > >permissive > >LID makes a proper check impossible. > > Thanks - I'll look at this within the next couple of days. Could this patch be merged please? Sean? -- MST From mst at mellanox.co.il Sat Apr 29 12:48:15 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 29 Apr 2006 22:48:15 +0300 Subject: [openib-general] Re: RFC: detecting duplicate MAD requests In-Reply-To: References: Message-ID: <20060429194815.GB9584@mellanox.co.il> Quoting r. Sean Hefty : > I'd like to propose that the MAD layer detect duplicate requests. After a > request MAD has been handed to a client, its context would be maintained until > the user calls ib_free_recv_mad(), allowing duplicate requests to be discarded. I understand that this is along the lines of the approach poposed by Jack a while ago: https://openib.org/svn/trunk/contrib/mellanox/gen2/patches/mad_rmpp_requester_retry.patch with the difference that duplicate request detection will be handled by full GID/TID/class/request/response matching, and not just TID matching. Right? -- MST From xma at us.ibm.com Sat Apr 29 11:43:12 2006 From: xma at us.ibm.com (Shirley Ma) Date: Sat, 29 Apr 2006 11:43:12 -0700 Subject: [openib-general] Re: Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: Message-ID: Michael, smp kernel on UP result is very bad. It dropped 40% throughput. up kernel on UP thoughput dropped with cpu utilization dropped from 75% idle to 52% idle. I didn't see latency difference. I used TCP_RR test. The patch was tested on linux 2.6.16 kernel. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Sat Apr 29 15:23:51 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 30 Apr 2006 01:23:51 +0300 Subject: [openib-general] Re: Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: References: Message-ID: <20060429222351.GC9584@mellanox.co.il> Quoting r. Shirley Ma : > Subject: Re: [openib-general] Re: Re: [PATCH] IPoIB splitting CQ,?increase both send/recv poll NUM_WC & interval > > > Michael, > > smp kernel on UP result is very bad. It dropped 40% throughput. > up kernel on UP thoughput dropped with cpu utilization dropped from 75% idle to 52% idle. Hmm. So far it seems the approach only works well on 2 CPUs. > I didn't see latency difference. I used TCP_RR test. This is somewhat surprising, isn't it? One would explain the extra context switch to have some effect on latency, would one not? -- MST From xma at us.ibm.com Sat Apr 29 14:02:33 2006 From: xma at us.ibm.com (Shirley Ma) Date: Sat, 29 Apr 2006 14:02:33 -0700 Subject: [openib-general] Re: Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: <20060429222351.GC9584@mellanox.co.il> Message-ID: "Michael S. Tsirkin" wrote on 04/29/2006 03:23:51 PM: > Quoting r. Shirley Ma : > > Subject: Re: [openib-general] Re: Re: [PATCH] IPoIB splitting CQ,? > increase both send/recv poll NUM_WC & interval > > > > > > Michael, > > > > smp kernel on UP result is very bad. It dropped 40% throughput. > > up kernel on UP thoughput dropped with cpu utilization dropped > from 75% idle to 52% idle. > > Hmm. So far it seems the approach only works well on 2 CPUs. Yes, since one cpu can only handler one handler. The easy solution would use ifdef SMP. Some mistake here, the cpu utilization dropped from 85%(15% idle) to 52%(48% idle). > > I didn't see latency difference. I used TCP_RR test. > > This is somewhat surprising, isn't it? One would explain the extra > context switch to have some effect on latency, would one not? > > -- > MST Thread context switch is pretty light. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From huanwei at cse.ohio-state.edu Sat Apr 29 14:09:06 2006 From: huanwei at cse.ohio-state.edu (wei huang) Date: Sat, 29 Apr 2006 17:09:06 -0400 (EDT) Subject: [openib-general] Problem running mpdboot command in MVAPICH2 v0.9.3-RC0 In-Reply-To: Message-ID: Hi Albert, Not sure if you export /usr/local/lib to LD_LIBRARY_PATH manually or it is in your bashrc. Could you please try to put export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH in your .bashrc (assume using bash) and try again? Thanks. Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Fri, 28 Apr 2006, Albert To wrote: > Hi, > > I downloaded and compiled the MVAPICH2 v0.9.3-RC0 using > make.mvapich2.gen2 script. The script finished without any errors. > However, I received "mpdboot: error while loading shared libraries: > libibverbs.so.1: cannot open shared object file: No such file or > directory" error while executing mpdboot -n 2 -f mpd.hosts. I checked > library file libibverbs.so.1 and found it in /usr/local/lib folder. > LD_LIBRARY_PATH is already set to /usr/local/bin, but that didn't help. > > Is there another environment variable that I need to set to make mpdboot > works? Thanks in advance for your help. > > -Albert > From xma at us.ibm.com Sat Apr 29 16:04:09 2006 From: xma at us.ibm.com (Shirley Ma) Date: Sat, 29 Apr 2006 16:04:09 -0700 Subject: [openib-general] Re: Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: <20060429222351.GC9584@mellanox.co.il> Message-ID: Michael, "Michael S. Tsirkin" wrote on 04/29/2006 03:23:51 PM: > Quoting r. Shirley Ma : > > Subject: Re: [openib-general] Re: Re: [PATCH] IPoIB splitting CQ,? > increase both send/recv poll NUM_WC & interval > > > > > > Michael, > > > > smp kernel on UP result is very bad. It dropped 40% throughput. > > up kernel on UP thoughput dropped with cpu utilization dropped > from 75% idle to 52% idle. > > Hmm. So far it seems the approach only works well on 2 CPUs. Did a clean 2.6.16 uniprocessor kernel build on both sides, + patch1 (splitting CQ & handler) + patch2 (tune CQ polling interval) + patch3 (use work queue in CQ handler) + patch4 (remove tx_ring) (rx_ring removal hasn't done yet) Without tuning, i got 1-3% throughput increase with average 10% cpu utiilzation reduce on netserver side. W/O patches, netperf side is 100% cpu utilization. The best result I got so far with tunning, 25% throughput increase + 2-5% cpu utilization saving in netperf side. > > I didn't see latency difference. I used TCP_RR test. > > This is somewhat surprising, isn't it? One would explain the extra > context switch to have some effect on latency, would one not? > > -- > MST I got around 4% latency decrease on UP with less cpu utilization. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Sat Apr 29 18:01:41 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 30 Apr 2006 04:01:41 +0300 Subject: [openib-general] Re: Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: References: <20060429222351.GC9584@mellanox.co.il> Message-ID: <20060430010141.GA15439@mellanox.co.il> Quoting r. Shirley Ma : > Subject: Re: [openib-general] Re: Re: [PATCH] IPoIB splitting CQ,?increase both send/recv poll NUM_WC & interval > > > Michael, > > "Michael S. Tsirkin" wrote on 04/29/2006 03:23:51 PM: > > Quoting r. Shirley Ma : > > > Subject: Re: [openib-general] Re: Re: [PATCH] IPoIB splitting CQ,? > > increase both send/recv poll NUM_WC & interval > > > > > > > > > Michael, > > > > > > smp kernel on UP result is very bad. It dropped 40% throughput. > > > up kernel on UP thoughput dropped with cpu utilization dropped > > from 75% idle to 52% idle. > > > > Hmm. So far it seems the approach only works well on 2 CPUs. > > Did a clean 2.6.16 uniprocessor kernel build on both sides, > + patch1 (splitting CQ & handler) > + patch2 (tune CQ polling interval) > + patch3 (use work queue in CQ handler) > + patch4 (remove tx_ring) (rx_ring removal hasn't done yet) > > Without tuning, i got 1-3% throughput increase with average 10% > cpu utiilzation reduce on netserver side. W/O patches, netperf side > is 100% cpu utilization. > > The best result I got so far with tunning, 25% throughput increase > + 2-5% cpu utilization saving in netperf side. Is the difference with previous result the tx_ring removal? > > > I didn't see latency difference. I used TCP_RR test. > > > > This is somewhat surprising, isn't it? One would explain the extra > > context switch to have some effect on latency, would one not? > > I got around 4% latency decrease on UP with less cpu utilization. You mean, latency actually got better? If so, that is surprising. -- MST From xma at us.ibm.com Sat Apr 29 16:29:29 2006 From: xma at us.ibm.com (Shirley Ma) Date: Sat, 29 Apr 2006 16:29:29 -0700 Subject: [openib-general] Re: Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: <20060430010141.GA15439@mellanox.co.il> Message-ID: "Michael S. Tsirkin" wrote on 04/29/2006 06:01:41 PM: > Quoting r. Shirley Ma : > > Subject: Re: [openib-general] Re: Re: [PATCH] IPoIB splitting CQ,? > increase both send/recv poll NUM_WC & interval > > > > > > Michael, > > > > "Michael S. Tsirkin" wrote on 04/29/2006 03:23:51 PM: > > > Quoting r. Shirley Ma : > > > > Subject: Re: [openib-general] Re: Re: [PATCH] IPoIB splitting CQ,? > > > increase both send/recv poll NUM_WC & interval > > > > > > > > > > > > Michael, > > > > > > > > smp kernel on UP result is very bad. It dropped 40% throughput. > > > > up kernel on UP thoughput dropped with cpu utilization dropped > > > from 75% idle to 52% idle. > > > > > > Hmm. So far it seems the approach only works well on 2 CPUs. > > > > Did a clean 2.6.16 uniprocessor kernel build on both sides, > > + patch1 (splitting CQ & handler) > > + patch2 (tune CQ polling interval) > > + patch3 (use work queue in CQ handler) > > + patch4 (remove tx_ring) (rx_ring removal hasn't done yet) > > > > Without tuning, i got 1-3% throughput increase with average 10% > > cpu utiilzation reduce on netserver side. W/O patches, netperf side > > is 100% cpu utilization. > > > > The best result I got so far with tunning, 25% throughput increase > > + 2-5% cpu utilization saving in netperf side. > > Is the difference with previous result the tx_ring removal? The previous comparsion test was based on one node UP with 4x mthca, one node SMP with 12x ehca without tx_ring removal since one of my machine was dead. The poor result bothered me. So I fixed the other node. This time I made a clean UP kernel build, use mthca on both netperf and netserver, and rerun test W/O above patches. > > > > I didn't see latency difference. I used TCP_RR test. > > > > > > This is somewhat surprising, isn't it? One would explain the extra > > > context switch to have some effect on latency, would one not? > > > > I got around 4% latency decrease on UP with less cpu utilization. > > You mean, latency actually got better? If so, that is surprising. > > -- > MST Sorry, I should have said latency was increased around 4% with all of these patches with less cpu utilization. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Sat Apr 29 16:30:34 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Sat, 29 Apr 2006 16:30:34 -0700 Subject: [openib-general] Re: [PATCH v2] mad: use GID/LID on requester sidewhen matching responses to requests In-Reply-To: <20060429193913.GA9584@mellanox.co.il> Message-ID: >> >Check GID/LID for requester side when searching for request which matches >> >received response. This, in order to guarantee uniqueness if use same TID >> >when requesting via multiple source LIDs (when LMC is not zero). To perform >> >check, add LMC to cache. >> > >> >Further, do not perform LID check for direct-routed packets, since >> >permissive >> >LID makes a proper check impossible. >> >> Thanks - I'll look at this within the next couple of days. > >Could this patch be merged please? Sean? There was a request to submit the LMC cache piece as a separate patch. I can merge in the MAD changes after the LMC cache has been accepted. - Sean From sean.hefty at intel.com Sat Apr 29 16:34:17 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Sat, 29 Apr 2006 16:34:17 -0700 Subject: [openib-general] RE: RFC: detecting duplicate MAD requests In-Reply-To: <20060429194815.GB9584@mellanox.co.il> Message-ID: >I understand that this is along the lines of the approach poposed by Jack a >while ago: > >https://openib.org/svn/trunk/contrib/mellanox/gen2/patches/mad_rmpp_requester_r >etry.patch > >with the difference that duplicate request detection will be handled by full >GID/TID/class/request/response matching, and not just TID matching. >Right? Correct - a large part of the motivation is based on the issues that Jack reported a while ago. I've been trying to come up with a more efficient solution (which can avoid the duplicate processing), and that can also be used with DS RMPP. - Sean From mst at mellanox.co.il Sat Apr 29 18:25:31 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 30 Apr 2006 04:25:31 +0300 Subject: [openib-general] Re: Re: [PATCH v2] mad: use GID/LID on requester sidewhen matching responses to requests In-Reply-To: References: <20060429193913.GA9584@mellanox.co.il> Message-ID: <20060430012531.GA15584@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RE: Re: [PATCH v2] mad: use GID/LID on requester sidewhen matching responses to requests > > >> >Check GID/LID for requester side when searching for request which matches > >> >received response. This, in order to guarantee uniqueness if use same TID > >> >when requesting via multiple source LIDs (when LMC is not zero). To perform > >> >check, add LMC to cache. > >> > > >> >Further, do not perform LID check for direct-routed packets, since > >> >permissive > >> >LID makes a proper check impossible. > >> > >> Thanks - I'll look at this within the next couple of days. > > > >Could this patch be merged please? Sean? > > There was a request to submit the LMC cache piece as a separate patch. I can > merge in the MAD changes after the LMC cache has been accepted. Roland said: "The lmc_cache stuff looks fine to me. It probably would be better to commit it as a separate patch -- one idea per patch." so I understand he's fine with it, and the comment was with regard to how to commit this - first core files, then MAD files. Anyway, its trivial to split the patch, if you want help with that let me know. -- MST From mst at mellanox.co.il Sat Apr 29 18:28:34 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 30 Apr 2006 04:28:34 +0300 Subject: [openib-general] [PATCH 1 of 2] add lmc cache Message-ID: <20060430012834.GA15657@mellanox.co.il> Add LMC cache. Signed-off-by: Jack Morgenstein Index: src/drivers/infiniband/include/rdma/ib_verbs.h =================================================================== --- src/drivers/infiniband/include/rdma/ib_verbs.h (revision 6066) +++ src/drivers/infiniband/include/rdma/ib_verbs.h (working copy) @@ -822,6 +822,7 @@ struct ib_cache { struct ib_event_handler event_handler; struct ib_pkey_cache **pkey_cache; struct ib_gid_cache **gid_cache; + u8 *lmc_cache; }; struct ib_device { Index: src/drivers/infiniband/include/rdma/ib_cache.h =================================================================== --- src/drivers/infiniband/include/rdma/ib_cache.h (revision 6066) +++ src/drivers/infiniband/include/rdma/ib_cache.h (working copy) @@ -102,4 +102,17 @@ int ib_find_cached_pkey(struct ib_device u16 pkey, u16 *index); +/** + * ib_get_cached_lmc - Returns a cached lmc table entry + * @device: The device to query. + * @port_num: The port number of the device to query. + * @lmc: The lmc value for the specified port for that device. + * + * ib_get_cached_lmc() fetches the specified lmc table entry stored in + * the local software cache. + */ +int ib_get_cached_lmc(struct ib_device *device, + u8 port_num, + u8 *lmc); + #endif /* _IB_CACHE_H */ Index: src/drivers/infiniband/core/cache.c =================================================================== --- src/drivers/infiniband/core/cache.c (revision 6066) +++ src/drivers/infiniband/core/cache.c (working copy) @@ -191,6 +195,24 @@ int ib_find_cached_pkey(struct ib_device } EXPORT_SYMBOL(ib_find_cached_pkey); +int ib_get_cached_lmc(struct ib_device *device, + u8 port_num, + u8 *lmc) +{ + unsigned long flags; + int ret = 0; + + if (port_num < start_port(device) || port_num > end_port(device)) + return -EINVAL; + + read_lock_irqsave(&device->cache.lock, flags); + *lmc = device->cache.lmc_cache[port_num - start_port(device)]; + read_unlock_irqrestore(&device->cache.lock, flags); + + return ret; +} +EXPORT_SYMBOL(ib_get_cached_lmc); + static void ib_cache_update(struct ib_device *device, u8 port) { @@ -251,6 +273,8 @@ static void ib_cache_update(struct ib_de device->cache.pkey_cache[port - start_port(device)] = pkey_cache; device->cache.gid_cache [port - start_port(device)] = gid_cache; + device->cache.lmc_cache[port - start_port(device)] = tprops->lmc; + write_unlock_irq(&device->cache.lock); kfree(old_pkey_cache); @@ -305,7 +329,13 @@ static void ib_cache_setup_one(struct ib kmalloc(sizeof *device->cache.pkey_cache * (end_port(device) - start_port(device) + 1), GFP_KERNEL); - if (!device->cache.pkey_cache || !device->cache.gid_cache) { + device->cache.lmc_cache = kmalloc(sizeof *device->cache.lmc_cache * + (end_port(device) - + start_port(device) + 1), + GFP_KERNEL); + + if (!device->cache.pkey_cache || !device->cache.gid_cache || + !device->cache.lmc_cache) { printk(KERN_WARNING "Couldn't allocate cache " "for %s\n", device->name); goto err; @@ -333,6 +363,7 @@ err_cache: err: kfree(device->cache.pkey_cache); kfree(device->cache.gid_cache); + kfree(device->cache.lmc_cache); } static void ib_cache_cleanup_one(struct ib_device *device) @@ -349,6 +380,7 @@ static void ib_cache_cleanup_one(struct kfree(device->cache.pkey_cache); kfree(device->cache.gid_cache); + kfree(device->cache.lmc_cache); } static struct ib_client cache_client = { -- MST From mst at mellanox.co.il Sat Apr 29 18:29:22 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 30 Apr 2006 04:29:22 +0300 Subject: [openib-general] [PATCH 2 of 2] mad: check GID/LID when searching for request Message-ID: <20060430012922.GA15663@mellanox.co.il> Check GID/LID for requester side when searching for request which matches received response. This, in order to guarantee uniqueness if use same TID when requesting via multiple source LIDs (when LMC is not zero). To perform check, use LMC to cache. Further, do not perform LID check for direct-routed packets, since permissive LID makes a proper check impossible. Signed-off-by: Jack Morgenstein Index: src/drivers/infiniband/core/mad.c =================================================================== --- src/drivers/infiniband/core/mad.c (revision 6066) +++ src/drivers/infiniband/core/mad.c (working copy) @@ -34,6 +34,7 @@ * $Id$ */ #include +#include #include "mad_priv.h" #include "mad_rmpp.h" @@ -1669,20 +1670,21 @@ static inline int rcv_has_same_class(str rwc->recv_buf.mad->mad_hdr.mgmt_class; } -static inline int rcv_has_same_gid(struct ib_mad_send_wr_private *wr, +static inline int rcv_has_same_gid(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_send_wr_private *wr, struct ib_mad_recv_wc *rwc ) { struct ib_ah_attr attr; u8 send_resp, rcv_resp; + union ib_gid sgid; + struct ib_device *device = mad_agent_priv->agent.device; + u8 port_num = mad_agent_priv->agent.port_num; + u8 lmc; send_resp = ((struct ib_mad *)(wr->send_buf.mad))-> mad_hdr.method & IB_MGMT_METHOD_RESP; rcv_resp = rwc->recv_buf.mad->mad_hdr.method & IB_MGMT_METHOD_RESP; - if (!send_resp && rcv_resp) - /* is request/response. GID/LIDs are both local (same). */ - return 1; - if (send_resp == rcv_resp) /* both requests, or both responses. GIDs different */ return 0; @@ -1691,48 +1693,78 @@ static inline int rcv_has_same_gid(struc /* Assume not equal, to avoid false positives. */ return 0; - if (!(attr.ah_flags & IB_AH_GRH) && !(rwc->wc->wc_flags & IB_WC_GRH)) - return attr.dlid == rwc->wc->slid; - else if ((attr.ah_flags & IB_AH_GRH) && - (rwc->wc->wc_flags & IB_WC_GRH)) - return memcmp(attr.grh.dgid.raw, - rwc->recv_buf.grh->sgid.raw, 16) == 0; - else + if (!!(attr.ah_flags & IB_AH_GRH) != + !!(rwc->wc->wc_flags & IB_WC_GRH)) /* one has GID, other does not. Assume different */ return 0; + + if (!send_resp && rcv_resp) { + /* is request/response. */ + if (!(attr.ah_flags & IB_AH_GRH)) { + if (ib_get_cached_lmc(device, port_num, &lmc)) + return 0; + return (!lmc || !((attr.src_path_bits ^ + rwc->wc->dlid_path_bits) & + ((1 << lmc) - 1))); + } else { + if (ib_get_cached_gid(device, port_num, + attr.grh.sgid_index, &sgid)) + return 0; + return !memcmp(sgid.raw, rwc->recv_buf.grh->dgid.raw, + 16); + } + } + + if (!(attr.ah_flags & IB_AH_GRH)) + return attr.dlid == rwc->wc->slid; + else + return !memcmp(attr.grh.dgid.raw, rwc->recv_buf.grh->sgid.raw, + 16); } + +static inline int is_direct(u8 class) +{ + return (class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE); +} + struct ib_mad_send_wr_private* ib_find_send_mad(struct ib_mad_agent_private *mad_agent_priv, - struct ib_mad_recv_wc *mad_recv_wc) + struct ib_mad_recv_wc *wc) { - struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wr_private *wr; struct ib_mad *mad; - mad = (struct ib_mad *)mad_recv_wc->recv_buf.mad; + mad = (struct ib_mad *)wc->recv_buf.mad; - list_for_each_entry(mad_send_wr, &mad_agent_priv->wait_list, - agent_list) { - if ((mad_send_wr->tid == mad->mad_hdr.tid) && - rcv_has_same_class(mad_send_wr, mad_recv_wc) && - rcv_has_same_gid(mad_send_wr, mad_recv_wc)) - return mad_send_wr; + list_for_each_entry(wr, &mad_agent_priv->wait_list, agent_list) { + if ((wr->tid == mad->mad_hdr.tid) && + rcv_has_same_class(wr, wc) && + /* + * Don't check GID for direct routed MADs. + * These might have permissive LIDs. + */ + (is_direct(wc->recv_buf.mad->mad_hdr.mgmt_class) || + rcv_has_same_gid(mad_agent_priv, wr, wc))) + return wr; } /* * It's possible to receive the response before we've * been notified that the send has completed */ - list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, - agent_list) { - if (is_data_mad(mad_agent_priv, mad_send_wr->send_buf.mad) && - mad_send_wr->tid == mad->mad_hdr.tid && - mad_send_wr->timeout && - rcv_has_same_class(mad_send_wr, mad_recv_wc) && - rcv_has_same_gid(mad_send_wr, mad_recv_wc)) { + list_for_each_entry(wr, &mad_agent_priv->send_list, agent_list) { + if (is_data_mad(mad_agent_priv, wr->send_buf.mad) && + wr->tid == mad->mad_hdr.tid && + wr->timeout && + rcv_has_same_class(wr, wc) && + /* + * Don't check GID for direct routed MADs. + * These might have permissive LIDs. + */ + (is_direct(wc->recv_buf.mad->mad_hdr.mgmt_class) || + rcv_has_same_gid(mad_agent_priv, wr, wc))) /* Verify request has not been canceled */ - return (mad_send_wr->status == IB_WC_SUCCESS) ? - mad_send_wr : NULL; - } + return (wr->status == IB_WC_SUCCESS) ? wr : NULL; } return NULL; } _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST From mst at mellanox.co.il Sat Apr 29 18:30:21 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 30 Apr 2006 04:30:21 +0300 Subject: [openib-general] Re: [PATCH v2] mad: use GID/LID on requester sidewhen matching responses to requests In-Reply-To: References: <20060429193913.GA9584@mellanox.co.il> Message-ID: <20060430013021.GA15684@mellanox.co.il> Quoting r. Sean Hefty : > >Could this patch be merged please? Sean? > > There was a request to submit the LMC cache piece as a separate patch. I can > merge in the MAD changes after the LMC cache has been accepted. I've just sent the patch in two chunks. Roland already approved the cache chunk, please commit both. -- MST From dotanb at mellanox.co.il Sun Apr 30 01:10:42 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Sun, 30 Apr 2006 11:10:42 +0300 Subject: [openib-general] the QP attribute have a double defenition of the QP primary port Message-ID: <200604301110.42718.dotanb@mellanox.co.il> Hi. I think that we have a problem with the primary port number in the QP attributes. In the IB spec: in the transition RESET->INIT, the primary port is required. in the transition INIT->RTR, the address vector is required for connected QPs. In the driver: in the transition RESET->INIT, the primary port is required (the mask IBV_QP_PORT is required) in the transition INIT->RTR, the port number is one of the attributes of the address vector (which mean there are 2 attributes which define the QP port number). I think that there are 2 problems with this implementation: 1) the user can use two different values for those port numbers (in mthca driver, the port number that was defined in the address vector will be used) 2) the user can define / change the QP port number in the transition INIT->RTR (which means a IB spec violation) what do you think about this issue? thanks Dotan From dotanb at mellanox.co.il Sun Apr 30 01:29:55 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Sun, 30 Apr 2006 11:29:55 +0300 Subject: [openib-general] [rds]: there is a kernel oops while loading the RDS module in kernel 2.6.11 Message-ID: <200604301129.55041.dotanb@mellanox.co.il> when loading the rds module, there is a kernel oops. here are the machine props: --------------------------------------- Host Architecture : x86_64 Linux Distribution: Fedora Core release 4 (Stentz) Kernel Version : 2.6.11-1.1369_FC4smp Memory size : 4102172 kB Driver Version : openib_gen2-20060430-0800 (REV=6741) HCA ID(s) : mthca0 HCA model(s) : 25218 FW version(s) : 5.1.400 Board(s) : MT_0150000001 here is the dump from the /var/log/messages: -------------------------------------------------------------- Apr 30 11:21:05 sw084 kernel: ----------- [cut here ] --------- [please bite here ] --------- Apr 30 11:21:05 sw084 kernel: Kernel BUG at "kernel/workqueue.c":311 Apr 30 11:21:05 sw084 kernel: invalid operand: 0000 [1] SMP Apr 30 11:21:05 sw084 kernel: CPU 1 Apr 30 11:21:05 sw084 kernel: Modules linked in: ib_local_sa(U) findex(U) ib_ipoib(U) ib_sa(U) ib_uverbs(U) ib_umad(U) ib_mthca(U) ib_mad(U) ib_core(U) mst_pciconf(U) mst_pci(U) nfsd exportfs md5 ipv6 parport_pc lp parport autofs4 nfs lockd rfcomm l2cap bluetooth sunrpc pcmcia ye nta_socket rsrc_nonstatic pcmcia_core dm_mod video button battery ac ohci_hcd ehci_hcd i2c_nforce2 i2c_core tg3 ext3 jbd sata_nv libata mpts csih mptbase sd_mod scsi_mod Apr 30 11:21:05 sw084 kernel: Pid: 31497, comm: modprobe Not tainted 2.6.11-1.1369_FC4smp Apr 30 11:21:05 sw084 kernel: RIP: 0010:[] {__create_workqueue+41} Apr 30 11:21:05 sw084 kernel: RSP: 0000:ffff81011220bf18 EFLAGS: 00010202 Apr 30 11:21:05 sw084 kernel: RAX: 000000000000000b RBX: 00000000fffffff4 RCX: 0000000000000bb8 Apr 30 11:21:05 sw084 kernel: RDX: 000000000000007f RSI: 0000000000000001 RDI: ffffffff88199d4f Apr 30 11:21:05 sw084 kernel: RBP: ffffffff88199d4f R08: ffff81000000f300 R09: ffff81000000e000 Apr 30 11:21:05 sw084 kernel: R10: 0000000000000000 R11: ffff81013ef685a0 R12: 0000000000000001 Apr 30 11:21:05 sw084 kernel: R13: 00002aaaaaad5010 R14: ffffffff804334c0 R15: 0000000000518568 Apr 30 11:21:05 sw084 kernel: FS: 00002aaaaaad43c0(0000) GS:ffffffff80510700(0000) knlGS:00000000f7fcd6c0 Apr 30 11:21:05 sw084 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Apr 30 11:21:05 sw084 kernel: CR2: 0000003792691ec0 CR3: 0000000110009000 CR4: 00000000000006e0 Apr 30 11:21:05 sw084 kernel: Process modprobe (pid: 31497, threadinfo ffff81011220a000, task ffff8100a743e130) Apr 30 11:21:05 sw084 kernel: Stack: ffffffff80433140 00000000fffffff4 000000000003384b ffffffff8819c480 Apr 30 11:21:05 sw084 kernel: 00002aaaaaad5010 ffffffff8819f094 ffffffff80433500 ffffffff80157af4 Apr 30 11:21:05 sw084 kernel: 0000000000000000 0000000000517d80 Apr 30 11:21:05 sw084 kernel: Call Trace:{:ib_local_sa:sa_db_init+148} {sys_init_module+292} Apr 30 11:21:05 sw084 kernel: {tracesys+209} Apr 30 11:21:05 sw084 kernel: Apr 30 11:21:05 sw084 kernel: Code: 0f 0b 50 b6 37 80 ff ff ff ff 37 01 48 8b 3d 5c 8b 2e 00 be Apr 30 11:21:05 sw084 kernel: RIP {__create_workqueue+41} RSP Apr 30 11:21:05 sw084 kernel: <3>Debug: sleeping function called from invalid context at include/linux/rwsem.h:43 Apr 30 11:21:05 sw084 kernel: in_atomic():0, irqs_disabled():1 Apr 30 11:21:05 sw084 kernel: Apr 30 11:21:05 sw084 kernel: Call Trace:{profile_task_exit+21} {do_exit+34} Apr 30 11:21:05 sw084 kernel: {do_unblank_screen+40} {die+77} Apr 30 11:21:05 sw084 kernel: {do_invalid_op+163} {__create_workqueue+41} Apr 30 11:21:05 sw084 kernel: {thread_return+139} {error_exit+0} Apr 30 11:21:05 sw084 kernel: {__create_workqueue+41} {__create_workqueue+35} Apr 30 11:21:05 sw084 kernel: {:ib_local_sa:sa_db_init+148} {sys_init_module+292} Apr 30 11:21:05 sw084 kernel: {tracesys+209} thanks Dotan From oibleo at gmail.com Sun Apr 30 03:51:03 2006 From: oibleo at gmail.com (Leonid Arsh) Date: Sun, 30 Apr 2006 13:51:03 +0300 Subject: [openib-general] Re: [PATCH] IPoIB splitting CQ, increase both send/recv poll NUM_WC & interval In-Reply-To: References: <44507FD1.9040801@voltaire.com> Message-ID: <10e223bf0604300351r3cd8767fmddd7d8b7a2e130ad@mail.gmail.com> On 4/27/06, Shirley Ma wrote: > How many percentage throughput you got from your NAPI implementation? Shirley, I couldn't find our exact test results, we did it pretty long ago. As far as I remember, we got several lonely percents, up to 3-4%. No surprise that you get better results - we handled comletions from a tasklet context. From ogerlitz at voltaire.com Sun Apr 30 05:30:17 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 30 Apr 2006 15:30:17 +0300 Subject: [openib-general] [PATCH 5/6] iser RDMA CM (CMA) and IB verbsinteraction In-Reply-To: References: Message-ID: <4454ADD9.20909@voltaire.com> Sean Hefty wrote: >> +static int iser_free_device_ib_res(struct iser_device *device) >> +{ >> + BUG_ON(device->mr == NULL); >> + >> + tasklet_kill(&device->cq_tasklet); >> + >> + (void)ib_dereg_mr(device->mr); >> + (void)ib_destroy_cq(device->cq); >> + (void)ib_dealloc_pd(device->pd); >> + >> + device->mr = NULL; >> + device->cq = NULL; >> + device->pd = NULL; >> + return 0; >> +} > > Can you eliminate the return code? Yes >> +static int iser_free_ib_conn_res(struct iser_conn *ib_conn) >> +{ >> + BUG_ON(ib_conn == NULL); >> + >> + iser_err("freeing conn %p cma_id %p fmr pool %p qp %p\n", >> + ib_conn, ib_conn->cma_id, >> + ib_conn->fmr_pool, ib_conn->qp); >> + >> + /* qp is created only once both addr & route are resolved */ >> + if (ib_conn->fmr_pool != NULL) >> + ib_destroy_fmr_pool(ib_conn->fmr_pool); >> + >> + if (ib_conn->qp != NULL) >> + rdma_destroy_qp(ib_conn->cma_id); >> + >> + if (ib_conn->cma_id != NULL) >> + rdma_destroy_id(ib_conn->cma_id); > Are the NULL checks needed above? Neither iser_create_device_ib_res() or > iser_create_ib_conn_res() set the values to NULL if an error occurred. we are dealing here with connection resources so the (shared among ib conns) device resources are irrelevant. The ib conn struct is kzallec-ed on creation, where later iser_free_ib_conn_res() can be called when only a ***subset*** of the resources was allocated. Examples are instant error from rdma_addr_resolve() or getting ADDR/ROUTE ERROR vs. CONNECT ERROR cma events, in the first three cases only the cma id should be destroyed while on the latter there's a need to destroy the fmr pool and the qp. >> +/** >> + * based on the resolved device node GUID see if there already allocated >> + * device for this device. If there's no such, create one. >> + */ >> +static >> +struct iser_device *iser_device_find_by_ib_device(struct rdma_cm_id *cma_id) >> +{ >> + struct list_head *p_list; >> + struct iser_device *device = NULL; >> + >> + mutex_lock(&ig.device_list_mutex); >> + >> + p_list = ig.device_list.next; >> + while (p_list != &ig.device_list) { >> + device = list_entry(p_list, struct iser_device, ig_list); >> + /* find if there's a match using the node GUID */ >> + if (device->ib_device->node_guid == cma_id->device->node_guid) >> + break; >> + } >> + >> + if (device == NULL) { >> + device = kzalloc(sizeof *device, GFP_KERNEL); >> + if (device == NULL) >> + goto end; > goto out; // see below >> + /* assign this device to the device */ >> + device->ib_device = cma_id->device; >> + /* init the device and link it into ig device list */ >> + if (iser_create_device_ib_res(device)) { >> + kfree(device); >> + device = NULL; >> + goto end; >> + } >> + list_add(&device->ig_list, &ig.device_list); >> + } >> +end: >> + BUG_ON(device == NULL); >> + device->refcount++; > > out: OK Or. From leonida at voltaire.com Sun Apr 30 05:47:15 2006 From: leonida at voltaire.com (Leonid Arsh) Date: Sun, 30 Apr 2006 15:47:15 +0300 Subject: [openib-general] Re: netperf for RDS needed In-Reply-To: <96f8e60e0604271019v2a9a7b6ei9c10a7dd507fbcb4@mail.gmail.com> References: <4450D380.4000303@voltaire.com> <96f8e60e0604271019v2a9a7b6ei9c10a7dd507fbcb4@mail.gmail.com> Message-ID: <4454B1D3.5080105@voltaire.com> Ranjit, we run it on dual CPU Intel(R) Xeon(TM) CPU 3.00GHz x86_64, with hyper-threading enabled (two hyper-threads on every CPU) We run IBED-1.0-rc3 on both machines. One machine runs SUSE Linux Enterprise Server 10 Beta8, kernel 2.6.16-rc6-git1-4-smp; The second one runs Red Hat Enterprise Linux AS release 4 (Nahant Update 1) with kernel 2.6.15 from kernel.org. We get approx. the same dmesg output on the sever side, doesn't matter which machine is server. The test without RDS was done over IPoIB (the same run line, but without '-r' command line switch.) Regards, Leonid Ranjit Pandit wrote: > On 4/27/06, Leonid Arsh wrote: > >> Ranjit >> thank you for the patch again. I applied it and succeeded to run. >> Looks very nice. >> >> This are the results with for RDS : >> Socket Message Elapsed Messages >> Size Size Time Okay Errors Throughput >> bytes bytes secs # # 10^6bits/sec >> 262144 8192 10.01 653574 1 4280.59 >> 118784 10.01 653574 4280.59 >> >> This are the results without RDS: >> Socket Message Elapsed Messages >> Size Size Time Okay Errors Throughput >> bytes bytes secs # # 10^6bits/sec >> 262144 8192 10.00 356180 0 2333.90 >> 118784 10.00 211005 1382.63 >> >> > > What kind of systems are you running on, cpu and memory? > > Are the results without-RDS on IPoIB? > > The second line of the output is more interesting as it shows the > "useful" b/w (as seen by the receiver) and therefore accounts for any > lost/dropped pkts. > > Rds shows 3x improvement on recvr side b/w (4289.59 Vs 1382.63). > > >> During the run we get error messages in dmesg on the server side. >> Have you seen anything like this? >> Please see the dmesg output below: >> > > What kernel are you on? > 32bit or 64bit system? > > I will see if I can reproduce it. > > >> >> swapper: page allocation failure. order:1, mode:0x20 >> >> Call Trace: {__alloc_pages+662} >> {smp_apic_timer_interrupt+54} >> {apic_timer_interrupt+132} >> {cache_grow+288} >> {cache_alloc_refill+419} >> {kmem_cache_alloc+87} >> {:ib_rds:rds_alloc_buf+16} >> {:ib_rds:rds_alloc_recv_buffer+12} >> {:ib_rds:rds_post_new_recv+23} >> {:ib_rds:rds_recv_completion+85} >> {:ib_rds:rds_cq_callback+87} >> {:ib_mthca:mthca_eq_int+119} >> {do_IRQ+50} {ret_from_intr+0} >> {:ib_mthca:mthca_tavor_interrupt+91} >> {handle_IRQ_event+41} >> {__do_IRQ+156} >> {do_IRQ+45} {ret_from_intr+0} >> {mwait_idle+54} >> {cpu_idle+93} >> {start_secondary+1131} >> ______________________________________________ >> From swise at opengridcomputing.com Sun Apr 30 10:24:16 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 30 Apr 2006 12:24:16 -0500 Subject: [openib-general] [PATCH] rping.c - indicate failure when rdma_resolve_addr returns an error Message-ID: <1146417856.852.23.camel@stevo-desktop> Move the state to ERROR if rdma_resolve_route() fails. Signed-off-by: Steve Wise Index: rping.c =================================================================== --- rping.c (revision 6741) +++ rping.c (working copy) @@ -161,6 +161,7 @@ cb->state = ADDR_RESOLVED; ret = rdma_resolve_route(cma_id, 2000); if (ret) { + cb->state = ERROR; fprintf(stderr, "rdma_resolve_route error %d\n", ret); sem_post(&cb->sem); } From swise at opengridcomputing.com Sun Apr 30 10:26:38 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 30 Apr 2006 12:26:38 -0500 Subject: [openib-general] [PATCH] rping.c - indicate failure when rdma_resolve_addr returns an error In-Reply-To: <1146417856.852.23.camel@stevo-desktop> References: <1146417856.852.23.camel@stevo-desktop> Message-ID: <1146417998.852.27.camel@stevo-desktop> Committed revision 6787. On Sun, 2006-04-30 at 12:24 -0500, Steve Wise wrote: > Move the state to ERROR if rdma_resolve_route() fails. > > Signed-off-by: Steve Wise > > > > Index: rping.c > =================================================================== > --- rping.c (revision 6741) > +++ rping.c (working copy) > @@ -161,6 +161,7 @@ > cb->state = ADDR_RESOLVED; > ret = rdma_resolve_route(cma_id, 2000); > if (ret) { > + cb->state = ERROR; > fprintf(stderr, "rdma_resolve_route error %d\n", ret); > sem_post(&cb->sem); > } > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From lindahl at pathscale.com Sun Apr 30 22:28:59 2006 From: lindahl at pathscale.com (Greg Lindahl) Date: Sun, 30 Apr 2006 22:28:59 -0700 Subject: [openib-general] RFC: detecting duplicate MAD requests In-Reply-To: References: <20060429001348.GA3843@greglaptop.internal.keyresearch.com> Message-ID: <20060501052859.GA1317@greglaptop.hsd1.ca.comcast.net> On Fri, Apr 28, 2006 at 11:23:44PM -0700, Sean Hefty wrote: > The proposal is to only discard duplicate requests while a response to the first > request is being generated. Ah. Does that happen that often? -- greg From lindahl at pathscale.com Sun Apr 30 22:32:07 2006 From: lindahl at pathscale.com (Greg Lindahl) Date: Sun, 30 Apr 2006 22:32:07 -0700 Subject: [openib-general] Re: [PATCH 04/16] ehca: userspace support In-Reply-To: <4044CACC-FB5A-415E-8974-27136269B5C1@schihei.de> References: <4450A176.9000008@de.ibm.com> <20060427114355.GB32127@wohnheim.fh-wedel.de> <1146177388.19236.1.camel@localhost.localdomain> <6C4A3B96-4752-4FF9-8FBE-C383B00AE014@schihei.de> <84144f020604272332s6101032cy6936096230f3637c@mail.gmail.com> <4044CACC-FB5A-415E-8974-27136269B5C1@schihei.de> Message-ID: <20060501053207.GB1317@greglaptop.hsd1.ca.comcast.net> > >Do you really need this heavy debug logging in the first place? You > >can use kprobes for arbitrary run-time inspection anyway, so logging > >everything seems wasteful. > > The problem I see with kprobes is that you have to set several kernel > configuration options (e.g. CONFIG_KPROBES, CONFIG_DEBUG_INFO, etc.) > on compile time to use it. Same problem with pr_debug(). Note that one usage of debug code is for a vendor to ask a customer to turn it on to figure out weird problems that the vendor can't replicate. Customers are more likely to cooperate if the effort is small... rebuilding the kernel is not a small effort compared to turning on debug that's already compiled in. -- greg From mst at mellanox.co.il Sun Apr 30 22:40:05 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 1 May 2006 08:40:05 +0300 Subject: [openib-general] Re: RFC: detecting duplicate MAD requests In-Reply-To: <20060501052859.GA1317@greglaptop.hsd1.ca.comcast.net> References: <20060429001348.GA3843@greglaptop.internal.keyresearch.com> <20060501052859.GA1317@greglaptop.hsd1.ca.comcast.net> Message-ID: <20060501054005.GA28537@mellanox.co.il> Quoting r. Greg Lindahl : > > The proposal is to only discard duplicate requests while a response to the > > first request is being generated. > > Ah. Does that happen that often? Often enough with large RMPP. -- MST From oferg at mellanox.co.il Sun Apr 30 23:59:16 2006 From: oferg at mellanox.co.il (Ofer Gigi) Date: Mon, 1 May 2006 09:59:16 +0300 Subject: [openib-general] [PATCH] osm_port_info_rcv.c : clear clientreregister bit Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301FE0027@mtlexch01.mtl.com> Hi Hal, Did you apply this one below? I forgot to CC you, and as far as I can see it is not applied. Thanks! Ofer -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Ofer Gigi Sent: Thursday, April 27, 2006 1:46 PM To: OPENIB Subject: [openib-general] [PATCH] osm_port_info_rcv.c : clear clientreregister bit Hi Hal, Bug Fix: On receive of client reregister - clear the reregister bit - so reregistering won't be sent again and again Please apply to trunk and branch. Thanks Ofer G. Signed-off-by: Ofer Gigi Index: osm_port_info_rcv.c =================================================================== --- osm_port_info_rcv.c (revision 6640) +++ osm_port_info_rcv.c (working copy) @@ -666,6 +666,17 @@ osm_pi_rcv_process( p_smp = osm_madw_get_smp_ptr( p_madw ); p_context = osm_madw_get_pi_context_ptr( p_madw ); p_pi = (ib_port_info_t*)ib_smp_get_payload_ptr( p_smp ); + + /* On receive of client reregister - clear the reregister bit - so + reregistering won't be sent again and again*/ + if (ib_port_info_get_client_rereg(p_pi)) + { + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "osm_pi_rcv_process: " + "client reregister received on response\n"); + ib_port_info_set_client_rereg(p_pi,0); + } + port_num = (uint8_t)cl_ntoh32( p_smp->attr_mod ); port_guid = p_context->port_guid; _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general